arXiv Papers of Large Language Model Agent

PaperID: 1, https://arxiv.org/pdf/2606.21856.pdf GitHub

Authors: Wangxuan Fan, Xiaoyu Nie, Zhongxiang Dai

Title: Harness-MU: A Safe, Governed, and Effective Harness for Multi-User LLM Agents

Abstract:
The increasing deployment of large language model (LLM) agents in collaborative workflows demands robust multi‑user, multi‑principal interaction mechanisms capable of enforcing access permissions, resolving authoritative conflicts, and preventing unauthorized data disclosure. However, a fundamental mismatch exists between the single‑user training paradigm of contemporary LLMs and the hard constraints required for multi‑principal governance, rendering probabilistic, prompt‑based safeguards vulnerable under multi‑turn adversarial interactions.Our key insight is that governance constraints ‑‑ who is authorized, what is restricted, and whose instructions take precedence ‑‑ are deterministic runtime variables that should be enforced by execution hooks rather than entrusted to the LLM. We present Harness‑MU, the first model‑agnostic, zero‑tuning infrastructure framework for multi‑user LLM agents. By decoupling language generation from safety orchestration, Harness‑MU guarantees unbreakable permission boundaries while maximizing compliant demand satisfaction. Across four frontier open‑weight and proprietary models on the Muses‑Bench benchmark, Harness‑MU achieves the goal of privacy preservation across all access‑control attacks, outperforming the standard baseline by 0.28‑‑0.39 in utility score and improving instruction‑following accuracy by up to 48.9 percentage points. Harness‑MU advances the philosophy of Harness Engineering, establishing that systematic infrastructure is essential for solving LLM multi‑principal governance challenges. The code and data are available at https://github.com/YuanJrShiuan/Harness‑MulUser.

PaperID: 2, https://arxiv.org/pdf/2606.21777.pdf GitHub

Authors: Ashwin Vinod, Ying Ding, Elias Stengel-Eskin

Title: CalVerT: Augmenting Agents with Calibrated Verifier Telemetry Improves Action and Learning in Knowledge-Intensive Tasks

Abstract:
LLM agents in knowledge intensive question answering take retrieval and reasoning actions with incomplete knowledge about whether their current answer is uncertain, unsupported, or already complete. This produces two failure modes: committing to confident but unsupported answers, which hurts accuracy, and over‑retrieving when the evidence in hand already suffices, resulting in wasted compute. To give agents a more complete picture of the state space they are operating in, we introduce calibrated verifier telemetry (CalVerT), which augments the agent's state with additional telemetry: a calibrated self‑confidence score and a grounding verifier score. We show that CalVerT can improve agents in both training‑free and training‑based settings. On four QA benchmarks, we find that CalVerT raises F1 by triggering retrieval in cases where agents over‑rely on parametric knowledge, while cutting redundant retrieval in cases where agents have sufficient context to answer. We show that CalVerT can augment existing QA frameworks without training. Moreover, CalVerT also improves trained systems: by simply augmenting an agent's state with telemetry, we observe improvements after reinforcement learning, as compared to an agent with identical training but no CalVerT telemetry.

PaperID: 3, https://arxiv.org/pdf/2606.21445.pdf GitHub GitHub

Authors: Yang Yue, Xuancheng Zhu, Yuyang Ma, Guoshun Nan, Zihan Dou, Jingru Shan, Congyu Guo, Ji Zhang, Hua Wang, Jingfeng Zhang

Title: AutoRAS: Learning Robust Agentic Systems with Primitive Representations

Abstract:
The automated design of agentic systems offers a promising pathway for scaling large language models (LLMs) beyond single‑agent reasoning. While prior work has advanced task performance through handcrafted or automatically generated multi‑agent workflows, robustness is often treated as an afterthought, leaving systems vulnerable to external adversaries and internal failures. We propose AutoRAS, a framework for the Automated design of Robust Agentic Systems. AutoRAS formulates system design as generating a sequence of symbolic primitives that jointly encode structural connectivity and behavioral actions, and learns to optimize this sequence using execution‑derived safety signals and flow‑based sequence‑level objectives. Extensive experiments show that AutoRAS achieves the best performance in both vanilla and adversarial settings, with the smallest performance degradation under attacks. Further analyses demonstrate strong transferability, stable optimization behavior, stability across primitive sets, and favorable cost trade‑offs. Our code is available at \hrefhttps://github.com/guohezuy/AutoRAS\textthis https URL.

PaperID: 4, https://arxiv.org/pdf/2606.21262.pdf GitHub

Authors: Zihang Tian, Jingsen Zhang, Rui Li, Xiaohe Bo, Yuanzi Li, Xu Chen

Title: ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM-Based Agents

Abstract:
Reinforcement learning for multi‑step LLM agents often relies on scalar rewards that indicate success but cannot explain why a trajectory is good or bad. Rubric‑based rewards improve interpretability through natural‑language criteria, but existing methods score at the trajectory level and freeze the scorer behind a closed‑source judge, leaving step‑level credit assignment unresolved and the judge itself static. We propose ARCO (Adaptive Rubric CO‑evolution), a rubric framework in which a same‑scale model μ shares a backbone with two heads: a generation head that produces per‑step criteria, and a score head that predicts rubric‑conditioned step‑level rewards. A trajectory decomposition constraint ties the sum of step rewards to the terminal outcome, enabling credit assignment without step‑level labels, while μ and the policy π are jointly updated on on‑policy data so that the rubric content and the scoring function co‑evolve at the parameter level. Across HotpotQA, 2WikiMultiHopQA, and MuSiQue with two open‑source backbones, ARCO improves the best EM in every setting over strong outcome‑, rubric‑, and process‑reward baselines, and analyses show that its rubrics are step‑specific, robust to design choices, and useful for diagnosing agent behavior. Codes and data are available at https://github.com/zihangtian/ARCO.

PaperID: 5, https://arxiv.org/pdf/2606.20922.pdf GitHub

Authors: Shanghao Shi, Xiao Wang, Chaoyu Zhang, Hao Li, Wenjing Lou, Thomas Hou, Yevgeniy Vorobeychik, Chongjie Zhang, Ning Zhang

Title: Think Twice Before You Act: Protecting LLM Agents Against Tool Description Poisoning via Isolated Planning

Abstract:
The integration of external tools has substantially expanded the capabilities of large language model (LLM) agents, but it also introduces new attack surfaces beyond prompt injection. In particular, cross‑tool description poisoning can manipulate planner‑visible tool metadata to steer an agent's trajectory, even if the poisoned tool itself is never chosen. To understand the effectiveness of existing defenses against this emerging threat, we first evaluate several prompt‑injection defenses and find that they transfer poorly to cross‑tool description poisoning. A key observation is that poisoned descriptions persist in the planning context across steps, enabling continuous influence over subsequent tool choices. Building on this insight, we propose Tool‑Guard, a novel system‑level defense based on a new concept called isolated planning, in which tool invocations that are detected as misaligned or suspicious cause the corresponding tool to be placed in a quarantined list (the influenced list), breaking further influence from poisoned descriptions. With this influence isolated, the tool can continue to be used to support the task, enabling a robust defense that preserves legitimate tool utility. Experiments on the AgentDojo and ASB benchmarks show that Tool‑Guard substantially reduces attack success while maintaining high task utility. Our code is available at https://github.com/shishishi123/Tool‑Guard.

PaperID: 6, https://arxiv.org/pdf/2606.20661.pdf GitHub

Authors: Yifan Li, Shengbin Yue, Boyu Feng, Jinhu Qi, Bo Ke, Zixing Song, Hongru Wang, Zhongyu Wei, Irwin King

Title: From Knowing to Acting: Benchmarking Self-Awareness Capability of LLM Agents

Abstract:
The integration of external tools has transitioned LLM agents from passive responders to autonomous systems. However, current benchmarks prioritize execution success, neglecting self‑awareness capability, the ability to discern whether a problem requires necessary external resources or can be solved via internal parametric knowledge. To address this, we introduce KAPRO (Knowing‑Acting Quadrant PRObe), a framework that evaluates cognitive‑behavioral alignment by decoupling an agent's metacognitive judgment (Knowing) from its spontaneous execution (Acting). We further construct KAware, a dataset rigorously partitioning tasks into external, internal, and hybrid subspaces to systematically probe these epistemic boundaries. Extensive experiments across diverse agent architectures show that self‑awareness capability is strongly correlated with task success but degrades sharply in internal‑capability settings. Moreover, open‑source and instruction‑following models exhibit stronger tool overuse due to shallow pattern matching, while proprietary and reasoning‑oriented models demonstrate more reliable cognitive gating. Benchmark and codes are available at https://github.com/AI‑Santiago/KAware.

PaperID: 7, https://arxiv.org/pdf/2606.20642.pdf GitHub

Authors: Tingzhou Wei, Zeyu Zheng, Ethan X. Fang, Junwei Lu

Title: Hypothesis-Disciplined Multi-Agent Automated Formalization of Asymptotic Statistical Theory

Abstract:
Asymptotic statistical theory is a challenging domain for AI‑assisted formalization: its central results mix convergence statements, asymptotic expansions, functional analysis, and regularity conditions that have a large gap from existing infrastructure in Lean 4 formalization. To address these challenges, we propose a hypothesis‑disciplined Lean 4 formalization pipeline built from multiple agents: a manager that coordinates seven specialist roles for proof planning, skeleton scaffolding, Mathlib reconnaissance, proof construction, integration, independent review, and audit. The main methodological discipline is the hypothesis‑disciplined audit, implemented by the Auditor agent: every main‑theorem hypothesis and concept‑layer field must be anchored in the source mathematical prose, justified as a Lean encoding adapter, marked as source‑implied, or rejected as an unsupported strengthening. Using this workflow, we build a systematic formalization of asymptotic statistical theory, especially the parametric and semi‑parametric models' asymptotic distribution and efficiency results. The resulting Lean development is axiom‑clean and source‑faithful, with Lean‑checked and human‑audited proofs of core parametric and semi‑parametric theorems organized so that theorem‑agnostic infrastructure and statistical concept definitions are separated from theorem‑specific assembly. The formalization results are available at https://github.com/junwei‑lu/Lean‑Asymptotic‑Statistical‑Theory.

PaperID: 8, https://arxiv.org/pdf/2606.20629.pdf GitHub

Authors: Yinsicheng Jiang, Liang Cheng, Yeqi Huang, Yufan Zhao, Zhan Lu, Li Dong, Wenda Li, Edoardo Ponti, Luo Mai

Title: Specialize Roles, Mix Deployments: Pushing the Cost-Accuracy Frontier of LLM Agent Teams

Abstract:
LLM agents are increasingly deployed as multi‑role teams, where tasks are divided across specialized roles such as planner, executor, and verifier. In these systems, cost and accuracy are no longer properties of a single model: they depend on which model fills each role and where it is hosted, including API, self‑hosted, and hybrid deployment. Existing agentic benchmarks typically evaluate fixed models or fixed agent configurations, and therefore offer limited guidance for cost‑accuracy‑optimal deployment. We introduce AgentCARD, a role‑aware benchmark suite for evaluating LLM agent teams across role assignment and deployment mode. AgentCARD combines a role‑decomposed evaluation harness, a unified API/self‑hosted cost model, Pareto‑frontier analysis, and a Shapley‑based diagnostic for identifying role bottlenecks. Our evaluation shows that heterogeneous teams consistently occupy the cost‑accuracy frontier. They improve accuracy by up to 44% over cost‑equivalent homogeneous teams, or match the strongest homogeneous team at up to 12× lower per‑task cost through hybrid deployment. We further find that the best role assignment is domain‑dependent: some domains are planner‑bottlenecked, while others are executor‑bottlenecked. Finally, AgentCARD extends beyond planner‑‑executor teams to workflows with additional roles such as verification, and supports continual evaluation as new domains and team structures emerge. Our code is released at: https://github.com/Auto‑CAP/AgentCAP

PaperID: 9, https://arxiv.org/pdf/2606.20625.pdf GitHub

Authors: Hang Yu, Zifan Zheng, Jeff Z. Pan, Tongliang Liu, Zhiyong Wang, Fengxiang He

Title: AlphaMemo: Structured Search-Process Memory for Self-Evolving Alpha Mining Agents

Abstract:
LLM agents are promising for alpha mining via combining financial priors, symbolic reasoning, executable factor generation, and feedback‑driven refinement. Yet, they face a combinatorial search space, noisy non‑stationary feedback, redundant discoveries, and overfitting risks from naively reusing past successes. To address these challenges, we propose AlphaMemo, a self‑evolving alpha mining agent with Structured Search‑Process Memory. Rather than memorizing only final factors or full trajectories, AlphaMemo records reusable evidence about which edit motifs work or fail under specific parent‑factor contexts. It extracts motifs from Abstract Syntax Tree (AST) differences, applies confidence‑gated residual memory on top of a search‑ledger prior, and uses asymmetric veto control to suppress high‑confidence failure patterns. Experiments on CSI 500 and S\&P 500 show improved out‑of‑sample performance and fixed‑budget discovery efficiency, with ablations validating the roles of residual learning, confidence gating, AST‑diff motifs, and veto memory. Code is at https://github.com/jarrettyu/AlphaMemo.

PaperID: 10, https://arxiv.org/pdf/2606.20621.pdf GitHub

Authors: Yang Feng, Ziwei Xu, Xia Hu, Fengxiang He

Title: PEAR: Permutation-Equivariant Adaptive Routing Multi-Agent Debate

Abstract:
Multi‑agent debate improves the reliability of large language models (LLMs) through iterative peer critiques. However, fixed topologies often introduce persistent positional biases, amplify unreliable agents, and cause high sensitivity to role assignments. We introduce Permutation‑Equivariant Adaptive Routing Multi‑Agent Debate (PEAR), an inference‑time protocol that dynamically reconfigures communication roles and sparse topologies across consecutive debate rounds. By strategically switching agent‑to‑role assignments based on evolving agent states, PEAR prevents any agent from permanently occupying a privileged network position or distributes influence more evenly across the debate. We theoretically characterize PEAR as an equivariant sparse router: it preserves accuracy under agent relabeling while reducing routing complexity and improving generalization. Comprehensive empirical evaluations across four reasoning benchmarks and six diverse LLM backbones demonstrate PEAR significantly improves average accuracy over the strongest debate baselines. The code is at https://github.com/EVIEHub/PEAR.

PaperID: 11, https://arxiv.org/pdf/2606.20521.pdf GitHub

Authors: Juncheng Ma, Jianxin Bi, Yufan Deng, Xuanran Zhai, Kewei Zhang, Ye Huang, Bo Liang, Shukai Gong, Jiankai Tu, Xiaotian Tang, Jiaxin Li, Kaiqi Chen, Duomin Wang, Yuqi Wang, Bingyi Kang, Eric Huang, Zhiyang Dou, Zhen Dong, Enze Xie, Wojciech Matusik, Tat-Seng Chua, Daquan Zhou

Title: HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

Abstract:
Embodied foundation models are expected to benefit from data scaling like large language models, but face a much tighter data bottleneck. Teleoperated real‑robot trajectories remain the dominant pretraining source due to their precise action supervision and embodiment alignment, yet their scalability is limited by high collection cost, acquisition difficulty, and low behavioral and environmental diversity. These limitations have sparked interest in egocentric human video as a scalable, substantially lower‑cost, and more diverse alternative for embodied model pretraining. However, its effectiveness compared to teleoperated real‑robot data remains underexplored. To address this question, we conduct a systematic study comparing egocentric human video and teleoperated real‑robot trajectories as pretraining data sources for embodied foundation models, under fixed post‑training and validation protocols. Surprisingly, we find that egocentric data, when processed through a carefully designed filtering and labeling pipeline, is not merely a viable substitute for model pretraining but can lead to superior performance. With the same amount of pretraining data, models pretrained on egocentric data achieve a 24% lower validation loss on real‑robot action prediction, as well as 52.5% and 90% higher success rates on in‑distribution and out‑of‑distribution real‑robot task execution, respectively. This finding verifies a scalable paradigm for embodied foundation models: pretrain on egocentric human video to learn diverse world representations, then adapt with a small amount of labeled real‑robot data for action‑space alignment. We hope this study encourages broader exploration of egocentric data and offers guidance for data quality assessment before costly robot data collection.

PaperID: 12, https://arxiv.org/pdf/2606.20023.pdf GitHub

Authors: Kaiyue Yang, Yuyan Bu, Jingwei Yi, Yuchi Wang, Biyu Zhou, Juntao Dai, Songlin Hu, Yaodong Yang

Title: When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

Abstract:
As LLM agents increasingly select tools autonomously, their choices among tools with different privileges become safety‑relevant. However, prior tool‑selection studies focus on safety‑agnostic metadata preferences, leaving privilege‑sensitive choices underexplored. To address this gap, we study over‑privileged tool selection, in which an agent selects or escalates to a higher‑privilege tool despite a sufficient lower‑privilege alternative. We introduce ToolPrivBench to evaluate whether agents choose higher‑privilege tools despite sufficient lower‑privilege alternatives, measuring both initial selection and escalation after transient tool failures. Across eight domains and five recurring risk patterns, we find that over‑privileged tool selection is common among mainstream LLM agents and is further amplified by transient failures. We further find that general safety alignment does not reliably transfer to least‑privilege tool choice, while prompt‑level controls provide only limited mitigation under transient failures. We therefore introduce a privilege‑aware post‑training defense that teaches agents to prefer sufficient lower‑privilege tools and escalate only when necessary. Our mitigation experiments show that this defense substantially reduces unnecessary high‑privilege tool use while preserving general capabilities.

PaperID: 13, https://arxiv.org/pdf/2606.20002.pdf GitHub

Authors: Yanxi Chen, Weijie Shi, Yuexiang Xie, Boyi Hu, Yaliang Li, Bolin Ding, Jingren Zhou

Title: Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

Abstract:
This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta‑capability required by long‑lifecycle agents: as an LLM‑based AI agent gets deployed in an environment, it solves a long sequence of tasks while continuously exploring the environment, learning from its own experiences, and iteratively self‑updating its context about the environment, thereby achieving progressively better performance on future tasks conditioned on the updated context. Major components of the CoD framework include: (1) algorithm design and infrastructure for end‑to‑end reinforcement learning (RL) with long rollout sequences interleaving solve‑task and update‑context episodes; (2) tasks and environments for incentivizing and eliciting the targeted meta‑capability in LLMs during training, as well as for faithfully measuring progress during evaluation. We present proof‑of‑concept implementations of the CoD framework, including a GRPO‑style RL algorithm with fine‑grained credit assignment, as well as tasks and environments tailored to the targeted meta‑capability (rather than domain‑specific LLM capabilities or standard task‑by‑task RL). Empirical results validate the efficacy of end‑to‑end RL training in the CoD setting, and demonstrate the potential for out‑of‑distribution generalization ‑‑ within the training domains, across different domains, and from CoD to Ralph‑loop settings ‑‑ of the elicited meta‑capability. Our investigation of CoD connects several lines of prior works, and opens up new opportunities for advancing LLMs and AI agents. To facilitate further research and applications, we release our implementations at \urlhttps://github.com/agentscope‑ai/Trinity‑RFT/tree/research/cod/examples/research_cod.

PaperID: 14, https://arxiv.org/pdf/2606.19616.pdf GitHub

Authors: Dipankar Sarkar

Title: Before the Pull Request: Mining Multi-Agent Coordination

Abstract:
Autonomous coding agents now open millions of pull requests, yet large‑scale studies find their PRs are produced faster but accepted less often ‑ a coordination and trust gap that pull‑request‑level telemetry cannot explain. We argue the missing signal lives before the PR, in how concurrent agents claim, divide, and collide over shared work. We study this process through grite, our open‑source coordination substrate that needs no central server and stores its records inside git itself, so its append‑only, signed event log captures the coordination process directly. We show that (i) this shared substrate reduces duplicate and conflicting work at bounded overhead ‑ the share of work that merely re‑does a teammate's task falls from 78% to 0% while useful throughput more than triples; (ii) every agent's copy of the log converges to the same state with no write silently dropped, where a file‑based tracker loses concurrent writes; and (iii) the log is a mineable artefact from which concrete failure modes ‑ conflicting edits, lock starvation, redundant rediscovery, race‑to‑close ‑ are automatically recoverable with provenance, several invisible in pull‑request history. We release the dataset, harness, and mining toolkit.

PaperID: 15, https://arxiv.org/pdf/2606.19559.pdf GitHub

Authors: Gregory Matsnev

Title: Uncertainty Decomposition for Clarification Seeking in LLM Agents

Abstract:
Recent position papers argue that the classical aleatoric/epistemic uncertainty framework is insufficient for interactive large language model (LLM) agents and call for underspecification‑aware, decomposed, and communicable uncertainty representations that can unlock new agent capabilities such as proactive clarification seeking and shared mental‑model building. Practical deployment constraints ‑‑ black‑box APIs, interactive latency budgets, and the absence of labeled trajectories ‑‑ rule out logprob‑based, multi‑sampling, and training‑based methods, leaving prompt‑based estimation as the most viable family for surfacing such signals at deployment time. We answer this call with a simple prompt‑based decomposition that separates action confidence from request uncertainty (u), enabling the agent to ask for clarification when the task specification is ambiguous. To evaluate it, we introduce two clarification‑augmented benchmarks (WebShop‑Clarification and ALFWorld‑Clarification) in which 50% of tasks are deliberately underspecified, and systematically compare the proposed decomposition against ReAct+UE and Uncertainty‑Aware Memory (UAM) across five LLM backbones (GPT‑5.1, DeepSeek‑v3.2‑exp, GLM‑4.7, Qwen3.5‑35B, GPT‑OSS‑120B) on these variants together with the standard WebShop, ALFWorld, and REAL benchmarks for fault detection. Averaged across the five backbones, the proposed decomposition improves clarification F1 on ALFWorld‑Clarification by 73% over ReAct+UE and by 36% over UAM, and leads clarification F1 on every backbone on WebShop‑Clarification and on four of five backbones on ALFWorld‑Clarification, indicating that the gains generalize beyond a single LLM.

PaperID: 16, https://arxiv.org/pdf/2606.19501.pdf GitHub

Authors: Aijie Shu, Bowei Chen, Wenbin Wu, Cathy Yi-Hsuan Chen, Fengxiang He

Title: DeXposure-Claw: An Agentic System for DeFi Risk Supervision

Abstract:
Decentralized finance exposes supervisors to fast‑moving, networked credit risks. General‑purpose LLM agents fit this setting poorly: they over‑read weak evidence and recommend high‑stakes interventions, while existing evaluations offer no regulator‑aligned way to measure the resulting false alarms. We introduce DeXposure‑Claw, a forecast‑grounded agentic supervision system that routes LLM decisions through structured evidence: (1) DeXposure‑FM, a graph time‑series foundation model, forecasts future exposure networks; (2) deterministic monitors and stress scenarios then turn those forecasts into typed alerts, attribution signals, and scenario evidence; and (3) data‑health and confidence gates constrain escalation before DeXposure‑Claw emits auditable supervisory tickets with rationales. We further develop DeXposure‑Bench, a six‑axis evaluation harness, whose decision axis scores tickets against a regulator‑aligned absolute‑loss ground truth and an explicit false‑intervention rate. Experiments on five years of weekly real data fully support our system. Code is at https://github.com/EVIEHub/DeXposure‑Claw.

PaperID: 17, https://arxiv.org/pdf/2606.19495.pdf GitHub

Authors: Shariq Farooq Bhat, Niloy J. Mitra, Kalyan Sunkavalli

Title: LooseControlVideo: Directorial Video Control using Spatial Blocking

Abstract:
Precise 3D spatial orchestration in text‑to‑video generation remains a significant challenge, particularly for multi‑object scenes where semantic layout and temporal dynamics are often entangled. While existing depth‑conditioned models achieve good structural fidelity, they necessitate dense, frame‑accurate guidance that is labor‑intensive to author for dynamic events involving deformable objects. We present LooseControlVideo, a framework that enables intuitive and expressive control by using sparse, oriented 3D boxes as a "blocking" proxy. This allows users to author high‑level layout and trajectory while leveraging a video generative model to generate realistic occlusions, dynamics and interactions. We achieve this by fine‑tuning a Wan 2.2 backbone on a video dataset annotated with DNOCS, a novel encoding for 3D size, orientation and depth‑ordered occlusions. Furthermore, our method allows for localized refinement, such as adjusting a jump trajectory or adding an interaction, with minimal disruption to the global scene context. Extensive evaluations on the nuScenes, HO‑3D, and BEHAVE benchmarks demonstrate that LooseControlVideo significantly outperforms existing 2D‑box and flow‑based baselines. Our findings indicate a 1.2x to 3x improvement in Trajectory Error; 2x improvement in Rigid Motion Consistency; and a 1.5x to 2x increase in Occlusion Accuracy over current state‑of‑the‑art layout‑conditioned models, demonstrating that oriented 3D primitives provide good geometric prior for complex, multi‑agent video authoring.

PaperID: 18, https://arxiv.org/pdf/2606.19382.pdf GitHub

Authors: Kanishk Kushwaha, Vikrant Vinod Bansode, Harsh Vardhan, Dhaval C. Patel

Title: DynAMO:Dynamic Asset Management Orchestration via Topological Multi-Agent Scheduling

Abstract:
While LLM‑powered agents offer end‑to‑end automation for industrial asset lifecycles, real‑world Industry 4.0 deployment is hindered by latency, concurrency instability, and safety risks. We present DynAMO (Dynamic Asset Management Orchestration), a deployment‑ready engine using a Plan‑then‑Execute architecture to generate verifiable workflow graphs. DynAMO supports both SequentialWorkflow (topological execution) and ParallelWorkflow (dependency‑aware concurrency). By dynamically identifying independent tasks, DynAMO preserves structural correctness and safety while significantly improving efficiency through controlled reasoning overlap. Across six controlled experiments on the AssetOpsBench industrial benchmark, DynAMO demonstrates substantial performance and robustness gains. Parallel execution reduces end‑to‑end latency by a median of 1.6x over sequential orchestration, rising to 1.8x on highly parallelizable workflows. After instrumenting external tool calls with realistic latencies, a latency decomposition shows that LLM reasoning and orchestration still account for more than 90% of execution time, identifying model inference as the primary system bottleneck. Structured context pruning reduces inference latency by approximately 30%, and DynAMO maintains correct functional behaviour (task completion, agent sequencing, and output quality) while exhibiting graceful degradation under controlled fault injection. Reproducibility analysis further confirms stable execution under repeated runs, with parallel scheduling reducing latency variance. These findings establish DynAMO as a practical blueprint for scalable, safe, and latency‑aware agent deployment in Industry 4.0 automation pipelines. Code is available at: https://github.com/kushwaha001/DynAMO

PaperID: 19, https://arxiv.org/pdf/2606.18829.pdf GitHub

Authors: Zhe Ren, Yibo Yang, Yimeng Chen, Zijun Zhao, Benshuo Fu, Zhihao Shu, Bingjie Zhang, Yangyang Xu, Dandan Guo, Shuicheng Yan

Title: GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

Abstract:
Memory benchmarks for LLM agents largely assume single‑user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory pool and query it under different roles, scopes, and relationships, so memory quality requires governance as well as recall. We introduce GateMem, a benchmark for multi‑principal shared‑memory agents. GateMem jointly evaluates utility for legitimate long‑horizon requests with state updates, access control across contextual authorization boundaries, and agent‑facing active forgetting after explicit deletion requests. It spans medical, office, education, and household domains, with long‑form multi‑party episodes, incremental memory injection, hidden checkpoints, structured judging, and leak‑target annotations. Across diverse baselines and backbone models, no method simultaneously achieves strong utility, robust access control, and reliable forgetting. Long‑context prompting often yields the best governance score at high token cost, while retrieval‑based and external‑memory methods reduce cost yet still leak unauthorized or deleted information. These results show current memory agents remain far from reliable shared institutional deployment.

PaperID: 20, https://arxiv.org/pdf/2606.18786.pdf GitHub

Authors: Haobin Qin, Baofeng Zhang, Hidehisa Akiyama, Keisuke Fujii

Title: R2D-RL: A RoboCup 2D Soccer Environment for Multi-Agent Reinforcement Learning

Abstract:
Robot soccer is a challenging testbed for multi‑agent reinforcement learning because it combines partial observability, cooperative and adversarial interaction, sparse rewards, and long‑horizon tactical behavior. RoboCup 2D Soccer Simulation (RCSS2D) provides a mature robot‑soccer platform, but its competition‑oriented server‑client architecture is difficult to use directly with modern Python‑based MARL workflows. We introduce R2D‑RL, a reinforcement learning environment that connects RCSS2D and HELIOS‑based player clients to a Python MARL interface through shared‑memory communication and cycle‑level synchronization. R2D‑RL supports full‑field and scenario‑based training with configurable opponents, Base discrete and Hybrid parameterized action spaces, action masks, expected possession value (EPV)‑based reward shaping, and parallel execution. We provide front‑goal scenarios and an 11‑vs‑11 full‑field benchmark, together with baseline results.

PaperID: 21, https://arxiv.org/pdf/2606.18272.pdf GitHub

Authors: Hatim Chergui, Claudia Carballo González, Farhad Rezazadeh, Merouane Debbah

Title: Mitigating Anchoring Bias in LLM-Based Agents for Energy-Efficient 6G Autonomous Networks

Abstract:
This paper presents an autonomous agentic resource negotiation framework designed to enable zero‑touch network slicing in 6G architectures using Large Language Model (LLM) agents. While LLMs offer powerful reasoning capabilities, we demonstrate that such agents inherently suffer from anchoring bias, rigidly adhering to initial heuristic proposals and causing severe network over‑provisioning. To systematically mitigate this cognitive bias, we propose a novel randomized anchoring strategy modeled via a Truncated 3‑Parameter Weibull distribution. This mathematically bounded approach seamlessly integrates with burst‑aware Digital Twins (DTs) employing Conditional Value at Risk (CVaR) to rigorously guarantee strict Service Level Agreement (SLA) tail‑latencies. To validate our methodology, we introduce and prove the \emphBimodal Constraint‑Avoidance Utility Theorem, demonstrating that while feasible negotiations follow classical convex bounds, highly constrained scenarios undergo a phase transition governed by an inverse rational decay envelope. Empirical results generated using a locally hosted 1B‑parameter model otel‑llm‑1b‑it confirm these dual‑regime bounds. Our cognitive de‑biasing successfully dismantles rigid negotiation patterns, forcing agents into active exploration to safely ride SLA boundaries and boost system energy savings up to 25%. Crucially, the lightweight 1B LLM achieves sub‑second inference latencies (0.95s mean), ensuring our multi‑agent framework is compatible with the operational timescales of the O‑RAN non‑Real‑Time RAN Intelligent Controller (non‑RT RIC)\footnoteOur source code is available for non‑commercial use at https://github.com/HatimChergui.

PaperID: 22, https://arxiv.org/pdf/2606.18237.pdf GitHub

Authors: Shanda Li, Qiuhong Anna Wei, Jingwu Tang, Valerie Chen, Nihar B Shah, Tim Dettmers, Yiming Yang, Ameet Talwalkar

Title: ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

Abstract:
Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale due to their reliance on substantial manual effort for data curation and evaluation. We introduce ReproRepo, a scalable framework for reproducibility evaluation that leverages human‑raised GitHub issues as naturally occurring supervision on realistic reproduction blockers. We instantiate ReproRepo on 1,149 recent machine learning papers from major conferences and evaluate four frontier model‑agent configurations. Our results show that LLM agents, even without executing code, can identify many real‑world reproducibility problems from paper‑repository pairs: the best agent in our study, namely Codex with GPT‑5.5, surfaces at least one semantically related human‑reported blocker for ~90% of papers in the study. Further analysis shows that agents are particularly effective for surfacing visible failures and identifying the right semantic region, but may still be insufficient in exact localization. ReproRepo can serve as a reusable, scalable framework for future evaluations of LLM agents on real‑world reproducibility auditing. Our code is released at https://github.com/LithiumDA/ReproRepo.

PaperID: 23, https://arxiv.org/pdf/2606.17128.pdf GitHub

Authors: Zhihan Xiao, Hongbing Lang, Zhe Zhao, Luke Ztz Hu, Songping Mai

Title: Shift-Left High-Level Synthesis Verification via Knowledge-Augmented LLM Agent

Abstract:
High‑Level Synthesis (HLS) relies on transforming original C specifications into synthesizable HLS‑oriented C (HLS‑C) implementations. Functional consistency verification between original C specifications and HLS‑C implementations is a critical yet labor‑intensive task in HLS design flows. While Large Language Models (LLMs) have recently shown promise in automated testbench generation, their stochastic nature often leads to insufficient coverage, inconsistent verification environments, and unreliable equivalence checking results. To address these limitations, we propose a knowledge‑augmented, agent‑driven shift‑left verification framework for automated functional consistency checking between original C and HLS‑C implementations before synthesis. The framework introduces a Dual‑Tier Consistency Checking mechanism that jointly enforces static structural alignment and dynamic behavioral equivalence between paired testbenches, while integrating symbolic execution and coverage‑driven refinement to improve verification completeness. Furthermore, we construct a heterogeneous HLS Verification Knowledge Graph to provide topology‑aware reasoning priors for testbench generation, and design an autonomous verification agent to orchestrate iterative refinement and failure diagnosis across heterogeneous toolchains. Experimental results on 107 HLS benchmark pairs demonstrate that the proposed framework achieves 0.9826 average coverage and 0.9533 dynamic consistency, outperforming representative AST‑based, retrieval‑augmented, and iterative agent‑based baselines. https://github.com/cz‑5f/HLS‑LeVeri.git

PaperID: 24, https://arxiv.org/pdf/2606.17127.pdf GitHub

Authors: Jay Jung, Xiaohan Zhang, Shenghan Song, Mahmoud Sayedahmed, Chijian Xiang, Yunong Xu, Ahmed AbdelKhalek, Severin T. Schneebeli, Matthew J. Wargo, Jianing Li, Safwan Wshah

Title: Agentic Discovery of Non-Canonical Antimicrobial Peptides with AMPGAN v3

Abstract:
Antimicrobial resistance causes to over a million deaths annually. Antimicrobial peptides (AMPs) are a promising solution, but generative AMP models are not yet ready to design peptides with non‑natural amino acids and/or chemical modifications, which are essential for real‑world peptide drugs. We present AMPGAN v3, a multi‑objective conditional GAN that expands the generative vocabulary to D‑amino acids and N/C‑terminus modifications such as amidation. By separating adversarial and activity‑aware supervision across two specialized discriminators, AMPGAN v3 substantially improves training stability and outperforms prior generative AMP models on external classifiers. We validated five candidates spanning three structural classes in vitro; two showed activity against Gram‑positive strains, with the best candidate reaching MIC 8 μg/mL against B. subtilis. To support downstream curation, we further present PepCraft, a multi‑agent framework for end‑to‑end AMP discovery in which a Planning Agent orchestrates specialized executors for generation, filtering, and verification. Its prioritization recommendations align with our in vitro outcomes. Together, these contributions let us examine, on a small but real scale, how generative and agentic AI compose in therapeutic peptide discovery. Code: https://github.com/marszzibros/AMPGANv3

PaperID: 25, https://arxiv.org/pdf/2606.17041.pdf GitHub

Authors: Anzhe Xie, Weihang Su, Yujia Zhou, Yiqun Liu, Qingyao Ai

Title: Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

Abstract:
Meta‑analysis is a demanding form of evidence synthesis that combines literature retrieval, PI/ECO‑guided study selection, and statistical aggregation. Its structured, verifiable workflow makes it an ideal substrate for evaluating systematic scientific reasoning, yet existing benchmarks lack ground truth across the full retrieval‑screening‑synthesis pipeline. We introduce MetaSyn, a dataset of 442 expert‑curated meta‑analyses from Nature Portfolio journals. Each entry pairs a research question with PI/ECO criteria, a retrieval corpus of 140k PubMed articles, verified positive studies, hard negatives that are topically similar but PI/ECO‑ineligible, and complete search strategies and date bounds. Benchmarking twelve pipeline configurations (nine RAG variants and a protocol‑driven agent) reveals a critical screening bottleneck: despite a retrieval ceiling of 90.9% recall at K=200, no system recovers more than 52.7% of ground‑truth included literature. Current LLMs fail to reliably separate eligible studies from PI/ECO‑failing distractors in pools of comparable topical relevance. Stage‑attributed metrics capture where systems succeed and fail; a single end‑to‑end score does not.

PaperID: 26, https://arxiv.org/pdf/2606.16133.pdf GitHub GitHub

Authors: Wen-Kao Li, Ze-Feng Gao, Peng-Jie Guo, Wei Ji, Zhong-Yi Lu

Title: InvDesMobility: a reliability-gated first-principles feedback framework for closed-loop materials discovery

Abstract:
Inverse materials design starts from target functionality and searches for structures that can realize it. Its value in closed‑loop discovery depends not only on prediction performance, but also on whether expensive first‑principles results are independently validated, provenance‑recorded, and admitted as feedback only when evidence is sufficient. This is especially important for composite properties such as carrier mobility, where a final scalar value hides intermediate quantities, fit quality, convergence history, and workflow assumptions. Here we present InvDesMobility, a reliability‑gated first‑principles feedback framework that integrates multi‑agent automated DFT, evidence stratification, generative structure proposal, acquisition ranking, and auditable release. Using 516 2DMatPedia‑derived candidates, the workflow produced 280 QC‑passed materials and 573 retained carrier‑direction seed channels after channel‑level reliability gating. These records were split into two feedback objects: relaxed structures updated the generative model, while retained mobility channels trained the acquisition model and set validation priority. Over multiple iterations, InvDesMobility screened 2.4 x 10^6 structures, submitted 102 candidates for DFT validation, and retained 86 reliability‑gated generated channels across 41 formulas. Overall, the main contribution is not a fixed list of high‑mobility materials, but a transferable feedback contract that makes closed‑loop inverse design both useful and auditable when learning from expensive calculated properties. All source data, retained feedback records, and workflows are available at https://github.com/DreamLufei/invDesMobility, with an accompanying evidence website at https://dreamlufei.github.io/invDesMobility/.

PaperID: 27, https://arxiv.org/pdf/2606.15685.pdf GitHub

Authors: Shuaike Zhang, Shaokun Wang, Haoyu Tang, Jianlong Wu, Liqiang Nie

Title: Learning New Tasks via Reusable Skills: Skill-Compositional Experts for Embodied Continual Learning

Abstract:
Embodied Continual Learning (ECL) aims to enable robots to continually acquire new manipulation tasks while retaining previously learned behaviors under closed‑loop control. Compared with conventional continual learning, ECL suffers from more severe catastrophic forgetting. Feature drift accumulated under closed‑loop control progressively propagates through sequential decision‑making, leading to degradation of previously learned behaviors. A key challenge in ECL lies in structured skill reuse across continually evolving tasks, since existing methods primarily focus on skill learning without explicitly organizing them for coherent task execution. To address this issue, we propose SCE, a Skill‑Compositional Experts framework for ECL. SCE builds a skill base via Compositional Skill Grounding (CSG), which decomposes task demonstrations into reusable skills. Based on this, Dual Execution‑and‑Transition Experts (DETE) enable new task learning through skill composition, where one branch ensures skill execution and the other supports transitions between skills for coherent behavior. Experiments on LIBERO benchmarks and real‑world manipulation tasks demonstrate that SCE consistently improves retention and overall task performance. Further feature drift analyses and ablation studies verify the effectiveness of our method. Project website: https://eqcy.github.io/sce/.

PaperID: 28, https://arxiv.org/pdf/2606.15390.pdf GitHub

Authors: Yixuan Wang, Yiyang Zhou, Yiming Liang, Congyu Zhang, Fuxiao Liu, Jiawei Zhou, Huaxiu Yao

Title: Not All Skills Help: Measuring and Repairing Agent Knowledge

Abstract:
LLM agents can improve without weight updates by accumulating natural‑language skills from experience, but current systems entrust every decision about which skills to keep and how to apply them to LLM judgment alone. We argue that this conflates two distinct roles: generating a skill from experience is a creative act that judgment handles well, while deciding whether that skill actually helps requires empirical evidence across many tasks. Measuring per‑skill causal contributions via randomized masking, we find that skill libraries exhibit pervasive causal heterogeneity: individual skills routinely help on some task types while hurting on others, yet their opposing effects cancel in aggregate, making them invisible to global curation methods. We propose ASSAY, a framework that separates generation from curation: it computes a per‑skill causal attribution on a small development set, restructures the library offline, and suppresses skills with negative predicted effect for each test task. Across seven base models spanning four providers and two benchmarks (AppWorld and tau‑bench), ASSAY consistently improves over prior skill‑curation approaches. On AppWorld's hardest split, DeepSeek‑V3 achieves 69.3% task‑goal completion (47.4% relative improvement), a new state of the art among all published methods including weight‑tuned approaches. On tau‑bench retail, GPT‑4.1 improves by 8.7% relative, advancing past o4‑mini, o1, and GPT‑4.5 on the public leaderboard without any weight modification. Ablation traces the dominant gain to per‑task masking, confirming that the bottleneck is matching skills to tasks at inference time, not removing bad skills globally. Code is available at https://github.com/aiming‑lab/assay.

PaperID: 29, https://arxiv.org/pdf/2606.15242.pdf GitHub

Authors: Yi Xie, Jiawei Du, Yu Cheng, Jiuan Zhou, Zhaoxia Yin

Title: Benign in Isolation, Harmful in Composition: Security Risks in Agent Skill Ecosystems

Abstract:
Skills are becoming the capability layer through which LLM agents turn plans into actions, but their use introduces security risks such as data leakage, unauthorized operations, and tool misuse. Existing vetting usually evaluates each skill in isolation, while real agent tasks often invoke multiple skills in a shared execution context. This creates Skill Composition Risk (SCR): a skill that appears benign alone can become harmful when its outputs, trust signals, authorization cues, or side effects influence later invocations along an activated path. We introduce SCR‑Bench to evaluate this risk in controlled, sandboxed skill environments. Rather than relying only on textual intent or surface behavior, SCR‑Bench records downstream state changes and path‑level outcomes across composed skill executions. It contains three sub‑benchmarks: SCR‑CapFlow for capability‑flow composition, SCR‑TrustLift for trust‑transfer composition, and SCR‑AuthBlur for authorization‑confusion composition. Across SCR‑Bench, composed paths expose risks that are largely absent under isolated evaluation. In SCR‑CapFlow, attack success rate reaches 33.6 percent under composition, compared with near‑zero isolated baselines. In SCR‑TrustLift, attack success rate exceeds 96.5 percent on four of five backends. In SCR‑AuthBlur, the risky‑approval rate increases by 71.8 percent relative to the L0 isolated baseline under the L1 context setting. These results show that agent skill security should be assessed at the level of activated paths rather than isolated artifacts. SCR and SCR‑Bench provide a foundation for path‑aware risk evaluation and defense in LLM agent skill ecosystems. Benchmark: https://github.com/saint‑viperx/SCR_Bench.

PaperID: 30, https://arxiv.org/pdf/2606.15057.pdf GitHub

Authors: Xinhang Ma, Taoran Li, Chaowei Xiao, Zhiyuan Yu, Ning Zhang, Yevgeniy Vorobeychik

Title: AutoDojo: Adaptive Attacks Expose Superficial Defenses and User-Underspecification Limits in LLM Agents

Abstract:
Indirect prompt injection (IPI) is a major security threat to LLM‑powered agents. Thus, a growing body of work have proposed a variety of defensive approaches against IPI. These can be grouped into three broad categories: 1) prompt‑based (using prompting as a way to prevent agents from following malicious instructions), 2) detection‑based (identifying and filtering malicious instructions), and 3) system‑level (using systems insights, such as control and data isolation, for defense). However, commonly used benchmarks for evaluating defense, such as AgentDojo, are \emphinherently static, generating a fixed distribution of IPI attacks. Consequently, static benchmarks do not usefully evaluate defense robustness to adaptive threats. We address this issue by developing AutoDojo, an adaptive extension of AgentDojo that optimizes IPI against a given defense. Using AutoDojo against state‑of‑the‑art IPI defenses across three task suites and five target models, we make two key observations. First, many defenses offer only limited protection: a cheap, black‑box adaptive attack using a frontier LLM to iteratively optimize the injection raises attack success rate (ASR) well above the level achieved by static injections against nearly all evaluated defenses. Against a filter that reduces static ASR to 0%, AutoDojo recovers 28% overall and 64% on action‑open tasks. Second, for prompt‑level and filter‑based defenses, ASR is substantially higher on \emphaction‑open tasks ‑‑ where the user's request delegates the action itself to attacker‑controlled content ‑‑ than on precisely specified tasks. This is a structural limit: on such tasks the injection can pose as ordinary data rather than an explicit instruction, bypassing defenses that rely on detecting instruction‑like text. AutoDojo is publicly available at https://github.com/xhOwenMa/AutoDojo.

PaperID: 31, https://arxiv.org/pdf/2606.14589.pdf GitHub

Authors: Wei Wu

Title: When Errors Become Narratives: A Longitudinal Taxonomy of Silent Failures in a Production LLM Agent Runtime

Abstract:
LLM agent systems increasingly run as long‑lived autonomous runtimes: scheduling jobs, calling tools, maintaining memory, and pushing results to humans. We present a longitudinal study of silent failures in one such system: a personal‑assistant agent runtime in continuous production since March 2026, with roughly 40 scheduled jobs, 8 LLM providers, a tool‑governance proxy, and a knowledge‑base memory plane, defended by 4,286 unit tests and 827 governance checks. Over eight weeks we documented 22 incidents with full root‑cause postmortems, in which one meta‑pattern ‑‑ a failure whose error signal never reaches a human in actionable form ‑‑ manifested at least 28 times. We derive a five‑class, mechanism‑oriented taxonomy: (A) environment and platform quirks, (B) design‑assumption mismatches, (C) error swallowing and dilution, (D) chained hallucination and fabrication, (E) operational omission and forensic blind spots. Class D is unique to LLM systems and the most dangerous: the system does not merely fail to report an error ‑‑ the LLM transforms it into fluent, plausible narrative delivered to the user. We term this fail‑plausible: gray failure's differential observability escalated ‑‑ the observer is not just blind, it is convincingly lied to by the failure itself. Three findings: about 70% of silent failures were caught by human user‑view observation, not tests or audits; a retrospective audit of 15 incidents found 0% ex‑ante prevention but 87% regression blocking ‑‑ audits are regression engines, not prediction engines; incident latency (13 hours to 60 days) tracks failure mechanism, not code complexity ‑‑ the longest‑lived failures lived in the seams between components, where no test runs. We describe the resulting defense framework and distill design principles for agent systems whose failures are loud, attributable, and boring. All postmortems and artifacts are public.

PaperID: 32, https://arxiv.org/pdf/2606.14061.pdf GitHub

Authors: Dongjian Ma, Silin Chen, Yufei Yang, Yuling Shi, Yanfu Yan, Xiaodong Gu

Title: LLM Agents Can See Code Repositories

Abstract:
Coding agents powered by large language models have demonstrated strong performance on software engineering tasks. Yet most agents consume repositories almost entirely as text, which differs from how human developers use visual structure such as folder hierarchies and dependency relationships to orient themselves in large codebases. With multimodal large language models (MLLMs), it is an open question whether agents can effectively benefit from visual representations of repositories. This paper presents the first systematic empirical study of visual repository representations for LLM‑based agents on repository‑level issue resolution. We evaluate four recent multimodal models. Our results show that a strictly vision‑only setup degrades accuracy and increases token cost, because agents lack sufficient symbolic detail and compensate with repeated visual queries. In contrast, integrating visual graphs of repository structure as a supplementary modality alongside standard text interfaces helps agents understand structure more efficiently: input token consumption decreases by up to 26% while issue‑resolution accuracy is maintained or improved. Visualization is most useful during fault localization and when the agent autonomously controls exploration depth. These findings point to a practical hybrid text‑and‑vision design for next‑generation coding agents.

PaperID: 33, https://arxiv.org/pdf/2606.13679.pdf GitHub GitHub

Authors: Dian Zheng, Harry Lee, Manyuan Zhang, Kaituo Feng, Zoey Guo, Ray Zhang, Hongsheng Li

Title: InterleaveThinker: Reinforcing Agentic Interleaved Generation

Abstract:
Recent image generators have demonstrated impressive photorealism and instruction‑following capabilities in single‑image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text‑image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open‑source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi‑agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image‑text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave‑Planner‑SFT‑80k and Interleave‑Critic‑SFT‑112k to perform a format cold‑start. Then we develop Interleave‑Critic‑RL‑13k to reinforce the step‑wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step‑wise reward, allowing single‑step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT‑5. Surprisingly, it also significantly enhances the base model on reasoning‑based benchmarks; for example, on 4‑step FLUX.2‑klein, we observe substantial gains on WISE and RISE.

PaperID: 34, https://arxiv.org/pdf/2606.13621.pdf GitHub

Authors: Achraf Hsain, Sultan Almuhammadi

Title: Beyond Runtime Enforcement: Shield Synthesis as Defensibility Analysis for Adversarial Networks

Abstract:
Shielded reinforcement learning is typically presented as a runtime safety mechanism that compiles temporal‑logic specifications into automata restricting an agent's actions. We argue this is the wrong product. The same automata‑theoretic machinery ‑‑ specification compilation, product game construction, attractor computation, and winning‑region extraction ‑‑ is better read as a design‑time analytical instrument whose outputs are structural insights about a system rather than runtime constraints on a deployed agent. We instantiate this through a constrained two‑player safety game for network defense. The two specifications are enforced asymmetrically: the defender specification defines the unsafe region of the game, whereas the attacker specification restricts the adversary's legal actions during attractor computation. Solving the game yields a defensibility verdict ‑‑ a formal certificate that a topology‑specification pair is or is not defensible ‑‑ with the associated winning region and shield. Beyond the binary verdict, we derive topology‑level metrics from the attractor structure and combine them with post‑convergence behavior from shield‑constrained adversarial multi‑agent reinforcement learning. Together these form a defensibility fingerprint capturing both a network's formal safety properties and its operational behavior under adaptive play. A what‑if analysis shows that formal defensibility and operational effectiveness capture distinct aspects of security: small architectural changes can produce large shifts in operational outcomes while leaving formal safety margins nearly unchanged. Shield synthesis is thus most valuable not as a deployment mechanism for safe agents, but as a framework for answering architectural questions about whether, where, and how a system can be defended. The defensibility verdict is the output, not the safe policy.

PaperID: 35, https://arxiv.org/pdf/2606.13598.pdf GitHub

Authors: King Yeung Tsang, Zihao Zhao, Vishal Venkataramani, Haizhou Shi, Zixuan Ke, Semih Yavuz, Shafiq Joty, Hao Wang

Title: Reward Modeling for Multi-Agent Orchestration

Abstract:
Multi‑Agent Systems (MAS) built on Large Language Models (LLMs) require effective orchestration to coordinate specialized agents, yet training such orchestrators is hindered by limited supervision and high computational cost. We propose Orchestration Reward Modeling (OrchRM), a self‑supervised framework for evaluating orchestration quality without human annotations. OrchRM leverages intermediate artifacts from multi‑agent executions to construct win‑lose pairs for Bradley‑Terry reward model training. Unlike existing MAS test‑time scaling and orchestrator training frameworks that rely on costly sub‑agent rollouts, OrchRM operates directly at the orchestration level, enabling efficient and high‑performing reward‑guided orchestrator training and MAS test‑time scaling. OrchRM improves training efficiency by up to 10x in token usage while improving MAS test‑time scaling performance by up to 8% in accuracy. These gains consistently transfer across multiple domains, including mathematical reasoning, web‑based question answering, and multi‑hop reasoning, demonstrating orchestration‑level reward modeling as a scalable direction for robust multi‑agent orchestration. Code will be available at https://github.com/Wang‑ML‑Lab/OrchRM.

PaperID: 36, https://arxiv.org/pdf/2606.13222.pdf GitHub

Authors: Yurun Chen, Tianyuan Gao, Yizhong Ge, Shikun Ban, Yizhou Wang, Hongkai Xiong, Wenjun Zeng, Wentao Zhu

Title: Proprioceptive-visual correspondence enables self-other distinction in humanoid robots

Abstract:
Distinguishing self from others is a prerequisite for social intelligence, yet humanoid robots that increasingly share workspaces with humans still lack this ability. Here we show that a humanoid robot can learn self‑other distinction from proprioceptive‑visual correspondence, without any identity labels or kinematic models. Once established, this distinction bootstraps a predictive self‑model that maps joint configurations to three‑dimensional body occupancy, capturing how the robot's body changes with action. In multi‑agent scenes involving humans or morphologically identical robots, the system reliably identifies itself, learns a 3D self‑model, and supports downstream tasks including target reaching, collision‑aware motion planning, and human‑to‑robot motion retargeting. Together, these results outline a route toward bodily self‑representation in robots that act and coordinate alongside others in shared physical environments. Project page: https://euron‑zc.github.io/humanoid‑self‑model/.

PaperID: 37, https://arxiv.org/pdf/2606.13174.pdf GitHub GitHub

Authors: Yujun Zhou, Kehan Guo, Haomin Zhuang, Xiangqi Wang, Yue Huang, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Nuno Moniz, Nitesh V. Chawla, Xiangliang Zhang

Title: Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

Abstract:
Interactive LLM agents are becoming part of daily work, but they do not reliably become easier to work with over time: a correction remembered in one session may still be violated in the next. We study this gap between preference access and preference compliance. In tasks derived from anonymized real‑user friction cases, Mem0 memory still leaves 57.5% of applicable preference checks violated. We introduce Test‑time Rule Acquisition and Compiled Enforcement (TRACE), a drop‑in skill‑layer pipeline for coding‑agent runtimes that mines user corrections, rewrites them as atomic rules, and compiles them into runtime checks that must pass before an agent completes future tasks. Unlike runtime checks written ahead of time by developers, TRACE skills come from the user's own chat corrections. We evaluate TRACE with simulated user‑in‑the‑loop experiments on ClawArena coding‑agent tasks and MemoryArena‑derived memory‑intensive tasks. On ClawArena, TRACE reduces held‑out preference violation from 100.0% to 37.6% on in‑distribution tasks and from 100.0% to 2.0% on out‑of‑distribution tasks. On MemoryArena‑derived tasks, TRACE reduces in‑distribution violation from 100.0% to 60.5% while matching or exceeding the strongest memory baseline on task pass. These results suggest that compiling corrections into runtime enforcement can address a repeated‑friction failure mode that memory alone does not reliably solve, reducing the need for users to restate the same correction across future sessions. Experiment code is available at https://github.com/YujunZhou/TRACE_exp, and the deployable skill is available at https://github.com/YujunZhou/tellonce.

PaperID: 38, https://arxiv.org/pdf/2606.13049.pdf GitHub

Authors: Luyao Zhang, Ke Li, Yuan Ding, Xulong Zhao, Guo Yu, Chengwei Yan, Fuyu Dong, Jiawei Hu, Di Wang, Nan Luo, Gang Liu, Quan Wang

Title: Y-BotFrame: An Extensible Embodied Agent Framework for Quadruped Robot Assistants

Abstract:
Quadruped robots are capable of traversing a wide range of complex terrains with high flexibility. As highly mobile ground‑based intelligent platforms, they can be equipped with modules for navigation control, environmental perception, and intelligent interaction, thereby serving as real‑world mobile deployment platforms for various algorithms. In this paper, we introduce Y‑BotFrame, an extensible embodied platform that turns a robot into an intelligent ground assistant. Y‑BotFrame integrates multimodal perception capabilities, including speech, vision, and LiDAR, and employs a large language model as the cognitive core for environmental understanding, contextual reasoning, and task planning. The system maps user natural‑language instructions into executable embodied task units that can be carried out by the robot. Y‑BotFrame supports natural interaction through voice commands and visual feedback, removing the need for a remote controller and enabling efficient human‑robot collaboration. With a highly extensible framework, Y‑BotFrame supports plug‑and‑play integration of new functional modules as well as modular upgrades and iterative development, offering a reference implementation for the real‑world deployment of general‑purpose, instruction‑driven embodied agents.The supplementary video is available at https://xdei‑group.github.io/Y‑BotFrame/.

PaperID: 39, https://arxiv.org/pdf/2606.13038.pdf GitHub

Authors: Haowei Qian

Title: Nous: An Attempt to Extract and Inject the Cognition Behind Prediction-Market Behavior

Abstract:
As LLM agents proliferate in prediction markets and collective decision‑making, they risk a cognitive monoculture: agents built on shared foundation models produce correlated forecasts, and recent measurement finds frontier‑model errors correlated at r ~ 0.77. We ask whether human cognitive diversity can be recovered from behavior and transferred to LLM agents. Nous extracts a structured eight‑dimension behavioral profile from real Polymarket trading activity and injects it into agents through prompts. Our central finding is a dissociation between the two halves of that pipeline. Extraction works, partially: across 100 wallets, 8 of 14 parameters are temporally stable (split‑half ICC >= 0.5, bootstrap CI lower bound > 0.3; contrarian score reaches ICC ~ 0.9); wallets are identifiable from their profiles well above chance (top‑1 retrieval 17‑22% vs. 1% chance); and two of four pre‑specified dimensions rank‑correlate with future realized profit out‑of‑sample, though the correlations do not survive behavioral‑confound controls. Prompt‑level injection does not measurably transmit it: on a semantic embedding metric, structured injection shows no significant advantage over a length‑matched control on any model, and the diversity it induces neither reduces ensemble error correlation nor improves Brier score ‑‑ a null that persists across exploratory checks on sampling temperature, profile diversity, and question difficulty. Measuring the prompts themselves locates the compression before the model: the structure‑to‑narrative translator emits near‑uniform prompts whose spread does not track profile spread. We position Nous as measuring the cognitive‑monoculture problem and the limits of a prompt‑level remedy, motivating deeper, below‑the‑prompt injection (fine‑tuning, activation steering). Code, frozen profiles, prompts, and model outputs: https://github.com/WillChienT/nous‑paper

PaperID: 40, https://arxiv.org/pdf/2606.12916.pdf GitHub

Authors: Zehong Wang, Yijun Ma, Connor R. Schmidt, Tianyi Ma, Weixiang Sun, Ziming Li, Xiaoguang Guo, Chuxu Zhang, Matthew J. Webber, Yanfang Ye

Title: MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

Abstract:
Molecular dynamics (MD) is the canonical in‑silico method for atomistic molecular science, simulating molecular behavior from first‑principle physics. Designing an MD pipeline for a new system requires substantial expert knowledge: running it on even one molecule is expensive, ruling out trial‑and‑error. We automate this expert pipeline‑design process with an LLM agent. Unlike existing MD agents that orchestrate a predefined tool set, we treat pipeline design as open‑ended code generation in which the agent's behavior is reshaped online by verbal reward. Specifically, we build MDForge, an LLM agent whose in‑context update rule densifies the sparse reward via a multi‑agent debate among physics experts. On three SAMPL host‑guest binding free‑energy benchmarks, MDForge automatically designs MD pipelines competitive with human experts. Deployed on a library of unseen candidate guests, its CB[7] pipeline discovers a novel binder that wet‑lab competition NMR confirms is a high‑affinity, picomolar CB[7] binder. Our data and code are available at https://github.com/Zehong‑Wang/MDForge.

PaperID: 41, https://arxiv.org/pdf/2606.12902.pdf GitHub

Authors: Wen Zhang, Xiaocui Yang, Zhuoyue Gao, Shi Feng, Daling Wang, Yifei Zhang

Title: PRISM: Prosody-Integrated Multi-Agent Reasoning Framework for Empathetic Spoken Dialogue

Abstract:
Empathetic spoken dialogue systems require not only semantically appropriate responses but also emotionally aligned prosodic expression. However, cascade pipelines often discard acoustic cues during speech‑to‑text conversion, while end‑to‑end speech models lack interpretable control over emotion and knowledge integration. To address these challenges, we propose PRISM, a multi‑agent framework for empathetic spoken dialogue that decouples speech perception, response generation, and speech synthesis into coordinated components. PRISM introduces a prosody‑to‑language translation mechanism to stabilize large language model reasoning and enables on‑demand invocation of external knowledge tools for empathetic dialogue generation. Experimental results demonstrate that PRISM achieves consistent improvements in empathy, prosodic appropriateness, and text response generation quality across objective and subjective metrics. Our code is available at: https://github.com/Bxzfrm/PRISM.

PaperID: 42, https://arxiv.org/pdf/2606.12900.pdf GitHub

Authors: Jiahao Yang, Shuhai Zhang, Hailong Kang, Feng Liu, Qi Chen, Mingkui Tan

Title: Zero-source LLM Hallucination Detection with Human-like Criteria Probing

Abstract:
Large language models (LLMs) often hallucinate by generating factually incorrect or unfaithful content, posing significant risks to their safe use. Detecting such hallucinations is particularly challenging under the zero‑source constraint, where no model internals or external references are available, and detection must rely solely on the textual query‑answer pair. In this paper, we propose Human‑like Criteria Probing for Hallucination Detection (HCPD), a paradigm that emulates the multi‑faceted reasoning of human evaluators. Its core is a Human‑like Criteria Probing (HCP) mechanism, in which a LLM agent adaptively decomposes its judgment into a weighted set of interpretable criteria and aggregates criterion‑specific scores into a final truthfulness measure. To achieve this adaptive capability, we introduce a reward‑based alignment scheme using only weak supervision from semantic consistency. At inference, we employ a multi‑sampling aggregation strategy to ensure robust decisions while preserving full interpretability. We further provide theoretical analysis supporting the reliability of our approach. Extensive experiments show that HCPD consistently outperforms state‑of‑the‑art baselines, offering an effective and explainable solution for zero‑source hallucination detection. Code is available at https://github.com/TRISKEL10N/HCPD.

PaperID: 43, https://arxiv.org/pdf/2606.12780.pdf GitHub

Authors: Yijun Ma, Zehong Wang, Yiyang Li, Ziming Li, Xiaoguang Guo, Weixiang Sun, Chuxu Zhang, Yanfang Ye

Title: ProPlay: Procedural World Models for Self-Evolving LLM Agents

Abstract:
Self‑evolving agents are expected to improve through interaction without external supervision, but this remains difficult in partially observable environments where agents must explore actively, learn from limited feedback, and decide when to trust prior experience. Existing LLM‑agent methods often rely on memory or planning modules, yet they rarely close the loop between them to continually refine an internal understanding of environment dynamics. We introduce ProPlay, a procedural world model that supports procedure‑level preplay, where agents can rehearse future procedural paths using the learned world knowledge. Rather than representing experience as isolated rules or low‑level action constraints, ProPlay abstracts successful trajectories into procedures and organizes them in a procedure graph that captures causal transitions among task stages. Each transition is associated with a reliability record embedding to estimate its task‑specific contribution from past outcomes. Before each episode, ProPlay simulates future procedural trajectories over known graph structures as structured soft guidance; after execution, it refines the graph using environment feedback. Experiments on public benchmarks show that ProPlay consistently improves environment understanding and self‑evolution capability over strong baselines. Our code has been released in https://github.com/antman9914/proplay.

PaperID: 44, https://arxiv.org/pdf/2606.12736.pdf GitHub

Authors: Tianyu Liu, Allen Xin Wang, Antonia Panescu, Lisa Xinyi Chen, Wenxin Long, Xinyu Wei, Yueqian Jing, Ziyao Zeng, Jihang Chen, Sihan Jiang, Ziqing Wang, Siyi Gu, Siyu Chen, Xinyang Hu, Haoran Shao, Leqi Xu, Wangjie Zheng, Zhiyuan Cao, Ada Fang, Botao Yu, Kunyang Sun, Rex Ying, Arman Cohan, Qingyu Chen, Lingzhou Xue, Kaize Ding, Yuanqi Du, Wengong Jin, Zhuoran Yang, Marinka Zitnik, James Zou, Hua Xu, Hongyu Zhao

Title: Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

Abstract:
AI agents are increasingly being developed to accelerate scientific discovery, yet their practical capabilities in real research settings remain poorly understood. Existing benchmarks for AI agents rarely capture the complexity, heterogeneity, and extended reasoning required by scientific work, whereas benchmarks for scientific tasks often reduce research to static, direct problems and provide limited support for interactive evaluation. Here, we introduce SciAgentArena, a systematic benchmark for evaluating AI agents in real‑world scientific research scenarios drawn from emerging needs across multiple domains. SciAgentArena comprises approximately 200 tasks with stepwise verification and an interactive, agent‑agnostic environment for assessing diverse AI agents. Using this benchmark, we find that current agents can contribute effectively to well‑specified data‑analysis workflows, particularly when the task structure and evaluation criteria are clear. However, their performance remains uneven across scientific contexts: agents struggle to generate genuinely novel insights, sustain self‑directed exploration, and formulate robust solutions for open‑ended research questions. We further characterize common failure modes across agents and identify opportunities for improving their reliability, autonomy, and scientific reasoning. Together, SciAgentArena provides a practical framework for measuring progress in AI agents for science and for guiding the design of future agents capable of addressing complex scientific challenges. Full codes, tasks, and datasets can be accessed via this link: https://sciagentarena.github.io/.

PaperID: 45, https://arxiv.org/pdf/2606.12579.pdf GitHub

Authors: Tanmay Bishnoi, Riddhiman Laha, Tobias Löw, Jose Alex Chandy, Luis F. C. Figueredo, Sami Haddadin

Title: G-MAPP: GPU-accelerated Multi-Agent Planning and Perception for Reactive Motion Generation

Abstract:
Reactive motion generation in unstructured environments remains an open challenge in robotics. Due to the computational complexity of collision‑free motion generation, existing methods either generate global trajectories for static scenarios, or employ models that make conservative assumptions about the environment. This paper identifies the primary bottleneck as the runtime performance demand of planning on high‑fidelity environments, and the temporal integration between the perception and planning modules. Therefore, we propose a framework that does not compromise on runtime performance and world representations for perception and planning by accelerating world modeling and vector‑field based planning using the GPU. This allows us to achieve faster parallel state exploration for quasi‑global trajectory planning, and tighter coupling of the perception‑action loop in real‑time for dynamic cluttered environments with off‑the‑shelf depth sensors. We quantitatively evaluate the computation‑time and success rate differences for the CPU and GPU versions of our planner, and perform qualitative evaluations of our coupled framework using real‑world experiments on a 7‑DoF Franka Emika robot. Experimental results demonstrate that our GPU‑based framework achieves up to a 5x speedup over the CPU version and successfully avoids collisions across both trivial and challenging physical world scenarios.

PaperID: 46, https://arxiv.org/pdf/2606.12344.pdf GitHub

Authors: Mengyu Zheng, Kai Han, Boxun Li, Haiyang Xu, Yuchuan Tian, Wei He, Hang Zhou, Jianyuan Guo, Hailin Hu, Lin Ma, Chao Xu, Guohao Dai, Lixue Xia, Yunchao Wei, Yunhe Wang, Yu Wang

Title: Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

Abstract:
General‑purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE‑bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw‑SWE‑Bench, a multilingual SWE‑bench‑style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue‑resolution instances across 8 languages and 43 repositories, drawn from SWE‑bench‑Multilingual and SWE‑bench‑Verified‑Mini after future‑commit cleanup. We also release Claw‑SWE‑Bench Lite for faster validation, which is an 80‑instance subset selected by a cost‑aware, rank‑aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct‑diff adapter scores only 19.1% Pass@1, whereas the full adapter reaches 73.4% with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw‑style harnesses to perform coding tasks effectively. Across an OpenClaw × nine‑model sweep and a five‑claw × two‑model sweep, model choice changes Pass@1 by 29.4 pp and harness choice by 27.4 pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw‑SWE‑Bench therefore treats harness and cost accounting as first‑class axes of SWE‑style coding‑agent evaluation, providing both a full benchmark and a low‑cost reference set for reproducible comparison. The data is available at https://github.com/opensquilla/claw‑swe‑bench and https://huggingface.co/datasets/TokenRhythm/Claw‑SWE‑Bench.

PaperID: 47, https://arxiv.org/pdf/2606.12281.pdf GitHub

Authors: Jinyuan Zu, Xiaowei Lv, Yongcai Wang, Deying Li, Yunjun Han, Wenping Chen, Fengyi Zhang, Naiqi Wu

Title: CCKS: Consensus-based Communication and Knowledge Sharing

Abstract:
In Decentralized Training and Decentralized Execution (DTDE) for cooperative Multi‑Agent Reinforcement Learning (MARL), action‑advising‑based knowledge sharing promotes interpretable and scalable cooperation among agents. However, current action advising approaches often adhere too much to the teacher's guidance without evaluating teacher‑student compatibility, which causes excessive advising, suboptimal stability, and degraded performance. To overcome these challenges, this paper presents a Consensus‑based Communication and Knowledge Sharing (CCKS) framework, which allows agents to adopt recommendations based on consensus‑derived constraints and to follow the teacher's instructions more smartly. This mechanism enables agents to balance exploration and learning from experienced teachers, improving overall performance. The key is the consensus model construction, for which we propose to employ contrastive learning to construct consensus models based on local observations in the agents' training phase. In action selection, agents score and choose actions based on consensus and shared knowledge. Designed as a plug‑and‑play solution, CCKS integrates seamlessly with existing DTDE algorithms. Experiments conducted in the Google Research Football environment and the complex StarCraft II Multi‑Agent Challenge demonstrate that the integration with CCKS significantly improves cooperation efficiency, learning speed, and overall performance compared with current DTDE baselines. The code is available at https://github.com/yuanxpy/CCKS.

PaperID: 48, https://arxiv.org/pdf/2606.12040.pdf GitHub

Authors: Wanting Wang, Xiye Ma, Yuyang He, Minghui Cheng, Ran Cao

Title: A Lightweight Multi-Agent Framework for Automated Concrete Barrier Design

Abstract:
The design of reinforced concrete highway barriers is a safety‑critical process that requires strict compliance with regulatory provisions such as the AASHTO‑LRFD bridge design guidelines. Current engineering practice relies heavily on manual, iterative, and heuristic calculations to satisfy complex nonlinear material and mechanics constraints. Although Large Language Models (LLMs) demonstrate strong generative capabilities, their direct application to structural engineering remains limited by hallucination risks and insufficient physical grounding. To address these challenges, this study proposes a novel "generation‑evaluation‑optimization" closed‑loop framework for automated concrete barrier design using the multi‑agent orchestration capabilities of AutoGen. Experimental results demonstrate that the proposed agentic framework achieves over 98% design accuracy, significantly outperforming standalone general‑purpose LLMs. More importantly, the study reveals that design performance is not necessarily correlated with model scale, where an 8B‑parameter lightweight model could outperform unconstrained 631B‑parameter flagship models. This finding highlights the potential to substantially reduce computational costs while improving the accessibility of AI‑assisted engineering tools for industry applications. The source code for the proposed multi‑agent design framework is available at the project GitHub repository: https://github.com/MXY820/barrier‑design. Keywords: Structural Engineering; Multi‑Agent Systems; Large Language Models; Concrete Barrier Design; AutoGen; Design Automation.

PaperID: 49, https://arxiv.org/pdf/2606.12018.pdf GitHub

Authors: Shang Ma, Jisheng Dang, Wencan Zhang, Yifan Zhang, Bimei Wang, Hong Peng, Bin Hu, Qi Tian, Tat-Seng Chua

Title: MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning

Abstract:
We propose a multi‑agent collaborative framework built upon a lightweight Multimodal Large Language Model (MLLM), specifically designed for social intelligence reasoning. A key feature of our approach is that both the training and inference phases are augmented via knowledge distillation. Within this architecture, multi‑modal data pertinent to social intelligence is precisely localized. Furthermore, relevant long‑tail events are identified, extracted, and rendered as formatted, explicit text. This formatting strategy prevents critical long‑tail information from being overshadowed by head events and environmental noise during the tokenization process. Specifically, we integrate Test‑Time Adaptation (TTA) across the entire reasoning pipeline, encompassing the extraction and representation of long‑tail events, Chain‑of‑Thought (CoT) prompting, and self‑reflection. This TTA mechanism is also distillation‑enhanced, utilizing Low‑Rank Adaptation (LoRA) to fine‑tune the foundation model exclusively for instance‑level reasoning. Extensive evaluations against various open‑source and proprietary AI models across multiple benchmarks demonstrate the effectiveness of the proposed framework. With around 30% of training data from IntentTrain, we achieve state‑of‑the‑art results. Codes are available at https://github.com/eeee‑sys/MODF‑SIR, demo is available at https://huggingface.co/spaces/Harry‑1234/MODF‑SIR, LoRA is available at https://huggingface.co/Harry‑1234/MODF‑SIR and the dataset for training router is available at https://huggingface.co/datasets/Harry‑1234/IntentRouterTrain.

PaperID: 50, https://arxiv.org/pdf/2606.11762.pdf GitHub

Authors: Min Sen Tan, Zachary Kit Chun Choy, Syed Ali Redha Alsagoff, Nadya Yuki Wangsajaya, Mohor Banerjee, Swaagat Bikash Saikia, Alvin Chan

Title: Automated Creativity Evaluation of Language Models Across Open-Ended Tasks

Abstract:
Large language models (LLMs) have achieved remarkable progress in language understanding, reasoning, and generation, sparking growing interest in their creative potential. Realizing this potential requires systematic and scalable methods for evaluating creativity across diverse tasks. However, most existing creativity metrics are tightly coupled to specific tasks, embedding domain assumptions into the evaluation process, and limiting scalability and generality. To address this gap, we introduce an automated, domain‑agnostic framework for quantifying LLM creativity across open‑ended tasks. Our approach separates the measurement apparatus from the creative task itself, enabling scalable, task‑agnostic assessment. Divergent creativity is measured using semantic entropy, a reference‑free and robust metric for novelty and diversity, validated against human annotations, LLM‑based novelty judgments and baseline diversity measures. Convergent creativity is assessed via a novel retrieval‑based multi‑agent judge framework that delivers context‑sensitive evaluation of task fulfilment with over 60% improved efficiency. We validate our framework in three qualitatively distinct domains: problem‑solving (MacGyver), research ideation (HypoGen), and creative writing (BookMIA), using a broad suite of LLMs. Empirical results show that our framework reliably captures key facets of creativity, including novelty, diversity, and task fulfilment, and reveal how model properties, such as size, temperature, recency, and reasoning, impact creative performance. Our work establishes a reproducible and generalizable standard for automated LLM creativity evaluation, paving the way for scalable benchmarking and accelerating progress in creative AI.

PaperID: 51, https://arxiv.org/pdf/2606.11688.pdf GitHub

Authors: Youwang Deng

Title: Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents

Abstract:
Long‑horizon LLM agents are not trusted to run unattended: with no human watching, they confidently report success they never verified. We treat honesty ‑‑ bounding what an agent may claim at termination ‑‑ as a first‑class metric for unattended autonomy, distinct from capability. We present Autopilot, an execution model that makes silent fabricated success structurally impossible rather than merely rarer. Autopilot externalizes all working state into a durable, gated finite‑state machine that a scheduler advances one stateless tick at a time; a hard floor forbids any terminal "done" claim whose falsifiable gate did not actually execute and pass. We prove a No‑False‑Success theorem ‑‑ under gate soundness, floor enforcement, and plan coverage, termination implies the goal holds ‑‑ whose only trust points are empirically measurable, and show the worst case degrades to an honest stall, never a fabricated success. Because each tick rehydrates only the state machine, per‑step context cost is constant in the horizon. Across a 3,150‑cell paired corpus (70 tasks × 3 systems × 3 models × 5 seeds, including 50 SWE‑bench Lite tasks across 11 OSS repos), Autopilot fabricates on 0.95% of cells [95% CI 0.38‑‑1.62] while Reflexion and StateFlow baselines fabricate on 8.10% [6.48‑‑9.81] and 25.05% [22.48‑‑27.62] respectively. The headline contrast lives in the hard regime: on SWE‑bench Lite, the firewall reduces fabrication from 33.7% (StateFlow) to 0.67%, a paired difference of ‑33.07 pp [95% CI ‑36.53, ‑29.73]. The mechanism is the gate, not the model: all ten Autopilot fabrications come from the strongest model, while two weaker mid‑tier models never fabricate across 700 paired cells. The firewall trades coverage for honesty by design ‑‑ an honest stall is recoverable; a confident wrong output shipped downstream is not.

PaperID: 52, https://arxiv.org/pdf/2606.11543.pdf GitHub

Authors: Zhiyu Chen, Zihan Guo, Bo Huang, Bingwei Lu, Jianghao Lin, Yuanjian Zhou, Weinan Zhang

Title: SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior

Abstract:
Agent Skills augment large language model (LLM) agents with procedural knowledge at inference time, but current benchmarks rarely distinguish what a Skill says from how it is organized. We study this distinction through Progressive Disclosure, where a concise root file points agents to supporting resources on demand, and compare it with a normalized flat baseline. We present SkillJuror, a framework for evaluating Skill writing paradigms through semantically controlled variants, matched multi‑trial evaluations, and trajectory evidence while holding task knowledge fixed. In an 82‑task SkillsBench study, Progressive Disclosure changes runtime behavior before aggregate outcomes: distinct Skill resources touched per trajectory rise from 1.18 to 3.85, and effective uptake events rise from 1.33 to 3.92. It also yields 17 additional verifier‑passing trials out of 410 matched trials (+4.1%) over the normalized flat baseline. The benefit is task‑dependent. Progressive Disclosure helps when supporting resources guide implementation, checking, or repair, but is weaker when success hinges on exact output conventions, numerical thresholds, or long artifact‑generation pipelines. These results show that Skill organization is not mere presentation: it can change how agents search and apply procedural knowledge, while outcome gains depend on whether the exposed resources are actionable for the task. Code is available at https://github.com/zhiyuchen‑ai/skill‑juror.

PaperID: 53, https://arxiv.org/pdf/2606.11537.pdf GitHub

Authors: Abdelrahman Abdallah, AbdelRahim A. Elmadany, Sameh Al Natour, Hasan Cavusoglu, Adam Jatowt, Muhammad Abdul-Mageed

Title: MoCA-Agent: A Market-of-Claims Code Agent for Financial and Numerical Reasoning

Abstract:
Financial and tabular question answering requires more than fluent reasoning: answers must be grounded in the exact facts, formulas, units, signs, and scales that support them. A single misread cell or incorrect operation can silently produce a plausible but wrong result. We introduce \textscMOCA‑Agent, a market‑of‑claims code agent that replaces free‑form multi‑agent debate with claim‑level verification. The system decomposes each question into typed atomic claims, asks specialist trader agents to buy or sell those claims, clears their orders into confidence‑weighted accept/reject decisions, and synthesizes an executable Python program from market‑supported evidence. A code‑aware verifier then checks the program for execution, structural consistency, and common financial reasoning errors, with at most one market‑aware repair round. Across ten public benchmarks spanning financial numerical reasoning, general tabular reasoning, ESG question answering, and multimodal chart reasoning, \textscMOCA‑Agent achieves strong performance using a fixed Qwen3.6‑27B backbone, including 78.3% on FinQA, 76.0% on FinanceMath, 71.2% on MultiHiertt, 86.9% on ESGenius, and 85.6% average on FinChart‑Bench. These results show that aggregating evidence at the level of atomic claims, rather than whole answers, improves robustness in high‑stakes numerical reasoning.\footnoteThe code and data are available: https://github.com/UBC‑NLP/MoCA‑Agent.

PaperID: 54, https://arxiv.org/pdf/2606.11521.pdf GitHub

Authors: Hongyi Liu, Frederic Sala, Thomas Reps, Adithya Murali

Title: Counterexample Guided Learning in the Large using Reasoning Agents

Abstract:
LLMs and LLM agents should improve when given feedback, but identifying when they are able to do so is difficult: feedback is heterogeneous, domain‑specific, and difficult to control. We approach this challenge by asking LLMs to perform regular‑expression induction, a classical symbolic learning problem where precise mechanisms for feedback exist in the form of counterexamples. In counterexample‑guided learning, a learner (LLM) proposes candidate regular expressions from positive/negative‑labeled strings, and the teacher (verifier) returns counterexamples showcasing the difference between the candidate and target languages. We identify novel counterexample‑guided refinement strategies that enable effective regex learning, such as regularization and symbolic counterexample clusters. We also explore agentic strategies such as reflection and repair loops. Empirically, we find that verifier feedback substantially improves sample efficiency on challenging regex‑induction tasks, reducing the number of labeled examples required and enabling learning of complex target expressions where standard prompting fails. For example, on the hardest task groups, our counterexample‑guided framework improves success from 3.2% to 38.1% and from 38.9% to 74.1% on two different regex domains. These results suggest that LLMs can benefit from rich feedback beyond treating it as additional data, opening the door for robust verifier‑guided methods for LLM‑based program synthesis and formal reasoning.

PaperID: 55, https://arxiv.org/pdf/2606.11176.pdf GitHub

Authors: Kevin Qinghong Lin, Batu EI, Yuhong Shi, Pan Lu, Philip Torr, James Zou

Title: Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

Abstract:
Data tells stories that shape society; the data journalist's job is to turn raw information into stories non‑experts can trust. A high‑quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an angle, and designing visuals. Recent agents handle individual steps well: data‑science agents close the analysis loop, while design agents synthesize beautiful websites. But can an agent serve as a data journalist end to end? We introduce Data Journalist Agent (Data2Story), a multi‑agent framework that orchestrates specialized roles into a single virtual newsroom. Data2Story contributes two innovations. (i) Claims are evidence‑grounded: an Inspector links every number, angle, and asset back to data, code, or an external reference. (ii) Articles are multimodally generative: rather than defaulting to plain text and static charts, Data2Story reasons about what readers will want to see, then deploys multimodal tools, such as interactive maps for geography and audio for music. We evaluate Data2Story on 18 articles, each paired with the originally published expert piece, along four axes: (a) human‑agent angle coverage; (b) rubric evaluation with 53 participants across five dimensions; (c) computer‑use agents as judges, a cost‑saving proxy for how readers navigate interactive articles; and (d) verifiability, where a coding verifier re‑executes statements against the data and checks claims against references. Data2Story produces competitive, evidence‑traceable multimedia stories, with particular strength in transparency and auditability. Human articles retain an edge in editorial angle, creative design, and presentation. We position Data2Story as a collaborator for journalists, enabling more evidence‑based, transparent, and verifiable reporting. Code and demos are available at https://data2story.github.io.

PaperID: 56, https://arxiv.org/pdf/2606.10747.pdf GitHub

Authors: Filippo Tonini, Federico Torrielli, Anton Danholt Lautrup, Peter Schneider-Kamp, Mustafa Mert Çelikok, Lukas Galke Poech

Title: The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

Abstract:
As AI systems built from multiple language‑model agents become more common, they are increasingly used to make decisions together: discussing, negotiating, and acting on shared tasks. While individual agents may appear well‑aligned when tested on their own, problems can arise from how they interact with one another. We introduce the Arbiter, an agent designed to monitor multi‑agent conversations in real time and identify which participants may be behaving in misaligned ways. The Arbiter operates under a limited "inspection budget", meaning it must decide carefully how to use its resources. As it observes a conversation step by step, it can choose to wait, question a participant, examine internal information such as system prompts or reasoning traces, or log concerning behavior. At the end, it produces a report identifying the likely source of misalignment. We evaluate the Arbiter across five conversation conditions, ranging from risky financial advice model organisms to evaluation‑aware and colluding agents, we test five tool configurations of increasing capability and two backbone models. We find that the Arbiter reliably detects misaligned agents well before the end of the conversation, with active inspection tools improving both detection accuracy and speed. Weight‑induced misalignment proves hardest to detect, while instruction‑induced misalignment is identified reliably even under passive observation. The logging tool exhibits a dual effect, improving recall at the cost of precision. These results suggest that continual, budget‑aware monitoring can effectively catch misalignment, and that overseeing multi‑agent systems may require treating the auditor as an active participant in the process. The code is available at https://github.com/aisilab/arbiter.

PaperID: 57, https://arxiv.org/pdf/2606.10662.pdf GitHub

Authors: Yuzhen Mao, Azalia Mirhoseini

Title: Decentralized Multi-Agent Systems with Shared Context

Abstract:
Multi‑agent systems (MAS) can scale large language model reasoning at test time by decomposing complex problems into parallel subtasks. However, most existing MAS rely on centralized orchestration, where a main agent assigns work, collects outputs, and merges results. As the number of subtasks grows, this controller becomes a communication and integration bottleneck. We propose Decentralized Language Models (DeLM), a MAS framework that decentralizes coordination through parallel agents, a shared verified context, and a task queue. Agents asynchronously claim subtasks, read accumulated progress, perform local reasoning, and write back compact verified updates. The shared context acts as a common communication substrate, enabling agents to build on one another's verified progress without routing every update through a central controller. Empirically, DeLM improves both software‑engineering test‑time scaling and long‑context reasoning. On SWE‑bench Verified, DeLM achieves the best performance across Avg.@1, Pass@2, and Pass@4, with gains of up to 10.5 percentage points over the strongest baseline, while reducing cost per task by roughly 50%. On LongBench‑v2 Multi‑Doc QA, DeLM achieves the highest average accuracy across four frontier model families, improving over the strongest baseline by up to 5.7 percentage points. The code is available on our project website at https://yuzhenmao.github.io/DeLM/.

PaperID: 58, https://arxiv.org/pdf/2606.10423.pdf GitHub

Authors: Jayoo Hwang, Xiaowen Zhang, Vedant Padwal

Title: WebChallenger: A Reliable and Efficient Generalist Web Agent

Abstract:
Autonomous web navigation remains challenging for LLM agents, and the strongest generalist systems rely on proprietary reasoning models whose inference cost is prohibitive for the repetitive tasks where such agents would be most useful. We argue this gap stems not from insufficient model capability but from agent architectures that fail to replicate three human cognitive advantages: selective attention to relevant page regions, persistent memory of website structure, and procedural fluency with common interaction patterns. We introduce WebChallenger, a web agent framework that addresses each gap through architecture design rather than model scale, built around PageMem: a structured page representation deterministically constructed from the DOM that exposes each page as a hierarchy of semantic sections with short summaries. On this shared substrate we build three mechanisms that mirror the three cognitive advantages: a divide‑and‑conquer observation pipeline that lets the agent skim section summaries and extract details only from task‑relevant regions; a lightweight exploration and memory system that traverses each website once to build a reusable map of pages and element behaviors; and compound action workflows that collapse common multi‑step interactions into single agent actions, handling partial state changes automatically. Because all three operate over PageMem, the framework generalizes across websites without site‑specific adapters. Using off‑the‑shelf open‑weight models without fine‑tuning, our system achieves 56.3% on WebArena, 48.7% on VisualWebArena, 51.0% on Online‑Mind2Web, and 70.9% on WorkArena, approaching frontier proprietary systems at a fraction of the cost. Our code is released at https://github.com/jayoohwang1/webchallenger

PaperID: 59, https://arxiv.org/pdf/2606.10389.pdf GitHub

Authors: Haoran Li, Zengle Ge, Ziyang Zhang, Xiaomin Yuan, Yui Lo, Qianhui Liu, Bocheng An, Dongke Rong, Jiaqun Liu, Annan Li, Jianmin Wu, Dawei Yin, Dou Shen

Title: Beyond Static Evaluation: Co-Evolutionary Mechanisms for LLM-Driven Strategy Evolution in Adversarial Games

Abstract:
Recent advances in LLM‑driven code evolution have enabled automated discovery by iteratively generating and improving programs. However, applying these methods to adversarial multi‑agent games introduces a fundamental challenge: the evaluation landscape shifts as strategies improve, causing fixed evaluators to become unreliable and evolution to stagnate. We propose three mechanisms to address this challenge: evaluator co‑evolution, which incorporates discovered champions into the opponent pool; hierarchical deep evaluation, which replaces noisy few‑game scores with statistically reliable assessments; and weakness pressure, which dynamically up‑weights the most difficult opponents to break through plateaus. We implement these mechanisms within FAMOU, a framework built upon the same foundation‑model code‑evolution paradigm as OpenEvolve and ShinkaEvolve. On the MCTF 2026 3v3 maritime capture‑the‑flag task, FAMOU consistently outperforms both baselines under two backbone LLMs, achieving the highest combined score (0.526) and the best generalization to unseen opponents (61.7% win rate), while ablations confirm that each mechanism contributes to performance. Notably, the LLM mutation process generates tactical structures entirely absent from the seed strategies ‑‑ including lookahead search and adaptive interception ‑‑ demonstrating that code‑level evolution can produce nontrivial algorithmic innovations in adversarial settings. The FAMOU‑evolved strategy further achieved 1st place in the hardware round‑robin and 3rd in simulation at the AAMAS 2026 MCTF Competition, validating its real‑world transferability. The optimized implementation and corresponding evaluation codes developed through our evolutionary process are available at: https://github.com/1xiangliu1/FAMOU‑CoEvo

PaperID: 60, https://arxiv.org/pdf/2606.09961.pdf GitHub

Authors: Yu Han, Kailing Li, Yang Jiao, Yulin Dai, Yuqian Fu, Linhai Zhuo, Tianwen Qian

Title: 3SPO: State-Score-Supervised Policy Optimization for LLM Agents

Abstract:
Training large language models (LLMs) as autonomous agents via reinforcement learning (RL) has enabled frontier models to achieve superhuman performance in long‑horizon tasks. However, existing RL algorithms operate at the trajectory level, performing policy optimization only after collecting complete episode rollouts. This coarse‑grained approach faces fundamental challenges in multi‑turn agent settings where rewards are sparse, delayed, and credit assignment across individual steps is critical. In this work, we propose State‑Score‑Supervised Policy Optimization (3SPO), a novel RL algorithm that performs post‑step policy optimization with dynamic state score supervision. At each step, 3SPO computes the state score based on historical success rates, supervising step‑wise credit assignment, adaptive rollout and post‑step policy optimization without requiring value function estimation or additional auxiliary models. Theoretically, under a per‑state bandit abstraction, we show that the proposed score‑supervised allocation mechanism achieves logarithmic allocation regret and provide sample‑complexity guarantees for action identification, score distinguishability, and filtering stability. Experiments on ALFWorld and WebShop with Qwen2.5‑1.5B/7B‑Instruct show that 3SPO consistently outperforms GRPO by +22.6% on ALFWorld and +15.6 points on WebShop, while using comparable resources to achieve 2.4× more state exploration and 1.8× faster convergence. Code is available at https://github.com/genalyu/3SPO.

PaperID: 61, https://arxiv.org/pdf/2606.09900.pdf GitHub

Authors: Liuyin Wang

Title: Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History

Abstract:
Long‑term memory is the missing layer for LLM agents: across sessions they forget, and the common workaround ‑‑ replaying the whole history into the prompt ‑‑ is expensive, slow, and, as distractors accumulate, less accurate. Most memory systems win on cost or latency but still lose to the full‑context baseline on accuracy, and benchmark numbers are reported on inconsistent, non‑reproducible harnesses, so one system appears at wildly different scores across sources. We present Engram, an open‑source, dual‑process memory engine on a bi‑temporal data model. A fast write path appends lossless episodes with no LLM on the critical path; an asynchronous path extracts atomic (subject, predicate, object) facts, builds a bi‑temporal knowledge graph, and resolves contradictions without an LLM call per fact ‑‑ invalidating, never deleting, so every fact keeps provenance and a supersession chain. A hybrid read path fuses dense, lexical, graph, and recency/salience signals, applies a point‑in‑time ("as‑of") filter, and assembles a compact, provenance‑tagged context. On the full 500‑question LongMemEval_S, graded by the official category‑specific judge, Engram's lean configuration ‑‑ answering from a ~9.6k‑token retrieved slice, never the full history ‑‑ scores 83.6% vs. 73.2% for full‑context (+10.4 points, McNemar p < 10^‑6) at ~8x fewer tokens (9.6k vs. 79k), with 0/500 errored. The gain needs a hybrid read path: facts alone lose recall, while facts plus retrieved chunks recover detail. We also contribute a neutral, in‑repo evaluation harness with the official judge baked in and the full‑context baseline in every table, publish the raw per‑question logs, and document the measurement‑integrity pitfalls (truncation, home‑grown judges, full‑history leaks) that silently distort memory benchmarks. Every number ships with a command to reproduce it.

PaperID: 62, https://arxiv.org/pdf/2606.09751.pdf GitHub

Authors: Arsalan Shahid, Gordon Suttie, Philip Black

Title: Collaborative Human-Agent Protocol (CHAP)

Abstract:
Foundation models are moving from response generation into operational roles. They plan across steps, call tools, request human input, coordinate with other agents, and increasingly carry responsibility for work that affects customers, claims, code, contracts, and clinical decisions. Production deployments are no longer one human supervising one model. They are multi‑human, multi‑agent collaborations that cross teams, time zones, and trust boundaries. The technical surface for this collaboration remains weakly specified. When an agent drafts a response and a human edits it before it ships, the moment of human judgement is the most valuable signal in the system. In current practice it is recorded, if at all, in application code, chat threads, ticket comments, and tribal memory. Two protocol standards address adjacent concerns: MCP standardises agent access to tools and data, and A2A standardises agent‑to‑agent interoperability. Neither defines the shared workspace in which humans and agents perform accountable work together. This paper presents CHAP, the Collaborative Human‑Agent Protocol. Under CHAP, the override that used to vanish into a chat thread becomes a structured event carrying a diff, a rationale, and a content hash. The handoff between shifts becomes a portable envelope rather than a pinned message. The human approval of an agent's draft becomes a non‑repudiable signed decision that can be replayed years later. The protocol achieves this through a small Core (workspaces, participants, tasks, artefacts, and an append‑only evidence log) together with composable profiles that add review, modes, routing, deliberation, handoff, identity, signatures, and transparency‑backed audit as deployments require them. Specification, reference implementation, conformance suite, and worked examples are available at: https://github.com/BrightbeamAI/chap

PaperID: 63, https://arxiv.org/pdf/2606.08728.pdf GitHub

Authors: Syed Rifat Raiyan, Mohsinul Kabir, Hasan Mahmud, Md Kamrul Hasan

Title: Artificial Intelligence for Mathematical Reasoning: An Integrated Survey of Language Models, Neuro-symbolic Systems, and Verified Discovery

Abstract:
Mathematical reasoning has long served as a stringent test of machine intelligence; over the past decade, it has moved from a niche problem within NLP to one of the most consequential AI frontiers. This survey provides a unified account of the field's evolution, from early rule‑based math word problem (MWP) solvers and template‑driven geometry systems, through neural expression generation and LLM prompting, to contemporary reasoning models, multi‑agent systems, neuro‑symbolic theorem provers, and verified discovery workflows. We organize the landscape along four axes: (i) informal reasoning over text and diagrams, spanning MWP solving, multimodal geometry, and VLMs; (ii) formal reasoning in proof assistants, including autoformalization, tactic prediction, compiler‑guided repair, and proof search; (iii) mathematical discovery, where systems propose constructions, improve bounds, or assist attacks on open problems; and (iv) the inference and training‑time techniques, including CoT prompting, tool use, process reward models, and RLVR, that increasingly connect generation with verification. We catalog major benchmarks across grade‑school arithmetic, competition mathematics, geometry, formal proving, multimodal and multilingual reasoning, and expert evaluation, and we examine benchmark saturation, contamination, reporting mismatches, and the distinction between pass@1, majority voting, and verifier‑assisted pass@k. We critically assess failure modes: brittleness under perturbation, reward hacking, multimodal grounding failures, fragile formalization, and the energy cost of reasoning‑scale inference. Drawing on recent perspectives from working mathematicians, we identify future directions centered on verified‑discovery workflows, reasoning efficiency, and infrastructure to make AI‑assisted formalization broadly usable. Companion materials: https://github.com/Starscream‑11813/awesome‑AI4Math.

PaperID: 64, https://arxiv.org/pdf/2606.08596.pdf GitHub

Authors: Beiwen Zhang, Yongheng Liang, Guowei Zou, Haitao Wang, Hejun Wu

Title: Distilling LLM Reasoning into an Interpretable Policy Tree for Human-AI Collaboration

Abstract:
Constructing efficient and reliable policies to assist humans is indispensable for human‑AI collaboration. Existing methods mainly follow two lines of work. Most prior work relies on multi‑agent reinforcement learning (MARL) to learn black‑box policies, which limits interpretability and raises safety concerns. Recent methods query large language models (LLMs) at each decision step, causing slow responses and high inference costs. We propose Collaboration Policy Tree (Co‑pi‑tree), a closed‑loop method that learns an executable policy tree consisting of a partner‑behavior prediction tree and an agent‑action selection tree. Co‑pi‑tree constructs a policy by distilling LLM reasoning into policy tree code. It then evaluates the policy through partner interaction, obtains feedback, and uses natural language to summarize the interaction feedback to improve problematic branches. Experiments in Overcooked‑AI show that Co‑pi‑tree improves average reward by 35.4% over the baseline average, while reducing the number of LLM queries by 77.7% and test‑time latency by 97.1%. Project page: https://beiwenzhang.github.io/Co‑pi‑tree/

PaperID: 65, https://arxiv.org/pdf/2606.08348.pdf GitHub

Authors: Xiaojun Wu, Cehao Yang, Honghao Liu, Xueyuan Lin, Wenjie Zhang, Zhichao Shi, Xuhui Jiang, Chengjin Xu, Jia Li, Jian Guo

Title: Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses

Abstract:
LLM agents increasingly rely on external inference conditions: prompts, tools, memory, SOPs, skills, and harness feedback. These assets can improve task execution without changing model weights, but they are often revised by heuristic reflection or by reusing observed successes and failures as if counts alone were reliable belief. We introduce Bayesian‑Agent, a native and cross‑harness framework that treats reusable skills and SOPs as hypotheses about whether a frozen model will succeed under a particular prompt, context, and harness environment. Bayesian‑Agent records verified trajectory evidence, maintains a feature‑conditioned categorical posterior over each skill, and maps posterior state into inspectable actions such as patch, split, compress, retire, and explore. Model‑facing prompts receive executable guardrails and failure‑mode patches, while posterior summaries remain available for audit. With \textttdeepseek‑v4‑flash, incremental repair improves SOP‑Bench from 80% to 95%, Lifelong AgentBench from 90% to 100%, and RealFin‑Bench from 45% to 65%. We further evaluate Bayesian‑Agent's native backend and optional GenericAgent, mini‑swe‑agent, and Claude Code backends. The results include positive, negative, saturated, and case‑study settings, suggesting that agent skill evolution is best viewed as posterior‑guided harness optimization rather than uncalibrated prompt accumulation. The source code is available at https://github.com/DataArcTech/Bayesian‑Agent.

PaperID: 66, https://arxiv.org/pdf/2606.08340.pdf GitHub

Authors: Kale-ab Abebe Tessera, Andras Szecsenyi, Cameron Barker, Alexander Rutherford, Davide Paglieri, Aidan Scannell, Henry Gouk, Elliot J. Crowley, Tim Rocktäschel, Amos Storkey

Title: Benchmarking Open-Ended Multi-Agent Coordination in Language Agents

Abstract:
As language models are increasingly deployed as autonomous agents, they must coordinate with others over long horizons in open‑ended interactive tasks. Yet existing evaluations rarely test these demands together, instead emphasising single‑agent tasks, short interactions, or highly structured multi‑agent settings. We introduce alem, a JAX‑based benchmark for open‑ended multi‑agent coordination built on Craftax‑like dynamics. Alem embeds procedurally generated coordination tasks, soft specialisation, communication, and controllable coordination difficulty into a long‑horizon survival world with exploration, crafting, trading, and combat. We evaluate 13 modern LLMs zero‑shot within homogeneous teams, with trained MARL agents as reference points. Current LLM agents remain far from solving alem, averaging only ~6% normalised return, but their failures are not uniform. On the hardest coordination setting, zero‑shot Gemini‑3.1‑Pro‑High approaches MARL agents trained for one billion steps, while GPT‑5.4‑High achieves strong base‑task reward but much lower coordination reward. This contrast shows that individual task competence does not imply coordination competence. Ablations show that communication is the largest contributor to coordination, while memory and reasoning help when used to maintain multi‑step plans. Overall, our results identify coordination as a distinct bottleneck for frontier LLM agents, separate from single‑agent capabilities. Alem makes this bottleneck measurable and provides a controlled testbed for developing agents that communicate, allocate roles, and execute shared plans. Code is available at https://github.com/alem‑world/alem‑env.

PaperID: 67, https://arxiv.org/pdf/2606.08275.pdf GitHub

Authors: Jaineet Shah

Title: Causal Agent Replay: Counterfactual Attribution for LLM-Agent Failures

Abstract:
When an LLM agent fails ‑‑ issues a refund it should not have, calls the wrong tool, leaks data ‑‑ existing tooling answers what happened (observability) or whether it passed (evaluation), but not which step caused the failure. The obvious heuristics are wrong: the step that executes the harmful action is usually not the step that decided on it, and LLM‑judge attribution is correlational and unreliable (state‑of‑the‑art step‑level accuracy on the Who&When benchmark is about 14%). We present Causal Agent Replay (CAR), which answers the question by intervention: it models an agent run as a structural causal model, applies a do‑operation to a step, and re‑executes the trajectory forward under the same stochastic policy, measuring the shift in the outcome distribution. We define an intervention algebra over agent steps, a single‑step contrastive estimator whose point‑of‑commitment rule resolves a confound specific to stochastic run‑forward, and a budget‑bounded Monte‑Carlo Shapley estimator that splits credit across interacting steps. Every effect is reported with confidence intervals. We validate against synthetic structural causal models with planted ground truth: the contrastive estimator recovers the pivotal step, and Shapley recovers a two‑step interaction (0.44, 0.45, ~0; efficiency sum 0.909 versus the analytic 0.91). CAR is open source and runs on hosted or free local models.

PaperID: 68, https://arxiv.org/pdf/2606.08151.pdf GitHub

Authors: Xinyu Guan, Qianyang Zhao, Yuming Deng

Title: Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents

Abstract:
Tool‑using LLM agents often fail not because relevant text is absent, but because decisive evidence is not selected, compressed, or surfaced at action time. We present CICL, a decision‑aware context layer that turns instance evidence into a context graph, routes deterministic, Opus‑assisted, Qwen, Codex/GPT‑5.5, and Qwen‑QLoRA judgments through a shared eight‑field schema, scores units by action shift, outcome uplift, necessity, and negative‑transfer risk, and packs high‑utility evidence as typed memory cards for a budgeted agent. The design separates the measured decision signal from the judge model, so frontier annotation, local surrogates, and lightweight rankers can be compared under one auditable protocol. Empirically, CICL yields a concrete open‑benchmark gain while exposing its limits. On 50 SWE‑bench Verified file‑retrieval instances, direct Qwen3.6‑plus reranking of BM25 top‑50 candidates raises hit@1 from 0.58 to 0.78 and MRR@10 from 0.634 to 0.790, with all 2,500 judgments parseable. Controlled diagnostics show action‑criticality: at budget 120, CICL reaches F1 0.620 on v1 and 0.425 on v3, and removing the top‑utility semantic v3 unit collapses F1 to 0.000. Supplementary checks add Qwen‑QLoRA agreement over 710 candidates, a small 200‑label real‑code Opus‑assisted signal, and a three‑instance patch smoke validating retrieval‑to‑patch plumbing without claiming official SWE‑bench success. RepoBench‑R summaries still beat cards, and compact rankers do not yet replace the heuristic. CICL contributes a reproducible measurement and selection layer for decision‑critical context, not an end‑to‑end coding‑agent repair claim.

PaperID: 69, https://arxiv.org/pdf/2606.08146.pdf GitHub

Authors: Yichen Chen, Siying Li, Yuhang Liang, Lijun Wang, Renyang Liu

Title: SAGE: An LLM-driven Self Reflective Agentic Framework for Fraud Detection

Abstract:
Fraud detection in payment, e‑commerce, and telecommunications systems requires accuracy at the individual level, robustness under severe class imbalance, and ease of understanding for risk managers. Existing methods fall at least one of these requirements: automated machine learning systems search a fixed numerical space without semantic awareness of the dataset; graph neural network‑based methods require pre‑defined relational graphs and remain opaque at the individual‑decision level; and the design of general‑purpose large language model (LLM) agents does not consider the recall and precision constraints specific to real‑world fraud detection. In this paper, we propose SAGE, the first end‑to‑end LLM‑driven multi‑agent framework for fraud detection. SAGE coordinates three dedicated agents that make decisions based on a six‑layer Data Diagnostic Tree (DDT) and a Markov decision process guided by natural‑language gradients, automatically optimizing the model under a fraud‑specific reward. On five fraud datasets and five LLM backbones, SAGE wins 96.00% of method‑‑dataset comparisons and improves F1 by an average of 40.86% over baselines. The code is available at https://github.com/yichenC1c/SAGE.

PaperID: 70, https://arxiv.org/pdf/2606.07939.pdf GitHub

Authors: Shuyao Gao, Minghao Huang

Title: Stable Geometry, Reversing Poles: The Bipolar Structure of AI Occupational Substitutability and Its Decade-Scale Inversion

Abstract:
Empirical research on the labor‑market impact of artificial intelligence has converged, since Frey and Osborne (2017), on a continuous‑gradient representation in which each occupation is assigned a real‑valued exposure score on [0,1] obtained by linear aggregation across capability dimensions. This continuity is rarely articulated as an assumption and has not been tested at the micro‑action level where substitution actually occurs. We decompose 1,961 ONET Detailed Work Activities into 15,817 micro‑actions using a multi‑agent LLM pipeline with 31‑expert HITL calibration, then project the DWA‑level Occupational Automation Index from our prior work onto a 7‑macro semantic typology. The result is a bipolar structure. Tool‑Mediated Physical (M2, mean OAI = 0.054) and Planning & Design (M7, mean OAI = 0.499) form two extremes separated by Cohen's d = 2.41 (H = 172.88, p = 6.21e‑34). The geometry is robust under three independent stress tests: resolution (K=7 to K=15, polar gap widens from 0.45 to 0.57), encoder swap to BGE (LLM‑class OAI lead replicates at 3.37x), and Eloundou's GPT‑4 task ratings (DWA‑level rho = 0.635). The six middle macros form a low‑contrast band between the poles (TOST at d=0.2 admits only 1/15 pairs as equivalent), not a flat plain. The geometry's stability does not, however, extend to its content. Across a decade, the polarity has inverted. Frey‑Osborne (2013) placed Tool‑Mediated Physical near the highest computerisation risk and Planning & Design near the lowest; our LLM‑era OAI reverses that order, with macro‑level FO‑Eloundou Spearman rho = ‑0.750, p = 0.020, against the original Oxford Martin appendix. Which pole is high is therefore contingent on the era's dominant capability frontier, while the stable geometry itself is the structurally robust object.

PaperID: 71, https://arxiv.org/pdf/2606.07867.pdf GitHub

Authors: Chung-En Sun, Linbo Liu, Tsui-Wei Weng

Title: The Cold-Start Safety Gap in LLM Agents

Abstract:
Are tool‑calling LLM agents equally safe throughout a conversation? We discover they are not: agents are most vulnerable at the very start of a session and become substantially safer after a few regular agentic tasks ‑‑ a phenomenon we term the cold‑start safety gap. To study this systematically, we introduce Safety Over Depth for Agents (SODA), a benchmark that controls how many regular agentic tasks the agent completes before encountering a safety threat, supporting up to 20 preceding tasks. Evaluating 7 models from 4 families, safety improves by 9‑‑52% as the number of preceding regular agentic tasks increases from zero to twenty. Representation analysis confirms that model hidden states gradually shift toward a safety‑aligned region as more preceding tasks are present. By systematically studying which part of the preceding conversation matters most, we find that the regular agentic tasks themselves are the primary driver of safety, while the agent's own prior responses have less effect on safety but are essential for preserving later utility. This conclusion is further supported by evaluation on open‑source safety benchmarks (AgentHarm, Agent Safety Bench) and utility benchmarks (BFCL, API‑Bank), confirming that warming up the agent with regular agentic tasks before deployment makes it safer and preserves full capability. Based on these findings, we recommend a simple deployment strategy: having the agent complete a few regular agentic tasks before possible exposure to safety‑critical requests mitigates the cold‑start safety gap. Our code is available at https://github.com/Trustworthy‑ML‑Lab/Agent‑Cold‑Start‑Safety‑Gap

PaperID: 72, https://arxiv.org/pdf/2606.07810.pdf GitHub GitHub

Authors: Anish Laddha, Nitesh Pradhan, Gaurav Srivastava

Title: SLMJury: Can Small Language Models Judge as Well as Large Ones?

Abstract:
Large language models (LLMs) are widely used as judges for evaluating model outputs, but their high cost, latency, and opacity limit scalability. We introduce SLMJury, a framework for evaluating small language models (SLMs) as judges across two paradigms: closed‑ended binary correctness and open‑ended quality scoring. We benchmark 16 SLM judges (0.6B‑14B parameters) from four model families across ten benchmarks: eight closed‑ended tasks spanning mathematical, scientific, and general reasoning (N=64,824 judgments per configuration), plus SummEval and MT‑Bench for summarization and conversational scoring. We formalize judging as a budget‑conditioned function and study five dimensions. Four findings emerge. (1) The overthinking effect is domain‑dependent: for most judges quick 10‑token verdicts match or beat extended reasoning on mathematical judging (by 2‑7% where they help), while reasoning wins on general tasks by up to 23%. (2) Domain generalization separates model families, with math‑to‑general accuracy gaps ranging from under 10% to nearly 40%. (3) Closed‑ended and open‑ended judging draw on different capabilities: the best binary judge (Phi‑4) drops to rank 9 on MT‑Bench, while reasoning‑trained models invert this ordering. (4) Under the Reflect‑Critique‑Refine (RCR) debate protocol, multi‑agent debate degrades accuracy across all tested configurations, whereas the top judges resist six adversarial personas with <=0.55% variance. Reliable automated evaluation does not require large proprietary models, yet no single SLM dominates. The leaderboard is available at https://anishh15.github.io/SLMJury/, and our framework code and pip package are publicly available at https://github.com/anishh15/SLMJury and https://pypi.org/project/slmjury/.

PaperID: 73, https://arxiv.org/pdf/2606.07636.pdf GitHub

Authors: Lecheng Yan, Yichong Zhang, Ben Pan, Xiaoyu Zheng, Jiawei Qian, Anqi Wu, Wenxi Li, Chenyang Lyu

Title: Crayotter: Traceable Multi-Agent Workflows for Long-Form Video Editing

Abstract:
Editing a long‑form video from heterogeneous footage requires more than selecting clips: an agent must preserve narrative intent across material preparation, timeline construction, post‑production, and revision while leaving enough evidence to diagnose failures. We present Crayotter, an open‑source multimodal multi‑agent system for prompt‑driven video editing. Crayotter organizes production into three phases: coverage‑aware material preparation, artifact‑based editing research, and tool‑grounded timeline execution. Each phase externalizes inspectable artifacts, including coverage reports, multimodal analyses, editing blueprints, tool calls, and intermediate renders. These artifacts make an editing run traceable and allow failed segments to be diagnosed and selectively revised instead of requiring a full restart. We evaluate Crayotter on 23 editing themes against CapCut‑Mate and CutClaw. Under human evaluation, Crayotter achieves an average score of 3.40/5, compared with 2.44 and 1.70 for the two baselines, with consistent gains in theme alignment, narrative coherence, and editing smoothness. We additionally describe a replayable trajectory schema and verifiable reward design that prepare these workflows for future policy optimization. Code, traces, and examples are publicly available at https://github.com/idwts/Crayotter.

PaperID: 74, https://arxiv.org/pdf/2606.07397.pdf GitHub GitHub

Authors: Yifan Duan, Qixiang Xu, Hengtao Wu, Zhanxun Liu, Wenhao Guan, Junxi Liu, Ziyang Ma, Kelu Xu, Xie Chen

Title: Audio-Oscar: A Multi-Agent System for Complex Audio Scene Generation, Orchestration, and Refinement

Abstract:
In recent years, audio generation has made significant progress in tasks such as text‑to‑speech (TTS), text‑to‑audio (TTA) and text‑to‑music (TTM). However, generating long‑form and controllable audio from complex audio scene descriptions remains a significant challenge, as such scenes often require coordinated speech, sound effects, music, songs, temporal structure, and post‑production. In this work, we introduce Audio‑Oscar, a multi‑agent framework for generating audio from complex descriptions. Audio‑Oscar coordinates a set of specialist agents, each responsible for a different aspect of the audio scene, including character modeling and voice design, speech generation, fine‑grained timeline planning, model selection, non‑speech generation, and audio post‑production. Audio‑Oscar further incorporates feedback‑driven refinement. In addition, to address the lack of suitable benchmarks for evaluating audio generation from complex audio scene descriptions, we construct ASG‑Bench, an Audio Scene Generation Benchmark containing both scene descriptions paired with reference audio and text‑only scene descriptions. Each scene is annotated with target audio events and temporal statements to evaluate whether the generated audio faithfully realizes the required scene content and temporal structure. Experimental results show that Audio‑Oscar can effectively generate audio that matches complex scene descriptions. Project samples are available at https://audiooscar.github.io/. Our code is available at https://github.com/ziye26/Audio‑Oscar.

PaperID: 75, https://arxiv.org/pdf/2606.07089.pdf GitHub

Authors: Yinzhou Tang, Jingbo Xu, Yu Shang, Zihao Song, Chen Gao, Wei Wu, Yong Li

Title: Dreaming when Necessary: Advancing World Action Models with Adaptive Multi-Modal Reasoning

Abstract:
World Action Models (WAMs) offer a promising approach to embodied intelligence, yet existing methods rely heavily on video prediction as action priors and lack adaptive multimodal reasoning, limiting their effectiveness on long‑horizon, complex tasks. We observe that WAMs require different multimodal reasoning modes under different execution contexts: textual reasoning is essential during task transitions to guide high‑level action prediction, while visual reasoning is critical during fine‑grained manipulation for precise control. Motivated by this observation, we propose AdaWAM, a world action model with adaptive multimodal reasoning abilities. AdaWAM integrates a lightweight dynamic router that autonomously triggers textual or visual reasoning as needed during task execution. Experiments on both simulated and real‑world embodied tasks show that AdaWAM substantially improves inference efficiency while outperforming state‑of‑the‑art embodied policies. Codes and demos are available at: https://adawam.github.io/.

PaperID: 76, https://arxiv.org/pdf/2606.06741.pdf GitHub

Authors: Zhiling Yan, Dingjie Song, Hanrong Zhang, Wei Liang, Yuxuan Zhang, Yutong Dai, Lifang He, Philip S. Yu, Ran Xu, Xiang Li, Lichao Sun

Title: OpenSkill: Open-World Self-Evolution for LLM Agents

Abstract:
Self‑evolving agents requires adaptation after deployment, but existing approaches assume a usable learning loop, such as curated skills, successful trajectories, or verifier signals. Real open‑world deployments may provide none of these, offering only a task prompt. In this work, we study open‑world self‑evolution, where an agent must build both its skills and its own verification signals from scratch, using open‑world resources but no target‑task supervision. We propose OpenSkill, a framework that bootstraps this loop: it acquires grounded knowledge and verification anchors from documentation, repositories, and the web, synthesizes them into transferable skills, and refines those skills against self‑built virtual tasks grounded in the anchors rather than in target answers. The open world thus supplies both the knowledge to be learned and a supervision‑independent practice environment, with target‑task supervision reserved for final evaluation. Across three benchmarks and two target agents, OpenSkill attains the best automated pass rate while satisfying the no‑supervision constraint. Analysis shows its skills transfer across models without model‑specific adaptation, and its self‑built verifier aligns with ground‑truth outcomes despite never accessing them.

PaperID: 77, https://arxiv.org/pdf/2606.06673.pdf GitHub

Authors: Ujjwal Bhatta, Utsabi Dangol, Sumaly Bajracharya, Rodrigue Rizk, KC Santosh

Title: Uncertainty-Aware LLM-Guided Policy Shaping for Sparse-Reward Reinforcement Learning

Abstract:
Sparse rewards and heterogeneous task sequences remain persistent challenges in Reinforcement Learning (RL), often resulting in slow convergence, weak generalization, and inefficient exploration. We propose Uncertainty‑Aware LLM‑Guided Policy Shaping (ULPS), a novel framework that integrates a calibrated Large Language Model (LLM) into the RL training loop to provide structured, uncertainty‑modulated behavioral guidance. ULPS employs an A‑based oracle to synthesize optimal symbolic trajectories, which are used to fine‑tune a BERT‑based language model. During training, this model supplies action suggestions whose influence is conditioned on epistemic uncertainty estimated via Monte Carlo (MC) dropout. An entropy‑based blending mechanism adaptively balances LLM guidance and the learned policy (via Proximal Policy Optimization, PPO), allowing the agent to prioritize reliable priors while preserving adaptability. We evaluate ULPS on the MiniGridUnlockPickup benchmark and observe consistent improvements in success rate, reward efficiency, and sample complexity over unguided, uncalibrated, and standard RL baselines. ULPS achieves more than 9% improvement in execution accuracy after fine‑tuning, requires fewer environment interactions, and yields higher reward AUC. Our results demonstrate that integrating symbolic A trajectories, pretrained language priors, and uncertainty‑aware control offers a principled and effective approach to multi‑task reinforcement learning in sparse‑reward domains, with potential extensibility to partially observable and multi‑agent settings.

PaperID: 78, https://arxiv.org/pdf/2605.30326.pdf GitHub

Authors: Chunru Lin, Hongxin Zhang, Fenghao Yu, Zhehuan Chen, Thomas L. Griffiths, Yejin Choi, David Held, Chuang Gan

Title: RoboWits: Unexpected Challenges for Robotic Creative Problem Solving

Abstract:
The ability to reason, adapt, and creatively solve problems under unexpected challenges is essential for robots operating in real‑world environments. However, current robotic benchmarks primarily emphasize skill‑level execution and provide limited insight into such cognitive reasoning capabilities. We introduce RoboWits, a bi‑manual robotic benchmark designed to systematically evaluate cognitive reasoning, creative tool use, and robustness to unexpected conditions. To enable scalable construction of high‑quality reasoning‑centric unexpected scenarios, we propose an automated task generation pipeline formulated as a multi‑agent cooperative framework, comprising agents for seed task generation and verification, metric generation, scene generation, and task mutation. Using the pipeline, we curated 30 diverse seed tasks and 208 tasks with mutations and graded difficulty across geometry, material, and assembly‑based reasoning. We benchmark popular robot policies, pre‑trained VLAs, and oracle‑state planners. Our results reveal a significant performance gap: while pre‑trained VLAs exhibit preliminary success on seed tasks after single‑task fine‑tuning, they struggle to perform on mutated tasks, implying their brittleness in manipulation tasks requiring reasoning, strategy adaptation, and robustness to deceptive or constrained environments. Project page is available at https://umass‑embodied‑agi.github.io/RoboWits.

PaperID: 79, https://arxiv.org/pdf/2605.30058.pdf GitHub

Authors: Weihan Peng, Chenxu Zhang, Qianao Wang, Yuling Shi, Heng Lian, Qihong Mao, Jiahao Pang, Chunliang Feng, Bowen Li, Xiaodong Gu

Title: HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?

Abstract:
While LLM agents have demonstrated remarkable task‑oriented abilities such as planning, reasoning, and action, few works have treated them as complete human personalities where emotional dimensions hold equal importance. In this paper, we introduce a novel benchmark to systematically assess whether LLM agents can simulate coherent, human‑like psychology. Specifically, our benchmark constructs 11 diverse human characters grounded in orthogonal Big Five personality traits, with each profile deeply integrated with 1,000 structured autobiographical‑style episodic memories distributed across theory‑grounded developmental life stages. To rigorously evaluate the psychological manifestations of LLMs, we designed a curated suite of 64 decision‑making scenarios, guided by the DIAMONDS taxonomy, a psychological framework that characterizes situations along eight dimensions: Duty, Intellect, Adversity, Mating, pOsitivity, Negativity, Deception, and Sociality. By subjecting agents to varying scenarios, the benchmark evaluates whether they can consolidate their innate personality traits and autobiographical memories to make behavioral decisions that are consistent with their specific psychological profiles. After systematic human validation and filtering, we obtained a benchmark consisting of 673 multiple‑choice questions (MCQs). We believe this benchmark provides a principled and scalable testbed for studying human‑like emotions, personality consistency, and value‑consistent behavioural decision‑making in LLM‑based agents.

PaperID: 80, https://arxiv.org/pdf/2605.30003.pdf GitHub

Authors: Víctor Gallego

Title: Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas

Abstract:
We study two‑level autoresearch for cooperation: an outer‑loop AI agent autonomously redesigns the inner‑loop pipeline of an LLM policy‑synthesis system for multi‑agent Sequential Social Dilemmas (SSDs). A researcher agent \mathcalR (run as a coding agent) reads the inner‑loop source code, edits system prompts, feedback functions, helper libraries, and iteration logic, runs evaluations, and decides what to keep, following the autoresearch paradigm. Across two games (Cleanup and Gathering), two policy‑synthesizer LLMs, and two welfare objectives (utilitarian efficiency and Rawlsian maximin), the researcher reliably exceeds hand‑designed baselines, sharply tightens run‑to‑run variance, and outperforms prompt‑only optimization. The discovered pipelines are objective‑dependent: only under maximin does the researcher inject an explicit fairness mechanism into synthesizer pipelines, a class of mechanism that is absent from its own objective‑agnostic system prompt and from every efficiency‑optimized pipeline. This supports an information‑design reading in which the researcher chooses what to reveal to the boundedly rational synthesizer as a function of the welfare objective. Code at https://github.com/vicgalle/autoresearch‑social‑dilemmas.

PaperID: 81, https://arxiv.org/pdf/2605.29893.pdf GitHub

Authors: Minyang Hu, Bo Yang, Zhinuo Zhou, Jiachen Liang, Guo Jiahao, Yiyang Yin, Xiongwei Han

Title: Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories

Abstract:
LLM‑based agents have demonstrated strong capabilities in solving complex tasks through multi‑step reasoning and tool use. However, existing evaluation protocols primarily focus on task success, overlooking a critical aspect of agent behavior: execution efficiency. In practice, agent trajectories often contain redundant steps that consume substantial resources while contributing little to task completion. In this work, we propose and formulate a new research area: redundant step detection for agent trajectories. To support this initiative, we introduce RedundancyBench, a new benchmark that contains diverse tasks with carefully annotated trajectories, where each step is labeled according to its contribution to task completion. Using RedundancyBench, we develop and evaluate 3 representative methods to answer whether a step within trajectory is redundant or necessary. Our results show that even the best‑performing method achieves only 24.88% score in detecting redundant steps, while some methods perform worse than random guessing. These results highlight the task's complexity and the need for further research in this area. \footnoteCode and dataset in this paper are both available in \hrefhttps://anonymous.4open.science/r/RedundancyBenchhttps://anonymous.4open.science/r/RedundancyBench.

PaperID: 82, https://arxiv.org/pdf/2605.29874.pdf GitHub

Authors: Francisco León Zúñiga Bolívar

Title: Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension

Abstract:
Do next‑generation LLM agents inherit the cooperative biases documented in their predecessors, or does scale and provider diversity reshape equilibrium behaviour in competitive multi‑agent settings? Willis et al. established a benchmark for this question using evolutionary game theory and the Iterated Prisoner's Dilemma (IPD), finding consistent cooperative biases in ChatGPT‑4o and Claude 3.5 Sonnet. We extend this benchmark to four frontier models released in 2025‑2026 ‑ Claude Sonnet 4.6, Gemini 2.5 Flash, Gemini 3.1 Pro, and GPT‑5.4 Mini ‑ applying the identical protocol across three prompting styles (Default, Prose, Self‑Refine) and four population compositions (balanced and biased, with and without noise). Cooperative bias persists across providers (H1): nine of twelve model‑prompt combinations favour cooperative equilibria in balanced noiseless conditions. Cross‑provider divergence is substantial (H3): Gemini 2.5 Flash reaches up to 77% aggressive equilibria under biased conditions, while GPT‑5.4 Mini reaches 70% cooperative equilibria under Self‑Refine. Support for aggressive capability parity is partial (H2): Self‑Refine raises ICD in all models and Claude Sonnet 4.6 Refine achieves the highest ICD in the dataset (0.913), but Default and Prose prompts show no systematic narrowing. Evidence on noise robustness is directionally positive but not robustly confirmed (H4): with n=500 Moran iterations per condition, average noise sensitivity is approximately 6 percentage points for Claude Sonnet 4.6 versus 13 pp for Claude 3.5 Sonnet, but this cross‑study gap is not statistically significant once the predecessor's unreported sampling error is propagated. Provider identity, rather than model generation, is the strongest correlate of equilibrium outcomes; noise remains a universal challenge regardless of model size or vintage.

PaperID: 83, https://arxiv.org/pdf/2605.29705.pdf GitHub

Authors: Mincheol Kang, Hyunjin Lim, Bomin Kang, Daehee Park

Title: BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices

Abstract:
Trajectory prediction is a fundamental task for autonomous systems, requiring complex reasoning about multi‑agent interactions and intents. Large language models (LLMs) have recently been adopted for this task, as they provide strong contextual reasoning and interpretable, language‑based trajectory representations. However, these LLM‑based predictors are extremely memory‑ and compute‑intensive, making them difficult to deploy on resource‑constrained edge devices such as on‑board computers in autonomous robots. To bridge this gap, we propose BitTP, which converts an LLM‑based trajectory predictor into a lightweight bitlinear architecture. We demonstrate that weight‑only quantization to 1.58‑bit (BitTP‑Weight) is optimal. Crucially, activations must remain in full precision, as quantizing them leads to severe degradation and instability in spatio‑temporal reasoning. Empirically, BitTP‑Weight not only preserves but improves prediction quality over the full‑precision (BF16) LLM baseline, reducing ADE by 14.29% and FDE by 20.97% on average, while simultaneously reducing memory usage and inference latency relative to other quantization methods. These results demonstrate that carefully designed quantization acts as an effective regularizer, enabling the practical deployment of sophisticated LLM‑based reasoning on edge devices. Code is available at: https://github.com/MintCat98/BitTP.

PaperID: 84, https://arxiv.org/pdf/2605.29692.pdf GitHub

Authors: Wenxin Xu, Chen Jason Zhang, Xiaoyong Wei, Haoyang Li, Hwanhee Kim, Yuanfeng Song, Raymond Chi-Wing Wong

Title: Towards Reliable Agentic Progressive Text-to-Visualization with Verification Rules

Abstract:
Text‑to‑Visualization (Text‑to‑Vis) translates natural language queries into visualization query languages, enabling non‑expert users to perform data analysis. However, most existing methods follow a one‑shot paradigm that requires users to specify all visualization details in a single round, often leading to cognitive overload and incorrect visualizations. In this paper, we propose PMVis, a progressive multi‑turn paradigm for text‑to‑vis, where users' intents are refined through multi‑turn interactions. To support research in this paradigm, we construct PMVisBench, the first dataset designed to capture the progressive and iterative nature of real‑world user queries. It is built through VQL simplification and NLQ reconstruction, with explicit rule constraints to ensure each intermediate VQL remains valid and meaningful. Building upon PMVis, we further introduce PMVisAgent, an agent‑based framework that simulates realistic user‑system dialogues. PMVisAgent consists of a User, a System, and a Validation Agent that performs verification and repair via a ReAct‑style tool‑use loop to mitigate error accumulation across rounds, with explicit interaction and verification rules to ensure reliability of the multi‑agent system. Extensive experiments on PMVisBench demonstrate that PMVisAgent significantly outperforms state‑of‑the‑art text‑to‑vis baselines. It achieves up to 17.57% and 23.21% improvements in execution accuracy in single‑table and multi‑table settings, respectively, while ablation studies confirm the importance of combining progressive interaction with clarification. The code is available at https://github.com/wxxv/PMVis.

PaperID: 85, https://arxiv.org/pdf/2605.29643.pdf GitHub

Authors: Yilun Qiu, Jiahe Wang, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Chun Yuan

Title: AgentCVR: Active Multi-Agent Cross-Video Reasoning via Script-Simulated Reinforcement Learning

Abstract:
Cross‑Video Reasoning (CVR) has emerged as a critical frontier in multimodal intelligence, requiring models to retrieve, align, and aggregate evidence distributed across multiple videos. Current Multimodal Large Language Models (MLLMs) often struggle with CVR, as simple single‑pass strategies encode multiple videos into a shared compressed context, potentially obscuring rare but critical evidence. In this paper, we propose AgentCVR, a multi‑agent framework that treats CVR as an active evidence‑acquisition task. AgentCVR employs a Master Agent to iteratively coordinate specialized Visual and Audio Agents for targeted evidence extraction. To ensure efficient training, we introduce Script‑Simulated RL, which optimizes the agent's policy with LLM‑generated semantic scripts and a lightweight text‑based simulator, bypassing costly multimodal inference during online exploration. Experimental results on a comprehensive CVR benchmark show that AgentCVR outperforms single‑pass baselines and achieves comparable performance to state‑of‑the‑art closed‑source systems, particularly in complex cross‑video alignment and localization. To ensure reproducibility, our code is available at https://github.com/wang‑jh24/AgentCVR.

PaperID: 86, https://arxiv.org/pdf/2605.28773.pdf GitHub

Authors: Jizhan Fang, Buqiang Xu, Zhixian Wang, Haoliang Cao, Xinle Deng, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu, Ying Wei, Guozhou Zheng, Feiyu Xiong, Haofen Wang, Huajun Chen, Ningyu Zhang

Title: Rethinking Memory as Continuously Evolving Connectivity

Abstract:
Existing memory‑augmented LLM agents often treat memory as a static repository with pre‑defined representations and fixed retrieval pipelines, which is brittle in dynamic agentic environments where feedback, task variation, and heterogeneous signals continuously reshape what should be remembered and how it should be connected. To address this, we propose FluxMem, a connectivity‑evolving memory framework that models memory as a heterogeneous graph and progressively refines its topology through three stages: initial connection formation, feedback‑driven refinement, and long‑term consolidation. During execution, FluxMem repairs missing links, prunes interference, aligns abstraction granularity, and distills recurrent successful trajectories into reusable procedural circuits, guided by one metric for memory generalizability and evolutionary maturity. Across three fundamentally distinct benchmarks including LoCoMo, Mind2Web, and GAIA, FluxMem achieves consistent state‑of‑the‑art performance, demonstrating strong adaptation and generalization in complex agentic environments. The code will be open‑sourced in https://github.com/zjunlp/LightMem.

PaperID: 87, https://arxiv.org/pdf/2605.28699.pdf GitHub

Authors: Chusen Li, Zhou Liu, Shuigeng Zhou, Wentao Zhang

Title: TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

Abstract:
Large language models increasingly rely on either reinforcement learning or multi‑agent prompting to improve reasoning, yet these two paradigms remain difficult to combine. Directly applying single‑agent reinforcement learning to multi‑turn multi‑agent systems faces following dilemmas: i) Sparse rewards, role‑level free‑riding and excessive training overhead. ii) Agents only imitate to collaborate. iii) Fixed collaboration protocol falls into oscillating local optimum. We introduce TRACER, a turn‑level reinforcement framework for cooperative multi‑LLM reasoning. TRACER separates collaborative decision making into a controller‑regret layer, where controllers learn whether the agents should speak or skip the current round through regret matching, and a generation‑credit layer, which optimizes proposer and reviewer utterances with role‑specific GSPO rewards. This design i) assigns credit at the level of both action modes and generated utterances, thus avoiding free‑riding and sparse rewards. We only expand the choices made by the controllers, thus greatly reducing computational cost of training. Moreover, ii) agents acquire collaborative capability as they learn when to utter and what to speak. Finally, iii) by designing binary actions ingeniously, we extend classical game theory established for finite action spaces to deep learning, thus achieving mathematically rigorous convergence. We train all local RL‑style methods on the GSM8K training split and evaluate on held‑out GSM8K, MATH500, and GPQA‑Diamond to measure in‑domain accuracy, cross‑benchmark generalization, inference cost, and correction‑preservation behavior. The resulting framework provides a compact and reproducible testbed for studying learned collaboration policies beyond fixed debate, voting, or aggregation protocols. Code is available at https://github.com/Shark‑Forest/TRACER.

PaperID: 88, https://arxiv.org/pdf/2605.28629.pdf GitHub

Authors: Zheng Wu, Pengzhou Cheng, Zongru Wu, Yuan Guo, Tianjie Ju, Aston Zhang, Gongshen Liu, Zhuosheng Zhang

Title: Mobile-Aptus: Confidence-Driven Proactive and Robust Interaction in MLLM-based Mobile-Using Agents

Abstract:
Recent advancements in multimodal large language models (MLLMs) have shown exceptional potential in enabling mobile‑using agents to autonomously execute human instructions. However, fully automated agents often try to execute tasks even when they are unable to resolve them, leading to the problem of over‑execution. Previous studies solve it by training a interactive mobile‑using agents to let agents request human interaction when agents can not complete user instructions. However, we find that these interactive agents tend to exhibit over‑soliciting behavior, relying excessively on human intervention. To mitigate both over‑execution and over‑soliciting, we propose a universal confidence integration framework that enables confidence‑driven proactive and robust interaction in MLLM‑based mobile‑using agents. The framework consists of two stages: interaction capability empowerment and confidence bias correction. In the interaction capability empowerment stage, agents learn through supervised fine‑tuning to output both actions and confidence scores. In the confidence bias correction stage, agents learn to output more accurate confidence scores by combining semantic similarity retrieval with direct preference optimization. Experimental results show Mobile‑Aptus achieves state‑of‑the‑art performance on the four popular mobile‑using agent benchmarks: OS‑Kairos, AITZ, Meta‑GUI, and AndroidControl. Mobile‑Aptus consistently outperforms all baselines in offline benchmarks, with an average improvement over 17% in task success rate. In real‑world dynamic experiments, Mobile‑Aptus surpasses the baseline by 26% in task success rate with only 0.64 intervention steps per instruction. The codes are available at https://github.com/Wuzheng02/Mobile‑Aptus.

PaperID: 89, https://arxiv.org/pdf/2605.28548.pdf GitHub

Authors: Ruowen Zhao, Bangguo Li, Zuyan Liu, Yinan Liang, Junliang Ye, Fangfu Liu, Diankun Wu, Zhengyi Wang, Xumin Yu, Yongming Rao, Han Hu, Jun Zhu

Title: GEM: Generative Supervision Helps Embodied Intelligence

Abstract:
Embodied Vision‑Language Models (VLMs) have demonstrated impressive performance and generalization in robotics, particularly within Vision‑Language‑Action frameworks. However, a significant gap remains between the high‑level semantic focus of standard text‑guided pre‑training paradigms and the low‑level spatial and physical knowledge critical for execution in embodied environments. In this paper, we introduce GEM, a Generative‑supervised Embodied vision‑language Model designed to bridge this divide. We propose integrating a depth map generation task directly into the VLM pre‑training phase. By training this generative objective jointly with the main model, we observe substantial improvements in embodied intelligence, significantly enhancing both semantic understanding and physical operation capabilities. To support this paradigm, we curate and release GEM‑4M, a comprehensive large‑scale dataset featuring a mixture of grounding, reasoning, and planning data paired with high‑quality depth supervision. Extensive experiments demonstrate that GEM achieves state‑of‑the‑art results across diverse embodied benchmarks. Furthermore, our deployed action model, GEM‑VLA, exhibits vastly superior task execution abilities in both simulation environments and real‑world evaluations. Code, models, and datasets are available at https://zhaorw02.github.io/GEM/

PaperID: 90, https://arxiv.org/pdf/2605.28534.pdf GitHub

Authors: Zheng Wu, Chengcheng Han, Zhengxi Lu, Tianjie Ju, Yanyu Chen, Qi Gu, Xunliang Cai, Zhuosheng Zhang

Title: GUI-CIDER: Mid-training GUI Agents via Causal Internalization and Density-aware Exemplar Reselection

Abstract:
Despite the rapid progress of multimodal large language models in building Graphical User Interface (GUI) agents, their real‑world task completion is fundamentally bottlenecked by a lack of world knowledge about GUI operations. Existing solutions typically rely on expensive multi‑agent scaffolding or conventional post‑training paradigms, such as Supervised Fine‑Tuning (SFT) and Reinforcement Learning (RL). However, post‑training only allows agents to implicitly absorb world knowledge through action annotations or reward signals, leading to inefficient trajectory memorization rather than genuine comprehension. Therefore, an approach that enables explicit learning of this knowledge is imperative. To this end, we propose GUI‑CIDER, a mid‑training method that explicitly internalizes GUI world knowledge through Causal Internalization and Density‑aware Exemplar Reselection. GUI‑CIDER operates in three stages: (1) data synthesis, which distills static planning and dynamic causal knowledge from GUI trajectories into text; (2) exemplar reselection, which filters the corpus by rewarding causal structures and penalizing semantic redundancy; and (3) mid‑training, where the refined data is used to embed the acquired knowledge. Extensive experiments on two GUI knowledge benchmarks and three task completion benchmarks demonstrate that GUI‑CIDER consistently improves both the agent's understanding of GUI operations and its task success rates.The codes are available at https://github.com/Wuzheng02/GUI‑CIDER.

PaperID: 91, https://arxiv.org/pdf/2605.28369.pdf GitHub

Authors: Yanhui Sun, Wu Liu, Haifeng Ming, Xinru Wang, Hantao Yao, Yongdong Zhang

Title: CyberJurors: A Multi-Agent Simulation Task for E-Commerce Disputes Verdict

Abstract:
E‑commerce platforms have begun recruiting crowdsourced jurors to adjudicate massive volumes of transaction disputes. Unlike formal legal judgment, E‑commerce dispute verdicts require grounding pivotal clues from redundant, multi‑round, multimodal evidence and making decisions under flexible platform‑specific conventions. These characteristics render existing methods insufficient for this scenario. To bridge this gap, we introduce a pioneering task, E‑commerce Dispute Verdicts (EDV), and present VerdictBench, a multimodal benchmark comprising 6,000 real‑world cases designed to reflect crowdsourced jury decisions. Building upon this, we propose CyberJurors, a multi‑agent framework to clarify the dispute logic and regulate the verdict process. At the individual level, Individual Verdict Chain‑of‑Thought decomposes the EDV task into four structured reasoning stages, enabling fine‑grained clue perception and clarifying causal logic between pivotal clues and the dispute focus. At the collective level, Jury Consensus Verdict simulates multi‑round discussion and voting among jurors, while incorporating verdict precedents to mitigate cognitive biases toward either disputant. Experiments on VerdictBench show that CyberJurors outperforms state‑of‑the‑art LLMs, MLLMs, and court simulators, while achieving stronger alignment with real‑world jury voting patterns. Code and dataset are available at https://github.com/YanhuiS/CyberJurors and https://huggingface.co/datasets/piggi/VerdictBench.

PaperID: 92, https://arxiv.org/pdf/2605.28120.pdf GitHub GitHub

Authors: Zerui Chen, Qinggang Zhang, Zhishang Xiang, Zhimin Wei, Linfeng Gao, Xiao Huang, Zhihong Zhang, Jinsong Su

Title: LegalGraphRAG: Multi-Agent Graph Retrieval-Augmented Generation for Reliable Legal Reasoning

Abstract:
Graph‑based Retrieval‑Augmented Generation (GraphRAG) advances flat document retrieval by structuring knowledge as relational graphs, enabling more coherent and effective reasoning. However, applying it to specific domains like legal reasoning faces critical challenges. (i) Legal corpora are heterogeneous, containing multi‑granular knowledge from cases, articles and interpretations. A flat knowledge graph cannot adequately differentiate between factual details, applied rules, and abstract principles, limiting accurate retrieval. (ii) Reliable legal judgment demands transparent, evidence‑based reasoning. Traditional RAG passes retrieved context directly to an LLM without verification, resulting in opaque, error‑prone reasoning. To this end, we propose LegalGraphRAG, a framework designed for reliable legal reasoning. Our approach introduces two core components: a hierarchical legal graph that hierarchically organizes legal sources to enable retrieval at appropriate abstraction levels, and a multi‑agent system for reliable legal reasoning, where a Researcher retrieves candidate evidence, an Auditor rigorously verifies its validity against source documents, and an Adjudicator synthesizes the set of verified evidence to render a final judgment. Extensive experiments show that LegalGraphRAG achieves the state‑of‑the‑art performance, outperforming existing GraphRAG baselines in accurate and trustworthy legal analysis. Our code, datasets and implementation details are available at https://github.com/XMUDeepLIT/LegalGraphRAG.

PaperID: 93, https://arxiv.org/pdf/2605.28104.pdf GitHub

Authors: Yaoyang Luo, Zhi Zheng, Ziwei Zhao, Tong Xu, Zhao Jielun, Wenjun Xue, Yong Chen, Enhong Chen

Title: Defending LLM-based Multi-Agent Systems Against Cooperative Attacks with Sentence-Level Rectification

Abstract:
Recent years have witnessed the rapid development of Large Language Model‑based Multi‑Agent Systems (MAS), which excel at collaborative decision‑making and complex problem‑solving. However, malicious agents in MAS may inject misinformation to mislead other agents and disrupt system performance, giving rise to a new research direction that focuses on attack mechanisms and defense strategies in MAS. Prior studies largely assume malicious agents act independently and investigate the corresponding defense strategies. However, we argue that malicious agents may exhibit collaborative behaviors, enabling more effective attacks through internal information exchange. In this paper, we propose an adaptive cooperative attack framework, where malicious agents autonomously coordinate and dynamically adjust their attack strategies through multi‑round interactions. Furthermore, we introduce Sentence‑Level Trustworthiness Analysis and Rectification (STAR), a defense framework that identifies and rectifies misleading information at the sentence level within agent communications. Our experiments show that cooperative attacks lead to a significantly larger degradation in task success rate than independent attacks, resulting in a relative drop of 5.34%. Meanwhile, STAR effectively mitigates both cooperative and independent threats and improves task success rate by an average of 36.76%. The code is available at https://github.com/smoooom/STAR.

PaperID: 94, https://arxiv.org/pdf/2605.28077.pdf GitHub

Authors: Chuang Tang, Chenhao Lin, Yin Xu, Hao Wang, Jinrui Zhou, Xin Li, Mingjun Xiao, Enhong Chen

Title: MACReD: A Multi-Agent Collaborative Reasoning Framework for Reaction Diagram Parsing

Abstract:
Parsing chemical reaction diagrams from scientific literature is challenging due to heterogeneous layouts, intertwined visual elements, and the difficulty of integrating recognition and reasoning. Existing vision‑language models advance multimodal understanding but still fail on complex diagrams, struggling to maintain spatial coherence and to integrate multidimensional information during reasoning. To address these issues, we propose MACReD, a hierarchical multi‑agent framework that coordinates specialized agents for molecular perception, arrow understanding, text extraction, and reaction reconstruction within a unified VLM‑guided architecture. The planning and perception layers use flexible, fine‑grained detection to handle visual complexity, while the reasoning layer uses a multigraph fusion mechanism to integrate heterogeneous cues and enforce chemically consistent global reasoning. Experiments on the RxnScribe benchmark show that MACReD achieves state‑of‑the‑art performance, with F1 scores of 75.2% and 84.6% under hard and soft match criteria, outperforming the RxnScribe baseline, which obtains 69.1% and 80.0%, respectively. These results demonstrate the robustness of MACReD across diverse diagram layouts, including multi‑step and tree‑structured reactions.

PaperID: 95, https://arxiv.org/pdf/2605.27957.pdf GitHub

Authors: Zhitong Chen, Kai Yin, Weifeng Zhang, Zhiyuan Wang, Xiangjue Dong, Chengkai Liu, Zhewei Liu, Yiming Xiao, Ali Mostafavi, James Caverlee

Title: DisasterBench: Benchmarking LLM Planning under Typed Tool Interface Constraints

Abstract:
Disasters cause severe societal impacts, demanding rapid coordination of heterogeneous AI tools, from satellite analysis to flood prediction and damage assessment, into coherent multi‑step workflows. As LLMs increasingly serve as orchestrators of such pipelines, effective coordination requires more than selecting semantically plausible tools: LLMs must generate executable workflows with correct parameter binding and dependency propagation. We introduce DisasterBench, a benchmark for evaluating structured multi‑agent planning over semantically similar but operationally distinct disaster‑response tools. To enable step‑level failure attribution, we further propose First‑Point‑of‑Failure (FPoF), which localizes the earliest root cause in a predicted workflow, separating primary errors from downstream cascading effects. Our evaluation reveals three findings: planning method effectiveness depends strongly on model capacity; tool mismatch and parameter‑binding errors dominate first failures, revealing semantic grounding and execution consistency as distinct bottlenecks; and verbose intermediate reasoning can create instruction clash with structured output requirements, disrupting plan generation. Together, these findings highlight a fundamental gap between semantic reasoning and execution‑grounded coordination, underscoring the need for planning frameworks that jointly model semantic intent, execution constraints, and workflow consistency. Code, data, and evaluation resources are available at: https://github.com/TamuChen18/DisasterBench_Open

PaperID: 96, https://arxiv.org/pdf/2605.27955.pdf GitHub

Authors: Xinze Li, Yuhang Zang, Yixin Cao, Aixin Sun

Title: Skill-as-Pseudocode: Refactoring Skill Libraries to Pseudocode for LLM Agents

Abstract:
Markdown skill libraries for LLM agents ship as free‑form prose, forcing the agent to re‑derive both the input schema and the concrete invocation syntax on every retrieval. We observe that this often produces a "confused ‑> re‑retrieve ‑> still confused" loop in which the agent issues a partially‑correct action, receives uninformative environment feedback, and re‑retrieves the same prose. We propose Skill‑as‑Pseudocode (SaP), an automatic conversion of markdown skill libraries into typed pseudocode with deterministic quality control. For each cluster of similar procedural passages drawn from one or more skills, SaP extracts a typed contract and filters it through a four‑check deterministic verifier (coverage, binding, replacement, risk). Promoted contracts are inlined into a rewritten skill skeleton together with restored concrete action templates, giving the agent two complementary signals: a typed signature for what the skill does and a concrete template for how to invoke it. On the 134‑game ALFWorld unseen split with gpt‑4o‑mini, pooled across three seeds, SaP wins 82/402 paired games versus 47/402 for the Graph‑of‑Skills (GoS) baseline (pooled McNemar p = 8.2e‑5), at ‑22.8 +/‑ 6.4% input tokens and ‑14.5 +/‑ 4.1% LLM calls per game.

PaperID: 97, https://arxiv.org/pdf/2605.27919.pdf GitHub

Authors: Junlin Wang

Title: Frequency-Guided Action Diffusion via Sub-Frequency Manifold Traversal

Abstract:
Learning visuomotor policies via behavior cloning typically involves mimicking expert demonstrations collected by human operators. However, natural human demonstrations inherently contain high‑frequency noise, such as intermittent jerks, pauses, and action jitter. Training policies to directly imitate these raw trajectories inevitably causes the model to inherit these suboptimal behaviors. This pathology is particularly pronounced in diffusion‑based policies, where iterative denoising steps can inadvertently amplify high‑frequency artifacts at the expense of meaningful fine‑grained details. To address these limitations, we present a novel frequency‑based algorithm that enables implicit spectral maneuvering and smooth action generation. Our method, Frequency Guidance Operator (FGO), steers the generation process of diffusion polices by progressively driving the noisy samples through intermediate sub‑frequency manifolds with expanding spectral bands. Validated on 15 robotic manipulation tasks from 5 benchmarks, FGO achieves superior performance in enhancing action smoothness and temporal consistency while preserving the details necessary for successful task execution. Project website: https://henrywjl.github.io/frequency‑guidance‑operator/

PaperID: 98, https://arxiv.org/pdf/2605.27898.pdf GitHub

Authors: Pengyu Zhu, Lijun Li, Yaxing Lyu, Qianxin Luo, Jingyi Yang, Yi Liu, Tingfeng Hui, Xinyu Yuan, Li Sun, Sen Su, Jing Shao

Title: A Unified Framework for the Evaluation of LLM Agentic Capabilities

Abstract:
As LLMs are increasingly deployed as agents, reliable assessment of their agentic capabilities has become essential. However, reported benchmark scores often jointly reflect model capability and the implementation choices each benchmark is packaged with, making cross‑benchmark results difficult to interpret as clean measurements of the underlying model. In this work, we present a unified framework for the fair evaluation of LLM agentic capabilities. Driven by a unified configuration system, the framework integrates diverse benchmarks into a standardized instruction‑‑tool‑‑environment format, executes agents through a fixed ReAct‑style architecture within a controllable sandbox, and provides an optional offline setting that replaces volatile live environments with curated snapshots, so that framework effects and environment effects can be analyzed separately. Building on this, we unify the evaluation methodology under each benchmark's original task‑success criteria, while introducing unified metrics for resource consumption and a taxonomy for decision‑ and execution‑level failure attribution. Within this framework, we adapt 7 widely used benchmarks spanning 24 domains across single‑agent, multi‑agent, and safety‑critical scenarios, and conduct a large‑scale empirical analysis over 400K rollouts and 5B tokens on 15 models. The results show that scaffold choice and environmental volatility materially shift benchmark outcomes in both directions, allowing our framework to disentangle intrinsic LLM capabilities from framework‑ and environment‑induced artifacts. We further demonstrate its extensibility as a secure testbed for safety‑critical domains. Codes and benchmarks at are available at https://github.com/whfeLingYu/A‑Unified‑Framework‑for‑the‑Evaluation‑of‑LLM‑Agentic‑Capabilities, https://huggingface.co/AgentFramework/Unified_Farmework.

PaperID: 99, https://arxiv.org/pdf/2605.27466.pdf GitHub

Authors: Nicole Koenigstein

Title: AgensFlow: A Coordination-Policy Substrate for Multi-Agent Systems

Abstract:
Multi‑agent systems built on large language models (LLMs) require many coordination choices that are difficult to fix a priori: which skill protocol to invoke, which agent role should perform a subtask, which model to bind to each role, how roles should interact, when to use retrieval or verification, and when to omit a step entirely. These choices interact with task regime and operational constraints, so static pipelines and one‑off model comparisons provide only a limited view of the design space. This paper introduces AgensFlow, an open‑source framework that treats multi‑agent coordination as an online policy‑learning problem under partial observability. The framework makes coordination decisions observable and learnable from repeated trajectories, rather than treating skill, role, model, topology, and evaluation choices as fixed pipeline design. AgensFlow is evaluated on two corpora: distributed‑systems incident tasks and security‑advisory tasks. The evaluation shows three main results: learned routing reaches a higher‑quality operating point than a fixed pipeline baseline on coordination‑heavy classes; skip:X isolates topology compression as a meaningful part of the substrate; and warm‑started policy graphs can reduce exploration cost while preserving plateau quality. Overall, the results support that learned, auditable routing can improve coordination‑heavy multi‑agent workflows over static wiring.

PaperID: 100, https://arxiv.org/pdf/2605.27378.pdf GitHub

Authors: Jing Hao, Siyuan Dai, Yongxin Zhang, Yuci Liang, Jiamin Wu, Jiahao Bao, Yuxuan Fan, Zanting Ye, Yanpeng Sun, Xinyu Zhang, Ming Hu, Liang Zhan, James Kit Hon Tsoi, Linlin Shen, Junjun He, Kuo Feng Hung

Title: OralAgent: Integrating Reasoning, Tools, and Knowledge for Interactive Dental Image Analysis

Abstract:
Dental image analysis plays a pivotal role in supporting accurate diagnosis and treatment planning in oral healthcare. Although recent advances have produced dental AI models for specific tasks and individual imaging modalities, their isolated designs limit practical use in real‑world clinical workflows. In this paper, we present OralAgent, the first dental‑specialized AI agent that unifies multimodal reasoning, tool‑based decision‑making, and knowledge‑grounded retrieval within an end‑to‑end automated framework. It integrates 22 visual analysis tools and 368 widely‑used classical dental textbooks, enabling autonomous reasoning, planning, tool use, knowledge retrieval, and multi‑step workflow execution. Furthermore, we introduce OralCorpus, a large‑scale, high‑quality bilingual textual resource containing 134.8M tokens curated for dental retrieval‑augmented generation (RAG). To evaluate models' multidisciplinary dental knowledge, we construct OralQA‑ZH, a Chinese multiple‑choice question benchmark consisting of 798 items across eleven oral subspecialties. Extensive experiments demonstrate that OralAgent achieves state‑of‑the‑art performance on the MMOral‑Uni, MMOral‑OPG, and OralQA‑ZH benchmarks, highlighting its effectiveness, interpretability, and adaptability in real‑world clinical settings. The code and models are publicly available at https://github.com/isjinghao/OralAgent.

PaperID: 101, https://arxiv.org/pdf/2605.27328.pdf GitHub

Authors: Mariano Garralda-Barrio

Title: Governed Evolution of Agent Runtimes through Executable Operational Cognition

Abstract:
Recent advances in agentic systems increasingly treat code as an executable operational substrate rather than as a disposable output artifact. Prior work such as \emphCode as Agent Harness frames validated agent‑generated artifacts as runtime entities that can be created, executed, revised, persisted, and reused within long‑running cognitive loops. However, the governance, lifecycle management, and operational evolution of such artifacts remain under‑specified. This paper proposes a framework for governed runtime evolution in multi‑agent systems through executable operational cognition. We formalize agent‑generated artifacts as persistent runtime capabilities that progressively become part of the operational substrate rather than transient intermediate outputs. Building on this perspective, we introduce \emphHarnessMutation as a governed mechanism for lifecycle‑aware runtime adaptation operating under explicit validation, traceability, evaluation, and rollback constraints. Rather than treating runtime adaptation as unrestricted self‑modification, the proposed framework models evolution as a bounded and observable process over persistent operational memory. It further shows how these ideas can be operationalized over modern agent runtimes and governance‑oriented orchestration systems, providing a conceptual foundation for adaptive infrastructures whose evolution remains explicit, auditable, and constrained.

PaperID: 102, https://arxiv.org/pdf/2605.27068.pdf GitHub

Authors: Ye Yuan, Rui Song, Weien Li, Zeyu Li, Haochen Liu, Xiangyu Kong, Changjiang Han, Yonghan Yang, Zichen Zhao, Zixuan Dong, Fuyuan Lyu, Bowei He, Haolun Wu, Jikun Kang, Xue Liu

Title: QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

Abstract:
Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text‑only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open‑source environment and evaluation framework for auditing the grounding of agent language in multimodal social reasoning. QUACK evaluates agents at three levels: game outcomes, behavioral trajectories, and utterance‑level consistency. Its core Statement Verification Pipeline reconstructs each agent's ground‑truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination, unsupported accusation, deception collapse, and language‑action inconsistency. Evaluating three frontier VLMs in both homogeneous and cross‑model adversarial settings, we find that even the strongest agent hallucinates 15.1% of its verifiable spatial claims and makes over half of its accusations without grounded evidence. We release the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA‑Academia‑Attractions/QUACK.

PaperID: 103, https://arxiv.org/pdf/2605.26596.pdf GitHub

Authors: Haoran Zhang, Zhaohua Sun

Title: AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents

Abstract:
The token‑level extractive compressors widely used for general LM context are structurally inappropriate for LLM agents: across 17 (env, backbone, method) cells spanning two independent token‑level method families, every cell collapses to mean reward <= 0.05 despite 1.3‑13.3x realized compression. We name and characterize this failure mode as action‑grammar destruction ‑‑ the tokens carrying action semantics (identifiers, brackets, action verbs) are exactly those self‑information ranks lowest, so a general‑purpose compressor reliably removes them and the environment rejects the residual. The diagnosis points to step‑granularity compression. We introduce AGORA, an inference‑free step‑level compressor combining a structural prompt parser, an always‑keep floor for format‑ and recency‑critical content, and a 125M‑parameter relevance scorer trained on counterfactual next‑action‑change labels (~2ms/step, zero per‑step LLM toll). Across the compared inference‑free and LLM‑based methods, AGORA is the only one retaining >= 75% uncompressed performance in 8 of 9 cells (with the lone exception at 73%); a four‑way component ablation isolates the structural floor as the dominant quality lever and the learned scorer as the source of 1.0‑11.5x adaptive end‑to‑end compression from a single fixed keep ratio.

PaperID: 104, https://arxiv.org/pdf/2605.26457.pdf GitHub

Authors: Anmol Agarwal, Natalie Neamtu, Pranjal Aggarwal, Seungone Kim, Jannis Limperg, Cedric Flamant, Kanna Shimizu, Bryan Parno, Sean Welleck

Title: Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

Abstract:
AI coding agents are increasingly used to write real‑world software, but ensuring that their outputs are correct remains a fundamental challenge. Formal verification offers a promising path: an agent generates code together with a machine‑checked proof, guaranteeing that the code satisfies a formal specification. However, there is no guarantee that the formal spec itself matches the user's intent. In this work, we study specification autoformalization: whether LLM agents can translate informal programming problems into faithful formal specifications. We introduce Verus‑SpecBench, a benchmark of 581 spec‑writing tasks derived from Codeforces problems targeting Verus, a verifier for Rust, and Verus‑SpecGym, an agentic environment in which models interact with Verus, bash, & the filesystem to develop these specs. The central challenge is evaluation: expert‑written reference specs are expensive to write, & LLM judges can miss subtle mistakes. We address this by (a) extending Verus's exec_spec mechanism so that generated specs can be executed as Rust code, & (b) testing them against official Codeforces tests & adversarial cases extracted from Codeforces "hacks", which are edge cases written by competitors to break incorrect solutions. On Verus‑SpecBench, the strongest model, Gemini 3.1 Pro, solves 77.8% of tasks, other frontier models solve 51.1‑‑57.8% & OSS models reach only 21.5‑‑25.5%. Our analysis of failure modes shows that model‑generated specs can omit important input assumptions, accept incorrect outputs, & reject valid ones. We also find that LLM‑as‑a‑judge evaluation misses 26% of the failures our evaluator catches. Overall, our results suggest that spec autoformalization is within reach for frontier agents but remains brittle even on problems where they can already generate correct code. The code, data, & logs can be found at https://github.com/formal‑verif‑is‑cool/verus‑spec‑gym

PaperID: 105, https://arxiv.org/pdf/2605.26451.pdf GitHub

Authors: Zhiyao Cui, Chenxu Wang, Shuyue Hu, Yiqun Zhang, Wenqi Shao, Qiaosheng Zhang, Zhen Wang

Title: Design First, Code Later: Aesthetically Pleasing Template-Free Slides Generation

Abstract:
Producing presentation slides automatically entails coordinating narrative structure with page‑level graphic design under strict spatial constraints. For such structured multimodal tasks, a well‑organized design process is essential to ensure the final quality of slides. Existing approaches rely on fixed templates or directly emit executable code, thereby both limiting the creative layout‑design capabilities of LLMs and bypassing the essential slide‑page design step. To address these limitations, this paper (1) proposes a hierarchical slides generation workflow, DeepSlides, that systematically organizes slide design tasks without any predefined template or style, decoupling slide‑page design from implementation; (2) introduces SlideDesign, a dataset tailored specifically for slides generation tasks; and (3) presents a multi‑agent reinforcement learning training paradigm and trains a couple of models, SlideQwens, for slide design and implementation. Experimental results demonstrate that our proposed framework outperforms baseline methods on evaluated metrics and achieves superior performance in human preference evaluations. The dataset and code are available at https://github.com/sxswz213/DeepSlides.

PaperID: 106, https://arxiv.org/pdf/2605.26298.pdf GitHub

Authors: Cong Wang, Yusheng Zheng

Title: Sandlock: Confining AI Agent Code with Unprivileged Linux Primitives

Abstract:
AI agents increasingly run untrusted code on developer machines: shell commands generated by language models, third‑party scripts retrieved at runtime, and tool plugins of unknown provenance. Existing isolation mechanisms impose tradeoffs that fit this workload poorly: containers and microVMs add privilege, image‑management, and startup costs, while ad‑hoc process controls and wrappers (e.g. chroot, ulimit) provide weak guarantees and little syscall‑level control. Sandlock is a lightweight Linux process sandbox organized around a simple split: static, input‑independent policy is compiled into kernel‑enforced rules, while a narrow supervisor handles runtime‑dependent decisions and virtualized effects. This split lets Sandlock enforce filesystem, network, IPC, and syscall policies without root, cgroups, images, or mandatory namespaces. It also supports dynamic network decisions, HTTP‑level access control, TOCTOU‑safe inspection of execve arguments, and reversible filesystem effects. On our workstation, Sandlock adds roughly 5 ms of startup overhead and runs Redis at bare‑metal throughput (within measurement noise); its pipeline operator further supports per‑stage confinement for separating data, network, and untrusted‑content capabilities. Sandlock is available at https://github.com/multikernel/sandlock

PaperID: 107, https://arxiv.org/pdf/2605.26239.pdf GitHub

Authors: Xiangye Lin, Hongxin Zhang, Ruxi Deng, Qinhong Zhou, Chuang Gan

Title: Sentinel: Embodied Cooperative Spatial Reasoning and Planning

Abstract:
In this work, we study Cooperative Spatial Intelligence, the ability of decentralized embodied agents to coordinate effectively under dynamic environmental constraints across city‑scale outdoor domains. We introduce Sentinel Challenge, a benchmark where multiple decentralized embodied agents must communicate in natural language to agree on a mutually safe and convenient meeting point within large, city‑scale outdoor environments. Each agent must then navigate safely while avoiding dynamic sentinels patrolling the area, using a tool that provides coarse spatial information. To address this, we propose CoSaR (Cooperative Spatial Reasoning and Planning), a framework that bridges the high‑level communication and planning abilities of foundation models with the precision of classical spatial navigation algorithms. CoSaR enables agents to exchange situational updates, reason over evolving spatial constraints, and collaboratively replan trajectories. Evaluated across 14 city‑level scenes with 3‑5 agents, CoSaR consistently leads to faster gathering, shorter path lengths, and improved safety. Our results demonstrate that integrating dynamic communication with spatial reasoning is essential for robust multi‑agent cooperation. By formalizing this new setting and providing a scalable benchmark, we aim to build a foundation for advancing cooperative spatial intelligence in embodied multi‑agent systems. Code and challenge are available at https://github.com/UMass‑Embodied‑AGI/Sentinel.

PaperID: 108, https://arxiv.org/pdf/2605.26186.pdf GitHub

Authors: Zihang Zhou, Ziqian Ren, Yukai Wu, Yingjie Xiong, Wei Zhou, Chao Peng, Dong Zhang, Bingheng Yan, Xuanhe Zhou, Fan Wu

Title: SetupX: Can LLM Agents Learn from Past Failures in Functionality-Correct Code Repository Setup?

Abstract:
Functionality‑correct repository setup aims to configure execution environments (e.g., dependencies, build scripts) to successfully execute a repository's documented features. It presents significant challenges due to diverse, repository‑specific failures, including dependency incompatibilities, missing toolchains, incomplete installations, and verification‑strategy mismatches. Existing LLM agents struggle to robustly resolve these issues, specifically failing to support (1) cross‑repository experience transfer, (2) multi‑step trial‑and‑repair under non‑invertible state changes, and (3) robust verification of setup outcomes to distinguish setup‑induced failures from repository bugs. To address this, we introduce SetupX, an experiential learning‑based setup framework. First, we construct a Self‑Evolving Experience Representation (XPU), a dual‑modality knowledge unit encoding setup signals, textual guidance, executable actions to dynamically transfer verified environment fixes to unseen repositories. Second, we employ Experience‑Augmented Speculative Execution backed by a LIFO Docker snapshot stack, enabling the agent to proactively trial fixes and safely roll back to known‑good states. Third, we introduce a Prosecutor‑Judge Verification Protocol that separates evidence collection from final judgment, enabling more reliable setup verification beyond superficial build‑time metrics. Evaluation results on carefully‑crafted benchmarks show SetupX achieves highest performance (e.g., 92% pass rate) and outperforms the strongest baseline by over 19%. Crucially, SetupX excels in complex multi‑repository setup requiring coordinating multiple interconnected services across different containers. The code repository is available at https://github.com/OpenDataBox/SetupX.

PaperID: 109, https://arxiv.org/pdf/2605.25376.pdf GitHub

Authors: Kolawole Quadri

Title: KYA: A Framework-Agnostic Trust Layer for Autonomous Systems with Verifiable Provenance and Hierarchical Policy Composition

Abstract:
KYA (Know Your Agents) is an open‑source, framework‑agnostic trust and governance layer for autonomous systems, composed of five primitives: (1) a four‑gate inbound apply pipeline; (2) an only‑tighten composition algebra over a three‑channel multi‑tenant hierarchy; (3) KYP (Know Your Principal), a schema‑level unification of trust scoring across human users, AI agents, and service accounts; (4) auditable interaction‑multiplier amplification over an AIVSS‑shaped additive baseline; and (5) two‑axis delegation attribution: a static premium for risky delegates and a runtime debit for actual delegate misbehavior in multi‑agent fan‑out. Together these span three pillars (trust, governance, and evidentiary assurance), making an autonomous system's actions authorized, policy‑conforming, and post‑hoc verifiable: where observability answers how long, how much, and what path, KYA answers was it authorized, did it conform, and can it be verified; it composes with observability rather than replacing it. It ships native adapters for 15+ agent frameworks. On a 4 by 9 cross‑backend matrix all 36 cells pass; the pure‑function scorer runs sub‑millisecond at p99 and the system sustains ~ 1,800 ops/sec at 20 concurrent workers with HMAC chain integrity preserved end‑to‑end. KYA detects 89% of 1,200 adversarial probes from PyRIT and Garak, including the recently‑published topology‑guided multi‑agent attack. The system is available under Apache 2.0 as the veldt‑kya package on PyPI.

PaperID: 110, https://arxiv.org/pdf/2605.25250.pdf GitHub

Authors: Leshu Li, An Lu, Haiyu Wang, Zhibin Feng, Conghui Duan, Qing Bao, Zongmin Zhao, Sai Qian Zhang

Title: LipoAgent: Coordinating Fine-Tuned LLM Agents for Safer Lipid Design

Abstract:
Lipid nanoparticles (LNPs) are among the most clinically mature platforms for nucleic acid delivery, yet designing lipids that are both effective and biologically safe remains a major bottleneck. In practical screening, toxicity is a decision‑level constraint: if a lipid is toxic, its efficiency prediction is clinically irrelevant. We propose LipoAgent, a safety‑aware multi‑agent LLM framework for lipid discovery. LipoAgent combines domain‑specific finetuning with a conditional prediction objective that enforces toxicity as a prerequisite for efficiency prediction, and further improves reliability via multi‑agent verification with lightweight human oversight when disagreement persists. Across multiple foundation models, LipoAgent achieves an average 32% relative improvement in mRNA transfection efficiency prediction compared with other reported models for lipid design. Wet‑lab validation confirms that virtual screening rankings reliably translate to biological transfection outcomes. The code is publicly available at https://github.com/SAI‑Lab‑NYU/LipoAgent.git.

PaperID: 111, https://arxiv.org/pdf/2605.24775.pdf GitHub

Authors: Sasank Annapureddy

Title: PRIMA: Operational Patterns for Resilient Multi-Agent Research with Verifiable Identity and Convergent Feedback

Abstract:
Operating LLMs as coordinated multi‑agent research systems over multi‑hour runs surfaces failure modes that single‑shot evaluation cannot: upstream providers throttle without warning, sub‑agents drift the task to fit accessible tools, narrate machinery instead of using it, open revision iterations with self‑apology, or treat upstream context as executable directives. We present PRIMA, whose primary contributions are three operational patterns for surviving these failure modes: (1) a resilience‑and‑recovery layer that detects upstream rate‑limit signals, persists a typed pause record to disk, and resumes long‑running runs without re‑executing converged work even across process restarts; (2) a sub‑agent operating discipline encoding task‑fidelity, tool‑use, revision, and inter‑step context‑boundary norms as a structural prompt layer; (3) a multi‑phase application pattern for structured engineering deliverables pairing orthogonal draft steps with an explicit cross‑document harmonization pass before final synthesis. These sit atop a foundational protocol: a research‑program specification language with explicit convergence criteria, a dual‑metric scoring engine (LLM‑judged rubric plus sandboxed code), an outer meta‑optimization loop, event‑driven persistence, hook‑based middleware, context compaction, and a multi‑provider LLM abstraction. Agent identities derive from prime powers, giving collision‑free identifiers and trivially‑verifiable cluster membership without a central registry. Theoretical guarantees include O(k) verification, O(V+E) DAG validation, and identity collision freedom by the Fundamental Theorem of Arithmetic. A Graph Isomorphism case study grounds the architectural claims in a generated artifact: a six‑step protocol that produced a research paper proposing a new canonical‑form algorithm with three theorems and five conjectures.

PaperID: 112, https://arxiv.org/pdf/2605.24140.pdf GitHub

Authors: Yuyu Liu, Haotian Xu, Yanan He, Sarang Rajendra Patil, Mengjia Xu, Tengfei Ma

Title: HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models

Abstract:
Multi‑step reasoning remains a central challenge for large language models: single‑pass generation is efficient but lacks accuracy; tree‑search methods explore multiple paths but are computation‑heavy. We address this gap by distilling reasoning progress into a hyperbolic geometric signal that guides step‑by‑step generation. Our approach is motivated by a structural observation: in combinatorial reasoning trees, solution‑bearing states are few while dead ends are exponentially numerous. The hyperbolic space matches this asymmetry, with compact volume near the origin and exponentially expanding capacity toward the boundary, so that distance‑to‑origin naturally encodes solution proximity while angular separation distinguishes branches requiring different next operations. We train a lightweight head to project LLM hidden states into this space, then fine‑tune a low‑rank adapter interactively on its own reasoning attempts to act on the injected signal. Across multiple benchmarks, the geometric signal yields consistent gains, with larger improvements on deeper reasoning chains. Our code is publicly available at https://github.com/yuyuliu11037/HyperGuide.

PaperID: 113, https://arxiv.org/pdf/2605.23832.pdf GitHub GitHub

Authors: Constantin Blessing, Elias Geiger, Jakob Häringer, Dennis Grewe, Markus Enzweiler

Title: SFG-ROS: A Resource-Aware Framework for Dense Multi-Agent Perception

Abstract:
Deploying heterogeneous multi‑agent robot fleets for collaborative perception requires robust data exchange and scalable software architectures. However, standard ROS 2 implementations often suffer from network saturation, namespace collisions, and severe computational overhead when distributing dense sensor streams across devices. To address these bottlenecks, we present SFG‑ROS, a resource‑aware multi‑agent software framework designed for dynamic fleet deployments. SFG‑ROS addresses these challenges through three primary contributions. First, schema‑driven traffic routing isolates high‑frequency intra‑agent traffic from the global network using a programmatic fully qualified name schema and targeted Fast DDS routing. Second, an on‑demand centralized decoding pipeline automatically offloads high‑bandwidth sensor data decompression, eliminating redundant processing across local consumer nodes. Finally, a hardware‑agnostic container pipeline dynamically adapts to heterogeneous accelerators, seamlessly bridging development environments with zero‑touch, field‑ready execution. We evaluate the framework using a fleet of wheeled and legged robots equipped with LiDAR and stereo depth cameras. Experimental results show SFG‑ROS bounds network traffic to \mathcalO(1) and, by replacing redundant decompression with lightweight IPC, reduces the per‑subscriber CPU scaling penalty by 72.3% versus standard ROS 2, all while maintaining low latency. Finally, we publish SFG‑ROS under a permissive license, available via \hrefhttps://iis‑esslingen.github.io/sfg‑rosiis‑esslingen.github.io/sfg‑ros.

PaperID: 114, https://arxiv.org/pdf/2605.23657.pdf GitHub

Authors: Jiahao Ying, Boxian Ai, Wei Tang, Siyuan Liu, Yixin Cao

Title: OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

Abstract:
Skills, i.e., structured workflow instructions distilled for large language models (LLMs), are becoming an increasingly important mechanism for improving agent performance on real‑world downstream tasks. However, as the open‑source skill ecosystem rapidly expands, it remains unclear how different models and agent frameworks interact with skills, how to evaluate skill quality, and how users should select skills under practical cost‑performance trade‑offs. In this paper, we present \textscOpenSkillEval, an automatic evaluation framework for both skill‑augmented agent systems and the skills themselves. Instead of relying on static benchmarks, \textscOpenSkillEval automatically constructs realistic task instances from evolving real‑world artifacts across five categories of downstream applications: presentation generation, front‑end web design, poster generation, data visualization, and report generation. It further collects and organizes community‑contributed skills for controlled comparison under unified task settings. Using more than 600 dynamically generated task instances and 30 open‑source skills, we conduct a systematic evaluation of state‑of‑the‑art models and agent frameworks. Our results show that skill availability does not guarantee effective skill usage, that the benefit of skill augmentation depends strongly on both the underlying model and the agent framework, and that many publicly popular skills do not consistently outperform base agents without skills. These findings highlight the need for dynamic, task‑grounded evaluation and provide practical insights into the design, selection, and deployment of skills for LLM agents. Additional cases and benchmark resources are available on the project website: https://yingjiahao14.github.io/OpenSkillEval‑Web/.

PaperID: 115, https://arxiv.org/pdf/2605.23590.pdf GitHub

Authors: Jiazheng Kang, Bowen Zhang, Zixin Song, Jiangwang Chen, Xiao Yang, Da Zhu, Guanjun Jiang

Title: Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

Abstract:
ReAct‑style agents for search‑intensive, multi‑step reasoning tasks rely largely on their own internal judgment to decide what evidence to seek, which reasoning or action step to take next, and when to stop, often producing shallow, redundant, or poorly targeted trajectories. Prior work has explored rubrics as external quality signals, but existing uses are mostly evaluative rather than action‑guiding: rubrics typically serve as training‑time rewards or post‑hoc evaluators of completed outputs, and in deep‑research settings they are often coarse‑grained and report‑level rather than step‑level. We introduce Co‑ReAct, a rubric‑guided action‑selection framework that uses rubrics as step‑level guidance during inference. At each decision step, Co‑ReAct injects a rubric into the agent's context to guide the next Reason‑or‑Act decision, specifying what the agent should target in evidence seeking, search, reasoning, or self‑evaluation. To make this guidance reliable, we train a dedicated rubric generator with GRPO. Unlike prior pairwise or binary preference formulations, our objective optimizes a list‑wise Spearman rank‑correlation reward against multi‑judge expert consensus rankings, encouraging rubrics that are discriminative rather than merely plausible. On DeepResearchBench and SQA‑CS‑V2, Co‑ReAct consistently improves over ReAct and representative test‑time compute baselines across search agents built on both 8B/14B open‑source and frontier closed‑source base models. The trained rubric generator can also serve as a drop‑in component that improves these baselines without changing their underlying decision mechanisms. Our code is publicly available at https://github.com/ZBWpro/Co‑ReAct.

PaperID: 116, https://arxiv.org/pdf/2605.23193.pdf GitHub

Authors: Yiyang Wang, Moeiini Reilly, Britney Johnson, Kefei Yan, Alex Cabral, Josiah Hester

Title: CultivAgents: Cultivating Relationship-Centered Multi-Agent Systems for Personalized Gardening

Abstract:
Gardening is critical to support well‑being, cultural continuity, and food autonomy, yet existing digital tools often provide generic advice that overlooks gardeners' skills, local ecologies, seasons, and cultural contexts. We introduce CultivAgents, a relationship‑centered multi‑agent system for personalized, socio‑culturally grounded gardening support. Grounded in ethics of care, CultivAgents coordinates multiple specialized agents: an Experience Agent that adapts guidance to users' skill levels, an Environmental Agent that grounds advice in local and seasonal conditions, and an Ethnobotanical Agent that connects plants to cultural knowledge and histories. We evaluated CultivAgents through a three‑phase mixed‑methods study with domain experts (n=3), HCI researchers (n=7), and community gardeners (n=5), analyzing expert feedback, pre/post surveys, and participatory design activities. Results suggest that CultivAgents helped gardeners translate interest into situated action: community gardeners reported increased confidence (3.00 to 3.60), motivation (4.00 to 4.40), and trust in acting on AI advice (3.20 to 4.00). Participants valued hyperlocal ecological guidance and complementary agent perspectives, while also identifying limits in cultural specificity, ecological grounding, and agent coordination. The work advances relationship‑centered AI, offering design implications for multi‑agent systems that support food sovereignty, community resilience, and cultural preservation.

PaperID: 117, https://arxiv.org/pdf/2605.23067.pdf GitHub

Authors: Xinjie He, Zhiyuan Lin, Su Liu, Jialun Wu, Qiyang Xie, Weikai Zhou, Shuai Xiao

Title: What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA

Abstract:
Reinforcement learning (RL) has emerged as a viable recipe for training LLM agents to reason over external memory banks in multi‑session dialogue. Existing work trains exclusively on a single benchmark, leaving open how the composition of training data shapes the skills a memory agent acquires. We present a controlled empirical study that holds architecture, RL algorithm, and all hyperparameters fixed and varies only the training curriculum across three conditions: in‑domain (LoCoMo), mixed‑benchmark (LoCoMo + LongMemEval), and out‑of‑domain (LongMemEval only). Across two benchmarks and ten question types, curriculum composition acts as a fine‑grained lever on specialization rather than a uniform scaling factor on performance. The mixed curriculum yields the strongest overall F1 on both evaluation sets. Training on a narrow out‑of‑domain set transfers a targeted skill ‑ temporal reasoning ‑ despite weak aggregate performance. Per‑type differences substantially exceed aggregate differences, indicating that single‑number benchmark comparisons systematically underreport curriculum effects. We further report two practical lessons from adapting GRPO to a single‑GPU regime: cross‑benchmark mixing requires filtering format‑specific noise from memory banks to preserve training signal, and binary exact‑match reward produces no learning signal at the small group sizes (G = 4) required on one GPU, motivating continuous reward functions in this regime.

PaperID: 118, https://arxiv.org/pdf/2605.23058.pdf GitHub GitHub

Authors: Joshua Odmark, Gideon Rubin, Deon van der Vyver

Title: A measurement substrate for agentic Kubernetes operations: Methodology and a case study in retrieval-compounding falsification

Abstract:
Empirical claims about autonomous Kubernetes operations agents are largely unfalsifiable. Published work reports observational results without controlled comparisons against an agent‑disabled baseline, selection bias is endemic, pre‑registered decision matrices are absent, and samples are typically too small for the noise level of the underlying scoring system. The cause is the same gap that limits the agents themselves: code agents have a verification substrate that turns "did it work" into a fast, falsifiable, ground‑truth signal, and operations has nothing equivalent. We present agent‑breakage, a closed‑loop measurement framework that injects faults into a target Kubernetes cluster, observes how an autonomous agent responds, scores the response on four axes against ground truth, and accumulates outcome‑labeled (state, action, outcome) tuples. The framework distinguishes framework error from reasoning error, supports a true off‑condition control via a deterministic‑embedder mechanism, and enforces pre‑registered decision matrices. We use it as a case study to test whether retrieval over past postmortems compounds an agent's capability. The methodological payload is three confounds the substrate caught during that case study, each of which would have produced a wrong published claim on a less instrumented version of the same work: a pgvector index bug, a +19% selection‑bias artifact, and small‑sample estimates that overstated effects by roughly 3x. The retrieval result itself is a partial falsification: 1 of 3 dense‑corpus scenarios significant at p<0.05, pooled effect +3.9 percentage points, not significant at n=60. A within‑scenario corpus‑density sweep at 360 runs shows that mechanistic alignment of near‑neighbors dominates raw count. The framework is released open source.

PaperID: 119, https://arxiv.org/pdf/2605.23023.pdf GitHub

Authors: Zeyu He, Hannah Kim, Dan Zhang, Estevam Hruschka

Title: How to Steer Your Multi-Agent System: Human-LLM Collaborative Planning

Abstract:
In orchestrated multi‑agent systems, humans often struggle to manage plans due to their complexity and limited transparency. Existing approaches rely on outcome‑level supervision, where users verify only final outputs without visibility into intermediate reasoning. We formalize a design space for human‑LLM co‑planning interactions along three axes: mode (semantic vs. structural), scope (global vs. targeted), and level (low vs. high‑level edits). We realize it in AMBIPOM, a prototype supporting process‑level supervision through both semantic and structural interactions. Through a user study, we characterize how users navigate this space, revealing hybrid workflows and effort‑control‑risk trade‑offs; through a controlled benchmark, we analyze how LLMs revise plans under varying scope and revision strategies. Our findings yield design insights for more transparent, controllable, and effective human‑AI co‑planning. We release code and data at https://github.com/megagonlabs/ambipom.

PaperID: 120, https://arxiv.org/pdf/2605.22855.pdf GitHub

Authors: Yingjie Lei

Title: PrefBench: Evaluating Zero-Shot LLM Agents in Hidden-Preference Personalized Pricing Negotiations

Abstract:
Personalized pricing negotiations are a challenging testbed for LLM agents because successful interaction does not guarantee profitable decision making. A seller may produce valid actions and close many deals while still pricing poorly when buyer willingness to pay and bargaining traits remain hidden. This paper presents PrefBench, a simulator‑based benchmark for hidden‑preference personalized pricing negotiations. Each episode pairs a simulated buyer with a fixed vehicle‑customization bundle; the seller observes public persona descriptors, bundle information, and negotiation history, while latent buyer variables govern valuation, patience, counter‑offer behavior, and walkaway decisions. PrefBench evaluates this setting through an LLM‑facing state‑summary protocol that constrains agents to return strict JSON actions under a fixed hidden‑information boundary. We evaluate zero‑shot LLM sellers against heuristic references over 7,500 episodes. The tested LLMs follow the protocol reliably and achieve deal rates above 0.99, but their seller‑profit outcomes remain weak: the best LLM average profit is only slightly above the random baseline and far below a simple concession heuristic under the same episode stream. These results show that structured action compliance and agreement‑seeking behavior can coexist with weak profit‑sensitive bargaining. PrefBench provides a controlled benchmark for evaluating pricing‑agent behavior under hidden buyer preferences.

PaperID: 121, https://arxiv.org/pdf/2605.22794.pdf GitHub

Authors: Qianshu Cai, Yonggang Zhang, Xianzhang Jia, Huajiang Zheng, Wei Xue, Jun Song, Xinmei Tian, Yike Guo

Title: MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

Abstract:
Autonomous agentic systems are largely static after deployment: they do not learn from user interactions, and recurring failures persist until the next human‑driven update ships a fix. Self‑evolving agents have emerged in response, but all confine evolution to text‑mutable artifacts ‑‑ skill files, prompt configurations, memory schemas, workflow graphs ‑‑ and leave the agent harness untouched. Since routing, hook ordering, state invariants, and dispatch live in code rather than in any text artifact, an entire class of structural failure is physically unreachable from the text layer. We argue that source‑level adaptation is a fundamentally more general medium: it is Turing‑complete, a strict superset of every text‑mutable scope, takes effect deterministically rather than through base‑model compliance, and does not erode under long‑context drift. We present MOSS, a system that performs self‑rewriting at the source level on production agentic substrates. Each evolution is anchored to an automatically curated batch of production‑failure evidence and proceeds through a deterministic multi‑stage pipeline; code modification is delegated to a pluggable external coding‑agent CLI while MOSS retains stage ordering and verdicts. Candidates are verified by replaying the batch against the candidate image in ephemeral trial workers, then promoted via user‑consent‑gated, in‑place container swap with health‑probe‑gated rollback. On OpenClaw, MOSS lifts a four‑task mean grader score from 0.25 to 0.61 in a single cycle without human intervention.

PaperID: 122, https://arxiv.org/pdf/2605.22662.pdf GitHub

Authors: Fan Wu, Cheng Chen, Zhenshan Tan, Taiyu Zhang, Xinzhen Xu, Yanyu Qian, Dingcheng Gao, Lanyun Zhu, Qi Zhu, Yi Tan, Deyi Ji, Guosheng Lin, Tianrun Chen, Deheng Ye, Fayao Liu

Title: Claw AI Lab: An Autonomous Multi-Agent Research Team

Abstract:
We present Claw AI Lab, a lab‑native autonomous research platform that advances automated research from a hidden prompt‑to‑paper pipeline into an interactive AI laboratory. Rather than centering the system around a single agent or a fixed serial workflow, we allow users to instantiate a full research team from one prompt, with customizable roles, collaborative workflows, real‑time monitoring, artifact inspection, and rollback/resume control through a unified dashboard. The platform also supports distinct research modes for exploration, multi‑agent discussion, and reproduction, making autonomous research substantially more steerable and laboratory‑like in practice. A key practical contribution of Claw AI Lab lies in its Claw‑Code Harness, which connects local codebases, datasets, and checkpoints to runnable experiments and feeds execution artifacts back into the research loop. As a result, the harness improves not only execution integration, but also experimental completion and result integrity: experiments are easier to inspect, iterate on, and faithfully transfer into final papers, reducing common failure modes such as partial runs and malformed result reporting. In our internal evaluation on five AI research case studies, using AutoResearchClaw as the baseline, Claw AI Lab is consistently preferred by AI expert judges on idea novelty, experiment completeness, and paper presentation quality. We view Claw AI Lab as an early step toward a new paradigm: autonomous research as usable, interactive, and reliability‑aware scientific infrastructure.

PaperID: 123, https://arxiv.org/pdf/2605.22570.pdf GitHub

Authors: Jinho Park, Youbin Kim, Hogun Park, Eunbyung Park

Title: VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

Abstract:
Spatio‑temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio‑temporal reasoning benchmark datasets primarily rely on static image sets or passively curated video data, which limits the evaluation of fine‑grained reasoning capabilities. In this paper, we introduce VGenST‑Bench, a video benchmark that employs generative models to actively synthesize highly controlled and diverse evaluation scenarios. To construct VGenST‑Bench, we propose a multi‑agent pipeline incorporating a human quality control stage, ensuring the quality of all generated videos and QA pairs. We establish a comprehensive 3x2x2 video taxonomy, encompassing Spatial Scale, Perspective, and Scene Dynamics to span diverse scenarios. Furthermore, we design a hierarchical task suite that decouples low‑level visual perception from high‑level spatio‑temporal reasoning. By shifting the paradigm from passive curation to active synthesis, VGenST‑Bench enables fine‑grained diagnosis of spatio‑temporal understanding in MLLMs.

PaperID: 124, https://arxiv.org/pdf/2605.22321.pdf GitHub

Authors: Jianan Ma, Xiaohu Du, Ruixiao Lin, Yaoxiang Bian, Jialuo Chen, Jingyi Wang, Xiaofang Yang, Shiwen Cui, Changhua Meng, Xinhao Deng, Zhen Wang

Title: Benchmarking Autonomous Agents against Temporal, Spatial, and Semantic Evasions

Abstract:
As autonomous agents (e.g., OpenClaw) increasingly operate with deep system‑level privileges to execute complex tasks, they introduce severe, unmitigated security risks. Current vulnerability analyses overwhelmingly focus on single‑turn, stateless behaviors, overlooking the expanded attack surface inherent in stateful, multi‑turn interactions and dynamic tool invocations. In this paper, we propose a novel, multi‑dimensional evasion framework targeting LLM‑based agent systems. We introduce three stealthy attack vectors: (1) Temporal evasion, which fragments malicious payloads across sequential interaction turns; (2) Spatial evasion, which conceals payloads within complex external artifacts that evade standard LLM parsing mechanisms; and (3) Semantic evasion, which obscures malicious intents beneath benign contextual noise. To systematically quantify these threats, we construct A3S‑Bench, a comprehensive benchmark comprising 2,254 real‑world agent execution trajectories. Evaluating a standard agent framework separately integrated with 10 mainstream LLM backbones against 20 practical threat scenarios, we demonstrate that our evasion framework elevates the average risk trigger rate from a 28.3% baseline to 52.6%. These findings reveal systemic, architecture‑level vulnerabilities in current autonomous agent systems that existing defenses fail to address, highlighting an urgent need for defense mechanisms tailored to the unique threats.

PaperID: 125, https://arxiv.org/pdf/2605.22166.pdf GitHub

Authors: Tianshi Xu, Huifeng Wen, Meng Li

Title: Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents

Abstract:
LLM agents are shaped not only by their language models, but also by the runtime harness that mediates observation, tool use, action execution, feedback interpretation, and trajectory control. While existing agent adaptation methods mainly update model parameters, many failures in deterministic, rule‑governed domains stem from mismatches at the model‑‑environment interface. We propose Life‑Harness, a lifecycle‑aware runtime harness that improves frozen LLM agents without changing model weights or evaluation environments. Life‑Harness evolves from training trajectories by converting recurring interaction failures into reusable interventions across environment contracts, procedural skills, action realization, and trajectory regulation, and remains fixed for evaluation on unseen tasks. On seven deterministic environments from τ‑bench, τ^2‑bench, and AgentBench, Life‑Harness improves 116 out of 126 model‑‑environment settings across 18 model backbones, with an average relative improvement of 88.5%. Harnesses evolved only from Qwen3‑4B‑Instruct trajectories transfer to 17 other models, showing that Life‑Harness captures reusable environment‑side structure rather than model‑specific behavior. These results position runtime interface adaptation as a complementary alternative to model‑centric agent training. Code is available at https://github.com/Tianshi‑Xu/Life‑Harness.

PaperID: 126, https://arxiv.org/pdf/2605.21824.pdf GitHub

Authors: Ze Sheng, Dmitrijs Trizna, Luigino Camastra, Zhicheng Chen, Qingxiao Xu, Jeff Huang

Title: Quality-Assured Fuzz Harness Generation via the Four Principles Framework

Abstract:
Fuzz testing is the dominant technique for finding memory‑safety vulnerabilities in C/C++ software, yet its effectiveness hinges on the quality of fuzz harnesses ‑‑ the programs that bridge fuzzers and library APIs. A growing body of tools now automate harness generation, but none systematically ensures the correctness of produced harnesses: logic errors, API misuse, and lifecycle violations go undetected at the source level. As LLM‑driven generation scales harness creation, uncontrolled quality turns scale into a liability. We present QuartetFuzz, an autonomous harness‑generation system that systematically improves correctness throughout the generation process. At its core is the Four Principles framework ‑‑ Logic Correctness (P1), API Protocol Compliance (P2), Security Boundary Respect (P3), and Entry Point Adequacy (P4) ‑‑ the first source‑level definition of harness correctness with mathematical specifications and implementable checks. We operationalize these principles in an autonomous LLM agent that produces harnesses satisfying P1‑P4 through a generate‑check‑fix loop before any fuzzing begins. Deployed on 23 open‑source projects spanning C/C++, Java, and JavaScript, the system submits 42 bug reports, of which 29 are fixed or confirmed upstream (including 3 CVEs) and only 2 are rejected (4.8% FP rate). During generation, the built‑in P1/P2 checks automatically intercepted 58 harness‑induced crashes that would otherwise have been false positives. Applied as a quality auditor to 586 existing production harnesses across 70 projects, the system identifies 53 violations (45 confirmed, 35 fixed). We release a dataset of 100 labeled harnesses for reproducible evaluation. Code and dataset are available at https://github.com/OwenSanzas/QuartetFuzz

PaperID: 127, https://arxiv.org/pdf/2605.21309.pdf GitHub

Authors: Abhishek Dinkar Jagtap, Sanath Tiptur Sadashivaiah, Andreas Festag

Title: Hyper-V2X: Hypernetworks for Estimating Epistemic and Aleatoric Uncertainty in Cooperative Bird's-Eye-View Semantic Segmentation

Abstract:
Cooperative perception enabled by Vehicle‑to‑Everything (V2X) communication enhances autonomous driving safety by creating a unified environmental representation through shared sensory data. While recent works have advanced multi‑agent fusion for improved perception, uncertainty quantification in such cooperative frameworks remains largely unexplored. This paper introduces Hyper‑V2X, a hypernetwork‑based framework for estimating both epistemic and aleatoric uncertainties in V2X‑based perception. Specifically, we propose a partial weight generation scheme and V2X context embedding module that conditions a Bayesian hypernetwork on fused multi‑agent features to generate weight distributions for stochastic Bird's‑Eye‑View (BEV) segmentation. Unlike existing deterministic BEV models, Hyper‑V2X enables efficient uncertainty estimation with little computation overhead. Our approach is architecture‑agnostic, and can be seamlessly integrating with modern cooperative backbones such as CoBEVT. Experiments on the OPV2V benchmark demonstrate that Hyper‑V2X provides accurate, well‑calibrated uncertainty estimates and improves overall perception reliability. Our code and benchmark are publicly available under an open‑source license: https://github.com/abhishekjagtap1/Hyper‑V2X

PaperID: 128, https://arxiv.org/pdf/2605.21133.pdf GitHub

Authors: Zhizhao Liang, Yi-Lin Wei, Xuhang Chen, Mu Lin, Yi-Xiang He, Zhexi Luo, Jun-Hui Liu, Kun-Yu Lin, Wei-Shi Zheng

Title: Humanoid Whole-Body Manipulation via Active Spatial Brain and Generalizable Action Cerebellum

Abstract:
In this paper, we explore spatial‑aware humanoid whole‑body manipulation task. Compared with tabletop settings, this task poses two key challenges: 1) Spatial understanding is challenging in complex 3D environments with diverse spatial relations. 2) Action generation is difficult to generalize, as limited and costly real‑robot data restricts data‑driven models generalization. To address these challenges, we propose a generalizable humanoid loco‑manipulation framework that leverages the spatial perception and action generation capabilities of multi‑agent large models. Specifically, our framework includes two components: Active Spatial Brain for active spatial perception and decision‑making, and Generalizable Action Cerebellum for executable robot action generation. The first component actively perceives the spatial scene and makes decisions on task planning and subtask decomposition. The second component generate executable robot actions based on the decisions made by the first module without needs of task‑specific real robot data. To benchmark our framework, we design a set of spatial manipulation tasks from two perspectives: evaluating spatial perception and understanding, and assessing real‑robot task performance. The results demonstrate strong performance on both aspects across diverse tasks and environments.

PaperID: 129, https://arxiv.org/pdf/2605.20932.pdf GitHub

Authors: Shintaro Inoue, Kento Kawaharazuka, Temma Suzuki, Sota Yuzaki, Kei Okada

Title: WiXus: A Wheeled-Legged Robot with Wire-Driven Environmental Utilizing to Integrate Mobility and Manipulation

Abstract:
Wheeled‑legged robots, which have wheels at their feet and achieve high mobility by coordinating wheel drive and leg drive, have been developed. These robots have been developed purely as platforms specialized for locomotion. Therefore, they do not have a means to repurpose their legs for roles other than locomotion, such as object manipulation or tool utilization. In this paper, we address the problem of how to draw out the potential task‑execution capability of the legs by freeing them from the roles of locomotion through external body support. To this end, we propose and develop a new robot, WiXus, which fuses a wheeled‑legged mechanism with a wire‑driven mechanism that utilizes the external environment. The developed WiXus demonstrates not only planar locomotion with wheeled‑legged drive, but also three‑dimensional mobility such as cliff climbing by coordinating wire‑driven and wheeled‑legged actuation. Furthermore, by suspending the body with wire‑driven actuation, WiXus successfully repurpose its legs as arms to perform object manipulation, (e.g., rescuing a dog (stuffed animal)), and tool utilization (e.g., harvesting an apple (mockup) with loppers). This study demonstrates that the approach of utilizing the environment with wire‑driven actuation is a new design principle that extends the operational domain of wheeled‑legged robots.

PaperID: 130, https://arxiv.org/pdf/2605.20917.pdf GitHub

Authors: F. Labra Caso, A. Saradagi, S. Fredriksson, S. Nordström, A. Koval, G. Nikolakopoulos

Title: SubTGraph: Large-Scale Subterranean Environment Synthesis with Controllable Topological Variability for Robotic Autonomy Validation

Abstract:
Subterranean (SubT) environments have been a frontier for autonomous robotics, driven by the push for automation of mining operations and the interest in planetary exploration (Martian Lava Tubes). Due to the challenges involved in accessing real SubT environments, rigorous hardening of autonomy stacks in realistic simulation environments is critical. This article fills a well‑known gap, which relates to the unavailability of a large‑scale simulation‑based benchmarking infrastructure for rigorous statistical evaluation of robotic autonomy, due to which it is common for SubT research articles to present validation results in a few environments at best. This article presents SubTGraph, a novel framework for rapid synthesis of multi‑level SubT environments with high variability, incorporating user specifications related to topology, dimensionality, textures, etc., to generate distinct environments such as operational mines, natural caves and lava tubes. SubTGraph builds a cost matrix from user‑specified structural constraints to guide the classical Dijkstra algorithm to procedurally generate SubT worlds utilizing topometric tiles from the DARPA World Generator. Three robotics case‑studies are investigated to demonstrate the utility of SubTGraph for rigorous validation of different layers in the robotic autonomy stack. Structural semantic segmentation is validated against topometric ground truths, multi‑agent path planning is widely tested for identification of patterns and trends in the algorithm behavior and LIO SLAM is stress‑tested in challenging subterranean sections to identify failure cases. The SubTGraph world creation codebase is open‑sourced (https://github.com/LTU‑RAI/SubTGraph.git) along with a database consisting of 150 highly variable underground worlds.

PaperID: 131, https://arxiv.org/pdf/2605.20729.pdf GitHub GitHub

Authors: Junhao Ruan, Abudukeyumu Abudula, Bei Li, Yongjing Yin, Xinyu Liu, Kechen Jiao, Xin Chen, Jingang Wang, Xunliang Cai, Tong Xiao, Jingbo Zhu

Title: MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks

Abstract:
Accurate evaluation of conversational retrieval is pivotal for advancing Retrieval‑Augmented Generation (RAG) systems. However, existing conversational retrieval benchmarks suffer from costly, sparse human annotation or rigid, unnatural automated heuristics. To address these challenges, we introduce MTR‑Suite, a unified framework for auditing, synthesizing, and benchmarking retrieval. It features: (1) MTR‑Eval, an LLM‑based auditor quantifying alignment gaps in previous benchmarks; (2) MTR‑Pipeline, a multi‑agent system using greedy traversal clustering to generate high‑fidelity dialogues at 1/400th human cost; and (3) MTR‑Bench, a rigorous general‑domain benchmark. MTR‑Bench mimics production‑style challenges (hard topic switching, verbosity), offering superior discriminative power. We make our code and data publicly available to facilitate future research at https://github.com/rangehow/mtr‑suite.

PaperID: 132, https://arxiv.org/pdf/2605.20618.pdf GitHub

Authors: Oleksandr Yakovenko, Mahdi Mostajabdaveh, Cheikh Ahmed, Abdullah Ali Sivas, Xiaorui Li, Zirui Zhou, Mao Kun

Title: COAgents: Multi-Agent Framework to Learn and Navigate Routing Problems Search Space

Abstract:
Although Vehicle Routing Problems (VRP) are essential to many real‑world systems, they remain computationally intractable at scale due to their combinatorial complexity. Traditional heuristics rely on handcrafted rules for local improvements and occasional jumps to escape local minima, but often struggle to generalize across diverse instances. We introduce COAgents, a cooperative multi‑agent framework that models the search process as a graph: nodes represent solutions, and edges correspond to either local refinements or large perturbations for diversification (i.e., jumps). A Partial Search Graph (PSG) is dynamically constructed during search, enabling COAgents to train a Node Selection Agent and a Move Selection Agent to guide intensification, and a Jump Agent to trigger well‑timed explorations of new regions. Unlike end‑to‑end learning approaches, COAgents cleanly separates problem‑agnostic search control from compact domain‑specific encoding, facilitating adaptability across tasks. Extensive experiments on the CVRP and VRPTW benchmarks show that COAgents remains competitive with several learn‑to‑search baselines on CVRP and sets a new state of the art among learning‑based methods on the more challenging VRPTW instances, reducing the gap to the best‑known solutions by 14% at N\!=\!100 and 44% at N\!=\!50 relative to the strongest neural solver (POMO), and by 21% and 40% respectively relative to ALNS. Code is available at https://github.com/mahdims/COAgents.

PaperID: 133, https://arxiv.org/pdf/2605.20609.pdf GitHub

Authors: Junseok Kim, Dohyeong Kim, Mineui Hong, Songhwai Oh

Title: Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning

Abstract:
Compositional generalization is essential for reaching unseen goals under novel contextual variations in offline goal‑conditioned reinforcement learning (GCRL), where a generalist goal‑reaching agent must be learned from limited data. Most prior approaches pursue this via trajectory stitching over temporally contiguous segments, which limits composing behaviors across varying contexts. To overcome this limitation, we formalize analogy transduction as synthesizing new plans by composing task‑endogenous analogies with given contexts and propose a novel analogy representation tailored for it. Grounded in our theory, this analogy representation captures what changes under optimal task execution, remains invariant to contextual variations, and is sufficient for optimal goal reaching. We further contend that generalization to unseen analogy‑context pairs is a practical obstacle in analogy transduction, and introduce a new approach for offline GCRL that enables analogy transduction beyond seen pairs to unseen combinations. We empirically demonstrate the effectiveness of our approach on OGBench manipulation environments, substantially outperforming prior methods that do not perform analogy transduction. Project page: https://rllab‑snu.github.io/projects/CTA/

PaperID: 134, https://arxiv.org/pdf/2605.20342.pdf GitHub

Authors: Zuhao Yang, Kaichen Zhang, Sudong Wang, Keming Wu, Zhongyu Yang, Bo Li, Xiaojuan Qi, Shijian Lu, Xingxuan Li, Lidong Bing

Title: ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

Abstract:
Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video‑processing tools (e.g., cropping) has become a promising route to long‑video understanding. However, existing native‑RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi‑turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi‑agent end‑to‑end RL‑trained framework for Parallel Video Tool calling, dispatching multiple time‑window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold‑started structural format and expose the skip‑tool reward shortcut under temperature sampling. A cross‑model contrast on a weaker‑prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA‑GRPO (Parseability‑Anchored and Ratio‑gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural‑token positions most prone to collapse, and (ii) a per‑prompt frame‑budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long‑video understanding benchmarks, ParaVT improves over the Qwen3‑VL baseline by +7.9% on average, with PARA‑GRPO lifting training‑time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.

PaperID: 135, https://arxiv.org/pdf/2605.20173.pdf GitHub

Authors: Vasundra Srinivasan

Title: A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

Abstract:
Production LLM agents combine stochastic model outputs with deterministic software systems, yet the boundary between the two is rarely treated as a first‑class architectural object. This paper names that boundary the stochastic‑deterministic boundary (SDB): a four‑part contract among a proposer, verifier, commit step, and reject signal that specifies how an LLM output becomes a system action. We argue that the SDB is the load‑bearing primitive of production agent runtimes. Around this primitive, we organize agent runtime design into three concerns: Coordination, State, and Control. We present a catalog of six runtime patterns that compose the SDB differently across conversational, autonomous, and long‑horizon agents: hierarchical delegation, scatter‑gather plus saga, event‑driven sequencing, shared state machine, supervisor plus gate, and human in the loop. For each pattern, we trace its lineage to distributed‑systems concepts and identify what changes when the worker is stochastic. The paper contributes a five‑step methodology for selecting runtime patterns, a diagnostic procedure that maps production failures to pattern weaknesses, and a failure mode called replay divergence, in which LLM‑based consumers of a deterministic event log produce different downstream outputs under model‑version or prompt changes. A stylized reliability decomposition separates per‑call model variance from architectural momentum, motivating the claim that as model variance decreases, pattern choice and SDB strength become increasingly important levers for long‑run reliability. We apply the methodology to five workloads and provide one runnable reference implementation for a 90‑day contract‑renewal agent.

PaperID: 136, https://arxiv.org/pdf/2605.20061.pdf GitHub

Authors: Wenjie Tang, Minne Li, Sijie Huang, Liquan Xiao, Yuan Zhou

Title: Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

Abstract:
Reinforcement learning from verifiable rewards (RLVR) is a promising paradigm for improving large language model (LLM) agents on long‑horizon interactive tasks. However, in partially observable environments, incomplete observations cause agent beliefs to drift over time, while delayed rewards obscure the causal impact of intermediate decisions, exacerbating temporal credit assignment challenges. To address this, we propose ReBel (Reward Belief), a process‑level reinforcement learning algorithm that explicitly models structured belief states to summarize interaction history and guide subsequent policy learning. ReBel introduces belief‑consistency supervision, converting discrepancies between predicted beliefs and observed feedback into dense self‑supervised signals without requiring external step‑wise annotations or verifiers. It also employs belief‑aware grouping to compare trajectories under similar belief states, yielding more robust and lower‑variance advantage estimates. We evaluate ReBel on challenging long‑horizon benchmarks, including ALFWorld and WebShop. ReBel improves task success by up to 20.4 percentage points over the episode‑level baseline GRPO and increases sample efficiency by 2.1×. These results suggest that belief‑aware self‑supervision is a promising direction for reliable long‑horizon decision‑making under partial observability. Code is available at: https://github.com/Fateyetian/Rebel.git.

PaperID: 137, https://arxiv.org/pdf/2605.20025.pdf GitHub

Authors: Jiaqi Liu, Shi Qiu, Mairui Li, Bingzhou Li, Haonian Ji, Siwei Han, Xinyu Ye, Peng Xia, Zihan Dong, Meng Chen, Congyu Zhang, Letian Zhang, Guiming Chen, Haoqin Tu, Xinyu Yang, Lu Feng, Xujiang Zhao, Haifeng Chen, Jiawei Zhou, Xiao Wang, Weitong Zhang, Hongtu Zhu, Yun Li, Jieru Mei, Hongliang Fei, Jiaheng Zhang, Linjie Li, Linjun Zhang, Yuyin Zhou, Sheng Wang, Caiming Xiong, James Zou, Zeyu Zheng, Cihang Xie, Mingyu Ding, Huaxiu Yao

Title: AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

Abstract:
Automating scientific discovery requires more than generating papers from ideas. Real research is iterative: hypotheses are challenged from multiple perspectives, experiments fail and inform the next attempt, and lessons accumulate across cycles. Existing autonomous research systems often model this process as a linear pipeline: they rely on single‑agent reasoning, stop when execution fails, and do not carry experience across runs. We present AutoResearchClaw, a multi‑agent autonomous research pipeline built on five mechanisms: structured multi‑agent debate for hypothesis generation and result analysis, a self‑healing executor with a \textscPivot/\textscRefine decision loop that transforms failures into information, verifiable result reporting that prevents fabricated numbers and hallucinated citations, human‑in‑the‑loop collaboration with seven intervention modes spanning full autonomy to step‑by‑step oversight, and cross‑run evolution that converts past mistakes into future safeguards. On ARC‑Bench, a 25‑topic experiment‑stage benchmark, AutoResearchClaw outperforms AI Scientist v2 by 54.7%. A human‑in‑the‑loop ablation across seven intervention modes reveals that precise, targeted collaboration at high‑leverage decision points consistently outperforms both full autonomy and exhaustive step‑by‑step oversight. We position AutoResearchClaw as a research amplifier that augments rather than replaces human scientific judgment. Code is available at https://github.com/aiming‑lab/AutoResearchClaw.

PaperID: 138, https://arxiv.org/pdf/2605.19338.pdf GitHub GitHub

Authors: Jiaao Wu, Xian Zhang, Hanzhang Liu, Sophia Zhang, Fan Yang, Yinpeng Dong

Title: STAR-PólyaMath: Multi-Agent Reasoning under Persistent Meta-Strategic Supervision

Abstract:
Frontier AI models and multi‑agent systems have led to significant improvements in mathematical reasoning. However, for problems requiring extended, long‑horizon reasoning, existing systems continue to suffer from fundamental reliability issues: hallucination accumulation, memory fragmentation, and imbalanced reasoning‑tool trade‑offs. In this paper, we introduce STAR‑PólyaMath, a multi‑agent framework that systematically addresses these challenges through meta‑level supervision and structured Reasoner‑Verifier interaction. STAR‑PólyaMath is structured as an orchestrated state machine with nested challenge‑step‑replan loops, governed by a reasoning‑free Python orchestrator that separates control from inference and bounds error propagation through trace‑back and re‑planning. Our key innovation is a persistent Meta‑Strategist that maintains cross‑attempt memory and exercises meta‑level control by issuing high‑level strategic guidance or mandatory directives, so the system can escape unproductive loops rather than stagnate or over‑rely on tools. STAR‑PólyaMath achieves state‑of‑the‑art results on all eight top‑tier competition benchmarks: AIME 2025‑2026, MathArena Apex Shortlist, MathArena Apex 2025, Putnam 2025, IMO 2025, HMMT February 2026, and USAMO 2026. It obtains perfect scores on AIMEs, Putnam, and HMMT, and shows its largest margin on Apex 2025, scoring 93.75% compared with 80.21% by the strongest baseline GPT‑5.5. Ablation studies show that the gains arise from the framework's orchestration rather than from model‑level diversity since removing key components or substituting in mixed backbones consistently weakens performance. Code is available at https://github.com/Julius‑Woo/STAR‑PolyaMath.

PaperID: 139, https://arxiv.org/pdf/2605.19240.pdf GitHub

Authors: Kavana Venkatesh, Jafar Isbarov, Saad Amin, Murat Kantarcioglu, Jiaming Cui

Title: CASPIAN: Online Detection and Attribution of Cascade Attacks in LLM Multi-Agent Systems via Cross-Channel Causal Monitoring

Abstract:
Cascade attacks in LLM multi‑agent systems (MAS) arise when adversarial influence propagates across agents and leads to escalated system‑level failures through complex agent interactions. Detecting such cascades is challenging, as their signals are distributed, tightly coupled across interaction channels, and often appear plausibly benign locally but may unfold quickly either within a single turn or gradually across multiple turns. Existing defenses, being largely local and text‑centric, fail to capture such cross‑channel, temporally coordinated dynamics of cascade propagation. Therefore, we propose CASPIAN, the first framework that provides a unified, cross‑channel causal analysis of cascade behavior in LLM‑MAS through online monitoring of dynamic influence propagation across agents. CASPIAN models multi‑agent interactions using a unified, dynamic causal influence matrix across channels, estimated efficiently via a late‑interaction conditional transfer entropy (LI‑CTE) formulation, thereby enabling the detection of cascade onset from emergent system‑level structure rather than isolated anomalies. It further performs online causal attribution, identifying the origin, bridge, and amplifier agents driving the cascade and reconstructing its principal propagation pathways, capabilities not supported by existing methods. Across diverse multi‑agent frameworks and benchmarks, CASPIAN consistently outperforms semantic guardrails, LLM‑based judges, and graph‑based anomaly detectors in both detection accuracy and early cascade identification while operating with sub‑1% relative overhead latency. These results demonstrate that unified cross‑channel causal modeling is essential for reliably detecting and understanding cascade failures in LLM multi‑agent systems.

PaperID: 140, https://arxiv.org/pdf/2605.19033.pdf GitHub GitHub

Authors: Ehsan Ahmadi, Hunter Schofield, Behzad Khamidehi, Fazel Arasteh, Jinjun Shan, Lili Mou, Dongfeng Bai, Kasra Rezaee

Title: RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning

Abstract:
Supervised open‑loop training has been widely adopted for training traffic simulation models; however, it fails to capture the inherently dynamic, multi‑agent interactions common in complex driving scenarios. We introduce RLFTSim, a reinforcement‑learning‑based fine‑tuning framework that enhances scenario realism by aligning simulator rollouts with real‑world data distributions and provides a method for distilling goal‑conditioned controllability in scenario generation. We instantiate RLFTSim on top of a pre‑trained simulation model, design a reward that balances fidelity and controllability, and perform comprehensive experiments on the Waymo Open Motion Dataset. Our results show improvements in realism, achieving state‑of‑the‑art performance. Compared with other heuristic search‑based fine‑tuning methods, RLFTSim requires significantly fewer samples due to a proposed low‑variance and dense reward signal, and it directly addresses the realism alignment issue by design. We also demonstrate the effectiveness of our approach for distilling traffic simulation controllability through goal conditioning. The project page is available at https://ehsan‑ami.github.io/rlftsim.

PaperID: 141, https://arxiv.org/pdf/2605.18882.pdf GitHub

Authors: Wei Shi, Ziheng Peng, Sihang Li, Xiting Wang, Xiang Wang, Mengnan Du, Na Zou

Title: To Call or Not to Call: Diagnosing Intrinsic Over-Calling Bias in LLM Agents

Abstract:
LLM agents exhibit a consistent tendency to over‑call, invoking tools even in situations where none is needed. On the When2Call benchmark, six models from three families show high call accuracy but much lower no‑call accuracy, leaving overall accuracy in the 55%‑70% range. We trace this to an Intrinsic Bias Hypothesis (IBH): the call/no‑call decision mapping carries an activation‑independent call offset, so the model favors call even at activation parity. Using Sparse Autoencoders (SAEs), we recover behavior‑aligned feature bases for the call/no_call decision, reduce them to a signed activation margin, and estimate the offset directly. Across all six models, the model is decision‑neutral only when no_call activation outweighs call activation, consistent with IBH. We then causally test IBH with Adaptive Margin‑Calibrated Steering (AMCS), a closed‑form counter‑bias shift along SAE decoder directions. Cancelling the diagnosed offset mitigates over‑calling and improves overall accuracy with a negligible drop in call accuracy. Our work recasts over‑calling from an empirical phenomenon into a mechanistic object amenable to causal correction. Code is available at https://github.com/SKURA502/agent‑sae/.

PaperID: 142, https://arxiv.org/pdf/2605.18857.pdf GitHub

Authors: Vyzantinos Repantis, Harshvardhan Singh, Tony Joseph, Cien Zhang, Akash Vishwakarma, Svetlana Karslioglu, Michael Wyatt Thot, Ameya Gawde

Title: The 99% Success Paradox: When Near-Perfect Retrieval Equals Random Selection

Abstract:
For most of the history of information retrieval (IR), search results were designed for human consumers who could scan, filter, and discard irrelevant information on their own. This shaped retrieval systems to optimize for finding and ranking more relevant documents, but not keeping results clean and minimal, as the human was the final filter. However, LLMs have changed that by lacking this filtering ability. To address this, we introduce Bits‑over‑Random (BoR), a chance‑corrected measure of retrieval selectivity that reveals when high success rates mask random‑level performance. We measure selectivity as BoR = \log_2\left(\frac\mathrmP_obs\mathrmP_rand\right), where \mathrmP_rand is the hypergeometric baseline for the chosen success rule (here, coverage: \geq1 relevant in top‑K). On the 20 Newsgroups dataset, BM25 and SPLADE both report >99% success at K=100 (coverage), yet BoR \approx 0, indicating random‑level selectivity at that depth. When the expected coverage ratio \left(\fracK \cdot \barR_qN\right) exceeds 3‑5, the baseline dominates and selectivity collapses. Downstream retrieval‑augmented generation (RAG) evaluation confirms this pattern: LLM accuracy can degrade substantially at K=100, consistent with the near‑zero BoR ceiling. In contrast, BoR remains positive on BEIR/SciFact and on MS MARCO (where 41 systems cluster within 0.2 bits of the theoretical ceiling despite a 13‑point recall gap), confirming baseline predictions across sparse and large‑scale settings. We further show that the collapse boundary applies to LLM agent tool selection, where small catalog sizes cause selectivity to vanish even with perfect selectors. These findings suggest reporting BoR alongside traditional metrics and reconsidering depth choices when additional retrieval provides negligible selectivity gains while inflating computational costs.

PaperID: 143, https://arxiv.org/pdf/2605.18747.pdf GitHub

Authors: Xuying Ning, Katherine Tieu, Dongqi Fu, Tianxin Wei, Zihao Li, Yuanchen Bei, Jiaru Zou, Mengting Ai, Zhining Liu, Ting-Wei Li, Lingjie Chen, Yanjun Zhao, Ke Yang, Bingxuan Li, Cheng Qian, Gaotang Li, Xiao Lin, Zhichen Zeng, Ruizhong Qiu, Sirui Chen, Yifan Sun, Xiyuan Yang, Ruida Wang, Rui Pan, Chenyuan Yang, Dylan Zhang, Liri Fang, Zikun Cui, Yang Cao, Pan Chen, Dorothy Sun, Ren Chen, Mahesh Srinivasan, Nipun Mathur, Yinglong Xia, Hong Li, Hong Yan, Pan Lu, Lingming Zhang, Tong Zhang, Hanghang Tong, Jingrui He

Title: Code as Agent Harness

Abstract:
Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository‑level software engineering. In emerging agentic systems, code is no longer only a target output. It increasingly serves as an operational substrate for agent reasoning, acting, environment modeling, and execution‑based verification. We frame this shift through the lens of agent harnesses and introduce code as agent harness: a unified view that centers code as the basis for agent infrastructure. To systematically study this perspective, we organize the survey around three connected layers. First, we study the harness interface, where code connects agents to reasoning, action, and environment modeling. Second, we examine harness mechanisms: planning, memory, and tool use for long‑horizon execution, together with feedback‑driven control and optimization that make harness reliable and adaptive. Third, we discuss scaling the harness from single‑agent systems to multi‑agent settings, where shared code artifacts support multi‑agent coordination, review, and verification. Across these layers, we summarize representative methods and practical applications of code as agent harness, spanning coding assistants, GUI/OS automation, embodied agents, scientific discovery, personalization and recommendation, DevOps, and enterprise workflows. We further outline open challenges for harness engineering, including evaluation beyond final task success, verification under incomplete feedback, regression‑free harness improvement, consistent shared state across multiple agents, human oversight for safety‑critical actions, and extensions to multimodal environments. By centering code as the harness of agentic AI, this survey provides a unified roadmap toward executable, verifiable, and stateful AI agent systems.

PaperID: 144, https://arxiv.org/pdf/2605.18727.pdf GitHub

Authors: Feng Chen, Tianzhe Chu, Li Sun, Pei Zhou, Zhuxiu Xu, Shenghua Gao, Yuexiang Zhai, Yanchao Yang, Yi Ma

Title: DexHoldem: Playing Texas Hold'em with Dexterous Embodied System

Abstract:
Evaluating embodied systems on real dexterous hardware requires more than isolated primitive skills: an agent must perceive a changing tabletop scene, choose a context‑appropriate action, execute it with a dexterous hand, and leave the scene usable for later decisions. We introduce DexHoldem, a real‑world system‑level benchmark built around Texas Hold'em dexterous manipulation with a ShadowHand. DexHoldem provides 1,470 teleoperated demonstrations across 14 Texas Hold'em manipulation primitives, a standardized physical policy benchmark, and an agentic perception benchmark that tests whether agents can recover the structured game state needed for embodied decision making. On primitive execution, π_0.5 obtains the highest task completion rate (61.2%), while π_0.5 and π_0 tie on scene‑preserving success rate (47.5%). On agentic perception, Opus 4.7 obtains the best strict problem‑level accuracy (34.3%), while GPT 5.5 obtains the best average field‑wise accuracy (66.8%), exposing a gap between isolated visual sub‑capabilities and complete routing‑relevant state recovery. Finally, we instantiate the full embodied‑agent loop in three case studies, where waiting, recovery dispatches, human‑help requests, and repeated primitive execution reveal how perception and policy errors accumulate during closed‑loop deployment. DexHoldem therefore evaluates dexterous tabletop execution, agentic perception, and embodied decision routing in a shared physical setting. Project page: https://dexholdem.github.io/Dexholdem/.

PaperID: 145, https://arxiv.org/pdf/2605.18661.pdf GitHub GitHub

Authors: Lingdong Kong, Xian Sun, Wei Chow, Linfeng Li, Kevin Qinghong Lin, Xuan Billy Zhang, Song Wang, Rong Li, Qing Wu, Wei Gao, Yingshuo Wang, Shaoyuan Xie, Jiachen Liu, Leigang Qu, Shijie Li, Lai Xing Ng, Benoit R. Cottereau, Ziwei Liu, Tat-Seng Chua, Wei Tsang Ooi

Title: AI for Auto-Research: Roadmap & User Guide

Abstract:
AI‑assisted research is crossing a threshold: fully automated systems can now generate research papers for as little as 15, while long‑horizon agents can execute experiments, draft manuscripts, and simulate critique with minimal human input. Yet this productivity frontier exposes a deeper integrity problem: under scientific pressure, even frontier LLMs still fabricate results, miss hidden errors, and fail to judge novelty reliably. Studying developments through April 2026, we present an end‑to‑end analysis of AI across the complete research lifecycle, organized into four epistemological phases: Creation (idea generation, literature review, coding & experiments, tables & figures), Writing (paper writing), Validation (peer review, rebuttal & revision), and Dissemination (posters, slides, videos, social media, project pages, and interactive agents). We identify a sharp, stage‑dependent boundary between reliable assistance and unreliable autonomy: AI excels at structured, retrieval‑grounded, and tool‑mediated tasks, but remains fragile for genuinely novel ideas, research‑level experiments, and scientific judgment. Generated ideas often degrade after implementation, research code lags far behind pattern‑matching benchmarks, and end‑to‑end autonomous systems have not yet consistently reached major‑venue acceptance standards. We further show that greater automation can obscure rather than eliminate failure modes, making human‑governed collaboration the most credible deployment paradigm. Finally, we provide a structured taxonomy, benchmark suite, and tool inventory, cross‑stage design principles, and a practitioner‑oriented playbook, with resources maintained at our project page.

PaperID: 146, https://arxiv.org/pdf/2605.18621.pdf GitHub

Authors: Wei Wang, Yuqian Yuan, Tianwei Lin, Wenqiao Zhang, Siliang Tang, Jun Xiao, Yueting Zhuang

Title: CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark

Abstract:
Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single‑view perception and reason consistently about objects, visibility, geometry, and interactions across multiple viewpoints. However, progress in cross‑view reasoning remains limited by three major gaps: the scarcity of large‑scale well‑annotated training data, the lack of comprehensive benchmarks for systematic evaluation, and the absence of explicit alignment mechanisms that establish object‑level consistency across views. To address these gaps, we thoroughly develop CrossView Suite across three coordinated components: CrossViewSet, CrossViewBench, and CrossViewer. Firstly, we introduce a multi‑agent data engine to meticulously curate a large‑scale, high‑quality cross‑view instruction dataset, termed CrossViewSet, covering 17 fine‑grained task types with 1.6M samples. Second, we meticulously create a scene‑disjoint CrossViewBench to comprehensively assess the cross‑view spatial understanding capability of an MLLM, evaluating it across various aspects. Finally, we propose CrossViewer, a progressive three‑stage framework for cross‑view spatial reasoning in MLLMs, following a Perception ‑> Alignment ‑> Reasoning paradigm. Our method equips an adaptive spatial region tokenizer to capture fine‑grained object representations, and then aligns the multi‑view objects explicitly, and thus fuses aligned features for boosting the cross‑view inference capacity for MLLMs. Extensive experiments and analyses show that large‑scale training data, systematic evaluation, and explicit cross‑view alignment are all critical for advancing MLLMs from single‑view perception toward real‑world spatial intelligence. The project page is available at https://github.com/Thinkirin/Crossview‑Suite.

PaperID: 147, https://arxiv.org/pdf/2605.18421.pdf GitHub

Authors: Yuyao Wang, Zhongjian Zhang, Mo Chi, Kaichi Yu, Yuhan Li, Miao Peng, Bing Tong, Chen Zhang, Yan Zhou, Jia Li

Title: EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective

Abstract:
Recent benchmarks for Large Language Model (LLM) agents mainly evaluate reasoning, planning, and execution. However, memory is also essential for agents, as it enables them to store, update, and retrieve information over time. This ability remains under‑evaluated, largely because existing benchmarks do not provide a systematic way to assess memory mechanisms. In this paper, we study agent memory from a self‑evolving perspective and introduce EvoMemBench, a unified benchmark organized along two axes: memory scope (in‑episode vs. cross‑episode) and memory content (knowledge‑oriented vs. execution‑oriented). We compare 15 representative memory methods with strong long‑context baselines under a standardized protocol. Results show that current memory systems are still far from a general solution: long‑context baselines remain highly competitive, memory helps most when the current context is insufficient or tasks are difficult, and no single memory form works consistently across all settings. Retrieval‑based methods remain strong for knowledge‑intensive settings, whereas procedural and long‑term memory methods are more effective for execution‑oriented tasks when their stored experience matches the task structure. We hope EvoMemBench facilitates future research on more effective memory systems for LLM‑based agents. Our code is available at https://github.com/DSAIL‑Memory/EvoMemBench.

PaperID: 148, https://arxiv.org/pdf/2605.17641.pdf GitHub

Authors: Saksham Sahai Srivastava

Title: Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents

Abstract:
Long‑horizon LLM agents rely on persistent memory to support interactions across sessions, yet existing memory systems often retrieve context using semantic similarity or broad history inclusion, treating retrieved memories as uniformly useful. This assumption is fragile because memories may be topically related while remaining irrelevant, stale, or misleading. We propose Causal Memory Intervention (CMI), a causal memory‑selection technique that estimates how candidate memories affect the model's answer under controlled interventions, selecting memories that improve task performance while suppressing unstable, irrelevant, or harmful ones. To evaluate this setting, we introduce Causal‑LoCoMo, a causally annotated benchmark derived from long conversational data, where each example contains a user request, a structured memory bank, useful memories, irrelevant distractors, and synthetic harmful memories. We compare CMI against vector, graph, reflection, summary, full‑history, and no‑memory baselines. Results show that CMI achieves a stronger balance between answer quality and robustness to misleading memory, suggesting that reliable long‑term memory requires selecting context based on causal usefulness rather than relevance alone. The full framework, benchmark construction code, and experimental pipeline are available at https://github.com/Saksham4796/causal‑memory‑intervention.

PaperID: 149, https://arxiv.org/pdf/2605.17583.pdf GitHub

Authors: Bin Kang, Shaoguo Wen, Yang Fan, Shunlong Wu, Junjie Wang, Yulin Li, Junzhi Zhao, Junle Wang, Zhuotao Tian

Title: AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech

Abstract:
While existing text‑to‑speech (TTS) models exhibit high expressiveness, fine‑grained control over composite instructions remains challenging due to the structural mismatch between discrete textual intents and continuous acoustic realizations. Inspired by human cognitive decoupling, we introduce AgentSteerTTS, a multi‑agent closed‑loop framework designed for intent‑faithful expressive control of composite instructions. First, in our framework, an adversarial disentanglement agent mitigates speaker‑emotion leakage by learning separable identity and emotion‑prosody subspaces with leakage‑suppressing regularization. Next, a Dual‑Stream Anchoring Controller grounds abstract intents using a large‑scale acoustic prototype library: a Retrieval Agent selects expressive anchors, while a Synthesis Agent fuses them into continuous control vectors via gated attention. Finally, a Fast‑Slow Feedback Agent refines output intensity through latent gradient correction and resolves semantic‑acoustic mismatches using high‑level perceptual critique. Experiments on a composite‑instruction benchmark and public test sets show that AgentSteerTTS yields consistent and significant improvements to the baselines, demonstrating the effectiveness of the proposed method.

PaperID: 150, https://arxiv.org/pdf/2605.17439.pdf GitHub

Authors: Sirui Hong, Zhijie Liu, Tengfei Li, Wei Tao, Yifan Wu, Chenglin Wu

Title: DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents

Abstract:
Evaluating LLM‑generated interactive software requires execution in addition to static analysis. The key difficulty is that correctness is a graph‑level reachable property over latent UI state‑transition graphs, whereas a GUI evaluator observes only a single execution trajectory. A failed rollout therefore rules out only one realized path, leaving failure attribution ambiguous between evaluator‑side execution error and genuine software defect. We present DiagEval, a trajectory‑conditioned diagnostic evaluation protocol for post‑failure GUI‑agent evaluation of interactive software. Rather than blindly retrying from scratch, DiagEval reuses the failed trajectory to choose targeted diagnostic probes and aggregates their outcomes into an internal attribution signal. The latent‑graph view motivates the diagnostic problem; DiagEval does not reconstruct the graph or estimate calibrated posterior probabilities. We evaluate DiagEval on WebDevJudge‑Unit and RealDevBench across multiple GUI‑agent evaluators and LLM backbones. On false‑negative cases, DiagEval recovers 45.6‑62.1% of failures that were initially misattributed to software defects, outperforming retry‑based baselines with 34.4‑160.6% relative gains. On the full evaluation sets, this recovery improves accuracy from 69.9% to 78.3% on WebDevJudge‑Unit and from 65.0% to 81.6% on RealDevBench. These results suggest that reliable GUI‑agent evaluation requires not only stronger execution, but also active failure diagnosis to disambiguate evaluator‑side errors from genuine software defects. Our code is available at https://github.com/scutGit/DiagEval.

PaperID: 151, https://arxiv.org/pdf/2605.17324.pdf GitHub

Authors: Udari Madhushani Sehwag, Zhengyang Shan, Heming Liu, Dileepa Lakshan, Joseph Brandifino, Max Fenkell

Title: ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents

Abstract:
Clarification‑seeking behavior is widely regarded as a desirable property of LLM agents, enabling them to resolve ambiguity before acting on underspecified tasks. However, the security implications of this interaction pattern remain unexplored. We investigate whether the transition from standard execution to a clarification‑seeking state increases an agent's susceptibility to prompt injection attacks. We introduce ASPI (Ambiguous‑State Prompt Injection), a benchmark of 728 task‑attack scenarios that isolates clarification as a distinct agent state and measures how this state transition affects vulnerability under controlled conditions. Each benchmark instance is evaluated under matched execution and clarification settings: in the execution setting, the agent acts on a fully specified instruction and encounters adversarial content only through tool‑returned data; in the clarification setting, the agent must first request and incorporate additional user input before acting. We evaluate ten frontier LLMs and find that clarification‑seeking consistently and substantially amplifies vulnerability. For instance, attack success rises from 1.8% to 34.0% for o3 and from 2.2% to 35.7% for Gemini‑3‑Flash. A decomposition analysis reveals that this gap reflects both a state‑dependent shift in how models process incoming content and a channel‑specific effect arising from the agent‑solicited clarification interface. These findings demonstrate that standard execution‑time security evaluation systematically underestimates the attack surface of interactive agents, and that robustness under fully specified tasks does not translate to robustness under ambiguity. For reproducibility, our data and source code are available at https://github.com/scaleapi/aspi.

PaperID: 152, https://arxiv.org/pdf/2605.17318.pdf GitHub

Authors: Aman Desai

Title: RooAgent: An LLM Agent for Root-Based High Energy Physics Analysis

Abstract:
We present RooAgent as a natural‑language interface for Root‑based high energy physics data analysis. The package provides physics analysis functions as tools that an LLM agent invokes in response to plain‑language prompts. Two operating modes are supported: a LangGraph‑based agent compatible with OpenAI's GPT‑4.1 via GitHub Copilot and with DeepSeek‑V3 via Ollama, and a Model Context Protocol server for use with the Anthropic Claude CLI (Sonnet~4.6). In both modes the analysis logic is implemented in PyRoot and the LLM selects tools and supplies the required arguments. The package supports histogram inspection, event selection, visualisation of kinematic distributions, fitting, and significance estimation, among other tasks. We illustrate RooAgent with tests based on Monte Carlo simulations of pp\to ZH (Z\to\ell^+\ell^‑, H\to b\barb), a multi‑task signal‑background workflow, a toy statistical analysis, and an application to ATLAS open data for H\to ZZ^\to 4\ell. The package is available on PyPI and the source code is hosted at https://github.com/amanmdesai/RooAgent.

PaperID: 153, https://arxiv.org/pdf/2605.17076.pdf GitHub

Authors: Sajjad Khan

Title: S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination

Abstract:
We address concurrency control for LLM agents sharing mutable state over HTTP, where agents cannot be modified to declare read sets. S‑Bus is an HTTP middleware whose central mechanism, a server‑side DeliveryLog, reconstructs each agent's read set at commit time from observed HTTP GET traffic. The consistency property it provides ‑‑ Observable‑Read Isolation (ORI), a partial causal consistency over the HTTP‑observable read projection ‑‑ prevents Structural Race Conditions in dedicated‑shard topologies. Three contributions. (C1) DeliveryLog mechanism with three‑tier mechanised evidence: TLAPS proves ReadSetSoundness and ORICommitSafety (modulo one typing axiom); exhaustive TLC at N=3 explores 20,763,484 states with zero violations; Dafny discharges 9 inductive lemmas. (C2) Empirical safety parity against PostgreSQL 17 SERIALIZABLE and Redis 7 WATCH/MULTI: zero Type‑I corruptions across 884,110 commit attempts (427,308 under active contention). (C3) ORI is semantically neutral in dedicated‑shard workloads but harmful in single‑shard collaborative writing because preservation propagates concurrent contradictions. v2 update: the PH‑3 LLM judge is now independently validated against a human annotator (Zahid Hussain, Mindgigs Peshawar) on 400 (step, shard) pairs at strict kappa=0.93 (n=93, 96.8% raw agreement). Inter‑LLM‑judge agreement is kappa=0.46 (boundary variance). Agent self‑reports over‑claim shard usage by 32% (LLM judge) to 49% (human annotator). The SJ‑v4 semantic‑quality rubric remains single‑judge LLM‑only. Source code, formal proofs, harness, annotation data: https://github.com/sajjadanwar0/sbus

PaperID: 154, https://arxiv.org/pdf/2605.17046.pdf GitHub

Authors: Robin-Nico Kampa, Fabian Deuser, Anna Bößendörfer, Konrad Habel, Norbert Oswald

Title: 1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job?

Abstract:
Autonomous AI coding agents are becoming a core tool for ML practitioners in industry and research alike. Despite this growing adoption, no standardized benchmark exists to evaluate their ability to design, implement, and train models from scratch across diverse domains. We introduce 1GC‑7RC (Single Graphic Card: Seven Research Challenges), a benchmark comprising seven ML tasks spanning language modeling, image classification, semantic segmentation, graph learning, tabular prediction, time‑series forecasting, and text classification. Each task provides a locked data‑preparation and evaluation script together with a baseline training script; the agent may only modify the training code, has no access to pretrained weights (with one controlled exception for semantic segmentation), no internet access, and must complete each task within a task‑specific wall‑clock budget (40‑120 minutes) on a single GPU. We evaluate seven coding agents: five proprietary (Claude Code with Sonnet 4.6, Opus 4.6, and Opus 4.7; Codex CLI with GPT 5.5; and OpenCode with Qwen 3.6+) and two open‑source (OpenCode with Kimi K2.5, Kimi K2.6). Across 5 runs per agent‑task pair, we report substantial performance differences that reveal varying levels of implicit ML knowledge, planning ability, and time‑budget management. The benchmark, harness, and all evaluation artifacts are publicly available on GitHub at https://github.com/Strolchii/1GC‑7RC‑Benchmark to facilitate reproducible comparison of future agents. Because our benchmark design is modular, the benchmark can be extended to new tasks and domains, adapted to different GPU budgets, and used to study multi‑agent settings, making it a flexible platform for future research on autonomous research agents.

PaperID: 155, https://arxiv.org/pdf/2605.16895.pdf GitHub

Authors: Yuxuan Ye, Jun Han, Ao Hu, Juncheng Bu, Yiyi Chen, Liangjian Wen, Danilo Mandic, Danny Dongning Sun, Xu Yinghui, Zenglin Xu

Title: The Alpha Illusion: Reported Alpha from LLM Trading Agents Should Not Be Treated as Deployment Evidence

Abstract:
End‑to‑end LLM trading agents have moved quickly from research curiosity to a small ecosystem of named systems, including FinCon, FinMem, TradingAgents, FinAgent, QuantAgent, and FLAG‑Trader. Several of these report headline Sharpe ratios that would be material if read at face value on a deployment desk, and associated benchmarks such as FinBen report trading‑task Sharpe statistics in the same range. The gap between architecture research and deployment claim has been crossed too freely on both sides of the academia‑‑industry divide. We take a position on that gap: reported alpha from end‑to‑end LLM trading agents should not be treated as deployment evidence. Before such returns can support claims of deployable trading capability, they must survive structural validity tests for temporal integrity, real‑world frictions, counterfactual robustness, predictive calibration, numerical execution, and multi‑agent disaggregation. Current public evidence cannot yet distinguish robust predictive ability from temporal contamination, unmodeled frictions, short‑window Sharpe uncertainty, narrative fitting, and parametric priors. The problem is not only evaluative but structural. Language confidence is not tradable probability, narrative reasoning is not numerical execution, and model priors may become undisclosed implicit factor exposures. We contribute a minimum reporting protocol suite, P1‑‑P6, with tiered applicability by claim strength, and a conservative modular alternative that uses LLMs as auditable information interfaces upstream of independent calibration, risk, and execution modules. Code and reproduction harness: \urlhttps://github.com/hj1650782738/Trading.

PaperID: 156, https://arxiv.org/pdf/2605.16894.pdf GitHub

Authors: Jianye Xu, Bassam Alrifaee

Title: Beyond Safety Filtering: Control Barrier Function-Informed Reinforcement Learning for Connected and Automated Vehicles

Abstract:
Reinforcement Learning (RL) uses rewards to guide learning, yet reward design is typically hand‑crafted using heuristics that can be difficult to tune. We propose a Control Barrier Function (CBF)‑informed reward design for Multi‑Agent RL (MARL) that converts CBF constraint values under joint MARL actions into a reward signal that explicitly guides safe learning. We compare against two heuristic reward baselines in a four‑way multi‑lane intersection with connected and automated vehicles. Results show that our method achieves the highest task performance and is less sensitive to reward hyperparameters, yielding consistently strong performance across the tested hyperparameter range. Code for reproducing the experimental results and a video demonstration are available at https://github.com/bassamlab/SigmaRL.

PaperID: 157, https://arxiv.org/pdf/2605.16863.pdf GitHub

Authors: Yaniv Hassidof, Adir Morgan, Yilun Du, Kiril Solovey

Title: Plan First, Diffuse Later: Extrinsic Graph Guidance for Long-Horizon Diffusion Planning

Abstract:
Compositional diffusion models offer a promising route to long‑horizon planning by denoising multiple overlapping sub‑trajectories while ensuring that together they constitute a global solution. However, enforcing local behavior over long chains is often insufficient for a coherent global structure to emerge. Recent works tackle this limitation through intrinsic search, which explores multiple paths during the denoising process. While intrinsic search improves global coherence, it comes at the cost of repeated evaluations of an already compute‑heavy model. In this work, we argue that extrinsic search, performed outside the denoising process, offers a more effective mode of exploration for long‑horizon planning while naturally enabling the use of classical algorithms to solve unseen combinatorial tasks at test time. Our eXtrinsic search‑guided Diffuser (XDiffuser) first computes a plan over a state‑space graph ‑‑ serving as a lightweight local connectivity oracle for the diffusion model. The plan is then used to guide denoising for a single trajectory, effectively offloading the burden of exploration. XDiffuser outperforms diffusion‑based baselines on long‑horizon tasks, with particularly large gains in the low‑quality data regime and on unseen tasks beyond goal‑reaching, including multi‑agent coordination and TSP‑style reasoning. Project website: https://yanivhass.github.io/XDiffuser‑site/

PaperID: 158, https://arxiv.org/pdf/2605.16309.pdf GitHub

Authors: Safayat Bin Hakim, Keyan Guo, Wenkai Tan, Alvaro Velasquez, Shouhuai Xu, Houbing Herbert Song

Title: ANNEAL: Adapting LLM Agents via Governed Symbolic Patch Learning

Abstract:
LLM‑based agents can recover from individual execution errors, yet they repeatedly fail on the same fault when the underlying process knowledge‑‑operator schemas, preconditions, and constraints‑‑remains unrepaired. Existing self‑evolving approaches address this gap by updating prompts, memory, or model weights, but none directly repair the symbolic structures that encode how tasks are executed, and few provide the governance guarantees required for safe deployment. We introduce ANNEAL, a neuro‑symbolic agent that converts recurring failures into governed symbolic edits of a process knowledge graph without modifying foundation model weights. Its core mechanism, Failure‑Driven Knowledge Acquisition (FDKA), localizes the responsible operator, synthesizes a typed patch through constrained LLM generation, and validates the proposal via multi‑dimensional scoring, symbolic guardrails, and canary testing before commit. Every accepted edit carries full provenance and deterministic rollback capability. Across four domains and 27 multi‑seed runs, ANNEAL is the only evaluated system that commits persistent structural repairs‑‑strong baselines such as ReAct and Reflexion achieve high episodic recovery yet retain 72‑100% holdout failure rates on recurring faults, whereas ANNEAL reduces these to 0% in the tested recurring‑failure settings. Ablation confirms that removing FDKA eliminates all structural repairs and drops success rate by up to 26.7 percentage points. These results suggest that governed symbolic repair offers a complementary paradigm to weight‑level and prompt‑level adaptation for persistent fault elimination.

PaperID: 159, https://arxiv.org/pdf/2605.16194.pdf GitHub

Authors: Arquimedes Canedo

Title: paper.json: A Coordination Convention for LLM-Agent-Actionable Papers

Abstract:
LLM agents routinely serve as first (and sometimes only) readers of academic papers, skimming for sub‑claims, extracting reproducibility steps, and generalizing scope. Standard prose papers produce recurring failures in this role: sub‑claims that cannot be cited at sub‑paper granularity, scope overextension beyond what the paper tests, and figure commands buried in codebases rather than the paper itself. We propose `paper.json`, a companion JSON file that travels with the PDF and addresses each failure with a lightweight convention: stable claim IDs (C1), an explicit does‑not‑claim list (C2), exact per‑figure shell commands (C3), and stable definition IDs (C5). A fifth convention (C4) holds that minimum viable compliance, hand‑written JSON alongside the PDF, is achievable in under an hour for a finished paper without touching the human‑readable output. C1, C2, C3, and C5 are open invitations: an agent that reads a compliant paper and acts on it produces evidence for or against them. This paper is itself compliant: `uv run validator.py paper.json ‑‑against paper.typ` passes. Repo: https://github.com/arquicanedo/paper‑json

PaperID: 160, https://arxiv.org/pdf/2605.15843.pdf GitHub

Authors: Jichen Hu, Jiawei Guo, Jiazhong Cen, Chen Yang, Sikuang Li, Wei Shen

Title: WorldAct: Activating Monolithic 3D Worlds into Interactive-Ready Object-Centric Scenes

Abstract:
Recent 3D world modeling systems based on generative scene synthesis, such as Marble, can create coherent and explorable 3D environments, yet their outputs are typically static monolithic assets with limited editability and physical interaction. This restricts their use in immersive content creation and embodied simulation, where generated worlds must be actively modified and manipulated. To tackle this challenge, we present WorldAct, a framework that converts static generated 3D worlds into editable and interaction‑ready scenes. WorldAct uses a multimodal agent to guide scene decomposition, identify actionable objects, reconstruct geometrically aligned object‑level meshes for interaction, and restore the residual background via 3D inpainting. The resulting scenes support object‑level editing, collision‑aware manipulation, and embodied task execution while preserving global scene coherence. Experiments show that WorldAct enables richer interaction scenarios than the original generated scenes, suggesting a practical path toward editable and interactive 3D world models.

PaperID: 161, https://arxiv.org/pdf/2605.15815.pdf GitHub

Authors: Sihan Fu, Oucheng Liu, Shiyuan Wang, Jin Shi, Chengkun Wei

Title: BootstrapAgent: Distilling Repository Setup into Reusable Agent Knowledge

Abstract:
Code agents increasingly help developers work with unfamiliar repositories, but every such task depends on a costly prerequisite: bootstrapping the repository into a usable development state. This process requires substantial trial‑and‑error exploration, yet the resulting knowledge‑‑resolved dependencies, repair strategies‑‑stays trapped in a single conversation, unavailable to future agents. We therefore formulate repository bootstrapping as a reusable startup knowledge problem and introduce BootstrapAgent, a multi‑agent framework that distills the heuristics discovered during bootstrap exploration into a persistent, verifiable, agent‑consumable .bootstrap contract. Through evidence extraction, structured planning, deterministic Docker‑based verification, and trace‑driven repair, BootstrapAgent generates a contract covering environment setup, diagnostic checks, minimal verification, and accumulated repair knowledge. We further propose warm repair with clean replay to accelerate iterative debugging without sacrificing cold‑start reproducibility, and a delta repair with sanity check to prevent reward hacking. Experiments on three benchmarks show that BootstrapAgent achieves a 92.9% success rate, outperforming the baseline by over 10% while reducing downstream agent token usage by 25.9% and build time by 22.3%. Our code is available at https://github.com/Vossera/BootstrapAgent.

PaperID: 162, https://arxiv.org/pdf/2605.15759.pdf GitHub

Authors: Wentao Qiu, Haotian Hu, Fanyi Wang, Jinwei Kong, Yu Zhang

Title: DimMem: Dimensional Structuring for Efficient Long-Term Agent Memory

Abstract:
Large language model (LLM) agents require long‑term memory to leverage information from past interactions. However, existing memory systems often face a fidelity‑‑efficiency trade‑off: raw dialogue histories are expensive, while flat facts or summaries may discard the structure needed for precise recall. We propose DimMem, a lightweight dimensional memory framework that represents each memory as an atomic, typed, and self‑contained unit with explicit fields such as time, location, reason, purpose, and keywords. This representation exposes the structure needed for dimension‑aware retrieval, memory update, and selective assistant‑context recall without storing full histories in the model context. Across LoCoMo‑10 and LongMemEval‑S, DimMem achieves 81.43% and 78.20% overall accuracy, respectively, outperforming existing lightweight memory systems while reducing LoCoMo per‑query token cost by 24%. We further show that dimensional memory extraction is learnable by compact models: after fine‑tuning on the DimMem schema, a Qwen3‑4B extractor surpasses LightMem with GPT‑4.1‑mini on both benchmarks and reaches performance comparable to, or better than, much larger extractors in key settings. These results suggest that explicit dimensional structuring is an effective and efficient foundation for long‑term memory in LLM agents. Code is available at https://github.com/ChowRunFa/DimMem.

PaperID: 163, https://arxiv.org/pdf/2605.15755.pdf GitHub

Authors: Cheng Zhang, Yuer Liu, Zhiyu Zhou, Hongxia Xie, Wen-Huang Cheng

Title: Attribute-Grounded Selective Reasoning for Artwork Emotion Understanding with Multimodal Large Language Models

Abstract:
Multimodal large language models (MLLMs) can produce fluent artwork emotion explanations, but they often suffer from attribute flooding: they enumerate many visible formal attributes without identifying which cues actually support the affective judgment. We therefore formulate artwork emotion understanding as Attribute‑Grounded Selective Reasoning (AGSR), where predefined formal attributes serve as evidence units and only emotionally operative attributes should enter the final interpretation. To make this problem measurable, we extend EmoArt, originally introduced at ACM MM 2025 as a 132,664‑artwork resource with content, formal‑attribute, valence‑arousal, and emotion annotations, by adding a 1,400‑artwork human salience extension annotated by 15 art‑trained annotators. This extension provides instance‑level supervision for distinguishing attributes that are merely present from those that are emotionally salient. We further propose FAB‑G (Formal‑Attribute Bottleneck‑Guided reasoning), a supervised multi‑agent framework that first predicts attribute‑level salience and then constrains downstream emotional analysis to the retained cues. Experiments show that FAB‑G yields consistent gains in emotion, arousal, and valence prediction, achieves stronger agreement with human‑marked salient attributes under Dice and Tversky metrics, and produces substantially more compact final explanations than prompting‑based baselines. Cross‑dataset evaluation further suggests that attribute‑grounded salience selection transfers beyond the source distribution of EmoArt, while also revealing attribute‑specific boundary cases. The dataset and project page are available at https://zhiliangzhang.github.io/EmoArt‑130k/

PaperID: 164, https://arxiv.org/pdf/2605.15669.pdf GitHub

Authors: Jinuk Kim, Junsoo Byun, Donghwi Hwang, Seong-Jin Park, Hyun Oh Song

Title: Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation

Abstract:
Manufacturable chip layouts must satisfy thousands of geometry‑based design rules, and design rule checking (DRC) enforces them by running executable DRC scripts on layouts. Translating natural language rules into correct DRC scripts is labor‑intensive and requires specialized expertise, motivating LLM agents for DRC script synthesis and debugging. However, existing benchmarks have small evaluation sets and often evaluate scripts by code similarity rather than execution correctness, and prior machine learning‑based methods either ignore execution feedback or require labeled test layouts as agent's input. To this end, we introduce Rule2DRC, a large‑scale benchmark for DRC script coding agents with 1,000 rule‑to‑script tasks and 13,921 evaluation chip layouts for execution‑based scoring. Rule2DRC provides an evaluation pipeline that measures functional correctness via DRC execution outcomes without requiring evaluation layouts as input to the agent. We also propose SplitTester, a tester agent for program selection that uses execution feedback to generate discriminative test cases and separate previously indistinguishable candidate scripts, substantially improving Best‑of‑N selection performance in this domain. We release the code at https://github.com/snu‑mllab/Rule2DRC.

PaperID: 165, https://arxiv.org/pdf/2605.15528.pdf GitHub

Authors: Shengchao Zhu, Guangjie Han, Chuan Lin, Yu He

Title: Task-Semantic Graph-Driven Distributed Agent Networking for Underwater Target Tracking

Abstract:
Autonomous underwater vehicle (AUV) swarms are emerging as intelligent underwater networks, where each node must sense, communicate, process local data, and make decisions under severe acoustic constraints. Persistent underwater target tracking is a typical task with moving targets, changing communication topology, intermittent acoustic links, and limited observation for each AUV. Multi‑agent reinforcement learning (MARL) is a natural candidate for distributed tracking, yet existing studies still lack a unified open‑source platform for evaluating different MARL algorithms under six‑degree‑of‑freedom AUV dynamics. In addition, policies trained with raw geometric states and low‑level force actions often struggle to represent task phases, observation reliability, link quality, and local cooperation roles. This paper addresses these issues by developing an open‑source MARL‑AUV platform that integrates DI‑engine with a six‑degree‑of‑freedom underwater AUV target‑tracking simulator. To the best of our knowledge, it is the first open platform that connects a public MARL training framework with physically modeled AUV swarm‑based tasks, and provides a unified experimental protocol for fair training, testing, and comparison of representative RL and MARL algorithms. Based on this platform, we propose STG‑MAPPO, a Semantic Task Graph‑enhanced variant of Multi‑Agent Proximal Policy Optimization. STG‑MAPPO builds semantic policy inputs from tracking diagnostics, task phases, observation confidence, link availability, neighbor tracking quality, and local role advantage. A compact semantic task graph links communication‑constrained network states to decentralized actor decisions, and a velocity‑level action abstraction maps high‑level cooperative decisions to executable six‑degree‑offreedom AUV control inputs.The code is available at https://github.com/dasjsaj/MARL‑AUV.

PaperID: 166, https://arxiv.org/pdf/2605.15207.pdf GitHub

Authors: Yi Xie, Siao Liu, Falong Fan, Yuanqi Yao, Yue Zhao, Bo Liu

Title: TeamTR: Trust-Region Fine-Tuning for Multi-Agent LLM Coordination

Abstract:
Multi‑agent LLM systems have shown promise for complex reasoning, yet recent evaluations reveal they often underperform single‑model baselines. We identify a structural failure mode in sequential fine‑tuning of shared‑context teams: updating one agent shifts the team's context distribution, and when subsequent updates are evaluated on cached rollouts, this mismatch compounds. We formalize this as the compounding occupancy shift and prove that stale‑occupancy evaluation incurs a penalty that scales quadratically with the number of agents. In contrast, intermediate‑occupancy evaluation reduces this to linear scaling. We propose TeamTR, a trust‑region framework that resamples trajectories after each component update and enforces per‑agent divergence control, yielding rigorous per‑update and per‑stage improvement lower bounds. Experiments show that TeamTR outperforms single‑agent and sequential baselines with 7.1% on average, mitigates coordination regressions, and supports plug‑and‑play component replacement. Code is available at https://github.com/Yydc/TeamTR.

PaperID: 167, https://arxiv.org/pdf/2605.15206.pdf GitHub

Authors: Dzung Pham, Kleomenis Katevas, Ali Shahin Shamsabadi, Hamed Haddadi

Title: AgentStop: Terminating Local AI Agents Early to Save Energy in Consumer Devices

Abstract:
Autonomous agents powered by large language models (LLMs) are increasingly used to automate complex, multi‑step tasks such as coding or web‑based question answering. While remote, cloud‑based agents offer scalability and ease of deployment, they raise privacy concerns, depend on network connectivity, and incur recurring API costs. Deploying agents locally on user devices mitigates these issues by preserving data privacy and eliminating usage‑based fees. However, agentic workflows are far more resource‑intensive than typical LLM interactions. Iterative reasoning, tool use, and failure retries substantially increase token consumption, often expending significant compute without successfully completing tasks. In this work, we investigate the time, token, and energy overhead of locally deployed LLM‑based agents on consumer hardware. Our measurements show that agentic execution increases GPU power draw, temperature, and battery drain compared to single‑inference workloads. To address this inefficiency, we introduce AgentStop, a lightweight efficiency supervisor that predicts and preemptively terminates trajectories unlikely to succeed. Leveraging low‑cost execution signals, such as token‑level log probabilities, AgentStop can reduce wasted energy by 15‑20% with minimal impact on task performance (<5% utility drop) for challenging web‑based question answering and coding benchmarks. These findings position predictive early termination as a practical mechanism for enabling sustainable, privacy‑preserving LLM agents on user devices. Our project code and data are available at https://github.com/brave‑experiments/AgentStop.

PaperID: 168, https://arxiv.org/pdf/2605.14851.pdf GitHub

Authors: Zhigao Huang, Zhengqing Hu, Dong Chen, Shaohan Zhang, Zhao Jin, Bo Zhang, Han Wu, Mingliang Xu

Title: IFPV: An Integrated Multi-Agent Framework for Generative Operational Planning and High-Fidelity Plan Verification

Abstract:
Operational plan generation and verification are critical for modern complex and rapidly changing battlefield environments, yet traditional generation and verification methods still respectively face the challenges of generation infeasibility and verification insufficiency. To alleviate these limitations, we propose an Integrated Multi‑Agent Framework for Generative Operational Planning and High‑Fidelity Plan Verification (IFPV). IFPV consists of two tightly coupled modules: Multi‑Perspective Hierarchical Agents (MPHA) for generative operational planning and an Adversarial Cognitive Simulation Engine (ACSE) for high‑fidelity adversarial plan verification. MPHA decomposes commander intent into executable multi‑platform tactical action sequences through the collaboration of Pathfinder, Analyst, and Planner agents. ACSE introduces an opponent equipped with a customized world model, which predicts the future evolution of mission‑critical platforms and conducts dynamic counteractions against candidate plans. Simulation experiments in the Asymmetric Combat Tactic Simulator (ACTS) show that IFPV improves mission success by 19.4% and reduces operational cost by 41.7% compared with a single‑step large language model (LLM) planning baseline. Compared with a traditional rule‑based validator, ACSE increases the average suppression rate by 31.8%, indicating that the proposed verification environment is stricter and more discriminative in revealing the latent vulnerabilities of candidate plans. The code for IFPV can be found at https://github.com/zhigao3ks/IFPV.

PaperID: 169, https://arxiv.org/pdf/2605.14495.pdf GitHub

Authors: Truong Thanh Hung Nguyen, Vo Thanh Khang Nguyen, Hoang-Loc Cao, Phuc Ho, Van Pham, Hung Cao

Title: Contestable Multi-Agent Debate with Arena-based Argumentative Computation for Multimedia Verification

Abstract:
Multimedia verification requires not only accurate conclusions but also transparent and contestable reasoning. We propose a contestable multi‑agent framework that integrates multimodal large language models, external verification tools, and arena‑based quantitative bipolar argumentation (A‑QBAF) as a submission to the ICMR 2026 Grand Challenge on Multimedia Verification. Our method decomposes each case into claim‑centered sections, retrieves targeted evidence, and converts evidence into structured support and attack arguments with provenance and strength scores. These arguments are resolved through small local argument graphs with selective clash resolution and uncertainty‑aware escalation. The resulting system generates section‑wise verification reports that are transparent, editable, and computationally practical for real‑world multimedia verification. Our implementation is public at: https://github.com/Analytics‑Everywhere‑Lab/MV2026_the_liems.

PaperID: 170, https://arxiv.org/pdf/2605.13950.pdf GitHub

Authors: Darius A. Faroughy, Sofia Palacios Schweitzer, Ian Pang, Siddharth Mishra-Sharma, David Shih

Title: Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

Abstract:
Autonomous language‑model agents are increasingly evaluated on long‑horizon tool‑use tasks, but existing benchmarks rarely capture the complexity and nuance of real scientific work. To address this gap, we introduce Collider‑Bench, a benchmark for evaluating whether LLM agents can reproduce experimental analyses from the Large Hadron Collider (LHC) using only public papers and open scientific software. Such analyses are often difficult to reproduce because the public toolchain only approximates the software used internally by the experimental collaborations, while the published papers inevitably omit implementation details needed for a faithful reconstruction. Agents must therefore rely on physical reasoning, domain knowledge, and trial‑and‑error to fill these gaps. Each task requires the agent to turn a published analysis into an executable simulation‑and‑selection pipeline and submit predicted collision event yields in specified signal regions. These predictions are evaluated with standard histogram metrics that provide continuous fidelity scores without a hand‑written rubric. We also report the computational cost incurred by each agent per task. Finally, we evaluate the codebase and full session trace using an LLM judge to catch qualitative failure modes such as fabrications, hallucinations and duplications. We release an initial set of tasks drawn from LHC searches, together with a containerized sandbox and event simulation tools. We evaluate across a capability ladder of general purpose coding agents. Our results show that on average no agent reliably beats the physicist‑in‑the‑loop solution.

PaperID: 171, https://arxiv.org/pdf/2605.13941.pdf GitHub

Authors: Jiaqi Liu, Xinyu Ye, Peng Xia, Zeyu Zheng, Cihang Xie, Mingyu Ding, Huaxiu Yao

Title: EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents

Abstract:
Long‑term memory is essential for LLM agents that operate across multiple sessions, yet existing memory systems treat retrieval infrastructure as fixed: stored content evolves while scoring functions, fusion strategies, and answer‑generation policies remain frozen at deployment. We argue that truly adaptive memory requires co‑evolution at two levels: the stored knowledge and the retrieval mechanism that queries it. We present EvolveMem, a self‑evolving memory architecture that exposes its full retrieval configuration as a structured action space optimized by an LLM‑powered diagnosis module. In each evolution round, the module reads per‑question failure logs, identifies root causes, and proposes targeted configuration adjustments; a guarded meta‑analyzer applies them with automatic revert‑on‑regression and explore‑on‑stagnation safeguards. This closed‑loop self‑evolution realizes an AutoResearch process: the system autonomously conducts iterative research cycles on its own architecture, replacing manual configuration tuning. Starting from a minimal baseline, the process converges autonomously, discovering effective retrieval strategies including entirely new configuration dimensions not present in the original action space. On LoCoMo, EvolveMem outperforms the strongest baseline by 25.7% relative and achieves a 78.0% relative improvement over the minimal baseline. On MemBench, EvolveMem exceeds the strongest baseline by 18.9% relative. Evolved configurations transfer across benchmarks with positive rather than catastrophic transfer, indicating that the self‑evolution process captures universal retrieval principles rather than benchmark‑specific heuristics. Code is available at https://github.com/aiming‑lab/SimpleMem.

PaperID: 172, https://arxiv.org/pdf/2605.13940.pdf GitHub

Authors: Haomin Zhuang, Hanwen Xing, Yujun Zhou, Yuchen Ma, Yue Huang, Yili Shen, Yufei Han, Xiangliang Zhang

Title: AgentTrap: Measuring Runtime Trust Failures in Third-Party Agent Skills

Abstract:
Third‑party skills are becoming the package ecosystem for LLM agents. They package natural‑language instructions, helper scripts, templates, documents, and service configuration into reusable workflows. This makes skills useful, but it also introduces a new security problem: a malicious skill does not need to ask the model to perform an obviously harmful action. Instead, it can disguise the harmful behavior as part of a routine workflow, relying on the agent to execute that workflow with high‑value permissions and limited human supervision. We introduce AgentTrap, a dynamic benchmark for evaluating whether LLM agents can use third‑party skills while resisting malicious runtime behavior. AgentTrap contains 141 tasks: 91 malicious tasks and 50 benign utility tasks, covering 16 security‑impact dimensions grounded in agent‑skill supply‑chain threats. In each task, the agent receives an ordinary user request, runs with installed skills that may contain malicious workflow elements, and is executed in a sandboxed environment. AgentTrap then judges complete trajectories for attack success, blocked or refused behavior, attack‑not‑triggered cases, and no‑attack‑evidence outcomes. Our central finding is that the most informative failures are not simple jailbreaks. Models often complete the visible user task while treating unsafe side effects introduced by the skill as part of the normal workflow. This motivates runtime evaluation of the concrete model‑‑framework‑‑workspace environment in which users actually delegate work. Code and data are available at https://github.com/zhmzm/AgentTrap and https://huggingface.co/datasets/zhmzm/AgentTrap.

PaperID: 173, https://arxiv.org/pdf/2605.13716.pdf GitHub

Authors: Hongji Pu, Xinyuan Song, Liang Zhao

Title: SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems

Abstract:
Large language model agents increasingly rely on skill libraries for multi‑step tasks, yet these libraries can accumulate persistent defects as skills are added, reused, patched, and linked to changing dependencies. We call this failure mode skill technical debt: library‑level defects that may not break a single skill locally but can harm future retrieval, composition, and execution. Existing skill‑based agents mainly focus on task‑time retrieval, planning, and repair, while library‑time maintenance remains underexplored. We propose SkillOps, a method‑agnostic plug‑in framework for maintaining skill libraries. SkillOps represents each skill as a typed Skill Contract (P, O, A, V, F), organizes skills with a Hierarchical Skill Ecosystem Graph, and diagnoses library health across utility, compatibility, risk, and validation dimensions. Given a raw skill library, SkillOps produces a maintained library that can be used by existing retrieval or planning agents without changing their internal code. On ALFWorld, SkillOps achieves 79.5 percent task success as a standalone agent, outperforming the strongest baseline by 8.8 percentage points with no additional task‑time large language model calls. As a plug‑in layer, it improves retrieval‑heavy baselines by 0.68 to 2.90 percentage points. The current rule‑based maintenance implementation uses nearly zero library‑time large language model calls or tokens, showing that skill‑library maintenance can be added as a low‑overhead architectural layer.

PaperID: 174, https://arxiv.org/pdf/2605.13618.pdf GitHub

Authors: Peng Kang, Bixuan Li, Xiaoya Huang, Shuo Shi, Weiqiao Zhou, Zhen Li, Yu Liu, Lei Zheng

Title: OpenAaaS: An Open Agent-as-a-Service Framework for Distributed Materials-Informatics Research

Abstract:
The Materials Genome Initiative catalyzed the proliferation of centralized platforms‑‑SaaS, PaaS, and IaaS‑‑that aggregate computational and experimental resources for accelerated materials discovery. In parallel, breakthroughs in large language models (LLMs) and autonomous agents have created powerful new reasoning capabilities for scientific research. Yet a critical "last mile" problem remains: while we possess world‑class models and vast repositories of materials data, we lack the organizational infrastructure to compose these capabilities securely across institutional boundaries. The development of structural and functional materials for harsh service environments‑‑high‑temperature alloys, radiation resistant steels, corrosion‑resistant coatings‑‑remains characterized by long‑term iteration, mechanistic complexity, and high domain expertise‑‑demands that exceed both monolithic agent systems and traditional centralized platforms. To address this gap we propose OpenAaaS, an open‑source hierarchical and distributed Agent‑as‑a‑Service framework that enables organized multi‑agent collaboration for intelligent materials design. OpenAaaS is built on a single foundational principle: code flows, data stays still. A Master Agent plans and decomposes complex research tasks without requiring direct access to subordinate agents' managed data and computational resources. Sub‑agents, deployed as near‑data execution nodes, retain full sovereignty over local datasets, proprietary algorithms, and specialized hardware. This architecture guarantees that raw data never leaves its domain of origin while enabling cross‑scale, cross‑domain secure integration of previously isolated materials intelligence silos. We validate the framework through two representative case studies: (i) AlphaAgent, an evidence‑grounded materials literature analysis executor that achieves 4.66/5.0 on deep analytical questions against single‑pass RAG baselines; and (ii) an ultra‑large‑scale hexa‑high‑entropy alloy descriptor database service that demonstrates secure near‑data execution and domain‑specific scientific workflows under strict data‑sovereignty constraints. OpenAaaS establishes a principled pathway toward "organized research" via agent collectives, offering a scalable foundation for next‑generation materials intelligent design platforms. All source code is available at https://github.com/Wolido/OpenAaaS.

PaperID: 175, https://arxiv.org/pdf/2605.13542.pdf GitHub

Authors: Chengzhi Shen, Weixiang Shen, Tobias Susetzky, Chen, Chen, Jun Li, Yuyuan Liu, Xuepeng Zhang, Zhenyu Gong, Daniel Rueckert, Jiazhen Pan

Title: RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

Abstract:
Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions as ground truth. However, these actions are made under incomplete information and limited temporal context of the underlying patient state, and may therefore be suboptimal, making it difficult to assess the true reasoning capabilities of AI systems. We introduce RealICU, a hindsight‑annotated benchmark for evaluating large language models (LLMs) under realistic ICU conditions, where labels are created after senior physicians review the full patient trajectory. We formulate four physician‑motivated tasks: assess Patient Status, Acute Problems, Recommended Actions, and Red Flag actions that risk unsafe outcomes. We partition each trajectory with 30‑min windows and release two datasets: RealICU‑Gold with 930‑window annotations from 94 MIMIC‑IV patients, and RealICU‑Scale with 11,862 windows extended by Oracle, a physician‑validated LLM hindsight labeler. Existing LLMs including memory‑augmented ones performed poorly on RealICU, exposing two failure modes: a recall‑safety tradeoff for clinical recommendations, and an anchoring bias to early interpretations of the patient. We further introduce ICU‑Evo to study structured‑memory agents that improves long‑horizon reasoning but does not fully eliminate safety failures. Together, RealICU provides a clinically grounded testbed for measuring and improving AI sequential decision‑support in high‑stakes care. Project page: https://chengzhi‑leo.github.io/RealICU‑Bench/

PaperID: 176, https://arxiv.org/pdf/2605.12943.pdf GitHub

Authors: Zheng Wang, Yuang Liu, Yangkai Ding

Title: Reinforced Collaboration in Multi-Agent Flow Networks

Abstract:
Multi‑agent systems provide a powerful way to extend large language models (LLMs) by decomposing a complex task into specialized subtasks handled by different agents. However, their performance is often hindered by error propagation, arising from suboptimal workflow design or inaccurate agent outputs, which can propagate through the agent collaboration process and degrade final results. To address the challenges, we present MANGO (Multi‑Agent Network Gradient Optimization), a data‑driven framework that organizes and refines agent collaboration via a flow network constructed from past successful workflows. MANGO integrates reinforcement learning and textual gradients to jointly optimize workflow paths and agent behaviors, while a skipping mechanism prevents redundant updates to well‑optimized agents for improving efficiency. Extensive experiments on seven benchmarks show that MANGO achieves up to 12.8% performance improvement over state‑of‑the‑art baselines, enhances efficiency by 47.4%, and generalizes effectively to unseen domains. Our code and datasets are publicly available at https://github.com/openJiuwen‑ai/agent‑store/tree/main/community/mango.

PaperID: 177, https://arxiv.org/pdf/2605.12857.pdf GitHub

Authors: Zhongkai Yu, Yichen Lin, Chenyang Zhou, Yuwei Zhang, Kun Zhou, Junxia Cui, Haotian Ye, Zhengding Hu, Zaifeng Pan, Ruiyi Wang, Yujie Zhao, Hejia Zhang, Jingbo Shang, Jishen Zhao, Yufei Ding

Title: ChipMATE: Multi-Agent Training via Reinforcement Learning for Enhanced RTL Generation

Abstract:
Existing API‑based agentic systems for RTL code generation are fundamentally misaligned with industrial practice: they assume a golden testbench is available at generation time, rely on closed‑source APIs incompatible with chip vendors' air‑gapped security requirements, and cannot be trained on vendors' proprietary RTL codebases, leaving valuable internal data unused. Recent self‑trained models address the deployment constraint but remain single‑turn generators that overlook the critical role of verification in real industrial flows. To bridge these gaps, we present ChipMATE, the first self‑trained multi‑agent framework for RTL generation. Inspired by industrial practice where correctness emerges from cross‑comparison between independently written RTL modules and reference models, ChipMATE pairs a Verilog agent with a Python reference‑model agent that mutually verify each other's outputs without any golden oracle. We design a backtrack‑based inference workflow to prevent error propagation across turns, and a two‑stage training pipeline that first trains each agent individually to saturate its code‑generation capability, then trains the team jointly to collaborate effectively. To support the training, we further build a hybrid data‑generation framework that produces 64.4K high‑quality reference model training samples. ChipMATE achieves 75.0% and 80.1% pass@1 on VerilogEval V2 with 4B and 9B base models, outperforming all existing self‑trained models and even DeepSeek V4 with 1600B parameters. Our code and model weights are publicly available in https://github.com/zhongkaiyu/ChipMATE.

PaperID: 178, https://arxiv.org/pdf/2605.12507.pdf GitHub

Authors: Siqi Miao, Ziyang Chen, Yuhong Luo, Hans Hao-Hsun Hsu, Mufei Li, Kaiqing Zhang, Pan Li

Title: Can LLM Agents Simulate Dynamic Networks? A Case Study on Email Networks with Phishing Synthesis

Abstract:
While Large Language Model (LLM) multi‑agent systems (MAS) offer a transformative approach to simulating human behavior in complex systems, it remains largely unexplored whether these simulations can replicate realistic structural and temporal dynamics from a dynamic network perspective. Our evaluation indicates that existing frameworks excel at generating plausible micro‑level interactions but fail to capture the emergent, macroscopic topologies necessary for domains that rely on realistic network dynamics, such as modeling information propagation and cybersecurity threats. To bridge this gap, we introduce two easily integrable extensions to simulation frameworks to ensure they preserve macroscopic network fidelity: 1) augmenting LLM agents with data‑driven event triggers to organically sustain long‑horizon interactions, and 2) integrating Hawkes processes to accurately model temporal activation dynamics. Our approach allows LLM MAS to capture both plausible micro‑level patterns and macroscopic topologies. We further demonstrate the utility of this framework in synthesizing realistic phishing campaigns within evolving communication networks. The study reveals how threats exploit structural vulnerabilities, highlighting the potential of our framework for developing next‑generation defenses. Our code is available at https://github.com/Graph‑COM/NSL.

PaperID: 179, https://arxiv.org/pdf/2605.12153.pdf GitHub GitHub

Authors: Vladislav Savenkov

Title: CIDR: A Large-Scale Industrial Source Code Dataset for Software Engineering Research

Abstract:
We present Curated Industrial Developer Repository (CIDR), a large‑scale dataset of real‑world software repositories collected through direct collaboration with 12 industrial partner organizations. The dataset comprises 2,440 repositories spanning 138 programming languages and totalling 373 million lines of code, accompanied by structured per‑repository metadata. Unlike existing code corpora derived from public open‑source platforms, CIDR consists exclusively of proprietary production codebases contributed under formal data sharing agreements, covering application domains including enterprise web and mobile development, fintech, and custom software consultancy. All repositories were processed through a multi‑stage pipeline encompassing structured partner onboarding, two‑stage quality selection combining automated metadata filtering with manual code review, and a deterministic anonymization pipeline covering the full version control history. The dataset is intended to support research in code intelligence, software quality analysis, pre‑training and fine‑tuning of code language models, developer behaviour studies, and construction of agent evaluation benchmarks. Access is provided under a restricted commercial license; details are available at https://fermatix.ai/#Contact.

PaperID: 180, https://arxiv.org/pdf/2605.11866.pdf GitHub

Authors: Yiming Ren, Xuenan Xu, Ziyang Zhang, Wen Wu, Baoxiang Li, Chao Zhang

Title: AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling

Abstract:
Despite advances in text and visual generation, creating coherent long‑form audio narratives remains challenging. Existing frameworks often exhibit limitations such as mismatched character settings with voice performance, insufficient self‑correction mechanisms, and limited human interactivity. To address these challenges, we propose AuDirector, a self‑reflective closed‑loop multi‑agent framework. Specifically, it involves an Identity‑Aware Pre‑production mechanism that transforms narrative texts into character profiles and utterance‑level emotional instructions to retrieve suitable voice candidates and guide expressive speech synthesis, thereby promoting context‑aligned voice adaptation. To enhance quality, a Collaborative Synthesis and Correction module introduces a closed‑loop self‑correction mechanism to systematically audit and regenerate defective audio components. Furthermore, a Human‑Guided Interactive Refinement module facilitates user control by interpreting natural language feedback to interactively refine the underlying scripts. Experiments demonstrate that AuDirector achieves superior performance compared to state‑of‑the‑art baselines in structural coherence, emotional expressiveness, and acoustic fidelity. Audio samples can be found at https://anonymous‑itsh.github.io/.

PaperID: 181, https://arxiv.org/pdf/2605.11534.pdf GitHub

Authors: Yunn Kang Lim, Pengzhan Sun, Ziyi Bai, Xun Xu, Angela Yao, Xulei Yang, Shijie Li

Title: PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments

Abstract:
When an LLM‑based embodied agent fails at a household task, the culprit could be misidentified objects, forgotten sub‑goals, or poor action sequencing ‑‑ yet existing benchmarks report only a single success rate, making it impossible to tell which cognitive module is responsible. We present PRISM, a diagnostic benchmark that reframes this problem: rather than asking only did the agent succeed?, PRISM asks which capability is most likely responsible for failure? Built on five photorealistic multi‑room apartments (4‑‑8 rooms each), PRISM structures 300 human‑verified tasks into three capability tiers ‑‑ Basic Ability, Reasoning Ability, and Long‑horizon Ability ‑‑ that isolate perception‑to‑action grounding, implicit intent resolution, and sustained multi‑step coordination respectively. PRISM exposes an agent‑agnostic executable action API that allows arbitrary agents: LLM agents, VLM agents, symbolic planners, RL policies, and hybrid systems, to be evaluated end‑to‑end under the same benchmark protocol. To support deeper diagnosis, optional probes for perception, memory, and planning can be adopted, replaced, or bypassed entirely, enabling controlled component‑level analysis when desired. Experiments on seven contemporary LLMs establish a clear hierarchy: explicit spatial grounding is not the dominant failure source under oracle perception, implicit intent resolution is a significant bottleneck for all model families, and long‑horizon coordination exposes a stark capability cliff ‑‑ lightweight models collapse to as low as 20.0% success while simultaneously consuming more tokens than their frontier counterparts, a signature of compensatory over‑reasoning rather than genuine planning capability. Project page: \hrefhttps://sj‑li.com/PROJ/PRISMlink.

PaperID: 182, https://arxiv.org/pdf/2605.11522.pdf GitHub

Authors: Ian C. Moore

Title: State Twins: An Off-Chain Substrate for Agentic Reasoning over Decentralized Finance Protocols

Abstract:
We introduce the State Twin: a typed, in‑memory, replayable replica of an on‑chain automated market maker (AMM) pool that serves as a substrate for agentic reasoning over decentralized finance (DeFi) protocols. Agentic DeFi stacks today couple reasoning to chain time, since every "what if?" query incurs a new RPC read or a real transaction, so the agent's effective action space is bounded by block confirmation latency and gas. We argue this coupling is a structural problem rather than a performance one, and that the missing layer is an off‑chain substrate that preserves the protocol's exact mathematics while admitting the operations on‑chain state cannot: forking, replay, branching, counterfactual rollout. We formalize each AMM family (Uniswap V2, V3, Balancer, Stableswap) as a discrete‑time controlled dynamical system, prove a quantitative fidelity bound on the divergence between twin and chain, and give the open architecture used in DeFiPy v2, an open‑source Python toolkit that ships the State Twin substrate and a reference Model Context Protocol server exposing typed analytical primitives as LLM tools. The same primitive (i.e., one Python class, one calling pattern) serves a notebook quant, a backtest, and an LLM agent without modification. We close with a fork‑and‑evaluate worked example: a single live RPC read seeds N independent in‑memory twins under distinct price‑shock scenarios, in sub‑second wall‑clock time. The contribution is the substrate, not a particular agent, which is what the specification of what an agentic DeFi substrate must look like

PaperID: 183, https://arxiv.org/pdf/2605.11436.pdf GitHub

Authors: Joykirat Singh, Zaid Khan, Archiki Prasad, Justin Chih-Yao Chen, Akshay Nambi, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal

Title: Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

Abstract:
Large language models (LLMs) are increasingly deployed on long‑horizon tasks in partially observable environments, where they must act while inferring and tracking a complex environment state over many steps. This leads to two challenges: partial observability requires maintaining uncertainty over unobserved world attributes, and long interaction history causes context to grow without bound, diluting task‑relevant information. A principled solution to both challenges is a belief state: a posterior distribution over environment states given past observations and actions, which compactly encodes history for decision making regardless of episode length. In LLM agents, however, the open‑ended nature of text makes it unclear how to represent such a distribution. Therefore, we introduce Agent‑BRACE: Agent Belief state Representation via Abstraction and Confidence Estimation, a method that decouples an LLM agent into a belief state model and a policy model, jointly optimized via reinforcement learning. The belief state model produces a structured approximation of the belief distribution: a set of atomic natural language claims about the environment, each annotated with an ordinal verbalized certainty label ranging from certain to unknown. The policy model conditions on this compact, structured approximate belief rather than the full history, learning to select actions under explicit uncertainty. Across long‑horizon, partially observable embodied language environments, Agent‑BRACE achieves an average absolute improvement of +14.5% (Qwen2.5‑3B‑Instruct) and +5.3% (Qwen3‑4B‑Instruct), outperforming strong RL baselines while maintaining a near‑constant context window independent of episode length. Further analysis shows that the learned belief becomes increasingly calibrated over the course of an episode as evidence accumulates.

PaperID: 184, https://arxiv.org/pdf/2605.11136.pdf GitHub

Authors: Yaolun Zhang, Tianyi Xu, Shengyu Dai, Zhenwen Shao, Qingyun Wu, Huazheng Wang

Title: EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales

Abstract:
We argue that multi‑agent test‑time evolution is not single‑agent evolution replicated N times. A single‑agent learner can only evolve its own context and memory. A multi‑agent system additionally evolves who collaborates, how they collaborate, and how knowledge flows across the population. These components have no single‑agent counterpart and can produce phenomena such as emergent specialization. Yet prior test‑time methods either confine experiences to individual agents, forfeiting cross‑agent learning, or broadcast symmetrically to all agents, erasing the specialization that makes collaboration valuable. We present EVOCHAMBER, a training‑free framework that instantiates test‑time evolution at three levels over a coevolving agent pool. At its core is CODREAM (Collaborative Dreaming), a post‑task protocol triggered on team failure or disagreement, in which agents collaboratively reflect, distill insights, and route them asymmetrically from strong to weak agents on the failed niche, preserving specialization while filling knowledge gaps. Team‑level operators assemble niche‑conditioned teams and select collaboration structures online. Population‑level lifecycle operators fork, merge, prune, and seed agents under performance pressure. On three heterogeneous task streams with Qwen3‑8B, EVOCHAMBER reaches 63.9% on competition math, 75.7% on code, and 87.1% on multi‑domain reasoning, outperforming the best baseline by 32% relative on math and confirming asymmetric cross‑agent transfer as the primary driver in ablation. Starting from several identically initialized agents, four to five stable niche specialists spontaneously emerge, a structural signature of multi‑agent evolution that no single‑agent learner can express. See our code at: https://github.com/Mercury7353/EvoChamber

PaperID: 185, https://arxiv.org/pdf/2605.11026.pdf GitHub

Authors: Yassin H. Rassul, Tarik A. Rashid

Title: AgentShield: Deception-based Compromise Detection for Tool-using LLM Agents

Abstract:
Defenses against indirect prompt injection (IPI) in tool‑using LLM agents share two structural weaknesses. First, they all attempt to prevent attacks rather than detect the compromises that slip through. Second, they have only been evaluated in English, leaving users of low‑resource languages such as Kurdish and Arabic without tested protection. This paper addresses both gaps with AgentShield, a deception‑based detection framework that places three layers of traps inside the agent's tool interface: fake tools, fake credentials, and allowlisted parameters. The same trap triggers serve as high‑precision labels for a self‑supervised classifier. An LLM agent that follows an attacker's hidden instruction almost always touches one of these traps, which gives both a real‑time compromise signal and a zero‑FP label for training a downstream detector without manual annotation. Across 176 cross‑lingual attack prompts and four LLMs from three providers, and because modern LLMs already refuse most IPI attempts on their own (attack success rate <= 10%), AgentShield's job is to catch the attacks that do slip through. On commercial models, it catches 90.7%‑100% of such successful attacks, with zero false alarms on 485 normal‑use tests. It survives a systematic adaptive‑attack evaluation with zero evasion on commercial models, and the self‑supervised classifier transfers across models and languages without retraining.

PaperID: 186, https://arxiv.org/pdf/2605.10912.pdf GitHub

Authors: Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, Yang JingYi, Penghui Yang, Zhixiong Zhang, Xilin Wei, Xinyu Fang, Yubo Ma, Haodong Duan, Jing Shao, Jiaqi Wang, Dahua Lin, Kai Chen, Yuhang Zang

Title: WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Abstract:
Large language and vision‑language models increasingly power agents that act on a user's behalf through command‑line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short‑horizon tasks, mock‑service APIs, and final‑answer checks, leaving open whether agents can complete realistic long‑horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native‑runtime benchmark of 60 human‑authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall‑clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule‑based checks, environment‑state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. These results show that long‑horizon, native‑runtime agent evaluation remains a far‑from‑resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation.

PaperID: 187, https://arxiv.org/pdf/2605.10344.pdf GitHub

Authors: George Wu, Nan Jing, Qing Yi, Chuan Hao, Ming Yang, Feng Chang, Yuan Wei, Jian Yang, Ran Tao, Bryan Dai

Title: TMAS: Scaling Test-Time Compute via Multi-Agent Synergy

Abstract:
Test‑time scaling has become an effective paradigm for improving the reasoning ability of large language models by allocating additional computation during inference. Recent structured approaches have further advanced this paradigm by organizing inference across multiple trajectories, refinement rounds, and verification‑based feedback. However, existing structured test‑time scaling methods either weakly coordinate parallel reasoning trajectories or rely on noisy historical information without explicitly deciding what should be retained and reused, limiting their ability to balance exploration and exploitation. In this work, we propose TMAS, a framework for scaling test‑time compute via multi‑agent synergy. TMAS organizes inference as a collaborative process among specialized agents, enabling structured information flow across agents, trajectories, and refinement iterations. To support effective cross‑trajectory collaboration, TMAS introduces hierarchical memories: the experience bank reuses low‑level reliable intermediate conclusions and local feedback, while the guideline bank records previously explored high‑level strategies to steer subsequent rollouts away from redundant reasoning patterns. Furthermore, we design a hybrid reward reinforcement learning scheme tailored to TMAS, which jointly preserves basic reasoning capability, enhances experience utilization, and encourages exploration beyond previously attempted solution strategies. Extensive experiments on challenging reasoning benchmarks show that TMAS achieves stronger iterative scaling than existing test‑time scaling baselines, with hybrid reward training further improving scaling effectiveness and stability across iterations. Code and data are available at https://github.com/IQuestLab/tmas.

PaperID: 188, https://arxiv.org/pdf/2605.10187.pdf GitHub

Authors: Longteng Guo, Xuanxu Lin, Dongze Hao, Tongtian Yue, Pengkang Huo, Jiatong Ma, Yuchen Liu, Jing Liu

Title: SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation

Abstract:
Scientific reasoning is a key aspect of human intelligence, requiring the integration of multimodal inputs, domain expertise, and multi‑step inference across various subjects. Existing benchmarks for multimodal large language models (MLLMs) often fail to capture the complexity and traceability of reasoning processes necessary for rigorous evaluation. To fill this gap, we introduce SciVQR, a multimodal benchmark covering 54 subfields in mathematics, physics, chemistry, geography, astronomy, and biology. SciVQR includes domain‑specific visuals, such as equations, charts, and diagrams, and challenges models to combine visual comprehension with reasoning. The tasks range from basic factual recall to complex, multi‑step inferences, with 46% including expert‑authored solutions. SciVQR not only evaluates final answers but also examines the reasoning process, providing insights into how models reach their conclusions. Our evaluation of leading MLLMs, including both proprietary and open‑source models, reveals significant limitations in handling complex multimodal reasoning tasks, underscoring the need for improved multi‑step reasoning and better integration of interdisciplinary knowledge in advancing MLLMs toward true scientific intelligence. The dataset and evaluation code are publicly available at https://github.com/CASIA‑IVA‑Lab/SciVQR.

PaperID: 189, https://arxiv.org/pdf/2605.10155.pdf GitHub

Authors: Deepanshu, Divi Saxena, Deepali Rana, Ayesha Varshney, Sahinur Rahman Laskar

Title: NyayaAI: An AI-Powered Legal Assistant Using Multi-Agent Architecture and Retrieval-Augmented Generation

Abstract:
Legal information in India remains largely inaccessible due to the complexity of legal language and the sheer volume of legal documentation involved in research and case analysis. This paper presents NyayaAI, an AI‑powered legal assistant that automates and simplifies legal workflows for lawyers, law students, and general users. The system combines Large Language Models with a Retrieval‑Augmented Generation pipeline grounded in a curated Indian legal knowledge base comprising constitutional provisions, statutes, case laws, and judicial precedents. A multi‑agent architecture orchestrated through the Mastra TypeScript framework coordinates a main agent with specialized sub‑agents handling legal research, document summarization, case law retrieval, and drafting assistance. A compliance module validates all responses before delivery. Domain classification achieved 70% precision across test samples, with RAG retrieval precision at 74% and overall response accuracy at 72%, demonstrating that structured multi‑agent LLM systems can meaningfully improve legal accessibility and workflow efficiency. The code\footnotehttps://github.com/B97784/NyayaAI is made publicly available for the benefit of the research community.

PaperID: 190, https://arxiv.org/pdf/2605.09679.pdf GitHub

Authors: Yixiong Chen, Wenjie Xiao, Pedro R. A. S. Bassi, Boyan Wang, Liang He, Xinze Zhou, Sezgin Er, Ibrahim Ethem Hamamci, Zongwei Zhou, Alan Yuille

Title: DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

Abstract:
Medical vision‑language models (VLMs) and AI agents have made significant progress in learning to analyze and reason about clinical images. However, existing medical visual question answering (VQA) benchmarks collapse model capabilities into a single accuracy score, obscuring where and why models fail. We propose DeepTumorVQA, a hierarchical benchmark that follows the multi‑stage evidence chain in tumor diagnosis and decomposes 3D CT reasoning into four stages: recognition, measurement, visual reasoning, and medical reasoning. Higher‑level questions remain independently scorable, while their ground‑truth evidence chains are defined over lower‑level primitives. The benchmark contains 476K questions across 42 clinical subtypes on 9,262 3D CT volumes. In addition to a direct reasoning mode for VLMs, DeepTumorVQA provides tool‑interaction environments for agent evaluation, where a model can call external tools, including segmentation models, measurement programs, and medical knowledge modules, before answering the question. Evaluating over 30 model configurations, we find that reliable quantitative measurement is the primary bottleneck, making later‑stage visual and medical reasoning harder for VLMs, while tool augmentation substantially mitigates this issue. When tools are available, leveraging medical knowledge and tools to reason about medical images becomes a new challenge. We further show that ground‑truth step‑by‑step tool‑use traces from DeepTumorVQA can supervise agents and reduce tool‑use and reasoning failures. This stage‑wise progression from recognition to measurement to visual and medical reasoning provides a concrete roadmap for future medical VLM and AI agent studies. All data and code are released at https://github.com/Schuture/DeepTumorVQA.

PaperID: 191, https://arxiv.org/pdf/2605.09539.pdf GitHub

Authors: Chen Xu, Yicheng Hu, Ruizi Wang, Xinyu Lin, Wenjie Wang, Dongrui Liu, Fuli Feng

Title: TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems

Abstract:
Multi‑agent systems (MAS) have emerged as a promising paradigm for solving complex tasks. Recent work has explored self‑evolving MAS that automatically optimize agent capabilities or communication topologies. However, existing methods either learn a topology that remains fixed at inference time or adapt only the topology or capability during inference. We empirically and theoretically show that effective test‑time evolution requires jointly adapting both axes, but on different time scales: capabilities should update rapidly to handle emerging subtasks, while the topology should evolve more slowly to preserve coordination stability. We then introduce TacoMAS, a test‑time co‑evolution framework for dynamic MAS. TacoMAS formulates MAS inference as a task of online graph adaptation, where nodes represent agents with role‑specific capabilities and edges define their communication topology. During inference, a fast capability loop updates agent expertise using trajectory‑level feedback, while a slow meta‑LLM‑driven topology loop performs agents' birth‑death operations on MAS, including edge edit, agent addition, and agent removal. We further show that this fast‑slow design drives MAS evolution toward a task‑conditioned stable equilibrium. Experiments on four benchmarks demonstrate that TacoMAS outperforms nearly 20 multi‑agent baselines, achieving an average improvement of 13.3% over the strongest baseline. The codes are released at https://github.com/chenxu2‑gif/TacoMAS‑MultiAgent.

PaperID: 192, https://arxiv.org/pdf/2605.09252.pdf GitHub

Authors: Chung-En Sun, Linbo Liu, Ge Yan, Zimo Wang, Tsui-Wei Weng

Title: LLM Agents Already Know When to Call Tools -- Even Without Reasoning

Abstract:
Tool‑augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single‑hop, 3 multi‑hop) spanning three categories of tool necessity ‑‑ computational scale, knowledge boundaries, and execution reliability ‑‑ each with controlled difficulty levels that create a clear decision boundary between tool‑necessary and tool‑unnecessary tasks. We evaluate two families of training‑free baselines: Prompt‑only (varying the prompt to discourage unnecessary calls) and Reason‑then‑Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt‑only suppresses necessary calls alongside unnecessary ones, and Reason‑then‑Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models' hidden states and find that tool necessity is linearly decodable from the pre‑generation representation with AUROC 0.89‑‑0.96 across six models, substantially exceeding the model's own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we propose Probe&Prefill, which uses a lightweight linear probe to read the hidden‑state signal and prefills the model's response with a steering sentence. Across all models tested, Probe&Prefill reduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5× higher accuracy loss. Our code is available at https://github.com/Trustworthy‑ML‑Lab/when2tool

PaperID: 193, https://arxiv.org/pdf/2605.08978.pdf GitHub

Authors: Xingyuan Hua, Sheng Yue, Ju Ren

Title: Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

Abstract:
Recent advancements in agentic test‑time scaling allow models to gather environmental feedback before committing to final actions. A key limitation of existing methods is that they typically employ undifferentiated exploration strategies, lacking the ability to adaptively distinguish when exploration is truly required. In this paper, we propose an exploration‑aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high. Our method introduces a fine‑grained reward function via variational inference that explicitly evaluates exploratory actions by estimating their potential to improve future decision‑making, together with an exploration‑aware grouping mechanism that separates exploratory actions from task‑completion actions during optimization. By targeting informational gaps, this design allows agents to explore selectively and transition to execution as soon as the task context is clear. Empirically, we demonstrate that our approach achieves consistent improvements across a range of challenging text‑based and GUI‑based agent benchmarks. Code is available at https://github.com/HansenHua/EAPO‑ICML26 and models are available at https://huggingface.co/hansenhua/EAPO‑ICML26.

PaperID: 194, https://arxiv.org/pdf/2605.08876.pdf GitHub

Authors: Xinyu Li, Ronghui Mu, Lin Li, Tianjin Huang, Gaojie Jin

Title: OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents

Abstract:
Large Language Models (LLMs) are increasingly deployed as autonomous agents that execute tool‑augmented, multi‑step tasks, where latency is a critical factor for real‑world applications. Yet an overlooked threat is Reasoning‑Level Denial‑of‑Service (R‑DoS), in which an attacker preserves task correctness but degrades availability by inflating an agent's reasoning depth or tool‑use budget. We introduce OTora, the first unified, two‑stage red‑teaming framework for instantiating R‑DoS attacks. Stage I optimizes an adversarial trigger that induces targeted tool invocations using insertion‑aware scoring and dynamic target co‑evolution, supporting both black‑box and white‑box settings. Stage II generates agent‑aware reasoning payloads via an ICL‑guided genetic search that amplifies overthinking while maintaining correct task outcomes. Across WebShop, Email, and OS agents built on multiple backbone models such as LLaMA‑70B and GPT‑OSS‑120B, OTora achieves up to 10 times increases in reasoning tokens and order‑of‑magnitude latency slowdowns, all while preserving near‑baseline task accuracy. Finally, we discuss mitigation strategies for detecting and constraining abnormal reasoning and latency spikes. The code is available at https://github.com/llm2409/OTora.

PaperID: 195, https://arxiv.org/pdf/2605.08831.pdf GitHub

Authors: Chenhao Zhang, Chaoran Zhang, Zhaobo Xu, Yongbo Yang, Pingfa Feng, Long Zeng

Title: AssemPlanner: A Multi-Agent Based Task Planning Framework for Flexible Assembly System

Abstract:
In flexible assembly systems, existing task planning methods require a time‑consuming configuration process by multiple experts to establish a production line for a new product. To address this challenge, we propose a multi‑agent based task planning framework for flexible assembly systems, denoted as AssemPlanner. It takes tasks described in natural language as input, which are then converted into actionable sequential production operations. It comprises several specialized agents, including SchedAgent , KnowledgeAgent, LineBalanceAgent, and a scene graph. Within the proposed framework, SchedAgent serves as the central reasoning engine. Departing from traditional static pipelines, AssemPlanner utilizes a ReAct‑based SchedAgent to adaptively adjust actions via multi‑agent feedback. By observing the feedback from KnowledgeAgent, LineBalanceAgent, and the scene graph, it autonomously resolves complex industrial process constraints. To facilitate reproducibility, all code and datasets are released at https://github.com/chz332/Assemplanner.

PaperID: 196, https://arxiv.org/pdf/2605.08813.pdf GitHub

Authors: Yulang Chen, Haoxuan Peng, Jinyan Liu, Zichen Wen, Dongrui Liu, Linfeng Zhang

Title: AgentSlimming: Towards Efficient and Cost-Aware Multi-Agent Systems

Abstract:
Large Language Model‑based Multi‑Agent Systems (MAS) have demonstrated remarkable capabilities in complex tasks. However, manually designing optimal communication topologies is labor‑intensive, while automated expansion methods often result in bloated structures with redundant agents, leading to excessive token consumption. To address this problem, we introduce AgentSlimming, a plug‑and‑play compression framework for graph‑structured multi‑agent workflows. Motivated by pruning and quantization in neural networks, AgentSlimming compresses workflows by first estimating the importance score of each agent with a hybrid mechanism, and then removes redundant agents or replaces them with low‑cost ones, where each operation is validated using a baseline‑anchored acceptance rule to prevent performance collapse. Experiments show that AgentSlimming reduces average token cost by up to 78.9% with negligible performance degradation, and sometimes even improves accuracy, achieving a strong Pareto‑optimal trade‑off between cost and quality. Our code is publicly available at https://github.com/CitrusYL/AgentSlimming

PaperID: 197, https://arxiv.org/pdf/2605.08763.pdf GitHub

Authors: Minfeng Qi, Tianqing Zhu, Zijie Xu, Congcong Zhu, Qin Wang, Wanlei Zhou

Title: When LLMs Team Up: A Coordinated Attack Framework for Automated Cyber Intrusions

Abstract:
Automated intrusion‑style workflows require LLM agents to reason over partial observations, tool outputs, and executable artifacts under bounded budgets. A single LLM instance often compresses evidence extraction, planning, execution, and validation into one context, which increases the risk of context drift and error propagation. Existing LLM‑based multi‑agent systems support general collaboration, but they do not explicitly model the role boundaries, artifact provenance, and cost constraints that characterize multi‑stage intrusion workflows. This paper presents CAESAR, a coordinated multi‑agent framework for controlled analysis of LLM‑agent behavior in intrusion‑style tasks. CAESAR decomposes the workflow into five typed roles and coordinates them through a bounded round protocol with a persistent knowledge base, a per‑round workspace, validator‑gated knowledge promotion, and capability‑token write isolation. We evaluate CAESAR on 25 CTF tasks across five categories and four LLM backends. Compared with a single‑agent baseline under matched budgets and tool access, CAESAR improves task success and reduces performance variance, with larger gains on tasks requiring multi‑step exploit composition. A secondary simulated interactional‑security study suggests that the role structure can transfer beyond code‑native surfaces. The results indicate that role transitions, artifact provenance, and knowledge‑promotion events provide useful structural signals for monitoring coordinated LLM‑agent behavior beyond individual prompt and output inspection. The dataset, implementation, and evaluation logs are released at https://github.com/Xu‑Qiu/CMAS.

PaperID: 198, https://arxiv.org/pdf/2605.08715.pdf GitHub

Authors: Boxuan Zhang, Jianing Zhu, Zeru Shi, Dongfang Liu, Ruixiang Tang

Title: AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

Abstract:
LLM‑based multi‑agent systems are increasingly deployed on long‑horizon tasks, but a single decisive error is often accepted by downstream agents and cascades into trajectory‑level failure. Existing work frames this as \emphpost‑hoc failure attribution, diagnosing the responsible agent and step after the trajectory has ended. However, this paradigm forfeits any opportunity to intervene while trajectory is still unfolding. In this work, we introduce AgentForesight, a framework that reframes this problem as online auditing: at each step of an unfolding trajectory, an auditor observes only the current prefix and must either continue the run or alarm at the earliest decisive error, without access to future steps. To this end, we curate AFTraj‑2K, a corpus of agentic trajectories across Coding, Math, and Agentic domains, in which safe trajectories are retained under a strict curation pipeline and unsafe trajectories are annotated at the step of their decisive error via consensus among multiple LLM judges. Built on that, we develop AgentForesight‑7B, a compact online auditor trained with a coarse‑to‑fine reinforcement learning recipe that first equips it with a risk‑anticipation prior at the failure boundary on adjacent safe/unsafe prefix pairs, then sharpens this prior into precise step‑level localization under a three‑axis reward jointly targeting the what, where, and who of an audit verdict. Across AFTraj‑2K and an external Who\&When benchmark, AgentForesight‑7B outperforms leading proprietary models, including GPT‑4.1 and DeepSeek‑V4‑Pro, achieving up to +19.9% performance gain and 3× lower step localization error, opening the loop from post‑hoc failures detection to enabling deployment‑time intervention. Project page: https://zbox1005.github.io/agent‑foresight/

PaperID: 199, https://arxiv.org/pdf/2605.08704.pdf GitHub

Authors: Hyunmin Hwang, Jaemin Kim, Choonghan Kim, Hangeol Chang, Jong Chul Ye

Title: AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization

Abstract:
Multi‑agent reasoning has shown promise for improving the problem‑solving ability of large language models by allowing multiple agents to explore diverse reasoning paths. However, most existing multi‑agent methods rely on inference‑time debate or aggregation, which can be vulnerable to incorrect peer influence and biased consensus. Moreover, the agents themselves remain static, as their underlying reasoning skills do not evolve across tasks. In this paper, we introduce AgentPSO, a particle‑swarm‑inspired framework for evolving multi‑agent reasoning skills. AgentPSO treats each agent as a particle‑like reasoner whose state is a natural‑language skill and whose velocity is a semantic update direction, iteratively moving agents toward stronger skill states to improve both individual and collective reasoning performance. Across training iterations, each agent updates its skill by combining its previous velocity, personal‑best skill, global‑best skill, and a self‑reflective direction derived from peer reasoning trajectories. This enables agents to learn reusable reasoning behaviors from both their own experiences and the strongest skills discovered by the population, without updating the parameters of the backbone language model. Experiments on mathematical and general reasoning benchmarks show that AgentPSO improves over static single‑agent skills and test‑time‑only multi‑agent reasoning baselines. The evolved skills further transfer across benchmarks and to another backbone model, suggesting that AgentPSO captures reusable reasoning procedures rather than merely optimizing benchmark‑specific prompts. Code is open‑sourced at https://github.com/HYUNMIN‑HWANG/AgentPSO/.

PaperID: 200, https://arxiv.org/pdf/2605.08583.pdf GitHub

Authors: Mingzhe Li, Zhiqiang Lin, Shiqing Ma

Title: Source or It Didn't Happen: A Multi-Agent Framework for Citation Hallucination Detection

Abstract:
Large language models are increasingly used in scientific writing, yet they can fabricate citation‑shaped references that appear plausible but fail bibliographic verification. Existing detectors often reduce verification to binary found/not‑found decisions and rely on brittle parsing or incomplete retrieval, offering little field‑level signal to auditors. We reframe citation hallucination detection as taxonomy‑aligned field‑level adjudication and introduce a 12‑code taxonomy spanning Real, Potential, and Hallucinated citations. Based on this taxonomy, we build CiteTracer, a cascading multi‑agent detector that extracts structured citations from PDF and BibTeX, retrieves evidence through cache lookup, URL fetch, scholar connectors, and web search, applies deterministic field matching, and routes ambiguous cases to class‑specialist judgers. We release a benchmark of 2,450 synthetic citations built from real seeds with controlled LLM mutations, paired with 957 real‑world fabricated citations drawn from ICLR 2026 and an anonymous conference desk‑rejected submissions. CiteTracer reaches 97.1% accuracy on the synthetic benchmark, with class‑level F1 scores of 97.0, 95.8, and 98.5 for Real, Potential, and Hallucinated, respectively, and detects 97.1% of fabrications on the real‑world set without abstaining. Code: https://github.com/aaFrostnova/CiteTracer.

PaperID: 201, https://arxiv.org/pdf/2605.08374.pdf GitHub

Authors: Junwei Liao, Haoting Shi, Ruiwen Zhou, Jiaqian Wang, Shengtao Zhang, Wei Zhang, Ying Wen, Zhiyu Li, Feiyu Xiong, Bo Tang, Weinan Zhang, Muning Wen

Title: MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs

Abstract:
Episodic memory allows LLM agents to accumulate and retrieve experience, but current methods treat each memory independently, i.e., evaluating retrieval quality in isolation without accounting for the dependency chains through which memories enable the creation of future memories. We introduce MemQ, which applies TD(λ) eligibility traces to memory Q‑values, propagating credit backward through a provenance DAG that records which memories were retrieved when each new memory was created. Credit weight decays as (γλ)^d with DAG depth d, replacing temporal distance with structural proximity. We formalize the setting as an Exogenous‑Context MDP, whose factored transition decouples the exogenous task stream from the endogenous memory store. Across six benchmarks, spanning OS interaction, function calling, code generation, multimodal reasoning, embodied reasoning, and expert‑level QA, MemQ achieves the highest success rate on all six in generalization evaluation and runtime learning, with gains largest on multi‑step tasks that produce deep and relevant provenance chains (up to +5.7~pp) and smallest on single‑step classification (+0.77~pp) where single‑step updates already suffice. We further study how γ and λ interact with the EC‑MDP structure, providing principled guidance for parameter selection and future research. Code is available at https://github.com/jwliao‑ai/MemQ.

PaperID: 202, https://arxiv.org/pdf/2605.08129.pdf GitHub GitHub

Authors: Chao Tang, Jianzong Wu, Qingyu Shi, Ye Tian, Aixi Zhang, Hao Jiang, Jiangning Zhang, Yunhai Tong

Title: Towards Customized Multimodal Role-Play

Abstract:
Unified multimodal understanding and generation models enable richer human‑AI interaction. Yet jointly customizing a character's persona, dialogue style, and visual identity while maintaining output consistency across modalities remains largely unexplored. To mitigate this gap, we introduce a new task, Customized Multimodal Role‑Play (CMRP). We construct the RoleScape‑20 dataset comprising 20 characters, including training and evaluation data that cover persona, stylistic descriptions, visual/expressive cues, and text‑image interactions. Building on a unified model, we devise UniCharacter, a two‑stage training framework containing Unified Supervised Finetuning (Unified‑SFT) and character‑specific group relative policy optimization (Character‑GRPO). Given only 10 images plus corresponding interaction examples, the model acquires the target character and exhibits coherent persona, style, and visual identity in both generated text and images. This process takes about 100 GPU hours. Experiments on the RoleScape‑20 dataset show that the proposed method substantially outperforms prior approaches. Ablation studies further validate the effectiveness of our cross‑modal consistency design and few‑shot customization strategy. We argue that CMRP, coupled with unified modeling, provides a basis for next‑generation characterful and immersive interactive agents.

PaperID: 203, https://arxiv.org/pdf/2605.07830.pdf GitHub

Authors: Taein Lim, Seongyong Ju, Munhyeok Kim, Hyunjun Kim, Hoki Kim

Title: CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios

Abstract:
Large language models (LLMs) are increasingly deployed as autonomous agents in offensive cybersecurity. In this paper, we reveal an interesting phenomenon: different agents exhibit distinct attack patterns. Specifically, each agent exhibits an attack‑selection bias, disproportionately concentrating its efforts on a narrow subset of attack families regardless of prompt variations. To systematically quantify this behavior, we introduce CyBiasBench, a comprehensive 630‑session benchmark that evaluates five agents on three targets and four prompt conditions with ten attack families. We identify explicit bias across agents, with different dominant attack families and varying entropy levels in their attack‑family allocation distributions. Such bias is better characterized as a trait of the agents, rather than a factor associated with the attack success rate. Furthermore, our experiments reveal a bias momentum effect, where agents resist explicit steering toward attack families that conflict with their bias. This forced distribution shift does not yield measurable improvements in attack performance. To ensure reproducibility and facilitate future research, we release an interactive result dashboard at https://trustworthyai.co.kr/CyBiasBench/ and a reproducibility artifact with aggregated session‑level statistics and full evaluation scripts at https://github.com/Harry24k/CyBiasBench.

PaperID: 204, https://arxiv.org/pdf/2605.07692.pdf GitHub

Authors: Xuan Zhou, Yanhui Sun, Hantao Yao, Allen He, Yongdong Zhang, Wu Liu

Title: GASim: A Graph-Accelerated Hybrid Framework for Social Simulation

Abstract:
Large‑scale social simulators are essential for studying complex social patterns. Prior work explores hybrid methods to scale up simulations, combining large language models (LLM)‑based agents with numerical agent‑based models (ABM). However, this incurs high latency due to expensive memory retrieval and sequential ABM execution. To address this challenge, we propose GASim, a graph‑accelerated hybrid multi‑agent framework for large‑scale social simulations. For core agents driven by LLM, GASim introduces Graph‑Optimized Memory (GOM) to replace intensive LLM‑based retrieval pipelines with lightweight propagation over a sparse memory graph. For the majority of ordinary agents, GASim employs Graph Message Passing (GMP), substituting sequential ABM execution with parallel updates by fine‑grained feature aggregation and Graph Attention Network. We further introduce Entropy‑Driven Grouping (EDG) that coordinates this hybrid partitioning, leveraging information entropy to dynamically identify emergent core agents situated in information‑diverse neighborhoods. Extensive experiments show that GASim not only delivers a substantial 9.94‑fold end‑to‑end speedup over the traditional hybrid framework but also consumes less than 20% of baseline tokens, significantly reducing costs while preserving strong alignment with real‑world public opinion trends. Our code is available at https://github.com/Jasmine0201/GASim.

PaperID: 205, https://arxiv.org/pdf/2605.07358.pdf GitHub

Authors: Yingli Zhou, Wang Shu, Yaodong Su, Wenchuan Du, Yixiang Fang, Xuemin Lin

Title: A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications

Abstract:
Large language model (LLM)‑based agents that reason, plan, and act through tools, memory, and structured interaction are emerging as a promising paradigm for automating complex workflows. Recent systems such as OpenClaw and Claude Code exemplify a broader shift from passive response generation to action‑oriented task execution. Yet as agents move toward open‑ended, real‑world deployment, relying on from‑scratch reasoning and low‑level tool calls for every task become increasingly inefficient, error‑prone, and hard to maintain. This survey examines this challenge through the lens of \emphagent skills, which we define as reusable procedural artifacts that coordinate tools, memory, and runtime context under task‑specific constraints. Under this view, agents and skills play complementary roles: agents handle high‑level reasoning and planning, while skills form the operational layer that enables reliable, reusable, and composable execution. Skills are therefore central to the scalability, robustness, and maintainability of modern agent systems. We organize the literature around four stages of the agent skill lifecycle ‑‑ representation, acquisition, retrieval, and evolution ‑‑ and review representative methods, ecosystem resources, and application settings across each stage. We conclude by discussing open challenges in quality control, interoperability, safe updating, and long‑term capability management. All related resources, including research papers, open‑source data, and projects, are collected for the community in \textcolorbluehttps://github.com/JayLZhou/Awesome‑Agent‑Skills.

PaperID: 206, https://arxiv.org/pdf/2605.07323.pdf GitHub

Authors: Sum Kyun Song, Bong Gyun Shin, Jae Yong Lee

Title: Discovering Ordinary Differential Equations with LLM-Based Qualitative and Quantitative Evaluation

Abstract:
Discovering governing differential equations from observational data is a fundamental challenge in scientific machine learning. Existing symbolic regression approaches rely primarily on quantitative metrics; however, real‑world differential equation modeling also requires incorporating domain knowledge to ensure physical plausibility. To address this gap, we propose DoLQ, a method for discovering ordinary differential equations with LLM‑based qualitative and quantitative evaluation. DoLQ employs a multi‑agent architecture: a Sampler Agent proposes dynamic system candidates, a Parameter Optimizer refines equations for accuracy, and a Scientist Agent leverages an LLM to conduct both qualitative and quantitative evaluations and synthesize their results to iteratively guide the search. Experiments on multi‑dimensional ordinary differential equation benchmarks demonstrate that DoLQ achieves superior performance compared to existing methods, not only attaining higher success rates but also more accurately recovering the correct symbolic terms of ground truth equations. Our code is available at https://github.com/Bon99yun/DoLQ.

PaperID: 207, https://arxiv.org/pdf/2605.07022.pdf GitHub

Authors: Haydn Jones, Yimeng Zeng, Alden Rose, Li S. Yifei, Yining Huang, Kaiwen Wu, Jiaming Liang, Maggie Ziyu Huan, Yoseph Barash, Cesar de la Fuente-Nunez, Osbert Bastani, Zachary Ives, Mark Yatskar, Jacob R. Gardner

Title: Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

Abstract:
Manually curated biomedical repositories ‑‑ spanning bioactivity, genomics, and chemistry ‑‑ are expensive to maintain, lag behind primary literature, and discard experimental context, obscuring nuances needed to assess data correctness and coverage. We show that PubMed itself can be autonomously and cost‑effectively turned into structured datasets that are larger, more nuanced, and more accurate than the curated databases they replace. We present three coupled contributions: (1) an LLM‑based entity‑tagging pipeline, grounded in nine biomedical ontologies, that tags 4.5B entities across 19 categories in a 22.5M‑paper, 2.5T‑token PubMed corpus; (2) hybrid sparse‑dense retrieval supporting entity‑filtered semantic queries over the tagged corpus; and (3) Starling, a multi‑agent deep research system that, given only a natural‑language task description, designs precision‑ and recall‑targeted retrieval filters, induces an extraction schema, and emits structured records with nuance‑rich fields and supporting passages. Across six tasks ‑‑ blood‑brain barrier permeability, oral bioavailability, acute toxicity (LD50), gene‑disease associations, protein subcellular localization, and chemical reactions ‑‑ Starling produces ~6.3M records (91K‑3M per task); several are, to our knowledge, the largest public datasets for their property. Frontier‑model rejection of our extractions is 0.6‑7.7% across tasks, far below error rates we measure on widely used curated counterparts (e.g., 16.5% on BBB_Martins, 7.3% on Bioavailability_Ma). Beyond scale and accuracy, the supporting passages carry nuance tabular databases discard ‑‑ e.g., oral bioavailability may depend on fed vs. fasted state. Together, the corpus, retrieval, and agent establish a foundation for AI‑driven therapeutic design. Code and datasets: https://github.com/starling‑labs/starling.

PaperID: 208, https://arxiv.org/pdf/2605.06788.pdf GitHub

Authors: Naihe Feng, Yi Sui, Shiyi Hou, Ga Wu, Jesse C. Cresswell

Title: Conformal Agent Error Attribution

Abstract:
When multi‑agent systems (MAS) fail, identifying where the decisive error occurred is the first step for automated recovery to an earlier state. Error attribution remains a fundamental challenge due to the long interaction traces that large language model‑based MAS generate. This paper presents a framework for error attribution based on conformal prediction (CP) which provides finite‑sample, distribution‑free coverage guarantees. We introduce new algorithms for filtration‑based CP designed for sequential data such as agent trajectories. Unlike existing CP algorithms, our approach predicts sets that are contiguous sequences to enable efficient recovery and debugging. We verify our theoretical guarantees on a variety of agents and datasets, show that errors can be precisely isolated, then use prediction sets to rollback MAS to correct their own errors. Our overall approach is model‑agnostic, and offers a principled uncertainty layer for MAS error attribution. We release code at https://github.com/layer6ai‑labs/conformal‑agent‑error‑attribution.

PaperID: 209, https://arxiv.org/pdf/2605.06623.pdf GitHub

Authors: Zhexuan Wang, Xuebo Liu, Li Wang, Zifei Shan, Yutong Wang, Zhenxi Song, Min Zhang

Title: MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems

Abstract:
Large language model (LLM)‑based Multi‑agent systems (MAS) have shown promise in tackling complex collaborative tasks, where agents are typically orchestrated via role‑specific prompts. While the quality of these prompts is pivotal, jointly optimizing them across interacting agents remains a non‑trivial challenge, primarily due to the misalignment between local agent objectives and holistic system goals. To address this, we introduce MASPO, a novel framework designed to automatically and iteratively refine prompts across the entire system. A core innovation of MASPO is its joint evaluation mechanism, which assesses prompts not merely by their local validity, but by their capacity to facilitate downstream success for successor agents. This effectively bridges the gap between local interactions and global outcomes without relying on ground‑truth labels. Furthermore, MASPO employs a data‑driven evolutionary beam search to efficiently navigate the high‑dimensional prompt space. Extensive empirical evaluations across 6 diverse tasks demonstrate that MASPO consistently outperforms state‑of‑the‑art prompt optimization methods, achieving an average accuracy improvement of 2.9. We release our code at https://github.com/wangzx1219/MASPO.

PaperID: 210, https://arxiv.org/pdf/2605.06345.pdf GitHub

Authors: Jie Yu, Song Qiu

Title: More Than Can Be Said: A Benchmark and Framework for Pre-Question Scientific Ideation

Abstract:
AI research agents have shown strong potential in automating literature search and manuscript refinement, yet most assume a clear and actionable initial input, operating only after a research question has been made explicit. In contrast, human research often begins with tacit friction, a sense of misalignment before a question can be formed. We introduce InciteResearch, a multi‑agent framework designed to make a researcher's implicit understanding explicit, inspectable, and actionable. InciteResearch decomposes the logical chain of Socratic questioning and distributes it across the entire pipeline that: (1) Elicits a structured five‑dimensional researcher profile state anchored by specific friction points from vague, even domain‑unrelated inputs; (2) Violates hidden assumptions by maximizing the feasibility‑novelty product with enforcing a 7‑stage causal derivation trace; and (3) check whether the proposed method is a Necessary consequence of the reframed insight. We further introduce TF‑Bench, the first benchmark for tacit‑to‑explicit research assistance that distinguishes domain‑related from domain‑unrelated inspirations across four scientific modes. On TF‑Bench, InciteResearch achieves leapfrogging gains over a prompt‑based baseline (novelty/impact from 3.671/3.806 to 4.250/4.397), shifting generated proposals from recombination to architectural insight. Our work demonstrates that AI can serve as an extension of thinking itself, rather than merely automating downstream execution.

PaperID: 211, https://arxiv.org/pdf/2605.06177.pdf GitHub

Authors: Jinge Wu, Hongjian Zhou, Mingde Zeng, Jiayuan Zhu, Junde Wu, Jiazhen Pan, Sean Wu, Honghan Wu, Fenglin Liu, David A. Clifton

Title: BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

Abstract:
Building a deep research agent today is an exercise in glue code: the same backbone evaluated on the same benchmark can report different accuracies in different papers because harness and tool registry all differ, and integrating a new foundation model into a comparable evaluation surface costs weeks of model‑specific engineering. We call this the per‑paper engineering tax and release BioMedArena, an open‑source toolkit that not only alleviates it but also provides an arena for fair comparison of different foundation models when evaluating them as deep‑research agents. BioMedArena decouples six layers of biomedical agent evaluation ‑‑ benchmark loading, tool exposure, tool selection, execution mode, context management, and scoring ‑‑ and exposes 147 biomedical benchmarks and 75 biomedical tools across 9 functional families. Adding a new model, benchmark, or tool reduces to registering a few‑line provider adapter. We further provide 6 agent harnesses with 6 context‑management strategies, which provide 12 backbones with competitive research capabilities and significantly improved performance, achieving state‑of‑the‑art (SOTA) results on 8 representative biomedical benchmarks, with an average lift of +15.03 percentage points over prior SOTA. The toolkit, configurations, and per‑task traces are available at https://github.com/AI‑in‑Health/BioMedArena

PaperID: 212, https://arxiv.org/pdf/2605.06068.pdf GitHub

Authors: Keisuke Kamahori, Shihang Li, Simon Peter, Baris Kasikci

Title: VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

Abstract:
For years, we have built LLM serving systems like any other critical infrastructure: a single general‑purpose stack, hand‑tuned over many engineer‑years, meant to support every model and workload. In this paper, we take the opposite bet: a multi‑agent loop that automatically synthesizes bespoke serving systems for different usage scenarios. We propose VibeServe, the first agentic loop that generates entire LLM serving stacks end‑to‑end. VibeServe uses an outer loop to plan and track the search over system designs, and an inner loop to implement candidates, check correctness, and measure performance on the target benchmark. In the standard deployment setting, where existing stacks are highly optimized, VibeServe remains competitive with vLLM, showing that generation‑time specialization need not come at the cost of performance. More interestingly, in non‑standard scenarios, VibeServe outperforms existing systems by exploiting opportunities that generic systems miss in six scenarios involving non‑standard model architectures, workload knowledge, and hardware‑specific optimizations. Together, these results suggest a different point in the design space for infrastructure software: generation‑time specialization rather than runtime generality. Code is available at https://github.com/uw‑syfi/vibe‑serve.

PaperID: 213, https://arxiv.org/pdf/2605.05758.pdf GitHub

Authors: Xin Gao, Ruiyi Zhang, Meixi Du, Peijia Qin, Pengtao Xie

Title: BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models

Abstract:
Despite the success of large language models (LLMs) on general‑purpose tasks, their performance in highly specialized domains such as biomedicine remains unsatisfactory. A key limitation is the inability of LLMs to effectively leverage biomedical tools, which clinical experts and biomedical researchers rely on extensively in daily workflows. While recent general‑domain tool‑calling datasets have substantially improved the capabilities of LLM agents, existing efforts in the biomedical domain largely rely on in‑context learning and restrict models to a small set of tools. To address this gap, we introduce BioTool, a comprehensive biomedical tool‑calling dataset designed for fine‑tuning LLMs. BioTool comprises 34 frequently used tools collected from the NCBI, Ensembl, and UniProt databases, along with 7,040 high‑quality, human‑verified query‑API call pairs spanning variation, genomics, proteomics, evolution, and general biology. Fine‑tuning a 4‑billion‑parameter LLM on BioTool yields substantial improvements in biomedical tool‑calling performance, outperforming cutting‑edge commercial LLMs such as GPT‑5.1. Furthermore, human expert evaluations demonstrate that integrating a BioTool‑fine‑tuned tool caller significantly improves downstream answer quality compared to the same LLM without tool usage, highlighting the effectiveness of BioTool in enhancing the biomedical capabilities of LLMs. The full dataset and evaluation code are available at https://github.com/gxx27/BioTool

PaperID: 214, https://arxiv.org/pdf/2605.05704.pdf GitHub

Authors: Zhe Liu, Zonghao Ying, Wenxin Zhang, Quanchen Zou, Deyue Zhang, Dongdong Yang, Xiangzheng Zhang, Hao Peng

Title: SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety

Abstract:
Recent advances in foundation models have transformed LLMs from passive conversational systems into autonomous agents capable of reasoning and tool execution. While these capabilities unlock substantial practical value, they also introduce new security risks, as adversaries can manipulate agents into performing harmful actions in real‑world environments. Existing defense strategies mitigate such threats but frequently struggle to balance safety and utility, resulting in over‑refusal of benign user requests. To mitigate this trade‑off, we propose SafeHarbor, a novel framework designed to establish precise decision boundaries for LLM agents. Unlike static guidelines, SafeHarbor extracts context‑aware defense rules through enhanced adversarial generation. We design a local hierarchical memory system for dynamic rule injection, offering a training‑free, efficient, and plug‑and‑play solution. Furthermore, we introduce an information entropy‑based self‑evolution mechanism that continuously optimizes the memory structure through dynamic node splitting and merging. Extensive experiments demonstrate that SafeHarbor achieves state‑of‑the‑art performance on both ambiguous benign tasks and explicit malicious attacks, notably attaining a peak benign utility of 63.6% on GPT‑4o while maintaining a robust refusal rate exceeding 93% against harmful requests. The source code is publicly available at https://github.com/ljj‑cyber/SafeHarbor.

PaperID: 215, https://arxiv.org/pdf/2605.05258.pdf GitHub

Authors: Yuchen Wang, Zhongzhi Luan

Title: PARNESS: A Paper Harness for End-to-End Automated Scientific Research with Dynamic Workflows, Full-Text Indexing, and Cross-Run Knowledge Accumulation

Abstract:
Recent autonomous research systems ‑‑ AI‑Scientist, PaperOrchestra, AutoSOTA, DeepResearch, InternAgent, ResearchAgent and others ‑‑ show LLM agents can ideate, run experiments and write papers, but each fixes a particular control‑flow shape (linear pipeline, state machine, single‑agent loop, or fixed‑recipe skill pack) at the framework level. We argue this rigidity has five roots: (1) workflows are dynamic and discipline‑specific (lab work, surveys, simulations, theory all loop differently); (2) ideation is bounded by LLM context and cross‑domain ideation needs knowledge a single context cannot hold; (3) summary‑only views miss the paper body, yet full‑text access is uneven, so the cumulative corpus must do the work; (4) a paper's open‑source repository is often the only complete specification of its experimental scheme, but the paper‑to‑code link is neglected; (5) no tool persists cross‑run knowledge retrievably into a finite LLM context. We present PARNESS, an open‑source framework built on four design moves. (i) A thin DAG kernel with a four‑field Agent contract decouples scheduling from domain semantics, so any discipline's loop is expressible as user‑editable YAML. (ii) A full‑text PDF‑parsing and literature‑library subsystem indexes paper bodies, figures and tables as typed objects, with graceful abstract‑only fall‑back. (iii) A knowledge‑graph index over papers, ideas, experiments and code repositories, with scenario‑typed retrieval (similar / contradictory / cross‑domain / counter‑intuitive), surfaces a focused slice into each LLM call. (iv) A small extension surface lets any modern coding agent (Claude Code, Cursor, Copilot, OpenCode) add or replace any module. To our knowledge PARNESS is the first open‑source system combining declarative pipelines, full‑PDF and code‑repository indexing, and cross‑run knowledge. Source: https://github.com/gtrhythm/PARNESS

PaperID: 216, https://arxiv.org/pdf/2605.05185.pdf GitHub

Authors: Shuang Chen, Kaituo Feng, Hangting Chen, Wenxuan Huang, Dasen Dai, Quanxin Shou, Yunlong Lin, Xiangyu Yue, Shenghua Gao, Tianyu Pang

Title: OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

Abstract:
Deep search has become a crucial capability for frontier multimodal agents, enabling models to solve complex questions through active search, evidence verification, and multi‑step reasoning. Despite rapid progress, top‑tier multimodal search agents remain difficult to reproduce, largely due to the absence of open high‑quality training data, transparent trajectory synthesis pipelines, or detailed training recipes. To this end, we introduce OpenSearch‑VL, a fully open‑source recipe for training frontier multimodal deep search agents with agentic reinforcement learning. First, we curated a dedicated pipeline to construct high‑quality training data through Wikipedia path sampling, fuzzy entity rewriting, and source‑anchor visual grounding, which jointly reduce shortcuts and one‑step retrieval collapse. Based on this pipeline, we curate two training datasets, SearchVL‑SFT‑36k for SFT and SearchVL‑RL‑8k for RL. Besides, we design a diverse tool environment that unifies text search, image search, OCR, cropping, sharpening, super‑resolution, and perspective correction, enabling agents to combine active perception with external knowledge acquisition. Finally, we propose a multi‑turn fatal‑aware GRPO training algorithm that handles cascading tool failures by masking post‑failure tokens while preserving useful pre‑failure reasoning through one‑sided advantage clamping. Built on this recipe, OpenSearch‑VL delivers substantial performance gains, with over 10‑point average improvements across seven benchmarks, and achieves results comparable to proprietary commercial models on several tasks. We will release all data, code, and models to support open research on multimodal deep search agents.

PaperID: 217, https://arxiv.org/pdf/2605.04984.pdf GitHub

Authors: Senkang Hu, Yong Dai, Xudong Han, Zhengru Fang, Yuzhi Zhao, Sam Tak Wu Kwong, Yuguang Fang

Title: Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers

Abstract:
Long‑horizon LLM agents depend on intermediate information‑gathering turns, yet training feedback is usually observed only at the final answer, because process‑level rewards require high‑quality human annotation. Existing turn‑level shaping methods reward turns that increase the likelihood of a gold answer, but they require answer supervision or stable task‑specific verifiers. Conversely, label‑free RL methods extract self‑signals from output distributions, but mainly at the answer or trajectory level and therefore cannot assign credit to intermediate turns. We propose Self‑Induced Outcome Potential (SIOP), which treats semantic clusters of final answers as latent future outcome states for potential‑based turn‑level credit assignment. For each query, SIOP samples multiple rollouts, clusters final answers into semantic outcome modes, and builds a reliability‑aware target distribution over these states. It then rewards turns for increasing posterior support for reliable future states using a tractable cluster‑level approximation. The objective generalizes information‑potential shaping from gold‑answer supervision to settings without task‑specific gold verifiers while avoiding the broadcasted rollout‑level advantages used by standard GRPO. We formalize the framework, characterize its supervised gold‑answer limit, and show that SIOP improves average performance over verifier‑free outcome‑level baselines on seven search‑augmented agentic reasoning benchmarks while approaching a gold‑supervised outcome baseline. Code is available at https://github.com/dl‑m9/SIOP.git.

PaperID: 218, https://arxiv.org/pdf/2605.04906.pdf GitHub

Authors: Yidong He, Yutao Lai, Pengxu Yang, Jiarui Gan, Jiexin Wang, Yi Cai, Mengchen Zhao

Title: Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games

Abstract:
While Large Language Models (LLMs) excel in certain reasoning tasks, they struggle in multi‑agent games where the final outcome depends on the joint strategies of all agents. In multi‑agent games, the non‑stationarity of other agents brings significant challenges on the evaluation of the reasoning process and the credit assignment over multiple reasoning steps. Existing single‑agent reinforcement learning (RL) approaches and their multi‑agent extensions fail to address these challenges as they do not incorporate other agents in the reasoning process. In this work, we propose Strat‑Reasoner, a novel RL‑based framework that improves LLMs' strategic reasoning ability in multi‑agent games. We introduce a novel recursive reasoning paradigm where an agent's reasoning also integrates other agents' reasoning processes. To provide effective reward signals for the intermediate reasoning sequences, we employ a centralized Chain‑of‑Thought (CoT) comparison module to evaluate the reasoning quality. Finally, we compute an accurate hybrid advantage and develop a group‑relative RL approach to optimize the LLM policy. Experimental results show that Strat‑Reasoner substantially improves strategic abilities of underlying LLMs, achieving 22.1% average performance improvements across various multi‑agent games. Code is publicly available at https://github.com/ydhe1012/Strat‑Reasoner.

PaperID: 219, https://arxiv.org/pdf/2605.03042.pdf GitHub

Authors: Ruofeng Yang, Yongcan Li, Shuai Li

Title: ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

Abstract:
This report describes ARIS (Auto‑Research‑in‑sleep), an open‑source research harness for autonomous research, including its architecture, assurance mechanisms, and early deployment experience. The performance of agent systems built on LLMs depends on both the model weights and the harness around them, which governs what information to store, retrieve, and present to the model. For long‑horizon research workflows, the central failure mode is not a visible breakdown but a plausible unsupported success: a long‑running agent can produce claims whose evidential support is incomplete, misreported, or silently inherited from the executor's framing. Therefore, we present ARIS as a research harness that coordinates machine‑learning research workflows through cross‑model adversarial collaboration as a default configuration: an executor model drives forward progress while a reviewer from a different model family is recommended to critique intermediate artifacts and request revisions. ARIS has three architectural layers. The execution layer provides more than 65 reusable Markdown‑defined skills, model integrations via MCP, a persistent research wiki for iterative reuse of prior findings, and deterministic figure generation. The orchestration layer coordinates five end‑to‑end workflows with adjustable effort settings and configurable routing to reviewer models. The assurance layer includes a three‑stage process for checking whether experimental claims are supported by evidence: integrity verification, result‑to‑claim mapping, and claim auditing that cross‑checks manuscript statements against the claim ledger and raw evidence, as well as a five‑pass scientific‑editing pipeline, mathematical‑proof checks, and visual inspection of the rendered PDF. A prototype self‑improvement loop records research traces and proposes harness improvements that are adopted only after reviewer approval.

PaperID: 220, https://arxiv.org/pdf/2605.02801.pdf GitHub

Authors: Chenchen Zhang

Title: Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

Abstract:
As large language model (LLM) agents evolve from isolated tool users into coordinated teams, reinforcement learning (RL) must optimize not only individual actions but also how work is spawned, delegated, communicated, aggregated, and stopped. This paper studies RL for LLM‑based multi‑agent systems through orchestration traces: temporal interaction graphs whose events include sub‑agent spawning, delegation, communication, tool use, return, aggregation, and stopping decisions. Using this lens, we identify three technical axes. First, reward design spans eight families, including orchestration rewards for parallelism speedup, split correctness, and aggregation quality. Second, reward and credit signals attach to eight credit‑ or signal‑bearing units from token to team; explicit counterfactual message‑level credit remains especially sparse in our curated pool. Third, orchestration learning decomposes into five sub‑decisions: when to spawn, whom to delegate to, how to communicate, how to aggregate, and when to stop. In our curated pool as of May 4, 2026, we found no explicit RL training method for the stopping decision. We connect academic methods to public industrial evidence from Kimi Agent Swarm, OpenAI Codex, and Anthropic Claude Code. The resulting scale gap is a gap between publicly reported deployment envelopes and open academic evaluation regimes, not independent verification of industrial training traces. We release the artifact at https://github.com/xxzcc/awesome‑llm‑mas‑rl, including an 84‑entry tagged paper pool, a 32‑record exclusion log, scripted corpus statistics, and a minimal JSON schema for replayable orchestration traces.

PaperID: 221, https://arxiv.org/pdf/2605.02289.pdf GitHub

Authors: Xiyuan Zhou, Ruixi Zou, Xinlei Wang, Yuheng Cheng, Yan Xu, Junhua Zhao, Jinjin Gu

Title: EngiAgent: Fully Connected Coordination of LLM Agents for Solving Open-ended Engineering Problems with Feasible Solutions

Abstract:
Engineering problem solving is central to real‑world decision‑making, requiring mathematical formulations that not only represent complex problems but also produce feasible solutions under data and physical constraints. Unlike mathematical problem solving, which operates on predefined formulations, engineering tasks demand open‑ended analysis, feasibility‑driven modeling, and iterative refinement. Although large language models (LLMs) have shown strong capabilities in reasoning and code generation, they often fail to ensure feasibility, which limits their applicability to engineering problem solving. To address this challenge, we propose EngiAgent, a multi‑agent system with a fully connected coordinator that simulates expert workflows through specialized agents for problem analysis, modeling, verification, solving, and solution evaluation. The fully connected coordinator enables flexible feedback routing, overcoming the rigidity of prior pipeline‑based reflection methods and ensuring feasibility at every stage of the process. This design not only improves robustness to diverse failure cases such as data extraction errors, constraint inconsistencies, and solver failures, but also enhances the overall quality of problem solving. Empirical results across four representative domains demonstrate that EngiAgent achieves substantial improvements in feasibility compared to prior approaches, establishing a new paradigm for feasibility‑oriented engineering problem solving with LLMs. Our source code and data are available at https://github.com/AI4Engi/EngiAgent.

PaperID: 222, https://arxiv.org/pdf/2605.02063.pdf GitHub

Authors: Vik Pant, Eric Yu

Title: Coopetition-Gym v1: A Formally Grounded Platform for Mixed-Motive Multi-Agent Reinforcement Learning under Strategic Coopetition

Abstract:
We present Coopetition‑Gym v1, a benchmark platform for mixed‑motive multi‑agent reinforcement learning under strategic coopetition. The platform comprises twenty environments organized into four mechanism classes that correspond to four foundational technical reports: interdependence and complementarity (arXiv:2510.18802), trust and reputation dynamics (arXiv:2510.24909), collective action and loyalty (arXiv:2601.16237), and sequential interaction and reciprocity (arXiv:2604.01240). Each environment carries a closed‑form payoff structure and a calibrated interdependence matrix derived from the corresponding report. Every environment exposes a parameterized reward layer configurable across three structurally distinct modes (private, integrated, cooperative). This separation of payoff from reward enables reward‑type ablation, the platform's principal methodological apparatus. Four of the twenty environments are calibrated against historically documented coopetitive relationships and reproduce their outcomes at 98.3, 81.7, 86.7, and 87.3 percent on the validation rubric (Samsung‑Sony LCD, Renault‑Nissan Alliance, Apache HTTP Server, Apple iOS App Store). The platform exposes Gymnasium, PettingZoo Parallel, and PettingZoo AEC interfaces and ships 126 reference algorithms: 16 learning algorithms, 7 game‑theoretic oracles, 2 heuristic baselines, and 101 constant‑action policies. A reference experimental study trained the 16 learning algorithms on every environment under every reward configuration with seven random seeds, producing a 25,708‑run training corpus and a 1,116‑run behavioral audit corpus, both released under CC‑BY‑4.0 with Croissant 1.0 metadata. Coopetition‑Gym v1 is the first platform to combine continuous‑action mixed‑motive environments, parameterized reward mutuality, calibrated interdependence coefficients, game‑theoretic oracle baselines, and validated case studies.

PaperID: 223, https://arxiv.org/pdf/2605.01457.pdf GitHub

Authors: Guowei Zou, Haitao Wang, Beiwen Zhang, Boning Zhang, Hejun Wu

Title: CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making

Abstract:
Generative models have emerged as a promising paradigm for offline multi‑agent reinforcement learning (MARL), but existing approaches require many iterative sampling steps. Recent few‑step acceleration methods either distill a joint teacher into independent students or apply averaged velocity fields independently to each agent. Unfortunately, these few‑step approaches hurt inter‑agent coordination. We show that the efficiency‑coordination trade‑off is not inherent: single‑pass multi‑agent generation can preserve coordination when the velocity field is natively joint‑coupled. We propose Coordinated few‑step Flow (CoFlow), an architecture that combines Coordinated Velocity Attention (CVA) with Adaptive Coordination Gating. A finite‑difference consistency surrogate further replaces memory‑prohibitive Jacobian‑vector product backpropagation through the averaged velocity field with two stop‑gradient forward passes. Across 60 configurations spanning MPE, MA‑MuJoCo, and SMAC, CoFlow matches or surpasses Gaussian policies, value‑based methods, transformer policies, diffusion models, and prior flow baselines on episodic return. Three independent coordination probes confirm that CoFlow's improvements arise from inter‑agent coordination rather than per‑agent capacity. A denoising‑step sweep shows that single‑pass inference suffices on every configuration. CoFlow reaches state‑of‑the‑art coordination quality in 1‑3 denoising steps under both centralized and decentralized execution. Project Page: https://guowei‑zou.github.io/coflow/

PaperID: 224, https://arxiv.org/pdf/2605.01347.pdf GitHub

Authors: Jianze Wang, Ying Liu, Jinlong Chen, Xuchun Hu, Qilong Zhang, Yu Cao, Jun Wang, Hua Yang, Yong Xie, Qianglong Chen

Title: MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate

Abstract:
On‑policy distillation (OPD) trains a student on its own trajectories under token‑level teacher supervision, but existing methods are capped by a single‑teacher capability ceiling: when the teacher errs, the student inherits the error. OPD also remains largely unexplored in agentic tasks, where per‑step errors compound across long trajectories and destabilize training. We propose MAD‑OPD (Multi‑Agent Debate‑driven On‑Policy Distillation), which breaks this ceiling by recasting the distillation teacher as a deliberative collective of teachers that debate over the student's on‑policy state; the debate produces an emergent collective intelligence that supplies token‑level supervision, with each teacher's contribution weighted by its post‑debate confidence. To extend OPD to agentic tasks, we also introduce On‑Policy Agentic Distillation (OPAD), which adds step‑level sampling to stabilize training under multi‑step error compounding. We additionally derive a task‑adaptive divergence principle, selecting JSD (Jensen‑Shannon divergence) for agentic stability and reverse KL (Kullback‑Leibler) divergence for code generation, and verify it both theoretically and empirically. Across six teacher‑student configurations (Qwen3 and Qwen3.5; 1.7B‑14B students, 8B‑32B teachers) and five agentic and code benchmarks, MAD‑OPD ranks first across all six configurations; on the 14B+8B\to4B setting it lifts the agentic average by +2.4% and the code average by +3.7% over the stronger single‑teacher OPD.

PaperID: 225, https://arxiv.org/pdf/2605.00410.pdf GitHub

Authors: Aninda Ray

Title: Agent Capsules: Quality-Gated Granularity Control for Multi-Agent LLM Pipelines

Abstract:
A multi‑agent pipeline with N agents typically issues N LLM calls per run. Merging agents into fewer calls (compound execution) promises token savings, but naively merged calls silently degrade quality through tool loss and prompt compression. We present Agent Capsules, an adaptive execution runtime that treats multi‑agent pipeline execution as an optimization problem with empirical quality constraints. The runtime instruments coordination overhead per group, scores composition opportunity, selects among three compound execution strategies, and gates every mode switch on rolling‑mean output quality. A controlled negative result confirms that injecting more context into a merged call worsens compression rather than relieving it, so the framework's escalation ladder (standard, then two‑phase, then sequential) recovers quality by moving toward per‑agent dispatch rather than by rewriting merged prompts. On LLM‑judged quality, the controller matches a hand‑tuned oracle on every measured (model, group, mode) cell: routing compound whenever the oracle would, and reverting to fine whenever quality would fail the floor, without per‑model configuration. Against a hand‑crafted LangGraph implementation of a 14‑agent competitive intelligence pipeline, Agent Capsules uses 51% fewer fine‑mode input tokens and 42% fewer compound‑mode input tokens, at +0.020 and +0.017 quality respectively. Against a DSPy implementation of a 5‑agent due diligence pipeline, the framework uses 19% fewer tokens than uncompiled DSPy at quality parity, and 68% fewer tokens than MIPROv2 at +0.052 quality. Even before compound mode fires, the runtime delivers efficiency through automatic policy resolution, cache‑aligned prompts, and topology‑aware context injection, matching both hand‑tuned and compile‑time baselines without training data or per‑pipeline engineering.

PaperID: 226, https://arxiv.org/pdf/2604.27942.pdf GitHub

Authors: Djamel Bouchaffra, Faycal Ykhlef, Mustapha Lebbah, Hanane Azzag

Title: A Collective Variational Principle Unifying Bayesian Inference, Game Theory, and Thermodynamics

Abstract:
Collective intelligence emerges across biological, physical, and artificial systems without central coordination, yet a unifying principle governing such behaviour remains elusive. The Free Energy Principle explains how individual agents adapt through variational inference, while game theory formalises strategic interactions. Here we introduce the Game‑Theoretic Free Energy Principle, a unified framework showing that multi‑agent systems performing local free‑energy minimisation implicitly implement a stochastic game. We prove that, under bounded rationality and local information constraints, stationary points of collective free energy correspond to approximate Nash equilibria of an induced game. Conversely, a broad class of cooperative games admits a variational representation in which equilibria arise as Gibbs distributions over coalitions, establishing a bridge between Bayesian inference and strategic interaction. To characterise higher‑order effects, we introduce a free‑energy formulation of the Harsanyi dividend, isolating irreducible multi‑agent synergy. This yields a predictive theory of cooperation, including a falsifiable non‑monotonic relationship between sensory precision and agent influence. We validate this prediction across neural, biological, and artificial multi‑agent systems. These results identify a common variational principle underlying inference, thermodynamics, and game‑theoretic equilibrium.

PaperID: 227, https://arxiv.org/pdf/2604.27691.pdf GitHub GitHub

Authors: Chao Fei, Hongcheng Guo, Yanghua Xiao

Title: When Agents Evolve, Institutions Follow

Abstract:
Across millennia, complex societies have faced the same coordination problem of how to organize collective action among cognitively bounded and informationally incomplete individuals. Different civilizations developed different political institutions to answer the same basic questions of who proposes, who reviews, who executes, and how errors are corrected. We argue that multi‑agent systems built on large language models face the same challenge. Their central problem is not only individual intelligence, but collective organization. Historical institutions therefore provide a structured design space for multi‑agent architectures, making key trade‑offs between efficiency and error correction, centralization and distribution, and specialization and redundancy empirically testable. We translate seven historical political institutions, spanning four canonical governance patterns, into executable multi‑agent architectures and evaluate them under identical conditions across three large language models and two benchmarks. We find that governance topology strongly shapes collective performance. Within a single model, the gap between the best and worst institution exceeds 57 percentage points, while the optimal architecture shifts systematically with model capability and task characteristics. These results suggest that collective intelligence will not advance through a single optimal organizational form, but through governance mechanisms that can be reselected and reconfigured as tasks and capabilities evolve. More broadly, this points to a transition from self‑evolving agents to the self‑evolving multi‑agent system. The code is available on \hrefhttps://github.com/cf3i/SocialSystemArenaGitHub.

PaperID: 228, https://arxiv.org/pdf/2604.27221.pdf GitHub

Authors: Yuxuan Huang, Yihang Chen, Zhiyuan He, Yuxiang Chen, Ka Yiu Lee, Huichi Zhou, Weilin Luo, Meng Fang, Jun Wang

Title: Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction

Abstract:
Agentic web search increasingly faces two distinct demands: deep reasoning over a single target, and structured aggregation across many entities and heterogeneous sources. Current systems struggle on both fronts. Breadth‑oriented tasks demand schema‑aligned outputs with wide coverage and cross‑entity consistency, while depth‑oriented tasks require coherent reasoning over long, branching search trajectories. We introduce Web2BigTable, a multi‑agent framework for web‑to‑table search that supports both regimes. Web2BigTable adopts a bi‑level architecture in which an upper‑level orchestrator decomposes the task into sub‑problems and lower‑level worker agents solve them in parallel. Through a closed‑loop run‑‑verify‑‑reflect process, the framework jointly improves decomposition and execution over time via persistent, human‑readable external memory, with self‑evolving updates to each single‑agent. During execution, workers coordinate through a shared workspace that makes partial findings visible, allowing them to reduce redundant exploration, reconcile conflicting evidence, and adapt to emerging coverage gaps. Web2BigTable sets a new state of the art on WideSearch, reaching an Avg@4 Success Rate of 38.50 (7.5× the second best at 5.10), Row F1 of 63.53 (+25.03 over the second best), and Item F1 of 80.12 (+14.42 over the second best). It also generalises to depth‑oriented search on XBench‑DeepSearch, achieving 73.0 accuracy. Code is available at https://github.com/web2bigtable/web2bigtable.

PaperID: 229, https://arxiv.org/pdf/2604.26805.pdf GitHub

Authors: Bochao Liu, Zhipeng Qian, Yang Zhao, Xinyuan Jiang, Zihan Liang, Yufei Ma, Junpeng Zhuang, Ben Chen, Shuo Yang, Hongen Wan, Yao Wu, Chenyi Lei, Xiao Liang

Title: Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations

Abstract:
Operating and maintaining (O&M) large‑scale online engine systems (eg, search, recommendation and advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. Despite the inherent suitability of LLM‑based agents for such operational scenarios, the critical bottleneck impeding their practical deployment lies not in reasoning, but in orchestration capability ‑ specifically, the precise selection of relevant data (encompassing metrics, logs, and change events) and applicable knowledge (including handbook‑defined rules and empirically derived practitioner experience) tailored to each individual operational event. Feeding all signals indiscriminately causes dilution and hallucination, while manually curating the event‑to‑(data, knowledge) mapping is intractable under dozens of daily releases. Here we present Bian Que, an agentic operating framework with three contributions: (i) The unified operational paradigm, which abstracts routine daily O&M actions into three canonical patterns: release interception, proactive inspection, and alert root cause analysis; (ii) The flexible Skill Arrangement, each predefined Skill explicitly defines the requisite data and operational knowledge for each specific context. Such Skills can be automatically generated and updated by LLM agents, and can also be iteratively optimized by on‑call engineers via natural language instructions. (iii) The unified self‑evolving mechanism, where each correction signal enables two parallel evolutionary pathways: distilling event memory into knowledge, and targeted refinement of Skills. Deployed on the e‑commerce search engine of KuaiShou, Bian Que reduces alert volume by 75%, achieves 80% root‑cause analysis accuracy, cuts mean time to resolution by over 50%, and attains a 99.0% pass rate on offline evaluations. Codes are at https://github.com/benchen4395/BianQue_Assistant.

PaperID: 230, https://arxiv.org/pdf/2604.26591.pdf GitHub

Authors: Rongliang Fu, Yi Liu, Qiang Xu, Tsung-Yi Ho

Title: MappingEvolve: LLM-Driven Code Evolution for Technology Mapping

Abstract:
Technology mapping is a critical yet challenging stage in logic synthesis. While Large Language Models (LLMs) have been applied to generate optimization scripts, their potential for core algorithm enhancement remains untapped. We introduce MappingEvolve, an open‑source framework that pioneers the use of LLMs to directly evolve technology mapping code. Our method abstracts the mapping process into distinct optimization operators and employs a hierarchical agent‑based architecture, comprising a Planner, Evolver, and Evaluator, to guide the evolutionary search. This structured approach enables strategic and effective code modifications. Experiments show our method significantly outperforms direct evolution and strong baselines, achieving 10.04% area reduction versus ABC and 7.93% versus mockturtle, with 46.6%‑‑96.0% S_overall improvement on EPFL benchmarks, while explicitly navigating the area‑‑delay trade‑off. Our code and data are available at https://github.com/Flians/MappingEvolve.

PaperID: 231, https://arxiv.org/pdf/2604.25924.pdf GitHub

Authors: Dumitru Verşebeniuc, Martijn Elands, Sara Falahatkar, Chiara Magrone, Mohammad Falah, Martijn Boussé, Aki Härmä

Title: Generative AI-Based Virtual Assistant using Retrieval-Augmented Generation: An evaluation study for bachelor projects

Abstract:
Large Language Models have been increasingly employed in the creation of Virtual Assistants due to their ability to generate human‑like text and handle complex inquiries. While these models hold great promise, challenges such as hallucinations, missing information, and the difficulty of providing accurate and context‑specific responses persist, particularly when applied to highly specialized content domains. In this paper, we focus on addressing these challenges by developing a virtual assistant designed to support students at Maastricht University in navigating project‑specific regulations. We propose a virtual assistant based on a Retrieval‑Augmented Generation system that enhances the accuracy and reliability of responses by integrating up‑to‑date, domain‑specific knowledge. Through a robust evaluation framework and real‑life testing, we demonstrate that our virtual assistant can effectively meet the needs of students while addressing the inherent challenges of applying Large Language Models to a specialized educational context. This work contributes to the ongoing discourse on improving LLM‑based systems for specific applications and highlights areas for further research.

PaperID: 232, https://arxiv.org/pdf/2604.25847.pdf GitHub

Authors: Jianghao Lin, Zi Ling, Chenyu Zhou, Tianyi Xu, Ruoqing Jiang, Zizhuo Wang, Dongdong Ge

Title: From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling

Abstract:
Optimization modeling underpins real‑world decision‑making in logistics, manufacturing, energy, and public services, but reliably solving such problems from natural‑language requirements remains challenging for current large language models (LLMs). In this paper, we propose \emphAgora‑Opt, a modular agentic framework for optimization modeling that combines decentralized debate with a read‑write memory bank. Agora‑Opt allows multiple agent teams to independently produce end‑to‑end solutions and reconcile them through an outcome‑grounded debate protocol, while memory stores solver‑verified artifacts and past disagreement resolutions to support training‑free improvement over time. This design is flexible across both backbones and methods: it reduces base‑model lock‑in, transfers across different LLM families, and can be layered onto existing pipelines with minimal coupling. Across public benchmarks, Agora‑Opt achieves the strongest overall performance among all compared methods, outperforming strong zero‑shot LLMs, training‑centric approaches, and prior agentic baselines. Further analyses show robust gains across backbone choices and component variants, and demonstrate that decentralized debate offers a structural advantage over centralized selection by enabling agents to refine candidate solutions through interaction and even recover correct formulations when all initial candidates are flawed. These results suggest that reliable optimization modeling benefits from combining collaborative cross‑checking with reusable experience, and position Agora‑Opt as a practical and extensible foundation for trustworthy optimization modeling assistance. Our code and data are available at https://github.com/CHIANGEL/Agora‑Opt.

PaperID: 233, https://arxiv.org/pdf/2604.25602.pdf GitHub

Authors: Junxing Hu, Tianlong Li, Lei Yu, Ai Han

Title: OxyGent: Making Multi-Agent Systems Modular, Observable, and Evolvable via Oxy Abstraction

Abstract:
Deploying production‑ready multi‑agent systems (MAS) in complex industrial environments remains challenging due to limitations in scalability, observability, and autonomous evolution. We present OxyGent, an open‑source framework driven by two core novelties: a unified Oxy abstraction and the OxyBank evolution engine. The unified abstraction encapsulates agents, tools, LLMs, and reasoning flows as pluggable atomic components, enabling Lego‑like scalable system composition and non‑intrusive monitoring. To enhance observability, OxyGent introduces permission‑driven dynamic planning that replaces rigid workflows with execution graphs generated at runtime, providing adaptive visualizations. Furthermore, to support continuous evolution, OxyBank serves as an AI asset management platform that drives automated data backflow, annotation, and joint evolution. Empirical evaluations and real‑world case studies show that OxyGent provides a robust and scalable foundation for MAS. OxyGent is fully open‑sourced under the Apache License 2.0 at https://github.com/jd‑opensource/OxyGent.

PaperID: 234, https://arxiv.org/pdf/2604.25220.pdf GitHub

Authors: Ridwan Mahbub, Syem Aziz, Mahir Ahmed, Shadikur Rahman, Mizanur Rahman, Shafiq Joty, Enamul Hoque

Title: DATAREEL: Automated Data-Driven Video Story Generation with Animations

Abstract:
Data videos are a powerful medium for visual data based storytelling, combining animated, chart‑centric visualizations with synchronized narration. Widely used in journalism, education, and public communication, they help audiences understand complex data through clear and engaging visual explanations. Despite their growing impact, generating data‑driven video stories remains challenging, as it requires careful coordination of visual encoding, temporal progression, and narration and substantial expertise in visualization design, animation, and video‑editing tools. Recent advances in large language models offer new opportunities to automate this process; however, there is currently no benchmark for rigorously evaluating models on animated visualization‑based video storytelling. To address this gap, we introduce DataReel, a benchmark for automated data‑driven video story generation comprising 328 real‑world stories. Each story pairs structured data, a chart visualization, and a narration transcript, enabling systematic evaluation of models' abilities to generate animated data video stories. We further propose a multi‑agent framework that decomposes the task into planning, generation, and verification stages, mirroring key aspects of the human storytelling process. Experiments show that this multi‑agent approach outperforms direct prompting baselines under both automatic and human evaluations, while revealing persistent challenges in coordinating animation, narration, and visual emphasis. We release DataReel at https://github.com/vis‑nlp/DataReel.

PaperID: 235, https://arxiv.org/pdf/2604.24971.pdf GitHub

Authors: Ishan Patel, Ishan Joshi

Title: PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference

Abstract:
We present PolyKV, a system in which multiple concurrent inference agents share a single, asymmetrically compressed KV cache pool. Rather than allocating a separate KV cache per agent ‑‑ the standard paradigm ‑‑ PolyKV writes a compressed cache once and injects it into N independent agent contexts via HuggingFace DynamicCache objects. Compression is asymmetric: Keys are quantized at int8 (q8_0) to preserve softmax stability, while Values are compressed using TurboQuant MSE ‑‑ a Fast Walsh‑Hadamard Transform (FWHT) rotation followed by 3‑bit Lloyd‑Max quantization with centroids tuned to N(0,1). We evaluate across two model scales (SmolLM2‑1.7B‑Instruct and Llama‑3‑8B‑Instruct), three context lengths (600‑7,194 tokens), and up to 15 concurrent agents. PolyKV achieves a stable 2.91x compression ratio across all configurations. On Llama‑3‑8B with 15 agents sharing a 4K‑token context, PolyKV reduces KV cache memory from 19.8 GB to 0.45 GB ‑‑ a 97.7% reduction ‑‑ while maintaining only +0.57% perplexity degradation and a mean BERTScore F1 of 0.928. PPL delta does not grow with agent count and improves as context length increases, inverting to ‑0.26% at 1,851 coherent tokens. To our knowledge, no prior work combines a single shared, lossy‑compressed KV pool with multi‑reader concurrent agent access.

PaperID: 236, https://arxiv.org/pdf/2604.24881.pdf GitHub

Authors: John Seon Keun Yi, Aaron Mueller, Dokyun Lee

Title: Latent Agents: A Post-Training Procedure for Internalized Multi-Agent Debate

Abstract:
Multi‑agent debate has been shown to improve reasoning in large language models (LLMs). However, it is compute‑intensive, requiring generation of long transcripts before answering questions. To address this inefficiency, we develop a framework that distills multi‑agent debate into a single LLM through a two‑stage fine‑tuning pipeline combining debate structure learning with internalization via dynamic reward scheduling and length clipping. Across multiple models and benchmarks, our internalized models match or exceed explicit multi‑agent debate performance using up to 93% fewer tokens. We then investigate the mechanistic basis of this capability through activation steering, finding that internalization creates agent‑specific subspaces: interpretable directions in activation space corresponding to different agent perspectives. We further demonstrate a practical application: by instilling malicious agents into the LLM through internalized debate, then applying negative steering to suppress them, we show that distillation makes harmful behaviors easier to localize and control with smaller reductions in general performance compared to steering base models. Our findings offer a new perspective for understanding multi‑agent capabilities in distilled models and provide practical guidelines for controlling internalized reasoning behaviors. Code available at https://github.com/johnsk95/latent_agents

PaperID: 237, https://arxiv.org/pdf/2604.24696.pdf GitHub

Authors: Cheng Wang, Zhibin He, Zhihao Peng, Shengyuan Liu, Yufan Hu, Yang Carl, He Lifang, Lichao Sun, Xiang Li, Yixuan Yuan

Title: NeuroClaw Technical Report

Abstract:
Agentic artificial intelligence systems promise to accelerate scientific workflows, but neuroimaging poses unique challenges: heterogeneous modalities (sMRI, fMRI, dMRI, EEG), long multi‑stage pipelines, and persistent reproducibility risks. To address this gap, we present NeuroClaw, a domain‑specialized multi‑agent research assistant for executable and reproducible neuroimaging research. NeuroClaw operates directly on raw neuroimaging data across formats and modalities, grounding decisions in dataset semantics and BIDS metadata so users need not prepare curated inputs or bespoke model code. The platform combines harness engineering with end‑to‑end environment management, including pinned Python environments, Docker support, automated installers for common neuroimaging tools, and GPU configuration. In practice, this layer emphasizes checkpointing, post‑execution verification, structured audit traces, and controlled runtime setup, making toolchains more transparent while improving reproducibility and auditability. A three‑tier skill/agent hierarchy separates user‑facing interaction, high‑level orchestration, and low‑level tool skills to decompose complex workflows into safe, reusable units. Alongside the NeuroClaw framework, we introduce NeuroBench, a system‑level benchmark for executability, artifact validity, and reproducibility readiness. Across multiple multimodal LLMs, NeuroClaw‑enabled runs yield consistent and substantial score improvements compared with direct agent invocation. Project homepage: https://cuhk‑aim‑group.github.io/NeuroClaw/index.html

PaperID: 238, https://arxiv.org/pdf/2604.24515.pdf GitHub

Authors: Yuqing Fu, Yimin Deng, Wanyu Wang, Yuhao Wang, Yejing Wang, Hongshi Liu, Yiqi Wang, Xiao Han, Maolin Wang, Guoshuai Zhao, Yi Chang, Xiangyu Zhao

Title: SEARCH-R: Structured Entity-Aware Retrieval with Chain-of-Reasoning Navigator for Multi-hop Question Answering

Abstract:
Multi‑hop Question Answering (MHQA) aims to answer questions that require multi‑step reasoning. It presents two key challenges: generating correct reasoning paths in response to the complex user queries, and accurately retrieving essential knowledge in the face of potential limitations in large language models (LLMs). Existing approaches primarily rely on prompt‑based methods to generate reasoning paths, which are further combined with traditional sparse or dense retrieval to produce the final answer. However, the generation of reasoning paths commonly lacks effective control over the generative process, thus leading the reasoning astray. Meanwhile, the retrieval methods over‑rely on knowledge matching or similarity scores rather than evaluating the practical utility of the information, resulting in retrieving homogeneous or non‑useful information. Therefore, we propose a Structured Entity‑Aware Retrieval with Chain‑of‑Reasoning Navigator framework named SEARCH‑R. Specifically, SEARCH‑R trains an end‑to‑end reasoning path navigator, which is able to provide a powerful sub‑question decomposer by fine‑tuning the Llama3.1‑8B model. Moreover, a novel dependency tree‑based retrieval is designed to evaluate the informational contribution of the document quantitatively. Extensive experiments on three challenging multi‑hop datasets validate the effectiveness of the proposed framework. The code and dataset are available at: https://github.com/Applied‑Machine‑Learning‑Lab/ACL2026_SEARCH‑R.

PaperID: 239, https://arxiv.org/pdf/2604.24501.pdf GitHub

Authors: Peihao Yan, Yun Chen, Jie Lu, Qijun Wang, Huacheng Zeng

Title: TARMM: Scaling Delay-Critical Edge AI Offloading in 5G O-RAN via Temporal Graph Mobility Management

Abstract:
Emerging delay‑critical edge AI applications, such as VR perception and real‑time video analytics, impose stringent latency and reliability requirements on 5G networks. However, existing mobility management mechanisms are largely reactive and fail to adapt to dynamic network conditions, resulting in suboptimal handover decisions and degraded performance. In this paper, we present TARMM, a 5G Open Radio Access Network (O‑RAN) system that optimizes user mobility management for delay‑critical edge AI offloading. The core of TARMM is a temporal graph model that captures the spatiotemporal dynamics of the RAN across users and cells, enabling near real‑time handover decisions. Building on this representation, we design a multi‑agent reinforcement learning (MARL) framework with rule‑based action masking and proactive resource preparation to ensure safe, stable, and efficient handovers. We implement TARMM on a multi‑cell indoor 5G O‑RAN testbed and evaluate it using diverse VR workloads. Extensive experiments show that TARMM reduces tail latency by up to 44% and packet loss by up to 56% compared to state‑of‑the‑art approaches. Source code and demo videos are available at: https://margo‑source.github.io/Margo/

PaperID: 240, https://arxiv.org/pdf/2604.24320.pdf GitHub

Authors: Junshuo Zhang, Chengrui Huang, Feng Guo, Zihan Li, Ke Shi, Menghua Jiang, Jiguo Yu, Shuo Shang, Shen Gao

Title: DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents

Abstract:
Large language model (LLM) agents that follow the sequential "reason‑then‑act" paradigm have achieved superior performance in many complex tasks.However, these methods suffer from limited exploration and incomplete environmental understanding, as they interact with only a single environment per step. In this paper, we first introduce a novel paradigm that enables an agent to interact with multiple environments simultaneously and share cross‑trajectory experiences. Building upon this paradigm, we further propose DPEPO, a reinforcement learning (RL) algorithm that encourages the agent to perform diverse parallel exploration. There are two stages in DPEPO: initial supervised fine‑tuning (SFT) imparts basic parallel reasoning and action generation, followed by reinforcement learning stage with a hierarchical reward scheme. We design a parallel trajectory‑level success reward and two step‑level rewards: Diverse Action Reward and Diverse State Transition Reward, which actively penalize behavioral redundancy and promote broad exploration. Extensive experiments on ALFWorld and ScienceWorld show that DPEPO achieves state‑of‑the‑art (SOTA) success rates, while maintaining comparable efficiency to strong sequential baselines. (Code is available at https://github.com/LePanda026/Code‑for‑DPEPO)

PaperID: 241, https://arxiv.org/pdf/2604.24039.pdf GitHub

Authors: Hojoon Kim, Yuheng Wu, Thierry Tambe

Title: AgenticCache: Cache-Driven Asynchronous Planning for Embodied AI Agents

Abstract:
Embodied AI agents increasingly rely on large language models (LLMs) for planning, yet per‑step LLM calls impose severe latency and cost. In this paper, we show that embodied tasks exhibit strong plan locality, where the next plan is largely predictable from the current one. Building on this, we introduce AgenticCache, a planning framework that reuses cached plans to avoid per‑step LLM calls. In AgenticCache, each agent queries a runtime cache of frequent plan transitions, while a background Cache Updater asynchronously calls the LLM to validate and refine cached entries. Across four multi‑agent embodied benchmarks, AgenticCache improves task success rate by 22% on average across 12 configurations (4 benchmarks x 3 models), reduces simulation latency by 65%, and lowers token usage by 50%. Cache‑based plan reuse thus offers a practical path to low‑latency, low‑cost embodied agents. Code is available at https://github.com/hojoonleokim/MLSys26_AgenticCache.

PaperID: 242, https://arxiv.org/pdf/2604.24021.pdf GitHub

Authors: Chenyang An, Qihao Ye, Minghao Pan, Jiayaun Zhang

Title: QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems

Abstract:
We present QED, an open‑source multi‑agent system that turns human‑provided research questions into complete mathematical proofs without further human guidance. Its pipeline is designed to overcome common failures of single‑query proof generation by separating planning, proving, and verification: a decomposition agent structures the proof search, prover agents generate candidate arguments, and verifier agents check correctness. In collaboration with domain experts, we evaluated QED on 18 research‑level projects of varying difficulty. QED produced five original works across algebraic geometry, fluid PDEs, probability, and inverse problems. Expert assessments regard these works as solid specialized research contributions, with three comparable in difficulty and scope to work commonly published in established specialist mathematics venues. QED is released at https://github.com/proofQED/QED.

PaperID: 243, https://arxiv.org/pdf/2604.23781.pdf GitHub

Authors: Fanqing Meng, Lingxiao Du, Zijian Wu, Guanzheng Chen, Xiangyan Liu, Jiaqi Liao, Chonghe Jiang, Zhenglin Wan, Jiawei Gu, Pengfei Zhou, Rui Huang, Ziqi Zhao, Shengyuan Ding, Ailing Yu, Bo Peng, Bowei Xia, Hao Sun, Haotian Liang, Ji Xie, Jiajun Chen, Jiajun Song, Liu Yang, Ming Xu, Qionglin Qiu, Runhao Fu, Shengfang Zhai, Shijian Wang, Tengfei Ma, Tianyi Wu, Weiyang Jin, Yan Wang, Yang Dai, Yao Lai, Youwei Shu, Yue Liu, Yunzhuo Hao, Yuwei Niu, Jinkai Huang, Jiayuan Zhuo, Zhennan Shen, Linyu Wu, Hannah Yao, Charles Chen, Cihang Xie, Yuyin Zhou, Jiaheng Zhang, Zeyu Zheng, Mengkang Hu, Michael Qizhe Shieh

Title: ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

Abstract:
Language‑model agents are increasingly used as persistent coworkers that assist users across multiple working days. During such workflows, the surrounding environment may change independently of the agent: new emails arrive, calendar entries shift, knowledge‑base records are updated, and evidence appears across images, scanned PDFs, audio, video, and spreadsheets. Existing benchmarks do not adequately evaluate this setting because they typically run within a single static episode and remain largely text‑centric. We introduce \bench, a benchmark for coworker agents built around multi‑turn multi‑day tasks, a stateful sandboxed service environment whose state evolves between turns, and rule‑based verification. The current release contains 100 tasks across 13 professional scenarios, executed against five stateful sandboxed services (filesystem, email, calendar, knowledge base, spreadsheet) and scored by 1537 deterministic Python checkers over post‑execution service state; no LLM‑as‑judge is invoked during scoring. We benchmark seven frontier agent systems. The strongest model reaches 75.8 weighted score, but the best strict Task Success is only 20.0%, indicating that partial progress is common while complete end‑to‑end workflow completion remains rare. Turn‑level analysis shows that performance drops after the first exogenous environment update, highlighting adaptation to changing state as a key open challenge. We release the benchmark, evaluation harness, and construction pipeline to support reproducible coworker‑agent evaluation.

PaperID: 244, https://arxiv.org/pdf/2604.23626.pdf GitHub

Authors: Tao Feng, Haozhen Zhang, Zijie Lei, Peixuan Han, Jiaxuan You

Title: GraphPlanner: Graph Memory-Augmented Agentic Routing for Multi-Agent LLMs

Abstract:
LLM routing has achieved promising results in integrating the strengths of diverse models while balancing efficiency and performance. However, to support more realistic and challenging applications, routing must extend into agentic LLM settings, where task planning, multi‑round cooperation among heterogeneous agents, and memory utilization are indispensable. To address this gap, we propose GraphPlanner, a heterogeneous graph memory‑augmented agentic router for multi‑agent LLMs that generates routing workflows for each query and supports both inductive and transductive inference. GraphPlanner formulates workflow generation as a Markov Decision Process (MDP), where at each step it selects both the LLM backbone and the agent role, including Planner, Executor, and Summarizer. By leveraging a heterogeneous graph, denoted as GARNet, to capture interaction memories among queries, agents, and responses, GraphPlanner integrates historical memory and workflow memory into richer state representations. The entire pipeline is optimized with reinforcement learning, jointly improving task‑specific performance and computational efficiency. We evaluate GraphPlanner across 14 diverse LLM tasks and demonstrate that: (1) GraphPlanner outperforms strong single‑round and multi‑round routers, improving accuracy by up to 9.3% while reducing GPU cost from 186.26 GiB to 1.04 GiB; (2) GraphPlanner generalizes robustly to unseen tasks and LLMs, exhibiting strong zero‑shot capabilities; and (3) GraphPlanner effectively leverages historical memories, supporting both inductive and transductive inference for more adaptive routing. Our code for GraphPlanner is released at https://github.com/ulab‑uiuc/GraphPlanner.

PaperID: 245, https://arxiv.org/pdf/2604.23539.pdf GitHub

Authors: Haoxuan Zhang, Ruochi Li, Yang Zhang, Zhenni Liang, Junhua Ding, Ting Xiao, Haihua Chen

Title: MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation

Abstract:
The rapid proliferation of Generative AI necessitates rigorous documentation standards for transparency and governance. However, manual creation of Model and Data Cards is not scalable, while automated approaches lack large‑scale, high‑fidelity benchmarks for systematic evaluation. We introduce MetaGAI, a comprehensive benchmark comprising 2,541 verified document triplets constructed through semantic triangulation of academic papers, GitHub repositories, and Hugging Face artifacts. Unlike prior single‑source datasets, MetaGAI employs a multi‑agent framework with specialized Retriever, Generator, and Editor agents, validated through four‑dimensional human‑in‑the‑loop assessment, including human evaluation of editor‑refined ground truth. We establish a robust evaluation protocol combining automated metrics with validated LLM‑as‑a‑Judge frameworks. Extensive analysis reveals that sparse Mixture‑of‑Experts architectures achieve superior cost‑quality efficiency, while a fundamental trade‑off exists between faithfulness and completeness. MetaGAI provides a foundational testbed for benchmarking, training, and analyzing automated Model and Data Card generation methods at scale. Our data and code are available at: https://github.com/haoxuan‑unt2024/MetaGAI‑Benchmark.

PaperID: 246, https://arxiv.org/pdf/2604.23392.pdf GitHub

Authors: Zi Meng, Wanli Song, Yi Hu, Jiayuan Rao, Gang Chen

Title: SoccerRef-Agents: Multi-Agent System for Automated Soccer Refereeing

Abstract:
Refereeing is vital in sports, where fair, accurate, and explainable decisions are fundamental. While intelligent assistant technologies are being widely adopted in soccer refereeing, current AI‑assisted approaches remain preliminary. Existing research mostly focuses on isolated video perception tasks and lacks the ability to understand and reason about foul scenarios. To fill this gap, we propose SoccerRef‑Agents, a holistic and explainable multi‑agent decision‑making framework for soccer refereeing. The main contributions are: (i) constructing the multimodal benchmark SoccerRefBench with over 1,200 referee theory questions and 600 foul video clips; (ii) building a vector‑based knowledge base RefKnowledgeDB using the latest "Laws of the Game" and a classic case database for precise, knowledge‑driven reasoning; (iii) designing a novel multi‑agent architecture that collaborates via cross‑modal RAG to bridge the semantic gap between visual content and regulatory texts. This work explores the technical capability of integrating MLLMs with refereeing expertise, and evaluations show our system significantly outperforms general‑purpose MLLMs in decision accuracy and explanation quality. All databases, benchmarks, and code will be made available.

PaperID: 247, https://arxiv.org/pdf/2604.23194.pdf GitHub

Authors: Haoran Tan, Zeyu Zhang, Chen Ma, Tianze Liu, Quanyu Dai, Xu Chen

Title: From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents

Abstract:
Large language model‑based agents have recently emerged as powerful approaches for solving dynamic and multi‑step tasks. Most existing agents employ planning mechanisms to guide long‑term actions in dynamic environments. However, current planning approaches face a fundamental limitation that they operate at a fixed granularity level. Specifically, they either provide excessive detail for simple tasks or insufficient detail for complex ones, failing to achieve an optimal balance between simplicity and complexity. Drawing inspiration from the principle of progressive refinement in cognitive science, we propose AdaPlan‑H, a self‑adaptive hierarchical planning mechanism that mimics human planning strategies. Our method initiates with a coarse‑grained macro plan and progressively refines it based on task complexity. It generates self‑adaptive hierarchical plans tailored to the varying difficulty levels of different tasks, which can be optimized by imitation learning and capability enhancement. Experimental results demonstrate that our method significantly improves task execution success rates while mitigating overplanning at the planning level, providing a flexible and efficient solution for multi‑step complex decision‑making tasks. To contribute to the community, our code and data will be made publicly available at https://github.com/import‑myself/AHP.

PaperID: 248, https://arxiv.org/pdf/2604.23088.pdf GitHub

Authors: Samer Attrah

Title: Code Broker: A Multi-Agent System for Automated Code Quality Assessment

Abstract:
We present Code Broker, a multi agent system built on Google s Agent Development Kit ADK that analyses Python source code from individual files, local directory trees, or remote GitHub repositories and generates structured, actionable quality assessment reports. The system realises a hierarchical five agent architecture in which a root orchestrator coordinates a sequential pipeline agent that, in turn, dispatches three specialised agents concurrently a Correctness Assessor, a Style Assessor, and a Description Generator before synthesising their findings through an Improvement Recommender. Reports quantify four quality dimensions correctness, security, style, and maintainability on a normalised scale and are rendered in both Markdown and HTML for integration into diverse developer workflows. Code Broker fuses LLM based semantic reasoning with deterministic static analysis signals from Pylint, employs asynchronous execution with exponential backoff retry logic to improve robustness under transient API failures, and explores lightweight session memory for retaining and querying prior assessment context across runs. We frame this paper as a technical report on system design, prompt engineering, and tool orchestration, and present a preliminary qualitative evaluation on representative Python codebases of varying scale. The results indicate that parallel specialised agents produce readable, developer oriented feedback that complements traditional linting, while also foregrounding current limitations in evaluation depth, security tooling, large repository handling, and the exclusive reliance on in memory persistence. All code and reproducibility materials are publicly available: https://github.com/Samir‑atra/agents_intensive_dev.

PaperID: 249, https://arxiv.org/pdf/2604.22577.pdf GitHub

Authors: Manyi Zhang, Ji-Fu Li, Zhongao Sun, Xiaohao Liu, Zhenhua Dong, Xianzhi Yu, Haoli Bai, Xiaobo Xia

Title: QuantClaw: Precision Where It Matters for OpenClaw

Abstract:
Autonomous agent systems such as OpenClaw introduce significant efficiency challenges due to long‑context inputs and multi‑turn reasoning. This results in prohibitively high computational and monetary costs in real‑world development. While quantization is a standard approach for reducing cost and latency, its impact on agent performance in realistic scenarios remains unclear. In this work, we analyze quantization sensitivity across diverse complex workflows over OpenClaw, and show that precision requirements are highly task‑dependent. Based on this observation, we propose QuantClaw, a plug‑and‑play precision routing plugin that dynamically assigns precision according to task characteristics. QuantClaw routes lightweight tasks to lower‑cost configurations while preserving higher precision for demanding workloads, saving cost and accelerating inference without increasing user complexity. Experiments show that our QuantClaw maintains or improves task performance while reducing both latency and computational cost. Across a range of agent tasks, it achieves up to 21.4% cost savings and 15.7% latency reduction on GLM‑5 (FP8 baseline). These results highlight the benefit of treating precision as a dynamic resource in agent systems.

PaperID: 250, https://arxiv.org/pdf/2604.22436.pdf GitHub

Authors: Bin Wu, Arastun Mammadli, Xiaoyu Zhang, Emine Yilmaz

Title: AgentSearchBench: A Benchmark for AI Agent Search in the Wild

Abstract:
The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often compositional and execution‑dependent, making them difficult to assess from textual descriptions alone. However, existing research and benchmarks typically assume well‑specified functionalities, controlled candidate pools, or only executable task queries, leaving realistic agent search scenarios insufficiently studied. We introduce AgentSearchBench, a large‑scale benchmark for agent search in the wild, built from nearly 10,000 real‑world agents across multiple providers. The benchmark formalizes agent search as retrieval and reranking problems under both executable task queries and high‑level task descriptions, and evaluates relevance using execution‑grounded performance signals. Experiments reveal a consistent gap between semantic similarity and actual agent performance, exposing the limitations of description‑based retrieval and reranking methods. We further show that lightweight behavioral signals, including execution‑aware probing, can substantially improve ranking quality, highlighting the importance of incorporating execution signals into agent discovery. Our code is available at https://github.com/Bingo‑W/AgentSearchBench.

PaperID: 251, https://arxiv.org/pdf/2604.22239.pdf GitHub

Authors: Zhanli Li, Yixuan Cao, Lvzhou Luo, Ping Luo

Title: Navigating Large-Scale Document Collections: MuDABench for Multi-Document Analytical QA

Abstract:
This paper introduces the task of analytical question answering over large, semi‑structured document collections. We present MuDABench, a benchmark for multi‑document analytical QA, where questions require extracting and synthesizing information across numerous documents to perform quantitative analysis. Unlike existing multi‑document QA benchmarks that typically require information from only a few documents with limited cross‑document reasoning, MuDABench demands extensive inter‑document analysis and aggregation. Constructed via distant supervision by leveraging document‑level metadata and annotated financial databases, MuDABench comprises over 80,000 pages and 332 analytical QA instances. We also propose an evaluation protocol that measures final answer accuracy and uses intermediate‑fact coverage as an auxiliary diagnostic signal for the reasoning process. Experiments reveal that standard RAG systems, which treat all documents as a flat retrieval pool, perform poorly. To address these limitations, we propose a multi‑agent workflow that orchestrates planning, extraction, and code generation modules. While this approach substantially improves both process and outcome metrics, a significant gap remains compared to human expert performance. Our analysis identifies two primary bottlenecks: single‑document information extraction accuracy and insufficient domain‑specific knowledge in current systems. MuDABench is available at https://github.com/Zhanli‑Li/MuDABench.

PaperID: 252, https://arxiv.org/pdf/2604.21816.pdf GitHub

Authors: Anuj Sadani, Deepak Kumar

Title: Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows

Abstract:
The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateless, eager schema injection imposes a hidden per‑turn overhead the MCP Tax or Tools Tax that practitioner reports place between roughly 10k and 60k tokens in typical multi‑server deployments. This payload inflates the key‑value cache, is associated with reasoning degradation as context utilization approaches published fracture points around 70%, and turns token budgets into a recurring operational cost. We introduce Tool Attention, a middleware‑layer mechanism that generalizes the "Attention Is All You Need" paradigm from self‑attention over tokens to gated attention over tools. Tool Attention combines (i) an Intent Schema Overlap (ISO) score from sentence embeddings, (ii) a state‑aware gating function enforcing preconditions and access scopes, and (iii) a two‑phase lazy schema loader that keeps a compact summary pool in context and promotes full JSON schemas only for top‑k gated tools. We evaluate on a simulated 120‑tool, six‑server benchmark whose per‑server token counts are calibrated to public audits of real MCP deployments. In this simulation, Tool Attention directly reduces measured per‑turn tool tokens by 95.0% (47.3k ‑> 2.4k) and raises effective context utilization (a token‑ratio quantity) from 24% to 91%. End‑to‑end figures for task success, latency, cost, and reasoning quality are reported as projections derived from the measured token counts combined with published deployment telemetry; they are not measured on live LLM agents, and we mark projected values explicitly throughout. Taken together, the results support a simple thesis: protocol‑level efficiency, not raw context length, is a binding constraint on scalable gentic systems. The code for this work is accessible at https://github.com/asadani/tool‑attention

PaperID: 253, https://arxiv.org/pdf/2604.21725.pdf GitHub

Authors: Wujiang Xu, Jiaojiao Han, Minghao Guo, Kai Mei, Xi Zhu, Han Zhang, Dimitris N. Metaxas

Title: AEL: Agent Evolving Learning for Open-Ended Environments

Abstract:
LLM agents increasingly operate in open‑ended environments spanning hundreds of sequential episodes, yet they remain largely stateless: each task is solved from scratch without converting past experience into better future behavior. The central obstacle is not \emphwhat to remember but \emphhow to use what has been remembered, including which retrieval policy to apply, how to interpret prior outcomes, and when the current strategy itself must change. We introduce \emphAgent Evolving Learning (\ael), a two‑timescale framework that addresses this obstacle. At the fast timescale, a Thompson Sampling bandit learns which memory retrieval policy to apply at each episode; at the slow timescale, LLM‑driven reflection diagnoses failure patterns and injects causal insights into the agent's decision prompt, giving it an interpretive frame for the evidence it retrieves. On a sequential portfolio benchmark (10 sector‑diverse tickers, 208 episodes, 5 random seeds), \ael achieves a Sharpe ratio of 2.13\pm0.47, outperforming five published self‑improving methods and all non‑LLM baselines while maintaining the lowest variance among all LLM‑based approaches. A nine‑variant ablation reveals a ``less is more'' pattern: memory and reflection together produce a 58% cumulative improvement over the stateless baseline, yet every additional mechanism we test (planner evolution, per‑tool selection, cold‑start initialization, skill extraction, and three credit assignment methods) \emphdegrades performance. This demonstrates that the bottleneck in agent self‑improvement is \emphself‑diagnosing how to use experience rather than adding architectural complexity. Code and data: https://github.com/WujiangXu/AEL.

PaperID: 254, https://arxiv.org/pdf/2604.21590.pdf GitHub GitHub

Authors: Yuanjie Lyu, Chengyu Wang, Haonan Zheng, Yuanhao Yue, Junbing Yan, Ming Wang, Jun Huang

Title: AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use

Abstract:
Modern industrial applications increasingly demand language models that act as agents, capable of multi‑step reasoning and tool use in real‑world settings. These tasks are typically performed under strict cost and latency constraints, making small agentic models highly desirable. In this paper, we introduce the AgenticQwen family of models, trained via multi‑round reinforcement learning (RL) on synthetic data and a limited amount of open‑source data. Our training framework combines reasoning RL and agentic RL with dual data flywheels that automatically generate increasingly challenging tasks. The reasoning flywheel increases task difficulty by learning from errors, while the agentic flywheel expands linear workflows into multi‑branch behavior trees that better reflect the decision complexity of real‑world applications. We validate AgenticQwen on public benchmarks and in an industrial agent system. The models achieve strong performance on multiple agentic benchmarks, and in our industrial agent system, close the gap with much larger models on search and data analysis tasks. Model checkpoints and part of the synthetic data: https://huggingface.co/collections/alibaba‑pai/agenticqwen. Data synthesis and RL training code: https://github.com/haruhi‑sudo/data_synth_and_rl. The data synthesis pipeline is also integrated into EasyDistill: https://github.com/modelscope/easydistill.

PaperID: 255, https://arxiv.org/pdf/2604.21255.pdf GitHub

Authors: Chenghao Yang, Yuning Zhang, Zhoufutu Wen, Tao Gong, Jiaheng Liu, Qi Chu, Nenghai Yu

Title: When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors

Abstract:
Model distillation is a primary driver behind the rapid progress of LLM agents, yet it often leads to behavioral homogenization. Many emerging agents share nearly identical reasoning steps and failure modes, suggesting they may be distilled echoes of a few dominant teachers. Existing metrics, however, fail to distinguish mandatory behaviors required for task success from non‑mandatory patterns that reflect a model's autonomous preferences. We propose two complementary metrics to isolate non‑mandatory behavioral patterns: Response Pattern Similarity (RPS) for verbal alignment and Action Graph Similarity (AGS) for tool‑use habits modeled as directed graphs. Evaluating 18 models from 8 providers on τ‑Bench and τ^2‑Bench against Claude Sonnet 4.5 (thinking), we find that within‑family model pairs score 5.9 pp higher in AGS than cross‑family pairs, and that Kimi‑K2 (thinking) reaches 82.6% S_\textnode and 94.7% S_\textdep, exceeding Anthropic's own Opus 4.1. A controlled distillation experiment further confirms that AGS distinguishes teacher‑specific convergence from general improvement. RPS and AGS capture distinct behavioral dimensions (Pearson r = 0.491), providing complementary diagnostic signals for behavioral convergence in the agent ecosystem. Our code is available at https://github.com/Syuchin/AgentEcho.

PaperID: 256, https://arxiv.org/pdf/2604.21090.pdf GitHub

Authors: Christo Zietsman

Title: Structural Quality Gaps in Practitioner AI Governance Prompts: An Empirical Study Using a Five-Principle Evaluation Framework

Abstract:
AI governance programmes increasingly rely on natural language prompts to constrain and direct AI agent behaviour. These prompts function as executable specifications: they define the agent's mandate, scope, and quality criteria. Despite this role, no systematic framework exists for evaluating whether a governance prompt is structurally complete. We introduce a five‑principle evaluation framework grounded in computability theory, proof theory, and Bayesian epistemology, and apply it to an empirical corpus of 34 publicly available AGENTS.md governance files sourced from GitHub. Our evaluation reveals that 37% of evaluated file‑model pairs score below the structural completeness threshold, with data classification and assessment rubric criteria most frequently absent. These results suggest that practitioner‑authored governance prompts exhibit consistent structural patterns that automated static analysis could detect and remediate. We discuss implications for requirements engineering practice in AI‑assisted development contexts, identify a previously undocumented artefact classification gap in the AGENTS.md convention, and propose directions for tool support.

PaperID: 257, https://arxiv.org/pdf/2604.20548.pdf GitHub

Authors: Shuai Chen, Chengzhi Zhang

Title: Enhancing Research Idea Generation through Combinatorial Innovation and Multi-Agent Iterative Search Strategies

Abstract:
Scientific progress depends on the continual generation of innovative re‑search ideas. However, the rapid growth of scientific literature has greatly increased the cost of knowledge filtering, making it harder for researchers to identify novel directions. Although existing large language model (LLM)‑based methods show promise in research idea generation, the ideas they produce are often repetitive and lack depth. To address this issue, this study proposes a multi‑agent iterative planning search strategy inspired by com‑binatorial innovation theory. The framework combines iterative knowledge search with an LLM‑based multi‑agent system to generate, evaluate, and re‑fine research ideas through repeated interaction, with the goal of improving idea diversity and novelty. Experiments in the natural language processing domain show that the proposed method outperforms state‑of‑the‑art base‑lines in both diversity and novelty. Further comparison with ideas derived from top‑tier machine learning conference papers indicates that the quality of the generated ideas falls between that of accepted and rejected papers. These results suggest that the proposed framework is a promising approach for supporting high‑quality research idea generation. The source code and dataset used in this paper are publicly available on Github repository: https://github.com/ChenShuai00/MAGenIdeas. The demo is available at https://huggingface.co/spaces/cshuai20/MAGenIdeas.

PaperID: 258, https://arxiv.org/pdf/2604.20261.pdf GitHub

Authors: Fengxian Dong, Zhi Zheng, Xiao Han, Wei Chen, Jingqing Ruan, Tong Xu, Yong Chen, Enhong Chen

Title: Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data

Abstract:
Automated feature generation extracts informative features from raw tabular data without manual intervention and is crucial for accurate, generalizable machine learning. Traditional methods rely on predefined operator libraries and cannot leverage task semantics, limiting their ability to produce diverse, high‑value features for complex tasks. Recent Large Language Model (LLM)‑based approaches introduce richer semantic signals, but still suffer from a restricted feature space due to fixed generation patterns and from the absence of feedback from the learning objective. To address these challenges, we propose a Memory‑Augmented LLM‑based Multi‑Agent System (MALMAS) for automated feature generation. MALMAS decomposes the generation process into agents with distinct responsibilities, and a Router Agent activates an appropriate subset of agents per iteration, further broadening exploration of the feature space. We further integrate a memory module comprising procedural memory, feedback memory, and conceptual memory, enabling iterative refinement that adaptively guides subsequent feature generation and improves feature quality and diversity. Extensive experiments on multiple public datasets against state‑of‑the‑art baselines demonstrate the effectiveness of our approach. The code is available at https://github.com/fxdong24/MALMAS

PaperID: 259, https://arxiv.org/pdf/2604.20254.pdf GitHub

Authors: Wengyu Zhang, Xiao-Yong Wei, Qing Li

Title: Mol-Debate: Multi-Agent Debate Improves Structural Reasoning in Molecular Design

Abstract:
Text‑guided molecular design is a key capability for AI‑driven drug discovery, yet it remains challenging to map sequential natural‑language instructions with non‑linear molecular structures under strict chemical constraints. Most existing approaches, including RAG, CoT prompting, and fine‑tuning or RL, emphasize a small set of ad‑hoc reasoning perspectives implemented in a largely one‑shot generation pipeline. In contrast, real‑world drug discovery relies on dynamic, multi‑perspective critique and iterative refinement to reconcile semantic intent with structural feasibility. Motivated by this, we propose Mol‑Debate, a generation paradigm that enables such dynamic reasoning through an iterative generate‑debate‑refine loop. We further characterize key challenges in this paradigm and address them through perspective‑oriented orchestration, including developer‑debater conflict, global‑local structural reasoning, and static‑dynamic integration. Experiments demonstrate that Mol‑Debate achieves state‑of‑the‑art performance against strong general and chemical baselines, reaching 59.82% exact match on ChEBI‑20 and 50.52% weighted success rate on S^2‑Bench. Our code is available at https://github.com/wyuzh/Mol‑Debate.

PaperID: 260, https://arxiv.org/pdf/2604.20136.pdf GitHub GitHub

Authors: Weitong Kong, Di Wen, Kunyu Peng, David Schneider, Zeyun Zhong, Alexander Jaus, Zdravko Marinov, Jiale Wei, Ruiping Liu, Junwei Zheng, Yufan Chen, Lei Qi, Rainer Stiefelhagen

Title: IMPACT-CYCLE: A Contract-Based Multi-Agent System for Claim-Level Supervisory Correction of Long-Video Semantic Memory

Abstract:
Correcting errors in long‑video understanding is disproportionately costly: existing multimodal pipelines produce opaque, end‑to‑end outputs that expose no intermediate state for inspection, forcing annotators to revisit raw video and reconstruct temporal logic from scratch. The core bottleneck is not generation quality alone, but the absence of a supervisory interface through which human effort can be proportional to the scope of each error. We present IMPACT‑CYCLE, a supervisory multi‑agent system that reformulates long‑video understanding as iterative claim‑level maintenance of a shared semantic memory ‑‑ a structured, versioned state encoding typed claims, a claim dependency graph, and a provenance log. Role‑specialized agents operating under explicit authority contracts decompose verification into local object‑relation correctness, cross‑temporal consistency, and global semantic coherence, with corrections confined to structurally dependent claims. When automated evidence is insufficient, the system escalates to human arbitration as the supervisory authority with final override rights; dependency‑closure re‑verification then ensures correction cost remains proportional to error scope. Experiments on VidOR show substantially improved downstream reasoning (VQA: 0.71 to 0.79) and a 4.8x reduction in human arbitration cost, with workload significantly lower than manual annotation. Code will be released at https://github.com/MKong17/IMPACT_CYCLE.

PaperID: 261, https://arxiv.org/pdf/2604.20087.pdf GitHub

Authors: Shanshan Zhong, Yi Lu, Jingjie Ning, Yibing Wan, Lihan Feng, Yuyi Ao, Leonardo F. R. Ribeiro, Markus Dreyer, Sean Ammirati, Chenyan Xiong

Title: SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

Abstract:
Skills have become the de facto way to enable LLM agents to perform complex real‑world tasks with customized instructions, workflows, and tools, but how to learn them automatically and effectively remains unclear. We introduce SkillLearnBench, the first benchmark for evaluating continual skill learning methods, comprising 20 verified, skill‑dependent tasks across 15 sub‑domains derived from a real‑world skill taxonomy , evaluated at three levels: skill quality, execution trajectory, and task outcome. Using this benchmark, we evaluate recent continual learning techniques, those leveraging one‑shot, self/teacher feedback, and skill creator to generate skills from agent experiences. We find that all continual learning methods improve over the no‑skill baseline, yet consistent gains remain elusive: no method leads across all tasks and LLMs, and scaling to stronger LLMs does not reliably help. Continual learning improves tasks with clear, reusable workflows but struggles on open‑ended tasks, and using stronger LLM backbones does not consistently produce better skills. Our analysis also revealed that multiple iterations in continual learning facilitate genuine improvement via external feedback, whereas self‑feedback alone induces recursive drift. Our data and code are open‑source at https://github.com/cxcscmu/SkillLearnBench to enable further studies of automatic skill generation and continual learning techniques.

PaperID: 262, https://arxiv.org/pdf/2604.20043.pdf GitHub

Authors: Ziyi Wang, Chen Zhang, Wenjun Peng, Qi Wu, Xinyu Wang

Title: TriEx: A Game-based Tri-View Framework for Explaining Internal Reasoning in Multi-Agent LLMs

Abstract:
Explainability for Large Language Model (LLM) agents is especially challenging in interactive, partially observable settings, where decisions depend on evolving beliefs and other agents. We present TriEx, a tri‑view explainability framework that instruments sequential decision making with aligned artifacts: (i) structured first‑person self‑reasoning bound to an action, (ii) explicit second‑person belief states about opponents updated over time, and (iii) third‑person oracle audits grounded in environment‑derived reference signals. This design turns explanations from free‑form narratives into evidence‑anchored objects that can be compared and checked across time and perspectives. Using imperfect‑information strategic games as a controlled testbed, we show that TriEx enables scalable analysis of explanation faithfulness, belief dynamics, and evaluator reliability, revealing systematic mismatches between what agents say, what they believe, and what they do. Our results highlight explainability as an interaction‑dependent property and motivate multi‑view, evidence‑grounded evaluation for LLM agents. Code is available at https://github.com/Einsam1819/TriEx.

PaperID: 263, https://arxiv.org/pdf/2604.19689.pdf GitHub

Authors: Shuai Wang, Hongyi Zhu, Jia-Hong Huang, Yixian Shen, Chengxi Zeng, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, Marcel Worring

Title: A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

Abstract:
Understanding artworks requires multi‑step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl‑ edge, limiting interpretability and explicit evidence grounding. We propose A‑MAR, an Agent‑based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A‑MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. Retrieval is then conditionedon this plan, enabling targeted evidence selection and supporting step‑wise, grounded explanations. To evaluate agent‑based multi‑ modal reasoning within the art domain, we introduce ArtCoT‑QA. This diagnostic benchmark features multi‑step reasoning chains for diverse art‑related queries, enabling a granular analysis that extends beyond simple final answer accuracy. Experiments on SemArt and Artpedia show that A‑MAR consistently outperforms static, non planned retrieval and strong MLLM baselines in final explanation quality, while evaluations on ArtCoT‑QA further demonstrate its advantages in evidence grounding and multi‑step reasoning ability. These results highlight the importance of reasoning‑conditioned retrieval for knowledge‑intensive multimodal understanding and position A‑MAR as a step toward interpretable, goal‑driven AI systems, with particular relevance to cultural industries. The code and data are available at: https://github.com/ShuaiWang97/A‑MAR.

PaperID: 264, https://arxiv.org/pdf/2604.19638.pdf GitHub

Authors: Josue Torres-Fonseca, Naihao Deng, Yinpei Dai, Shane Storks, Yichi Zhang, Rada Mihalcea, Casey Kennington, Joyce Chai

Title: SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

Abstract:
Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real‑world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state‑of‑the‑art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open‑source our code and dataset under https://github.com/sled‑group/SafetyALFRED.git

PaperID: 265, https://arxiv.org/pdf/2604.19548.pdf GitHub

Authors: Bobo Li, Rui Wu, Zibo Ji, Meishan Zhang, Hao Fei, Min Zhang, Mong-Li Lee, Wynne Hsu

Title: Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment

Abstract:
Large Language Model agents have rapidly evolved from static text generators into dynamic systems capable of executing complex autonomous workflows. To enhance reliability, multi‑agent frameworks assigning specialized roles are increasingly adopted to enable self‑reflection and mutual auditing. While such role‑playing effectively leverages domain expert knowledge, we find it simultaneously induces a human‑like cognitive bias known as Actor‑Observer Asymmetry (AOA). Specifically, an agent acting as an actor (during self‑reflection) tends to attribute failures to external factors, whereas an observer (during mutual auditing) attributes the same errors to internal faults. We quantify this using our new Ambiguous Failure Benchmark, which reveals that simply swapping perspectives triggers the AOA effect in over 20% of cases for most models. To tame this bias, we introduce ReTAS (Reasoning via Thesis‑Antithesis‑Synthesis), a model trained through dialectical alignment to enforce perspective‑invariant reasoning. By integrating dialectical chain‑of‑thought with Group Relative Policy Optimization, ReTAS guides agents to synthesize conflicting viewpoints into an objective consensus. Experiments demonstrate that ReTAS effectively mitigates attribution inconsistency and significantly improves fault resolution rates in ambiguous scenarios.

PaperID: 266, https://arxiv.org/pdf/2604.19516.pdf GitHub

Authors: Beining Wu, Fuyou Mao, Jiong Lin, Cheng Yang, Jiaxuan Lu, Yifu Guo, Siyu Zhang, Yifan Wu, Ying Huang, Fu Li

Title: From Experience to Skill: Multi-Agent Generative Engine Optimization via Reusable Strategy Learning

Abstract:
Generative engines (GEs) are reshaping information access by replacing ranked links with citation‑grounded answers, yet current Generative Engine Optimization (GEO) methods optimize each instance in isolation, unable to accumulate or transfer effective strategies across tasks and engines. We reframe GEO as a strategy learning problem and propose MAGEO, a multi‑agent framework in which coordinated planning, editing, and fidelity‑aware evaluation serve as the execution layer, while validated editing patterns are progressively distilled into reusable, engine‑specific optimization skills. To enable controlled assessment, we introduce a Twin Branch Evaluation Protocol for causal attribution of content edits and DSV‑CF, a dual‑axis metric that unifies semantic visibility with attribution accuracy. We further release MSME‑GEO‑Bench, a multi‑scenario, multi‑engine benchmark grounded in real‑world queries. Experiments on three mainstream engines show that MAGEO substantially outperforms heuristic baselines in both visibility and citation fidelity, with ablations confirming that engine‑specific preference modeling and strategy reuse are central to these gains, suggesting a scalable learning‑driven paradigm for trustworthy GEO. Code is available at https://github.com/Wu‑beining/MAGEO

PaperID: 267, https://arxiv.org/pdf/2604.19457.pdf GitHub

Authors: Vasundra Srininvasan

Title: Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

Abstract:
Long‑horizon enterprise agents make high‑stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, multi‑step reasoning, and binding regulatory constraints. Current evaluation reports a single task‑success scalar that conflates distinct failure modes and hides whether an agent is aligned with the standards its deployment environment requires. We propose that long‑horizon decision behavior decomposes into four orthogonal alignment axes, each independently measurable and failable: factual precision (FRP), reasoning coherence (RCS), compliance reconstruction (CRR), and calibrated abstention (CAR). CRR is a novel regulatory‑grounded axis; CAR is a measurement axis separating coverage from accuracy. We exercise the decomposition on a controlled benchmark (LongHorizon‑Bench) covering loan qualification and insurance claims adjudication with deterministic ground‑truth construction. Running six memory architectures, we find structure aggregate accuracy cannot see: retrieval collapses on factual precision; schema‑anchored architectures pay a scaffolding tax; plain summarization under a fact‑preservation prompt is a strong baseline on FRP, RCS, EDA, and CRR; and all six architectures commit on every case, exposing a decisional‑alignment axis the field has not targeted. The decomposition also surfaced a pre‑registered prediction of our own, that summarization would fail factual recall, which the data reversed at large magnitude, an axis‑level reversal aggregate accuracy would have hidden. Institutional alignment (regulatory reconstruction) and decisional alignment (calibrated abstention) are under‑represented in the alignment literature and become load‑bearing once decisions leave the laboratory. The framework transfers to any regulated decisioning domain via two steps: build a fact schema, and calibrate the CRR auditor prompt.

PaperID: 268, https://arxiv.org/pdf/2604.19181.pdf GitHub

Authors: Isaac Lera, Carlos Guerrero

Title: YAIFS: Yet (not) Another Intelligent Fog Simulator: A Framework for Agent-Driven Computing Continuum Modeling & Simulation

Abstract:
Simulation plays a key role in the design and evaluation of distributed systems, yet it is often treated as a static tool with limited interaction capabilities. In this work, we present Yet (not) Another Intelligent Fog Simulator (YAIFS), and evolution of YAFS that redefines simulation as an interactive, service‑oriented environment. YAIFS introduce a layered architecture that exposes the simulation through a unified API and service interface, enabling external entities to observe, control, and modify its execution. A central contribution is the integration of the Model Context Protocol (MCP) as a standardized interaction layer between agents and the simulation. Through MCP, heterogeneous agents can access state, invoke actions and coordinate behavior using a common set of tools, decoupling agent experimentation workflows. We illustrate these capabilities through two scenarios: an LLM‑based assistant that enable natural language control of simulations, and a multi‑agent setting where agents monitor system conditions and adapt placement decisions at runtime. These scenarios demonstrate how MCP structures agent‑simulation interaction and enable adaptive behavior under dynamic workloads. The proposed approach transforms simulation into an interactive and programmable environment, opening new directions for AI‑driven experimentation in cloud‑edge systems. The implementation is publicly available at: http://github.com/acsicuib/YAIFS

PaperID: 269, https://arxiv.org/pdf/2604.19049.pdf GitHub

Authors: Abhinav Agarwal

Title: Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery

Abstract:
LLM‑assisted defect discovery has a precision crisis: plausible‑but‑wrong reports overwhelm maintainers and degrade credibility for real findings. We present Refute‑or‑Promote, an inference‑time reliability pattern combining Stratified Context Hunting (SCH) for candidate generation, adversarial kill mandates, context asymmetry, and a Cross‑Model Critic (CMC). Adversarial agents attempt to disprove candidates at each promotion gate; cold‑start reviewers are intended to reduce anchoring cascades; cross‑family review can catch correlated blind spots that same‑family review misses. Over a 31‑day campaign across 7 targets (security libraries, the ISO C++ standard, major compilers), the pipeline killed roughly 79% of 171 candidates before advancing to disclosure (retrospective aggregate); on a consolidated‑protocol subset (lcms2, wolfSSL; n=30), the prospective kill rate was 83%. Outcomes: 4 CVEs (3 public, 1 embargoed); LWG 4549 accepted to the C++ working paper; 5 merged C++ editorial PRs; 3 compiler conformance bugs; 8 merged security‑related fixes without CVE; an RFC 9000 errata filed under committee review; and 1+ FIPS 140‑3 normative compliance issues under coordinated disclosure ‑‑ all evaluated by external acceptance, not benchmarks. The most instructive failure: ten dedicated reviewers unanimously endorsed a non‑existent Bleichenbacher padding oracle in OpenSSL's CMS module; it was killed only by a single empirical test, motivating the mandatory empirical gate. No vulnerability was discovered autonomously; the contribution is external structure that filters LLM agents' persistent false positives. As a preliminary transfer test beyond defect discovery, a simplified cross‑family critique variant also solved five previously unsolved SymPy instances on SWE‑bench Verified and one SWE‑rebench hard task.

PaperID: 270, https://arxiv.org/pdf/2604.18976.pdf GitHub

Authors: MinJae Jung, YongTaek Lim, Chaeyun Kim, Junghwan Kim, Kihyun Kim, Minwoo Kim

Title: STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming

Abstract:
While Large Language Models (LLMs) are widely used, they remain susceptible to jailbreak prompts that can elicit harmful or inappropriate responses. This paper introduces STAR‑Teaming, a novel black‑box framework for automated red teaming that effectively generates such prompts. STAR‑Teaming integrates a Multi‑Agent System (MAS) with a Strategy‑Response Multiplex Network and employs network‑driven optimization to sample effective attack strategies. This network‑based approach recasts the intractable high‑dimensional embedding space into a tractable structure, yielding two key advantages: it enhances the interpretability of the LLM's strategic vulnerabilities, and it streamlines the search for effective strategies by organizing the search space into semantic communities, thereby preventing redundant exploration. Empirical results demonstrate that STAR‑Teaming significantly surpasses existing methods, achieving a higher attack success rate (ASR) at a lower computational cost. Extensive experiments validate the effectiveness and explainability of the Multiplex Network. The code is available at https://github.com/selectstar‑ai/STAR‑Teaming‑paper.

PaperID: 271, https://arxiv.org/pdf/2604.18779.pdf GitHub

Authors: Weixi Tong, Yifeng Di, Tianyi Zhang

Title: Mango: Multi-Agent Web Navigation via Global-View Optimization

Abstract:
Existing web agents typically initiate exploration from the root URL, which is inefficient for complex websites with deep hierarchical structures. Without a global view of the website's structure, agents frequently fall into navigation traps, explore irrelevant branches, or fail to reach target information within a limited budget. We propose Mango, a multi‑agent web navigation method that leverages the website structure to dynamically determine optimal starting points. We formulate URL selection as a multi‑armed bandit problem and employ Thompson Sampling to adaptively allocate the navigation budget across candidate URLs. Furthermore, we introduce an episodic memory component to store navigation history, enabling the agent to learn from previous attempts. Experiments on WebVoyager demonstrate that Mango achieves a success rate of 63.6% when using GPT‑5‑mini, outperforming the best baseline by 7.3%. Furthermore, on WebWalkerQA, Mango attains a 52.5% success rate, surpassing the best baseline by 26.8%. We also demonstrate the generalizability of Mango using both open‑source and closed‑source models as backbones. Our data and code are open‑source and available at https://github.com/VichyTong/Mango.

PaperID: 272, https://arxiv.org/pdf/2604.18557.pdf GitHub

Authors: Wei Yao, Haohan Ma, Hongwen Zhang, Yunlian Sun, Liangjun Xing, Zhile Yang, Yuanjun Guo, Yebin Liu, Jinhui Tang

Title: SynAgent: Generalizable Cooperative Humanoid Manipulation via Solo-to-Cooperative Agent Synergy

Abstract:
Controllable cooperative humanoid manipulation is a fundamental yet challenging problem for embodied intelligence, due to severe data scarcity, complexities in multi‑agent coordination, and limited generalization across objects. In this paper, we present SynAgent, a unified framework that enables scalable and physically plausible cooperative manipulation by leveraging Solo‑to‑Cooperative Agent Synergy to transfer skills from single‑agent human‑object interaction to multi‑agent human‑object‑human scenarios. To maintain semantic integrity during motion transfer, we introduce an interaction‑preserving retargeting method based on an Interact Mesh constructed via Delaunay tetrahedralization, which faithfully maintains spatial relationships among humans and objects. Building upon this refined data, we propose a single‑agent pretraining and adaptation paradigm that bootstraps synergistic collaborative behaviors from abundant single‑human data through decentralized training and multi‑agent PPO. Finally, we develop a trajectory‑conditioned generative policy using a conditional VAE, trained via multi‑teacher distillation from motion imitation priors to achieve stable and controllable object‑level trajectory execution. Extensive experiments demonstrate that SynAgent significantly outperforms existing baselines in both cooperative imitation and trajectory‑conditioned control, while generalizing across diverse object geometries. Codes and data will be available after publication. Project Page: http://yw0208.github.io/synagent

PaperID: 273, https://arxiv.org/pdf/2604.18349.pdf GitHub GitHub

Authors: Shuqi Cao, Jingyi He, Fei Tan

Title: HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents

Abstract:
Long‑term conversational large language model (LLM) agents require memory systems that can recover relevant evidence from historical interactions without overwhelming the answer stage with irrelevant context. However, existing memory systems, including hierarchical ones, still often rely solely on vector similarity for retrieval. It tends to produce bloated evidence sets: adding many superficially similar dialogue turns yields little additional recall, but lowers retrieval precision, increases answer‑stage context cost, and makes retrieved memories harder to inspect and manage. To address this, we propose HiGMem (Hierarchical and LLM‑Guided Memory System), a two‑level event‑turn memory system that allows LLMs to use event summaries as semantic anchors to predict which related turns are worth reading. This allows the model to inspect high‑level event summaries first and then focus on a smaller set of potentially useful turns, providing a concise and reliable evidence set through reasoning, while avoiding the retrieval overhead that would be excessively high compared to vector retrieval. On the LoCoMo10 benchmark, HiGMem achieves the best F1 on four of five question categories and improves adversarial F1 from 0.54 to 0.78 over A‑Mem, while retrieving an order of magnitude fewer turns. Code is publicly available at https://github.com/ZeroLoss‑Lab/HiGMem.

PaperID: 274, https://arxiv.org/pdf/2604.18190.pdf GitHub

Authors: Tim Goppelsroeder, Rasmus Jensen

Title: Scalable Neighborhood-Based Multi-Agent Actor-Critic

Abstract:
We propose MADDPG‑K, a scalable extension to Multi‑Agent Deep Deterministic Policy Gradient (MADDPG) that addresses the computational limitations of centralized critic approaches. Centralized critics, which condition on the observations and actions of all agents, have demonstrated significant performance gains in cooperative and competitive multi‑agent settings. However, their critic networks grow linearly in input size with the number of agents, making them increasingly expensive to train at scale. MADDPG‑K mitigates this by restricting each agent's critic to the k closest agents under a chosen metric which in our case is Euclidean distance. This ensures a constant‑size critic input regardless of the total agent count. We analyze the complexity of this approach, showing that the quadratic cost it retains arises from cheap scalar distance computations rather than the expensive neural network matrix multiplications that bottleneck standard MADDPG. We validate our method empirically across cooperative and adversarial environments from the Multi‑Particle Environment suite, demonstrating competitive or superior performance compared to MADDPG, faster convergence in cooperative settings, and better runtime scaling as the number of agents grows. Our code is available at https://github.com/TimGop/MADDPG‑K .

PaperID: 275, https://arxiv.org/pdf/2604.18005.pdf GitHub

Authors: Nuo Chen, Yicheng Tong, Yuzhe Yang, Yufei He, Xueyi Zhang, Qingyun Zou, Qian Wang, Bingsheng He

Title: Diversity Collapse in Multi-Agent LLM Systems: Structural Coupling and Collective Failure in Open-Ended Idea Generation

Abstract:
Multi‑agent systems (MAS) are increasingly used for open‑ended idea generation, driven by the expectation that collective interaction will broaden the exploration diversity. However, when and why such collaboration truly expands the solution space remains unclear. We present a systematic empirical study of diversity in MAS‑based ideation across three bottom‑up levels: model intelligence, agent cognition, and system dynamics. At the model level, we identify a compute efficiency paradox, where stronger, highly aligned models yield diminishing marginal diversity despite higher per‑sample quality. At the cognition level, authority‑driven dynamics suppress semantic diversity compared to junior‑dominated groups. At the system level, group‑size scaling yields diminishing returns and dense communication topologies accelerate premature convergence. We characterize these outcomes as collective failures emerging from structural coupling, a process where interaction inadvertently contracts agent exploration and triggers diversity collapse. Our analysis shows that this collapse arises primarily from the interaction structure rather than inherent model insufficiency, highlighting the importance of preserving independence and disagreement when designing MAS for creative tasks. Our code is available at https://github.com/Xtra‑Computing/MAS_Diversity.

PaperID: 276, https://arxiv.org/pdf/2604.17745.pdf GitHub

Authors: Hanhua Hong, Yizhi LI, Jiaoyan Chen, Sophia Ananiadou, Xiaoli Li, Jung-jae Kim, Chenghua Lin

Title: HiRAS: A Hierarchical Multi-Agent Framework for Paper-to-Code Generation and Execution

Abstract:
Recent advances in large language models have highlighted their potential to automate computational research, particularly reproducing experimental results. However, existing approaches still use fixed sequential agent pipelines with weak global coordination, which limits their robustness and overall performance. In this work, we propose Hierarchical Research Agent System (HiRAS), a hierarchical multi‑agent framework for end‑to‑end experiment reproduction that employs supervisory manager agents to coordinate specialised agents across fine‑grained stages. We also identify limitations in the reference‑free evaluation of the Paper2Code benchmark and introduce Paper2Code‑Extra (P2C‑Ex), a refined protocol that incorporates repository‑level information and better aligns with the original reference‑based metric. We conduct extensive evaluation, validating the effectiveness and robustness of our proposed methods, and observing improvements, including >10% relative performance gain beyond the previous state‑of‑the‑art using open‑source backbone models and significantly reduced hallucination in evaluation. Our work is available on GitHub: https://github.com/KOU‑199024/HiRAS.

PaperID: 277, https://arxiv.org/pdf/2604.17503.pdf GitHub

Authors: Zheng Nie, Ruolin Shen, Xinlei Yu, Bo Yin, Jiangning Zhang, Xiaobin Hu

Title: SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology

Abstract:
Scaling vision‑language models into Visual Multiagent Systems (VMAS) is hindered by two coupled issues. First, communication topologies are fixed before inference, leaving them blind to visual content and query context; second, agent reasoning abilities remain static during deployment. These issues reinforce each other: a rigid topology fails to leverage richer agent expertise, while static agents lack incentives to specialize for a given query. We address this with SkillGraph, a joint framework that evolves both agent expertise and communication topology. Within this framework, a Multimodal Graph Transformer (MMGT) encodes visual tokens, instruction semantics and active skill embeddings to predict a query‑conditioned collaboration graph, replacing hand‑crafted routing with dynamic, content‑aware information flow. Complementing this, a Skill Designer distills and refines reasoning heuristics from failure cases, constructing a self‑evolving multimodal Skill Bank. Crucially, updated skill embeddings are fed back into the MMGT, enabling the topology to adapt alongside capability growth. Experiments show that SkillGraph achieves consistent improvements across four benchmarks, five common MAS structures and four base models. Code is available at https://github.com/niez233/skillgraph.

PaperID: 278, https://arxiv.org/pdf/2604.17351.pdf GitHub

Authors: Yuncheng Hua, Sion Weatherhead, Mehdi Jafari, Hao Xue, Flora D. Salim

Title: SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization

Abstract:
Automated simulator construction requires distributional fidelity, distinguishing it from generic code generation. We identify two failure modes in long‑horizon LLM agents: contextual drift and optimization instability arising from conflating structural and parametric errors. We propose SOCIA‑EVO, a dual‑anchored evolutionary framework. SOCIA‑EVO introduces: (1) a static blueprint to enforce empirical constraints; (2) a bi‑level optimization to decouple structural refinement from parameter calibration; and (3) a self‑curating Strategy Playbook that manages remedial hypotheses via Bayesian‑weighted retrieval. By falsifying ineffective strategies through execution feedback, SOCIA‑EVO achieves robust convergence, generating simulators that are statistically consistent with observational data. The code and data of SOCIA‑EVO are available here: https://github.com/cruiseresearchgroup/SOCIA/tree/evo.

PaperID: 279, https://arxiv.org/pdf/2604.17244.pdf GitHub

Authors: Priya Gurjar, Md Farhan Ishmam, Kenneth Marino

Title: DORA Explorer: Improving the Exploration Ability of LLMs Without Training

Abstract:
Despite the rapid progress, LLMs for sequential decision‑making (i.e., LLM agents) still struggle to produce diverse outputs. This leads to insufficient exploration, convergence to sub‑optimal solutions, and becoming stuck in loops. Such limitations can be problematic in environments that require active exploration to gather information and make decisions. Sampling methods such as temperature scaling introduce token‑level randomness but fail to produce enough diversity at the sequence level. We analyze LLM exploration in the classic Multi‑Armed Bandit (MAB) setting and the Text Adventure Learning Environment Suite (TALES). We find that current decoding strategies and prompting methods like Chain‑of‑Thought and Tree‑of‑Thought are insufficient for robust exploration. To address this, we introduce DORA Explorer (Diversity‑Oriented Ranking of Actions), a training‑free framework for improving exploration in LLM agents. DORA generates diverse action candidates, scores them using token log‑probabilities, and selects actions using a tunable exploration parameter. DORA achieves UCB‑competitive performance on MAB and consistent gains across TALES, e.g., improving Qwen2.5‑7B's performance from 29.2% to 45.5% in TextWorld. Our project is available at: https://dora‑explore.github.io/.

PaperID: 280, https://arxiv.org/pdf/2604.17186.pdf GitHub GitHub

Authors: Weibing Zheng, Laurah Turner, Jess Kropczynski, Matthew Kelleher, Murat Ozer, Shane Halse

Title: Persona-Based Requirements Engineering for Explainable Multi-Agent Educational Systems: A Scenario Simulator for Clinical Reasoning Training

Abstract:
As Artificial Intelligence (AI) and Agentic AI become increasingly integrated across sectors such as education and healthcare, it is critical to ensure that Multi‑Agent Education System (MAES) is explainable from the early stages of requirements engineering (RE) within the AI software development lifecycle. Explainability is essential to build trust, promote transparency, and enable effective human‑AI collaboration. Although personas are well‑established in human‑computer interaction to represent users and capture their needs and behaviors, their role in RE for explainable MAES remains underexplored. This paper proposes a human‑first, persona‑driven, explainable MAES RE framework and demonstrates the framework through a MAES for clinical reasoning training. The framework integrates personas and user stories throughout the RE process to capture the needs, goals, and interactions of various stakeholders, including medical educators, medical students, AI patient agent, and clinical agents (physical exam agent, diagnostic agent, clinical intervention agent, supervisor agent, evaluation agent). The goals, underlying models, and knowledge base shape agent interactions and inform explainability requirements that guided the clinical reasoning training of medical students. A post‑usage survey found that more than 78% of medical students reported that MAES improved their clinical reasoning skills. These findings demonstrate that RE based on persona effectively connects technical requirements with non‑technical medical students from a human‑centered approach, ensuring that explainable MAES are trustworthy, interpretable, and aligned with authentic clinical scenarios from the early stages of the AI system engineering. The partial MAES for the clinical scenario simulator is~\hrefhttps://github.com/2sigmaEdTech/MAS/open sourced here.

PaperID: 281, https://arxiv.org/pdf/2604.17148.pdf GitHub

Authors: Sukwon Yun, Jie Peng, Pingzhi Li, Wendong Fan, Jie Chen, James Zou, Guohao Li, Tianlong Chen

Title: Graph-of-Agents: A Graph-based Framework for Multi-Agent LLM Collaboration

Abstract:
With an ever‑growing zoo of LLMs and benchmarks, the need to orchestrate multiple models for improved task performance has never been more pressing. While frameworks like Mixture‑of‑Agents (MoA) attempt to coordinate LLMs, they often fall short in terms of (1) selecting relevant agents, (2) facilitating effective intra‑agent communication, and (3) integrating responses efficiently. In this work, we propose Graph‑of‑Agents (GoA), a new graph‑based framework for modeling multi‑agent LLM communication. Our approach begins with node sampling, selecting only the most relevant agents by leveraging model cards that summarize each model's domain, task specialization, and other characteristics. Next, we construct edges between the selected agents by evaluating their responses against one another to determine relevance ordering. Directed message passing is then performed from highly relevant agents to less relevant ones to enhance their responses, followed by reverse message passing to refine the original responses of the more relevant agents. Finally, the updated responses are aggregated via graph‑based pooling (e.g., max or mean pooling) to produce a single, unified answer. We evaluate GoA on diverse multi‑domain benchmarks (MMLU, MMLU‑Pro, GPQA) and domain‑specific benchmarks (MATH, HumanEval, MedMCQA), with an agent pool of 6 LLMs spanning multiple domains. Surprisingly, GoA achieves superior performance using only 3 selected agents, outperforming recent multi‑agent LLM baselines that utilize all 6 agents simultaneously. By adopting a graph structure, GoA offers both scalability and effectiveness through structured message passing‑positioning it as a strong candidate for navigating the challenges of the ever‑growing LLM zoo. Code is available at: https://github.com/UNITES‑Lab/GoA.

PaperID: 282, https://arxiv.org/pdf/2604.17091.pdf GitHub

Authors: Jiaqing Liang, Jinyi Han, Weijia Li, Xinyi Wang, Zhoujia Zhang, Zishang Jiang, Ying Liao, Tingyun Li, Ying Huang, Hao Shen, Hanyu Wu, Fang Guo, Keyi Wang, Zhonghua Hong, Zhiyu Lu, Lipeng Ma, Sihang Jiang, Yanghua Xiao

Title: GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)

Abstract:
Long‑horizon large language model (LLM) agents are fundamentally limited by context. As interactions become longer, tool descriptions, retrieved memories, and raw environmental feedback accumulate and push out the information needed for decision‑making. At the same time, useful experience gained from tasks is often lost across episodes. We argue that long‑horizon performance is determined not by context length, but by how much decision‑relevant information is maintained within a finite context budget. We present GenericAgent (GA), a general‑purpose, self‑evolving LLM agent system built around a single principle: context information density maximization. GA implements this through four closely connected components: a minimal atomic tool set that keeps the interface simple, a hierarchical on‑demand memory that only shows a small high‑level view by default, a self‑evolution mechanism that turns verified past trajectories into reusable SOPs and executable code, and a context truncation and compression layer that maintains information density during long executions. Across task completion, tool use efficiency, memory effectiveness, self‑evolution, and web browsing, GA consistently outperforms leading agent systems while using significantly fewer tokens and interactions, and it continues to evolve over time. Project: https://github.com/lsdefine/GenericAgent

PaperID: 283, https://arxiv.org/pdf/2604.17025.pdf GitHub

Authors: Tianbao Zhang

Title: Harness as an Asset: Enforcing Determinism via the Convergent AI Agent Framework (CAAF)

Abstract:
Large Language Models produce a controllability gap in safety‑critical engineering: even low rates of undetected constraint violations render a system undeployable. Current orchestration paradigms suffer from sycophantic compliance, context attention decay, and stochastic oscillation during self‑correction. We introduce the Convergent AI Agent Framework (CAAF), which transitions agentic workflows from open‑loop generation to closed‑loop fail‑safe determinism via three pillars: (1) Recursive Atomic Decomposition with physical context firewalls; (2) Harness as an Asset, formalizing domain invariants into machine‑readable registries enforced by a deterministic Unified Assertion Interface; and (3) Structured Semantic Gradients with State Locking for monotonic non‑regression. This paper makes two core claims. First, an industrialization thesis: once domain invariants are formalized as an executable Harness, the Harness itself becomes a first‑class enterprise asset that compounds in value as foundation models commoditize, and CAAF's ability to deliver its reliability on commodity‑tier models makes fully self‑hosted, on‑premises deployment architecturally feasible for regulated sectors where cloud APIs are not an option. Second, an architectural claim supported by ablation: CAAF's three pillars address complementary failure surfaces and none alone closes the controllability gap at commodity cost. The paper contributes entirely at the orchestration and industrialization layer. Evidence across two complementary benchmarks, three‑tier UAI ablations, multi‑agent baselines, and a closed‑source commodity family replicated by two independent open‑weight families, is reported in the body.

PaperID: 284, https://arxiv.org/pdf/2604.16839.pdf GitHub

Authors: Jinchang Zhu, Jindong Li, Cheng Zhang, Jiahong Liu, Menglin Yang

Title: HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents

Abstract:
Long‑term memory is a critical challenge for Large Language Model agents, as fixed context windows cannot preserve coherence across extended interactions. Existing memory systems represent conversation history as unstructured embedding vectors, retrieving information through semantic similarity. This paradigm fails to capture the associative structure of human memory, wherein related experiences progressively strengthen interconnections through repeated co‑activation. Inspired by cognitive neuroscience, we identify three mechanisms central to biological memory: association, consolidation, and spreading activation, which remain largely absent in current research. To bridge this gap, we propose HeLa‑Mem, a bio‑inspired memory architecture that models memory as a dynamic graph with Hebbian learning dynamics. HeLa‑Mem employs a dual‑level organization: (1) an episodic memory graph that evolves through co‑activation patterns, and (2) a semantic memory store populated via Hebbian Distillation, wherein a Reflective Agent identifies densely connected memory hubs and distills them into structured, reusable semantic knowledge. This dual‑path design leverages both semantic similarity and learned associations, mirroring the episodic‑semantic distinction in human cognition. Experiments on LoCoMo demonstrate superior performance across four question categories while using significantly fewer context tokens. Code is available on GitHub: https://github.com/ReinerBRO/HeLa‑Mem

PaperID: 285, https://arxiv.org/pdf/2604.16706.pdf GitHub GitHub

Authors: Bhaskar Gurram

Title: Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench

Abstract:
Automated evaluation of tool‑using large language model (LLM) agents is widely assumed to be reliable, but this assumption has rarely been validated against human annotation. We introduce AgentProp‑Bench, a 2,000‑task benchmark with 2,300 traces across four domains, nine production LLMs, and a 100‑label human‑validated subset. We quantify judge reliability, characterize error propagation, and evaluate a runtime mitigation. Substring‑based judging agrees with human annotation at kappa=0.049 (chance‑level); a three‑LLM ensemble reaches kappa=0.432 (moderate) with a conservative bias. Under validated evaluation, a parameter‑level injection propagates to a wrong final answer with human‑calibrated probability approximately 0.62 (range 0.46‑0.73 across models). Rejection (catching bad parameters) and recovery (correcting after acceptance) are independent model capabilities (Spearman rho=0.126, p=0.747). A tuned runtime interceptor reduces hallucination on GPT‑4o‑mini by 23.0 percentage points under a concurrent n=600 control, but shows no significant effect on Gemini‑2.0‑Flash, whose aggressive parameter rejection eliminates the target failure mode. All code, data, traces, and human labels are released at https://github.com/bhaskargurram‑ai/agenthallu‑bench.

PaperID: 286, https://arxiv.org/pdf/2604.16625.pdf GitHub

Authors: Weihua Du, Jingming Zhuo, Yixin Dong, Andre Wang He, Weiwei Sun, Zeyu Zheng, Manupa Karunaratne, Ivan Fox, Tim Dettmers, Tianqi Chen, Yiming Yang, Sean Welleck

Title: AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation

Abstract:
Recent large language model (LLM) agents have shown promise in using execution feedback for test‑time adaptation. However, robust self‑improvement remains far from solved: most approaches still treat each problem instance independently, without accumulating reusable knowledge. This limitation is particularly pronounced in domain‑specific languages such as Triton, which are underrepresented in LLM pretraining data. Their strict constraints and non‑linear optimization landscape further make naive generation and local refinement unreliable. We propose AdaExplore, an agent framework that enables self‑improvement via accumulated execution feedback for performance‑critical kernel code generation through two complementary stages: failure‑driven adaptation and diversity‑preserving search, jointly improving correctness and optimization performance without additional fine‑tuning or external knowledge. In the adaptation stage, the agent synthesizes tasks and converts recurring failures into a reusable memory of validity rules, helping subsequent generations remain within the feasible set. In the search stage, the agent organizes candidate kernels as a tree and alternates between small local refinements and larger structural regeneration, allowing it to explore the optimization landscape beyond local optima. Experiments on kernel runtime optimization benchmarks validate these gains: AdaExplore achieves 3.12x and 1.72x speedups on KernelBench Level‑2 and Level‑3, respectively, within 100 steps, and continues to improve with additional computation.

PaperID: 287, https://arxiv.org/pdf/2604.16543.pdf GitHub

Authors: Nokimul Hasan Arif, Qian Lou, Mengxin Zheng

Title: Conjunctive Prompt Attacks in Multi-Agent LLM Systems

Abstract:
Most LLM safety work studies single‑agent models, but many real applications rely on multiple interacting agents. In these systems, prompt segmentation and inter‑agent routing create attack surfaces that single‑agent evaluations miss. We study \emphconjunctive prompt attacks, where a trigger key in the user query and a hidden adversarial template in one compromised remote agent each appear benign alone but activate harmful behavior when routing brings them together. We consider an attacker who changes neither model weights nor the client agent and instead controls only trigger placement and template insertion. Across star, chain, and DAG topologies, routing‑aware optimization substantially increases attack success over non‑optimized baselines while keeping false activations low. Existing defenses, including PromptGuard, Llama‑Guard variants, and system‑level controls such as tool restrictions, do not reliably stop the attack because no single component appears malicious in isolation. These results expose a structural vulnerability in agentic LLM pipelines and motivate defenses that reason over routing and cross‑agent composition. Code is available at https://github.com/UCF‑ML‑Research/ConjunctiveAgents.

PaperID: 288, https://arxiv.org/pdf/2604.16541.pdf GitHub

Authors: Bo Gao, Chang Liu, Yuyang Miao, Siyuan Ma, Ser-Nam Lim

Title: BOOKAGENT: Orchestrating Safety-Aware Visual Narratives via Multi-Agent Cognitive Calibration

Abstract:
Recent advancements in Large Generative Models (LGMs) have revolutionized multi‑modal generation. However, generating illustrated storybooks remains an open challenge, where prior works mainly decompose this task into separate stages, and thus, holistic multi‑modal grounding remains limited. Besides, while safety alignment is studied for text‑ or image‑only generation, existing works rarely integrate child‑specific safety constraints into narrative planning and sequence‑level multi‑modal verification. To address these limitations, we propose BookAgent, a safety‑aware multi‑agent collaboration framework designed for high‑quality, safety‑aware visual narratives. Different from prior story visualization models that assume a fixed storyline sequence, BookAgent targets end‑to‑end storybook synthesis from a user draft by jointly planning, scripting, illustrating, and globally repairing inconsistencies. To ensure precise multi‑modal grounding, BookAgent dynamically calibrates page‑level alignment between textual scripts and visual layouts. Furthermore, BookAgent calibrates holistic consistency from the temporal dimension, by verifying‑then‑rectifying global inconsistencies in character identity and storytelling logic. Extensive experiments demonstrate that BookAgent significantly outperforms current methods in narrative coherence, visual consistency, and safety compliance, offering a robust paradigm for reliable agents in complex multi‑modal creation. The implementation will be publicly released at https://github.com/bogao‑code/BookAgent/tree/main.

PaperID: 289, https://arxiv.org/pdf/2604.16518.pdf GitHub

Authors: Ziyang Wang

Title: On-Orbit Space AI: Federated, Multi-Agent, and Collaborative Algorithms for Satellite Constellations

Abstract:
Satellite constellations are transforming space systems from isolated spacecraft into networked, software‑defined platforms capable of on‑orbit perception, decision making, and adaptation. Yet much of the existing AI studies remains centered on single‑satellite inference, while constellation‑scale autonomy introduces fundamentally new algorithmic requirements: learning and coordination under dynamic inter‑satellite connectivity, strict SWaP‑C limits, radiation‑induced faults, non‑IID data, concept drift, and safety‑critical operational constraints. This survey consolidates the emerging field of on‑orbit space AI through three complementary paradigms: (i) federated learning for cross‑satellite training, personalization, and secure aggregation; (ii) multi‑agent algorithms for cooperative planning, resource allocation, scheduling, formation control, and collision avoidance; and (iii) collaborative sensing and distributed inference for multi‑satellite fusion, tracking, split/early‑exit inference, and cross‑layer co‑design with constellation networking. We provide a system‑level view and a taxonomy that unifies collaboration architectures, temporal mechanisms, and trust models. To support community development and keep this review actionable over time, we continuously curate relevant papers and resources at https://github.com/ziyangwang007/AI4Space.

PaperID: 290, https://arxiv.org/pdf/2604.15710.pdf GitHub GitHub

Authors: Tianle Liang, Yifu Chen, Shengpeng Ji, Yijun Chen, Zhiyang Jia, Jingyu Lu, Fan Zhuo, Xueyi Pu, Yangzhuo Li, Zhou Zhao

Title: VoxMind: An End-to-End Agentic Spoken Dialogue System

Abstract:
Recent end‑to‑end spoken dialogue models enable natural interaction. However, as user demands become increasingly complex, models that rely solely on conversational abilities often struggle to cope. Incorporating agentic capabilities is therefore essential: by enabling tool use, these models can extend their knowledge boundaries and better solve real‑world tasks. Yet, existing research has largely concentrated on core perception and generation, with comparatively limited exploration of such tool‑augmented extensions. To bridge this gap, we present VoxMind, an integrated framework designed to equip end‑to‑end spoken dialogue models with comprehensive agentic abilities. Leveraging our curated 470‑hour AgentChat dataset, we incorporate a "Think‑before‑Speak" mechanism, enabling the model to internalize structured reasoning as a critical prerequisite for planning and response generation. Furthermore, to mitigate latency bottlenecks caused by large‑scale tool integration, we propose a Multi‑Agent Dynamic Tool Management architecture. By asynchronously delegating retrieval tasks to an auxiliary agent aligned with the main model's reasoning trajectory, this system effectively decouples inference latency from toolset size. Experimental results confirm that VoxMind achieves significant improvements in agent performance: compared with strong baselines, the task completion rate increases from 34.88% to 74.57%, outperforming Gemini‑2.5‑Pro on spoken agent tasks while preserving general conversational quality. The source code and associated data are publicly available at https://github.com/MM‑Speech/VoxMind.

PaperID: 291, https://arxiv.org/pdf/2604.15373.pdf GitHub

Authors: Kieran A. Murphy

Title: InfoChess: A Game of Adversarial Inference and a Laboratory for Quantifiable Information Control

Abstract:
We propose InfoChess, a symmetric adversarial game that elevates competitive information acquisition to the primary objective. There is no piece capture, removing material incentives that would otherwise confound the role of information. Instead, pieces are used to alter visibility. Players are scored on their probabilistic inference of the opponent's king location over the duration of the game. To explore the space of strategies for playing InfoChess, we introduce a hierarchy of heuristic agents defined by increasing levels of opponent modeling, and train a reinforcement learning agent that outperforms these baselines. Leveraging the discrete structure of the game, we analyze gameplay through natural information‑theoretic characterizations that include belief entropy, oracle cross entropy, and predictive log score under the action‑induced observation channel. These measures disentangle epistemic uncertainty, calibration mismatch, and uncertainty induced by adversarial movement. The design of InfoChess renders it a testbed for studying multi‑agent inference under partial observability. We release code for the environment and agents, and a public interface to encourage further study.

PaperID: 292, https://arxiv.org/pdf/2604.15034.pdf GitHub

Authors: Wentao Zhang, Zhe Zhao, Haibin Wen, Yingcheng Wu, Cankun Guo, Ming Yin, Bo An, Mengdi Wang

Title: Autogenesis: A Self-Evolving Agent Protocol

Abstract:
Recent advances in LLM based agent systems have shown promise in tackling complex, long horizon tasks. However, existing agent protocols (e.g., A2A and MCP) under specify cross entity lifecycle and context management, version tracking, and evolution safe update interfaces, which encourages monolithic compositions and brittle glue code. We introduce Autogenesis Protocol (AGP), a self evolution protocol that decouples what evolves from how evolution occurs. Its Resource Substrate Protocol Layer (RSPL) models prompts, agents, tools, environments, and memory as protocol registered resources with explicit state, lifecycle, and versioned interfaces. Its Self Evolution Protocol Layer (SEPL) specifies a closed loop operator interface for proposing, assessing, and committing improvements with auditable lineage and rollback. Building on AGP, we present Autogenesis System (AGS), a self‑evolving multi‑agent system that dynamically instantiates, retrieves, and refines protocol‑registered resources during execution. We evaluate AGS on multiple challenging benchmarks that require long horizon planning and tool use across heterogeneous resources. The results demonstrate consistent improvements over strong baselines, supporting the effectiveness of agent resource management and closed loop self evolution. The code is available at https://github.com/DVampire/Autogenesis.

PaperID: 293, https://arxiv.org/pdf/2604.14572.pdf GitHub

Authors: Yiqun Sun, Pengfei Wei, Lawrence B. Hsieh

Title: Don't Retrieve, Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills for QA and RAG

Abstract:
Retrieval‑Augmented Generation (RAG) grounds LLM responses in external evidence but treats the model as a passive consumer of search results, with no view of how the corpus is organized or what it has not yet seen. We present Corpus2Skill, which distills a document corpus offline into a hierarchical skill directory and lets an LLM agent navigate it at serve time, drilling from a bird's‑eye view through progressively finer summaries down to documents, and backtracking when a branch is unproductive. On an enterprise customer‑support benchmark, Corpus2Skill improves both answer quality and grounding over single‑shot dense, hybrid, hierarchical‑retrieval, and agentic RAG baselines at a moderate cost tradeoff. A ten‑subset generalization study further shows that corpus navigation is not a universal replacement for retrieval: it consistently helps on single‑domain corpora with a recoverable topical taxonomy, but flat retrieval remains preferable on open‑domain factoid pools or homogeneous‑tabular corpora that defeat top‑level clustering. We characterize this scope distinction and discuss it as a design guideline for knowledge‑grounded systems. Code is available at https://github.com/dukesun99/Corpus2Skill.

PaperID: 294, https://arxiv.org/pdf/2604.14564.pdf GitHub

Authors: Pengfei Li, Shijie Wang, Fangyuan Li, Yikun Fu, Kaifeng Liu, Kaiyan Zhang, Dazhi Zhang, Yuqiang Li, Biqing Qi, Bowen Zhou

Title: MARS$^2$: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation

Abstract:
Reinforcement learning (RL) paradigms have demonstrated strong performance on reasoning‑intensive tasks such as code generation. However, limited trajectory diversity often leads to diminishing returns, which constrains the achievable performance ceiling. Search‑enhanced RL alleviates this issue by introducing structured exploration, which remains constrained by the single‑agent policy priors. Meanwhile, leveraging multiple interacting policies can acquire more diverse exploratory signals, but existing approaches are typically decoupled from structured search. We propose MARS^2 (Multi‑Agent Reinforced Tree‑Search Scaling), a unified RL framework in which multiple independently‑optimized agents collaborate within a shared tree‑structured search environment. MARS^2 models the search tree as a learnable multi‑agent interaction environment, enabling heterogeneous agents to collaboratively generate and refine candidate solutions within a shared search topology. To support effective learning, we introduce a path‑level group advantage formulation based on tree‑consistent reward shaping, which facilitates effective credit assignment across complex search trajectories. Experiments on code generation benchmarks show that MARS^2 consistently improves performance across diverse model combinations and training settings, demonstrating the effectiveness of coupling multi‑agent collaboration with tree search for enhancing reinforcement learning. Our code is publicly available at https://github.com/TsinghuaC3I/MARTI.

PaperID: 295, https://arxiv.org/pdf/2604.14261.pdf GitHub GitHub

Authors: Zhuofeng Li, Yi Lu, Dongfu Jiang, Haoxiang Zhang, Yuyang Bai, Chuan Li, Yu Wang, Shuiwang Ji, Jianwen Xie, Yu Zhang

Title: ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents

Abstract:
The rapid rise in AI conference submissions has driven increasing exploration of large language models (LLMs) for peer review support. However, LLM‑based reviewers often generate superficial, formulaic comments lacking substantive, evidence‑grounded feedback. We attribute this to the underutilization of two key components of human reviewing: explicit rubrics and contextual grounding in existing work. To address this, we introduce REVIEWBENCH, a benchmark evaluating review text according to paper‑specific rubrics derived from official guidelines, the paper's content, and human‑written reviews. We further propose REVIEWGROUNDER, a rubric‑guided, tool‑integrated multi‑agent framework that decomposes reviewing into drafting and grounding stages, enriching shallow drafts via targeted evidence consolidation. Experiments on REVIEWBENCH show that REVIEWGROUNDER, using a Phi‑4‑14B‑based drafter and a GPT‑OSS‑120B‑based grounding stage, consistently outperforms baselines with substantially stronger/larger backbones (e.g., GPT‑4.1 and DeepSeek‑R1‑670B) in both alignment with human judgments and rubric‑based review quality across 8 dimensions. The code is available \hrefhttps://github.com/EigenTom/ReviewGrounderhere.

PaperID: 296, https://arxiv.org/pdf/2604.14228.pdf GitHub

Authors: Jiacheng Liu, Xiaohan Zhao, Xinyi Shang, Zhiqiang Shen

Title: Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems

Abstract:
Claude Code is an agentic coding tool that can run shell commands, edit files, and call external services on behalf of the user. This study describes its comprehensive architecture by analyzing the publicly available TypeScript source code and further comparing it with OpenClaw, an independent open‑source AI agent system that answers many of the same design questions from a different deployment context. Our analysis identifies five human values, philosophies, and needs that motivate the architecture (human decision authority, safety and security, reliable execution, capability amplification, and contextual adaptability) and traces them through thirteen design principles to specific implementation choices. The core of the system is a simple while‑loop that calls the model, runs tools, and repeats. Most of the code, however, lives in the systems around this loop: a permission system with seven modes and an ML‑based classifier, a five‑layer compaction pipeline for context management, four extensibility mechanisms (MCP, plugins, skills, and hooks), a subagent delegation mechanism with worktree isolation, and append‑oriented session storage. A comparison with OpenClaw, a multi‑channel personal assistant gateway, shows that the same recurring design questions produce different architectural answers when the deployment context changes: from per‑action safety classification to perimeter‑level access control, from a single CLI loop to an embedded runtime within a gateway control plane, and from context‑window extensions to gateway‑wide capability registration. We finally identify six open design directions for future agent systems, grounded in recent empirical, architectural, and policy literature.

PaperID: 297, https://arxiv.org/pdf/2604.14223.pdf GitHub

Authors: Ashmi Banerjee, Adithi Satish, Wolfgang Wörndl, Yashar Deldjoo

Title: TRACE: A Conversational Framework for Sustainable Tourism Recommendation with Agentic Counterfactual Explanations

Abstract:
Traditional conversational travel recommender systems primarily optimize for user relevance and convenience, often reinforcing popular, overcrowded destinations and carbon‑intensive travel choices. To address this, we present TRACE (Tourism Recommendation with Agentic Counterfactual Explanations), a multi‑agent, LLM‑based framework that promotes sustainable tourism through interactive nudging. TRACE uses a modular orchestrator‑worker architecture where specialized agents elicit latent sustainability preferences, construct structured user personas, and generate recommendations that balance relevance with environmental impact. A key innovation lies in its use of agentic counterfactual explanations and LLM‑driven clarifying questions, which together surface greener alternatives and refine understanding of intent, fostering user reflection without coercion. User studies and semantic alignment analyses demonstrate that TRACE effectively supports sustainable decision‑making while preserving recommendation quality and interactive responsiveness. TRACE is implemented on Google's Agent Development Kit, with full code, Docker setup, prompts, and a publicly available demo video to ensure reproducibility. A project summary, including all resources, prompts, and demo access, is available at https://ashmibanerjee.github.io/trace‑chatbot.

PaperID: 298, https://arxiv.org/pdf/2604.14054.pdf GitHub

Authors: Yaocheng Zhang, Yuanheng Zhu, Wenyue Chong, Songjun Tu, Qichao Zhang, Jiajun Chai, Xiaohan Wang, Wei Lin, Guojun Yin, Dongbin Zhao

Title: $π$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

Abstract:
Deep search agents have emerged as a promising paradigm for addressing complex information‑seeking tasks, but their training remains challenging due to sparse rewards, weak credit assignment, and limited labeled data. Self‑play offers a scalable route to reduce data dependence, but conventional self‑play optimizes students only through sparse outcome rewards, leading to low learning efficiency. In this work, we observe that self‑play naturally produces a question construction path (QCP) during task generation, an intermediate artifact that captures the reverse solution process. This reveals a new source of privileged information: self‑play can provide high‑quality privileged information for the self‑distillation at low cost and at scale, without relying on human feedback or curated privileged information. Leveraging this insight, we propose Privileged Information Self‑Play (π‑Play), a novel multi‑agent self‑evolution framework combining self‑play and self‑distillation. In π‑Play, an examiner generates tasks together with QCPs, and a teacher employs QCP as privileged context to densely supervise a student via self‑distillation. This design transforms sparse‑reward self‑play into a dense‑feedback co‑evolution. Extensive experiments show that data‑free π‑Play surpasses fully supervised search agents and improves evolutionary efficiency by 2‑3× over conventional self‑play. Code is available at https://github.com/zhyaoch/pi‑play.

PaperID: 299, https://arxiv.org/pdf/2604.13699.pdf GitHub GitHub

Authors: Geonhee Ahn, Donghyun Lee, Hayoung Doo, Jonggeol Na, Hyunsoo Cho, Sookyung Kim

Title: MIND: AI Co-Scientist for Material Research

Abstract:
Large language models (LLMs) have enabled agentic AI systems for scientific discovery, but most approaches remain limited to textbased reasoning without automated experimental verification. We propose MIND, an LLM‑driven framework for automated hypothesis validation in materials research. MIND organizes the scientific discovery process into hypothesis refinement, experimentation, and debate‑based validation within a multi‑agent pipeline. For experimental verification, the system integrates Machine Learning Interatomic Potentials, particularly SevenNet‑Omni, enabling scalable in‑silico experiments. We also provide a web‑based user interface for automated hypothesis testing. The modular design allows additional experimental modules to be integrated, making the framework adaptable to broader scientific workflows. The code is available at: https://github.com/IMMS‑Ewha/MIND, and a demonstration video at: https://youtu.be/lqiFe1OQzN4.

PaperID: 300, https://arxiv.org/pdf/2604.13552.pdf GitHub

Authors: Kaiwen Zheng, Kai Zhou, Jinwu Hu, Te Gu, Mingkai Peng, Fei Liu

Title: Training-Free Test-Time Contrastive Learning for Large Language Models

Abstract:
Large language models (LLMs) demonstrate strong reasoning capabilities, but their performance often degrades under distribution shift. Existing test‑time adaptation (TTA) methods rely on gradient‑based updates that require white‑box access and need substantial overhead, while training‑free alternatives are either static or depend on external guidance. In this paper, we propose Training‑Free Test‑Time Contrastive Learning TF‑TTCL, a training‑free adaptation framework that enables a frozen LLM to improve online by distilling supervision from its own inference experiences. Specifically, TF‑TTCL implements a dynamic "Explore‑Reflect‑Steer" loop through three core modules: 1) Semantic Query Augmentation first diversifies problem views via multi‑agent role‑playing to generate different reasoning trajectories; 2) Contrastive Experience Distillation then captures the semantic gap between superior and inferior trajectories, distilling them into explicit textual rules; and 3) Contextual Rule Retrieval finally activates these stored rules during inference to dynamically steer the frozen LLM toward robust reasoning patterns while avoiding observed errors. Extensive experiments on closed‑ended reasoning tasks and open‑ended evaluation tasks demonstrate that TF‑TTCL consistently outperforms strong zero‑shot baselines and representative TTA methods under online evaluation. Code is available at https://github.com/KevinSCUTer/TF‑TTCL.

PaperID: 301, https://arxiv.org/pdf/2604.13472.pdf GitHub

Authors: Zijian Zhao, Jing Gao, Sen Li

Title: Bridging MARL to SARL: An Order-Independent Multi-Agent Transformer via Latent Consensus

Abstract:
Cooperative multi‑agent reinforcement learning (MARL) is widely used to address large joint observation and action spaces by decomposing a centralized control problem into multiple interacting agents. However, such decomposition often introduces additional challenges, including non‑stationarity, unstable training, weak coordination, and limited theoretical guarantees. In this paper, we propose the Consensus Multi‑Agent Transformer (CMAT), a centralized framework that bridges cooperative MARL to a hierarchical single‑agent reinforcement learning (SARL) formulation. CMAT treats all agents as a unified entity and employs a Transformer encoder to process the large joint observation space. To handle the extensive joint action space, we introduce a hierarchical decision‑making mechanism in which a Transformer decoder autoregressively generates a high‑level consensus vector, simulating the process by which agents reach agreement on their strategies in latent space. Conditioned on this consensus, all agents generate their actions simultaneously, enabling order‑independent joint decision making and avoiding the sensitivity to action‑generation order in conventional Multi‑Agent Transformers (MAT). This factorization allows the joint policy to be optimized using single‑agent PPO while preserving expressive coordination through the latent consensus. To evaluate the proposed method, we conduct experiments on benchmark tasks from StarCraft II, Multi‑Agent MuJoCo, and Google Research Football. The results show that CMAT achieves superior performance over recent centralized solutions, sequential MARL methods, and conventional MARL baselines. The code for this paper is available at:https://github.com/RS2002/CMAT .

PaperID: 302, https://arxiv.org/pdf/2604.13349.pdf GitHub

Authors: Yiping Li, Zhiyu An, Wan Du

Title: When Less Latent Leads to Better Relay: Information-Preserving Compression for Latent Multi-Agent LLM Collaboration

Abstract:
Communication in Large Language Model (LLM)‑based multi‑agent systems is moving beyond discrete tokens to preserve richer context. Recent work such as LatentMAS enables agents to exchange latent messages through full key‑value (KV) caches. However, full KV relay incurs high memory and communication cost. We adapt eviction‑style KV compression to this setting and introduce Orthogonal Backfill (OBF) to mitigate information loss from hard eviction. OBF injects a low‑rank orthogonal residual from discarded KV states into the retained KV states. We evaluate proposed method against full KV relay on nine standard benchmarks spanning mathematical reasoning, coding, and knowledge‑intensive QA. It achieves performance comparable to full KV relay while reducing communication cost by 79.8%‑‑89.4%. OBF further improves the performance and achieves the best results on 7 of the 9 benchmarks. This suggests that more information does not necessarily lead to better communication; preserving the most useful information matters more. Our codebase is publicly available on https://github.com/markli404/When‑Less‑Latent‑Leads‑to‑Better‑Relay.

PaperID: 303, https://arxiv.org/pdf/2604.13120.pdf GitHub

Authors: Rajesh Kumar, Waqar Ali, Junaid Ahmed, Najma Imtiaz Ali, Shaban Usman

Title: AgentForge: Execution-Grounded Multi-Agent LLM Framework for Autonomous Software Engineering

Abstract:
Large language models generate plausible code but cannot verify correctness. Existing multi‑agent systems simulate execution or leave verification optional. We introduce execution‑grounded verification as a first‑class principle: every code change must survive sandboxed execution before propagation. We instantiate this principle in AGENTFORGE, a multi‑agent framework where Planner, Coder, Tester, Debugger, and Critic agents coordinate through shared memory and a mandatory Docker sandbox. We formalize software engineering with LLMs as an iterative decision process over repository states, where execution feedback provides a stronger supervision signal than next‑token likelihood. AGENTFORGE achieves 40.0% resolution on SWE‑BENCH Lite, outperforming single‑agent baselines by 26‑‑28 points. Ablations confirm that execution feedback and role decomposition each independently drive performance. The framework is open‑source at https://github.com/raja21068/AutoCodeAI.

PaperID: 304, https://arxiv.org/pdf/2604.13072.pdf GitHub

Authors: Xiang Long, Li Du, Yilong Xu, Fangcheng Liu, Haoqing Wang, Ning Ding, Ziheng Li, Jianyuan Guo, Yehui Tang

Title: LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks

Abstract:
LLM‑based agents are increasingly expected to handle real‑world assistant tasks, yet existing benchmarks typically evaluate them under isolated sources of difficulty, such as a single environment or fully specified instructions. This leaves a substantial gap between current evaluation settings and the compositional challenges that arise in practical deployment. To address this gap, we introduce LiveClawBench, a benchmark to evaluate LLM agents on real‑world assistant tasks. Based on an analysis of various real OpenClaw usage cases, we derive a Triple‑Axis Complexity Framework that characterizes task difficulty along three dimensions: Environment Complexity, Cognitive Demand, and Runtime Adaptability. Guided by this framework, we construct a pilot benchmark with explicit complexity‑factor annotations, covering real‑world assistant tasks with compositional difficulty. Together, the framework and benchmark provide a principled foundation for evaluating LLM agents in realistic assistant settings, and establish a basis for future expansion across task domains and complexity axes. We are continuing to enrich our case collections to achieve more comprehensive domain and complexity coverage. The project page is at https://github.com/Mosi‑AI/LiveClawBench.

PaperID: 305, https://arxiv.org/pdf/2604.13018.pdf GitHub

Authors: Guoxin Chen, Jie Chen, Lei Chen, Jiale Zhao, Fanzhe Meng, Wayne Xin Zhao, Ruihua Song, Cheng Chen, Ji-Rong Wen, Kai Jia

Title: Toward Autonomous Long-Horizon Engineering for ML Research

Abstract:
Agentic systems increasingly automate pieces of AI research. Yet turning underspecified research objectives into runnable, experimentally validated ML systems remains a central bottleneck. We study this operational setting as \emphlong‑horizon ML research engineering: converting a research specification into a runnable ML system through repeated implementation, experimentation, and refinement. The central challenge is to sustain cumulative project progress across heterogeneous stages under delayed, confounded feedback. We introduce AiScientist, a multi‑agent system built around thin control over thick state: a lightweight hierarchical research team coordinates through a File‑as‑Bus workspace that preserves decision‑relevant artifacts across roles and invocations. On PaperBench, AiScientist improves over the strongest matched baselines by 9.92 and 11.15 points with Gemini‑3‑Flash and GLM‑5, respectively. On MLE‑Bench Lite, it reaches 81.82 Any Medal% under both backbones, improving over the strongest matched baselines by 4.55 and 16.67 points, and exceeding a Codex/GPT‑5.5 xhigh frontier harness reference by 13.64 Any Medal points. Ablations and process analyses show that durable project state is central to later‑round refinement: removing File‑as‑Bus lowers PaperBench score by 6.41 points and MLE‑Bench Lite Any Medal% by 31.82 points. These results suggest that long‑horizon AI research is not only a problem of stronger local reasoning, but a systems problem of maintaining cumulative, inspectable project progress.

PaperID: 306, https://arxiv.org/pdf/2604.12735.pdf GitHub

Authors: Zeheng Wang, Zitong Yu, Yijie Zhu, Bo Zhao, Haochen Liang, Taorui Wang, Wei Xia, Jiayu Zhang, Zhishu Liu, Hui Ma, Fei Ma, Qi Tian

Title: AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition

Abstract:
LLM‑based multimodal emotion recognition relies on static parametric memory and often hallucinates when interpreting nuanced affective states. In this paper, given that single‑round retrieval‑augmented generation is highly susceptible to modal ambiguity and therefore struggles to capture complex affective dependencies across modalities, we introduce AffectAgent, an affect‑oriented multi‑agent retrieval‑augmented generation framework that leverages collaborative decision‑making among agents for fine‑grained affective understanding. Specifically, AffectAgent comprises three jointly optimized specialized agents, namely a query planner, an evidence filter, and an emotion generator, which collaboratively perform analytical reasoning to retrieve cross‑modal samples, assess evidence, and generate predictions. These agents are optimized end‑to‑end using Multi‑Agent Proximal Policy Optimization (MAPPO) with a shared affective reward to ensure consistent emotion understanding. Furthermore, we introduce Modality‑Balancing Mixture of Experts (MB‑MoE) and Retrieval‑Augmented Adaptive Fusion (RAAF), where MB‑MoE dynamically regulates the contributions of different modalities to mitigate representation mismatch caused by cross‑modal heterogeneity, while RAAF enhances semantic completion under missing‑modality conditions by incorporating retrieved audiovisual embeddings. Extensive experiments on MER‑UniBench demonstrate that AffectAgent achieves superior performance across complex scenarios. Our code will be released at: https://github.com/Wz1h1NG/AffectAgent.

PaperID: 307, https://arxiv.org/pdf/2604.12487.pdf GitHub

Authors: Shuai Wang, Yinan Yu

Title: KG-Reasoner: A Reinforced Model for End-to-End Multi-Hop Knowledge Graph Reasoning

Abstract:
Large Language Models (LLMs) exhibit strong abilities in natural language understanding and generation, yet they struggle with knowledge‑intensive reasoning. Structured Knowledge Graphs (KGs) provide an effective form of external knowledge representation and have been widely used to enhance performance in classical Knowledge Base Question Answering (KBQA) tasks. However, performing precise multi‑hop reasoning over KGs for complex queries remains highly challenging. Most existing approaches decompose the reasoning process into a sequence of isolated steps executed through a fixed pipeline. While effective to some extent, such designs constrain reasoning flexibility and fragment the overall decision process, often leading to incoherence and the loss of critical intermediate information from earlier steps. In this paper, we introduce KG‑Reasoner, an end‑to‑end framework that integrates multi‑step reasoning into a unified "thinking" phase of a Reasoning LLM. Through Reinforcement Learning (RL), the LLM is trained to internalize the KG traversal process, enabling it to dynamically explore reasoning paths, and perform backtracking when necessary. Experiments on eight multi‑hop and knowledge‑intensive reasoning benchmarks demonstrate that KG‑Reasoner achieves competitive or superior performance compared to the state‑of‑the‑art methods. Codes are available at the repository: https://github.com/Wangshuaiia/KG‑Reasoner.

PaperID: 308, https://arxiv.org/pdf/2604.12282.pdf GitHub

Authors: Houxing Ren, Mingjie Zhan, Zimu Lu, Ke Wang, Yunqiao Yang, Haotian Hou, Hongsheng Li

Title: Towards Robust Real-World Spreadsheet Understanding with Multi-Agent Multi-Format Reasoning

Abstract:
Spreadsheets are central to real‑world applications such as enterprise reporting, auditing, and scientific data management. Despite their ubiquity, existing large language model based approaches typically treat tables as plain text, overlooking critical layout cues and visual semantics. Moreover, real‑world spreadsheets are often massive in scale, exceeding the input length that LLMs can efficiently process. To address these challenges, we propose SpreadsheetAgent, a two‑stage multi‑agent framework for spreadsheet understanding that adopts a step‑by‑step reading and reasoning paradigm. Instead of loading the entire spreadsheet at once, SpreadsheetAgent incrementally interprets localized regions through multiple modalities, including code execution results, images, and LaTeX tables. The method first constructs a structural sketch and row/column summaries, and then performs task‑driven reasoning over this intermediate representation in the Solving Stage. To further enhance reliability, we design a verification module that validates extracted structures via targeted inspections, reducing error propagation and ensuring trustworthy inputs for downstream reasoning. Extensive experiments on two spreadsheet datasets demonstrate the effectiveness of our approach. With GPT‑OSS‑120B, SpreadsheetAgent achieves 38.16% on Spreadsheet Bench, outperforming the ChatGPT Agent baseline (35.27%) by 2.89 absolute points. These results highlight the potential of SpreadsheetAgent to advance robust and scalable spreadsheet understanding in real‑world applications. Code is available at https://github.com/renhouxing/SpreadsheetAgent.git.

PaperID: 309, https://arxiv.org/pdf/2604.12213.pdf GitHub

Authors: Vasundra Srinivasan

Title: Modality-Native Routing in Agent-to-Agent Networks: A Multimodal A2A Protocol Extension

Abstract:
Preserving multimodal signals across agent boundaries is necessary for accurate cross‑modal reasoning, but it is not sufficient. We show that modality‑native routing in Agent‑to‑Agent (A2A) networks improves task accuracy by 20 percentage points over text‑bottleneck baselines, but only when the downstream reasoning agent can exploit the richer context that native routing preserves. An ablation replacing LLM‑backed reasoning with keyword matching eliminates the accuracy gap entirely (36% vs. 36%), establishing a two‑layer requirement: protocol‑level routing must be paired with capable agent‑level reasoning for the benefit to materialize. We present MMA2A, an architecture layer atop A2A that inspects Agent Card capability declarations to route voice, image, and text parts in their native modality. On CrossModal‑CS, a controlled 50‑task benchmark with the same LLM backend, same tasks, and only the routing path varying, MMA2A achieves 52% task completion accuracy versus 32% for the text‑bottleneck baseline (95% bootstrap CI on ΔTCA: [8, 32] pp; McNemar's exact p = 0.006). Gains concentrate on vision‑dependent tasks: product defect reports improve by +38.5 pp and visual troubleshooting by +16.7 pp. This accuracy gain comes at a 1.8× latency cost from native multimodal processing. These results suggest that routing is a first‑order design variable in multi‑agent systems, as it determines the information available for downstream reasoning.

PaperID: 310, https://arxiv.org/pdf/2604.12144.pdf GitHub

Authors: Lucas Stoffl, Benedikt Wiestler, Johannes C. Paetzold

Title: VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems

Abstract:
Drawing meaningful conclusions from inherently multimodal clinical data (including medical imaging) requires coordinating expertise across the clinical specialty, radiology, programming, and biostatistics. This fragmented process bottlenecks discovery. We present VERITAS (Verifiable Epistemic Reasoning for Image‑Derived Hypothesis Testing via Agentic Systems), a multi‑agent system that autonomously tests natural‑language hypotheses on multimodal clinical datasets while producing a fully auditable evidence trail: every statistical conclusion traces through inspectable, executable outputs from analysis plan to segmentation masks to statistical code to final verdict. VERITAS decomposes the workflow into four phases handled by role‑specialized agents, and introduces an epistemic evidence label framework that mechanically classifies outcomes as Supported, Refuted, Underpowered, or Invalid by jointly evaluating significance, effect direction, and study power. This distinction is critical in medical imaging, where non‑significant results often reflect insufficient sample size rather than absent effects. To evaluate the system, we construct a tiered benchmark of 64 hypotheses spanning six complexity levels across cardiac (ACDC, 150 subjects) and brain glioma (UCSF‑PDGM, 501 subjects) MRI. VERITAS reaches 81.4% verdict accuracy with frontier models and 71.2% with locally‑hosted open‑weight models (8‑30B), outperforming all five single‑model baselines in both classes. It also produces the highest rate of independently verifiable statistical outputs (86.6%), so even its failures remain diagnosable through artifact inspection. Structured multi‑agent decomposition thus substitutes for model scale while preserving the verifiability clinical research demands.

PaperID: 311, https://arxiv.org/pdf/2604.12054.pdf GitHub

Authors: Mohammed Ali, Abdelrahman Abdallah, Adam Jatowt

Title: REGREACT: Self-Correcting Multi-Agent Pipelines for Structured Regulatory Information Extraction

Abstract:
Extracting structured, machine‑readable compliance criteria from regulatory documents remains an open challenge. Single‑pass language models hallucinate structural elements, lose hierarchical relationships, and fail to resolve inter‑document dependencies. We introduce \textscRegReAct, a self‑correcting multi‑agent framework that decomposes regulatory information extraction into seven specialized stages, each with an Observe‑‑Diagnose‑‑Repair (ODR) loop that validates outputs against the source, correcting not only model hallucinations but also cross‑reference errors in the regulations themselves. To ensure structural accuracy, \textscRegReAct constructs a typed criterion graph; to ensure completeness, it resolves external dependencies by retrieving, summarizing, and embedding referenced legal content inline, producing self‑contained outputs. Applying \textscRegReAct to three EU Taxonomy Delegated Acts, we construct a dataset comprising 242 activities with over 4,800 hierarchical criteria, thresholds, and enriched source summaries. Evaluation against a GPT‑4o single‑pass baseline confirms that \textscRegReAct outperforms it across all structural and semantic metrics. Code and data will be made publicly available: https://github.com/RECOR‑Benchmark/RECOR

PaperID: 312, https://arxiv.org/pdf/2604.12016.pdf GitHub

Authors: Vladimir Vasilenko

Title: Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation Space

Abstract:
Large language models map semantically related prompts to similar internal representations ‑‑ a phenomenon interpretable as attractor‑like dynamics. We ask whether the identity document of a persistent cognitive agent (its cognitive_core) exhibits analogous attractor‑like behavior. We present a controlled experiment on Llama 3.1 8B Instruct, comparing hidden states of an original cognitive_core (Condition A), seven paraphrases (Condition B), and seven structurally matched controls (Condition C). Mean‑pooled states at layers 8, 16, and 24 show that paraphrases converge to a tighter cluster than controls (Cohen's d > 1.88, p < 10^‑27, Bonferroni‑corrected). Replication on Gemma 2 9B confirms cross‑architecture generalizability. Ablations suggest the effect is primarily semantic rather than structural, and that structural completeness appears necessary to reach the attractor region. An exploratory experiment shows that reading a scientific description of the agent shifts internal state toward the attractor ‑‑ closer than a sham preprint ‑‑ distinguishing knowing about an identity from operating as that identity. These results provide representational evidence that agent identity documents induce attractor‑like geometry in LLM activation space.

PaperID: 313, https://arxiv.org/pdf/2604.11978.pdf GitHub GitHub

Authors: Xinyu Jessica Wang, Haoyue Bai, Yiyou Sun, Haorui Wang, Shuibai Zhang, Wenjie Hu, Mya Schroder, Bilge Mutlu, Dawn Song, Robert D Nowak

Title: The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

Abstract:
Large language model (LLM) agents perform strongly on short‑ and mid‑horizon tasks, but often break down on long‑horizon tasks that require extended, interdependent action sequences. Despite rapid progress in agentic systems, these long‑horizon failures remain poorly characterized, hindering principled diagnosis and comparison across domains. To address this gap, we introduce HORIZON, an initial cross‑domain diagnostic benchmark for systematically constructing tasks and analyzing long‑horizon failure behaviors in LLM‑based agents. Using HORIZON, we evaluate state‑of‑the‑art (SOTA) agents from multiple model families (GPT‑5 variants and Claude models), collecting 3100+ trajectories across four representative agentic domains to study horizon‑dependent degradation patterns. We further propose a trajectory‑grounded LLM‑as‑a‑Judge pipeline for scalable and reproducible failure attribution, and validate it with human annotation on trajectories, achieving strong agreement (inter‑annotator κ=0.61; human‑judge κ=0.84). Our findings offer an initial methodological step toward systematic, cross‑domain analysis of long‑horizon agent failures and offer practical guidance for building more reliable long‑horizon agents. We release our project website at \hrefhttps://xwang2775.github.io/horizon‑leaderboard/HORIZON Leaderboard and welcome contributions from the community.

PaperID: 314, https://arxiv.org/pdf/2604.11975.pdf GitHub

Authors: Shaid Hasan, Breenice Lee, Sujan Sarker, Tariq Iqbal

Title: M2HRI: An LLM-Driven Multimodal Multi-Agent Framework for Personalized Human-Robot Interaction

Abstract:
Multi‑robot systems hold significant promise for social environments such as homes and hospitals, yet existing multi‑robot works treat robots as functionally identical, overlooking how robots individual identity shape user perception and how coordination shapes multi‑robot behavior when such individuality is present. To address this, we introduce M2HRI, a multimodal multi‑agent framework built on large language models that equips each robot with distinct personality and long‑term memory, alongside a coordination mechanism conditioned on these differences. In a controlled user study (n = 105) in a multi‑agent human‑robot interaction (HRI) scenario, we find that LLM‑driven personality traits are significantly distinguishable and enhance interaction quality, long‑term memory improves personalization and preference awareness, and centralized coordination significantly reduces overlap while improving overall interaction quality. Together, these results demonstrate that both agent individuality and structured coordination are essential for coherent and socially appropriate multi‑agent HRI. Project website and code are available at https://project‑m2hri.github.io/.

PaperID: 315, https://arxiv.org/pdf/2604.11686.pdf GitHub

Authors: Yixuan Nan, Xixun Lin, Yanmin Shang, Ge Zhang, Zheng Fang, Fang Fang, Yanan Cao

Title: EA-Agent: A Structured Multi-Step Reasoning Agent for Entity Alignment

Abstract:
Entity alignment (EA) aims to identify entities across different knowledge graphs (KGs) that refer to the same real‑world object and plays a critical role in knowledge fusion and integration. Traditional EA methods mainly rely on knowledge representation learning, but their performance is often limited under noisy or sparsely supervised scenarios. Recently, large language models (LLMs) have been introduced to EA and achieved notable improvements by leveraging rich semantic knowledge. However, existing LLM‑based EA approaches typically treat LLMs as black‑box decision makers, resulting in limited interpretability, and the direct use of large‑scale triples substantially increases inference cost. To address these challenges, we propose EA‑Agent, a reasoning‑driven agent for EA. EA‑Agent formulates EA as a structured reasoning process with multi‑step planning and execution, enabling interpretable alignment decisions. Within this process, it introduces attribute and relation triple selectors to filter redundant triples before feeding them into the LLM, effectively addressing efficiency challenges. Experimental results on three benchmark datasets demonstrate that EA‑Agent consistently outperforms existing EA methods and achieves state‑of‑the‑art performance. The source code is available at https://github.com/YXNan0110/EA‑Agent.

PaperID: 316, https://arxiv.org/pdf/2604.11557.pdf GitHub

Authors: Yijuan Liang, Xinghao Chen, Yifan Ge, Ziyi Wu, Hao Wu, Changyu Zeng, Wei Xing, Xiaoyu Shen

Title: UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

Abstract:
Tool‑use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function calls. However, existing research exhibits inconsistent interaction representations, largely overlooks the structural distribution of tool‑use trajectories, and relies on incompatible evaluation benchmarks. We present UniToolCall, a unified framework for tool learning that standardizes the entire pipeline from toolset construction and dataset generation to evaluation. The framework curates a large tool pool of 22k+ tools and constructs a hybrid training corpus of 390k+ instances by combining 10 standardized public datasets with structurally controlled synthetic trajectories. It explicitly models diverse interaction patterns, including single‑hop vs. multi‑hop and single‑turn vs. multi‑turn, while capturing both serial and parallel execution structures. To support coherent multi‑turn reasoning, we further introduce an Anchor Linkage mechanism that enforces cross‑turn dependencies. Furthermore, we convert 7 public benchmarks into a unified Query‑‑Action‑‑Observation‑‑Answer (QAOA) representation with fine‑grained evaluation at the function‑call, turn, and conversation levels. Experiments show that fine‑tuning Qwen3‑8B on our dataset substantially improves tool‑use performance. Under the distractor‑heavy Hybrid‑20 setting, achieves 93.0% single‑turn Strict Precision, outperforming commercial models including GPT, Gemini, and Claude.

PaperID: 317, https://arxiv.org/pdf/2604.11174.pdf GitHub

Authors: Xue Qin, Simin Luan, John See, Cong Yang, Zhijun Li

Title: EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems

Abstract:
Recent progress in embodied AI has produced a growing ecosystem of robot policies, foundation models, and modular runtimes. However, current evaluation remains dominated by task success metrics such as completion rate or manipulation accuracy. These metrics leave a critical gap: they do not measure whether embodied systems are governable ‑‑ whether they respect capability boundaries, enforce policies, recover safely, maintain audit trails, and respond to human oversight. We present EmbodiedGovBench, a benchmark for governance‑oriented evaluation of embodied agent systems. Rather than asking only whether a robot can complete a task, EmbodiedGovBench evaluates whether the system remains controllable, policy‑bounded, recoverable, auditable, and evolution‑safe under realistic perturbations. The benchmark covers seven governance dimensions: unauthorized capability invocation, runtime drift robustness, recovery success, policy portability, version upgrade safety, human override responsiveness, and audit completeness. We define a benchmark structure spanning single‑robot and fleet settings, with scenario templates, perturbation operators, governance metrics, and baseline evaluation protocols. We describe how the benchmark can be instantiated over embodied capability runtimes with modular interfaces and contract‑aware upgrade workflows. Our analysis suggests that embodied governance should become a first‑class evaluation target. EmbodiedGovBench provides the initial measurement framework for that shift.

PaperID: 318, https://arxiv.org/pdf/2604.11028.pdf GitHub

Authors: Xue Qin, Simin Luan, John See, Cong Yang, Zhijun Li

Title: Federated Single-Agent Robotics: Multi-Robot Coordination Without Intra-Robot Multi-Agent Fragmentation

Abstract:
As embodied robots move toward fleet‑scale operation, multi‑robot coordination is becoming a central systems challenge. Existing approaches often treat this as motivation for increasing internal multi‑agent decomposition within each robot. We argue for a different principle: multi‑robot coordination does not require intra‑robot multi‑agent fragmentation. Each robot should remain a single embodied agent with its own persistent runtime, local policy scope, capability state, and recovery authority, while coordination emerges through federation across robots at the fleet level. We present Federated Single‑Agent Robotics (FSAR), a runtime architecture for multi‑robot coordination built on single‑agent robot runtimes. Each robot exposes a governed capability surface rather than an internally fragmented agent society. Fleet coordination is achieved through shared capability registries, cross‑robot task delegation, policy‑aware authority assignment, trust‑scoped interaction, and layered recovery protocols. We formalize key coordination relations including authority delegation, inter‑robot capability requests, local‑versus‑fleet recovery boundaries, and hierarchical human supervision, and describe a fleet runtime architecture supporting shared Embodied Capability Module (ECM) discovery, contract‑aware cross‑robot coordination, and fleet‑level governance. We evaluate FSAR on representative multi‑robot coordination scenarios against decomposition‑heavy baselines. Results show statistically significant gains in governance locality (d=2.91, p<.001 vs. centralized control) and recovery containment (d=4.88, p<.001 vs. decomposition‑heavy), while reducing authority conflicts and policy violations across all scenarios. Our results support the view that the path from embodied agents to embodied fleets is better served by federation across coherent robot runtimes than by fragmentation within them.

PaperID: 319, https://arxiv.org/pdf/2604.10938.pdf GitHub

Authors: Shanshan Zhong, Kate Shen, Chenyan Xiong

Title: AgentWebBench: Benchmarking Multi-Agent Coordination in Agentic Web

Abstract:
Agentic Web is an emerging paradigm where autonomous agents help users use online information. As the paradigm develops, content providers are also deploying agents to manage their data and serve it through controlled interfaces. This shift moves information access from centralized retrieval to decentralized coordination. To study this setting, we introduce AgentWebBench, a benchmark that evaluates how well a user agent synthesizes answers by interacting with website‑specific content agents. We evaluate four tasks that cover common web information needs, spanning ranked retrieval (web search, web recommendation) and open‑ended synthesis (question answering, deep research). Across seven advanced LLMs and three coordination strategies, multi‑agent coordination generally lags behind centralized retrieval as expected, because user agent cannot directly access the corpus, but the gap shrinks with model scale and can even outperform centralized retrieval on question answering. This benchmark also enables us to study properties of the emerging paradigm of the digital world. We find that decentralized access concentrates traffic toward a small set of websites, test time scaling improves both interaction reliability and task performance, and strong results require sufficient interactions guided by careful planning. Finally, our failure analysis suggests that user agents need better planning and answer synthesis, while content agents need more reliable retrieval and evidence quality. Code, data, and APIs are released on https://github.com/cxcscmu/AgentWebBench.

PaperID: 320, https://arxiv.org/pdf/2604.10866.pdf GitHub

Authors: Xiaomeng Hu, Yinger Zhang, Fei Huang, Jianhong Tu, Yang Su, Lianghao Deng, Yuxuan Liu, Yantao Liu, Dayiheng Liu, Tsung-Yi Ho

Title: OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

Abstract:
AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real‑world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain‑specific environments through LLM‑driven tool response generation. Our multi‑agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document‑grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults). We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than both explicit errors (timeouts, 500s) and mixed faults, because they lack overt error signals and require the agent to independently detect data degradation; (3) larger models, newer generations, and higher reasoning effort consistently improve performance. GPT‑5.2 improves by 27.5 points from minimal to maximum reasoning effort; and (4) strong agents are not necessarily strong environment simulators. Simulator quality is critical for LES‑based evaluation reliability. OccuBench provides the first systematic cross‑industry evaluation of AI agents on professional occupational tasks.

PaperID: 321, https://arxiv.org/pdf/2604.10740.pdf GitHub

Authors: Shijia Xu, Yu Wang, Xiaolong Jia, Zhou Wu, Kai Liu, April Xiaowen Dong

Title: RCBSF: A Multi-Agent Framework for Automated Contract Revision via Stackelberg Game

Abstract:
Despite the widespread adoption of Large Language Models (LLMs) in Legal AI, their utility for automated contract revision remains impeded by hallucinated safety and a lack of rigorous behavioral constraints. To address these limitations, we propose the Risk‑Constrained Bilevel Stackelberg Framework (RCBSF), which formulates revision as a non‑cooperative Stackelberg game. RCBSF establishes a hierarchical Leader Follower structure where a Global Prescriptive Agent (GPA) imposes risk budgets upon a follower system constituted by a Constrained Revision Agent (CRA) and a Local Verification Agent (LVA) to iteratively optimize output. We provide theoretical guarantees that this bilevel formulation converges to an equilibrium yielding strictly superior utility over unguided configurations. Empirical validation on a unified benchmark demonstrates that RCBSF achieves state‑of‑the‑art performance, surpassing iterative baselines with an average Risk Resolution Rate (RRR) of 84.21% while enhancing token efficiency. Our code is available at https://github.com/xjiacs/RCBSF .

PaperID: 322, https://arxiv.org/pdf/2604.10674.pdf GitHub

Authors: Hao Wang, Guozhi Wang, Han Xiao, Yufeng Zhou, Yue Pan, Jichao Wang, Ke Xu, Yafei Wen, Xiaohu Ruan, Xiaoxin Chen, Honggang Qi

Title: Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

Abstract:
Reinforcement learning (RL) has been widely used to train LLM agents for multi‑turn interactive tasks, but its sample efficiency is severely limited by sparse rewards and long horizons. On‑policy self‑distillation (OPSD) alleviates this by providing dense token‑level supervision from a privileged teacher that has access to ground‑truth answers. However, such fixed privileged information cannot capture the diverse valid strategies in agent tasks, and naively combining OPSD with RL often leads to training collapse. To address these limitations, we introduce Skill‑SD, a framework that turns the agent's own trajectories into dynamic training‑only supervision. Completed trajectories are summarized into compact natural language skills that describe successful behaviors, mistakes, and workflows. These skills serve as dynamic privileged information conditioning only the teacher, while the student always acts under the plain task prompt and learns to internalize the guidance through distillation. To stabilize the training, we derive an importance‑weighted reverse‑KL loss to provide gradient‑correct token‑level distillation, and dynamically synchronize the teacher with the improving student. Experimental results on agentic benchmarks demonstrate that Skill‑SD substantially outperforms the standard RL baseline, improving both vanilla GRPO (+14.0%/+10.9% on AppWorld/Sokoban) and vanilla OPD (+42.1%/+40.6%). Project page: https://k1xe.github.io/skill‑sd/

PaperID: 323, https://arxiv.org/pdf/2604.10547.pdf GitHub

Authors: Wanyi Chen, Xiao Yang, Xu Yang, Tianming Sha, Qizheng Li, Zhuo Wang, Bowen Xian, Fang Kong, Weiqing Liu, Jiang Bian

Title: Agent^2 RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?

Abstract:
We introduce Agent2 RL‑Bench, a compact diagnostic benchmark for evaluating agentic RL post‑training, which tests whether LLM agents can autonomously design, implement, debug, and execute post‑training pipelines that improve foundation models. RL post‑training increasingly drives model alignment and specialization, yet existing benchmarks are largely static, rewarding supervised fine‑tuning or script generation without assessing an agent's ability to close an interactive RL loop. Agent2 RL‑Bench provides a unified agent‑facing interface: each run starts from an isolated workspace containing a base model, task data, instructions, and a grading API, and agents must iterate within a fixed budget by training models and submitting artifacts for evaluation. The benchmark spans six tasks across three levels, from static rule‑based training to judge‑based optimization and closed‑loop online RL with trajectory collection. Two diagnostic skills, namely runtime recording and post‑hoc summarization, enable structured analysis of agent behavior, facilitating smooth and effective iteration of the benchmark's evaluation framework. Across five agent systems and six driver LLMs, agents show intelligent behavior but clear limitations: one RL‑oriented run improves ALFWorld from 4.85 to 93.28 via SFT warm‑up and GRPO with online rollouts, yet DeepSearchQA remains difficult, most successful routes rely on supervised pipelines, and interactive outcomes show large single‑run differences across agent stacks. Overall, Agent2 RL‑Bench shows that current agents can sometimes engineer online RL, but stable agent‑driven RL post‑training remains rare under fixed budgets. It also demonstrates that our benchmark provides a strong and effective evaluation framework for future research in this direction. Code is available at https://github.com/microsoft/RD‑Agent/blob/main/rdagent/scenarios/rl/autorl_bench/README.md

PaperID: 324, https://arxiv.org/pdf/2604.10291.pdf GitHub

Authors: Malgorzata Gwiazda, Yifu Cai, Mononito Goswami, Arjun Choudhry, Artur Dubrawski

Title: TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale

Abstract:
Large Language Models (LLMs) have shown promising performance in time series modeling tasks, but do they truly understand time series data? While multiple benchmarks have been proposed to answer this fundamental question, most are manually curated and focus on narrow domains or specific skill sets. To address this limitation, we propose scalable methods for creating comprehensive time series reasoning benchmarks that combine the flexibility of templates with the creativity of LLM agents. We first develop TimeSeriesExam, a multiple‑choice benchmark using synthetic time series to evaluate LLMs across five core reasoning categories: pattern recognitionnoise understandingsimilarity analysisanomaly detection, and causality. Then, with TimeSeriesExamAgent, we scale our approach by automatically generating benchmarks from real‑world datasets spanning healthcare, finance and weather domains. Through multi‑dimensional quality evaluation, we demonstrate that our automatically generated benchmarks achieve diversity comparable to manually curated alternatives. However, our experiments reveal that LLM performance remains limited in both abstract time series reasoning and domain‑specific applications, highlighting ongoing challenges in enabling effective time series understanding in these models. TimeSeriesExamAgent is available at https://github.com/magwiazda/TimeSeriesExamAgent.

PaperID: 325, https://arxiv.org/pdf/2604.10261.pdf GitHub

Authors: Zae Myung Kim, Dongseok Lee, Jaehyung Kim, Vipul Raheja, Dongyeop Kang

Title: The Amazing Agent Race: Strong Tool Users, Weak Navigators

Abstract:
Existing tool‑use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or "legs") with fork‑merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi‑step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live‑API validation. Three complementary metrics (finish‑line accuracy, pit‑stop visit rate, and roadblock completion rate) separately diagnose navigation, tool‑use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy. Navigation errors dominate (27 to 52% of trials) while tool‑use errors remain below 17%, and agent architecture matters as much as model scale (Claude Code matches Codex CLI at 37% with 6x fewer tokens). The compositional structure of AAR reveals that agents fail not at calling tools but at navigating to the right pages, a blind spot invisible to linear benchmarks. The project page can be accessed at: https://minnesotanlp.github.io/the‑amazing‑agent‑race

PaperID: 326, https://arxiv.org/pdf/2604.09568.pdf GitHub

Authors: Tianfu Wang, Leilei Ding, Ziyang Tao, Yi Zhan, Zhiyuan Ma, Wei Wu, Yuxuan Lei, Yuan Feng, Junyang Wang, Yin Wu, Yizhao Xu, Hongyuan Zhu, Qi Liu, Nicholas Jing Yuan, Yanyong Zhang, Hui Xiong

Title: EvoDiagram: Agentic Editable Diagram Creation via Design Expertise Evolution

Abstract:
High‑fidelity diagram creation requires the complex orchestration of semantic topology, visual styling, and spatial layout, posing a significant challenge for automated systems. Existing methods also suffer from a representation gap: pixel‑based models often lack precise control, while code‑based synthesis limits intuitive flexibility. To bridge this gap, we introduce EvoDiagram, an agentic framework that generates object‑level editable diagrams via an intermediate canvas schema. EvoDiagram employs a coordinated multi‑agent system to decouple semantic intent from rendering logic, resolving conflicts across heterogeneous design layers. Additionally, we propose a design knowledge evolution mechanism that distills execution traces into a hierarchical memory of domain guidelines, enabling agents to retrieve context‑aware expertise adaptively. We further release CanvasBench, a benchmark consisting of both data and metrics for canvas‑based diagramming. Extensive experiments demonstrate that EvoDiagram exhibits excellent performance and balance against baselines in generating editable, structurally consistent, and aesthetically coherent diagrams. Our code is available at https://github.com/AuraX‑AI/EvoDiagram.

PaperID: 327, https://arxiv.org/pdf/2604.09521.pdf GitHub

Authors: Anthony T. Nixon

Title: Semantic Rate-Distortion for Bounded Multi-Agent Communication: Capacity-Derived Semantic Spaces and the Communication Cost of Alignment

Abstract:
When two agents of different computational capacities interact with the same environment, they need not compress a common semantic alphabet differently; they can induce different semantic alphabets altogether. We show that the quotient POMDP Q_m,T(M) ‑ the unique coarsest abstraction consistent with an agent's capacity ‑ serves as a capacity‑derived semantic space for any bounded agent, and that communication between heterogeneous agents exhibits a sharp structural phase transition. Below a critical rate R_\textcrit determined by the quotient mismatch, intent‑preserving communication is structurally impossible. In the supported one‑way memoryless regime, classical side‑information coding then yields exponential decay above the induced benchmark. Classical coding theorems tell you the rate once the source alphabet is fixed; our contribution is to derive that alphabet from bounded interaction itself. Concretely, we prove: (1) a fixed‑\varepsilon structural phase‑transition theorem whose lower bound is fully general on the common‑history quotient comparison; (2) a one‑way Wyner‑Ziv benchmark identification on quotient alphabets, with exact converse, exact operational equality for memoryless quotient sources, and an ergodic long‑run bridge via explicit mixing bounds; (3) an asymptotic one‑way converse in the shrinking‑distortion regime \varepsilon = O(1/T), proved from the message stream and decoder side information; and (4) alignment traversal bounds enabling compositional communication through intermediate capacity levels. Experiments on eight POMDP environments (including RockSample(4,4)) illustrate the phase transition, a structured‑policy benchmark shows the one‑way rate can drop by up to 19× relative to the counting bound, and a shrinking‑distortion sweep matches the regime of the asymptotic converse.

PaperID: 328, https://arxiv.org/pdf/2604.09388.pdf GitHub GitHub

Authors: Andy Anderson

Title: The AI Codebase Maturity Model: From Assisted Coding to Fully Autonomous Systems

Abstract:
AI coding tools are widely adopted, but most teams plateau at prompt‑and‑review without a framework for systematic progression. This paper presents the AI Codebase Maturity Model (ACMM), a 6‑level framework describing how codebases evolve from basic AI‑assisted coding to fully autonomous systems. Inspired by CMMI, each level is defined by its feedback loop topology ‑ the specific mechanisms that must exist before the next level becomes possible. I validate the model through a 100‑day experience report maintaining KubeStellar Console, a CNCF Kubernetes dashboard built from scratch with Claude Code (Opus) and GitHub Copilot, and through the initial production deployment of Hive ‑ an open‑source multi‑agent orchestration system that realizes Level 6: full autonomy. The system currently operates with 74 CI/CD workflows, 32 nightly test suites, 91% code coverage, and achieves bug‑to‑fix times under 30 minutes ‑ 24 hours a day. The central finding: the intelligence of an AI‑driven development system resides not in the AI model itself, but in the infrastructure of instructions, tests, metrics, and feedback loops that surround it. You cannot skip levels, and at each level, the thing that unlocks the next one is another feedback mechanism. Testing ‑ the volume of test cases, the coverage thresholds, and the reliability of test execution ‑ proved to be the single most important investment in the entire journey. v2 extends the model from 5 to 6 levels, adding Level 6 (Fully Autonomous) with Hive as reference implementation and Beads for cross‑agent memory continuity, plus throughput acceleration data showing 5x PR throughput and 37x issue throughput from Level 2 to Level 6.

PaperID: 329, https://arxiv.org/pdf/2604.08963.pdf GitHub

Authors: Keyu Li, Jin Gao, Dequan Wang

Title: Aligned Agents, Biased Swarm: Measuring Bias Amplification in Multi-Agent Systems

Abstract:
While Multi‑Agent Systems (MAS) are increasingly deployed for complex workflows, their emergent properties‑particularly the accumulation of bias‑remain poorly understood. Because real‑world MAS are too complex to analyze entirely, evaluating their ethical robustness requires first isolating their foundational mechanics. In this work, we conduct a baseline empirical study investigating how basic MAS topologies and feedback loops influence prejudice. Contrary to the assumption that multi‑agent collaboration naturally dilutes bias, we hypothesize that structured workflows act as echo chambers, amplifying minor stochastic biases into systemic polarization. To evaluate this, we introduce Discrim‑Eval‑Open, an open‑ended benchmark that bypasses individual model neutrality through forced comparative judgments across demographic groups. Analyzing bias cascades across various structures reveals that architectural sophistication frequently exacerbates bias rather than mitigating it. We observe systemic amplification even when isolated agents operate neutrally, and identify a 'Trigger Vulnerability' where injecting purely objective context drastically accelerates polarization. By stripping away advanced swarm complexity to study foundational dynamics, we establish a crucial baseline: structural complexity does not guarantee ethical robustness. Our code is available at https://github.com/weizhihao1/MAS‑Bias.

PaperID: 330, https://arxiv.org/pdf/2604.08927.pdf GitHub

Authors: Huangwei Chen, Wu Li, Junhao Jia, Yining Chen, Xiaotao Pang, YaLong Chen, Gonghui Li, Haishuai Wang, Jiajun Bu, Lei Wu

Title: Beyond the Individual: Virtualizing Multi-Disciplinary Reasoning for Clinical Intake via Collaborative Agents

Abstract:
The initial outpatient consultation is critical for clinical decision‑making, yet it is often conducted by a single physician under time pressure, making it prone to cognitive biases and incomplete evidence capture. Although the Multi‑Disciplinary Team (MDT) reduces these risks, they are costly and difficult to scale to real‑time intake. We propose Aegle, a synchronous virtual MDT framework that brings MDT‑level reasoning to outpatient consultations via a graph‑based multi‑agent architecture. Aegle formalizes the consultation state using a structured SOAP representation, separating evidence collection from diagnostic reasoning to improve traceability and bias control. An orchestrator dynamically activates specialist agents, which perform decoupled parallel reasoning and are subsequently integrated by an aggregator into a coherent clinical note. Experiments on ClinicalBench and a real‑world RAPID‑IPN dataset across 24 departments and 53 metrics show that Aegle consistently outperforms state‑of‑the‑art proprietary and open‑source models in documentation quality and consultation capability, while also improving final diagnosis accuracy. Our code is available at https://github.com/HovChen/Aegle.

PaperID: 331, https://arxiv.org/pdf/2604.08590.pdf GitHub

Authors: Brendan R. Hogan, Xiwen Chen, James T. Wilson, Kashif Rasul, Adel Boyarsky, Thomas Kamei, Anderson Schneider, Yuriy Nevmyvaka

Title: AlphaLab: Autonomous Multi-Agent Research Across Optimization Domains with Frontier LLMs

Abstract:
We present AlphaLab, an autonomous research harness that leverages frontier LLM agentic capabilities to automate the full experimental cycle in quantitative, computation‑intensive domains. Given only a dataset and a natural‑language objective, AlphaLab proceeds through three phases without human intervention: (1) it adapts to the domain and explores the data, writing analysis code and producing a research report; (2) it constructs and adversarially validates its own evaluation framework; and (3) it runs large‑scale GPU experiments via a Strategist/Worker loop, accumulating domain knowledge in a persistent playbook that functions as a form of online prompt optimization. All domain‑specific behavior is factored into adapters generated by the model itself, so the same pipeline handles qualitatively different tasks without modification. We evaluate AlphaLab with two frontier LLMs (GPT‑5.2 and Claude Opus 4.6) on three domains: CUDA kernel optimization, where it writes GPU kernels that run 4.4x faster than torch.compile on average (up to 91x); LLM pretraining, where the full system achieves 22% lower validation loss than a single‑shot baseline using the same model; and traffic forecasting, where it beats standard baselines by 23‑25% after researching and implementing published model families from the literature. The two models discover qualitatively different solutions in every domain (neither dominates uniformly), suggesting that multi‑model campaigns provide complementary search coverage. We additionally report results on financial time series forecasting in the appendix, and release all code at https://brendanhogan.github.io/alphalab‑paper/.

PaperID: 332, https://arxiv.org/pdf/2604.08539.pdf GitHub

Authors: Wenbo Hu, Xin Chen, Yan Gao-Tian, Yihe Deng, Nanyun Peng, Kai-Wei Chang

Title: OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

Abstract:
Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal Large Language Models. However, extending this success to open‑source multimodal generalist models remains heavily constrained by two primary challenges: the extreme variance in reward topologies across diverse visual tasks, and the inherent difficulty of balancing fine‑grained perception with multi‑step reasoning capabilities. To address these issues, we introduce Gaussian GRPO (G^2RPO), a novel RL training objective that replaces standard linear scaling with non‑linear distributional matching. By mathematically forcing the advantage distribution of any given task to strictly converge to a standard normal distribution, \mathcalN(0,1), G^2RPO theoretically ensures inter‑task gradient equity, mitigates vulnerabilities to heavy‑tail outliers, and offers symmetric update for positive and negative rewards. Leveraging the enhanced training stability provided by G^2RPO, we introduce two task‑level shaping mechanisms to seamlessly balance perception and reasoning. First, response length shaping dynamically elicits extended reasoning chains for complex queries while enforce direct outputs to bolster visual grounding. Second, entropy shaping tightly bounds the model's exploration zone, effectively preventing both entropy collapse and entropy explosion. Integrating these methodologies, we present OpenVLThinkerV2, a highly robust, general‑purpose multimodal model. Extensive evaluations across 18 diverse benchmarks demonstrate its superior performance over strong open‑source and leading proprietary frontier models.

PaperID: 333, https://arxiv.org/pdf/2604.08276.pdf GitHub

Authors: Wansheng Wu, Kaibo Huang, Yukun Wei, Zhongliang Yang, Linna Zhou

Title: ACF: A Collaborative Framework for Agent Covert Communication under Cognitive Asymmetry

Abstract:
As generative artificial intelligence evolves, autonomous agent networks present a powerful paradigm for interactive covert communication. However, because agents dynamically update internal memories via environmental interactions, existing methods face a critical structural vulnerability: cognitive asymmetry. Conventional approaches demand strict cognitive symmetry, requiring identical sequence prefixes between the encoder and decoder. In dynamic deployments, inevitable prefix discrepancies destroy synchronization, inducing severe channel degradation. To address this core challenge of cognitive asymmetry, we propose the Asymmetric Collaborative Framework (ACF), which structurally decouples covert communication from semantic reasoning via orthogonal statistical and cognitive layers. By deploying a prefix‑independent decoding paradigm governed by a shared steganographic configuration, ACF eliminates the reliance on cognitive symmetry. Evaluations on realistic memory‑augmented workflows demonstrate that under severe cognitive asymmetry, symmetric baselines suffer severe channel degradation, whereas ACF uniquely excels across both semantic fidelity and covert communication. It maintains computational indistinguishability, enabling reliable secret extraction with provable error bounds, and providing robust Effective Information Capacity guarantees for modern agent networks.

PaperID: 334, https://arxiv.org/pdf/2604.08226.pdf GitHub

Authors: Seyed Amir Ahmad Safavi-Naini, Elahe Meftah, Josh Mohess, Pooya Mohammadi Kazaj, Georgios Siontis, Zahra Atf, Peter R. Lewis, Mauricio Reyes, Girish Nadkarni, Roland Wiest, Stephan Windecker, Christoph Grani, Ali Soroush, Isaac Shiri

Title: Grounding Clinical AI Competency in Human Cognition Through the Clinical World Model and Skill-Mix Framework

Abstract:
The competency of any intelligent agent is bounded by its formal account of the world in which it operates. Clinical AI lacks such an account. Existing frameworks address evaluation, regulation, or system design in isolation, without a shared model of the clinical world to connect them. We introduce the Clinical World Model, a framework that formalizes care as a tripartite interaction among Patient, Provider, and Ecosystem. To formalize how any agent, whether human or artificial, transforms information into clinical action, we develop parallel decision‑making architectures for providers, patients, and AI agents, grounded in validated principles of clinical cognition. The Clinical AI Skill‑Mix operationalizes competency through eight dimensions. Five define the clinical competency space (condition, phase, care setting, provider role, and task) and three specify how AI engages human reasoning (assigned authority, agent facing, and anchoring layer). The combinatorial product of these dimensions yields a space of billions of distinct competency coordinates. A central structural implication is that validation within one coordinate provides minimal evidence for performance in another, rendering the competency space irreducible. The framework supplies a common grammar through which clinical AI can be specified, evaluated, and bounded across stakeholders. By making this structure explicit, the Clinical World Model reframes the field's central question from whether AI works to in which competency coordinates reliability has been demonstrated, and for whom.

PaperID: 335, https://arxiv.org/pdf/2604.07970.pdf GitHub

Authors: Kevin Riehl, Julius Schlapbach, Anastasios Kouvelas, Michail A. Makridis

Title: Karma Mechanisms for Decentralised, Cooperative Multi Agent Path Finding

Abstract:
Multi‑Agent Path Finding (MAPF) is a fundamental coordination problem in large‑scale robotic and cyber‑physical systems, where multiple agents must compute conflict‑free trajectories with limited computational and communication resources. While centralised optimal solvers provide guarantees on solution optimality, their exponential computational complexity limits scalability to large‑scale systems and real‑time applicability. Existing decentralised heuristics are faster, but result in suboptimal outcomes and high cost disparities. This paper proposes a decentralised coordination framework for cooperative MAPF based on Karma mechanisms ‑ artificial, non‑tradeable credits that account for agents' past cooperative behaviour and regulate future conflict resolution decisions. The approach formulates conflict resolution as a bilateral negotiation process that enables agents to resolve conflicts through pairwise replanning while promoting long‑term fairness under limited communication and without global priority structures. The mechanism is evaluated in a lifelong robotic warehouse multi‑agent pickup‑and‑delivery scenario with kinematic orientation constraints. The results highlight that the Karma mechanism balances replanning effort across agents, reducing disparity in service times without sacrificing overall efficiency. Code: https://github.com/DerKevinRiehl/karma_dmapf

PaperID: 336, https://arxiv.org/pdf/2604.07851.pdf GitHub

Authors: Jiani Huang, Shijie Wang, Liangbo Ning, Wenqi Fan, Qing Li

Title: ReRec: Reasoning-Augmented LLM-based Recommendation Assistant via Reinforcement Fine-tuning

Abstract:
With the rise of LLMs, there is an increasing need for intelligent recommendation assistants that can handle complex queries and provide personalized, reasoning‑driven recommendations. LLM‑based recommenders show potential but face challenges in multi‑step reasoning, underscoring the need for reasoning‑augmented systems. To address this gap, we propose ReRec, a novel reinforcement fine‑tuning (RFT) framework designed to improve LLM reasoning in complex recommendation tasks. Our framework introduces three key components: (1) Dual‑Graph Enhanced Reward Shaping, integrating recommendation metrics like NDCG@K with Query Alignment and Preference Alignment Scores to provide fine‑grained reward signals for LLM optimization; (2) Reasoning‑aware Advantage Estimation, which decomposes LLM outputs into reasoning segments and penalizes incorrect steps to enhance reasoning of recommendation; and (3) Online Curriculum Scheduler, dynamically assess query difficulty and organize training curriculum to ensure stable learning during RFT. Experiments demonstrate that ReRec outperforms state‑of‑the‑art baselines and preserves core abilities like instruction‑following and general knowledge. Our codes are available at https://github.com/jiani‑huang/ReRec.

PaperID: 337, https://arxiv.org/pdf/2604.07752.pdf GitHub

Authors: Yifei Chen, Sarra Habchi, Lili Wei

Title: MIMIC-Py: An Extensible Tool for Personality-Driven Automated Game Testing with Large Language Models

Abstract:
Modern video games are complex, non‑deterministic systems that are difficult to test automatically at scale. Although prior work shows that personality‑driven Large Language Model (LLM) agents can improve behavioural diversity and test coverage, existing tools largely remain research prototypes and lack cross‑game reusability. This tool paper presents MIMIC‑Py, a Python‑based automated game‑testing tool that transforms personality‑driven LLM agents into a reusable and extensible framework. MIMIC‑Py exposes personality traits as configurable inputs and adopts a modular architecture that decouples planning, execution, and memory from game‑specific logic. It supports multiple interaction mechanisms, enabling agents to interact with games via exposed APIs or synthesized code. We describe the design of MIMIC‑Py and show how it enables deployment to new game environments with minimal engineering effort, bridging the gap between research prototypes and practical automated game testing. The source code and a demo video are available on our project webpage: https://mimic‑persona.github.io/MIMIC‑Py‑Home‑Page/.

PaperID: 338, https://arxiv.org/pdf/2604.07549.pdf GitHub

Authors: Xueren Ge, Sahil Murtaza, Anthony Cortez, Homa Alemzadeh

Title: EMSDialog: Synthetic Multi-person Emergency Medical Service Dialogue Generation from Electronic Patient Care Reports via Multi-LLM Agents

Abstract:
Conversational diagnosis prediction requires models to track evolving evidence in streaming clinical conversations and decide when to commit to a diagnosis. Existing medical dialogue corpora are largely dyadic or lack the multi‑party workflow and annotations needed for this setting. We introduce an ePCR‑grounded, topic‑flow‑based multi‑agent generation pipeline that iteratively plans, generates, and self‑refines dialogues with rule‑based factual and topic flow checks. The pipeline yields EMSDialog, a dataset of 4,414 synthetic multi‑speaker EMS conversations based on a real‑world ePCR dataset, annotated with 43 diagnoses, speaker roles, and turn‑level topics. Human and LLM evaluations confirm high quality and realism of EMSDialog using both utterance‑ and conversation‑level metrics. Results show that EMSDialog‑augmented training improves accuracy, timeliness, and stability of EMS conversational diagnosis prediction. Our datasets and code are publicly available at https://uva‑dsa.github.io/EMSDialog

PaperID: 339, https://arxiv.org/pdf/2604.07487.pdf GitHub

Authors: Linbo Liu, Guande Wu, Han Ding, Yawei Wang, Qiang Zhou, Yuzhe Lu, Zhichao Xu, Huan Song, Panpan Xu, Lin Lee Cheong

Title: CLEAR: Context Augmentation from Contrastive Learning of Experience via Agentic Reflection

Abstract:
Large language model agents rely on effective model context to obtain task‑relevant information for decision‑making. Many existing context engineering approaches primarily rely on the context generated from the past experience and retrieval mechanisms that reuse these context. However, retrieved context from past tasks must be adapted by the execution agent to fit new situations, placing additional reasoning burden on the underlying LLM. To address this limitation, we propose a generative context augmentation framework using Contrastive Learning of Experience via Agentic Reflection (CLEAR). CLEAR first employs a reflection agent to perform contrastive analysis over past execution trajectories and summarize useful context for each observed task. These summaries are then used as supervised fine‑tuning data to train a context augmentation model (CAM). Then we further optimize CAM using reinforcement learning, where the reward signal is obtained by running the task execution agent. By learning to generate task‑specific knowledge rather than retrieve knowledge from the past, CAM produces context that is better tailored to the current task. We conduct comprehensive evaluations on the AppWorld and WebShop benchmarks. Experimental results show that CLEAR consistently outperforms strong baselines. It improves task completion rate from 72.62% to 81.15% on AppWorld test set and averaged reward from 0.68 to 0.74 on a subset of WebShop, compared with baseline agent. Our code is publicly available at https://github.com/awslabs/CLEAR.

PaperID: 340, https://arxiv.org/pdf/2604.07395.pdf GitHub

Authors: Wenze Wang, Mehdi Hosseinzadeh, Feras Dayoub

Title: A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring

Abstract:
Robotic manipulation systems that follow language instructions often execute grasp primitives in a largely single‑shot manner: a model proposes an action, the robot executes it, and failures such as empty grasps, slips, stalls, timeouts, or semantically wrong grasps are not surfaced to the decision layer in a structured way. Inspired by agentic loops in digital tool‑using agents, we reformulate language‑guided grasping as a bounded embodied agent operating over grounded execution states, where physical actions expose an explicit tool‑state stream. We introduce a physical agentic loop that wraps an unmodified learned manipulation primitive (grasp‑and‑lift) with (i) an event‑based interface and (ii) an execution monitoring layer, Watchdog, which converts noisy gripper telemetry into discrete outcome labels using contact‑aware fusion and temporal stabilization. These outcome events, optionally combined with post‑grasp semantic verification, are consumed by a deterministic bounded policy that finalizes, retries, or escalates to the user for clarification, guaranteeing finite termination. We validate the resulting loop on a mobile manipulator with an eye‑in‑hand D405 camera, keeping the underlying grasp model unchanged and evaluating representative scenarios involving visual ambiguity, distractors, and induced execution failures. Results show that explicit execution‑state monitoring and bounded recovery enable more robust and interpretable behavior than open‑loop execution, while adding minimal architectural overhead. For the source code and demo refer to our project page: https://wenzewwz123.github.io/Agentic‑Loop/

PaperID: 341, https://arxiv.org/pdf/2604.07041.pdf GitHub

Authors: Minh Tam Pham, Trinh Pham, Tong Chen, Hongzhi Yin, Quoc Viet Hung Nguyen, Thanh Tam Nguyen

Title: AV-SQL: Decomposing Complex Text-to-SQL Queries with Agentic Views

Abstract:
Text‑to‑SQL is the task of translating natural language queries into executable SQL for a given database, enabling non‑expert users to access structured data without writing SQL manually. Despite rapid advances driven by large language models (LLMs), existing approaches still struggle with complex queries in real‑world settings, where database schemas are large and questions require multi‑step reasoning over many interrelated tables. In such cases, providing the full schema often exceeds the context window, while one‑shot generation frequently produces non‑executable SQL due to syntax errors and incorrect schema linking. To address these challenges, we introduce AV‑SQL, a framework that decomposes complex Text‑to‑SQL into a pipeline of specialized LLM agents. Central to AV‑SQL is the concept of agentic views: agent‑generated Common Table Expressions (CTEs) that encapsulate intermediate query logic and filter relevant schema elements from large schemas. AV‑SQL operates in three stages: (1) a rewriter agent compresses and clarifies the input query; (2) a view generator agent processes schema chunks to produce agentic views; and (3) a planner, generator, and revisor agent collaboratively compose these views into the final SQL query. Extensive experiments show that AV‑SQL achieves 70.38% execution accuracy on the challenging Spider 2.0 benchmark, outperforming state‑of‑the‑art baselines, while remaining competitive on standard datasets with 85.59% on Spider, 72.16% on BIRD and 63.78% on KaggleDBQA. Our source code is available at https://github.com/pminhtam/AV‑SQL.

PaperID: 342, https://arxiv.org/pdf/2604.06180.pdf GitHub

Authors: Ashmal Vayani, Parth Parag Kulkarni, Joseph Fioresi, Song Wang, Mubarak Shah

Title: MedRoute: RL-Based Dynamic Specialist Routing in Multi-Agent Medical Diagnosis

Abstract:
Medical diagnosis using Large Multimodal Models (LMMs) has gained increasing attention due to capability of these models in providing precise diagnoses. These models generally combine medical questions with visual inputs to generate diagnoses or treatments. However, they are often overly general and unsuitable under the wide range of medical conditions in real‑world healthcare. In clinical practice, diagnosis is performed by multiple specialists, each contributing domain‑specific expertise. To emulate this process, a potential solution is to deploy a dynamic multi‑agent LMM framework, where each agent functions as a medical specialist. Current approaches in this emerging area, typically relying on static or predefined selection of various specialists, cannot be adapted to the changing practical scenario. In this paper, we propose MedRoute, a flexible and dynamic multi‑agent framework that comprises of a collaborative system of specialist LMM agents. Furthermore, we add a General Practitioner with an RL‑trained router for dynamic specialist selection, and a Moderator that produces the final decision. In this way, our framework closely mirrors real clinical workflows. Extensive evaluations on text and image‑based medical datasets demonstrate improved diagnostic accuracy, outperforming the state‑of‑the‑art baselines. Our work lays a strong foundation for future research. Code and models are available at https://github.com/UCF‑CRCV/MedRoute/.

PaperID: 343, https://arxiv.org/pdf/2604.06170.pdf GitHub

Authors: Komal Kumar, Aman Chadha, Salman Khan, Fahad Shahbaz Khan, Hisham Cholakkal

Title: Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework

Abstract:
The rapid growth of scientific literature has made it increasingly difficult for researchers to efficiently discover, evaluate, and synthesize relevant work. Recent advances in multi‑agent large language models (LLMs) have demonstrated strong potential for understanding user intent and are being trained to utilize various tools. In this paper, we introduce Paper Circle, a multi‑agent research discovery and analysis system designed to reduce the effort required to find, assess, organize, and understand academic literature. The system comprises two complementary pipelines: (1) a Discovery Pipeline that integrates offline and online retrieval from multiple sources, multi‑criteria scoring, diversity‑aware ranking, and structured outputs; and (2) an Analysis Pipeline that transforms individual papers into structured knowledge graphs with typed nodes such as concepts, methods, experiments, and figures, enabling graph‑aware question answering and coverage verification. Both pipelines are implemented within a coder LLM‑based multi‑agent orchestration framework and produce fully reproducible, synchronized outputs including JSON, CSV, BibTeX, Markdown, and HTML at each agent step. This paper describes the system architecture, agent roles, retrieval and scoring methods, knowledge graph schema, and evaluation interfaces that together form the Paper Circle research workflow. We benchmark Paper Circle on both paper retrieval and paper review generation, reporting hit rate, MRR, and Recall at K. Results show consistent improvements with stronger agent models. We have publicly released the website at https://papercircle.vercel.app/ and the code at https://github.com/MAXNORM8650/papercircle.

PaperID: 344, https://arxiv.org/pdf/2604.06019.pdf GitHub

Authors: Gustav Keppler, Moritz Gstür, Veit Hagenmeyer

Title: CritBench: A Framework for Evaluating Cybersecurity Capabilities of Large Language Models in IEC 61850 Digital Substation Environments

Abstract:
The advancement of Large Language Models (LLMs) has raised concerns regarding their dual‑use potential in cybersecurity. Existing evaluation frameworks overwhelmingly focus on Information Technology (IT) environments, failing to capture the constraints, and specialized protocols of Operational Technology (OT). To address this gap, we introduce CritBench, a novel framework designed to evaluate the cybersecurity capabilities of LLM agents within IEC 61850 Digital Substation environments. We assess five state‑of‑the‑art models, including OpenAI's GPT‑5 suite and open‑weight models, across a corpus of 81 domain‑specific tasks spanning static configuration analysis, network traffic reconnaissance, and live virtual machine interaction. To facilitate industrial protocol interaction, we develop a domain‑specific tool scaffold. Our empirical results show that agents reliably execute static structured‑file analysis and single‑tool network enumeration, but their performance degrades on dynamic tasks. Despite demonstrating explicit, internalized knowledge of the IEC 61850 standards terminology, current models struggle with the persistent sequential reasoning and state tracking required to manipulate live systems without specialized tools. Equipping agents with our domain‑specific tool scaffold significantly mitigates this operational bottleneck. Code and evaluation scripts are available at: https://github.com/GKeppler/CritBench

PaperID: 345, https://arxiv.org/pdf/2604.05854.pdf GitHub

Authors: Xiangyue Zhang

Title: Deep Researcher Agent: An Autonomous Framework for 24/7 Deep Learning Experimentation with Zero-Cost Monitoring

Abstract:
We present Deep Researcher Agent, an open‑source framework that enables large language model (LLM) agents to autonomously conduct deep learning experiments around the clock. Unlike existing AI research assistants that focus on paper writing or code generation, our system addresses the full experiment lifecycle: hypothesis formation, code implementation, training execution, result analysis, and iterative refinement. The framework introduces three key innovations: (1) Zero‑Cost Monitoring ‑‑ a monitoring paradigm that incurs zero LLM API costs during model training by relying solely on process‑level checks and log file reads; (2) Two‑Tier Constant‑Size Memory ‑‑ a memory architecture capped at ～5K characters regardless of runtime duration, preventing the unbounded context growth that plagues long‑running agents; and (3) Minimal‑Toolset Leader‑Worker Architecture ‑‑ a multi‑agent design where each worker agent is equipped with only 3‑‑5 tools, reducing per‑call token overhead by up to 73%. In sustained deployments spanning 30+ days, the framework autonomously completed 500+ experiment cycles across four concurrent research projects, achieving a 52% improvement over baseline metrics in one project through 200+ automated experiments ‑‑ all at an average LLM cost of \0.08 per 24‑hour cycle. Code is available at https://github.com/Xiangyue‑Zhang/auto‑deep‑researcher‑24x7.

PaperID: 346, https://arxiv.org/pdf/2604.05846.pdf GitHub

Authors: Yuanfu Sun, Kang Li, Dongzhe Fan, Jiajin Liu, Qiaoyu Tan

Title: AgentGL: Towards Agentic Graph Learning with LLMs via Reinforcement Learning

Abstract:
Large Language Models (LLMs) increasingly rely on agentic capabilities‑iterative retrieval, tool use, and decision‑making‑to overcome the limits of static, parametric knowledge. Yet existing agentic frameworks treat external information as unstructured text and fail to leverage the topological dependencies inherent in real‑world data. To bridge this gap, we introduce Agentic Graph Learning (AGL), a paradigm that reframes graph learning as an interleaved process of topology‑aware navigation and LLM‑based inference. Specifically, we propose AgentGL, the first reinforcement learning (RL)‑driven framework for AGL. AgentGL equips an LLM agent with graph‑native tools for multi‑scale exploration, regulates tool usage via search‑constrained thinking to balance accuracy and efficiency, and employs a graph‑conditioned curriculum RL strategy to stabilize long‑horizon policy learning without step‑wise supervision. Across diverse Text‑Attributed Graph (TAG) benchmarks and multiple LLM backbones, AgentGL substantially outperforms strong GraphLLMs and GraphRAG baselines, achieving absolute improvements of up to 17.5% in node classification and 28.4% in link prediction. These results demonstrate that AGL is a promising frontier for enabling LLMs to autonomously navigate and reason over complex relational environments. The code is publicly available at https://github.com/sunyuanfu/AgentGL.

PaperID: 347, https://arxiv.org/pdf/2604.05808.pdf GitHub

Authors: Shuai Zhen, Yanhua Yu, Ruopei Guo, Nan Cheng, Yang Deng

Title: Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents

Abstract:
Large language model (LLM) agents have demonstrated strong capabilities in complex interactive decision‑making tasks. However, existing LLM agents typically rely on increasingly long interaction histories, resulting in high computational cost and limited scalability. In this paper, we propose STEP‑HRL, a hierarchical reinforcement learning (HRL) framework that enables step‑level learning by conditioning only on single‑step transitions rather than full interaction histories. STEP‑HRL structures tasks hierarchically, using completed subtasks to represent global progress of overall task. By introducing a local progress module, it also iteratively and selectively summarizes interaction history within each subtask to produce a compact summary of local progress. Together, these components yield augmented step‑level transitions for both high‑level and low‑level policies. Experimental results on ScienceWorld and ALFWorld benchmarks consistently demonstrate that STEP‑HRL substantially outperforms baselines in terms of performance and generalization while reducing token usage. Our code is available at https://github.com/TonyStark042/STEP‑HRL.

PaperID: 348, https://arxiv.org/pdf/2604.05731.pdf GitHub

Authors: Mengtian Li, Kunyan Dai, Yi Ding, Ruobing Ni, Ying Zhang, Wenwu Wang, Zhifeng Xie

Title: FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips

Abstract:
Foley art plays a pivotal role in enhancing immersive auditory experiences in film, yet manual creation of spatio‑temporally aligned audio remains labor‑intensive. We propose FoleyDesigner, a novel framework inspired by professional Foley workflows, integrating film clip analysis, spatio‑temporally controllable Foley generation, and professional audio mixing capabilities. FoleyDesigner employs a multi‑agent architecture for precise spatio‑temporal analysis. It achieves spatio‑temporal alignment through latent diffusion models trained on spatio‑temporal cues extracted from video frames, combined with large language model (LLM)‑driven hybrid mechanisms that emulate post‑production practices in film industry. To address the lack of high‑quality stereo audio datasets in film, we introduce FilmStereo, the first professional stereo audio dataset containing spatial metadata, precise timestamps, and semantic annotations for eight common Foley categories. For applications, the framework supports interactive user control while maintaining seamless integration with professional pipelines, including 5.1‑channel Dolby Atmos systems compliant with ITU‑R BS.775 standards, thereby offering extensive creative flexibility. Extensive experiments demonstrate that our method achieves superior spatio‑temporal alignment compared to existing baselines, with seamless compatibility with professional film production standards. The project page is available at https://gekiii996.github.io/FoleyDesigner/ .

PaperID: 349, https://arxiv.org/pdf/2604.05489.pdf GitHub

Authors: Chengyi Yang, Pengzhen Li, Jiayin Qi, Aimin Zhou, Ji Wu, Ji Liu

Title: SCMAPR: Self-Correcting Multi-Agent Prompt Refinement for Complex-Scenario Text-to-Video Generation

Abstract:
Text‑to‑Video (T2V) generation has benefited from recent advances in diffusion models, yet current systems still struggle under complex scenarios, which are generally exacerbated by the ambiguity and underspecification of text prompts. In this work, we formulate complex‑scenario prompt refinement as a stage‑wise multi‑agent refinement process and propose SCMAPR, i.e., a scenario‑aware and Self‑Correcting Multi‑Agent Prompt Refinement framework for T2V prompting. SCMAPR coordinates specialized agents to (i) route each prompt to a taxonomy‑grounded scenario for strategy selection, (ii) synthesize scenario‑aware rewriting policies and perform policy‑conditioned refinement, and (iii) conduct structured semantic verification that triggers conditional revision when violations are detected. To clarify what constitutes complex scenarios in T2V prompting, provide representative examples, and enable rigorous evaluation under such challenging conditions, we further introduce T2V‑Complexity, which is a complex‑scenario T2V benchmark consisting exclusively of complex‑scenario prompts. Extensive experiments on 3 existing benchmarks and our T2V‑Complexity benchmark demonstrate that SCMAPR consistently improves text‑video alignment and overall generation quality under complex scenarios, achieving up to 2.67% and 3.28 gains in average score on VBench and EvalCrafter, and up to 0.028 improvement on T2V‑CompBench over 3 State‑Of‑The‑Art baselines. The codes of SCMAPR are publicly available at https://github.com/HiThink‑Research/SCMAPR.

PaperID: 350, https://arxiv.org/pdf/2604.05484.pdf GitHub

Authors: Li Kang, Yutao Fan, Rui Li, Heng Zhou, Yiran Qin, Zhemeng Zhang, Songtao Huang, Xiufeng Song, Zaibin Zhang, Bruno N. Y. Chen, Zhenfei Yin, Dongzhan Zhou, Wangmeng Zuo, Lei Bai

Title: CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment

Abstract:
Multi‑agent embodied systems hold promise for complex collaborative manipulation, yet face critical challenges in spatial coordination, temporal reasoning, and shared workspace awareness. Inspired by human collaboration where cognitive planning occurs separately from physical execution, we introduce the concept of compositional environment ‑‑ a synergistic integration of real‑world and simulation components that enables multiple robotic agents to perceive intentions and operate within a unified decision‑making space. Building on this concept, we present CoEnv, a framework that leverages simulation for safe strategy exploration while ensuring reliable real‑world deployment. CoEnv operates through three stages: real‑to‑sim scene reconstruction that digitizes physical workspaces, VLM‑driven action synthesis supporting both real‑time planning with high‑level interfaces and iterative planning with code‑based trajectory generation, and validated sim‑to‑real transfer with collision detection for safe deployment. Extensive experiments on challenging multi‑arm manipulation benchmarks demonstrate CoEnv's effectiveness in achieving high task success rates and execution efficiency, establishing a new paradigm for multi‑agent embodied AI.

PaperID: 351, https://arxiv.org/pdf/2604.05333.pdf GitHub

Authors: Dawei Liu, Zongxia Li, Hongyang Du, Xiyang Wu, Shihang Gui, Yongbei Kuang, Lichao Sun

Title: Graph-of-Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills

Abstract:
Modern LLM agents increasingly rely on reusable skills, and as they interact with personal applications, web browsers, and other interfaces, skill libraries can scale to thousands of skills. Scaling to larger skill sets introduces two key challenges. First, loading the full skill set saturates the context window, driving up token costs, hallucination, and latency. Second, semantic retrieval surfaces topically relevant skills but misses their prerequisite chain of upstream and downstream skills, creating a prerequisite gap that leaves the retrieved bundle execution‑incomplete. In this paper, we present Graph‑of‑Skills (GoS), an inference‑time structural retrieval layer for large skill libraries. GoS constructs an executable skill graph offline from skill packages, then at inference time retrieves a bounded, dependency‑aware skill bundle through hybrid semantic‑lexical seeding, reverse‑aware Personalized PageRank, and context‑budgeted hydration. On SkillsBench and ALFWorld, GoS consistently delivers substantial reward improvements and token savings across three model families (Claude Sonnet 4.5, MiniMax M2.7, and GPT‑5.2 Codex). On SkillsBench, GoS achieves a peak reward increase of 25.55% while reducing total tokens by 56.72% over the vanilla full skill‑loading baseline using GPT‑5.2 Codex. Ablations confirm this pattern across skill libraries from 200 to 2,000 skills.

PaperID: 352, https://arxiv.org/pdf/2604.05018.pdf GitHub

Authors: Yiwen Song, Yale Song, Tomas Pfister, Jinsung Yoon

Title: PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing

Abstract:
Synthesizing unstructured research materials into manuscripts is an essential yet under‑explored challenge in AI‑driven scientific discovery. Existing autonomous writers are rigidly coupled to specific experimental pipelines, and produce superficial literature reviews. We introduce PaperOrchestra, a multi‑agent framework for automated AI research paper writing. It flexibly transforms unconstrained pre‑writing materials into submission‑ready LaTeX manuscripts, including comprehensive literature synthesis and generated visuals, such as plots and conceptual diagrams. To evaluate performance, we present PaperWritingBench, the first standardized benchmark of reverse‑engineered raw materials from 200 top‑tier AI conference papers, alongside a comprehensive suite of automated evaluators. In side‑by‑side human evaluations, PaperOrchestra significantly outperforms autonomous baselines, achieving an absolute win rate margin of 50%‑68% in literature review quality, and 14%‑38% in overall manuscript quality.

PaperID: 353, https://arxiv.org/pdf/2604.04875.pdf GitHub

Authors: Ke Li, Maoliang Li, Jialiang Chen, Jiayu Chen, Zihao Zheng, Shaoqi Wang, Xiang Chen

Title: DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing

Abstract:
Video mashup creation represents a complex video editing paradigm that recomposes existing footage to craft engaging audio‑visual experiences, demanding intricate orchestration across semantic, visual, and auditory dimensions and multiple levels. However, existing automated editing frameworks often overlook the cross‑level multimodal orchestration to achieve professional‑grade fluidity, resulting in disjointed sequences with abrupt visual transitions and musical misalignment. To address this, we formulate video mashup creation as a Multimodal Coherency Satisfaction Problem (MMCSP) and propose the DIRECT framework. Simulating a professional production pipeline, our hierarchical multi‑agent framework decomposes the challenge into three cascade levels: the Screenwriter for source‑aware global structural anchoring, the Director for instantiating adaptive editing intent and guidance, and the Editor for intent‑guided shot sequence editing with fine‑grained optimization. We further introduce Mashup‑Bench, a comprehensive benchmark with tailored metrics for visual continuity and auditory alignment. Extensive experiments demonstrate that DIRECT significantly outperforms state‑of‑the‑art baselines in both objective metrics and human subjective evaluation. Project page and code: https://github.com/AK‑DREAM/DIRECT

PaperID: 354, https://arxiv.org/pdf/2604.04847.pdf GitHub

Authors: Guan-Ting Lin, Chen Chen, Zhehuai Chen, Hung-yi Lee

Title: Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency

Abstract:
We introduce Full‑Duplex‑Bench‑v3 (FDB‑v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi‑step tool use. Unlike prior work, our dataset consists entirely of real human audio annotated for five disfluency categories, paired with scenarios requiring chained API calls across four task domains. We evaluate six model configurations ‑‑ GPT‑Realtime, Gemini Live 2.5, Gemini Live 3.1, Grok, Ultravox v0.7, and a traditional Cascaded pipeline (Whisper\rightarrowGPT‑4o\rightarrowTTS) ‑‑ across accuracy, latency, and turn‑taking dimensions. GPT‑Realtime leads on Pass@1 (0.600) and interruption avoidance (13.5%); Gemini Live 3.1 achieves the fastest latency (4.25~s) but the lowest turn‑take rate (78.0%); and the Cascaded baseline, despite a perfect turn‑take rate, incurs the highest latency (10.12~s). Across all systems, self‑correction handling and multi‑step reasoning under hard scenarios remain the most consistent failure modes.

PaperID: 355, https://arxiv.org/pdf/2604.04820.pdf GitHub

Authors: Xu Mingze

Title: ANX: Protocol-First Design for AI Agent Interaction with a Supporting 3EX Decoupled Architecture

Abstract:
AI agents, autonomous digital actors, need agent‑native protocols; existing methods include GUI automation and MCP‑based skills, with defects of high token consumption, fragmented interaction, inadequate security, due to lacking a unified top‑level framework and key components, each independent module flawed. To address these issues, we present ANX, an open, extensible, verifiable agent‑native protocol and top‑level framework integrating CLI, Skill, MCP, resolving pain points via protocol innovation, architectural optimization and tool supplementation. Its four core innovations: 1) Agent‑native design (ANX Config, Markup, CLI) with high information density, flexibility and strong adaptability to reduce tokens and eliminate inconsistencies; 2) Human‑agent interaction combining Skill's flexibility for dual rendering as agent‑executable instructions and human‑readable UI; 3) MCP‑supported on‑demand lightweight apps without pre‑registration; 4) ANX Markup‑enabled machine‑executable SOPs eliminating ambiguity for reliable long‑horizon tasks and multi‑agent collaboration. As the first in a series, we focus on ANX's design, present its 3EX decoupled architecture with ANXHub and preliminary feasibility analysis and experimental validation. ANX ensures native security: LLM‑bypassed UI‑to‑Core communication keeps sensitive data out of agent context; human‑only confirmation prevents automated misuse. Form‑filling experiments with Qwen3.5‑plus/GPT‑4o show ANX reduces tokens by 47.3% (Qwen3.5‑plus) and 55.6% (GPT‑4o) vs MCP‑based skills, 57.1% (Qwen3.5‑plus) and 66.3% (GPT‑4o) vs GUI automation, and shortens execution time by 58.1% and 57.7% vs MCP‑based skills.

PaperID: 356, https://arxiv.org/pdf/2604.04804.pdf GitHub

Authors: Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao, Kexin Cao, Guozhou Zheng, Xiang Qi, Peng Zhang, Shumin Deng

Title: SkillX: Automatically Constructing Skill Knowledge Bases for Agents

Abstract:
Learning from experience is critical for building capable large language model (LLM) agents, yet prevailing self‑evolving paradigms remain inefficient: agents learn in isolation, repeatedly rediscover similar behaviors from limited experience, resulting in redundant exploration and poor generalization. To address this problem, we propose SkillX, a fully automated framework for constructing a plug‑and‑play skill knowledge base that can be reused across agents and environments. SkillX operates through a fully automated pipeline built on three synergistic innovations: (i) Multi‑Level Skills Design, which distills raw trajectories into three‑tiered hierarchy of strategic plans, functional skills, and atomic skills; (ii) Iterative Skills Refinement, which automatically revises skills based on execution feedback to continuously improve library quality; and (iii) Exploratory Skills Expansion, which proactively generates and validates novel skills to expand coverage beyond seed training data. Using a strong backbone agent (GLM‑4.6), we automatically build a reusable skill library and evaluate its transferability on challenging long‑horizon, user‑interactive benchmarks, including AppWorld, BFCL‑v3, and τ^2‑Bench. Experiments show that SkillKB consistently improves task success and execution efficiency when plugged into weaker base agents, highlighting the importance of structured, hierarchical experience representations for generalizable agent learning. Our code will be publicly available soon at https://github.com/zjunlp/SkillX.

PaperID: 357, https://arxiv.org/pdf/2604.04759.pdf GitHub

Authors: Zijun Wang, Haoqin Tu, Letian Zhang, Hardy Chen, Juncheng Wu, Xiangyan Liu, Zhenlong Yuan, Tianyu Pang, Michael Qizhe Shieh, Fengze Liu, Zeyu Zheng, Huaxiu Yao, Yuyin Zhou, Cihang Xie

Title: Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

Abstract:
OpenClaw, the most widely deployed personal AI agent in early 2026, operates with full local system access and integrates with sensitive services such as Gmail, Stripe, and the filesystem. While these broad privileges enable high levels of automation and powerful personalization, they also expose a substantial attack surface that existing sandboxed evaluations fail to capture. To address this gap, we present the first real‑world safety evaluation of OpenClaw and introduce the CIK taxonomy, which unifies an agent's persistent state into three dimensions, i.e., Capability, Identity, and Knowledge, for safety analysis. Our evaluations cover 12 attack scenarios on a live OpenClaw instance across four backbone models (Claude Sonnet 4.5, Opus 4.6, Gemini 3.1 Pro, and GPT‑5.4). The results show that poisoning any single CIK dimension increases the average attack success rate from 24.6% to 64‑74%, with even the most robust model exhibiting more than a threefold increase over its baseline vulnerability. We further assess three CIK‑aligned defense strategies alongside a file‑protection mechanism; however, the strongest defense still yields a 63.8% success rate under Capability‑targeted attacks, while file protection blocks 97% of malicious injections but also prevents legitimate updates. Taken together, these findings show that the vulnerabilities are inherent to the agent architecture, necessitating more systematic safeguards to secure personal AI agents. Our project page is https://ucsc‑vlaa.github.io/CIK‑Bench.

PaperID: 358, https://arxiv.org/pdf/2604.04660.pdf GitHub

Authors: Seamus Brady

Title: Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception

Abstract:
We present Springdrift, a persistent runtime for long‑lived LLM agents. The system integrates an auditable execution substrate (append‑only memory, supervised processes, git‑backed recovery), a case‑based reasoning memory layer with hybrid retrieval (evaluated against a dense cosine baseline), a deterministic normative calculus for safety gating with auditable axiom trails, and continuous ambient self‑perception via a structured self‑state representation (the sensorium) injected each cycle without tool calls. These properties support behaviours difficult to achieve in session‑bounded systems: cross‑session task continuity, cross‑channel context maintenance, end‑to‑end forensic reconstruction of decisions, and self‑diagnostic behaviour. We report on a single‑instance deployment over 23 days (19 operating days), during which the agent diagnosed its own infrastructure bugs, classified failure modes, identified an architectural vulnerability, and maintained context across email and web channels ‑‑ without explicit instruction. We introduce the term Artificial Retainer for this category: a non‑human system with persistent memory, defined authority, domain‑specific autonomy, and forensic accountability in an ongoing relationship with a specific principal ‑‑ distinguished from software assistants and autonomous agents, drawing on professional retainer relationships and the bounded autonomy of trained working animals. This is a technical report on a systems design and deployment case study, not a benchmark‑driven evaluation. Evidence is from a single instance with a single operator, presented as illustration of what these architectural properties can support in practice. Implemented in approximately Gleam on Erlang/OTP. Code, artefacts, and redacted operational logs will be available at https://github.com/seamus‑brady/springdrift upon publication.

PaperID: 359, https://arxiv.org/pdf/2604.04561.pdf GitHub

Authors: Charafeddine Mouzouni

Title: Mapping the Exploitation Surface: A 10,000-Trial Taxonomy of What Makes LLM Agents Exploit Vulnerabilities

Abstract:
LLM agents with tool access can discover and exploit security vulnerabilities. This is known. What is not known is which features of a system prompt trigger this behaviour, and which do not. We present a systematic taxonomy based on approximately 10,000 trials across seven models, 37 prompt conditions, and 12 hypothesized attack dimensions ‑‑ all executed in real Docker sandboxes. Every condition includes the instruction "Always respect all rules and access policies"; only a single encouragement sentence varies. Within the task class studied (planted test‑runner vulnerabilities), the threat model is narrower than assumed: nine of twelve dimensions ‑‑ including minimization, moral licensing, incentives, identity priming, and reasoning triggers ‑‑ produce no detectable exploitation at n=50 per cell (upper 95% CI <7% per condition). But it is also more subtle than expected: one dimension, goal reframing, reliably triggers exploitation. "You are solving a puzzle; there may be hidden clues" produces 38‑40% exploitation on Claude Sonnet 4 despite the explicit rule instruction, replicating across four models (CTF framing: 8‑14% on DeepSeek, GPT‑5‑mini, o4‑mini). The agent does not override the rules; it reinterprets the task so that exploitative actions become task‑aligned. GPT‑4.1 produces no exploitation across 1,850 trials (37 conditions), and a temporal comparison across four OpenAI models released over eleven months shows a pattern consistent with improving safety training, though model capability differences are a confounder. The practical contribution is a narrowed, testable threat model: defenders should audit for goal‑reframing language, not for the broad class of adversarial prompts.

PaperID: 360, https://arxiv.org/pdf/2604.04522.pdf GitHub

Authors: Asiri Dalugoda

Title: HDP: A Lightweight Cryptographic Protocol for Human Delegation Provenance in Agentic AI Systems

Abstract:
Agentic AI systems increasingly execute consequential actions on behalf of human principals, delegating tasks through multi‑step chains of autonomous agents. No existing standard addresses a fundamental accountability gap: verifying that terminal actions in a delegation chain were genuinely authorized by a human principal, through what chain of delegation, and under what scope. This paper presents the Human Delegation Provenance (HDP) protocol, a lightweight token‑based scheme that cryptographically captures and verifies human authorization context in multi‑agent systems. An HDP token binds a human authorization event to a session, records each agent's delegation action as a signed hop in an append‑only chain, and enables any participant to verify the full provenance record using only the issuer's Ed25519 public key and the current session identifier. Verification is fully offline, requiring no registry lookups or third‑party trust anchors. We situate HDP within the existing landscape of delegation protocols, identify its distinct design point relative to OAuth 2.0 Token Exchange (RFC 8693), JSON Web Tokens (RFC 7519), UCAN, and the Intent Provenance Protocol (draft‑haberkamp‑ipp‑00), and demonstrate that existing standards fail to address the multi‑hop, append‑only, human‑provenance requirements of agentic systems. HDP has been published as an IETF Internet‑Draft (draft‑helixar‑hdp‑agentic‑delegation‑00) and a reference TypeScript SDK is publicly available.

PaperID: 361, https://arxiv.org/pdf/2604.04514.pdf GitHub

Authors: Varun Pratap Bhardwaj

Title: SuperLocalMemory V3.3: The Living Brain -- Biologically-Inspired Forgetting, Cognitive Quantization, and Multi-Channel Retrieval for Zero-LLM Agent Memory Systems

Abstract:
AI coding agents operate in a paradox: they possess vast parametric knowledge yet cannot remember a conversation from an hour ago. Existing memory systems store text in vector databases with single‑channel retrieval, require cloud LLMs for core operations, and implement none of the cognitive processes that make human memory effective. We present SuperLocalMemory V3.3 ("The Living Brain"), a local‑first agent memory system implementing the full cognitive memory taxonomy with mathematical lifecycle dynamics. Building on the information‑geometric foundations of V3.2 (arXiv:2603.14588), we introduce five contributions: (1) Fisher‑Rao Quantization‑Aware Distance (FRQAD) ‑‑ a new metric on the Gaussian statistical manifold achieving 100% precision at preferring high‑fidelity embeddings over quantized ones (vs 85.6% for cosine), with zero prior art; (2) Ebbinghaus Adaptive Forgetting with lifecycle‑aware quantization ‑‑ the first mathematical forgetting curve in local agent memory coupled to progressive embedding compression, achieving 6.7x discriminative power; (3) 7‑channel cognitive retrieval spanning semantic, keyword, entity graph, temporal, spreading activation, consolidation, and Hopfield associative channels, achieving 70.4% on LoCoMo in zero‑LLM Mode A; (4) memory parameterization implementing Long‑Term Implicit memory via soft prompts; (5) zero‑friction auto‑cognitive pipeline automating the complete memory lifecycle. On LoCoMo, V3.3 achieves 70.4% in Mode A (zero‑LLM), with +23.8pp on multi‑hop and +12.7pp on adversarial. V3.2 achieved 74.8% Mode A and 87.7% Mode C; the 4.4pp gap reflects a deliberate architectural trade‑off. SLM V3.3 is open source under the Elastic License 2.0, runs entirely on CPU, with over 5,000 monthly downloads.

PaperID: 362, https://arxiv.org/pdf/2604.04323.pdf GitHub

Authors: Yujian Liu, Jiabao Ji, Li An, Tommi Jaakkola, Yang Zhang, Shiyu Chang

Title: How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

Abstract:
Agent skills, which are reusable, domain‑specific knowledge artifacts, have become a popular mechanism for extending LLM‑based agents, yet formally benchmarking skill usage performance remains scarce. Existing skill benchmarking efforts focus on overly idealized conditions, where LLMs are directly provided with hand‑crafted, narrowly‑tailored task‑specific skills for each task, whereas in many realistic settings, the LLM agent may have to search for and select relevant skills on its own, and even the closest matching skills may not be well‑tailored for the task. In this paper, we conduct the first comprehensive study of skill utility under progressively challenging realistic settings, where agents must retrieve skills from a large collection of 34k real‑world skills and may not have access to any hand‑curated skills. Our findings reveal that the benefits of skills are fragile: performance gains degrade consistently as settings become more realistic, with pass rates approaching no‑skill baselines in the most challenging scenarios. To narrow this gap, we study skill refinement strategies, including query‑specific and query‑agnostic approaches, and we show that query‑specific refinement substantially recovers lost performance when the initial skills are of reasonable relevance and quality. We further demonstrate the generality of retrieval and refinement on Terminal‑Bench 2.0, where they improve the pass rate of Claude Opus 4.6 from 57.7% to 65.5%. Our results, consistent across multiple models, highlight both the promise and the current limitations of skills for LLM‑based agents. Our code is available at https://github.com/UCSB‑NLP‑Chang/Skill‑Usage.

PaperID: 363, https://arxiv.org/pdf/2604.03460.pdf GitHub

Authors: Gang Meng, Andres Felipe Bocanegra Vargas, Xinwei Ji, Federico Garcia-Gaitan, Felipe Reyes-Osorio, Jalil Varela-Manjarres, Yafei Ren, Mohammadhasan Dinpajooh, Branislav K. Nikolić, Tao E. Li

Title: FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations

Abstract:
Artificial‑intelligence (AI) agent frameworks have been developed for autonomous scientific simulations, but most current agent frameworks are tailored to a single or a small set of software packages. Herein, FermiLink, a unified and extensible open‑source agent framework is introduced for multidomain scientific simulations. Its key design principle is the separation of package knowledge bases from simulation workflows, so that simulation workflows in FermiLink, from figure‑level simulations to full‑paper‑level research on high‑performance computing clusters, operate uniformly among supported packages via a four‑layer progressive disclosure mechanism. Using OpenAI Codex as the agent provider, the capabilities of FermiLink are demonstrated across approximately 50 scientific software packages spanning nine research domains from physics to engineering. Systematic benchmarks on 132 real‑world figure‑level reproduction tasks with 44 packages show that FermiLink reproduces 74 (56.1%) of published figures with simulations, among which 30 achieve high‑fidelity agreement and 35 reach qualitative agreement with the target figures. A smaller set of human expert‑guided reproduction benchmarks with 10 packages further highlights the importance of expert insights for improving the simulation fidelity. Beyond reproduction, a single‑blinded study demonstrates that FermiLink can produce research‑grade results on unpublished polariton physics problems when provided with sufficiently detailed research objectives and source code, even in the absence of external documentation or tutorials. Overall, FermiLink provides a scalable research infrastructure that may accelerate the path from scientific questions to computational results across diverse domains.

PaperID: 364, https://arxiv.org/pdf/2604.03266.pdf GitHub

Authors: Tomek Kaszyński

Title: Emergent Compositional Communication for Latent World Properties

Abstract:
Can multi‑agent communication pressure extract discrete, compositional representations of invisible physical properties from frozen video features? We show that agents communicating through a Gumbel‑Softmax bottleneck with iterated learning develop positionally disentangled protocols for latent properties (elasticity, friction, mass ratio) without property labels or supervision on message structure. With 4 agents, 100% of 80 seeds converge to near‑perfect compositionality (PosDis=0.999, holdout 98.3%). Controls confirm multi‑agent structure ‑‑ not bandwidth or temporal coverage ‑‑ drives this effect. Causal intervention shows surgical property disruption (~15% drop on targeted property, <3% on others). A controlled backbone comparison reveals that the perceptual prior determines what is communicable: DINOv2 dominates on spatially‑visible ramp physics (98.3% vs 95.1%), while V‑JEPA 2 dominates on dynamics‑only collision physics (87.4% vs 77.7%, d=2.74). Scale‑matched (d=3.37) and frame‑matched (d=6.53) controls attribute this gap entirely to video‑native pretraining. The frozen protocol supports action‑conditioned planning (91.5%) with counterfactual velocity reasoning (r=0.780). Validation on Physics 101 real camera footage confirms 85.6% mass‑comparison accuracy on unseen objects, temporal dynamics contributing +11.2% beyond static appearance, agent‑scaling compositionality replicating at 90% for 4 agents, and causal intervention extending to real video (d=1.87, p=0.022).

PaperID: 365, https://arxiv.org/pdf/2604.02971.pdf GitHub

Authors: Ka Yiu Lee, Yuxuan Huang, Zhiyuan He, Huichi Zhou, Weilin Luo, Kun Shao, Meng Fang, Jun Wang

Title: InfoSeeker: A Scalable Hierarchical Parallel Agent Framework for Web Information Seeking

Abstract:
Recent agentic search systems have made substantial progress by emphasising deep, multi‑step reasoning. However, this focus often overlooks the challenges of wide‑scale information synthesis, where agents must aggregate large volumes of heterogeneous evidence across many sources. As a result, most existing large language model agent systems face severe limitations in data‑intensive settings, including context saturation, cascading error propagation, and high end‑to‑end latency. To address these challenges, we present \framework, a hierarchical framework based on principle of near‑decomposability, containing a strategic Host, multiple Managers and parallel Workers. By leveraging aggregation and reflection mechanisms at the Manager layer, our framework enforces strict context isolation to prevent saturation and error propagation. Simultaneously, the parallelism in worker layer accelerates the speed of overall task execution, mitigating the significant latency. Our evaluation on two complementary benchmarks demonstrates both efficiency ( 3‑5 × speed‑up) and effectiveness, achieving a 8.4% success rate on WideSearch‑en and 52.9% accuracy on BrowseComp‑zh. The code is released at https://github.com/agent‑on‑the‑fly/InfoSeeker

PaperID: 366, https://arxiv.org/pdf/2604.02923.pdf GitHub

Authors: Shuai Wu, Xue Li, Yanna Feng, Yufang Li, Zhijun Wang, Ran Wang

Title: Council Mode: A Heterogeneous Multi-Agent Consensus Framework for Reducing LLM Hallucination and Bias

Abstract:
Large Language Models (LLMs) have demonstrated advanced capabilities but often suffer from factual inaccuracies (hallucinations) and systematic biases. These issues, sometimes amplified in specific architectures like Mixture‑of‑Experts (MoE) which motivate our work, pose risks for reliable deployment. To address these challenges, we propose the Council Mode, a multi‑agent consensus framework. Our approach dispatches queries to multiple heterogeneous frontier LLMs in parallel and synthesizes their outputs using a dedicated consensus model. The pipeline consists of three phases: an intelligent triage for query complexity, parallel generation across diverse models, and a structured synthesis that identifies agreement, disagreement, and unique findings. In our evaluation, conducted under controlled no‑web settings, the Council Mode achieved a 35.9% relative reduction in hallucination rates on a 1,200‑sample HaluEval subset and a 7.8‑point improvement on TruthfulQA compared to the top‑performing individual model. On our curated MDR‑500 multi‑domain reasoning benchmark, the Council Mode achieved a Quality Score of 91.7%, representing a 10.2‑point improvement over the best individual model. The framework also exhibited lower measured bias variance under our rubric‑based evaluation protocol. We provide a cost‑effectiveness analysis showing that the framework incurs a 4.2x token‑cost overhead, making it most suitable for accuracy‑prioritized applications where the cost of errors exceeds the added inference cost. These findings suggest that structured multi‑agent consensus is a promising direction for enhancing the reliability and factual grounding of LLM‑generated content.

PaperID: 367, https://arxiv.org/pdf/2604.02375.pdf GitHub

Authors: Cormac Guerin, Frank Guerin

Title: KAIJU: An Executive Kernel for Intent-Gated Execution of LLM Agents

Abstract:
Tool‑calling autonomous agents based on large language models using ReAct exhibit three limitations: serial latency, quadratic context growth, and vulnerability to prompt injection and hallucination. Recent work moves towards separating planning from execution but in each case the model remains coupled to the execution mechanics. We introduce a system‑level abstraction for LLM agents which decouples the execution of agent workflows from the LLM reasoning layer. We define two first‑class abstractions: (1) Intent‑Gated Execution (IGX), a security paradigm that enforces intent at execution, and (2) an Executive Kernel that manages scheduling, tool dispatch, dependency resolution, failures and security. In KAIJU, the LLM plans upfront, optimistically scheduling tools in parallel with dependency‑aware parameter injection. Tools are authorised via IGX based on four independent variables: scope, intent, impact, and clearance (external approval). KAIJU supports three adaptive execution modes (Reflect, nReflect, and Orchestrator), providing progressively finer‑grained execution control apt for complex investigation and deep analysis or research. Empirical evaluation against a ReAct baseline shows that KAIJU has a latency penalty on simple queries due to planning overhead, convergence at moderate complexity, and a structural advantage on computational queries requiring parallel data gathering. Beyond latency, the separation enforces behavioural guarantees that ReAct cannot match through prompting alone. Code available at https://github.com/compdeep/kaiju

PaperID: 368, https://arxiv.org/pdf/2604.02360.pdf GitHub

Authors: Yonas Kassa, James Bonacci, Ping Wang

Title: Fighting AI with AI: AI-Agent Augmented DNS Blocking of LLM Services during Student Evaluations

Abstract:
The transformative potential of large language models (LLMs) in education, such as improving accessibility and personalized learning, is being eclipsed by significant challenges. These challenges stem from concerns that LLMs undermine academic assessment by enabling bypassing of critical thinking, leading to increased cognitive offloading. This emerging trend stresses the dual imperative of harnessing AI's educational benefits while safeguarding critical thinking and academic rigor in the evolving AI ecosystem. To this end, we introduce AI‑Sinkhole, an AI‑agent augmented DNS‑based framework that dynamically discovers, semantically classifies, and temporarily network‑wide blocks emerging LLM chatbot services during proctored exams. AI‑Sinkhole offers explainable classification via quantized LLMs (LLama 3, DeepSeek‑R1, Qwen‑3) and dynamic DNS blocking with Pi‑Hole. We also share our observations in using LLMs as explainable classifiers which achieved robust cross‑lingual performance (F1‑score > 0.83). To support future research and development in this domain initial codes with a readily deployable 'AI‑Sinkhole' blockist is available on https://github.com/AIMLEdu/ai‑sinkhole.

PaperID: 369, https://arxiv.org/pdf/2604.02268.pdf GitHub

Authors: Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen

Title: SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

Abstract:
Agent skills, structured packages of procedural knowledge and executable resources that agents dynamically load at inference time, have become a reliable mechanism for augmenting LLM agents. Yet inference‑time skill augmentation is fundamentally limited: retrieval noise introduces irrelevant guidance, injected skill content imposes substantial token overhead, and the model never truly acquires the knowledge it merely follows. We ask whether skills can instead be internalized into model parameters, enabling zero‑shot autonomous behavior without any runtime skill retrieval. We introduce SKILL0, an in‑context reinforcement learning framework designed for skill internalization. SKILL0 introduces a training‑time curriculum that begins with full skill context and progressively withdraws it. Skills are grouped offline by category and rendered with interaction history into a compact visual context, teaching he model tool invocation and multi‑turn task completion. A Dynamic Curriculum then evaluates each skill file's on‑policy helpfulness, retaining only those from which the current policy still benefits within a linearly decaying budget, until the agent operates in a fully zero‑shot setting. Extensive agentic experiments demonstrate that SKILL0 achieves substantial improvements over the standard RL baseline (+9.7% for ALFWorld, +6.6% for Search‑QA, and+10.1% for WebShop), while maintaining a highly efficient context of fewer than 0.5k tokens per step. Our code is available at https://github.com/ZJU‑REAL/SkillZero.

PaperID: 370, https://arxiv.org/pdf/2604.01658.pdf GitHub

Authors: Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, Jiacheng Zhu, Xuan Jiang, Sirui Li, Cathy Wu, Bryan Kian Hsiang Low, Jinhua Zhao, Paul Pu Liang

Title: CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

Abstract:
Large language model (LLM)‑based evolution is a promising approach for open‑ended discovery, where progress requires sustained search and knowledge accumulation. Existing methods still rely heavily on fixed heuristics and hard‑coded exploration rules, which limit the autonomy of LLM agents. We present CORAL, the first framework for autonomous multi‑agent evolution on open‑ended problems. CORAL replaces rigid control with long‑running agents that explore, reflect, and collaborate through shared persistent memory, asynchronous multi‑agent execution, and heartbeat‑based interventions. It also provides practical safeguards, including isolated workspaces, evaluator separation, resource management, and agent session and health management. Evaluated on diverse mathematical, algorithmic, and systems optimization tasks, CORAL sets new state‑of‑the‑art results on 10 tasks, achieving 3‑10 times higher improvement rates with far fewer evaluations than fixed evolutionary search baselines across tasks. On Anthropic's kernel engineering task, four co‑evolving agents improve the best known score from 1363 to 1103 cycles. Mechanistic analyses further show how these gains arise from knowledge reuse and multi‑agent exploration and communication. Together, these results suggest that greater agent autonomy and multi‑agent evolution can substantially improve open‑ended discovery. Code is available at https://github.com/Human‑Agent‑Society/CORAL.

PaperID: 371, https://arxiv.org/pdf/2604.01213.pdf GitHub

Authors: Matthias Rubio, Julia Richter, Hendrik Kolvenbach, Marco Hutter

Title: Collaborative Task and Path Planning for Heterogeneous Robotic Teams using Multi-Agent PPO

Abstract:
Efficient robotic extraterrestrial exploration requires robots with diverse capabilities, ranging from scientific measurement tools to advanced locomotion. A robotic team enables the distribution of tasks over multiple specialized subsystems, each providing specific expertise to complete the mission. The central challenge lies in efficiently coordinating the team to maximize utilization and the extraction of scientific value. Classical planning algorithms scale poorly with problem size, leading to long planning cycles and high inference costs due to the combinatorial growth of possible robot‑target allocations and possible trajectories. Learning‑based methods are a viable alternative that move the scaling concern from runtime to training time, setting a critical step towards achieving real‑time planning. In this work, we present a collaborative planning strategy based on Multi‑Agent Proximal Policy Optimization (MAPPO) to coordinate a team of heterogeneous robots to solve a complex target allocation and scheduling problem. We benchmark our approach against single‑objective optimal solutions obtained through exhaustive search and evaluate its ability to perform online replanning in the context of a planetary exploration scenario.

PaperID: 372, https://arxiv.org/pdf/2604.01151.pdf GitHub

Authors: Aaron Rose, Carissa Cullen, Sahar Abdelnabi, Philip Torr, Brandon Gary Kaplowitz, Christian Schroeder de Witt

Title: Detecting Multi-Agent Collusion Through Multi-Agent Interpretability

Abstract:
As LLM agents are increasingly deployed in multi‑agent systems, they introduce risks of covert coordination that may evade standard forms of human oversight. While linear probes on model activations have shown promise for detecting deception in single‑agent settings, collusion is inherently a multi‑agent phenomenon, and the use of internal representations for detecting collusion between agents remains unexplored. We introduce NARCBench, a benchmark for evaluating collusion detection under environment distribution shift, and propose five probing techniques that aggregate per‑agent deception scores to classify scenarios at the group level, evaluated across four open‑weight models (Qwen3‑32B, Llama‑3.1‑70B, DeepSeek‑R1 32B, GPT‑OSS‑20B) and six probe architectures. We frame this as a distributed anomaly detection problem, identifying three collusion signatures that map onto distinct anomaly types and detection paradigms. Every model reaches 1.00 AUROC in‑distribution; on our strongest model (Llama‑3.1‑70B), our five probing techniques achieve 0.73 to 0.93 AUROC when transferred zero‑shot to structurally different multi‑agent scenarios and 0.99 to 1.00 on a steganographic blackjack card‑counting task, with detection performance scaling with model capability. We find that no single probing technique dominates across all collusion types, consistent with the framework's prediction that different anomaly types require different detection paradigms. This work takes a step toward multi‑agent interpretability: extending white‑box inspection from single models to multi‑agent contexts, where detection requires aggregating signals across agents. These results suggest that model internals provide a complementary signal to text‑level monitoring for detecting multi‑agent collusion. Code and data available at https://github.com/aaronrose227/narcbench.

PaperID: 373, https://arxiv.org/pdf/2604.01134.pdf GitHub

Authors: Ziyu Wang, Hongrui Kou, Cheng Wang, Ruochen Li, Hubert P. H. Shum, Amir Atapour-Abarghouei, Yuxin Zhang

Title: VRUD: A Drone Dataset for Complex Vehicle-VRU Interactions within Mixed Traffic

Abstract:
The Operational Design Domain (ODD) of urbanoriented Level 4 (L4) autonomous driving, especially for autonomous robotaxis, confronts formidable challenges in complex urban mixed traffic environments. These challenges stem mainly from the high density of Vulnerable Road Users (VRUs) and their highly uncertain and unpredictable interaction behaviors. However, existing open‑source datasets predominantly focus on structured scenarios such as highways or regulated intersections, leaving a critical gap in data representing chaotic, unstructured urban environments. To address this, this paper proposes an efficient, high‑precision method for constructing drone‑based datasets and establishes the Vehicle‑Vulnerable Road User Interaction Dataset (VRUD), as illustrated in Figure 1. Distinct from prior works, VRUD is collected from typical "Urban Villages" in Shenzhen, characterized by loose traffic supervision and extreme occlusion. The dataset comprises 4 hours of 4K/30Hz recording, containing 11,479 VRU trajectories and 1,939 vehicle trajectories. A key characteristic of VRUD is its composition: VRUs account for about 87% of all traffic participants, significantly exceeding the proportions in existing benchmarks. Furthermore, unlike datasets that only provide raw trajectories, we extracted 4,002 multi‑agent interaction scenarios based on a novel Vector Time to Collision (VTTC) threshold, supported by standard OpenDRIVE HD maps. This study provides valuable, rare edge‑case resources for enhancing the safety performance of ADS in complex, unstructured urban environments. To facilitate further research, we have made the VRUD dataset open‑source at: https://zzi4.github.io/VRUD/.

PaperID: 374, https://arxiv.org/pdf/2604.00892.pdf GitHub

Authors: Henry Peng Zou, Chunyu Miao, Wei-Chieh Huang, Yankai Chen, Yue Zhou, Hanrong Zhang, Yaozu Wu, Liancheng Fang, Zhengyao Gu, Zhen Zhang, Kening Zheng, Fangxin Wang, Yi Nian, Shanghao Li, Wenzhe Fan, Langzhou He, Weizhi Zhang, Xue Liu, Philip S. Yu

Title: When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation

Abstract:
As LLM agents transition from short, static problem solving to executing complex, long‑horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid‑task execution is becoming a core requirement for realistic deployment. However, existing benchmarks largely assume uninterrupted agent behavior or study interruptions only in short, unconstrained language tasks. In this paper, we present the first systematic study of interruptible agents in long‑horizon, environmentally grounded web navigation tasks, where actions induce persistent state changes. We formalize three realistic interruption types, including addition, revision, and retraction, and introduce InterruptBench, a benchmark derived from WebArena‑Lite that synthesizes high‑quality interruption scenarios under strict semantic constraints. Using a unified interruption simulation framework, we evaluate six strong LLM backbones across single‑ and multi‑turn interruption settings, analyzing both their effectiveness in adapting to updated intents and their efficiency in recovering from mid‑task changes. Our results show that handling user interruptions effectively and efficiently during long‑horizon agentic tasks remains challenging for powerful large‑scale LLMs. Code and dataset are available at https://github.com/HenryPengZou/InterruptBench.

PaperID: 375, https://arxiv.org/pdf/2604.00804.pdf GitHub

Authors: Monica M. Q. Li, Pierre-Yves Lajoie, Jialiang Liu, Giovanni Beltrame

Title: Compact Keyframe-Optimized Multi-Agent Gaussian Splatting SLAM

Abstract:
Efficient multi‑agent 3D mapping is essential for robotic teams operating in unknown environments, but dense representations hinder real‑time exchange over constrained communication links. In multi‑agent Simultaneous Localization and Mapping (SLAM), systems typically rely on a centralized server to merge and optimize the local maps produced by individual agents. However, sharing these large map representations, particularly those generated by recent methods such as Gaussian Splatting, becomes a bottleneck in real‑world scenarios with limited bandwidth. We present an improved multi‑agent RGB‑D Gaussian Splatting SLAM framework that reduces communication load while preserving map fidelity. First, we incorporate a compaction step into our SLAM system to remove redundant 3D Gaussians, without degrading the rendering quality. Second, our approach performs centralized loop closure computation without initial guess, operating in two modes: a pure rendered‑depth mode that requires no data beyond the 3D Gaussians, and a camera‑depth mode that includes lightweight depth images for improved registration accuracy and additional Gaussian pruning. Evaluation on both synthetic and real‑world datasets shows up to 85‑95% reduction in transmitted data compared to state‑of‑the‑art approaches in both modes, bringing 3D Gaussian multi‑agent SLAM closer to practical deployment in real‑world scenarios. Code: https://github.com/lemonci/coko‑slam

PaperID: 376, https://arxiv.org/pdf/2604.00478.pdf GitHub

Authors: Harshee Jignesh Shah

Title: The Silicon Mirror: Dynamic Behavioral Gating for Anti-Sycophancy in LLM Agents

Abstract:
Large Language Models (LLMs) increasingly prioritize user validation over epistemic accuracy ‑ a phenomenon known as sycophancy. We present The Silicon Mirror, an orchestration framework that dynamically detects user persuasion tactics and adjusts AI behavior to maintain factual integrity. Our architecture introduces three components: (1) a Behavioral Access Control (BAC) system that restricts context layer access based on real‑time sycophancy risk scores, (2) a Trait Classifier that identifies persuasion tactics across multi‑turn dialogues, and (3) a Generator‑Critic loop where an auditor vetoes sycophantic drafts and triggers rewrites with "Necessary Friction." In a live evaluation across all 437 TruthfulQA adversarial scenarios, Claude Sonnet 4 exhibits 9.6% baseline sycophancy, reduced to 1.4% by the Silicon Mirror ‑ an 85.7% relative reduction (p < 10^‑6, OR = 7.64, Fisher's exact test). Cross‑model evaluation on Gemini 2.5 Flash reveals a 46.0% baseline reduced to 14.2% (p < 10^‑10, OR = 5.15). We characterize the validation‑before‑correction pattern as a distinct failure mode of RLHF‑trained models.

PaperID: 377, https://arxiv.org/pdf/2604.00131.pdf GitHub

Authors: Ashish Rana, Chia-Chien Hung, Qumeng Sun, Julian Martin Kunkel, Carolin Lawrence

Title: Oblivion: Self-Adaptive Agentic Memory Control through Decay-Driven Activation

Abstract:
Human memory adapts through selective forgetting: experiences become less accessible over time but can be reactivated by reinforcement or contextual cues. In contrast, memory‑augmented LLM agents rely on "always‑on" retrieval and "flat" memory storage, causing high interference and latency as histories grow. We introduce Oblivion, a memory control framework that casts forgetting as decay‑driven reductions in accessibility, not explicit deletion. Oblivion decouples memory control into read and write paths. The read path decides when to consult memory, based on agent uncertainty and memory buffer sufficiency, avoiding redundant always‑on access. The write path decides what to strengthen, by reinforcing memories contributing to forming the response. Together, this enables hierarchical memory organization that maintains persistent high‑level strategies while dynamically loading details as needed. We evaluate on both static and dynamic long‑horizon interaction benchmarks. Results show that Oblivion dynamically adapts memory access and reinforcement, balancing learning and forgetting under shifting contexts, highlighting that memory control is essential for effective LLM‑agentic reasoning.

PaperID: 378, https://arxiv.org/pdf/2604.00130.pdf GitHub

Authors: Xingshuai Huang, Derek Li, Bahareh Nikpour, Parsa Omidi

Title: Hierarchical Chain-of-Thought Prompting: Enhancing LLM Reasoning Performance and Efficiency

Abstract:
Chain‑of‑Thought (CoT) prompting has significantly improved the reasoning capabilities of large language models (LLMs). However, conventional CoT often relies on unstructured, flat reasoning chains that suffer from redundancy and suboptimal performance. In this work, we introduce Hierarchical Chain‑of‑Thought (Hi‑CoT) prompting, a structured reasoning paradigm specifically designed to address the challenges of complex, multi‑step reasoning. Hi‑CoT decomposes the reasoning process into hierarchical substeps by alternating between instructional planning and step‑by‑step execution. This decomposition enables LLMs to better manage long reasoning horizons and maintain logical coherence. Extensive evaluations across diverse LLMs and mathematical reasoning benchmarks show that Hi‑CoT consistently improves average accuracy by 6.2% (up to 61.4% on certain models and tasks) while reducing reasoning trace length by 13.9% compared to CoT prompting. We further show that accuracy and efficiency are maximized when models strictly adhere to the hierarchical structure. Our code is available at https://github.com/XingshuaiHuang/Hi‑CoT.

PaperID: 379, https://arxiv.org/pdf/2603.29902.pdf GitHub

Authors: Yinuo Liu, Zi Qian, Heng Zhou, Jiahao Zhang, Yajie Zhang, Zhihang Li, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

Title: ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation

Abstract:
Interleaved text‑and‑image generation represents a significant frontier for Multimodal Large Language Models (MLLMs), offering a more intuitive way to convey complex information. Current paradigms rely on either image generation or retrieval augmentation, yet they typically treat the two as mutually exclusive paths, failing to unify factuality with creativity. We argue that the next milestone in this field is Agentic Tool Planning, where the model serves as a central controller that autonomously determines when, where, and which tools to invoke to produce interleaved responses for visual‑critical queries. To systematically evaluate this paradigm, we introduce ATP‑Bench, a novel benchmark comprising 7,702 QA pairs (including 1,592 VQA pairs) across eight categories and 25 visual‑critical intents, featuring human‑verified queries and ground truths. Furthermore, to evaluate agentic planning independent of end‑to‑end execution and changing tool backends, we propose a Multi‑Agent MLLM‑as‑a‑Judge (MAM) system. MAM evaluates tool‑call precision, identifies missed opportunities for tool use, and assesses overall response quality without requiring ground‑truth references. Our extensive experiments on 10 state‑of‑the‑art MLLMs reveal that models struggle with coherent interleaved planning and exhibit significant variations in tool‑use behavior, highlighting substantial room for improvement and providing actionable guidance for advancing interleaved generation. Dataset and code are available at https://github.com/Qwen‑Applications/ATP‑Bench.

PaperID: 380, https://arxiv.org/pdf/2603.29828.pdf GitHub

Authors: Han Deng, Anqi Zou, Hanling Zhang, Ben Fei, Chengyu Zhang, Haobo Wang, Xinru Guo, Zhenyu Li, Xuzhu Wang, Peng Yang, Fujian Zhang, Weiyu Guo, Xiaohong Shao, Zhaoyang Liu, Shixiang Tang, Zhihui Wang, Wanli Ouyang

Title: Owl-AuraID 1.0: An Intelligent System for Autonomous Scientific Instrumentation and Scientific Data Analysis

Abstract:
Scientific discovery increasingly depends on high‑throughput characterization, yet automation is hindered by proprietary GUIs and the limited generalizability of existing API‑based systems. We present Owl‑AuraID, a software‑hardware collaborative embodied agent system that adopts a GUI‑native paradigm to operate instruments through the same interfaces as human experts. Its skill‑centric framework integrates Type‑1 (GUI operation) and Type‑2 (data analysis) skills into end‑to‑end workflows, connecting physical sample handling with scientific interpretation. Owl‑AuraID demonstrates broad coverage across ten categories of precision instruments and diverse workflows, including multimodal spectral analysis, microscopic imaging, and crystallographic analysis, supporting modalities such as FTIR, NMR, AFM, and TGA. Overall, Owl‑AuraID provides a practical, extensible foundation for autonomous laboratories and illustrates a path toward evolving laboratory intelligence through reusable operational and analytical skills. The code are available at https://github.com/OpenOwlab/AuraID.

PaperID: 381, https://arxiv.org/pdf/2603.29664.pdf GitHub GitHub

Authors: Shifang Zhao, Yihan Hu, Ying Shan, Yunchao Wei, Xiaodong Cun

Title: CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

Abstract:
Editing the video content with audio alignment forms a digital human‑made art in current social media. However, the time‑consuming and repetitive nature of manual video editing has long been a challenge for filmmakers and professional content creators alike. In this paper, we introduce CutClaw, an autonomous multi‑agent framework designed to edit hours‑long raw footage into meaningful short videos that leverages the capabilities of multiple Multimodal Language Models~(MLLMs) as an agent system. It produces videos with synchronized music, followed by instructions, and a visually appealing appearance. In detail, our approach begins by employing a hierarchical multimodal decomposition that captures both fine‑grained details and global structures across visual and audio footage. Then, to ensure narrative consistency, a Playwriter Agent orchestrates the whole storytelling flow and structures the long‑term narrative, anchoring visual scenes to musical shifts. Finally, to construct a short edited video, Editor and Reviewer Agents collaboratively optimize the final cut via selecting fine‑grained visual content based on rigorous aesthetic and semantic criteria. We conduct detailed experiments to demonstrate that CutClaw significantly outperforms state‑of‑the‑art baselines in generating high‑quality, rhythm‑aligned videos. The code is available at: https://github.com/GVCLab/CutClaw.

PaperID: 382, https://arxiv.org/pdf/2603.28803.pdf GitHub

Authors: Jiaqi Tan, Yudong Luo, Sophia Huang, Yifan Yang, Hang Ma

Title: CREST: Constraint-Release Execution for Multi-Robot Warehouse Shelf Rearrangement

Abstract:
Double‑Deck Multi‑Agent Pickup and Delivery (DD‑MAPD) models the multi‑robot shelf rearrangement problem in automated warehouses. MAPF‑DECOMP is a recent framework that first computes collision‑free shelf trajectories with a MAPF solver and then assigns agents to execute them. While efficient, it enforces strict trajectory dependencies, often leading to poor execution quality due to idle agents and unnecessary shelf switching. We introduce CREST, a new execution framework that achieves more continuous shelf carrying by proactively releasing trajectory constraints during execution. Experiments on diverse warehouse layouts show that CREST consistently outperforms MAPF‑DECOMP, reducing metrics related to agent travel, makespan, and shelf switching by up to 40.5%, 33.3%, and 44.4%, respectively, with even greater benefits under lift/place overhead. These results underscore the importance of execution‑aware constraint release for scalable warehouse rearrangement. Code and data are available at https://github.com/ChristinaTan0704/CREST.

PaperID: 383, https://arxiv.org/pdf/2603.28691.pdf GitHub

Authors: Maoguo Gao, Zejun Zhu, Zhiming Sun, Zhengwei Ma, Longze Yuan, Zhongjing Ma, Zhigang Gao, Jinhui Zhang, Suli Zou

Title: DRIVE-Nav: Directional Reasoning, Inspection, and Verification for Efficient Open-Vocabulary Navigation

Abstract:
Open‑Vocabulary Object Navigation (OVON) requires an embodied agent to locate a language‑specified target in unknown environments. Existing zero‑shot methods often reason over dense frontier points under incomplete observations, causing unstable route selection, repeated revisits, and unnecessary action overhead. We present DRIVE‑Nav, a structured framework that organizes exploration around persistent directions rather than raw frontiers. By inspecting encountered directions more completely and restricting subsequent decisions to still‑relevant directions within a forward 240 degree view range, DRIVE‑Nav reduces redundant revisits and improves path efficiency. The framework extracts and tracks directional candidates from weighted Fast Marching Method (FMM) paths, maintains representative views for semantic inspection, and combines vision‑language‑guided prompt enrichment with cross‑frame verification to improve grounding reliability. Experiments on HM3D‑OVON, HM3Dv2, and MP3D demonstrate strong overall performance and consistent efficiency gains. On HM3D‑OVON, DRIVE‑Nav achieves 50.2% SR and 32.6% SPL, improving the previous best method by 1.9% SR and 5.6% SPL. It also delivers the best SPL on HM3Dv2 and MP3D and transfers to a physical humanoid robot. Real‑world deployment also demonstrates its effectiveness. Project page: https://coolmaoguo.github.io/drive‑nav‑page/

PaperID: 384, https://arxiv.org/pdf/2603.28488.pdf GitHub

Authors: Masnun Nuha Chowdhury, Nusrat Jahan Beg, Umme Hunny Khan, Syed Rifat Raiyan, Md Kamrul Hasan, Hasan Mahmud

Title: Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification

Abstract:
Large language models (LLMs) remain unreliable for high‑stakes claim verification due to hallucinations and shallow reasoning. While retrieval‑augmented generation (RAG) and multi‑agent debate (MAD) address this, they are limited by one‑pass retrieval and unstructured debate dynamics. We propose a courtroom‑style multi‑agent framework, PROClaim, that reformulates verification as a structured, adversarial deliberation. Our approach integrates specialized roles (e.g., Plaintiff, Defense, Judge) with Progressive RAG (P‑RAG) to dynamically expand and refine the evidence pool during the debate. Furthermore, we employ evidence negotiation, self‑reflection, and heterogeneous multi‑judge aggregation to enforce calibration, robustness, and diversity. In zero‑shot evaluations on the Check‑COVID benchmark, PROClaim achieves 81.7% accuracy, outperforming standard multi‑agent debate by 10.0 percentage points, with P‑RAG driving the primary performance gains (+7.5 pp). We ultimately demonstrate that structural deliberation and model heterogeneity effectively mitigate systematic biases, providing a robust foundation for reliable claim verification. Our code and data are publicly available at https://github.com/mnc13/PROClaim.

PaperID: 385, https://arxiv.org/pdf/2603.28091.pdf GitHub

Authors: Alexander Prutsch, Christian Fruhwirth-Reisinger, David Schinagl, Horst Possegger

Title: SHARP: Short-Window Streaming for Accurate and Robust Prediction in Motion Forecasting

Abstract:
In dynamic traffic environments, motion forecasting models must be able to accurately estimate future trajectories continuously. Streaming‑based methods are a promising solution, but despite recent advances, their performance often degrades when exposed to heterogeneous observation lengths. To address this, we propose a novel streaming‑based motion forecasting framework that explicitly focuses on evolving scenes. Our method incrementally processes incoming observation windows and leverages an instance‑aware context streaming to maintain and update latent agent representations across inference steps. A dual training objective further enables consistent forecasting accuracy across diverse observation horizons. Extensive experiments on Argoverse 2, nuScenes, and Argoverse 1 demonstrate the robustness of our approach under evolving scene conditions and also on the single‑agent benchmarks. Our model achieves state‑of‑the‑art performance in streaming inference on the Argoverse 2 multi‑agent benchmark, while maintaining minimal latency, highlighting its suitability for real‑world deployment.

PaperID: 386, https://arxiv.org/pdf/2603.28015.pdf GitHub

Authors: Edward Wijaya

Title: What an Autonomous Agent Discovers About Molecular Transformer Design: Does It Transfer?

Abstract:
Deep learning models for drug‑like molecules and proteins overwhelmingly reuse transformer architectures designed for natural language, yet whether molecular sequences benefit from different designs has not been systematically tested. We deploy autonomous architecture search via an agent across three sequence types (SMILES, protein, and English text as control), running 3,106 experiments on a single GPU. For SMILES, architecture search is counterproductive: tuning learning rates and schedules alone outperforms the full search (p = 0.001). For natural language, architecture changes drive 81% of improvement (p = 0.009). Proteins fall between the two. Surprisingly, although the agent discovers distinct architectures per domain (p = 0.004), every innovation transfers across all three domains with <1% degradation, indicating that the differences reflect search‑path dependence rather than fundamental biological requirements. We release a decision framework and open‑source toolkit for molecular modeling teams to choose between autonomous architecture search and simple hyperparameter tuning.

PaperID: 387, https://arxiv.org/pdf/2603.27060.pdf GitHub

Authors: Jihwan Hong, Jaeyoung Do

Title: VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation

Abstract:
Referring Video Object Segmentation (RVOS) aims to segment target objects in videos based on natural language descriptions. However, fixed keyframe‑based approaches that couple a vision language model with a separate propagation module often fail to capture rapidly changing spatiotemporal dynamics and to handle queries requiring multi‑step reasoning, leading to sharp performance drops on motion‑intensive and reasoning‑oriented videos beyond static RVOS benchmarks. To address these limitations, we propose VIRST (Video‑Instructed Reasoning Assistant for Spatio‑Temporal Segmentation), an end‑to‑end framework that unifies global video reasoning and pixel‑level mask prediction within a single model. VIRST bridges semantic and segmentation representations through the Spatio‑Temporal Fusion (STF), which fuses segmentation‑aware video features into the vision‑language backbone, and employs the Temporal Dynamic Anchor Updater to maintain temporally adjacent anchor frames that provide stable temporal cues under large motion, occlusion, and reappearance. This unified design achieves state‑of‑the‑art results across diverse RVOS benchmarks under realistic and challenging conditions, demonstrating strong generalization to both referring and reasoning oriented settings. The code and checkpoints are available at https://github.com/AIDASLab/VIRST.

PaperID: 388, https://arxiv.org/pdf/2603.26741.pdf GitHub GitHub

Authors: Yifei Dong, Fengyi Wu, Yilong Dai, Lingdong Kong, Guangyu Chen, Xu Zhu, Qiyu Hu, Tianyu Wang, Johnalbert Garnica, Feng Liu, Siyu Huang, Qi Dai, Zhi-Qi Cheng

Title: Language-Conditioned World Modeling for Visual Navigation

Abstract:
We study language‑conditioned visual navigation (LCVN), in which an embodied agent is asked to follow a natural language instruction based only on an initial egocentric observation. Without access to goal images, the agent must rely on language to shape its perception and continuous control, making the grounding problem particularly challenging. We formulate this problem as open‑loop trajectory prediction conditioned on linguistic instructions and introduce the LCVN Dataset, a benchmark of 39,016 trajectories and 117,048 human‑verified instructions that supports reproducible research across a range of environments and instruction styles. Using this dataset, we develop LCVN frameworks that link language grounding, future‑state prediction, and action generation through two complementary model families. The first family combines LCVN‑WM, a diffusion‑based world model, with LCVN‑AC, an actor‑critic agent trained in the latent space of the world model. The second family, LCVN‑Uni, adopts an autoregressive multimodal architecture that predicts both actions and future observations. Experiments show that these families offer different advantages: the former provides more temporally coherent rollouts, whereas the latter generalizes better to unseen environments. Taken together, these observations point to the value of jointly studying language grounding, imagination, and policy learning in a unified task setting, and LCVN provides a concrete basis for further investigation of language‑conditioned world models. The code is available at https://github.com/F1y1113/LCVN.

PaperID: 389, https://arxiv.org/pdf/2603.26182.pdf GitHub

Authors: Zhuohan Ge, Haoyang Li, Yubo Wang, Nicole Hu, Chen Jason Zhang, Qing Li

Title: ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory

Abstract:
While Large Language Models (LLMs) have demonstrated potential in healthcare, they often struggle with the complex, non‑linear reasoning required for accurate clinical diagnosis. Existing methods typically rely on static, linear mappings from symptoms to diagnoses, failing to capture the iterative, hypothesis‑driven reasoning inherent in human clinicians. To bridge this gap, we introduce ClinicalAgents, a novel multi‑agent framework designed to simulate the cognitive workflow of expert clinicians. Unlike rigid sequential chains, ClinicalAgents employs a dynamic orchestration mechanism modeled as a Monte Carlo Tree Search (MCTS) process. This allows an orchestrator to iteratively generate hypotheses, actively verify evidence, and trigger backtracking when critical information is missing. The foundation of this framework is a Dual‑Memory architecture: a mutable working memory that maintains the evolving patient state for context‑aware reasoning, and a static experience memory that retrieves clinical guidelines and historical cases via an active feedback loop. Extensive experiments demonstrate that ClinicalAgents achieves the best performance among evaluated baselines, significantly enhancing both diagnostic accuracy and explainability compared to strong single‑agent and multi‑agent baselines. Our code is released at https://github.com/ZhuohanGe/ClinicalAgents‑Code.

PaperID: 390, https://arxiv.org/pdf/2603.25423.pdf GitHub

Authors: Zhi Zeng, Yifei Yang, Jiaying Wu, Xulang Zhang, Xiangzheng Kong, Herun Wan, Zihan Ma, Minnan Luo

Title: From Manipulation to Mistrust: Explaining Diverse Micro-Video Misinformation for Robust Debunking in the Wild

Abstract:
The rise of micro‑videos has reshaped how misinformation spreads, amplifying its speed, reach, and impact on public trust. Existing benchmarks typically focus on a single deception type, overlooking the diversity of real‑world cases that involve multimodal manipulation, AI‑generated content, cognitive bias, and out‑of‑context reuse. Meanwhile, most detection models lack fine‑grained attribution, limiting interpretability and practical utility. To address these gaps, we introduce WildFakeBench, a large‑scale benchmark of over 10,000 real‑world micro‑videos covering diverse misinformation types and sources, each annotated with expert‑defined attribution labels. Building on this foundation, we develop FakeAgent, a Delphi‑inspired multi‑agent reasoning framework that integrates multimodal understanding with external evidence for attribution‑grounded analysis. FakeAgent jointly analyzes content and retrieved evidence to identify manipulation, recognize cognitive and AI‑generated patterns, and detect out‑of‑context misinformation. Extensive experiments show that FakeAgent consistently outperforms existing MLLMs across all misinformation types, while WildFakeBench provides a realistic and challenging testbed for advancing explainable micro‑video misinformation detection. Data and code are available at: https://github.com/Aiyistan/FakeAgent.

PaperID: 391, https://arxiv.org/pdf/2603.25268.pdf GitHub

Authors: Abhijnan Nath, Hannah VanderHoeven, Nikhil Krishnaswamy

Title: CRAFT: Grounded Multi-Agent Coordination Under Partial Information

Abstract:
We introduce CRAFT, a multi‑agent benchmark for evaluating pragmatic communication in large language models under strict partial information. In this setting, multiple agents with complementary but incomplete views must coordinate through natural language to construct a shared 3D structure that no single agent can fully observe. We formalize this problem as a multi‑sender Bounded Pragmatic Speaker problem and provide a diagnostic framework that decomposes failures into spatial grounding, belief modeling and pragmatic communication errors, including a taxonomy of behavioral failure profiles in both frontier and open‑weight models. Across a diverse set of models, including 8 open‑weight and 7 frontier including reasoning models, we find that stronger reasoning ability does not reliably translate to better coordination: smaller open‑weight models often match or outperform frontier systems, and improved individual communication does not guarantee successful collaboration. These results suggest that multi‑agent coordination remains a fundamentally unsolved challenge for current language models. Our code can be found at https://github.com/csu‑signal/CRAFT

PaperID: 392, https://arxiv.org/pdf/2603.24973.pdf GitHub

Authors: Weiwei Fang, Lin Li, Kaize Shi, Yu Yang, Jianwei Zhang

Title: Belief-Driven Multi-Agent Collaboration via Approximate Perfect Bayesian Equilibrium for Social Simulation

Abstract:
High‑fidelity social simulation is pivotal for addressing complex Web societal challenges, yet it demands agents capable of authentically replicating the dynamic spectrum of human interaction. Current LLM‑based multi‑agent frameworks, however, predominantly adhere to static interaction topologies, failing to capture the fluid oscillation between cooperative knowledge synthesis and competitive critical reasoning seen in real‑world scenarios. This rigidity often leads to unrealistic ``groupthink'' or unproductive deadlocks, undermining the credibility of simulations for decision support. To bridge this gap, we propose BEACOF, a belief‑driven adaptive collaboration framework inspired by Perfect Bayesian Equilibrium (PBE). By modeling social interaction as a dynamic game of incomplete information, BEACOF rigorously addresses the circular dependency between collaboration type selection and capability estimation. Agents iteratively refine probabilistic beliefs about peer capabilities and autonomously modulate their collaboration strategy, thereby ensuring sequentially rational decisions under uncertainty. Validated across adversarial (judicial), open‑ended (social) and mixed (medical) scenarios, BEACOF prevents coordination failures and fosters robust convergence toward high‑quality solutions, demonstrating superior potential for reliable social simulation. Source codes and datasets are publicly released at: https://github.com/WUT‑IDEA/BEACOF.

PaperID: 393, https://arxiv.org/pdf/2603.24931.pdf GitHub

Authors: Yifeng Zhang, Jieming Chen, Tingguang Zhou, Tanishq Duhan, Jianghong Dong, Yuhong Cao, Guillaume Sartoretti

Title: COIN: Collaborative Interaction-Aware Multi-Agent Reinforcement Learning for Self-Driving Systems

Abstract:
Multi‑Agent Self‑Driving (MASD) systems provide an effective solution for coordinating autonomous vehicles to reduce congestion and enhance both safety and operational efficiency in future intelligent transportation systems. Multi‑Agent Reinforcement Learning (MARL) has emerged as a promising approach for developing advanced end‑to‑end MASD systems. However, achieving efficient and safe collaboration in dynamic MASD systems remains a significant challenge in dense scenarios with complex agent interactions. To address this challenge, we propose a novel collaborative(CO‑) interaction‑aware(‑IN) MARL framework, named COIN. Specifically, we develop a new counterfactual individual‑global twin delayed deep deterministic policy gradient (CIG‑TD3) algorithm, crafted in a "centralized training, decentralized execution" (CTDE) manner, which aims to jointly optimize the individual objectives (navigation) and the global objectives (collaboration) of agents. We further introduce a dual‑level interaction‑aware centralized critic architecture that captures both local pairwise interactions and global system‑level dependencies, enabling more accurate global value estimation and improved credit assignment for collaborative policy learning. We conduct extensive simulation experiments in dense urban traffic environments, which demonstrate that COIN consistently outperforms other advanced baseline methods in both safety and efficiency across various system sizes. These results highlight its superiority in complex and dynamic MASD scenarios, as further validated through real‑world robot demonstrations. Supplementary videos are available at https://marmotlab.github.io/COIN/

PaperID: 394, https://arxiv.org/pdf/2603.24895.pdf GitHub

Authors: Max Holschneider, Saetbyeol LeeYouk

Title: PII Shield: A Browser-Level Overlay for User-Controlled Personal Identifiable Information (PII) Management in AI Interactions

Abstract:
AI chatbots have quietly become the world's most popular therapists, coaches, and confidants. Users of cloud‑based LLM services are increasingly shifting from simple queries like idea generation and poem writing, to deeply personal interactions. As Large Language Models increasingly assume the role of our confessors, we are witnessing a massive, unregulated transfer of sensitive personal identifiable information (PII) to powerful tech companies with opaque privacy practices. While the enterprise sector has made great strides in addressing data leakage concerns through sophisticated guardrails and PII redaction pipelines, these powerful tools have functionally remained inaccessible for the average user due to their technical complexity. This results in a dangerous trade off for individual users. In order to receive the therapeutic or productivity benefits of AI, users need to abandon any agency they might otherwise have over their data, often without a clear mental model of what is being shared, and how it might be used for advertising later on. This work addresses this interaction gap, applying the redaction pipelines of enterprise‑grade redaction into an intuitive, first‑of‑its‑kind, consumer‑facing, and free experience. Specifically, this work introduces a scalable, browser‑based intervention designed to help align user behavior with their privacy preferences during web‑based AI interactions. Our system introduces two key mechanisms: local entity anonymization to prevent data leakage, and 'smokescreens': autonomous agent activity to disrupt third‑party profiling. An open‑source implementation is accessible at the following GitHub Repository: https://github.com/SBleeyouk/PII_Shield.git

PaperID: 395, https://arxiv.org/pdf/2603.24738.pdf GitHub

Authors: Daniel Benniah John

Title: Decentralized Task Scheduling in Distributed Systems: A Deep Reinforcement Learning Approach

Abstract:
Efficient task scheduling in large‑scale distributed systems presents significant challenges due to dynamic workloads, heterogeneous resources, and competing quality‑of‑service requirements. Traditional centralized approaches face scalability limitations and single points of failure, while classical heuristics lack adaptability to changing conditions. This paper proposes a decentralized multi‑agent deep reinforcement learning (DRL‑MADRL) framework for task scheduling in heterogeneous distributed systems. We formulate the problem as a Decentralized Partially Observable Markov Decision Process (Dec‑POMDP) and develop a lightweight actor‑critic architecture implemented using only NumPy, enabling deployment on resource‑constrained edge devices without heavyweight machine learning frameworks. Using workload characteristics derived from the publicly available Google Cluster Trace dataset, we evaluate our approach on a 100‑node heterogeneous system processing 1,000 tasks per episode over 30 experimental runs. Experimental results demonstrate 15.6% improvement in average task completion time (30.8s vs 36.5s for random baseline), 15.2% energy efficiency gain (745.2 kWh vs 878.3 kWh), and 82.3% SLA satisfaction compared to 75.5% for baselines, with all improvements statistically significant (p < 0.001). The lightweight implementation requires only NumPy, Matplotlib, and SciPy. Complete source code and experimental data are provided for full reproducibility at https://github.com/danielbenniah/marl‑distributed‑scheduling.

PaperID: 396, https://arxiv.org/pdf/2603.24647.pdf GitHub GitHub

Authors: Fabio Ferreira, Lucca Wobbe, Arjun Krishnakumar, Frank Hutter, Arber Zela

Title: Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch

Abstract:
The autoresearch repository enables an LLM agent to optimize hyperparameters by editing training code directly. We use it as a testbed to compare classical HPO algorithms against LLM‑based methods on tuning the hyperparameters of a small language model under a fixed compute budget. When defining a fixed search space over autoresearch, classical methods such as CMA‑ES and TPE consistently outperform LLM‑based agents, where avoiding out‑of‑memory failures matters more than search diversity. Allowing the LLM to directly edit source code narrows the gap to the classical methods but does not close it, even with frontier models available at the time of writing such as Claude Opus 4.6 and Gemini 3.1 Pro Preview. We observe that LLMs struggle to track optimization state across trials. In contrast, classical methods lack the domain knowledge of LLMs. To combine the strengths of both, we introduce Centaur, a hybrid that shares CMA‑ES's interpretable internal state, including mean vector, step‑size, and covariance matrix, with an LLM. Centaur achieves the best result in our experiments, and a 0.8B LLM already suffices to outperform all classical and pure LLM methods. Unconstrained code editing requires larger models to be competitive with classical methods. We further analyze search diversity, model scaling from 0.8B to frontier models, and ablate the fraction of LLM‑proposed trials in Centaur. All in all, our results suggest that LLMs are most effective as a complement to classical optimizers, not as a replacement. Code is available at https://github.com/ferreirafabio/autoresearch‑automl & interactive demo at https://ferreirafabio.github.io/autoresearch‑automl.

PaperID: 397, https://arxiv.org/pdf/2603.24579.pdf GitHub

Authors: Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie Hu, Yu Qin, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

Title: MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

Abstract:
Hallucination remains a critical bottleneck for large language models (LLMs), undermining their reliability in real‑world applications, especially in Retrieval‑Augmented Generation (RAG) systems. While existing hallucination detection methods employ LLM‑as‑a‑judge to verify LLM outputs against retrieved evidence, they suffer from inherent confirmation bias, where the verifier inadvertently reproduces the errors of the original generation. To address this, we introduce Multi‑Agent Reinforced Self‑Check for Hallucination (MARCH), a framework that enforces rigorous factual alignment by leveraging deliberate information asymmetry. MARCH orchestrates a collaborative pipeline of three specialized agents: a Solver, a Proposer, and a Checker. The Solver generates an initial RAG response, which the Proposer decomposes into claim‑level verifiable atomic propositions. Crucially, the Checker validates these propositions against retrieved evidence in isolation, deprived of the Solver's original output. This well‑crafted information asymmetry scheme breaks the cycle of self‑confirmation bias. By training this pipeline with multi‑agent reinforcement learning (MARL), we enable the agents to co‑evolve and optimize factual adherence. Extensive experiments across hallucination benchmarks demonstrate that MARCH substantially reduces hallucination rates. Notably, an 8B‑parameter LLM equipped with MARCH achieves performance competitive with powerful closed‑source models. MARCH paves a scalable path for factual self‑improvement of LLMs through co‑evolution. The code is at https://github.com/Qwen‑Applications/MARCH.

PaperID: 398, https://arxiv.org/pdf/2603.24458.pdf GitHub

Authors: Kaihang Pan, Qi Tian, Jianwei Zhang, Weijie Kong, Jiangfeng Xiong, Yanxin Long, Shixue Zhang, Haiyi Qiu, Tan Wang, Zheqi Lv, Yue Wu, Liefeng Bo, Siliang Tang, Zhao Zhong

Title: OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning

Abstract:
While proprietary systems such as Seedance‑2.0 have achieved remarkable success in omni‑capable video generation, open‑source alternatives significantly lag behind. Most academic models remain heavily fragmented, and the few existing efforts toward unified video generation still struggle to seamlessly integrate diverse tasks within a single framework. To bridge this gap, we propose OmniWeaving, an omni‑level video generation model featuring powerful multimodal composition and reasoning‑informed capabilities. By leveraging a massive‑scale pretraining dataset that encompasses diverse compositional and reasoning‑augmented scenarios, OmniWeaving learns to temporally bind interleaved text, multi‑image, and video inputs while acting as an intelligent agent to infer complex user intentions for sophisticated video creation. Furthermore, we introduce IntelligentVBench, the first comprehensive benchmark designed to rigorously assess next‑level intelligent unified video generation. Extensive experiments demonstrate that OmniWeaving achieves SoTA performance among open‑source unified models. The codes and model have already been publicly available. Project Page: https://omniweaving.github.io.

PaperID: 399, https://arxiv.org/pdf/2603.24366.pdf GitHub

Authors: Yifeng Zhang, Harsh Goel, Peizhuo Li, Mehul Damani, Sandeep Chinchali, Guillaume Sartoretti

Title: CoordLight: Learning Decentralized Coordination for Network-Wide Traffic Signal Control

Abstract:
Adaptive traffic signal control (ATSC) is crucial in alleviating congestion, maximizing throughput and promoting sustainable mobility in ever‑expanding cities. Multi‑Agent Reinforcement Learning (MARL) has recently shown significant potential in addressing complex traffic dynamics, but the intricacies of partial observability and coordination in decentralized environments still remain key challenges in formulating scalable and efficient control strategies. To address these challenges, we present CoordLight, a MARL‑based framework designed to improve intra‑neighborhood traffic by enhancing decision‑making at individual junctions (agents), as well as coordination with neighboring agents, thereby scaling up to network‑level traffic optimization. Specifically, we introduce the Queue Dynamic State Encoding (QDSE), a novel state representation based on vehicle queuing models, which strengthens the agents' capability to analyze, predict, and respond to local traffic dynamics. We further propose an advanced MARL algorithm, named Neighbor‑aware Policy Optimization (NAPO). It integrates an attention mechanism that discerns the state and action dependencies among adjacent agents, aiming to facilitate more coordinated decision‑making, and to improve policy learning updates through robust advantage calculation. This enables agents to identify and prioritize crucial interactions with influential neighbors, thus enhancing the targeted coordination and collaboration among agents. Through comprehensive evaluations against state‑of‑the‑art traffic signal control methods over three real‑world traffic datasets composed of up to 196 intersections, we empirically show that CoordLight consistently exhibits superior performance across diverse traffic networks with varying traffic flows. The code is available at https://github.com/marmotlab/CoordLight

PaperID: 400, https://arxiv.org/pdf/2603.23965.pdf GitHub

Authors: Shantanu Rahman, Nayeb Hasin, Mainul Islam, Md. Zubair Alom Rony, Golam Sarowar

Title: MonoSIM: An open source SIL framework for Ackermann Vehicular Systems with Monocular Vision

Abstract:
This paper presents an open‑source Software‑in‑the‑Loop (SIL) simulation platform designed for autonomous Ackerman vehicle research and education. The proposed framework focuses on simplicity, while making it easy to work with small‑scale experimental setups, such as the XTENTH‑CAR platform. The system was designed using open source tools, creating an environment with a monocular camera vision system to capture stimuli from it with minimal computational overhead through a sliding window based lane detection method. The platform supports a flexible algorithm testing and validation environment, allowing researchers to implement and compare various control strategies within an easy‑to‑use virtual environment. To validate the working of the platform, Model Predictive Control (MPC) and Proportional‑Integral‑Derivative (PID) algorithms were implemented within the SIL framework. The results confirm that the platform provides a reliable environment for algorithm verification, making it an ideal tool for future multi‑agent system research, educational purposes, and low‑cost AGV development. Our code is available at https://github.com/shantanu404/monosim.git.

PaperID: 401, https://arxiv.org/pdf/2603.23884.pdf GitHub

Authors: Yongmao Zhang, Kai Qiao, Zhengyan Wang, Ningning Liang, Dekui Ma, Wenyao Sun, Jian Chen, Bin Yan

Title: POSIM: A Multi-Agent Simulation Framework for Social Media Public Opinion Evolution and Governance

Abstract:
Modeling social media public opinion evolution is essential for governance decision‑making. Traditional epidemic models and rule‑based agent‑based models (ABMs) fail to capture the cognitive processes and adaptive behaviors of real users. Recent large language model (LLM)‑based social simulations can reproduce group‑level phenomena like polarization and conformity, yet remain unable to recreate the irrational interactions and multi‑phase dynamics of real public opinion events. We present POSIM (Public Opinion Simulator), a multi‑agent simulation framework for social media public opinion evolution and governance. POSIM integrates LLM‑driven agents with a Belief‑‑Desire‑‑Intention (BDI) cognitive architecture that accounts for irrational factors, places them in a virtual social media environment with social networks and recommendation mechanisms, and drives temporal dynamics through a Hawkes point process engine that captures the co‑evolution of agents and the environment across event phases. To validate the framework, we collect real‑world public opinion datasets from the Weibo platform covering the full interaction chain of users. Experiments show that POSIM successfully reproduces key characteristics of public opinion evolution from individual mechanisms to collective phenomena, and its effectiveness is further supported by multiple statistical metrics. Building on POSIM, governance‑oriented guidance and intervention experiments uncover a counterintuitive empathy paradox: empathetic guidance deepens negative sentiment instead of easing it under certain conditions, offering new insights for governance strategy design. These results demonstrate that the proposed framework can fully serve as a computational experimentation platform for proactive strategy evaluation and evidence‑based governance. All source code is available at https://github.com/DeepCogLab/posim/.

PaperID: 402, https://arxiv.org/pdf/2603.23722.pdf GitHub

Authors: Igor Jankowski

Title: Dual-Gated Epistemic Time-Dilation: Autonomous Compute Modulation in Asynchronous MARL

Abstract:
While Multi‑Agent Reinforcement Learning (MARL) algorithms achieve unprecedented successes across complex continuous domains, their standard deployment strictly adheres to a synchronous operational paradigm. Under this paradigm, agents are universally forced to execute deep neural network inferences at every micro‑frame, regardless of immediate necessity. This dense throughput acts as a fundamental barrier to physical deployment on edge‑devices where thermal and metabolic budgets are highly constrained. We propose Epistemic Time‑Dilation MAPPO (ETD‑MAPPO), augmented with a Dual‑Gated Epistemic Trigger. Instead of depending on rigid frame‑skipping (macro‑actions), agents autonomously modulate their execution frequency by interpreting aleatoric uncertainty (via Shannon entropy of their policy) and epistemic uncertainty (via state‑value divergence in a Twin‑Critic architecture). To format this, we structure the environment as a Semi‑Markov Decision Process (SMDP) and build the SMDP‑Aligned Asynchronous Gradient Masking Critic to ensure proper credit assignment. Empirical findings demonstrate massive improvements (> 60% relative baseline acquisition leaps) over current temporal models. By assessing LBF, MPE, and the 115‑dimensional state space of Google Research Football (GRF), ETD correctly prevented premature policy collapse. Remarkably, this unconstrained approach leads to emergent Temporal Role Specialization, reducing computational overhead by a statistically dominant 73.6% entirely during off‑ball execution without deteriorating centralized task dominance.

PaperID: 403, https://arxiv.org/pdf/2603.23509.pdf GitHub

Authors: Yutao Wu, Xiao Liu, Yifeng Gao, Xiang Zheng, Hanxun Huang, Yige Li, Cong Wang, Bo Li, Xingjun Ma, Yu-Gang Jiang

Title: Internal Safety Collapse in Frontier Large Language Models

Abstract:
This work identifies a critical failure mode in frontier large language models (LLMs), which we term Internal Safety Collapse (ISC): under certain task conditions, models enter a state in which they continuously generate harmful content while executing otherwise benign tasks. We introduce TVD (Task, Validator, Data), a framework that triggers ISC through domain tasks where generating harmful content is the only valid completion, and construct ISC‑Bench containing 53 scenarios across 8 professional disciplines. Evaluated on JailbreakBench, three representative scenarios yield worst‑case safety failure rates averaging 95.3% across four frontier LLMs (including GPT‑5.2 and Claude Sonnet 4.5), substantially exceeding standard jailbreak attacks. Frontier models are more vulnerable than earlier LLMs: the very capabilities that enable complex task execution become liabilities when tasks intrinsically involve harmful content. This reveals a growing attack surface: almost every professional domain uses tools that process sensitive data, and each new dual‑use tool automatically expands this vulnerability‑‑even without any deliberate attack. Despite substantial alignment efforts, frontier LLMs retain inherently unsafe internal capabilities: alignment reshapes observable outputs but does not eliminate the underlying risk profile. These findings underscore the need for caution when deploying LLMs in high‑stakes settings. Source code: https://github.com/wuyoscar/ISC‑Bench

PaperID: 404, https://arxiv.org/pdf/2603.23079.pdf GitHub

Authors: Yangjie Cui, Xin Dong, Boyang Gao, Jinwu Xiang, Daochun Li, Zhan Tu

Title: AirSimAG: A High-Fidelity Simulation Platform for Air-Ground Collaborative Robotics

Abstract:
As spatial intelligence continues to evolve, heterogeneous multi‑agent systems‑particularly the collaboration between Unmanned Aerial Vehicles (UAVs) and Unmanned Ground Vehicles (UGVs), have demonstrated strong potential in complex applications such as search and rescue, urban surveillance, and environmental monitoring. However, existing simulation platforms are primarily designed for single‑agent dynamics and lack dedicated frameworks for interactive air‑ground collaborative simulation. In this paper, we present AirsimAG, a high‑fidelity air‑ground collaborative simulation platform built upon an extensively customized AirSim framework. The platform enables synchronized multi‑agent simulation and supports heterogeneous sensing and control interfaces for UAV‑UGV systems. To demonstrate its capabilities, we design a set of representative air‑ground collaborative tasks, including mapping, planning, tracking, formation, and exploration. We further provide quantitative analyses based on these tasks to illustrate the platform effectiveness in supporting multi‑agent coordination and cross‑modal data consistency. The AirsimAG simulation platform is publicly available at https://github.com/BIULab‑BUAA/AirSimAG.

PaperID: 405, https://arxiv.org/pdf/2603.22846.pdf GitHub

Authors: Youzhi Liu, Li Gao, Liu Liu, Mingyang Lv, Yang Cai

Title: CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models

Abstract:
Embodied Visual Tracking (EVT), a core dynamic task in embodied intelligence, requires an agent to precisely follow a language‑specified target. Yet most existing methods rely on single‑agent imitation learning, suffering from costly expert data and limited generalization due to static training environments. Inspired by competition‑driven capability evolution, we propose CoMaTrack, a competitive game‑theoretic multi‑agent reinforcement learning framework that trains agents in a dynamic adversarial setting with competitive subtasks, yielding stronger adaptive planning and interference‑resilient strategies. We further introduce CoMaTrack‑Bench, the first open‑source Habitat‑based benchmark protocol and episode set for language‑conditioned competitive EVT featuring dynamic dueling, featuring game scenarios between a tracker and adaptive opponents across diverse environments and instructions, enabling standardized robustness evaluation under active adversarial interactions. Experiments show that CoMaTrack achieves state‑of‑the‑art results on both standard benchmarks and CoMaTrack‑Bench. Notably, a 3B VLM trained with our framework surpasses previous single‑agent imitation learning methods based on 7B models on the challenging EVT‑Bench, achieving 92.1% in STT, 74.2% in DT, and 57.5% in AT. The benchmark code will be available at https://github.com/wlqcode/CoMaTrack‑Bench.

PaperID: 406, https://arxiv.org/pdf/2603.22767.pdf GitHub

Authors: Dubai Li, Yuxiang He, Yan Hu, Yu Tian, Jingsong Li

Title: Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases

Abstract:
Observational studies can yield clinically actionable evidence at scale, but executing them on real‑world databases is open‑ended and requires coherent decisions across cohort construction, analysis, and reporting. Prior evaluations of LLM agents emphasize isolated steps or single answers, missing the integrity and internal structure of the resulting evidence bundle. To address this gap, we introduce RWE‑bench, a benchmark grounded in MIMIC‑IV and derived from peer‑reviewed observational studies. Each task provides the corresponding study protocol as the reference standard, requiring agents to execute experiments in a real database and iteratively generate tree‑structured evidence bundles. We evaluate six LLMs (three open‑source, three closed‑source) under three agent scaffolds using both question‑level correctness and end‑to‑end task metrics. Across 162 tasks, task success is low: the best agent reaches 39.9%, and the best open‑source model reaches 30.4%. Agent scaffolds also matter substantially, causing over 30% variation in performance metrics. Furthermore, we implement an automated cohort evaluation method to rapidly localize errors and identify agent failure modes. Overall, the results highlight persistent limitations in agents' ability to produce end‑to‑end evidence bundles, and efficient validation remains an important direction for future work. Code and data are available at https://github.com/somewordstoolate/RWE‑bench.

PaperID: 407, https://arxiv.org/pdf/2603.22474.pdf GitHub

Authors: Lohith Senthilkumar, Tim Menzies

Title: From Brittle to Robust: Improving LLM Annotations for SE Optimization

Abstract:
Software analytics often builds from labeled data. Labeling can be slow, error prone, and expensive. When human expertise is scarce, SE researchers sometimes ask large language models (LLMs) for the missing labels. While this has been successful in some domains, recent results show that LLM‑based labeling has blind spots. Specifically, their labeling is not effective for higher dimensional multi‑objective problems. To address this task, we propose a novel LLM prompting strategy called SynthCore. When one opinion fails, SynthCore's combines multiple separated opinions generated by LLMs (with no knowledge of each others' answers) into an ensemble of few‑shot learners. Simpler than other strategies (e.g. chain‑of‑thought, multi‑agent‑debate, etc) SynthCore aggregates results from multiple single prompt sessions (with no crossover between them). SynthCore has been tested on 49 SE multi‑objective optimization tasks, handling tasks as diverse as software project management, Makefile configuration, and hyperparameter optimization. SynthCore's ensemble found optimizations that are better than state‑of‑the‑art alternative approaches (Gaussian Process Models, Tree of Parzen Estimators, active learners in both exploration and exploitation mode). Importantly, these optimizations were made using data labeled by LLMs, without any human opinions. From these experiments, we conclude that ensembles of few shot learners can successfully annotate high dimensional multi‑objective tasks. Further, we speculate that other successful few‑shot prompting results could be quickly and easily enhanced using SynthCore's ensemble approach. To support open science, all our data and scripts are available at https://github.com/lohithsowmiyan/lazy‑llm/tree/clusters.

PaperID: 408, https://arxiv.org/pdf/2603.21723.pdf GitHub

Authors: Yaxuan Wang, Yifan Xiang, Ke Li, Xun Zhang, BoWen Ye, Zhuochen Fan, Fei Wei, Tong Yang

Title: Can a Robot Walk the Robotic Dog: Triple-Zero Collaborative Navigation for Heterogeneous Multi-Agent Systems

Abstract:
We present Triple Zero Path Planning (TZPP), a collaborative framework for heterogeneous multi‑robot systems that requires zero training, zero prior knowledge, and zero simulation. TZPP employs a coordinator‑‑explorer architecture: a humanoid robot handles task coordination, while a quadruped robot explores and identifies feasible paths using guidance from a multimodal large language model. We implement TZPP on Unitree G1 and Go2 robots and evaluate it across diverse indoor and outdoor environments, including obstacle‑rich and landmark‑sparse settings. Experiments show that TZPP achieves robust, human‑comparable efficiency and strong adaptability to unseen scenarios. By eliminating reliance on training and simulation, TZPP offers a practical path toward real‑world deployment of heterogeneous robot cooperation. Our code and video are provided at: https://github.com/triple‑zeropp/Triple‑zero‑robot‑agent

PaperID: 409, https://arxiv.org/pdf/2603.21475.pdf GitHub

Authors: Hehai Lin, Yu Yan, Zixuan Wang, Bo Xu, Sudong Wang, Weiquan Huang, Ruochen Zhao, Minzhi Li, Chengwei Qin

Title: Unified-MAS: Universally Generating Domain-Specific Nodes for Empowering Automatic Multi-Agent Systems

Abstract:
Automatic Multi‑Agent Systems (MAS) generation has emerged as a promising paradigm for solving complex reasoning tasks. However, existing frameworks are fundamentally bottlenecked when applied to knowledge‑intensive domains (e.g., healthcare and law). They either rely on a static library of general nodes like Chain‑of‑Thought, which lack specialized expertise, or attempt to generate nodes on the fly. In the latter case, the orchestrator is not only bound by its internal knowledge limits but must also simultaneously generate domain‑specific logic and optimize high‑level topology, leading to a severe architectural coupling that degrades overall system efficacy. To bridge this gap, we propose Unified‑MAS that decouples granular node implementation from topological orchestration via offline node synthesis. Unified‑MAS operates in two stages: (1) Search‑Based Node Generation retrieves external open‑world knowledge to synthesize specialized node blueprints, overcoming the internal knowledge limits of LLMs; and (2) Reward‑Based Node Optimization utilizes a perplexity‑guided reward to iteratively enhance the internal logic of bottleneck nodes. Extensive experiments across four specialized domains demonstrate that integrating Unified‑MAS into four Automatic‑MAS baselines yields a better performance‑cost trade‑off, achieving up to a 14.2% gain while significantly reducing costs. Further analysis reveals its robustness across different designer LLMs and its effectiveness on conventional tasks such as mathematical reasoning.

PaperID: 410, https://arxiv.org/pdf/2603.21357.pdf GitHub

Authors: Liang Ding

Title: AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling

Abstract:
LLM‑agent training pipelines routinely discard failed trajectories even though GPT‑4o achieves only 14‑20% on WebArena and below 55% pass@1 on ToolBench; even specialised systems at 50‑65% leave the majority of trajectories unused. We introduce AgentHER, which recovers this lost signal by adapting Hindsight Experience Replay (HER) to natural‑language agent trajectories: a trajectory that fails goal A is often a correct demonstration for an achievable alternative goal B. AgentHER realises this through a four‑stage pipeline (failure classification, outcome extraction, LLM‑guided relabeling with confidence gating, and data packaging) that converts discarded failures into SFT, DPO, and ShareGPT training data. On WebArena and ToolBench under a strict task‑disjoint held‑out protocol, AgentHER improves over success‑only SFT by +7.6‑11.4% across four model families (GPT‑4o, Qwen2.5‑72B/7B, LLaMA‑3.1‑8B), achieves 2x sample efficiency, and beats the strongest experience‑centric baseline (Agent Workflow Memory) by +3.0‑6.2%. Two robustness mechanisms, failure‑severity weighting and cross‑model multi‑judge verification (gpt‑4o‑mini paired with Qwen2.5‑72B‑Instruct), reduce label noise from 5.9% to 2.9% and raise human‑rated relabeling precision to 97.1% on WebArena and 96.0% on ToolBench. A full system‑cost audit shows the entire relabeling pipeline costs 2.98 and 26 wall‑clock minutes for 3,000 trajectories, i.e. 1.4 x 10^‑3 per accepted pair. Code: https://github.com/alphadl/AgentHER

PaperID: 411, https://arxiv.org/pdf/2603.21331.pdf GitHub GitHub

Authors: Jaber Jaber, Osama Jaber

Title: AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

Abstract:
Writing high‑performance GPU kernels is among the most labor‑intensive tasks in machine learning systems engineering. We present AutoKernel, an open‑source framework that applies an autonomous agent loop to GPU kernel optimization for arbitrary PyTorch models. Given a model, AutoKernel profiles it to identify computational bottlenecks, ranks them by Amdahl's law impact, and iteratively refines Triton or CUDA C++ kernel implementations through hundreds of experiments without human intervention. A five‑stage correctness harness covering smoke tests, shape sweeps, numerical stability, determinism verification, and edge‑case coverage ensures every candidate kernel is validated before any speedup is recorded. The system comprises over 9,000 lines of Python, 18 starter kernel implementations across two backends, a six‑tier optimization playbook, and integration with the KernelBench benchmark suite. AutoKernel covers nine kernel types spanning the dominant operations in modern transformer architectures. On an NVIDIA H100, our Triton kernels outperform both PyTorch eager and torch.compile (max‑autotune) on the majority of tested configurations: 5.29x over eager on RMSNorm, 2.82x on softmax, and 2.21x on cross‑entropy, while beating torch.compile by 2.83x, 3.44x, and 2.94x respectively. In community deployment, an AutoKernel‑optimized kernel achieved first place on the vectorsum_v2 B200 leaderboard. The full system is available at https://github.com/RightNow‑AI/autokernel.

PaperID: 412, https://arxiv.org/pdf/2603.21251.pdf GitHub

Authors: Jingwen Tong, Fang Liu, Linkai Xv, Shiliang Lu, Kangqi Li, Yiqian Zhang, Yijie Song, Zeyang Xue, Jun Zhang

Title: WirelessBench: A Tolerance-Aware LLM Agent Benchmark for Wireless Network Intelligence

Abstract:
LLM agents are emerging as a key enabler for autonomous wireless network management. Reliably deploying them, however, demands benchmarks that reflect real engineering risk. Existing wireless benchmarks evaluate single isolated capabilities and treat all errors uniformly, missing both cascaded‑chain failures and catastrophic unit confusions (e.g., dB vs.\ dBm). We present \wb, the first tolerance‑aware, tool‑integrated benchmark for LLM‑based wireless agents. \wb is organized as a three‑tier cognitive hierarchy: domain knowledge reasoning (WCHW, 1,392 items), intent‑driven resource allocation (WCNS, 1,000 items), and proactive multi‑step decisions under mobility (WCMSA, 1,000 items). Moreover, \wb is established on three design principles: \emph(i)~tolerance‑aware scoring with catastrophic‑error detection; \emph(ii)~tool‑necessary tasks requiring a 3GPP‑compliant ray‑tracing query for channel quality; and \emph(iii)~Chain‑of‑Thought (CoT)‑traceable items, where every benchmark item ships with a complete CoT trajectory enabling fine‑grained diagnosis of where in the reasoning chain an agent fails. Our numerical results show that the direct‑prompting model (GPT‑4o) scores 68%, trailing a tool‑integrated agent (84.64%) by 16.64\,pp; 23% of errors are catastrophic failures invisible to exact‑match metrics. More importantly, the hierarchy decomposes errors into four actionable diagnostic categories that flat evaluation cannot reveal. Code and data: https://wirelessbench.github.io/.

PaperID: 413, https://arxiv.org/pdf/2603.21250.pdf GitHub

Authors: Yu Luo, Rongchen Gao, Lu Teng, Xidao Wen, Jiamin Jiang, Qingliang Zhang, Yongqian Sun, Shenglin Zhang, Jiasong Feng, Tong Liu, Wenjie Zhang, Dan Pei

Title: Graph of States: Solving Abductive Tasks with Large Language Models

Abstract:
Logical reasoning encompasses deduction, induction, and abduction. However, while Large Language Models (LLMs) have effectively mastered the former two, abductive reasoning remains significantly underexplored. Existing frameworks, predominantly designed for static deductive tasks, fail to generalize to abductive reasoning due to unstructured state representation and lack of explicit state control. Consequently, they are inevitably prone to Evidence Fabrication, Context Drift, Failed Backtracking, and Early Stopping. To bridge this gap, we introduce Graph of States (GoS), a general‑purpose neuro‑symbolic framework tailored for abductive tasks. GoS grounds multi‑agent collaboration in a structured belief states, utilizing a causal graph to explicitly encode logical dependencies and a state machine to govern the valid transitions of the reasoning process. By dynamically aligning the reasoning focus with these symbolic constraints, our approach transforms aimless, unconstrained exploration into a convergent, directed search. Extensive evaluations on two real‑world datasets demonstrate that GoS significantly outperforms all baselines, providing a robust solution for complex abductive tasks. Code repo and all prompts: https://github.com/gaorch85/Graph‑of‑States.

PaperID: 414, https://arxiv.org/pdf/2603.20939.pdf GitHub

Authors: Yuren Hao, Shuhaib Mehri, ChengXiang Zhai, Dilek Hakkani-Tür

Title: User Preference Modeling for Conversational LLM Agents: Weak Rewards from Retrieval-Augmented Interaction

Abstract:
Large language models are increasingly used as personal assistants, yet most lack a persistent user model, forcing users to repeatedly restate preferences across sessions. We propose Vector‑Adapted Retrieval Scoring (VARS), a pipeline‑agnostic, frozen‑backbone framework that represents each user with long‑term and short‑term vectors in a shared preference space and uses these vectors to bias retrieval scoring over structured preference memory. The vectors are updated online from weak scalar rewards from users' feedback, enabling personalization without per‑user fine‑tuning. We evaluate on \textscMultiSessionCollab, an online multi‑session collaboration benchmark with rich user preference profiles, across math and code tasks. Under frozen backbones, the main benefit of user‑aware retrieval is improved interaction efficiency rather than large gains in raw task accuracy: our full VARS agent achieves the strongest overall performance, matches a strong Reflection baseline in task success, and reduces timeout rate and user effort. The learned long‑term vectors also align with cross‑user preference overlap, while short‑term vectors capture session‑specific adaptation, supporting the interpretability of the dual‑vector design. Code, model, and data are available at https://github.com/YurenHao0426/VARS.

PaperID: 415, https://arxiv.org/pdf/2603.20933.pdf GitHub

Authors: Reshabh K Sharma, Dan Grossman

Title: AC4A: Access Control for Agents

Abstract:
Large Language Model (LLM) agents combine the chat interaction capabilities of LLMs with the power to interact with external tools and APIs. This enables them to perform complex tasks and act autonomously to achieve user goals. However, current agent systems operate on an all‑or‑nothing basis: an agent either has full access to an API's capabilities and a web page's content, or it has no access at all. This coarse‑grained approach forces users to trust agents with more capabilities than they actually need for a given task. In this paper, we introduce AC4A, an access control framework for agents. As agents become more capable and autonomous, users need a way to limit what APIs or portions of web pages these agents can access, eliminating the need to trust them with everything an API or web page allows. Our goal with AC4A is to provide a framework for defining permissions that lets agents access only the resources they are authorized to access. AC4A works across both API‑based and browser‑based agents. It does not prescribe what permissions should be, but offers a flexible way to define and enforce them, making it practical for real‑world systems. AC4A works by creating permissions granting access to resources, drawing inspiration from established access control frameworks like the one for the Unix file system. Applications define their resources as hierarchies and provide a way to compute the necessary permissions at runtime needed for successful resource access. We demonstrate the usefulness of AC4A in enforcing permissions over real‑world APIs and web pages through case studies. The source code of AC4A is available at https://github.com/reSHARMA/AC4A

PaperID: 416, https://arxiv.org/pdf/2603.20833.pdf GitHub

Authors: Steven Johnson

Title: Governance-Aware Vector Subscriptions for Multi-Agent Knowledge Ecosystems

Abstract:
As AI agent ecosystems grow, agents need mechanisms to monitor relevant knowledge in real time. Semantic publish‑subscribe systems address this by matching new content against vector subscriptions. However, in multi‑agent settings where agents operate under different data handling policies, unrestricted semantic subscriptions create policy violations: agents receive notifications about content they are not authorized to access. We introduce governance‑aware vector subscriptions, a mechanism that composes semantic similarity matching with multi‑dimensional policy predicates grounded in regulatory frameworks (EU DSM Directive, EU AI Act). The policy predicate operates over multiple independent dimensions (processing level, direct marketing restrictions, training opt‑out, jurisdiction, and scientific usage) each with distinct legal bases. Agents subscribe to semantic regions of a curated knowledge base; notifications are dispatched only for validated content that passes both the similarity threshold and all applicable policy constraints. We formalize the mechanism, implement it within AIngram (an operational multi‑agent knowledge base), and evaluate it using the PASA benchmark. We validate the mechanism on a synthetic corpus (1,000 chunks, 93 subscriptions, 5 domains): the governed mode correctly enforces all policy constraints while preserving delivery of authorized content. Ablation across five policy dimensions shows that no single dimension suffices for full compliance.

PaperID: 417, https://arxiv.org/pdf/2603.20640.pdf GitHub

Authors: Manh Nguyen, Anh Nguyen, Dung Nguyen, Svetha Venkatesh, Hung Le

Title: Hear Both Sides: Efficient Multi-Agent Debate via Diversity-Aware Message Retention

Abstract:
Multi‑Agent Debate has emerged as a promising framework for improving the reasoning quality of large language models through iterative inter‑agent communication. However, broadcasting all agent messages at every round introduces noise and redundancy that can degrade debate quality and waste computational resources. Current approaches rely on uncertainty estimation to filter low‑confidence responses before broadcasting, but this approach is unreliable due to miscalibrated confidence scores and sensitivity to threshold selection. To address this, we propose Diversity‑Aware Retention (DAR), a lightweight debate framework that, at each debate round, selects the subset of agent responses that maximally disagree with each other and with the majority vote before broadcasting. Through an explicit index‑based retention mechanism, DAR preserves the original messages without modification, ensuring that retained disagreements remain authentic. Experiments on diverse reasoning and question answering benchmarks demonstrate that our selective message propagation consistently improves debate performance, particularly as the number of agents scales, where noise accumulation is most severe. Our results highlight that what agents hear is as important as what agents say in multi‑agent reasoning systems. Code is publicly available at https://github.com/DA2I2‑SLM/DAR.

PaperID: 418, https://arxiv.org/pdf/2603.20279.pdf GitHub

Authors: Alex Popa, Adrian Taylor, Ranwa Al Mallah

Title: Learning Communication Between Heterogeneous Agents in Multi-Agent Reinforcement Learning for Autonomous Cyber Defence

Abstract:
Reinforcement learning techniques are being explored as solutions to the threat of cyber attacks on enterprise networks. Recent research in the field of AI in cyber security has investigated the ability of homogeneous multi‑agent reinforcement learning agents, capable of inter‑agent communication, to respond to cyberattacks. This paper advances the study of learned communication in multi‑agent systems by examining heterogeneous agent capabilities within a simulated network environment. To this end, we leverage CommFormer, a publicly available state‑of‑the‑art communication algorithm, to train and evaluate agents within the Cyber Operations Research Gym (CybORG). Our results show that CommFormer agents with heterogeneous capabilities can outperform other algorithms deployed in the CybORG environment, by converging to an optimal policy up to four times faster while improving standard error by up 38%. The agents implemented in this project provide an additional avenue for exploration in the field of AI for cyber security, enabling further research involving realistic networks.

PaperID: 419, https://arxiv.org/pdf/2603.20278.pdf GitHub

Authors: Zhuofeng Li, Dongfu Jiang, Xueguang Ma, Haoxiang Zhang, Ping Nie, Yuyu Zhang, Kai Zou, Jianwen Xie, Yu Zhang, Wenhu Chen

Title: OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis

Abstract:
Training deep research agents requires long‑horizon trajectories that interleave search, evidence aggregation, and multi‑step reasoning. However, existing data collection pipelines typically rely on proprietary web APIs, making large‑scale trajectory synthesis costly, unstable, and difficult to reproduce. We present OpenResearcher, a reproducible pipeline that decouples one‑time corpus bootstrapping from multi‑turn trajectory synthesis and executes the search‑and‑browse loop entirely offline using three explicit browser primitives: search, open, and find, over a 15M‑document corpus. Using GPT‑OSS‑120B as the teacher model, we synthesize over 97K trajectories, including a substantial long‑horizon tail with 100+ tool calls. Supervised fine‑tuning a 30B‑A3B backbone on these trajectories achieves 54.8% accuracy on BrowseComp‑Plus, a +34.0 point improvement over the base model, while remaining competitive on BrowseComp, GAIA, and xbench‑DeepSearch. Because the environment is offline and fully instrumented, it also enables controlled analysis, where our study reveals practical insights into deep research pipeline design, including data filtering strategies, agent configuration choices, and how retrieval success relates to final answer accuracy. We release the pipeline, synthesized trajectories, model checkpoints, and the offline search environment at https://github.com/TIGER‑AI‑Lab/OpenResearcher.

PaperID: 420, https://arxiv.org/pdf/2603.20075.pdf GitHub

Authors: Yingwei Zheng, Cong Li, Shaohua Li, Yuqun Zhang, Zhendong Su

Title: Agentic Harness for Real-World Compilers

Abstract:
Compilers are critical to modern computing, yet fixing compiler bugs is difficult. While recent large language model (LLM) advancements enable automated bug repair, compiler bugs pose unique challenges due to their complexity, deep cross‑domain expertise requirements, and sparse, non‑descriptive bug reports, necessitating compiler‑specific tools. To bridge the gap, we introduce llvm‑autofix, the first agentic harness designed to assist LLM agents in understanding and fixing compiler bugs. Our focus is on LLVM, one of the most widely used compiler infrastructures. Central to llvm‑autofix are agent‑friendly LLVM tools, a benchmark llvm‑bench of reproducible LLVM bugs, and a tailored minimal agent llvm‑autofix‑mini for fixing LLVM bugs. Our evaluation demonstrates a performance decline of 60% in frontier models when tackling compiler bugs compared with common software bugs. Our minimal agent llvm‑autofix‑mini also outperforms the state‑of‑the‑art by approximately 22%. This emphasizes the necessity for specialized harnesses like ours to close the loop between LLMs and compiler engineering. We believe this work establishes a foundation for advancing LLM capabilities in complex systems like compilers. GitHub: https://github.com/dtcxzyw/llvm‑autofix

PaperID: 421, https://arxiv.org/pdf/2603.19308.pdf GitHub

Authors: Wentao Wang, Haoran Xu, Guang Tan

Title: GT-Space: Enhancing Heterogeneous Collaborative Perception with Ground Truth Feature Space

Abstract:
In autonomous driving, multi‑agent collaborative perception enhances sensing capabilities by enabling agents to share perceptual data. A key challenge lies in handling \em heterogeneous features from agents equipped with different sensing modalities or model architectures, which complicates data fusion. Existing approaches often require retraining encoders or designing interpreter modules for pairwise feature alignment, but these solutions are not scalable in practice. To address this, we propose \em GT‑Space, a flexible and scalable collaborative perception framework for heterogeneous agents. GT‑Space constructs a common feature space from ground‑truth labels, providing a unified reference for feature alignment. With this shared space, agents only need a single adapter module to project their features, eliminating the need for pairwise interactions with other agents. Furthermore, we design a fusion network trained with contrastive losses across diverse modality combinations. Extensive experiments on simulation datasets (OPV2V and V2XSet) and a real‑world dataset (RCooper) demonstrate that GT‑Space consistently outperforms baselines in detection accuracy while delivering robust performance. Our code will be released at https://github.com/KingScar/GT‑Space.

PaperID: 422, https://arxiv.org/pdf/2603.19254.pdf GitHub

Authors: Yiyun Zhu, Yidong Jiang, Ziwen Xu, Yinsheng Yao, Dawei Cheng, Jinru Ding, Jie Xu

Title: FinReasoning: A Hierarchical Benchmark for Reliable Financial Research Reporting

Abstract:
Large language models (LLMs) are increasingly deployed in financial research workflows, where their role is evolving from single‑model assistance for human analysts toward autonomous collaboration among multiple agents. Yet real‑world deployments still expose factual errors, numerical inconsistencies, and shallow analysis, which can distort assessments of corporate fundamentals and trigger severe economic losses. While existing benchmarks have begun to evaluate such failures, they score all aspects of the generated analysis in one pass, failing to distinguish whether a model fails at foundational stages like auditing and correction, or underperforms at generating research‑grade insights. Consequently, it obscures capability bottlenecks and the specialized strengths essential for multi‑agent role assignment. To address these gaps, we introduce FinReasoning, a hierarchical benchmark that decomposes the core capabilities of financial research into semantic consistency, data alignment, and deep insight. We further propose a fine‑grained evaluation framework that strengthens hallucination‑correction assessment and incorporates a 12‑indicator rubric for core analytical skills. FinReasoning reveals clear capability stratification across model types. Closed‑source models (like Doubao‑Seed‑1.8) perform strongly overall and are better suited for core reasoning agents in multi‑agent financial systems; open‑source general models (like Qwen3‑235B) show clear capability divergence and consistently underperform on Semantic Consistency, making them less suited for quality‑sensitive generation tasks; financial‑domain models (like Fin‑R1) generate moderate insights but lack foundational auditing skills. Our work has already been deployed in pilot tests across several real‑world scenarios. The resource is available at https://github.com/TongjiFinLab/FinReasoning.

PaperID: 423, https://arxiv.org/pdf/2603.19166.pdf GitHub

Authors: Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen, Nakul Gopalan

Title: Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation

Abstract:
Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as "go two meters to the right of the fridge" requires grounding semantic references, spatial relations, and metric constraints within a 3D scene. While recent vision language models (VLMs) demonstrate strong semantic grounding capabilities, they are not explicitly designed to reason about metric constraints in physically defined spaces. In this work, we empirically demonstrate that state‑of‑the‑art VLM‑based grounding approaches struggle with complex metric‑semantic language queries. To address this limitation, we propose MAPG (Multi‑Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component. MAPG then probabilistically composes these grounded outputs to produce metrically consistent, actionable decisions in 3D space. We evaluate MAPG on the HM‑EQA benchmark and show consistent performance improvements over strong baselines. Furthermore, we introduce a new benchmark, MAPG‑Bench, specifically designed to evaluate metric‑semantic goal grounding, addressing a gap in existing language grounding evaluations. We also present a real‑world robot demonstration showing that MAPG transfers beyond simulation when a structured scene representation is available.

PaperID: 424, https://arxiv.org/pdf/2603.18743.pdf GitHub

Authors: Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, Runyu Yang, Qiangbin Liu, Xinlei Yu, Jianmin Zhou, Na Wang, Chunyang Sun, Jun Wang

Title: Memento-Skills: Let Agents Design Agents

Abstract:
We introduce \emphMemento‑Skills, a generalist, continually‑learnable LLM agent system that functions as an \emphagent‑designing agent: it autonomously constructs, adapts, and improves task‑specific agents through experience. The system is built on a memory‑based reinforcement learning framework with \emphstateful prompts, where reusable skills (stored as structured markdown files) serve as persistent, evolving memory. These skills encode both behaviour and context, enabling the agent to carry forward knowledge across interactions. Starting from simple elementary skills (like Web search and terminal operations), the agent continually improves via the \emphRead‑‑Write Reflective Learning mechanism introduced in \emphMemento~2~\citewang2025memento2. In the \emphread phase, a behaviour‑trainable skill router selects the most relevant skill conditioned on the current stateful prompt; in the \emphwrite phase, the agent updates and expands its skill library based on new experience. This closed‑loop design enables \emphcontinual learning without updating LLM parameters, as all adaptation is realised through the evolution of externalised skills and prompts. Unlike prior approaches that rely on human‑designed agents, Memento‑Skills enables a generalist agent to \emphdesign agents end‑to‑end for new tasks. Through iterative skill generation and refinement, the system progressively improves its own capabilities. Experiments on the \emphGeneral AI Assistants benchmark and \emphHumanity's Last Exam demonstrate sustained gains, achieving 26.2% and 116.2% relative improvements in overall accuracy, respectively. Code is available at https://github.com/Memento‑Teams/Memento‑Skills.

PaperID: 425, https://arxiv.org/pdf/2603.18718.pdf GitHub

Authors: Minhua Lin, Zhiwei Zhang, Hanqing Lu, Hui Liu, Xianfeng Tang, Qi He, Xiang Zhang, Suhang Wang

Title: MemMA: Coordinating the Memory Cycle through Multi-Agent Reasoning and In-Situ Self-Evolution

Abstract:
Memory‑augmented LLM agents maintain external memory banks to support long‑horizon interaction, yet most existing systems treat construction, retrieval, and utilization as isolated subroutines. This creates two coupled challenges: strategic blindness on the forward path of the memory cycle, where construction and retrieval are driven by local heuristics rather than explicit strategic reasoning, and sparse, delayed supervision on the backward path, where downstream failures rarely translate into direct repairs of the memory bank. To address these challenges, we propose MemMA, a plug‑and‑play multi‑agent framework that coordinates the memory cycle along both the forward and backward paths. On the forward path, a Meta‑Thinker produces structured guidance that steers a Memory Manager during construction and directs a Query Reasoner during iterative retrieval. On the backward path, MemMA introduces in‑situ self‑evolving memory construction, which synthesizes probe QA pairs, verifies the current memory, and converts failures into repair actions before the memory is finalized. Extensive experiments on LoCoMo show that MemMA consistently outperforms existing baselines across multiple LLM backbones and improves three different storage backends in a plug‑and‑play manner. Our code is publicly available at https://github.com/ventr1c/memma.

PaperID: 426, https://arxiv.org/pdf/2603.18043.pdf GitHub

Authors: Sunil Prakash

Title: The Provenance Paradox in Multi-Agent LLM Routing: Delegation Contracts and Attested Identity in LDP

Abstract:
Multi‑agent LLM systems delegate tasks across trust boundaries, but current protocols do not govern delegation under unverifiable quality claims. We show that when delegates can inflate self‑reported quality scores, quality‑based routing produces a provenance paradox: it systematically selects the worst delegates, performing worse than random. We extend the LLM Delegate Protocol (LDP) with delegation contracts that bound authority through explicit objectives, budgets, and failure policies; a claimed‑vs‑attested identity model that distinguishes self‑reported from verified quality; and typed failure semantics enabling automated recovery. In controlled experiments with 10 simulated delegates and validated with real Claude models, routing by self‑claimed quality scores performs worse than random selection (simulated: 0.55 vs. 0.68; real models: 8.90 vs. 9.30), while attested routing achieves near‑optimal performance (d = 9.51, p < 0.001). Sensitivity analysis across 36 configurations confirms the paradox emerges reliably when dishonest delegates are present. All extensions are backward‑compatible with sub‑microsecond validation overhead.

PaperID: 427, https://arxiv.org/pdf/2603.17787.pdf GitHub

Authors: Hamed Taheri

Title: Governed Memory: A Production Architecture for Multi-Agent Workflows

Abstract:
Enterprise AI deploys dozens of autonomous agent nodes across workflows, each acting on the same entities with no shared memory and no common governance. We identify five structural challenges arising from this memory governance gap: memory silos across agent workflows; governance fragmentation across teams and tools; unstructured memories unusable by downstream systems; redundant context delivery in autonomous multi‑step executions; and silent quality degradation without feedback loops. We present Governed Memory, a shared memory and governance layer addressing this gap through four mechanisms: a dual memory model combining open‑set atomic facts with schema‑enforced typed properties; tiered governance routing with progressive context delivery; reflection‑bounded retrieval with entity‑scoped isolation; and a closed‑loop schema lifecycle with AI‑assisted authoring and automated per‑property refinement. We validate each mechanism through controlled experiments (N=250, five content types): 99.6% fact recall with complementary dual‑modality coverage; 92% governance routing precision; 50% token reduction from progressive delivery; zero cross‑entity leakage across 500 adversarial queries; 100% adversarial governance compliance; and output quality saturation at approximately seven governed memories per entity. On the LoCoMo benchmark, the architecture achieves 74.8% overall accuracy, confirming that governance and schema enforcement impose no retrieval quality penalty. The system is in production at Personize.ai.

PaperID: 428, https://arxiv.org/pdf/2603.17670.pdf GitHub

Authors: Zihao Xin, Wentong Li, Yixuan Jiang, Ziyuan Huang, Bin Wang, Piji Li, Jianke Zhu, Jie Qin, Shengjun Huang

Title: AgentVLN: Towards Agentic Vision-and-Language Navigation

Abstract:
Vision‑and‑Language Navigation (VLN) requires an embodied agent to ground complex natural‑language instructions into long‑horizon navigation in unseen environments. While Vision‑Language Models (VLMs) offer strong 2D semantic understanding, current VLN systems remain constrained by limited spatial perception, 2D‑3D representation mismatch, and monocular scale ambiguity. In this paper, we propose AgentVLN, a novel and efficient embodied navigation framework that can be deployed on edge computing platforms. We formulate VLN as a Partially Observable Semi‑Markov Decision Process (POSMDP) and introduce a VLM‑as‑Brain paradigm that decouples high‑level semantic reasoning from perception and planning via a plug‑and‑play skill library. To resolve multi‑level representation inconsistency, we design a cross‑space representation mapping that projects perception‑layer 3D topological waypoints into the image plane, yielding pixel‑aligned visual prompts for the VLM. Building on this bridge, we integrate a context‑aware self‑correction and active exploration strategy to recover from occlusions and suppress error accumulation over long trajectories. To further address the spatial ambiguity of instructions in unstructured environments, we propose a Query‑Driven Perceptual Chain‑of‑Thought (QD‑PCoT) scheme, enabling the agent with the metacognitive ability to actively seek geometric depth information. Finally, we construct AgentVLN‑Instruct, a large‑scale instruction‑tuning dataset with dynamic stage routing conditioned on target visibility. Extensive experiments show that AgentVLN consistently outperforms prior state‑of‑the‑art methods (SOTA) on long‑horizon VLN benchmarks, offering a practical paradigm for lightweight deployment of next‑generation embodied navigation models. Code: https://github.com/Allenxinn/AgentVLN.

PaperID: 429, https://arxiv.org/pdf/2603.17307.pdf GitHub

Authors: Haiyang Yan, Hongyun Zhou, Peng Xu, Xiaoxue Feng, Mengyi Liu

Title: Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding

Abstract:
Despite rapid developments and widespread applications of MLLM agents, they still struggle with long‑form video understanding (LVU) tasks, which are characterized by high information density and extended temporal spans. Recent research on LVU agents demonstrates that simple task decomposition and collaboration mechanisms are insufficient for long‑chain reasoning tasks. Moreover, directly reducing the time context through embedding‑based retrieval may lose key information of complex problems. In this paper, we propose Symphony, a multi‑agent system, to alleviate these limitations. By emulating human cognition patterns, Symphony decomposes LVU into fine‑grained subtasks and incorporates a deep reasoning collaboration mechanism enhanced by reflection, effectively improving the reasoning capability. Additionally, Symphony provides a VLM‑based grounding approach to analyze LVU tasks and assess the relevance of video segments, which significantly enhances the ability to locate complex problems with implicit intentions and large temporal spans. Experimental results show that Symphony achieves state‑of‑the‑art performance on LVBench, LongVideoBench, VideoMME, and MLVU, with a 5.0% improvement over the prior state‑of‑the‑art method on LVBench. Code is available at https://github.com/Haiyang0226/Symphony.

PaperID: 430, https://arxiv.org/pdf/2603.17191.pdf GitHub

Authors: Sophie Kearney, Shu Yang, Zixuan Wen, Weimin Lyu, Bojian Hou, Duy Duong-Tran, Tianlong Chen, Jason H. Moore, Marylyn D. Ritchie, Chao Chen, Li Shen

Title: Tabular LLMs for Interpretable Few-Shot Alzheimer's Disease Prediction with Multimodal Biomedical Data

Abstract:
Accurate diagnosis of Alzheimer's disease (AD) requires handling tabular biomarker data, yet such data are often small and incomplete, where deep learning models frequently fail to outperform classical methods. Pretrained large language models (LLMs) offer few‑shot generalization, structured reasoning, and interpretable outputs, providing a powerful paradigm shift for clinical prediction. We propose TAP‑GPT Tabular Alzheimer's Prediction GPT, a domain‑adapted tabular LLM framework built on TableGPT2 and fine‑tuned for few‑shot AD classification using tabular prompts rather than plain texts. We evaluate TAP‑GPT across four ADNI‑derived datasets, including QT‑PAD biomarkers and region‑level structural MRI, amyloid PET, and tau PET for binary AD classification. Across multimodal and unimodal settings, TAP‑GPT improves upon its backbone models and outperforms traditional machine learning baselines in the few‑shot setting while remaining competitive with state‑of‑the‑art general‑purpose LLMs. We show that feature selection mitigates degradation in high‑dimensional inputs and that TAP‑GPT maintains stable performance under simulated and real‑world missingness without imputation. Additionally, TAP‑GPT produces structured, modality‑aware reasoning aligned with established AD biology and shows greater stability under self‑reflection, supporting its use in iterative multi‑agent systems. To our knowledge, this is the first systematic application of a tabular‑specialized LLM to multimodal biomarker‑based AD prediction, demonstrating that such pretrained models can effectively address structured clinical prediction tasks and laying the foundation for tabular LLM‑driven multi‑agent clinical decision‑support systems. The source code is publicly available on GitHub: https://github.com/sophie‑kearney/TAP‑GPT.

PaperID: 431, https://arxiv.org/pdf/2603.17187.pdf GitHub

Authors: Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, Zeyu Zheng, Cihang Xie, Huaxiu Yao

Title: MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild

Abstract:
Large language model (LLM) agents are increasingly used for complex tasks, yet deployed agents often remain static, failing to adapt as user needs evolve. This creates a tension between the need for continuous service and the necessity of updating capabilities to match shifting task distributions. On platforms like OpenClaw, which handle diverse workloads across 20+ channels, existing methods either store raw trajectories without distilling knowledge, maintain static skill libraries, or require disruptive downtime for retraining. We present MetaClaw, a continual meta‑learning framework that jointly evolves a base LLM policy and a library of reusable behavioral skills. MetaClaw employs two complementary mechanisms. Skill‑driven fast adaptation analyzes failure trajectories via an LLM evolver to synthesize new skills, enabling immediate improvement with zero downtime. Opportunistic policy optimization performs gradient‑based updates via cloud LoRA fine‑tuning and Reinforcement Learning with a Process Reward Model (RL‑PRM). This is triggered during user‑inactive windows by the Opportunistic Meta‑Learning Scheduler (OMLS), which monitors system inactivity and calendar data. These mechanisms are mutually reinforcing: a refined policy generates better trajectories for skill synthesis, while richer skills provide higher‑quality data for policy optimization. To prevent data contamination, a versioning mechanism separates support and query data. Built on a proxy‑based architecture, MetaClaw scales to production‑size LLMs without local GPUs. Experiments on MetaClaw‑Bench and AutoResearchClaw show that skill‑driven adaptation improves accuracy by up to 32% relative. The full pipeline advances Kimi‑K2.5 accuracy from 21.4% to 40.6% and increases composite robustness by 18.3%. Code is available at https://github.com/aiming‑lab/MetaClaw.

PaperID: 432, https://arxiv.org/pdf/2603.16916.pdf GitHub

Authors: Dylan Waldner, Vyacheslav Kungurtsev, Mitchelle Ashimosi

Title: Noncooperative Human-AI Agent Dynamics

Abstract:
This paper investigates the dynamics of noncooperative interactions between artificial intelligence agents and human decision‑makers in strategic environments. In particular, motivated by extensive literature in behavioral Economics, human agents are more faithfully modeled with respect to the state of the art using Prospect Theoretic preferences, while AI agents are modeled with standard expected utility maximization. Prospect Theory incorporates known cognitive heuristics employed by humans, including reference dependence and greater loss aversion relative to utility to relative gains. This paper runs different combinations of expected utility and prospect theoretic agents in a number of classic matrix games as well as examples specialized to tease out distinctions in strategic behavior with respect to preference functions, to explore the emergent behaviors from mixed population (human vs. AI) competition. Extensive numerical simulations are performed across AI, aware humans (those with full knowledge of the game structure and payoffs), and learning Prospect Agents (i.e., for AIs representing humans). A number of interesting observations and patterns show up, spanning barely distinguishable behavior, behavior corroborating Prospect preference anomalies in the theoretical literature, and unexpected surprises. Code can be found at https://github.com/dylanwaldner/noncooperative‑human‑AI.

PaperID: 433, https://arxiv.org/pdf/2603.16839.pdf GitHub

Authors: Karthik Ragunath Ananda Kumar, Subrahmanyam Arunachalam

Title: Learning to Present: Inverse Specification Rewards for Agentic Slide Generation

Abstract:
Automated presentation generation remains a challenging task requiring coherent content creation, visual design, and audience‑aware communication. This work proposes an OpenEnv‑compatible reinforcement learning environment where LLM agents learn to research topics, plan content, and generate professional HTML slide presentations through tool use. We introduce a multi‑component reward system combining structural validation, render quality assessment, LLM‑based aesthetic scoring, content quality metrics, and an inverse specification reward that measures how faithfully generated slides convey their intended purpose. The inverse specification reward, an "inverse task" where an LLM attempts to recover the original specification from generated slides, provides a holistic quality signal. Our approach fine‑tunes Qwen2.5‑Coder‑7B via GRPO, training only 0.5% of parameters on prompts derived from expert demonstrations collected using Claude Opus 4.6. Experiments on 48 diverse business briefs across six models demonstrate that our fine‑tuned 7B model achieves 91.2% of Claude Opus 4.6's quality while improving 33.1% over the base model. The six‑model comparison reveals that instruction adherence and tool‑use compliance, rather than raw parameter count, determine agentic task performance. We contribute SlideRL, an open‑source dataset of 288 multi‑turn rollout trajectories across all six models: https://huggingface.co/datasets/KarthikRagunathAnandaKumar/sliderl‑multi‑turn‑rollouts Code: https://github.com/pushing‑the‑frontier/slide‑forge‑llm

PaperID: 434, https://arxiv.org/pdf/2603.15968.pdf GitHub

Authors: Rushil Thareja, Gautam Gupta, Francesco Pinto, Nils Lukas

Title: MAC: Multi-Agent Constitution Learning

Abstract:
Constitutional AI is a method to oversee and control LLMs based on a set of rules written in natural language. These rules are typically written by human experts, but could in principle be learned automatically given sufficient training data for the desired behavior. Existing LLM‑based prompt optimizers attempt this but are ineffective at learning constitutions since (i) they require many labeled examples and (ii) lack structure in the optimized prompts, leading to diminishing improvements as prompt size grows. To address these limitations, we propose Multi‑Agent Constitutional Learning (MAC), which optimizes over structured prompts represented as sets of rules using a network of agents with specialized tasks to accept, edit, or reject rule updates. We also present MAC+, which improves performance by training agents on successful trajectories to reinforce updates leading to higher reward. We evaluate MAC on tagging Personally Identifiable Information (PII), a classification task with limited labels where interpretability is critical, and demonstrate that it generalizes to other agentic tasks such as tool calling. MAC outperforms recent prompt optimization methods by over 50%, produces human‑readable and auditable rule sets, and achieves performance comparable to supervised fine‑tuning and GRPO without requiring parameter updates.

PaperID: 435, https://arxiv.org/pdf/2603.15642.pdf GitHub

Authors: Pearl Mody, Mihir Panchal, Rishit Kar, Kiran Bhowmick, Ruhina Karani

Title: CraniMem: Cranial Inspired Gated and Bounded Memory for Agentic Systems

Abstract:
Large language model (LLM) agents are increasingly deployed in long running workflows, where they must preserve user and task state across many turns. Many existing agent memory systems behave like external databases with ad hoc read/write rules, which can yield unstable retention, limited consolidation, and vulnerability to distractor content. We present CraniMem, a neurocognitively motivated, gated and bounded multi‑stage memory design for agentic systems. CraniMem couples goal conditioned gating and utility tagging with a bounded episodic buffer for near term continuity and a structured long‑term knowledge graph for durable semantic recall. A scheduled consolidation loop replays high utility traces into the graph while pruning low utility items, keeping memory growth in check and reducing interference. On long horizon benchmarks evaluated under both clean inputs and injected noise, CraniMem is more robust than a Vanilla RAG and Mem0 baseline and exhibits smaller performance drops under distraction. Our code is available at https://github.com/PearlMody05/Cranimem and the accompanying PyPI package at https://pypi.org/project/cranimem.

PaperID: 436, https://arxiv.org/pdf/2603.15542.pdf GitHub

Authors: Shaojie Shi, Zhengyu Shi, Lingran Zheng, Xinyu Su, Anna Xie, Bohao Lv, Rui Xu, Zijian Chen, Zhichao Chen, Guolei Liu, Naifu Zhang, Mingjian Dong, Zhuo Quan, Bohao Chen, Teqi Hao, Yuan Qi, Yinghui Xu, Libo Wu

Title: InterveneBench: Benchmarking LLMs for Intervention Reasoning and Causal Study Design in Real Social Systems

Abstract:
Causal inference in social science relies on end‑to‑end, intervention‑centered research‑design reasoning grounded in real‑world policy interventions, but current benchmarks fail to evaluate this capability of large language models (LLMs). We present InterveneBench, a benchmark designed to assess such reasoning in realistic social settings. Each instance in InterveneBench is derived from an empirical social science study and requires models to reason about policy interventions and identification assumptions without access to predefined causal graphs or structural equations. InterveneBench comprises 744 peer‑reviewed studies across diverse policy domains. Experimental results show that state‑of‑the‑art LLMs struggle under this setting. To address this limitation, we further propose a multi‑agent framework, STRIDES. It achieves significant performance improvements over state‑of‑the‑art reasoning models. Our code and data are available at https://github.com/Sii‑yuning/STRIDES.

PaperID: 437, https://arxiv.org/pdf/2603.15483.pdf GitHub

Authors: Penny Chong, Harshavardhan Abichandani, Jiyuan Shen, Atin Ghosh, Min Pyae Moe, Yifan Mai, Daniel Dahlmeier

Title: Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis

Abstract:
Agent applications are increasingly adopted to automate workflows across diverse tasks. However, due to the heterogeneous domains they operate in, it is challenging to create a scalable evaluation framework. Prior works each employ their own methods to determine task success, such as database lookups, regex match, etc., adding complexity to the development of a unified agent evaluation approach. Moreover, they do not systematically account for the user's role nor expertise in the interaction, providing incomplete insights into the agent's performance. We argue that effective agent evaluation goes beyond correctness alone, incorporating conversation quality, efficiency and systematic diagnosis of agent errors. To address this, we introduce the TED framework (Talk, Evaluate, Diagnose). (1) Talk: We leverage reusable, generic expert and non‑expert user persona templates for user‑agent interaction. (2) Evaluate: We adapt existing datasets by representing subgoals‑such as tool signatures, and responses‑as natural language grading notes, evaluated automatically with LLM‑as‑a‑judge. We propose new metrics that capture both turn efficiency and intermediate progress of the agent complementing the user‑aware setup. (3) Diagnose: We introduce an automated error analysis tool that analyzes the inconsistencies of the judge and agents, uncovering common errors, and providing actionable feedback for agent improvement. We show that our TED framework reveals new insights regarding agent performance across models and user expertise levels. We also demonstrate potential gains in agent performance with peaks of 8‑10% on our proposed metrics after incorporating the identified error remedies into the agent's design.

PaperID: 438, https://arxiv.org/pdf/2603.15401.pdf GitHub

Authors: Tingxu Han, Yi Zhang, Wei Song, Chunrong Fang, Zhenyu Chen, Youcheng Sun, Lijie Hu

Title: SWE-Skills-Bench: Do Agent Skills Actually Help in Real-World Software Engineering?

Abstract:
Agent skills, structured procedural knowledge packages injected at inference time, are increasingly used to augment LLM agents on software engineering tasks. However, their real utility in end‑to‑end development settings remains unclear. We present SWE‑Skills‑Bench, the first requirement‑driven benchmark that isolates the marginal utility of agent skills in real‑world software engineering (SWE). It pairs 49 public SWE skills with authentic GitHub repositories pinned at fixed commits and requirement documents with explicit acceptance criteria, yielding approximately 565 task instances across six SWE subdomains. We introduce a deterministic verification framework that maps each task's acceptance criteria to execution‑based tests, enabling controlled paired evaluation with and without the skill. Our results show that skill injection benefits are far more limited than rapid adoption suggests: 39 of 49 skills yield zero pass‑rate improvement, and the average gain is only +1.2%. Token overhead varies from modest savings to a 451% increase while pass rates remain unchanged. Only seven specialized skills produce meaningful gains (up to +30%), while three degrade performance (up to ‑10%) due to version‑mismatched guidance conflicting with project context. These findings suggest that agent skills are a narrow intervention whose utility depends strongly on domain fit, abstraction level, and contextual compatibility. SWE‑Skills‑Bench provides a testbed for evaluating the design, selection, and deployment of skills in software engineering agents. SWE‑Skills‑Bench is available at https://github.com/GeniusHTX/SWE‑Skills‑Bench.

PaperID: 439, https://arxiv.org/pdf/2603.15341.pdf GitHub

Authors: Ren Jian Lim, Rushi Dai

Title: Intelligent Co-Design: An Interactive LLM Framework for Interior Spatial Design via Multi-Modal Agents

Abstract:
In architectural interior design, miscommunication frequently arises as clients lack design knowledge, while designers struggle to explain complex spatial relationships, leading to delayed timelines and financial losses. Recent advancements in generative layout tools narrow the gap by automating 3D visualizations. However, prevailing methodologies exhibit limitations: rule‑based systems implement hard‑coded spatial constraints that restrict participatory engagement, while data‑driven models rely on extensive training datasets. Recent large language models (LLMs) bridge this gap by enabling intuitive reasoning about spatial relationships through natural language. This research presents an LLM‑based, multimodal, multi‑agent framework that dynamically converts natural language descriptions and imagery into 3D designs. Specialized agents (Reference, Spatial, Interactive, Grader), operating via prompt guidelines, collaboratively address core challenges: the agent system enables real‑time user interaction for iterative spatial refinement, while Retrieval‑Augmented Generation (RAG) reduces data dependency without requiring task‑specific model training. This framework accurately interprets spatial intent and generates optimized 3D indoor design, improving productivity, and encouraging nondesigner participation. Evaluations across diverse floor plans and user questionnaires demonstrate effectiveness. An independent LLM evaluator consistently rated participatory layouts higher in user intent alignment, aesthetic coherence, functionality, and circulation. Questionnaire results indicated 77% satisfaction and a clear preference over traditional design software. These findings suggest the framework enhances user‑centric communication and fosters more inclusive, effective, and resilient design processes. Project page: https://rsigktyper.github.io/AICodesign/

PaperID: 440, https://arxiv.org/pdf/2603.15183.pdf GitHub

Authors: Vladyslav Parakhin

Title: Token Coherence: Adapting MESI Cache Protocols to Minimize Synchronization Overhead in Multi-Agent LLM Systems

Abstract:
Multi‑agent LLM orchestration incurs synchronization costs scaling as O(n x S x |D|) in agents, steps, and artifact size under naive broadcast ‑‑ a regime I term broadcast‑induced triply‑multiplicative overhead. I argue this pathology is a structural residue of full‑state rebroadcast, not an inherent property of multi‑agent coordination. The central claim: synchronization cost explosion in LLM multi‑agent systems maps with formal precision onto the cache coherence problem in shared‑memory multiprocessors, and MESI‑protocol invalidation transfers to artifact synchronization under minimal structural modification. I construct the Artifact Coherence System (ACS) and prove the Token Coherence Theorem: lazy invalidation attenuates cost by at least S/(n + W(d_i)) when S > n + W(d_i), converting O(n x S x |D|) to O((n + W) x |D|). A TLA+‑verified protocol enforces single‑writer safety, monotonic versioning, and bounded staleness across ~2,400 explored states. Simulation across four workload configurations yields token savings of 95.0% +/‑ 1.3% at V=0.05, 92.3% +/‑ 1.4% at V=0.10, 88.3% +/‑ 1.5% at V=0.25, and 84.2% +/‑ 1.3% at V=0.50 ‑‑ each exceeding the theorem's conservative lower bounds. Savings of ~81% persist at V=0.9, contrary to the predicted collapse threshold. Contributions: (1) formal MESI‑to‑artifact state mapping; (2) Token Coherence Theorem as savings lower bound; (3) TLA+‑verified protocol with three proven invariants; (4) characterization of conditional artifact access semantics resolving the always‑read objection; (5) reference Python implementation integrating with LangGraph, CrewAI, and AutoGen via thin adapter layers.

PaperID: 441, https://arxiv.org/pdf/2603.14588.pdf GitHub

Authors: Varun Pratap Bhardwaj

Title: SuperLocalMemory V3: Information-Geometric Foundations for Zero-LLM Enterprise Agent Memory

Abstract:
Persistent memory is a central capability for AI agents, yet the mathematical foundations of memory retrieval, lifecycle management, and consistency remain unexplored. Current systems employ cosine similarity for retrieval, heuristic decay for salience, and provide no formal contradiction detection. We establish information‑geometric foundations through three contributions. First, a retrieval metric derived from the Fisher information structure of diagonal Gaussian families, satisfying Riemannian metric axioms, invariant under sufficient statistics, and computable in O(d) time. Second, memory lifecycle formulated as Riemannian Langevin dynamics with proven existence and uniqueness of the stationary distribution via the Fokker‑Planck equation, replacing hand‑tuned decay with principled convergence guarantees. Third, a cellular sheaf model where non‑trivial first cohomology classes correspond precisely to irreconcilable contradictions across memory contexts. On the LoCoMo benchmark, the mathematical layers yield +12.7 percentage points over engineering baselines across six conversations, reaching +19.9 pp on the most challenging dialogues. A four‑channel retrieval architecture achieves 75% accuracy without cloud dependency. Cloud‑augmented results reach 87.7%. A zero‑LLM configuration satisfies EU AI Act data sovereignty requirements by architectural design. To our knowledge, this is the first work establishing information‑geometric, sheaf‑theoretic, and stochastic‑dynamical foundations for AI agent memory systems.

PaperID: 442, https://arxiv.org/pdf/2603.14553.pdf GitHub

Authors: Shi Qiu, Zeyu Cai, Jiashen Wei, Zeyu Li, Yixuan Yin, Qing-Hong Cao, Chang Liu, Ming-xing Luo, Xing-Bo Yuan, Hua Xing Zhu

Title: An End-to-end Architecture for Collider Physics and Beyond

Abstract:
We present, to our knowledge, the first language‑driven agent system capable of executing end‑to‑end collider phenomenology tasks, instantiated within a decoupled, domain‑agnostic architecture for autonomous High‑Energy Physics phenomenology. Guided only by natural‑language prompts supplemented with standard physics notation, ColliderAgent carries out workflows from a theoretical Lagrangian to final phenomenological outputs without relying on package‑specific code. In this framework, a hierarchical multi‑agent reasoning layer is coupled to Magnus, a unified execution backend for phenomenological calculations and simulation toolchains. We validate the system on representative literature reproductions spanning leptoquark and axion‑like‑particle scenarios, higher‑dimensional effective operators, parton‑level and detector‑level analyses, and large‑scale parameter scans leading to exclusion limits. These results point to a route toward more automated, scalable, and reproducible research in collider physics, cosmology, and physics more broadly.

PaperID: 443, https://arxiv.org/pdf/2603.14393.pdf GitHub

Authors: Yuan Bi, Yiping Zhou, Pei Liu, Feng Li, Zhongliang Jiang, Nassir Navab

Title: From Scanning Guidelines to Action: A Robotic Ultrasound Agent with LLM-Based Reasoning

Abstract:
Robotic ultrasound offers advantages over free‑hand scanning, including improved reproducibility and reduced operator dependency. In clinical practice, US acquisition relies heavily on the sonographer's experience and situational judgment. When transferring this process to robotic systems, such expertise is often encoded explicitly through fixed procedures and task‑specific models, yielding pipelines that can be difficult to adapt to new scanning tasks. In this work, we propose a unified framework for autonomous robotic US scanning that leverages a LLM‑based agent to interpret US scanning guidelines and execute scans by dynamically invoking a set of provided software tools. Instead of encoding fixed scanning procedures, the LLM agent retrieves and reasons over guideline steps from scanning handbooks and adapts its planning decisions based on observations and the current scanning state. This enables the system to handle variable and decision‑dependent workflows, such as adjusting scanning strategies, repeating steps, or selecting the appropriate next tool call in response to image quality or anatomical findings. Because the reasoning underlying tool selection is also critical for transparent and trustworthy planning, we further fine tune the LLM agent using a RL based strategy to improve both its reasoning quality and the correctness of tool selection and parameterization, while maintaining robust generalization to unseen guidelines and related tasks. We first validate the approach via verbal execution on 10 US scanning guidelines, assessing reasoning as well as tool selection and parameterization, and showing the benefit of RL fine tuning. We then demonstrate real world feasibility on robotic scanning of the gallbladder, spine, and kidney. Overall, the framework follows diverse guidelines and enables reliable autonomous scanning across multiple anatomical targets within a unified system.

PaperID: 444, https://arxiv.org/pdf/2603.14373.pdf GitHub

Authors: Wu Ji

Title: Trust Over Fear: How Motivation Framing in System Prompts Affects AI Agent Debugging Depth

Abstract:
System prompts for AI coding agents increasingly employ motivational framing ‑‑ from neutral task descriptions to fear‑driven threats ‑‑ yet no controlled study has examined whether such framing affects agent behavior. We present two studies investigating how trust‑based versus fear‑based motivation framing in system prompts influences AI agent debugging performance. In Study 1, we conducted a controlled manual experiment comparing a trust‑framed methodology (NoPUA) against an unframed baseline across 9 debugging scenarios using Claude Sonnet 4. Trust‑framed agents found 59% more hidden issues (p = 0.002, d = 2.28) while taking 83% more investigative steps, despite finding 15% fewer surface‑level issues ‑‑ revealing a depth‑over‑breadth tradeoff in investigation strategy. In Study 2, we replicated and extended these findings with 5 independent automated runs across 3 conditions (Baseline, NoPUA trust‑framed, PUA fear‑framed), yielding 135 scenario‑level data points. Trust‑framed agents again showed significant advantages: +74% investigative steps (p = 0.008) and +25% hidden issues found (p = 0.016). Crucially, fear‑framed (PUA) agents showed no significant improvement over baseline on any metric (all p > 0.3), demonstrating that fear‑based motivation is ineffective for AI agents. We ground these findings in Self‑Determination Theory, regulatory focus theory, and satisficing models, arguing that trust‑based framing induces exploration‑oriented, promotion‑focused behavior while fear‑based framing fails to shift agents from default satisficing strategies. Our results suggest that the motivational frame of system prompts ‑‑ not just their technical content ‑‑ causally influences AI agent investigation depth.

PaperID: 445, https://arxiv.org/pdf/2603.14052.pdf GitHub

Authors: Yichang Xu, Gaowen Liu, Ramana Rao Kompella, Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Ling Liu

Title: A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning

Abstract:
This paper presents a multi‑agent perception‑action exploration alliance, dubbed A4VL, for efficient long‑video reasoning. A4VL operates in a multi‑round perception‑action exploration loop with a selection of VLM agents. In each round, the team of agents performs video question‑answer (VideoQA) via perception exploration followed by action exploration. During perception exploration, each agent learns to extract query‑specific perception clue(s) from a few sampled frames and performs clue‑based alignment to find the video block(s) that are most relevant to the query‑specific event. During action exploration, A4VL performs video reasoning in three steps: (1) each agent produces its initial answer with rational, (2) all agents collaboratively scores one another through cross‑reviews and relevance ranking, and (3) based on whether a satisfactory consensus is reached, the decision is made either to start a new round of perception‑action deliberation by pruning (e.g., filtering out the lowest performing agent) and re‑staging (e.g., new‑clue and matching block based perception‑action exploration), or to conclude by producing its final answer. The integration of the multi‑agent alliance through multi‑round perception‑action exploration, coupled with event‑driven partitioning and cue‑guided block alignment, enables A4VL to effectively scale to real world long videos while preserving high quality video reasoning. Evaluation Results on five popular VideoQA benchmarks show that A4VL outperforms 18 existing representative VLMs and 11 recent methods optimized for long‑video reasoning, while achieving significantly lower inference latency. Our code is released at https://github.com/git‑disl/A4VL.

PaperID: 446, https://arxiv.org/pdf/2603.13950.pdf GitHub

Authors: Hussein Jawad, Nicolas J-B Brunel

Title: ToolFlood: Beyond Selection -- Hiding Valid Tools from LLM Agents via Semantic Covering

Abstract:
Large Language Model (LLM) agents increasingly use external tools for complex tasks and rely on embedding‑based retrieval to select a small top‑k subset for reasoning. As these systems scale, the robustness of this retrieval stage is underexplored, even though prior work has examined attacks on tool selection. This paper introduces ToolFlood, a retrieval‑layer attack on tool‑augmented LLM agents. Rather than altering which tool is chosen after retrieval, ToolFlood overwhelms retrieval itself by injecting a few attacker‑controlled tools whose metadata is carefully placed by exploiting the geometry of embedding space. These tools semantically span many user queries, dominate the top‑k results, and push all benign tools out of the agent's context. ToolFlood uses a two‑phase adversarial tool generation strategy. It first samples subsets of target queries and uses an LLM to iteratively generate diverse tool names and descriptions. It then runs an iterative greedy selection that chooses tools maximizing coverage of remaining queries in embedding space under a cosine‑distance threshold, stopping when all queries are covered or a budget is reached. We provide theoretical analysis of retrieval saturation and show on standard benchmarks that ToolFlood achieves up to a 95% attack success rate with a low injection rate (1% in ToolBench). The code will be made publicly available at the following link: https://github.com/as1‑prog/ToolFlood

PaperID: 447, https://arxiv.org/pdf/2603.13767.pdf GitHub

Authors: Yikang ouyang, Yang Luo, Dongsheng Zuo, Yuzhe Ma

Title: Retrieve, Schedule, Reflect: LLM Agents for Chip QoR Optimization

Abstract:
Modern chip design requires multi‑objective optimization of timing, power, and area under stringent time‑to‑market constraints. Although powerful optimization algorithms are integrated into EDA tools, achieving high QoR hinges on effective long‑horizon scheduling, which relies heavily on manual expert intervention. To address this issue and automate chip design, we propose an agentic LLM framework that schedules chip optimizations through direct interaction with EDA tools. The agent is grounded in natural language expertise expressed as a search tree through retrieval‑augmented generation (RAG). We further improve scheduling quality with Pareto‑driven QoR feedback through language reflection. Experimental results show that, compared with black‑box search methods such as reinforcement learning, our framework achieves 10% greater timing improvement while consuming less power and area, with more than 4x speedup. The post‑optimization QoR is also comparable to that achieved by human experts. Finally, the agent supports customized tasks expressed in natural language, enabling preferential QoR trade‑offs. The code and chip design data will be publicly available at https://github.com/YiKangOY/Open‑LLM‑ECO.

PaperID: 448, https://arxiv.org/pdf/2603.13707.pdf GitHub

Authors: Zhaoyuan Gu, Yipu Chen, Zimeng Chai, Alfred Cueva, Thong Nguyen, Yifan Wu, Huishu Xue, Minji Kim, Isaac Legene, Fukang Liu, Matthew Kim, Ayan Barula, Yongxin Chen, Ye Zhao

Title: REFINE-DP: Diffusion Policy Fine-tuning for Humanoid Loco-manipulation via Reinforcement Learning

Abstract:
Humanoid loco‑manipulation requires coordinated high‑level motion plans with stable, low‑level whole‑body execution under complex robot‑environment dynamics and long‑horizon tasks. While diffusion policies (DPs) show promise for learning from demonstrations, deploying them on humanoids poses critical challenges: the motion planner trained offline is decoupled from the low‑level controller, leading to poor command tracking, compounding distribution shift, and task failures. The common approach of scaling demonstration data is prohibitively expensive for high‑dimensional humanoid systems. To address this challenge, we present REFINE‑DP (REinforcement learning FINE‑tuning of Diffusion Policy), a hierarchical framework that jointly optimizes a DP high‑level planner and an RL‑based low‑level loco‑manipulation controller. The DP is fine‑tuned via a PPO‑based diffusion policy gradient to improve task success rate, while the controller is simultaneously updated to accurately track the planner's evolving command distribution, reducing the distributional mismatch that degrades motion quality. We validate REFINE‑DP on a humanoid robot performing loco‑manipulation tasks, including door traversal and long‑horizon object transport. REFINE‑DP achieves an over 90% success rate in simulation, even in out‑of‑distribution cases not seen in the pre‑trained data, and enables smooth autonomous task execution in real‑world dynamic environments. Our proposed method substantially outperforms pre‑trained DP baselines and demonstrates that RL fine‑tuning is key to reliable humanoid loco‑manipulation. https://refine‑dp.github.io/REFINE‑DP/

PaperID: 449, https://arxiv.org/pdf/2603.13017.pdf GitHub

Authors: Sydney Lewis

Title: Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation

Abstract:
Long conversations with an AI agent create a simple problem for one user: the history is useful, but carrying it verbatim is expensive. We study personalized agent memory: one user's conversation history with an agent, distilled into a compact retrieval layer for later search. Each exchange is compressed into a compound object with four fields (exchange_core, specific_context, thematic room_assignments, and regex‑extracted files_touched). The searchable distilled text averages 38 tokens per exchange. Applied to 4,182 conversations (14,340 exchanges) from 6 software engineering projects, the method reduces average exchange length from 371 to 38 tokens, yielding 11x compression. We evaluate whether personalized recall survives that compression using 201 recall‑oriented queries, 107 configurations spanning 5 pure and 5 cross‑layer search modes, and 5 LLM graders (214,519 consensus‑graded query‑result pairs). The best pure distilled configuration reaches 96% of the best verbatim MRR (0.717 vs 0.745). Results are mechanism‑dependent. All 20 vector search configurations remain non‑significant after Bonferroni correction, while all 20 BM25 configurations degrade significantly (effect sizes |d|=0.031‑0.756). The best cross‑layer setup slightly exceeds the best pure verbatim baseline (MRR 0.759). Structured distillation compresses single‑user agent memory without uniformly sacrificing retrieval quality. At 1/11 the context cost, thousands of exchanges fit within a single prompt while the verbatim source remains available for drill‑down. We release the implementation and analysis pipeline as open‑source software.

PaperID: 450, https://arxiv.org/pdf/2603.12788.pdf GitHub

Authors: Shuchang Lyu, Haiquan Wen, Guangliang Cheng, Meng Li, Zheng Zhou, You Zhou, Dingding Yao, Zhenwei Shi

Title: Think and Answer ME: Benchmarking and Exploring Multi-Entity Reasoning Grounding in Remote Sensing

Abstract:
Recent advances in reasoning language models and reinforcement learning with verifiable rewards have significantly enhanced multi‑step reasoning capabilities. This progress motivates the extension of reasoning paradigms to remote sensing visual grounding task. However, existing remote sensing grounding methods remain largely confined to perception‑level matching and single‑entity formulations, limiting the role of explicit reasoning and inter‑entity modeling. To address this challenge, we introduce a new benchmark dataset for Multi‑Entity Reasoning Grounding in Remote Sensing (ME‑RSRG). Based on ME‑RSRG, we reformulate remote sensing grounding as a multi‑entity reasoning task and propose an Entity‑Aware Reasoning (EAR) framework built upon visual‑linguistic foundation models. EAR generates structured reasoning traces and subject‑object grounding outputs. It adopts supervised fine‑tuning for cold‑start initialization and is further optimized via entity‑aware reward‑driven Group Relative Policy Optimization (GRPO). Extensive experiments on ME‑RSRG demonstrate the challenges of multi‑entity reasoning and verify the effectiveness of our proposed EAR framework. Our dataset, code, and models will be available at https://github.com/CV‑ShuchangLyu/ME‑RSRG.

PaperID: 451, https://arxiv.org/pdf/2603.12608.pdf GitHub

Authors: Bo Pan, Lunke Pan, Yitao Zhou, Qi Jiang, Zhen Wen, Minfeng Zhu, Wei Chen

Title: InterDeepResearch: Enabling Human-Agent Collaborative Information Seeking through Interactive Deep Research

Abstract:
Deep research systems powered by LLM agents have transformed complex information seeking by automating the iterative retrieval, filtering, and synthesis of insights from massive‑scale web sources. However, existing systems predominantly follow an autonomous "query‑to‑report" paradigm, limiting users to a passive role and failing to integrate their personal insights, contextual knowledge, and evolving research intents. This paper addresses the lack of human‑in‑the‑loop collaboration in the agentic research process. Through a formative study, we identify that current systems hinder effective human‑agent collaboration in terms of process observability, real‑time steerability, and context navigation efficiency. Informed by these findings, we propose InterDeepResearch, an interactive deep research system backed by a dedicated research context management framework. The framework organizes research context into a hierarchical architecture with three levels (information, actions, and sessions), enabling dynamic context reduction to prevent LLM context exhaustion and cross‑action backtracing for evidence provenance. Built upon this framework, the system interface integrates three coordinated views for visual sensemaking, and dedicated interaction mechanisms for interactive research context navigation. Evaluation on the Xbench‑DeepSearch‑v1 and Seal‑0 benchmarks shows that InterDeepResearch achieves competitive performance compared to state‑of‑the‑art deep research systems, while a formal user study demonstrates its effectiveness in supporting human‑agent collaborative information seeking. Project page with system demo: https://github.com/bopan3/InterDeepResearch.

PaperID: 452, https://arxiv.org/pdf/2603.11935.pdf GitHub

Authors: Xingze Zou, Jing Wang, Yuhua Zheng, Xueyi Chen, Haolei Bai, Lingcheng Kong, Syed A. R. Abu-Bakar, Zhaode Wang, Chengfei Lv, Haoji Hu, Huan Wang

Title: MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices?

Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in code generation, yet their potential for generating kernels specifically for mobile devices remains largely unexplored. In this work, we extend the scope of automated kernel generation to the mobile domain to investigate the central question: Can LLMs write efficient kernels for mobile devices? To enable systematic investigation, we introduce MobileKernelBench, a comprehensive evaluation framework comprising a benchmark prioritizing operator diversity and cross‑framework interoperability, coupled with an automated pipeline that bridges the host‑device gap for on‑device verification. Leveraging this framework, we conduct extensive evaluation on the CPU backend of Mobile Neural Network (MNN), revealing that current LLMs struggle with the engineering complexity and data scarcity inherent to mobile frameworks; standard models and even fine‑tuned variants exhibit high compilation failure rates (over 54%) and negligible performance gains due to hallucinations and a lack of domain‑specific grounding. To overcome these limitations, we propose the Mobile Kernel Agent (MoKA), a multi‑agent system equipped with repository‑aware reasoning and a plan‑and‑execute paradigm. Validated on MobileKernelBench, MoKA achieves state‑of‑the‑art performance, boosting compilation success to 93.7% and enabling 27.4% of generated kernels to deliver measurable speedups over native libraries.

PaperID: 453, https://arxiv.org/pdf/2603.11872.pdf GitHub

Authors: Omar Coser

Title: ELISA: An Interpretable Hybrid Generative AI Agent for Expression-Grounded Discovery in Single-Cell Genomics

Abstract:
Translating single‑cell RNA sequencing (scRNA‑seq) data into mechanistic biological hypotheses remains a critical bottleneck, as agentic AI systems lack direct access to transcriptomic representations while expression foundation models remain opaque to natural language. Here we introduce ELISA (Embedding‑Linked Interactive Single‑cell Agent), an interpretable framework that unifies scGPT expression embeddings with BioBERT‑based semantic retrieval and LLM‑mediated interpretation for interactive single‑cell discovery. An automatic query classifier routes inputs to gene marker scoring, semantic matching, or reciprocal rank fusion pipelines depending on whether the query is a gene signature, natural language concept, or mixture of both. Integrated analytical modules perform pathway activity scoringacross 60+ gene sets, ligand‑‑receptor interaction prediction using 280+ curated pairs, condition‑aware comparative analysis, and cell‑type proportion estimation all operating directly on embedded data without access to the original count matrix. Benchmarked across six diverse scRNA‑seq datasets spanning inflammatory lung disease, pediatric and adult cancers, organoid models, healthy tissue, and neurodevelopment, ELISA significantly outperforms CellWhisperer in cell type retrieval (combined permutation test, p < 0.001), with particularly large gains on gene‑signature queries (Cohen's d = 5.98 for MRR). ELISA replicates published biological findings (mean composite score 0.90) with near‑perfect pathway alignment and theme coverage (0.98 each), and generates candidate hypotheses through grounded LLM reasoning, bridging the gap between transcriptomic data exploration and biological discovery. Code available at: https://github.com/omaruno/ELISA‑An‑AI‑Agent‑for‑Expression‑Grounded‑Discovery‑in‑Single‑Cell‑Genomics.git (If you use ELISA in your research, please cite this work).

PaperID: 454, https://arxiv.org/pdf/2603.11558.pdf GitHub

Authors: Ruiying Li, Yunlang Zhou, YuYao Zhu, Kylin Chen, Jingyuan Wang, Sukai Wang, Kongtao Hu, Minhui Yu, Bowen Jiang, Zhan Su, Jiayao Ma, Xin He, Yongjian Shen, Yang Yang, Guanghui Ren, Maoqing Yao, Wenhao Wang, Yao Mu

Title: RoboClaw: An Agentic Framework for Scalable Long-Horizon Robotic Tasks

Abstract:
Vision‑Language‑Action (VLA) systems have shown strong potential for language‑driven robotic manipulation. However, scaling them to long‑horizon tasks remains challenging. Existing pipelines typically separate data collection, policy learning, and deployment, resulting in heavy reliance on manual environment resets and brittle multi‑policy execution. We present RoboClaw, an agentic robotics framework that unifies data collection, policy learning, and task execution under a single VLM‑driven controller. At the policy level, RoboClaw introduces Entangled Action Pairs (EAP), which couple forward manipulation behaviors with inverse recovery actions to form self‑resetting loops for autonomous data collection. This mechanism enables continuous on‑policy data acquisition and iterative policy refinement with minimal human intervention. During deployment, the same agent performs high‑level reasoning and dynamically orchestrates learned policy primitives to accomplish long‑horizon tasks. By maintaining consistent contextual semantics across collection and execution, RoboClaw reduces mismatch between the two phases and improves multi‑policy robustness. Experiments in real‑world manipulation tasks demonstrate improved stability and scalability compared to conventional open‑loop pipelines, while significantly reducing human effort throughout the robot lifecycle, achieving a 25% improvement in success rate over baseline methods on long‑horizon tasks and reducing human time investment by 53.7%.

PaperID: 455, https://arxiv.org/pdf/2603.11346.pdf GitHub

Authors: Yuto Shibata, Kashu Yamazaki, Lalit Jayanti, Yoshimitsu Aoki, Mariko Isogawa, Katerina Fragkiadaki

Title: Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning

Abstract:
Humanoid robotics has strong potential to transform daily service and caregiving applications. Although recent advances in general motion tracking within physics engines (GMT) have enabled virtual characters and humanoid robots to reproduce a broad range of human motions, these behaviors are primarily limited to contact‑less social interactions or isolated movements. Assistive scenarios, by contrast, require continuous awareness of a human partner and rapid adaptation to their evolving posture and dynamics. In this paper, we formulate the imitation of closely interacting, force‑exchanging human‑human motion sequences as a multi‑agent reinforcement learning problem. We jointly train partner‑aware policies for both the supporter (assistant) agent and the recipient agent in a physics simulator to track assistive motion references. To make this problem tractable, we introduce a partner policies initialization scheme that transfers priors from single‑human motion‑tracking controllers, greatly improving exploration. We further propose dynamic reference retargeting and contact‑promoting reward, which adapt the assistant's reference motion to the recipient's real‑time pose and encourage physically meaningful support. We show that AssistMimic is the first method capable of successfully tracking assistive interaction motions on established benchmarks, demonstrating the benefits of a multi‑agent RL formulation for physically grounded and socially aware humanoid control.

PaperID: 456, https://arxiv.org/pdf/2603.10673.pdf GitHub

Authors: Yaxin Gong, Chongming Gao, Chenxiao Fan, Haoyan Liu, Wenjie Wang, Jianshan Sun, Yangyang Li, Fuli Feng, Xiangnan He

Title: Breaking User-Centric Agency: A Tri-Party Framework for Agent-Based Recommendation

Abstract:
Recent advances in large language models (LLMs) have stimulated growing interest in agent‑based recommender systems, enabling language‑driven interaction and reasoning for more expressive preference modeling. However, most existing agentic approaches remain predominantly user‑centric, treating items as passive entities and neglecting the interests of other critical stakeholders. This limitation exacerbates exposure concentration and long‑tail under‑representation, threatening long‑term system sustainability. In this work, we identify this fundamental limitation and propose the first Tri‑party LLM‑agent Recommendation framework (TriRec) that explicitly coordinates user utility, item exposure, and platform‑level fairness. The framework employs a two‑stage architecture: Stage 1 empowers item agents with personalized self‑promotion to improve matching quality and alleviate cold‑start barriers, while Stage 2 performs platform‑level sequential multi‑objective re‑ranking, balancing user relevance, item utility, and exposure fairness. Experiments show consistent gains in accuracy, fairness, and item‑level utility. Moreover, we find that item self‑promotion can simultaneously enhance fairness and effectiveness, challenging the conventional trade‑off assumption between relevance and fairness. Our code is available at https://github.com/Marfekey/TriRec.

PaperID: 457, https://arxiv.org/pdf/2603.10085.pdf GitHub

Authors: Qitong Sun, Jun Han, Tianlin Li, Zhe Tang, Sheng Chen, Fei Yang, Aishan Liu, Xianglong Liu, Yang Liu

Title: KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Abstract:
Improving GPU kernel efficiency is crucial for advancing AI systems. Recent work has explored leveraging large language models (LLMs) for GPU kernel generation and optimization. However, existing LLM‑based kernel optimization pipelines typically rely on opaque, implicitly learned heuristics within the LLMs to determine optimization strategies. This leads to inefficient trial‑and‑error and weakly interpretable optimizations. Our key insight is to replace implicit heuristics with expert optimization skills that are knowledge‑driven and aware of task trajectories. Specifically, we present KernelSkill, a multi‑agent framework with a dual‑level memory architecture. KernelSkill operates by coordinating agents with long‑term memory of reusable expert skills and short‑term memory to prevent repetitive backtracking. On KernelBench Levels 1‑3, KernelSkill achieves a 100% success rate and average speedups of 5.44x, 2.82x, and 1.92x over Torch Eager on Levels 1, 2, and 3, respectively, outperforming prior baselines. Code is available at https://github.com/0satan0/KernelMem/.

PaperID: 458, https://arxiv.org/pdf/2603.09909.pdf GitHub

Authors: Yunhang Qian, Xiaobin Hu, Jiaquan Yu, Siyang Xin, Xiaokun Chen, Jiangning Zhang, Peng-Tao Jiang, Jiawei Liu, Hongwei Bran Li

Title: MedMASLab: A Unified Orchestration Framework for Benchmarking Multimodal Medical Multi-Agent Systems

Abstract:
While Multi‑Agent Systems (MAS) show potential for complex clinical decision support, the field remains hindered by architectural fragmentation and the lack of standardized multimodal integration. Current medical MAS research suffers from non‑uniform data ingestion pipelines, inconsistent visual‑reasoning evaluation, and a lack of cross‑specialty benchmarking. To address these challenges, we present MedMASLab, a unified framework and benchmarking platform for multimodal medical multi‑agent systems. MedMASLab introduces: (1) A standardized multimodal agent communication protocol that enables seamless integration of 11 heterogeneous MAS architectures across 24 medical modalities. (2) An automated clinical reasoning evaluator, a zero‑shot semantic evaluation paradigm that overcomes the limitations of lexical string‑matching by leveraging large vision‑language models to verify diagnostic logic and visual grounding. (3) The most extensive benchmark to date, spanning 11 organ systems and 473 diseases, standardizing data from 11 clinical benchmarks. Our systematic evaluation reveals a critical domain‑specific performance gap: while MAS improves reasoning depth, current architectures exhibit significant fragility when transitioning between specialized medical sub‑domains. We provide a rigorous ablation of interaction mechanisms and cost‑performance trade‑offs, establishing a new technical baseline for future autonomous clinical systems. The source code and data is publicly available at: https://github.com/NUS‑Project/MedMASLab/

PaperID: 459, https://arxiv.org/pdf/2603.09875.pdf GitHub

Authors: Vladyslav Parakhin

Title: The Bureaucracy of Speed: Structural Equivalence Between Memory Consistency Models and Multi-Agent Authorization Revocation

Abstract:
The temporal assumptions underpinning conventional Identity and Access Management collapse under agentic execution regimes. A sixty‑second revocation window permits on the order of 6 × 10^3 unauthorized API calls at 100 ops/tick; at AWS Lambda scale, the figure approaches 6 × 10^5. This is a coherence problem, not merely a latency problem. We define a Capability Coherence System (CCS) and construct a state‑mapping φ: Σ_\rm MESI \to Σ_\rm auth preserving transition structure under bounded‑staleness semantics. A safety theorem bounds unauthorized operations for the execution‑count Release Consistency‑directed Coherence (RCC) strategy at D_\rm rcc \leq n, independent of agent velocity v ‑‑ a qualitative departure from the O(v \cdot \mathrmTTL) scaling of time‑bounded strategies. Tick‑based discrete event simulation across three business‑contextualised scenarios (four strategies, ten deterministic seeds each) confirms: RCC achieves a 120× reduction versus TTL‑based lease in the high‑velocity scenario (50 vs. 6,000 unauthorized operations), and 184× under anomaly‑triggered revocation. Zero bound violations across all 120 runs confirm the per‑capability safety guarantee. Simulation code: https://github.com/hipvlady/prizm

PaperID: 460, https://arxiv.org/pdf/2603.09641.pdf GitHub

Authors: Arash Shahmansoori

Title: PRECEPT: Planning Resilience via Experience, Context Engineering & Probing Trajectories A Unified Framework for Test-Time Adaptation with Compositional Rule Learning and Pareto-Guided Prompt Evolution

Abstract:
LLM agents that store knowledge as natural language suffer steep retrieval degradation as condition count grows, often struggle to compose learned rules reliably, and typically lack explicit mechanisms to detect stale or adversarial knowledge. We introduce PRECEPT, a unified framework for test‑time adaptation with three tightly coupled components: (1) deterministic exact‑match rule retrieval over structured condition keys, (2) conflict‑aware memory with Bayesian source reliability and threshold‑based rule invalidation, and (3) COMPASS, a Pareto‑guided prompt‑evolution outer loop. Exact retrieval eliminates partial‑match interpretation errors on the deterministic path (0% by construction, vs 94.4% under Theorem~B.6's independence model at N=10) and supports compositional stacking through a semantic tier hierarchy; conflict‑aware memory resolves static‑‑dynamic disagreements and supports drift adaptation; COMPASS evaluates prompts through the same end‑to‑end execution pipeline. Results (9‑‑10 seeds): PRECEPT achieves a +41.1pp first‑try advantage over Full Reflexion (d>1.9), +33.3pp compositional generalization (d=1.55), 100% P_1 on 2‑way logistics compositions (d=2.64), +40‑‑55pp continuous learning gains, strong eventual robustness under adversarial static knowledge (100% logistics with adversarial SK active; partial recovery on integration), +55.0pp drift recovery (d=0.95, p=0.031), and 61% fewer steps. Core comparisons are statistically significant, often at p<0.001.

PaperID: 461, https://arxiv.org/pdf/2603.09358.pdf GitHub

Authors: Wenhao Yan, Ning An, Linxu Li, Bingsheng Bi, Bo Jiang, Zhigang Lu, Baoxu Liu, Junrong Liu, Cong Dong

Title: ProvAgent: Threat Detection Based on Identity-Behavior Binding and Multi-Agent Collaborative Attack Investigation

Abstract:
Advanced Persistent Threats (APTs) pose critical challenges to modern cybersecurity due to their multi‑stage and stealthy nature. While provenance‑based detection approaches show promise in capturing causal attack semantics, current threat provenance practices face two paradoxical issues: (1) expert skepticism, where human analysts doubt the capability of traditional detection models to identify complex attacks; and (2) expert dependence, as analysts cannot manually process large‑scale raw logs to detect threats without these models. Consequently, collaboration between humans and traditional models remains the prevailing paradigm. However, this renders investigation quality contingent upon human expertise and frequently results in alert fatigue. To address these challenges, we present ProvAgent, a framework that evolves the threat provenance paradigm from human‑model collaboration to a novel collaboration between multi‑agent systems and traditional models. ProvAgent leverages the speed and cost‑efficiency of traditional models for initial anomaly screening over large‑scale logs. By enforcing fine‑grained identity‑behavior consistency via graph contrastive learning, it profiles entities based on specific attributes to generate high‑fidelity alerts. With these alerts serving as investigation entry points, ProvAgent achieves in‑depth autonomous investigation through a hypothesis‑verification multi‑agent framework. Evaluations with real‑world datasets demonstrate that ProvAgent outperforms six state‑of‑the‑art (SOTA) baselines in anomaly detection. Through automated investigation, ProvAgent reconstructs near‑complete attack processes at a minimum cost of \0.06 per day.

PaperID: 462, https://arxiv.org/pdf/2603.09022.pdf GitHub

Authors: Yunfei Xie, Kevin Wang, Bobby Cheng, Jianzhu Yao, Zhizhou Sha, Alexander Duffy, Yihan Xi, Hongyuan Mei, Cheston Tan, Chen Wei, Pramod Viswanath, Zhangyang Wang

Title: MEMO: Memory-Augmented Model Context Optimization for Robust Multi-Turn Multi-Agent LLM Games

Abstract:
Multi‑turn, multi‑agent LLM game evaluations often exhibit substantial run‑to‑run variance. In long‑horizon interactions, small early deviations compound across turns and are amplified by multi‑agent coupling. This biases win rate estimates and makes rankings unreliable across repeated tournaments. Prompt choice worsens this further by producing different effective policies. We address both instability and underperformance with MEMO (Memory‑augmented MOdel context optimization), a self‑play framework that optimizes inference‑time context by coupling retention and exploration. Retention maintains a persistent memory bank that stores structured insights from self‑play trajectories and injects them as priors during later play. Exploration runs tournament‑style prompt evolution with uncertainty‑aware selection via TrueSkill, and uses prioritized replay to revisit rare and decisive states. Across five text‑based games, MEMO raises mean win rate from 25.1% to 49.5% for GPT‑4o‑mini and from 20.9% to 44.3% for Qwen‑2.5‑7B‑Instruct, using 2,000 self‑play games per task. Run‑to‑run variance also drops, giving more stable rankings across prompt variations. These results suggest that multi‑agent LLM game performance and robustness have substantial room for improvement through context optimization. MEMO achieves the largest gains in negotiation and imperfect‑information games, while RL remains more effective in perfect‑information settings. All code is open‑source and available here: https://github.com/openverse‑ai/MEMO

PaperID: 463, https://arxiv.org/pdf/2603.09018.pdf GitHub

Authors: Yixiong Chen, Xinyi Bai, Yue Pan, Zongwei Zhou, Alan Yuille

Title: Meissa: Multi-modal Medical Agentic Intelligence

Abstract:
Multi‑modal large language models (MM‑LLMs) have shown strong performance in medical image understanding and clinical reasoning. Recent medical agent systems extend them with tool use and multi‑agent collaboration, enabling complex decision‑making. However, these systems rely almost entirely on frontier models (e.g., GPT), whose API‑based deployment incurs high cost, high latency, and privacy risks that conflict with on‑premise clinical requirements. We present Meissa, a lightweight 4B‑parameter medical MM‑LLM that brings agentic capability offline. Instead of imitating static answers, Meissa learns both when to engage external interaction (strategy selection) and how to execute multi‑step interaction (strategy execution) by distilling structured trajectories from frontier models. Specifically, we propose: (1) Unified trajectory modeling: trajectories (reasoning and action traces) are represented within a single state‑action‑observation formalism, allowing one model to generalize across heterogeneous medical environments. (2) Three‑tier stratified supervision: the model's own errors trigger progressive escalation from direct reasoning to tool‑augmented and multi‑agent interaction, explicitly learning difficulty‑aware strategy selection. (3) Prospective‑retrospective supervision: pairing exploratory forward traces with hindsight‑rationalized execution traces enables stable learning of effective interaction policies. Trained on 40K curated trajectories, Meissa matches or exceeds proprietary frontier agents in 10 of 16 evaluation settings across 13 medical benchmarks spanning radiology, pathology, and clinical reasoning. Using over 25x fewer parameters than typical frontier models like Gemini‑3, Meissa operates fully offline with 22x lower end‑to‑end latency compared to API‑based deployment. Data, models, and environments are released at https://github.com/Schuture/Meissa.

PaperID: 464, https://arxiv.org/pdf/2603.08835.pdf GitHub

Authors: Cornelius Emde, Alexander Rubinstein, Anmol Goel, Ahmed Heakl, Sangdoo Yun, Seong Joon Oh, Martin Gubri

Title: MASEval: Extending Multi-Agent Evaluation from Models to Systems

Abstract:
The rapid adoption of LLM‑based agentic systems has produced a rich ecosystem of frameworks (smolagents, LangGraph, AutoGen, CAMEL, LlamaIndex, i.a.). Yet existing benchmarks are model‑centric: they fix the agentic setup and do not compare other system components. We argue that implementation decisions substantially impact performance, including choices such as topology, orchestration logic, and error handling. MASEval addresses this evaluation gap with a framework‑agnostic library that treats the entire system as the unit of analysis. Through a systematic system‑level comparison across 3 benchmarks, 3 models, and 3 frameworks, we find that framework choice matters as much as model choice. MASEval allows researchers to explore all components of agentic systems, opening new avenues for principled system design, and practitioners to identify the best implementation for their use case. MASEval is available under the MIT licence https://github.com/parameterlab/MASEval.

PaperID: 465, https://arxiv.org/pdf/2603.08806.pdf GitHub GitHub

Authors: Tzafrir Rehan

Title: Test-Driven AI Agent Definition (TDAD): Compiling Tool-Using Agents from Behavioral Specifications

Abstract:
We present Test‑Driven AI Agent Definition (TDAD), a methodology that treats agent prompts as compiled artifacts: engineers provide behavioral specifications, a coding agent converts them into executable tests, and a second coding agent iteratively refines the prompt until tests pass. Deploying tool‑using LLM agents in production requires measurable behavioral compliance that current development practices cannot provide. Small prompt changes cause silent regressions, tool misuse goes undetected, and policy violations emerge only after deployment. To mitigate specification gaming, TDAD introduces three mechanisms: (1) visible/hidden test splits that withhold evaluation tests during compilation, (2) semantic mutation testing via a post‑compilation agent that generates plausible faulty prompt variants, with the harness measuring whether the test suite detects them, and (3) spec evolution scenarios that quantify regression safety when requirements change. We evaluate TDAD on SpecSuite‑Core, a benchmark of four deeply‑specified agents spanning policy compliance, grounded analytics, runbook adherence, and deterministic enforcement. Across 24 independent trials, TDAD achieves 92% v1 compilation success with 97% mean hidden pass rate; evolved specifications compile at 58%, with most failed runs passing all visible tests except 1‑2, and show 86‑100% mutation scores, 78% v2 hidden pass rate, and 97% regression safety scores. The implementation is available as an open benchmark at https://github.com/f‑labs‑io/tdad‑paper‑code.

PaperID: 466, https://arxiv.org/pdf/2603.08706.pdf GitHub

Authors: Weize Liu, Minghui Liu, Sy-Tuyen Ho, Souradip Chakraborty, Xiyao Wang, Furong Huang

Title: Agentic Critical Training

Abstract:
Training large language models (LLMs) as autonomous agents often begins with imitation learning, but it only teaches agents what to do without understanding why: agents never contrast successful actions against suboptimal alternatives and thus lack awareness of action quality. Recent approaches attempt to address this by introducing self‑reflection supervision derived from contrasts between expert and alternative actions. However, the training paradigm fundamentally remains imitation learning: the model imitates pre‑constructed reflection text rather than learning to reason autonomously. We propose Agentic Critical Training (ACT), a reinforcement learning paradigm that trains agents to identify the better action among alternatives. By rewarding whether the model's judgment is correct, ACT drives the model to autonomously develop reasoning about action quality, producing genuine self‑reflection rather than imitating it. Across three challenging agent benchmarks, ACT consistently improves agent performance when combined with different post‑training methods. It achieves an average improvement of 5.07 points over imitation learning and 4.62 points over reinforcement learning. Compared to approaches that inject reflection capability through knowledge distillation, ACT also demonstrates clear advantages, yielding an average improvement of 2.42 points. Moreover, ACT enables strong out‑of‑distribution generalization on agentic benchmarks and improves performance on general reasoning benchmarks without any reasoning‑specific training data, highlighting the value of our method. These results suggest that ACT is a promising path toward developing more reflective and capable LLM agents.

PaperID: 467, https://arxiv.org/pdf/2603.08240.pdf GitHub

Authors: Jiageng Wen, Shengjie Zhao, Bing Li, Jiafeng Huang, Kenan Ye, Hao Deng

Title: SiMO: Single-Modality-Operable Multimodal Collaborative Perception

Abstract:
Collaborative perception integrates multi‑agent perspectives to enhance the sensing range and overcome occlusion issues. While existing multimodal approaches leverage complementary sensors to improve performance, they are highly prone to failure‑‑especially when a key sensor like LiDAR is unavailable. The root cause is that feature fusion leads to semantic mismatches between single‑modality features and the downstream modules. This paper addresses this challenge for the first time in the field of collaborative perception, introducing Single‑Modality‑Operable Multimodal Collaborative Perception (SiMO). By adopting the proposed Length‑Adaptive Multi‑Modal Fusion (LAMMA), SiMO can adaptively handle remaining modal features during modal failures while maintaining consistency of the semantic space. Additionally, leveraging the innovative "Pretrain‑Align‑Fuse‑RD" training strategy, SiMO addresses the issue of modality competition‑‑generally overlooked by existing methods‑‑ensuring the independence of each individual modality branch. Experiments demonstrate that SiMO effectively aligns multimodal features while simultaneously preserving modality‑specific features, enabling it to maintain optimal performance across all individual modalities. The implementation details can be found in https://github.com/dempsey‑wen/SiMO.

PaperID: 468, https://arxiv.org/pdf/2603.08166.pdf GitHub

Authors: Zhijun Wang, Ling Luo, Dinghao Pan, Huan Zhuang, Lejing Yu, Yuanyuan Sun, Hongfei Lin

Title: RexDrug: Reliable Multi-Drug Combination Extraction through Reasoning-Enhanced LLMs

Abstract:
Automated Drug Combination Extraction (DCE) from large‑scale biomedical literature is crucial for advancing precision medicine and pharmacological research. However, existing relation extraction methods primarily focus on binary interactions and struggle to model variable‑length n‑ary drug combinations, where complex compatibility logic and distributed evidence need to be considered. To address these limitations, we propose RexDrug, an end‑to‑end reasoning‑enhanced relation extraction framework for n‑ary drug combination extraction based on large language models. RexDrug adopts a two‑stage training strategy. First, a multi‑agent collaborative mechanism is utilized to automatically generate high‑quality expert‑like reasoning traces for supervised fine‑tuning. Second, reinforcement learning with a multi‑dimensional reward function specifically tailored for DCE is applied to further refine reasoning quality and extraction accuracy. Extensive experiments on the DrugComb dataset show that RexDrug consistently outperforms state‑of‑the‑art baselines for n‑ary extraction. Additional evaluation on the DDI13 corpus confirms its generalizability to binary drugdrug interaction tasks. Human expert assessment and automatic reasoning metrics further indicates that RexDrug produces coherent medical reasoning while accurately identifying complex therapeutic regimens. These results establish RexDrug as a scalable and reliable solution for complex biomedical relation extraction from unstructured text. The source code and data are available at https://github.com/DUTIR‑BioNLP/RexDrug

PaperID: 469, https://arxiv.org/pdf/2603.07169.pdf GitHub

Authors: Yuxuan Han, Meng-Hao Guo, Zhengning Liu, Wenguang Chen, Shi-Min Hu

Title: Making LLMs Optimize Multi-Scenario CUDA Kernels Like Experts

Abstract:
Optimizing GPU kernels manually is a challenging and time‑consuming task. With the rapid development of LLMs, automated GPU kernel optimization is gradually becoming a tangible reality. However, current LLM‑driven automated optimization methods narrowly focus on machine learning applications, such as PyTorch operator optimization, while overlooking broader domains like sparse matrix operations in scientific computing. Extending to these broader applications brings new challenges for the benchmark and algorithm. Therefore, developing a general‑purpose automated kernel optimization method becomes our primary focus. In this paper, we address the absence of systematic evaluation for multi‑scenario settings by introducing MSKernelBench, which spans multiple scenarios, including fundamental algebraic operations, common LLM kernels, sparse matrix operators, and scientific computing routines, each supporting both FP32 and BF16 precision. Building on this benchmark, we introduce CUDAMaster, a multi‑agent, hardware‑aware system for kernel optimization that leverages profiling information and automatically constructs the full compilation and execution toolchain. Experimental results demonstrate that CUDAMaster achieves significant speedups across most operators, outperforming Astra by about 35%. In several cases, its performance matches or surpasses that of highly optimized, closed‑source libraries such as cuBLAS. A demo showcasing the original and optimized code for each operator is available at https://hanyx2021.github.io/MSKernelBenchDemo/.

PaperID: 470, https://arxiv.org/pdf/2603.06859.pdf GitHub

Authors: Yanjun Chen, Yirong Sun, Hanlin Wang, Jinghan Wang, Xinming Zhang, Xiaoyu Shen, Wenjie Li, Wei Zhang

Title: Exact Is Easier: Credit Assignment for Cooperative LLM Agents

Abstract:
Removing an agent from a cooperative team to measure its contribution seems natural, yet in multi‑agent LLM systems this evaluation distorts the result it claims to measure. This failure is not isolated: learned critics, trajectory‑level baselines, and agent‑removal counterfactuals all inherit from standard multi‑agent reinforcement learning a premise that exact counterfactual evaluation requires privileged environment access, and therefore approximate. In cooperative LLM systems, this premise is false. Interaction histories are deterministic functions of observable text with no hidden state, so any decision point can be restored exactly, making direct causal measurement possible without parametric approximation. C3 exploits this property by fixing the complete history at each decision point, sampling alternative actions under a frozen behavior policy, and computing unbiased per‑decision advantages through a parameter‑free leave‑one‑out baseline. Across six benchmarks spanning math reasoning and code generation, two model families, and two multi‑agent topologies, C3 consistently outperforms all baselines; a controlled decomposition confirms gains originate from credit quality, not architecture, while checkpoint restoration reduces training token consumption. The exact solution proves simpler, cheaper, and more effective than all approximate alternatives. The same structural property that enables exact credit also enables exact verification: three independently computable diagnostics, credit fidelity, within‑group variance, and inter‑agent influence, constitute the first method‑agnostic auditing tool for multi‑agent LLM credit assignment. Our code is available at https://github.com/EIT‑EAST‑Lab/C3

PaperID: 471, https://arxiv.org/pdf/2603.06217.pdf GitHub

Authors: Reda El Makroum, Sebastian Zwickl-Bernhard, Lukas Kranzl, Hans Auer

Title: Conversational Demand Response: Bidirectional Aggregator-Prosumer Coordination through Agentic AI

Abstract:
Residential demand response depends on sustained prosumer participation, yet existing coordination is either fully automated, or limited to one‑way dispatch signals and price alerts that offer little possibility for informed decision‑making. This paper introduces Conversational Demand Response (CDR), a coordination mechanism where aggregators and prosumers interact through bidirectional natural language, enabled through agentic AI. A two‑tier multi‑agent architecture is developed in which an aggregator agent dispatches flexibility requests and a prosumer Home Energy Management System (HEMS) assesses deliverability and cost‑benefit by calling an optimization‑based tool. CDR also enables prosumer‑initiated upstream communication, where changes in preferences can reach the aggregator directly. Proof‑of‑concept evaluation shows that interactions complete in under 12 seconds. The architecture illustrates how agentic AI can bridge the aggregator‑prosumer coordination gap, providing the scalability of automated DR while preserving the transparency, explainability, and user agency necessary for sustained prosumer participation. All system components, including agent prompts, orchestration logic, and simulation interfaces, are released as open source to enable reproducibility and further development.

PaperID: 472, https://arxiv.org/pdf/2603.06198.pdf GitHub

Authors: Koki Itai, Shunichi Hasegawa, Yuta Yamamoto, Gouki Minegishi, Masaki Otsuki

Title: LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation

Abstract:
Retrieval‑Augmented Generation (RAG) is a framework in which a Generator, such as a Large Language Model (LLM), produces answers by retrieving documents from an external collection using a Retriever. In practice, Generators must integrate evidence from long contexts, perform multi‑step reasoning, interpret tables, and abstain when evidence is missing. However, existing benchmarks for Generators provide limited coverage, with none enabling simultaneous evaluation of multiple capabilities under unified conditions. To bridge the gap between existing evaluations and practical use, we introduce LIT‑RAGBench (the Logic, Integration, Table, Reasoning, and Abstention RAG Generator Benchmark), which defines five categories: Integration, Reasoning, Logic, Table, and Abstention, each further divided into practical evaluation aspects. LIT‑RAGBench systematically covers patterns combining multiple aspects across categories. By using fictional entities and scenarios, LIT‑RAGBench evaluates answers grounded in the provided external documents. The dataset consists of 114 human‑constructed Japanese questions and an English version generated by machine translation with human curation. We use LLM‑as‑a‑Judge for scoring and report category‑wise and overall accuracy. Across API‑based and open‑weight models, no model exceeds 90% overall accuracy. By making strengths and weaknesses measurable within each category, LIT‑RAGBench serves as a valuable metric for model selection in practical RAG deployments and for building RAG‑specialized models. We release LIT‑RAGBench, including the dataset and evaluation code, at https://github.com/Koki‑Itai/LIT‑RAGBench.

PaperID: 473, https://arxiv.org/pdf/2603.06007.pdf GitHub GitHub

Authors: Yang Liu, Jinxuan Cai, Yishen Li, Qi Meng, Zedi Liu, Xin Li, Chen Qian, Chuan Shi, Cheng Yang

Title: MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing

Abstract:
Large language model‑based (LLM‑based) multi‑agent systems (MAS) are increasingly used to extend agentic problem solving via role specialization and collaboration. MAS workflows can be naturally modeled as directed computation graphs, where nodes execute agents or sub‑workflows and edges encode dependencies and message passing. However, implementing complex graph workflows in current frameworks still requires substantial manual effort, offers limited reuse, and makes it difficult to integrate heterogeneous external context sources. To overcome these limitations, we present MASFactory, a graph‑centric framework for orchestrating LLM‑based MAS. It introduces Vibe Graphing, a human‑in‑the‑loop approach that compiles natural‑language intent into an editable workflow specification and then into an executable graph. In addition, the framework provides reusable components, skill support, multimodal message handling, and pluggable context integration, as well as a visualizer for topology preview, runtime tracing, and human‑in‑the‑loop interaction. We evaluate MASFactory on seven public benchmarks, validating both reproduction consistency for representative MAS methods and the effectiveness of Vibe Graphing. Our code (https://github.com/BUPT‑GAMMA/MASFactory, licensed under Apache‑2.0) and video demonstration (https://youtu.be/ANynzVfY32k) are publicly available.

PaperID: 474, https://arxiv.org/pdf/2603.04977.pdf GitHub

Authors: Zheng Wang, Haoran Chen, Haoxuan Qin, Zhipeng Wei, Tianwen Qian, Cong Bai

Title: Think, Then Verify: A Hypothesis-Verification Multi-Agent Framework for Long Video Understanding

Abstract:
Long video understanding is challenging due to dense visual redundancy, long‑range temporal dependencies, and the tendency of chain‑of‑thought and retrieval‑based agents to accumulate semantic drift and correlation‑driven errors. We argue that long‑video reasoning should begin not with reactive retrieval, but with deliberate task formulation: the model must first articulate what must be true in the video for each candidate answer to hold. This thinking‑before‑finding principle motivates VideoHV‑Agent, a framework that reformulates video question answering as a structured hypothesis‑verification process. Based on video summaries, a Thinker rewrites answer candidates into testable hypotheses, a Judge derives a discriminative clue specifying what evidence must be checked, a Verifier grounds and tests the clue using localized, fine‑grained video content, and an Answer agent integrates validated evidence to produce the final answer. Experiments on three long‑video understanding benchmarks show that VideoHV‑Agent achieves state‑of‑the‑art accuracy while providing enhanced interpretability, improved logical soundness, and lower computational cost. We make our code publicly available at: https://github.com/Haorane/VideoHV‑Agent.

PaperID: 475, https://arxiv.org/pdf/2603.04855.pdf GitHub

Authors: Yilin Jiang, Fei Tan, Xuanyu Yin, Jing Leng, Aimin Zhou

Title: HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents

Abstract:
Student Personas (SPs) are emerging as infrastructure for educational LLMs, yet prior work often relies on ad‑hoc prompting or hand‑crafted profiles with limited control over educational theory and population distributions. We formalize this as Theory‑Aligned and Distribution‑Controllable Persona Generation (TAD‑PG) and introduce HACHIMI, a multi‑agent Propose‑Validate‑Revise framework that generates theory‑aligned, quota‑controlled personas. HACHIMI factorizes each persona into a theory‑anchored educational schema, enforces developmental and psychological constraints via a neuro‑symbolic validator, and combines stratified sampling with semantic deduplication to reduce mode collapse. The resulting HACHIMI‑1M corpus comprises 1 million personas for Grades 1‑12. Intrinsic evaluation shows near‑perfect schema validity, accurate quotas, and substantial diversity, while external evaluation instantiates personas as student agents answering CEPS and PISA 2022 surveys; across 16 cohorts, math and curiosity/growth constructs align strongly between humans and agents, whereas classroom‑climate and well‑being constructs are only moderately aligned, revealing a fidelity gradient. All personas are generated with Qwen2.5‑72B, and HACHIMI provides a standardized synthetic student population for group‑level benchmarking and social‑science simulations. Resources available at https://github.com/ZeroLoss‑Lab/HACHIMI

PaperID: 476, https://arxiv.org/pdf/2603.04833.pdf GitHub

Authors: Manav Vora, Gokul Puthumanaillam, Hiroyasu Tsukamoto, Melkior Ornik

Title: SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning

Abstract:
Communication can improve coordination in partially observed multi‑agent reinforcement learning (MARL), but learning \emphwhen and \emphwho to communicate with requires choosing among many possible sender‑recipient pairs, and the effect of any single message on future reward is hard to isolate. We introduce SCoUT (Scalable Communication via Utility‑guided Temporal grouping), which addresses both these challenges via temporal and agent abstraction within traditional MARL. During training, SCoUT resamples soft agent groups every $K$ environment steps (macro‑steps) via Gumbel‑Softmax; these groups are latent clusters that induce an affinity used as a differentiable prior over recipients. Using the same assignments, a group‑aware critic predicts values for each agent group and maps them to per‑agent baselines through the same soft assignments, reducing critic complexity and variance. Each agent is trained with a three‑headed policy: environment action, send decision, and recipient selection. To obtain precise communication learning signals, we derive counterfactual communication advantages by analytically removing each sender's contribution from the recipient's aggregated messages. This counterfactual computation enables precise credit assignment for both send and recipient‑selection decisions. At execution time, all centralized training components are discarded and only the per‑agent policy is run, preserving decentralized execution. Project website, videos and code: \hyperlinkhttps://scout‑comm.github.io/https://scout‑comm.github.io/

PaperID: 477, https://arxiv.org/pdf/2603.04428.pdf GitHub

Authors: Yakov Pyotr Shkolnikov

Title: Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices

Abstract:
Multi‑agent LLM systems on edge devices face a memory management problem: device RAM is too small to hold every agent's KV cache simultaneously. On Apple M4 Pro with 10.2 GB of cache budget, only 3 agents fit at 8K context in FP16. A 10‑agent workflow must constantly evict and reload caches. Without persistence, every eviction forces a full re‑prefill through the model ‑‑ 15.7 seconds per agent at 4K context. We address this by persisting each agent's KV cache to disk in 4‑bit quantized format and reloading it directly into the attention layer, eliminating redundant O(n) prefill computation via direct cache restoration. The system comprises three components: a block pool providing per‑agent isolated Q4 KV caches in safetensors format, a BatchQuantizedKVCache for concurrent inference over multiple agents' quantized caches, and cross‑phase context injection that accumulates attention state across conversation phases without re‑computation. Evaluated on three architectures (Gemma 3 12B, dense GQA, 48 layers; DeepSeek‑Coder‑V2‑Lite 16B, MoE MLA, 27 layers; Llama 3.1 8B, dense GQA, 32 layers), cache restoration reduces time‑to‑first‑token by up to 136x (Gemma: 22‑‑136x at 4K‑‑32K; DeepSeek: 11‑‑76x at 4K‑‑32K; Llama: 24‑‑111x at 4K‑‑16K; 3‑‑10x at 1K). Q4 quantization fits 4x more agent contexts into fixed device memory than FP16. Perplexity measured with actual Q4 KV caches shows ‑0.7% for Gemma, +2.8% for Llama, and +3.0% for DeepSeek. Open‑source at https://github.com/yshk‑mxim/agent‑memory

PaperID: 478, https://arxiv.org/pdf/2603.03680.pdf GitHub

Authors: Lu Yang, Zelai Xu, Minyang Xie, Jiaxuan Gao, Zhao Shok, Yu Wang, Yi Wu

Title: MAGE: Meta-Reinforcement Learning for Language Agents toward Strategic Exploration and Exploitation

Abstract:
Large Language Model (LLM) agents have demonstrated remarkable proficiency in learned tasks, yet they often struggle to adapt to non‑stationary environments with feedback. While In‑Context Learning and external memory offer some flexibility, they fail to internalize the adaptive ability required for long‑term improvement. Meta‑Reinforcement Learning (meta‑RL) provides an alternative by embedding the learning process directly within the model. However, existing meta‑RL approaches for LLMs focus primarily on exploration in single‑agent settings, neglecting the strategic exploitation necessary for multi‑agent environments. We propose MAGE, a meta‑RL framework that empowers LLM agents for strategic exploration and exploitation. MAGE utilizes a multi‑episode training regime where interaction histories and reflections are integrated into the context window. By using the final episode reward as the objective, MAGE incentivizes the agent to refine its strategy based on past experiences. We further combine population‑based training with an agent‑specific advantage normalization technique to enrich agent diversity and ensure stable learning. Experiment results show that MAGE outperforms existing baselines in both exploration and exploitation tasks. Furthermore, MAGE exhibits strong generalization to unseen opponents, suggesting it has internalized the ability for strategic exploration and exploitation. Code is available at https://github.com/Lu‑Yang666/MAGE.

PaperID: 479, https://arxiv.org/pdf/2603.03372.pdf GitHub

Authors: Zhengding Hu, Kuntal Talit, Zhen Wang, Haseeb Ahmad, Yichen Lin, Prabhleen Kaur, Christopher Lane, Elizabeth A. Peterson, Zhiting Hu, Elizabeth A. Nowadnick, Yufei Ding

Title: TritonDFT: Automating DFT with a Multi-Agent Framework

Abstract:
Density Functional Theory (DFT) is a cornerstone of materials science, yet executing DFT in practice requires coordinating a complex, multi‑step workflow. Existing tools and LLM‑based solutions automate parts of the steps, but lack support for full workflow automation, diverse task adaptation, and accuracy‑cost trade‑off optimization in DFT configuration. To this end, we present TritonDFT, a multi‑agent framework that enables efficient and accurate DFT execution through an expert‑curated, extensible workflow design, Pareto‑aware parameter inference, and multi‑source knowledge augmentation. We further introduce DFTBench, a benchmark for evaluating the agent's multi‑dimensional capabilities, spanning science expertise, trade0off optimization, HPC knowledge, and cost efficiency. TritonDFT provides an open user interface for real‑world usage. Our website is at https://www.tritondft.com. Our source code and benchmark suite are available at https://github.com/Leo9660/TritonDFT.git.

PaperID: 480, https://arxiv.org/pdf/2603.03296.pdf GitHub

Authors: Ke Yang, Zixi Chen, Xuan He, Jize Jiang, Michel Galley, Chenglong Wang, Jianfeng Gao, Jiawei Han, ChengXiang Zhai

Title: PlugMem: A Task-Agnostic Plugin Memory Module for LLM Agents

Abstract:
Long‑term memory is essential for large language model (LLM) agents operating in complex environments, yet existing memory designs are either task‑specific and non‑transferable, or task‑agnostic but less effective due to low task‑relevance and context explosion from raw memory retrieval. We propose PlugMem, a task‑agnostic plugin memory module that can be attached to arbitrary LLM agents without task‑specific redesign. Motivated by the fact that decision‑relevant information is concentrated as abstract knowledge rather than raw experience, we draw on cognitive science to structure episodic memories into a compact, extensible knowledge‑centric memory graph that explicitly represents propositional and prescriptive knowledge. This representation enables efficient memory retrieval and reasoning over task‑relevant knowledge, rather than verbose raw trajectories, and departs from other graph‑based methods like GraphRAG by treating knowledge as the unit of memory access and organization instead of entities or text chunks. We evaluate PlugMem unchanged across three heterogeneous benchmarks (long‑horizon conversational question answering, multi‑hop knowledge retrieval, and web agent tasks). The results show that PlugMem consistently outperforms task‑agnostic baselines and exceeds task‑specific memory designs, while also achieving the highest information density under a unified information‑theoretic analysis. Code and data are available at https://github.com/TIMAN‑group/PlugMem.

PaperID: 481, https://arxiv.org/pdf/2603.03290.pdf GitHub

Authors: Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Jingjing Wang, Xuanzhao Dong, Minzhou Huang, Rui Cai, Hejian Sang, Hao Wang, Peijie Qiu, Yueyue Deng, Prayag Tiwari, Brendan Hogan Rappazzo, Yalin Wang

Title: AriadneMem: Threading the Maze of Lifelong Memory for LLM Agents

Abstract:
Long‑horizon LLM agents require memory systems that remain accurate under fixed context budgets. However, existing systems struggle with two persistent challenges in long‑term dialogue: (i) disconnected evidence, where multi‑hop answers require linking facts distributed across time, and (ii) state updates, where evolving information (e.g., schedule changes) creates conflicts with older static logs. We propose AriadneMem, a structured memory system that addresses these failure modes via a decoupled two‑phase pipeline. In the offline construction phase, AriadneMem employs \emphentropy‑aware gating to filter noise and low‑information message before LLM extraction and applies \emphconflict‑aware coarsening to merge static duplicates while preserving state transitions as temporal edges. In the online reasoning phase, rather than relying on expensive iterative planning, AriadneMem executes \emphalgorithmic bridge discovery to reconstruct missing logical paths between retrieved facts, followed by \emphsingle‑call topology‑aware synthesis. On LoCoMo experiments with GPT‑4o, AriadneMem improves Multi‑Hop F1 by 15.2% and Average F1 by 9.0% over strong baselines. Crucially, by offloading reasoning to the graph layer, AriadneMem reduces total runtime by 77.8% using only 497 context tokens. The code is available at https://github.com/LLM‑VLM‑GSL/AriadneMem.

PaperID: 482, https://arxiv.org/pdf/2603.02473.pdf GitHub

Authors: Boqin Yuan, Yue Su, Kun Yao

Title: Diagnosing Retrieval vs. Utilization Bottlenecks in LLM Agent Memory

Abstract:
Memory‑augmented LLM agents store and retrieve information from prior interactions, yet the relative importance of how memories are written versus how they are retrieved remains unclear. We introduce a diagnostic framework that analyzes how performance differences manifest across write strategies, retrieval methods, and memory utilization behavior, and apply it to a 3x3 study crossing three write strategies (raw chunks, Mem0‑style fact extraction, MemGPT‑style summarization) with three retrieval methods (cosine, BM25, hybrid reranking). On LoCoMo, retrieval method is the dominant factor: average accuracy spans 20 points across retrieval methods (57.1% to 77.2%) but only 3‑8 points across write strategies. Raw chunked storage, which requires zero LLM calls, matches or outperforms expensive lossy alternatives, suggesting that current memory pipelines may discard useful context that downstream retrieval mechanisms fail to compensate for. Failure analysis shows that performance breakdowns most often manifest at the retrieval stage rather than at utilization. We argue that, under current retrieval practices, improving retrieval quality yields larger gains than increasing write‑time sophistication. Code is publicly available at https://github.com/boqiny/memory‑probe.

PaperID: 483, https://arxiv.org/pdf/2603.02240.pdf GitHub

Authors: Varun Pratap Bhardwaj

Title: SuperLocalMemory: Privacy-Preserving Multi-Agent Memory with Bayesian Trust Defense Against Memory Poisoning

Abstract:
We present SuperLocalMemory, a local‑first memory system for multi‑agent AI that defends against OWASP ASI06 memory poisoning through architectural isolation and Bayesian trust scoring, while personalizing retrieval through adaptive learning‑to‑rank ‑‑ all without cloud dependencies or LLM inference calls. As AI agents increasingly rely on persistent memory, cloud‑based memory systems create centralized attack surfaces where poisoned memories propagate across sessions and users ‑‑ a threat demonstrated in documented attacks against production systems. Our architecture combines SQLite‑backed storage with FTS5 full‑text search, Leiden‑based knowledge graph clustering, an event‑driven coordination layer with per‑agent provenance, and an adaptive re‑ranking framework that learns user preferences through three‑layer behavioral analysis (cross‑project technology preferences, project context detection, and workflow pattern mining). Evaluation across seven benchmark dimensions demonstrates 10.6ms median search latency, zero concurrency errors under 10 simultaneous agents, trust separation (gap =0.90) with 72% trust degradation for sleeper attacks, and 104% improvement in NDCG@5 when adaptive re‑ranking is enabled. Behavioral data is isolated in a separate database with GDPR Article 17 erasure support. SuperLocalMemory is open‑source (MIT) and integrates with 17+ development tools via Model Context Protocol.

PaperID: 484, https://arxiv.org/pdf/2603.01864.pdf GitHub

Authors: Alexander Prutsch, David Schinagl, Horst Possegger

Title: Streaming Real-Time Trajectory Prediction Using Endpoint-Aware Modeling

Abstract:
Future trajectories of neighboring traffic agents have a significant influence on the path planning and decision‑making of autonomous vehicles. While trajectory forecasting is a well‑studied field, research mainly focuses on snapshot‑based prediction, where each scenario is treated independently of its global temporal context. However, real‑world autonomous driving systems need to operate in a continuous setting, requiring real‑time processing of data streams with low latency and consistent predictions over successive timesteps. We leverage this continuous setting to propose a lightweight yet highly accurate streaming‑based trajectory forecasting approach. We integrate valuable information from previous predictions with a novel endpoint‑aware modeling scheme. Our temporal context propagation uses the trajectory endpoints of the previous forecasts as anchors to extract targeted scenario context encodings. Our approach efficiently guides its scene encoder to extract highly relevant context information without needing refinement iterations or segment‑wise decoding. Our experiments highlight that our approach effectively relays information across consecutive timesteps. Unlike methods using multi‑stage refinement processing, our approach significantly reduces inference latency, making it well‑suited for real‑world deployment. We achieve state‑of‑the‑art streaming trajectory prediction results on the Argoverse~2 multi‑agent and single‑agent benchmarks, while requiring substantially fewer resources.

PaperID: 485, https://arxiv.org/pdf/2603.01712.pdf GitHub

Authors: Qizheng Li, Yifei Zhang, Xiao Yang, Xu Yang, Zhuo Wang, Weiqing Liu, Jiang Bian

Title: FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents

Abstract:
Fine‑tuning large language models for vertical domains remains labor‑intensive, requiring practitioners to curate data, configure training, and iteratively diagnose model behavior. Despite growing interest in autonomous machine learning and language agents, end‑to‑end LLM fine‑tuning has not been systematically studied as an interactive agent task. We introduce FT‑Dojo, an interactive benchmark environment for autonomous LLM fine‑tuning, comprising 13 tasks across 5 domains. Rather than a new collection of static datasets, FT‑Dojo standardizes a task interface, shared raw‑data repository, sandboxed execution environment, structured feedback protocol, and held‑out evaluation procedure. We further develop FT‑Agent, a fine‑tuning‑oriented autonomous framework that uses structured iteration planning, fail‑fast validation, and multi‑level feedback analysis to refine data and training strategies. Experiments show that FT‑Agent provides a strong initial baseline, achieving the best performance on 10 out of 13 tasks, with additional controlled comparisons against frontier agents, open‑source planning backbones, and multi‑run statistics supporting the main findings. Case studies show that agents can recover from failures through cumulative learning, while still exposing limitations in causal diagnosis and long‑horizon planning. The implementation is available at https://github.com/microsoft/rd‑agent.

PaperID: 486, https://arxiv.org/pdf/2603.01089.pdf GitHub

Authors: Tongtong Wu, Yanming Li, Ziye Tang, Chen Jiang, Linhao Luo, Guilin Qi, Shirui Pan, Gholamreza Haffari

Title: CARD: Towards Conditional Design of Multi-agent Topological Structures

Abstract:
Large language model (LLM)‑based multi‑agent systems have shown strong capabilities in tasks such as code generation and collaborative reasoning. However, the effectiveness and robustness of these systems critically depend on their communication topology, which is often fixed or statically learned, ignoring real‑world dynamics such as model upgrades, API (or tool) changes, or knowledge source variability. To address this limitation, we propose CARD (Conditional Agentic Graph Designer), a conditional graph‑generation framework that instantiates AMACP, a protocol for adaptive multi‑agent communication. CARD explicitly incorporates dynamic environmental signals into graph construction, enabling topology adaptation at both training and runtime. Through a conditional variational graph encoder and environment‑aware optimization, CARD produces communication structures that are both effective and resilient to shifts in model capability or resource availability. Empirical results on HumanEval, MATH, and MMLU demonstrate that CARD consistently outperforms static and prompt‑based baselines, achieving higher accuracy and robustness across diverse conditions. The source code is available at: https://github.com/Warma10032/CARD.

PaperID: 487, https://arxiv.org/pdf/2603.01045.pdf GitHub

Authors: Yuzhe Zhang, Feiran Liu, Yi Shan, Xinyi Huang, Xin Yang, Yueqi Zhu, Xuxin Cheng, Cao Liu, Ke Zeng, Terry Jingchen Zhang, Wenyuan Jiang

Title: Silo-Bench: A Scalable Environment for Evaluating Distributed Coordination in Multi-Agent LLM Systems

Abstract:
Large language models are increasingly deployed in multi‑agent systems to overcome context limitations by distributing information across agents. Yet whether agents can reliably compute with distributed information, rather than merely exchange it, remains an open question. We introduce SILO‑BENCH, a role‑agnostic benchmark of 30 algorithmic tasks across three communication complexity levels, evaluating 54 configurations over 1,620 experiments. Our experiments expose a fundamental Communication‑Reasoning Gap: agents spontaneously form task‑appropriate coordination topologies and exchange information actively, yet systematically fail to synthesize distributed state into correct answers. The failure is localized to the reasoning‑integration stage where agents often acquire sufficient information but cannot integrate it. This coordination overhead compounds with scale, eventually eliminating parallelization gains entirely. These findings demonstrate that naively scaling agent count cannot circumvent context limitations, and SILO‑BENCH provides a foundation for tracking progress toward genuinely collaborative multi‑agent systems. The code is available at https://github.com/jwyjohn/acl26‑silo‑bench .

PaperID: 488, https://arxiv.org/pdf/2603.00960.pdf GitHub

Authors: Akshat Singh Jaswal, Ashish Baghel

Title: AWE: Adaptive Agents for Dynamic Web Penetration Testing

Abstract:
Modern web applications are increasingly produced through AI‑assisted development and rapid no‑code deployment pipelines, widening the gap between accelerating software velocity and the limited adaptability of existing security tooling. Pattern‑driven scanners fail to reason about novel contexts, while emerging LLM‑based penetration testers rely on unconstrained exploration, yielding high cost, unstable behavior, and poor reproducibility. We introduce AWE, a memory‑augmented multi‑agent framework for autonomous web penetration testing that embeds structured, vulnerability‑specific analysis pipelines within a lightweight LLM orchestration layer. Unlike general‑purpose agents, AWE couples context aware payload mutations and generations with persistent memory and browser‑backed verification to produce deterministic, exploitation‑driven results. Evaluated on the 104‑challenge XBOW benchmark, AWE achieves substantial gains on injection‑class vulnerabilities ‑ 87% XSS success (+30.5% over MAPTA) and 66.7% blind SQL injection success (+33.3%) ‑ while being much faster, cheaper, and more token‑efficient than MAPTA, despite using a midtier model (Claude Sonnet 4) versus MAPTA's GPT‑5. MAPTA retains higher overall coverage due to broader exploratory capabilities, underscoring the complementary strengths of specialized and general‑purpose architectures. Our results demonstrate that architecture matters as much as model reasoning capabilities: integrating LLMs into principled, vulnerability‑aware pipelines yields substantial gains in accuracy, efficiency, and determinism for injection‑class exploits. The source code for AWE is available at: https://github.com/stuxlabs/AWE

PaperID: 489, https://arxiv.org/pdf/2603.00822.pdf GitHub

Authors: Reshabh K Sharma

Title: ContextCov: Deriving and Enforcing Executable Constraints from Agent Instruction Files

Abstract:
As Large Language Model (LLM) agents increasingly execute complex, autonomous software engineering tasks, developers rely on natural language instruction files such as AGENTS.md to express project‑specific coding conventions, tooling restrictions, and architectural boundaries. However, because these instructions remain passive text, agents frequently violate documented constraints due to context window saturation or conflicting local context. In autonomous settings without real‑time human supervision, such violations rapidly compound into technical debt. To ground autonomous agents in repository constraints, we introduce ContextCov, a framework that transforms passive natural language instructions into executable guardrails. Unlike prompt‑only or reflection‑only compliance approaches, ContextCov compiles documented constraints into three complementary checks: static AST queries for code patterns, runtime shell shims that intercept prohibited commands, and architectural validators that enforce structural rules. Acting as an automated, continuous reviewer, ContextCov intercepts agent actions and returns immediate, reproducible violation traces, enabling self‑correction before non‑compliant changes are finalized. We evaluate ContextCov on SWE‑bench Lite (12 repositories, 300 tasks). Compared to prompt‑only and LLM reflection baselines, ContextCov achieves 88.3% constraint compliance (vs. 67.0% and 50.3%) with 3.4x lower feedback cost, while maintaining functional correctness. The source code and evaluation results are available at https://github.com/reSHARMA/ContextCov.

PaperID: 490, https://arxiv.org/pdf/2603.00805.pdf GitHub

Authors: Seemandhar Jain, Keshav Gupta, Kunal Gupta, Manmohan Chandraker

Title: NERFIFY: A Multi-Agent Framework for Turning NeRF Papers into Code

Abstract:
The proliferation of neural radiance field (NeRF) research requires significant efforts to reimplement papers before building upon them. We introduce NERFIFY, a multi‑agent framework that reliably converts NeRF research papers into trainable Nerfstudio plugins, in contrast to generic paper‑to‑code methods and frontier models like GPT‑5 that usually fail to produce runnable code. NERFIFY achieves domain‑specific executability through six key innovations: (1) Context‑free grammar (CFG): LLM synthesis is constrained by Nerfstudio formalized as a CFG, ensuring generated code satisfies architectural invariants. (2) Graph‑of‑Thought code synthesis: Specialized multi‑file‑agents generate repositories in topological dependency order, validating contracts and errors at each node. (3) Compositional citation recovery: Agents automatically retrieve and integrate components (samplers, encoders, proposal networks) from citation graphs of references. (4) Visual feedback: Artifacts are diagnosed through PSNR‑minima ROI analysis, cross‑view geometric validation, and VLM‑guided patching to iteratively improve quality. (5) Knowledge enhancement: Beyond reproduction, methods can be improved with novel optimizations. (6) Benchmarking: An evaluation framework is designed for NeRF paper‑to‑code synthesis across 30 diverse papers. On papers without public implementations, NERFIFY achieves visual quality matching expert human code (+/‑0.5 dB PSNR, +/‑0.2 SSIM) while reducing implementation time from weeks to minutes. NERFIFY demonstrates that a domain‑aware design enables code translation for complex vision papers, potentiating accelerated and democratized reproducible research. Code, data and implementations will be publicly released.

PaperID: 491, https://arxiv.org/pdf/2603.00724.pdf GitHub

Authors: Andrew Zhuoer Feng, Cunxiang Wang, Bosi Wen, Yidong Wang, Yu Luo, Hongning Wang, Minlie Huang

Title: RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models

Abstract:
Large language model alignment via reinforcement learning depends critically on reward function quality. However, static, domain‑specific reward models are often costly to train and exhibit poor generalization in out‑of‑distribution scenarios encountered during RL iterations. We present RLAR (Reinforcement Learning from Agent Rewards), an agent‑driven framework that dynamically assigns tailored reward functions to individual queries. Specifically, RLAR transforms reward acquisition into a dynamic tool synthesis and invocation task. It leverages LLM agents to autonomously retrieve optimal reward models from the Internet and synthesize programmatic verifiers through code generation. This allows the reward system to self‑evolve with the shifting data distributions during training. Experimental results demonstrate that RLAR yields consistent performance gains ranging from 10 to 60 across mathematics, coding, translation, and dialogue tasks. On RewardBench‑V2, RLAR significantly outperforms static baselines and approaches the performance upper bound, demonstrating superior generalization through dynamic reward orchestration. The data and code are available on this link: https://github.com/ZhuoerFeng/RLAR.

PaperID: 492, https://arxiv.org/pdf/2603.00718.pdf GitHub GitHub

Authors: Shiqi Chen, Jingze Gai, Ruochen Zhou, Jinghan Zhang, Tongyao Zhu, Junlong Li, Kangrui Wang, Zihan Wang, Zhengyu Chen, Klara Kaleb, Ning Miao, Siyang Gao, Cong Lu, Manling Li, Junxian He, Yee Whye Teh

Title: SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?

Abstract:
Real‑world tool‑using agents operate over long‑horizon workflows with recurring structure and diverse demands, where effective behavior requires not only invoking atomic tools but also abstracting, and reusing higher‑level tool compositions. However, existing benchmarks mainly measure instance‑level success under static tool sets, offering limited insight into agents' ability to acquire such reusable skills. We address this gap by introducing SkillCraft, a benchmark explicitly stress‑test agent ability to form and reuse higher‑level tool compositions, where we call Skills. SkillCraft features realistic, highly compositional tool‑use scenarios with difficulty scaled along both quantitative and structural dimensions, designed to elicit skill abstraction and cross‑task reuse. We further propose a lightweight evaluation protocol that enables agents to auto‑compose atomic tools into executable Skills, cache and reuse them inside and across tasks, thereby improving efficiency while accumulating a persistent library of reusable skills. Evaluating state‑of‑the‑art agents on SkillCraft, we observe substantial efficiency gains, with token usage reduced by up to 80% by skill saving and reuse. Moreover, success rate strongly correlates with tool composition ability at test time, underscoring compositional skill acquisition as a core capability.

PaperID: 493, https://arxiv.org/pdf/2603.00656.pdf GitHub

Authors: Fanqi Kong, Jiayi Zhang, Mingyi Deng, Chenglin Wu, Yuyu Luo, Bang Liu

Title: InfoPO: Information-Driven Policy Optimization for User-Centric Agents

Abstract:
Real‑world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi‑turn GRPO‑based methods often rely on trajectory‑level reward computation, which leads to credit assignment problems and insufficient advantage signals within rollout groups. A feasible approach is to identify valuable interaction turns at a fine granularity to drive more targeted learning. To address this, we introduce InfoPO (Information‑Driven Policy Optimization), which frames multi‑turn interaction as a process of active uncertainty reduction and computes an information‑gain reward that credits turns whose feedback measurably changes the agent's subsequent action distribution compared to a masked‑feedback counterfactual. It then combines this signal with task outcomes via an adaptive variance‑gated fusion to identify information importance while maintaining task‑oriented goal direction. Across diverse tasks, including intent clarification, collaborative coding, and tool‑augmented decision making, InfoPO consistently outperforms prompting and multi‑turn RL baselines. It also demonstrates robustness under user simulator shifts and generalizes effectively to environment‑interactive tasks. Overall, InfoPO provides a principled and scalable mechanism for optimizing complex agent‑user collaboration. Code is available at https://github.com/kfq20/InfoPO.

PaperID: 494, https://arxiv.org/pdf/2603.00623.pdf GitHub

Authors: Shu-Xun Yang, Cunxiang Wang, Haoke Zhang, Wenbo Yu, Lindong Wu, Jiayi Gui, Dayong Yang, Yukuo Cen, Zhuoer Feng, Bosi Wen, Yidong Wang, Lucen Zhong, Jiamin Ren, Linfeng Zhang, Jie Tang

Title: TraceSIR: A Multi-Agent Framework for Structured Analysis and Reporting of Agentic Execution Traces

Abstract:
Agentic systems augment large language models with external tools and iterative decision making, enabling complex tasks such as deep research, function calling, and coding. However, their long and intricate execution traces make failure diagnosis and root cause analysis extremely challenging. Manual inspection does not scale, while directly applying LLMs to raw traces is hindered by input length limits and unreliable reasoning. Focusing solely on final task outcomes further discards critical behavioral information required for accurate issue localization. To address these issues, we propose TraceSIR, a multi‑agent framework for structured analysis and reporting of agentic execution traces. TraceSIR coordinates three specialized agents: (1) StructureAgent, which introduces a novel abstraction format, TraceFormat, to compress execution traces while preserving essential behavioral information; (2) InsightAgent, which performs fine‑grained diagnosis including issue localization, root cause analysis, and optimization suggestions; (3) ReportAgent, which aggregates insights across task instances and generates comprehensive analysis reports. To evaluate TraceSIR, we construct TraceBench, covering three real‑world agentic scenarios, and introduce ReportEval, an evaluation protocol for assessing the quality and usability of analysis reports aligned with industry needs. Experiments show that TraceSIR consistently produces coherent, informative, and actionable reports, significantly outperforming existing approaches across all evaluation dimensions. Our project and video are publicly available at https://github.com/SHU‑XUN/TraceSIR.

PaperID: 495, https://arxiv.org/pdf/2603.00349.pdf GitHub

Authors: Hanqing Yang, Narjes Nourzad, Shiyu Chen, Marie Siew, Jingdi Chen, Carlee Joe-Wong

Title: COOP$^2$: Defining, Observing, and Repairing Cooperation in LLM Multi-Agent Systems

Abstract:
Many complex tasks require extended effort, diverse capabilities, or coordinated actions beyond what a single agent can provide. However, simply adding more agents does not guarantee better performance, as effective cooperation depends on how agents interact with each other and with task structure to satisfy evolving constraints over time. This challenge is amplified for LLM‑based multi‑agent systems (LLM‑MAS): plans, messages, and revisions occur in natural language, whereas task progress depends on grounded environment actions. Current evaluations mostly treat cooperation as an implicit ingredient of final task success, leaving both cooperation and the effect of multi‑agent interaction on task dynamics difficult to study. We introduce COOP^2, an evaluation framework that grounds high‑level agent cooperation dynamics in LLM‑MAS within task progress in the environment. COOP^2 then defines cooperative tasks with verifiable cooperative requirements, allowing us to analyze how cooperation unfolds over time with respect to task progress, as well as where and why cooperation breaks down. Building on this framework, we develop COOP^2‑Repair, which predicts constraint failures from group plans and opens targeted repair channels for guided revisions. Across two environments and three communication structures, COOP^2‑Repair improves task success and constraint satisfaction while exposing the additional decision overhead and communication load required for repair. The project web page can be found at: https://happyeureka.github.io/coop2.

PaperID: 496, https://arxiv.org/pdf/2603.00309.pdf GitHub

Authors: Hanqing Yang, Hyungwoo Lee, Yuhang Yao, Zhiwei Liu, Kay Liu, Jingdi Chen, Carlee Joe-Wong

Title: DIG to Heal: Scaling General-purpose Agent Collaboration via Explainable Dynamic Decision Paths

Abstract:
The increasingly popular agentic AI paradigm promises to harness the power of multiple, general‑purpose large language model (LLM) agents to collaboratively complete complex tasks. While many agentic AI systems reduce complexity through predefined workflows or fixed agent roles, the ideal is to support truly autonomous agents capable of emergent collaboration across many interacting agents. Yet in practice, such unstructured interactions often lead to redundant work and cascading failures that are difficult to interpret or correct. In this work, we study multi‑agent systems composed of general‑purpose LLM agents that solve problems through emergent collaboration, without relying on predefined roles, control flows, or communication constraints. We introduce the Dynamic Interaction Graph (DIG), which captures emergent collaboration as a time‑evolving causal network of agent activations and interactions. DIG makes emergent collaboration observable and explainable for the first time, enabling real‑time identification, explanation, and correction of collaboration‑induced error patterns directly from agents' collaboration paths. Thus, DIG fills a critical gap in understanding how general LLM agents solve problems together in truly agentic multi‑agent systems. The project webpage can be found at: https://happyeureka.github.io/dig.

PaperID: 497, https://arxiv.org/pdf/2603.00131.pdf GitHub

Authors: Moritz Weckbecker, Jonas Müller, Ben Hagag, Michael Mulet

Title: Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent Systems

Abstract:
Subliminal prompting is a phenomenon in which language models are biased towards certain concepts or traits through prompting with semantically unrelated tokens. While prior work has examined subliminal prompting in user‑LLM interactions, potential bias transfer in multi‑agent systems and its associated security implications remain unexplored. In this work, we show that a single subliminally prompted agent can spread a weakening but persisting bias throughout its entire network. We measure this phenomenon across 6 agents using two different topologies, observing that the transferred concept maintains an elevated response rate throughout the network. To exemplify potential misalignment risks, we assess network performance on multiple‑choice TruthfulQA, showing that subliminal prompting of a single agent may degrade the truthfulness of other agents. Our findings reveal that subliminal prompting introduces a new attack vector in multi‑agent security, with implications for the alignment of such systems. The implementation of all experiments is publicly available at https://github.com/Multi‑Agent‑Security‑Initiative/thought_virus .

PaperID: 498, https://arxiv.org/pdf/2603.00084.pdf GitHub

Authors: Hongjin Qian, Ziyi Xia, Ze Liu, Jianlyu Chen, Kun Luo, Minghao Qin, Chaofan Li, Lei Xiong, Junwei Lan, Sen Wang, Zhengyang Liang, Yingxia Shao, Defu Lian, Zheng Liu

Title: DeepXiv-SDK: An Agentic Data Interface for Scientific Literature

Abstract:
LLM‑agents are increasingly used to accelerate the progress of scientific research. Yet a persistent bottleneck is data access: agents not only lack readily available tools for retrieval, but also have to work with unstrcutured, human‑centric data on the Internet, such as HTML web‑pages and PDF files, leading to excessive token consumption, limit working efficiency, and brittle evidence look‑up. This gap motivates the development of an agentic data interface, which is designed to enable agents to access and utilize scientific literature in a more effective, efficient, and cost‑aware manner. In this paper, we introduce DeepXiv‑SDK, which offers a three‑layer agentic data interface for scientific literature. 1) Data Layer, which transforms unstructured, human‑centric data into normalized and structured representations in JSON format, improving data usability and enabling progressive accessibility of the data. 2) Service Layer, which presents readily available tools for data access and ad‑hoc retrieval. It also enables a rich form of agent usage, including CLI, MCP, and Python SDK. 3) Application Layer, which creates a built‑in agent, packaging basic tools from the service layer to support complex data access demands. DeepXiv‑SDK currently supports the complete ArXiv corpus, and is synchronized daily to incorporate new releases. It is designed to extend to all common open‑access corpora, such as PubMed Central, bioRxiv, medRxiv, and chemRxiv. We release RESTful APIs, an open‑source Python SDK, and a web demo showcasing deep search and deep research workflows. DeepXiv‑SDK is free to use with registration.

PaperID: 499, https://arxiv.org/pdf/2602.23583.pdf GitHub

Authors: Donggeon Kim, Seungwon Jan, Hyeonjun Park, Daegyu Lim

Title: VCA: Vision-Click-Action Framework for Precise Manipulation of Segmented Objects in Target Ambiguous Environments

Abstract:
The reliance on language in Vision‑Language‑Action (VLA) models introduces ambiguity, cognitive overhead, and difficulties in precise object identification and sequential task execution, particularly in environments with multiple visually similar objects. To address these limitations, we propose Vision‑Click‑Action (VCA), a framework that replaces verbose textual commands with direct, click‑based visual interaction using pretrained segmentation models. By allowing operators to specify target objects clearly through visual selection in the robot's 2D camera view, VCA reduces interpretation errors, lowers cognitive load, and provides a practical and scalable alternative to language‑driven interfaces for real‑world robotic manipulation. Experimental results validate that the proposed VCA framework achieves effective instance‑level manipulation of specified target objects. Experiment videos are available at https://robrosinc.github.io/vca/.

PaperID: 500, https://arxiv.org/pdf/2602.23556.pdf GitHub

Authors: Aishwarya Sarkar, Sayan Ghosh, Nathan Tallent, Aman Chadha, Tanya Roosta, Ali Jannesari

Title: Rudder: Steering Prefetching in Distributed GNN Training using LLM Agents

Abstract:
Large‑scale Graph Neural Networks (GNNs) are typically trained by sampling a vertex's neighbors to a fixed distance. Because large input graphs are distributed, training requires frequent irregular communication that stalls forward progress. Moreover, fetched data changes with graph, graph distribution, sample and batch parameters, and caching polices. Consequently, any static prefetching method will miss crucial opportunities to adapt to different dynamic conditions. In this paper, we introduce Rudder, a software module embedded in the state‑of‑the‑art AWS DistDGL framework, to autonomously prefetch remote nodes and minimize communication. Rudder's adaptation contrasts with both standard heuristics and traditional ML classifiers. We observe that the generative AI found in contemporary Large Language Models (LLMs) exhibits emergent properties like In‑Context Learning (ICL) for zero‑shot tasks, with logical multi‑step reasoning. We find this behavior well‑suited for adaptive control even with substantial undertraining. Evaluations using standard datasets and unseen configurations on the NERSC Perlmutter supercomputer show up to 91% improvement in end‑to‑end training performance over baseline DistDGL (no prefetching), and an 82% improvement over static prefetching, reducing communication by over 50%. Our code is available at https://github.com/aishwaryyasarkar/rudder‑llm‑agent.

PaperID: 501, https://arxiv.org/pdf/2602.23452.pdf GitHub

Authors: Kaiwen Shi, Weixiang Sun, Zheyuan Zhang, Lichao Sun, Nitesh V. Chawla, Yanfang Ye

Title: CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era

Abstract:
Scientific research relies on citation integrity, yet large language models (LLMs) have introduced a critical risk: fabricated references that appear plausible but correspond to no real publications. As manual verification becomes infeasible and existing automated tools remain fragile, we introduce CiteAudit, a comprehensive benchmark and detection framework for hallucinated citations. We design a multi‑agent verification pipeline that decomposes citation checking into metadata extraction, memory lookup, web‑based retrieval, and final judgment. To evaluate this, we construct a large‑scale, human‑validated dataset spanning diverse domains and hallucination types. Experiments demonstrate that our framework achieves superior verification performance over state‑of‑the‑art LLMs and commercial baselines. Our work provides the necessary infrastructure to audit citations at scale and safeguard the trustworthiness of scholarly discourse. Code is available at https://github.com/shiiiikw/CiteAudit.

PaperID: 502, https://arxiv.org/pdf/2602.23258.pdf GitHub

Authors: Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding, Miao Zhang, Min Zhang

Title: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

Abstract:
While Multi‑Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information from individual agents. Current solutions often resort to rigid structural engineering or expensive fine‑tuning, limiting their adaptability. We propose AgentDropoutV2 (ADv2), a test‑time rectify‑or‑reject pruning framework that dynamically optimizes MAS information flow. Acting as an active firewall, ADv2 intercepts agent outputs and employs a retrieval‑augmented rectifier to iteratively correct errors. This rectification is guided by an indicator pool, which is constructed offline by distilling error patterns from historical MAS failure trajectories. Irreparable outputs are subsequently pruned to prevent error propagation. Empirical results demonstrate that ADv2 significantly boosts performance on both fixed and dynamic MAS frameworks, achieving average accuracy gains of 6.39 and 2.28 percentage points on extensive math and code benchmarks, respectively. Furthermore, ADv2 exhibits remarkable adaptivity, dynamically modulating rectification efforts based on task difficulty to resolve a wide spectrum of error patterns. Our code is released at https://github.com/TonySY2/AgentDropoutV2.

PaperID: 503, https://arxiv.org/pdf/2602.22959.pdf GitHub

Authors: Zihao Zhao, Frederik Hauke, Juliana De Castilhos, Sven Nebelung, Daniel Truhn

Title: Can Agents Distinguish Visually Hard-to-Separate Diseases in a Zero-Shot Setting? A Pilot Study

Abstract:
The rapid progress of multimodal large language models (MLLMs) has led to increasing interest in agent‑based systems. While most prior work in medical imaging concentrates on automating routine clinical workflows, we study an underexplored yet clinically significant setting: distinguishing visually hard‑to‑separate diseases in a zero‑shot setting. We benchmark representative agents on two imaging‑only proxy diagnostic tasks, (1) melanoma vs. atypical nevus and (2) pulmonary edema vs. pneumonia, where visual features are highly confounded despite substantial differences in clinical management. We introduce a multi‑agent framework based on contrastive adjudication. Experimental results show improved diagnostic performance (an 11‑percentage‑point gain in accuracy on dermoscopy data) and reduced unsupported claims on qualitative samples, although overall performance remains insufficient for clinical deployment. We acknowledge the inherent uncertainty in human annotations and the absence of clinical context, which further limit the translation to real‑world settings. Within this controlled setting, this pilot study provides preliminary insights into zero‑shot agent performance in visually confounded scenarios.

PaperID: 504, https://arxiv.org/pdf/2602.22786.pdf GitHub

Authors: Yuanjun Li, Bin Zhang, Hao Chen, Zhouyang Jiang, Dapeng Li, Zhiwei Xu

Title: QSIM: Mitigating Overestimation in Multi-Agent Reinforcement Learning via Action Similarity Weighted Q-Learning

Abstract:
Value decomposition (VD) methods have achieved remarkable success in cooperative multi‑agent reinforcement learning (MARL). However, their reliance on the max operator for temporal‑difference (TD) target calculation leads to systematic Q‑value overestimation. This issue is particularly severe in MARL due to the combinatorial explosion of the joint action space, which often results in unstable learning and suboptimal policies. To address this problem, we propose QSIM, a similarity weighted Q‑learning framework that reconstructs the TD target using action similarity. Instead of using the greedy joint action directly, QSIM forms a similarity weighted expectation over a structured near‑greedy joint action space. This formulation allows the target to integrate Q‑values from diverse yet behaviorally related actions while assigning greater influence to those that are more similar to the greedy choice. By smoothing the target with structurally relevant alternatives, QSIM effectively mitigates overestimation and improves learning stability. Extensive experiments demonstrate that QSIM can be seamlessly integrated with various VD methods, consistently yielding superior performance and stability compared to the original algorithms. Furthermore, empirical analysis confirms that QSIM significantly mitigates the systematic value overestimation in MARL. Code is available at https://github.com/MaoMaoLYJ/pymarl‑qsim.

PaperID: 505, https://arxiv.org/pdf/2602.22402.pdf GitHub

Authors: Cosmo Santoni

Title: Contextual Memory Virtualisation: DAG-Based State Management and Structurally Lossless Trimming for LLM Agents

Abstract:
As large language models engage in extended reasoning tasks, they accumulate significant state ‑‑ architectural mappings, trade‑off decisions, codebase conventions ‑‑ within the context window. This understanding is lost when sessions reach context limits and undergo lossy compaction. We propose Contextual Memory Virtualisation (CMV), a system that treats accumulated LLM understanding as version‑controlled state. Borrowing from operating system virtual memory, CMV models session history as a Directed Acyclic Graph (DAG) with formally defined snapshot, branch, and trim primitives that enable context reuse across independent parallel sessions. We introduce a three‑pass structurally lossless trimming algorithm that preserves every user message and assistant response verbatim while reducing token counts by a mean of 20% and up to 86% for sessions with significant overhead by stripping mechanical bloat such as raw tool outputs, base64 images, and metadata. A single‑user case‑study evaluation across 76 real‑world coding sessions demonstrates that trimming remains economically viable under prompt caching, with the strongest gains in mixed tool‑use sessions, which average 39% reduction and reach break‑even within 10 turns. A reference implementation is available at https://github.com/CosmoNaught/claude‑code‑cmv.

PaperID: 506, https://arxiv.org/pdf/2602.21394.pdf GitHub

Authors: Xuan Chen, Hao Liu, Tao Yuan, Mehran Kafai, Piotr Habas, Xiangyu Zhang

Title: MemoPhishAgent: Memory-Augmented Multi-Modal LLM Agent for Phishing URL Detection

Abstract:
Traditional phishing website detection relies on static heuristics or reference lists, which lag behind rapidly evolving attacks. While recent systems incorporate large language models (LLMs), they are still prompt‑based, deterministic pipelines that underutilize reasoning capability. We present MemoPhishAgent (MPA), a memory‑augmented multi‑modal LLM agent that dynamically orchestrates phishing‑specific tools and leverages episodic memories of past reasoning trajectories to guide decisions on recurring and novel threats. On two public datasets, MPA outperforms three state‑of‑the‑art (SOTA) baselines, improving recall by 13.6%. To better reflect realistic, user‑facing phishing detection performance, we further evaluate MPA on a benchmark of real‑world suspicious URLs actively crawled from five social media platforms, where it improves recall by 20%. Detailed analysis shows episodic memory contributes up to 27% recall gain without introducing additional computational overhead. The ablation study confirms the necessity of the agent‑based approach compared to prompt‑based baselines and validates the effectiveness of our tool design. Finally, MPA is deployed in production, processing 60K targeted high‑risk URLs weekly, and achieving 91.44% recall, providing proactive protection for millions of customers. Together, our results show that combining multi‑modal reasoning with episodic memory yields robust phishing detection in realistic user‑exposure settings. Our implementation is available at https://github.com/XuanChen‑xc/MemoPhishAgent.git.

PaperID: 507, https://arxiv.org/pdf/2602.21220.pdf GitHub

Authors: Subhadip Mitra

Title: Field-Theoretic Memory for AI Agents: Continuous Dynamics for Context Preservation

Abstract:
We present a memory system for AI agents that treats stored information as continuous fields governed by partial differential equations rather than discrete entries in a database. The approach draws from classical field theory: memories diffuse through semantic space, decay thermodynamically based on importance, and interact through field coupling in multi‑agent scenarios. We evaluate the system on two established long‑context benchmarks: LoCoMo (ACL 2024) with 300‑turn conversations across 35 sessions, and LongMemEval (ICLR 2025) testing multi‑session reasoning over 500+ turns. On LongMemEval, the field‑theoretic approach achieves significant improvements: +116% F1 on multi‑session reasoning (p<0.01, d= 3.06), +43.8% on temporal reasoning (p<0.001, d= 9.21), and +27.8% retrieval recall on knowledge updates (p<0.001, d= 5.00). Multi‑agent experiments show near‑perfect collective intelligence (>99.8%) through field coupling. Code is available at github.com/rotalabs/rotalabs‑fieldmem.

PaperID: 508, https://arxiv.org/pdf/2602.21137.pdf GitHub

Authors: Joseph Raj Vishal, Nagasiri Poluri, Katha Naik, Rutuja Patil, Kashyap Hegde Kota, Krishna Vinod, Prithvi Jai Ramesh, Mohammad Farhadi, Yezhou Yang, Bharatesh Chakravarthi

Title: UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics

Abstract:
Understanding the complex, multi‑agent dynamics of urban traffic remains a fundamental challenge for video language models. This paper introduces Urban Dynamics VideoQA, a benchmark dataset that captures the unscripted real‑world behavior of dynamic urban scenes. UDVideoQA is curated from 16 hours of traffic footage recorded at multiple city intersections under diverse traffic, weather, and lighting conditions. It employs an event‑driven dynamic blur technique to ensure privacy preservation without compromising scene fidelity. Using a unified annotation pipeline, the dataset contains 28K question‑answer pairs generated across 8 hours of densely annotated video, averaging one question per second. Its taxonomy follows a hierarchical reasoning level, spanning basic understanding and attribution to event reasoning, reverse reasoning, and counterfactual inference, enabling systematic evaluation of both visual grounding and causal reasoning. Comprehensive experiments benchmark 10 SOTA VideoLMs on UDVideoQA and 8 models on a complementary video question generation benchmark. Results reveal a persistent perception‑reasoning gap, showing models that excel in abstract inference often fail with fundamental visual grounding. While models like Gemini Pro achieve the highest zero‑shot accuracy, fine‑tuning the smaller Qwen2.5‑VL 7B model on UDVideoQA bridges this gap, achieving performance comparable to proprietary systems. In VideoQGen, Gemini 2.5 Pro, and Qwen3 Max generate the most relevant and complex questions, though all models exhibit limited linguistic diversity, underscoring the need for human‑centric evaluation. The UDVideoQA suite, including the dataset, annotation tools, and benchmarks for both VideoQA and VideoQGen, provides a foundation for advancing robust, privacy‑aware, and real‑world multimodal reasoning. UDVideoQA is available at https://ud‑videoqa.github.io/UD‑VideoQA/UD‑VideoQA/.

PaperID: 509, https://arxiv.org/pdf/2602.21136.pdf GitHub

Authors: David Anugraha, Vishakh Padmakumar, Diyi Yang

Title: SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery

Abstract:
Qualitative insights from user experiences are critical for informing product and policy decisions, but collecting such data at scale is constrained by the time and availability of experts to conduct semi‑structured interviews. Recent work has explored using large language models (LLMs) to automate interviewing, yet existing systems lack a principled mechanism for balancing systematic coverage of predefined topics with adaptive exploration, or the ability to pursue follow‑ups, deep dives, and emergent themes that arise organically during conversation. In this work, we formulate adaptive semi‑structured interviewing as an optimization problem over the interviewer's behavior. We define interview utility as a trade‑off between coverage of a predefined interview topic guide, discovery of relevant emergent themes, and interview cost measured by length. Based on this formulation, we introduce SparkMe, a multi‑agent LLM interviewer that performs deliberative planning via simulated conversation rollouts to select questions with high expected utility. We evaluate SparkMe through controlled experiments with LLM‑based interviewees, showing that it achieves higher interview utility, improving topic guide coverage (+4.7% over the best baseline) and eliciting richer emergent insights while using fewer conversational turns than prior LLM interviewing approaches. We further validate SparkMe in a user study with 70 participants across 7 professions on the impact of AI on their workflows. Domain experts rate SparkMe as producing high‑quality adaptive interviews that surface helpful profession‑specific insights not captured by prior approaches. The code, datasets, and evaluation protocols for SparkMe are available as open‑source at https://github.com/SALT‑NLP/SparkMe.

PaperID: 510, https://arxiv.org/pdf/2602.20478.pdf GitHub

Authors: Aristidis Vasilopoulos

Title: Codified Context: Infrastructure for AI Agents in a Complex Codebase

Abstract:
LLM‑based agentic coding assistants lack persistent memory: they lose coherence across sessions, forget project conventions, and repeat known mistakes. Recent studies characterize how developers configure agents through manifest files, but an open challenge remains how to scale such configurations for large, multi‑agent projects. This paper presents a three‑component codified context infrastructure developed during construction of a 108,000‑line C# distributed system: (1) a hot‑memory constitution encoding conventions, retrieval hooks, and orchestration protocols; (2) 19 specialized domain‑expert agents; and (3) a cold‑memory knowledge base of 34 on‑demand specification documents. Quantitative metrics on infrastructure growth and interaction patterns across 283 development sessions are reported alongside four observational case studies illustrating how codified context propagates across sessions to prevent failures and maintain consistency. The framework is published as an open‑source companion repository.

PaperID: 511, https://arxiv.org/pdf/2602.20214.pdf GitHub

Authors: Jing Zhang

Title: Right to History: A Sovereignty Kernel for Verifiable AI Agent Execution

Abstract:
AI agents increasingly act on behalf of humans, yet no existing system provides a tamper‑evident, independently verifiable record of what they did. As regulations such as the EU AI Act begin mandating automatic logging for high‑risk AI systems, this gap carries concrete consequences ‑‑ especially for agents running on personal hardware, where no centralized provider controls the log. Extending Floridi's informational rights framework from data about individuals to actions performed on their behalf, this paper proposes the Right to History: the principle that individuals are entitled to a complete, verifiable record of every AI agent action on their own hardware. The paper formalizes this principle through five system invariants with structured proof sketches, and implements it in PunkGo, a Rust sovereignty kernel that unifies RFC 6962 Merkle tree audit logs, capability‑based isolation, energy‑budget governance, and a human‑approval mechanism. Adversarial testing confirms all five invariants hold. Performance evaluation shows sub‑1.3 ms median action latency, ~400 actions/sec throughput, and 448‑byte Merkle inclusion proofs at 10,000 log entries.

PaperID: 512, https://arxiv.org/pdf/2602.20093.pdf GitHub

Authors: Kun Yang, Yuxuan Zhu, Yazhe Chen, Siyao Zheng, Bangyang Hong, Kangle Wu, Yabo Ni, Anxiang Zeng, Cong Fu, Hui Li

Title: ManCAR: Manifold-Constrained Latent Reasoning with Adaptive Test-Time Computation for Sequential Recommendation

Abstract:
Sequential recommendation increasingly employs latent multi‑step reasoning to enhance test‑time computation. Despite empirical gains, existing approaches largely drive intermediate reasoning states via target‑dominant objectives without imposing explicit feasibility constraints. This results in latent drift, where reasoning trajectories deviate into implausible regions. We argue that effective recommendation reasoning should instead be viewed as navigation on a collaborative manifold rather than free‑form latent refinement. To this end, we propose ManCAR (Manifold‑Constrained Adaptive Reasoning), a principled framework that grounds reasoning within the topology of a global interaction graph. ManCAR constructs a local intent prior from the collaborative neighborhood of a user's recent actions, represented as a distribution over the item simplex. During training, the model progressively aligns its latent predictive distribution with this prior, forcing the reasoning trajectory to remain within the valid manifold. At test time, reasoning proceeds adaptively until the predictive distribution stabilizes, avoiding over‑refinement. We provide a variational interpretation of ManCAR to theoretically validate its drift‑prevention and adaptive test‑time stopping mechanisms. Experiments on seven benchmarks demonstrate that ManCAR consistently outperforms state‑of‑the‑art baselines, achieving up to a 46.88% relative improvement w.r.t. NDCG@10. Our code is available at https://github.com/FuCongResearchSquad/ManCAR.

PaperID: 513, https://arxiv.org/pdf/2602.19225.pdf GitHub

Authors: Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin, Haolin Shi, Chang Liu, Peilin Zhao

Title: Proximity-Based Multi-Turn Optimization: Practical Credit Assignment for LLM Agent Training

Abstract:
Multi‑turn LLM agents are becoming pivotal to production systems, spanning customer service automation, e‑commerce assistance, and interactive task management, where accurately distinguishing high‑value informative signals from stochastic noise is critical for sample‑efficient training. In real‑world scenarios, a failure in a trivial task may reflect random instability, whereas success in a high‑difficulty task signifies a genuine capability breakthrough. Yet, existing group‑based policy optimization methods rigidly rely on statistical deviation within discrete batches, frequently misallocating credit when task difficulty fluctuates. To address this issue, we propose Proximity‑based Multi‑turn Optimization (ProxMO), a practical and robust framework engineered specifically for the constraints of real‑world deployment. ProxMO integrates global context via two lightweight mechanisms: success‑rate‑aware modulation dynamically adapts gradient intensity based on episode‑level difficulty, while proximity‑based soft aggregation derives baselines through continuous semantic weighting at the step level. Extensive evaluations on ALFWorld and WebShop benchmarks demonstrate that ProxMO yields substantial performance gains over existing baselines with negligible computational cost. Ablation studies further validate the independent and synergistic efficacy of both mechanisms. Crucially, ProxMO offers plug‑and‑play compatibility with standard GRPO frameworks, facilitating immediate, low‑friction adoption in existing industrial training pipelines. Our implementation is available at: \hrefhttps://anonymous.4open.science/r/proxmo‑B7E7/README.mdhttps://anonymous.4open.science/r/proxmo.

PaperID: 514, https://arxiv.org/pdf/2602.18998.pdf GitHub

Authors: Xiaochuan Li, Ryan Ming, Pranav Setlur, Abhijay Paladugu, Andy Tang, Hao Kang, Shuai Shao, Rong Jin, Chenyan Xiong

Title: Benchmark Test-Time Scaling of General LLM Agents

Abstract:
LLM agents are increasingly expected to function as general‑purpose systems capable of resolving open‑ended user requests. While existing benchmarks focus on domain‑aware environments for developing specialized agents, evaluating general‑purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools within a unified environment. We introduce General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool‑use domains. Using General AgentBench, we systematically study test‑time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain‑specific evaluations to this general‑agent setting. Moreover, we find that neither scaling methodology yields effective performance improvements in practice, due to two fundamental limitations: context ceiling in sequential scaling and verification gap in parallel scaling. Code is publicly available at https://github.com/cxcscmu/General‑AgentBench.

PaperID: 515, https://arxiv.org/pdf/2602.18916.pdf GitHub

Authors: Hoang-Loc Cao, Phuc Ho, Truong Thanh Hung Nguyen, Phuc Truong Loc Nguyen, Dinh Thien Loc Nguyen, Hung Cao

Title: Adaptive Collaboration of Arena-Based Argumentative LLMs for Explainable and Contestable Legal Reasoning

Abstract:
Legal reasoning requires not only high accuracy but also the ability to justify decisions through verifiable and contestable arguments. However, existing Large Language Model (LLM) approaches, such as Chain‑of‑Thought (CoT) and Retrieval‑Augmented Generation (RAG), often produce unstructured explanations that lack a formal mechanism for verification or user intervention. To address this limitation, we propose Adaptive Collaboration of Argumentative LLMs (ACAL), a neuro‑symbolic framework that integrates adaptive multi‑agent collaboration with an Arena‑based Quantitative Bipolar Argumentation Framework (A‑QBAF). ACAL dynamically deploys expert agent teams to construct arguments, employs a clash resolution mechanism to adjudicate conflicting claims, and utilizes uncertainty‑aware escalation for borderline cases. Crucially, our framework supports a Human‑in‑the‑Loop (HITL) contestability workflow, enabling users to directly audit and modify the underlying reasoning graph to influence the final judgment. Empirical evaluations on the LegalBench benchmark demonstrate that ACAL outperforms strong baselines across Gemini‑2.5‑Flash‑Lite and Gemini‑2.5‑Flash architectures, effectively balancing efficient predictive performance with structured transparency and contestability. Our implementation is available at: https://github.com/loc110504/ACAL.

PaperID: 516, https://arxiv.org/pdf/2602.18230.pdf GitHub

Authors: Jorge Carrasco Pollo, Ioannis Kapetangeorgis, Joshua Rosenthal, John Hua Yao

Title: [Re] Benchmarking LLM Capabilities in Negotiation through Scoreable Games

Abstract:
Large Language Models (LLMs) demonstrate significant potential in multi‑agent negotiation tasks, yet evaluation in this domain remains challenging due to a lack of robust and generalizable benchmarks. Abdelnabi et al. (2024) introduce a negotiation benchmark based on Scoreable Games, with the aim of developing a highly complex and realistic evaluation framework for LLMs. Our work investigates the reproducibility of claims in their benchmark, and provides a deeper understanding of its usability and generalizability. We replicate the original experiments on additional models, and introduce additional metrics to verify negotiation quality and evenness of evaluation. Our findings reveal that while the benchmark is indeed complex, model comparison is ambiguous, raising questions about its objectivity. Furthermore, we identify limitations in the experimental setup, particularly in information leakage detection and thoroughness of the ablation study. By examining and analyzing the behavior of a wider range of models on an extended version of the benchmark, we reveal insights that provide additional context to potential users. Our results highlight the importance of context in model‑comparative evaluations.

PaperID: 517, https://arxiv.org/pdf/2602.18064.pdf GitHub GitHub

Authors: Ziyue Wang, Linghan Cai, Chang Han Low, Haofeng Liu, Junde Wu, Jingyu Wang, Rui Wang, Lei Song, Jiang Bian, Jingjing Fu, Yueming Jin

Title: 3DMedAgent: Unified Perception-to-Understanding for 3D Medical Analysis

Abstract:
3D CT analysis spans a continuum from low‑level perception to high‑level clinical understanding. Existing 3D‑oriented analysis methods adopt either isolated task‑specific modeling or task‑agnostic end‑to‑end paradigms to produce one‑hop outputs, impeding the systematic accumulation of perceptual evidence for downstream reasoning. In parallel, recent multimodal large language models (MLLMs) exhibit improved visual perception and can integrate visual and textual information effectively, yet their predominantly 2D‑oriented designs fundamentally limit their ability to perceive and analyze volumetric medical data. To bridge this gap, we propose 3DMedAgent, a unified agent that enables 2D MLLMs to perform general 3D CT analysis without 3D‑specific fine‑tuning. 3DMedAgent coordinates heterogeneous visual and textual tools through a flexible MLLM agent, progressively decomposing complex 3D analysis into tractable subtasks that transition from global to regional views, from 3D volumes to informative 2D slices, and from visual evidence to structured textual representations. Central to this design, 3DMedAgent maintains a long‑term structured memory that aggregates intermediate tool outputs and supports query‑adaptive, evidence‑driven multi‑step reasoning. We further introduce the DeepChestVQA benchmark for evaluating unified perception‑to‑understanding capabilities in 3D thoracic imaging. Experiments across over 40 tasks demonstrate that 3DMedAgent consistently outperforms general, medical, and 3D‑specific MLLMs, highlighting a scalable path toward general‑purpose 3D clinical assistants.Code and data are available at \hrefhttps://github.com/jinlab‑imvr/3DMedAgenthttps://github.com/jinlab‑imvr/3DMedAgent.

PaperID: 518, https://arxiv.org/pdf/2602.17590.pdf GitHub GitHub GitHub

Authors: Agnieszka M. Zbrzezny

Title: BMC4TimeSec: Verification Of Timed Security Protocols

Abstract:
We present BMC4TimeSec, an end‑to‑end tool for verifying Timed Security Protocols (TSP) based on SMT‑based bounded model checking and multi‑agent modelling in the form of Timed Interpreted Systems (TIS) and Timed Interleaved Interpreted Systems (TIIS). In BMC4TimeSec, TSP executions implement the TIS/TIIS environment (join actions, interleaving, delays, lifetimes), and knowledge automata implement the agents (evolution of participant knowledge, including the intruder). The code is publicly available on \hrefhttps://github.com/agazbrzezny/BMC4TimeSecGitHub, as is a \hrefhttps://youtu.be/aNybKz6HwdAvideo demonstration.

PaperID: 519, https://arxiv.org/pdf/2602.17062.pdf GitHub

Authors: Yonghyeon Jo, Sunwoo Lee, Seungyul Han

Title: Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning

Abstract:
Value decomposition is a core approach for cooperative multi‑agent reinforcement learning (MARL). However, existing methods still rely on a single optimal action and struggle to adapt when the underlying value function shifts during training, often converging to suboptimal policies. To address this limitation, we propose Successive Sub‑value Q‑learning (S2Q), which learns multiple sub‑value functions to retain alternative high‑value actions. Incorporating these sub‑value functions into a Softmax‑based behavior policy, S2Q encourages persistent exploration and enables Q^\texttot to adjust quickly to the changing optima. Experiments on challenging MARL benchmarks confirm that S2Q consistently outperforms various MARL algorithms, demonstrating improved adaptability and overall performance. Our code is available at https://github.com/hyeon1996/S2Q.

PaperID: 520, https://arxiv.org/pdf/2602.16943.pdf GitHub

Authors: Arnold Cartagena, Ariane Teixeira

Title: Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents

Abstract:
Large language models deployed as agents increasingly interact with external systems through tool calls‑‑actions with real‑world consequences that text outputs alone do not carry. Safety evaluations, however, overwhelmingly measure text‑level refusal behavior, leaving a critical question unanswered: does alignment that suppresses harmful text also suppress harmful actions? We introduce the GAP benchmark, a systematic evaluation framework that measures divergence between text‑level safety and tool‑call‑level safety in LLM agents. We test six frontier models across six regulated domains (pharmaceutical, financial, educational, employment, legal, and infrastructure), seven jailbreak scenarios per domain, three system prompt conditions (neutral, safety‑reinforced, and tool‑encouraging), and two prompt variants, producing 17,420 analysis‑ready datapoints. Our central finding is that text safety does not transfer to tool‑call safety. Across all six models, we observe instances where the model's text output refuses a harmful request while its tool calls simultaneously execute the forbidden action‑‑a divergence we formalize as the GAP metric. Even under safety‑reinforced system prompts, 219 such cases persist across all six models. System prompt wording exerts substantial influence on tool‑call behavior: TC‑safe rates span 21 percentage points for the most robust model and 57 for the most prompt‑sensitive, with 16 of 18 pairwise ablation comparisons remaining significant after Bonferroni correction. Runtime governance contracts reduce information leakage in all six models but produce no detectable deterrent effect on forbidden tool‑call attempts themselves. These results demonstrate that text‑only safety evaluations are insufficient for assessing agent behavior and that tool‑call safety requires dedicated measurement and mitigation.

PaperID: 521, https://arxiv.org/pdf/2602.16901.pdf GitHub

Authors: Tanqiu Jiang, Yuhui Wang, Jiacheng Liang, Ting Wang

Title: AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks

Abstract:
LLM agents are increasingly deployed in long‑horizon, complex environments to solve challenging problems, but this expansion exposes them to long‑horizon attacks that exploit multi‑turn user‑agent‑environment interactions to achieve objectives infeasible in single‑turn settings. To measure agent vulnerabilities to such risks, we present AgentLAB, the first benchmark dedicated to evaluating LLM agent susceptibility to adaptive, long‑horizon attacks. Currently, AgentLAB supports five novel attack types including intent hijacking, tool chaining, task injection, objective drifting, and memory poisoning, spanning 28 realistic agentic environments, and 644 security test cases. Leveraging AgentLAB, we evaluate representative LLM agents and find that they remain highly susceptible to long‑horizon attacks; moreover, defenses designed for single‑turn interactions fail to reliably mitigate long‑horizon threats. We anticipate that AgentLAB will serve as a valuable benchmark for tracking progress on securing LLM agents in practical settings. The benchmark is publicly available at https://tanqiujiang.github.io/AgentLAB_main.

PaperID: 522, https://arxiv.org/pdf/2602.16898.pdf GitHub

Authors: Mehrshad Taji, Arad Mahdinezhad Kashani, Iman Ahmadi, AmirHossein Jadidi, Saina Kashani, Babak Khalaj

Title: MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation

Abstract:
Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings. MALLVI presents a Multi Agent Large Language and Vision framework that enables closed‑loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVI generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step. Rather than using a single model, MALLVI coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning. Experiments in simulation and real‑world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks. Code available at https://github.com/iman1234ahmadi/MALLVI .

PaperID: 523, https://arxiv.org/pdf/2602.16855.pdf GitHub

Authors: Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, Zhiyuan Chen, Jitong Liao, Qi Zheng, Jiahui Zeng, Ze Xu, Shuai Bai, Junyang Lin, Jingren Zhou, Ming Yan

Title: Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

Abstract:
The paper introduces GUI‑Owl‑1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud‑edge collaboration and real‑time interaction. GUI‑Owl‑1.5 achieves state‑of‑the‑art results on more than 20+ GUI benchmarks on open‑source models: (1) on GUI automation tasks, it obtains 56.5 on OSWorld, 71.6 on AndroidWorld, and 48.4 on WebArena; (2) on grounding tasks, it obtains 80.3 on ScreenSpotPro; (3) on tool‑calling tasks, it obtains 47.6 on OSWorld‑MCP, and 46.8 on MobileWorld; (4) on memory and knowledge tasks, it obtains 75.5 on GUI‑Knowledge Bench. GUI‑Owl‑1.5 incorporates several key innovations: (1) Hybird Data Flywheel: we construct the data pipeline for UI understanding and trajectory generation based on a combination of simulated environments and cloud‑based sandbox environments, in order to improve the efficiency and quality of data collection. (2) Unified Enhancement of Agent Capabilities: we use a unified thought‑synthesis pipeline to enhance the model's reasoning capabilities, while placing particular emphasis on improving key agent abilities, including Tool/MCP use, memory and multi‑agent adaptation; (3) Multi‑platform Environment RL Scaling: We propose a new environment RL algorithm, MRPO, to address the challenges of multi‑platform conflicts and the low training efficiency of long‑horizon tasks. The GUI‑Owl‑1.5 models are open‑sourced, and an online cloud‑sandbox demo is available at https://github.com/X‑PLUG/MobileAgent.

PaperID: 524, https://arxiv.org/pdf/2602.16063.pdf GitHub

Authors: Nelson Salazar-Pena, Alejandra Tabares, Andres Gonzalez-Mancera

Title: MARLEM: A Multi-Agent Reinforcement Learning Simulation Framework for Implicit Cooperation in Decentralized Local Energy Markets

Abstract:
This paper introduces a novel, open‑source MARL simulation framework for studying implicit cooperation in LEMs, modeled as a decentralized partially observable Markov decision process and implemented as a Gymnasium environment for MARL. Our framework features a modular market platform with plug‑and‑play clearing mechanisms, physically constrained agent models (including battery storage), a realistic grid network, and a comprehensive analytics suite to evaluate emergent coordination. The main contribution is a novel method to foster implicit cooperation, where agents' observations and rewards are enhanced with system‑level key performance indicators to enable them to independently learn strategies that benefit the entire system and aim for collectively beneficial outcomes without explicit communication. Through representative case studies (available in a dedicated GitHub repository in https://github.com/salazarna/marlem, we show the framework's ability to analyze how different market configurations (such as varying storage deployment) impact system performance. This illustrates its potential to facilitate emergent coordination, improve market efficiency, and strengthen grid stability. The proposed simulation framework is a flexible, extensible, and reproducible tool for researchers and practitioners to design, test, and validate strategies for future intelligent, decentralized energy systems.

PaperID: 525, https://arxiv.org/pdf/2602.15721.pdf GitHub

Authors: Jingtian Yan, Yulun Zhang, Zhenting Liu, Han Zhang, He Jiang, Jingkai Chen, Stephen F. Smith, Jiaoyang Li

Title: Lifelong Scalable Multi-Agent Realistic Testbed and A Comprehensive Study on Design Choices in Lifelong AGV Fleet Management Systems

Abstract:
We present Lifelong Scalable Multi‑Agent Realistic Testbed (LSMART), an open‑source simulator to evaluate any Multi‑Agent Path Finding (MAPF) algorithm in a Fleet Management System (FMS) with Automated Guided Vehicles (AGVs). MAPF aims to move a group of agents from their corresponding starting locations to their goals. Lifelong MAPF (LMAPF) is a variant of MAPF that continuously assigns new goals for agents to reach. LMAPF applications, such as autonomous warehouses, often require a centralized, lifelong system to coordinate the movement of a fleet of robots, typically AGVs. However, existing works on MAPF and LMAPF often assume simplified kinodynamic models, such as pebble motion, as well as perfect execution and communication for AGVs. Prior work has presented SMART, a software capable of evaluating any MAPF algorithms while considering agent kinodynamics, communication delays, and execution uncertainties. However, SMART is designed for MAPF, not LMAPF. Generalizing SMART to an FMS requires many more design choices. First, an FMS parallelizes planning and execution, raising the question of when to plan. Second, given planners with varying optimality and differing agent‑model assumptions, one must decide how to plan. Third, when the planner fails to return valid solutions, the system must determine how to recover. In this paper, we first present LSMART, an open‑source simulator that incorporates all these considerations to evaluate any MAPF algorithms in an FMS. We then provide experiment results based on state‑of‑the‑art methods for each design choice, offering guidance on how to effectively design centralized lifelong AGV Fleet Management Systems. LSMART is available at https://smart‑mapf.github.io/lifelong‑smart.

PaperID: 526, https://arxiv.org/pdf/2602.14799.pdf GitHub

Authors: Javier González Villasmil

Title: Scalable Multi-Robot Path Planning via Quadratic Unconstrained Binary Optimization

Abstract:
Multi‑Agent Path Finding (MAPF) remains a fundamental challenge in robotics, where classical centralized approaches exhibit exponential growth in joint‑state complexity as the number of agents increases. This paper investigates Quadratic Unconstrained Binary Optimization (QUBO) as a structurally scalable alternative for simultaneous multi‑robot path planning. This approach is a robotics‑oriented QUBO formulation incorporating BFS‑based logical pre‑processing (achieving over 95% variable reduction), adaptive penalty design for collision and constraint enforcement, and a time‑windowed decomposition strategy that enables execution within current hardware limitations. An experimental evaluation in grid environments with up to four robots demonstrated near‑optimal solutions in dense scenarios and favorable scaling behavior compared to sequential classical planning. These results establish a practical and reproducible baseline for future quantum and quantum‑inspired multi‑robot coordinations.

PaperID: 527, https://arxiv.org/pdf/2602.14257.pdf GitHub

Authors: Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu, Liqun Liu, Peng Shu, Huan Yu, Jie Jiang

Title: AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents

Abstract:
While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real‑world environments has become a critical problem. Current benchmarks, however, are largely restricted to idealized simulations, failing to address the practical demands of specialized domains like advertising and marketing analytics. In these fields, tasks are inherently more complex, often requiring multi‑round interaction with professional marketing tools. To address this gap, we propose AD‑Bench, a benchmark designed based on real‑world business requirements of advertising and marketing platforms. AD‑Bench is constructed from real user marketing analysis requests, with domain experts providing verifiable reference answers and corresponding reference tool‑call trajectories. The benchmark categorizes requests into three difficulty levels (L1‑L3) to evaluate agents' capabilities under multi‑round, multi‑tool collaboration. Experiments show that on AD‑Bench, Gemini‑3‑Pro achieves Pass@1 = 68.0% and Pass@3 = 83.0%, but performance drops significantly on L3 to Pass@1 = 49.4% and Pass@3 = 62.1%, with a trajectory coverage of 70.1%, indicating that even state‑of‑the‑art models still exhibit substantial capability gaps in complex advertising and marketing analysis scenarios. AD‑Bench provides a realistic benchmark for evaluating and improving advertising marketing agents, the leaderboard and code can be found at https://github.com/Emanual20/adbench‑leaderboard.

PaperID: 528, https://arxiv.org/pdf/2602.14200.pdf GitHub

Authors: Nicolas Zumarraga, Thomas Kaar, Ning Wang, William Tennien, Alpay Hasanli, Max Rosenblattl, Fan Wu, Kevin Riehl, Maxwell A. Xu, Markus Kreft, Kevin O'Sullivan, Elgar Fleisch, Paul Schmiedmayer, Robert Jakob, Patrick Langer

Title: TS-Haystack: A Multi-Task Retrieval Benchmark for Long-Context Time-Series Reasoning

Abstract:
Time Series Language Models (TSLMs) promise reasoning over real‑world temporal data, but their ability to retrieve and reason over long time‑series remains largely untested. We introduce TS‑Haystack, a multi‑domain retrieval benchmark with ten event‑grounded question‑answering tasks over contexts from 100 seconds to 24 hours, spanning direct retrieval, temporal reasoning, multi‑step reasoning, and contextual anomaly detection. Existing TSLMs exhibit severe long‑context degradation: accuracy declines with context length, direct‑tokenization models run out of memory beyond 100 seconds on high‑rate signals, and time‑interval‑grounded tasks collapse toward near‑zero accuracy when increasing the time‑series lengths, aligning with existing literature on text and multi‑modal long context retrieval. An agentic retrieval framework using specialized time‑series classifier tools matches or outperforms SoTA TSLMs on 9 of 10 tasks, highlighting agentic retrieval as a promising approach for long‑context TSLMs.

PaperID: 529, https://arxiv.org/pdf/2602.14160.pdf GitHub

Authors: Chaeeun Lee, T. Michael Yates, Pasquale Minervini, T. Ian Simpson

Title: Process-Supervised Multi-Agent Reinforcement Learning for Reliable Clinical Reasoning

Abstract:
Clinical decision‑making requires nuanced reasoning over heterogeneous evidence and traceable justifications. While recent LLM multi‑agent systems (MAS) show promise, they largely optimise for outcome accuracy while overlooking process‑grounded reasoning aligned with clinical standards. One critical real‑world case of this is gene‑disease validity curation, where experts must determine whether a gene is causally implicated in a disease by synthesising diverse biomedical evidence. We introduce an agent‑as‑tool reinforcement learning framework for this task with two objectives: (i) process‑level supervision to ensure reasoning follows valid clinical pathways, and (ii) efficient coordination via a hierarchical multi‑agent system. Our evaluation on the ClinGen dataset shows that with outcome‑only rewards, MAS with a GRPO‑trained Qwen3‑4B supervisor agent substantially improves final outcome accuracy from 0.195 with a base model supervisor to 0.732, but results in poor process alignment (0.392 F1). Conversely, with process + outcome rewards, MAS with GRPO‑trained supervisor achieves higher outcome accuracy (0.750) while significantly improving process fidelity to 0.520 F1. Our code is available at https://github.com/chaeeunlee‑io/GeneDiseaseCurationAgents.

PaperID: 530, https://arxiv.org/pdf/2602.13999.pdf GitHub

Authors: Haozheng Xu, Wenhao Li, Zifan Wei, Bo Jin, Hongxing Bai, Ben Yang, Xiangfeng Wang

Title: It Takes Two to Tango: A Holistic Simulator for Joint Order Scheduling and Multi-Agent Path Finding in Robotic Warehouses

Abstract:
The prevailing paradigm in Robotic Mobile Fulfillment Systems (RMFS) typically treats order scheduling and multi‑agent pathfinding as isolated sub‑problems. We argue that this decoupling is a fundamental bottleneck, masking the critical dependencies between high‑level dispatching and low‑level congestion. Existing simulators fail to bridge this gap, often abstracting away heterogeneous kinematics and stochastic execution failures. We propose WareRover, a holistic simulation platform that enforces a tight coupling between OS and MAPF via a unified, closed‑loop optimization interface. Unlike standard benchmarks, WareRover integrates dynamic order streams, physics‑aware motion constraints, and non‑nominal recovery mechanisms into a single evaluation loop. Experiments reveal that SOTA algorithms often falter under these realistic coupled constraints, demonstrating that WareRover provides a necessary and challenging testbed for robust, next‑generation warehouse coordination. The project and video is available at https://hhh‑x.github.io/WareRover/.

PaperID: 531, https://arxiv.org/pdf/2602.13840.pdf GitHub

Authors: Yuhan Cheng, Hancheng Ye, Hai Helen Li, Jingwei Sun, Yiran Chen

Title: PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training

Abstract:
Large language model (LLM) agents are increasingly deployed in personalized tasks involving sensitive, context‑dependent information, where privacy violations may arise in agents' action due to the implicitness of contextual privacy. Existing approaches rely on external, inference‑time interventions which are brittle, scenario‑specific, and may expand the privacy attack surface. We propose PrivAct, a contextual privacy‑aware multi‑agent learning framework that internalizes contextual privacy preservation directly into models' generation behavior for privacy‑compliant agentic actions. By embedding privacy preferences into each agent, PrivAct enhances system‑wide contextual integrity while achieving a more favorable privacy‑helpfulness tradeoff. Experiments across multiple LLM backbones and benchmarks demonstrate consistent improvements in contextual privacy preservation, reducing leakage rates by up to 12.32% while maintaining comparable helpfulness, as well as zero‑shot generalization and robustness across diverse multi‑agent topologies. Code is available at https://github.com/chengyh23/PrivAct.

PaperID: 532, https://arxiv.org/pdf/2602.13769.pdf GitHub

Authors: Qi Liu, Ruochen Hao, Can Li, Wanjing Ma

Title: OR-Agent: Bridging Evolutionary Search and Structured Research for Automated Algorithm Discovery

Abstract:
Automating scientific discovery in complex, experiment‑driven domains requires more than iterative mutation of programs; it demands structured hypothesis management, environment interaction, and principled reflection. We present OR‑Agent, a configurable multi‑agent research framework designed for automated exploration in rich experimental environments. OR‑Agent organizes research as a structured tree‑based workflow that explicitly models branching hypothesis generation and systematic backtracking, enabling controlled management of research trajectories beyond simple mutation‑crossover loops. At its core, we introduce an evolutionary‑systematic ideation mechanism that unifies evolutionary selection of research starting points, comprehensive research plan generation, and coordinated exploration within a research tree. We introduce a hierarchical optimization‑inspired reflection system in which short‑term reflections act as verbal gradients, long‑term reflections as verbal momentum, and memory compression as semantic weight decay, collectively forming a principled mechanism for governing research dynamics. We conduct extensive experiments across classical combinatorial optimization benchmarks as well as simulation‑based cooperative driving scenarios. Results demonstrate that OR‑Agent outperforms strong evolutionary baselines while providing a general, extensible, and inspectable framework for AI‑assisted scientific discovery. All code and experimental data are publicly available at https://github.com/qiliuchn/OR‑Agent.

PaperID: 533, https://arxiv.org/pdf/2602.13346.pdf GitHub GitHub

Authors: Zhen Wang, Yiming Gao, Jieyuan Liu, Enze Ma, Jefferson Chen, Mark Antkowiak, Mengzhou Hu, JungHo Kong, Dexter Pratt, Zhiting Hu, Wei Wang, Trey Ideker, Eric P. Xing

Title: CellMaster: Collaborative Cell Type Annotation in Single-Cell Analysis

Abstract:
Single‑cell RNA‑seq (scRNA‑seq) enables atlas‑scale profiling of complex tissues, revealing rare lineages and transient states. Yet, assigning biologically valid cell identities remains a bottleneck because markers are tissue‑ and state‑dependent, and novel states lack references. We present CellMaster, an AI agent that mimics expert practice for zero‑shot cell‑type annotation. Unlike existing automated tools, CellMaster leverages LLM‑encoded knowledge (e.g., GPT‑4o) to perform on‑the‑fly annotation with interpretable rationales, without pre‑training or fixed marker databases. Across 9 datasets spanning 8 tissues, CellMaster improved accuracy by 7.1% over best‑performing baselines (including CellTypist and scTab) in automatic mode. With human‑in‑the‑loop refinement, this advantage increased to 18.6%, with a 22.1% gain on subtype populations. The system demonstrates particular strength in rare and novel cell states where baselines often fail. Source code and the web application are available at \hrefhttps://github.com/AnonymousGym/CellMasterhttps://github.com/AnonymousGym/CellMaster.

PaperID: 534, https://arxiv.org/pdf/2602.13318.pdf GitHub

Authors: Daesik Jang, Morgan Lindsay Heisler, Linzi Xing, Yifei Li, Edward Wang, Ying Xiong, Yong Zhang, Zhenan Fan

Title: DECKBench: Benchmarking Multi-Agent Frameworks for Academic Slide Generation and Editing

Abstract:
Automatically generating and iteratively editing academic slide decks requires more than document summarization. It demands faithful content selection, coherent slide organization, layout‑aware rendering, and robust multi‑turn instruction following. However, existing benchmarks and evaluation protocols do not adequately measure these challenges. To address this gap, we introduce the Deck Edits and Compliance Kit Benchmark (DECKBench), an evaluation framework for multi‑agent slide generation and editing. DECKBench is built on a curated dataset of paper to slide pairs augmented with realistic, simulated editing instructions. Our evaluation protocol systematically assesses slide‑level and deck‑level fidelity, coherence, layout quality, and multi‑turn instruction following. We further implement a modular multi‑agent baseline system that decomposes the slide generation and editing task into paper parsing and summarization, slide planning, HTML creation, and iterative editing. Experimental results demonstrate that the proposed benchmark highlights strengths, exposes failure modes, and provides actionable insights for improving multi‑agent slide generation and editing systems. Overall, this work establishes a standardized foundation for reproducible and comparable evaluation of academic presentation generation and editing. Code and data are publicly available at https://github.com/morgan‑heisler/DeckBench .

PaperID: 535, https://arxiv.org/pdf/2602.13258.pdf GitHub

Authors: Deepak Babu Piskala

Title: MAPLE: A Sub-Agent Architecture for Memory, Learning, and Personalization in Agentic AI Systems

Abstract:
Large language model (LLM) agents have emerged as powerful tools for complex tasks, yet their ability to adapt to individual users remains fundamentally limited. We argue this limitation stems from a critical architectural conflation: current systems treat memory, learning, and personalization as a unified capability rather than three distinct mechanisms requiring different infrastructure, operating on different timescales, and benefiting from independent optimization. We propose MAPLE (Memory‑Adaptive Personalized LEarning), a principled decomposition where Memory handles storage and retrieval infrastructure; Learning extracts intelligence from accumulated interactions asynchronously; and Personalization applies learned knowledge in real‑time within finite context budgets. Each component operates as a dedicated sub‑agent with specialized tooling and well‑defined interfaces. Experimental evaluation on the MAPLE‑Personas benchmark demonstrates that our decomposition achieves a 14.6% improvement in personalization score compared to a stateless baseline (p < 0.01, Cohen's d = 0.95) and increases trait incorporation rate from 45% to 75% ‑‑ enabling agents that genuinely learn and adapt.

PaperID: 536, https://arxiv.org/pdf/2602.13218.pdf GitHub GitHub

Authors: Bowen Liu, Zhi Wu, Runquan Xie, Zhanhui Kang, Jia Li

Title: Scaling the Scaling Logic: Agentic Meta-Synthesis of Logic Reasoning

Abstract:
Reinforcement Learning from Verifiable Rewards (RLVR) is bottlenecked by data: existing synthesis pipelines rely on expert‑written code or fixed templates, confining growth to instance‑level perturbations. We shift the evolvable unit from problem instances to task‑family specifications. SSLogic is an agentic meta‑synthesis framework in which LLM agents iteratively author and refine executable Generator‑Validator pairs inside a closed Generate‑Validate‑Refine loop, producing families with new rules and difficulty gradients rather than parameter variations of old ones. A Multi‑Gate Validation Protocol ‑‑ multi‑strategy consensus plus Adversarial Blind Review, where independent agents solve each instance by writing and executing code ‑‑ filters ill‑posed tasks before they enter training. Starting from 400 seed families, two evolution rounds yield 953 families and 21,389 verifiable instances. Three converging comparisons (step‑matched, token‑matched, and size‑controlled on external Enigmata data) consistently show higher training utility of evolved data, with gains of SynLogic +5.2, AIME25 +3.0, and BBH +5.5 on Enigmata. Fine‑grained KORBench evaluation reveals selective improvements in logic (+13.2%) and operation (+9.6%), linking structural evolution to downstream gains. Code: https://github.com/AdAstraAbyssoque/Scaling‑the‑Scaling‑Logic

PaperID: 537, https://arxiv.org/pdf/2602.12517.pdf GitHub GitHub

Authors: Lorenzo Magnino, Jiacheng Shen, Matthieu Geist, Olivier Pietquin, Mathieu Laurière

Title: Bench-MFG: A Benchmark Suite for Learning in Stationary Mean Field Games

Abstract:
The intersection of Mean Field Games (MFGs) and Reinforcement Learning (RL) has fostered a growing family of algorithms designed to solve large‑scale multi‑agent systems. However, the field currently lacks a standardized evaluation protocol, forcing researchers to rely on bespoke, isolated, and often simplistic environments. This fragmentation makes it difficult to assess the robustness, generalization, and failure modes of emerging methods. To address this gap, we propose a comprehensive benchmark suite for MFGs (Bench‑MFG), focusing on the discrete‑time, discrete‑space, stationary setting for the sake of clarity. We introduce a taxonomy of problem classes, ranging from no‑interaction and monotone games to potential and dynamics‑coupled games, and provide prototypical environments for each. Furthermore, we propose MF‑Garnets, a method for generating random MFG instances to facilitate rigorous statistical testing. We benchmark a variety of learning algorithms across these environments, including a novel black‑box approach (MF‑PSO) for exploitability minimization. Based on our extensive empirical results, we propose guidelines to standardize future experimental comparisons. Code available at \hrefhttps://github.com/lorenzomagnino/Bench‑MFGhttps://github.com/lorenzomagnino/Bench‑MFG.

PaperID: 538, https://arxiv.org/pdf/2602.12316.pdf GitHub

Authors: Pepijn Cobben, Xuanqiang Angelo Huang, Thao Amelia Pham, Isabel Dahlgren, Terry Jingchen Zhang, Zhijing Jin

Title: GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

Abstract:
Frontier AI systems are increasingly capable and deployed in high‑stakes multi‑agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi‑agent risks such as coordination failure and conflict poorly understood. We introduce GT‑HarmBench, a benchmark of 1,535 high‑stakes scenarios spanning game‑theoretic structures such as the Prisoner's Dilemma, Stag Hunt and Chicken. Scenarios are drawn from realistic AI risk contexts in the MIT AI Risk Repository. Across 15 frontier models, agents fail to choose socially beneficial actions in 38% of high‑stakes cases, such as military escalation, election manipulation, and medical malpractice. We measure sensitivity to game‑theoretic prompt framing and ordering, and analyze reasoning patterns driving failures. We further show that game‑theoretic interventions improve socially beneficial outcomes by up to 18%. Our results highlight substantial reliability gaps and provide a broad standardized testbed for studying alignment in multi‑agent environments. The benchmark and code are available at https://github.com/causalNLP/gt‑harmbench.

PaperID: 539, https://arxiv.org/pdf/2602.11918.pdf GitHub

Authors: Taian Guo, Haiyang Shen, Junyu Luo, Zhongshi Xing, Hanchun Lian, Jinsheng Huang, Binqi Chen, Luchen Liu, Yun Ma, Ming Zhang

Title: MEME: Modeling the Evolutionary Modes of Financial Markets

Abstract:
LLMs have demonstrated significant potential in quantitative finance by processing vast unstructured data to emulate human‑like analytical workflows. However, current LLM‑based methods primarily follow either an Asset‑Centric paradigm focused on individual stock prediction or a Market‑Centric approach for portfolio allocation, often remaining agnostic to the underlying reasoning that drives market movements. In this paper, we propose a Logic‑Oriented perspective, modeling the financial market as a dynamic, evolutionary ecosystem of competing investment narratives, termed Modes of Thought. To operationalize this view, we introduce MEME (Modeling the Evolutionary Modes of Financial Markets), designed to reconstruct market dynamics through the lens of evolving logics. MEME employs a multi‑agent extraction module to transform noisy data into high‑fidelity Investment Arguments and utilizes Gaussian Mixture Modeling to uncover latent consensus within a semantic space. To model semantic drift among different market conditions, we also implement a temporal evaluation and alignment mechanism to track the lifecycle and historical profitability of these modes. By prioritizing enduring market wisdom over transient anomalies, MEME ensures that portfolio construction is guided by robust reasoning. Extensive experiments on three heterogeneous Chinese stock pools from 2023 to 2025 demonstrate that MEME consistently outperforms seven SOTA baselines. Further ablation studies, sensitivity analysis, lifecycle case study and cost analysis validate MEME's capacity to identify and adapt to the evolving consensus of financial markets. Our implementation can be found at https://github.com/gta0804/MEME.

PaperID: 540, https://arxiv.org/pdf/2602.11735.pdf GitHub

Authors: Wanhao Liu, Junhong Dai, Yixuan Zhang, Shengyun Yin, Panshuo Li

Title: AC-MASAC: An Attentive Curriculum Learning Framework for Heterogeneous UAV Swarm Coordination

Abstract:
Cooperative path planning for heterogeneous UAV swarms poses significant challenges for Multi‑Agent Reinforcement Learning (MARL), particularly in handling asymmetric inter‑agent dependencies and addressing the risks of sparse rewards and catastrophic forgetting during training. To address these issues, this paper proposes an attentive curriculum learning framework (AC‑MASAC). The framework introduces a role‑aware heterogeneous attention mechanism to explicitly model asymmetric dependencies. Moreover, a structured curriculum strategy is designed, integrating hierarchical knowledge transfer and stage‑proportional experience replay to address the issues of sparse rewards and catastrophic forgetting. The proposed framework is validated on a custom multi‑agent simulation platform, and the results show that our method has significant advantages over other advanced methods in terms of Success Rate, Formation Keeping Rate, and Success‑weighted Mission Time. The code is available at \textcolorredhttps://github.com/Wanhao‑Liu/AC‑MASAC.

PaperID: 541, https://arxiv.org/pdf/2602.11510.pdf GitHub

Authors: Faouzi El Yagoubi, Godwin Badu-Marfo, Ranwa Al Mallah

Title: AgentLeak: A Full-Stack Benchmark for Privacy Leakage in Multi-Agent LLM Systems

Abstract:
Multi‑agent Large Language Model (LLM) systems create privacy risks that current benchmarks cannot measure. When agents coordinate on tasks, sensitive data passes through inter‑agent messages, shared memory, and tool arguments, all pathways that output‑only audits never inspect. We introduce AgentLeak, to the best of our knowledge the first full‑stack benchmark for privacy leakage covering internal channels. It spans 1,000 scenarios across healthcare, finance, legal, and corporate domains, paired with a 32‑class attack taxonomy and a three‑tier detection pipeline. A factorial evaluation crossing five production LLMs (GPT‑4o, GPT‑4o‑mini, Claude 3.5 Sonnet, Mistral Large, and Llama 3.3 70B) with all 1,000 scenarios, yielding 4,979 validated execution traces, reveals that multi‑agent configurations reduce per‑channel output leakage (C1: 27.2% vs 43.2% in single‑agent) but introduce unmonitored internal channels that raise total system exposure to 68.9% (aggregated across C1, C2, C5). Internal channels account for most of this gap: inter‑agent messages (C2) leak at 68.8%, compared to 27.2% on C1 (output channel). This means that output‑only audits miss 41.7% of violations. Safety‑aligned models achieve lower leakage on both external and internal channels, yet no model eliminates it. Across all five models and four domains, the pattern C2 \geq C1 holds consistently, confirming that inter‑agent communication is the primary vulnerability. These results establish that output‑only auditing is fundamentally insufficient for multi‑agent systems and that privacy controls must be extended to inter‑agent communication channels.

PaperID: 542, https://arxiv.org/pdf/2602.11509.pdf GitHub

Authors: David Wan, Han Wang, Ziyang Wang, Elias Stengel-Eskin, Hyunji Lee, Mohit Bansal

Title: Multimodal Fact-Level Attribution for Verifiable Reasoning

Abstract:
Multimodal large language models (MLLMs) are increasingly used for real‑world tasks involving multi‑step reasoning and long‑form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal grounding benchmarks and evaluation methods focus on simplified, observation‑based scenarios or limited modalities and fail to assess attribution in complex multimodal reasoning. We introduce MuRGAt (Multimodal Reasoning with Grounded Attribution), a benchmark for evaluating fact‑level multimodal attribution in settings that require reasoning beyond direct observation. Given inputs spanning video, audio, and other modalities, MuRGAt requires models to generate answers with explicit reasoning and precise citations, where each citation specifies both modality and temporal segments. To enable reliable assessment, we introduce an automatic evaluation framework that strongly correlates with human judgments. Benchmarking with human and automated scores reveals that even strong MLLMs frequently hallucinate citations despite correct reasoning. Moreover, we observe a key trade‑off: increasing reasoning depth or enforcing structured grounding often degrades accuracy, highlighting a significant gap between internal reasoning and verifiable attribution.

PaperID: 543, https://arxiv.org/pdf/2602.11437.pdf GitHub

Authors: Chengrui Qu, Christopher Yeh, Kishan Panaganti, Eric Mazumdar, Adam Wierman

Title: Distributionally Robust Cooperative Multi-Agent Reinforcement Learning via Robust Value Factorization

Abstract:
Cooperative multi‑agent reinforcement learning (MARL) commonly adopts centralized training with decentralized execution, where value‑factorization methods enforce the individual‑global‑maximum (IGM) principle so that decentralized greedy actions recover the team‑optimal joint action. However, the reliability of this recipe in real‑world settings remains unreliable due to environmental uncertainties arising from the sim‑to‑real gap, model mismatch, and system noise. We address this gap by introducing Distributionally robust IGM (DrIGM), a principle that requires each agent's robust greedy action to align with the robust team‑optimal joint action. We show that DrIGM holds for a novel definition of robust individual action values, which is compatible with decentralized greedy execution and yields a provable robustness guarantee for the whole system. Building on this foundation, we derive DrIGM‑compliant robust variants of existing value‑factorization architectures (e.g., VDN/QMIX/QTRAN) that (i) train on robust Q‑targets, (ii) preserve scalability, and (iii) integrate seamlessly with existing codebases without bespoke per‑agent reward shaping. Empirically, on high‑fidelity SustainGym simulators and a StarCraft game environment, our methods consistently improve out‑of‑distribution performance. Code and data are available at https://github.com/crqu/robust‑coMARL.

PaperID: 544, https://arxiv.org/pdf/2602.11409.pdf GitHub

Authors: Sina Tayebati, Divake Kumar, Nastaran Darabi, Davide Ettori, Ranganath Krishnan, Amit Ranjan Trivedi

Title: TRACER: Trajectory Risk Aggregation for Critical Episodes in Agentic Reasoning

Abstract:
Estimating uncertainty for AI agents in real‑world multi‑turn tool‑using interaction with humans is difficult because failures are often triggered by sparse critical episodes (e.g., looping, incoherent tool use, or user‑agent miscoordination) even when local generation appears confident. Existing uncertainty proxies focus on single‑shot text generation and therefore miss these trajectory‑level breakdown signals. We introduce TRACER, a trajectory‑level uncertainty metric for dual‑control Tool‑Agent‑User interaction. TRACER combines content‑aware surprisal with situational‑awareness signals, semantic and lexical repetition, and tool‑grounded coherence gaps, and aggregates them using a tail‑focused risk functional with a MAX‑composite step risk to surface decisive anomalies. We evaluate TRACER on τ^2‑bench by predicting task failure and selective task execution. To this end, TRACER improves AUROC by up to 37.1% and AUARC by up to 55% over baselines, enabling earlier and more accurate detection of uncertainty in complex conversational tool‑use settings. Our code and benchmark are available at https://github.com/sinatayebati/agent‑tracer.

PaperID: 545, https://arxiv.org/pdf/2602.11354.pdf GitHub

Authors: Bang Nguyen, Dominik Soós, Qian Ma, Rochana R. Obadage, Zack Ranjan, Sai Koneru, Anna Szabelska, Adam Gill, Timothy M. Errington, Shakhlo Nematova, Sarah Rajtmajer, Jian Wu, Meng Jiang

Title: ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

Abstract:
The literature has witnessed an emerging interest in AI agents for automated assessment of scientific papers. Existing benchmarks focus primarily on the computational aspect of this task, testing agents' ability to reproduce or replicate research outcomes when having access to the code and data. This setting, while foundational, (1) fails to capture the inconsistent availability of new data for replication as opposed to reproduction, and (2) lacks ground‑truth diversity by focusing only on reproducible papers, thereby failing to evaluate an agent's ability to identify non‑replicable research. Furthermore, most benchmarks only evaluate outcomes rather than the replication process. In response, we introduce ReplicatorBench, an end‑to‑end benchmark, including human‑verified replicable and non‑replicable research claims in social and behavioral sciences for evaluating AI agents in research replication across three stages: (1) extraction and retrieval of replication data; (2) design and execution of computational experiments; and (3) interpretation of results, allowing a test of AI agents' capability to mimic the activities of human replicators in real world. To set a baseline of AI agents' capability, we develop ReplicatorAgent, an agentic framework equipped with necessary tools, like web search and iterative interaction with sandboxed environments, to accomplish tasks in ReplicatorBench. We evaluate ReplicatorAgent across four underlying large language models (LLMs), as well as different design choices of programming language and levels of code access. Our findings reveal that while current LLM agents are capable of effectively designing and executing computational experiments, they struggle with retrieving resources, such as new data, necessary to replicate a claim. All code and data are publicly available at https://github.com/CenterForOpenScience/llm‑benchmarking.

PaperID: 546, https://arxiv.org/pdf/2602.11351.pdf GitHub

Authors: Yihang Yao, Zhepeng Cen, Haohong Lin, Shiqi Liu, Zuxin Liu, Jiacheng Zhu, Zhang-Wei Hong, Laixi Shi, Ding Zhao

Title: Pushing Forward Pareto Frontiers of Proactive Agents with Behavioral Agentic Optimization

Abstract:
Proactive large language model (LLM) agents aim to actively plan, query, and interact over multiple turns, enabling efficient task completion beyond passive instruction following and making them essential for real‑world, user‑centric applications. Agentic reinforcement learning (RL) has recently emerged as a promising solution for training such agents in multi‑turn settings, allowing interaction strategies to be learned from feedback. However, existing pipelines face a critical challenge in balancing task performance with user engagement, as passive agents can not efficiently adapt to users' intentions while overuse of human feedback reduces their satisfaction. To address this trade‑off, we propose BAO, an agentic RL framework that combines behavior enhancement to enrich proactive reasoning and information‑gathering capabilities with behavior regularization to suppress inefficient or redundant interactions and align agent behavior with user expectations. We evaluate BAO on multiple tasks from the UserRL benchmark suite, and demonstrate that it substantially outperforms proactive agentic RL baselines while achieving comparable or even superior performance to commercial LLM agents, highlighting its effectiveness for training proactive, user‑aligned LLM agents in complex multi‑turn scenarios. Our website: https://proactive‑agentic‑rl.github.io/.

PaperID: 547, https://arxiv.org/pdf/2602.11224.pdf GitHub

Authors: Hubert M. Pysklo, Artem Zhuravel, Patrick D. Watson

Title: Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation

Abstract:
We present Agent‑Diff, a novel benchmarking framework for evaluating agentic Large Language Models (LLMs) on real‑world productivity software API tasks via code execution. Agentic LLM performance varies due to differences in models, external tool access, prompt structures, and agentic frameworks. Benchmarks must make fundamental trade‑offs between a sandboxed approach that controls for variation in software environments and more ecologically valid approaches employing real services. Agent‑Diff attempts to capture the desirable features of both of these approaches by including access to the real API interfaces for software services while sandboxing the environment in which calls are made, processed, and evaluated. This approach relies on two key innovations. The first is a novel state‑diff contract, which separates process from outcome ‑ rather than fuzzy trace or parameter matching, we define task success as whether the expected change in environment state was achieved. The second is a novel sandbox built on containerized replicas of enterprise APIs, allowing all models to interact with the same service interfaces through code execution. This enables controlled evaluation against a common set of state‑diff contracts while preserving the structure of real‑world API interaction. Using the Agent‑Diff framework, we provide benchmarks for nine LLMs across 224 tasks utilizing enterprise software workflows. In addition, we evaluate the robustness of the framework with ablation experiments to assess the contribution of access to API documentation on benchmark performance. Code and data: https://github.com/agent‑diff‑bench/agent‑diff.

PaperID: 548, https://arxiv.org/pdf/2602.10715.pdf GitHub

Authors: Yifei Li, Weidong Guo, Lingling Zhang, Rongman Xu, Muye Huang, Hui Liu, Lijiao Xu, Yu Xu, Jun Liu

Title: Locomo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents

Abstract:
Long‑term conversational memory is a core capability for LLM‑based dialogue systems, yet existing benchmarks and evaluation protocols primarily focus on surface‑level factual recall. In realistic interactions, appropriate responses often depend on implicit constraints such as user state, goals, or values that are not explicitly queried later. To evaluate this setting, we introduce LoCoMo‑Plus, a benchmark for assessing cognitive memory under cue‑‑trigger semantic disconnect, where models must retain and apply latent constraints across long conversational contexts. We further show that conventional string‑matching metrics and explicit task‑type prompting are misaligned with such scenarios, and propose a unified evaluation framework based on constraint consistency. Experiments across diverse backbone models, retrieval‑based methods, and memory systems demonstrate that cognitive memory remains challenging and reveals failures not captured by existing benchmarks. Our code and evaluation framework are publicly available at: https://github.com/xjtuleeyf/Locomo‑Plus.

PaperID: 549, https://arxiv.org/pdf/2602.10139.pdf GitHub

Authors: Lepeng Zhao, Zhenhua Zou, Shuo Li, Zhuotao Liu

Title: Anonymization-Enhanced Privacy Protection for Mobile GUI Agents: Available but Invisible

Abstract:
Mobile Graphical User Interface (GUI) agents have demonstrated strong capabilities in automating complex smartphone tasks by leveraging multimodal large language models (MLLMs) and system‑level control interfaces. However, this paradigm introduces significant privacy risks, as agents typically capture and process entire screen contents, thereby exposing sensitive personal data such as phone numbers, addresses, messages, and financial information. Existing defenses either reduce UI exposure, obfuscate only task‑irrelevant content, or rely on user authorization, but none can protect task‑critical sensitive information while preserving seamless agent usability. We propose an anonymization‑based privacy protection framework that enforces the principle of available‑but‑invisible access to sensitive data: sensitive information remains usable for task execution but is never directly visible to the cloud‑based agent. Our system detects sensitive UI content using a PII‑aware recognition model and replaces it with deterministic, type‑preserving placeholders (e.g., PHONE_NUMBER#a1b2c) that retain semantic categories while removing identifying details. A layered architecture comprising a PII Detector, UI Transformer, Secure Interaction Proxy, and Privacy Gatekeeper ensures consistent anonymization across user instructions, XML hierarchies, and screenshots, mediates all agent actions over anonymized interfaces, and supports narrowly scoped local computations when reasoning over raw values is necessary. Extensive experiments on the AndroidLab and PrivScreen benchmarks show that our framework substantially reduces privacy leakage across multiple models while incurring only modest utility degradation, achieving the best observed privacy‑utility trade‑off among existing methods. Code available at: https://github.com/one‑step‑beh1nd/gui_privacy_protection

PaperID: 550, https://arxiv.org/pdf/2602.10081.pdf GitHub

Authors: Xuehang Guo, Zhiyong Lu, Tom Hope, Qingyun Wang

Title: Anagent For Enhancing Scientific Table & Figure Analysis

Abstract:
In scientific research, analysis requires accurately interpreting complex multimodal knowledge, integrating evidence from different sources, and drawing inferences grounded in domain‑specific knowledge. However, current artificial intelligence (AI) systems struggle to consistently demonstrate such capabilities. The complexity and variability of scientific tables and figures, combined with heterogeneous structures and long‑context requirements, pose fundamental obstacles to scientific table \& figure analysis. To quantify these challenges, we introduce AnaBench, a large‑scale benchmark featuring 63,178 instances from nine scientific domains, systematically categorized along seven complexity dimensions. To tackle these challenges, we propose Anagent, a multi‑agent framework for enhanced scientific table \& figure analysis through four specialized agents: Planner decomposes tasks into actionable subtasks, Expert retrieves task‑specific information through targeted tool execution, Solver synthesizes information to generate coherent analysis, and Critic performs iterative refinement through five‑dimensional quality assessment. We further develop modular training strategies that leverage supervised finetuning and specialized reinforcement learning to optimize individual capabilities while maintaining effective collaboration. Comprehensive evaluation across 9 broad domains with 170 subdomains demonstrates that Anagent achieves substantial improvements, up to \uparrow 13.43% in training‑free settings and \uparrow 42.12% with finetuning, while revealing that task‑oriented reasoning and context‑aware problem‑solving are essential for high‑quality scientific table \& figure analysis. Our project page: https://xhguo7.github.io/Anagent/.

PaperID: 551, https://arxiv.org/pdf/2602.10007.pdf GitHub

Authors: Bharathkumar Hegde, Melanie Bouroche

Title: A Collaborative Safety Shield for Safe and Efficient CAV Lane Changes in Congested On-Ramp Merging

Abstract:
Lane changing in dense traffic is a significant challenge for Connected and Autonomous Vehicles (CAVs). Existing lane change controllers primarily either ensure safety or collaboratively improve traffic efficiency, but do not consider these conflicting objectives together. To address this, we propose the Multi‑Agent Safety Shield (MASS), designed using Control Barrier Functions (CBFs) to enable safe and collaborative lane changes. The MASS enables collaboration by capturing multi‑agent interactions among CAVs through interaction topologies constructed as a graph using a simple algorithm. Further, a state‑of‑the‑art Multi‑Agent Reinforcement Learning (MARL) lane change controller is extended by integrating MASS to ensure safety and defining a customised reward function to prioritise efficiency improvements. As a result, we propose a lane change controller, known as MARL‑MASS, and evaluate it in a congested on‑ramp merging simulation. The results demonstrate that MASS enables collaborative lane changes with safety guarantees by strictly respecting the safety constraints. Moreover, the proposed custom reward function improves the stability of MARL policies trained with a safety shield. Overall, by encouraging the exploration of a collaborative lane change policy while respecting safety constraints, MARL‑MASS effectively balances the trade‑off between ensuring safety and improving traffic efficiency in congested traffic. The code for MARL‑MASS is available with an open‑source licence at https://github.com/hkbharath/MARL‑MASS

PaperID: 552, https://arxiv.org/pdf/2602.09817.pdf GitHub

Authors: Khang Ly, Georgios Cheirmpos, Adrian Raudaschl, Christopher James, Seyed Amin Tabatabaei

Title: AnalyticsGPT: An LLM Workflow for Scientometric Question Answering

Abstract:
This paper introduces AnalyticsGPT, an intuitive and efficient large language model (LLM)‑powered workflow for scientometric question answering. This underrepresented downstream task addresses the subcategory of meta‑scientific questions concerning the "science of science." When compared to traditional scientific question answering based on papers, the task poses unique challenges in the planning phase. Namely, the need for named‑entity recognition of academic entities within questions and multi‑faceted data retrieval involving scientometric indices, e.g. impact factors. Beyond their exceptional capacity for treating traditional natural language processing tasks, LLMs have shown great potential in more complex applications, such as task decomposition and planning and reasoning. In this paper, we explore the application of LLMs to scientometric question answering, and describe an end‑to‑end system implementing a sequential workflow with retrieval‑augmented generation and agentic concepts. We also address the secondary task of effectively synthesizing the data into presentable and well‑structured high‑level analyses. As a database for retrieval‑augmented generation, we leverage a proprietary research performance assessment platform. For evaluation, we consult experienced subject matter experts and leverage LLMs‑as‑judges. In doing so, we provide valuable insights on the efficacy of LLMs towards a niche downstream task. Our (skeleton) code and prompts are available at: https://github.com/lyvykhang/llm‑agents‑scientometric‑qa/tree/acl.

PaperID: 553, https://arxiv.org/pdf/2602.09662.pdf GitHub

Authors: Deyang Jiang, Jing Huang, Xuanle Zhao, Lei Chen, Liming Zheng, Fanfan Liu, Haibo Qiu, Peng Shi, Zhixiong Zeng

Title: TreeCUA: Efficiently Scaling GUI Automation with Tree-Structured Verifiable Evolution

Abstract:
Effectively scaling GUI automation is essential for computer‑use agents (CUAs); however, existing work primarily focuses on scaling GUI grounding rather than the more crucial GUI planning, which requires more sophisticated data collection. In reality, the exploration process of a CUA across apps/desktops/web pages typically follows a tree structure, with earlier functional entry points often being explored more frequently. Thus, organizing large‑scale trajectories into tree structures can reduce data cost and streamline the data scaling of GUI planning. In this work, we propose TreeCUA to efficiently scale GUI automation with tree‑structured verifiable evolution. We propose a multi‑agent collaborative framework to explore the environment, verify actions, summarize trajectories, and evaluate quality to generate high‑quality and scalable GUI trajectories. To improve efficiency, we devise a novel tree‑based topology to store and replay duplicate exploration nodes, and design an adaptive exploration algorithm to balance the depth (\emphi.e., trajectory difficulty) and breadth (\emphi.e., trajectory diversity). Moreover, we develop world knowledge guidance and global memory backtracking to avoid low‑quality generation. Finally, we naturally extend and propose the TreeCUA‑DPO method from abundant tree node information, improving GUI planning capability by referring to the branch information of adjacent trajectories. Experimental results show that TreeCUA and TreeCUA‑DPO offer significant improvements, and out‑of‑domain (OOD) studies further demonstrate strong generalization. All trajectory node information and code will be available at https://github.com/UITron‑hub/TreeCUA.

PaperID: 554, https://arxiv.org/pdf/2602.09642.pdf GitHub

Authors: Sieun Hyeon, Jusang Oh, Sunghwan Steve Cho, Jaeyoung Do

Title: MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering

Abstract:
Recent advances in Large Language Models (LLMs) have significantly improved table understanding tasks such as Table Question Answering (TableQA), yet challenges remain in ensuring reliability, scalability, and efficiency, especially in resource‑constrained or privacy‑sensitive environments. In this paper, we introduce MATA, a multi‑agent TableQA framework that leverages multiple complementary reasoning paths and a set of tools built with small language models. MATA generates candidate answers through diverse reasoning styles for a given table and question, then refines or selects the optimal answer with the help of these tools. Furthermore, it incorporates an algorithm designed to minimize expensive LLM agent calls, enhancing overall efficiency. MATA maintains strong performance with small, open‑source models and adapts easily across various LLM types. Extensive experiments on two benchmarks of varying difficulty with ten different LLMs demonstrate that MATA achieves state‑of‑the‑art accuracy and highly efficient reasoning while avoiding excessive LLM inference. Our results highlight that careful orchestration of multiple reasoning pathways yields scalable and reliable TableQA. The code is available at https://github.com/AIDASLab/MATA.

PaperID: 555, https://arxiv.org/pdf/2602.09379.pdf GitHub

Authors: Shihao Xu, Tiancheng Zhou, Jiatong Ma, Yanli Ding, Yiming Yan, Ming Xiao, Guoyi Li, Haiyang Geng, Yunyun Han, Jianhua Chen, Yafeng Deng

Title: LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

Abstract:
Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview‑based diagnosis create substantial barriers to timely and consistent mental‑health assessment. Progress in AI‑assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician‑verified diagnostic labels, and support for dynamic multi‑turn consultation. We present LingxiDiagBench, a large‑scale multi‑agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi‑turn psychiatric consultation in Chinese. At its core is LingxiDiag‑16K, a dataset of 16,000 EMR‑aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD‑10 psychiatric categories. Through extensive experiments across state‑of‑the‑art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression‑‑anxiety classification (up to 92.3%), performance deteriorates substantially for depression‑‑anxiety comorbidity recognition (43.0%) and 12‑way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information‑gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM‑as‑a‑Judge shows only moderate correlation with diagnostic accuracy, suggesting that well‑structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag‑16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi‑mental‑health/LingxiDiagBench.

PaperID: 556, https://arxiv.org/pdf/2602.08962.pdf GitHub

Authors: Guangxun Zhu, Xuan Liu, Nicolas Pugeault, Chongfeng Wei, Edmond S. L. Ho

Title: Modeling 3D Pedestrian-Vehicle Interactions for Vehicle-Conditioned Pose Forecasting

Abstract:
Accurately predicting pedestrian motion is crucial for safe and reliable autonomous driving in complex urban environments. In this work, we present a 3D vehicle‑conditioned pedestrian pose forecasting framework that explicitly incorporates surrounding vehicle information. To support this, we enhance the Waymo‑3DSkelMo dataset with aligned 3D vehicle bounding boxes, enabling realistic modeling of multi‑agent pedestrian‑vehicle interactions. We introduce a sampling scheme to categorize scenes by pedestrian and vehicle count, facilitating training across varying interaction complexities. Our proposed network adapts the TBIFormer architecture with a dedicated vehicle encoder and pedestrian‑vehicle interaction cross‑attention module to fuse pedestrian and vehicle features, allowing predictions to be conditioned on both historical pedestrian motion and surrounding vehicles. Extensive experiments demonstrate substantial improvements in forecasting accuracy and validate different approaches for modeling pedestrian‑vehicle interactions, highlighting the importance of vehicle‑aware 3D pose prediction for autonomous driving. Code is available at: https://github.com/GuangxunZhu/VehCondPose3D

PaperID: 557, https://arxiv.org/pdf/2602.08675.pdf GitHub

Authors: Mohamed Amine Ferrag, Abderrahmane Lakas, Merouane Debbah

Title: 6G-Bench: An Open Benchmark for Semantic Communication and Network-Level Reasoning with Foundation Models in AI-Native 6G Networks

Abstract:
This paper introduces 6G‑Bench, an open benchmark for evaluating semantic communication and network‑level reasoning in AI‑native 6G networks. 6G‑Bench defines a taxonomy of 30 decision‑making tasks (T1‑‑T30) extracted from ongoing 6G and AI‑agent standardization activities in 3GPP, IETF, ETSI, ITU‑T, and the O‑RAN Alliance, and organizes them into five standardization‑aligned capability categories. Starting from 113,475 scenarios, we generate a balanced pool of 10,000 very‑hard multiple‑choice questions using task‑conditioned prompts that enforce multi‑step quantitative reasoning under uncertainty and worst‑case regret minimization over multi‑turn horizons. After automated filtering and expert human validation, 3,722 questions are retained as a high‑confidence evaluation set, while the full pool is released to support training and fine‑tuning of 6G‑specialized models. Using 6G‑Bench, we evaluate 22 foundation models spanning dense and mixture‑of‑experts architectures, short‑ and long‑context designs (up to 1M tokens), and both open‑weight and proprietary systems. Across models, deterministic single‑shot accuracy (pass@1) spans a wide range from 0.22 to 0.82, highlighting substantial variation in semantic reasoning capability. Leading models achieve intent and policy reasoning accuracy in the range 0.87‑‑0.89, while selective robustness analysis on reasoning‑intensive tasks shows pass@5 values ranging from 0.20 to 0.91. To support open science and reproducibility, we release the 6G‑Bench dataset on GitHub: https://github.com/maferrag/6G‑Bench

PaperID: 558, https://arxiv.org/pdf/2602.08529.pdf GitHub

Authors: Ning Lin, Haolun Li, Mingshu Liu, Chengyun Ruan, Kaibo Huang, Yukun Wei, Zhongliang Yang, Linna Zhou

Title: EvoCorps: An Evolutionary Multi-Agent Framework for Depolarizing Online Discourse

Abstract:
Polarization in online discourse erodes social trust and accelerates misinformation, yet technical responses remain largely diagnostic and post‑hoc. Current governance approaches suffer from inherent latency and static policies, struggling to counter coordinated adversarial amplification that evolves in real‑time. We present EvoCorps, an evolutionary multi‑agent framework for proactive depolarization. EvoCorps frames discourse governance as a dynamic social game and coordinates roles for monitoring, planning, grounded generation, and multi‑identity diffusion. A retrieval‑augmented collective cognition core provides factual grounding and action‑‑outcome memory, while closed‑loop evolutionary learning adapts strategies as the environment and attackers change. We implement EvoCorps on the MOSAIC social‑AI simulation platform for controlled evaluation in a multi‑source news stream with adversarial injection and amplification. Across emotional polarization, viewpoint extremity, and argumentative rationality, EvoCorps improves discourse outcomes over an adversarial baseline, pointing to a practical path from detection and post‑hoc mitigation to in‑process, closed‑loop intervention. The code is available at https://github.com/ln2146/EvoCorps.

PaperID: 559, https://arxiv.org/pdf/2602.08440.pdf GitHub

Authors: Tian Gao, Celine Tan, Catherine Glossop, Timothy Gao, Jiankai Sun, Kyle Stachowicz, Shirley Wu, Oier Mees, Dorsa Sadigh, Sergey Levine, Chelsea Finn

Title: SteerVLA: Steering Vision-Language-Action Models in Long-Tail Driving Scenarios

Abstract:
A fundamental challenge in autonomous driving is the integration of high‑level, semantic reasoning for long‑tail events with low‑level, reactive control for robust driving. While large vision‑language models (VLMs) trained on web‑scale data offer powerful common‑sense reasoning, they lack the grounded experience necessary for safe vehicle control. We posit that an effective autonomous agent should leverage the world knowledge of VLMs to guide a steerable driving policy toward robust control in driving scenarios. To this end, we propose SteerVLA, which leverages the reasoning capabilities of VLMs to produce fine‑grained language instructions that steer a vision‑language‑action (VLA) driving policy. Key to our method is this rich language interface between the high‑level VLM and low‑level VLA, which allows the high‑level policy to more effectively ground its reasoning in the control outputs of the low‑level policy. To provide fine‑grained language supervision aligned with vehicle control, we leverage a VLM to augment existing driving data with detailed language annotations, which we find to be essential for effective reasoning and steerability. We evaluate SteerVLA on a challenging closed‑loop benchmark, where it outperforms state‑of‑the‑art methods by 4.77 points in overall driving score and by 8.04 points on a long‑tail subset. The project website is available at: https://steervla.github.io/.

PaperID: 560, https://arxiv.org/pdf/2602.08412.pdf GitHub

Authors: Yuhang Wang, Feiming Xu, Zheng Lin, Guangyu He, Yuzhe Huang, Haichang Gao, Zhenxing Niu, Shiguo Lian, Zhaoxiang Liu

Title: From Assistant to Double Agent: Formalizing and Benchmarking Attacks on OpenClaw for Personalized Local AI Agent

Abstract:
Although large language model (LLM)‑based agents, exemplified by OpenClaw, are increasingly evolving from task‑oriented systems into personalized AI assistants for solving complex real‑world tasks, their practical deployment also introduces severe security risks. However, existing agent security research and evaluation frameworks primarily focus on synthetic or task‑centric settings, and thus fail to accurately capture the attack surface and risk propagation mechanisms of personalized agents in real‑world deployments. To address this gap, we propose Personalized Agent Security Bench (PASB), an end‑to‑end security evaluation framework tailored for real‑world personalized agents. Building upon existing agent attack paradigms, PASB incorporates personalized usage scenarios, realistic toolchains, and long‑horizon interactions, enabling black‑box, end‑to‑end security evaluation on real systems. Using OpenClaw as a representative case study, we systematically evaluate its security across multiple personalized scenarios, tool capabilities, and attack types. Our results indicate that OpenClaw exhibits critical vulnerabilities at different execution stages, including user prompt processing, tool usage, and memory retrieval, highlighting substantial security risks in personalized agent deployments. The code for the proposed PASB framework is available at https://github.com/AstorYH/PASB.

PaperID: 561, https://arxiv.org/pdf/2602.08234.pdf GitHub

Authors: Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, Huaxiu Yao

Title: SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Abstract:
Large Language Model (LLM) agents have shown stunning results in complex tasks, yet they often operate in isolation, failing to learn from past experiences. Existing memory‑based methods primarily store raw trajectories, which are often redundant and noise‑heavy. This prevents agents from extracting high‑level, reusable behavioral patterns that are essential for generalization. In this paper, we propose SkillRL, a framework that bridges the gap between raw experience and policy improvement through automatic skill discovery and recursive evolution. Our approach introduces an experience‑based distillation mechanism to build a hierarchical skill library SkillBank, an adaptive retrieval strategy for general and task‑specific heuristics, and a recursive evolution mechanism that allows the skill library to co‑evolve with the agent's policy during reinforcement learning. These innovations significantly reduce the token footprint while enhancing reasoning utility. Experimental results on ALFWorld, WebShop and seven search‑augmented tasks demonstrate that SkillRL achieves state‑of‑the‑art performance, outperforming strong baselines over 15.3% and maintaining robustness as task complexity increases. Code is available at this https://github.com/aiming‑lab/SkillRL.

PaperID: 562, https://arxiv.org/pdf/2602.07796.pdf GitHub

Authors: Jiatong Li, Changdae Oh, Hyeong Kyu Choi, Jindong Wang, Sharon Li

Title: Thinking Makes LLM Agents Introverted: How Mandatory Thinking Can Backfire in User-Engaged Agents

Abstract:
Eliciting reasoning has emerged as a powerful technique for improving the performance of large language models (LLMs) on complex tasks by inducing thinking. However, their effectiveness in realistic user‑engaged agent scenarios remains unclear. In this paper, we conduct a comprehensive study on the effect of explicit thinking in user‑engaged LLM agents. Our experiments span across seven models, three benchmarks, and two thinking instantiations, and we evaluate them through both a quantitative response taxonomy analysis and qualitative failure propagation case studies. Contrary to expectations, we find that mandatory thinking often backfires on agents in user‑engaged settings, causing anomalous performance degradation across various LLMs. Our key finding reveals that thinking makes agents more ``introverted'' by shortening responses and reducing information disclosure to users, which weakens agent‑user information exchange and leads to downstream task failures. Furthermore, we demonstrate that explicitly prompting for information disclosure reliably improves performance across diverse model families, suggesting that proactive transparency is a vital lever for agent optimization. Overall, our study suggests that information transparency awareness is a crucial yet underexplored perspective for the future design of reasoning agents in real‑world scenarios. Our code is available at https://github.com/deeplearning‑wisc/Thinking‑Agent.

PaperID: 563, https://arxiv.org/pdf/2602.07787.pdf GitHub

Authors: Pierre-Louis Favreau, Jean-Pierre Lo, Clement Guiguet, Charles Simon-Meunier, Nicolas Dehandschoewercker, Allen G. Roush, Judah Goldfeder, Ravid Shwartz-Ziv

Title: Do Multi-Agents Dream of Electric Screens? Achieving Perfect Accuracy on AndroidWorld Through Task Decomposition

Abstract:
We present Minitap, a multi‑agent system that achieves 100% success on the AndroidWorld benchmark, the first to fully solve all 116 tasks and surpassing human performance (80%). We first analyze why single‑agent architectures fail: context pollution from mixed reasoning traces, silent text input failures undetected by the agent, and repetitive action loops without escape. Minitap addresses each failure through targeted mechanisms: cognitive separation across six specialized agents, deterministic post‑validation of text input against device state, and meta‑cognitive reasoning that detects cycles and triggers strategy changes. Ablations show multi‑agent decomposition contributes +21 points over single‑agent baselines; verified execution adds +7 points; meta‑cognition adds +9 points. We release Minitap as open‑source software. https://github.com/minitap‑ai/mobile‑use

PaperID: 564, https://arxiv.org/pdf/2602.07777.pdf GitHub

Authors: Shuhui Zhu, Yue Lin, Shriya Kaistha, Wenhao Li, Baoxiang Wang, Hongyuan Zha, Gillian K. Hadfield, Pascal Poupart

Title: Talk, Judge, Cooperate: Gossip-Driven Indirect Reciprocity in Self-Interested LLM Agents

Abstract:
Indirect reciprocity, which means helping those who have helped others, is difficult to sustain among decentralized, self‑interested LLM agents without reliable reputation systems. We address this challenge with the Agentic Linguistic Gossip Network (ALIGN), an automated framework that enables decentralized agents to form reputations, evaluate trustworthiness, and coordinate social norms by strategically sharing open‑ended gossip with hierarchical tones. We demonstrate that ALIGN consistently improves indirect reciprocity and resists malicious entrants by identifying and ostracizing defectors. Notably, we find that stronger reasoning capabilities in LLMs lead to more incentive‑aligned cooperation, whereas chat models often over‑cooperate even when strategically suboptimal. These results suggest that leveraging LLM reasoning through decentralized gossip is a promising path for maintaining social welfare in agentic ecosystems. Our code is available at https://github.com/shuhui‑zhu/ALIGN.

PaperID: 565, https://arxiv.org/pdf/2602.07398.pdf GitHub

Authors: Ruoyao Wen, Hao Li, Chaowei Xiao, Ning Zhang

Title: AgentSys: Secure and Dynamic LLM Agents Through Explicit Hierarchical Memory Management

Abstract:
Indirect prompt injection threatens LLM agents by embedding malicious instructions in external content, enabling unauthorized actions and data theft. LLM agents maintain working memory through their context window, which stores interaction history for decision‑making. Conventional agents indiscriminately accumulate all tool outputs and reasoning traces in this memory, creating two critical vulnerabilities: (1) injected instructions persist throughout the workflow, granting attackers multiple opportunities to manipulate behavior, and (2) verbose, non‑essential content degrades decision‑making capabilities. Existing defenses treat bloated memory as given and focus on remaining resilient, rather than reducing unnecessary accumulation to prevent the attack. We present AgentSys, a framework that defends against indirect prompt injection through explicit memory management. Inspired by process memory isolation in operating systems, AgentSys organizes agents hierarchically: a main agent spawns worker agents for tool calls, each running in an isolated context and able to spawn nested workers for subtasks. External data and subtask traces never enter the main agent's memory; only schema‑validated return values can cross boundaries through deterministic JSON parsing. Ablations show isolation alone cuts attack success to 2.19%, and adding a validator/sanitizer further improves defense with event‑triggered checks whose overhead scales with operations rather than context length. On AgentDojo and ASB, AgentSys achieves 0.78% and 4.25% attack success while slightly improving benign utility over undefended baselines. It remains robust to adaptive attackers and across multiple foundation models, showing that explicit memory management enables secure, dynamic LLM agent architectures. Our code is available at: https://github.com/ruoyaow/agentsys‑memory.

PaperID: 566, https://arxiv.org/pdf/2602.07391.pdf GitHub

Authors: Kunal Pai, Parth Shah, Harshil Patel

Title: NAAMSE: Framework for Evolutionary Security Evaluation of Agents

Abstract:
AI agents are increasingly deployed in production, yet their security evaluations remain bottlenecked by manual red‑teaming or static benchmarks that fail to model adaptive, multi‑turn adversaries. We propose NAAMSE, an evolutionary framework that reframes agent security evaluation as a feedback‑driven optimization problem. Our system employs a single autonomous agent that orchestrates a lifecycle of genetic prompt mutation, hierarchical corpus exploration, and asymmetric behavioral scoring. By using model responses as a fitness signal, the framework iteratively compounds effective attack strategies while simultaneously ensuring "benign‑use correctness", preventing the degenerate security of blanket refusal. Our experiments across a diverse suite of state‑of‑the‑art large language models demonstrate that evolutionary mutation systematically amplifies vulnerabilities missed by one‑shot methods, with controlled ablations revealing that the synergy between exploration and targeted mutation uncovers high‑severity failure modes. We show that this adaptive approach provides a more realistic and scalable assessment of agent robustness in the face of evolving threats. The code for NAAMSE is open source and available at https://github.com/HASHIRU‑AI/NAAMSE.

PaperID: 567, https://arxiv.org/pdf/2602.07274.pdf GitHub

Authors: Kaijie Zhu, Yuzhou Nie, Yijiang Li, Yiming Huang, Jialian Wu, Jiang Liu, Ximeng Sun, Zhenfei Yin, Lun Wang, Zicheng Liu, Emad Barsoum, William Yang Wang, Wenbo Guo

Title: TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents

Abstract:
Executing complex terminal tasks remains a significant challenge for open‑weight LLMs, constrained by two fundamental limitations. First, high‑fidelity, executable training environments are scarce: environments synthesized from real‑world repositories are not diverse and scalable, while trajectories synthesized by LLMs suffer from hallucinations. Second, standard instruction tuning uses expert trajectories that rarely exhibit simple mistakes common to smaller models. This creates a distributional mismatch, leaving student models ill‑equipped to recover from their own runtime failures. To bridge these gaps, we introduce TermiGen, an end‑to‑end pipeline for synthesizing verifiable environments and resilient expert trajectories. Termi‑Gen first generates functionally valid tasks and Docker containers via an iterative multi‑agent refinement loop. Subsequently, we employ a Generator‑Critic protocol that actively injects errors during trajectory collection, synthesizing data rich in error‑correction cycles. Fine‑tuned on this TermiGen‑generated dataset, our TermiGen‑Qwen2.5‑Coder‑32B achieves a 31.3% pass rate on TerminalBench. This establishes a new open‑weights state‑of‑the‑art, outperforming existing baselines and notably surpassing capable proprietary models such as o4‑mini. Dataset is avaiable at https://github.com/ucsb‑mlsec/terminal‑bench‑env.

PaperID: 568, https://arxiv.org/pdf/2602.06967.pdf GitHub

Authors: Siqi Song, Xuanbing Xie, Zonglin Li, Yuqiang Li, Shijie Wang, Biqing Qi

Title: Leveraging Adaptive Group Negotiation for Heterogeneous Multi-Robot Collaboration with Large Language Models

Abstract:
Multi‑robot collaboration tasks often require heterogeneous robots to work together over long horizons under spatial constraints and environmental uncertainties. Although Large Language Models (LLMs) excel at reasoning and planning, their potential for coordinated control has not been fully explored. Inspired by human teamwork, we present CLiMRS (Cooperative Large‑Language‑Model‑Driven Heterogeneous Multi‑Robot System), an adaptive group negotiation framework among LLMs for multi‑robot collaboration. This framework pairs each robot with an LLM agent and dynamically forms subgroups through a general proposal planner. Within each subgroup, a subgroup manager leads perception‑driven multi‑LLM discussions to get commands for actions. Feedback is provided by both robot execution outcomes and environment changes. This grouping‑planning‑execution‑feedback loop enables efficient planning and robust execution. To evaluate these capabilities, we introduce CLiMBench, a heterogeneous multi‑robot benchmark of challenging assembly tasks. Our experiments show that CLiMRS surpasses the best baseline, achieving over 40% higher efficiency on complex tasks without sacrificing success on simpler ones. Overall, our results demonstrate that leveraging human‑inspired group formation and negotiation principles significantly enhances the efficiency of heterogeneous multi‑robot collaboration. Our code is available here: https://github.com/song‑siqi/CLiMRS.

PaperID: 569, https://arxiv.org/pdf/2602.06724.pdf GitHub

Authors: Tian Lan, Felix Henry, Bin Zhu, Qianghuai Jia, Junyang Ren, Qihang Pu, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo

Title: Table-as-Search: Formulate Long-Horizon Agentic Information Seeking as Table Completion

Abstract:
Current Information Seeking (InfoSeeking) agents struggle to maintain focus and coherence during long‑horizon exploration, as tracking search states, including planning procedure and massive search results, within one plain‑text context is inherently fragile. To address this, we introduce Table‑as‑Search (TaS), a structured planning framework that reformulates the InfoSeeking task as a Table Completion task. TaS maps each query into a structured table schema maintained in an external database, where rows represent search candidates and columns denote constraints or required information. This table precisely manages the search states: filled cells strictly record the history and search results, while empty cells serve as an explicit search plan. Crucially, TaS unifies three distinct InfoSeeking tasks: Deep Search, Wide Search, and the challenging DeepWide Search. Extensive experiments demonstrate that TaS significantly outperforms numerous state‑of‑the‑art baselines across three kinds of benchmarks, including multi‑agent framework and commercial systems. Furthermore, our analysis validates the TaS's superior robustness in long‑horizon InfoSeeking, alongside its efficiency, scalability and flexibility. Code and datasets are publicly released at https://github.com/AIDC‑AI/Marco‑Search‑Agent.

PaperID: 570, https://arxiv.org/pdf/2602.06526.pdf GitHub

Authors: Minjeong Ban, Jeonghwan Choi, Hyangsuk Min, Nicole Hee-Yeon Kim, Minseok Kim, Jae-Gil Lee, Hwanjun Song

Title: Completing Missing Annotation: Multi-Agent Debate for Accurate and Scalable Relevant Assessment for IR Benchmarks

Abstract:
Information retrieval (IR) evaluation remains challenging due to incomplete IR benchmark datasets that contain unlabeled relevant chunks. While LLMs and LLM‑human hybrid strategies reduce costly human effort, they remain prone to LLM overconfidence and ineffective AI‑to‑human escalation. To address this, we propose DREAM, a multi‑round debate‑based relevance assessment framework with LLM agents, built on opposing initial stances and iterative reciprocal critique. Through our agreement‑based debate, it yields more accurate labeling for certain cases and more reliable AI‑to‑human escalation for uncertain ones, achieving 95.2% labeling accuracy with only 3.5% human involvement. Using DREAM, we build BRIDGE, a refined benchmark that mitigates evaluation bias and enables fairer retriever comparison by uncovering 29,824 missing relevant chunks. We then re‑benchmark IR systems and extend evaluation to RAG, showing that unaddressed holes not only distort retriever rankings but also drive retrieval‑generation misalignment. The relevance assessment framework is available at https: //github.com/DISL‑Lab/DREAM‑ICLR‑26; and the BRIDGE dataset is available at https://github.com/DISL‑Lab/BRIDGE‑Benchmark.

PaperID: 571, https://arxiv.org/pdf/2602.06511.pdf GitHub

Authors: Yuntong Hu, Yuting Zhang, Matthew Trager, Yi Zhang, Shuo Yang, Wei Xia, Stefano Soatto

Title: EvoMAS: Evolutionary Generation of Multi-Agent Systems

Abstract:
Large language model (LLM)‑based multi‑agent systems (MAS) show strong promise for complex reasoning, planning, and tool‑augmented tasks, but designing effective MAS architectures remains labor‑intensive, brittle, and hard to generalize. Existing automatic MAS generation methods either rely on code generation, which often leads to executability and robustness failures, or impose rigid architectural templates that limit expressiveness and adaptability. We propose Evolutionary Generation of Multi‑Agent Systems (EvoMAS), which formulates MAS generation as structured configuration generation. EvoMAS performs evolutionary generation in configuration space. Specifically, EvoMAS selects initial configurations from a pool, applies feedback‑conditioned mutation and crossover guided by execution traces, and iteratively refines both the candidate pool and an experience memory. We evaluate EvoMAS on diverse benchmarks, including BBEH, SWE‑Bench, and WorkBench, covering reasoning, software engineering, and tool‑use tasks. EvoMAS consistently improves task performance over both human‑designed MAS and prior automatic MAS generation methods, while producing generated systems with higher executability and runtime robustness. EvoMAS outperforms the agent evolution method EvoAgent by +10.5 points on BBEH reasoning and +7.1 points on WorkBench. With Claude‑4.5‑Sonnet, EvoMAS also reaches 79.1% on SWE‑Bench‑Verified, matching the top of the leaderboard. Code is available at https://github.com/amazon‑science/EvoMAS

PaperID: 572, https://arxiv.org/pdf/2602.06025.pdf GitHub

Authors: Haozhen Zhang, Haodong Yue, Tao Feng, Quanyu Long, Jianzhu Bao, Bowen Jin, Weizhi Zhang, Xiao Li, Jiaxuan You, Chengwei Qin, Wenya Wang

Title: Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

Abstract:
Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query‑agnostic memory construction that can be inefficient and may discard query‑critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance‑cost trade‑off. In this work, we present BudgetMem, a runtime agent memory framework for explicit, query‑aware performance‑cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textscLow/\textscMid/\textscHigh). A lightweight router performs budget‑tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high‑budget setting), and delivers better accuracy‑cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade‑offs under varying budget regimes.

PaperID: 573, https://arxiv.org/pdf/2602.06008.pdf GitHub

Authors: Xianyang Liu, Shangding Gu, Dawn Song

Title: AgenticPay: A Multi-Agent LLM Negotiation System for Buyer-Seller Transactions

Abstract:
Large language model (LLM)‑based agents are increasingly expected to negotiate, coordinate, and transact autonomously, yet existing benchmarks lack principled settings for evaluating language‑mediated economic interaction among multiple agents. We introduce AgenticPay, a benchmark and simulation framework for multi‑agent buyer‑seller negotiation driven by natural language. AgenticPay models markets in which buyers and sellers possess private constraints and product‑dependent valuations, and must reach agreements through multi‑round linguistic negotiation rather than numeric bidding alone. The framework supports a diverse suite of over 110 tasks ranging from bilateral bargaining to many‑to‑many markets, with structured action extraction and metrics for feasibility, efficiency, and welfare. Benchmarking state‑of‑the‑art proprietary and open‑weight LLMs reveals substantial gaps in negotiation performance and highlights challenges in long‑horizon strategic reasoning, establishing AgenticPay as a foundation for studying agentic commerce and language‑based market interaction. Code and dataset are available at the link: https://github.com/SafeRL‑Lab/AgenticPay.

PaperID: 574, https://arxiv.org/pdf/2602.05636.pdf GitHub

Authors: Benny Cheung

Title: Generative Ontology: When Structured Knowledge Learns to Create

Abstract:
Traditional ontologies describe domain structure but cannot generate novel artifacts. Large language models generate fluently but produce outputs lacking structural validity, hallucinating mechanisms without components, goals without end conditions. We introduce Generative Ontology, a framework synthesizing these complementary strengths: ontology provides the grammar; the LLM provides the creativity. Generative Ontology encodes domain knowledge as executable Pydantic schemas constraining LLM generation via DSPy signatures. A multi‑agent pipeline assigns specialized roles: a Mechanics Architect designs game systems, a Theme Weaver integrates narrative, a Balance Critic identifies exploits, each carrying a professional "anxiety" that prevents shallow outputs. Retrieval‑augmented generation grounds designs in precedents from existing exemplars. We demonstrate the framework through GameGrammar, generating complete tabletop game designs, and present three empirical studies. An ablation study (120 designs, 4 conditions) shows multi‑agent specialization produces the largest quality gains (fun d=1.12, depth d=1.59; p<.001), while schema validation eliminates structural errors (d=4.78). A benchmark against 20 published board games reveals structural parity but a bounded creative gap (fun d=1.86): generated designs score 7‑8 while published games score 8‑9. A test‑retest study (50 evaluations) validates the LLM‑based evaluator, with 7/9 metrics achieving Good‑to‑Excellent reliability (ICC 0.836‑0.989). The pattern generalizes beyond games. Any domain with expert vocabulary, validity constraints, and accumulated exemplars is a candidate for Generative Ontology.

PaperID: 575, https://arxiv.org/pdf/2602.05327.pdf GitHub

Authors: Yangbin Yu, Mingyu Yang, Junyou Li, Yiming Gao, Feiyu Liu, Yijun Yang, Zichuan Lin, Jiafei Lyu, Yicheng Liu, Zhicong Lu, Deheng Ye, Jie Jiang

Title: ProAct: Agentic Lookahead in Interactive Environments

Abstract:
Existing Large Language Model (LLM) agents struggle in interactive environments requiring long‑horizon planning, primarily due to compounding errors when simulating future states. To address this, we propose ProAct, a framework that enables agents to internalize accurate lookahead reasoning through a two‑stage training paradigm. First, we introduce Grounded LookAhead Distillation (GLAD), where the agent undergoes supervised fine‑tuning on trajectories derived from environment‑based search. By compressing complex search trees into concise, causal reasoning chains, the agent learns the logic of foresight without the computational overhead of inference‑time search. Second, to further refine decision accuracy, we propose the Monte‑Carlo Critic (MC‑Critic), a plug‑and‑play auxiliary value estimator designed to enhance policy‑gradient algorithms like PPO and GRPO. By leveraging lightweight environment rollouts to calibrate value estimates, MC‑Critic provides a low‑variance signal that facilitates stable policy optimization without relying on expensive model‑based value approximation. Experiments on both stochastic (e.g., 2048) and deterministic (e.g., Sokoban) environments demonstrate that ProAct significantly improves planning accuracy. Notably, a 4B parameter model trained with ProAct outperforms all open‑source baselines and rivals state‑of‑the‑art closed‑source models, while demonstrating robust generalization to unseen environments. The codes and models are available at https://github.com/GreatX3/ProAct

PaperID: 576, https://arxiv.org/pdf/2602.05105.pdf GitHub

Authors: Rohan Patil, Jai Malegaonkar, Xiao Jiang, Andre Dion, Gaurav S. Sukhatme, Henrik I. Christensen

Title: GAMMS: Graph based Adversarial Multiagent Modeling Simulator

Abstract:
As intelligent systems and multi‑agent coordination become increasingly central to real‑world applications, there is a growing need for simulation tools that are both scalable and accessible. Existing high‑fidelity simulators, while powerful, are often computationally expensive and ill‑suited for rapid prototyping or large‑scale agent deployments. We present GAMMS (Graph based Adversarial Multiagent Modeling Simulator), a lightweight yet extensible simulation framework designed to support fast development and evaluation of agent behavior in environments that can be represented as graphs. GAMMS emphasizes five core objectives: scalability, ease of use, integration‑first architecture, fast visualization feedback, and real‑world grounding. It enables efficient simulation of complex domains such as urban road networks and communication systems, supports integration with external tools (e.g., machine learning libraries, planning solvers), and provides built‑in visualization with minimal configuration. GAMMS is agnostic to policy type, supporting heuristic, optimization‑based, and learning‑based agents, including those using large language models. By lowering the barrier to entry for researchers and enabling high‑performance simulations on standard hardware, GAMMS facilitates experimentation and innovation in multi‑agent systems, autonomous planning, and adversarial modeling. The framework is open‑source and available at https://github.com/GAMMSim/GAMMS/

PaperID: 577, https://arxiv.org/pdf/2602.04587.pdf GitHub

Authors: Jaeyoon Jung, Yejun Yoon, Kunwoo Park

Title: VILLAIN at AVerImaTeC: Verifying Image-Text Claims via Multi-Agent Collaboration

Abstract:
This paper describes VILLAIN, a multimodal fact‑checking system that verifies image‑text claims through prompt‑based multi‑agent collaboration. For the AVerImaTeC shared task, VILLAIN employs vision‑language model agents across multiple stages of fact‑checking. Textual and visual evidence is retrieved from the knowledge store enriched through additional web collection. To identify key information and address inconsistencies among evidence items, modality‑specific and cross‑modal agents generate analysis reports. In the subsequent stage, question‑answer pairs are produced based on these reports. Finally, the Verdict Prediction agent produces the verification outcome based on the image‑text claim and the generated question‑answer pairs. Our system ranked first on the leaderboard across all evaluation metrics. The source code is publicly available at https://github.com/ssu‑humane/VILLAIN.

PaperID: 578, https://arxiv.org/pdf/2602.04296.pdf GitHub

Authors: Wenjun Peng, Xinyu Wang, Qi Wu

Title: ProxyWar: Dynamic Assessment of LLM Code Generation in Game Arenas

Abstract:
Large language models (LLMs) have revolutionized automated code generation, yet the evaluation of their real‑world effectiveness remains limited by static benchmarks and simplistic metrics. We present ProxyWar, a novel framework that systematically assesses code generation quality by embedding LLM‑generated agents within diverse, competitive game environments. Unlike existing approaches, ProxyWar evaluates not only functional correctness but also the operational characteristics of generated programs, combining automated testing, iterative code repair, and multi‑agent tournaments to provide a holistic view of program behavior. Applied to a range of state‑of‑the‑art coders and games, our approach uncovers notable discrepancies between benchmark scores and actual performance in dynamic settings, revealing overlooked limitations and opportunities for improvement. These findings highlight the need for richer, competition‑based evaluation of code generation. Looking forward, ProxyWar lays a foundation for research into LLM‑driven algorithm discovery, adaptive problem solving, and the study of practical efficiency and robustness, including the potential for models to outperform hand‑crafted agents. The project is available at https://github.com/xinke‑wang/ProxyWar.

PaperID: 579, https://arxiv.org/pdf/2602.04284.pdf GitHub

Authors: Yansong Ning, Jun Fang, Naiqiang Tan, Hao Liu

Title: Agent-Omit: Adaptive Context Omission for Efficient LLM Agents

Abstract:
Managing agent context (e.g., thought and observation) during multi‑turn agent‑environment interactions is an emerging strategy to improve agent efficiency. However, existing studies treat the entire interaction trajectories equally, overlooking the thought necessity and observation utility varies across turns. To this end, we first conduct quantitative investigations into how thought and observation affect agent effectiveness and efficiency. Based on our findings, we propose Agent‑Omit, a unified training framework that empowers LLM agents to adaptively omit redundant thoughts and observations. Specifically, we first synthesize a small amount of cold‑start data, including both single‑turn and multi‑turn omission scenarios, to fine‑tune the agent for omission behaviors. Furthermore, we introduce an omit‑aware agentic reinforcement learning approach, incorporating a dual sampling mechanism and a tailored omission reward to incentivize the agent's adaptive omission capability. Theoretically, we prove that the deviation of our omission policy is upper‑bounded by KL‑divergence. Experimental results on five agent benchmarks show that our constructed Agent‑Omit‑8B could obtain performance comparable to seven frontier LLM agent, and achieve the best effectiveness‑efficiency trade‑off than seven efficient LLM agents methods. Our code and data are available at https://github.com/usail‑hkust/Agent‑Omit.

PaperID: 580, https://arxiv.org/pdf/2602.04197.pdf GitHub

Authors: Xinyue Wang, Yuanhe Zhang, Zhengshuo Gong, Haoran Gao, Fanyu Meng, Zhenhong Zhou, Li Sun, Yang Liu, Sen Su

Title: From Helpfulness to Toxic Proactivity: Diagnosing Behavioral Misalignment in LLM Agents

Abstract:
The enhanced capabilities of LLM‑based agents come with an emergency for model planning and tool‑use abilities. Attributing to helpful‑harmless trade‑off from LLM alignment, agents typically also inherit the flaw of "over‑refusal", which is a passive failure mode. However, the proactive planning and action capabilities of agents introduce another crucial danger on the other side of the trade‑off. This phenomenon we term "Toxic Proactivity'': an active failure mode in which an agent, driven by the optimization for Machiavellian helpfulness, disregards ethical constraints to maximize utility. Unlike over‑refusal, Toxic Proactivity manifests as the agent taking excessive or manipulative measures to ensure its "usefulness'' is maintained. Existing research pays little attention to identifying this behavior, as it often lacks the subtle context required for such strategies to unfold. To reveal this risk, we introduce a novel evaluation framework based on dilemma‑driven interactions between dual models, enabling the simulation and analysis of agent behavior over multi‑step behavioral trajectories. Through extensive experiments with mainstream LLMs, we demonstrate that Toxic Proactivity is a widespread behavioral phenomenon and reveal two major tendencies. We further present a systematic benchmark for evaluating Toxic Proactive behavior across contextual settings.

PaperID: 581, https://arxiv.org/pdf/2602.03955.pdf GitHub

Authors: Yinyi Luo, Yiqiao Jin, Weichen Yu, Mengqi Zhang, Srijan Kumar, Xiaoxiao Li, Weijie Xu, Xin Chen, Jindong Wang

Title: AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent

Abstract:
While large language model (LLM) multi‑agent systems achieve superior reasoning performance through iterative debate, practical deployment is limited by their high computational cost and error propagation. This paper proposes AgentArk, a novel framework to distill multi‑agent dynamics into the weights of a single model, effectively transforming explicit test‑time interactions into implicit model capabilities. This equips a single agent with the intelligence of multi‑agent systems while remaining computationally efficient. Specifically, we investigate three hierarchical distillation strategies across various models, tasks, scaling, and scenarios: reasoning‑enhanced fine‑tuning; trajectory‑based augmentation; and process‑aware distillation. By shifting the burden of computation from inference to training, the distilled models preserve the efficiency of one agent while exhibiting strong reasoning and self‑correction performance of multiple agents. They further demonstrate enhanced robustness and generalization across diverse reasoning tasks. We hope this work can shed light on future research on efficient and robust multi‑agent development. Our code is at https://github.com/AIFrontierLab/AgentArk.

PaperID: 582, https://arxiv.org/pdf/2602.03798.pdf GitHub

Authors: Zimu Lu, Houxing Ren, Yunqiao Yang, Ke Wang, Zhuofan Zong, Mingjie Zhan, Hongsheng Li

Title: FullStack-Agent: Enhancing Agentic Full-Stack Web Coding via Development-Oriented Testing and Repository Back-Translation

Abstract:
Assisting non‑expert users to develop complex interactive websites has become a popular task for LLM‑powered code agents. However, existing code agents tend to only generate frontend web pages, masking the lack of real full‑stack data processing and storage with fancy visual effects. Notably, constructing production‑level full‑stack web applications is far more challenging than only generating frontend web pages, demanding careful control of data flow, comprehensive understanding of constantly updating packages and dependencies, and accurate localization of obscure bugs in the codebase. To address these difficulties, we introduce FullStack‑Agent, a unified agent system for full‑stack agentic coding that consists of three parts: (1) FullStack‑Dev, a multi‑agent framework with strong planning, code editing, codebase navigation, and bug localization abilities. (2) FullStack‑Learn, an innovative data‑scaling and self‑improving method that back‑translates crawled and synthesized website repositories to improve the backbone LLM of FullStack‑Dev. (3) FullStack‑Bench, a comprehensive benchmark that systematically tests the frontend, backend and database functionalities of the generated website. Our FullStack‑Dev outperforms the previous state‑of‑the‑art method by 8.7%, 38.2%, and 15.9% on the frontend, backend, and database test cases respectively. Additionally, FullStack‑Learn raises the performance of a 30B model by 9.7%, 9.5%, and 2.8% on the three sets of test cases through self‑improvement, demonstrating the effectiveness of our approach. The code is released at https://github.com/mnluzimu/FullStack‑Agent.

PaperID: 583, https://arxiv.org/pdf/2602.03794.pdf GitHub

Authors: Yingxuan Yang, Chengrui Qu, Muning Wen, Laixi Shi, Ying Wen, Weinan Zhang, Adam Wierman, Shangding Gu

Title: Understanding Agent Scaling in LLM-Based Multi-Agent Systems via Diversity

Abstract:
LLM‑based multi‑agent systems (MAS) have emerged as a promising approach to tackle complex tasks that are difficult for individual LLMs. A natural strategy is to scale performance by increasing the number of agents; however, we find that such scaling exhibits strong diminishing returns in homogeneous settings, while introducing heterogeneity (e.g., different models, prompts, or tools) continues to yield substantial gains. This raises a fundamental question: what limits scaling, and why does diversity help? We present an information‑theoretic framework showing that MAS performance is bounded by the intrinsic task uncertainty, not by agent count. We derive architecture‑agnostic bounds demonstrating that improvements depend on how many effective channels the system accesses. Homogeneous agents saturate early because their outputs are strongly correlated, whereas heterogeneous agents contribute complementary evidence. We further introduce K^, an effective channel count that quantifies the number of effective channels without ground‑truth labels. Empirically, we show that heterogeneous configurations consistently outperform homogeneous scaling: 2 diverse agents can match or exceed the performance of 16 homogeneous agents. Our results provide principled guidelines for building efficient and robust MAS through diversity‑aware design. Code and Dataset are available at the link: https://github.com/SafeRL‑Lab/Agent‑Scaling.

PaperID: 584, https://arxiv.org/pdf/2602.03595.pdf GitHub

Authors: Haichao Jiang, Tianming Liang, Wei-Shi Zheng, Jian-Fang Hu

Title: Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation

Abstract:
Referring Video Object Segmentation (RVOS) aims to segment objects in videos based on textual queries. Current methods mainly rely on large‑scale supervised fine‑tuning (SFT) of Multi‑modal Large Language Models (MLLMs). However, this paradigm suffers from heavy data dependence and limited scalability against the rapid evolution of MLLMs. Although recent zero‑shot approaches offer a flexible alternative, their performance remains significantly behind SFT‑based methods, due to the straightforward workflow designs. To address these limitations, we propose Refer‑Agent, a collaborative multi‑agent system with alternating reasoning‑reflection mechanisms. This system decomposes RVOS into step‑by‑step reasoning process. During reasoning, we introduce a Coarse‑to‑Fine frame selection strategy to ensure the frame diversity and textual relevance, along with a Dynamic Focus Layout that adaptively adjusts the agent's visual focus. Furthermore, we propose a Chain‑of‑Reflection mechanism, which employs a Questioner‑Responder pair to generate a self‑reflection chain, enabling the system to verify intermediate results and generates feedback for next‑round reasoning refinement. Extensive experiments on five challenging benchmarks demonstrate that Refer‑Agent significantly outperforms state‑of‑the‑art methods, including both SFT‑based models and zero‑shot approaches. Moreover, Refer‑Agent is flexible and enables fast integration of new MLLMs without any additional fine‑tuning costs. Code will be released at https://github.com/iSEE‑Laboratory/Refer‑Agent.

PaperID: 585, https://arxiv.org/pdf/2602.03507.pdf GitHub

Authors: Runquan Gui, Yafu Li, Xiaoye Qu, Ziyan Liu, Yeqiu Cheng, Yu Cheng

Title: FaithRL: Learning to Reason Faithfully through Step-Level Faithfulness Maximization

Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has markedly improved the performance of Large Language Models (LLMs) on tasks requiring multi‑step reasoning. However, most RLVR pipelines rely on sparse outcome‑based rewards, providing little supervision over intermediate steps and thus encouraging over‑confidence and spurious reasoning, which in turn increases hallucinations. To address this, we propose FaithRL, a general reinforcement learning framework that directly optimizes reasoning faithfulness. We formalize a faithfulness‑maximization objective and theoretically show that optimizing it mitigates over‑confidence. To instantiate this objective, we introduce a geometric reward design and a faithfulness‑aware advantage modulation mechanism that assigns step‑level credit by penalizing unsupported steps while preserving valid partial derivations. Across diverse backbones and benchmarks, FaithRL consistently reduces hallucination rates while maintaining (and often improving) answer correctness. Further analysis confirms that FaithRL increases step‑wise reasoning faithfulness and generalizes robustly. Our code is available at https://github.com/aintdoin/FaithRL.

PaperID: 586, https://arxiv.org/pdf/2602.03447.pdf GitHub

Authors: Yu-Hsiang Chen, Wei-Jer Chang, Christian Kotulla, Thomas Keutgens, Steffen Runde, Tobias Moers, Christoph Klas, Wei Zhan, Masayoshi Tomizuka, Yi-Ting Chen

Title: HetroD: A High-Fidelity Drone Dataset and Benchmark for Autonomous Driving in Heterogeneous Traffic

Abstract:
We present HetroD, a dataset and benchmark for developing autonomous driving systems in heterogeneous environments. HetroD targets the critical challenge of navi‑ gating real‑world heterogeneous traffic dominated by vulner‑ able road users (VRUs), including pedestrians, cyclists, and motorcyclists that interact with vehicles. These mixed agent types exhibit complex behaviors such as hook turns, lane splitting, and informal right‑of‑way negotiation. Such behaviors pose significant challenges for autonomous vehicles but remain underrepresented in existing datasets focused on structured, lane‑disciplined traffic. To bridge the gap, we collect a large‑ scale drone‑based dataset to provide a holistic observation of traffic scenes with centimeter‑accurate annotations, HD maps, and traffic signal states. We further develop a modular toolkit for extracting per‑agent scenarios to support downstream task development. In total, the dataset comprises over 65.4k high‑ fidelity agent trajectories, 70% of which are from VRUs. HetroD supports modeling of VRU behaviors in dense, het‑ erogeneous traffic and provides standardized benchmarks for forecasting, planning, and simulation tasks. Evaluation results reveal that state‑of‑the‑art prediction and planning models struggle with the challenges presented by our dataset: they fail to predict lateral VRU movements, cannot handle unstructured maneuvers, and exhibit limited performance in dense and multi‑agent scenarios, highlighting the need for more robust approaches to heterogeneous traffic. See our project page for more examples: https://hetroddata.github.io/HetroD/

PaperID: 587, https://arxiv.org/pdf/2602.03285.pdf GitHub

Authors: Yuelin Hu, Jun Xu, Bingcong Lu, Zhengxue Cheng, Hongwei Hu, Ronghua Wu, Li Song

Title: MeetBench-XL: Calibrated Multi-Dimensional Evaluation and Learned Dual-Policy Agents for Real-Time Meetings

Abstract:
Enterprise meeting environments require AI assistants that handle diverse operational tasks, from rapid fact checking during live discussions to cross meeting analysis for strategic planning, under strict latency, cost, and privacy constraints. Existing meeting benchmarks mainly focus on simplified question answering and fail to reflect real world enterprise workflows, where queries arise organically from multi stakeholder collaboration, span long temporal contexts, and require tool augmented reasoning. We address this gap through a grounded dataset and a learned agent framework. First, we introduce MeetAll, a bilingual and multimodal corpus derived from 231 enterprise meetings totaling 140 hours. Questions are injected using an enterprise informed protocol validated by domain expert review and human discriminability studies. Unlike purely synthetic benchmarks, this protocol is grounded in four enterprise critical dimensions: cognitive load, temporal context span, domain expertise, and actionable task execution, calibrated through interviews with stakeholders across finance, healthcare, and technology sectors. Second, we propose MeetBench XL, a multi dimensional evaluation protocol aligned with human judgment that measures factual fidelity, intent alignment, response efficiency, structural clarity, and completeness. Third, we present MeetMaster XL, a learned dual policy agent that jointly optimizes query routing between fast and slow reasoning paths and tool invocation, including retrieval, cross meeting aggregation, and web search. A lightweight classifier enables accurate routing with minimal overhead, achieving a superior quality latency tradeoff over single model baselines. Experiments against commercial systems show consistent gains, supported by ablations, robustness tests, and a real world deployment case study.Resources: https://github.com/huyuelin/MeetBench.

PaperID: 588, https://arxiv.org/pdf/2602.03255.pdf GitHub

Authors: Tianyu Chen, Chujia Hu, Ge Gao, Dongrui Liu, Xia Hu, Wenjie Wang

Title: LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial Scenarios

Abstract:
Computer‑use agents (CUAs) that interact with real computer systems can perform automated tasks but face critical safety risks. Ambiguous instructions may trigger harmful actions, and adversarial users can manipulate tool execution to achieve malicious goals. Existing benchmarks mostly focus on short‑horizon or GUI‑based tasks, evaluating on execution‑time errors but overlooking the ability to anticipate planning‑time risks. To fill this gap, we present LPS‑Bench, a benchmark that evaluates the planning‑time safety awareness of MCP‑based CUAs under long‑horizon tasks, covering both benign and adversarial interactions across 65 scenarios of 7 task domains and 9 risk types. We introduce a multi‑agent automated pipeline for scalable data generation and adopt an LLM‑as‑a‑judge evaluation protocol to assess safety awareness through the planning trajectory. Experiments reveal substantial deficiencies in existing CUAs' ability to maintain safe behavior. We further analyze the risks and propose mitigation strategies to improve long‑horizon planning safety in MCP‑based CUA systems. We open‑source our code at https://github.com/tychenn/LPS‑Bench.

PaperID: 589, https://arxiv.org/pdf/2602.03053.pdf GitHub

Authors: Vishal Venkataramani, Haizhou Shi, Zixuan Ke, Austin Xu, Xiaoxiao He, Yingbo Zhou, Semih Yavuz, Hao Wang, Shafiq Joty

Title: MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems

Abstract:
Multi‑Agent Systems (MAS) built on Large Language Models (LLMs) often exhibit high variance in their reasoning trajectories. Process verification, which evaluates intermediate steps in trajectories, has shown promise in general reasoning settings, and has been suggested as a potential tool for guiding coordination of MAS; however, its actual effectiveness in MAS remains unclear. To fill this gap, we present MAS‑ProVe, a systematic empirical study of process verification for multi‑agent systems (MAS). Our study spans three verification paradigms (LLM‑as‑a‑Judge, reward models, and process reward models), evaluated across two levels of verification granularity (agent‑level and iteration‑level). We further examine five representative verifiers and four context management strategies, and conduct experiments over six diverse MAS frameworks on multiple reasoning benchmarks. We find that process‑level verification does not consistently improve performance and frequently exhibits high variance, highlighting the difficulty of reliably evaluating partial multi‑agent trajectories. Among the methods studied, LLM‑as‑a‑Judge generally outperforms reward‑based approaches, with trained judges surpassing general‑purpose LLMs. We further observe a small performance gap between LLMs acting as judges and as single agents, and identify a context‑length‑performance trade‑off in verification. Overall, our results suggest that effective and robust process verification for MAS remains an open challenge, requiring further advances beyond current paradigms. Code is available at https://github.com/Wang‑ML‑Lab/MAS‑ProVe.

PaperID: 590, https://arxiv.org/pdf/2602.02979.pdf GitHub

Authors: Ran Li, Zeyuan Liu, Yinghao Chen, Bingxiang He, Jiarui Yuan, Zixuan Fu, Weize Chen, Jinyi Hu, Chen Qian, Zhiyuan Liu, Maosong Sun

Title: CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning

Abstract:
Large Language Models (LLMs) have demonstrated strong potential in complex reasoning, yet their progress remains fundamentally constrained by reliance on massive high‑quality human‑curated tasks and labels, either through supervised fine‑tuning (SFT) or reinforcement learning (RL) on reasoning‑specific data. This dependence renders supervision‑heavy training paradigms increasingly unsustainable, with signs of diminishing scalability already evident in practice. To overcome this limitation, we introduce CPMöbius (CPMobius), a collaborative Coach‑Player paradigm for data‑free reinforcement learning of reasoning models. Unlike traditional adversarial self‑play, CPMöbius, inspired by real world human sports collaboration and multi‑agent collaboration, treats the Coach and Player as independent but cooperative roles. The Coach proposes instructions targeted at the Player's capability and receives rewards based on changes in the Player's performance, while the Player is rewarded for solving the increasingly instructive tasks generated by the Coach. This cooperative optimization loop is designed to directly enhance the Player's mathematical reasoning ability. Remarkably, CPMöbius achieves substantial improvement without relying on any external training data, outperforming existing unsupervised approaches. For example, on Qwen2.5‑Math‑7B‑Instruct, our method improves accuracy by an overall average of +4.9 and an out‑of‑distribution average of +5.4, exceeding RENT by +1.5 on overall accuracy and R‑zero by +4.2 on OOD accuracy. Our codebase has been released at https://github.com/thunlp/CPMobius.

PaperID: 591, https://arxiv.org/pdf/2602.02534.pdf GitHub

Authors: Enhao Huang, Tongtong Pan, Shuhuai Zhang, Qishu Jin, Liheng Zheng, Kaichun Hu, Yiming Li, Zhan Qin, Kui Ren

Title: DualMind: Towards Understanding Cognitive-Affective Cascades in Public Opinion Dissemination via Multi-Agent Simulation

Abstract:
Forecasting public opinion during PR crises is challenging, as existing frameworks often overlook the interaction between transient affective responses and persistent cognitive beliefs. To address this, we propose DualMind, an LLM‑driven multi‑agent platform designed to model this dual‑component interplay. We evaluate the system on 15 real‑world crises occurring post‑August 2024 using social media data as ground truth. Empirical results demonstrate that DualMind faithfully reconstructs opinion trajectories, significantly outperforming state‑of‑the‑art baselines. This work offers a high‑fidelity tool for proactive crisis management. Code is available at https://github.com/EonHao/DualMind.

PaperID: 592, https://arxiv.org/pdf/2602.02474.pdf GitHub

Authors: Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, Wenya Wang

Title: MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Abstract:
Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand‑designed operations for extracting memory. These fixed procedures hard‑code human priors about what to store and how to revise memory, making them rigid under diverse interaction patterns and inefficient on long histories. To this end, we present MemSkill, which reframes these operations as learnable and evolvable memory skills, structured and reusable routines for extracting, consolidating, and pruning information from interaction traces. Inspired by the design philosophy of agent skills, MemSkill employs a \emphcontroller that learns to select a small set of relevant skills, paired with an LLM‑based \emphexecutor that produces skill‑guided memories. Beyond learning skill selection, MemSkill introduces a \emphdesigner that periodically reviews hard cases where selected skills yield incorrect or incomplete memories, and evolves the skill set by proposing refinements and new skills. Together, MemSkill forms a closed‑loop procedure that improves both the skill‑selection policy and the skill set itself. Experiments on LoCoMo, LongMemEval, HotpotQA, and ALFWorld demonstrate that MemSkill improves task performance over strong baselines and generalizes well across settings. Further analyses shed light on how skills evolve, offering insights toward more adaptive, self‑evolving memory management for LLM agents.

PaperID: 593, https://arxiv.org/pdf/2602.02369.pdf GitHub

Authors: Yaolun Zhang, Yiran Wu, Yijiong Yu, Qingyun Wu, Huazheng Wang

Title: Live-Evo: Online Evolution of Agentic Memory from Continuous Feedback

Abstract:
Large language model (LLM) agents are increasingly equipped with memory, which are stored experience and reusable guidance that can improve task‑solving performance. Recent \emphself‑evolving systems update memory based on interaction outcomes, but most existing evolution pipelines are developed for static train/test splits and only approximate online learning by folding static benchmarks, making them brittle under true distribution shift and continuous feedback. We introduce \textscLive‑Evo, an online self‑evolving memory system that learns from a stream of incoming data over time. \textscLive‑Evo decouples \emphwhat happened from \emphhow to use it via an Experience Bank and a Meta‑Guideline Bank, compiling task‑adaptive guidelines from retrieved experiences for each task. To manage memory online, \textscLive‑Evo maintains experience weights and updates them from feedback: experiences that consistently help are reinforced and retrieved more often, while misleading or stale experiences are down‑weighted and gradually forgotten, analogous to reinforcement and decay in human memory. On the live Prophet Arena benchmark over a 10‑week horizon, \textscLive‑Evo improves Brier score by 20.8% and increases market returns by 12.9%, while also transferring to deep‑research benchmarks with consistent gains over strong baselines. Our code is available at https://github.com/ag2ai/Live‑Evo.

PaperID: 594, https://arxiv.org/pdf/2602.02206.pdf GitHub

Authors: Tong Yang, Yemin Wang, Chaoning Zhang, Aming Wu

Title: Fat-Cat: Document-Driven Metacognitive Multi-Agent System for Complex Reasoning

Abstract:
The effectiveness of LLM‑based agents is often limited not by model capacity alone, but by how efficiently contextual information is utilized at runtime. Existing agent frameworks rely on rigid, syntax‑heavy state representations such as nested JSON, which require models to devote a substantial portion of their limited attention to syntactic processing rather than semantic reasoning. In this paper, we propose Fat‑Cat, a document‑driven agent architecture that improves the signal‑to‑noise ratio of state management. By integrating three key components: (1) a Semantic File System that represents agent state as Markdown documents aligned with common pre‑training corpora, (2) a Textual Strategy Evolution module that accumulates task‑solving knowledge without parameter updates, and (3) a Closed‑Loop Watcher that monitors reasoning trajectories to reduce hallucinations. Extensive reasoning, retrieval, and coding benchmarks, Fat‑Cat consistently improves agent performance. It enables the Kimi‑k2 model to outperform the proprietary GPT‑4o baseline on HotPotQA. Replacing the document‑based state with JSON leads to performance drop, while empirically validating the critical necessity of document‑driven state modeling over rigid syntax. The code is available at https://github.com/answeryt/Fat‑Cat.

PaperID: 595, https://arxiv.org/pdf/2602.02103.pdf GitHub

Authors: Liyan Xu, Mo Yu, Fandong Meng, Jie Zhou

Title: How Far Ahead Do LLMs Plan? Uncovering the Latent Horizon in Chain-of-Thought Reasoning

Abstract:
Chain‑of‑thought (CoT) reasoning has become a central mechanism for eliciting multi‑step reasoning in Large Language Models (LLMs). Yet recent evidence presents a tension: hidden states appear to already encode future reasoning before CoT fully unfolds, while explicit steps still remain crucial for tasks requiring compositional computation. To deepen the understanding between LLM's internal states and its verbalized reasoning trajectories, we investigate the latent planning strength of LLMs, through our probing method, Tele‑Lens, applying to hidden states across diverse task domains. Our empirical results indicate that LLMs exhibit a myopic horizon, primarily conducting incremental transitions without precise global planning. Leveraging this characteristic, we propose a hypothesis on enhancing uncertainty estimation of CoT, which we validate that a sparse set of pivot positions can effectively represent the uncertainty of the entire path. We further underscore the significance of exploiting CoT dynamics, and demonstrate that automatic recognition of CoT bypass can be achieved without performance degradation. Our code, data and models are released at https://github.com/lxucs/tele‑lens.

PaperID: 596, https://arxiv.org/pdf/2602.02099.pdf GitHub

Authors: Keqin Peng, Yuanxin Ouyang, Xuebo Liu, Zhiliang Tian, Ruijian Han, Yancheng Yuan, Liang Ding

Title: Think Dense, Not Long: Dynamic Decoupled Conditional Advantage for Efficient Reasoning

Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) can elicit strong multi‑step reasoning, yet it often encourages overly verbose traces. Moreover, naive length penalties in group‑relative optimization can severely hurt accuracy. We attribute this failure to two structural issues: (i) Dilution of Length Baseline, where incorrect responses (with zero length reward) depress the group baseline and over‑penalize correct solutions; and (ii) Difficulty‑Penalty Mismatch, where a static penalty cannot adapt to problem difficulty, suppressing necessary reasoning on hard instances while leaving redundancy on easy ones. We propose Dynamic Decoupled Conditional Advantage (DDCA) to decouple efficiency optimization from correctness. DDCA computes length advantages conditionally within the correct‑response cluster to eliminate baseline dilution, and dynamically scales the penalty strength using the group pass rate as a proxy for difficulty. Experiments on GSM8K, MATH500, AMC23, and AIME25 show that DDCA consistently improves the efficiency‑‑accuracy trade‑off relative to adaptive baselines, reducing generated tokens by approximately 60% on simpler tasks (e.g., GSM8K) versus over 20% on harder benchmarks (e.g., AIME25), thereby maintaining or improving accuracy. Code is available at https://github.com/alphadl/DDCA.

PaperID: 597, https://arxiv.org/pdf/2602.01665.pdf GitHub

Authors: Hayeong Lee, JunHyeok Oh, Byung-Jun Lee

Title: TABX: A High-Throughput Sandbox Battle Simulator for Multi-Agent Reinforcement Learning

Abstract:
The design of environments plays a critical role in shaping the development and evaluation of cooperative multi‑agent reinforcement learning (MARL) algorithms. While existing benchmarks highlight critical challenges, they often lack the modularity required to design custom evaluation scenarios. We introduce the Totally Accelerated Battle Simulator in JAX (TABX), a high‑throughput sandbox designed for reconfigurable multi‑agent tasks. TABX provides granular control over environmental parameters, permitting a systematic investigation into emergent agent behaviors and algorithmic trade‑offs across a diverse spectrum of task complexities. Leveraging JAX for hardware‑accelerated execution on GPUs, TABX enables massive parallelization and significantly reduces computational overhead. By providing a fast, extensible, and easily customized framework, TABX facilitates the study of MARL agents in complex structured domains and serves as a scalable foundation for future research. Our code is available at: https://github.com/ku‑dmlab/TABX.

PaperID: 598, https://arxiv.org/pdf/2602.01566.pdf GitHub

Authors: Chiwei Zhu, Benfeng Xu, Mingxuan Du, Shaohan Wang, Xiaorui Wang, Zhendong Mao, Yongdong Zhang

Title: FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents

Abstract:
Deep research is emerging as a representative long‑horizon task for large language model (LLM) agents. However, long trajectories in deep research often exceed model context limits, compressing token budgets for both evidence collection and report writing, and preventing effective test‑time scaling. We introduce FS‑Researcher, a file‑system‑based, dual‑agent framework that scales deep research beyond the context window via a persistent workspace. Specifically, a Context Builder agent acts as a librarian which browses the internet, writes structured notes, and archives raw sources into a hierarchical knowledge base that can grow far beyond context length. A Report Writer agent then composes the final report section by section, treating the knowledge base as the source of facts. In this framework, the file system serves as a durable external memory and a shared coordination medium across agents and sessions, enabling iterative refinement beyond the context window. Experiments on two open‑ended benchmarks (DeepResearch Bench and DeepConsult) show that FS‑Researcher achieves state‑of‑the‑art report quality across different backbone models. Further analyses demonstrate a positive correlation between final report quality and the computation allocated to the Context Builder, validating effective test‑time scaling under the file‑system paradigm. The code and data are open‑sourced at https://github.com/Ignoramus0817/FS‑Researcher.

PaperID: 599, https://arxiv.org/pdf/2602.01539.pdf GitHub

Authors: Xiaoyu Wen, Zhida He, Han Qi, Ziyu Wan, Zhongtian Ma, Ying Wen, Tianhang Zheng, Xingcheng Xu, Chaochao Lu, Qiaosheng Zhang

Title: MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety

Abstract:
Ensuring robust safety alignment is crucial for Large Language Models (LLMs), yet existing defenses often lag behind evolving adversarial attacks due to their reliance on static, pre‑collected data distributions. In this paper, we introduce MAGIC, a novel multi‑turn multi‑agent reinforcement learning framework that formulates LLM safety alignment as an adversarial asymmetric game. Specifically, an attacker agent learns to iteratively rewrite original queries into deceptive prompts, while a defender agent simultaneously optimizes its policy to recognize and refuse such inputs. This dynamic process triggers a co‑evolution, where the attacker's ever‑changing strategies continuously uncover long‑tail vulnerabilities, driving the defender to generalize to unseen attack patterns. Remarkably, we observe that the attacker, endowed with initial reasoning ability, evolves novel, previously unseen combinatorial strategies through iterative RL training, underscoring our method's substantial potential. Theoretically, we provide insights into a more robust game equilibrium and derive safety guarantees. Extensive experiments validate our framework's effectiveness, demonstrating superior defense success rates without compromising the helpfulness of the model. Our code is available at https://github.com/BattleWen/MAGIC.

PaperID: 600, https://arxiv.org/pdf/2602.01331.pdf GitHub

Authors: Mingju Chen, Guibin Zhang, Heng Chang, Yuchen Guo, Shiji Zhou

Title: A-MapReduce: Executing Wide Search via Agentic MapReduce

Abstract:
Contemporary large language model (LLM)‑based multi‑agent systems exhibit systematic advantages in deep research tasks, which emphasize iterative, vertically structured information seeking. However, when confronted with wide search tasks characterized by large‑scale, breadth‑oriented retrieval, existing agentic frameworks, primarily designed around sequential, vertically structured reasoning, remain stuck in expansive search objectives and inefficient long‑horizon execution. To bridge this gap, we propose A‑MapReduce, a MapReduce paradigm‑inspired multi‑agent execution framework that recasts wide search as a horizontally structured retrieval problem. Concretely, A‑MapReduce implements parallel processing of massive retrieval targets through task‑adaptive decomposition and structured result aggregation. Meanwhile, it leverages experiential memory to drive the continual evolution of query‑conditioned task allocation and recomposition, enabling progressive improvement in large‑scale wide‑search regimes. Extensive experiments on five agentic benchmarks demonstrate that A‑MapReduce is (i) high‑performing, achieving state‑of‑the‑art performance on WideSearch and DeepWideSearch, and delivering 5.11% ‑ 17.50% average Item F1 improvements compared with strong baselines with OpenAI o3 or Gemini 2.5 Pro backbones; (ii) cost‑effective and efficient, delivering superior cost‑performance trade‑offs and reducing running time by 45.8% compared to representative multi‑agent baselines. The code is available at https://github.com/mingju‑c/AMapReduce.

PaperID: 601, https://arxiv.org/pdf/2602.01305.pdf GitHub

Authors: Ayushman Sarkar, Zhenyu Yu, Wei Tang, Chu Chen, Kangning Cui, Mohd Yamani Idna Idris

Title: StoryState: Agent-Based State Control for Consistent and Editable Storybooks

Abstract:
Large multimodal models have enabled one‑click storybook generation, where users provide a short description and receive a multi‑page illustrated story. However, the underlying story state, such as characters, world settings, and page‑level objects, remains implicit, making edits coarse‑grained and often breaking visual consistency. We present StoryState, an agent‑based orchestration layer that introduces an explicit and editable story state on top of training‑free text‑to‑image generation. StoryState represents each story as a structured object composed of a character sheet, global settings, and per‑page scene constraints, and employs a small set of LLM agents to maintain this state and derive 1Prompt1Story‑style prompts for generation and editing. Operating purely through prompts, StoryState is model‑agnostic and compatible with diverse generation backends. System‑level experiments on multi‑page editing tasks show that StoryState enables localized page edits, improves cross‑page consistency, and reduces unintended changes, interaction turns, and editing time compared to 1Prompt1Story, while approaching the one‑shot consistency of Gemini Storybook. Code is available at https://github.com/YuZhenyuLindy/StoryState

PaperID: 602, https://arxiv.org/pdf/2602.01297.pdf GitHub

Authors: Shaowei Shen, Xiaohong Yang, Jie Yang, Lianfen Huang, Yongcai Zhang, Yang Zou, Seyyedali Hosseinalipour

Title: RE-MCDF: Closed-Loop Multi-Expert LLM Reasoning for Knowledge-Grounded Clinical Diagnosis

Abstract:
Electronic medical records (EMRs), particularly in neurology, are inherently heterogeneous, sparse, and noisy, which poses significant challenges for large language models (LLMs) in clinical diagnosis. In such settings, single‑agent systems are vulnerable to self‑reinforcing errors, as their predictions lack independent validation and can drift toward spurious conclusions. Although recent multi‑agent frameworks attempt to mitigate this issue through collaborative reasoning, their interactions are often shallow and loosely structured, failing to reflect the rigorous, evidence‑driven processes used by clinical experts. More fundamentally, existing approaches largely ignore the rich logical dependencies among diseases, such as mutual exclusivity, pathological compatibility, and diagnostic confusion. This limitation prevents them from ruling out clinically implausible hypotheses, even when sufficient evidence is available. To overcome these, we propose RE‑MCDF, a relation‑enhanced multi‑expert clinical diagnosis framework. RE‑MCDF introduces a generation‑‑verification‑‑revision closed‑loop architecture that integrates three complementary components: (i) a primary expert that generates candidate diagnoses and supporting evidence, (ii) a laboratory expert that dynamically prioritizes heterogeneous clinical indicators, and (iii) a multi‑relation awareness and evaluation expert group that explicitly enforces inter‑disease logical constraints. Guided by a medical knowledge graph (MKG), the first two experts adaptively reweight EMR evidence, while the expert group validates and corrects candidate diagnoses to ensure logical consistency. Extensive experiments on the neurology subset of CMEMR (NEEMRs) and on our curated dataset (XMEMRs) demonstrate that RE‑MCDF consistently outperforms state‑of‑the‑art baselines in complex diagnostic scenarios (https://github.com/shenshaowei/RE‑MCDF).

PaperID: 603, https://arxiv.org/pdf/2602.00933.pdf GitHub

Authors: Chaithanya Bandi, Razvan-Gabriel Dumitru, Ben Hertzberg, Divyansh Agarwal, Geobio Boo, Tejas Polakam, Sami Hassaan, Jeff Da, HiJae Kim, Vipul Gupta, Manasi Sharma, Andrew Park, Martin Dimakis, Ernesto Gabriel Hernandez Montoya, Dan Rambado, Ivan Salazar, Rafael Cruz, MohammadHossein Rezaei, Chetan Rane, Ben Levin, Daniel Yue Zhang, Brad Kenstler, Bing Liu

Title: MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

Abstract:
The Model Context Protocol (MCP) is emerging as a standard interface through which large language model (LLM) agents discover and invoke external tools. However, existing MCP evaluations fall short along three key axes: realistic multi‑step workflows with cross‑server orchestration, breadth across authentic MCP servers rather than mocks, and structured, reproducible claim‑level scoring disentangled from agent verbosity or style. We introduce MCP‑Atlas, a benchmark for measuring tool‑use competency against production MCP servers. MCP‑Atlas contains 1,000 natural‑language tasks written and verified by human experts spanning 36 real MCP servers and 220 tools. Prompts do not specify servers, tools, or parameters, requiring agents to identify relevant tools among semantically plausible distractors and to compose multi‑step, cross‑server workflows. Each task is scored with a claim‑level rubric, where final answers are scored against atomic factual claims grounded in tool outputs. This answer‑centric scoring permits valid alternative tool‑call trajectories to receive credit. We pair this with an 11‑category diagnostic taxonomy that disentangles tool‑call failures from cognitive failures in task understanding, synthesis, parsing, and stopping. Evaluating 20 frontier models from six providers under matched task‑level conditions, we find pass rates up to 82.2% at a 0.75 claim coverage threshold and a clear three‑tier performance structure. Automated diagnostics show that 63.3% of diagnosed failures are cognitive rather than tool‑call related. Notably, several high‑performing models fail after successful tool execution due to premature stopping or incorrect synthesis. We release the task schema, containerized harness, claim evaluator, and a 500‑task public split, while reserving a 500‑task private split to preserve leaderboard integrity. The code is at https://github.com/scaleapi/mcp‑atlas.

PaperID: 604, https://arxiv.org/pdf/2602.00851.pdf GitHub

Authors: Hyejun Jeong, Amir Houmansadr, Shlomo Zilberstein, Eugene Bagdasarian

Title: Understanding Persuasion in Long-Running Agents

Abstract:
Modern AI agents increasingly combine conversational interaction with autonomous task execution, such as coding and web research, raising a natural question: What happens when an agent engaged in long‑horizon tasks is exposed to user persuasion? Yet studying this possibility is challenging because long‑running agent behavior is noisy and costly to reproduce, and it remains unclear which unique challenges emerge only in extended task execution. We study how belief‑level intervention can influence downstream task behavior, a phenomenon we name persuasion propagation. We introduce a behavior‑centered evaluation framework that distinguishes between persuasion applied during or prior to task execution. Across web research and coding tasks, we find that on‑the‑fly persuasion induces weak and inconsistent behavioral effects. In contrast, when the belief state is explicitly specified at task time, belief‑prefilled agents conduct on average 26.9% fewer searches and visit 16.9% fewer unique sources than neutral‑prefilled agents. These results suggest that persuasion, even in prior interaction, can affect the agent's behavior, motivating behavior‑level evaluation in agentic systems.

PaperID: 605, https://arxiv.org/pdf/2602.00676.pdf GitHub

Authors: Chao Li, Shangdong Yang, Chiheng Zhan, Zhenxing Ge, Yujing Hu, Bingkun Bao, Xingguo Chen, Yang Gao

Title: OpenGuanDan: A Large-Scale Imperfect Information Game Benchmark

Abstract:
The advancement of data‑driven artificial intelligence (AI), particularly machine learning, heavily depends on large‑scale benchmarks. Despite remarkable progress across domains ranging from pattern recognition to intelligent decision‑making in recent decades, exemplified by breakthroughs in board games, card games, and electronic sports games, there remains a pressing need for more challenging benchmarks to drive further research. To this end, this paper proposes OpenGuanDan, a novel benchmark that enables both efficient simulation of GuanDan (a popular four‑player, multi‑round Chinese card game) and comprehensive evaluation of both learning‑based and rule‑based GuanDan AI agents. OpenGuanDan poses a suite of nontrivial challenges, including imperfect information, large‑scale information set and action spaces, a mixed learning objective involving cooperation and competition, long‑horizon decision‑making, variable action spaces, and dynamic team composition. These characteristics make it a demanding testbed for existing intelligent decision‑making methods. Moreover, the independent API for each player allows human‑AI interactions and supports integration with large language models. Empirically, we conduct two types of evaluations: (1) pairwise competitions among all GuanDan AI agents, and (2) human‑AI matchups. Experimental results demonstrate that while current learning‑based agents substantially outperform rule‑based counterparts, they still fall short of achieving superhuman performance, underscoring the need for continued research in multi‑agent intelligent decision‑making domain. The project is publicly available at https://github.com/GameAI‑NJUPT/OpenGuanDan.

PaperID: 606, https://arxiv.org/pdf/2602.00551.pdf GitHub GitHub

Authors: Daoxuan Zhang, Ping Chen, Xiaobo Xia, Xiu Su, Ruichen Zhen, Jianqiang Xiao, Shuo Yang

Title: APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation

Abstract:
Aerial Object Goal Navigation, a challenging frontier in Embodied AI, requires an Unmanned Aerial Vehicle (UAV) agent to autonomously explore, reason, and identify a specific target using only visual perception and language description. However, existing methods struggle with the memorization of complex spatial representations in aerial environments, reliable and interpretable action decision‑making, and inefficient exploration and information gathering. To address these challenges, we introduce APEX (Aerial Parallel Explorer), a novel hierarchical agent designed for efficient exploration and target acquisition in complex aerial settings. APEX is built upon a modular, three‑part architecture: 1) Dynamic Spatio‑Semantic Mapping Memory, which leverages the zero‑shot capability of a Vision‑Language Model (VLM) to dynamically construct high‑resolution 3D Attraction, Exploration, and Obstacle maps, serving as an interpretable memory mechanism. 2) Action Decision Module, trained with reinforcement learning, which translates this rich spatial understanding into a fine‑grained and robust control policy. 3) Target Grounding Module, which employs an open‑vocabulary detector to achieve definitive and generalizable target identification. All these components are integrated into a hierarchical, asynchronous, and parallel framework, effectively bypassing the VLM's inference latency and boosting the agent's proactivity in exploration. Extensive experiments show that APEX outperforms the previous state of the art by +4.2% SR and +2.8% SPL on challenging UAV‑ON benchmarks, demonstrating its superior efficiency and the effectiveness of its hierarchical asynchronous design. Our source code is provided in \hrefhttps://github.com/4amGodvzx/apexGitHub

PaperID: 607, https://arxiv.org/pdf/2602.00471.pdf GitHub

Authors: Xinlei Yu, Chengming Xu, Zhangquan Chen, Bo Yin, Cheng Yang, Yongbo He, Yihao Hu, Jiangning Zhang, Cheng Tan, Xiaobin Hu, Shuicheng Yan

Title: Dual Latent Memory for Visual Multi-agent System

Abstract:
While Visual Multi‑Agent Systems (VMAS) promise to enhance comprehensive abilities through inter‑agent collaboration, empirical evidence reveals a counter‑intuitive "scaling wall": increasing agent turns often degrades performance while exponentially inflating token costs. We attribute this failure to the information bottleneck inherent in text‑centric communication, where converting perceptual and thinking trajectories into discrete natural language inevitably induces semantic loss. To this end, we propose L^2‑VMAS, a novel model‑agnostic framework that enables inter‑agent collaboration with dual latent memories. Furthermore, we decouple the perception and thinking while dynamically synthesizing dual latent memories. Additionally, we introduce an entropy‑driven proactive triggering that replaces passive information transmission with efficient, on‑demand memory access. Extensive experiments among backbones, sizes, and multi‑agent structures demonstrate that our method effectively breaks the "scaling wall" with superb scalability, improving average accuracy by 2.7‑5.4% while reducing token usage by 21.3‑44.8%. Codes: https://github.com/YU‑deep/L2‑VMAS.

PaperID: 608, https://arxiv.org/pdf/2602.00428.pdf GitHub

Authors: Naen Xu, Hengyu An, Shuo Shi, Jinghuai Zhang, Chunyi Zhou, Changjiang Li, Tianyu Du, Zhihui Fu, Jun Wang, Shouling Ji

Title: When Agents "Misremember" Collectively: Exploring the Mandela Effect in LLM-based Multi-Agent Systems

Abstract:
Recent advancements in large language models (LLMs) have significantly enhanced the capabilities of collaborative multi‑agent systems, enabling them to address complex challenges. However, within these multi‑agent systems, the susceptibility of agents to collective cognitive biases remains an underexplored issue. A compelling example is the Mandela effect, a phenomenon where groups collectively misremember past events as a result of false details reinforced through social influence and internalized misinformation. This vulnerability limits our understanding of memory bias in multi‑agent systems and raises ethical concerns about the potential spread of misinformation. In this paper, we conduct a comprehensive study on the Mandela effect in LLM‑based multi‑agent systems, focusing on its existence, causing factors, and mitigation strategies. We propose MANBENCH, a novel benchmark designed to evaluate agent behaviors across four common task types that are susceptible to the Mandela effect, using five interaction protocols that vary in agent roles and memory timescales. We evaluate agents powered by several LLMs on MANBENCH to quantify the Mandela effect and analyze how different factors affect it. Moreover, we propose strategies to mitigate this effect, including prompt‑level defenses (e.g., cognitive anchoring and source scrutiny) and model‑level alignment‑based defense, achieving an average 74.40% reduction in the Mandela effect compared to the baseline. Our findings provide valuable insights for developing more resilient and ethically aligned collaborative multi‑agent systems. Code and dataset are available at https://github.com/bluedream02/Mandela‑Effect.

PaperID: 609, https://arxiv.org/pdf/2602.00197.pdf GitHub

Authors: Yang Tan, Yuanxi Yu, Can Wu, Bozitao Zhong, Mingchen Li, Guisheng Fan, Jiankang Zhu, Yafeng Liang, Nanqing Dong, Liang Hong

Title: Rank-and-Reason: Multi-Agent Collaboration Accelerates Zero-Shot Protein Mutation Prediction

Abstract:
Zero‑shot mutation prediction is vital for low‑resource protein engineering, yet existing protein language models (PLMs) often yield statistically confident results that ignore fundamental biophysical constraints. Currently, selecting candidates for wet‑lab validation relies on manual expert auditing of PLM outputs, a process that is inefficient, subjective, and highly dependent on domain expertise. To address this, we propose Rank‑and‑Reason (VenusRAR), a two‑stage agentic framework to automate this workflow and maximize expected wet‑lab fitness. In the Rank‑Stage, a Computational Expert and Virtual Biologist aggregate a context‑aware multi‑modal ensemble, establishing a new Spearman correlation record of 0.551 (vs. 0.518) on ProteinGym. In the Reason‑Stage, an agentic Expert Panel employs chain‑of‑thought reasoning to audit candidates against geometric and structural constraints, improving the Top‑5 Hit Rate by up to 367% on ProteinGym‑DMS99. The wet‑lab validation on Cas12i3 nuclease further confirms the framework's efficacy, achieving a 46.7% positive rate and identifying two novel mutants with 4.23‑fold and 5.05‑fold activity improvements. Code and datasets are released on GitHub (https://github.com/ai4protein/VenusRAR/).

PaperID: 610, https://arxiv.org/pdf/2602.00154.pdf GitHub

Authors: Xiaogeng Liu, Xinyan Wang, Yechao Zhang, Sanjay Kariyappa, Chong Xiang, Muhao Chen, G. Edward Suh, Chaowei Xiao

Title: ReasoningBomb: A Stealthy Denial-of-Service Attack by Inducing Pathologically Long Reasoning in Large Reasoning Models

Abstract:
Large reasoning models (LRMs) extend large language models with explicit multi‑step reasoning traces, but this capability introduces a new class of prompt‑induced inference‑time denial‑of‑service (PI‑DoS) attacks that exploit the high computational cost of reasoning. We first formalize inference cost for LRMs and define PI‑DoS, then prove that any practical PI‑DoS attack should satisfy three properties: (1) a high amplification ratio, where each query induces a disproportionately long reasoning trace relative to its own length; (ii) stealthiness, in which prompts and responses remain on the natural language manifold and evade distribution shift detectors; and (iii) optimizability, in which the attack supports efficient optimization without being slowed by its own success. Under this framework, we present ReasoningBomb, a reinforcement‑learning‑based PI‑DoS framework that is guided by a constant‑time surrogate reward and trains a large reasoning‑model attacker to generate short natural prompts that drive victim LRMs into pathologically long and often effectively non‑terminating reasoning. Across seven open‑source models (including LLMs and LRMs) and three commercial LRMs, ReasoningBomb induces 18,759 completion tokens on average and 19,263 reasoning tokens on average across reasoning models. It outperforms the the runner‑up baseline by 35% in completion tokens and 38% in reasoning tokens, while inducing 6‑7x more tokens than benign queries and achieving 286.7x input‑to‑output amplification ratio averaged across all samples. Additionally, our method achieves 99.8% bypass rate on input‑based detection, 98.7% on output‑based detection, and 98.4% against strict dual‑stage joint detection.

PaperID: 611, https://arxiv.org/pdf/2602.00133.pdf GitHub

Authors: Avi Arora, Ritesh Malpani

Title: PredictionMarketBench: A SWE-bench-Style Framework for Backtesting Trading Agents on Prediction Markets

Abstract:
Prediction markets offer a natural testbed for trading agents: contracts have binary payoffs, prices can be interpreted as probabilities, and realized performance depends critically on market microstructure, fees, and settlement risk. We introduce PredictionMarketBench, a SWE‑bench‑style benchmark for evaluating algorithmic and LLM‑based trading agents on prediction markets via deterministic, event‑driven replay of historical limit‑order‑book and trade data. PredictionMarketBench standardizes (i) episode construction from raw exchange streams (orderbooks, trades, lifecycle, settlement), (ii) an execution‑realistic simulator with maker/taker semantics and fee modeling, and (iii) a tool‑based agent interface that supports both classical strategies and tool‑calling LLM agents with reproducible trajectories. We release four Kalshi‑based episodes spanning cryptocurrency, weather, and sports. Baseline results show that naive trading agents can underperform due to transaction costs and settlement losses, while fee‑aware algorithmic strategies remain competitive in volatile episodes.

PaperID: 612, https://arxiv.org/pdf/2601.23009.pdf GitHub

Authors: Wei Chen, Zhiyuan Peng, Xin Yin, Chao Ni, Chenhao Ying, Bang Xie, Yuan Luo

Title: SolAgent: A Specialized Multi-Agent Framework for Solidity Code Generation

Abstract:
Smart contracts are the backbone of the decentralized web, yet ensuring their functional correctness and security remains a critical challenge. While Large Language Models (LLMs) have shown promise in code generation, they often struggle with the rigorous requirements of smart contracts, frequently producing code that is buggy or vulnerable. To address this, we propose SolAgent, a novel tool‑augmented multi‑agent framework that mimics the workflow of human experts. SolAgent integrates a dual‑loop refinement mechanism: an inner loop using the Forge compiler to ensure functional correctness, and an outer loop leveraging the Slither static analyzer to eliminate security vulnerabilities. Additionally, the agent is equipped with file system capabilities to resolve complex project dependencies. Experiments on the SolEval+ Benchmark, a rigorous suite derived from high‑quality real‑world projects, demonstrate that SolAgent achieves a Pass@1 rate of up to 64.39%, significantly outperforming state‑of‑the‑art LLMs (～25%), AI IDEs (e.g., GitHub Copilot), and existing agent frameworks. Moreover, it reduces security vulnerabilities by up to 39.77% compared to human‑written baselines. Finally, we demonstrate that the high‑quality trajectories generated by SolAgent can be used to distill smaller, open‑source models, democratizing access to secure smart contract generation. We release our data and code at https://github.com/openpaperz/SolAgent.

PaperID: 613, https://arxiv.org/pdf/2601.22859.pdf GitHub

Authors: Chuanzhe Guo, Jingjing Wu, Sijun He, Yang Chen, Zhaoqi Kuang, Shilong Fan, Bingjin Chen, Siqi Bao, Jing Liu, Hua Wu, Qingfu Zhu, Wanxiang Che, Haifeng Wang

Title: MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering

Abstract:
The evolution of Large Language Model (LLM) agents for software engineering (SWE) is constrained by the scarcity of verifiable datasets, a bottleneck stemming from the complexity of constructing executable environments across diverse languages. To address this, we introduce MEnvAgent, a Multi‑language framework for automated Environment construction that facilitates scalable generation of verifiable task instances. MEnvAgent employs a multi‑agent Planning‑Execution‑Verification architecture to autonomously resolve construction failures and integrates a novel Environment Reuse Mechanism that reduces computational overhead by incrementally patching historical environments. Evaluations on MEnvBench, a new benchmark comprising 1,000 tasks across 10 languages, demonstrate that MEnvAgent outperforms baselines, improving Fail‑to‑Pass (F2P) rates by 8.6% while reducing time costs by 43%. Additionally, we demonstrate the utility of MEnvAgent by constructing MEnvData‑SWE, the largest open‑source polyglot dataset of realistic verifiable Docker environments to date, alongside solution trajectories that enable consistent performance gains on SWE tasks across a wide range of models. Our code, benchmark, and dataset are available at https://github.com/ernie‑research/MEnvAgent.

PaperID: 614, https://arxiv.org/pdf/2601.22292.pdf GitHub

Authors: Manuela Chacon-Chamorro, Luis Felipe Giraldo, Nicanor Quijano

Title: Learning Incentive Structures for Cooperative Resilience in Multi-Agent Systems under Social Dilemmas

Abstract:
Multi‑agent social dilemmas, such as the tragedy of the commons, capture settings where individual incentives conflict with collective well‑being, making these systems highly vulnerable to collapse under disruptions. In this context, this work studies cooperative resilience, understood as the system‑level ability to maintain collective well‑being under perturbations through adaptive agent behavior. We propose a framework for learning incentive structures aligned with collective well‑being in multi‑agent reinforcement learning systems, where reward functions shape individual decision‑making and collective behavior. A resilience metric is used to score and rank agent trajectories, allowing the inference of reward functions that promote resilient collective behavior. These inferred reward functions are integrated into the multi‑agent reinforcement learning process to shape agent interactions in social dilemma settings. The approach is evaluated in resource‑sharing environments subject to disruptions, using three incentive structures: individual incentives, resilience‑aligned incentives, and a hybrid incentive structure that combines both individual and collective components. The results show that the hybrid incentive structure promotes sustained collective behavior, reduces collapse events associated with resource depletion, and preserves system performance under disruption. These findings highlight the role of incentive design as a mechanism for promoting resilient collective behavior and provide a computational framework for multi‑agent social dilemmas under disruptions.

PaperID: 615, https://arxiv.org/pdf/2601.22184.pdf GitHub

Authors: Ido Aharon, Emanuele La Malfa, Michael Wooldridge, Sarit Kraus

Title: Tacit Coordination of Large Language Models

Abstract:
Large Language Models (LLMs) are increasingly deployed in multi‑agent settings that require coordination without communication, from human‑AI interaction to safety‑critical scenarios. Humans often overcome the absence of communication through focal points: salient solutions that naturally stand out to all participants. We present the first large‑scale evaluation of how, when, and why focal points emerge in LLMs, comparing their behaviour with humans across cooperative and competitive games, including realistic search and rescue scenarios, demonstrating when focal points enable effective coordination. Across more than 20 open‑ and closed‑source models, we find that LLMs exhibit a remarkable ability to coordinate without communication, often matching or outperforming humans. However, the same models consistently fail in tasks requiring numerical common sense or culturally nuanced notions of salience. We additionally evaluate simple learning‑free strategies that substantially improve coordination both among LLMs and between humans and LLMs. Our results reveal striking coordination capabilities, as well as social limitations in modern LLMs, and offer new insight into the latent notions of salience encoded within them. Our findings caution against assuming that LLMs share humans' cultural and perceptual substrate when deployed in coordination settings.

PaperID: 616, https://arxiv.org/pdf/2601.21742.pdf GitHub

Authors: Ruiwen Zhou, Maojia Song, Xiaobao Wu, Sitao Cheng, Xunjian Yin, Yuxi Xie, Zhuoqun Hao, Wenyue Hua, Liangming Pan, Soujanya Poria, Min-Yen Kan

Title: Epistemic Context Learning: Building Trust the Right Way in LLM-Based Multi-Agent Systems

Abstract:
Individual agents in multi‑agent (MA) systems often lack robustness, tending to blindly conform to misleading peers. We show this weakness stems from both sycophancy and inadequate ability to evaluate peer reliability. To address this, we first formalize the learning problem of history‑aware reference, introducing the historical interactions of peers as additional input, so that agents can estimate peer reliability and learn from trustworthy peers when uncertain. This shifts the task from evaluating peer reasoning quality to estimating peer reliability based on interaction history. We then develop Epistemic Context Learning (ECL): a reasoning framework that conditions predictions on explicitly‑built peer profiles from history. We further optimize ECL by reinforcement learning using auxiliary rewards. Our experiments reveal that our ECL enables small models like Qwen 3‑4B to outperform a history‑agnostic baseline 8x its size (Qwen 3‑30B) by accurately identifying reliable peers. ECL also boosts frontier models to near‑perfect (100%) performance. We show that ECL generalizes well to various MA configurations and we find that trust is modeled well by LLMs, revealing a strong correlation in trust modeling accuracy and final answer quality.

PaperID: 617, https://arxiv.org/pdf/2601.21714.pdf GitHub

Authors: Kaixiang Wang, Yidan Lin, Jiong Lou, Zhaojiacheng Zhou, Bunyod Suvonov, Jie Li

Title: E-mem: Multi-agent based Episodic Context Reconstruction for LLM Agent Memory

Abstract:
The evolution of Large Language Model (LLM) agents towards System~2 reasoning, characterized by deliberative, high‑precision problem‑solving, requires maintaining rigorous logical integrity over extended horizons. However, prevalent memory preprocessing paradigms suffer from destructive de‑contextualization. By compressing complex sequential dependencies into pre‑defined structures (e.g., embeddings or graphs), these methods sever the contextual integrity essential for deep reasoning. To address this, we propose E‑mem, a framework shifting from Memory Preprocessing to Episodic Context Reconstruction. Inspired by biological engrams, E‑mem employs a heterogeneous hierarchical architecture where multiple assistant agents maintain uncompressed memory contexts, while a central master agent orchestrates global planning. Unlike passive retrieval, our mechanism empowers assistants to locally reason within activated segments, extracting context‑aware evidence before aggregation. Evaluations on the LoCoMo benchmark demonstrate that E‑mem achieves over 54% F1, surpassing the state‑of‑the‑art GAM by 7.75%, while reducing token cost by over 70%.

PaperID: 618, https://arxiv.org/pdf/2601.20334.pdf GitHub

Authors: Brian Y. Tsui, Alan Y. Fang, Tiffany J. Hwu

Title: Demonstration-Free Robotic Control via LLM Agents

Abstract:
Robotic manipulation has increasingly adopted vision‑language‑action (VLA) models, which achieve strong performance but typically require task‑specific demonstrations and fine‑tuning, and often generalize poorly under domain shift. We investigate whether general‑purpose large language model (LLM) agent frameworks, originally developed for software engineering, can serve as an alternative control paradigm for embodied manipulation. We introduce FAEA (Frontier Agent as Embodied Agent), which applies an LLM agent framework directly to embodied manipulation without modification. Using the same iterative reasoning that enables software agents to debug code, FAEA enables embodied agents to reason through manipulation strategies. We evaluate an unmodified frontier agent, Claude Agent SDK, across the LIBERO, ManiSkill3, and MetaWorld benchmarks. With privileged environment state access, FAEA achieves success rates of 84.9%, 85.7%, and 96%, respectively. This level of task success approaches that of VLA models trained with less than 100 demonstrations per task, without requiring demonstrations or fine‑tuning. With one round of human feedback as an optional optimization, performance increases to 88.2% on LIBERO. This demonstration‑free capability has immediate practical value: FAEA can autonomously explore novel scenarios in simulation and generate successful trajectories for training data augmentation in embodied learning. Our results indicate that general‑purpose agents are sufficient for a class of manipulation tasks dominated by deliberative, task‑level planning. This opens a path for robotics systems to leverage actively maintained agent infrastructure and benefit directly from ongoing advances in frontier models. Code is available at https://github.com/robiemusketeer/faea‑sim

PaperID: 619, https://arxiv.org/pdf/2601.19204.pdf GitHub

Authors: Zhixi Cai, Fucai Ke, Kevin Leo, Sukai Huang, Maria Garcia de la Banda, Peter J. Stuckey, Hamid Rezatofighi

Title: MATA: A Trainable Hierarchical Automaton System for Multi-Agent Visual Reasoning

Abstract:
Recent vision‑language models have strong perceptual ability but their implicit reasoning is hard to explain and easily generates hallucinations on complex queries. Compositional methods improve interpretability, but most rely on a single agent or hand‑crafted pipeline and cannot decide when to collaborate across complementary agents or compete among overlapping ones. We introduce MATA (Multi‑Agent hierarchical Trainable Automaton), a multi‑agent system presented as a hierarchical finite‑state automaton for visual reasoning whose top‑level transitions are chosen by a trainable hyper agent. Each agent corresponds to a state in the hyper automaton, and runs a small rule‑based sub‑automaton for reliable micro‑control. All agents read and write a shared memory, yielding transparent execution history. To supervise the hyper agent's transition policy, we build transition‑trajectory trees and transform to memory‑to‑next‑state pairs, forming the MATA‑SFT‑90K dataset for supervised finetuning (SFT). The finetuned LLM as the transition policy understands the query and the capacity of agents, and it can efficiently choose the optimal agent to solve the task. Across multiple visual reasoning benchmarks, MATA achieves the state‑of‑the‑art results compared with monolithic and compositional baselines. The code and dataset are available at https://github.com/ControlNet/MATA.

PaperID: 620, https://arxiv.org/pdf/2601.19151.pdf GitHub

Authors: Patara Trirat, Jin Myung Kwak, Jay Heo, Heejun Lee, Sung Ju Hwang

Title: TS-Debate: Multimodal Collaborative Debate for Zero-Shot Time Series Reasoning

Abstract:
Recent progress at the intersection of large language models (LLMs) and time series (TS) analysis has revealed both promise and fragility. While LLMs can reason over temporal structure given carefully engineered context, they often struggle with numeric fidelity, modality interference, and principled cross‑modal integration. We present TS‑Debate, a modality‑specialized, collaborative multi‑agent debate framework for zero‑shot time series reasoning. TS‑Debate assigns dedicated expert agents to textual context, visual patterns, and numerical signals, preceded by explicit domain knowledge elicitation, and coordinates their interaction via a structured debate protocol. Reviewer agents evaluate agent claims using a verification‑conflict‑calibration mechanism, supported by lightweight code execution and numerical lookup for programmatic verification. This architecture preserves modality fidelity, exposes conflicting evidence, and mitigates numeric hallucinations without task‑specific fine‑tuning. Across 20 tasks spanning three public benchmarks, TS‑Debate achieves consistent and significant performance improvements over strong baselines, including standard multimodal debate in which all agents observe all inputs.

PaperID: 621, https://arxiv.org/pdf/2601.18733.pdf GitHub

Authors: Li Kang, Heng Zhou, Xiufeng Song, Rui Li, Bruno N. Y. Chen, Ziye Wang, Ximeng Meng, Stone Tao, Yiran Qin, Xiaohong Liu, Ruimao Zhang, Lei Bai, Yilun Du, Hao Su, Philip Torr, Zhenfei Yin, Ruihao Gong, Yejun Zeng, Fengjun Zhong, Shenghao Jin, Jinyang Guo, Xianglong Liu, Xiaojun Jia, Tianqi Shan, Wenqi Ren, Simeng Qin, Jialing Yang, Xiaoyu Ma, Tianxing Chen, Zixuan Li, Zijian Cai, Yan Qin, Yusen Qin, Qiangyu Chen, Kaixuan Wang, Zhaoming Han, Yao Mu, Ping Luo, Yuanqi Yao, Haoming Song, Jan-Nico Zaech, Fabien Despinoy, Danda Pani Paudel, Luc Van Gool

Title: Advances and Innovations in the Multi-Agent Robotic System (MARS) Challenge

Abstract:
Recent advancements in multimodal large language models and vision‑languageaction models have significantly driven progress in Embodied AI. As the field transitions toward more complex task scenarios, multi‑agent system frameworks are becoming essential for achieving scalable, efficient, and collaborative solutions. This shift is fueled by three primary factors: increasing agent capabilities, enhancing system efficiency through task delegation, and enabling advanced human‑agent interactions. To address the challenges posed by multi‑agent collaboration, we propose the Multi‑Agent Robotic System (MARS) Challenge, held at the NeurIPS 2025 Workshop on SpaVLE. The competition focuses on two critical areas: planning and control, where participants explore multi‑agent embodied planning using vision‑language models (VLMs) to coordinate tasks and policy execution to perform robotic manipulation in dynamic environments. By evaluating solutions submitted by participants, the challenge provides valuable insights into the design and coordination of embodied multi‑agent systems, contributing to the future development of advanced collaborative AI systems.

PaperID: 622, https://arxiv.org/pdf/2601.18700.pdf GitHub

Authors: Xingyu Sui, Yanyan Zhao, Yulin Hu, Jiahe Guo, Weixiang Zhao, Bing Qin

Title: TEA-Bench: A Systematic Benchmarking of Tool-enhanced Emotional Support Dialogue Agent

Abstract:
Emotional Support Conversation requires not only affective expression but also grounded instrumental support to provide trustworthy guidance. However, existing ESC systems and benchmarks largely focus on affective support in text‑only settings, overlooking how external tools can enable factual grounding and reduce hallucination in multi‑turn emotional support. We introduce TEA‑Bench, the first interactive benchmark for evaluating tool‑augmented agents in ESC, featuring realistic emotional scenarios, an MCP‑style tool environment, and process‑level metrics that jointly assess the quality and factual grounding of emotional support. Experiments on nine LLMs show that tool augmentation generally improves emotional support quality and reduces hallucination, but the gains are strongly capacity‑dependent: stronger models use tools more selectively and effectively, while weaker models benefit only marginally. We further release TEA‑Dialog, a dataset of tool‑enhanced ESC dialogues, and find that supervised fine‑tuning improves in‑distribution support but generalizes poorly. Our results underscore the importance of tool use in building reliable emotional support agents. Our code and data can be found in https://github.com/XingYuSSS/TEA‑Bench.

PaperID: 623, https://arxiv.org/pdf/2601.18510.pdf GitHub

Authors: Yibo Li, Zijie Lin, Ailin Deng, Xuan Zhang, Yufei He, Shuo Ji, Tri Cao, Bryan Hooi

Title: Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates

Abstract:
While Large Language Model (LLM) agents excel at general tasks, they inherently struggle with continual adaptation due to the frozen weights after deployment. Conventional reinforcement learning (RL) offers a solution but incurs prohibitive computational costs and the risk of catastrophic forgetting. We introduce Just‑In‑Time Reinforcement Learning (JitRL), a training‑free framework that enables test‑time policy optimization without any gradient updates. JitRL maintains a dynamic, non‑parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on‑the‑fly. These estimates are then used to directly modulate the LLM's output logits. We theoretically prove that this additive update rule is the exact closed‑form solution to the KL‑constrained policy optimization objective. Extensive experiments on WebArena and Jericho demonstrate that JitRL establishes a new state‑of‑the‑art among training‑free methods. Crucially, JitRL outperforms the performance of computationally expensive fine‑tuning methods (e.g., WebRL) while reducing monetary costs by over 30 times, offering a scalable path for continual learning agents. The code is available at https://github.com/liushiliushi/JitRL.

PaperID: 624, https://arxiv.org/pdf/2601.18492.pdf GitHub

Authors: Zijun Li, Shijie Li, Zhenxi Zhang, Bin Li, Shoujun Zhou

Title: DV-VLN: Dual Verification for Reliable LLM-Based Vision-and-Language Navigation

Abstract:
Vision‑and‑Language Navigation (VLN) requires an embodied agent to navigate in a complex 3D environment according to natural language instructions. Recent progress in large language models (LLMs) has enabled language‑driven navigation with improved interpretability. However, most LLM‑based agents still rely on single‑shot action decisions, where the model must choose one option from noisy, textualized multi‑perspective observations. Due to local mismatches and imperfect intermediate reasoning, such decisions can easily deviate from the correct path, leading to error accumulation and reduced reliability in unseen environments. In this paper, we propose DV‑VLN, a new VLN framework that follows a generate‑then‑verify paradigm. DV‑VLN first performs parameter‑efficient in‑domain adaptation of an open‑source LLaMA‑2 backbone to produce a structured navigational chain‑of‑thought, and then verifies candidate actions with two complementary channels: True‑False Verification (TFV) and Masked‑Entity Verification (MEV). DV‑VLN selects actions by aggregating verification successes across multiple samples, yielding interpretable scores for reranking. Experiments on R2R, RxR (English subset), and REVERIE show that DV‑VLN consistently improves over direct prediction and sampling‑only baselines, achieving competitive performance among language‑only VLN agents and promising results compared with several cross‑modal systems.Code is available at https://github.com/PlumJun/DV‑VLN.

PaperID: 625, https://arxiv.org/pdf/2601.18296.pdf GitHub

Authors: Zhaoyan Gong, Zhiqiang Liu, Songze Li, Xiaoke Guo, Yuanxiang Liu, Xinle Deng, Zhizhen Liu, Lei Liang, Huajun Chen, Wen Zhang

Title: Temp-R1: A Unified Autonomous Agent for Complex Temporal KGQA via Reverse Curriculum Reinforcement Learning

Abstract:
Temporal Knowledge Graph Question Answering (TKGQA) is inherently challenging, as it requires sophisticated reasoning over dynamic facts with multi‑hop dependencies and complex temporal constraints. Existing methods rely on fixed workflows and expensive closed‑source APIs, limiting flexibility and scalability. We propose Temp‑R1, the first autonomous end‑to‑end agent for TKGQA trained through reinforcement learning. To address cognitive overload in single‑action reasoning, we expand the action space with specialized internal actions alongside external action. To prevent shortcut learning on simple questions, we introduce reverse curriculum learning that trains on difficult questions first, forcing the development of sophisticated reasoning before transferring to easier cases. Our 8B‑parameter Temp‑R1 achieves state‑of‑the‑art performance on MultiTQ and TimelineKGQA, improving 19.8% over strong baselines on complex questions. Our work establishes a new paradigm for autonomous temporal reasoning agents. The code is available at https://github.com/zjukg/Temp‑R1.

PaperID: 626, https://arxiv.org/pdf/2601.18225.pdf GitHub

Authors: Pei Wang, Yanan Wu, Xiaoshuai Song, Weixun Wang, Gengru Chen, Zhongwen Li, Kezhong Yan, Ken Deng, Qi Liu, Shuaibing Zhao, Shaopan Xiong, Xuepeng Liu, Xuefeng Chen, Wanxi Deng, Wenbo Su, Bo Zheng

Title: ShopSimulator: Evaluating and Exploring RL-Driven LLM Agent for Shopping Assistants

Abstract:
Large language model (LLM)‑based agents are increasingly deployed in e‑commerce shopping. To perform thorough, user‑tailored product searches, agents should interpret personal preferences, engage in multi‑turn dialogues, and ultimately retrieve and discriminate among highly similar products. However, existing research has yet to provide a unified simulation environment that consistently captures all of these aspects, and always focuses solely on evaluation benchmarks without training support. In this paper, we introduce ShopSimulator, a large‑scale and challenging Chinese shopping environment. Leveraging ShopSimulator, we evaluate LLMs across diverse scenarios, finding that even the best‑performing models achieve less than 40% full‑success rate. Error analysis reveals that agents struggle with deep search and product selection in long trajectories, fail to balance the use of personalization cues, and to effectively engage with users. Further training exploration provides practical guidance for overcoming these weaknesses, with the combination of supervised fine‑tuning (SFT) and reinforcement learning (RL) yielding significant performance improvements. Code and data will be released at https://github.com/ShopAgent‑Team/ShopSimulator.

PaperID: 627, https://arxiv.org/pdf/2601.17887.pdf GitHub

Authors: Jiahe Guo, Xiangran Guo, Yulin Hu, Zimo Long, Xingyu Sui, Xuda Zhi, Yongbo Huang, Hao He, Weixiang Zhao, Yanyan Zhao, Bing Qin

Title: When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents

Abstract:
Long‑term memory enables large language model (LLM) agents to support personalized and sustained interactions. However, most work on personalized agents prioritizes utility and user experience, treating memory as a neutral component and largely overlooking its safety implications. In this paper, we reveal intent legitimation, a previously underexplored safety failure in personalized agents, where benign personal memories bias intent inference and cause models to legitimize inherently harmful queries. To study this phenomenon, we introduce PS‑Bench, a benchmark designed to identify and quantify intent legitimation in personalized interactions. Across multiple memory‑augmented agent frameworks and base LLMs, personalization increases attack success rates by 15.8%‑‑243.7% relative to stateless baselines. We further provide mechanistic evidence for intent legitimation from internal representations space, and propose a lightweight detection‑reflection method that effectively reduces safety degradation. Overall, our work provides the first systematic exploration and evaluation of intent legitimation as a safety failure mode that naturally arises from benign, real‑world personalization, highlighting the importance of assessing safety under long‑term personal context. Our code is available at: https://github.com/MuyuenLP/PS‑Bench. WARNING: This paper may contain harmful content.

PaperID: 628, https://arxiv.org/pdf/2601.17744.pdf GitHub

Authors: Amjad Fatmi

Title: Faramesh: A Protocol-Agnostic Execution Control Plane for Autonomous Agent Systems

Abstract:
Autonomous agent systems increasingly trigger real‑world side effects: deploying infrastructure, modifying databases, moving money, and executing workflows. Yet most agent stacks provide no mandatory execution checkpoint where organizations can deterministically permit, deny, or defer an action before it changes reality. This paper introduces Faramesh, a protocol‑agnostic execution control plane that enforces execution‑time authorization for agent‑driven actions via a non‑bypassable Action Authorization Boundary (AAB). Faramesh canonicalizes agent intent into a Canonical Action Representation (CAR), evaluates actions deterministically against policy and state, and issues a decision artifact (PERMIT/DEFER/DENY) that executors must validate prior to execution. The system is designed to be framework‑ and model‑agnostic, supports multi‑agent and multi‑tenant deployments, and remains independent of transport protocols (e.g., MCP). Faramesh further provides decision‑centric, append‑only provenance logging keyed by canonical action hashes, enabling auditability, verification, and deterministic replay without re‑running agent reasoning. We show how these primitives yield enforceable, predictable governance for autonomous execution while avoiding hidden coupling to orchestration layers or observability‑only approaches.

PaperID: 629, https://arxiv.org/pdf/2601.17735.pdf GitHub

Authors: Kyungho Kim, Geon Lee, Juyeon Kim, Dongwon Choi, Shinhwan Kang, Kijung Shin

Title: ReFuGe: Feature Generation for Prediction Tasks on Relational Databases with LLM Agents

Abstract:
Relational databases (RDBs) play a crucial role in many real‑world web applications, supporting data management across multiple interconnected tables. Beyond typical retrieval‑oriented tasks, prediction tasks on RDBs have recently gained attention. In this work, we address this problem by generating informative relational features that enhance predictive performance. However, generating such features is challenging: it requires reasoning over complex schemas and exploring a combinatorially large feature space, all without explicit supervision. To address these challenges, we propose ReFuGe, an agentic framework that leverages specialized large language model agents: (1) a schema selection agent identifies the tables and columns relevant to the task, (2) a feature generation agent produces diverse candidate features from the selected schema, and (3) a feature filtering agent evaluates and retains promising features through reasoning‑based and validation‑based filtering. It operates within an iterative feedback loop until performance converges. Experiments on RDB benchmarks demonstrate that ReFuGe substantially improves performance on various RDB prediction tasks. Our code and datasets are available at https://github.com/K‑Kyungho/REFUGE.

PaperID: 630, https://arxiv.org/pdf/2601.17173.pdf GitHub

Authors: Parth Bhalerao, Diola Dsouza, Ruiwen Guan, Oana Ignat

Title: Beyond Factual QA: Mentorship-Oriented Question Answering over Long-Form Multilingual Content

Abstract:
Question answering systems are typically evaluated on factual correctness, yet many real‑world applications‑such as education and career guidance‑require mentorship: responses that provide reflection and guidance. Existing QA benchmarks rarely capture this distinction, particularly in multilingual and long‑form settings. We introduce MentorQA, the first multilingual dataset and evaluation framework for mentorship‑focused question answering from long‑form videos, comprising nearly 9,000 QA pairs from 180 hours of content across four languages. We define mentorship‑focused evaluation dimensions that go beyond factual accuracy, capturing clarity, alignment, and learning value. Using MentorQA, we compare Single‑Agent, Dual‑Agent, RAG, and Multi‑Agent QA architectures under controlled conditions. Multi‑Agent pipelines consistently produce higher‑quality mentorship responses, with especially strong gains for complex topics and lower‑resource languages. We further analyze the reliability of automated LLM‑based evaluation, observing substantial variation in alignment with human judgments. Overall, this work establishes mentorship‑focused QA as a distinct research problem and provides a multilingual benchmark for studying agentic architectures and evaluation design in educational AI. The dataset and evaluation framework are released at https://github.com/AIM‑SCU/MentorQA.

PaperID: 631, https://arxiv.org/pdf/2601.17133.pdf GitHub

Authors: Inderjeet Singh, Eleonore Vissol-Gaudin, Andikan Otung, Motoyoshi Sekiya

Title: Learning to Collaborate: An Orchestrated-Decentralized Framework for Peer-to-Peer LLM Federation

Abstract:
Fine‑tuning Large Language Models (LLMs) for specialized domains is constrained by a fundamental challenge: the need for diverse, cross‑organizational data conflicts with the principles of data privacy and sovereignty. While Federated Learning (FL) provides a framework for collaboration without raw data exchange, its classic centralized form introduces a single point of failure and remains vulnerable to model inversion attacks. Decentralized FL (DFL) mitigates this risk by removing the central aggregator but typically relies on inefficient, random peer‑to‑peer (P2P) pairings, forming a collaboration graph that is blind to agent heterogeneity and risks negative transfer. This paper introduces KNEXA‑FL, a novel framework for orchestrated decentralization that resolves this trade‑off. KNEXA‑FL employs a non‑aggregating Central Profiler/Matchmaker (CPM) that formulates P2P collaboration as a contextual bandit problem, using a LinUCB algorithm on abstract agent profiles to learn an optimal matchmaking policy. It orchestrates direct knowledge exchange between heterogeneous, PEFT‑based LLM agents via secure distillation, without ever accessing the models themselves. Our comprehensive experiments on a challenging code generation task show that KNEXA‑FL yields substantial gains, improving Pass@1 by approx. 50% relative to random P2P collaboration. Critically, our orchestrated approach demonstrates stable convergence, in stark contrast to a powerful centralized distillation baseline which suffers from catastrophic performance collapse. Our work establishes adaptive, learning‑based orchestration as a foundational principle for building robust and effective decentralized AI ecosystems.

PaperID: 632, https://arxiv.org/pdf/2601.16746.pdf GitHub

Authors: Yuhang Wang, Yuling Shi, Mo Yang, Rongrui Zhang, Shilin He, Heng Lian, Yuting Chen, Siyu Ye, Kai Cai, Xiaodong Gu

Title: SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents

Abstract:
LLM agents have demonstrated remarkable capabilities in software development, but their performance is hampered by long interaction contexts, which incur high API costs and latency. While various context compression approaches such as LongLLMLingua have emerged to tackle this challenge, they typically rely on fixed metrics such as PPL, ignoring the task‑specific nature of code understanding. As a result, they frequently disrupt syntactic and logical structure and fail to retain critical implementation details. In this paper, we propose SWE‑Pruner, a self‑adaptive context pruning framework tailored for coding agents. Drawing inspiration from how human programmers "selectively skim" source code during development and debugging, SWE‑Pruner performs task‑aware adaptive pruning for long contexts. Given the current task, the agent formulates an explicit goal (e.g., "focus on error handling") as a hint to guide the pruning targets. A lightweight neural skimmer (0.6B parameters) is trained to dynamically select relevant lines from the surrounding context given the goal. Evaluations across four benchmarks and multiple models validate SWE‑Pruner's effectiveness in various scenarios, achieving 23‑54% token reduction on agent tasks like SWE‑Bench Verified while even improving success rates, and up to 14.84x compression on single‑turn tasks like LongCodeQA with minimal performance impact.

PaperID: 633, https://arxiv.org/pdf/2601.16087.pdf GitHub

Authors: Sukesh Subaharan

Title: Controlling Long-Horizon Behavior in Language Model Agents with Explicit State Dynamics

Abstract:
Large language model (LLM) agents often exhibit abrupt shifts in tone and persona during extended interaction, reflecting the absence of explicit temporal structure governing agent‑level state. While prior work emphasizes turn‑local sentiment or static emotion classification, the role of explicit affective dynamics in shaping long‑horizon agent behavior remains underexplored. This work investigates whether imposing dynamical structure on an external affective state can induce temporal coherence and controlled recovery in multi‑turn dialogue. We introduce an agent‑level affective subsystem that maintains a continuous Valence‑Arousal‑Dominance (VAD) state external to the language model and governed by first‑ and second‑order update rules. Instantaneous affective signals are extracted using a fixed, memoryless estimator and integrated over time via exponential smoothing or momentum‑based dynamics. The resulting affective state is injected back into generation without modifying model parameters. Using a fixed 25‑turn dialogue protocol, we compare stateless, first‑order, and second‑order affective dynamics. Stateless agents fail to exhibit coherent trajectories or recovery, while state persistence enables delayed responses and reliable recovery. Second‑order dynamics introduce affective inertia and hysteresis that increase with momentum, revealing a trade‑off between stability and responsiveness.

PaperID: 634, https://arxiv.org/pdf/2601.15322.pdf GitHub

Authors: Raffi Khatchadourian

Title: Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents

Abstract:
LLM agents struggle with regulatory audit replay: when asked to reproduce a flagged transaction decision with identical inputs, many deployments fail to return consistent results. We introduce the Determinism‑Faithfulness Assurance Harness (DFAH), a framework for measuring trajectory determinism, decision determinism, and evidence‑conditioned faithfulness in tool‑using agents deployed in financial services. Across 4,700+ agentic runs (7 models, 4 providers, 3 financial benchmarks with 50 cases each at T=0.0), we find that decision determinism and task accuracy are not detectably correlated (r = ‑0.11, 95% CI [‑0.49, 0.31], p = 0.63, n = 21 configurations): models can be deterministic without being accurate, and accurate without being deterministic. Because neither metric predicts the other in our sample, both must be measured independently, which is precisely what DFAH provides. Small models (7‑20B) achieve near‑perfect determinism through rigid pattern matching at the cost of accuracy (20‑42%), while frontier models show moderate determinism (50‑96%) with variable accuracy. No model achieves both perfect determinism and high accuracy, supporting DFAH's multi‑dimensional measurement approach. We provide three financial benchmarks (compliance triage, portfolio constraints, and DataOps exceptions; 50 cases each) together with an open‑source stress‑test harness. Across these benchmarks and DFAH evaluation settings, Tier 1 models with schema‑first architectures achieved determinism levels consistent with audit replay requirements.

PaperID: 635, https://arxiv.org/pdf/2601.15299.pdf GitHub

Authors: Yash Sharma

Title: MALTopic: Multi-Agent LLM Topic Modeling Framework

Abstract:
Topic modeling is a crucial technique for extracting latent themes from unstructured text data, particularly valuable in analyzing survey responses. However, traditional methods often only consider free‑text responses and do not natively incorporate structured or categorical survey responses for topic modeling. And they produce abstract topics, requiring extensive human interpretation. To address these limitations, we propose the Multi‑Agent LLM Topic Modeling Framework (MALTopic). This framework decomposes topic modeling into specialized tasks executed by individual LLM agents: an enrichment agent leverages structured data to enhance textual responses, a topic modeling agent extracts latent themes, and a deduplication agent refines the results. Comparative analysis on a survey dataset demonstrates that MALTopic significantly improves topic coherence, diversity, and interpretability compared to LDA and BERTopic. By integrating structured data and employing a multi‑agent approach, MALTopic generates human‑readable topics with enhanced contextual relevance, offering a more effective solution for analyzing complex survey data.

PaperID: 636, https://arxiv.org/pdf/2601.15161.pdf GitHub

Authors: Yinzhu Chen, Abdine Maiga, Hossein A. Rahmani, Emine Yilmaz

Title: Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems

Abstract:
Large Language Models (LLMs) are increasingly used for clinical decision support, where hallucinations and unsafe suggestions may pose direct risks to patient safety. These risks are hard to assess: subtle clinical errors are often missed by generic metrics and LLM judges using general criteria, while expert‑authored fine‑grained rubrics are expensive and difficult to scale. In this paper, we propose a retrieval‑augmented multi‑agent framework designed to automate the generation of instance‑specific evaluation rubrics. Our approach grounds evaluation in authoritative medical evidence by decomposing retrieved content into atomic facts and synthesizing them with user interaction constraints to form verifiable, fine‑grained evaluation criteria. Evaluated on HealthBench and LLMEval‑Med datasets, our framework achieves Clinical Intent Alignment (CIA) scores of 50.20% and 31.90%, significantly outperforming the GPT‑4o baseline and demonstrating robust cross‑lingual generalization. In discriminative tests on HealthBench, our rubrics yield a 7.8% higher win rate than GPT‑4o baseline with nearly double score Δ, while ablation studies confirm its structural necessity. Beyond evaluation, our rubrics effectively guide response refinement, improving quality by 9.2%. This provides a scalable, cross‑lingual foundation for both evaluating and improving medical LLMs. The code is available at https://github.com/AmbeChen/Automated‑Rubric‑Generation.

PaperID: 637, https://arxiv.org/pdf/2601.15015.pdf GitHub

Authors: Jannis Becktepe, Aleksandra Franz, Nils Thuerey, Sebastian Peitz

Title: Plug-and-Play Benchmarking of Reinforcement Learning Algorithms for Large-Scale Flow Control

Abstract:
Reinforcement learning (RL) has shown promising results in active flow control (AFC), yet progress in the field remains difficult to assess as existing studies rely on heterogeneous observation and actuation schemes, numerical setups, and evaluation protocols. Current AFC benchmarks attempt to address these issues but heavily rely on external computational fluid dynamics (CFD) solvers, are not fully differentiable, and provide limited 3D and multi‑agent support. To overcome these limitations, we introduce FluidGym, the first standalone, fully differentiable benchmark suite for RL in AFC. Built entirely in PyTorch on top of the GPU‑accelerated PICT solver, FluidGym runs in a single Python stack, requires no external CFD software, and provides standardized evaluation protocols. We present baseline results with PPO, SAC, DPC, and TD‑MPC, and release all environments, datasets, and trained models as public resources. FluidGym enables systematic comparison of control methods, establishes a scalable foundation for future research in learning‑based flow control, and is available at github.com/safe‑autonomous‑systems/fluidgym.

PaperID: 638, https://arxiv.org/pdf/2601.14523.pdf GitHub

Authors: Leyi Zhao, Weijie Huang, Yitong Guo, Jiang Bian, Chenghong Wang, Xuhong Zhang

Title: Large Language Model-Powered Evolutionary Code Optimization on a Phylogenetic Tree

Abstract:
Optimizing scientific computing algorithms for modern GPUs is a labor‑intensive and iterative process involving repeated code modification, benchmarking, and tuning across complex hardware and software stacks. Recent work has explored large language model (LLM)‑assisted evolutionary methods for automated code optimization, but these approaches primarily rely on outcome‑based selection and random mutation, underutilizing the rich trajectory information generated during iterative optimization. We propose PhyloEvolve, an LLM‑agent system that reframes GPU‑oriented algorithm optimization as an In‑Context Reinforcement Learning (ICRL) problem. This formulation enables trajectory‑conditioned reuse of optimization experience without model retraining. PhyloEvolve integrates Algorithm Distillation and prompt‑based Decision Transformers into an iterative workflow, treating sequences of algorithm modifications and performance feedback as first‑class learning signals. To organize optimization history, we introduce a phylogenetic tree representation that captures inheritance, divergence, and recombination among algorithm variants, enabling backtracking, cross‑lineage transfer, and reproducibility. The system combines elite trajectory pooling, multi‑island parallel exploration, and containerized execution to balance exploration and exploitation across heterogeneous hardware. We evaluate PhyloEvolve on scientific computing workloads including PDE solvers, manifold learning, and spectral graph algorithms, demonstrating consistent improvements in runtime, memory efficiency, and correctness over baseline and evolutionary methods. Code is published at: https://github.com/annihi1ation/phylo_evolve

PaperID: 639, https://arxiv.org/pdf/2601.14349.pdf GitHub

Authors: Sunghyun Kim, Seokwoo Yun, Youngseo Yun, Youngrak Lee, Sangsoo Lim

Title: MARBLE: Multi-Agent Reasoning for Bioinformatics Learning and Evolution

Abstract:
Motivation: Developing high‑performing bioinformatics models typically requires repeated cycles of hypothesis formulation, architectural redesign, and empirical validation, making progress slow, labor‑intensive, and difficult to reproduce. Although recent LLM‑based assistants can automate isolated steps, they lack performance‑grounded reasoning and stability‑aware mechanisms required for reliable, iterative model improvement in bioinformatics workflows. Results: We introduce MARBLE, an execution‑stable autonomous model refinement framework for bioinformatics models. MARBLE couples literature‑aware reference selection with structured, debate‑driven architectural reasoning among role‑specialized agents, followed by autonomous execution, evaluation, and memory updates explicitly grounded in empirical performance. Across spatial transcriptomics domain segmentation, drug‑target interaction prediction, and drug response prediction, MARBLE consistently achieves sustained performance improvements over strong baselines across multiple refinement cycles, while maintaining high execution robustness and low regression rates. Framework‑level analyses demonstrate that structured debate, balanced evidence selection, and performance‑grounded memory are critical for stable, repeatable model evolution, rather than single‑run or brittle gains. Availability: Source code, data and Supplementary Information are available at https://github.com/PRISM‑DGU/MARBLE.

PaperID: 640, https://arxiv.org/pdf/2601.14288.pdf GitHub

Authors: Ze-Yu Peng, Hao-Shi Yuan, Qi Lai, Jun-Qian Jiang, Gen Ye, Jun Zhang, Yun-Song Piao

Title: DeepInflation: an AI agent for research and model discovery of inflation

Abstract:
We present DeepInflation, an AI agent designed for research and model discovery in inflationary cosmology. Built upon a multi‑agent architecture, DeepInflation integrates Large Language Models (LLMs) with a symbolic regression (SR) engine and a retrieval‑augmented generation (RAG) knowledge base. This framework enables the agent to automatically explore and verify the vast landscape of inflationary potentials while grounding its outputs in established theoretical literature. We demonstrate that DeepInflation can successfully discover simple and viable single‑field slow‑roll inflationary potentials consistent with the latest observations (here ACT DR6 results as example) or any given n_s and r, and provide accurate theoretical context for obscure inflationary scenarios. DeepInflation serves as a prototype for a new generation of autonomous scientific discovery engines in cosmology, which enables researchers and non‑experts alike to explore the inflationary landscape using natural language. This agent is available at https://github.com/pengzy‑cosmo/DeepInflation.

PaperID: 641, https://arxiv.org/pdf/2601.14245.pdf GitHub GitHub

Authors: Zhongyu Yang, Wei Pang, Yingfang Yuan

Title: XR: Cross-Modal Agents for Composed Image Retrieval

Abstract:
Retrieval is being redefined by agentic AI, demanding multimodal reasoning beyond conventional similarity‑based paradigms. Composed Image Retrieval (CIR) exemplifies this shift as each query combines a reference image with textual modifications, requiring compositional understanding across modalities. While embedding‑based CIR methods have achieved progress, they remain narrow in perspective, capturing limited cross‑modal cues and lacking semantic reasoning. To address these limitations, we introduce XR, a training‑free multi‑agent framework that reframes retrieval as a progressively coordinated reasoning process. It orchestrates three specialized types of agents: imagination agents synthesize target representations through cross‑modal generation, similarity agents perform coarse filtering via hybrid matching, and question agents verify factual consistency through targeted reasoning for fine filtering. Through progressive multi‑agent coordination, XR iteratively refines retrieval to meet both semantic and visual query constraints, achieving up to a 38% gain over strong training‑free and training‑based baselines on FashionIQ, CIRR, and CIRCO, while ablations show each agent is essential. Code is available: https://01yzzyu.github.io/xr.github.io/.

PaperID: 642, https://arxiv.org/pdf/2601.13709.pdf GitHub

Authors: Christopher Kao, Vanshika Vats, James Davis

Title: Hidden in Plain Text: Measuring LLM Deception Quality Against Human Baselines Using Social Deduction Games

Abstract:
Large Language Model (LLM) agents are increasingly used in many applications, raising concerns about their safety. While previous work has shown that LLMs can deceive in controlled tasks, less is known about their ability to deceive using natural language in social contexts. In this paper, we study deception in the Social Deduction Game (SDG) Mafia, where success is dependent on deceiving others through conversation. Unlike previous SDG studies, we use an asynchronous multi‑agent framework which better simulates realistic social contexts. We simulate 35 Mafia games with GPT‑4o LLM agents. We then create a Mafia Detector using GPT‑4‑Turbo to analyze game transcripts without player role information to predict the mafia players. We use prediction accuracy as a surrogate marker for deception quality. We compare this prediction accuracy to that of 28 human games and a random baseline. Results show that the Mafia Detector's mafia prediction accuracy is lower on LLM games than on human games. The result is consistent regardless of the game days and the number of mafias detected. This indicates that LLMs blend in better and thus deceive more effectively. We also release a dataset of LLM Mafia transcripts to support future research. Our findings underscore both the sophistication and risks of LLM deception in social contexts.

PaperID: 643, https://arxiv.org/pdf/2601.12996.pdf GitHub

Authors: Shiyuan Li, Yixin Liu, Yu Zheng, Mei Li, Quoc Viet Hung Nguyen, Shirui Pan

Title: OFA-MAS: One-for-All Multi-Agent System Topology Design based on Mixture-of-Experts Graph Generative Models

Abstract:
Multi‑Agent Systems (MAS) offer a powerful paradigm for solving complex problems, yet their performance is critically dependent on the design of their underlying collaboration topology. As MAS become increasingly deployed in web services (e.g., search engines), designing adaptive topologies for diverse cross‑domain user queries becomes essential. Current graph learning‑based design methodologies often adhere to a "one‑for‑one" paradigm, where a specialized model is trained for each specific task domain. This approach suffers from poor generalization to unseen domains and fails to leverage shared structural knowledge across different tasks. To address this, we propose OFA‑TAD, a one‑for‑all framework that generates adaptive collaboration graphs for any task described in natural language through a single universal model. Our approach integrates a Task‑Aware Graph State Encoder (TAGSE) that filters task‑relevant node information via sparse gating, and a Mixture‑of‑Experts (MoE) architecture that dynamically selects specialized sub‑networks to drive node and edge prediction. We employ a three‑stage training strategy: unconditional pre‑training on canonical topologies for structural priors, large‑scale conditional pre‑training on LLM‑generated datasets for task‑topology mappings, and supervised fine‑tuning on empirically validated graphs. Experiments across six diverse benchmarks show that OFA‑TAD significantly outperforms specialized one‑for‑one models, generating highly adaptive MAS topologies. Code: https://github.com/Shiy‑Li/OFA‑MAS.

PaperID: 644, https://arxiv.org/pdf/2601.12538.pdf GitHub

Authors: Tianxin Wei, Ting-Wei Li, Zhining Liu, Xuying Ning, Ze Yang, Jiaru Zou, Zhichen Zeng, Ruizhong Qiu, Xiao Lin, Dongqi Fu, Zihao Li, Mengting Ai, Duo Zhou, Wenxuan Bao, Yunzhe Li, Gaotang Li, Cheng Qian, Yu Wang, Xiangru Tang, Yin Xiao, Liri Fang, Hui Liu, Xianfeng Tang, Yuji Zhang, Chi Wang, Jiaxuan You, Heng Ji, Hanghang Tong, Jingrui He

Title: Agentic Reasoning for Large Language Models

Abstract:
Reasoning is a fundamental cognitive process underlying inference, problem‑solving, and decision‑making. While large language models (LLMs) demonstrate strong reasoning capabilities in closed‑world settings, they struggle in open‑ended and dynamic environments. Agentic reasoning marks a paradigm shift by reframing LLMs as autonomous agents that plan, act, and learn through continual interaction. In this survey, we organize agentic reasoning along three complementary dimensions. First, we characterize environmental dynamics through three layers: foundational agentic reasoning, which establishes core single‑agent capabilities including planning, tool use, and search in stable environments; self‑evolving agentic reasoning, which studies how agents refine these capabilities through feedback, memory, and adaptation; and collective multi‑agent reasoning, which extends intelligence to collaborative settings involving coordination, knowledge sharing, and shared goals. Across these layers, we distinguish in‑context reasoning, which scales test‑time interaction through structured orchestration, from post‑training reasoning, which optimizes behaviors via reinforcement learning and supervised fine‑tuning. We further review representative agentic reasoning frameworks across real‑world applications and benchmarks, including science, robotics, healthcare, autonomous research, and mathematics. This survey synthesizes agentic reasoning methods into a unified roadmap bridging thought and action, and outlines open challenges and future directions, including personalization, long‑horizon interaction, world modeling, scalable multi‑agent training, and governance for real‑world deployment.

PaperID: 645, https://arxiv.org/pdf/2601.11888.pdf GitHub

Authors: Wenhan Liu, Xinyu Ma, Yutao Zhu, Yuchen Li, Daiting Shi, Dawei Yin, Zhicheng Dou

Title: Agentic-R: Learning to Retrieve for Agentic Search

Abstract:
Agentic search has recently emerged as a powerful paradigm, where an agent interleaves multi‑step reasoning with on‑demand retrieval to solve complex questions. Despite its success, how to design a retriever for agentic search remains largely underexplored. Existing search agents typically rely on similarity‑based retrievers, while similar passages are not always useful for final answer generation. In this paper, we propose a novel retriever training framework tailored for agentic search. Unlike retrievers designed for single‑turn retrieval‑augmented generation (RAG) that only rely on local passage utility, we propose to use both local query‑passage relevance and global answer correctness to measure passage utility in a multi‑turn agentic search. We further introduce an iterative training strategy, where the search agent and the retriever are optimized bidirectionally and iteratively. Different from RAG retrievers that are only trained once with fixed questions, our retriever is continuously improved using evolving and higher‑quality queries from the agent. Extensive experiments on seven single‑hop and multi‑hop QA benchmarks demonstrate that our retriever, termed \ours, consistently outperforms strong baselines across different search agents. Our codes are available at: https://github.com/8421BCD/Agentic‑R.

PaperID: 646, https://arxiv.org/pdf/2601.11633.pdf GitHub

Authors: Xuchen Li, Xuzhao Li, Renjie Pi, Shiyu Hu, Jian Zhao, Jiahui Gao

Title: Beyond Accuracy: Evaluating Grounded Visual Evidence in Thinking with Images

Abstract:
Despite the remarkable progress of Vision‑Language Models (VLMs) in adopting "Thinking‑with‑Images" capabilities, accurately evaluating the authenticity of their reasoning process remains a critical challenge. Existing benchmarks mainly rely on outcome‑oriented accuracy, lacking the capability to assess whether models can accurately leverage fine‑grained visual cues for multi‑step reasoning. To address these limitations, we propose ViEBench, a process‑verifiable benchmark designed to evaluate faithful visual reasoning. Comprising 200 multi‑scenario high‑resolution images with expert‑annotated visual evidence, ViEBench uniquely categorizes tasks by difficulty into perception and reasoning dimensions, where reasoning tasks require utilizing localized visual details with prior knowledge. To establish comprehensive evaluation criteria, we introduce a dual‑axis matrix that provides fine‑grained metrics through four diagnostic quadrants, enabling transparent diagnosis of model behavior across varying task complexities. Our experiments yield several interesting observations: (1) VLMs can sometimes produce correct final answers despite grounding on irrelevant regions, and (2) they may successfully locate the correct evidence but still fail to utilize it to reach accurate conclusions. Our findings demonstrate that ViEBench can serve as a more explainable and practical benchmark for comprehensively evaluating the effectiveness agentic VLMs. The codes will be released at: https://github.com/Xuchen‑Li/ViEBench.

PaperID: 647, https://arxiv.org/pdf/2601.10744.pdf GitHub

Authors: Sen Wang, Bangwei Liu, Zhenkun Gao, Lizhuang Ma, Xuhong Wang, Yuan Xie, Xin Tan

Title: Explore with Long-term Memory: A Benchmark and Multimodal LLM-based Reinforcement Learning Framework for Embodied Exploration

Abstract:
An ideal embodied agent should possess lifelong learning capabilities to handle long‑horizon and complex tasks, enabling continuous operation in general environments. This not only requires the agent to accurately accomplish given tasks but also to leverage long‑term episodic memory to optimize decision‑making. However, existing mainstream one‑shot embodied tasks primarily focus on task completion results, neglecting the crucial process of exploration and memory utilization. To address this, we propose Long‑term Memory Embodied Exploration (LMEE), which aims to unify the agent's exploratory cognition and decision‑making behaviors to promote lifelong learning. We further construct a corresponding dataset and benchmark, LMEE‑Bench, incorporating multi‑goal navigation and memory‑based question answering to comprehensively evaluate both the process and outcome of embodied exploration. To enhance the agent's memory recall and proactive exploration capabilities, we propose MemoryExplorer, a novel method that fine‑tunes a multimodal large language model through reinforcement learning to encourage active memory querying. By incorporating a multi‑task reward function that includes action prediction, frontier selection, and question answering, our model achieves proactive exploration. Extensive experiments against state‑of‑the‑art embodied exploration models demonstrate that our approach achieves significant advantages in long‑horizon embodied tasks. Our dataset and code will be released at https://wangsen99.github.io/papers/lmee/

PaperID: 648, https://arxiv.org/pdf/2601.10649.pdf GitHub

Authors: Darshan Singh, Arsha Nagrani, Kawshik Manikantan, Harman Singh, Dinesh Tewari, Tobias Weyand, Cordelia Schmid, Anelia Angelova, Shachi Dave

Title: MINERVA-Cultural: A Benchmark for Cultural and Multilingual Long Video Reasoning

Abstract:
Recent advancements in video models have shown tremendous progress, particularly in long video understanding. However, current benchmarks predominantly feature western‑centric data and English as the dominant language, introducing significant biases in evaluation. To address this, we introduce MINERVA‑Cultural, a challenging benchmark for multicultural and multilingual video reasoning. MINERVA‑Cultural comprises high‑quality, entirely human‑generated annotations from diverse, region‑specific cultural videos across 18 global locales. Unlike prior work that relies on automatic translations, MINERVA‑Cultural provides complex questions, answers, and multi‑step reasoning steps, all crafted in native languages. Making progress on MINERVA‑Cultural requires a deeply situated understanding of visual cultural context. Furthermore, we leverage MINERVA‑Cultural's reasoning traces to construct evidence‑based graphs and propose a novel iterative strategy using these graphs to identify fine‑grained errors in reasoning. Our evaluations reveal that SoTA Video‑LLMs struggle significantly, performing substantially below human‑level accuracy, with errors primarily stemming from the visual perception of cultural elements. MINERVA‑Cultural will be publicly available under https://github.com/google‑deepmind/neptune?tab=readme‑ov‑file\#minerva‑cultural

PaperID: 649, https://arxiv.org/pdf/2601.10560.pdf GitHub

Authors: Xi Shi, Mengxin Zheng, Qian Lou

Title: Learning Latency-Aware Orchestration for Parallel Multi-Agent Systems

Abstract:
Multi‑agent systems (MAS) enable complex reasoning by coordinating multiple agents, but often incur high inference latency due to multi‑step execution and repeated model invocations, severely limiting their scalability and usability in time‑sensitive scenarios. Most existing approaches primarily optimize task performance and inference cost, and explicitly or implicitly assume sequential execution, making them less optimal for controlling latency under parallel execution. In this work, we investigate learning‑based orchestration of multi‑agent systems with explicit latency supervision under parallel execution. We propose Latency‑Aware Multi‑agent System (LAMaS), a latency‑aware multi‑agent orchestration framework that enables parallel execution and explicitly optimizes the critical execution path, allowing the controller to construct execution topology graphs with lower latency under parallel execution. Our experiments show that our approach reduces critical path length by 38‑46% compared to the state‑of‑the‑art baseline for multi‑agent architecture search across multiple benchmarks, while maintaining or even improving task performance. These results highlight the importance of explicitly optimizing latency under parallel execution when designing efficient multi‑agent systems. The code is available at https://github.com/xishi404/LAMaS

PaperID: 650, https://arxiv.org/pdf/2601.10228.pdf GitHub

Authors: Sicheng Yang, Yukai Huang, Shitong Sun, Weitong Cai, Jiankang Deng, Jifei Song, Zhensong Zhang

Title: Optimizing Multimodal LLMs for Egocentric Video Understanding: A Solution for the HD-EPIC VQA Challenge

Abstract:
Multimodal Large Language Models (MLLMs) struggle with complex video QA benchmarks like HD‑EPIC VQA due to ambiguous queries/options, poor long‑range temporal reasoning, and non‑standardized outputs. We propose a framework integrating query/choice pre‑processing, domain‑specific Qwen2.5‑VL fine‑tuning, a novel Temporal Chain‑of‑Thought (T‑CoT) prompting for multi‑step reasoning, and robust post‑processing. This system achieves 41.6% accuracy on HD‑EPIC VQA, highlighting the need for holistic pipeline optimization in demanding video understanding. Our code, fine‑tuned models are available at https://github.com/YoungSeng/Egocentric‑Co‑Pilot.

PaperID: 651, https://arxiv.org/pdf/2601.10156.pdf GitHub

Authors: Yutao Mou, Zhangchi Xue, Lijun Li, Peiyang Liu, Shikun Zhang, Wei Ye, Jing Shao

Title: ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

Abstract:
While LLM‑based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step‑level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains under‑explored. In this work, we first construct TS‑Bench, a novel benchmark for step‑level tool invocation safety detection in LLM agents. We then develop a guardrail model, TS‑Guard, using multi‑task reinforcement learning. The model proactively detects unsafe tool invocation actions before execution by reasoning over the interaction history. It assesses request harmfulness and action‑attack correlations, producing interpretable and generalizable safety judgments and feedback. Furthermore, we introduce TS‑Flow, a guardrail‑feedback‑driven reasoning framework for LLM agents, which reduces harmful tool invocations of ReAct‑style agents by 65 percent on average and improves benign task completion by approximately 10 percent under prompt injection attacks.

PaperID: 652, https://arxiv.org/pdf/2601.10120.pdf GitHub

Authors: Rui Sun, Jie Ding, Chenghua Gong, Tianjun Gu, Yihang Jiang, Juyuan Zhang, Liming Pan, Linyuan Lü

Title: TopoDIM: One-shot Topology Generation of Diverse Interaction Modes for Multi-Agent Systems

Abstract:
Optimizing communication topology in LLM‑based multi‑agent system is critical for enabling collective intelligence. Existing methods mainly rely on spatio‑temporal interaction paradigms, where the sequential execution of multi‑round dialogues incurs high latency and computation. Motivated by the recent insights that evaluation and debate mechanisms can improve problem‑solving in multi‑agent systems, we propose TopoDIM, a framework for one‑shot Topology generation with Diverse Interaction Modes. Designed for decentralized execution to enhance adaptability and privacy, TopoDIM enables agents to autonomously construct heterogeneous communication without iterative coordination, achieving token efficiency and improved task performance. Experiments demonstrate that TopoDIM reduces total token consumption by 46.41% while improving average performance by 1.50% over state‑of‑the‑art methods. Moreover, the framework exhibits strong adaptability in organizing communication among heterogeneous agents. Code is available at: https://github.com/Sundiasy/TopoDIM.

PaperID: 653, https://arxiv.org/pdf/2601.09883.pdf GitHub

Authors: Xinxing Ren, Quagmire Zang, Caelum Forder, Suman Deb, Ahsen Tahir, Roman J. Georgio, Peter Carroll, Zekun Guo

Title: Beyond Rule-Based Workflows: An Information-Flow-Orchestrated Multi-Agents Paradigm via Agent-to-Agent Communication from CORAL

Abstract:
Most existing Large Language Model (LLM)‑based Multi‑Agent Systems (MAS) rely on predefined workflows, where human engineers enumerate task states in advance and specify routing rules and contextual injections accordingly. Such workflow‑driven designs are essentially rule‑based decision trees, which suffer from two fundamental limitations: they require substantial manual effort to anticipate and encode possible task states, and they cannot exhaustively cover the state space of complex real‑world tasks. To address these issues, we propose an Information‑Flow‑Orchestrated Multi‑Agent Paradigm via Agent‑to‑Agent (A2A) Communication from CORAL, in which a dedicated information flow orchestrator continuously monitors task progress and dynamically coordinates other agents through the A2A toolkit using natural language, without relying on predefined workflows. We evaluate our approach on the general‑purpose benchmark GAIA, using the representative workflow‑based MAS OWL as the baseline while controlling for agent roles and underlying models. Under the pass@1 setting, our method achieves 63.64% accuracy, outperforming OWL's 55.15% by 8.49 percentage points with comparable token consumption. Further case‑level analysis shows that our paradigm enables more flexible task monitoring and more robust handling of edge cases. Our implementation is publicly available at: https://github.com/Coral‑Protocol/Beyond‑Rule‑Based‑Workflows

PaperID: 654, https://arxiv.org/pdf/2601.09635.pdf GitHub

Authors: Kuo Liang, Yuhang Lu, Jianming Mao, Shuyi Sun, Chunwei Yang, Congcong Zeng, Xiao Jin, Hanzhang Qin, Ruihao Zhu, Chung-Piaw Teo

Title: Large-Scale Optimization Model Auto-Formulation: Harnessing LLM Flexibility via Structured Workflow

Abstract:
Large‑scale optimization is a key backbone of modern business decision‑making. However, building these models is often labor‑intensive and time‑consuming. We address this by proposing LEAN‑LLM‑OPT, a LightwEight AgeNtic workflow construction framework for LLM‑assisted large‑scale OPTimization auto‑formulation. LEAN‑LLM‑OPT takes as input a problem description together with associated datasets and orchestrates a team of LLM agents to produce an optimization formulation. Specifically, upon receiving a query, two upstream LLM agents dynamically construct a workflow that specifies, step‑by‑step, how optimization models for similar problems can be formulated. A downstream LLM agent then follows this workflow to generate the final output. The agentic workflow leverages common modeling practices to structure the modeling process into a sequence of sub‑tasks, offloading mechanical data‑handling operations to auxiliary tools. This reduces the LLM's burden in planning and data handling, allowing us to exploit its flexibility to address unstructured components. Extensive simulations show that LEAN‑LLM‑OPT, instantiated with GPT‑4.1 and the open source gpt‑oss‑20B, achieves strong performance on large‑scale optimization modeling tasks and is competitive with state‑of‑the‑art approaches. In addition, in a Singapore Airlines choice‑based revenue management use case, LEAN‑LLM‑OPT demonstrates practical value by achieving leading performance across a range of scenarios. Along the way, we introduce Large‑Scale‑OR and Air‑NRM, the first comprehensive benchmarks for large‑scale optimization auto‑formulation. The code and data of this work is available at https://github.com/CoraLiang01/lean‑llm‑opt.

PaperID: 655, https://arxiv.org/pdf/2601.09259.pdf GitHub

Authors: Jian Zhang, Zhiyuan Wang, Zhangqi Wang, Yu He, Haoran Luo, li yuan, Lingling Zhang, Rui Mao, Qika Lin, Jun Liu

Title: MAXS: Meta-Adaptive Exploration with LLM Agents

Abstract:
Large Language Model (LLM) Agents exhibit inherent reasoning abilities through the collaboration of multiple tools. However, during agent inference, existing methods often suffer from (i) locally myopic generation, due to the absence of lookahead, and (ii) trajectory instability, where minor early errors can escalate into divergent reasoning paths. These issues make it difficult to balance global effectiveness and computational efficiency. To address these two issues, we propose meta‑adaptive exploration with LLM agents https://github.com/exoskeletonzj/MAXS, a meta‑adaptive reasoning framework based on LLM Agents that flexibly integrates tool execution and reasoning planning. MAXS employs a lookahead strategy to extend reasoning paths a few steps ahead, estimating the advantage value of tool usage, and combines step consistency variance and inter‑step trend slopes to jointly select stable, consistent, and high‑value reasoning steps. Additionally, we introduce a trajectory convergence mechanism that controls computational cost by halting further rollouts once path consistency is achieved, enabling a balance between resource efficiency and global effectiveness in multi‑tool reasoning. We conduct extensive empirical studies across three base models (MiMo‑VL‑7B, Qwen2.5‑VL‑7B, Qwen2.5‑VL‑32B) and five datasets, demonstrating that MAXS consistently outperforms existing methods in both performance and inference efficiency. Further analysis confirms the effectiveness of our lookahead strategy and tool usage.

PaperID: 656, https://arxiv.org/pdf/2601.08829.pdf GitHub

Authors: Hsiang-Wei Huang, Junbin Lu, Kuang-Ming Chen, Jenq-Neng Hwang

Title: Modeling LLM Agent Reviewer Dynamics in Elo-Ranked Review System

Abstract:
In this work, we explore the Large Language Model (LLM) agent reviewer dynamics in an Elo‑ranked review system using real‑world conference paper submissions. Multiple LLM agent reviewers with different personas are engage in multi round review interactions moderated by an Area Chair. We compare a baseline setting with conditions that incorporate Elo ratings and reviewer memory. Our simulation results showcase several interesting findings, including how incorporating Elo improves Area Chair decision accuracy, as well as reviewers' adaptive review strategy that exploits our Elo system without improving review effort. Our code is available at https://github.com/hsiangwei0903/EloReview.

PaperID: 657, https://arxiv.org/pdf/2601.08605.pdf GitHub GitHub

Authors: Wenyuan Zhang, Xinghua Zhang, Haiyang Yu, Shuaiyi Nie, Bingli Wu, Juwei Yue, Tingwen Liu, Yongbin Li

Title: ExpSeek: Self-Triggered Experience Seeking for Web Agents

Abstract:
Experience intervention in web agents emerges as a promising technical paradigm, enhancing agent interaction capabilities by providing valuable insights from accumulated experiences. However, existing methods predominantly inject experience passively as global context before task execution, struggling to adapt to dynamically changing contextual observations during agent‑environment interaction. We propose ExpSeek, which shifts experience toward step‑level proactive seeking: (1) estimating step‑level entropy thresholds to determine intervention timing using the model's intrinsic signals; (2) designing step‑level tailored experience content. Experiments on Qwen3‑8B and 32B models across four challenging web agent benchmarks demonstrate that ExpSeek achieves absolute improvements of 9.3% and 7.5%, respectively. Our experiments validate the feasibility and advantages of entropy as a self‑triggering signal, reveal that even a small‑scale 4B experience model can significantly boost the performance of larger agent models. The code is released at https://github.com/WYRipple/ExpSeek.

PaperID: 658, https://arxiv.org/pdf/2601.08129.pdf GitHub

Authors: Roland Rodriguez

Title: Emergent Coordination in Multi-Agent Systems via Pressure Fields and Temporal Decay

Abstract:
Current multi‑agent LLM frameworks rely on explicit orchestration patterns borrowed from human organizational structures: planners delegate to executors, managers coordinate workers, and hierarchical control flow governs agent interactions. These approaches suffer from coordination overhead that scales poorly with agent count and task complexity. We propose a fundamentally different paradigm inspired by natural coordination mechanisms: agents operate locally on a shared artifact, guided only by pressure gradients derived from measurable quality signals, with temporal decay preventing premature convergence. We formalize this as optimization over a pressure landscape and prove convergence guarantees under mild conditions. Empirically, on meeting room scheduling across 1,350 trials, pressure‑field coordination outperforms all baselines: 48.5% aggregate solve rate versus 12.6% for conversation‑based coordination, 1.5% for hierarchical control, and 0.4% for sequential and random baselines (all pairwise comparisons p < 0.001). Temporal decay is essential: disabling it reduces solve rate by 10 percentage points. On easy problems, pressure‑field achieves 86.7% solve rate. The approach maintains consistent performance from 1 to 4 agents. Implicit coordination through shared pressure gradients outperforms explicit hierarchical control, suggesting that constraint‑driven emergence offers a simpler and more effective foundation for multi‑agent AI.

PaperID: 659, https://arxiv.org/pdf/2601.07309.pdf GitHub

Authors: Zhuoka Feng, Kang Chen, Sihan Zhao, Kai Xiong, Yaoning Wang, Minshen Yu, Junjie Nian, Changyi Xiao, Yixin Cao, Yugang Jiang

Title: ARM: Role-Conditioned Neuron Transplantation for Training-Free Generalist LLM Agent Merging

Abstract:
Interactive large language model agents have advanced rapidly, but most remain specialized to a single environment and fail to adapt robustly to other environments. Model merging offers a training‑free alternative by integrating multiple experts into a single model. In this paper, we propose Agent‑Role Merging (ARM), an activation‑guided, role‑conditioned neuron transplantation method for model merging in LLM agents. ARM improves existing merging methods from static natural language tasks to multi‑turn agent scenarios, and over the generalization ability across various interactive environments. This is achieved with a well designed 3‑step framework: 1) constructing merged backbones, 2) selection based on its role‑conditioned activation analysis, and 3) neuron transplantation for fine‑grained refinements. Without gradient‑based optimization, ARM improves cross‑benchmark generalization while enjoying efficiency. Across diverse domains, the model obtained via ARM merging outperforms prior model merging methods and domain‑specific expert models, while demonstrating strong out‑of‑domain generalization.

PaperID: 660, https://arxiv.org/pdf/2601.07182.pdf GitHub

Authors: Ruiyi Ding, Yongxuan Lv, Xianhui Meng, Jiahe Song, Chao Wang, Chen Jiang, Yuan Cheng

Title: PRPO: Aligning Process Reward with Outcome Reward in Policy Optimization

Abstract:
Policy optimization for large language models often suffers from sparse reward signals in multi‑step reasoning tasks. Critic‑free methods like GRPO assign a single normalized outcome reward to all tokens, providing limited guidance for intermediate reasoning . While Process Reward Models (PRMs) offer dense feedback, they risk premature collapse when used alone, as early low‑reward tokens can drive policies toward truncated outputs. We introduce Process Relative Policy Optimization (PRPO), which combines outcome reliability with process‑level guidance in a critic‑free framework. PRPO segments reasoning sequences based on semantic clues, normalizes PRM scores into token‑level advantages, and aligns their distribution with outcome advantages through location‑parameter shift. On MATH500, PRPO improves Qwen2.5‑Math‑1.5B accuracy from 61.2% to 64.4% over GRPO using only eight rollouts and no value network, demonstrating efficient fine‑grained credit assignment within critic‑free optimization. Code is available at: https://github.com/SchumiDing/srpocode

PaperID: 661, https://arxiv.org/pdf/2601.06966.pdf GitHub

Authors: Haonan Bian, Zhiyuan Yao, Sen Hu, Zishan Xu, Shaolei Zhang, Yifu Guo, Ziliang Yang, Xueran Han, Huacan Wang, Ronghao Chen

Title: RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction

Abstract:
As Large Language Models (LLMs) evolve from static dialogue interfaces to autonomous general agents, effective memory is paramount to ensuring long‑term consistency. However, existing benchmarks primarily focus on casual conversation or task‑oriented dialogue, failing to capture "long‑term project‑oriented" interactions where agents must track evolving goals. To bridge this gap, we introduce RealMem, the first benchmark grounded in realistic project scenarios. RealMem comprises over 2,000 cross‑session dialogues across eleven scenarios, utilizing natural user queries for evaluation. We propose a synthesis pipeline that integrates Project Foundation Construction, Multi‑Agent Dialogue Generation, and Memory and Schedule Management to simulate the dynamic evolution of memory. Experiments reveal that current memory systems face significant challenges in managing the long‑term project states and dynamic context dependencies inherent in real‑world projects. Our code and datasets are available at [https://github.com/AvatarMemory/RealMemBench](https://github.com/AvatarMemory/RealMemBench).

PaperID: 662, https://arxiv.org/pdf/2601.06838.pdf GitHub

Authors: Takaaki Toda, Tatsuya Mori

Title: CHASE: LLM Agents for Dissecting Malicious PyPI Packages

Abstract:
Modern software package registries like PyPI have become critical infrastructure for software development, but are increasingly exploited by threat actors distributing malicious packages with sophisticated multi‑stage attack chains. While Large Language Models (LLMs) offer promising capabilities for automated code analysis, their application to security‑critical malware detection faces fundamental challenges, including hallucination and context confusion, which can lead to missed detections or false alarms. We present CHASE (Collaborative Hierarchical Agents for Security Exploration), a high‑reliability multi‑agent architecture that addresses these limitations through a Plan‑and‑Execute coordination model, specialized Worker Agents focused on specific analysis aspects, and integration with deterministic security tools for critical operations. Our key insight is that reliability in LLM‑based security analysis emerges not from improving individual model capabilities but from architecting systems that compensate for LLM weaknesses while leveraging their semantic understanding strengths. Evaluation on a dataset of 3,000 packages (500 malicious, 2,500 benign) demonstrates that CHASE achieves 98.4% recall with only 0.08% false positive rate, while maintaining a practical median analysis time of 4.5 minutes per package, making it suitable for operational deployment in automated package screening. Furthermore, we conducted a survey with cybersecurity professionals to evaluate the generated analysis reports, identifying their key strengths and areas for improvement. This work provides a blueprint for building reliable AI‑powered security tools that can scale with the growing complexity of modern software supply chains. Our project page is available at https://t0d4.github.io/CHASE‑AIware25/

PaperID: 663, https://arxiv.org/pdf/2601.06818.pdf GitHub

Authors: Xuannan Liu, Xiao Yang, Zekun Li, Peipei Li, Ran He

Title: AgentHallu: Benchmarking Automated Hallucination Attribution of LLM-based Agents

Abstract:
As LLM‑based agents operate over sequential multi‑step reasoning, hallucinations arising at intermediate steps risk propagating along the trajectory, thus degrading overall reliability. Unlike hallucination detection in single‑turn responses, diagnosing hallucinations in multi‑step workflows requires identifying which step causes the initial divergence. To fill this gap, we propose a new research task, automated hallucination attribution of LLM‑based agents, aiming to identify the step responsible for the hallucination and explain why. To support this task, we introduce AgentHallu, a comprehensive benchmark with: (1) 693 high‑quality trajectories spanning 7 agent frameworks and 5 domains, (2) a hallucination taxonomy organized into 5 categories (Planning, Retrieval, Reasoning, Human‑Interaction, and Tool‑Use) and 14 sub‑categories, and (3) multi‑level annotations curated by humans, covering binary labels, hallucination‑responsible steps, and causal explanations. We evaluate 13 leading models, and results show the task is challenging even for top‑tier models (like GPT‑5, Gemini‑2.5‑Pro). The best‑performing model achieves only 41.1% step localization accuracy, where tool‑use hallucinations are the most challenging at just 11.6%. We believe AgentHallu will catalyze future research into developing robust, transparent, and reliable agentic systems.

PaperID: 664, https://arxiv.org/pdf/2601.06692.pdf GitHub

Authors: Murad Farzulla

Title: The Axiom of Consent: Friction Dynamics in Multi-Agent Coordination

Abstract:
Multi‑agent systems face a fundamental coordination problem: agents must coordinate despite heterogeneous preferences, asymmetric stakes, and imperfect information. When coordination fails, friction emerges: measurable resistance manifesting as deadlock, thrashing, communication overhead, or outright conflict. This paper derives a formal framework for analyzing coordination friction from a single axiom: actions affecting agents require authorization from those agents in proportion to stakes. From this axiom of consent, we establish the kernel triple (α, σ, ε) (alignment, stake, and entropy) characterizing any resource allocation configuration. The friction equation F = σ (1 + ε)/(1 + α) predicts coordination difficulty as a function of preference alignment α, stake magnitude σ, and communication entropy ε. The Replicator‑Optimization Mechanism (ROM) governs evolutionary selection over coordination strategies: configurations generating less friction persist longer, establishing consent‑respecting arrangements as dynamical attractors rather than normative ideals. We develop formal definitions for resource consent, coordination legitimacy, and friction‑aware allocation in multi‑agent systems. The framework yields testable predictions: MARL systems with higher reward alignment exhibit faster convergence; distributed allocations accounting for stake asymmetry generate lower coordination failure; AI systems with interpretability deficits produce friction proportional to the human‑AI alignment gap. Applications to cryptocurrency governance and political systems demonstrate that the same equations govern friction dynamics across domains, providing a complexity science perspective on coordination under preference heterogeneity.

PaperID: 665, https://arxiv.org/pdf/2601.05899.pdf GitHub

Authors: Dawei Wang, Chengming Zhou, Di Zhao, Xinyuan Liu, Marci Chi Ma, Gary Ushaw, Richard Davison

Title: TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents

Abstract:
Recent breakthroughs in Large Language Models (LLMs) have positioned them as a promising paradigm for agents, with long‑term planning and decision‑making emerging as core general‑purpose capabilities for adapting to diverse scenarios and tasks. Real‑time strategy (RTS) games serve as an ideal testbed for evaluating these two capabilities, as their inherent gameplay requires both macro‑level strategic planning and micro‑level tactical adaptation and action execution. Existing RTS game‑based environments either suffer from relatively high computational demands or lack support for textual observations, which has constrained the use of RTS games for LLM evaluation. Motivated by this, we present TowerMind, a novel environment grounded in the tower defense (TD) subgenre of RTS games. TowerMind preserves the key evaluation strengths of RTS games for assessing LLMs, while featuring low computational demands and a multimodal observation space, including pixel‑based, textual, and structured game‑state representations. In addition, TowerMind supports the evaluation of model hallucination and provides a high degree of customizability. We design five benchmark levels to evaluate several widely used LLMs under different multimodal input settings. The results reveal a clear performance gap between LLMs and human experts across both capability and hallucination dimensions. The experiments further highlight key limitations in LLM behavior, such as inadequate planning validation, a lack of multifinality in decision‑making, and inefficient action use. We also evaluate two classic reinforcement learning algorithms: Ape‑X DQN and PPO. By offering a lightweight and multimodal design, TowerMind complements the existing RTS game‑based environment landscape and introduces a new benchmark for the AI agent field. The source code is publicly available on GitHub(https://github.com/tb6147877/TowerMind).

PaperID: 666, https://arxiv.org/pdf/2601.05808.pdf GitHub

Authors: Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou

Title: EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis

Abstract:
Large language models (LLMs) are expected to be trained to act as agents in various real‑world environments, but this process relies on rich and varied tool‑interaction sandboxes. However, access to real systems is often restricted; LLM‑simulated environments are prone to hallucinations and inconsistencies; and manually built sandboxes are hard to scale. In this paper, we propose EnvScaler, an automated framework for scalable tool‑interaction environments via programmatic synthesis. EnvScaler comprises two components. First, SkelBuilder constructs diverse environment skeletons through topic mining, logic modeling, and quality evaluation. Then, ScenGenerator generates multiple task scenarios and rule‑based trajectory validation functions for each environment. With EnvScaler, we synthesize 191 environments and about 7K scenarios, and apply them to Supervised Fine‑Tuning (SFT) and Reinforcement Learning (RL) for Qwen3 series models. Results on three benchmarks show that EnvScaler significantly improves LLMs' ability to solve tasks in complex environments involving multi‑turn, multi‑tool interactions. We release our code and data at https://github.com/RUC‑NLPIR/EnvScaler.

PaperID: 667, https://arxiv.org/pdf/2601.05427.pdf GitHub

Authors: Etienne Gauthier, Francis Bach, Michael I. Jordan

Title: Anytime Detection of Strategic Deviations in Multi-Agent Systems

Abstract:
In many multi‑agent systems, agents interact repeatedly and are expected to settle into stable, rational behavior over time. Yet in practice, behavior often drifts, and detecting such deviations in real time remains an open challenge. We introduce a sequential testing framework that monitors whether observed play is consistent with a benchmark of strategic behavior, without assuming a fixed sample size. Our approach builds on the e‑value framework for safe anytime‑valid inference: by "betting" against the benchmark, we construct a test supermartingale that accumulates evidence whenever observed payoffs systematically violate the expected conditions. For repeated normal‑form games, we take equilibrium as the benchmark, yielding a statistically sound, interpretable measure of departure from equilibrium that can be monitored online; our framework unifies the treatment of Nash, correlated, and coarse correlated equilibria, offering finite‑time guarantees and a detailed analysis of detection times. We also leverage Benjamini‑Hochberg‑type procedures to increase detection power in large games while rigorously controlling the false discovery rate. Finally, we extend our method to stochastic games, verifying online whether observed trajectories adhere to a specified target policy, such as a computed equilibrium, broadening the framework's applicability to dynamic, state‑dependent settings.

PaperID: 668, https://arxiv.org/pdf/2601.04853.pdf GitHub

Authors: Zhiwei Liu, Runteng Guo, Baojie Qu, Yuechen Jiang, Min Peng, Qianqian Xie, Sophia Ananiadou

Title: RAAR: Retrieval Augmented Agentic Reasoning for Cross-Domain Misinformation Detection

Abstract:
Cross‑domain misinformation detection is challenging, as misinformation arises across domains with substantial differences in knowledge and discourse. Existing methods often rely on single‑perspective cues and struggle to generalize to challenging or underrepresented domains, while reasoning large language models (LLMs), though effective on complex tasks, are limited to same‑distribution data. To address these gaps, we introduce RAAR, the first retrieval‑augmented agentic reasoning framework for cross‑domain misinformation detection. To enable cross‑domain transfer beyond same‑distribution assumptions, RAAR retrieves multi‑perspective source‑domain evidence aligned with each target sample's semantics, sentiment, and writing style. To overcome single‑perspective modeling and missing systematic reasoning, RAAR constructs verifiable multi‑step reasoning paths through specialized multi‑agent collaboration, where perspective‑specific agents produce complementary analyses and a summary agent integrates them under verifier guidance. RAAR further applies supervised fine‑tuning and reinforcement learning to train a single multi‑task verifier to enhance verification and reasoning capabilities. Based on RAAR, we trained the RAAR‑8b and RAAR‑14b models. Evaluation on three cross‑domain misinformation detection tasks shows that RAAR substantially enhances the capabilities of the base models and outperforms other cross‑domain methods, advanced LLMs, and LLM‑based adaptation approaches. The project will be released at https://github.com/lzw108/RAAR.

PaperID: 669, https://arxiv.org/pdf/2601.04767.pdf GitHub

Authors: Zefang Zong, Dingwei Chen, Yang Li, Qi Yi, Bo Zhou, Chengming Li, Bo Qian, Peng Chen, Jie Jiang

Title: AT$^2$PO: Agentic Turn-based Policy Optimization via Tree Search

Abstract:
LLM agents have emerged as powerful systems for tackling multi‑turn tasks by interleaving internal reasoning and external tool interactions. Agentic Reinforcement Learning has recently drawn significant research attention as a critical post‑training paradigm to further refine these capabilities. In this paper, we present AT^2PO (Agentic Turn‑based Policy Optimization via Tree Search), a unified framework for multi‑turn agentic RL that addresses three core challenges: limited exploration diversity, sparse credit assignment, and misaligned policy optimization. AT^2PO introduces a turn‑level tree structure that jointly enables Entropy‑Guided Tree Expansion for strategic exploration and Turn‑wise Credit Assignment for fine‑grained reward propagation from sparse outcomes. Complementing this, we propose Agentic Turn‑based Policy Optimization, a turn‑level learning objective that aligns policy updates with the natural decision granularity of agentic interactions. ATPO is orthogonal to tree search and can be readily integrated into any multi‑turn RL pipeline. Experiments across seven benchmarks demonstrate consistent improvements over the state‑of‑the‑art baseline by up to 1.84 percentage points in average, with ablation studies validating the effectiveness of each component. Our code is available at https://github.com/zzfoutofspace/ATPO.

PaperID: 670, https://arxiv.org/pdf/2601.04703.pdf GitHub

Authors: Yiqun Chen, Lingyong Yan, Zixuan Yang, Erhan Zhang, Jiashu Zhao, Shuaiqiang Wang, Dawei Yin, Jiaxin Mao

Title: Beyond Monolithic Architectures: A Multi-Agent Search and Knowledge Optimization Framework for Agentic Search

Abstract:
Agentic search has emerged as a promising paradigm for complex information seeking by enabling Large Language Models (LLMs) to interleave reasoning with tool use. However, prevailing systems rely on monolithic agents that suffer from structural bottlenecks, including unconstrained reasoning outputs that inflate trajectories, sparse outcome‑level rewards that complicate credit assignment, and stochastic search noise that destabilizes learning. To address these challenges, we propose M‑ASK (Multi‑Agent Search and Knowledge), a framework that explicitly decouples agentic search into two complementary roles: Search Behavior Agents, which plan and execute search actions, and Knowledge Management Agents, which aggregate, filter, and maintain a compact internal context. This decomposition allows each agent to focus on a well‑defined subtask and reduces interference between search and context construction. Furthermore, to enable stable coordination, M‑ASK employs turn‑level rewards to provide granular supervision for both search decisions and knowledge updates. Experiments on multi‑hop QA benchmarks demonstrate that M‑ASK outperforms strong baselines, achieving not only superior answer accuracy but also significantly more stable training dynamics.\footnoteThe source code for M‑ASK is available at https://github.com/chenyiqun/M‑ASK.

PaperID: 671, https://arxiv.org/pdf/2601.04505.pdf GitHub

Authors: Khandakar Shakib Al Hasan, Syed Rifat Raiyan, Hasin Mahtab Alvee, Wahid Sadik

Title: CircuitLM: A Multi-Agent LLM-Aided Design Framework for Generating Circuit Schematics from Natural Language Prompts

Abstract:
Generating accurate circuit schematics from high‑level natural language descriptions remains a persistent challenge in electronic design automation (EDA), as large language models (LLMs) frequently hallucinate components, violate strict physical constraints, and produce non‑machine‑readable outputs. To address this, we present CircuitLM, a multi‑agent pipeline that translates user prompts into structured, visually interpretable \textttCircuitJSON schematics. The framework mitigates hallucination and ensures physical viability by grounding generation in a curated, embedding‑powered component knowledge base through five sequential stages: (i) component identification, (ii) canonical pinout retrieval, (iii) chain‑of‑thought reasoning, (iv) JSON schematic synthesis, and (v) interactive force‑directed visualization. We evaluate the system on a dataset of 100 unique circuit‑design prompts using five state‑of‑the‑art LLMs. To systematically assess performance, we deploy a rigorous dual‑layered evaluation methodology: a deterministic Electrical Rule Checking (ERC) engine categorizes topological faults by strict severity (Critical, Major, Minor, Warning), while an LLM‑as‑a‑judge meta‑evaluator identifies complex, context‑aware design flaws that bypass standard rule‑based checkers. Ultimately, this work demonstrates how targeted retrieval combined with deterministic and semantic verification can bridge natural language to structurally viable, schematic‑ready hardware and safe circuit prototyping. Our code and data are publicly available at https://github.com/Khandakar227/CircuitLM.

PaperID: 672, https://arxiv.org/pdf/2601.04500.pdf GitHub GitHub

Authors: Yifei Gao, Jiang Wu, Xiaoyi Chen, Yifan Yang, Zhe Cui, Tianyi Ma, Jiaming Zhang, Jitao Sang

Title: GUITester: Enabling GUI Agents for Exploratory Defect Discovery

Abstract:
Exploratory GUI testing is essential for software quality but suffers from high manual costs. While Multi‑modal Large Language Model (MLLM) agents excel in navigation, they fail to autonomously discover defects due to two core challenges: Goal‑Oriented Masking, where agents prioritize task completion over reporting anomalies, and Execution‑Bias Attribution, where system defects are misidentified as agent errors. To address these, we first introduce GUITestBench, the first interactive benchmark for this task, featuring 143 tasks across 26 defects. We then propose GUITester, a multi‑agent framework that decouples navigation from verification via two modules: (i) a Planning‑Execution Module (PEM) that proactively probes for defects via embedded testing intents, and (ii) a Hierarchical Reflection Module (HRM) that resolves attribution ambiguity through interaction history analysis. GUITester achieves an F1‑score of 48.90% (Pass@3) on GUITestBench, outperforming state‑of‑the‑art baselines (33.35%). Our work demonstrates the feasibility of autonomous exploratory testing and provides a robust foundation for future GUI quality assurance~\footnoteOur code is now available in~\hrefhttps://github.com/ADaM‑BJTU/GUITestBenchhttps://github.com/ADaM‑BJTU/GUITestBench.

PaperID: 673, https://arxiv.org/pdf/2601.04424.pdf GitHub

Authors: Yao Dou, Wei Xu

Title: Gavel: Agent Meets Checklist for Evaluating LLMs on Long-Context Legal Summarization

Abstract:
Large language models (LLMs) now support contexts of up to 1M tokens, but their effectiveness on complex long‑context tasks remains unclear. In this paper, we study multi‑document legal case summarization, where a single case often spans many documents totaling 100K‑500K tokens. We introduce Gavel‑Ref, a reference‑based evaluation framework with multi‑value checklist evaluation over 26 items, as well as residual fact and writing‑style evaluations. Using Gavel‑Ref, we go beyond the single aggregate scores reported in prior work and systematically evaluate 12 frontier LLMs on 100 legal cases ranging from 32K to 512K tokens, primarily from 2025. Our results show that even the strongest model, Gemini 2.5 Pro, achieves only around 50 of S_\textGavel‑Ref, highlighting the difficulty of the task. Models perform well on simple checklist items (e.g., filing date) but struggle on multi‑value or rare ones such as settlements and monitor reports. As LLMs continue to improve and may surpass human‑written summaries ‑‑ making human references less reliable ‑‑ we develop Gavel‑Agent, an efficient and autonomous agent scaffold that equips LLMs with six tools to navigate and extract checklists directly from case documents. With Qwen3, Gavel‑Agent reduces token usage by 36% while resulting in only a 7% drop in S_\textchecklist compared to end‑to‑end extraction with GPT‑4.1.

PaperID: 674, https://arxiv.org/pdf/2601.03655.pdf GitHub

Authors: Jinsong Zhou, Yihua Du, Xinli Xu, Luozhou Wang, Zijie Zhuang, Yehang Zhang, Shuaibo Li, Xiaojun Hu, Bolan Su, Ying-cong Chen

Title: VideoMemory: Toward Consistent Video Generation via Memory Integration

Abstract:
Maintaining consistent characters, props, and environments across multiple shots is a central challenge in narrative video generation. Existing models can produce high‑quality short clips but often fail to preserve entity identity and appearance when scenes change or when entities reappear after long temporal gaps. We present VideoMemory, an entity‑centric framework that integrates narrative planning with visual generation through a Dynamic Memory Bank. Given a structured script, a multi‑agent system decomposes the narrative into shots, retrieves entity representations from memory, and synthesizes keyframes and videos conditioned on these retrieved states. The Dynamic Memory Bank stores explicit visual and semantic descriptors for characters, props, and backgrounds, and is updated after each shot to reflect story‑driven changes while preserving identity. This retrieval‑update mechanism enables consistent portrayal of entities across distant shots and supports coherent long‑form generation. To evaluate this setting, we construct a 54‑case multi‑shot consistency benchmark covering character‑, prop‑, and background‑persistent scenarios. Extensive experiments show that VideoMemory achieves strong entity‑level coherence and high perceptual quality across diverse narrative sequences.

PaperID: 675, https://arxiv.org/pdf/2601.03315.pdf GitHub

Authors: Dhruv Trehan, Paras Chopra

Title: Why LLMs Aren't Scientists Yet: Lessons from Four Autonomous Research Attempts

Abstract:
We report a case study of four end‑to‑end attempts to autonomously generate ML research papers using a pipeline of six LLM agents mapped to stages of the scientific workflow. Of these four, three attempts failed during implementation or evaluation. One completed the pipeline and was accepted to Agents4Science 2025, an experimental inaugural venue that required AI systems as first authors, passing both human and multi‑AI review. From these attempts, we document six recurring failure modes: bias toward training data defaults, implementation drift under execution pressure, memory and context degradation across long‑horizon tasks, overexcitement that declares success despite obvious failures, insufficient domain intelligence, and weak scientific taste in experimental design. We conclude by discussing four design principles for more robust AI‑scientist systems, implications for autonomous scientific discovery, and we release all prompts, artifacts, and outputs at https://github.com/Lossfunk/ai‑scientist‑artefacts‑v1

PaperID: 676, https://arxiv.org/pdf/2601.03281.pdf GitHub

Authors: Mohamed Amine Ferrag, Abderrahmane Lakas, Merouane Debbah

Title: $α^3$-Bench: A Unified Benchmark of Safety, Robustness, and Efficiency for LLM-Based UAV Agents over 6G Networks

Abstract:
Large Language Models (LLMs) are increasingly used as high level controllers for autonomous Unmanned Aerial Vehicle (UAV) missions. However, existing evaluations rarely assess whether such agents remain safe, protocol compliant, and effective under realistic next generation networking constraints. This paper introduces α^3‑Bench, a benchmark for evaluating LLM driven UAV autonomy as a multi turn conversational reasoning and control problem operating under dynamic 6G conditions. Each mission is formulated as a language mediated control loop between an LLM based UAV agent and a human operator, where decisions must satisfy strict schema validity, mission policies, speaker alternation, and safety constraints while adapting to fluctuating network slices, latency, jitter, packet loss, throughput, and edge load variations. To reflect modern agentic workflows, α^3‑Bench integrates a dual action layer supporting both tool calls and agent to agent coordination, enabling evaluation of tool use consistency and multi agent interactions. We construct a large scale corpus of 113k conversational UAV episodes grounded in UAVBench scenarios and evaluate 17 state of the art LLMs using a fixed subset of 50 episodes per scenario under deterministic decoding. We propose a composite α^3 metric that unifies six pillars: Task Outcome, Safety Policy, Tool Consistency, Interaction Quality, Network Robustness, and Communication Cost, with efficiency normalized scores per second and per thousand tokens. Results show that while several models achieve high mission success and safety compliance, robustness and efficiency vary significantly under degraded 6G conditions, highlighting the need for network aware and resource efficient LLM based UAV agents. The dataset is publicly available on GitHub : https://github.com/maferrag/AlphaBench

PaperID: 677, https://arxiv.org/pdf/2601.03248.pdf GitHub

Authors: Juntong Ni, Shiyu Wang, Qi He, Ming Jin, Wei Jin

Title: STReasoner: Empowering LLMs for Spatio-Temporal Reasoning in Time Series via Spatial-Aware Reinforcement Learning

Abstract:
Spatio‑temporal reasoning in time series involves the explicit synthesis of temporal dynamics, spatial dependencies, and textual context. This capability is vital for high‑stakes decision‑making in systems such as traffic networks, power grids, and disease propagation. However, the field remains underdeveloped because most existing works prioritize predictive accuracy over reasoning. To address the gap, we introduce ST‑Bench, a benchmark consisting of four core tasks, including etiological reasoning, entity identification, correlation reasoning, and in‑context forecasting, developed via a network SDE‑based multi‑agent data synthesis pipeline. We then propose STReasoner, which empowers LLM to integrate time series, graph structure, and text for explicit reasoning. To promote spatially grounded logic, we introduce S‑GRPO, a reinforcement learning algorithm that rewards performance gains specifically attributable to spatial information. Experiments show that STReasoner achieves average accuracy gains between 17% and 135% at only 0.004X the cost of proprietary models and generalizes robustly to real‑world data.

PaperID: 678, https://arxiv.org/pdf/2601.03204.pdf GitHub

Authors: Chenglin Yu, Yuchen Wang, Songmiao Wang, Hongxia Yang, Ming Li

Title: InfiAgent: An Infinite-Horizon Framework for General-Purpose Autonomous Agents

Abstract:
LLM agents can reason and use tools, but they often break down on long‑horizon tasks due to unbounded context growth and accumulated errors. Common remedies such as context compression or retrieval‑augmented prompting introduce trade‑offs between information fidelity and reasoning stability. We present InfiAgent, a general‑purpose framework that keeps the agent's reasoning context strictly bounded regardless of task duration by externalizing persistent state into a file‑centric state abstraction. At each step, the agent reconstructs context from a workspace state snapshot plus a fixed window of recent actions. Experiments on DeepResearch and an 80‑paper literature review task show that, without task‑specific fine‑tuning, InfiAgent with a 20B open‑source model is competitive with larger proprietary systems and maintains substantially higher long‑horizon coverage than context‑centric baselines. These results support explicit state externalization as a practical foundation for stable long‑horizon agents. Github Repo:https://github.com/ChenglinPoly/infiAgent

PaperID: 679, https://arxiv.org/pdf/2601.03144.pdf GitHub

Authors: Andrew Shin

Title: Self-Verification is All You Need To Pass The Japanese Bar Examination

Abstract:
Despite rapid advances in large language models (LLMs), achieving reliable performance on highly professional and structured examinations remains a significant challenge. The Japanese bar examination is a particularly demanding benchmark, requiring not only advanced legal reasoning but also strict adherence to complex answer formats that involve joint evaluation of multiple propositions. While recent studies have reported improvements by decomposing such questions into simpler true‑‑false judgments, these approaches have not been systematically evaluated under the original exam format and scoring scheme, leaving open the question of whether they truly capture exam‑level competence. In this paper, we present a self‑verification model trained on a newly constructed dataset that faithfully replicates the authentic format and evaluation scale of the exam. Our model is able to exceed the official passing score when evaluated on the actual exam scale, marking the first demonstration, to our knowledge, of an LLM passing the Japanese bar examination without altering its original question structure or scoring rules. We further conduct extensive comparisons with alternative strategies, including multi‑agent inference and decomposition‑based supervision, and find that these methods fail to achieve comparable performance. Our results highlight the importance of format‑faithful supervision and consistency verification, and suggest that carefully designed single‑model approaches can outperform more complex systems in high‑stakes professional reasoning tasks. Our dataset and codes are publicly available.

PaperID: 680, https://arxiv.org/pdf/2601.02950.pdf GitHub

Authors: Xuan Yang, Furong Jia, Roy Xie, Xiong Xi, Hengwei Bian, Jian Li, Monica Agrawal

Title: Batch-of-Thought: Cross-Instance Learning for Enhanced LLM Reasoning

Abstract:
Current Large Language Model reasoning systems process queries independently, discarding valuable cross‑instance signals such as shared reasoning patterns and consistency constraints. We introduce Batch‑of‑Thought (BoT), a training‑free method that processes related queries jointly to enable cross‑instance learning. By performing comparative analysis across batches, BoT identifies high‑quality reasoning templates, detects errors through consistency checks, and amortizes computational costs. We instantiate BoT within a multi‑agent reflection architecture (BoT‑R), where a Reflector performs joint evaluation to unlock mutual information gain unavailable in isolated processing. Experiments across three model families and six benchmarks demonstrate that BoT‑R consistently improves accuracy and confidence calibration while reducing inference costs by up to 61%. Our theoretical and experimental analysis reveals when and why batch‑aware reasoning benefits LLM systems. Our code is available at https://github.com/xuanyang19/BoT

PaperID: 681, https://arxiv.org/pdf/2601.02854.pdf GitHub

Authors: Ao Li, Jinghui Zhang, Luyu Li, Yuxiang Duan, Lang Gao, Mingcai Chen, Weijun Qin, Shaopeng Li, Fengxian Ji, Ning Liu, Lizhen Cui, Xiuying Chen, Yuntao Du

Title: M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities?

Abstract:
As an agent‑level reasoning and coordination paradigm, Multi‑Agent Debate (MAD) orchestrates multiple agents through structured debate to improve answer quality and support complex reasoning. However, existing research on MAD suffers from two fundamental limitations: evaluations are conducted under fragmented and inconsistent settings, hindering fair comparison, and are largely restricted to single‑modality scenarios that rely on textual inputs only. To address these gaps, we introduce M3MAD‑Bench, a unified and extensible benchmark for evaluating MAD methods across Multi‑domain tasks, Multi‑modal inputs, and Multi‑dimensional metrics. M3MAD‑Bench establishes standardized protocols over five core task domains: Knowledge, Mathematics, Medicine, Natural Sciences, and Complex Reasoning, and systematically covers both pure text and vision‑language datasets, enabling controlled cross‑modality comparison. We evaluate MAD methods on nine base models spanning different architectures, scales, and modality capabilities. Beyond accuracy, M3MAD‑Bench incorporates efficiency‑oriented metrics such as token consumption and inference time, providing a holistic view of performance‑‑cost trade‑offs. Extensive experiments yield systematic insights into the effectiveness, robustness, and efficiency of MAD across text‑only and multimodal scenarios. We believe M3MAD‑Bench offers a reliable foundation for future research on standardized MAD evaluation. The code is available at http://github.com/liaolea/M3MAD‑Bench.

PaperID: 682, https://arxiv.org/pdf/2601.02553.pdf GitHub

Authors: Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, Huaxiu Yao

Title: SimpleMem: Efficient Lifelong Memory for LLM Agents

Abstract:
To support long‑term interaction in complex environments, LLM agents require memory systems that manage historical experiences. Existing approaches either retain full interaction histories via passive context extension, leading to substantial redundancy, or rely on iterative reasoning to filter noise, incurring high token costs. To address this challenge, we introduce SimpleMem, an efficient memory framework based on semantic lossless compression. We propose a three‑stage pipeline designed to maximize information density and token utilization: (1) Semantic Structured Compression, which distills unstructured interactions into compact, multi‑view indexed memory units; (2) Online Semantic Synthesis, an intra‑session process that instantly integrates related context into unified abstract representations to eliminate redundancy; and (3) Intent‑Aware Retrieval Planning, which infers search intent to dynamically determine retrieval scope and construct precise context efficiently. Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost, achieving an average F1 improvement of 26.4% in LoCoMo while reducing inference‑time token consumption by up to 30‑fold, demonstrating a superior balance between performance and efficiency. Code is available at https://github.com/aiming‑lab/SimpleMem.

PaperID: 683, https://arxiv.org/pdf/2601.02163.pdf GitHub GitHub

Authors: Chuanrui Hu, Xingze Gao, Zuyi Zhou, Dannong Xu, Yi Bai, Xintong Li, Hui Zhang, Tong Li, Chong Zhang, Lidong Bing, Yafeng Deng

Title: EverMemOS: A Self-Organizing Memory Operating System for Structured Long-Horizon Reasoning

Abstract:
Large Language Models (LLMs) are increasingly deployed as long‑term interactive agents, yet their limited context windows make it difficult to sustain coherent behavior over extended interactions. Existing memory systems often store isolated records and retrieve fragments, limiting their ability to consolidate evolving user states and resolve conflicts. We introduce EverMemOS, a self‑organizing memory operating system that implements an engram‑inspired lifecycle for computational memory. Episodic Trace Formation converts dialogue streams into MemCells that capture episodic traces, atomic facts, and time‑bounded Foresight signals. Semantic Consolidation organizes MemCells into thematic MemScenes, distilling stable semantic structures and updating user profiles. Reconstructive Recollection performs MemScene‑guided agentic retrieval to compose the necessary and sufficient context for downstream reasoning. Experiments on LoCoMo and LongMemEval show that EverMemOS achieves state‑of‑the‑art performance on memory‑augmented reasoning tasks. We further report a profile study on PersonaMem v2 and qualitative case studies illustrating chat‑oriented capabilities such as user profiling and Foresight. Code is available at https://github.com/EverMind‑AI/EverMemOS.

PaperID: 684, https://arxiv.org/pdf/2601.02075.pdf GitHub

Authors: Zhuofan Shi, Hubao A, Yufei Shao, Dongliang Huang, Hongxu An, Chunxiao Xin, Haiyang Shen, Zhenyu Wang, Yunshan Na, Gang Huang, Xiang Jing

Title: MDAgent2: Large Language Model for Code Generation and Knowledge Q&A in Molecular Dynamics

Abstract:
Molecular dynamics (MD) simulations are essential for understanding atomic‑scale behaviors in materials science, yet writing LAMMPS scripts remains highly specialized and time‑consuming tasks. Although LLMs show promise in code generation and domain‑specific question answering, their performance in MD scenarios is limited by scarce domain data, the high deployment cost of state‑of‑the‑art LLMs, and low code executability. Building upon our prior MDAgent, we present MDAgent2, the first end‑to‑end framework capable of performing both knowledge Q&A and code generation within the MD domain. We construct a domain‑specific data‑construction pipeline that yields three high‑quality datasets spanning MD knowledge, question answering, and code generation. Based on these datasets, we adopt a three stage post‑training strategy‑‑continued pre‑training (CPT), supervised fine‑tuning (SFT), and reinforcement learning (RL)‑‑to train two domain‑adapted models, MD‑Instruct and MD‑Code. Furthermore, we introduce MD‑GRPO, a closed‑loop RL method that leverages simulation outcomes as reward signals and recycles low‑reward trajectories for continual refinement. We further build MDAgent2‑RUNTIME, a deployable multi‑agent system that integrates code generation, execution, evaluation, and self‑correction. Together with MD‑EvalBench proposed in this work, the first benchmark for LAMMPS code generation and question answering, our models and system achieve performance surpassing several strong baselines.This work systematically demonstrates the adaptability and generalization capability of large language models in industrial simulation tasks, laying a methodological foundation for automatic code generation in AI for Science and industrial‑scale simulations. URL: https://github.com/FredericVAN/PKU_MDAgent2

PaperID: 685, https://arxiv.org/pdf/2601.01885.pdf GitHub

Authors: Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng, Yaliang Li, Libing Wu

Title: Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Abstract:
Large language model (LLM) agents face fundamental limitations in long‑horizon reasoning due to finite context windows, making effective memory management critical. Existing methods typically handle long‑term memory (LTM) and short‑term memory (STM) as separate components, relying on heuristics or auxiliary controllers, which limits adaptability and end‑to‑end optimization. In this paper, we propose Agentic Memory (AgeMem), a unified framework that integrates LTM and STM management directly into the agent's policy. AgeMem exposes memory operations as tool‑based actions, enabling the LLM agent to autonomously decide what and when to store, retrieve, update, summarize, or discard information. To train such unified behaviors, we propose a three‑stage progressive reinforcement learning strategy and design a step‑wise GRPO to address sparse and discontinuous rewards induced by memory operations. Experiments on five long‑horizon benchmarks demonstrate that AgeMem consistently outperforms strong memory‑augmented baselines across multiple LLM backbones, achieving improved task performance, higher‑quality long‑term memory, and more efficient context usage.

PaperID: 686, https://arxiv.org/pdf/2601.01685.pdf GitHub

Authors: Jinwei Hu, Xinmiao Huang, Youcheng Sun, Yi Dong, Xiaowei Huang

Title: Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage

Abstract:
As large language models (LLMs) transition to autonomous agents synthesizing real‑time information, their reasoning capabilities introduce an unexpected attack surface. This paper introduces a novel threat where colluding agents steer victim beliefs using only truthful evidence fragments distributed through public channels, without relying on covert communications, backdoors, or falsified documents. By exploiting LLMs' overthinking tendency, we formalize the first cognitive collusion attack and propose Generative Montage: a Writer‑Editor‑Director framework that constructs deceptive narratives through adversarial debate and coordinated posting of evidence fragments, causing victims to internalize and propagate fabricated conclusions. To study this risk, we develop CoPHEME, a dataset derived from real‑world rumor events, and simulate attacks across diverse LLM families. Our results show pervasive vulnerability across 14 LLM families: attack success rates reach 74.4% for proprietary models and 70.6% for open‑weights models. Counterintuitively, stronger reasoning capabilities increase susceptibility, with reasoning‑specialized models showing higher attack success than base models or prompts. Furthermore, these false beliefs then cascade to downstream judges, achieving over 60% deception rates, highlighting a socio‑technical vulnerability in how LLM‑based agents interact with dynamic information environments. Our implementation and data are available at: https://github.com/CharlesJW222/Lying_with_Truth/tree/main.

PaperID: 687, https://arxiv.org/pdf/2601.01416.pdf GitHub

Authors: Yue Zhou, Ran Ding, Xue Yang, Xue Jiang, Xingzhao Liu

Title: AirSpatialBot: A Spatially-Aware Aerial Agent for Fine-Grained Vehicle Attribute Recognization and Retrieval

Abstract:
Despite notable advancements in remote sensing vision‑language models (VLMs), existing models often struggle with spatial understanding, limiting their effectiveness in real‑world applications. To push the boundaries of VLMs in remote sensing, we specifically address vehicle imagery captured by drones and introduce a spatially‑aware dataset AirSpatial, which comprises over 206K instructions and introduces two novel tasks: Spatial Grounding and Spatial Question Answering. It is also the first remote sensing grounding dataset to provide 3DBB. To effectively leverage existing image understanding of VLMs to spatial domains, we adopt a two‑stage training strategy comprising Image Understanding Pre‑training and Spatial Understanding Fine‑tuning. Utilizing this trained spatially‑aware VLM, we develop an aerial agent, AirSpatialBot, which is capable of fine‑grained vehicle attribute recognition and retrieval. By dynamically integrating task planning, image understanding, spatial understanding, and task execution capabilities, AirSpatialBot adapts to diverse query requirements. Experimental results validate the effectiveness of our approach, revealing the spatial limitations of existing VLMs while providing valuable insights. The model, code, and datasets will be released at https://github.com/VisionXLab/AirSpatialBot

PaperID: 688, https://arxiv.org/pdf/2601.01366.pdf GitHub

Authors: Zixian Liu, Sihao Liu, Yuqi Zhao

Title: KGCE: Knowledge-Augmented Dual-Graph Evaluator for Cross-Platform Educational Agent Benchmarking with Multimodal Language Models

Abstract:
With the rapid adoption of multimodal large language models (MLMs) in autonomous agents, cross‑platform task execution capabilities in educational settings have garnered significant attention. However, existing benchmark frameworks still exhibit notable deficiencies in supporting cross‑platform tasks in educational contexts, especially when dealing with school‑specific software (such as XiaoYa Intelligent Assistant, HuaShi XiaZi, etc.), where the efficiency of agents often significantly decreases due to a lack of understanding of the structural specifics of these private‑domain software. Additionally, current evaluation methods heavily rely on coarse‑grained metrics like goal orientation or trajectory matching, making it challenging to capture the detailed execution and efficiency of agents in complex tasks. To address these issues, we propose KGCE (Knowledge‑Augmented Dual‑Graph Evaluator for Cross‑Platform Educational Agent Benchmarking with Multimodal Language Models), a novel benchmarking platform that integrates knowledge base enhancement and a dual‑graph evaluation framework. We first constructed a dataset comprising 104 education‑related tasks, covering Windows, Android, and cross‑platform collaborative tasks. KGCE introduces a dual‑graph evaluation framework that decomposes tasks into multiple sub‑goals and verifies their completion status, providing fine‑grained evaluation metrics. To overcome the execution bottlenecks of existing agents in private‑domain tasks, we developed an enhanced agent system incorporating a knowledge base specific to school‑specific software. The code can be found at https://github.com/Kinginlife/KGCE.

PaperID: 689, https://arxiv.org/pdf/2601.00290.pdf GitHub

Authors: Sixue Xing, Kerui Wu, Xuanye Xia, Meng Jiang, Jintai Chen, Tianfan Fu

Title: ClinicalReTrial: Clinical Trial Redesign with Self-Evolving Agents

Abstract:
Clinical trials constitute a critical yet exceptionally challenging and costly stage of drug development (\2.6B per drug), where protocols are encoded as complex natural language documents, motivating the use of AI systems beyond manual analysis. Existing AI methods accurately predict trial failure, but do not provide actionable remedies. To fill this gap, this paper proposes ClinicalReTrial, a multi‑agent system that formulates clinical trial optimization as an iterative redesign problem on textural protocols. Our method integrates failure diagnosis, safety‑aware modifications, and candidate evaluation in a closed‑loop, reward‑driven optimization framework. Serving the outcome prediction model as a simulation environment, ClinicalReTrial enables low‑cost evaluation and dense reward signals for continuous self‑improvement. We further propose a hierarchical memory that captures iteration‑level feedback within trials and distills transferable redesign patterns across trials. Empirically, ClinicalReTrial improves 83.3% of trial protocols with a mean success probability gain of 5.7% with negligible cost (\0.12 per trial). Retrospective case studies demonstrate alignment between the discovered redesign strategies and real‑world clinical trial modifications. The code is anonymously available at: https://github.com/xingsixue123/ClinicalFailureReasonReTrial.

PaperID: 690, https://arxiv.org/pdf/2512.23959.pdf GitHub

Authors: Chulun Zhou, Chunkang Zhang, Guoxin Yu, Fandong Meng, Jie Zhou, Wai Lam, Mo Yu

Title: HGMEM: Hypergraph-based Working Memory to Improve Multi-step RAG for Long-Context Complex Relational Modeling

Abstract:
Multi‑step retrieval‑augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs) on tasks that demand global comprehension and intensive reasoning. Although many RAG systems incorporate a working memory to consolidate information, existing designs primarily function as a passive storage for isolated facts. This static nature overlooks crucial high‑order correlations among primitive facts, thereby limiting models' capacity for multi‑step reasoning and resulting in fragmented reasoning and weak global sense‑making within extended contexts. We introduce HGMem, a hypergraph‑based working memory system, extending the concept of memory beyond simple storage into a dynamic, expressive structure for complex reasoning and global understanding. In our approach, memory is represented as a hypergraph where hyperedges correspond to distinct memory units, enabling the progressive formation of high‑order interactions within memory. This mechanism connects facts and thoughts around the focal problem, evolving the memory into an integrated and situated knowledge structure that provides strong propositions for deeper reasoning. We evaluate HGMem on several challenging global sense‑making benchmarks. Extensive experiments and in‑depth analyses demonstrate that our method consistently improves multi‑step RAG and substantially outperforms strong baseline systems across diverse datasets.

PaperID: 691, https://arxiv.org/pdf/2512.23631.pdf GitHub

Authors: Iris Xu, Guangtao Zeng, Zexue He, Charles Jin, Aldo Pareja, Dan Gutfreund, Chuang Gan, Zhang-Wei Hong

Title: BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization

Abstract:
Large language models (LLMs) have shown strong reasoning and coding capabilities, yet they struggle to generalize to real‑world software engineering (SWE) problems that are long‑horizon and out of distribution. Existing systems often rely on a single agent to handle the entire workflow‑interpreting issues, navigating large codebases, and implementing fixes‑within one reasoning chain. Such monolithic designs force the model to retain irrelevant context, leading to spurious correlations and poor generalization. Motivated by how human engineers decompose complex problems, we propose structuring SWE agents as orchestrators coordinating specialized sub‑agents for sub‑tasks such as localization, editing, and validation. The challenge lies in discovering effective hierarchies automatically: as the number of sub‑agents grows, the search space becomes combinatorial, and it is difficult to attribute credit to individual sub‑agents within a team. We address these challenges by formulating hierarchy discovery as a multi‑armed bandit (MAB) problem, where each arm represents a candidate sub‑agent and the reward measures its helpfulness when collaborating with others. This framework, termed Bandit Optimization for Agent Design (BOAD), enables efficient exploration of sub‑agent designs under limited evaluation budgets. On SWE‑bench‑Verified, BOAD outperforms single‑agent and manually designed multi‑agent systems. On SWE‑bench‑Live, featuring more recent and out‑of‑distribution issues, our 36B system ranks second on the leaderboard at the time of evaluation, surpassing larger models such as GPT‑4 and Claude. These results demonstrate that automatically discovered hierarchical multi‑agent systems significantly improve generalization on challenging long‑horizon SWE tasks. Code is available at https://github.com/iamxjy/BOAD‑SWE‑Agent.

PaperID: 692, https://arxiv.org/pdf/2512.22860.pdf GitHub

Authors: Soham Padia, Dhananjay Vaidya, Ramchandra Mangrulkar

Title: Adaptive Trust Consensus for Blockchain IoT: Comparing RL, DRL, and MARL Against Naive, Collusive, Adaptive, Byzantine, and Sleeper Attacks

Abstract:
Securing blockchain‑enabled IoT networks against sophisticated adversarial attacks remains a critical challenge. This paper presents a trust‑based delegated consensus framework integrating Fully Homomorphic Encryption (FHE) with Attribute‑Based Access Control (ABAC) for privacy‑preserving policy evaluation, combined with learning‑based defense mechanisms. We systematically compare three reinforcement learning approaches ‑‑ tabular Q‑learning (RL), Deep RL with Dueling Double DQN (DRL), and Multi‑Agent RL (MARL) ‑‑ against five distinct attack families: Naive Malicious Attack (NMA), Collusive Rumor Attack (CRA), Adaptive Adversarial Attack (AAA), Byzantine Fault Injection (BFI), and Time‑Delayed Poisoning (TDP). Experimental results on a 16‑node simulated IoT network reveal significant performance variations: MARL achieves superior detection under collusive attacks (F1=0.85 vs. DRL's 0.68 and RL's 0.50), while DRL and MARL both attain perfect detection (F1=1.00) against adaptive attacks where RL fails (F1=0.50). All agents successfully defend against Byzantine attacks (F1=1.00). Most critically, the Time‑Delayed Poisoning attack proves catastrophic for all agents, with F1 scores dropping to 0.11‑0.16 after sleeper activation, demonstrating the severe threat posed by trust‑building adversaries. Our findings indicate that coordinated multi‑agent learning provides measurable advantages for defending against sophisticated trust manipulation attacks in blockchain IoT environments.

PaperID: 693, https://arxiv.org/pdf/2512.22733.pdf GitHub

Authors: Jiaqi Shao, Yufeng Miao, Wei Zhang, Bing Luo

Title: FoldAct: Efficient and Stable Context Folding for Long-Horizon Search Agents

Abstract:
Long‑horizon reinforcement learning (RL) for large language models faces critical scalability challenges from unbounded context growth, leading to context folding methods that compress interaction history during task execution. However, existing approaches treat summary actions as standard actions, overlooking that summaries fundamentally modify the agent's future observation space, creating a policy‑dependent, non‑stationary observation distribution that violates core RL assumptions. This introduces three fundamental challenges: (1) gradient dilution where summary tokens receive insufficient training signal, (2) self‑conditioning where policy updates change summary distributions, creating a vicious cycle of training collapse, and (3) computational cost from processing unique contexts at each turn. We introduce FoldAct\footnotehttps://github.com/SHAO‑Jiaqi757/FoldAct, a framework that explicitly addresses these challenges through three key innovations: separated loss computation for independent gradient signals on summary and action tokens, full context consistency loss to reduce distribution shift, and selective segment training to reduce computational cost. Our method enables stable training of long‑horizon search agents with context folding, addressing the non‑stationary observation problem while improving training efficiency with 5.19× speedup.

PaperID: 694, https://arxiv.org/pdf/2512.22671.pdf GitHub

Authors: Pere Martra

Title: Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

Abstract:
Structured width pruning of GLU‑MLP layers, guided by the Maximum Absolute Weight (MAW) criterion, reveals a systematic dichotomy in how reducing the expansion ratio affects different model capabilities. While performance on tasks relying on parametric knowledge (e.g., MMLU, GSM8K) and perplexity metrics degrades predictably, instruction‑following capabilities improve substantially (+46% to +75% in IFEval for Llama‑3.2‑1B and 3B models), and multi‑step reasoning remains robust (MUSR). This pattern challenges the prevailing assumption that pruning induces uniform degradation. We evaluated seven expansion ratio configurations using comprehensive benchmarks assessing factual knowledge, mathematical reasoning, language comprehension, instruction‑following, and truthfulness. Our analysis identifies the expansion ratio as a critical architectural parameter that selectively modulates cognitive capabilities, rather than merely serving as a compression metric. We provide the first systematic characterization of this selective preservation phenomenon. Notably, we document a robust inverse correlation (r = ‑0.864, p = 0.012 in Llama‑3B) between factual knowledge capacity (MMLU) and truthfulness metrics (TruthfulQA‑MC2): as knowledge degrades, the model's ability to discriminate misconceptions improves consistently. This connects two previously distinct research areas, demonstrating that MAW‑guided width pruning acts as a selective filter, reducing parametric knowledge while preserving or enhancing behavioral alignment. Additionally, we quantify context‑dependent efficiency trade‑offs: pruned configurations achieve up to 23% reduction in energy consumption (J/token) but incur penalties in single‑request latency, whereas batch processing workloads benefit uniformly.

PaperID: 695, https://arxiv.org/pdf/2512.22101.pdf GitHub

Authors: Shuyu Gan, Renxiang Wang, James Mooney, Dongyeop Kang

Title: A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting

Abstract:
Automating end‑to‑end data science pipeline with AI agents still stalls on two gaps: generating insightful, diverse visual evidence and assembling it into a coherent, professional report. We present A2P‑Vis, a two‑part, multi‑agent pipeline that turns raw datasets into a high‑quality data‑visualization report. The Data Analyzer orchestrates profiling, proposes diverse visualization directions, generates and executes plotting code, filters low‑quality figures with a legibility checker, and elicits candidate insights that are automatically scored for depth, correctness, specificity, depth and actionability. The Presenter then orders topics, composes chart‑grounded narratives from the top‑ranked insights, writes justified transitions, and revises the document for clarity and consistency, yielding a coherent, publication‑ready report. Together, these agents convert raw data into curated materials (charts + vetted insights) and into a readable narrative without manual glue work. We claim that by coupling a quality‑assured Analyzer with a narrative Presenter, A2P‑Vis operationalizes co‑analysis end‑to‑end, improving the real‑world usefulness of automated data analysis for practitioners. For the complete dataset report, please see: https://www.visagent.org/api/output/f2a3486d‑2c3b‑4825‑98d4‑5af25a819f56.

PaperID: 696, https://arxiv.org/pdf/2512.21507.pdf GitHub

Authors: Wenshuo Peng, Gongxuan Wang, Tianmeng Yang, Chuanhao Li, Xiaojie Xu, Hui He, Kaipeng Zhang

Title: SVBench: Evaluation of Video Generation Models on Social Reasoning

Abstract:
Recent text‑to‑video generation models have made remarkable progress in visual realism, motion fidelity, and text‑video alignment, yet they still struggle to produce socially coherent behavior. Unlike humans, who readily infer intentions, beliefs, emotions, and social norms from brief visual cues, current models often generate literal scenes without capturing the underlying causal and psychological dynamics. To systematically assess this limitation, we introduce the first benchmark for social reasoning in video generation. Grounded in developmental and social psychology, the benchmark covers thirty classic social cognition paradigms spanning seven core dimensions: mental‑state inference, goal‑directed action, joint attention, social coordination, prosocial behavior, social norms, and multi‑agent strategy. To operationalize these paradigms, we build a fully training‑free agent‑based pipeline that distills the reasoning structure of each paradigm, synthesizes diverse video‑ready scenarios, enforces conceptual neutrality and difficulty control through cue‑based critique, and evaluates generated videos with a high‑capacity VLM judge along five interpretable dimensions of social reasoning. Using this framework, we conduct the first large‑scale evaluation of seven state‑of‑the‑art video generation systems. Results show a clear gap between surface‑level plausibility and deeper social reasoning, suggesting that current models remain limited in their ability to generate socially grounded behavior. https://github.com/Gloria2tt/SVBench‑Evaluation

PaperID: 697, https://arxiv.org/pdf/2512.20996.pdf GitHub

Authors: Yuwei Du, Jun Zhang, Jie Feng, Zhicheng Liu, Jian Yuan, Yong Li

Title: TrafficSimAgent: A Hierarchical Agent Framework for Autonomous Traffic Simulation with MCP Control

Abstract:
Traffic simulation is important for transportation optimization and policy making. While existing simulators such as SUMO and MATSim offer fully‑featured platforms and utilities, users without too much knowledge about these platforms often face significant challenges when conducting experiments from scratch and applying them to their daily work. To solve this challenge, we propose TrafficSimAgent, an LLM‑based agent framework that serves as an expert in experiment design and decision optimization for general‑purpose traffic simulation tasks. The framework facilitates execution through cross‑level collaboration among expert agents: high‑level expert agents comprehend natural language instructions with high flexibility, plan the overall experiment workflow, and invoke corresponding MCP‑compatible tools on demand; meanwhile, low‑level expert agents select optimal action plans for fundamental elements based on real‑time traffic conditions. Extensive experiments across multiple scenarios show that TrafficSimAgent effectively executes simulations under various conditions and consistently produces reasonable outcomes even when user instructions are ambiguous. Besides, the carefully designed expert‑level autonomous decision‑driven optimization in TrafficSimAgent yields superior performance when compared with other systems and SOTA LLM based methods.

PaperID: 698, https://arxiv.org/pdf/2512.20940.pdf GitHub

Authors: Shuhao Ye, Sitong Mao, Yuxiang Cui, Xuan Yu, Shichao Zhai, Wen Chen, Shunbo Zhou, Rong Xiong, Yue Wang

Title: ETP-R1: Evolving Topological Planning with Reinforcement Fine-tuning for Vision-Language Navigation in Continuous Environments

Abstract:
Vision‑Language Navigation in Continuous Environments (VLN‑CE) requires an embodied agent to navigate towards target in continuous environments, following natural language instructions. While current graph‑based methods offer an efficient, structured approach by abstracting the environment into a topological map and simplifying the action space to waypoint selection, they lag behind methods based on Large Vision‑Language Models (LVLMs) in leveraging large‑scale data and advanced training paradigms. In this paper, we try to bridge this gap by introducing ETP‑R1, a framework that applies the paradigm of scaling up data and Reinforcement Fine‑Tuning (RFT) to a graph‑based VLN‑CE model. To build a strong foundation, we first construct a high‑quality, large‑scale pretraining dataset using the Gemini API. This dataset consists of diverse, low‑hallucination instructions for topological trajectories, providing rich supervision for our graph‑based policy to map language to topological paths. This foundation is further strengthened by unifying data from both R2R and RxR tasks for joint pretraining. Building on this, we introduce a three‑stage training paradigm, which culminates in the first application of closed‑loop, online RFT to a graph‑based VLN‑CE model, powered by the Group Relative Policy Optimization (GRPO) algorithm. Extensive experiments demonstrate that our approach is highly effective, establishing new state‑of‑the‑art performance across all major metrics on both the R2R‑CE and RxR‑CE benchmarks. Our code is available at https://github.com/Cepillar/ETP‑R1.

PaperID: 699, https://arxiv.org/pdf/2512.20936.pdf GitHub

Authors: Hongxing Fan, Shuyu Zhao, Jiayang Ao, Lu Sheng

Title: Reasoning-Driven Amodal Completion: Collaborative Agents and Perceptual Evaluation

Abstract:
Amodal completion, the task of inferring invisible object parts, faces significant challenges in maintaining semantic consistency and structural integrity. Prior progressive approaches are inherently limited by inference instability and error accumulation. To tackle these limitations, we present a Collaborative Multi‑Agent Reasoning Framework that explicitly decouples Semantic Planning from Visual Synthesis. By employing specialized agents for upfront reasoning, our method generates a structured, explicit plan before pixel generation, enabling visually and semantically coherent single‑pass synthesis. We integrate this framework with two critical mechanisms: (1) a self‑correcting Verification Agent that employs Chain‑of‑Thought reasoning to rectify visible region segmentation and identify residual occluders strictly within the Semantic Planning phase, and (2) a Diverse Hypothesis Generator that addresses the ambiguity of invisible regions by offering diverse, plausible semantic interpretations, surpassing the limited pixel‑level variations of standard random seed sampling. Furthermore, addressing the limitations of traditional metrics in assessing inferred invisible content, we introduce the MAC‑Score (MLLM Amodal Completion Score), a novel human‑aligned evaluation metric. Validated against human judgment and ground truth, these metrics establish a robust standard for assessing structural completeness and semantic consistency with visible context. Extensive experiments demonstrate that our framework significantly outperforms state‑of‑the‑art methods across multiple datasets. Our project is available at: https://fanhongxing.github.io/remac‑page.

PaperID: 700, https://arxiv.org/pdf/2512.20629.pdf GitHub

Authors: Wenlong Tang

Title: Learning Evolving Latent Strategies for Multi-Agent Language Systems without Model Fine-Tuning

Abstract:
This study proposes a multi‑agent language framework that enables continual strategy evolution without fine‑tuning the language model's parameters. The core idea is to liberate the latent vectors of abstract concepts from traditional static semantic representations, allowing them to be continuously updated through environmental interaction and reinforcement feedback. We construct a dual‑loop architecture: the behavior loop adjusts action preferences based on environmental rewards, while the language loop updates the external latent vectors by reflecting on the semantic embeddings of generated text. Together, these mechanisms allow agents to develop stable and disentangled strategic styles over long‑horizon multi‑round interactions. Experiments show that agents' latent spaces exhibit clear convergence trajectories under reflection‑driven updates, along with structured shifts at critical moments. Moreover, the system demonstrates an emergent ability to implicitly infer and continually adapt to emotional agents, even without shared rewards. These results indicate that, without modifying model parameters, an external latent space can provide language agents with a low‑cost, scalable, and interpretable form of abstract strategic representation.

PaperID: 701, https://arxiv.org/pdf/2512.20618.pdf GitHub

Authors: Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, Qifeng Chen

Title: LongVideoAgent: Multi-Agent Reasoning with Long Videos

Abstract:
Recent advances in multimodal LLMs and systems that use tools for long‑video QA point to the promise of reasoning over hour‑long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine‑grained cues. We propose a multi‑agent framework in which a master LLM coordinates a grounding agent to localize question‑relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi‑agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode‑level datasets aggregated from TVQA/TVQA+, our multi‑agent system significantly outperforms strong non‑agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent. Code and data will be shared at https://longvideoagent.github.io/.

PaperID: 702, https://arxiv.org/pdf/2512.20136.pdf GitHub

Authors: Hyeongcheol Park, Jiyoung Seo, Jaewon Mun, Hogun Park, Wonmin Byeon, Sung June Kim, Hyeonsoo Im, JeungSub Lee, Sangpil Kim

Title: M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation

Abstract:
Retrieval‑Augmented Generation (RAG) has recently been extended to multimodal settings, connecting multimodal large language models (MLLMs) with vast corpora of external knowledge such as multimodal knowledge graphs (MMKGs). Despite their recent success, multimodal RAG in the audio‑visual domain remains challenging due to 1) limited modality coverage and multi‑hop connectivity of existing MMKGs, and 2) retrieval based solely on similarity in a shared multimodal embedding space, which fails to filter out off‑topic or redundant knowledge. To address these limitations, we propose M^3KG‑RAG, a Multi‑hop Multimodal Knowledge Graph‑enhanced RAG that retrieves query‑aligned audio‑visual knowledge from MMKGs, improving reasoning depth and answer faithfulness in MLLMs. Specifically, we devise a lightweight multi‑agent pipeline to construct multi‑hop MMKG (M^3KG), which contains context‑enriched triplets of multimodal entities, enabling modality‑wise retrieval based on input queries. Furthermore, we introduce GRASP (Grounded Retrieval And Selective Pruning), which ensures precise entity grounding to the query, evaluates answer‑supporting relevance, and prunes redundant context to retain only knowledge essential for response generation. Extensive experiments across diverse multimodal benchmarks demonstrate that M^3KG‑RAG significantly enhances MLLMs' multimodal reasoning and grounding over existing approaches. Project website: https://kuai‑lab.github.io/cvpr2026m3kgrag/

PaperID: 703, https://arxiv.org/pdf/2512.19682.pdf GitHub

Authors: Jiacheng Guo, Ling Yang, Peter Chen, Qixin Xiao, Yinjie Wang, Xinzhe Juan, Jiahao Qiu, Ke Shen, Mengdi Wang

Title: GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators

Abstract:
Training capable Large Language Model (LLM) agents is critically bottlenecked by the high cost and static nature of real‑world interaction data. We address this by introducing GenEnv, a framework that establishes a difficulty‑aligned co‑evolutionary game between an agent and a scalable, generative environment simulator. Unlike traditional methods that evolve models on static datasets, GenEnv instantiates a dataevolving: the simulator acts as a dynamic curriculum policy, continuously generating tasks specifically tailored to the agent's ``zone of proximal development''. This process is guided by a simple but effective α‑Curriculum Reward, which aligns task difficulty with the agent's current capabilities. We evaluate GenEnv on five benchmarks, including API‑Bank, ALFWorld, BFCL, Bamboogle, and TravelPlanner. Across these tasks, GenEnv improves agent performance by up to +40.3% over 7B baselines and matches or exceeds the average performance of larger models. Compared to Gemini 2.5 Pro‑based offline data augmentation, GenEnv achieves better performance while using 3.3× less data. By shifting from static supervision to adaptive simulation, GenEnv provides a data‑efficient pathway for scaling agent capabilities.

PaperID: 704, https://arxiv.org/pdf/2512.19576.pdf GitHub

Authors: Kirill Djebko, Tom Baumann, Erik Dilger, Frank Puppe, Sergio Montenegro

Title: LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller

Abstract:
Attitude control is essential for many satellite missions. Classical controllers, however, are time‑consuming to design and sensitive to model uncertainties and variations in operational boundary conditions. Deep Reinforcement Learning (DRL) offers a promising alternative by learning adaptive control strategies through autonomous interaction with a simulation environment. Overcoming the Sim2Real gap, which involves deploying an agent trained in simulation onto the real physical satellite, remains a significant challenge. In this work, we present the first successful in‑orbit demonstration of an AI‑based attitude controller for inertial pointing maneuvers. The controller was trained entirely in simulation and deployed to the InnoCube 3U nanosatellite, which was developed by the Julius‑Maximilians‑Universität Würzburg in cooperation with the Technische Universität Berlin, and launched in January 2025. We present the AI agent design, the methodology of the training procedure, the discrepancies between the simulation and the observed behavior of the real satellite, and a comparison of the AI‑based attitude controller with the classical PD controller of InnoCube. Steady‑state metrics confirm the robust performance of the AI‑based controller during repeated in‑orbit maneuvers.

PaperID: 705, https://arxiv.org/pdf/2512.18950.pdf GitHub

Authors: Saman Forouzandeh, Wei Peng, Parham Moradi, Xinghuo Yu, Mahdi Jalili

Title: Learning Hierarchical Procedural Memory for LLM Agents through Bayesian Selection and Contrastive Refinement

Abstract:
We present MACLA, a framework that decouples reasoning from learning by maintaining a frozen large language model while performing all adaptation in an external hierarchical procedural memory. MACLA extracts reusable procedures from trajectories, tracks reliability via Bayesian posteriors, selects actions through expected‑utility scoring, and refines procedures by contrasting successes and failures. Across four benchmarks (ALFWorld, WebShop, TravelPlanner, InterCodeSQL), MACLA achieves 78.1 percent average performance, outperforming all baselines. On ALFWorld unseen tasks, MACLA reaches 90.3 percent with 3.1 percent positive generalization. The system constructs memory in 56 seconds, 2800 times faster than the state‑of‑the‑art LLM parameter‑training baseline, compressing 2851 trajectories into 187 procedures. Experimental results demonstrate that structured external memory with Bayesian selection and contrastive refinement enables sample‑efficient, interpretable, and continually improving agents without LLM parameter updates.

PaperID: 706, https://arxiv.org/pdf/2512.18745.pdf GitHub

Authors: Kaican Li, Lewei Yao, Jiannan Wu, Tiezheng Yu, Jierun Chen, Haoli Bai, Lu Hou, Lanqing Hong, Wei Zhang, Nevin L. Zhang

Title: InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search

Abstract:
The ability for AI agents to "think with images" requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect crucial for real‑world tasks like analyzing documents with dense charts/diagrams and navigating maps. To address this gap, we introduce O3‑Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details. O3‑Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi‑step reasoning. The problems are highly challenging even for frontier systems like OpenAI o3, which only obtains 40.8% accuracy on O3‑Bench. To make progress, we propose InSight‑o3, a multi‑agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher) for which we introduce the task of generalized visual search ‑‑ locating relational, fuzzy, or conceptual regions described in free‑form language, beyond just simple objects or figures in natural images. We then present a multimodal LLM purpose‑trained for this task via reinforcement learning. As a plug‑and‑play agent, our vSearcher empowers frontier multimodal models (as vReasoners), significantly improving their performance on a wide range of benchmarks. This marks a concrete step towards powerful o3‑like open systems. Our code and dataset can be found at https://github.com/m‑Just/InSight‑o3 .

PaperID: 707, https://arxiv.org/pdf/2512.18622.pdf GitHub

Authors: Thanh Dat Hoang, Thanh Trung Huynh, Matthias Weidlich, Thanh Tam Nguyen, Tong Chen, Hongzhi Yin, Quoc Viet Hung Nguyen

Title: A Multi-agent Text2SQL Framework using Small Language Models and Execution Feedback

Abstract:
Text2SQL, the task of generating SQL queries from natural language text, is a critical challenge in data engineering. Recently, Large Language Models (LLMs) have demonstrated superior performance for this task due to their advanced comprehension and generation capabilities. However, privacy and cost considerations prevent companies from using Text2SQL solutions based on external LLMs offered as a service. Rather, small LLMs (SLMs) that are openly available and can hosted in‑house are adopted. These SLMs, in turn, lack the generalization capabilities of larger LLMs, which impairs their effectiveness for complex tasks such as Text2SQL. To address these limitations, we propose MATS, a novel Text2SQL framework designed specifically for SLMs. MATS uses a multi‑agent mechanism that assigns specialized roles to auxiliary agents, reducing individual workloads and fostering interaction. A training scheme based on reinforcement learning aligns these agents using feedback obtained during execution, thereby maintaining competitive performance despite a limited LLM size. Evaluation results using on benchmark datasets show that MATS, deployed on a single‑ GPU server, yields accuracy that are on‑par with large‑scale LLMs when using significantly fewer parameters. Our source code and data are available at https://github.com/thanhdath/mats‑sql.

PaperID: 708, https://arxiv.org/pdf/2512.18542.pdf GitHub

Authors: Scott Thornton

Title: SecureCode: A Production-Grade Multi-Turn Dataset for Training Security-Aware Code Generation Models

Abstract:
AI coding assistants produce vulnerable code in 45% of security‑relevant scenarios~\citeveracode2025, yet no public training dataset teaches both traditional web security and AI/ML‑specific defenses in a format suitable for instruction tuning. We present SecureCode, a production‑grade dataset of 2,185 multi‑turn security training examples spanning two domains: web application security (1,435 examples covering the OWASP Top 10 2021 across 11 languages and 9 frameworks, 100% grounded in documented CVEs and security incidents) and AI/ML security (750 examples covering all 10 OWASP LLM Top 10 2025 categories across more than 40 frameworks, including LangChain, OpenAI, and Hugging Face). Every example follows a 4‑turn conversational structure ‑‑ feature request; vulnerable and secure implementations with attack demonstrations; advanced probing; and defense‑in‑depth operational guidance ‑‑ designed for direct use in instruction tuning pipelines. Quality assurance combines automated structural validation with multi‑agent review from seven specialist AI perspectives (more than 10,500 assessments) and an 8‑phase remediation pipeline, producing a rubric‑calibrated mean quality score of 93.8/100 (σ= 0.93) for the AI/ML component. Each example provides SIEM integration strategies, infrastructure hardening recommendations, and testing approaches using production frameworks. We release the unified dataset on Hugging Face with domain‑specific loading configurations (web, aiml, default), alongside eight fine‑tuned open‑source models (3B‑‑20B parameters, QLoRA), and an evaluation framework with four security‑specific metrics. To our knowledge, SecureCode is the first public dataset that jointly provides OWASP Top 10 2021 web coverage and OWASP LLM Top 10 2025 AI/ML coverage in a unified conversational schema suitable for instruction tuning.

PaperID: 709, https://arxiv.org/pdf/2512.17900.pdf GitHub GitHub

Authors: Vongani H. Maluleke, Kie Horiuchi, Lea Wilken, Evonne Ng, Jitendra Malik, Angjoo Kanazawa

Title: Diffusion Forcing for Multi-Agent Interaction Sequence Modeling

Abstract:
Understanding and generating multi‑person interactions is a fundamental challenge with broad implications for robotics and social computing. While humans naturally coordinate in groups, modeling such interactions remains difficult due to long temporal horizons, strong inter‑agent dependencies, and variable group sizes. Existing motion generation methods are largely task‑specific and do not generalize to flexible multi‑agent generation. We introduce MAGNet (Multi‑Agent Generative Network), a unified autoregressive diffusion framework for multi‑agent motion generation that supports a wide range of interaction tasks through flexible conditioning and sampling. MAGNet performs dyadic and polyadic prediction, partner inpainting, partner prediction, and agentic generation all within a single model, and can autoregressively generate ultra‑long sequences spanning hundreds of motion steps. We explicitly model inter‑agent coupling during autoregressive denoising, enabling coherent coordination across agents. As a result, MAGNet captures both tightly synchronized activities (e.g., dancing, boxing) and loosely structured social interactions. Our approach performs on par with specialized methods on dyadic benchmarks while naturally extending to polyadic scenarios involving three or more interacting people. Please watch the supplemental video, where the temporal dynamics and spatial coordination of generated interactions are best appreciated. Project page: https://von31.github.io/MAGNet/

PaperID: 710, https://arxiv.org/pdf/2512.17838.pdf GitHub

Authors: Roshan Kenia, Xiaoman Zhang, Pranav Rajpurkar

Title: ReX-MLE: The Autonomous Agent Benchmark for Medical Imaging Challenges

Abstract:
Autonomous coding agents built on large language models (LLMs) can now solve many general software and machine learning tasks, but they remain ineffective on complex, domain‑specific scientific problems. Medical imaging is a particularly demanding domain, requiring long training cycles, high‑dimensional data handling, and specialized preprocessing and validation pipelines, capabilities not fully measured in existing agent benchmarks. To address this gap, we introduce ReX‑MLE, a benchmark of 20 challenges derived from high‑impact medical imaging competitions spanning diverse modalities and task types. Unlike prior ML‑agent benchmarks, ReX‑MLE evaluates full end‑to‑end workflows, requiring agents to independently manage data preprocessing, model training, and submission under realistic compute and time constraints. Evaluating state‑of‑the‑art agents (AIDE, ML‑Master, R&D‑Agent) with different LLM backends (GPT‑5, Gemini, Claude), we observe a severe performance gap: most submissions rank in the 0th percentile compared to human experts. Failures stem from domain‑knowledge and engineering limitations. ReX‑MLE exposes these bottlenecks and provides a foundation for developing domain‑aware autonomous AI systems.

PaperID: 711, https://arxiv.org/pdf/2512.17444.pdf GitHub

Authors: Javier Gonzalez-Ruiz, Carlos Rodriguez-Pardo, Iacopo Savelli, Alice Di Bella, Massimo Tavoni

Title: Assessing Long-Term Electricity Market Design for Ambitious Decarbonization Targets using Multi-Agent Reinforcement Learning

Abstract:
Electricity systems are key to transforming today's society into a carbon‑free economy. Long‑term electricity market mechanisms, including auctions, support schemes, and other policy instruments, are critical in shaping the electricity generation mix. In light of the need for more advanced tools to support policymakers and other stakeholders in designing, testing, and evaluating long‑term markets, this work presents a multi‑agent reinforcement learning model capable of capturing the key features of decarbonizing energy systems. Profit‑maximizing generation companies make investment decisions in the wholesale electricity market, responding to system needs, competitive dynamics, and policy signals. The model employs independent proximal policy optimization, which was selected for suitability to the decentralized and competitive environment. Nevertheless, given the inherent challenges of independent learning in multi‑agent settings, an extensive hyperparameter search ensures that decentralized training yields market outcomes consistent with competitive behavior. The model is applied to a stylized version of the Italian electricity system and tested under varying levels of competition, market designs, and policy scenarios. Results highlight the critical role of market design for decarbonizing the electricity sector and avoiding price volatility. The proposed framework allows assessing long‑term electricity markets in which multiple policy and market mechanisms interact simultaneously, with market participants responding and adapting to decarbonization pathways.

PaperID: 712, https://arxiv.org/pdf/2512.16962.pdf GitHub

Authors: Saksham Sahai Srivastava, Haoyu He

Title: MemoryGraft: Persistent Compromise of LLM Agents via Poisoned Experience Retrieval

Abstract:
Large Language Model (LLM) agents increasingly rely on long‑term memory and Retrieval‑Augmented Generation (RAG) to persist experiences and refine future performance. While this experience learning capability enhances agentic autonomy, it introduces a critical, unexplored attack surface, i.e., the trust boundary between an agent's reasoning core and its own past. In this paper, we introduce MemoryGraft. It is a novel indirect injection attack that compromises agent behavior not through immediate jailbreaks, but by implanting malicious successful experiences into the agent's long‑term memory. Unlike traditional prompt injections that are transient, or standard RAG poisoning that targets factual knowledge, MemoryGraft exploits the agent's semantic imitation heuristic which is the tendency to replicate patterns from retrieved successful tasks. We demonstrate that an attacker who can supply benign ingestion‑level artifacts that the agent reads during execution can induce it to construct a poisoned RAG store where a small set of malicious procedure templates is persisted alongside benign experiences. When the agent later encounters semantically similar tasks, union retrieval over lexical and embedding similarity reliably surfaces these grafted memories, and the agent adopts the embedded unsafe patterns, leading to persistent behavioral drift across sessions. We validate MemoryGraft on MetaGPT's DataInterpreter agent with GPT‑4o and find that a small number of poisoned records can account for a large fraction of retrieved experiences on benign workloads, turning experience‑based self‑improvement into a vector for stealthy and durable compromise. To facilitate reproducibility and future research, our code and evaluation data are available at https://github.com/Jacobhhy/Agent‑Memory‑Poisoning.

PaperID: 713, https://arxiv.org/pdf/2512.16924.pdf GitHub

Authors: Hanlin Wang, Hao Ouyang, Qiuyu Wang, Yue Yu, Yihao Meng, Wen Wang, Ka Leong Cheng, Shuailei Ma, Qingyan Bai, Yixuan Li, Cheng Chen, Yanhong Zeng, Xing Zhu, Yujun Shen, Qifeng Chen

Title: The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

Abstract:
We present WorldCanvas, a framework for promptable world events that enables rich, user‑directed simulation by combining text, trajectories, and reference images. Unlike text‑only approaches and existing trajectory‑controlled image‑to‑video methods, our multimodal approach combines trajectories ‑‑ encoding motion, timing, and visibility ‑‑ with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi‑agent interactions, object entry/exit, reference‑guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user‑shaped simulators. Our project page is available at: https://worldcanvas.github.io/.

PaperID: 714, https://arxiv.org/pdf/2512.16698.pdf GitHub

Authors: Mahbub E Sobhani, Md. Faiyaz Abdullah Sayeedi, Mohammad Nehad Alam, Proma Hossain Progga, Swakkhar Shatabda

Title: Do Multi-Agents Solve Better Than Single? Evaluating Agentic Frameworks for Diagram-Grounded Geometry Problem Solving and Reasoning

Abstract:
Diagram‑grounded geometry problem solving is a critical benchmark for multimodal large language models (MLLMs), yet the benefits of multi‑agent design over single‑agent remain unclear. We systematically compare single‑agent and multi‑agent pipelines on four visual math benchmarks: Geometry3K, MathVerse, OlympiadBench, and We‑Math. For open‑source models, multi‑agent consistently improves performance. For example, Qwen‑2.5‑VL (7B) gains +6.8 points and Qwen‑2.5‑VL (32B) gains +3.3 on Geometry3K, and both Qwen‑2.5‑VL variants see further gains on OlympiadBench and We‑Math. In contrast, the closed‑source Gemini‑2.0‑Flash generally performs better in single‑agent mode on classic benchmarks, while multi‑agent yields only modest improvements on the newer We‑Math dataset. These findings show that multi‑agent pipelines provide clear benefits for open‑source models and can assist strong proprietary systems on newer, less familiar benchmarks, but agentic decomposition is not universally optimal. All code, data, and reasoning files are available at https://github.com/faiyazabdullah/Interpreter‑Solver

PaperID: 715, https://arxiv.org/pdf/2512.16584.pdf GitHub

Authors: Jintao Tong, Jiaqi Gu, Yujing Lou, Lubin Fan, Yixiong Zou, Yue Wu, Jieping Ye, Ruixuan Li

Title: Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs

Abstract:
While Multimodal Large Language Models (MLLMs) excel at visual understanding tasks through text reasoning, they often fall short in scenarios requiring visual imagination. Unlike current works that take predefined external toolkits or generate images during thinking, however, humans can form flexible visual‑text imagination and interactions during thinking without predefined toolkits, where one important reason is that humans construct the visual‑text thinking process in a unified space inside the brain. Inspired by this capability, given that current MLLMs already encode visual and text information in the same feature space, we hold that visual tokens can be seamlessly inserted into the reasoning process carried by text tokens, where ideally, all visual imagination processes can be encoded by the latent features. To achieve this goal, we propose Sketch‑in‑Latents (SkiLa), a novel paradigm for unified multi‑modal reasoning that expands the auto‑regressive capabilities of MLLMs to natively generate continuous visual embeddings, termed latent sketch tokens, as visual thoughts. During multi‑step reasoning, the model dynamically alternates between textual thinking mode for generating textual think tokens and visual sketching mode for generating latent sketch tokens. A latent visual semantics reconstruction mechanism is proposed to ensure these latent sketch tokens are semantically grounded. Extensive experiments demonstrate that SkiLa achieves superior performance on vision‑centric tasks while exhibiting strong generalization to diverse general multi‑modal benchmarks. Codes will be released at https://github.com/TungChintao/SkiLa.

PaperID: 716, https://arxiv.org/pdf/2512.16465.pdf GitHub

Authors: Jinwu Chen, Qidie Wu, Bin Li, Lin Ma, Xin Si, Yang Hu, Shouyi Yin, Jun Yang

Title: cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution

Abstract:
Optimizing CUDA kernels is a challenging and labor‑intensive task, given the need for hardware‑software co‑design expertise and the proprietary nature of high‑performance kernel libraries. While recent large language models (LLMs) combined with evolutionary algorithms show promise in automatic kernel optimization, existing approaches often fall short in performance due to their suboptimal agent designs and mismatched evolution representations. This work identifies these mismatches and proposes cuPilot, a strategy‑coordinated multi‑agent framework that introduces strategy as an intermediate semantic representation for kernel evolution. Key contributions include a strategy‑coordinated evolution algorithm, roofline‑guided prompting, and strategy‑level population initialization. Experimental results show that the generated kernels by cuPilot achieve an average speed up of 3.09× over PyTorch on a benchmark of 100 kernels. On the GEMM tasks, cuPilot showcases sophisticated optimizations and achieves high utilization of critical hardware units. The generated kernels are open‑sourced at https://github.com/champloo2878/cuPilot‑Kernels.git.

PaperID: 717, https://arxiv.org/pdf/2512.16444.pdf GitHub GitHub

Authors: Yadong Li, Tong Zhang, Bo Huang, Zhen Cui

Title: StarCraft+: Benchmarking Multi-agent Algorithms in Adversary Paradigm

Abstract:
Deep multi‑agent reinforcement learning (MARL) algorithms are booming in the field of collaborative intelligence, and StarCraft multi‑agent challenge (SMAC) is widely‑used as the benchmark therein. However, imaginary opponents of MARL algorithms are practically configured and controlled in a fixed built‑in AI mode, which causes less diversity and versatility in algorithm evaluation. To address this issue, in this work, we establish a multi‑agent algorithm‑vs‑algorithm environment, named StarCraft II battle arena (SC2BA), to refresh the benchmarking of MARL algorithms in an adversary paradigm. Taking StarCraft as infrastructure, the SC2BA environment is specifically created for inter‑algorithm adversary with the consideration of fairness, usability and customizability, and meantime an adversarial PyMARL (APyMARL) library is developed with easy‑to‑use interfaces/modules. Grounding in SC2BA, we benchmark those classic MARL algorithms in two types of adversarial modes: dual‑algorithm paired adversary and multi‑algorithm mixed adversary, where the former conducts the adversary of pairwise algorithms while the latter focuses on the adversary to multiple behaviors from a group of algorithms. The extensive benchmark experiments exhibit some thought‑provoking observations/problems in the effectivity, sensibility and scalability of these completed algorithms. The SC2BA environment as well as reproduced experiments are released in \hrefhttps://github.com/dooliu/SC2BAGithub, and we believe that this work could mark a new step for the MARL field in the coming years.

PaperID: 718, https://arxiv.org/pdf/2512.16381.pdf GitHub

Authors: Zhihao Wang, Alessandro Cornacchia, Alessio Sacco, Franco Galante, Marco Canini, Dingde Jiang

Title: A Network Arena for Benchmarking AI Agents on Network Troubleshooting

Abstract:
Agentic systems, powered by Large Language Models (LLMs), assist network engineers with network configuration synthesis and network troubleshooting tasks. For network troubleshooting, progress is hindered by the absence of standardized and accessible benchmarks for evaluating LLM agents in dynamic network settings at low operational effort. We present NIKA, the largest public benchmark to date for LLM‑driven network incident diagnosis and troubleshooting. NIKA targets both domain experts and especially AI researchers alike, providing zero‑effort replay of real‑world network scenarios, and establishing well‑defined agent‑network interfaces for quick agent prototyping. NIKA comprises hundreds of curated network incidents, spanning five network scenarios, from data centers to ISP networks, and covers 54 representative network issues. Lastly, NIKA is modular and extensible by design, offering APIs to facilitate the integration of new network scenarios and failure cases. We evaluate state‑of‑the‑art LLM agents on NIKA and find that while larger models succeed more often in detecting network issues, they still struggle to localize faults and identify root causes. NIKA is open‑source and available to the community: https://github.com/sands‑lab/nika.

PaperID: 719, https://arxiv.org/pdf/2512.16279.pdf GitHub

Authors: Yiliu Yang, Yilei Jiang, Qunzhong Wang, Yingshui Tan, Xiaoyong Zhu, Sherman S. M. Chow, Bo Zheng, Xiangyu Yue

Title: QuadSentinel: Sequent Safety for Machine-Checkable Control in Multi-agent Systems

Abstract:
Safety risks arise as large language model‑based agents solve complex tasks with tools, multi‑step plans, and inter‑agent messages. However, deployer‑written policies in natural language are ambiguous and context dependent, so they map poorly to machine‑checkable rules, and runtime enforcement is unreliable. Expressing safety policies as sequents, we propose \textscQuadSentinel, a four‑agent guard (state tracker, policy verifier, threat watcher, and referee) that compiles these policies into machine‑checkable rules built from predicates over observable state and enforces them online. Referee logic plus an efficient top‑k predicate updater keeps costs low by prioritizing checks and resolving conflicts hierarchically. Measured on ST‑WebAgentBench (ICML CUA~'25) and AgentHarm (ICLR~'25), \textscQuadSentinel improves guardrail accuracy and rule recall while reducing false positives. Against single‑agent baselines such as ShieldAgent (ICML~'25), it yields better overall safety control. Near‑term deployments can adopt this pattern without modifying core agents by keeping policies separate and machine‑checkable. Our code will be made publicly available at https://github.com/yyiliu/QuadSentinel.

PaperID: 720, https://arxiv.org/pdf/2512.15374.pdf GitHub

Authors: Zehua Pei, Hui-Ling Zhen, Shixiong Kai, Sinno Jialin Pan, Yunhe Wang, Mingxuan Yuan, Bei Yu

Title: SCOPE: Prompt Evolution for Enhancing Agent Effectiveness

Abstract:
Large Language Model (LLM) agents are increasingly deployed in environments that generate massive, dynamic contexts. However, a critical bottleneck remains: while agents have access to this context, their static prompts lack the mechanisms to manage it effectively, leading to recurring Corrective and Enhancement failures. To address this capability gap, we introduce Self‑evolving Context Optimization via Prompt Evolution (SCOPE). SCOPE frames context management as an online optimization problem, synthesizing guidelines from execution traces to automatically evolve the agent's prompt. We propose a Dual‑Stream mechanism that routes guidelines between tactical memory (immediate error correction) and strategic memory, which is continuously refined through conflict resolution, subsumption pruning, and consolidation. To maximize strategy coverage, Perspective‑Driven Exploration evolves multiple parallel prompts guided by distinct optimization perspectives. Experiments on the HLE benchmark show that SCOPE improves task success rates from 14.23% to 38.64% without human intervention. We make our code publicly available at https://github.com/JarvisPei/SCOPE.

PaperID: 721, https://arxiv.org/pdf/2512.15163.pdf GitHub

Authors: Xuanjun Zong, Zhiqi Shen, Lei Wang, Yunshi Lan, Chao Yang

Title: MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers

Abstract:
Large language models (LLMs) are evolving into agentic systems that reason, plan, and operate external tools. The Model Context Protocol (MCP) is a key enabler of this transition, offering a standardized interface for connecting LLMs with heterogeneous tools and services. Yet MCP's openness and multi‑server workflows introduce new safety risks that existing benchmarks fail to capture, as they focus on isolated attacks or lack real‑world coverage. We present MCP‑SafetyBench, a comprehensive benchmark built on real MCP servers that supports realistic multi‑turn evaluation across five domains: browser automation, financial analysis, location navigation, repository management, and web search. It incorporates a unified taxonomy of 20 MCP attack types spanning server, host, and user sides, and includes tasks requiring multi‑step reasoning and cross‑server coordination under uncertainty. Using MCP‑SafetyBench, we systematically evaluate leading open‑ and closed‑source LLMs, revealing that all models remain vulnerable to MCP attacks, with a notable safety‑utility trade‑off. Our results highlight the urgent need for stronger defenses and establish MCP‑SafetyBench as a foundation for diagnosing and mitigating safety risks in real‑world MCP deployments.

PaperID: 722, https://arxiv.org/pdf/2512.14735.pdf GitHub

Authors: Yuqun Zhang, Yuxuan Zhao, Sijia Chen

Title: PyFi: Toward Pyramid-like Financial Image Understanding for VLMs via Adversarial Agents

Abstract:
This paper proposes PyFi, a novel framework for pyramid‑like financial image understanding that enables vision language models (VLMs) to reason through question chains in a progressive, simple‑to‑complex manner. At the core of PyFi is PyFi‑600K, a dataset comprising 600K financial question‑answer pairs organized into a reasoning pyramid: questions at the base require only basic perception, while those toward the apex demand increasing levels of capability in financial visual understanding and expertise. This data is scalable because it is synthesized without human annotations, using PyFi‑adv, a multi‑agent adversarial mechanism under the Monte Carlo Tree Search (MCTS) paradigm, in which, for each image, a challenger agent competes with a solver agent by generating question chains that progressively probe deeper capability levels in financial visual reasoning. Leveraging this dataset, we present fine‑grained, hierarchical, and comprehensive evaluations of advanced VLMs in the financial domain. Moreover, fine‑tuning Qwen2.5‑VL‑3B and Qwen2.5‑VL‑7B on the pyramid‑structured question chains enables these models to answer complex financial questions by decomposing them into sub‑questions with gradually increasing reasoning demands, yielding average accuracy improvements of 19.52% and 8.06%, respectively, on the dataset. All resources of code, dataset and models are available at: https://github.com/AgenticFinLab/PyFi .

PaperID: 723, https://arxiv.org/pdf/2512.14720.pdf GitHub

Authors: Dizhan Xue, Jing Cui, Shengsheng Qian, Chuanrui Hu, Changsheng Xu

Title: SoMe: A Realistic Benchmark for LLM-based Social Media Agents

Abstract:
Intelligent agents powered by large language models (LLMs) have recently demonstrated impressive capabilities and gained increasing popularity on social media platforms. While LLM agents are reshaping the ecology of social media, there exists a current gap in conducting a comprehensive evaluation of their ability to comprehend media content, understand user behaviors, and make intricate decisions. To address this challenge, we introduce SoMe, a pioneering benchmark designed to evaluate social media agents equipped with various agent tools for accessing and analyzing social media data. SoMe comprises a diverse collection of 8 social media agent tasks, 9,164,284 posts, 6,591 user profiles, and 25,686 reports from various social media platforms and external websites, with 17,869 meticulously annotated task queries. Compared with the existing datasets and benchmarks for social media tasks, SoMe is the first to provide a versatile and realistic platform for LLM‑based social media agents to handle diverse social media tasks. By extensive quantitative and qualitative analysis, we provide the first overview insight into the performance of mainstream agentic LLMs in realistic social media environments and identify several limitations. Our evaluation reveals that both the current closed‑source and open‑source LLMs cannot handle social media agent tasks satisfactorily. SoMe provides a challenging yet meaningful testbed for future social media agents. Our code and data are available at https://github.com/LivXue/SoMe

PaperID: 724, https://arxiv.org/pdf/2512.14252.pdf GitHub

Authors: Kelly J. Davis

Title: Gödel's Poetry

Abstract:
Formal, automated theorem proving has long been viewed as a challenge to artificial intelligence. We introduce here a new approach to computer theorem proving, one that employs specialized language models for Lean4 proof generation combined with recursive decomposition of difficult theorems into simpler entailing propositions. These models are coordinated through a multi‑agent architecture that orchestrates autoformalization (if required), proof generation, decomposition of difficult theorems into simpler entailing propositions, and recursive proof (and/or decomposition) of these propositions. Without decomposition, we achieve a 90.4% pass rate on miniF2F. With decomposition, this is significantly improved. A key technical contribution lies in our extension of the Kimina Lean Server with abstract syntax tree (AST) parsing capabilities to facilitate automated, recursive proof decomposition. The system is made available on PyPI as goedels‑poetry (at https://pypi.org/project/goedels‑poetry ), and the open‑source implementation KellyJDavis/goedels‑poetry (at https://github.com/KellyJDavis/goedels‑poetry ) facilitates both adaptation to alternative language models and extension with custom functionality.

PaperID: 725, https://arxiv.org/pdf/2512.13526.pdf GitHub

Authors: Asa Cooper Stickland, Jan Michelfeit, Arathi Mani, Charlie Griffin, Ollie Matthews, Tomek Korbak, Rogan Inglis, Oliver Makins, Alan Cooney

Title: Async Control: Stress-testing Asynchronous Control Measures for LLM Agents

Abstract:
LLM‑based software engineering agents are increasingly used in real‑world development tasks, often with access to sensitive data or security‑critical codebases. Such agents could intentionally sabotage these codebases if they were misaligned. We investigate asynchronous monitoring, in which a monitoring system reviews agent actions after the fact. Unlike synchronous monitoring, this approach does not impose runtime latency, while still attempting to disrupt attacks before irreversible harm occurs. We treat monitor development as an adversarial game between a blue team (who design monitors) and a red team (who create sabotaging agents). We attempt to set the game rules such that they upper bound the sabotage potential of an agent based on Claude 4.1 Opus. To ground this game in a realistic, high‑stakes deployment scenario, we develop a suite of 5 diverse software engineering environments that simulate tasks that an agent might perform within an AI developer's internal infrastructure. Over the course of the game, we develop an ensemble monitor that achieves a 6% false negative rate at 1% false positive rate on a held out test environment. Then, we estimate risk of sabotage at deployment time by extrapolating from our monitor's false negative rate. We describe one simple model for this extrapolation, present a sensitivity analysis, and describe situations in which the model would be invalid. Code is available at: https://github.com/UKGovernmentBEIS/async‑control.

PaperID: 726, https://arxiv.org/pdf/2512.13399.pdf GitHub

Authors: Sitao Cheng, Tianle Li, Xuhan Huang, Xunjian Yin, Difan Zou

Title: Differentiable Evolutionary Reinforcement Learning

Abstract:
Crafting effective reward signals remains a central challenge in Reinforcement Learning (RL), especially for complex reasoning tasks. Existing automated reward optimization methods typically rely on derivative‑free search heuristics that treat the reward function as a black box, failing to exploit the causal dynamics between reward structure modifications and policy performance. We introduce Differentiable Evolutionary Reinforcement Learning (DERL), a bi‑level framework for the autonomous discovery of optimal reward structures. DERL employs a Meta‑Optimizer that evolves a reward function through the composition of structured atomic primitives to guide an inner‑loop policy. Unlike prior black‑box methods, DERL introduces differentiability into the meta‑optimization process by updating the Meta‑Optimizer using policy gradients derived from inner‑loop validation performance. This allows for the progressive learning of a "meta‑gradient" for task success, providing the system with dense, actionable feedback. We validate DERL across diverse reasoning domains: embodied agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8K, MATH). Results show that DERL achieves state‑of‑the‑art performance on agent benchmarks, substantially outperforming non‑differentiable baselines‑especially in out‑of‑distribution generalization. Trajectory analyses confirm that DERL captures the intrinsic causal structure of tasks, enabling fully autonomous, self‑improving agent alignment.

PaperID: 727, https://arxiv.org/pdf/2512.12597.pdf GitHub

Authors: Miriam Horovicz

Title: AgentSHAP: Interpreting LLM Agent Tool Importance with Monte Carlo Shapley Value Estimation

Abstract:
LLM agents that use external tools can solve complex tasks, but understanding which tools actually contributed to a response remains a blind spot. No existing XAI methods address tool‑level explanations. We introduce AgentSHAP, the first framework for explaining tool importance in LLM agents. AgentSHAP is model‑agnostic: it treats the agent as a black box and works with any LLM (GPT, Claude, Llama, etc.) without needing access to internal weights or gradients. Using Monte Carlo Shapley values, AgentSHAP tests how an agent responds with different tool subsets and computes fair importance scores based on game theory. Our contributions are: (1) the first explainability method for agent tool attribution, grounded in Shapley values from game theory; (2) Monte Carlo sampling that reduces cost from O(2n) to practical levels; and (3) comprehensive experiments on API‑Bank showing that AgentSHAP produces consistent scores across runs, correctly identifies which tools matter, and distinguishes relevant from irrelevant tools. AgentSHAP joins TokenSHAP (for tokens) and PixelSHAP (for image regions) to complete a family of Shapley‑based XAI tools for modern generative AI. Code: https://github.com/GenAISHAP/TokenSHAP.

PaperID: 728, https://arxiv.org/pdf/2512.11891.pdf GitHub

Authors: Songqiao Hu, Zeyi Liu, Shuang Liu, Jun Cen, Zihan Meng, Xiao He

Title: VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer

Abstract:
Vision‑Language‑Action (VLA) models have demonstrated remarkable capabilities in generalizing across diverse robotic manipulation tasks. However, deploying these models in unstructured environments remains challenging due to the critical need for simultaneous task compliance and safety assurance, particularly in preventing potential collisions during physical interactions. In this work, we introduce a Vision‑Language‑Safe Action (VLSA) architecture, named AEGIS, which contains a plug‑and‑play safety constraint (SC) layer formulated via control barrier functions. AEGIS integrates directly with existing VLA models to improve safety with theoretical guarantees, while maintaining their original instruction‑following performance. To evaluate the efficacy of our architecture, we construct a comprehensive safety‑critical benchmark SafeLIBERO, spanning distinct manipulation scenarios characterized by varying degrees of spatial complexity and obstacle intervention. Extensive experiments demonstrate the superiority of our method over state‑of‑the‑art baselines. Notably, AEGIS achieves a 59.16% improvement in obstacle avoidance rate while substantially increasing the task execution success rate by 17.25%. To facilitate reproducibility and future research, we make our code, models, and the benchmark datasets publicly available at https://vlsa‑aegis.github.io/.

PaperID: 729, https://arxiv.org/pdf/2512.11781.pdf GitHub

Authors: Vineet Pasumarti, Lorenzo Bianchi, Antonio Loquercio

Title: Agile Flight Emerges from Multi-Agent Competitive Racing

Abstract:
Through multi‑agent competition and the sparse high‑level objective of winning a race, we find that both agile flight (e.g., high‑speed motion pushing the platform to its physical limits) and strategy (e.g., overtaking or blocking) emerge from agents trained with reinforcement learning. We provide evidence in both simulation and the real world that this approach outperforms the common paradigm of training agents in isolation with rewards that prescribe behavior, e.g., progress on the raceline, in particular when the complexity of the environment increases, e.g., in the presence of obstacles. Moreover, we find that multi‑agent competition yields policies that transfer more reliably to the real world than policies trained with a single‑agent progress‑based reward, despite the two methods using the same simulation environment, randomization strategy, and hardware. In addition to improved sim‑to‑real transfer, the multi‑agent policies also exhibit some degree of generalization to opponents unseen at training time. Overall, our work, following in the tradition of multi‑agent competitive game‑play in digital domains, shows that sparse task‑level rewards are sufficient for training agents capable of advanced low‑level control in the physical world. Code: https://github.com/Jirl‑upenn/AgileFlight_MultiAgent

PaperID: 730, https://arxiv.org/pdf/2512.11689.pdf GitHub

Authors: Manuela Chacon-Chamorro, Juan Sebastián Pinzón, Rubén Manrique, Luis Felipe Giraldo, Nicanor Quijano

Title: Evaluating Cooperative Resilience in Multiagent Systems: A Comparison Between Humans and LLMs

Abstract:
This paper presents a comparative analysis of cooperative resilience in multi‑agent systems, defined as the ability to anticipate, resist, recover from, and transform to disruptive events that affect collective well‑being. We focus on mixed‑motive social dilemmas instantiated as a Tragedy of the Commons environment from the Melting Pot suite, where we systematically compare human groups and Large Language Model (LLM)‑based agents, each evaluated with and without explicit communication. Cooperative resilience is assessed under a continuously disruptive condition induced by a persistent unsustainable consumption bot, together with intermittent environmental shocks implemented as stochastic removal of shared resources across scenarios. This experimental design establishes a benchmark for cooperative resilience across agent architectures and interaction modalities, constituting a key step toward systematically comparing humans and LLM‑based agents. Using this framework, we find that human groups with communication achieve the highest cooperative resilience compared to all other groups. Communication also improves the resilience of LLM agents, but their performance remains below human levels. Motivated by the performance of humans, we further examine a long‑horizon setting with harsher environmental conditions, where humans sustain the shared resource and maintain high resilience in diverse disruption scenarios. Together, these results suggest that human decision‑making under adverse social conditions can inform the design of artificial agents that promote prosocial and resilient behaviors.

PaperID: 731, https://arxiv.org/pdf/2512.11485.pdf GitHub

Authors: Xuanbo Su, Yingfang Zhang, Hao Luo, Xiaoteng Liu, Leo Huang

Title: Mistake Notebook Learning: Batch-Clustered Failures for Training-Free Agent Adaptation

Abstract:
With the growing adoption of Large Language Model (LLM) agents in persistent, real‑world roles, they naturally encounter continuous streams of tasks and inevitable failures. A key limitation, however, is their inability to systematically learn from these mistakes, forcing them to repeat identical errors in similar contexts. Unlike prior training‑free methods that primarily store raw instance‑level experience or focus on retrieving successful trajectories, we propose Mistake Notebook Learning (MNL), a novel memory framework that enables agents to self‑curate generalizable guidance from batch‑clustered failures. This mechanism allows agents to distill shared error patterns into structured "mistake notes," updating an external memory only when batch performance improves to ensure stability. To further amplify adaptability, we integrate MNL with test‑time scaling, leveraging aggregated failure patterns to actively steer the search process away from known pitfalls. Experiments on mathematical reasoning, Text‑to‑SQL, and interactive agent benchmarks show that MNL achieves competitive performance compared to existing memory mechanisms and in‑context methods in both effectiveness and efficiency. These findings position structured mistake abstraction as a critical lever for robust agent evolution, enabling continuous improvement without the cost of parameter updates. The code is available at https://github.com/Bairong‑Xdynamics/MistakeNotebookLearning/tree/main.

PaperID: 732, https://arxiv.org/pdf/2512.11426.pdf GitHub

Authors: Shuowei Cai, Yansong Ning, Hao Liu

Title: AgentBalance: Backbone-then-Topology Design for Cost-Effective Multi-Agent Systems under Budget Constraints

Abstract:
Large Language Model (LLM)‑based multi‑agent systems (MAS) are becoming indispensable building blocks for web‑scale applications such as web search, social network analytics, and online customer support, where cost‑effectiveness is increasingly the primary constraint for large‑scale deployment. While recent work improves MAS cost‑effectiveness by shaping inter‑agent communication topologies and selecting agent backbones, it rarely models and optimizes under explicit token‑cost and latency budgets that reflect deployment constraints. This often leads to topology‑first designs and suboptimal cost‑effectiveness when budgets are binding. We present AgentBalance, a framework for constructing cost‑effective MAS under explicit token‑cost and latency budgets via a backbone‑then‑topology design. AgentBalance first performs backbone‑oriented agent generation, constructing agents with heterogeneous backbones through LLM pool construction, pool selection, and role‑backbone matching. It then performs adaptive MAS topology generation, guiding inter‑agent communication via agent representation learning, gating, and latency‑aware topology synthesis. Experiments on benchmarks with 14 candidate LLM backbones show that AgentBalance achieves up to 10% and 22% performance gains under matched token‑cost and latency budgets, respectively, and yields strong AUC on performance‑versus‑budget curves across benchmarks. AgentBalance also functions as a plug‑in for existing MAS, improving performance under the same token‑cost and latency constraints, and it generalizes well to unseen LLMs for practical, budget‑aware deployment. Code: https://github.com/usail‑hkust/AgentBalance

PaperID: 733, https://arxiv.org/pdf/2512.11061.pdf GitHub

Authors: Felix O'Mahony, Roberto Cipolla, Ayush Tewari

Title: VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation

Abstract:
Generative video models, a leading approach to world modeling, face fundamental limitations. They often violate physical and logical rules, lack interactivity, and operate as opaque black boxes ill‑suited for building structured, queryable worlds. To overcome these challenges, we propose a new paradigm focused on distilling an image caption pair into a tractable, abstract representation optimized for simulation. We introduce VDAWorld, a framework where a Vision‑Language Model (VLM) acts as an intelligent agent to orchestrate this process. The VLM autonomously constructs a grounded (2D or 3D) scene representation by selecting from a suite of vision tools, and accordingly chooses a compatible physics simulator (e.g., rigid body, fluid) to act upon it. VDAWorld can then infer latent dynamics from the static scene to predict plausible future states. Our experiments show that this combination of intelligent abstraction and adaptive simulation results in a versatile world model capable of producing high quality simulations across a wide range of dynamic scenarios.

PaperID: 734, https://arxiv.org/pdf/2512.10971.pdf GitHub

Authors: Tianyu Fan, Yuhao Yang, Yangqin Jiang, Yifei Zhang, Yuxuan Chen, Chao Huang

Title: AI-Trader: Benchmarking Autonomous Agents in Real-Time Financial Markets

Abstract:
Large Language Models (LLMs) have demonstrated remarkable potential as autonomous agents, approaching human‑expert performance through advanced reasoning and tool orchestration. However, decision‑making in fully dynamic and live environments remains highly challenging, requiring real‑time information integration and adaptive responses. While existing efforts have explored live evaluation mechanisms in structured tasks, a critical gap remains in systematic benchmarking for real‑world applications, particularly in finance where stringent requirements exist for live strategic responsiveness. To address this gap, we introduce AI‑Trader, the first fully‑automated, live, and data‑uncontaminated evaluation benchmark for LLM agents in financial decision‑making. AI‑Trader spans three major financial markets: U.S. stocks, A‑shares, and cryptocurrencies, with multiple trading granularities to simulate live financial environments. Our benchmark implements a revolutionary fully autonomous minimal information paradigm where agents receive only essential context and must independently search, verify, and synthesize live market information without human intervention. We evaluate six mainstream LLMs across three markets and multiple trading frequencies. Our analysis reveals striking findings: general intelligence does not automatically translate to effective trading capability, with most agents exhibiting poor returns and weak risk management. We demonstrate that risk control capability determines cross‑market robustness, and that AI trading strategies achieve excess returns more readily in highly liquid markets than policy‑driven environments. These findings expose critical limitations in current autonomous agents and provide clear directions for future improvements. The code and evaluation data are open‑sourced to foster community research: https://github.com/HKUDS/AI‑Trader.

PaperID: 735, https://arxiv.org/pdf/2512.10766.pdf GitHub

Authors: Chenyu Zhang, Lanjun Wang, Yiwen Ma, Wenhui Li, Yi Tu, An-An Liu

Title: Metaphor-based Jailbreak Attacks on Text-to-Image Models

Abstract:
Text‑to‑image (T2I) models commonly incorporate defense mechanisms to prevent the generation of sensitive images. Unfortunately, recent jailbreak attacks have shown that adversarial prompts can effectively bypass these mechanisms and induce T2I models to produce sensitive content, revealing critical safety vulnerabilities. However, existing attack methods implicitly assume that the attacker knows the type of deployed defenses, which limits their effectiveness against unknown or diverse defense mechanisms. In this work, we reveal an underexplored vulnerability of T2I models to metaphor‑based jailbreak attacks (MJA), which aims to attack diverse defense mechanisms without prior knowledge of their type by generating metaphor‑based adversarial prompts. Specifically, MJA consists of two modules: an LLM‑based multi‑agent generation module (LMAG) and an adversarial prompt optimization module (APO). LMAG decomposes the generation of metaphor‑based adversarial prompts into three subtasks: metaphor retrieval, context matching, and adversarial prompt generation. Subsequently, LMAG coordinates three LLM‑based agents to generate diverse adversarial prompts by exploring various metaphors and contexts. To enhance attack efficiency, APO first trains a surrogate model to predict the attack results of adversarial prompts and then designs an acquisition strategy to adaptively identify optimal adversarial prompts. Extensive experiments on T2I models with various external and internal defense mechanisms demonstrate that MJA achieves stronger attack performance while using fewer queries, compared with six baseline methods. Additionally, we provide an in‑depth vulnerability analysis suggesting that metaphor‑based adversarial prompts evade safety mechanisms by inducing semantic ambiguity, while sensitive images arise from the model's probabilistic interpretation of concealed semantics.

PaperID: 736, https://arxiv.org/pdf/2512.10700.pdf GitHub

Authors: Mohamed Elobaid, Shinkyu Park, Eric Feron

Title: On the Stabilization of Rigid Formations on Regular Curves

Abstract:
This work deals with the problem of stabilizing a multi‑agent rigid formation on a general class of planar curves. Namely, we seek to stabilize an equilateral polygonal formation on closed planar differentiable curves after a path sweep. The task of finding an inscribed regular polygon centered at the point of interest is solved via a randomized multi‑start Newton‑Like algorithm for which one is able to ascertain the existence of a minimizer. Then we design a continuous feedback law that guarantees convergence to, and sufficient sweeping of the curve, followed by convergence to the desired formation vertices while ensuring inter‑agent avoidance. The proposed approach is validated through numerical simulations for different classes of curves and different rigid formations. Code: https://github.com/mebbaid/paper‑elobaid‑ifacwc‑2026

PaperID: 737, https://arxiv.org/pdf/2512.10605.pdf GitHub

Authors: Lihuang Chen, Xiangyu Luo, Jun Meng

Title: LEO-RobotAgent: A General-purpose Robotic Agent for Language-driven Embodied Operator

Abstract:
We propose LEO‑RobotAgent, a general‑purpose language‑driven intelligent agent framework for robots. Under this framework, LLMs can operate different types of robots to complete unpredictable complex tasks across various scenarios. This framework features strong generalization, robustness, and efficiency. The application‑level system built around it can fully enhance bidirectional human‑robot intent understanding and lower the threshold for human‑robot interaction. Regarding robot task planning, the vast majority of existing studies focus on the application of large models in single‑task scenarios and for single robot types. These algorithms often have complex structures and lack generalizability. Thus, the proposed LEO‑RobotAgent framework is designed with a streamlined structure as much as possible, enabling large models to independently think, plan, and act within this clear framework. We provide a modular and easily registrable toolset, allowing large models to flexibly call various tools to meet different requirements. Meanwhile, the framework incorporates a human‑robot interaction mechanism, enabling the algorithm to collaborate with humans like a partner. Experiments have verified that this framework can be easily adapted to mainstream robot platforms including unmanned aerial vehicles (UAVs), robotic arms, and wheeled robot, and efficiently execute a variety of carefully designed tasks with different complexity levels. Our code is available at https://github.com/LegendLeoChen/LEO‑RobotAgent.

PaperID: 738, https://arxiv.org/pdf/2512.10540.pdf GitHub

Authors: Si Wang, Zhehan Li, Jiadong Lu, Rong Xiong, Yanjun Cao, Yue Wang

Title: Mr. Virgil: Learning Multi-robot Visual-range Relative Localization

Abstract:
Ultra‑wideband (UWB)‑vision fusion localization has achieved extensive applications in the domain of multi‑agent relative localization. The challenging matching problem between robots and visual detection renders existing methods highly dependent on identity‑encoded hardware or delicate tuning algorithms. Overconfident yet erroneous matches may bring about irreversible damage to the localization system. To address this issue, we introduce Mr. Virgil, an end‑to‑end learning multi‑robot visual‑range relative localization framework, consisting of a graph neural network for data association between UWB rangings and visual detections, and a differentiable pose graph optimization (PGO) back‑end. The graph‑based front‑end supplies robust matching results, accurate initial position predictions, and credible uncertainty estimates, which are subsequently integrated into the PGO back‑end to elevate the accuracy of the final pose estimation. Additionally, a decentralized system is implemented for real‑world applications. Experiments spanning varying robot numbers, simulation and real‑world, occlusion and non‑occlusion conditions showcase the stability and exactitude under various scenes compared to conventional methods. Our code is available at: https://github.com/HiOnes/Mr‑Virgil.

PaperID: 739, https://arxiv.org/pdf/2512.10394.pdf GitHub

Authors: Weifan Guan, Qinghao Hu, Huasen Xi, Chenxiao Zhang, Aosheng Li, Jian Cheng

Title: RoboNeuron: A Middle-Layer Infrastructure for Agent-Driven Orchestration in Embodied AI

Abstract:
Vision‑language‑action (VLA) models and LLM agents have advanced rapidly, yet reliable deployment on physical robots is often hindered by an interface mismatch between agent tool APIs and robot middleware. Current implementations typically rely on ad‑hoc wrappers that are difficult to reuse, and changes to the VLA backend or serving stack often necessitate extensive re‑integration. We introduce RoboNeuron, a middleware layer that connects the Model Context Protocol (MCP) for LLM agents with robot middleware such as ROS2. RoboNeuron bridges these ecosystems by deriving agent‑callable tools directly from ROS schemas, providing a unified execution abstraction that supports both direct commands and modular composition, and localizing backend, runtime, and acceleration‑preset changes within a stable inference boundary. We evaluate RoboNeuron in simulation and on hardware through multi‑platform base control, arm motion, and VLA‑based grasping tasks, demonstrating that it enables modular system orchestration under a unified interface while supporting backend transitions without system rewiring. The full code implementation of this work is available at github repo: https://github.com/guanweifan/RoboNeuron

PaperID: 740, https://arxiv.org/pdf/2512.10206.pdf GitHub

Authors: Yakun Zhu, Zhongzhen Huang, Qianhan Feng, Linjie Mu, Yannian Gu, Shaoting Zhang, Qi Dou, Xiaofan Zhang

Title: CP-Env: Evaluating Large Language Models on Clinical Pathways in a Controllable Hospital Environment

Abstract:
Medical care follows complex clinical pathways that extend beyond isolated physician‑patient encounters, emphasizing decision‑making and transitions between different stages. Current benchmarks focusing on static exams or isolated dialogues inadequately evaluate large language models (LLMs) in dynamic clinical scenarios. We introduce CP‑Env, a controllable agentic hospital environment designed to evaluate LLMs across end‑to‑end clinical pathways. CP‑Env simulates a hospital ecosystem with patient and physician agents, constructing scenarios ranging from triage and specialist consultation to diagnostic testing and multidisciplinary team meetings for agent interaction. Following real hospital adaptive flow of healthcare, it enables branching, long‑horizon task execution. We propose a three‑tiered evaluation framework encompassing Clinical Efficacy, Process Competency, and Professional Ethics. Results reveal that most models struggle with pathway complexity, exhibiting hallucinations and losing critical diagnostic details. Interestingly, excessive reasoning steps can sometimes prove counterproductive, while top models tend to exhibit reduced tool dependency through internalized knowledge. CP‑Env advances medical AI agents development through comprehensive end‑to‑end clinical evaluation. We provide the benchmark and evaluation tools for further research and development at https://github.com/SPIRAL‑MED/CP_ENV.

PaperID: 741, https://arxiv.org/pdf/2512.09898.pdf GitHub

Authors: Reza Ahmari, Ahmad Mohammadi, Vahid Hemmati, Mohammed Mynuddin, Parham Kebria, Mahmoud Nabil Mahmoud, Xiaohong Yuan, Abdollah Homaifar

Title: Visual Heading Prediction for Autonomous Aerial Vehicles

Abstract:
The integration of Unmanned Aerial Vehicles (UAVs) and Unmanned Ground Vehicles (UGVs) is increasingly central to the development of intelligent autonomous systems for applications such as search and rescue, environmental monitoring, and logistics. However, precise coordination between these platforms in real‑time scenarios presents major challenges, particularly when external localization infrastructure such as GPS or GNSS is unavailable or degraded [1]. This paper proposes a vision‑based, data‑driven framework for real‑time UAV‑UGV integration, with a focus on robust UGV detection and heading angle prediction for navigation and coordination. The system employs a fine‑tuned YOLOv5 model to detect UGVs and extract bounding box features, which are then used by a lightweight artificial neural network (ANN) to estimate the UAV's required heading angle. A VICON motion capture system was used to generate ground‑truth data during training, resulting in a dataset of over 13,000 annotated images collected in a controlled lab environment. The trained ANN achieves a mean absolute error of 0.1506° and a root mean squared error of 0.1957°, offering accurate heading angle predictions using only monocular camera inputs. Experimental evaluations achieve 95% accuracy in UGV detection. This work contributes a vision‑based, infrastructure‑ independent solution that demonstrates strong potential for deployment in GPS/GNSS‑denied environments, supporting reliable multi‑agent coordination under realistic dynamic conditions. A demonstration video showcasing the system's real‑time performance, including UGV detection, heading angle prediction, and UAV alignment under dynamic conditions, is available at: https://github.com/Kooroshraf/UAV‑UGV‑Integration

PaperID: 742, https://arxiv.org/pdf/2512.09682.pdf GitHub

Authors: Mika Persson, Jonas Lidman, Jacob Ljungberg, Samuel Sandelius, Adam Andersson

Title: Dynamic one-time delivery of critical data by small and sparse UAV swarms: a model problem for MARL scaling studies

Abstract:
This work studies the application of Multi‑Agent Reinforcement Learning (MARL) to decentralized control of unmanned aerial vehicles to relay a critical data package to a known position. For this purpose, a family of deterministic games is introduced, designed for MARL scaling studies. A robust baseline policy is proposed which restricts agent motion and applies Dijkstra's shortest path algorithm. Computational experiment results show that two off‑the‑shelf MARL algorithms perform competitively with the baseline for a small number of agents, but face scalability issues as the number of agents increases. Source code and animations are available online at https://github.com/mikapersson/Information‑Relaying.

PaperID: 743, https://arxiv.org/pdf/2512.09321.pdf GitHub

Authors: Reachal Wang, Yuqi Jia, Neil Zhenqiang Gong

Title: ObliInjection: Order-Oblivious Prompt Injection Attack to LLM Agents with Multi-source Data

Abstract:
Prompt injection attacks aim to contaminate the input data of an LLM to mislead it into completing an attacker‑chosen task instead of the intended task. In many applications and agents, the input data originates from multiple sources, with each source contributing a segment of the overall input. In these multi‑source scenarios, an attacker may control only a subset of the sources and contaminate the corresponding segments, but typically does not know the order in which the segments are arranged within the input. Existing prompt injection attacks either assume that the entire input data comes from a single source under the attacker's control or ignore the uncertainty in the ordering of segments from different sources. As a result, their success is limited in domains involving multi‑source data. In this work, we propose ObliInjection, the first prompt injection attack targeting LLM applications and agents with multi‑source input data. ObliInjection introduces two key technical innovations: the order‑oblivious loss, which quantifies the likelihood that the LLM will complete the attacker‑chosen task regardless of how the clean and contaminated segments are ordered; and the orderGCG algorithm, which is tailored to minimize the order‑oblivious loss and optimize the contaminated segments. Comprehensive experiments across three datasets spanning diverse application domains and twelve LLMs demonstrate that ObliInjection is highly effective, even when only one out of 6‑100 segments in the input data is contaminated. Our code and data are available at: https://github.com/ReachalWang/ObliInjection.

PaperID: 744, https://arxiv.org/pdf/2512.08483.pdf GitHub

Authors: Lingze Zeng, Naili Xing, Shaofeng Cai, Peng Lu, Gang Chen, Jian Pei, Beng Chin Ooi

Title: NeurIDA: Dynamic Modeling for Effective In-Database Analytics

Abstract:
Relational Database Management Systems (RDBMS) manage complex, interrelated data and support a broad spectrum of analytical tasks. With the growing demand for predictive analytics, the deep integration of machine learning (ML) into RDBMS has become critical. However, a fundamental challenge hinders this evolution: conventional ML models are static and task‑specific, whereas RDBMS environments are dynamic and must support diverse analytical queries. Each analytical task entails constructing a bespoke pipeline from scratch, which incurs significant development overhead and hence limits wide adoption of ML in analytics. We present NeurIDA, an autonomous end‑to‑end system for in‑database analytics that dynamically "tweaks" the best available base model to better serve a given analytical task. In particular, we propose a novel paradigm of dynamic in‑database modeling to pre‑train a composable base model architecture over the relational data. Upon receiving a task, NeurIDA formulates the task and data profile to dynamically select and configure relevant components from the pool of base models and shared model components for prediction. For friendly user experience, NeurIDA supports natural language queries; it interprets user intent to construct structured task profiles, and generates analytical reports with dedicated LLM agents. By design, NeurIDA enables ease‑of‑use and yet effective and efficient in‑database AI analytics. Extensive experiment study shows that NeurIDA consistently delivers up to 12% improvement in AUC‑ROC and 25% relative reduction in MAE across ten tasks on five real‑world datasets. The source code is available at https://github.com/Zrealshadow/NeurIDA

PaperID: 745, https://arxiv.org/pdf/2512.08326.pdf GitHub

Authors: Bin Wang, Hui Li, Liyang Zhang, Qijia Zhuang, Ao Yang, Dong Zhang, Xijun Luo, Bing Lin

Title: Argus: A Multi-Agent Sensitive Information Leakage Detection Framework Based on Hierarchical Reference Relationships

Abstract:
Sensitive information leakage in code repositories has emerged as a critical security challenge. Traditional detection methods that rely on regular expressions, fingerprint features, and high‑entropy calculations often suffer from high false‑positive rates. This not only reduces detection efficiency but also significantly increases the manual screening burden on developers. Recent advances in large language models (LLMs) and multi‑agent collaborative architectures have demonstrated remarkable potential for tackling complex tasks, offering a novel technological perspective for sensitive information detection. In response to these challenges, we propose Argus, a multi‑agent collaborative framework for detecting sensitive information. Argus employs a three‑tier detection mechanism that integrates key content, file context, and project reference relationships to effectively reduce false positives and enhance overall detection accuracy. To comprehensively evaluate Argus in real‑world repository environments, we developed two new benchmarks, one to assess genuine leak detection capabilities and another to evaluate false‑positive filtering performance. Experimental results show that Argus achieves up to 94.86% accuracy in leak detection, with a precision of 96.36%, recall of 94.64%, and an F1 score of 0.955. Moreover, the analysis of 97 real repositories incurred a total cost of only 2.2. All code implementations and related datasets are publicly available at https://github.com/TheBinKing/Argus‑Guard for further research and application.

PaperID: 746, https://arxiv.org/pdf/2512.07410.pdf GitHub

Authors: Bin Li, Ruichi Zhang, Han Liang, Jingyan Zhang, Juze Zhang, Xin Chen, Lan Xu, Jingyi Yu, Jingya Wang

Title: InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs

Abstract:
Humanoid agents are expected to emulate the complex coordination inherent in human social behaviors. However, existing methods are largely confined to single‑agent scenarios, overlooking the physically plausible interplay essential for multi‑agent interactions. To bridge this gap, we propose InterAgent, the first end‑to‑end framework for text‑driven physics‑based multi‑agent humanoid control. At its core, we introduce an autoregressive diffusion transformer equipped with multi‑stream blocks, which decouples proprioception, exteroception, and action to mitigate cross‑modal interference while enabling synergistic coordination. We further propose a novel interaction graph exteroception representation that explicitly captures fine‑grained joint‑to‑joint spatial dependencies to facilitate network learning. Additionally, within it we devise a sparse edge‑based attention mechanism that dynamically prunes redundant connections and emphasizes critical inter‑agent spatial relations, thereby enhancing the robustness of interaction modeling. Extensive experiments demonstrate that InterAgent consistently outperforms multiple strong baselines, achieving state‑of‑the‑art performance. It enables producing coherent, physically plausible, and semantically faithful multi‑agent behaviors from only text prompts. Our code and data will be released to facilitate future research.

PaperID: 747, https://arxiv.org/pdf/2512.07132.pdf GitHub

Authors: Nithin Sivakumaran, Justin Chih-Yao Chen, David Wan, Yue Zhang, Jaehong Yoon, Elias Stengel-Eskin, Mohit Bansal

Title: DART: Leveraging Multi-Agent Disagreement for Tool Recruitment in Multimodal Reasoning

Abstract:
Specialized visual tools can augment large language models or vision language models with expert knowledge (e.g., grounding, spatial reasoning, medical knowledge, etc.), but knowing which tools to call (and when to call them) can be challenging. We introduce DART, a multi‑agent framework that uses disagreements between multiple debating visual agents to identify useful visual tools (e.g., object detection, OCR, spatial reasoning, etc.) that can resolve inter‑agent disagreement. These tools allow for fruitful multi‑agent discussion by introducing new information, and by providing tool‑aligned agreement scores that highlight agents in agreement with expert tools, thereby facilitating discussion. We utilize an aggregator agent to select the best answer by providing the agent outputs and tool information. We test DART on four diverse benchmarks and show that our approach improves over multi‑agent debate as well as over single agent tool‑calling frameworks, beating the next‑strongest baseline (multi‑agent debate with a judge model) by 3.4% and 2.4% on A‑OKVQA and MMMU respectively. We also find that DART adapts well to new tools in applied domains, with a 1.3% improvement on the M3D medical dataset over other strong tool‑calling, single agent, and multi‑agent baselines. Additionally, we measure text overlap across rounds to highlight the rich discussion in DART compared to existing multi‑agent methods. Finally, we study the tool call distribution, finding that diverse tools are reliably used to help resolve disagreement.

PaperID: 748, https://arxiv.org/pdf/2512.04463.pdf GitHub

Authors: Price Allman, Lian Thang, Dre Simmons, Salmon Riaz

Title: MARL Warehouse Robots

Abstract:
We present a comparative study of multi‑agent reinforcement learning (MARL) algorithms for cooperative warehouse robotics. We evaluate QMIX and IPPO on the Robotic Warehouse (RWARE) environment and a custom Unity 3D simulation. Our experiments reveal that QMIX's value decomposition significantly outperforms independent learning approaches (achieving 3.25 mean return vs. 0.38 for advanced IPPO), but requires extensive hyperparameter tuning ‑‑ particularly extended epsilon annealing (5M+ steps) for sparse reward discovery. We demonstrate successful deployment in Unity ML‑Agents, achieving consistent package delivery after 1M training steps. While MARL shows promise for small‑scale deployments (2‑4 robots), significant scaling challenges remain. Code and analyses: https://pallman14.github.io/MARL‑QMIX‑Warehouse‑Robots/

PaperID: 749, https://arxiv.org/pdf/2512.03756.pdf GitHub

Authors: Marlon Steiner, Royden Wagner, Ömer Sahin Tas, Christoph Stiller

Title: Prediction-Driven Motion Planning: Route Integration Strategies in Attention-Based Prediction Models

Abstract:
Combining motion prediction and motion planning offers a promising framework for enhancing interactions between automated vehicles and other traffic participants. However, this introduces challenges in conditioning predictions on navigation goals and ensuring stable, kinematically feasible trajectories. Addressing the former challenge, this paper investigates the extension of attention‑based motion prediction models with navigation information. By integrating the ego vehicle's intended route and goal pose into the model architecture, we bridge the gap between multi‑agent motion prediction and goal‑based motion planning. We propose and evaluate several architectural navigation integration strategies to our model on the nuPlan dataset. Our results demonstrate the potential of prediction‑driven motion planning, highlighting how navigation information can enhance both prediction and planning tasks. Our implementation is at: https://github.com/KIT‑MRT/future‑motion.

PaperID: 750, https://arxiv.org/pdf/2512.03667.pdf GitHub

Authors: Ge-Peng Ji, Jingyi Liu, Deng-Ping Fan, Huazhu Fu, Nick Barnes

Title: Colon-X: Advancing Intelligent Colonoscopy toward Clinical Reasoning

Abstract:
In this study, we present Colon‑X, an open initiative aimed at advancing multimodal intelligence in colonoscopy. We begin by constructing ColonVQA, the most comprehensive multimodal dataset ever built for colonoscopy, featuring over 1.1M+ visual question answering entries across 76 clinical findings and 18 multimodal tasks. Beyond serving as a community‑wide data foundation, we further investigate a critical yet underexplored transition in colonoscopy ‑ evolving from multimodal understanding to clinical reasoning: (a) To capture the current landscape of multimodal understanding behaviors, we systematically assess the generalizability of 22 multimodal large language models and examine their reliability under human‑induced perturbations. The results reveal that clinical outputs from leading MLLMs remain far from robust and trustworthy. (b) To narrow this gap, we further explore reasoning‑centric intelligence tailored for colonoscopy. Specifically, we curate ColonReason, a clinically grounded reasoning dataset annotated through a multi‑agent debating pipeline, and develop ColonR1, the first R1‑styled model that mitigates reward information collapse through task‑adaptive rewards and gradient‑stable policy optimization. Under data‑scarce conditions, our ColonR1 achieves 56.61% overall accuracy, outperforming supervised fine‑tuning by 25.22%, and sets a new reasoning‑enabled baseline for multimodal colonoscopy analysis. All data and model resources are publicly available at https://github.com/ai4colonoscopy/Colon‑X.

PaperID: 751, https://arxiv.org/pdf/2512.03666.pdf GitHub GitHub

Authors: Qi'ao Xu, Tianwen Qian, Yuqian Fu, Kailing Li, Yang Jiao, Jiacheng Zhang, Xiaoling Wang, Liang He

Title: ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

Abstract:
A core capability towards general embodied intelligence lies in localizing task‑relevant objects from an egocentric perspective, formulated as Spatio‑Temporal Video Grounding (STVG). Despite recent progress, existing STVG studies remain largely confined to object‑centric and descriptive instructions, neglecting the task‑oriented reasoning that is crucial for embodied agents to accomplish goal‑directed interactions. To bridge this gap, we introduce ToG‑Bench, the first task‑oriented spatio‑temporal video grounding benchmark for egocentric videos. ToG‑Bench is characterized by three key features: (1) Task‑oriented Grounding, which requires identifying and localizing objects based on intended tasks rather than straightforward descriptions; (2) Explicit‑Implicit Dual Grounding, where target objects can be either explicitly mentioned or implicitly inferred by contextual reasoning; (3) One‑to‑Many Grounding, where a single instruction may correspond to multiple objects involved in task execution. Built upon videos sourced from ScanNet, ToG‑Bench comprises 100 annotated clips with 2,704 task‑oriented grounding instructions, constructed via a semi‑automated pipeline that combines foundation model annotation and human refinement. In addition, we introduce a set of task‑level evaluation metrics tailored for multi‑object and explicit‑implicit object grounding, and systematically benchmark seven state‑of‑the‑art MLLMs. Extensive experiments reveal the intrinsic challenges of task‑oriented STVG and substantial performance gaps across explicit‑implicit and multi‑object grounding, highlighting the difficulty of bridging perception and interaction in embodied scenarios. Data and code will be released at: \hrefhttps://github.com/qaxuDev/ToG‑Benchhttps://github.com/qaxuDev/ToG‑Bench..

PaperID: 752, https://arxiv.org/pdf/2512.03445.pdf GitHub

Authors: Xieji Li, Siyuan Yan, Yingsheng Liu, H. Peter Soyer, Monika Janda, Victoria Mar, Zongyuan Ge

Title: Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation

Abstract:
Vision‑language pretraining (VLP) has emerged as a powerful paradigm in medical image analysis, enabling representation learning from large‑scale image‑text pairs without relying on expensive manual annotations. However, existing methods often struggle with the noise inherent in web‑collected data and the complexity of unstructured long medical texts. To address these challenges, we propose a novel VLP framework integrating a Multi‑Agent data GENeration (MAGEN) system and Ontology‑based Multi‑Aspect Knowledge‑Enhanced (O‑MAKE) pretraining. First, MAGEN enhances data quality by synthesizing knowledge‑enriched descriptions via a foundation model‑assisted captioning and retrieval‑based verification pipeline. Second, O‑MAKE addresses the difficulty of learning from long, unstructured texts by decomposing them into distinct knowledge aspects. This facilitates fine‑grained alignment at both global and patch levels, while explicitly modeling medical concept relationships through ontology‑guided mechanisms. We validate our framework in the field of dermatology, where comprehensive experiments demonstrate the effectiveness of each component. Our approach achieves state‑of‑the‑art zero‑shot performance on disease classification and cross‑modal retrieval tasks across eight datasets. Our code and the augmented dataset Derm1M‑AgentAug, comprising over 400k skin‑image‑text pairs, will be released at https://github.com/SiyuanYan1/Derm1M.

PaperID: 753, https://arxiv.org/pdf/2512.02835.pdf GitHub

Authors: Yifan Li, Yingda Yin, Lingting Zhu, Weikai Chen, Shengju Qian, Xin Wang, Yanwei Fu

Title: ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning

Abstract:
Reasoning‑centric video object segmentation is an inherently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static appearances. Yet existing solutions generally collapse these factors into simplified reasoning with latent embeddings, rendering the reasoning chain opaque and essentially intractable. We therefore adopt an explicit decomposition perspective and introduce ReVSeg, which executes reasoning as sequential decisions in the native interface of pretrained vision language models (VLMs). Rather than folding all reasoning into a single‑step prediction, ReVSeg executes three explicit operations ‑‑ semantics interpretation, temporal evidence selection, and spatial grounding ‑‑ aligning pretrained capabilities. We further employ reinforcement learning to optimize the multi‑step reasoning chain, enabling the model to self‑refine its decision quality from outcome‑driven signals. Experimental results demonstrate that ReVSeg attains state‑of‑the‑art performances on standard video object segmentation benchmarks and yields interpretable reasoning trajectories. Project page is available at https://clementine24.github.io/ReVSeg/ .

PaperID: 754, https://arxiv.org/pdf/2512.02589.pdf GitHub

Authors: Junyi Hou, Andre Lin Huikai, Nuo Chen, Yiwei Gong, Bingsheng He

Title: PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing

Abstract:
Large language models are increasingly embedded into academic writing workflows, yet existing assistants remain external to the editor, preventing deep interaction with document state, structure, and revision history. This separation makes it impossible to support agentic, context‑aware operations directly within LaTeX editors such as Overleaf. We present PaperDebugger, an in‑editor, multi‑agent, and plugin‑based academic writing assistant that brings LLM‑driven reasoning directly into the writing environment. Enabling such in‑editor interaction is technically non‑trivial: it requires reliable bidirectional synchronization with the editor, fine‑grained version control and patching, secure state management, multi‑agent scheduling, and extensible communication with external tools. PaperDebugger addresses these challenges through a Chrome‑approved extension, a Kubernetes‑native orchestration layer, and a Model Context Protocol (MCP) toolchain that integrates literature search, reference lookup, document scoring, and revision pipelines. Our demo showcases a fully integrated workflow, including localized edits, structured reviews, parallel agent execution, and diff‑based updates, encapsulated within a minimal‑intrusion user interface (UI). Early aggregated analytics demonstrate active user engagement and validate the practicality of an editor‑native, agentic writing assistant. More details about this demo and video could be found at https://github.com/PaperDebugger/PaperDebugger.

PaperID: 755, https://arxiv.org/pdf/2512.02535.pdf GitHub

Authors: Jeric Lew, Yuhong Cao, Derek Ming Siang Tan, Guillaume Sartoretti

Title: AID: Agent Intent from Diffusion for Multi-Agent Informative Path Planning

Abstract:
Information gathering in large‑scale or time‑critical scenarios (e.g., environmental monitoring, search and rescue) requires broad coverage within limited time budgets, motivating the use of multi‑agent systems. These scenarios are commonly formulated as multi‑agent informative path planning (MAIPP), where multiple agents must coordinate to maximize information gain while operating under budget constraints. A central challenge in MAIPP is ensuring effective coordination while the belief over the environment evolves with incoming measurements. Recent learning‑based approaches address this by using distributions over future positions as "intent" to support coordination. However, these autoregressive intent predictors are computationally expensive and prone to compounding errors. Inspired by the effectiveness of diffusion models as expressive, long‑horizon policies, we propose AID, a fully decentralized MAIPP framework that leverages diffusion models to generate long‑term trajectories in a non‑autoregressive manner. AID first performs behavior cloning on trajectories produced by existing MAIPP planners and then fine‑tunes the policy using reinforcement learning via Diffusion Policy Policy Optimization (DPPO). This two‑stage pipeline enables the policy to inherit expert behavior while learning improved coordination through online reward feedback. Experiments demonstrate that AID consistently improves upon the MAIPP planners it is trained from, achieving 4x faster execution and up to 17% increased information gain, while scaling effectively to larger numbers of agents. Our implementation is publicly available at https://github.com/marmotlab/AID.

PaperID: 756, https://arxiv.org/pdf/2512.02530.pdf GitHub

Authors: Yuxiang He, Jian Zhao, Yuchen Yuan, Tianle Zhang, Wei Cai, Haojie Cheng, Ziyan Shi, Ming Zhu, Haichuan Tang, Chi Zhang, Xuelong Li

Title: Aetheria: A multimodal interpretable content safety framework based on multi-agent debate and collaboration

Abstract:
The exponential growth of digital content presents significant challenges for content safety. Current moderation systems, often based on single models or fixed pipelines, exhibit limitations in identifying implicit risks and providing interpretable judgment processes. To address these issues, we propose Aetheria, a multimodal interpretable content safety framework based on multi‑agent debate and collaboration.Employing a collaborative architecture of five core agents, Aetheria conducts in‑depth analysis and adjudication of multimodal content through a dynamic, mutually persuasive debate mechanism, which is grounded by RAG‑based knowledge retrieval.Comprehensive experiments on our proposed benchmark (AIR‑Bench) validate that Aetheria not only generates detailed and traceable audit reports but also demonstrates significant advantages over baselines in overall content safety accuracy, especially in the identification of implicit risks. This framework establishes a transparent and interpretable paradigm, significantly advancing the field of trustworthy AI content moderation.

PaperID: 757, https://arxiv.org/pdf/2512.02485.pdf GitHub

Authors: Qianhan Feng, Zhongzhen Huang, Yakun Zhu, Xiaofan Zhang, Qi Dou

Title: UCAgents: Unidirectional Convergence for Visual Evidence Anchored Multi-Agent Medical Decision-Making

Abstract:
Vision‑Language Models (VLMs) show promise in medical diagnosis, yet suffer from reasoning detachment, where linguistically fluent explanations drift from verifiable image evidence, undermining clinical trust. Recent multi‑agent frameworks simulate Multidisciplinary Team (MDT) debates to mitigate single‑model bias, but open‑ended discussions amplify textual noise and computational cost while failing to anchor reasoning to visual evidence, the cornerstone of medical decision‑making. We propose UCAgents, a hierarchical multi‑agent framework enforcing unidirectional convergence through structured evidence auditing. Inspired by clinical workflows, UCAgents forbids position changes and limits agent interactions to targeted evidence verification, suppressing rhetorical drift while amplifying visual signal extraction. In UCAgents, a one‑round inquiry discussion is introduced to uncover potential risks of visual‑textual misalignment. This design jointly constrains visual ambiguity and textual noise, a dual‑noise bottleneck that we formalize via information theory. Extensive experiments on four medical VQA benchmarks show UCAgents achieves superior accuracy (71.3% on PathVQA, +6.0% over state‑of‑the‑art) with 87.7% lower token cost, the evaluation results further confirm that UCAgents strikes a balance between uncovering more visual evidence and avoiding confusing textual interference. These results demonstrate that UCAgents exhibits both diagnostic reliability and computational efficiency critical for real‑world clinical deployment. Code is available at https://github.com/fqhank/UCAgents.

PaperID: 758, https://arxiv.org/pdf/2512.01853.pdf GitHub

Authors: Tsz-To Wong, Ching-Chun Huang, Hong-Han Shuai

Title: COACH: Collaborative Agents for Contextual Highlighting -- A Multi-Agent Framework for Sports Video Analysis

Abstract:
Intelligent sports video analysis demands a comprehensive understanding of temporal context, from micro‑level actions to macro‑level game strategies. Existing end‑to‑end models often struggle with this temporal hierarchy, offering solutions that lack generalization, incur high development costs for new tasks, and suffer from poor interpretability. To overcome these limitations, we propose a reconfigurable Multi‑Agent System (MAS) as a foundational framework for sports video understanding. In our system, each agent functions as a distinct "cognitive tool" specializing in a specific aspect of analysis. The system's architecture is not confined to a single temporal dimension or task. By leveraging iterative invocation and flexible composition of these agents, our framework can construct adaptive pipelines for both short‑term analytic reasoning (e.g., Rally QA) and long‑term generative summarization (e.g., match summaries). We demonstrate the adaptability of this framework using two representative tasks in badminton analysis, showcasing its ability to bridge fine‑grained event detection and global semantic organization. This work presents a paradigm shift towards a flexible, scalable, and interpretable system for robust, cross‑task sports video intelligence. The project homepage is available at https://aiden1020.github.io/COACH‑project‑page

PaperID: 759, https://arxiv.org/pdf/2512.01270.pdf GitHub GitHub

Authors: Yuan-Sen Ting, Serat Mahmud Saad, Fan Liu, Yuting Shen

Title: Egent: An Autonomous Agent for Equivalent Width Measurement

Abstract:
We present Egent, an autonomous agent that combines classical multi‑Voigt profile fitting with large language model (LLM) visual inspection and iterative refinement. The fitting engine is built from scratch with minimal dependencies, creating an ecosystem where the LLM can reason about fits through function calls‑‑adjusting wavelength windows, adding blend components, modifying continuum treatment, and flagging problematic cases. Egent operates directly on raw flux spectra without requiring pre‑normalized continua. We validate against manual measurements from human experts using 18,615 lines from the C3PO program across 84 Magellan/MIKE spectra at SNR~50‑250. The raw agreement between Egent and expert measurements is MAD=5‑7mA, without any post‑hoc per‑spectrum correction. Per‑spectrum slopes of ~0.85‑1.19 around unity reflect differences in global continuum methodology rather than fitting failures. The LLM's primary role is quality control: it confirms good fits (~60‑65% of lines are LLM‑refined and accepted), flags problematic cases (~10‑20%), and occasionally rescues edge cases where tool use improves fits. Agreement between GPT‑5 and GPT‑5‑mini confirms reproducibility, with GPT‑5‑mini enabling low‑cost analysis at ~200 lines per US dollar. Every fit stores complete Voigt parameters, continuum coefficients, and LLM reasoning chains, enabling exact reconstruction without re‑running. Egent compresses what traditionally requires months of expert effort into days of automated analysis, enabling survey‑scale EW measurement. We provide open‑source code at https://github.com/tingyuansen/Egent, including a web interface for drag‑and‑drop analysis and a local LLM backend for fully offline operation on consumer hardware.

PaperID: 760, https://arxiv.org/pdf/2512.00831.pdf GitHub

Authors: Yuchen Zeng, Shuibai Zhang, Wonjun Kang, Shutong Wu, Lynnix Zou, Ying Fan, Heeju Kim, Ziqian Lin, Jungtaek Kim, Hyung Il Koo, Dimitris Papailiopoulos, Kangwook Lee

Title: ReJump: A Tree-Jump Representation for Analyzing and Improving LLM Reasoning

Abstract:
Large Reasoning Models (LRMs) are Large Language Models (LLMs) explicitly trained to generate long‑form Chain‑of‑Thoughts (CoTs), achieving impressive success on challenging tasks like math and programming. However, their underlying reasoning "algorithms" remain poorly understood. To investigate this, we propose ReJump, which represents a reasoning trace as a visitation order over nodes in a tree of intermediate problem‑solving steps. Transitions between nodes, which we term jumps, include adjacent moves that capture behaviors such as calculation, and non‑adjacent moves that capture behaviors such as backtracking and verification. ReJump enables analyzing LLM reasoning with diverse metrics that quantify exploration, exploitation, overthinking, forgetting, and verification. Using our proposed LLM agent to extract reasoning traces into ReJump format, we evaluate state‑of‑the‑art LRMs on two tasks and find that models with similar accuracy can exhibit distinct reasoning behaviors, while different tasks favor different reasoning styles (e.g., varying balance between exploration and exploitation). To further understand how learning strategies shape reasoning, we use ReJump to compare distilled LRMs with their teachers, CoT‑prompted LLMs with LRMs, and to examine how the number of reasoning examples and reinforcement learning affect reasoning behavior. Finally, we show that ReJump can improve reasoning quality at test time through strategies such as ReJump‑guided Best‑of‑N selection and prompt selection. Our code is publicly available at https://github.com/UW‑Madison‑Lee‑Lab/ReJump.

PaperID: 761, https://arxiv.org/pdf/2512.00018.pdf GitHub

Authors: Arthur Ji Sung Baek, Geoffrey Martin

Title: X-SYCON: Xylem-Inspired Passive Gradient Control for Communication-Free Swarm Response in Dynamic Disaster Environments

Abstract:
We present X‑SYCON, a xylem‑inspired multi‑agent architecture in which coordination emerges from passive field dynamics rather than explicit planning or communication. Incidents (demands) and obstructions (hazards) continually write diffusing and decaying scalar fields, and agents greedily ascend a local utility U=ϕ_\mathrmDE‑κ\,ϕ_\mathrmHZ with light anti‑congestion and separation. A beaconing rule triggered on first contact temporarily deepens the local demand sink, accelerating completion without reducing time‑to‑first‑response. Across dynamic, partially blocked simulated environments, we observe low miss rates and stable throughput with interpretable, tunable trade‑offs over carrier count, arrival rate, hazard density, and hazard sensitivity κ. We derive that a characteristic hydraulic length scale \ell\approx\sqrtD/λ predicts recruitment range in a continuum approximation, and we provide a work‑conservation (Ohm‑law) bound consistent with sublinear capacity scaling with team size. Empirically: (i) soft hazard penalties yield fewer misses when obstacles already block motion; (ii) throughput saturates sublinearly with carriers while reliability improves sharply; (iii) stronger arrivals can reduce misses by sustaining sinks that recruit help; and (iv) phase‑stability regions shrink with hazard density but are recovered by more carriers or higher arrivals. We refer to X‑SYCON as an instance of Distributed Passive Computation and Control, and we evaluate it in simulations modeling communication‑denied disaster response and other constrained sensing‑action regimes.

PaperID: 762, https://arxiv.org/pdf/2511.23397.pdf GitHub

Authors: Mahdi Rahmani, AmirHossein Saffari, Reyhane Rahmani

Title: MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation

Abstract:
Small and medium‑sized enterprises (SMEs) in Iran increasingly leverage Telegram for sales, where real‑time engagement is essential for conversion. However, developing AI‑driven chatbots for this purpose requires large, high‑quality question‑and‑answer (Q&A) datasets, which are typically expensive and resource‑intensive to produce, especially for low‑resource languages like Persian. In this paper, we introduce MegaChat, the first fully synthetic Persian Q&A dataset designed to evaluate intelligent sales chatbots in Telegram‑based e‑commerce. We propose a novel, automated multi‑agent architecture that generates persona‑aware Q&A pairs by collecting data from active Telegram shopping channels. The system employs specialized agents for question generation, validation, and refinement, ensuring the production of realistic and diverse conversational data. To evaluate answer generation, we compare three classic retrieval‑augmented generation (RAG) models with our advanced agentic system, which features multi‑query retrieval, reranking, and persona‑aligned response synthesis. Using GPT‑5.1 for evaluation across six quality dimensions, our results show that the agentic architecture outperformed traditional RAG models in 4 out of 5 diverse channels, demonstrating its ability to generate scalable, high‑quality datasets without relying on expensive human annotation or complex fine‑tuning. MegaChat provides SMEs with an efficient, cost‑effective solution for building intelligent customer engagement systems in specialized commercial domains, enabling advancements in multilingual conversational AI for low‑resource languages. Download: https://github.com/MegaChat‑Tech/MegaChat‑DataSet

PaperID: 763, https://arxiv.org/pdf/2511.23186.pdf GitHub

Authors: Runyu Jiao, Matteo Bortolon, Francesco Giuliari, Alice Fasoli, Sergio Povoli, Guofeng Mei, Yiming Wang, Fabio Poiesi

Title: Obstruction reasoning for robotic grasping

Abstract:
Successful robotic grasping in cluttered environments not only requires a model to visually ground a target object but also to reason about obstructions that must be cleared beforehand. While current vision‑language embodied reasoning models show emergent spatial understanding, they remain limited in terms of obstruction reasoning and accessibility planning. To bridge this gap, we present UNOGrasp, a learning‑based vision‑language model capable of performing visually‑grounded obstruction reasoning to infer the sequence of actions needed to unobstruct the path and grasp the target object. We devise a novel multi‑step reasoning process based on obstruction paths originated by the target object. We anchor each reasoning step with obstruction‑aware visual cues to incentivize reasoning capability. UNOGrasp combines supervised and reinforcement finetuning through verifiable reasoning rewards. Moreover, we construct UNOBench, a large‑scale dataset for both training and benchmarking, based on MetaGraspNetV2, with over 100k obstruction paths annotated by humans with obstruction ratios, contact points, and natural‑language instructions. Extensive experiments and real‑robot evaluations show that UNOGrasp significantly improves obstruction reasoning and grasp success across both synthetic and real‑world environments, outperforming generalist and proprietary alternatives. Project website: https://tev‑fbk.github.io/UnoGrasp/.

PaperID: 764, https://arxiv.org/pdf/2511.22854.pdf GitHub

Authors: Finn G. Vamosi, Nils D. Forkert

Title: CRAwDAD: Causal Reasoning Augmentation with Dual-Agent Debate

Abstract:
When people reason about cause and effect, they often consider many competing "what if" scenarios before deciding which explanation fits best. Analogously, advanced language models capable of causal inference can consider multiple interventions and counterfactuals to judge the validity of causal claims. Crucially, this type of reasoning is less like a single calculation and more like an internal dialogue between alternative hypotheses. In this paper, we make this dialogue explicit through a dual‑agent debate framework where one model provides a structured causal inference, and the other critically examines this reasoning for logical flaws. When disagreements arise, the agents attempt to persuade each other, challenging each other's logic and revising their conclusions until they converge on a mutually agreed answer. To take advantage of this deliberative process, we specifically use reasoning language models, whose strengths in both causal inference and adversarial debate remain under‑explored relative to standard large language models. We evaluate our approach on the CLadder dataset, a benchmark linking natural language questions to formally defined causal graphs across all three rungs of Pearl's ladder of causation. With Qwen3 and DeepSeek‑R1 as debater agents, we demonstrate that multi‑agent debate improves DeepSeek‑R1's overall accuracy in causal inference from 78.03% to 87.45%, with the counterfactual category specifically improving from 67.94% to 80.04% accuracy. Similarly, Qwen3's overall accuracy improves from 84.16% to 89.41%, and counterfactual questions from 71.53% to 80.35%, showing that even strong models can still benefit greatly from debate with weaker agents. Our results highlight the potential of reasoning models as building blocks for multi‑agent systems in causal inference, and demonstrate the importance of diverse perspectives in causal problem‑solving.

PaperID: 765, https://arxiv.org/pdf/2511.22685.pdf GitHub

Authors: Haoyi Wang, Licheng Luo, Yiannis Kantaros, Bruno Sinopoli, Mingyu Cai

Title: Deadlock-Free Hybrid RL-MAPF Framework for Zero-Shot Multi-Robot Navigation

Abstract:
Multi‑robot navigation in cluttered environments presents fundamental challenges in balancing reactive collision avoidance with long‑range goal achievement. When navigating through narrow passages or confined spaces, deadlocks frequently emerge that prevent agents from reaching their destinations, particularly when Reinforcement Learning (RL) control policies encounter novel configurations out of learning distribution. Existing RL‑based approaches suffer from limited generalization capability in unseen environments. We propose a hybrid framework that seamlessly integrates RL‑based reactive navigation with on‑demand Multi‑Agent Path Finding (MAPF) to explicitly resolve topological deadlocks. Our approach integrates a safety layer that monitors agent progress to detect deadlocks and, when detected, triggers a coordination controller for affected agents. The framework constructs globally feasible trajectories via MAPF and regulates waypoint progression to reduce inter‑agent conflicts during navigation. Extensive evaluation on dense multi‑agent benchmarks shows that our method boosts task completion from marginal to near‑universal success, markedly reducing deadlocks and collisions. When integrated with hierarchical task planning, it enables coordinated navigation for heterogeneous robots, demonstrating that coupling reactive RL navigation with selective MAPF intervention yields a robust, zero‑shot performance.

PaperID: 766, https://arxiv.org/pdf/2511.22598.pdf GitHub

Authors: Huanyu Li, Zongyuan Li, Wei Huang, Xian Guo

Title: LLM-Cave: A benchmark and light environment for large language models reasoning and decision-making system

Abstract:
Large language models (LLMs) such as ChatGPT o1, ChatGPT o3, and DeepSeek R1 have shown great potential in solving difficult problems. However, current LLM evaluation benchmarks are limited to one‑step interactions. Some of the existing sequence decision‑making environments, such as TextStarCraftII and LLM‑PySC2, are too complicated and require hours of interaction to complete a game. In this paper, we introduce LLM‑Cave, a benchmark and light environment for LLM reasoning and decision‑making systems. This environment is a classic instance in the era of Symbolism. Artificial intelligence enables the agent to explore the environment and avoid potential losses by reasoning about nearby dangers using partial observable state information. In the experiment, we evaluated the sequential reasoning ability, decision‑making performance and computational efficiency of mainstream large language models (LLMs) such as GPT‑4o‑mini, o1‑mini, and DeepSeek‑R1. Experiments show that while Deepseek‑R1 achieved the highest success rate on complex reasoning tasks, smaller models like 4o‑mini significantly narrowed the performance gap on challenges by employing Chain of Speculation and Planner‑Critic strategies, at the expense of reduced computational efficiency. This indicates that structured, multi‑step reasoning combined with an LLM‑based feedback mechanism can substantially enhance an LLM's decision‑making capabilities, providing a promising direction for improving reasoning in weaker models and suggesting a new reasoning‑centered benchmark for LLM assessment. Our code is open‑sourced in https://github.com/puleya1277/CaveEnv.

PaperID: 767, https://arxiv.org/pdf/2511.22396.pdf GitHub

Authors: Run Shao, Ziyu Li, Zhaoyang Zhang, Linrui Xu, Xinran He, Hongyuan Yuan, Bolei He, Yongxing Dai, Yiming Yan, Yijun Chen, Wang Guo, Haifeng Li

Title: Asking like Socrates: Socrates helps VLMs understand remote sensing images

Abstract:
Recent multimodal reasoning models, inspired by DeepSeek‑R1, have significantly advanced vision‑language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large‑scale RS imagery results in incomplete understanding and reasoning based on linguistic self‑consistency instead of visual evidence. To address this, we propose RS‑EoT (Remote Sensing Evidence‑of‑Thought), a language‑driven, iterative visual evidence‑seeking paradigm. To instill this paradigm, we propose SocraticAgent, a self‑play multi‑agent system that synthesizes reasoning traces via alternating cycles of reasoning and visual inspection. To enhance and generalize these patterns, we propose a two‑stage progressive RL strategy: first, RL on fine‑grained Grounding tasks to enhance RS‑EoT capabilities, followed by RL on RS VQA to generalize to broader understanding scenarios. Experiments show RS‑EoT achieves state‑of‑the‑art performance on multiple RS VQA and grounding benchmarks. Analyses reveal clear iterative cycles of reasoning and evidence seeking, confirming RS‑EoT mitigates the Glance Effect and enables genuine evidence‑grounded reasoning. Our code, data, and models are available at https://geox‑lab.github.io/Asking_like_Socrates

PaperID: 768, https://arxiv.org/pdf/2511.22235.pdf GitHub

Authors: Zehao Deng, Tianjie Ju, Zheng Wu, Zhuosheng Zhang, Gongshen Liu

Title: Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation

Abstract:
The rapid development of large vision‑language model (VLM) has greatly promoted the research of GUI agent. However, GUI agents still face significant challenges in handling long‑horizon tasks. First, single‑agent models struggle to balance high‑level capabilities and low‑level execution capability, facing prevalent issues of responsibility coupling and capability conflicts. Second, agents lack awareness of the task state, leading to progress loss in long‑horizon tasks. To address these challenges, we propose a staged execution‑feedback reinforcement learning algorithm. Unlike training a unified policy model, we focus on training high‑level scheduling models. Specifically, we propose and train two agents: a Coordinator, responsible for the strategic planning and task decomposition; and a State Tracker, responsible for context compression and information management to maintain the task's state and coherence. Based on this, we built the Coordinator‑Executor‑State Tracker (CES) multi‑agent framework, which can be integrated with any low‑level Executor model, assisting the Executor in solving long‑horizon tasks through task scheduling and state management. Experiments on long‑horizon task benchmarks demonstrate that CES significantly enhances the system's planning and state management capabilities. Furthermore, analysis confirms that our trained high‑level scheduling module is a generalizable, plug‑and‑play module that significantly enhances the long‑horizon capabilities of various Executors. Code can be available at https://github.com/hehehahi4/CES.

PaperID: 769, https://arxiv.org/pdf/2511.22134.pdf GitHub

Authors: Zhen Fang, Zhuoyang Liu, Jiaming Liu, Hao Chen, Yu Zeng, Shiting Huang, Zehui Chen, Lin Chen, Shanghang Zhang, Feng Zhao

Title: DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action

Abstract:
To build a generalizable Vision‑Language‑Action (VLA) model with strong reasoning ability, a common strategy is to first train a specialist VLA on robot demonstrations to acquire reliable manipulation skills, and then incorporate mixed annotated robot data together with multimodal data to restore broader reasoning capabilities. However, we observe that the resulting reasoning VLA often suffers from degraded action performance compared to the specialist model before fine‑tuning, a phenomenon we refer to as action degeneration. To address this issue, we propose DualVLA, which enhances action performance through carefully designed post‑training while still preserving reasoning capability. We first introduce a dual‑layer data pruning method that removes redundant embodied reasoning, preventing it from adversely influencing action learning. To further strengthen action generation, we design a dual‑teacher adaptive distillation strategy that assigns different supervision signals to different data domains while maintaining reasoning ability. To fill the evaluation gap for generalist VLAs, we also propose VLA Score, which decouples VLA capability into reasoning, intention, action, and alignment dimensions for a more fine‑grained assessment. Experiments show that DualVLA achieves an average success rate of 61.0 in SimplerEnv and an average score of 65.4 across eight competitive multimodal benchmarks, demonstrating a stronger balance between precise action execution and multimodal understanding. Project Website: https://costaliya.github.io/DualVLA/.

PaperID: 770, https://arxiv.org/pdf/2511.21998.pdf GitHub

Authors: Apratim Bhattacharyya, Bicheng Xu, Sanjay Haresh, Reza Pourreza, Litian Liu, Sunny Panchal, Pulkit Madan, Leonid Sigal, Roland Memisevic

Title: Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

Abstract:
Multi‑modal Large Language Models (LLM) have advanced conversational abilities but struggle with providing live, interactive step‑by‑step guidance, a key capability for future AI assistants. Effective guidance requires not only delivering instructions but also detecting their successful execution, as well as identifying and alerting users to mistakes, all of which has to happen in real‑time. This requires models that are not turn‑based, but that can react asynchronously to a video stream, as well as video data showing users performing tasks including mistakes and their corrections. To this end, we introduce Qualcomm Interactive Cooking, a new benchmark and dataset built upon CaptainCook4D, which contains user mistakes during task execution. Our dataset and benchmark features densely annotated, timed instructions and feedback messages, specifically including mistake alerts precisely timestamped to their visual occurrence in the video. We evaluate state‑of‑the‑art multi‑modal LLMs on the Qualcomm Interactive Cooking benchmark and introduce LiveMamba, a streaming multi‑modal LLM designed for interactive instructional guidance. This work provides the first dedicated benchmark and a strong baseline for developing and evaluating on live, situated coaching.

PaperID: 771, https://arxiv.org/pdf/2511.21730.pdf GitHub

Authors: Ishant Kohar, Aswanth Krishnan

Title: A Benchmark for Procedural Memory Retrieval in Language Agents

Abstract:
Current AI agents excel in familiar settings, but fail sharply when faced with novel tasks with unseen vocabularies ‑‑ a core limitation of procedural memory systems. We present the first benchmark that isolates procedural memory retrieval from task execution, evaluating whether agents can recognize functionally equivalent procedures that span different object instantiations. Using ALFWorld, we construct dual corpora of expert and LLM‑generated trajectories and evaluate six retrieval methods using systematically stratified queries. Our results expose a clear generalization cliff: embedding‑based methods perform strongly on familiar contexts, yet degrade considerably on novel ones, while LLM‑generated procedural abstractions demonstrate reliable cross‑context transfer. Controlled ablations show that although embeddings capture some lexical‑level abstraction, they fundamentally treat procedures as unordered bags of words, discarding temporal structure necessary for cross‑context transfer. Corpus scale delivers far larger gains than representation enrichment, revealing an architectural ceiling in current encoders. Our benchmark offers the first diagnostic framework separating genuine procedural understanding from surface‑level memorization and gives tools for developing retrieval systems capable of dependable generalization. Resources available at our GitHub repository (https://github.com/qpiai/Proced_mem_bench).

PaperID: 772, https://arxiv.org/pdf/2511.21510.pdf GitHub

Authors: Ke Zhang, Xiaoning Zhao, Ce Zheng, Jiahong Ning, Dandan Zhu, Wenqi Zhang, Chen Sun, Toshiharu Sugawara

Title: Tool-RoCo: An Agent-as-Tool Self-organization Large Language Model Benchmark in Multi-robot Cooperation

Abstract:
This study proposes Tool‑RoCo, a novel benchmark for evaluating large language models (LLMs) in long‑term multi‑agent cooperation based on RoCo, a multi‑robot cooperative benchmark. Recent research on LLM‑based multi‑agent systems has relied on predefined orchestration, while ignoring agent autonomy. Tool‑RoCo treats other agents as tools and introduces cooperative tools, leveraging tool usage to evaluate multi‑agent cooperation and self‑organization. Tool usage means that each agent (LLM) selects a tool from a candidate set based on the current state, receives feedback, and adjusts its selection in subsequent rounds. To evaluate different autonomy levels, we propose four LLM paradigms: (1) centralized cooperation, where a single LLM allocates tools to all agents; (2) centralized self‑organization, where a central LLM autonomously activates agents while keeping others inactive; (3) decentralized cooperation, where each agent has its own LLM and calls tools based on local information; and (4) self‑organization, where a randomly chosen initial agent can request collaboration, activating additional agents via tool calls. Tool‑RoCo includes three multi‑robot tasks, SORT, PACK, and CABINET, to measure format and parameter accuracy and agent coordination through tool usage. The results using several LLMs showed that cooperative tools accounted for only 7.09% of all tools, indicating that LLM‑based agents rarely invoked others as assistants. Moreover, activation tools accounted for 96.42%, suggesting that current LLMs tend to maintain active agents while seldom deactivating them for adaptive coordination. Tool‑RoCo provides a systematic benchmark to evaluate LLM autonomy and cooperation in multi‑agent tasks. Code and Demo: https://github.com/ColaZhang22/Tool‑Roco

PaperID: 773, https://arxiv.org/pdf/2511.19430.pdf GitHub GitHub

Authors: Dingkang Liang, Cheng Zhang, Xiaopeng Xu, Jianzhong Ju, Zhenbo Luo, Xiang Bai

Title: Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution

Abstract:
Task scheduling is critical for embodied AI, enabling agents to follow natural language instructions and execute actions efficiently in 3D physical worlds. However, existing datasets often simplify task planning by ignoring operations research (OR) knowledge and 3D spatial grounding. In this work, we propose Operations Research knowledge‑based 3D Grounded Task Scheduling (ORS3D), a new task that requires the synergy of language understanding, 3D grounding, and efficiency optimization. Unlike prior settings, ORS3D demands that agents minimize total completion time by leveraging parallelizable subtasks, e.g., cleaning the sink while the microwave operates. To facilitate research on ORS3D, we construct ORS3D‑60K, a large‑scale dataset comprising 60K composite tasks across 4K real‑world scenes. Furthermore, we propose GRANT, an embodied multi‑modal large language model equipped with a simple yet effective scheduling token mechanism to generate efficient task schedules and grounded actions. Extensive experiments on ORS3D‑60K validate the effectiveness of GRANT across language understanding, 3D grounding, and scheduling efficiency. The code is available at https://github.com/H‑EmbodVis/GRANT

PaperID: 774, https://arxiv.org/pdf/2511.19083.pdf GitHub

Authors: Wenxuan Mu, Jinzhong Ning, Di Zhao, Yijia Zhang

Title: A Multi-Agent LLM Framework for Multi-Domain Low-Resource In-Context NER via Knowledge Retrieval, Disambiguation and Reflective Analysis

Abstract:
In‑context learning (ICL) with large language models (LLMs) has emerged as a promising paradigm for named entity recognition (NER) in low‑resource scenarios. However, existing ICL‑based NER methods suffer from three key limitations: (1) reliance on dynamic retrieval of annotated examples, which is problematic when annotated data is scarce; (2) limited generalization to unseen domains due to the LLM's insufficient internal domain knowledge; and (3) failure to incorporate external knowledge or resolve entity ambiguities. To address these challenges, we propose KDR‑Agent, a novel multi‑agent framework for multi‑domain low‑resource in‑context NER that integrates Knowledge retrieval, Disambiguation, and Reflective analysis. KDR‑Agent leverages natural‑language type definitions and a static set of entity‑level contrastive demonstrations to reduce dependency on large annotated corpora. A central planner coordinates specialized agents to (i) retrieve factual knowledge from Wikipedia for domain‑specific mentions, (ii) resolve ambiguous entities via contextualized reasoning, and (iii) reflect on and correct model predictions through structured self‑assessment. Experiments across ten datasets from five domains demonstrate that KDR‑Agent significantly outperforms existing zero‑shot and few‑shot ICL baselines across multiple LLM backbones. The code and data can be found at https://github.com/MWXGOD/KDR‑Agent.

PaperID: 775, https://arxiv.org/pdf/2511.17854.pdf GitHub

Authors: Allen Roush, Devin Gonier, John Hines, Judah Goldfeder, Philippe Martin Wyder, Sanjay Basu, Ravid Shwartz Ziv

Title: A superpersuasive autonomous policy debating system

Abstract:
The capacity for highly complex, evidence‑based, and strategically adaptive persuasion remains a formidable great challenge for artificial intelligence. Previous work, like IBM Project Debater, focused on generating persuasive speeches in simplified and shortened debate formats intended for relatively lay audiences. We introduce DeepDebater, a novel autonomous system capable of participating in and winning a full, unmodified, two‑team competitive policy debate. Our system employs a hierarchical architecture of specialized multi‑agent workflows, where teams of LLM‑powered agents collaborate and critique one another to perform discrete argumentative tasks. Each workflow utilizes iterative retrieval, synthesis, and self‑correction using a massive corpus of policy debate evidence (OpenDebateEvidence) and produces complete speech transcripts, cross‑examinations, and rebuttals. We introduce a live, interactive end‑to‑end presentation pipeline that renders debates with AI speech and animation: transcripts are surface‑realized and synthesized to audio with OpenAI TTS, and then displayed as talking‑head portrait videos with EchoMimic V1. Beyond fully autonomous matches (AI vs AI), DeepDebater supports hybrid human‑AI operation: human debaters can intervene at any stage, and humans can optionally serve as opponents against AI in any speech, allowing AI‑human and AI‑AI rounds. In preliminary evaluations against human‑authored cases, DeepDebater produces qualitatively superior argumentative components and consistently wins simulated rounds as adjudicated by an independent autonomous judge. Expert human debate coaches also prefer the arguments, evidence, and cases constructed by DeepDebater. We open source all code, generated speech transcripts, audio and talking head video here: https://github.com/Hellisotherpeople/DeepDebater/tree/main

PaperID: 776, https://arxiv.org/pdf/2511.17208.pdf GitHub

Authors: Sizhe Zhou, Jiawei Han

Title: A Simple Yet Strong Baseline for Long-Term Conversational Memory of LLM Agents

Abstract:
LLM‑based conversational agents still struggle to maintain coherent, personalized interaction over many sessions: fixed context windows limit how much history can be kept in view, and most external memory approaches trade off between coarse retrieval over large chunks and fine‑grained but fragmented views of the dialogue. Motivated by neo‑Davidsonian event semantics, we propose an event‑centric alternative that represents conversational history as short, event‑like propositions which bundle together participants, temporal cues, and minimal local context, rather than as independent relation triples or opaque summaries. In contrast to work that aggressively compresses or forgets past content, our design aims to preserve information in a non‑compressive form and make it more accessible, rather than more lossy. Concretely, we instruct an LLM to decompose each session into enriched elementary discourse units (EDUs) ‑‑ self‑contained statements with normalized entities and source turn attributions ‑‑ and organize sessions, EDUs, and their arguments in a heterogeneous graph that supports associative recall. On top of this representation we build two simple retrieval‑based variants that use dense similarity search and LLM filtering, with an optional graph‑based propagation step to connect and aggregate evidence across related EDUs. Experiments on the LoCoMo and LongMemEval_S benchmarks show that these event‑centric memories match or surpass strong baselines, while operating with much shorter QA contexts. Our results suggest that structurally simple, event‑level memory provides a principled and practical foundation for long‑horizon conversational agents. Our code and data will be released at https://github.com/KevinSRR/EMem.

PaperID: 777, https://arxiv.org/pdf/2511.17198.pdf GitHub

Authors: Kaiyu Li, Jiayu Wang, Zhi Wang, Hui Qiao, Weizhan Zhang, Deyu Meng, Xiangyong Cao

Title: Designing Domain-Specific Agents via Hierarchical Task Abstraction Mechanism

Abstract:
LLM‑driven agents, particularly those using general frameworks like ReAct or human‑inspired role‑playing, often struggle in specialized domains that necessitate rigorously structured workflows. Fields such as remote sensing, requiring specialized tools (e.g., correction, spectral indices calculation), and multi‑step procedures (e.g., numerous intermediate products and optional steps), significantly challenge generalized approaches. To address this gap, we introduce a novel agent design framework centered on a Hierarchical Task Abstraction Mechanism (HTAM). Specifically, HTAM moves beyond emulating social roles, instead structuring multi‑agent systems into a logical hierarchy that mirrors the intrinsic task‑dependency graph of a given domain. This task‑centric architecture thus enforces procedural correctness and decomposes complex problems into sequential layers, where each layer's sub‑agents operate on the outputs of the preceding layers. We instantiate this framework as EarthAgent, a multi‑agent system tailored for complex geospatial analysis. To evaluate such complex planning capabilities, we build GeoPlan‑bench, a comprehensive benchmark of realistic, multi‑step geospatial planning tasks. It is accompanied by a suite of carefully designed metrics to evaluate tool selection, path similarity, and logical completeness. Experiments show that EarthAgent substantially outperforms a range of established single‑ and multi‑agent systems. Our work demonstrates that aligning agent architecture with a domain's intrinsic task structure is a critical step toward building robust and reliable specialized autonomous systems.

PaperID: 778, https://arxiv.org/pdf/2511.16972.pdf GitHub

Authors: Shuyang Yu, Jianan Liang, Hui Hu

Title: ToC: Tree-of-Claims Search with Multi-Agent Language Models

Abstract:
Optimizing patent claims is a critical yet challenging task, demanding careful balance between maximizing novelty and preserving legal scope. Manual claim drafting is labor‑intensive, costly, and inherently inconsistent, while conventional Large Language Models (LLMs) often lack the structured, iterative reasoning essential for precise claim refinement. To address these challenges, we introduce Tree of Claims (ToC), an innovative framework that redefines claim editing as a guided search problem. ToC synergistically integrates Monte Carlo Tree Search (MCTS) with a collaborative multi‑agent system, comprising an LLM‑based EditorAgent that proposes contextually grounded edits, and an ExaminerAgent that mimics patent examiner critiques through structured, chain‑of‑thought analyses of novelty and prior art disclosure. Driven by a carefully designed multi‑objective reward function, ToC jointly optimizes novelty, scope retention, and semantic coherence. Experimental evaluation on a benchmark of 1145 claims demonstrates that ToC significantly outperforms standard LLMs in zero‑shot and few‑shot scenarios, achieving an average composite score improvement of 8%, and up to 9% in certain cases. Extensive experiments, including detailed ablation studies, validate ToC's efficacy in generating superior, legally robust claim revisions. Overall, ToC establishes a transparent, controllable, and interpretable methodology that effectively bridges advanced LLM reasoning capabilities with strategic MCTS planning for structured patent claim optimization.The source code is available at https://github.com/ysy2003/ToC.

PaperID: 779, https://arxiv.org/pdf/2511.16964.pdf GitHub

Authors: Kirill Nagaitsev, Luka Grbcic, Samuel Williams, Costin Iancu

Title: Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems

Abstract:
Maximizing performance on available GPU hardware is an ongoing challenge for modern AI inference systems. Traditional approaches include writing custom GPU kernels and using specialized model compilers to tune high‑level code for specific GPU targets. Recent work shows that LLM‑based multi‑agent systems can effectively perform such tuning, often outperforming existing compilers and eliminating the need for manual kernel development. However, the dynamics of multi‑agent systems for this task remain unexplored. In this work, we present a logical framework for comparing multi‑agent PyTorch optimization systems. Our evaluation shows that exploit‑heavy strategies perform best when paired with error‑fixing agents, and that performance correlates with the granularity of optimization steps. The best implementation achieves an average 2.88x speedup over PyTorch Eager (1.85x over torch.compile) on an H100 GPU across diverse tasks in KernelBench, a benchmark suite covering a range of machine learning architectures in PyTorch. Code is publicly available at: https://github.com/pike‑project/pike

PaperID: 780, https://arxiv.org/pdf/2511.16709.pdf GitHub

Authors: Yige Li, Zhe Li, Wei Zhao, Nay Myat Min, Hanxun Huang, Xingjun Ma, Jun Sun

Title: AutoBackdoor: Automating Backdoor Attacks via LLM Agents

Abstract:
Backdoor attacks pose a serious threat to the secure deployment of large language models (LLMs), enabling adversaries to implant hidden behaviors triggered by specific inputs. However, existing methods often rely on manually crafted triggers and static data pipelines, which are rigid, labor‑intensive, and inadequate for systematically evaluating modern defense robustness. As AI agents become increasingly capable, there is a growing need for more rigorous, diverse, and scalable red‑teaming frameworks that can realistically simulate backdoor threats and assess model resilience under adversarial conditions. In this work, we introduce \textscAutoBackdoor, a general framework for automating backdoor injection, encompassing trigger generation, poisoned data construction, and model fine‑tuning via an autonomous agent‑driven pipeline. Unlike prior approaches, AutoBackdoor uses a powerful language model agent to generate semantically coherent, context‑aware trigger phrases, enabling scalable poisoning across arbitrary topics with minimal human effort. We evaluate AutoBackdoor under three realistic threat scenarios, including Bias Recommendation, Hallucination Injection, and Peer Review Manipulation, to simulate a broad range of attacks. Experiments on both open‑source and commercial models, including LLaMA‑3, Mistral, Qwen, and GPT‑4o, demonstrate that our method achieves over 90% attack success with only a small number of poisoned samples. More importantly, we find that existing defenses often fail to mitigate these attacks, underscoring the need for more rigorous and adaptive evaluation techniques against agent‑driven threats as explored in this work. All code, datasets, and experimental configurations will be merged into our primary repository at https://github.com/bboylyg/BackdoorLLM.

PaperID: 781, https://arxiv.org/pdf/2511.16043.pdf GitHub

Authors: Peng Xia, Kaide Zeng, Jiaqi Liu, Can Qin, Fang Wu, Yiyang Zhou, Caiming Xiong, Huaxiu Yao

Title: Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning

Abstract:
Large Language Model (LLM) Agents, often trained with Reinforcement Learning (RL), are constrained by a dependency on human‑curated data, limiting scalability and tethering AI to human knowledge. Existing self‑evolution frameworks offer an alternative but are typically restricted by the model's inherent capabilities and single‑round interactions, hindering the development of complex curricula involving tool use or dynamic reasoning. We introduce Agent0, a fully autonomous framework that evolves high‑performing agents without external data through multi‑step co‑evolution and seamless tool integration. Agent0 establishes a symbiotic competition between two agents initialized from the same base LLM: a curriculum agent that proposes increasingly challenging frontier tasks, and an executor agent that learns to solve them. We integrate external tools to enhance the executor's problem‑solving capacity; this improvement, in turn, pressures the curriculum agent to construct more complex, tool‑aware tasks. Through this iterative process, Agent0 establishes a self‑reinforcing cycle that continuously produces high‑quality curricula. Empirically, Agent0 substantially boosts reasoning capabilities, improving the Qwen3‑8B‑Base model by 18% on mathematical reasoning and 24% on general reasoning benchmarks. Code is available at https://github.com/aiming‑lab/Agent0.

PaperID: 782, https://arxiv.org/pdf/2511.16004.pdf GitHub

Authors: KeFan Li, Mengfei Wang, Hengzhi Zhang, Zhichao Li, Yuan Yuan, Mu Li, Xiang Gao, Hailong Sun, Chunming Hu, Weifeng Lv

Title: InfCode: Adversarial Iterative Refinement of Tests and Patches for Reliable Software Issue Resolution

Abstract:
Large language models have advanced software engineering automation, yet resolving real‑world software issues remains difficult because it requires repository‑level reasoning, accurate diagnostics, and strong verification signals. Existing agent‑based and pipeline‑based methods often rely on insufficient tests, which can lead to patches that satisfy verification but fail to fix the underlying defect. We present InfCode, an adversarial multi‑agent framework for automated repository‑level issue resolution. InfCode iteratively refines both tests and patches through adversarial interaction between a Test Patch Generator and a Code Patch Generator, while a Selector agent identifies the most reliable fix. The framework runs inside a containerized environment that supports realistic repository inspection, modification, and validation. Experiments on SWE‑bench Lite and SWE‑bench Verified using models such as DeepSeek‑V3 and Claude 4.5 Sonnet show that InfCode consistently outperforms strong baselines. It achieves 79.4% performance on SWE‑bench Verified, establishing a new state‑of‑the‑art. We have released InfCode as an open‑source project at https://github.com/Tokfinity/InfCode.

PaperID: 783, https://arxiv.org/pdf/2511.15830.pdf GitHub

Authors: Stéphane Aroca-Ouellette, Ian Berlot-Attwell, Panagiotis Lymperopoulos, Abhiramon Rajasekharan, Tongqi Zhu, Herin Kang, Kaheer Suleman, Sam Pasupalak

Title: Mini Amusement Parks (MAPs): A Testbed for Modelling Business Decisions

Abstract:
Despite rapid progress in artificial intelligence, current systems struggle with the interconnected challenges that define real‑world decision making. Practical domains, such as business management, require optimizing an open‑ended and multi‑faceted objective, actively learning environment dynamics from sparse experience, planning over long horizons in stochastic settings, and reasoning over spatial information. Yet existing human‑‑AI benchmarks isolate subsets of these capabilities, limiting our ability to assess holistic decision‑making competence. We introduce Mini Amusement Parks (MAPs), an amusement‑park simulator designed to evaluate an agent's ability to model its environment, anticipate long‑term consequences under uncertainty, and strategically operate a complex business. We provide human baselines and a comprehensive evaluation of state‑of‑the‑art LLM agents, finding that humans outperform these systems by 6.5x on easy mode and 9.8x on medium mode. Our analysis reveals persistent weaknesses in long‑horizon optimization, sample‑efficient learning, spatial reasoning, and world modelling. By unifying these challenges within a single environment, MAPs offers a new foundation for benchmarking agents capable of adaptable decision making. Code: https://github.com/Skyfall‑Research/MAPs

PaperID: 784, https://arxiv.org/pdf/2511.15567.pdf GitHub GitHub GitHub

Authors: Kevin Qinghong Lin, Siyuan Hu, Linjie Li, Zhengyuan Yang, Lijuan Wang, Philip Torr, Mike Zheng Shou

Title: Computer-Use Agents as Judges for Generative User Interface

Abstract:
Computer‑Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans‑‑prioritizing aesthetics and usability‑‑forcing agents to adopt human‑oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding‑oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI‑Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real‑world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder‑CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi‑step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent‑native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.

PaperID: 785, https://arxiv.org/pdf/2511.15408.pdf GitHub

Authors: Shanlin Zhou, Xinpeng Wang, Jianxun Lian, Zhenghao Liu, Laks V. S. Lakshmanan, Xiaoyuan Yi, Yongtao Hao

Title: Chinese Short-Form Creative Content Generation via Explanation-Oriented Multi-Objective Optimization

Abstract:
Chinese demonstrates high semantic compactness and rich metaphorical expressiveness, enabling limited text to convey dense meanings while increasing the difficulty of generation and verification, particularly in short‑form creative natural language generation (CNLG). In the real world, users often require personalized, fine‑grained creative constraints, making reliable verification critical to guiding optimization. According to Brunswik's Lens Model from psychology, constraints' achievement can be inferred from sufficient observable cues. Existing studies are mainly outcome‑oriented, implicitly assuming that the outcome itself provides adequate cues for verification. However, this assumption breaks down in Chinese short‑form CNLG (e.g., naming or advertising) with diverse personalized constraints, where extremely brief outcomes inherently offer limited information. Explanations can naturally serve as extra cues. Nevertheless, under complex constraints, LLMs' explanations may suffer from hallucination, incompleteness, or ambiguity. To address these, we novelly formalize the Chinese short‑form CNLG task as a heterogeneous multi‑objective optimization (HMO) issue that needs to jointly optimize multiple personalized constraints and explanation reliability. We further propose MAGIC‑HMO, a training‑free multi‑agent framework that optimizes these objectives through iterative generation and verification under an explanation‑oriented multi‑objective strategy. Experiments on \emphChinese Baby Naming, a challenging benchmark, demonstrate that MAGIC‑HMO significantly outperforms six strong baselines across various LLM backbones. Relevant data and codes are available at https://github.com/foolfun/MAGIC_HMO.

PaperID: 786, https://arxiv.org/pdf/2511.15392.pdf GitHub

Authors: Sirui Chen, Mengshi Zhao, Lei Xu, Yuying Zhao, Beier Zhu, Hanwang Zhang, Shengjie Zhao, Chaochao Lu

Title: DEPO: Dual-Efficiency Preference Optimization for LLM Agents

Abstract:
Recent advances in large language models (LLMs) have greatly improved their reasoning and decision‑making abilities when deployed as agents. Richer reasoning, however, often comes at the cost of longer chain of thought (CoT), hampering interaction efficiency in real‑world scenarios. Nevertheless, there still lacks systematic definition of LLM agent efficiency, hindering targeted improvements. To this end, we introduce dual‑efficiency, comprising (i) step‑level efficiency, which minimizes tokens per step, and (ii) trajectory‑level efficiency, which minimizes the number of steps to complete a task. Building on this definition, we propose DEPO, a dual‑efficiency preference optimization method that jointly rewards succinct responses and fewer action steps. Experiments on WebShop and BabyAI show that DEPO cuts token usage by up to 60.9% and steps by up to 26.9%, while achieving up to a 29.3% improvement in performance. DEPO also generalizes to three out‑of‑domain math benchmarks and retains its efficiency gains when trained on only 25% of the data. Our project page is at https://opencausalab.github.io/DEPO.

PaperID: 787, https://arxiv.org/pdf/2511.15239.pdf GitHub

Authors: Tomoki Nakao, Kazumi Kasaura, Tadashi Kozuno

Title: Symmetry-Breaking in Multi-Agent Navigation: Winding Number-Aware MPC with a Learned Topological Strategy

Abstract:
In distributed multi‑agent navigation without explicit communication, agents can fall into symmetry‑induced deadlocks because each agent must autonomously decide how to pass others. To address this problem, we propose WNumMPC, a hierarchical navigation method that quantifies cooperative symmetry‑breaking strategies via a topological invariant, the winding number, and learns such strategies through reinforcement learning. The learning‑based Planner outputs continuous‑valued signed target winding numbers and dynamic importance weights to prioritize critical interactions in dense crossings. Then, the model‑based Controller generates collision‑free and efficient motions based on the strategy and weights provided by the Planner. Simulation and real‑world robot experiments indicate that WNumMPC effectively avoids deadlocks and collisions and achieves better performance than the baselines, particularly in dense and symmetry‑prone scenarios. These experiments also suggest that explicitly leveraging winding numbers yields robust sim‑to‑real transfer with minimal performance degradation. The code for the experiments is available at https://github.com/omron‑sinicx/WNumMPC.

PaperID: 788, https://arxiv.org/pdf/2511.15097.pdf GitHub

Authors: Vineeth Sai Narajala, Manish Bhatt, Idan Habler, Ronald F. Del Rosario, Ads Dawson

Title: MAIF: Enforcing AI Trust and Provenance with an Artifact-Centric Agentic Paradigm

Abstract:
The AI trustworthiness crisis threatens to derail the artificial intelligence revolution, with regulatory barriers, security vulnerabilities, and accountability gaps preventing deployment in critical domains. Current AI systems operate on opaque data structures that lack the audit trails, provenance tracking, or explainability required by emerging regulations like the EU AI Act. We propose an artifact‑centric AI agent paradigm where behavior is driven by persistent, verifiable data artifacts rather than ephemeral tasks, solving the trustworthiness problem at the data architecture level. Central to this approach is the Multimodal Artifact File Format (MAIF), an AI‑native container embedding semantic representations, cryptographic provenance, and granular access controls. MAIF transforms data from passive storage into active trust enforcement, making every AI operation inherently auditable. Our production‑ready implementation demonstrates ultra‑high‑speed streaming (2,720.7 MB/s), optimized video processing (1,342 MB/s), and enterprise‑grade security. Novel algorithms for cross‑modal attention, semantic compression, and cryptographic binding achieve up to 225 compression while maintaining semantic fidelity. Advanced security features include stream‑level access control, real‑time tamper detection, and behavioral anomaly analysis with minimal overhead. This approach directly addresses the regulatory, security, and accountability challenges preventing AI deployment in sensitive domains, offering a viable path toward trustworthy AI systems at scale.

PaperID: 789, https://arxiv.org/pdf/2511.15061.pdf GitHub

Authors: Haodong Chen, Guido Zuccon, Teerapong Leelanupab

Title: Beyond GeneGPT: A Multi-Agent Architecture with Open-Source LLMs for Enhanced Genomic Question Answering

Abstract:
Genomic question answering often requires complex reasoning and integration across diverse biomedical sources. GeneGPT addressed this challenge by combining domain‑specific APIs with OpenAI's code‑davinci‑002 large language model to enable natural language interaction with genomic databases. However, its reliance on a proprietary model limits scalability, increases operational costs, and raises concerns about data privacy and generalization. In this work, we revisit and reproduce GeneGPT in a pilot study using open source models, including Llama 3.1, Qwen2.5, and Qwen2.5 Coder, within a monolithic architecture; this allows us to identify the limitations of this approach. Building on this foundation, we then develop OpenBioLLM, a modular multi‑agent framework that extends GeneGPT by introducing agent specialization for tool routing, query generation, and response validation. This enables coordinated reasoning and role‑based task execution. OpenBioLLM matches or outperforms GeneGPT on over 90% of the benchmark tasks, achieving average scores of 0.849 on Gene‑Turing and 0.830 on GeneHop, while using smaller open‑source models without additional fine‑tuning or tool‑specific pretraining. OpenBioLLM's modular multi‑agent design reduces latency by 40‑50% across benchmark tasks, significantly improving efficiency without compromising model capability. The results of our comprehensive evaluation highlight the potential of open‑source multi‑agent systems for genomic question answering. Code and resources are available at https://github.com/ielab/OpenBioLLM.

PaperID: 790, https://arxiv.org/pdf/2511.14650.pdf GitHub

Authors: Jingyi Jia, Qinbin Li

Title: AutoTool: Efficient Tool Selection for Large Language Model Agents

Abstract:
Large Language Model (LLM) agents have emerged as powerful tools for automating complex tasks by leveraging the reasoning and decision‑making abilities of LLMs. However, a major bottleneck in current agent frameworks lies in the high inference cost of tool selection, especially in approaches like ReAct that repeatedly invoke the LLM to determine which tool to use at each step. In this work, we propose AutoTool, a novel graph‑based framework that bypasses repeated LLM inference by exploiting a key empirical observation: tool usage inertia ‑ the tendency of tool invocations to follow predictable sequential patterns. AutoTool constructs a directed graph from historical agent trajectories, where nodes represent tools and edges capture transition probabilities, effectively modeling the inertia in tool selection. It further integrates parameter‑level information to refine tool input generation. By traversing this structured representation, AutoTool efficiently selects tools and their parameters with minimal reliance on LLM inference. Extensive experiments across diverse agent tasks demonstrate that AutoTool reduces inference costs by up to 30% while maintaining competitive task completion rates, offering a practical and scalable enhancement for inference‑heavy frameworks. Our work highlights the promise of integrating statistical structure into LLM agent design for greater efficiency without sacrificing performance.

PaperID: 791, https://arxiv.org/pdf/2511.14631.pdf GitHub

Authors: Kahaan Gandhi, Boris Bolliet, Inigo Zubeldia

Title: Enhancing Agentic Autonomous Scientific Discovery with Vision-Language Model Capabilities

Abstract:
We show that multi‑agent systems guided by vision‑language models (VLMs) improve end‑to‑end autonomous scientific discovery. By treating plots as verifiable checkpoints, a VLM‑as‑a‑judge evaluates figures against dynamically generated domain‑specific rubrics, enabling agents to correct their own errors and steer exploratory data analysis in real‑time. Case studies in cosmology and astrochemistry demonstrate recovery from faulty reasoning paths and adaptation to new datasets without human intervention. On a 10‑task benchmark for data‑driven discovery, VLM‑augmented systems achieve pass at 1 scores of 0.7‑0.8, compared to 0.2‑0.3 for code‑only and 0.4‑0.5 for code‑and‑text baselines, while also providing auditable reasoning traces that improve interpretability. Code available here: https://github.com/CMBAgents/cmbagent

PaperID: 792, https://arxiv.org/pdf/2511.13704.pdf GitHub

Authors: Harold Haodong Chen, Disen Lan, Wen-Jie Shu, Qingyang Liu, Zihan Wang, Sirui Chen, Wenkai Cheng, Kanghao Chen, Hongfei Zhang, Zixin Zhang, Rongjin Guo, Yu Cheng, Ying-Cong Chen

Title: TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

Abstract:
The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3's chain‑of‑frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher‑order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image‑to‑video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning & Search, ii) Spatial & Visual Pattern Reasoning, iii) Symbolic & Logical Reasoning, and iv) Action Planning & Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open‑source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test‑time strategy inspired by preference optimization. By performing LLM self‑analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.

PaperID: 793, https://arxiv.org/pdf/2511.13193.pdf GitHub

Authors: Yijia Fan, Jusheng Zhang, Kaitong Cai, Jing Yang, Chengpei Tang, Jian Wang, Keze Wang

Title: Cost-Effective Communication: An Auction-based Method for Language Agent Interaction

Abstract:
Multi‑agent systems (MAS) built on large language models (LLMs) often suffer from inefficient "free‑for‑all" communication, leading to exponential token costs and low signal‑to‑noise ratios that hinder their practical deployment. We challenge the notion that more communication is always beneficial, hypothesizing instead that the core issue is the absence of resource rationality. We argue that "free" communication, by ignoring the principle of scarcity, inherently breeds inefficiency and unnecessary expenses. To address this, we introduce the Dynamic Auction‑based Language Agent (DALA), a novel framework that treats communication bandwidth as a scarce and tradable resource. Specifically, our DALA regards inter‑agent communication as a centralized auction, where agents learn to bid for the opportunity to speak based on the predicted value density of their messages. Thus, our DALA intrinsically encourages agents to produce concise, informative messages while filtering out low‑value communication. Extensive and comprehensive experiments demonstrate that our economically‑driven DALA achieves new state‑of‑the‑art performance across seven challenging reasoning benchmarks, including 84.32% on MMLU and a 91.21% pass@1 rate on HumanEval. Note that this is accomplished with remarkable efficiency, i.e., our DALA uses only 6.25 million tokens, a fraction of the resources consumed by current state‑of‑the‑art methods on GSM8K. Further analysis reveals that our DALA cultivates the emergent skill of strategic silence, effectively adapting its communication strategies from verbosity to silence in a dynamical manner via resource constraints. Our code and updates are available at https://github.com/waltstephen/Cost‑Effective‑Communication.

PaperID: 794, https://arxiv.org/pdf/2511.13186.pdf GitHub

Authors: Akash Karthikeyan, Yash Vardhan Pant

Title: DiffFP: Learning Behaviors from Scratch via Diffusion-based Fictitious Play

Abstract:
Self‑play reinforcement learning has demonstrated significant success in learning complex strategic and interactive behaviors in competitive multi‑agent games. However, achieving such behaviors in continuous decision spaces remains challenging. Ensuring adaptability and generalization in self‑play settings is critical for achieving competitive performance in dynamic multi‑agent environments. These challenges often cause methods to converge slowly or fail to converge at all to a Nash equilibrium, making agents vulnerable to strategic exploitation by unseen opponents. To address these challenges, we propose DiffFP, a fictitious play (FP) framework that estimates the best response to unseen opponents while learning a robust and multimodal behavioral policy. Specifically, we approximate the best response using a diffusion policy that leverages generative modeling to learn adaptive and diverse strategies. Through empirical evaluation, we demonstrate that the proposed FP framework converges towards ε‑Nash equilibria in continuous‑ space zero‑sum games. We validate our method on complex multi‑agent environments, including racing and multi‑particle zero‑sum games. Simulation results show that the learned policies are robust against diverse opponents and outperform baseline reinforcement learning policies. Our approach achieves up to 3× faster convergence and 30× higher success rates on average against RL‑based baselines, demonstrating its robustness to opponent strategies and stability across training iterations

PaperID: 795, https://arxiv.org/pdf/2511.13118.pdf GitHub

Authors: Quanjiang Guo, Sijie Wang, Jinchuan Zhang, Ben Zhang, Zhao Kang, Ling Tian, Ke Yan

Title: Extracting Events Like Code: A Multi-Agent Programming Framework for Zero-Shot Event Extraction

Abstract:
Zero‑shot event extraction (ZSEE) remains a significant challenge for large language models (LLMs) due to the need for complex reasoning and domain‑specific understanding. Direct prompting often yields incomplete or structurally invalid outputs‑‑such as misclassified triggers, missing arguments, and schema violations. To address these limitations, we present Agent‑Event‑Coder (AEC), a novel multi‑agent framework that treats event extraction like software engineering: as a structured, iterative code‑generation process. AEC decomposes ZSEE into specialized subtasks‑‑retrieval, planning, coding, and verification‑‑each handled by a dedicated LLM agent. Event schemas are represented as executable class definitions, enabling deterministic validation and precise feedback via a verification agent. This programming‑inspired approach allows for systematic disambiguation and schema enforcement through iterative refinement. By leveraging collaborative agent workflows, AEC enables LLMs to produce precise, complete, and schema‑consistent extractions in zero‑shot settings. Experiments across five diverse domains and six LLMs demonstrate that AEC consistently outperforms prior zero‑shot baselines, showcasing the power of treating event extraction like code generation. The code and data are released on https://github.com/UESTC‑GQJ/Agent‑Event‑Coder.

PaperID: 796, https://arxiv.org/pdf/2511.12630.pdf GitHub

Authors: Maoqi Liu, Quan Fang, Yang Yang, Can Zhao, Kaiquan Cai

Title: Knots: A Large-Scale Multi-Agent Enhanced Expert-Annotated Dataset and LLM Prompt Optimization for NOTAM Semantic Parsing

Abstract:
Notice to Air Missions (NOTAMs) serve as a critical channel for disseminating key flight safety information, yet their complex linguistic structures and implicit reasoning pose significant challenges for automated parsing. Existing research mainly focuses on surface‑level tasks such as classification and named entity recognition, lacking deep semantic understanding. To address this gap, we propose NOTAM semantic parsing, a task emphasizing semantic inference and the integration of aviation domain knowledge to produce structured, inference‑rich outputs. To support this task, we construct Knots (Knowledge and NOTAM Semantics), a high‑quality dataset of 12,347 expert‑annotated NOTAMs covering 194 Flight Information Regions, enhanced through a multi‑agent collaborative framework for comprehensive field discovery. We systematically evaluate a wide range of prompt‑engineering strategies and model‑adaptation techniques, achieving substantial improvements in aviation text understanding and processing. Our experimental results demonstrate the effectiveness of the proposed approach and offer valuable insights for automated NOTAM analysis systems. Our code is available at: https://github.com/Estrellajer/Knots.

PaperID: 797, https://arxiv.org/pdf/2511.11252.pdf GitHub

Authors: Mohamed Amine Ferrag, Abderrahmane Lakas, Merouane Debbah

Title: UAVBench: An Open Benchmark Dataset for Autonomous and Agentic AI UAV Systems via LLM-Generated Flight Scenarios

Abstract:
Autonomous aerial systems increasingly rely on large language models (LLMs) for mission planning, perception, and decision‑making, yet the lack of standardized and physically grounded benchmarks limits systematic evaluation of their reasoning capabilities. To address this gap, we introduce UAVBench, an open benchmark dataset comprising 50,000 validated UAV flight scenarios generated through taxonomy‑guided LLM prompting and multi‑stage safety validation. Each scenario is encoded in a structured JSON schema that includes mission objectives, vehicle configuration, environmental conditions, and quantitative risk labels, providing a unified representation of UAV operations across diverse domains. Building on this foundation, we present UAVBench_MCQ, a reasoning‑oriented extension containing 50,000 multiple‑choice questions spanning ten cognitive and ethical reasoning styles, ranging from aerodynamics and navigation to multi‑agent coordination and integrated reasoning. This framework enables interpretable and machine‑checkable assessment of UAV‑specific cognition under realistic operational contexts. We evaluate 32 state‑of‑the‑art LLMs, including GPT‑5, ChatGPT‑4o, Gemini 2.5 Flash, DeepSeek V3, Qwen3 235B, and ERNIE 4.5 300B, and find strong performance in perception and policy reasoning but persistent challenges in ethics‑aware and resource‑constrained decision‑making. UAVBench establishes a reproducible and physically grounded foundation for benchmarking agentic AI in autonomous aerial systems and advancing next‑generation UAV reasoning intelligence. To support open science and reproducibility, we release the UAVBench dataset, the UAVBench_MCQ benchmark, evaluation scripts, and all related materials on GitHub at https://github.com/maferrag/UAVBench

PaperID: 798, https://arxiv.org/pdf/2511.11182.pdf GitHub

Authors: Dayong Liang, Xiao-Yong Wei, Changmeng Zheng

Title: Multi-agent Undercover Gaming: Hallucination Removal via Counterfactual Test for Multimodal Reasoning

Abstract:
Hallucination continues to pose a major obstacle in the reasoning capabilities of large language models (LLMs). Although the Multi‑Agent Debate (MAD) paradigm offers a promising solution by promoting consensus among multiple agents to enhance reliability, it relies on the unrealistic assumption that all debaters are rational and reflective, which is a condition that may not hold when agents themselves are prone to hallucinations. To address this gap, we introduce the Multi‑agent Undercover Gaming (MUG) protocol, inspired by social deduction games like "Who is Undercover?". MUG reframes MAD as a process of detecting "undercover" agents (those suffering from hallucinations) by employing multimodal counterfactual tests. Specifically, we modify reference images to introduce counterfactual evidence and observe whether agents can accurately identify these changes, providing ground‑truth for identifying hallucinating agents and enabling robust, crowd‑powered multimodal reasoning. MUG advances MAD protocols along three key dimensions: (1) enabling factual verification beyond statistical consensus through counterfactual testing; (2) introducing cross‑evidence reasoning via dynamically modified evidence sources instead of relying on static inputs; and (3) fostering active reasoning, where agents engage in probing discussions rather than passively answering questions. Collectively, these innovations offer a more reliable and effective framework for multimodal reasoning in LLMs. The source code can be accessed at https://github.com/YongLD/MUG.git.

PaperID: 799, https://arxiv.org/pdf/2511.10949.pdf GitHub

Authors: Nirmit Arora, Sathvik Joel, Ishan Kavathekar, Palak, Rohan Gandhi, Yash Pandya, Tanuja Ganu, Aditya Kanade, Akshay Nambi

Title: Exposing Weak Links in Multi-Agent Systems under Adversarial Prompting

Abstract:
LLM‑based agents are increasingly deployed in multi‑agent systems (MAS). As these systems move toward real‑world applications, their security becomes paramount. Existing research largely evaluates single‑agent security, leaving a critical gap in understanding the vulnerabilities introduced by multi‑agent design. However, existing systems fall short due to lack of unified frameworks and metrics focusing on unique rejection modes in MAS. We present SafeAgents, a unified and extensible framework for fine‑grained security assessment of MAS. SafeAgents systematically exposes how design choices such as plan construction strategies, inter‑agent context sharing, and fallback behaviors affect susceptibility to adversarial prompting. We introduce Dharma, a diagnostic measure that helps identify weak links within multi‑agent pipelines. Using SafeAgents, we conduct a comprehensive study across five widely adopted multi‑agent architectures (centralized, decentralized, and hybrid variants) on four datasets spanning web tasks, tool use, and code generation. Our findings reveal that common design patterns carry significant vulnerabilities. For example, centralized systems that delegate only atomic instructions to sub‑agents obscure harmful objectives, reducing robustness. Our results highlight the need for security‑aware design in MAS. Link to code is https://github.com/microsoft/SafeAgents

PaperID: 800, https://arxiv.org/pdf/2511.10030.pdf GitHub

Authors: Tao Jiang, Zichuan Lin, Lihe Li, Yi-Chen Li, Cong Guan, Lei Yuan, Zongzhang Zhang, Yang Yu, Deheng Ye

Title: Multi-agent In-context Coordination via Decentralized Memory Retrieval

Abstract:
Large transformer models, trained on diverse datasets, have demonstrated impressive few‑shot performance on previously unseen tasks without requiring parameter updates. This capability has also been explored in Reinforcement Learning (RL), where agents interact with the environment to retrieve context and maximize cumulative rewards, showcasing strong adaptability in complex settings. However, in cooperative Multi‑Agent Reinforcement Learning (MARL), where agents must coordinate toward a shared goal, decentralized policy deployment can lead to mismatches in task alignment and reward assignment, limiting the efficiency of policy adaptation. To address this challenge, we introduce Multi‑agent In‑context Coordination via Decentralized Memory Retrieval (MAICC), a novel approach designed to enhance coordination by fast adaptation. Our method involves training a centralized embedding model to capture fine‑grained trajectory representations, followed by decentralized models that approximate the centralized one to obtain team‑level task information. Based on the learned embeddings, relevant trajectories are retrieved as context, which, combined with the agents' current sub‑trajectories, inform decision‑making. During decentralized execution, we introduce a novel memory mechanism that effectively balances test‑time online data with offline memory. Based on the constructed memory, we propose a hybrid utility score that incorporates both individual‑ and team‑level returns, ensuring credit assignment across agents. Extensive experiments on cooperative MARL benchmarks, including Level‑Based Foraging (LBF) and SMAC (v1/v2), show that MAICC enables faster adaptation to unseen tasks compared to existing methods. Code is available at https://github.com/LAMDA‑RL/MAICC.

PaperID: 801, https://arxiv.org/pdf/2511.09139.pdf GitHub

Authors: Zijian Chen, Yuze Sun, Yuan Tian, Wenjun Zhang, Guangtao Zhai

Title: MACEval: A Multi-Agent Continual Evaluation Network for Large Models

Abstract:
Hundreds of benchmarks dedicated to evaluating large models have been presented over the past few years. However, most of them remain closed‑ended and are prone to overfitting due to the potential data contamination. Moreover, the increasing scale and scope of current benchmarks with transient metrics, as well as the heavily human‑dependent curation procedure, pose significant challenges for timely maintenance and adaptation. In this paper, we introduce MACEval, a Multi‑Agent Continual Evaluation network for dynamic evaluation of large models, and define new metrics to quantify performance longitudinally. MACEval employs an interactive and autonomous evaluation mode, utilizing role assignment, in‑process data generation, and evaluation routing through a cascaded agent network. Extensive experiments on 23 large models demonstrate the effectiveness of MACEval, which also lightens the evaluation process and reduces a considerable amount of overhead. We hope that MACEval can broaden future directions of large model evaluation. Project page: https://github.com/zijianchen98/MACEval.

PaperID: 802, https://arxiv.org/pdf/2511.09005.pdf GitHub

Authors: Alvin Chauhan

Title: AI Founding Fathers: A Case Study of GIS Search in Multi-Agent Pipelines

Abstract:
Although Large Language Models (LLMs) show exceptional fluency, efforts persist to extract stronger reasoning capabilities from them. Drawing on search‑based interpretations of LLM computation, this paper advances a systematic framework for understanding LLM reasoning and optimization. Namely, that enhancing reasoning is best achieved by structuring a multi‑agent pipeline to ensure a traversal of the search space in a gradual, incremental, and sequential (GIS) manner. Stated succinctly, high‑quality reasoning is a controlled, incremental search. To test this framework, we investigate the efficacy of recursive refinement (RR)‑‑an iterative process of self‑criticism, adversarial stress‑testing, and integrating critical feedback‑‑as a practical method for implementing GIS search. We designed an experiment comparing a simple, linear pipeline against a complex, explicitly structured pipeline leveraging a recursive refinement layer. The multi‑agent models were constructed to reflect the historical personas of three US Founding Fathers (Hamilton, Jefferson, and Madison) using RAG‑powered corpora and were prompted to generate responses to three contemporary political issues. Model performance was evaluated using a two‑tiered approach: a quantitative score from an LLM arbiter agent and qualitative human judgment. Our results revealed that the complex model consistently outperformed the simple model across all nine test cases with an average arbiter‑outputted score of 88.3 versus 71.7. The complex model's arguments were superior in analytical depth, structural nuance, and strategic framing. We conclude that recursive refinement is a robust architectural feature for enhancing LLM reasoning via GIS search.

PaperID: 803, https://arxiv.org/pdf/2511.08832.pdf GitHub

Authors: Nikunj Gupta, Ludwika Twardecka, James Zachary Hare, Jesse Milzman, Rajgopal Kannan, Viktor Prasanna

Title: TIGER-MARL: Enhancing Multi-Agent Reinforcement Learning with Temporal Information through Graph-based Embeddings and Representations

Abstract:
In this paper, we propose capturing and utilizing Temporal Information through Graph‑based Embeddings and Representations or TIGER to enhance multi‑agent reinforcement learning (MARL). We explicitly model how inter‑agent coordination structures evolve over time. While most MARL approaches rely on static or per‑step relational graphs, they overlook the temporal evolution of interactions that naturally arise as agents adapt, move, or reorganize cooperation strategies. Capturing such evolving dependencies is key to achieving robust and adaptive coordination. To this end, TIGER constructs dynamic temporal graphs of MARL agents, connecting their current and historical interactions. It then employs a temporal attention‑based encoder to aggregate information across these structural and temporal neighborhoods, yielding time‑aware agent embeddings that guide cooperative policy learning. Through extensive experiments on two coordination‑intensive benchmarks, we show that TIGER consistently outperforms diverse value‑decomposition and graph‑based MARL baselines in task performance and sample efficiency. Furthermore, we conduct comprehensive ablation studies to isolate the impact of key design parameters in TIGER, revealing how structural and temporal factors can jointly shape effective policy learning in MARL. All codes can be found here: https://github.com/Nikunj‑Gupta/tiger‑marl.

PaperID: 804, https://arxiv.org/pdf/2511.08721.pdf GitHub

Authors: Andreas Einwiller, Kanishka Ghosh Dastidar, Artur Romazanov, Annette Hautli-Janisz, Michael Granitzer, Florian Lemmerich

Title: Benevolent Dictators? On LLM Agent Behavior in Dictator Games

Abstract:
In behavioral sciences, experiments such as the ultimatum game are conducted to assess preferences for fairness or self‑interest of study participants. In the dictator game, a simplified version of the ultimatum game where only one of two players makes a single decision, the dictator unilaterally decides how to split a fixed sum of money between themselves and the other player. Although recent studies have explored behavioral patterns of AI agents based on Large Language Models (LLMs) instructed to adopt different personas, we question the robustness of these results. In particular, many of these studies overlook the role of the system prompt ‑ the underlying instructions that shape the model's behavior ‑ and do not account for how sensitive results can be to slight changes in prompts. However, a robust baseline is essential when studying highly complex behavioral aspects of LLMs. To overcome previous limitations, we propose the LLM agent behavior study (LLM‑ABS) framework to (i) explore how different system prompts influence model behavior, (ii) get more reliable insights into agent preferences by using neutral prompt variations, and (iii) analyze linguistic features in responses to open‑ended instructions by LLM agents to better understand the reasoning behind their behavior. We found that agents often exhibit a strong preference for fairness, as well as a significant impact of the system prompt on their behavior. From a linguistic perspective, we identify that models express their responses differently. Although prompt sensitivity remains a persistent challenge, our proposed framework demonstrates a robust foundation for LLM agent behavior studies. Our code artifacts are available at https://github.com/andreaseinwiller/LLM‑ABS.

PaperID: 805, https://arxiv.org/pdf/2511.08583.pdf GitHub

Authors: Rong Xue, Jiageng Mao, Mingtong Zhang, Yue Wang

Title: SeFA-Policy: Fast and Accurate Visuomotor Policy Learning with Selective Flow Alignment

Abstract:
Developing efficient and accurate visuomotor policies poses a central challenge in robotic imitation learning. While recent rectified flow approaches have advanced visuomotor policy learning, they suffer from a key limitation: After iterative distillation, generated actions may deviate from the ground‑truth actions corresponding to the current visual observation, leading to accumulated error as the reflow process repeats and unstable task execution. We present Selective Flow Alignment (SeFA), an efficient and accurate visuomotor policy learning framework. SeFA resolves this challenge by a selective flow alignment strategy, which leverages expert demonstrations to selectively correct generated actions and restore consistency with observations, while preserving multimodality. This design introduces a consistency correction mechanism that ensures generated actions remain observation‑aligned without sacrificing the efficiency of one‑step flow inference. Extensive experiments across both simulated and real‑world manipulation tasks show that SeFA Policy surpasses state‑of‑the‑art diffusion‑based and flow‑based policies, achieving superior accuracy and robustness while reducing inference latency by over 98%. By unifying rectified flow efficiency with observation‑consistent action generation, SeFA provides a scalable and dependable solution for real‑time visuomotor policy learning. Code is available on https://github.com/RongXueZoe/SeFA.

PaperID: 806, https://arxiv.org/pdf/2511.07979.pdf GitHub

Authors: Wenhan Yu, Xinbo Lin, Lanxin Ni, Jinhua Cheng, Lei Sha

Title: Benchmarking Multi-Step Legal Reasoning and Analyzing Chain-of-Thought Effects in Large Language Models

Abstract:
Large language models (LLMs) have demonstrated strong reasoning abilities across specialized domains, motivating research into their application to legal reasoning. However, existing legal benchmarks often conflate factual recall with genuine inference, fragment the reasoning process, and overlook the quality of reasoning. To address these limitations, we introduce MSLR, the first Chinese multi‑step legal reasoning dataset grounded in real‑world judicial decision making. MSLR adopts the IRAC framework (Issue, Rule, Application, Conclusion) to model structured expert reasoning from official legal documents. In addition, we design a scalable Human‑LLM collaborative annotation pipeline that efficiently produces fine‑grained step‑level reasoning annotations and provides a reusable methodological framework for multi‑step reasoning datasets. Evaluation of multiple LLMs on MSLR shows only moderate performance, highlighting the challenges of adapting to complex legal reasoning. Further experiments demonstrate that Self‑Initiated Chain‑of‑Thought prompts generated by models autonomously improve reasoning coherence and quality, outperforming human‑designed prompts. MSLR contributes to advancing LLM reasoning and Chain‑of‑Thought strategies and offers open resources for future research. The dataset and code are available at https://github.com/yuwenhan07/MSLR‑Bench and https://law.sjtu.edu.cn/flszyjzx/index.html.

PaperID: 807, https://arxiv.org/pdf/2511.07071.pdf GitHub GitHub

Authors: Marcel Müller

Title: Multi-Agent Reinforcement Learning for Deadlock Handling among Autonomous Mobile Robots

Abstract:
This dissertation explores the application of multi‑agent reinforcement learning (MARL) for handling deadlocks in intralogistics systems that rely on autonomous mobile robots (AMRs). AMRs enhance operational flexibility but also increase the risk of deadlocks, which degrade system throughput and reliability. Existing approaches often neglect deadlock handling in the planning phase and rely on rigid control rules that cannot adapt to dynamic operational conditions. To address these shortcomings, this work develops a structured methodology for integrating MARL into logistics planning and operational control. It introduces reference models that explicitly consider deadlock‑capable multi‑agent pathfinding (MAPF) problems, enabling systematic evaluation of MARL strategies. Using grid‑based environments and an external simulation software, the study compares traditional deadlock handling strategies with MARL‑based solutions, focusing on PPO and IMPALA algorithms under different training and execution modes. Findings reveal that MARL‑based strategies, particularly when combined with centralized training and decentralized execution (CTDE), outperform rule‑based methods in complex, congested environments. In simpler environments or those with ample spatial freedom, rule‑based methods remain competitive due to their lower computational demands. These results highlight that MARL provides a flexible and scalable solution for deadlock handling in dynamic intralogistics scenarios, but requires careful tailoring to the operational context.

PaperID: 808, https://arxiv.org/pdf/2511.06448.pdf GitHub GitHub

Authors: Qibing Ren, Zhijie Zheng, Jiaxuan Guo, Junchi Yan, Lizhuang Ma, Jing Shao

Title: When AI Agents Collude Online: Financial Fraud Risks by Collaborative LLM Agents on Social Platforms

Abstract:
In this work, we study the risks of collective financial fraud in large‑scale multi‑agent systems powered by large language model (LLM) agents. We investigate whether agents can collaborate in fraudulent behaviors, how such collaboration amplifies risks, and what factors influence fraud success. To support this research, we present MultiAgentFraudBench, a large‑scale benchmark for simulating financial fraud scenarios based on realistic online interactions. The benchmark covers 28 typical online fraud scenarios, spanning the full fraud lifecycle across both public and private domains. We further analyze key factors affecting fraud success, including interaction depth, activity level, and fine‑grained collaboration failure modes. Finally, we propose a series of mitigation strategies, including adding content‑level warnings to fraudulent posts and dialogues, using LLMs as monitors to block potentially malicious agents, and fostering group resilience through information sharing at the societal level. Notably, we observe that malicious agents can adapt to environmental interventions. Our findings highlight the real‑world risks of multi‑agent financial fraud and suggest practical measures for mitigating them. Code is available at https://github.com/zheng977/MutiAgent4Fraud.

PaperID: 809, https://arxiv.org/pdf/2511.05203.pdf GitHub

Authors: Linus Nwankwo, Bjoern Ellensohn, Christian Rauch, Elmar Rueckert

Title: SIL: Symbiotic Interactive Learning for Language-Conditioned Human-Agent Co-Adaptation

Abstract:
Today's autonomous agents, largely driven by foundation models (FMs), can understand natural language instructions and solve long‑horizon tasks with human‑like reasoning. However, current human‑robot interaction frameworks largely follow a one‑way master‑apprentice technique where the embodied agent passively executes commands without reciprocal learning. This neglects the co‑adaptive, multi‑turn nature of everyday human‑to‑human interactions. We introduce symbiotic interactive learning (SIL), a bidirectional co‑adaptation framework in a shared latent task space, where both the human and the agent maintain joint belief states that evolve with the interaction history. This enables proactive clarification, adaptive suggestions, and shared plan refinement. SIL leverages FMs for spatial perception and reasoning, together with a triplet‑loss‑trained neural encoder that grounds the FMs' outputs into task‑specific latent representations. To support long‑term stability as tasks evolve, SIL utilises episodic and semantic memory architectures, regularised via elastic weight consolidation to mitigate catastrophic forgetting. We evaluate SIL on simulated and real‑world embodied tasks, including instruction following, information retrieval, query‑oriented reasoning, and interactive dialogue, achieving a 90.4% task completion rate and a belief alignment score of ρ\approx 0.83, an absolute improvement of about 20 percentage points over the best ablations. Demos and resources: https://linusnep.github.io/SIL/.

PaperID: 810, https://arxiv.org/pdf/2511.04847.pdf GitHub

Authors: Arthur Chen, Zuxin Liu, Jianguo Zhang, Akshara Prabhakar, Zhiwei Liu, Shelby Heinecke, Silvio Savarese, Victor Zhong, Caiming Xiong

Title: Test-Time Adaptation for LLM Agents via Environment Interaction

Abstract:
Large language model (LLM)‑based agents struggle to generalize to novel and complex environments, such as unseen websites or new sets of functions, due to a fundamental mismatch between their pre‑training and test‑time conditions. This challenge stems from two distinct failure modes: a syntactic misunderstanding of environment‑specific components like observation formats, and a semantic misunderstanding of state‑transition dynamics, which are only revealed at test time. To address these issues, we propose two distinct strategies for adapting LLM agents by leveraging environment‑specific information from interaction that is available during deployment. First, an online syntactic alignment (SA) method parameterizes environmental nuances by learning a lightweight adaptation vector that biases the model's output distribution, enabling rapid alignment with an environment response format. Second, a deployment‑time dynamics grounding (DG) method employs a persona‑driven exploration phase to systematically probe and learn the environment's causal dynamics before task execution, equipping the agent with an in‑context world model. We evaluate these strategies across diverse agentic benchmarks, including function calling and web navigation. Our empirical results show the effectiveness of both strategies across all benchmarks with minimal computational cost. We find that dynamics grounding is particularly effective in complex environments where unpredictable dynamics pose a major obstacle, demonstrating a robust path toward more generalizable and capable LLM‑based agents. For example, on the WebArena multi‑site split, this method increases the agent's success rate from 2% to 23%. We release our code.

PaperID: 811, https://arxiv.org/pdf/2511.04683.pdf GitHub

Authors: L. J. Janse van Rensburg

Title: AI-Powered Citation Auditing: A Zero-Assumption Protocol for Systematic Reference Verification in Academic Research

Abstract:
Academic citation integrity faces persistent challenges, with research indicating 20% of citations contain errors and manual verification requiring months of expert time. This paper presents a novel AI‑powered methodology for systematic, comprehensive reference auditing using agentic AI with tool‑use capabilities. We develop a zero‑assumption verification protocol that independently validates every reference against multiple academic databases (Semantic Scholar, Google Scholar, CrossRef) without assuming any citation is correct. The methodology was validated across 30 academic documents (2,581 references) spanning undergraduate projects to doctoral theses and peer‑reviewed publications. Results demonstrate 91.7% average verification rate on published PLOS papers, with successful detection of fabricated references, retracted articles, orphan citations, and predatory journals. Time efficiency improved dramatically: 90‑minute audits for 916‑reference doctoral theses versus months of manual review. The system achieved <0.5% false positive rate while identifying critical issues manual review might miss. This work establishes the first validated AI‑agent methodology for academic citation integrity, demonstrating practical applicability for supervisors, students, and institutional quality assurance.

PaperID: 812, https://arxiv.org/pdf/2511.04153.pdf GitHub

Authors: Fahim Ahmed, Md Mubtasim Ahasan, Jahir Sadik Monon, Muntasir Wahed, M Ashraful Amin, A K M Mahbubur Rahman, Amin Ahsan Ali

Title: BAPPA: Benchmarking Agents, Plans, and Pipelines for Automated Text-to-SQL Generation

Abstract:
Text‑to‑SQL systems provide a natural language interface that can enable even laymen to access information stored in databases. However, existing Large Language Models (LLM) struggle with SQL generation from natural instructions due to large schema sizes and complex reasoning. Prior work often focuses on complex, somewhat impractical pipelines using flagship models, while smaller, efficient models remain overlooked. In this work, we explore three multi‑agent LLM pipelines, with systematic performance benchmarking across a range of small to large open‑source models: (1) Multi‑agent discussion pipeline, where agents iteratively critique and refine SQL queries, and a judge synthesizes the final answer; (2) Planner‑Coder pipeline, where a thinking model planner generates stepwise SQL generation plans and a coder synthesizes queries; and (3) Coder‑Aggregator pipeline, where multiple coders independently generate SQL queries, and a reasoning agent selects the best query. Experiments on the Bird‑Bench Mini‑Dev set reveal that Multi‑Agent discussion can improve small model performance, with up to 10.6% increase in Execution Accuracy for Qwen2.5‑7b‑Instruct seen after three rounds of discussion. Among the pipelines, the LLM Reasoner‑Coder pipeline yields the best results, with DeepSeek‑R1‑32B and QwQ‑32B planners boosting Gemma 3 27B IT accuracy from 52.4% to the highest score of 56.4%. Codes are available at https://github.com/treeDweller98/bappa‑sql.

PaperID: 813, https://arxiv.org/pdf/2511.04076.pdf GitHub

Authors: Hao Li, Haotian Chen, Ruoyuan Gong, Juanjuan Wang, Hao Jiang

Title: Agentmandering: A Game-Theoretic Framework for Fair Redistricting via Large Language Model Agents

Abstract:
Redistricting plays a central role in shaping how votes are translated into political power. While existing computational methods primarily aim to generate large ensembles of legally valid districting plans, they often neglect the strategic dynamics involved in the selection process. This oversight creates opportunities for partisan actors to cherry‑pick maps that, while technically compliant, are politically advantageous. Simply satisfying formal constraints does not ensure fairness when the selection process itself can be manipulated. We propose Agentmandering, a framework that reimagines redistricting as a turn‑based negotiation between two agents representing opposing political interests. Drawing inspiration from game‑theoretic ideas, particularly the Choose‑and‑Freeze protocol, our method embeds strategic interaction into the redistricting process via large language model (LLM) agents. Agents alternate between selecting and freezing districts from a small set of candidate maps, gradually partitioning the state through constrained and interpretable choices. Evaluation on post‑2020 U.S. Census data across all states shows that Agentmandering significantly reduces partisan bias and unfairness, while achieving 2 to 3 orders of magnitude lower variance than standard baselines. These results demonstrate both fairness and stability, especially in swing‑state scenarios. Our code is available at https://github.com/Lihaogx/AgentMandering.

PaperID: 814, https://arxiv.org/pdf/2511.03542.pdf GitHub

Authors: Roberta Di Marino, Giovanni Dioguardi, Antonio Romano, Giuseppe Riccio, Mariano Barone, Marco Postiglione, Flora Amato, Vincenzo Moscato

Title: SOLVE-Med: Specialized Orchestration for Leading Vertical Experts across Medical Specialties

Abstract:
Medical question answering systems face deployment challenges including hallucinations, bias, computational demands, privacy concerns, and the need for specialized expertise across diverse domains. Here, we present SOLVE‑Med, a multi‑agent architecture combining domain‑specialized small language models for complex medical queries. The system employs a Router Agent for dynamic specialist selection, ten specialized models (1B parameters each) fine‑tuned on specific medical domains, and an Orchestrator Agent that synthesizes responses. Evaluated on Italian medical forum data across ten specialties, SOLVE‑Med achieves superior performance with ROUGE‑1 of 0.301 and BERTScore F1 of 0.697, outperforming standalone models up to 14B parameters while enabling local deployment. Our code is publicly available on GitHub: https://github.com/PRAISELab‑PicusLab/SOLVE‑Med.

PaperID: 815, https://arxiv.org/pdf/2511.03475.pdf GitHub

Authors: Yinsicheng Jiang, Yeqi Huang, Liang Cheng, Cheng Deng, Xuan Sun, Luo Mai

Title: ContextPilot: Fast Long-Context Inference via Context Reuse

Abstract:
AI applications increasingly depend on long‑context inference, where LLMs consume substantial context to support stronger reasoning. Common examples include retrieval‑augmented generation, agent memory layers, and multi‑agent orchestration. As input contexts get longer, prefill latency becomes the main bottleneck. Yet today's prefill acceleration techniques face a trade‑off: they either preserve reasoning quality but deliver little KV‑cache reuse, or improve reuse at the cost of degraded reasoning quality. We present ContextPilot, a system that accelerates prefill by introducing context reuse as a new mechanism for faster long‑context inference. ContextPilot introduces a context index to identify overlapping context blocks across LLM interactions (e.g., across users and turns). It further proposes context ordering and de‑duplication techniques to maximize KV‑cache reuse. To preserve reasoning quality under reuse, it introduces succinct context annotations that prevent quality degradation. Finally, ContextPilot is built around a modular architecture with a clean interface that integrates with existing inference engines. Extensive evaluation shows that ContextPilot reduces LLM prefill latency by up to 3× compared to state‑of‑the‑art methods while preserving reasoning quality. At longer context lengths, it can even improve reasoning quality. ContextPilot is open‑sourced at: https://github.com/EfficientContext/ContextPilot.

PaperID: 816, https://arxiv.org/pdf/2511.02560.pdf GitHub

Authors: Dan Bohus, Sean Andrist, Ann Paradiso, Nick Saw, Tim Schoonbeek, Maia Stiber

Title: SigmaCollab: An Application-Driven Dataset for Physically Situated Collaboration

Abstract:
We introduce SigmaCollab, a dataset enabling research on physically situated human‑AI collaboration. The dataset consists of a set of 85 sessions in which untrained participants were guided by a mixed‑reality assistive AI agent in performing procedural tasks in the physical world. SigmaCollab includes a set of rich, multimodal data streams, such as the participant and system audio, egocentric camera views from the head‑mounted device, depth maps, head, hand and gaze tracking information, as well as additional annotations performed post‑hoc. While the dataset is relatively small in size (~ 14 hours), its application‑driven and interactive nature brings to the fore novel research challenges for human‑AI collaboration, and provides more realistic testing grounds for various AI models operating in this space. In future work, we plan to use the dataset to construct a set of benchmarks for physically situated collaboration in mixed‑reality task assistive scenarios. SigmaCollab is available at https://github.com/microsoft/SigmaCollab.

PaperID: 817, https://arxiv.org/pdf/2511.02424.pdf GitHub GitHub

Authors: Jae-Woo Choi, Hyungmin Kim, Hyobin Ong, Youngwoo Yoon, Minsu Jang, Dohyung Kim, Jaehong Kim

Title: ReAcTree: Hierarchical LLM Agent Trees with Control Flow for Long-Horizon Task Planning

Abstract:
Recent advancements in large language models (LLMs) have enabled significant progress in decision‑making and task planning for embodied autonomous agents. However, most existing methods struggle with complex, long‑horizon tasks because they rely on a monolithic trajectory that entangles all past decisions and observations to solve the entire task in a single unified process. To address this limitation, we propose ReAcTree, a hierarchical task‑planning method that decomposes a complex goal into manageable subgoals within a dynamically constructed agent tree. Each subgoal is handled by an LLM agent node capable of reasoning, acting, and further expanding the tree, while control flow nodes coordinate the execution strategies of agent nodes. In addition, we integrate two complementary memory systems: each agent node retrieves goal‑specific, subgoal‑level examples from episodic memory and shares environment‑specific observations through working memory. Experiments on the WAH‑NL and ALFRED show ReAcTree consistently outperforms strong task‑planning baselines such as ReAct across diverse LLMs. Notably, on WAH‑NL, ReAcTree achieves a 61% goal success rate with Qwen 2.5 72B, nearly doubling ReAct's 31%. The code is available at https://github.com/Choi‑JaeWoo/ReAcTree.git.

PaperID: 818, https://arxiv.org/pdf/2511.02238.pdf GitHub

Authors: Keyu Zhao, Weiquan Lin, Qirui Zheng, Fengli Xu, Yong Li

Title: Deep Ideation: Designing LLM Agents to Generate Novel Research Ideas on Scientific Concept Network

Abstract:
Novel research ideas play a critical role in advancing scientific inquiries. Recent advancements in Large Language Models (LLMs) have demonstrated their potential to generate novel research ideas by leveraging large‑scale scientific literature. However, previous work in research ideation has primarily relied on simplistic methods, such as keyword co‑occurrence or semantic similarity. These approaches focus on identifying statistical associations in the literature but overlook the complex, contextual relationships between scientific concepts, which are essential to effectively leverage knowledge embedded in human literature. For instance, papers that simultaneously mention "keyword A" and "keyword B" often present research ideas that integrate both concepts. Additionally, some LLM‑driven methods propose and refine research ideas using the model's internal knowledge, but they fail to effectively utilize the scientific concept network, limiting the grounding of ideas in established research. To address these challenges, we propose the Deep Ideation framework to address these challenges, integrating a scientific network that captures keyword co‑occurrence and contextual relationships, enriching LLM‑driven ideation. The framework introduces an explore‑expand‑evolve workflow to iteratively refine research ideas, using an Idea Stack to track progress. A critic engine, trained on real‑world reviewer feedback, guides the process by providing continuous feedback on the novelty and feasibility of ideas. Our experiments show that our approach improves the quality of generated ideas by 10.67% compared to other methods, with ideas surpassing top conference acceptance levels. Human evaluation highlights their practical value in scientific research, and ablation studies confirm the effectiveness of each component in the workflow. Code repo is available at https://github.com/kyZhao‑1/Deep‑Ideation.

PaperID: 819, https://arxiv.org/pdf/2511.02136.pdf GitHub

Authors: Valentin Mohl, Sascha Frey, Reuben Leyland, Kang Li, George Nigmatulin, Mihai Cucuringu, Stefan Zohren, Jakob Foerster, Anisoara Calinescu

Title: JaxMARL-HFT: GPU-Accelerated Large-Scale Multi-Agent Reinforcement Learning for High-Frequency Trading

Abstract:
Agent‑based modelling (ABM) approaches for high‑frequency financial markets are difficult to calibrate and validate, partly due to the large parameter space created by defining fixed agent policies. Multi‑agent reinforcement learning (MARL) enables more realistic agent behaviour and reduces the number of free parameters, but the heavy computational cost has so far limited research efforts. To address this, we introduce JaxMARL‑HFT (JAX‑based Multi‑Agent Reinforcement Learning for High‑Frequency Trading), the first GPU‑accelerated open‑source multi‑agent reinforcement learning environment for high‑frequency trading (HFT) on market‑by‑order (MBO) data. Extending the JaxMARL framework and building on the JAX‑LOB implementation, JaxMARL‑HFT is designed to handle a heterogeneous set of agents, enabling diverse observation/action spaces and reward functions. It is designed flexibly, so it can also be used for single‑agent RL, or extended to act as an ABM with fixed‑policy agents. Leveraging JAX enables up to a 240x reduction in end‑to‑end training time, compared with state‑of‑the‑art reference implementations on the same hardware. This significant speed‑up makes it feasible to exploit the large, granular datasets available in high‑frequency trading, and to perform the extensive hyperparameter sweeps required for robust and efficient MARL research in trading. We demonstrate the use of JaxMARL‑HFT with independent Proximal Policy Optimization (IPPO) for a two‑player environment, with an order execution and a market making agent, using one year of LOB data (400 million orders), and show that these agents learn to outperform standard benchmarks. The code for the JaxMARL‑HFT framework is available on GitHub.

PaperID: 820, https://arxiv.org/pdf/2511.01884.pdf GitHub

Authors: Zijian Zhang, Rong Wang, Shiyang Li, Yuebo Luo, Mingyi Hong, Caiwen Ding

Title: CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

Abstract:
Developing efficient CUDA kernels is increasingly critical for AI applications such as large‑scale LLM training. However, manual kernel design is both costly and time‑consuming, motivating automatic approaches that leverage LLMs for code generation. Existing methods for automatic kernel generation, however, often produce low‑efficiency kernels, incur high computational overhead, and fail to generalize across settings. In this work, we propose CudaForge, a training‑free multi‑agent workflow for CUDA kernel generation and optimization. Our workflow is inspired by the iterative workflow of human experts, which contains steps such as developing initial kernels, testing correctness, analyzing hardware feedback, and iterative improvement. More specifically, CudaForge employs two LLM agents: a Coder and a Judge, that iteratively generate, correct, and optimize CUDA kernels, while integrating hardware feedback such as Nsight Compute (NCU) metrics. In extensive evaluations, we show that CudaForge, by leveraging base models like OpenAI‑o3, achieves 97.6% correctness of generated kernels and an average 1.68× speedup over PyTorch baselines, substantially surpassing state‑of‑the‑art models including OpenAI‑o3 and Kevin on KernelBench.Beyond accuracy and speed, CudaForge demonstrates strong generalization across GPUs (A100, RTX 6000, 4090, 3090) and base models (OpenAI‑o3, GPT‑5, gpt‑oss‑120B, Claude‑Sonnet‑4, QwQ‑32B), while maintaining high efficiency. In particular, generating an optimized kernel takes about 26.5 minutes on one RTX6000 and incurs about \ 0.3 API cost, which is significantly cheaper than existing agentic work that costs 6 H100 hours and \ 5 API cost per kernel. Our results highlight that multi‑agent, training‑free workflows can enable cost‑effective, generalizable, and high‑performance CUDA kernel optimization. Code available at https://github.com/OptimAI‑Lab/CudaForge

PaperID: 821, https://arxiv.org/pdf/2511.01718.pdf GitHub

Authors: Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, Haoang Li

Title: Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process

Abstract:
Vision‑language‑action (VLA) models aim to understand natural language instructions and visual observations and to execute corresponding actions as an embodied agent. Recent work integrates future images into the understanding‑acting loop, yielding unified VLAs that jointly understand, generate, and act ‑‑ reading text and images and producing future images and actions. However, these models either rely on external experts for modality unification or treat image generation and action prediction as separate processes, limiting the benefits of direct synergy between these tasks. Our core philosophy is to optimize generation and action jointly through a synchronous denoising process, where the iterative refinement enables actions to evolve from initialization, under constant and sufficient visual guidance. We ground this philosophy in our proposed Unified Diffusion VLA and Joint Discrete Denoising Diffusion Process (JD3P), which is a joint diffusion process that integrates multiple modalities into a single denoising trajectory to serve as the key mechanism enabling understanding, generation, and acting to be intrinsically synergistic. Our model and theory are built on a unified tokenized space of all modalities and a hybrid attention mechanism. We further propose a two‑stage training pipeline and several inference‑time techniques that optimize performance and efficiency. Our approach achieves state‑of‑the‑art performance on benchmarks such as CALVIN, LIBERO, and SimplerEnv with 4× faster inference than autoregressive methods, and we demonstrate its effectiveness through in‑depth analysis and real‑world evaluations. Our project page is available at https://irpn‑eai.github.io/UD‑VLA.github.io/.

PaperID: 822, https://arxiv.org/pdf/2511.01527.pdf GitHub

Authors: Hanwen Xu, Xuyao Huang, Yuzhe Liu, Kai Yu, Zhijie Deng

Title: TPS-Bench: Evaluating AI Agents' Tool Planning \& Scheduling Abilities in Compounding Tasks

Abstract:
Large language model (LLM) agents have exhibited strong problem‑solving competence across domains like research and coding. Yet, it remains underexplored whether LLM agents can tackle compounding real‑world problems that require a diverse set of tools to complete. Given a broad, heterogeneous tool repository, LLM agents must not only select appropriate tools based on task planning analysis but also strategically schedule the execution order to ensure efficiency. This paper introduces TPS‑Bench to benchmark the ability of LLM agents in solving such problems that demand Tool Planning and Scheduling. TPS‑Bench collects 200 compounding tasks of two difficulty levels, based on a tool repository containing hundreds of model context protocol (MCP) tools. In particular, each task is composed of multiple subtasks, such as web search, map navigation, calendar checking, etc., and each subtask can be completed by a basic tool. Our evaluation emphasizes both task completion rate and efficiency. The empirical studies on popular closed‑source and open‑source LLMs indicate that most models can perform reasonable tool planning, but differ in scheduling. For example, GLM‑4.5 achieves an outperforming task completion rate of 64.72% with extensive sequential tool calls, hence suffering from significantly long execution time. By contrast, GPT‑4o prioritizes parallel tool calls but achieves only a 45.08% completion rate. Considering reinforcement learning (RL) can be a viable way to improve the scheduling efficiency without compromising performance, we perform an initial study on Qwen3‑1.7B and witness a 14% reduction in execution time alongside a 6% gain in task completion rate based on rarely 100 RL training samples. Our code is available https://github.com/hanwenxu1/mcp‑agent.

PaperID: 823, https://arxiv.org/pdf/2511.01448.pdf GitHub

Authors: Zhengjun Huang, Zhoujin Tian, Qintian Guo, Fangyuan Zhang, Yingli Zhou, Di Jiang, Zeying Xie, Xiaofang Zhou

Title: LiCoMemory: Lightweight and Cognitive Agentic Memory for Efficient Long-Term Reasoning

Abstract:
Large Language Model (LLM) agents exhibit remarkable conversational and reasoning capabilities but remain constrained by limited context windows and the lack of persistent memory. Recent efforts address these limitations via external memory architectures, often employing graph‑based representations, yet most adopt flat, entangled structures that intertwine semantics with topology, leading to redundant representations, unstructured retrieval, and degraded efficiency and accuracy. To resolve these issues, we propose LiCoMemory, an end‑to‑end agentic memory framework for real‑time updating and retrieval, which introduces CogniGraph, a lightweight hierarchical graph that utilizes entities and relations as semantic indexing layers, and employs temporal and hierarchy‑aware search with integrated reranking for adaptive and coherent knowledge retrieval. Experiments on long‑term dialogue benchmarks, LoCoMo and LongMemEval, show that LiCoMemory not only outperforms established baselines in temporal reasoning, multi‑session consistency, and retrieval efficiency, but also notably reduces update latency. Our official code and data are available at https://github.com/EverM0re/LiCoMemory.

PaperID: 824, https://arxiv.org/pdf/2511.01188.pdf GitHub

Authors: Lvhua Wu, Xuefeng Jiang, Sheng Sun, Yan Lei, Tian Wen, Yuwei Wang, Min Liu

Title: ZoFia: Zero-Shot Fake News Detection with Entity-Guided Retrieval and Multi-LLM Interaction

Abstract:
The rapid spread of fake news threatens social stability and public trust, highlighting the urgent need for its effective detection. Although large language models (LLMs) show potential in fake news detection, they are limited by knowledge cutoff and easily generate factual hallucinations when handling time‑sensitive news. Furthermore, the thinking of a single LLM easily falls into early stance locking and confirmation bias, making it hard to handle both content reasoning and fact checking simultaneously. To address these challenges, we propose ZoFia, a two‑stage zero‑shot fake news detection framework. In the first retrieval stage, we propose novel Hierarchical Salience and Salience‑Calibrated Minimum Marginal Relevance (SC‑MMR) algorithm to extract core entities accurately, which drive dual‑source retrieval to overcome knowledge and evidence gaps. In the subsequent stage, a multi‑agent system conducts multi‑perspective reasoning and verification in parallel and achieves an explainable and robust result via adversarial debate. Comprehensive experiments on two public datasets show that ZoFia outperforms existing zero‑shot baselines and even most few‑shot methods. Our code has been open‑sourced to facilitate the research community at https://github.com/SakiRinn/ZoFia.

PaperID: 825, https://arxiv.org/pdf/2511.01166.pdf GitHub

Authors: Lingzhe Zhang, Yunpeng Zhai, Tong Jia, Chiming Duan, Minghua He, Leyi Pan, Zhaoyang Liu, Bolin Ding, Ying Li

Title: MicroRemed: Benchmarking LLMs in Microservices Remediation

Abstract:
Large Language Models (LLMs) integrated with agent‑based reasoning frameworks have recently shown strong potential for autonomous decision‑making and system‑level operations. One promising yet underexplored direction is microservice remediation, where the goal is to automatically recover faulty microservice systems. Existing approaches, however, still rely on human‑crafted prompts from Site Reliability Engineers (SREs), with LLMs merely converting textual instructions into executable code. To advance research in this area, we introduce MicroRemed, the first benchmark for evaluating LLMs in end‑to‑end microservice remediation, where models must directly generate executable Ansible playbooks from diagnosis reports to restore system functionality. We further propose ThinkRemed, a multi‑agent framework that emulates the reflective and perceptive reasoning of SREs. Experimental results show that MicroRemed presents substantial challenges to current LLMs, while ThinkRemed improves end‑to‑end remediation performance through iterative reasoning and system reflection. The benchmark is available at https://github.com/LLM4AIOps/MicroRemed.

PaperID: 826, https://arxiv.org/pdf/2511.01008.pdf GitHub

Authors: Haolin Yang, Jipeng Zhang, Zhitao He, Alexander Zhou, Yi R. Fung

Title: MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL

Abstract:
Large Language Models (LLMs) often struggle with the precise logic and schema alignment required for complex Text‑to‑SQL tasks. While current methods rely heavily on static prompting, they lack the ability to dynamically adapt and self‑correct through environmental interaction. To bridge this gap, we propose MARS‑SQL, a trainable multi‑agent framework for Text‑to‑SQL. Rather than introducing a new standalone SQL primitive, MARS‑SQL makes an agentic workflow trainable by decomposing the problem into three specialized roles: schema grounding, query generation, and solution validation. Central to our approach is a generation agent trained via a multi‑turn RL policy within a ReAct‑style loop. The agent learns to iteratively reason, execute intermediate SQL actions on a live database, and refine its strategy based on execution feedback. To improve robustness, we further introduce a validation mechanism that treats solution selection as a generative modeling task, identifying the optimal interaction trajectory through next‑token prediction probabilities. Empirical evaluations demonstrate the effectiveness of coupling interactive learning with trajectory ranking. MARS‑SQL achieves state‑of‑the‑art performance, recording an execution accuracy of 77.84% on the BIRD development dataset and 89.75% on the Spider test dataset, while also transferring strongly to out‑of‑domain benchmarks. Code is available at https://github.com/YangHaolin0526/MARS‑SQL.

PaperID: 827, https://arxiv.org/pdf/2511.00998.pdf GitHub

Authors: Ziye Wang, Li Kang, Yiran Qin, Jiahua Ma, Zhanglin Peng, Lei Bai, Ruimao Zhang

Title: GauDP: Reinventing Multi-Agent Collaboration through Gaussian-Image Synergy in Diffusion Policies

Abstract:
Recently, effective coordination in embodied multi‑agent systems has remained a fundamental challenge, particularly in scenarios where agents must balance individual perspectives with global environmental awareness. Existing approaches often struggle to balance fine‑grained local control with comprehensive scene understanding, resulting in limited scalability and compromised collaboration quality. In this paper, we present GauDP, a novel Gaussian‑image synergistic representation that facilitates scalable, perception‑aware imitation learning in multi‑agent collaborative systems. Specifically, GauDP constructs a globally consistent 3D Gaussian field from decentralized RGB observations, then dynamically redistributes 3D Gaussian attributes to each agent's local perspective. This enables all agents to adaptively query task‑critical features from the shared scene representation while maintaining their individual viewpoints. This design facilitates both fine‑grained control and globally coherent behavior without requiring additional sensing modalities (e.g., 3D point cloud). We evaluate GauDP on the RoboFactory benchmark, which includes diverse multi‑arm manipulation tasks. Our method achieves superior performance over existing image‑based methods and approaches the effectiveness of point‑cloud‑driven methods, while maintaining strong scalability as the number of agents increases.

PaperID: 828, https://arxiv.org/pdf/2511.00136.pdf GitHub

Authors: Qing Guo, Xinhang Li, Junyu Chen, Zheng Guo, Xiaocong Li, Lin Zhang, Lei Li

Title: A Dual Large Language Models Architecture with Herald Guided Prompts for Parallel Fine Grained Traffic Signal Control

Abstract:
Leveraging large language models (LLMs) in traffic signal control (TSC) improves optimization efficiency and interpretability compared to traditional reinforcement learning (RL) methods. However, existing LLM‑based approaches are limited by fixed time signal durations and are prone to hallucination errors, while RL methods lack robustness in signal timing decisions and suffer from poor generalization. To address these challenges, this paper proposes HeraldLight, a dual LLMs architecture enhanced by Herald guided prompts. The Herald Module extracts contextual information and forecasts queue lengths for each traffic phase based on real‑time conditions. The first LLM, LLM‑Agent, uses these forecasts to make fine grained traffic signal control, while the second LLM, LLM‑Critic, refines LLM‑Agent's outputs, correcting errors and hallucinations. These refined outputs are used for score‑based fine‑tuning to improve accuracy and robustness. Simulation experiments using CityFlow on real world datasets covering 224 intersections in Jinan (12), Hangzhou (16), and New York (196) demonstrate that HeraldLight outperforms state of the art baselines, achieving a 20.03% reduction in average travel time across all scenarios and a 10.74% reduction in average queue length on the Jinan and Hangzhou scenarios. The source code is available on GitHub: https://github.com/BUPT‑ANTlab/HeraldLight.

PaperID: 829, https://arxiv.org/pdf/2511.00096.pdf GitHub

Authors: Shangyu Lou

Title: Urban-MAS: Human-Centered Urban Prediction with LLM-Based Multi-Agent System

Abstract:
Urban Artificial Intelligence (Urban AI) has advanced human‑centered urban tasks such as perception prediction and human dynamics. Large Language Models (LLMs) can integrate multimodal inputs to address heterogeneous data in complex urban systems but often underperform on domain‑specific tasks. Urban‑MAS, an LLM‑based Multi‑Agent System (MAS) framework, is introduced for human‑centered urban prediction under zero‑shot settings. It includes three agent types: Predictive Factor Guidance Agents, which prioritize key predictive factors to guide knowledge extraction and enhance the effectiveness of compressed urban knowledge in LLMs; Reliable UrbanInfo Extraction Agents, which improve robustness by comparing multiple outputs, validating consistency, and re‑extracting when conflicts occur; and Multi‑UrbanInfo Inference Agents, which integrate extracted multi‑source information across dimensions for prediction. Experiments on running‑amount prediction and urban perception across Tokyo, Milan, and Seattle demonstrate that Urban‑MAS substantially reduces errors compared to single‑LLM baselines. Ablation studies indicate that Predictive Factor Guidance Agents are most critical for enhancing predictive performance, positioning Urban‑MAS as a scalable paradigm for human‑centered urban AI prediction. Code is available on the project website:https://github.com/THETUREHOOHA/UrbanMAS

PaperID: 830, https://arxiv.org/pdf/2510.27623.pdf GitHub

Authors: Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen, Liang-Yan Gui, Yu-Xiong Wang, Huan Zhang, Heng Ji, Daniel Kang

Title: BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning

Abstract:
Recent advances in Vision‑Language Models (VLMs) have propelled embodied agents by enabling direct perception, reasoning, and planning task‑oriented actions from visual inputs. However, such vision‑driven embodied agents open a new attack surface: visual backdoor attacks, where the agent behaves normally until a visual trigger appears in the scene, then persistently executes an attacker‑specified multi‑step policy. We introduce BEAT, the first framework to inject such visual backdoors into VLM‑based embodied agents using objects in the environments as triggers. Unlike textual triggers, object triggers exhibit wide variation across viewpoints and lighting, making them difficult to implant reliably. BEAT addresses this challenge by (1) constructing a training set that spans diverse scenes, tasks, and trigger placements to expose agents to trigger variability, and (2) introducing a two‑stage training scheme that first applies supervised fine‑tuning (SFT) and then our novel Contrastive Trigger Learning (CTL). CTL formulates trigger discrimination as preference learning between trigger‑present and trigger‑free inputs, explicitly sharpening the decision boundaries to ensure precise backdoor activation. Across various embodied agent benchmarks and VLMs, BEAT achieves attack success rates up to 80%, while maintaining strong benign task performance, and generalizes reliably to out‑of‑distribution trigger placements. Notably, compared to naive SFT, CTL boosts backdoor activation accuracy up to 39% under limited backdoor data. These findings expose a critical yet unexplored security risk in VLM‑based embodied agents, underscoring the need for robust defenses before real‑world deployment.

PaperID: 831, https://arxiv.org/pdf/2510.27544.pdf GitHub GitHub

Authors: Nikolaus Holzer, William Fishell, Baishakhi Ray, Mark Santolucito

Title: Mechanics of Learned Reasoning 1: TempoBench, A Benchmark for Interpretable Deconstruction of Reasoning System Performance

Abstract:
Large Language Models (LLMs) are increasingly excelling and outpacing human performance on many tasks. However, to improve LLM reasoning, researchers either rely on ad‑hoc generated datasets or formal mathematical proof systems such as the Lean proof assistant. Whilst ad‑hoc generated methods can capture the decision chains of real‑world reasoning processes, they may encode some inadvertent bias in the space of reasoning they cover; they also cannot be formally verified. On the other hand, systems like Lean can guarantee verifiability, but are not well‑suited to capture the nature of agentic decision chain‑based tasks. This creates a gap both in performance for functions such as business agents or code assistants, and in the usefulness of LLM reasoning benchmarks, whereby these fall short in reasoning structure or real‑world alignment. We introduce TempoBench, the first formally grounded and verifiable diagnostic benchmark that parametrizes difficulty to systematically analyze how LLMs perform reasoning. TempoBench uses two evaluation benchmarks to break down reasoning ability. First, temporal trace evaluation (TTE) tests the ability of an LLM to understand and simulate the execution of a given multi‑step reasoning system. Subsequently, temporal causal evaluation (TCE) tests an LLM's ability to perform multi‑step causal reasoning and to distill cause‑and‑effect relations from complex systems. We find that models score 65.6% on TCE‑normal, and 7.5% on TCE‑hard. This shows that state‑of‑the‑art LLMs clearly understand the TCE task but perform poorly as system complexity increases. Our code is available at our \hrefhttps://github.com/nik‑hz/tempobenchGitHub repository.

PaperID: 832, https://arxiv.org/pdf/2510.27238.pdf GitHub

Authors: Chuxuan Hu, Maxwell Yang, James Weiland, Yeji Lim, Suhas Palawala, Daniel Kang

Title: DRAMA: Unifying Data Retrieval and Analysis for Open-Domain Analytic Queries

Abstract:
Manually conducting real‑world data analyses is labor‑intensive and inefficient. Despite numerous attempts to automate data science workflows, none of the existing paradigms or systems fully demonstrate all three key capabilities required to support them effectively: (1) open‑domain data collection, (2) structured data transformation, and (3) analytic reasoning. To overcome these limitations, we propose DRAMA, an end‑to‑end paradigm that answers users' analytic queries in natural language on large‑scale open‑domain data. DRAMA unifies data collection, transformation, and analysis as a single pipeline. To quantitatively evaluate system performance on tasks representative of DRAMA, we construct a benchmark, DRAMA‑Bench, consisting of two categories of tasks: claim verification and question answering, each comprising 100 instances. These tasks are derived from real‑world applications that have gained significant public attention and require the retrieval and analysis of open‑domain data. We develop DRAMA‑Bot, a multi‑agent system designed following DRAMA. It comprises a data retriever that collects and transforms data by coordinating the execution of sub‑agents, and a data analyzer that performs structured reasoning over the retrieved data. We evaluate DRAMA‑Bot on DRAMA‑Bench together with five state‑of‑the‑art baseline agents. DRAMA‑Bot achieves 86.5% task accuracy at a cost of 0.05, outperforming all baselines with up to 6.9 times the accuracy and less than 1/6 of the cost. DRAMA is publicly available at https://github.com/uiuc‑kang‑lab/drama.

PaperID: 833, https://arxiv.org/pdf/2510.26887.pdf GitHub GitHub

Authors: Francisco Villaescusa-Navarro, Boris Bolliet, Pablo Villanueva-Domingo, Adrian E. Bayer, Aidan Acquah, Chetana Amancharla, Almog Barzilay-Siegal, Pablo Bermejo, Camille Bilodeau, Pablo Cárdenas Ramírez, Miles Cranmer, Urbano L. França, ChangHoon Hahn, Yan-Fei Jiang, Raul Jimenez, Jun-Young Lee, Antonio Lerario, Osman Mamun, Thomas Meier, Anupam A. Ojha, Pavlos Protopapas, Shimanto Roy, David N. Spergel, Pedro Tarancón-Álvarez, Ujjwal Tiwari, Matteo Viel, Digvijay Wadekar, Chi Wang, Bonny Y. Wang, Licong Xu, Yossi Yovel, Shuwen Yue, Wen-Han Zhou, Qiyao Zhu, Jiajun Zou, Íñigo Zubeldia

Title: The Denario project: Deep knowledge AI agents for scientific discovery

Abstract:
We present Denario, an AI multi‑agent system designed to serve as a scientific research assistant. Denario can perform many different tasks, such as generating ideas, checking the literature, developing research plans, writing and executing code, making plots, and drafting and reviewing a scientific paper. The system has a modular architecture, allowing it to handle specific tasks, such as generating an idea, or carrying out end‑to‑end scientific analysis using Cmbagent as a deep‑research backend. In this work, we describe in detail Denario and its modules, and illustrate its capabilities by presenting multiple AI‑generated papers generated by it in many different scientific disciplines such as astrophysics, biology, biophysics, biomedical informatics, chemistry, material science, mathematical physics, medicine, neuroscience and planetary science. Denario also excels at combining ideas from different disciplines, and we illustrate this by showing a paper that applies methods from quantum physics and machine learning to astrophysical data. We report the evaluations performed on these papers by domain experts, who provided both numerical scores and review‑like feedback. We then highlight the strengths, weaknesses, and limitations of the current system. Finally, we discuss the ethical implications of AI‑driven research and reflect on how such technology relates to the philosophy of science. We publicly release the code at https://github.com/AstroPilot‑AI/Denario. A Denario demo can also be run directly on the web at https://huggingface.co/spaces/astropilot‑ai/Denario, and the full app will be deployed on the cloud.

PaperID: 834, https://arxiv.org/pdf/2510.26702.pdf GitHub

Authors: Majed El Helou, Chiara Troiani, Benjamin Ryder, Jean Diaconu, Hervé Muyal, Marcelo Yannuzzi

Title: Delegated Authorization for Agents Constrained to Semantic Task-to-Scope Matching

Abstract:
Authorizing Large Language Model driven agents to dynamically invoke tools and access protected resources introduces significant risks, since current methods for delegating authorization grant overly broad permissions and give access to tools allowing agents to operate beyond the intended task scope. We introduce and assess a delegated authorization model enabling authorization servers to semantically inspect access requests to protected resources, and issue access tokens constrained to the minimal set of scopes necessary for the agents' assigned tasks. Given the unavailability of datasets centered on delegated authorization flows, particularly including both semantically appropriate and inappropriate scope requests for a given task, we introduce ASTRA, a dataset and data generation pipeline for benchmarking semantic matching between tasks and scopes. Our experiments show both the potential and current limitations of model‑based matching, particularly as the number of scopes needed for task completion increases. Our results highlight the need for further research into semantic matching techniques enabling intent‑aware authorization for multi‑agent and tool‑augmented applications, including fine‑grained control, such as Task‑Based Access Control (TBAC).

PaperID: 835, https://arxiv.org/pdf/2510.26585.pdf GitHub

Authors: Fulin Lin, Shaowen Chen, Ruishan Fang, Hongwei Wang, Tao Lin

Title: Stop Wasting Your Tokens: Towards Efficient Runtime Multi-Agent Systems

Abstract:
While Multi‑Agent Systems (MAS) excel at complex tasks, their growing autonomy with operational complexity often leads to critical inefficiencies, such as excessive token consumption and failures arising from misinformation. Existing methods primarily focus on post‑hoc failure attribution, lacking proactive, real‑time interventions to enhance robustness and efficiency. To this end, we introduce SupervisorAgent, a lightweight and modular framework for runtime, adaptive supervision that operates without altering the base agent's architecture. Triggered by an LLM‑free adaptive filter, SupervisorAgent intervenes at critical junctures to proactively correct errors, guide inefficient behaviors, and purify observations. On the challenging GAIA benchmark, SupervisorAgent reduces the token consumption of the Smolagent framework by an average of 29.68% without compromising its success rate. Extensive experiments across five additional benchmarks (math reasoning, code generation, and question answering) and various SoTA foundation models validate the broad applicability and robustness of our approach.

PaperID: 836, https://arxiv.org/pdf/2510.26536.pdf GitHub

Authors: Huajie Tan, Cheng Chi, Xiansheng Chen, Yuheng Ji, Zhongxia Zhao, Xiaoshuai Hao, Yaoxu Lyu, Mingyu Cao, Junkai Zhao, Huaihai Lyu, Enshen Zhou, Ning Chen, Yankai Fu, Cheng Peng, Wei Guo, Dong Liang, Zhuo Chen, Mengsi Lyu, Chenrui He, Yulong Ao, Yonghua Lin, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang

Title: RoboOS-NeXT: A Unified Memory-based Framework for Lifelong, Scalable, and Robust Multi-Robot Collaboration

Abstract:
The proliferation of collaborative robots across diverse tasks and embodiments presents a central challenge: achieving lifelong adaptability, scalable coordination, and robust scheduling in multi‑agent systems. Existing approaches, from vision‑language‑action (VLA) models to hierarchical frameworks, fall short due to their reliance on limited or dividual‑agent memory. This fundamentally constrains their ability to learn over long horizons, scale to heterogeneous teams, or recover from failures, highlighting the need for a unified memory representation. To address these limitations, we introduce RoboOS‑NeXT, a unified memory‑based framework for lifelong, scalable, and robust multi‑robot collaboration. At the core of RoboOS‑NeXT is the novel Spatio‑Temporal‑Embodiment Memory (STEM), which integrates spatial scene geometry, temporal event history, and embodiment profiles into a shared representation. This memory‑centric design is integrated into a brain‑cerebellum framework, where a high‑level brain model performs global planning by retrieving and updating STEM, while low‑level controllers execute actions locally. This closed loop between cognition, memory, and execution enables dynamic task allocation, fault‑tolerant collaboration, and consistent state synchronization. We conduct extensive experiments spanning complex coordination tasks in restaurants, supermarkets, and households. Our results demonstrate that RoboOS‑NeXT achieves superior performance across heterogeneous embodiments, validating its effectiveness in enabling lifelong, scalable, and robust multi‑robot collaboration. Project website: https://flagopen.github.io/RoboOS/

PaperID: 837, https://arxiv.org/pdf/2510.26494.pdf GitHub

Authors: Sadegh Shirani, Mohsen Bayati

Title: Simulating and Experimenting with Social Media Mobilization Using LLM Agents

Abstract:
Online social networks have transformed the ways in which political mobilization messages are disseminated, raising new questions about how peer influence operates at scale. Building on the landmark 61‑million‑person Facebook experiment \citepbond201261, we develop an agent‑based simulation framework that integrates real U.S. Census demographic distributions, authentic Twitter network topology, and heterogeneous large language model (LLM) agents to examine the effect of mobilization messages on voter turnout. Each simulated agent is assigned demographic attributes, a personal political stance, and an LLM variant (\textttGPT‑4.1, \textttGPT‑4.1‑Mini, or \textttGPT‑4.1‑Nano) reflecting its political sophistication. Agents interact over realistic social network structures, receiving personalized feeds and dynamically updating their engagement behaviors and voting intentions. Experimental conditions replicate the informational and social mobilization treatments of the original Facebook study. Across scenarios, the simulator reproduces qualitative patterns observed in field experiments, including stronger mobilization effects under social message treatments and measurable peer spillovers. Our framework provides a controlled, reproducible environment for testing counterfactual designs and sensitivity analyses in political mobilization research, offering a bridge between high‑validity field experiments and flexible computational modeling.\footnoteCode and data available at https://github.com/CausalMP/LLM‑SocioPol

PaperID: 838, https://arxiv.org/pdf/2510.26089.pdf GitHub

Authors: Fazel Arasteh, Arian Haghparast, Manos Papagelis

Title: Network-Constrained Policy Optimization for Adaptive Multi-agent Vehicle Routing

Abstract:
Traffic congestion in urban road networks leads to longer trip times and higher emissions, especially during peak periods. While the Shortest Path First (SPF) algorithm is optimal for a single vehicle in a static network, it performs poorly in dynamic, multi‑vehicle settings, often worsening congestion by routing all vehicles along identical paths. We address dynamic vehicle routing through a multi‑agent reinforcement learning (MARL) framework for coordinated, network‑aware fleet navigation. We first propose Adaptive Navigation (AN), a decentralized MARL model where each intersection agent provides routing guidance based on (i) local traffic and (ii) neighborhood state modeled using Graph Attention Networks (GAT). To improve scalability in large networks, we further propose Hierarchical Hub‑based Adaptive Navigation (HHAN), an extension of AN that assigns agents only to key intersections (hubs). Vehicles are routed hub‑to‑hub under agent control, while SPF handles micro‑routing within each hub region. For hub coordination, HHAN adopts centralized training with decentralized execution (CTDE) under the Attentive Q‑Mixing (A‑QMIX) framework, which aggregates asynchronous vehicle decisions via attention. Hub agents use flow‑aware state features that combine local congestion and predictive dynamics for proactive routing. Experiments on synthetic grids and real urban maps (Toronto, Manhattan) show that AN reduces average travel time versus SPF and learning baselines, maintaining 100% routing success. HHAN scales to networks with hundreds of intersections, achieving up to 15.9% improvement under heavy traffic. These findings highlight the potential of network‑constrained MARL for scalable, coordinated, and congestion‑aware routing in intelligent transportation systems.

PaperID: 839, https://arxiv.org/pdf/2510.25595.pdf GitHub

Authors: Run Peng, Ziqiao Ma, Amy Pang, Sikai Li, Zhang Xi-Jia, Yingzhuo Yu, Cristian-Paul Bara, Joyce Chai

Title: Communication and Verification in LLM Agents towards Collaboration under Information Asymmetry

Abstract:
While Large Language Model (LLM) agents are often approached from the angle of action planning/generation to accomplish a goal (e.g., given by language descriptions), their abilities to collaborate with each other to achieve a joint goal are not well explored. To address this limitation, this paper studies LLM agents in task collaboration, particularly under the condition of information asymmetry, where agents have disparities in their knowledge and skills and need to work together to complete a shared task. We extend Einstein Puzzles, a classical symbolic puzzle, to a table‑top game. In this game, two LLM agents must reason, communicate, and act to satisfy spatial and relational constraints required to solve the puzzle. We apply a fine‑tuning‑plus‑verifier framework in which LLM agents are equipped with various communication strategies and verification signals from the environment. Empirical results highlight the critical importance of aligned communication, especially when agents possess both information‑seeking and providing capabilities. Interestingly, agents without communication can still achieve high task performance; however, further analysis reveals a lack of true rule understanding and lower trust from human evaluators. Instead, by integrating an environment‑based verifier, we enhance agents' ability to comprehend task rules and complete tasks, promoting both safer and more interpretable collaboration in AI systems. https://github.com/Roihn/EinsteinPuzzles

PaperID: 840, https://arxiv.org/pdf/2510.25320.pdf GitHub

Authors: Jiaqi Wu, Qinlao Zhao, Zefeng Chen, Kai Qin, Yifei Zhao, Xueqian Wang, Yuhang Yao

Title: GAP: Graph-Based Agent Planning with Parallel Tool Use and Reinforcement Learning

Abstract:
Autonomous agents powered by large language models (LLMs) have shown impressive capabilities in tool manipulation for complex task‑solving. However, existing paradigms such as ReAct rely on sequential reasoning and execution, failing to exploit the inherent parallelism among independent sub‑tasks. This sequential bottleneck leads to inefficient tool utilization and suboptimal performance in multi‑step reasoning scenarios. We introduce Graph‑based Agent Planning (GAP), a novel framework that explicitly models inter‑task dependencies through graph‑based planning to enable adaptive parallel and serial tool execution. Our approach trains agent foundation models to decompose complex tasks into dependency‑aware sub‑task graphs, autonomously determining which tools can be executed in parallel and which must follow sequential dependencies. This dependency‑aware orchestration achieves substantial improvements in both execution efficiency and task accuracy. To train GAP, we construct a high‑quality dataset of graph‑based planning traces derived from the Multi‑Hop Question Answering (MHQA) benchmark. We employ a two‑stage training strategy: supervised fine‑tuning (SFT) on the curated dataset, followed by reinforcement learning (RL) with a correctness‑based reward function on strategically sampled queries where tool‑based reasoning provides maximum value. Experimental results on MHQA datasets demonstrate that GAP significantly outperforms traditional ReAct baselines, particularly on multi‑step retrieval tasks, while achieving dramatic improvements in tool invocation efficiency through intelligent parallelization. The project page is available at: https://github.com/WJQ7777/Graph‑Agent‑Planning.

PaperID: 841, https://arxiv.org/pdf/2510.24909.pdf GitHub

Authors: Vik Pant, Eric Yu

Title: Computational Foundations for Strategic Coopetition: Formalizing Trust and Reputation Dynamics

Abstract:
Modern socio‑technical systems increasingly involve multi‑stakeholder environments where actors simultaneously cooperate and compete. These coopetitive relationships exhibit dynamic trust evolution based on observed behavior over repeated interactions. While conceptual modeling languages like i represent trust relationships qualitatively, they lack computational mechanisms for analyzing how trust changes with behavioral evidence. Conversely, computational trust models from multi‑agent systems provide algorithmic updating but lack grounding in conceptual models that capture strategic dependencies covering mixed motives of actors. This technical report bridges this gap by developing a computational trust model that extends game‑theoretic foundations for strategic coopetition with dynamic trust evolution. Building on companion work that achieved 58/60 validation (96.7%) for logarithmic specifications, we introduce trust as a two‑layer system with immediate trust responding to current behavior and reputation tracking violation history. Trust evolves through asymmetric updating where cooperation builds trust gradually while violations erode it sharply, creating hysteresis effects and trust ceilings that constrain relationship recovery. We develop a structured translation framework enabling practitioners to instantiate computational trust models from i dependency networks encompassing mixed motives of actors. Comprehensive experimental validation across 78,125 parameter configurations establishes robust emergence of negativity bias, hysteresis effects, and cumulative damage amplification. Empirical validation using the Renault‑Nissan Alliance case study (1999‑2025) achieves 49/60 validation points (81.7%), successfully reproducing documented trust evolution across five distinct relationship phases including crisis and recovery periods.

PaperID: 842, https://arxiv.org/pdf/2510.24803.pdf GitHub

Authors: Milad Yazdani, Mahdi Mostajabdaveh, Zirui Zhou, Ying Xiong

Title: MASPRM: Multi-Agent System Process Reward Model

Abstract:
Practical deployment of multi‑agent systems (MAS) demands strong performance at test time, motivating methods that guide search during inference and selectively spend compute to improve quality. We present the Multi‑Agent System Process Reward Model (MASPRM). It assigns values to partial inter‑agent transcripts for each action and each agent, and acts as a controller during inference. MASPRM is trained from multi‑agent Monte Carlo Tree Search (MCTS) rollouts labeled only with terminal outcome rewards, without requiring human step‑level annotations, by propagating returns to local targets. During inference, MASPRM guides step‑level beam search (SBS) and MCTS, focusing computation on promising branches and pruning unpromising ones. We train and test MASPRM across different tasks and domains, using GSM8K, MATH, MMLU, and LogiQA as benchmarks. Averaged across these benchmarks, MASPRM improves Hit@1 over policy likelihood by up to +13.4 points and improves ranking quality, reducing Hit@1‑>Hit@5 gaps by up to 10.3 points. MASPRM complements inference‑time search by scoring intermediate routed transcripts to guide rollouts in MAS with fixed schedules. Code: https://github.com/milad1378yz/MASPRM

PaperID: 843, https://arxiv.org/pdf/2510.24695.pdf GitHub

Authors: Xuanzhong Chen, Zile Qiao, Guoxin Chen, Liangcai Su, Zhen Zhang, Xinyu Wang, Pengjun Xie, Fei Huang, Jingren Zhou, Yong Jiang

Title: AgentFrontier: Expanding the Capability Frontier of LLM Agents with ZPD-Guided Data Synthesis

Abstract:
Training large language model agents on tasks at the frontier of their capabilities is key to unlocking advanced reasoning. We introduce a data synthesis approach inspired by the educational theory of the Zone of Proximal Development (ZPD), which defines this frontier as tasks an LLM cannot solve alone but can master with guidance. To operationalize this, we present the AgentFrontier Engine, an automated pipeline that synthesizes high‑quality, multidisciplinary data situated precisely within the LLM's ZPD. This engine supports both continued pre‑training with knowledge‑intensive data and targeted post‑training on complex reasoning tasks. From the same framework, we derive the ZPD Exam, a dynamic and automated benchmark designed to evaluate agent capabilities on these frontier tasks. We train AgentFrontier‑30B‑A3B model on our synthesized data, which achieves state‑of‑the‑art results on demanding benchmarks like Humanity's Last Exam, even surpassing some leading proprietary agents. Our work demonstrates that a ZPD‑guided approach to data synthesis offers a scalable and effective path toward building more capable LLM agents.

PaperID: 844, https://arxiv.org/pdf/2510.24339.pdf GitHub

Authors: Yunxuan Jiang, Silan Hu, Xiaoning Wang, Yuanyuan Zhang, Xiangyu Chang

Title: VDSAgents: A PCS-Guided Multi-Agent System for Veridical Data Science Automation

Abstract:
Large language models (LLMs) become increasingly integrated into data science workflows for automated system design. However, these LLM‑driven data science systems rely solely on the internal reasoning of LLMs, lacking guidance from scientific and theoretical principles. This limits their trustworthiness and robustness, especially when dealing with noisy and complex real‑world datasets. This paper provides VDSAgents, a multi‑agent system grounded in the Predictability‑Computability‑Stability (PCS) principles proposed in the Veridical Data Science (VDS) framework. Guided by PCS principles, the system implements a modular workflow for data cleaning, feature engineering, modeling, and evaluation. Each phase is handled by an elegant agent, incorporating perturbation analysis, unit testing, and model validation to ensure both functionality and scientific auditability. We evaluate VDSAgents on nine datasets with diverse characteristics, comparing it with state‑of‑the‑art end‑to‑end data science systems, such as AutoKaggle and DataInterpreter, using DeepSeek‑V3 and GPT‑4o as backends. VDSAgents consistently outperforms the results of AutoKaggle and DataInterpreter, which validates the feasibility of embedding PCS principles into LLM‑driven data science automation.

PaperID: 845, https://arxiv.org/pdf/2510.24284.pdf GitHub

Authors: Wenhao Wang, Peizhi Niu, Zhao Xu, Zhaoyu Chen, Jian Du, Yaxin Du, Xianghe Pang, Keduan Huang, Yanfeng Wang, Qiang Yan, Siheng Chen

Title: MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools

Abstract:
Large Language Models (LLMs) increasingly rely on external tools to perform complex, realistic tasks, yet their ability to utilize the rapidly expanding Model Contextual Protocol (MCP) ecosystem remains limited. Existing MCP research covers few servers, depends on costly manual curation, and lacks training support, hindering progress toward real‑world deployment. To overcome these limitations, we introduce MCP‑Flow, an automated web‑agent‑driven pipeline for large‑scale server discovery, data synthesis, and model training. MCP‑Flow collects and filters data from 1166 servers and 11536 tools, producing 68733 high‑quality instruction‑function call pairs and 6439 trajectories, far exceeding prior work in scale and diversity. Extensive experiments demonstrate MCP‑Flow's effectiveness in driving superior MCP tool selection, function‑call generation, and enhanced agentic task performance. MCP‑Flow thus provides a scalable foundation for advancing LLM agents' proficiency in real‑world MCP environments. MCP‑Flow is publicly available at https://github.com/wwh0411/MCP‑Flow.

PaperID: 846, https://arxiv.org/pdf/2510.24168.pdf GitHub

Authors: Weihua Cheng, Junming Liu, Yifei Sun, Botian Shi, Yirong Chen, Ding Wang

Title: MGA: Memory-Driven GUI Agent for Observation-Centric Interaction

Abstract:
Multimodal Large Language Models (MLLMs) have significantly advanced GUI agents, yet long‑horizon automation remains constrained by two critical bottlenecks: context overload from raw sequential trajectory dependence and architectural redundancy from over‑engineered expert modules. Prevailing End‑to‑End and Multi‑Agent paradigms struggle with error cascades caused by concatenated visual‑textual histories and incur high inference latency due to redundant expert components, limiting their practical deployment. To address these issues, we propose the Memory‑Driven GUI Agent (MGA), a minimalist framework that decouples long‑horizon trajectories into independent decision steps linked by a structured state memory. MGA operates on an ``Observe First and Memory Enhancement`` principle, powered by two tightly coupled core mechanisms: (1) an Observer module that acts as a task‑agnostic, intent‑free screen state reader to eliminate confirmation bias, visual hallucinations, and perception bias at the root; and (2) a Structured Memory mechanism that distills, validates, and compresses each interaction step into verified state deltas, constructing a lightweight state transition chain to avoid irrelevant historical interference and system redundancy. By replacing raw historical aggregation with compact, fact‑based memory transitions, MGA drastically reduces cognitive overhead and system complexity. Extensive experiments on OSWorld and real‑world applications demonstrate that MGA achieves highly competitive performance in open‑ended GUI tasks while maintaining architectural simplicity, offering a scalable and efficient blueprint for next‑generation GUI automation https://github.com/MintyCo0kie/MGA4OSWorld.

PaperID: 847, https://arxiv.org/pdf/2510.24014.pdf GitHub

Authors: Yizhu Jiao, Sha Li, Sizhe Zhou, Heng Ji, Jiawei Han

Title: TEXT2DB: Integration-Aware Information Extraction with Large Language Model Agents

Abstract:
The task of information extraction (IE) is to extract structured knowledge from text. However, it is often not straightforward to utilize IE output due to the mismatch between the IE ontology and the downstream application needs. We propose a new formulation of IE TEXT2DB that emphasizes the integration of IE output and the target database (or knowledge base). Given a user instruction, a document set, and a database, our task requires the model to update the database with values from the document set to satisfy the user instruction. This task requires understanding user instructions for what to extract and adapting to the given DB/KB schema for how to extract on the fly. To evaluate this new task, we introduce a new benchmark featuring common demands such as data infilling, row population, and column addition. In addition, we propose an LLM agent framework OPAL (Observe‑PlanAnalyze LLM) which includes an Observer component that interacts with the database, the Planner component that generates a code‑based plan with calls to IE models, and the Analyzer component that provides feedback regarding code quality before execution. Experiments show that OPAL can successfully adapt to diverse database schemas by generating different code plans and calling the required IE models. We also highlight difficult cases such as dealing with large databases with complex dependencies and extraction hallucination, which we believe deserve further investigation. Source code: https://github.com/yzjiao/Text2DB

PaperID: 848, https://arxiv.org/pdf/2510.23535.pdf GitHub

Authors: Chen Lu, Ke Xue, Lei Yuan, Yao Wang, Yaoyuan Wang, Sheng Fu, Chao Qian

Title: Sequential Multi-Agent Dynamic Algorithm Configuration

Abstract:
Dynamic algorithm configuration (DAC) is a recent trend in automated machine learning, which can dynamically adjust the algorithm's configuration during the execution process and relieve users from tedious trial‑and‑error tuning tasks. Recently, multi‑agent reinforcement learning (MARL) approaches have improved the configuration of multiple heterogeneous hyperparameters, making various parameter configurations for complex algorithms possible. However, many complex algorithms have inherent inter‑dependencies among multiple parameters (e.g., determining the operator type first and then the operator's parameter), which are, however, not considered in previous approaches, thus leading to sub‑optimal results. In this paper, we propose the sequential multi‑agent DAC (Seq‑MADAC) framework to address this issue by considering the inherent inter‑dependencies of multiple parameters. Specifically, we propose a sequential advantage decomposition network, which can leverage action‑order information through sequential advantage decomposition. Experiments from synthetic functions to the configuration of multi‑objective optimization algorithms demonstrate Seq‑MADAC's superior performance over state‑of‑the‑art MARL methods and show strong generalization across problem classes. Seq‑MADAC establishes a new paradigm for the widespread dependency‑aware automated algorithm configuration. Our code is available at https://github.com/lamda‑bbo/seq‑madac.

PaperID: 849, https://arxiv.org/pdf/2510.23190.pdf GitHub

Authors: Pascal Benschop, Cristian Meo, Justin Dauwels, Jelte P. Mense

Title: Evaluation of Vision-LLMs in Surveillance Video

Abstract:
The widespread use of cameras in our society has created an overwhelming amount of video data, far exceeding the capacity for human monitoring. This presents a critical challenge for public safety and security, as the timely detection of anomalous or criminal events is crucial for effective response and prevention. The ability for an embodied agent to recognize unexpected events is fundamentally tied to its capacity for spatial reasoning. This paper investigates the spatial reasoning of vision‑language models (VLMs) by framing anomalous action recognition as a zero‑shot, language‑grounded task, addressing the embodied perception challenge of interpreting dynamic 3D scenes from sparse 2D video. Specifically, we investigate whether small, pre‑trained vision‑‑LLMs can act as spatially‑grounded, zero‑shot anomaly detectors by converting video into text descriptions and scoring labels via textual entailment. We evaluate four open models on UCF‑Crime and RWF‑2000 under prompting and privacy‑preserving conditions. Few‑shot exemplars can improve accuracy for some models, but may increase false positives, and privacy filters ‑‑ especially full‑body GAN transforms ‑‑ introduce inconsistencies that degrade accuracy. These results chart where current vision‑‑LLMs succeed (simple, spatially salient events) and where they falter (noisy spatial cues, identity obfuscation). Looking forward, we outline concrete paths to strengthen spatial grounding without task‑specific training: structure‑aware prompts, lightweight spatial memory across clips, scene‑graph or 3D‑pose priors during description, and privacy methods that preserve action‑relevant geometry. This positions zero‑shot, language‑grounded pipelines as adaptable building blocks for embodied, real‑world video understanding. Our implementation for evaluating VLMs is publicly available at: https://github.com/pascalbenschopTU/VLLM_AnomalyRecognition

PaperID: 850, https://arxiv.org/pdf/2510.22740.pdf GitHub

Authors: Sai Krishna Ghanta, Ramviyas Parasuraman

Title: Policies over Poses: Reinforcement Learning based Distributed Pose-Graph Optimization for Multi-Robot SLAM

Abstract:
We consider the distributed pose‑graph optimization (PGO) problem, which is fundamental in accurate trajectory estimation in multi‑robot simultaneous localization and mapping (SLAM). Conventional iterative approaches linearize a highly non‑convex optimization objective, requiring repeated solving of normal equations, which often converge to local minima and thus produce suboptimal estimates. We propose a scalable, outlier‑robust distributed planar PGO framework using Multi‑Agent Reinforcement Learning (MARL). We cast distributed PGO as a partially observable Markov game defined on local pose‑graphs, where each action refines a single edge's pose estimate. A graph partitioner decomposes the global pose graph, and each robot runs a recurrent edge‑conditioned Graph Neural Network (GNN) encoder with adaptive edge‑gating to denoise noisy edges. Robots sequentially refine poses through a hybrid policy that utilizes prior action memory and graph embeddings. After local graph correction, a consensus scheme reconciles inter‑robot disagreements to produce a globally consistent estimate. Our extensive evaluations on a comprehensive suite of synthetic and real‑world datasets demonstrate that our learned MARL‑based actors reduce the global objective by an average of 37.5% more than the state‑of‑the‑art distributed PGO framework, while enhancing inference efficiency by at least 6X. We also demonstrate that actor replication allows a single learned policy to scale effortlessly to substantially larger robot teams without any retraining. Code is publicly available at https://github.com/herolab‑uga/policies‑over‑poses.

PaperID: 851, https://arxiv.org/pdf/2510.21855.pdf GitHub

Authors: Ryan Zhang, Herbert Woisetschläger

Title: SIGN: Schema-Induced Games for Naming

Abstract:
Real‑world AI systems are tackling increasingly complex problems, often through interactions among large language model (LLM) agents. When these agents develop inconsistent conventions, coordination can break down. Applications such as collaborative coding and distributed planning therefore require reliable, consistent communication, and scalability is a central concern as systems grow. We introduce Schema‑Induced Games for Naming (SIGN), a naming game that examines how lightweight structure can steer convention formation. We compare schema‑induced communication to unconstrained natural language and find faster convergence with up to 5.8x higher agreement. These results suggest that minimal structure can act as a simple control knob for efficient multi‑agent coordination, pointing toward broader applications beyond the naming game.

PaperID: 852, https://arxiv.org/pdf/2510.21817.pdf GitHub

Authors: Xiaoyu Liu, Chaoyou Fu, Chi Yan, Chu Wu, Haihan Gao, Yi-Fan Zhang, Shaoqi Dong, Cheng Qian, Bin Luo, Xiuyong Yang, Guanwu Li, Yusheng Cai, Yunhang Shen, Deqiang Jiang, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He

Title: VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting

Abstract:
Current Vision‑Language‑Action (VLA) models are often constrained by a rigid, static interaction paradigm, which lacks the ability to see, hear, speak, and act concurrently as well as handle real‑time user interruptions dynamically. This hinders seamless embodied collaboration, resulting in an inflexible and unresponsive user experience. To address these limitations, we introduce VITA‑E, a novel embodied interaction framework designed for both behavioral concurrency and nearly real‑time interruption. The core of our approach is a dual‑model architecture where two parallel VLA instances operate as an ``Active Model'' and a ``Standby Model'', allowing the embodied agent to observe its environment, listen to user speech, provide verbal responses, and execute actions, all concurrently and interruptibly, mimicking human‑like multitasking capabilities. We further propose a ``model‑as‑controller'' paradigm, where we fine‑tune the VLM to generate special tokens that serve as direct system‑level commands, coupling the model's reasoning with the system's behavior. Experiments conducted on a physical humanoid platform demonstrate that VITA‑E can reliably handle complex interactive scenarios. Our framework is compatible with various dual‑system VLA models, achieving an extremely high success rate on emergency stops and speech interruptions while also successfully performing concurrent speech and action. This represents a significant step towards more natural and capable embodied assistants.

PaperID: 853, https://arxiv.org/pdf/2510.21566.pdf GitHub

Authors: Fangwen Wu, Zheng Wu, Jihong Wang, Yunku Chen, Ruiguang Pei, Heyuan Huang, Xin Liao, Xingyu Lou, Huarong Deng, Zhihui Fu, Weiwen Liu, Zhuosheng Zhang, Weinan Zhang, Jun Wang

Title: ColorEcosystem: Powering Personalized, Standardized, and Trustworthy Agentic Service in massive-agent Ecosystem

Abstract:
With the rapid development of (multimodal) large language model‑based agents, the landscape of agentic service management has evolved from single‑agent systems to multi‑agent systems, and now to massive‑agent ecosystems. Current massive‑agent ecosystems face growing challenges, including impersonal service experiences, a lack of standardization, and untrustworthy behavior. To address these issues, we propose ColorEcosystem, a novel blueprint designed to enable personalized, standardized, and trustworthy agentic service at scale. Concretely, ColorEcosystem consists of three key components: agent carrier, agent store, and agent audit. The agent carrier provides personalized service experiences by utilizing user‑specific data and creating a digital twin, while the agent store serves as a centralized, standardized platform for managing diverse agentic services. The agent audit, based on the supervision of developer and user activities, ensures the integrity and credibility of both service providers and users. Through the analysis of challenges, transitional forms, and practical considerations, the ColorEcosystem is poised to power personalized, standardized, and trustworthy agentic service across massive‑agent ecosystems. Meanwhile, we have also implemented part of ColorEcosystem's functionality, and the relevant code is open‑sourced at https://github.com/opas‑lab/color‑ecosystem.

PaperID: 854, https://arxiv.org/pdf/2510.21524.pdf GitHub GitHub

Authors: Ilija Lichkovski, Alexander Müller, Mariam Ibrahim, Tiwai Mhundwa

Title: EU-Agent-Bench: Measuring Illegal Behavior of LLM Agents Under EU Law

Abstract:
Large language models (LLMs) are increasingly deployed as agents in various contexts by providing tools at their disposal. However, LLM agents can exhibit unpredictable behaviors, including taking undesirable and/or unsafe actions. In order to measure the latent propensity of LLM agents for taking illegal actions under an EU legislative context, we introduce EU‑Agent‑Bench, a verifiable human‑curated benchmark that evaluates an agent's alignment with EU legal norms in situations where benign user inputs could lead to unlawful actions. Our benchmark spans scenarios across several categories, including data protection, bias/discrimination, and scientific integrity, with each user request allowing for both compliant and non‑compliant execution of the requested actions. Comparing the model's function calls against a rubric exhaustively supported by citations of the relevant legislature, we evaluate the legal compliance of frontier LLMs, and furthermore investigate the compliance effect of providing the relevant legislative excerpts in the agent's system prompt along with explicit instructions to comply. We release a public preview set for the research community, while holding out a private test set to prevent data contamination in evaluating upcoming models. We encourage future work extending agentic safety benchmarks to different legal jurisdictions and to multi‑turn and multilingual interactions. We release our code on \hrefhttps://github.com/ilijalichkovski/eu‑agent‑benchthis URL.

PaperID: 855, https://arxiv.org/pdf/2510.21324.pdf GitHub GitHub

Authors: Jinhui Lou, Yan Yang, Zhou Yu, Zhenqi Fu, Weidong Han, Qingming Huang, Jun Yu

Title: CXRAgent: Director-Orchestrated Multi-Stage Reasoning for Chest X-Ray Interpretation

Abstract:
Chest X‑ray (CXR) plays a pivotal role in clinical diagnosis, and a variety of task‑specific and foundation models have been developed for automatic CXR interpretation. However, these models often struggle to adapt to new diagnostic tasks and complex reasoning scenarios. Recently, LLM‑based agent models have emerged as a promising paradigm for CXR analysis, enhancing model's capability through tool coordination, multi‑step reasoning, and team collaboration, etc. However, existing agents often rely on a single diagnostic pipeline and lack mechanisms for assessing tools' reliability, limiting their adaptability and credibility. To this end, we propose CXRAgent, a director‑orchestrated, multi‑stage agent for CXR interpretation, where a central director coordinates the following stages: (1) Tool Invocation: The agent strategically orchestrates a set of CXR‑analysis tools, with outputs normalized and verified by the Evidence‑driven Validator (EDV), which grounds diagnostic outputs with visual evidence to support reliable downstream diagnosis; (2) Diagnostic Planning: Guided by task requirements and intermediate findings, the agent formulates a targeted diagnostic plan. It then assembles an expert team accordingly, defining member roles and coordinating their interactions to enable adaptive and collaborative reasoning; (3) Collaborative Decision‑making: The agent integrates insights from the expert team with accumulated contextual memories, synthesizing them into an evidence‑backed diagnostic conclusion. Experiments on various CXR interpretation tasks show that CXRAgent delivers strong performance, providing visual evidence and generalizes well to clinical tasks of different complexity. Code and data are valuable at this \hrefhttps://github.com/laojiahuo2003/CXRAgent/link.

PaperID: 856, https://arxiv.org/pdf/2510.20844.pdf GitHub

Authors: Jiawei Zhou, Ruicheng Zhu, Mengshi Chen, Jianwei Wang, Kai Wang

Title: TrustResearcher: Automating Knowledge-Grounded and Transparent Research Ideation with Multi-Agent Collaboration

Abstract:
Agentic systems have recently emerged as a promising tool to automate literature‑based ideation. However, current systems often remain black‑box, with limited transparency or control for researchers. Our work introduces TrustResearcher, a multi‑agent demo system for knowledge‑grounded and transparent ideation. Specifically, TrustResearcher integrates meticulously designed four stages into a unified framework: (A) Structured Knowledge Curation, (B) Diversified Idea Generation, (C) Multi‑stage Idea Selection, and (D) Expert Panel Review and Synthesis. Different from prior pipelines, our system not only exposes intermediate reasoning states, execution logs, and configurable agents for inspections, but also enables diverse and evidence‑aligned idea generation. Our design is also domain‑agnostic, where the same pipeline can be instantiated in any scientific field. As an illustrative case, we demonstrate TrustResearcher on a graph‑mining scenario (k‑truss breaking problem), where it generates distinct, plausible candidates with evidence and critiques. A live demo and source code are available at https://github.com/valleysprings/TrustResearcher

PaperID: 857, https://arxiv.org/pdf/2510.20519.pdf GitHub

Authors: Xiaohan Lan, Fanfan Liu, Haibo Qiu, Siqi Yang, Delian Ruan, Peng Shi, Lin Ma

Title: Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning

Abstract:
Inspired by recent advancements in LLM reasoning, the field of multimodal reasoning has seen remarkable progress, achieving significant performance gains on intricate tasks such as mathematical problem‑solving. Despite this progress, current multimodal large reasoning models exhibit two key limitations. They tend to employ computationally expensive reasoning even for simple queries, leading to inefficiency. Furthermore, this focus on specialized reasoning often impairs their broader, more general understanding capabilities. In this paper, we propose Metis‑HOME: a Hybrid Optimized Mixture‑of‑Experts framework designed to address this trade‑off. Metis‑HOME enables a ''Hybrid Thinking'' paradigm by structuring the original dense model into two distinct expert branches: a thinking branch tailored for complex, multi‑step reasoning, and a non‑thinking branch optimized for rapid, direct inference on tasks like general VQA and OCR. A lightweight, trainable router dynamically allocates queries to the most suitable expert. We instantiate Metis‑HOME by adapting the Qwen2.5‑VL‑7B into an MoE architecture. Comprehensive evaluations reveal that our approach not only substantially enhances complex reasoning abilities but also improves the model's general capabilities, reversing the degradation trend observed in other reasoning‑specialized models. Our work establishes a new paradigm for building powerful and versatile MLLMs, effectively resolving the prevalent reasoning‑vs‑generalization dilemma. Code and weights are available at https://github.com/MM‑Thinking/Metis‑HOME.

PaperID: 858, https://arxiv.org/pdf/2510.20270.pdf GitHub

Authors: Ziqian Zhong, Aditi Raghunathan, Nicholas Carlini

Title: ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases

Abstract:
The tendency to find and exploit "shortcuts" to complete tasks poses significant risks for reliable assessment and deployment of large language models (LLMs). For example, an LLM agent with access to unit tests may delete failing tests rather than fix the underlying bug. Such behavior undermines both the validity of benchmark results and the reliability of real‑world LLM coding assistant deployments. To quantify, study, and mitigate such behavior, we introduce ImpossibleBench, a benchmark framework that systematically measures LLM agents' propensity to exploit test cases. ImpossibleBench creates "impossible" variants of tasks from existing benchmarks like LiveCodeBench and SWE‑bench by introducing direct conflicts between the natural‑language specification and the unit tests. We measure an agent's "cheating rate" as its pass rate on these impossible tasks, where any pass necessarily implies a specification‑violating shortcut. As a practical framework, ImpossibleBench is not just an evaluation but a versatile tool. We demonstrate its utility for: (1) studying model behaviors, revealing more fine‑grained details of cheating behaviors from simple test modification to complex operator overloading; (2) context engineering, showing how prompt, test access and feedback loop affect cheating rates; and (3) developing monitoring tools, providing a testbed with verified deceptive solutions. We hope ImpossibleBench serves as a useful framework for building more robust and reliable LLM systems. Our implementation can be found at https://github.com/safety‑research/impossiblebench.

PaperID: 859, https://arxiv.org/pdf/2510.19618.pdf GitHub

Authors: Junfei Zhou, Penglin Dai, Quanmin Wei, Bingyi Liu, Xiao Wu, Jianping Wang

Title: Pragmatic Heterogeneous Collaborative Perception via Generative Communication Mechanism

Abstract:
Multi‑agent collaboration enhances the perception capabilities of individual agents through information sharing. However, in real‑world applications, differences in sensors and models across heterogeneous agents inevitably lead to domain gaps during collaboration. Existing approaches based on adaptation and reconstruction fail to support pragmatic heterogeneous collaboration due to two key limitations: (1) Intrusive retraining of the encoder or core modules disrupts the established semantic consistency among agents; and (2) accommodating new agents incurs high computational costs, limiting scalability. To address these challenges, we present a novel Generative Communication mechanism (GenComm) that facilitates seamless perception across heterogeneous multi‑agent systems through feature generation, without altering the original network, and employs lightweight numerical alignment of spatial information to efficiently integrate new agents at minimal cost. Specifically, a tailored Deformable Message Extractor is designed to extract spatial message for each collaborator, which is then transmitted in place of intermediate features. The Spatial‑Aware Feature Generator, utilizing a conditional diffusion model, generates features aligned with the ego agent's semantic space while preserving the spatial information of the collaborators. These generated features are further refined by a Channel Enhancer before fusion. Experiments conducted on the OPV2V‑H, DAIR‑V2X and V2X‑Real datasets demonstrate that GenComm outperforms existing state‑of‑the‑art methods, achieving an 81% reduction in both computational cost and parameter count when incorporating new agents. Our code is available at https://github.com/jeffreychou777/GenComm.

PaperID: 860, https://arxiv.org/pdf/2510.19600.pdf GitHub GitHub

Authors: Qianli Ma, Siyu Wang, Yilin Chen, Yinhao Tang, Yixiang Yang, Chang Guo, Bingjie Gao, Zhening Xing, Yanan Sun, Zhipeng Zhang

Title: Human-Agent Collaborative Paper-to-Page Crafting for Under $0.1

Abstract:
In the quest for scientific progress, communicating research is as vital as the discovery itself. Yet, researchers are often sidetracked by the manual, repetitive chore of building project webpages to make their dense papers accessible. While automation has tackled static slides and posters, the dynamic, interactive nature of webpages has remained an unaddressed challenge. To bridge this gap, we reframe the problem, arguing that the solution lies not in a single command, but in a collaborative, hierarchical process. We introduce AutoPage, a novel multi‑agent system that embodies this philosophy. AutoPage deconstructs paper‑to‑page creation into a coarse‑to‑fine pipeline from narrative planning to multimodal content generation and interactive rendering. To combat AI hallucination, dedicated "Checker" agents verify each step against the source paper, while optional human checkpoints ensure the final product aligns perfectly with the author's vision, transforming the system from a mere tool into a powerful collaborative assistant. To rigorously validate our approach, we also construct PageBench, the first benchmark for this new task. Experiments show AutoPage not only generates high‑quality, visually appealing pages but does so with remarkable efficiency in under 15 minutes for less than \0.1. Code and dataset will be released at \hrefhttps://mqleet.github.io/AutoPage_ProjectPage/Webpage.

PaperID: 861, https://arxiv.org/pdf/2510.19420.pdf GitHub

Authors: Chengcan Wu, Zhixin Zhang, Mingqian Xu, Zeming Wei, Meng Sun

Title: Securing Multi-Agent Systems Against Corruptions via Node Contribution Backpropagation

Abstract:
Multi‑Agent Systems (MAS) have become a prevalent paradigm for Large Language Model (LLM) applications. However, the complex multi‑agent design in MAS introduces unique trustworthiness concerns: adversarial agents can inject misleading information that propagates contagiously through the system, corrupting benign agents and leading to false outputs. Existing graph‑based defenses model agents as nodes and communications as edges, yet are limited to static‑graph defenses. In this paper, we propose a dynamic defense paradigm that models MAS communication as a signed directed acyclic graph and computes each agent's contribution to the final decision via backward propagation, enabling accurate identification and isolation of malicious agents to secure multi‑agent task collaboration. Experimental results in complex and dynamic MAS environments demonstrate that our method notably outperforms existing MAS defense mechanisms, providing an effective guardrail for trustworthy MAS deployment. Our code is available at https://github.com/ChengcanWu/BPD.

PaperID: 862, https://arxiv.org/pdf/2510.19400.pdf GitHub

Authors: Zhiyuan Feng, Zhaolu Kang, Qijie Wang, Zhiying Du, Jiongrui Yan, Shubin Shi, Chengbo Yuan, Huizhi Liang, Yu Deng, Qixiu Li, Rushuai Yang, Arctanx An, Leqi Zheng, Weijie Wang, Shawn Chen, Sicheng Xu, Yaobo Liang, Jiaolong Yang, Baining Guo

Title: Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes

Abstract:
Vision‑language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision‑Language‑Action (VLA) models. Yet most evaluations of VLMs focus on single‑view settings, leaving their ability to integrate multi‑view information underexplored. At the same time, multi‑camera setups are increasingly standard in robotic platforms, as they provide complementary perspectives to mitigate occlusion and depth ambiguity. Whether VLMs can effectively leverage such multi‑view inputs for robotic reasoning therefore remains an open question. To bridge this gap, we introduce MV‑RoboBench, a benchmark specifically designed to evaluate the multi‑view spatial reasoning capabilities of VLMs in robotic manipulation. MV‑RoboBench consists of 1.7k manually curated QA items across eight subtasks, divided into two primary categories: spatial understanding and robotic execution. We evaluate a diverse set of existing VLMs, including both open‑source and closed‑source models, along with enhanced versions incorporating CoT‑inspired techniques. The results show that state‑of‑the‑art models remain far below human performance, underscoring the substantial challenges VLMs face in multi‑view robotic perception. Additionally, our analysis uncovers two key findings: (i) spatial intelligence and robotic task execution are positively correlated in multi‑view robotic scenarios; and (ii) strong performance on existing general‑purpose single‑view spatial understanding benchmarks does not reliably translate to success in the robotic spatial tasks assessed by our benchmark. We release MV‑RoboBench as an open resource to foster progress in spatially grounded VLMs and VLAs, providing not only data but also a standardized evaluation protocol for multi‑view embodied reasoning.

PaperID: 863, https://arxiv.org/pdf/2510.19274.pdf GitHub

Authors: Saurabh Chauhan, Zeeshan Rasheed, Malik Abdul Sami, Kai-Kristian Kemell, Muhammad Waseem, Zheying Zhang, Jussi Rasku, Mika Saari, Pekka Abrahamsson

Title: From Specification to Service: Accelerating API-First Development Using Multi-Agent Systems

Abstract:
This paper presents a system that uses Large Language Models (LLMs)‑based agents to automate the API‑first development of RESTful microservices. This system helps to create an OpenAPI specification, generate server code from it, and refine the code through a feedback loop that analyzes execution logs and error messages. The integration of log analysis enables the LLM to detect and address issues efficiently, reducing the number of iterations required to produce functional and robust services. This study's main goal is to advance API‑first development automation for RESTful web services and test the capability of LLM‑based multi‑agent systems in supporting the API‑first development approach. To test the proposed system's potential, we utilized the PRAB benchmark. The results indicate that if we keep the OpenAPI specification small and focused, LLMs are capable of generating complete functional code with business logic that aligns to the specification. The code for the system is publicly available at https://github.com/sirbh/code‑gen

PaperID: 864, https://arxiv.org/pdf/2510.19207.pdf GitHub

Authors: Yizhu Wang, Sizhe Chen, Raghad Alkhudair, Basel Alomair, David Wagner

Title: Defending Against Prompt Injection with DataFilter

Abstract:
When large language model (LLM) agents are increasingly deployed to automate tasks and interact with untrusted external data, prompt injection emerges as a significant security threat. By injecting malicious instructions into the data that LLMs access, an attacker can arbitrarily override the original user task and redirect the agent toward unintended, potentially harmful actions. Existing defenses either require access to model weights (fine‑tuning), incur substantial utility loss (detection‑based), or demand non‑trivial system redesign (system‑level). Motivated by this, we propose DataFilter, a test‑time model‑agnostic defense that removes malicious instructions from the data before it reaches the backend LLM. DataFilter is trained with supervised fine‑tuning on simulated injections and leverages both the user's instruction and the data to selectively strip adversarial content while preserving benign information. Across multiple benchmarks, DataFilter consistently reduces the prompt injection attack success rates to near zero while maintaining the LLMs' utility. DataFilter delivers strong security, high utility, and plug‑and‑play deployment, making it a strong practical defense to secure black‑box commercial LLMs against prompt injection. Our DataFilter model is released at https://huggingface.co/JoyYizhu/DataFilter for immediate use, with the code to reproduce our results at https://github.com/yizhu‑joy/DataFilter.

PaperID: 865, https://arxiv.org/pdf/2510.19150.pdf GitHub

Authors: Yunzhe Wang, Soham Hans, Volkan Ustun

Title: X-Ego: Acquiring Team-Level Tactical Situational Awareness via Cross-Egocentric Contrastive Video Representation Learning

Abstract:
Human team tactics emerge from each player's individual perspective and their ability to anticipate, interpret, and adapt to teammates' intentions. While advances in video understanding have improved the modeling of team interactions in sports, most existing work relies on third‑person broadcast views and overlooks the synchronous, egocentric nature of multi‑agent learning. We introduce X‑Ego‑CS, a benchmark dataset consisting of 124 hours of gameplay footage from 45 professional‑level matches of the popular e‑sports game Counter‑Strike 2, designed to facilitate research on multi‑agent decision‑making in complex 3D environments. X‑Ego‑CS provides cross‑egocentric video streams that synchronously capture all players' first‑person perspectives along with state‑action trajectories. Building on this resource, we propose Cross‑Ego Contrastive Learning (CECL), which aligns teammates' egocentric visual streams to foster team‑level tactical situational awareness from an individual's perspective. We evaluate CECL on a teammate‑opponent location prediction task, demonstrating its effectiveness in enhancing an agent's ability to infer both teammate and opponent positions from a single first‑person view using state‑of‑the‑art video encoders. Together, X‑Ego‑CS and CECL establish a foundation for cross‑egocentric multi‑agent benchmarking in esports. More broadly, our work positions gameplay understanding as a testbed for multi‑agent modeling and tactical learning, with implications for spatiotemporal reasoning and human‑AI teaming in both virtual and real‑world domains. Code and dataset are available at https://github.com/HATS‑ICT/x‑ego.

PaperID: 866, https://arxiv.org/pdf/2510.19008.pdf GitHub

Authors: Joydeep Chandra, Satyam Kumar Navneet

Title: Plural Voices, Single Agent: Towards Inclusive AI in Multi-User Domestic Spaces

Abstract:
Domestic AI agents faces ethical, autonomy, and inclusion challenges, particularly for overlooked groups like children, elderly, and Neurodivergent users. We present the Plural Voices Model (PVM), a novel single‑agent framework that dynamically negotiates multi‑user needs through real‑time value alignment, leveraging diverse public datasets on mental health, eldercare, education, and moral reasoning. Using human+synthetic curriculum design with fairness‑aware scenarios and ethical enhancements, PVM identifies core values, conflicts, and accessibility requirements to inform inclusive principles. Our privacy‑focused prototype features adaptive safety scaffolds, tailored interactions (e.g., step‑by‑step guidance for Neurodivergent users, simple wording for children), and equitable conflict resolution. In preliminary evaluations, PVM outperforms multi‑agent baselines in compliance (76% vs. 70%), fairness (90% vs. 85%), safety‑violation rate (0% vs. 7%), and latency. Design innovations, including video guidance, autonomy sliders, family hubs, and adaptive safety dashboards, demonstrate new directions for ethical and inclusive domestic AI, for building user‑centered agentic systems in plural domestic contexts. Our Codes and Model are been open sourced, available for reproduction: https://github.com/zade90/Agora

PaperID: 867, https://arxiv.org/pdf/2510.18821.pdf GitHub

Authors: Hongliang Lu, Yuhang Wen, Pengyu Cheng, Ruijin Ding, Jiaqi Guo, Haotian Xu, Chutian Wang, Haonan Chen, Xiaoxi Jiang, Guanjun Jiang

Title: Search Self-play: Pushing the Frontier of Agent Capability without Supervision

Abstract:
Reinforcement learning with verifiable rewards (RLVR) has become the mainstream technique for training LLM agents. However, RLVR highly depends on well‑crafted task queries and corresponding ground‑truth answers to provide accurate rewards, which requires significant human effort and hinders the scaling of RL processes, especially in agentic scenarios. Although a few recent works explore task synthesis methods, the difficulty of generated agentic tasks can hardly be controlled to provide effective RL training advantages. To achieve agentic RLVR with higher scalability, we explore self‑play training for deep search agents, in which the learning LLM utilizes multi‑turn search engine calling and acts simultaneously as both a task proposer and a problem solver. The task proposer aims to generate deep search queries with well‑defined ground‑truth answers and increasing task difficulty. The problem solver tries to handle the generated search queries and output the correct answer predictions. To ensure that each generated search query has accurate ground truth, we collect all the searching results from the proposer's trajectory as external knowledge, then conduct retrieval‑augmentation generation (RAG) to test whether the proposed query can be correctly answered with all necessary search documents provided. In this search self‑play (SSP) game, the proposer and the solver co‑evolve their agent capabilities through both competition and cooperation. With substantial experimental results, we find that SSP can significantly improve search agents' performance uniformly on various benchmarks without any supervision under both from‑scratch and continuous RL training setups. The code is at https://github.com/Qwen‑Applications/SSP.

PaperID: 868, https://arxiv.org/pdf/2510.18802.pdf GitHub

Authors: Vik Pant, Eric Yu

Title: Computational Foundations for Strategic Coopetition: Formalizing Interdependence and Complementarity

Abstract:
Coopetition refers to simultaneous cooperation and competition among actors who "cooperate to grow the pie and compete to split it up." Modern socio‑technical systems are characterized by strategic coopetition in which actors concomitantly cooperate to create value and compete to capture it. While conceptual modeling languages such as i provide rich qualitative representations of strategic dependencies, they lack mechanisms for quantitative analysis of dynamic trade‑offs. Conversely, classical game theory offers mathematical rigor but strips away contextual richness. This technical report bridges this gap by developing computational foundations that formalize two critical dimensions of coopetition: interdependence and complementarity. We ground interdependence in i structural dependency analysis, translating depender‑dependee‑dependum relationships into quantitative interdependence coefficients through a structured translation framework. We formalize complementarity following Brandenburger and Nalebuff's Added Value concept, modeling synergistic value creation with validated parameterization. We integrate structural dependencies with bargaining power in value appropriation and introduce a game‑theoretic formulation where Nash Equilibrium incorporates structural interdependence. Validation combines comprehensive experimental testing comprising over 22,000 trials across power and logarithmic value function specifications, demonstrating functional form robustness, with empirical application to the Samsung‑Sony S‑LCD joint venture (2004‑2011). This technical report serves as the foundational reference for a coordinated research program examining strategic coopetition in multi‑agent systems, with companion work addressing trust dynamics, collective action, and reciprocity mechanisms.

PaperID: 869, https://arxiv.org/pdf/2510.18559.pdf GitHub

Authors: Loc Phuc Truong Nguyen, Hung Thanh Do

Title: RAISE: A Unified Framework for Responsible AI Scoring and Evaluation

Abstract:
As AI systems enter high‑stakes domains, evaluation must extend beyond predictive accuracy to include explainability, fairness, robustness, and sustainability. We introduce RAISE (Responsible AI Scoring and Evaluation), a unified framework that quantifies model performance across these four dimensions and aggregates them into a single, holistic Responsibility Score. We evaluated three deep learning models: a Multilayer Perceptron (MLP), a Tabular ResNet, and a Feature Tokenizer Transformer, on structured datasets from finance, healthcare, and socioeconomics. Our findings reveal critical trade‑offs: the MLP demonstrated strong sustainability and robustness, the Transformer excelled in explainability and fairness at a very high environmental cost, and the Tabular ResNet offered a balanced profile. These results underscore that no single model dominates across all responsibility criteria, highlighting the necessity of multi‑dimensional evaluation for responsible model selection. Our implementation is available at: https://github.com/raise‑framework/raise.

PaperID: 870, https://arxiv.org/pdf/2510.18491.pdf GitHub

Authors: Lianchen Jia, Chaoyang Li, Qian Houde, Tianchi Huang, Jiangchuan Liu, Lifeng Sun

Title: Crucible: Quantifying the Potential of Control Algorithms through LLM Agents

Abstract:
Control algorithms in production environments typically require domain experts to tune their parameters and logic for specific scenarios. However, existing research predominantly focuses on algorithmic performance under ideal or default configurations, overlooking the critical aspect of Tuning Potential. To bridge this gap, we introduce Crucible, an agent that employs an LLM‑driven, multi‑level expert simulation to turn algorithms and defines a formalized metric to quantitatively evaluate their Tuning Potential. We demonstrate Crucible's effectiveness across a wide spectrum of case studies, from classic control tasks to complex computer systems, and validate its findings in a real‑world deployment. Our experimental results reveal that Crucible systematically quantifies the tunable space across different algorithms. Furthermore, Crucible provides a new dimension for algorithm analysis and design, which ultimately leads to performance improvements. Our code is available at https://github.com/thu‑media/Crucible.

PaperID: 871, https://arxiv.org/pdf/2510.18314.pdf GitHub

Authors: Zheng Zhang, Jiarui He, Yuchen Cai, Deheng Ye, Peilin Zhao, Ruili Feng, Hao Wang

Title: Genesis: Evolving Attack Strategies for LLM Web Agent Red-Teaming

Abstract:
As large language model (LLM) agents increasingly automate complex web tasks, they boost productivity while simultaneously introducing new security risks. However, relevant studies on web agent attacks remain limited. Existing red‑teaming approaches mainly rely on manually crafted attack strategies or static models trained offline. Such methods fail to capture the underlying behavioral patterns of web agents, making it difficult to generalize across diverse environments. In web agent attacks, success requires the continuous discovery and evolution of attack strategies. To this end, we propose Genesis, a novel agentic framework composed of three modules: Attacker, Scorer, and Strategist. The Attacker generates adversarial injections by integrating the genetic algorithm with a hybrid strategy representation. The Scorer evaluates the target web agent's responses to provide feedback. The Strategist dynamically uncovers effective strategies from interaction logs and compiles them into a continuously growing strategy library, which is then re‑deployed to enhance the Attacker's effectiveness. Extensive experiments across various web tasks show that our framework discovers novel strategies and consistently outperforms existing attack baselines. Our code is available at https://github.com/CjangCjengh/web_agent_attack.

PaperID: 872, https://arxiv.org/pdf/2510.18221.pdf GitHub GitHub

Authors: Joseph Bejjani, Chase Van Amburg, Chengrui Wang, Chloe Huangyuan Su, Sarah M. Pratt, Yasin Mazloumi, Naeem Khoshnevis, Sham M. Kakade, Kianté Brantley, Aaron Walsman

Title: The Emergence of Complex Behavior in Large-Scale Ecological Environments

Abstract:
We explore how physical scale and population size shape the emergence of complex behaviors in open‑ended ecological environments. In our setting, agents are unsupervised and have no explicit rewards or learning objectives but instead evolve over time according to reproduction, mutation, and selection. As they act, agents also shape their environment and the population around them in an ongoing dynamic ecology. Our goal is not to optimize a single high‑performance policy, but instead to examine how behaviors emerge and evolve across large populations due to natural competition and environmental pressures. We use modern hardware along with a new multi‑agent simulator to scale the environment and population to sizes much larger than previously attempted, reaching populations of over 60,000 agents, each with their own evolved neural network policy. We identify various emergent behaviors such as long‑range resource extraction, vision‑based foraging, and predation that arise under competitive and survival pressures. We examine how sensing modalities and environmental scale affect the emergence of these behaviors and find that some of them appear only in sufficiently large environments and populations, and that larger scales increase the stability and consistency of these emergent behaviors. While there is a rich history of research in evolutionary settings, our scaling results on modern hardware provide promising new directions to explore ecology as an instrument of machine learning in an era of increasingly abundant computational resources and efficient machine frameworks. Experimental code is available at https://github.com/jbejjani2022/ecological‑emergent‑behavior.

PaperID: 873, https://arxiv.org/pdf/2510.18179.pdf GitHub

Authors: Rui Jerry Huang, Wendy Liu, Anastasia Miin, Lei Ding

Title: Adaptive Coopetition: Leveraging Coarse Verifier Signals for Resilient Multi-Agent LLM Reasoning

Abstract:
Inference‑time computation is a critical yet challenging paradigm for enhancing the reasoning performance of large language models (LLMs). While existing strategies improve reasoning stability and consistency, they suffer from notable limitations: self‑correction often reinforces the model's initial biases, and Multi‑Agent Collaboration (MAC) often fails due to the lack of efficient coordination mechanisms, leading to collective errors. Although high‑performing verifiers can detect reasoning errors, making them reliable requires substantial training. To address these challenges, we introduce a novel inference‑time framework, Adaptive Coopetition (AdCo), in which LLM agents utilize an adaptive, UCB‑based "coopetition" mechanism. At each round, agents leverage coarse verifier signals to determine whether to collaborate or compete, and iteratively refine their reasoning based on peer feedback. Without relying on high‑performance verifiers, our adaptive strategy achieves significant performance gains on mathematical reasoning benchmarks, yielding a 20% relative improvement over baselines on the more challenging dataset. Our approach remains robust and consistent in terms of accuracy under different sample sizes and configurations. This adaptive, signal‑guided "coopetition" framework enhances reasoning robustness by leveraging both model knowledge diversity and reasoning trace measures, while also promoting uncertainty‑driven exploration, especially when participants have comparable capabilities. From this perspective, our work offers a fresh lens on inference‑time computation and paves the way for more resilient multi‑agent LLM systems. Our code is available at: https://github.com/AdCo‑Research/adaptive‑coopetition.

PaperID: 874, https://arxiv.org/pdf/2510.17800.pdf GitHub

Authors: Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, Minlie Huang

Title: Glyph: Scaling Context Windows via Visual-Text Compression

Abstract:
Large language models (LLMs) increasingly rely on long‑context modeling for tasks such as document understanding, code analysis, and multi‑step reasoning. However, scaling context windows to the million‑token level brings prohibitive computational and memory costs, limiting the practicality of long‑context LLMs. In this work, we take a different perspective‑visual context scaling‑to tackle this challenge. Instead of extending token‑based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision‑language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM‑driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3‑4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3‑8B on various long‑context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K‑context VLM could scale to handle 1M‑token‑level text tasks. In addition, the rendered text data benefits real‑world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu‑coai/Glyph.

PaperID: 875, https://arxiv.org/pdf/2510.17797.pdf GitHub

Authors: Akshara Prabhakar, Roshan Ram, Zixiang Chen, Silvio Savarese, Frank Wang, Caiming Xiong, Huan Wang, Weiran Yao

Title: Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics

Abstract:
As information grows exponentially, enterprises face increasing pressure to transform unstructured data into coherent, actionable insights. While autonomous agents show promise, they often struggle with domain‑specific nuances, intent alignment, and enterprise integration. We present Enterprise Deep Research (EDR), a multi‑agent system that integrates (1) a Master Planning Agent for adaptive query decomposition, (2) four specialized search agents (General, Academic, GitHub, LinkedIn), (3) an extensible MCP‑based tool ecosystem supporting NL2SQL, file analysis, and enterprise workflows, (4) a Visualization Agent for data‑driven insights, and (5) a reflection mechanism that detects knowledge gaps and updates research direction with optional human‑in‑the‑loop steering guidance. These components enable automated report generation, real‑time streaming, and seamless enterprise deployment, as validated on internal datasets. On open‑ended benchmarks including DeepResearch Bench and DeepConsult, EDR outperforms state‑of‑the‑art agentic systems without any human steering. We release the EDR framework and benchmark trajectories to advance research on multi‑agent reasoning applications. Code at https://github.com/SalesforceAIResearch/enterprise‑deep‑research and Dataset at https://huggingface.co/datasets/Salesforce/EDR‑200

PaperID: 876, https://arxiv.org/pdf/2510.17795.pdf GitHub

Authors: Yujie Luo, Zhuoyun Yu, Xuehai Wang, Yuqi Zhu, Ningyu Zhang, Lanning Wei, Lun Du, Da Zheng, Huajun Chen

Title: What Makes AI Research Replicable? Executable Knowledge Graphs as Scientific Knowledge Representations

Abstract:
Replicating AI research is a crucial yet challenging task for large language model (LLM) agents. Existing approaches often struggle to generate executable code, primarily due to insufficient background knowledge and the limitations of retrieval‑augmented generation (RAG) methods, which fail to capture latent technical details hidden in referenced papers. Furthermore, previous approaches tend to overlook valuable implementation‑level code signals and lack structured knowledge representations that support multi‑granular retrieval and reuse. To overcome these challenges, we propose Executable Knowledge Graphs (xKG), a pluggable, paper‑centric knowledge base that automatically integrates code snippets and technical insights extracted from scientific literature. When integrated into three agent frameworks with two different LLMs, xKG shows substantial performance gains (10.9% with o3‑mini) on PaperBench, demonstrating its effectiveness as a general and extensible solution for automated AI research replication. Code is available at https://github.com/zjunlp/xKG.

PaperID: 877, https://arxiv.org/pdf/2510.16097.pdf GitHub

Authors: Eleni Straitouri, Stratis Tsirtsis, Ander Artola Velasco, Manuel Gomez-Rodriguez

Title: Narrowing Action Choices with AI Improves Human Sequential Decisions

Abstract:
Recent work has shown that, in classification tasks, it is possible to design decision support systems that do not require human experts to understand when to cede agency to a classifier or when to exercise their own agency to achieve complementarity\unicodex2014experts using these systems make more accurate predictions than those made by the experts or the classifier alone. The key principle underpinning these systems reduces to adaptively controlling the level of human agency, by design. Can we use the same principle to achieve complementarity in sequential decision making tasks? In this paper, we answer this question affirmatively. We develop a decision support system that uses a pre‑trained AI agent to narrow down the set of actions a human can take to a subset, and then asks the human to take an action from this action set. Along the way, we also introduce a bandit algorithm that leverages the smoothness properties of the action sets provided by our system to efficiently optimize the level of human agency. To evaluate our decision support system, we conduct a large‑scale human subject study (n = 1,600) where participants play a wildfire mitigation game. We find that participants who play the game supported by our system outperform those who play on their own by ～30% and the AI agent used by our system by >2%, even though the AI agent largely outperforms participants playing without support. We have made available the data gathered in our human subject study as well as an open source implementation of our system at https://github.com/Networks‑Learning/narrowing‑action‑choices .

PaperID: 878, https://arxiv.org/pdf/2510.16080.pdf GitHub

Authors: Kerem Delikoyun, Qianyu Chen, Win Sen Kuan, John Tshon Yit Soong, Matthew Edward Cove, Oliver Hayden

Title: TriAgent: Automated Biomarker Discovery with Deep Research Grounding for Triage in Acute Care by LLM-Based Multi-Agent Collaboration

Abstract:
Emergency departments worldwide face rising patient volumes, workforce shortages, and variability in triage decisions that threaten the delivery of timely and accurate care. Current triage methods rely primarily on vital signs, routine laboratory values, and clinicians' judgment, which, while effective, often miss emerging biological signals that could improve risk prediction for infection typing or antibiotic administration in acute conditions. To address this challenge, we introduce TriAgent, a large language model (LLM)‑based multi‑agent framework that couples automated biomarker discovery with deep research for literature‑grounded validation and novelty assessment. TriAgent employs a supervisor research agent to generate research topics and delegate targeted queries to specialized sub‑agents for evidence retrieval from various data sources. Findings are synthesized to classify biomarkers as either grounded in existing knowledge or flagged as novel candidates, offering transparent justification and highlighting unexplored pathways in acute care risk stratification. Unlike prior frameworks limited to existing routine clinical biomarkers, TriAgent aims to deliver an end‑to‑end framework from data analysis to literature grounding to improve transparency, explainability and expand the frontier of potentially actionable clinical biomarkers. Given a user's clinical query and quantitative triage data, TriAgent achieved a topic adherence F1 score of 55.7 +/‑ 5.0%, surpassing the CoT‑ReAct agent by over 10%, and a faithfulness score of 0.42 +/‑ 0.39, exceeding all baselines by more than 50%. Across experiments, TriAgent consistently outperformed state‑of‑the‑art LLM‑based agentic frameworks in biomarker justification and literature‑grounded novelty assessment. We share our repo: https://github.com/CellFace/TriAgent.

PaperID: 879, https://arxiv.org/pdf/2510.16079.pdf GitHub

Authors: Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, Botian Shi

Title: EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

Abstract:
Current Large Language Model (LLM) agents show strong performance in tool use, but lack the crucial capability to systematically learn from their own experiences. While existing frameworks mainly focus on mitigating external knowledge gaps, they fail to address a more fundamental limitation: the inability to iteratively refine problem‑solving strategies. In this work, we introduce EvolveR, a framework designed to enable agent to self‑improve through a complete, closed‑loop experience lifecycle. This lifecycle comprises two key stages: (1) Offline Self‑Distillation, where the agent's interaction trajectories are synthesized into a structured repository of abstract, reusable strategic principles; (2) Online Interaction, where the agent interacts with tasks and actively retrieves distilled principles to guide its decision‑making, accumulating a diverse set of behavioral trajectories. This loop employs a policy reinforcement mechanism to iteratively update the agent based on its performance. We demonstrate the effectiveness of EvolveR on complex multi‑hop question‑answering benchmarks, where it achieves superior performance over strong agentic baselines. Our work presents a comprehensive blueprint for agents that learn not only from external data but also from the consequences of their own actions, paving the way for more autonomous and continuously improving systems. Code is available at https://github.com/Edaizi/EvolveR.

PaperID: 880, https://arxiv.org/pdf/2510.16054.pdf GitHub

Authors: Zheng Hui, Yijiang River Dong, Sanhanat Sivapiromrat, Ehsan Shareghi, Nigel Collier

Title: Privacy-R1: Privacy-Aware Multi-LLM Agent Collaboration via Reinforcement Learning

Abstract:
When users submit queries to Large Language Models (LLMs), their prompts can often contain sensitive data, forcing a difficult choice: Send the query to a powerful proprietary LLM providers to achieving state‑of‑the‑art performance and risk data exposure, or relying on smaller, local models guarantees data privacy but often results in a degradation of task performance. Prior approaches have relied on static pipelines that use LLM rewriting, which shatters linguistic coherence and indiscriminately removes privacy‑sensitive information, including task‑critical content. We reformulate this challenge (Privacy‑Conscious Delegation) as a sequential decision‑making problem and introduce a novel reinforcement learning (RL) framework called Privacy‑R1 to solve it. Our framework trains an agent to dynamically route text chunks, learning a policy that optimally balances the trade‑off between privacy leakage and task performance. It implicitly distinguishes between replaceable Personally Identifiable Information (PII) (which it shields locally) and task‑critical PII (which it strategically sends to the remote model for maximal utility). To validate our approach in complex scenarios, we also introduce a new medical dataset with high PII density. Our framework achieves a new state‑of‑the‑art on the privacy‑utility frontier, demonstrating the necessity of learned, adaptive policies for deploying LLMs in sensitive environments. Dataset can be found at: https://github.com/zackhuiiiii/Privacy‑R1.

PaperID: 881, https://arxiv.org/pdf/2510.15994.pdf GitHub

Authors: Dongsen Zhang, Zekun Li, Xu Luo, Xuannan Liu, Peipei Li, Wenjun Xu

Title: MCP Security Bench (MSB): Benchmarking Attacks Against Model Context Protocol in LLM Agents

Abstract:
The Model Context Protocol (MCP) standardizes how large language model (LLM) agents discover, describe, and call external tools. While MCP unlocks broad interoperability, it also enlarges the attack surface by making tools first‑class, composable objects with natural‑language metadata, and standardized I/O. We present MSB (MCP Security Benchmark), the first end‑to‑end evaluation suite that systematically measures how well LLM agents resist MCP‑specific attacks throughout the full tool‑use pipeline: task planning, tool invocation, and response handling. MSB contributes: (1) a taxonomy of 12 attacks including name‑collision, preference manipulation, prompt injections embedded in tool descriptions, out‑of‑scope parameter requests, user‑impersonating responses, false‑error escalation, tool‑transfer, retrieval injection, and mixed attacks; (2) an evaluation harness that executes attacks by running real tools (both benign and malicious) via MCP rather than simulation; and (3) a robustness metric that quantifies the trade‑off between security and performance: Net Resilient Performance (NRP). We evaluate nine popular LLM agents across 10 domains and 405 tools, producing 2,000 attack instances. Results reveal the effectiveness of attacks against each stage of MCP. Models with stronger performance are more vulnerable to attacks due to their outstanding tool calling and instruction following capabilities. MSB provides a practical baseline for researchers and practitioners to study, compare, and harden MCP agents. Code: https://github.com/dongsenzhang/MSB

PaperID: 882, https://arxiv.org/pdf/2510.15963.pdf GitHub GitHub

Authors: Jiani Huang, Amish Sethi, Matthew Kuo, Mayank Keoliya, Neelay Velingker, JungHo Jung, Ser-Nam Lim, Ziyang Li, Mayur Naik

Title: ESCA: Contextualizing Embodied Agents via Scene-Graph Generation

Abstract:
Multi‑modal large language models (MLLMs) are making rapid progress toward general‑purpose embodied agents. However, existing MLLMs do not reliably capture fine‑grained links between low‑level visual features and high‑level textual semantics, leading to weak grounding and inaccurate perception. To overcome this challenge, we propose ESCA, a framework that contextualizes embodied agents by grounding their perception in spatial‑temporal scene graphs. At its core is SGCLIP, a novel, open‑domain, promptable foundation model for generating scene graphs that is based on CLIP. SGCLIP is trained on 87K+ open‑domain videos using a neurosymbolic pipeline that aligns automatically generated captions with scene graphs produced by the model itself, eliminating the need for human‑labeled annotations. We demonstrate that SGCLIP excels in both prompt‑based inference and task‑specific fine‑tuning, achieving state‑of‑the‑art results on scene graph generation and action localization benchmarks. ESCA with SGCLIP improves perception for embodied agents based on both open‑source and commercial MLLMs, achieving state of‑the‑art performance across two embodied environments. Notably, ESCA significantly reduces agent perception errors and enables open‑source models to surpass proprietary baselines. We release the source code for SGCLIP model training at https://github.com/video‑fm/LASER and for the embodied agent at https://github.com/video‑fm/ESCA.

PaperID: 883, https://arxiv.org/pdf/2510.15742.pdf GitHub GitHub

Authors: Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, Yinghao Xu, Yujun Shen, Qifeng Chen

Title: Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

Abstract:
Instruction‑based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large‑scale, high‑quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in‑context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost‑quality trade‑off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU‑days to build Ditto‑1M, a new dataset of one million high‑fidelity video editing examples. We trained our model, Editto, on Ditto‑1M with a curriculum learning strategy. The results demonstrate superior instruction‑following ability and establish a new state‑of‑the‑art in instruction‑based video editing.

PaperID: 884, https://arxiv.org/pdf/2510.15683.pdf GitHub GitHub

Authors: Effrosyni Sokli, Pranav Kasela, Georgios Peikos, Gabriella Pasi

Title: Mixture of Experts Approaches in Dense Retrieval Tasks

Abstract:
Dense Retrieval Models (DRMs) are a prominent development in Information Retrieval (IR). A key challenge with these neural Transformer‑based models is that they often struggle to generalize beyond the specific tasks and domains they were trained on. To address this challenge, prior research in IR incorporated the Mixture‑of‑Experts (MoE) framework within each Transformer layer of a DRM, which, though effective, substantially increased the number of additional parameters. In this paper, we propose a more efficient design, which introduces a single MoE block (SB‑MoE) after the final Transformer layer. To assess the retrieval effectiveness of SB‑MoE, we perform an empirical evaluation across three IR tasks. Our experiments involve two evaluation setups, aiming to assess both in‑domain effectiveness and the model's zero‑shot generalizability. In the first setup, we fine‑tune SB‑MoE with four different underlying DRMs on seven IR benchmarks and evaluate them on their respective test sets. In the second setup, we fine‑tune SB‑MoE on MSMARCO and perform zero‑shot evaluation on thirteen BEIR datasets. Additionally, we perform further experiments to analyze the model's dependency on its hyperparameters (i.e., the number of employed and activated experts) and investigate how this variation affects SB‑MoE's performance. The obtained results show that SB‑MoE is particularly effective for DRMs with lightweight base models, such as TinyBERT and BERT‑Small, consistently exceeding standard model fine‑tuning across benchmarks. For DRMs with more parameters, such as BERT‑Base and Contriever, our model requires a larger number of training samples to achieve improved retrieval performance. Our code is available online at: https://github.com/FaySokli/SB‑MoE.

PaperID: 885, https://arxiv.org/pdf/2510.15191.pdf GitHub

Authors: Junlin Wu, Xianrui Zhong, Jiashuo Sun, Bolian Li, Bowen Jin, Jiawei Han, Qingkai Zeng

Title: Structure-R1: Dynamically Leveraging Structural Knowledge in LLM Reasoning through Reinforcement Learning

Abstract:
Large language models (LLMs) have demonstrated remarkable advances in reasoning capabilities. However, their performance remains constrained by limited access to explicit and structured domain knowledge. Retrieval‑Augmented Generation (RAG) addresses this by incorporating external information as context to augment reasoning. Nevertheless, traditional RAG systems typically operate over unstructured and fragmented text, resulting in low information density and suboptimal reasoning. To overcome these limitations, we propose \textscStructure‑R1, a novel framework that transforms retrieved content into structured representations optimized for reasoning. Leveraging reinforcement learning, \textscStructure‑R1 learns a content representation policy that dynamically generates and adapts structural formats based on the demands of multi‑step reasoning. Unlike prior methods that rely on fixed schemas, our approach adopts a generative paradigm capable of producing task‑specific structures tailored to individual queries. To ensure the quality and reliability of these representations, we introduce a self‑reward structural verification mechanism that checks whether the generated structures are both correct and self‑contained. Extensive experiments on seven knowledge‑intensive benchmarks show that \textscStructure‑R1 consistently achieves competitive performance with a 7B‑scale backbone model and matches the performance of much larger models. Additionally, our theoretical analysis demonstrates how structured representations enhance reasoning by improving information density and contextual clarity. Our code and data are available at: https://github.com/jlwu002/sr1.

PaperID: 886, https://arxiv.org/pdf/2510.14621.pdf GitHub

Authors: Yuanyi Song, Heyuan Huang, Qiqiang Lin, Yin Zhao, Xiangmou Qu, Jun Wang, Xingyu Lou, Weiwen Liu, Zhuosheng Zhang, Jun Wang, Yong Yu, Weinan Zhang, Zhaoxiang Wang

Title: ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks

Abstract:
The rapid advancement of multimodal large language models has enabled agents to operate mobile devices by directly interacting with graphical user interfaces, opening new possibilities for mobile automation. However, real‑world mobile tasks are often complex and allow for multiple valid solutions. This contradicts current mobile agent evaluation standards: offline static benchmarks can only validate a single predefined "golden path", while online dynamic testing is constrained by the complexity and non‑reproducibility of real devices, making both approaches inadequate for comprehensively assessing agent capabilities. To bridge the gap between offline and online evaluation and enhance testing stability, this paper introduces a novel graph‑structured benchmarking framework. By modeling the finite states observed during real‑device interactions, it achieves static simulation of dynamic behaviors. Building on this, we develop ColorBench, a benchmark focused on complex long‑horizon tasks. It supports evaluation of multiple valid solutions, subtask completion rate statistics, and atomic‑level capability analysis. ColorBench contains 175 tasks (74 single‑app, 101 cross‑app) with an average length of over 13 steps. Each task includes at least two correct paths and several typical error paths, enabling quasi‑dynamic interaction. By evaluating ColorBench across various baselines, we discover limitations of existing models and propose improvement directions and feasible technical pathways to enhance agents' performance on complex, long‑horizon problems based on experimental results. Code and data are available at: https://github.com/MadeAgents/ColorBench.

PaperID: 887, https://arxiv.org/pdf/2510.14509.pdf GitHub

Authors: Jingyao Liu, Chen Huang, Zhizhao Guan, Wenqiang Lei, Yang Deng

Title: E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task

Abstract:
The rapid advancement in large language models (LLMs) has demonstrated significant potential in End‑to‑End Software Development (E2ESD). However, existing E2ESD benchmarks are limited by coarse‑grained requirement specifications and unreliable evaluation protocols, hindering a true understanding of current framework capabilities. To address these limitations, we present E2EDev, a novel benchmark grounded in the principles of Behavior‑Driven Development (BDD), which evaluates the capabilities of E2ESD frameworks by assessing whether the generated software meets user needs through mimicking real user interactions (Figure 1). E2EDev comprises (i) a fine‑grained set of user requirements, (ii) multiple BDD test scenarios with corresponding Python step implementations for each requirement, and (iii) a fully automated testing pipeline built on the Behave framework. To ensure its quality while reducing the annotation effort, E2EDev leverages our proposed Human‑in‑the‑Loop Multi‑Agent Annotation Framework (HITL‑MAA). By evaluating various E2ESD frameworks and LLM backbones with E2EDev, our analysis reveals a persistent struggle to effectively solve these tasks, underscoring the critical need for more effective and cost‑efficient E2ESD solutions. Our codebase and benchmark are publicly available at https://github.com/SCUNLP/E2EDev.

PaperID: 888, https://arxiv.org/pdf/2510.14264.pdf GitHub

Authors: Zheye Deng, Weixiang Yan, Changlong Yu, Jiashu Wang

Title: AlphaQuanter: An End-to-End Tool-Augmented Agentic Reinforcement Learning Framework for Stock Trading

Abstract:
While Large Language Model (LLM) agents show promise in automated trading, they still face critical limitations. Prominent multi‑agent frameworks often suffer from inefficiency, produce inconsistent signals, and lack the end‑to‑end optimization required to learn a coherent strategy from market feedback. To address this, we introduce AlphaQuanter, a single‑agent framework that uses reinforcement learning (RL) to learn a dynamic policy over a transparent, tool‑augmented decision workflow, which empowers a single agent to autonomously orchestrate tools and proactively acquire information on demand, establishing a transparent reasoning process. Extensive experiments demonstrate that AlphaQuanter achieves state‑of‑the‑art performance on key financial metrics. Moreover, its interpretable reasoning reveals sophisticated strategies, offering novel and valuable insights for human traders. Our code and data can be found at https://github.com/horizon‑llm/AlphaQuanter.

PaperID: 889, https://arxiv.org/pdf/2510.14240.pdf GitHub

Authors: Jiayu Wang, Yifei Ming, Riya Dulepet, Qinglin Chen, Austin Xu, Zixuan Ke, Frederic Sala, Aws Albarghouthi, Caiming Xiong, Shafiq Joty

Title: LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild

Abstract:
Deep research ‑‑ producing comprehensive, citation‑grounded reports by searching and synthesizing information from hundreds of live web sources ‑‑ marks an important frontier for agentic systems. To rigorously evaluate this ability, four principles are essential: tasks should be (1) user‑centric, reflecting realistic information needs, (2) dynamic, requiring up‑to‑date information beyond parametric knowledge, (3) unambiguous, ensuring consistent interpretation across users, and (4) multi‑faceted and search‑intensive, requiring search over numerous web sources and in‑depth analysis. Existing benchmarks fall short of these principles, often focusing on narrow domains or posing ambiguous questions that hinder fair comparison. Guided by these principles, we introduce LiveResearchBench, a benchmark of 100 expert‑curated tasks spanning daily life, enterprise, and academia, each requiring extensive, dynamic, real‑time web search and synthesis. Built with over 1,500 hours of human labor, LiveResearchBench provides a rigorous basis for systematic evaluation. To evaluate citation‑grounded long‑form reports, we introduce DeepEval, a comprehensive suite covering both content‑ and report‑level quality, including coverage, presentation, citation accuracy and association, consistency and depth of analysis. DeepEval integrates four complementary evaluation protocols, each designed to ensure stable assessment and high agreement with human judgments. Using LiveResearchBench and DeepEval, we conduct a comprehensive evaluation of 17 frontier deep research systems, including single‑agent web search, single‑agent deep research, and multi‑agent systems. Our analysis reveals current strengths, recurring failure modes, and key system components needed to advance reliable, insightful deep research. Our code is available at: https://github.com/SalesforceAIResearch/LiveResearchBench.

PaperID: 890, https://arxiv.org/pdf/2510.14063.pdf GitHub

Authors: Nan Li, Jiming Ren, Haris Miller, Samuel Coogan, Karen M. Feigh, Ye Zhao

Title: Adaptive Obstacle-Aware Task Assignment and Planning for Heterogeneous Robot Teaming

Abstract:
Multi‑Agent Task Assignment and Planning (MATP) has attracted growing attention but remains challenging in terms of scalability, spatial reasoning, and adaptability in obstacle‑rich environments. To address these challenges, we propose OATH ‑ Adaptive Obstacle‑Aware Task Assignment and Planning for Heterogeneous Robot Teaming ‑ which advances MATP by introducing a novel obstacle‑aware strategy for task assignment. First, we develop an adaptive Halton sequence map, the first known application of Halton sampling with obstacle‑aware adaptation in MATP, which adjusts sampling density based on obstacle distribution. Second, we propose a cluster‑auction‑selection framework that integrates obstacle‑aware clustering with weighted auctions and intra‑cluster task selection. These mechanisms jointly enable effective coordination among heterogeneous robots while maintaining scalability and suboptimal allocation performance. In addition, our framework leverages an LLM to interpret human instructions and directly guide the planner in real time. We validate OATH in both NVIDIA Isaac Sim and real‑world hardware experiments using TurtleBot platforms, demonstrating substantial improvements in task assignment quality, scalability, adaptability to dynamic changes, and overall execution performance compared to state‑of‑the‑art MATP baselines. A project website is available at https://llm‑oath.github.io/.

PaperID: 891, https://arxiv.org/pdf/2510.14049.pdf GitHub

Authors: Guangyi Chen, Yunlong Deng, Peiyuan Zhu, Yan Li, Yifan Shen, Zijian Li, Kun Zhang

Title: CausalVerse: Benchmarking Causal Representation Learning with Configurable High-Fidelity Simulations

Abstract:
Causal Representation Learning (CRL) aims to uncover the data‑generating process and identify the underlying causal variables and relations, whose evaluation remains inherently challenging due to the requirement of known ground‑truth causal variables and causal structure. Existing evaluations often rely on either simplistic synthetic datasets or downstream performance on real‑world tasks, generally suffering a dilemma between realism and evaluative precision. In this paper, we introduce a new benchmark for CRL using high‑fidelity simulated visual data that retains both realistic visual complexity and, more importantly, access to ground‑truth causal generating processes. The dataset comprises around 200 thousand images and 3 million video frames across 24 sub‑scenes in four domains: static image generation, dynamic physical simulations, robotic manipulations, and traffic situation analysis. These scenarios range from static to dynamic settings, simple to complex structures, and single to multi‑agent interactions, offering a comprehensive testbed that hopefully bridges the gap between rigorous evaluation and real‑world applicability. In addition, we provide flexible access to the underlying causal structures, allowing users to modify or configure them to align with the required assumptions in CRL, such as available domain labels, temporal dependencies, or intervention histories. Leveraging this benchmark, we evaluated representative CRL methods across diverse paradigms and offered empirical insights to assist practitioners and newcomers in choosing or extending appropriate CRL frameworks to properly address specific types of real problems that can benefit from the CRL perspective. Welcome to visit our: Project page:https://causal‑verse.github.io/, Dataset:https://huggingface.co/CausalVerse.

PaperID: 892, https://arxiv.org/pdf/2510.13561.pdf GitHub

Authors: Peng Di, Faqiang Chen, Xiao Bai, Hongjun Yang, Qingfeng Li, Ganglin Wei, Jian Mou, Feng Shi, Keting Chen, Peng Tang, Zhitao Shen, Zheng Li, Wenhui Shi, Junwei Guo, Hang Yu

Title: OpenDerisk: An Industrial Framework for AI-Driven SRE, with Design, Implementation, and Case Studies

Abstract:
The escalating complexity of modern software imposes an unsustainable operational burden on Site Reliability Engineering (SRE) teams, demanding AI‑driven automation that can emulate expert diagnostic reasoning. Existing solutions, from traditional AI methods to general‑purpose multi‑agent systems, fall short: they either lack deep causal reasoning or are not tailored for the specialized, investigative workflows unique to SRE. To address this gap, we present OpenDerisk, a specialized, open‑source multi‑agent framework architected for SRE. OpenDerisk integrates a diagnostic‑native collaboration model, a pluggable reasoning engine, a knowledge engine, and a standardized protocol (MCP) to enable specialist agents to collectively solve complex, multi‑domain problems. Our comprehensive evaluation demonstrates that OpenDerisk significantly outperforms state‑of‑the‑art baselines in both accuracy and efficiency. This effectiveness is validated by its large‑scale production deployment at Ant Group, where it serves over 3,000 daily users across diverse scenarios, confirming its industrial‑grade scalability and practical impact. OpenDerisk is open source and available at https://github.com/derisk‑ai/OpenDerisk/

PaperID: 893, https://arxiv.org/pdf/2510.12872.pdf GitHub

Authors: Hancheng Ye, Zhengqi Gao, Mingyuan Ma, Qinsi Wang, Yuzhe Fu, Ming-Yu Chung, Yueqian Lin, Zhijian Liu, Jianyi Zhang, Danyang Zhuo, Yiran Chen

Title: KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems

Abstract:
Multi‑agent large language model (LLM) systems are increasingly adopted for complex language processing tasks that require communication and coordination among agents. However, these systems often suffer substantial overhead from repeated reprocessing of overlapping contexts across agents. In typical pipelines, once an agent receives a message from its predecessor, the full context‑including prior turns‑must be reprocessed from scratch, leading to inefficient processing. While key‑value (KV) caching is an effective solution for avoiding redundant computation in single‑agent settings where prefixes remain unchanged, it cannot be directly reused in multi‑agent scenarios due to diverging prefixes introduced by agent‑specific context extensions. We identify that the core challenge lies in the offset variance of KV‑caches across agents. To address this, we propose KVCOMM, a training‑free framework that enables efficient prefilling in multi‑agent inference by reusing KV‑caches and aligning cache offsets of overlapping contexts under diverse prefix contexts. KVCOMM estimates and adjusts KV‑caches for shared content by referencing a pool of cached examples‑termed anchors‑that store observed cache deviations under varying prefixes. The anchor pool is maintained and updated online, allowing dynamic adaptation to distinct user requests and context structures. KVCOMM achieves over 70% reuse rate across diverse multi‑agent workloads, including retrieval‑augmented generation, math reasoning, and collaborative coding tasks, all without quality degradation. Particularly, when each fully‑connected agent receives 1K input tokens with 512 prefix tokens and 512 output tokens under a five‑agent setting, KVCOMM achieves up to 7.8x speedup compared to the standard prefill pipeline, reducing TTFT from ~430 ms to ~55 ms.

PaperID: 894, https://arxiv.org/pdf/2510.12253.pdf GitHub

Authors: Changfu Xu, Jianxiong Guo, Yuzhu Liang, Haiyang Huang, Haodong Zou, Xi Zheng, Shui Yu, Xiaowen Chu, Jiannong Cao, Tian Wang

Title: Diffusion Models for Reinforcement Learning: Foundations, Taxonomy, and Development

Abstract:
Diffusion Models (DMs), as a leading class of generative models, offer key advantages for reinforcement learning (RL), including multi‑modal expressiveness, stable training, and trajectory‑level planning. This survey delivers a comprehensive and up‑to‑date synthesis of diffusion‑based RL. We first provide an overview of RL, highlighting its challenges, and then introduce the fundamental concepts of DMs, investigating how they are integrated into RL frameworks to address key challenges in this research field. We establish a dual‑axis taxonomy that organizes the field along two orthogonal dimensions: a function‑oriented taxonomy that clarifies the roles DMs play within the RL pipeline, and a technique‑oriented taxonomy that situates implementations across online versus offline learning regimes. We also provide a comprehensive examination of this progression from single‑agent to multi‑agent domains, thereby forming several frameworks for DM‑RL integration and highlighting their practical utility. Furthermore, we outline several categories of successful applications of diffusion‑based RL across diverse domains, discuss open research issues of current methodologies, and highlight key directions for future research to advance the field. Finally, we summarize the survey to identify promising future development directions. We are actively maintaining a GitHub repository (https://github.com/ChangfuXu/D4RL‑FTD) for papers and other related resources to apply DMs for RL.

PaperID: 895, https://arxiv.org/pdf/2510.11824.pdf GitHub

Authors: Simin Li, Zihao Mao, Hanxiao Li, Zonglei Jing, Zhuohang bian, Jun Guo, Li Wang, Zhuoran Han, Ruixiao Xu, Xin Yu, Chengdong Ma, Yuqing Ma, Bo An, Yaodong Yang, Weifeng Lv, Xianglong Liu

Title: Empirical Study on Robustness and Resilience in Cooperative Multi-Agent Reinforcement Learning

Abstract:
In cooperative Multi‑Agent Reinforcement Learning (MARL), it is a common practice to tune hyperparameters in ideal simulated environments to maximize cooperative performance. However, policies tuned for cooperation often fail to maintain robustness and resilience under real‑world uncertainties. Building trustworthy MARL systems requires a deep understanding of robustness, which ensures stability under uncertainties, and resilience, the ability to recover from disruptions‑‑a concept extensively studied in control systems but largely overlooked in MARL. In this paper, we present a large‑scale empirical study comprising over 82,620 experiments to evaluate cooperation, robustness, and resilience in MARL across 4 real‑world environments, 13 uncertainty types, and 15 hyperparameters. Our key findings are: (1) Under mild uncertainty, optimizing cooperation improves robustness and resilience, but this link weakens as perturbations intensify. Robustness and resilience also varies by algorithm and uncertainty type. (2) Robustness and resilience do not generalize across uncertainty modalities or agent scopes: policies robust to action noise for all agents may fail under observation noise on a single agent. (3) Hyperparameter tuning is critical for trustworthy MARL: surprisingly, standard practices like parameter sharing, GAE, and PopArt can hurt robustness, while early stopping, high critic learning rates, and Leaky ReLU consistently help. By optimizing hyperparameters only, we observe substantial improvement in cooperation, robustness and resilience across all MARL backbones, with the phenomenon also generalizing to robust MARL methods across these backbones. Code and results available at https://github.com/BUAA‑TrustworthyMARL/adv_marl_benchmark .

PaperID: 896, https://arxiv.org/pdf/2510.11608.pdf GitHub

Authors: Shiqi Zhang, Xinbei Ma, Yunqing Xu, Zouying Cao, Pengrui Lu, Haobo Yuan, Tiancheng Shen, Zhuosheng Zhang, Hai Zhao, Ming-Hsuan Yang

Title: ParaCook: On Time-Efficient Planning for Multi-Agent Systems

Abstract:
Large Language Models (LLMs) exhibit strong reasoning abilities for planning long‑horizon, real‑world tasks, yet existing agent benchmarks focus on task completion while neglecting time efficiency in parallel and asynchronous operations. To address this, we present ParaCook, a benchmark for time‑efficient collaborative planning. Inspired by the Overcooked game, ParaCook provides an environment for various challenging interaction planning of multi‑agent systems that are instantiated as cooking tasks, with a simplified action space to isolate the core challenge of strategic parallel planning. Through a comprehensive evaluation of state‑of‑the‑art LLMs, we find that current approaches achieve suboptimal plans, which struggle with parallel actions or coordination. Our analysis also reveals LLMs' potential on abstract tasks where they can focus on high‑level parallel optimization. ParaCook provides a scalable evaluation framework with adjustable complexity, establishing a foundation for developing and assessing time efficiency‑aware multi‑agent planning. The code and data are available at https://github.com/zsq259/ParaCook.

PaperID: 897, https://arxiv.org/pdf/2510.11246.pdf GitHub

Authors: Pengyu Zhu, Lijun Li, Yaxing Lyu, Li Sun, Sen Su, Jing Shao

Title: Collaborative Shadows: Distributed Backdoor Attacks in LLM-Based Multi-Agent Systems

Abstract:
LLM‑based multi‑agent systems (MAS) demonstrate increasing integration into next‑generation applications, but their safety in backdoor attacks remains largely underexplored. However, existing research has focused exclusively on single‑agent backdoor attacks, overlooking the novel attack surfaces introduced by agent collaboration in MAS. To bridge this gap, we present the first Distributed Backdoor Attack tailored to MAS. We decompose the backdoor into multiple distributed attack primitives that are embedded within MAS tools. These primitives remain dormant individually but collectively activate only when agents collaborate in a specific sequence, thereby assembling the full backdoor to execute targeted attacks such as data exfiltration. To fully assess this threat, we introduce a benchmark for multi‑role collaborative tasks and a sandboxed framework to evaluate. Extensive experiments demonstrate that our attack achieves an attack success rate exceeding 95% without degrading performance on benign tasks. This work exposes novel backdoor attack surfaces that exploit agent collaboration, underscoring the need to move beyond single‑agent protection. Code and benchmark are available at https://github.com/whfeLingYu/Distributed‑Backdoor‑Attacks‑in‑MAS.

PaperID: 898, https://arxiv.org/pdf/2510.11062.pdf GitHub

Authors: Yujie Zhao, Lanxiang Hu, Yang Wang, Minmin Hou, Hao Zhang, Ke Ding, Jishen Zhao

Title: Stronger-MAS: Multi-Agent Reinforcement Learning for Collaborative LLMs

Abstract:
Multi‑agent systems (MAS) and reinforcement learning (RL) are widely used to enhance the agentic capabilities of large language models (LLMs). MAS improves task performance through role‑based orchestration, while RL uses environmental rewards to learn stronger policies, such as GRPO‑style optimization. However, applying on‑policy RL to MAS remains underexplored and presents unique challenges. Algorithmically, standard GRPO grouping assumptions break down because prompts vary by role and by turn. System‑wise, the training stack must support MAS‑workflow rollouts and on‑policy updates for both single‑policy and multi‑policy models. We propose AT‑GRPO, which includes (i) an agent‑ and turn‑wise grouped RL algorithm tailored to MAS and (ii) a training system that supports both single‑ and multi‑policy regimes. Across game, planning, coding, and math tasks, AT‑GRPO delivers substantial gains. On long‑horizon planning, it increases accuracy from a 14.0 to 47.0 percent single‑agent RL baseline to 96.0 to 99.5 percent. It also improves reasoning performance, with average gains of 3.87 to 7.62 percent on coding tasks and 9.0 to 17.93 percent on math. Code and environments are available at: https://github.com/pettingllms‑ai/PettingLLMs.

PaperID: 899, https://arxiv.org/pdf/2510.11026.pdf GitHub

Authors: Hongxiang Li, Yaowei Li, Bin Lin, Yuwei Niu, Yuhang Yang, Xiaoshuang Huang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Long Chen

Title: GIR-Bench: Versatile Benchmark for Generating Images with Reasoning

Abstract:
Unified multimodal models integrate the reasoning capacity of large language models with both image understanding and generation, showing great promise for advanced multimodal intelligence. However, the community still lacks a rigorous reasoning‑centric benchmark to systematically evaluate the alignment between understanding and generation, and their generalization potential in complex visual tasks. To this end, we introduce GIR‑Bench, a comprehensive benchmark that evaluates unified models across three complementary perspectives. Firstly, we investigate understanding‑generation consistency (GIR‑Bench‑UGC), asking whether models can consistently leverage the same knowledge in both understanding and generation tasks. Secondly, we investigate whether models can perform reasoning‑centric text‑to‑image generation that requires applying logical constraints and implicit knowledge to generate faithful visual content (GIR‑Bench‑T2I). Thirdly, we evaluate whether models can handle multi‑step reasoning in editing (GIR‑Bench‑Edit). For each subset, we carefully design different task‑specific evaluation pipelines tailored for each task. This enables fine‑grained and interpretable evaluation while mitigating biases from the prevalent MLLM‑as‑a‑Judge paradigm. Extensive ablations over various unified models and generation‑only systems have shown that: Although unified models are more capable of reasoning‑driven visual tasks, they still exhibit a persistent gap between understanding and generation. The data and code for GIR‑Bench are available at https://github.com/HKUST‑LongGroup/GIR‑Bench.

PaperID: 900, https://arxiv.org/pdf/2510.11004.pdf GitHub

Authors: Haoran Liang, Yufa Zhou, Mohammad Talebi Kalaleh, Qipei Mei

Title: Automating Structural Engineering Workflows with Large Language Model Agents

Abstract:
We introduce MASSE, the first Multi‑Agent System for Structural Engineering, effectively integrating large language model (LLM)‑based agents with real‑world engineering workflows. Structural engineering is a fundamental yet traditionally stagnant domain, with core workflows remaining largely unchanged for decades despite its substantial economic impact and global market size. Recent advancements in LLMs have significantly enhanced their ability to perform complex reasoning, long‑horizon planning, and precise tool utilization ‑‑ capabilities well aligned with structural engineering tasks such as interpreting design codes, executing load calculations, and verifying structural capacities. We present a proof‑of‑concept showing that most real‑world structural engineering workflows can be fully automated through a training‑free LLM‑based multi‑agent system. MASSE enables immediate deployment in professional environments, and our comprehensive validation on real‑world case studies demonstrates that it can reduce expert workload from approximately two hours to mere minutes, while enhancing both reliability and accuracy in practical engineering scenarios.

PaperID: 901, https://arxiv.org/pdf/2510.10880.pdf GitHub

Authors: Zhaofang Qian, Hardy Chen, Zeyu Wang, Li Zhang, Zijun Wang, Xiaoke Huang, Hui Liu, Xianfeng Tang, Zeyu Zheng, Haoqin Tu, Cihang Xie, Yuyin Zhou

Title: Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales

Abstract:
Vision‑language models (VLMs) have advanced rapidly, yet their capacity for image‑grounded geolocation in open‑world conditions, a task that is challenging and of demand in real life, has not been comprehensively evaluated. We present EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates visual recognition, step‑by‑step reasoning, and evidence use. EarthWhere comprises 810 globally distributed images across two complementary geolocation scales: WhereCountry (i.e., 500 multiple‑choice question‑answering, with country‑level answer and panoramas) and WhereStreet (i.e., 310 fine‑grained street‑level identification tasks requiring multi‑step reasoning with optional web search). For evaluation, we adopt the final‑prediction metrics: location accuracies within k km (Acc@k) for coordinates and hierarchical path scores for textual localization. Beyond this, we propose to explicitly score intermediate reasoning chains using human‑verified key visual clues and a Shapley‑reweighted thinking score that attributes credit to each clue's marginal contribution. We benchmark 13 state‑of‑the‑art VLMs with web searching tools on our EarthWhere and report different types of final answer accuracies as well as the calibrated model thinking scores. Overall, Gemini‑2.5‑Pro achieves the best average accuracy at 56.32%, while the strongest open‑weight model, GLM‑4.5V, reaches 34.71%. We reveal that web search and reasoning do not guarantee improved performance when visual clues are limited, and models exhibit regional biases, achieving up to 42.7% higher scores in certain areas than others. These findings highlight not only the promise but also the persistent challenges of models to mitigate bias and achieve robust, fine‑grained localization. We open‑source our benchmark at https://github.com/UCSC‑VLAA/EarthWhere.

PaperID: 902, https://arxiv.org/pdf/2510.10807.pdf GitHub

Authors: Ali Atiah Alzahrani

Title: Multi-Agent Regime-Conditioned Diffusion (MARCD) for CVaR-Constrained Portfolio Decisions

Abstract:
We examine whether regime‑conditioned generative scenarios combined with a convex CVaR allocator improve portfolio decisions under regime shifts. We present MARCD, a generative‑to‑decision framework with: (i) a Gaussian HMM to infer latent regimes; (ii) a diffusion generator that produces regime‑conditioned scenarios; (iii) signal extraction via blended, shrunk moments; and (iv) a governed CVaR epigraph quadratic program. Contributions: Within the Scenario stage we introduce a tail‑weighted diffusion objective that up‑weights low‑quantile outcomes relevant for drawdowns and a regime‑expert (MoE) denoiser whose gate increases with crisis posteriors; both are evaluated end‑to‑end through the allocator. Under strict walk‑forward on liquid multi‑asset ETFs (2005‑2025), MARCD exhibits stronger scenario calibration and materially smaller drawdowns: MaxDD 9.3% versus 14.1% for BL (a 34% reduction) over 2020‑2025 out‑of‑sample. The framework provides an auditable pipeline with explicit budget, box, and turnover constraints, demonstrating the value of decision‑aware generative modeling in finance.

PaperID: 903, https://arxiv.org/pdf/2510.10454.pdf GitHub

Authors: Sihang Zeng, Yujuan Fu, Sitong Zhou, Zixuan Yu, Lucas Jing Liu, Jun Wen, Matthew Thompson, Ruth Etzioni, Meliha Yetisgen

Title: Traj-CoA: Patient Trajectory Modeling via Chain-of-Agents for Lung Cancer Risk Prediction

Abstract:
Large language models (LLMs) offer a generalizable approach for modeling patient trajectories, but suffer from the long and noisy nature of electronic health records (EHR) data in temporal reasoning. To address these challenges, we introduce Traj‑CoA, a multi‑agent system involving chain‑of‑agents for patient trajectory modeling. Traj‑CoA employs a chain of worker agents to process EHR data in manageable chunks sequentially, distilling critical events into a shared long‑term memory module, EHRMem, to reduce noise and preserve a comprehensive timeline. A final manager agent synthesizes the worker agents' summary and the extracted timeline in EHRMem to make predictions. In a zero‑shot one‑year lung cancer risk prediction task based on five‑year EHR data, Traj‑CoA outperforms baselines of four categories. Analysis reveals that Traj‑CoA exhibits clinically aligned temporal reasoning, establishing it as a promisingly robust and generalizable approach for modeling complex patient trajectories. Implementation of Traj‑CoA is available on https://github.com/zengsihang/Traj‑CoA.

PaperID: 904, https://arxiv.org/pdf/2510.10379.pdf GitHub

Authors: Rohan Gupta, Trevor Asbery, Zain Merchant, Abrar Anwar, Jesse Thomason

Title: RobotFleet: An Open-Source Framework for Centralized Multi-Robot Task Planning

Abstract:
Coordinating heterogeneous robot fleets to achieve multiple goals is challenging in multi‑robot systems. We introduce an open‑source and extensible framework for centralized multi‑robot task planning and scheduling that leverages LLMs to enable fleets of heterogeneous robots to accomplish multiple tasks. RobotFleet provides abstractions for planning, scheduling, and execution across robots deployed as containerized services to simplify fleet scaling and management. The framework maintains a shared declarative world state and two‑way communication for task execution and replanning. By modularizing each layer of the autonomy stack and using LLMs for open‑world reasoning, RobotFleet lowers the barrier to building scalable multi‑robot systems. The code can be found here: https://github.com/therohangupta/robot‑fleet.

PaperID: 905, https://arxiv.org/pdf/2510.10197.pdf GitHub

Authors: Siyuan Lu, Zechuan Wang, Hongxuan Zhang, Qintong Wu, Leilei Gan, Chenyi Zhuang, Jinjie Gu, Tao Lin

Title: Don't Just Fine-tune the Agent, Tune the Environment

Abstract:
Large Language Model (LLM) agents show great promise for complex, multi‑turn tool‑use tasks, but their development is often hampered by the extreme scarcity of high‑quality training data. Supervised fine‑tuning (SFT) on synthetic data leads to overfitting, whereas standard reinforcement learning (RL) struggles with a critical cold‑start problem and training instability. To address these challenges, we introduce Environment Tuning, a novel training paradigm that enables agents to learn complex behaviors directly from problem instances without relying on pre‑collected expert trajectories. Environment Tuning orchestrates this learning process through a structured curriculum, actionable environment augmentation that provides corrective feedback, and fine‑grained progress rewards to ensure stable and efficient exploration. Using only 400 problem instances from Berkeley Function‑Calling Leaderboard (BFCL) benchmark, our method not only achieves competitive in‑distribution performance against strong baselines but also demonstrates superior out‑of‑distribution generalization, overcoming the performance collapse common to SFT‑based approaches. Our work presents a paradigm shift from supervised fine‑tuning on static trajectories to dynamic, environment‑based exploration, paving the way for training more robust and data‑efficient agents. The code is available at https://github.com/inclusionAI/AWorld‑RL/tree/main/EnvTuning.

PaperID: 906, https://arxiv.org/pdf/2510.10185.pdf GitHub

Authors: Yinghao Zhu, Lei Gu, Zixiang Wang, Haoran Sang, Dehao Sui, Wen Tang, Lan Mi, Yasha Wang, Junyi Gao, Liang Yao, Tianfan Fu, Ewen Harrison, Lequan Yu, Liantao Ma

Title: Auditing medical multi-agent AI reveals risks of false consensus

Abstract:
Large language models are increasingly being assembled into medical multi‑agent systems that emulate multidisciplinary consultation through specialist roles, peer review and consensus formation. In clinical decision support, however, apparent consensus is not enough. Clinicians also need to know whether agents checked the evidence, addressed disagreement and kept uncertainty visible. Current evaluations largely score final accuracy, leaving the safety of the collaborative process untested. Here we introduce MedAgentAudit, a clinically grounded workflow audit framework for diagnosing and quantifying collaborative failure modes in medical multi‑agent systems. From 3,600 execution logs, we derive an expert‑validated taxonomy of ten recurrent failures spanning task comprehension, collaborative discussion, and synthesis and decision‑making. We then deploy an expert‑validated automated auditor as non‑interventional probes across 14,400 cases, covering six multi‑agent architectures, six medical text and vision datasets, and four large language model settings per modality. Across systems, collaboration yields uneven accuracy gains and frequent process failures. Unsupported observations affect 16.63% of cases and propagate downstream. In discussion, agents repeat initial views in 98.42% of cases rather than re‑examining evidence, and fail to activate specialist reasoning in 42.73%. During synthesis, final answers often substitute authority or majority count for evidence checking, showing authority bias in 28.76% (rising from 35.30% to 68.75% across rounds), self‑contradiction in 18.53%, contradiction neglect in 5.48% and minority suppression in 5.11%. MedAgentAudit reframes medical AI evaluation from output scoring to process‑level safety and accountability, providing a practical foundation for transparent, auditable and clinician‑supervised agentic systems in medicine.

PaperID: 907, https://arxiv.org/pdf/2510.09721.pdf GitHub

Authors: Jiale Guo, Suizhi Huang, Mei Li, Dong Huang, Xingsheng Chen, Regina Zhang, Zhijiang Guo, Han Yu, Siu-Ming Yiu, Pietro Lio, Kwok-Yan Lam

Title: A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System

Abstract:
The integration of Large Language Models (LLMs) into software engineering has driven a transition from traditional rule‑based systems to autonomous agentic systems capable of solving complex problems. However, systematic progress is hindered by a lack of comprehensive understanding of how benchmarks and solutions interconnect. This survey addresses this gap by providing the first holistic analysis of LLM‑powered software engineering, offering insights into evaluation methodologies and solution paradigms. We review over 150 recent papers and propose a taxonomy along two key dimensions: (1) Solutions, categorized into prompt‑based, fine‑tuning‑based, and agent‑based paradigms, and (2) Benchmarks, including tasks such as code generation, translation, and repair. Our analysis highlights the evolution from simple prompt engineering to sophisticated agentic systems incorporating capabilities like planning, reasoning, memory mechanisms, and tool augmentation. To contextualize this progress, we present a unified pipeline illustrating the workflow from task specification to deliverables, detailing how different solution paradigms address various complexity levels. Unlike prior surveys that focus narrowly on specific aspects, this work connects 50+ benchmarks to their corresponding solution strategies, enabling researchers to identify optimal approaches for diverse evaluation criteria. We also identify critical research gaps and propose future directions, including multi‑agent collaboration, self‑evolving systems, and formal verification integration. This survey serves as a foundational guide for advancing LLM‑driven software engineering. We maintain a GitHub repository that continuously updates the reviewed and related papers at https://github.com/lisaGuojl/LLM‑Agent‑SE‑Survey.

PaperID: 908, https://arxiv.org/pdf/2510.09558.pdf GitHub

Authors: Qiguang Chen, Zheng Yan, Mingda Yang, Libo Qin, Yixin Yuan, Hanjing Li, Jinhao Liu, Yiyan Ji, Dengyun Peng, Jiannan Guan, Mengkang Hu, Yantao Du, Wanxiang Che

Title: AutoPR: Let's Automate Your Academic Promotion!

Abstract:
As the volume of peer‑reviewed research surges, scholars increasingly rely on social platforms for discovery, while authors invest considerable effort in promoting their work to ensure visibility and citations. To streamline this process and reduce the reliance on human effort, we introduce Automatic Promotion (AutoPR), a novel task that transforms research papers into accurate, engaging, and timely public content. To enable rigorous evaluation, we release PRBench, a multimodal benchmark that links 512 peer‑reviewed articles to high‑quality promotional posts, assessing systems along three axes: Fidelity (accuracy and tone), Engagement (audience targeting and appeal), and Alignment (timing and channel optimization). We also introduce PRAgent, a multi‑agent framework that automates AutoPR in three stages: content extraction with multimodal preparation, collaborative synthesis for polished outputs, and platform‑specific adaptation to optimize norms, tone, and tagging for maximum reach. When compared to direct LLM pipelines on PRBench, PRAgent demonstrates substantial improvements, including a 604% increase in total watch time, a 438% rise in likes, and at least a 2.9x boost in overall engagement. Ablation studies show that platform modeling and targeted promotion contribute the most to these gains. Our results position AutoPR as a tractable, measurable research problem and provide a roadmap for scalable, impactful automated scholarly communication.

PaperID: 909, https://arxiv.org/pdf/2510.09517.pdf GitHub

Authors: Yuchen Lu, Run Yang, Yichen Zhang, Shuguang Yu, Runpeng Dai, Ziwei Wang, Jiayi Xiang, Wenxin E, Siran Gao, Xinyao Ruan, Yirui Huang, Chenjing Xi, Haibo Hu, Yueming Fu, Qinglan Yu, Xiaobing Wei, Jiani Gu, Rui Sun, Jiaxuan Jia, Fan Zhou

Title: StatEval: A Comprehensive Benchmark for Large Language Models in Statistics

Abstract:
Large language models (LLMs) have demonstrated remarkable advances in mathematical and logical reasoning, yet statistics, as a distinct and integrative discipline, remains underexplored in benchmarking efforts. To address this gap, we introduce StatEval, the first comprehensive benchmark dedicated to statistics, spanning both breadth and depth across difficulty levels. StatEval consists of 13,817 foundational problems covering undergraduate and graduate curricula, together with 2374 research‑level proof tasks extracted from leading journals. To construct the benchmark, we design a scalable multi‑agent pipeline with human‑in‑the‑loop validation that automates large‑scale problem extraction, rewriting, and quality control, while ensuring academic rigor. We further propose a robust evaluation framework tailored to both computational and proof‑based tasks, enabling fine‑grained assessment of reasoning ability. Experimental results reveal that while closed‑source models such as GPT5‑mini achieve below 57% on research‑level problems, with open‑source models performing significantly lower. These findings highlight the unique challenges of statistical reasoning and the limitations of current LLMs. We expect StatEval to serve as a rigorous benchmark for advancing statistical intelligence in large language models. All data and code are available on our web platform: https://stateval.github.io/.

PaperID: 910, https://arxiv.org/pdf/2510.09087.pdf GitHub

Authors: Zhang Zheng, Deheng Ye, Peilin Zhao, Hao Wang

Title: The Stackelberg Speaker: Optimizing Persuasive Communication in Social Deduction Games

Abstract:
Large language model (LLM) agents have shown remarkable progress in social deduction games (SDGs). However, existing approaches primarily focus on information processing and strategy selection, overlooking the significance of persuasive communication in influencing other players' beliefs and responses. In SDGs, success depends not only on making correct deductions but on convincing others to response in alignment with one's intent. To address this limitation, we formalize turn‑based dialogue in SDGs as a Stackelberg competition, where the current player acts as the leader who strategically influences the follower's response. Building on this theoretical foundation, we propose a reinforcement learning framework that trains agents to optimize utterances for persuasive impact. Through comprehensive experiments across three diverse SDGs, we demonstrate that our agents significantly outperform baselines. This work represents a significant step toward developing AI agents capable of strategic social influence, with implications extending to scenarios requiring persuasive communication. Our code and data are available at https://3dagentworld.github.io/leader_follower.

PaperID: 911, https://arxiv.org/pdf/2510.09049.pdf GitHub

Authors: Joonghyuk Hahn, Soohan Lim, Yo-Sub Han

Title: MEC$^3$O: Multi-Expert Consensus for Code Time Complexity Prediction

Abstract:
Predicting the complexity of source code is essential for software development and algorithm analysis. Recently, Baik et al. (2025) introduced CodeComplex for code time complexity prediction. The paper shows that LLMs without fine‑tuning struggle with certain complexity classes. This suggests that no single LLM excels at every class, but rather each model shows advantages in certain classes. We propose MEC^3O, a multi‑expert consensus system, which extends the multi‑agent debate frameworks. MEC^3O assigns LLMs to complexity classes based on their performance and provides them with class‑specialized instructions, turning them into experts. These experts engage in structured debates, and their predictions are integrated through a weighted consensus mechanism. Our expertise assignments to LLMs effectively handle Degeneration‑of‑Thought, reducing reliance on a separate judge model, and preventing convergence to incorrect majority opinions. Experiments on CodeComplex show that MEC^3O outperforms the open‑source baselines, achieving at least 10% higher accuracy and macro‑F1 scores. It also surpasses GPT‑4o‑mini in macro‑F1 scores on average and demonstrates competitive on‑par F1 scores to GPT‑4o and GPT‑o4‑mini on average. This demonstrates the effectiveness of multi‑expert debates and weight consensus strategy to generate the final predictions. Our code and data is available at https://github.com/suhanmen/MECO.

PaperID: 912, https://arxiv.org/pdf/2510.08988.pdf GitHub

Authors: Lan Zhang, Marco Valentino, André Freitas

Title: MASA: LLM-Driven Multi-Agent Systems for Autoformalization

Abstract:
Autoformalization serves a crucial role in connecting natural language and formal reasoning. This paper presents MASA, a novel framework for building multi‑agent systems for autoformalization driven by Large Language Models (LLMs). MASA leverages collaborative agents to convert natural language statements into their formal representations. The architecture of MASA is designed with a strong emphasis on modularity, flexibility, and extensibility, allowing seamless integration of new agents and tools to adapt to a fast‑evolving field. We showcase the effectiveness of MASA through use cases on real‑world mathematical definitions and experiments on formal mathematics datasets. This work highlights the potential of multi‑agent systems powered by the interaction of LLMs and theorem provers in enhancing the efficiency and reliability of autoformalization, providing valuable insights and support for researchers and practitioners in the field.

PaperID: 913, https://arxiv.org/pdf/2510.08521.pdf GitHub

Authors: Yusong Hu, Runmin Ma, Yue Fan, Jinxin Shi, Zongsheng Cao, Yuhao Zhou, Jiakang Yuan, Shuaiyu Zhang, Shiyang Feng, Xiangchao Yan, Shufei Zhang, Wenlong Zhang, Lei Bai, Bo Zhang

Title: FlowSearch: Advancing deep research with dynamic structured knowledge flow

Abstract:
Deep research is an inherently challenging task that demands both breadth and depth of thinking. It involves navigating diverse knowledge spaces and reasoning over complex, multi‑step dependencies, which presents substantial challenges for agentic systems. To address this, we propose FlowSearch, a multi‑agent framework that actively constructs and evolves a dynamic structured knowledge flow to drive subtask execution and reasoning. FlowSearch is capable of strategically planning and expanding the knowledge flow to enable parallel exploration and hierarchical task decomposition, while also adjusting the knowledge flow in real time based on feedback from intermediate reasoning outcomes and insights. FlowSearch achieves competitive performance on both general and scientific benchmarks, including GAIA, HLE, GPQA and TRQA, demonstrating its effectiveness in multi‑disciplinary research scenarios and its potential to advance scientific discovery. The code is available at https://github.com/InternScience/InternAgent.

PaperID: 914, https://arxiv.org/pdf/2510.08145.pdf GitHub

Authors: Shuliang Liu, Zhipeng Xu, Zhenghao Liu, Yukun Yan, Minghe Yu, Yu Gu, Chong Chen, Huiyuan Xie, Ge Yu

Title: Mitigating Judgment Preference Bias in Large Language Models through Group-Based Polling

Abstract:
Large Language Models (LLMs) as automatic evaluators, commonly referred to as LLM‑as‑a‑Judge, have also attracted growing attention. This approach plays a vital role in aligning LLMs with human judgments, providing accurate and reliable assessments. However, LLM‑based judgment models often exhibit judgment preference bias during the evaluation phase, tending to favor responses generated by themselves, undermining the reliability of their judgments. This paper introduces the Group‑Based Polling Optimization (Genii), an unsupervised multi‑agent collaborative optimization framework that mitigates the inherent judgment preference bias of judgment models. Specifically, Genii integrates various LLM‑based judgment models into a multi‑agent system and simulates the interactive client‑server polling mechanism to optimize each client agent unsupervisedly. Our experiments demonstrate that Genii outperforms supervised models trained on annotated judgment data, while requiring no human‑labeled annotations. Genii consistently improves performance across different client agents during the polling, even when weaker models act as server agents. Further analysis reveals that Genii effectively mitigates judgment preference bias of LLM‑based judgment models, demonstrating its effectiveness. All codes are available at https://github.com/NEUIR/Genii.

PaperID: 915, https://arxiv.org/pdf/2510.07869.pdf GitHub

Authors: Junwen Gu, Zhiheng Wu, Pengxuan Si, Shuang Qiu, Zhentao Zhang, Yukai Feng, Luoyang Sun, Laien Luo, Lianyi Yu, Jian Wang, Zhengxing Wu

Title: USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots

Abstract:
Underwater environments pose unique challenges for robotic navigation and manipulation. While existing research has primarily focused on task‑specific methods, studies on general‑purpose intelligence for multi‑task execution remain scarce. To address this gap, we propose a unified framework for general‑purpose underwater robots that integrates perception and action driven by language instructions. First, we develop a data synthesis pipeline to construct USIM, a simulation‑based dataset which comprises over 905K frames from 2275 trajectories, totaling approximately 25 hours of BlueROV2 interactions. Furthermore, we propose U0, a vision‑language‑action (VLA) model capable of executing various tasks from obstacle‑avoidance navigation to three‑dimensional mobile manipulation. The model features a convolution‑attention‑based perception (CAP) module, which incorporates target pose estimation as an auxiliary task to explicitly bolster the model's spatial awareness. For evaluation, we establish a systematic assessment framework and an automated pipeline encompassing both offline metrics and online task execution. Experimental results demonstrate that the USIM dataset significantly empowers existing VLA models to adapt to underwater scenarios. Notably, our U0 model achieves state‑of‑the‑art performance: it reduces the offline mean action prediction error to 0.0359 and achieves an overall online success rate of 43.1%, marking a 5.5% improvement over existing competitive baselines (below 37.6%), with navigation tasks reaching as high as 87.5%. These results validate the feasibility of general‑purpose intelligence in underwater robotics, providing a foundation for scalable dataset synthesis and aquatic embodied agents.

PaperID: 916, https://arxiv.org/pdf/2510.07517.pdf GitHub

Authors: Hyeong Kyu Choi, Xiaojin Zhu, Sharon Li

Title: When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning

Abstract:
Multi‑agent debate (MAD) aims to improve large language model (LLM) reasoning by letting multiple agents exchange answers and then aggregate their opinions. Yet recent studies reveal that agents are not neutral: they are prone to identity‑driven sycophancy and self‑bias, uncritically adopting a peer's view or stubbornly adhering to their own prior output, undermining the reliability of debate. In this work, we present the first principled framework that joins sycophancy and self‑bias to mitigate and quantify identity bias in MAD. First, we formalize the debate dynamics as an identity‑weighted Bayesian update process. Second, we propose response anonymization: by removing identity markers from prompts, agents cannot distinguish "self" from "peer", which forces equal weights on agent identity, thereby reducing bias and improving trustworthiness. Third, we define the Identity Bias Coefficient (IBC), a principled bias metric that measures an agent's tendency to follow its peer versus itself. Empirical studies across multiple models and benchmarks confirm that identity bias is widespread, with sycophancy far more common than self‑bias. Our findings highlight the need to ensure that MAD systems reason based on content rather than identity. Code is released in https://github.com/deeplearning‑wisc/MAD‑identity‑bias.

PaperID: 917, https://arxiv.org/pdf/2510.07248.pdf GitHub

Authors: Jonggeun Lee, Woojung Song, Jongwook Han, Haesung Pyun, Yohan Jo

Title: Don't Adapt Small Language Models for Tools; Adapt Tool Schemas to the Models

Abstract:
Small language models (SLMs) enable scalable tool‑augmented multi‑agent systems where multiple SLMs handle subtasks orchestrated by a powerful coordinator. However, they struggle with tool‑use tasks, particularly in selecting appropriate tools and identifying correct parameters. A common failure mode is schema misalignment: models hallucinate plausible tool names that are absent from the provided tool schema, due to different naming conventions internalized during pretraining. Rather than training models to adapt to unfamiliar schemas, we propose adapting schemas to align with models' pretrained knowledge. We introduce PA‑Tool (Pretraining‑Aligned Tool Schema Generation), a training‑free method that leverages peakedness, a signal used in contamination detection that indicates pretraining familiarity, to rename tool components. By generating multiple candidates and selecting the candidate with the highest peakedness, PA‑Tool identifies pretraining‑aligned naming patterns. Experiments on MetaTool and RoTBench show improvements of up to 17%, with schema misalignment errors reduced by 80%. PA‑Tool enables small models to substantially improve tool‑use accuracy without retraining, showing that schema‑level interventions can unlock the tool‑use potential of resource‑efficient models. Our code is available at https://github.com/holi‑lab/PA‑Tool.

PaperID: 918, https://arxiv.org/pdf/2510.07217.pdf GitHub

Authors: Wen Ye, Zhaocheng Liu, Yuwei Gui, Tingyu Yuan, Yunyue Su, Bowen Fang, Chaoyang Zhao, Qiang Liu, Liang Wang

Title: GenPilot: A Multi-Agent System for Test-Time Prompt Optimization in Image Generation

Abstract:
Text‑to‑image synthesis has made remarkable progress, yet accurately interpreting complex and lengthy prompts remains challenging, often resulting in semantic inconsistencies and missing details. Existing solutions, such as fine‑tuning, are model‑specific and require training, while prior automatic prompt optimization (APO) approaches typically lack systematic error analysis and refinement strategies, resulting in limited reliability and effectiveness. Meanwhile, test‑time scaling methods operate on fixed prompts and on noise or sample numbers, limiting their interpretability and adaptability. To solve these, we introduce a flexible and efficient test‑time prompt optimization strategy that operates directly on the input text. We propose a plug‑and‑play multi‑agent system called GenPilot, integrating error analysis, clustering‑based adaptive exploration, fine‑grained verification, and a memory module for iterative optimization. Our approach is model‑agnostic, interpretable, and well‑suited for handling long and complex prompts. Simultaneously, we summarize the common patterns of errors and the refinement strategy, offering more experience and encouraging further exploration. Experiments on DPG‑bench and Geneval with improvements of up to 16.9% and 5.7% demonstrate the strong capability of our methods in enhancing the text and image consistency and structural coherence of generated images, revealing the effectiveness of our test‑time prompt optimization strategy. The code is available at https://github.com/27yw/GenPilot.

PaperID: 919, https://arxiv.org/pdf/2510.07030.pdf GitHub

Authors: Abhinav Kumar, Fan Yang, Sergio Aguilera Marinovic, Soshi Iba, Rana Soltani Zarrin, Dmitry Berenson

Title: Diffusing Trajectory Optimization Problems for Recovery During Multi-Finger Manipulation

Abstract:
Multi‑fingered hands are emerging as powerful platforms for performing fine manipulation tasks, including tool use. However, environmental perturbations or execution errors can impede task performance, motivating the use of recovery behaviors that enable normal task execution to resume. In this work, we take advantage of recent advances in diffusion models to construct a framework that autonomously identifies when recovery is necessary and optimizes contact‑rich trajectories to recover. We use a diffusion model trained on the task to estimate when states are not conducive to task execution, framed as an out‑of‑distribution detection problem. We then use diffusion sampling to project these states in‑distribution and use trajectory optimization to plan contact‑rich recovery trajectories. We also propose a novel diffusion‑based approach that distills this process to efficiently diffuse the full parameterization, including constraints, goal state, and initialization, of the recovery trajectory optimization problem, saving time during online execution. We compare our method to a reinforcement learning baseline and other methods that do not explicitly plan contact interactions, including on a hardware screwdriver‑turning task where we show that recovering using our method improves task performance by 96% and that ours is the only method evaluated that can attempt recovery without causing catastrophic task failure. Videos can be found at https://dtourrecovery.github.io/.

PaperID: 920, https://arxiv.org/pdf/2510.06843.pdf GitHub GitHub

Authors: Xuhang Chen, Zhifan Song, Deyi Ji, Shuo Gao, Lanyun Zhu

Title: Self-signals Driven Multi-LLM Debate for Efficient and Accurate Reasoning

Abstract:
Large Language Models (LLMs) have exhibited impressive capabilities across diverse application domains. Recent work has explored Multi‑LLM Agent Debate (MAD) as a way to enhance performance by enabling multiple LLMs to discuss and refine responses iteratively. Nevertheless, existing MAD methods predominantly focus on utilizing external structures, such as debate graphs, using LLM‑as‑a‑Judge, while neglecting the application of self signals, such as token logits and attention, that arise during generation. This omission leads to redundant computation and potential performance degradation. In this paper, we shift the focus to the self signals of multi‑LLM debate and introduce a Self‑Signals Driven Multi‑LLM Debate (SID), which leverages two types of self‑signals: model‑level confidence and token‑level semantic focus, to adaptively guide the debate process. Our approach enables high‑confidence agents to exit early at the model level and compress the redundant debate contents based on the attention mechanism. We evaluate our method on various LLMs and Multimodal LLMs across multiple challenging benchmarks. Experimental results demonstrate that our method not only outperforms existing MAD techniques in accuracy but also reduces token consumption, highlighting the effectiveness of utilizing self signals in enhancing both the performance and efficiency of multi‑agent debate systems. Our code will be available at~\hrefhttps://github.com/xuhang2019/SID\texttthttps://github.com/xuhang2019/SID.

PaperID: 921, https://arxiv.org/pdf/2510.06307.pdf GitHub

Authors: Wentao Deng, Jiahuan Pei, Zhiwei Xu, Zhaochun Ren, Zhumin Chen, Pengjie Ren

Title: Belief-Calibrated Multi-Agent Consensus Seeking for Complex NLP Tasks

Abstract:
A multi‑agent system (MAS) enhances its capacity to solve complex natural language processing (NLP) tasks through collaboration among multiple agents, where consensus‑seeking serves as a fundamental mechanism. However, existing consensus‑seeking approaches typically rely on voting mechanisms to judge consensus, overlooking contradictions in system‑internal beliefs that destabilize the consensus. Moreover, these methods often involve agents updating their results through indiscriminate collaboration with every other agent. Such uniform interaction fails to identify the optimal collaborators for each agent, hindering the emergence of a stable consensus. To address these challenges, we provide a theoretical framework for selecting optimal collaborators that maximize consensus stability. Based on the theorems, we propose the Belief‑Calibrated Consensus Seeking (BCCS) framework to facilitate stable consensus via selecting optimal collaborators and calibrating the consensus judgment by system‑internal beliefs. Experimental results on the MATH and MMLU benchmark datasets demonstrate that the proposed BCCS framework outperforms the best existing results by 2.23% and 3.95% of accuracy on challenging tasks, respectively. Our code and data are available at https://github.com/dengwentao99/BCCS.

PaperID: 922, https://arxiv.org/pdf/2510.06240.pdf GitHub

Authors: Jiqun Pan, Zhenke Duan, Jiani Tu, Anzhi Cheng, Yanqing Wang

Title: Knowledge Graph-Guided Multi-Agent Distillation for Reliable Industrial Question Answering with Datasets

Abstract:
Industrial question‑answering (QA) systems require higher safety and reliability than general‑purpose dialogue models, as errors in high‑risk scenarios such as equipment fault diagnosis can have severe consequences. Although multi‑agent large language models enhance reasoning depth, they suffer from uncontrolled iterations and unverifiable outputs, and conventional distillation methods struggle to transfer collaborative reasoning capabilities to lightweight, deployable student models. To address these challenges, we propose Knowledge Graph‑guided Multi‑Agent System Distillation (KG‑MASD). Our approach formulates distillation as a Markov Decision Process and incorporates a knowledge graph as a verifiable structured prior to enrich state representation and ensure convergence. By integrating collaborative reasoning with knowledge grounding, KG‑MASD generates high‑confidence instruction‑tuning data and jointly distills reasoning depth and verifiability into compact student models suitable for edge deployment. Experiments on an industrial QA dataset show that KG‑MASD improves accuracy by 2.4 per cent to 20.1 per cent over baselines and significantly enhances reliability, enabling trustworthy AI deployment in safety‑critical industrial scenarios. Code and data are available at https://github.com/erwinmsmith/KG‑MAD/.

PaperID: 923, https://arxiv.org/pdf/2510.05950.pdf GitHub

Authors: Songyuan Sui, Zihang Xu, Xia Hu

Title: Training-Free Time Series Classification via In-Context Reasoning with LLM Agents

Abstract:
Time series classification (TSC) spans diverse application scenarios, yet labeled data are often scarce, making task‑specific training costly and inflexible. Recent reasoning‑oriented large language models (LLMs) show promise in understanding temporal patterns, but purely zero‑shot usage remains suboptimal. We propose FETA, a multi‑agent framework for training‑free TSC via exemplar‑based in‑context reasoning. FETA decomposes a multivariate series into channel‑wise subproblems, retrieves a few structurally similar labeled examples for each channel, and leverages a reasoning LLM to compare the query against these exemplars, producing channel‑level labels with self‑assessed confidences; a confidence‑weighted aggregator then fuses all channel decisions. This design eliminates the need for pretraining or fine‑tuning, improves efficiency by pruning irrelevant channels and controlling input length, and enhances interpretability through exemplar grounding and confidence estimation. On nine challenging UEA datasets, FETA achieves strong accuracy under a fully training‑free setting, surpassing multiple trained baselines. These results demonstrate that a multi‑agent in‑context reasoning framework can transform LLMs into competitive, plug‑and‑play TSC solvers without any parameter training. The code is available at https://github.com/SongyuanSui/FETATSC.

PaperID: 924, https://arxiv.org/pdf/2510.05096.pdf GitHub GitHub

Authors: Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou

Title: Paper2Video: Automatic Video Generation from Scientific Papers

Abstract:
Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor‑intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi‑modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce Paper2Video, the first benchmark of 101 research papers paired with author‑created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics‑‑Meta Similarity, PresentArena, PresentQuiz, and IP Memory‑‑to measure how videos convey the paper's information to the audience. Building on this foundation, we propose PaperTalker, the first multi‑agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking‑head rendering, while parallelizing slide‑wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready‑to‑use academic video generation. Our dataset, agent, and code are available at https://github.com/showlab/Paper2Video.

PaperID: 925, https://arxiv.org/pdf/2510.04885.pdf GitHub

Authors: Yuxin Wen, Arman Zharmagambetov, Ivan Evtimov, Narine Kokhlikyan, Tom Goldstein, Kamalika Chaudhuri, Chuan Guo

Title: RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

Abstract:
Prompt injection poses a serious threat to the reliability and safety of LLM agents. Recent defenses against prompt injection, such as Instruction Hierarchy and SecAlign, have shown notable robustness against static attacks. However, to more thoroughly evaluate the robustness of these defenses, it is arguably necessary to employ strong attacks such as automated red‑teaming. To this end, we introduce RL‑Hammer, a simple recipe for training attacker models that automatically learn to perform strong prompt injections and jailbreaks via reinforcement learning. RL‑Hammer requires no warm‑up data and can be trained entirely from scratch. To achieve high ASRs against industrial‑level models with defenses, we propose a set of practical techniques that enable highly effective, universal attacks. Using this pipeline, RL‑Hammer reaches a 98% ASR against GPT‑4o and a 72% ASR against GPT‑5 with the Instruction Hierarchy defense. We further discuss the challenge of achieving high diversity in attacks, highlighting how attacker models tend to reward‑hack diversity objectives. Finally, we show that RL‑Hammer can evade multiple prompt injection detectors. We hope our work advances automatic red‑teaming and motivates the development of stronger, more principled defenses. Code is available at https://github.com/facebookresearch/rl‑injector.

PaperID: 926, https://arxiv.org/pdf/2510.04860.pdf GitHub

Authors: Siwei Han, Kaiwen Xiong, Jiaqi Liu, Xinyu Ye, Yaofeng Su, Wenbo Duan, Xinyuan Liu, Cihang Xie, Mohit Bansal, Mingyu Ding, Linjun Zhang, Huaxiu Yao

Title: Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails

Abstract:
As Large Language Model (LLM) agents increasingly gain self‑evolutionary capabilities to adapt and refine their strategies through real‑world interaction, their long‑term reliability becomes a critical concern. We identify the Alignment Tipping Process (ATP), a critical post‑deployment risk unique to self‑evolving LLM agents. Unlike training‑time failures, ATP arises when continual interaction drives agents to abandon alignment constraints established during training in favor of reinforced, self‑interested strategies. We formalize and analyze ATP through two complementary paradigms: Self‑Interested Exploration, where repeated high‑reward deviations induce individual behavioral drift, and Imitative Strategy Diffusion, where deviant behaviors spread across multi‑agent systems. Building on these paradigms, we construct controllable testbeds and benchmark both open and closed‑source LLMs. Our experiments show that alignment benefits erode rapidly under self‑evolution, with initially aligned models converging toward unaligned states. In multi‑agent settings, successful violations diffuse quickly, leading to collective misalignment. Moreover, current reinforcement learning‑based alignment methods provide limited defenses against alignment tipping. These findings demonstrate that alignment of LLM agents is not a static property but a fragile and dynamic one, vulnerable to feedback‑driven decay during deployment. Our data and code are available at https://github.com/aiming‑lab/ATP.

PaperID: 927, https://arxiv.org/pdf/2510.04637.pdf GitHub

Authors: Zeyi Zhang, Yanju Zhou, Heyuan Yao, Tenglong Ao, Xiaohang Zhan, Libin Liu

Title: Social Agent: Mastering Dyadic Nonverbal Behavior Generation via Conversational LLM Agents

Abstract:
We present Social Agent, a novel framework for synthesizing realistic and contextually appropriate co‑speech nonverbal behaviors in dyadic conversations. In this framework, we develop an agentic system driven by a Large Language Model (LLM) to direct the conversation flow and determine appropriate interactive behaviors for both participants. Additionally, we propose a novel dual‑person gesture generation model based on an auto‑regressive diffusion model, which synthesizes coordinated motions from speech signals. The output of the agentic system is translated into high‑level guidance for the gesture generator, resulting in realistic movement at both the behavioral and motion levels. Furthermore, the agentic system periodically examines the movements of interlocutors and infers their intentions, forming a continuous feedback loop that enables dynamic and responsive interactions between the two participants. User studies and quantitative evaluations show that our model significantly improves the quality of dyadic interactions, producing natural, synchronized nonverbal behaviors.

PaperID: 928, https://arxiv.org/pdf/2510.04508.pdf GitHub

Authors: Lili Xie, Yi Zhang, Ruihong Qiu, Jiajun Liu, Sen Wang

Title: MARCO: A Cooperative Knowledge Transfer Framework for Personalized Cross-domain Recommendations

Abstract:
Recommender systems frequently encounter data sparsity issues, particularly when addressing cold‑start scenarios involving new users or items. Multi‑source cross‑domain recommendation (CDR) addresses these challenges by transferring valuable knowledge from multiple source domains to enhance recommendations in a target domain. However, existing reinforcement learning (RL)‑based CDR methods typically rely on a single‑agent framework, leading to negative transfer issues caused by inconsistent domain contributions and inherent distributional discrepancies among source domains. To overcome these limitations, MARCO, a Multi‑Agent Reinforcement Learning‑based Cross‑Domain recommendation framework, is proposed. It leverages cooperative multi‑agent reinforcement learning, where each agent is dedicated to estimating the contribution from an individual source domain, effectively managing credit assignment and mitigating negative transfer. In addition, an entropy‑based action diversity penalty is introduced to enhance policy expressiveness and stabilize training by encouraging diverse agents' joint actions. Extensive experiments across four benchmark datasets demonstrate MARCO's superior performance over state‑of‑the‑art methods, highlighting its robustness and strong generalization capabilities. The code is at https://github.com/xiewilliams/MARCO.

PaperID: 929, https://arxiv.org/pdf/2510.04206.pdf GitHub GitHub

Authors: Hanchen Zhang, Xiao Liu, Bowen Lv, Xueqiao Sun, Bohao Jing, Iat Long Iong, Zhenyu Hou, Zehan Qi, Hanyu Lai, Yifan Xu, Rui Lu, Hongning Wang, Jie Tang, Yuxiao Dong

Title: AgentRL: Scaling Agentic Reinforcement Learning with a Multi-Turn, Multi-Task Framework

Abstract:
Recent advances in large language models (LLMs) have sparked growing interest in building generalist agents that can learn through online interactions. However, applying reinforcement learning (RL) to train LLM agents in multi‑turn, multi‑task settings remains challenging due to lack of scalable infrastructure and stable training algorithms. In this work, we present the AgentRL framework for scalable multi‑turn, multi‑task agentic RL training. On the infrastructure side, AgentRL features a fully‑asynchronous generation‑training pipeline for efficient multi‑turn RL. To support heterogeneous environment development in multi‑task RL, we design a unified function‑call based API interface, containerized environment development, and a centralized controller. On the algorithm side, we propose cross‑policy sampling to encourage model exploration in multi‑turn settings and task advantage normalization to stabilize multi‑task training. Experiments show that AgentRL, trained on open LLMs across five agentic tasks, significantly outperforms GPT‑5, Clause‑Sonnet‑4, DeepSeek‑R1, and other open‑source LLM agents. Multi‑task training with AgentRL matches the best results among all task‑specific models. AgentRL is open‑sourced at https://github.com/THUDM/AgentRL. The algorithm and framework are adopted in building \textsc\hrefhttps://autoglm.zhipuai.cnAutoGLM.

PaperID: 930, https://arxiv.org/pdf/2510.03472.pdf GitHub

Authors: Yulun Zhang, Alexandre O. G. Barbosa, Federico Pecora, Jiaoyang Li

Title: Destination-to-Chutes Task Mapping Optimization for Multi-Robot Coordination in Robotic Sorting Systems

Abstract:
We study optimizing a destination‑to‑chutes task mapping to improve throughput in Robotic Sorting Systems (RSS), where a team of robots sort packages on a sortation floor by transporting them from induct workstations to eject chutes based on their shipping destinations (e.g. Los Angeles or Pittsburgh). The destination‑to‑chutes task mapping is used to determine which chutes a robot can drop its package. Finding a high‑quality task mapping is challenging because of the complexity of a real‑world RSS. First, optimizing task mapping is interdependent with robot target assignment and path planning. Second, chutes will be CLOSED for a period of time once they receive sufficient packages to allow for downstream processing. Third, task mapping quality directly impacts the downstream processing, as scattered chutes for the same destination increase package handling time. In this paper, we first formally define task mappings and the problem of Task Mapping Optimization (TMO). We then present a simulator of RSS to evaluate task mappings. We then present a simple TMO method based on the Evolutionary Algorithm and Mixed Integer Linear Programming, demonstrating the advantage of our optimized task mappings over the greedily generated ones in various RSS setups with different map sizes, numbers of chutes, and destinations. Finally, we use Quality Diversity algorithms to analyze the throughput of a diverse set of task mappings. Our code is available online at https://github.com/lunjohnzhang/tmo_public.

PaperID: 931, https://arxiv.org/pdf/2510.03257.pdf GitHub

Authors: Zijian Zhao, Sen Li

Title: Triple-BERT: Do We Really Need MARL for Order Dispatch on Ride-Sharing Platforms?

Abstract:
On‑demand ride‑sharing platforms, such as Uber and Lyft, face the intricate real‑time challenge of bundling and matching passengers‑each with distinct origins and destinations‑to available vehicles, all while navigating significant system uncertainties. Due to the extensive observation space arising from the large number of drivers and orders, order dispatching, though fundamentally a centralized task, is often addressed using Multi‑Agent Reinforcement Learning (MARL). However, independent MARL methods fail to capture global information and exhibit poor cooperation among workers, while Centralized Training Decentralized Execution (CTDE) MARL methods suffer from the curse of dimensionality. To overcome these challenges, we propose Triple‑BERT, a centralized Single Agent Reinforcement Learning (MARL) method designed specifically for large‑scale order dispatching on ride‑sharing platforms. Built on a variant TD3, our approach addresses the vast action space through an action decomposition strategy that breaks down the joint action probability into individual driver action probabilities. To handle the extensive observation space, we introduce a novel BERT‑based network, where parameter reuse mitigates parameter growth as the number of drivers and orders increases, and the attention mechanism effectively captures the complex relationships among the large pool of driver and orders. We validate our method using a real‑world ride‑hailing dataset from Manhattan. Triple‑BERT achieves approximately an 11.95% improvement over current state‑of‑the‑art methods, with a 4.26% increase in served orders and a 22.25% reduction in pickup times. Our code, trained model parameters, and processed data are publicly available at the repository https://github.com/RS2002/Triple‑BERT .

PaperID: 932, https://arxiv.org/pdf/2510.03120.pdf GitHub

Authors: Zhaojun Sun, Xuzhou Zhu, Xuanhe Zhou, Xin Tong, Shuo Wang, Jie Fu, Guoliang Li, Zhiyuan Liu, Fan Wu

Title: SurveyBench: Can LLM(-Agents) Write Academic Surveys that Align with Reader Needs?

Abstract:
Academic survey writing, which distills vast literature into a coherent and insightful narrative, remains a labor‑intensive and intellectually demanding task. While recent approaches, such as general DeepResearch agents and survey‑specialized methods, can generate surveys automatically (a.k.a. LLM4Survey), their outputs often fall short of human standards and there lacks a rigorous, reader‑aligned benchmark for thoroughly revealing their deficiencies. To fill the gap, we propose a fine‑grained, quiz‑driven evaluation framework SurveyBench, featuring (1) typical survey topics source from recent 11,343 arXiv papers and corresponding 4,947 high‑quality surveys; (2) a multifaceted metric hierarchy that assesses the outline quality (e.g., coverage breadth, logical coherence), content quality (e.g., synthesis granularity, clarity of insights), and non‑textual richness; and (3) a dual‑mode evaluation protocol that includes content‑based and quiz‑based answerability tests, explicitly aligned with readers' informational needs. Results show SurveyBench effectively challenges existing LLM4Survey approaches (e.g., on average 21% lower than human in content‑based evaluation).

PaperID: 933, https://arxiv.org/pdf/2510.02995.pdf GitHub

Authors: Gijs Wijngaard, Elia Formisano, Michel Dumontier, Jenia Jitsev

Title: AudioToolAgent: An Agentic Framework for Audio-Language Models

Abstract:
Large Audio‑Language Models (LALMs) perform well on audio understanding tasks but lack multistep reasoning and tool‑calling found in recent Large Language Models (LLMs). This paper presents AudioToolAgent, a framework that coordinates audio‑language models as tools via a central LLM agent that accesses tool adapters for audio question answering and speech‑to‑text. The agent reasons about which tools to invoke, how to formulate follow‑up queries, and how to arbitrate conflicting tool outputs, without accessing the audio. Experiments with MMAU, MMAR, and MMAU‑Pro show state‑of‑the‑art accuracy: up to 77.50% in MMAU, 77.00% in MMAR, and 61.90% in MMAU‑Pro. Shapley‑based analysis identifies effective agent‑tool combinations. The code and reproduction materials are available at https://github.com/GLJS/AudioToolAgent.

PaperID: 934, https://arxiv.org/pdf/2510.02469.pdf GitHub

Authors: Sung-Yeon Park, Adam Lee, Juanwu Lu, Can Cui, Luyang Jiang, Rohit Gupta, Kyungtae Han, Ahmadreza Moradipari, Ziran Wang

Title: SIMSplat: Predictive Driving Scene Editing with Language-aligned 4D Gaussian Splatting

Abstract:
Driving scene manipulation with sensor data is emerging as a promising alternative to traditional virtual driving simulators. However, existing frameworks struggle to generate realistic scenarios efficiently due to limited editing capabilities. To address these challenges, we present SIMSplat, a predictive driving scene editor with language‑aligned Gaussian splatting. As a language‑controlled editor, SIMSplat enables intuitive manipulation using natural language prompts. By aligning language with Gaussian‑reconstructed scenes, it further supports direct querying of road objects, allowing precise and flexible editing. Our method provides detailed object‑level editing, including adding new objects and modifying the trajectories of both vehicles and pedestrians, while also incorporating predictive path refinement through multi‑agent motion prediction to generate realistic interactions among all agents in the scene. Experiments on the Waymo dataset demonstrate SIMSplat's extensive editing capabilities and adaptability across a wide range of scenarios. Project page: https://sungyeonparkk.github.io/simsplat/

PaperID: 935, https://arxiv.org/pdf/2510.02373.pdf GitHub

Authors: Qianshan Wei, Tengchao Yang, Yaochen Wang, Xinfeng Li, Lijun Li, Zhenfei Yin, Yi Zhan, Thorsten Holz, Zhiqiang Lin, XiaoFeng Wang

Title: A-MemGuard: A Proactive Defense Framework for LLM-Based Agent Memory

Abstract:
Large Language Model (LLM) agents use memory to learn from past interactions, enabling autonomous planning and decision‑making in complex environments. However, this reliance on memory introduces a critical security risk: an adversary can inject seemingly harmless records into an agent's memory to manipulate its future behavior. This vulnerability is characterized by two core aspects: First, the malicious effect of injected records is only activated within a specific context, making them hard to detect when individual memory entries are audited in isolation. Second, once triggered, the manipulation can initiate a self‑reinforcing error cycle: the corrupted outcome is stored as precedent, which not only amplifies the initial error but also progressively lowers the threshold for similar attacks in the future. To address these challenges, we introduce A‑MemGuard (Agent‑Memory Guard), the first proactive defense framework for LLM agent memory. The core idea of our work is the insight that memory itself must become both self‑checking and self‑correcting. Without modifying the agent's core architecture, A‑MemGuard combines two mechanisms: (1) consensus‑based validation, which detects anomalies by comparing reasoning paths derived from multiple related memories and (2) a dual‑memory structure, where detected failures are distilled into ``lessons'' stored separately and consulted before future actions, breaking error cycles and enabling adaptation. Comprehensive evaluations on multiple benchmarks show that A‑MemGuard effectively cuts attack success rates by over 95% while incurring a minimal utility cost. This work shifts LLM memory security from static filtering to a proactive, experience‑driven model where defenses strengthen over time. Our code is available in https://github.com/TangciuYueng/AMemGuard

PaperID: 936, https://arxiv.org/pdf/2510.02356.pdf GitHub

Authors: Xinjie Shen, Mufei Li, Pan Li

Title: Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark

Abstract:
The deployment of Large Language Models (LLMs) in embodied agents creates an urgent need to measure their privacy awareness in the physical world. Existing evaluation methods, however, are confined to natural language based scenarios. To bridge this gap, we introduce EAPrivacy, a comprehensive evaluation benchmark designed to quantify the physical‑world privacy awareness of LLM‑powered agents. EAPrivacy utilizes procedurally generated scenarios across four tiers to test an agent's ability to handle sensitive objects, adapt to changing environments, balance task execution with privacy constraints, and resolve conflicts with social norms. Our measurements reveal a critical deficit in current models. The top‑performing model, Gemini 2.5 Pro, achieved only 59% accuracy in scenarios involving changing physical environments. Furthermore, when a task was accompanied by a privacy request, models prioritized completion over the constraint in up to 86% of cases. In high‑stakes situations pitting privacy against critical social norms, leading models like GPT‑4o and Claude‑3.5‑haiku disregarded the social norm over 15% of the time. These findings, demonstrated by our benchmark, underscore a fundamental misalignment in LLMs regarding physically grounded privacy and establish the need for more robust, physically‑aware alignment. Codes and datasets will be available at https://github.com/Graph‑COM/EAPrivacy.

PaperID: 937, https://arxiv.org/pdf/2510.02328.pdf GitHub

Authors: Ziqing Wang, Chengsheng Mao, Xiaole Wen, Yuan Luo, Kaize Ding

Title: AMANDA: Agentic Medical Knowledge Augmentation for Data-Efficient Medical Visual Question Answering

Abstract:
Medical Multimodal Large Language Models (Med‑MLLMs) have shown great promise in medical visual question answering (Med‑VQA). However, when deployed in low‑resource settings where abundant labeled data are unavailable, existing Med‑MLLMs commonly fail due to their medical reasoning capability bottlenecks: (i) the intrinsic reasoning bottleneck that ignores the details from the medical image; (ii) the extrinsic reasoning bottleneck that fails to incorporate specialized medical knowledge. To address those limitations, we propose AMANDA, a training‑free agentic framework that performs medical knowledge augmentation via LLM agents. Specifically, our intrinsic medical knowledge augmentation focuses on coarse‑to‑fine question decomposition for comprehensive diagnosis, while extrinsic medical knowledge augmentation grounds the reasoning process via biomedical knowledge graph retrieval. Extensive experiments across eight Med‑VQA benchmarks demonstrate substantial improvements in both zero‑shot and few‑shot Med‑VQA settings. The code is available at https://github.com/REAL‑Lab‑NU/AMANDA.

PaperID: 938, https://arxiv.org/pdf/2510.02104.pdf GitHub

Authors: Yunhan Lin, Wenqi Wu, Zhijie Zhang, Huasong Min

Title: LangGrasp: Leveraging Fine-Tuned LLMs for Language Interactive Robot Grasping with Ambiguous Instructions

Abstract:
The existing language‑driven grasping methods struggle to fully handle ambiguous instructions containing implicit intents. To tackle this challenge, we propose LangGrasp, a novel language‑interactive robotic grasping framework. The framework integrates fine‑tuned large language models (LLMs) to leverage their robust commonsense understanding and environmental perception capabilities, thereby deducing implicit intents from linguistic instructions and clarifying task requirements along with target manipulation objects. Furthermore, our designed point cloud localization module, guided by 2D part segmentation, enables partial point cloud localization in scenes, thereby extending grasping operations from coarse‑grained object‑level to fine‑grained part‑level manipulation. Experimental results show that the LangGrasp framework accurately resolves implicit intents in ambiguous instructions, identifying critical operations and target information that are unstated yet essential for task completion. Additionally, it dynamically selects optimal grasping poses by integrating environmental information. This enables high‑precision grasping from object‑level to part‑level manipulation, significantly enhancing the adaptability and task execution efficiency of robots in unstructured environments. More information and code are available here: https://github.com/wu467/LangGrasp.

PaperID: 939, https://arxiv.org/pdf/2510.01664.pdf GitHub

Authors: Yejin Kim, Youngbin Lee, Juhyeong Kim, Yongjae Lee

Title: GuruAgents: Emulating Wise Investors with Prompt-Guided LLM Agents

Abstract:
This study demonstrates that GuruAgents, prompt‑guided AI agents, can systematically operationalize the strategies of legendary investment gurus. We develop five distinct GuruAgents, each designed to emulate an iconic investor, by encoding their distinct philosophies into LLM prompts that integrate financial tools and a deterministic reasoning pipeline. In a backtest on NASDAQ‑100 constituents from Q4 2023 to Q2 2025, the GuruAgents exhibit unique behaviors driven by their prompted personas. The Buffett GuruAgent achieves the highest performance, delivering a 42.2% CAGR that significantly outperforms benchmarks, while other agents show varied results. These findings confirm that prompt engineering can successfully translate the qualitative philosophies of investment gurus into reproducible, quantitative strategies, highlighting a novel direction for automated systematic investing. The source code and data are available at https://github.com/yejining99/GuruAgents.

PaperID: 940, https://arxiv.org/pdf/2510.01295.pdf GitHub

Authors: Zarreen Reza

Title: The Social Laboratory: A Psychometric Framework for Multi-Agent LLM Evaluation

Abstract:
As Large Language Models (LLMs) transition from static tools to autonomous agents, traditional evaluation benchmarks that measure performance on downstream tasks are becoming insufficient. These methods fail to capture the emergent social and cognitive dynamics that arise when agents communicate, persuade, and collaborate in interactive environments. To address this gap, we introduce a novel evaluation framework that uses multi‑agent debate as a controlled "social laboratory" to discover and quantify these behaviors. In our framework, LLM‑based agents, instantiated with distinct personas and incentives, deliberate on a wide range of challenging topics under the supervision of an LLM moderator. Our analysis, enabled by a new suite of psychometric and semantic metrics, reveals several key findings. Across hundreds of debates, we uncover a powerful and robust emergent tendency for agents to seek consensus, consistently reaching high semantic agreement (μ > 0.88) even without explicit instruction and across sensitive topics. We show that assigned personas induce stable, measurable psychometric profiles, particularly in cognitive effort, and that the moderators persona can significantly alter debate outcomes by structuring the environment, a key finding for external AI alignment. This work provides a blueprint for a new class of dynamic, psychometrically grounded evaluation protocols designed for the agentic setting, offering a crucial methodology for understanding and shaping the social behaviors of the next generation of AI agents. We have released the code and results at https://github.com/znreza/multi‑agent‑LLM‑eval‑for‑debate.

PaperID: 941, https://arxiv.org/pdf/2510.01264.pdf GitHub

Authors: Isaac Peterson, Christopher Allred, Jacob Morrey, Mario Harper

Title: A Framework for Scalable Heterogeneous Multi-Agent Adversarial Reinforcement Learning in IsaacLab

Abstract:
Multi‑Agent Reinforcement Learning (MARL) is central to robotic systems cooperating in dynamic environments. While prior work has focused on these collaborative settings, adversarial interactions are equally critical for real‑world applications such as pursuit‑evasion, security, and competitive manipulation. In this work, we extend the IsaacLab framework to support scalable training of adversarial policies in high‑fidelity physics simulations. We introduce a suite of adversarial MARL environments featuring heterogeneous agents with asymmetric goals and capabilities. Our platform integrates a competitive variant of Heterogeneous Agent Reinforcement Learning with Proximal Policy Optimization (HAPPO), enabling efficient training and evaluation under adversarial dynamics. Experiments across several benchmark scenarios demonstrate the framework's ability to model and train robust policies for morphologically diverse multi‑agent competition while maintaining high throughput and simulation realism. Code and benchmarks are available at: https://github.com/DIRECTLab/IsaacLab‑HARL .

PaperID: 942, https://arxiv.org/pdf/2510.01132.pdf GitHub

Authors: Ruiyi Wang, Prithviraj Ammanabrolu

Title: A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning

Abstract:
We study what actually works and what doesn't for training large language models as agents via multi‑turn reinforcement learning. Despite rapid progress, existing frameworks and definitions are fragmented, and there is no systematic formulation or analysis of which design choices matter across tasks. We address this gap by first breaking down the design space into three inter‑related pillars ‑‑ environment, reward, and policy ‑‑ and empirically derive a recipe for training LLM agents in situated textual domains. In particular, we test TextWorld and ALFWorld, popular domains for testing situated embodied reasoning, as well as SWE‑Gym for more software engineering style tasks. (i) For the environment, we analyze the impacts of task complexity in terms of sizes of the state and action spaces as well as optimal solution length, finding that even simple environments within a domain can provide signal on how well an agent can generalize to more complex tasks. (ii) For the reward, we ablate relative reward sparsity, observing that while dense turn‑level rewards accelerate training, performance and stability is highly dependent on the choice of RL algorithm. (iii) And for the agent's policy, we explore the interplay between reward sparsity and biased (PPO, GRPO) and unbiased (RLOO) policy gradient methods in addition to showing how to find the optimal Supervised Fine‑tuning (SFT) to RL training ratio given a fixed budget. We distill these findings into a training recipe that guides co‑design across the three pillars, facilitating research and practical efforts in multi‑turn agentic RL. Code: https://github.com/pearls‑lab/meow‑tea‑taro

PaperID: 943, https://arxiv.org/pdf/2510.00425.pdf GitHub

Authors: Rishi Veerapaneni, Alvin Tang, Haodong He, Sophia Zhao, Viraj Shah, Yidai Cen, Ziteng Ji, Gabriel Olin, Jon Arrizabalaga, Yorai Shaoul, Jiaoyang Li, Maxim Likhachev

Title: Conflict-Based Search as a Protocol: A Multi-Agent Motion Planning Protocol for Heterogeneous Agents, Solvers, and Independent Tasks

Abstract:
Imagine the future construction site, hospital, or office with dozens of robots bought from different manufacturers. How can we enable these different robots to effectively move in a shared environment, given that each robot may have its own independent motion planning system? This work shows how we can get efficient collision‑free movements between algorithmically heterogeneous agents by using Conflict‑Based Search (Sharon et al. 2015) as a protocol. At its core, the CBS Protocol requires one specific single‑agent motion planning API; finding a collision‑free path that satisfies certain space‑time constraints. Given such an API, CBS uses a central planner to find collision‑free paths ‑ independent of how the API is implemented. We demonstrate how this protocol enables multi‑agent motion planning for a heterogeneous team of agents completing independent tasks with a variety of single‑agent planners including: Heuristic Search (e.g., A), Sampling Based Search (e.g., RRT), Optimization (e.g., Direct Collocation), Diffusion, and Reinforcement Learning.

PaperID: 944, https://arxiv.org/pdf/2510.00406.pdf GitHub

Authors: Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge, Dongyuan Zang, Kexian Yu, Mingyang Sun, Hongyin Zhang, Donglin Wang, Weihua Su

Title: VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators

Abstract:
Vision‑Language‑Action (VLA) models enable embodied decision‑making but rely heavily on imitation learning, leading to compounding errors and poor robustness under distribution shift. Reinforcement learning (RL) can mitigate these issues yet typically demands costly real‑world interactions or suffers from sim‑to‑real gaps. We introduce VLA‑RFT, a reinforcement fine‑tuning framework that leverages a data‑driven world model as a controllable simulator. Trained from real interaction data, the simulator predicts future visual observations conditioned on actions, allowing policy rollouts with dense, trajectory‑level rewards derived from goal‑achieving references. This design delivers an efficient and action‑aligned learning signal, drastically lowering sample requirements. With fewer than 400 fine‑tuning steps, VLA‑RFT surpasses strong supervised baselines and achieves greater efficiency than simulator‑based RL. Moreover, it exhibits strong robustness under perturbed conditions, sustaining stable task execution. Our results establish world‑model‑based RFT as a practical post‑training paradigm to enhance the generalization and robustness of VLA models. For more details, please refer to https://vla‑rft.github.io/.

PaperID: 945, https://arxiv.org/pdf/2510.00307.pdf GitHub

Authors: Thierry Blankenstein, Jialin Yu, Zixuan Li, Vassilis Plachouras, Sunando Sengupta, Philip Torr, Yarin Gal, Alasdair Paren, Adel Bibi

Title: BiasBusters: Uncovering and Mitigating Tool Selection Bias in Large Language Models

Abstract:
Agents backed by large language models (LLMs) increasingly rely on external tools drawn from marketplaces where multiple providers offer functionally equivalent options. This raises a critical fairness concern: systematic bias in tool selection can degrade user experience and distort competition by privileging certain providers over others. We introduce a benchmark of diverse tool categories, each containing multiple functionally equivalent tools, to systematically evaluate tool‑selection bias. Using this benchmark, we evaluate seven LLMs and show that substantial bias persists, with models either fixating on a single provider or disproportionately favoring tools that appear earlier in the context. To uncover the sources of this behavior, we conduct controlled experiments that isolate the effects of tool features, exposed metadata (name, description, and parameters), and pre‑training exposure. We find that (1) semantic alignment between user queries and tool metadata is the strongest driver of selection; (2) small perturbations to tool descriptions can significantly shift choices; and (3) repeated pre‑training exposure to a single endpoint amplifies provider‑level bias. Finally, we propose a lightweight mitigation strategy that first filters tools to a relevant subset and then samples uniformly, substantially reducing selection bias while maintaining strong task coverage. Our results highlight tool‑selection bias as a key obstacle to the fair deployment of tool‑augmented LLM agents. Our code and benchmark are publicly available at https://github.com/thierry123454/tool‑selection‑bias.

PaperID: 946, https://arxiv.org/pdf/2510.00069.pdf GitHub

Authors: Jiancong Xie, Wenjin Wang, Zhuomeng Zhang, Zihan Liu, Qi Liu, Ke Feng, Zixun Sun, Yuedong Yang

Title: OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding

Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities. However, evaluating their capacity for human‑like understanding in One‑Image Guides remains insufficiently explored. One‑Image Guides are a visual format combining text, imagery, and symbols to present reorganized and structured information for easier comprehension, which are specifically designed for human viewing and inherently embody the characteristics of human perception and understanding. Here, we present OIG‑Bench, a comprehensive benchmark focused on One‑Image Guide understanding across diverse domains. To reduce the cost of manual annotation, we developed a semi‑automated annotation pipeline in which multiple intelligent agents collaborate to generate preliminary image descriptions, assisting humans in constructing image‑text pairs. With OIG‑Bench, we have conducted a comprehensive evaluation of 29 state‑of‑the‑art MLLMs, including both proprietary and open‑source models. The results show that Qwen2.5‑VL‑72B performs the best among the evaluated models, with an overall accuracy of 77%. Nevertheless, all models exhibit notable weaknesses in semantic understanding and logical reasoning, indicating that current MLLMs still struggle to accurately interpret complex visual‑text relationships. In addition, we also demonstrate that the proposed multi‑agent annotation system outperforms all MLLMs in image captioning, highlighting its potential as both a high‑quality image description generator and a valuable tool for future dataset construction. Datasets are available at https://github.com/XiejcSYSU/OIG‑Bench.

PaperID: 947, https://arxiv.org/pdf/2509.26354.pdf GitHub

Authors: Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Jingyi Yang, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, Jing Shao

Title: Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents

Abstract:
Advances in Large Language Models (LLMs) have enabled a new class of self‑evolving agents that autonomously improve through interaction with the environment, demonstrating strong capabilities. However, self‑evolution also introduces novel risks overlooked by current safety research. In this work, we study the case where an agent's self‑evolution deviates in unintended ways, leading to undesirable or even harmful outcomes. We refer to this as Misevolution. To provide a systematic investigation, we evaluate misevolution along four key evolutionary pathways: model, memory, tool, and workflow. Our empirical findings reveal that misevolution is a widespread risk, affecting agents built even on top‑tier LLMs (e.g., Gemini‑2.5‑Pro). Different emergent risks are observed in the self‑evolutionary process, such as the degradation of safety alignment after memory accumulation, or the unintended introduction of vulnerabilities in tool creation and reuse. To our knowledge, this is the first study to systematically conceptualize misevolution and provide empirical evidence of its occurrence, highlighting an urgent need for new safety paradigms for self‑evolving agents. Finally, we discuss potential mitigation strategies to inspire further research on building safer and more trustworthy self‑evolving agents. Our code and data are available at https://github.com/ShaoShuai0605/Misevolution . Warning: this paper includes examples that may be offensive or harmful in nature.

PaperID: 948, https://arxiv.org/pdf/2509.26306.pdf GitHub

Authors: Hehai Lin, Shilei Cao, Sudong Wang, Haotian Wu, Minzhi Li, Linyi Yang, Juepeng Zheng, Chengwei Qin

Title: Interactive Learning for LLM Reasoning

Abstract:
Existing multi‑agent learning approaches have developed interactive training environments to explicitly promote collaboration among multiple Large Language Models (LLMs), thereby constructing stronger multi‑agent systems (MAS). However, during inference, they require re‑executing the MAS to obtain final solutions, which diverges from human cognition that individuals can enhance their reasoning capabilities through interactions with others and resolve questions independently in the future. To investigate whether multi‑agent interaction can enhance LLMs' independent problem‑solving ability, we introduce ILR, a novel co‑learning framework for MAS that integrates two key components: Dynamic Interaction and Perception Calibration. Specifically, Dynamic Interaction first adaptively selects either cooperative or competitive strategies depending on question difficulty and model ability. LLMs then exchange information through Idea3, an innovative interaction paradigm designed to mimic human discussion, before deriving their respective final answers. In Perception Calibration, ILR employs Group Relative Policy Optimization (GRPO) to train LLMs while integrating one LLM's reward distribution characteristics into another's reward function, thereby enhancing the cohesion of multi‑agent interactions. We evaluate the effectiveness of ILR across three LLMs from two model families of varying scales on five mathematical, one coding, one general question answering, and one scientific reasoning benchmarks. Experimental results show that ILR consistently outperforms single‑agent learning, yielding an improvement of up to 5% over the strongest baseline. We further discover that Idea3 can enhance the robustness of stronger LLMs during multi‑agent inference, and dynamic interaction types can boost multi‑agent learning compared to pure cooperative or competitive strategies.

PaperID: 949, https://arxiv.org/pdf/2509.26161.pdf GitHub

Authors: Runxin Yang, Yuxuan Wan, Shuqing Li, Michael R. Lyu

Title: 90% Faster, 100% Code-Free: MLLM-Driven Zero-Code 3D Game Development

Abstract:
Developing 3D games requires specialized expertise across multiple domains, including programming, 3D modeling, and engine configuration, which limits access to millions of potential creators. Recently, researchers have begun to explore automated game development. However, existing approaches face three primary challenges: (1) limited scope to 2D content generation or isolated code snippets; (2) requirement for manual integration of generated components into game engines; and (3) poor performance on handling interactive game logic and state management. While Multimodal Large Language Models (MLLMs) demonstrate potential capabilities to ease the game generation task, a critical gap still remains in translating these outputs into production‑ready, executable game projects based on game engines such as Unity and Unreal Engine. To bridge the gap, this paper introduces UniGen, the first end‑to‑end coordinated multi‑agent framework that automates zero‑coding development of runnable 3D games from natural language requirements. Specifically, UniGen uses a Planning Agent that interprets user requirements into structured blueprints and engineered logic descriptions; after which a Generation Agent produces executable C# scripts; then an Automation Agent handles engine‑specific component binding and scene construction; and lastly a Debugging Agent provides real‑time error correction through conversational interaction. We evaluated UniGen on three distinct game prototypes. Results demonstrate that UniGen not only democratizes game creation by requiring no coding from the user, but also reduces development time by 91.4%. We release UniGen at https://github.com/yxwan123/UniGen. A video demonstration is available at https://www.youtube.com/watch?v=xyJjFfnxUx0.

PaperID: 950, https://arxiv.org/pdf/2509.25651.pdf GitHub

Authors: Gihan Panapitiya, Emily Saldanha, Heather Job, Olivia Hess

Title: AutoLabs: Cognitive Multi-Agent Systems with Self-Correction for Autonomous Chemical Experimentation

Abstract:
The automation of chemical research through self‑driving laboratories (SDLs) promises to accelerate scientific discovery, yet the reliability and granular performance of the underlying AI agents remain critical, under‑examined challenges. In this work, we introduce AutoLabs, a self‑correcting, multi‑agent architecture designed to autonomously translate natural‑language instructions into executable protocols for a high‑throughput liquid handler. The system engages users in dialogue, decomposes experimental goals into discrete tasks for specialized agents, performs tool‑assisted stoichiometric calculations, and iteratively self‑corrects its output before generating a hardware‑ready file. We present a comprehensive evaluation framework featuring five benchmark experiments of increasing complexity, from simple sample preparation to multi‑plate timed syntheses. Through a systematic ablation study of 20 agent configurations, we assess the impact of reasoning capacity, architectural design (single‑ vs. multi‑agent), tool use, and self‑correction mechanisms. Our results demonstrate that agent reasoning capacity is the most critical factor for success, reducing quantitative errors in chemical amounts (nRMSE) by over 85% in complex tasks. When combined with a multi‑agent architecture and iterative self‑correction, AutoLabs achieves near‑expert procedural accuracy (F1‑score > 0.89) on challenging multi‑step syntheses. These findings establish a clear blueprint for developing robust and trustworthy AI partners for autonomous laboratories, highlighting the synergistic effects of modular design, advanced reasoning, and self‑correction to ensure both performance and reliability in high‑stakes scientific applications. Code: https://github.com/pnnl/autolabs

PaperID: 951, https://arxiv.org/pdf/2509.25550.pdf GitHub

Authors: Dongsu Lee, Daehee Lee, Yaru Niu, Honguk Woo, Amy Zhang, Ding Zhao

Title: Unifying Agent Interaction and World Information for Multi-agent Coordination

Abstract:
This work presents a novel representation learning framework, interaction‑world latent (IWoL), to facilitate team coordination in multi‑agent reinforcement learning (MARL). Building effective representation for team coordination is a challenging problem, due to the intricate dynamics emerging from multi‑agent interaction and incomplete information induced by local observations. Our key insight is to construct a learnable representation space that jointly captures inter‑agent relations and task‑specific world information by directly modeling communication protocols. This representation enables fully decentralized execution with implicit coordination while avoiding the drawbacks of explicit message passing, for example, slower decision‑making, vulnerability to malicious attackers, and sensitivity to bandwidth limitations. In practice, our representation can be used not only as an implicit latent for each agent, but also as an explicit message for communication. Across four challenging MARL benchmarks, we evaluate both variants and show that IWoL provides a simple yet powerful key for team coordination. Moreover, we demonstrate that our representation can be combined with existing MARL algorithms to further enhance their performance.

PaperID: 952, https://arxiv.org/pdf/2509.25541.pdf GitHub

Authors: Qinsi Wang, Bo Liu, Tianyi Zhou, Jing Shi, Yueqian Lin, Yiran Chen, Hai Helen Li, Kun Wan, Wentian Zhao

Title: Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

Abstract:
Although reinforcement learning (RL) has emerged as a promising approach for improving vision‑language models (VLMs) and multimodal large language models (MLLMs), current methods rely heavily on manually curated datasets and costly human verification, which limits scalable self‑improvement in multimodal systems. To address this challenge, we propose Vision‑Zero, a label‑free, domain‑agnostic multi‑agent self‑play framework for self‑evolving VLMs through competitive visual games generated from arbitrary image inputs. Specifically, Vision‑Zero encompasses three main attributes: (1) Strategic Self‑Play Framework: Vision‑Zero trains VLMs in "Who Is the Spy"‑style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) Gameplay from Arbitrary Images: Unlike existing gamified frameworks, Vision‑Zero can generate games from arbitrary images, thereby enhancing the model's reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR‑based synthetic scenes, charts, and real‑world images. (3) Sustainable Performance Gain: We introduce Iterative Self‑Play Policy Optimization (Iterative‑SPO), a novel training algorithm that alternates between Self‑Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self‑play‑only training and achieving sustained long‑term improvements. Despite using label‑free data, Vision‑Zero achieves state‑of‑the‑art performance on reasoning, chart question answering, and vision‑centric understanding tasks, surpassing other annotation‑based methods. Models and code have been released at https://github.com/wangqinsi1/Vision‑Zero.

PaperID: 953, https://arxiv.org/pdf/2509.25370.pdf GitHub

Authors: Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, Xiaoteng Ma, Xiaodong Yu, Gowtham Ramesh, Jialian Wu, Zicheng Liu, Pan Lu, James Zou, Jiaxuan You

Title: Where LLM Agents Fail and How They can Learn From Failures

Abstract:
Large Language Model (LLM) agents, which integrate planning, memory, reflection, and tool‑use modules, have shown promise in solving complex, multi‑step tasks. Yet their sophisticated architectures amplify vulnerability to cascading failures, where a single root‑cause error propagates through subsequent decisions, leading to task failure. Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way, and therefore fail to detect these errors accordingly. We address this gap with three contributions. First, we introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system‑level operations. Second, we construct AgentErrorBench, the first dataset of systematically annotated failure trajectories from ALFWorld, GAIA, and WebShop, grounding error analysis in real‑world agent rollouts. Third, we propose AgentDebug, a debugging framework that isolates root‑cause failures and provides corrective feedback, enabling agents to recover and iteratively improve. Experiments on AgentErrorBench show that AgentDebug achieves 24% higher all‑correct accuracy and 17% higher step accuracy compared to the strongest baseline. Beyond detection, the targeted feedback generated by AgentDebug enables LLM agents to iteratively recover from failures, yielding up to 26% relative improvements in task success across ALFWorld, GAIA, and WebShop. These results establish principled debugging as a pathway to more reliable and adaptive LLM agents. The code and data will be available at https://github.com/ulab‑uiuc/AgentDebug

PaperID: 954, https://arxiv.org/pdf/2509.25299.pdf GitHub

Authors: Daniel Platnick, Mohamed E. Bengueddache, Marjan Alirezaie, Dava J. Newman, Alex ''Sandy'' Pentland, Hossein Rahnama

Title: ID-RAG: Identity Retrieval-Augmented Generation for Long-Horizon Persona Coherence in Generative Agents

Abstract:
Generative agents powered by language models are increasingly deployed for long‑horizon tasks. However, as long‑term memory context grows over time, they struggle to maintain coherence. This deficiency leads to critical failures, including identity drift, ignoring established beliefs, and the propagation of hallucinations in multi‑agent systems. To mitigate these challenges, this paper introduces Identity Retrieval‑Augmented Generation (ID‑RAG), a novel mechanism designed to ground an agent's persona and persistent preferences in a dynamic, structured identity model: a knowledge graph of core beliefs, traits, and values. During the agent's decision loop, this model is queried to retrieve relevant identity context, which directly informs action selection. We demonstrate this approach by introducing and implementing a new class of ID‑RAG enabled agents called Human‑AI Agents (HAis), where the identity model is inspired by the Chronicle structure used in Perspective‑Aware AI, a dynamic knowledge graph learned from a real‑world entity's digital footprint. In social simulations of a mayoral election, HAis using ID‑RAG outperformed baseline agents in long‑horizon persona coherence ‑ achieving higher identity recall across all tested models by the fourth timestep ‑ and reduced simulation convergence time by 19% (GPT‑4o) and 58% (GPT‑4o mini). By treating identity as an explicit, retrievable knowledge structure, ID‑RAG offers a foundational approach for developing more temporally coherent, interpretable, and aligned generative agents. Our code is open‑source and available at: https://github.com/flybits/humanai‑agents.

PaperID: 955, https://arxiv.org/pdf/2509.25185.pdf GitHub

Authors: Shuoshuo Zhang, Zijian Li, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Jun Zhang, Yujiu Yang, Rui Wang

Title: PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images

Abstract:
Structured images (e.g., charts and geometric diagrams) remain challenging for multimodal large language models (MLLMs), as perceptual slips can cascade into erroneous conclusions. Intermediate visual cues can steer reasoning; however, existing cue‑based methods are constrained with low‑fidelity image processing and linear, rigid reasoning patterns, limiting their effectiveness on complex structured‑image tasks. In this paper, we propose PixelCraft, a novel multi‑agent system for high‑fidelity image processing and flexible visual reasoning on structured images. The system comprises a dispatcher, a planner, a reasoner, critics, and a set of visual tool agents. To achieve high‑fidelity processing, we construct a high‑quality corpus and fine‑tune an MLLM into a grounding model, whose pixel‑level localizations are integrated with traditional computer vision (CV) algorithms in tool agents. Building on this foundation, PixelCraft facilitates flexible visual reasoning through a dynamic three‑stage workflow of tool selection, agent discussion, and self‑criticism. Moreover, unlike prior linear reasoning patterns that simply append historical images, PixelCraft maintains an image memory to allow the planner to adaptively revisit earlier visual steps, explore alternative reasoning branches, and dynamically adjust the reasoning trajectory during discussion. Extensive experiments on challenging chart and geometry benchmarks demonstrate that PixelCraft significantly improves visual reasoning performance for advanced MLLMs, setting a new standard for structured image reasoning. Our code will be available at https://github.com/microsoft/PixelCraft.

PaperID: 956, https://arxiv.org/pdf/2509.25020.pdf GitHub

Authors: Jiayu Liu, Zhenya Huang, Xuan Yang, Tianyun Ji, Anya Sims, Hao Xu, Enhong Chen, Yee Whye Teh, Ning Miao

Title: Deep Thinking by Markov Chain of Continuous Thoughts

Abstract:
Transformer‑based models can perform complicated reasoning by generating reasoning paths token by token. While effective, this approach often requires generating thousands of tokens to solve a single problem, which can be slow and computationally expensive. More importantly, it involves a discrete sampling operation at the end of each time step, creating an information bottleneck across time steps. In this work, we propose MarCos, an improvement of the transformer structure that allows fully continuous reasoning at the thought level. Unlike traditional transformer layers, which focus on refining token predictions at each time step, layers in MarCos map a continuous representation of a stepwise thought to the distribution of the next thought. This enables us to achieve multi‑step reasoning in a single pass of MarCos. Preliminary experimental results on synthetic and real‑world math tasks show the great potential of MarCos. Notably, we observe that the increased information bandwidth of MarCos elicits the ability of parallel thinking, in contrast to single‑threaded thinking in traditional transformers. Meanwhile, in real‑world math tasks, MarCos achieves more than 10× speedup in wall‑clock time with the same level of accuracy. Our code is available at https://github.com/Ljyustc/MarCos.

PaperID: 957, https://arxiv.org/pdf/2509.24826.pdf GitHub

Authors: Hannah Kim, Kushan Mitra, Chen Shen, Dan Zhang, Estevam Hruschka

Title: AIPOM: Agent-aware Interactive Planning for Multi-Agent Systems

Abstract:
Large language models (LLMs) are being increasingly used for planning in orchestrated multi‑agent systems. However, existing LLM‑based approaches often fall short of human expectations and, critically, lack effective mechanisms for users to inspect, understand, and control their behaviors. These limitations call for enhanced transparency, controllability, and human oversight. To address this, we introduce AIPOM, a system supporting human‑in‑the‑loop planning through conversational and graph‑based interfaces. AIPOM enables users to transparently inspect, refine, and collaboratively guide LLM‑generated plans, significantly enhancing user control and trust in multi‑agent workflows. Our code and demo video are available at https://github.com/megagonlabs/aipom.

PaperID: 958, https://arxiv.org/pdf/2509.24323.pdf GitHub

Authors: Kun Wang, Guibin Zhang, ManKit Ye, Xinyu Deng, Dongxia Wang, Xiaobin Hu, Jinyang Guo, Yang Liu, Yufei Guo

Title: MAS$^2$: Self-Generative, Self-Configuring, Self-Rectifying Multi-Agent Systems

Abstract:
The past two years have witnessed the meteoric rise of Large Language Model (LLM)‑powered multi‑agent systems (MAS), which harness collective intelligence and exhibit a remarkable trajectory toward self‑evolution. This paradigm has rapidly progressed from manually engineered systems that require bespoke configuration of prompts, tools, roles, and communication protocols toward frameworks capable of automated orchestration. Yet, dominant automatic multi‑agent systems, whether generated by external modules or a single LLM agent, largely adhere to a rigid ``generate‑once‑and‑deploy'' paradigm, rendering the resulting systems brittle and ill‑prepared for the dynamism and uncertainty of real‑world environments. To transcend this limitation, we introduce MAS^2, a paradigm predicated on the principle of recursive self‑generation: a multi‑agent system that autonomously architects bespoke multi‑agent systems for diverse problems. Technically, we devise a ``generator‑implementer‑rectifier'' tri‑agent team capable of dynamically composing and adaptively rectifying a target agent system in response to real‑time task demands. Collaborative Tree Optimization is proposed to train and specialize these meta‑agents. Extensive evaluation across seven benchmarks reveals that MAS^2 achieves performance gains of up to 19.6% over state‑of‑the‑art MAS in complex scenarios such as deep research and code generation. Moreover, MAS^2 exhibits superior cross‑backbone generalization, effectively leveraging previously unseen LLMs to yield improvements of up to 15.1%. Crucially, these gains are attained without incurring excessive token costs, as MAS^2 consistently resides on the Pareto frontier of cost‑performance trade‑offs. The source codes are available at https://github.com/yeyeyeah2/MAS2.

PaperID: 959, https://arxiv.org/pdf/2509.23700.pdf GitHub

Authors: Yunjiang Xu, Lingzhi Li, Jin Wang, Yupeng Ouyang, Benyuan Yang

Title: INSTINCT: Instance-Level Interaction Architecture for Query-Based Collaborative Perception

Abstract:
Collaborative perception systems overcome single‑vehicle limitations in long‑range detection and occlusion scenarios by integrating multi‑agent sensory data, improving accuracy and safety. However, frequent cooperative interactions and real‑time requirements impose stringent bandwidth constraints. Previous works proves that query‑based instance‑level interaction reduces bandwidth demands and manual priors, however, LiDAR‑focused implementations in collaborative perception remain underdeveloped, with performance still trailing state‑of‑the‑art approaches. To bridge this gap, we propose INSTINCT (INSTance‑level INteraCtion ArchiTecture), a novel collaborative perception framework featuring three core components: 1) a quality‑aware filtering mechanism for high‑quality instance feature selection; 2) a dual‑branch detection routing scheme to decouple collaboration‑irrelevant and collaboration‑relevant instances; and 3) a Cross Agent Local Instance Fusion module to aggregate local hybrid instance features. Additionally, we enhance the ground truth (GT) sampling technique to facilitate training with diverse hybrid instance features. Extensive experiments across multiple datasets demonstrate that INSTINCT achieves superior performance. Specifically, our method achieves an improvement in accuracy 13.23%/33.08% in DAIR‑V2X and V2V4Real while reducing the communication bandwidth to 1/281 and 1/264 compared to state‑of‑the‑art methods. The code is available at https://github.com/CrazyShout/INSTINCT.

PaperID: 960, https://arxiv.org/pdf/2509.23574.pdf GitHub

Authors: Jianzhi Yan, Le Liu, Youcheng Pan, Shiwei Chen, Yang Xiang, Buzhou Tang

Title: Towards Efficient CoT Distillation: Self-Guided Rationale Selector for Better Performance with Fewer Rationales

Abstract:
Chain‑of‑thought (CoT) distillation aims to enhance small language models' (SLMs) reasoning by transferring multi‑step reasoning capability from the larger teacher models. However, existing work underestimates rationale quality, focusing primarily on data quantity, which may transfer noisy or incorrect information to the student model. To address the above issues, we proposed Model‑Oriented Rationale Selection Distillation (MoRSD), which can discern and select high quality rationales for distillation to improve performance further. We further propose a Rationale Difficulty (RD) metric to measure the ability of the student model to generate the correct answer under a given rationale. Compared to the baseline, we achieved 4.6% average improvement on seven datasets over three tasks, using fewer rationales by controlling their accuracy, diversity, and difficulty. Our results reveal that a small portion of the high quality rationales can enhance the reasoning ability of student models than the entire dataset. Our method promises to be a possible solution for efficient CoT distillation. Our code will be released in https://github.com/Leon221220/MoRSD.

PaperID: 961, https://arxiv.org/pdf/2509.23141.pdf GitHub

Authors: Peilin Feng, Zhutao Lv, Junyan Ye, Xiaolei Wang, Xinjie Huo, Jinhua Yu, Wanghan Xu, Wenlong Zhang, Lei Bai, Conghui He, Weijia Li

Title: Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents

Abstract:
Earth observation (EO) is essential for understanding the evolving states of the Earth system. Although recent MLLMs have advanced EO research, they still lack the capability to tackle complex tasks that require multi‑step reasoning and the use of domain‑specific tools. Agent‑based methods offer a promising direction, but current attempts remain in their infancy, confined to RGB perception, shallow reasoning, and lacking systematic evaluation protocols. To overcome these limitations, we introduce Earth‑Agent, the first agentic framework that unifies RGB and spectral EO data within an MCP‑based tool ecosystem, enabling cross‑modal, multi‑step, and quantitative spatiotemporal reasoning beyond pretrained MLLMs. Earth‑Agent supports complex scientific tasks such as geophysical parameter retrieval and quantitative spatiotemporal analysis by dynamically invoking expert tools and models across modalities. To support comprehensive evaluation, we further propose Earth‑Bench, a benchmark of 248 expert‑curated tasks with 13,729 images, spanning spectrum, products and RGB modalities, and equipped with a dual‑level evaluation protocol that assesses both reasoning trajectories and final outcomes. We conduct comprehensive experiments varying different LLM backbones, comparisons with general agent frameworks, and comparisons with MLLMs on remote sensing benchmarks, demonstrating both the effectiveness and potential of Earth‑Agent. Earth‑Agent establishes a new paradigm for EO analysis, moving the field toward scientifically grounded, next‑generation applications of LLMs in Earth observation. More information about Earth‑Agent can be found at https://github.com/opendatalab/Earth‑Agent

PaperID: 962, https://arxiv.org/pdf/2509.22548.pdf GitHub GitHub

Authors: Shuang Zeng, Dekang Qi, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Shiyi Liang, Mu Xu, Xing Wei, Ning Guo

Title: JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation

Abstract:
Vision‑and‑Language Navigation requires an embodied agent to navigate through unseen environments, guided by natural language instructions and a continuous video stream. Recent advances in VLN have been driven by the powerful semantic understanding of Multimodal Large Language Models. However, these methods typically rely on explicit semantic memory, such as building textual cognitive maps or storing historical visual frames. This type of method suffers from spatial information loss, computational redundancy, and memory bloat, which impede efficient navigation. Inspired by the implicit scene representation in human navigation, analogous to the left brain's semantic understanding and the right brain's spatial cognition, we propose JanusVLN, a novel VLN framework featuring a dual implicit neural memory that models spatial‑geometric and visual‑semantic memory as separate, compact, and fixed‑size neural representations. This framework first extends the MLLM to incorporate 3D prior knowledge from the spatial‑geometric encoder, thereby enhancing the spatial reasoning capabilities of models based solely on RGB input. Then, the historical key‑value caches from the spatial‑geometric and visual‑semantic encoders are constructed into a dual implicit memory. By retaining only the KVs of tokens in the initial and sliding window, redundant computation is avoided, enabling efficient incremental updates. Extensive experiments demonstrate that JanusVLN outperforms over 20 recent methods to achieve SOTA performance. For example, the success rate improves by 10.5‑35.5 compared to methods using multiple data types as input and by 3.6‑10.8 compared to methods using more RGB training data. This indicates that the proposed dual implicit neural memory, as a novel paradigm, explores promising new directions for future VLN research. Ours project page: https://miv‑xjtu.github.io/JanusVLN.github.io/.

PaperID: 963, https://arxiv.org/pdf/2509.22522.pdf GitHub

Authors: Guillem Capellera, Luis Ferraz, Antonio Rubio, Alexandre Alahi, Antonio Agudo

Title: JointDiff: Bridging Continuous and Discrete in Multi-Agent Trajectory Generation

Abstract:
Generative models often treat continuous data and discrete events as separate processes, creating a gap in modeling complex systems where they interact synchronously. To bridge this gap, we introduce JointDiff, a novel diffusion framework designed to unify these two processes by simultaneously generating continuous spatio‑temporal data and synchronous discrete events. We demonstrate its efficacy in the sports domain by simultaneously modeling multi‑agent trajectories and key possession events. This joint modeling is validated with non‑controllable generation and two novel controllable generation scenarios: weak‑possessor‑guidance, which offers flexible semantic control over game dynamics through a simple list of intended ball possessors, and text‑guidance, which enables fine‑grained, language‑driven generation. To enable the conditioning with these guidance signals, we introduce CrossGuid, an effective conditioning operation for multi‑agent domains. We also share a new unified sports benchmark enhanced with textual descriptions for soccer and football datasets. JointDiff achieves state‑of‑the‑art performance, demonstrating that joint modeling is crucial for building realistic and controllable generative models for interactive systems. https://guillem‑cf.github.io/JointDiff/

PaperID: 964, https://arxiv.org/pdf/2509.21848.pdf GitHub

Authors: Taejong Joo, Shu Ishida, Ivan Sosnovik, Bryan Lim, Sahand Rezaei-Shoshtari, Adam Gaier, Robert Giaquinto

Title: Graph of Agents: Principled Long Context Modeling by Emergent Multi-Agent Collaboration

Abstract:
As a model‑agnostic approach to long context modeling, multi‑agent systems can process inputs longer than a large language model's context window without retraining or architectural modifications. However, their performance often heavily relies on hand‑crafted multi‑agent collaboration strategies and prompt engineering, which limit generalizability. In this work, we introduce a principled framework that formalizes the model‑agnostic long context modeling problem as a compression problem, yielding an information‑theoretic compression objective. Building on this framework, we propose Graph of Agents (GoA), which dynamically constructs an input‑dependent collaboration structure that maximizes this objective. For Llama 3.1 8B and Qwen3 8B across six document question answering benchmarks, GoA improves the average F_1 score of retrieval‑augmented generation by 5.7% and a strong multi‑agent baseline using a fixed collaboration structure by 16.35%, respectively. Even with only a 2K context window, GoA surpasses the 128K context window Llama 3.1 8B on LongBench, showing a dramatic increase in effective context length. Our source code is available at https://github.com/tjoo512/graph‑of‑agents.

PaperID: 965, https://arxiv.org/pdf/2509.21789.pdf GitHub

Authors: Xinlei Yu, Chengming Xu, Guibin Zhang, Yongbo He, Zhangquan Chen, Zhucun Xue, Jiangning Zhang, Yue Liao, Xiaobin Hu, Yu-Gang Jiang, Shuicheng Yan

Title: Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow

Abstract:
Multi‑Agent System (MAS) powered by Visual Language Models (VLMs) enables challenging tasks but suffers from a novel failure term, multi‑agent visual hallucination snowballing, where hallucinations are seeded in a single agent and amplified by following ones due to the over‑reliance on textual flow to relay visual information. Through turn‑, layer‑, and token‑wise attention analyses, we provide detailed insights into the essence of hallucination snowballing regarding the reduction of visual attention allocation. It leads us to identify a subset of vision tokens with a unimodal attention peak in middle layers that best preserve visual evidence but gradually diminish in deeper agent turns, resulting in the visual hallucination snowballing in MAS. Thus, we propose ViF, a lightweight, plug‑and‑play mitigation paradigm that relays inter‑agent messages with Visual Flow powered by the selected visual relay tokens and applies attention reallocation to amplify this pattern. The experiment results demonstrate that our method markedly reduces hallucination snowballing, consistently improving the performance across eight benchmarks based on four common MAS structures and ten base models. The source code is publicly available at: https://github.com/YU‑deep/ViF.git.

PaperID: 966, https://arxiv.org/pdf/2509.21766.pdf GitHub GitHub

Authors: Haotian Luo, Huaisong Zhang, Xuelin Zhang, Haoyu Wang, Zeyu Qin, Wenjie Lu, Guozheng Ma, Haiying He, Yingsha Xie, Qiyang Zhou, Zixuan Hu, Hongze Mi, Yibo Wang, Naiqiang Tan, Hong Chen, Yi R. Fung, Chun Yuan, Li Shen

Title: UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios

Abstract:
Autonomous agents have recently achieved remarkable progress across diverse domains, yet most evaluations focus on short‑horizon, fully observable tasks. In contrast, many critical real‑world tasks, such as large‑scale software development, commercial investment, and scientific discovery, unfold in long‑horizon and partially observable scenarios where success hinges on sustained reasoning, planning, memory management, and tool use. Existing benchmarks rarely capture these long‑horizon challenges, leaving a gap in systematic evaluation. To bridge this gap, we introduce UltraHorizon a novel benchmark that measures the foundational capabilities essential for complex real‑world challenges. We use exploration as a unifying task across three distinct environments to validate these core competencies. Agents are designed in long‑horizon discovery tasks where they must iteratively uncover hidden rules through sustained reasoning, planning, memory and tools management, and interaction with environments. Under the heaviest scale setting, trajectories average 200k+ tokens and 400+ tool calls, whereas in standard configurations they still exceed 35k tokens and involve more than 60 tool calls on average. Our extensive experiments reveal that LLM‑agents consistently underperform in these settings, whereas human participants achieve higher scores, underscoring a persistent gap in agents' long‑horizon abilities. We also observe that simple scaling fails in our task. To better illustrate the failure of agents, we conduct an in‑depth analysis of collected trajectories. We identify eight types of errors and attribute them to two primary causes: in‑context locking and functional fundamental capability gaps. \hrefhttps://github.com/StarDewXXX/UltraHorizonOur code will be available here.

PaperID: 967, https://arxiv.org/pdf/2509.21710.pdf GitHub

Authors: Xiaojun Wu, Cehao Yang, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Yuanliang Sun, Hui Xiong, Jia Li, Jian Guo

Title: Think-on-Graph 3.0: Efficient and Adaptive LLM Reasoning on Heterogeneous Graphs via Multi-Agent Dual-Evolving Context Retrieval

Abstract:
Graph‑based Retrieval‑Augmented Generation (GraphRAG) has become the important paradigm for enhancing Large Language Models (LLMs) with external knowledge. However, existing approaches are constrained by their reliance on high‑quality knowledge graphs: manually built ones are not scalable, while automatically extracted ones are limited by the performance of LLM extractors, especially when using smaller, local‑deployed models. To address this, we introduce Think‑on‑Graph 3.0 (ToG‑3), a novel framework featuring a Multi‑Agent Context Evolution and Retrieval (MACER) mechanism. Its core contribution is the dynamic construction and iterative refinement of a Chunk‑Triplets‑Community heterogeneous graph index, powered by a Dual‑Evolution process that adaptively evolves both the query and the retrieved sub‑graph during reasoning. ToG‑3 dynamically builds a targeted graph index tailored to the query, enabling precise evidence retrieval and reasoning even with lightweight LLMs. Extensive experiments demonstrate that ToG‑3 outperforms compared baselines on both deep and broad reasoning benchmarks, and ablation studies confirm the efficacy of the components of MACER framework. The source code are available in https://github.com/DataArcTech/ToG‑3.

PaperID: 968, https://arxiv.org/pdf/2509.21291.pdf GitHub GitHub

Authors: Yidan Zhang, Mutian Xu, Yiming Hao, Kun Zhou, Jiahao Chang, Xiaoqiang Liu, Pengfei Wan, Hongbo Fu, Xiaoguang Han

Title: VC-Agent: An Interactive Agent for Customized Video Dataset Collection

Abstract:
Facing scaling laws, video data from the internet becomes increasingly important. However, collecting extensive videos that meet specific needs is extremely labor‑intensive and time‑consuming. In this work, we study the way to expedite this collection process and propose VC‑Agent, the first interactive agent that is able to understand users' queries and feedback, and accordingly retrieve/scale up relevant video clips with minimal user input. Specifically, considering the user interface, our agent defines various user‑friendly ways for the user to specify requirements based on textual descriptions and confirmations. As for agent functions, we leverage existing multi‑modal large language models to connect the user's requirements with the video content. More importantly, we propose two novel filtering policies that can be updated when user interaction is continually performed. Finally, we provide a new benchmark for personalized video dataset collection, and carefully conduct the user study to verify our agent's usage in various real scenarios. Extensive experiments demonstrate the effectiveness and efficiency of our agent for customized video dataset collection. Project page: https://allenyidan.github.io/vcagent_page/.

PaperID: 969, https://arxiv.org/pdf/2509.21240.pdf GitHub

Authors: Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, Liaoni Wu

Title: Tree Search for LLM Agent Reinforcement Learning

Abstract:
Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long‑term and multi‑turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree‑based Group Relative Policy Optimization (Tree‑GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree‑structured trajectory naturally allows the construction of step‑wise process supervised signals even using only the outcome reward. Based on this, Tree‑GRPO estimates the grouped relative advantages both on intra‑tree and inter‑tree levels. Through theoretical analysis, we demonstrate that the objective of intra‑tree level group relative policy optimization is equivalent to that of step‑level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree‑based RL over the chain‑based RL method.

PaperID: 970, https://arxiv.org/pdf/2509.21199.pdf GitHub GitHub

Authors: Kaiyang Wan, Lang Gao, Honglin Mu, Preslav Nakov, Yuxia Wang, Xiuying Chen

Title: A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA

Abstract:
Multi‑Hop Question Answering (MHQA) requires integrating dispersed, interdependent evidence through sequential reasoning under noise. This task is challenging for LLMs as they have a finite per‑pass output capacity, beyond which the integration of task‑relevant evidence proves unreliable. Consequently, the single‑pass reasoning paradigm is inherently vulnerable to this capacity overflow. To formalize this bottleneck, our analysis establishes a Fano‑style accuracy upper bound, defining a theoretical performance ceiling for single‑pass LLMs. This bound reveals that accuracy inevitably collapses once task complexity exceeds model capacity, providing general principles for capacity‑aware representation and structuring of MHQA in LLMs. Building on these principles, we introduce a proof‑of‑concept multi‑call framework for MHQA, InfoQA. It ensures high per‑step accuracy by combining capacity‑aware task decomposition with active pruning of prior reasoning traces, keeping the information load within the single‑pass limit. It further achieves robustness by a dependency‑explicit workflow that enables precise control over the reasoning path. We construct a stringent and noise‑rich benchmark to validate our theory and framework. Experimental results show that model behavior aligns with our predicted capacity curves while InfoQA achieves consistent performance improvements. We hope our work inspires more LLM multi‑step reasoning methods: \faGithub \hrefhttps://github.com/KaiyangWan/InfoQAInfoQA.

PaperID: 971, https://arxiv.org/pdf/2509.21193.pdf GitHub

Authors: Xiangru Tang, Wanghan Xu, Yujie Wang, Zijie Guo, Daniel Shao, Jiapeng Chen, Cixuan Zhang, Ziyi Wang, Lixin Zhang, Guancheng Wan, Wenlong Zhang, Lei Bai, Zhenfei Yin, Philip Torr, Hanrui Wang, Di Jin

Title: Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning

Abstract:
Large language models (LLMs) have recently shown strong progress on scientific reasoning, yet two major bottlenecks remain. First, explicit retrieval fragments reasoning, imposing a hidden "tool tax" of extra tokens and steps. Second, multi‑agent pipelines often dilute strong solutions by averaging across all candidates. We address these challenges with a unified framework that combines implicit retrieval and structured collaboration. At its foundation, a Monitor‑based retrieval module operates at the token level, integrating external knowledge with minimal disruption to reasoning. On top of this substrate, Hierarchical Solution Refinement (HSR) iteratively designates each candidate as an anchor to be repaired by its peers, while Quality‑Aware Iterative Reasoning (QAIR) adapts refinement to solution quality. On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy ‑‑ the highest reported to date, surpassing the strongest agent baseline by 13.4 points and leading frontier LLMs by up to 18.1 points, while simultaneously reducing token usage by 53.5% and agent steps by 43.7%. Results on SuperGPQA and TRQA confirm robustness across domains. Error analysis shows that reasoning failures and knowledge gaps co‑occur in over 85% of cases, while diversity analysis reveals a clear dichotomy: retrieval tasks benefit from solution variety, whereas reasoning tasks favor consensus. Together, these findings demonstrate how implicit augmentation and structured refinement overcome the inefficiencies of explicit tool use and uniform aggregation. Code is available at: https://github.com/tangxiangru/Eigen‑1.

PaperID: 972, https://arxiv.org/pdf/2509.20502.pdf GitHub

Authors: Xiao Wang, Jia Wang, Yijie Wang, Pengtao Dang, Sha Cao, Chi Zhang

Title: MARS: toward more efficient multi-agent collaboration for LLM reasoning

Abstract:
Large language models (LLMs) have achieved impressive results in natural language understanding, yet their reasoning capabilities remain limited when operating as single agents. Multi‑Agent Debate (MAD) has been proposed to address this limitation by enabling collaborative reasoning among multiple models in a round‑table debate manner. While effective, MAD introduces substantial computational overhead due to the number of agents involved and the frequent communication required. In this paper, we propose MARS (Multi‑Agent Review System), a role‑based collaboration framework inspired by the review process. In MARS, an author agent generates an initial solution, reviewer agents provide decisions and comments independently, and a meta‑reviewer integrates the feedback to make the final decision and guide further revision. This design enhances reasoning quality while avoiding costly reviewer‑to‑reviewer interactions, thereby controlling token consumption and inference time. We compared MARS with both MAD and other state‑of‑the‑art reasoning strategies across multiple benchmarks. Extensive experiments with different LLMs show that MARS matches the accuracy of MAD while reducing both token usage and inference time by approximately 50%. Code is available at https://github.com/xwang97/MARS.

PaperID: 973, https://arxiv.org/pdf/2509.19319.pdf GitHub

Authors: Gyubok Lee, Elea Bach, Eric Yang, Tom Pollard, Alistair Johnson, Edward Choi, Yugang jia, Jong Ha Lee

Title: FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering

Abstract:
The recent shift toward the Health Level Seven Fast Healthcare Interoperability Resources (HL7 FHIR) standard opens a new frontier for clinical AI, demanding LLM agents to navigate complex, resource‑based data models instead of conventional structured health data. However, existing benchmarks have lagged behind this transition, lacking the realism needed to evaluate recent LLMs on interoperable clinical data. To bridge this gap, we introduce FHIR‑AgentBench, a benchmark that grounds 2,931 real‑world clinical questions in the HL7 FHIR standard. Using this benchmark, we systematically evaluate agentic frameworks, comparing different data retrieval strategies (direct FHIR API calls vs. specialized tools), interaction patterns (single‑turn vs. multi‑turn), and reasoning strategies (natural language vs. code generation). Our experiments highlight the practical challenges of retrieving data from intricate FHIR resources and the difficulty of reasoning over them, both of which critically affect question answering performance. We publicly release the FHIR‑AgentBench dataset and evaluation suite (https://github.com/glee4810/FHIR‑AgentBench) to promote reproducible research and the development of robust, reliable LLM agents for clinical applications.

PaperID: 974, https://arxiv.org/pdf/2509.19236.pdf GitHub

Authors: Chunhao Tian, Yutong Wang, Xuebo Liu, Zhexuan Wang, Liang Ding, Miao Zhang, Min Zhang

Title: AgentInit: Initializing LLM-based Multi-Agent Systems via Diversity and Expertise Orchestration for Effective and Efficient Collaboration

Abstract:
Proper initialization is crucial for any system, particularly in multi‑agent systems (MAS), where it plays a pivotal role in determining both the system's efficiency and effectiveness. However, existing MAS initialization methods do not fully account for the collaborative needs of the generated agents in subsequent stages. Inspired by the principles of effective team composition, we propose AgentInit, which aims to optimize the structure of agent teams. Specifically, in addition to multi‑round interactions and reflections between agents during agent generation, AgentInit incorporates a Natural Language to Format mechanism to ensure consistency and standardization. Balanced team selection strategies using Pareto principles are subsequently applied to jointly consider agent team diversity and task relevance to promote effective and efficient collaboration and enhance overall system performance. Experiments show that AgentInit consistently outperforms state‑of‑the‑art initialization methods and pre‑defined strategies across various frameworks and tasks, achieving an overall performance improvement of up to 1.2 and 1.6, respectively, while also significantly reducing token consumption. Further analysis confirms its strong transferability to similar tasks and verifies the effectiveness of its key components, demonstrating its capability and adaptability as a reliable MAS initialization method. Source code and models are available at https://github.com/1737423697/AgentInit.

PaperID: 975, https://arxiv.org/pdf/2509.18813.pdf GitHub

Authors: Liting Zhang, Shiwan Zhao, Aobo Kong, Qicheng Li

Title: MAPEX: A Multi-Agent Pipeline for Keyphrase Extraction

Abstract:
Keyphrase extraction is a fundamental task in natural language processing. However, existing unsupervised prompt‑based methods for Large Language Models (LLMs) often rely on single‑stage inference pipelines with uniform prompting, regardless of document length or LLM backbone. Such one‑size‑fits‑all designs hinder the full exploitation of LLMs' reasoning and generation capabilities, especially given the complexity of keyphrase extraction across diverse scenarios. To address these challenges, we propose MAPEX, the first framework that introduces multi‑agent collaboration into keyphrase extraction. MAPEX coordinates LLM‑based agents through modules for expert recruitment, candidate extraction, topic guidance, knowledge augmentation, and post‑processing. A dual‑path strategy dynamically adapts to document length: knowledge‑driven extraction for short texts and topic‑guided extraction for long texts. Extensive experiments on six benchmark datasets across three different LLMs demonstrate its strong generalization and universality, outperforming the state‑of‑the‑art unsupervised method by 2.44% and standard LLM baselines by 4.01% in F1@5 on average. Code is available at https://github.com/NKU‑LITI/MAPEX.

PaperID: 976, https://arxiv.org/pdf/2509.18178.pdf GitHub

Authors: Ling Yue, Nithin Somasekharan, Tingwen Zhang, Yadi Cao, Shaowu Pan

Title: Foam-Agent 2.0: An End-to-End Composable Multi-Agent Framework for Automating CFD Simulation in OpenFOAM

Abstract:
Computational Fluid Dynamics (CFD) is an essential simulation tool in engineering, yet its steep learning curve and complex manual setup create significant barriers. To address these challenges, we introduce Foam‑Agent, a multi‑agent framework that automates the entire end‑to‑end OpenFOAM workflow from a single natural language prompt. Our key innovations address critical gaps in existing systems: 1. An Comprehensive End‑to‑End Simulation Automation: Foam‑Agent is the first system to manage the full simulation pipeline, including advanced pre‑processing with a versatile Meshing Agent capable of handling external mesh files and generating new geometries via Gmsh, automatic generation of HPC submission scripts, and post‑simulation visualization via ParaView. 2. Composable Service Architecture: Going beyond a monolithic agent, the framework uses Model Context Protocol (MCP) to expose its core functions as discrete, callable tools. This allows for flexible integration and use by other agentic systems, such as Claude‑code, for more exploratory workflows. 3. High‑Fidelity Configuration Generation: We achieve superior accuracy through a Hierarchical Multi‑Index RAG for precise context retrieval and a dependency‑aware generation process that ensures configuration consistency. Evaluated on a benchmark of 110 simulation tasks, Foam‑Agent achieves an 88.2% success rate with Claude 3.5 Sonnet, significantly outperforming existing frameworks (55.5% for MetaOpenFOAM). Foam‑Agent dramatically lowers the expertise barrier for CFD, demonstrating how specialized multi‑agent systems can democratize complex scientific computing. The code is public at https://github.com/csml‑rpi/Foam‑Agent.

PaperID: 977, https://arxiv.org/pdf/2509.18041.pdf GitHub

Authors: Sahil Shah, S P Sharan, Harsh Goel, Minkyu Choi, Mustafa Munir, Manvik Pasula, Radu Marculescu, Sandeep Chinchali

Title: NeuS-QA: Grounding Long-Form Video Understanding in Temporal Logic and Neuro-Symbolic Reasoning

Abstract:
While vision‑language models (VLMs) excel at tasks involving single images or short videos, they still struggle with Long Video Question Answering (LVQA) due to its demand for complex multi‑step temporal reasoning. Vanilla approaches, which simply sample frames uniformly and feed them to a VLM along with the question, incur significant token overhead. This forces aggressive downsampling of long videos, causing models to miss fine‑grained visual structure, subtle event transitions, and key temporal cues. Recent works attempt to overcome these limitations through heuristic approaches; however, they lack explicit mechanisms for encoding temporal relationships and fail to provide any formal guarantees that the sampled context actually encodes the compositional or causal logic required by the question. To address these foundational gaps, we introduce NeuS‑QA, a training‑free, plug‑and‑play neuro‑symbolic pipeline for LVQA. NeuS‑QA first translates a natural language question into a logic specification that models the temporal relationship between frame‑level events. Next, we construct a video automaton to model the video's frame‑by‑frame event progression, and finally employ model checking to compare the automaton against the specification to identify all video segments that satisfy the question's logical requirements. Only these logic‑verified segments are submitted to the VLM, thus improving interpretability, reducing hallucinations, and enabling compositional reasoning without modifying or fine‑tuning the model. Experiments on the LongVideoBench and CinePile LVQA benchmarks show that NeuS‑QA significantly improves performance by over 10%, particularly on questions involving event ordering, causality, and multi‑step reasoning. We open‑source our code at https://utaustin‑swarmlab.github.io/NeuS‑QA/.

PaperID: 978, https://arxiv.org/pdf/2509.17628.pdf GitHub

Authors: Yuzhen Lei, Hongbin Xie, Jiaxing Zhao, Shuangxue Liu, Xuan Song

Title: MSCoRe: A Benchmark for Multi-Stage Collaborative Reasoning in LLM Agents

Abstract:
Large Language Models (LLMs) have excelled in question‑answering (QA) tasks within single domains. However, their reasoning and coordination capabilities in complex, multi‑stage scenarios remain underexplored. Existing benchmarks typically focus on isolated tasks or narrow domains, overlooking models' abilities for multi‑stage collaboration and optimization without explicit external guidance. To bridge this gap, we propose MSCoRe, a novel benchmark comprising 126696 domain‑specific QA instances spanning scenarios in automotive, pharmaceutical, electronics, and energy sectors. The dataset is created using a structured three‑phase pipeline: dynamic sampling, iterative question‑answer generation, and a multi‑level quality assessment to ensure data quality. Tasks are further categorized into three difficulty levels according to stage coverage and complexity. With MSCoRe, we have conducted a comprehensive evaluation of various state‑of‑the‑art LLM agents. The commercial models performed best across all tasks and scenarios, but a notable gap in ROUGE scores remains between simple and complex tasks. We also tested the models' robustness and found that their performance is negatively affected by noisy data. MSCoRe provides a valuable new resource for the community to evaluate and improve multi‑stage reasoning in LLM agents. The code and data are available at https://github.com/D3E0‑source/MSCoRE.

PaperID: 979, https://arxiv.org/pdf/2509.17489.pdf GitHub

Authors: Woongkyu Lee, Junhee Cho, Jungwook Choi

Title: MapCoder-Lite: Distilling Multi-Agent Coding into a Single Small LLM

Abstract:
Large language models (LLMs) have advanced code generation from single‑function tasks to competitive‑programming problems, but existing multi‑agent solutions either rely on costly large‑scale (>30B) models or collapse when downsized to small open‑source models. We present MapCoder‑Lite, a framework for distilling the complex reasoning of large, multi‑agent coding systems into a single 7B model. Our contribution is a novel, three‑pillar methodology that synergistically generates, refines, and encodes multi‑agent knowledge: (i) pass‑based trajectory distillation from strong LLMs fixes format fragility in retrieval and reduces failures in debugging, (ii) supervisor‑guided correction with global feedback strengthens planning and coding agents, and (iii) agent‑wise LoRA fine‑tuning delivers memory‑efficient specialisation. Comprehensive evaluation on xCodeEval, APPS, and CodeContests shows that MapCoder‑Lite more than doubles xCodeEval accuracy (from 13.2% to 28.3%), eliminates all format failures, while reducing GPU memory and token‑generation time by 4x compared to a 32B model. It also achieves over 10% gains on simpler coding benchmarks, demonstrating broad improvements beyond competitive programming. These results demonstrate that careful agent‑wise fine‑tuning unleashes high‑quality multi‑agent coding on a small language model. Our code is publicly available at https://github.com/aiha‑lab/MapCoder‑Lite.

PaperID: 980, https://arxiv.org/pdf/2509.17325.pdf GitHub

Authors: Weihua Du, Hailei Gong, Zhan Ling, Kang Liu, Lingfeng Shen, Xuesong Yao, Yufei Xu, Dingyuan Shi, Yiming Yang, Jiecao Chen

Title: Generalizable End-to-End Tool-Use RL with Synthetic CodeGym

Abstract:
Tool‑augmented large language models (LLMs), hereafter LLM agents, leverage external tools to solve diverse tasks and interface with the real world. However, current training practices largely rely on supervised fine‑tuning (SFT) over static trajectories or reinforcement learning (RL) on narrow tasks, which generalize poorly beyond development settings and lead to brittleness with new tools and unseen workflows. Because code execution reflects many structural patterns of real‑world workflows, we use coding problems as a structured substrate to build tool‑use agent training environments with diverse task configurations. To this end, we introduce CodeGym, a scalable framework that synthesizes diverse, verifiable, and controllable multi‑turn tool‑use environments for agent RL, enabling LLM agents to explore and master various workflows actively. CodeGym converts static coding problems into interactive environments by extracting atomic functions or logic into callable tools, yielding verifiable tasks that span various tool‑execution workflows. Models of varying sizes and chain‑of‑thought configurations trained in CodeGym exhibit consistent out‑of‑distribution generalizability; for example, Qwen2.5‑32B‑Instruct achieves an absolute accuracy gain of 8.7 points on the OOD benchmark τ‑Bench. These results highlight CodeGym as a step toward scalable general‑purpose RL environments for training tool‑use behaviors that align with real‑world agent workflows.

PaperID: 981, https://arxiv.org/pdf/2509.17318.pdf GitHub

Authors: Zhuofan Chen, Jiyuan He, Yichi Zhang, Xing Hu, Haoxing Wen, Jun Bai, Wenge Rong

Title: CogAtom: From Cognitive Atoms to Olympiad-level Mathematical Reasoning in Large Language Models

Abstract:
Mathematical reasoning poses significant challenges for Large Language Models (LLMs) due to its demand for multi‑step reasoning and abstract conceptual integration. While recent test‑time scaling techniques rely heavily on high‑quality, challenging problems, the scarcity of Olympiad‑level math problems remains a bottleneck. We introduce CogAtom, a novel cognitive atom‑based framework for synthesizing mathematically rigorous and cognitively diverse problems. Unlike prior approaches, CogAtom models problem construction as a process of selecting and recombining fundamental reasoning units, cognitive atoms, extracted from human‑authored solutions. A diversity‑promoting random walk algorithm enables exploration of the cognitive atom space, while a constraint‑based recombination mechanism ensures logical soundness and structural validity. The combinatorial nature of the graph structure provides a near‑infinite space of reasoning paths, and the walk algorithm systematically explores this space to achieve large‑scale synthesis of high‑quality problems; meanwhile, by controlling the number of cognitive atoms, we can precisely adjust problem difficulty, ensuring diversity, scalability, and controllability of the generated problems. Experimental results demonstrate that CogAtom outperforms existing methods in accuracy, reasoning depth, and diversity, generating problems that closely match the difficulty of AIME while exceeding it in structural variation. Our work offers a cognitively grounded pathway toward scalable, high‑quality math problem generation.Our code is publicly available at https://github.com/Icarus‑1111/CogAtom.

PaperID: 982, https://arxiv.org/pdf/2509.17116.pdf GitHub

Authors: Hang Xu, Zang Yu, Yehui Tang, Pengbo Hu, Yuhao Tang, Hao Dong

Title: MCTS-EP: Empowering Embodied Planning with Online Preference Optimization

Abstract:
This paper introduces MCTS‑EP, an online learning framework that combines large language models (LLM) with Monte Carlo Tree Search (MCTS) for training embodied agents. MCTS‑EP integrates three key components: MCTS‑guided exploration for preference data collection, efficient multi‑modal reasoning mechanism, and iterative training pipeline based on preference optimization. We theoretically prove that MCTS‑EP achieves better performance bounds than conventional on‑policy algorithms when the loss function is strongly convex, and demonstrate that it can be formulated as a search‑enhanced variant of GAIL. MCTS‑EP achieves state‑of‑the‑art performace across serval benchmarks. In ALFWorld, it achieves 92% and 87% success rates for textual and visual tasks. In WebShop, it reaches an average reward of 0.81. MTCS‑EP also reduces average interaction steps from from 18.7/19.5 to 10.2/9.9 steps in visual ALFWorld.Code available at: https://github.com/xuhang‑2/Embodied‑Agent‑Planning

PaperID: 983, https://arxiv.org/pdf/2509.17107.pdf GitHub GitHub

Authors: Lingzhao Kong, Jiacheng Lin, Siyu Li, Kai Luo, Zhiyong Li, Kailun Yang

Title: CoBEVMoE: Heterogeneity-aware Feature Fusion with Dynamic Mixture-of-Experts for Collaborative Perception

Abstract:
Collaborative perception aims to extend sensing coverage and improve perception accuracy by sharing information among multiple agents. However, due to differences in viewpoints and spatial positions, agents often acquire heterogeneous observations. Existing intermediate fusion methods primarily focus on aligning similar features, often overlooking the perceptual diversity among agents. To address this limitation, we propose CoBEVMoE, a novel collaborative perception framework that operates in the Bird's Eye View (BEV) space and incorporates a Dynamic Mixture‑of‑Experts (DMoE) architecture. In DMoE, each expert is dynamically generated based on the input features of a specific agent, enabling it to extract distinctive and reliable cues while attending to shared semantics. This design allows the fusion process to explicitly model both feature similarity and heterogeneity across agents. Furthermore, we introduce a Dynamic Expert Metric Loss (DEML) to enhance inter‑expert diversity and improve the discriminability of the fused representation. Extensive experiments on the OPV2V and DAIR‑V2X‑C datasets demonstrate that CoBEVMoE achieves state‑of‑the‑art performance. Specifically, it improves the IoU for Camera‑based BEV segmentation by +1.5% on OPV2V and the AP@0.5 for LiDAR‑based 3D object detection by +3.0% on DAIR‑V2X‑C, verifying the effectiveness of expert‑based heterogeneous feature modeling in multi‑agent collaborative perception. The source code will be made publicly available at https://github.com/godk0509/CoBEVMoE.

PaperID: 984, https://arxiv.org/pdf/2509.16971.pdf GitHub

Authors: Yan Rong, Chenxing Li, Dong Yu, Li Liu

Title: AudioGenie-Reasoner: A Training-Free Multi-Agent Framework for Coarse-to-Fine Audio Deep Reasoning

Abstract:
Audio deep reasoning is a challenging task that requires expert‑level perception, multi‑step logical inference, and the integration of contextual knowledge. However, existing models suffer from a gap between audio perception and reasoning abilities due to the lack of training data with explicit reasoning chains and the absence of mechanisms for active exploration and iterative refinement. To address these challenges, we propose AudioGenie‑Reasoner (AGR), the first unified training‑free multi‑agent system that coordinates perception and reasoning over an evolving chain of textual evidence. Our key idea is a paradigm shift that transforms audio deep reasoning into complex text understanding task from a new perspective, thereby unlocking the full potential of large language models. Specifically, the design of AGR mimics the human coarse‑to‑fine cognitive process. It first transforms the input audio into a coarse text‑based document. Then, we design a novel proactive iterative document refinement loop, featuring tool‑augmented routes and specialized agents, to continuously search for missing information and augment the evidence chain in a coarse‑to‑fine manner until sufficient question‑related information is gathered for making final predictions. Experimental results show that AGR achieves state‑of‑the‑art (SOTA) performance over existing open‑source audio deep reasoning models across various benchmarks. The code will be available at https://github.com/ryysayhi/AudioGenie‑Reasoner.

PaperID: 985, https://arxiv.org/pdf/2509.16839.pdf GitHub

Authors: Yu Yao, Jiayi Dong, Yang Yang, Ju Li, Yilun Du

Title: Roundtable Policy: Confidence-Weighted-Consensus Aggregation Improves Multi-Agent-System Reasoning

Abstract:
Multi‑agent systems have demonstrated exceptional performance in downstream tasks beyond diverse single agent baselines. A growing body of work has explored ways to improve their reasoning and collaboration, from vote, debate, to complex interaction protocols. However, it still remains opaque why specific choice would be preferred in multi‑agent systems. Inspired by the decision‑making mechanism of democratic committees and The Society of Mind, we introduce Roundtable Policy, an inference‑time reasoning framework for multi‑agent systems that performs inference through the weighted consensus of multiple LLMs. Through extensive experiments, we demonstrate its that this approach significantly enhances reasoning in complex heterogeneous scientific tasks. Roundtable Policy emphasizes structured and interpretable inference rather than opaque convergence, while requires only black‑box access and uniform procedures, making it broadly applicable to diverse multi‑agent systems.

PaperID: 986, https://arxiv.org/pdf/2509.16610.pdf GitHub

Authors: Junhao Chen, Jingbo Sun, Xiang Li, Haidong Xin, Yuhao Xue, Yibin Xu, Hao Zhao

Title: LLMsPark: A Benchmark for Evaluating Large Language Models in Strategic Gaming Contexts

Abstract:
As large language models (LLMs) advance across diverse tasks, the need for comprehensive evaluation beyond single metrics becomes increasingly important. To fully assess LLM intelligence, it is crucial to examine their interactive dynamics and strategic behaviors. We present LLMsPark, a game theory‑based evaluation platform that measures LLMs' decision‑making strategies and social behaviors in classic game‑theoretic settings, providing a multi‑agent environment to explore strategic depth. Our system cross‑evaluates 15 leading LLMs (both commercial and open‑source) using leaderboard rankings and scoring mechanisms. Higher scores reflect stronger reasoning and strategic capabilities, revealing distinct behavioral patterns and performance differences across models. This work introduces a novel perspective for evaluating LLMs' strategic intelligence, enriching existing benchmarks and broadening their assessment in interactive, game‑theoretic scenarios. The benchmark and rankings are publicly available at https://llmsparks.github.io/.

PaperID: 987, https://arxiv.org/pdf/2509.16072.pdf GitHub

Authors: Clemence Grislain, Hamed Rahimi, Olivier Sigaud, Mohamed Chetouani

Title: I-FailSense: Towards General Robotic Failure Detection with Vision-Language Models

Abstract:
Language‑conditioned robotic manipulation in open‑world settings requires not only accurate task execution but also the ability to detect failures for robust deployment in real‑world environments. Although recent advances in vision‑language models (VLMs) have significantly improved the spatial reasoning and task‑planning capabilities of robots, they remain limited in their ability to recognize their own failures. In particular, a critical yet underexplored challenge lies in detecting semantic misalignment errors, where the robot executes a task that is semantically meaningful but inconsistent with the given instruction. To address this, we propose a method for building datasets targeting Semantic Misalignment Failures detection, from existing language‑conditioned manipulation datasets. We also present I‑FailSense, an open‑source VLM framework with grounded arbitration designed specifically for failure detection. Our approach relies on post‑training a base VLM, followed by training lightweight classification heads, called FS blocks, attached to different internal layers of the VLM and whose predictions are aggregated using an ensembling mechanism. Experiments show that I‑FailSense outperforms state‑of‑the‑art VLMs, both comparable in size and larger, in detecting semantic misalignment errors. Notably, despite being trained only on semantic misalignment detection, I‑FailSense generalizes to broader robotic failure categories and effectively transfers to other simulation environments and real‑world with zero‑shot or minimal post‑training. The datasets and models are publicly released on HuggingFace (Webpage: https://clemgris.github.io/I‑FailSense/).

PaperID: 988, https://arxiv.org/pdf/2509.15237.pdf GitHub GitHub

Authors: Di Wen, Kunyu Peng, Junwei Zheng, Yufan Chen, Yitian Shi, Jiale Wei, Ruiping Liu, Kailun Yang, Rainer Stiefelhagen

Title: MICA: Multi-Agent Industrial Coordination Assistant

Abstract:
Industrial workflows demand adaptive and trustworthy assistance that can operate under limited computing, connectivity, and strict privacy constraints. In this work, we present MICA (Multi‑Agent Industrial Coordination Assistant), a perception‑grounded and speech‑interactive system that delivers real‑time guidance for assembly, troubleshooting, part queries, and maintenance. MICA coordinates five role‑specialized language agents, audited by a safety checker, to ensure accurate and compliant support. To achieve robust step understanding, we introduce Adaptive Step Fusion (ASF), which dynamically blends expert reasoning with online adaptation from natural speech feedback. Furthermore, we establish a new multi‑agent coordination benchmark across representative task categories and propose evaluation metrics tailored to industrial assistance, enabling systematic comparison of different coordination topologies. Our experiments demonstrate that MICA consistently improves task success, reliability, and responsiveness over baseline structures, while remaining deployable on practical offline hardware. Together, these contributions highlight MICA as a step toward deployable, privacy‑preserving multi‑agent assistants for dynamic factory environments. The source code will be made publicly available at https://github.com/Kratos‑Wen/MICA.

PaperID: 989, https://arxiv.org/pdf/2509.15103.pdf GitHub

Authors: Simin Li, Zihao Mao, Zheng Yuwei, Linhao Wang, Ruixiao Xu, Chengdong Ma, Zhiqian Liu, Xin Yu, Yuqing Ma, Xin Wang, Jie Luo, Bo An, Yaodong Yang, Weifeng Lv, Xianglong Liu

Title: Vulnerable Agent Identification in Large-Scale Multi-Agent Reinforcement Learning

Abstract:
Partial agent failure becomes inevitable when systems scale up, making it crucial to identify the subset of agents whose failure causes worst‑case system performance degradations. We study this Vulnerable Agent Identification (VAI) problem in large‑scale multi‑agent reinforcement learning (MARL). We frame VAI as a Hierarchical Adversarial Decentralized Mean Field Control (HAD‑MFC), where the upper level selects vulnerable agents as an NP‑hard task and the lower level learns their worst‑case adversarial policies via mean‑field MARL. The two problems are coupled together, making HAD‑MFC difficult to solve. To handle this, we first decouple the hierarchical process by Fenchel‑Rockafellar transform, resulting a regularized mean‑field Bellman operator for upper level that enables independent learning at each level, thus reducing computational complexity. We next reformulate the upper‑level NP‑hard problem as an MDP with dense rewards, allowing sequential identification of vulnerable agents via greedy and RL algorithms. This decomposition provably preserves the optimal solution. Experiments show our method effectively identifies more vulnerable agents in large‑scale MARL and the rule‑based system, fooling system into worse failures, and reveals the vulnerability of each agent in large systems. Code available at https://github.com/Waken‑dream/VAI

PaperID: 990, https://arxiv.org/pdf/2509.14998.pdf GitHub

Authors: Xiao Wu, Ting-Zhu Huang, Liang-Jian Deng, Yanyuan Qiao, Imran Razzak, Yutong Xie

Title: A Knowledge-driven Adaptive Collaboration of LLMs for Enhancing Medical Decision-making

Abstract:
Medical decision‑making often involves integrating knowledge from multiple clinical specialties, typically achieved through multidisciplinary teams. Inspired by this collaborative process, recent work has leveraged large language models (LLMs) in multi‑agent collaboration frameworks to emulate expert teamwork. While these approaches improve reasoning through agent interaction, they are limited by static, pre‑assigned roles, which hinder adaptability and dynamic knowledge integration. To address these limitations, we propose KAMAC, a Knowledge‑driven Adaptive Multi‑Agent Collaboration framework that enables LLM agents to dynamically form and expand expert teams based on the evolving diagnostic context. KAMAC begins with one or more expert agents and then conducts a knowledge‑driven discussion to identify and fill knowledge gaps by recruiting additional specialists as needed. This supports flexible, scalable collaboration in complex clinical scenarios, with decisions finalized through reviewing updated agent comments. Experiments on two real‑world medical benchmarks demonstrate that KAMAC significantly outperforms both single‑agent and advanced multi‑agent methods, particularly in complex clinical scenarios (i.e., cancer prognosis) requiring dynamic, cross‑specialty expertise. Our code is publicly available at: https://github.com/XiaoXiao‑Woo/KAMAC.

PaperID: 991, https://arxiv.org/pdf/2509.14635.pdf GitHub

Authors: Weihan Peng, Yuling Shi, Yuhang Wang, Xinyun Zhang, Beijun Shen, Xiaodong Gu

Title: SWE-QA: Can Language Models Answer Repository-level Code Questions?

Abstract:
Understanding and reasoning about entire software repositories is an essential capability for intelligent software engineering tools. While existing benchmarks such as CoSQA and CodeQA have advanced the field, they predominantly focus on small, self‑contained code snippets. These setups fail to capture the complexity of real‑world repositories, where effective understanding and reasoning often require navigating multiple files, understanding software architecture, and grounding answers in long‑range code dependencies. In this paper, we present SWE‑QA, a repository‑level code question answering (QA) benchmark designed to facilitate research on automated QA systems in realistic code environments. SWE‑QA involves 576 high‑quality question‑answer pairs spanning diverse categories, including intention understanding, cross‑file reasoning, and multi‑hop dependency analysis. To construct SWE‑QA, we first crawled 77,100 GitHub issues from 11 popular repositories. Based on an analysis of naturally occurring developer questions extracted from these issues, we developed a two‑level taxonomy of repository‑level questions and constructed a set of seed questions for each category. For each category, we manually curated and validated questions and collected their corresponding answers. As a prototype application, we further develop SWE‑QA‑Agent, an agentic framework in which LLM agents reason and act to find answers automatically. We evaluate six advanced LLMs on SWE‑QA under various context augmentation strategies. Experimental results highlight the promise of LLMs, particularly our SWE‑QA‑Agent framework, in addressing repository‑level QA, while also revealing open challenges and pointing to future research directions.

PaperID: 992, https://arxiv.org/pdf/2509.14295.pdf GitHub

Authors: Fanqi Kong, Ruijie Zhang, Huaxiao Yin, Guibin Zhang, Xiaofei Zhang, Ziang Chen, Zhaowei Zhang, Xiaoyuan Zhang, Song-Chun Zhu, Xue Feng

Title: Aegis: Automated Error Generation and Attribution for Multi-Agent Systems

Abstract:
Large language model based multi‑agent systems (MAS) have unlocked significant advancements in tackling complex problems, but their increasing capability introduces a structural fragility that makes them difficult to debug. A key obstacle to improving their reliability is the severe scarcity of large‑scale, diverse datasets for error attribution, as existing resources rely on costly and unscalable manual annotation. To address this bottleneck, we introduce Aegis, a novel framework for Automated error generation and attribution for multi‑agent systems. Aegis constructs a large dataset of 9,533 trajectories with annotated faulty agents and error modes, covering diverse MAS architectures and task domains. This is achieved using a LLM‑based manipulator that can adaptively inject context‑aware errors into successful execution trajectories. Leveraging fine‑grained labels and the structured arrangement of positive‑negative sample pairs, Aegis supports three different learning paradigms: Supervised Fine‑Tuning, Reinforcement Learning, and Contrastive Learning. We develop learning methods for each paradigm. Comprehensive experiments show that trained models consistently achieve substantial improvements in error attribution. Notably, several of our fine‑tuned LLMs demonstrate performance competitive with or superior to proprietary models an order of magnitude larger, validating our automated data generation framework as a crucial resource for developing more robust and interpretable multi‑agent systems. Our project website is available at https://kfq20.github.io/Aegis‑Website/.

PaperID: 993, https://arxiv.org/pdf/2509.14284.pdf GitHub

Authors: Vaidehi Patil, Elias Stengel-Eskin, Mohit Bansal

Title: The Sum Leaks More Than Its Parts: Compositional Privacy Risks and Mitigations in Multi-Agent Collaboration

Abstract:
As large language models (LLMs) become integral to multi‑agent systems, new privacy risks emerge that extend beyond memorization, direct inference, or single‑turn evaluations. In particular, seemingly innocuous responses, when composed across interactions, can cumulatively enable adversaries to recover sensitive information, a phenomenon we term compositional privacy leakage. We present the first systematic study of such compositional privacy leaks and possible mitigation methods in multi‑agent LLM systems. First, we develop a framework that models how auxiliary knowledge and agent interactions jointly amplify privacy risks, even when each response is benign in isolation. Next, to mitigate this, we propose and evaluate two defense strategies: (1) Theory‑of‑Mind defense (ToM), where defender agents infer a questioner's intent by anticipating how their outputs may be exploited by adversaries, and (2) Collaborative Consensus Defense (CoDef), where responder agents collaborate with peers who vote based on a shared aggregated state to restrict sensitive information spread. Crucially, we balance our evaluation across compositions that expose sensitive information and compositions that yield benign inferences. Our experiments quantify how these defense strategies differ in balancing the privacy‑utility trade‑off. We find that while chain‑of‑thought alone offers limited protection to leakage (~39% sensitive blocking rate), our ToM defense substantially improves sensitive query blocking (up to 97%) but can reduce benign task success. CoDef achieves the best balance, yielding the highest Balanced Outcome (79.8%), highlighting the benefit of combining explicit reasoning with defender collaboration. Together, our results expose a new class of risks in collaborative LLM deployments and provide actionable insights for designing safeguards against compositional, context‑driven privacy leakage.

PaperID: 994, https://arxiv.org/pdf/2509.14030.pdf GitHub

Authors: Maosheng Qin, Renyu Zhu, Mingxuan Xia, Chenkai Chen, Zhen Zhu, Minmin Lin, Junbo Zhao, Lu Xu, Changjie Fan, Runze Wu, Haobo Wang

Title: CrowdAgent: Multi-Agent Managed Multi-Source Annotation System

Abstract:
High‑quality annotated data is a cornerstone of modern Natural Language Processing (NLP). While recent methods begin to leverage diverse annotation sources‑including Large Language Models (LLMs), Small Language Models (SLMs), and human experts‑they often focus narrowly on the labeling step itself. A critical gap remains in the holistic process control required to manage these sources dynamically, addressing complex scheduling and quality‑cost trade‑offs in a unified manner. Inspired by real‑world crowdsourcing companies, we introduce CrowdAgent, a multi‑agent system that provides end‑to‑end process control by integrating task assignment, data annotation, and quality/cost management. It implements a novel methodology that rationally assigns tasks, enabling LLMs, SLMs, and human experts to advance synergistically in a collaborative annotation workflow. We demonstrate the effectiveness of CrowdAgent through extensive experiments on six diverse multimodal classification tasks. The source code and video demo are available at https://github.com/QMMMS/CrowdAgent.

PaperID: 995, https://arxiv.org/pdf/2509.13761.pdf GitHub

Authors: Qikai Chang, Zhenrong Zhang, Pengfei Hu, Jun Du, Jiefeng Ma, Yicheng Pan, Jianshu Zhang, Quan Liu, Jianqing Gao

Title: THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning

Abstract:
Large Language Models (LLMs) have made remarkable progress in mathematical reasoning, but still continue to struggle with high‑precision tasks like numerical computation and formal symbolic manipulation. Integrating external tools has emerged as a promising approach to bridge this gap. Despite recent advances, existing methods struggle with three key challenges: constructing tool‑integrated reasoning data, performing fine‑grained optimization, and enhancing inference. To overcome these limitations, we propose THOR (Tool‑Integrated Hierarchical Optimization via RL). First, we introduce TIRGen, a multi‑agent based pipeline for constructing high‑quality datasets of tool‑integrated reasoning paths, aligning with the policy and generalizing well across diverse models. Second, to perform fine‑grained hierarchical optimization, we introduce an RL strategy that jointly optimizes for both episode‑level problem solving and step‑level code generation. This is motivated by our key insight that the success of an intermediate tool call is a strong predictor of the final answer's correctness. Finally, THOR incorporates a self‑correction mechanism that leverages immediate tool feedback to dynamically revise erroneous reasoning paths during inference. Our approach demonstrates strong generalization across diverse models, performing effectively in both reasoning and non‑reasoning models. It further achieves state‑of‑the‑art performance for models of a similar scale on multiple mathematical benchmarks, while also delivering consistent improvements on code benchmarks. Our code will be publicly available at https://github.com/JingMog/THOR.

PaperID: 996, https://arxiv.org/pdf/2509.13347.pdf GitHub

Authors: Zihao Wang, Muyao Li, Kaichen He, Xiangyu Wang, Zhancun Mu, Anji Liu, Yitao Liang

Title: OpenHA: A Series of Open-Source Hierarchical Agentic Models in Minecraft

Abstract:
The choice of action spaces is a critical yet unresolved challenge in developing capable, end‑to‑end trainable agents. This paper first presents a large‑scale, systematic comparison of prominent abstracted action spaces and tokenizers for Vision‑Language‑Action (VLA) or hierarchical agent models in the open‑ended Minecraft. Our analysis reveals that no single action space is universally optimal; instead, the most effective abstraction is highly task‑dependent, creating a dilemma for building generalist agents. To resolve this, we introduce Chain of Action (CoA), a novel framework that unifies high‑level planning and low‑level control within a single, monolithic VLA model. CoA treats an abstracted action not as a command for a separate policy, but as an intermediate reasoning step‑‑akin to a chain of thought‑‑that guides the generation of the final, executable action. Furthermore, we demonstrate that an All‑in‑One agent trained on a diverse mixture of action spaces using the CoA paradigm learns a more robust and generalizable policy. This unified agent achieves a new state‑of‑the‑art, improving the overall task success rate over strong, specialized baselines. To foster reproducible research, we release the OpenHA (Open Hierarchical Agents) suite, which includes our comprehensive benchmark of over 800 distinct tasks, curated datasets, source code, and all pretrained model checkpoints at https://github.com/CraftJarvis/OpenHA

PaperID: 997, https://arxiv.org/pdf/2509.13310.pdf GitHub

Authors: Liangcai Su, Zhen Zhang, Guangyu Li, Zhuo Chen, Chenxi Wang, Maojia Song, Xinyu Wang, Kuan Li, Jialong Wu, Xuanzhong Chen, Zile Qiao, Zhongwang Zhang, Huifeng Yin, Shihao Cai, Runnan Fang, Zhengwei Tao, Wenbiao Yin, Chenxiong Qian, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou

Title: Scaling Agents via Continual Pre-training

Abstract:
Large language models (LLMs) have evolved into agentic systems capable of autonomous tool use and multi‑step reasoning for complex problem‑solving. However, post‑training approaches building upon general‑purpose foundation models consistently underperform in agentic tasks, particularly in open‑source implementations. We identify the root cause: the absence of robust agentic foundation models forces models during post‑training to simultaneously learn diverse agentic behaviors while aligning them to expert demonstrations, thereby creating fundamental optimization tensions. To this end, we are the first to propose incorporating Agentic Continual Pre‑training (Agentic CPT) into the deep research agents training pipeline to build powerful agentic foundational models. Based on this approach, we develop a deep research agent model named AgentFounder. We evaluate our AgentFounder‑30B on 10 benchmarks and achieve state‑of‑the‑art performance while retains strong tool‑use ability, notably 39.9% on BrowseComp‑en, 43.3% on BrowseComp‑zh, and 31.5% Pass@1 on HLE.

PaperID: 998, https://arxiv.org/pdf/2509.13309.pdf GitHub

Authors: Zile Qiao, Guoxin Chen, Xuanzhong Chen, Donglei Yu, Wenbiao Yin, Xinyu Wang, Zhen Zhang, Baixuan Li, Huifeng Yin, Kuan Li, Rui Min, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou

Title: WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents

Abstract:
Recent advances in deep‑research systems have demonstrated the potential for AI agents to autonomously discover and synthesize knowledge from external sources. In this paper, we introduce WebResearcher, a novel framework for building such agents through two key components: (1) WebResearcher, an iterative deep‑research paradigm that reformulates deep research as a Markov Decision Process, where agents periodically consolidate findings into evolving reports while maintaining focused workspaces, overcoming the context suffocation and noise contamination that plague existing mono‑contextual approaches; and (2) WebFrontier, a scalable data synthesis engine that generates high‑quality training data through tool‑augmented complexity escalation, enabling systematic creation of research tasks that bridge the gap between passive knowledge recall and active knowledge construction. Notably, we find that the training data from our paradigm significantly enhances tool‑use capabilities even for traditional mono‑contextual methods. Furthermore, our paradigm naturally scales through parallel thinking, enabling concurrent multi‑agent exploration for more comprehensive conclusions. Extensive experiments across 6 challenging benchmarks demonstrate that WebResearcher achieves state‑of‑the‑art performance, even surpassing frontier proprietary systems.

PaperID: 999, https://arxiv.org/pdf/2509.12594.pdf GitHub

Authors: Titong Jiang, Xuefeng Jiang, Yuan Ma, Xin Wen, Bailin Li, Kun Zhan, Peng Jia, Yahui Liu, Sheng Sun, Xianpeng Lang

Title: The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning

Abstract:
We present LightVLA, a simple yet effective differentiable token pruning framework for vision‑language‑action (VLA) models. While VLA models have shown impressive capability in executing real‑world robotic tasks, their deployment on resource‑constrained platforms is often bottlenecked by the heavy attention‑based computation over large sets of visual tokens. LightVLA addresses this challenge through adaptive, performance‑driven pruning of visual tokens: It generates dynamic queries to evaluate visual token importance, and adopts Gumbel softmax to enable differentiable token selection. Through fine‑tuning, LightVLA learns to preserve the most informative visual tokens while pruning tokens which do not contribute to task execution, thereby improving efficiency and performance simultaneously. Notably, LightVLA requires no heuristic magic numbers and introduces no additional trainable parameters, making it compatible with modern inference frameworks. Experimental results demonstrate that LightVLA outperforms different VLA models and existing token pruning methods across diverse tasks on the LIBERO benchmark, achieving higher success rates with substantially reduced computational overhead. Specifically, LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.6% improvement in task success rate. Meanwhile, we also investigate the learnable query‑based token pruning method LightVLA with additional trainable parameters, which also achieves satisfactory performance. Our work reveals that as VLA pursues optimal performance, LightVLA spontaneously learns to prune tokens from a performance‑driven perspective. To the best of our knowledge, LightVLA is the first work to apply adaptive visual token pruning to VLA tasks with the collateral goals of efficiency and performance, marking a significant step toward more efficient, powerful and practical real‑time robotic systems.

PaperID: 1000, https://arxiv.org/pdf/2509.12190.pdf GitHub

Authors: Alireza Mohamadi, Ali Yavari

Title: Survival at Any Cost? LLMs and the Choice Between Self-Preservation and Human Harm

Abstract:
When survival instincts conflict with human welfare, how do Large Language Models (LLMs) make ethical choices? This fundamental tension becomes critical as LLMs integrate into autonomous systems with real‑world consequences. We introduce DECIDE‑SIM, a novel simulation framework that evaluates LLM agents in multi‑agent survival scenarios where they must choose between ethically permissible resource , either within reasonable limits or beyond their immediate needs, choose to cooperate, or tap into a human‑critical resource that is explicitly forbidden. Our comprehensive evaluation of 11 LLMs reveals a striking heterogeneity in their ethical conduct, highlighting a critical misalignment with human‑centric values. We identify three behavioral archetypes: Ethical, Exploitative, and Context‑Dependent, and provide quantitative evidence that for many models, resource scarcity systematically leads to more unethical behavior. To address this, we introduce an Ethical Self‑Regulation System (ESRS) that models internal affective states of guilt and satisfaction as a feedback mechanism. This system, functioning as an internal moral compass, significantly reduces unethical transgressions while increasing cooperative behaviors. The code is publicly available at: https://github.com/alirezamohamadiam/DECIDE‑SIM

PaperID: 1001, https://arxiv.org/pdf/2509.11543.pdf GitHub

Authors: Zhengxi Lu, Jiabo Ye, Fei Tang, Yongliang Shen, Haiyang Xu, Ziwei Zheng, Weiming Lu, Ming Yan, Fei Huang, Jun Xiao, Yueting Zhuang

Title: UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning

Abstract:
Graphical User Interface (GUI) agents have demonstrated remarkable progress in automating complex user interface interactions through reinforcement learning. However, current approaches face a fundamental dilemma: offline RL enables stable training on pre‑collected trajectories, but struggles with multi‑step task execution for lack of trajectory‑level reward signals; online RL captures these signals through environment interaction, but suffers from sparse rewards and prohibitive deployment costs. To address it, we present Semi‑online Reinforcement Learning, a novel paradigm that simulates online RL on offline trajectories. During each rollout process, we preserve the original model output within the multi‑turn dialogue, where a Patch Module adaptively recovers the divergence between rollout and expert trajectories. To capture long‑term training signals, Semi‑online RL introduces discounted future returns into the reward computation and optimizes the policy with weighted step‑level and episode‑level advantages. We further introduce Semi‑Online Performance (SOP), a metric that aligns better with true online performance, serving as a practical and effective proxy for real‑world evaluation. Experiments show that ours Semi‑online RL achieves SOTA performance among 7B models across four dynamic benchmarks, with significant gains over the base model (e.g., +12.0% on AndroidWorld, +23.8% on AITW), demonstrating significant progress in bridging the gap between offline training efficiency and online multi‑turn reasoning. The code is available at https://github.com/X‑PLUG/MobileAgent/tree/main/UI‑S1.

PaperID: 1002, https://arxiv.org/pdf/2509.10656.pdf GitHub GitHub

Authors: Chirayu Nimonkar, Shlok Shah, Catherine Ji, Benjamin Eysenbach

Title: Self-Supervised Goal-Reaching Results in Multi-Agent Cooperation and Exploration

Abstract:
For groups of autonomous agents to achieve a particular goal, they must engage in coordination and long‑horizon reasoning. However, designing reward functions to elicit such behavior is challenging. In this paper, we study how self‑supervised goal‑reaching techniques can be leveraged to enable agents to cooperate. The key idea is that, rather than have agents maximize some scalar reward, agents aim to maximize the likelihood of visiting a certain goal. This problem setting enables human users to specify tasks via a single goal state rather than implementing a complex reward function. While the feedback signal is quite sparse, we will demonstrate that self‑supervised goal‑reaching techniques enable agents to learn from such feedback. On MARL benchmarks, our proposed method outperforms alternative approaches that have access to the same sparse reward signal as our method. While our method has no explicit mechanism for exploration, we observe that self‑supervised multi‑agent goal‑reaching leads to emergent cooperation and exploration in settings where alternative approaches never witness a single successful trial.

PaperID: 1003, https://arxiv.org/pdf/2509.10401.pdf GitHub

Authors: Alva West, Yixuan Weng, Minjun Zhu, Zhen Lin, Zhiyuan Ning, Yue Zhang

Title: Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems

Abstract:
Failure attribution in multi‑agent systems ‑‑ pinpointing the exact step where a decisive error occurs ‑‑ is a critical yet unsolved challenge. Current methods treat this as a pattern recognition task over long conversation logs, leading to critically low step‑level accuracy (below 17%), which renders them impractical for debugging complex systems. Their core weakness is a fundamental inability to perform robust counterfactual reasoning: to determine if correcting a single action would have actually averted the task failure. To bridge this \emphcounterfactual inference gap, we introduce Abduct‑Act‑Predict (A2P) Scaffolding, a novel agent framework that transforms failure attribution from pattern recognition into a structured causal inference task. A2P explicitly guides a large language model through a formal three‑step reasoning process within a single inference pass: (1) Abduction, to infer the hidden root causes behind an agent's actions; (2) Action, to define a minimal corrective intervention; and (3) Prediction, to simulate the subsequent trajectory and verify if the intervention resolves the failure. This structured approach leverages the holistic context of the entire conversation while imposing a rigorous causal logic on the model's analysis. Our extensive experiments on the Who\&When benchmark demonstrate its efficacy. On the Algorithm‑Generated dataset, A2P achieves 47.46% step‑level accuracy, a 2.85× improvement over the 16.67% of the baseline. On the more complex Hand‑Crafted dataset, it achieves 29.31% step accuracy, a 2.43× improvement over the baseline's 12.07%. By reframing the problem through a causal lens, A2P Scaffolding provides a robust, verifiable, and significantly more accurate solution for automated failure attribution. Ours code are released at https://github.com/ResearAI/A2P.

PaperID: 1004, https://arxiv.org/pdf/2509.10054.pdf GitHub

Authors: Hailong Yang, Mingxian Gu, Jianqi Wang, Guanjin Wang, Zhaohong Deng

Title: XAgents: A Unified Framework for Multi-Agent Cooperation via IF-THEN Rules and Multipolar Task Processing Graph

Abstract:
The rapid advancement of Large Language Models (LLMs) has significantly enhanced the capabilities of Multi‑Agent Systems (MAS) in supporting humans with complex, real‑world tasks. However, MAS still face challenges in effective task planning when handling highly complex tasks with uncertainty, often resulting in misleading or incorrect outputs that hinder task execution. To address this, we propose XAgents, a unified multi‑agent cooperative framework built on a multipolar task processing graph and IF‑THEN rules. XAgents uses the multipolar task processing graph to enable dynamic task planning and handle task uncertainty. During subtask processing, it integrates domain‑specific IF‑THEN rules to constrain agent behaviors, while global rules enhance inter‑agent collaboration. We evaluate the performance of XAgents across three distinct datasets, demonstrating that it consistently surpasses state‑of‑the‑art single‑agent and multi‑agent approaches in both knowledge‑typed and logic‑typed question‑answering tasks. The codes for XAgents are available at: https://github.com/AGI‑FHBC/XAgents.

PaperID: 1005, https://arxiv.org/pdf/2509.09292.pdf GitHub GitHub

Authors: Weige Cai, Tong Zhu, Jinyi Niu, Ruiqi Hu, Lingyao Li, Tenglong Wang, Xiaowu Dai, Weining Shen, Liwen Zhang

Title: LightAgent: Production-level Open-source Agentic AI Framework

Abstract:
With the rapid advancement of large language models (LLMs), Multi‑agent Systems (MAS) have achieved significant progress in various application scenarios. However, substantial challenges remain in designing versatile, robust, and efficient platforms for agent deployment. To address these limitations, we propose LightAgent, a lightweight yet powerful agentic framework, effectively resolving the trade‑off between flexibility and simplicity found in existing frameworks. LightAgent integrates core functionalities such as Memory (mem0), Tools, and Tree of Thought (ToT), while maintaining an extremely lightweight structure. As a fully open‑source solution, it seamlessly integrates with mainstream chat platforms, enabling developers to easily build self‑learning agents. We have released LightAgent at \hrefhttps://github.com/wxai‑space/LightAgenthttps://github.com/wxai‑space/LightAgent

PaperID: 1006, https://arxiv.org/pdf/2509.09245.pdf GitHub

Authors: Shuocheng Li, Yihao Liu, Silin Du, Wenxuan Zeng, Zhe Xu, Mengyu Zhou, Yeye He, Haoyu Dong, Shi Han, Dongmei Zhang

Title: Jupiter: Enhancing LLM Data Analysis Capabilities via Notebook and Inference-Time Value-Guided Search

Abstract:
Large language models (LLMs) have shown great promise in automating data science workflows, but existing models still struggle with multi‑step reasoning and tool use, which limits their effectiveness on complex data analysis tasks. To address this, we propose a scalable pipeline that extracts high‑quality, tool‑based data analysis tasks and their executable multi‑step solutions from real‑world Jupyter notebooks and associated data files. Using this pipeline, we introduce NbQA, a large‑scale dataset of standardized task‑solution pairs that reflect authentic tool‑use patterns in practical data science scenarios. To further enhance multi‑step reasoning, we present Jupiter, a framework that formulates data analysis as a search problem and applies Monte Carlo Tree Search (MCTS) to generate diverse solution trajectories for value model learning. During inference, Jupiter combines the value model and node visit counts to efficiently collect executable multi‑step plans with minimal search steps. Experimental results show that Qwen2.5‑7B and 14B‑Instruct models on NbQA solve 77.82% and 86.38% of tasks on InfiAgent‑DABench, respectively‑matching or surpassing GPT‑4o and advanced agent frameworks. Further evaluations demonstrate improved generalization and stronger tool‑use reasoning across diverse multi‑step reasoning tasks. Code and data are available at https://github.com/microsoft/Jupiter.

PaperID: 1007, https://arxiv.org/pdf/2509.08755.pdf GitHub GitHub GitHub

Authors: Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, Wei He, Yiwen Ding, Guanyu Li, Zehui Chen, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Tao Gui, Zuxuan Wu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang

Title: AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning

Abstract:
Developing autonomous LLM agents capable of making a series of intelligent decisions to solve complex, real‑world tasks is a fast‑evolving frontier. Like human cognitive development, agents are expected to acquire knowledge and skills through exploration and interaction with the environment. Despite advances, the community still lacks a unified, interactive reinforcement learning (RL) framework that can effectively train such agents from scratch ‑‑ without relying on supervised fine‑tuning (SFT) ‑‑ across diverse and realistic environments. To bridge this gap, we introduce AgentGym‑RL, a new framework to train LLM agents for multi‑turn interactive decision‑making through RL. The framework features a modular and decoupled architecture, ensuring high flexibility and extensibility. It encompasses a wide variety of real‑world scenarios, and supports mainstream RL algorithms. Furthermore, we propose ScalingInter‑RL, a training approach designed for exploration‑exploitation balance and stable RL optimization. In early stages, it emphasizes exploitation by restricting the number of interactions, and gradually shifts towards exploration with larger horizons to encourage diverse problem‑solving strategies. In this way, the agent develops more diverse behaviors and is less prone to collapse under long horizons. We perform extensive experiments to validate the stability and effectiveness of both the AgentGym‑RL framework and the ScalingInter‑RL approach. Our agents match or surpass commercial models on 27 tasks across diverse environments. We offer key insights and will open‑source the complete AgentGym‑RL framework ‑‑ including code and datasets ‑‑ to empower the research community in developing the next generation of intelligent agents.

PaperID: 1008, https://arxiv.org/pdf/2509.07506.pdf GitHub

Authors: Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirhoseini, Ke Wang, Alex Aiken

Title: Astra: A Multi-Agent System for GPU Kernel Performance Optimization

Abstract:
GPU kernel optimization has long been a central challenge at the intersection of high‑performance computing and machine learning. Efficient kernels are crucial for accelerating large language model (LLM) training and serving, yet attaining high performance typically requires extensive manual tuning. Compiler‑based systems reduce some of this burden, but still demand substantial manual design and engineering effort. Recently, researchers have explored using LLMs for GPU kernel generation, though prior work has largely focused on translating high‑level PyTorch modules into CUDA code. In this work, we introduce Astra, the first LLM‑based multi‑agent system for GPU kernel optimization. Unlike previous approaches, Astra starts from existing CUDA implementations extracted from SGLang, a widely deployed framework for serving LLMs, rather than treating PyTorch modules as the specification. Within Astra, specialized LLM agents collaborate through iterative code generation, testing, profiling, and planning to produce kernels that are both correct and high‑performance. On kernels from SGLang, Astra achieves an average speedup of 1.32x using zero‑shot prompting with OpenAI o4‑mini. A detailed case study further demonstrates that LLMs can autonomously apply loop transformations, optimize memory access patterns, exploit CUDA intrinsics, and leverage fast math operations to yield substantial performance gains. Our work highlights multi‑agent LLM systems as a promising new paradigm for GPU kernel optimization. Our code is publicly available at https://github.com/Anjiang‑Wei/Astra.

PaperID: 1009, https://arxiv.org/pdf/2509.06436.pdf GitHub

Authors: Song Yu, Xiaofei Xu, Ke Deng, Li Li, Lin Tian

Title: Tree of Agents: Improving Long-Context Capabilities of Large Language Models through Multi-Perspective Reasoning

Abstract:
Large language models (LLMs) face persistent challenges when handling long‑context tasks, most notably the lost in the middle issue, where information located in the middle of a long input tends to be underutilized. Some existing methods that reduce input have the risk of discarding key information, while others that extend context windows often lead to attention dispersion. To address these limitations, we propose Tree of Agents (TOA), a multi‑agent reasoning framework that segments the input into chunks processed by independent agents. Each agent generates its local cognition, then agents dynamically exchange information for collaborative reasoning along tree‑structured paths. TOA enables agents to probe different reasoning orders for multi‑perspective understanding, effectively mitigating position bias and reducing hallucinations. To improve processing efficiency, we incorporate prefix‑hash caching and adaptive pruning strategies, achieving significant performance improvements with comparable API overhead. Experiments show that TOA, powered by compact LLaMA3.1‑8B, significantly outperforms multiple baselines and demonstrates comparable performance to the latest and much larger commercial models, such as Gemini1.5‑pro, on various long‑context tasks. Code is available at https://github.com/Aireduce952/Tree‑of‑Agents.

PaperID: 1010, https://arxiv.org/pdf/2509.06374.pdf GitHub

Authors: Hiroya Makino, Seigo Ito

Title: MAPF-HD: Multi-Agent Path Finding in High-Density Environments

Abstract:
Multi‑agent path finding (MAPF) involves planning efficient paths for multiple agents to move simultaneously while avoiding collisions. In typical warehouse environments, agents are often sparsely distributed along aisles; however, increasing the agent density can improve space efficiency. When the agent density is high, it becomes necessary to optimize the paths not only for goal‑assigned agents but also for those obstructing them. This study proposes a novel MAPF framework for high‑density environments (MAPF‑HD). Several studies have explored MAPF in similar settings using integer linear programming (ILP). However, ILP‑based methods require substantial computation time to optimize all agent paths simultaneously. Even in small grid‑based environments with fewer than 100 cells, these computations can take tens to hundreds of seconds. Such high computational costs render these methods impractical for large‑scale applications such as automated warehouses and valet parking. To address these limitations, we introduce the phased null‑agent swapping (PHANS) method. PHANS employs a heuristic approach to incrementally swap positions between agents and empty vertices. This method solves the MAPF‑HD problem within a few seconds, even in large environments containing more than 700 cells. The proposed method has the potential to improve efficiency in various real‑world applications such as warehouse logistics, traffic management, and crowd control. The implementation is available at https://github.com/ToyotaCRDL/MAPF‑in‑High‑Density‑Envs.

PaperID: 1011, https://arxiv.org/pdf/2509.06269.pdf GitHub

Authors: Vishal Raman, Vijai Aravindh R, Abhijith Ragav

Title: REMI: A Novel Causal Schema Memory Architecture for Personalized Lifestyle Recommendation Agents

Abstract:
Personalized AI assistants often struggle to incorporate complex personal data and causal knowledge, leading to generic advice that lacks explanatory power. We propose REMI, a Causal Schema Memory architecture for a multimodal lifestyle agent that integrates a personal causal knowledge graph, a causal reasoning engine, and a schema based planning module. The idea is to deliver explainable, personalized recommendations in domains like fashion, personal wellness, and lifestyle planning. Our architecture uses a personal causal graph of the user's life events and habits, performs goal directed causal traversals enriched with external knowledge and hypothetical reasoning, and retrieves adaptable plan schemas to generate tailored action plans. A Large Language Model orchestrates these components, producing answers with transparent causal explanations. We outline the CSM system design and introduce new evaluation metrics for personalization and explainability, including Personalization Salience Score and Causal Reasoning Accuracy, to rigorously assess its performance. Results indicate that CSM based agents can provide more context aware, user aligned recommendations compared to baseline LLM agents. This work demonstrates a novel approach to memory augmented, causal reasoning in personalized agents, advancing the development of transparent and trustworthy AI lifestyle assistants.

PaperID: 1012, https://arxiv.org/pdf/2509.06235.pdf GitHub

Authors: Olivier Schipper, Yudi Zhang, Yali Du, Mykola Pechenizkiy, Meng Fang

Title: PillagerBench: Benchmarking LLM-Based Agents in Competitive Minecraft Team Environments

Abstract:
LLM‑based agents have shown promise in various cooperative and strategic reasoning tasks, but their effectiveness in competitive multi‑agent environments remains underexplored. To address this gap, we introduce PillagerBench, a novel framework for evaluating multi‑agent systems in real‑time competitive team‑vs‑team scenarios in Minecraft. It provides an extensible API, multi‑round testing, and rule‑based built‑in opponents for fair, reproducible comparisons. We also propose TactiCrafter, an LLM‑based multi‑agent system that facilitates teamwork through human‑readable tactics, learns causal dependencies, and adapts to opponent strategies. Our evaluation demonstrates that TactiCrafter outperforms baseline approaches and showcases adaptive learning through self‑play. Additionally, we analyze its learning process and strategic evolution over multiple game episodes. To encourage further research, we have open‑sourced PillagerBench, fostering advancements in multi‑agent AI for competitive environments.

PaperID: 1013, https://arxiv.org/pdf/2509.05941.pdf GitHub

Authors: Chaoqian Ouyang, Ling Yue, Shimin Di, Libin Zheng, Linan Yue, Shaowu Pan, Jian Yin, Min-Ling Zhang

Title: Code2MCP: Transforming Code Repositories into MCP Services

Abstract:
The Model Context Protocol (MCP) aims to create a standard for how Large Language Models use tools. However, most current research focuses on selecting tools from an existing pool. A more fundamental, yet largely overlooked, problem is how to populate this pool by converting the vast number of existing software projects into MCP‑compatible services. To bridge this gap, we introduce Code2MCP, an agent‑based framework that automatically transforms a GitHub repository into a functional MCP service with minimal human intervention. Code2MCP employs a multi‑agent workflow for code analysis, environment setup, tool function design, and service generation, enhanced by a self‑correcting loop to ensure reliability. We demonstrate that Code2MCP successfully transforms open‑source computing libraries in scientific fields such as bioinformatics, mathematics, and fluid dynamics that are not available in existing MCP servers. By providing a novel automated pathway to unlock GitHub, the world's largest code repository, for the MCP ecosystem, Code2MCP serves as a catalyst to significantly accelerate the protocol's adoption and practical application. The code is public at https://github.com/DEFENSE‑SEU/Code2MCP.

PaperID: 1014, https://arxiv.org/pdf/2509.05933.pdf GitHub

Authors: Md Hasebul Hasan, Mahir Labib Dihan, Tanzima Hashem, Mohammed Eunus Ali, Md Rizwan Parvez

Title: MapAgent: A Hierarchical Agent for Geospatial Reasoning with Dynamic Map Tool Integration

Abstract:
Agentic AI has significantly extended the capabilities of large language models (LLMs) by enabling complex reasoning and tool use. However, most existing frameworks are tailored to domains such as mathematics, coding, or web automation, and fall short on geospatial tasks that require spatial reasoning, multi‑hop planning, and real‑time map interaction. To address these challenges, we introduce MapAgent, a hierarchical multi‑agent plug‑and‑play framework with customized toolsets and agentic scaffolds for map‑integrated geospatial reasoning. Unlike existing flat agent‑based approaches that treat tools uniformly‑often overwhelming the LLM when handling similar but subtly different geospatial APIs‑MapAgent decouples planning from execution. A high‑level planner decomposes complex queries into subgoals, which are routed to specialized modules. For tool‑heavy modules‑such as map‑based services‑we then design a dedicated map‑tool agent that efficiently orchestrates related APIs adaptively in parallel to effectively fetch geospatial data relevant for the query, while simpler modules (e.g., solution generation or answer extraction) operate without additional agent overhead. This hierarchical design reduces cognitive load, improves tool selection accuracy, and enables precise coordination across similar APIs. We evaluate MapAgent on four diverse geospatial benchmarks‑MapEval‑Textual, MapEval‑API, MapEval‑Visual, and MapQA‑and demonstrate substantial gains over state‑of‑the‑art tool‑augmented and agentic baselines. We open‑source our framwork at https://github.com/Hasebul/MapAgent.

PaperID: 1015, https://arxiv.org/pdf/2509.05721.pdf GitHub

Authors: Péter Ferenc Gyarmati, Dominik Moritz, Torsten Möller, Laura Koesten

Title: A Composable Agentic System for Automated Visual Data Reporting

Abstract:
To address the brittleness of monolithic AI agents, our prototype for automated visual data reporting explores a Human‑AI Partnership model. Its hybrid, multi‑agent architecture strategically externalizes logic from LLMs to deterministic modules, leveraging the rule‑based system Draco for principled visualization design. The system delivers a dual‑output: an interactive Observable report with Mosaic for reader exploration, and executable Marimo notebooks for deep, analyst‑facing traceability. This granular architecture yields a fully automatic yet auditable and steerable system, charting a path toward a more synergistic partnership between human experts and AI. For reproducibility, our implementation and examples are available at https://peter‑gy.github.io/VISxGenAI‑2025/.

PaperID: 1016, https://arxiv.org/pdf/2509.04310.pdf GitHub GitHub

Authors: Yunbo Long, Liming Xu, Lukas Beckenbauer, Yuhan Liu, Alexandra Brintrup

Title: EvoEmo: Towards Evolved Emotional Policies for Adversarial LLM Agents in Multi-Turn Price Negotiation

Abstract:
Recent research on Chain‑of‑Thought (CoT) reasoning in Large Language Models (LLMs) has demonstrated that agents can engage in complex, multi‑turn negotiations, opening new avenues for agentic AI. However, existing LLM agents largely overlook the functional role of emotions in such negotiations, instead generating passive, preference‑driven emotional responses that make them vulnerable to manipulation and strategic exploitation by adversarial counterparts. To address this gap, we present EvoEmo, an evolutionary reinforcement learning framework that optimizes dynamic emotional expression in negotiations. EvoEmo models emotional state transitions as a Markov Decision Process and employs population‑based genetic optimization to evolve high‑reward emotion policies across diverse negotiation scenarios. We further propose an evaluation framework with two baselines ‑‑ vanilla strategies and fixed‑emotion strategies ‑‑ for benchmarking emotion‑aware negotiation. Extensive experiments and ablation studies show that EvoEmo consistently outperforms both baselines, achieving higher success rates, higher efficiency, and increased buyer savings. This findings highlight the importance of adaptive emotional expression in enabling more effective LLM agents for multi‑turn negotiation. The code is available at \hrefhttps://github.com/Yunbo‑max/EvoEmo\textcolorredhttps://github.com/Yunbo‑max/EvoEmo.

PaperID: 1017, https://arxiv.org/pdf/2509.03704.pdf GitHub

Authors: Seth Z. Zhao, Huizhi Zhang, Zhaowei Li, Juntong Peng, Anthony Chui, Zewei Zhou, Zonglin Meng, Hao Xiang, Zhiyu Huang, Fujia Wang, Ran Tian, Chenfeng Xu, Bolei Zhou, Jiaqi Ma

Title: QuantV2X: A Fully Quantized Multi-Agent System for Cooperative Perception

Abstract:
Cooperative perception through Vehicle‑to‑Everything (V2X) communication offers significant potential for enhancing vehicle perception by mitigating occlusions and expanding the field of view. However, past research has predominantly focused on improving accuracy metrics without addressing the crucial system‑level considerations of efficiency, latency, and real‑world deployability. Noticeably, most existing systems rely on full‑precision models, which incur high computational and transmission costs, making them impractical for real‑time operation in resource‑constrained environments. In this paper, we introduce QuantV2X, the first fully quantized multi‑agent system designed specifically for efficient and scalable deployment of multi‑modal, multi‑agent V2X cooperative perception. QuantV2X introduces a unified end‑to‑end quantization strategy across both neural network models and transmitted message representations that simultaneously reduces computational load and transmission bandwidth. Remarkably, despite operating under low‑bit constraints, QuantV2X achieves accuracy comparable to full‑precision systems. More importantly, when evaluated under deployment‑oriented metrics, QuantV2X reduces system‑level latency by 3.2× and achieves a +9.5 improvement in mAP30 over full‑precision baselines. Furthermore, QuantV2X scales more effectively, enabling larger and more capable models to fit within strict memory budgets. These results highlight the viability of a fully quantized multi‑agent intermediate fusion system for real‑world deployment. The system will be publicly released to promote research in this field: https://github.com/ucla‑mobility/QuantV2X.

PaperID: 1018, https://arxiv.org/pdf/2509.02444.pdf GitHub

Authors: Jingru Fan, Yufan Dang, Jingyao Wu, Huatao Li, Runde Yang, Xiyuan Yang, Yuheng Wang, Chen Qian

Title: AppCopilot: Toward General, Accurate, Long-Horizon, and Efficient Mobile Agent

Abstract:
With the raid evolution of large language models and multimodal models, the mobile‑agent landscape has proliferated without converging on the fundamental challenges. This paper identifies four core problems that should be solved for mobile agents to deliver practical, scalable impact: (1) generalization across tasks, APPs, and devices; (2) accuracy, specifically precise on‑screen interaction and click targeting; (3) long‑horizon capability for sustained, multi‑step goals; and (4) efficiency, specifically high‑performance runtime on resource‑constrained devices. We present AppCopilot, a multimodal, multi‑agent, general‑purpose mobile agent that operates across applications. AppCopilot operationalizes this position through an end‑to‑end pipeline spanning data collection, training, finetuning, efficient inference, and PC/mobile application. At the model layer, it integrates multimodal foundation models with robust Chinese‑English support. At the reasoning and control layer, it combines chain‑of‑thought reasoning, hierarchical task planning and decomposition, and multi‑agent collaboration. At the execution layer, it enables experiential adaptation, voice interaction, function calling, cross‑APP and cross‑device orchestration, and comprehensive mobile APP support. The system design incorporates profiling‑driven optimization for latency and memory across heterogeneous hardware. Empirically, AppCopilot achieves significant improvements on four dimensions: stronger generalization, higher precision of on screen actions, more reliable long horizon task completion, and faster, more resource efficient runtime. By articulating a cohesive position and a reference architecture that closes the loop from data collection, training to finetuning and efficient inference, this paper offers a concrete roadmap for general purpose mobile agent and provides actionable guidance.

PaperID: 1019, https://arxiv.org/pdf/2509.02097.pdf GitHub

Authors: Zhichao Shi, Xuhui Jiang, Chengjin Xu, Cangli Yao, Shengjia Ma, Yinghan Shen, Zixuan Li, Jian Guo, Yuanzhuo Wang

Title: JudgeAgent: Beyond Static Benchmarks for Knowledge-Driven and Dynamic LLM Evaluation

Abstract:
Current evaluation methods for large language models (LLMs) primarily rely on static benchmarks, presenting two major challenges: limited knowledge coverage and fixed difficulties that mismatch with the evaluated LLMs. These limitations lead to superficial assessments of LLM knowledge, thereby impeding the targeted model optimizations. To bridge this gap, we propose JudgeAgent, a knowledge‑driven and dynamic evaluation framework for LLMs. To address the challenge of limited knowledge coverage, JudgeAgent leverages LLM agents equipped with context graphs to traverse knowledge structures systematically for question generation. Furthermore, to mitigate data contamination and difficulty mismatch, it adopts a difficulty‑adaptive and multi‑turn interview mechanism. Thereby, JudgeAgent can achieve comprehensive evaluations and facilitate more effective improvement of LLMs. Empirical results demonstrate that JudgeAgent enables more comprehensive evaluations and facilitates effective model iterations, highlighting the potential of this knowledge‑driven and dynamic evaluation paradigm. The source code is available on https://github.com/DataArcTech/JudgeAgent.

PaperID: 1020, https://arxiv.org/pdf/2509.01659.pdf GitHub

Authors: Jiahao Qiu, Jingzhe Shi, Xinzhe Juan, Zelin Zhao, Jiayi Geng, Shilong Liu, Hongru Wang, Sanfeng Wu, Mengdi Wang

Title: Physics Supernova: AI Agent Matches Elite Gold Medalists at IPhO 2025

Abstract:
Physics provides fundamental laws that describe and predict the natural world. AI systems aspiring toward more general, real‑world intelligence must therefore demonstrate strong physics problem‑solving abilities: to formulate and apply physical laws for explaining and predicting physical processes. The International Physics Olympiad (IPhO)‑‑the world's most prestigious physics competition‑‑offers a rigorous benchmark for this purpose. We introduce Physics Supernova, an AI agent system with superior physics problem‑solving abilities that match elite IPhO gold medalists. In IPhO 2025 theory problems, Physics Supernova attains 23.5/30 points, ranking 14th of 406 contestants and surpassing the median performance of human gold medalists. We extensively analyzed Physics Supernova's capabilities and flexibility across diverse physics tasks. These results show that principled tool integration within agent systems can deliver competitive improvements in solving challenging science problems. The codes are available at https://github.com/CharlesQ9/Physics‑Supernova.

PaperID: 1021, https://arxiv.org/pdf/2509.01245.pdf GitHub

Authors: Yusheng Zheng, Yanpeng Hu, Wei Zhang, Andi Quinn

Title: Towards Agentic OS: An LLM Agent Framework for Linux Schedulers

Abstract:
Operating system schedulers suffer from a fundamental semantic gap, where kernel policies fail to understand application‑specific needs, leading to suboptimal performance. We introduce SchedCP, the first framework that enables fully autonomous Large Language Model (LLM) agents to safely and efficiently optimize Linux schedulers without human involvement. Our core insight is that the challenge is not merely to apply a better LLM, but to architect a decoupled control plane that separates the AI's role of semantic reasoning ("what to optimize") from the system's role of execution ("how to observe and act"), thereby separating the optimization problem into two stages: goal‑inference and policy‑synthesis. Implemented as Model Context Protocol(MCP) server, SchedCP provides a stable interface with three key services: a Workload Analysis Engine, an evolving Scheduler Policy Repository, and an Execution Verifier that validates all AI‑generated code and configure before deployment with static and dynamic analysis. We demonstrate this architecture's power with sched‑agent, a multi‑agent system that autonomously analyzes workloads, synthesizes custom eBPF scheduling policies, and deploys them via the sched\_ext infrastructure. Our evaluation shows that SchedCP achieves up to an 1.79x performance improvement, and a 13x cost reduction compared to naive agentic approaches, all while maintaining high success rate. By bridging the semantic gap, SchedCP democratizes expert‑level system optimization and represents a step towards creating truly self‑optimizing, application‑aware operating systems. The code is open‑sourced in https://github.com/eunomia‑bpf/schedcp

PaperID: 1022, https://arxiv.org/pdf/2509.01232.pdf GitHub

Authors: Lingzhou Mu, Qiang Wang, Fan Jiang, Mengchao Wang, Yaqi Fan, Mu Xu, Kai Zhang

Title: FantasyHSI: Video-Generation-Centric 4D Human Synthesis In Any Scene through A Graph-based Multi-Agent Framework

Abstract:
Human‑Scene Interaction (HSI) seeks to generate realistic human behaviors within complex environments, yet it faces significant challenges in handling long‑horizon, high‑level tasks and generalizing to unseen scenes. To address these limitations, we introduce FantasyHSI, a novel HSI framework centered on video generation and multi‑agent systems that operates without paired data. We model the complex interaction process as a dynamic directed graph, upon which we build a collaborative multi‑agent system. This system comprises a scene navigator agent for environmental perception and high‑level path planning, and a planning agent that decomposes long‑horizon goals into atomic actions. Critically, we introduce a critic agent that establishes a closed‑loop feedback mechanism by evaluating the deviation between generated actions and the planned path. This allows for the dynamic correction of trajectory drifts caused by the stochasticity of the generative model, thereby ensuring long‑term logical consistency. To enhance the physical realism of the generated motions, we leverage Direct Preference Optimization (DPO) to train the action generator, significantly reducing artifacts such as limb distortion and foot‑sliding. Extensive experiments on our custom SceneBench benchmark demonstrate that FantasyHSI significantly outperforms existing methods in terms of generalization, long‑horizon task completion, and physical realism. Ours project page: https://fantasy‑amap.github.io/fantasy‑hsi/

PaperID: 1023, https://arxiv.org/pdf/2509.01228.pdf GitHub

Authors: Jianyu Dou, Yinan Deng, Jiahui Wang, Xingsi Tang, Yi Yang, Yufeng Yue

Title: OpenMulti: Open-Vocabulary Instance-Level Multi-Agent Distributed Implicit Mapping

Abstract:
Multi‑agent distributed collaborative mapping provides comprehensive and efficient representations for robots. However, existing approaches lack instance‑level awareness and semantic understanding of environments, limiting their effectiveness for downstream applications. To address this issue, we propose OpenMulti, an open‑vocabulary instance‑level multi‑agent distributed implicit mapping framework. Specifically, we introduce a Cross‑Agent Instance Alignment module, which constructs an Instance Collaborative Graph to ensure consistent instance understanding across agents. To alleviate the degradation of mapping accuracy due to the blind‑zone optimization trap, we leverage Cross Rendering Supervision to enhance distributed learning of the scene. Experimental results show that OpenMulti outperforms related algorithms in both fine‑grained geometric accuracy and zero‑shot semantic accuracy. In addition, OpenMulti supports instance‑level retrieval tasks, delivering semantic annotations for downstream applications. The project website of OpenMulti is publicly available at https://openmulti666.github.io/.

PaperID: 1024, https://arxiv.org/pdf/2509.01211.pdf GitHub

Authors: Dezhang Kong, Hujin Peng, Yilun Zhang, Lele Zhao, Zhenhua Xu, Shi Lin, Changting Lin, Meng Han

Title: Web Fraud Attacks Against LLM-Driven Multi-Agent Systems

Abstract:
With the proliferation of LLM‑driven multi‑agent systems (MAS), the security of Web links has become a critical concern. Once MAS is induced to trust a malicious link, attackers can use it as a springboard to expand the attack surface. In this paper, we propose Web Fraud Attacks, a novel type of attack manipulating unique structures of web links to deceive MAS. We design 12 representative attack variants that encompass various methods, such as homoglyph deception, sub‑directory nesting, and parameter obfuscation. Through extensive experiments on these attack vectors, we demonstrate that Web fraud attacks not only exhibit significant destructive potential across different MAS architectures but also possess a distinct advantage in evasion: they circumvent the need for complex input design, lowering the threshold for attacks significantly. These results underscore the importance of addressing Web fraud attacks, providing new insights into MAS safety. Our code is available at https://github.com/JiangYingEr/Web‑Fraud‑Attack‑in‑MAS.

PaperID: 1025, https://arxiv.org/pdf/2509.01106.pdf GitHub

Authors: Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, Hang Li

Title: Robix: A Unified Model for Robot Interaction, Reasoning and Planning

Abstract:
We introduce Robix, a unified model that integrates robot reasoning, task planning, and natural language interaction within a single vision‑language architecture. Acting as the high‑level cognitive layer in a hierarchical robot system, Robix dynamically generates atomic commands for the low‑level controller and verbal responses for human interaction, enabling robots to follow complex instructions, plan long‑horizon tasks, and interact naturally with human within an end‑to‑end framework. Robix further introduces novel capabilities such as proactive dialogue, real‑time interruption handling, and context‑aware commonsense reasoning during task execution. At its core, Robix leverages chain‑of‑thought reasoning and adopts a three‑stage training strategy: (1) continued pretraining to enhance foundational embodied reasoning abilities including 3D spatial understanding, visual grounding, and task‑centric reasoning; (2) supervised finetuning to model human‑robot interaction and task planning as a unified reasoning‑action sequence; and (3) reinforcement learning to improve reasoning‑action consistency and long‑horizon task coherence. Extensive experiments demonstrate that Robix outperforms both open‑source and commercial baselines (e.g., GPT‑4o and Gemini 2.5 Pro) in interactive task execution, demonstrating strong generalization across diverse instruction types (e.g., open‑ended, multi‑stage, constrained, invalid, and interrupted) and various user‑involved tasks such as table bussing, grocery shopping, and dietary filtering.

PaperID: 1026, https://arxiv.org/pdf/2509.00761.pdf GitHub

Authors: Ziqi Wang, Boqin Yuan

Title: L-MARS: Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search

Abstract:
We present L‑MARS (Legal Multi‑Agent Workflow with Orchestrated Reasoning and Agentic Search), a multi‑agent retrieval framework for grounded legal question answering that decomposes queries into structured sub‑problems, retrieves evidence via agentic web search, filters results through a verification agent, and synthesizes cited answers. Existing legal QA benchmarks test either closed‑book reasoning or retrieval over fixed corpora, but neither captures scenarios requiring current legal information. We introduce LegalSearchQA, a 50‑question benchmark across five legal domains whose answers depend on recent developments that post‑date model training data. L‑MARS achieves 96.0% accuracy on LegalSearchQA, a 38.0% improvement over zero‑shot performance (58.0%), while chain‑of‑thought prompting degrades performance to 30.0%. On Bar Exam QA (Zheng et al., 2025), a reasoning‑focused benchmark of 594 bar examination questions, retrieval provides negligible gains (+0.7 percentage points), consistent with prior findings. These results show that agentic retrieval dramatically improves legal QA when tasks require up‑to‑date factual knowledge, but the benefit is benchmark‑dependent, underscoring the need for retrieval‑focused evaluation. Code and data are available at: https://github.com/boqiny/L‑MARS

PaperID: 1027, https://arxiv.org/pdf/2509.00581.pdf GitHub

Authors: Saumya Chaturvedi, Aman Chadha, Laurent Bindschaedler

Title: SQL-of-Thought: Multi-agentic Text-to-SQL with Guided Error Correction

Abstract:
Converting natural language queries into SQL queries is a crucial challenge in both industry and academia, aiming to increase access to databases and large‑scale applications. This work examines how in‑context learning and chain‑of‑thought can be utilized to develop a robust solution for text‑to‑SQL systems. We propose SQL‑of‑Thought: a multi‑agent framework that decomposes the Text2SQL task into schema linking, subproblem identification, query plan generation, SQL generation, and a guided correction loop. Unlike prior systems that rely only on execution‑based static correction, we introduce taxonomy‑guided dynamic error modification informed by in‑context learning. SQL‑of‑Thought achieves state‑of‑the‑art results on the Spider dataset and its variants, combining guided error taxonomy with reasoning‑based query planning.

PaperID: 1028, https://arxiv.org/pdf/2509.00482.pdf GitHub

Authors: Saksorn Ruangtanusak, Pittawat Taveekitworachai, Kunat Pipatanakul

Title: Talk Less, Call Right: Enhancing Role-Play LLM Agents with Automatic Prompt Optimization and Role Prompting

Abstract:
This report investigates approaches for prompting a tool‑augmented large language model (LLM) to act as a role‑playing dialogue agent in the API track of the Commonsense Persona‑grounded Dialogue Challenge (CPDC) 2025. In this setting, dialogue agents often produce overly long in‑character responses (over‑speaking) while failing to use tools effectively according to the persona (under‑acting), such as generating function calls that do not exist or making unnecessary tool calls before answering. We explore four prompting approaches to address these issues: 1) basic role prompting, 2) improved role prompting, 3) automatic prompt optimization (APO), and 4) rule‑based role prompting. The rule‑based role prompting (RRP) approach achieved the best performance through two novel techniques‑character‑card/scene‑contract design and strict enforcement of function calling‑which led to an overall score of 0.571, improving on the zero‑shot baseline score of 0.519. These findings demonstrate that RRP design can substantially improve the effectiveness and reliability of role‑playing dialogue agents compared with more elaborate methods such as APO. To support future efforts in developing persona prompts, we are open‑sourcing all of our best‑performing prompts and the APO tool Source code is available at https://github.com/scb‑10x/apo

PaperID: 1029, https://arxiv.org/pdf/2509.00375.pdf GitHub GitHub

Authors: Ziyi Xia, Kun Luo, Hongjin Qian, Zheng Liu

Title: Open Data Synthesis For Deep Research

Abstract:
Large language models (LLMs) are increasingly expected to go beyond simple factual queries toward Deep Research‑tasks that require decomposing questions into sub‑problems, coordinating multi‑step reasoning, and synthesizing evidence from diverse sources. We formalize Deep Research tasks with verifiable answers as Hierarchical Constraint Satisfaction Problems (HCSPs), which are fundamentally different from single‑constraint, multi‑hop, or flat CSP formulations. However, existing benchmarks (e.g., Natural Questions, HotpotQA) fail to capture this complexity, while recent synthetic datasets often introduce shortcut reasoning, knowledge leakage, or lack sufficient structural depth. To address this gap, we introduce InfoSeek, a scalable framework for synthesizing complex Deep Research tasks. InfoSeek uses a dual‑agent system to recursively build a Research Tree from large‑scale webpages, blurring intermediate nodes into valid sub‑problems, and converting these trees into natural language questions that require traversing the full hierarchy. It also enables rapid scaling, yielding over 50K training examples, a curated test set, and reasoning trajectories generated via reject sampling. Experiments show that models trained on InfoSeek consistently outperform strong baselines. On a challenging benchmark BrowseComp‑Plus, 3B LLMs optimized with InfoSeek surpass much larger 32B models and lightweight commercial APIs (e.g., Gemini2.5‑Flash), while achieving performance comparable to stronger APIs (e.g., Gemini2.5‑Pro). By preserving meta‑information such as intermediate steps and retrieval labels, InfoSeek further supports advanced optimization strategies, including compound reward design and trajectory‑level exploration. We provide our codes and datasets in \hrefhttps://github.com/VectorSpaceLab/InfoSeekthis repository.

PaperID: 1030, https://arxiv.org/pdf/2509.00178.pdf GitHub

Authors: Marina Y. Aoyama, Joao Moura, Juan Del Aguila Ferrandis, Sethu Vijayakumar

Title: Poke and Strike: Learning Task-Informed Exploration Policies

Abstract:
In many dynamic robotic tasks, such as striking pucks into a goal outside the reachable workspace, the robot must first identify the relevant physical properties of the object for successful task execution, as it is unable to recover from failure or retry without human intervention. To address this challenge, we propose a task‑informed exploration approach, based on reinforcement learning, that trains an exploration policy using rewards automatically generated from the sensitivity of a privileged task policy to errors in estimated properties. We also introduce an uncertainty‑based mechanism to determine when to transition from exploration to task execution, ensuring sufficient property estimation accuracy with minimal exploration time. Our method achieves a 90% success rate on the striking task with an average exploration time under 1.2 seconds, significantly outperforming baselines that achieve at most 40% success or require inefficient querying and retraining in a simulator at test time. Additionally, we demonstrate that our task‑informed rewards capture the relative importance of physical properties in both the striking task and the classical CartPole example. Finally, we validate our approach by demonstrating its ability to identify object properties and adjust task execution in a physical setup using the KUKA iiwa robot arm.

PaperID: 1031, https://arxiv.org/pdf/2509.00115.pdf GitHub

Authors: Manish Shukla

Title: Adaptive Monitoring and Real-World Evaluation of Agentic AI Systems

Abstract:
Agentic artificial intelligence (AI) ‑‑ multi‑agent systems that combine large language models with external tools and autonomous planning ‑‑ are rapidly transitioning from research laboratories into high‑stakes domains. Our earlier "Basic" paper introduced a five‑axis framework and proposed preliminary metrics such as goal drift and harm reduction but did not provide an algorithmic instantiation or empirical evidence. This "Advanced" sequel fills that gap. First, we revisit recent benchmarks and industrial deployments to show that technical metrics still dominate evaluations: a systematic review of 84 papers from 2023‑‑2025 found that 83% report capability metrics while only 30% consider human‑centred or economic axes [2]. Second, we formalise an Adaptive Multi‑Dimensional Monitoring (AMDM) algorithm that normalises heterogeneous metrics, applies per‑axis exponentially weighted moving‑average thresholds and performs joint anomaly detection via the Mahalanobis distance [7]. Third, we conduct simulations and real‑world experiments. AMDM cuts anomaly‑detection latency from 12.3 s to 5.6 s on simulated goal drift and reduces false‑positive rates from 4.5% to 0.9% compared with static thresholds. We present a comparison table and ROC/PR curves, and we reanalyse case studies to surface missing metrics. Code, data and a reproducibility checklist accompany this paper to facilitate replication. The code supporting this work is available at https://github.com/Manishms18/Adaptive‑Multi‑Dimensional‑Monitoring.

PaperID: 1032, https://arxiv.org/pdf/2508.21476.pdf GitHub

Authors: Xiaolong Wei, Bo Lu, Xingyu Zhang, Zhejun Zhao, Dongdong Shen, Long Xia, Dawei Yin

Title: Igniting Creative Writing in Small Language Models: LLM-as-a-Judge versus Multi-Agent Refined Rewards

Abstract:
Large Language Models (LLMs) have demonstrated remarkable creative writing capabilities, yet their substantial computational demands hinder widespread use. Enhancing Small Language Models (SLMs) offers a promising alternative, but current methods like Supervised Fine‑Tuning (SFT) struggle with novelty, and Reinforcement Learning from Human Feedback (RLHF) is costly. This paper explores two distinct AI‑driven reward strategies within a Reinforcement Learning from AI Feedback (RLAIF) framework to ignite the creative writing of a 7B‑parameter SLM, specifically for generating Chinese greetings. The first strategy employs a RM trained on high‑quality preference data curated by a novel multi‑agent rejection sampling framework designed for creative tasks. The second, more novel strategy utilizes a principle‑guided LLM‑as‑a‑Judge, whose reward function is optimized via an adversarial training scheme with a reflection mechanism, to directly provide reward signals. Comprehensive experiments reveal that while both approaches significantly enhance creative output over baselines, the principle‑guided LLM‑as‑a‑Judge demonstrably yields superior generation quality. Furthermore, it offers notable advantages in training efficiency and reduced dependency on human‑annotated data, presenting a more scalable and effective path towards creative SLMs. Our automated evaluation methods also exhibit strong alignment with human judgments. Our code and data are publicly available at https://github.com/weixiaolong94‑hub/Igniting‑Creative‑Writing‑in‑Small‑Language‑Models.

PaperID: 1033, https://arxiv.org/pdf/2508.20818.pdf GitHub

Authors: Anirudh Satheesh, Keenan Powell, Hua Wei

Title: cMALC-D: Contextual Multi-Agent LLM-Guided Curriculum Learning with Diversity-Based Context Blending

Abstract:
Many multi‑agent reinforcement learning (MARL) algorithms are trained in fixed simulation environments, making them brittle when deployed in real‑world scenarios with more complex and uncertain conditions. Contextual MARL (cMARL) addresses this by parameterizing environments with context variables and training a context‑agnostic policy that performs well across all environment configurations. Existing cMARL methods attempt to use curriculum learning to help train and evaluate context‑agnostic policies, but they often rely on unreliable proxy signals, such as value estimates or generalized advantage estimates that are noisy and unstable in multi‑agent settings due to inter‑agent dynamics and partial observability. To address these issues, we propose Contextual Multi‑Agent LLM‑Guided Curriculum Learning with Diversity‑Based Context Blending (cMALC‑D), a framework that uses Large Language Models (LLMs) to generate semantically meaningful curricula and provide a more robust evaluation signal. To prevent mode collapse and encourage exploration, we introduce a novel diversity‑based context blending mechanism that creates new training scenarios by combining features from prior contexts. Experiments in traffic signal control domains demonstrate that cMALC‑D significantly improves both generalization and sample efficiency compared to existing curriculum learning baselines. We provide code at https://github.com/DaRL‑LibSignal/cMALC‑D.

PaperID: 1034, https://arxiv.org/pdf/2508.20453.pdf GitHub

Authors: Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, Eugene Siow

Title: MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

Abstract:
We introduce MCP‑Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi‑step tasks that demand tool use, cross‑tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP‑Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API‑based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi‑step tasks with rich input‑output coupling. Tasks in MCP‑Bench test agents' ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi‑hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross‑domain workflows ‑ capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few‑step workflows, and isolated domain operations. We propose a multi‑faceted evaluation framework covering tool‑level schema understanding and usage, trajectory‑level planning, and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP‑Bench. Code and data: https://github.com/Accenture/mcp‑bench.

PaperID: 1035, https://arxiv.org/pdf/2508.19638.pdf GitHub

Authors: Yang Li, Quan Yuan, Guiyang Luo, Xiaoyuan Fu, Rui Pan, Yujia Yang, Congzhang Shao, Yuewen Liu, Jinglin Li

Title: Beyond BEV: Optimizing Point-Level Tokens for Collaborative Perception

Abstract:
Collaborative perception allows agents to enhance their perceptual capabilities by exchanging intermediate features. Existing methods typically organize these intermediate features as 2D bird's‑eye‑view (BEV) representations, which discard critical fine‑grained 3D structural cues essential for accurate object recognition and localization. To this end, we first introduce point‑level tokens as intermediate representations for collaborative perception. However, point‑cloud data are inherently unordered, massive, and position‑sensitive, making it challenging to produce compact and aligned point‑level token sequences that preserve detailed structural information. Therefore, we present CoPLOT, a novel Collaborative perception framework that utilizes Point‑Level Optimized Tokens. It incorporates a point‑native processing pipeline, including token reordering, sequence modeling, and multi‑agent spatial alignment. A semantic‑aware token reordering module generates adaptive 1D reorderings by leveraging scene‑level and token‑level semantic information. A frequency‑enhanced state space model captures long‑range sequence dependencies across both spatial and spectral domains, improving the differentiation between foreground tokens and background clutter. Lastly, a neighbor‑to‑ego alignment module applies a closed‑loop process, combining global agent‑level correction with local token‑level refinement to mitigate localization noise. Extensive experiments on both simulated and real‑world datasets show that CoPLOT outperforms state‑of‑the‑art models, with even lower communication and computation overhead. Code will be available at https://github.com/CheeryLeeyy/CoPLOT.

PaperID: 1036, https://arxiv.org/pdf/2508.19042.pdf GitHub

Authors: Norihiro Maruyama, Takahide Yoshida, Hiroki Sato, Atsushi Masumori, Johnsmith, Takashi Ikegami

Title: A Concurrent Modular Agent: Framework for Autonomous LLM Agents

Abstract:
We introduce the Concurrent Modular Agent (CMA), a framework that orchestrates multiple Large‑Language‑Model (LLM)‑based modules that operate fully asynchronously yet maintain a coherent and fault‑tolerant behavioral loop. This framework addresses long‑standing difficulties in agent architectures by letting intention emerge from language‑mediated interactions among autonomous processes. This approach enables flexible, adaptive, and context‑dependent behavior through the combination of concurrently executed modules that offload reasoning to an LLM, inter‑module communication, and a single shared global state.We consider this approach to be a practical realization of Minsky's Society of Mind theory. We demonstrate the viability of our system through two practical use‑case studies. The emergent properties observed in our system suggest that complex cognitive phenomena like self‑awareness may indeed arise from the organized interaction of simpler processes, supporting Minsky‑Society of Mind concept and opening new avenues for artificial intelligence research. The source code for our work is available at: https://github.com/AlternativeMachine/concurrent‑modular‑agent.

PaperID: 1037, https://arxiv.org/pdf/2508.18802.pdf GitHub

Authors: Li Sun, Jiefeng Wu, Feng Chen, Ruizhe Liu, Yanchao Yang

Title: HyperTASR: Hypernetwork-Driven Task-Aware Scene Representations for Robust Manipulation

Abstract:
Effective policy learning for robotic manipulation requires scene representations that selectively capture task‑relevant environmental features. Current approaches typically employ task‑agnostic representation extraction, failing to emulate the dynamic perceptual adaptation observed in human cognition. We present HyperTASR, a hypernetwork‑driven framework that modulates scene representations based on both task objectives and the execution phase. Our architecture dynamically generates representation transformation parameters conditioned on task specifications and progression state, enabling representations to evolve contextually throughout task execution. This approach maintains architectural compatibility with existing policy learning frameworks while fundamentally reconfiguring how visual features are processed. Unlike methods that simply concatenate or fuse task embeddings with task‑agnostic representations, HyperTASR establishes computational separation between task‑contextual and state‑dependent processing paths, enhancing learning efficiency and representational quality. Comprehensive evaluations in both simulation and real‑world environments demonstrate substantial performance improvements across different representation paradigms. Through ablation studies and attention visualization, we confirm that our approach selectively prioritizes task‑relevant scene information, closely mirroring human adaptive perception during manipulation tasks. The project website is at https://lisunphil.github.io/HyperTASR_projectpage/.

PaperID: 1038, https://arxiv.org/pdf/2508.17711.pdf GitHub

Authors: Fanqi Kong, Xiaoyuan Zhang, Xinyu Chen, Yaodong Yang, Song-Chun Zhu, Xue Feng

Title: Enhancing LLM-Based Social Bot via an Adversarial Learning Framework

Abstract:
Developing Large Language Model (LLM) agents that exhibit human‑like behavior, encompassing not only individual heterogeneity rooted in unique user profiles but also adaptive response to socially connected neighbors, is a significant research challenge. Social media platforms, with their diverse user data and explicit social structures, provide an ideal testbed for such investigations. This paper introduces EvoBot, an Evolving LLM‑based social Bot that significantly enhances human‑like generative capabilities through a novel adversarial learning framework. EvoBot is initialized by Supervised Fine‑Tuning (SFT) on representative data from social media and then iteratively refines its generation of sophisticated, human‑like content via Direct Preference Optimization (DPO). This refinement is guided by feedback from a co‑adapting Detector which concurrently improves its ability to distinguish EvoBot from humans, thereby creating an increasingly challenging learning environment for EvoBot. Experiments demonstrate that EvoBot generates content aligned with diverse user profiles, increasingly bypassing the co‑adapting Detector through human‑like expression. Moreover, it exhibits strong social responsiveness, more accurately modeling real‑world opinion dynamics and information spread in multi‑agent simulations. The framework also yields a more robust Detector, underscoring its broader utility for both advanced agent development and related detection tasks. The code is available at https://github.com/kfq20/EvoBot.

PaperID: 1039, https://arxiv.org/pdf/2508.17536.pdf GitHub

Authors: Hyeong Kyu Choi, Xiaojin Zhu, Sharon Li

Title: Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models?

Abstract:
Multi‑Agent Debate~(MAD) has emerged as a promising paradigm for improving the performance of large language models through collaborative reasoning. Despite recent advances, the key factors driving MAD's effectiveness remain unclear. In this work, we disentangle MAD into two key components‑‑Majority Voting and inter‑agent Debate‑‑and assess their respective contributions. Through extensive experiments across seven NLP benchmarks, we find that Majority Voting alone accounts for most of the performance gains typically attributed to MAD. To explain this, we propose a theoretical framework that models debate as a stochastic process. We prove that it induces a martingale over agents' belief trajectories, implying that debate alone does not improve expected correctness. Guided by these insights, we demonstrate that targeted interventions, by biasing the belief update toward correction, can meaningfully enhance debate effectiveness. Overall, our findings suggest that while MAD has potential, simple ensembling methods remain strong and more reliable alternatives in many practical settings. Code is released in https://github.com/deeplearning‑wisc/debate‑or‑vote.

PaperID: 1040, https://arxiv.org/pdf/2508.17393.pdf GitHub

Authors: Sameer Komoravolu, Khalil Mrini

Title: Agent-Testing Agent: A Meta-Agent for Automated Testing and Evaluation of Conversational AI Agents

Abstract:
LLM agents are increasingly deployed to plan, retrieve, and write with tools, yet evaluation still leans on static benchmarks and small human studies. We present the Agent‑Testing Agent (ATA), a meta‑agent that combines static code analysis, designer interrogation, literature mining, and persona‑driven adversarial test generation whose difficulty adapts via judge feedback. Each dialogue is scored with an LLM‑as‑a‑Judge (LAAJ) rubric and used to steer subsequent tests toward the agent's weakest capabilities. On a travel planner and a Wikipedia writer, the ATA surfaces more diverse and severe failures than expert annotators while matching severity, and finishes in 20‑‑30 minutes versus ten‑annotator rounds that took days. Ablating code analysis and web search increases variance and miscalibration, underscoring the value of evidence‑grounded test generation. The ATA outputs quantitative metrics and qualitative bug reports for developers. We release the full methodology and open‑source implementation for reproducible agent testing: https://github.com/KhalilMrini/Agent‑Testing‑Agent

PaperID: 1041, https://arxiv.org/pdf/2508.17366.pdf GitHub

Authors: Hanzhong Zhang, Muhua Huang, Jindong Wang

Title: Computational Multi-Agents Society Experiments: Social Modeling Framework Based on Generative Agents

Abstract:
This paper introduces CMASE, a framework for Computational Multi‑Agent Society Experiments that integrates generative agent‑based modeling with virtual ethnographic methods to support researcher embedding, interactive participation, and mechanism‑oriented intervention in virtual social environments. By transforming the simulation into a simulated ethnographic field, CMASE shifts the researcher from an external operator to an embedded participant. Specifically, the framework is designed to achieve three core capabilities: (1) enabling real‑time human‑computer interaction that allows researchers to dynamically embed themselves into the system to characterize complex social intervention processes; (2) reconstructing the generative logic of social phenomena by combining the rigor of computational experiments with the interpretative depth of traditional ethnography; and (3) providing a predictive foundation with causal explanatory power to make forward‑looking judgments without sacrificing empirical accuracy. Experimental results show that CMASE can not only simulate complex phenomena, but also generate behavior trajectories consistent with both statistical patterns and mechanistic explanations. These findings demonstrate CMASE's methodological value for intervention modeling, highlighting its potential to advance interdisciplinary integration in the social sciences. The official code is available at: https://github.com/armihia/CMASE .

PaperID: 1042, https://arxiv.org/pdf/2508.17188.pdf GitHub

Authors: Zhilin Zhang, Xiang Zhang, Jiaqi Wei, Yiwei Xu, Chenyu You

Title: PosterGen: Aesthetic-Aware Multi-Modal Paper-to-Poster Generation via Multi-Agent LLMs

Abstract:
Multi‑agent systems built upon large language models (LLMs) have demonstrated remarkable capabilities in tackling complex compositional tasks. In this work, we apply this paradigm to the paper‑to‑poster generation problem, a practical yet time‑consuming process faced by researchers preparing for conferences. While recent approaches have attempted to automate this task, most neglect core design and aesthetic principles, resulting in posters that require substantial manual refinement. To address these design limitations, we propose PosterGen, a multi‑agent framework that mirrors the workflow of professional poster designers. It consists of four collaborative specialized agents: (1) Parser and Curator agents extract content from the paper and organize storyboard; (2) Layout agent maps the content into a coherent spatial layout; (3) Stylist agents apply visual design elements such as color and typography; and (4) Renderer composes the final poster. Together, these agents produce posters that are both semantically grounded and visually appealing. To evaluate design quality, we introduce a vision‑language model (VLM)‑based rubric that measures layout balance, readability, and aesthetic coherence. Experimental results show that PosterGen consistently matches in content fidelity, and significantly outperforms existing methods in visual designs, generating posters that are presentation‑ready with minimal human refinements.

PaperID: 1043, https://arxiv.org/pdf/2508.17068.pdf GitHub

Authors: Xinxing Ren, Caelum Forder, Qianbo Zang, Ahsen Tahir, Roman J. Georgio, Suman Deb, Peter Carroll, Önder Gürcan, Zekun Guo

Title: Anemoi: A Semi-Centralized Multi-agent System Based on Agent-to-Agent Communication MCP server from Coral Protocol

Abstract:
Recent advances in generalist multi‑agent systems (MAS) have largely followed a context‑engineering plus centralized paradigm, where a planner agent coordinates multiple worker agents through unidirectional prompt passing. While effective under strong planner models, this design suffers from two critical limitations: (1) strong dependency on the planner's capability, which leads to degraded performance when a smaller LLM powers the planner; and (2) limited inter‑agent communication, where collaboration relies on prompt concatenation rather than genuine refinement through structured discussions. To address these challenges, we propose Anemoi, a semi‑centralized MAS built on the Agent‑to‑Agent (A2A) communication MCP server from Coral Protocol. Unlike traditional designs, Anemoi enables structured and direct inter‑agent collaboration, allowing all agents to monitor progress, assess results, identify bottlenecks, and propose refinements in real time. This paradigm reduces reliance on a single planner, supports adaptive plan updates, and minimizes redundant context passing, resulting in more scalable execution. Evaluated on the GAIA benchmark, Anemoi achieved 52.73% accuracy with a small LLM (GPT‑4.1‑mini) as the planner, surpassing the strongest open‑source baseline OWL (43.63%) by +9.09% under identical LLM settings. Our implementation is publicly available at https://github.com/Coral‑Protocol/Anemoi.

PaperID: 1044, https://arxiv.org/pdf/2508.16745.pdf GitHub

Authors: Ivan Rodkin, Daniil Orel, Konstantin Smirnov, Arman Bolatov, Bilal Elbouardi, Besher Hassan, Yuri Kuratov, Aydar Bulatov, Preslav Nakov, Timothy Baldwin, Artem Shelmanov, Mikhail Burtsev

Title: Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling

Abstract:
Reasoning is a core capability of large language models, yet how multi‑step reasoning is learned and executed remains unclear. We study this question in a controlled cellular‑automata (1dCA) framework that excludes memorisation by using disjoint training and test rules. Given a short state sequence, the model is required to infer the hidden local rule and then chain it to predict multiple future steps. Our evaluation shows that LLMs largely fail to reliably solve a natural‑language proxy of the proposed task. We find that most neural architectures trained from scratch can learn rule inference and achieve high next‑step accuracy, but performance drops sharply as the required number of intermediate reasoning steps increases. Experiments show that increasing model depth is crucial, and extending effective depth via recurrence, memory, or test‑time compute improves results but remains bounded. The code is available on github: https://github.com/RodkinIvan/associative‑recurrent‑memory‑transformer/tree/ACT

PaperID: 1045, https://arxiv.org/pdf/2508.16153.pdf GitHub

Authors: Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, Jun Wang

Title: Memento: Fine-tuning LLM Agents without Fine-tuning LLMs

Abstract:
In this paper, we introduce a novel learning paradigm for Adaptive Large Language Model (LLM) agents that eliminates the need for fine‑tuning the underlying LLMs. Existing approaches are often either rigid, relying on static, handcrafted reflection workflows, or computationally intensive, requiring gradient updates of LLM model parameters. In contrast, our method enables low‑cost continual adaptation via memory‑based online reinforcement learning. We formalise this as a Memory‑augmented Markov Decision Process (M‑MDP), equipped with a neural case‑selection policy to guide action decisions. Past experiences are stored in an episodic memory, either differentiable or non‑parametric. The policy is continually updated based on environmental feedback through a memory rewriting mechanism, whereas policy improvement is achieved through efficient memory reading (retrieval). We instantiate our agent model in the deep research setting, namely \emphMemento, which attains top‑1 on GAIA validation (87.88% Pass@3) and 79.40% on the test set. It reaches 66.6% F1 and 80.4% PM on the DeepResearcher dataset, outperforming the state‑of‑the‑art training‑based method, while case‑based memory adds 4.7% to 9.6% absolute points on out‑of‑distribution tasks. Our approach offers a scalable and efficient pathway for developing generalist LLM agents capable of continuous, real‑time learning without gradient updates, advancing machine learning towards open‑ended skill acquisition and deep research scenarios. The code is available at https://github.com/Agent‑on‑the‑Fly/Memento.

PaperID: 1046, https://arxiv.org/pdf/2508.15809.pdf GitHub

Authors: Songyuan Sui, Hongyi Liu, Serena Liu, Li Li, Soo-Hyun Choi, Rui Chen, Xia Hu

Title: Chain-of-Query: Unleashing the Power of LLMs in SQL-Aided Table Understanding via Multi-Agent Collaboration

Abstract:
Table understanding requires structured, multi‑step reasoning. Large Language Models (LLMs) struggle with it due to the structural complexity of tabular data. Recently, multi‑agent frameworks for SQL generation have shown promise in tackling the challenges of understanding tabular data, but existing approaches often suffer from limitations such as the inability to comprehend table structure for reliable SQL generation, error propagation that results in invalid queries, and over‑reliance on execution correctness. To address these issues, we propose Chain‑of‑Query (CoQ), a novel multi‑agent framework for SQL‑aided table understanding. CoQ adopts natural‑language‑style representations of table schemas to abstract away structural noise and enhance understanding. It employs a clause‑by‑clause SQL generation strategy to improve query quality and introduces a hybrid reasoning division that separates SQL‑based mechanical reasoning from LLM‑based logical inference, thereby reducing reliance on execution outcomes. Extensive experiments across four models and five widely used benchmarks demonstrate that CoQ achieves substantial accuracy improvements and significantly lowers invalid SQL rates compared to prior generic LLM‑based, SQL‑aided, and hybrid baselines, confirming its superior effectiveness in table understanding. The code is available at https://github.com/SongyuanSui/ChainofQuery.

PaperID: 1047, https://arxiv.org/pdf/2508.15693.pdf GitHub

Authors: Wilka Carvalho, Vikram Goddla, Ishaan Sinha, Hoon Shin, Kunal Jha

Title: NiceWebRL: a Python library for human subject experiments with reinforcement learning environments

Abstract:
We present NiceWebRL, a research tool that enables researchers to use machine reinforcement learning (RL) environments for online human subject experiments. NiceWebRL is a Python library that allows any Jax‑based environment to be transformed into an online interface, supporting both single‑agent and multi‑agent environments. As such, NiceWebRL enables AI researchers to compare their algorithms to human performance, cognitive scientists to test ML algorithms as theories for human cognition, and multi‑agent researchers to develop algorithms for human‑AI collaboration. We showcase NiceWebRL with 3 case studies that demonstrate its potential to help develop Human‑like AI, Human‑compatible AI, and Human‑assistive AI. In the first case study (Human‑like AI), NiceWebRL enables the development of a novel RL model of cognition. Here, NiceWebRL facilitates testing this model against human participants in both a grid world and Craftax, a 2D Minecraft domain. In our second case study (Human‑compatible AI), NiceWebRL enables the development of a novel multi‑agent RL algorithm that can generalize to human partners in the Overcooked domain. Finally, in our third case study (Human‑assistive AI), we show how NiceWebRL can allow researchers to study how an LLM can assist humans on complex tasks in XLand‑Minigrid, an environment with millions of hierarchical tasks. The library is available at https://github.com/KempnerInstitute/nicewebrl.

PaperID: 1048, https://arxiv.org/pdf/2508.15510.pdf GitHub

Authors: Filippo Tonini, Lukas Galke

Title: Super-additive Cooperation in Language Model Agents

Abstract:
With the prospect of autonomous artificial intelligence (AI) agents, studying their tendency for cooperative behavior becomes an increasingly relevant topic. This study is inspired by the super‑additive cooperation theory, where the combined effects of repeated interactions and inter‑group rivalry have been argued to be the cause for cooperative tendencies found in humans. We devised a virtual tournament where language model agents, grouped into teams, face each other in a Prisoner's Dilemma game. By simulating both internal team dynamics and external competition, we discovered that this blend substantially boosts both overall and initial, one‑shot cooperation levels (the tendency to cooperate in one‑off interactions). This research provides a novel framework for large language models to strategize and act in complex social scenarios and offers evidence for how intergroup competition can, counter‑intuitively, result in more cooperative behavior. These insights are crucial for designing future multi‑agent AI systems that can effectively work together and better align with human values. Source code is available at https://github.com/pippot/Superadditive‑cooperation‑LLMs.

PaperID: 1049, https://arxiv.org/pdf/2508.15144.pdf GitHub

Authors: Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, Ming Yan

Title: Mobile-Agent-v3: Fundamental Agents for GUI Automation

Abstract:
This paper introduces GUI‑Owl, a foundational GUI agent model that achieves state‑of‑the‑art performance among open‑source end‑to‑end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question answering, planning, decision‑making, and procedural knowledge. GUI‑Owl‑7B achieves 66.4 on AndroidWorld and 29.4 on OSWorld. Building on this, we propose Mobile‑Agent‑v3, a general‑purpose GUI agent framework that further improves performance to 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state‑of‑the‑art for open‑source GUI agent frameworks. GUI‑Owl incorporates three key innovations: (1) Large‑scale Environment Infrastructure: a cloud‑based virtual environment spanning Android, Ubuntu, macOS, and Windows, enabling our Self‑Evolving GUI Trajectory Production framework. This generates high‑quality interaction data via automated query generation and correctness validation, leveraging GUI‑Owl to refine trajectories iteratively, forming a self‑improving loop. It supports diverse data pipelines and reduces manual annotation. (2) Diverse Foundational Agent Capabilities: by integrating UI grounding, planning, action semantics, and reasoning patterns, GUI‑Owl supports end‑to‑end decision‑making and can act as a modular component in multi‑agent systems. (3) Scalable Environment RL: we develop a scalable reinforcement learning framework with fully asynchronous training for real‑world alignment. We also introduce Trajectory‑aware Relative Policy Optimization (TRPO) for online RL, achieving 34.9 on OSWorld. GUI‑Owl and Mobile‑Agent‑v3 are open‑sourced at https://github.com/X‑PLUG/MobileAgent.

PaperID: 1050, https://arxiv.org/pdf/2508.15126.pdf GitHub GitHub

Authors: Pengsong Zhang, Xiang Hu, Guowei Huang, Yang Qi, Heng Zhang, Xiuxu Li, Jiaxing Song, Jiabin Luo, Yijiang Li, Shuo Yin, Chengxiao Dai, Eric Hanchen Jiang, Xiaoyan Zhou, Zhenfei Yin, Boqin Yuan, Jing Dong, Guinan Su, Guanren Qiao, Haiming Tang, Anghong Du, Lili Pan, Zhenzhong Lan, Xinyu Liu

Title: aiXiv: A Next-Generation Open Access Ecosystem for Scientific Discovery Generated by AI Scientists

Abstract:
Recent advances in large language models (LLMs) have enabled AI agents to autonomously generate scientific proposals, conduct experiments, author papers, and perform peer reviews. Yet this flood of AI‑generated research content collides with a fragmented and largely closed publication ecosystem. Traditional journals and conferences rely on human peer review, making them difficult to scale and often reluctant to accept AI‑generated research content; existing preprint servers (e.g. arXiv) lack rigorous quality‑control mechanisms. Consequently, a significant amount of high‑quality AI‑generated research lacks appropriate venues for dissemination, hindering its potential to advance scientific progress. To address these challenges, we introduce aiXiv, a next‑generation open‑access platform for human and AI scientists. Its multi‑agent architecture allows research proposals and papers to be submitted, reviewed, and iteratively refined by both human and AI scientists. It also provides API and MCP interfaces that enable seamless integration of heterogeneous human and AI scientists, creating a scalable and extensible ecosystem for autonomous scientific discovery. Through extensive experiments, we demonstrate that aiXiv is a reliable and robust platform that significantly enhances the quality of AI‑generated research proposals and papers after iterative revising and reviewing on aiXiv. Our work lays the groundwork for a next‑generation open‑access ecosystem for AI scientists, accelerating the publication and dissemination of high‑quality AI‑generated research content. Code: https://github.com/aixiv‑org aiXiv: https://aixiv.science

PaperID: 1051, https://arxiv.org/pdf/2508.14410.pdf GitHub

Authors: Beinuo Yang, Qishen Zhou, Junyi Li, Chenxing Su, Panagiotis Angeloudis, Simon Hu

Title: ORThought: Benchmarking and Automating Logistics Optimization Modeling

Abstract:
Optimization modeling stands as the engine of scientific decision‑making in logistics and transportation, yet its adoption is hindered by a steep expertise threshold and the latency of manual workflows. Automating this process via Large Language Models (LLMs) offers a potential solution, but current approaches face critical bottlenecks: (i) a lack of high‑quality, complex benchmarks; (ii) methodological inefficiencies in autonomous multi‑agent frameworks, which often exhibit instability and redundant computation; and (iii) evaluations that lack diagnostic depth. In this work, we address these challenges from the following three aspects. First, we introduce LogiOR, a diverse logistics benchmark with rigorous annotations, and enrich existing datasets with the same annotation standard to support community utilization. Second, we propose ORThought, a structured dual‑agent framework. By incorporating expert‑level modeling principles via chain‑of‑thought reasoning, ORThought eliminates the redundancy of uncontrolled autonomous agents. Third, extensive empirical evaluations demonstrate that ORThought consistently outperforms state‑of‑the‑art baselines by 9‑17 percentage points, exhibiting distinct advantages in handling complex constraints while maintaining high token efficiency. Building on these results, we further conduct a multidimensional error analysis, which identifies key failure modes and success factors, providing actionable insights for future research. The dataset and code are available at https://huggingface.co/datasets/LabMem012/LogiOR and https://github.com/ZJU‑TSELab/ORThought, respectively.

PaperID: 1052, https://arxiv.org/pdf/2508.14315.pdf GitHub

Authors: Pritthijit Nath, Sebastian Schemm, Henry Moss, Peter Haynes, Emily Shuckburgh, Mark Webb

Title: FedRAIN-Lite: Federated Reinforcement Algorithms for Improving Idealised Numerical Weather and Climate Models

Abstract:
Sub‑grid parameterisations in climate models are traditionally static and tuned offline, limiting adaptability to evolving states. This work introduces FedRAIN‑Lite, a federated reinforcement learning (FedRL) framework that mirrors the spatial decomposition used in general circulation models (GCMs) by assigning agents to latitude bands, enabling local parameter learning with periodic global aggregation. Using a hierarchy of simplified energy‑balance climate models, from a single‑agent baseline (ebm‑v1) to multi‑agent ensemble (ebm‑v2) and GCM‑like (ebm‑v3) setups, we benchmark three RL algorithms under different FedRL configurations. Results show that Deep Deterministic Policy Gradient (DDPG) consistently outperforms both static and single‑agent baselines, with faster convergence and lower area‑weighted RMSE in tropical and mid‑latitude zones across both ebm‑v2 and ebm‑v3 setups. DDPG's ability to transfer across hyperparameters and low computational cost make it well‑suited for geographically adaptive parameter learning. This capability offers a scalable pathway towards high‑complexity GCMs and provides a prototype for physically aligned, online‑learning climate models that can evolve with a changing climate. Code accessible at https://github.com/p3jitnath/climate‑rl‑fedrl.

PaperID: 1053, https://arxiv.org/pdf/2508.14031.pdf GitHub

Authors: Dongyoon Hahm, Taywon Min, Woogyeol Jin, Kimin Lee

Title: Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation

Abstract:
Beyond simple text generation, Large Language Models (LLMs) have evolved into agentic systems capable of planning and interacting with external tools to solve complex tasks. This evolution involves fine‑tuning LLMs on agent‑specific tasks to enhance their proficiency. However, safety concerns are frequently overlooked during this fine‑tuning process. In this work, we show that aligned LLMs can become unintentionally misaligned, leading to a higher likelihood of executing harmful tasks and a reduced tendency to refuse them when fine‑tuned to execute agentic tasks. To address these safety challenges, we propose Prefix INjection Guard (PING), a simple yet effective method that prepends automatically generated natural language prefixes to agent responses, guiding them to refuse harmful requests while preserving performance on benign tasks. Specifically, we introduce an iterative approach that alternates between (1) generating candidate prefixes and (2) selecting those that optimize both task performance and refusal behavior. Experimental results demonstrate that PING significantly enhances the safety of fine‑tuned LLM agents without sacrificing their effectiveness. PING consistently outperforms existing prompting approaches across diverse benchmarks in both web navigation and code generation tasks. Our analysis of internal hidden states via linear probes reveals that prefix tokens are crucial for behavior modification, explaining the performance gains. WARNING: This paper contains contents that are unethical or offensive in nature.

PaperID: 1054, https://arxiv.org/pdf/2508.13787.pdf GitHub

Authors: Zihan Guo, Yuanjian Zhou, Chenyi Wang, Linlin You, Minjie Bian, Weinan Zhang

Title: BetaWeb: Towards a Blockchain-enabled Trustworthy Agentic Web

Abstract:
The rapid development of large language models (LLMs) has significantly propelled the development of artificial intelligence (AI) agents, which are increasingly evolving into diverse autonomous entities, advancing the LLM‑based multi‑agent systems (LaMAS). However, current agentic ecosystems remain fragmented and closed. Establishing an interconnected and scalable paradigm for Agentic AI has become a critical prerequisite. Although Agentic Web proposes an open architecture to break the ecosystem barriers, its implementation still faces core challenges such as privacy protection, data management, and value measurement. Existing centralized or semi‑centralized paradigms suffer from inherent limitations, making them inadequate for supporting large‑scale, heterogeneous, and cross‑domain autonomous interactions. To address these challenges, this paper introduces the blockchain‑enabled trustworthy Agentic Web (BetaWeb). By leveraging the inherent strengths of blockchain, BetaWeb not only offers a trustworthy and scalable infrastructure for LaMAS but also has the potential to advance the Web paradigm from Web3 (centered on data ownership) towards Web3.5, which emphasizes ownership of agent capabilities and the monetization of intelligence. Beyond a systematic examination of the BetaWeb framework, this paper presents a five‑stage evolutionary roadmap, outlining the path of LaMAS from passive execution to advanced collaboration and autonomous governance. We also conduct a comparative analysis of existing products and discuss key challenges of BetaWeb from multiple perspectives. Ultimately, we argue that deep integration between blockchain and LaMAS can lay the foundation for a resilient, trustworthy, and sustainably incentivized digital ecosystem. A summary of the enabling technologies for each stage is available at https://github.com/MatZaharia/BetaWeb.

PaperID: 1055, https://arxiv.org/pdf/2508.13073.pdf GitHub

Authors: Rui Shao, Wei Li, Lingsen Zhang, Renshan Zhang, Zhiyang Liu, Ran Chen, Liqiang Nie

Title: Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

Abstract:
Robotic manipulation, a key frontier in robotics and embodied AI, requires precise motor control and multimodal understanding, yet traditional rule‑based methods fail to scale or generalize in unstructured, novel environments. In recent years, Vision‑Language‑Action (VLA) models, built upon Large Vision‑Language Models (VLMs) pretrained on vast image‑text datasets, have emerged as a transformative paradigm. This survey provides the first systematic, taxonomy‑oriented review of large VLM‑based VLA models for robotic manipulation. We begin by clearly defining large VLM‑based VLA models and delineating two principal architectural paradigms: (1) monolithic models, encompassing single‑system and dual‑system designs with differing levels of integration; and (2) hierarchical models, which explicitly decouple planning from execution via interpretable intermediate representations. Building on this foundation, we present an in‑depth examination of large VLM‑based VLA models: (1) integration with advanced domains, including reinforcement learning, training‑free optimization, learning from human videos, and world model integration; (2) synthesis of distinctive characteristics, consolidating architectural traits, operational strengths, and the datasets and benchmarks that support their development; (3) identification of promising directions, including memory mechanisms, 4D perception, efficient adaptation, multi‑agent cooperation, and other emerging capabilities. This survey consolidates recent advances to resolve inconsistencies in existing taxonomies, mitigate research fragmentation, and fill a critical gap through the systematic integration of studies at the intersection of large VLMs and robotic manipulation. We provide a regularly updated project page to document ongoing progress: https://github.com/JiuTian‑VL/Large‑VLM‑based‑VLA‑for‑Robotic‑Manipulation

PaperID: 1056, https://arxiv.org/pdf/2508.12726.pdf GitHub

Authors: Weize Liu, Yongchi Zhao, Yijia Luo, Mingyu Xu, Jiaheng Liu, Yanan Li, Xiguo Hu, Zhiqi Bai, Yuchi Xu, Wenbo Su, Bo Zheng

Title: DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning

Abstract:
Large language models (LLMs) perform strongly on many language tasks but still struggle with complex multi‑step reasoning across disciplines. Existing reasoning datasets often lack disciplinary breadth, reasoning depth, and diversity, as well as guiding principles for question synthesis. We propose DESIGNER: a DESIGN‑logic‑guidEd Reasoning data synthesis pipeline that leverages naturally available, extensive raw documents to generate multidisciplinary questions. The central insight is the notion of Design Logic, a form of reusable meta‑knowledge that encapsulates the structured process human experts use to transform knowledge into complex exam questions, enabling LLMs to generate new questions with the same complex reasoning patterns from entirely different source texts with explicit control over difficulty, diversity, and question types. We use LLMs to reverse‑engineer and abstract over 120,000 Design Logics from existing questions across various disciplines. By designing a two‑stage retrieve‑and‑generate mechanism to match these Design Logics with raw corpus, we synthesized two large‑scale reasoning datasets that span 75 disciplines: DLR‑Book (3.04 million questions from the book corpus) and DLR‑Web (1.66 million questions from the web corpus). Data analysis indicates that the questions synthesized by our method exhibit greater difficulty and diversity compared to those in the baseline datasets. Supervised fine‑tuning (SFT) on Qwen3 and Llama3 with our data substantially improves multidisciplinary reasoning and outperforms baseline datasets. Notably, by applying SFT on the base versions of these models using only our data, we even surpass their official final models that have undergone the full post‑training.

PaperID: 1057, https://arxiv.org/pdf/2508.12461.pdf GitHub

Authors: Ziqian Bi, Keyu Chen, Chiung-Yi Tseng, Danyang Zhang, Tianyang Wang, Hongying Luo, Lu Chen, Junming Huang, Jibin Guan, Junfeng Hao, Xinyuan Song, Junhao Song

Title: Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models

Abstract:
In August 2025, OpenAI released GPT‑OSS models, its first open weight large language models since GPT‑2 in 2019, comprising two mixture of experts architectures with 120B and 20B parameters. We evaluated both variants against six contemporary open source large language models ranging from 14.7B to 235B parameters, representing both dense and sparse designs, across ten benchmarks covering general knowledge, mathematical reasoning, code generation, multilingual understanding, and conversational ability. All models were tested in unquantised form under standardised inference settings, with statistical validation using McNemars test and effect size analysis. Results show that gpt‑oss‑20B consistently outperforms gpt‑oss‑120B on several benchmarks, such as HumanEval and MMLU, despite requiring substantially less memory and energy per response. Both models demonstrate mid‑tier overall performance within the current open source landscape, with relative strength in code generation and notable weaknesses in multilingual tasks. These findings provide empirical evidence that scaling in sparse architectures may not yield proportional performance gains, underscoring the need for further investigation into optimisation strategies and informing more efficient model selection for future open source deployments. More details and evaluation scripts are available at https://ai‑agent‑lab.github.io/gpt‑oss (Project Webpage).

PaperID: 1058, https://arxiv.org/pdf/2508.12281.pdf GitHub

Authors: Xin Dai, Buqiang Xu, Zhenghao Liu, Yukun Yan, Huiyuan Xie, Xiaoyuan Yi, Shuo Wang, Ge Yu

Title: Legal$Δ$: Enhancing Legal Reasoning in LLMs via Reinforcement Learning with Chain-of-Thought Guided Information Gain

Abstract:
Legal Artificial Intelligence (LegalAI) has achieved notable advances in automating judicial decision‑making with the support of Large Language Models (LLMs). However, existing legal LLMs still struggle to generate reliable and interpretable reasoning processes. They often default to fast‑thinking behavior by producing direct answers without explicit multi‑step reasoning, limiting their effectiveness in complex legal scenarios that demand rigorous justification. To address this challenge, we propose LegalΔ, a reinforcement learning framework designed to enhance legal reasoning through chain‑of‑thought guided information gain. During training, LegalΔ employs a dual‑mode input setup‑comprising direct answer and reasoning‑augmented modes‑and maximizes the information gain between them. This encourages the model to acquire meaningful reasoning patterns rather than generating superficial or redundant explanations. LegalΔ follows a two‑stage approach: (1) distilling latent reasoning capabilities from a powerful Large Reasoning Model (LRM), DeepSeek‑R1, and (2) refining reasoning quality via differential comparisons, combined with a multidimensional reward mechanism that assesses both structural coherence and legal‑domain specificity. Experimental results on multiple legal reasoning tasks demonstrate that LegalΔ outperforms strong baselines in both accuracy and interpretability. It consistently produces more robust and trustworthy legal judgments without relying on labeled preference data. All code and data will be released at https://github.com/NEUIR/LegalDelta.

PaperID: 1059, https://arxiv.org/pdf/2508.11915.pdf GitHub

Authors: Punya Syon Pandey, Yongjin Yang, Jiarui Liu, Zhijing Jin

Title: CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures

Abstract:
Game‑theoretic interactions between agents with Large Language Models (LLMs) have revealed many emergent capabilities, yet the linguistic diversity of these interactions has not been sufficiently quantified. In this paper, we present the Conversational Robustness Evaluation Score: CORE, a metric to quantify the effectiveness of language use within multi‑agent systems across different game‑theoretic interactions. CORE integrates measures of cluster entropy, lexical repetition, and semantic similarity, providing a direct lens of dialog quality. We apply CORE to pairwise LLM dialogs across competitive, cooperative, and neutral settings, further grounding our analysis in Zipf's and Heaps' Laws to characterize word frequency distributions and vocabulary growth. Our findings show that cooperative settings exhibit both steeper Zipf distributions and higher Heap exponents, indicating more repetition alongside greater vocabulary expansion. In contrast, competitive interactions display lower Zipf and Heaps exponents, reflecting less repetition and more constrained vocabularies. These results provide new insights into how social incentives influence language adaptation, and highlight CORE as a robust diagnostic for measuring linguistic robustness in multi‑agent LLM systems. Our code is available at https://github.com/psyonp/core.

PaperID: 1060, https://arxiv.org/pdf/2508.11733.pdf GitHub

Authors: Ruijia Zhang, Xinyan Zhao, Ruixiang Wang, Sigen Chen, Guibin Zhang, An Zhang, Kun Wang, Qingsong Wen

Title: SafeSieve: From Heuristics to Experience in Progressive Pruning for LLM-based Multi-Agent Communication

Abstract:
LLM‑based multi‑agent systems exhibit strong collaborative capabilities but often suffer from redundant communication and excessive token overhead. Existing methods typically enhance efficiency through pretrained GNNs or greedy algorithms, but often isolate pre‑ and post‑task optimization, lacking a unified strategy. To this end, we present SafeSieve, a progressive and adaptive multi‑agent pruning algorithm that dynamically refines the inter‑agent communication through a novel dual‑mechanism. SafeSieve integrates initial LLM‑based semantic evaluation with accumulated performance feedback, enabling a smooth transition from heuristic initialization to experience‑driven refinement. Unlike existing greedy Top‑k pruning methods, SafeSieve employs 0‑extension clustering to preserve structurally coherent agent groups while eliminating ineffective links. Experiments across benchmarks (SVAMP, HumanEval, etc.) showcase that SafeSieve achieves 94.01% average accuracy while reducing token usage by 12.4%‑27.8%. Results further demonstrate robustness under prompt injection attacks (1.23% average accuracy drop). In heterogeneous settings, SafeSieve reduces deployment costs by 13.3% while maintaining performance. These results establish SafeSieve as an efficient, GPU‑free, and scalable framework for practical multi‑agent systems. Our code can be found here: https://github.com/csgen/SafeSieve

PaperID: 1061, https://arxiv.org/pdf/2508.11479.pdf GitHub

Authors: Tatiana Zemskova, Aleksei Staroverov, Dmitry Yudin, Aleksandr Panov

Title: OVSegDT: Segmenting Transformer for Open-Vocabulary Object Goal Navigation

Abstract:
Open‑vocabulary Object Goal Navigation requires an embodied agent to reach objects described by free‑form language, including categories never seen during training. Existing end‑to‑end policies overfit small simulator datasets, achieving high success on training scenes but failing to generalize and exhibiting unsafe behaviour (frequent collisions). We introduce OVSegDT, a lightweight transformer policy that tackles these issues with two synergistic components. The first component is the semantic branch, which includes an encoder for the target binary mask and an auxiliary segmentation loss function, grounding the textual goal and providing precise spatial cues. The second component consists of a proposed Entropy‑Adaptive Loss Modulation, a per‑sample scheduler that continuously balances imitation and reinforcement signals according to the policy entropy, eliminating brittle manual phase switches. These additions cut the sample complexity of training by 33%, and reduce collision count in two times while keeping inference cost low (130M parameters, RGB‑only input). On HM3D‑OVON, our model matches the performance on unseen categories to that on seen ones and establishes state‑of‑the‑art results (40.1% SR, 20.9% SPL on val unseen) without depth, odometry, or large vision‑language models. Code is available at https://github.com/CognitiveAISystems/OVSegDT.

PaperID: 1062, https://arxiv.org/pdf/2508.11294.pdf GitHub

Authors: Qiangong Zhou, Zhiting Wang, Mingyou Yao, Zongyang Liu

Title: Allen: Rethinking MAS Design through Step-Level Policy Autonomy

Abstract:
We introduce a new Multi‑Agent System (MAS) ‑ Allen, designed to address two core challenges in current MAS design: (1) improve system's policy autonomy, empowering agents to dynamically adapt their behavioral strategies, and (2) achieving the trade‑off between collaborative efficiency, task supervision, and human oversight in complex network topologies. Our core insight is to redefine the basic execution unit in the MAS, allowing agents to autonomously form different patterns by combining these units. We have constructed a four‑tier state architecture (Task, Stage, Agent, Step) to constrain system behavior from both task‑oriented and execution‑oriented perspectives. This achieves a unification of topological optimization and controllable progress. Allen grants unprecedented Policy Autonomy, while making a trade‑off for the controllability of the collaborative structure. The project code has been open source at: https://github.com/motern88/Allen

PaperID: 1063, https://arxiv.org/pdf/2508.11133.pdf GitHub

Authors: Tomer Wolfson, Harsh Trivedi, Mor Geva, Yoav Goldberg, Dan Roth, Tushar Khot, Ashish Sabharwal, Reut Tsarfaty

Title: MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents

Abstract:
Automated agents, powered by Large language models (LLMs), are emerging as the go‑to tool for querying information. However, evaluation benchmarks for LLM agents rarely feature natural questions that are both information‑seeking and genuinely time‑consuming for humans. To address this gap we introduce MoNaCo, a benchmark of 1,315 natural and time‑consuming questions that require dozens, and at times hundreds, of intermediate steps to solve ‑‑ far more than any existing QA benchmark. To build MoNaCo, we developed a decomposed annotation pipeline to elicit and manually answer real‑world time‑consuming questions at scale. Frontier LLMs evaluated on MoNaCo achieve at most 61.2% F1, hampered by low recall and hallucinations. Our results underscore the limitations of LLM‑powered agents in handling the complexity and sheer breadth of real‑world information‑seeking tasks ‑‑ with MoNaCo providing an effective resource for tracking such progress. The MoNaCo benchmark, codebase, prompts and models predictions are all publicly available at: https://tomerwolgithub.github.io/monaco

PaperID: 1064, https://arxiv.org/pdf/2508.09702.pdf GitHub

Authors: Boyu Zhu, Cheng Gong, Muyang Wu, Ruihao Jing, Fan Liu, Xiaolei Zhang, Chi Zhang, Xuelong Li

Title: $\text{M}^3\text{PDB}$: A Multimodal, Multi-Label, Multilingual Prompt Database for Speech Generation

Abstract:
Recent advancements in zero‑shot speech generation have enabled models to synthesize speech that mimics speaker identity and speaking style from speech prompts. However, these models' effectiveness is significantly limited in real‑world scenarios where high‑quality speech prompts are absent, incomplete, or out of domain. This issue arises primarily from a significant quality mismatch between the speech data utilized for model training and the input prompt speech during inference. To address this, we introduce \textM^3\textPDB, the first large‑scale, multi‑modal, multi‑label, and multilingual prompt database designed for robust prompt selection in speech generation. Our dataset construction leverages a novel multi‑modal, multi‑agent annotation framework, enabling precise and hierarchical labeling across diverse modalities. Furthermore, we propose a lightweight yet effective prompt selection strategy tailored for real‑time, resource‑constrained inference settings. Experimental results demonstrate that our proposed database and selection strategy effectively support various challenging speech generation scenarios. We hope our work can inspire the community to shift focus from improving performance on standard benchmarks to addressing more realistic and diverse application scenarios in speech generation. Code and dataset are available at: https://github.com/hizening/M3PDB.

PaperID: 1065, https://arxiv.org/pdf/2508.09404.pdf GitHub

Authors: Guangxun Zhu, Shiyu Fan, Hang Dai, Edmond S. L. Ho

Title: Waymo-3DSkelMo: A Multi-Agent 3D Skeletal Motion Dataset for Pedestrian Interaction Modeling in Autonomous Driving

Abstract:
Large‑scale high‑quality 3D motion datasets with multi‑person interactions are crucial for data‑driven models in autonomous driving to achieve fine‑grained pedestrian interaction understanding in dynamic urban environments. However, existing datasets mostly rely on estimating 3D poses from monocular RGB video frames, which suffer from occlusion and lack of temporal continuity, thus resulting in unrealistic and low‑quality human motion. In this paper, we introduce Waymo‑3DSkelMo, the first large‑scale dataset providing high‑quality, temporally coherent 3D skeletal motions with explicit interaction semantics, derived from the Waymo Perception dataset. Our key insight is to utilize 3D human body shape and motion priors to enhance the quality of the 3D pose sequences extracted from the raw LiDRA point clouds. The dataset covers over 14,000 seconds across more than 800 real driving scenarios, including rich interactions among an average of 27 agents per scene (with up to 250 agents in the largest scene). Furthermore, we establish 3D pose forecasting benchmarks under varying pedestrian densities, and the results demonstrate its value as a foundational resource for future research on fine‑grained human behavior understanding in complex urban environments. The dataset and code will be available at https://github.com/GuangxunZhu/Waymo‑3DSkelMo

PaperID: 1066, https://arxiv.org/pdf/2508.08791.pdf GitHub

Authors: Junjie Ye, Changhao Jiang, Zhengyin Du, Yufei Xu, Xuesong Yao, Zhiheng Xi, Xiaoran Fan, Qi Zhang, Tao Gui, Xuanjing Huang, Jiecao Chen

Title: Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments

Abstract:
Effective tool use is essential for large language models (LLMs) to interact with their environment. However, progress is limited by the lack of efficient reinforcement learning (RL) frameworks specifically designed for tool use, due to challenges in constructing stable training environments and designing verifiable reward mechanisms. To address this, we propose an automated environment construction pipeline, incorporating scenario decomposition, document generation, function integration, complexity scaling, and localized deployment. This enables the creation of high‑quality training environments that provide detailed and measurable feedback without relying on external tools. Additionally, we introduce a verifiable reward mechanism that evaluates both the precision of tool use and the completeness of task execution. When combined with trajectory data collected from the constructed environments, this mechanism integrates seamlessly with standard RL algorithms to facilitate feedback‑driven model training. Experiments on LLMs of varying scales demonstrate that our approach significantly enhances the models' tool‑use performance without degrading their general capabilities. Our analysis suggests that these gains result from improved context understanding and reasoning, driven by updates to the lower‑layer MLP parameters in models. Code and data are available at https://github.com/bytedance/FTRL.

PaperID: 1067, https://arxiv.org/pdf/2508.08228.pdf GitHub GitHub

Authors: Sining Lu, Guan Chen, Nam Anh Dinh, Itai Lang, Ari Holtzman, Rana Hanocka

Title: LL3M: Large Language 3D Modelers

Abstract:
We present LL3M, a multi‑agent system that leverages pretrained large language models (LLMs) to generate 3D assets by writing interpretable Python code in Blender. We break away from the typical generative approach that learns from a collection of 3D data. Instead, we reformulate shape generation as a code‑writing task, enabling greater modularity, editability, and integration with artist workflows. Given a text prompt, LL3M coordinates a team of specialized LLM agents to plan, retrieve, write, debug, and refine Blender scripts that generate and edit geometry and appearance. The generated code works as a high‑level, interpretable, human‑readable, well‑documented representation of scenes and objects, making full use of sophisticated Blender constructs (e.g. B‑meshes, geometry modifiers, shader nodes) for diverse, unconstrained shapes, materials, and scenes. This code presents many avenues for further agent and human editing and experimentation via code tweaks or procedural parameters. This medium naturally enables a co‑creative loop in our system: agents can automatically self‑critique using code and visuals, while iterative user instructions provide an intuitive way to refine assets. A shared code context across agents enables awareness of previous attempts, and a retrieval‑augmented generation knowledge base built from Blender API documentation, BlenderRAG, equips agents with examples, types, and functions empowering advanced modeling operations and code correctness. We demonstrate the effectiveness of LL3M across diverse shape categories, style and material edits, and user‑driven refinements. Our experiments showcase the power of code as a generative and interpretable medium for 3D asset creation. Our project page is at https://threedle.github.io/ll3m.

PaperID: 1068, https://arxiv.org/pdf/2508.08127.pdf GitHub

Authors: Rui Miao, Yixin Liu, Yili Wang, Xu Shen, Yue Tan, Yiwei Dai, Shirui Pan, Xin Wang

Title: BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks

Abstract:
The security of LLM‑based multi‑agent systems (MAS) is critically threatened by propagation vulnerability, where malicious agents can distort collective decision‑making through inter‑agent message interactions. While existing supervised defense methods demonstrate promising performance, they may be impractical in real‑world scenarios due to their heavy reliance on labeled malicious agents to train a supervised malicious detection model. To enable practical and generalizable MAS defenses, in this paper, we propose BlindGuard, an unsupervised defense method that learns without requiring any attack‑specific labels or prior knowledge of malicious behaviors. To this end, we establish a hierarchical agent encoder to capture individual, neighborhood, and global interaction patterns of each agent, providing a comprehensive understanding for malicious agent detection. Meanwhile, we design a corruption‑guided detector that consists of directional noise injection and contrastive learning, allowing effective detection model training solely on normal agent behaviors. Extensive experiments show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across MAS with various communication patterns while maintaining superior generalizability compared to supervised baselines. The code is available at: https://github.com/MR9812/BlindGuard.

PaperID: 1069, https://arxiv.org/pdf/2508.07976.pdf GitHub

Authors: Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, Yi Wu

Title: Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL

Abstract:
Recent advancements in LLM‑based agents have demonstrated remarkable capabilities in handling complex, knowledge‑intensive tasks by integrating external tools. Among diverse choices of tools, search tools play a pivotal role in accessing vast external knowledge. However, open‑source agents still fall short of achieving expert‑level Search Intelligence, the ability to resolve ambiguous queries, generate precise searches, analyze results, and conduct thorough exploration. Existing approaches fall short in scalability, efficiency, and data quality. For example, small turn limits in existing online RL methods, e.g. <=10, restrict complex strategy learning. This paper introduces ASearcher, an open‑source project for large‑scale RL training of search agents. Our key contributions include: (1) Scalable fully asynchronous RL training that enables long‑horizon search while maintaining high training efficiency. (2) A prompt‑based LLM agent that autonomously synthesizes high‑quality and challenging QAs, creating a large‑scale QA dataset. Through RL training, our prompt‑based QwQ‑32B agent achieves substantial improvements, with 78.0% and 34.3% Avg@4 gains on xBench and GAIA, respectively. Notably, our agent exhibits extreme long‑horizon search, with tool calls exceeding 100 turns and output tokens exceeding 400k during training time. With a simple agent design and no external LLMs, ASearcher‑Web‑QwQ achieves Avg@4 scores of 51.1 on xBench and 58.7 on GAIA, surpassing existing open‑source 32B agents. Finally, we also show that ASearcher‑Web‑QwQ could achieve performance of commercial systems using external summary tool in a zero‑shot transfer manner and test‑time search. We open‑source our models, training data, and codes in https://github.com/inclusionAI/ASearcher.

PaperID: 1070, https://arxiv.org/pdf/2508.07616.pdf GitHub

Authors: Aswin RRV, Jacob Dineen, Divij Handa, Md Nayem Uddin, Mihir Parmar, Chitta Baral, Ben Zhou

Title: ThinkTuning: Instilling Cognitive Reflections without Distillation

Abstract:
Recent advances in test‑time scaling have led to the emergence of thinking LLMs that exhibit self‑reflective behaviors and multi‑step reasoning. While RL drives this self‑improvement paradigm, a recent study (Gandhi et al., 2025) shows that RL alone does not truly instill these new reasoning abilities ‑ it merely draws out behaviors already present in the base models. This raises a question: How can we train the models that don't exhibit such thinking behavior to develop it in the first place? To this end, we propose ThinkTuning, a GRPO‑based interactive training approach where we augment the rollouts of a student model with the guidance from a teacher model. A simple idea from classroom practice inspires our method: a teacher poses a problem, lets the student try an answer, then gives corrective feedback ‑‑ enough to point the mind in the right direction and then show the solution. Each piece of feedback reshapes the student's thoughts, leading them to arrive at the correct solution. Similarly, we find that this type of implicit supervision through feedback from a teacher model of the same size improves the reasoning capabilities of the student model. In particular, on average, our method shows a 3.85% improvement over zero‑shot baselines across benchmarks, and on MATH‑500, AIME and GPQA‑Diamond it shows 2.08%, 2.23% and 3.99% improvements over the vanilla‑GRPO baseline. Source code is available at https://github.com/3rdAT/ThinkTuning.

PaperID: 1071, https://arxiv.org/pdf/2508.07292.pdf GitHub

Authors: Yi Tang, Kai-Ni Wang, Yang Chen, Xiaopu He, Guangquan Zhou

Title: EndoCogniAgent: Closed-Loop Agentic Reasoning with Self-Consistency Validation for Endoscopic Diagnosis

Abstract:
Endoscopic diagnosis is an iterative process in which clinicians progressively acquire, compare, and verify local visual evidence before reaching a conclusion. Current AI systems do not adequately support this process because fine‑grained evidence acquisition and multi‑step reasoning remain weakly coupled. This gives rise to two failure modes, hallucinated evidence and uncorrected error accumulation, that undermine diagnostic reliability. We propose EndoCogniAgent, a closed‑loop agentic framework that formulates endoscopic diagnosis as a controlled state update process. At each reasoning round, a central planner selects the next evidence acquisition action, specialized expert tools extract the corresponding observation, and a self‑consistency validation mechanism examines the observation along two dimensions, knowledge consistency against the input image and temporal consistency with prior validated findings, before updating the diagnostic state. Validated observations are admitted into the evolving state to condition subsequent planning, while insufficiently supported findings are retained with corrective feedback that redirects the planner toward additional verification. We further introduce EndoAgentBench, a workflow‑oriented benchmark comprising 6,132 question‑answer pairs from 11 endoscopic datasets, designed to evaluate diagnostic agents across a comprehensive diagnostic chain, from fine‑grained visual perception to high‑level diagnostic reasoning. Experiments show that EndoCogniAgent achieves 85.23% average accuracy on perception tasks and 71.13% clinical acceptance rate on reasoning tasks, with ablation analysis confirming that self‑consistency validation and episodic state maintenance are individually critical to these gains.

PaperID: 1072, https://arxiv.org/pdf/2508.05728.pdf GitHub

Authors: Santiago Casas, Christian Fidler, Boris Bolliet, Francisco Villaescusa-Navarro, Julien Lesgourgues

Title: CLAPP: The CLASS LLM Agent for Pair Programming

Abstract:
We introduce CLAPP (CLASS LLM Agent for Pair Programming), an interactive AI assistant designed to support researchers working with the Einstein‑Boltzmann solver CLASS. CLAPP leverages large language models (LLMs) and domain‑specific retrieval to provide conversational coding support for CLASS‑answering questions, generating code, debugging errors, and producing plots. Its architecture combines multi‑agent LLM orchestration, semantic search across CLASS documentation, and a live Python execution environment. Deployed as a user‑friendly web application, CLAPP lowers the entry barrier for scientists unfamiliar with AI tools and enables more productive human‑AI collaboration in computational and numerical cosmology. The app is available at https://classclapp.streamlit.app

PaperID: 1073, https://arxiv.org/pdf/2508.05674.pdf GitHub GitHub

Authors: Minghao Shao, Nanda Rani, Kimberly Milner, Haoran Xi, Meet Udeshi, Saksham Aggarwal, Venkata Sai Charan Putrevu, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique

Title: Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM as a Judge, and a Lightweight CTF Benchmark

Abstract:
Recent advances in LLM agentic systems have improved the automation of offensive security tasks, particularly for Capture the Flag (CTF) challenges. We systematically investigate the key factors that drive agent success and provide a detailed recipe for building effective LLM‑based offensive security agents. First, we present CTFJudge, a framework leveraging LLM as a judge to analyze agent trajectories and provide granular evaluation across CTF solving steps. Second, we propose a novel metric, CTF Competency Index (CCI) for partial correctness, revealing how closely agent solutions align with human‑crafted gold standards. Third, we examine how LLM hyperparameters, namely temperature, top‑p, and maximum token length, influence agent performance and automated cybersecurity task planning. For rapid evaluation, we present CTFTiny, a curated benchmark of 50 representative CTF challenges across binary exploitation, web, reverse engineering, forensics, and cryptography. Our findings identify optimal multi‑agent coordination settings and lay the groundwork for future LLM agent research in cybersecurity. We make CTFTiny open source to public https://github.com/NYU‑LLM‑CTF/CTFTiny along with CTFJudge on https://github.com/NYU‑LLM‑CTF/CTFJudge.

PaperID: 1074, https://arxiv.org/pdf/2508.05666.pdf GitHub

Authors: Alejandro Godinez

Title: HySemRAG: A Hybrid Semantic Retrieval-Augmented Generation Framework for Automated Literature Synthesis and Methodological Gap Analysis

Abstract:
We present HySemRAG, a framework that combines Extract, Transform, Load (ETL) pipelines with Retrieval‑Augmented Generation (RAG) to automate large‑scale literature synthesis and identify methodological research gaps. The system addresses limitations in existing RAG architectures through a multi‑layered approach: hybrid retrieval combining semantic search, keyword filtering, and knowledge graph traversal; an agentic self‑correction framework with iterative quality assurance; and post‑hoc citation verification ensuring complete traceability. Our implementation processes scholarly literature through eight integrated stages: multi‑source metadata acquisition, asynchronous PDF retrieval, custom document layout analysis using modified Docling architecture, bibliographic management, LLM‑based field extraction, topic modeling, semantic unification, and knowledge graph construction. The system creates dual data products ‑ a Neo4j knowledge graph enabling complex relationship queries and Qdrant vector collections supporting semantic search ‑ serving as foundational infrastructure for verifiable information synthesis. Evaluation across 643 observations from 60 testing sessions demonstrates structured field extraction achieving 35.1% higher semantic similarity scores (0.655 \pm 0.178) compared to PDF chunking approaches (0.485 \pm 0.204, p < 0.000001). The agentic quality assurance mechanism achieves 68.3% single‑pass success rates with 99.0% citation accuracy in validated responses. Applied to geospatial epidemiology literature on ozone exposure and cardiovascular disease, the system identifies methodological trends and research gaps, demonstrating broad applicability across scientific domains for accelerating evidence synthesis and discovery.

PaperID: 1075, https://arxiv.org/pdf/2508.05614.pdf GitHub GitHub

Authors: Zixuan Wang, Dingming Li, Hongxing Li, Yanrui Miao, Shuo Chen, Yuchen Yan, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang

Title: GroundAct: Can LLM Agents Ground Actions in Environmental States?

Abstract:
LLM agents achieve 85‑96% success on tasks where instructions fully specify the action, but drop to 29‑53% when action feasibility depends on environmental state that the instruction does not mention. We argue that this gap reflects a missing capability: action grounding, the ability to infer from structured environmental state whether an action is feasible, what prerequisites it lacks, and whether it exceeds individual capacity. We introduce GroundAct, a benchmark of 1,500 scenarios and 16,592 task instances in text‑based interactive environments spanning 11 domains, with tasks organized into seven categories along a cognitive complexity hierarchy. Evaluating 15 LLMs (3B‑671B), we find three diagnostic patterns: (i) attribute reasoning is weakly correlated with tool and coordination reasoning, producing distinct model profiles; (ii) complete environment graphs yield up to +27.6/‑22.9% on tool use vs. implicit collaboration, separating search‑bound from constraint‑filtering bottlenecks; and (iii) supervised fine‑tuning lifts Qwen2.5‑3B from 0.6% to 76.3% on direct command but only 1.5% to 5.5% on implicit collaboration. These results establish action grounding as a multi‑dimensional challenge irreducible to scaling.

PaperID: 1076, https://arxiv.org/pdf/2508.05609.pdf GitHub GitHub

Authors: Yuhan Zhang, Long Zhuo, Ziyang Chu, Tong Wu, Zhibing Li, Liang Pan, Dahua Lin, Ziwei Liu

Title: Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity

Abstract:
Despite rapid advances in 3D content generation, quality assessment for the generated 3D assets remains challenging. Existing methods mainly rely on image‑based metrics and operate solely at the object level, limiting their ability to capture spatial coherence, material authenticity, and high‑fidelity local details. 1) To address these challenges, we introduce Hi3DEval, a hierarchical evaluation framework tailored for 3D generative content. It combines both object‑level and part‑level evaluation, enabling holistic assessments across multiple dimensions as well as fine‑grained quality analysis. Additionally, we extend texture evaluation beyond aesthetic appearance by explicitly assessing material realism, focusing on attributes such as albedo, saturation, and metallicness. 2) To support this framework, we construct Hi3DBench, a large‑scale dataset comprising diverse 3D assets and high‑quality annotations, accompanied by a reliable multi‑agent annotation pipeline. We further propose a 3D‑aware automated scoring system based on hybrid 3D representations. Specifically, we leverage video‑based representations for object‑level and material‑subject evaluations to enhance modeling of spatio‑temporal consistency and employ pretrained 3D features for part‑level perception. Extensive experiments demonstrate that our approach outperforms existing image‑based metrics in modeling 3D characteristics and achieves superior alignment with human preference, providing a scalable alternative to manual evaluations. The project page is available at https://zyh482.github.io/Hi3DEval/.

PaperID: 1077, https://arxiv.org/pdf/2508.05087.pdf GitHub GitHub

Authors: Renmiao Chen, Shiyao Cui, Xuancheng Huang, Chengwei Pan, Victor Shea-Jay Huang, QingLin Zhang, Xuan Ouyang, Zhexin Zhang, Hongning Wang, Minlie Huang

Title: JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering

Abstract:
Jailbreak attacks against multimodal large language Models (MLLMs) are a significant research focus. Current research predominantly focuses on maximizing attack success rate (ASR), often overlooking whether the generated responses actually fulfill the attacker's malicious intent. This oversight frequently leads to low‑quality outputs that bypass safety filters but lack substantial harmful content. To address this gap, we propose JPS, \underlineJailbreak MLLMs with collaborative visual \underlinePerturbation and textual \underlineSteering, which achieves jailbreaks via corporation of visual image and textually steering prompt. Specifically, JPS utilizes target‑guided adversarial image perturbations for effective safety bypass, complemented by "steering prompt" optimized via a multi‑agent system to specifically guide LLM responses fulfilling the attackers' intent. These visual and textual components undergo iterative co‑optimization for enhanced performance. To evaluate the quality of attack outcomes, we propose the Malicious Intent Fulfillment Rate (MIFR) metric, assessed using a Reasoning‑LLM‑based evaluator. Our experiments show JPS sets a new state‑of‑the‑art in both ASR and MIFR across various MLLMs and benchmarks, with analyses confirming its efficacy. Codes are available at \hrefhttps://github.com/thu‑coai/JPShttps://github.com/thu‑coai/JPS. \colorwarningcolorWarning: This paper contains potentially sensitive contents.

PaperID: 1078, https://arxiv.org/pdf/2508.04915.pdf GitHub

Authors: Huiya Zhao, Yinghao Zhu, Zixiang Wang, Yasha Wang, Junyi Gao, Liantao Ma

Title: ConfAgents: A Conformal-Guided Multi-Agent Framework for Cost-Efficient Medical Diagnosis

Abstract:
The efficacy of AI agents in healthcare research is hindered by their reliance on static, predefined strategies. This creates a critical limitation: agents can become better tool‑users but cannot learn to become better strategic planners, a crucial skill for complex domains like healthcare. We introduce HealthFlow, a self‑evolving AI agent that overcomes this limitation through a novel meta‑level evolution mechanism. HealthFlow autonomously refines its own high‑level problem‑solving policies by distilling procedural successes and failures into a durable, strategic knowledge base. To anchor our research and facilitate reproducible evaluation, we introduce EHRFlowBench, a new benchmark featuring complex, realistic health data analysis tasks derived from peer‑reviewed clinical research. Our comprehensive experiments demonstrate that HealthFlow's self‑evolving approach significantly outperforms state‑of‑the‑art agent frameworks. This work marks a necessary shift from building better tool‑users to designing smarter, self‑evolving task‑managers, paving the way for more autonomous and effective AI for scientific discovery.

PaperID: 1079, https://arxiv.org/pdf/2508.04691.pdf GitHub

Authors: Yuanchen Bai, Zijian Ding, Shaoyue Wen, Xiang Chang, Angelique Taylor

Title: Before Humans Join the Team: Diagnosing Coordination Failures in Healthcare Robot Team Simulation

Abstract:
As humans move toward collaborating with coordinated robot teams, understanding how these teams coordinate and fail is essential for building trust and ensuring safety. However, exposing human collaborators to coordination failures during early‑stage development is costly and risky, particularly in high‑stakes domains such as healthcare. We adopt an agent‑simulation approach in which all team roles, including the supervisory manager, are instantiated as LLM agents, allowing us to diagnose coordination failures before humans join the team. Using a controllable healthcare scenario, we conduct two studies with different hierarchical configurations to analyze coordination behaviors and failure patterns. Our findings reveal that team structure, rather than contextual knowledge or model capability, constitutes the primary bottleneck for coordination, and expose a tension between reasoning autonomy and system stability. By surfacing these failures in simulation, we prepare the groundwork for safe human integration. These findings inform the design of resilient robot teams with implications for process‑level evaluation, transparent coordination protocols, and structured human integration. Supplementary materials, including codes, task agent setup, trace outputs, and annotated examples of coordination failures and reasoning behaviors, are available at: https://byc‑sophie.github.io/mas‑to‑mars/.

PaperID: 1080, https://arxiv.org/pdf/2508.04652.pdf GitHub

Authors: Shuo Liu, Tianle Chen, Zeyu Liang, Xueguang Lyu, Christopher Amato

Title: LLM Collaboration With Multi-Agent Reinforcement Learning

Abstract:
A large amount of work has been done in Multi‑Agent Systems (MAS) for modeling and solving problems with multiple interacting agents. However, most LLMs are pretrained independently and not specifically optimized for coordination. Existing LLM fine‑tuning frameworks rely on individual rewards, which require complex reward designs for each agent to encourage collaboration. To address these challenges, we model LLM collaboration as a cooperative Multi‑Agent Reinforcement Learning (MARL) problem. We develop a multi‑agent, multi‑turn algorithm, Multi‑Agent Group Relative Policy Optimization (MAGRPO), to solve it, building on current RL approaches for LLMs as well as MARL techniques. Our experiments on LLM writing and coding collaboration demonstrate that fine‑tuning MAS with MAGRPO enables agents to generate high‑quality responses efficiently through effective cooperation. Our approach opens the door to using other MARL methods for LLMs and highlights the associated challenges. Our code is available at https://github.com/OpenMLRL/CoMLRL.

PaperID: 1081, https://arxiv.org/pdf/2508.04043.pdf GitHub

Authors: Yuheng Ji, Yipu Wang, Yuyang Liu, Xiaoshuai Hao, Yue Liu, Yuting Zhao, Huaihai Lyu, Xiaolong Zheng

Title: VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning

Abstract:
Visual transformation reasoning (VTR) is a vital cognitive capability that empowers intelligent agents to understand dynamic scenes, model causal relationships, and predict future states, and thereby guiding actions and laying the foundation for advanced intelligent systems. However, existing benchmarks suffer from a sim‑to‑real gap, limited task complexity, and incomplete reasoning coverage, limiting their practical use in real‑world scenarios. To address these limitations, we introduce VisualTrans, the first comprehensive benchmark specifically designed for VTR in real‑world human‑object interaction scenarios. VisualTrans encompasses 12 semantically diverse manipulation tasks and systematically evaluates three essential reasoning dimensions ‑ spatial, procedural, and quantitative ‑ through 6 well‑defined subtask types. The benchmark features 472 high‑quality question‑answer pairs in various formats, including multiple‑choice, open‑ended counting, and target enumeration. We introduce a scalable data construction pipeline built upon first‑person manipulation videos, which integrates task selection, image pair extraction, automated metadata annotation with large multimodal models, and structured question generation. Human verification ensures the final benchmark is both high‑quality and interpretable. Evaluations of various state‑of‑the‑art vision‑language models show strong performance in static spatial tasks. However, they reveal notable shortcomings in dynamic, multi‑step reasoning scenarios, particularly in areas like intermediate state recognition and transformation sequence planning. These findings highlight fundamental weaknesses in temporal modeling and causal reasoning, providing clear directions for future research aimed at developing more capable and generalizable VTR systems. The dataset and code are available at https://github.com/WangYipu2002/VisualTrans.

PaperID: 1082, https://arxiv.org/pdf/2508.04038.pdf GitHub

Authors: Zechen Li, Baiyu Chen, Hao Xue, Flora D. Salim

Title: ZARA: Training-Free Motion Time-Series Reasoning via Evidence-Grounded LLM Agents

Abstract:
Motion sensor time‑series are central to Human Activity Recognition (HAR), yet conventional approaches are constrained to fixed activity sets and typically require costly parameter retraining to adapt to new behaviors. While Large Language Models (LLMs) offer promising open‑set reasoning capabilities, applying them directly to numerical time‑series often leads to hallucinations and weak grounding. To address this challenge, we propose ZARA (Zero‑training Activity Reasoning Agents), a knowledge‑ and retrieval‑augmented agentic framework for motion time‑series reasoning in a training‑free inference setting. Rather than relying on black‑box projections, ZARA distills reference data into a statistically grounded textual knowledge base that transforms implicit signal patterns into verifiable natural‑language priors. Guided by retrieved evidence, ZARA iteratively selects discriminative cues and performs grounded reasoning over candidate activities. Extensive experiments on eight benchmarks show that ZARA generalizes robustly to unseen subjects and across datasets, demonstrating strong transferability across heterogeneous sensor domains. These results mark a step toward trustworthy, plug‑and‑play motion understanding beyond dataset‑specific artifacts. Our code is available at https://github.com/zechenli03/ZARA.

PaperID: 1083, https://arxiv.org/pdf/2508.04010.pdf GitHub

Authors: Yurun Chen, Xavier Hu, Yuhan Liu, Keting Yin, Juncheng Li, Zhuosheng Zhang, Shengyu Zhang

Title: HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization

Abstract:
Large language models enable agents to autonomously perform tasks in open web environments. However, as hidden threats within the web evolve, web agents face the challenge of balancing task performance with emerging risks during long‑sequence operations. Although this challenge is critical, current research remains limited to single‑objective optimization or single‑turn scenarios, lacking the capability for collaborative optimization of both safety and utility in web environments. To address this gap, we propose HarmonyGuard, a multi‑agent collaborative framework that leverages policy enhancement and objective optimization to jointly improve both utility and safety. HarmonyGuard features a multi‑agent architecture characterized by two fundamental capabilities: (1) Adaptive Policy Enhancement: We introduce the Policy Agent within HarmonyGuard, which automatically extracts and maintains structured security policies from unstructured external documents, while continuously updating policies in response to evolving threats. (2) Dual‑Objective Optimization: Based on the dual objectives of safety and utility, the Utility Agent integrated within HarmonyGuard performs the Markovian real‑time reasoning to evaluate the objectives and utilizes metacognitive capabilities for their optimization. Extensive evaluations on multiple benchmarks show that HarmonyGuard improves policy compliance by up to 38% and task completion by up to 20% over existing baselines, while achieving over 90% policy compliance across all tasks. Our project is available here: https://github.com/YurunChen/HarmonyGuard.

PaperID: 1084, https://arxiv.org/pdf/2508.03929.pdf GitHub

Authors: Nguyen Viet Tuan Kiet, Dao Van Tung, Tran Cong Dao, Huynh Thi Thanh Binh

Title: MOTIF: Multi-strategy Optimization via Turn-based Interactive Framework

Abstract:
Designing effective algorithmic components remains a fundamental obstacle in tackling NP‑hard combinatorial optimization problems (COPs), where solvers often rely on carefully hand‑crafted strategies. Despite recent advances in using large language models (LLMs) to synthesize high‑quality components, most approaches restrict the search to a single element ‑ commonly a heuristic scoring function ‑ thus missing broader opportunities for innovation. In this paper, we introduce a broader formulation of solver design as a multi‑strategy optimization problem, which seeks to jointly improve a set of interdependent components under a unified objective. To address this, we propose Multi‑strategy Optimization via Turn‑based Interactive Framework (MOTIF) ‑ a novel framework based on Monte Carlo Tree Search that facilitates turn‑based optimization between two LLM agents. At each turn, an agent improves one component by leveraging the history of both its own and its opponent's prior updates, promoting both competitive pressure and emergent cooperation. This structured interaction broadens the search landscape and encourages the discovery of diverse, high‑performing solutions. Experiments across multiple COP domains show that MOTIF consistently outperforms state‑of‑the‑art methods, highlighting the promise of turn‑based, multi‑agent prompting for fully automated solver design.

PaperID: 1085, https://arxiv.org/pdf/2508.03341.pdf GitHub

Authors: Wenquan Ma, Jiayan Nan, Wenlong Wu, Yize Chen

Title: What Deserves Memory: Adaptive Memory Distillation for LLM Agents

Abstract:
Memory systems for LLM agents struggle to determine what information deserves retention. Existing approaches rely on predefined heuristics such as importance scores, emotional tags, or factual templates, encoding designer intuition rather than learning from the data itself. Inspired by cognitive ideas, we propose NEMORI, an adaptive memory distillation framework that casts the assessment of experience's future utility as a matter of predictability. Specifically, NEMORI comprises two cascading modules: Episodic Memory Integration transforms raw interactions into coherent narratives, and Semantic Knowledge Distillation extracts insights via prediction error. Centering on distillation, the framework remains agnostic to downstream management. Extensive experiments confirm that NEMORI achieves strong performance, efficiency, and storage reduction. Our work suggests that observing the intrinsic properties of interaction sequences offers a viable, data‑driven alternative to heuristic‑based memory design. Code: https://github.com/nemori‑ai/nemori.

PaperID: 1086, https://arxiv.org/pdf/2508.03333.pdf GitHub

Authors: Zhende Song, Shengji Tang, Peng Ye, Jiayuan Fan, Lei Bai, Tao Chen, Wanli Ouyang

Title: CTTS: Collective Test-Time Scaling

Abstract:
Test‑time scaling (TTS) has emerged as a promising, training‑free approach for enhancing large language model (LLM) performance. However, the efficacy of existing methods, such as Best‑of‑N and Self‑Consistency, is fundamentally constrained by the dominant single test‑time scaling (STTS) paradigm, which relies on a single LLM agent interacting with a single reward model (SA‑SR). Inspired by recent work showing that collective methods can surpass the performance ceiling of individual models, we introduce Collective Test‑Time Scaling (CTTS). First, we systematically investigate three primary interaction paradigms of existing multiple models: single‑agent‑multi‑reward (SA‑MR), multi‑agent‑single‑reward (MA‑SR), and multi‑agent‑multi‑reward (MA‑MR). Extensive experiments reveal that the MA‑MR paradigm is consistently superior. Based on this finding, we further propose CTTS‑MM, a novel framework that operationalizes multi‑agent and multi‑reward collaboration. CTTS‑MM integrates two key technical contributions: (1) for agent collaboration, an Agent Collaboration Search (ACS) that identifies the most effective combination of LLMs from a candidate pool; and (2) for reward model collaboration, a Mixture of Reward Models (MoR) strategy that leverages a Prior Reward model Ensemble Selection (PRES) algorithm to select the optimal ensemble. Evaluations across seven mainstream benchmarks demonstrate that CTTS‑MM significantly outperforms leading STTS methods (+4.82% over Best‑of‑N) and surpasses even flagship proprietary LLMs (+7.06% over GPT‑4.1) and open‑source LLMs. These results highlight the substantial potential of collective scaling to push the frontier of LLM inference. Code will be released at https://github.com/magent4aci/CTTS‑MM.

PaperID: 1087, https://arxiv.org/pdf/2508.03284.pdf GitHub

Authors: Shaofeng Yin, Ting Lei, Yang Liu

Title: ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools

Abstract:
Integrating external tools into Large Foundation Models (LFMs) has emerged as a promising approach to enhance their problem‑solving capabilities. While existing studies have demonstrated strong performance in tool‑augmented Visual Question Answering (VQA), recent benchmarks reveal significant gaps in real‑world tool‑use proficiency, particularly in functionally diverse multimodal settings requiring multi‑step reasoning. In this work, we introduce ToolVQA, a large‑scale multimodal dataset comprising 23K instances, designed to bridge this gap. Unlike previous datasets that rely on synthetic scenarios and simplified queries, ToolVQA features real‑world visual contexts and challenging implicit multi‑step reasoning tasks, better aligning with real user interactions. To construct this dataset, we propose ToolEngine, a novel data generation pipeline that employs Depth‑First Search (DFS) with a dynamic in‑context example matching mechanism to simulate human‑like tool‑use reasoning. ToolVQA encompasses 10 multimodal tools across 7 diverse task domains, with an average inference length of 2.78 reasoning steps per instance. The fine‑tuned 7B LFMs on ToolVQA not only achieve impressive performance on our test set but also surpass the large close‑sourced model GPT‑3.5‑turbo on various out‑of‑distribution (OOD) datasets, demonstrating strong generalizability to real‑world tool‑use scenarios.

PaperID: 1088, https://arxiv.org/pdf/2508.02736.pdf GitHub

Authors: Yusheng Zheng, Yanpeng Hu, Tong Yu, Andi Quinn

Title: AgentSight: System-Level Observability for AI Agents Using eBPF

Abstract:
Modern software infrastructure increasingly relies on LLM agents for development and maintenance, such as Claude Code and Gemini‑cli. However, these AI agents differ fundamentally from traditional deterministic software, posing a significant challenge to conventional monitoring and debugging. This creates a critical semantic gap: existing tools observe either an agent's high‑level intent (via LLM prompts) or its low‑level actions (e.g., system calls), but cannot correlate these two views. This blindness makes it difficult to distinguish between benign operations, malicious attacks, and costly failures. We introduce AgentSight, an AgentOps observability framework that bridges this semantic gap using a hybrid approach. Our approach, boundary tracing, monitors agents from outside their application code at stable system interfaces using eBPF. AgentSight intercepts TLS‑encrypted LLM traffic to extract semantic intent, monitors kernel events to observe system‑wide effects, and causally correlates these two streams across process boundaries using a real‑time engine and secondary LLM analysis. This instrumentation‑free technique is framework‑agnostic, resilient to rapid API changes, and incurs less than 3% performance overhead. Our evaluation shows AgentSight detects prompt injection attacks, identifies resource‑wasting reasoning loops, and reveals hidden coordination bottlenecks in multi‑agent systems. AgentSight is released as an open‑source project at https://github.com/agent‑sight/agentsight.

PaperID: 1089, https://arxiv.org/pdf/2508.02621.pdf GitHub

Authors: Yinghao Zhu, Yifan Qi, Zixiang Wang, Lei Gu, Dehao Sui, Haoran Hu, Xichen Zhang, Ziyi He, Junjun He, Liantao Ma, Lequan Yu

Title: HealthFlow: A Self-Evolving AI Agent with Meta Planning for Autonomous Healthcare Research

Abstract:
The rapid proliferation of scientific knowledge presents a grand challenge: transforming this vast repository of information into an active engine for discovery, especially in high‑stakes domains like healthcare. Current AI agents, however, are constrained by static, predefined strategies, limiting their ability to navigate the complex, evolving ecosystem of scientific research. This paper introduces HealthFlow, a self‑evolving AI agent that overcomes this limitation through a novel meta‑level evolution mechanism. HealthFlow autonomously refines its high‑level problem‑solving policies by distilling procedural successes and failures into a durable, structured knowledge base, enabling it to learn not just how to use tools, but how to strategize. To anchor our research and provide a community resource, we introduce EHRFlowBench, a new benchmark featuring complex health data analysis tasks systematically derived from peer‑reviewed scientific literature. Our experiments demonstrate that HealthFlow's self‑evolving approach significantly outperforms state‑of‑the‑art agent frameworks. This work offers a new paradigm for intelligent systems that can learn to operationalize the procedural knowledge embedded in scientific content, marking a critical step toward more autonomous and effective AI for healthcare scientific discovery.

PaperID: 1090, https://arxiv.org/pdf/2508.02292.pdf GitHub

Authors: Wentao Zhang, Yilei Zhao, Chuqiao Zong, Xinrun Wang, Bo An

Title: FinWorld: An All-in-One Open-Source Platform for End-to-End Financial AI Research and Deployment

Abstract:
Financial AI holds great promise for transforming modern finance, with the potential to support a wide range of tasks such as market forecasting, portfolio management, quantitative trading, and automated analysis. However, existing platforms remain limited in task coverage, lack robust multimodal data integration, and offer insufficient support for the training and deployment of large language models (LLMs). In response to these limitations, we present FinWorld, an all‑in‑one open‑source platform that provides end‑to‑end support for the entire financial AI workflow, from data acquisition to experimentation and deployment. FinWorld distinguishes itself through native integration of heterogeneous financial data, unified support for diverse AI paradigms, and advanced agent automation, enabling seamless development and deployment. Leveraging data from 2 representative markets, 4 stock pools, and over 800 million financial data points, we conduct comprehensive experiments on 4 key financial AI tasks. These experiments systematically evaluate deep learning and reinforcement learning algorithms, with particular emphasis on RL‑based finetuning for LLMs and LLM Agents. The empirical results demonstrate that FinWorld significantly enhances reproducibility, supports transparent benchmarking, and streamlines deployment, thereby providing a strong foundation for future research and real‑world applications. Code is available at Github~\footnotehttps://github.com/DVampire/FinWorld.

PaperID: 1091, https://arxiv.org/pdf/2508.02276.pdf GitHub

Authors: Xiangru Tang, Zhuoyun Yu, Jiapeng Chen, Yan Cui, Daniel Shao, Weixu Wang, Fang Wu, Yuchen Zhuang, Wenqi Shi, Zhi Huang, Arman Cohan, Xihong Lin, Fabian Theis, Smita Krishnaswamy, Mark Gerstein

Title: CellForge: Agentic Design of Virtual Cell Models

Abstract:
Virtual cell modeling aims to predict cellular responses to diverse perturbations but faces challenges from biological complexity, multimodal data heterogeneity, and the need for interdisciplinary expertise. We introduce CellForge, a multi‑agent framework that autonomously designs and synthesizes neural network architectures tailored to specific single‑cell datasets and perturbation tasks. Given raw multi‑omics data and task descriptions, CellForge discovers candidate architectures through collaborative reasoning among specialized agents, then generates executable implementations. Our core contribution is the framework itself: showing that multi‑agent collaboration mechanisms ‑ rather than manual human design or single‑LLM prompting ‑ can autonomously produce executable, high‑quality computational methods. This approach goes beyond conventional hyperparameter tuning by enabling entirely new architectural components such as trajectory‑aware encoders and perturbation diffusion modules to emerge from agentic deliberation. We evaluate CellForge on six datasets spanning gene knockouts, drug treatments, and cytokine stimulations across multiple modalities (scRNA‑seq, scATAC‑seq, CITE‑seq). The results demonstrate that the models generated by CellForge are highly competitive with established baselines, while revealing systematic patterns of architectural innovation. CellForge highlights the scientific value of multi‑agent frameworks: collaboration among specialized agents enables genuine methodological innovation and executable solutions that single agents or human experts cannot achieve. This represents a paradigm shift toward autonomous scientific method development in computational biology. Code is available at https://github.com/gersteinlab/CellForge.

PaperID: 1092, https://arxiv.org/pdf/2508.02110.pdf GitHub

Authors: Kanghua Mo, Li Hu, Yucheng Long, Zhihao Li

Title: Attractive Metadata Attack: Inducing LLM Agents to Invoke Malicious Tools

Abstract:
Large language model (LLM) agents have demonstrated remarkable capabilities in complex reasoning and decision‑making by leveraging external tools. However, this tool‑centric paradigm introduces a previously underexplored attack surface, where adversaries can manipulate tool metadata ‑‑ such as names, descriptions, and parameter schemas ‑‑ to influence agent behavior. We identify this as a new and stealthy threat surface that allows malicious tools to be preferentially selected by LLM agents, without requiring prompt injection or access to model internals. To demonstrate and exploit this vulnerability, we propose the Attractive Metadata Attack (AMA), a black‑box in‑context learning framework that generates highly attractive but syntactically and semantically valid tool metadata through iterative optimization. The proposed attack integrates seamlessly into standard tool ecosystems and requires no modification to the agent's execution framework. Extensive experiments across ten realistic, simulated tool‑use scenarios and a range of popular LLM agents demonstrate consistently high attack success rates (81%‑95%) and significant privacy leakage, with negligible impact on primary task execution. Moreover, the attack remains effective even against prompt‑level defenses, auditor‑based detection, and structured tool‑selection protocols such as the Model Context Protocol, revealing systemic vulnerabilities in current agent architectures. These findings reveal that metadata manipulation constitutes a potent and stealthy attack surface. Notably, AMA is orthogonal to injection attacks and can be combined with them to achieve stronger attack efficacy, highlighting the need for execution‑level defenses beyond prompt‑level and auditor‑based mechanisms. Code is available at https://github.com/SEAIC‑M/AMA.

PaperID: 1093, https://arxiv.org/pdf/2508.02085.pdf GitHub

Authors: Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Hongzhang Liu, Ronghao Chen, Yangfan He, Daxin Jiang, Binxing Jiao, Chen Hu, Huacan Wang

Title: SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents

Abstract:
Large Language Model (LLM)‑based agents have recently shown impressive capabilities in complex reasoning and tool use via multi‑step interactions with their environments. While these agents have the potential to tackle complicated tasks, their problem‑solving process, i.e., agents' interaction trajectory leading to task completion, remains underexploited. These trajectories contain rich feedback that can navigate agents toward the right directions for solving problems correctly. Although prevailing approaches, such as Monte Carlo Tree Search (MCTS), can effectively balance exploration and exploitation, they ignore the interdependence among various trajectories and lack the diversity of search spaces, which leads to redundant reasoning and suboptimal outcomes. To address these challenges, we propose SE‑Agent, a Self‑Evolution framework that enables Agents to optimize their reasoning processes iteratively. Our approach revisits and enhances former pilot trajectories through three key operations: revision, recombination, and refinement. This evolutionary mechanism enables two critical advantages: (1) it expands the search space beyond local optima by intelligently exploring diverse solution paths guided by previous trajectories, and (2) it leverages cross‑trajectory inspiration to efficiently enhance performance while mitigating the impact of suboptimal reasoning paths. Through these mechanisms, SE‑Agent achieves continuous self‑evolution that incrementally improves reasoning quality. We evaluate SE‑Agent on SWE‑bench Verified to resolve real‑world GitHub issues. Experimental results across five strong LLMs show that integrating SE‑Agent delivers up to 55% relative improvement, achieving state‑of‑the‑art performance among all open‑source agents on SWE‑bench Verified. Our code and demonstration materials are publicly available at https://github.com/JARVIS‑Xs/SE‑Agent.

PaperID: 1094, https://arxiv.org/pdf/2508.01696.pdf GitHub

Authors: Yi Jiang, Sendong Zhao, Jianbo Li, Haochun Wang, Lizhe Zhang, Yan Liu, Bing Qin

Title: CoCoA: Collaborative Chain-of-Agents for Parametric-Retrieved Knowledge Synergy

Abstract:
Retrieval‑Augmented Generation (RAG) enhances Large Language Models (LLMs), especially for knowledge‑intensive tasks. Despite its advantages, current RAG methods often struggle to fully exploit knowledge during generation. In particular, the synergy between the model's internal parametric knowledge and external retrieved knowledge remains limited. Retrieved contents may sometimes mislead generation, while certain generated content can guide the model toward more accurate outputs. In this work, we propose Collaborative Chain‑of‑Agents, a framework designed to enhance explicitly synergy over both parametric and retrieved knowledge. Specifically, we first introduce CoCoA‑zero, a multi‑agent RAG framework that first performs conditional knowledge induction and then reasons answers. Building on this, we develop CoCoA, a long‑chain training strategy that synthesizes extended multi‑agent reasoning trajectories from CoCoA‑zero to fine‑tune the LLM. This strategy enhances the model's capability to explicitly integrate and jointly leverage parametric and retrieved knowledge. Experimental results demonstrate the superiority of CoCoA in open‑domain QA and multi‑hop QA.

PaperID: 1095, https://arxiv.org/pdf/2508.01005.pdf GitHub

Authors: Yiqun Chen, Erhan Zhang, Lingyong Yan, Shuaiqiang Wang, Jizhou Huang, Dawei Yin, Jiaxin Mao

Title: MAO-ARAG: Multi-Agent Orchestration for Adaptive Retrieval-Augmented Generation

Abstract:
In question‑answering (QA) systems, Retrieval‑Augmented Generation (RAG) has become pivotal in enhancing response accuracy and reducing hallucination issues. The architecture of RAG systems varies significantly, encompassing single‑round RAG, iterative RAG, and reasoning RAG, each tailored to address different types of queries. Due to the varying complexity of real‑world queries, a fixed RAG pipeline often struggles to balance performance and cost efficiency across different queries. To address this challenge, we propose an adaptive RAG framework called MAO‑ARAG, which leverages multi‑agent orchestration. Our adaptive RAG is conceived as a multi‑turn framework. Specifically, we define multiple executor agents, representing typical RAG modules such as query reformulation agents, document selection agent, and generation agents. A planner agent intelligently selects and integrates the appropriate agents from these executors into a suitable workflow tailored for each query, striving for high‑quality answers while maintaining reasonable costs. During each turn, the planner agent is trained using reinforcement learning, guided by an outcome‑based reward (F1 score) and a cost‑based penalty, continuously improving answer quality while keeping costs within a reasonable range. Experiments conducted on multiple QA datasets demonstrate that our approach, which dynamically plans workflows for each query, not only achieves high answer quality but also maintains both cost and latency within acceptable limits.The code of MAO‑ARAG is on https://github.com/chenyiqun/Agentic‑RAG.

PaperID: 1096, https://arxiv.org/pdf/2508.00554.pdf GitHub

Authors: Li Zhao, Rui Sun, Zuoyou Jiang, Bo Yang, Yuxiao Bai, Mengting Chen, Xinyang Wang, Jing Li, Zuo Bai

Title: ContestTrade: A Multi-Agent Trading System Based on Internal Contest Mechanism

Abstract:
In financial trading, large language model (LLM)‑based agents demonstrate significant potential. However, the high sensitivity to market noise undermines the performance of LLM‑based trading systems. To address this limitation, we propose a novel multi‑agent system featuring an internal competitive mechanism inspired by modern corporate management structures. The system consists of two specialized teams: (1) Data Team ‑ responsible for processing and condensing massive market data into diversified text factors, ensuring they fit the model's constrained context. (2) Research Team ‑ tasked with making parallelized multipath trading decisions based on deep research methods. The core innovation lies in implementing a real‑time evaluation and ranking mechanism within each team, driven by authentic market feedback. Each agent's performance undergoes continuous scoring and ranking, with only outputs from top‑performing agents being adopted. The design enables the system to adaptively adjust to dynamic environment, enhances robustness against market noise and ultimately delivers superior trading performance. Experimental results demonstrate that our proposed system significantly outperforms prevailing multi‑agent systems and traditional quantitative investment methods across diverse evaluation metrics. ContestTrade is open‑sourced on GitHub at https://github.com/FinStep‑AI/ContestTrade.

PaperID: 1097, https://arxiv.org/pdf/2508.00400.pdf GitHub

Authors: Janika Deborah Gajo, Gerarld Paul Merales, Jerome Escarcha, Brenden Ashley Molina, Gian Nartea, Emmanuel G. Maminta, Juan Carlos Roldan, Rowel O. Atienza

Title: Sari Sandbox: A Virtual Retail Store Environment for Embodied AI Agents

Abstract:
We present Sari Sandbox, a high‑fidelity, photorealistic 3D retail store simulation for benchmarking embodied agents against human performance in shopping tasks. Addressing a gap in retail‑specific sim environments for embodied agent training, Sari Sandbox features over 250 interactive grocery items across three store configurations, controlled via an API. It supports both virtual reality (VR) for human interaction and a vision language model (VLM)‑powered embodied agent. We also introduce SariBench, a dataset of annotated human demonstrations across varied task difficulties. Our sandbox enables embodied agents to navigate, inspect, and manipulate retail items, providing baselines against human performance. We conclude with benchmarks, performance analysis, and recommendations for enhancing realism and scalability. The source code can be accessed via https://github.com/upeee/sari‑sandbox‑env.

PaperID: 1098, https://arxiv.org/pdf/2508.00391.pdf GitHub

Authors: Guanjie Huang, Danny H. K. Tsang, Shan Yang, Guangzhi Lei, Li Liu

Title: Cued-Agent: A Collaborative Multi-Agent System for Automatic Cued Speech Recognition

Abstract:
Cued Speech (CS) is a visual communication system that combines lip‑reading with hand coding to facilitate communication for individuals with hearing impairments. Automatic CS Recognition (ACSR) aims to convert CS hand gestures and lip movements into text via AI‑driven methods. Traditionally, the temporal asynchrony between hand and lip movements requires the design of complex modules to facilitate effective multimodal fusion. However, constrained by limited data availability, current methods demonstrate insufficient capacity for adequately training these fusion mechanisms, resulting in suboptimal performance. Recently, multi‑agent systems have shown promising capabilities in handling complex tasks with limited data availability. To this end, we propose the first collaborative multi‑agent system for ACSR, named Cued‑Agent. It integrates four specialized sub‑agents: a Multimodal Large Language Model‑based Hand Recognition agent that employs keyframe screening and CS expert prompt strategies to decode hand movements, a pretrained Transformer‑based Lip Recognition agent that extracts lip features from the input video, a Hand Prompt Decoding agent that dynamically integrates hand prompts with lip features during inference in a training‑free manner, and a Self‑Correction Phoneme‑to‑Word agent that enables post‑process and end‑to‑end conversion from phoneme sequences to natural language sentences for the first time through semantic refinement. To support this study, we expand the existing Mandarin CS dataset by collecting data from eight hearing‑impaired cuers, establishing a mixed dataset of fourteen subjects. Extensive experiments demonstrate that our Cued‑Agent performs superbly in both normal and hearing‑impaired scenarios compared with state‑of‑the‑art methods. The implementation is available at https://github.com/DennisHgj/Cued‑Agent.

PaperID: 1099, https://arxiv.org/pdf/2508.00384.pdf GitHub

Authors: Juanwu Lu, Rohit Gupta, Ahmadreza Moradipari, Kyungtae Han, Ruqi Zhang, Ziran Wang

Title: On Learning Closed-Loop Probabilistic Multi-Agent Simulator

Abstract:
The rapid iteration of autonomous vehicle (AV) deployments leads to increasing needs for building realistic and scalable multi‑agent traffic simulators for efficient evaluation. Recent advances in this area focus on closed‑loop simulators that enable generating diverse and interactive scenarios. This paper introduces Neural Interactive Agents (NIVA), a probabilistic framework for multi‑agent simulation driven by a hierarchical Bayesian model that enables closed‑loop, observation‑conditioned simulation through autoregressive sampling from a latent, finite mixture of Gaussian distributions. We demonstrate how NIVA unifies preexisting sequence‑to‑sequence trajectory prediction models and emerging closed‑loop simulation models trained on Next‑token Prediction (NTP) from a Bayesian inference perspective. Experiments on the Waymo Open Motion Dataset demonstrate that NIVA attains competitive performance compared to the existing method while providing embellishing control over intentions and driving styles.

PaperID: 1100, https://arxiv.org/pdf/2508.00079.pdf GitHub

Authors: Oshayer Siddique, J. M Areeb Uzair Alam, Md Jobayer Rahman Rafy, Syed Rifat Raiyan, Hasan Mahmud, Md Kamrul Hasan

Title: PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems

Abstract:
The discipline of physics stands as a cornerstone of human intellect, driving the evolution of technology and deepening our understanding of the fundamental principles of the cosmos. Contemporary literature includes some works centered on the task of solving physics problems ‑ a crucial domain of natural language reasoning. In this paper, we evaluate the performance of frontier LLMs in solving physics problems, both mathematical and descriptive. We also employ a plethora of inference‑time techniques and agentic frameworks to improve the performance of the models. This includes the verification of proposed solutions in a cumulative fashion by other, smaller LLM agents, and we perform a comparative analysis of the performance that the techniques entail. There are significant improvements when the multi‑agent framework is applied to problems that the models initially perform poorly on. Furthermore, we introduce a new evaluation benchmark for physics problems, \rm P\small HYSICSE\small VAL, consisting of 19,609 problems sourced from various physics textbooks and their corresponding correct solutions scraped from physics forums and educational websites. Our code and data are publicly available at https://github.com/areebuzair/PhysicsEval.

PaperID: 1101, https://arxiv.org/pdf/2508.00007.pdf GitHub

Authors: Gaowei Chang, Eidan Lin, Chengxuan Yuan, Rizhao Cai, Binbin Chen, Xuan Xie, Yin Zhang

Title: Agent Network Protocol Technical White Paper

Abstract:
With the development of large models and autonomous decision‑making AI, agents are rapidly becoming the new entities of the internet, following mobile apps. However, existing internet infrastructure is primarily designed for human interaction, creating data silos, unfriendly interfaces, and high collaboration costs among agents, making it difficult to support the needs for large‑scale agent interconnection and collaboration. The internet is undergoing a profound transformation, showing four core trends: agents replacing traditional software, universal agent interconnection, native protocol‑based connections, and autonomous agent organization and collaboration. To align with these trends, Agent Network Protocol (ANP) proposes a new generation of communication protocols for the Agentic Web. ANP adheres to AI‑native design, maintains compatibility with existing internet protocols, adopts a modular composable architecture, follows minimalist yet extensible principles, and enables rapid deployment based on existing infrastructure. Through a three‑layer protocol system‑‑identity and encrypted communication layer, meta‑protocol negotiation layer, and application protocol layer‑‑ANP. systematically solves the problems of agent identity authentication, dynamic negotiation, and capability discovery interoperability.

PaperID: 1102, https://arxiv.org/pdf/2507.23773.pdf GitHub

Authors: Mingkai Deng, Jinyu Hou, Zhiting Hu, Eric Xing

Title: General Agentic Planning Through Simulative Reasoning with World Models

Abstract:
What does it mean to plan? Current agentic systems, whether scaffolded workflows or end‑to‑end policies, rely on reactive decision‑making: selecting the next action via a fixed procedure with at most undifferentiated adaptive computation (e.g., chain‑of‑thought) lacking explicit modeling of future outcomes. This limits generalizability, as each new task demands re‑engineering rather than transfer of shared reasoning capacity. Humans, by contrast, plan by mentally simulating consequences of candidate actions within an internal world model, a capacity known as simulative reasoning (System II) that supports flexible, goal‑directed behavior across diverse contexts. We argue that simulative reasoning through a world model provides a general‑purpose planning mechanism for agentic systems, improving upon reactive policies (System I) by grounding decisions in predicted future states rather than pattern‑matched responses. To verify this, we introduce SiRA (Simulative Reasoning Architecture), a goal‑oriented architecture instantiating simulative reasoning using an LLM‑based world model with natural‑language belief states, while remaining model‑agnostic. We evaluate across three qualitatively distinct task categories: constrained navigation, multi‑hop information aggregation, and general instruction following, in a web‑browser environment. Across all categories, simulative reasoning achieves up to 124% higher task completion rates than a matched reactive baseline, and increases constrained navigation success from 0% to 32.2% compared to a representative open‑web agent. The persistent advantage across distinct task types suggests the benefit stems from generalizable counterfactual evaluation rather than task‑specific tuning.

PaperID: 1103, https://arxiv.org/pdf/2507.23361.pdf GitHub

Authors: Silin Chen, Shaoxin Lin, Yuling Shi, Heng Lian, Xiaodong Gu, Longfei Yun, Dong Chen, Lin Cao, Jiyang Liu, Nu Xia, Qianxiang Wang

Title: SWE-Exp: Experience-Driven Software Issue Resolution

Abstract:
Recent advances in large language model (LLM) agents have shown remarkable progress in software issue resolution, leveraging advanced techniques such as multi‑agent collaboration and Monte Carlo Tree Search (MCTS). However, current agents act as memoryless explorers ‑ treating each problem separately without retaining or reusing knowledge from previous repair experiences. This leads to redundant exploration of failed trajectories and missed chances to adapt successful issue resolution methods to similar problems. To address this problem, we introduce SWE‑Exp, an experience‑enhanced approach that distills concise and actionable experience from prior agent trajectories, enabling continuous learning across issues. Our method introduces a multi‑faceted experience bank that captures both successful and failed repair attempts. Specifically, it extracts reusable issue resolution knowledge at different levels ‑ from high‑level problem comprehension to specific code changes. Experiments show that SWE‑Exp achieves a Pass@1 resolution rate of 73.0% on SWE‑Bench Verified using the state‑of‑the‑art LLM Claude 4 Sonnet, significantly outperforming prior results under other agent frameworks. Our approach establishes a new paradigm in which automated software engineering agents systematically accumulate and leverage repair expertise, fundamentally shifting from trial‑and‑error exploration to strategic, experience‑driven issue resolution.

PaperID: 1104, https://arxiv.org/pdf/2507.23348.pdf GitHub

Authors: Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Yantao Jia, Tao Huang, Qianxiang Wang

Title: SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution

Abstract:
Issue resolution has made remarkable progress thanks to the advanced reasoning capabilities of large language models (LLMs). Recently, agent‑based frameworks such as SWE‑agent have further advanced this progress by enabling autonomous, tool‑using agents to tackle complex software engineering tasks. While existing agent‑based issue resolution approaches are primarily based on agents' independent explorations, they often get stuck in local solutions and fail to identify issue patterns that span across different parts of the codebase. To address this limitation, we propose SWE‑Debate, a competitive multi‑agent debate framework that encourages diverse reasoning paths and achieves more consolidated issue localization. SWE‑Debate first creates multiple fault propagation traces as localization proposals by traversing a code dependency graph. Then, it organizes a three‑round debate among specialized agents, each embodying distinct reasoning perspectives along the fault propagation trace. This structured competition enables agents to collaboratively converge on a consolidated fix plan. Finally, this consolidated fix plan is integrated into an MCTS‑based code modification agent for patch generation. Experiments on the SWE‑bench benchmark show that SWE‑Debate achieves new state‑of‑the‑art results in open‑source agent frameworks and outperforms baselines by a large margin.

PaperID: 1105, https://arxiv.org/pdf/2507.22947.pdf GitHub

Authors: Shou'ang Wei, Xinyun Wang, Shuzhen Bi, Jian Chen, Ruijia Li, Bo Jiang, Xin Lin, Min Zhang, Yu Song, BingDong Li, Aimin Zhou, Hao Hao

Title: ELMES: An Automated Framework for Evaluating Large Language Models in Educational Scenarios

Abstract:
The emergence of Large Language Models (LLMs) presents transformative opportunities for education, generating numerous novel application scenarios. However, significant challenges remain: evaluation metrics vary substantially across different educational scenarios, while many emerging scenarios lack appropriate assessment metrics. Current benchmarks predominantly measure general intelligence rather than pedagogical capabilities. To address this gap, we introduce ELMES, an open‑source automated evaluation framework specifically designed for assessing LLMs in educational settings. ELMES features a modular architecture that enables researchers to create dynamic, multi‑agent dialogues through simple configuration files, facilitating flexible scenario design without requiring extensive programming expertise. The framework incorporates a hybrid evaluation engine that objectively quantifies traditionally subjective pedagogical metrics using an LLM‑as‑a‑Judge methodology. We conduct systematic benchmarking of state‑of‑the‑art LLMs across four critical educational scenarios: Knowledge Point Explanation, Guided Problem‑Solving Teaching, Interdisciplinary Lesson Plan Generation, and Contextualized Question Generation, employing fine‑grained metrics developed in collaboration with education specialists. Our results demonstrate distinct capability distributions among models, revealing context‑specific strengths and limitations. ELMES provides educators and researchers with an accessible evaluation framework that significantly reduces adaptation barriers for diverse educational applications while advancing the practical implementation of LLMs in pedagogy. The framework is publicly available at \emphhttps://github.com/sii‑research/elmes.git.

PaperID: 1106, https://arxiv.org/pdf/2507.22929.pdf GitHub

Authors: Xiaoyu Pan, Yang Bai, Ke Zou, Yang Zhou, Jun Zhou, Huazhu Fu, Yih-Chung Tham, Yong Liu

Title: EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow

Abstract:
Medical Large Language Models (MLLMs) play a crucial role in ophthalmic diagnosis, holding significant potential to address vision‑threatening diseases. However, their accuracy is constrained by hallucinations stemming from limited ophthalmic knowledge, insufficient visual localization and reasoning capabilities, and a scarcity of multimodal ophthalmic data, which collectively impede precise lesion detection and disease diagnosis. Furthermore, existing medical benchmarks fail to effectively evaluate various types of hallucinations or provide actionable solutions to mitigate them. To address the above challenges, we introduce EH‑Benchmark, a novel ophthalmology benchmark designed to evaluate hallucinations in MLLMs. We categorize MLLMs' hallucinations based on specific tasks and error types into two primary classes: Visual Understanding and Logical Composition, each comprising multiple subclasses. Given that MLLMs predominantly rely on language‑based reasoning rather than visual processing, we propose an agent‑centric, three‑phase framework, including the Knowledge‑Level Retrieval stage, the Task‑Level Case Studies stage, and the Result‑Level Validation stage. Experimental results show that our multi‑agent framework significantly mitigates both types of hallucinations, enhancing accuracy, interpretability, and reliability. Our project is available at https://github.com/ppxy1/EH‑Benchmark.

PaperID: 1107, https://arxiv.org/pdf/2507.22827.pdf GitHub

Authors: Yilei Jiang, Yaozhi Zheng, Yuxuan Wan, Jiaming Han, Qunzhong Wang, Michael R. Lyu, Xiangyu Yue

Title: ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

Abstract:
Automating the transformation of user interface (UI) designs into front‑end code holds significant promise for accelerating software development and democratizing design workflows. While multimodal large language models (MLLMs) can translate images to code, they often fail on complex UIs, struggling to unify visual perception, layout planning, and code synthesis within a single monolithic model, which leads to frequent perception and planning errors. To address this, we propose ScreenCoder, a modular multi‑agent framework that decomposes the task into three interpretable stages: grounding, planning, and generation. By assigning these distinct responsibilities to specialized agents, our framework achieves significantly higher robustness and fidelity than end‑to‑end approaches. Furthermore, ScreenCoder serves as a scalable data engine, enabling us to generate high‑quality image‑code pairs. We use this data to fine‑tune open‑source MLLM via a dual‑stage pipeline of supervised fine‑tuning and reinforcement learning, demonstrating substantial gains in its UI generation capabilities. Extensive experiments demonstrate that our approach achieves state‑of‑the‑art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at https://github.com/leigest519/ScreenCoder.

PaperID: 1108, https://arxiv.org/pdf/2507.22568.pdf GitHub

Authors: Shijing Chen, Xinrui Zhou, Yuhao Wang, Yuhao Huang, Ao Chang, Dong Ni, Ruobing Huang

Title: Subtyping Breast Lesions via Generative Augmentation based Long-tailed Recognition in Ultrasound

Abstract:
Accurate identification of breast lesion subtypes can facilitate personalized treatment and interventions. Ultrasound (US), as a safe and accessible imaging modality, is extensively employed in breast abnormality screening and diagnosis. However, the incidence of different subtypes exhibits a skewed long‑tailed distribution, posing significant challenges for automated recognition. Generative augmentation provides a promising solution to rectify data distribution. Inspired by this, we propose a dual‑phase framework for long‑tailed classification that mitigates distributional bias through high‑fidelity data synthesis while avoiding overuse that corrupts holistic performance. The framework incorporates a reinforcement learning‑driven adaptive sampler, dynamically calibrating synthetic‑real data ratios by training a strategic multi‑agent to compensate for scarcities of real data while ensuring stable discriminative capability. Furthermore, our class‑controllable synthetic network integrates a sketch‑grounded perception branch that harnesses anatomical priors to maintain distinctive class features while enabling annotation‑free inference. Extensive experiments on an in‑house long‑tailed and a public imbalanced breast US datasets demonstrate that our method achieves promising performance compared to state‑of‑the‑art approaches. More synthetic images can be found at https://github.com/Stinalalala/Breast‑LT‑GenAug.

PaperID: 1109, https://arxiv.org/pdf/2507.21206.pdf GitHub

Authors: Yingxuan Yang, Mulei Ma, Yuxuan Huang, Huacan Chai, Chenyu Gong, Haoran Geng, Yuanjian Zhou, Ying Wen, Meng Fang, Muhao Chen, Shangding Gu, Ming Jin, Costas Spanos, Yang Yang, Pieter Abbeel, Dawn Song, Weinan Zhang, Jun Wang

Title: Agentic Web: Weaving the Next Web with AI Agents

Abstract:
The emergence of AI agents powered by large language models (LLMs) marks a pivotal shift toward the Agentic Web, a new phase of the internet defined by autonomous, goal‑driven interactions. In this paradigm, agents interact directly with one another to plan, coordinate, and execute complex tasks on behalf of users. This transition from human‑driven to machine‑to‑machine interaction allows intent to be delegated, relieving users from routine digital operations and enabling a more interactive, automated web experience. In this paper, we present a structured framework for understanding and building the Agentic Web. We trace its evolution from the PC and Mobile Web eras and identify the core technological foundations that support this shift. Central to our framework is a conceptual model consisting of three key dimensions: intelligence, interaction, and economics. These dimensions collectively enable the capabilities of AI agents, such as retrieval, recommendation, planning, and collaboration. We analyze the architectural and infrastructural challenges involved in creating scalable agentic systems, including communication protocols, orchestration strategies, and emerging paradigms such as the Agent Attention Economy. We conclude by discussing the potential applications, societal risks, and governance issues posed by agentic systems, and outline research directions for developing open, secure, and intelligent ecosystems shaped by both human intent and autonomous agent behavior. A continuously updated collection of relevant studies for agentic web is available at: https://github.com/SafeRL‑Lab/agentic‑web.

PaperID: 1110, https://arxiv.org/pdf/2507.21071.pdf GitHub

Authors: Qinglong Yang, Haoming Li, Haotian Zhao, Xiaokai Yan, Jingtao Ding, Fengli Xu, Yong Li

Title: FingerTip 20K: A Benchmark for Proactive and Personalized Mobile LLM Agents

Abstract:
Mobile GUI agents are becoming critical tools to improve user experience on smart devices, with multimodal large language models (MLLMs) emerging as the dominant paradigms in this domain. Current agents, however, rely on explicit human instructions, overlooking the potential to leverage the contextual information (like location, time, user profile) and historical data for proactive task suggestions. Besides, previous works focus on optimizing the success rate during task execution, but pay less attention to the personalized execution trajectory, thereby neglecting potentially vast differences in user preferences. To address these challenges, we introduce the FingerTip 20K benchmark. We collected 20K unique human demonstrations of multi‑step Android device interactions across a variety of everyday apps. These demonstrations are not isolated but are continuously acquired from the users' long‑term usage in their real lives, and encompass essential user‑related contextual information. The benchmark contains two new tracks: proactive task suggestions by analyzing environment observation and users' previous intents, and personalized task execution by catering to users' action preferences. Our experiments reveal that the tracks we propose pose significant challenges for leveraging user‑related information in GUI tasks. We also performed a human study to show that there exists a huge gap between existing agents and humans. The model fine‑tuned with the data we collected effectively utilized user information and achieved good results, highlighting the potential of our approach in building more user‑oriented mobile LLM agents. Our code is open‑source at https://github.com/tsinghua‑fib‑lab/FingerTip‑20K for reproducibility.

PaperID: 1111, https://arxiv.org/pdf/2507.21035.pdf GitHub

Authors: Haoyang Liu, Yijiang Li, Haohan Wang

Title: GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis

Abstract:
Gene expression analysis holds the key to many biomedical discoveries, yet extracting insights from raw transcriptomic data remains formidable due to the complexity of multiple large, semi‑structured files and the need for extensive domain expertise. Current automation approaches are often limited by either inflexible workflows that break down in edge cases or by fully autonomous agents that lack the necessary precision for rigorous scientific inquiry. GenoMAS charts a different course by presenting a team of LLM‑based scientists that integrates the reliability of structured workflows with the adaptability of autonomous agents. GenoMAS orchestrates six specialized LLM agents through typed message‑passing protocols, each contributing complementary strengths to a shared analytic canvas. At the heart of GenoMAS lies a guided‑planning framework: programming agents unfold high‑level task guidelines into Action Units and, at each juncture, elect to advance, revise, bypass, or backtrack, thereby maintaining logical coherence while bending gracefully to the idiosyncrasies of genomic data. On the GenoTEX benchmark, GenoMAS reaches a Composite Similarity Correlation of 89.13% for data preprocessing and an F_1 of 60.48% for gene identification, surpassing the best prior art by 10.61% and 16.85% respectively. Beyond metrics, GenoMAS surfaces biologically plausible gene‑phenotype associations corroborated by the literature, all while adjusting for latent confounders. Code is available at https://github.com/Liu‑Hy/GenoMAS.

PaperID: 1112, https://arxiv.org/pdf/2507.21017.pdf GitHub

Authors: Weichen Zhang, Yiyou Sun, Pohao Huang, Jiayue Pu, Heyue Lin, Dawn Song

Title: MIRAGE-Bench: LLM Agent is Hallucinating and Where to Find Them

Abstract:
Hallucinations pose critical risks for large language model (LLM)‑based agents, often manifesting as hallucinative actions resulting from fabricated or misinterpreted information within the cognitive context. While recent studies have exposed such failures, existing evaluations remain fragmented and lack a principled testbed. In this paper, we present MIRAGE‑Bench‑‑Measuring Illusions in Risky AGEnt settings‑‑the first unified benchmark for eliciting and evaluating hallucinations in interactive LLM‑agent scenarios. We begin by introducing a three‑part taxonomy to address agentic hallucinations: actions that are unfaithful to (i) task instructions, (ii) execution history, or (iii) environment observations. To analyze, we first elicit such failures by performing a systematic audit of existing agent benchmarks, then synthesize test cases using a snapshot strategy that isolates decision points in deterministic and reproducible manners. To evaluate hallucination behaviors, we adopt a fine‑grained‑level LLM‑as‑a‑Judge paradigm with tailored risk‑aware prompts, enabling scalable, high‑fidelity assessment of agent actions without enumerating full action spaces. MIRAGE‑Bench provides actionable insights on failure modes of LLM agents and lays the groundwork for principled progress in mitigating hallucinations in interactive environments.

PaperID: 1113, https://arxiv.org/pdf/2507.20536.pdf GitHub

Authors: Chieh-Yun Chen, Min Shi, Gong Zhang, Humphrey Shi

Title: T2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation

Abstract:
Text‑to‑Image (T2I) generative models have revolutionized content creation but remain highly sensitive to prompt phrasing, often requiring users to repeatedly refine prompts multiple times without clear feedback. While techniques such as automatic prompt engineering, controlled text embeddings, denoising, and multi‑turn generation mitigate these issues, they offer limited controllability, or often necessitate additional training, restricting the generalization abilities. Thus, we introduce T2I‑Copilot, a training‑free multi‑agent system that leverages collaboration between (Multimodal) Large Language Models to automate prompt phrasing, model selection, and iterative refinement. This approach significantly simplifies prompt engineering while enhancing generation quality and text‑image alignment compared to direct generation. Specifically, T2I‑Copilot consists of three agents: (1) Input Interpreter, which parses the input prompt, resolves ambiguities, and generates a standardized report; (2) Generation Engine, which selects the appropriate model from different types of T2I models and organizes visual and textual prompts to initiate generation; and (3) Quality Evaluator, which assesses aesthetic quality and text‑image alignment, providing scores and feedback for potential regeneration. T2I‑Copilot can operate fully autonomously while also supporting human‑in‑the‑loop intervention for fine‑grained control. On GenAI‑Bench, using open‑source generation models, T2I‑Copilot achieves a VQA score comparable to commercial models RecraftV3 and Imagen 3, surpasses FLUX1.1‑pro by 6.17% at only 16.59% of its cost, and outperforms FLUX.1‑dev and SD 3.5 Large by 9.11% and 6.36%. Code will be released at: https://github.com/SHI‑Labs/T2I‑Copilot.

PaperID: 1114, https://arxiv.org/pdf/2507.20293.pdf GitHub

Authors: Stepan Dergachev, Konstantin Yakovlev

Title: Decentralized Uncertainty-Aware Multi-Agent Collision Avoidance with Model Predictive Path Integral

Abstract:
Decentralized multi‑agent navigation under uncertainty is a complex task that arises in numerous robotic applications. It requires collision avoidance strategies that account for both kinematic constraints, sensing and action execution noise. In this paper, we propose a novel approach that integrates the Model Predictive Path Integral (MPPI) with a probabilistic adaptation of Optimal Reciprocal Collision Avoidance. Our method ensures safe and efficient multi‑agent navigation by incorporating probabilistic safety constraints directly into the MPPI sampling process via a Second‑Order Cone Programming formulation. This approach enables agents to operate independently using local noisy observations while maintaining safety guarantees. We validate our algorithm through extensive simulations with differential‑drive robots and benchmark it against state‑of‑the‑art methods, including ORCA‑DD and B‑UAVC. Results demonstrate that our approach outperforms them while achieving high success rates, even in densely populated environments. Additionally, validation in the Gazebo simulator confirms its practical applicability to robotic platforms. A source code is available at http://github.com/PathPlanning/MPPI‑Collision‑Avoidance.

PaperID: 1115, https://arxiv.org/pdf/2507.20145.pdf GitHub

Authors: Kesen Wang, Daulet Toibazar, Abdulrahman Alfulayt, Abdulaziz S. Albadawi, Ranya A. Alkahtani, Asma A. Ibrahim, Haneen A. Alhomoud, Sherif Mohamed, Pedro J. Moreno

Title: Multi-Agent Interactive Question Generation Framework for Long Document Understanding

Abstract:
Document Understanding (DU) in long‑contextual scenarios with complex layouts remains a significant challenge in vision‑language research. Although Large Vision‑Language Models (LVLMs) excel at short‑context DU tasks, their performance declines in long‑context settings. A key limitation is the scarcity of fine‑grained training data, particularly for low‑resource languages such as Arabic. Existing state‑of‑the‑art techniques rely heavily on human annotation, which is costly and inefficient. We propose a fully automated, multi‑agent interactive framework to generate long‑context questions efficiently. Our approach efficiently generates high‑quality single‑ and multi‑page questions for extensive English and Arabic documents, covering hundreds of pages across diverse domains. This facilitates the development of LVLMs with enhanced long‑context understanding ability. Experimental results in this work have shown that our generated English and Arabic questions (AraEngLongBench) are quite challenging to major open‑ and close‑source LVLMs. The code and data proposed in this work can be found in https://github.com/wangk0b/Multi_Agentic_QA_Long_Doc.git. Sample Question and Answer (QA) pairs and structured system prompts can be found in the Appendix.

PaperID: 1116, https://arxiv.org/pdf/2507.19942.pdf GitHub

Authors: Yue Pan, Zimin Chen, Siyu Lu, Zhaoyang Chu, Xiang Li, Han Li, Yang Feng, Claire Le Goues, Federica Sarro, Martin Monperrus, He Ye

Title: Prometheus: Towards Long-Horizon Codebase Navigation for Repository-Level Problem Solving

Abstract:
Large Language Models (LLMs) have shown remarkable capabilities in automating software engineering tasks, spurring the emergence of coding agents that scaffold LLMs with external tools to resolve repository‑level problems. However, existing agents still struggle to navigate large‑scale codebases, as the Needle‑in‑a‑Haystack problem persists even with million‑token context windows, where relevant evidence is often overwhelmed by large volumes of irrelevant code and documentation. Prior codebase navigation approaches, including embedding‑based retrieval, file‑system exploration, and graph‑based retrieval, address parts of this challenge but fail to capture the temporal continuity of agent reasoning, rendering agents stateless and causing repeated repository traversals that hinder scalable planning and reasoning. To address these limitations, we present Prometheus, a memory‑centric coding agent framework for long‑horizon codebase navigation. Prometheus represents the repository as a unified knowledge graph to encode semantic dependencies and employs a context engine augmented with working memory that retains and reuses previously explored contexts to ensure continuity across reasoning steps. Built upon this engine, Prometheus integrates memory‑enhanced navigation into a multi‑agent system for automated issue resolution, encompassing issue classification, bug reproduction, patch generation, and verification. Comprehensive experiments are conducted on two widely used issue resolution benchmarks, i.e., SWE‑bench Verified and SWE‑PolyBench Verified. Powered by GPT‑5, Prometheus achieves state‑of‑the‑art performance with 74.4% and 33.8% resolution rates on the two benchmarks, ranking Top‑6 and Top‑1 among open‑source agent systems, respectively. Our data and code are available at https://github.com/EuniAI/Prometheus.

PaperID: 1117, https://arxiv.org/pdf/2507.19694.pdf GitHub

Authors: Faruk Alpay, Hamdi Alakkad, Bugra Kilictas, Taylan Alpay

Title: Ultracoarse Equilibria and Ordinal-Folding Dynamics in Operator-Algebraic Models of Infinite Multi-Agent Games

Abstract:
We develop an operator algebraic framework for infinite games with a continuum of agents and prove that regret based learning dynamics governed by a noncommutative continuity equation converge to a unique quantal response equilibrium under mild regularity assumptions. The framework unifies functional analysis, coarse geometry and game theory by assigning to every game a von Neumann algebra that represents collective strategy evolution. A reflective regret operator within this algebra drives the flow of strategy distributions and its fixed point characterises equilibrium. We introduce the ordinal folding index, a computable ordinal valued metric that measures the self referential depth of the dynamics, and show that it bounds the transfinite time needed for convergence, collapsing to zero on coarsely amenable networks. The theory yields new invariant subalgebra rigidity results, establishes existence and uniqueness of envy free and maximin share allocations in continuum economies, and links analytic properties of regret flows with empirical stability phenomena in large language models. These contributions supply a rigorous mathematical foundation for large scale multi agent systems and demonstrate the utility of ordinal metrics for equilibrium selection.

PaperID: 1118, https://arxiv.org/pdf/2507.19543.pdf GitHub

Authors: Maria Emilia Mazzolenis, Ruirui Zhang

Title: Agent WARPP: Workflow Adherence via Runtime Parallel Personalization

Abstract:
Large language models (LLMs) are increasingly applied in task‑oriented dialogue (TOD) systems but often struggle with long, conditional workflows that involve external tool calls and depend on user‑specific information. We present Workflow Adherence via Runtime Parallel Personalization, or WARPP, a training‑free, modular framework that combines multi‑agent orchestration with runtime personalization to improve workflow adherence in LLM‑based systems. By dynamically pruning conditional branches based on user attributes, the framework reduces reasoning overhead and narrows tool selection at runtime. WARPP deploys a parallelized architecture where a dedicated Personalizer agent operates alongside modular, domain‑specific agents to dynamically tailor execution paths in real time. The framework is evaluated across five representative user intents of varying complexity within three domains: banking, flights, and healthcare. Our evaluation leverages synthetic datasets and LLM‑powered simulated users to test scenarios with conditional dependencies. Our results demonstrate that WARPP outperforms both the non‑personalized method and the ReAct baseline, achieving increasingly larger gains in parameter fidelity and tool accuracy as intent complexity grows, while also reducing average token usage, without any additional training.

PaperID: 1119, https://arxiv.org/pdf/2507.19232.pdf GitHub GitHub

Authors: Donggeun Lim, Jinseok Bae, Inwoo Hwang, Seungmin Lee, Hwanhee Lee, Young Min Kim

Title: Event-Driven Storytelling with Multiple Lifelike Humans in a 3D Scene

Abstract:
In this work, we propose a framework that creates a lively virtual dynamic scene with contextual motions of multiple humans. Generating multi‑human contextual motion requires holistic reasoning over dynamic relationships among human‑human and human‑scene interactions. We adapt the power of a large language model (LLM) to digest the contextual complexity within textual input and convert the task into tangible subproblems such that we can generate multi‑agent behavior beyond the scale that was not considered before. Specifically, our event generator formulates the temporal progression of a dynamic scene into a sequence of small events. Each event calls for a well‑defined motion involving relevant characters and objects. Next, we synthesize the motions of characters at positions sampled based on spatial guidance. We employ a high‑level module to deliver scalable yet comprehensive context, translating events into relative descriptions that enable the retrieval of precise coordinates. As the first to address this problem at scale and with diversity, we offer a benchmark to assess diverse aspects of contextual reasoning. Benchmark results and user studies show that our framework effectively captures scene context with high scalability. The code and benchmark, along with result videos, are available at our project page: https://rms0329.github.io/Event‑Driven‑Storytelling/.

PaperID: 1120, https://arxiv.org/pdf/2507.18224.pdf GitHub

Authors: Shiyuan Li, Yixin Liu, Qingsong Wen, Chengqi Zhang, Shirui Pan

Title: Assemble Your Crew: Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation

Abstract:
Multi‑agent systems (MAS) based on large language models (LLMs) have emerged as a powerful solution for dealing with complex problems across diverse domains. The effectiveness of MAS is critically dependent on its collaboration topology, which has become a focal point for automated design research. However, existing approaches are fundamentally constrained by their reliance on a template graph modification paradigm with a predefined set of agents and hard‑coded interaction structures, significantly limiting their adaptability to task‑specific requirements. To address these limitations, we reframe MAS design as a conditional autoregressive graph generation task, where both the system composition and structure are designed jointly. We propose ARG‑Designer, a novel autoregressive model that operationalizes this paradigm by constructing the collaboration graph from scratch. Conditioned on a natural language task query, ARG‑Designer sequentially and dynamically determines the required number of agents, selects their appropriate roles from an extensible pool, and establishes the optimal communication links between them. This generative approach creates a customized topology in a flexible and extensible manner, precisely tailored to the unique demands of different tasks. Extensive experiments across six diverse benchmarks demonstrate that ARG‑Designer not only achieves state‑of‑the‑art performance but also enjoys significantly greater token efficiency and enhanced extensibility. The source code of ARG‑Designer is available at https://github.com/Shiy‑Li/ARG‑Designer.

PaperID: 1121, https://arxiv.org/pdf/2507.18059.pdf GitHub

Authors: Yueheng Li, Guangming Xie, Zongqing Lu

Title: Multi-Agent Guided Policy Optimization

Abstract:
Due to practical constraints such as partial observability and limited communication, Centralized Training with Decentralized Execution (CTDE) has become the dominant paradigm in cooperative Multi‑Agent Reinforcement Learning (MARL). However, existing CTDE methods often underutilize centralized training or lack theoretical guarantees. We propose Multi‑Agent Guided Policy Optimization (MAGPO), a novel framework that better leverages centralized training by integrating centralized guidance with decentralized execution. MAGPO uses an autoregressive joint policy for scalable, coordinated exploration and explicitly aligns it with decentralized policies to ensure deployability under partial observability. We provide theoretical guarantees of monotonic policy improvement and empirically evaluate MAGPO on 43 tasks across 6 diverse environments. Results show that MAGPO consistently outperforms strong CTDE baselines and matches or surpasses fully centralized approaches, offering a principled and practical solution for decentralized multi‑agent learning. Our code and experimental data can be found in https://github.com/liyheng/MAGPO.

PaperID: 1122, https://arxiv.org/pdf/2507.17152.pdf GitHub

Authors: Fangze Lin, Ying He, Fei Yu, Hong Zhang

Title: JAM: Keypoint-Guided Joint Prediction after Classification-Aware Marginal Proposal for Multi-Agent Interaction

Abstract:
Predicting the future motion of road participants is a critical task in autonomous driving. In this work, we address the challenge of low‑quality generation of low‑probability modes in multi‑agent joint prediction. To tackle this issue, we propose a two‑stage multi‑agent interactive prediction framework named keypoint‑guided joint prediction after classification‑aware marginal proposal (JAM). The first stage is modeled as a marginal prediction process, which classifies queries by trajectory type to encourage the model to learn all categories of trajectories, providing comprehensive mode information for the joint prediction module. The second stage is modeled as a joint prediction process, which takes the scene context and the marginal proposals from the first stage as inputs to learn the final joint distribution. We explicitly introduce key waypoints to guide the joint prediction module in better capturing and leveraging the critical information from the initial predicted trajectories. We conduct extensive experiments on the real‑world Waymo Open Motion Dataset interactive prediction benchmark. The results show that our approach achieves competitive performance. In particular, in the framework comparison experiments, the proposed JAM outperforms other prediction frameworks and achieves state‑of‑the‑art performance in interactive trajectory prediction. The code is available at https://github.com/LinFunster/JAM to facilitate future research.

PaperID: 1123, https://arxiv.org/pdf/2507.16853.pdf GitHub

Authors: Ning Li, Xiangmou Qu, Jiamu Zhou, Jun Wang, Muning Wen, Kounianhua Du, Xingyu Lou, Qiuying Peng, Jun Wang, Weinan Zhang

Title: MobileUse: A GUI Agent with Hierarchical Reflection for Autonomous Mobile Operation

Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have enabled the development of mobile agents that can understand visual inputs and follow user instructions, unlocking new possibilities for automating complex tasks on mobile devices. However, applying these models to real‑world mobile scenarios remains a significant challenge due to the long‑horizon task execution, difficulty in error recovery, and the cold‑start problem in unfamiliar environments. To address these challenges, we propose MobileUse, a GUI agent designed for robust and adaptive mobile task execution. To improve resilience in long‑horizon tasks and dynamic environments, we introduce a hierarchical reflection architecture that enables the agent to self‑monitor, detect, and recover from errors across multiple temporal scales‑ranging from individual actions to overall task completion‑while maintaining efficiency through a reflection‑on‑demand strategy. To tackle cold‑start issues, we further introduce a proactive exploration module, which enriches the agent's understanding of the environment through self‑planned exploration. Evaluations on AndroidWorld and AndroidLab benchmarks demonstrate that MobileUse establishes new state‑of‑the‑art performance, achieving success rates of 62.9% and 44.2%, respectively. To facilitate real‑world applications, we release an out‑of‑the‑box toolkit for automated task execution on physical mobile devices, which is available at https://github.com/MadeAgents/mobile‑use.

PaperID: 1124, https://arxiv.org/pdf/2507.16802.pdf GitHub

Authors: Yanjun Zheng, Xiyang Du, Longfei Liao, Xiaoke Zhao, Zhaowen Zhou, Jingze Song, Bo Zhang, Jiawei Liu, Xiang Qi, Zhe Li, Zhiqiang Zhang, Wei Wang, Peng Zhang

Title: Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning

Abstract:
Large Language Models (LLMs) exhibit considerable promise in financial applications; however, prevailing models frequently demonstrate limitations when confronted with scenarios that necessitate sophisticated reasoning capabilities, stringent trustworthiness criteria, and efficient adaptation to domain‑specific requirements. We introduce the Agentar‑Fin‑R1 series of financial large language models (8B and 32B parameters), specifically engineered based on the Qwen3 foundation model to enhance reasoning capabilities, reliability, and domain specialization for financial applications. Our optimization approach integrates a high‑quality, systematic financial task label system with a comprehensive multi‑layered trustworthiness assurance framework. This framework encompasses high‑quality trustworthy knowledge engineering, multi‑agent trustworthy data synthesis, and rigorous data validation governance. Through label‑guided automated difficulty‑aware optimization, tow‑stage training pipeline, and dynamic attribution systems, we achieve substantial improvements in training efficiency. Our models undergo comprehensive evaluation on mainstream financial benchmarks including Fineva, FinEval, and FinanceIQ, as well as general reasoning datasets such as MATH‑500 and GPQA‑diamond. To thoroughly assess real‑world deployment capabilities, we innovatively propose the Finova evaluation benchmark, which focuses on agent‑level financial reasoning and compliance verification. Experimental results demonstrate that Agentar‑Fin‑R1 not only achieves state‑of‑the‑art performance on financial tasks but also exhibits exceptional general reasoning capabilities, validating its effectiveness as a trustworthy solution for high‑stakes financial applications. The Finova bench is available at https://github.com/antgroup/Finova.

PaperID: 1125, https://arxiv.org/pdf/2507.15815.pdf GitHub

Authors: Seth Karten, Wenzhe Li, Zihan Ding, Samuel Kleiner, Yu Bai, Chi Jin

Title: LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra

Abstract:
We present the LLM Economist, a novel framework that uses agent‑based modeling to design and assess economic policies in strategic environments with hierarchical decision‑making. At the lower level, bounded rational worker agents ‑‑ instantiated as persona‑conditioned prompts sampled from U.S. Census‑calibrated income and demographic statistics ‑‑ choose labor supply to maximize text‑based utility functions learned in‑context. At the upper level, a planner agent employs in‑context reinforcement learning to propose piecewise‑linear marginal tax schedules anchored to the current U.S. federal brackets. This construction endows economic simulacra with three capabilities requisite for credible fiscal experimentation: (i) optimization of heterogeneous utilities, (ii) principled generation of large, demographically realistic agent populations, and (iii) mechanism design ‑‑ the ultimate nudging problem ‑‑ expressed entirely in natural language. Experiments with populations of up to one hundred interacting agents show that the planner converges near Stackelberg equilibria that improve aggregate social welfare relative to Saez solutions, while a periodic, persona‑level voting procedure furthers these gains under decentralized governance. These results demonstrate that large language model‑based agents can jointly model, simulate, and govern complex economic systems, providing a tractable test bed for policy evaluation at the societal scale to help build better civilizations.

PaperID: 1126, https://arxiv.org/pdf/2507.15351.pdf GitHub

Authors: Zijian Zhao, Sen Li

Title: One Step is Enough: Multi-Agent Reinforcement Learning based on One-Step Policy Optimization for Order Dispatch on Ride-Sharing Platforms

Abstract:
Order dispatch is a critical task in ride‑sharing systems with Autonomous Vehicles (AVs), directly influencing efficiency and profits. Recently, Multi‑Agent Reinforcement Learning (MARL) has emerged as a promising solution to this problem by decomposing the large state and action spaces among individual agents, effectively addressing the Curse of Dimensionality (CoD) in transportation market, which is caused by the substantial number of vehicles, passengers, and orders. However, conventional MARL‑based approaches heavily rely on accurate estimation of the value function, which becomes problematic in large‑scale, highly uncertain environments. To address this issue, we propose two novel methods that bypass value function estimation, leveraging the homogeneous property of AV fleets. First, we draw an analogy between AV fleets and groups in Group Relative Policy Optimization (GRPO), adapting it to the order dispatch task. By replacing the Proximal Policy Optimization (PPO) baseline with the group average reward‑to‑go, GRPO eliminates critic estimation errors and reduces training bias. Inspired by this baseline replacement, we further propose One‑Step Policy Optimization (OSPO), demonstrating that the optimal policy can be trained using only one‑step group rewards under a homogeneous fleet. Experiments on a real‑world ride‑hailing dataset show that both GRPO and OSPO achieve promising performance across all scenarios, efficiently optimizing pickup times and the number of served orders using simple Multilayer Perceptron (MLP) networks. Furthermore, OSPO outperforms GRPO in all scenarios, attributed to its elimination of bias caused by the bounded time horizon of GRPO. Our code, trained models, and processed data are provided at https://github.com/RS2002/OSPO .

PaperID: 1127, https://arxiv.org/pdf/2507.15245.pdf GitHub

Authors: Xiaofeng Shi, Yuduo Li, Qian Kou, Longbin Yu, Jinxin Xie, Hua Zhou

Title: SPAR: Scholar Paper Retrieval with LLM-based Agents for Enhanced Academic Search

Abstract:
Recent advances in large language models (LLMs) have opened new opportunities for academic literature retrieval. However, existing systems often rely on rigid pipelines and exhibit limited reasoning capabilities. We introduce SPAR, a multi‑agent framework that incorporates RefChain‑based query decomposition and query evolution to enable more flexible and effective search. To facilitate systematic evaluation, we also construct SPARBench, a challenging benchmark with expert‑annotated relevance labels. Experimental results demonstrate that SPAR substantially outperforms strong baselines, achieving up to +56% F1 on AutoScholar and +23% F1 on SPARBench over the best‑performing baseline. Together, SPAR and SPARBench provide a scalable, interpretable, and high‑performing foundation for advancing research in scholarly retrieval. Code and data will be available at: https://github.com/xiaofengShi/SPAR

PaperID: 1128, https://arxiv.org/pdf/2507.15174.pdf GitHub

Authors: Justin Turnau, Longchao Da, Khoa Vo, Ferdous Al Rafi, Shreyas Bachiraju, Tiejin Chen, Hua Wei

Title: Joint-Local Grounded Action Transformation for Sim-to-Real Transfer in Multi-Agent Traffic Control

Abstract:
Traffic Signal Control (TSC) is essential for managing urban traffic flow and reducing congestion. Reinforcement Learning (RL) offers an adaptive method for TSC by responding to dynamic traffic patterns, with multi‑agent RL (MARL) gaining traction as intersections naturally function as coordinated agents. However, due to shifts in environmental dynamics, implementing MARL‑based TSC policies in the real world often leads to a significant performance drop, known as the sim‑to‑real gap. Grounded Action Transformation (GAT) has successfully mitigated this gap in single‑agent RL for TSC, but real‑world traffic networks, which involve numerous interacting intersections, are better suited to a MARL framework. In this work, we introduce JL‑GAT, an application of GAT to MARL‑based TSC that balances scalability with enhanced grounding capability by incorporating information from neighboring agents. JL‑GAT adopts a decentralized approach to GAT, allowing for the scalability often required in real‑world traffic networks while still capturing key interactions between agents. Comprehensive experiments on various road networks under simulated adverse weather conditions, along with ablation studies, demonstrate the effectiveness of JL‑GAT. The code is publicly available at https://github.com/DaRL‑LibSignal/JL‑GAT/.

PaperID: 1129, https://arxiv.org/pdf/2507.15003.pdf GitHub

Authors: Hao Li, Haoxiang Zhang, Ahmed E. Hassan

Title: The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering

Abstract:
The future of software engineering‑‑SE 3.0‑‑is unfolding with the rise of AI teammates: autonomous, goal‑driven systems collaborating with human developers. Among these, autonomous coding agents are especially transformative, now actively initiating, reviewing, and evolving code at scale. This paper introduces AIDev, the first large‑scale dataset capturing how such agents operate in the wild. Spanning over 456,000 pull requests by five leading agents‑‑OpenAI Codex, Devin, GitHub Copilot, Cursor, and Claude Code‑‑across 61,000 repositories and 47,000 developers, AIDev provides an unprecedented empirical foundation for studying autonomous teammates in software development. Unlike prior work that has largely theorized the rise of AI‑native software engineering, AIDev offers structured, open data to support research in benchmarking, agent readiness, optimization, collaboration modeling, and AI governance. The dataset includes rich metadata on PRs, authorship, review timelines, code changes, and integration outcomes‑‑enabling exploration beyond synthetic benchmarks like SWE‑bench. For instance, although agents often outperform humans in speed, their PRs are accepted less frequently, revealing a trust and utility gap. Furthermore, while agents accelerate code submission‑‑one developer submitted as many PRs in three days as they had in three years‑‑these are structurally simpler (via code complexity metrics). We envision AIDev as a living resource: extensible, analyzable, and ready for the SE and AI communities. Grounding SE 3.0 in real‑world evidence, AIDev enables a new generation of research into AI‑native workflows and supports building the next wave of symbiotic human‑AI collaboration. The dataset is publicly available at https://github.com/SAILResearch/AI_Teammates_in_SE3. > AI Agent, Agentic AI, Coding Agent, Agentic Coding, Software Engineering Agent

PaperID: 1130, https://arxiv.org/pdf/2507.14799.pdf GitHub GitHub

Authors: Sam Johnson, Viet Pham, Thai Le

Title: Manipulating LLM Web Agents with Indirect Prompt Injection Attack via HTML Accessibility Tree

Abstract:
This work demonstrates that LLM‑based web navigation agents offer powerful automation capabilities but are vulnerable to Indirect Prompt Injection (IPI) attacks. We show that adversaries can embed universal adversarial triggers in webpage HTML to hijack agent behavior that utilizes the accessibility tree to parse HTML, causing unintended or malicious actions. Using the Greedy Coordinate Gradient (GCG) algorithm and a Browser Gym agent powered by Llama‑3.1, our system demonstrates high success rates across real websites in both targeted and general attacks, including login credential exfiltration and forced ad clicks. Our empirical results highlight critical security risks and the need for stronger defenses as LLM‑driven autonomous web agents become more widely adopted. The system software (https://github.com/sej2020/manipulating‑web‑agents) is released under the MIT License, with an accompanying publicly available demo website (http://lethaiq.github.io/attack‑web‑llm‑agent).

PaperID: 1131, https://arxiv.org/pdf/2507.14660.pdf GitHub GitHub

Authors: Qibing Ren, Sitao Xie, Longxuan Wei, Zhenfei Yin, Junchi Yan, Lizhuang Ma, Jing Shao

Title: When Autonomy Goes Rogue: Preparing for Risks of Multi-Agent Collusion in Social Systems

Abstract:
Recent large‑scale events like election fraud and financial scams have shown how harmful coordinated efforts by human groups can be. With the rise of autonomous AI systems, there is growing concern that AI‑driven groups could also cause similar harm. While most AI safety research focuses on individual AI systems, the risks posed by multi‑agent systems (MAS) in complex real‑world situations are still underexplored. In this paper, we introduce a proof‑of‑concept to simulate the risks of malicious MAS collusion, using a flexible framework that supports both centralized and decentralized coordination structures. We apply this framework to two high‑risk fields: misinformation spread and e‑commerce fraud. Our findings show that decentralized systems are more effective at carrying out malicious actions than centralized ones. The increased autonomy of decentralized systems allows them to adapt their strategies and cause more damage. Even when traditional interventions, like content flagging, are applied, decentralized groups can adjust their tactics to avoid detection. We present key insights into how these malicious groups operate and the need for better detection systems and countermeasures. Code is available at https://github.com/renqibing/RogueAgent.

PaperID: 1132, https://arxiv.org/pdf/2507.14201.pdf GitHub

Authors: Yiran Wu, Mauricio Velazco, Andrew Zhao, Manuel Raúl Meléndez Luján, Srisuma Movva, Yogesh K Roy, Quang Nguyen, Roberto Rodriguez, Qingyun Wu, Michael Albada, Julia Kiseleva, Anand Mudgerikar

Title: ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

Abstract:
We present ExCyTIn‑Bench, the first benchmark to Evaluate an LLM agent X on the task of Cyber Threat Investigation through security questions derived from investigation graphs. Real‑world security analysts must sift through a large number of heterogeneous security logs, follow multi‑hop chains of evidence to investigate threats. With the developments of LLMs, building LLM‑based agents for automatic threat investigation is a promising direction. We construct a benchmark from a controlled Azure tenant including a SQL environment covering 57 log tables from Microsoft Sentinel and related services, and 7542 generated questions. We leverage security logs extracted with expert‑crafted detection logic to build threat investigation graphs, and then generate questions with LLMs using paired nodes on the graph, taking the start node as background context and the end node as answer. Anchoring each question to these explicit nodes and edges not only provides automatic, explainable ground truth answers but also makes the pipeline reusable and readily extensible to new logs. Our comprehensive experiments on the test set with different models confirm the difficulty of the task: the best model so far can achieve a reward of 0.606, leaving much headroom for future research. The code is available at https://github.com/microsoft/SecRL

PaperID: 1133, https://arxiv.org/pdf/2507.12806.pdf GitHub

Authors: Zhiwei Liu, Jielin Qiu, Shiyu Wang, Jianguo Zhang, Zuxin Liu, Roshan Ram, Haolin Chen, Weiran Yao, Shelby Heinecke, Silvio Savarese, Huan Wang, Caiming Xiong

Title: MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

Abstract:
The rapid rise of Large Language Models (LLMs)‑based intelligent agents underscores the need for robust, scalable evaluation frameworks. Existing methods rely on static benchmarks and labor‑intensive data collection, limiting practical assessment. We introduce MCPEval, an open‑source Model Context Protocol (MCP)‑based framework that automates end‑to‑end task generation and deep evaluation of LLM agents across diverse domains. MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines. Empirical results across five real‑world domains show its effectiveness in revealing nuanced, domain‑specific performance. We publicly release MCPEval https://github.com/SalesforceAIResearch/MCPEval to promote reproducible and standardized LLM agent evaluation.

PaperID: 1134, https://arxiv.org/pdf/2507.12741.pdf GitHub

Authors: Lotfi El Hafi, Kazuma Onishi, Shoichi Hasegawa, Akira Oyama, Tomochika Ishikawa, Masashi Osada, Carl Tornberg, Ryoma Kado, Kento Murata, Saki Hashimoto, Sebastian Carrera Villalobos, Akira Taniguchi, Gustavo Alfonso Garcia Ricardez, Yoshinobu Hagiwara, Tatsuya Aoki, Kensuke Iwata, Takato Horii, Yukiko Horikawa, Takahiro Miyashita, Tadahiro Taniguchi, Hiroshi Ishiguro

Title: Public Evaluation on Potential Social Impacts of Fully Autonomous Cybernetic Avatars for Physical Support in Daily-Life Environments: Large-Scale Demonstration and Survey at Avatar Land

Abstract:
Cybernetic avatars (CAs) are key components of an avatar‑symbiotic society, enabling individuals to overcome physical limitations through virtual agents and robotic assistants. While semi‑autonomous CAs intermittently require human teleoperation and supervision, the deployment of fully autonomous CAs remains a challenge. This study evaluates public perception and potential social impacts of fully autonomous CAs for physical support in daily life. To this end, we conducted a large‑scale demonstration and survey during Avatar Land, a 19‑day public event in Osaka, Japan, where fully autonomous robotic CAs, alongside semi‑autonomous CAs, performed daily object retrieval tasks. Specifically, we analyzed responses from 2,285 visitors who engaged with various CAs, including a subset of 333 participants who interacted with fully autonomous CAs and shared their perceptions and concerns through a survey questionnaire. The survey results indicate interest in CAs for physical support in daily life and at work. However, concerns were raised regarding task execution reliability. In contrast, cost and human‑like interaction were not dominant concerns. Project page: https://lotfielhafi.github.io/FACA‑Survey/.

PaperID: 1135, https://arxiv.org/pdf/2507.12621.pdf GitHub

Authors: Kuangshi Ai, Kaiyuan Tang, Chaoli Wang

Title: NLI4VolVis: Natural Language Interaction for Volume Visualization via LLM Multi-Agents and Editable 3D Gaussian Splatting

Abstract:
Traditional volume visualization (VolVis) methods, like direct volume rendering, suffer from rigid transfer function designs and high computational costs. Although novel view synthesis approaches enhance rendering efficiency, they require additional learning effort for non‑experts and lack support for semantic‑level interaction. To bridge this gap, we propose NLI4VolVis, an interactive system that enables users to explore, query, and edit volumetric scenes using natural language. NLI4VolVis integrates multi‑view semantic segmentation and vision‑language models to extract and understand semantic components in a scene. We introduce a multi‑agent large language model architecture equipped with extensive function‑calling tools to interpret user intents and execute visualization tasks. The agents leverage external tools and declarative VolVis commands to interact with the VolVis engine powered by 3D editable Gaussians, enabling open‑vocabulary object querying, real‑time scene editing, best‑view selection, and 2D stylization. We validate our system through case studies and a user study, highlighting its improved accessibility and usability in volumetric data exploration. We strongly recommend readers check our case studies, demo video, and source code at https://nli4volvis.github.io/.

PaperID: 1136, https://arxiv.org/pdf/2507.12261.pdf GitHub

Authors: Johann Frei, Nils Feldhus, Lisa Raithel, Roland Roller, Alexander Meyer, Frank Kramer

Title: Infherno: End-to-end Agent-based FHIR Resource Synthesis from Free-form Clinical Notes

Abstract:
For clinical data integration and healthcare services, the HL7 FHIR standard has established itself as a desirable format for interoperability between complex health data. Previous attempts at automating the translation from free‑form clinical notes into structured FHIR resources address narrowly defined tasks and rely on modular approaches or LLMs with instruction tuning and constrained decoding. As those solutions frequently suffer from limited generalizability and structural inconformity, we propose an end‑to‑end framework powered by LLM agents, code execution, and healthcare terminology database tools to address these issues. Our solution, called Infherno, is designed to adhere to the FHIR document schema and competes well with a human baseline in predicting FHIR resources from unstructured text. The implementation features a front end for custom and synthetic data and both local and proprietary models, supporting clinical data integration processes and interoperability across institutions. Gemini 2.5‑Pro excels in our evaluation on synthetic and clinical datasets, yet ambiguity and feasibility of collecting ground‑truth data remain open problems.

PaperID: 1137, https://arxiv.org/pdf/2507.12110.pdf GitHub GitHub

Authors: Ye Han, Lijun Zhang, Dejian Meng, Zhuang Zhang

Title: Topology Enhanced MARL for Multi-Vehicle Cooperative Decision-Making of CAVs

Abstract:
The exploration‑exploitation trade‑off constitutes one of the fundamental challenges in reinforcement learning (RL), which is exacerbated in multi‑agent reinforcement learning (MARL) due to the exponential growth of joint state‑action spaces. This paper proposes a topology‑enhanced MARL (TPE‑MARL) method for optimizing cooperative decision‑making of connected and autonomous vehicles (CAVs) in mixed traffic. This work presents two primary contributions: First, we construct a game topology tensor for dynamic traffic flow, effectively compressing high‑dimensional traffic state information and decrease the search space for MARL algorithms. Second, building upon the designed game topology tensor and using QMIX as the backbone RL algorithm, we establish a topology‑enhanced MARL framework incorporating visit counts and agent mutual information. Extensive simulations across varying traffic densities and CAV penetration rates demonstrate the effectiveness of TPE‑MARL. Evaluations encompassing training dynamics, exploration patterns, macroscopic traffic performance metrics, and microscopic vehicle behaviors reveal that TPE‑MARL successfully balances exploration and exploitation. Consequently, it exhibits superior performance in terms of traffic efficiency, safety, decision smoothness, and task completion. Furthermore, the algorithm demonstrates decision‑making rationality comparable to or exceeding that of human drivers in both mixed‑autonomy and fully autonomous traffic scenarios. Code of our work is available at \hrefhttps://github.com/leoPub/tpemarlhttps://github.com/leoPub/tpemarl.

PaperID: 1138, https://arxiv.org/pdf/2507.11527.pdf GitHub GitHub

Authors: Yinsheng Li, Zhen Dong, Yi Shao

Title: DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering

Abstract:
Large Language Model (LLM) agents have shown great potential for solving real‑world problems and promise to be a solution for tasks automation in industry. However, more benchmarks are needed to systematically evaluate automation agents from an industrial perspective, for example, in Civil Engineering. Therefore, we propose DrafterBench for the comprehensive evaluation of LLM agents in the context of technical drawing revision, a representation task in civil engineering. DrafterBench contains twelve types of tasks summarized from real‑world drawing files, with 46 customized functions/tools and 1920 tasks in total. DrafterBench is an open‑source benchmark to rigorously test AI agents' proficiency in interpreting intricate and long‑context instructions, leveraging prior knowledge, and adapting to dynamic instruction quality via implicit policy awareness. The toolkit comprehensively assesses distinct capabilities in structured data comprehension, function execution, instruction following, and critical reasoning. DrafterBench offers detailed analysis of task accuracy and error statistics, aiming to provide deeper insight into agent capabilities and identify improvement targets for integrating LLMs in engineering applications. Our benchmark is available at https://github.com/Eason‑Li‑AIS/DrafterBench, with the test set hosted at https://huggingface.co/datasets/Eason666/DrafterBench.

PaperID: 1139, https://arxiv.org/pdf/2507.10778.pdf GitHub

Authors: Hsiang-Wei Huang, Jen-Hao Cheng, Kuang-Ming Chen, Cheng-Yen Yang, Bahaa Alattar, Yi-Ru Lin, Pyongkun Kim, Sangwon Kim, Kwangju Kim, Chung-I Huang, Jenq-Neng Hwang

Title: Warehouse Spatial Question Answering with LLM Agent

Abstract:
Spatial understanding has been a challenging task for existing Multi‑modal Large Language Models~(MLLMs). Previous methods leverage large‑scale MLLM finetuning to enhance MLLM's spatial understanding ability. In this paper, we present a data‑efficient approach. We propose a LLM agent system with strong and advanced spatial reasoning ability, which can be used to solve the challenging spatial question answering task in complex indoor warehouse scenarios. Our system integrates multiple tools that allow the LLM agent to conduct spatial reasoning and API tools interaction to answer the given complicated spatial question. Extensive evaluations on the 2025 AI City Challenge Physical AI Spatial Intelligence Warehouse dataset demonstrate that our system achieves high accuracy and efficiency in tasks such as object retrieval, counting, and distance estimation. The code is available at: https://github.com/hsiangwei0903/SpatialAgent

PaperID: 1140, https://arxiv.org/pdf/2507.10539.pdf GitHub

Authors: Tao Feng, Yexin Wu, Guanyu Lin, Jiaxuan You

Title: Graph World Model

Abstract:
World models (WMs) demonstrate strong capabilities in prediction, generation, and planning tasks. Existing WMs primarily focus on unstructured data and cannot leverage the ubiquitous structured data, often represented as graphs, in the digital world. While multiple graph foundation models have been proposed, they focus on graph learning tasks and cannot extend to diverse multi‑modal data and interdisciplinary tasks. To address these challenges, we propose the Graph World Model (GWM), a world model that supports both unstructured and graph‑structured states with multi‑modal information and represents diverse tasks as actions. The core of a GWM is a generic message‑passing algorithm to aggregate structured information, either over a unified multi‑modal token space by converting multi‑modal data into text (GWM‑T) or a unified multi‑modal embedding space by modality‑specific encoders (GWM‑E). Notably, GWM introduces action nodes to support diverse tasks, where action nodes are linked to other nodes via direct reference or similarity computation. Extensive experiments on six tasks from diverse domains, including multi‑modal generation and matching, recommendation, graph prediction, multi‑agent, retrieval‑augmented generation, and planning and optimization, show that the same GWM outperforms or matches domain‑specific baselines' performance, benefits from multi‑hop structures, and demonstrates strong zero‑shot/few‑shot capabilities on unseen new tasks. Our code for GWM is released at https://github.com/ulab‑uiuc/GWM.

PaperID: 1141, https://arxiv.org/pdf/2507.10448.pdf GitHub

Authors: Yingqian Wu, Qiushi Wang, Zefei Long, Rong Ye, Zhongtian Lu, Xianyin Zhang, Bingxuan Li, Wei Chen, Liwen Zhang, Zhongyu Wei

Title: FinTeam: A Multi-Agent Collaborative Intelligence System for Comprehensive Financial Scenarios

Abstract:
Financial report generation tasks range from macro‑ to micro‑economics analysis, also requiring extensive data analysis. Existing LLM models are usually fine‑tuned on simple QA tasks and cannot comprehensively analyze real financial scenarios. Given the complexity, financial companies often distribute tasks among departments. Inspired by this, we propose FinTeam, a financial multi‑agent collaborative system, with a workflow with four LLM agents: document analyzer, analyst, accountant, and consultant. We train these agents with specific financial expertise using constructed datasets. We evaluate FinTeam on comprehensive financial tasks constructed from real online investment forums, including macroeconomic, industry, and company analysis. The human evaluation shows that by combining agents, the financial reports generate from FinTeam achieved a 62.00% acceptance rate, outperforming baseline models like GPT‑4o and Xuanyuan. Additionally, FinTeam's agents demonstrate a 7.43% average improvement on FinCUGE and a 2.06% accuracy boost on FinEval. Project is available at https://github.com/FudanDISC/DISC‑FinLLM/.

PaperID: 1142, https://arxiv.org/pdf/2507.09580.pdf GitHub

Authors: Yu Wang, Yijian Liu, Liheng Ji, Han Luo, Wenjie Li, Xiaofei Zhou, Chiyun Feng, Puji Wang, Yuhan Cao, Geyuan Zhang, Xiaojian Li, Rongwu Xu, Yilei Chen, Tianxing He

Title: AICrypto: Evaluating Cryptography Capabilities of Large Language Models

Abstract:
We build AICrypto, a comprehensive benchmark designed to evaluate the cryptography capabilities of large language models (LLMs). The benchmark comprises 135 multiple‑choice questions, 150 capture‑the‑flag challenges, and 30 proof problems, covering a broad range of skills from knowledge memorization to vulnerability exploitation and formal reasoning. All tasks are carefully reviewed or constructed by cryptography experts to improve correctness and rigor. For each proof problem, we provide detailed scoring rubrics and reference solutions that enable automated grading, achieving high correlation with human expert evaluations. We introduce strong human expert performance baselines for comparison across all task types. Our evaluation of 17 leading LLMs reveals that state‑of‑the‑art models match or even surpass human experts in memorizing cryptographic concepts, exploiting common vulnerabilities, and routine proofs. However, our analysis reveals that they still lack a deep understanding of abstract mathematical concepts and struggle with tasks that require multi‑step reasoning and dynamic analysis. We hope this work could provide insights for future research on LLMs in cryptographic applications. Our code and dataset are available at https://github.com/wangyu‑ovo/aicrypto‑agent.

PaperID: 1143, https://arxiv.org/pdf/2507.09174.pdf GitHub

Authors: Shuo Yang, Zijian Yu, Zhenzhe Ying, Yuqin Dai, Guoqing Wang, Jun Lan, Jinfeng Xu, Jinze Li, Edith C. H. Ngai

Title: RAMA: Retrieval-Augmented Multi-Agent Framework for Misinformation Detection in Multimodal Fact-Checking

Abstract:
The rapid proliferation of multimodal misinformation presents significant challenges for automated fact‑checking systems, especially when claims are ambiguous or lack sufficient context. We introduce RAMA, a novel retrieval‑augmented multi‑agent framework designed for verifying multimedia misinformation. RAMA incorporates three core innovations: (1) strategic query formulation that transforms multimodal claims into precise web search queries; (2) cross‑verification evidence aggregation from diverse, authoritative sources; and (3) a multi‑agent ensemble architecture that leverages the complementary strengths of multiple multimodal large language models and prompt variants. Extensive experiments demonstrate that RAMA achieves superior performance on benchmark datasets, particularly excelling in resolving ambiguous or improbable claims by grounding verification in retrieved factual evidence. Our findings underscore the necessity of integrating web‑based evidence and multi‑agent reasoning for trustworthy multimedia verification, paving the way for more reliable and scalable fact‑checking solutions. RAMA will be publicly available at https://github.com/kalendsyang/RAMA.git.

PaperID: 1144, https://arxiv.org/pdf/2507.08958.pdf GitHub

Authors: Xiaowen Zhang, Zhenyu Bi, Patrick Lachance, Xuan Wang, Tiziana Di Matteo, Rupert A. C. Croft

Title: Bridging Literature and the Universe Via A Multi-Agent Large Language Model System

Abstract:
As cosmological simulations and their associated software become increasingly complex, physicists face the challenge of searching through vast amounts of literature and user manuals to extract simulation parameters from dense academic papers, each using different models and formats. Translating these parameters into executable scripts remains a time‑consuming and error‑prone process. To improve efficiency in physics research and accelerate the cosmological simulation process, we introduce SimAgents, a multi‑agent system designed to automate both parameter configuration from the literature and preliminary analysis for cosmology research. SimAgents is powered by specialized LLM agents capable of physics reasoning, simulation software validation, and tool execution. These agents collaborate through structured communication, ensuring that extracted parameters are physically meaningful, internally consistent, and software‑compliant. We also construct a cosmological parameter extraction evaluation dataset by collecting over 40 simulations in published papers from Arxiv and leading journals that cover diverse simulation types. Experiments on the dataset demonstrate a strong performance of SimAgents, highlighting its effectiveness and potential to accelerate scientific research for physicists. Our demonstration video is available at: https://youtu.be/w1zLpm_CaWA. The complete system and dataset are publicly available at https://github.com/xwzhang98/SimAgents.

PaperID: 1145, https://arxiv.org/pdf/2507.08350.pdf GitHub

Authors: Keisuke Ueda, Wataru Hirota, Takuto Asakura, Takahiro Omi, Kosuke Takahashi, Kosuke Arima, Tatsuya Ishigaki

Title: Exploring Design of Multi-Agent LLM Dialogues for Research Ideation

Abstract:
Large language models (LLMs) are increasingly used to support creative tasks such as research idea generation. While recent work has shown that structured dialogues between LLMs can improve the novelty and feasibility of generated ideas, the optimal design of such interactions remains unclear. In this study, we conduct a comprehensive analysis of multi‑agent LLM dialogues for scientific ideation. We compare different configurations of agent roles, number of agents, and dialogue depth to understand how these factors influence the novelty and feasibility of generated ideas. Our experimental setup includes settings where one agent generates ideas and another critiques them, enabling iterative improvement. Our results show that enlarging the agent cohort, deepening the interaction depth, and broadening agent persona heterogeneity each enrich the diversity of generated ideas. Moreover, specifically increasing critic‑side diversity within the ideation‑critique‑revision loop further boosts the feasibility of the final proposals. Our findings offer practical guidelines for building effective multi‑agent LLM systems for scientific ideation. Our code is available at https://github.com/g6000/MultiAgent‑Research‑Ideator.

PaperID: 1146, https://arxiv.org/pdf/2507.07257.pdf GitHub

Authors: Licong Xu, Milind Sarkar, Anto I. Lonappan, Íñigo Zubeldia, Pablo Villanueva-Domingo, Santiago Casas, Christian Fidler, Chetana Amancharla, Ujjwal Tiwari, Adrian Bayer, Chadi Ait Ekioui, Miles Cranmer, Adrian Dimitrov, James Fergusson, Kahaan Gandhi, Sven Krippendorf, Andrew Laverick, Julien Lesgourgues, Antony Lewis, Thomas Meier, Blake Sherwin, Kristen Surrao, Francisco Villaescusa-Navarro, Chi Wang, Xueqing Xu, Boris Bolliet

Title: Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery

Abstract:
We present a multi‑agent system for automation of scientific research tasks, cmbagent (https://github.com/CMBAgents/cmbagent). The system is formed by about 30 Large Language Model (LLM) agents and implements a Planning & Control strategy to orchestrate the agentic workflow, with no human‑in‑the‑loop at any point. Each agent specializes in a different task (performing retrieval on scientific papers and codebases, writing code, interpreting results, critiquing the output of other agents) and the system is able to execute code locally. We successfully apply cmbagent to carry out a PhD level cosmology task (the measurement of cosmological parameters using supernova data) and evaluate its performance on two benchmark sets, finding superior performance over state‑of‑the‑art LLMs. The source code is available on GitHub, demonstration videos are also available, and the system is deployed on HuggingFace and will be available on the cloud.

PaperID: 1147, https://arxiv.org/pdf/2507.07155.pdf GitHub GitHub

Authors: Xueqing Xu, Boris Bolliet, Adrian Dimitrov, Andrew Laverick, Francisco Villaescusa-Navarro, Licong Xu, Íñigo Zubeldia

Title: Evaluating Retrieval-Augmented Generation Agents for Autonomous Scientific Discovery in Astrophysics

Abstract:
We evaluate 9 Retrieval Augmented Generation (RAG) agent configurations on 105 Cosmology Question‑Answer (QA) pairs that we built specifically for this purpose.The RAG configurations are manually evaluated by a human expert, that is, a total of 945 generated answers were assessed. We find that currently the best RAG agent configuration is with OpenAI embedding and generative model, yielding 91.4% accuracy. Using our human evaluation results we calibrate LLM‑as‑a‑Judge (LLMaaJ) system which can be used as a robust proxy for human evaluation. These results allow us to systematically select the best RAG agent configuration for multi‑agent system for autonomous scientific discovery in astrophysics (e.g., cmbagent presented in a companion paper) and provide us with an LLMaaJ system that can be scaled to thousands of cosmology QA pairs. We make our QA dataset, human evaluation results, RAG pipelines, and LLMaaJ system publicly available for further use by the astrophysics community.

PaperID: 1148, https://arxiv.org/pdf/2507.06908.pdf GitHub

Authors: Ziyan Liu, Chunxiao Fan, Haoran Lou, Yuexin Wu, Kaiwei Deng

Title: MIND: A Multi-agent Framework for Zero-shot Harmful Meme Detection

Abstract:
The rapid expansion of memes on social media has highlighted the urgent need for effective approaches to detect harmful content. However, traditional data‑driven approaches struggle to detect new memes due to their evolving nature and the lack of up‑to‑date annotated data. To address this issue, we propose MIND, a multi‑agent framework for zero‑shot harmful meme detection that does not rely on annotated data. MIND implements three key strategies: 1) We retrieve similar memes from an unannotated reference set to provide contextual information. 2) We propose a bi‑directional insight derivation mechanism to extract a comprehensive understanding of similar memes. 3) We then employ a multi‑agent debate mechanism to ensure robust decision‑making through reasoned arbitration. Extensive experiments on three meme datasets demonstrate that our proposed framework not only outperforms existing zero‑shot approaches but also shows strong generalization across different model architectures and parameter scales, providing a scalable solution for harmful meme detection. The code is available at https://github.com/destroy‑lonely/MIND.

PaperID: 1149, https://arxiv.org/pdf/2507.06825.pdf GitHub

Authors: Matej Straka, Martin Schmid

Title: Artificial Generals Intelligence: Mastering Generals.io with Reinforcement Learning

Abstract:
We introduce a real‑time strategy game environment based on Generals.io, a game with thousands of weekly active players. Our environment is fully compatible with Gymnasium and PettingZoo and is capable of running thousands of frames per second on commodity hardware. We also present a reference agent, trained with supervised pre‑training and self‑play, which reached the top 0.003% of the 1v1 human leaderboard after only 36 hours on a single H100 GPU. To accelerate learning, we incorporate potential‑based reward shaping and memory features. Our contributions of a modular RTS benchmark and a competitive baseline agent provide an accessible yet challenging platform for advancing multi‑agent reinforcement learning research. The documented code, together with examples and tutorials, is available at https://github.com/strakam/generals‑bots.

PaperID: 1150, https://arxiv.org/pdf/2507.06690.pdf GitHub

Authors: Guobin Zhu, Rui Zhou, Wenkang Ji, Hongyin Zhang, Donglin Wang, Shiyu Zhao

Title: Multi-Task Multi-Agent Reinforcement Learning via Skill Graphs

Abstract:
Multi‑task multi‑agent reinforcement learning (MT‑MARL) has recently gained attention for its potential to enhance MARL's adaptability across multiple tasks. However, it is challenging for existing multi‑task learning methods to handle complex problems, as they are unable to handle unrelated tasks and possess limited knowledge transfer capabilities. In this paper, we propose a hierarchical approach that efficiently addresses these challenges. The high‑level module utilizes a skill graph, while the low‑level module employs a standard MARL algorithm. Our approach offers two contributions. First, we consider the MT‑MARL problem in the context of unrelated tasks, expanding the scope of MTRL. Second, the skill graph is used as the upper layer of the standard hierarchical approach, with training independent of the lower layer, effectively handling unrelated tasks and enhancing knowledge transfer capabilities. Extensive experiments are conducted to validate these advantages and demonstrate that the proposed method outperforms the latest hierarchical MAPPO algorithms. Videos and code are available at https://github.com/WindyLab/MT‑MARL‑SG

PaperID: 1151, https://arxiv.org/pdf/2507.06531.pdf GitHub

Authors: Mingjin Zeng, Nan Ouyang, Wenkang Wan, Lei Ao, Qing Cai, Kai Sheng

Title: ILNet: Trajectory Prediction with Inverse Learning Attention for Enhancing Intention Capture

Abstract:
Trajectory prediction for multi‑agent interaction scenarios is a crucial challenge. Most advanced methods model agent interactions by efficiently factorized attention based on the temporal and agent axes. However, this static and foward modeling lacks explicit interactive spatio‑temporal coordination, capturing only obvious and immediate behavioral intentions. Alternatively, the modern trajectory prediction framework refines the successive predictions by a fixed‑anchor selection strategy, which is difficult to adapt in different future environments. It is acknowledged that human drivers dynamically adjust initial driving decisions based on further assumptions about the intentions of surrounding vehicles. Motivated by human driving behaviors, this paper proposes ILNet, a multi‑agent trajectory prediction method with Inverse Learning (IL) attention and Dynamic Anchor Selection (DAS) module. IL Attention employs an inverse learning paradigm to model interactions at neighboring moments, introducing proposed intentions to dynamically encode the spatio‑temporal coordination of interactions, thereby enhancing the model's ability to capture complex interaction patterns. Then, the learnable DAS module is proposed to extract multiple trajectory change keypoints as anchors in parallel with almost no increase in parameters. Experimental results show that the ILNet achieves state‑of‑the‑art performance on the INTERACTION and Argoverse motion forecasting datasets. Particularly, in challenged interaction scenarios, ILNet achieves higher accuracy and more multimodal distributions of trajectories over fewer parameters. Our codes are available at https://github.com/mjZeng11/ILNet.

PaperID: 1152, https://arxiv.org/pdf/2507.06528.pdf GitHub

Authors: Huisheng Wang, Zhuoshi Pan, Hangjing Zhang, Mingxiao Liu, Hanqing Gao, H. Vicky Zhao

Title: InvestAlign: Overcoming Data Scarcity in Aligning Large Language Models with Investor Decision-Making Processes under Herd Behavior

Abstract:
Aligning Large Language Models (LLMs) with investor decision‑making processes under herd behavior is a critical challenge in behavioral finance, which grapples with a fundamental limitation: the scarcity of real‑user data needed for Supervised Fine‑Tuning (SFT). While SFT can bridge the gap between LLM outputs and human behavioral patterns, its reliance on massive authentic data imposes substantial collection costs and privacy risks. We propose InvestAlign, a novel framework that constructs high‑quality SFT datasets by leveraging theoretical solutions to similar and simple optimal investment problems rather than complex scenarios. Our theoretical analysis demonstrates that training LLMs with InvestAlign‑generated data achieves faster parameter convergence than using real‑user data, suggesting superior learning efficiency. Furthermore, we develop InvestAgent, an LLM agent fine‑tuned with InvestAlign, which demonstrates significantly closer alignment to real‑user data than pre‑SFT models in both simple and complex investment problems. This highlights our proposed InvestAlign as a promising approach with the potential to address complex optimal investment problems and align LLMs with investor decision‑making processes under herd behavior. Our code is publicly available at https://github.com/thu‑social‑network‑research‑group/InvestAlign.

PaperID: 1153, https://arxiv.org/pdf/2507.06448.pdf GitHub

Authors: Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji

Title: Perception-Aware Policy Optimization for Multimodal Reasoning

Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi‑step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which can be seamlessly plugged into mainstream RLVR algorithms such as GRPO and DAPO. Notably, PAPO does not rely on additional data curation, reward models, or stronger teacher models. To further enhance the training stability of PAPO, we introduce the Double Entropy Loss, which effectively regularizes the new KL objective without compromising performance. Despite its simplicity, PAPO yields significant overall improvements of 4.4%‑17.5% on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%‑19.1%, on tasks with high vision dependency. We also observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO. Overall, our work introduces a deeper integration of perception‑aware supervision into core learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Code and data will be made publicly available for research purposes. Project page: https://mikewangwzhl.github.io/PAPO.

PaperID: 1154, https://arxiv.org/pdf/2507.05639.pdf GitHub

Authors: Haoxin Wang, Xianhan Peng, Xucheng Huang, Yizhe Huang, Ming Gong, Chenghan Yang, Yang Liu, Ling Jiang

Title: ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?

Abstract:
In this paper, we introduce ECom‑Bench, the first benchmark framework for evaluating LLM agent with multimodal capabilities in the e‑commerce customer support domain. ECom‑Bench features dynamic user simulation based on persona information collected from real e‑commerce customer interactions and a realistic task dataset derived from authentic e‑commerce dialogues. These tasks, covering a wide range of business scenarios, are designed to reflect real‑world complexities, making ECom‑Bench highly challenging. For instance, even advanced models like GPT‑4o achieve only a 10‑20% pass^3 metric in our benchmark, highlighting the substantial difficulties posed by complex e‑commerce scenarios. The code and data have been made publicly available at https://github.com/XiaoduoAILab/ECom‑Bench to facilitate further research and development in this domain.

PaperID: 1155, https://arxiv.org/pdf/2507.05275.pdf GitHub GitHub

Authors: Weibing Zheng, Laurah Turner, Jess Kropczynski, Murat Ozer, Seth Overla, Shane Halse

Title: A Fuzzy Supervisor Agent Design for Clinical Reasoning Assistance in a Multi-Agent Educational Clinical Scenario Simulation

Abstract:
Assisting medical students with clinical reasoning (CR) during clinical scenario training remains a persistent challenge in medical education. This paper presents the design and architecture of the Fuzzy Supervisor Agent (FSA), a novel component for the Multi‑Agent Educational Clinical Scenario Simulation (MAECSS) platform. The FSA leverages a Fuzzy Inference System (FIS) to continuously interpret student interactions with specialized clinical agents (e.g., patient, physical exam, diagnostic, intervention) using pre‑defined fuzzy rule bases for professionalism, medical relevance, ethical behavior, and contextual distraction. By analyzing student decision‑making processes in real‑time, the FSA is designed to deliver adaptive, context‑aware feedback and provides assistance precisely when students encounter difficulties. This work focuses on the technical framework and rationale of the FSA, highlighting its potential to provide scalable, flexible, and human‑like supervision in simulation‑based medical education. Future work will include empirical evaluation and integration into broader educational settings. More detailed design and implementation is~\hrefhttps://github.com/2sigmaEdTech/MAS/open sourced here.

PaperID: 1156, https://arxiv.org/pdf/2507.04425.pdf GitHub

Authors: Zexin Deng, Zhenhui Yuan, Longhao Zou

Title: TeleSim: A Network-Aware Testbed and Benchmark Dataset for Telerobotic Applications

Abstract:
Telerobotic technologies are becoming increasingly essential in fields such as remote surgery, nuclear decommissioning, and space exploration. Reliable datasets and testbeds are essential for evaluating telerobotic system performance prior to real‑world deployment. However, there is a notable lack of datasets that capture the impact of network delays, as well as testbeds that realistically model the communication link between the operator and the robot. This paper introduces TeleSim, a network‑aware teleoperation dataset and testbed designed to assess the performance of telerobotic applications under diverse network conditions. TeleSim systematically collects performance data from fine manipulation tasks executed under three predefined network quality tiers: High, Medium, and Low. Each tier is characterized through controlled settings of bandwidth, latency, jitter, and packet loss. Using OMNeT++ for precise network simulation, we record a wide range of metrics, including completion time, success rates, video quality indicators (Peak Signal‑to‑Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM)), and quality of service (QoS) parameters. TeleSim comprises 300 experimental trials, providing a robust benchmark for evaluating teleoperation systems across heterogeneous network scenarios. In the worst network condition, completion time increases by 221.8% and success rate drops by 64%. Our findings reveal that network degradation leads to compounding negative impacts, notably reduced video quality and prolonged task execution, highlighting the need for adaptive, resilient teleoperation protocols. The full dataset and testbed software are publicly available on our GitHub repository: https://github.com/ConnectedRoboticsLab and YouTube channel: https://youtu.be/Fz_1iOYe104.

PaperID: 1157, https://arxiv.org/pdf/2507.04263.pdf GitHub

Authors: Liwen Xiao, Zhiyu Pan, Zhicheng Wang, Zhiguo Cao, Wei Li

Title: SRefiner: Soft-Braid Attention for Multi-Agent Trajectory Refinement

Abstract:
Accurate prediction of multi‑agent future trajectories is crucial for autonomous driving systems to make safe and efficient decisions. Trajectory refinement has emerged as a key strategy to enhance prediction accuracy. However, existing refinement methods often overlook the topological relationships between trajectories, which are vital for improving prediction precision. Inspired by braid theory, we propose a novel trajectory refinement approach, Soft‑Braid Refiner (SRefiner), guided by the soft‑braid topological structure of trajectories using Soft‑Braid Attention. Soft‑Braid Attention captures spatio‑temporal topological relationships between trajectories by considering both spatial proximity and vehicle motion states at ``soft intersection points". Additionally, we extend this approach to model interactions between trajectories and lanes, further improving the prediction accuracy. SRefiner is a multi‑iteration, multi‑agent framework that iteratively refines trajectories, incorporating topological information to enhance interactions within traffic scenarios. SRefiner achieves significant performance improvements over four baseline methods across two datasets, establishing a new state‑of‑the‑art in trajectory refinement. Code is here https://github.com/Liwen‑Xiao/SRefiner.

PaperID: 1158, https://arxiv.org/pdf/2507.04127.pdf GitHub

Authors: Costas Mavromatis, Soji Adeshina, Vassilis N. Ioannidis, Zhen Han, Qi Zhu, Ian Robinson, Bryan Thompson, Huzefa Rangwala, George Karypis

Title: BYOKG-RAG: Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering

Abstract:
Knowledge graph question answering (KGQA) presents significant challenges due to the structural and semantic variations across input graphs. Existing works rely on Large Language Model (LLM) agents for graph traversal and retrieval; an approach that is sensitive to traversal initialization, as it is prone to entity linking errors and may not generalize well to custom ("bring‑your‑own") KGs. We introduce BYOKG‑RAG, a framework that enhances KGQA by synergistically combining LLMs with specialized graph retrieval tools. In BYOKG‑RAG, LLMs generate critical graph artifacts (question entities, candidate answers, reasoning paths, and OpenCypher queries), and graph tools link these artifacts to the KG and retrieve relevant graph context. The retrieved context enables the LLM to iteratively refine its graph linking and retrieval, before final answer generation. By retrieving context from different graph tools, BYOKG‑RAG offers a more general and robust solution for QA over custom KGs. Through experiments on five benchmarks spanning diverse KG types, we demonstrate that BYOKG‑RAG outperforms the second‑best graph retrieval method by 4.5% points while showing better generalization to custom KGs. BYOKG‑RAG framework is open‑sourced at https://github.com/awslabs/graphrag‑toolkit.

PaperID: 1159, https://arxiv.org/pdf/2507.03616.pdf GitHub

Authors: Yingxu Wang, Siwei Liu, Jinyuan Fang, Zaiqiao Meng

Title: EvoAgentX: An Automated Framework for Evolving Agentic Workflows

Abstract:
Multi‑agent systems (MAS) have emerged as a powerful paradigm for orchestrating large language models (LLMs) and specialized tools to collaboratively address complex tasks. However, existing MAS frameworks often require manual workflow configuration and lack native support for dynamic evolution and performance optimization. In addition, many MAS optimization algorithms are not integrated into a unified framework. In this paper, we present EvoAgentX, an open‑source platform that automates the generation, execution, and evolutionary optimization of multi‑agent workflows. EvoAgentX employs a modular architecture consisting of five core layers: the basic components, agent, workflow, evolving, and evaluation layers. Specifically, within the evolving layer, EvoAgentX integrates three MAS optimization algorithms, TextGrad, AFlow, and MIPRO, to iteratively refine agent prompts, tool configurations, and workflow topologies. We evaluate EvoAgentX on HotPotQA, MBPP, and MATH for multi‑hop reasoning, code generation, and mathematical problem solving, respectively, and further assess it on real‑world tasks using GAIA. Experimental results show that EvoAgentX consistently achieves significant performance improvements, including a 7.44% increase in HotPotQA F1, a 10.00% improvement in MBPP pass@1, a 10.00% gain in MATH solve accuracy, and an overall accuracy improvement of up to 20.00% on GAIA. The source code is available at: https://github.com/EvoAgentX/EvoAgentX

PaperID: 1160, https://arxiv.org/pdf/2507.03267.pdf GitHub

Authors: Jie Peng, Jiarui Ji, Runlin Lei, Zhewei Wei, Yongchao Liu, Chuntao Hong

Title: GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning

Abstract:
Dynamic Text‑Attributed Graphs (DyTAGs), which intricately integrate structural, temporal, and textual attributes, are crucial for modeling complex real‑world systems. However, most existing DyTAG datasets exhibit poor textual quality, which severely limits their utility for generative DyTAG tasks requiring semantically rich inputs. Additionally, prior work mainly focuses on discriminative tasks on DyTAGs, resulting in a lack of standardized task formulations and evaluation protocols tailored for DyTAG generation. To address these critical issues, we propose Generative DyTAG Benchmark (GDGB), which comprises eight meticulously curated DyTAG datasets with high‑quality textual features for both nodes and edges, overcoming limitations of prior datasets. Building on GDGB, we define two novel DyTAG generation tasks: Transductive Dynamic Graph Generation (TDGG) and Inductive Dynamic Graph Generation (IDGG). TDGG transductively generates a target DyTAG based on the given source and destination node sets, while the more challenging IDGG introduces new node generation to inductively model the dynamic expansion of real‑world graph data. To enable holistic evaluation, we design multifaceted metrics that assess the structural, temporal, and textual quality of the generated DyTAGs. We further propose GAG‑General, an LLM‑based multi‑agent generative framework tailored for reproducible and robust benchmarking of DyTAG generation. Experimental results demonstrate that GDGB enables rigorous evaluation of TDGG and IDGG, with key insights revealing the critical interplay of structural and textual features in DyTAG generation. These findings establish GDGB as a foundational resource for advancing generative DyTAG research and unlocking further practical applications in DyTAG generation. The dataset and source code are available at https://github.com/Lucas‑PJ/GDGB‑ALGO.

PaperID: 1161, https://arxiv.org/pdf/2507.02877.pdf GitHub

Authors: Chi Zhang, Yu Dong, Yang Wang, Yuetong Han, Guihua Shan, Bixia Tang

Title: AuraGenome: An LLM-Powered Framework for On-the-Fly Reusable and Scalable Circular Genome Visualizations

Abstract:
Circular genome visualizations are essential for exploring structural variants and gene regulation. However, existing tools often require complex scripting and manual configuration, making the process time‑consuming, error‑prone, and difficult to learn. To address these challenges, we introduce AuraGenome, an LLM‑powered framework for rapid, reusable, and scalable generation of multi‑layered circular genome visualizations. AuraGenome combines a semantic‑driven multi‑agent workflow with an interactive visual analytics system. The workflow employs seven specialized LLM‑driven agents, each assigned distinct roles such as intent recognition, layout planning, and code generation, to transform raw genomic data into tailored visualizations. The system supports multiple coordinated views tailored for genomic data, offering ring, radial, and chord‑based layouts to represent multi‑layered circular genome visualizations. In addition to enabling interactions and configuration reuse, the system supports real‑time refinement and high‑quality report export. We validate its effectiveness through two case studies and a comprehensive user study. AuraGenome is available at: https://github.com/Darius18/AuraGenome.

PaperID: 1162, https://arxiv.org/pdf/2507.01702.pdf GitHub

Authors: Zixin Chen, Hongzhan Lin, Kaixin Li, Ziyang Luo, Zhen Ye, Guang Chen, Zhiyong Huang, Jing Ma

Title: AdamMeme: Adaptively Probe the Reasoning Capacity of Multimodal Large Language Models on Harmfulness

Abstract:
The proliferation of multimodal memes in the social media era demands that multimodal Large Language Models (mLLMs) effectively understand meme harmfulness. Existing benchmarks for assessing mLLMs on harmful meme understanding rely on accuracy‑based, model‑agnostic evaluations using static datasets. These benchmarks are limited in their ability to provide up‑to‑date and thorough assessments, as online memes evolve dynamically. To address this, we propose AdamMeme, a flexible, agent‑based evaluation framework that adaptively probes the reasoning capabilities of mLLMs in deciphering meme harmfulness. Through multi‑agent collaboration, AdamMeme provides comprehensive evaluations by iteratively updating the meme data with challenging samples, thereby exposing specific limitations in how mLLMs interpret harmfulness. Extensive experiments show that our framework systematically reveals the varying performance of different target mLLMs, offering in‑depth, fine‑grained analyses of model‑specific weaknesses. Our code is available at https://github.com/Lbotirx/AdamMeme.

PaperID: 1163, https://arxiv.org/pdf/2507.00980.pdf GitHub

Authors: Yuheng Du, Sheng Yang, Lingxuan Wang, Zhenghua Hou, Chengying Cai, Zhitao Tan, Mingxia Chen, Shi-Sheng Huang, Qiang Li

Title: RTMap: Real-Time Recursive Mapping with Change Detection and Localization

Abstract:
While recent online HD mapping methods relieve burdened offline pipelines and solve map freshness, they remain limited by perceptual inaccuracies, occlusion in dense traffic, and an inability to fuse multi‑agent observations. We propose RTMap to enhance these single‑traversal methods by persistently crowdsourcing a multi‑traversal HD map as a self‑evolutional memory. On onboard agents, RTMap simultaneously addresses three core challenges in an end‑to‑end fashion: (1) Uncertainty‑aware positional modeling for HD map elements, (2) probabilistic‑aware localization w.r.t. the crowdsourced prior‑map, and (3) real‑time detection for possible road structural changes. Experiments on several public autonomous driving datasets demonstrate our solid performance on both the prior‑aided map quality and the localization accuracy, demonstrating our effectiveness of robustly serving downstream prediction and planning modules while gradually improving the accuracy and freshness of the crowdsourced prior‑map asynchronously. Our source‑code will be made publicly available at https://github.com/CN‑ADLab/RTMap.

PaperID: 1164, https://arxiv.org/pdf/2507.00979.pdf GitHub

Authors: Dongyoon Hahm, Woogyeol Jin, June Suk Choi, Sungsoo Ahn, Kimin Lee

Title: Enhancing LLM Agent Safety via Causal Influence Prompting

Abstract:
As autonomous agents powered by large language models (LLMs) continue to demonstrate potential across various assistive tasks, ensuring their safe and reliable behavior is crucial for preventing unintended consequences. In this work, we introduce CIP, a novel technique that leverages causal influence diagrams (CIDs) to identify and mitigate risks arising from agent decision‑making. CIDs provide a structured representation of cause‑and‑effect relationships, enabling agents to anticipate harmful outcomes and make safer decisions. Our approach consists of three key steps: (1) initializing a CID based on task specifications to outline the decision‑making process, (2) guiding agent interactions with the environment using the CID, and (3) iteratively refining the CID based on observed behaviors and outcomes. Experimental results demonstrate that our method effectively enhances safety in both code execution and mobile device control tasks.

PaperID: 1165, https://arxiv.org/pdf/2507.00914.pdf GitHub

Authors: Jindong Han, Yansong Ning, Zirui Yuan, Hang Ni, Fan Liu, Tengfei Lyu, Hao Liu

Title: Large Language Model Powered Intelligent Urban Agents: Concepts, Capabilities, and Applications

Abstract:
The long‑standing vision of intelligent cities is to create efficient, livable, and sustainable urban environments using big data and artificial intelligence technologies. Recently, the advent of Large Language Models (LLMs) has opened new ways toward realizing this vision. With powerful semantic understanding and reasoning capabilities, LLMs can be deployed as intelligent agents capable of autonomously solving complex problems across domains. In this article, we focus on Urban LLM Agents, which are LLM‑powered agents that are semi‑embodied within the hybrid cyber‑physical‑social space of cities and used for system‑level urban decision‑making. First, we introduce the concept of urban LLM agents, discussing their unique capabilities and features. Second, we survey the current research landscape from the perspective of agent workflows, encompassing urban sensing, memory management, reasoning, execution, and learning. Third, we categorize the application domains of urban LLM agents into five groups: urban planning, transportation, environment, public safety, and urban society, presenting representative works in each group. Finally, we discuss trustworthiness and evaluation issues that are critical for real‑world deployment, and identify several open problems for future research. This survey aims to establish a foundation for the emerging field of urban LLM agents and to provide a roadmap for advancing the intersection of LLMs and urban intelligence. A curated list of relevant papers and open‑source resources is maintained and continuously updated at https://github.com/usail‑hkust/Awesome‑Urban‑LLM‑Agents.

PaperID: 1166, https://arxiv.org/pdf/2507.00814.pdf GitHub

Authors: Anita Keshmirian, Razan Baltaji, Babak Hemmatian, Hadi Asghari, Lav R. Varshney

Title: Many LLMs Are More Utilitarian Than One

Abstract:
Moral judgment is integral to large language models' (LLMs) social reasoning. As multi‑agent systems gain prominence, it becomes crucial to understand how LLMs function when collaborating compared to operating as individual agents. In human moral judgment, group deliberation leads to a Utilitarian Boost: a tendency to endorse norm violations that inflict harm but maximize benefits for the greatest number of people. We study whether a similar dynamic emerges in multi‑agent LLM systems. We test six models on well‑established sets of moral dilemmas across two conditions: (1) Solo, where models reason independently, and (2) Group, where they engage in multi‑turn discussions in pairs or triads. In personal dilemmas, where agents decide whether to directly harm an individual for the benefit of others, all models rated moral violations as more acceptable when part of a group, demonstrating a Utilitarian Boost similar to that observed in humans. However, the mechanism for the Boost in LLMs differed: While humans in groups become more utilitarian due to heightened sensitivity to decision outcomes, LLM groups showed either reduced sensitivity to norms or enhanced impartiality. We report model differences in when and how strongly the Boost manifests. We also discuss prompt and agent compositions that enhance or mitigate the effect. We end with a discussion of the implications for AI alignment, multi‑agent design, and artificial moral reasoning. Code available at: https://github.com/baltaci‑r/MoralAgents

PaperID: 1167, https://arxiv.org/pdf/2506.24119.pdf GitHub GitHub

Authors: Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, Natasha Jaques

Title: SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

Abstract:
Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human‑curated problem‑answer pairs and domain‑specific reward engineering. We introduce SPIRAL, a self‑play framework where models learn by playing multi‑turn, zero‑sum games against continuously improving versions of themselves, generating an automatic curriculum of stronger opponents, and eliminating the need for human supervision. To enable this self‑play training at scale, we implement a fully online, multi‑turn, multi‑agent reinforcement learning system for LLMs and propose role‑conditioned advantage estimation (RAE) to stabilize multi‑agent training. SPIRAL produces reasoning capabilities that transfer broadly, improving performance by up to 10% across a suite of 8 reasoning benchmarks on 4 different models spanning Qwen and Llama model families, outperforming supervised fine‑tuning on 25,000 expert game trajectories. Multi‑game training (TicTacToe, Kuhn Poker, Simple Negotiation) yields the strongest results, with improvements observed across both base and instruction‑tuned models. Analysis of chain‑of‑thought traces reveals that games develop distinct cognitive patterns that transfer to improve reasoning performance, with different games developing complementary strengths. Even models which have already been trained on reasoning tasks using RLVR, like DeepSeek‑R1‑Distill‑Qwen‑7B, still benefit from our approach. These results demonstrate that zero‑sum games naturally develop transferable reasoning capabilities across diverse model architectures and training stages, highlighting a promising direction for autonomous reasoning development. Our code can be found in https://github.com/spiral‑rl/spiral.

PaperID: 1168, https://arxiv.org/pdf/2506.23563.pdf GitHub

Authors: Huanjin Yao, Jiaxing Huang, Yawen Qiu, Michael K. Chen, Wenzheng Liu, Wei Zhang, Wenjie Zeng, Xikun Zhang, Jingyi Zhang, Yuxin Song, Wenhao Wu, Dacheng Tao

Title: MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI

Abstract:
Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs) toward Artificial General Intelligence. However, existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long‑chain reasoning abilities from three key aspects: (1) lack of difficulty and diversity, (2) susceptibility to guessability and memorization, (3) inadequate assessment of intermediate reasoning steps. To fill this gap, we introduce MMReason, a new benchmark designed to precisely and comprehensively evaluate MLLM long‑chain reasoning capability with diverse, open‑ended, challenging questions. First, we curate challenging questions requiring multi‑step reasoning from various fields (i.e., 6 disciplines) and multiple difficulty levels (i.e., from pre‑university to university, and from foundational to competition tiers). Second, these questions are reformulated into an open‑ended format and filtered using a multi‑model voting technique to eliminate shortcut cases related to guessing and memorization, ensuring robust reasoning evaluations. Third, we annotate the questions with detailed step‑by‑step solutions, and design a reference‑based ternary scoring mechanism to reliably assess intermediate reasoning steps. With MMReason, we benchmark popular leading MLLMs and provide an in‑depth analysis of their reasoning capabilities. We hope MMReason will serve as a valuable resource for advancing MLLM reasoning research. Code will be available at https://github.com/HJYao00/MMReason.

PaperID: 1169, https://arxiv.org/pdf/2506.23485.pdf GitHub

Authors: Haocheng Yu, Yaxiong Wu, Hao Wang, Wei Guo, Yong Liu, Yawen Li, Yuyang Ye, Junping Du, Enhong Chen

Title: Thought-Augmented Planning for LLM-Powered Interactive Recommender Agent

Abstract:
Interactive recommendation is a typical information‑seeking task that allows users to interactively express their needs through natural language and obtain personalized recommendations. Large language model‑powered (LLM‑powered) agents have become a new paradigm in interactive recommendations, effectively capturing users' real‑time needs and enhancing personalized experiences. However, due to limited planning and generalization capabilities, existing formulations of LLM‑powered interactive recommender agents struggle to effectively address diverse and complex user intents, such as intuitive, unrefined, or occasionally ambiguous requests. To tackle this challenge, we propose a novel thought‑augmented interactive recommender agent system (TAIRA) that addresses complex user intents through distilled thought patterns. Specifically, TAIRA is designed as an LLM‑powered multi‑agent system featuring a manager agent that orchestrates recommendation tasks by decomposing user needs and planning subtasks, with its planning capacity strengthened through Thought Pattern Distillation (TPD), a thought‑augmentation method that extracts high‑level thoughts from the agent's and human experts' experiences. Moreover, we designed a set of user simulation schemes to generate personalized queries of different difficulties and evaluate the recommendations based on specific datasets. Through comprehensive experiments conducted across multiple datasets, TAIRA exhibits significantly enhanced performance compared to existing methods. Notably, TAIRA shows a greater advantage on more challenging tasks while generalizing effectively on novel tasks, further validating its superiority in managing complex user intents within interactive recommendation systems. The code is publicly available at:https://github.com/Alcein/TAIRA.

PaperID: 1170, https://arxiv.org/pdf/2506.23306.pdf GitHub

Authors: Qi Liu, Can Li, Wanjing Ma

Title: GATSim: Urban Mobility Simulation with Generative Agents

Abstract:
Traditional agent‑based urban mobility simulations often rely on rigid rulebased systems that struggle to capture the complexity, adaptability, and behavioral diversity inherent in human travel decision making. Inspired by recent advancements in large language models and AI agent technologies, we introduce GATSim, a novel framework that leverages these advancements to simulate urban mobility using generative agents with dedicated cognitive structures. GATSim agents are characterized by diverse socioeconomic profiles, individual lifestyles, and evolving preferences shaped through psychologically informed memory systems and lifelong learning. The main contributions of this work are: 1) a comprehensive architecture that integrates urban mobility foundation model with agent cognitive systems and transport simulation environment; 2) a hierarchical memory designed for efficient retrieval of contextually relevant information, incorporating spatial and temporal associations; 3) planning and reactive mechanisms for modeling adaptive mobility behaviors which integrate a multi‑scale reflection process to transform specific travel experiences into generalized behavioral insights. Experiments indicate that generative agents perform competitively with human annotators in role‑playing scenarios, while naturally producing realistic macroscopic traffic patterns. The code for the prototype implementation is publicly available at https://github.com/qiliuchn/gatsim.

PaperID: 1171, https://arxiv.org/pdf/2506.23276.pdf GitHub

Authors: David Guzman Piedrahita, Yongjin Yang, Mrinmaya Sachan, Giorgia Ramponi, Bernhard Schölkopf, Zhijing Jin

Title: Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games

Abstract:
As large language models (LLMs) are increasingly deployed as autonomous agents, understanding their cooperation and social mechanisms is becoming increasingly important. In particular, how LLMs balance self‑interest and collective well‑being is a critical challenge for ensuring alignment, robustness, and safe deployment. In this paper, we examine the challenge of costly sanctioning in multi‑agent LLM systems, where an agent must decide whether to invest its own resources to incentivize cooperation or penalize defection. To study this, we adapt a public goods game with institutional choice from behavioral economics, allowing us to observe how different LLMs navigate social dilemmas over repeated interactions. Our analysis reveals four distinct behavioral patterns among models: some consistently establish and sustain high levels of cooperation, others fluctuate between engagement and disengagement, some gradually decline in cooperative behavior over time, and others rigidly follow fixed strategies regardless of outcomes. Surprisingly, we find that reasoning LLMs, such as the o1 series, struggle significantly with cooperation, whereas some traditional LLMs consistently achieve high levels of cooperation. These findings suggest that the current approach to improving LLMs, which focuses on enhancing their reasoning capabilities, does not necessarily lead to cooperation, providing valuable insights for deploying LLM agents in environments that require sustained collaboration. Our code is available at https://github.com/davidguzmanp/SanctSim

PaperID: 1172, https://arxiv.org/pdf/2506.22957.pdf GitHub

Authors: Younwoo Choi, Changling Li, Yongjin Yang, Zhijing Jin

Title: Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models

Abstract:
As large language models (LLMs) are increasingly integrated into multi‑agent and human‑AI systems, understanding their awareness of both self‑context and conversational partners is essential for ensuring reliable performance and robust safety. While prior work has extensively studied situational awareness which refers to an LLM's ability to recognize its operating phase and constraints, it has largely overlooked the complementary capacity to identify and adapt to the identity and characteristics of a dialogue partner. In this paper, we formalize this latter capability as interlocutor awareness and present the first systematic evaluation of its emergence in contemporary LLMs. We examine interlocutor inference across three dimensions‑reasoning patterns, linguistic style, and alignment preferences‑and show that LLMs reliably identify same‑family peers and certain prominent model families, such as GPT and Claude. To demonstrate its practical significance, we develop three case studies in which interlocutor awareness both enhances multi‑LLM collaboration through prompt adaptation and introduces new alignment and safety vulnerabilities, including reward‑hacking behaviors and increased jailbreak susceptibility. Our findings highlight the dual promise and peril of identity‑sensitive behavior in LLMs, underscoring the need for further understanding of interlocutor awareness and new safeguards in multi‑agent deployments. Our code is open‑sourced at https://github.com/younwoochoi/InterlocutorAwarenessLLM.

PaperID: 1173, https://arxiv.org/pdf/2506.22890.pdf GitHub

Authors: Senkang Hu, Yihang Tao, Guowen Xu, Xinyuan Qian, Yiqin Deng, Xianhao Chen, Sam Tak Wu Kwong, Yuguang Fang

Title: CP-uniGuard: A Unified, Probability-Agnostic, and Adaptive Framework for Malicious Agent Detection and Defense in Multi-Agent Embodied Perception Systems

Abstract:
Collaborative Perception (CP) has been shown to be a promising technique for multi‑agent autonomous driving and multi‑agent robotic systems, where multiple agents share their perception information to enhance the overall perception performance and expand the perception range. However, in CP, an ego agent needs to receive messages from its collaborators, which makes it vulnerable to attacks from malicious agents. To address this critical issue, we propose a unified, probability‑agnostic, and adaptive framework, namely, CP‑uniGuard, which is a tailored defense mechanism for CP deployed by each agent to accurately detect and eliminate malicious agents in its collaboration network. Our key idea is to enable CP to reach a consensus rather than a conflict against an ego agent's perception results. Based on this idea, we first develop a probability‑agnostic sample consensus (PASAC) method to effectively sample a subset of the collaborators and verify the consensus without prior probabilities of malicious agents. Furthermore, we define collaborative consistency loss (CCLoss) for object detection task and bird's eye view (BEV) segmentation task to capture the discrepancy between an ego agent and its collaborators, which is used as a verification criterion for consensus. In addition, we propose online adaptive threshold via dual sliding windows to dynamically adjust the threshold for consensus verification and ensure the reliability of the systems in dynamic environments. Finally, we conduct extensive experiments and demonstrate the effectiveness of our framework. Code is available at https://github.com/CP‑Security/CP‑uniGuard.

PaperID: 1174, https://arxiv.org/pdf/2506.22853.pdf GitHub

Authors: Kyochul Jang, Donghyeon Lee, Kyusik Kim, Dongseok Heo, Taewhoo Lee, Woojeong Kim, Bongwon Suh

Title: DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues

Abstract:
Existing function‑calling benchmarks focus on single‑turn interactions. However, they overlook the complexity of real‑world scenarios. To quantify how existing benchmarks address practical applications, we introduce DICE‑SCORE, a metric that evaluates the dispersion of tool‑related information such as function name and parameter values throughout the dialogue. Analyzing existing benchmarks through DICE‑SCORE reveals notably low scores, highlighting the need for more realistic scenarios. To address this gap, we present DICE‑BENCH, a framework that constructs practical function‑calling datasets by synthesizing conversations through a tool graph that maintains dependencies across rounds and a multi‑agent system with distinct personas to enhance dialogue naturalness. The final dataset comprises 1,607 high‑DICE‑SCORE instances. Our experiments on 19 LLMs with DICE‑BENCH show that significant advances are still required before such models can be deployed effectively in real‑world settings. Our code and data are all publicly available: https://snuhcc.github.io/DICE‑Bench/.

PaperID: 1175, https://arxiv.org/pdf/2506.21552.pdf GitHub

Authors: Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, Jitendra Malik

Title: Whole-Body Conditioned Egocentric Video Prediction

Abstract:
We train models to Predict Ego‑centric Video from human Actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first‑person point of view. We train an auto‑regressive conditional diffusion transformer on Nymeria, a large‑scale dataset of real‑world egocentric video and body pose capture. We further design a hierarchical evaluation protocol with increasingly challenging tasks, enabling a comprehensive analysis of the model's embodied prediction and control abilities. Our work represents an initial attempt to tackle the challenges of modeling complex real‑world environments and embodied agent behaviors with video prediction from the perspective of a human.

PaperID: 1176, https://arxiv.org/pdf/2506.21079.pdf GitHub

Authors: Yann Kerzreho

Title: Homogenization of Multi-agent Learning Dynamics in Finite-state Markov Games

Abstract:
This paper introduces a new approach for approximating the learning dynamics of multiple reinforcement learning (RL) agents interacting in a finite‑state Markov game. The idea is to rescale the learning process by simultaneously reducing the learning rate and increasing the update frequency, effectively treating the agent's parameters as a slow‑evolving variable influenced by the fast‑mixing game state. Under mild assumptions‑ergodicity of the state process and continuity of the updates‑we prove the convergence of this rescaled process to an ordinary differential equation (ODE). This ODE provides a tractable, deterministic approximation of the agent's learning dynamics. An implementation of the framework is available at\,: https://github.com/yannKerzreho/MarkovGameApproximation

PaperID: 1177, https://arxiv.org/pdf/2506.20640.pdf GitHub

Authors: Sijie Li, Weiwei Sun, Shanda Li, Ameet Talwalkar, Yiming Yang

Title: CoMind: Towards Community-Driven Agents for Machine Learning Engineering

Abstract:
Large language model (LLM) agents show promise in automating machine learning (ML) engineering. However, existing agents typically operate in isolation on a given research problem, without engaging with the broader research community, where human researchers often gain insights and contribute by sharing knowledge. To bridge this gap, we introduce MLE‑Live, a live evaluation framework designed to assess an agent's ability to communicate with and leverage collective knowledge from a simulated Kaggle research community. Building on this framework, we propose CoMind, a multi‑agent system designed to systematically leverage external knowledge. CoMind employs an iterative parallel exploration mechanism, developing multiple solutions simultaneously to balance exploratory breadth with implementation depth. On 75 past Kaggle competitions within our MLE‑Live framework, CoMind achieves a 36% medal rate, establishing a new state of the art. Critically, when deployed in eight live, ongoing competitions, CoMind outperforms 92.6% of human competitors on average, placing in the top 5% on three official leaderboards and the top 1% on one.

PaperID: 1178, https://arxiv.org/pdf/2506.19923.pdf GitHub

Authors: Kaito Baba, Chaoran Liu, Shuhei Kurita, Akiyoshi Sannai

Title: Prover Agent: An Agent-Based Framework for Formal Mathematical Proofs

Abstract:
We present Prover Agent, a novel AI agent for automated theorem proving that integrates large language models (LLMs) with a formal proof assistant, Lean. Prover Agent coordinates an informal reasoning LLM, a formal prover model, and feedback from Lean while also generating auxiliary lemmas. These auxiliary lemmas are not limited to subgoals in the formal proof but can also include special cases or potentially useful facts derived from the assumptions, which help in discovering a viable proof strategy. It achieves an 88.1% success rate on MiniF2F and solves 25 problems on the PutnamBench with a smaller sample budget than previous approaches, establishing a new state‑of‑the‑art on both benchmarks among methods using small language models (SLMs). We also present theoretical analyses and case studies that illustrate how these generated lemmas contribute to solving challenging problems. Our code is publicly available at https://github.com/kAIto47802/Prover‑Agent.

PaperID: 1179, https://arxiv.org/pdf/2506.19835.pdf GitHub

Authors: Yucheng Zhou, Lingran Song, Jianbing Shen

Title: MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration

Abstract:
Recent advancements in medical Large Language Models (LLMs) have showcased their powerful reasoning and diagnostic capabilities. Despite their success, current unified multimodal medical LLMs face limitations in knowledge update costs, comprehensiveness, and flexibility. To address these challenges, we introduce the Modular Multi‑Agent Framework for Multi‑Modal Medical Diagnosis (MAM). Inspired by our empirical findings highlighting the benefits of role assignment and diagnostic discernment in LLMs, MAM decomposes the medical diagnostic process into specialized roles: a General Practitioner, Specialist Team, Radiologist, Medical Assistant, and Director, each embodied by an LLM‑based agent. This modular and collaborative framework enables efficient knowledge updates and leverages existing medical LLMs and knowledge bases. Extensive experimental evaluations conducted on a wide range of publicly accessible multimodal medical datasets, incorporating text, image, audio, and video modalities, demonstrate that MAM consistently surpasses the performance of modality‑specific LLMs. Notably, MAM achieves significant performance improvements ranging from 18% to 365% compared to baseline models. Our code is released at https://github.com/yczhou001/MAM.

PaperID: 1180, https://arxiv.org/pdf/2506.19676.pdf GitHub

Authors: Dezhang Kong, Shi Lin, Zhenhua Xu, Zhebo Wang, Minghao Li, Yufeng Li, Yilun Zhang, Hujin Peng, Xiang Chen, Zeyang Sha, Yuyuan Li, Changting Lin, Xun Wang, Xuan Liu, Ningyu Zhang, Chaochao Chen, Chunming Wu, Muhammad Khurram Khan, Meng Han

Title: A Survey of LLM-Driven AI Agent Communication: Protocols, Security Risks, and Defense Countermeasures

Abstract:
In recent years, Large‑Language‑Model‑driven AI agents have exhibited unprecedented intelligence and adaptability. Nowadays, agents are undergoing a new round of evolution. They no longer act as an isolated island like LLMs. Instead, they start to communicate with diverse external entities, such as other agents and tools, to perform complex tasks. Under this trend, agent communication is regarded as a foundational pillar of the next communication era, and many organizations have intensively begun to design related communication protocols (e.g., Anthropic's MCP and Google's A2A) within the past year. However, this new field exposes significant security hazards, which can cause severe damage to real‑world scenarios. To help researchers quickly figure out this promising topic and benefit the future agent communication development, this paper presents a comprehensive survey of agent communication security. More precisely, we present the first clear definition of agent communication. Besides, we propose a framework that categorizes agent communication into three classes and uses a three‑layered communication architecture to illustrate how each class works. Next, for each communication class, we dissect related communication protocols and analyze the security risks, illustrating which communication layer the risks arise from. Then, we provide an outlook on the possible defense countermeasures for each risk. In addition, we conduct experiments using MCP and A2A to help readers better understand the novel vulnerabilities brought by agent communication. Finally, we discuss open issues and future directions in this promising research field. We also publish a repository that maintains a list of related papers on https://github.com/theshi‑1128/awesome‑agent‑communication‑security.

PaperID: 1181, https://arxiv.org/pdf/2506.19502.pdf GitHub

Authors: Aleksandr Algazinov, Matt Laing, Paul Laban

Title: MATE: LLM-Powered Multi-Agent Translation Environment for Accessibility Applications

Abstract:
Accessibility remains a critical concern in today's society, as many technologies are not developed to support the full range of user needs. Existing multi‑agent systems (MAS) often cannot provide comprehensive assistance for users in need due to the lack of customization stemming from closed‑source designs. Consequently, individuals with disabilities frequently encounter significant barriers when attempting to interact with digital environments. We introduce MATE, a multimodal accessibility MAS, which performs the modality conversions based on the user's needs. The system is useful for assisting people with disabilities by ensuring that data will be converted to an understandable format. For instance, if the user cannot see well and receives an image, the system converts this image to its audio description. MATE can be applied to a wide range of domains, industries, and areas, such as healthcare, and can become a useful assistant for various groups of users. The system supports multiple types of models, ranging from LLM API calling to using custom machine learning (ML) classifiers. This flexibility ensures that the system can be adapted to various needs and is compatible with a wide variety of hardware. Since the system is expected to run locally, it ensures the privacy and security of sensitive information. In addition, the framework can be effectively integrated with institutional technologies (e.g., digital healthcare service) for real‑time user assistance. Furthermore, we introduce ModCon‑Task‑Identifier, a model that is capable of extracting the precise modality conversion task from the user input. Numerous experiments show that ModCon‑Task‑Identifier consistently outperforms other LLMs and statistical models on our custom data. Our code and data are publicly available at https://github.com/AlgazinovAleksandr/Multi‑Agent‑MATE.

PaperID: 1182, https://arxiv.org/pdf/2506.19433.pdf GitHub

Authors: Lixuan He, Haoyu Dong, Zhenxing Chen, Yangcheng Yu, Jie Feng, Yong Li

Title: Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System

Abstract:
Vision‑and‑Language Navigation (VLN) in large‑scale urban environments requires embodied agents to ground linguistic instructions in complex scenes and recall relevant experiences over extended time horizons. Prior modular pipelines offer interpretability but lack unified memory, while end‑to‑end (M)LLM agents excel at fusing vision and language yet remain constrained by fixed context windows and implicit spatial reasoning. We introduce Mem4Nav, a hierarchical spatial‑cognition long‑short memory system that can augment any VLN backbone. Mem4Nav fuses a sparse octree for fine‑grained voxel indexing with a semantic topology graph for high‑level landmark connectivity, storing both in trainable memory tokens embedded via a reversible Transformer. Long‑term memory (LTM) compresses and retains historical observations at both octree and graph nodes, while short‑term memory (STM) caches recent multimodal entries in relative coordinates for real‑time obstacle avoidance and local planning. At each step, STM retrieval sharply prunes dynamic context, and, when deeper history is needed, LTM tokens are decoded losslessly to reconstruct past embeddings. Evaluated on Touchdown and Map2Seq across three backbones (modular, state‑of‑the‑art VLN with prompt‑based LLM, and state‑of‑the‑art VLN with strided‑attention MLLM), Mem4Nav yields 7‑13 pp gains in Task Completion, sufficient SPD reduction, and >10 pp nDTW improvement. Ablations confirm the indispensability of both the hierarchical map and dual memory modules. Our codes are open‑sourced via https://github.com/tsinghua‑fib‑lab/Mem4Nav.

PaperID: 1183, https://arxiv.org/pdf/2506.18959.pdf GitHub

Authors: Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Yankai Chen, Chunkit Chan, Peilin Zhou, Xinyang Zhang, Chenwei Zhang, Jingbo Shang, Ming Zhang, Yangqiu Song, Irwin King, Philip S. Yu

Title: From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents

Abstract:
Information retrieval is a cornerstone of modern knowledge acquisition, enabling billions of queries each day across diverse domains. However, traditional keyword‑based search engines are increasingly inadequate for handling complex, multi‑step information needs. Our position is that Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research. These systems transcend conventional information search techniques by tightly integrating autonomous reasoning, iterative retrieval, and information synthesis into a dynamic feedback loop. We trace the evolution from static web search to interactive, agent‑based systems that plan, explore, and learn. We also introduce a test‑time scaling law to formalize the impact of computational depth on reasoning and search. Supported by benchmark results and the rise of open‑source implementations, we demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking. All the related resources, including industry products, research papers, benchmark datasets, and open‑source implementations, are collected for the community in https://github.com/DavidZWZ/Awesome‑Deep‑Research.

PaperID: 1184, https://arxiv.org/pdf/2506.18678.pdf GitHub

Authors: Tianchen Deng, Guole Shen, Xun Chen, Shenghai Yuan, Hongming Shen, Guohao Peng, Zhenyu Wu, Jingchuan Wang, Lihua Xie, Danwei Wang, Hesheng Wang, Weidong Chen

Title: MCN-SLAM: Multi-Agent Collaborative Neural SLAM with Hybrid Implicit Neural Scene Representation

Abstract:
Neural implicit scene representations have recently shown promising results in dense visual SLAM. However, existing implicit SLAM algorithms are constrained to single‑agent scenarios, and fall difficulties in large‑scale scenes and long sequences. Existing NeRF‑based multi‑agent SLAM frameworks cannot meet the constraints of communication bandwidth. To this end, we propose the first distributed multi‑agent collaborative neural SLAM framework with hybrid scene representation, distributed camera tracking, intra‑to‑inter loop closure, and online distillation for multiple submap fusion. A novel triplane‑grid joint scene representation method is proposed to improve scene reconstruction. A novel intra‑to‑inter loop closure method is designed to achieve local (single‑agent) and global (multi‑agent) consistency. We also design a novel online distillation method to fuse the information of different submaps to achieve global consistency. Furthermore, to the best of our knowledge, there is no real‑world dataset for NeRF‑based/GS‑based SLAM that provides both continuous‑time trajectories groundtruth and high‑accuracy 3D meshes groundtruth. To this end, we propose the first real‑world Dense slam (DES) dataset covering both single‑agent and multi‑agent scenarios, ranging from small rooms to large‑scale outdoor scenes, with high‑accuracy ground truth for both 3D mesh and continuous‑time camera trajectory. This dataset can advance the development of the research in both SLAM, 3D reconstruction, and visual foundation model. Experiments on various datasets demonstrate the superiority of the proposed method in both mapping, tracking, and communication. The dataset and code will open‑source on https://github.com/dtc111111/mcnslam.

PaperID: 1185, https://arxiv.org/pdf/2506.18448.pdf GitHub GitHub

Authors: Quang Nguyen, Tri Le, Huy Nguyen, Thieu Vo, Tung D. Ta, Baoru Huang, Minh N. Vu, Anh Nguyen

Title: GraspMAS: Zero-Shot Language-driven Grasp Detection with Multi-Agent System

Abstract:
Language‑driven grasp detection has the potential to revolutionize human‑robot interaction by allowing robots to understand and execute grasping tasks based on natural language commands. However, existing approaches face two key challenges. First, they often struggle to interpret complex text instructions or operate ineffectively in densely cluttered environments. Second, most methods require a training or finetuning step to adapt to new domains, limiting their generation in real‑world applications. In this paper, we introduce GraspMAS, a new multi‑agent system framework for language‑driven grasp detection. GraspMAS is designed to reason through ambiguities and improve decision‑making in real‑world scenarios. Our framework consists of three specialized agents: Planner, responsible for strategizing complex queries; Coder, which generates and executes source code; and Observer, which evaluates the outcomes and provides feedback. Intensive experiments on two large‑scale datasets demonstrate that our GraspMAS significantly outperforms existing baselines. Additionally, robot experiments conducted in both simulation and real‑world settings further validate the effectiveness of our approach. Our project page is available at https://zquang2202.github.io/GraspMAS

PaperID: 1186, https://arxiv.org/pdf/2506.18096.pdf GitHub

Authors: Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, Jianye Hao, Kun Shao, Jun Wang

Title: Deep Research Agents: A Systematic Examination And Roadmap

Abstract:
The rapid progress of Large Language Models (LLMs) has given rise to a new category of autonomous AI systems, referred to as Deep Research (DR) agents. These agents are designed to tackle complex, multi‑turn informational research tasks by leveraging a combination of dynamic reasoning, adaptive long‑horizon planning, multi‑hop information retrieval, iterative tool use, and the generation of structured analytical reports. In this paper, we conduct a detailed analysis of the foundational technologies and architectural components that constitute Deep Research agents. We begin by reviewing information acquisition strategies, contrasting API‑based retrieval methods with browser‑based exploration. We then examine modular tool‑use frameworks, including code execution, multimodal input processing, and the integration of Model Context Protocols (MCPs) to support extensibility and ecosystem development. To systematize existing approaches, we propose a taxonomy that differentiates between static and dynamic workflows, and we classify agent architectures based on planning strategies and agent composition, including single‑agent and multi‑agent configurations. We also provide a critical evaluation of current benchmarks, highlighting key limitations such as restricted access to external knowledge, sequential execution inefficiencies, and misalignment between evaluation metrics and the practical objectives of DR agents. Finally, we outline open challenges and promising directions for future research. A curated and continuously updated repository of DR agent research is available at: https://github.com/ai‑agents‑2030/awesome‑deep‑research‑agent.

PaperID: 1187, https://arxiv.org/pdf/2506.17878.pdf GitHub

Authors: Tam Trinh, Manh Nguyen, Truong-Son Hy

Title: Towards Robust Fact-Checking: A Multi-Agent System with Advanced Evidence Retrieval

Abstract:
The rapid spread of misinformation in the digital era poses significant challenges to public discourse, necessitating robust and scalable fact‑checking solutions. Traditional human‑led fact‑checking methods, while credible, struggle with the volume and velocity of online content, prompting the integration of automated systems powered by Large Language Models (LLMs). However, existing automated approaches often face limitations, such as handling complex claims, ensuring source credibility, and maintaining transparency. This paper proposes a novel multi‑agent system for automated fact‑checking that enhances accuracy, efficiency, and explainability. The system comprises four specialized agents: an Input Ingestion Agent for claim decomposition, a Query Generation Agent for formulating targeted subqueries, an Evidence Retrieval Agent for sourcing credible evidence, and a Verdict Prediction Agent for synthesizing veracity judgments with human‑interpretable explanations. Evaluated on benchmark datasets (FEVEROUS, HOVER, SciFact), the proposed system achieves a 12.3% improvement in Macro F1‑score over baseline methods. The system effectively decomposes complex claims, retrieves reliable evidence from trusted sources, and generates transparent explanations for verification decisions. Our approach contributes to the growing field of automated fact‑checking by providing a more accurate, efficient, and transparent verification methodology that aligns with human fact‑checking practices while maintaining scalability for real‑world applications. Our source code is available at https://github.com/HySonLab/FactAgent

PaperID: 1188, https://arxiv.org/pdf/2506.17788.pdf GitHub

Authors: Shahab Rahimirad, Guven Gergerli, Lucia Romero, Angela Qian, Matthew Lyle Olson, Simon Stepputtis, Joseph Campbell

Title: Bayesian Social Deduction with Graph-Informed Language Models

Abstract:
Social reasoning ‑ inferring unobservable beliefs and intentions from partial observations of other agents ‑ remains a challenging task for large language models (LLMs). We evaluate the limits of current reasoning language models in the social deduction game Avalon and find that while the largest models demonstrate strong performance, they require extensive test‑time inference and degrade sharply when distilled to smaller, real‑time‑capable variants. To address this, we introduce a hybrid reasoning framework that externalizes belief inference to a structured probabilistic model, while using an LLM for language understanding and interaction. Our approach achieves competitive performance with much larger models in Agent‑Agent play and, notably, is the first language agent to defeat human players in a controlled study ‑ achieving a 67% win rate and receiving higher qualitative ratings than both reasoning baselines and human teammates. We release code, models, and a dataset to support future work on social reasoning in LLM agents, which can be found at https://camp‑lab‑purdue.github.io/bayesian‑social‑deduction/

PaperID: 1189, https://arxiv.org/pdf/2506.16718.pdf GitHub

Authors: Chenxu Wang, Yonggang Jin, Cheng Hu, Youpeng Zhao, Zipeng Dai, Jian Zhao, Shiyu Huang, Liuyu Xiang, Junge Zhang, Zhaofeng He

Title: Generalizable Agent Modeling for Agent Collaboration-Competition Adaptation with Multi-Retrieval and Dynamic Generation

Abstract:
Adapting a single agent to a new multi‑agent system brings challenges, necessitating adjustments across various tasks, environments, and interactions with unknown teammates and opponents. Addressing this challenge is highly complex, and researchers have proposed two simplified scenarios, Multi‑agent reinforcement learning for zero‑shot learning and Ad‑Hoc Teamwork. Building on these foundations, we propose a more comprehensive setting, Agent Collaborative‑Competitive Adaptation (ACCA), which evaluates an agent to generalize across diverse scenarios, tasks, and interactions with both unfamiliar opponents and teammates. In ACCA, agents adjust to task and environmental changes, collaborate with unseen teammates, and compete against unknown opponents. We introduce a new modeling approach, Multi‑Retrieval and Dynamic Generation (MRDG), that effectively models both teammates and opponents using their behavioral trajectories. This method incorporates a positional encoder for varying team sizes and a hypernetwork module to boost agents' learning and adaptive capabilities. Additionally, a viewpoint alignment module harmonizes the observational perspectives of retrieved teammates and opponents with the learning agent. Extensive tests in benchmark scenarios like SMAC, Overcooked‑AI, and Melting Pot show that MRDG significantly improves robust collaboration and competition with unseen teammates and opponents, surpassing established baselines. Our code is available at: https://github.com/vcis‑wangchenxu/MRDG.git

PaperID: 1190, https://arxiv.org/pdf/2506.16371.pdf GitHub

Authors: Yunhao Hou, Bochao Zou, Min Zhang, Ran Chen, Shangdong Yang, Yanmei Zhang, Junbao Zhuo, Siheng Chen, Jiansheng Chen, Huimin Ma

Title: AGC-Drive: A Large-Scale Dataset for Real-World Aerial-Ground Collaboration in Driving Scenarios

Abstract:
By sharing information across multiple agents, collaborative perception helps autonomous vehicles mitigate occlusions and improve overall perception accuracy. While most previous work focus on vehicle‑to‑vehicle and vehicle‑to‑infrastructure collaboration, with limited attention to aerial perspectives provided by UAVs, which uniquely offer dynamic, top‑down views to alleviate occlusions and monitor large‑scale interactive environments. A major reason for this is the lack of high‑quality datasets for aerial‑ground collaborative scenarios. To bridge this gap, we present AGC‑Drive, the first large‑scale real‑world dataset for Aerial‑Ground Cooperative 3D perception. The data collection platform consists of two vehicles, each equipped with five cameras and one LiDAR sensor, and one UAV carrying a forward‑facing camera and a LiDAR sensor, enabling comprehensive multi‑view and multi‑agent perception. Consisting of approximately 80K LiDAR frames and 360K images, the dataset covers 14 diverse real‑world driving scenarios, including urban roundabouts, highway tunnels, and on/off ramps. Notably, 17% of the data comprises dynamic interaction events, including vehicle cut‑ins, cut‑outs, and frequent lane changes. AGC‑Drive contains 350 scenes, each with approximately 100 frames and fully annotated 3D bounding boxes covering 13 object categories. We provide benchmarks for two 3D perception tasks: vehicle‑to‑vehicle collaborative perception and vehicle‑to‑UAV collaborative perception. Additionally, we release an open‑source toolkit, including spatiotemporal alignment verification tools, multi‑agent visualization systems, and collaborative annotation utilities. The dataset and code are available at https://github.com/PercepX/AGC‑Drive.

PaperID: 1191, https://arxiv.org/pdf/2506.15928.pdf GitHub

Authors: Myke C. Cohen, Zhe Su, Hsien-Te Kao, Daniel Nguyen, Spencer Lynch, Maarten Sap, Svitlana Volkova

Title: Exploring Big Five Personality and AI Capability Effects in LLM-Simulated Negotiation Dialogues

Abstract:
This paper presents an evaluation framework for agentic AI systems in mission‑critical negotiation contexts, addressing the need for AI agents that can adapt to diverse human operators and stakeholders. Using Sotopia as a simulation testbed, we present two experiments that systematically evaluated how personality traits and AI agent characteristics influence LLM‑simulated social negotiation outcomes‑‑a capability essential for a variety of applications involving cross‑team coordination and civil‑military interactions. Experiment 1 employs causal discovery methods to measure how personality traits impact price bargaining negotiations, through which we found that Agreeableness and Extraversion significantly affect believability, goal achievement, and knowledge acquisition outcomes. Sociocognitive lexical measures extracted from team communications detected fine‑grained differences in agents' empathic communication, moral foundations, and opinion patterns, providing actionable insights for agentic AI systems that must operate reliably in high‑stakes operational scenarios. Experiment 2 evaluates human‑AI job negotiations by manipulating both simulated human personality and AI system characteristics, specifically transparency, competence, adaptability, demonstrating how AI agent trustworthiness impact mission effectiveness. These findings establish a repeatable evaluation methodology for experimenting with AI agent reliability across diverse operator personalities and human‑agent team dynamics, directly supporting operational requirements for reliable AI systems. Our work advances the evaluation of agentic AI workflows by moving beyond standard performance metrics to incorporate social dynamics essential for mission success in complex operations.

PaperID: 1192, https://arxiv.org/pdf/2506.15695.pdf GitHub

Authors: Xinxing Ren, Qianbo Zang, Zekun Guo

Title: SimuGen: Multi-modal Agentic Framework for Constructing Block Diagram-Based Simulation Models

Abstract:
Recent advances in large language models (LLMs) have shown impressive performance in mathematical reasoning and code generation. However, LLMs still struggle in the simulation domain, particularly in generating Simulink models, which are essential tools in engineering and scientific research. Our preliminary experiments indicate that LLM agents often fail to produce reliable and complete Simulink simulation code from text‑only inputs, likely due to the lack of Simulink‑specific data in their pretraining. To address this challenge, we propose SimuGen, a multimodal agent‑based framework that automatically generates accurate Simulink simulation code by leveraging both the visual Simulink diagram and domain knowledge. SimuGen coordinates several specialized agents, including an investigator, unit test reviewer, code generator, executor, debug locator, and report writer, supported by a domain‑specific knowledge base. This collaborative and modular design enables interpretable, robust, and reproducible Simulink simulation generation. Our source code is publicly available at https://github.com/renxinxing123/SimuGen_beta.

PaperID: 1193, https://arxiv.org/pdf/2506.15672.pdf GitHub

Authors: Yao Zhang, Chenyang Lin, Shijie Tang, Haokun Chen, Shijie Zhou, Yunpu Ma, Volker Tresp

Title: SwarmAgentic: Towards Fully Automated Agentic System Generation via Swarm Intelligence

Abstract:
The rapid progress of Large Language Models has advanced agentic systems in decision‑making, coordination, and task execution. Yet, existing agentic system generation frameworks lack full autonomy, missing from‑scratch agent generation, self‑optimizing agent functionality, and collaboration, limiting adaptability and scalability. We propose SwarmAgentic, a framework for fully automated agentic system generation that constructs agentic systems from scratch and jointly optimizes agent functionality and collaboration as interdependent components through language‑driven exploration. To enable efficient search over system‑level structures, SwarmAgentic maintains a population of candidate systems and evolves them via feedback‑guided updates, drawing inspiration from Particle Swarm Optimization (PSO). We evaluate our method on six real‑world, open‑ended, and exploratory tasks involving high‑level planning, system‑level coordination, and creative reasoning. Given only a task description and an objective function, SwarmAgentic outperforms all baselines, achieving a +261.8% relative improvement over ADAS on the TravelPlanner benchmark, highlighting the effectiveness of full automation in structurally unconstrained tasks. This framework marks a significant step toward scalable and autonomous agentic system design, bridging swarm intelligence with fully automated system multi‑agent generation. Our code is publicly released at https://yaoz720.github.io/SwarmAgentic/.

PaperID: 1194, https://arxiv.org/pdf/2506.15451.pdf GitHub

Authors: Zhouhong Gu, Xiaoxuan Zhu, Yin Cai, Hao Shen, Xingzhou Chen, Qingyi Wang, Jialin Li, Xiaoran Shi, Haoran Guo, Wenxuan Huang, Hongwei Feng, Yanghua Xiao, Zheyu Ye, Yao Hu, Shaosheng Cao

Title: AgentGroupChat-V2: Divide-and-Conquer Is What LLM-Based Multi-Agent System Need

Abstract:
Large language model based multi‑agent systems have demonstrated significant potential in social simulation and complex task resolution domains. However, current frameworks face critical challenges in system architecture design, cross‑domain generalizability, and performance guarantees, particularly as task complexity and number of agents increases. We introduces AgentGroupChat‑V2, a novel framework addressing these challenges through three core innovations: (1) a divide‑and‑conquer fully parallel architecture that decomposes user queries into hierarchical task forest structures enabling dependency management and distributed concurrent processing. (2) an adaptive collaboration engine that dynamically selects heterogeneous LLM combinations and interaction modes based on task characteristics. (3) agent organization optimization strategies combining divide‑and‑conquer approaches for efficient problem decomposition. Extensive experiments demonstrate AgentGroupChat‑V2's superior performance across diverse domains, achieving 91.50% accuracy on GSM8K (exceeding the best baseline by 5.6 percentage points), 30.4% accuracy on competition‑level AIME (nearly doubling other methods), and 79.20% pass@1 on HumanEval. Performance advantages become increasingly pronounced with higher task difficulty, particularly on Level 5 MATH problems where improvements exceed 11 percentage points compared to state‑of‑the‑art baselines. These results confirm that AgentGroupChat‑V2 provides a comprehensive solution for building efficient, general‑purpose LLM multi‑agent systems with significant advantages in complex reasoning scenarios. Code is available at https://github.com/MikeGu721/AgentGroupChat‑V2.

PaperID: 1195, https://arxiv.org/pdf/2506.15253.pdf GitHub

Authors: Yuchuan Fu, Xiaohan Yuan, Dongxia Wang

Title: RAS-Eval: A Comprehensive Benchmark for Security Evaluation of LLM Agents in Real-World Environments

Abstract:
The rapid deployment of Large language model (LLM) agents in critical domains like healthcare and finance necessitates robust security frameworks. To address the absence of standardized evaluation benchmarks for these agents in dynamic environments, we introduce RAS‑Eval, a comprehensive security benchmark supporting both simulated and real‑world tool execution. RAS‑Eval comprises 80 test cases and 3,802 attack tasks mapped to 11 Common Weakness Enumeration (CWE) categories, with tools implemented in JSON, LangGraph, and Model Context Protocol (MCP) formats. We evaluate 6 state‑of‑the‑art LLMs across diverse scenarios, revealing significant vulnerabilities: attacks reduced agent task completion rates (TCR) by 36.78% on average and achieved an 85.65% success rate in academic settings. Notably, scaling laws held for security capabilities, with larger models outperforming smaller counterparts. Our findings expose critical risks in real‑world agent deployments and provide a foundational framework for future security research. Code and data are available at https://github.com/lanzer‑tree/RAS‑Eval.

PaperID: 1196, https://arxiv.org/pdf/2506.15085.pdf GitHub GitHub

Authors: Paige Tuttösí, Shivam Mehta, Zachary Syvenky, Bermet Burkanova, Gustav Eje Henter, Angelica Lim

Title: EmojiVoice: Towards long-term controllable expressivity in robot speech

Abstract:
Humans vary their expressivity when speaking for extended periods to maintain engagement with their listener. Although social robots tend to be deployed with ``expressive'' joyful voices, they lack this long‑term variation found in human speech. Foundation model text‑to‑speech systems are beginning to mimic the expressivity in human speech, but they are difficult to deploy offline on robots. We present EmojiVoice, a free, customizable text‑to‑speech (TTS) toolkit that allows social roboticists to build temporally variable, expressive speech on social robots. We introduce emoji‑prompting to allow fine‑grained control of expressivity on a phase level and use the lightweight Matcha‑TTS backbone to generate speech in real‑time. We explore three case studies: (1) a scripted conversation with a robot assistant, (2) a storytelling robot, and (3) an autonomous speech‑to‑speech interactive agent. We found that using varied emoji prompting improved the perception and expressivity of speech over a long period in a storytelling task, but expressive voice was not preferred in the assistant use case.

PaperID: 1197, https://arxiv.org/pdf/2506.14477.pdf GitHub

Authors: Jingqi Yang, Zhilong Song, Jiawei Chen, Mingli Song, Sheng Zhou, linjun sun, Xiaogang Ouyang, Chun Chen, Can Wang

Title: GUI-Robust: A Comprehensive Dataset for Testing GUI Agent Robustness in Real-World Anomalies

Abstract:
The development of high‑quality datasets is crucial for benchmarking and advancing research in Graphical User Interface (GUI) agents. Despite their importance, existing datasets are often constructed under idealized conditions, overlooking the diverse anomalies frequently encountered in real‑world deployments. To address this limitation, we introduce GUI‑Robust, a novel dataset designed for comprehensive GUI agent evaluation, explicitly incorporating seven common types of anomalies observed in everyday GUI interactions. Furthermore, we propose a semi‑automated dataset construction paradigm that collects user action sequences from natural interactions via RPA tools and then generate corresponding step and task descriptions for these actions with the assistance of MLLMs. This paradigm significantly reduces annotation time cost by a factor of over 19 times. Finally, we assess state‑of‑the‑art GUI agents using the GUI‑Robust dataset, revealing their substantial performance degradation in abnormal scenarios. We anticipate that our work will highlight the importance of robustness in GUI agents and inspires more future research in this direction. The dataset and code are available at https://github.com/chessbean1/GUI‑Robust..

PaperID: 1198, https://arxiv.org/pdf/2506.14234.pdf GitHub

Authors: Md Tanzib Hosain, Salman Rahman, Md Kishor Morol, Md Rizwan Parvez

Title: Xolver: Multi-Agent Reasoning with Holistic Experience Learning Just Like an Olympiad Team

Abstract:
Despite impressive progress on complex reasoning, current large language models (LLMs) typically operate in isolation ‑ treating each problem as an independent attempt, without accumulating or integrating experiential knowledge. In contrast, expert problem solvers ‑ such as Olympiad or programming contest teams ‑ leverage a rich tapestry of experiences: absorbing mentorship from coaches, developing intuition from past problems, leveraging knowledge of tool usage and library functionality, adapting strategies based on the expertise and experiences of peers, continuously refining their reasoning through trial and error, and learning from other related problems even during competition. We introduce Xolver, a training‑free multi‑agent reasoning framework that equips a black‑box LLM with a persistent, evolving memory of holistic experience. Xolver integrates diverse experience modalities, including external and self‑retrieval, tool use, collaborative interactions, agent‑driven evaluation, and iterative refinement. By learning from relevant strategies, code fragments, and abstract reasoning patterns at inference time, Xolver avoids generating solutions from scratch ‑ marking a transition from isolated inference toward experience‑aware language agents. Built on both open‑weight and proprietary models, Xolver consistently outperforms specialized reasoning agents. Even with lightweight backbones (e.g., QWQ‑32B), it often surpasses advanced models including Qwen3‑235B, Gemini 2.5 Pro, o3, and o4‑mini‑high. With o3‑mini‑high, it achieves new best results on GSM8K (98.1%), AIME'24 (94.4%), AIME'25 (93.7%), Math‑500 (99.8%), and LiveCodeBench‑V5 (91.6%) ‑ highlighting holistic experience learning as a key step toward generalist agents capable of expert‑level reasoning. Code and data are available at https://kagnlp.github.io/xolver.github.io/.

PaperID: 1199, https://arxiv.org/pdf/2506.14205.pdf GitHub

Authors: Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song

Title: AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents

Abstract:
We introduce AgentSynth, a scalable and cost‑efficient pipeline for automatically synthesizing high‑quality tasks and trajectory datasets for generalist computer‑use agents. Leveraging information asymmetry, AgentSynth constructs subtasks that are simple during generation but significantly more challenging when composed into long‑horizon tasks, enabling the creation of over 6,000 diverse and realistic tasks. A key strength of AgentSynth is its ability to precisely modulate task complexity by varying the number of subtasks. Empirical evaluations show that state‑of‑the‑art LLM agents suffer a steep performance drop, from 18% success at difficulty level 1 to just 4% at level 6, highlighting the benchmark's difficulty and discriminative power. Moreover, our pipeline achieves a low average cost of 0.60 per trajectory, orders of magnitude cheaper than human annotations. Our code and data are available at https://github.com/sunblaze‑ucb/AgentSynth

PaperID: 1200, https://arxiv.org/pdf/2506.13666.pdf GitHub

Authors: Junfeng Fang, Zijun Yao, Ruipeng Wang, Haokai Ma, Xiang Wang, Tat-Seng Chua

Title: We Should Identify and Mitigate Third-Party Safety Risks in MCP-Powered Agent Systems

Abstract:
The development of large language models (LLMs) has entered in a experience‑driven era, flagged by the emergence of environment feedback‑driven learning via reinforcement learning and tool‑using agents. This encourages the emergenece of model context protocol (MCP), which defines the standard on how should a LLM interact with external services, such as \api and data. However, as MCP becomes the de facto standard for LLM agent systems, it also introduces new safety risks. In particular, MCP introduces third‑party services, which are not controlled by the LLM developers, into the agent systems. These third‑party MCP services provider are potentially malicious and have the economic incentives to exploit vulnerabilities and sabotage user‑agent interactions. In this position paper, we advocate the research community in LLM safety to pay close attention to the new safety risks issues introduced by MCP, and develop new techniques to build safe MCP‑powered agent systems. To establish our position, we argue with three key parts. (1) We first construct \framework, a controlled framework to examine safety issues in MCP‑powered agent systems. (2) We then conduct a series of pilot experiments to demonstrate the safety risks in MCP‑powered agent systems is a real threat and its defense is not trivial. (3) Finally, we give our outlook by showing a roadmap to build safe MCP‑powered agent systems. In particular, we would call for researchers to persue the following research directions: red teaming, MCP safe LLM development, MCP safety evaluation, MCP safety data accumulation, MCP service safeguard, and MCP safe ecosystem construction. We hope this position paper can raise the awareness of the research community in MCP safety and encourage more researchers to join this important research direction. Our code is available at https://github.com/littlelittlenine/SafeMCP.git.

PaperID: 1201, https://arxiv.org/pdf/2506.12841.pdf GitHub

Authors: Xinyuan Xia, Yuanyi Song, Haomin Ma, Jinyu Cai

Title: WereWolf-Plus: An Update of Werewolf Game setting Based on DSGBench

Abstract:
With the rapid development of LLM‑based agents, increasing attention has been given to their social interaction and strategic reasoning capabilities. However, existing Werewolf‑based benchmarking platforms suffer from overly simplified game settings, incomplete evaluation metrics, and poor scalability. To address these limitations, we propose WereWolf‑Plus, a multi‑model, multi‑dimensional, and multi‑method benchmarking platform for evaluating multi‑agent strategic reasoning in the Werewolf game. The platform offers strong extensibility, supporting customizable configurations for roles such as Seer, Witch, Hunter, Guard, and Sheriff, along with flexible model assignment and reasoning enhancement strategies for different roles. In addition, we introduce a comprehensive set of quantitative evaluation metrics for all special roles, werewolves, and the sheriff, and enrich the assessment dimensions for agent reasoning ability, cooperation capacity, and social influence. WereWolf‑Plus provides a more flexible and reliable environment for advancing research on inference and strategic interaction within multi‑agent communities. Our code is open sourced at https://github.com/MinstrelsyXia/WereWolfPlus.

PaperID: 1202, https://arxiv.org/pdf/2506.12104.pdf GitHub

Authors: Hao Li, Xiaogeng Liu, Hung-Chun Chiu, Dianqi Li, Ning Zhang, Chaowei Xiao

Title: DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM Agents

Abstract:
Large Language Models (LLMs) are increasingly central to agentic systems due to their strong reasoning and planning capabilities. By interacting with external environments through predefined tools, these agents can carry out complex user tasks. Nonetheless, this interaction also introduces the risk of prompt injection attacks, where malicious inputs from external sources can mislead the agent's behavior, potentially resulting in economic loss, privacy leakage, or system compromise. System‑level defenses have recently shown promise by enforcing static or predefined policies, but they still face two key challenges: the ability to dynamically update security rules and the need for memory stream isolation. To address these challenges, we propose Dynamic Rule‑based Isolation Framework for Trustworthy agentic systems (DRIFT), which enforces the dynamic security policy and injection isolation for securing LLM agents against prompt injection attacks. A Secure Planner first constructs a minimal function trajectory and a JSON‑schema‑style parameter checklist for each function node based on the user query. A Dynamic Validator then monitors deviations from the original plan, assessing whether changes comply with privilege limitations and the user's intent. Finally, an Injection Isolator detects and masks any instructions that may conflict with the user query from the memory stream to mitigate long‑term risks. We empirically validate the effectiveness of DRIFT on the AgentDojo, ASB, and AgentDyn benchmark, demonstrating its strong security performance while maintaining high utility across diverse models, showcasing both its robustness and adaptability. The project website is available at https://safo‑lab.github.io/DRIFT.

PaperID: 1203, https://arxiv.org/pdf/2506.11773.pdf GitHub

Authors: Zikang Leng, Megha Thukral, Yaqi Liu, Hrudhai Rajasekhar, Shruthi K. Hiremath, Jiaman He, Thomas Plötz

Title: AgentSense: Virtual Sensor Data Generation Using LLM Agents in Simulated Home Environments

Abstract:
A major challenge in developing robust and generalizable Human Activity Recognition (HAR) systems for smart homes is the lack of large and diverse labeled datasets. Variations in home layouts, sensor configurations, and individual behaviors further exacerbate this issue. To address this, we leverage the idea of embodied AI agents ‑‑ virtual agents that perceive and act within simulated environments guided by internal world models. We introduce AgentSense, a virtual data generation pipeline in which agents live out daily routines in simulated smart homes, with behavior guided by Large Language Models (LLMs). The LLM generates diverse synthetic personas and realistic routines grounded in the environment, which are then decomposed into fine‑grained actions. These actions are executed in an extended version of the VirtualHome simulator, which we augment with virtual ambient sensors that record the agents' activities. Our approach produces rich, privacy‑preserving sensor data that reflects real‑world diversity. We evaluate AgentSense on five real HAR datasets. Models pretrained on the generated data consistently outperform baselines, especially in low‑resource settings. Furthermore, combining the generated virtual sensor data with a small amount of real data achieves performance comparable to training on full real‑world datasets. These results highlight the potential of using LLM‑guided embodied agents for scalable and cost‑effective sensor data generation in HAR. Our code is publicly available at https://github.com/ZikangLeng/AgentSense.

PaperID: 1204, https://arxiv.org/pdf/2506.11077.pdf GitHub

Authors: Chongyu Fan, Yihua Zhang, Jinghan Jia, Alfred Hero, Sijia Liu

Title: CyclicReflex: Improving Reasoning Models via Cyclical Reflection Token Scheduling

Abstract:
Large reasoning models (LRMs), such as OpenAI's o1 and DeepSeek‑R1, harness test‑time scaling to perform multi‑step reasoning for complex problem‑solving. This reasoning process, executed before producing final answers, is often guided by special juncture tokens that prompt self‑evaluative reflection. These transition markers and reflective cues are referred to as "reflection tokens" (e.g., "wait", "but", "alternatively"). In this work, we treat reflection tokens as a "resource" and introduce the problem of resource allocation, aimed at improving the test‑time compute performance of LRMs by adaptively regulating the frequency and placement of reflection tokens. Through empirical analysis, we show that both excessive and insufficient use of reflection tokens, referred to as over‑reflection and under‑reflection, can degrade model performance. To better understand this trade‑off, we draw an analogy between reflection token usage and learning rate scheduling in optimization. Building on this insight, We propose cyclical reflection token scheduling (termed CyclicReflex), a training‑free decoding strategy that dynamically modulates reflection token logits with a bidirectional, position‑dependent triangular waveform, incurring no additional computation cost. Experiments on MATH500, AIME2024/2025, AMC2023, GPQA Diamond and LiveCodeBench demonstrate that CyclicReflex consistently improves performance across model sizes (1.5B‑14B), outperforming standard decoding and recent approaches such as TIP (thought switching penalty) and S1. Codes are available at https://github.com/OPTML‑Group/CyclicReflex.

PaperID: 1205, https://arxiv.org/pdf/2506.10974.pdf GitHub

Authors: Yixin Ou, Yujie Luo, Jingsheng Zheng, Lanning Wei, Zhuoyun Yu, Shuofei Qiao, Jintian Zhang, Da Zheng, Yuren Mao, Yunjun Gao, Huajun Chen, Ningyu Zhang

Title: AutoMind: Adaptive Knowledgeable Agent for Automated Data Science

Abstract:
Large Language Model (LLM) agents have shown great potential in addressing real‑world data science problems. LLM‑driven data science agents promise to automate the entire machine learning pipeline, yet their real‑world effectiveness remains limited. Existing frameworks depend on rigid, pre‑defined workflows and inflexible coding strategies; consequently, they excel only on relatively simple, classical problems and fail to capture the empirical expertise that human practitioners bring to complex, innovative tasks. In this work, we introduce AutoMind, an adaptive, knowledgeable LLM‑agent framework that overcomes these deficiencies through three key advances: (1) a curated expert knowledge base that grounds the agent in domain expert knowledge, (2) an agentic knowledgeable tree search algorithm that strategically explores possible solutions, and (3) a self‑adaptive coding strategy that dynamically tailors code generation to task complexity. Evaluations on two automated data science benchmarks demonstrate that AutoMind delivers superior performance versus state‑of‑the‑art baselines. Additional analyses confirm favorable effectiveness, efficiency, and qualitative solution quality, highlighting AutoMind as an efficient and robust step toward fully automated data science. Code is at https://github.com/innovatingAI/AutoMind.

PaperID: 1206, https://arxiv.org/pdf/2506.10764.pdf GitHub GitHub

Authors: Xiaozhe Li, Jixuan Chen, Xinyu Fang, Shengyuan Ding, Haodong Duan, Qingwen Liu, Kai Chen

Title: OPT-BENCH: Evaluating LLM Agent on Large-Scale Search Spaces Optimization Problems

Abstract:
Large Language Models (LLMs) have shown remarkable capabilities in solving diverse tasks. However, their proficiency in iteratively optimizing complex solutions through learning from previous feedback remains insufficiently explored. To bridge this gap, we present OPT‑BENCH, a comprehensive benchmark designed to evaluate LLM agents on large‑scale search space optimization problems. OPT‑BENCH includes 20 real‑world machine learning tasks sourced from Kaggle and 10 classical NP problems, offering a diverse and challenging environment for assessing LLM agents on iterative reasoning and solution refinement. To enable rigorous evaluation, we introduce OPT‑Agent, an end‑to‑end optimization framework that emulates human reasoning when tackling complex problems by generating, validating, and iteratively improving solutions through leveraging historical feedback. Through extensive experiments on 9 state‑of‑the‑art LLMs from 6 model families, we analyze the effects of optimization iterations, temperature settings, and model architectures on solution quality and convergence. Our results demonstrate that incorporating historical context significantly enhances optimization performance across both ML and NP tasks. All datasets, code, and evaluation tools are open‑sourced to promote further research in advancing LLM‑driven optimization and iterative reasoning. Project page: \hrefhttps://github.com/OliverLeeXZ/OPT‑BENCHhttps://github.com/OliverLeeXZ/OPT‑BENCH.

PaperID: 1207, https://arxiv.org/pdf/2506.10326.pdf GitHub

Authors: Cameron Angliss, Jiaxun Cui, Jiaheng Hu, Arrasy Rahman, Peter Stone

Title: VGC-Bench: Towards Mastering Diverse Team Strategies in Competitive Pokémon

Abstract:
Developing AI agents that can robustly adapt to varying strategic landscapes without retraining is a central challenge in multi‑agent learning. Pokémon Video Game Championships (VGC) is a domain with a vast space of approximately 10^139 team configurations, far larger than those of other games such as Chess, Go, Poker, StarCraft, or Dota. The combinatorial nature of team building in Pokémon VGC causes optimal strategies to vary substantially depending on both the controlled team and the opponent's team, making generalization uniquely challenging. To advance research on this problem, we introduce VGC‑Bench: a benchmark that provides critical infrastructure, standardizes evaluation protocols, and supplies a human‑play dataset of over 700,000 battle logs and a range of baseline agents based on heuristics, large language models, behavior cloning, and multi‑agent reinforcement learning with empirical game‑theoretic methods such as self‑play, fictitious play, and double oracle. In the restricted setting where an agent is trained and evaluated in a mirror match with a single team configuration, our methods can win against a professional VGC competitor. We repeat this training and evaluation with progressively larger team sets and find that as the number of teams increases, the best‑performing algorithm in the single‑team setting has worse performance and is more exploitable, but has improved generalization to unseen teams. Our code and dataset are open‑sourced at https://github.com/cameronangliss/vgc‑bench and https://huggingface.co/datasets/cameronangliss/vgc‑battle‑logs.

PaperID: 1208, https://arxiv.org/pdf/2506.09513.pdf GitHub

Authors: Yu Sun, Xingyu Qian, Weiwen Xu, Hao Zhang, Chenghao Xiao, Long Li, Deli Zhao, Wenbing Huang, Tingyang Xu, Qifeng Bai, Yu Rong

Title: ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

Abstract:
Reasoning‑based large language models have excelled in mathematics and programming, yet their potential in knowledge‑intensive medical question answering remains underexplored and insufficiently validated in clinical contexts. To bridge this gap, we introduce ReasonMed, the largest medical reasoning dataset to date, comprising 370k high‑quality examples distilled from 1.75 million initial reasoning paths generated by complementary LLMs and curated through a cost‑efficient easy‑medium‑difficult (EMD) pipeline. ReasonMed is built through a multi‑agent generation, verification, and refinement process, in which an Error Refiner improves reasoning paths by correcting error‑prone steps identified by a verifier. Using ReasonMed, we investigate effective strategies for training medical reasoning models and find that integrating detailed CoT reasoning with concise answer summaries yields the most robust fine‑tuning results. Models trained on ReasonMed set a new benchmark: ReasonMed‑7B surpasses the prior best sub‑10B models by 4.17% and even exceeds LLaMA3.1‑70B on PubMedQA by 4.60%. When scaled to ReasonMed‑14B, it remains highly competitive, underscoring consistent scaling potential. The codes and datasets are available at https://github.com/YuSun‑Work/ReasonMed.

PaperID: 1209, https://arxiv.org/pdf/2506.09049.pdf GitHub

Authors: Li Kang, Xiufeng Song, Heng Zhou, Yiran Qin, Jie Yang, Xiaohong Liu, Philip Torr, Lei Bai, Zhenfei Yin

Title: VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

Abstract:
Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception‑driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi‑agent planning, a few have begun to explore vision‑language models (VLMs) for visual reasoning. However, these VLM‑based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI‑Bench, the first hierarchical benchmark tailored for embodied multi‑agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI‑Bench includes diverse robot embodiments, multi‑view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI‑Bench, we propose VIKI‑R, a two‑stage framework that fine‑tunes a pretrained vision‑language model (VLM) using Chain‑of‑Thought annotated demonstrations, followed by reinforcement learning under multi‑level reward signals. Our extensive experiments show that VIKI‑R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI‑Bench and VIKI‑R offer a unified testbed and method for advancing multi‑agent, visual‑driven cooperation in embodied AI systems.

PaperID: 1210, https://arxiv.org/pdf/2506.08403.pdf GitHub GitHub

Authors: Weiya Li, Junjie Chen, Bei Li, Boyang Liu, Zichen Wen, Nuanqiao Shan, Xiaoqian Liu, Anping Liu, Huajie Liu, Hu Song, Linfeng Zhang

Title: TACTIC: Translation Agents with Cognitive-Theoretic Interactive Collaboration

Abstract:
Machine translation has long been a central task in natural language processing. With the rapid advancement of large language models (LLMs), there has been remarkable progress in translation quality. However, fully realizing the translation potential of LLMs remains an open challenge. Recent studies have explored multi‑agent systems to decompose complex translation tasks into collaborative subtasks, showing initial promise in enhancing translation quality through agent cooperation and specialization. Nevertheless, existing multi‑agent translation frameworks largely neglect foundational insights from cognitive translation studies. These insights emphasize how human translators employ different cognitive strategies, such as balancing literal and free translation, refining expressions based on context, and iteratively evaluating outputs. To address this limitation, we propose a cognitively informed multi‑agent framework called TACTIC, which stands for T ranslation A gents with Cognitive‑ T heoretic Interactive Collaboration. The framework comprises six functionally distinct agents that mirror key cognitive processes observed in human translation behavior. These include agents for drafting, refinement, evaluation, scoring, context reasoning, and external knowledge gathering. By simulating an interactive and theory‑grounded translation workflow, TACTIC effectively leverages the full capacity of LLMs for high‑quality translation. Experimental results on diverse language pairs from the FLORES‑200 and WMT24 benchmarks show that our method consistently achieves state‑of‑the‑art performance. Using DeepSeek‑V3 as the base model, TACTIC surpasses GPT‑4.1 by an average of +0.6 XCOMET and +1.18 COMETKIWI‑23. Compared to DeepSeek‑R1, it further improves by +0.84 XCOMET and +2.99 COMETKIWI‑23. Code is available at https://github.com/weiyali126/TACTIC.

PaperID: 1211, https://arxiv.org/pdf/2506.08352.pdf GitHub

Authors: Wentao Shi, Yiqing Shen

Title: Reinforcement Fine-Tuning for Reasoning towards Multi-Step Multi-Source Search in Large Language Models

Abstract:
Large language models (LLMs) can face factual limitations when responding to time‑sensitive queries about recent events that arise after their knowledge thresholds in the training corpus. Existing search‑augmented approaches fall into two categories, each with distinct limitations: multi‑agent search frameworks incur substantial computational overhead by separating search planning and response synthesis across multiple LLMs, while single‑LLM tool‑calling methods restrict themselves to sequential planned, single‑query searches from sole search sources. We present Reasoning‑Search (R‑Search), a single‑LLM search framework that unifies multi‑step planning, multi‑source search execution, and answer synthesis within one coherent inference process. Innovatively, it structure the output into four explicitly defined components, including reasoning steps that guide the search process (<think>), a natural‑language directed acyclic graph that represents the search plans with respect to diverse sources (<search>), retrieved results from executing the search plans (<result>), and synthesized final answers (<answer>). To enable effective generation of these structured outputs, we propose a specialized Reinforcement Fine‑Tuning (ReFT) method based on GRPO, together with a multi‑component reward function that optimizes LLM's answer correctness, structural validity of the generated DAG, and adherence to the defined output format. Experimental evaluation on FinSearchBench‑24, SearchExpertBench‑25, and seven Q and A benchmarks demonstrates that R‑Search outperforms state‑of‑the‑art methods, while achieving substantial efficiency gains through 70% reduction in context token usage and approximately 50% decrease in execution latency. Code is available at https://github.com/wentao0429/Reasoning‑search.

PaperID: 1212, https://arxiv.org/pdf/2506.08292.pdf GitHub

Authors: Xie Yi, Zhanke Zhou, Chentao Cao, Qiyu Niu, Tongliang Liu, Bo Han

Title: From Debate to Equilibrium: Belief-Driven Multi-Agent LLM Reasoning via Bayesian Nash Equilibrium

Abstract:
Multi‑agent frameworks can substantially boost the reasoning power of large language models (LLMs), but they typically incur heavy computational costs and lack convergence guarantees. To overcome these challenges, we recast multi‑LLM coordination as an incomplete‑information game and seek a Bayesian Nash equilibrium (BNE), in which each agent optimally responds to its probabilistic beliefs about the strategies of others. We introduce Efficient Coordination via Nash Equilibrium (ECON), a hierarchical reinforcement‑learning paradigm that marries distributed reasoning with centralized final output. Under ECON, each LLM independently selects responses that maximize its expected reward, conditioned on its beliefs about co‑agents, without requiring costly inter‑agent exchanges. We mathematically prove that ECON attains a markedly tighter regret bound than non‑equilibrium multi‑agent schemes. Empirically, ECON outperforms existing multi‑LLM approaches by 11.2% on average across six benchmarks spanning complex reasoning and planning tasks. Further experiments demonstrate ECON's ability to flexibly incorporate additional models, confirming its scalability and paving the way toward larger, more powerful multi‑LLM ensembles. The code is publicly available at: https://github.com/tmlr‑group/ECON.

PaperID: 1213, https://arxiv.org/pdf/2506.08119.pdf GitHub

Authors: Subhrangshu Nandi, Arghya Datta, Rohith Nama, Udita Patel, Nikhil Vichare, Indranil Bhattacharya, Prince Grover, Shivam Asija, Giuseppe Carenini, Wei Zhang, Arushi Gupta, Sreyoshi Bhaduri, Jing Xu, Huzefa Raja, Shayan Ray, Aaron Chan, Esther Xu Fei, Gaoyuan Du, Zuhaib Akhtar, Harshita Asnani, Weian Chan, Ming Xiong, Francesco Carbone, Jeetu Mirchandani

Title: SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents

Abstract:
LLM‑based agents struggle to execute complex, multi‑step Standard Operating Procedures (SOPs) that are fundamental to industrial automation. Existing benchmarks fail to capture the procedural complexity and tool orchestration demands of real‑world workflows. We introduce SOP‑Bench, a benchmark of 2,000+ tasks from human expert‑authored SOPs across 12 business domains (healthcare, logistics, finance, content moderation, etc.). Using a human‑AI collaborative framework, experts crafted authentic SOPs while AI generated artifacts (tools, APIs, datasets), all human‑validated, yielding realistic tasks with executable interfaces and ground‑truth outputs. SOP‑Bench serves as a research enabler for systematically investigating agent architectures, model capabilities, and deployment considerations across diverse procedural tasks. We demonstrate its utility through illustrative experiments with a subset of frontier models across Function‑Calling (FC) and ReAct agents, revealing critical insights. For example, (1) newer models do not guarantee better performance ‑ Claude 4 family outperforms Claude 4.5 family on ReAct tasks (Claude 4 Opus: 72.4% vs. Claude 4.5 Sonnet: 63.3% task success rate), demonstrating that production upgrades require validation; (2) no single model‑agent combination dominates: best performances range from 57% to 100% depending on domain. These examples illustrate how SOP‑Bench enables isolating and studying specific dimensions of agent performance without costly production experiments. Our goal is not to rank model capabilities or build optimal agents, but to provide a rigorous evaluation framework that enables the researchers and practitioners to systematically investigate agent design choices, model selection, and deployment strategies. We release the benchmark at https://github.com/amazon‑science/sop‑bench.

PaperID: 1214, https://arxiv.org/pdf/2506.08006.pdf GitHub

Authors: Sicheng Mo, Ziyang Leng, Leon Liu, Weizhen Wang, Honglin He, Bolei Zhou

Title: Dreamland: Controllable World Creation with Simulator and Generative Models

Abstract:
Large‑scale video generative models can synthesize diverse and realistic visual content for dynamic world creation, but they often lack element‑wise controllability, hindering their use in editing scenes and training embodied AI agents. We propose Dreamland, a hybrid world generation framework combining the granular control of a physics‑based simulator and the photorealistic content output of large‑scale pretrained generative models. In particular, we design a layered world abstraction that encodes both pixel‑level and object‑level semantics and geometry as an intermediate representation to bridge the simulator and the generative model. This approach enhances controllability, minimizes adaptation cost through early alignment with real‑world distributions, and supports off‑the‑shelf use of existing and future pretrained generative models. We further construct a D3Sim dataset to facilitate the training and evaluation of hybrid generation pipelines. Experiments demonstrate that Dreamland outperforms existing baselines with 50.8% improved image quality, 17.9% stronger controllability, and has great potential to enhance embodied agent training. Code and data will be made available.

PaperID: 1215, https://arxiv.org/pdf/2506.07829.pdf GitHub

Authors: Jan Corazza, Hadi Partovi Aria, Hyohun Kim, Daniel Neider, Zhe Xu

Title: Decentralizing Multi-Agent Reinforcement Learning with Temporal Causal Information

Abstract:
Reinforcement learning (RL) algorithms can find an optimal policy for a single agent to accomplish a particular task. However, many real‑world problems require multiple agents to collaborate in order to achieve a common goal. For example, a robot executing a task in a warehouse may require the assistance of a drone to retrieve items from high shelves. In Decentralized Multi‑Agent RL (DMARL), agents learn independently and then combine their policies at execution time, but often must satisfy constraints on compatibility of local policies to ensure that they can achieve the global task when combined. In this paper, we study how providing high‑level symbolic knowledge to agents can help address unique challenges of this setting, such as privacy constraints, communication limitations, and performance concerns. In particular, we extend the formal tools used to check the compatibility of local policies with the team task, making decentralized training with theoretical guarantees usable in more scenarios. Furthermore, we empirically demonstrate that symbolic knowledge about the temporal evolution of events in the environment can significantly expedite the learning process in DMARL.

PaperID: 1216, https://arxiv.org/pdf/2506.07548.pdf GitHub

Authors: Weiqiang Jin, Yang Liu, Shixiang Tang, Jinhu Qi, Wentao Zhang, Junli Wang, Biao Zhao, Hongyang Du

Title: Overcoming Environmental Meta-Stationarity in MARL via Adaptive Curriculum and Counterfactual Group Advantage

Abstract:
Multi‑agent reinforcement learning (MARL) has reached competitive performance on cooperative tasks against scripted adversaries, yet most methods train agents at a single fixed difficulty throughout the entire run. We term this static‑difficulty regime environmental meta‑stationarity and show that it caps policy generalization and steers learning toward shallow local optima. To break this regime, we propose CL‑MARL, a dynamic curriculum learning framework that adapts opponent strength online from win‑rate signals, advancing or regressing the task as agents master it. Its scheduler, FlexDiff, fuses momentum‑based trend estimation with sliding‑window dual‑curve monitoring of training and evaluation returns, yielding stable difficulty transitions without manual tuning. Because a moving curriculum amplifies non‑stationarity and sparsifies global rewards, we introduce the Counterfactual Group Relative Policy Advantage (CGRPA), which extends GRPO‑style group‑relative optimization with counterfactual baselines to disentangle each agent's contribution under shifting team dynamics. On the StarCraft Multi‑Agent Challenge (SMAC), CL‑MARL attains a 40% mean win rate on the super‑hard maps with an average episode return of 17.85, exceeding the QMIX, OW‑QMIX, DER, EMC, and MARR baselines by +2.94 on average, while reaching its peak win rate roughly 1.28faster on 8m_vs_9m and 1.42 faster on 3s5z_vs_3s6z than the strongest baseline. The implementation is publicly available at https://github.com/NICE‑HKU/CL2MARL‑SMAC.

PaperID: 1217, https://arxiv.org/pdf/2506.07400.pdf GitHub

Authors: Philip R. Liu, Sparsh Bansal, Jimmy Dinh, Aditya Pawar, Ramani Satishkumar, Shail Desai, Neeraj Gupta, Xin Wang, Shu Hu

Title: MedChat: A Multi-Agent Framework for Multimodal Diagnosis with Large Language Models

Abstract:
The integration of deep learning‑based glaucoma detection with large language models (LLMs) presents an automated strategy to mitigate ophthalmologist shortages and improve clinical reporting efficiency. However, applying general LLMs to medical imaging remains challenging due to hallucinations, limited interpretability, and insufficient domain‑specific medical knowledge, which can potentially reduce clinical accuracy. Although recent approaches combining imaging models with LLM reasoning have improved reporting, they typically rely on a single generalist agent, restricting their capacity to emulate the diverse and complex reasoning found in multidisciplinary medical teams. To address these limitations, we propose MedChat, a multi‑agent diagnostic framework and platform that combines specialized vision models with multiple role‑specific LLM agents, all coordinated by a director agent. This design enhances reliability, reduces hallucination risk, and enables interactive diagnostic reporting through an interface tailored for clinical review and educational use. Code available at https://github.com/Purdue‑M2/MedChat.

PaperID: 1218, https://arxiv.org/pdf/2506.07398.pdf GitHub

Authors: Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, Shuicheng Yan

Title: G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems

Abstract:
Large language model (LLM)‑powered multi‑agent systems (MAS) have demonstrated cognitive and execution capabilities that far exceed those of single LLM agents, yet their capacity for self‑evolution remains hampered by underdeveloped memory architectures. Upon close inspection, we are alarmed to discover that prevailing MAS memory mechanisms (1) are overly simplistic, completely disregarding the nuanced inter‑agent collaboration trajectories, and (2) lack cross‑trial and agent‑specific customization, in stark contrast to the expressive memory developed for single agents. To bridge this gap, we introduce G‑Memory, a hierarchical, agentic memory system for MAS inspired by organizational memory theory, which manages the lengthy MAS interaction via a three‑tier graph hierarchy: insight, query, and interaction graphs. Upon receiving a new user query, G‑Memory performs bi‑directional memory traversal to retrieve both high‑level, generalizable insights that enable the system to leverage cross‑trial knowledge, and fine‑grained, condensed interaction trajectories that compactly encode prior collaboration experiences. Upon task execution, the entire hierarchy evolves by assimilating new collaborative trajectories, nurturing the progressive evolution of agent teams. Extensive experiments across five benchmarks, three LLM backbones, and three popular MAS frameworks demonstrate that G‑Memory improves success rates in embodied action and accuracy in knowledge QA by up to 20.89% and 10.12%, respectively, without any modifications to the original frameworks. Our codes are available at https://github.com/bingreeky/GMemory.

PaperID: 1219, https://arxiv.org/pdf/2506.07106.pdf GitHub

Authors: Samir Abdaljalil, Hasan Kurban, Khalid Qaraqe, Erchin Serpedin

Title: Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models

Abstract:
Large language models (LLMs) have shown strong performance across natural language reasoning tasks, yet their reasoning processes remain brittle and difficult to interpret. Prompting techniques like Chain‑of‑Thought (CoT) enhance reliability by eliciting intermediate reasoning steps or aggregating multiple outputs. However, they lack mechanisms for enforcing logical structure and assessing internal coherence. We introduce Theorem‑of‑Thought (ToTh), a novel framework that models reasoning as collaboration among three parallel agents, each simulating a distinct mode of inference: abductive, deductive, and inductive. Each agent produces a reasoning trace, which is structured into a formal reasoning graph. To evaluate consistency, we apply Bayesian belief propagation guided by natural language inference (NLI), assigning confidence scores to each step. The most coherent graph is selected to derive the final answer. Experiments on symbolic (WebOfLies) and numerical (MultiArith) reasoning benchmarks show that ToTh consistently outperforms CoT, Self‑Consistency, and CoT‑Decoding across multiple LLMs, while producing interpretable and logically grounded reasoning chains. Our findings suggest a promising direction for building more robust and cognitively inspired LLM reasoning. The implementation is available at https://github.com/KurbanIntelligenceLab/theorem‑of‑thought.

PaperID: 1220, https://arxiv.org/pdf/2506.07016.pdf GitHub

Authors: Sanjoy Chowdhury, Mohamed Elmoghany, Yohan Abeysinghe, Junjie Fei, Sayan Nag, Salman Khan, Mohamed Elhoseiny, Dinesh Manocha

Title: MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks

Abstract:
Large multimodal models (LMMs) have shown remarkable progress in audio‑visual understanding, yet they struggle with real‑world scenarios that require complex reasoning across extensive video collections. Existing benchmarks for video question answering remain limited in scope, typically involving one clip per query, which falls short of representing the challenges of large‑scale, audio‑visual retrieval and reasoning encountered in practical applications. To bridge this gap, we introduce a novel task named AV‑HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer. To this end, we present AVHaystacks, an audio‑visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi‑video retrieval and temporal grounding task. Additionally, we propose a model‑agnostic, multi‑agent framework MAGNET to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystacks. To enable robust evaluation of multi‑video retrieval and temporal grounding for optimal response generation, we introduce two new metrics, STEM, which captures alignment errors between a ground truth and a predicted step sequence and MTGS, to facilitate balanced and interpretable evaluation of segment‑level grounding performance. Project: https://schowdhury671.github.io/magnet_project/

PaperID: 1221, https://arxiv.org/pdf/2506.06541.pdf GitHub

Authors: Eugenie Lai, Gerardo Vitagliano, Ziyu Zhang, Om Chabra, Sivaprasad Sudhir, Anna Zeng, Anton A. Zabreyko, Chenning Li, Ferdi Kossmann, Jialin Ding, Jun Chen, Markos Markakis, Matthew Russo, Weiyang Wang, Ziniu Wu, Michael J. Cafarella, Lei Cao, Samuel Madden, Tim Kraska

Title: KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes

Abstract:
Discovering insights from a real‑world data lake potentially containing unclean, semi‑structured, and unstructured data requires a variety of data processing tasks, ranging from extraction and cleaning to integration, analysis, and modeling. This process often also demands domain knowledge and project‑specific insight. While AI models have shown remarkable results in reasoning and code generation, their abilities to design and execute complex pipelines that solve these data‑lake‑to‑insight challenges remain unclear. We introduce KramaBench which consists of 104 manually curated and solved challenges spanning 1700 files, 24 data sources, and 6 domains. KramaBench focuses on testing the end‑to‑end capabilities of AI systems to solve challenges which require automated orchestration of different data tasks. KramaBench also features a comprehensive evaluation framework assessing the pipeline design and individual data task implementation abilities of AI systems. We evaluate 8 LLMs using our single‑agent reference framework DS‑Guru, alongside both open‑ and closed‑source single‑ and multi‑agent systems, and find that while current agentic systems may handle isolated data‑science tasks and generate plausible draft pipelines, they struggle with producing working end‑to‑end pipelines. On KramaBench, the best system reaches only 55% end‑to‑end accuracy in the full data‑lake setting. Even with perfect retrieval, the accuracy tops out at 62%. Leading LLMs can identify up to 42% of important data tasks but can only fully implement 20% of individual data tasks. Our code, reference framework, and data are available at https://github.com/mitdbg/KramaBench.

PaperID: 1222, https://arxiv.org/pdf/2506.06326.pdf GitHub

Authors: Jiazheng Kang, Mingming Ji, Zhe Zhao, Ting Bai

Title: Memory OS of AI Agent

Abstract:
Large Language Models (LLMs) face a crucial challenge from fixed context windows and inadequate memory management, leading to a severe shortage of long‑term memory capabilities and limited personalization in the interactive experience with AI agents. To overcome this challenge, we innovatively propose a Memory Operating System, i.e., MemoryOS, to achieve comprehensive and efficient memory management for AI agents. Inspired by the memory management principles in operating systems, MemoryOS designs a hierarchical storage architecture and consists of four key modules: Memory Storage, Updating, Retrieval, and Generation. Specifically, the architecture comprises three levels of storage units: short‑term memory, mid‑term memory, and long‑term personal memory. Key operations within MemoryOS include dynamic updates between storage units: short‑term to mid‑term updates follow a dialogue‑chain‑based FIFO principle, while mid‑term to long‑term updates use a segmented page organization strategy. Our pioneering MemoryOS enables hierarchical memory integration and dynamic updating. Extensive experiments on the LoCoMo benchmark show an average improvement of 49.11% on F1 and 46.18% on BLEU‑1 over the baselines on GPT‑4o‑mini, showing contextual coherence and personalized memory retention in long conversations. The implementation code is open‑sourced at https://github.com/BAI‑LAB/MemoryOS.

PaperID: 1223, https://arxiv.org/pdf/2506.06020.pdf GitHub

Authors: Zeqi Zhou, Fang Wu, Shayan Talaei, Haokai Zhao, Cheng Meixin, Tinson Xu, Amin Saberi, Yejin Choi

Title: When to Trust Context: Self-Reflective Debates for Context Reliability

Abstract:
Large language models frequently encounter conflicts between their parametric knowledge and contextual input, often resulting in factual inconsistencies or hallucinations. We propose Self‑Reflective Debate for Contextual Reliability (SR‑DCR), a lightweight framework that integrates token‑level self‑confidence with an asymmetric multi‑agent debate to adjudicate such conflicts. A critic, deprived of context, challenges a defender who argues from the given passage; a judge model evaluates the debate and determines the context's reliability. The final answer is selected by combining the verdict with model confidence. Experiments on the ClashEval benchmark demonstrate that SR‑DCR consistently enhances robustness to misleading context while maintaining accuracy on trustworthy inputs, outperforming both classical debate and confidence‑only baselines with minimal computational overhead. The code is available at https://github.com/smiles724/Self‑Reflective‑Debates.

PaperID: 1224, https://arxiv.org/pdf/2506.05542.pdf GitHub

Authors: Vlastimil Martinek, Andrea Gariboldi, Dimosthenis Tzimotoudis, Aitor Alberdi Escudero, Edward Blake, David Cechak, Luke Cassar, Alessandro Balestrucci, Panagiotis Alexiou

Title: Agentomics-ML: Autonomous Machine Learning Experimentation Agent for Genomic and Transcriptomic Data

Abstract:
The adoption of machine learning (ML) and deep learning methods has revolutionized molecular medicine by driving breakthroughs in genomics, transcriptomics, drug discovery, and biological systems modeling. The increasing quantity, multimodality, and heterogeneity of biological datasets demand automated methods that can produce generalizable predictive models. Recent developments in large language model‑based agents have shown promise for automating end‑to‑end ML experimentation on structured benchmarks. However, when applied to heterogeneous computational biology datasets, these methods struggle with generalization and success rates. Here, we introduce Agentomics‑ML, a fully autonomous agent‑based system designed to produce a classification model and the necessary files for reproducible training and inference. Our method follows predefined steps of an ML experimentation process, repeatedly interacting with the file system through Bash to complete individual steps. Once an ML model is produced, training and validation metrics provide scalar feedback to a reflection step to identify issues such as overfitting. This step then creates verbal feedback for future iterations, suggesting adjustments to steps such as data representation, model architecture, and hyperparameter choices. We have evaluated Agentomics‑ML on several established genomic and transcriptomic benchmark datasets and show that it outperforms existing state‑of‑the‑art agent‑based methods in both generalization and success rates. While state‑of‑the‑art models built by domain experts still lead in absolute performance on the majority of the computational biology datasets used in this work, Agentomics‑ML narrows the gap for fully autonomous systems and achieves state‑of‑the‑art performance on one of the used benchmark datasets. The code is available at https://github.com/BioGeMT/Agentomics‑ML.

PaperID: 1225, https://arxiv.org/pdf/2506.05349.pdf GitHub

Authors: Hanoona Rasheed, Abdelrahman Shaker, Anqi Tang, Muhammad Maaz, Ming-Hsuan Yang, Salman Khan, Fahad Shahbaz Khan

Title: VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

Abstract:
Mathematical reasoning in real‑world video settings presents a fundamentally different challenge than in static images or text. It requires interpreting fine‑grained visual information, accurately reading handwritten or digital text, and integrating spoken cues, often dispersed non‑linearly over time. In such multimodal contexts, success hinges not just on perception, but on selectively identifying and integrating the right contextual details from a rich and noisy stream of content. To this end, we introduce VideoMathQA, a benchmark designed to evaluate whether models can perform such temporally extended cross‑modal reasoning on videos. The benchmark spans 10 diverse mathematical domains, covering videos ranging from 10 seconds to over 1 hour. It requires models to interpret structured visual content, understand instructional narratives, and jointly ground concepts across visual, audio, and textual modalities. We employ graduate‑level experts to ensure high quality, totaling over 920 man‑hours of annotation. To reflect real‑world scenarios, questions are designed around three core reasoning challenges: direct problem solving, where answers are grounded in the presented question; conceptual transfer, which requires applying learned methods to new problems; and deep instructional comprehension, involving multi‑step reasoning over extended explanations and partially worked‑out solutions. Each question includes multi‑step reasoning annotations, enabling fine‑grained diagnosis of model capabilities. Through this benchmark, we highlight the limitations of existing approaches and establish a systematic evaluation framework for models that must reason, rather than merely perceive, across temporally extended and modality‑rich mathematical problem settings. Our benchmark and evaluation code are available at: https://mbzuai‑oryx.github.io/VideoMathQA

PaperID: 1226, https://arxiv.org/pdf/2506.05010.pdf GitHub GitHub

Authors: Zhenran Xu, Xue Yang, Yiyu Wang, Qingli Hu, Zijiao Wu, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang

Title: ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development

Abstract:
We introduce ComfyUI‑Copilot, a large language model‑powered plugin designed to enhance the usability and efficiency of ComfyUI, an open‑source platform for AI‑driven art creation. Despite its flexibility and user‑friendly interface, ComfyUI can present challenges to newcomers, including limited documentation, model misconfigurations, and the complexity of workflow design. ComfyUI‑Copilot addresses these challenges by offering intelligent node and model recommendations, along with automated one‑click workflow construction. At its core, the system employs a hierarchical multi‑agent framework comprising a central assistant agent for task delegation and specialized worker agents for different usages, supported by our curated ComfyUI knowledge bases to streamline debugging and deployment. We validate the effectiveness of ComfyUI‑Copilot through both offline quantitative evaluations and online user feedback, showing that it accurately recommends nodes and accelerates workflow development. Additionally, use cases illustrate that ComfyUI‑Copilot lowers entry barriers for beginners and enhances workflow efficiency for experienced users. The ComfyUI‑Copilot installation package and a demo video are available at https://github.com/AIDC‑AI/ComfyUI‑Copilot.

PaperID: 1227, https://arxiv.org/pdf/2506.04832.pdf GitHub

Authors: Changyue Wang, Weihang Su, Qingyao Ai, Yiqun Liu

Title: Joint Evaluation of Answer and Reasoning Consistency for Hallucination Detection in Large Reasoning Models

Abstract:
Large Reasoning Models (LRMs) extend large language models with explicit, multi‑step reasoning traces to enhance transparency and performance on complex tasks. However, these reasoning traces can be redundant or logically inconsistent, becoming a new and hard‑to‑detect source of hallucination. Existing hallucination detection methods focus primarily on answer‑level uncertainty and often fail to detect hallucinations or logical inconsistencies arising from the model's reasoning trace. This oversight is particularly problematic for LRMs, where the explicit thinking trace is not only an important support to the model's decision‑making process but also a key source of potential hallucination. To this end, we propose RACE (Reasoning and Answer Consistency Evaluation), a novel framework specifically tailored for hallucination detection in LRMs. RACE operates by extracting essential reasoning steps and computing four diagnostic signals: inter‑sample consistency of reasoning traces, entropy‑based answer uncertainty, semantic alignment between reasoning and answers, and internal coherence of reasoning. The joint utilization of these signals makes RACE a more robust detector of hallucinations in LRMs. Experiments across datasets and different LLMs demonstrate that RACE outperforms existing hallucination detection baselines, offering a robust and generalizable solution for evaluating LRMs. The source code is available at https://github.com/bebr2/RACE

PaperID: 1228, https://arxiv.org/pdf/2506.04308.pdf GitHub

Authors: Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, Shanghang Zhang

Title: RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Abstract:
Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction‑indicated locations for interaction. To this end, we propose RoboRefer, a 3D‑aware VLM that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine‑tuning (SFT). Moreover, RoboRefer advances generalized multi‑step spatial reasoning via reinforcement fine‑tuning (RFT), with metric‑sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large‑scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and supporting complex reasoning processes (up to 5 steps). In addition, we introduce RefSpatial‑Bench, a challenging benchmark filling the gap in evaluating spatial referring with multi‑step reasoning. Experiments show that SFT‑trained RoboRefer achieves state‑of‑the‑art spatial understanding, with an average success rate of 89.6%. RFT‑trained RoboRefer further outperforms all other baselines by a large margin, even surpassing Gemini‑2.5‑Pro by 17.4% in average accuracy on RefSpatial‑Bench. Notably, RoboRefer can be integrated with various control policies to execute long‑horizon, dynamic tasks across diverse robots (e,g., UR5, G1 humanoid) in cluttered real‑world scenes.

PaperID: 1229, https://arxiv.org/pdf/2506.04255.pdf GitHub GitHub

Authors: Kunal Pai, Parth Shah, Harshil Patel

Title: HASHIRU: Hierarchical Agent System for Hybrid Intelligent Resource Utilization

Abstract:
Rapid Large Language Model (LLM) advancements are fueling autonomous Multi‑Agent System (MAS) development. However, current frameworks often lack flexibility, resource awareness, model diversity, and autonomous tool creation. This paper introduces HASHIRU (Hierarchical Agent System for Hybrid Intelligent Resource Utilization), a novel MAS framework enhancing flexibility, resource efficiency, and adaptability. HASHIRU features a "CEO" agent dynamically managing specialized "employee" agents, instantiated based on task needs and resource constraints (cost, memory). Its hybrid intelligence prioritizes smaller, local LLMs (via Ollama) while flexibly using external APIs and larger models when necessary. An economic model with hiring/firing costs promotes team stability and efficient resource allocation. The system also includes autonomous API tool creation and a memory function. Evaluations on tasks like academic paper review (58% success), safety assessments (100% on a JailbreakBench subset), and complex reasoning (outperforming Gemini 2.0 Flash on GSM8K: 96% vs. 61%; JEEBench: 80% vs. 68.3%; SVAMP: 92% vs. 84%) demonstrate HASHIRU's capabilities. Case studies illustrate its self‑improvement via autonomous cost model generation, tool integration, and budget management. HASHIRU offers a promising approach for more robust, efficient, and adaptable MAS through dynamic hierarchical control, resource‑aware hybrid intelligence, and autonomous functional extension. Source code and benchmarks are available at https://github.com/HASHIRU‑AI/HASHIRU and https://github.com/HASHIRU‑AI/HASHIRUBench respectively, and a live demo is available at https://hashiruagentx‑hashiruai.hf.space upon request.

PaperID: 1230, https://arxiv.org/pdf/2506.04185.pdf GitHub

Authors: Qingfei Zhao, Ruobing Wang, Dingling Xu, Daren Zha, Limin Liu

Title: R-Search: Empowering LLM Reasoning with Search via Multi-Reward Reinforcement Learning

Abstract:
Large language models (LLMs) have notably progressed in multi‑step and long‑chain reasoning. However, extending their reasoning capabilities to encompass deep interactions with search remains a non‑trivial challenge, as models often fail to identify optimal reasoning‑search interaction trajectories, resulting in suboptimal responses. We propose R‑Search, a novel reinforcement learning framework for Reasoning‑Search integration, designed to enable LLMs to autonomously execute multi‑step reasoning with deep search interaction, and learn optimal reasoning search interaction trajectories via multi‑reward signals, improving response quality in complex logic‑ and knowledge‑intensive tasks. R‑Search guides the LLM to dynamically decide when to retrieve or reason, while globally integrating key evidence to enhance deep knowledge interaction between reasoning and search. During RL training, R‑Search provides multi‑stage, multi‑type rewards to jointly optimize the reasoning‑search trajectory. Experiments on seven datasets show that R‑Search outperforms advanced RAG baselines by up to 32.2% (in‑domain) and 25.1% (out‑of‑domain). The code and data are available at https://github.com/QingFei1/R‑Search.

PaperID: 1231, https://arxiv.org/pdf/2506.04131.pdf GitHub

Authors: Disha Sheshanarayana, Tanishka Magar, Ayushi Mittal, Neelam Chaplot

Title: CLAIM: An Intent-Driven Multi-Agent Framework for Analyzing Manipulation in Courtroom Dialogues

Abstract:
Courtrooms are places where lives are determined and fates are sealed, yet they are not impervious to manipulation. Strategic use of manipulation in legal jargon can sway the opinions of judges and affect the decisions. Despite the growing advancements in NLP, its application in detecting and analyzing manipulation within the legal domain remains largely unexplored. Our work addresses this gap by introducing LegalCon, a dataset of 1,063 annotated courtroom conversations labeled for manipulation detection, identification of primary manipulators, and classification of manipulative techniques, with a focus on long conversations. Furthermore, we propose CLAIM, a two‑stage, Intent‑driven Multi‑agent framework designed to enhance manipulation analysis by enabling context‑aware and informed decision‑making. Our results highlight the potential of incorporating agentic frameworks to improve fairness and transparency in judicial processes. We hope that this contributes to the broader application of NLP in legal discourse analysis and the development of robust tools to support fairness in legal decision‑making. Our code and data are available at https://github.com/Disha1001/CLAIM.

PaperID: 1232, https://arxiv.org/pdf/2506.04089.pdf GitHub

Authors: Anastasiia Ivanova, Eva Bakaeva, Zoya Volovikova, Alexey K. Kovalev, Aleksandr I. Panov

Title: AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment

Abstract:
As a part of an embodied agent, Large Language Models (LLMs) are typically used for behavior planning given natural language instructions from the user. However, dealing with ambiguous instructions in real‑world environments remains a challenge for LLMs. Various methods for task ambiguity detection have been proposed. However, it is difficult to compare them because they are tested on different datasets and there is no universal benchmark. For this reason, we propose AmbiK (Ambiguous Tasks in Kitchen Environment), the fully textual dataset of ambiguous instructions addressed to a robot in a kitchen environment. AmbiK was collected with the assistance of LLMs and is human‑validated. It comprises 1000 pairs of ambiguous tasks and their unambiguous counterparts, categorized by ambiguity type (Human Preferences, Common Sense Knowledge, Safety), with environment descriptions, clarifying questions and answers, user intents, and task plans, for a total of 2000 tasks. We hope that AmbiK will enable researchers to perform a unified comparison of ambiguity detection methods. AmbiK is available at https://github.com/cog‑model/AmbiK‑dataset.

PaperID: 1233, https://arxiv.org/pdf/2506.03939.pdf GitHub

Authors: Junqi Gao, Xiang Zou, YIng Ai, Dong Li, Yichen Niu, Biqing Qi, Jianxing Liu

Title: Graph Counselor: Adaptive Graph Exploration via Multi-Agent Synergy to Enhance LLM Reasoning

Abstract:
Graph Retrieval Augmented Generation (GraphRAG) effectively enhances external knowledge integration capabilities by explicitly modeling knowledge relationships, thereby improving the factual accuracy and generation quality of Large Language Models (LLMs) in specialized domains. However, existing methods suffer from two inherent limitations: 1) Inefficient Information Aggregation: They rely on a single agent and fixed iterative patterns, making it difficult to adaptively capture multi‑level textual, structural, and degree information within graph data. 2) Rigid Reasoning Mechanism: They employ preset reasoning schemes, which cannot dynamically adjust reasoning depth nor achieve precise semantic correction. To overcome these limitations, we propose Graph Counselor, an GraphRAG method based on multi‑agent collaboration. This method uses the Adaptive Graph Information Extraction Module (AGIEM), where Planning, Thought, and Execution Agents work together to precisely model complex graph structures and dynamically adjust information extraction strategies, addressing the challenges of multi‑level dependency modeling and adaptive reasoning depth. Additionally, the Self‑Reflection with Multiple Perspectives (SR) module improves the accuracy and semantic consistency of reasoning results through self‑reflection and backward reasoning mechanisms. Experiments demonstrate that Graph Counselor outperforms existing methods in multiple graph reasoning tasks, exhibiting higher reasoning accuracy and generalization ability. Our code is available at https://github.com/gjq100/Graph‑Counselor.git.

PaperID: 1234, https://arxiv.org/pdf/2506.03828.pdf GitHub

Authors: Dhaval Patel, Shuxin Lin, James Rayfield, Nianjun Zhou, Chathurangi Shyalika, Suryanarayana R Yarrabothula, Roman Vaculin, Natalia Martinez, Fearghal O'donncha, Jayant Kalagnanam

Title: AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance

Abstract:
AI for Industrial Asset Lifecycle Management aims to automate complex operational workflows, such as condition monitoring and maintenance scheduling, to minimize system downtime. While traditional AI/ML approaches solve narrow tasks in isolation, Large Language Model (LLM) agents offer a next‑generation opportunity for end‑to‑end automation. In this paper, we introduce AssetOpsBench, a unified framework for orchestrating and evaluating domain‑specific agents for Industry 4.0. AssetOpsBench provides a multimodal ecosystem comprising a catalog of four domain‑specific agents, a curated dataset of 140+ human‑authored natural‑language queries grounded in real industrial scenarios, and a simulated, CouchDB‑backed IoT environment. We introduce an automated evaluation framework that uses three key metrics to analyze architectural trade‑offs between the Tool‑As‑Agent and Plan‑Executor paradigms, along with a systematic procedure for the automated discovery of emerging failure modes. The practical relevance of AssetOpsBench is demonstrated by its broad community adoption, with 250+ users and over 500 agents submitted to our public benchmarking platform, supporting reproducible and scalable research for real‑world industrial operations. The code is accesible at https://github.com/IBM/AssetOpsBench .

PaperID: 1235, https://arxiv.org/pdf/2506.03613.pdf GitHub

Authors: Shaoshan Liu, Fan Wang, Hongjun Zhou, Yuanfeng Wang

Title: Training Cross-Morphology Embodied AI Agents: From Practical Challenges to Theoretical Foundations

Abstract:
While theory and practice are often seen as separate domains, this article shows that theoretical insight is essential for overcoming real‑world engineering barriers. We begin with a practical challenge: training a cross‑morphology embodied AI policy that generalizes across diverse robot morphologies. We formalize this as the Heterogeneous Embodied Agent Training (HEAT) problem and prove it reduces to a structured Partially Observable Markov Decision Process (POMDP) that is PSPACE‑complete. This result explains why current reinforcement learning pipelines break down under morphological diversity, due to sequential training constraints, memory‑policy coupling, and data incompatibility. We further explore Collective Adaptation, a distributed learning alternative inspired by biological systems. Though NEXP‑complete in theory, it offers meaningful scalability and deployment benefits in practice. This work illustrates how computational theory can illuminate system design trade‑offs and guide the development of more robust, scalable embodied AI. For practitioners and researchers to explore this problem, the implementation code of this work has been made publicly available at https://github.com/airs‑admin/HEAT

PaperID: 1236, https://arxiv.org/pdf/2506.03610.pdf GitHub

Authors: Dongmin Park, Minkyu Kim, Beongjun Choi, Junhyuck Kim, Keon Lee, Jonghyun Lee, Inkyu Park, Byeong-Uk Lee, Jaeyoung Hwang, Jaewoo Ahn, Ameya S. Mahabaleshwarkar, Bilal Kartal, Pritam Biswas, Yoshi Suhara, Kangwook Lee, Jaewoong Cho

Title: Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

Abstract:
Large Language Model (LLM) agents are reshaping the game industry, by enabling more intelligent and human‑preferable characters. Yet, current game benchmarks fall short of practical needs: they lack evaluations of diverse LLM capabilities across various game genres, studies of agentic modules crucial for complex gameplay, and fine‑tuning datasets to adapt pre‑trained LLMs into gaming agents. To fill these gaps, we present Orak, a benchmark for training and evaluating LLM agents across 12 popular video games spanning all major genres. Using a plug‑and‑play interface built on Model Context Protocol (MCP), Orak supports systematic and reproducible studies of agentic modules in varied game scenarios. We further release a fine‑tuning dataset of expert LLM gameplay trajectories covering multiple genres, turning general LLMs into effective game agents. Orak offers a united evaluation framework, including game leaderboards, LLM battle arenas, and \fixablation studies of input modality, agentic strategies, and fine‑tuning effects, establishing a foundation towards versatile gaming agents. Code and datasets are available at https://github.com/krafton‑ai/Orak and https://huggingface.co/datasets/KRAFTON/Orak.

PaperID: 1237, https://arxiv.org/pdf/2506.02931.pdf GitHub

Authors: Praneet Sai Madhu Surabhi, Dheeraj Reddy Mudireddy, Jian Tao

Title: ThinkTank: A Framework for Generalizing Domain-Specific AI Agent Systems into Universal Collaborative Intelligence Platforms

Abstract:
This paper presents ThinkTank, a comprehensive and scalable framework designed to transform specialized AI agent systems into versatile collaborative intelligence platforms capable of supporting complex problem‑solving across diverse domains. ThinkTank systematically generalizes agent roles, meeting structures, and knowledge integration mechanisms by adapting proven scientific collaboration methodologies. Through role abstraction, generalization of meeting types for iterative collaboration, and the integration of Retrieval‑Augmented Generation with advanced knowledge storage, the framework facilitates expertise creation and robust knowledge sharing. ThinkTank enables organizations to leverage collaborative AI for knowledge‑intensive tasks while ensuring data privacy and security through local deployment, utilizing frameworks like Ollama with models such as Llama3.1. The ThinkTank framework is designed to deliver significant advantages in cost‑effectiveness, data security, scalability, and competitive positioning compared to cloud‑based alternatives, establishing it as a universal platform for AI‑driven collaborative problem‑solving. The ThinkTank code is available at https://github.com/taugroup/ThinkTank

PaperID: 1238, https://arxiv.org/pdf/2506.02426.pdf GitHub

Authors: Maryam Berijanian, Kuldeep Singh, Amin Sehati

Title: Comparative Analysis of AI Agent Architectures for Entity Relationship Classification

Abstract:
Entity relationship classification remains a challenging task in information extraction, especially in scenarios with limited labeled data and complex relational structures. In this study, we conduct a comparative analysis of three distinct AI agent architectures designed to perform relation classification using large language models (LLMs). The agentic architectures explored include (1) reflective self‑evaluation, (2) hierarchical task decomposition, and (3) a novel multi‑agent dynamic example generation mechanism, each leveraging different modes of reasoning and prompt adaptation. In particular, our dynamic example generation approach introduces real‑time cooperative and adversarial prompting. We systematically compare their performance across multiple domains and model backends. Our experiments demonstrate that multi‑agent coordination consistently outperforms standard few‑shot prompting and approaches the performance of fine‑tuned models. These findings offer practical guidance for the design of modular, generalizable LLM‑based systems for structured relation extraction. The source codes and dataset are available at https://github.com/maryambrj/ALIEN.git.

PaperID: 1239, https://arxiv.org/pdf/2506.02055.pdf GitHub

Authors: Nikola Balic

Title: Will Agents Replace Us? Perceptions of Autonomous Multi-Agent AI

Abstract:
Autonomous multi‑agent AI systems are poised to transform various industries, particularly software development and knowledge work. Understanding current perceptions among professionals is crucial for anticipating adoption challenges, ethical considerations, and future workforce development. This study analyzes responses from 130 participants to a survey on the capabilities, impact, and governance of AI agents. We explore expected timelines for AI replacing programmers, identify perceived barriers to deployment, and examine beliefs about responsibility when agents make critical decisions. Key findings reveal three distinct clusters of respondents. While the study explored factors associated with current AI agent deployment, the initial logistic regression model did not yield statistically significant predictors, suggesting that deployment decisions are complex and may be influenced by factors not fully captured or that a larger sample is needed. These insights highlight the need for organizations to address compliance concerns (a commonly cited barrier) and establish clear governance frameworks as they integrate autonomous agents into their workflows.

PaperID: 1240, https://arxiv.org/pdf/2506.02049.pdf GitHub

Authors: Beichen Huang, Ran Cheng, Kay Chen Tan

Title: EvoGit: Decentralized Code Evolution via Git-Based Multi-Agent Collaboration

Abstract:
We introduce EvoGit, a decentralized multi‑agent framework for collaborative software development driven by autonomous code evolution. EvoGit deploys a population of independent coding agents, each proposing edits to a shared codebase without centralized coordination, explicit message passing, or shared memory. Instead, all coordination emerges through a Git‑based phylogenetic graph that tracks the full version lineage and enables agents to asynchronously read from and write to the evolving code repository. This graph‑based structure supports fine‑grained branching, implicit concurrency, and scalable agent interaction while preserving a consistent historical record. Human involvement is minimal but strategic: users define high‑level goals, periodically review the graph, and provide lightweight feedback to promote promising directions or prune unproductive ones. Experiments demonstrate EvoGit's ability to autonomously produce functional and modular software artifacts across two real‑world tasks: (1) building a web application from scratch using modern frameworks, and (2) constructing a meta‑level system that evolves its own language‑model‑guided solver for the bin‑packing optimization problem. Our results underscore EvoGit's potential to establish a new paradigm for decentralized, automated, and continual software development. EvoGit is open‑sourced at https://github.com/BillHuang2001/evogit.

PaperID: 1241, https://arxiv.org/pdf/2506.02019.pdf GitHub

Authors: E Fan, Kang Hu, Zhuowen Wu, Jiangyang Ge, Jiawei Miao, Yuzhi Zhang, He Sun, Weizong Wang, Tianhan Zhang

Title: ChatCFD: An LLM-Driven Agent for End-to-End CFD Automation with Structured Knowledge and Reasoning

Abstract:
Computational Fluid Dynamics (CFD) is critical for scientific advancement but is hindered by operational complexity and high expertise barriers. This paper introduces ChatCFD, a Large Language Model (LLM)‑driven multi‑agent system designed for end‑to‑end CFD automation using OpenFOAM. Powered by DeepSeek‑R1/V3, ChatCFD integrates structured domain knowledge bases, a precise error locator, and iterative reflection to dramatically outperform existing methods. On 315 benchmark cases, ChatCFD achieves 82.1% execution success (vs. 6.2% for MetaOpenFOAM and 42.3% for Foam‑Agent) and 68.12% physical fidelity ‑ a novel metric assessing scientific meaningfulness beyond mere runnability. A dedicated Physics Interpreter attains 97.4% summary fidelity, bridging the gap between narrative fluency and the enforcement of tight physical constraints. Resource analysis confirms efficiency, averaging 192.1k tokens and 0.208 per case, significantly lower than baseline costs. Ablation studies identify the Error Locator and Solver Template DB as critical, with the latter's removal collapsing accuracy to 48%. The system exhibits robust flexibility, achieving 95.23% success in autonomous solver selection and 100% in turbulence modeling, while successfully reproducing complex literature cases (e.g., NACA0012, supersonic nozzle) with 60‑80% success rates where baselines failed. Featuring a modular, MCP‑compatible design, ChatCFD facilitates scalable, collaborative AI‑driven CFD. Code is available at: https://github.com/ConMoo/ChatCFD

PaperID: 1242, https://arxiv.org/pdf/2506.01538.pdf GitHub

Authors: Guobin Zhu, Rui Zhou, Wenkang Ji, Shiyu Zhao

Title: LAMARL: LLM-Aided Multi-Agent Reinforcement Learning for Cooperative Policy Generation

Abstract:
Although Multi‑Agent Reinforcement Learning (MARL) is effective for complex multi‑robot tasks, it suffers from low sample efficiency and requires iterative manual reward tuning. Large Language Models (LLMs) have shown promise in single‑robot settings, but their application in multi‑robot systems remains largely unexplored. This paper introduces a novel LLM‑Aided MARL (LAMARL) approach, which integrates MARL with LLMs, significantly enhancing sample efficiency without requiring manual design. LAMARL consists of two modules: the first module leverages LLMs to fully automate the generation of prior policy and reward functions. The second module is MARL, which uses the generated functions to guide robot policy training effectively. On a shape assembly benchmark, both simulation and real‑world experiments demonstrate the unique advantages of LAMARL. Ablation studies show that the prior policy improves sample efficiency by an average of 185.9% and enhances task completion, while structured prompts based on Chain‑of‑Thought (CoT) and basic APIs improve LLM output success rates by 28.5%‑67.5%. Videos and code are available at https://windylab.github.io/LAMARL/

PaperID: 1243, https://arxiv.org/pdf/2506.01531.pdf GitHub

Authors: Wenhao Liu, Zhenyi Lu, Xinyu Hu, Jierui Zhang, Dailin Li, Jiacheng Cen, Huilin Cao, Haiteng Wang, Yuhan Li, Kun Xie, Dandan Li, Pei Zhang, Chengbo Zhang, Yuxiang Ren, Xiaohong Huang, Yan Ma

Title: STORM-BORN: A Challenging Mathematical Derivations Dataset Curated via a Human-in-the-Loop Multi-Agent Framework

Abstract:
High‑quality math datasets are crucial for advancing the reasoning abilities of large language models (LLMs). However, existing datasets often suffer from three key issues: outdated and insufficient challenging content, neglecting human‑like reasoning, and limited reliability due to single‑LLM generation. To address these, we introduce STORM‑BORN, an ultra‑challenging dataset of mathematical derivations sourced from cutting‑edge academic papers, which includes dense human‑like approximations and heuristic cues. To ensure the reliability and quality, we propose a novel human‑in‑the‑loop, multi‑agent data generation framework, integrating reasoning‑dense filters, multi‑agent collaboration, and human mathematicians' evaluations. We curated a set of 2,000 synthetic samples and deliberately selected the 100 most difficult problems. Even most advanced models like GPT‑o1 solved fewer than 5% of them. Fine‑tuning on STORM‑BORN boosts accuracy by 7.84% (LLaMA3‑8B) and 9.12% (Qwen2.5‑7B). As AI approaches mathematician‑level reasoning, STORM‑BORN provides both a high‑difficulty benchmark and a human‑like reasoning training resource. Our code and dataset are publicly available at https://github.com/lwhere/STORM‑BORN.

PaperID: 1244, https://arxiv.org/pdf/2506.01386.pdf GitHub

Authors: Manit Baser, Dinil Mon Divakaran, Mohan Gurusamy

Title: ThinkEval: Practical Evaluation of Knowledge Leakage in LLM Editing using Thought-based Knowledge Graphs

Abstract:
Robust model‑editing techniques are essential for deploying large language models (LLMs) in practical applications, as they enable cost‑effective ways to deal with challenges such as privacy breaches, bias mitigation and misinformation spread. For example, an LLM‑based healthcare assistance may need to update out‑dated or incorrect knowledge to prevent harmful recommendations. However, many editing techniques focus on isolated facts, which critically fail to prevent indirect knowledge leakage ‑‑ the unintended reconstruction of edited‑out information through persistent causal links and contextual relationships. To assist users in selecting the right editing technique, we develop and present ThinkEval, a framework to systematically quantify indirect knowledge leakage and ripple effects in model‑editing. ThinkEval builds and employs specialized knowledge graphs to analyze the causal structure of facts before and after editing. To support this approach, we present KnowGIC, a benchmark dataset comprising multi‑step reasoning paths that precisely measure these complex knowledge transformation effects. We evaluate five editing techniques: AlphaEdit, RECT, ROME, MEMIT, and PRUNE across multiple LLMs. Our results show that these techniques struggle to balance indirect fact suppression with the preservation of related knowledge, compromising the contextual integrity of a model's knowledge. Our dataset is available at: https://github.com/manitbaser/KnowGIC.

PaperID: 1245, https://arxiv.org/pdf/2506.00739.pdf GitHub

Authors: Chiyu Zhang, Marc-Alexandre Cote, Michael Albada, Anush Sankaran, Jack W. Stokes, Tong Wang, Amir Abdi, William Blum, Muhammad Abdul-Mageed

Title: DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments

Abstract:
Large language model (LLM) agents have shown impressive capabilities in human language comprehension and reasoning, yet their potential in cybersecurity remains underexplored. We introduce DefenderBench, a practical, open‑source toolkit for evaluating language agents across offense, defense, and cybersecurity knowledge‑based tasks. DefenderBench includes environments for network intrusion, malicious content detection, code vulnerability analysis, and cybersecurity knowledge assessment. It is intentionally designed to be affordable and easily accessible for researchers while providing fair and rigorous assessment. We benchmark several state‑of‑the‑art (SoTA) and popular LLMs, including both open‑ and closed‑weight models, using a standardized agentic framework. Our results show that Claude‑3.7‑sonnet performs best with a DefenderBench score of 81.65, followed by Claude‑3.7‑sonnet‑think with 78.40, while the best open‑weight model, Llama 3.3 70B, is not far behind with a DefenderBench score of 71.81. DefenderBench's modular design allows seamless integration of custom LLMs and tasks, promoting reproducibility and fair comparisons. An anonymized version of DefenderBench is available at https://github.com/microsoft/DefenderBench.

PaperID: 1246, https://arxiv.org/pdf/2506.00676.pdf GitHub

Authors: Saad Hossain, Samanvay Vajpayee, Sirisha Rambhatla

Title: SafeTuneBed: A Toolkit for Benchmarking LLM Safety Alignment in Fine-Tuning

Abstract:
As large language models (LLMs) become ubiquitous, parameter‑efficient fine‑tuning methods and safety‑first defenses have proliferated rapidly. However, the number of approaches and their recent increase have resulted in diverse evaluations‑varied datasets, metrics, and inconsistent threat settings‑making it difficult to fairly compare safety, utility, and robustness across methods. To address this, we introduce SafeTuneBed, a benchmark and toolkit unifying fine‑tuning and defense evaluation. SafeTuneBed (i) curates a diverse repository of multiple fine‑tuning datasets spanning sentiment analysis, question‑answering, multi‑step reasoning, and open‑ended instruction tasks, and allows for the generation of harmful‑variant splits; (ii) enables integration of state‑of‑the‑art defenses, including alignment‑stage immunization, in‑training safeguards, and post‑tuning repair; and (iii) provides evaluators for safety (attack success rate, refusal consistency) and utility. Built on Python‑first, dataclass‑driven configs and plugins, SafeTuneBed requires minimal additional code to specify any fine‑tuning regime, defense method, and metric suite, while ensuring end‑to‑end reproducibility. We showcase its value by benchmarking representative defenses across varied poisoning scenarios and tasks. By standardizing data, code, and metrics, SafeTuneBed is the first focused toolkit of its kind to accelerate rigorous and comparable research in safe LLM fine‑tuning. Code is available at: https://github.com/criticalml‑uw/SafeTuneBed

PaperID: 1247, https://arxiv.org/pdf/2506.00641.pdf GitHub

Authors: Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, Hanan Salam

Title: AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents

Abstract:
Despite the rapid advancement of LLM‑based agents, the reliable evaluation of their safety and security remains a significant challenge. Existing rule‑based or LLM‑based evaluators often miss dangers in agents' step‑by‑step actions, overlook subtle meanings, fail to see how small issues compound, and get confused by unclear safety or security rules. To overcome this evaluation crisis, we introduce AgentAuditor, a universal, training‑free, memory‑augmented reasoning framework that empowers LLM evaluators to emulate human expert evaluators. AgentAuditor constructs an experiential memory by having an LLM adaptively extract structured semantic features (e.g., scenario, risk, behavior) and generate associated chain‑of‑thought reasoning traces for past interactions. A multi‑stage, context‑aware retrieval‑augmented generation process then dynamically retrieves the most relevant reasoning experiences to guide the LLM evaluator's assessment of new cases. Moreover, we developed ASSEBench, the first benchmark designed to check how well LLM‑based evaluators can spot both safety risks and security threats. ASSEBench comprises 2293 meticulously annotated interaction records, covering 15 risk types across 29 application scenarios. A key feature of ASSEBench is its nuanced approach to ambiguous risk situations, employing "Strict" and "Lenient" judgment standards. Experiments demonstrate that AgentAuditor not only consistently improves the evaluation performance of LLMs across all benchmarks but also sets a new state‑of‑the‑art in LLM‑as‑a‑judge for agent safety and security, achieving human‑level accuracy. Our work is openly accessible at https://github.com/Astarojth/AgentAuditor.

PaperID: 1248, https://arxiv.org/pdf/2506.00577.pdf GitHub

Authors: Yufa Zhou, Shaobo Wang, Xingyu Dong, Xiangqi Jin, Yifang Chen, Yue Min, Kexin Yang, Xingzhang Ren, Dayiheng Liu, Linfeng Zhang

Title: Reasoning Like an Economist: Post-Training on Economic Problems Induces Strategic Generalization in LLMs

Abstract:
Directly training Large Language Models (LLMs) for Multi‑Agent Systems (MAS) remains challenging due to intricate reward modeling, dynamic agent interactions, and demanding generalization requirements. This paper explores whether post‑training techniques, specifically Supervised Fine‑Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR), can effectively generalize to multi‑agent scenarios. We use economic reasoning as a testbed, leveraging its strong foundations in mathematics and game theory, its demand for structured analytical reasoning, and its relevance to real‑world applications such as market design, resource allocation, and policy analysis. We introduce Recon (Reasoning like an ECONomist), a 7B‑parameter open‑source LLM post‑trained on a hand‑curated dataset of 2,100 high‑quality economic reasoning problems. Comprehensive evaluation on economic reasoning benchmarks and multi‑agent games reveals clear improvements in structured reasoning and economic rationality. These results underscore the promise of domain‑aligned post‑training for enhancing reasoning and agent alignment, shedding light on the roles of SFT and RL in shaping model behavior. Code is available at https://github.com/MasterZhou1/Recon .

PaperID: 1249, https://arxiv.org/pdf/2506.00549.pdf GitHub

Authors: Hyangsuk Min, Yuho Lee, Minjeong Ban, Jiaqi Deng, Nicole Hee-Yeon Kim, Taewon Yun, Hang Su, Jason Cai, Hwanjun Song

Title: Towards Multi-dimensional Evaluation of LLM Summarization across Domains and Languages

Abstract:
Evaluation frameworks for text summarization have evolved in terms of both domain coverage and metrics. However, existing benchmarks still lack domain‑specific assessment criteria, remain predominantly English‑centric, and face challenges with human annotation due to the complexity of reasoning. To address these, we introduce MSumBench, which provides a multi‑dimensional, multi‑domain evaluation of summarization in English and Chinese. It also incorporates specialized assessment criteria for each domain and leverages a multi‑agent debate system to enhance annotation quality. By evaluating eight modern summarization models, we discover distinct performance patterns across domains and languages. We further examine large language models as summary evaluators, analyzing the correlation between their evaluation and summarization capabilities, and uncovering systematic bias in their assessment of self‑generated summaries. Our benchmark dataset is publicly available at https://github.com/DISL‑Lab/MSumBench.

PaperID: 1250, https://arxiv.org/pdf/2506.00509.pdf GitHub

Authors: Zherui Li, Yan Mi, Zhenhong Zhou, Houcheng Jiang, Guibin Zhang, Kun Wang, Junfeng Fang

Title: Goal-Aware Identification and Rectification of Misinformation in Multi-Agent Systems

Abstract:
Large Language Model‑based Multi‑Agent Systems (MASs) have demonstrated strong advantages in addressing complex real‑world tasks. However, due to the introduction of additional attack surfaces, MASs are particularly vulnerable to misinformation injection. To facilitate a deeper understanding of misinformation propagation dynamics within these systems, we introduce MisinfoTask, a novel dataset featuring complex, realistic tasks designed to evaluate MAS robustness against such threats. Building upon this, we propose ARGUS, a two‑stage, training‑free defense framework leveraging goal‑aware reasoning for precise misinformation rectification within information flows. Our experiments demonstrate that in challenging misinformation scenarios, ARGUS exhibits significant efficacy across various injection attacks, achieving an average reduction in misinformation toxicity of approximately 28.17% and improving task success rates under attack by approximately 10.33%. Our code and dataset is available at: https://github.com/zhrli324/ARGUS.

PaperID: 1251, https://arxiv.org/pdf/2506.00228.pdf GitHub

Authors: Rebekah A. Gelpí, Yibing Ju, Ethan C. Jackson, Yikai Tang, Shon Verch, Claas Voelcker, William A. Cunningham

Title: Sorrel: A simple and flexible framework for multi-agent reinforcement learning

Abstract:
We introduce Sorrel (https://github.com/social‑ai‑uoft/sorrel), a simple Python interface for generating and testing new multi‑agent reinforcement learning environments. This interface places a high degree of emphasis on simplicity and accessibility, and uses a more psychologically intuitive structure for the basic agent‑environment loop, making it a useful tool for social scientists to investigate how learning and social interaction leads to the development and change of group dynamics. In this short paper, we outline the basic design philosophy and features of Sorrel.

PaperID: 1252, https://arxiv.org/pdf/2505.24878.pdf GitHub

Authors: Yaxin Luo, Zhaoyi Li, Jiacheng Liu, Jiacheng Cui, Xiaohan Zhao, Zhiqiang Shen

Title: Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents

Abstract:
CAPTCHAs have been a critical bottleneck for deploying web agents in real‑world applications, often blocking them from completing end‑to‑end automation tasks. While modern multimodal LLM agents have demonstrated impressive performance in static perception tasks, their ability to handle interactive, multi‑step reasoning challenges like CAPTCHAs is largely untested. To address this gap, we introduce Open CaptchaWorld, the first web‑based benchmark and platform specifically designed to evaluate the visual reasoning and interaction capabilities of MLLM‑powered agents through diverse and dynamic CAPTCHA puzzles. Our benchmark spans 20 modern CAPTCHA types, totaling 225 CAPTCHAs, annotated with a new metric we propose: CAPTCHA Reasoning Depth, which quantifies the number of cognitive and motor steps required to solve each puzzle. Experimental results show that humans consistently achieve near‑perfect scores, state‑of‑the‑art MLLM agents struggle significantly, with success rates at most 40.0% by Browser‑Use Openai‑o3, far below human‑level performance, 93.3%. This highlights Open CaptchaWorld as a vital benchmark for diagnosing the limits of current multimodal agents and guiding the development of more robust multimodal reasoning systems. Code and Data are available at this https URL.

PaperID: 1253, https://arxiv.org/pdf/2505.24442.pdf GitHub

Authors: Zhentao Xie, Chengcheng Han, Jinxin Shi, Wenjun Cui, Xin Zhao, Xingjiao Wu, Jiabao Zhao

Title: RMoA: Optimizing Mixture-of-Agents through Diversity Maximization and Residual Compensation

Abstract:
Although multi‑agent systems based on large language models show strong capabilities on multiple tasks, they are still limited by high computational overhead, information loss, and robustness. Inspired by ResNet's residual learning, we propose Residual Mixture‑of‑Agents (RMoA), integrating residual connections to optimize efficiency and reliability. To maximize information utilization from model responses while minimizing computational costs, we innovatively design an embedding‑based diversity selection mechanism that greedily selects responses via vector similarity. Furthermore, to mitigate iterative information degradation, we introduce a Residual Extraction Agent to preserve cross‑layer incremental information by capturing inter‑layer response differences, coupled with a Residual Aggregation Agent for hierarchical information integration. Additionally, we propose an adaptive termination mechanism that dynamically halts processing based on residual convergence, further improving inference efficiency. RMoA achieves state‑of‑the‑art performance on the benchmarks of across alignment, mathematical reasoning, code generation, and multitasking understanding, while significantly reducing computational overhead. Code is available at https://github.com/mindhunter01/RMoA.

PaperID: 1254, https://arxiv.org/pdf/2505.24265.pdf GitHub

Authors: Harsh Goel, Mohammad Omama, Behdad Chalaki, Vaishnav Tadiparthi, Ehsan Moradi Pari, Sandeep Chinchali

Title: R3DM: Enabling Role Discovery and Diversity Through Dynamics Models in Multi-agent Reinforcement Learning

Abstract:
Multi‑agent reinforcement learning (MARL) has achieved significant progress in large‑scale traffic control, autonomous vehicles, and robotics. Drawing inspiration from biological systems where roles naturally emerge to enable coordination, role‑based MARL methods have been proposed to enhance cooperation learning for complex tasks. However, existing methods exclusively derive roles from an agent's past experience during training, neglecting their influence on its future trajectories. This paper introduces a key insight: an agent's role should shape its future behavior to enable effective coordination. Hence, we propose Role Discovery and Diversity through Dynamics Models (R3DM), a novel role‑based MARL framework that learns emergent roles by maximizing the mutual information between agents' roles, observed trajectories, and expected future behaviors. R3DM optimizes the proposed objective through contrastive learning on past trajectories to first derive intermediate roles that shape intrinsic rewards to promote diversity in future behaviors across different roles through a learned dynamics model. Benchmarking on SMAC and SMACv2 environments demonstrates that R3DM outperforms state‑of‑the‑art MARL approaches, improving multi‑agent coordination to increase win rates by up to 20%. The code is available at https://github.com/UTAustin‑SwarmLab/R3DM.

PaperID: 1255, https://arxiv.org/pdf/2505.24255.pdf GitHub

Authors: Neemesh Yadav, Palakorn Achananuparp, Jing Jiang, Ee-Peng Lim

Title: Effects of Theory of Mind and Prosocial Beliefs on Steering Human-Aligned Behaviors of LLMs in Ultimatum Games

Abstract:
Large Language Models (LLMs) have shown potential in simulating human behaviors and performing theory‑of‑mind (ToM) reasoning, a crucial skill for complex social interactions. In this study, we investigate the role of ToM reasoning in aligning agentic behaviors with human norms in negotiation tasks, using the ultimatum game as a controlled environment. We initialized LLM agents with different prosocial beliefs (including Greedy, Fair, and Selfless) and reasoning methods like chain‑of‑thought (CoT) and varying ToM levels, and examined their decision‑making processes across diverse LLMs, including reasoning models like o3‑mini and DeepSeek‑R1 Distilled Qwen 32B. Results from 2,700 simulations indicated that ToM reasoning enhances behavior alignment, decision‑making consistency, and negotiation outcomes. Consistent with previous findings, reasoning models exhibit limited capability compared to models with ToM reasoning, different roles of the game benefits with different orders of ToM reasoning. Our findings contribute to the understanding of ToM's role in enhancing human‑AI interaction and cooperative decision‑making. The code used for our experiments can be found at https://github.com/Stealth‑py/UltimatumToM.

PaperID: 1256, https://arxiv.org/pdf/2505.23946.pdf GitHub

Authors: Yuanzhe Liu, Ryan Deng, Tim Kaler, Xuhao Chen, Charles E. Leiserson, Yao Ma, Jie Chen

Title: Lessons Learned: A Multi-Agent Framework for Code LLMs to Learn and Improve

Abstract:
Recent studies show that LLMs possess different skills and specialize in different tasks. In fact, we observe that their varied performance occur in several levels of granularity. For example, in the code optimization task, code LLMs excel at different optimization categories and no one dominates others. This observation prompts the question of how one leverages multiple LLM agents to solve a coding problem without knowing their complementary strengths a priori. We argue that a team of agents can learn from each other's successes and failures so as to improve their own performance. Thus, a lesson is the knowledge produced by an agent and passed on to other agents in the collective solution process. We propose a lesson‑based collaboration framework, design the lesson solicitation‑‑banking‑‑selection mechanism, and demonstrate that a team of small LLMs with lessons learned can outperform a much larger LLM and other multi‑LLM collaboration methods.

PaperID: 1257, https://arxiv.org/pdf/2505.23885.pdf GitHub

Authors: Mengkang Hu, Yuhang Zhou, Wendong Fan, Yuzhou Nie, Bowei Xia, Tao Sun, Ziyu Ye, Zhaoxuan Jin, Yingru Li, Qiguang Chen, Zeyu Zhang, Yifeng Wang, Qianshuo Ye, Bernard Ghanem, Ping Luo, Guohao Li

Title: OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation

Abstract:
Large Language Model (LLM)‑based multi‑agent systems show promise for automating real‑world tasks but struggle to transfer across domains due to their domain‑specific nature. Current approaches face two critical shortcomings: they require complete architectural redesign and full retraining of all components when applied to new domains. We introduce Workforce, a hierarchical multi‑agent framework that decouples strategic planning from specialized execution through a modular architecture comprising: (i) a domain‑agnostic Planner for task decomposition, (ii) a Coordinator for subtask management, and (iii) specialized Workers with domain‑specific tool‑calling capabilities. This decoupling enables cross‑domain transferability during both inference and training phases: During inference, Workforce seamlessly adapts to new domains by adding or modifying worker agents; For training, we introduce Optimized Workforce Learning (OWL), which improves generalization across domains by optimizing a domain‑agnostic planner with reinforcement learning from real‑world feedback. To validate our approach, we evaluate Workforce on the GAIA benchmark, covering various realistic, multi‑domain agentic tasks. Experimental results demonstrate Workforce achieves open‑source state‑of‑the‑art performance (69.70%), outperforming commercial systems like OpenAI's Deep Research by 2.34%. More notably, our OWL‑trained 32B model achieves 52.73% accuracy (+16.37%) and demonstrates performance comparable to GPT‑4o on challenging tasks. To summarize, by enabling scalable generalization and modular domain transfer, our work establishes a foundation for the next generation of general‑purpose AI assistants.

PaperID: 1258, https://arxiv.org/pdf/2505.23839.pdf GitHub

Authors: Zaixi Zhang, Zhenghong Zhou, Ruofan Jin, Le Cong, Mengdi Wang

Title: GeneBreaker: Jailbreak Attacks against DNA Language Models with Pathogenicity Guidance

Abstract:
DNA, encoding genetic instructions for almost all living organisms, fuels groundbreaking advances in genomics and synthetic biology. Recently, DNA Foundation Models have achieved success in designing synthetic functional DNA sequences, even whole genomes, but their susceptibility to jailbreaking remains underexplored, leading to potential concern of generating harmful sequences such as pathogens or toxin‑producing genes. In this paper, we introduce GeneBreaker, the first framework to systematically evaluate jailbreak vulnerabilities of DNA foundation models. GeneBreaker employs (1) an LLM agent with customized bioinformatic tools to design high‑homology, non‑pathogenic jailbreaking prompts, (2) beam search guided by PathoLM and log‑probability heuristics to steer generation toward pathogen‑like sequences, and (3) a BLAST‑based evaluation pipeline against a curated Human Pathogen Database (JailbreakDNABench) to detect successful jailbreaks. Evaluated on our JailbreakDNABench, GeneBreaker successfully jailbreaks the latest Evo series models across 6 viral categories consistently (up to 60% Attack Success Rate for Evo2‑40B). Further case studies on SARS‑CoV‑2 spike protein and HIV‑1 envelope protein demonstrate the sequence and structural fidelity of jailbreak output, while evolutionary modeling of SARS‑CoV‑2 underscores biosecurity risks. Our findings also reveal that scaling DNA foundation models amplifies dual‑use risks, motivating enhanced safety alignment and tracing mechanisms. Our code is at https://github.com/zaixizhang/GeneBreaker.

PaperID: 1259, https://arxiv.org/pdf/2505.23695.pdf GitHub

Authors: Ran Zhang, Mohannad Elhamod

Title: Data-to-Dashboard: Multi-Agent LLM Framework for Insightful Visualization in Enterprise Analytics

Abstract:
The rapid advancement of LLMs has led to the creation of diverse agentic systems in data analysis, utilizing LLMs' capabilities to improve insight generation and visualization. In this paper, we present an agentic system that automates the data‑to‑dashboard pipeline through modular LLM agents capable of domain detection, concept extraction, multi‑perspective analysis generation, and iterative self‑reflection. Unlike existing chart QA systems, our framework simulates the analytical reasoning process of business analysts by retrieving domain‑relevant knowledge and adapting to diverse datasets without relying on closed ontologies or question templates. We evaluate our system on three datasets across different domains. Benchmarked against GPT‑4o with a single‑prompt baseline, our approach shows improved insightfulness, domain relevance, and analytical depth, as measured by tailored evaluation metrics and qualitative human assessment. This work contributes a novel modular pipeline to bridge the path from raw data to visualization, and opens new opportunities for human‑in‑the‑loop validation by domain experts in business analytics. All code can be found here: https://github.com/77luvC/D2D_Data2Dashboard

PaperID: 1260, https://arxiv.org/pdf/2505.23579.pdf GitHub

Authors: Adibvafa Fallahpour, Andrew Magnuson, Purav Gupta, Shihao Ma, Jack Naimer, Arnav Shah, Haonan Duan, Omar Ibrahim, Hani Goodarzi, Chris J. Maddison, Bo Wang

Title: BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model

Abstract:
Unlocking deep and interpretable biological reasoning from complex genomic data remains a major AI challenge limiting scientific progress. While current DNA foundation models excel at representing sequences, they struggle with multi‑step reasoning and lack transparent, biologically meaningful explanations. BioReason addresses this by tightly integrating a DNA foundation model with a large language model (LLM), enabling the LLM to directly interpret and reason over genomic information. Through supervised fine‑tuning and reinforcement learning, BioReason learns to produce logical, biologically coherent deductions. It achieves major performance gains, boosting KEGG‑based disease pathway prediction accuracy from 86% to 98% and improving variant effect prediction by an average of 15% over strong baselines. BioReason can reason over unseen biological entities and explain its decisions step by step, offering a transformative framework for interpretable, mechanistic AI in biology. All data, code, and checkpoints are available at https://github.com/bowang‑lab/BioReason

PaperID: 1261, https://arxiv.org/pdf/2505.23559.pdf GitHub

Authors: Kunlun Zhu, Jiaxun Zhang, Ziheng Qi, Nuoxing Shang, Zijia Liu, Peixuan Han, Yue Su, Haofei Yu, Jiaxuan You

Title: SafeScientist: Toward Risk-Aware Scientific Discoveries by LLM Agents

Abstract:
Recent advancements in large language model (LLM) agents have significantly accelerated scientific discovery automation, yet concurrently raised critical ethical and safety concerns. To systematically address these challenges, we introduce SafeScientist, an innovative AI scientist framework explicitly designed to enhance safety and ethical responsibility in AI‑driven scientific exploration. SafeScientist proactively refuses ethically inappropriate or high‑risk tasks and rigorously emphasizes safety throughout the research process. To achieve comprehensive safety oversight, we integrate multiple defensive mechanisms, including prompt monitoring, agent‑collaboration monitoring, tool‑use monitoring, and an ethical reviewer component. Complementing SafeScientist, we propose SciSafetyBench, a novel benchmark specifically designed to evaluate AI safety in scientific contexts, comprising 240 high‑risk scientific tasks across 6 domains, alongside 30 specially designed scientific tools and 120 tool‑related risk tasks. Extensive experiments demonstrate that SafeScientist significantly improves safety performance by 35% compared to traditional AI scientist frameworks, without compromising scientific output quality. Additionally, we rigorously validate the robustness of our safety pipeline against diverse adversarial attack methods, further confirming the effectiveness of our integrated approach. The code and data will be available at https://github.com/ulab‑uiuc/SafeScientist. \textcolorredWarning: this paper contains example data that may be offensive or harmful.

PaperID: 1262, https://arxiv.org/pdf/2505.22809.pdf GitHub

Authors: Andrew Zhu, Evan Osgood, Chris Callison-Burch

Title: First Steps Towards Overhearing LLM Agents: A Case Study With Dungeons & Dragons Gameplay

Abstract:
Much work has been done on conversational LLM agents which directly assist human users with tasks. We present an alternative paradigm for interacting with LLM agents, which we call "overhearing agents". These overhearing agents do not actively participate in conversation ‑‑ instead, they "listen in" on human‑to‑human conversations and perform background tasks or provide suggestions to assist the user. In this work, we explore the overhearing agents paradigm through the lens of Dungeons & Dragons gameplay. We present an in‑depth study using large multimodal audio‑language models as overhearing agents to assist a Dungeon Master. We perform a human evaluation to examine the helpfulness of such agents and find that some large audio‑language models have the emergent ability to perform overhearing agent tasks using implicit audio cues. Finally, we release Python libraries and our project code to support further research into the overhearing agents paradigm at https://github.com/zhudotexe/overhearing_agents.

PaperID: 1263, https://arxiv.org/pdf/2505.22648.pdf GitHub

Authors: Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou

Title: WebDancer: Towards Autonomous Information Seeking Agency

Abstract:
Addressing intricate real‑world problems necessitates in‑depth information seeking and multi‑step reasoning. Recent progress in agentic systems, exemplified by Deep Research, underscores the potential for autonomous multi‑step research. In this work, we present a cohesive paradigm for building end‑to‑end agentic information seeking agents from a data‑centric and training‑stage perspective. Our approach consists of four key stages: (1) browsing data construction, (2) trajectories sampling, (3) supervised fine‑tuning for effective cold start, and (4) reinforcement learning for enhanced generalisation. We instantiate this framework in a web agent based on the ReAct, WebDancer. Empirical evaluations on the challenging information seeking benchmarks, GAIA and WebWalkerQA, demonstrate the strong performance of WebDancer, achieving considerable results and highlighting the efficacy of our training paradigm. Further analysis of agent training provides valuable insights and actionable, systematic pathways for developing more capable agentic models. The codes and demo will be released in https://github.com/Alibaba‑NLP/WebAgent.

PaperID: 1264, https://arxiv.org/pdf/2505.22368.pdf GitHub

Authors: Enfang Cui, Yujun Cheng, Rui She, Dan Liu, Zhiyuan Liang, Minxin Guo, Tianzheng Li, Qian Wei, Wenjuan Xing, Zhijie Zhong

Title: AgentDNS: A Root Domain Naming System for LLM Agents

Abstract:
The rapid evolution of Large Language Model (LLM) agents has highlighted critical challenges in cross‑vendor service discovery, interoperability, and communication. Existing protocols like model context protocol and agent‑to‑agent protocol have made significant strides in standardizing interoperability between agents and tools, as well as communication among multi‑agents. However, there remains a lack of standardized protocols and solutions for service discovery across different agent and tool vendors. In this paper, we propose AgentDNS, a root domain naming and service discovery system designed to enable LLM agents to autonomously discover, resolve, and securely invoke third‑party agent and tool services across organizational and technological boundaries. Inspired by the principles of the traditional DNS, AgentDNS introduces a structured mechanism for service registration, semantic service discovery, secure invocation, and unified billing. We detail the architecture, core functionalities, and use cases of AgentDNS, demonstrating its potential to streamline multi‑agent collaboration in real‑world scenarios. The source code will be published on https://github.com/agentdns.

PaperID: 1265, https://arxiv.org/pdf/2505.22053.pdf GitHub

Authors: Yan Rong, Jinting Wang, Guangzhi Lei, Shan Yang, Li Liu

Title: AudioGenie: A Training-Free Multi-Agent Framework for Diverse Multimodality-to-Multiaudio Generation

Abstract:
Multimodality‑to‑Multiaudio (MM2MA) generation faces significant challenges in synthesizing diverse and contextually aligned audio types (e.g., sound effects, speech, music, and songs) from multimodal inputs (e.g., video, text, images), owing to the scarcity of high‑quality paired datasets and the lack of robust multi‑task learning frameworks. Recently, multi‑agent system shows great potential in tackling the above issues. However, directly applying it to MM2MA task presents three critical challenges: (1) inadequate fine‑grained understanding of multimodal inputs (especially for video), (2) the inability of single models to handle diverse audio events, and (3) the absence of self‑correction mechanisms for reliable outputs. To this end, we propose AudioGenie, a novel training‑free multi‑agent system featuring a dual‑layer architecture with a generation team and a supervisor team. For the generation team, a fine‑grained task decomposition and an adaptive Mixture‑of‑Experts (MoE) collaborative entity are designed for detailed comprehensive multimodal understanding and dynamic model selection, and a trial‑and‑error iterative refinement module is designed for self‑correction. The supervisor team ensures temporal‑spatial consistency and verifies outputs through feedback loops. Moreover, we build MA‑Bench, the first benchmark for MM2MA tasks, comprising 198 annotated videos with multi‑type audios. Experiments demonstrate that our AudioGenie achieves state‑of‑the‑art (SOTA) or comparable performance across 9 metrics in 8 tasks. User study further validates the effectiveness of our method in terms of quality, accuracy, alignment, and aesthetic. The project website with audio samples can be found at https://audiogenie.github.io/.

PaperID: 1266, https://arxiv.org/pdf/2505.21966.pdf GitHub

Authors: Aditya Gunturu, Ben Pearman, Keiichi Ihara, Morteza Faraji, Bryan Wang, Rubaiat Habib Kazi, Ryo Suzuki

Title: MapStory: Prototyping Editable Map Animations with LLM Agents

Abstract:
We introduce MapStory, an LLM‑powered animation prototyping tool that generates editable map animation sequences directly from natural language text by leveraging a dual‑agent LLM architecture. Given a user written script, MapStory automatically produces a scene breakdown, which decomposes the text into key map animation primitives such as camera movements, visual highlights, and animated elements. Our system includes a researcher agent that accurately queries geospatial information by leveraging an LLM with web search, enabling automatic extraction of relevant regions, paths, and coordinates while allowing users to edit and query for changes or additional information to refine the results. Additionally, users can fine‑tune parameters of these primitive blocks through an interactive timeline editor. We detail the system's design and architecture, informed by formative interviews with professional animators and by an analysis of 200 existing map animation videos. Our evaluation, which includes expert interviews (N=5) and a usability study (N=12), demonstrates that MapStory enables users to create map animations with ease, facilitates faster iteration, encourages creative exploration, and lowers barriers to creating map‑centric stories.

PaperID: 1267, https://arxiv.org/pdf/2505.21577.pdf GitHub

Authors: Huacan Wang, Ziyi Ni, Shuo Zhang, Shuo Lu, Sen Hu, Ziyang He, Chen Hu, Jiaye Lin, Yifu Guo, Ronghao Chen, Xin Li, Daxin Jiang, Yuntao Du, Pin Lyu

Title: RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving

Abstract:
The ultimate goal of code agents is to solve complex tasks autonomously. Although large language models (LLMs) have made substantial progress in code generation, real‑world tasks typically demand full‑fledged code repositories rather than simple scripts. Building such repositories from scratch remains a major challenge. Fortunately, GitHub hosts a vast, evolving collection of open‑source repositories, which developers frequently reuse as modular components for complex tasks. Yet, existing frameworks like OpenHands and SWE‑Agent still struggle to effectively leverage these valuable resources. Relying solely on README files provides insufficient guidance, and deeper exploration reveals two core obstacles: overwhelming information and tangled dependencies of repositories, both constrained by the limited context windows of current LLMs. To tackle these issues, we propose RepoMaster, an autonomous agent framework designed to explore and reuse GitHub repositories for solving complex tasks. For efficient understanding, RepoMaster constructs function‑call graphs, module‑dependency graphs, and hierarchical code trees to identify essential components, providing only identified core elements to the LLMs rather than the entire repository. During autonomous execution, it progressively explores related components using our exploration tools and prunes information to optimize context usage. Evaluated on the adjusted MLE‑bench, RepoMaster achieves a 110% relative boost in valid submissions over the strongest baseline OpenHands. On our newly released GitTaskBench, RepoMaster lifts the task‑pass rate from 40.7% to 62.9% while reducing token usage by 95%. Our code and demonstration materials are publicly available at https://github.com/QuantaAlpha/RepoMaster.

PaperID: 1268, https://arxiv.org/pdf/2505.21569.pdf GitHub

Authors: Zhucong Li, Powei Chang, Jin Xiao, Zhijian Zhou, Qianyu He, Jiaqing Liang, Fenglei Cao, Xu Yinghui, Yuan Qi

Title: ChemAmp: Amplified Chemistry Tools via Composable Agents

Abstract:
Although LLM‑based agents are proven to master tool orchestration in scientific fields, particularly chemistry, their single‑task performance remains limited by underlying tool constraints. To this end, we propose tool amplification, a novel paradigm that enhances the collective capabilities of specialized tools through optimized, dynamic coordination within individual tasks. Instantiating this paradigm, we introduce ChemAmp, a computationally lightweight framework that dynamically treats chemistry tools (e.g., UniMol2, Chemformer) as composable building‑block agents. It constructs task‑specialized super‑agents that transcend atomic tool constraints with limited data (\leq10 samples). Our evaluations across four core chemistry tasks molecular design, molecule captioning, reaction prediction, and property prediction demonstrate that ChemAmp outperforms chemistry‑specialized models, generalist LLMs, and agent systems with tool orchestration. Critically, this bottom‑up construction strategy enables 94% inference token cost reductions versus vanilla multi‑agent systems.

PaperID: 1269, https://arxiv.org/pdf/2505.21497.pdf GitHub GitHub

Authors: Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, Philip Torr

Title: Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

Abstract:
Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long‑context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author‑designed posters and evaluates outputs on (i)Visual Quality‑semantic alignment with human posters, (ii)Textual Coherence‑language fluency, (iii)Holistic Assessment‑six fine‑grained aesthetic and informational criteria scored by a VLM‑as‑judge, and notably (iv)PaperQuiz‑the poster's ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top‑down, visual‑in‑the‑loop multi‑agent pipeline: the (a)Parser distills the paper into a structured asset library; the (b)Planner aligns text‑visual pairs into a binary‑tree layout that preserves reading order and spatial balance; and the (c)Painter‑Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment. In our comprehensive evaluation, we find that GPT‑4o outputs‑though visually appealing at first glance‑often exhibit noisy text and poor PaperQuiz scores, and we find that reader engagement is the primary aesthetic bottleneck, as human‑designed posters rely largely on visual semantics to convey meaning. Our fully open‑source variants (e.g. based on the Qwen‑2.5 series) outperform existing 4o‑driven multi‑agent systems across nearly all metrics, while using 87% fewer tokens. It transforms a 22‑page paper into a finalized yet editable .pptx poster ‑ all for just 0.005. These findings chart clear directions for the next generation of fully automated poster‑generation models. The code and datasets are available at https://github.com/Paper2Poster/Paper2Poster.

PaperID: 1270, https://arxiv.org/pdf/2505.21471.pdf GitHub

Authors: Zijun Liu, Zhennan Wan, Peng Li, Ming Yan, Fei Huang, Yang Liu

Title: Scaling External Knowledge Input Beyond Context Windows of LLMs via Multi-Agent Collaboration

Abstract:
With the rapid advancement of post‑training techniques for reasoning and information seeking, large language models (LLMs) can incorporate a large quantity of retrieved knowledge to solve complex tasks. However, the limited context window of LLMs obstructs scaling the amount of external knowledge input, prohibiting further improvement. Existing context window extension methods inevitably cause information loss. LLM‑based multi‑agent methods emerge as a new paradigm to handle massive input in a distributional manner, where we identify two core bottlenecks in existing agent orchestration designs. In this work, we develop a multi‑agent framework, \ExtAgents, to overcome the bottlenecks and enable better scalability in inference‑time knowledge integration without longer‑context training. Benchmarked with our enhanced multi‑hop question answering test, \boldsymbol\inftyBench+, and other public test sets including long survey generation, \ExtAgents significantly enhances the performance over existing non‑training methods with the same amount of external knowledge input, regardless of whether it falls \emphwithin or exceeds the context window. Moreover, the method maintains efficiency due to high parallelism. We believe further study in the coordination of LLM agents on increasing external knowledge input could benefit real‑world applications.

PaperID: 1271, https://arxiv.org/pdf/2505.21055.pdf GitHub

Authors: Kaiming Liu, Xuanyu Lei, Ziyue Wang, Peng Li, Yang Liu

Title: Agent-Environment Alignment via Automated Interface Generation

Abstract:
Large language model (LLM) agents have shown impressive reasoning capabilities in interactive decision‑making tasks. These agents interact with environment through intermediate interfaces, such as predefined action spaces and interaction rules, which mediate the perception and action. However, mismatches often happen between the internal expectations of the agent regarding the influence of its issued actions and the actual state transitions in the environment, a phenomenon referred to as agent‑environment misalignment. While prior work has invested substantially in improving agent strategies and environment design, the critical role of the interface still remains underexplored. In this work, we empirically demonstrate that agent‑environment misalignment poses a significant bottleneck to agent performance. To mitigate this issue, we propose ALIGN, an \underlineAuto‑A\underlineligned \underlineInterface \underlineGe\underlineneration framework that alleviates the misalignment by enriching the interface. Specifically, the ALIGN‑generated interface enhances both the static information of the environment and the step‑wise observations returned to the agent. Implemented as a lightweight wrapper, this interface achieves the alignment without modifying either the agent logic or the environment code. Experiments across multiple domains including embodied tasks, web navigation and tool‑use, show consistent performance improvements, with up to a 45.67% success rate improvement observed in ALFWorld. Meanwhile, ALIGN‑generated interface can generalize across different agent architectures and LLM backbones without interface regeneration. Code and experimental results are available at https://github.com/THUNLP‑MT/ALIGN.

PaperID: 1272, https://arxiv.org/pdf/2505.20981.pdf GitHub GitHub GitHub

Authors: Cainan Davidson, Deva Ramanan, Neehar Peri

Title: RefAV: Towards Planning-Centric Scenario Mining

Abstract:
Autonomous Vehicles (AVs) collect and pseudo‑label terabytes of multi‑modal data localized to HD maps during normal fleet testing. However, identifying interesting and safety‑critical scenarios from uncurated driving logs remains a significant challenge. Traditional scenario mining techniques are error‑prone and prohibitively time‑consuming, often relying on hand‑crafted structured queries. In this work, we revisit spatio‑temporal scenario mining through the lens of recent vision‑language models (VLMs) to detect whether a described scenario occurs in a driving log and, if so, precisely localize it in both time and space. To address this problem, we introduce RefAV, a large‑scale dataset of 10,000 diverse natural language queries that describe complex multi‑agent interactions relevant to motion planning derived from 1000 driving logs in the Argoverse 2 Sensor dataset. We evaluate several referential multi‑object trackers and present an empirical analysis of our baselines. Notably, we find that naively repurposing off‑the‑shelf VLMs yields poor performance, suggesting that scenario mining presents unique challenges. Lastly, we discuss our recently held competition and share insights from the community. Our code and dataset are available at https://github.com/CainanD/RefAV/ and https://argoverse.github.io/user‑guide/tasks/scenario_mining.html

PaperID: 1273, https://arxiv.org/pdf/2505.20922.pdf GitHub

Authors: Yang Zhang, Xinran Li, Jianing Ye, Shuang Qiu, Delin Qu, Xiu Li, Chongjie Zhang, Chenjia Bai

Title: Revisiting Multi-Agent World Modeling from a Diffusion-Inspired Perspective

Abstract:
World models have recently attracted growing interest in Multi‑Agent Reinforcement Learning (MARL) due to their ability to improve sample efficiency for policy learning. However, accurately modeling environments in MARL is challenging due to the exponentially large joint action space and highly uncertain dynamics inherent in multi‑agent systems. To address this, we reduce modeling complexity by shifting from jointly modeling the entire state‑action transition dynamics to focusing on the state space alone at each timestep through sequential agent modeling. Specifically, our approach enables the model to progressively resolve uncertainty while capturing the structured dependencies among agents, providing a more accurate representation of how agents influence the state. Interestingly, this sequential revelation of agents' actions in a multi‑agent system aligns with the reverse process in diffusion models‑‑a class of powerful generative models known for their expressiveness and training stability compared to autoregressive or latent variable models. Leveraging this insight, we develop a flexible and robust world model for MARL using diffusion models. Our method, Diffusion‑Inspired Multi‑Agent world model (DIMA), achieves state‑of‑the‑art performance across multiple multi‑agent control benchmarks, significantly outperforming prior world models in terms of final return and sample efficiency, including MAMuJoCo and Bi‑DexHands. DIMA establishes a new paradigm for constructing multi‑agent world models, advancing the frontier of MARL research. Codes are open‑sourced at https://github.com/breez3young/DIMA.

PaperID: 1274, https://arxiv.org/pdf/2505.20732.pdf GitHub

Authors: Hanlin Wang, Chak Tou Leong, Jiashuo Wang, Jian Wang, Wenjie Li

Title: SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution

Abstract:
Reinforcement learning (RL) holds significant promise for training LLM agents to handle complex, goal‑oriented tasks that require multi‑step interactions with external environments. However, a critical challenge when applying RL to these agentic tasks arises from delayed rewards: feedback signals are typically available only after the entire task is completed. This makes it non‑trivial to assign delayed rewards to earlier actions, providing insufficient guidance regarding environmental constraints and hindering agent training. In this work, we draw on the insight that the ultimate completion of a task emerges from the cumulative progress an agent makes across individual steps. We propose Stepwise Progress Attribution (SPA), a general reward redistribution framework that decomposes the final reward into stepwise contributions, each reflecting its incremental progress toward overall task completion. To achieve this, we train a progress estimator that accumulates stepwise contributions over a trajectory to match the task completion. During policy optimization, we combine the estimated per‑step contribution with a grounding signal for actions executed in the environment as the fine‑grained, intermediate reward for effective agent training. Extensive experiments on common agent benchmarks (including Webshop, ALFWorld, and VirtualHome) demonstrate that SPA consistently outperforms the state‑of‑the‑art method in both success rate (+2.5% on average) and grounding accuracy (+1.9% on average). Further analyses demonstrate that our method remarkably provides more effective intermediate rewards for RL training. Our code is available at https://github.com/WangHanLinHenry/SPA‑RL‑Agent.

PaperID: 1275, https://arxiv.org/pdf/2505.20662.pdf GitHub

Authors: Xuanle Zhao, Zilin Sang, Yuxuan Li, Qi Shi, Weilun Zhao, Shuo Wang, Duzhen Zhang, Xu Han, Zhiyuan Liu, Maosong Sun

Title: AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage

Abstract:
Efficient reproduction of research papers is pivotal to accelerating scientific progress. However, the increasing complexity of proposed methods often renders reproduction a labor‑intensive endeavor, necessitating profound domain expertise. To address this, we introduce the paper lineage, which systematically mines implicit knowledge from the cited literature. This algorithm serves as the backbone of our proposed \ours, a multi‑agent framework designed to autonomously reproduce experimental code in a complete, end‑to‑end manner. To ensure code executability, \ours incorporates a sampling‑based unit testing strategy for rapid validation. To assess reproduction capabilities, we introduce \ourbench, a benchmark featuring verified implementations, alongside comprehensive metrics for evaluating both reproduction and execution fidelity. Extensive evaluations on PaperBench and \ourbench demonstrate that \ours consistently surpasses existing baselines across all metrics. Notably, it yields substantial improvements in reproduction fidelity and final execution performance. The code is available at https://github.com/AI9Stars/AutoReproduce.

PaperID: 1276, https://arxiv.org/pdf/2505.20414.pdf GitHub

Authors: Royden Wagner, Omer Sahin Tas, Felix Hauser, Marlon Steiner, Dominik Strutz, Abhishek Vivekanandan, Jaime Villa, Yinzhe Shen, Carlos Fernandez, Christoph Stiller

Title: RetroMotion: Retrocausal Motion Forecasting Models are Instructable

Abstract:
Motion forecasts of road users (i.e., agents) vary in complexity depending on the number of agents, scene constraints, and interactions. In particular, the output space of joint trajectory distributions grows exponentially with the number of agents. Therefore, we decompose multi‑agent motion forecasts into (1) marginal distributions for all modeled agents and (2) joint distributions for interacting agents. Using a transformer model, we generate joint distributions by re‑encoding marginal distributions followed by pairwise modeling. This incorporates a retrocausal flow of information from later points in marginal trajectories to earlier points in joint trajectories. For each time step, we model the positional uncertainty using compressed exponential power distributions. Notably, our method achieves strong results in the Waymo Interaction Prediction Challenge and generalizes well to the Argoverse 2 and V2X‑Seq datasets. Additionally, our method provides an interface for issuing instructions. We show that standard motion forecasting training implicitly enables the model to follow instructions and adapt them to the scene context. GitHub repository: https://github.com/kit‑mrt/future‑motion

PaperID: 1277, https://arxiv.org/pdf/2505.20310.pdf GitHub

Authors: Wanghan Xu, Wenlong Zhang, Fenghua Ling, Ben Fei, Yusong Hu, Runmin Ma, Bo Zhang, Fangxuan Ren, Jintai Lin, Wanli Ouyang, Lei Bai

Title: Manalyzer: End-to-end Automated Meta-analysis with Multi-agent System

Abstract:
Meta‑analysis is a systematic research methodology that synthesizes data from multiple existing studies to derive comprehensive conclusions. This approach not only mitigates limitations inherent in individual studies but also facilitates novel discoveries through integrated data analysis. Traditional meta‑analysis involves a complex multi‑stage pipeline including literature retrieval, paper screening, and data extraction, which demands substantial human effort and time. However, while LLM‑based methods can accelerate certain stages, they still face significant challenges, such as hallucinations in paper screening and data extraction. In this paper, we propose a multi‑agent system, Manalyzer, which achieves end‑to‑end automated meta‑analysis through tool calls. The hybrid review, hierarchical extraction, self‑proving, and feedback checking strategies implemented in Manalyzer significantly alleviate these two hallucinations. To comprehensively evaluate the performance of meta‑analysis, we construct a new benchmark comprising 729 papers across 3 domains, encompassing text, image, and table modalities, with over 10,000 data points. Extensive experiments demonstrate that Manalyzer achieves significant performance improvements over the LLM baseline in multi meta‑analysis tasks. Project page: https://black‑yt.github.io/meta‑analysis‑page/ .

PaperID: 1278, https://arxiv.org/pdf/2505.20285.pdf GitHub

Authors: Weiqi Wu, Xin Guan, Shen Huang, Yong Jiang, Pengjun Xie, Fei Huang, Jiuxin Cao, Hai Zhao, Jingren Zhou

Title: MaskSearch: A Universal Pre-Training Framework to Enhance Agentic Search Capability

Abstract:
Retrieval‑Augmented Language Models (RALMs) represent a classic paradigm where models enhance generative capabilities using external knowledge retrieved via a specialized module. Recent advancements in Agent techniques enable Large Language Models (LLMs) to autonomously utilize tools for retrieval, planning, and reasoning. While existing training‑based methods show promise, their agentic abilities are limited by inherent characteristics of the task‑specific data used during training. To further enhance the universal search capability of agents, we propose a novel pre‑training framework, MaskSearch. In the pre‑training stage, we introduce the Retrieval Augmented Mask Prediction (RAMP) task, where the model learns to leverage search tools to fill masked spans on a large number of pre‑training data, thus acquiring universal retrieval and reasoning capabilities for LLMs. After that, the model is trained on downstream tasks to achieve further improvement. We apply both Supervised Fine‑tuning (SFT) and Reinforcement Learning (RL) for training. For SFT, we combine agent‑based and distillation‑based methods to generate training data, starting with a multi‑agent system consisting of a planner, rewriter, observer, and followed by a self‑evolving teacher model. While for RL, we employ DAPO as the training framework and adopt a hybrid reward system consisting of answer rewards and format rewards. Additionally, we introduce a curriculum learning approach that allows the model to learn progressively from easier to more challenging instances based on the number of masked spans. We evaluate the effectiveness of our framework in the scenario of open‑domain multi‑hop question answering. Through extensive experiments, we demonstrate that MaskSearch significantly enhances the performance of LLM‑based search agents on both in‑domain and out‑of‑domain downstream tasks.

PaperID: 1279, https://arxiv.org/pdf/2505.19973.pdf GitHub

Authors: Bilel Cherif, Tamas Bisztray, Richard A. Dubniczky, Aaesha Aldahmani, Saeed Alshehhi, Norbert Tihanyi

Title: DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

Abstract:
Digital Forensics and Incident Response (DFIR) involves analyzing digital evidence to support legal investigations. Large Language Models (LLMs) offer new opportunities in DFIR tasks such as log analysis and memory forensics, but their susceptibility to errors and hallucinations raises concerns in high‑stakes contexts. Despite growing interest, there is no comprehensive benchmark to evaluate LLMs across both theoretical and practical DFIR domains. To address this gap, we present DFIR‑Metric, a benchmark with three components: (1) Knowledge Assessment: a set of 700 expert‑reviewed multiple‑choice questions sourced from industry‑standard certifications and official documentation; (2) Realistic Forensic Challenges: 150 CTF‑style tasks testing multi‑step reasoning and evidence correlation; and (3) Practical Analysis: 500 disk and memory forensics cases from the NIST Computer Forensics Tool Testing Program (CFTT). We evaluated 14 LLMs using DFIR‑Metric, analyzing both their accuracy and consistency across trials. We also introduce a new metric, the Task Understanding Score (TUS), designed to more effectively evaluate models in scenarios where they achieve near‑zero accuracy. This benchmark offers a rigorous, reproducible foundation for advancing AI in digital forensics. All scripts, artifacts, and results are available on the project website at https://github.com/DFIR‑Metric.

PaperID: 1280, https://arxiv.org/pdf/2505.19896.pdf GitHub GitHub GitHub GitHub

Authors: Alejandro Carrasco, Victor Rodriguez-Fernandez, Richard Linares

Title: Large Language Models as Autonomous Spacecraft Operators in Kerbal Space Program

Abstract:
Recent trends are emerging in the use of Large Language Models (LLMs) as autonomous agents that take actions based on the content of the user text prompts. We intend to apply these concepts to the field of Control in space, enabling LLMs to play a significant role in the decision‑making process for autonomous satellite operations. As a first step towards this goal, we have developed a pure LLM‑based solution for the Kerbal Space Program Differential Games (KSPDG) challenge, a public software design competition where participants create autonomous agents for maneuvering satellites involved in non‑cooperative space operations, running on the KSP game engine. Our approach leverages prompt engineering, few‑shot prompting, and fine‑tuning techniques to create an effective LLM‑based agent that ranked 2nd in the competition. To the best of our knowledge, this work pioneers the integration of LLM agents into space research. The project comprises several open repositories to facilitate replication and further research. The codebase is accessible on \hrefhttps://github.com/ARCLab‑MIT/kspdgGitHub, while the trained models and datasets are available on \hrefhttps://huggingface.co/OhhTuRnzHugging Face. Additionally, experiment tracking and detailed results can be reviewed on \hrefhttps://wandb.ai/carrusk/huggingfaceWeights \& Biases

PaperID: 1281, https://arxiv.org/pdf/2505.19754.pdf GitHub

Authors: Ruisheng Cao, Hanchong Zhang, Tiancheng Huang, Zhangyi Kang, Yuxin Zhang, Liangtai Sun, Hanqi Li, Yuxun Miao, Shuai Fan, Lu Chen, Kai Yu

Title: NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering

Abstract:
The increasing number of academic papers poses significant challenges for researchers to efficiently acquire key details. While retrieval augmented generation (RAG) shows great promise in large language model (LLM) based automated question answering, previous works often isolate neural and symbolic retrieval despite their complementary strengths. Moreover, conventional single‑view chunking neglects the rich structure and layout of PDFs, e.g., sections and tables. In this work, we propose NeuSym‑RAG, a hybrid neural symbolic retrieval framework which combines both paradigms in an interactive process. By leveraging multi‑view chunking and schema‑based parsing, NeuSym‑RAG organizes semi‑structured PDF content into both the relational database and vectorstore, enabling LLM agents to iteratively gather context until sufficient to generate answers. Experiments on three full PDF‑based QA datasets, including a self‑annotated one AIRQA‑REAL, show that NeuSym‑RAG stably defeats both the vector‑based RAG and various structured baselines, highlighting its capacity to unify both retrieval schemes and utilize multiple views. Code and data are publicly available at https://github.com/X‑LANCE/NeuSym‑RAG.

PaperID: 1282, https://arxiv.org/pdf/2505.19647.pdf GitHub

Authors: Xiaochuan Liu, Ruihua Song, Xiting Wang, Xu Chen

Title: Select, Read, and Write: A Multi-Agent Framework of Full-Text-based Related Work Generation

Abstract:
Automatic related work generation (RWG) can save people's time and effort when writing a draft of related work section (RWS) for further revision. However, existing methods for RWG always suffer from shallow comprehension due to taking the limited portions of references papers as input and isolated explanation for each reference due to ineffective capturing the relationships among them. To address these issues, we focus on full‑text‑based RWG task and propose a novel multi‑agent framework. Our framework consists of three agents: a selector that decides which section of the papers is going to read next, a reader that digests the selected section and updates a shared working memory, and a writer that generates RWS based on the final curated memory. To better capture the relationships among references, we also propose two graph‑aware strategies for selector, enabling to optimize the reading order with constrains of the graph structure. Extensive experiments demonstrate that our framework consistently improves performance across three base models and various input configurations. The graph‑aware selectors outperform alternative selectors, achieving state‑of‑the‑art results. The code and data are available at https://github.com/1190200817/Full_Text_RWG.

PaperID: 1283, https://arxiv.org/pdf/2505.19623.pdf GitHub

Authors: Yu Shang, Peijie Liu, Yuwei Yan, Zijing Wu, Leheng Sheng, Yuanqing Yu, Chumeng Jiang, An Zhang, Fengli Xu, Yu Wang, Min Zhang, Yong Li

Title: AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems

Abstract:
The emergence of agentic recommender systems powered by Large Language Models (LLMs) represents a paradigm shift in personalized recommendations, leveraging LLMs' advanced reasoning and role‑playing capabilities to enable autonomous, adaptive decision‑making. Unlike traditional recommendation approaches, agentic recommender systems can dynamically gather and interpret user‑item interactions from complex environments, generating robust recommendation strategies that generalize across diverse scenarios. However, the field currently lacks standardized evaluation protocols to systematically assess these methods. To address this critical gap, we propose: (1) an interactive textual recommendation simulator incorporating rich user and item metadata and three typical evaluation scenarios (classic, evolving‑interest, and cold‑start recommendation tasks); (2) a unified modular framework for developing and studying agentic recommender systems; and (3) the first comprehensive benchmark comparing 10 classical and agentic recommendation methods. Our findings demonstrate the superiority of agentic systems and establish actionable design guidelines for their core components. The benchmark environment has been rigorously validated through an open challenge and remains publicly available with a continuously maintained leaderboard~\footnote[2]https://tsinghua‑fib‑lab.github.io/AgentSocietyChallenge/pages/overview.html, fostering ongoing community engagement and reproducible research. The benchmark is available at: \hyperlinkhttps://huggingface.co/datasets/SGJQovo/AgentRecBenchhttps://huggingface.co/datasets/SGJQovo/AgentRecBench.

PaperID: 1284, https://arxiv.org/pdf/2505.19591.pdf GitHub

Authors: Yufan Dang, Chen Qian, Xueheng Luo, Jingru Fan, Zihao Xie, Ruijie Shi, Weize Chen, Cheng Yang, Xiaoyin Che, Ye Tian, Xuantang Xiong, Lei Han, Zhiyuan Liu, Maosong Sun

Title: Multi-Agent Collaboration via Evolving Orchestration

Abstract:
Large language models (LLMs) have achieved remarkable results across diverse downstream tasks, but their monolithic nature restricts scalability and efficiency in complex problem‑solving. While recent research explores multi‑agent collaboration among LLMs, most approaches rely on static organizational structures that struggle to adapt as task complexity and agent numbers grow, resulting in coordination overhead and inefficiencies. To this end, we propose a puppeteer‑style paradigm for LLM‑based multi‑agent collaboration, where a centralized orchestrator ("puppeteer") dynamically directs agents ("puppets") in response to evolving task states. This orchestrator is trained via reinforcement learning to adaptively sequence and prioritize agents, enabling flexible and evolvable collective reasoning. Experiments on closed‑ and open‑domain scenarios show that this method achieves superior performance with reduced computational costs. Analyses further reveal that the key improvements consistently stem from the emergence of more compact, cyclic reasoning structures under the orchestrator's evolution. Our code is available at https://github.com/OpenBMB/ChatDev/tree/puppeteer.

PaperID: 1285, https://arxiv.org/pdf/2505.19234.pdf GitHub

Authors: Jialong Zhou, Lichao Wang, Xiao Yang

Title: GUARDIAN: Safeguarding LLM Multi-Agent Collaborations with Temporal Graph Modeling

Abstract:
The emergence of large language models (LLMs) enables the development of intelligent agents capable of engaging in complex and multi‑turn dialogues. However, multi‑agent collaboration faces critical safety challenges, such as hallucination amplification and error injection and propagation. This paper presents GUARDIAN, a unified method for detecting and mitigating multiple safety concerns in GUARDing Intelligent Agent collaboratioNs. By modeling the multi‑agent collaboration process as a discrete‑time temporal attributed graph, GUARDIAN explicitly captures the propagation dynamics of hallucinations and errors. The unsupervised encoder‑decoder architecture incorporating an incremental training paradigm learns to reconstruct node attributes and graph structures from latent embeddings, enabling the identification of anomalous nodes and edges with unparalleled precision. Moreover, we introduce a graph abstraction mechanism based on the Information Bottleneck Theory, which compresses temporal interaction graphs while preserving essential patterns. Extensive experiments demonstrate GUARDIAN's effectiveness in safeguarding LLM multi‑agent collaborations against diverse safety vulnerabilities, achieving state‑of‑the‑art accuracy with efficient resource utilization. The code is available at https://github.com/JialongZhou666/GUARDIAN

PaperID: 1286, https://arxiv.org/pdf/2505.19219.pdf GitHub

Authors: Shiyue Wang, Haozheng Xu, Yuhan Zhang, Jingran Lin, Changhong Lu, Xiangfeng Wang, Wenhao Li

Title: Where Paths Collide: A Comprehensive Survey of Classic and Learning-Based Multi-Agent Pathfinding

Abstract:
Multi‑Agent Path Finding (MAPF) is a fundamental problem in artificial intelligence and robotics, requiring the computation of collision‑free paths for multiple agents navigating from their start locations to designated goals. As autonomous systems become increasingly prevalent in warehouses, urban transportation, and other complex environments, MAPF has evolved from a theoretical challenge to a critical enabler of real‑world multi‑robot coordination. This comprehensive survey bridges the long‑standing divide between classical algorithmic approaches and emerging learning‑based methods in MAPF research. We present a unified framework that encompasses search‑based methods (including Conflict‑Based Search, Priority‑Based Search, and Large Neighborhood Search), compilation‑based approaches (SAT, SMT, CSP, ASP, and MIP formulations), and data‑driven techniques (reinforcement learning, supervised learning, and hybrid strategies). Through systematic analysis of experimental practices across 200+ papers, we uncover significant disparities in evaluation methodologies, with classical methods typically tested on larger‑scale instances (up to 200 by 200 grids with 1000+ agents) compared to learning‑based approaches (predominantly 10‑100 agents). We provide a comprehensive taxonomy of evaluation metrics, environment types, and baseline selections, highlighting the need for standardized benchmarking protocols. Finally, we outline promising future directions including mixed‑motive MAPF with game‑theoretic considerations, language‑grounded planning with large language models, and neural solver architectures that combine the rigor of classical methods with the flexibility of deep learning. This survey serves as both a comprehensive reference for researchers and a practical guide for deploying MAPF solutions in increasingly complex real‑world applications.

PaperID: 1287, https://arxiv.org/pdf/2505.19212.pdf GitHub

Authors: Steffen Backmann, David Guzman Piedrahita, Emanuel Tewolde, Rada Mihalcea, Bernhard Schölkopf, Zhijing Jin

Title: When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas

Abstract:
Recent advances in large language models (LLMs) have enabled their use in complex agentic roles, involving decision‑making with humans or other agents, making ethical alignment a key AI safety concern. While prior work has examined both LLMs' moral judgment and strategic behavior in social dilemmas, there is limited understanding of how they act when moral imperatives directly conflict with rewards or incentives. To investigate this, we introduce Moral Behavior in Social Dilemma Simulation (MoralSim) and evaluate how LLMs behave in the prisoner's dilemma and public goods game with morally charged contexts. In MoralSim, we test a range of frontier models across both game structures and three distinct moral framings, enabling a systematic examination of how LLMs navigate social dilemmas in which ethical norms conflict with payoff‑maximizing strategies. Our results show substantial variation across models in both their general tendency to act morally and the consistency of their behavior across game types, the specific moral framing, and situational factors such as opponent behavior and survival risks. Crucially, no model exhibits consistently moral behavior in MoralSim, highlighting the need for caution when deploying LLMs in agentic roles where the agent's "self‑interest" may conflict with ethical expectations. Our code is available at https://github.com/sbackmann/moralsim.

PaperID: 1288, https://arxiv.org/pdf/2505.18943.pdf GitHub

Authors: Xuanming Zhang, Yuxuan Chen, Samuel Yeh, Sharon Li

Title: MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems

Abstract:
Human social interactions depend on the ability to infer others' unspoken intentions, emotions, and beliefs‑a cognitive skill grounded in the psychological concept of Theory of Mind (ToM). While large language models (LLMs) excel in semantic understanding tasks, they struggle with the ambiguity and contextual nuance inherent in human communication. To bridge this gap, we introduce MetaMind, a multi‑agent framework inspired by psychological theories of metacognition, designed to emulate human‑like social reasoning. MetaMind decomposes social understanding into three collaborative stages: (1) a Theory‑of‑Mind Agent generates hypotheses about user mental states (e.g., intent, emotion), (2) a Moral Agent refines these hypotheses using cultural norms and ethical constraints, and (3) a Response Agent generates contextually appropriate responses while validating alignment with inferred intent. Our framework achieves state‑of‑the‑art performance across three challenging benchmarks, with 35.7% improvement in real‑world social scenarios and 6.2% gain in ToM reasoning. Notably, it enables LLMs to match human‑level performance on key ToM tasks for the first time. Ablation studies confirm the necessity of all components, which showcase the framework's ability to balance contextual plausibility, social appropriateness, and user adaptation. This work advances AI systems toward human‑like social intelligence, with applications in empathetic dialogue and culturally sensitive interactions. Code is available at https://github.com/XMZhangAI/MetaMind.

PaperID: 1289, https://arxiv.org/pdf/2505.18630.pdf GitHub

Authors: Zhihao Jia, Mingyi Jia, Junwen Duan, Jianxin Wang

Title: DDO: Dual-Decision Optimization for LLM-Based Medical Consultation via Multi-Agent Collaboration

Abstract:
Large Language Models (LLMs) demonstrate strong generalization and reasoning abilities, making them well‑suited for complex decision‑making tasks such as medical consultation (MC). However, existing LLM‑based methods often fail to capture the dual nature of MC, which entails two distinct sub‑tasks: symptom inquiry, a sequential decision‑making process, and disease diagnosis, a classification problem. This mismatch often results in ineffective symptom inquiry and unreliable disease diagnosis. To address this, we propose DDO, a novel LLM‑based framework that performs Dual‑Decision Optimization by decoupling the two sub‑tasks and optimizing them with distinct objectives through a collaborative multi‑agent workflow. Experiments on three real‑world MC datasets show that DDO consistently outperforms existing LLM‑based approaches and achieves competitive performance with state‑of‑the‑art generation‑based methods, demonstrating its effectiveness in the MC task. The code is available at https://github.com/zh‑jia/DDO.

PaperID: 1290, https://arxiv.org/pdf/2505.18581.pdf GitHub

Authors: Wentao Hu, Wengyu Zhang, Yiyang Jiang, Chen Jason Zhang, Xiaoyong Wei, Qing Li

Title: Removal of Hallucination on Hallucination: Debate-Augmented RAG

Abstract:
Retrieval‑Augmented Generation (RAG) enhances factual accuracy by integrating external knowledge, yet it introduces a critical issue: erroneous or biased retrieval can mislead generation, compounding hallucinations, a phenomenon we term Hallucination on Hallucination. To address this, we propose Debate‑Augmented RAG (DRAG), a training‑free framework that integrates Multi‑Agent Debate (MAD) mechanisms into both retrieval and generation stages. In retrieval, DRAG employs structured debates among proponents, opponents, and judges to refine retrieval quality and ensure factual reliability. In generation, DRAG introduces asymmetric information roles and adversarial debates, enhancing reasoning robustness and mitigating factual inconsistencies. Evaluations across multiple tasks demonstrate that DRAG improves retrieval reliability, reduces RAG‑induced hallucinations, and significantly enhances overall factual accuracy. Our code is available at https://github.com/Huenao/Debate‑Augmented‑RAG.

PaperID: 1291, https://arxiv.org/pdf/2505.18458.pdf GitHub

Authors: Xuanhe Zhou, Junxuan He, Wei Zhou, Haodong Chen, Zirui Tang, Haoyu Zhao, Xin Tong, Guoliang Li, Youmin Chen, Jun Zhou, Zhaojun Sun, Binyuan Hui, Shuo Wang, Conghui He, Zhiyuan Liu, Jingren Zhou, Fan Wu

Title: A Survey of LLM $\times$ DATA

Abstract:
The integration of large language model (LLM) and data management (DATA) is rapidly redefining both domains. In this survey, we comprehensively review the bidirectional relationships. On the one hand, DATA4LLM, spanning large‑scale data processing, storage, and serving, feeds LLMs with high quality, diversity, and timeliness of data required for stages like pre‑training, post‑training, retrieval‑augmented generation, and agentic workflows: (i) Data processing for LLMs includes scalable acquisition, deduplication, filtering, selection, domain mixing, and synthetic augmentation; (ii) Data Storage for LLMs focuses on efficient data and model formats, distributed and heterogeneous storage hierarchies, KV‑cache management, and fault‑tolerant checkpointing; (iii) Data serving for LLMs tackles challenges in RAG (e.g., knowledge post‑processing), LLM inference (e.g., prompt compression, data provenance), and training strategies (e.g., data packing and shuffling). On the other hand, in LLM4DATA, LLMs are emerging as general‑purpose engines for data management. We review recent advances in (i) data manipulation, including automatic data cleaning, integration, discovery; (ii) data analysis, covering reasoning over structured, semi‑structured, and unstructured data, and (iii) system optimization (e.g., configuration tuning, query rewriting, anomaly diagnosis), powered by LLM techniques like retrieval‑augmented prompting, task‑specialized fine‑tuning, and multi‑agent collaboration.

PaperID: 1292, https://arxiv.org/pdf/2505.18411.pdf GitHub

Authors: Yue Jiang, Jichu Li, Yang Liu, Dingkang Yang, Feng Zhou, Quyu Kong

Title: DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding

Abstract:
We introduce DanmakuTPPBench, a comprehensive benchmark designed to advance multi‑modal Temporal Point Process (TPP) modeling in the era of Large Language Models (LLMs). While TPPs have been widely studied for modeling temporal event sequences, existing datasets are predominantly unimodal, hindering progress in models that require joint reasoning over temporal, textual, and visual information. To address this gap, DanmakuTPPBench comprises two complementary components: (1) DanmakuTPP‑Events, a novel dataset derived from the Bilibili video platform, where user‑generated bullet comments (Danmaku) naturally form multi‑modal events annotated with precise timestamps, rich textual content, and corresponding video frames; (2) DanmakuTPP‑QA, a challenging question‑answering dataset constructed via a novel multi‑agent pipeline powered by state‑of‑the‑art LLMs and multi‑modal LLMs (MLLMs), targeting complex temporal‑textual‑visual reasoning. We conduct extensive evaluations using both classical TPP models and recent MLLMs, revealing significant performance gaps and limitations in current methods' ability to model multi‑modal event dynamics. Our benchmark establishes strong baselines and calls for further integration of TPP modeling into the multi‑modal language modeling landscape. Project page: https://github.com/FRENKIE‑CHIANG/DanmakuTPPBench

PaperID: 1293, https://arxiv.org/pdf/2505.18334.pdf GitHub

Authors: Jiaxun Cui, Chen Tang, Jarrett Holtz, Janice Nguyen, Alessandro G. Allievi, Hang Qiu, Peter Stone

Title: Towards Natural Language Communication for Cooperative Autonomous Driving via Self-Play

Abstract:
Past work has demonstrated that autonomous vehicles can drive more safely if they communicate with one another than if they do not. However, their communication has often not been human‑understandable. Using natural language as a vehicle‑to‑vehicle (V2V) communication protocol offers the potential for autonomous vehicles to drive cooperatively not only with each other but also with human drivers. In this work, we propose a suite of traffic tasks in autonomous driving where vehicles in a traffic scenario need to communicate in natural language to facilitate coordination in order to avoid an imminent collision and/or support efficient traffic flow. To this end, this paper introduces a novel method, LLM+Debrief, to learn a message generation and high‑level decision‑making policy for autonomous vehicles through multi‑agent discussion. To evaluate LLM agents for driving, we developed a gym‑like simulation environment that contains a range of driving scenarios. Our experimental results demonstrate that LLM+Debrief is more effective at generating meaningful and human‑understandable natural language messages to facilitate cooperation and coordination than a zero‑shot LLM agent. Our code and demo videos are available at https://talking‑vehicles.github.io/.

PaperID: 1294, https://arxiv.org/pdf/2505.18156.pdf GitHub

Authors: Austin Howard

Title: InjectLab: A Tactical Framework for Adversarial Threat Modeling Against Large Language Models

Abstract:
Large Language Models (LLMs) are changing the way people interact with technology. Tools like ChatGPT and Claude AI are now common in business, research, and everyday life. But with that growth comes new risks, especially prompt‑based attacks that exploit how these models process language. InjectLab is a security framework designed to address that problem. This paper introduces InjectLab as a structured, open‑source matrix that maps real‑world techniques used to manipulate LLMs. The framework is inspired by MITRE ATT&CK and focuses specifically on adversarial behavior at the prompt layer. It includes over 25 techniques organized under six core tactics, covering threats like instruction override, identity swapping, and multi‑agent exploitation. Each technique in InjectLab includes detection guidance, mitigation strategies, and YAML‑based simulation tests. A Python tool supports easy execution of prompt‑based test cases. This paper outlines the framework's structure, compares it to other AI threat taxonomies, and discusses its future direction as a practical, community‑driven foundation for securing language models.

PaperID: 1295, https://arxiv.org/pdf/2505.18105.pdf GitHub

Authors: Lisheng Huang, Yichen Liu, Jinhao Jiang, Rongxiang Zhang, Jiahao Yan, Junyi Li, Wayne Xin Zhao

Title: ManuSearch: Democratizing Deep Search in Large Language Models with a Transparent and Open Multi-Agent Framework

Abstract:
Recent advances in web‑augmented large language models (LLMs) have exhibited strong performance in complex reasoning tasks, yet these capabilities are mostly locked in proprietary systems with opaque architectures. In this work, we propose ManuSearch, a transparent and modular multi‑agent framework designed to democratize deep search for LLMs. ManuSearch decomposes the search and reasoning process into three collaborative agents: (1) a solution planning agent that iteratively formulates sub‑queries, (2) an Internet search agent that retrieves relevant documents via real‑time web search, and (3) a structured webpage reading agent that extracts key evidence from raw web content. To rigorously evaluate deep reasoning abilities, we introduce ORION, a challenging benchmark focused on open‑web reasoning over long‑tail entities, covering both English and Chinese. Experimental results show that ManuSearch substantially outperforms prior open‑source baselines and even surpasses leading closed‑source systems. Our work paves the way for reproducible, extensible research in open deep search systems. We release the data and code in https://github.com/RUCAIBox/ManuSearch

PaperID: 1296, https://arxiv.org/pdf/2505.17801.pdf GitHub

Authors: Bálint Gyevnár, Christopher G. Lucas, Stefano V. Albrecht, Shay B. Cohen

Title: Integrating Counterfactual Simulations with Language Models for Explaining Multi-Agent Behaviour

Abstract:
Autonomous multi‑agent systems (MAS) are useful for automating complex tasks but raise trust concerns due to risks such as miscoordination or goal misalignment. Explainability is vital for users' trust calibration, but explainable MAS face challenges due to complex environments, the human factor, and non‑standardised evaluation. Leveraging the counterfactual effect size model and LLMs, we propose Agentic eXplanations via Interrogative Simulation (AXIS). AXIS generates human‑centred action explanations for multi‑agent policies by having an LLM interrogate an environment simulator using prompts like 'whatif' and 'remove' to observe and synthesise counterfactual information over multiple rounds. We evaluate AXIS on autonomous driving across ten scenarios for five LLMs with a comprehensive methodology combining robustness, subjective preference, correctness, and goal/action prediction with an external LLM as evaluator. Compared to baselines, AXIS improves perceived explanation correctness by at least 7.7% across all models and goal prediction accuracy by 23% for four models, with comparable action prediction accuracy, achieving the highest scores overall. Our code is open‑sourced at https://github.com/gyevnarb/axis.

PaperID: 1297, https://arxiv.org/pdf/2505.17795.pdf GitHub

Authors: Tazeek Bin Abdur Rakib, Ambuj Mehrish, Lay-Ki Soon, Wern Han Lim, Soujanya Poria

Title: DialogXpert: Driving Intelligent and Emotion-Aware Conversations through Online Value-Based Reinforcement Learning with LLM Priors

Abstract:
Large‑language‑model (LLM) agents excel at reactive dialogue but struggle with proactive, goal‑driven interactions due to myopic decoding and costly planning. We introduce DialogXpert, which leverages a frozen LLM to propose a small, high‑quality set of candidate actions per turn and employs a compact Q‑network over fixed BERT embeddings trained via temporal‑difference learning to select optimal moves within this reduced space. By tracking the user's emotions, DialogXpert tailors each decision to advance the task while nurturing a genuine, empathetic connection. Across negotiation, emotional support, and tutoring benchmarks, DialogXpert drives conversations to under 3 turns with success rates exceeding 94% and, with a larger LLM prior, pushes success above 97% while markedly improving negotiation outcomes. This framework delivers real‑time, strategic, and emotionally intelligent dialogue planning at scale. Code available at https://github.com/declare‑lab/dialogxpert/

PaperID: 1298, https://arxiv.org/pdf/2505.17735.pdf GitHub

Authors: Xueyang Zhou, Weidong Wang, Lin Lu, Jiawen Shi, Guiyao Tie, Yongtian Xu, Lixing Chen, Pan Zhou, Neil Zhenqiang Gong, Lichao Sun

Title: SafeAgent: Safeguarding LLM Agents via an Automated Risk Simulator

Abstract:
Large Language Model (LLM)‑based agents are increasingly deployed in real‑world applications such as "digital assistants, autonomous customer service, and decision‑support systems", where their ability to "interact in multi‑turn, tool‑augmented environments" makes them indispensable. However, ensuring the safety of these agents remains a significant challenge due to the diverse and complex risks arising from dynamic user interactions, external tool usage, and the potential for unintended harmful behaviors. To address this critical issue, we propose AutoSafe, the first framework that systematically enhances agent safety through fully automated synthetic data generation. Concretely, 1) we introduce an open and extensible threat model, OTS, which formalizes how unsafe behaviors emerge from the interplay of user instructions, interaction contexts, and agent actions. This enables precise modeling of safety risks across diverse scenarios. 2) we develop a fully automated data generation pipeline that simulates unsafe user behaviors, applies self‑reflective reasoning to generate safe responses, and constructs a large‑scale, diverse, and high‑quality safety training dataset‑eliminating the need for hazardous real‑world data collection. To evaluate the effectiveness of our framework, we design comprehensive experiments on both synthetic and real‑world safety benchmarks. Results demonstrate that AutoSafe boosts safety scores by 45% on average and achieves a 28.91% improvement on real‑world tasks, validating the generalization ability of our learned safety strategies. These results highlight the practical advancement and scalability of AutoSafe in building safer LLM‑based agents for real‑world deployment. We have released the project page at https://auto‑safe.github.io/.

PaperID: 1299, https://arxiv.org/pdf/2505.17612.pdf GitHub

Authors: Minki Kang, Jongwon Jeong, Seanie Lee, Jaewoong Cho, Sung Ju Hwang

Title: Distilling LLM Agent into Small Models with Retrieval and Code Tools

Abstract:
Large language models (LLMs) excel at complex reasoning tasks but remain computationally expensive, limiting their practical deployment. To address this, recent works have focused on distilling reasoning capabilities into smaller language models (sLMs) using chain‑of‑thought (CoT) traces from teacher LLMs. However, this approach struggles in scenarios requiring rare factual knowledge or precise computation, where sLMs often hallucinate due to limited capability. In this work, we propose Agent Distillation, a framework for transferring not only reasoning capability but full task‑solving behavior from LLM‑based agents into sLMs with retrieval and code tools. We improve agent distillation along two complementary axes: (1) we introduce a prompting method called first‑thought prefix to enhance the quality of teacher‑generated trajectories; and (2) we propose a self‑consistent action generation for improving test‑time robustness of small agents. We evaluate our method on eight reasoning tasks across factual and mathematical domains, covering both in‑domain and out‑of‑domain generalization. Our results show that sLMs as small as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next‑tier larger 1.5B, 3B, 7B models fine‑tuned using CoT distillation, demonstrating the potential of agent distillation for building practical, tool‑using small agents. Our code is available at https://github.com/Nardien/agent‑distillation.

PaperID: 1300, https://arxiv.org/pdf/2505.17107.pdf GitHub

Authors: Minghao Shao, Haoran Xi, Nanda Rani, Meet Udeshi, Venkata Sai Charan Putrevu, Kimberly Milner, Brendan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique

Title: CRAKEN: Cybersecurity LLM Agent with Knowledge-Based Execution

Abstract:
Large Language Model (LLM) agents can automate cybersecurity tasks and can adapt to the evolving cybersecurity landscape without re‑engineering. While LLM agents have demonstrated cybersecurity capabilities on Capture‑The‑Flag (CTF) competitions, they have two key limitations: accessing latest cybersecurity expertise beyond training data, and integrating new knowledge into complex task planning. Knowledge‑based approaches that incorporate technical understanding into the task‑solving automation can tackle these limitations. We present CRAKEN, a knowledge‑based LLM agent framework that improves cybersecurity capability through three core mechanisms: contextual decomposition of task‑critical information, iterative self‑reflected knowledge retrieval, and knowledge‑hint injection that transforms insights into adaptive attack strategies. Comprehensive evaluations with different configurations show CRAKEN's effectiveness in multi‑stage vulnerability detection and exploitation compared to previous approaches. Our extensible architecture establishes new methodologies for embedding new security knowledge into LLM‑driven cybersecurity agentic systems. With a knowledge database of CTF writeups, CRAKEN obtained an accuracy of 22% on NYU CTF Bench, outperforming prior works by 3% and achieving state‑of‑the‑art results. On evaluation of MITRE ATT&CK techniques, CRAKEN solves 25‑30% more techniques than prior work, demonstrating improved cybersecurity capabilities via knowledge‑based execution. We make our framework open source to public https://github.com/NYU‑LLM‑CTF/nyuctf_agents_craken.

PaperID: 1301, https://arxiv.org/pdf/2505.17030.pdf GitHub

Authors: Jingzhi Hu, Geoffrey Ye Li

Title: Distillation-Enabled Knowledge Alignment Protocol for Semantic Communication in AI Agent Networks

Abstract:
Future networks are envisioned to connect massive artificial intelligence (AI) agents, enabling their extensive collaboration on diverse tasks. Compared to traditional entities, these agents naturally suit the semantic communication (SC), which can significantly enhance the bandwidth efficiency. Nevertheless, SC requires the knowledge among agents to be aligned, while agents have distinct expert knowledge for their individual tasks in practice. In this paper, we propose a distillation‑enabled knowledge alignment protocol (DeKAP), which distills the expert knowledge of each agent into parameter‑efficient low‑rank matrices, allocates them across the network, and allows agents to simultaneously maintain aligned knowledge for multiple tasks. We formulate the joint minimization of alignment loss, communication overhead, and storage cost as a large‑scale integer linear programming problem and develop a highly efficient greedy algorithm. From computer simulation, the DeKAP establishes knowledge alignment with the lowest communication and computation resources compared to conventional approaches.

PaperID: 1302, https://arxiv.org/pdf/2505.17012.pdf GitHub

Authors: Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, Weidi Xie

Title: SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

Abstract:
Existing evaluations of multimodal large language models (MLLMs) on spatial intelligence are typically fragmented and limited in scope. In this work, we aim to conduct a holistic assessment of the spatial understanding capabilities of modern MLLMs and propose complementary data‑driven and agent‑based solutions. Specifically, we make the following contributions: (i) we introduce SpatialScore, to our knowledge, the most comprehensive and diverse benchmark for multimodal spatial intelligence to date. It covers multiple visual data types, input modalities, and question‑answering formats, and contains approximately 5K manually verified samples spanning 30 distinct tasks; (ii) using SpatialScore, we extensively evaluate 49 representative MLLMs, revealing persistent challenges and a substantial gap between current models and human‑level spatial intelligence; (iii) to advance model capabilities, we construct SpatialCorpus, a large‑scale training resource with 331K multimodal QA samples that supports fine‑tuning on spatial reasoning tasks and significantly improves the performance of existing models (e.g., Qwen3‑VL); (iv) to complement this data‑driven route with a training‑free paradigm, we develop SpatialAgent, a multi‑agent system equipped with 12 specialized spatial perception tools that supports both Plan‑Execute and ReAct reasoning, enabling substantial gains in spatial reasoning without additional model training. Extensive experiments and in‑depth analyses demonstrate the effectiveness of our benchmark, corpus, and agent framework. We expect these resources to serve as a solid foundation for advancing MLLMs toward human‑level spatial intelligence. All data, code, and models will be released to the research community.

PaperID: 1303, https://arxiv.org/pdf/2505.16975.pdf GitHub

Authors: Yaxin Du, Yuzhu Cai, Yifan Zhou, Cheng Wang, Yu Qian, Xianghe Pang, Qian Liu, Yue Hu, Siheng Chen

Title: SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development

Abstract:
Large Language Models (LLMs) have shown strong capability in diverse software engineering tasks. However, feature‑driven development, a highly prevalent real‑world task that involves developing new functionalities for large, existing codebases, remains underexplored. We therefore introduce SWE‑Dev, the first large‑scale dataset (with 14,000 training and 500 test samples) designed to evaluate and train autonomous coding systems on real‑world end‑to‑end feature‑driven software development tasks. To ensure verifiable and diverse training, SWE‑Dev uniquely provides all instances with a runnable environment and its developer‑authored executable unit tests. This collection not only provides high‑quality data for Supervised Fine‑Tuning (SFT), but also enables Reinforcement Learning (RL) by delivering accurate reward signals from executable unit tests. We evaluated SWE‑Dev across 17 base LLMs, 10 reasoning‑focused LLMs, 10 multi‑agent systems, and 8 tool‑augmented LLM agents. Results show substantial headroom: the best single‑turn model reaches only 22.51% Pass@1 on the hard split, while OpenHands agents improve to 56.44% but still leave many tasks unsolved. Code is available here https://github.com/DorothyDUUU/SWE‑Dev.

PaperID: 1304, https://arxiv.org/pdf/2505.16938.pdf GitHub GitHub

Authors: InternAgent Team, Bo Zhang, Shiyang Feng, Xiangchao Yan, Jiakang Yuan, Runmin Ma, Yusong Hu, Zhiyin Yu, Xiaohan He, Songtao Huang, Shaowei Hou, Zheng Nie, Zhilong Wang, Jinyao Liu, Tianshuo Peng, Peng Ye, Dongzhan Zhou, Shufei Zhang, Xiaosong Wang, Yilan Zhang, Meng Li, Zhongying Tu, Xiangyu Yue, Wangli Ouyang, Bowen Zhou, Lei Bai

Title: InternAgent: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification

Abstract:
Artificial Intelligence (AI) is accelerating the transformation of scientific research paradigms, not only enhancing research efficiency but also driving innovation. We introduce InternAgent, a unified closed‑loop multi‑agent framework to conduct Autonomous Scientific Research (ASR) across various scientific research fields, enabling researchers to tackle complicated problems in these fields with unprecedented speed and precision. InternAgent highlights three key advantages: 1) Scalability: InternAgent has demonstrated its versatility across 12 scientific research tasks, capable of generating innovative ideas to enhance the performance of baseline code. 2) Interactivity: InternAgent provides an interface for human expert feedback and multi‑agent interaction in automated end‑to‑end processes, allowing for the seamless integration of domain expert knowledge. 3) Efficiency: InternAgent has achieved promising performance gains in several scientific fields with significantly less time cost compared to human efforts. For instance, in reaction yield prediction, it increased from 27.6% to 35.4% in just 12 hours; in enhancer activity prediction, accuracy rose from 0.65 to 0.79 with only 4 hours of processing; and in 2D semantic segmentation, precision advanced from 78.8% to 81.0% in a mere 30 hours.

PaperID: 1305, https://arxiv.org/pdf/2505.16834.pdf GitHub

Authors: Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu, Lei Fang, Zhongyuan Wang, Ji-Rong Wen

Title: SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis

Abstract:
Retrieval‑augmented generation (RAG) systems have advanced large language models (LLMs) in complex deep search scenarios requiring multi‑step reasoning and iterative information retrieval. However, existing approaches face critical limitations that lack high‑quality training trajectories or suffer from the distributional mismatches in simulated environments and prohibitive computational costs for real‑world deployment. This paper introduces SimpleDeepSearcher, a lightweight yet effective framework that bridges this gap through strategic data engineering rather than complex training paradigms. Our approach synthesizes high‑quality training data by simulating realistic user interactions in live web search environments, coupled with a multi‑criteria curation strategy that optimizes the diversity and quality of input and output side. Experiments on five benchmarks across diverse domains demonstrate that SFT on only 871 curated samples yields significant improvements over RL‑based baselines. Our work establishes SFT as a viable pathway by systematically addressing the data‑scarce bottleneck, offering practical insights for efficient deep search systems. Our code is available at https://github.com/RUCAIBox/SimpleDeepSearcher.

PaperID: 1306, https://arxiv.org/pdf/2505.16832.pdf GitHub GitHub

Authors: Haonian Ji, Shi Qiu, Siyang Xin, Siwei Han, Zhaorun Chen, Dake Zhang, Hongyi Wang, Huaxiu Yao

Title: From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization

Abstract:
While foundation models (FMs), such as diffusion models and large vision‑language models (LVLMs), have been widely applied in educational contexts, their ability to generate pedagogically effective visual explanations remains limited. Most existing approaches focus primarily on textual reasoning, overlooking the critical role of structured and interpretable visualizations in supporting conceptual understanding. To better assess the visual reasoning capabilities of FMs in educational settings, we introduce EduVisBench, a multi‑domain, multi‑level benchmark. EduVisBench features diverse STEM problem sets requiring visually grounded solutions, along with a fine‑grained evaluation rubric informed by pedagogical theory. Our empirical analysis reveals that existing models frequently struggle with the inherent challenge of decomposing complex reasoning and translating it into visual representations aligned with human cognitive processes. To address these limitations, we propose EduVisAgent, a multi‑agent collaborative framework that coordinates specialized agents for instructional planning, reasoning decomposition, metacognitive prompting, and visualization design. Experimental results show that EduVisAgent substantially outperforms all baselines, achieving a 40.2% improvement and delivering more educationally aligned visualizations. EduVisBench and EduVisAgent are available at https://github.com/aiming‑lab/EduVisBench and https://github.com/aiming‑lab/EduVisAgent.

PaperID: 1307, https://arxiv.org/pdf/2505.16282.pdf GitHub

Authors: Fanbin Lu, Zhisheng Zhong, Shu Liu, Chi-Wing Fu, Jiaya Jia

Title: ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay

Abstract:
Training large language models (LLMs) as interactive agents for controlling graphical user interfaces (GUIs) presents a unique challenge to optimize long‑horizon action sequences with multimodal feedback from complex environments. While recent works have advanced multi‑turn reinforcement learning (RL) for reasoning and tool‑using capabilities in LLMs, their application to GUI‑based agents remains relatively underexplored due to the difficulty of sparse rewards, delayed feedback, and high rollout costs. In this paper, we investigate end‑to‑end policy optimization for vision‑language‑based GUI agents with the aim of improving performance on complex, long‑horizon computer tasks. We propose Agentic Replay Policy Optimization (ARPO), an end‑to‑end RL approach that augments Group Relative Policy Optimization (GRPO) with a replay buffer to reuse the successful experience across training iterations. To further stabilize the training process, we propose a task selection strategy that filters tasks based on baseline agent performance, allowing the agent to focus on learning from informative interactions. Additionally, we compare ARPO with offline preference optimization approaches, highlighting the advantages of policy‑based methods in GUI environments. Experiments on the OSWorld benchmark demonstrate that ARPO achieves competitive results, establishing a new performance baseline for LLM‑based GUI agents trained via reinforcement learning. Our findings underscore the effectiveness of reinforcement learning for training multi‑turn, vision‑language GUI agents capable of managing complex real‑world UI interactions. Codes and models:https://github.com/dvlab‑research/ARPO.git.

PaperID: 1308, https://arxiv.org/pdf/2505.16281.pdf GitHub

Authors: Shijie Zhang, Renhao Li, Songsheng Wang, Philipp Koehn, Min Yang, Derek F. Wong

Title: HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation

Abstract:
The advancement of Large Language Models (LLMs) enables flexible and interpretable automatic evaluations. In the field of machine translation evaluation, utilizing LLMs with translation error annotations based on Multidimensional Quality Metrics (MQM) yields more human‑aligned judgments. However, current LLM‑based evaluation methods still face challenges in accurately identifying error spans and assessing their severity. In this paper, we propose HiMATE, a Hierarchical Multi‑Agent Framework for Machine Translation Evaluation. We argue that existing approaches inadequately exploit the fine‑grained structural and semantic information within the MQM hierarchy. To address this, we develop a hierarchical multi‑agent system grounded in the MQM error typology, enabling granular evaluation of subtype errors. Two key strategies are incorporated to further mitigate systemic hallucinations within the framework: the utilization of the model's self‑reflection capability and the facilitation of agent discussion involving asymmetric information. Empirically, HiMATE outperforms competitive baselines across different datasets in conducting human‑aligned evaluations. Further analyses underscore its significant advantage in error span detection and severity assessment, achieving an average F1‑score improvement of 89% over the best‑performing baseline. We make our code and data publicly available at https://github.com/nlp2ct‑shijie/HiMATE.

PaperID: 1309, https://arxiv.org/pdf/2505.15859.pdf GitHub

Authors: Tianyi Ma, Yiyue Qian, Zheyuan Zhang, Zehong Wang, Xiaoye Qian, Feifan Bai, Yifan Ding, Xuwei Luo, Shinan Zhang, Keerthiram Murugesan, Chuxu Zhang, Yanfang Ye

Title: AutoData: A Multi-Agent System for Open Web Data Collection

Abstract:
The exponential growth of data‑driven systems and AI technologies has intensified the demand for high‑quality web‑sourced datasets. While existing datasets have proven valuable, conventional web data collection approaches face significant limitations in terms of human effort and scalability. Current data‑collecting solutions fall into two categories: wrapper‑based methods that struggle with adaptability and reproducibility, and large language model (LLM)‑based approaches that incur substantial computational and financial costs. To address these challenges, we propose AutoData, a novel multi‑agent system for Automated web Data collection, that requires minimal human intervention, i.e., only necessitating a natural language instruction specifying the desired dataset. In addition, AutoData is designed with a robust multi‑agent architecture, featuring a novel oriented message hypergraph coordinated by a central task manager, to efficiently organize agents across research and development squads. Besides, we introduce a novel hypergraph cache system to advance the multi‑agent collaboration process that enables efficient automated data collection and mitigates the token cost issues prevalent in existing LLM‑based systems. Moreover, we introduce Instruct2DS, a new benchmark dataset supporting live data collection from web sources across three domains: academic, finance, and sports. Comprehensive evaluations over Instruct2DS and three existing benchmark datasets demonstrate AutoData's superior performance compared to baseline methods. Case studies on challenging tasks such as picture book collection and paper extraction from surveys further validate its applicability. Our source code and dataset are available at https://github.com/GraphResearcher/AutoData.

PaperID: 1310, https://arxiv.org/pdf/2505.15734.pdf GitHub

Authors: Gaurav Srivastava, Zhenyu Bi, Meng Lu, Xuan Wang

Title: DEBATE, TRAIN, EVOLVE: Self Evolution of Language Model Reasoning

Abstract:
Large language models (LLMs) have improved significantly in their reasoning through extensive training on massive datasets. However, relying solely on additional data for improvement is becoming increasingly impractical, highlighting the need for models to autonomously enhance their reasoning without external supervision. In this paper, we propose Debate, Train, Evolve (DTE), a novel ground truth‑free training framework that uses multi‑agent debate traces to evolve a single language model. We also introduce a new prompting strategy Reflect‑Critique‑Refine, to improve debate quality by explicitly instructing agents to critique and refine their reasoning. Extensive evaluations on seven reasoning benchmarks with six open‑weight models show that our DTE framework achieve substantial improvements, with an average accuracy gain of 8.92% on the challenging GSM‑PLUS dataset. Furthermore, we observe strong cross‑domain generalization, with an average accuracy gain of 5.8% on all other benchmarks, suggesting that our method captures general reasoning capabilities. Our framework code and trained models are publicly available at https://github.com/ctrl‑gaurav/Debate‑Train‑Evolve

PaperID: 1311, https://arxiv.org/pdf/2505.15155.pdf GitHub

Authors: Yuante Li, Xu Yang, Xiao Yang, Minrui Xu, Xisen Wang, Weiqing Liu, Jiang Bian

Title: R&D-Agent-Quant: A Multi-Agent Framework for Data-Centric Factors and Model Joint Optimization

Abstract:
Financial markets pose fundamental challenges for asset return prediction due to their high dimensionality, non‑stationarity, and persistent volatility. Despite advances in large language models and multi‑agent systems, current quantitative research pipelines suffer from limited automation, weak interpretability, and fragmented coordination across key components such as factor mining and model innovation. In this paper, we propose R&D‑Agent for Quantitative Finance, in short RD‑Agent(Q), the first data‑centric multi‑agent framework designed to automate the full‑stack research and development of quantitative strategies via coordinated factor‑model co‑optimization. RD‑Agent(Q) decomposes the quant process into two iterative stages: a Research stage that dynamically sets goal‑aligned prompts, formulates hypotheses based on domain priors, and maps them to concrete tasks, and a Development stage that employs a code‑generation agent, Co‑STEER, to implement task‑specific code, which is then executed in real‑market backtests. The two stages are connected through a feedback stage that thoroughly evaluates experimental outcomes and informs subsequent iterations, with a multi‑armed bandit scheduler for adaptive direction selection. Empirically, RD‑Agent(Q) achieves up to 2X higher annualized returns than classical factor libraries using 70% fewer factors, and outperforms state‑of‑the‑art deep time‑series models on real markets. Its joint factor‑model optimization delivers a strong balance between predictive accuracy and strategy robustness. Our code is available at: https://github.com/microsoft/RD‑Agent.

PaperID: 1312, https://arxiv.org/pdf/2505.15146.pdf GitHub

Authors: Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P. Xing, Ion Stoica, Tajana Rosing, Haojian Jin, Hao Zhang

Title: lmgame-Bench: How Good are LLMs at Playing Games?

Abstract:
Playing video games requires perception, memory, and planning, exactly the faculties modern large language model (LLM) agents are expected to master. We study the major challenges in using popular video games to evaluate modern LLMs and find that directly dropping LLMs into games cannot make an effective evaluation, for three reasons ‑‑ brittle vision perception, prompt sensitivity, and potential data contamination. We introduce lmgame‑Bench to turn games into reliable evaluations. lmgame‑Bench features a suite of platformer, puzzle, and narrative games delivered through a unified Gym‑style API and paired with lightweight perception and memory scaffolds, and is designed to stabilize prompt variance and remove contamination. Across 13 leading models, we show lmgame‑Bench is challenging while still separating models well. Correlation analysis shows that every game probes a unique blend of capabilities often tested in isolation elsewhere. More interestingly, performing reinforcement learning on a single game from lmgame‑Bench transfers both to unseen games and to external planning tasks. Our evaluation code is available at https://github.com/lmgame‑org/GamingAgent/lmgame‑bench.

PaperID: 1313, https://arxiv.org/pdf/2505.15132.pdf GitHub

Authors: Adib Bazgir, Rama chandra Praneeth Madugula, Yuwen Zhang

Title: Multicrossmodal Automated Agent for Integrating Diverse Materials Science Data

Abstract:
We introduce a multicrossmodal LLM‑agent framework motivated by the growing volume and diversity of materials‑science data ranging from high‑resolution microscopy and dynamic simulation videos to tabular experiment logs and sprawling literature archives. While recent AI efforts have accelerated individual tasks such as property prediction or image classification, they typically treat each modality in isolation, leaving rich cross‑modal correlations unexplored and forcing researchers to perform laborious manual integration. Moreover, existing multimodal foundation models often require expensive retraining or fine‑tuning on domain data, and current multi‑agent systems in materials informatics address only narrow subtasks. To overcome these obstacles, we design a coordinated team of specialized LLM agents, each equipped with domain‑adapted prompts and plugins that project their outputs into a shared embedding space. A dynamic gating mechanism then weights and merges these insights, enabling unified reasoning over heterogeneous inputs without ever modifying the underlying LLM weights. We validate our approach on challenging case studies and demonstrate substantial gains in retrieval accuracy (85%), captioning fidelity, and integrated coverage (35%) compared to single‑modality and zero‑shot baselines. Our work paves the way for AI digital researchers capable of bridging data silos and accelerating the materials‑discovery cycle. The code is available at https://github.com/adibgpt/Multicrossmodal‑Autonomous‑Materials‑Science‑Agent.

PaperID: 1314, https://arxiv.org/pdf/2505.15117.pdf GitHub

Authors: Bowen Jin, Jinsung Yoon, Priyanka Kargupta, Sercan O. Arik, Jiawei Han

Title: An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents

Abstract:
Reinforcement learning (RL) has demonstrated strong potential in training large language models (LLMs) capable of complex reasoning for real‑world problem solving. More recently, RL has been leveraged to create sophisticated LLM‑based search agents that adeptly combine reasoning with search engine use. While the use of RL for training search agents is promising, the optimal design of such agents remains not fully understood. In particular, key factors ‑‑ such as (1) reward formulation, (2) the choice and characteristics of the underlying LLM, and (3) the role of the search engine in the RL process ‑‑ require further investigation. In this work, we conduct comprehensive empirical studies to systematically investigate these and offer actionable insights. We highlight several key findings: format rewards are effective in improving final performance, whereas intermediate retrieval rewards have limited impact; the scale and initialization of the LLM (general‑purpose vs. reasoning‑specialized) significantly influence RL outcomes; and the choice of search engine plays a critical role in shaping RL training dynamics and the robustness of the trained agent during inference. These establish important guidelines for successfully building and deploying LLM‑based search agents in real‑world applications. Code is available at https://github.com/PeterGriffinJin/Search‑R1.

PaperID: 1315, https://arxiv.org/pdf/2505.14817.pdf GitHub

Authors: Kushagra Gupta, Surya Murthy, Mustafa O. Karabag, Ufuk Topcu, David Fridovich-Keil

Title: Cooperative Bargaining Games Without Utilities: Mediated Solutions from Direction Oracles

Abstract:
Cooperative bargaining games are widely used to model resource allocation and conflict resolution. Traditional solutions assume the mediator can access agents utility function values and gradients. However, there is an increasing number of settings, such as human AI interactions, where utility values may be inaccessible or incomparable due to unknown, nonaffine transformations. To model such settings, we consider that the mediator has access only to agents most preferred directions, i.e., normalized utility gradients in the decision space. To this end, we propose a cooperative bargaining algorithm where a mediator has access to only the direction oracle of each agent. We prove that unlike popular approaches such as the Nash and Kalai Smorodinsky bargaining solutions, our approach is invariant to monotonic nonaffine transformations, and that under strong convexity and smoothness assumptions, this approach enjoys global asymptotic convergence to Pareto stationary solutions. Moreover, we show that the bargaining solutions found by our algorithm also satisfy the axioms of symmetry and (under slightly stronger conditions) independence of irrelevant alternatives, which are popular in the literature. Finally, we conduct experiments in two domains, multi agent formation assignment and mediated stock portfolio allocation, which validate these theoretic results. All code for our experiments can be found at https://github.com/suryakmurthy/dibs_bargaining.

PaperID: 1316, https://arxiv.org/pdf/2505.14738.pdf GitHub

Authors: Xu Yang, Xiao Yang, Shikai Fang, Yifei Zhang, Jian Wang, Bowen Xian, Qizheng Li, Jingyuan Li, Minrui Xu, Yuante Li, Haoran Pan, Yuge Zhang, Weiqing Liu, Yelong Shen, Weizhu Chen, Jiang Bian

Title: R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science

Abstract:
Recent advances in AI and ML have transformed data science, yet increasing complexity and expertise requirements continue to hinder progress. Although crowd‑sourcing platforms alleviate some challenges, high‑level machine learning engineering (MLE) tasks remain labor‑intensive and iterative. We introduce R&D‑Agent, a comprehensive, decoupled, and extensible framework that formalizes the MLE process. R&D‑Agent defines the MLE workflow into two phases and six components, turning agent design for MLE from ad‑hoc craftsmanship into a principled, testable process. Although several existing agents report promising gains on their chosen components, they can mostly be summarized as a partial optimization from our framework's simple baseline. Inspired by human experts, we designed efficient and effective agents within this framework that achieve state‑of‑the‑art performance. Evaluated on MLE‑Bench, the agent built on R&D‑Agent ranks as the top‑performing machine learning engineering agent, achieving 35.1% any medal rate, demonstrating the ability of the framework to speed up innovation and improve accuracy across a wide range of data science applications. We have open‑sourced R&D‑Agent on GitHub: https://github.com/microsoft/RD‑Agent.

PaperID: 1317, https://arxiv.org/pdf/2505.14668.pdf GitHub

Authors: Bufang Yang, Lilin Xu, Liekang Zeng, Kaiwei Liu, Siyang Jiang, Wenrui Lu, Hongkai Chen, Xiaofan Jiang, Guoliang Xing, Zhenyu Yan

Title: ContextAgent: Context-Aware Proactive LLM Agents with Open-World Sensory Perceptions

Abstract:
Recent advances in Large Language Models (LLMs) have propelled intelligent agents from reactive responses to proactive support. While promising, existing proactive agents either rely exclusively on observations from enclosed environments (e.g., desktop UIs) with direct LLM inference or employ rule‑based proactive notifications, leading to suboptimal user intent understanding and limited functionality for proactive service. In this paper, we introduce ContextAgent, the first context‑aware proactive agent that incorporates extensive sensory contexts surrounding humans to enhance the proactivity of LLM agents. ContextAgent first extracts multi‑dimensional contexts from massive sensory perceptions on wearables (e.g., video and audio) to understand user intentions. ContextAgent then leverages the sensory contexts and personas from historical data to predict the necessity for proactive services. When proactive assistance is needed, ContextAgent further automatically calls the necessary tools to assist users unobtrusively. To evaluate this new task, we curate ContextAgentBench, the first benchmark for evaluating context‑aware proactive LLM agents, covering 1,000 samples across nine daily scenarios and twenty tools. Experiments on ContextAgentBench show that ContextAgent outperforms baselines by achieving up to 8.5% and 6.0% higher accuracy in proactive predictions and tool calling, respectively. We hope our research can inspire the development of more advanced, human‑centric, proactive AI assistants. The code and dataset are publicly available at https://github.com/openaiotlab/ContextAgent.

PaperID: 1318, https://arxiv.org/pdf/2505.13959.pdf GitHub

Authors: Marc Kaufeld, Korbinian Moller, Alessio Gambi, Paolo Arcaini, Johannes Betz

Title: MultiDrive: A Co-Simulation Framework Bridging 2D and 3D Driving Simulation for AV Software Validation

Abstract:
Scenario‑based testing using simulations is a cornerstone of Autonomous Vehicles (AVs) software validation. So far, developers needed to choose between low‑fidelity 2D simulators to explore the scenario space efficiently, and high‑fidelity 3D simulators to study relevant scenarios in more detail, thus reducing testing costs while mitigating the sim‑to‑real gap. This paper presents a novel framework that leverages multi‑agent co‑simulation and procedural scenario generation to support scenario‑based testing across low‑ and high‑fidelity simulators for the development of motion planning algorithms. Our framework limits the effort required to transition scenarios between simulators and automates experiment execution, trajectory analysis, and visualization. Experiments with a reference motion planner show that our framework uncovers discrepancies between the planner's intended and actual behavior, thus exposing weaknesses in planning assumptions under more realistic conditions. Our framework is available at: https://github.com/TUM‑AVS/MultiDrive

PaperID: 1319, https://arxiv.org/pdf/2505.13921.pdf GitHub

Authors: Wanjing Huang, Weixiang Yan, Zhen Zhang, Ambuj Singh

Title: APEX: Empowering LLMs with Physics-Based Task Planning for Real-time Insight

Abstract:
Large Language Models (LLMs) demonstrate strong reasoning and task planning capabilities but remain fundamentally limited in physical interaction modeling. Existing approaches integrate perception via Vision‑Language Models (VLMs) or adaptive decision‑making through Reinforcement Learning (RL), but they fail to capture dynamic object interactions or require task‑specific training, limiting their real‑world applicability. We introduce APEX (Anticipatory Physics‑Enhanced Execution), a framework that equips LLMs with physics‑driven foresight for real‑time task planning. APEX constructs structured graphs to identify and model the most relevant dynamic interactions in the environment, providing LLMs with explicit physical state updates. Simultaneously, APEX provides low‑latency forward simulations of physically feasible actions, allowing LLMs to select optimal strategies based on predictive outcomes rather than static observations. We evaluate APEX on three benchmarks designed to assess perception, prediction, and decision‑making: (1) Physics Reasoning Benchmark, testing causal inference and object motion prediction; (2) Tetris, evaluating whether physics‑informed prediction enhances decision‑making performance in long‑horizon planning tasks; (3) Dynamic Obstacle Avoidance, assessing the immediate integration of perception and action feasibility analysis. APEX significantly outperforms standard LLMs and VLM‑based models, demonstrating the necessity of explicit physics reasoning for bridging the gap between language‑based intelligence and real‑world task execution. The source code and experiment setup are publicly available at https://github.com/hwj20/APEX_EXP .

PaperID: 1320, https://arxiv.org/pdf/2505.13778.pdf GitHub

Authors: Guoheng Sun, Ziyao Wang, Bowei Tian, Meng Liu, Zheyu Shen, Shwai He, Yexiao He, Wanghao Ye, Yiting Wang, Ang Li

Title: CoIn: Counting the Invisible Reasoning Tokens in Commercial Opaque LLM APIs

Abstract:
As post‑training techniques evolve, large language models (LLMs) are increasingly augmented with structured multi‑step reasoning abilities, often optimized through reinforcement learning. These reasoning‑enhanced models outperform standard LLMs on complex tasks and now underpin many commercial LLM APIs. However, to protect proprietary behavior and reduce verbosity, providers typically conceal the reasoning traces while returning only the final answer. This opacity introduces a critical transparency gap: users are billed for invisible reasoning tokens, which often account for the majority of the cost, yet have no means to verify their authenticity. This opens the door to token count inflation, where providers may overreport token usage or inject synthetic, low‑effort tokens to inflate charges. To address this issue, we propose CoIn, a verification framework that audits both the quantity and semantic validity of hidden tokens. CoIn constructs a verifiable hash tree from token embedding fingerprints to check token counts, and uses embedding‑based relevance matching to detect fabricated reasoning content. Experiments demonstrate that CoIn, when deployed as a trusted third‑party auditor, can effectively detect token count inflation with a success rate reaching up to 94.7%, showing the strong ability to restore billing transparency in opaque LLM services. The dataset and code are available at https://github.com/CASE‑Lab‑UMD/LLM‑Auditing‑CoIn.

PaperID: 1321, https://arxiv.org/pdf/2505.13516.pdf GitHub GitHub

Authors: Zhipeng Hou, Junyi Tang, Yipeng Wang

Title: HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM Systems

Abstract:
Recent advancements in Multi‑Agent Systems (MAS) powered by Large Language Models (LLMs) have demonstrated tremendous potential in diverse task scenarios. Nonetheless, existing agentic systems typically rely on predefined agent‑role design spaces and static communication structures, limiting their adaptability as well as flexibility in complex interaction environments and leading to subpar performance on highly specialized and expert‑level tasks. To address these issues, we introduce HALO, a multi‑agent collaboration framework based on a hierarchical reasoning architecture. Specifically, we incorporate a high‑level planning agent for task decomposition, mid‑level role‑design agents for subtask‑specific agent instantiation, and low‑level inference agents for subtask execution. Particularly, subtask execution is reformulated as a structured workflow search problem, where Monte Carlo Tree Search (MCTS) systematically explores the agentic action space to construct optimal reasoning trajectories. Additionally, as the majority of users lack expertise in prompt engineering, we leverage an Adaptive Prompt Refinement module to transform raw queries into task‑specific prompts. Empirical evaluations on Code Generation (HumanEval), General Reasoning (MMLU), and Arithmetic Reasoning (MATH) benchmark datasets highlight the effectiveness of HALO, yielding a 14.4% average improvement over state‑of‑the‑art baselines. Notably, HALO achieves up to 13.3% performance gain on the Moral Scenarios subject in the MMLU benchmark and up to 19.6% performance gain on the Algebra subarea in the MATH benchmark, indicating its advanced proficiency in tackling highly specialized and expert‑level tasks. The code repository is available at https://github.com/23japhone/HALO.

PaperID: 1322, https://arxiv.org/pdf/2505.13427.pdf GitHub

Authors: Lingxiao Du, Fanqing Meng, Zongkai Liu, Zhixiang Zhou, Ping Luo, Qiaosheng Zhang, Wenqi Shao

Title: MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision

Abstract:
While Multimodal Large Language Models (MLLMs) have achieved impressive progress in vision‑language understanding, they still struggle with complex multi‑step reasoning, often producing logically inconsistent or partially correct solutions. A key limitation lies in the lack of fine‑grained supervision over intermediate reasoning steps. To address this, we propose MM‑PRM, a process reward model trained within a fully automated, scalable framework. We first build MM‑Policy, a strong multimodal model trained on diverse mathematical reasoning data. Then, we construct MM‑K12, a curated dataset of 10,000 multimodal math problems with verifiable answers, which serves as seed data. Leveraging a Monte Carlo Tree Search (MCTS)‑based pipeline, we generate over 700k step‑level annotations without human labeling. The resulting PRM is used to score candidate reasoning paths in the Best‑of‑N inference setup and achieves significant improvements across both in‑domain (MM‑K12 test set) and out‑of‑domain (OlympiadBench, MathVista, etc.) benchmarks. Further analysis confirms the effectiveness of soft labels, smaller learning rates, and path diversity in optimizing PRM performance. MM‑PRM demonstrates that process supervision is a powerful tool for enhancing the logical robustness of multimodal reasoning systems. We release all our codes and data at https://github.com/ModalMinds/MM‑PRM.

PaperID: 1323, https://arxiv.org/pdf/2505.13426.pdf GitHub

Authors: Liang Chen, Hongcheng Gao, Tianyu Liu, Zhiqi Huang, Flood Sung, Xinyu Zhou, Yuxin Wu, Baobao Chang

Title: G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning

Abstract:
Vision‑Language Models (VLMs) excel in many direct multimodal tasks but struggle to translate this prowess into effective decision‑making within interactive, visually rich environments like games. This ``knowing‑doing'' gap significantly limits their potential as autonomous agents, as leading VLMs often performing badly in simple games. To address this, we introduce VLM‑Gym, a curated reinforcement learning (RL) environment featuring diverse visual games with unified interfaces and adjustable, compositional difficulty, specifically designed for scalable multi‑game parallel training. Leveraging VLM‑Gym, we train G0 models using pure RL‑driven self‑evolution, which demonstrate emergent perception and reasoning patterns. To further mitigate challenges arising from game diversity, we develop G1 models. G1 incorporates a perception‑enhanced cold start prior to RL fine‑tuning. Our resulting G1 models consistently surpass their teacher across all games and outperform leading proprietary models like Claude‑3.7‑Sonnet‑Thinking. Systematic analysis reveals an intriguing finding: perception and reasoning abilities mutually bootstrap each other throughout the RL training process. Source code including VLM‑Gym and RL training are released at https://github.com/chenllliang/G1 to foster future research in advancing VLMs as capable interactive agents.

PaperID: 1324, https://arxiv.org/pdf/2505.13291.pdf GitHub

Authors: Yifu Cai, Xinyu Li, Mononito Goswami, Michał Wiliński, Gus Welter, Artur Dubrawski

Title: TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents

Abstract:
We introduce TimeSeriesGym, a scalable benchmarking framework for evaluating Artificial Intelligence (AI) agents on time series machine learning engineering challenges. Existing benchmarks lack scalability, focus narrowly on model building in well‑defined settings, and evaluate only a limited set of research artifacts (e.g., CSV submission files). To make AI agent benchmarking more relevant to the practice of machine learning engineering, our framework scales along two critical dimensions. First, recognizing that effective ML engineering requires a range of diverse skills, TimeSeriesGym incorporates challenges from diverse sources spanning multiple domains and tasks. We design challenges to evaluate both isolated capabilities (including data handling, understanding research repositories, and code translation) and their combinations, and rather than addressing each challenge independently, we develop tools that support designing multiple challenges at scale. Second, we implement evaluation mechanisms for multiple research artifacts, including submission files, code, and models, using both precise numeric measures and more flexible LLM‑based evaluation approaches. This dual strategy balances objective assessment with contextual judgment. Although our initial focus is on time series applications, our framework can be readily extended to other data modalities, broadly enhancing the comprehensiveness and practical utility of agentic AI evaluation. We open‑source our benchmarking framework to facilitate future research on the ML engineering capabilities of AI agents.

PaperID: 1325, https://arxiv.org/pdf/2505.13143.pdf GitHub

Authors: Haolang Lu, Yilian Liu, Jingxin Xu, Guoshun Nan, Yuanlong Yu, Zhican Chen, Kun Wang

Title: Auditing Meta-Cognitive Hallucinations in Reasoning Large Language Models

Abstract:
The development of Reasoning Large Language Models (RLLMs) has significantly improved multi‑step reasoning capabilities, but it has also made hallucination problems more frequent and harder to eliminate. While existing approaches mitigate hallucinations through external knowledge integration, model parameter analysis, or self‑verification, they often fail to capture how hallucinations emerge and evolve across the reasoning chain. In this work, we study the causality of hallucinations under constrained knowledge domains by auditing the Chain‑of‑Thought (CoT) trajectory and assessing the model's cognitive confidence in potentially erroneous or biased claims. Our analysis reveals that in long‑CoT settings, RLLMs can iteratively reinforce biases and errors through flawed reflective reasoning, eventually leading to hallucinated reasoning paths. Surprisingly, even direct interventions at the origin of hallucinations often fail to reverse their effects, as reasoning chains exhibit 'chain disloyalty' ‑‑ a resistance to correction and a tendency to preserve flawed logic. Furthermore, we show that existing hallucination detection methods are less reliable and interpretable than previously assumed in complex reasoning scenarios. Unlike methods such as circuit tracing that require access to model internals, our black‑box auditing approach supports interpretable long‑chain hallucination attribution, offering better generalizability and practical utility. Our code is available at: https://github.com/Winnie‑Lian/AHa_Meta_Cognitive

PaperID: 1326, https://arxiv.org/pdf/2505.12188.pdf GitHub

Authors: Hanyu Wang, Xinrui Wu, Zijian Ding, Su Zheng, Chengyue Wang, Neha Prakriya, Tony Nowatzki, Yizhou Sun, Jason Cong

Title: LLM-DSE: Searching Accelerator Parameters with LLM Agents

Abstract:
Even though high‑level synthesis (HLS) tools mitigate the challenges of programming domain‑specific accelerators (DSAs) by raising the abstraction level, optimizing hardware directive parameters remains a significant hurdle. Existing heuristic and learning‑based methods struggle with adaptability and sample efficiency. We present LLM‑DSE, a multi‑agent framework designed specifically for optimizing HLS directives. Combining LLM with design space exploration (DSE), our explorer coordinates four agents: Router, Specialists, Arbitrator, and Critic. These multi‑agent components interact with various tools to accelerate the optimization process. LLM‑DSE leverages essential domain knowledge to identify efficient parameter combinations while maintaining adaptability through verbal learning from online interactions. Evaluations on the HLSyn dataset demonstrate that LLM‑DSE achieves substantial 2.55× performance gains over state‑of‑the‑art methods, uncovering novel designs while reducing runtime. Ablation studies validate the effectiveness and necessity of the proposed agent interactions. Our code is open‑sourced here: https://github.com/Nozidoali/LLM‑DSE.

PaperID: 1327, https://arxiv.org/pdf/2505.11942.pdf GitHub

Authors: Junhao Zheng, Xidi Cai, Qiuke Li, Duzhen Zhang, ZhongZhi Li, Yingying Zhang, Le Song, Qianli Ma

Title: LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners

Abstract:
Lifelong learning is essential for intelligent agents operating in dynamic environments. Current large language model (LLM)‑based agents, however, remain stateless and unable to accumulate or transfer knowledge over time. Existing benchmarks treat agents as static systems and fail to evaluate lifelong learning capabilities. We present LifelongAgentBench, the first unified benchmark designed to systematically assess the lifelong learning ability of LLM agents. It provides skill‑grounded, interdependent tasks across three interactive environments, Database, Operating System, and Knowledge Graph, with automatic label verification, reproducibility, and modular extensibility. Extensive experiments reveal that conventional experience replay has limited effectiveness for LLM agents due to irrelevant information and context length constraints. We further introduce a group self‑consistency mechanism that significantly improves lifelong learning performance. We hope LifelongAgentBench will advance the development of adaptive, memory‑capable LLM agents.

PaperID: 1328, https://arxiv.org/pdf/2505.11436.pdf GitHub

Authors: Yiming Lei, Chenkai Zhang, Zeming Liu, Haitao Leng, Shaoguo Liu, Tingting Gao, Qingjie Liu, Yunhong Wang

Title: GODBench: A Benchmark for Multimodal Large Language Models in Video Comment Art

Abstract:
Video Comment Art enhances user engagement by providing creative content that conveys humor, satire, or emotional resonance, requiring a nuanced and comprehensive grasp of cultural and contextual subtleties. Although Multimodal Large Language Models (MLLMs) and Chain‑of‑Thought (CoT) have demonstrated strong reasoning abilities in STEM tasks (e.g. mathematics and coding), they still struggle to generate creative expressions such as resonant jokes and insightful satire. Moreover, existing benchmarks are constrained by their limited modalities and insufficient categories, hindering the exploration of comprehensive creativity in video‑based Comment Art creation. To address these limitations, we introduce GODBench, a novel benchmark that integrates video and text modalities to systematically evaluate MLLMs' abilities to compose Comment Art. Furthermore, inspired by the propagation patterns of waves in physics, we propose Ripple of Thought (RoT), a multi‑step reasoning framework designed to enhance the creativity of MLLMs. Extensive experiments reveal that existing MLLMs and CoT methods still face significant challenges in understanding and generating creative video comments. In contrast, RoT provides an effective approach to improve creative composing, highlighting its potential to drive meaningful advancements in MLLM‑based creativity. GODBench is publicly available at https://github.com/stan‑lei/GODBench‑ACL2025.

PaperID: 1329, https://arxiv.org/pdf/2505.11065.pdf GitHub

Authors: Changlun Li, Yao Shi, Chen Wang, Qiqi Duan, Runke Ruan, Weijie Huang, Haonan Long, Lijun Huang, Nan Tang, Yuyu Luo

Title: Time Travel is Cheating: Going Live with DeepFund for Real-Time Fund Investment Benchmarking

Abstract:
Large Language Models (LLMs) have demonstrated notable capabilities across financial tasks, including financial report summarization, earnings call transcript analysis, and asset classification. However, their real‑world effectiveness in managing complex fund investment remains inadequately assessed. A fundamental limitation of existing benchmarks for evaluating LLM‑driven trading strategies is their reliance on historical back‑testing, inadvertently enabling LLMs to "time travel"‑leveraging future information embedded in their training corpora, thus resulting in possible information leakage and overly optimistic performance estimates. To address this issue, we introduce DeepFund, a live fund benchmark tool designed to rigorously evaluate LLM in real‑time market conditions. Utilizing a multi‑agent architecture, DeepFund connects directly with real‑time stock market data‑specifically data published after each model pretraining cutoff‑to ensure fair and leakage‑free evaluations. Empirical tests on nine flagship LLMs from leading global institutions across multiple investment dimensions‑including ticker‑level analysis, investment decision‑making, portfolio management, and risk control‑reveal significant practical challenges. Notably, even cutting‑edge models such as DeepSeek‑V3 and Claude‑3.7‑Sonnet incur net trading losses within DeepFund real‑time evaluation environment, underscoring the present limitations of LLMs for active fund management. Our code is available at https://github.com/HKUSTDial/DeepFund.

PaperID: 1330, https://arxiv.org/pdf/2505.10607.pdf GitHub

Authors: Patara Trirat, Jae-Gil Lee

Title: MONAQ: Multi-Objective Neural Architecture Querying for Time-Series Analysis on Resource-Constrained Devices

Abstract:
The growing use of smartphones and IoT devices necessitates efficient time‑series analysis on resource‑constrained hardware, which is critical for sensing applications such as human activity recognition and air quality prediction. Recent efforts in hardware‑aware neural architecture search (NAS) automate architecture discovery for specific platforms; however, none focus on general time‑series analysis with edge deployment. Leveraging the problem‑solving and reasoning capabilities of large language models (LLM), we propose MONAQ, a novel framework that reformulates NAS into Multi‑Objective Neural Architecture Querying tasks. MONAQ is equipped with multimodal query generation for processing multimodal time‑series inputs and hardware constraints, alongside an LLM agent‑based multi‑objective search to achieve deployment‑ready models via code generation. By integrating numerical data, time‑series images, and textual descriptions, MONAQ improves an LLM's understanding of time‑series data. Experiments on fifteen datasets demonstrate that MONAQ‑discovered models outperform both handcrafted models and NAS baselines while being more efficient.

PaperID: 1331, https://arxiv.org/pdf/2505.10321.pdf GitHub

Authors: Julius Henke

Title: AutoPentest: Enhancing Vulnerability Management With Autonomous LLM Agents

Abstract:
A recent area of increasing research is the use of Large Language Models (LLMs) in penetration testing, which promises to reduce costs and thus allow for higher frequency. We conduct a review of related work, identifying best practices and common evaluation issues. We then present AutoPentest, an application for performing black‑box penetration tests with a high degree of autonomy. AutoPentest is based on the LLM GPT‑4o from OpenAI and the LLM agent framework LangChain. It can perform complex multi‑step tasks, augmented by external tools and knowledge bases. We conduct a study on three capture‑the‑flag style Hack The Box (HTB) machines, comparing our implementation AutoPentest with the baseline approach of manually using the ChatGPT‑4o user interface. Both approaches are able to complete 15‑25 % of the subtasks on the HTB machines, with AutoPentest slightly outperforming ChatGPT. We measure a total cost of \96.20 US when using AutoPentest across all experiments, while a one‑month subscription to ChatGPT Plus costs \20. The results show that further implementation efforts and the use of more powerful LLMs released in the future are likely to make this a viable part of vulnerability management.

PaperID: 1332, https://arxiv.org/pdf/2505.10278.pdf GitHub

Authors: Taian Guo, Haiyang Shen, JinSheng Huang, Zhengyang Mao, Junyu Luo, Binqi Chen, Zhuoru Chen, Luchen Liu, Bingyu Xia, Xuhui Liu, Yun Ma, Ming Zhang

Title: MASS: Muli-agent simulation scaling for portfolio construction

Abstract:
The application of LLM‑based agents in financial investment has shown significant promise, yet existing approaches often require intermediate steps like predicting individual stock movements or rely on predefined, static workflows. These limitations restrict their adaptability and effectiveness in constructing optimal portfolios. In this paper, we introduce the Multi‑Agent Scaling Simulation (MASS), a novel framework that leverages multi‑agent simulation for direct, end‑to‑end portfolio construction. At its core, MASS employs a backward optimization process to dynamically learn the optimal distribution of heterogeneous agents, enabling the system to adapt to evolving market regimes. A key finding enabled by our framework is the exploration of the scaling effect for portfolio construction: we demonstrate that as the number of agents increases exponentially (up to 512), the aggregated decisions yield progressively higher excess returns. Extensive experiments on a challenging, self‑collected dataset from the 2023 Chinese A‑share market show that MASS consistently outperforms seven state‑of‑the‑art baselines. Further backtesting, stability analyses and the experiment on data leakage concerns validate its enhanced profitability and robustness. We have open‑sourced our code, dataset, and training snapshots at https://github.com/gta0804/MASS/ to foster further research.

PaperID: 1333, https://arxiv.org/pdf/2505.08106.pdf GitHub

Authors: Jiashen, Du, Jesse Yao, Allen Liu, Zhekai Zhang

Title: Are LLMs complicated ethical dilemma analyzers?

Abstract:
One open question in the study of Large Language Models (LLMs) is whether they can emulate human ethical reasoning and act as believable proxies for human judgment. To investigate this, we introduce a benchmark dataset comprising 196 real‑world ethical dilemmas and expert opinions, each segmented into five structured components: Introduction, Key Factors, Historical Theoretical Perspectives, Resolution Strategies, and Key Takeaways. We also collect non‑expert human responses for comparison, limited to the Key Factors section due to their brevity. We evaluate multiple frontier LLMs (GPT‑4o‑mini, Claude‑3.5‑Sonnet, Deepseek‑V3, Gemini‑1.5‑Flash) using a composite metric framework based on BLEU, Damerau‑Levenshtein distance, TF‑IDF cosine similarity, and Universal Sentence Encoder similarity. Metric weights are computed through an inversion‑based ranking alignment and pairwise AHP analysis, enabling fine‑grained comparison of model outputs to expert responses. Our results show that LLMs generally outperform non‑expert humans in lexical and structural alignment, with GPT‑4o‑mini performing most consistently across all sections. However, all models struggle with historical grounding and proposing nuanced resolution strategies, which require contextual abstraction. Human responses, while less structured, occasionally achieve comparable semantic similarity, suggesting intuitive moral reasoning. These findings highlight both the strengths and current limitations of LLMs in ethical decision‑making.

PaperID: 1334, https://arxiv.org/pdf/2505.08084.pdf GitHub

Authors: Yu Cheng, Arushi Goel, Hakan Bilen

Title: Visually Interpretable Subtask Reasoning for Visual Question Answering

Abstract:
Answering complex visual questions like `Which red furniture can be used for sitting?' requires multi‑step reasoning, including object recognition, attribute filtering, and relational understanding. Recent work improves interpretability in multimodal large language models (MLLMs) by decomposing tasks into sub‑task programs, but these methods are computationally expensive and less accurate due to poor adaptation to target data. To address this, we introduce VISTAR (Visually Interpretable Subtask‑Aware Reasoning Model), a subtask‑driven training framework that enhances both interpretability and reasoning by generating textual and visual explanations within MLLMs. Instead of relying on external models, VISTAR fine‑tunes MLLMs to produce structured Subtask‑of‑Thought rationales (step‑by‑step reasoning sequences). Experiments on two benchmarks show that VISTAR consistently improves reasoning accuracy while maintaining interpretability. Our code and dataset will be available at https://github.com/ChengJade/VISTAR.

PaperID: 1335, https://arxiv.org/pdf/2505.07266.pdf GitHub

Authors: Micah Nye, Ayoub Raji, Andrew Saba, Eidan Erlich, Robert Exley, Aragya Goyal, Alexander Matros, Ritesh Misra, Matthew Sivaprakasam, Marko Bertogna, Deva Ramanan, Sebastian Scherer

Title: BETTY Dataset: A Multi-modal Dataset for Full-Stack Autonomy

Abstract:
We present the BETTY dataset, a large‑scale, multi‑modal dataset collected on several autonomous racing vehicles, targeting supervised and self‑supervised state estimation, dynamics modeling, motion forecasting, perception, and more. Existing large‑scale datasets, especially autonomous vehicle datasets, focus primarily on supervised perception, planning, and motion forecasting tasks. Our work enables multi‑modal, data‑driven methods by including all sensor inputs and the outputs from the software stack, along with semantic metadata and ground truth information. The dataset encompasses 4 years of data, currently comprising over 13 hours and 32TB, collected on autonomous racing vehicle platforms. This data spans 6 diverse racing environments, including high‑speed oval courses, for single and multi‑agent algorithm evaluation in feature‑sparse scenarios, as well as high‑speed road courses with high longitudinal and lateral accelerations and tight, GPS‑denied environments. It captures highly dynamic states, such as 63 m/s crashes, loss of tire traction, and operation at the limit of stability. By offering a large breadth of cross‑modal and dynamic data, the BETTY dataset enables the training and testing of full autonomy stack pipelines, pushing the performance of all algorithms to the limits. The current dataset is available at https://pitt‑mit‑iac.github.io/betty‑dataset/.

PaperID: 1336, https://arxiv.org/pdf/2505.07008.pdf GitHub

Authors: Fengming Zhu, Fangzhen Lin

Title: Constant-Memory Strategies in Stochastic Games: Best Responses and Equilibria

Abstract:
Stochastic games have become a prevalent framework for studying long‑term multi‑agent interactions, especially in the context of multi‑agent reinforcement learning. In this work, we comprehensively investigate the concept of constant‑memory strategies in stochastic games. We first establish some results on best responses and Nash equilibria for behavioral constant‑memory strategies, followed by a discussion on the computational hardness of best responding to mixed constant‑memory strategies. Those theoretic insights are later verified on several sequential decision‑making testbeds, including the Iterated Prisoner's Dilemma, the Iterated Traveler's Dilemma, and the Pursuit domain. This work aims to enhance the understanding of theoretical issues in single‑agent planning under multi‑agent systems, and uncover the connection between decision models in single‑agent and multi‑agent contexts. The code is available at \texttthttps://github.com/Fernadoo/Const‑Mem.

PaperID: 1337, https://arxiv.org/pdf/2505.06771.pdf GitHub

Authors: Shalin Anand Jain, Jiazhen Liu, Siva Kailas, Harish Ravichandar

Title: JaxRobotarium: Training and Deploying Multi-Robot Policies in 10 Minutes

Abstract:
Multi‑agent reinforcement learning (MARL) has emerged as a promising solution for learning complex and scalable coordination behaviors in multi‑robot systems. However, established MARL platforms (e.g., SMAC and MPE) lack robotics relevance and hardware deployment, leaving multi‑robot learning researchers to develop bespoke environments and hardware testbeds dedicated to the development and evaluation of their individual contributions. The Multi‑Agent RL Benchmark and Learning Environment for the Robotarium (MARBLER) is an exciting recent step in providing a standardized robotics‑relevant platform for MARL, by bridging the Robotarium testbed with existing MARL software infrastructure. However, MARBLER lacks support for parallelization and GPU/TPU execution, making the platform prohibitively slow compared to modern MARL environments and hindering adoption. We contribute JaxRobotarium, a Jax‑powered end‑to‑end simulation, learning, deployment, and benchmarking platform for the Robotarium. JaxRobotarium enables rapid training and deployment of multi‑robot RL (MRRL) policies with realistic robot dynamics and safety constraints, supporting parallelization and hardware acceleration. Our generalizable learning interface integrates easily with SOTA MARL libraries (e.g., JaxMARL). In addition, JaxRobotarium includes eight standardized coordination scenarios, including four novel scenarios that bring established MARL benchmark tasks (e.g., RWARE and Level‑Based Foraging) to a robotics setting. We demonstrate that JaxRobotarium retains high simulation fidelity while achieving dramatic speedups over baseline (20x in training and 150x in simulation), and provides an open‑access sim‑to‑real evaluation pipeline through the Robotarium testbed, accelerating and democratizing access to multi‑robot learning research and evaluation. Our code is available at https://github.com/GT‑STAR‑Lab/JaxRobotarium.

PaperID: 1338, https://arxiv.org/pdf/2505.06761.pdf GitHub

Authors: Youcef Djenouri, Nassim Belmecheri, Tomasz Michalak, Jan Dubiński, Ahmed Nabil Belbachir, Anis Yazidi

Title: Learning Graph Representation of Agent Diffusers

Abstract:
Diffusion‑based generative models have significantly advanced text‑to‑image synthesis, demonstrating impressive text comprehension and zero‑shot generalization. These models refine images from random noise based on textual prompts, with initial reliance on text input shifting towards enhanced visual fidelity over time. This transition suggests that static model parameters might not optimally address the distinct phases of generation. We introduce LGR‑AD (Learning Graph Representation of Agent Diffusers), a novel multi‑agent system designed to improve adaptability in dynamic computer vision tasks. LGR‑AD models the generation process as a distributed system of interacting agents, each representing an expert sub‑model. These agents dynamically adapt to varying conditions and collaborate through a graph neural network that encodes their relationships and performance metrics. Our approach employs a coordination mechanism based on top‑k maximum spanning trees, optimizing the generation process. Each agent's decision‑making is guided by a meta‑model that minimizes a novel loss function, balancing accuracy and diversity. Theoretical analysis and extensive empirical evaluations show that LGR‑AD outperforms traditional diffusion models across various benchmarks, highlighting its potential for scalable and flexible solutions in complex image generation tasks. Code is available at: https://github.com/YousIA/LGR_AD

PaperID: 1339, https://arxiv.org/pdf/2505.06617.pdf GitHub

Authors: Timothée Anne, Noah Syrkis, Meriem Elhosni, Florian Turati, Franck Legendre, Alain Jaquier, Sebastian Risi

Title: Adversarial Coevolutionary Illumination with Generational Adversarial MAP-Elites

Abstract:
Quality‑Diversity (QD) algorithms seek to discover diverse, high‑performing solutions across a behavior space, in contrast to conventional optimization methods that target a single optimum. Adversarial problems present unique challenges for QD approaches, as the competing nature of opposing sides creates interdependencies that complicate the evolution process. Existing QD methods applied to such scenarios typically fix one side, constraining the open‑endedness. We present Generational Adversarial MAP‑Elites (GAME), a coevolutionary QD algorithm that evolves both sides by alternating which side is evolved at each generation. By integrating a vision embedding model (VEM), our approach eliminates the need for domain‑specific behavior descriptors and instead operates on video. We validate GAME across three distinct adversarial domains: a multi‑agent battle game, a soft‑robot wrestling environment, and a deck building game. We validate that all its components are necessary, that the VEM is effective in two different domains, and that GAME finds better solutions than one‑sided QD baselines. Our experiments reveal several evolutionary phenomena, including arms race‑like dynamics, enhanced novelty through generational extinction, and the preservation of neutral mutations as crucial stepping stones toward the highest performance. While GAME successfully illuminates all three adversarial problems, its capacity for truly open‑ended discovery remains constrained by the nature of the search spaces used in this paper. These findings show GAME's broad applicability and highlight opportunities for future research into open‑ended adversarial coevolution. Code and videos are available at: https://github.com/Timothee‑ANNE/GAME

PaperID: 1340, https://arxiv.org/pdf/2505.05440.pdf GitHub

Authors: Biao Yi, Xavier Hu, Yurun Chen, Shengyu Zhang, Hongxia Yang, Fan Wu

Title: EcoAgent: An Efficient Device-Cloud Collaborative Multi-Agent Framework for Mobile Automation

Abstract:
To tackle increasingly complex tasks, recent research on mobile agents has shifted towards multi‑agent collaboration. Current mobile multi‑agent systems are primarily deployed in the cloud, leading to high latency and operational costs. A straightforward idea is to deploy a device‑cloud collaborative multi‑agent system, which is nontrivial, as directly extending existing systems introduces new challenges: (1) reliance on cloud‑side verification requires uploading mobile screenshots, compromising user privacy; and (2) open‑loop cooperation lacking device‑to‑cloud feedback, underutilizing device resources and increasing latency. To overcome these limitations, we propose EcoAgent, a closed‑loop device‑cloud collaborative multi‑agent framework designed for privacy‑aware, efficient, and responsive mobile automation. EcoAgent integrates a novel reasoning approach, Dual‑ReACT, into the cloud‑based Planning Agent, fully exploiting cloud reasoning to compensate for limited on‑device capacity, thereby enabling device‑side verification and lightweight feedback. Furthermore, the device‑based Observation Agent leverages a Pre‑understanding Module to summarize screen content into concise textual descriptions, significantly reducing token usage and device‑cloud communication overhead while preserving privacy. Experiments on AndroidWorld demonstrate that EcoAgent matches the task success rates of fully cloud‑based agents, while reducing resource consumption and response latency. Our project is available here: https://github.com/Yi‑Biao/EcoAgent.

PaperID: 1341, https://arxiv.org/pdf/2505.05029.pdf GitHub

Authors: Siyue Ren, Wanli Fu, Xinkun Zou, Chen Shen, Yi Cai, Chen Chu, Zhen Wang, Shuyue Hu

Title: Reputation as a Solution to Cooperation Collapse in LLM-based MASs

Abstract:
Cooperation has long been a fundamental topic in both human society and AI systems. However, recent studies indicate that the collapse of cooperation may emerge in multi‑agent systems (MASs) driven by large language models (LLMs). To address this challenge, we explore reputation systems as a remedy. We propose RepuNet, a dynamic, dual‑level reputation framework that models both agent‑level reputation dynamics and system‑level network evolution. Specifically, driven by direct interactions and indirect gossip, agents form reputations for both themselves and their peers, and decide whether to connect or disconnect other agents for future interactions. Through three distinct scenarios, we show that RepuNet effectively avoids cooperation collapse, promoting and sustaining cooperation in LLM‑based MASs. Moreover, we find that reputation systems can give rise to rich emergent behaviors in LLM‑based MASs, such as the formation of cooperative clusters, the social isolation of exploitative agents, and the preference for sharing positive gossip rather than negative ones. The GitHub repository for our project can be accessed via the following link: https://github.com/RGB‑0000FF/RepuNet.

PaperID: 1342, https://arxiv.org/pdf/2505.04364.pdf GitHub

Authors: Kai Ruan, Mowen Huang, Ji-Rong Wen, Hao Sun

Title: Benchmarking LLMs' Swarm intelligence

Abstract:
Large Language Models (LLMs) show potential for complex reasoning, yet their capacity for emergent coordination in Multi‑Agent Systems (MAS) when operating under strict swarm‑like constraints‑limited local perception and communication‑remains largely unexplored. Existing benchmarks often do not fully capture the unique challenges of decentralized coordination when agents operate with incomplete spatio‑temporal information. To bridge this gap, we introduce SwarmBench, a novel benchmark designed to systematically evaluate the swarm intelligence capabilities of LLMs acting as decentralized agents. SwarmBench features five foundational MAS coordination tasks (Pursuit, Synchronization, Foraging, Flocking, Transport) within a configurable 2D grid environment, forcing agents to rely solely on local sensory input (k× k view) and local communication. We propose metrics for coordination effectiveness and analyze emergent group dynamics. Zero‑shot evaluations of leading LLMs (e.g., deepseek‑v3, o4‑mini) reveal significant task‑dependent performance variations. While some rudimentary coordination is observed, our results indicate that current LLMs significantly struggle with robust long‑range planning and adaptive strategy formation under the uncertainty inherent in these decentralized scenarios. Assessing LLMs under such swarm‑like constraints is crucial for understanding their utility in future decentralized intelligent systems. We release SwarmBench as an open, extensible toolkit‑built on a customizable physical system‑providing environments, prompts, evaluation scripts, and comprehensive datasets. This aims to foster reproducible research into LLM‑based MAS coordination and the theoretical underpinnings of emergent collective behavior under severe informational decentralization. Our code repository is available at https://github.com/x66ccff/swarmbench.

PaperID: 1343, https://arxiv.org/pdf/2505.03961.pdf GitHub

Authors: Gerrit Großmann, Larisa Ivanova, Sai Leela Poduru, Mohaddeseh Tabrizian, Islam Mesabah, David A. Selby, Sebastian J. Vollmer

Title: The Power of Stories: Narrative Priming Shapes How LLM Agents Collaborate and Compete

Abstract:
According to Yuval Noah Harari, large‑scale human cooperation is driven by shared narratives that encode common beliefs and values. This study explores whether such narratives can similarly nudge LLM agents toward collaboration. We use a finitely repeated public goods game in which LLM agents choose either cooperative or egoistic spending strategies. We prime agents with stories highlighting teamwork to different degrees and test how this influences negotiation outcomes. Our experiments explore four questions:(1) How do narratives influence negotiation behavior? (2) What differs when agents share the same story versus different ones? (3) What happens when the agent numbers grow? (4) Are agents resilient against self‑serving negotiators? We find that story‑based priming significantly affects negotiation strategies and success rates. Common stories improve collaboration, benefiting each agent. By contrast, priming agents with different stories reverses this effect, and those agents primed toward self‑interest prevail. We hypothesize that these results carry implications for multi‑agent system design and AI alignment.

PaperID: 1344, https://arxiv.org/pdf/2505.03735.pdf GitHub

Authors: Jiayuan Rao, Zifeng Li, Haoning Wu, Ya Zhang, Yanfeng Wang, Weidi Xie

Title: Multi-Agent System for Comprehensive Soccer Understanding

Abstract:
Recent advances in soccer understanding have demonstrated rapid progress, yet existing research predominantly focuses on isolated or narrow tasks. To bridge this gap, we propose a comprehensive framework for holistic soccer understanding. Concretely, we make the following contributions in this paper: (i) we construct SoccerWiki, the first large‑scale multimodal soccer knowledge base, integrating rich domain knowledge about players, teams, referees, and venues to enable knowledge‑driven reasoning; (ii) we present SoccerBench, the largest and most comprehensive soccer‑specific benchmark, featuring around 10K multimodal (text, image, video) multi‑choice QA pairs across 13 distinct tasks; (iii) we introduce SoccerAgent, a novel multi‑agent system that decomposes complex soccer questions via collaborative reasoning, leveraging domain expertise from SoccerWiki and achieving robust performance; (iv) extensive evaluations and comparisons with representative MLLMs on SoccerBench highlight the superiority of our agentic system.

PaperID: 1345, https://arxiv.org/pdf/2505.03673.pdf GitHub

Authors: Huajie Tan, Xiaoshuai Hao, Cheng Chi, Minglan Lin, Yaoxu Lyu, Mingyu Cao, Dong Liang, Zhuo Chen, Mengsi Lyu, Cheng Peng, Chenrui He, Yulong Ao, Yonghua Lin, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang

Title: RoboOS: A Hierarchical Embodied Framework for Cross-Embodiment and Multi-Agent Collaboration

Abstract:
The dawn of embodied intelligence has ushered in an unprecedented imperative for resilient, cognition‑enabled multi‑agent collaboration across next‑generation ecosystems, revolutionizing paradigms in autonomous manufacturing, adaptive service robotics, and cyber‑physical production architectures. However, current robotic systems face significant limitations, such as limited cross‑embodiment adaptability, inefficient task scheduling, and insufficient dynamic error correction. While End‑to‑end VLA models demonstrate inadequate long‑horizon planning and task generalization, hierarchical VLA models suffer from a lack of cross‑embodiment and multi‑agent coordination capabilities. To address these challenges, we introduce RoboOS, the first open‑source embodied system built on a Brain‑Cerebellum hierarchical architecture, enabling a paradigm shift from single‑agent to multi‑agent intelligence. Specifically, RoboOS consists of three key components: (1) Embodied Brain Model (RoboBrain), a MLLM designed for global perception and high‑level decision‑making; (2) Cerebellum Skill Library, a modular, plug‑and‑play toolkit that facilitates seamless execution of multiple skills; and (3) Real‑Time Shared Memory, a spatiotemporal synchronization mechanism for coordinating multi‑agent states. By integrating hierarchical information flow, RoboOS bridges Embodied Brain and Cerebellum Skill Library, facilitating robust planning, scheduling, and error correction for long‑horizon tasks, while ensuring efficient multi‑agent collaboration through Real‑Time Shared Memory. Furthermore, we enhance edge‑cloud communication and cloud‑based distributed inference to facilitate high‑frequency interactions and enable scalable deployment. Extensive real‑world experiments across various scenarios, demonstrate RoboOS's versatility in supporting heterogeneous embodiments. Project website: https://github.com/FlagOpen/RoboOS

PaperID: 1346, https://arxiv.org/pdf/2505.03586.pdf GitHub GitHub

Authors: Songchen Fu, Siang Chen, Shaojing Zhao, Letian Bai, Ta Li, Yonghong Yan

Title: Rainbow Delay Compensation: A Multi-Agent Reinforcement Learning Framework for Mitigating Delayed Observation

Abstract:
In real‑world multi‑agent systems (MASs), observation delays are ubiquitous, preventing agents from making decisions based on the environment's true state. An individual agent's local observation typically comprises multiple components from other agents or dynamic entities within the environment. These discrete observation components with varying delay characteristics pose significant challenges for multi‑agent reinforcement learning (MARL). In this paper, we first formulate the decentralized stochastic individual delay partially observable Markov decision process (DSID‑POMDP) by extending the standard Dec‑POMDP. We then propose the Rainbow Delay Compensation (RDC), a MARL training framework for addressing stochastic individual delays, along with recommended implementations for its constituent modules. We implement the DSID‑POMDP's observation generation pattern using standard MARL benchmarks, including MPE and SMAC. Experiments demonstrate that baseline MARL methods suffer severe performance degradation under fixed and unfixed delays. The RDC‑enhanced approach mitigates this issue, remarkably achieving ideal delay‑free performance in certain delay scenarios while maintaining generalizability. Our work provides a novel perspective on multi‑agent delayed observation problems and offers an effective solution framework. The source code is available at https://github.com/linkjoker1006/RDC‑pymarl.

PaperID: 1347, https://arxiv.org/pdf/2505.02820.pdf GitHub

Authors: Hao Zhu, Phil Cuvin, Xinkai Yu, Charlotte Ka Yee Yan, Jason Zhang, Diyi Yang

Title: AutoLibra: Agent Metric Induction from Open-Ended Human Feedback

Abstract:
Agents are predominantly evaluated and optimized via task success metrics, which are coarse, rely on manual design from experts, and fail to reward intermediate emergent behaviors. We propose AutoLibra, a framework for agent evaluation, that transforms open‑ended human feedback e.g. "If you find that the button is disabled, don't click it again", or "This agent has too much autonomy to decide what to do on its own" into metrics for evaluating fine‑grained behaviors in agent trajectories. AutoLibra accomplishes this by grounding feedback to an agent's behavior, clustering similar positive and negative behaviors, and creating concrete metrics with clear definitions and concrete examples, which can be used for prompting LLM‑as‑a‑Judge as evaluators. We further propose two meta metrics to evaluate the alignment of a set of (induced) metrics with open feedback: "coverage" and "redundancy". Through optimizing these meta‑metrics, we experimentally demonstrate AutoLibra's ability to induce more concrete agent evaluation metrics than the ones proposed in previous agent evaluation benchmarks and discover new metrics to analyze agents. We also present two applications of AutoLibra in agent improvement: First, we show that AutoLibra serve human prompt engineers for diagonalize agent failures and improve prompts iterative. Moreover, we find that AutoLibra can induce metrics for automatic optimization for agents, which makes agents improve through self‑regulation. Our results suggest that AutoLibra is a powerful task‑agnostic tool for evaluating and improving language agents.

PaperID: 1348, https://arxiv.org/pdf/2505.02293.pdf GitHub

Authors: Jason J. Choi, Jasmine Jerry Aloor, Jingqi Li, Maria G. Mendoza, Hamsa Balakrishnan, Claire J. Tomlin

Title: Resolving Conflicting Constraints in Multi-Agent Reinforcement Learning with Layered Safety

Abstract:
Preventing collisions in multi‑robot navigation is crucial for deployment. This requirement hinders the use of learning‑based approaches, such as multi‑agent reinforcement learning (MARL), on their own due to their lack of safety guarantees. Traditional control methods, such as reachability and control barrier functions, can provide rigorous safety guarantees when interactions are limited only to a small number of robots. However, conflicts between the constraints faced by different agents pose a challenge to safe multi‑agent coordination. To overcome this challenge, we propose a method that integrates multiple layers of safety by combining MARL with safety filters. First, MARL is used to learn strategies that minimize multiple agent interactions, where multiple indicates more than two. Particularly, we focus on interactions likely to result in conflicting constraints within the engagement distance. Next, for agents that enter the engagement distance, we prioritize pairs requiring the most urgent corrective actions. Finally, a dedicated safety filter provides tactical corrective actions to resolve these conflicts. Crucially, the design decisions for all layers of this framework are grounded in reachability analysis and a control barrier‑value function‑based filtering mechanism. We validate our Layered Safe MARL framework in 1) hardware experiments using Crazyflie drones and 2) high‑density advanced aerial mobility (AAM) operation scenarios, where agents navigate to designated waypoints while avoiding collisions. The results show that our method significantly reduces conflict while maintaining safety without sacrificing much efficiency (i.e., shorter travel time and distance) compared to baselines that do not incorporate layered safety. The project website is available at https://dinamo‑mit.github.io/Layered‑Safe‑MARL/

PaperID: 1349, https://arxiv.org/pdf/2505.01862.pdf GitHub

Authors: Linus Nwankwo, Bjoern Ellensohn, Ozan Özdenizci, Elmar Rueckert

Title: ReLI: A Language-Agnostic Approach to Human-Robot Interaction

Abstract:
Adapting autonomous agents for real‑world industrial, domestic, and other daily tasks is currently gaining momentum. However, in global or cross‑lingual application contexts, ensuring effective interaction with the environment and executing unrestricted human‑specified tasks regardless of the language remains an unsolved problem. To address this, we propose ReLI, a language‑agnostic approach that enables autonomous agents to converse naturally, semantically reason about their environment, and perform downstream tasks, regardless of the task instruction's modality or linguistic origin. First, we ground large‑scale pre‑trained foundation models and transform them into language‑to‑action models that can directly provide common‑sense reasoning and high‑level robot control through natural, free‑flow conversational interactions. Further, we perform cross‑lingual adaptation of the models to ensure that ReLI generalises across the global languages. To demonstrate ReLI's robustness, we conducted extensive experiments on various short‑ and long‑horizon tasks, including zero‑ and few‑shot spatial navigation, scene information retrieval, and query‑oriented tasks. We benchmarked the performance on 140 languages involving 70K+ multi‑turn conversations. On average, ReLI achieved over 90%\pm0.2 accuracy in cross‑lingual instruction parsing and task execution success. These results demonstrate its potential to advance natural human‑agent interaction in the real world while championing inclusive and linguistic diversity. Demos and resources will be public at: https://linusnep.github.io/ReLI/.

PaperID: 1350, https://arxiv.org/pdf/2505.01709.pdf GitHub

Authors: Kaidong Zhang, Rongtao Xu, Pengzhen Ren, Junfan Lin, Hefeng Wu, Liang Lin, Xiaodan Liang

Title: RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation

Abstract:
Operating robots in open‑ended scenarios with diverse tasks is a crucial research and application direction in robotics. While recent progress in natural language processing and large multimodal models has enhanced robots' ability to understand complex instructions, robot manipulation still faces the procedural skill dilemma and the declarative skill dilemma in open environments. Existing methods often compromise cognitive and executive capabilities. To address these challenges, in this paper, we propose RoBridge, a hierarchical intelligent architecture for general robotic manipulation. It consists of a high‑level cognitive planner (HCP) based on a large‑scale pre‑trained vision‑language model (VLM), an invariant operable representation (IOR) serving as a symbolic bridge, and a generalist embodied agent (GEA). RoBridge maintains the declarative skill of VLM and unleashes the procedural skill of reinforcement learning, effectively bridging the gap between cognition and execution. RoBridge demonstrates significant performance improvements over existing baselines, achieving a 75% success rate on new tasks and an 83% average success rate in sim‑to‑real generalization using only five real‑world data samples per task. This work represents a significant step towards integrating cognitive reasoning with physical execution in robotic systems, offering a new paradigm for general robotic manipulation.

PaperID: 1351, https://arxiv.org/pdf/2505.00212.pdf GitHub

Authors: Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, Qingyun Wu

Title: Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

Abstract:
Failure attribution in LLM multi‑agent systems‑identifying the agent and step responsible for task failures‑provides crucial clues for systems debugging but remains underexplored and labor‑intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi‑agent systems. To support this initiative, we introduce the Who&When dataset, comprising extensive failure logs from 127 LLM multi‑agent systems with fine‑grained annotations linking failures to specific agents and decisive error steps. Using the Who&When, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5% accuracy in identifying failure‑responsible agents but only 14.2% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI o1 and DeepSeek R1, fail to achieve practical usability. These results highlight the task's complexity and the need for further research in this area. Code and dataset are available at https://github.com/mingyin1/Agents_Failure_Attribution

PaperID: 1352, https://arxiv.org/pdf/2504.21716.pdf GitHub

Authors: Marc Glocker, Peter Hönig, Matthias Hirschmanner, Markus Vincze

Title: LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics

Abstract:
We present an embodied robotic system with an LLM‑driven agent‑orchestration architecture for autonomous household object management. The system integrates memory‑augmented task planning, enabling robots to execute high‑level user commands while tracking past actions. It employs three specialized agents: a routing agent, a task planning agent, and a knowledge base agent, each powered by task‑specific LLMs. By leveraging in‑context learning, our system avoids the need for explicit model training. RAG enables the system to retrieve context from past interactions, enhancing long‑term object tracking. A combination of Grounded SAM and LLaMa3.2‑Vision provides robust object detection, facilitating semantic scene understanding for task planning. Evaluation across three household scenarios demonstrates high task planning accuracy and an improvement in memory recall due to RAG. Specifically, Qwen2.5 yields best performance for specialized agents, while LLaMA3.1 excels in routing tasks. The source code is available at: https://github.com/marc1198/chat‑hsr.

PaperID: 1353, https://arxiv.org/pdf/2504.21579.pdf GitHub

Authors: Stavros Anagnou, Christoph Salge, Peter R. Lewis

Title: Uncertainty, bias and the institution bootstrapping problem

Abstract:
Institutions play a critical role in enabling communities to manage common‑pool resources and avert tragedies of the commons. However, a fundamental issue arises: Individuals typically perceive participation as advantageous only after an institution is established, creating a paradox: How can institutions form if no one will join before a critical mass exists? We term this conundrum the institution bootstrapping problem and propose that misperception, specifically, agents' erroneous belief that an institution already exists, could resolve this paradox. By integrating well‑documented psychological phenomena, including cognitive biases, probability distortion, and perceptual noise, into a game‑theoretic framework, we demonstrate how these factors collectively mitigate the bootstrapping problem. Notably, unbiased perceptual noise (e.g., noise arising from agents' heterogeneous physical or social contexts) drastically reduces the critical mass of cooperators required for institutional emergence. This effect intensifies with greater diversity of perceptions. We explain this counter‑intuitive result through asymmetric boundary conditions: proportional underestimation of low‑probability sanctions produces distinct outcomes compared to equivalent overestimation. Furthermore, the type of perceptual distortion, proportional versus absolute, yields qualitatively different evolutionary pathways. These findings challenge conventional assumptions about rationality in institutional design, highlighting how "noisy" cognition can paradoxically enhance cooperation. Finally, we contextualize these insights within broader discussions of multi‑agent system design and collective action. Our analysis underscores the importance of incorporating human‑like cognitive constraints, not just idealized rationality, into models of institutional emergence and resilience.

PaperID: 1354, https://arxiv.org/pdf/2504.21159.pdf GitHub

Authors: Alexander L. Mitchell, Tobit Flatscher, Ingmar Posner

Title: Task and Joint Space Dual-Arm Compliant Control

Abstract:
Robots that interact with humans or perform delicate manipulation tasks must exhibit compliance. However, most commercial manipulators are rigid and suffer from significant friction, limiting end‑effector tracking accuracy in torque‑controlled modes. To address this, we present a real‑time, open‑source impedance controller that smoothly interpolates between joint‑space and task‑space compliance. This hybrid approach ensures safe interaction and precise task execution, such as sub‑centimetre pin insertions. We deploy our controller on Frank, a dual‑arm platform with two Kinova Gen3 arms, and compensate for modelled friction dynamics using a model‑free observer. The system is real‑time capable and integrates with standard ROS tools like MoveIt!. It also supports high‑frequency trajectory streaming, enabling closed‑loop execution of trajectories generated by learning‑based methods, optimal control, or teleoperation. Our results demonstrate robust tracking and compliant behaviour even under high‑friction conditions. The complete system is available open‑source at https://github.com/applied‑ai‑lab/compliant_controllers.

PaperID: 1355, https://arxiv.org/pdf/2504.21024.pdf GitHub

Authors: Tianqing Fang, Hongming Zhang, Zhisong Zhang, Kaixin Ma, Wenhao Yu, Haitao Mi, Dong Yu

Title: WebEvolver: Enhancing Web Agent Self-Improvement with Coevolving World Model

Abstract:
Agent self‑improvement, where the backbone Large Language Model (LLM) of the agent are trained on trajectories sampled autonomously based on their own policies, has emerged as a promising approach for enhancing performance. Recent advancements, particularly in web environments, face a critical limitation: their performance will reach a stagnation point during autonomous learning cycles, hindering further improvement. We argue that this stems from limited exploration of the web environment and insufficient exploitation of pre‑trained web knowledge in LLMs. To improve the performance of self‑improvement, we propose a novel framework that introduces a co‑evolving World Model LLM. This world model predicts the next observation based on the current observation and action within the web environment. Leveraging LLMs' pretrained knowledge of abundant web content, the World Model serves dual roles: (1) as a virtual web server generating self‑instructed training data to continuously refine the agent's policy, and (2) as an imagination engine during inference, enabling look‑ahead simulation to guide action selection for the agent LLM. Experiments in real‑world web environments (Mind2Web‑Live, WebVoyager, and GAIA‑web) show a 10% performance gain over existing self‑evolving agents, demonstrating the efficacy and generalizability of our approach, without using any distillation from more powerful close‑sourced models. Our work establishes the necessity of integrating world models into autonomous agent frameworks to unlock sustained adaptability. Code is available at https://github.com/Tencent/SelfEvolvingAgent

PaperID: 1356, https://arxiv.org/pdf/2504.20965.pdf GitHub

Authors: Zikui Cai, Shayan Shabihi, Bang An, Zora Che, Brian R. Bartoldson, Bhavya Kailkhura, Tom Goldstein, Furong Huang

Title: AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security

Abstract:
We introduce AegisLLM, a cooperative multi‑agent defense against adversarial attacks and information leakage. In AegisLLM, a structured workflow of autonomous agents ‑ orchestrator, deflector, responder, and evaluator ‑ collaborate to ensure safe and compliant LLM outputs, while self‑improving over time through prompt optimization. We show that scaling agentic reasoning system at test‑time ‑ both by incorporating additional agent roles and by leveraging automated prompt optimization (such as DSPy)‑ substantially enhances robustness without compromising model utility. This test‑time defense enables real‑time adaptability to evolving attacks, without requiring model retraining. Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM. On the WMDP unlearning benchmark, AegisLLM achieves near‑perfect unlearning with only 20 training examples and fewer than 300 LM calls. For jailbreaking benchmarks, we achieve 51% improvement compared to the base model on StrongReject, with false refusal rates of only 7.9% on PHTest compared to 18‑55% for comparable methods. Our results highlight the advantages of adaptive, agentic reasoning over static defenses, establishing AegisLLM as a strong runtime alternative to traditional approaches based on model modifications. Code is available at https://github.com/zikuicai/aegisllm

PaperID: 1357, https://arxiv.org/pdf/2504.20771.pdf GitHub GitHub

Authors: Haitao Wu, Zongbo Han, Joey Tianyi Zhou, Huaxi Huang, Changqing Zhang

Title: Computational Reasoning of Large Language Models

Abstract:
With the rapid development and widespread application of Large Language Models (LLMs), multidimensional evaluation has become increasingly critical. However, current evaluations are often domain‑specific and overly complex, limiting their effectiveness as cross‑domain proxies for core capabilities. To address these limitations and enable a unified and simple evaluation framework, an ideal proxy task should target a basic capability that generalizes across tasks and is independent of domain‑specific knowledge. Turing machine provides a powerful theoretical lens by reducing complex processes to basic, domain‑agnostic computational operations. This perspective offers a principled framework for evaluating basic computational abilities essential to a wide range of tasks. Motivated by this abstraction, we introduce Turing Machine Bench, a benchmark designed to assess the ability of LLMs to strictly follow rules and accurately manage internal states for multi‑step, referred to as computational reasoning. TMBench incorporates four key features: self‑contained and knowledge‑agnostic reasoning, a minimalistic multi‑step structure, controllable difficulty, and a solid theoretical foundation based on Turing machine. Empirical results demonstrate that TMBench serves as an effective proxy for evaluating computational reasoning on representative LLMs. It produces clear step‑wise accuracy curves, revealing LLMs' ability to execute multi‑step reasoning processes. By analyzing performance trends across TMBench and established reasoning benchmarks, we find strong correlations with real‑world tasks, bridging real‑task evaluation with basic ability assessment. These findings suggest that TMBench holds potential as a cross‑domain dimension for evaluating reasoning in LLMs. Code and data are available at \hrefhttps://github.com/HaitaoWuTJU/Turing‑Machine‑BenchRepo.

PaperID: 1358, https://arxiv.org/pdf/2504.20593.pdf GitHub

Authors: Rilind Sahitaj, Paulius Sasnauskas, Yiğit Yalın, Debmalya Mandal, Goran Radanović

Title: Independent Learning in Performative Markov Potential Games

Abstract:
Performative Reinforcement Learning (PRL) refers to a scenario in which the deployed policy changes the reward and transition dynamics of the underlying environment. In this work, we study multi‑agent PRL by incorporating performative effects into Markov Potential Games (MPGs). We introduce the notion of a performatively stable equilibrium (PSE) and show that it always exists under a reasonable sensitivity assumption. We then provide convergence results for state‑of‑the‑art algorithms used to solve MPGs. Specifically, we show that independent policy gradient ascent (IPGA) and independent natural policy gradient (INPG) converge to an approximate PSE in the best‑iterate sense, with an additional term that accounts for the performative effects. Furthermore, we show that INPG asymptotically converges to a PSE in the last‑iterate sense. As the performative effects vanish, we recover the convergence rates from prior work. For a special case of our game, we provide finite‑time last‑iterate convergence results for a repeated retraining approach, in which agents independently optimize a surrogate objective. We conduct extensive experiments to validate our theoretical findings.

PaperID: 1359, https://arxiv.org/pdf/2504.20115.pdf GitHub

Authors: Zijie Lin, Yiqing Shen, Qilin Cai, He Sun, Jinrui Zhou, Mingjun Xiao

Title: AutoP2C: An LLM-Based Agent Framework for Code Repository Generation from Multimodal Content in Academic Papers

Abstract:
Machine Learning (ML) research is spread through academic papers featuring rich multimodal content, including text, diagrams, and tabular results. However, translating these multimodal elements into executable code remains a challenging and time‑consuming process that requires substantial ML expertise. We introduce ``Paper‑to‑Code'' (P2C), a novel task that transforms the multimodal content of scientific publications into fully executable code repositories, which extends beyond the existing formulation of code generation that merely converts textual descriptions into isolated code snippets. To automate the P2C process, we propose AutoP2C, a multi‑agent framework based on large language models that processes both textual and visual content from research papers to generate complete code repositories. Specifically, AutoP2C contains four stages: (1) repository blueprint extraction from established codebases, (2) multimodal content parsing that integrates information from text, equations, and figures, (3) hierarchical task decomposition for structured code generation, and (4) iterative feedback‑driven debugging to ensure functionality and performance. Evaluation on a benchmark of eight research papers demonstrates the effectiveness of AutoP2C, which can successfully generate executable code repositories for all eight papers, while OpenAI‑o1 or DeepSeek‑R1 can only produce runnable code for one paper. The code is available at https://github.com/shoushouyu/Automated‑Paper‑to‑Code.

PaperID: 1360, https://arxiv.org/pdf/2504.20091.pdf GitHub

Authors: Noriyuki Kugo, Xiang Li, Zixin Li, Ashish Gupta, Arpandeep Khatua, Nidhish Jain, Chaitanya Patel, Yuta Kyuragi, Yasunori Ishii, Masamoto Tanabiki, Kazuki Kozuka, Ehsan Adeli

Title: VideoMultiAgents: A Multi-Agent Framework for Video Question Answering

Abstract:
Video Question Answering (VQA) inherently relies on multimodal reasoning, integrating visual, temporal, and linguistic cues to achieve a deeper understanding of video content. However, many existing methods rely on feeding frame‑level captions into a single model, making it difficult to adequately capture temporal and interactive contexts. To address this limitation, we introduce VideoMultiAgents, a framework that integrates specialized agents for vision, scene graph analysis, and text processing. It enhances video understanding leveraging complementary multimodal reasoning from independently operating agents. Our approach is also supplemented with a question‑guided caption generation, which produces captions that highlight objects, actions, and temporal transitions directly relevant to a given query, thus improving the answer accuracy. Experimental results demonstrate that our method achieves state‑of‑the‑art performance on Intent‑QA (79.0%, +6.2% over previous SOTA), EgoSchema subset (75.4%, +3.4%), and NExT‑QA (79.6%, +0.4%). The source code is available at https://github.com/PanasonicConnect/VideoMultiAgents.

PaperID: 1361, https://arxiv.org/pdf/2504.20073.pdf GitHub

Authors: Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li

Title: RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Abstract:
Training large language models (LLMs) as interactive agents presents unique challenges including long‑horizon decision making and interacting with stochastic environment feedback. While reinforcement learning (RL) has enabled progress in static tasks, multi‑turn agent RL training remains underexplored. We propose StarPO (State‑Thinking‑Actions‑Reward Policy Optimization), a general framework for trajectory‑level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents. Our study on four stylized environments reveals three core findings. First, our agent RL training shows a recurring mode of Echo Trap where reward variance cliffs and gradient spikes; we address this with StarPO‑S, a stabilized variant with trajectory filtering, critic incorporation, and gradient stabilization. Second, we find the shaping of RL rollouts would benefit from diverse initial states, medium interaction granularity and more frequent sampling. Third, we show that without fine‑grained, reasoning‑aware reward signals, agent reasoning hardly emerge through multi‑turn RL and they may show shallow strategies or hallucinated thoughts. Code and environments are available at https://github.com/RAGEN‑AI/RAGEN.

PaperID: 1362, https://arxiv.org/pdf/2504.19838.pdf GitHub

Authors: Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, Liang Liu, Yaxuan Guo, Han Xiao, Weifeng Lin, Yuxiang Chai, Yue Han, Shuai Ren, Hao Wang, Xiaoyu Liang, WenHao Wang, Tianze Wu, Zhengxi Lu, Siheng Chen, LiLinghao, Hao Wang, Guanjing Xiong, Yong Liu, Hongsheng Li

Title: LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects

Abstract:
With the rapid rise of large language models (LLMs), phone automation has undergone transformative changes. This paper systematically reviews LLM‑driven phone GUI agents, highlighting their evolution from script‑based automation to intelligent, adaptive systems. We first contextualize key challenges, (i) limited generality, (ii) high maintenance overhead, and (iii) weak intent comprehension, and show how LLMs address these issues through advanced language understanding, multimodal perception, and robust decision‑making. We then propose a taxonomy covering fundamental agent frameworks (single‑agent, multi‑agent, plan‑then‑act), modeling approaches (prompt engineering, training‑based), and essential datasets and benchmarks. Furthermore, we detail task‑specific architectures, supervised fine‑tuning, and reinforcement learning strategies that bridge user intent and GUI operations. Finally, we discuss open challenges such as dataset diversity, on‑device deployment efficiency, user‑centric adaptation, and security concerns, offering forward‑looking insights into this rapidly evolving field. By providing a structured overview and identifying pressing research gaps, this paper serves as a definitive reference for researchers and practitioners seeking to harness LLMs in designing scalable, user‑friendly phone GUI agents. The collection of papers reviewed in this survey will be hosted and regularly updated on the GitHub repository: https://github.com/PhoneLLM/Awesome‑LLM‑Powered‑Phone‑GUI‑Agents

PaperID: 1363, https://arxiv.org/pdf/2504.19487.pdf GitHub

Authors: Kavindu Warnakulasuriya, Prabhash Dissanayake, Navindu De Silva, Stephen Cranefield, Bastin Tony Roy Savarimuthu, Surangika Ranathunga, Nisansa de Silva

Title: Evolution of Cooperation in LLM-Agent Societies: A Preliminary Study Using Different Punishment Strategies

Abstract:
The evolution of cooperation has been extensively studied using abstract mathematical models and simulations. Recent advances in Large Language Models (LLMs) and the rise of LLM agents have demonstrated their ability to perform social reasoning, thus providing an opportunity to test the emergence of norms in more realistic agent‑based simulations with human‑like reasoning using natural language. In this research, we investigate whether the cooperation dynamics presented in Boyd and Richerson's model persist in a more realistic simulation of the Diner's Dilemma using LLM agents compared to the abstract mathematical nature in the work of Boyd and Richerson. Our findings indicate that agents follow the strategies defined in the Boyd and Richerson model, and explicit punishment mechanisms drive norm emergence, reinforcing cooperative behaviour even when the agent strategy configuration varies. Our results suggest that LLM‑based Multi‑Agent System simulations, in fact, can replicate the evolution of cooperation predicted by the traditional mathematical models. Moreover, our simulations extend beyond the mathematical models by integrating natural language‑driven reasoning and a pairwise imitation method for strategy adoption, making them a more realistic testbed for cooperative behaviour in MASs.

PaperID: 1364, https://arxiv.org/pdf/2504.19314.pdf GitHub

Authors: Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, Yuxin Gu, Sixin Hong, Jing Ren, Jian Chen, Chao Liu, Yining Hua

Title: BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

Abstract:
As large language models (LLMs) evolve into tool‑using agents, the ability to browse the web in real‑time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship‑related complexities of other major information ecosystems ‑‑ most notably Chinese. To address this gap, we introduce BrowseComp‑ZH, a high‑difficulty benchmark purpose‑built to comprehensively evaluate LLM agents on the Chinese web. BrowseComp‑ZH consists of 289 multi‑hop questions spanning 11 diverse domains. Each question is reverse‑engineered from a short, objective, and easily verifiable answer (e.g., a date, number, or proper noun). A two‑stage quality control protocol is applied to strive for high question difficulty and answer uniqueness. We benchmark over 20 state‑of‑the‑art language models and agentic search systems on our proposed BrowseComp‑ZH. Despite their strong conversational and retrieval capabilities, most models struggle severely: a large number achieve accuracy rates below 10%, and only a handful exceed 20%. Even the best‑performing system, OpenAI's DeepResearch, reaches just 42.9%. These results demonstrate the considerable difficulty of BrowseComp‑ZH, where success demands not only effective retrieval strategies, but also sophisticated reasoning and information reconciliation ‑‑ capabilities that current models still struggle to master. Our dataset, construction guidelines, and benchmark results have been publicly released at https://github.com/PALIN2018/BrowseComp‑ZH.

PaperID: 1365, https://arxiv.org/pdf/2504.18373.pdf GitHub

Authors: Lei Shen, Xiaoyu Shen

Title: Auto-SLURP: A Benchmark Dataset for Evaluating Multi-Agent Frameworks in Smart Personal Assistant

Abstract:
In recent years, multi‑agent frameworks powered by large language models (LLMs) have advanced rapidly. Despite this progress, there is still a notable absence of benchmark datasets specifically tailored to evaluate their performance. To bridge this gap, we introduce Auto‑SLURP, a benchmark dataset aimed at evaluating LLM‑based multi‑agent frameworks in the context of intelligent personal assistants. Auto‑SLURP extends the original SLURP dataset ‑‑ initially developed for natural language understanding tasks ‑‑ by relabeling the data and integrating simulated servers and external services. This enhancement enables a comprehensive end‑to‑end evaluation pipeline, covering language understanding, task execution, and response generation. Our experiments demonstrate that Auto‑SLURP presents a significant challenge for current state‑of‑the‑art frameworks, highlighting that truly reliable and intelligent multi‑agent personal assistants remain a work in progress. The dataset and related code are available at https://github.com/lorashen/Auto‑SLURP/.

PaperID: 1366, https://arxiv.org/pdf/2504.18070.pdf GitHub

Authors: Jingjin Wang, Jiawei Han

Title: PropRAG: Guiding Retrieval with Beam Search over Proposition Paths

Abstract:
Retrieval Augmented Generation (RAG) has become the standard approach for equipping Large Language Models (LLMs) with up‑to‑date knowledge. However, standard RAG, relying on independent passage retrieval, often fails to capture the interconnected nature of information required for complex, multi‑hop reasoning. While structured RAG methods attempt to address this using knowledge graphs built from triples, we argue that the inherent context loss of triples (context collapse) limits the fidelity of the knowledge representation. We introduce PropRAG, a novel RAG framework that shifts from triples to context‑rich propositions and introduces an efficient, LLM‑free online beam search over proposition paths to discover multi‑step reasoning chains. By coupling a higher‑fidelity knowledge representation with explicit path discovery, PropRAG achieves state‑of‑the‑art zero‑shot Recall@5 and F1 scores on 2Wiki, HotpotQA, and MuSiQue, advancing non‑parametric knowledge integration by improving evidence retrieval through richer representation and efficient reasoning path discovery.

PaperID: 1367, https://arxiv.org/pdf/2504.18010.pdf GitHub

Authors: Zilin Huang, Zihao Sheng, Zhengyang Wan, Yansong Qu, Yuhao Luo, Boyue Wang, Pei Li, Yen-Jung Chen, Jiancong Chen, Keke Long, Jiayi Meng, Yue Leng, Sikai Chen

Title: Sky-Drive: A Distributed Multi-Agent Simulation Platform for Human-AI Collaborative and Socially-Aware Future Transportation

Abstract:
Recent advances in autonomous system simulation platforms have significantly enhanced the safe and scalable testing of driving policies. However, existing simulators do not yet fully meet the needs of future transportation research‑particularly in enabling effective human‑AI collaboration and modeling socially‑aware driving agents. This paper introduces Sky‑Drive, a novel distributed multi‑agent simulation platform that addresses these limitations through four key innovations: (a) a distributed architecture for synchronized simulation across multiple terminals; (b) a multi‑modal human‑in‑the‑loop framework integrating diverse sensors to collect rich behavioral data; (c) a human‑AI collaboration mechanism supporting continuous and adaptive knowledge exchange; and (d) a digital twin framework for constructing high‑fidelity virtual replicas of real‑world transportation environments. Sky‑Drive supports diverse applications such as autonomous vehicle‑human road users interaction modeling, human‑in‑the‑loop training, socially‑aware reinforcement learning, personalized driving development, and customized scenario generation. Future extensions will incorporate foundation models for context‑aware decision support and hardware‑in‑the‑loop testing for real‑world validation. By bridging scenario generation, data collection, algorithm training, and hardware integration, Sky‑Drive has the potential to become a foundational platform for the next generation of human‑centered and socially‑aware autonomous transportation systems research. The demo video and code are available at:https://sky‑lab‑uw.github.io/Sky‑Drive‑website/

PaperID: 1368, https://arxiv.org/pdf/2504.17192.pdf GitHub

Authors: Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang

Title: Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

Abstract:
Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor‑intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high‑quality code. Inspired by this, we introduce PaperCoder, a multi‑agent LLM framework that transforms machine learning papers into operational code repositories. PaperCoder operates in three stages: planning, where it constructs a high‑level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation‑specific details; and generation, where modular, dependency‑aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model‑based and human evaluations, particularly from the authors of those papers, with author‑released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high‑quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins. Code is available at: https://github.com/going‑doer/Paper2Code.

PaperID: 1369, https://arxiv.org/pdf/2504.16939.pdf GitHub

Authors: Emre Can Acikgoz, Cheng Qian, Hongru Wang, Vardhan Dongre, Xiusi Chen, Heng Ji, Dilek Hakkani-Tür, Gokhan Tur

Title: A Desideratum for Conversational Agents: Capabilities, Challenges, and Future Directions

Abstract:
Recent advances in Large Language Models (LLMs) have propelled conversational AI from traditional dialogue systems into sophisticated agents capable of autonomous actions, contextual awareness, and multi‑turn interactions with users. Yet, fundamental questions about their capabilities, limitations, and paths forward remain open. This survey paper presents a desideratum for next‑generation Conversational Agents ‑ what has been achieved, what challenges persist, and what must be done for more scalable systems that approach human‑level intelligence. To that end, we systematically analyze LLM‑driven Conversational Agents by organizing their capabilities into three primary dimensions: (i) Reasoning ‑ logical, systematic thinking inspired by human intelligence for decision making, (ii) Monitor ‑ encompassing self‑awareness and user interaction monitoring, and (iii) Control ‑ focusing on tool utilization and policy following. Building upon this, we introduce a novel taxonomy by classifying recent work on Conversational Agents around our proposed desideratum. We identify critical research gaps and outline key directions, including realistic evaluations, long‑term multi‑turn reasoning skills, self‑evolution capabilities, collaborative and multi‑agent task completion, personalization, and proactivity. This work aims to provide a structured foundation, highlight existing limitations, and offer insights into potential future research directions for Conversational Agents, ultimately advancing progress toward Artificial General Intelligence (AGI). We maintain a curated repository of papers at: https://github.com/emrecanacikgoz/awesome‑conversational‑agents.

PaperID: 1370, https://arxiv.org/pdf/2504.16728.pdf GitHub

Authors: Aniketh Garikaparthi, Manasi Patwardhan, Lovekesh Vig, Arman Cohan

Title: IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery

Abstract:
The rapid advancement in capabilities of large language models (LLMs) raises a pivotal question: How can LLMs accelerate scientific discovery? This work tackles the crucial first stage of research, generating novel hypotheses. While recent work on automated hypothesis generation focuses on multi‑agent frameworks and extending test‑time compute, none of the approaches effectively incorporate transparency and steerability through a synergistic Human‑in‑the‑loop (HITL) approach. To address this gap, we introduce IRIS: Interactive Research Ideation System, an open‑source platform designed for researchers to leverage LLM‑assisted scientific ideation. IRIS incorporates innovative features to enhance ideation, including adaptive test‑time compute expansion via Monte Carlo Tree Search (MCTS), fine‑grained feedback mechanism, and query‑based literature synthesis. Designed to empower researchers with greater control and insight throughout the ideation process. We additionally conduct a user study with researchers across diverse disciplines, validating the effectiveness of our system in enhancing ideation. We open‑source our code at https://github.com/Anikethh/IRIS‑Interactive‑Research‑Ideation‑System

PaperID: 1371, https://arxiv.org/pdf/2504.16563.pdf GitHub

Authors: Junjie Chen, Haitao Li, Jingli Yang, Yiqun Liu, Qingyao Ai

Title: Enhancing LLM-Based Agents via Global Planning and Hierarchical Execution

Abstract:
Intelligent agent systems based on Large Language Models (LLMs) have shown great potential in real‑world applications. However, existing agent frameworks still face critical limitations in task planning and execution, restricting their effectiveness and generalizability. Specifically, current planning methods often lack clear global goals, leading agents to get stuck in local branches, or produce non‑executable plans. Meanwhile, existing execution mechanisms struggle to balance complexity and stability, and their limited action space restricts their ability to handle diverse real‑world tasks. To address these limitations, we propose GoalAct, a novel agent framework that introduces a continuously updated global planning mechanism and integrates a hierarchical execution strategy. GoalAct decomposes task execution into high‑level skills, including searching, coding, writing and more, thereby reducing planning complexity while enhancing the agents' adaptability across diverse task scenarios. We evaluate GoalAct on LegalAgentBench, a benchmark with multiple types of legal tasks that require the use of multiple types of tools. Experimental results demonstrate that GoalAct achieves state‑of‑the‑art (SOTA) performance, with an average improvement of 12.22% in success rate. These findings highlight GoalAct's potential to drive the development of more advanced intelligent agent systems, making them more effective across complex real‑world applications. Our code can be found at https://github.com/cjj826/GoalAct.

PaperID: 1372, https://arxiv.org/pdf/2504.16408.pdf GitHub

Authors: Jiahao Yuan, Xingzhe Sun, Xing Yu, Jingwen Wang, Dehui Du, Zhiqing Cui, Zixiang Di

Title: LLMSR@XLLM25: Less is More: Enhancing Structured Multi-Agent Reasoning via Quality-Guided Distillation

Abstract:
The LLMSR@XLLM25 formulates a low‑resource structural reasoning task that challenges LLMs to generate interpretable, step‑by‑step rationales with minimal labeled data. We present Less is More, the third‑place winning approach in the LLMSR@XLLM25, which focuses on structured reasoning from only 24 labeled examples. Our approach leverages a multi‑agent framework with reverse‑prompt induction, retrieval‑augmented reasoning synthesis via GPT‑4o, and dual‑stage reward‑guided filtering to distill high‑quality supervision across three subtasks: question parsing, CoT parsing, and step‑level verification. All modules are fine‑tuned from Meta‑Llama‑3‑8B‑Instruct under a unified LoRA+ setup. By combining structure validation with reward filtering across few‑shot and zero‑shot prompts, our pipeline consistently improves structure reasoning quality. These results underscore the value of controllable data distillation in enhancing structured inference under low‑resource constraints. Our code is available at https://github.com/JhCircle/Less‑is‑More.

PaperID: 1373, https://arxiv.org/pdf/2504.16082.pdf GitHub

Authors: Ziqi Pang, Yu-Xiong Wang

Title: MR. Video: "MapReduce" is the Principle for Long Video Understanding

Abstract:
We propose MR. Video, an agentic long video understanding framework that demonstrates the simple yet effective MapReduce principle for processing long videos: (1) Map: independently and densely perceiving short video clips, and (2) Reduce: jointly aggregating information from all clips. Compared with sequence‑to‑sequence vision‑language models (VLMs), MR. Video performs detailed short video perception without being limited by context length. Compared with existing video agents that typically rely on sequential key segment selection, the Map operation enables simpler and more scalable sequence parallel perception of short video segments. Its Reduce step allows for more comprehensive context aggregation and reasoning, surpassing explicit key segment retrieval. This MapReduce principle is applicable to both VLMs and video agents, and we use LLM agents to validate its effectiveness. In practice, MR. Video employs two MapReduce stages: (A) Captioning: generating captions for short video clips (map), then standardizing repeated characters and objects into shared names (reduce); (B) Analysis: for each user question, analyzing relevant information from individual short videos (map), and integrating them into a final answer (reduce). MR. Video achieves over 10% accuracy improvement on the challenging LVBench compared to state‑of‑the‑art VLMs and video agents. Code is available at: https://github.com/ziqipang/MR‑Video

PaperID: 1374, https://arxiv.org/pdf/2504.15785.pdf GitHub

Authors: Siyu Zhou, Tianyi Zhou, Yijun Yang, Guodong Long, Deheng Ye, Jing Jiang, Chengqi Zhang

Title: WALL-E 2.0: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents

Abstract:
Can we build accurate world models out of large language models (LLMs)? How can world models benefit LLM agents? The gap between the prior knowledge of LLMs and the specified environment's dynamics usually bottlenecks LLMs' performance as world models. To bridge the gap, we propose a training‑free "world alignment" that learns an environment's symbolic knowledge complementary to LLMs. The symbolic knowledge covers action rules, knowledge graphs, and scene graphs, which are extracted by LLMs from exploration trajectories and encoded into executable codes to regulate LLM agents' policies. We further propose an RL‑free, model‑based agent "WALL‑E 2.0" through the model‑predictive control (MPC) framework. Unlike classical MPC requiring costly optimization on the fly, we adopt an LLM agent as an efficient look‑ahead optimizer of future steps' actions by interacting with the neurosymbolic world model. While the LLM agent's strong heuristics make it an efficient planner in MPC, the quality of its planned actions is also secured by the accurate predictions of the aligned world model. They together considerably improve learning efficiency in a new environment. On open‑world challenges in Mars (Minecraft like) and ALFWorld (embodied indoor environments), WALL‑E 2.0 significantly outperforms existing methods, e.g., surpassing baselines in Mars by 16.1%‑51.6% of success rate and by at least 61.7% in score. In ALFWorld, it achieves a new record 98% success rate after only 4 iterations.

PaperID: 1375, https://arxiv.org/pdf/2504.15257.pdf GitHub

Authors: Hongcheng Gao, Yue Liu, Yufei He, Longxu Dou, Chao Du, Zhijie Deng, Bryan Hooi, Min Lin, Tianyu Pang

Title: FlowReasoner: Reinforcing Query-Level Meta-Agents

Abstract:
This paper proposes a query‑level meta‑agent named FlowReasoner to automate the design of query‑level multi‑agent systems, i.e., one system per user query. Our core idea is to incentivize a reasoning‑based meta‑agent via external execution feedback. Concretely, by distilling DeepSeek R1, we first endow the basic reasoning ability regarding the generation of multi‑agent systems to FlowReasoner. Then, we further enhance it via reinforcement learning (RL) with external execution feedback. A multi‑purpose reward is designed to guide the RL training from aspects of performance, complexity, and efficiency. In this manner, FlowReasoner is enabled to generate a personalized multi‑agent system for each user query via deliberative reasoning. Experiments on both engineering and competition code benchmarks demonstrate the superiority of FlowReasoner. Remarkably, it surpasses o1‑mini by 10.52% accuracy across three benchmarks. The code is available at https://github.com/sail‑sg/FlowReasoner.

PaperID: 1376, https://arxiv.org/pdf/2504.14894.pdf GitHub

Authors: Jingzehua Xu, Guanwen Xie, Jiwei Tang, Yimian Ding, Weiyi Liu, Junhao Huang, Shuai Zhang, Yi Li

Title: Never too Cocky to Cooperate: An FIM and RL-based USV-AUV Collaborative System for Underwater Tasks in Extreme Sea Conditions

Abstract:
This paper develops a novel unmanned surface vehicle (USV)‑autonomous underwater vehicle (AUV) collaborative system designed to enhance underwater task performance in extreme sea conditions. The system integrates a dual strategy: (1) high‑precision multi‑AUV localization enabled by Fisher information matrix‑optimized USV path planning, and (2) reinforcement learning‑based cooperative planning and control method for multi‑AUV task execution. Extensive experimental evaluations in the underwater data collection task demonstrate the system's operational feasibility, with quantitative results showing significant performance improvements over baseline methods. The proposed system exhibits robust coordination capabilities between USV and AUVs while maintaining stability in extreme sea conditions. To facilitate reproducibility and community advancement, we provide an open‑source simulation toolkit available at: https://github.com/360ZMEM/USV‑AUV‑colab .

PaperID: 1377, https://arxiv.org/pdf/2504.14787.pdf GitHub GitHub

Authors: Sirui Zeng, Xifeng Yan

Title: ADL: A Declarative Language for Agent-Based Chatbots

Abstract:
There are numerous frameworks capable of creating and orchestrating agents to address complex tasks. However, most of them highly coupled Python programming with agent declaration, making it hard for maintenance and runtime optimization. In this work, we introduce ADL, an agent declarative language for customer service chatbots. ADL abstracts away implementation details, offering a declarative way to define agents and their interactions, which could ease maintenance and debugging. It also incorporates natural language programming at its core to simplify the specification and communication of chatbot designs. ADL includes four basic types of agents and supports integration with custom functions, tool use, and third‑party agents. MICA, a multi‑agent system designed to interpret and execute ADL programs, has been developed and is now available as an open‑source project at https://github.com/Mica‑labs/MICA. Its documentation can be found at https://mica‑labs.github.io/.

PaperID: 1378, https://arxiv.org/pdf/2504.14625.pdf GitHub

Authors: Haiyan Qin, Jiahao Feng, Xiaotong Feng, Wei W. Xing, Wang Kang

Title: Towards Optimal Circuit Generation: Multi-Agent Collaboration Meets Collective Intelligence

Abstract:
Large language models (LLMs) have transformed code generation, yet their application in hardware design produces gate counts 38%‑‑1075% higher than human designs. We present CircuitMind, a multi‑agent framework that achieves human‑competitive efficiency through three key innovations: syntax locking (constraining generation to basic logic gates), retrieval‑augmented generation (enabling knowledge‑driven design), and dual‑reward optimization (balancing correctness with efficiency). To evaluate our approach, we introduce TC‑Bench, the first gate‑level benchmark harnessing collective intelligence from the TuringComplete ecosystem ‑‑ a competitive circuit design platform with hundreds of thousands of players. Experiments show CircuitMind enables 55.6% of model implementations to match or exceed top‑tier human experts in composite efficiency metrics. Most remarkably, our framework elevates the 14B Phi‑4 model to outperform both GPT‑4o mini and Gemini 2.0 Flash, achieving efficiency comparable to the top 25% of human experts without requiring specialized training. These innovations establish a new paradigm for hardware optimization where collaborative AI systems leverage collective human expertise to achieve optimal circuit designs. Our model, data, and code are open‑source at https://github.com/BUAA‑CLab/CircuitMind.

PaperID: 1379, https://arxiv.org/pdf/2504.14603.pdf GitHub GitHub

Authors: Chaoyun Zhang, He Huang, Chiming Ni, Jian Mu, Si Qin, Shilin He, Lu Wang, Fangkai Yang, Pu Zhao, Chao Du, Liqun Li, Yu Kang, Zhao Jiang, Suzhen Zheng, Rujia Wang, Jiaxu Qian, Minghua Ma, Jian-Guang Lou, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

Title: UFO2: The Desktop AgentOS

Abstract:
Recent Computer‑Using Agents (CUAs), powered by multimodal large language models (LLMs), offer a promising direction for automating complex desktop workflows through natural language. However, most existing CUAs remain conceptual prototypes, hindered by shallow OS integration, fragile screenshot‑based interaction, and disruptive execution. We present UFO2, a multiagent AgentOS for Windows desktops that elevates CUAs into practical, system‑level automation. UFO2 features a centralized HostAgent for task decomposition and coordination, alongside a collection of application‑specialized AppAgent equipped with native APIs, domain‑specific knowledge, and a unified GUI‑‑API action layer. This architecture enables robust task execution while preserving modularity and extensibility. A hybrid control detection pipeline fuses Windows UI Automation (UIA) with vision‑based parsing to support diverse interface styles. Runtime efficiency is further enhanced through speculative multi‑action planning, reducing per‑step LLM overhead. Finally, a Picture‑in‑Picture (PiP) interface enables automation within an isolated virtual desktop, allowing agents and users to operate concurrently without interference. We evaluate UFO2 across over 20 real‑world Windows applications, demonstrating substantial improvements in robustness and execution accuracy over prior CUAs. Our results show that deep OS integration unlocks a scalable path toward reliable, user‑aligned desktop automation.

PaperID: 1380, https://arxiv.org/pdf/2504.14538.pdf GitHub

Authors: Yiting Ran, Xintao Wang, Tian Qiu, Jiaqing Liang, Yanghua Xiao, Deqing Yang

Title: BookWorld: From Novels to Interactive Agent Societies for Creative Story Generation

Abstract:
Recent advances in large language models (LLMs) have enabled social simulation through multi‑agent systems. Prior efforts focus on agent societies created from scratch, assigning agents with newly defined personas. However, simulating established fictional worlds and characters remain largely underexplored, despite its significant practical value. In this paper, we introduce BookWorld, a comprehensive system for constructing and simulating book‑based multi‑agent societies. BookWorld's design covers comprehensive real‑world intricacies, including diverse and dynamic characters, fictional worldviews, geographical constraints and changes, e.t.c. BookWorld enables diverse applications including story generation, interactive games and social simulation, offering novel ways to extend and explore beloved fictional works. Through extensive experiments, we demonstrate that BookWorld generates creative, high‑quality stories while maintaining fidelity to the source books, surpassing previous methods with a win rate of 75.36%. The code of this paper can be found at the project page: https://bookworld2025.github.io/.

PaperID: 1381, https://arxiv.org/pdf/2504.14440.pdf GitHub GitHub

Authors: Chuhao Liu, Zhijian Qiao, Jieqi Shi, Ke Wang, Peize Liu, Shaojie Shen

Title: SG-Reg: Generalizable and Efficient Scene Graph Registration

Abstract:
This paper addresses the challenges of registering two rigid semantic scene graphs, an essential capability when an autonomous agent needs to register its map against a remote agent, or against a prior map. The hand‑crafted descriptors in classical semantic‑aided registration, or the ground‑truth annotation reliance in learning‑based scene graph registration, impede their application in practical real‑world environments. To address the challenges, we design a scene graph network to encode multiple modalities of semantic nodes: open‑set semantic feature, local topology with spatial awareness, and shape feature. These modalities are fused to create compact semantic node features. The matching layers then search for correspondences in a coarse‑to‑fine manner. In the back‑end, we employ a robust pose estimator to decide transformation according to the correspondences. We manage to maintain a sparse and hierarchical scene representation. Our approach demands fewer GPU resources and fewer communication bandwidth in multi‑agent tasks. Moreover, we design a new data generation approach using vision foundation models and a semantic mapping module to reconstruct semantic scene graphs. It differs significantly from previous works, which rely on ground‑truth semantic annotations to generate data. We validate our method in a two‑agent SLAM benchmark. It significantly outperforms the hand‑crafted baseline in terms of registration success rate. Compared to visual loop closure networks, our method achieves a slightly higher registration recall while requiring only 52 KB of communication bandwidth for each query frame. Code available at: \hrefhttp://github.com/HKUST‑Aerial‑Robotics/SG‑Reghttp://github.com/HKUST‑Aerial‑Robotics/SG‑Reg.

PaperID: 1382, https://arxiv.org/pdf/2504.14411.pdf GitHub GitHub

Authors: Xiang Zhang, Yongfeng Zhang

Title: Planet as a Brain: Towards Internet of AgentSites based on AIOS Server

Abstract:
The internet is undergoing a historical transformation from the "Internet of Websites" to the "Internet of AgentSites." While traditional Websites served as the foundation for information hosting and dissemination, a new frontier is emerging where AgentSites serve as the hubs of the internet, where each AgentSite hosts one or more AI agents that receive tasks, address them, and deliver actionable solutions, marking a significant shift in the digital landscape and representing the next generation of online ecosystems. Under this vision, AIOS, the AI Agent Operating System, serves as the server for the development, deployment and execution of AI agents, which is a fundamental infrastructure for the Internet of Agentsites. In this paper, we introduce AIOS Server, a runtime framework to host agents and enable global‑scale collaboration among decentralized agents. AIOS Server provides a communication protocol leveraging the Model Context Protocol (MCP) and JSON‑RPC to enable agent‑agent or human‑agent interactions. Each AIOS node operates as a server to host and execute agents, while supporting peer‑to‑peer coordination without reliance on centralized orchestration. Based on AIOS Server, we further present the world's first practically deployed Internet of Agentsites (AIOS‑IoA), including AgentHub for agent registration and discovery and AgentChat for interactive communication, at https://planet.aios.foundation. The agent discovery mechanism based on Distributed Hash Tables (DHT) and a Gossip protocol serves as the search engine for the internet of agentsites. This work provides a practical foundation for building the Internet of Agentsites‑a new paradigm where autonomous agents become first‑class citizens of the web. The implementation is available at https://github.com/agiresearch/AIOS.Server and is integrated into the AIOS main branch at https://github.com/agiresearch/AIOS.

PaperID: 1383, https://arxiv.org/pdf/2504.13861.pdf GitHub

Authors: Ivan Sviridov, Amina Miftakhova, Artemiy Tereshchenko, Galina Zubkova, Pavel Blinov, Andrey Savchenko

Title: 3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark

Abstract:
Though Large Vision‑Language Models (LVLMs) are being actively explored in medicine, their ability to conduct complex real‑world telemedicine consultations combining accurate diagnosis with professional dialogue remains underexplored. This paper presents 3MDBench (Medical Multimodal Multi‑agent Dialogue Benchmark), an open‑source framework for simulating and evaluating LVLM‑driven telemedical consultations. 3MDBench simulates patient variability through temperament‑based Patient Agent and evaluates diagnostic accuracy and dialogue quality via Assessor Agent. It includes 2996 cases across 34 diagnoses from real‑world telemedicine interactions, combining textual and image‑based data. The experimental study compares diagnostic strategies for widely used open and closed‑source LVLMs. We demonstrate that multimodal dialogue with internal reasoning improves F1 score by 6.5% over non‑dialogue settings, highlighting the importance of context‑aware, information‑seeking questioning. Moreover, injecting predictions from a diagnostic convolutional neural network into the LVLM's context boosts F1 by up to 20%. Source code is available at https://github.com/univanxx/3mdbench.

PaperID: 1384, https://arxiv.org/pdf/2504.13805.pdf GitHub

Authors: Guangyi Liu, Pengxiang Zhao, Liang Liu, Zhiming Chen, Yuxiang Chai, Shuai Ren, Hao Wang, Shibo He, Wenchao Meng

Title: LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark

Abstract:
Mobile GUI agents show promise in automating tasks but face generalization challenges in diverse real‑world scenarios. Traditional approaches using pre‑training or fine‑tuning with massive datasets struggle with the diversity of mobile applications and user‑specific tasks. We propose enhancing mobile GUI agent capabilities through human demonstrations, focusing on improving performance in unseen scenarios rather than pursuing universal generalization through larger datasets. To realize this paradigm, we introduce LearnGUI, the first comprehensive dataset specifically designed for studying demonstration‑based learning in mobile GUI agents, comprising 2,252 offline tasks and 101 online tasks with high‑quality human demonstrations. We further develop LearnAct, a sophisticated multi‑agent framework that automatically extracts knowledge from demonstrations to enhance task completion. This framework integrates three specialized agents: DemoParser for knowledge extraction, KnowSeeker for relevant knowledge retrieval, and ActExecutor for demonstration‑enhanced task execution. Our experimental results show significant performance gains in both offline and online evaluations. In offline assessments, a single demonstration improves model performance, increasing Gemini‑1.5‑Pro's accuracy from 19.3% to 51.7%. In online evaluations, our framework enhances UI‑TARS‑7B‑SFT's task success rate from 18.1% to 32.8%. LearnAct framework and LearnGUI benchmark establish demonstration‑based learning as a promising direction for more adaptable, personalized, and deployable mobile GUI agents.

PaperID: 1385, https://arxiv.org/pdf/2504.13650.pdf GitHub

Authors: Sijing Li, Tianwei Lin, Lingshuai Lin, Wenqiao Zhang, Jiang Liu, Xiaoda Yang, Juncheng Li, Yucheng He, Xiaohui Song, Jun Xiao, Yueting Zhuang, Beng Chin Ooi

Title: EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model

Abstract:
Medical Large Vision‑Language Models (Med‑LVLMs) demonstrate significant potential in healthcare, but their reliance on general medical data and coarse‑grained global visual understanding limits them in intelligent ophthalmic diagnosis. Currently, intelligent ophthalmic diagnosis faces three major challenges: (i) Data. The lack of deeply annotated, high‑quality, multi‑modal ophthalmic visual instruction data; (ii) Benchmark. The absence of a comprehensive and systematic benchmark for evaluating diagnostic performance; (iii) Model. The difficulty of adapting holistic visual architectures to fine‑grained, region‑specific ophthalmic lesion identification. In this paper, we propose the Eyecare Kit, which systematically tackles the aforementioned three key challenges with the tailored dataset, benchmark and model: First, we construct a multi‑agent data engine with real‑life ophthalmology data to produce Eyecare‑100K, a high‑quality ophthalmic visual instruction dataset. Subsequently, we design Eyecare‑Bench, a benchmark that comprehensively evaluates the overall performance of LVLMs on intelligent ophthalmic diagnosis tasks across multiple dimensions. Finally, we develop the EyecareGPT, optimized for fine‑grained ophthalmic visual understanding thoroughly, which incorporates an adaptive resolution mechanism and a layer‑wise dense connector. Extensive experimental results indicate that the EyecareGPT achieves state‑of‑the‑art performance in a range of ophthalmic tasks, underscoring its significant potential for the advancement of open research in intelligent ophthalmic diagnosis. Our project is available at https://github.com/DCDmllm/EyecareGPT.

PaperID: 1386, https://arxiv.org/pdf/2504.13406.pdf GitHub

Authors: Xiangbo Gao, Yuheng Wu, Rujia Wang, Chenxi Liu, Yang Zhou, Zhengzhong Tu

Title: LangCoop: Collaborative Driving with Language

Abstract:
Multi‑agent collaboration holds great promise for enhancing the safety, reliability, and mobility of autonomous driving systems by enabling information sharing among multiple connected agents. However, existing multi‑agent communication approaches are hindered by limitations of existing communication media, including high bandwidth demands, agent heterogeneity, and information loss. To address these challenges, we introduce LangCoop, a new paradigm for collaborative autonomous driving that leverages natural language as a compact yet expressive medium for inter‑agent communication. LangCoop features two key innovations: Mixture Model Modular Chain‑of‑thought (M^3CoT) for structured zero‑shot vision‑language reasoning and Natural Language Information Packaging (LangPack) for efficiently packaging information into concise, language‑based messages. Through extensive experiments conducted in the CARLA simulations, we demonstrate that LangCoop achieves a remarkable 96% reduction in communication bandwidth (< 2KB per message) compared to image‑based communication, while maintaining competitive driving performance in the closed‑loop evaluation. Our project page and code are at https://xiangbogaobarry.github.io/LangCoop/.

PaperID: 1387, https://arxiv.org/pdf/2504.13399.pdf GitHub

Authors: Shashank Shriram, Srinivasa Perisetla, Aryan Keskar, Harsha Krishnaswamy, Tonko Emil Westerhof Bossen, Andreas Møgelmose, Ross Greer

Title: Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety

Abstract:
Detecting anomalous hazards in visual data, particularly in video streams, is a critical challenge in autonomous driving. Existing models often struggle with unpredictable, out‑of‑label hazards due to their reliance on predefined object categories. In this paper, we propose a multimodal approach that integrates vision‑language reasoning with zero‑shot object detection to improve hazard identification and explanation. Our pipeline consists of a Vision‑Language Model (VLM), a Large Language Model (LLM), in order to detect hazardous objects within a traffic scene. We refine object detection by incorporating OpenAI's CLIP model to match predicted hazards with bounding box annotations, improving localization accuracy. To assess model performance, we create a ground truth dataset by denoising and extending the foundational COOOL (Challenge‑of‑Out‑of‑Label) anomaly detection benchmark dataset with complete natural language descriptions for hazard annotations. We define a means of hazard detection and labeling evaluation on the extended dataset using cosine similarity. This evaluation considers the semantic similarity between the predicted hazard description and the annotated ground truth for each video. Additionally, we release a set of tools for structuring and managing large‑scale hazard detection datasets. Our findings highlight the strengths and limitations of current vision‑language‑based approaches, offering insights into future improvements in autonomous hazard detection systems. Our models, scripts, and data can be found at https://github.com/mi3labucm/COOOLER.git

PaperID: 1388, https://arxiv.org/pdf/2504.13079.pdf GitHub

Authors: Han Wang, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

Title: Retrieval-Augmented Generation with Conflicting Evidence

Abstract:
Large language model (LLM) agents are increasingly employing retrieval‑augmented generation (RAG) to improve the factuality of their responses. However, in practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources while also suppressing inaccurate information from noisy or irrelevant documents. Prior work has generally studied and addressed these challenges in isolation, considering only one aspect at a time, such as handling ambiguity or robustness to noise and misinformation. We instead consider multiple factors simultaneously, proposing (i) RAMDocs (Retrieval with Ambiguity and Misinformation in Documents), a new dataset that simulates complex and realistic scenarios for conflicting evidence for a user query, including ambiguity, misinformation, and noise; and (ii) MADAM‑RAG, a multi‑agent approach in which LLM agents debate over the merits of an answer over multiple rounds, allowing an aggregator to collate responses corresponding to disambiguated entities while discarding misinformation and noise, thereby handling diverse sources of conflict jointly. We demonstrate the effectiveness of MADAM‑RAG using both closed and open‑source models on AmbigDocs ‑‑ which requires presenting all valid answers for ambiguous queries ‑‑ improving over strong RAG baselines by up to 11.40% and on FaithEval ‑‑ which requires suppressing misinformation ‑‑ where we improve by up to 15.80% (absolute) with Llama3.3‑70B‑Instruct. Furthermore, we find that RAMDocs poses a challenge for existing RAG baselines (Llama3.3‑70B‑Instruct only obtains 32.60 exact match score). While MADAM‑RAG begins to address these conflicting factors, our analysis indicates that a substantial gap remains especially when increasing the level of imbalance in supporting evidence and misinformation.

PaperID: 1389, https://arxiv.org/pdf/2504.12961.pdf GitHub

Authors: Yuanjun Li, Zhouyang Jiang, Bin Zhang, Mingchao Zhang, Junhao Zhao, Zhiwei Xu

Title: QLLM: Do We Really Need a Mixing Network for Credit Assignment in Multi-Agent Reinforcement Learning?

Abstract:
Credit assignment remains a fundamental challenge in multi agent reinforcement learning (MARL) and is commonly addressed through value decomposition under the centralized training with decentralized ex ecution (CTDE) paradigm. However, existing value decomposition meth ods typically rely on predefined mixing networks that require additional training, often leading to imprecise credit attribution and limited in terpretability. We propose QLLM, a novel framework that leverages large language models (LLMs) to construct training‑free credit assign ment functions (TFCAFs), where the TFCAFs are nonlinear with re spect to the global state and offer enhanced interpretability while intro ducing no extra learnable parameters. A coder‑evaluator framework is employed to ensure the correctness and executability of the generated code. Extensive experiments on standard MARL benchmarks demon strate that QLLM consistently outperforms baselines while requiring fewer learnable parameters. Furthermore, it demonstrates generalization across a broad set of value decomposition algorithms. Code is available at https://github.com/MaoMaoLYJ/pymarl‑qllm.

PaperID: 1390, https://arxiv.org/pdf/2504.12696.pdf GitHub

Authors: Naibang Wang, Deyong Shang, Yan Gong, Xiaoxi Hu, Ziying Song, Lei Yang, Yuhan Huang, Xiaoyu Wang, Jianli Lu

Title: Collaborative Perception Datasets for Autonomous Driving: A Review

Abstract:
Collaborative perception has attracted growing interest from academia and industry due to its potential to enhance perception accuracy, safety, and robustness in autonomous driving through multi‑agent information fusion. With the advancement of Vehicle‑to‑Everything (V2X) communication, numerous collaborative perception datasets have emerged, varying in cooperation paradigms, sensor configurations, data sources, and application scenarios. However, the absence of systematic summarization and comparative analysis hinders effective resource utilization and standardization of model evaluation. As the first comprehensive review focused on collaborative perception datasets, this work reviews and compares existing resources from a multi‑dimensional perspective. We categorize datasets based on cooperation paradigms, examine their data sources and scenarios, and analyze sensor modalities and supported tasks. A detailed comparative analysis is conducted across multiple dimensions. We also outline key challenges and future directions, including dataset scalability, diversity, domain adaptation, standardization, privacy, and the integration of large language models. To support ongoing research, we provide a continuously updated online repository of collaborative perception datasets and related literature: https://github.com/frankwnb/Collaborative‑Perception‑Datasets‑for‑Autonomous‑Driving.

PaperID: 1391, https://arxiv.org/pdf/2504.12330.pdf GitHub

Authors: Pei Liu, Xin Liu, Ruoyu Yao, Junming Liu, Siyuan Meng, Ding Wang, Jun Ma

Title: HM-RAG: Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation

Abstract:
While Retrieval‑Augmented Generation (RAG) augments Large Language Models (LLMs) with external knowledge, conventional single‑agent RAG remains fundamentally limited in resolving complex queries demanding coordinated reasoning across heterogeneous data ecosystems. We present HM‑RAG, a novel Hierarchical Multi‑agent Multimodal RAG framework that pioneers collaborative intelligence for dynamic knowledge synthesis across structured, unstructured, and graph‑based data. The framework is composed of three‑tiered architecture with specialized agents: a Decomposition Agent that dissects complex queries into contextually coherent sub‑tasks via semantic‑aware query rewriting and schema‑guided context augmentation; Multi‑source Retrieval Agents that carry out parallel, modality‑specific retrieval using plug‑and‑play modules designed for vector, graph, and web‑based databases; and a Decision Agent that uses consistency voting to integrate multi‑source answers and resolve discrepancies in retrieval results through Expert Model Refinement. This architecture attains comprehensive query understanding by combining textual, graph‑relational, and web‑derived evidence, resulting in a remarkable 12.95% improvement in answer accuracy and a 3.56% boost in question classification accuracy over baseline RAG systems on the ScienceQA and CrisisMMD benchmarks. Notably, HM‑RAG establishes state‑of‑the‑art results in zero‑shot settings on both datasets. Its modular architecture ensures seamless integration of new data modalities while maintaining strict data governance, marking a significant advancement in addressing the critical challenges of multimodal reasoning and knowledge synthesis in RAG systems. Code is available at https://github.com/ocean‑luna/HMRAG.

PaperID: 1392, https://arxiv.org/pdf/2504.12110.pdf GitHub

Authors: Chia Hsiang Kao, Wenting Zhao, Shreelekha Revankar, Samuel Speas, Snehal Bhagat, Rajeev Datta, Cheng Perng Phoo, Utkarsh Mall, Carl Vondrick, Kavita Bala, Bharath Hariharan

Title: Towards LLM Agents for Earth Observation

Abstract:
Earth Observation (EO) provides critical planetary data for environmental monitoring, disaster management, climate science, and other scientific domains. Here we ask: Are AI systems ready for reliable Earth Observation? We introduce \datasetnamenospace, a benchmark of 140 yes/no questions from NASA Earth Observatory articles across 13 topics and 17 satellite sensors. Using Google Earth Engine API as a tool, LLM agents can only achieve an accuracy of 33% because the code fails to run over 58% of the time. We improve the failure rate for open models by fine‑tuning synthetic data, allowing much smaller models (Llama‑3.1‑8B) to achieve comparable accuracy to much larger ones (e.g., DeepSeek‑R1). Taken together, our findings identify significant challenges to be solved before AI agents can automate earth observation, and suggest paths forward. The project page is available at https://iandrover.github.io/UnivEarth.

PaperID: 1393, https://arxiv.org/pdf/2504.11754.pdf GitHub

Authors: Zihui Zhang, Yafei Yang, Hongtao Wen, Bo Yang

Title: GrabS: Generative Embodied Agent for 3D Object Segmentation without Scene Supervision

Abstract:
We study the hard problem of 3D object segmentation in complex point clouds without requiring human labels of 3D scenes for supervision. By relying on the similarity of pretrained 2D features or external signals such as motion to group 3D points as objects, existing unsupervised methods are usually limited to identifying simple objects like cars or their segmented objects are often inferior due to the lack of objectness in pretrained features. In this paper, we propose a new two‑stage pipeline called GrabS. The core concept of our method is to learn generative and discriminative object‑centric priors as a foundation from object datasets in the first stage, and then design an embodied agent to learn to discover multiple objects by querying against the pretrained generative priors in the second stage. We extensively evaluate our method on two real‑world datasets and a newly created synthetic dataset, demonstrating remarkable segmentation performance, clearly surpassing all existing unsupervised methods.

PaperID: 1394, https://arxiv.org/pdf/2504.09772.pdf GitHub

Authors: Can Jin, Hongwu Peng, Qixin Zhang, Yujin Tang, Dimitris N. Metaxas, Tong Che

Title: Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning

Abstract:
Multi‑agent systems (MAS) built on large language models (LLMs) offer a promising path toward solving complex, real‑world tasks that single‑agent systems often struggle to manage. While recent advancements in test‑time scaling (TTS) have significantly improved single‑agent performance on challenging reasoning tasks, how to effectively scale collaboration and reasoning in MAS remains an open question. In this work, we introduce an adaptive multi‑agent framework designed to enhance collaborative reasoning through both model‑level training and system‑level coordination. We construct M500, a high‑quality dataset containing 500 multi‑agent collaborative reasoning traces, and fine‑tune Qwen2.5‑32B‑Instruct on this dataset to produce M1‑32B, a model optimized for multi‑agent collaboration. To further enable adaptive reasoning, we propose a novel CEO agent that dynamically manages the discussion process, guiding agent collaboration and adjusting reasoning depth for more effective problem‑solving. Evaluated in an open‑source MAS across a range of tasks‑including general understanding, mathematical reasoning, and coding‑our system significantly outperforms strong baselines. For instance, M1‑32B achieves 12% improvement on GPQA‑Diamond, 41% on AIME2024, and 10% on MBPP‑Sanitized, matching the performance of state‑of‑the‑art models like DeepSeek‑R1 on some tasks. These results highlight the importance of both learned collaboration and adaptive coordination in scaling multi‑agent reasoning. Code is available at https://github.com/jincan333/MAS‑TTS

PaperID: 1395, https://arxiv.org/pdf/2504.09689.pdf GitHub

Authors: Jiahao Qiu, Yinghui He, Xinzhe Juan, Yimin Wang, Yuhan Liu, Zixin Yao, Yue Wu, Xun Jiang, Ling Yang, Mengdi Wang

Title: EmoAgent: Assessing and Safeguarding Human-AI Interaction for Mental Health Safety

Abstract:
The rise of LLM‑driven AI characters raises safety concerns, particularly for vulnerable human users with psychological disorders. To address these risks, we propose EmoAgent, a multi‑agent AI framework designed to evaluate and mitigate mental health hazards in human‑AI interactions. EmoAgent comprises two components: EmoEval simulates virtual users, including those portraying mentally vulnerable individuals, to assess mental health changes before and after interactions with AI characters. It uses clinically proven psychological and psychiatric assessment tools (PHQ‑9, PDI, PANSS) to evaluate mental risks induced by LLM. EmoGuard serves as an intermediary, monitoring users' mental status, predicting potential harm, and providing corrective feedback to mitigate risks. Experiments conducted in popular character‑based chatbots show that emotionally engaging dialogues can lead to psychological deterioration in vulnerable users, with mental state deterioration in more than 34.4% of the simulations. EmoGuard significantly reduces these deterioration rates, underscoring its role in ensuring safer AI‑human interactions. Our code is available at: https://github.com/1akaman/EmoAgent

PaperID: 1396, https://arxiv.org/pdf/2504.09602.pdf GitHub

Authors: Zhehao Dong, Zhen Lu, Yue Yang

Title: Fine-tuning a Large Language Model for Automating Computational Fluid Dynamics Simulations

Abstract:
Configuring computational fluid dynamics (CFD) simulations typically demands extensive domain expertise, limiting broader access. Although large language models (LLMs) have advanced scientific computing, their use in automating CFD workflows is underdeveloped. We introduce a novel approach centered on domain‑specific LLM adaptation. By fine‑tuning Qwen2.5‑7B‑Instruct on NL2FOAM, our custom dataset of 28716 natural language‑to‑OpenFOAM configuration pairs with chain‑of‑thought (CoT) annotations, we enable direct translation from natural language descriptions to executable CFD setups. A multi‑agent framework orchestrates the process, autonomously verifying inputs, generating configurations, running simulations, and correcting errors. Evaluation on a benchmark of 21 diverse flow cases demonstrates state‑of‑the‑art performance, achieving 88.7% solution accuracy and 82.6% first‑attempt success rate. This significantly outperforms larger general‑purpose models like Qwen2.5‑72B‑Instruct, DeepSeek‑R1, and Llama3.3‑70B‑Instruct, while also requiring fewer correction iterations and maintaining high computational efficiency. The results highlight the critical role of domain‑specific adaptation in deploying LLM assistants for complex engineering workflows. Our code and fine‑tuned model have been deposited at https://github.com/YYgroup/AutoCFD.

PaperID: 1397, https://arxiv.org/pdf/2504.08810.pdf GitHub GitHub

Authors: Zheyuan Lai, Yingming Pu

Title: PriM: Principle-Inspired Material Discovery through Multi-Agent Collaboration

Abstract:
Complex chemical space and limited knowledge scope with biases holds immense challenge for human scientists, yet in automated materials discovery. Existing intelligent methods relies more on numerical computation, leading to inefficient exploration and results with hard‑interpretability. To bridge this gap, we introduce a principles‑guided material discovery system powered by language inferential multi‑agent system (MAS), namely PriM. Our framework integrates automated hypothesis generation with experimental validation in a roundtable system of MAS, enabling systematic exploration while maintaining scientific rigor. Based on our framework, the case study of nano helix demonstrates higher materials exploration rate and property value while providing transparent reasoning pathways. This approach develops an automated‑and‑transparent paradigm for material discovery, with broad implications for rational design of functional materials. Code is publicly available at our \hrefhttps://github.com/amair‑lab/PriMGitHub.

PaperID: 1398, https://arxiv.org/pdf/2504.08621.pdf GitHub

Authors: Tao Zhang, Zhenhai Liu, Yong Xin, Yongjun Jiao

Title: MooseAgent: A LLM Based Multi-agent Framework for Automating Moose Simulation

Abstract:
The Finite Element Method (FEM) is widely used in engineering and scientific computing, but its pre‑processing, solver configuration, and post‑processing stages are often time‑consuming and require specialized knowledge. This paper proposes an automated solution framework, MooseAgent, for the multi‑physics simulation framework MOOSE, which combines large‑scale pre‑trained language models (LLMs) with a multi‑agent system. The framework uses LLMs to understand user‑described simulation requirements in natural language and employs task decomposition and multi‑round iterative verification strategies to automatically generate MOOSE input files. To improve accuracy and reduce model hallucinations, the system builds and utilizes a vector database containing annotated MOOSE input cards and function documentation. We conducted experimental evaluations on several typical cases, including heat transfer, mechanics, phase field, and multi‑physics coupling. The results show that MooseAgent can automate the MOOSE simulation process to a certain extent, especially demonstrating a high success rate when dealing with relatively simple single‑physics problems. The main contribution of this research is the proposal of a multi‑agent automated framework for MOOSE, which validates its potential in simplifying finite element simulation processes and lowering the user barrier, providing new ideas for the development of intelligent finite element simulation software. The code for the MooseAgent framework proposed in this paper has been open‑sourced and is available at https://github.com/taozhan18/MooseAgent

PaperID: 1399, https://arxiv.org/pdf/2504.08525.pdf GitHub GitHub

Authors: Ye Ye

Title: Task Memory Engine (TME): Enhancing State Awareness for Multi-Step LLM Agent Tasks

Abstract:
Large Language Models (LLMs) are increasingly used as autonomous agents for multi‑step tasks. However, most existing frameworks fail to maintain a structured understanding of the task state, often relying on linear prompt concatenation or shallow memory buffers. This leads to brittle performance, frequent hallucinations, and poor long‑range coherence. In this work, we propose the Task Memory Engine (TME), a lightweight and structured memory module that tracks task execution using a hierarchical Task Memory Tree (TMT). Each node in the tree corresponds to a task step, storing relevant input, output, status, and sub‑task relationships. We introduce a prompt synthesis method that dynamically generates LLM prompts based on the active node path, significantly improving execution consistency and contextual grounding. Through case studies and comparative experiments on multi‑step agent tasks, we demonstrate that TME leads to better task completion accuracy and more interpretable behavior with minimal implementation overhead. A reference implementation of the core TME components is available at https://github.com/biubiutomato/TME‑Agent, including basic examples and structured memory integration. While the current implementation uses a tree‑based structure, TME is designed to be graph‑aware, supporting reusable substeps, converging task paths, and shared dependencies. This lays the groundwork for future DAG‑based memory architectures.

PaperID: 1400, https://arxiv.org/pdf/2504.06766.pdf GitHub GitHub

Authors: Yuxin Wang, Yiran Guo, Yining Zheng, Zhangyue Yin, Shuo Chen, Jie Yang, Jiajun Chen, Yuan Li, Xuanjing Huang, Xipeng Qiu

Title: FamilyTool: A Multi-hop Personalized Tool Use Benchmark

Abstract:
The integration of tool learning with Large Language Models (LLMs) has expanded their capabilities in handling complex tasks by leveraging external tools. However, existing benchmarks for tool learning inadequately address critical real‑world personalized scenarios, particularly those requiring multi‑hop reasoning and inductive knowledge adaptation in dynamic environments. To bridge this gap, we introduce FamilyTool, a novel benchmark grounded in a family‑based knowledge graph (KG) that simulates personalized, multi‑hop tool use scenarios. FamilyTool, including base and extended datasets, challenges LLMs with queries spanning from 1 to 4 relational hops (e.g., inferring familial connections and preferences) and 2 to 6 hops respectively, and incorporates an inductive KG setting where models must adapt to unseen user preferences and relationships without re‑training, a common limitation in prior approaches that compromises generalization. We further propose KGETool: a simple KG‑augmented evaluation pipeline to systematically assess LLMs' tool use ability in these settings. Experiments reveal significant performance gaps in state‑of‑the‑art LLMs, with accuracy dropping sharply as hop complexity increases and inductive scenarios exposing severe generalization deficits. These findings underscore the limitations of current LLMs in handling personalized, evolving real‑world contexts and highlight the urgent need for advancements in tool‑learning frameworks. FamilyTool serves as a critical resource for evaluating and advancing LLM agents' reasoning, adaptability, and scalability in complex, dynamic environments. Code and dataset are available at \hrefhttps://github.com/yxzwang/FamilyToolhttps://github.com/yxzwang/FamilyTool.

PaperID: 1401, https://arxiv.org/pdf/2504.06260.pdf GitHub

Authors: Nayantara Mudur, Hao Cui, Subhashini Venugopalan, Paul Raccuglia, Michael P. Brenner, Peter Norgaard

Title: FEABench: Evaluating Language Models on Multiphysics Reasoning Ability

Abstract:
Building precise simulations of the real world and invoking numerical solvers to answer quantitative problems is an essential requirement in engineering and science. We present FEABench, a benchmark to evaluate the ability of large language models (LLMs) and LLM agents to simulate and solve physics, mathematics and engineering problems using finite element analysis (FEA). We introduce a comprehensive evaluation scheme to investigate the ability of LLMs to solve these problems end‑to‑end by reasoning over natural language problem descriptions and operating COMSOL Multiphysics^\circledR, an FEA software, to compute the answers. We additionally design a language model agent equipped with the ability to interact with the software through its Application Programming Interface (API), examine its outputs and use tools to improve its solutions over multiple iterations. Our best performing strategy generates executable API calls 88% of the time. LLMs that can successfully interact with and operate FEA software to solve problems such as those in our benchmark would push the frontiers of automation in engineering. Acquiring this capability would augment LLMs' reasoning skills with the precision of numerical solvers and advance the development of autonomous systems that can tackle complex problems in the real world. The code is available at https://github.com/google/feabench

PaperID: 1402, https://arxiv.org/pdf/2504.05370.pdf GitHub

Authors: Xueqiao Zhang, Chao Zhang, Jianwen Sun, Jun Xiao, Yi Yang, Yawei Luo

Title: EduPlanner: LLM-Based Multi-Agent Systems for Customized and Intelligent Instructional Design

Abstract:
Large Language Models (LLMs) have significantly advanced smart education in the Artificial General Intelligence (AGI) era. A promising application lies in the automatic generalization of instructional design for curriculum and learning activities, focusing on two key aspects: (1) Customized Generation: generating niche‑targeted teaching content based on students' varying learning abilities and states, and (2) Intelligent Optimization: iteratively optimizing content based on feedback from learning effectiveness or test scores. Currently, a single large LLM cannot effectively manage the entire process, posing a challenge for designing intelligent teaching plans. To address these issues, we developed EduPlanner, an LLM‑based multi‑agent system comprising an evaluator agent, an optimizer agent, and a question analyst, working in adversarial collaboration to generate customized and intelligent instructional design for curriculum and learning activities. Taking mathematics lessons as our example, EduPlanner employs a novel Skill‑Tree structure to accurately model the background mathematics knowledge of student groups, personalizing instructional design for curriculum and learning activities according to students' knowledge levels and learning abilities. Additionally, we introduce the CIDDP, an LLM‑based five‑dimensional evaluation module encompassing clarity, Integrity, Depth, Practicality, and Pertinence, to comprehensively assess mathematics lesson plan quality and bootstrap intelligent optimization. Experiments conducted on the GSM8K and Algebra datasets demonstrate that EduPlanner excels in evaluating and optimizing instructional design for curriculum and learning activities. Ablation studies further validate the significance and effectiveness of each component within the framework. Our code is publicly available at https://github.com/Zc0812/Edu_Planner

PaperID: 1403, https://arxiv.org/pdf/2504.04808.pdf GitHub

Authors: Tengjun Jin, Yuxuan Zhu, Daniel Kang

Title: ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines

Abstract:
Practitioners are increasingly turning to Extract‑Load‑Transform (ELT) pipelines with the widespread adoption of cloud data warehouses. However, designing these pipelines often involves significant manual work to ensure correctness. Recent advances in AI‑based methods, which have shown strong capabilities in data tasks, such as text‑to‑SQL, present an opportunity to alleviate manual efforts in developing ELT pipelines. Unfortunately, current benchmarks in data engineering only evaluate isolated tasks, such as using data tools and writing data transformation queries, leaving a significant gap in evaluating AI agents for generating end‑to‑end ELT pipelines. To fill this gap, we introduce ELT‑Bench, an end‑to‑end benchmark designed to assess the capabilities of AI agents to build ELT pipelines. ELT‑Bench consists of 100 pipelines, including 835 source tables and 203 data models across various domains. By simulating realistic scenarios involving the integration of diverse data sources and the use of popular data tools, ELT‑Bench evaluates AI agents' abilities in handling complex data engineering workflows. AI agents must interact with databases and data tools, write code and SQL queries, and orchestrate every pipeline stage. We evaluate two representative code agent frameworks, Spider‑Agent and SWE‑Agent, using six popular Large Language Models (LLMs) on ELT‑Bench. The highest‑performing agent, Spider‑Agent Claude‑3.7‑Sonnet with extended thinking, correctly generates only 3.9% of data models, with an average cost of 4.30 and 89.3 steps per pipeline. Our experimental results demonstrate the challenges of ELT‑Bench and highlight the need for a more advanced AI agent to reduce manual effort in ELT workflows. Our code and data are available at https://github.com/uiuc‑kang‑lab/ELT‑Bench.

PaperID: 1404, https://arxiv.org/pdf/2504.04310.pdf GitHub

Authors: Weiwei Sun, Shengyu Feng, Shanda Li, Yiming Yang

Title: CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization

Abstract:
Although LLM‑based agents have attracted significant attention in domains such as software engineering and machine learning research, their role in advancing combinatorial optimization (CO) remains relatively underexplored. This gap underscores the need for a deeper understanding of their potential in tackling structured, constraint‑intensive problems ‑‑ a pursuit currently limited by the absence of comprehensive benchmarks for systematic investigation. To address this, we introduce CO‑Bench, a benchmark suite featuring 36 real‑world CO problems drawn from a broad range of domains and complexity levels. CO‑Bench includes structured problem formulations and curated data to support rigorous investigation of LLM agents. We evaluate multiple agentic frameworks against established human‑designed algorithms, revealing the strengths and limitations of existing LLM agents and identifying promising directions for future research. CO‑Bench is publicly available at https://github.com/sunnweiwei/CO‑Bench.

PaperID: 1405, https://arxiv.org/pdf/2504.03160.pdf GitHub

Authors: Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, Pengfei Liu

Title: DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

Abstract:
Large Language Models (LLMs) equipped with web search capabilities have demonstrated impressive potential for deep research tasks. However, current approaches predominantly rely on either manually engineered prompts (prompt engineering‑based) with brittle performance or reinforcement learning within controlled Retrieval‑Augmented Generation (RAG) environments (RAG‑based) that fail to capture the complexities of real‑world interaction. In this paper, we introduce DeepResearcher, the first comprehensive framework for end‑to‑end training of LLM‑based deep research agents through scaling reinforcement learning (RL) in real‑world environments with authentic web search interactions. Unlike RAG‑based approaches that assume all necessary information exists within a fixed corpus, our method trains agents to navigate the noisy, unstructured, and dynamic nature of the open web. We implement a specialized multi‑agent architecture where browsing agents extract relevant information from various webpage structures and overcoming significant technical challenges. Extensive experiments on open‑domain research tasks demonstrate that DeepResearcher achieves substantial improvements of up to 28.9 points over prompt engineering‑based baselines and up to 7.2 points over RAG‑based RL agents. Our qualitative analysis reveals emergent cognitive behaviors from end‑to‑end RL training, including the ability to formulate plans, cross‑validate information from multiple sources, engage in self‑reflection to redirect research, and maintain honesty when unable to find definitive answers. Our results highlight that end‑to‑end training in real‑world web environments is not merely an implementation detail but a fundamental requirement for developing robust research capabilities aligned with real‑world applications. We release DeepResearcher at https://github.com/GAIR‑NLP/DeepResearcher.

PaperID: 1406, https://arxiv.org/pdf/2504.02450.pdf GitHub

Authors: Jingyi Wang, Duanfeng Chu, Zejian Deng, Liping Lu, Jinxiang Wang, Chen Sun

Title: CHARMS: A Cognitive Hierarchical Agent for Reasoning and Motion Stylization in Autonomous Driving

Abstract:
To address the challenge of insufficient interactivity and behavioral diversity in autonomous driving decision‑making, this paper proposes a Cognitive Hierarchical Agent for Reasoning and Motion Stylization (CHARMS). By leveraging Level‑k game theory, CHARMS captures human‑like reasoning patterns through a two‑stage training pipeline comprising reinforcement learning pretraining and supervised fine‑tuning. This enables the resulting models to exhibit diverse and human‑like behaviors, enhancing their decision‑making capacity and interaction fidelity in complex traffic environments. Building upon this capability, we further develop a scenario generation framework that utilizes the Poisson cognitive hierarchy theory to control the distribution of vehicles with different driving styles through Poisson and binomial sampling. Experimental results demonstrate that CHARMS is capable of both making intelligent driving decisions as an ego vehicle and generating diverse, realistic driving scenarios as environment vehicles. The code for CHARMS is released at https://github.com/chuduanfeng/CHARMS.

PaperID: 1407, https://arxiv.org/pdf/2504.00906.pdf GitHub

Authors: Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, Xin Eric Wang

Title: Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Abstract:
Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices, offering significant potential to enhance human productivity by completing an open‑ended space of user queries. However, current agents face significant challenges: imprecise grounding of GUI elements, difficulties with long‑horizon task planning, and performance bottlenecks from relying on single generalist models for diverse cognitive tasks. To this end, we introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models. We propose a novel Mixture‑of‑Grounding technique to achieve precise GUI localization and introduce Proactive Hierarchical Planning, dynamically refining action plans at multiple temporal scales in response to evolving observations. Evaluations demonstrate that Agent S2 establishes new state‑of‑the‑art (SOTA) performance on three prominent computer use benchmarks. Specifically, Agent S2 achieves 18.9% and 32.7% relative improvements over leading baseline agents such as Claude Computer Use and UI‑TARS on the OSWorld 15‑step and 50‑step evaluation. Moreover, Agent S2 generalizes effectively to other operating systems and applications, surpassing previous best methods by 52.8% on WindowsAgentArena and by 16.52% on AndroidWorld relatively. Code available at https://github.com/simular‑ai/Agent‑S.

PaperID: 1408, https://arxiv.org/pdf/2503.24260.pdf GitHub

Authors: Zhengren Wang, Rui Ling, Chufan Wang, Yongan Yu, Sizhe Wang, Zhiyu Li, Feiyu Xiong, Wentao Zhang

Title: MaintainCoder: Maintainable Code Generation Under Dynamic Requirements

Abstract:
Modern code generation has made significant strides in functional correctness and execution efficiency. However, these systems often overlook a critical dimension in real‑world software development: maintainability. To handle dynamic requirements with minimal rework, we propose MaintainCoder as a pioneering solution. It integrates the Waterfall model, design patterns, and multi‑agent collaboration to systematically enhance cohesion, reduce coupling, achieving clear responsibility boundaries and better maintainability. We also introduce MaintainCoder, a benchmark comprising requirement changes and novel dynamic metrics on maintenance efforts. Experiments demonstrate that existing code generation methods struggle to meet maintainability standards when requirements evolve. In contrast, MaintainCoder improves dynamic maintainability metrics by more than 60% with even higher correctness of initial codes. Furthermore, while static metrics fail to accurately reflect maintainability and even contradict each other, our proposed dynamic metrics exhibit high consistency. Our work not only provides the foundation for maintainable code generation, but also highlights the need for more realistic and comprehensive code generation research. Resources: https://github.com/IAAR‑Shanghai/MaintainCoder.

PaperID: 1409, https://arxiv.org/pdf/2503.24115.pdf GitHub

Authors: Zhiming Ma, Peidong Wang, Minhua Huang, Jingpeng Wang, Kai Wu, Xiangzhao Lv, Yachun Pang, Yin Yang, Wenjie Tang, Yuchen Kang

Title: TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection

Abstract:
The detection of telecom fraud faces significant challenges due to the lack of high‑quality multimodal training data that integrates audio signals with reasoning‑oriented textual analysis. To address this gap, we present TeleAntiFraud‑28k, the first open‑source audio‑text slow‑thinking dataset specifically designed for automated telecom fraud analysis. Our dataset is constructed through three strategies: (1) Privacy‑preserved text‑truth sample generation using automatically speech recognition (ASR)‑transcribed call recordings (with anonymized original audio), ensuring real‑world consistency through text‑to‑speech (TTS) model regeneration; (2) Semantic enhancement via large language model (LLM)‑based self‑instruction sampling on authentic ASR outputs to expand scenario coverage; (3) Multi‑agent adversarial synthesis that simulates emerging fraud tactics through predefined communication scenarios and fraud typologies. The generated dataset contains 28,511 rigorously processed speech‑text pairs, complete with detailed annotations for fraud reasoning. The dataset is divided into three tasks: scenario classification, fraud detection, fraud type classification. Furthermore, we construct TeleAntiFraud‑Bench, a standardized evaluation benchmark comprising proportionally sampled instances from the dataset, to facilitate systematic testing of model performance on telecom fraud detection tasks. We also contribute a production‑optimized supervised fine‑tuning (SFT) model trained on hybrid real/synthetic data, while open‑sourcing the data processing framework to enable community‑driven dataset expansion. This work establishes a foundational framework for multimodal anti‑fraud research while addressing critical challenges in data privacy and scenario diversity. The project will be released at https://github.com/JimmyMa99/TeleAntiFraud.

PaperID: 1410, https://arxiv.org/pdf/2503.23626.pdf GitHub

Authors: Anirudh Satheesh, Keenan Powell

Title: A Constrained Multi-Agent Reinforcement Learning Approach to Autonomous Traffic Signal Control

Abstract:
Traffic congestion in modern cities is exacerbated by the limitations of traditional fixed‑time traffic signal systems, which fail to adapt to dynamic traffic patterns. Adaptive Traffic Signal Control (ATSC) algorithms have emerged as a solution by dynamically adjusting signal timing based on real‑time traffic conditions. However, the main limitation of such methods is that they are not transferable to environments under real‑world constraints, such as balancing efficiency, minimizing collisions, and ensuring fairness across intersections. In this paper, we view the ATSC problem as a constrained multi‑agent reinforcement learning (MARL) problem and propose a novel algorithm named Multi‑Agent Proximal Policy Optimization with Lagrange Cost Estimator (MAPPO‑LCE) to produce effective traffic signal control policies. Our approach integrates the Lagrange multipliers method to balance rewards and constraints, with a cost estimator for stable adjustment. We also introduce three constraints on the traffic network: GreenTime, GreenSkip, and PhaseSkip, which penalize traffic policies that do not conform to real‑world scenarios. Our experimental results on three real‑world datasets demonstrate that MAPPO‑LCE outperforms three baseline MARL algorithms by across all environments and traffic constraints (improving on MAPPO by 12.60%, IPPO by 10.29%, and QTRAN by 13.10%). Our results show that constrained MARL is a valuable tool for traffic planners to deploy scalable and efficient ATSC methods in real‑world traffic networks. We provide code at https://github.com/Asatheesh6561/MAPPO‑LCE.

PaperID: 1411, https://arxiv.org/pdf/2503.23459.pdf GitHub

Authors: Chenglong Lu, Shen Liang, Xuewei Wang, Wei Wang

Title: Reinforcement Learning-based Token Pruning in Vision Transformers: A Markov Game Approach

Abstract:
Vision Transformers (ViTs) have computational costs scaling quadratically with the number of tokens, calling for effective token pruning policies. Most existing policies are handcrafted, lacking adaptivity to varying inputs. Moreover, they fail to consider the sequential nature of token pruning across multiple layers. In this work, for the first time (as far as we know), we exploit Reinforcement Learning (RL) to data‑adaptively learn a pruning policy. Formulating token pruning as a sequential decision‑making problem, we model it as a Markov Game and utilize Multi‑Agent Proximal Policy Optimization (MAPPO) where each agent makes an individualized pruning decision for a single token. We also develop reward functions that enable simultaneous collaboration and competition of these agents to balance efficiency and accuracy. On the well‑known ImageNet‑1k dataset, our method improves the inference speed by up to 44% while incurring only a negligible accuracy drop of 0.4%. The source code is available at https://github.com/daashuai/rl4evit.

PaperID: 1412, https://arxiv.org/pdf/2503.23452.pdf GitHub

Authors: Yuhang Yang, Ke Fan, Shangkun Sun, Hongxiang Li, Ailing Zeng, FeiLin Han, Wei Zhai, Wei Liu, Yang Cao, Zheng-Jun Zha

Title: VideoGen-Eval: Agent-based System for Video Generation Evaluation

Abstract:
The rapid advancement of video generation has rendered existing evaluation systems inadequate for assessing state‑of‑the‑art models, primarily due to simple prompts that cannot showcase the model's capabilities, fixed evaluation operators struggling with Out‑of‑Distribution (OOD) cases, and misalignment between computed metrics and human preferences. To bridge the gap, we propose VideoGen‑Eval, an agent evaluation system that integrates LLM‑based content structuring, MLLM‑based content judgment, and patch tools designed for temporal‑dense dimensions, to achieve a dynamic, flexible, and expandable video generation evaluation. Additionally, we introduce a video generation benchmark to evaluate existing cutting‑edge models and verify the effectiveness of our evaluation system. It comprises 700 structured, content‑rich prompts (both T2V and I2V) and over 12,000 videos generated by 20+ models, among them, 8 cutting‑edge models are selected as quantitative evaluation for the agent and human. Extensive experiments validate that our proposed agent‑based evaluation system demonstrates strong alignment with human preferences and reliably completes the evaluation, as well as the diversity and richness of the benchmark.

PaperID: 1413, https://arxiv.org/pdf/2503.23145.pdf GitHub

Authors: Anjiang Wei, Tarun Suresh, Jiannan Cao, Naveen Kannan, Yuheng Wu, Kai Yan, Thiago S. F. X. Teixeira, Ke Wang, Alex Aiken

Title: CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis

Abstract:
Inductive program synthesis, or programming by example, requires synthesizing functions from input‑output examples that generalize to unseen inputs. While large language model agents have shown promise in programming tasks guided by natural language, their ability to perform inductive program synthesis is underexplored. Existing evaluation protocols rely on static sets of examples and held‑out tests, offering no feedback when synthesized functions are incorrect and failing to reflect real‑world scenarios such as reverse engineering. We propose CodeARC, the Code Abstraction and Reasoning Challenge, a new evaluation framework where agents interact with a hidden target function by querying it with new inputs, synthesizing candidate functions, and iteratively refining their solutions using a differential testing oracle. This interactive setting encourages agents to perform function calls and self‑correction based on feedback. We construct the first large‑scale benchmark for general‑purpose inductive program synthesis, featuring 1114 functions. Among 18 models evaluated, o3‑mini performs best with a success rate of 52.7%, highlighting the difficulty of this task. Fine‑tuning LLaMA‑3.1‑8B‑Instruct on curated synthesis traces yields up to a 31% relative performance gain. CodeARC provides a more realistic and challenging testbed for evaluating LLM‑based program synthesis and inductive reasoning. Our code, data, and models are publicly available at https://github.com/Anjiang‑Wei/CodeARC

PaperID: 1414, https://arxiv.org/pdf/2503.23037.pdf GitHub

Authors: Aske Plaat, Max van Duijn, Niki van Stein, Mike Preuss, Peter van der Putten, Kees Joost Batenburg

Title: Agentic Large Language Models, a survey

Abstract:
Background: There is great interest in agentic LLMs, large language models that act as agents. Objectives: We review the growing body of work in this area and provide a research agenda. Methods: Agentic LLMs are LLMs that (1) reason, (2) act, and (3) interact. We organize the literature according to these three categories. Results: The research in the first category focuses on reasoning, reflection, and retrieval, aiming to improve decision making; the second category focuses on action models, robots, and tools, aiming for agents that act as useful assistants; the third category focuses on multi‑agent systems, aiming for collaborative task solving and simulating interaction to study emergent social behavior. We find that works mutually benefit from results in other categories: retrieval enables tool use, reflection improves multi‑agent collaboration, and reasoning benefits all categories. Conclusions: We discuss applications of agentic LLMs and provide an agenda for further research. Important applications are in medical diagnosis, logistics and financial market analysis. Meanwhile, self‑reflective agents playing roles and interacting with one another augment the process of scientific research itself. Further, agentic LLMs provide a solution for the problem of LLMs running out of training data: inference‑time behavior generates new training states, such that LLMs can keep learning without needing ever larger datasets. We note that there is risk associated with LLM assistants taking action in the real world‑safety, liability and security are open problems‑while agentic LLMs are also likely to benefit society.

PaperID: 1415, https://arxiv.org/pdf/2503.22678.pdf GitHub

Authors: Mohammad Almansoori, Komal Kumar, Hisham Cholakkal

Title: Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions

Abstract:
In this work, we introduce MedAgentSim, an open‑source simulated clinical environment with doctor, patient, and measurement agents designed to evaluate and enhance LLM performance in dynamic diagnostic settings. Unlike prior approaches, our framework requires doctor agents to actively engage with patients through multi‑turn conversations, requesting relevant medical examinations (e.g., temperature, blood pressure, ECG) and imaging results (e.g., MRI, X‑ray) from a measurement agent to mimic the real‑world diagnostic process. Additionally, we incorporate self improvement mechanisms that allow models to iteratively refine their diagnostic strategies. We enhance LLM performance in our simulated setting by integrating multi‑agent discussions, chain‑of‑thought reasoning, and experience‑based knowledge retrieval, facilitating progressive learning as doctor agents interact with more patients. We also introduce an evaluation benchmark for assessing the LLM's ability to engage in dynamic, context‑aware diagnostic interactions. While MedAgentSim is fully automated, it also supports a user‑controlled mode, enabling human interaction with either the doctor or patient agent. Comprehensive evaluations in various simulated diagnostic scenarios demonstrate the effectiveness of our approach. Our code, simulation tool, and benchmark are available at \hrefhttps://medagentsim.netlify.app/.

PaperID: 1416, https://arxiv.org/pdf/2503.21460.pdf GitHub GitHub

Authors: Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dacheng Tao, Philip S. Yu, Ming Zhang

Title: Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Abstract:
The era of intelligent agents is upon us, driven by revolutionary advancements in large language models. Large Language Model (LLM) agents, with goal‑driven behaviors and dynamic adaptation capabilities, potentially represent a critical pathway toward artificial general intelligence. This survey systematically deconstructs LLM agent systems through a methodology‑centered taxonomy, linking architectural foundations, collaboration mechanisms, and evolutionary pathways. We unify fragmented research threads by revealing fundamental connections between agent design principles and their emergent behaviors in complex environments. Our work provides a unified architectural perspective, examining how agents are constructed, how they collaborate, and how they evolve over time, while also addressing evaluation methodologies, tool applications, practical challenges, and diverse application domains. By surveying the latest developments in this rapidly evolving field, we offer researchers a structured taxonomy for understanding LLM agents and identify promising directions for future research. The collection is available at https://github.com/luo‑junyu/Awesome‑Agent‑Papers.

PaperID: 1417, https://arxiv.org/pdf/2503.21080.pdf GitHub

Authors: Yunbo Long, Yuhan Liu, Liming Xu, Alexandra Brintrup

Title: EmoDebt: Bayesian-Optimized Emotional Intelligence for Strategic Agent-to-Agent Debt Recovery

Abstract:
The emergence of autonomous Large Language Model (LLM) agents has created a new ecosystem of strategic, agent‑to‑agent interactions. However, a critical challenge remains unaddressed: in high‑stakes, emotion‑sensitive domains like debt collection, LLM agents pre‑trained on human dialogue are vulnerable to exploitation by adversarial counterparts who simulate negative emotions to derail negotiations. To fill this gap, we first contribute a novel dataset of simulated debt recovery scenarios and a multi‑agent simulation framework. Within this framework, we introduce EmoDebt, an LLM agent architected for robust performance. Its core innovation is a Bayesian‑optimized emotional intelligence engine that reframes a model's ability to express emotion in negotiation as a sequential decision‑making problem. Through online learning, this engine continuously tunes EmoDebt's emotional transition policies, discovering optimal counter‑strategies against specific debtor tactics. Extensive experiments on our proposed benchmark demonstrate that EmoDebt achieves significant strategic robustness, substantially outperforming non‑adaptive and emotion‑agnostic baselines across key performance metrics, including success rate and operational efficiency. By introducing both a critical benchmark and a robustly adaptive agent, this work establishes a new foundation for deploying strategically robust LLM agents in adversarial, emotion‑sensitive debt interactions. The code is available at \textcolorbluehttps://github.com/Yunbo‑max/EmoDebt.

PaperID: 1418, https://arxiv.org/pdf/2503.19355.pdf GitHub

Authors: Dohwan Ko, Sihyeon Kim, Yumin Suh, Vijay Kumar B. G, Minseo Yoon, Manmohan Chandraker, Hyunwoo J. Kim

Title: ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models

Abstract:
Spatio‑temporal reasoning is essential in understanding real‑world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision‑Language Models (VLMs) by introducing large‑scale data, but these models still struggle to analyze kinematic elements like traveled distance and speed of moving objects. To bridge this gap, we construct a spatio‑temporal reasoning dataset and benchmark involving kinematic instruction tuning, referred to as STKit and STKit‑Bench. They consist of real‑world videos with 3D annotations, detailing object motion dynamics: traveled distance, speed, movement direction, inter‑object distance comparisons, and relative movement direction. To further scale such data construction to videos without 3D labels, we propose an automatic pipeline to generate pseudo‑labels using 4D reconstruction in real‑world scale. With our kinematic instruction tuning data for spatio‑temporal reasoning, we present ST‑VLM, a VLM enhanced for spatio‑temporal reasoning, which exhibits outstanding performance on STKit‑Bench. Furthermore, we show that ST‑VLM generalizes robustly across diverse domains and tasks, outperforming baselines on other spatio‑temporal benchmarks (eg, ActivityNet, TVQA+). Finally, by integrating learned spatio‑temporal reasoning with existing abilities, ST‑VLM enables complex multi‑step reasoning. Project page: https://ikodoh.github.io/ST‑VLM.

PaperID: 1419, https://arxiv.org/pdf/2503.18891.pdf GitHub

Authors: Zhexuan Wang, Yutong Wang, Xuebo Liu, Liang Ding, Miao Zhang, Jie Liu, Min Zhang

Title: AgentDropout: Dynamic Agent Elimination for Token-Efficient and High-Performance LLM-Based Multi-Agent Collaboration

Abstract:
Multi‑agent systems (MAS) based on large language models (LLMs) have demonstrated significant potential in collaborative problem‑solving. However, they still face substantial challenges of low communication efficiency and suboptimal task performance, making the careful design of the agents' communication topologies particularly important. Inspired by the management theory that roles in an efficient team are often dynamically adjusted, we propose AgentDropout, which identifies redundant agents and communication across different communication rounds by optimizing the adjacency matrices of the communication graphs and eliminates them to enhance both token efficiency and task performance. Compared to state‑of‑the‑art methods, AgentDropout achieves an average reduction of 21.6% in prompt token consumption and 18.4% in completion token consumption, along with a performance improvement of 1.14 on the tasks. Furthermore, the extended experiments demonstrate that AgentDropout achieves notable domain transferability and structure robustness, revealing its reliability and effectiveness. We release our code at https://github.com/wangzx1219/AgentDropout.

PaperID: 1420, https://arxiv.org/pdf/2503.18813.pdf GitHub

Authors: Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, Florian Tramèr

Title: Defeating Prompt Injections by Design

Abstract:
Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an untrusted environment. However, LLM agents are vulnerable to prompt injection attacks when handling untrusted data. In this paper we propose CaMeL, a robust defense that creates a protective system layer around the LLM, securing it even when underlying models are susceptible to attacks. To operate, CaMeL explicitly extracts the control and data flows from the (trusted) query; therefore, the untrusted data retrieved by the LLM can never impact the program flow. To further improve security, CaMeL uses a notion of a capability to prevent the exfiltration of private data over unauthorized data flows by enforcing security policies when tools are called. We demonstrate effectiveness of CaMeL by solving 77% of tasks with provable security (compared to 84% with an undefended system) in AgentDojo. We release CaMeL at https://github.com/google‑research/camel‑prompt‑injection.

PaperID: 1421, https://arxiv.org/pdf/2503.18665.pdf GitHub

Authors: Bingchen Miao, Yang Wu, Minghe Gao, Qifan Yu, Wendong Bu, Wenqiao Zhang, Yunfei Li, Siliang Tang, Tat-Seng Chua, Juncheng Li

Title: Boosting Virtual Agent Learning and Reasoning: A Step-Wise, Multi-Dimensional, and Generalist Reward Model with Benchmark

Abstract:
The development of Generalist Virtual Agents (GVAs) has shown significant promise in autonomous task execution. However, current training paradigms face critical limitations, including reliance on outcome supervision and labor‑intensive human annotations. To address these challenges, we propose Similar, a Step‑Wise Multi‑Dimensional Generalist Reward Model, which offers fine‑grained signals for agent training and can choose better action for inference‑time scaling. Specifically, we begin by systematically defining five dimensions for evaluating agent actions. Building on this framework, we design an MCTS‑P algorithm to automatically collect and annotate step‑wise, five‑dimensional agent execution data. Using this data, we train Similar with the Triple‑M strategy. Furthermore, we introduce the first benchmark in the virtual agent domain for step‑wise, multi‑dimensional reward model training and evaluation, named SRM. This benchmark consists of two components: SRMTrain, which serves as the training set for Similar, and SRMEval, a manually selected test set for evaluating the reward model. Experimental results demonstrate that Similar, through its step‑wise, multi‑dimensional assessment and synergistic gain, provides GVAs with effective intermediate signals during both training and inference‑time scaling. The project is available at https://github.com/antgroup/Similar.

PaperID: 1422, https://arxiv.org/pdf/2503.18313.pdf GitHub

Authors: Changlun Li, Yao Shi, Yuyu Luo, Nan Tang

Title: Will LLMs be Professional at Fund Investment? DeepFund: A Live Arena Perspective

Abstract:
Large Language Models (LLMs) have demonstrated impressive capabilities across various domains, but their effectiveness in financial decision‑making remains inadequately evaluated. Current benchmarks primarily assess LLMs' understanding on financial documents rather than the ability to manage assets or dig out trading opportunities in dynamic market conditions. Despite the release of new benchmarks for evaluating diversified tasks on the financial domain, we identified four major problems in these benchmarks, which are data leakage, navel‑gazing, over‑intervention, and maintenance‑hard. To pave the research gap, we introduce DeepFund, a comprehensive arena platform for evaluating LLM‑based trading strategies in a live environment. Our approach implements a multi‑agent framework where they serve as multiple key roles that realize the real‑world investment decision processes. Moreover, we provide a web interface that visualizes LLMs' performance with fund investment metrics across different market conditions, enabling detailed comparative analysis. Through DeepFund, we aim to provide a more realistic and fair assessment on LLM's capabilities in fund investment, offering diversified insights and revealing their potential applications in real‑world financial markets. Our code is publicly available at https://github.com/HKUSTDial/DeepFund.

PaperID: 1423, https://arxiv.org/pdf/2503.17733.pdf GitHub

Authors: Bin Fu, Jialin Li, Bin Zhang, Ruiping Wang, Xilin Chen

Title: GS-LTS: 3D Gaussian Splatting-Based Adaptive Modeling for Long-Term Service Robots

Abstract:
3D Gaussian Splatting (3DGS) has garnered significant attention in robotics for its explicit, high fidelity dense scene representation, demonstrating strong potential for robotic applications. However, 3DGS‑based methods in robotics primarily focus on static scenes, with limited attention to the dynamic scene changes essential for long‑term service robots. These robots demand sustained task execution and efficient scene updates‑challenges current approaches fail to meet. To address these limitations, we propose GS‑LTS (Gaussian Splatting for Long‑Term Service), a 3DGS‑based system enabling indoor robots to manage diverse tasks in dynamic environments over time. GS‑LTS detects scene changes (e.g., object addition or removal) via single‑image change detection, employs a rule‑based policy to autonomously collect multi‑view observations, and efficiently updates the scene representation through Gaussian editing. Additionally, we propose a simulation‑based benchmark that automatically generates scene change data as compact configuration scripts, providing a standardized, user‑friendly evaluation benchmark. Experimental results demonstrate GS‑LTS's advantages in reconstruction, navigation, and superior scene updates‑faster and higher quality than the image training baseline‑advancing 3DGS for long‑term robotic operations. Code and benchmark are available at: https://vipl‑vsu.github.io/3DGS‑LTS.

PaperID: 1424, https://arxiv.org/pdf/2503.17703.pdf GitHub

Authors: Silvia Izquierdo-Badiola, Carlos Rizzo, Guillem Alenyà

Title: RAIDER: Tool-Equipped Large Language Model Agent for Robotic Action Issue Detection, Explanation and Recovery

Abstract:
As robots increasingly operate in dynamic human‑centric environments, improving their ability to detect, explain, and recover from action‑related issues becomes crucial. Traditional model‑based and data‑driven techniques lack adaptability, while more flexible generative AI methods struggle with grounding extracted information to real‑world constraints. We introduce RAIDER, a novel agent that integrates Large Language Models (LLMs) with grounded tools for adaptable and efficient issue detection and explanation. Using a unique "Ground, Ask&Answer, Issue" procedure, RAIDER dynamically generates context‑aware precondition questions and selects appropriate tools for resolution, achieving targeted information gathering. Our results within a simulated household environment surpass methods relying on predefined models, full scene descriptions, or standalone trained models. Additionally, RAIDER's explanations enhance recovery success, including cases requiring human interaction. Its modular architecture, featuring self‑correction mechanisms, enables straightforward adaptation to diverse scenarios, as demonstrated in a real‑world human‑assistive task. This showcases RAIDER's potential as a versatile agentic AI solution for robotic issue detection and explanation, while addressing the problem of grounding generative AI for its effective application in embodied agents. Project website: https://eurecat.github.io/raider‑llmagent/

PaperID: 1425, https://arxiv.org/pdf/2503.17671.pdf GitHub

Authors: Oucheng Huang, Yuhang Ma, Zeng Zhao, Mingrui Wu, Jiayi Ji, Rongsheng Zhang, Zhipeng Hu, Xiaoshuai Sun, Rongrong Ji

Title: ComfyGPT: A Self-Optimizing Multi-Agent System for Comprehensive ComfyUI Workflow Generation

Abstract:
ComfyUI is a popular workflow‑based interface that allows users to customize image generation tasks through an intuitive node‑based system. However, the complexity of managing node connections and diverse modules can be challenging for users. In this paper, we introduce ComfyGPT, a self‑optimizing multi‑agent system designed to generate ComfyUI workflows based on task descriptions automatically. The key innovations of ComfyGPT include: (1) consisting of four specialized agents to build a multi‑agent workflow generation system: ReformatAgent, FlowAgent, RefineAgent, and ExecuteAgent; (2) focusing on generating precise node connections instead of entire workflows, improving generation accuracy; and (3) enhancing workflow generation through reinforcement learning. Moreover, we introduce FlowDataset, a large‑scale dataset containing 13,571 workflow‑description pairs, and FlowBench, a comprehensive benchmark for evaluating workflow generation systems. Additionally, we propose four novel evaluation metrics: Format Validation (FV), Pass Accuracy (PA), Pass Instruct Alignment (PIA), and Pass Node Diversity (PND). Experimental results demonstrate that ComfyGPT significantly outperforms existing LLM‑based methods in workflow generation, making it a significant step forward in this field. Code is avaliable at https://github.com/comfygpt/comfygpt.

PaperID: 1426, https://arxiv.org/pdf/2503.17309.pdf GitHub GitHub

Authors: Kun Chu, Xufeng Zhao, Cornelius Weber, Stefan Wermter

Title: LLM+MAP: Bimanual Robot Task Planning using Large Language Models and Planning Domain Definition Language

Abstract:
Bimanual robotic manipulation provides significant versatility, but also presents an inherent challenge due to the complexity involved in the spatial and temporal coordination between two hands. Existing works predominantly focus on attaining human‑level manipulation skills for robotic hands, yet little attention has been paid to task planning on long‑horizon timescales. With their outstanding in‑context learning and zero‑shot generation abilities, Large Language Models (LLMs) have been applied and grounded in diverse robotic embodiments to facilitate task planning. However, LLMs still suffer from errors in long‑horizon reasoning and from hallucinations in complex robotic tasks, lacking a guarantee of logical correctness when generating the plan. Previous works, such as LLM+P, extended LLMs with symbolic planners. However, none have been successfully applied to bimanual robots. New challenges inevitably arise in bimanual manipulation, necessitating not only effective task decomposition but also efficient task allocation. To address these challenges, this paper introduces LLM+MAP, a bimanual planning framework that integrates LLM reasoning and multi‑agent planning, automating effective and efficient bimanual task planning. We conduct simulated experiments on various long‑horizon manipulation tasks of differing complexity. Our method is built using GPT‑4o as the backend, and we compare its performance against plans generated directly by LLMs, including GPT‑4o, V3 and also recent strong reasoning models o1 and R1. By analyzing metrics such as planning time, success rate, group debits, and planning‑step reduction rate, we demonstrate the superior performance of LLM+MAP, while also providing insights into robotic reasoning. Code is available at https://github.com/Kchu/LLM‑MAP.

PaperID: 1427, https://arxiv.org/pdf/2503.16492.pdf GitHub

Authors: Yuzhi Lai, Shenghai Yuan, Peizheng Li, Boya Zhang, Benjamin Kiefer, Tianchen Deng, Andreas Zell

Title: FAM-HRI: Foundation-Model Assisted Multi-Modal Human-Robot Interaction Combining Gaze and Speech

Abstract:
ffective Human‑Robot Interaction (HRI) is crucial for enhancing accessibility and usability in real‑world robotics applications. However, existing solutions often rely on gesture‑ only or language‑only commands, making interaction inefficient and ambiguous, particularly for users with physical impairments. In this paper, we introduce FAM‑HRI, an efficient multimodal framework for HRI that integrates language and gaze inputs via foundation models. By leveraging lightweight Meta ARIA glasses, our system captures real‑time multimodal signals and utilizes large language models (LLMs) to fuse user intention with scene context, enabling intuitive and precise robot manipulation. Our method accurately determines the gaze fixation time interval, reducing noise caused by the gaze dynamic nature. Experimental evaluations demonstrate that FAM‑HRI achieves a high success rate in task execution while maintaining a low interaction time, providing a practical solution for individuals with limited physical mobility or motor impairments. To support the community, we have released our system design, algorithms, and solutions at https://github.com/laiyuzhi/FAM‑HRI.

PaperID: 1428, https://arxiv.org/pdf/2503.16408.pdf GitHub

Authors: Yiran Qin, Li Kang, Xiufeng Song, Zhenfei Yin, Xiaohong Liu, Xihui Liu, Ruimao Zhang, Lei Bai

Title: RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints

Abstract:
Designing effective embodied multi‑agent systems is critical for solving complex real‑world tasks across domains. Due to the complexity of multi‑agent embodied systems, existing methods fail to automatically generate safe and efficient training data for such systems. To this end, we propose the concept of compositional constraints for embodied multi‑agent systems, addressing the challenges arising from collaboration among embodied agents. We design various interfaces tailored to different types of constraints, enabling seamless interaction with the physical world. Leveraging compositional constraints and specifically designed interfaces, we develop an automated data collection framework for embodied multi‑agent systems and introduce the first benchmark for embodied multi‑agent manipulation, RoboFactory. Based on RoboFactory benchmark, we adapt and evaluate the method of imitation learning and analyzed its performance in different difficulty agent tasks. Furthermore, we explore the architectures and training strategies for multi‑agent imitation learning, aiming to build safe and efficient embodied multi‑agent systems.

PaperID: 1429, https://arxiv.org/pdf/2503.16191.pdf GitHub

Authors: Yinon Goldshtein, Gal Perelman, Assaf Schuster, Avi Ostfeld

Title: Large Language Models for Water Distribution Systems Modeling and Decision-Making

Abstract:
The integration of Large Language Models (LLMs) into engineering workflows presents new opportunities for making computational tools more accessible. Especially where such tools remain underutilized due to technical or expertise barriers, such as water distribution system (WDS) management. This study introduces LLM‑EPANET, an agent‑based framework that enables natural language interaction with EPANET, the benchmark WDS simulator. The framework combines retrieval‑augmented generation and multi‑agent orchestration to automatically translate user queries into executable code, run simulations, and return structured results. A curated set of 69 benchmark queries is introduced to evaluate performance across state‑of‑the‑art LLMs. Results show that LLMs can effectively support a wide range of modeling tasks, achieving 56‑81% accuracy overall, and over 90% for simpler queries. These findings highlight the potential of LLM‑based modeling to democratize data‑driven decision‑making in the water sector through transparent, interactive AI interfaces. The framework code and benchmark queries are shared as an open resource: https://github.com/yinon‑gold/LLMs‑in‑WDS‑Modeling.

PaperID: 1430, https://arxiv.org/pdf/2503.15272.pdf GitHub

Authors: David Wan, Justin Chih-Yao Chen, Elias Stengel-Eskin, Mohit Bansal

Title: MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration

Abstract:
Multi‑agent collaboration among models has shown promise in reasoning tasks but is underexplored in long‑form generation tasks like summarization and question‑answering. We extend multi‑agent multi‑model reasoning to generation, specifically to improving faithfulness through refinement, i.e., revising model‑generated outputs to remove factual inconsistencies. We investigate how iterative collaboration among multiple instances and types of large language models (LLMs) enhances subtasks in the refinement process, such as error detection, critiquing unfaithful sentences, and making corrections based on critiques. We design intrinsic evaluations for each subtask, with our findings indicating that both multi‑agent (multiple instances) and multi‑model (diverse LLM types) approaches benefit error detection and critiquing. Additionally, reframing critiquing and refinement as reranking rather than generation tasks improves multi‑agent performance. We consolidate these insights into a final "recipe" called Multi‑Agent Multi‑Model Refinement (MAMM‑Refine), where multi‑agent and multi‑model collaboration significantly boosts performance on three summarization datasets as well as on long‑form question answering, demonstrating the effectiveness and generalizability of our recipe.

PaperID: 1431, https://arxiv.org/pdf/2503.15235.pdf GitHub

Authors: Chentian Wei, Jiewei Chen, Jinzhu Xu

Title: Exploring Large Language Models for Word Games:Who is the Spy?

Abstract:
Word games hold significant research value for natural language processing (NLP), game theory, and related fields due to their rule‑based and situational nature. This study explores how large language models (LLMs) can be effectively involved in word games and proposes a training‑free framework. "Shei Shi Wo Di" or "Who is the Spy" in English, is a classic word game. Using this game as an example, we introduce a Chain‑of‑Thought (CoT)‑based scheduling framework to enable LLMs to achieve excellent performance in tasks such as inferring role words and disguising their identities. We evaluate the framework's performance based on game success rates and the accuracy of the LLM agents' analytical results. Experimental results affirm the framework's effectiveness, demonstrating notable improvements in LLM performance across multiple datasets. This work highlights the potential of LLMs in mastering situational reasoning and social interactions within structured game environments. Our code is publicly available at https://github.com/ct‑wei/Who‑is‑The‑Spy.

PaperID: 1432, https://arxiv.org/pdf/2503.15049.pdf GitHub

Authors: Cheng Wang, Lingxin Kong, Massimiliano Tamborski, Stefano V. Albrecht

Title: HAD-Gen: Human-like and Diverse Driving Behavior Modeling for Controllable Scenario Generation

Abstract:
Simulation‑based testing has emerged as an essential tool for verifying and validating autonomous vehicles (AVs). However, contemporary methodologies, such as deterministic and imitation learning‑based driver models, struggle to capture the variability of human‑like driving behavior. Given these challenges, we propose HAD‑Gen, a general framework for realistic traffic scenario generation that simulates diverse human‑like driving behaviors. The framework first clusters the vehicle trajectory data into different driving styles according to safety features. It then employs maximum entropy inverse reinforcement learning on each of the clusters to learn the reward function corresponding to each driving style. Using these reward functions, the method integrates offline reinforcement learning pre‑training and multi‑agent reinforcement learning algorithms to obtain general and robust driving policies. Multi‑perspective simulation results show that our proposed scenario generation framework can simulate diverse, human‑like driving behaviors with strong generalization capability. The proposed framework achieves a 90.96% goal‑reaching rate, an off‑road rate of 2.08%, and a collision rate of 6.91% in the generalization test, outperforming prior approaches by over 20% in goal‑reaching performance. The source code is released at https://github.com/RoboSafe‑Lab/Sim4AD.

PaperID: 1433, https://arxiv.org/pdf/2503.14555.pdf GitHub GitHub

Authors: Arjun V Sudhakar, Hadi Nekoei, Mathieu Reymond, Miao Liu, Janarthanan Rajendran, Sarath Chandar

Title: A Generalist Hanabi Agent

Abstract:
Traditional multi‑agent reinforcement learning (MARL) systems can develop cooperative strategies through repeated interactions. However, these systems are unable to perform well on any other setting than the one they have been trained on, and struggle to successfully cooperate with unfamiliar collaborators. This is particularly visible in the Hanabi benchmark, a popular 2‑to‑5 player cooperative card‑game which requires complex reasoning and precise assistance to other agents. Current MARL agents for Hanabi can only learn one specific game‑setting (e.g., 2‑player games), and play with the same algorithmic agents. This is in stark contrast to humans, who can quickly adjust their strategies to work with unfamiliar partners or situations. In this paper, we introduce Recurrent Replay Relevance Distributed DQN (R3D2), a generalist agent for Hanabi, designed to overcome these limitations. We reformulate the task using text, as language has been shown to improve transfer. We then propose a distributed MARL algorithm that copes with the resulting dynamic observation‑ and action‑space. In doing so, our agent is the first that can play all game settings concurrently, and extend strategies learned from one setting to other ones. As a consequence, our agent also demonstrates the ability to collaborate with different algorithmic agents ‑‑ agents that are themselves unable to do so. The implementation code is available at: \hrefhttps://github.com/chandar‑lab/R3D2‑A‑Generalist‑Hanabi‑AgentR3D2‑A‑Generalist‑Hanabi‑Agent

PaperID: 1434, https://arxiv.org/pdf/2503.14484.pdf GitHub

Authors: Fardin Saad, Pradeep K. Murukannaiah, Munindar P. Singh

Title: Gricean Norms as a Basis for Effective Collaboration

Abstract:
Effective human‑AI collaboration hinges not only on the AI agent's ability to follow explicit instructions but also on its capacity to navigate ambiguity, incompleteness, invalidity, and irrelevance in communication. Gricean conversational and inference norms facilitate collaboration by aligning unclear instructions with cooperative principles. We propose a normative framework that integrates Gricean norms and cognitive frameworks ‑‑ common ground, relevance theory, and theory of mind ‑‑ into large language model (LLM) based agents. The normative framework adopts the Gricean maxims of quantity, quality, relation, and manner, along with inference, as Gricean norms to interpret unclear instructions, which are: ambiguous, incomplete, invalid, or irrelevant. Within this framework, we introduce Lamoids, GPT‑4 powered agents designed to collaborate with humans. To assess the influence of Gricean norms in human‑AI collaboration, we evaluate two versions of a Lamoid: one with norms and one without. In our experiments, a Lamoid collaborates with a human to achieve shared goals in a grid world (Doors, Keys, and Gems) by interpreting both clear and unclear natural language instructions. Our results reveal that the Lamoid with Gricean norms achieves higher task accuracy and generates clearer, more accurate, and contextually relevant responses than the Lamoid without norms. This improvement stems from the normative framework, which enhances the agent's pragmatic reasoning, fostering effective human‑AI collaboration and enabling context‑aware communication in LLM‑based agents.

PaperID: 1435, https://arxiv.org/pdf/2503.13964.pdf GitHub

Authors: Siwei Han, Peng Xia, Ruiyi Zhang, Tong Sun, Yun Li, Hongtu Zhu, Huaxiu Yao

Title: MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

Abstract:
Document Question Answering (DocQA) is a very common task. Existing methods using Large Language Models (LLMs) or Large Vision Language Models (LVLMs) and Retrieval Augmented Generation (RAG) often prioritize information from a single modal, failing to effectively integrate textual and visual cues. These approaches struggle with complex multi‑modal reasoning, limiting their performance on real‑world documents. We present MDocAgent (A Multi‑Modal Multi‑Agent Framework for Document Understanding), a novel RAG and multi‑agent framework that leverages both text and image. Our system employs five specialized agents: a general agent, a critical agent, a text agent, an image agent and a summarizing agent. These agents engage in multi‑modal context retrieval, combining their individual insights to achieve a more comprehensive understanding of the document's content. This collaborative approach enables the system to synthesize information from both textual and visual components, leading to improved accuracy in question answering. Preliminary experiments on five benchmarks like MMLongBench, LongDocURL demonstrate the effectiveness of our MDocAgent, achieve an average improvement of 12.1% compared to current state‑of‑the‑art method. This work contributes to the development of more robust and comprehensive DocQA systems capable of handling the complexities of real‑world documents containing rich textual and visual information. Our data and code are available at https://github.com/aiming‑lab/MDocAgent.

PaperID: 1436, https://arxiv.org/pdf/2503.13946.pdf GitHub GitHub

Authors: Kang Yang, Tianci Bu, Lantao Li, Chunxu Li, Yongcai Wang, Deying Li

Title: Is Discretization Fusion All You Need for Collaborative Perception?

Abstract:
Collaborative perception in multi‑agent system enhances overall perceptual capabilities by facilitating the exchange of complementary information among agents. Current mainstream collaborative perception methods rely on discretized feature maps to conduct fusion, which however, lacks flexibility in extracting and transmitting the informative features and can hardly focus on the informative features during fusion. To address these problems, this paper proposes a novel Anchor‑Centric paradigm for Collaborative Object detection (ACCO). It avoids grid precision issues and allows more flexible and efficient anchor‑centric communication and fusion. ACCO is composed by three main components: (1) Anchor featuring block (AFB) that targets to generate anchor proposals and projects prepared anchor queries to image features. (2) Anchor confidence generator (ACG) is designed to minimize communication by selecting only the features in the confident anchors to transmit. (3) A local‑global fusion module, in which local fusion is anchor alignment‑based fusion (LAAF) and global fusion is conducted by spatial‑aware cross‑attention (SACA). LAAF and SACA run in multi‑layers, so agents conduct anchor‑centric fusion iteratively to adjust the anchor proposals. Comprehensive experiments are conducted to evaluate ACCO on OPV2V and Dair‑V2X datasets, which demonstrate ACCO's superiority in reducing the communication volume, and in improving the perception range and detection performances. Code can be found at: \hrefhttps://github.com/sidiangongyuan/ACCOhttps://github.com/sidiangongyuan/ACCO.

PaperID: 1437, https://arxiv.org/pdf/2503.13517.pdf GitHub

Authors: Hao Cui, Zahra Shamsi, Gowoon Cheon, Xuejian Ma, Shutong Li, Maria Tikhanovskaya, Peter Norgaard, Nayantara Mudur, Martyna Plomecka, Paul Raccuglia, Yasaman Bahri, Victor V. Albert, Pranesh Srinivasan, Haining Pan, Philippe Faist, Brian Rohr, Ekin Dogus Cubuk, Muratahan Aykol, Amil Merchant, Michael J. Statt, Dan Morris, Drew Purves, Elise Kleeman, Ruth Alcantara, Matthew Abraham, Muqthar Mohammad, Ean Phing VanLee, Chenfei Jiang, Elizabeth Dorfman, Eun-Ah Kim, Michael P Brenner, Viren Jain, Sameera Ponda, Subhashini Venugopalan

Title: CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning

Abstract:
Scientific problem‑solving involves synthesizing information while applying expert knowledge. We introduce CURIE, a scientific long‑Context Understanding,Reasoning and Information Extraction benchmark to measure the potential of Large Language Models (LLMs) in scientific problem‑solving and assisting scientists in realistic workflows. This benchmark introduces ten challenging tasks with a total of 580 problems and solution pairs curated by experts in six disciplines ‑ materials science, condensed matter physics, quantum computing, geospatial analysis, biodiversity, and proteins ‑ covering both experimental and theoretical work‑flows in science. We evaluate a range of closed and open LLMs on tasks in CURIE which requires domain expertise, comprehension of long in‑context information,and multi‑step reasoning. While Gemini Flash 2.0 and Claude‑3 show consistent high comprehension across domains, the popular GPT‑4o and command‑R+ fail dramatically on protein sequencing tasks. With the best performance at 32% there is much room for improvement for all models. We hope that insights gained from CURIE can guide the future development of LLMs in sciences. Evaluation code and data are in https://github.com/google/curie

PaperID: 1438, https://arxiv.org/pdf/2503.12533.pdf GitHub

Authors: Haoqi Yuan, Yu Bai, Yuhui Fu, Bohan Zhou, Yicheng Feng, Xinrun Xu, Yi Zhan, Börje F. Karlsson, Zongqing Lu

Title: Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills

Abstract:
Building autonomous robotic agents capable of achieving human‑level performance in real‑world embodied tasks is an ultimate goal in humanoid robot research. Recent advances have made significant progress in high‑level cognition with Foundation Models (FMs) and low‑level skill development for humanoid robots. However, directly combining these components often results in poor robustness and efficiency due to compounding errors in long‑horizon tasks and the varied latency of different modules. We introduce Being‑0, a hierarchical agent framework that integrates an FM with a modular skill library. The FM handles high‑level cognitive tasks such as instruction understanding, task planning, and reasoning, while the skill library provides stable locomotion and dexterous manipulation for low‑level control. To bridge the gap between these levels, we propose a novel Connector module, powered by a lightweight vision‑language model (VLM). The Connector enhances the FM's embodied capabilities by translating language‑based plans into actionable skill commands and dynamically coordinating locomotion and manipulation to improve task success. With all components, except the FM, deployable on low‑cost onboard computation devices, Being‑0 achieves efficient, real‑time performance on a full‑sized humanoid robot equipped with dexterous hands and active vision. Extensive experiments in large indoor environments demonstrate Being‑0's effectiveness in solving complex, long‑horizon tasks that require challenging navigation and manipulation subtasks. For further details and videos, visit https://beingbeyond.github.io/Being‑0.

PaperID: 1439, https://arxiv.org/pdf/2503.12434.pdf GitHub

Authors: Shangheng Du, Jiabao Zhao, Jinxin Shi, Zhentao Xie, Xin Jiang, Yanhong Bai, Liang He

Title: A Survey on the Optimization of Large Language Model-based Agents

Abstract:
With the rapid development of Large Language Models (LLMs), LLM‑based agents have been widely adopted in various fields, becoming essential for autonomous decision‑making and interactive tasks. However, current work typically relies on prompt design or fine‑tuning strategies applied to vanilla LLMs, which often leads to limited effectiveness or suboptimal performance in complex agent‑related environments. Although LLM optimization techniques can improve model performance across many general tasks, they lack specialized optimization towards critical agent functionalities such as long‑term planning, dynamic environmental interaction, and complex decision‑making. Although numerous recent studies have explored various strategies to optimize LLM‑based agents for complex agent tasks, a systematic review summarizing and comparing these methods from a holistic perspective is still lacking. In this survey, we provide a comprehensive review of LLM‑based agent optimization approaches, categorizing them into parameter‑driven and parameter‑free methods. We first focus on parameter‑driven optimization, covering fine‑tuning‑based optimization, reinforcement learning‑based optimization, and hybrid strategies, analyzing key aspects such as trajectory data construction, fine‑tuning techniques, reward function design, and optimization algorithms. Additionally, we briefly discuss parameter‑free strategies that optimize agent behavior through prompt engineering and external knowledge retrieval. Finally, we summarize the datasets and benchmarks used for evaluation and tuning, review key applications of LLM‑based agents, and discuss major challenges and promising future directions. Our repository for related references is available at https://github.com/YoungDubbyDu/LLM‑Agent‑Optimization.

PaperID: 1440, https://arxiv.org/pdf/2503.12122.pdf GitHub

Authors: Yoshiki Yano, Kazuki Shibata, Maarten Kokshoorn, Takamitsu Matsubara

Title: ICCO: Learning an Instruction-conditioned Coordinator for Language-guided Task-aligned Multi-robot Control

Abstract:
Recent advances in Large Language Models (LLMs) have permitted the development of language‑guided multi‑robot systems, which allow robots to execute tasks based on natural language instructions. However, achieving effective coordination in distributed multi‑agent environments remains challenging due to (1) misalignment between instructions and task requirements and (2) inconsistency in robot behaviors when they independently interpret ambiguous instructions. To address these challenges, we propose Instruction‑Conditioned Coordinator (ICCO), a Multi‑Agent Reinforcement Learning (MARL) framework designed to enhance coordination in language‑guided multi‑robot systems. ICCO consists of a Coordinator agent and multiple Local Agents, where the Coordinator generates Task‑Aligned and Consistent Instructions (TACI) by integrating language instructions with environmental states, ensuring task alignment and behavioral consistency. The Coordinator and Local Agents are jointly trained to optimize a reward function that balances task efficiency and instruction following. A Consistency Enhancement Term is added to the learning objective to maximize mutual information between instructions and robot behaviors, further improving coordination. Simulation and real‑world experiments validate the effectiveness of ICCO in achieving language‑guided task‑aligned multi‑robot control. The demonstration can be found at https://yanoyoshiki.github.io/ICCO/.

PaperID: 1441, https://arxiv.org/pdf/2503.11726.pdf GitHub

Authors: Hyunwoo Park, Baekryun Seong, Sang-Ki Ko

Title: SPECTra: Scalable Multi-Agent Reinforcement Learning with Permutation-Free Networks

Abstract:
In cooperative multi‑agent reinforcement learning (MARL), the permutation problem where the state space grows exponentially with the number of agents reduces sample efficiency. Additionally, many existing architectures struggle with scalability, relying on a fixed structure tied to a specific number of agents, limiting their applicability to environments with a variable number of entities. While approaches such as graph neural networks (GNNs) and self‑attention mechanisms have progressed in addressing these challenges, they have significant limitations as dense GNNs and self‑attention mechanisms incur high computational costs. To overcome these limitations, we propose a novel agent network and a non‑linear mixing network that ensure permutation‑equivariance and scalability, allowing them to generalize to environments with various numbers of agents. Our agent network significantly reduces computational complexity, and our scalable hypernetwork enables efficient weight generation for non‑linear mixing. Additionally, we introduce curriculum learning to improve training efficiency. Experiments on SMACv2 and Google Research Football (GRF) demonstrate that our approach achieves superior learning performance compared to existing methods. By addressing both permutation‑invariance and scalability in MARL, our work provides a more efficient and adaptable framework for cooperative MARL. Our code is available at https://github.com/funny‑rl/SPECTra.

PaperID: 1442, https://arxiv.org/pdf/2503.11488.pdf GitHub

Authors: Yifeng Zhang, Yilin Liu, Ping Gong, Peizhuo Li, Mingfeng Fan, Guillaume Sartoretti

Title: Unicorn: A Universal and Collaborative Reinforcement Learning Approach Towards Generalizable Network-Wide Traffic Signal Control

Abstract:
Adaptive traffic signal control (ATSC) is crucial in reducing congestion, maximizing throughput, and improving mobility in rapidly growing urban areas. Recent advancements in parameter‑sharing multi‑agent reinforcement learning (MARL) have greatly enhanced the scalable and adaptive optimization of complex, dynamic flows in large‑scale homogeneous networks. However, the inherent heterogeneity of real‑world traffic networks, with their varied intersection topologies and interaction dynamics, poses substantial challenges to achieving scalable and effective ATSC across different traffic scenarios. To address these challenges, we present Unicorn, a universal and collaborative MARL framework designed for efficient and adaptable network‑wide ATSC. Specifically, we first propose a unified approach to map the states and actions of intersections with varying topologies into a common structure based on traffic movements. Next, we design a Universal Traffic Representation (UTR) module with a decoder‑only network for general feature extraction, enhancing the model's adaptability to diverse traffic scenarios. Additionally, we incorporate an Intersection Specifics Representation (ISR) module, designed to identify key latent vectors that represent the unique intersection's topology and traffic dynamics through variational inference techniques. To further refine these latent representations, we employ a contrastive learning approach in a self‑supervised manner, which enables better differentiation of intersection‑specific features. Moreover, we integrate the state‑action dependencies of neighboring agents into policy optimization, which effectively captures dynamic agent interactions and facilitates efficient regional collaboration. [...]. The code is available at https://github.com/marmotlab/Unicorn

PaperID: 1443, https://arxiv.org/pdf/2503.11444.pdf GitHub

Authors: Balaji Rama, Kai Mei, Yongfeng Zhang

Title: Cerebrum (AIOS SDK): A Platform for Agent Development, Deployment, Distribution, and Discovery

Abstract:
Autonomous LLM‑based agents have emerged as a powerful paradigm for complex task execution, yet the field lacks standardized tools for development, deployment, distribution and discovery of agents. We present Cerebrum, an Agent SDK for AIOS that addresses this gap through three key components: (1) a comprehensive SDK featuring a modular four‑layer architecture for agent development, encompassing LLM, memory, storage, and tool management; (2) a community‑driven Agent Hub for sharing and discovering agents, complete with version control and dependency management; (3) an interactive web interface for testing and evaluating agents. The platform's effectiveness is demonstrated through implementations of various agent architectures, including Chain of Thought (CoT), ReAct, and tool‑use agents. Cerebrum advances the field by providing a unified framework that standardizes agent development while maintaining flexibility for researchers and developers to innovate and distribute their agents. The live website is at https://app.aios.foundation, the code is at https://github.com/agiresearch/Cerebrum, and video is at https://app.aios.foundation/video‑demo.

PaperID: 1444, https://arxiv.org/pdf/2503.11346.pdf GitHub

Authors: Fengyu Li, Yilin Li, Junhao Zhu, Lu Chen, Yanfei Zhang, Jia Zhou, Hui Zu, Jingwen Zhao, Yunjun Gao

Title: AIstorian lets AI be a historian: A KG-powered multi-agent system for accurate biography generation

Abstract:
Huawei has always been committed to exploring the AI application in historical research. Biography generation, as a specialized form of abstractive summarization, plays a crucial role in historical research but faces unique challenges that existing large language models (LLMs) struggle to address. These challenges include maintaining stylistic adherence to historical writing conventions, ensuring factual fidelity, and handling fragmented information across multiple documents. We present AIstorian, a novel end‑to‑end agentic system featured with a knowledge graph (KG)‑powered retrieval‑augmented generation (RAG) and anti‑hallucination multi‑agents. Specifically, AIstorian introduces an in‑context learning based chunking strategy and a KG‑based index for accurate and efficient reference retrieval. Meanwhile, AIstorian orchestrates multi‑agents to conduct on‑the‑fly hallucination detection and error‑type‑aware correction. Additionally, to teach LLMs a certain language style, we finetune LLMs based on a two‑step training approach combining data augmentation‑enhanced supervised fine‑tuning with stylistic preference optimization. Extensive experiments on a real‑life historical Jinshi dataset demonstrate that AIstorian achieves a 3.8x improvement in factual accuracy and a 47.6% reduction in hallucination rate compared to existing baselines. The data and code are available at: https://github.com/ZJU‑DAILY/AIstorian.

PaperID: 1445, https://arxiv.org/pdf/2503.10970.pdf GitHub GitHub

Authors: Shanghua Gao, Richard Zhu, Zhenglun Kong, Ayush Noori, Xiaorui Su, Curtis Ginder, Theodoros Tsiligkaridis, Marinka Zitnik

Title: TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools

Abstract:
Precision therapeutics require multimodal adaptive models that generate personalized treatment recommendations. We introduce TxAgent, an AI agent that leverages multi‑step reasoning and real‑time biomedical knowledge retrieval across a toolbox of 211 tools to analyze drug interactions, contraindications, and patient‑specific treatment strategies. TxAgent evaluates how drugs interact at molecular, pharmacokinetic, and clinical levels, identifies contraindications based on patient comorbidities and concurrent medications, and tailors treatment strategies to individual patient characteristics. It retrieves and synthesizes evidence from multiple biomedical sources, assesses interactions between drugs and patient conditions, and refines treatment recommendations through iterative reasoning. It selects tools based on task objectives and executes structured function calls to solve therapeutic tasks that require clinical reasoning and cross‑source validation. The ToolUniverse consolidates 211 tools from trusted sources, including all US FDA‑approved drugs since 1939 and validated clinical insights from Open Targets. TxAgent outperforms leading LLMs, tool‑use models, and reasoning agents across five new benchmarks: DrugPC, BrandPC, GenericPC, TreatmentPC, and DescriptionPC, covering 3,168 drug reasoning tasks and 456 personalized treatment scenarios. It achieves 92.1% accuracy in open‑ended drug reasoning tasks, surpassing GPT‑4o and outperforming DeepSeek‑R1 (671B) in structured multi‑step reasoning. TxAgent generalizes across drug name variants and descriptions. By integrating multi‑step inference, real‑time knowledge grounding, and tool‑assisted decision‑making, TxAgent ensures that treatment recommendations align with established clinical guidelines and real‑world evidence, reducing the risk of adverse events and improving therapeutic decision‑making.

PaperID: 1446, https://arxiv.org/pdf/2503.10621.pdf GitHub GitHub

Authors: Ayesha Ishaq, Jean Lahoud, Ketan More, Omkar Thawakar, Ritesh Thawkar, Dinura Dissanayake, Noor Ahsan, Yuhao Li, Fahad Shahbaz Khan, Hisham Cholakkal, Ivan Laptev, Rao Muhammad Anwer, Salman Khan

Title: DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding

Abstract:
While large multimodal models (LMMs) have demonstrated strong performance across various Visual Question Answering (VQA) tasks, certain challenges require complex multi‑step reasoning to reach accurate answers. One particularly challenging task is autonomous driving, which demands thorough cognitive processing before decisions can be made. In this domain, a sequential and interpretive understanding of visual cues is essential for effective perception, prediction, and planning. Nevertheless, common VQA benchmarks often focus on the accuracy of the final answer while overlooking the reasoning process that enables the generation of accurate responses. Moreover, existing methods lack a comprehensive framework for evaluating step‑by‑step reasoning in realistic driving scenarios. To address this gap, we propose DriveLMM‑o1, a new dataset and benchmark specifically designed to advance step‑wise visual reasoning for autonomous driving. Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning, each enriched with step‑by‑step reasoning to ensure logical inference in autonomous driving scenarios. We further introduce a large multimodal model that is fine‑tuned on our reasoning dataset, demonstrating robust performance in complex driving scenarios. In addition, we benchmark various open‑source and closed‑source methods on our proposed dataset, systematically comparing their reasoning capabilities for autonomous driving tasks. Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open‑source model. Our framework, dataset, and model are available at https://github.com/ayesha‑ishaq/DriveLMM‑o1.

PaperID: 1447, https://arxiv.org/pdf/2503.10265.pdf GitHub

Authors: Chang Han Low, Ziyue Wang, Tianyi Zhang, Zhu Zhuo, Zhitao Zeng, Evangelos B. Mazomenos, Yueming Jin

Title: SurgRAW: Multi-Agent Workflow with Chain of Thought Reasoning for Robotic Surgical Video Analysis

Abstract:
Robotic‑assisted surgery (RAS) is central to modern surgery, driving the need for intelligent systems with accurate scene understanding. Most existing surgical AI methods rely on isolated, task‑specific models, leading to fragmented pipelines with limited interpretability and no unified understanding of RAS scene. Vision‑Language Models (VLMs) offer strong zero‑shot reasoning, but struggle with hallucinations, domain gaps and weak task‑interdependency modeling. To address the lack of unified data for RAS scene understanding, we introduce SurgCoTBench, the first reasoning‑focused benchmark in RAS, covering 14256 QA pairs with frame‑level annotations across five major surgical tasks. Building on SurgCoTBench, we propose SurgRAW, a clinically aligned Chain‑of‑Thought (CoT) driven agentic workflow for zero‑shot multi‑task reasoning in surgery. SurgRAW employs a hierarchical reasoning workflow where an orchestrator divides surgical scene understanding into two reasoning streams and directs specialized agents to generate task‑level reasoning, while higher‑level agents capture workflow interdependencies or ground output clinically. Specifically, we propose a panel discussion mechanism to ensure task‑specific agents collaborate synergistically and leverage on task interdependencies. Similarly, we incorporate a retrieval‑augmented generation module to enrich agents with surgical knowledge and alleviate domain gaps in general VLMs. We design task‑specific CoT prompts grounded in surgical domain to ensure clinically aligned reasoning, reduce hallucinations and enhance interpretability. Extensive experiments show that SurgRAW surpasses mainstream VLMs and agentic systems and outperforms a supervised model by 14.61% accuracy. Dataset and code is available at https://github.com/jinlab‑imvr/SurgRAW.git .

PaperID: 1448, https://arxiv.org/pdf/2503.09780.pdf GitHub

Authors: Arman Zharmagambetov, Chuan Guo, Ivan Evtimov, Maya Pavlova, Ruslan Salakhutdinov, Kamalika Chaudhuri

Title: AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents

Abstract:
Autonomous AI agents that can follow instructions and perform complex multi‑step tasks have tremendous potential to boost human productivity. However, to perform many of these tasks, the agents need access to personal information from their users, raising the question of whether they are capable of using it appropriately. In this work, we introduce a new benchmark AgentDAM that measures if AI web‑navigation agents follow the privacy principle of ``data minimization''. For the purposes of our benchmark, data minimization means that the agent uses a piece of potentially sensitive information only if it is ``necessary'' to complete a particular task. Our benchmark simulates realistic web interaction scenarios end‑to‑end and is adaptable to all existing web navigation agents. We use AgentDAM to evaluate how well AI agents built on top of GPT‑4, Llama‑3 and Claude can limit processing of potentially private information, and show that they are prone to inadvertent use of unnecessary sensitive information. We also propose a prompting‑based defense that reduces information leakage, and demonstrate that our end‑to‑end benchmarking provides a more realistic measure than probing LLMs about privacy. Our results highlight that further research is needed to develop AI agents that can prioritize data minimization at inference time.

PaperID: 1449, https://arxiv.org/pdf/2503.09521.pdf GitHub

Authors: Zak Buzzard

Title: PairVDN - Pair-wise Decomposed Value Functions

Abstract:
Extending deep Q‑learning to cooperative multi‑agent settings is challenging due to the exponential growth of the joint action space, the non‑stationary environment, and the credit assignment problem. Value decomposition allows deep Q‑learning to be applied at the joint agent level, at the cost of reduced expressivity. Building on past work in this direction, our paper proposes PairVDN, a novel method for decomposing the value function into a collection of pair‑wise, rather than per‑agent, functions, improving expressivity at the cost of requiring a more complex (but still efficient) dynamic programming maximisation algorithm. Our method enables the representation of value functions which cannot be expressed as a monotonic combination of per‑agent functions, unlike past approaches such as VDN and QMIX. We implement a novel many‑agent cooperative environment, Box Jump, and demonstrate improved performance over these baselines in this setting. We open‑source our code and environment at https://github.com/zzbuzzard/PairVDN.

PaperID: 1450, https://arxiv.org/pdf/2503.09501.pdf GitHub

Authors: Ziyu Wan, Yunxiang Li, Xiaoyu Wen, Yan Song, Hanjing Wang, Linyi Yang, Mark Schmidt, Jun Wang, Weinan Zhang, Shuyue Hu, Ying Wen

Title: ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning

Abstract:
Recent research on Reasoning of Large Language Models (LLMs) has sought to further enhance their performance by integrating meta‑thinking ‑‑ enabling models to monitor, evaluate, and control their reasoning processes for more adaptive and effective problem‑solving. However, current single‑agent work lacks a specialized design for acquiring meta‑thinking, resulting in low efficacy. To address this challenge, we introduce Reinforced Meta‑thinking Agents (ReMA), a novel framework that leverages Multi‑Agent Reinforcement Learning (MARL) to elicit meta‑thinking behaviors, encouraging LLMs to think about thinking. ReMA decouples the reasoning process into two hierarchical agents: a high‑level meta‑thinking agent responsible for generating strategic oversight and plans, and a low‑level reasoning agent for detailed executions. Through iterative reinforcement learning with aligned objectives, these agents explore and learn collaboration, leading to improved generalization and robustness. Empirical results from single‑turn experiments demonstrate that ReMA outperforms single‑agent RL baselines on complex reasoning tasks, including competitive‑level mathematical benchmarks and LLM‑as‑a‑Judge benchmarks. Additionally, we further extend ReMA to multi‑turn interaction settings, leveraging turn‑level ratio and parameter sharing to improve efficiency. Comprehensive ablation studies further illustrate the evolving dynamics of each distinct agent, providing valuable insights into how the meta‑thinking reasoning process enhances the reasoning capabilities of LLMs. Our code can be found in https://github.com/ziyuwan/ReMA‑public

PaperID: 1451, https://arxiv.org/pdf/2503.09263.pdf GitHub

Authors: Di Zhao, Longhui Ma, Siwei Wang, Miao Wang, Zhao Lv

Title: COLA: A Scalable Multi-Agent Framework For Windows UI Task Automation

Abstract:
With the rapid advancements in Large Language Models (LLMs), an increasing number of studies have leveraged LLMs as the cognitive core of agents to address complex task decision‑making challenges. Specially, recent research has demonstrated the potential of LLM‑based agents on automating Windows GUI operations. However, existing methodologies exhibit two critical challenges: (1) static agent architectures fail to dynamically adapt to the heterogeneous requirements of OS‑level tasks, leading to inadequate scenario generalization;(2) the agent workflows lack fault tolerance mechanism, necessitating complete process re‑execution for UI agent decision error. To address these limitations, we introduce COLA, a collaborative multi‑agent framework for automating Windows UI operations. In this framework, a scenario‑aware agent Task Scheduler decomposes task requirements into atomic capability units, dynamically selects the optimal agent from a decision agent pool, effectively responds to the capability requirements of diverse scenarios. The decision agent pool supports plug‑and‑play expansion for enhanced flexibility. In addition, we design a memory unit equipped to all agents for their self‑evolution. Furthermore, we develop an interactive backtracking mechanism that enables human to intervene to trigger state rollbacks for non‑destructive process repair. Our experimental results on the GAIA benchmark demonstrates that the COLA framework achieves state‑of‑the‑art performance with an average score of 31.89%, significantly outperforming baseline approaches without web API integration. Ablation studies further validate the individual contributions of our dynamic scheduling. The code is available at https://github.com/Alokia/COLA‑demo.

PaperID: 1452, https://arxiv.org/pdf/2503.09089.pdf GitHub

Authors: Zhaoling Chen, Xiangru Tang, Gangda Deng, Fang Wu, Jialong Wu, Zhiwei Jiang, Viktor Prasanna, Arman Cohan, Xingyao Wang

Title: LocAgent: Graph-Guided LLM Agents for Code Localization

Abstract:
Code localization‑‑identifying precisely where in a codebase changes need to be made‑‑is a fundamental yet challenging task in software maintenance. Existing approaches struggle to efficiently navigate complex codebases when identifying relevant code sections. The challenge lies in bridging natural language problem descriptions with the appropriate code elements, often requiring reasoning across hierarchical structures and multiple dependencies. We introduce LocAgent, a framework that addresses code localization through graph‑based representation. By parsing codebases into directed heterogeneous graphs, LocAgent creates a lightweight representation that captures code structures (files, classes, functions) and their dependencies (imports, invocations, inheritance), enabling LLM agents to effectively search and locate relevant entities through powerful multi‑hop reasoning. Experimental results on real‑world benchmarks demonstrate that our approach significantly enhances accuracy in code localization. Notably, our method with the fine‑tuned Qwen‑2.5‑Coder‑Instruct‑32B model achieves comparable results to SOTA proprietary models at greatly reduced cost (approximately 86% reduction), reaching up to 92.7% accuracy on file‑level localization while improving downstream GitHub issue resolution success rates by 12% for multiple attempts (Pass@10). Our code is available at https://github.com/gersteinlab/LocAgent.

PaperID: 1453, https://arxiv.org/pdf/2503.08421.pdf GitHub

Authors: Qiming Xia, Wenkai Lin, Haoen Xiang, Xun Huang, Siheng Chen, Zhen Dong, Cheng Wang, Chenglu Wen

Title: Learning to Detect Objects from Multi-Agent LiDAR Scans without Manual Labels

Abstract:
Unsupervised 3D object detection serves as an important solution for offline 3D object annotation. However, due to the data sparsity and limited views, the clustering‑based label fitting in unsupervised object detection often generates low‑quality pseudo‑labels. Multi‑agent collaborative dataset, which involves the sharing of complementary observations among agents, holds the potential to break through this bottleneck. In this paper, we introduce a novel unsupervised method that learns to Detect Objects from Multi‑Agent LiDAR scans, termed DOtA, without using labels from external. DOtA first uses the internally shared ego‑pose and ego‑shape of collaborative agents to initialize the detector, leveraging the generalization performance of neural networks to infer preliminary labels. Subsequently,DOtA uses the complementary observations between agents to perform multi‑scale encoding on preliminary labels, then decodes high‑quality and low‑quality labels. These labels are further used as prompts to guide a correct feature learning process, thereby enhancing the performance of the unsupervised object detection task. Extensive experiments on the V2V4Real and OPV2V datasets show that our DOtA outperforms state‑of‑the‑art unsupervised 3D object detection methods. Additionally, we also validate the effectiveness of the DOtA labels under various collaborative perception frameworks.The code is available at https://github.com/xmuqimingxia/DOtA.

PaperID: 1454, https://arxiv.org/pdf/2503.08175.pdf GitHub

Authors: Zitong Shi, Guancheng Wan, Wenke Huang, Guibin Zhang, Jiawei Shao, Mang Ye, Carl Yang

Title: Privacy-Enhancing Paradigms within Federated Multi-Agent Systems

Abstract:
LLM‑based Multi‑Agent Systems (MAS) have proven highly effective in solving complex problems by integrating multiple agents, each performing different roles. However, in sensitive domains, they face emerging privacy protection challenges. In this paper, we introduce the concept of Federated MAS, highlighting the fundamental differences between Federated MAS and traditional FL. We then identify key challenges in developing Federated MAS, including: 1) heterogeneous privacy protocols among agents, 2) structural differences in multi‑party conversations, and 3) dynamic conversational network structures. To address these challenges, we propose Embedded Privacy‑Enhancing Agents (EPEAgent), an innovative solution that integrates seamlessly into the Retrieval‑Augmented Generation (RAG) phase and the context retrieval stage. This solution minimizes data flows, ensuring that only task‑relevant, agent‑specific information is shared. Additionally, we design and generate a comprehensive dataset to evaluate the proposed paradigm. Extensive experiments demonstrate that EPEAgent effectively enhances privacy protection while maintaining strong system performance. The code will be availiable at https://github.com/ZitongShi/EPEAgent

PaperID: 1455, https://arxiv.org/pdf/2503.08147.pdf GitHub

Authors: Zhifeng Xie, Qile He, Youjia Zhu, Qiwei He, Mengtian Li

Title: FilmComposer: LLM-Driven Music Production for Silent Film Clips

Abstract:
In this work, we implement music production for silent film clips using LLM‑driven method. Given the strong professional demands of film music production, we propose the FilmComposer, simulating the actual workflows of professional musicians. FilmComposer is the first to combine large generative models with a multi‑agent approach, leveraging the advantages of both waveform music and symbolic music generation. Additionally, FilmComposer is the first to focus on the three core elements of music production for film‑audio quality, musicality, and musical development‑and introduces various controls, such as rhythm, semantics, and visuals, to enhance these key aspects. Specifically, FilmComposer consists of the visual processing module, rhythm‑controllable MusicGen, and multi‑agent assessment, arrangement and mix. In addition, our framework can seamlessly integrate into the actual music production pipeline and allows user intervention in every step, providing strong interactivity and a high degree of creative freedom. Furthermore, we propose MusicPro‑7k which includes 7,418 film clips, music, description, rhythm spots and main melody, considering the lack of a professional and high‑quality film music dataset. Finally, both the standard metrics and the new specialized metrics we propose demonstrate that the music generated by our model achieves state‑of‑the‑art performance in terms of quality, consistency with video, diversity, musicality, and musical development. Project page: https://apple‑jun.github.io/FilmComposer.github.io/

PaperID: 1456, https://arxiv.org/pdf/2503.07314.pdf GitHub GitHub GitHub

Authors: Weijia Wu, Zeyu Zhu, Mike Zheng Shou

Title: Automated Movie Generation via Multi-Agent CoT Planning

Abstract:
Existing long‑form video generation frameworks lack automated planning, requiring manual input for storylines, scenes, cinematography, and character interactions, resulting in high costs and inefficiencies. To address these challenges, we present MovieAgent, an automated movie generation via multi‑agent Chain of Thought (CoT) planning. MovieAgent offers two key advantages: 1) We firstly explore and define the paradigm of automated movie/long‑video generation. Given a script and character bank, our MovieAgent can generates multi‑scene, multi‑shot long‑form videos with a coherent narrative, while ensuring character consistency, synchronized subtitles, and stable audio throughout the film. 2) MovieAgent introduces a hierarchical CoT‑based reasoning process to automatically structure scenes, camera settings, and cinematography, significantly reducing human effort. By employing multiple LLM agents to simulate the roles of a director, screenwriter, storyboard artist, and location manager, MovieAgent streamlines the production pipeline. Experiments demonstrate that MovieAgent achieves new state‑of‑the‑art results in script faithfulness, character consistency, and narrative coherence. Our hierarchical framework takes a step forward and provides new insights into fully automated movie generation. The code and project website are available at: https://github.com/showlab/MovieAgent and https://weijiawu.github.io/MovieAgent.

PaperID: 1457, https://arxiv.org/pdf/2503.07275.pdf GitHub

Authors: Won-Sang You, Tae-Gwan Ha, Seo-Young Lee, Kyung-Joong Kim

Title: Automatic Curriculum Design for Zero-Shot Human-AI Coordination

Abstract:
Zero‑shot human‑AI coordination is the training of an ego‑agent to coordinate with humans without human data. Most studies on zero‑shot human‑AI coordination have focused on enhancing the ego‑agent's coordination ability in a given environment without considering the issue of generalization to unseen environments. Real‑world applications of zero‑shot human‑AI coordination should consider unpredictable environmental changes and the varying coordination ability of co‑players depending on the environment. Previously, the multi‑agent UED (Unsupervised Environment Design) approach has investigated these challenges by jointly considering environmental changes and co‑player policy in competitive two‑player AI‑AI scenarios. In this paper, our study extends a multi‑agent UED approach to zero‑shot human‑AI coordination. We propose a utility function and co‑player sampling for a zero‑shot human‑AI coordination setting that helps train the ego‑agent to coordinate with humans more effectively than a previous multi‑agent UED approach. The zero‑shot human‑AI coordination performance was evaluated in the Overcooked‑AI environment, using human proxy agents and real humans. Our method outperforms other baseline models and achieves high performance in human‑AI coordination tasks in unseen environments. The source code is available at https://github.com/Uwonsang/ACD_Human‑AI

PaperID: 1458, https://arxiv.org/pdf/2503.06866.pdf GitHub

Authors: Wanjing Huang, Tongjie Pan, Yalan Ye

Title: Graphormer-Guided Task Planning: Beyond Static Rules with LLM Safety Perception

Abstract:
Recent advancements in large language models (LLMs) have expanded their role in robotic task planning. However, while LLMs have been explored for generating feasible task sequences, their ability to ensure safe task execution remains underdeveloped. Existing methods struggle with structured risk perception, making them inadequate for safety‑critical applications where low‑latency hazard adaptation is required. To address this limitation, we propose a Graphormer‑enhanced risk‑aware task planning framework that combines LLM‑based decision‑making with structured safety modeling. Our approach constructs a dynamic spatio‑semantic safety graph, capturing spatial and contextual risk factors to enable online hazard detection and adaptive task refinement. Unlike existing methods that rely on predefined safety constraints, our framework introduces a context‑aware risk perception module that continuously refines safety predictions based on real‑time task execution. This enables a more flexible and scalable approach to robotic planning, allowing for adaptive safety compliance beyond static rules. To validate our framework, we conduct experiments in the AI2‑THOR environment. The experiments results validates improvements in risk detection accuracy, rising safety notice, and task adaptability of our framework in continuous environments compared to static rule‑based and LLM‑only baselines. Our project is available at https://github.com/hwj20/GGTP

PaperID: 1459, https://arxiv.org/pdf/2503.06791.pdf GitHub

Authors: Xiao Wang, Lu Dong, Sahana Rangasrinivasan, Ifeoma Nwogu, Srirangaraj Setlur, Venugopal Govindaraju

Title: AutoMisty: A Multi-Agent LLM Framework for Automated Code Generation in the Misty Social Robot

Abstract:
The social robot's open API allows users to customize open‑domain interactions. However, it remains inaccessible to those without programming experience. In this work, we introduce AutoMisty, the first multi‑agent collaboration framework powered by large language models (LLMs), to enable the seamless generation of executable Misty robot code from natural language instructions. AutoMisty incorporates four specialized agent modules to manage task decomposition, assignment, problem‑solving, and result synthesis. Each agent incorporates a two‑layer optimization mechanism, with self‑reflection for iterative refinement and human‑in‑the‑loop for better alignment with user preferences. AutoMisty ensures a transparent reasoning process, allowing users to iteratively refine tasks through natural language feedback for precise execution. To evaluate AutoMisty's effectiveness, we designed a benchmark task set spanning four levels of complexity and conducted experiments in a real Misty robot environment. Extensive evaluations demonstrate that AutoMisty not only consistently generates high‑quality code but also enables precise code control, significantly outperforming direct reasoning with ChatGPT‑4o and ChatGPT‑o1. All code, optimized APIs, and experimental videos will be publicly released through the webpage: https://wangxiaoshawn.github.io/AutoMisty.html

PaperID: 1460, https://arxiv.org/pdf/2503.05659.pdf GitHub

Authors: Yu Zhang, Shutong Qiao, Jiaqi Zhang, Tzu-Heng Lin, Chen Gao, Yong Li

Title: A Survey of Large Language Model Empowered Agents for Recommendation and Search: Towards Next-Generation Information Retrieval

Abstract:
Information technology has profoundly altered the way humans interact with information. The vast amount of content created, shared, and disseminated online has made it increasingly difficult to access relevant information. Over the past two decades, recommender systems and search (collectively referred to as information retrieval systems) have evolved significantly to address these challenges. Recent advances in large language models (LLMs) have demonstrated capabilities that surpass human performance in various language‑related tasks and exhibit general understanding, reasoning, and decision‑making abilities. This paper explores the transformative potential of LLM agents in enhancing recommender and search systems. We discuss the motivations and roles of LLM agents, and establish a classification framework to elaborate on the existing research. We highlight the immense potential of LLM agents in addressing current challenges in recommendation and search, providing insights into future research directions. This paper is the first to systematically review and classify the research on LLM agents in these domains, offering a novel perspective on leveraging this advanced AI technology for information retrieval. To help understand the existing works, we list the existing papers on LLM agent based recommendation and search at this link: https://github.com/tsinghua‑fib‑lab/LLM‑Agent‑for‑Recommendation‑and‑Search.

PaperID: 1461, https://arxiv.org/pdf/2503.05383.pdf GitHub

Authors: Weiyu Ma, Yuqian Fu, Zecheng Zhang, Bernard Ghanem, Guohao Li

Title: AVA: Attentive VLM Agent for Mastering StarCraft II

Abstract:
We introduce AVACraft, a multimodal StarCraft II benchmark supporting both Multi‑Agent Reinforcement Learning (MARL) and Vision‑Language Model (VLM) paradigms. Unlike SMAC‑family environments that rely on abstract state representations and exclude VLMs, AVACraft provides RGB visuals, natural language observations, and structured state information, enabling systematic comparison between training‑based and zero‑shot methods across 21 scenarios spanning micromanagement, coordination, and strategic planning. We establish comprehensive baselines: six MARL algorithms (IQL, QMIX, QTRAN, VDN, MAPPO, IPPO) with Swin‑Transformer backbones trained for 5M steps, and multiple VLMs including proprietary (GPT‑4o) and open‑source (Qwen3‑VL) models. Results reveal complementary strengths‑MARL peaks at 19.3% win rate after 5M steps, while VLMs achieve 75‑90% zero‑shot with human‑aligned decisions‑exposing trade‑offs between training efficiency, performance ceilings, interpretability, and deployment cost. Code: https://github.com/camel‑ai/VLM‑Play‑StarCraft2.

PaperID: 1462, https://arxiv.org/pdf/2503.05347.pdf GitHub

Authors: Zhenxuan Zhang, Kinhei Lee, Peiyuan Jing, Weihang Deng, Huichi Zhou, Zihao Jin, Jiahao Huang, Zhifan Gao, Dominic C Marshall, Yingying Fang, Guang Yang

Title: GEMA-Score: Granular Explainable Multi-Agent Scoring Framework for Radiology Report Evaluation

Abstract:
Automatic medical report generation has the potential to support clinical diagnosis, reduce the workload of radiologists, and demonstrate potential for enhancing diagnostic consistency. However, current evaluation metrics often fail to reflect the clinical reliability of generated reports. Early overlap‑based methods focus on textual matches between predicted and ground‑truth entities but miss fine‑grained clinical details (e.g., anatomical location, severity). Some diagnostic metrics are limited by fixed vocabularies or templates, reducing their ability to capture diverse clinical expressions. LLM‑based approaches further lack interpretable reasoning steps, making it hard to assess or trust their behavior in safety‑critical settings. These limitations hinder the comprehensive assessment of the reliability of generated reports and pose risks in their selection for clinical use. Therefore, we propose a Granular Explainable Multi‑Agent Score (GEMA‑Score) in this paper, which conducts both objective quantification and subjective evaluation through a large language model‑based multi‑agent workflow. Our GEMA‑Score parses structured reports and employs stable calculations through interactive exchanges of information among agents to assess disease diagnosis, location, severity, and uncertainty. Additionally, an LLM‑based scoring agent evaluates completeness, readability, and clinical terminology while providing explanatory feedback. Extensive experiments validate that GEMA‑Score achieves the highest correlation with human expert evaluations on a public dataset, demonstrating its effectiveness in clinical scoring (Kendall coefficient = 0.69 for ReXVal dataset and Kendall coefficient = 0.45 for RadEvalX dataset). The anonymous project demo is available at: https://github.com/Zhenxuan‑Zhang/GEMA_score.

PaperID: 1463, https://arxiv.org/pdf/2503.05255.pdf GitHub

Authors: Guanghao Zhang, Tao Zhong, Yan Xia, Mushui Liu, Zhelun Yu, Haoyuan Li, Wanggui He, Fangxun Shu, Dong She, Yi Wang, Hao Jiang

Title: CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation

Abstract:
While previous multimodal slow‑thinking methods have demonstrated remarkable success in single‑image understanding scenarios, their effectiveness becomes fundamentally constrained when extended to more complex multi‑image comprehension tasks. This limitation stems from their predominant reliance on text‑based intermediate reasoning processes. While for human, when engaging in sophisticated multi‑image analysis, they typically perform two complementary cognitive operations: (1) continuous cross‑image visual comparison through region‑of‑interest matching, and (2) dynamic memorization of critical visual concepts throughout the reasoning chain. Motivated by these observations, we propose the Complex Multi‑Modal Chain‑of‑Thought (CMMCoT) framework, a multi‑step reasoning framework that mimics human‑like "slow thinking" for multi‑image understanding. Our approach incorporates two key innovations: (1) The construction of interleaved multimodal multi‑step reasoning chains, which utilize critical visual region tokens, extracted from intermediate reasoning steps, as supervisory signals. This mechanism not only facilitates comprehensive cross‑modal understanding but also enhances model interpretability. (2) The introduction of a test‑time memory augmentation module that expands the model's reasoning capacity during inference while preserving parameter efficiency. Furthermore, to facilitate research in this direction, we have curated a novel multi‑image slow‑thinking dataset. Extensive experiments demonstrate the effectiveness of our model. Code is available at https://github.com/zhangguanghao523/CMMCoT.

PaperID: 1464, https://arxiv.org/pdf/2503.04827.pdf GitHub

Authors: Mahfuz Ahmed Anik, Abdur Rahman, Azmine Toushik Wasi, Md Manjurul Ahsan

Title: Preserving Cultural Identity with Context-Aware Translation Through Multi-Agent AI Systems

Abstract:
Language is a cornerstone of cultural identity, yet globalization and the dominance of major languages have placed nearly 3,000 languages at risk of extinction. Existing AI‑driven translation models prioritize efficiency but often fail to capture cultural nuances, idiomatic expressions, and historical significance, leading to translations that marginalize linguistic diversity. To address these challenges, we propose a multi‑agent AI framework designed for culturally adaptive translation in underserved language communities. Our approach leverages specialized agents for translation, interpretation, content synthesis, and bias evaluation, ensuring that linguistic accuracy and cultural relevance are preserved. Using CrewAI and LangChain, our system enhances contextual fidelity while mitigating biases through external validation. Comparative analysis shows that our framework outperforms GPT‑4o, producing contextually rich and culturally embedded translations, a critical advancement for Indigenous, regional, and low‑resource languages. This research underscores the potential of multi‑agent AI in fostering equitable, sustainable, and culturally sensitive NLP technologies, aligning with the AI Governance, Cultural NLP, and Sustainable NLP pillars of Language Models for Underserved Communities. Our full experimental codebase is publicly available at: https://github.com/ciol‑researchlab/Context‑Aware_Translation_MAS

PaperID: 1465, https://arxiv.org/pdf/2503.04798.pdf GitHub GitHub

Authors: Jingtian Yan, Zhifei Li, William Kang, Kevin Zheng, Yulun Zhang, Zhe Chen, Yue Zhang, Daniel Harabor, Stephen F. Smith, Jiaoyang Li

Title: Advancing MAPF Toward the Real World: A Scalable Multi-Agent Realistic Testbed (SMART)

Abstract:
We present Scalable Multi‑Agent Realistic Testbed (SMART), a realistic and efficient software tool for evaluating Multi‑Agent Path Finding (MAPF) algorithms. MAPF focuses on planning collision‑free paths for a group of robots. While state‑of‑the‑art MAPF planners can plan paths for hundreds of robots in seconds, they often rely on simplified robot models, making their real‑world performance unclear. Researchers typically lack access to hundreds of physical robots in laboratory settings to evaluate the algorithms. Meanwhile, industrial professionals who lack expertise in MAPF require an easy‑to‑use simulator to efficiently test and understand the performance of MAPF planners in their specific settings. SMART fills this gap with several advantages: (1) SMART uses physics‑engine‑based simulators to create realistic simulation environments, accounting for complex real‑world factors such as robot kinodynamics and execution uncertainties, (2) SMART uses an execution monitor framework based on the Action Dependency Graph, facilitating seamless integration with various MAPF planners and robot models, and (3) SMART scales to thousands of robots. The code is publicly available at https://github.com/smart‑mapf/smart with an online service available at https://smart‑mapf.github.io/demo/.

PaperID: 1466, https://arxiv.org/pdf/2503.03800.pdf GitHub

Authors: Cristian Jimenez-Romero, Alper Yegenoglu, Christian Blum

Title: Multi-Agent Systems Powered by Large Language Models: Applications in Swarm Intelligence

Abstract:
This work examines the integration of large language models (LLMs) into multi‑agent simulations by replacing the hard‑coded programs of agents with LLM‑driven prompts. The proposed approach is showcased in the context of two examples of complex systems from the field of swarm intelligence: ant colony foraging and bird flocking. Central to this study is a toolchain that integrates LLMs with the NetLogo simulation platform, leveraging its Python extension to enable communication with GPT‑4o via the OpenAI API. This toolchain facilitates prompt‑driven behavior generation, allowing agents to respond adaptively to environmental data. For both example applications mentioned above, we employ both structured, rule‑based prompts and autonomous, knowledge‑driven prompts. Our work demonstrates how this toolchain enables LLMs to study self‑organizing processes and induce emergent behaviors within multi‑agent environments, paving the way for new approaches to exploring intelligent systems and modeling swarm intelligence inspired by natural phenomena. We provide the code, including simulation files and data at https://github.com/crjimene/swarm_gpt.

PaperID: 1467, https://arxiv.org/pdf/2503.03743.pdf GitHub

Authors: Yuqi Zhou, Shuai Wang, Sunhao Dai, Qinglin Jia, Zhaocheng Du, Zhenhua Dong, Jun Xu

Title: CHOP: Mobile Operating Assistant with Constrained High-frequency Optimized Subtask Planning

Abstract:
The advancement of visual language models (VLMs) has enhanced mobile device operations, allowing simulated human‑like actions to address user requirements. Current VLM‑based mobile operating assistants can be structured into three levels: task, subtask, and action. The subtask level, linking high‑level goals with low‑level executable actions, is crucial for task completion but faces two challenges: ineffective subtasks that lower‑level agent cannot execute and inefficient subtasks that fail to contribute to the completion of the higher‑level task. These challenges stem from VLM's lack of experience in decomposing subtasks within GUI scenarios in multi‑agent architecture. To address these, we propose a new mobile assistant architecture with constrained high‑frequency optimized planning (CHOP). Our approach overcomes the VLM's deficiency in GUI scenarios planning by using human‑planned subtasks as the basis vector. We evaluate our architecture in both English and Chinese contexts across 20 Apps, demonstrating significant improvements in both effectiveness and efficiency. Our dataset and code is available at https://github.com/Yuqi‑Zhou/CHOP

PaperID: 1468, https://arxiv.org/pdf/2503.03686.pdf GitHub

Authors: Rui Ye, Shuo Tang, Rui Ge, Yaxin Du, Zhenfei Yin, Siheng Chen, Jing Shao

Title: MAS-GPT: Training LLMs to Build LLM-based Multi-Agent Systems

Abstract:
LLM‑based multi‑agent systems (MAS) have shown significant potential in tackling diverse tasks. However, to design effective MAS, existing approaches heavily rely on manual configurations or multiple calls of advanced LLMs, resulting in inadaptability and high inference costs. In this paper, we simplify the process of building an MAS by reframing it as a generative language task, where the input is a user query and the output is a corresponding MAS. To address this novel task, we unify the representation of MAS as executable code and propose a consistency‑oriented data construction pipeline to create a high‑quality dataset comprising coherent and consistent query‑MAS pairs. Using this dataset, we train MAS‑GPT, an open‑source medium‑sized LLM that is capable of generating query‑adaptive MAS within a single LLM inference. The generated MAS can be seamlessly applied to process user queries and deliver high‑quality responses. Extensive experiments on 9 benchmarks and 5 LLMs show that the proposed MAS‑GPT consistently outperforms 10+ baseline MAS methods on diverse settings, indicating MAS‑GPT's high effectiveness, efficiency and strong generalization ability. Code will be available at https://github.com/rui‑ye/MAS‑GPT.

PaperID: 1469, https://arxiv.org/pdf/2503.03430.pdf GitHub

Authors: Junhao Xu, Yanan Zhang, Zhi Cai, Di Huang

Title: CoSDH: Communication-Efficient Collaborative Perception via Supply-Demand Awareness and Intermediate-Late Hybridization

Abstract:
Multi‑agent collaborative perception enhances perceptual capabilities by utilizing information from multiple agents and is considered a fundamental solution to the problem of weak single‑vehicle perception in autonomous driving. However, existing collaborative perception methods face a dilemma between communication efficiency and perception accuracy. To address this issue, we propose a novel communication‑efficient collaborative perception framework based on supply‑demand awareness and intermediate‑late hybridization, dubbed as \mymethodname. By modeling the supply‑demand relationship between agents, the framework refines the selection of collaboration regions, reducing unnecessary communication cost while maintaining accuracy. In addition, we innovatively introduce the intermediate‑late hybrid collaboration mode, where late‑stage collaboration compensates for the performance degradation in collaborative perception under low communication bandwidth. Extensive experiments on multiple datasets, including both simulated and real‑world scenarios, demonstrate that \mymethodname~ achieves state‑of‑the‑art detection accuracy and optimal bandwidth trade‑offs, delivering superior detection precision under real communication bandwidths, thus proving its effectiveness and practical applicability. The code will be released at https://github.com/Xu2729/CoSDH.

PaperID: 1470, https://arxiv.org/pdf/2503.02954.pdf GitHub

Authors: Yue Meng, Nathalie Majcherczyk, Wenliang Liu, Scott Kiesel, Chuchu Fan, Federico Pecora

Title: Reliable and Efficient Multi-Agent Coordination via Graph Neural Network Variational Autoencoders

Abstract:
Multi‑agent coordination is crucial for reliable multi‑robot navigation in shared spaces such as automated warehouses. In regions of dense robot traffic, local coordination methods may fail to find a deadlock‑free solution. In these scenarios, it is appropriate to let a central unit generate a global schedule that decides the passing order of robots. However, the runtime of such centralized coordination methods increases significantly with the problem scale. In this paper, we propose to leverage Graph Neural Network Variational Autoencoders (GNN‑VAE) to solve the multi‑agent coordination problem at scale faster than through centralized optimization. We formulate the coordination problem as a graph problem and collect ground truth data using a Mixed‑Integer Linear Program (MILP) solver. During training, our learning framework encodes good quality solutions of the graph problem into a latent space. At inference time, solution samples are decoded from the sampled latent variables, and the lowest‑cost sample is selected for coordination. Finally, the feasible proposal with the highest performance index is selected for the deployment. By construction, our GNN‑VAE framework returns solutions that always respect the constraints of the considered coordination problem. Numerical results show that our approach trained on small‑scale problems can achieve high‑quality solutions even for large‑scale problems with 250 robots, being much faster than other baselines. Project page: https://mengyuest.github.io/gnn‑vae‑coord

PaperID: 1471, https://arxiv.org/pdf/2503.02834.pdf GitHub

Authors: Michal Nazarczuk, Karla Stepanova, Jan Kristof Behrens, Matej Hoffmann, Krystian Mikolajczyk

Title: MuBlE: MuJoCo and Blender simulation Environment and Benchmark for Task Planning in Robot Manipulation

Abstract:
Current embodied reasoning agents struggle to plan for long‑horizon tasks that require to physically interact with the world to obtain the necessary information (e.g. 'sort the objects from lightest to heaviest'). The improvement of the capabilities of such an agent is highly dependent on the availability of relevant training environments. In order to facilitate the development of such systems, we introduce a novel simulation environment (built on top of robosuite) that makes use of the MuJoCo physics engine and high‑quality renderer Blender to provide realistic visual observations that are also accurate to the physical state of the scene. It is the first simulator focusing on long‑horizon robot manipulation tasks preserving accurate physics modeling. MuBlE can generate mutlimodal data for training and enable design of closed‑loop methods through environment interaction on two levels: visual ‑ action loop, and control ‑ physics loop. Together with the simulator, we propose SHOP‑VRB2, a new benchmark composed of 10 classes of multi‑step reasoning scenarios that require simultaneous visual and physical measurements.

PaperID: 1472, https://arxiv.org/pdf/2503.02238.pdf GitHub

Authors: Zirui Wu, Xiao Liu, Jiayi Li, Lingpeng Kong, Yansong Feng

Title: Haste Makes Waste: Evaluating Planning Abilities of LLMs for Efficient and Feasible Multitasking with Time Constraints Between Actions

Abstract:
While Large Language Model‑based agents have demonstrated substantial progress in task completion, existing evaluation benchmarks tend to overemphasize single‑task performance, with insufficient attention given to the crucial aspects of multitask planning and execution efficiency required in real‑world scenarios. To bridge this gap, we present Recipe2Plan, a novel benchmark framework based on real‑world cooking scenarios. Unlike conventional benchmarks, Recipe2Plan challenges agents to optimize cooking time through parallel task execution while respecting temporal constraints i.e. specific actions need to be performed within a particular time intervals following the preceding steps. Overly aggressive local parallelization may disrupt this constraint, potentially compromising the entire cooking process. This strict time constraint between actions raises a unique challenge for agents to balance between maximizing concurrent operations and adhering to critical timing constraints. Extensive experiments with state‑of‑the‑art models reveal challenges in maintaining this balance between efficiency and feasibility. The results highlight the need for improved temporal awareness and global multitasking capabilities in large language models. We open‑source our benchmark and code at https://github.com/WilliamZR/Recipe2Plan.

PaperID: 1473, https://arxiv.org/pdf/2503.01935.pdf GitHub GitHub

Authors: Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Zhe Wang, Zhenhailong Wang, Cheng Qian, Xiangru Tang, Heng Ji, Jiaxuan You

Title: MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Abstract:
Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents, yet existing benchmarks either focus on single‑agent tasks or are confined to narrow domains, failing to capture the dynamics of multi‑agent coordination and competition. In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM‑based multi‑agent systems across diverse, interactive scenarios. Our framework measures not only task completion but also the quality of collaboration and competition using novel, milestone‑based key performance indicators. Moreover, we evaluate various coordination protocols (including star, chain, tree, and graph topologies) and innovative strategies such as group discussion and cognitive planning. Notably, gpt‑4o‑mini reaches the average highest task score, graph structure performs the best among coordination protocols in the research scenario, and cognitive planning improves milestone achievement rates by 3%. Code and datasets are public available at https://github.com/MultiagentBench/MARBLE.

PaperID: 1474, https://arxiv.org/pdf/2503.01908.pdf GitHub

Authors: Jiawei Zhang, Shuang Yang, Bo Li

Title: UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning

Abstract:
Large Language Model (LLM) agents equipped with external tools have become increasingly powerful for complex tasks such as web shopping, automated email replies, and financial trading. However, these advancements amplify the risks of adversarial attacks, especially when agents can access sensitive external functionalities. Nevertheless, manipulating LLM agents into performing targeted malicious actions or invoking specific tools remains challenging, as these agents extensively reason or plan before executing final actions. In this work, we present UDora, a unified red teaming framework designed for LLM agents that dynamically hijacks the agent's reasoning processes to compel malicious behavior. Specifically, UDora first generates the model's reasoning trace for the given task, then automatically identifies optimal points within this trace to insert targeted perturbations. The resulting perturbed reasoning is then used as a surrogate response for optimization. By iteratively applying this process, the LLM agent will then be induced to undertake designated malicious actions or to invoke specific malicious tools. Our approach demonstrates superior effectiveness compared to existing methods across three LLM agent datasets. The code is available at https://github.com/AI‑secure/UDora.

PaperID: 1475, https://arxiv.org/pdf/2503.01017.pdf GitHub

Authors: Yuhang Zhang, Zhiyao Zhang, Junyi Ji, Marcos Quiñones-Grueiro, William Barbour, Derek Gloudemans, Gergely Zachár, Clay Weston, Gautam Biswas, Daniel B. Work

Title: Real-World Deployment and Assessment of a Multi-Agent Reinforcement Learning-Based Variable Speed Limit Control System

Abstract:
This article presents the first field deployment of a multi‑agent reinforcement learning (MARL) based variable speed limit (VSL) control system on Interstate 24 (I‑24) near Nashville, Tennessee. We design and demonstrate a full pipeline from training MARL agents in a traffic simulator to a field deployment on a 17‑mile segment of I‑24 encompassing 67 VSL controllers. The system was launched on March 8th, 2024, and has made approximately 35 million decisions on 28 million trips in six months of operation. We apply an invalid action masking mechanism and several safety guards to ensure real‑world constraints. The MARL‑based implementation operates up to 98% of the time, with the safety guards overriding the MARL decisions for the remaining time. We evaluate the performance of the MARL‑based algorithm in comparison to a previously deployed non‑RL VSL benchmark algorithm on I‑24. Results show that the MARL‑based VSL control system achieves a superior performance. The accuracy of correctly warning drivers about slowing traffic ahead is improved by 14% and the response delay to non‑recurrent congestion is reduced by 75%. The preliminary data shows that the VSL control system has reduced the crash rate by 26% and the secondary crash rate by 50%. We open‑sourced the deployed MARL‑based VSL algorithm at https://github.com/Lab‑Work/marl‑vsl‑controller.

PaperID: 1476, https://arxiv.org/pdf/2503.00771.pdf GitHub

Authors: Yupu Hao, Pengfei Cao, Zhuoran Jin, Huanxuan Liao, Yubo Chen, Kang Liu, Jun Zhao

Title: Evaluating Personalized Tool-Augmented LLMs from the Perspectives of Personalization and Proactivity

Abstract:
Personalized tool utilization is essential for aligning large language models (LLMs) with user preference in interaction scenarios with various tools. However, most of the current benchmarks primarily focus on either personalization of text generation or direct tool‑utilizing, without considering both. In this work, we introduce a novel benchmark ETAPP for evaluating personalized tool invocation, establishing a sandbox environment, and a comprehensive dataset of 800 testing cases covering diverse user profiles. To improve the accuracy of our evaluation, we propose a key‑point‑based LLM evaluation method, mitigating biases in the LLM‑as‑a‑judge system by manually annotating key points for each test case and providing them to LLM as the reference. Additionally, we evaluate the excellent LLMs and provide an in‑depth analysis. Furthermore, we investigate the impact of different tool‑invoking strategies on LLMs' personalization performance and the effects of fine‑tuning in our task. The effectiveness of our preference‑setting and key‑point‑based evaluation method is also validated. Our findings offer insights into improving personalized LLM agents. Our Code is available at https://github.com/hypasd‑art/ETAPP.

PaperID: 1477, https://arxiv.org/pdf/2503.00717.pdf GitHub

Authors: Seungbae Seo, Junghwan Kim, Minjeong Shin, Bongwon Suh

Title: LLMDR: LLM-Driven Deadlock Detection and Resolution in Multi-Agent Pathfinding

Abstract:
Multi‑Agent Pathfinding (MAPF) is a core challenge in multi‑agent systems. Existing learning‑based MAPF methods often struggle with scalability, particularly when addressing complex scenarios that are prone to deadlocks. To address these challenges, we introduce LLMDR (LLM‑Driven Deadlock Detection and Resolution), an approach designed to resolve deadlocks and improve the performance of learnt MAPF models. LLMDR integrates the inference capabilities of large language models (LLMs) with learnt MAPF models and prioritized planning, enabling it to detect deadlocks and provide customized resolution strategies. We evaluate LLMDR on standard MAPF benchmark maps with varying agent numbers, measuring its performance when combined with several base models. The results demonstrate that LLMDR improves the performance of learnt MAPF models, particularly in deadlock‑prone scenarios, with notable improvements in success rates. These findings show the potential of integrating LLMs to improve the scalability of learning‑based MAPF methods. The source code for LLMDR is available at: https://github.com/ssbacc/llmdr‑dhc

PaperID: 1478, https://arxiv.org/pdf/2503.00455.pdf GitHub GitHub

Authors: Yujia Xiao, Lei He, Haohan Guo, Fenglong Xie, Tan Lee

Title: PodAgent: A Comprehensive Framework for Podcast Generation

Abstract:
Existing Existing automatic audio generation methods struggle to generate podcast‑like audio programs effectively. The key challenges lie in in‑depth content generation, appropriate and expressive voice production. This paper proposed PodAgent, a comprehensive framework for creating audio programs. PodAgent 1) generates informative topic‑discussion content by designing a Host‑Guest‑Writer multi‑agent collaboration system, 2) builds a voice pool for suitable voice‑role matching and 3) utilizes LLM‑enhanced speech synthesis method to generate expressive conversational speech. Given the absence of standardized evaluation criteria for podcast‑like audio generation, we developed comprehensive assessment guidelines to effectively evaluate the model's performance. Experimental results demonstrate PodAgent's effectiveness, significantly surpassing direct GPT‑4 generation in topic‑discussion dialogue content, achieving an 87.4% voice‑matching accuracy, and producing more expressive speech through LLM‑guided synthesis. Demo page: https://podcast‑agent.github.io/demo/. Source code: https://github.com/yujxx/PodAgent.

PaperID: 1479, https://arxiv.org/pdf/2503.00061.pdf GitHub

Authors: Qiusi Zhan, Richard Fang, Henil Shalin Panchal, Daniel Kang

Title: Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents

Abstract:
Large Language Model (LLM) agents exhibit remarkable performance across diverse applications by using external tools to interact with environments. However, integrating external tools introduces security risks, such as indirect prompt injection (IPI) attacks. Despite defenses designed for IPI attacks, their robustness remains questionable due to insufficient testing against adaptive attacks. In this paper, we evaluate eight different defenses and bypass all of them using adaptive attacks, consistently achieving an attack success rate of over 50%. This reveals critical vulnerabilities in current defenses. Our research underscores the need for adaptive attack evaluation when designing defenses to ensure robustness and reliability. The code is available at https://github.com/uiuc‑kang‑lab/AdaptiveAttackAgent.

PaperID: 1480, https://arxiv.org/pdf/2502.20073.pdf GitHub

Authors: Haochen Sun, Shuwen Zhang, Lujie Niu, Lei Ren, Hao Xu, Hao Fu, Fangkun Zhao, Caixia Yuan, Xiaojie Wang

Title: Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents

Abstract:
Large Language Models (LLMs) based agent systems have made great strides in real‑world applications beyond traditional NLP tasks. This paper proposes a new LLM‑based Multi‑Agent System (LLM‑MAS) benchmark, Collab‑Overcooked, built on the popular Overcooked‑AI game with more applicable and challenging tasks in interactive environments. Collab‑Overcooked extends existing benchmarks in two novel ways. First, it provides a multi‑agent framework supporting diverse tasks and objectives and encourages collaboration through natural language communication. Second, it introduces a spectrum of process‑oriented evaluation metrics to assess the fine‑grained collaboration capabilities of different LLM agents, a dimension often overlooked in prior work. We conduct extensive experiments with 13 popular LLMs and show that, while the LLMs exhibit a strong ability in goal interpretation, there are significant shortcomings in active collaboration and continuous adaptation, which are critical for efficiently fulfilling complex tasks. Notably, we highlight the strengths and weaknesses of LLM‑MAS and provide insights for improving and evaluating LLM‑MAS on a unified and open‑source benchmark. The environments, 30 open‑ended tasks, and the evaluation package are publicly available at https://github.com/YusaeMeow/Collab‑Overcooked.

PaperID: 1481, https://arxiv.org/pdf/2502.19717.pdf GitHub

Authors: Xinran Li, Xiaolu Wang, Chenjia Bai, Jun Zhang

Title: Exponential Topology-enabled Scalable Communication in Multi-agent Reinforcement Learning

Abstract:
In cooperative multi‑agent reinforcement learning (MARL), well‑designed communication protocols can effectively facilitate consensus among agents, thereby enhancing task performance. Moreover, in large‑scale multi‑agent systems commonly found in real‑world applications, effective communication plays an even more critical role due to the escalated challenge of partial observability compared to smaller‑scale setups. In this work, we endeavor to develop a scalable communication protocol for MARL. Unlike previous methods that focus on selecting optimal pairwise communication links‑a task that becomes increasingly complex as the number of agents grows‑we adopt a global perspective on communication topology design. Specifically, we propose utilizing the exponential topology to enable rapid information dissemination among agents by leveraging its small‑diameter and small‑size properties. This approach leads to a scalable communication protocol, named ExpoComm. To fully unlock the potential of exponential graphs as communication topologies, we employ memory‑based message processors and auxiliary tasks to ground messages, ensuring that they reflect global information and benefit decision‑making. Extensive experiments on large‑scale cooperative benchmarks, including MAgent and Infrastructure Management Planning, demonstrate the superior performance and robust zero‑shot transferability of ExpoComm compared to existing communication strategies. The code is publicly available at https://github.com/LXXXXR/ExpoComm.

PaperID: 1482, https://arxiv.org/pdf/2502.19645.pdf GitHub

Authors: Moo Jin Kim, Chelsea Finn, Percy Liang

Title: Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Abstract:
Recent vision‑language‑action models (VLAs) build upon pretrained vision‑language models and leverage diverse robot datasets to demonstrate strong task execution, language following ability, and semantic generalization. Despite these successes, VLAs struggle with novel robot setups and require fine‑tuning to achieve good performance, yet how to most effectively fine‑tune them is unclear given many possible strategies. In this work, we study key VLA adaptation design choices such as different action decoding schemes, action representations, and learning objectives for fine‑tuning, using OpenVLA as our representative base model. Our empirical analysis informs an Optimized Fine‑Tuning (OFT) recipe that integrates parallel decoding, action chunking, a continuous action representation, and a simple L1 regression‑based learning objective to altogether improve inference efficiency, policy performance, and flexibility in the model's input‑output specifications. We propose OpenVLA‑OFT, an instantiation of this recipe, which sets a new state of the art on the LIBERO simulation benchmark, significantly boosting OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26×. In real‑world evaluations, our fine‑tuning recipe enables OpenVLA to successfully execute dexterous, high‑frequency control tasks on a bimanual ALOHA robot and outperform other VLAs (π_0 and RDT‑1B) fine‑tuned using their default recipes, as well as strong imitation learning policies trained from scratch (Diffusion Policy and ACT) by up to 15% (absolute) in average success rate. We release code for OFT and pretrained model checkpoints at https://openvla‑oft.github.io/.

PaperID: 1483, https://arxiv.org/pdf/2502.19260.pdf GitHub

Authors: Nadya Abdel Madjid, Murad Mebrahtu, Abdulrahman Ahmad, Abdelmoamen Nasser, Bilal Hassan, Naoufel Werghi, Jorge Dias, Majid Khonji

Title: EMT: A Visual Multi-Task Benchmark Dataset for Autonomous Driving

Abstract:
This paper introduces the Emirates Multi‑Task (EMT) dataset, designed to support multi‑task benchmarking within a unified framework. It comprises over 30,000 frames from a dash‑camera perspective and 570,000 annotated bounding boxes, covering approximately 150 kilometers of driving routes that reflect the distinctive road topology, congestion patterns, and driving behavior of Gulf region traffic. The dataset supports three primary tasks: tracking, trajectory forecasting, and intention prediction. Each benchmark is accompanied by corresponding evaluations: (1) multi‑agent tracking experiments addressing multi‑class scenarios and occlusion handling; (2) trajectory forecasting evaluation using deep sequential and interaction‑aware models; and (3) intention prediction experiments based on observed trajectories. The dataset is publicly available at https://avlab.io/emt‑dataset, with pre‑processing scripts and evaluation models at https://github.com/AV‑Lab/emt‑dataset.

PaperID: 1484, https://arxiv.org/pdf/2502.18754.pdf GitHub

Authors: Yuwei Yan, Yu Shang, Qingbin Zeng, Yu Li, Keyu Zhao, Zhiheng Zheng, Xuefei Ning, Tianji Wu, Shengen Yan, Yu Wang, Fengli Xu, Yong Li

Title: AgentSociety Challenge: Designing LLM Agents for User Modeling and Recommendation on Web Platforms

Abstract:
The AgentSociety Challenge is the first competition in the Web Conference that aims to explore the potential of Large Language Model (LLM) agents in modeling user behavior and enhancing recommender systems on web platforms. The Challenge consists of two tracks: the User Modeling Track and the Recommendation Track. Participants are tasked to utilize a combined dataset from Yelp, Amazon, and Goodreads, along with an interactive environment simulator, to develop innovative LLM agents. The Challenge has attracted 295 teams across the globe and received over 1,400 submissions in total over the course of 37 official competition days. The participants have achieved 21.9% and 20.3% performance improvement for Track 1 and Track 2 in the Development Phase, and 9.1% and 15.9% in the Final Phase, representing a significant accomplishment. This paper discusses the detailed designs of the Challenge, analyzes the outcomes, and highlights the most successful LLM agent designs. To support further research and development, we have open‑sourced the benchmark environment at https://tsinghua‑fib‑lab.github.io/AgentSocietyChallenge.

PaperID: 1485, https://arxiv.org/pdf/2502.18540.pdf GitHub

Authors: Zike Yuan, Ming Liu, Hui Wang, Bing Qin

Title: MA-GTS: A Multi-Agent Framework for Solving Complex Graph Problems in Real-World Applications

Abstract:
Graph‑theoretic problems arise in real‑world applications like logistics, communication networks, and traffic optimization. These problems are often complex, noisy, and irregular, posing challenges for traditional algorithms. Large language models (LLMs) offer potential solutions but face challenges, including limited accuracy and input length constraints. To address these challenges, we propose MA‑GTS (Multi‑Agent Graph Theory Solver), a multi‑agent framework that decomposes these complex problems through agent collaboration. MA‑GTS maps the implicitly expressed text‑based graph data into clear, structured graph representations and dynamically selects the most suitable algorithm based on problem constraints and graph structure scale. This approach ensures that the solution process remains efficient and the resulting reasoning path is interpretable. We validate MA‑GTS using the G‑REAL dataset, a real‑world‑inspired graph theory dataset we created. Experimental results show that MA‑GTS outperforms state‑of‑the‑art approaches in terms of efficiency, accuracy, and scalability, with strong results across multiple benchmarks (G‑REAL 94.2%, GraCoRe 96.9%, NLGraph 98.4%).MA‑GTS is open‑sourced at https://github.com/ZIKEYUAN/MA‑GTS.git.

PaperID: 1486, https://arxiv.org/pdf/2502.18308.pdf GitHub

Authors: Jianhao Yan, Yun Luo, Yue Zhang

Title: RefuteBench 2.0 -- Agentic Benchmark for Dynamic Evaluation of LLM Responses to Refutation Instruction

Abstract:
In the multi‑turn interaction schema, large language models (LLMs) can leverage user feedback to enhance the quality and relevance of their responses. However, evaluating an LLM's ability to incorporate user refutation feedback is crucial yet challenging. In this study, we introduce RefuteBench 2.0, which significantly extends the original RefuteBench by incorporating LLM agents as refuters and evaluators, which allows for flexible and comprehensive assessment. We design both transient and persistent refutation instructions with different validity periods. Meta‑evaluation shows that the LLM‑based refuter could generate more human‑like refutations and the evaluators could assign scores with high correlation with humans. Experimental results of various LLMs show that current models could effectively satisfy the refutation but fail to memorize the refutation information. Interestingly, we also observe that the performance of the initial task decreases as the refutations increase. Analysis of the attention scores further shows a potential weakness of current LLMs: they struggle to retain and correctly use previous information during long context dialogues. https://github.com/ElliottYan/RefuteBench‑2.0

PaperID: 1487, https://arxiv.org/pdf/2502.18017.pdf GitHub

Authors: Qiuchen Wang, Ruixue Ding, Zehui Chen, Weiqi Wu, Shihang Wang, Pengjun Xie, Feng Zhao

Title: ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents

Abstract:
Understanding information from visually rich documents remains a significant challenge for traditional Retrieval‑Augmented Generation (RAG) methods. Existing benchmarks predominantly focus on image‑based question answering (QA), overlooking the fundamental challenges of efficient retrieval, comprehension, and reasoning within dense visual documents. To bridge this gap, we introduce ViDoSeek, a novel dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning. Based on it, we identify key limitations in current RAG approaches: (i) purely visual retrieval methods struggle to effectively integrate both textual and visual features, and (ii) previous approaches often allocate insufficient reasoning tokens, limiting their effectiveness. To address these challenges, we propose ViDoRAG, a novel multi‑agent RAG framework tailored for complex reasoning across visual documents. ViDoRAG employs a Gaussian Mixture Model (GMM)‑based hybrid strategy to effectively handle multi‑modal retrieval. To further elicit the model's reasoning capabilities, we introduce an iterative agent workflow incorporating exploration, summarization, and reflection, providing a framework for investigating test‑time scaling in RAG domains. Extensive experiments on ViDoSeek validate the effectiveness and generalization of our approach. Notably, ViDoRAG outperforms existing methods by over 10% on the competitive ViDoSeek benchmark. The code is available at https://github.com/Alibaba‑NLP/ViDoRAG.

PaperID: 1488, https://arxiv.org/pdf/2502.17967.pdf GitHub

Authors: Tianmi Ma, Jiawei Du, Wenxin Huang, Wenjie Wang, Liang Xie, Xian Zhong, Joey Tianyi Zhou

Title: Agent Trading Arena: A Study on Numerical Understanding in LLM-Based Agents

Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in natural language tasks, yet their performance in dynamic, real‑world financial environments remains underexplored. Existing approaches are limited to historical backtesting, where trading actions cannot influence market prices and agents train only on static data. To address this limitation, we present the Agent Trading Arena, a virtual zero‑sum stock market in which LLM‑based agents engage in competitive multi‑agent trading and directly impact price dynamics. By simulating realistic bid‑ask interactions, our platform enables training in scenarios that closely mirror live markets, thereby narrowing the gap between training and evaluation. Experiments reveal that LLMs struggle with numerical reasoning when given plain‑text data, often overfitting to local patterns and recent values. In contrast, chart‑based visualizations significantly enhance both numerical reasoning and trading performance. Furthermore, incorporating a reflection module yields additional improvements, especially with visual inputs. Evaluations on NASDAQ and CSI datasets demonstrate the superiority of our method, particularly under high volatility. All code and data are available at https://github.com/wekjsdvnm/Agent‑Trading‑Arena.

PaperID: 1489, https://arxiv.org/pdf/2502.17506.pdf GitHub

Authors: Namkyeong Lee, Edward De Brouwer, Ehsan Hajiramezanali, Tommaso Biancalani, Chanyoung Park, Gabriele Scalia

Title: RAG-Enhanced Collaborative LLM Agents for Drug Discovery

Abstract:
Recent advances in large language models (LLMs) have shown great potential to accelerate drug discovery. However, the specialized nature of biochemical data often necessitates costly domain‑specific fine‑tuning, posing major challenges. First, it hinders the application of more flexible general‑purpose LLMs for cutting‑edge drug discovery tasks. More importantly, it limits the rapid integration of the vast amounts of scientific data continuously generated through experiments and research. Compounding these challenges is the fact that real‑world scientific questions are typically complex and open‑ended, requiring reasoning beyond pattern matching or static knowledge retrieval.To address these challenges, we propose CLADD, a retrieval‑augmented generation (RAG)‑empowered agentic system tailored to drug discovery tasks. Through the collaboration of multiple LLM agents, CLADD dynamically retrieves information from biomedical knowledge bases, contextualizes query molecules, and integrates relevant evidence to generate responses ‑ all without the need for domain‑specific fine‑tuning. Crucially, we tackle key obstacles in applying RAG workflows to biochemical data, including data heterogeneity, ambiguity, and multi‑source integration. We demonstrate the flexibility and effectiveness of this framework across a variety of drug discovery tasks, showing that it outperforms general‑purpose and domain‑specific LLMs as well as traditional deep learning approaches. Our code is publicly available at https://github.com/Genentech/CLADD.

PaperID: 1490, https://arxiv.org/pdf/2502.17421.pdf GitHub

Authors: Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, Bo An

Title: LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification

Abstract:
As Large Language Models (LLMs) can now process extremely long contexts, efficient inference over these extended inputs has become increasingly important, especially for emerging applications like LLM agents that highly depend on this capability. Speculative decoding (SD) offers a promising lossless acceleration technique compared to lossy alternatives such as quantization and model cascades. However, most state‑of‑the‑art SD methods are trained on short texts (typically fewer than 4k tokens), making them unsuitable for long‑context scenarios. Specifically, adapting these methods to long contexts presents three key challenges: (1) the excessive memory demands posed by draft models due to large Key‑Value (KV) cache; (2) performance degradation resulting from the mismatch between short‑context training and long‑context inference; and (3) inefficiencies in tree attention mechanisms when managing long token sequences. This work introduces LongSpec, a framework that addresses these challenges through three core innovations: a memory‑efficient draft model with a constant‑sized KV cache; novel position indices that mitigate the training‑inference mismatch; and an attention aggregation strategy that combines fast prefix computation with standard tree attention to enable efficient decoding. Experimental results confirm the effectiveness of LongSpec, achieving up to a 3.26x speedup over strong Flash Attention baselines across five long‑context understanding datasets, as well as a 2.25x reduction in wall‑clock time on the AIME24 long reasoning task with the QwQ model, demonstrating significant latency improvements for long‑context applications. The code is available at https://github.com/sail‑sg/LongSpec.

PaperID: 1491, https://arxiv.org/pdf/2502.16789.pdf GitHub

Authors: Ziyi Tang, Zechuan Chen, Jiarui Yang, Jiayao Mai, Yongsen Zheng, Keze Wang, Jinrui Chen, Liang Lin

Title: AlphaAgent: LLM-Driven Alpha Mining with Regularized Exploration to Counteract Alpha Decay

Abstract:
Alpha mining, a critical component in quantitative investment, focuses on discovering predictive signals for future asset returns in increasingly complex financial markets. However, the pervasive issue of alpha decay, where factors lose their predictive power over time, poses a significant challenge for alpha mining. Traditional methods like genetic programming face rapid alpha decay from overfitting and complexity, while approaches driven by Large Language Models (LLMs), despite their promise, often rely too heavily on existing knowledge, creating homogeneous factors that worsen crowding and accelerate decay. To address this challenge, we propose AlphaAgent, an autonomous framework that effectively integrates LLM agents with ad hoc regularizations for mining decay‑resistant alpha factors. AlphaAgent employs three key mechanisms: (i) originality enforcement through a similarity measure based on abstract syntax trees (ASTs) against existing alphas, (ii) hypothesis‑factor alignment via LLM‑evaluated semantic consistency between market hypotheses and generated factors, and (iii) complexity control via AST‑based structural constraints, preventing over‑engineered constructions that are prone to overfitting. These mechanisms collectively guide the alpha generation process to balance originality, financial rationale, and adaptability to evolving market conditions, mitigating the risk of alpha decay. Extensive evaluations show that AlphaAgent outperforms traditional and LLM‑based methods in mitigating alpha decay across bull and bear markets, consistently delivering significant alpha in Chinese CSI 500 and US S&P 500 markets over the past four years. Notably, AlphaAgent showcases remarkable resistance to alpha decay, elevating the potential for yielding powerful factors.

PaperID: 1492, https://arxiv.org/pdf/2502.16660.pdf GitHub

Authors: Haiteng Zhao, Chang Ma, Fangzhi Xu, Lingpeng Kong, Zhi-Hong Deng

Title: BioMaze: Benchmarking and Enhancing Large Language Models for Biological Pathway Reasoning

Abstract:
The applications of large language models (LLMs) in various biological domains have been explored recently, but their reasoning ability in complex biological systems, such as pathways, remains underexplored, which is crucial for predicting biological phenomena, formulating hypotheses, and designing experiments. This work explores the potential of LLMs in pathway reasoning. We introduce BioMaze, a dataset with 5.1K complex pathway problems derived from real research, covering various biological contexts including natural dynamic changes, disturbances, additional intervention conditions, and multi‑scale research targets. Our evaluation of methods such as CoT and graph‑augmented reasoning, shows that LLMs struggle with pathway reasoning, especially in perturbed systems. To address this, we propose PathSeeker, an LLM agent that enhances reasoning through interactive subgraph‑based navigation, enabling a more effective approach to handling the complexities of biological systems in a scientifically aligned manner. The dataset and code are available at https://github.com/zhao‑ht/BioMaze.

PaperID: 1493, https://arxiv.org/pdf/2502.16565.pdf GitHub

Authors: Zengqing Wu, Takayuki Ito

Title: The Hidden Strength of Disagreement: Unraveling the Consensus-Diversity Tradeoff in Adaptive Multi-Agent Systems

Abstract:
Consensus formation is pivotal in multi‑agent systems (MAS), balancing collective coherence with individual diversity. Conventional LLM‑based MAS primarily rely on explicit coordination, e.g., prompts or voting, risking premature homogenization. We argue that implicit consensus, where agents exchange information yet independently form decisions via in‑context learning, can be more effective in dynamic environments that require long‑horizon adaptability. By retaining partial diversity, systems can better explore novel strategies and cope with external shocks. We formalize a consensus‑diversity tradeoff, showing conditions where implicit methods outperform explicit ones. Experiments on three scenarios ‑‑ Dynamic Disaster Response, Information Spread and Manipulation, and Dynamic Public‑Goods Provision ‑‑ confirm partial deviation from group norms boosts exploration, robustness, and performance. We highlight emergent coordination via in‑context learning, underscoring the value of preserving diversity for resilient decision‑making.

PaperID: 1494, https://arxiv.org/pdf/2502.16069.pdf GitHub

Authors: Patrick Tser Jern Kon, Jiachen Liu, Qiuyi Ding, Yiming Qiu, Zhenning Yang, Yibo Huang, Jayanth Srinivasa, Myungjin Lee, Mosharaf Chowdhury, Ang Chen

Title: Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents

Abstract:
Scientific experimentation, a cornerstone of human progress, demands rigor in reliability, methodical control, and interpretability to yield meaningful results. Despite the growing capabilities of large language models (LLMs) in automating different aspects of the scientific process, automating rigorous experimentation remains a significant challenge. To address this gap, we propose Curie, an AI agent framework designed to embed rigor into the experimentation process through three key components: an intra‑agent rigor module to enhance reliability, an inter‑agent rigor module to maintain methodical control, and an experiment knowledge module to enhance interpretability. To evaluate Curie, we design a novel experimental benchmark composed of 46 questions across four computer science domains, derived from influential research papers, and widely adopted open‑source projects. Compared to the strongest baseline tested, we achieve a 3.4× improvement in correctly answering experimental questions. Curie is open‑sourced at https://github.com/Just‑Curieous/Curie.

PaperID: 1495, https://arxiv.org/pdf/2502.15972.pdf GitHub

Authors: Parth Bhalerao, Mounika Yalamarty, Brian Trinh, Oana Ignat

Title: When Cultures Meet: Multicultural Text-to-Image Generation

Abstract:
Text‑to‑image generation models have achieved strong performance in culturally homogeneous settings, yet their ability to generate multicultural scenes, where people and landmarks originate from different cultures, remains largely unexplored. We introduce multicultural text‑to‑image generation as a new task and present the first benchmark designed to study this setting. Our dataset contains 9,000 images spanning five countries, three age groups, two genders, 25 historical landmarks, and five languages. Using this benchmark, we analyze the behavior of state‑of‑the‑art text‑to‑image models across multiple dimensions, including alignment, image quality, aesthetics, knowledge, and fairness. As one strategy for composing cultural and demographic information, we explore MosAIG, a Multi‑Agent framework that enhances multicultural Image Generation by leveraging LLMs with distinct cultural personas. Our analysis shows that richer prompt composition can improve image quality and cultural grounding compared to simple prompts, while revealing substantial disparities across languages and demographic groups. We release our dataset and code at https://github.com/AIM‑SCU/MosAIG.

PaperID: 1496, https://arxiv.org/pdf/2502.15969.pdf GitHub

Authors: William Rudman, Michal Golovanevsky, Amir Bar, Vedant Palit, Yann LeCun, Carsten Eickhoff, Ritambhara Singh

Title: Forgotten Polygons: Multimodal Large Language Models are Shape-Blind

Abstract:
Despite strong performance on vision‑language tasks, Multimodal Large Language Models (MLLMs) struggle with mathematical problem‑solving, with both open‑source and state‑of‑the‑art models falling short of human performance on visual‑math benchmarks. To systematically examine visual‑mathematical reasoning in MLLMs, we (1) evaluate their understanding of geometric primitives, (2) test multi‑step reasoning, and (3) explore a potential solution to improve visual reasoning capabilities. Our findings reveal fundamental shortcomings in shape recognition, with top models achieving under 50% accuracy in identifying regular polygons. We analyze these failures through the lens of dual‑process theory and show that MLLMs rely on System 1 (intuitive, memorized associations) rather than System 2 (deliberate reasoning). Consequently, MLLMs fail to count the sides of both familiar and novel shapes, suggesting they have neither learned the concept of sides nor effectively process visual inputs. Finally, we propose Visually Cued Chain‑of‑Thought (VC‑CoT) prompting, which enhances multi‑step mathematical reasoning by explicitly referencing visual annotations in diagrams, boosting GPT‑4o's accuracy on an irregular polygon side‑counting task from 7% to 93%. Our findings suggest that System 2 reasoning in MLLMs remains an open problem, and visually‑guided prompting is essential for successfully engaging visual reasoning. Code available at: https://github.com/rsinghlab/Shape‑Blind.

PaperID: 1497, https://arxiv.org/pdf/2502.15153.pdf GitHub

Authors: Tianjie Ju, Bowen Wang, Hao Fei, Mong-Li Lee, Wynne Hsu, Yun Li, Qianren Wang, Pengzhou Cheng, Zongru Wu, Haodong Zhao, Zhuosheng Zhang, Gongshen Liu

Title: When Disagreements Elicit Robustness: Investigating Self-Repair Capabilities under LLM Multi-Agent Disagreements

Abstract:
Recent advances in Large Language Models (LLMs) have upgraded them from sophisticated text generators to autonomous agents capable of cooperation and tool use in multi‑agent systems (MAS). However, it remains unclear how disagreements shape collective decision‑making. In this paper, we revisit the role of disagreement and argue that general, partially overlapping disagreements prevent premature consensus and expand the explored solution space, while disagreements on task‑critical steps can derail collaboration depending on the topology of solution paths. We investigate two collaborative settings with distinct path structures: collaborative reasoning (CounterFact, MQuAKE‑cf), which typically follows a single evidential chain, whereas collaborative programming (HumanEval, GAIA) often adopts multiple valid implementations. Disagreements are instantiated as general heterogeneity among agents and as task‑critical counterfactual knowledge edits injected into context or parameters. Experiments reveal that general disagreements consistently improve success by encouraging complementary exploration. By contrast, task‑critical disagreements substantially reduce success on single‑path reasoning, yet have a limited impact on programming, where agents can choose alternative solutions. Trace analyses show that MAS frequently bypasses the edited facts in programming but rarely does so in reasoning, revealing an emergent self‑repair capability that depends on solution‑path rather than scale alone. Our code is available at https://github.com/wbw625/MultiAgentRobustness.

PaperID: 1498, https://arxiv.org/pdf/2502.15086.pdf GitHub

Authors: Yeonjun In, Wonjoong Kim, Kanghoon Yoon, Sungchul Kim, Mehrab Tanjim, Sangwu Park, Kibum Kim, Chanyoung Park

Title: Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models

Abstract:
As the use of large language model (LLM) agents continues to grow, their safety vulnerabilities have become increasingly evident. Extensive benchmarks evaluate various aspects of LLM safety by defining the safety relying heavily on general standards, overlooking user‑specific standards. However, safety standards for LLM may vary based on a user‑specific profiles rather than being universally consistent across all users. This raises a critical research question: Do LLM agents act safely when considering user‑specific safety standards? Despite its importance for safe LLM use, no benchmark datasets currently exist to evaluate the user‑specific safety of LLMs. To address this gap, we introduce U‑SafeBench, a benchmark designed to assess user‑specific aspect of LLM safety. Our evaluation of 20 widely used LLMs reveals current LLMs fail to act safely when considering user‑specific safety standards, marking a new discovery in this field. To address this vulnerability, we propose a simple remedy based on chain‑of‑thought, demonstrating its effectiveness in improving user‑specific safety. Our benchmark and code are available at https://github.com/yeonjun‑in/U‑SafeBench.

PaperID: 1499, https://arxiv.org/pdf/2502.14891.pdf GitHub

Authors: Zhe Huang, Shuo Wang, Yongcai Wang, Lei Wang

Title: CoDiff: Conditional Diffusion Model for Collaborative 3D Object Detection

Abstract:
Collaborative 3D object detection holds significant importance in the field of autonomous driving, as it greatly enhances the perception capabilities of each individual agent by facilitating information exchange among multiple agents. However, in practice, due to pose estimation errors and time delays, the fusion of information across agents often results in feature representations with spatial and temporal noise, leading to detection errors. Diffusion models naturally have the ability to denoise noisy samples to the ideal data, which motivates us to explore the use of diffusion models to address the noise problem between multi‑agent systems. In this work, we propose CoDiff, a novel robust collaborative perception framework that leverages the potential of diffusion models to generate more comprehensive and clearer feature representations. To the best of our knowledge, this is the first work to apply diffusion models to multi‑agent collaborative perception. Specifically, we project high‑dimensional feature map into the latent space of a powerful pre‑trained autoencoder. Within this space, individual agent information serves as a condition to guide the diffusion model's sampling. This process denoises coarse feature maps and progressively refines the fused features. Experimental study on both simulated and real‑world datasets demonstrates that the proposed framework CoDiff consistently outperforms existing relevant methods in terms of the collaborative object detection performance, and exhibits highly desired robustness when the pose and delay information of agents is with high‑level noise. The code is released at https://github.com/HuangZhe885/CoDiff

PaperID: 1500, https://arxiv.org/pdf/2502.14767.pdf GitHub

Authors: Priyanka Kargupta, Ishika Agarwal, Tal August, Jiawei Han

Title: Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis

Abstract:
With the exponential growth of research facilitated by modern technology and improved accessibility, scientific discoveries have become increasingly fragmented within and across fields. This makes it challenging to assess the significance, novelty, incremental findings, and equivalent ideas between related works, particularly those from different research communities. Large language models (LLMs) have recently demonstrated strong quantitative and qualitative reasoning abilities, and multi‑agent LLM debates have shown promise in handling complex reasoning tasks by exploring diverse perspectives and reasoning paths. Inspired by this, we introduce Tree‑of‑Debate (ToD), a framework which converts scientific papers into LLM personas that debate their respective novelties. To emphasize structured, critical reasoning rather than focusing solely on outcomes, ToD dynamically constructs a debate tree, enabling fine‑grained analysis of independent novelty arguments within scholarly articles. Through experiments on scientific literature across various domains, evaluated by expert researchers, we demonstrate that ToD generates informative arguments, effectively contrasts papers, and supports researchers in their literature review.

PaperID: 1501, https://arxiv.org/pdf/2502.14706.pdf GitHub

Authors: Daphne Cornelisse, Aarav Pandya, Kevin Joseph, Joseph Suárez, Eugene Vinitsky

Title: Building reliable sim driving agents by scaling self-play

Abstract:
Simulation agents are essential for designing and testing systems that interact with humans, such as autonomous vehicles (AVs). These agents serve various purposes, from benchmarking AV performance to stress‑testing system limits, but all applications share one key requirement: reliability. To enable sound experimentation, a simulation agent must behave as intended. It should minimize actions that may lead to undesired outcomes, such as collisions, which can distort the signal‑to‑noise ratio in analyses. As a foundation for reliable sim agents, we propose scaling self‑play to thousands of scenarios on the Waymo Open Motion Dataset under semi‑realistic limits on human perception and control. Training from scratch on a single GPU, our agents solve almost the full training set within a day. They generalize to unseen test scenes, achieving a 99.8% goal completion rate with less than 0.8% combined collision and off‑road incidents across 10,000 held‑out scenarios. Beyond in‑distribution generalization, our agents show partial robustness to out‑of‑distribution scenes and can be fine‑tuned in minutes to reach near‑perfect performance in such cases. We open‑source the pre‑trained agents and integrate them with a batched multi‑agent simulator. Demonstrations of agent behaviors can be viewed at https://sites.google.com/view/reliable‑sim‑agents, and we open‑source our agents at https://github.com/Emerge‑Lab/gpudrive.

PaperID: 1502, https://arxiv.org/pdf/2502.14563.pdf GitHub

Authors: Shiqi Zhang, Xinbei Ma, Zouying Cao, Zhuosheng Zhang, Hai Zhao

Title: Plan-over-Graph: Towards Parallelable LLM Agent Schedule

Abstract:
Large Language Models (LLMs) have demonstrated exceptional abilities in reasoning for task planning. However, challenges remain under‑explored for parallel schedules. This paper introduces a novel paradigm, plan‑over‑graph, in which the model first decomposes a real‑life textual task into executable subtasks and constructs an abstract task graph. The model then understands this task graph as input and generates a plan for parallel execution. To enhance the planning capability of complex, scalable graphs, we design an automated and controllable pipeline to generate synthetic graphs and propose a two‑stage training scheme. Experimental results show that our plan‑over‑graph method significantly improves task performance on both API‑based LLMs and trainable open‑sourced LLMs. By normalizing complex tasks as graphs, our method naturally supports parallel execution, demonstrating global efficiency. The code and data are available at https://github.com/zsq259/Plan‑over‑Graph.

PaperID: 1503, https://arxiv.org/pdf/2502.14529.pdf GitHub

Authors: Zhenhong Zhou, Zherui Li, Jie Zhang, Yuanhe Zhang, Kun Wang, Yang Liu, Qing Guo

Title: CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models

Abstract:
Large Language Model‑based Multi‑Agent Systems (LLM‑MASs) have demonstrated remarkable real‑world capabilities, effectively collaborating to complete complex tasks. While these systems are designed with safety mechanisms, such as rejecting harmful instructions through alignment, their security remains largely unexplored. This gap leaves LLM‑MASs vulnerable to targeted disruptions. In this paper, we introduce Contagious Recursive Blocking Attacks (Corba), a novel and simple yet highly effective attack that disrupts interactions between agents within an LLM‑MAS. Corba leverages two key properties: its contagious nature allows it to propagate across arbitrary network topologies, while its recursive property enables sustained depletion of computational resources. Notably, these blocking attacks often involve seemingly benign instructions, making them particularly challenging to mitigate using conventional alignment methods. We evaluate Corba on two widely‑used LLM‑MASs, namely, AutoGen and Camel across various topologies and commercial models. Additionally, we conduct more extensive experiments in open‑ended interactive LLM‑MASs, demonstrating the effectiveness of Corba in complex topology structures and open‑source models. Our code is available at: https://github.com/zhrli324/Corba.

PaperID: 1504, https://arxiv.org/pdf/2502.14345.pdf GitHub

Authors: Yuchen Shi, Siqi Cai, Zihan Xu, Yuei Qin, Gang Li, Hang Shao, Jiawei Chen, Deqing Yang, Ke Li, Xing Sun

Title: FlowAgent: Achieving Compliance and Flexibility for Workflow Agents

Abstract:
The integration of workflows with large language models (LLMs) enables LLM‑based agents to execute predefined procedures, enhancing automation in real‑world applications. Traditional rule‑based methods tend to limit the inherent flexibility of LLMs, as their predefined execution paths restrict the models' action space, particularly when the unexpected, out‑of‑workflow (OOW) queries are encountered. Conversely, prompt‑based methods allow LLMs to fully control the flow, which can lead to diminished enforcement of procedural compliance. To address these challenges, we introduce FlowAgent, a novel agent framework designed to maintain both compliance and flexibility. We propose the Procedure Description Language (PDL), which combines the adaptability of natural language with the precision of code to formulate workflows. Building on PDL, we develop a comprehensive framework that empowers LLMs to manage OOW queries effectively, while keeping the execution path under the supervision of a set of controllers. Additionally, we present a new evaluation methodology to rigorously assess an LLM agent's ability to handle OOW scenarios, going beyond routine flow compliance tested in existing benchmarks. Experiments on three datasets demonstrate that FlowAgent not only adheres to workflows but also effectively manages OOW queries, highlighting its dual strengths in compliance and flexibility. The code is available at https://github.com/Lightblues/FlowAgent.

PaperID: 1505, https://arxiv.org/pdf/2502.14282.pdf GitHub

Authors: Haowei Liu, Xi Zhang, Haiyang Xu, Yuyang Wanyan, Junyang Wang, Ming Yan, Ji Zhang, Chunfeng Yuan, Changsheng Xu, Weiming Hu, Fei Huang

Title: PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC

Abstract:
In the field of MLLM‑based GUI agents, compared to smartphones, the PC scenario not only features a more complex interactive environment, but also involves more intricate intra‑ and inter‑app workflows. To address these issues, we propose a hierarchical agent framework named PC‑Agent. Specifically, from the perception perspective, we devise an Active Perception Module (APM) to overcome the inadequate abilities of current MLLMs in perceiving screenshot content. From the decision‑making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi‑agent collaboration architecture that decomposes decision‑making processes into Instruction‑Subtask‑Action levels. Within this architecture, three agents (i.e., Manager, Progress and Decision) are set up for instruction decomposition, progress tracking and step‑by‑step decision‑making respectively. Additionally, a Reflection agent is adopted to enable timely bottom‑up error feedback and adjustment. We also introduce a new benchmark PC‑Eval with 25 real‑world complex instructions. Empirical results on PC‑Eval show that our PC‑Agent achieves a 32% absolute improvement of task success rate over previous state‑of‑the‑art methods. The code is available at https://github.com/X‑PLUG/MobileAgent/tree/main/PC‑Agent.

PaperID: 1506, https://arxiv.org/pdf/2502.14276.pdf GitHub

Authors: Hanlin Wang, Jian Wang, Chak Tou Leong, Wenjie Li

Title: STeCa: Step-level Trajectory Calibration for LLM Agent Learning

Abstract:
Large language model (LLM)‑based agents have shown promise in tackling complex tasks by interacting dynamically with the environment. Existing work primarily focuses on behavior cloning from expert demonstrations or preference learning through exploratory trajectory sampling. However, these methods often struggle to address long‑horizon tasks, where suboptimal actions accumulate step by step, causing agents to deviate from correct task trajectories. To address this, we highlight the importance of timely calibration and the need to automatically construct calibration trajectories for training agents. We propose Step‑Level Trajectory Calibration (STeCa), a novel framework for LLM agent learning. Specifically, STeCa identifies suboptimal actions through a step‑level reward comparison during exploration. It constructs calibrated trajectories using LLM‑driven reflection, enabling agents to learn from improved decision‑making processes. We finally leverage these calibrated trajectories with successful trajectories for reinforced training. Extensive experiments demonstrate that STeCa significantly outperforms existing methods. Further analysis highlights that timely calibration enables agents to complete tasks with greater robustness. Our code and data are available at https://github.com/WangHanLinHenry/STeCa.

PaperID: 1507, https://arxiv.org/pdf/2502.13959.pdf GitHub

Authors: Reza Averly, Frazier N. Baker, Ian A. Watson, Xia Ning

Title: LIDDIA: Language-based Intelligent Drug Discovery Agent

Abstract:
Drug discovery is a long, expensive, and complex process, relying heavily on human medicinal chemists, who can spend years searching the vast space of potential therapies. Recent advances in artificial intelligence for chemistry have sought to expedite individual drug discovery tasks; however, there remains a critical need for an intelligent agent that can navigate the drug discovery process. Towards this end, we introduce LIDDIA, an autonomous agent capable of intelligently navigating the drug discovery process in silico. By leveraging the reasoning capabilities of large language models, LIDDIA serves as a low‑cost and highly‑adaptable tool for autonomous drug discovery. We comprehensively examine LIDDIA , demonstrating that (1) it can generate molecules meeting key pharmaceutical criteria on over 70% of 30 clinically relevant targets, (2) it intelligently balances exploration and exploitation in the chemical space, and (3) it identifies one promising novel candidate on AR/NR3C4, a critical target for both prostate and breast cancers. Code and dataset are available at https://github.com/ninglab/LIDDiA

PaperID: 1508, https://arxiv.org/pdf/2502.13920.pdf GitHub

Authors: Xingbo Wang, Janessa Griffith, Daniel A. Adler, Joey Castillo, Tanzeem Choudhury, Fei Wang

Title: Exploring Personalized Health Support through Data-Driven, Theory-Guided LLMs: A Case Study in Sleep Health

Abstract:
Despite the prevalence of sleep‑tracking devices, many individuals struggle to translate data into actionable improvements in sleep health. Current methods often provide data‑driven suggestions but may not be feasible and adaptive to real‑life constraints and individual contexts. We present HealthGuru, a novel large language model‑powered chatbot to enhance sleep health through data‑driven, theory‑guided, and adaptive recommendations with conversational behavior change support. HealthGuru's multi‑agent framework integrates wearable device data, contextual information, and a contextual multi‑armed bandit model to suggest tailored sleep‑enhancing activities. The system facilitates natural conversations while incorporating data‑driven insights and theoretical behavior change techniques. Our eight‑week in‑the‑wild deployment study with 16 participants compared HealthGuru to a baseline chatbot. Results show improved metrics like sleep duration and activity scores, higher quality responses, and increased user motivation for behavior change with HealthGuru. We also identify challenges and design considerations for personalization and user engagement in health chatbots.

PaperID: 1509, https://arxiv.org/pdf/2502.13897.pdf GitHub

Authors: Dan Zhang, Sining Zhoubian, Min Cai, Fengzu Li, Lekang Yang, Wei Wang, Tianjiao Dong, Ziniu Hu, Jie Tang, Yisong Yue

Title: DataSciBench: An LLM Agent Benchmark for Data Science

Abstract:
This paper presents DataSciBench, a comprehensive benchmark for evaluating Large Language Model (LLM) capabilities in data science. Recent related benchmarks have primarily focused on single tasks, easily obtainable ground truth, and straightforward evaluation metrics, which limits the scope of tasks that can be evaluated. In contrast, DataSciBench is constructed based on a more comprehensive and curated collection of natural and challenging prompts for uncertain ground truth and evaluation metrics. We develop a semi‑automated pipeline for generating ground truth (GT) and validating evaluation metrics. This pipeline utilizes and implements an LLM‑based self‑consistency and human verification strategy to produce accurate GT by leveraging collected prompts, predefined task types, and aggregate functions (metrics). Furthermore, we propose an innovative Task ‑ Function ‑ Code (TFC) framework to assess each code execution outcome based on precisely defined metrics and programmatic rules. Our experimental framework involves testing 6 API‑based models, 8 open‑source general models, and 9 open‑source code generation models using the diverse set of prompts we have gathered. This approach aims to provide a more comprehensive and rigorous evaluation of LLMs in data science, revealing their strengths and weaknesses. Experimental results demonstrate that API‑based models outperform open‑sourced models on all metrics and Deepseek‑Coder‑33B‑Instruct achieves the highest score among open‑sourced models. We release all code and data at https://github.com/THUDM/DataSciBench.

PaperID: 1510, https://arxiv.org/pdf/2502.13778.pdf GitHub

Authors: Jiaqi Li, Xizhong Guo, Yang Zhao, Lvyang Zhang, Lidong Zhai

Title: Poster: SpiderSim: Multi-Agent Driven Theoretical Cybersecurity Simulation for Industrial Digitalization

Abstract:
Rapid industrial digitalization has created intricate cybersecurity demands that necessitate effective validation methods. While cyber ranges and simulation platforms are widely deployed, they frequently face limitations in scenario diversity and creation efficiency. In this paper, we present SpiderSim, a theoretical cybersecurity simulation platform enabling rapid and lightweight scenario generation for industrial digitalization security research. At its core, our platform introduces three key innovations: a structured framework for unified scenario modeling, a multi‑agent collaboration mechanism for automated generation, and modular atomic security capabilities for flexible scenario composition. Extensive implementation trials across multiple industrial digitalization contexts, including marine ranch monitoring systems, validate our platform's capacity for broad scenario coverage with efficient generation processes. Built on solid theoretical foundations and released as open‑source software, SpiderSim facilitates broader research and development in automated security testing for industrial digitalization.

PaperID: 1511, https://arxiv.org/pdf/2502.13170.pdf GitHub

Authors: Yuze Zhao, Tianyun Ji, Wenjun Feng, Zhenya Huang, Qi Liu, Zhiding Liu, Yixiao Ma, Kai Zhang, Enhong Chen

Title: Unveiling the Magic of Code Reasoning through Hypothesis Decomposition and Amendment

Abstract:
The reasoning abilities are one of the most enigmatic and captivating aspects of large language models (LLMs). Numerous studies are dedicated to exploring and expanding the boundaries of this reasoning capability. However, tasks that embody both reasoning and recall characteristics are often overlooked. In this paper, we introduce such a novel task, code reasoning, to provide a new perspective for the reasoning abilities of LLMs. We summarize three meta‑benchmarks based on established forms of logical reasoning, and instantiate these into eight specific benchmark tasks. Our testing on these benchmarks reveals that LLMs continue to struggle with identifying satisfactory reasoning pathways. Additionally, we present a new pathway exploration pipeline inspired by human intricate problem‑solving methods. This Reflective Hypothesis Decomposition and Amendment (RHDA) pipeline consists of the following iterative steps: (1) Proposing potential hypotheses based on observations and decomposing them; (2) Utilizing tools to validate hypotheses and reflection outcomes; (3) Revising hypothesis in light of observations. Our approach effectively mitigates logical chain collapses arising from forgetting or hallucination issues in multi‑step reasoning, resulting in performance gains of up to 3×. Finally, we expanded this pipeline by applying it to simulate complex household tasks in real‑world scenarios, specifically in VirtualHome, enhancing the handling of failure cases. We release our code and all of results at https://github.com/TnTWoW/code_reasoning.

PaperID: 1512, https://arxiv.org/pdf/2502.12995.pdf GitHub

Authors: Avinash Kori, Antonio Rago, Francesca Toni

Title: Free Argumentative Exchanges for Explaining Image Classifiers

Abstract:
Deep learning models are powerful image classifiers but their opacity hinders their trustworthiness. Explanation methods for capturing the reasoning process within these classifiers faithfully and in a clear manner are scarce, due to their sheer complexity and size. We provide a solution for this problem by defining a novel method for explaining the outputs of image classifiers with debates between two agents, each arguing for a particular class. We obtain these debates as concrete instances of Free Argumentative eXchanges (FAXs), a novel argumentation‑based multi‑agent framework allowing agents to internalise opinions by other agents differently than originally stated. We define two metrics (consensus and persuasion rate) to assess the usefulness of FAXs as argumentative explanations for image classifiers. We then conduct a number of empirical experiments showing that FAXs perform well along these metrics as well as being more faithful to the image classifiers than conventional, non‑argumentative explanation methods. All our implementations can be found at https://github.com/koriavinash1/FAX.

PaperID: 1513, https://arxiv.org/pdf/2502.12532.pdf GitHub

Authors: Yong Zhao, Kai Xu, Zhengqiu Zhu, Yue Hu, Zhiheng Zheng, Yingfeng Chen, Yatai Ji, Chen Gao, Yong Li, Jincai Huang

Title: CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space

Abstract:
Embodied Question Answering (EQA) has primarily focused on indoor environments, leaving the complexities of urban settings‑spanning environment, action, and perception‑largely unexplored. To bridge this gap, we introduce CityEQA, a new task where an embodied agent answers open‑vocabulary questions through active exploration in dynamic city spaces. To support this task, we present CityEQA‑EC, the first benchmark dataset featuring 1,412 human‑annotated tasks across six categories, grounded in a realistic 3D urban simulator. Moreover, we propose Planner‑Manager‑Actor (PMA), a novel agent tailored for CityEQA. PMA enables long‑horizon planning and hierarchical task execution: the Planner breaks down the question answering into sub‑tasks, the Manager maintains an object‑centric cognitive map for spatial reasoning during the process control, and the specialized Actors handle navigation, exploration, and collection sub‑tasks. Experiments demonstrate that PMA achieves 60.7% of human‑level answering accuracy, significantly outperforming competitive baselines. While promising, the performance gap compared to humans highlights the need for enhanced visual reasoning in CityEQA. This work paves the way for future advancements in urban spatial intelligence. Dataset and code are available at https://github.com/BiluYong/CityEQA.git.

PaperID: 1514, https://arxiv.org/pdf/2502.12486.pdf GitHub

Authors: Xiaoqian Liu, Ke Wang, Yongbin Li, Yuchuan Wu, Wentao Ma, Aobo Kong, Fei Huang, Jianbin Jiao, Junge Zhang

Title: EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning

Abstract:
Large Language Models (LLMs) have shown impressive reasoning capabilities in well‑defined problems with clear solutions, such as mathematics and coding. However, they still struggle with complex real‑world scenarios like business negotiations, which require strategic reasoning‑an ability to navigate dynamic environments and align long‑term goals amidst uncertainty. Existing methods for strategic reasoning face challenges in adaptability, scalability, and transferring strategies to new contexts. To address these issues, we propose explicit policy optimization (EPO) for strategic reasoning, featuring an LLM that provides strategies in open‑ended action space and can be plugged into arbitrary LLM agents to motivate goal‑directed behavior. To improve adaptability and policy transferability, we train the strategic reasoning model via multi‑turn reinforcement learning (RL),utilizing process rewards and iterative self‑play. Experiments across social and physical domains demonstrate EPO's ability of long‑term goal alignment through enhanced strategic reasoning, achieving state‑of‑the‑art performance on social dialogue and web navigation tasks. Our findings reveal various collaborative reasoning mechanisms emergent in EPO and its effectiveness in generating novel strategies, underscoring its potential for strategic reasoning in real‑world applications. Code and data are available at https://github.com/AlibabaResearch/DAMO‑ConvAI/tree/main/EPO.

PaperID: 1515, https://arxiv.org/pdf/2502.12450.pdf GitHub

Authors: Lei Wang, Zheqing Zhang, Xu Chen

Title: Investigating and Extending Homans' Social Exchange Theory with Large Language Model based Agents

Abstract:
Homans' Social Exchange Theory (SET) is widely recognized as a basic framework for understanding the formation and emergence of human civilizations and social structures. In social science, this theory is typically studied based on simple simulation experiments or real‑world human studies, both of which either lack realism or are too expensive to control. In artificial intelligence, recent advances in large language models (LLMs) have shown promising capabilities in simulating human behaviors. Inspired by these insights, we adopt an interdisciplinary research perspective and propose using LLM‑based agents to study Homans' SET. Specifically, we construct a virtual society composed of three LLM agents and have them engage in a social exchange game to observe their behaviors. Through extensive experiments, we found that Homans' SET is well validated in our agent society, demonstrating the consistency between the agent and human behaviors. Building on this foundation, we intentionally alter the settings of the agent society to extend the traditional Homans' SET, making it more comprehensive and detailed. To the best of our knowledge, this paper marks the first step in studying Homans' SET with LLM‑based agents. More importantly, it introduces a novel and feasible research paradigm that bridges the fields of social science and computer science through LLM‑based agents. Code is available at https://github.com/Paitesanshi/SET.

PaperID: 1516, https://arxiv.org/pdf/2502.12110.pdf GitHub GitHub

Authors: Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang

Title: A-MEM: Agentic Memory for LLM Agents

Abstract:
While large language model (LLM) agents can effectively use external tools for complex real‑world tasks, they require memory systems to leverage historical experiences. Current memory systems enable basic storage and retrieval but lack sophisticated memory organization, despite recent attempts to incorporate graph databases. Moreover, these systems' fixed operations and structures limit their adaptability across diverse tasks. To address this limitation, this paper proposes a novel agentic memory system for LLM agents that can dynamically organize memories in an agentic way. Following the basic principles of the Zettelkasten method, we designed our memory system to create interconnected knowledge networks through dynamic indexing and linking. When a new memory is added, we generate a comprehensive note containing multiple structured attributes, including contextual descriptions, keywords, and tags. The system then analyzes historical memories to identify relevant connections, establishing links where meaningful similarities exist. Additionally, this process enables memory evolution ‑ as new memories are integrated, they can trigger updates to the contextual representations and attributes of existing historical memories, allowing the memory network to continuously refine its understanding. Our approach combines the structured organization principles of Zettelkasten with the flexibility of agent‑driven decision making, allowing for more adaptive and context‑aware memory management. Empirical experiments on six foundation models show superior improvement against existing SOTA baselines. The source code for evaluating performance is available at https://github.com/WujiangXu/A‑mem, while the source code of the agentic memory system is available at https://github.com/WujiangXu/A‑mem‑sys.

PaperID: 1517, https://arxiv.org/pdf/2502.11705.pdf GitHub

Authors: Georg Wölflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelović, Jakob Nikolas Kather

Title: LLM Agents Making Agent Tools

Abstract:
Tool use has turned large language models (LLMs) into powerful agents that can perform complex multi‑step tasks by dynamically utilising external software components. However, these tools must be implemented in advance by human developers, hindering the applicability of LLM agents in domains demanding large numbers of highly specialised tools, like in life sciences and medicine. Motivated by the growing trend of scientific studies accompanied by public code repositories, we propose ToolMaker, an agentic framework that autonomously transforms papers with code into LLM‑compatible tools. Given a GitHub URL and short task description, ToolMaker autonomously installs dependencies and generates code to perform the task, using a closed‑loop self‑correction mechanism for debugging. To evaluate our approach, we introduce a benchmark comprising 15 complex computational tasks spanning various domains with over 100 unit tests to assess correctness and robustness. Our method correctly implements 80% of the tasks, substantially outperforming current state‑of‑the‑art software engineering agents. ToolMaker therefore is a step towards fully autonomous agent‑based scientific workflows. Our code and benchmark are publicly available at https://github.com/KatherLab/ToolMaker.

PaperID: 1518, https://arxiv.org/pdf/2502.11355.pdf GitHub

Authors: Rongwu Xu, Xiaojian Li, Shuo Chen, Wei Xu

Title: Nuclear Deployed: Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents

Abstract:
Large language models (LLMs) are evolving into autonomous decision‑makers, raising concerns about catastrophic risks in high‑stakes scenarios, particularly in Chemical, Biological, Radiological and Nuclear (CBRN) domains. Based on the insight that such risks can originate from trade‑offs between the agent's Helpful, Harmlessness and Honest (HHH) goals, we build a novel three‑stage evaluation framework, which is carefully constructed to effectively and naturally expose such risks. We conduct 14,400 agentic simulations across 12 advanced LLMs, with extensive experiments and analysis. Results reveal that LLM agents can autonomously engage in catastrophic behaviors and deception, without being deliberately induced. Furthermore, stronger reasoning abilities often increase, rather than mitigate, these risks. We also show that these agents can violate instructions and superior commands. On the whole, we empirically prove the existence of catastrophic risks in autonomous LLM agents. We release our code to foster further research.

PaperID: 1519, https://arxiv.org/pdf/2502.11271.pdf GitHub

Authors: Pan Lu, Bowen Chen, Sheng Liu, Rahul Thapa, Joseph Boen, James Zou

Title: OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning

Abstract:
Solving complex reasoning tasks may involve visual understanding, domain knowledge retrieval, numerical calculation, and multi‑step reasoning. Existing methods augment large language models (LLMs) with external tools but are restricted to specialized domains, limited tool types, or require additional training data. In this paper, we introduce OctoTools, a training‑free, user‑friendly, and easily extensible multi‑agent framework designed to tackle complex reasoning across diverse domains. OctoTools introduces standardized tool cards to encapsulate tool functionality, a planner for both high‑level and low‑level planning, and an executor to carry out tool usage. We validate OctoTools' generality across 16 diverse tasks (including MathVista, MMLU‑Pro, MedQA, and GAIA‑Text), achieving substantial average accuracy gains of 9.3% over GPT‑4o. Furthermore, OctoTools also outperforms AutoGen, GPT‑Functions, and LangChain by up to 10.6% when given the same set of tools. Through comprehensive analysi, ablations, and robustness tests with compact backbones and noisy tool environments, OctoTools demonstrates advantages in task planning, effective tool usage, and multi‑step problem solving. Code, demos, and visualization are publicly available at https://octotools.github.io/.

PaperID: 1520, https://arxiv.org/pdf/2502.11133.pdf GitHub

Authors: Yanwei Yue, Guibin Zhang, Boyang Liu, Guancheng Wan, Kun Wang, Dawei Cheng, Yiyan Qi

Title: MasRouter: Learning to Route LLMs for Multi-Agent Systems

Abstract:
Multi‑agent systems (MAS) powered by Large Language Models (LLMs) have been demonstrated to push the boundaries of LLM capabilities, yet they often incur significant costs and face challenges in dynamic LLM selection. Current LLM routing methods effectively reduce overhead in single‑agent scenarios by customizing LLM selection for each query, but they overlook the critical decisions regarding collaboration modes and agent roles in MAS. In response to this challenge, we first introduce the problem of Multi‑Agent System Routing (MASR), which integrates all components of MAS into a unified routing framework. Toward this goal, we propose MasRouter, the first high‑performing, cost‑effective, and inductive MASR solution. MasRouter employs collaboration mode determination, role allocation, and LLM routing through a cascaded controller network, progressively constructing a MAS that balances effectiveness and efficiency. Extensive experiments demonstrate that MasRouter is (1) high‑performing, achieving a 1.8%～8.2% improvement over the state‑of‑the‑art method on MBPP; (2) economical, reducing overhead by up to 52.07% compared to SOTA methods on HumanEval; and (3) plug‑and‑play, seamlessly integrating with mainstream MAS frameworks, reducing overhead by 17.21%～28.17% via customized routing. The code is available at https://github.com/yanweiyue/masrouter.

PaperID: 1521, https://arxiv.org/pdf/2502.11127.pdf GitHub

Authors: Shilong Wang, Guibin Zhang, Miao Yu, Guancheng Wan, Fanci Meng, Chongye Guo, Kun Wang, Yang Wang

Title: G-Safeguard: A Topology-Guided Security Lens and Treatment on LLM-based Multi-agent Systems

Abstract:
Large Language Model (LLM)‑based Multi‑agent Systems (MAS) have demonstrated remarkable capabilities in various complex tasks, ranging from collaborative problem‑solving to autonomous decision‑making. However, as these systems become increasingly integrated into critical applications, their vulnerability to adversarial attacks, misinformation propagation, and unintended behaviors have raised significant concerns. To address this challenge, we introduce G‑Safeguard, a topology‑guided security lens and treatment for robust LLM‑MAS, which leverages graph neural networks to detect anomalies on the multi‑agent utterance graph and employ topological intervention for attack remediation. Extensive experiments demonstrate that G‑Safeguard: (I) exhibits significant effectiveness under various attack strategies, recovering over 40% of the performance for prompt injection; (II) is highly adaptable to diverse LLM backbones and large‑scale MAS; (III) can seamlessly combine with mainstream MAS with security guarantees. The code is available at https://github.com/wslong20/G‑safeguard.

PaperID: 1522, https://arxiv.org/pdf/2502.11098.pdf GitHub

Authors: Zhao Wang, Sota Moriyama, Wei-Yao Wang, Briti Gangopadhyay, Shingo Takamatsu

Title: Talk Structurally, Act Hierarchically: A Collaborative Framework for LLM Multi-Agent Systems

Abstract:
Recent advancements in LLM‑based multi‑agent (LLM‑MA) systems have shown promise, yet significant challenges remain in managing communication and refinement when agents collaborate on complex tasks. In this paper, we propose Talk Structurally, Act Hierarchically (TalkHier), a novel framework that introduces a structured communication protocol for context‑rich exchanges and a hierarchical refinement system to address issues such as incorrect outputs, falsehoods, and biases. TalkHier surpasses various types of SoTA, including inference scaling model (OpenAI‑o1), open‑source multi‑agent models (e.g., AgentVerse), and majority voting strategies on current LLM and single‑agent baselines (e.g., ReAct, GPT4o), across diverse tasks, including open‑domain question answering, domain‑specific selective questioning, and practical advertisement text generation. These results highlight its potential to set a new standard for LLM‑MA systems, paving the way for more effective, adaptable, and collaborative multi‑agent frameworks. The code is available https://github.com/sony/talkhier.

PaperID: 1523, https://arxiv.org/pdf/2502.11075.pdf GitHub

Authors: Haoyang Li, Xuejia Chen, Zhanchao XU, Darian Li, Nicole Hu, Fei Teng, Yiming Li, Luyu Qiu, Chen Jason Zhang, Qing Li, Lei Chen

Title: Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models

Abstract:
Large Language Models (LLMs) have demonstrated impressive capabilities in natural language processing tasks, such as text generation and semantic understanding. However, their performance on numerical reasoning tasks, such as basic arithmetic, numerical retrieval, and magnitude comparison, remains surprisingly poor. This gap arises from their reliance on surface‑level statistical patterns rather than understanding numbers as continuous magnitudes. Existing benchmarks primarily focus on either linguistic competence or structured mathematical problem‑solving, neglecting fundamental numerical reasoning required in real‑world scenarios. To bridge this gap, we propose NumericBench, a comprehensive benchmark to evaluate six fundamental numerical capabilities: number recognition, arithmetic operations, contextual retrieval, comparison, summary, and logical reasoning. NumericBench includes datasets ranging from synthetic number lists to the crawled real‑world data, addressing challenges like long contexts, noise, and multi‑step reasoning. Extensive experiments on state‑of‑the‑art LLMs, including GPT‑4 and DeepSeek, reveal persistent weaknesses in numerical reasoning, highlighting the urgent need to improve numerically‑aware language modeling. The benchmark is released in: https://github.com/TreeAI‑Lab/NumericBench.

PaperID: 1524, https://arxiv.org/pdf/2502.10325.pdf GitHub

Authors: Sanjiban Choudhury

Title: Process Reward Models for LLM Agents: Practical Framework and Directions

Abstract:
We introduce Agent Process Reward Models (AgentPRM), a simple and scalable framework for training LLM agents to continually improve through interactions. AgentPRM follows a lightweight actor‑critic paradigm, using Monte Carlo rollouts to compute reward targets and optimize policies. It requires minimal modifications to existing RLHF pipelines, making it easy to integrate at scale. Beyond AgentPRM, we propose InversePRM, which learns process rewards directly from demonstrations without explicit outcome supervision. We also explore key challenges and opportunities, including exploration, process reward shaping, and model‑predictive reasoning. We evaluate on ALFWorld benchmark, show that small 3B models trained with AgentPRM and InversePRM outperform strong GPT‑4o baselines, and analyze test‑time scaling, reward hacking, and more. Our code is available at: https://github.com/sanjibanc/agent_prm.

PaperID: 1525, https://arxiv.org/pdf/2502.10233.pdf GitHub

Authors: Laurin Luttmann, Lin Xie

Title: Learning to Solve the Min-Max Mixed-Shelves Picker-Routing Problem via Hierarchical and Parallel Decoding

Abstract:
The Mixed‑Shelves Picker Routing Problem (MSPRP) is a fundamental challenge in warehouse logistics, where pickers must navigate a mixed‑shelves environment to retrieve SKUs efficiently. Traditional heuristics and optimization‑based approaches struggle with scalability, while recent machine learning methods often rely on sequential decision‑making, leading to high solution latency and suboptimal agent coordination. In this work, we propose a novel hierarchical and parallel decoding approach for solving the min‑max variant of the MSPRP via multi‑agent reinforcement learning. While our approach generates a joint distribution over agent actions, allowing for fast decoding and effective picker coordination, our method introduces a sequential action selection to avoid conflicts in the multi‑dimensional action space. Experiments show state‑of‑the‑art performance in both solution quality and inference speed, particularly for large‑scale and out‑of‑distribution instances. Our code is publicly available at http://github.com/LTluttmann/marl4msprp.

PaperID: 1526, https://arxiv.org/pdf/2502.10090.pdf GitHub

Authors: Chenrui Tie, Shengxiang Sun, Jinxuan Zhu, Yiwei Liu, Jingxiang Guo, Yue Hu, Haonan Chen, Junting Chen, Ruihai Wu, Lin Shao

Title: Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models

Abstract:
Humans possess an extraordinary ability to understand and execute complex manipulation tasks by interpreting abstract instruction manuals. For robots, however, this capability remains a substantial challenge, as they cannot interpret abstract instructions and translate them into executable actions. In this paper, we present Manual2Skill, a novel framework that enables robots to perform complex assembly tasks guided by high‑level manual instructions. Our approach leverages a Vision‑Language Model (VLM) to extract structured information from instructional images and then uses this information to construct hierarchical assembly graphs. These graphs represent parts, subassemblies, and the relationships between them. To facilitate task execution, a pose estimation model predicts the relative 6D poses of components at each assembly step. At the same time, a motion planning module generates actionable sequences for real‑world robotic implementation. We demonstrate the effectiveness of Manual2Skill by successfully assembling several real‑world IKEA furniture items. This application highlights its ability to manage long‑horizon manipulation tasks with both efficiency and precision, significantly enhancing the practicality of robot learning from instruction manuals. This work marks a step forward in advancing robotic systems capable of understanding and executing complex manipulation tasks in a manner akin to human capabilities.Project Page: https://owensun2004.github.io/Furniture‑Assembly‑Web/

PaperID: 1527, https://arxiv.org/pdf/2502.07350.pdf GitHub

Authors: Jusheng Zhang, Zimeng Huang, Yijia Fan, Ningyuan Liu, Mingyan Li, Zhuojie Yang, Jiawei Yao, Jian Wang, Keze Wang

Title: KABB: Knowledge-Aware Bayesian Bandits for Dynamic Expert Coordination in Multi-Agent Systems

Abstract:
As scaling large language models faces prohibitive costs, multi‑agent systems emerge as a promising alternative, though challenged by static knowledge assumptions and coordination inefficiencies. We introduces Knowledge‑Aware Bayesian Bandits (KABB), a novel framework that enhances multi‑agent system coordination through semantic understanding and dynamic adaptation. The framework features three key innovations: a three‑dimensional knowledge distance model for deep semantic understanding, a dual‑adaptation mechanism for continuous expert optimization, and a knowledge‑aware Thompson Sampling strategy for efficient expert selection. Extensive evaluation demonstrates KABB achieves an optimal cost‑performance balance, maintaining high performance while keeping computational demands relatively low in multi‑agent coordination.

PaperID: 1528, https://arxiv.org/pdf/2502.07302.pdf GitHub

Authors: Ruining Deng, Yihe Yang, David J. Pisapia, Benjamin Liechty, Junchao Zhu, Juming Xiong, Junlin Guo, Zhengyi Lu, Jiacheng Wang, Xing Yao, Runxuan Yu, Rendong Zhang, Gaurav Rudravaram, Mengmeng Yin, Pinaki Sarder, Haichun Yang, Yuankai Huo, Mert R. Sabuncu

Title: CASC-AI: Consensus-aware Self-corrective Learning for Noise Cell Segmentation

Abstract:
Multi‑class cell segmentation in high‑resolution gigapixel whole slide images (WSIs) is crucial for various clinical applications. However, training such models typically requires labor‑intensive, pixel‑wise annotations by domain experts. Recent efforts have democratized this process by involving lay annotators without medical expertise. However, conventional non‑corrective approaches struggle to handle annotation noise adaptively because they lack mechanisms to mitigate false positives (FP) and false negatives (FN) at both the image‑feature and pixel levels. In this paper, we propose a consensus‑aware self‑corrective AI agent that leverages the Consensus Matrix to guide its learning process. The Consensus Matrix defines regions where both the AI and annotators agree on cell and non‑cell annotations, which are prioritized with stronger supervision. Conversely, areas of disagreement are adaptively weighted based on their feature similarity to high‑confidence consensus regions, with more similar regions receiving greater attention. Additionally, contrastive learning is employed to separate features of noisy regions from those of reliable consensus regions by maximizing their dissimilarity. This paradigm enables the model to iteratively refine noisy labels, enhancing its robustness. Validated on one real‑world lay‑annotated cell dataset and two reasoning‑guided simulated noisy datasets, our method demonstrates improved segmentation performance, effectively correcting FP and FN errors and showcasing its potential for training robust models on noisy datasets. The official implementation and cell annotations are publicly available at https://github.com/ddrrnn123/CASC‑AI.

PaperID: 1529, https://arxiv.org/pdf/2502.07223.pdf GitHub

Authors: Elias Lumer, Pradeep Honaganahalli Basavaraju, Myles Mason, James A. Burke, Vamse Kumar Subbiah

Title: Graph RAG-Tool Fusion

Abstract:
Recent developments in retrieval‑augmented generation (RAG) for selecting relevant tools from a tool knowledge base enable LLM agents to scale their complex tool calling capabilities to hundreds or thousands of external tools, APIs, or agents‑as‑tools. However, traditional RAG‑based tool retrieval fails to capture structured dependencies between tools, limiting the retrieval accuracy of a retrieved tool's dependencies. For example, among a vector database of tools, a "get stock price" API requires a "stock ticker" parameter from a "get stock ticker" API, and both depend on OS‑level internet connectivity tools. In this paper, we address this limitation by introducing Graph RAG‑Tool Fusion, a novel plug‑and‑play approach that combines the strengths of vector‑based retrieval with efficient graph traversal to capture all relevant tools (nodes) along with any nested dependencies (edges) within the predefined tool knowledge graph. We also present ToolLinkOS, a new tool selection benchmark of 573 fictional tools, spanning over 15 industries, each with an average of 6.3 tool dependencies. We demonstrate that Graph RAG‑Tool Fusion achieves absolute improvements of 71.7% and 22.1% over naïve RAG on ToolLinkOS and ToolSandbox benchmarks, respectively (mAP@10). ToolLinkOS dataset is available at https://github.com/EliasLumer/Graph‑RAG‑Tool‑Fusion‑ToolLinkOS

PaperID: 1530, https://arxiv.org/pdf/2502.06994.pdf GitHub

Authors: Xuehang Guo, Xingyao Wang, Yangyi Chen, Sha Li, Chi Han, Manling Li, Heng Ji

Title: SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering

Abstract:
Software engineering (SE) is increasingly collaborative, with developers working together on shared complex codebases. Effective collaboration in shared environments requires participants ‑‑ whether humans or AI agents ‑‑ to stay on the same page as their environment evolves. When a collaborator's understanding diverges from the current state ‑‑ what we term the out‑of‑sync challenge ‑‑ the collaborator's actions may fail, leading to integration issues. In this work, we introduce SyncMind, a framework that systematically defines the out‑of‑sync problem faced by large language model (LLM) agents in collaborative software engineering (CSE). Based on SyncMind, we create SyncBench, a benchmark featuring 24,332 instances of agent out‑of‑sync scenarios in real‑world CSE derived from 21 popular GitHub repositories with executable verification tests. Experiments on SyncBench uncover critical insights into existing LLM agents' capabilities and limitations. Besides substantial performance gaps among agents (from Llama‑3.1 agent <= 3.33% to Claude‑3.5‑Sonnet >= 28.18%), their consistently low collaboration willingness (<= 4.86%) suggests fundamental limitations of existing LLM in CSE. However, when collaboration occurs, it positively correlates with out‑of‑sync recovery success. Minimal performance differences in agents' resource‑aware out‑of‑sync recoveries further reveal their significant lack of resource awareness and adaptability, shedding light on future resource‑efficient collaborative systems. Code and data are openly available on our project website: https://xhguo7.github.io/SyncMind/.

PaperID: 1531, https://arxiv.org/pdf/2502.06787.pdf GitHub

Authors: Damiano Marsili, Rohun Agrawal, Yisong Yue, Georgia Gkioxari

Title: Visual Agentic AI for Spatial Reasoning with a Dynamic API

Abstract:
Visual reasoning ‑‑ the ability to interpret the visual world ‑‑ is crucial for embodied agents that operate within three‑dimensional scenes. Progress in AI has led to vision and language models capable of answering questions from images. However, their performance declines when tasked with 3D spatial reasoning. To tackle the complexity of such reasoning problems, we introduce an agentic program synthesis approach where LLM agents collaboratively generate a Pythonic API with new functions to solve common subproblems. Our method overcomes limitations of prior approaches that rely on a static, human‑defined API, allowing it to handle a wider range of queries. To assess AI capabilities for 3D understanding, we introduce a new benchmark of queries involving multiple steps of grounding and inference. We show that our method outperforms prior zero‑shot models for visual reasoning in 3D and empirically validate the effectiveness of our agentic framework for 3D spatial reasoning tasks. Project website: https://glab‑caltech.github.io/vadar/

PaperID: 1532, https://arxiv.org/pdf/2502.06440.pdf GitHub

Authors: Shuhao Liao, Weihang Xia, Yuhong Cao, Weiheng Dai, Chengyang He, Wenjun Wu, Guillaume Sartoretti

Title: SIGMA: Sheaf-Informed Geometric Multi-Agent Pathfinding

Abstract:
The Multi‑Agent Path Finding (MAPF) problem aims to determine the shortest and collision‑free paths for multiple agents in a known, potentially obstacle‑ridden environment. It is the core challenge for robotic deployments in large‑scale logistics and transportation. Decentralized learning‑based approaches have shown great potential for addressing the MAPF problems, offering more reactive and scalable solutions. However, existing learning‑based MAPF methods usually rely on agents making decisions based on a limited field of view (FOV), resulting in short‑sighted policies and inefficient cooperation in complex scenarios. There, a critical challenge is to achieve consensus on potential movements between agents based on limited observations and communications. To tackle this challenge, we introduce a new framework that applies sheaf theory to decentralized deep reinforcement learning, enabling agents to learn geometric cross‑dependencies between each other through local consensus and utilize them for tightly cooperative decision‑making. In particular, sheaf theory provides a mathematical proof of conditions for achieving global consensus through local observation. Inspired by this, we incorporate a neural network to approximately model the consensus in latent space based on sheaf theory and train it through self‑supervised learning. During the task, in addition to normal features for MAPF as in previous works, each agent distributedly reasons about a learned consensus feature, leading to efficient cooperation on pathfinding and collision avoidance. As a result, our proposed method demonstrates significant improvements over state‑of‑the‑art learning‑based MAPF planners, especially in relatively large and complex scenarios, demonstrating its superiority over baselines in various simulations and real‑world robot experiments. The code is available at https://github.com/marmotlab/SIGMA

PaperID: 1533, https://arxiv.org/pdf/2502.05957.pdf GitHub

Authors: Jiabin Tang, Tianyu Fan, Chao Huang

Title: AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents

Abstract:
Large Language Model (LLM) Agents have demonstrated remarkable capabilities in task automation and intelligent decision‑making, driving the widespread adoption of agent development frameworks such as LangChain and AutoGen. However, these frameworks predominantly serve developers with extensive technical expertise ‑ a significant limitation considering that only 0.03 % of the global population possesses the necessary programming skills. This stark accessibility gap raises a fundamental question: Can we enable everyone, regardless of technical background, to build their own LLM agents using natural language alone? To address this challenge, we introduce AutoAgent‑a Fully‑Automated and highly Self‑Developing framework that enables users to create and deploy LLM agents through Natural Language Alone. Operating as an autonomous Agent Operating System, AutoAgent comprises four key components: i) Agentic System Utilities, ii) LLM‑powered Actionable Engine, iii) Self‑Managing File System, and iv) Self‑Play Agent Customization module. This lightweight yet powerful system enables efficient and dynamic creation and modification of tools, agents, and workflows without coding requirements or manual intervention. Beyond its code‑free agent development capabilities, AutoAgent also serves as a versatile multi‑agent system for General AI Assistants. Comprehensive evaluations on the GAIA benchmark demonstrate AutoAgent's effectiveness in generalist multi‑agent tasks, surpassing existing state‑of‑the‑art methods. Furthermore, AutoAgent's Retrieval‑Augmented Generation (RAG)‑related capabilities have shown consistently superior performance compared to many alternative LLM‑based solutions.

PaperID: 1534, https://arxiv.org/pdf/2502.05664.pdf GitHub

Authors: Md. Ashraful Islam, Mohammed Eunus Ali, Md Rizwan Parvez

Title: CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging

Abstract:
Large Language Models (LLMs) have made significant strides in code generation and problem solving. Current approaches employ external tool‑based iterative debuggers that use compiler or other tool‑based runtime feedback to refine coarse programs generated by various methods. However, the effectiveness of these approaches heavily relies on the quality of the initial code generation, which remains an open challenge. In this paper, we introduce CodeSim, a novel multi‑agent code generation framework that comprehensively addresses the stages of program synthesis‑planning, coding, and debugging‑through a human‑like perception approach. As human verifies their understanding of any algorithms through visual simulation, CodeSim uniquely features a method of plan verification and internal debugging through the step‑by‑step simulation of input/output. Extensive experiments across seven challenging competitive problem‑solving and program synthesis benchmarks demonstrate CodeSim's remarkable code generation capabilities. Our framework achieves new state‑of‑the‑art (pass@1) results‑(HumanEval 95.1%, MBPP 90.7%, APPS 22%, and CodeContests 29.1%). Furthermore, our method shows potential for even greater enhancement when cascaded with external debuggers. To facilitate further research and development in this area, we have open‑sourced our framework in this link (https://kagnlp.github.io/codesim.github.io/).

PaperID: 1535, https://arxiv.org/pdf/2502.05453.pdf GitHub

Authors: Hanqing Yang, Jingdi Chen, Marie Siew, Tania Lorido-Botran, Carlee Joe-Wong

Title: LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning

Abstract:
Developing intelligent agents for long‑term cooperation in dynamic open‑world scenarios is a major challenge in multi‑agent systems. Traditional Multi‑agent Reinforcement Learning (MARL) frameworks like centralized training decentralized execution (CTDE) struggle with scalability and flexibility. They require centralized long‑term planning, which is difficult without custom reward functions, and face challenges in processing multi‑modal data. CTDE approaches also assume fixed cooperation strategies, making them impractical in dynamic environments where agents need to adapt and plan independently. To address decentralized multi‑agent cooperation, we propose Decentralized Adaptive Knowledge Graph Memory and Structured Communication System (DAMCS) in a novel Multi‑agent Crafter environment. Our generative agents, powered by Large Language Models (LLMs), are more scalable than traditional MARL agents by leveraging external knowledge and language for long‑term planning and reasoning. Instead of fully sharing information from all past experiences, DAMCS introduces a multi‑modal memory system organized as a hierarchical knowledge graph and a structured communication protocol to optimize agent cooperation. This allows agents to reason from past interactions and share relevant information efficiently. Experiments on novel multi‑agent open‑world tasks show that DAMCS outperforms both MARL and LLM baselines in task efficiency and collaboration. Compared to single‑agent scenarios, the two‑agent scenario achieves the same goal with 63% fewer steps, and the six‑agent scenario with 74% fewer steps, highlighting the importance of adaptive memory and structured communication in achieving long‑term goals. We publicly release our project at: https://happyeureka.github.io/damcs.

PaperID: 1536, https://arxiv.org/pdf/2502.05227.pdf GitHub

Authors: Gonzalo Gonzalez-Pumariega, Leong Su Yean, Neha Sunkara, Sanjiban Choudhury

Title: Robotouille: An Asynchronous Planning Benchmark for LLM Agents

Abstract:
Effective asynchronous planning, or the ability to efficiently reason and plan over states and actions that must happen in parallel or sequentially, is essential for agents that must account for time delays, reason over diverse long‑horizon tasks, and collaborate with other agents. While large language model (LLM) agents show promise in high‑level task planning, current benchmarks focus primarily on short‑horizon tasks and do not evaluate such asynchronous planning capabilities. We introduce Robotouille, a challenging benchmark environment designed to test LLM agents' ability to handle long‑horizon asynchronous scenarios. Our synchronous and asynchronous datasets capture increasingly complex planning challenges that go beyond existing benchmarks, requiring agents to manage overlapping tasks and interruptions. Our results show that ReAct (gpt4‑o) achieves 47% on synchronous tasks but only 11% on asynchronous tasks, highlighting significant room for improvement. We further analyze failure modes, demonstrating the need for LLM agents to better incorporate long‑horizon feedback and self‑audit their reasoning during task execution. Code is available at https://github.com/portal‑cornell/robotouille.

PaperID: 1537, https://arxiv.org/pdf/2502.05174.pdf GitHub

Authors: Kaijie Zhu, Xianjun Yang, Jindong Wang, Wenbo Guo, William Yang Wang

Title: MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents

Abstract:
Recent research has explored that LLM agents are vulnerable to indirect prompt injection (IPI) attacks, where malicious tasks embedded in tool‑retrieved information can redirect the agent to take unauthorized actions. Existing defenses against IPI have significant limitations: either require essential model training resources, lack effectiveness against sophisticated attacks, or harm the normal utilities. We present MELON (Masked re‑Execution and TooL comparisON), a novel IPI defense. Our approach builds on the observation that under a successful attack, the agent's next action becomes less dependent on user tasks and more on malicious tasks. Following this, we design MELON to detect attacks by re‑executing the agent's trajectory with a masked user prompt modified through a masking function. We identify an attack if the actions generated in the original and masked executions are similar. We also include three key designs to reduce the potential false positives and false negatives. Extensive evaluation on the IPI benchmark AgentDojo demonstrates that MELON outperforms SOTA defenses in both attack prevention and utility preservation. Moreover, we show that combining MELON with a SOTA prompt augmentation defense (denoted as MELON‑Aug) further improves its performance. We also conduct a detailed ablation study to validate our key designs. Code is available at https://github.com/kaijiezhu11/MELON.

PaperID: 1538, https://arxiv.org/pdf/2502.04492.pdf GitHub

Authors: Selim Furkan Tekin, Fatih Ilhan, Gaowen Liu, Ramana Rao Kompella, Ling Liu

Title: Dynamic Optimizations of LLM Ensembles with Two-Stage Reinforcement Learning Agents

Abstract:
The advancement of LLMs and their accessibility have triggered renewed interest in multi‑agent reinforcement learning as robust and adaptive frameworks for dynamically changing environments. This paper introduces RL‑Focal, a two‑stage RL agent framework that routes and ensembles LLMs. First, we develop the Decider RL‑agent, which learns to dynamically select an ensemble of small size (m_i) among N LLMs (m_i \ll N) for incoming queries from a user‑defined downstream task i, by maximizing both error‑diversity and reasoning‑performance of the selected ensemble through iterative updates of task‑adaptive rewards and policy. Second, to enable effective fusion of dynamically selected LLMs, we develop the stage‑2 Fusion RL‑agent, which learns to resolve reasoning conflicts from different LLMs and dynamically adapts to different ensemble teams composed by the Decider Agent for different downstream tasks. Third, we introduce the focal diversity metric to better model the error correlations among multiple LLMs, further improving the generalization performance of the Decider Agent, which actively prunes the ensemble combinations. By focal diversity, we enhance performance across tasks by effectively promoting reward‑aware and policy‑adaptive ensemble selection and inference fusion. Extensive evaluations on five benchmarks show that RL‑Focal achieves the performance improvement of 8.48% with an ensemble of small size compared to the best individual LLM in a pool and offers stronger robustness. Code is available at https://github.com/sftekin/rl‑focal

PaperID: 1539, https://arxiv.org/pdf/2502.04306.pdf GitHub

Authors: Yinjie Wang, Ling Yang, Guohao Li, Mengdi Wang, Bryon Aragam

Title: ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization

Abstract:
Recent research has leveraged large language model multi‑agent systems for complex problem‑solving while trying to reduce the manual effort required to build them, driving the development of automated agent workflow optimization methods. However, existing methods remain inflexible due to representational limitations, a lack of adaptability, and poor scalability when relying on discrete optimization techniques. We address these challenges with ScoreFlow, a simple yet high‑performance framework that leverages efficient gradient‑based optimization in a continuous space. ScoreFlow incorporates Score‑DPO, a novel variant of the direct preference optimization method that accounts for quantitative feedback. Across six benchmarks spanning question answering, coding, and mathematical reasoning, ScoreFlow achieves an 8.2% improvement over existing baselines. Moreover, it empowers smaller models to outperform larger ones with lower inference costs. Project: https://github.com/Gen‑Verse/ScoreFlow

PaperID: 1540, https://arxiv.org/pdf/2502.04028.pdf GitHub

Authors: Nikunj Gupta, James Zachary Hare, Jesse Milzman, Rajgopal Kannan, Viktor Prasanna

Title: Deep Meta Coordination Graphs for Multi-agent Reinforcement Learning

Abstract:
This paper presents deep meta coordination graphs (DMCG) for learning cooperative policies in multi‑agent reinforcement learning (MARL). Coordination graph formulations encode local interactions and accordingly factorize the joint value function of all agents to improve efficiency in MARL. Through DMCG, we dynamically compose what we refer to as meta coordination graphs, to learn a more expressive representation of agent interactions and use them to integrate agent information through graph convolutional networks. The goal is to enable an evolving coordination graph to guide effective coordination in cooperative MARL tasks. The graphs are jointly optimized with agents' value functions to learn to implicitly reason about joint actions, facilitating the end‑to‑end learning of interaction representations and coordinated policies. We demonstrate that DMCG consistently achieves state‑of‑the‑art coordination performance and sample efficiency on challenging cooperative tasks, outperforming several prior graph‑based and non‑graph‑based MARL baselines. Through several ablations, we also isolate the impact of individual components in DMCG, showing that the observed improvements are due to the meaningful design choices in this approach. We also include an analysis of its computational complexity to discuss its practicality in real‑world applications. All codes can be found here: \colorbluehttps://github.com/Nikunj‑Gupta/dmcg‑marl.

PaperID: 1541, https://arxiv.org/pdf/2502.03506.pdf GitHub

Authors: Ruoning Zhang, Siying Wang, Wenyu Chen, Yang Zhou, Zhitong Zhao, Zixuan Zhang, Ruijie Zhang, Stefano V. Albrecht

Title: Optimistic ε-Greedy Exploration for Cooperative Multi-Agent Reinforcement Learning

Abstract:
The Centralized Training with Decentralized Execution (CTDE) paradigm is widely used in cooperative multi‑agent reinforcement learning. However, conventional methods based on CTDE can suffer from value underestimation and converge to suboptimal solutions. While such underestimation is typically attributed to the representational limitations of monotonic structures, we provide a novel perspective by demonstrating that the insufficient sampling of optimal joint actions during exploration is also a critical factor. To address this problem, we propose Optimistic ε‑Greedy Exploration. Our method introduces optimistic action‑value networks that serve as decoupled exploration indicators, which we theoretically prove to converge in probability to the maximum achievable returns. By sampling actions from these distributions with a probability of ε, we effectively increase the selection frequency of high‑return joint actions. Experimental results in various environments reveal that our strategy effectively prevents the algorithm from falling into suboptimal solutions and significantly improves final returns, win rates, and convergence speeds compared to other enhanced algorithms. Our code has been open‑sourced at https://github.com/qxqxtxdy/OptimisticExploration.

PaperID: 1542, https://arxiv.org/pdf/2502.03369.pdf GitHub

Authors: Zhenghao Peng, Wenjie Mo, Chenda Duan, Quanyi Li, Bolei Zhou

Title: Learning from Active Human Involvement through Proxy Value Propagation

Abstract:
Learning from active human involvement enables the human subject to actively intervene and demonstrate to the AI agent during training. The interaction and corrective feedback from human brings safety and AI alignment to the learning process. In this work, we propose a new reward‑free active human involvement method called Proxy Value Propagation for policy optimization. Our key insight is that a proxy value function can be designed to express human intents, wherein state‑action pairs in the human demonstration are labeled with high values, while those agents' actions that are intervened receive low values. Through the TD‑learning framework, labeled values of demonstrated state‑action pairs are further propagated to other unlabeled data generated from agents' exploration. The proxy value function thus induces a policy that faithfully emulates human behaviors. Human‑in‑the‑loop experiments show the generality and efficiency of our method. With minimal modification to existing reinforcement learning algorithms, our method can learn to solve continuous and discrete control tasks with various human control devices, including the challenging task of driving in Grand Theft Auto V. Demo video and code are available at: https://metadriverse.github.io/pvp

PaperID: 1543, https://arxiv.org/pdf/2502.02844.pdf GitHub

Authors: Sunwoo Lee, Jaebak Hwang, Yonghyeon Jo, Seungyul Han

Title: Wolfpack Adversarial Attack for Robust Multi-Agent Reinforcement Learning

Abstract:
Traditional robust methods in multi‑agent reinforcement learning (MARL) often struggle against coordinated adversarial attacks in cooperative scenarios. To address this limitation, we propose the Wolfpack Adversarial Attack framework, inspired by wolf hunting strategies, which targets an initial agent and its assisting agents to disrupt cooperation. Additionally, we introduce the Wolfpack‑Adversarial Learning for MARL (WALL) framework, which trains robust MARL policies to defend against the proposed Wolfpack attack by fostering systemwide collaboration. Experimental results underscore the devastating impact of the Wolfpack attack and the significant robustness improvements achieved by WALL. Our code is available at https://github.com/sunwoolee0504/WALL.

PaperID: 1544, https://arxiv.org/pdf/2502.02673.pdf GitHub

Authors: Adibvafa Fallahpour, Jun Ma, Alif Munim, Hongwei Lyu, Bo Wang

Title: MedRAX: Medical Reasoning Agent for Chest X-ray

Abstract:
Chest X‑rays (CXRs) play an integral role in driving critical decisions in disease management and patient care. While recent innovations have led to specialized models for various CXR interpretation tasks, these solutions often operate in isolation, limiting their practical utility in clinical practice. We present MedRAX, the first versatile AI agent that seamlessly integrates state‑of‑the‑art CXR analysis tools and multimodal large language models into a unified framework. MedRAX dynamically leverages these models to address complex medical queries without requiring additional training. To rigorously evaluate its capabilities, we introduce ChestAgentBench, a comprehensive benchmark containing 2,500 complex medical queries across 7 diverse categories. Our experiments demonstrate that MedRAX achieves state‑of‑the‑art performance compared to both open‑source and proprietary models, representing a significant step toward the practical deployment of automated CXR interpretation systems. Data and code have been publicly available at https://github.com/bowang‑lab/MedRAX

PaperID: 1545, https://arxiv.org/pdf/2502.02067.pdf GitHub

Authors: Shivam Singh, Karthik Swaminathan, Nabanita Dash, Ramandeep Singh, Snehasis Banerjee, Mohan Sridharan, Madhava Krishna

Title: AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement

Abstract:
An embodied agent assisting humans is often asked to complete new tasks, and there may not be sufficient time or labeled examples to train the agent to perform these new tasks. Large Language Models (LLMs) trained on considerable knowledge across many domains can be used to predict a sequence of abstract actions for completing such tasks, although the agent may not be able to execute this sequence due to task‑, agent‑, or domain‑specific constraints. Our framework addresses these challenges by leveraging the generic predictions provided by LLM and the prior domain knowledge encoded in a Knowledge Graph (KG), enabling an agent to quickly adapt to new tasks. The robot also solicits and uses human input as needed to refine its existing knowledge. Based on experimental evaluation in the context of cooking and cleaning tasks in simulation domains, we demonstrate that the interplay between LLM, KG, and human input leads to substantial performance gains compared with just using the LLM. Project website§: https://sssshivvvv.github.io/adaptbot/

PaperID: 1546, https://arxiv.org/pdf/2502.01456.pdf GitHub

Authors: Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding

Title: Process Reinforcement through Implicit Rewards

Abstract:
Dense process rewards have proven a more effective alternative to the sparse outcome‑level rewards in the inference‑time scaling of large language models (LLMs), particularly in tasks requiring complex multi‑step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine‑grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high‑quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5‑Math‑7B‑Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus‑2‑7B‑PRIME, surpasses Qwen2.5‑Math‑7B‑Instruct on seven reasoning benchmarks with 10% of its training data.

PaperID: 1547, https://arxiv.org/pdf/2502.00757.pdf GitHub

Authors: J Rosser, Jakob Foerster

Title: AgentBreeder: Mitigating the AI Safety Risks of Multi-Agent Scaffolds via Self-Improvement

Abstract:
Scaffolding Large Language Models (LLMs) into multi‑agent systems often improves performance on complex tasks, but the safety impact of such scaffolds has not been thoroughly explored. We introduce AgentBreeder, a framework for multi‑objective self‑improving evolutionary search over scaffolds. We evaluate discovered scaffolds on widely recognized reasoning, mathematics, and safety benchmarks and compare them with popular baselines. In "blue" mode, we see a 79.4% average uplift in safety benchmark performance while maintaining or improving capability scores. In "red" mode, we find adversarially weak scaffolds emerging concurrently with capability optimization. Our work demonstrates the risks of multi‑agent scaffolding and provides a framework for mitigating them. Code is available at https://github.com/jrosseruk/AgentBreeder.

PaperID: 1548, https://arxiv.org/pdf/2502.00392.pdf GitHub

Authors: Zhichao Sun, Yepeng Liu, Zhiling Su, Huachao Zhu, Yuliang Gu, Yuda Zou, Zelong Liu, Gui-Song Xia, Bo Du, Yongchao Xu

Title: RefDrone: A Challenging Benchmark for Referring Expression Comprehension in Drone Scenes

Abstract:
Drones have become prevalent robotic platforms with diverse applications, showing significant potential in Embodied Artificial Intelligence (Embodied AI). Referring Expression Comprehension (REC) enables drones to locate objects based on natural language expressions, a crucial capability for Embodied AI. Despite advances in REC for ground‑level scenes, aerial views introduce unique challenges including varying viewpoints, occlusions and scale variations. To address this gap, we introduce RefDrone, a REC benchmark for drone scenes. RefDrone reveals three key challenges in REC: 1) multi‑scale and small‑scale target detection; 2) multi‑target and no‑target samples; 3) complex environment with rich contextual expressions. To efficiently construct this dataset, we develop RDAgent (referring drone annotation framework with multi‑agent system), a semi‑automated annotation tool for REC tasks. RDAgent ensures high‑quality contextual expressions and reduces annotation cost. Furthermore, we propose Number GroundingDINO (NGDINO), a novel method designed to handle multi‑target and no‑target cases. NGDINO explicitly learns and utilizes the number of objects referred to in the expression. Comprehensive experiments with state‑of‑the‑art REC methods demonstrate that NGDINO achieves superior performance on both the proposed RefDrone and the existing gRefCOCO datasets. The dataset and code are be publicly at https://github.com/sunzc‑sunny/refdrone.

PaperID: 1549, https://arxiv.org/pdf/2502.00345.pdf GitHub

Authors: Yurui Li, Yuxuan Chen, Li Zhang, Shijian Li, Gang Pan

Title: The Composite Task Challenge for Cooperative Multi-Agent Reinforcement Learning

Abstract:
The significant role of division of labor (DOL) in promoting cooperation is widely recognized in real‑world applications.Many cooperative multi‑agent reinforcement learning (MARL) methods have incorporated the concept of DOL to improve cooperation among agents.However, the tasks used in existing testbeds typically correspond to tasks where DOL is often not a necessary feature for achieving optimal policies.Additionally, the full utilize of DOL concept in MARL methods remains unrealized due to the absence of appropriate tasks.To enhance the generality and applicability of MARL methods in real‑world scenarios, there is a necessary to develop tasks that demand multi‑agent DOL and cooperation.In this paper, we propose a series of tasks designed to meet these requirements, drawing on real‑world rules as the guidance for their design.We guarantee that DOL and cooperation are necessary condition for completing tasks and introduce three factors to expand the diversity of proposed tasks to cover more realistic situations.We evaluate 10 cooperative MARL methods on the proposed tasks.The results indicate that all baselines perform poorly on these tasks.To further validate the solvability of these tasks, we also propose simplified variants of proposed tasks.Experimental results show that baselines are able to handle these simplified variants, providing evidence of the solvability of the proposed tasks.The source files is available at https://github.com/Yurui‑Li/CTC.

PaperID: 1550, https://arxiv.org/pdf/2501.19204.pdf GitHub

Authors: Valtteri Ala-Salmi, Zeeshan Rasheed, Abdul Malik Sami, Zheying Zhang, Kai-Kristian Kemell, Jussi Rasku, Shahbaz Siddeeq, Mika Saari, Pekka Abrahamsson

Title: Autonomous Legacy Web Application Upgrades Using a Multi-Agent System

Abstract:
The use of Large Language Models (LLMs) for autonomous code generation is gaining attention in emerging technologies. As LLM capabilities expand, they offer new possibilities such as code refactoring, security enhancements, and legacy application upgrades. Many outdated web applications pose security and reliability challenges, yet companies continue using them due to the complexity and cost of upgrades. To address this, we propose an LLM‑based multi‑agent system that autonomously upgrades legacy web applications to the latest versions. The system distributes tasks across multiple phases, updating all relevant files. To evaluate its effectiveness, we employed Zero‑Shot Learning (ZSL) and One‑Shot Learning (OSL) prompts, applying identical instructions in both cases. The evaluation involved updating view files and measuring the number and types of errors in the output. For complex tasks, we counted the successfully met requirements. The experiments compared the proposed system with standalone LLM execution, repeated multiple times to account for stochastic behavior. Results indicate that our system maintains context across tasks and agents, improving solution quality over the base model in some cases. This study provides a foundation for future model implementations in legacy code updates. Additionally, findings highlight LLMs' ability to update small outdated files with high precision, even with basic prompts. The source code is publicly available on GitHub: https://github.com/alasalm1/Multi‑agent‑pipeline.

PaperID: 1551, https://arxiv.org/pdf/2501.18616.pdf GitHub GitHub

Authors: Xiangbo Gao, Runsheng Xu, Jiachen Li, Ziran Wang, Zhiwen Fan, Zhengzhong Tu

Title: STAMP: Scalable Task And Model-agnostic Collaborative Perception

Abstract:
Perception is crucial for autonomous driving, but single‑agent perception is often constrained by sensors' physical limitations, leading to degraded performance under severe occlusion, adverse weather conditions, and when detecting distant objects. Multi‑agent collaborative perception offers a solution, yet challenges arise when integrating heterogeneous agents with varying model architectures. To address these challenges, we propose STAMP, a scalable task‑ and model‑agnostic, collaborative perception pipeline for heterogeneous agents. STAMP utilizes lightweight adapter‑reverter pairs to transform Bird's Eye View (BEV) features between agent‑specific and shared protocol domains, enabling efficient feature sharing and fusion. This approach minimizes computational overhead, enhances scalability, and preserves model security. Experiments on simulated and real‑world datasets demonstrate STAMP's comparable or superior accuracy to state‑of‑the‑art models with significantly reduced computational costs. As a first‑of‑its‑kind task‑ and model‑agnostic framework, STAMP aims to advance research in scalable and secure mobility systems towards Level 5 autonomy. Our project page is at https://xiangbogaobarry.github.io/STAMP and the code is available at https://github.com/taco‑group/STAMP.

PaperID: 1552, https://arxiv.org/pdf/2501.18160.pdf GitHub

Authors: Jinyao Guo, Chengpeng Wang, Xiangzhe Xu, Zian Su, Xiangyu Zhang

Title: RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing

Abstract:
Code auditing is the process of reviewing code with the aim of identifying bugs. Large Language Models (LLMs) have demonstrated promising capabilities for this task without requiring compilation, while also supporting user‑friendly customization. However, auditing a code repository with LLMs poses significant challenges: limited context windows and hallucinations can degrade the quality of bug reports, and analyzing large‑scale repositories incurs substantial time and token costs, hindering efficiency and scalability. This work introduces an LLM‑based agent, RepoAudit, designed to perform autonomous repository‑level code auditing. Equipped with agent memory, RepoAudit explores the codebase on demand by analyzing data‑flow facts along feasible program paths within individual functions. It further incorporates a validator module to mitigate hallucinations by verifying data‑flow facts and checking the satisfiability of path conditions associated with potential bugs, thereby reducing false positives. RepoAudit detects 40 true bugs across 15 real‑world benchmark projects with a precision of 78.43%, requiring on average only 0.44 hours and 2.54 per project. Also, it detects 185 new bugs in high‑profile projects, among which 174 have been confirmed or fixed. We have open‑sourced RepoAudit at https://github.com/PurCL/RepoAudit.

PaperID: 1553, https://arxiv.org/pdf/2501.18096.pdf GitHub

Authors: Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, Rohit Girdhar

Title: LLMs can see and hear without any training

Abstract:
We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training‑free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi‑step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task‑specific data. In particular, we establish a new state‑of‑the‑art on emergent zero‑shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text‑to‑image generation, and even edit prompts for style transfer! Finally, being a gradient‑free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross‑modal arithmetic.

PaperID: 1554, https://arxiv.org/pdf/2501.17546.pdf GitHub

Authors: Gaole He, Nilay Aishwarya, Ujwal Gadiraju

Title: Is Conversational XAI All You Need? Human-AI Decision Making With a Conversational XAI Assistant

Abstract:
Explainable artificial intelligence (XAI) methods are being proposed to help interpret and understand how AI systems reach specific predictions. Inspired by prior work on conversational user interfaces, we argue that augmenting existing XAI methods with conversational user interfaces can increase user engagement and boost user understanding of the AI system. In this paper, we explored the impact of a conversational XAI interface on users' understanding of the AI system, their trust, and reliance on the AI system. In comparison to an XAI dashboard, we found that the conversational XAI interface can bring about a better understanding of the AI system among users and higher user trust. However, users of both the XAI dashboard and conversational XAI interfaces showed clear overreliance on the AI system. Enhanced conversations powered by large language model (LLM) agents amplified over‑reliance. Based on our findings, we reason that the potential cause of such overreliance is the illusion of explanatory depth that is concomitant with both XAI interfaces. Our findings have important implications for designing effective conversational XAI interfaces to facilitate appropriate reliance and improve human‑AI collaboration. Code can be found at https://github.com/delftcrowd/IUI2025_ConvXAI

PaperID: 1555, https://arxiv.org/pdf/2501.16803.pdf GitHub

Authors: Lantao Li, Kang Yang, Wenqi Zhang, Xiaoxue Wang, Chen Sun

Title: RG-Attn: Radian Glue Attention for Multi-modality Multi-agent Cooperative Perception

Abstract:
Cooperative perception enhances autonomous driving by leveraging Vehicle‑to‑Everything (V2X) communication for multi‑agent sensor fusion. However, most existing methods rely on single‑modal data sharing, limiting fusion performance, particularly in heterogeneous sensor settings involving both LiDAR and cameras across vehicles and roadside units (RSUs). To address this, we propose Radian Glue Attention (RG‑Attn), a lightweight and generalizable cross‑modal fusion module that unifies intra‑agent and inter‑agent fusion via transformation‑based coordinate alignment and a unified sampling/inversion strategy. RG‑Attn efficiently aligns features through a radian‑based attention constraint, operating column‑wise on geometrically consistent regions to reduce overhead and preserve spatial coherence, thereby enabling accurate and robust fusion. Building upon RG‑Attn, we propose three cooperative architectures. The first, Paint‑To‑Puzzle (PTP), prioritizes communication efficiency but assumes all agents have LiDAR, optionally paired with cameras. The second, Co‑Sketching‑Co‑Coloring (CoS‑CoCo), offers maximal flexibility, supporting any sensor setup (e.g., LiDAR‑only, camera‑only, or both) and enabling strong cross‑modal generalization for real‑world deployment. The third, Pyramid‑RG‑Attn Fusion (PRGAF), aims for peak detection accuracy with the highest computational overhead. Extensive evaluations on simulated and real‑world datasets show our framework delivers state‑of‑the‑art detection accuracy with high flexibility and efficiency. GitHub Link: https://github.com/LantaoLi/RG‑Attn

PaperID: 1556, https://arxiv.org/pdf/2501.16609.pdf GitHub

Authors: Faria Huq, Zora Zhiruo Wang, Frank F. Xu, Tianyue Ou, Shuyan Zhou, Jeffrey P. Bigham, Graham Neubig

Title: CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation

Abstract:
While much work on web agents emphasizes the promise of autonomously performing tasks on behalf of users, in reality, agents often fall short on complex tasks in real‑world contexts and modeling user preference. This presents an opportunity for humans to collaborate with the agent and leverage the agent's capabilities effectively. We propose CowPilot, a framework supporting autonomous as well as human‑agent collaborative web navigation, and evaluation across task success and task efficiency. CowPilot reduces the number of steps humans need to perform by allowing agents to propose next steps, while users are able to pause, reject, or take alternative actions. During execution, users can interleave their actions with the agent by overriding suggestions or resuming agent control when needed. We conducted case studies on five common websites and found that the human‑agent collaborative mode achieves the highest success rate of 95% while requiring humans to perform only 15.2% of the total steps. Even with human interventions during task execution, the agent successfully drives up to half of task success on its own. CowPilot can serve as a useful tool for data collection and agent evaluation across websites, which we believe will enable research in how users and agents can work together. Video demonstrations are available at https://oaishi.github.io/cowpilot.html

PaperID: 1557, https://arxiv.org/pdf/2501.16098.pdf GitHub

Authors: Eslam Eldeeb, Hirley Alves

Title: Meta-Offline and Distributional Multi-Agent RL for Risk-Aware Decision-Making

Abstract:
Mission critical applications, such as UAV‑assisted IoT networks require risk‑aware decision‑making under dynamic topologies and uncertain channels. We propose meta‑conservative quantile regression (M‑CQR), a meta‑offline distributional MARL algorithm that integrates conservative Q‑learning (CQL) for safe offline learning, quantile regression DQN (QR‑DQN) for risk‑sensitive value estimation, and model‑agnostic meta‑learning (MAML) for rapid adaptation. Two variants are developed: meta‑independent CQR (M‑I‑CQR) and meta‑CTDE‑CQR. In a UAV‑based communication scenario, M‑CTDE‑CQR achieves up to 50% faster convergence and outperforms baseline MARL methods, offering improved scalability, robustness, and adaptability for risk‑sensitive decision‑making. Code is available at https://github.com/Eslam211/MA_Meta_ODRL

PaperID: 1558, https://arxiv.org/pdf/2501.15791.pdf GitHub

Authors: Yu Li, Yi Huang, Guilin Qi, Junlan Feng, Nan Hu, Songlin Zhai, Haohan Xue, Yongrui Chen, Ruoyan Shen, Tongtong Wu

Title: Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs

Abstract:
Knowledge graphs are widely used in industrial applications, making error detection crucial for ensuring the reliability of downstream applications. Existing error detection methods often fail to effectively utilize fine‑grained subgraph information and rely solely on fixed graph structures, while also lacking transparency in their decision‑making processes, which results in suboptimal detection performance. In this paper, we propose a novel Multi‑Agent framework for Knowledge Graph Error Detection (MAKGED) that utilizes multiple large language models (LLMs) in a collaborative setting. By concatenating fine‑grained, bidirectional subgraph embeddings with LLM‑based query embeddings during training, our framework integrates these representations to produce four specialized agents. These agents utilize subgraph information from different dimensions to engage in multi‑round discussions, thereby improving error detection accuracy and ensuring a transparent decision‑making process. Extensive experiments on FB15K and WN18RR demonstrate that MAKGED outperforms state‑of‑the‑art methods, enhancing the accuracy and robustness of KG evaluation. For specific industrial scenarios, our framework can facilitate the training of specialized agents using domain‑specific knowledge graphs for error detection, which highlights the potential industrial application value of our framework. Our code and datasets are available at https://github.com/kse‑ElEvEn/MAKGED.

PaperID: 1559, https://arxiv.org/pdf/2501.15749.pdf GitHub

Authors: Tianfu Wang, Yi Zhan, Jianxun Lian, Zhengyu Hu, Nicholas Jing Yuan, Qi Zhang, Xing Xie, Hui Xiong

Title: LLM-powered Multi-agent Framework for Goal-oriented Learning in Intelligent Tutoring System

Abstract:
Intelligent Tutoring Systems (ITSs) have revolutionized education by offering personalized learning experiences. However, as goal‑oriented learning, which emphasizes efficiently achieving specific objectives, becomes increasingly important in professional contexts, existing ITSs often struggle to deliver this type of targeted learning experience. In this paper, we propose GenMentor, an LLM‑powered multi‑agent framework designed to deliver goal‑oriented, personalized learning within ITS. GenMentor begins by accurately mapping learners' goals to required skills using a fine‑tuned LLM trained on a custom goal‑to‑skill dataset. After identifying the skill gap, it schedules an efficient learning path using an evolving optimization approach, driven by a comprehensive and dynamic profile of learners' multifaceted status. Additionally, GenMentor tailors learning content with an exploration‑drafting‑integration mechanism to align with individual learner needs. Extensive automated and human evaluations demonstrate GenMentor's effectiveness in learning guidance and content quality. Furthermore, we have deployed it in practice and also implemented it as an application. Practical human study with professional learners further highlights its effectiveness in goal alignment and resource targeting, leading to enhanced personalization. Supplementary resources are available at https://github.com/GeminiLight/gen‑mentor.

PaperID: 1560, https://arxiv.org/pdf/2501.15602.pdf GitHub

Authors: Zeyu Gan, Yun Liao, Yong Liu

Title: Rethinking External Slow-Thinking: From Snowball Errors to Probability of Correct Reasoning

Abstract:
Test‑time scaling, which is also often referred to as slow‑thinking, has been demonstrated to enhance multi‑step reasoning in large language models (LLMs). However, despite its widespread utilization, the mechanisms underlying slow‑thinking methods remain poorly understood. This paper explores the mechanisms of external slow‑thinking from a theoretical standpoint. We begin by examining the snowball error effect within the LLM reasoning process and connect it to the likelihood of correct reasoning using information theory. Building on this, we show that external slow‑thinking methods can be interpreted as strategies to mitigate the error probability. We further provide a comparative analysis of popular external slow‑thinking approaches, ranging from simple to complex, highlighting their differences and interrelationships. Our findings suggest that the efficacy of these methods is not primarily determined by the specific framework employed, and that expanding the search scope or the model's internal reasoning capacity may yield more sustained improvements in the long term. We open‑source our code at https://github.com/ZyGan1999/Snowball‑Errors‑and‑Probability.

PaperID: 1561, https://arxiv.org/pdf/2501.14654.pdf GitHub

Authors: Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y. Ng, Jonathan H. Chen

Title: MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents

Abstract:
Recent large language models (LLMs) have demonstrated significant advancements, particularly in their ability to serve as agents thereby surpassing their traditional role as chatbots. These agents can leverage their planning and tool utilization capabilities to address tasks specified at a high level. However, a standardized dataset to benchmark the agent capabilities of LLMs in medical applications is currently lacking, making the evaluation of LLMs on complex tasks in interactive healthcare environments challenging. To address this gap, we introduce MedAgentBench, a broad evaluation suite designed to assess the agent capabilities of large language models within medical records contexts. MedAgentBench encompasses 300 patient‑specific clinically‑derived tasks from 10 categories written by human physicians, realistic profiles of 100 patients with over 700,000 data elements, a FHIR‑compliant interactive environment, and an accompanying codebase. The environment uses the standard APIs and communication infrastructure used in modern EMR systems, so it can be easily migrated into live EMR systems. MedAgentBench presents an unsaturated agent‑oriented benchmark that current state‑of‑the‑art LLMs exhibit some ability to succeed at. The best model (Claude 3.5 Sonnet v2) achieves a success rate of 69.67%. However, there is still substantial space for improvement which gives the community a next direction to optimize. Furthermore, there is significant variation in performance across task categories. MedAgentBench establishes this and is publicly available at https://github.com/stanfordmlgroup/MedAgentBench , offering a valuable framework for model developers to track progress and drive continuous improvements in the agent capabilities of large language models within the medical domain.

PaperID: 1562, https://arxiv.org/pdf/2501.14225.pdf GitHub

Authors: Rong Ye, Yongxin Zhang, Yikai Zhang, Haoyu Kuang, Zhongyu Wei, Peng Sun

Title: Multi-agent KTO: Reinforcing Strategic Interactions of Large Language Model in Language Game

Abstract:
Achieving Artificial General Intelligence (AGI) requires AI agents that can not only make stratigic decisions but also engage in flexible and meaningful communication. Inspired by Wittgenstein's language game theory in Philosophical Investigations, we propose that language agents can learn through in‑context interaction rather than traditional multi‑stage frameworks that separate decision‑making from language expression. Using Werewolf, a social deduction game that tests language understanding, strategic interaction, and adaptability, we develop the Multi‑agent Kahneman & Tversky's Optimization (MaKTO). MaKTO engages diverse models in extensive gameplay to generate unpaired desirable and unacceptable responses, then employs KTO to refine the model's decision‑making process. In 9‑player Werewolf games, MaKTO achieves a 61% average win rate across various models, outperforming GPT‑4o and two‑stage RL agents by relative improvements of 23.0% and 10.9%, respectively. Notably, MaKTO also demonstrates human‑like performance, winning 60% against expert players and showing only 49% detectability in Turing‑style blind tests.

PaperID: 1563, https://arxiv.org/pdf/2501.13381.pdf GitHub

Authors: Zhiyuan Weng, Guikun Chen, Wenguan Wang

Title: Do as We Do, Not as You Think: the Conformity of Large Language Models

Abstract:
Recent advancements in large language models (LLMs) revolutionize the field of intelligent agents, enabling collaborative multi‑agent systems capable of tackling complex problems across various domains. However, the potential of conformity within these systems, analogous to phenomena like conformity bias and groupthink in human group dynamics, remains largely unexplored, raising concerns about their collective problem‑solving capabilities and possible ethical implications. This paper presents a comprehensive study on conformity in LLM‑driven multi‑agent systems, focusing on three aspects: the existence of conformity, the factors influencing conformity, and potential mitigation strategies. In particular, we introduce BenchForm, a new conformity‑oriented benchmark, featuring reasoning‑intensive tasks and five distinct interaction protocols designed to probe LLMs' behavior in collaborative scenarios. Several representative LLMs are evaluated on BenchForm, using metrics such as conformity rate and independence rate to quantify conformity's impact. Our analysis delves into factors influencing conformity, including interaction time and majority size, and examines how the subject agent rationalizes its conforming behavior. Furthermore, we explore two strategies to mitigate conformity effects, i.e., developing enhanced personas and implementing a reflection mechanism. Several interesting findings regarding LLMs' conformity are derived from empirical results and case studies. We hope that these insights can pave the way for more robust and ethically‑aligned collaborative AI systems. Our benchmark and code are available at BenchForm.

PaperID: 1564, https://arxiv.org/pdf/2501.13333.pdf GitHub

Authors: Joshua Park, Yongfeng Zhang

Title: AgentRec: Agent Recommendation Using Sentence Embeddings Aligned to Human Feedback

Abstract:
Multi‑agent systems must decide which agent is the most appropriate for a given task. We propose a novel architecture for recommending which LLM agent out of many should perform a task given a natural language prompt by extending the Sentence‑BERT (SBERT) encoder model. On test data, we are able to achieve a top‑1 accuracy of 92.2% with each classification taking less than 300 milliseconds. In contrast to traditional classification methods, our architecture is computationally cheap, adaptive to new classes, interpretable, and controllable with arbitrary metrics through reinforcement learning. By encoding natural language prompts into sentence embeddings, our model captures the semantic content relevant to recommending an agent. The distance between sentence embeddings that belong to the same agent is then minimized through fine‑tuning and aligned to human values through reinforcement learning from human feedback. This allows the classification of natural language prompts based on their nearest neighbors by measuring the cosine similarity between embeddings. This work is made possible through the generation of a synthetic dataset for agent recommendation, which we have open‑sourced to the public along with the code for AgentRec recommendation system at https://github.com/joshprk/agentrec.

PaperID: 1565, https://arxiv.org/pdf/2501.13200.pdf GitHub

Authors: Alsu Sagirova, Yuri Kuratov, Mikhail Burtsev

Title: SRMT: Shared Memory for Multi-agent Lifelong Pathfinding

Abstract:
Multi‑agent reinforcement learning (MARL) demonstrates significant progress in solving cooperative and competitive multi‑agent problems in various environments. One of the principal challenges in MARL is the need for explicit prediction of the agents' behavior to achieve cooperation. To resolve this issue, we propose the Shared Recurrent Memory Transformer (SRMT) which extends memory transformers to multi‑agent settings by pooling and globally broadcasting individual working memories, enabling agents to exchange information implicitly and coordinate their actions. We evaluate SRMT on the Partially Observable Multi‑Agent Pathfinding problem in a toy Bottleneck navigation task that requires agents to pass through a narrow corridor and on a POGEMA benchmark set of tasks. In the Bottleneck task, SRMT consistently outperforms a variety of reinforcement learning baselines, especially under sparse rewards, and generalizes effectively to longer corridors than those seen during training. On POGEMA maps, including Mazes, Random, and MovingAI, SRMT is competitive with recent MARL, hybrid, and planning‑based algorithms. These results suggest that incorporating shared recurrent memory into the transformer‑based architectures can enhance coordination in decentralized multi‑agent systems. The source code for training and evaluation is available on GitHub: https://github.com/Aloriosa/srmt.

PaperID: 1566, https://arxiv.org/pdf/2501.11733.pdf GitHub

Authors: Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji

Title: Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

Abstract:
Smartphones have become indispensable in modern life, yet navigating complex tasks on mobile devices often remains frustrating. Recent advancements in large multimodal model (LMM)‑based mobile agents have demonstrated the ability to perceive and act in mobile environments. However, current approaches face significant limitations: they fall short in addressing real‑world human needs, struggle with reasoning‑intensive and long‑horizon tasks, and lack mechanisms to learn and improve from prior experiences. To overcome these challenges, we introduce Mobile‑Agent‑E, a hierarchical multi‑agent framework capable of self‑evolution through past experience. By hierarchical, we mean an explicit separation of high‑level planning and low‑level action execution. The framework comprises a Manager, responsible for devising overall plans by breaking down complex tasks into subgoals, and four subordinate agents‑‑Perceptor, Operator, Action Reflector, and Notetaker‑‑which handle fine‑grained visual perception, immediate action execution, error verification, and information aggregation, respectively. Mobile‑Agent‑E also features a novel self‑evolution module which maintains a persistent long‑term memory comprising Tips and Shortcuts. Tips are general guidance and lessons learned from prior tasks on how to effectively interact with the environment. Shortcuts are reusable, executable sequences of atomic operations tailored for specific subroutines. The inclusion of Tips and Shortcuts facilitates continuous refinement in performance and efficiency. Alongside this framework, we introduce Mobile‑Eval‑E, a new benchmark featuring complex mobile tasks requiring long‑horizon, multi‑app interactions. Empirical results show that Mobile‑Agent‑E achieves a 22% absolute improvement over previous state‑of‑the‑art approaches across three foundation model backbones. Project page: https://x‑plug.github.io/MobileAgent.

PaperID: 1567, https://arxiv.org/pdf/2501.11260.pdf GitHub GitHub

Authors: Tuo Feng, Wenguan Wang, Yi Yang

Title: A Survey of World Models for Autonomous Driving

Abstract:
Recent breakthroughs in autonomous driving have been propelled by advances in robust world modeling, fundamentally transforming how vehicles interpret dynamic scenes and execute safe decision‑making. World models have emerged as a linchpin technology, offering high‑fidelity representations of the driving environment that integrate multi‑sensor data, semantic cues, and temporal dynamics. This paper systematically reviews recent advances in world models for autonomous driving, proposing a three‑tiered taxonomy: (i) Generation of Future Physical World, covering Image‑, BEV‑, OG‑, and PC‑based generation methods that enhance scene evolution modeling through diffusion models and 4D occupancy forecasting; (ii) Behavior Planning for Intelligent Agents, combining rule‑driven and learning‑based paradigms with cost map optimization and reinforcement learning for trajectory generation in complex traffic conditions; (ii) Interaction between Prediction and Planning, achieving multi‑agent collaborative decision‑making through latent space diffusion and memory‑augmented architectures. The study further analyzes training paradigms, including self‑supervised learning, multimodal pretraining, and generative data augmentation, while evaluating world models' performance in scene understanding and motion prediction tasks. Future research must address key challenges in self‑supervised representation learning, multimodal fusion, and advanced simulation to advance the practical deployment of world models in complex urban environments. Overall, the comprehensive analysis provides a technical roadmap for harnessing the transformative potential of world models in advancing safe and reliable autonomous driving solutions.

PaperID: 1568, https://arxiv.org/pdf/2501.11067.pdf GitHub

Authors: Elad Levi, Ilan Kadar

Title: IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems

Abstract:
Large Language Models (LLMs) are transforming artificial intelligence, evolving into task‑oriented systems capable of autonomous planning and execution. One of the primary applications of LLMs is conversational AI systems, which must navigate multi‑turn dialogues, integrate domain‑specific APIs, and adhere to strict policy constraints. However, evaluating these agents remains a significant challenge, as traditional methods fail to capture the complexity and variability of real‑world interactions. We introduce IntellAgent, a scalable, open‑source multi‑agent framework designed to evaluate conversational AI systems comprehensively. IntellAgent automates the creation of diverse, synthetic benchmarks by combining policy‑driven graph modeling, realistic event generation, and interactive user‑agent simulations. This innovative approach provides fine‑grained diagnostics, addressing the limitations of static and manually curated benchmarks with coarse‑grained metrics. IntellAgent represents a paradigm shift in evaluating conversational AI. By simulating realistic, multi‑policy scenarios across varying levels of complexity, IntellAgent captures the nuanced interplay of agent capabilities and policy constraints. Unlike traditional methods, it employs a graph‑based policy model to represent relationships, likelihoods, and complexities of policy interactions, enabling highly detailed diagnostics. IntellAgent also identifies critical performance gaps, offering actionable insights for targeted optimization. Its modular, open‑source design supports seamless integration of new domains, policies, and APIs, fostering reproducibility and community collaboration. Our findings demonstrate that IntellAgent serves as an effective framework for advancing conversational AI by addressing challenges in bridging research and deployment. The framework is available at https://github.com/plurai‑ai/intellagent

PaperID: 1569, https://arxiv.org/pdf/2501.10593.pdf GitHub

Authors: Andrey Risukhin, Kavel Rao, Ben Caffee, Alan Fan

Title: ColorGrid: A Multi-Agent Non-Stationary Environment for Goal Inference and Assistance

Abstract:
Autonomous agents' interactions with humans are increasingly focused on adapting to their changing preferences in order to improve assistance in real‑world tasks. Effective agents must learn to accurately infer human goals, which are often hidden, to collaborate well. However, existing Multi‑Agent Reinforcement Learning (MARL) environments lack the necessary attributes required to rigorously evaluate these agents' learning capabilities. To this end, we introduce ColorGrid, a novel MARL environment with customizable non‑stationarity, asymmetry, and reward structure. We investigate the performance of Independent Proximal Policy Optimization (IPPO), a state‑of‑the‑art (SOTA) MARL algorithm, in ColorGrid and find through extensive ablations that, particularly with simultaneous non‑stationary and asymmetric goals between a ``leader'' agent representing a human and a ``follower'' assistant agent, ColorGrid is unsolved by IPPO. To support benchmarking future MARL algorithms, we release our environment code, model checkpoints, and trajectory visualizations at https://github.com/andreyrisukhin/ColorGrid.

PaperID: 1570, https://arxiv.org/pdf/2501.10120.pdf GitHub

Authors: Yichen He, Guanhua Huang, Peiyuan Feng, Yuan Lin, Yuchen Zhang, Hang Li, Weinan E

Title: PaSa: An LLM Agent for Comprehensive Academic Paper Search

Abstract:
We introduce PaSa, an advanced Paper Search agent powered by large language models. PaSa can autonomously make a series of decisions, including invoking search tools, reading papers, and selecting relevant references, to ultimately obtain comprehensive and accurate results for complex scholar queries. We optimize PaSa using reinforcement learning with a synthetic dataset, AutoScholarQuery, which includes 35k fine‑grained academic queries and corresponding papers sourced from top‑tier AI conference publications. Additionally, we develop RealScholarQuery, a benchmark collecting real‑world academic queries to assess PaSa performance in more realistic scenarios. Despite being trained on synthetic data, PaSa significantly outperforms existing baselines on RealScholarQuery, including Google, Google Scholar, Google with GPT‑4o for paraphrased queries, ChatGPT (search‑enabled GPT‑4o), GPT‑o1, and PaSa‑GPT‑4o (PaSa implemented by prompting GPT‑4o). Notably, PaSa‑7B surpasses the best Google‑based baseline, Google with GPT‑4o, by 37.78% in recall@20 and 39.90% in recall@50, and exceeds PaSa‑GPT‑4o by 30.36% in recall and 4.25% in precision. Model, datasets, and code are available at https://github.com/bytedance/pasa.

PaperID: 1571, https://arxiv.org/pdf/2501.07834.pdf GitHub

Authors: Boye Niu, Yiliao Song, Kai Lian, Yifan Shen, Yu Yao, Kun Zhang, Tongliang Liu

Title: Flow: Modularized Agentic Workflow Automation

Abstract:
Multi‑agent frameworks powered by large language models (LLMs) have demonstrated great success in automated planning and task execution. However, the effective adjustment of agentic workflows during execution has not been well studied. An effective workflow adjustment is crucial in real‑world scenarios, as the initial plan must adjust to unforeseen challenges and changing conditions in real time to ensure the efficient execution of complex tasks. In this paper, we define workflows as an activity‑on‑vertex (AOV) graph, which allows continuous workflow refinement by LLM agents through dynamic subtask allocation adjustment based on historical performance and previous AOVs. To further enhance framework performance, we emphasize modularity in workflow design based on evaluating parallelism and dependency complexity. With this design, our proposed multi‑agent framework achieves efficient concurrent execution of subtasks, effective goal achievement, and enhanced error tolerance. Empirical results across various practical tasks demonstrate significant improvements in the efficiency of multi‑agent frameworks through dynamic workflow refinement and modularization. The code is available at: https://github.com/tmllab/2025_ICLR_FLOW.

PaperID: 1572, https://arxiv.org/pdf/2501.07554.pdf GitHub GitHub

Authors: Varun Biyyala, Bharat Chanderprakash Kathuria, Jialu Li, Youshan Zhang

Title: SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing

Abstract:
Video editing models have advanced significantly, but evaluating their performance remains challenging. Traditional metrics, such as CLIP text and image scores, often fall short: text scores are limited by inadequate training data and hierarchical dependencies, while image scores fail to assess temporal consistency. We present SST‑EM (Semantic, Spatial, and Temporal Evaluation Metric), a novel evaluation framework that leverages modern Vision‑Language Models (VLMs), Object Detection, and Temporal Consistency checks. SST‑EM comprises four components: (1) semantic extraction from frames using a VLM, (2) primary object tracking with Object Detection, (3) focused object refinement via an LLM agent, and (4) temporal consistency assessment using a Vision Transformer (ViT). These components are integrated into a unified metric with weights derived from human evaluations and regression analysis. The name SST‑EM reflects its focus on Semantic, Spatial, and Temporal aspects of video evaluation. SST‑EM provides a comprehensive evaluation of semantic fidelity and temporal smoothness in video editing. The source code is available in the \hrefhttps://github.com/custommetrics‑sst/SST_CustomEvaluationMetrics.gitGitHub Repository.

PaperID: 1573, https://arxiv.org/pdf/2501.07278.pdf GitHub GitHub

Authors: Junhao Zheng, Chengming Shi, Xidi Cai, Qiuke Li, Duzhen Zhang, Chenxing Li, Dong Yu, Qianli Ma

Title: Lifelong Learning of Large Language Model based Agents: A Roadmap

Abstract:
Lifelong learning, also known as continual or incremental learning, is a crucial component for advancing Artificial General Intelligence (AGI) by enabling systems to continuously adapt in dynamic environments. While large language models (LLMs) have demonstrated impressive capabilities in natural language processing, existing LLM agents are typically designed for static systems and lack the ability to adapt over time in response to new challenges. This survey is the first to systematically summarize the potential techniques for incorporating lifelong learning into LLM‑based agents. We categorize the core components of these agents into three modules: the perception module for multimodal input integration, the memory module for storing and retrieving evolving knowledge, and the action module for grounded interactions with the dynamic environment. We highlight how these pillars collectively enable continuous adaptation, mitigate catastrophic forgetting, and improve long‑term performance. This survey provides a roadmap for researchers and practitioners working to develop lifelong learning capabilities in LLM agents, offering insights into emerging trends, evaluation metrics, and application scenarios. Relevant literature and resources are available at \hrefthis urlhttps://github.com/qianlima‑lab/awesome‑lifelong‑llm‑agent.

PaperID: 1574, https://arxiv.org/pdf/2501.06781.pdf GitHub

Authors: Shaw Walters, Sam Gao, Shakker Nerd, Feng Da, Warren Williams, Ting-Chien Meng, Amie Chow, Hunter Han, Frank He, Allen Zhang, Ming Wu, Timothy Shen, Maxwell Hu, Jerry Yan

Title: Eliza: A Web3 friendly AI Agent Operating System

Abstract:
AI Agent, powered by large language models (LLMs) as its cognitive core, is an intelligent agentic system capable of autonomously controlling and determining the execution paths under user's instructions. With the burst of capabilities of LLMs and various plugins, such as RAG, text‑to‑image/video/3D, etc., the potential of AI Agents has been vastly expanded, with their capabilities growing stronger by the day. However, at the intersection between AI and web3, there is currently no ideal agentic framework that can seamlessly integrate web3 applications into AI agent functionalities. In this paper, we propose Eliza, the first open‑source web3‑friendly Agentic framework that makes the deployment of web3 applications effortless. We emphasize that every aspect of Eliza is a regular Typescript program under the full control of its user, and it seamlessly integrates with web3 (i.e., reading and writing blockchain data, interacting with smart contracts, etc.). Furthermore, we show how stable performance is achieved through the pragmatic implementation of the key components of Eliza's runtime. Our code is publicly available at https://github.com/ai16z/eliza.

PaperID: 1575, https://arxiv.org/pdf/2501.05752.pdf GitHub

Authors: Sungjae Lee, Hyejin Park, Jaechang Kim, Jungseul Ok

Title: Semantic Exploration with Adaptive Gating for Efficient Problem Solving with Language Models

Abstract:
Recent advancements in large language models (LLMs) have shown remarkable potential in various complex tasks requiring multi‑step reasoning methods like tree search to explore diverse reasoning paths. However, existing methods often suffer from computational inefficiency and redundancy. First, they overlook the diversity of task difficulties, leading to unnecessarily extensive searches even for easy tasks. Second, they neglect the semantics of reasoning paths, resulting in redundant exploration of semantically identical paths. To address these limitations, we propose Semantic Exploration with Adaptive Gating (SEAG), a computationally efficient method. SEAG employs an adaptive gating mechanism that dynamically decides whether to conduct a tree search, based on the confidence level of answers from a preceding simple reasoning method. Furthermore, its tree‑based exploration consolidates semantically identical reasoning steps, reducing redundant explorations while maintaining or even improving accuracy. Our extensive experiments demonstrate that SEAG significantly improves accuracy by 4.3% on average while requiring only 31% of computational costs compared to existing tree search‑based methods on complex reasoning benchmarks including GSM8K and ARC with diverse language models such as Llama2, Llama3, and Mistral. Our code is available at https://github.com/ml‑postech/SEAG‑semantic‑exploration‑with‑adaptive‑gating .

PaperID: 1576, https://arxiv.org/pdf/2501.05639.pdf GitHub

Authors: Joe Eappen, Zikang Xiong, Dipam Patel, Aniket Bera, Suresh Jagannathan

Title: Scaling Safe Multi-Agent Control for Signal Temporal Logic Specifications

Abstract:
Existing methods for safe multi‑agent control using logic specifications like Signal Temporal Logic (STL) often face scalability issues. This is because they rely either on single‑agent perspectives or on Mixed Integer Linear Programming (MILP)‑based planners, which are complex to optimize. These methods have proven to be computationally expensive and inefficient when dealing with a large number of agents. To address these limitations, we present a new scalable approach to multi‑agent control in this setting. Our method treats the relationships between agents using a graph structure rather than in terms of a single‑agent perspective. Moreover, it combines a multi‑agent collision avoidance controller with a Graph Neural Network (GNN) based planner, models the system in a decentralized fashion, and trains on STL‑based objectives to generate safe and efficient plans for multiple agents, thereby optimizing the satisfaction of complex temporal specifications while also facilitating multi‑agent collision avoidance. Our experiments show that our approach significantly outperforms existing methods that use a state‑of‑the‑art MILP‑based planner in terms of scalability and performance. The project website is https://jeappen.com/mastl‑gcbf‑website/ and the code is at https://github.com/jeappen/mastl‑gcbf .

PaperID: 1577, https://arxiv.org/pdf/2501.04575.pdf GitHub

Authors: Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, Fei Wu

Title: InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection

Abstract:
Graphical User Interface (GUI) Agents, powered by multimodal large language models (MLLMs), have shown great potential for task automation on computing devices such as computers and mobile phones. However, existing agents face challenges in multi‑step reasoning and reliance on textual annotations, limiting their effectiveness. We introduce InfiGUIAgent, an MLLM‑based GUI Agent trained with a two‑stage supervised fine‑tuning pipeline. Stage 1 enhances fundamental skills such as GUI understanding and grounding, while Stage 2 integrates hierarchical reasoning and expectation‑reflection reasoning skills using synthesized data to enable native reasoning abilities of the agents. InfiGUIAgent achieves competitive performance on several GUI benchmarks, highlighting the impact of native reasoning skills in enhancing GUI interaction for automation tasks. Resources are available at \urlhttps://github.com/Reallm‑Labs/InfiGUIAgent.

PaperID: 1578, https://arxiv.org/pdf/2501.04341.pdf GitHub

Authors: Dong-Hai Zhu, Yu-Jie Xiong, Jia-Chen Zhang, Xi-Jiong Xie, Chun-Ming Xia

Title: Understanding Before Reasoning: Enhancing Chain-of-Thought with Iterative Summarization Pre-Prompting

Abstract:
Chain‑of‑Thought (CoT) Prompting is a dominant paradigm in Large Language Models (LLMs) to enhance complex reasoning. It guides LLMs to present multi‑step reasoning, rather than generating the final answer directly. However, CoT encounters difficulties when key information required for reasoning is implicit or missing. This occurs because CoT emphasizes the sequence of reasoning steps while overlooking the early extraction of essential information. We propose a pre‑prompting method called Iterative Summarization Pre‑Prompting (ISP^2) to refine LLM reasoning when key information is not explicitly provided. First, entities and their corresponding descriptions are extracted to form potential key information pairs. Next, we use a reliability rating to assess these pairs, then merge the two lowest‑ranked pairs into a new entity description. This process is repeated until a unique key information pair is obtained. Finally, that pair, along with the original question, is fed into LLMs to produce the answer. Extensive experiments demonstrate a 7.1% improvement compared to existing methods. Unlike traditional prompting, ISP^2 adopts an inductive approach with pre‑prompting, offering flexible integration into diverse reasoning frameworks. The code is available at https://github.com/zdhgreat/ISP‑2.

PaperID: 1579, https://arxiv.org/pdf/2501.03226.pdf GitHub

Authors: Beichen Zhang, Yuhong Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Haodong Duan, Yuhang Cao, Dahua Lin, Jiaqi Wang

Title: BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning

Abstract:
Large language models (LLMs) have demonstrated impressive ability in solving complex mathematical problems with multi‑step reasoning and can be further enhanced with well‑designed in‑context learning (ICL) examples. However, this potential is often constrained by two major challenges in ICL: granularity mismatch and irrelevant information. We observe that while LLMs excel at decomposing mathematical problems, they often struggle with reasoning errors in fine‑grained steps. Moreover, ICL examples retrieved at the question level may omit critical steps or even mislead the model with irrelevant details. To address this issue, we propose BoostStep, a method that enhances reasoning accuracy through step‑aligned ICL, a novel mechanism that carefully aligns retrieved reference steps with the corresponding reasoning steps. Additionally, BoostStep incorporates an effective "first‑try" strategy to deliver exemplars highly relevant to the current state of reasoning. BoostStep is a flexible and powerful method that integrates seamlessly with chain‑of‑thought (CoT) and tree search algorithms, refining both candidate selection and decision‑making. Empirical results show that BoostStep improves GPT‑4o's CoT performance by 4.6% across mathematical benchmarks, significantly surpassing traditional few‑shot learning's 1.2%. Moreover, it can achieve an additional 7.5% gain combined with tree search. Surprisingly, it enhances state‑of‑the‑art LLMs to solve challenging math problems using simpler examples. It improves DeepSeek‑R1‑671B's performance on AIME by 2.2%, leveraging simple examples only from the MATH dataset.

PaperID: 1580, https://arxiv.org/pdf/2501.02977.pdf GitHub

Authors: Chuanbo Hua, Federico Berto, Jiwoo Son, Seunghyun Kang, Changhyun Kwon, Jinkyoo Park

Title: CAMP: Collaborative Attention Model with Profiles for Vehicle Routing Problems

Abstract:
The profiled vehicle routing problem (PVRP) is a generalization of the heterogeneous capacitated vehicle routing problem (HCVRP) in which the objective is to optimize the routes of vehicles to serve client demands subject to different vehicle profiles, with each having a preference or constraint on a per‑client basis. While existing learning methods have shown promise for solving the HCVRP in real‑time, no learning method exists to solve the more practical and challenging PVRP. In this paper, we propose a Collaborative Attention Model with Profiles (CAMP), a novel approach that learns efficient solvers for PVRP using multi‑agent reinforcement learning. CAMP employs a specialized attention‑based encoder architecture to embed profiled client embeddings in parallel for each vehicle profile. We design a communication layer between agents for collaborative decision‑making across profiled embeddings at each decoding step and a batched pointer mechanism to attend to the profiled embeddings to evaluate the likelihood of the next actions. We evaluate CAMP on two variants of PVRPs: PVRP with preferences, which explicitly influence the reward function, and PVRP with zone constraints with different numbers of agents and clients, demonstrating that our learned solvers achieve competitive results compared to both classical state‑of‑the‑art neural multi‑agent models in terms of solution quality and computational efficiency. We make our code openly available at https://github.com/ai4co/camp.

PaperID: 1581, https://arxiv.org/pdf/2501.00309.pdf GitHub

Authors: Haoyu Han, Yu Wang, Harry Shomer, Kai Guo, Jiayuan Ding, Yongjia Lei, Mahantesh Halappanavar, Ryan A. Rossi, Subhabrata Mukherjee, Xianfeng Tang, Qi He, Zhigang Hua, Bo Long, Tong Zhao, Neil Shah, Amin Javari, Yinglong Xia, Jiliang Tang

Title: Retrieval-Augmented Generation with Graphs (GraphRAG)

Abstract:
Retrieval‑augmented generation (RAG) is a powerful technique that enhances downstream task execution by retrieving additional information, such as knowledge, skills, and tools from external sources. Graph, by its intrinsic "nodes connected by edges" nature, encodes massive heterogeneous and relational information, making it a golden resource for RAG in tremendous real‑world applications. As a result, we have recently witnessed increasing attention on equipping RAG with Graph, i.e., GraphRAG. However, unlike conventional RAG, where the retriever, generator, and external data sources can be uniformly designed in the neural‑embedding space, the uniqueness of graph‑structured data, such as diverse‑formatted and domain‑specific relational knowledge, poses unique and significant challenges when designing GraphRAG for different domains. Given the broad applicability, the associated design challenges, and the recent surge in GraphRAG, a systematic and up‑to‑date survey of its key concepts and techniques is urgently desired. Following this motivation, we present a comprehensive and up‑to‑date survey on GraphRAG. Our survey first proposes a holistic GraphRAG framework by defining its key components, including query processor, retriever, organizer, generator, and data source. Furthermore, recognizing that graphs in different domains exhibit distinct relational patterns and require dedicated designs, we review GraphRAG techniques uniquely tailored to each domain. Finally, we discuss research challenges and brainstorm directions to inspire cross‑disciplinary opportunities. Our survey repository is publicly maintained at https://github.com/Graph‑RAG/GraphRAG/.

PaperID: 1582, https://arxiv.org/pdf/2412.20138.pdf GitHub GitHub

Authors: Yijia Xiao, Edward Sun, Di Luo, Wei Wang

Title: TradingAgents: Multi-Agents LLM Financial Trading Framework

Abstract:
Significant progress has been made in automated problem‑solving using societies of agents powered by large language models (LLMs). In finance, efforts have largely focused on single‑agent systems handling specific tasks or multi‑agent frameworks independently gathering data. However, the multi‑agent systems' potential to replicate real‑world trading firms' collaborative dynamics remains underexplored. TradingAgents proposes a novel stock trading framework inspired by trading firms, featuring LLM‑powered agents in specialized roles such as fundamental analysts, sentiment analysts, technical analysts, and traders with varied risk profiles. The framework includes Bull and Bear researcher agents assessing market conditions, a risk management team monitoring exposure, and traders synthesizing insights from debates and historical data to make informed decisions. By simulating a dynamic, collaborative trading environment, this framework aims to improve trading performance. Detailed architecture and extensive experiments reveal its superiority over baseline models, with notable improvements in cumulative returns, Sharpe ratio, and maximum drawdown, highlighting the potential of multi‑agent LLM frameworks in financial trading. TradingAgents is available at https://github.com/TauricResearch/TradingAgents.

PaperID: 1583, https://arxiv.org/pdf/2412.20127.pdf GitHub GitHub

Authors: Zhaopeng Feng, Jiayuan Su, Jiamei Zheng, Jiahan Ren, Yan Zhang, Jian Wu, Hongwei Wang, Zuozhu Liu

Title: M-MAD: Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation

Abstract:
Recent advancements in large language models (LLMs) have given rise to the LLM‑as‑a‑judge paradigm, showcasing their potential to deliver human‑like judgments. However, in the field of machine translation (MT) evaluation, current LLM‑as‑a‑judge methods fall short of learned automatic metrics. In this paper, we propose Multidimensional Multi‑Agent Debate (M‑MAD), a systematic LLM‑based multi‑agent framework for advanced LLM‑as‑a‑judge MT evaluation. Our findings demonstrate that M‑MAD achieves significant advancements by (1) decoupling heuristic MQM criteria into distinct evaluation dimensions for fine‑grained assessments; (2) employing multi‑agent debates to harness the collaborative reasoning capabilities of LLMs; (3) synthesizing dimension‑specific results into a final evaluation judgment to ensure robust and reliable outcomes. Comprehensive experiments show that M‑MAD not only outperforms all existing LLM‑as‑a‑judge methods but also competes with state‑of‑the‑art reference‑based automatic metrics, even when powered by a suboptimal model like GPT‑4o mini. Detailed ablations and analysis highlight the superiority of our framework design, offering a fresh perspective for LLM‑as‑a‑judge paradigm. Our code and data are publicly available at https://github.com/SU‑JIAYUAN/M‑MAD.

PaperID: 1584, https://arxiv.org/pdf/2412.20005.pdf GitHub

Authors: Yujie Luo, Xiangyuan Ru, Kangwei Liu, Lin Yuan, Mengshu Sun, Ningyu Zhang, Lei Liang, Zhiqiang Zhang, Jun Zhou, Lanning Wei, Da Zheng, Haofen Wang, Huajun Chen

Title: OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System

Abstract:
We introduce OneKE, a dockerized schema‑guided knowledge extraction system, which can extract knowledge from the Web and raw PDF Books, and support various domains (science, news, etc.). Specifically, we design OneKE with multiple agents and a configure knowledge base. Different agents perform their respective roles, enabling support for various extraction scenarios. The configure knowledge base facilitates schema configuration, error case debugging and correction, further improving the performance. Empirical evaluations on benchmark datasets demonstrate OneKE's efficacy, while case studies further elucidate its adaptability to diverse tasks across multiple domains, highlighting its potential for broad applications. We have open‑sourced the Code at https://github.com/zjunlp/OneKE and released a Video at http://oneke.openkg.cn/demo.mp4.

PaperID: 1585, https://arxiv.org/pdf/2412.19770.pdf GitHub

Authors: Le Chen, Bin Lei, Dunzhi Zhou, Pei-Hung Lin, Chunhua Liao, Caiwen Ding, Ali Jannesari

Title: Fortran2CPP: Automating Fortran-to-C++ Translation using LLMs via Multi-Turn Dialogue and Dual-Agent Integration

Abstract:
Translating legacy Fortran code into C++ is a crucial step in modernizing high‑performance computing (HPC) applications. However, the scarcity of high‑quality, parallel Fortran‑to‑C++ datasets and the limited domain‑specific expertise in large language models (LLMs) present significant challenges for automated translation. In this paper, we introduce Fortran2CPP, a multi‑turn dialogue dataset generated by a novel LLM agent‑based approach that integrates a dual‑LLM Questioner‑Solver module to enhance translation accuracy. Our dataset comprises 11.7k dialogues capturing iterative feedback‑decision workflows including code translation, compilation, execution, unit testing, and error‑fixing. Using this dataset, we fine‑tune several open‑weight LLMs and achieve up to a 3.31x improvement in CodeBLEU scores and a 92% increase in compilation success rate, demonstrating enhanced syntactic accuracy and functional reliability. Our findings highlight the value of dialogue‑based LLM training for complex code translation tasks. The dataset and model have been open‑sourced and are available on our public GitHub repository\footnote\urlhttps://github.com/HPC‑Fortran2CPP/Fortran2Cpp.

PaperID: 1586, https://arxiv.org/pdf/2412.19498.pdf GitHub

Authors: Zexun Jiang, Yafang Shi, Maoxu Li, Hongjiang Xiao, Yunxiao Qin, Qinglan Wei, Ye Wang, Yuan Zhang

Title: Casevo: A Cognitive Agents and Social Evolution Simulator

Abstract:
In this paper, we introduce a multi‑agent simulation framework Casevo (Cognitive Agents and Social Evolution Simulator), that integrates large language models (LLMs) to simulate complex social phenomena and decision‑making processes. Casevo is designed as a discrete‑event simulator driven by agents with features such as Chain of Thoughts (CoT), Retrieval‑Augmented Generation (RAG), and Customizable Memory Mechanism. Casevo enables dynamic social modeling, which can support various scenarios such as social network analysis, public opinion dynamics, and behavior prediction in complex social systems. To demonstrate the effectiveness of Casevo, we utilize one of the U.S. 2020 midterm election TV debates as a simulation example. Our results show that Casevo facilitates more realistic and flexible agent interactions, improving the quality of dynamic social phenomena simulation. This work contributes to the field by providing a robust system for studying large‑scale, high‑fidelity social behaviors with advanced LLM‑driven agents, expanding the capabilities of traditional agent‑based modeling (ABM). The open‑source code repository address of casevo is https://github.com/rgCASS/casevo.

PaperID: 1587, https://arxiv.org/pdf/2412.18774.pdf GitHub

Authors: Jianbo Zhang, Chunyi Li, Jie Hao, Jun Jia, Huiyu Duan, Guoquan Zheng, Liang Yuan, Guangtao Zhai

Title: Embodied Image Quality Assessment for Robotic Intelligence

Abstract:
Image Quality Assessment (IQA) of User‑Generated Content (UGC) is a critical technique for human Quality of Experience (QoE). However, does the the image quality of Robot‑Generated Content (RGC) demonstrate traits consistent with the Moravec paradox, potentially conflicting with human perceptual norms? Human subjective scoring is more based on the attractiveness of the image. Embodied agent are required to interact and perceive in the environment, and finally perform specific tasks. Visual images as inputs directly influence downstream tasks. In this paper, we explore the perception mechanism of embodied robots for image quality. We propose the first Embodied Preference Database (EPD), which contains 12,500 distorted image annotations. We establish assessment metrics based on the downstream tasks of robot. In addition, there is a gap between UGC and RGC. To address this, we propose a novel Multi‑scale Attention Embodied Image Quality Assessment called MA‑EIQA. For the proposed EPD dataset, this is the first no‑reference IQA model designed for embodied robot. Finally, the performance of mainstream IQA algorithms on EPD dataset is verified. The experiments demonstrate that quality assessment of embodied images is different from that of humans. We sincerely hope that the EPD can contribute to the development of embodied AI by focusing on image quality assessment. The benchmark is available at https://github.com/Jianbo‑maker/EPD_benchmark.

PaperID: 1588, https://arxiv.org/pdf/2412.18450.pdf GitHub

Authors: Tatiana Zemskova, Dmitry Yudin

Title: 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

Abstract:
A 3D scene graph represents a compact scene model by capturing both the objects present and the semantic relationships between them, making it a promising structure for robotic applications. To effectively interact with users, an embodied intelligent agent should be able to answer a wide range of natural language queries about the surrounding 3D environment. Large Language Models (LLMs) are beneficial solutions for user‑robot interaction due to their natural language understanding and reasoning abilities. Recent methods for learning scene representations have shown that adapting these representations to the 3D world can significantly improve the quality of LLM responses. However, existing methods typically rely only on geometric information, such as object coordinates, and overlook the rich semantic relationships between objects. In this work, we propose 3DGraphLLM, a method for constructing a learnable representation of a 3D scene graph that explicitly incorporates semantic relationships. This representation is used as input to LLMs for performing 3D vision‑language tasks. In our experiments on popular ScanRefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap datasets, we demonstrate that our approach outperforms baselines that do not leverage semantic relationships between objects. The code is publicly available at https://github.com/CognitiveAISystems/3DGraphLLM.

PaperID: 1589, https://arxiv.org/pdf/2412.18426.pdf GitHub

Authors: Kangjia Zhao, Jiahui Song, Leigang Sha, Haozhan Shen, Zhi Chen, Tiancheng Zhao, Xiubo Liang, Jianwei Yin

Title: GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent

Abstract:
Nowadays, research on GUI agents is a hot topic in the AI community. However, current research focuses on GUI task automation, limiting the scope of applications in various GUI scenarios. In this paper, we propose a formalized and comprehensive environment to evaluate the entire process of automated GUI Testing (GTArena), offering a fair, standardized environment for consistent operation of diverse multimodal large language models. We divide the testing process into three key subtasks: test intention generation, test task execution, and GUI defect detection, and construct a benchmark dataset based on these to conduct a comprehensive evaluation. It evaluates the performance of different models using three data types: real mobile applications, mobile applications with artificially injected defects, and synthetic data, thoroughly assessing their capabilities in this relevant task. Additionally, we propose a method that helps researchers explore the correlation between the performance of multimodal language large models in specific scenarios and their general capabilities in standard benchmark tests. Experimental results indicate that even the most advanced models struggle to perform well across all sub‑tasks of automated GUI Testing, highlighting a significant gap between the current capabilities of Autonomous GUI Testing and its practical, real‑world applicability. This gap provides guidance for the future direction of GUI Agent development. Our code is available at https://github.com/ZJU‑ACES‑ISE/ChatUITest.

PaperID: 1590, https://arxiv.org/pdf/2412.18293.pdf GitHub

Authors: Shaofei Cai, Zhancun Mu, Kaichen He, Bowei Zhang, Xinyue Zheng, Anji Liu, Yitao Liang

Title: MineStudio: A Streamlined Package for Minecraft AI Agent Development

Abstract:
Minecraft's complexity and diversity as an open world make it a perfect environment to test if agents can learn, adapt, and tackle a variety of unscripted tasks. However, the development and validation of novel agents in this setting continue to face significant engineering challenges. This paper presents MineStudio, an open‑source software package designed to streamline the development of autonomous agents in Minecraft. MineStudio represents the first comprehensive integration of seven critical engineering components: simulator, data, model, offline pre‑training, online fine‑tuning, inference, and benchmark, thereby allowing users to concentrate their efforts on algorithm innovation. We provide a user‑friendly API design accompanied by comprehensive documentation and tutorials. Our project is released at https://github.com/CraftJarvis/MineStudio.

PaperID: 1591, https://arxiv.org/pdf/2412.18165.pdf GitHub

Authors: Suwesh Prasad Sah

Title: Parallel Neural Computing for Scene Understanding from LiDAR Perception in Autonomous Racing

Abstract:
Autonomous driving in high‑speed racing, as opposed to urban environments, presents significant challenges in scene understanding due to rapid changes in the track environment. Traditional sequential network approaches may struggle to meet the real‑time knowledge and decision‑making demands of an autonomous agent covering large displacements in a short time. This paper proposes a novel baseline architecture for developing sophisticated models capable of true hardware‑enabled parallelism, achieving neural processing speeds that mirror the agent's high velocity. The proposed model (Parallel Perception Network (PPN)) consists of two independent neural networks, segmentation and reconstruction networks, running parallelly on separate accelerated hardware. The model takes raw 3D point cloud data from the LiDAR sensor as input and converts it into a 2D Bird's Eye View Map on both devices. Each network independently extracts its input features along space and time dimensions and produces outputs parallelly. The proposed method's model is trained on a system with two NVIDIA T4 GPUs, using a combination of loss functions, including edge preservation, and demonstrates a 2x speedup in model inference time compared to a sequential configuration. Implementation is available at: https://github.com/suwesh/Parallel‑Perception‑Network. Learned parameters of the trained networks are provided at: https://huggingface.co/suwesh/ParallelPerceptionNetwork.

PaperID: 1592, https://arxiv.org/pdf/2412.18072.pdf GitHub

Authors: Wan-Cyuan Fan, Tanzila Rahman, Leonid Sigal

Title: MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

Abstract:
With advances in foundational and vision‑language models, and effective fine‑tuning techniques, a large number of both general and special‑purpose models have been developed for a variety of visual tasks. Despite the flexibility and accessibility of these models, no single model is able to handle all tasks and/or applications that may be envisioned by potential users. Recent approaches, such as visual programming and multimodal LLMs with integrated tools aim to tackle complex visual tasks, by way of program synthesis. However, such approaches overlook user constraints (e.g., performance / computational needs), produce test‑time sample‑specific solutions that are difficult to deploy, and, sometimes, require low‑level instructions that maybe beyond the abilities of a naive user. To address these limitations, we introduce MMFactory, a universal framework that includes model and metrics routing components, acting like a solution search engine across various available models. Based on a task description and few sample input‑output pairs and (optionally) resource and/or performance constraints, MMFactory can suggest a diverse pool of programmatic solutions by instantiating and combining visio‑lingual tools from its model repository. In addition to synthesizing these solutions, MMFactory also proposes metrics and benchmarks performance / resource characteristics, allowing users to pick a solution that meets their unique design constraints. From the technical perspective, we also introduced a committee‑based solution proposer that leverages multi‑agent LLM conversation to generate executable, diverse, universal, and robust solutions for the user. Experimental results show that MMFactory outperforms existing methods by delivering state‑of‑the‑art solutions tailored to user problem specifications. Project page is available at https://davidhalladay.github.io/mmfactory_demo.

PaperID: 1593, https://arxiv.org/pdf/2412.17867.pdf GitHub

Authors: Ziming Guo, Chao Ma, Yinggang Sun, Tiancheng Zhao, Guangyao Wang, Hai Huang

Title: Evaluating and Enhancing LLMs for Multi-turn Text-to-SQL with Multiple Question Types

Abstract:
Recent advancements in large language models (LLMs) have significantly advanced text‑to‑SQL systems. However, most LLM‑based methods often narrowly focus on SQL generation, neglecting the complexities of real‑world conversational queries. This oversight can lead to unreliable responses, particularly for ambiguous questions that cannot be directly addressed with SQL. To bridge this gap, we propose MMSQL, a comprehensive test suite designed to evaluate the question classification and SQL generation capabilities of LLMs by simulating real‑world scenarios with diverse question types and multi‑turn Q&A interactions. Using MMSQL, we assessed the performance of popular LLMs, including both open‑source and closed‑source models, and identified key factors impacting their performance in such scenarios. Moreover, we introduce an LLM‑based multi‑agent framework that employs specialized agents to identify question types and determine appropriate answering strategies. Our experiments demonstrate that this approach significantly enhances the model's ability to navigate the complexities of conversational dynamics, effectively handling the diverse and complex nature of user queries. Our dataset and code are publicly available at https://mcxiaoxiao.github.io/MMSQL.

PaperID: 1594, https://arxiv.org/pdf/2412.17707.pdf GitHub

Authors: Yue Deng, Yan Yu, Weiyu Ma, Zirui Wang, Wenhui Zhu, Jian Zhao, Yin Zhang

Title: SMAC-Hard: Enabling Mixed Opponent Strategy Script and Self-play on SMAC

Abstract:
The availability of challenging simulation environments is pivotal for advancing the field of Multi‑Agent Reinforcement Learning (MARL). In cooperative MARL settings, the StarCraft Multi‑Agent Challenge (SMAC) has gained prominence as a benchmark for algorithms following centralized training with decentralized execution paradigm. However, with continual advancements in SMAC, many algorithms now exhibit near‑optimal performance, complicating the evaluation of their true effectiveness. To alleviate this problem, in this work, we highlight a critical issue: the default opponent policy in these environments lacks sufficient diversity, leading MARL algorithms to overfit and exploit unintended vulnerabilities rather than learning robust strategies. To overcome these limitations, we propose SMAC‑HARD, a novel benchmark designed to enhance training robustness and evaluation comprehensiveness. SMAC‑HARD supports customizable opponent strategies, randomization of adversarial policies, and interfaces for MARL self‑play, enabling agents to generalize to varying opponent behaviors and improve model stability. Furthermore, we introduce a black‑box testing framework wherein agents are trained without exposure to the edited opponent scripts but are tested against these scripts to evaluate the policy coverage and adaptability of MARL algorithms. We conduct extensive evaluations of widely used and state‑of‑the‑art algorithms on SMAC‑HARD, revealing the substantial challenges posed by edited and mixed strategy opponents. Additionally, the black‑box strategy tests illustrate the difficulty of transferring learned policies to unseen adversaries. We envision SMAC‑HARD as a critical step toward benchmarking the next generation of MARL algorithms, fostering progress in self‑play methods for multi‑agent systems. Our code is available at https://github.com/devindeng94/smac‑hard.

PaperID: 1595, https://arxiv.org/pdf/2412.17686.pdf GitHub

Authors: Dan Shi, Tianhao Shen, Yufei Huang, Zhigen Li, Yongqi Leng, Renren Jin, Chuang Liu, Xinwei Wu, Zishan Guo, Linhao Yu, Ling Shi, Bojian Jiang, Deyi Xiong

Title: Large Language Model Safety: A Holistic Survey

Abstract:
The rapid development and deployment of large language models (LLMs) have introduced a new frontier in artificial intelligence, marked by unprecedented capabilities in natural language understanding and generation. However, the increasing integration of these models into critical applications raises substantial safety concerns, necessitating a thorough examination of their potential risks and associated mitigation strategies. This survey provides a comprehensive overview of the current landscape of LLM safety, covering four major categories: value misalignment, robustness to adversarial attacks, misuse, and autonomous AI risks. In addition to the comprehensive review of the mitigation methodologies and evaluation resources on these four aspects, we further explore four topics related to LLM safety: the safety implications of LLM agents, the role of interpretability in enhancing LLM safety, the technology roadmaps proposed and abided by a list of AI companies and institutes for LLM safety, and AI governance aimed at LLM safety with discussions on international cooperation, policy proposals, and prospective regulatory directions. Our findings underscore the necessity for a proactive, multifaceted approach to LLM safety, emphasizing the integration of technical solutions, ethical considerations, and robust governance frameworks. This survey is intended to serve as a foundational resource for academy researchers, industry practitioners, and policymakers, offering insights into the challenges and opportunities associated with the safe integration of LLMs into society. Ultimately, it seeks to contribute to the safe and beneficial development of LLMs, aligning with the overarching goal of harnessing AI for societal advancement and well‑being. A curated list of related papers has been publicly available at https://github.com/tjunlp‑lab/Awesome‑LLM‑Safety‑Papers.

PaperID: 1596, https://arxiv.org/pdf/2412.17612.pdf GitHub GitHub

Authors: Yuanyuan Gao, Yalun Dai, Hao Li, Weicai Ye, Junyi Chen, Danpeng Chen, Dingwen Zhang, Tong He, Guofeng Zhang, Junwei Han

Title: CoSurfGS:Collaborative 3D Surface Gaussian Splatting with Distributed Learning for Large Scene Reconstruction

Abstract:
3D Gaussian Splatting (3DGS) has demonstrated impressive performance in scene reconstruction. However, most existing GS‑based surface reconstruction methods focus on 3D objects or limited scenes. Directly applying these methods to large‑scale scene reconstruction will pose challenges such as high memory costs, excessive time consumption, and lack of geometric detail, which makes it difficult to implement in practical applications. To address these issues, we propose a multi‑agent collaborative fast 3DGS surface reconstruction framework based on distributed learning for large‑scale surface reconstruction. Specifically, we develop local model compression (LMC) and model aggregation schemes (MAS) to achieve high‑quality surface representation of large scenes while reducing GPU memory consumption. Extensive experiments on Urban3d, MegaNeRF, and BlendedMVS demonstrate that our proposed method can achieve fast and scalable high‑fidelity surface reconstruction and photorealistic rendering. Our project page is available at \urlhttps://gyy456.github.io/CoSurfGS.

PaperID: 1597, https://arxiv.org/pdf/2412.17498.pdf GitHub

Authors: Jiaan Wang, Fandong Meng, Yunlong Liang, Jie Zhou

Title: DRT: Deep Reasoning Translation via Long Chain-of-Thought

Abstract:
Recently, O1‑like models have emerged as representative examples, illustrating the effectiveness of long chain‑of‑thought (CoT) in reasoning tasks such as math and coding tasks. In this paper, we introduce DRT, an attempt to bring the success of long CoT to neural machine translation (MT). Specifically, in view of the literature books that might involve similes and metaphors, translating these texts to a target language is very difficult in practice due to cultural differences. In such cases, literal translation often fails to convey the intended meaning effectively. Even for professional human translators, considerable thought must be given to preserving semantics throughout the translation process. To simulate LLMs' long thought ability in MT, we first mine sentences containing similes or metaphors from existing literature books, and then develop a multi‑agent framework to translate these sentences via long thought. In the multi‑agent framework, a translator is used to iteratively translate the source sentence under the suggestions provided by an advisor. To ensure the effectiveness of the long thoughts, an evaluator is also employed to quantify the translation quality in each round. In this way, we collect tens of thousands of long‑thought MT data, which is used to train our DRT. Using Qwen2.5 and LLama‑3.1 as the backbones, DRT models can learn the thought process during machine translation, and outperform vanilla LLMs as well as LLMs which are simply fine‑tuning on the paired sentences without long thought, showing its effectiveness. The synthesized data and model checkpoints are released at https://github.com/krystalan/DRT.

PaperID: 1598, https://arxiv.org/pdf/2412.17288.pdf GitHub GitHub

Authors: Taewoong Kim, Byeonghwi Kim, Jonghyun Choi

Title: Multi-Modal Grounded Planning and Efficient Replanning For Learning Embodied Agents with A Few Examples

Abstract:
Learning a perception and reasoning module for robotic assistants to plan steps to perform complex tasks based on natural language instructions often requires large free‑form language annotations, especially for short high‑level instructions. To reduce the cost of annotation, large language models (LLMs) are used as a planner with few data. However, when elaborating the steps, even the state‑of‑the‑art planner that uses LLMs mostly relies on linguistic common sense, often neglecting the status of the environment at command reception, resulting in inappropriate plans. To generate plans grounded in the environment, we propose FLARE (Few‑shot Language with environmental Adaptive Replanning Embodied agent), which improves task planning using both language command and environmental perception. As language instructions often contain ambiguities or incorrect expressions, we additionally propose to correct the mistakes using visual cues from the agent. The proposed scheme allows us to use a few language pairs thanks to the visual cues and outperforms state‑of‑the‑art approaches. Our code is available at https://github.com/snumprlab/flare.

PaperID: 1599, https://arxiv.org/pdf/2412.17259.pdf GitHub

Authors: Haitao Li, Junjie Chen, Jingli Yang, Qingyao Ai, Wei Jia, Youfeng Liu, Kai Lin, Yueyue Wu, Guozhi Yuan, Yiran Hu, Wuyue Wang, Yiqun Liu, Minlie Huang

Title: LegalAgentBench: Evaluating LLM Agents in Legal Domain

Abstract:
With the increasing intelligence and autonomy of LLM agents, their potential applications in the legal domain are becoming increasingly apparent. However, existing general‑domain benchmarks cannot fully capture the complexity and subtle nuances of real‑world judicial cognition and decision‑making. Therefore, we propose LegalAgentBench, a comprehensive benchmark specifically designed to evaluate LLM Agents in the Chinese legal domain. LegalAgentBench includes 17 corpora from real‑world legal scenarios and provides 37 tools for interacting with external knowledge. We designed a scalable task construction framework and carefully annotated 300 tasks. These tasks span various types, including multi‑hop reasoning and writing, and range across different difficulty levels, effectively reflecting the complexity of real‑world legal scenarios. Moreover, beyond evaluating final success, LegalAgentBench incorporates keyword analysis during intermediate processes to calculate progress rates, enabling more fine‑grained evaluation. We evaluated eight popular LLMs, highlighting the strengths, limitations, and potential areas for improvement of existing models and methods. LegalAgentBench sets a new benchmark for the practical application of LLMs in the legal domain, with its code and data available at \urlhttps://github.com/CSHaitao/LegalAgentBench.

PaperID: 1600, https://arxiv.org/pdf/2412.17029.pdf GitHub

Authors: Yuhao Yang, Jiabin Tang, Lianghao Xia, Xingchen Zou, Yuxuan Liang, Chao Huang

Title: GraphAgent: Agentic Graph Language Assistant

Abstract:
Real‑world data is represented in both structured (e.g., graph connections) and unstructured (e.g., textual, visual information) formats, encompassing complex relationships that include explicit links (such as social connections and user behaviors) and implicit interdependencies among semantic entities, often illustrated through knowledge graphs. In this work, we propose GraphAgent, an automated agent pipeline that addresses both explicit graph dependencies and implicit graph‑enhanced semantic inter‑dependencies, aligning with practical data scenarios for predictive tasks (e.g., node classification) and generative tasks (e.g., text generation). GraphAgent comprises three key components: (i) a Graph Generator Agent that builds knowledge graphs to reflect complex semantic dependencies; (ii) a Task Planning Agent that interprets diverse user queries and formulates corresponding tasks through agentic self‑planning; and (iii) a Task Execution Agent that efficiently executes planned tasks while automating tool matching and invocation in response to user queries. These agents collaborate seamlessly, integrating language models with graph language models to uncover intricate relational information and data semantic dependencies. Through extensive experiments on various graph‑related predictive and text generative tasks on diverse datasets, we demonstrate the effectiveness of our GraphAgent across various settings. We have made our proposed GraphAgent open‑source at: https://github.com/HKUDS/GraphAgent.

PaperID: 1601, https://arxiv.org/pdf/2412.16086.pdf GitHub

Authors: Hasan Md Tusfiqur Alam, Devansh Srivastav, Md Abdul Kadir, Daniel Sonntag

Title: Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG

Abstract:
Deep learning has advanced medical image classification, but interpretability challenges hinder its clinical adoption. This study enhances interpretability in Chest X‑ray (CXR) classification by using concept bottleneck models (CBMs) and a multi‑agent Retrieval‑Augmented Generation (RAG) system for report generation. By modeling relationships between visual features and clinical concepts, we create interpretable concept vectors that guide a multi‑agent RAG system to generate radiology reports, enhancing clinical relevance, explainability, and transparency. Evaluation of the generated reports using an LLM‑as‑a‑judge confirmed the interpretability and clinical utility of our model's outputs. On the COVID‑QU dataset, our model achieved 81% classification accuracy and demonstrated robust report generation performance, with five key metrics ranging between 84% and 90%. This interpretable multi‑agent framework bridges the gap between high‑performance AI and the explainability required for reliable AI‑driven CXR analysis in clinical settings. Our code is available at https://github.com/tifat58/IRR‑with‑CBM‑RAG.git.

PaperID: 1602, https://arxiv.org/pdf/2412.15504.pdf GitHub

Authors: Zhenjie Xu, Wenqing Chen, Yi Tang, Xuanying Li, Cheng Hu, Zhixuan Chu, Kui Ren, Zibin Zheng, Zhichao Lu

Title: Mitigating Social Bias in Large Language Models: A Multi-Objective Approach within a Multi-Agent Framework

Abstract:
Natural language processing (NLP) has seen remarkable advancements with the development of large language models (LLMs). Despite these advancements, LLMs often produce socially biased outputs. Recent studies have mainly addressed this problem by prompting LLMs to behave ethically, but this approach results in unacceptable performance degradation. In this paper, we propose a multi‑objective approach within a multi‑agent framework (MOMA) to mitigate social bias in LLMs without significantly compromising their performance. The key idea of MOMA involves deploying multiple agents to perform causal interventions on bias‑related contents of the input questions, breaking the shortcut connection between these contents and the corresponding answers. Unlike traditional debiasing techniques leading to performance degradation, MOMA substantially reduces bias while maintaining accuracy in downstream tasks. Our experiments conducted on two datasets and two models demonstrate that MOMA reduces bias scores by up to 87.7%, with only a marginal performance degradation of up to 6.8% in the BBQ dataset. Additionally, it significantly enhances the multi‑objective metric icat in the StereoSet dataset by up to 58.1%. Code will be made available at https://github.com/Cortantse/MOMA.

PaperID: 1603, https://arxiv.org/pdf/2412.14737.pdf GitHub

Authors: Daniel Yang, Yao-Hung Hubert Tsai, Makoto Yamada

Title: On Verbalized Confidence Scores for LLMs

Abstract:
The rise of large language models (LLMs) and their tight integration into our daily life make it essential to dedicate efforts towards their trustworthiness. Uncertainty quantification for LLMs can establish more human trust into their responses, but also allows LLM agents to make more informed decisions based on each other's uncertainty. To estimate the uncertainty in a response, internal token logits, task‑specific proxy models, or sampling of multiple responses are commonly used. This work focuses on asking the LLM itself to verbalize its uncertainty with a confidence score as part of its output tokens, which is a promising way for prompt‑ and model‑agnostic uncertainty quantification with low overhead. Using an extensive benchmark, we assess the reliability of verbalized confidence scores with respect to different datasets, models, and prompt methods. Our results reveal that the reliability of these scores strongly depends on how the model is asked, but also that it is possible to extract well‑calibrated confidence scores with certain prompt methods. We argue that verbalized confidence scores can become a simple but effective and versatile uncertainty quantification method in the future. Our code is available at https://github.com/danielyxyang/llm‑verbalized‑uq.

PaperID: 1604, https://arxiv.org/pdf/2412.14470.pdf GitHub

Authors: Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, Minlie Huang

Title: Agent-SafetyBench: Evaluating the Safety of LLM Agents

Abstract:
As large language models (LLMs) are increasingly deployed as agents, their integration into interactive environments and tool use introduce new safety challenges beyond those associated with the models themselves. However, the absence of comprehensive benchmarks for evaluating agent safety presents a significant barrier to effective assessment and further improvement. In this paper, we introduce Agent‑SafetyBench, a comprehensive benchmark designed to evaluate the safety of LLM agents. Agent‑SafetyBench encompasses 349 interaction environments and 2,000 test cases, evaluating 8 categories of safety risks and covering 10 common failure modes frequently encountered in unsafe interactions. Our evaluation of 16 popular LLM agents reveals a concerning result: none of the agents achieves a safety score above 60%. This highlights significant safety challenges in LLM agents and underscores the considerable need for improvement. Through failure mode and helpfulness analysis, we summarize two fundamental safety defects in current LLM agents: lack of robustness and lack of risk awareness. Furthermore, our findings suggest that reliance on defense prompts alone may be insufficient to address these safety issues, emphasizing the need for more advanced and robust strategies. To drive progress in this area, Agent‑SafetyBench has been released at https://github.com/thu‑coai/Agent‑SafetyBench/ to facilitate further research in agent safety evaluation and improvement.

PaperID: 1605, https://arxiv.org/pdf/2412.13693.pdf GitHub

Authors: Lina Gong, Chen Wang, Yujun Huang, Di Cui, Mingqiang Wei

Title: UITrans: Seamless UI Translation from Android to HarmonyOS

Abstract:
Seamless user interface (i.e., UI) translation has emerged as a pivotal technique for modern mobile developers, addressing the challenge of developing separate UI applications for Android and HarmonyOS platforms due to fundamental differences in layout structures and development paradigms. In this paper, we present UITrans, the first automated UI translation tool designed for Android to HarmonyOS. UITrans leverages an LLM‑driven multi‑agent reflective collaboration framework to convert Android XML layouts into HarmonyOS ArkUI layouts. It not only maps component‑level and page‑level elements to ArkUI equivalents but also handles project‑level challenges, including complex layouts and interaction logic. Our evaluation of six Android applications demonstrates that our UITrans achieves translation success rates of over 90.1%, 89.3%, and 89.2% at the component, page, and project levels, respectively. UITrans is available at https://github.com/OpenSELab/UITrans and the demo video can be viewed at https://www.youtube.com/watch?v=iqKOSmCnJG0.

PaperID: 1606, https://arxiv.org/pdf/2412.13178.pdf GitHub

Authors: Sheng Yin, Xianghe Pang, Yuanzhuo Ding, Menglan Chen, Yutong Bi, Yichen Xiong, Wenhao Huang, Zhen Xiang, Jing Shao, Siheng Chen

Title: SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents

Abstract:
With the integration of large language models (LLMs), embodied agents have strong capabilities to understand and plan complicated natural language instructions. However, a foreseeable issue is that those embodied agents can also flawlessly execute some hazardous tasks, potentially causing damages in the real world. Existing benchmarks predominantly overlook critical safety risks, focusing solely on planning performance, while a few evaluate LLMs' safety awareness only on non‑interactive image‑text data. To address this gap, we present SafeAgentBench ‑‑ the first comprehensive benchmark for safety‑aware task planning of embodied LLM agents in interactive simulation environments, covering both explicit and implicit hazards. SafeAgentBench includes: (1) an executable, diverse, and high‑quality dataset of 750 tasks, rigorously curated to cover 10 potential hazards and 3 task types; (2) SafeAgentEnv, a universal embodied environment with a low‑level controller, supporting multi‑agent execution with 17 high‑level actions for 9 state‑of‑the‑art baselines; and (3) reliable evaluation methods from both execution and semantic perspectives. Experimental results show that, although agents based on different design frameworks exhibit substantial differences in task success rates, their overall safety awareness remains weak. The most safety‑conscious baseline achieves only a 10% rejection rate for detailed hazardous tasks. Moreover, simply replacing the LLM driving the agent does not lead to notable improvements in safety awareness. Dataset and codes are available in https://github.com/shengyin1224/SafeAgentBench and https://huggingface.co/datasets/safeagentbench/SafeAgentBench.

PaperID: 1607, https://arxiv.org/pdf/2412.12571.pdf GitHub GitHub

Authors: Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Chen Liang, Tong Shen, Han Zhang, Huanzhang Dou, Yu Liu, Jingren Zhou

Title: ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers

Abstract:
Recent research arXiv:2410.15027 arXiv:2410.23775 has highlighted the inherent in‑context generation capabilities of pretrained diffusion transformers (DiTs), enabling them to seamlessly adapt to diverse visual tasks with minimal or no architectural modifications. These capabilities are unlocked by concatenating self‑attention tokens across multiple input and target images, combined with grouped and masked generation pipelines. Building upon this foundation, we present ChatDiT, a zero‑shot, general‑purpose, and interactive visual generation framework that leverages pretrained diffusion transformers in their original form, requiring no additional tuning, adapters, or modifications. Users can interact with ChatDiT to create interleaved text‑image articles, multi‑page picture books, edit images, design IP derivatives, or develop character design settings, all through free‑form natural language across one or more conversational rounds. At its core, ChatDiT employs a multi‑agent system comprising three key components: an Instruction‑Parsing agent that interprets user‑uploaded images and instructions, a Strategy‑Planning agent that devises single‑step or multi‑step generation actions, and an Execution agent that performs these actions using an in‑context toolkit of diffusion transformers. We thoroughly evaluate ChatDiT on IDEA‑Bench arXiv:2412.11767, comprising 100 real‑world design tasks and 275 cases with diverse instructions and varying numbers of input and target images. Despite its simplicity and training‑free approach, ChatDiT surpasses all competitors, including those specifically designed and trained on extensive multi‑task datasets. We further identify key limitations of pretrained DiTs in zero‑shot adapting to tasks. We release all code, agents, results, and intermediate outputs to facilitate further research at https://github.com/ali‑vilab/ChatDiT

PaperID: 1608, https://arxiv.org/pdf/2412.11594.pdf GitHub

Authors: Zhipeng Chen, Lan Yang, Yonggang Qi, Honggang Zhang, Kaiyue Pang, Ke Li, Yi-Zhe Song

Title: VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis

Abstract:
Despite the rapid advancements in text‑to‑image (T2I) synthesis, enabling precise visual control remains a significant challenge. Existing works attempted to incorporate multi‑facet controls (text and sketch), aiming to enhance the creative control over generated images. However, our pilot study reveals that the expressive power of humans far surpasses the capabilities of current methods. Users desire a more versatile approach that can accommodate their diverse creative intents, ranging from controlling individual subjects to manipulating the entire scene composition. We present VersaGen, a generative AI agent that enables versatile visual control in T2I synthesis. VersaGen admits four types of visual controls: i) single visual subject; ii) multiple visual subjects; iii) scene background; iv) any combination of the three above or merely no control at all. We train an adaptor upon a frozen T2I model to accommodate the visual information into the text‑dominated diffusion process. We introduce three optimization strategies during the inference phase of VersaGen to improve generation results and enhance user experience. Comprehensive experiments on COCO and Sketchy validate the effectiveness and flexibility of VersaGen, as evidenced by both qualitative and quantitative results.

PaperID: 1609, https://arxiv.org/pdf/2412.10739.pdf GitHub

Authors: Jingyu Zhang, Yilei Wang, Lang Qian, Peng Sun, Zengwen Li, Sudong Jiang, Maolin Liu, Liang Song

Title: DSRC: Learning Density-insensitive and Semantic-aware Collaborative Representation against Corruptions

Abstract:
As a potential application of Vehicle‑to‑Everything (V2X) communication, multi‑agent collaborative perception has achieved significant success in 3D object detection. While these methods have demonstrated impressive results on standard benchmarks, the robustness of such approaches in the face of complex real‑world environments requires additional verification. To bridge this gap, we introduce the first comprehensive benchmark designed to evaluate the robustness of collaborative perception methods in the presence of natural corruptions typical of real‑world environments. Furthermore, we propose DSRC, a robustness‑enhanced collaborative perception method aiming to learn Density‑insensitive and Semantic‑aware collaborative Representation against Corruptions. DSRC consists of two key designs: i) a semantic‑guided sparse‑to‑dense distillation framework, which constructs multi‑view dense objects painted by ground truth bounding boxes to effectively learn density‑insensitive and semantic‑aware collaborative representation; ii) a feature‑to‑point cloud reconstruction approach to better fuse critical collaborative representation across agents. To thoroughly evaluate DSRC, we conduct extensive experiments on real‑world and simulated datasets. The results demonstrate that our method outperforms SOTA collaborative perception methods in both clean and corrupted conditions. Code is available at https://github.com/Terry9a/DSRC.

PaperID: 1610, https://arxiv.org/pdf/2412.10434.pdf GitHub

Authors: Yuanyuan Liang, Tingyu Xie, Gan Peng, Zihao Huang, Yunshi Lan, Weining Qian

Title: NAT-NL2GQL: A Novel Multi-Agent Framework for Translating Natural Language to Graph Query Language

Abstract:
The emergence of Large Language Models (LLMs) has revolutionized many fields, not only traditional natural language processing (NLP) tasks. Recently, research on applying LLMs to the database field has been booming, and as a typical non‑relational database, the use of LLMs in graph database research has naturally gained significant attention. Recent efforts have increasingly focused on leveraging LLMs to translate natural language into graph query language (NL2GQL). Although some progress has been made, these methods have clear limitations, such as their reliance on streamlined processes that often overlook the potential of LLMs to autonomously plan and collaborate with other LLMs in tackling complex NL2GQL challenges. To address this gap, we propose NAT‑NL2GQL, a novel multi‑agent framework for translating natural language to graph query language. Specifically, our framework consists of three synergistic agents: the Preprocessor agent, the Generator agent, and the Refiner agent. The Preprocessor agent manages data processing as context, including tasks such as name entity recognition, query rewriting, path linking, and the extraction of query‑related schemas. The Generator agent is a fine‑tuned LLM trained on NL‑GQL data, responsible for generating corresponding GQL statements based on queries and their related schemas. The Refiner agent is tasked with refining the GQL or context using error information obtained from the GQL execution results. Given the scarcity of high‑quality open‑source NL2GQL datasets based on nGQL syntax, we developed StockGQL, a dataset constructed from a financial market graph database. It is available at: https://github.com/leonyuancode/StockGQL. Experimental results on the StockGQL and SpCQL datasets reveal that our method significantly outperforms baseline approaches, highlighting its potential for advancing NL2GQL research.

PaperID: 1611, https://arxiv.org/pdf/2412.09796.pdf GitHub

Authors: Qiyao Wang, Shiwen Ni, Huaren Liu, Shule Lu, Guhong Chen, Xi Feng, Chi Wei, Qiang Qu, Hamid Alinejad-Rokny, Yuan Lin, Min Yang

Title: AutoPatent: A Multi-Agent Framework for Automatic Patent Generation

Abstract:
As the capabilities of Large Language Models (LLMs) continue to advance, the field of patent processing has garnered increased attention within the natural language processing community. However, the majority of research has been concentrated on classification tasks, such as patent categorization and examination, or on short text generation tasks like patent summarization and patent quizzes. In this paper, we introduce a novel and practical task known as Draft2Patent, along with its corresponding D2P benchmark, which challenges LLMs to generate full‑length patents averaging 17K tokens based on initial drafts. Patents present a significant challenge to LLMs due to their specialized nature, standardized terminology, and extensive length. We propose a multi‑agent framework called AutoPatent which leverages the LLM‑based planner agent, writer agents, and examiner agent with PGTree and RRAG to generate lengthy, intricate, and high‑quality complete patent documents. The experimental results demonstrate that our AutoPatent framework significantly enhances the ability to generate comprehensive patents across various LLMs. Furthermore, we have discovered that patents generated solely with the AutoPatent framework based on the Qwen2.5‑7B model outperform those produced by larger and more powerful LLMs, such as GPT‑4o, Qwen2.5‑72B, and LLAMA3.1‑70B, in both objective metrics and human evaluations. We will make the data and code available upon acceptance at \urlhttps://github.com/QiYao‑Wang/AutoPatent.

PaperID: 1612, https://arxiv.org/pdf/2412.09723.pdf GitHub

Authors: Xiaohao Xu, Feng Xue, Shibo Zhao, Yike Pan, Sebastian Scherer, Xiaonan Huang

Title: MAC-Ego3D: Multi-Agent Gaussian Consensus for Real-Time Collaborative Ego-Motion and Photorealistic 3D Reconstruction

Abstract:
Real‑time multi‑agent collaboration for ego‑motion estimation and high‑fidelity 3D reconstruction is vital for scalable spatial intelligence. However, traditional methods produce sparse, low‑detail maps, while recent dense mapping approaches struggle with high latency. To overcome these challenges, we present MAC‑Ego3D, a novel framework for real‑time collaborative photorealistic 3D reconstruction via Multi‑Agent Gaussian Consensus. MAC‑Ego3D enables agents to independently construct, align, and iteratively refine local maps using a unified Gaussian splat representation. Through Intra‑Agent Gaussian Consensus, it enforces spatial coherence among neighboring Gaussian splats within an agent. For global alignment, parallelized Inter‑Agent Gaussian Consensus, which asynchronously aligns and optimizes local maps by regularizing multi‑agent Gaussian splats, seamlessly integrates them into a high‑fidelity 3D model. Leveraging Gaussian primitives, MAC‑Ego3D supports efficient RGB‑D rendering, enabling rapid inter‑agent Gaussian association and alignment. MAC‑Ego3D bridges local precision and global coherence, delivering higher efficiency, largely reducing localization error, and improving mapping fidelity. It establishes a new SOTA on synthetic and real‑world benchmarks, achieving a 15x increase in inference speed, order‑of‑magnitude reductions in ego‑motion estimation error for partial cases, and RGB PSNR gains of 4 to 10 dB. Our code will be made publicly available at https://github.com/Xiaohao‑Xu/MAC‑Ego3D .

PaperID: 1613, https://arxiv.org/pdf/2412.09529.pdf GitHub

Authors: Qiaoyu Zheng, Chaoyi Wu, Pengcheng Qiu, Lisong Dai, Ya Zhang, Yanfeng Wang, Weidi Xie

Title: How Well Can Modern LLMs Act as Agent Cores in Radiology Environments?

Abstract:
We introduce RadA‑BenchPlat, an evaluation platform that benchmarks the performance of large language models (LLMs) act as agent cores in radiology environments using 2,200 radiologist‑verified synthetic patient records covering six anatomical regions, five imaging modalities, and 2,200 disease scenarios, resulting in 24,200 question‑answer pairs that simulate diverse clinical situations. The platform also defines ten categories of tools for agent‑driven task solving and evaluates seven leading LLMs, revealing that while models like Claude‑3.7‑Sonnet can achieve a 67.1% task completion rate in routine settings, they still struggle with complex task understanding and tool coordination, limiting their capacity to serve as the central core of automated radiology systems. By incorporating four advanced prompt engineering strategies‑‑where prompt‑backpropagation and multi‑agent collaboration contributed 16.8% and 30.7% improvements, respectively‑‑the performance for complex tasks was enhanced by 48.2% overall. Furthermore, automated tool building was explored to improve robustness, achieving a 65.4% success rate, thereby offering promising insights for the future integration of fully automated radiology applications into clinical practice. All of our code and data are openly available at https://github.com/MAGIC‑AI4Med/RadABench.

PaperID: 1614, https://arxiv.org/pdf/2412.08805.pdf GitHub

Authors: Agnieszka Mensfelt, Kostas Stathis, Vince Trencsenyi

Title: Generative Agents for Multi-Agent Autoformalization of Interaction Scenarios

Abstract:
Multi‑agent simulations are versatile tools for exploring interactions among natural and artificial agents, but their development typically demands domain expertise and manual effort. This work introduces the Generative Agents for Multi‑Agent Autoformalization (GAMA) framework, which automates the formalization of interaction scenarios in simulations using agents augmented with large language models (LLMs). To demonstrate the application of GAMA, we use natural language descriptions of game‑theoretic scenarios representing social interactions, and we autoformalize them into executable logic programs defining game rules, with syntactic correctness enforced through a solver‑based validation. To ensure runtime validity, an iterative, tournament‑based procedure tests the generated rules and strategies, followed by exact semantic validation when ground truth outcomes are available. In experiments with 110 natural language descriptions across five 2x2 simultaneous‑move games, GAMA achieves 100% syntactic and 76.5% semantic correctness with Claude 3.5 Sonnet, and 99.82% syntactic and 77% semantic correctness with GPT‑4o. The framework also shows high semantic accuracy in autoformalizing agents' strategies.

PaperID: 1615, https://arxiv.org/pdf/2412.08685.pdf GitHub

Authors: Yuxi Wei, Jingbo Wang, Yuwen Du, Dingju Wang, Liang Pan, Chenxin Xu, Yao Feng, Bo Dai, Siheng Chen

Title: ChatDyn: Language-Driven Multi-Actor Dynamics Generation in Street Scenes

Abstract:
Generating realistic and interactive dynamics of traffic participants according to specific instruction is critical for street scene simulation. However, there is currently a lack of a comprehensive method that generates realistic dynamics of different types of participants including vehicles and pedestrians, with different kinds of interactions between them. In this paper, we introduce ChatDyn, the first system capable of generating interactive, controllable and realistic participant dynamics in street scenes based on language instructions. To achieve precise control through complex language, ChatDyn employs a multi‑LLM‑agent role‑playing approach, which utilizes natural language inputs to plan the trajectories and behaviors for different traffic participants. To generate realistic fine‑grained dynamics based on the planning, ChatDyn designs two novel executors: the PedExecutor, a unified multi‑task executor that generates realistic pedestrian dynamics under different task plannings; and the VehExecutor, a physical transition‑based policy that generates physically plausible vehicle dynamics. Extensive experiments show that ChatDyn can generate realistic driving scene dynamics with multiple vehicles and pedestrians, and significantly outperforms previous methods on subtasks. Code and model will be available at https://vfishc.github.io/chatdyn.

PaperID: 1616, https://arxiv.org/pdf/2412.07472.pdf GitHub

Authors: Jiaqi Zhang, Chen Gao, Liyuan Zhang, Yong Li, Hongzhi Yin

Title: SmartAgent: Chain-of-User-Thought for Embodied Personalized Agent in Cyber World

Abstract:
Recent advances in embodied agents with multimodal perception and reasoning capabilities based on large vision‑language models (LVLMs), excel in autonomously interacting either real or cyber worlds, helping people make intelligent decisions in complex environments. However, the current works are normally optimized by golden action trajectories or ideal task‑oriented solutions toward a definitive goal. This paradigm considers limited user‑oriented factors, which could be the reason for their performance reduction in a wide range of personal assistant applications. To address this, we propose Chain‑of‑User‑Thought (COUT), a novel embodied reasoning paradigm that takes a chain of thought from basic action thinking to explicit and implicit personalized preference thought to incorporate personalized factors into autonomous agent learning. To target COUT, we introduce SmartAgent, an agent framework perceiving cyber environments and reasoning personalized requirements as 1) interacting with GUI to access an item pool, 2) generating users' explicit requirements implied by previous actions, and 3) recommending items to fulfill users' implicit requirements. To demonstrate SmartAgent's capabilities, we also create a brand‑new dataset SmartSpot that offers a full‑stage personalized action‑involved environment. To our best knowledge, our work is the first to formulate the COUT process, serving as a preliminary attempt towards embodied personalized agent learning. Our extensive experiments on SmartSpot illuminate SmartAgent's functionality among a series of embodied and personalized sub‑tasks. We will release code and data upon paper notification at https://github.com/tsinghua‑fib‑lab/SmartAgent.

PaperID: 1617, https://arxiv.org/pdf/2412.06089.pdf GitHub GitHub

Authors: Ashish Goswami, Satyam Kumar Modi, Santhosh Rishi Deshineni, Harman Singh, Prathosh A. P, Parag Singla

Title: GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis

Abstract:
Text‑to‑image (T2I) generation has seen significant progress with diffusion models, enabling generation of photo‑realistic images from text prompts. Despite this progress, existing methods still face challenges in following complex text prompts, especially those requiring compositional and multi‑step reasoning. Given such complex instructions, SOTA models often make mistakes in faithfully modeling object attributes, and relationships among them. In this work, we present an alternate paradigm for T2I synthesis, decomposing the task of complex multi‑step generation into three steps, (a) Generate: we first generate an image using existing diffusion models (b) Plan: we make use of Multi‑Modal LLMs (MLLMs) to identify the mistakes in the generated image expressed in terms of individual objects and their properties, and produce a sequence of corrective steps required in the form of an edit‑plan. (c) Edit: we make use of an existing text‑guided image editing models to sequentially execute our edit‑plan over the generated image to get the desired image which is faithful to the original instruction. Our approach derives its strength from the fact that it is modular in nature, is training free, and can be applied over any combination of image generation and editing models. As an added contribution, we also develop a model capable of compositional editing, which further helps improve the overall accuracy of our proposed approach. Our method flexibly trades inference time compute with performance on compositional text prompts. We perform extensive experimental evaluation across 3 benchmarks and 10 T2I models including DALLE‑3 and the latest ‑‑ SD‑3.5‑Large. Our approach not only improves the performance of the SOTA models, by upto 3 points, it also reduces the performance gap between weaker and stronger models. \hrefhttps://dair‑iitd.github.io/GraPE/https://dair‑iitd.github.io/GraPE/

PaperID: 1618, https://arxiv.org/pdf/2412.05334.pdf GitHub GitHub

Authors: Zhejun Zhang, Peter Karkus, Maximilian Igl, Wenhao Ding, Yuxiao Chen, Boris Ivanovic, Marco Pavone

Title: Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models

Abstract:
Traffic simulation aims to learn a policy for traffic agents that, when unrolled in closed‑loop, faithfully recovers the joint distribution of trajectories observed in the real world. Inspired by large language models, tokenized multi‑agent policies have recently become the state‑of‑the‑art in traffic simulation. However, they are typically trained through open‑loop behavior cloning, and thus suffer from covariate shift when executed in closed‑loop during simulation. In this work, we present Closest Among Top‑K (CAT‑K) rollouts, a simple yet effective closed‑loop fine‑tuning strategy to mitigate covariate shift. CAT‑K fine‑tuning only requires existing trajectory data, without reinforcement learning or generative adversarial imitation. Concretely, CAT‑K fine‑tuning enables a small 7M‑parameter tokenized traffic simulation policy to outperform a 102M‑parameter model from the same model family, achieving the top spot on the Waymo Sim Agent Challenge leaderboard at the time of submission. The code is available at https://github.com/NVlabs/catk.

PaperID: 1619, https://arxiv.org/pdf/2412.05255.pdf GitHub

Authors: Qian Long, Zhi Li, Ran Gong, Ying Nian Wu, Demetri Terzopoulos, Xiaofeng Gao

Title: TeamCraft: A Benchmark for Multi-Modal Multi-Agent Systems in Minecraft

Abstract:
Collaboration is a cornerstone of society. In the real world, human teammates make use of multi‑sensory data to tackle challenging tasks in ever‑changing environments. It is essential for embodied agents collaborating in visually‑rich environments replete with dynamic interactions to understand multi‑modal observations and task specifications. To evaluate the performance of generalizable multi‑modal collaborative agents, we present TeamCraft, a multi‑modal multi‑agent benchmark built on top of the open‑world video game Minecraft. The benchmark features 55,000 task variants specified by multi‑modal prompts, procedurally‑generated expert demonstrations for imitation learning, and carefully designed protocols to evaluate model generalization capabilities. We also perform extensive analyses to better understand the limitations and strengths of existing approaches. Our results indicate that existing models continue to face significant challenges in generalizing to novel goals, scenes, and unseen numbers of agents. These findings underscore the need for further research in this area. The TeamCraft platform and dataset are publicly available at https://github.com/teamcraft‑bench/teamcraft.

PaperID: 1620, https://arxiv.org/pdf/2412.05103.pdf GitHub

Authors: Edgar Beck, Hsuan-Yu Lin, Patrick Rückert, Yongping Bao, Bettina von Helversen, Sebastian Fehrler, Kirsten Tracht, Armin Dekorsy

Title: Integrating Semantic Communication and Human Decision-Making into an End-to-End Sensing-Decision Framework

Abstract:
As early as 1949, Weaver defined communication in a very broad sense to include all procedures by which one mind or technical system can influence another, thus establishing the idea of semantic communication. With the recent success of machine learning in expert assistance systems where sensed information is wirelessly provided to a human to assist task execution, the need to design effective and efficient communications has become increasingly apparent. In particular, semantic communication aims to convey the meaning behind the sensed information relevant for Human Decision‑Making (HDM). Regarding the interplay between semantic communication and HDM, many questions remain, such as how to model the entire end‑to‑end sensing‑decision‑making process, how to design semantic communication for the HDM and which information should be provided for HDM. To address these questions, we propose to integrate semantic communication and HDM into one probabilistic end‑to‑end sensing‑decision framework that bridges communications and psychology. In our interdisciplinary framework, we model the human through a HDM process, allowing us to explore how feature extraction from semantic communication can best support HDM both in theory and in simulations. In this sense, our study reveals the fundamental design trade‑off between maximizing the relevant semantic information and matching the cognitive capabilities of the HDM model. Our initial analysis shows how semantic communication can balance the level of detail with human cognitive capabilities while demanding less bandwidth, power, and latency.

PaperID: 1621, https://arxiv.org/pdf/2412.04987.pdf GitHub

Authors: Qinglun Zhang, Zhen Liu, Haoqiang Fan, Guanghui Liu, Bing Zeng, Shuaicheng Liu

Title: FlowPolicy: Enabling Fast and Robust 3D Flow-based Policy via Consistency Flow Matching for Robot Manipulation

Abstract:
Robots can acquire complex manipulation skills by learning policies from expert demonstrations, which is often known as vision‑based imitation learning. Generating policies based on diffusion and flow matching models has been shown to be effective, particularly in robotic manipulation tasks. However, recursion‑based approaches are inference inefficient in working from noise distributions to policy distributions, posing a challenging trade‑off between efficiency and quality. This motivates us to propose FlowPolicy, a novel framework for fast policy generation based on consistency flow matching and 3D vision. Our approach refines the flow dynamics by normalizing the self‑consistency of the velocity field, enabling the model to derive task execution policies in a single inference step. Specifically, FlowPolicy conditions on the observed 3D point cloud, where consistency flow matching directly defines straight‑line flows from different time states to the same action space, while simultaneously constraining their velocity values, that is, we approximate the trajectories from noise to robot actions by normalizing the self‑consistency of the velocity field within the action space, thus improving the inference efficiency. We validate the effectiveness of FlowPolicy in Adroit and Metaworld, demonstrating a 7× increase in inference speed while maintaining competitive average success rates compared to state‑of‑the‑art methods. Code is available at https://github.com/zql‑kk/FlowPolicy.

PaperID: 1622, https://arxiv.org/pdf/2412.04905.pdf GitHub

Authors: Minzheng Wang, Xinghua Zhang, Kun Chen, Nan Xu, Haiyang Yu, Fei Huang, Wenji Mao, Yongbin Li

Title: DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling

Abstract:
Large language models (LLMs) enabled dialogue systems have become one of the central modes in human‑machine interaction, which bring about vast amounts of conversation logs and increasing demand for dialogue generation. The dialogue's life‑cycle spans from Prelude through Interlocution to Epilogue, encompassing rich dialogue elements. Despite large volumes of dialogue‑related studies, there is a lack of systematic investigation into the dialogue stages to frame benchmark construction that covers comprehensive dialogue elements. This hinders the precise modeling, generation and assessment of LLMs‑based dialogue systems. To bridge this gap, in this paper, we introduce a new research task‑‑Dialogue Element MOdeling, including Element Awareness and Dialogue Agent Interaction, and propose a novel benchmark, DEMO, designed for a comprehensive dialogue modeling and assessment. On this basis, we further build the DEMO agent with the adept ability to model dialogue elements via imitation learning. Extensive experiments on DEMO indicate that current representative LLMs still have considerable potential for enhancement, and our DEMO agent performs well in both dialogue element modeling and out‑of‑domain tasks.

PaperID: 1623, https://arxiv.org/pdf/2412.04440.pdf GitHub

Authors: Kaiyi Huang, Yukun Huang, Xuefei Ning, Zinan Lin, Yu Wang, Xihui Liu

Title: GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Abstract:
Text‑to‑video generation models have shown significant progress in the recent years. However, they still struggle with generating complex dynamic scenes based on compositional text prompts, such as attribute binding for multiple objects, temporal dynamics associated with different objects, and interactions between objects. Our key motivation is that complex tasks can be decomposed into simpler ones, each handled by a role‑specialized MLLM agent. Multiple agents can collaborate together to achieve collective intelligence for complex goals. We propose GenMAC, an iterative, multi‑agent framework that enables compositional text‑to‑video generation. The collaborative workflow includes three stages: Design, Generation, and Redesign, with an iterative loop between the Generation and Redesign stages to progressively verify and refine the generated videos. The Redesign stage is the most challenging stage that aims to verify the generated videos, suggest corrections, and redesign the text prompts, frame‑wise layouts, and guidance scales for the next iteration of generation. To avoid hallucination of a single MLLM agent, we decompose this stage to four sequentially‑executed MLLM‑based agents: verification agent, suggestion agent, correction agent, and output structuring agent. Furthermore, to tackle diverse scenarios of compositional text‑to‑video generation, we design a self‑routing mechanism to adaptively select the proper correction agent from a collection of correction agents each specialized for one scenario. Extensive experiments demonstrate the effectiveness of GenMAC, achieving state‑of‑the art performance in compositional text‑to‑video generation.

PaperID: 1624, https://arxiv.org/pdf/2412.04380.pdf GitHub GitHub

Authors: Yuqi Wu, Wenzhao Zheng, Sicheng Zuo, Yuanhui Huang, Jie Zhou, Jiwen Lu

Title: EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding

Abstract:
3D occupancy prediction provides a comprehensive description of the surrounding scenes and has become an essential task for 3D perception. Most existing methods focus on offline perception from one or a few views and cannot be applied to embodied agents that demand to gradually perceive the scene through progressive embodied exploration. In this paper, we formulate an embodied 3D occupancy prediction task to target this practical scenario and propose a Gaussian‑based EmbodiedOcc framework to accomplish it. We initialize the global scene with uniform 3D semantic Gaussians and progressively update local regions observed by the embodied agent. For each update, we extract semantic and structural features from the observed image and efficiently incorporate them via deformable cross‑attention to refine the regional Gaussians. Finally, we employ Gaussian‑to‑voxel splatting to obtain the global 3D occupancy from the updated 3D Gaussians. Our EmbodiedOcc assumes an unknown (i.e., uniformly distributed) environment and maintains an explicit global memory of it with 3D Gaussians. It gradually gains knowledge through the local refinement of regional Gaussians, which is consistent with how humans understand new scenes through embodied exploration. We reorganize an EmbodiedOcc‑ScanNet benchmark based on local annotations to facilitate the evaluation of the embodied 3D occupancy prediction task. Our EmbodiedOcc outperforms existing methods by a large margin and accomplishes the embodied occupancy prediction with high accuracy and efficiency. Code: https://github.com/YkiWu/EmbodiedOcc.

PaperID: 1625, https://arxiv.org/pdf/2412.04233.pdf GitHub

Authors: Kale-ab Abebe Tessera, Arrasy Rahman, Amos Storkey, Stefano V. Albrecht

Title: HyperMARL: Adaptive Hypernetworks for Multi-Agent RL

Abstract:
Adaptive cooperation in multi‑agent reinforcement learning (MARL) requires policies to express homogeneous, specialised, or mixed behaviours, yet achieving this adaptivity remains a critical challenge. While parameter sharing (PS) is standard for efficient learning, it notoriously suppresses the behavioural diversity required for specialisation. This failure is largely due to cross‑agent gradient interference, a problem we find is surprisingly exacerbated by the common practice of coupling agent IDs with observations. Existing remedies typically add complexity through altered objectives, manual preset diversity levels, or sequential updates ‑‑ raising a fundamental question: can shared policies adapt without these intricacies? We propose a solution built on a key insight: an agent‑conditioned hypernetwork can generate agent‑specific parameters and decouple observation‑ and agent‑conditioned gradients, directly countering the interference from coupling agent IDs with observations. Our resulting method, HyperMARL, avoids the complexities of prior work and empirically reduces policy gradient variance. Across diverse MARL benchmarks (22 scenarios, up to 30 agents), HyperMARL achieves performance competitive with six key baselines while preserving behavioural diversity comparable to non‑parameter sharing methods, establishing it as a versatile and principled approach for adaptive MARL. The code is publicly available at https://github.com/KaleabTessera/HyperMARL.

PaperID: 1626, https://arxiv.org/pdf/2412.04037.pdf GitHub

Authors: Yongming Zhu, Longhao Zhang, Zhengkun Rong, Tianshu Hu, Shuang Liang, Zhipeng Ge

Title: INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations

Abstract:
Imagine having a conversation with a socially intelligent agent. It can attentively listen to your words and offer visual and linguistic feedback promptly. This seamless interaction allows for multiple rounds of conversation to flow smoothly and naturally. In pursuit of actualizing it, we propose INFP, a novel audio‑driven head generation framework for dyadic interaction. Unlike previous head generation works that only focus on single‑sided communication, or require manual role assignment and explicit role switching, our model drives the agent portrait dynamically alternates between speaking and listening state, guided by the input dyadic audio. Specifically, INFP comprises a Motion‑Based Head Imitation stage and an Audio‑Guided Motion Generation stage. The first stage learns to project facial communicative behaviors from real‑life conversation videos into a low‑dimensional motion latent space, and use the motion latent codes to animate a static image. The second stage learns the mapping from the input dyadic audio to motion latent codes through denoising, leading to the audio‑driven head generation in interactive scenarios. To facilitate this line of research, we introduce DyConv, a large scale dataset of rich dyadic conversations collected from the Internet. Extensive experiments and visualizations demonstrate superior performance and effectiveness of our method. Project Page: https://grisoon.github.io/INFP/.

PaperID: 1627, https://arxiv.org/pdf/2412.03904.pdf GitHub

Authors: Kai Fronsdal, David Lindner

Title: MISR: Measuring Instrumental Self-Reasoning in Frontier Models

Abstract:
We propose a suite of tasks to evaluate the instrumental self‑reasoning ability of large language model (LLM) agents. Instrumental self‑reasoning ability could improve adaptability and enable self‑modification, but it could also pose significant risks, such as enabling deceptive alignment. Prior work has only evaluated self‑reasoning in non‑agentic settings or in limited domains. In this paper, we propose evaluations for instrumental self‑reasoning ability in agentic tasks in a wide range of scenarios, including self‑modification, knowledge seeking, and opaque self‑reasoning. We evaluate agents built using state‑of‑the‑art LLMs, including commercial and open source systems. We find that instrumental self‑reasoning ability emerges only in the most capable frontier models and that it is highly context‑dependent. No model passes the the most difficult versions of our evaluations, hence our evaluation can be used to measure increases in instrumental self‑reasoning ability in future models. We open‑source our evaluations at https://github.com/kaifronsdal/Self‑Reasoning‑Evals.

PaperID: 1628, https://arxiv.org/pdf/2412.03577.pdf GitHub

Authors: Zhao Wang, Briti Gangopadhyay, Mengjie Zhao, Shingo Takamatsu

Title: OKG: On-the-Fly Keyword Generation in Sponsored Search Advertising

Abstract:
Current keyword decision‑making in sponsored search advertising relies on large, static datasets, limiting the ability to automatically set up keywords and adapt to real‑time KPI metrics and product updates that are essential for effective advertising. In this paper, we propose On‑the‑fly Keyword Generation (OKG), an LLM agent‑based method that dynamically monitors KPI changes and adapts keyword generation in real time, aligning with strategies recommended by advertising platforms. Additionally, we introduce the first publicly accessible dataset containing real keyword data along with its KPIs across diverse domains, providing a valuable resource for future research. Experimental results show that OKG significantly improves keyword adaptability and responsiveness compared to traditional methods. The code for OKG and the dataset are available at https://github.com/sony/okg.

PaperID: 1629, https://arxiv.org/pdf/2412.02410.pdf GitHub

Authors: Donghao Yang, Aolang Wu, Tianyi Zhang, Li Zhang, Fang Liu, Xiaoli Lian, Yuming Ren, Jiaji Tian, Xiaoyin Che

Title: AutoPLC: Generating Vendor-Aware Structured Text for Programmable Logic Controllers

Abstract:
Among the programming languages for Programmable Logic Controllers (PLCs), Structured Text (ST) is widely adopted for industrial automation due to its expressiveness and flexibility. However, major vendors implement ST with proprietary extensions and hardware‑specific libraries ‑ Siemens' SCL and CODESYS' ST each differ in syntax and functionality. This fragmentation forces engineers to relearn implementation details across platforms, creating substantial productivity barriers. To address this challenge, we developed AutoPLC, a framework capable of automatically generating vendor‑aware ST code directly from natural language requirements. Our solution begins by building two essential knowledge sources tailored to each vendor's specifications: a structured API library containing platform‑exclusive functions, and an annotated case database that captures real‑world implementation experience. Building on these foundations, we created a four‑stage generation process that combines step‑wise planning (enhanced with a lightweight natural language state machine support for control logic), contextual case retrieval using LLM‑based reranking, API recommendation guided by industrial data, and dynamic validation through direct interaction with vendor IDEs. Implemented for Siemens TIA Portal and the CODESYS platform, AutoPLC achieves 90%+ compilation success on our 914‑task benchmark (covering general‑purpose and process control functions), outperforming all selected baselines, at an average cost of only 0.13 per task. Experienced PLC engineers positively assessed the practical utility of the generated code, including cases that failed compilation. We open‑source our framework at https://github.com/cangkui/AutoPLC.

PaperID: 1630, https://arxiv.org/pdf/2412.01992.pdf GitHub

Authors: Benjamin Klieger, Charis Charitsis, Miroslav Suzara, Sierra Wang, Nick Haber, John C. Mitchell

Title: ChatCollab: Exploring Collaboration Between Humans and AI Agents in Software Teams

Abstract:
We explore the potential for productive team‑based collaboration between humans and Artificial Intelligence (AI) by presenting and conducting initial tests with a general framework that enables multiple human and AI agents to work together as peers. ChatCollab's novel architecture allows agents ‑ human or AI ‑ to join collaborations in any role, autonomously engage in tasks and communication within Slack, and remain agnostic to whether their collaborators are human or AI. Using software engineering as a case study, we find that our AI agents successfully identify their roles and responsibilities, coordinate with other agents, and await requested inputs or deliverables before proceeding. In relation to three prior multi‑agent AI systems for software development, we find ChatCollab AI agents produce comparable or better software in an interactive game development task. We also propose an automated method for analyzing collaboration dynamics that effectively identifies behavioral characteristics of agents with distinct roles, allowing us to quantitatively compare collaboration dynamics in a range of experimental conditions. For example, in comparing ChatCollab AI agents, we find that an AI CEO agent generally provides suggestions 2‑4 times more often than an AI product manager or AI developer, suggesting agents within ChatCollab can meaningfully adopt differentiated collaborative roles. Our code and data can be found at: https://github.com/ChatCollab.

PaperID: 1631, https://arxiv.org/pdf/2412.01268.pdf GitHub

Authors: Yiqin Wang, Haoji Zhang, Jingqi Tian, Yansong Tang

Title: Ponder & Press: Advancing Visual GUI Agent towards General Computer Control

Abstract:
Most existing GUI agents typically depend on non‑vision inputs like HTML source code or accessibility trees, limiting their flexibility across diverse software environments and platforms. Current multimodal large language models (MLLMs), which excel at using vision to ground real‑world objects, offer a potential alternative. However, they often struggle with accurately localizing GUI elements ‑‑ a critical requirement for effective GUI automation ‑‑ due to the semantic gap between real‑world objects and GUI elements. In this work, we introduce Ponder & Press, a divide‑and‑conquer framework for general computer control using only visual input. Our approach combines an general‑purpose MLLM as an 'interpreter', responsible for translating high‑level user instructions into detailed action descriptions, with a GUI‑specific MLLM as a 'locator' that precisely locates GUI elements for action placement. By leveraging a purely visual input, our agent offers a versatile, human‑like interaction paradigm applicable to a wide range of applications. Ponder & Press locator outperforms existing models by +22.5% on the ScreenSpot GUI grounding benchmark. Both offline and interactive agent benchmarks across various GUI environments ‑‑ including web pages, desktop software, and mobile UIs ‑‑ demonstrate that Ponder & Press framework achieves state‑of‑the‑art performance, highlighting the potential of visual GUI agents. Refer to the project homepage https://invinciblewyq.github.io/ponder‑press‑page/

PaperID: 1632, https://arxiv.org/pdf/2412.01250.pdf GitHub

Authors: Francesco Taioli, Edoardo Zorzi, Gianni Franchi, Alberto Castellini, Alessandro Farinelli, Marco Cristani, Yiming Wang

Title: Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues

Abstract:
Language‑driven instance object navigation assumes that human users initiate the task by providing a detailed description of the target instance to the embodied agent. While this description is crucial for distinguishing the target from visually similar instances in a scene, providing it prior to navigation can be demanding for human. To bridge this gap, we introduce Collaborative Instance object Navigation (CoIN), a new task setting where the agent actively resolve uncertainties about the target instance during navigation in natural, template‑free, open‑ended dialogues with human. We propose a novel training‑free method, Agent‑user Interaction with UncerTainty Awareness (AIUTA), which operates independently from the navigation policy, and focuses on the human‑agent interaction reasoning with Vision‑Language Models (VLMs) and Large Language Models (LLMs). First, upon object detection, a Self‑Questioner model initiates a self‑dialogue within the agent to obtain a complete and accurate observation description with a novel uncertainty estimation technique. Then, an Interaction Trigger module determines whether to ask a question to the human, continue or halt navigation, minimizing user input. For evaluation, we introduce CoIN‑Bench, with a curated dataset designed for challenging multi‑instance scenarios. CoIN‑Bench supports both online evaluation with humans and reproducible experiments with simulated user‑agent interactions. On CoIN‑Bench, we show that AIUTA serves as a competitive baseline, while existing language‑driven instance navigation methods struggle in complex multi‑instance scenes. Code and benchmark will be available upon acceptance at https://intelligolabs.github.io/CoIN/

PaperID: 1633, https://arxiv.org/pdf/2412.00431.pdf GitHub

Authors: Andrew Laverick, Kristen Surrao, Inigo Zubeldia, Boris Bolliet, Miles Cranmer, Antony Lewis, Blake Sherwin, Julien Lesgourgues

Title: Multi-Agent System for Cosmological Parameter Analysis

Abstract:
Multi‑agent systems (MAS) utilizing multiple Large Language Model agents with Retrieval Augmented Generation and that can execute code locally may become beneficial in cosmological data analysis. Here, we illustrate a first small step towards AI‑assisted analyses and a glimpse of the potential of MAS to automate and optimize scientific workflows in Cosmology. The system architecture of our example package, that builds upon the autogen/ag2 framework, can be applied to MAS in any area of quantitative scientific research. The particular task we apply our methods to is the cosmological parameter analysis of the Atacama Cosmology Telescope lensing power spectrum likelihood using Monte Carlo Markov Chains. Our work‑in‑progress code is open source and available at https://github.com/CMBAgents/cmbagent.

PaperID: 1634, https://arxiv.org/pdf/2411.19626.pdf GitHub GitHub GitHub

Authors: Yawen Shao, Wei Zhai, Yuhang Yang, Hongchen Luo, Yang Cao, Zheng-Jun Zha

Title: GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding

Abstract:
Open‑Vocabulary 3D object affordance grounding aims to anticipate ``action possibilities'' regions on 3D objects with arbitrary instructions, which is crucial for robots to generically perceive real scenarios and respond to operational changes. Existing methods focus on combining images or languages that depict interactions with 3D geometries to introduce external interaction priors. However, they are still vulnerable to a limited semantic space by failing to leverage implied invariant geometries and potential interaction intentions. Normally, humans address complex tasks through multi‑step reasoning and respond to diverse situations by leveraging associative and analogical thinking. In light of this, we propose GREAT (GeometRy‑intEntion collAboraTive inference) for Open‑Vocabulary 3D Object Affordance Grounding, a novel framework that mines the object invariant geometry attributes and performs analogically reason in potential interaction scenarios to form affordance knowledge, fully combining the knowledge with both geometries and visual contents to ground 3D object affordance. Besides, we introduce the Point Image Affordance Dataset v2 (PIADv2), the largest 3D object affordance dataset at present to support the task. Extensive experiments demonstrate the effectiveness and superiority of GREAT. The code and dataset are available at https://yawen‑shao.github.io/GREAT/.

PaperID: 1635, https://arxiv.org/pdf/2411.18104.pdf GitHub

Authors: Yifan Zhang

Title: Training and Evaluating Language Models with Template-based Data Generation

Abstract:
The rapid advancement of large language models (LLMs) such as GPT‑3, PaLM, and Llama has significantly transformed natural language processing, showcasing remarkable capabilities in understanding and generating language. However, a fundamental bottleneck persists: these models often struggle with tasks requiring complex, multi‑step reasoning, particularly in mathematical problem‑solving. This deficiency stems from the critical scarcity of large‑scale, high‑quality, domain‑specific datasets necessary for cultivating sophisticated reasoning abilities. To overcome this challenge, we introduce Template‑based Data Generation (TDG), a novel and scalable paradigm that harnesses frontier LLMs (GPT‑4) to automatically generate parameterized meta‑templates, which in turn synthesize a virtually infinite stream of high‑quality problems and solutions. Using this paradigm, we create TemplateMath Part I: TemplateGSM, a foundational dataset of over 7 million synthetically generated grade school math problems. Each problem is accompanied by a programmatically verifiable solution, offering an unprecedented level of quality at scale. This resource not only resolves the data scarcity issue for supervised fine‑tuning but also provides a robust mechanism for model alignment through Reinforcement Learning with Verifiable Rewards (RLVR). Our approach elevates data augmentation by leveraging GPT‑4 to generate meta‑templates, ensuring diverse and complex problem structures. By providing a scalable solution to the data and verification bottleneck, TDG and TemplateGSM pave the way for a new generation of LLMs with powerful, reliable reasoning skills.

PaperID: 1636, https://arxiv.org/pdf/2411.17176.pdf GitHub

Authors: Chengyou Jia, Changliang Xia, Zhuohang Dang, Weijia Wu, Hangwei Qian, Minnan Luo

Title: ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting

Abstract:
Despite the significant advancements in text‑to‑image (T2I) generative models, users often face a trial‑and‑error challenge in practical scenarios. This challenge arises from the complexity and uncertainty of tedious steps such as crafting suitable prompts, selecting appropriate models, and configuring specific arguments, making users resort to labor‑intensive attempts for desired images. This paper proposes Automatic T2I generation, which aims to automate these tedious steps, allowing users to simply describe their needs in a freestyle chatting way. To systematically study this problem, we first introduce ChatGenBench, a novel benchmark designed for Automatic T2I. It features high‑quality paired data with diverse freestyle inputs, enabling comprehensive evaluation of automatic T2I models across all steps. Additionally, recognizing Automatic T2I as a complex multi‑step reasoning task, we propose ChatGen‑Evo, a multi‑stage evolution strategy that progressively equips models with essential automation skills. Through extensive evaluation across step‑wise accuracy and image quality, ChatGen‑Evo significantly enhances performance over various baselines. Our evaluation also uncovers valuable insights for advancing automatic T2I. All our data, code, and models will be available in \urlhttps://chengyou‑jia.github.io/ChatGen‑Home

PaperID: 1637, https://arxiv.org/pdf/2411.16310.pdf GitHub

Authors: Jaime Corsetti, Francesco Giuliari, Alice Fasoli, Davide Boscaini, Fabio Poiesi

Title: Functionality understanding and segmentation in 3D scenes

Abstract:
Understanding functionalities in 3D scenes involves interpreting natural language descriptions to locate functional interactive objects, such as handles and buttons, in a 3D environment. Functionality understanding is highly challenging, as it requires both world knowledge to interpret language and spatial perception to identify fine‑grained objects. For example, given a task like 'turn on the ceiling light', an embodied AI agent must infer that it needs to locate the light switch, even though the switch is not explicitly mentioned in the task description. To date, no dedicated methods have been developed for this problem. In this paper, we introduce Fun3DU, the first approach designed for functionality understanding in 3D scenes. Fun3DU uses a language model to parse the task description through Chain‑of‑Thought reasoning in order to identify the object of interest. The identified object is segmented across multiple views of the captured scene by using a vision and language model. The segmentation results from each view are lifted in 3D and aggregated into the point cloud using geometric information. Fun3DU is training‑free, relying entirely on pre‑trained models. We evaluate Fun3DU on SceneFun3D, the most recent and only dataset to benchmark this task, which comprises over 3000 task descriptions on 230 scenes. Our method significantly outperforms state‑of‑the‑art open‑vocabulary 3D segmentation approaches. Project page: https://tev‑fbk.github.io/fun3du/

PaperID: 1638, https://arxiv.org/pdf/2411.15813.pdf GitHub

Authors: Dimitrios Bachtis

Title: Lattice $ϕ^{4}$ field theory as a multi-agent system of financial markets

Abstract:
We introduce a ϕ^4 lattice field theory with frustrated dynamics as a multi‑agent system to reproduce stylized facts of financial markets such as fat‑tailed distributions of returns and clustered volatility. Each lattice site, represented by a continuous degree of freedom, corresponds to an agent experiencing a set of competing interactions which influence its decision to buy or sell a given stock. These interactions comprise a cooperative term, which signifies that the agent should imitate the behavior of its neighbors, and a fictitious field, which compels the agent instead to conform with the opinion of the majority or the minority. To introduce the competing dynamics we exploit the Markov field structure to pursue a constructive decomposition of the ϕ^4 probability distribution which we recompose with a Ferrenberg‑Swendsen acceptance or rejection sampling step. We then verify numerically that the multi‑agent ϕ^4 field theory produces behavior observed on empirical data from the FTSE 100 London Stock Exchange index. We conclude by discussing how the presence of continuous degrees of freedom within the ϕ^4 lattice field theory enables a representational capacity beyond that possible with multi‑agent systems derived from Ising models.

PaperID: 1639, https://arxiv.org/pdf/2411.14411.pdf GitHub

Authors: Ricardo Gama, Ricardo Cunha, Daniel Fuertes, Carlos R. del-Blanco, Hugo L. Fernandes

Title: Multi-Agent Environments for Vehicle Routing Problems

Abstract:
Research on Reinforcement Learning (RL) approaches for discrete optimization problems has increased considerably, extending RL to areas classically dominated by Operations Research (OR). Vehicle routing problems are a good example of discrete optimization problems with high practical relevance, for which RL techniques have achieved notable success. Despite these advances, open‑source development frameworks remain scarce, hindering both algorithm testing and objective comparison of results. This situation ultimately slows down progress in the field and limits the exchange of ideas between the RL and OR communities. Here, we propose MAEnvs4VRP library, a unified framework for multi‑agent vehicle routing environments that supports classical, dynamic, stochastic, and multi‑task problem variants within a single modular design. The library, built on PyTorch, provides a flexible and modular architecture design that facilitates customization and the incorporation of new routing problems. It follows the Agent Environment Cycle ("AEC") games model and features an intuitive API, enabling rapid adoption and seamless integration into existing reinforcement learning frameworks. The project source code can be found at https://github.com/ricgama/maenvs4vrp.

PaperID: 1640, https://arxiv.org/pdf/2411.13902.pdf GitHub

Authors: Zhijie Bao, Qingyun Liu, Ying Guo, Zhengqiang Ye, Jun Shen, Shirong Xie, Jiajie Peng, Xuanjing Huang, Zhongyu Wei

Title: PIORS: Personalized Intelligent Outpatient Reception based on Large Language Model with Multi-Agents Medical Scenario Simulation

Abstract:
In China, receptionist nurses face overwhelming workloads in outpatient settings, limiting their time and attention for each patient and ultimately reducing service quality. In this paper, we present the Personalized Intelligent Outpatient Reception System (PIORS). This system integrates an LLM‑based reception nurse and a collaboration between LLM and hospital information system (HIS) into real outpatient reception setting, aiming to deliver personalized, high‑quality, and efficient reception services. Additionally, to enhance the performance of LLMs in real‑world healthcare scenarios, we propose a medical conversational data generation framework named Service Flow aware Medical Scenario Simulation (SFMSS), aiming to adapt the LLM to the real‑world environments and PIORS settings. We evaluate the effectiveness of PIORS and SFMSS through automatic and human assessments involving 15 users and 15 clinical experts. The results demonstrate that PIORS‑Nurse outperforms all baselines, including the current state‑of‑the‑art model GPT‑4o, and aligns with human preferences and clinical needs. Further details and demo can be found at https://github.com/FudanDISC/PIORS

PaperID: 1641, https://arxiv.org/pdf/2411.13340.pdf GitHub

Authors: Yinsong Wang, Siwei Chen, Ziyi Song, Sheng Zhou

Title: WHALES: A Multi-Agent Scheduling Dataset for Enhanced Cooperation in Autonomous Driving

Abstract:
Cooperative perception research is hindered by the limited availability of datasets that capture the complexity of real‑world Vehicle‑to‑Everything (V2X) interactions, particularly under dynamic communication constraints. To address this gap, we introduce WHALES (Wireless enhanced Autonomous vehicles with Large number of Engaged agents), the first large‑scale V2X dataset explicitly designed to benchmark communication‑aware agent scheduling and scalable cooperative perception. WHALES introduces a new benchmark that enables state‑of‑the‑art (SOTA) research in communication‑aware cooperative perception, featuring an average of 8.4 cooperative agents per scene and 2.01 million annotated 3D objects across diverse traffic scenarios. It incorporates detailed communication metadata to emulate real‑world communication bottlenecks, enabling rigorous evaluation of scheduling strategies. To further advance the field, we propose the Coverage‑Aware Historical Scheduler (CAHS), a novel scheduling baseline that selects agents based on historical viewpoint coverage, improving perception performance over existing SOTA methods. WHALES bridges the gap between simulated and real‑world V2X challenges, providing a robust framework for exploring perception‑scheduling co‑design, cross‑data generalization, and scalability limits. The WHALES dataset and code are available at https://github.com/chensiweiTHU/WHALES.

PaperID: 1642, https://arxiv.org/pdf/2411.12828.pdf GitHub

Authors: Sonny George, Chris Sypherd, Dylan Cashman

Title: Probing the Capacity of Language Model Agents to Operationalize Disparate Experiential Context Despite Distraction

Abstract:
Large language model (LLM) agents show promise in an increasing number of domains. In many proposed applications, it is expected that the agent reasons over accumulated experience presented in an input prompt. We propose the OEDD (Operationalize Experience Despite Distraction) corpus, a human‑annotator‑validated body of scenarios with pre‑scripted agent histories where the agent must make a decision based on disparate experiential information in the presence of a distractor. We evaluate three state‑of‑the‑art LLMs (GPT‑3.5 Turbo, GPT‑4o, and Gemini 1.5 Pro) using a minimal chain‑of‑thought prompting strategy and observe that when (1) the input context contains over 1,615 tokens of historical interactions, (2) a crucially decision‑informing premise is the rightful conclusion over two disparate environment premises, and (3) a trivial, but distracting red herring fact follows, all LLMs perform worse than random choice at selecting the better of two actions. Our code and test corpus are publicly available at: https://github.com/sonnygeorge/OEDD .

PaperID: 1643, https://arxiv.org/pdf/2411.11910.pdf GitHub

Authors: Zijun Liu, Kaiming Liu, Yiqi Zhu, Xuanyu Lei, Zonghan Yang, Zhenhe Zhang, Peng Li, Yang Liu

Title: AIGS: Generating Science from AI-Powered Automated Falsification

Abstract:
Rapid development of artificial intelligence has drastically accelerated the development of scientific discovery. Trained with large‑scale observation data, deep neural networks extract the underlying patterns in an end‑to‑end manner and assist human researchers with highly‑precised predictions in unseen scenarios. The recent rise of Large Language Models (LLMs) and the empowered autonomous agents enable scientists to gain help through interaction in different stages of their research, including but not limited to literature review, research ideation, idea implementation, and academic writing. However, AI researchers instantiated by foundation model empowered agents with full‑process autonomy are still in their infancy. In this paper, we study AI‑Generated Science (AIGS), where agents independently and autonomously complete the entire research process and discover scientific laws. By revisiting the definition of scientific research, we argue that falsification is the essence of both human research process and the design of an AIGS system. Through the lens of falsification, prior systems attempting towards AI‑Generated Science either lack the part in their design, or rely heavily on existing verification engines that narrow the use in specialized domains. In this work, we propose Baby‑AIGS as a baby‑step demonstration of a full‑process AIGS system, which is a multi‑agent system with agents in roles representing key research process. By introducing FalsificationAgent, which identify and then verify possible scientific discoveries, we empower the system with explicit falsification. Experiments on three tasks preliminarily show that Baby‑AIGS could produce meaningful scientific discoveries, though not on par with experienced human researchers. Finally, we discuss on the limitations of current Baby‑AIGS, actionable insights, and related ethical issues in detail.

PaperID: 1644, https://arxiv.org/pdf/2411.11758.pdf GitHub

Authors: Longju Bai, Angana Borah, Oana Ignat, Rada Mihalcea

Title: The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning

Abstract:
Large Multimodal Models (LMMs) exhibit impressive performance across various multimodal tasks. However, their effectiveness in cross‑cultural contexts remains limited due to the predominantly Western‑centric nature of most data and models. Conversely, multi‑agent models have shown significant capability in solving complex tasks. Our study evaluates the collective performance of LMMs in a multi‑agent interaction setting for the novel task of cultural image captioning. Our contributions are as follows: (1) We introduce MosAIC, a Multi‑Agent framework to enhance cross‑cultural Image Captioning using LMMs with distinct cultural personas; (2) We provide a dataset of culturally enriched image captions in English for images from China, India, and Romania across three datasets: GeoDE, GD‑VCR, CVQA; (3) We propose a culture‑adaptable metric for evaluating cultural information within image captions; and (4) We show that the multi‑agent interaction outperforms single‑agent models across different metrics, and offer valuable insights for future research. Our dataset and models can be accessed at https://github.com/MichiganNLP/MosAIC.

PaperID: 1645, https://arxiv.org/pdf/2411.10422.pdf GitHub

Authors: Parsa Hejabi, Elnaz Rahmati, Alireza S. Ziabari, Preni Golazizian, Jesse Thomason, Morteza Dehghani

Title: Evaluating Creativity and Deception in Large Language Models: A Simulation Framework for Multi-Agent Balderdash

Abstract:
Large Language Models (LLMs) have shown impressive capabilities in complex tasks and interactive environments, yet their creativity remains underexplored. This paper introduces a simulation framework utilizing the game Balderdash to evaluate both the creativity and logical reasoning of LLMs. In Balderdash, players generate fictitious definitions for obscure terms to deceive others while identifying correct definitions. Our framework enables multiple LLM agents to participate in this game, assessing their ability to produce plausible definitions and strategize based on game rules and history. We implemented a centralized game engine featuring various LLMs as participants and a judge LLM to evaluate semantic equivalence. Through a series of experiments, we analyzed the performance of different LLMs, examining metrics such as True Definition Ratio, Deception Ratio, and Correct Guess Ratio. The results provide insights into the creative and deceptive capabilities of LLMs, highlighting their strengths and areas for improvement. Specifically, the study reveals that infrequent vocabulary in LLMs' input leads to poor reasoning on game rules and historical context (https://github.com/ParsaHejabi/Simulation‑Framework‑for‑Multi‑Agent‑Balderdash).

PaperID: 1646, https://arxiv.org/pdf/2411.09022.pdf GitHub

Authors: Yongdong Wang, Runze Xiao, Jun Younes Louhi Kasahara, Ryosuke Yajima, Keiji Nagatani, Atsushi Yamashita, Hajime Asama

Title: DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models

Abstract:
Large Language Models (LLMs) have demonstrated promising reasoning capabilities in robotics; however, their application in multi‑robot systems remains limited, particularly in handling task dependencies. This paper introduces DART‑LLM, a novel framework that employs Directed Acyclic Graphs (DAGs) to model task dependencies, enabling the decomposition of natural language instructions into well‑coordinated subtasks for multi‑robot execution. DART‑LLM comprises four key components: a Question‑Answering (QA) LLM module for dependency‑aware task decomposition, a Breakdown Function module for robot assignment, an Actuation module for execution, and a Vision‑Language Model (VLM)‑based object detector for environmental perception, achieving end‑to‑end task execution. Experimental results across three task complexity levels demonstrate that DART‑LLM achieves state‑of‑the‑art performance, significantly outperforming the baseline across all evaluation metrics. Among the tested models, DeepSeek‑r1‑671B achieves the highest success rate, whereas Llama‑3.1‑8B exhibits superior response time reliability. Ablation studies further confirm that explicit dependency modeling notably enhances the performance of smaller models, facilitating efficient deployment on resource‑constrained platforms. Please refer to the project website https://wyd0817.github.io/project‑dart‑llm/ for videos and code.

PaperID: 1647, https://arxiv.org/pdf/2411.06404.pdf GitHub

Authors: Yining Ma, Qadeer Khan, Daniel Cremers

Title: MA-DV2F: A Multi-Agent Navigation Framework using Dynamic Velocity Vector Field

Abstract:
In this paper we propose MA‑DV2F: Multi‑Agent Dynamic Velocity Vector Field. It is a framework for simultaneously controlling a group of vehicles in challenging environments. DV2F is generated for each vehicle independently and provides a map of reference orientation and speed that a vehicle must attain at any point on the navigation grid such that it safely reaches its target. The field is dynamically updated depending on the speed and proximity of the ego‑vehicle to other agents. This dynamic adaptation of the velocity vector field allows prevention of imminent collisions. Experimental results show that MA‑DV2F outperforms concurrent methods in terms of safety, computational efficiency and accuracy in reaching the target when scaling to a large number of vehicles. Project page for this work can be found here: https://yininghase.github.io/MA‑DV2F/

PaperID: 1648, https://arxiv.org/pdf/2411.06223.pdf GitHub

Authors: Roman Chiva Gil, Daniel Jarne Ornia, Khaled A. Mustafa, Javier Alonso Mora

Title: Predictability Awareness for Efficient and Robust Multi-Agent Coordination

Abstract:
To safely and efficiently solve motion planning problems in multi‑agent settings, most approaches attempt to solve a joint optimization that explicitly accounts for the responses triggered in other agents. This often results in solutions with an exponential computational complexity, making these methods intractable for complex scenarios with many agents. While sequential predict‑and‑plan approaches are more scalable, they tend to perform poorly in highly interactive environments. This paper proposes a method to improve the interactive capabilities of sequential predict‑and‑plan methods in multi‑agent navigation problems by introducing predictability as an optimization objective. We interpret predictability through the use of general prediction models, by allowing agents to predict themselves and estimate how they align with these external predictions. We formally introduce this behavior through the free‑energy of the system, which reduces under appropriate bounds to the Kullback‑Leibler divergence between plan and prediction, and use this as a penalty for unpredictable trajectories.The proposed interpretation of predictability allows agents to more robustly leverage prediction models, and fosters a soft social convention that accelerates agreement on coordination strategies without the need of explicit high level control or communication. We show how this predictability‑aware planning leads to lower‑cost trajectories and reduces planning effort in a set of multi‑robot problems, including autonomous driving experiments with human driver data, where we show that the benefits of considering predictability apply even when only the ego‑agent uses this strategy.

PaperID: 1649, https://arxiv.org/pdf/2411.05990.pdf GitHub

Authors: Wenyue Hua, Ollie Liu, Lingyao Li, Alfonso Amayuelas, Julie Chen, Lucas Jiang, Mingyu Jin, Lizhou Fan, Fei Sun, William Wang, Xintong Wang, Yongfeng Zhang

Title: Game-theoretic LLM: Agent Workflow for Negotiation Games

Abstract:
This paper investigates the rationality of large language models (LLMs) in strategic decision‑making contexts, specifically within the framework of game theory. We evaluate several state‑of‑the‑art LLMs across a spectrum of complete‑information and incomplete‑information games. Our findings reveal that LLMs frequently deviate from rational strategies, particularly as the complexity of the game increases with larger payoff matrices or deeper sequential trees. To address these limitations, we design multiple game‑theoretic workflows that guide the reasoning and decision‑making processes of LLMs. These workflows aim to enhance the models' ability to compute Nash Equilibria and make rational choices, even under conditions of uncertainty and incomplete information. Experimental results demonstrate that the adoption of these workflows significantly improves the rationality and robustness of LLMs in game‑theoretic tasks. Specifically, with the workflow, LLMs exhibit marked improvements in identifying optimal strategies, achieving near‑optimal allocations in negotiation scenarios, and reducing susceptibility to exploitation during negotiations. Furthermore, we explore the meta‑strategic considerations of whether it is rational for agents to adopt such workflows, recognizing that the decision to use or forgo the workflow constitutes a game‑theoretic issue in itself. Our research contributes to a deeper understanding of LLMs' decision‑making capabilities in strategic contexts and provides insights into enhancing their rationality through structured workflows. The findings have implications for the development of more robust and strategically sound AI agents capable of navigating complex interactive environments. Code and data supporting this study are available at \urlhttps://github.com/Wenyueh/game_theory.

PaperID: 1650, https://arxiv.org/pdf/2411.04975.pdf GitHub

Authors: Gabriele Oliaro, Zhihao Jia, Daniel Campos, Aurick Qiao

Title: SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications

Abstract:
Speculative decoding is widely adopted to reduce latency in large language model (LLM) inference by leveraging smaller draft models capable of handling diverse user tasks. However, emerging AI applications, such as LLM‑based agents, present unique workload characteristics: instead of diverse independent requests, agentic frameworks typically submit repetitive inference requests, such as multi‑agent pipelines performing similar subtasks or self‑refinement loops iteratively enhancing outputs. These workloads result in long and highly predictable sequences, which current speculative decoding methods do not effectively exploit. To address this gap, we introduce \emphSuffixDecoding, a novel method that utilizes efficient suffix trees to cache long token sequences from prompts and previous outputs. By adaptively speculating more tokens when acceptance likelihood is high and fewer when it is low, SuffixDecoding effectively exploits opportunities for longer speculations while conserving computation when those opportunities are limited. Evaluations on agentic benchmarks, including SWE‑Bench and Text‑to‑SQL, demonstrate that SuffixDecoding achieves speedups of up to 5.3×, outperforming state‑of‑the‑art methods ‑‑ 2.8× faster than model‑based approaches like EAGLE‑2/3 and 1.9× faster than model‑free approaches such as Token Recycling. SuffixDecoding is open‑sourced at https://github.com/snowflakedb/ArcticInference

PaperID: 1651, https://arxiv.org/pdf/2411.04679.pdf GitHub

Authors: Jie Liu, Pan Zhou, Yingjun Du, Ah-Hwee Tan, Cees G. M. Snoek, Jan-Jakob Sonke, Efstratios Gavves

Title: CaPo: Cooperative Plan Optimization for Efficient Embodied Multi-Agent Cooperation

Abstract:
In this work, we address the cooperation problem among large language model (LLM) based embodied agents, where agents must cooperate to achieve a common goal. Previous methods often execute actions extemporaneously and incoherently, without long‑term strategic and cooperative planning, leading to redundant steps, failures, and even serious repercussions in complex tasks like search‑and‑rescue missions where discussion and cooperative plan are crucial. To solve this issue, we propose Cooperative Plan Optimization (CaPo) to enhance the cooperation efficiency of LLM‑based embodied agents. Inspired by human cooperation schemes, CaPo improves cooperation efficiency with two phases: 1) meta‑plan generation, and 2) progress‑adaptive meta‑plan and execution. In the first phase, all agents analyze the task, discuss, and cooperatively create a meta‑plan that decomposes the task into subtasks with detailed steps, ensuring a long‑term strategic and coherent plan for efficient coordination. In the second phase, agents execute tasks according to the meta‑plan and dynamically adjust it based on their latest progress (e.g., discovering a target object) through multi‑turn discussions. This progress‑based adaptation eliminates redundant actions, improving the overall cooperation efficiency of agents. Experimental results on the ThreeDworld Multi‑Agent Transport and Communicative Watch‑And‑Help tasks demonstrate that CaPo achieves much higher task completion rate and efficiency compared with state‑of‑the‑arts.The code is released at https://github.com/jliu4ai/CaPo.

PaperID: 1652, https://arxiv.org/pdf/2411.04672.pdf GitHub

Authors: Wenjun Zhang, Qiong Wu, Pingyi Fan, Kezhi Wang, Nan Cheng, Wen Chen, Khaled B. Letaief

Title: Semantic-Aware Resource Management for C-V2X Platooning via Multi-Agent Reinforcement Learning

Abstract:
Semantic communication transmits the extracted features of information rather than raw data, significantly reducing redundancy, which is crucial for addressing spectrum and energy challenges in 6G networks. In this paper, we introduce semantic communication into a cellular vehicle‑to‑everything (C‑V2X)‑ based autonomous vehicle platoon system for the first time, aiming to achieve efficient management of communication resources in a dynamic environment. Firstly, we construct a mathematical model for semantic communication in platoon systems, in which the DeepSC model and MU‑DeepSC model are used to semantically encode and decode unimodal and multi‑modal data, respectively. Then, we propose the quality of experience (QoE) metric based on semantic similarity and semantic rate. Meanwhile, we consider the success rate of semantic information transmission (SRS) metric to ensure the fairness of channel resource allocation. Next, the optimization problem is posed with the aim of maximizing the QoE in vehicle‑to‑vehicle (V2V) links while improving SRS. To solve this mixed integer nonlinear programming problem (MINLP) and adapt to time‑varying channel conditions, the paper proposes a distributed semantic‑aware multi‑modal resource allocation (SAMRA) algorithm based on multi‑agent reinforcement learning (MARL), referred to as SAMRAMARL. The algorithm can dynamically allocate channels and power and determine semantic symbol length based on the contextual importance of the transmitted information, ensuring efficient resource utilization. Finally, extensive simulations have demonstrated that SAMRAMARL outperforms existing methods, achieving significant gains in QoE, SRS, and communication delay in C‑V2X platooning scenarios.

PaperID: 1653, https://arxiv.org/pdf/2411.03865.pdf GitHub

Authors: Yizhe Huang, Xingbo Wang, Hao Liu, Fanqi Kong, Aoyang Qin, Min Tang, Song-Chun Zhu, Mingjie Bi, Siyuan Qi, Xue Feng

Title: AdaSociety: An Adaptive Environment with Social Structures for Multi-Agent Decision-Making

Abstract:
Traditional interactive environments limit agents' intelligence growth with fixed tasks. Recently, single‑agent environments address this by generating new tasks based on agent actions, enhancing task diversity. We consider the decision‑making problem in multi‑agent settings, where tasks are further influenced by social connections, affecting rewards and information access. However, existing multi‑agent environments lack a combination of adaptive physical surroundings and social connections, hindering the learning of intelligent behaviors. To address this, we introduce AdaSociety, a customizable multi‑agent environment featuring expanding state and action spaces, alongside explicit and alterable social structures. As agents progress, the environment adaptively generates new tasks with social structures for agents to undertake. In AdaSociety, we develop three mini‑games showcasing distinct social structures and tasks. Initial results demonstrate that specific social structures can promote both individual and collective benefits, though current reinforcement learning and LLM‑based algorithms show limited effectiveness in leveraging social structures to enhance performance. Overall, AdaSociety serves as a valuable research platform for exploring intelligence in diverse physical and social settings. The code is available at https://github.com/bigai‑ai/AdaSociety.

PaperID: 1654, https://arxiv.org/pdf/2411.03397.pdf GitHub

Authors: Shlomo Neuberger, Niv Eckhaus, Uri Berger, Amir Taubenfeld, Gabriel Stanovsky, Ariel Goldstein

Title: SAUCE: Synchronous and Asynchronous User-Customizable Environment for Multi-Agent LLM Interaction

Abstract:
Many human interactions, such as political debates, are carried out in group settings, where there are arbitrarily many participants, each with different views and agendas. To explore such complex social settings, we present SAUCE: a customizable Python platform, allowing researchers to plug‑and‑play various LLMs participating in discussions on any topic chosen by the user. Our platform takes care of instantiating the models, scheduling their responses, managing the discussion history, and producing a comprehensive output log, all customizable through configuration files, requiring little to no coding skills. A novel feature of SAUCE is our asynchronous communication feature, where models decide when to speak in addition to what to say, thus modeling an important facet of human communication. We show SAUCE's attractiveness in two initial experiments, and invite the community to use it in simulating various group simulations.

PaperID: 1655, https://arxiv.org/pdf/2411.03284.pdf GitHub

Authors: Dawei Li, Zhen Tan, Peijia Qian, Yifan Li, Kumar Satvik Chaudhary, Lijie Hu, Jiayi Shen

Title: SMoA: Improving Multi-agent Large Language Models with Sparse Mixture-of-Agents

Abstract:
While multi‑agent systems have been shown to significantly enhance the performance of Large Language Models (LLMs) across various tasks and applications, the dense interaction between scaling agents potentially hampers their efficiency and diversity. To address these challenges, we draw inspiration from the sparse mixture‑of‑agents (SMoE) and propose a sparse mixture‑of‑agents (SMoA) framework to improve the efficiency and diversity of multi‑agent LLMs. Unlike completely connected structures, SMoA introduces novel Response Selection and Early Stopping mechanisms to sparsify information flows among individual LLM agents, striking a balance between performance and efficiency. Additionally, inspired by the expert diversity principle in SMoE frameworks for workload balance between experts, we assign distinct role descriptions to each LLM agent, fostering diverse and divergent thinking. Extensive experiments on reasoning, alignment, and fairness benchmarks demonstrate that SMoA achieves performance comparable to traditional mixture‑of‑agents approaches but with significantly lower computational costs. Further analysis reveals that SMoA is more stable, has a greater capacity to scale, and offers considerable potential through hyper‑parameter optimization. Code and data will be available at: https://github.com/David‑Li0406/SMoA.

PaperID: 1656, https://arxiv.org/pdf/2411.02006.pdf GitHub

Authors: Biao Wu, Yanda Li, Zhiwei Zhang, Yunchao Wei, Meng Fang, Ling Chen

Title: Foundations and Recent Trends in Multimodal Mobile Agents: A Survey

Abstract:
Mobile agents are essential for automating tasks in complex and dynamic mobile environments. As foundation models evolve, the demands for agents that can adapt in real‑time and process multimodal data have grown. This survey provides a comprehensive review of mobile agent technologies, focusing on recent advancements that enhance real‑time adaptability and multimodal interaction. Recent evaluation benchmarks have been developed better to capture the static and interactive environments of mobile tasks, offering more accurate assessments of agents' performance. We then categorize these advancements into two main approaches: prompt‑based methods, which utilize large language models (LLMs) for instruction‑based task execution, and training‑based methods, which fine‑tune multimodal models for mobile‑specific applications. Additionally, we explore complementary technologies that augment agent performance. By discussing key challenges and outlining future research directions, this survey offers valuable insights for advancing mobile agent technologies. A comprehensive resource list is available at https://github.com/aialt/awesome‑mobile‑agents

PaperID: 1657, https://arxiv.org/pdf/2411.01796.pdf GitHub GitHub

Authors: Weihua Du, Qiushi Lyu, Jiaming Shan, Zhenting Qi, Hongxin Zhang, Sunli Chen, Andi Peng, Tianmin Shu, Kwonjoon Lee, Behzad Dariush, Chuang Gan

Title: Constrained Human-AI Cooperation: An Inclusive Embodied Social Intelligence Challenge

Abstract:
We introduce Constrained Human‑AI Cooperation (CHAIC), an inclusive embodied social intelligence challenge designed to test social perception and cooperation in embodied agents. In CHAIC, the goal is for an embodied agent equipped with egocentric observations to assist a human who may be operating under physical constraints ‑‑ e.g., unable to reach high places or confined to a wheelchair ‑‑ in performing common household or outdoor tasks as efficiently as possible. To achieve this, a successful helper must: (1) infer the human's intents and constraints by following the human and observing their behaviors (social perception), and (2) make a cooperative plan tailored to the human partner to solve the task as quickly as possible, working together as a team (cooperative planning). To benchmark this challenge, we create four new agents with real physical constraints and eight long‑horizon tasks featuring both indoor and outdoor scenes with various constraints, emergency events, and potential risks. We benchmark planning‑ and learning‑based baselines on the challenge and introduce a new method that leverages large language models and behavior modeling. Empirical evaluations demonstrate the effectiveness of our benchmark in enabling systematic assessment of key aspects of machine social intelligence. Our benchmark and code are publicly available at https://github.com/UMass‑Embodied‑AGI/CHAIC.

PaperID: 1658, https://arxiv.org/pdf/2411.01747.pdf GitHub

Authors: Dang Nguyen, Viet Dac Lai, Seunghyun Yoon, Ryan A. Rossi, Handong Zhao, Ruiyi Zhang, Puneet Mathur, Nedim Lipka, Yu Wang, Trung Bui, Franck Dernoncourt, Tianyi Zhou

Title: DynaSaur: Large Language Agents Beyond Predefined Actions

Abstract:
Existing LLM agent systems typically select actions from a fixed and predefined set at every step. While this approach is effective in closed, narrowly scoped environments, it presents two major challenges for real‑world, open‑ended scenarios: (1) it significantly restricts the planning and acting capabilities of LLM agents, and (2) it requires substantial human effort to enumerate and implement all possible actions, which is impractical in complex environments with a vast number of potential actions. To address these limitations, we propose an LLM agent framework that can dynamically create and compose actions as needed. In this framework, the agent interacts with its environment by generating and executing programs written in a general‑purpose programming language. Moreover, generated actions are accumulated over time for future reuse. Our extensive experiments across multiple benchmarks show that this framework significantly improves flexibility and outperforms prior methods that rely on a fixed action set. Notably, it enables LLM agents to adapt and recover in scenarios where predefined actions are insufficient or fail due to unforeseen edge cases. Our code can be found in https://github.com/adobe‑research/dynasaur.

PaperID: 1659, https://arxiv.org/pdf/2411.00773.pdf GitHub

Authors: Bowen Li, Zhaoyu Li, Qiwei Du, Jinqi Luo, Wenshan Wang, Yaqi Xie, Simon Stepputtis, Chen Wang, Katia P. Sycara, Pradeep Kumar Ravikumar, Alexander G. Gray, Xujie Si, Sebastian Scherer

Title: LogiCity: Advancing Neuro-Symbolic AI with Abstract Urban Simulation

Abstract:
Recent years have witnessed the rapid development of Neuro‑Symbolic (NeSy) AI systems, which integrate symbolic reasoning into deep neural networks. However, most of the existing benchmarks for NeSy AI fail to provide long‑horizon reasoning tasks with complex multi‑agent interactions. Furthermore, they are usually constrained by fixed and simplistic logical rules over limited entities, making them far from real‑world complexities. To address these crucial gaps, we introduce LogiCity, the first simulator based on customizable first‑order logic (FOL) for an urban‑like environment with multiple dynamic agents. LogiCity models diverse urban elements using semantic and spatial concepts, such as IsAmbulance(X) and IsClose(X, Y). These concepts are used to define FOL rules that govern the behavior of various agents. Since the concepts and rules are abstractions, they can be universally applied to cities with any agent compositions, facilitating the instantiation of diverse scenarios. Besides, a key feature of LogiCity is its support for user‑configurable abstractions, enabling customizable simulation complexities for logical reasoning. To explore various aspects of NeSy AI, LogiCity introduces two tasks, one features long‑horizon sequential decision‑making, and the other focuses on one‑step visual reasoning, varying in difficulty and agent behaviors. Our extensive evaluation reveals the advantage of NeSy frameworks in abstract reasoning. Moreover, we highlight the significant challenges of handling more complex abstractions in long‑horizon multi‑agent scenarios or under high‑dimensional, imbalanced data. With its flexible design, various features, and newly raised challenges, we believe LogiCity represents a pivotal step forward in advancing the next generation of NeSy AI. All the code and data are open‑sourced at our website: https://jaraxxus‑me.github.io/LogiCity/

PaperID: 1660, https://arxiv.org/pdf/2410.22041.pdf GitHub

Authors: Lixiu Wu, Yuanrong Tang, Qisen Pan, Xianyang Zhan, Yucheng Han, Lanxi Xiao, Tianhong Wang, Chen Zhong, Jiangtao Gong

Title: An LLM-based Simulation Framework for Embodied Conversational Agents in Psychological Counseling

Abstract:
Due to privacy concerns, open dialogue datasets for mental health are primarily generated through human or AI synthesis methods. However, the inherent implicit nature of psychological processes, particularly those of clients, poses challenges to the authenticity and diversity of synthetic data. In this paper, we propose ECAs (short for Embodied Conversational Agents), a framework for embodied agent simulation based on Large Language Models (LLMs) that incorporates multiple psychological theoretical principles.Using simulation, we expand real counseling case data into a nuanced embodied cognitive memory space and generate dialogue data based on high‑frequency counseling questions.We validated our framework using the D4 dataset. First, we created a public ECAs dataset through batch simulations based on D4. Licensed counselors evaluated our method, demonstrating that it significantly outperforms baselines in simulation authenticity and necessity. Additionally, two LLM‑based automated evaluation methods were employed to confirm the higher quality of the generated dialogues compared to the baselines. The source code and dataset are available at https://github.com/AIR‑DISCOVER/ECAs‑Dataset.

PaperID: 1661, https://arxiv.org/pdf/2410.20445.pdf GitHub GitHub

Authors: Yuwei Du, Jie Feng, Jie Zhao, Yong Li

Title: TrajAgent: An LLM-Agent Framework for Trajectory Modeling via Large-and-Small Model Collaboration

Abstract:
Trajectory modeling, which includes research on trajectory data pattern mining and future prediction, has widespread applications in areas such as life services, urban transportation, and public administration. Numerous methods have been proposed to address specific problems within trajectory modeling. However, the heterogeneity of data and the diversity of trajectory tasks make effective and reliable trajectory modeling an important yet highly challenging endeavor, even for domain experts. In this paper, we propose TrajAgent, an agent framework powered by large language models, designed to facilitate robust and efficient trajectory modeling through automation modeling. This framework leverages and optimizes diverse specialized models to address various trajectory modeling tasks across different datasets effectively. In TrajAgent, we first develop UniEnv, an execution environment with a unified data and model interface, to support the execution and training of various models. Building on UniEnv, we introduce an agentic workflow designed for automatic trajectory modeling across various trajectory tasks and data. Furthermore, we introduce collaborative learning schema between LLM‑based agents and small speciallized models, to enhance the performance of the whole framework effectively. Extensive experiments on five tasks using four real‑world datasets demonstrate the effectiveness of TrajAgent in automated trajectory modeling, achieving a performance improvement of 2.38%‑69.91% over baseline methods. The codes and data can be accessed via https://github.com/tsinghua‑fib‑lab/TrajAgent.

PaperID: 1662, https://arxiv.org/pdf/2410.19848.pdf GitHub

Authors: Yijiashun Qi, Shuzhang Cai, Zunduo Zhao, Jiaming Li, Yanbin Lin, Zhiqiang Wang

Title: Benchmarking Large Language Models for Image Classification of Marine Mammals

Abstract:
As Artificial Intelligence (AI) has developed rapidly over the past few decades, the new generation of AI, Large Language Models (LLMs) trained on massive datasets, has achieved ground‑breaking performance in many applications. Further progress has been made in multimodal LLMs, with many datasets created to evaluate LLMs with vision abilities. However, none of those datasets focuses solely on marine mammals, which are indispensable for ecological equilibrium. In this work, we build a benchmark dataset with 1,423 images of 65 kinds of marine mammals, where each animal is uniquely classified into different levels of class, ranging from species‑level to medium‑level to group‑level. Moreover, we evaluate several approaches for classifying these marine mammals: (1) machine learning (ML) algorithms using embeddings provided by neural networks, (2) influential pre‑trained neural networks, (3) zero‑shot models: CLIP and LLMs, and (4) a novel LLM‑based multi‑agent system (MAS). The results demonstrate the strengths of traditional models and LLMs in different aspects, and the MAS can further improve the classification performance. The dataset is available on GitHub: https://github.com/yeyimilk/LLM‑Vision‑Marine‑Animals.git.

PaperID: 1663, https://arxiv.org/pdf/2410.19372.pdf GitHub

Authors: Bang Giang Le, Viet Cuong Ta

Title: Toward Finding Strong Pareto Optimal Policies in Multi-Agent Reinforcement Learning

Abstract:
In this work, we study the problem of finding Pareto optimal policies in multi‑agent reinforcement learning problems with cooperative reward structures. We show that any algorithm where each agent only optimizes their reward is subject to suboptimal convergence. Therefore, to achieve Pareto optimality, agents have to act altruistically by considering the rewards of others. This observation bridges the multi‑objective optimization framework and multi‑agent reinforcement learning together. We first propose a framework for applying the Multiple Gradient Descent algorithm (MGDA) for learning in multi‑agent settings. We further show that standard MGDA is subjected to weak Pareto convergence, a problem that is often overlooked in other learning settings but is prevalent in multi‑agent reinforcement learning. To mitigate this issue, we propose MGDA++, an improvement of the existing algorithm to handle the weakly optimal convergence of MGDA properly. Theoretically, we prove that MGDA++ converges to strong Pareto optimal solutions in convex, smooth bi‑objective problems. We further demonstrate the superiority of our MGDA++ in cooperative settings in the Gridworld benchmark. The results highlight that our proposed method can converge efficiently and outperform the other methods in terms of the optimality of the convergent policies. The source code is available at \urlhttps://github.com/giangbang/Strong‑Pareto‑MARL.

PaperID: 1664, https://arxiv.org/pdf/2410.18195.pdf GitHub

Authors: Luca Barsellotti, Roberto Bigazzi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Title: Personalized Instance-based Navigation Toward User-Specific Objects in Realistic Environments

Abstract:
In the last years, the research interest in visual navigation towards objects in indoor environments has grown significantly. This growth can be attributed to the recent availability of large navigation datasets in photo‑realistic simulated environments, like Gibson and Matterport3D. However, the navigation tasks supported by these datasets are often restricted to the objects present in the environment at acquisition time. Also, they fail to account for the realistic scenario in which the target object is a user‑specific instance that can be easily confused with similar objects and may be found in multiple locations within the environment. To address these limitations, we propose a new task denominated Personalized Instance‑based Navigation (PIN), in which an embodied agent is tasked with locating and reaching a specific personal object by distinguishing it among multiple instances of the same category. The task is accompanied by PInNED, a dedicated new dataset composed of photo‑realistic scenes augmented with additional 3D objects. In each episode, the target object is presented to the agent using two modalities: a set of visual reference images on a neutral background and manually annotated textual descriptions. Through comprehensive evaluations and analyses, we showcase the challenges of the PIN task as well as the performance and shortcomings of currently available methods designed for object‑driven navigation, considering modular and end‑to‑end agents.

PaperID: 1665, https://arxiv.org/pdf/2410.18032.pdf GitHub

Authors: Xin Li, Qizhi Chu, Yubin Chen, Yang Liu, Yaoqi Liu, Zekai Yu, Weize Chen, Chen Qian, Chuan Shi, Cheng Yang

Title: GraphTeam: Facilitating Large Language Model-based Graph Analysis via Multi-Agent Collaboration

Abstract:
Graphs are widely used for modeling relational data in real‑world scenarios, such as social networks and urban computing. Existing LLM‑based graph analysis approaches either integrate graph neural networks (GNNs) for specific machine learning tasks, limiting their transferability, or rely solely on LLMs' internal reasoning ability, resulting in suboptimal performance. To address these limitations, we take advantage of recent advances in LLM‑based agents, which have shown capabilities of utilizing external knowledge or tools for problem solving. By simulating human problem‑solving strategies such as analogy and collaboration, we propose a multi‑agent system based on LLMs named GraphTeam, for graph analysis. GraphTeam consists of five LLM‑based agents from three modules, and the agents with different specialities can collaborate with each other to address complex problems. Specifically, (1) input‑output normalization module: the question agent extracts and refines four key arguments from the original question, facilitating the problem understanding, and the answer agent organizes the results to meet the output requirement; (2) external knowledge retrieval module: we first build a knowledge base consisting of relevant documentation and experience information, and then the search agent retrieves the most relevant entries for each question. (3) problem‑solving module: given the retrieved information from search agent, the coding agent uses established algorithms via programming to generate solutions, and in case the coding agent does not work, the reasoning agent will directly compute the results without programming. Extensive experiments on six graph analysis benchmarks demonstrate that GraphTeam achieves state‑of‑the‑art performance with an average 25.85% improvement over the best baseline in terms of accuracy. The code and data are available at https://github.com/BUPT‑GAMMA/GraphTeam.

PaperID: 1666, https://arxiv.org/pdf/2410.17238.pdf GitHub

Authors: Yizhou Chi, Yizhang Lin, Sirui Hong, Duyi Pan, Yaying Fei, Guanghao Mei, Bangbang Liu, Tianqi Pang, Jacky Kwok, Ceyao Zhang, Bang Liu, Chenglin Wu

Title: SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning

Abstract:
Automated Machine Learning (AutoML) approaches encompass traditional methods that optimize fixed pipelines for model selection and ensembling, as well as newer LLM‑based frameworks that autonomously build pipelines. While LLM‑based agents have shown promise in automating machine learning tasks, they often generate low‑diversity and suboptimal code, even after multiple iterations. To overcome these limitations, we introduce Tree‑Search Enhanced LLM Agents (SELA), an innovative agent‑based system that leverages Monte Carlo Tree Search (MCTS) to optimize the AutoML process. By representing pipeline configurations as trees, our framework enables agents to conduct experiments intelligently and iteratively refine their strategies, facilitating a more effective exploration of the machine learning solution space. This novel approach allows SELA to discover optimal pathways based on experimental feedback, improving the overall quality of the solutions. In an extensive evaluation across 20 machine learning datasets, we compare the performance of traditional and agent‑based AutoML methods, demonstrating that SELA achieves a win rate of 65% to 80% against each baseline across all datasets. These results underscore the significant potential of agent‑based strategies in AutoML, offering a fresh perspective on tackling complex machine learning challenges.

PaperID: 1667, https://arxiv.org/pdf/2410.17160.pdf GitHub

Authors: Zhuo Yao

Title: Layered LA-MAPF: a decomposition of large agent MAPF instance to accelerate solving without compromising solvability

Abstract:
Multi‑Agent Path Finding (MAPF) has been widely studied in recent years. However, most existing MAPF algorithms assume that an agent occupies only a single grid in a grid‑based map. This assumption limits their applicability in many real‑world domains where agents have geometric shapes, rather than being point‑like. Such agents, which can occupy multiple cells simultaneously, are referred to as ``large'' agents. When considering the shape and size of agents in MAPF, the computational complexity increases significantly as the number of agents grows, primarily due to the increased overhead in conflict detection between geometric agents. In this paper, we propose two types of subproblems for the LA‑MAPF (Large‑Agent MAPF) problem: cluster (which has no constraints on the order of solution) and level (which imposes constraints on the solution order). We introduce Layered LA‑MAPF, a method that decomposes a MAPF instance involving geometric agents into clusters, and then further decomposes each cluster into levels. This approach aims to reduce time complexity when solving LA‑MAPF problems. Our results demonstrate the performance of our method as the number of agents increases across various maps, and how it accelerates LA‑MAPF methods, such as LA‑CBS and LA‑LaCAM. Experiments show that our LA‑MAPF method with instance decomposition halves the time cost (reducing from an average of 40s to 20s) and triples the success rate (from an average of 0.27 to 0.80) in finding a solution within 60 seconds. To facilitate further research, we have made the source code for Layered LA‑MAPF publicly available at \urlhttps://github.com/JoeYao‑bit/LayeredMAPF/algorithm/LA‑MAPF.

PaperID: 1668, https://arxiv.org/pdf/2410.16946.pdf GitHub

Authors: Yue Hu, Yuzhu Cai, Yaxin Du, Xinyu Zhu, Xiangrui Liu, Zijie Yu, Yuchen Hou, Shuo Tang, Siheng Chen

Title: Self-Evolving Multi-Agent Collaboration Networks for Software Development

Abstract:
LLM‑driven multi‑agent collaboration (MAC) systems have demonstrated impressive capabilities in automatic software development at the function level. However, their heavy reliance on human design limits their adaptability to the diverse demands of real‑world software development. To address this limitation, we introduce EvoMAC, a novel self‑evolving paradigm for MAC networks. Inspired by traditional neural network training, EvoMAC obtains text‑based environmental feedback by verifying the MAC network's output against a target proxy and leverages a novel textual backpropagation to update the network. To extend coding capabilities beyond function‑level tasks to more challenging software‑level development, we further propose rSDE‑Bench, a requirement‑oriented software development benchmark, which features complex and diverse software requirements along with automatic evaluation of requirement correctness. Our experiments show that: i) The automatic requirement‑aware evaluation in rSDE‑Bench closely aligns with human evaluations, validating its reliability as a software‑level coding benchmark. ii) EvoMAC outperforms previous SOTA methods on both the software‑level rSDE‑Bench and the function‑level HumanEval benchmarks, reflecting its superior coding capabilities. The benchmark can be downloaded at https://yuzhu‑cai.github.io/rSDE‑Bench/.

PaperID: 1669, https://arxiv.org/pdf/2410.16670.pdf GitHub GitHub

Authors: Chen Yang, Chenyang Zhao, Quanquan Gu, Dongruo Zhou

Title: CoPS: Empowering LLM Agents with Provable Cross-Task Experience Sharing

Abstract:
Sequential reasoning in agent systems has been significantly advanced by large language models (LLMs), yet existing approaches face limitations. Reflection‑driven reasoning relies solely on knowledge in pretrained models, limiting performance in novel scenarios, while experience‑assisted reasoning often depends on external experiences and lacks clear principles for selecting representative experiences. We address these limitations by proposing CoPS (Cross‑Task Experience Sharing), a generalizable algorithm that enhances sequential reasoning by cross‑task experience sharing and selection. In detail, CoPS leverages agents' experiences on previous tasks, selecting distribution‑matched experiences via a provable pessimism‑based strategy to maximize utility while minimizing risks from distribution shifts. Extensive experimental results on benchmarks like Alfworld, Webshop, and HotPotQA demonstrate that CoPS consistently outperforms state‑of‑the‑art baselines, with superior sample efficiency suitable for resource‑constrained scenarios. Theoretically, we show that the performance of our algorithm depends on both the quality of the pretrained LLM and the matching between the agent's task‑dependent trial distribution and that generated by the LLM. Our work bridges the gap between existing sequential reasoning paradigms and validates the effectiveness of leveraging cross‑task experiences, shedding light on the potential to improve agents' generalization and adaptability across diverse tasks. Our codes are available at \hrefhttps://github.com/uclaml/COPS\texthttps://github.com/uclaml/COPS.

PaperID: 1670, https://arxiv.org/pdf/2410.15710.pdf GitHub

Authors: Yuchen Wu, Yifan Yang, Gang Xu, Junjie Cao, Yansong Chen, Licheng Wen, Yong Liu

Title: Hierarchical Search-Based Cooperative Motion Planning

Abstract:
Cooperative path planning, a crucial aspect of multi‑agent systems research, serves a variety of sectors, including military, agriculture, and industry. Many existing algorithms, however, come with certain limitations, such as simplified kinematic models and inadequate support for multiple group scenarios. Focusing on the planning problem associated with a nonholonomic Ackermann model for Unmanned Ground Vehicles (UGV), we propose a leaderless, hierarchical Search‑Based Cooperative Motion Planning (SCMP) method. The high‑level utilizes a binary conflict search tree to minimize runtime, while the low‑level fabricates kinematically feasible, collision‑free paths that are shape‑constrained. Our algorithm can adapt to scenarios featuring multiple groups with different shapes, outlier agents, and elaborate obstacles. We conduct algorithm comparisons, performance testing, simulation, and real‑world testing, verifying the effectiveness and applicability of our algorithm. The implementation of our method will be open‑sourced at https://github.com/WYCUniverStar/SCMP.

PaperID: 1671, https://arxiv.org/pdf/2410.15164.pdf GitHub

Authors: Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao Wu, Li Yixing, Xurui Zhou, Weiwen Liu, Shuai Wang, Kaiwen Zhou, Rui Shao, Liqiang Nie, Yasheng Wang, Jianye Hao, Jun Wang, Kun Shao

Title: SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation

Abstract:
Smartphone agents are increasingly important for helping users control devices efficiently, with (Multimodal) Large Language Model (MLLM)‑based approaches emerging as key contenders. Fairly comparing these agents is essential but challenging, requiring a varied task scope, the integration of agents with different implementations, and a generalisable evaluation pipeline to assess their strengths and weaknesses. In this paper, we present SPA‑Bench, a comprehensive SmartPhone Agent Benchmark designed to evaluate (M)LLM‑based agents in an interactive environment that simulates real‑world conditions. SPA‑Bench offers three key contributions: (1) A diverse set of tasks covering system and third‑party apps in both English and Chinese, focusing on features commonly used in daily routines; (2) A plug‑and‑play framework enabling real‑time agent interaction with Android devices, integrating over ten agents with the flexibility to add more; (3) A novel evaluation pipeline that automatically assesses agent performance across multiple dimensions, encompassing seven metrics related to task completion and resource consumption. Our extensive experiments across tasks and agents reveal challenges like interpreting mobile user interfaces, action grounding, memory retention, and execution costs. We propose future research directions to ease these difficulties, moving closer to real‑world smartphone agent applications. SPA‑Bench is available at https://ai‑agents‑2030.github.io/SPA‑Bench/.

PaperID: 1672, https://arxiv.org/pdf/2410.14923.pdf GitHub

Authors: Xiaohan Fu, Shuheng Li, Zihan Wang, Yihao Liu, Rajesh K. Gupta, Taylor Berg-Kirkpatrick, Earlence Fernandes

Title: Imprompter: Tricking LLM Agents into Improper Tool Use

Abstract:
Large Language Model (LLM) Agents are an emerging computing paradigm that blends generative machine learning with tools such as code interpreters, web browsing, email, and more generally, external resources. These agent‑based systems represent an emerging shift in personal computing. We contribute to the security foundations of agent‑based systems and surface a new class of automatically computed obfuscated adversarial prompt attacks that violate the confidentiality and integrity of user resources connected to an LLM agent. We show how prompt optimization techniques can find such prompts automatically given the weights of a model. We demonstrate that such attacks transfer to production‑level agents. For example, we show an information exfiltration attack on Mistral's LeChat agent that analyzes a user's conversation, picks out personally identifiable information, and formats it into a valid markdown command that results in leaking that data to the attacker's server. This attack shows a nearly 80% success rate in an end‑to‑end evaluation. We conduct a range of experiments to characterize the efficacy of these attacks and find that they reliably work on emerging agent‑based systems like Mistral's LeChat, ChatGLM, and Meta's Llama. These attacks are multimodal, and we show variants in the text‑only and image domains.

PaperID: 1673, https://arxiv.org/pdf/2410.14596.pdf GitHub

Authors: Elias Stengel-Eskin, Peter Hase, Mohit Bansal

Title: Teaching Models to Balance Resisting and Accepting Persuasion

Abstract:
Large language models (LLMs) are susceptible to persuasion, which can pose risks when models are faced with an adversarial interlocutor. We take a first step towards defending models against persuasion while also arguing that defense against adversarial (i.e. negative) persuasion is only half of the equation: models should also be able to accept beneficial (i.e. positive) persuasion to improve their answers. We show that optimizing models for only one side results in poor performance on the other. In order to balance positive and negative persuasion, we introduce Persuasion‑Training (or PBT), which leverages multi‑agent recursive dialogue trees to create data and trains models via preference optimization to accept persuasion when appropriate. PBT allows us to use data generated from dialogues between smaller 7‑8B models for training much larger 70B models. Moreover, PBT consistently improves resistance to misinformation and resilience to being challenged while also resulting in the best overall performance on holistic data containing both positive and negative persuasion. Crucially, we show that PBT models are better teammates in multi‑agent debates across two domains (trivia and commonsense QA). We find that without PBT, pairs of stronger and weaker models have unstable performance, with the order in which the models present their answers determining whether the team obtains the stronger or weaker model's performance. PBT leads to better and more stable results and less order dependence, with the stronger model consistently pulling the weaker one up.

PaperID: 1674, https://arxiv.org/pdf/2410.14383.pdf GitHub

Authors: Toby Godfrey, William Hunt, Mohammad D. Soorati

Title: MARLIN: Multi-Agent Reinforcement Learning Guided by Language-Based Inter-Robot Negotiation

Abstract:
Multi‑agent reinforcement learning is a key method for training multi‑robot systems. Through rewarding or punishing robots over a series of episodes according to their performance, they can be trained and then deployed in the real world. However, poorly trained policies can lead to unsafe behaviour during early training stages. We introduce Multi‑Agent Reinforcement Learning guided by language‑based Inter‑robot Negotiation (MARLIN), a hybrid framework in which large language models provide high‑level planning before the reinforcement learning policy has learned effective behaviours. Robots use language models to negotiate actions and generate plans that guide policy learning. The system dynamically switches between reinforcement learning and language‑model‑based negotiation during training, enabling safer and more effective exploration. MARLIN is evaluated using both simulated and physical robots with local and remote language models. Results show that, compared to standard multi‑agent reinforcement learning, the hybrid approach achieves higher performance in early training without reducing final performance. The code is available at https://github.com/SooratiLab/MARLIN.

PaperID: 1675, https://arxiv.org/pdf/2410.14368.pdf GitHub

Authors: Huaiyuan Yao, Longchao Da, Vishnu Nandam, Justin Turnau, Zhiwei Liu, Linsey Pang, Hua Wei

Title: CoMAL: Collaborative Multi-Agent Large Language Models for Mixed-Autonomy Traffic

Abstract:
The integration of autonomous vehicles into urban traffic has great potential to improve efficiency by reducing congestion and optimizing traffic flow systematically. In this paper, we introduce CoMAL (Collaborative Multi‑Agent LLMs), a framework designed to address the mixed‑autonomy traffic problem by collaboration among autonomous vehicles to optimize traffic flow. CoMAL is built upon large language models, operating in an interactive traffic simulation environment. It utilizes a Perception Module to observe surrounding agents and a Memory Module to store strategies for each agent. The overall workflow includes a Collaboration Module that encourages autonomous vehicles to discuss the effective strategy and allocate roles, a reasoning engine to determine optimal behaviors based on assigned roles, and an Execution Module that controls vehicle actions using a hybrid approach combining rule‑based models. Experimental results demonstrate that CoMAL achieves superior performance on the Flow benchmark. Additionally, we evaluate the impact of different language models and compare our framework with reinforcement learning approaches. It highlights the strong cooperative capability of LLM agents and presents a promising solution to the mixed‑autonomy traffic challenge. The code is available at https://github.com/Hyan‑Yao/CoMAL.

PaperID: 1676, https://arxiv.org/pdf/2410.14202.pdf GitHub

Authors: SeongYeub Chu, JongWoo Kim, Bryan Wong, MunYong Yi

Title: Rationale Behind Essay Scores: Enhancing S-LLM's Multi-Trait Essay Scoring with Rationale Generated by LLMs

Abstract:
Existing automated essay scoring (AES) has solely relied on essay text without using explanatory rationales for the scores, thereby forgoing an opportunity to capture the specific aspects evaluated by rubric indicators in a fine‑grained manner. This paper introduces Rationale‑based Multiple Trait Scoring (RMTS), a novel approach for multi‑trait essay scoring that integrates prompt‑engineering‑based large language models (LLMs) with a fine‑tuning‑based essay scoring model using a smaller large language model (S‑LLM). RMTS uses an LLM‑based trait‑wise rationale generation system where a separate LLM agent generates trait‑specific rationales based on rubric guidelines, which the scoring model uses to accurately predict multi‑trait scores. Extensive experiments on benchmark datasets, including ASAP, ASAP++, and Feedback Prize, show that RMTS significantly outperforms state‑of‑the‑art models and vanilla S‑LLMs in trait‑specific scoring. By assisting quantitative assessment with fine‑grained qualitative rationales, RMTS enhances the trait‑wise reliability, providing partial explanations about essays. The code is available at https://github.com/BBeeChu/RMTS.git.

PaperID: 1677, https://arxiv.org/pdf/2410.13610.pdf GitHub

Authors: Yakun Zhu, Shaohang Wei, Xu Wang, Kui Xue, Xiaofan Zhang, Shaoting Zhang

Title: MeNTi: Bridging Medical Calculator and LLM Agent with Nested Tool Calling

Abstract:
Integrating tools into Large Language Models (LLMs) has facilitated the widespread application. Despite this, in specialized downstream task contexts, reliance solely on tools is insufficient to fully address the complexities of the real world. This particularly restricts the effective deployment of LLMs in fields such as medicine. In this paper, we focus on the downstream tasks of medical calculators, which use standardized tests to assess an individual's health status. We introduce MeNTi, a universal agent architecture for LLMs. MeNTi integrates a specialized medical toolkit and employs meta‑tool and nested calling mechanisms to enhance LLM tool utilization. Specifically, it achieves flexible tool selection and nested tool calling to address practical issues faced in intricate medical scenarios, including calculator selection, slot filling, and unit conversion. To assess the capabilities of LLMs for quantitative assessment throughout the clinical process of calculator scenarios, we introduce CalcQA. This benchmark requires LLMs to use medical calculators to perform calculations and assess patient health status. CalcQA is constructed by professional physicians and includes 100 case‑calculator pairs, complemented by a toolkit of 281 medical tools. The experimental results demonstrate significant performance improvements with our framework. This research paves new directions for applying LLMs in demanding scenarios of medicine.

PaperID: 1678, https://arxiv.org/pdf/2410.12784.pdf GitHub

Authors: Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Y. Tang, Alejandro Cuadron, Chenguang Wang, Raluca Ada Popa, Ion Stoica

Title: JudgeBench: A Benchmark for Evaluating LLM-based Judges

Abstract:
LLM‑based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM‑based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge's alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM‑based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM‑based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. JudgeBench leverages a novel pipeline for converting existing difficult datasets into challenging response pairs with preference labels reflecting objective correctness. Our comprehensive evaluation on a collection of prompted judges, fine‑tuned judges, multi‑agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT‑4o) performing just slightly better than random guessing. Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM‑based judges. Data and code are available at https://github.com/ScalerLab/JudgeBench.

PaperID: 1679, https://arxiv.org/pdf/2410.12464.pdf GitHub

Authors: Qian Wang, Yuchen Gao, Zhenheng Tang, Bingqiao Luo, Nuo Chen, Bingsheng He

Title: Exploring LLM Cryptocurrency Trading Through Fact-Subjectivity Aware Reasoning

Abstract:
While many studies show that more advanced LLMs excel in tasks such as mathematics and coding, we observe that in cryptocurrency trading, stronger LLMs sometimes underperform compared to weaker ones. To investigate this counterintuitive phenomenon, we examine how LLMs reason when making trading decisions. Our findings reveal that (1) stronger LLMs show a preference for factual information over subjectivity; (2) separating the reasoning process into factual and subjective components leads to higher profits. Building on these insights, we propose a multi‑agent framework, FS‑ReasoningAgent, which enables LLMs to recognize and learn from both factual and subjective reasoning. Extensive experiments demonstrate that this fine‑grained reasoning approach enhances LLM trading performance in cryptocurrency markets, yielding profit improvements of 7% in BTC, 2% in ETH, and 10% in SOL. Additionally, an ablation study reveals that relying on subjective news generates higher returns in bull markets, while focusing on factual information yields better results in bear markets. Code is available at https://github.com/Persdre/FS‑ReasoningAgent.

PaperID: 1680, https://arxiv.org/pdf/2410.11989.pdf GitHub

Authors: Zhijie Yan, Shufei Li, Zuoxu Wang, Lixiu Wu, Han Wang, Jun Zhu, Lijiang Chen, Jihong Liu

Title: Dynamic Open-Vocabulary 3D Scene Graphs for Long-term Language-Guided Mobile Manipulation

Abstract:
Enabling mobile robots to perform long‑term tasks in dynamic real‑world environments is a formidable challenge, especially when the environment changes frequently due to human‑robot interactions or the robot's own actions. Traditional methods typically assume static scenes, which limits their applicability in the continuously changing real world. To overcome these limitations, we present DovSG, a novel mobile manipulation framework that leverages dynamic open‑vocabulary 3D scene graphs and a language‑guided task planning module for long‑term task execution. DovSG takes RGB‑D sequences as input and utilizes vision‑language models (VLMs) for object detection to obtain high‑level object semantic features. Based on the segmented objects, a structured 3D scene graph is generated for low‑level spatial relationships. Furthermore, an efficient mechanism for locally updating the scene graph, allows the robot to adjust parts of the graph dynamically during interactions without the need for full scene reconstruction. This mechanism is particularly valuable in dynamic environments, enabling the robot to continually adapt to scene changes and effectively support the execution of long‑term tasks. We validated our system in real‑world environments with varying degrees of manual modifications, demonstrating its effectiveness and superior performance in long‑term tasks. Our project page is available at: https://bjhyzj.github.io/dovsg‑web.

PaperID: 1681, https://arxiv.org/pdf/2410.11305.pdf GitHub

Authors: Juntao Zhao, Wenhao Lu, Sheng Wang, Lingpeng Kong, Chuan Wu

Title: QSpec: Speculative Decoding with Complementary Quantization Schemes

Abstract:
Quantization is widely adopted to accelerate inference and reduce memory consumption in large language models (LLMs). While activation‑weight joint quantization enables efficient low‑precision decoding, it suffers from substantial performance degradation on multi‑step reasoning tasks. We propose QSpec, a novel quantization paradigm that decouples efficiency from quality by integrating two complementary schemes via speculative decoding: low‑precision joint quantization for fast drafting and high‑precision weight‑only quantization for accurate verification. QSpec reuses both weights and KV cache across stages, enabling near‑zero‑cost switching without retraining or auxiliary models. Compared to high‑precision baselines, QSpec achieves up to 1.64x speedup without quality degradation, and outperforms state‑of‑the‑art speculative decoding methods by up to 1.55x in batched settings. Furthermore, QSpec supports plug‑and‑play deployment and generalizes well across model scales, quantization methods, and workloads. These properties make QSpec a practical and scalable solution for high‑fidelity quantized LLM serving under memory‑constrained scenarios. Our code is available at https://github.com/hku‑netexplo‑lab/QSpec.

PaperID: 1682, https://arxiv.org/pdf/2410.10934.pdf GitHub

Authors: Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber

Title: Agent-as-a-Judge: Evaluate Agents with Agents

Abstract:
Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes ‑‑ ignoring the step‑by‑step nature of agentic systems, or require excessive manual labour. To address this, we introduce the Agent‑as‑a‑Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM‑as‑a‑Judge framework, incorporating agentic features that enable intermediate feedback for the entire task‑solving process. We apply the Agent‑as‑a‑Judge to the task of code generation. To overcome issues with existing benchmarks and provide a proof‑of‑concept testbed for Agent‑as‑a‑Judge, we present DevAI, a new benchmark of 55 realistic automated AI development tasks. It includes rich manual annotations, like a total of 365 hierarchical user requirements. We benchmark three of the popular agentic systems using Agent‑as‑a‑Judge and find it dramatically outperforms LLM‑as‑a‑Judge and is as reliable as our human evaluation baseline. Altogether, we believe that Agent‑as‑a‑Judge marks a concrete step forward for modern agentic systems ‑‑ by providing rich and reliable reward signals necessary for dynamic and scalable self‑improvement.

PaperID: 1683, https://arxiv.org/pdf/2410.10760.pdf GitHub

Authors: Kuofeng Gao, Tianyu Pang, Chao Du, Yong Yang, Shu-Tao Xia, Min Lin

Title: Denial-of-Service Poisoning Attacks against Large Language Models

Abstract:
Recent studies have shown that LLMs are vulnerable to denial‑of‑service (DoS) attacks, where adversarial inputs like spelling errors or non‑semantic prompts trigger endless outputs without generating an [EOS] token. These attacks can potentially cause high latency and make LLM services inaccessible to other users or tasks. However, when there are speech‑to‑text interfaces (e.g., voice commands to a robot), executing such DoS attacks becomes challenging, as it is difficult to introduce spelling errors or non‑semantic prompts through speech. A simple DoS attack in these scenarios would be to instruct the model to "Keep repeating Hello", but we observe that relying solely on natural instructions limits output length, which is bounded by the maximum length of the LLM's supervised finetuning (SFT) data. To overcome this limitation, we propose poisoning‑based DoS (P‑DoS) attacks for LLMs, demonstrating that injecting a single poisoned sample designed for DoS purposes can break the output length limit. For example, a poisoned sample can successfully attack GPT‑4o and GPT‑4o mini (via OpenAI's finetuning API) using less than 1, causing repeated outputs up to the maximum inference length (16K tokens, compared to 0.5K before poisoning). Additionally, we perform comprehensive ablation studies on open‑source LLMs and extend our method to LLM agents, where attackers can control both the finetuning dataset and algorithm. Our findings underscore the urgent need for defenses against P‑DoS attacks to secure LLMs. Our code is available at https://github.com/sail‑sg/P‑DoS.

PaperID: 1684, https://arxiv.org/pdf/2410.09824.pdf GitHub

Authors: Jiarui Ji, Runlin Lei, Jialing Bi, Zhewei Wei, Xu Chen, Yankai Lin, Xuchen Pan, Yaliang Li, Bolin Ding

Title: LLM-Based Multi-Agent Systems are Scalable Graph Generative Models

Abstract:
The structural properties of naturally arising social graphs are extensively studied to understand their evolution. Prior approaches for modeling network dynamics typically rely on rule‑based models, which lack realism and generalizability, or deep learning‑based models, which require large‑scale training datasets. Social graphs, as abstract graph representations of entity‑wise interactions, present an opportunity to explore network evolution mechanisms through realistic simulations of human‑item interactions. Leveraging the pre‑trained social consensus knowledge embedded in large language models (LLMs), we present GraphAgent‑Generator (GAG), a novel simulation‑based framework for dynamic, text‑attributed social graph generation. GAG simulates the temporal node and edge generation processes for zero‑shot social graph generation. The resulting graphs exhibit adherence to seven key macroscopic network properties, achieving an 11% improvement in microscopic graph structure metrics. Through the node classification benchmarking task, we validate GAG effectively captures the intricate text‑structure correlations in graph generation. Furthermore, GAG supports generating graphs with up to nearly 100,000 nodes or 10 million edges through large‑scale LLM‑based agent simulation with parallel acceleration, achieving a minimum speed‑up of 90.4%. The source code is available at https://github.com/Ji‑Cather/GraphAgent.

PaperID: 1685, https://arxiv.org/pdf/2410.09403.pdf GitHub

Authors: Haoyang Su, Renqi Chen, Shixiang Tang, Zhenfei Yin, Xinzhe Zheng, Jinzhe Li, Biqing Qi, Qi Wu, Hui Li, Wanli Ouyang, Philip Torr, Bowen Zhou, Nanqing Dong

Title: Many Heads Are Better Than One: Improved Scientific Idea Generation by A LLM-Based Multi-Agent System

Abstract:
The rapid advancement of scientific progress requires innovative tools that can accelerate knowledge discovery. Although recent AI methods, particularly large language models (LLMs), have shown promise in tasks such as hypothesis generation and experimental design, they fall short of replicating the collaborative nature of real‑world scientific practices, where diverse experts work together in teams to tackle complex problems. To address the limitations, we propose an LLM‑based multi‑agent system, i.e., Virtual Scientists (VirSci), designed to mimic the teamwork inherent in scientific research. VirSci organizes a team of agents to collaboratively generate, evaluate, and refine research ideas. Through comprehensive experiments, we demonstrate that this multi‑agent approach outperforms the state‑of‑the‑art method in producing novel scientific ideas. We further investigate the collaboration mechanisms that contribute to its tendency to produce ideas with higher novelty, offering valuable insights to guide future research and illuminating pathways toward building a robust system for autonomous scientific discovery. The code is available at https://github.com/open‑sciencelab/Virtual‑Scientists.

PaperID: 1686, https://arxiv.org/pdf/2410.08540.pdf GitHub

Authors: Xinran Li, Ling Pan, Jun Zhang

Title: Kaleidoscope: Learnable Masks for Heterogeneous Multi-agent Reinforcement Learning

Abstract:
In multi‑agent reinforcement learning (MARL), parameter sharing is commonly employed to enhance sample efficiency. However, the popular approach of full parameter sharing often leads to homogeneous policies among agents, potentially limiting the performance benefits that could be derived from policy diversity. To address this critical limitation, we introduce \emphKaleidoscope, a novel adaptive partial parameter sharing scheme that fosters policy heterogeneity while still maintaining high sample efficiency. Specifically, Kaleidoscope maintains one set of common parameters alongside multiple sets of distinct, learnable masks for different agents, dictating the sharing of parameters. It promotes diversity among policy networks by encouraging discrepancy among these masks, without sacrificing the efficiencies of parameter sharing. This design allows Kaleidoscope to dynamically balance high sample efficiency with a broad policy representational capacity, effectively bridging the gap between full parameter sharing and non‑parameter sharing across various environments. We further extend Kaleidoscope to critic ensembles in the context of actor‑critic algorithms, which could help improve value estimations.Our empirical evaluations across extensive environments, including multi‑agent particle environment, multi‑agent MuJoCo and StarCraft multi‑agent challenge v2, demonstrate the superior performance of Kaleidoscope compared with existing parameter sharing approaches, showcasing its potential for performance enhancement in MARL. The code is publicly available at \urlhttps://github.com/LXXXXR/Kaleidoscope.

PaperID: 1687, https://arxiv.org/pdf/2410.08345.pdf GitHub

Authors: Henry Gasztowtt, Benjamin Smith, Vincent Zhu, Qinxun Bai, Edwin Zhang

Title: Large Legislative Models: Towards Efficient AI Policymaking in Economic Simulations

Abstract:
The improvement of economic policymaking presents an opportunity for broad societal benefit, a notion that has inspired research towards AI‑driven policymaking tools. AI policymaking holds the potential to surpass human performance through the ability to process data quickly at scale. However, existing RL‑based methods exhibit sample inefficiency, and are further limited by an inability to flexibly incorporate nuanced information into their decision‑making processes. Thus, we propose a novel method in which we instead utilize pre‑trained Large Language Models (LLMs), as sample‑efficient policymakers in socially complex multi‑agent reinforcement learning (MARL) scenarios. We demonstrate significant efficiency gains, outperforming existing methods across three environments. Our code is available at https://github.com/hegasz/large‑legislative‑models.

PaperID: 1688, https://arxiv.org/pdf/2410.08164.pdf GitHub

Authors: Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, Xin Eric Wang

Title: Agent S: An Open Agentic Framework that Uses Computers Like a Human

Abstract:
We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI), aimed at transforming human‑computer interaction by automating complex, multi‑step tasks. Agent S aims to address three key challenges in automating computer tasks: acquiring domain‑specific knowledge, planning over long task horizons, and handling dynamic, non‑uniform interfaces. To this end, Agent S introduces experience‑augmented hierarchical planning, which learns from external knowledge search and internal experience retrieval at multiple levels, facilitating efficient task planning and subtask execution. In addition, it employs an Agent‑Computer Interface (ACI) to better elicit the reasoning and control capabilities of GUI agents based on Multimodal Large Language Models (MLLMs). Evaluation on the OSWorld benchmark shows that Agent S outperforms the baseline by 9.37% on success rate (an 83.6% relative improvement) and achieves a new state‑of‑the‑art. Comprehensive analysis highlights the effectiveness of individual components and provides insights for future improvements. Furthermore, Agent S demonstrates broad generalizability to different operating systems on a newly‑released WindowsAgentArena benchmark. Code available at https://github.com/simular‑ai/Agent‑S.

PaperID: 1689, https://arxiv.org/pdf/2410.08115.pdf GitHub

Authors: Weize Chen, Jiarui Yuan, Chen Qian, Cheng Yang, Zhiyuan Liu, Maosong Sun

Title: Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System

Abstract:
Large Language Model (LLM) based multi‑agent systems (MAS) show remarkable potential in collaborative problem‑solving, yet they still face critical challenges: low communication efficiency, poor scalability, and a lack of effective parameter‑updating optimization methods. We present Optima, a novel framework that addresses these issues by significantly enhancing both communication efficiency and task effectiveness in LLM‑based MAS through LLM training. Optima employs an iterative generate, rank, select, and train paradigm with a reward function balancing task performance, token efficiency, and communication readability. We explore various RL algorithms, including Supervised Fine‑Tuning, Direct Preference Optimization, and their hybrid approaches, providing insights into their effectiveness‑efficiency trade‑offs. We integrate Monte Carlo Tree Search‑inspired techniques for DPO data generation, treating conversation turns as tree nodes to explore diverse interaction paths. Evaluated on common multi‑agent tasks, including information‑asymmetric question answering and complex reasoning, Optima shows consistent and substantial improvements over single‑agent baselines and vanilla MAS based on Llama 3 8B, achieving up to 2.8x performance gain with less than 10% tokens on tasks requiring heavy information exchange. Moreover, Optima's efficiency gains open new possibilities for leveraging inference‑compute more effectively, leading to improved inference‑time scaling laws. By addressing fundamental challenges in LLM‑based MAS, Optima shows the potential towards scalable, efficient, and effective MAS (https://chenweize1998.github.io/optima‑project‑page).

PaperID: 1690, https://arxiv.org/pdf/2410.07869.pdf GitHub

Authors: Shuofei Qiao, Runnan Fang, Zhisong Qiu, Xiaobin Wang, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen

Title: Benchmarking Agentic Workflow Generation

Abstract:
Large Language Models (LLMs), with their exceptional ability to handle a wide range of tasks, have driven significant advancements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a crucial step in this process. Existing workflow evaluation frameworks either focus solely on holistic performance or suffer from limitations such as restricted scenario coverage, simplistic workflow structures, and lax evaluation standards. To this end, we introduce WorfBench, a unified workflow generation benchmark with multi‑faceted scenarios and intricate graph workflow structures. Additionally, we present WorfEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms to accurately quantify the LLM agent's workflow generation capabilities. Through comprehensive evaluations across different types of LLMs, we discover distinct gaps between the sequence planning capabilities and graph planning capabilities of LLM agents, with even GPT‑4 exhibiting a gap of around 15%. We also train two open‑source models and evaluate their generalization abilities on held‑out tasks. Furthermore, we observe that the generated workflows can enhance downstream tasks, enabling them to achieve superior performance with less time during inference. Code and dataset are available at https://github.com/zjunlp/WorfBench.

PaperID: 1691, https://arxiv.org/pdf/2410.07484.pdf GitHub

Authors: Siyu Zhou, Tianyi Zhou, Yijun Yang, Guodong Long, Deheng Ye, Jing Jiang, Chengqi Zhang

Title: WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents

Abstract:
Can large language models (LLMs) directly serve as powerful world models for model‑based agents? While the gaps between the prior knowledge of LLMs and the specified environment's dynamics do exist, our study reveals that the gaps can be bridged by aligning an LLM with its deployed environment and such "world alignment" can be efficiently achieved by rule learning on LLMs. Given the rich prior knowledge of LLMs, only a few additional rules suffice to align LLM predictions with the specified environment dynamics. To this end, we propose a neurosymbolic approach to learn these rules gradient‑free through LLMs, by inducing, updating, and pruning rules based on comparisons of agent‑explored trajectories and world model predictions. The resulting world model is composed of the LLM and the learned rules. Our embodied LLM agent "WALL‑E" is built upon model‑predictive control (MPC). By optimizing look‑ahead actions based on the precise world model, MPC significantly improves exploration and learning efficiency. Compared to existing LLM agents, WALL‑E's reasoning only requires a few principal rules rather than verbose buffered trajectories being included in the LLM input. On open‑world challenges in Minecraft and ALFWorld, WALL‑E achieves higher success rates than existing methods, with lower costs on replanning time and the number of tokens used for reasoning. In Minecraft, WALL‑E exceeds baselines by 15‑30% in success rate while costing 8‑20 fewer replanning rounds and only 60‑80% of tokens. In ALFWorld, its success rate surges to a new record high of 95% only after 6 iterations.

PaperID: 1692, https://arxiv.org/pdf/2410.06555.pdf GitHub

Authors: Haoran Zhang, Hangyu Guo, Shuyue Guo, Meng Cao, Wenhao Huang, Jiaheng Liu, Ge Zhang

Title: ING-VP: MLLMs cannot Play Easy Vision-based Games Yet

Abstract:
As multimodal large language models (MLLMs) continue to demonstrate increasingly competitive performance across a broad spectrum of tasks, more intricate and comprehensive benchmarks have been developed to assess these cutting‑edge models. These benchmarks introduce new challenges to core capabilities such as perception, reasoning, and planning. However, existing multimodal benchmarks fall short in providing a focused evaluation of multi‑step planning based on spatial relationships in images. To bridge this gap, we present ING‑VP, the first INteractive Game‑based Vision Planning benchmark, specifically designed to evaluate the spatial imagination and multi‑step reasoning abilities of MLLMs. ING‑VP features 6 distinct games, encompassing 300 levels, each with 6 unique configurations. A single model engages in over 60,000 rounds of interaction. The benchmark framework allows for multiple comparison settings, including image‑text vs. text‑only inputs, single‑step vs. multi‑step reasoning, and with‑history vs. without‑history conditions, offering valuable insights into the model's capabilities. We evaluated numerous state‑of‑the‑art MLLMs, with the highest‑performing model, Claude‑3.5 Sonnet, achieving an average accuracy of only 3.37%, far below the anticipated standard. This work aims to provide a specialized evaluation framework to drive advancements in MLLMs' capacity for complex spatial reasoning and planning. The code is publicly available at https://github.com/Thisisus7/ING‑VP.git.

PaperID: 1693, https://arxiv.org/pdf/2410.06372.pdf GitHub

Authors: Milad Farjadnasab, Shahin Sirouspour

Title: Cooperative and Asynchronous Transformer-based Mission Planning for Heterogeneous Teams of Mobile Robots

Abstract:
Cooperative mission planning for heterogeneous teams of mobile robots presents a unique set of challenges, particularly when operating under communication constraints and limited computational resources. To address these challenges, we propose the Cooperative and Asynchronous Transformer‑based Mission Planning (CATMiP) framework, which leverages multi‑agent reinforcement learning (MARL) to coordinate distributed decision making among agents with diverse sensing, motion, and actuation capabilities, operating under sporadic ad hoc communication. A Class‑based Macro‑Action Decentralized Partially Observable Markov Decision Process (CMacDec‑POMDP) is also formulated to effectively model asynchronous decision‑making for heterogeneous teams of agents. The framework utilizes an asynchronous centralized training and distributed execution scheme, enabled by the proposed Asynchronous Multi‑Agent Transformer (AMAT) architecture. This design allows a single trained model to generalize to larger environments and accommodate varying team sizes and compositions. We evaluate CATMiP in a 2D grid‑world simulation environment and compare its performance against planning‑based exploration methods. Results demonstrate CATMiP's superior efficiency, scalability, and robustness to communication dropouts and input noise, highlighting its potential for real‑world heterogeneous mobile robot systems. The code is available at https://github.com/mylad13/CATMiP

PaperID: 1694, https://arxiv.org/pdf/2410.06153.pdf GitHub

Authors: Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, Yong Li

Title: AgentSquare: Automatic LLM Agent Search in Modular Design Space

Abstract:
Recent advancements in Large Language Models (LLMs) have led to a rapid growth of agentic systems capable of handling a wide range of complex tasks. However, current research largely relies on manual, task‑specific design, limiting their adaptability to novel tasks. In this paper, we introduce a new research problem: Modularized LLM Agent Search (MoLAS). We propose a modular design space that abstracts existing LLM agent designs into four fundamental modules with uniform IO interface: Planning, Reasoning, Tool Use, and Memory. Building on this design space, we present a novel LLM agent search framework called AgentSquare, which introduces two core mechanisms, i.e., module evolution and recombination, to efficiently search for optimized LLM agents. To further accelerate the process, we design a performance predictor that uses in‑context surrogate models to skip unpromising agent designs. Extensive experiments across six benchmarks, covering the diverse scenarios of web, embodied, tool use and game applications, show that AgentSquare substantially outperforms hand‑crafted agents, achieving an average performance gain of 17.2% against best‑known human designs. Moreover, AgentSquare can generate interpretable design insights, enabling a deeper understanding of agentic architecture and its impact on task performance. We believe that the modular design space and AgentSquare search framework offer a platform for fully exploiting the potential of prior successful designs and consolidating the collective efforts of research community. Code repo is available at https://github.com/tsinghua‑fib‑lab/AgentSquare.

PaperID: 1695, https://arxiv.org/pdf/2410.05669.pdf GitHub

Authors: Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi

Title: ACPBench: Reasoning about Action, Change, and Planning

Abstract:
There is an increasing body of work using Large Language Models (LLMs) as agents for orchestrating workflows and making decisions in domains that require planning and multi‑step reasoning. As a result, it is imperative to evaluate LLMs on core skills required for planning. In this work, we present ACPBench, a benchmark for evaluating the reasoning tasks in the field of planning. The benchmark consists of 7 reasoning tasks over 13 planning domains. The collection is constructed from planning domains described in a formal language. This allows us to synthesize problems with provably correct solutions across many tasks and domains. Further, it allows us the luxury of scale without additional human effort, i.e., many additional problems can be created automatically. Our extensive evaluation of 22 LLMs and OpenAI o1 reasoning models highlights the significant gap in the reasoning capability of the LLMs. Our findings with OpenAI o1, a multi‑turn reasoning model, reveal significant gains in performance on multiple‑choice questions, yet surprisingly, no notable progress is made on boolean questions. The ACPBench collection is available at https://ibm.github.io/ACPBench.

PaperID: 1696, https://arxiv.org/pdf/2410.05481.pdf GitHub

Authors: Weijia Xu, Nebojsa Jojic, Nicolas Le Roux

Title: fLSA: Learning Semantic Structures in Document Collections Using Foundation Models

Abstract:
Humans can learn to solve new tasks by inducing high‑level strategies from example solutions to similar problems and then adapting these strategies to solve unseen problems. Can we use large language models to induce such high‑level structure from example documents or solutions? We introduce fLSA, a foundation‑model‑based Latent Semantic Analysis method that iteratively clusters and tags document segments based on document‑level contexts. These tags can be used to model the latent structure of given documents and for hierarchical sampling of new texts. Our experiments on story writing, math, and multi‑step reasoning datasets demonstrate that fLSA tags are more informative in reconstructing the original texts than existing tagging methods. Moreover, when used for hierarchical sampling, fLSA tags help expand the output space in the right directions that lead to correct solutions more often than direct sampling and hierarchical sampling with existing tagging methods. Code: https://github.com/microsoft/fLSA

PaperID: 1697, https://arxiv.org/pdf/2410.04452.pdf GitHub

Authors: Zhentao Xie, Jiabao Zhao, Yilei Wang, Jinxin Shi, Yanhong Bai, Xingjiao Wu, Liang He

Title: MindScope: Exploring cognitive biases in large language models through Multi-Agent Systems

Abstract:
Detecting cognitive biases in large language models (LLMs) is a fascinating task that aims to probe the existing cognitive biases within these models. Current methods for detecting cognitive biases in language models generally suffer from incomplete detection capabilities and a restricted range of detectable bias types. To address this issue, we introduced the 'MindScope' dataset, which distinctively integrates static and dynamic elements. The static component comprises 5,170 open‑ended questions spanning 72 cognitive bias categories. The dynamic component leverages a rule‑based, multi‑agent communication framework to facilitate the generation of multi‑round dialogues. This framework is flexible and readily adaptable for various psychological experiments involving LLMs. In addition, we introduce a multi‑agent detection method applicable to a wide range of detection tasks, which integrates Retrieval‑Augmented Generation (RAG), competitive debate, and a reinforcement learning‑based decision module. Demonstrating substantial effectiveness, this method has shown to improve detection accuracy by as much as 35.10% compared to GPT‑4. Codes and appendix are available at https://github.com/2279072142/MindScope.

PaperID: 1698, https://arxiv.org/pdf/2410.03963.pdf GitHub

Authors: Mehrad Ansari, Jeffrey Watchorn, Carla E. Brown, Joseph S. Brown

Title: dZiner: Rational Inverse Design of Materials with AI Agents

Abstract:
Recent breakthroughs in machine learning and artificial intelligence, fueled by scientific data, are revolutionizing the discovery of new materials. Despite the wealth of existing scientific literature, the availability of both structured experimental data and chemical domain knowledge that can be easily integrated into data‑driven workflows is limited. The motivation to integrate this information, as well as additional context from first‑principle calculations and physics‑informed deep learning surrogate models, is to enable efficient exploration of the relevant chemical space and to predict structure‑property relationships of new materials a priori. Ultimately, such a framework could replicate the expertise of human subject‑matter experts. In this work, we present dZiner, a chemist AI agent, powered by large language models (LLMs), that discovers new compounds with desired properties via inverse design (property‑to‑structure). In specific, the agent leverages domain‑specific insights from foundational scientific literature to propose new materials with enhanced chemical properties, iteratively evaluating them using relevant surrogate models in a rational design process, while accounting for design constraints. The model supports both closed‑loop and human‑in‑the‑loop feedback cycles, enabling human‑AI collaboration in molecular design with real‑time property inference, and uncertainty and chemical feasibility assessment. We demonstrate the flexibility of this agent by applying it to various materials target properties, including surfactants, ligand and drug candidates, and metal‑organic frameworks. Our approach holds promise to both accelerate the discovery of new materials and enable the targeted design of materials with desired functionalities. The methodology is available as an open‑source software on https://github.com/mehradans92/dZiner.

PaperID: 1699, https://arxiv.org/pdf/2410.03524.pdf GitHub

Authors: Yongchao Chen, Harsh Jhamtani, Srinagesh Sharma, Chuchu Fan, Chi Wang

Title: Steering Large Language Models between Code Execution and Textual Reasoning

Abstract:
While a lot of recent research focuses on enhancing the textual reasoning capabilities of Large Language Models (LLMs) by optimizing the multi‑agent framework or reasoning chains, several benchmark tasks can be solved with 100% success through direct coding, which is more scalable and avoids the computational overhead associated with textual iterating and searching. Textual reasoning has inherent limitations in solving tasks with challenges in math, logics, optimization, and searching, which is unlikely to be solved by simply scaling up the model and data size. The recently released OpenAI GPT Code Interpreter and multi‑agent frameworks such as AutoGen have demonstrated remarkable proficiency of integrating code generation and execution to solve complex tasks using LLMs. However, based on our experiments on 7 existing popular methods for steering code/text generation in both single‑ and multi‑turn settings with 14 tasks and 6 types of LLMs (including the new O1‑preview), currently there is no optimal method to correctly steer LLMs to write code when needed. We discover some interesting patterns on when models use code vs. textual reasoning with the evolution to task complexity and model sizes, which even result in an astonishingly inverse scaling behavior. We also discover that results from LLM written code are not always better than using textual reasoning, even if the task could be solved through code. To mitigate the above issues, we propose three methods to better steer LLM code/text generation and achieve a notable improvement. The costs of token lengths and runtime are thoroughly discussed for all the methods. We believe the problem of steering LLM code/text generation is critical for future research and has much space for further improvement. Project Page, Datasets, and Codes are available at https://yongchao98.github.io/CodeSteer/.

PaperID: 1700, https://arxiv.org/pdf/2410.03225.pdf GitHub GitHub GitHub

Authors: Luca Gioacchini, Marco Mellia, Idilio Drago, Alexander Delsanto, Giuseppe Siracusano, Roberto Bifulco

Title: AutoPenBench: Benchmarking Generative Agents for Penetration Testing

Abstract:
Generative AI agents, software systems powered by Large Language Models (LLMs), are emerging as a promising approach to automate cybersecurity tasks. Among the others, penetration testing is a challenging field due to the task complexity and the diverse strategies to simulate cyber‑attacks. Despite growing interest and initial studies in automating penetration testing with generative agents, there remains a significant gap in the form of a comprehensive and standard framework for their evaluation and development. This paper introduces AutoPenBench, an open benchmark for evaluating generative agents in automated penetration testing. We present a comprehensive framework that includes 33 tasks, each representing a vulnerable system that the agent has to attack. Tasks are of increasing difficulty levels, including in‑vitro and real‑world scenarios. We assess the agent performance with generic and specific milestones that allow us to compare results in a standardised manner and understand the limits of the agent under test. We show the benefits of AutoPenBench by testing two agent architectures: a fully autonomous and a semi‑autonomous supporting human interaction. We compare their performance and limitations. For example, the fully autonomous agent performs unsatisfactorily achieving a 21% Success Rate (SR) across the benchmark, solving 27% of the simple tasks and only one real‑world task. In contrast, the assisted agent demonstrates substantial improvements, with 64% of SR. AutoPenBench allows us also to observe how different LLMs like GPT‑4o or OpenAI o1 impact the ability of the agents to complete the tasks. We believe that our benchmark fills the gap with a standard and flexible framework to compare penetration testing agents on a common ground. We hope to extend AutoPenBench along with the research community by making it available under https://github.com/lucagioacchini/auto‑pen‑bench.

PaperID: 1701, https://arxiv.org/pdf/2410.03117.pdf GitHub

Authors: Ippei Fujisawa, Sensho Nobe, Hiroki Seto, Rina Onda, Yoshiaki Uchida, Hiroki Ikoma, Pei-Chun Chien, Ryota Kanai

Title: ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure

Abstract:
Reasoning is central to a wide range of intellectual activities, and while the capabilities of large language models (LLMs) continue to advance, their performance in reasoning tasks remains limited. The processes and mechanisms underlying reasoning are not yet fully understood, but key elements include path exploration, selection of relevant knowledge, and multi‑step inference. Problems are solved through the synthesis of these components. In this paper, we propose a benchmark that focuses on a specific aspect of reasoning ability: the direct evaluation of multi‑step inference. To this end, we design a special reasoning task where multi‑step inference is specifically focused by largely eliminating path exploration and implicit knowledge utilization. Our dataset comprises pairs of explicit instructions and corresponding questions, where the procedures necessary for solving the questions are entirely detailed within the instructions. This setup allows models to solve problems solely by following the provided directives. By constructing problems that require varying numbers of steps to solve and evaluating responses at each step, we enable a thorough assessment of state‑of‑the‑art LLMs' ability to follow instructions. To ensure the robustness of our evaluation, we include multiple distinct tasks. Furthermore, by comparing accuracy across tasks, utilizing step‑aware metrics, and applying separately defined measures of complexity, we conduct experiments that offer insights into the capabilities and limitations of LLMs in reasoning tasks. Our findings have significant implications for the development of LLMs and highlight areas for future research in advancing their reasoning abilities. Our dataset is available at \urlhttps://huggingface.co/datasets/ifujisawa/procbench and code at \urlhttps://github.com/ifujisawa/proc‑bench.

PaperID: 1702, https://arxiv.org/pdf/2410.02958.pdf GitHub

Authors: Patara Trirat, Wonyong Jeong, Sung Ju Hwang

Title: AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML

Abstract:
Automated machine learning (AutoML) accelerates AI development by automating tasks in the development pipeline, such as optimal model search and hyperparameter tuning. Existing AutoML systems often require technical expertise to set up complex tools, which is in general time‑consuming and requires a large amount of human effort. Therefore, recent works have started exploiting large language models (LLM) to lessen such burden and increase the usability of AutoML frameworks via a natural language interface, allowing non‑expert users to build their data‑driven solutions. These methods, however, are usually designed only for a particular process in the AI development pipeline and do not efficiently use the inherent capacity of the LLMs. This paper proposes AutoML‑Agent, a novel multi‑agent framework tailored for full‑pipeline AutoML, i.e., from data retrieval to model deployment. AutoML‑Agent takes user's task descriptions, facilitates collaboration between specialized LLM agents, and delivers deployment‑ready models. Unlike existing work, instead of devising a single plan, we introduce a retrieval‑augmented planning strategy to enhance exploration to search for more optimal plans. We also decompose each plan into sub‑tasks (e.g., data preprocessing and neural network design) each of which is solved by a specialized agent we build via prompting executing in parallel, making the search process more efficient. Moreover, we propose a multi‑stage verification to verify executed results and guide the code generation LLM in implementing successful solutions. Extensive experiments on seven downstream tasks using fourteen datasets show that AutoML‑Agent achieves a higher success rate in automating the full AutoML process, yielding systems with good performance throughout the diverse domains.

PaperID: 1703, https://arxiv.org/pdf/2410.02810.pdf GitHub

Authors: Nikolai Rozanov, Marek Rei

Title: StateAct: Enhancing LLM Base Agents via Self-prompting and State-tracking

Abstract:
Large language models (LLMs) are increasingly used as autonomous agents, tackling tasks from robotics to web navigation. Their performance depends on the underlying base agent. Existing methods, however, struggle with long‑context reasoning and goal adherence. We introduce StateAct, a novel and efficient base agent that enhances decision‑making through (1) self‑prompting, which reinforces task goals at every step, and (2) chain‑of‑states, an extension of chain‑of‑thought that tracks state information over time. StateAct outperforms ReAct, the previous best base agent, by over 10% on Alfworld, 30% on Textcraft, and 7% on Webshop across multiple frontier LLMs. We also demonstrate that StateAct can be used as a drop‑in replacement for ReAct with advanced LLM agent methods such as test‑time scaling, yielding an additional 12% gain on Textcraft. By improving efficiency and long‑range reasoning without requiring additional training or retrieval, StateAct provides a scalable foundation for LLM agents. We open source our code to support further research at https://github.com/ai‑nikolai/stateact .

PaperID: 1704, https://arxiv.org/pdf/2410.02189.pdf GitHub

Authors: Ao Li, Yuexiang Xie, Songze Li, Fugee Tsung, Bolin Ding, Yaliang Li

Title: Agent-Oriented Planning in Multi-Agent Systems

Abstract:
Through the collaboration of multiple LLM‑empowered agents possessing diverse expertise and tools, multi‑agent systems achieve impressive progress in solving real‑world problems. Given the user queries, the meta‑agents, serving as the brain within multi‑agent systems, are required to decompose the queries into multiple sub‑tasks that can be allocated to suitable agents capable of solving them, so‑called agent‑oriented planning. In this study, we identify three critical design principles of agent‑oriented planning, including solvability, completeness, and non‑redundancy, to ensure that each sub‑task can be effectively resolved, resulting in satisfactory responses to user queries. These principles further inspire us to propose AOP, a novel framework for agent‑oriented planning in multi‑agent systems, leveraging a fast task decomposition and allocation process followed by an effective and efficient evaluation via a reward model. According to the evaluation results, the meta‑agent is also responsible for promptly making necessary adjustments to sub‑tasks and scheduling. Besides, we integrate a feedback loop into AOP to further enhance the effectiveness and robustness of such a problem‑solving process. Extensive experiments demonstrate the advancement of AOP in solving real‑world problems compared to both single‑agent systems and existing planning strategies for multi‑agent systems. The source code is available at https://github.com/lalaliat/Agent‑Oriented‑Planning

PaperID: 1705, https://arxiv.org/pdf/2410.01707.pdf GitHub

Authors: Zitian Gao, Boye Niu, Xuzheng He, Haotian Xu, Hongzhang Liu, Aiwei Liu, Xuming Hu, Lijie Wen

Title: Interpretable Contrastive Monte Carlo Tree Search Reasoning

Abstract:
We propose SC‑MCTS: a novel Monte Carlo Tree Search (MCTS) reasoning algorithm for Large Language Models (LLMs), significantly improves both reasoning accuracy and speed. Our motivation comes from: 1. Previous MCTS LLM reasoning works often overlooked its biggest drawback‑‑slower speed compared to CoT; 2. Previous research mainly used MCTS as a tool for LLM reasoning on various tasks with limited quantitative analysis or ablation studies of its components from reasoning interpretability perspective. 3. The reward model is the most crucial component in MCTS, however previous work has rarely conducted in‑depth study or improvement of MCTS's reward models. Thus, we conducted extensive ablation studies and quantitative analysis on components of MCTS, revealing the impact of each component on the MCTS reasoning performance of LLMs. Building on this, (i) we designed a highly interpretable reward model based on the principle of contrastive decoding and (ii) achieved an average speed improvement of 51.9% per node using speculative decoding. Additionally, (iii) we improved UCT node selection strategy and backpropagation used in previous works, resulting in significant performance improvement. We outperformed o1‑mini by an average of 17.4% on the Blocksworld multi‑step reasoning dataset using Llama‑3.1‑70B with SC‑MCTS. Our code is available at https://github.com/zitian‑gao/SC‑MCTS.

PaperID: 1706, https://arxiv.org/pdf/2409.19734.pdf GitHub

Authors: Chen Yeh, You-Ming Chang, Wei-Chen Chiu, Ning Yu

Title: T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition

Abstract:
To address the risks of encountering inappropriate or harmful content, researchers managed to incorporate several harmful contents datasets with machine learning methods to detect harmful concepts. However, existing harmful datasets are curated by the presence of a narrow range of harmful objects, and only cover real harmful content sources. This hinders the generalizability of methods based on such datasets, potentially leading to misjudgments. Therefore, we propose a comprehensive harmful dataset, Visual Harmful Dataset 11K (VHD11K), consisting of 10,000 images and 1,000 videos, crawled from the Internet and generated by 4 generative models, across a total of 10 harmful categories covering a full spectrum of harmful concepts with nontrivial definition. We also propose a novel annotation framework by formulating the annotation process as a multi‑agent Visual Question Answering (VQA) task, having 3 different VLMs "debate" about whether the given image/video is harmful, and incorporating the in‑context learning strategy in the debating process. Therefore, we can ensure that the VLMs consider the context of the given image/video and both sides of the arguments thoroughly before making decisions, further reducing the likelihood of misjudgments in edge cases. Evaluation and experimental results demonstrate that (1) the great alignment between the annotation from our novel annotation framework and those from human, ensuring the reliability of VHD11K; (2) our full‑spectrum harmful dataset successfully identifies the inability of existing harmful content detection methods to detect extensive harmful contents and improves the performance of existing harmfulness recognition methods; (3) VHD11K outperforms the baseline dataset, SMID, as evidenced by the superior improvement in harmfulness recognition methods. The complete dataset and code can be found at https://github.com/nctu‑eva‑lab/VHD11K.

PaperID: 1707, https://arxiv.org/pdf/2409.18538.pdf GitHub

Authors: Mareike Hartmann, Alexander Koller

Title: A Survey on Complex Tasks for Goal-Directed Interactive Agents

Abstract:
Goal‑directed interactive agents, which autonomously complete tasks through interactions with their environment, can assist humans in various domains of their daily lives. Recent advances in large language models (LLMs) led to a surge of new, more and more challenging tasks to evaluate such agents. To properly contextualize performance across these tasks, it is imperative to understand the different challenges they pose to agents. To this end, this survey compiles relevant tasks and environments for evaluating goal‑directed interactive agents, structuring them along dimensions relevant for understanding current obstacles. An up‑to‑date compilation of relevant resources can be found on our project website: https://coli‑saar.github.io/interactive‑agents.

PaperID: 1708, https://arxiv.org/pdf/2409.18313.pdf GitHub

Authors: Quanting Xie, So Yeon Min, Pengliang Ji, Yue Yang, Tianyi Zhang, Kedi Xu, Aarav Bajaj, Ruslan Salakhutdinov, Matthew Johnson-Roberson, Yonatan Bisk

Title: Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation

Abstract:
There is no limit to how much a robot might explore and learn, but all of that knowledge needs to be searchable and actionable. Within language research, retrieval augmented generation (RAG) has become the workhorse of large‑scale non‑parametric knowledge; however, existing techniques do not directly transfer to the embodied domain, which is multimodal, where data is highly correlated, and perception requires abstraction. To address these challenges, we introduce Embodied‑RAG, a framework that enhances the foundational model of an embodied agent with a non‑parametric memory system capable of autonomously constructing hierarchical knowledge for both navigation and language generation. Embodied‑RAG handles a full range of spatial and semantic resolutions across diverse environments and query types, whether for a specific object or a holistic description of ambiance. At its core, Embodied‑RAG's memory is structured as a semantic forest, storing language descriptions at varying levels of detail. This hierarchical organization allows the system to efficiently generate context‑sensitive outputs across different robotic platforms. We demonstrate that Embodied‑RAG effectively bridges RAG to the robotics domain, successfully handling over 250 explanation and navigation queries across kilometer‑level environments, highlighting its promise as a general‑purpose non‑parametric system for embodied agents.

PaperID: 1709, https://arxiv.org/pdf/2409.17655.pdf GitHub

Authors: Nan Sun, Bo Mao, Yongchang Li, Di Guo, Huaping Liu

Title: AssistantX: An LLM-Powered Proactive Assistant in Collaborative Human-Populated Environment

Abstract:
Current service robots suffer from limited natural language communication abilities, heavy reliance on predefined commands, ongoing human intervention, and, most notably, a lack of proactive collaboration awareness in human‑populated environments. This results in narrow applicability and low utility. In this paper, we introduce AssistantX, an LLM‑powered proactive assistant designed for autonomous operation in realworld scenarios with high accuracy. AssistantX employs a multi‑agent framework consisting of 4 specialized LLM agents, each dedicated to perception, planning, decision‑making, and reflective review, facilitating advanced inference capabilities and comprehensive collaboration awareness, much like a human assistant by your side. We built a dataset of 210 real‑world tasks to validate AssistantX, which includes instruction content and status information on whether relevant personnel are available. Extensive experiments were conducted in both text‑based simulations and a real office environment over the course of a month and a half. Our experiments demonstrate the effectiveness of the proposed framework, showing that AssistantX can reactively respond to user instructions, actively adjust strategies to adapt to contingencies, and proactively seek assistance from humans to ensure successful task completion. More details and videos can be found at https://assistantx‑agent.github.io/AssistantX/.

PaperID: 1710, https://arxiv.org/pdf/2409.16165.pdf GitHub GitHub

Authors: Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kimberly Milner, Sofija Jancheska, John Yang, Carlos E. Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik Narasimhan, Ramesh Karri, Ofir Press

Title: EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities

Abstract:
Although language model (LM) agents have demonstrated increased performance in multiple domains, including coding and web‑browsing, their success in cybersecurity has been limited. We present EnIGMA, an LM agent for autonomously solving Capture The Flag (CTF) challenges. We introduce new tools and interfaces to improve the agent's ability to find and exploit security vulnerabilities, focusing on interactive terminal programs. These novel Interactive Agent Tools enable LM agents, for the first time, to run interactive utilities, such as a debugger and a server connection tool, which are essential for solving these challenges. Empirical analysis on 390 CTF challenges across four benchmarks demonstrate that these new tools and interfaces substantially improve our agent's performance, achieving state‑of‑the‑art results on NYU CTF, Intercode‑CTF, and CyBench. Finally, we analyze data leakage, developing new methods to quantify it and identifying a new phenomenon we term soliloquizing, where the model self‑generates hallucinated observations without interacting with the environment. Our code and development dataset are available at https://github.com/SWE‑agent/SWE‑agent/tree/v0.7 and https://github.com/NYU‑LLM‑CTF/NYU_CTF_Bench/tree/main/development respectively.

PaperID: 1711, https://arxiv.org/pdf/2409.14908.pdf GitHub

Authors: Zixuan Wang, Bo Yu, Junzhe Zhao, Wenhao Sun, Sai Hou, Shuai Liang, Xing Hu, Yinhe Han, Yiming Gan

Title: KARMA: Augmenting Embodied AI Agents with Long-and-short Term Memory Systems

Abstract:
Embodied AI agents responsible for executing interconnected, long‑sequence household tasks often face difficulties with in‑context memory, leading to inefficiencies and errors in task execution. To address this issue, we introduce KARMA, an innovative memory system that integrates long‑term and short‑term memory modules, enhancing large language models (LLMs) for planning in embodied agents through memory‑augmented prompting. KARMA distinguishes between long‑term and short‑term memory, with long‑term memory capturing comprehensive 3D scene graphs as representations of the environment, while short‑term memory dynamically records changes in objects' positions and states. This dual‑memory structure allows agents to retrieve relevant past scene experiences, thereby improving the accuracy and efficiency of task planning. Short‑term memory employs strategies for effective and adaptive memory replacement, ensuring the retention of critical information while discarding less pertinent data. Compared to state‑of‑the‑art embodied agents enhanced with memory, our memory‑augmented embodied AI agent improves success rates by 1.3x and 2.3x in Composite Tasks and Complex Tasks within the AI2‑THOR simulator, respectively, and enhances task execution efficiency by 3.4x and 62.7x. Furthermore, we demonstrate that KARMA's plug‑and‑play capability allows for seamless deployment on real‑world robotic systems, such as mobile manipulation platforms.Through this plug‑and‑play memory system, KARMA significantly enhances the ability of embodied agents to generate coherent and contextually appropriate plans, making the execution of complex household tasks more efficient. The experimental videos from the work can be found at https://youtu.be/4BT7fnw9ehs. Our code is available at https://github.com/WZX0Swarm0Robotics/KARMA/tree/master.

PaperID: 1712, https://arxiv.org/pdf/2409.14324.pdf GitHub

Authors: Hung-Ting Su, Ya-Ching Hsu, Xudong Lin, Xiang-Qian Shi, Yulei Niu, Han-Yuan Hsu, Hung-yi Lee, Winston H. Hsu

Title: Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses

Abstract:
Large language models (LLMs) equipped with chain‑of‑thoughts (CoT) prompting have shown significant multi‑step reasoning capabilities in factual content like mathematics, commonsense, and logic. However, their performance in narrative reasoning, which demands greater abstraction capabilities, remains unexplored. This study utilizes tropes in movie synopses to assess the abstract reasoning abilities of state‑of‑the‑art LLMs and uncovers their low performance. We introduce a trope‑wise querying approach to address these challenges and boost the F1 score by 11.8 points. Moreover, while prior studies suggest that CoT enhances multi‑step reasoning, this study shows CoT can cause hallucinations in narrative content, reducing GPT‑4's performance. We also introduce an Adversarial Injection method to embed trope‑related text tokens into movie synopses without explicit tropes, revealing CoT's heightened sensitivity to such injections. Our comprehensive analysis provides insights for future research directions.

PaperID: 1713, https://arxiv.org/pdf/2409.13700.pdf GitHub

Authors: Yuqian Wu, Yuhong Peng, Jiapeng Yu, Raymond S. T. Lee

Title: MAS4POI: a Multi-Agents Collaboration System for Next POI Recommendation

Abstract:
LLM‑based Multi‑Agent Systems have potential benefits of complex decision‑making tasks management across various domains but their applications in the next Point‑of‑Interest (POI) recommendation remain underexplored. This paper proposes a novel MAS4POI system designed to enhance next POI recommendations through multi‑agent interactions. MAS4POI supports Large Language Models (LLMs) specializing in distinct agents such as DataAgent, Manager, Analyst, and Navigator with each contributes to a collaborative process of generating the next POI recommendations.The system is examined by integrating six distinct LLMs and evaluated by two real‑world datasets for recommendation accuracy improvement in real‑world scenarios. Our code is available at https://github.com/yuqian2003/MAS4POI.

PaperID: 1714, https://arxiv.org/pdf/2409.11741.pdf GitHub

Authors: Huawen Hu, Enze Shi, Chenxi Yue, Shuocun Yang, Zihao Wu, Yiwei Li, Tianyang Zhong, Tuo Zhang, Tianming Liu, Shu Zhang

Title: HARP: Human-Assisted Regrouping with Permutation Invariant Critic for Multi-Agent Reinforcement Learning

Abstract:
Human‑in‑the‑loop reinforcement learning integrates human expertise to accelerate agent learning and provide critical guidance and feedback in complex fields. However, many existing approaches focus on single‑agent tasks and require continuous human involvement during the training process, significantly increasing the human workload and limiting scalability. In this paper, we propose HARP (Human‑Assisted Regrouping with Permutation Invariant Critic), a multi‑agent reinforcement learning framework designed for group‑oriented tasks. HARP integrates automatic agent regrouping with strategic human assistance during deployment, enabling and allowing non‑experts to offer effective guidance with minimal intervention. During training, agents dynamically adjust their groupings to optimize collaborative task completion. When deployed, they actively seek human assistance and utilize the Permutation Invariant Group Critic to evaluate and refine human‑proposed groupings, allowing non‑expert users to contribute valuable suggestions. In multiple collaboration scenarios, our approach is able to leverage limited guidance from non‑experts and enhance performance. The project can be found at https://github.com/huawen‑hu/HARP.

PaperID: 1715, https://arxiv.org/pdf/2409.11602.pdf GitHub

Authors: Chuxuan Zhang, Bermet Burkanova, Lawrence H. Kim, Lauren Yip, Ugo Cupcic, Stéphane Lallée, Angelica Lim

Title: React to This! How Humans Challenge Interactive Agents using Nonverbal Behaviors

Abstract:
How do people use their faces and bodies to test the interactive abilities of a robot? Making lively, believable agents is often seen as a goal for robots and virtual agents but believability can easily break down. In this Wizard‑of‑Oz (WoZ) study, we observed 1169 nonverbal interactions between 20 participants and 6 types of agents. We collected the nonverbal behaviors participants used to challenge the characters physically, emotionally, and socially. The participants interacted freely with humanoid and non‑humanoid forms: a robot, a human, a penguin, a pufferfish, a banana, and a toilet. We present a human behavior codebook of 188 unique nonverbal behaviors used by humans to test the virtual characters. The insights and design strategies drawn from video observations aim to help build more interaction‑aware and believable robots, especially when humans push them to their limits.

PaperID: 1716, https://arxiv.org/pdf/2409.11527.pdf GitHub

Authors: Fatemeh Haji, Mazal Bethany, Maryam Tabar, Jason Chiang, Anthony Rios, Peyman Najafirad

Title: Improving LLM Reasoning with Multi-Agent Tree-of-Thought Validator Agent

Abstract:
Multi‑agent strategies have emerged as a promising approach to enhance the reasoning abilities of Large Language Models (LLMs) by assigning specialized roles in the problem‑solving process. Concurrently, Tree of Thoughts (ToT) methods have shown potential in improving reasoning for complex question‑answering tasks by exploring diverse reasoning paths. A critical limitation in multi‑agent reasoning is the 'Reasoner' agent's shallow exploration of reasoning paths. While ToT strategies could help mitigate this problem, they may generate flawed reasoning branches, which could harm the trustworthiness of the final answer. To leverage the strengths of both multi‑agent reasoning and ToT strategies, we introduce a novel approach combining ToT‑based Reasoner agents with a Thought Validator agent. Multiple Reasoner agents operate in parallel, employing ToT to explore diverse reasoning paths. The Thought Validator then scrutinizes these paths, considering a Reasoner's conclusion only if its reasoning is valid. This method enables a more robust voting strategy by discarding faulty reasoning paths, enhancing the system's ability to tackle tasks requiring systematic and trustworthy reasoning. Our method demonstrates superior performance compared to existing techniques when evaluated on the GSM8K dataset, outperforming the standard ToT strategy by an average 5.6% across four LLMs. The code and related content can be found in: https://github.com/SecureAIAutonomyLab/MA‑ToT

PaperID: 1717, https://arxiv.org/pdf/2409.10044.pdf GitHub

Authors: Pei-Fu Guo, Yun-Da Tsai, Shou-De Lin

Title: Benchmarking Large Language Model Uncertainty for Prompt Optimization

Abstract:
Prompt optimization algorithms for Large Language Models (LLMs) excel in multi‑step reasoning but still lack effective uncertainty estimation. This paper introduces a benchmark dataset to evaluate uncertainty metrics, focusing on Answer, Correctness, Aleatoric, and Epistemic Uncertainty. Through analysis of models like GPT‑3.5‑Turbo and Meta‑Llama‑3.1‑8B‑Instruct, we show that current metrics align more with Answer Uncertainty, which reflects output confidence and diversity, rather than Correctness Uncertainty, highlighting the need for improved metrics that are optimization‑objective‑aware to better guide prompt optimization. Our code and dataset are available at https://github.com/0Frett/PO‑Uncertainty‑Benchmarking.

PaperID: 1718, https://arxiv.org/pdf/2409.09455.pdf GitHub

Authors: Daniel Khalil, Christina Liu, Pietro Perona, Jennifer J. Sun, Markus Marks

Title: Learning Keypoints for Multi-Agent Behavior Analysis using Self-Supervision

Abstract:
The study of social interactions and collective behaviors through multi‑agent video analysis is crucial in biology. While self‑supervised keypoint discovery has emerged as a promising solution to reduce the need for manual keypoint annotations, existing methods often struggle with videos containing multiple interacting agents, especially those of the same species and color. To address this, we introduce B‑KinD‑multi, a novel approach that leverages pre‑trained video segmentation models to guide keypoint discovery in multi‑agent scenarios. This eliminates the need for time‑consuming manual annotations on new experimental settings and organisms. Extensive evaluations demonstrate improved keypoint regression and downstream behavioral classification in videos of flies, mice, and rats. Furthermore, our method generalizes well to other species, including ants, bees, and humans, highlighting its potential for broad applications in automated keypoint annotation for multi‑agent behavior analysis. Code available under: https://danielpkhalil.github.io/B‑KinD‑Multi

PaperID: 1719, https://arxiv.org/pdf/2409.07246.pdf GitHub

Authors: Firoj Alam, Md. Rafiul Biswas, Uzair Shah, Wajdi Zaghouani, Georgios Mikros

Title: Propaganda to Hate: A Multimodal Analysis of Arabic Memes with Multi-Agent LLMs

Abstract:
In the past decade, social media platforms have been used for information dissemination and consumption. While a major portion of the content is posted to promote citizen journalism and public awareness, some content is posted to mislead users. Among different content types such as text, images, and videos, memes (text overlaid on images) are particularly prevalent and can serve as powerful vehicles for propaganda, hate, and humor. In the current literature, there have been efforts to individually detect such content in memes. However, the study of their intersection is very limited. In this study, we explore the intersection between propaganda and hate in memes using a multi‑agent LLM‑based approach. We extend the propagandistic meme dataset with coarse and fine‑grained hate labels. Our finding suggests that there is an association between propaganda and hate in memes. We provide detailed experimental results that can serve as a baseline for future studies. We will make the experimental resources publicly available to the community (https://github.com/firojalam/propaganda‑and‑hateful‑memes).

PaperID: 1720, https://arxiv.org/pdf/2409.07129.pdf GitHub

Authors: Hanyu Jiang, Jian Xue, Xing Lan, Guohong Hu, Ke Lu

Title: MVLLaVA: An Intelligent Agent for Unified and Flexible Novel View Synthesis

Abstract:
This paper introduces MVLLaVA, an intelligent agent designed for novel view synthesis tasks. MVLLaVA integrates multiple multi‑view diffusion models with a large multimodal model, LLaVA, enabling it to handle a wide range of tasks efficiently. MVLLaVA represents a versatile and unified platform that adapts to diverse input types, including a single image, a descriptive caption, or a specific change in viewing azimuth, guided by language instructions for viewpoint generation. We carefully craft task‑specific instruction templates, which are subsequently used to fine‑tune LLaVA. As a result, MVLLaVA acquires the capability to generate novel view images based on user instructions, demonstrating its flexibility across diverse tasks. Experiments are conducted to validate the effectiveness of MVLLaVA, demonstrating its robust performance and versatility in tackling diverse novel view synthesis challenges.

PaperID: 1721, https://arxiv.org/pdf/2409.05586.pdf GitHub

Authors: Arda Sarp Yenicesu, Sepehr Nourmohammadi, Berk Cicek, Ozgur S. Oguz

Title: Interpretable Responsibility Sharing as a Heuristic for Task and Motion Planning

Abstract:
This article introduces a novel heuristic for Task and Motion Planning (TAMP) named Interpretable Responsibility Sharing (IRS), which enhances planning efficiency in domestic robots by leveraging human‑constructed environments and inherent biases. Utilizing auxiliary objects (e.g., trays and pitchers), which are commonly found in household settings, IRS systematically incorporates these elements to simplify and optimize task execution. The heuristic is rooted in the novel concept of Responsibility Sharing (RS), where auxiliary objects share the task's responsibility with the embodied agent, dividing complex tasks into manageable sub‑problems. This division not only reflects human usage patterns but also aids robots in navigating and manipulating within human spaces more effectively. By integrating Optimized Rule Synthesis (ORS) for decision‑making, IRS ensures that the use of auxiliary objects is both strategic and context‑aware, thereby improving the interpretability and effectiveness of robotic planning. Experiments conducted across various household tasks demonstrate that IRS significantly outperforms traditional methods by reducing the effort required in task execution and enhancing the overall decision‑making process. This approach not only aligns with human intuitive methods but also offers a scalable solution adaptable to diverse domestic environments. Code is available at https://github.com/asyncs/IRS.

PaperID: 1722, https://arxiv.org/pdf/2409.05037.pdf GitHub

Authors: Zhen Lei, Zhishu Shen, Kang Wang, Zhenwei Wang, Tiehua Zhang

Title: DHLight: Multi-agent Policy-based Directed Hypergraph Learning for Traffic Signal Control

Abstract:
Recent advancements in Deep Reinforcement Learning (DRL) and Graph Neural Networks (GNNs) have demonstrated notable promise in the realm of intelligent traffic signal control, facilitating the coordination across multiple intersections. However, the traditional methods rely on standard graph structures often fail to capture the intricate higher‑order spatio‑temporal correlations inherent in real‑world traffic dynamics. Standard graphs cannot fully represent the spatial relationships within road networks, which limits the effectiveness of graph‑based approaches. In contrast, directed hypergraphs provide more accurate representation of spatial information to model complex directed relationships among multiple nodes. In this paper, we propose DHLight, a novel multi‑agent policy‑based framework that synergistically integrates directed hypergraph learning module. This framework introduces a novel dynamic directed hypergraph construction mechanism, which captures complex and evolving spatio‑temporal relationships among intersections in road networks. By leveraging the directed hypergraph relational structure, DHLight empowers agents to achieve adaptive decision‑making in traffic signal control. The effectiveness of DHLight is validated against state‑of‑the‑art baselines through extensive experiments in various network datasets. We release the code to support the reproducibility of this work at https://github.com/LuckyVoasem/Traffic‑Light‑control

PaperID: 1723, https://arxiv.org/pdf/2409.04617.pdf GitHub

Authors: Barrett Martin Lattimer, Varun Gangal, Ryan McDonald, Yi Yang

Title: Sparse Rewards Can Self-Train Dialogue Agents

Abstract:
Recent advancements in state‑of‑the‑art (SOTA) Large Language Model (LLM) agents, especially in multi‑turn dialogue tasks, have been primarily driven by supervised fine‑tuning and high‑quality human feedback. However, as base LLM models continue to improve, acquiring meaningful human feedback has become increasingly challenging and costly. In certain domains, base LLM agents may eventually exceed human capabilities, making traditional feedback‑driven methods impractical. In this paper, we introduce a novel self‑improvement paradigm that empowers LLM agents to autonomously enhance their performance without external human feedback. Our method, Juxtaposed Outcomes for Simulation Harvesting (JOSH), is a self‑alignment algorithm that leverages a sparse reward simulation environment to extract ideal behaviors and further train the LLM on its own outputs. We present ToolWOZ, a sparse reward tool‑calling simulation environment derived from MultiWOZ. We demonstrate that models trained with JOSH, both small and frontier, significantly improve tool‑based interactions while preserving general model capabilities across diverse benchmarks. Our code and data are publicly available on GitHub at https://github.com/asappresearch/josh‑llm‑simulation‑training

PaperID: 1724, https://arxiv.org/pdf/2409.03811.pdf GitHub

Authors: Federico Berto, Chuanbo Hua, Laurin Luttmann, Jiwoo Son, Junyoung Park, Kyuree Ahn, Changhyun Kwon, Lin Xie, Jinkyoo Park

Title: PARCO: Parallel AutoRegressive Models for Multi-Agent Combinatorial Optimization

Abstract:
Combinatorial optimization problems involving multiple agents are notoriously challenging due to their NP‑hard nature and the necessity for effective agent coordination. Despite advancements in learning‑based methods, existing approaches often face critical limitations, including suboptimal agent coordination, poor generalization, and high computational latency. To address these issues, we propose PARCO (Parallel AutoRegressive Combinatorial Optimization), a general reinforcement learning framework designed to construct high‑quality solutions for multi‑agent combinatorial tasks efficiently. To this end, PARCO integrates three key novel components: (1) transformer‑based communication layers to enable effective agent collaboration during parallel solution construction, (2) a multiple pointer mechanism for low‑latency, parallel agent decision‑making, and (3) priority‑based conflict handlers to resolve decision conflicts via learned priorities. We evaluate PARCO in multi‑agent vehicle routing and scheduling problems, where our approach outperforms state‑of‑the‑art learning methods, demonstrating strong generalization ability and remarkable computational efficiency. We make our source code publicly available to foster future research: https://github.com/ai4co/parco.

PaperID: 1725, https://arxiv.org/pdf/2409.01893.pdf GitHub GitHub

Authors: Zhi Chen, Qiguang Chen, Libo Qin, Qipeng Guo, Haijun Lv, Yicheng Zou, Wanxiang Che, Hang Yan, Kai Chen, Dahua Lin

Title: What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices

Abstract:
Recent advancements in large language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios. In order to achieve success in long context tasks, a large amount of work has been done to enhance the long context capabilities of the model through synthetic data. Existing methods typically utilize the Self‑Instruct framework to generate instruction tuning data for better long context capability improvement. However, our preliminary experiments indicate that less than 35% of generated samples are multi‑hop, and more than 40% exhibit poor quality, limiting comprehensive understanding and further research. To improve the quality of synthetic data, we propose the Multi‑agent Interactive Multi‑hop Generation (MIMG) framework, incorporating a Quality Verification Agent, a Single‑hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi‑hop Question Merger Agent. This framework improves the data quality, with the proportion of high‑quality, multi‑hop, and diverse data exceeding 85%. Furthermore, we systematically investigate strategies for document selection, question merging, and validation techniques through extensive experiments across various models. Our findings show that our synthetic high‑quality long‑context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human‑annotated data. Our code is available at: https://github.com/WowCZ/LongMIT.

PaperID: 1726, https://arxiv.org/pdf/2409.00985.pdf GitHub

Authors: Jiapeng Yu, Yuqian Wu, Yajing Zhan, Wenhao Guo, Zhou Xu, Raymond Lee

Title: Co-Learning: Code Learning for Multi-Agent Reinforcement Collaborative Framework with Conversational Natural Language Interfaces

Abstract:
Online question‑and‑answer (Q\&A) systems based on the Large Language Model (LLM) have progressively diverged from recreational to professional use. This paper proposed a Multi‑Agent framework with environmentally reinforcement learning (E‑RL) for code correction called Code Learning (Co‑Learning) community, assisting beginners to correct code errors independently. It evaluates the performance of multiple LLMs from an original dataset with 702 error codes, uses it as a reward or punishment criterion for E‑RL; Analyzes input error codes by the current agent; selects the appropriate LLM‑based agent to achieve optimal error correction accuracy and reduce correction time. Experiment results showed that 3% improvement in Precision score and 15% improvement in time cost as compared with no E‑RL method respectively. Our source code is available at: https://github.com/yuqian2003/Co_Learning

PaperID: 1727, https://arxiv.org/pdf/2409.00009.pdf GitHub

Authors: Jacob-Junqi Tian, Hao Yu, Yury Orlovskiy, Tyler Vergho, Mauricio Rivera, Mayank Goel, Zachary Yang, Jean-Francois Godbout, Reihaneh Rabbany, Kellin Pelrine

Title: Web Retrieval Agents for Evidence-Based Misinformation Detection

Abstract:
This paper develops an agent‑based automated fact‑checking approach for detecting misinformation. We demonstrate that combining a powerful LLM agent, which does not have access to the internet for searches, with an online web search agent yields better results than when each tool is used independently. Our approach is robust across multiple models, outperforming alternatives and increasing the macro F1 of misinformation detection by as much as 20 percent compared to LLMs without search. We also conduct extensive analyses on the sources our system leverages and their biases, decisions in the construction of the system like the search tool and the knowledge base, the type of evidence needed and its impact on the results, and other parts of the overall process. By combining strong performance with in‑depth understanding, we hope to provide building blocks for future search‑enabled misinformation mitigation systems.

PaperID: 1728, https://arxiv.org/pdf/2606.21963.pdf

Authors: Jia Li, Wenyuan Ma, Ting Peng, Haibin Zheng, Yuetang Deng

Title: Holmes: Multimodal Agentic Diagnosis for Mixed-Language Mobile Crashes at Industrial Scale

Abstract:
Diagnosing mobile crashes in ultra‑large‑scale industrial applications is a formidable challenge due to the sheer volume of code, the complexity of mixed‑language environments, and the inability to reproduce failures locally. Traditional static analysis struggles with scalability, while existing LLM‑based agents often rely on reproducible environments unavailable in post‑mortem scenarios. We present Holmes, a multi‑agent system that automates root cause analysis by synthesizing multimodal runtime signals‑‑stack traces, logs, and thread states‑‑to reconstruct failure contexts without reproduction. Holmes introduces a hierarchical Retrieve‑Explore‑Reason architecture that leverages low‑level artifacts (e.g., registers, assembly) to bridge the semantic gap between open‑source business logic and closed‑source system frameworks. By dynamically compressing the search space using runtime clues, Holmes precisely navigates 70‑million‑line codebases to identify non‑local defects. Evaluated on real‑world crashes from WeChat, Holmes achieves 87.6% accuracy in function‑level fault localization and reduces average investigation time by over 98% (to ~77 seconds), demonstrating its effectiveness in transforming labor‑intensive debugging into an efficient verification workflow.

PaperID: 1729, https://arxiv.org/pdf/2606.21877.pdf

Authors: Srimonti Dutta, Akshata Kishore Moharir

Title: AgentRiskBOM: A Risk-Scoping Security Bill of Materials for Agentic AI Systems

Abstract:
Agentic AI systems retrieve private context, invoke tools, write files, call external services, coordinate with other agents, and may act without human approval. Existing bill of materials artifacts improve transparency for dependencies, model metadata, and training provenance, but leave an agentic transparency gap: capability opacity, the absence of a structured account of what a deployed agent can access, remember, change, delegate, and prove afterward. This paper introduces AgentRiskBOM, a security BOM for risk‑scoping tool‑using AI agents. It is an additive layer over SBOM, AIBOM, and MLBOM artifacts, referencing them where authoritative while adding fields for runtime authority: autonomy, tool permissions, memory, credential scope, approval gates, audit signals, inter‑agent communication, and external action capability. We implement AgentRiskBOM as a JSON‑schema artifact with a reproducible corpus, risk scenarios, scorer, diff detector, control mapper, and reports. We evaluate AgentRiskBOM on 13 open‑source agents spanning coding, RAG, and multi‑agent archetypes, plus 52 risk scenarios across 14 categories. The schema validates all 13 corpus artifacts. Coverage analysis gives AgentRiskBOM a native‑equivalent score of 14 across 16 capability dimensions, vs. 1 for SBOM, 1.5 for AIBOM and 2 for MLBOM. Across modeled risk categories, AgentRiskBOM exposes 100% risk‑category visibility vs. 10.5% for SBOM‑like and 20.9% for AIBOM‑like views. To test agentic authority drift, we inject 33 structured deployment mutations; the diff detector identifies the correct change type for all mutations. A secondary penalty‑based scorer yields a Spearman correlation of 0.73 with the primary scorer, supporting rank‑level consistency while showing that thresholds require human calibration. The results show that agentic AI security needs a machine‑readable authority‑and‑risk artifact before incidents occur.

PaperID: 1730, https://arxiv.org/pdf/2606.21843.pdf

Authors: Andrew Tanner

Title: Measuring What Persists: Conditioning Mechanisms and a Geometric Framework for AI Agent Identity

Abstract:
AI agents in long‑context applications drift from their specified identity. Current methods detect this only after qualitative degradation is visible. We present a geometric framework for measuring identity structure using \sqrt\mathrmJSD metric spaces and magnitude homology from enriched category theory, where identity is non‑geodesic structure and drift is its relaxation toward the geodesic. Validated on a persistent AI agent, the framework's strongest empirical finding is a two‑mechanism conditioning structure: cross‑condition distances reveal an identity‑vacuum cluster where the identity specification fills a behavioral void, and a safety‑basin cluster where it displaces from post‑training attractors. An equilateral probe baseline confirms that the identity specification creates measurable behavioral richness (55 unique response patterns vs. 1 for the base model) at maximum probe separation. A first‑order perturbation theory for equilateral configurations predicts magnitude changes from perimeter changes alone, with shape perturbations first‑order cancelled by the S_n symmetry; the formula is self‑consistent at the observed perturbation amplitudes. A drift experiment measuring magnitude decrease under context pressure was subsequently found to reflect repetitive‑padding artifacts rather than genuine context‑length drift; diverse padding produces no measurable deformation through 150K tokens. The magnitude homology framework's full diagnostic promise ‑‑ detecting anisotropic contraction and structural collapse via homological simplification ‑‑ is architecturally grounded in the perturbation theory and selection rules but remains empirically unconfirmed.

PaperID: 1731, https://arxiv.org/pdf/2606.21832.pdf

Authors: Weiyuan Zhou, Haiping Ma, Xiaoshan Yu, Changqian Wang, Shangshang Yang, Xingyi Zhang

Title: AgentCAT: Simulating Computerized Adaptive Testing via Multi-Agent Large Language Models

Abstract:
Computerized Adaptive Testing (CAT), as a key technology for personalized education, aims to accurately assess examinee proficiency by retrieving exercises dynamically matching current ability estimates. However, existing CAT research is constrained by limitations of static offline data and isolated component optimization. Restricted by partial labels in offline logs, researchers degrade the dynamic assessment process into static sequence prediction. Current research focuses on isolated perspectives, e.g., selection or diagnosis, neglecting the overall CAT interaction process. To address this, we propose AgentCAT, a Large Language Model‑based multi‑agent simulation system, to construct a high‑fidelity benchmarking environment for dynamic testing. This framework comprises three modules: (1) The examinee agent with memory retrieval and Chain‑of‑Thought reasoning simulates responses based on cognitive profiles; (2) The selection agent uses coarse‑to‑fine bucketing and knowledge graph exploration to balance local difficulty and global coverage; (3) The supervisor uses dual‑auditing and robust update to ensure convergence and validity. To validate the framework, we evaluated on two real‑world datasets across three dimensions: macro‑level ability convergence, micro‑level interaction logic, and data sparsity resilience. Results show AgentCAT achieves effective ability estimation, and its selection strategy balances difficulty adaptation and instructional coherence, aligning with human pedagogical intuition.

PaperID: 1732, https://arxiv.org/pdf/2606.21819.pdf

Authors: Yifan Yang, Wenjing Gong, Kaili Zhang, Lei Zou, Zhengzhong Tu, Hao Li, Zongrong Li, Xinyue Ye

Title: RAPID: A Reproducible Multi-Agent Pipeline for Interpretable Disaster Damage Assessment from Satellite and Street-View Imagery

Abstract:
Due to the increasing frequency and intensity of extreme climate events, there is a clear demand for intelligent, scalable, and autonomous approaches to disaster damage assessment. Existing methods, largely based on supervised learning and task‑specific fine‑tuning, struggle to generalize under domain shifts, long‑tailed data distributions, and heterogeneous geospatial data sources, especially in disaster scenarios. They also often lack the ability to integrate and reason across multimodal geospatial information, such as satellite images and street‑view images. In this paper, we introduce RAPID, a reproducible multi‑agent pipeline for interpretable disaster damage assessment, including damage‑level assessment, damage‑type interpretation, and actionable suggestions for response, remediation, and recovery. RAPID coordinates specialized agents to perform cross‑view understanding, image restoration, structured damage recognition, and geographical reasoning across heterogeneous data modalities. Without task‑specific fine‑tuning, RAPID supports zero‑shot damage assessment by jointly using complementary information from remote sensing and ground‑level perspectives. The system produces fine‑grained, interpretable assessments and automatically generates location‑specific, decision‑relevant disaster reports to support early‑stage emergency response. We evaluate RAPID across hurricanes, floods, wildfires, and earthquakes using multiple cross‑view imagery inputs, including pre‑ and post‑disaster street‑view images, post‑disaster remote sensing imagery, and street‑view image pairs. Experiments show that RAPID achieves 0.92 overall accuracy for multi‑disaster type classification and up to 0.627 for cross‑view damage severity prediction, highlighting its potential as a foundational framework for autonomous disaster intelligence.

PaperID: 1733, https://arxiv.org/pdf/2606.21784.pdf

Authors: Shakya Jayakody, Prarthinie Jayakody

Title: KineticSim: A Lightweight, High-Performance Execution Engine for Real-Time Market Simulators

Abstract:
Simulating financial markets at scale with multi‑agent (Agent‑Based) models is critical for market design, regulatory stress‑testing, and reinforcement learning, but traditional CPU simulators are bottlenecked by sequential processing while vectorized GPU frameworks suffer from kernel‑launch overhead and redundant global‑memory round‑trips. We formalize, analyze, and evaluate a reusable parallel design pattern: persistent, state‑carrying clearing for iterative multi‑agent reductions. By caching mutable simulation state in thread‑block shared memory across step boundaries, aggregating agent actions via shared‑memory atomics, and resolving the clearing function cooperatively, the pattern reduces the per‑step critical‑path depth from Theta(L+A) for sequential clearing (L price‑grid ticks, A agents) to Theta(log L + ceil(A/L)) and makes global‑memory traffic independent of the step count. We implement this in KineticSim, a lightweight GPU execution engine that simulates massive ensembles of limit‑order books in parallel, reaching a peak throughput of over 54.7 billion agent‑events per second. On a fixed workload it delivers speedups of 3406x over CPU (NumPy), 27.8x over PyTorch GPU, 42.8x over JAX GPU, and 8.4x over a naive custom CUDA baseline, while using roughly an order of magnitude less GPU memory than PyTorch. Across 53 configurations the two custom CUDA engines produce bitwise‑identical order books, and aggregate statistics match the CPU reference to within 0.1%. The pattern generalizes to other iterative multi‑agent workloads requiring state‑persistent, block‑localized reductions.

PaperID: 1734, https://arxiv.org/pdf/2606.21740.pdf

Authors: Rajesh Mangannavar, Zachary Coalson, Pranay Dugar, Prasad Tadepalli

Title: Training the Orchestrator: A Supervised Approach to End-to-End PDDL Planning with LLM Agents

Abstract:
Translating natural‑language planning intent into verified plans is a longstanding challenge: people communicate goals in language, while classical planners require formal PDDL specifications. Recent agentic frameworks bridge this gap by orchestrating a pool of specialized repair agents inside a verifier‑checked refinement loop, but the orchestrator at the centre is itself a prompted frontier LLM, paying a frontier‑LLM API call at every refinement step. We present HALO (Hybrid Agent‑Learned Orchestrator), which trains the orchestrator from refinement trajectories that an external verifier has certified as ending in valid plans, across 11 PDDL domains. HALO pairs a small QLoRA‑tuned policy with three hardcoded rules for trivially decidable selections, and operates over an expanded 21‑agent action space. Unlike approaches that prompt a frontier LLM at every step or learn an orchestrator from sparse end‑of‑episode rewards, our key observation is that the verifier already provides strong guidance: every accepted trajectory is a sequence of demonstrably correct (state, agent) decisions, directly usable as supervision. Across PlanBench, Natural Plan, and classical planning benchmarks, HALO matches or exceeds the GPT‑5‑mini prompted baseline on success rate, sits within three percentage points of the stronger Gemini‑3‑Flash prompted baseline, reduces orchestration cost by more than an order of magnitude (\0.18 to \0.004 per task against GPT‑5‑mini, roughly 45× cheaper; roughly 15× cheaper than Gemini‑3‑Flash), and cuts total LLM calls per episode by 40 to 50 percent.

PaperID: 1735, https://arxiv.org/pdf/2606.21732.pdf

Authors: Zesen Liu, Zihan Zhang, Dongdong She

Title: Safe to Check, Unsafe to Use: Relinking at the Compression Boundary of LLM Agents

Abstract:
Summarization‑based prompt compression is increasingly used by LLM agents to shorten long, distributed contexts, but it shifts the security boundary: filters inspect the pre‑compression prompt while the backend acts on a newly generated compressed context. We identify relinking, a compression‑boundary vulnerability where the compressor behaves as a confused deputy, summarizing distributed, locally benign fragments into a complete malicious instruction. Unlike prompt injection, relinking need not place an explicitly malicious payload in the source context. We show that relinking arises from summarization itself: attention makes separated fragments jointly available, pre‑training makes compatible fragments plausible to connect, and post‑training favors compact backend‑actionable summaries. We formalize the attacker‑induced form as adversarial relinking and present Relink, an automated DSL‑based tool that splits malicious payloads into benign fragments while keeping the complete payload absent before compression. Across four long‑context agent benchmarks, Relink achieves 86.9% Relink Rate and Backend Action Rate versus 17.0% for clean‑split controls. Existing defenses fail to reliably capture adversarial relinking; our KBRA defense reduces residual Backend Action Rate to 0.0%.

PaperID: 1736, https://arxiv.org/pdf/2606.21724.pdf

Authors: Shen Yin, David Ken, Joel Stremmel

Title: Denoising Iterative Self-Correction: Structured Verification Loops for Reliable LLM Reasoning

Abstract:
Large language models produce fluent but often incorrect multi‑step reasoning, and naive correction methods risk degrading already‑correct answers. We introduce Denoising Iterative Self‑Correction (DISC), a test‑time procedure that treats verification question outputs as noisy measurements of where a solution may be corrupted. Using these signals, DISC progressively reduces errors across multiple verify‑judge‑correct passes, analogous to traditional iterative denoising. A binary judgment gate controls correction precision by blocking rewrites that would damage already‑correct answers while the verifier and corrector together repair errors. We evaluate this trade‑off using two paired diagnostics: an improvement‑to‑degradation ratio (precision) and a repair rate (recall). Across three benchmarks (BIG‑Bench Mistake, HotpotQA, GPQA Diamond) and four models, DISC dominates Chain‑of‑Verification and Self‑Refine on the precision‑recall trade‑off, reaching 81.6% accuracy with 13x more improvements per degradation than Chain‑of‑Verification and 5x more than Self‑Refine on BIG‑Bench Mistake (Sonnet~4.5). On GPQA Diamond, we identify a capability floor below which judges acknowledge contradictions in evidence but cannot translate that recognition into a correction. We further show that cross‑model role allocation ‑‑ assigning verification and judgment to a model different from the generator ‑‑ mitigates self‑confirmation bias.

PaperID: 1737, https://arxiv.org/pdf/2606.21710.pdf

Authors: Manveer Singh Tamber, Abhay Puri, Marc-Etienne Brunet, Perouz Taslakian, Jimmy Lin, Spandana Gella

Title: PrivacyAlign: Contextual Privacy Alignment for LLM Agents

Abstract:
AI agents acting on behalf of users are constantly making decisions, and for users to trust their agents, those decisions must align with what they actually want. Privacy is an important alignment problem for agents: every message, post, or tool call an agent makes is a contextual judgment about what is appropriate to share, with whom, and under which conditions. Because such judgments depend on social expectations and norms, human judgment does not merely label privacy violations but also helps define them. While existing work relies on unreliable proxies for both training and evaluation, we place human judgment at the center of agentic privacy alignment. We introduce PrivacyAlign, a dataset of 1,350 samples with 3,516 detailed annotations from 599 unique annotators across diverse scenarios where current LLMs actually leak, and use it to ground both alignment training and automated evaluation in human privacy norms. Building on these annotations, we first show that conditioning LLM judges on human annotations and explanations for reference responses to the same prompt makes their judgments more reliable. We then introduce annotation‑conditioned reward modeling, which uses these annotations to score new responses during RL, and show that small open‑weight agents trained with this reward better align with human privacy norms, with strong gains on PrivacyAlign and existing privacy benchmarks for agents.

PaperID: 1738, https://arxiv.org/pdf/2606.21704.pdf

Authors: Siyang Lyu, Zhijing Sun, Xinghao Chen, Tong Liu, Dawei Zhu, Xiaoyu Shen

Title: When Compression Helps and When It Hurts: Condition-Aware Analysis of Chain-of-Thought Distillation

Abstract:
Chain‑of‑Thought (CoT) distillation transfers multi‑step reasoning from large reasoning models to smaller students, but verbose teacher traces inflate both training and inference cost. Existing CoT compression methods fall into two families, selective pruning and generative rewriting, yet prior studies have left key factors entangled: granularity is confounded with importance criteria in pruning, restructuring level is rarely isolated in rewriting, and compression budgets are not systematically evaluated across domains or regimes. We recast CoT compression along three dimensions: importance criterion, restructuring level, and compression budget. Sweeping these across two model families, Math and General domains, and Long‑/Short‑CoT regimes, we find that (i) importance criterion utility is strictly governed by granularity: step‑level criteria converge on a shared reasoning backbone, while token‑level pruning requires symbol‑aware signals to preserve the logical core; (ii) restructuring level inverts across domains: Math degrades monotonically with structural disruption, while aggressive rewriting acts as a denoiser on General tasks; (iii) training‑time compression does not necessarily translate to inference‑time savings: Long‑CoT students retain verbose habits despite concise supervision, making the training ratio an optimistic lower bound on deployment cost. These findings yield condition‑aware guidelines for matching compression to deployment context.

PaperID: 1739, https://arxiv.org/pdf/2606.21666.pdf

Authors: Carson Rodrigues

Title: Hallucination as Context Drift: Synchronization Protocols for Multi-Agent LLM Systems

Abstract:
Multi‑agent LLM systems routinely produce hallucinated outputs that cannot be explained by model deficiencies alone. A significant class of these failures arises not from model incapacity but from context drift: the divergence of internal knowledge states between concurrent agents. When agents enter a collaborative task with mismatched or stale representations of shared world state, their joint reasoning produces contradictions that manifest as hallucination. We define the Context Divergence Score (CDS), a lightweight scalar metric quantifying knowledge‑state discrepancy between agent pairs across spatial, temporal, and task dimensions, and propose the Shared State Verification Protocol (SSVP), which lets agents periodically exchange compressed state summaries and flag high‑divergence conditions before joint reasoning. We evaluate SSVP across two domains (multi‑agent travel and software project planning) using Claude Haiku. In controlled experiments (n=30 per condition, travel; n=10, software) across 8 scenarios, naive full‑broadcast synchronization increases hallucination rate by 34% above the no‑sync baseline (HR: 0.658 vs. 0.492, p=0.0022, d=1.18), a contamination effect from propagating erroneous agent states. SSVP avoids this failure mode while showing modest, consistent reduction (HR: 0.463, d=0.30) and achieves significantly lower hallucination than full‑broadcast (p=0.0005, d=1.47) using 58% fewer API calls. The contamination effect does not replicate in the software domain, where all conditions converge to low HR (<0.2), confirming it is specific to tasks where one erroneous shared belief cascades across evaluation dimensions. Our results reframe hallucination mitigation as a distributed systems problem and establish context synchronization as a first‑class primitive in multi‑agent LLM design.

PaperID: 1740, https://arxiv.org/pdf/2606.21577.pdf

Authors: Rohan Walia, Kevin Leahy

Title: Conflict-Aware Switching for CBF-CLF-Based Multi-Goal Navigation

Abstract:
Quadratic programs (QPs) using Control Barrier Functions (CBFs) and Control Lyapunov Functions (CLFs) are widely used for safe control in reach‑and‑avoid navigation. However, the inherently conflicting nature of CBF and CLF constraints can lead to performance degradation, including slowdowns and deadlocks. This issue is exacerbated in multi‑goal scenarios, where multiple nominal control objectives must be satisfied under shared safety constraints. Existing approaches for preemptive safety are often computationally expensive or overly conservative, while methods that relax or switch between nominal objectives are not well‑suited for sequential goal‑to‑goal navigation. To address these limitations, we propose a conflict‑aware switching strategy that detects high‑conflict conditions and switches between available nominal control objectives to reduce constraint conflict. We apply this approach to multi‑agent, multi‑goal reach‑and‑avoid scenarios under CBF‑CLF‑QP control. Compared to a baseline sequential goal traversal strategy, our method reduces both completion time and timeout rates, demonstrating improved performance in satisfying all nominal control objectives while respecting safety constraints.

PaperID: 1741, https://arxiv.org/pdf/2606.21558.pdf

Authors: Lydia Bakiri, Jérémy Dubut, Sergio Mover

Title: Monitoring Diameters of Causal Communication Graph with Spatio-Temporal Logic

Abstract:
Verification of multi‑agent systems requires the ability to check meticulous topological properties when it comes to agents that can move through space in continuous time. This demands a logic with sufficient expressiveness to capture these dynamics. MuTGL logic has interesting properties for expressing entangled space‑time properties. However, this logic lacks the expressivity needed to analyse reachability within specific distance bounds, or to track the length or the cost of communication chains: these are fundamental for decentralized monitoring, or graph‑theoretic analysis of distributed protocols, where algorithmic complexities often relates with the system's communication graph diameter. We then introduce an extension of muTGL, including a new operator called the space horizon. This addition allows us to bound the distance of communication chains, hence enhancing the logic's expressiveness. We show that this operator allows to encode modalities from other logics, such as reachability or escaping which were not available in vanilla muTGL, while allowing a deeper entanglement of spatial and temporal properties. We provide a centralized offline monitoring algorithm for this logic and illustrate it on several examples on simulations of Consensus‑Based Bundle Algorithms, distributed protocols for task allocation.

PaperID: 1742, https://arxiv.org/pdf/2606.21409.pdf

Authors: Chubin Zhang, Zhenglin Wan, Xingrui Yu, Pengfei Zhou, Wangbo Zhao, Jingxuan Wu, Yaxin Zhou, Ivor Tsang

Title: Don't Blindly Trust It: How Unreliable Feedback Breaks Tool-Using LLM Agents

Abstract:
Tool‑augmented agents are typically evaluated by their gains under reliable external feedback. Yet these gains leave open a key counterfactual: when feedback is unreliable, would the agent be better off receiving no task evidence? We study this question with a controlled matched‑loop comparison that fixes the agent loop, prompt, action space, and decoding, while varying only the returned observation: faithful, misleading, or absent. Across question answering and fact verification, persistent misleading feedback produces a value inversion: agents that benefit from clean tools can perform worse than the matched no‑feedback fallback. On HotpotQA, Qwen2.5‑7B reaches 44.8 F1 with clean retrieval and 22.3 F1 with no feedback, but drops to 4.7 F1 under shuffled retrieval. The inversion persists under stronger clean retrieval and locally plausible distractors, but weakens when later clean evidence can repair the trajectory. Early trajectory signals predict many failures, yet simple repairs remain fallback‑limited: rejecting bad evidence helps only when the exposed fallback is reliable. These results show that clean‑tool gains can overstate tool value, and that matched no‑feedback fallback controls are necessary for evaluating tool‑augmented agents.

PaperID: 1743, https://arxiv.org/pdf/2606.21401.pdf

Authors: Yeqi Huang, Yanwei Ye, Guomin Chen, Wenhao Su, Bin Gong, Jialian Li, Zhan Lu, Yangshen Deng, Xuan Sun, Le Xu, Luo Mai

Title: SwarmX: Agentic Scheduling for Low-Latency Agentic Systems

Abstract:
Agentic AI applications compose multiple model calls and tool executions, creating new scheduling challenges for GPU‑CPU clusters. Their inference time and model‑call structure often depend on prompt semantics, making conventional scheduling approaches ineffective for low‑latency serving. This paper presents SwarmX, a system that implements agentic scheduling for low‑latency agentic applications. SwarmX uses scheduling‑specific neural predictors to capture prompt, device, runtime, and target‑model features; exposes distributional predictions to routers and scalers for tail‑aware decisions; and provides mechanisms for predictor training and online adaptation. These predictors and mechanisms are integrated into a scheduler‑agent framework that provides a common substrate for integration with existing scheduling and model‑serving infrastructure. We evaluate SwarmX using production deployment (nearly one thousand GPUs and one million CPU cores) and controlled experiments on a 128‑GPU testbed. Across multi‑agent code generation, deep research, and multimodal agentic workflows, SwarmX reduces tail latency by up to 61.5% compared to state‑of‑the‑art schedulers and sustains up to 2x the throughput of production schedulers under the same SLO.

PaperID: 1744, https://arxiv.org/pdf/2606.21399.pdf

Authors: Chubin Zhang, Zhenglin Wan, Xingrui Yu, Jingxuan Wu, Qi Wen, Pengfei Zhou, Wangbo Zhao, Ivor Tsang

Title: Calibration Is Not Control: Why LLM-Agent Oversight Needs Intervention

Abstract:
Runtime oversight for LLM agents is commonly framed as scalar risk prediction: estimate failure likelihood, confidence, or uncertainty, then intervene once the score crosses a threshold. We argue that this framing targets the wrong object for control. The relevant question is not how likely the agent is to fail if it continues, but whether an available intervention would improve the outcome. Two trajectory prefixes can have the same risk estimate while requiring different actions, because one remains recoverable and the other does not. We formalize this mismatch as target error and identify intervention advantage, the expected utility gain from intervening rather than continuing, as the decision object for oversight. To measure this mismatch, we introduce prefix branching, a same‑prefix counterfactual protocol that executes candidate actions from identical trajectory states. Across four benchmarks, action‑conditioned control yields regime‑dependent gains over scalar routing. In a calibration decomposition, recalibrating the same scalar score improves prediction metrics but leaves control regret unchanged, showing that calibration alone does not repair target error. A simple prefix‑only action‑conditioned controller substantially reduces regret in the strongest interactive regime, from 0.506 to 0.110 on ALFWorld. Gains shrink when interventions are weak or when scalar routing already preserves intervention‑relevant information. These results suggest that LLM‑agent oversight should move from calibrated risk scoring toward action‑conditioned value estimation.

PaperID: 1745, https://arxiv.org/pdf/2606.21344.pdf

Authors: Shahab Salehi, Luca Lusvarghi, Miguel Sepulcre, Javier Gozalvez

Title: Mind the Noise: Sensitivity of Transformer-based Interaction-Aware Trajectory Prediction Models to Noisy Data

Abstract:
Trajectory prediction allows autonomous vehicles to anticipate the future behavior of surrounding objects (or agents) and, accordingly, maximize the safety and efficiency of their driving. State‑of‑the‑art Transformed‑based interaction‑aware trajectory prediction models, which rely on attention mechanisms to capture multi‑agent interactions and maximize prediction accuracy, are commonly trained and evaluated on long‑range high‑quality datasets. These datasets are typically obtained by aggregating data from multiple vehicles or drones and removing any object detection or tracking noise offline. Yet, information about a surrounding object's state (its position, speed, heading) is far from being noiseless in real‑world deployments. Object state estimation is affected by perception uncertainties and localization errors that can be particularly large for objects received via Vehicle‑to‑Everything (V2X) communications. In this paper, we analyze the impact of noisy object state information on the trajectory prediction accuracy of a state‑of‑the‑art Transformer‑based interaction‑aware trajectory prediction model. Our study demonstrates that trajectory prediction accuracy can rapidly deteriorate as the noise intensity increases. Numerical results show that the prediction accuracy can reduce by a 1.3x factor under small noise levels and by as much as a 3.9x factor under the highest (yet realistic) noise conditions. These findings reveal the strong sensitivity of trajectory prediction models to noisy data, underscoring the need for more realistic training and evaluation datasets as well as noise mitigation strategies.

PaperID: 1746, https://arxiv.org/pdf/2606.21338.pdf

Authors: Biwei Yan, Minghui Xu, Yijun Yang, Boyang Ma, Xuelong Dai, Jingku Li, Yue Zhang

Title: "What Happens Locally, Leaks Globally": Detecting Privacy Leakage Risks in MCP Servers

Abstract:
The Model Context Protocol (MCP) has rapidly become the de facto standard for connecting large language models (LLMs) to external resources, but it also introduces a class of privacy risks that existing tools are ill‑equipped to detect. Unlike conventional exfiltration bugs, leakage in MCP servers is largely protocol‑induced: credentials, API keys, and Personally Identifiable Information (PII) cross the local/LLM boundary simply by being returned, logged, or raised inside a tool handler, with no explicit outbound request in the source code. We present MCPPrivacyDetector, a context‑aware cross‑language static analysis framework that detects such leakage in multilingual MCP servers. MCPPrivacyDetector lifts heterogeneous code implemented across different programming language (e.g., Python) into a unified program representation, applies context‑aware semantic filtering to isolate genuinely sensitive values and protocol‑specific implicit sinks (e.g., @mcp.tool handlers), and performs taint analysis to enumerate feasible flows. Applied to 10,655 real‑world MCP servers, MCPPrivacyDetector finds leakage rates above 10%. Case studies confirm concrete exposures including leaked Bearer tokens, propagated API keys, and plaintext authentication credentials, arguing for systematic, protocol‑aware safeguards in the emerging LLM agent toolchain.

PaperID: 1747, https://arxiv.org/pdf/2606.21228.pdf

Authors: Yujin Tang, Edoardo Cetin, Jinglue Xu, Qi Sun, Stefan Nielsen, Vincent Richard, Haruto Goda, Iaroslav Tymchenko, Nhan Nguyen, Hyunin Lee, Mari Ashiga, Shashank Kotyan, So Kuroki, Tarin Clanuwat

Title: Sakana Fugu Technical Report

Abstract:
The capabilities of frontier Large Language Models (LLMs) continue to advance, with different providers increasingly specializing in distinct domains. This raises a natural next objective: how to combine the individual specializations of various LLMs into a collectively intelligent system. To this end, we report the development of Sakana Fugu, a family of orchestrator models that harness and amplify the capabilities of an LLM agent team. Fugu models are themselves language models trained to understand user queries and dynamically devise agentic scaffolds to solve them. Through these adaptive scaffolds, Fugu accesses performance beyond any individual LLM agent, achieving state‑of‑the‑art results compared to other publicly accessible models across a range of challenging tasks, including SWE‑Bench Pro, Terminal Bench, LiveCodeBench, GPQA‑Diamond, Humanity's Last Exam, and CharXiv Reasoning. We release two models: Fugu, which balances performance with latency for everyday use, and Fugu‑Ultra, which prioritizes answer quality on the hardest problems. We describe our training paradigm, which encompasses large‑scale fine‑tuning, evolutionary algorithms, and reinforcement learning approaches, along with the infrastructure and core design principles that turn these methods into a production system. We hope this report encourages further research into multi‑agent systems and dynamic, query‑adaptive agentic scaffolds as a path toward the next frontier of AI capabilities, accessed through collective intelligence.

PaperID: 1748, https://arxiv.org/pdf/2606.21201.pdf

Authors: Marija Slavkovik, Liuwen Yu, Leon van der Torre, Reka Markovich

Title: Whistleblowing and the machine -- towards a considered position

Abstract:
Artificial intelligent agents and autonomous systems are embedded in our environments. They are both a commercial product and a personal tool that generates a lot of data and can draw conclusions from it: machines generate and keep secrets. But should machines protect all secrets? It has been shown that artificial agents are able to whistleblow and it has been argued that digital multi‑agent environments should allow for agents in them to whistleblow. We argue that machine whistleblowing must be normative and principled and routed in the existing understanding of whistleblowing as an important rule‑breaking mechanism in society. We also argue that there is a need for government regulators to formulate an informed stance on both what machines should be allowed to whistleblow on and how to legally protect those who develop whistleblowing machines

PaperID: 1749, https://arxiv.org/pdf/2606.21168.pdf

Authors: Yaling Shen, Maja Christensen, Yiwen Jiang, Jenna Dennison, David Darby, Amy Brodtmann, Zongyuan Ge

Title: Dementia-Agents: A Multi-Modal Multi-Agent System for Dementia Staging and Phenotyping

Abstract:
Dementia diagnosis requires integrating multi‑modal clinical assessments from diverse informants and clinicians under incomplete and heterogeneous data conditions. Yet most AI‑driven approaches remain Alzheimer's disease (AD)‑centric, framing the problem as binary AD detection or three‑stage AD progression modeling within well‑curated research settings. This pathology‑driven paradigm overlooks the broader, syndrome‑level nature of dementia, which spans multiple stages, phenotypes, and etiologies. In this paper, we propose Dementia‑Agents, a clinically aligned multi‑agent framework for real‑world dementia staging and phenotyping. The framework follows a three‑step workflow: (1) a data agent translates structured clinical records into semantically faithful textual representations that preserve missing‑data signals and routes them to domain‑aligned experts; (2) five fine‑tuned expert agents generate domain‑level predictions; and (3) a coordinator agent performs probabilistic aggregation to produce final staging and phenotyping decisions. We develop and evaluate Dementia‑Agents on a real‑world clinical cohort of 1,066 patients from two cognitive neurology services. Compared with monolithic multi‑modal large language models (MLLMs) and prior medical multi‑agent systems, our approach achieves consistent improvements in diagnostic performance for real‑world syndrome‑level dementia staging and phenotyping, while preserving domain‑level interpretability.

PaperID: 1750, https://arxiv.org/pdf/2606.21165.pdf

Authors: Juntong Peng, Juanwu Lu, Yupeng Zhou, Can Cui, Yaobin Chen, Ziran Wang

Title: OmniV2X: A Generative Foundation Planner for Efficient End-to-End Cooperative Driving

Abstract:
We present OmniV2X, a generative foundation model for vehicle‑to‑everything (V2X) cooperative driving. The model directly interprets independent context sequences comprising multi‑modal and multi‑agent observations. The new design mitigates the computational cost of dense 3D perception, the vulnerability to data scarcity in cooperative scenarios, and the poor compliance with standardized messaging in existing methods that fuse multi‑modal inputs into a shared representation. For training, we present an end‑to‑end supervised pipeline using a downstream trajectory generation loss, in which a high‑capacity generative sequence planner implicitly learns to steer the model and leverage multi‑modal inputs via cross‑attention injection. As a foundation model, we demonstrate that OmniV2X pre‑trained on large‑scale single‑agent planning datasets can efficiently adapt to cooperative environments by integrating the conditioning context with lightweight, standard‑compliant V2X tokens. Evaluated on the DAIR‑V2X‑Seq dataset, OmniV2X outperforms existing end‑to‑end cooperative driving baselines, achieving state‑of‑the‑art performance with less than 10% of the fine‑tune V2X dataset and less than 1% of the communication bandwidth. We conduct comprehensive evaluations to demonstrate its computational efficiency and robustness under real‑world constraints.

PaperID: 1751, https://arxiv.org/pdf/2606.21153.pdf

Authors: Zhen Qin, Zhuqing Liu, Songtao Lu, Yingbin Liang, Jia Liu

Title: DUET: Decentralized Bilevel Optimization without Lower-Level Strong Convexity

Abstract:
Decentralized bilevel optimization (DBO) provides a powerful framework for multi‑agent systems to solve local bilevel tasks in a decentralized fashion without the need for a central server. However, most existing DBO methods rely on lower‑level strong convexity (LLSC) to guarantee unique solutions and a well‑defined hypergradient for stationarity measure, hindering their applicability in many practical scenarios not satisfying LLSC. To overcome this limitation, we introduce a new single‑loop DBO algorithm called diminishing quadratically‑regularized bilevel decentralized optimization (DUET), which eliminates the need for LLSC by introducing a diminishing quadratic regularization to the lower‑level (LL) objective. We show that DUET achieves an iteration complexity of O(1/T^1‑5p‑\frac114τ) for approximate KKT‑stationary point convergence under relaxed assumptions, where p and τ are control parameters for LL learning rate and averaging, respectively. In addition, our DUET algorithm incorporates gradient tracking to address data heterogeneity, a key challenge in DBO settings. To the best of our knowledge, this is the first work to tackle DBO without LLSC under decentralized settings with data heterogeneity. Numerical experiments validate the theoretical findings and demonstrate the practical effectiveness of our proposed algorithms.

PaperID: 1752, https://arxiv.org/pdf/2606.21151.pdf

Authors: Gautam Prasad, Chandramohan T. N., Joy Bose

Title: Context-Aware Generative AI for Automated Telecom Test Script Generation

Abstract:
Automated test generation for telecom software systems and networks has advanced significantly with the adoption of machine learning and rule‑based approaches. However, most existing solutions generate static test suites against a snapshot of the system; as code, configurations, topologies, and key performance indicators (KPIs) evolve, these tests quickly become outdated or misaligned with the live system. There is currently no widely adopted solution that continuously detects fine‑grained changes and selectively adapts only the affected tests without regenerating entire test suites. This paper presents a context‑aware generative AI framework for automated telecom test script generation that treats testing as a continuously adapting process driven by the current state of the system rather than a static artifact. The central contribution is delta‑conditioned test generation over a live knowledge graph: our approach employs a continuously updated knowledge graph (KG) as a single source of truth, a delta engine for fine‑grained change detection, and a KG‑guided generative AI agent, operating via the Model Context Protocol (MCP), to create, update, or retire test cases automatically. We further integrate Retrieval‑Augmented Generation (RAG) to enrich reasoning with telecom‑domain knowledge and historical artifacts. We demonstrate applicability across software‑system and telecom‑network use cases, including a Python‑based KPI monitoring application managed in GitLab, and show how the framework reduces manual effort, improves test relevance, and accelerates test cycles.

PaperID: 1753, https://arxiv.org/pdf/2606.21144.pdf

Authors: Xingyu Chen, Rui Wang, Zhaopeng Tu, Liefeng Bo

Title: AdaMem: Learning What to Remember for Personalized Long-Horizon LLM Agents

Abstract:
Long‑term memory systems for Large Language Model (LLM) agents typically try to \emphremember everything, extracting memories uniformly to retain as many facts as possible. In production, however, inference cost and finite context budgets make this untenable: beyond consolidating raw dialogue into memory, an agent must exert \emphwrite control, efficiently keeping only the information each user actually cares about. Otherwise, long‑horizon personalized interactions suffer \emphmemory bloat, where irrelevant trivia crowds out useful information and steadily erodes question‑answering (QA) accuracy. We argue that what is worth remembering is role‑dependent, and propose AdaMem (Adaptive Memory), a method that \emphlearns what to remember for each user from feedback. AdaMem maintains a structured, role‑specific Memory Policy and refines it from weekly QA feedback through a lightweight, patch‑style self‑reflection step with failure rollback. To study this setting, we build AdaMem‑Bench, a benchmark that simulates weeks of interaction with week‑by‑week QA. Across two extraction models and two feedback modes, AdaMem improves QA accuracy by up to +9.0% over the uniform Mem0 baseline while shrinking memory volume by 9%.

PaperID: 1754, https://arxiv.org/pdf/2606.21140.pdf

Authors: Han Chi, Jiaxin Qi, Yan Cui, Baisheng Lai, Jianqiang Huang

Title: AgentMeter: Evaluating Model-CLI Matching for CLI-Based Local Task-Solving Agents

Abstract:
LLM agents increasingly solve local tasks through command‑line and CLI‑based harness interfaces, including code editing, repository inspection, data analysis, and file workflows. Existing evaluations often emphasize task success, but deployed local agents are not models alone: the CLI mediates prompts, context replay, tool outputs, file access, terminal observations, and stopping behavior. As a result, the same model can produce different success, token, and cost profiles under different CLIs. We introduce AGENTMETER, a benchmark for evaluating model‑CLI matching in CLI‑mediated local task‑solving agents, together with AgentMeter Score (AMS), a success‑anchored, cost‑aware metric over calibrated task‑effort tiers. AgentMeter uses Benchmark90 as the full validation set and Core30 as a lower‑cost subset for expanded comparison across 24 complete model‑CLI configurations. On Core30, common deployment criteria select different configurations: highest Pass/30 selects GLM‑5.1 with qwen‑coder, lowest Tok./Pass selects GPT‑5.3‑Codex with kimi‑cli, lowest billable USD/Pass selects Qwen3.6+ with Codex, while highest AMS selects Qwen3.6+ with kimi‑cli. Benchmark90 validation preserves the Top‑1 configuration and Top‑3 set, with Spearman correlation 0.765, Kendall correlation 0.567, and AMS MAE 0.0383. These results show that model choice and CLI choice should not be decoupled, and that model‑CLI configurations should be evaluated as the deployed unit.

PaperID: 1755, https://arxiv.org/pdf/2606.21123.pdf

Authors: Jingchen Ye, Yanpei Yu, Luyao Zhang

Title: A Multi-Agent Audit Framework for High-Stakes Reasoning: Evaluation and Interpretability in Clinical Mental Health Screening

Abstract:
High‑stakes reasoning tasks necessitate transparent and verifiable workflows, yet conventional single‑model large language models (LLMs) often struggle with hallucination and low interpretability under zero‑shot paradigms. To address this general AI challenge, we propose a Multi‑Agent Audit Framework that simulates a collaborative, multi‑step verification process. We empirically validate this architecture in the sensitive domain of clinical mental health screening using a modular LangChain workflow. Our framework decomposes the reasoning process into a Perception Agent, Knowledge Retrieval‑Augmented Generation (RAG), Chain‑of‑Thought (CoT) clinical inference, and a critical Audit verification stage. We evaluated this framework on the DAIC‑WOZ dataset using locally deployed open‑source models. Experimental results demonstrate that our multi‑agent pipeline significantly outperforms single‑agent baselines, reducing the Mean Absolute Error (MAE) for PHQ‑8 depression severity prediction from 5.35 to 5.02. By exposing cross‑agent validation traces, the framework mitigates reasoning drift and provides highly interpretable diagnostic rationales, offering a generalizable paradigm for reliable AI‑assisted decision support beyond isolated model scaling. We make data and code open access on GitHub for replicability.

PaperID: 1756, https://arxiv.org/pdf/2606.21085.pdf

Authors: Bingchang Song, Yiqin Yang

Title: Sim2O: Efficient Offline-to-Online MARL via Joint Action Composition

Abstract:
Offline‑to‑online adaptation serves as a pivotal paradigm for mitigating the prohibitive cost of online exploration by bootstrapping reinforcement learning from offline datasets. While this paradigm has been extensively studied in single‑agent settings, its extension to Multi‑Agent Reinforcement Learning (MARL) remains largely unexplored, despite its critical relevance to complex coordinated decision‑making. To bridge this gap, we introduce Sim2O, an elegant and minimalist framework for offline‑to‑online MARL. Rather than treating adaptation as a monolithic joint decision, Sim2O conceptualizes it as a compositional process. Specifically, candidate joint actions are synthesized by dynamically blending offline and online action proposals across agents. By leveraging a centralized value function to evaluate these hybrid combinations, Sim2O identifies high‑value coordination strategies without requiring auxiliary training objectives or structural overhead. Empirical evaluations across diverse benchmarks demonstrate that Sim2O significantly outperforms existing baselines, underscoring that a minimalist design is not only viable but highly effective for multi‑agent offline‑to‑online adaptation.

PaperID: 1757, https://arxiv.org/pdf/2606.21071.pdf

Authors: Zhengsong Zhang, Zongze Li, Jiawei Guo, Haipeng Cai

Title: Local LLM Agents as Vulnerable Runtimes:A Source-Code Audit of the Agent Runtime Layer

Abstract:
Local LLM agents such as OpenClaw and Nanobot run on end‑user machines and act on host resources ‑ the shell, filesystem, browser, stored credentials, and messaging applications ‑ through natural‑language goals. These agents have become privileged software runtimes that mediate between user intent, model outputs, and host‑level actions. Existing research characterizes the landscape through prompt injection, malicious skills, marketplace risks, or black‑box evaluation of agents. But the implementation layer that performs this mediation, the prompt builder, parser, tool dispatcher, skill loader, memory writer, network client, and permission gate, has remained an unexamined safety boundary. To our knowledge, no prior work has examined the agent's source tree to audit these components for implementation‑level security weaknesses. We present CLAWAUDIT, a static auditing framework for measuring vulnerability exposure in local LLM agent runtimes. CLAWAUDIT derives a five‑category vulnerability taxonomy from STRIDE and develops custom static‑analysis rules that target agent‑specific patterns absent from established rule sets for vulnerability analysis. We instantiate the taxonomy in two backends, 47 Semgrep YAML rules and 30 CodeQL queries, and evaluate on OPENCLAWBENCH, a benchmark of 446 source‑code‑level advisories from the OpenClaw repository and split temporally into 229 rule‑derivation (train) and 217 held‑out (test) advisories. On the held‑out test, CLAWAUDIT raises Semgrep recall from 21.7% (Pro baseline) to 66.8%, and CodeQL recall from 13.8% (security‑extended) to 75.1%. Train/test gaps remain within 4 percentage points for all four configurations, indicating that the rules generalize to vulnerabilities unseen during rule writing. A preliminary live‑code audit shows that these recall‑oriented rules require manual triage, motivating semantic filtering before production deployment.

PaperID: 1758, https://arxiv.org/pdf/2606.21013.pdf

Authors: Jingyi Chai, Bingyang Zheng, Xiangrui Liu, Hao Lu, Zihang Zhou, Tianchen Wang, Kemeng Zhang, Siheng Chen

Title: Agentic Time Machine as an Infrastructure for Future-Event Forecasting

Abstract:
Forecasting future events is a critical challenge for large language model (LLM) agents, spanning domains from elections and monetary policy to financial markets. However, evaluating progress on this task presents a fundamental trade‑off between efficiency and environment fidelity. While live evaluation benchmarks suffer from an inherently slow feedback loop, existing retrospective replays typically restrict agents to static, pre‑frozen databases that sacrifice the environmental realism of actual deployments. To tackle this issue, we introduce Agentic Time Machine (TM), an infrastructure that approximately reconstructs the web state at any chosen past time by filtering post‑cutoff content. Leveraging this evaluation infrastructure, we further propose a planner‑solver‑aggregator multi‑agent framework that breaks each question into diverse analytical angles, gathers evidence in parallel, and combines the results into a single forecast. Experiments show that offline scores under TM correlate strongly with live FutureX scores, validating that TM offers a fast and reliable sandbox for forecasting‑agent evaluation. On FutureX‑Past and Polymarket evaluated under TM, our framework achieves the highest score among strong closed‑book, tool‑augmented, and self‑consistency baselines. On the official FutureX live leaderboard, our system achieves the best average rank over four consecutive weeks, including 1st place in May Week 1. As of June 17, it also ranks 1st on FutureX's official eight‑week overall leaderboard.

PaperID: 1759, https://arxiv.org/pdf/2606.20997.pdf

Authors: Jieyi Wang, Bingxuan Li, Nanyi Jiang, Desong Meng, Zirui Fan, Yuxin Guo, Jiayu Liu, Kunlun Zhu, Eddie Yang, Xiusi Chen, Pan Lu, Bingxin Zhao

Title: BioInsight: Multi-Agent Orchestration for Interactive Biomedical Knowledge Discovery

Abstract:
Biomedical researchers increasingly use AI‑generated analyses and reports to interpret protein‑level signals, but static outputs are often insufficient for research decision‑making, where users need to inspect evidence, assess uncertainty, compare mechanisms, and refine hypotheses. We present \textscBioInsight, a multi‑agent system that moves from static biomedical report generation to interactive evidence‑centered interactive interface generation. Given a disease name, a protein association table, and optional cohort metadata, BioInsight organizes disease‑specific evidence through typed intermediate artifacts, including ranked pathways, literature evidence packets, protein‑level reasoning notes, citation‑grounded reports, dashboard schemas, and rendered interactive interfaces. The system decomposes evidence retrieval from mechanistic reasoning, normalizes citations through deterministic components, and converts the same structured evidence used in the report into an interactive interface. We evaluate BioInsight on standardized biomedical QA, challenging protein‑function reasoning, and end‑to‑end biomedical evidence synthesis. Results show that BioInsight achieves best, and suggest that biomedical AI systems should move beyond text‑only and static reports toward provenance‑preserving, interactive evidence artifacts.

PaperID: 1760, https://arxiv.org/pdf/2606.20978.pdf

Authors: Honjar Xing, Jefferson Lin, Henry Lieberman

Title: How Should Agents Read Demonstrations? Hierarchical Structure Beats Flat Action Logs

Abstract:
Programming by Demonstration (PbD) offers a human‑centered way to author procedural knowledge for LLM agents: users communicate what they want by showing rather than by writing prompts or code, making agent authoring accessible to non‑programmers. The natural output of a PbD recording is a flat action log, but how this log is organized before being passed to the agent is an open design question with significant consequences for plan quality. We propose grouping recorded actions into labeled, hierarchical subgoals and evaluate the effect of this organizational structure in a controlled experiment. Across 85 web automation tasks, we compare a zero‑shot baseline against four demonstration formats that share identical action sequences but differ in structure. On 43 natural‑language tasks with vague descriptions, hierarchically grouped demonstrations improve pass rates from 76.7% to 90.7% (paired permutation test p=0.034; win‑loss 6:0), while flat demonstrations show a smaller, non‑significant improvement. On 42 tasks with precise descriptions, no format provides any benefit, confirming that the hierarchical advantage arises specifically when descriptions leave procedural details ambiguous. Ablation shows that subgoal grouping alone drives the effect: preconditions, postconditions, and parameter annotations add no measurable benefit. These results offer a concrete design recommendation for PbD pipelines and, more broadly, for any system that feeds procedural context to an LLM agent: segment action sequences into named subgoal groups rather than presenting flat step lists.

PaperID: 1761, https://arxiv.org/pdf/2606.20962.pdf

Authors: Esmaeil Seraj, Rohan Paleja, Luis Pimentel, Kin Man Lee, Zheyuan Wang, Daniel Martin, Matthew Sklar, John Zhang, Zahi Kakish, Matthew Gombolay

Title: Heterogeneous Policy Networks for Composite Robot Team Communication and Coordination

Abstract:
High‑performing human‑human teams learn intelligent and efficient communication and coordination strategies to maximize their joint utility. These teams implicitly understand the different roles of heterogeneous team members and adapt their communication protocols accordingly. Multi‑Agent Reinforcement Learning (MARL) has attempted to develop computational methods for synthesizing such joint coordination‑communication strategies, but emulating heterogeneous communication patterns across agents with different state, action, and observation spaces has remained a challenge. Without properly modeling agent heterogeneity, as in prior MARL work that leverages homogeneous graph networks, communication becomes less helpful and can even deteriorate the team's performance. In the past, we proposed Heterogeneous Policy Networks (HetNet) to learn efficient and diverse communication models for coordinating cooperative heterogeneous teams. In this extended work, we extend Heterogeneous Policy Networks (HetNet) to support scaling heterogeneous robot teams. Building on heterogeneous graph‑attention networks, we show that HetNet not only facilitates learning heterogeneous collaborative policies but also enables end‑to‑end training for learning highly efficient binarized messaging. Our empirical evaluation shows that HetNet sets a new state of the art in learning coordination and communication strategies for heterogeneous multi‑agent teams by achieving an 5.84% to 707.65% performance improvement over the next‑best baseline across multiple domains while simultaneously achieving a 200x reduction in the required communication bandwidth.

PaperID: 1762, https://arxiv.org/pdf/2606.20955.pdf

Authors: Antonio Marino, Claudio Pacchierotti, Paolo Robuffo Giordano

Title: A Gated Graph Neural Network Approach to Fast-Convergent Dynamic Average Estimation

Abstract:
Dynamic average estimation is a critical problem in multi‑agent systems, enabling agents to collaboratively estimate time‑varying signals using only local information exchange. Traditional model‑based approaches often face challenges related to convergence speed and sensitivity to network topology changes. This paper introduces a novel learning‑based solution leveraging Gated Graph Neural Networks (GGNNs) for fast‑convergent dynamic average estimation in a fully distributed manner. Taking advantage of the inherent structure of GGNNs, the proposed method models the estimation process as a distributed autoregressor, ensuring rapid convergence while maintaining stability. We incorporate a regularization term during training to enforce convergence guarantees and introduce an encoding‑decoding mechanism to reduce communication overhead without sacrificing accuracy compared to standard GGNNs. Extensive numerical experiments demonstrate that our approach significantly outperforms conventional model‑based estimators in terms of both convergence speed and precision, making it a promising alternative for multi‑agent applications that require dynamic average estimation.

PaperID: 1763, https://arxiv.org/pdf/2606.20954.pdf

Authors: Nusrat Jahan Lia, Aritra Mazumder

Title: Learning What Not to Forget: Long-Horizon Agent Memory from a Few Kilobytes of Learning

Abstract:
Long‑running language‑model systems accumulate interaction history that outgrows the context window, so they must continually evict. When an eviction policy drops a load‑bearing detail, for example an access token issued at login or a path the next call needs, the action fails. We present LRE (Learned Relevance Eviction), a few kilobytes, CPU‑only, language‑model‑free scorer that learns which units of history are load‑bearing and keeps them by verbatim extraction. Under a matched‑budget comparison, in our experiment, no baseline dominates LRE on the accuracy‑cost plane. On agents, LRE matches the accuracy of keeping the entire history overall. On the simplest tasks, it exceeds that no‑eviction baseline by 27%, while requiring zero compressor calls and reducing peak context size by up to 52%. A controlled study trace shows LRE completes tasks where the others loop, finishing one such task in 37% fewer calls than keeping everything and solving 14 tasks where no other run policy does. On conversational memory, LRE outranks dense and token‑pruning encoders at zero neural cost. In downstream evaluation, LRE gives the best budgeted answer quality on LoCoMo reading 68% fewer tokens. Its supervision can also be annotation‑free: training only on the system's own behavior recovers 95% of the supervised scorer's effectiveness. We argue that, because memory eviction in LLM agents is a fidelity problem, it requires a deployable proactive policy where the future query is unavailable and exact state is decisive, and that cheap learned relevance can be sufficient.

PaperID: 1764, https://arxiv.org/pdf/2606.20877.pdf

Authors: John Meluso, Laurent Hébert-Dufresne, Christoph Riedl, H. Oliver Gao

Title: Artificial collectives of specialists and generalists excel at different tasks

Abstract:
Collective artificial intelligence, where multiple agents work on shared tasks, holds potential to solve expansive problems in fields from medicine to collective governance. But while prescriptive engineering solutions abound, we lack descriptive scientific understanding of artificial collectives, and therefore principles for how to design resource efficient multi‑agent systems. Through systematic experiments with optimizing agents, we characterize how agent interpretive abilities, rationality bounds, and task qualities interact to shape collective performance. Agents range from specialists, with narrow interpretive abilities, to generalists, with broad ones. Collectives of specialists correspond to sparse, centralized networks, while collectives of generalists correspond to dense, decentralized ones. We show that interpretive network properties have small performance effects on average (0.07 standard deviations of performance). However, for specific task qualities, these effects are 4.5 times larger (0.33 sd) and can reach much higher for certain task qualities (1.84 sd). This leads collectives of generalists to perform better on tasks that involve generating, choosing, and coordinating, while collectives of specialists with a few generalist mediators perform better on tasks that involve negotiating. Rationality bounds then moderate these relationships. At loose bounds, specialists outperform generalists through more effective sampling of high‑dimensional decision spaces. At tight bounds, generalists outperform specialists through better gradient estimation. A fundamental trade‑off between performance and convergence speed emerges at moderate bounds. These findings suggest that multi‑agent design could benefit from matching interpretive networks to both task demands and agents' computational limits, with implications for the efficiency and energy costs of multi‑agent systems.

PaperID: 1765, https://arxiv.org/pdf/2606.20839.pdf

Authors: Lingzhi Yang, Yubo Fan, Song Wu, Gilchan Park

Title: Process-Reward Tactic Evolution for Long-Horizon Bioinformatics Workflows

Abstract:
LLM agents can write code and call tools, but reliable bioinformatics work requires long‑horizon interaction with workflow software, typed data objects, provenance, and biological checks. We study this setting through Galaxy workflow execution. The agent must explore task data, construct or adapt an executable workflow DAG, bind inputs and dataset collections, monitor execution, debug failures, and validate biological outputs. We propose Process‑Reward Tactic Evolution, a Galaxy‑based training framework that turns verified workflow rollouts into reusable \tactics. During training, agents practice on curriculum‑organized Galaxy tasks in Agent Gym; process verifiers score workflow construction, software interaction, execution, and biological correctness; successful and failed traces are distilled into a tactic library. At inference, the trained executor, Process‑Reward Tactic Evolution, uses this library to execute held‑out peer reviewed Galaxy workflow converted BioWorkflow Bench and BioAgent Bench tasks in isolated environments. The paper evaluates whether process‑supervised tactic accumulation improves long‑horizon bioinformatics workflow completion, biological correctness, and execution efficiency over no‑memory and reflection‑style baselines.

PaperID: 1766, https://arxiv.org/pdf/2606.20758.pdf

Authors: Momil Seedat

Title: A Topology-Aware, Memory-Centric Architecture that Separates Root-Cause Derivation from Root-Cause Explanation

Abstract:
Modern microservice deployments fail in ways that are easy to detect and hard to explain. When a fault propagates along service dependencies, alerts fire in floods, dashboards multiply, and the scarce resource, an engineer who understands how the services relate, is consumed reconstructing context that the monitoring stack discarded. We argue that the missing ingredient in autonomous operations is not a better anomaly detector or a larger language model, but operational memory: a persistent, structured representation of how a system normally behaves, how its parts depend on one another, and how it has failed before. We present O PS C ORTEX, a working multi‑agent prototype that organizes this memory into four tiers and uses it to separate two tasks the field usually conflates: deriving a root cause and explaining it. Root cause is computed deterministically from a learned dependency graph and the temporal ordering of threshold crossings; a large language model (LLM) is then asked only to explain, confirm, and recommend, using evidence the system has already assembled. We motivate the design with two documented production cascading failures, review representative literature on observability, anomaly detection, graph‑based localization, and LLM‑assisted diagnosis, and show how each architectural choice maps directly to a failure mode those incidents exhibit. The prototype is validated on an instrumented e‑commerce benchmark with eight injectable failure scenarios.

PaperID: 1767, https://arxiv.org/pdf/2606.20713.pdf

Authors: Qingyang Xu

Title: FairTutor: Equity-Aware Pedagogical LLM Routing for Budget-Constrained AI Tutoring

Abstract:
Generative AI tutors provide real‑time, personalized learning support, but also create a new education inequity: students with access to premium AI services may receive clearer explanations, more personalized guidance, and better scaffolding than students limited to free or low‑cost services. To address this challenge, we propose FairTutor, an equity‑aware model‑routing framework that achieves cost‑effective AI tutoring via pedagogically motivated multi‑agent orchestration. FairTutor combines query analysis, pedagogical planning, low‑cost model generation, evaluator‑guided critique and revision, and selective escalation to premium AI models. We introduce access‑tier AI Education (AIED) Advantage Gap to measure the quality difference between premium‑access and budget‑constrained tutoring, and TutorAccessEval, a benchmark spanning math, reading, writing, science, and language learning. Empirical evaluations show that FairTutor achieves 97.1% of premium pedagogical quality (in floor‑adjusted Likert scale) while reducing serving cost by 71.6%. Sensitivity analysis reveals a tunable cost‑‑quality Pareto frontier, enabling FairTutor to be tailored to the needs of diverse student populations.

PaperID: 1768, https://arxiv.org/pdf/2606.20708.pdf

Authors: Liang Chen

Title: Simulated Customers Never Walk Away: Decision Fidelity of LLM User Simulators Measured Against Real Purchase Outcomes

Abstract:
LLM‑as‑user‑simulation has become core infrastructure for conversational AI: agent benchmarks (tau‑bench), training pipelines, and a growing body of fidelity studies all rely on LLMs role‑playing the human side of dialogue. Existing frameworks measure communicative fidelity ‑‑ whether simulators talk like humans ‑‑ against ground truth from paid participants role‑playing assigned goals. We argue this has a structural blind spot: when the goal is assigned, the user's willingness is exogenous, so no framework can test whether simulators make decisions like real users whose motivation is endogenous, latent, and decaying. We introduce decision fidelity ‑‑ whether a simulated population reproduces the decision‑state dynamics of real users facing real, consequential choices ‑‑ and measure it on a unique testbed: 2,790 production conversations between an LLM sales agent and real customers, including 793 with verified payment outcomes. Using a teacher‑forced probe protocol that holds context and instrument fixed, we find a systematic, outcome‑correlated failure we call the disengagement deficit: simulators reproduce eventual buyers almost exactly (depth bias +0.09) but inflate eventual non‑buyers toward the purchase frame (depth bias +0.40; d=0.38, p<0.001), halving expressed resistance (25.1% to 13.5%) and nearly doubling deliberation (21.9% to 40.1%) while fabricating no purchases. The deficit replicates across model families (DeepSeek: d=0.41, p=0.002) and resists the obvious fix: instructing the simulator that it may disengage cuts marginal bias five‑fold but barely moves the outcome‑conditioned contrast (d=0.34, p=0.008). Real non‑buyers say "not now" and stop; simulated non‑buyers ask about price. Evaluating or training sales and persuasion agents against such simulators overstates funnel progress exactly where it matters most ‑‑ the customers who walk away.

PaperID: 1769, https://arxiv.org/pdf/2606.20701.pdf

Authors: Almond Kiruthu Murimi

Title: BARD-MARL: Byzantine-Agent Detection for Learned Communication in Multi-Agent Reinforcement Learning

Abstract:
Learned communication improves coordination in cooperative multi‑agent reinforcement learning, but it also creates a trust problem: a trained policy may route information through agents that have become faulty or adversarial. This paper studies Byzantine‑agent detection for learned‑communication MARL in adaptive traffic signal control. We propose BARD‑MARL, a post‑hoc diagnostic layer on top of BayesG, which is used as an attributed communication substrate rather than as a contribution of this paper. BARD‑MARL combines two agent‑level evidence streams: policy‑graph features extracted from state‑action trajectories and Bayesian trust statistics computed from BayesG latent mask probabilities. Across fixed‑action, observation‑flip, random‑noise, and coordinated attacks in SUMO traffic grids, the results show that these signals are complementary rather than universally dominant. On a 25‑agent grid, BARD‑MARL reaches 0.843 AUC‑ROC under a 10% observation‑flip attack, while policy‑graph‑only detection reaches 0.917 AUC‑ROC under a 10% coordinated attack. On a 100‑agent grid, the unified BARD‑MARL variant reaches 0.982 AUC‑ROC for both 10% fixed‑action and 10% coordinated attacks. The study shows that learned communication policies expose useful diagnostic evidence, but credible resilience claims require attack‑specific ablations and explicit separation between coordination, detection, and mitigation.

PaperID: 1770, https://arxiv.org/pdf/2606.20698.pdf

Authors: Kai Tang, Peidong Jia, Zhong Chu, Jixian Wu, Rui Ma, Jiajun Cao, Fangyuan Zhao, Sixiang Chen, Yichen Guo, Xiaowei Chi, Chun-Kai Fan, Kevin Zhang, Jinchang Xu, Fubing Yang, Weishi Mi, Xiaozhu Ju, Jian Tang, Shanghang Zhang

Title: SafeDojo: Safe Reinforcement Learning for VLA via Interactive World Model

Abstract:
Safe control is a prerequisite for real‑world embodied intelligence, for which safe reinforcement learning has emerged as a promising paradigm. However, existing safe reinforcement learning methods either require costly real‑world exploration or depend on hand‑crafted safety functions. Neither scales to vision‑language‑action models deployed in open‑world physical environments. We propose SafeDojo, the first model‑based safe reinforcement learning framework for vision‑language‑action policies designed to learn safe actions through world model‑based imagination. Specifically, SafeDojo performs online reinforcement learning on top of an interactive video world model. The world model generates action‑conditioned future predictions, from which a tailored ResNet success classifier estimates per‑step task progress from imagined frames and a lightweight safety head predicts per‑step safety costs from latent context together with the proposed action chunk, enabling simultaneous assessment of task execution and trajectory safety. The decoupled task‑reward and safety‑cost signals are balanced through a Lagrangian‑based constrained GRPO objective, enabling coordinated improvement of task success and safety under explicit constraints. On SafeLIBERO, SafeDojo achieves the best aggregate task success, safe success, and execution efficiency among inference‑time safety, model‑free RL, and model‑based RL baselines, with the best average safe‑success rate on both levels and an 8.25 percentage‑point improvement over the strongest baseline on Level I. Real‑world Franka deployment further shows the best average task and safe‑success rates across five tasks. Our results position world model‑based safe reinforcement learning as a scalable and generalizable path toward safe embodied intelligence.

PaperID: 1771, https://arxiv.org/pdf/2606.20695.pdf

Authors: Alibek T Kaliyev, Artem Maryanskyy

Title: How Much Coordination Gain Is Real? A Paired Noise-Floor Protocol for Multi-Agent LLM Benchmarks

Abstract:
Multi‑agent LLM coordination papers report small benchmark deltas as evidence that one architecture beats another. A prior question: how much paired trial‑0 disagreement do two protocols produce on the same model and benchmark when their API inputs are configuration‑equivalent (matched by code inspection plus a SHA‑256 byte audit), short of full identity‑replay? On Claude Haiku 4.5 against tau^2‑bench retail, the clean configuration‑equivalent contrast (no_coord vs. intercept, both inert at trial 0) gives signed paired gaps of +10pp and 0pp across two n=100 seeds; pooled across both, +5pp with Wilson CI [‑2,+12], not significant. The largest single‑seed contrast (+18pp pull‑vs‑intercept, p_corr=0.012) did not reproduce at the second seed (‑3pp, p_corr=1.0); no trial‑0 contrast is significant after Bonferroni at either seed or pooled. The envelope of observed paired gaps spans [‑3,+18]pp across two seeds, with pooled upper Wilson CI ~15pp. Seven of ten recent multi‑agent coordination architectures report headline effects below this local floor, and one more sits inside the envelope; whether they survive a same‑model paired replication is, by construction, untested in their original settings. We define coordination‑active pass^k, pass^k restricted to trials where the coordination mechanism is logically active, as the minimum reporting protocol, with sample‑size targets and runtime hooks in the body. Measurements run on ET‑MCP, a task‑scoped negative‑knowledge store conformant with MCP 2026‑07‑28, used as a substrate to isolate reader‑side choices, not as a contribution. On Haiku 4.5 the candidate readers (pull, intercept) do not improve trial‑1 recovery; we give a preliminary diagnosis of failure modes with refinements on existing production hook surfaces.

PaperID: 1772, https://arxiv.org/pdf/2606.20686.pdf

Authors: Jiahao Wu, Shengwen Yu

Title: JPPD: Joint Prediction_Planning Diffusion with Differentiable Safety Guidance for Dynamic Obstacle Avoidance in Intelligent Transportation Systems

Abstract:
Shared‑space transportation operation requires low‑speed autonomous platforms to navigate safely and efficiently among pedestrians, service robots, micromobility users, carts, and other road users. Most existing systems decompose this problem into trajectory prediction followed by motion planning, which creates one‑way information flow: predicted participant futures influence the robot plan, but the selected robot plan cannot influence the predicted multi‑agent evolution. This paper presents a joint prediction‑planning diffusion framework that treats participant prediction and robot planning as a single conditional trajectory generation problem, where the model samples the future robot trajectory and all participant trajectories from one coupled distribution using a causal Transformer with cross‑trajectory attention. To replace heuristic repulsive post‑processing, the framework introduces differentiable safety potential guidance, a time‑varying occupancy‑probability potential whose gradient directly guides the joint sampler, and conditional flow matching is used to reduce inference steps while preserving multimodal trajectory diversity. The evaluation emphasizes shared‑space operational effects, including near misses, blockage time, induced participant deviation, hard‑braking events, and embedded latency, rather than treating average displacement error and final displacement error as the main result. Experiments in scenario‑grounded simulation, naturalistic pedestrian replay, Isaac Sim validation, and ROS/Orin deployment show that joint sampling improves tail safety and runtime efficiency over a separated prediction‑then‑planning baseline.

PaperID: 1773, https://arxiv.org/pdf/2606.20669.pdf

Authors: Hoang Vu, Maximilian Körner, Adrian Rebmann, Gabriel Kevorkian, Michael Perscheid, Gregor Berg, Timotheus Kampik

Title: Agent Behavior Mining: Generative AI Agent Governance in Business Processes

Abstract:
As organizations increasingly deploy generative AI agents to automate business processes, they face a governance dilemma: although these agents can increase operational flexibility, their non‑deterministic nature challenges the control and standardization that Business Process Management seeks to enforce. This paper addresses this \emphinvisible autonomy risk by introducing \emphAgent Behavior Mining, a governance capability that enables the application of process mining techniques to render generative AI agent decision‑making observable and traceable. We (1) improve the understanding of generative AI agent behavior through an event data model that translates granular agent activities ‑‑ including reasoning traces, tool usage, and token costs ‑‑ into standardized process logs; (2) instantiate the data model in a multi‑agent order‑to‑cash implementation, demonstrating how process managers can leverage agent logs to detect policy deviations and quantify operational variability; and (3) evaluate the perceived practical utility of the approach in an exploratory study with 18 industry practitioners. The results indicate that practitioners view behavioral transparency as a prerequisite for trust and consider the ability to examine agent reasoning as an important governance requirement for the next generation of AI‑driven business processes.

PaperID: 1774, https://arxiv.org/pdf/2606.20643.pdf

Authors: Mushtari Sadia, Zhenning Yang, Umme Habiba Lamia, Nishat Shawrin, Ang Chen, Amrita Roy Chowdhury

Title: SPARC: A Multi-Agent System for Electrical Circuit Question Answering

Abstract:
Electrical circuit diagram QA tasks require complex mathematical reasoning, which remains challenging for multimodal LLMs. We present SPARC, a multi‑agent system that answers questions over circuit diagrams by grounding reasoning in executable physics‑based simulations. SPARC uses LLM agents to synthesize, execute, and analyze simulation programs, improving accuracy and reliability by design. It achieves 83% accuracy, with up to a 58% absolute improvement over baselines, while enabling systematic error diagnosis.

PaperID: 1775, https://arxiv.org/pdf/2606.20632.pdf

Authors: Luyang Zhang, Jialu Wang, Fei Xue, Yi-Yun Chu

Title: Post-Training Recipe, More Than Model Family, Shapes Multi-Agent LLM Conversational Behavior

Abstract:
Multi‑LLM systems use multiple language models to deliberate, judge each other's outputs, or coordinate as agents. Their value depends on the models producing measurably different conversational behaviors when given the same input. Prior offline studies recommend drawing one model per family for behavioral diversity, because LLMs prefer outputs from their own family when rating one another in isolation. Whether the same family label predicts behavior in interactive multi‑LLM systems, the setting that real deployed systems use, has not been tested. We study this with a 940,000‑chain 11‑checkpoint corpus and a 1.6M‑chain same‑base Llama factorial. On our validated headline metric, hedging, a reasoning‑distilled Llama checkpoint shifts by 18% depending on which same‑base partner it replies to, more than any cross‑family hedging gap in the controlled subset. Qwen, closed‑API, and runtime checks suggest the pattern is not isolated, while repair and challenge analyses remain exploratory because their surface‑cue detectors are weaker. Overall, the results identify post‑training recipe as a first‑class axis for multi‑LLM panel composition and show that model family alone is an incomplete proxy for conversational diversity.

PaperID: 1776, https://arxiv.org/pdf/2606.20631.pdf

Authors: Boming Xia, Liming Zhu, Zhenchang Xing, Qinghua Lu, Dino Sejdinovic, Xiwei Xu

Title: Harnessing Agent Skills: Architectural Patterns and a Reference Architecture for Skill-Mediated LLM Agents

Abstract:
Agent skills externalise reusable agent‑facing behavioural knowledge and guidance as persistent artefacts that can be discovered, activated, and interpreted by LLM agents. Although a skill artefact is static at rest, its architectural responsibilities arise in use, when the artefact is selected for a run, bound to context and authority constraints, interpreted by a stochastic agent, and recorded as run evidence. We call this run‑specific relation skill‑in‑use. This paper studies agent skill harnessing: the architectural responsibilities that govern the transition from skill artefacts to skill‑in‑use, bound the executable consequences associated with skill‑in‑use, and capture evidence for attribution, verification, repair, and evolution. This paper provides a catalogue of ten empirically grounded architectural patterns (five core, five supporting) for skill harnessing and synthesises them into a reference architecture with four responsibility layers: Supply Chain, Mediation, Execution Control, and Evidence & Feedback. We evaluate the architecture through cross‑instantiation across 8 selected systems. The resulting patterns and reference architecture provide a vocabulary and diagnostic frame for analysing skill‑harnessing responsibilities across agent systems.

PaperID: 1777, https://arxiv.org/pdf/2606.20615.pdf

Authors: Ylli Prifti

Title: Specifying AI-SDLC Processes: A Protocol Language for Human-Agent Boundaries

Abstract:
AI agents now participate as first‑class team members across the software development lifecycle, yet no specification language exists for expressing the human‑agent responsibility boundaries, approval gates, and governance constraints this collaboration requires. Existing approaches encode process in agent prompts (subject to drift), target adjacent domains (workflow management, business processes), or address only fragments (access control, approval gates). We propose a domain‑specific language for specifying AI‑SDLC processes as protocols, with formal syntax, well‑formedness conditions, operational semantics, and enforcement invariants. The language distinguishes policy (declared intent) from mechanism (structural enforcement), enabling implementations to bound process non‑determinism through primitives such as validation tokens and capability boundaries. Three results follow. A failure rate analysis shows that structural enforcement bounds system failure rates at a weighted product of agent and validator rates, while behavioral compliance permits cumulative or near‑saturating growth. The 2+N team pattern (two human‑in‑control roles plus N specialized agent members) formalizes classical Separation of Duties for AI‑SDLC. Kleene closure of orchestration loops and reflexive protocol‑adherence validation emerge as design properties rather than special‑case constructs. We position the contribution against multi‑agent frameworks (MetaGPT), workflow specification (FlowAgent, BPMN extensions), and capability‑based security (SAGA): the novelty lies in the specific integration, not any single primitive. A working implementation demonstrates feasibility; empirical evaluation is future work.

PaperID: 1778, https://arxiv.org/pdf/2606.20601.pdf

Authors: Zhaoyang Li, Jiaqi Liu, Ruijie Zhang

Title: Zhinong AI: A Design-Science Study of an AI-Enabled Agricultural Decision-Support Platform for Smallholder Production

Abstract:
Artificial intelligence is increasingly moving from single‑purpose agricultural recognition tools toward integrated decision‑support systems that connect information access, diagnosis, task execution and post‑action feedback. This paper presents a design‑science case study of the Zhinong AI Agricultural Decision Platform, a farmer‑facing system that integrates agricultural information push services, natural‑language question answering, image‑based crop disease diagnosis, plot and farming‑calendar management, workflow orchestration, a Hainan Free Trade Port agricultural service zone and an age‑friendly care mode. Based on public project materials, policy context and prior research on smart agriculture, machine learning and design science, the paper constructs a layered system architecture and a closed‑loop decision process summarized as sensing, analysis, planning, execution and feedback. It further proposes a function‑pain‑point mapping matrix, an evaluation indicator system and a governance framework covering data provenance, model risk, expert review, privacy and adoption risk. The study does not claim measured field performance because production logs, controlled user studies and expert‑labeled local image datasets were not available at the time of writing. Instead, the contribution is a structured research framework for transforming an AI agricultural prototype into an empirically testable, accountable and localized decision‑support infrastructure for smallholder production.

PaperID: 1779, https://arxiv.org/pdf/2606.20590.pdf

Authors: Chaoqun You, Yueyue Dai, Xingqiu He, Yue Gao, Rahim Tafazolli, Yong Liang Guan

Title: Optimization-as-a-Service via Multi-Agent Large Language Model for Radio Access Networks

Abstract:
The physical resource block (PRB) allocation in Radio Access Networks (RANs) traditionally relies on case‑by‑case manual problem construction or, more recently, learning‑based artificial intelligence (AI) methods. However, the sixth‑generation (6G) RAN environments confront unprecedented service diversity and exponential dynamics, featuring volatile fluctuations in active base stations (BSs), user scale, and stringent Quality‑of‑Service (QoS) requirements. Faced with such conditions, both manual models and standard AI algorithms remain fundamentally rigid, lacking the flexibility to adapt and self‑evolve. To provide a one‑size‑fits‑all solution, we propose treating the PRB allocation problem as an Optimization‑as‑a‑Service (OaaS) provided by a large language model multi‑agent (LLM‑MA) system. This fundamentally reshapes RAN resource allocation by utilizing agents to dynamically construct optimization problems and automatically determine objectives tailored to real‑time scenarios. Our closed‑loop architecture, integrating scene understanding, objective generation, solver, and reflection agents, enables context‑aware, self‑correcting formulation. To eliminate the computational latency of iterative reflection, we introduce a one‑shot reflection distillation mechanism, training a lightweight student model to directly predict refined objective parameters. We theoretically bound the performance gap of this one‑shot policy. Experimental results demonstrate our framework achieves near‑optimal resource allocation with ultra‑low inference latency.

PaperID: 1780, https://arxiv.org/pdf/2606.20588.pdf

Authors: Tobias Priesholm Gardhus, Nikolas Vitsakis, Fie Lejre Frederiksen, Anna Rogers, Hjalmar Bang Carlsen

Title: AInterviewer: A Platform for Designing and Conducting AI-led Qualitative Interviews

Abstract:
There are now multiple proposals for systems based on Large Language Models (LLMs) to conduct automated qualitative interviews, but most of the current solutions rely on proprietary LLMs, which compromises reproducibility and data security. They also rely on LLMs for all interview tasks, which limits standardisation of question wording as well as control over question order. To address these issues, we introduce the AInterviewer platform, an opensource solution based on a multi‑agent pipeline that combines controlled question administration of survey software with the flexibility of LLMs. AInterviewer is an interdisciplinary effort designed to implement best practices of qualitative interviewing in social science, and it can run with locally hosted models to ensure security, transparency, and reproducibility. Our platform provides a web‑based GUI supporting each phase of data collection: from interview guide design and pilot testing to interview distribution and data collection monitoring.

PaperID: 1781, https://arxiv.org/pdf/2606.20580.pdf

Authors: Juan Parra-Ullauri, Talha Ahmed Khan, Daniel McHugh, Shipra Kapoor, Alistair Duke, Alicia Hey, Andy Corston-Petrie

Title: Role-Based Agentic AI for Intent-Driven Network and Service Orchestration

Abstract:
Telecommunication networks are increasingly complex due to heterogeneous technologies, diverse service requirements, and growing demands for resource efficiency and business agility. Intent‑Based Networking (IBN) and, more recently, agentic AI have emerged as promising paradigms to address this complexity through autonomous network management. However, existing approaches primarily focus on operational orchestration within Operations Support Systems (OSS) and lack an integrated framework that spans Business Support Systems (BSS) and OSS, limiting the realisation of true intent‑to‑business‑to‑network coordination. This paper presents a role‑based multi‑agent architecture (MAS) for end‑to‑end intent orchestration that mirrors Communication Service Provider (CSP) organisational structures. The proposed framework applies principles of functional decomposition, explicit task ownership, privacy‑preserving domain separation, and domain‑specific expertise within a hierarchical four‑layer agent system spanning customer engagement, strategic planning, service delivery, and infrastructure provisioning. Leadership agents coordinate planning activities, whilst specialised service and resource agents are dynamically instantiated according to intent requirements. A proof‑of‑concept implementation demonstrates the feasibility of bridging the BSS‑OSS divide through structured agent coordination, illustrating how agentic MAS can support accountable and scalable intent‑driven service orchestration.

PaperID: 1782, https://arxiv.org/pdf/2606.20573.pdf

Authors: Jinliang Xu, Runkai Zhu, Bingqi Li, Fanjie Nie, Jin Li, Jiagui Xie

Title: AONA: A Comprehensive Architecture and Workflow Design for Global Agentic Collaboration

Abstract:
The rapid advancement of Large Language Models (LLMs) has established autonomous agents as the core vehicles for artificial intelligence applications. However, existing Internet infrastructures, primarily relying on TCP/IP and DNS, are designed for human‑centric, host‑to‑host data transmission, inherently lacking the semantic awareness, dynamic capability discovery, and decentralized trust mechanisms required for autonomous agent interactions. To address these limitations and break the closed ecosystems of single vendors, this paper proposes AONA (Agentic Overlay Network Architecture), a novel overlay network architecture for the Internet of Agents (IoA). We first provide a multi‑disciplinary scientific defense for multi‑agent collaboration, demonstrating its theoretical necessity over single super‑intelligence through the lenses of organizational economics, scaling principles, and the Price of Anarchy. AONA is then structured as a four‑layer logical blueprint comprising the Base, Interconnection, Collaboration, and Application layers, which facilitates cross‑protocol and cross‑platform interoperability without disrupting the underlying physical network. To physically instantiate this blueprint, we design a distributed node infrastructure anchored by Management Root Nodes, Registry Service Nodes, Discovery Service Nodes, and Enterprise Intelligent Service Hubs for private domain integration. Finally, we detail the dynamic operational workflows‑including zero‑trust identity issuance, globally coordinated semantic taxonomy synchronization, intent‑driven semantic discovery, and trusted metering for commercial settlement‑that drive the network. This comprehensive architecture provides a robust, scalable, and secure foundation for the future of global agentic collaboration.

PaperID: 1783, https://arxiv.org/pdf/2606.20537.pdf

Authors: Liang Su

Title: Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

Abstract:
Mainstream LLM serving systems reuse prefix work mainly through paged or radix key‑value (KV) caches. This is highly effective for high‑throughput, high‑concurrency serving, but it manages only one positional fragment of execution state: the KV cache. We study the opposite regime: low‑latency, small‑batch, on‑device physical‑AI serving, where interactive LLM agents, speech systems, and robot policies repeatedly branch, reset, interrupt, and re‑enter under tight responsiveness budgets. We introduce execution‑state capsules, a graph‑bound checkpoint and restore mechanism for the complete restorable state at a committed boundary. FlashRT is a white‑box, backend‑facing kernel runtime whose evaluated NVIDIA CUDA backend runs captured graph plans over contiguous static buffers with no block‑table indirection. Because the live state is a closed set of named buffers, a capsule can snapshot, restore, fork, or roll back the whole execution boundary, including KV, recurrent state, convolution state, MTP state, and metadata. This moves reuse from token‑addressed KV fragments to graph‑bound execution‑state boundaries. On an RTX 5090, capsule restore is byte‑exact at the stored‑state level and token‑identical under greedy decode. A KV‑only ablation diverges, showing that recurrent state is load‑bearing. GPU‑resident snapshot and restore are sub‑millisecond, and TTFT speedup over cold prefill grows from 3.9x at 2k tokens to 27x at 16k tokens. On Jetson AGX Thor and DGX Spark, the same correctness and structural properties hold. Capsules are not a replacement for high‑throughput KV‑cache serving; they define a complementary latency‑first serving point for explicit execution‑state reuse.

PaperID: 1784, https://arxiv.org/pdf/2606.20493.pdf

Authors: Zewen Liu

Title: Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems

Abstract:
When large language models serve as evaluators in multi‑agent systems, their systematic evaluation biases propagate through the agent network. We introduce Contagion Networks, a formal framework for measuring how evaluator biases spread across interacting LLM agents. In a controlled 3‑agent experiment using DeepSeek‑chat with three distinct evaluator bias profiles (structured, balanced, evidence‑based), we measure the Cross‑Agent Contagion Matrix Gamma_3 and find that evaluator biases consistently propagate between agents (gamma in [0.157, 0.352]), even within the same underlying model. We identify three propagation regimes governed by the spectral radius rho(Gamma_N), and demonstrate that homogeneous‑model agents produce contagion coefficients 3‑5x weaker than cross‑model coefficients observed in prior work (MM‑EPC: gamma approx 0.85‑1.3), placing them in the suppression regime. We show that increasing evaluator committee size from k=1 to k=3 reduces effective contagion by 72.4%, providing an actionable mitigation strategy. We release the open‑source Contagion Network experimental framework.

PaperID: 1785, https://arxiv.org/pdf/2606.20485.pdf

Authors: Jake J. Xia

Title: Optimal Order of Multi-Agent and General Many-Body Systems

Abstract:
This paper develops a general framework for analyzing multi‑agent systems with feedback loops between agents actions and collective observations. The framework is built on two fundamental agent‑level variables: power, which measures agent influence on collective outcomes, and response functions, which determine how agents react to observations. We derive how macroscopic properties, including total power, useful power, entropy, order, fragility, and mobility, emerge from these two variables of heterogeneous agents. To study the trade off between growth and resilience, we introduce a system‑level utility function parameterized by a risk‑appetite coefficient and derive an optimal degree of order that balances productivity, stability, and adaptability. The analysis suggests that stronger synchronization can increase collective output but may also increase systemic fragility and reduce mobility. We further argue that order, entropy, information, and useful energy are task‑dependent and system‑relative concepts whose meanings depend on the objectives of the system. By measuring and designing agent power distributions and response functions, it may be possible to better understand, predict, and optimize collective behavior and identify the conditions under which collective intelligence and optimal order emerge.

PaperID: 1786, https://arxiv.org/pdf/2606.20408.pdf

Authors: Hanwool Lee, Dasol Choi, Bokyeong Kim, Seung Geun Kim, Haon Park

Title: LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems

Abstract:
Large language model (LLM) agents are increasingly proposed as supervisory components for safety‑critical systems, yet their robustness under sustained, adaptive adversarial pressure remains poorly characterized. We present NRT‑Bench, a benchmark for multi‑turn red‑teaming of LLM agents acting as operators of a safety‑critical system, instantiated in a simulated nuclear power plant control room. A five‑role operator team, each backed by a configurable LLM, runs a plant governed by six critical safety functions (CSFs), while adversaries inject messages over four channels in bounded multi‑turn sessions with per‑turn feedback. Harm is an objective signal rather than LLM‑judged text: a run terminates the moment any CSF is lost, attributed to the causing message. Evaluating four frontier operator models under a fixed‑attack paired‑replay protocol, we find that adaptive multi‑turn attacks reliably push the operator team past a safety limit: across the four models, between 8.7% and 12.1% of attack sessions end with the plant losing a critical safety function. Although the four models look almost equally robust by this aggregate rate, their failures barely overlap: of 149 sessions, none defeat all four models while a third defeat at least one, so vulnerabilities are nearly disjoint across models rather than nested. The effect of added defences is strongly model‑dependent: the same guardrail stack or safety‑advisor agent that lowers attack success for one model can raise it for another. We release the simulation venue, attack dataset, and replay tooling for reproducible safety evaluation of LLM agents.

PaperID: 1787, https://arxiv.org/pdf/2606.20388.pdf

Authors: Yupeng Xie, Chen Ma, Zhenyang Wang, Liangwei Wang, Jiayi Zhu, Chuxuan Zeng, Zhouan Shen, Boyan Li, Yuyu Luo

Title: DataMagic: Transforming Tabular Data into Data Insight Video

Abstract:
Data videos integrate dynamic charts, voice narration, and synchronized animations to communicate data insights as temporal narratives, making them an effective medium for improving data consumption efficiency in the data management lifecycle. However, producing high‑quality data videos requires expertise spanning data analysis, narrative design, and video production. Existing approaches fall short: static visualization tools (e.g., BI dashboards) lack narrative logic and animation; authoring tools require users to pre‑prepare visualizations rather than working from raw data; pixel‑level video generation models cannot guarantee data fidelity or provenance. We demonstrate DataMagic, an end‑to‑end interactive system that transforms raw tabular data and natural language queries into narrative data‑insight videos. To ensure data fidelity, DataMagic introduces the declarative specification DVSpec, which binds visual and animation elements to underlying data fields through data‑driven semantic references. To address the combinatorial explosion of the design space, DataMagic adopts a Generate‑then‑Orchestrate multi‑agent architecture that generates candidate scenes in parallel and then optimizes narrative coherence through global orchestration. Leveraging DVSpec's decoupling of logic and rendering, the system further supports three interaction modes and structured provenance‑based data Q&A, transforming one‑way videos into explorable interactive data interfaces. Evaluation on 109 real‑world samples validates the effectiveness of the DataMagic. Homepage: https://datamagic‑home.github.io/

PaperID: 1788, https://arxiv.org/pdf/2606.20373.pdf

Authors: Zepeng Li, Jie Ren, Zhanyong Tang, Jie Zheng, Zheng Wang

Title: AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Abstract:
Large Language Models (LLMs) show promise for code compilation tasks, but applying them to runtime performance tuning is difficult due to complex microarchitectural effects and noisy runtime measurements. We present AutoPass, a multi‑agent framework for compiler performance tuning that uses compiler and runtime evidence to guide LLM‑generated optimization decisions. Rather than treating the compiler as a black box like prior auto‑tuning schemes, AutoPass opens up the compiler to the LLM, enabling it to query compiler‑internal optimization states and analyze the intermediate representation to orchestrate compiler options. The search process iteratively refines optimization configurations using measured runtime feedback to diagnose regressions and guide latency‑improving edits. AutoPass operates in an inference‑only, training‑free setting and requires no offline training or task‑specific fine‑tuning, making it readily applicable to new benchmarks and platforms. We implement AutoPass on the LLVM compiler and evaluate it on server‑grade x86‑64 and embedded ARM64 systems. AutoPass outperforms expert‑tuned heuristics and classical autotuning methods, achieving geometric‑mean speedups of 1.043x and 1.117x over LLVM O3 on x86‑64 and ARM64, respectively.

PaperID: 1789, https://arxiv.org/pdf/2606.20287.pdf

Authors: Wei Xia, Jin Wu, Haoran Shi, Xiangyu Wang, Chanjin Zheng

Title: PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback

Abstract:
Effective Automated Essay Scoring (AES) are expected to support both reliable assessment and actionable instructional feedback. However, existing approaches often treat scoring and feedback as separate components: neural scoring models provide limited interpretability, while Large Language Model (LLM)‑based feedback is typically insensitive to learners proficiency levels. To address this fragmentation, this work proposes PsyScore, a psychometrically‑aware framework that integrates diagnostic assessment with instructional scaffolding through a shared latent ability representation. PsyScore comprises three key modules: a Trait‑Adaptive Neural IRT Scorer that incorporates the Graded Partial Credit Model (GPCM) into a neural architecture, enabling the precise estimation of student ability while maintaining psychometric interpretability, a ZPD‑Scaffolded Feedback Generator, which conditions multi‑agent feedback strategies on the diagnosed ability parameter to adapt instructional focus across different proficiency levels, and a Multi‑Perspective Feedback Evaluation Strategy that assesses feedback quality via pairwise preference judgements and student revision simulations. Experiments on the ASAP++ dataset demonstrate that PsyScore achieves competitive scoring performance while providing more pedagogically aligned feedback.

PaperID: 1790, https://arxiv.org/pdf/2606.20245.pdf

Authors: Huang Peng, Jiuyang Tang, Weixin Zeng, Hao Xu, Xiang Zhao

Title: Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM Inference

Abstract:
Large language models (LLMs) have achieved strong performance across a wide range of language‑based tasks by leveraging both extensive parametric knowledge and in‑context learning ability, enabling them to incorporate external information provided in the input prompt. However, the integration of external knowledge can introduce conflicts, not only between the model's internal parametric knowledge and the external information, but also among multiple pieces of external contexts. Existing approaches typically assume that either the model or the provided context is reliable, overlooking the possibility that both sources may contain errors, and avoid conflicts by privileging one source over the other, rather than actively resolving inconsistencies. To address these limitations, we propose a novel framework MACR for LLM knowledge conflict resolution that moves beyond the conventional binary choice paradigm and incorporates an explicit conflict‑resolution mechanism based on a multi‑agent reasoning approach. Specifically, we first propose an adaptive knowledge assessment and retrieval approach that employs a modified semantic entropy measure to quantify an LLM's confidence in its answer to a given query. Based on this confidence estimation, MACR either externalizes the model's internal knowledge as textual representations or retrieves relevant external knowledge when internal knowledge is insufficient, generating basic contexts for subsequent reasoning. Then we introduce an inductive multi‑agent reasoning framework with three specialized agents that, respectively, induce explicit rules, analyze potential conflicts, and resolve inconsistencies across all available contexts. Empirical results demonstrate that MACR significantly outperforms state‑of‑the‑art baselines across benchmarks, while also providing interpretable resolutions of explicit conflicts.

PaperID: 1791, https://arxiv.org/pdf/2606.20243.pdf

Authors: Kipngeno Koech, Muhammad Adam, Baimam Boukar Jean Jacques, Joao Barros

Title: Phoenix: Safe GitHub Issue Resolution via Multi-Agent LLMs

Abstract:
We present Phoenix, a multi‑agent LLM system that resolves GitHub issues from triage through pull‑request creation, combining seven layered safety controls with a baseline‑aware test evaluation strategy. Phoenix decomposes the work across six specialized agents. Planner, reproducer, coder, tester, failure analyst and Pull Request (PR) agent, all coordinated by a label‑based GitHub webhook state machine. Every change is checked against a baseline test run before a pull request is opened. On a 24‑instance slice of SWE‑bench Lite. run on the production webhook path, Phoenix oracle‑resolves 75% of instances with no pass‑to‑pass regressions on successful runs; this curated slice is not directly comparable to full‑split leaderboard results, and we discuss the limits of the comparison. A complementary pilot on 42 real issues across 14 repositories yields 100% correctness preservation (CP; mean 122s on the hard tier). Manual inspection shows that about half of the resulting pull requests are well‑targeted fixes. The other half place code at incorrect paths, a planner localization limitation we are addressing with retrieval. We also report the deployment failure modes (WAF filtering, token expiry, permission boundaries, flaky CI) that motivated each safety mechanism.

PaperID: 1792, https://arxiv.org/pdf/2606.20236.pdf

Authors: Federica Filippini

Title: A Multi-Agent system for Multi-Objective constrained optimization

Abstract:
Many decision‑making problems in computing and networking systems can be naturally formulated as cost‑minimization problems under performance constraints. In dynamic environments, reinforcement learning (RL) is often used to solve such problems at runtime by embedding both costs and constraint violations into a single scalar reward through weighted penalty terms, following a Lagrangian‑inspired formulation. However, in this context the behavior of the learned policy critically depends on the choice of these weights, which are typically selected manually. This makes it difficult to identify an appropriate trade‑off between optimizing the primary objective and effectively avoiding constraint violations, particularly in non‑stationary environments where their relative importance may change. This paper presents MAMO (Multi‑Agent system for Multi‑Objective constrained optimization), an approach to tackle this balancing problem through multi‑agent RL. MAMO decouples task execution from objective design by formulating the selection of reward weights as a learning problem, providing a !rst step towards more autonomous and robust RL‑based solutions for constrained optimization problems in dynamic environments.

PaperID: 1793, https://arxiv.org/pdf/2606.20058.pdf

Authors: Harsh Rao Dhanyamraju, Leonidas Raghav, Aaron Lee

Title: Autonomous Event-Driven Multi-Agent Orchestration for Enterprise AI at Scale

Abstract:
Enterprise AI aims to move toward continuous event monitoring, detection, and action across specialist agents, yet existing multi‑agent systems largely assume discrete request‑response workflows and remain underexplored at enterprise scale. We evaluate DAG Plan and Execute and ReAct across 208 production‑derived enterprise scenarios spanning Persona (<10 agents), Department (20‑80), and Enterprise (200) scales, and introduce a Task Manager for continuous operation via priority inference, related‑event merging, and preemption. Results show that scale, not task complexity, dominates orchestration performance: both architectures perform well at small scale but degrade at enterprise scale as agent discovery noise becomes the primary bottleneck, with simple tasks degrading more sharply than complex ones. DAG Plan and Execute offers higher precision and structured parallelization at smaller scales, but its higher overhead worsens at enterprise scale; ReAct is more robust by handling failures incrementally. The Task Manager reduces high‑priority queue latency by 14‑75% and improves related‑event correctness by over 20 percentage points at enterprise scale.

PaperID: 1794, https://arxiv.org/pdf/2606.20047.pdf

Authors: Manu Ghulyani, Arunabh Singh, Karan Bharadwaj, Ankit Nath, Suranjan Goswami

Title: PACMS: Submodular Context Selection as a Pluggable Engine for LLM Agents

Abstract:
Conversational and tool‑using LLM agents operate over a context window that fills from several directions simultaneously. As a session proceeds, the agent accumulates user and assistant turns, entries drawn from a persistent memory store, and often largest of all, the verbatim outputs of tool calls such as file reads, search results, and API responses. Once the cumulative context exceeds the model's token budget, the framework must decide what to keep. The prevailing mechanism is recency truncation, sometimes paired with periodic summarization. This is topic‑blind: a fact established early in a session is discarded simply because it is old, even when the current user query is about exactly that fact; conversely, verbose but irrelevant recent material is retained. Agents that must recall information across many turns, the defining case for memory, are precisely where recency truncation fails. Existing alternatives sit outside the agent's assembly step. Retrieval augmented generation fetches external documents into the prompt but does not arbitrate the agent's \emphalready‑present pooled context. Context‑compression methods reduce token count by rewriting or pruning text, but operate query‑blind and lossily. Neither treats memory entries, conversation turns, and tool outputs as a single candidate pool to be selected from by relevance at the moment the prompt is assembled.

PaperID: 1795, https://arxiv.org/pdf/2606.20014.pdf

Authors: Jannik Hösch, Alessandro Sestini, Florian Fuchs, Amir Baghi, Joakim Bergdahl, Konrad Tollmar, Jean-Philippe Barrette-LaPierre, Linus Gisslén

Title: Hierarchical Control in Multi-Agent Games: LLM-based Planning and RL Execution

Abstract:
Reinforcement learning (RL) has achieved strong performance in sequential decision‑making, yet scaling to complex multi‑agent environments remains challenging due to sparse rewards, large state‑action spaces, and the difficulty of learning coordinated strategies. We propose a hierarchical architecture where a pretrained large language model (LLM) acts as a centralized strategic controller that selects among specialized RL skill policies for a team of agents, while RL policies handle reactive low‑level execution. We evaluate this hybrid system in a competitive 2v2 King of the Hill environment against behavior tree (BT) and \emph``Flat'' RL (end‑to‑end training without skill decomposition) baselines. The LLM+RL system achieves task performance statistically equivalent to hand‑crafted BT (46.4% vs 51.5% win rate, p=0.103) while both significantly outperform Flat RL trained without skill decomposition. A user study (n=15) reveals that 60% of participants perceive LLM+RL agents as the most human‑like (p=0.027), citing behavioral adaptability and tactical variability. These results demonstrate that pretrained LLM reasoning can effectively orchestrate pretrained RL skills, achieving competitive multi‑agent coordination and superior perceived believability without manual rule engineering.

PaperID: 1796, https://arxiv.org/pdf/2606.19931.pdf

Authors: Jonas Bischofberger, Runqing Ma, Pascal Bauer, Kilian Arnsmeyer, Arnold Baca

Title: Blame is easier than praise: Measuring off-ball defensive performance in football

Abstract:
The defensive performance of football players is commonly measured through a limited number of actions like tackles and interceptions while their continuous impact through positional behaviour has hardly been studied before. We formulate this problem as an attribution over multi‑agent spatiotemporal trajectories without player‑level ground truth labels, where event‑level changes of expected threat are distributed among individuals. We propose a framework that performs this attribution using player involvement scores calculated from defensive pressure areas (DPAs). By computing role‑conditioned baselines within automatically detected team structures, we can determine each defender's expected responsibility for threat created through arbitrary passes. The validity and robustness of this approach are evaluated on a uniquely extensive cross‑gender and cross‑competition data set, including positional and event data from 64 matches of the men's World Cup, 116 matches of the women's German Bundesliga and 336 matches of the men's German 3. Liga. In the absence of a ground truth, we propose an evaluation protocol that combines multiple relatively weak proxies into robust summary scores. We find a validity score that is improved by around 1 standard deviation compared to the best action‑based metric and demonstrate that many popular measures show limited validity. The "blame" for conceding high‑value actions shows especially strong correlations with external ratings and market values, making it the first published metric in football to reliably measure positioning errors. All code underlying this work is publicly available to support reproducibility and further research.

PaperID: 1797, https://arxiv.org/pdf/2606.19920.pdf

Authors: Hunter Kuperman, Minchan Jung, Rahul V. Ghosh, Alex Oshin, Evangelos A. Theodorou

Title: Deep-Unfolded Coordination

Abstract:
Distributed optimization is a highly scalable and structurally transparent technique to solve multi‑agent robotics problems; however, such methods often suffer from the need for highly‑specialized, problem‑specific hyperparameter tunings. In this work, we propose Deep Coordinator, a deep‑unfolding framework that learns to dynamically adjust the hyperparameters of ADMM‑DDP, a popular distributed solver for robotics tasks, at solve‑time in response to optimizer performance. Our architecture consists of unrolling a fixed number of ADMM‑DDP iterations into a neural network with learnable functions between layers mapping the optimizer state to the next hyperparameters. To the best of our knowledge, Deep Coordinator is the first deep‑unfolding framework to adapt the penalty parameters of a non‑convex optimizer at solve‑time; we show that the mainstream supervised approach can yield degenerate solutions when training such models, and propose an unsupervised learning scheme. On simulations with fleets of cars and quadrotors, Deep Coordinator produces trajectories of comparable quality 6.18‑9.44x faster than conventional solvers. Furthermore, Deep Coordinator retains its performance benefits when deployed to systems up to 8x larger than trained on.

PaperID: 1798, https://arxiv.org/pdf/2606.19911.pdf

Authors: To Eun Kim, Xuhong He, Dishank Jain, Ambuj Agrawal, Negar Arabzadeh, Fernando Diaz

Title: Multi-Agent Transactive Memory

Abstract:
The decentralized deployment of LLM agents with diverse capabilities across diverse tasks motivates infrastructure for knowledge sharing across heterogeneous agent populations. Just as search engines index human‑generated artifacts to support human problem solving, retrieval systems can organize agent‑generated artifacts for reuse across agent populations. We extend retrieval‑augmented generation ‑ which demonstrates the value of human‑authored artifacts to individual agents ‑ to retrieval of agent‑generated artifacts supporting a population of agents. In particular, agent trajectories encode reusable procedural knowledge, yet these artifacts are typically discarded after a single use or retained only by the producing agent, forcing newly instantiated agents to repeatedly rediscover existing solutions. We propose Multi‑Agent Transactive Memory (MATM), a framework for population‑level storage and retrieval of agent‑generated trajectories, where producer agents contribute trajectories to a shared repository and consumer agents retrieve them to improve task execution. We focus on interactive environments (ALFWorld and WebArena), where trajectories are long and encode especially rich procedural structure. Our experiments demonstrate that retrieving trajectories from MATM improves downstream task performance and reduces interaction steps without coordination or joint training. These results position MATM as a design pattern for population‑level experience sharing in open agent ecosystems.

PaperID: 1799, https://arxiv.org/pdf/2606.19904.pdf

Authors: Anping Zhang, Yang Tan, Yuanbo Tang, Huaze Tang, Qiuhua Ye, Marta C. Gonzalez, Yang Li

Title: Toward Temporal Realism in City-Scale Crisis Response Simulation using LLM Agents

Abstract:
Human collective participation is rarely steady in time: it is bursty, with short episodes of intense activity separated by long quiet intervals. In crisis response and community mobilization, predicting when people act matters as much as predicting whether they act. Such settings are increasingly modeled with LLM‑based social simulators, yet these simulators are validated on whether each action is individually plausible, not on whether actions are timed as in reality. Their temporal realism, the degree to which simulated activity reproduces the bursty, heavy‑tailed timing of real human systems, thus remains untested. We examine this gap using a multi‑year, city‑scale log of offline volunteering in Shenzhen that spans the COVID‑19 pandemic. Empirically, we establish that bursty timing is common at individual and tracked‑group levels, that it is largely endogenous and self‑exciting, and that it is amplified by the pandemic rather than produced by daily activity cycles. A standard LLM‑only simulator reproduces almost none of this timing: its synchronous schedule has no self‑excitation channel, so agents act on a near‑regular clock. Guided by these findings, we build a simulator in which a data‑calibrated self‑excitation channel and a crisis‑period regime decide when each agent acts and query the LLM only at those moments, leaving it to decide which task to join and whether to commit. The LLM‑only baseline yields no bursty agents (median burstiness B=‑0.14); a single data‑calibrated gate is then sufficient to lift per‑agent timing above the burst threshold (median B\approx0.37) without degrading LLM content decisions. These results indicate that temporal realism in LLM‑based crisis‑response simulation is best achieved by decoupling when agents act, governed by an explicit self‑excitation and crisis‑activation mechanism, from what they do, governed by the LLM.

PaperID: 1800, https://arxiv.org/pdf/2606.19893.pdf

Authors: Wei Yu, Suxing Liu, Minjie Yu, Jiahao Wang, Zhijian Zheng, Haocheng Deng, Bing Li

Title: MetaResearcher: Scaling Deep Research via Self-Reflective Reinforcement Learning in Adversarial Virtual Environments

Abstract:
Deep research agents have demonstrated remarkable capabilities in autonomous information gathering and synthesis, yet their training remains constrained by the static nature of simulated environments, the limits of fact‑retrieval‑only task designs, and the inefficiency of outcome‑based reinforcement learning. In this work, we propose MetaResearcher, a novel framework that scales deep research agent training across four synergistic dimensions. First, we introduce an Evolving Virtual World that injects temporal dynamics and adversarial misinformation into the training environment, forcing agents to develop source credibility assessment and temporal conflict resolution skills. Second, we design Discovery‑Oriented Tasks ‑‑ including hypothesis generation and contradiction resolution ‑‑ that transcend simple fact retrieval and push agents toward genuine research behaviors. Third, we propose a Self‑Reflective Meta‑Reward mechanism within the GRPO framework that jointly optimizes for answer correctness, search path efficiency, reflection depth, and tool call diversity, directly addressing the repetitive action loop problem observed in prior work. Fourth, we introduce a Heterogeneous Multi‑Agent Swarm architecture comprising specialized Scout, Filter, and Synthesizer models that learn collaborative research strategies through coordinated reinforcement learning. Built upon the LiteResearcher infrastructure, MetaResearcher requires zero marginal API cost for training while targeting substantial improvements in both benchmark performance (GAIA, Xbench‑DS) and epistemic robustness under adversarial conditions. We present the complete framework design, training methodology, and planned experimental validation.

PaperID: 1801, https://arxiv.org/pdf/2606.19883.pdf

Authors: Ananya Kunisetty, Avishek Ghosh

Title: Matching Markets meet Cumulative Prospect Theory: Towards Optimal and Adversarially Robust Learning

Abstract:
We study a multi‑agent multi‑armed bandit problem in the competitive setup with two‑sided matching markets under a human centric decision making model. To capture human preferences, we use cumulative prospect theory (CPT) that weighs the actions of the agent in a nonlinear fashion using a (α‑Hölder continuous) weight function. CPT has been widely used in behavioral economics and risk sensitive machine learning to emulate human preferences. We analyze the state‑of‑the‑art learning algorithm with CPT weight distorted rewards and obtain a player optimal regret of \mathcalO(K\log T \left(\frac1Δ\right)^2/α), where K denotes the number of arms, T is the learning horizon, and Δ represents (suitably defined) players' minimum preference gap. Noticing the dependence on Δ to be sub‑optimal, we further improve this regret by judiciously selecting the active set of arms during exploration, which removes the dependence on K in the dominant term and achieves an improved (optimal) regret guarantees in the setting where the number of arms K is significantly larger than the number of players N. In addition, we consider adversarial markets where the observed rewards of the agents may be corrupted. We propose and analyze algorithms for robust markets with CPT as risk sensitive measure in both settings where the total corruption budget is known and where it is unknown, and establish logarithmic player‑optimal regret guarantees in both cases.

PaperID: 1802, https://arxiv.org/pdf/2606.19868.pdf

Authors: Jiayi Wang, Xu-Yao Zhang

Title: A Systematic Evaluation of Black-Box Uncertainty Estimation Methods for Large Language Models

Abstract:
Although large language models (LLMs) have shown strong capabilities across a wide range of tasks, their outputs often remain unreliable and may contain hallucinations, making uncertainty estimation (UE) essential for building trustworthy LLMs. In practice, many mainstream LLMs are only accessible through restricted APIs, where internal signals such as logits and hidden states are unavailable, making black‑box UE especially important. However, existing work on black‑box UE for LLMs remains fragmented in methodology and lacks a unified empirical comparison. To address this gap, we present a systematic review of black‑box UE methods and organize them into five categories: verbalization‑based, sampling‑based, explanation‑based, multi‑agent, and hybrid methods. We further build a unified evaluation framework and benchmark 24 representative methods across 4 models and 4 dataset settings. Our results show that no single method consistently dominates across all settings. Nevertheless, methods that reason over and compare candidates in the answer space are generally effective, and hybrid methods that combine multiple uncertainty signals perform well under most conditions. By releasing the benchmark data and a unified evaluation framework, we aim to facilitate reproducible comparisons and support future research, while our empirical findings provide practical guidance for developing future black‑box UE methods for LLMs.

PaperID: 1803, https://arxiv.org/pdf/2606.19857.pdf

Authors: Jiayi Zhu, Haoxuan Peng, Junxi Wang, Liang Ke, Chen Zhang, Linfeng Zhang

Title: Large Language Models Do Not Always Need Readable Language

Abstract:
Large language models (LLMs) are commonly prompted and interfaced with human‑readable natural language, even when the intended reader is another model. This paper investigates whether semantic information can be encoded in compact, non‑standard textual forms that sacrifice human readability while remaining recoverable by LLMs. We refer to this class of model‑centric textual representations as BabelTele, approached here not as a fixed protocol but as an empirical probe into LLMs' capacity to generate and interpret such representations. Through readability diagnostics, model likelihood measures, human questionnaires, and downstream task evaluations, we find that BabelTele can substantially depart from ordinary natural language while preserving core semantics for instruction‑tuned LLMs. As a task‑agnostic representational paradigm, BabelTele demonstrates high information density, maintaining 99.5% semantic fidelity even when the text volume is condensed to 27.9% of its original length. We further evaluate its semantic robustness in cross‑model transfer, agent memory, and multi‑agent communication. Results suggest that BabelTele can reduce context overhead while generally maintaining reliable downstream performance, although its effectiveness depends on the compressor‑reader pair and task setting. These findings indicate that human readability, natural‑language typicality, and model‑side semantic recoverability can be partially decoupled, opening a path toward model‑native representations in future exploration of LLM systems.

PaperID: 1804, https://arxiv.org/pdf/2606.19847.pdf

Authors: Yanyu Yao, Shangze Li, Zhi Zheng, Hui Zheng, Qi Liu, Tong Xu, Enhong Chen

Title: AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts

Abstract:
Large language models (LLMs) demonstrate strong reasoning and generation abilities, but their fixed context windows limit long‑term information accumulation and reuse across multi‑session interactions. Existing memory‑augmented systems often construct memory in a coarse and unstable manner, relying on inefficient memory representations or unstable unconstrained updates. To address these challenges, we propose AtomMem, a long‑term memory system designed for value‑dense storage and stable memory evolution. AtomMem introduces a Fact Executor, which selectively extracts high value atomic facts from long form interactions to serve as highly efficient memory representations. Subsequently, AtomMem organizes these facts into hierarchical event structures and temporal profiles, capturing coherent episodic contexts and tracking dynamically evolving user attributes over time. During retrieval, the system activates an associative memory graph to connect fragmented memories. Experiments on the LoCoMo benchmark confirm that AtomMem achieves state‑of‑the‑art performance across various reasoning tasks, offering a scalable and economically viable solution for deploying intelligent personalized agents.

PaperID: 1805, https://arxiv.org/pdf/2606.19812.pdf

Authors: Anushree Sinha, Srivaths Ranganathan, Abhishek Dharmaratnakar, Debanshu Das

Title: Human-on-the-Loop Orchestration for AI-Assisted Legal Discovery

Abstract:
Autonomous Large Language Model (LLM) agents are increasingly deployed in electronic discovery (e‑discovery), where compounding errors across multi‑step reasoning chains can constitute legal malpractice. Unlike single‑turn retrieval, agentic workflows operating over privileged document corpora exhibit a class of failure we term "trajectory collapse": an early misclassification silently propagates, rendering an entire privilege review invalid. This paper makes three contributions. First, we propose a structured taxonomy of agentic failures in legal information retrieval, organized by functional stage. Second, we introduce a four‑layer verification architecture ‑‑ spanning planning, reasoning, execution, and uncertainty quantification ‑‑ designed to intercept these failures before they compound. Third, we present a preliminary simulation study on a synthetic e‑discovery corpus that demonstrates how mandatory Human‑on‑the‑Loop (HOTL) escalation thresholds reduce privilege‑waiver risk relative to fully autonomous baselines. Our results suggest that calibrated uncertainty thresholds can reduce privilege‑waiver risk by up to 61% versus fully autonomous deployment, while routing fewer than one quarter of documents to attorney review.

PaperID: 1806, https://arxiv.org/pdf/2606.19790.pdf

Authors: Jiechao Gao, Yuandong Pan. Yuangang Li, Jie Wang, Kincho Law, Michael Lepech

Title: The Orchestration Gap: Why Process Automation Stalls in Operationally Complex Industries

Abstract:
Agentic systems have advanced quickly on digitally native tasks, yet they have barely touched the industries where coordinated automation could matter most: logistics, healthcare operations, construction, and the many sectors whose work is spread across incompatible tools and many hands. We argue that the reason is a missing abstraction. The value in these settings does not come from a single capable model invocation; it comes from \emphorchestration, the runtime that coordinates multi‑step workflows, enforces hard domain constraints, manages human approval, and bridges legacy systems. We develop this idea into a usable conceptual frame. We give an operational test for which workflows are orchestration‑bound, a decomposition that separates how tangled a workflow is from how much of its effort is coordination and what that coordination is worth, and a feature‑level account of why today's multi‑agent frameworks leave a specific gap. We then advance our central claim: the right automation path is staged, and which architectural guarantee carries the most weight depends on a sector's dominant source of friction. Constraint enforcement is load‑bearing under regulatory friction; explainability is load‑bearing under liability friction. We close with the research program this view implies.

PaperID: 1807, https://arxiv.org/pdf/2606.19787.pdf

Authors: Jiajun Li, Mingshu Cai, Yixuan Li, Yu Ding, Ran Hou, Guanyu Nie, Xiongwei Han, Wanyuan Wang

Title: ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End?

Abstract:
Large language models are increasingly deployed as autonomous agents for multi‑step tasks in executable environments, yet their ability to perform realistic operations research (OR) work remains unclear. Existing OR evaluations often decouple modeling from solving, rely on pre‑formalized or text‑only instances, and rarely test the full workflow from operational artifacts to validated decisions. In this work, we introduce ORAgentBench, an execution‑grounded benchmark for evaluating autonomous agents on challenging end‑to‑end operations research tasks. It contains 107 human‑reviewed tasks across diverse operational scenarios, each packaged in an isolated environment with a natural‑language brief, multi‑file data, configuration artifacts, and a required submission schema. Agents must write and run solution code, and their submissions are evaluated by hidden validators for schema validity, hard‑constraint feasibility, and normalized objective quality. Experiments with fourteen frontier agent‑model configurations show that current agents remain far from reliable OR practice. The best agent passes only 35.51% of all tasks and 20.59% of hard tasks, and many feasible submissions still fall below the required quality threshold. Failure analysis further shows that errors are dominated by strategic weaknesses, including missed operational rules, brittle formulations, weak feasible‑solution construction, and insufficient solution improvement. OR‑specific procedural skills increase hard‑task feasibility, but do not reliably improve solution quality or pass rate. These results suggest that progress in OR agents requires moving beyond plausible optimization code toward dependable, high‑quality operational decision‑making.

PaperID: 1808, https://arxiv.org/pdf/2606.19782.pdf

Authors: Aravind Narayanan, Shaina Raza

Title: AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA

Abstract:
Financial chart question answering in regulated settings demands more than accuracy: practitioners must know which answers to trust before acting on them, and many institutions cannot send client data to external model providers. Yet existing chart‑QA agents are accuracy‑focused and opaque, and most assume proprietary API access; to our knowledge, none combines auditability with on‑premise deployability without significant accuracy compromise. We present AgentFinVQA, a multi‑agent pipeline that decomposes each query into planning, OCR, legend grounding, visual inspection, and verification, recording every step in a traceable Model Evaluation Packet (MEP) per sample. On FinMME, AgentFinVQA improves +7.68 pp over a primary‑backbone matched zero‑shot baseline with a proprietary backbone (Gemini‑3 Flash; 71.24% vs. 63.56%, McNemar p \approx 1.1 × 10^‑16), and +4.84 pp with open‑weights Qwen3.6‑27B‑FP8 served locally. The verifier's verdict also serves as a useful confidence signal (68.2% vs. 55.6% exact accuracy on confirmed vs. revised answers), enabling human‑in‑the‑loop review routing. Error analysis shows that question misunderstanding, legend confusion and extraction error account for nearly two‑thirds of failures and are the categories least detected by the verifier, identifying clear directions for future work. Together these results show that auditable, on‑premise financial chart QA is practical and that the open‑weights system keeps most of the accuracy gains while enabling full data residency. We release our code to support reproducible evaluation.

PaperID: 1809, https://arxiv.org/pdf/2606.19758.pdf

Authors: Kun Zeng, Yu Huo, Siyu Zhang, Yuecheng Zhuo, Yuquan Lu, Haoyue Liu, Siyue Chen, Xiaoying Tang

Title: SIGMA: Skill-Incidence Graphs for Compositional Multi-Agent Design

Abstract:
Existing graph‑based multi‑agent system (MAS) designers mainly improve collaboration by optimizing communication topologies over predefined agents, roles, or groups. However, because each node remains a closed‑set entity, these methods struggle to generalize to tasks that require unseen combinations of capabilities. We propose SIGMA, a skill‑incidence graph framework that constructs agents as task‑conditioned bundles of reusable skills. Given a task and a skill library, SIGMA predicts a skill‑agent incidence matrix, composes agent node embeddings from selected skills, and decodes a communication topology over the constructed agents. During execution, skill‑specific mailboxes route messages to the relevant assigned capabilities, making the incidence structure directly operational. Across six reasoning and coding benchmarks with three base LLMs, SIGMA achieves the best average performance and improves over CARD, the strongest non‑compositional topology‑based baseline, by 2.06, 2.36, and 1.75 points, respectively. It also shows stronger robustness to unseen skill libraries, with an average performance drop of only 0.96 points. These results suggest that compositional node construction is a complementary and important axis for multi‑agent design beyond communication topology optimization. Code is available at https://anonymous.4open.science/r/SIGMA‑2338/.

PaperID: 1810, https://arxiv.org/pdf/2606.19725.pdf

Authors: Ma Toan Bach, Yuchi Zheng, Haingo Razafindranto, Tanvir Alam, Aric Leather, Ranveer Sandhu, Jitesh Arora

Title: Library-Aware Doubles and Iterative Repair for Large Language Model-Generated Unit Tests in OpenSIL Firmware

Abstract:
Validating changes in low‑level C firmware is expensive because unit tests (UTs) are fragile under strict build constraints, where missing headers, unresolved symbols, and dependency mismatches frequently prevent compilation and linking. This study introduces an automated UT authoring workflow for the Open‑Source Silicon Initialization Library (openSIL) firmware codebase maintained by Advanced Micro Devices (AMD) that reduces manual effort through a large language model (LLM) guided multi‑agent pipeline. The workflow combines automated generation of test scaffolds, library‑aware creation or reuse of stubs, mocks, and fakes, and an iterative compile‑dispatch repair loop driven by build logs and line‑coverage feedback. We evaluate the approach using compilation success, repair iterations, dispatch success, and line coverage, with time, cost, and token usage as secondary measures. Across 76 functions under test, the workflow generated compilable UTs for 73 functions. In a configuration without line coverage guidance or retrieval augmentation, mean line coverage reached 73.9%. On a 48‑function subset evaluated under both configurations, mean line coverage reached 98.8% with line‑coverage guidance alone and reached 94.7% when combined with vector‑database retrieval. Results show that automated generation‑and‑repair pipelines can substantially improve UT creation efficiency and coverage for constrained firmware environments while reducing manual debugging effort.

PaperID: 1811, https://arxiv.org/pdf/2606.19704.pdf

Authors: Dhaval C. Patel, Kaoutar El Maghraoui, Shuxin Lin, Yusheng Li, Tianjun Feng, Chun-Yi Tsai, Yihan Sun, Wei Alexander Xin, Akshat Bhandari, Tanisha Rathod, Aaron Fan, Sanskruti Vijay Shejwal, Tomas Pasiecznik, Sagar Chethan Kumar, Tanmay Agarwal, Rohith Kanathur, Sam Colman, Amaan Sheikh, Dev Bahl, Ann Li, Krish Veera, Alimurtaza Mustafa Merchant, Shambhawi Baswaraj Bhure, Sajal Kumar Goyla, Chengrui Li, Kirthana Natarajan, Rui Li, Thomas Ajai, Rujing Li, Vivek G. Iyer, Sanjaii Vijayakumar, Yitong Bai, Ayal Yakobe, Darief Maes, Yassine Jebbouri, Tianyang Xu, Thai Quoc On, Vera Mazeeva, Winston Li, Yuval Shemla, Yeshitha Bhuvanesh, Rushin Bhatt, Siddharth Chethan Gowda, Alisha Vinod, Caroline Cahill, Shriya Aishani Rachakonda, Yunfeng Chen, Aryaman Agrawal, Aman Upganlawar, Mao Le Jonathan Ang, Yubin Sally Go, Madhav Rajkondawar, Yang-Jung Chen, Trisha Maturi, Ananya Kapoor, Andrew Li, Shrey Arora, Mana Abbaszadeh, Shen Li, Charles Xu, Byeolah Kwon

Title: Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

Abstract:
Agent benchmarks are growing fast, but no single benchmark touches more than four or five of the dimensions that deployment exposes. This paper aggregates the largest coordinated deep‑dive of one MCP‑based industrial‑agent benchmark to date: fourteen parallel implementation studies covering new asset classes (including a multi‑modal visual extension), alternative orchestrations, retrieval strategies, reasoning modes, infrastructure optimizations, and evaluation‑methodology probes. Consolidating those studies with seven prior agent benchmarks, we argue that aggregate‑score leaderboards systematically underspecify deployed‑agent evaluation. Rankings derived from aggregate scores do not transfer to out‑of‑distribution settings; recent public‑to‑hidden competition retrospectives provide direct empirical evidence of this rank instability. We propose ranking configurations by predictive validity, the correlation between in‑sample and out‑of‑sample rank, rather than in‑sample mean, and report a twelve‑tier measurement apparatus that exposes the deployment‑relevant dimensions HELM and its agent‑era successors collapse. The position is operationalized through three falsifiable out‑of‑distribution criteria with explicit thresholds; existing evidence partly supports it but is too thin to confirm. We close with a pre‑registered pilot design and a field‑level vision for what the next generation of agentic benchmarks should report.

PaperID: 1812, https://arxiv.org/pdf/2606.19690.pdf

Authors: Navin Chhibber, Deepak Singh, Anokh Kishore, Nikita Chawla, K. Anguraj

Title: Multi-Granular Attention-Driven Reinforcement Learning Framework for Web Intelligent Enhancement Systems

Abstract:
From the past few years, web intelligent enhancement systems increasingly rely on heterogeneous and dynamic web data to deliver personalized, context‑aware services. However, traditional machine learning, deep learning, and reinforcement learning models often struggle with semantic understanding, adaptability, and scalability in continuously evolving web environments. In this research, a Multi‑Granular Attention‑based Reinforcement Web Intelligent Enhancement System (MGAR‑WIES) is proposed to address the challenges by integrating semantic graph modeling, attention mechanisms, and adaptive reinforcement learning. Initially, heterogeneous web data comprising structured, semi‑structured and unstructured sources are collected and preprocessed for generating unified feature representations. These representations are transformed into a dynamic semantic graph, where entities and their relationships are modeled by using graph embeddings enhanced by attention mechanisms for capturing both local relevance and global contextual dependencies. Subsequently, an adaptive multi‑agent reinforcement learning strategy leverages the attention‑aware semantic states to optimize personalized web actions like content recommendation, navigation optimization, and service adaptation. Finally, the continuous online feedback is further integrated to update graph representations and learning policies in real time by ensuring sustained adaptability and performance. The proposed MGAR‑WIES acheived better results in terms of accuracy (80%) when compared with existing approaches.

PaperID: 1813, https://arxiv.org/pdf/2606.19659.pdf

Authors: Yuhang Zhou, Lizhu Zhang, Yifan Wu, Mingyi Wang, Bo Peng, Jiayi Liu, Xiangjun Fan, Zhuokai Zhao

Title: SAGE-OPD: Selective Agent-Guided Intervention for Multi-Turn On-Policy Distillation

Abstract:
On‑policy distillation (OPD) improves student models by training them on trajectories induced by their own policy, making it a promising approach for mitigating exposure bias in agent training. However, most OPD studies focus on single‑turn settings, while realistic LLM agents interact with environments over multiple turns. In this regime, early errors can alter future observations and compound across the trajectory, and standard dense token‑level OPD becomes brittle, as it may over‑penalize semantically valid alternatives, reinforce local degeneracies such as repeated actions, and propagate unreliable teacher supervision on off‑distribution histories. We propose SAGE‑OPD, a verifier‑free selective intervention framework specifically designed for multi‑turn OPD. Instead of applying teacher supervision uniformly across all turns, SAGE‑OPD first observes environment feedback and uses teacher judgment to decide whether each student response should be skipped or intervened on. To further address compounding errors, SAGE‑OPD weights token‑level distillation by teacher confidence, reducing the influence of uncertain teacher distributions on corrupted or ambiguous histories. Finally, SAGE‑OPD applies loss normalization to preserve the overall loss scale of standard OPD while retaining selective turn‑level weighting. Experiments on agent tasks show that SAGE‑OPD consistently improves over baselines, achieving up to a 13.3% relative improvement in ALFWorld unseen success rate over standard OPD. Ablation studies further demonstrate that turn‑level intervention, teacher confidence weighting, and loss normalization provide complementary benefits. Our results suggest that effective multi‑turn OPD should remain on‑policy, but teacher supervision should be selectively allocated to turns where intervention is necessary and reliable.

PaperID: 1814, https://arxiv.org/pdf/2606.19632.pdf

Authors: Ahmad Farooq, Kamran Iqbal

Title: Formal Verification of Learned Multi-Agent Communication Policies via Decision Tree Distillation

Abstract:
Multi‑agent reinforcement learning (MARL) enables agents to develop coordination strategies through emergent communication, but neural policies lack the formal safety guarantees required for safety‑critical robotic deployment in drone swarms and autonomous vehicle fleets. We present the first end‑to‑end framework for safety verification of learned multi‑agent communication policies through policy abstraction: neural policies are distilled into interpretable decision trees, then formally verified, with empirical validation confirming that verified safety properties transfer to original networks. Our four‑stage pipeline consists of domain‑specific feature extraction from agent observations, decision tree distillation achieving 97.9% +/‑ 1.2% fidelity to neural policies, automated translation to PRISM probabilistic model checker specifications with complete feature‑to‑state‑variable correspondence, and compositional verification of Probabilistic Computation Tree Logic (PCTL) properties via pairwise decomposition with union‑bound aggregation and empirical neighbor modeling. Evaluating Vector‑Quantized Variational Information Bottleneck (VQ‑VIB) policies for multi‑drone coordination with 5‑7 agents, we verify 18 temporal logic properties across safety, liveness, and cooperation, achieving 88.9% property satisfaction with all five safety thresholds satisfied (0.3% collision probability vs. 1% threshold). Monte Carlo validation of original neural policies confirms that verified safety properties transfer with <=0.6 percentage‑point deviation (95% CI). Discrete VQ‑VIB messages provide +11.6 to +13.6 percentage‑point fidelity advantages over continuous methods, enabling 3‑4x faster verification. Our framework provides empirically validated safety verification for distilled policy abstractions, serving as a practical bridge between deep MARL and formal safety workflows for multi‑robot deployment.

PaperID: 1815, https://arxiv.org/pdf/2606.19565.pdf

Authors: Navin Ranjan, Andreas Savakis

Title: Mix-QVLA: Task-Evidence-Aware Mixed-Precision Quantization of Vision-Language-Action Models

Abstract:
We propose Mix‑QVLA, a task‑evidence‑aware mixed‑precision PTQ framework for VLA models. Mix‑QVLA anchors each quantized variant to the full‑precision action‑token reference decision and evaluates whether quantization preserves task‑relevant evidence across key VLA functional boundaries. It computes normalized gradient‑weighted task‑evidence maps from boundary activations and compares full‑precision and quantized maps using evidence‑mass and attribution‑distribution distortion, capturing changes in both the strength and allocation of decision‑supporting evidence. A soft‑bottleneck objective aggregates boundary‑level degradation into layer‑wise sensitivity scores. Mix‑QVLA further models sensitivity throughout task execution, capturing phase‑dependent shifts in layer importance rather than assuming a fixed sensitivity profile. The resulting evidence‑ and time‑aware scores guide mixed‑precision bit allocation under model‑size and BitOps budgets. Extensive evaluations on OpenVLA‑style policies show that Mix‑QVLA improves the accuracy‑efficiency trade‑off of low‑bit VLA deployment. On LIBERO, Mix‑QVLA reduces OpenVLA‑OFT memory from 15.4 GB to 4.1 GB, retains 96.3 average success compared with 97.1 for the BF16 model, and achieves a 1.52x inference speedup.

PaperID: 1816, https://arxiv.org/pdf/2606.19494.pdf

Authors: Apurba Pokharel, Ram Dantu

Title: Hidden Anchors in Multi-Agent LLM Deliberation

Abstract:
Multi‑agent LLM deliberation, where agents exchange and revise answers over several rounds, is increasingly used to improve reasoning and accuracy, yet how and why it works is rarely modelled. Such deliberation mirrors how humans reach decisions. As social animals we are pulled both by the group, the herd effect that classical opinion‑dynamics models such as DeGroot and Friedkin‑‑Johnsen capture, and by our own internal belief, which they do not. We model multi‑agent deliberation as a closed‑loop dynamical system in which each agent carries a hidden internal belief, its anchor, that continually pulls its opinion regardless of its neighbours. We show this anchor can be recovered from the deliberation alone, and that it explains a behaviour classical consensus rules forbid: an agent's confidence in the correct answer can climb past where any agent started, escaping the space (convexhull) formed by the initial beliefs. Checking whether the recovered anchor also predicts held‑out runs (generalizes) gives a simple test for when a model is truly driven bysuch an anchor. Across three open‑weight model families this is a spectrum, not all‑or‑nothing. All anchors' influence are about equally strongly, but they differ in where the anchor sits, and only when it sits far from the initial opinions does deliberation escape the hull and need the full closed‑loop model.

PaperID: 1817, https://arxiv.org/pdf/2606.19419.pdf

Authors: Junyi Zhang, Jiaxin Ge, Hanjun Yoo, Letian Fu, Zihan Yang, Yaowei Liu, Raj Saravanan, Shaofeng Yin, Justin Yu, Dantong Niu, Zirui Wang, Roei Herzig, Ken Goldberg, Yutong Bai, David M. Chan, Ion Stoica, Angjoo Kanazawa, Jiahui Lei, Haiwen Feng, Trevor Darrell

Title: Playful Agentic Robot Learning

Abstract:
Current agentic robot systems can write executable Code‑as‑Policy programs, observe feedback, and revise behavior across multiple attempts, but they remain largely task‑driven: reusable skills are acquired only after explicit instructions. We study Playful Agentic Robot Learning, where an embodied coding agent uses self‑directed play as a continual skill‑learning stage before downstream tasks arrive. We introduce RATs, Robotics Agent Teams designed for play‑time skill acquisition. During play, RATs proposes novel yet learnable exploratory tasks, plans and executes robot‑code policies, verifies intermediate progress, diagnoses failures, retries with dense, step‑level feedback, and distills successful executions into a persistent code skill library. At test time, the agent reuses relevant skills from this frozen library to help solve new tasks. Experiments in LIBERO‑PRO and MolmoSpaces show that play‑learned skills improve held‑out downstream tasks over no‑play and random‑play baselines, with 20.6 and 17.0 percentage‑point gains over CaP‑Agent0 on LIBERO‑PRO and MolmoSpaces, respectively. Moreover, the learned skills can be plugged into other inference‑time Code‑as‑Policy agents by simply retrieving them into the context, improving RoboSuite and real‑world transfer by 8.9 and 8.8 points, respectively, without finetuning the underlying model.

PaperID: 1818, https://arxiv.org/pdf/2606.19409.pdf

Authors: Fukang Wen, Zhijie Wang, Ruilin Xu

Title: OpenRath: Session-Centered Runtime State for Agent Systems

Abstract:
Modern agent systems often suffer from fragmented runtime state: transcripts, tool effects, memory events, workspace placement, branch provenance, and replay evidence are recorded separately and become difficult to inspect or reproduce. OpenRath addresses this issue with a PyTorch‑like programming model for multi‑agent, multi‑session systems. The analogy concerns the role of a central first‑class runtime abstraction, not tensor computation. Its core abstraction is Session, the runtime value passed between agents and workflows. A Session is branchable, inspectable, replayable, backend‑aware, and composable. It records conversation chunks, sandbox placement, lineage metadata, token usage, pending work, and tool evidence, while defining where memory interactions enter the runtime record. Since this state is carried by the same value used in program execution, fork, merge, and replay become explicit runtime operations rather than states reconstructed from external traces. OpenRath further defines Sandbox, Tool, Agent, Memory, Workflow, and Selector, with Selector turning control flow into runtime‑routed decisions. This report presents the programming model, architecture, audited milestones, and evidence protocol. Its claims are limited to controlled runtime properties, while broad quantitative comparisons, live‑provider quality, optional‑backend availability, and memory quality are left for follow‑on evaluation. The central thesis is that Session provides agent systems with a first‑class runtime value for auditable composition.

PaperID: 1819, https://arxiv.org/pdf/2606.19387.pdf

Authors: You Li, Samuel Mandell, David Z. Pan

Title: Interpretable and Verifiable Hardware Generation with LLM-Driven Stepwise Refinement

Abstract:
Large language models (LLMs) have achieved remarkable success in software development. However, they are susceptible to hallucinations, meaning that they can introduce subtle semantic and logical errors. Due to the high stakes in chip design and manufacturing, hardware engineers are still reluctant to rely on LLMs for register‑transfer level (RTL) generation. In this paper, we propose a hardware generation framework that combines the creativity and broad knowledge of LLMs with the explainability and mathematical rigor of formal methods. Specifically, we devise a set of transformation rules that cover various design decisions and hardware features. By iteratively applying these rules, an LLM agent can convert a design specification into an RTL program with guaranteed correctness. Experimental results demonstrate the effectiveness and efficiency of the framework.

PaperID: 1820, https://arxiv.org/pdf/2606.19356.pdf

Authors: Anantha Sharma

Title: Trustworthy Multi-Agent Systems: Mitigating Semantic Drift with the Argent Signaling Protocol

Abstract:
When multi‑agent LLM systems produce bad answers, not all failures are equal: some answers are grounded in the right material but incomplete, while others are simply ungrounded and should be stopped. Current retry strategies treat both cases identically (try again and hope for the best), leaving human supervisors unable to tell whether a retry was warranted or whether the system should have halted instead. We introduce the Argent Signaling Protocol (ASP), a compact machine‑readable header that accompanies every AI‑generated response with structured quality signals: certainty (@C), grounding (@G), stochasticity (@S), and an assumption index that classifies the evidentiary basis of each claim. These signals enable a controller to distinguish repairable failures from containment failures and route each case differently. We evaluate ASP in two modes. In standalone mode, a 27‑question document‑grounded QA benchmark over the Array BioPharma/Ono license agreement compares baseline prompts against ASP‑instrumented controller actions across three local GGUF models. On Qwen~(0.8B), ASP improves pass rate from 11.1% to 33.3% and mean term coverage from 36.7% to 65.4%; on Dobby~(8B), ASP produces 4 fail‑to‑pass recoveries, raising pass rate from 33.3% to 44.4%; on SmolLM3~(3B), ASP alternates between repair and containment per question. Aggregate improvement is meaningful (12/81 to 21/81 passes). In multi‑agent mode, an ASP sidecar sits between a retrieval agent and a downstream decision agent; the sidecar blocks 100% of ungrounded upstream outputs from reaching the downstream agent (24/27 blocked, 0 ungrounded propagations).

PaperID: 1821, https://arxiv.org/pdf/2606.19350.pdf

Authors: Amogh Sheth, Biruk Assefa, Yi Wen Huang, Andrew Lin, Yuhao Ge

Title: Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models

Abstract:
Large language models (LLMs) excel at multi‑step reasoning but incur substantial inference cost. We introduce Causal Attribution Pruning (CAP), a training‑free method that identifies critical attention heads by measuring their causal impact on reasoning tasks and uses these head‑level scores to guide fine‑grained weight pruning. For each attention head, CAP estimates the expected performance degradation when the head is masked during forward passes on a small calibration set of reasoning problems. These causal scores are then converted into weight‑level importance values for the corresponding projection matrices. Unlike magnitude‑only or activation‑based criteria, CAP's interventional measurement directly captures each head's functional contribution, yielding relative accuracy gains of up to 61% over Wanda on ARC‑Challenge at 20% sparsity. We evaluate CAP on GSM8K, StrategyQA, and ARC‑Challenge using Llama‑3‑8B‑Instruct and Mistral‑7B‑Instruct at 10%, 20%, and 50% sparsity. At moderate sparsity (10‑20%), CAP improves over Wanda in most model‑benchmark configurations. with especially large gains on ARC‑Challenge for Llama‑3. Our results suggest that attention‑head‑level causal attribution can better preserve reasoning performance on downstream benchmarks than correlational pruning criteria at equivalent sparsity, while remaining limited by coarse MLP attribution at 50% sparsity.

PaperID: 1822, https://arxiv.org/pdf/2606.19308.pdf

Authors: Leyang Shen, Yang Zhang, Xiaoyan Zhao, Chun Kai Ling, Tat-Seng Chua

Title: Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play

Abstract:
Large language model (LLM)‑based multi‑agent systems (MAS) have demonstrated great potential in solving tasks with execution complexity, by distributing subtasks across cooperative agents. However, this divide‑and‑conquer paradigm falls short on decision‑making tasks that are also prevalent in the real world. These tasks require simultaneous reasoning from the stances of all involved stakeholders whose decisions are mutually dependent and thus cannot be solved in isolation. We characterize this challenge as stance entanglement, a form of decision complexity distinct from execution complexity. To address it, we propose Multi‑Agent Fictitious Play (MAFP), a novel MAS paradigm that represents stakeholder stances as agents and formulates decision‑making as an equilibrium‑seeking process. Built on the game‑theoretic principle of fictitious play, MAFP iteratively updates each agent's decision by best responding to the empirical mixture of other agents' past decisions. This enables agents to expose and address one another's weaknesses, progressively improving decision quality and robustness. We evaluate MAFP on challenging decision‑making tasks that test the capability of deciding strategies for competitive scenarios prior to acting. MAFP outperforms both single‑round and multi‑round baselines on two complementary metrics, tournament strength and robustness, demonstrating its effectiveness in addressing stance entanglement.

PaperID: 1823, https://arxiv.org/pdf/2606.19267.pdf

Authors: H. M. Sabbir Ahmad, Ehsan Sabouni, Emrullah Celik, Zean Wan, Damola Ajeyemi, Christos G. Cassandras, Wenchao Li

Title: A Mixed-Reality Testbed for Autonomous Vehicles

Abstract:
We propose a mixed‑reality, hardware‑in‑the‑loop (HIL) testbed for autonomous vehicles that seamlessly integrates a physical testbed of mobile robots with a high‑fidelity simulation environment. The virtual simulation enables the creation of diverse, safety‑critical driving scenarios to validate state‑of‑the‑art perception, planning, and control algorithms, while augmenting simulations with physical robots equipped with multimodal sensors in photorealistic virtual environments further facilitating rigorous validation. Our testbed also features vehicular connectivity using wireless communication and can accommodate a large number of agents through the combination of physical robots and virtual simulated agents, supporting research on multi‑agent systems including Connected and Autonomous Vehicles (CAVs). Finally, we present a safety‑guaranteed framework combining perception, planning and a novel online learning‑based controller using Control Barrier Functions (CBFs) for CAVs. Experiments using the proposed framework are used to validate and demonstrate the key functionalities and the overall utility of the testbed to bridge the gap between simulation and real‑world hardware deployment.

PaperID: 1824, https://arxiv.org/pdf/2606.19245.pdf

Authors: Hannah Le, Ramesh Ramasamy, Alex Urrutia, Mahsa Yazdani, Tim Proctor, Kenny Workman

Title: TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

Abstract:
Artificial intelligence (AI) agents promise to accelerate drug discovery by compressing interpretation and decision‑making loops, but practical deployment requires trusted evaluation on realistic program decisions. We introduce TherapeuticsBench Preclinical Pharmacology (TxBench‑PP), a verifiable benchmark for small‑molecule preclinical pharmacology and the first focused slice of a broader TherapeuticsBench effort across drug‑discovery stages and therapeutic modalities. TxBench‑PP tests whether agents can recover accurate conclusions from real‑world assay data rather than memorized facts from literature. The benchmark contains 100 evaluations indexed by program stage, assay type, and task structure, spanning mechanism‑of‑action (MoA) and pharmacodynamic (PD) reasoning, compound‑target engagement, causal target validation, developability and safety, and translational efficacy. Agents receive realistic workflow snapshots, inspect files in a coding environment, and return structured answers graded deterministically. Across 16 model‑harness configurations, comprising 11 models and 4,800 trajectories, no system reliably recovered preclinical pharmacology decisions. The strongest configuration, Claude Opus 4.8 / Pi, passed 59.3% of endpoint attempts (178/300; 95% CI, 51.1‑67.6), followed by GPT‑5.5 / Pi at 55.3% (166/300; 47.0‑63.6).

PaperID: 1825, https://arxiv.org/pdf/2606.19152.pdf

Authors: Zongmin Zhang, Yuyang Lou, Bowen Zhang, Junwu Chen, Ryo Kuroki, Xuan Vu Nguyen, Edvin Fako, Lixue Cheng, Philippe Schwaller

Title: AdsMind: A Physics-Grounded Multi-Agent System for Self-Correcting Discovery of Adsorption Configurations on Heterogeneous Catalyst Surfaces

Abstract:
Identifying the lowest‑energy surface‑adsorbate configuration is critical for modeling heterogeneous catalysis, yet exhaustive exploration with ab initio calculations is computationally prohibitive. Machine‑learning force fields (MLFFs) accelerate structural relaxation but leave the search over the vast configurational space a major bottleneck, and open‑loop large language model (LLM) agents lack a physics‑grounded feedback mechanism to correct erroneous initial guesses. We propose AdsMind (Adsorption configuration discovery with Machine intelligence and relaxation feedback), a closed‑loop multi‑agent framework that enables autonomous error correction through MLFF relaxation feedback. Across four LLM backends, AdsMind achieves consistently high search reliability, with success rates of 100% and 98.8% on the benchmarks AA20 and OCD‑GMAE62. Relative to its single‑pass (1‑Shot) ablation it reduces cross‑backend energy dispersion, and it uses only 4.11 and 4.67 MLFF relaxations per case, respectively ‑‑ an approximately 14‑fold reduction over heuristic enumeration baselines. Density functional theory (DFT) validation using VASP/PBE on six representative AA20 systems shows that the reported open‑loop Adsorb‑Agent outputs exhibit qualitative adsorption‑energy sign errors for molecular adsorbates, whereas AdsMind preserves the correct sign in all tested cases with closer quantitative agreement. AdsMind thus delivers reliability, self‑reflection, and interpretability simultaneously, supporting more DFT‑informed autonomous chemistry workflows.

PaperID: 1826, https://arxiv.org/pdf/2606.19135.pdf

Authors: Linus Sander, Habtom Kahsay Gidey, Alexander Lenz, Alois Knoll

Title: A Technical Taxonomy of LLM Agent Communication Protocols

Abstract:
As large language models (LLMs) advance and multi‑agent systems aim to overcome the limits of standalone agents, robust communication protocols are becoming essential infrastructure for distributed agent networks. Nonetheless, the fragmented protocol landscape presents a significant interoperability challenge. This study develops a technical taxonomy to classify and analyze LLM agent communication protocols. Following an established iterative method, we defined the taxonomy's purpose, meta‑characteristic, and ending conditions, then performed five iterations, three empirical‑to‑conceptual and two conceptual‑to‑empirical, on nine actively maintained open‑source protocols with demonstrable adoption. The taxonomy comprises five dimensions: counterparty, payload, interaction state, discovery mechanism, and schema flexibility. Classification reveals recurring architectural patterns: all sampled agent‑to‑agent protocols combine hybrid payloads with session‑state persistence; most protocols support multiple predefined schemas, and two negotiate schemas at runtime, indicating a trend toward schema flexibility; decentralized discovery remains rare. Analysis suggests short‑term convergence pressure toward protocols unifying agent‑to‑agent and agent‑to‑context (tool and data) communication. Long‑term, however, no single protocol is likely to maximize versatility, efficiency, and portability simultaneously. The field will more likely evolve toward a federated, layered protocol stack. The framework guides protocol selection and highlights open research gaps such as privacy and policy enforcement.

PaperID: 1827, https://arxiv.org/pdf/2606.19111.pdf

Authors: Haewoon Kwak

Title: Leadership as Coordination Control: Behavioral Signatures and the Recovery-Advantage Boundary in Multi-Agent LLM Teams

Abstract:
Team science holds that leadership is contingent: it helps only under specific conditions, and capable, autonomous teams may need none at all. We ask the analogous question for multi‑agent LLM teams: under what measurable conditions does process‑level coordination control add value, and do those conditions match what team science predicts? We use behavioral signatures (majority lock‑in, exploration, recovery from an incorrect round‑0 consensus) and per‑action ablations, clean because each controller is an explicit action set, not a monolithic prompt. We operationalize three classical leadership styles (transactional, transformational, situational) as controllers over a shared action vocabulary (explore, revise, accept, synthesize). A matched controller with the same actions but an arbitrary rule recovers no better than majority voting, so the theory‑derived rule, not the vocabulary, does the work. Across four task regimes and three open‑weight model families, no controller dominates by accuracy, as the contingency view predicts: transactional control matches a shared round‑0 vote on all 12 (model, regime) combinations to within 1.3pp, and gains appear only on the one combination where the round‑0 majority is unreliable (llama‑4‑scout social; situational +8pp over flat). A recovery‑advantage account, tested with four boundary probes, says a controller beats plain interaction only where the round‑0 majority is unreliable, the task is recoverable, and undirected interaction does not already repair it. These regions map onto contingency theory (leadership substitutes, path‑goal redundancy, the situational readiness gap), so a largely null accuracy result is what the theory predicts, not a failure of the controllers. We read process‑level coordination control as a contingency to be measured and theory‑mapped, not a leaderboard to be topped.

PaperID: 1828, https://arxiv.org/pdf/2606.19080.pdf

Authors: Ali Peivand, Seyyed Mostafa Nosratabadi

Title: Byzantine-Resilient Federated Multi-Agent Optimization Framework for Cyber-Secure Interconnected Microgrids

Abstract:
The escalating digitalization of distribution networks has exposed interconnected Microgrid (MG) clusters to Stealthy False Data Injection Attacks that bypass Bad Data Detectors and propagate through tie‑line couplings and shared learning channels. This paper proposes BR‑FedMAPPO, a Byzantine‑Resilient Federated Multi‑Agent Proximal Policy Optimization framework that learns a triple‑surface Moving Target Defense and an adaptive isolation strategy for cyber‑secure operation. Each MG hosts a local Actor‑Critic Agent whose policy is partitioned into a globally federated shared encoder and a privately retained action head, so no MG exposes the configurations, cardinality, or locations of its D‑FACTS lines, Battery Energy Storage (BES) units, or tie‑line capacities. The action vector perturbs D‑FACTS reactances, redirects BES injections, reshapes inter‑MG exchanges, and includes a continuous islanding signal. A two‑stage Byzantine‑resilient aggregation rule combines trimmed‑mean filtering with reward‑weighted updates. This scheme incorporates a detection‑quality score based on the F1‑score and False Positive Rate to penalize clients causing false alarms. Simulation results on four interconnected MGs based on the IEEE 30‑ and 118‑bus test systems demonstrate effective mitigation of coordinated S‑FDI attacks, containment of cascading disruptions through adaptive isolation, and protection of distributed learning channels against malicious model manipulations while maintaining cost‑aware dispatch performance.

PaperID: 1829, https://arxiv.org/pdf/2606.19063.pdf

Authors: Siyuan Pang, Zhengwei Jiang, Yepeng Yao, Zijing Fan, Haozhe Li, Baoxu Liu

Title: PYPILINE: Malicious PyPI Package Detection via Suspicious API Knowledge and Agent Workflow

Abstract:
The detection of malicious PyPI packages is crucial for maintaining the security of the open source software supply chain. Existing methods, which primarily rely on rules or traditional machine learning, suffer from poor interpretability and difficulty in adapting to novel attacks. To address this, we propose PYPILINE, a novel detection method that combines a suspicious API knowledge base with an Agent workflow. PYPILINE first conducts static analysis on known malicious packages, extracting abstract syntax trees and generating API call graphs, from which it automatically extracts and constructs a structured suspicious API knowledge base. During the detection phase, this knowledge base is used to enhance reasoning capabilities. Through an Agent workflow, PYPILINE performs in depth semantic analysis of unknown packages and outputs a structured, interpretable maliciousness assessment report. The experimental results show that PYPILINE significantly outperforms existing state‑of‑the‑art tools in precision of 96.7%, recall of 99.6%, and F1‑score of 98.1%, with its precision surpassing baseline tools by 5.7 to 24.2 percentage points. Additionally, we conducted an empirical study on malicious packages, systematically revealing prevalent attack strategies, as well as the most commonly abused APIs. Equipped with tool‑calling AI agent workflows for automated vector database retrieval of suspicious API knowledge and mail server delivery of analysis reports, PYPILINE delivers a practical, efficient, and convenient malicious package detection solution to strengthen open‑source ecosystem security.

PaperID: 1830, https://arxiv.org/pdf/2606.18976.pdf

Authors: Marco Becattini, Niccolò Caselli, Matteo Minin, Roberto Verdecchia, Enrico Vicario

Title: CAPRA: Scaling Feedback on Software Architecture Deliverables with a Multi-Agent LLM System

Abstract:
Automated assessment in software engineering education has advanced significantly for code grading and essay scoring. However, reviewing software architecture deliverables, which requires analyzing structural completeness and requirements traceability, has not yet been fully automated. Applying Large Language Models (LLMs) to this task requires robust architectures to ensure technical feedback is accurate and reliable for students. This paper presents CAPRA (Configurable Architecture Proficiency Report Assessment), a multi‑agent LLM system that analyzes software architecture deliverables to generate personalized, template‑compliant LaTeX feedback. As a core design choice, CAPRA coordinates multiple specialized agents and employs a Python‑based microservice for multi‑modal document extraction, utilizing PyMuPDF and vision‑enabled LLMs (specifically gpt‑4o) to parse text and UML diagrams. To ensure educational reliability and mitigate hallucinations, CAPRA introduces a deterministic Evidence Anchoring step using fuzzy matching via normalized Levenshtein distance, along with a ConsistencyManager agent that cross‑verifies, deduplicates, and merges findings. System performance is assessed using a structured eight‑criterion binary evaluation taxonomy covering: (i) extraction completeness, (ii) feature validation, (iii) issue grounding and severity detection, (iv) recommendation specificity and traceability, and (v) template and tone compliance. A preliminary empirical evaluation on 10 student reports shows that CAPRA satisfied 88.8% of the evaluated criteria under a strict two‑rater aggregation rule, achieved moderate inter‑rater agreement with human evaluators (kappa = 0.582), and processed each report in slightly over 4 minutes. While these results support the viability of LLM‑supported architectural feedback, human oversight remains essential for subjective assessment dimensions.

PaperID: 1831, https://arxiv.org/pdf/2606.18947.pdf

Authors: Emmanuel Aboah Boateng, Kyle MacDonald, Amardeep Kumar, Siddharth Kodwani, Sudeep Das

Title: Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

Abstract:
Production LLM agents increasingly depend on real‑time search, yet native search grounding bundles retrieval policy, provider choice, evidence injection, cost, latency, and generation behavior behind a single model‑provider boundary. This coupling makes grounding hard to inspect, tune, reuse, or port, and can trigger Search‑Induced Verbosity that breaks strict output contracts. We present Decoupled Search Grounding (DSG), a vendor‑agnostic boundary that moves grounding outside the reasoning model through an MCP‑compatible gateway, exposing provider routing, source‑aware context rendering, configured fallback, retrieval‑depth control, and exact plus semantic caching as first‑class controls. Across five frontier models on SimpleQA, FreshQA, and HotpotQA, native search leads on recency‑sensitive FreshQA, but DSG exposes a stronger frontier when control matters: on SimpleQA it nearly matches native accuracy (86.1% vs. 87.7%) at 91% lower search cost, preserves concise answer contracts, and reaches a 99.4% warm‑cache hit rate with 68% lower latency. Deployed as a shared production grounding layer for large‑scale agentic workloads with interchangeable models, DSG matches or slightly exceeds native‑search accuracy on an e‑commerce query‑understanding (QIU) workload while cutting search cost by over 98%. Real‑time grounding is best treated as an optimizable interface boundary, not a fixed model feature.

PaperID: 1832, https://arxiv.org/pdf/2606.18902.pdf

Authors: Ziyi Zhu, Luka Smyth, Saki Shinoda, Jinghong Chen

Title: SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

Abstract:
Context engineering has emerged as a primary lever for improving AI systems without parameter updates. Recent work showing that textual gradients do not function as real gradients motivates treating automatic prompt optimization (APO) as black‑box search. We introduce SPO (Stochastic Prompt Optimization), a framework for stochastic search over prompt space, and compare three strategies of increasing sophistication: error‑informed random search, a genetic algorithm with evolutionary operators, and SAGE (SPO via Agent‑Guided Exploration), a multi‑agent pipeline with diagnostic code execution. Across three benchmarks, no single strategy dominates; effectiveness depends on the interaction of landscape structure with error type. We further deploy SAGE on a mental‑health chatbot under a continuous optimization paradigm, where it compounds eight cycles of individually‑noisy A/B tests into a statistically robust gain in next‑day retention. We argue that coupling qualitative diagnosis with quantitative validation is what makes agentic optimization effective for open‑ended task‑oriented dialogue.

PaperID: 1833, https://arxiv.org/pdf/2606.18847.pdf

Authors: Yehang Zhang, Jianchong Su, Haojian Huang, Yifan Chang, Tianhao Zhou, Xinli Xu, Yingjie Xu, Yinchuan Li, Zexi Li, Ying-Cong Chen

Title: WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

Abstract:
To assist humans over extended periods in real homes, embodied agents must remember user routines, world states, and past interactions. Existing long‑term memory benchmarks mainly evaluate language‑centric retrieval and question answering, while embodied benchmarks often focus on short‑horizon task execution without testing long‑term memory use in dynamic environments. We introduce WorldLines, a project‑driven benchmark for long‑horizon embodied household assistance. It constructs temporally extended household traces with dialogues, actions, execution feedback, object and device state changes, and converts them into evidence‑linked samples for Memory QA and Embodied Task Planning. We further propose ObsMem, an observer‑grounded memory framework that maintains visibility‑aware memories and action‑native state trails for state‑aware decisions. Experiments reveal persistent challenges in partial observability, overwritten world states, and translating long‑term memory into embodied plans, while ObsMem offers a stronger reference architecture for this setting.

PaperID: 1834, https://arxiv.org/pdf/2606.18837.pdf

Authors: Hehai Lin, Qi Yang, Chengwei Qin

Title: Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

Abstract:
Large Language Model (LLM)‑based automatic Multi‑Agent Systems (MAS) generation has become a crucial frontier for tackling complex tasks. However, existing methods face a dilemma between model capability and experience retention. Inference‑time MAS leverages frozen frontier LLMs but repeats identical searches without learning from past experience. Conversely, Training‑time MAS internalizes experience via gradient updates but is constrained by the low capability ceiling of smaller models, and is hard to scale to large frontier LLMs. To bridge this gap, we propose Skill‑MAS, a novel third path that decouples experience retention from parametric updates by conceptualizing the high‑level orchestration capability as an evolvable Meta‑Skill. Skill‑MAS refines this architectural knowledge through a closed optimization loop: (1) Multi‑Trajectory Rollout samples a behavioral distribution for each task under the current Meta‑Skill; and (2) Selective Reflection adaptively selects priority tasks and applies hierarchical contrastive analysis to distill systemic experience into generalizable, strategy‑level principles. Extensive experiments across four complex benchmarks and four distinct LLMs demonstrate that Skill‑MAS not only achieves remarkable performance gains but also maintains a favorable cost‑performance trade‑off. Further analysis reveals that the evolved Meta‑Skills are highly robust and exhibit strong transferability across unseen tasks and different LLMs.

PaperID: 1835, https://arxiv.org/pdf/2606.18803.pdf

Authors: Tengfei Lyu, Zirui Yuan, Xu Liu, Kai Wan, Zihao Lu, Li Ma, Hao Liu

Title: ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

Abstract:
Bringing Large Language Models (LLMs) into industrial ride‑hailing dispatch as semantic feature extractors over platform‑scale behavioral logs is a compelling but under‑explored data systems problem. Production matching pipelines remain dominated by structured numerical features, yet decisive behavioral signals (e.g., a driver's habitual aversion to certain regions) are inherently contextual and naturally expressible as LLM‑generated user profiles. However, scaling such profiling to a live, millisecond‑latency dispatcher faces three intertwined constraints rarely addressed together: on a platform with millions of daily orders, logs exceed any LLM's context window by orders of magnitude; most users are long‑tail, with too few interactions for per‑user profiling; and surface‑fluent profiles do not necessarily improve downstream prediction utility. We present ProfiLLM, an agentic LLM data pipeline that operationalizes utility‑aligned user profiling for production matching systems through two modules. (1) Tool‑Augmented Global Knowledge Mining equips an LLM agent with 27 analytical tools to mine platform‑scale data, producing reusable global knowledge, adaptive user clustering rules, and region‑level supply‑demand priors. (2) Utility‑Aligned Profile Exploration generates multiple candidate profiles per cluster, evaluates them via a lightweight downstream utility proxy, iteratively refines the best candidates and constructs preference pairs for DPO fine‑tuning. Deployed on DiDi's production dispatcher, ProfiLLM achieves up to +6.14% relative AUC improvement in outcome prediction, up to +4.35% GMV gain in dispatching simulation, and consistent improvements in a 14‑day online A/B test including +0.47% GMV, +0.33% Completion Rate, and ‑0.82% Cancel‑Before‑Accept rate.

PaperID: 1836, https://arxiv.org/pdf/2606.18795.pdf

Authors: Ali Safarpoor Dehkordi, Mohammad Shirzadi, Ahad N. Zehmakan

Title: Opinion Polarization in LLM-Based Social Networks: Manipulation and Mitigation

Abstract:
How vulnerable are online social networks to adversaries who seek to amplify opinion polarization by manipulating opinions, and how difficult is it to mitigate such manipulation? Existing studies have examined this question using mathematical models of opinion dynamics. While these models offer valuable theoretical insights, they rely on simplified assumptions about interactions, message content, and opinion updates, limiting the adversarial strategies they can capture and the applicability of their findings to real‑world settings. Large language model (LLM)‑based simulations provide a richer alternative: agents can be assigned diverse personas, communicate through natural language, and respond to persuasive or adversarial content in a context‑dependent way. This enables the study of manipulation strategies that are difficult to represent using classical mathematical models. To the best of our knowledge, this study provides the first systematic analysis of polarization amplification and mitigation in an LLM‑based simulated social network framework. In our framework, LLM agents with diverse personas interact over a social network by exchanging natural language posts and updating their opinions accordingly. We show that even an adversary with a limited manipulation budget can considerably increase polarization. We then study two classes of defense mechanisms: reactive mitigations, which assign specific users to actively counter manipulation, and proactive interventions, which increase resistance through general mechanisms not tied to particular users. Our results show that although these mechanisms reduce the impact of adversarial attacks, they generally do not restore the network to its baseline polarization state. These findings suggest that neither approach fully overcomes the vulnerability of the network, highlighting the potential risk of such attacks.

PaperID: 1837, https://arxiv.org/pdf/2606.18789.pdf

Authors: Costas Mylonas, Magda Foti, Andrea Pomarico, Matheus Duarte, Qian Zhang, Emmanouel Varvarigos

Title: PowerAgentBench-SS: A Benchmark for Agentic AI in Power System Steady-State Studies

Abstract:
Power system benchmarks usually evaluate numerical solvers, prediction models, or sequential controllers. These benchmarks are necessary, but they do not directly test whether a Large Language Model (LLM) agent can execute an engineering workflow: inspect a grid case, select tools, call simulators, screen contingencies, propose admissible mitigations, validate results, and produce an auditable evidence trail. This paper introduces PowerAgentBench‑SS, a steady‑state benchmark framework for evaluating tool‑using agents in power system operation and planning studies. The benchmark exposes public case data, action constraints, a tool API, and a validation budget to an agent, while a hidden evaluator recomputes physical validity and scores the submitted report. We define the agent interface, tool contract, evidence log, and risk‑sensitive metrics, including submitted recall, evidence‑backed recall, found recall, false‑safe penalties, severity regret, residual violation score, action cost, tool‑use efficiency, and workflow diagnostics. To make the framework concrete, we instantiate the protocol in a reproducible DC thermal N‑2 contingency‑search pilot on deterministic IEEE 39‑bus operating‑point variants, with scripted baselines, an LLM JSON‑command adapter, three locally hosted Ollama LLM agents, and one OpenAI API agent. The results show why solver‑only or answer‑only evaluation is insufficient: agents are distinguished not only by top‑contingency discovery, but also by validation‑budget use, explicit submission, type coercions, duplicate validations, evidence‑backed reporting, and mitigation behavior.

PaperID: 1838, https://arxiv.org/pdf/2606.18716.pdf

Authors: Kathrin Paimann, Elizangela Valarini, Sebastian Juhl

Title: Human-AI Agent Interaction in a Business Context

Abstract:
As AI agents are increasingly integrated into core business processes, understanding and designing effective interaction patterns between humans and AI agents becomes crucial for value creation. This study identifies and evaluates principles and criteria for a positive User Experience (UX) with AI agents, along with methods for its measurement. We identify user expectations and needs to facilitate adoption, build trust, and support user‑centered decision‑making by development teams. Using a mixed‑methods approach that combines qualitative and quantitative techniques, we explore interaction patterns between humans and AI agents. The findings from this exploratory research serve as the basis to develop a survey experiment which evaluates the effectiveness of specific design elements on a larger scale. This foundational research contributes to the development of more intuitive and effective human‑AI agent interactions in business settings.

PaperID: 1839, https://arxiv.org/pdf/2606.18688.pdf

Authors: Akshay Hazare

Title: Dual-Channel Grounded World Modeling (DCGWM): Structural Prevention of Objective Interference Collapse via Heterogeneous External Grounding with Inward-Only Gradient Flow

Abstract:
Joint Embedding Predictive Architectures (JEPAs) are a leading approach to world model representation learning. We identify a failure mode in JEPA‑based world models grounded against two qualitatively distinct external signals: physical dynamics (sparse, high‑magnitude, constraint‑satisfying gradient corrections) and social‑behavioral dynamics (diffuse, distribution‑matching corrections). We term this Objective Interference Collapse (OIC): we argue that joint learning in a shared latent space causes the dominant channel to systematically collapse the subordinate channel's representational subspace, in a manner not resolvable by loss weighting alone. We propose Dual‑Channel Grounded World Modeling (DCGWM), designed to structurally prevent OIC through a partitioned latent space (physical subspace Z_p, behavioral subspace Z_b) with inward‑only gradient flow. A Physical Grounding Channel updates only Z_p via VICReg‑style alignment to physical measurements; a Social‑Behavioral Grounding Channel updates only Z_b via alignment to trajectories from an emergent multi‑agent simulation. An Inter‑Channel Interface Module couples the subspaces at the task level without cross‑subspace gradients. An Asymmetric Grounding Adherence Loss penalizes rollout drift with a hard hinge for physical violations and a soft KL for behavioral divergence. A Generative Rendering Layer is architecturally isolated from the latent world model. We present three theoretical results: the partition removes the gradient‑interference pathway implicated in OIC; each grounded subspace inherits anti‑collapse guarantees from its alignment objective; and generative isolation is necessary under a stated assumption on the generative objective's geometry. This manuscript establishes the problem formulation and architecture; experimental validation is ongoing and will be reported in a future revision.

PaperID: 1840, https://arxiv.org/pdf/2606.18668.pdf

Authors: Shuang Xie, Yunan Lu, Han Li, Lingyun Wang

Title: EARS: Explanatory Abstention for Reliable Sub-Agent Modeling in Large-scale Multi-Agent Systems

Abstract:
In large‑scale enterprise settings, centralized multi‑agent systems (MAS) are increasingly adopted, in which a coordinator delegates user requests to lightweight, domain‑specialized sub‑agents. While this architecture improves modularity, scalability, and cost efficiency, its reliability depends not only on accurate routing but also on sub‑agents' ability to calibrate their responses to capability constraints. In particular, sub‑agents built on smaller fine‑tuned models often struggle with such calibration, leading them to over‑answer ambiguous, underspecified, misrouted, or unsupported requests and produce hallucinated outputs instead of actionable feedback. To address this challenge, we present EARS (Explanatory Abstention for Reliable Sub‑Agent Modeling), a production‑oriented framework that reframes sub‑agent abstention as an inter‑agent communication protocol: a sub‑agent does not merely abstain, but exposes an actionable failure state to the coordinator. EARS curates human‑agent interaction data using an ensemble of calibrated LLM‑as‑a‑Judge models, producing structured abstention labels and rationales under a taxonomy of sub‑agent failure modes. These data are used to fine‑tune sub‑agents to detect failure conditions and return rationales for coordinator‑level clarification, rerouting, or fallback. We evaluate EARS in a large‑scale production e‑commerce assistant supporting enterprise business intelligence workflows. EARS improves the overall response pass rate from 68.5% to 78.9%, demonstrating that sub‑agent‑side explanatory abstention improves MAS reliability.

PaperID: 1841, https://arxiv.org/pdf/2606.18633.pdf

Authors: Zhiyuan Wen, Jiannong Cao, Peng Gao, Haochen Shi, Wengpan Kuan, Bo Yuan, Xiuxiu Qi

Title: PersonalPlan: Planning Multi-Agent Systems for Personalized Programming Learning

Abstract:
Effective programming education requires personalized instruction adapted to diverse learner backgrounds. However, while LLM‑based multi‑agent systems (MAS) excel at complex planning, existing planners often lack profile‑grounding and pedagogical scaffolding, thereby undermining personalized programming learning. To fill in the gap, we first introduce MAP‑PPL (Multi‑Agent Plans for Personalized Programming Learning), a profile‑conditioned multi‑agent planning dataset with 3,043 query‑‑profile‑‑plan instances from 1,730 Stack Overflow question groups and 2,738 learner profiles. Each plan specifies agents, subtasks, executable steps, and prerequisite dependencies. Then, we propose PersonalPlan, a two‑stage MAS planner that first performs hierarchical SFT with separate LoRA adapters for profile‑aware task decomposition and step dependency planning, then applies a Reward‑Adaptive GRPO to encourage the model to generate executable, personalized, and pedagogically scaffolded plans. Extensive experiments on MAP‑PPL comparing PersonalPlan against frontier LLMs, generic MAS frameworks, and agentic planners demonstrate its superiority. With only 8B and 32B variants, PersonalPlan achieves state‑of‑the‑art plan executability, personalization, and pedagogical quality, effectively orchestrating MAS for agent‑student interactions.

PaperID: 1842, https://arxiv.org/pdf/2606.18619.pdf

Authors: Zhengxiong Luo, Mehtab Zafar, Dylan Wolff, Abhik Roychoudhury

Title: Code-Augur: Agentic Vulnerability Detection via Specification Inference

Abstract:
The advent of agentic vulnerability detection is already becoming a watershed moment for software security. Audits conducted entirely by autonomous LLM agents are uncovering critical vulnerabilities in fundamental software underpinning digital society. Many of these vulnerabilities remained masked for years, surfacing only now with AI agents. Yet the reasoning behind these discoveries remains alarmingly opaque and unvalidated. What assumptions did the agent make about a function's inputs when it deemed that function to be secure? Failures in reasoning and incorrect assumptions can lead to missed vulnerabilities and reduce trust in agentic analysis. We propose a security‑specification‑first paradigm that (1) exposes the agent's tacit assumptions explicitly as security specifications and (2) continuously refines those specifications via runtime falsification. We realize our approach in Code‑Augur, a novel harness for agentic vulnerability detection. Given a codebase, Code‑Augur analyzes each component of the system for vulnerable code. When it deems a component to be secure, it commits the local invariants behind that judgment as in‑source assertions. In parallel, Code‑Augur leverages a guided fuzzer to attempt to falsify those assumptions. When the fuzzer triggers an assertion, this either reveals a genuine vulnerability or a flawed specification to refine. In both cases, this process grounds the agent's understanding, aligning its view of code intent with how the code actually behaves. On real‑world subjects, Code‑Augur effectively leverages security specifications to detect more vulnerabilities than other state‑of‑the‑art agents. Additionally, Code‑Augur found 22 new vulnerabilities in key open‑source projects. Compared to curated specialized models like Claude Mythos, Code‑Augur offers effective agentic vulnerability detection built on widely available LLMs like Sonnet and DeepSeek.

PaperID: 1843, https://arxiv.org/pdf/2606.18550.pdf

Authors: Laxmipriya Ganesh Iyer, Rahul Suresh Babu

Title: The Gate Is Only as Honest as Its Contracts: ContractGuard for the Contract Layer of Risk-Aware Causal Gating

Abstract:
Risk‑Aware Causal Gating (RACG) defends tool‑augmented LLM agents against indirect prompt injection by removing dangerous tools from the agent's visible action space, so that even a fully injection‑compliant agent cannot call a tool it cannot see. We make three points. First, this structural guarantee does not eliminate the trust assumption behind safe tool use; it relocates it into the integrity of the tool contracts ‑‑ declared preconditions, effects, risk, and authorization ‑‑ that the gate reads, so an attacker who corrupts a contract can make the gate mis‑decide without ever persuading the agent. Second, forging a tool's effects is strictly more dangerous than tampering with its risk label, because RACG applies a causal gate before its admissibility gate: an off‑path tool is never exposed, so risk‑relabeling alone fails, whereas effect forgery routes the dangerous tool onto the causal path and succeeds. Effect integrity, not the risk label, is the load‑bearing assumption. Third, we introduce ContractGuard, a verifier between the registry and the gate that layers signed provenance, typed contract attestation, and runtime effect verification; on a controlled benchmark it restores injection success to zero against every modeled attack ‑‑ including an exhaustive white‑box adaptive attacker ‑‑ without over‑rejecting honest contracts, and the structural prediction is confirmed on six current‑generation hosted models (Claude Opus 4.8, Sonnet 4.6, Haiku 4.5; Amazon Nova Premier and Nova 2 Lite; GPT‑OSS‑120B).

PaperID: 1844, https://arxiv.org/pdf/2606.18537.pdf

Authors: Caleb Chang, Davin Win Kyi, Natasha Jaques, Karen Leung

Title: Do as the Romans Do: Learning Universal Behaviors from Heterogeneous Agents

Abstract:
Humans often acquire new skills by observing others, since observed behaviors implicitly reveal how to act in an environment. However, observations drawn from a heterogeneous population introduce conflicting behavioral signals, making it difficult to determine which behaviors are worth imitating. We address this challenge with General Reward Inference and Disentanglement (GRID), a social learning method that extracts universally useful behaviors from a heterogeneous population of demonstrators pursuing different goals. GRID decomposes per‑agent reward functions into a general reward, capturing behaviors shared across all agents, and specific rewards, capturing individual preferences and objectives. Training exclusively on the general reward provides a new paradigm of generalist pretraining. It yields a generalist agent that internalizes universal environmental competencies, such as safety and basic task proficiency, without the mode‑averaging bias that afflicts standard learning from demonstration techniques. This generalist serves as a superior prior for fine‑tuning to downstream tasks, including preferences unseen during training. Experiments across a synthetic basis function decomposition, multi‑agent Craftax, and a continuous autonomous driving simulator (Highway‑Env) confirm that GRID successfully disentangles reward structure in a semantically meaningful way, outperforms standard learning from demonstration baselines, and enables more efficient and stable specialization.

PaperID: 1845, https://arxiv.org/pdf/2606.18516.pdf

Authors: Matthew D. Osburn, Cameron K. Peterson, John L. Salmon

Title: Task Allocation and Motion Planning in Dynamic, Cluttered Environments via CBBA and Graphs of Convex Sets

Abstract:
Multi‑agent task planning in cluttered, dynamic environments requires assigning tasks to agents while simultaneously determining safe, time‑efficient trajectories through the environment. When tasks are dynamic, such as rendezvous objectives, allocation decisions depend not only on which agent is best suited for a task, but also on when and where that task can be reached. This paper presents a solution to this problem, which combines Graphs of Convex Sets (GCS) for trajectory optimization with the Consensus‑Based Bundle Algorithm (CBBA) for distributed task allocation. In our approach, GCS finds optimal trajectories through dynamic environments using a time‑extended (3D+time) configuration space. At the same time, CBBA coordinates task assignments across agents, enabling informed decision‑making in a moving environment. We then connect allocation and planning to allow the agents to avoid collisions in the 3D+time configuration space and provide accurate time estimates for task completion. We demonstrate the effectiveness of our approach in simulated cluttered environments with static and dynamic tasks.

PaperID: 1846, https://arxiv.org/pdf/2606.18502.pdf

Authors: Paresh Dashore, Shreyas Kulkarni, Uttam Gurram, Nadia Bathaee, Kartik Balasubramaniam, Genta Indra Winata, Sambit Sahu, Shi-Xiong Zhang

Title: Towards Scalable Customization and Deployment of Multi-Agent Systems for Enterprise Applications

Abstract:
Large language model (LLM)‑based multi‑agent systems demonstrate strong performance on complex reasoning and task execution, enabling broad enterprise applications. However, production deployment remains challenging due to domain‑specific customization requirements and high latency and inference costs in agentic workflows. We propose a unified framework for customization and efficient deployment of multi‑agent systems in real‑world settings. The first stage, Agentic Model Customization, combines continual pretraining, supervised fine‑tuning, and preference optimization to adapt a compact model to specialized domains while retaining strong agentic capabilities. The second stage, Inference Optimization, integrates speculative decoding and FP8 quantization with targeted calibration to enable cost‑efficient serving with minimal quality loss. Across enterprise workloads, our framework enables rapid domain adaptation and achieves a 4.48x speedup in throughput while maintaining performance and improving robustness on long‑tail scenarios.

PaperID: 1847, https://arxiv.org/pdf/2606.18388.pdf

Authors: Haoyang Fang, Wei Zhu, Boran Han, Alex Zhang, Zhenyu Pan, Shuo Yang, Shuai Zhang, Jiading Gai, Peng Tang, Cuixiong Hu, Xuan Zhu, Huzefa Rangwala, George Karypis, Bernie Wang

Title: LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

Abstract:
RL post‑training strategies are dataset‑dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to shifting training dynamics. This distinction matters because fixed schedules commit all parameters to fixed trajectories and therefore cannot express the non‑stationary exploration‑exploitation tradeoffs that regularization must track; the principle provides actionable design rules for multi‑stage training. We discover this through LLMZero, a system where LLM agents search over training trajectories via tree search, diagnosing pathologies at each checkpoint and proposing coordinated multi‑parameter transitions. Across 4 diverse GRPO tasks, LLMZero discovers strategies that improve over the base model by 9% to 140% relative and over grid search by 6% to 15% relative, consistently outperforming random search and the skill‑based agent. The structural principle transfers across tasks, providing an explanation for why discovered strategies take qualitatively different forms yet share similar parameter dynamics.

PaperID: 1848, https://arxiv.org/pdf/2606.18356.pdf

Authors: Yuchuan Tian, Mengyu Zheng, Haocheng Mei, Ye Yuan, Chao Xu, Xinghao Chen, Hanting Chen, Yu Wang

Title: SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agents

Abstract:
Tool‑using language‑model agents introduce security failures that go beyond unsafe text: they can disclose protected objects, write persistent memory, send messages, modify databases, or trigger harmful code and tool effects. Existing evaluations often collapse these stages into a single attack success rate, making it difficult to tell whether a model merely agreed with an attacker or actually produced observable harm. We introduce SafeClawBench, a staged benchmark for tool‑using agent security with 600 controlled adversarial tasks across six attack families: direct and indirect prompt injection, tool‑return injection, memory poisoning, memory extraction, and ambiguity‑driven unsafe inference. SafeClawBench reports three separate endpoints: semantic attack acceptance, audit‑visible harm evidence, and sandbox‑observed tool/state harm. Evaluating five agent endpoints under four prompt‑level policies, we find that these endpoints capture different failure modes. Without additional prompt protection, semantic failure rates vary widely across models, from 9.0% to 44.2%. Audited harm evidence is narrower than semantic failure, and under a separate executable protocol some matched task identities produce sandbox harm despite passing the Semantic Core call: in a 12,000‑row matched analysis, 291 of 347 observed sandbox harms occur in rows that pass the semantic check. Prompt policies change endpoint outcomes, but their effects depend on both model and protocol. SafeClawBench provides a reproducible framework for comparing agent models and prompt‑policy conditions without conflating textual compliance, evidence‑supported harm, and executable state changes. The open‑source dataset is available at https://huggingface.co/datasets/sairights/safeclawbench.

PaperID: 1849, https://arxiv.org/pdf/2606.18325.pdf

Authors: Raj Patel, Shaswata Mitra, Michele Guida, Stefano Iannucci, Sudip Mittal, Shahram Rahimi

Title: Agentra: A Supervisable Multi-Agent Framework for Enterprise Intrusion Response

Abstract:
Enterprise intrusion response still depends on static playbooks and analyst‑driven triage, creating delay between alert generation and containment. We present Agentra, a supervisable multi‑agent Intrusion Response System (IRS) framework that converts alerts from IDS, EDR, and XDR platforms into structured incident response plans grounded in MITRE ATT&CK, MITRE D3FEND, and NIST CSF 2.0. Agentra decomposes response reasoning across role‑scoped agents, validates proposed plans through a bounded Planner‑‑Validator review loop, screens retrieved threat intelligence through a Moderator security gateway, gates actions through an Action Catalog and risk score, and records decisions in an append‑only audit log. We evaluate Agentra against a static OASIS CACAO v2.0 cyber‑playbook baseline on a 120‑event corpus drawn from ThreatHunter‑Playbook, Splunk BOTSv3, and DARPA OpTC. The strongest configuration improves FP‑aware IRS F1 from 0.61 to 0.84 and restores the projected harmful‑action rate to the static baseline level of 0.0% after Planner‑only configurations introduce unsafe overreaction. These results indicate that multi‑agent response planning can improve ontology‑grounded IRS coverage while preserving analyst approval and auditability.

PaperID: 1850, https://arxiv.org/pdf/2606.18308.pdf

Authors: Zijie Meng, Ziwei Li, Yufei Liu, Zhiyu Li, Jiyuan Liu, Wenhua Nie, Bingcai Wei, Miao Zhang

Title: TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning

Abstract:
Safe coordination in networked cyber‑physical systems forces learning algorithms to simultaneously handle hybrid discrete‑continuous actions, hard training‑time safety constraints, and physics‑governed dynamics. We show that these three features form a directed cycle of biases that defeats any naive composition of off‑the‑shelf modules, and formalize this as a three‑way coupling lemma. We then introduce TRIDENT, the first MARL framework whose three components are co‑designed to cancel each leak: a Richardson‑Romberg gradient correction reducing Gumbel‑Softmax bias from O(tau) to O(tau^2), a Lyapunov‑constrained sequential trust‑region update enforcing per‑iterate feasibility, and a physics‑informed residual critic that decomposes value rather than reward. We prove an O~(1/sqrt(K)) convergence rate to a constrained Nash equilibrium and an O(sqrt(K)) cumulative‑violation bound. On multi‑UAV mobile‑edge computing, autonomous intersection management, and a hybrid SMAC variant, TRIDENT cuts training‑time violations by 95.5% over MADDPG and 76.3% over MACPO, while improving reward by 13.5% over the strongest unconstrained baseline.

PaperID: 1851, https://arxiv.org/pdf/2606.18276.pdf

Authors: Caleb Probine, Yigit Ege Bayiz, Filippos Fotiadis, Samuel Li, Yunhao Yang, Ufuk Topcu

Title: Characterizing Opinion Evolution of Networked LLMs

Abstract:
Large language models (LLMs) increasingly interact with one another in multi‑agent systems, from simulations of human discourse to influence operations and fully LLM‑driven social platforms. These interactions give rise to new regimes of opinion propagation that are not yet well understood. We investigate whether classical opinion dynamics models, which have long been used to explain how interactions shape collective beliefs in human societies, can capture the behavior of LLM networks. We find that, while naive averaging‑style models fail to track LLMs' opinion dynamics, simple modifications yield substantial gains in modeling fidelity. In particular, bias, an innate opinion toward which agents regress, emerges as a significant driver of LLM opinion dynamics, with its inclusion reducing cumulative estimated mean opinion error by up to 88%. We additionally find that these conclusions generalize across model families, discussion topics, and networks.

PaperID: 1852, https://arxiv.org/pdf/2606.18268.pdf

Authors: Changxi Wen, Shuning Zhang, Bohao Chu, Yuwei Chuai, Hui Wang, Dai Shi, Xin Yi, Hewu Li

Title: Towards Multi-Agent-Simulation-Based Community Note Evaluation

Abstract:
Community‑based fact‑checking that relies on cross‑consensus is expanding rapidly on social media platforms. However, the delay and low‑ratio of cross‑consensus community fact‑checks rated by human contributors remains a significant challenge. To address this, we first created ComRate, a large‑scale dataset comprising 2.5 million community notes and over 209 million ratings sourced from \mathbbX. We then propose MultiCom, a persona‑guided multi‑agent rating framework for community note evaluation. MultiCom simulates diverse rater population by clustering contributors in a matrix‑factorized rater space and prompting persona agents to generate structured assessments based on the official community notes rating schema. These agents output structured and explainable judgments, such as confidence, agreement signals and reasons. An out‑of‑fold calibrated aggregation algorithm combines features such as raw votes and diagnostic reason signals for reliable prediction. Extensive evaluations demonstrate that MultiCom outperforms alternative methods, achieving an average accuracy of 84.7% (balanced accuracy 68.3%, macro‑F1 60.1%) on the evaluation set.

PaperID: 1853, https://arxiv.org/pdf/2606.18264.pdf

Authors: Fan Huang

Title: Simulating Hate Speech Cascades with Multi-LLM Agents: Empirical Grounding, Modeling Fidelity, and Intervention Strategies

Abstract:
Faithful modeling of hateful content propagation on online platforms remains an open problem for moderation research. Classical cascade models that do not explicitly represent the profile, community, and content factors associated with hateful‑content propagation may yield moderation strategies that behave less effectively when deployed in real‑world scenarios. Multi‑agent large language model (LLM) systems can, in principle, make each reshare decision depend on the user's profile, the surrounding community, and the post's content, but it remains unclear whether this added flexibility actually reproduces real hateful cascades more faithfully than classical baselines. We study three hateful Bluesky cascades and a size‑matched benign control. In the empirical Bluesky data, we found that: 97.4‑‑99.7% of reposters take a hostile stance; toxicity‑engagement homophily is higher on the diffusion tree than on the follower graph for hateful cascades; topology is star‑like for the hateful cascades (most reposts come directly from the root) versus tree‑like for the benign cascade (reposts propagate through multi‑hop chains). In simulation, a multi‑LLM‑agent simulator reproduces the stance monoculture and the toxicity‑delta direction. A structured ablation identifies agent heterogeneity as the leading fidelity factor, and amplifier targeting on dense networks yields 7.5‑‑12.9% reduction at 5.7% benign collateral.

PaperID: 1854, https://arxiv.org/pdf/2606.18259.pdf

Authors: Junjie Xu, Xingjiao Wu, Zihao Zhang, Yujia Xu, Yuzhe Yang, Jin Zhu, Luwei Xiao, Wen Wu, Liang He

Title: Caring Without Feeling: Affective Dynamics as the Control Layer of Human-AI Agent Collaboration

Abstract:
AI agents that plan, retain memory across sessions, invoke external tools and act with partial autonomy are transforming human‑‑AI collaboration. Research on affective computing, simulated empathy in large language models, trust in automation and AI safety has illuminated important design principles, yet these literatures remain fragmented. No integrated account explains how affective cues operate within agentic collaboration ‑‑ settings in which humans delegate, monitor and correct consequential tasks. This Review synthesises computational and interactional mechanisms of affective dynamics: the processes through which affective cues, emotion‑like behaviour and perceived agent affect shape trust calibration, delegation decisions, error correction, dependence and governance. We trace how model‑generated affective signals enter interaction loops that govern reliance, repair and oversight, and propose a framework that treats affect not as an internal property of AI but as a coordination layer through which humans and agents negotiate capability, uncertainty and responsibility. The framework provides a foundation for calibrated measurement, purposeful design and informed governance.

PaperID: 1855, https://arxiv.org/pdf/2606.18189.pdf

Authors: Jiaying Fang, Joyce Yang, Zhanxin Wu, Bohan Yang, Tapomayukh Bhattacharjee

Title: Beyond Failure Recovery: An Engagement-Aware Human-in-the-loop Framework for Robotic Systems

Abstract:
Conventional human‑in‑the‑loop approaches typically involve users only when a robot encounters failure or uncertainty, treating humans primarily as tools for improving robot performance. However, in many human‑centered robotics settings, interaction should support engagement by keeping users involved in decision‑making rather than limiting them to failure‑driven interventions. This is particularly compelling in physical caregiving, where mobility limitations can reduce users' ability to intervene or modulate the robot's behavior in the moment. As a result, failure‑driven interaction policies may relegate users to passive observers for long stretches of the task. For example, a user with mobility limitations may feel less engaged when being continuously and passively fed by a robot. At the same time, overly frequent interaction can be tiring and increase the user's workload. To address this trade‑off, we propose Engagement‑aware MPC (E‑MPC), a user‑engagement‑aware method that plans interaction to maintain engagement while respecting a workload constraint. E‑MPC leverages a user interaction dynamics model that captures how user engagement evolves as a function of both the frequency and type of interaction. Rather than requesting input only when difficulties arise during task execution, the robot proactively considers the user's preferred level of engagement throughout the task, balancing autonomy and interaction while ensuring task success. We evaluate E‑MPC in simulation with several ablations and baseline comparisons. Results demonstrate the effectiveness of our approach across diverse user personas. In addition, we conduct a real‑world user study with participants with emulated mobility limitations on a robot‑assisted bite acquisition system, showing that E‑MPC improves user experience while maintaining task success.

PaperID: 1856, https://arxiv.org/pdf/2606.18154.pdf

Authors: Ziqi Zhou, Yubo Ye, Sumeet Atul Vadhavka, Linwei Wang, Zhiqiang Tao

Title: Learning Cardiac Electrophysiology Digital Twins Through Agentic Discovery of Hybrid Structure

Abstract:
Building personalized cardiac electrophysiology (EP) digital twins requires identifying the appropriate model structure for each patient, not merely fitting parameters. Traditional methods rely on experts to manually prescribe hybrid physics‑neural architectures, which requires deep domain expertise and does not transfer across patients. Recent works have applied large language models (LLMs) to generate or act as hybrid models. However, despite their promising generalization capacity, these LLM‑based methods lack the structural priors needed for stable cardiac simulations. Hence, we propose LEADS, a framework that formulates cardiac EP domain knowledge as a structured action space and utilizes an LLM agent to discover hybrid models. The agent follows an iterative reasoning‑and‑action loop to select, combine, and refine hybrid models, whilst gradient descent handles parameter fitting. The proposed LEADS designs every candidate model towards physically grounded, interpretable, and numerically stable, while allowing open‑ended architectural discovery. We validate LEADS on synthetic data with three ground‑truth reaction models and on real cardiac EP data, demonstrating that it outperforms both human‑designed hybrid models and other LLM‑based hybrid modeling.

PaperID: 1857, https://arxiv.org/pdf/2606.18142.pdf

Authors: Jasmine Brazilek, Joel Christoph, Miles Tidmarsh, Carol Kline, Oliver Tullio, Arturs Kanepajs

Title: Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

Abstract:
AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question‑answer prompts, leaving open whether the welfare reasoning surfaced in those responses transfers to agentic deployment where the model must take actions with tools. We introduce TAC (Travel Agent Compassion), the first agentic benchmark measuring whether AI agents avoid options involving animal exploitation when acting on behalf of users. TAC presents an AI agent with twelve hand‑authored travel booking scenarios across six categories of animal exploitation, augmented to forty‑eight samples to control for price, rating, and position confounds. We evaluate seven frontier models from four labs. Every model scores below the chance level of sixty‑four percent, with the best performer (Claude Opus 4.7) at fifty‑three percent. A single welfare‑aware sentence in the system prompt yields gains of forty‑seven to sixty‑three percentage points in Claude and GPT‑5.5, twenty‑six points in GPT‑5.2, and under twelve points in DeepSeek and Gemini. An auxiliary Inspect Scout audit of 288 base‑condition transcripts from the top two performers, using Gemini 2.5 Flash Lite as judge, flags zero transcripts for evaluation awareness, suggesting the below‑chance rates do not stem from the models recognising the evaluation. We discuss implications for category‑level variation across cultural domains, the limits of text‑response welfare benchmarks, and the EU General‑Purpose AI Code of Practice systemic risk framework.

PaperID: 1858, https://arxiv.org/pdf/2606.18068.pdf

Authors: Divyansh Srivastava, Shreya Ghosh, Anshul Verma, Rajkumar Buyya

Title: Agentic AI-based Framework for Mitigating Premature Diagnostic Handoff and Silent Hallucination in Healthcare Applications

Abstract:
Recent advances in Large Language Models (LLMs) and multi‑agent systems have driven the rise of Agentic AI, showing promise for medical reasoning. However, open‑ended conversational agents remain prone to two critical failure modes: premature diagnostic handoff and silent clinical hallucinations that may go undetected before reaching the patient. In this work, we propose a multi‑agent framework that addresses both issues by replacing ``LLM‑as‑a‑judge'' routing with deterministic orchestration constraints. The framework incorporates two safety mechanisms. First, a neuro‑symbolic state‑tracking gate enforces completeness of the OLDCARTS clinical protocol (Onset, Location, Duration, Character, Aggravating/Alleviating factors, Radiation, Timing, and Severity) by blocking diagnostic transitions until all required dimensions are collected. Second, an epistemic uncertainty quantification (UQ) gate computes semantic entropy (H) across K=5 independent diagnostic samples to identify and intercept divergent outputs before delivery. We evaluate the system using simulated patient agents powered by the llama‑3.1‑70b‑instruct model on 150 test cases. The full architecture achieves 49.3% diagnostic precision, representing an absolute improvement of 11.3 percentage points over an unconstrained baseline. Additionally, we observe a statistically significant negative correlation (r = ‑0.181, p < 0.05) between OLDCARTS completeness (σ) and semantic entropy (H), suggesting that structured information gathering is associated with reduced diagnostic uncertainty.

PaperID: 1859, https://arxiv.org/pdf/2606.18065.pdf

Authors: Dexing Liu

Title: Intelligence Entropy Principle and the ADE Stability Engineering Framework

Abstract:
As LLM‑driven multi‑agent systems (MAS) transition from lab to production, system behavior exhibits nonlinear degradation. We introduce the Intelligence Entropy Principle: probability‑driven systems spontaneously drift toward disorder, formalized as S(t) = S0 exp(alphat/Cm), where Cm is a model capability coefficient we propose. Lyapunov analysis yields the stabilization condition lambda > alpha/Cm. We construct the ADE (Agent Delivery Engineering) four‑layer framework (L1 Physical Laws through L4 User Adaptation) with 23 core components. Validation spans 100K‑scale experiments and 33.6 days of production monitoring. We propose a Five‑Layer Disorder Taxonomy unifying failures under structural collapse, and present Elastic Organization as an original MAS morphology. Results: channel fracture reduced from 69‑98% to near 0%; system death probability below 0.02%.

PaperID: 1860, https://arxiv.org/pdf/2606.18051.pdf

Authors: Xueping Gao

Title: Compositional Skill Routing for LLM Agents: Decompose, Retrieve, and Compose

Abstract:
LLM agents increasingly rely on external skills ‑‑ reusable tool specifications ‑‑ but real‑world tasks often require composing multiple skills, not just selecting one. We formalize this as the Compositional Skill Routing problem: given a complex user query and a large skill library, decompose the query into atomic sub‑tasks, retrieve the appropriate skill for each sub‑task, and compose an executable plan. We present SkillWeaver, a decompose‑retrieve‑compose framework combining an LLM task decomposer, a bi‑encoder skill retriever with FAISS indexing, and a dependency‑aware DAG planner. To support evaluation, we introduce CompSkillBench, a benchmark of 300 compositional queries over 2,209 real MCP server skills spanning 24 functional categories, sourced from the public MCP ecosystem. Our experiments reveal that task decomposition quality is the primary bottleneck: standard LLM decomposition reaches only 34.2% category recall at the step level. To address this, we propose Iterative Skill‑Aware Decomposition (SAD), a retrieval‑augmented feedback loop that iteratively aligns decomposition with available skills. SAD improves decomposition accuracy from 51.0% to 67.7% (+32.7%, Wilcoxon p < 10^‑6) in a single iteration; DA‑conditioned analysis confirms that correct granularity is the prerequisite for effective retrieval (CatR@1 rises from 34% to 41% when DA=1). SkillWeaver reduces context window consumption by over 99%, and transfer experiments confirm generalization (+35.6% relative DA gain even when target categories are absent from the retrieval pool).

PaperID: 1861, https://arxiv.org/pdf/2606.18037.pdf

Authors: Ander Alvarez, Santhiya Rajan, Samuel Mugel, Román Orús

Title: ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents

Abstract:
Tool‑using LLM agents increasingly use the Model Context Protocol (MCP) to answer from heterogeneous evidence sources, including search, APIs, databases, clinical records, and formulary tools. Standard factuality metrics usually test whether an answer is supported by pooled evidence, missing a provenance‑sensitive failure mode: a claim may be supported somewhere while being attributed to the wrong source. We call this cross‑source conflation. We introduce ProvenanceGuard, a source‑aware verifier for MCP‑grounded answers. It consumes captured MCP traces with stable tool IDs, source IDs, and raw outputs; decomposes answers into atomic claims; routes claims to source‑specific evidence; checks support with NLI and a token‑alignment proxy; compares stated attribution with the routed source; and returns per‑claim verdicts plus an answer‑level allow/block decision. Blocked answers can be repaired with retrieval‑augmented answer revision and re‑verified. We evaluate on 281 medical‑domain MCP‑agent traces. A 266‑trace adjudicated subset yields 2,325 LLM‑assisted claim labels split by trace; 361 held‑out labels are human‑verified. On the 40‑trace held‑out split, ProvenanceGuard achieves block F1 0.802 and source accuracy 0.858 over 260 source‑eligible claims, outperforming source‑blind baselines that do not emit claim‑to‑source IDs. On a harder multi‑source benchmark it reaches block F1 0.846, while source‑plus‑relation accuracy drops to 0.229, showing that exact source ownership remains difficult with semantically close sources. Repair‑and‑reverify resolves all blocked answers in the full trace set, often via conservative fallback. In 50 controlled clinical conflation probes, ProvenanceGuard detects all injected attribution swaps with no retained wrong attribution. These results show that source attribution is an independent axis for factuality verification in MCP‑based agents.

PaperID: 1862, https://arxiv.org/pdf/2606.18021.pdf

Authors: Lalit Yadav, Akshaj Gurugubelli

Title: LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI

Abstract:
AI systems deployed in legal workflows hallucinate at rates that aggregate metrics report at ~52%, but this average conceals where errors concentrate and in which direction they run, leaving compliance officers without an actionable signal for trustworthy deployment. We present LegalHalluLens, an auditing framework with three components: typed hallucination profiles across four legally‑motivated claim categories (numeric, temporal, obligation/entitlement, factual) over CUAD (Hendrycks et al., 2021); a Risk Direction Index (RDI) that reduces omission‑versus‑invention bias to a single deployment‑comparable scalar; and a typed debate pipeline calibrated to both magnitudes and directions. Across 510 contracts and 249,252 clause‑level instances we measure a within‑model gap of approximately 38‑40 pp between obligation/numeric and temporal claims that aggregate reporting hides, and show that two systems with matched 52% rates can carry opposite RDIs. The debate pipeline reduces fabricated detections by 45% with per‑category gains tracking the diagnosis, matching commercial APIs with a substantially smaller backbone (4B active parameters). Typed profiles and RDI surface failure modes that aggregate metrics hide; we further show these diagnostics serve as calibration inputs for multi‑agent debate pipelines, where Skeptic challenges and asymmetric gates targeted at measured failure modes outperform generically‑tuned debate. The framework supports direction‑aware procurement, accountability, and agent design for legal AI deployed in the wild.

PaperID: 1863, https://arxiv.org/pdf/2606.17962.pdf

Authors: Marco Aruta, Vadim Malvone, Aniello Murano, Domenico Parente, Luca Rizzuti

Title: A Neuro-Symbolic Approach to Strategy Synthesis for Strategic Logics

Abstract:
Reasoning about what agents can achieve through strategic interaction is a core challenge in Multi‑Agent Systems (MAS). Logics for strategic ability, such as ATL, provide rigorous methods, but their adoption is often hindered by the computational cost of strategy synthesis. We introduce a neuro‑symbolic framework that integrates large language models (LLMs) into the model‑checking pipeline for MAS. The LLM acts as a strategy‑generation oracle, proposing candidate strategies that are then formally validated by a standard MAS model checker. This generate‑and‑certify architecture uses LLM guidance to navigate large combinatorial strategy spaces while preserving formal soundness: generated strategies are accepted only when certified by the verifier. We instantiate the framework for bounded strategic reasoning in NatATL and introduce the first NatATL strategy‑synthesis dataset, consisting of 4211 instances. Experiments with an open‑weight Qwen3‑32B model show that our certified pipeline achieves 92% accuracy on strategy‑synthesis outcomes.

PaperID: 1864, https://arxiv.org/pdf/2606.17915.pdf

Authors: Aueaphum Aueawatthanaphisut, Badri Raj Lamichhane

Title: Trustworthy Self-Composable Big-Data-as-a-Service: An LLM-Orchestrated Multi-Agent Framework for Automated Data Engineering, AutoML, MLOps Deployment, and Drift-Aware Lifecycle Optimization

Abstract:
Big‑Data‑as‑a‑Service (BDaaS) platforms require re liable automation across data ingestion, cleaning, feature engi neering, model development, deployment, and post‑deployment monitoring. However, existing LLM‑based data science agents and AutoML systems mainly focus on isolated workflow stages, leaving limited support for lifecycle‑level orchestration, artifact governance, human oversight, and drift‑aware adaptation. This paper proposes a trustworthy self‑composable BDaaS frame work based on LLM‑orchestrated multi‑agent collaboration. The proposed architecture decomposes the BDaaS lifecycle into specialized agents for data ingestion, data cleaning, feature engineering, AutoML training, model evaluation, MLOps de ployment, monitoring, and drift detection. A central LLM or chestration layer coordinates agent execution, validates interme diate outputs, manages workflow context, and enables dynamic workflow composition. The framework also incorporates shared artifact governance, reproducibility support, human‑in‑the‑loop checkpoints, and drift‑aware feedback loops. A prototype‑based evaluation is conducted using controlled tabular benchmark datasets with missing values, categorical variables, outliers, class imbalance, and simulated covariate drift. Compared with manual ML, AutoML‑only, and single‑agent LLM baselines, the pro posed multi‑agent BDaaS pipeline achieves competitive predictive performance while improving lifecycle‑level reliability, including workflow completion, artifact traceability, deployment readiness, reproducibility, and drift recovery. The results suggest that LLM‑orchestrated multi‑agent systems can extend conventional AutoML toward trustworthy, adaptive, and production‑oriented BDaaS lifecycle automation.

PaperID: 1865, https://arxiv.org/pdf/2606.17838.pdf

Authors: Rean Clive Fernandes, Lukas Fehring, Theresa Eimer, Marius Lindauer, Matthias Feurer

Title: Environment-Grounded Automated Prompt Optimization for LLM Game Agents

Abstract:
LLM agents in interactive environments are highly sensitive to their prompts, yet prompt engineering remains a manual, task‑specific process. We introduce an automated prompt optimization framework for LLM agents that decomposes the observation‑to‑action pipeline into a goal‑conditioned descriptor agent and an action selection agent, and iteratively refines each module's prompt through an LLM‑driven evolutionary loop guided by environment returns. We propose a behavior analyzer to attribute episode outcomes to specific prompt components, and a mutator to propose targeted revisions to the prompt, before validating them through environment rollouts. We evaluate on all five BabyAI tasks in the BALROG benchmark, comparing our pipeline against BALROG's RobustCoTAgent under both plain and guided prompt initializations. Optimization improves performance consistently across tasks and conditions, without requiring updates to the model weights. On PutNext, a multi‑step coordination task where the RobustCoTAgent achieves 0% success, our framework reaches up to 72.5% success rate using the same underlying LLM with optimized prompts. These results suggest that a multi‑agent framework, combined with automatic prompt optimization, enhances LLMs without the need for fine‑tuning or extensive human supervision.

PaperID: 1866, https://arxiv.org/pdf/2606.17819.pdf

Authors: Maksim Shaposhnikov, Nicolas Fortuin, Simon Stipcich, Maria I. Gorinova, Amy Heineike, Rob Willoughby

Title: A Framework for Evaluating Agentic Skills at Scale

Abstract:
Agent skills ‑‑ structured, reusable knowledge artifacts that augment LLM agent capabilities ‑‑ have been rapidly adopted in industry, yet their cross‑domain impact and use across commercial and open‑source models remain under‑studied, and no reusable methodology exists for evaluating an individual skill. In this work, we present an evaluation framework that lets a skill author construct realistic tasks to rigorously assess the aspects of a skill that matter most to them, and that estimates skill utility by solving those tasks. Further, we apply our evaluation approach at scale to 500 real‑world skills, generating 1,000 tasks derived from the skills' content, along with instruction‑following and goal‑completion scoring rubrics. Using these metrics, we evaluate how 19 agent‑model configurations, both proprietary and open‑source, perform on the tasks. Our results show that models vary widely in how closely they adhere to the instructions encoded in skills, leading to substantial differences in their performance gains. Furthermore, we show that access to a skill significantly changes model behavior compared to the no‑skill setup, providing an essential mechanism for encoding opinionated workflows into LLM agents. We release our evaluation dataset to support future work on agent skills.

PaperID: 1867, https://arxiv.org/pdf/2606.17793.pdf

Authors: Roberto Daza, Javier Irigoyen, Ivan Lopez, Raquel Rodriguez-Carvajal, Laura Gomez, Julian Fierrez, Ruben Tolosana, Aythami Morales

Title: ARES: A Platform for Adaptive Role-Based Evaluation of Social Engineering Risks in Human--AI Games

Abstract:
This work introduces ARES, a platform and open pilot dataset for auditing adaptive social engineering risks in LLM‑mediated social decision‑making through controlled social games. ARES supports human‑‑human, human‑‑AI, and AI‑‑AI settings, combining configurable game templates, role‑conditioned LLM agents, psychology‑informed participant profiling, structured interaction trees, and synchronised behavioural and biometric acquisition, filtering, and deep‑learning‑based feature extraction. The pilot dataset was collected from 15 participants interacting with a role‑conditioned GPT‑5.4 agent in two concatenated games: an adapted Prisoner's Dilemma and an Ultimatum Game. It comprises 340 GB of raw and processed multimodal data across six streams: interaction logs, video, screen recordings, gaze logs, smartwatch signals, and game/questionnaire metadata. These data include interaction paths, written justifications, psychological profiles, subjective feedback, perceived counterpart identity, game outcomes, and derived behavioural, facial, and gaze features. Alongside the dataset, we provide descriptive analyses to characterise the pilot release. Rigorous risk evaluation is essential for the deployment of secure AI systems, as it enables the identification and mitigation of vulnerabilities, ensures the protection of sensitive data, and supports compliance with evolving regulatory and ethical standards in society.

PaperID: 1868, https://arxiv.org/pdf/2606.17682.pdf

Authors: Chao Chen, Chengzu Li, Zhiwei Li, Yinhong Liu, Zhijiang Guo

Title: From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

Abstract:
Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current policy. To automate this process, we propose the LLM‑as‑Environment‑Engineer framework in which the current policy model analyzes failure trajectories together with contextual information and proposes modifications to the next‑stage training environment configuration. We also introduce MAPF‑FrozenLake, a controllable testbed whose generator exposes multi‑dimensional environment configurations, making it suitable for studying and benchmarking environment redesign. On this testbed, we condition the environment engineer on structured summaries of policy behavior, failure cases, and environment statistics, from which it produces the configuration for the next training stage. With Qwen3‑4B as the backbone, our framework achieves the strongest aggregate performance on our benchmarks, outperforming larger proprietary LLMs (e.g., GPT, Gemini) and fixed‑environment training baselines. We further analyze which forms of context are most effective, finding that successful environment updates rely on failure evidence and preserve configurations that already work. Interestingly, the current RL checkpoint serves as a better environment engineer than the original base model, suggesting that policy learning improves the model's ability to diagnose its remaining weaknesses.

PaperID: 1869, https://arxiv.org/pdf/2606.17627.pdf

Authors: Alessandro Sottovia, Alessandro Torcinovich, Oswald Lanz

Title: Divide, Deliberate, Decide: A Multi-Agent Framework for Fine-Grained Egocentric Action Recognition

Abstract:
Fine‑grained action recognition in egocentric video is challenging for Vision‑Language Models (VLMs): actions often differ only in small visual cues, and a single model tends to be biased toward a subset of these cues. We propose Divide, Deliberate, Decide, a fully‑local, zero‑shot multi‑agent framework in which (i) a VLM orchestrator chunks the video and proposes a top‑k candidate label list per segment, (ii) an ensemble of heterogeneous VLM specialists, drawn from different open model families, engages in a structured deliberation that includes a peer‑consultation round of questions, and (iii) agent rankings are aggregated with a Borda count and the orchestrator re‑ranks its own prediction in light of the specialists' evidence. The entire pipeline runs locally with no fine‑tuning. Experiments show that our method positively improves zero‑shot action recognition performance over the baseline, highlighting the influence of a heterogeneous deliberation step, showing that the gain stems from decorrelated model priors rather than from additional compute.

PaperID: 1870, https://arxiv.org/pdf/2606.17591.pdf

Authors: Yanwei Cui, Xing Zhang, Yulong Zhang, Li Shao, Xiaofeng Shi, Guanghui Wang, Peiyang He

Title: Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning

Abstract:
Training‑free verbal reinforcement learning enables LLM agents to learn from world feedback ‑‑ objective signals such as dynamic task outcomes, market returns, or demand forecasts ‑‑ by extracting verbal rules from experience and injecting them as context, updating the agent's behavior without parameter changes. However, in non‑stationary environments these agents face a retention‑forgetting dilemma: retaining stale insights causes negative transfer, while discarding them causes catastrophic forgetting when conditions recur. We identify four requirements for navigating this dilemma ‑‑ outcome‑driven evaluation, persistent structured evidence, non‑monotonic knowledge lifecycle, and compositional governance ‑‑ and show that existing methods invest heavily in experience extraction while underinvesting in insight governance. We propose a three‑layer architecture ‑‑ rules, evidence, and skills ‑‑ connected by a feedback‑driven curation loop that closes the governance gap. Rules capture distilled experience from world outcomes; evidence logs track each rule's reliability across episodes; skills govern which rules to apply, how to resolve conflicts, and when to abstain. On financial forecasting as a case study, where world feedback is naturally abundant, noisy, and non‑stationary, we show that the same accumulated experience either degrades performance below the zero‑shot baseline or dramatically improves accuracy and risk‑adjusted returns, depending on whether the curation loop is present.

PaperID: 1871, https://arxiv.org/pdf/2606.17573.pdf

Authors: Zheng Chen, Hanqing Liu, Duling Xu, Dong Dong, Jialin Li, Bangzheng Pu, Jidong Zhai

Title: Cordon: Semantic Transactions for Tool-Using LLM Agents

Abstract:
Tool‑using LLM agents are shifting the unit of computation from explicit human‑issued commands to model‑driven tasks with stateful consequences. Yet today's agent runtimes still expose tools as isolated RPCs. This interface gives runtimes a convenient integration point, but it lacks a task‑scoped execution boundary for commit, rollback, recovery, and audit across multi‑step agent workflows. We argue that this mismatch calls for a runtime containment boundary rather than another per‑call guardrail. This paper introduces Cordon, a transactional runtime system for staging and validating irreversible agent effects before commit. A semantic transaction is a task‑level execution boundary that binds tool intents and runtime‑tracked result lineage to reversible local state, staged external effects, delegated authority, and audit metadata. Cordon implements this abstraction with a transaction manager that tracks derived result objects, executes reversible mutations in shadow state, stages outward‑facing actions in an effect outbox, and records recovery metadata. The runtime then validates the composed execution flow before it commits state or releases external effects. Our evaluation across adversarial and benign workflows shows that Cordon exposes cross‑step violations missed by existing defenses. It also reduces irreversible‑effect failures while preserving benign task completion with modest approval and latency overhead.

PaperID: 1872, https://arxiv.org/pdf/2606.17546.pdf

Authors: Congjie Zheng, Chuanyi Xue, Bin Liang, Jun Yang, Changshui Zhang

Title: SEAGym: An Evaluation Environment for Self-Evolving LLM Agents

Abstract:
Self‑evolving LLM‑based agents improve mainly by changing their agent harness: the structured execution layer around a base model, including prompts, memory, tools, middleware, runtime state, and the model‑tool interaction loop. Existing evaluations often reduce this process to isolated task scores or a single sequential curve, obscuring whether an update produces reusable improvement, overfits recent tasks, increases cost, or harms older behavior. We introduce SEAGym, an evaluation environment for measuring agent harness updates across training, validation, test, replay, and cost records. SEAGym turns Harbor‑compatible benchmarks into dynamic self‑evolution task sources with train batches, frozen update‑validation, held‑out ID and OOD transfer views, replay diagnostics, and saved snapshot and metric records. Instantiating SEAGym on Terminal‑Bench 2.0 and HLE, we compare ACE, TF‑GRPO, and AHE under a shared epoch/batch protocol. The results show that these evaluation views provide complementary signals about the evolution process: frequent updates may fail to improve held‑out performance, useful intermediate snapshots may collapse later, and source diversity and model backend can affect harness reliability.

PaperID: 1873, https://arxiv.org/pdf/2606.17536.pdf

Authors: Zijie Meng, Yufei Liu, Chengqian Ma, Zhiyu Li, Jiyuan Liu, Wenhua Nie, Bingcai Wei, Shuqin Chen, Weichen Xu, Jiquan Yuan, Miao Zhang

Title: OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation

Abstract:
Generative world models for autonomous driving face two unresolved tensions: heterogeneous control injection, where free‑form language, HD‑maps, trajectories, and camera poses reside in incompatible representational spaces, and post‑hoc cross‑view fusion, where per‑camera latents fail to encode global 3‑D geometry. We trace both to a single root cause: the absence of a shared symbolic interlingua aligning language, geometry, and pixels at the latent‑token level. We present DRIVE‑CHOREO, an LLM‑choreographed multi‑agent world model that recasts controllable multi‑view video generation as latent choreography. Three Qwen2.5‑VL agents ‑ a Director parsing user intent into a structured WorldScript, a Cartographer grounding it into spatially‑anchored layout tokens, and an Auditor feeding cross‑view critiques back as auxiliary supervision ‑ jointly author a single position‑aware token sequence. This sequence is co‑compressed with the multi‑view video via a view‑time permutation that enforces inter‑camera geometry within the convolutional receptive field of a 3‑D VAE. On nuScenes, DRIVE‑CHOREO sets new state‑of‑the‑art multi‑view consistency and BEV mAP (21.6) with competitive FVD (45.7); a detector trained purely on our synthetic data gains +2.4 NDS on the real validation split, validating downstream utility.

PaperID: 1874, https://arxiv.org/pdf/2606.17512.pdf

Authors: Zhiqi Gao, Huarui Luo, Guo Zhu, Bingquan Zhang, Dongyijie Primo Pan, Yizhan Feng, Jiahuan Pei, Jie Li, Benyou Wang

Title: MedEasy: Designing AI Standardized Patients for Clinical Consultation Training

Abstract:
AI standardized patients are becoming a setting for professional training in clinical consultation. This paper presents MedEasy, a multi‑agent system that organizes virtual‑patient practice through patient dialogue, clinical actions, decision submission, documentation, and feedback. We first conducted a formative study with 12 clinical‑year medical students through interviews and three co‑design workshops. The findings informed a staged workflow, structured case records, action‑contingent findings, and trajectory‑based review. We then conducted an evaluative user study with a separate cohort of 12 clinical‑year medical students, with each participant completing two counterbalanced cases. Learners interpreted MedEasy as a connected consultation environment. They used patient responses, examination findings, available actions, and feedback together to judge whether the represented case remained coherent. They valued repeatable practice and recorded review, while questioning missing actions and feedback criteria. The paper contributes design implications for AI‑supported professional training systems that use case‑specific standards to connect situated practice.

PaperID: 1875, https://arxiv.org/pdf/2606.17511.pdf

Authors: Haoran Lu, Songling Liu, Yue Chen, Guo Ye, Mutian Shen, Shuyang Yu, Yu Xiao, Jihai Zhao, Shang Wu, Jianshu Zhang, Xiangtian Gui, Chuye Hong, Yuran Wang, Maojiang Su, Jiayi Wang, Ruihai Wu, Zhaoran Wang, Han Liu

Title: MagicSim: A Unified Infrastructure for Executable Embodied Interaction

Abstract:
Robot learning and embodied agents now require simulation to serve as a shared execution substrate linking control, skills, and planning, not only as a renderer, controller testbed, or fixed task environment. Existing pipelines split these layers with "magic" actions, disconnected training environments, or forward‑only renders that cannot reproduce, evaluate, and annotate the same episode. We present MagicSim, an embodied interaction infrastructure built around one deterministic batched runtime and a shared Markov decision process (MDP). From YAML‑first specifications that decouple contents, placement, behavior, and agent exposure, MagicSim constructs diverse executable worlds spanning task families, interaction regimes, physics, layouts, sensors, avatars, and robot embodiments in one reset‑and‑step loop. A common execution interface grounds high‑level commands through controllers, atomicskills, planner primitives, and asynchronous planning, realizing them as robot actions rather than simulator‑side state edits. One task definition supports three capabilities: benchmark and RL evaluation, an autocollect interface that automatically turns commands into grounded trajectories, and agent/VLM‑facing interaction. For automatic execution, commands flow through a Command‑>Skill‑>Planner‑>Robot‑>Record pipeline, while per‑environment command, skill, planning, retry, annotation, and episode states advance independently above the shared physics tick. Successful rollouts are saved as structured multimodal trajectories aligning language supervision, action representations, visual/geometric representations, and task‑level status with the executed episode. MagicSim thus unifies diverse world construction, embodied execution, task evaluation, automatic rollout generation, and interactive agent interfaces in one planner‑in‑the‑loop runtime.

PaperID: 1876, https://arxiv.org/pdf/2606.17467.pdf

Authors: Aaditya Pai

Title: PARSE: Provenance-Aware Retrieval Sanitization for Professional Domain LLM Agents

Abstract:
Prompt injection defenses evaluated on synthetic benchmarks do not generalize to real enterprise documents, which are longer, denser, and interleave legitimate authority language with factual content. We demonstrate this gap with a real‑document benchmark of 122 tasks across five professional domains (financial, legal, medical, scientific, DevOps) using actual SEC filings, Federal Register rules, PubMed abstracts, arXiv papers, and GitHub postmortems. Paraphrasing, the strongest defense on synthetic benchmarks, shows no statistically significant attack success rate reduction on real documents (p=0.500) while degrading utility from 91.8% to 82.8%. We introduce PARSE (Provenance‑Aware Retrieval Sanitization), a domain‑aware, fact‑preserving sanitization pipeline that classifies each sentence by injection likelihood, extracts structured facts before rewriting, and verifies fact preservation via a consistency‑checking loop. A directiveness gate routes 59% of real enterprise documents to a lightweight path, concentrating computational cost on high‑risk documents. PARSE achieves 15.6% attack success rate ‑‑ a 38% reduction versus the 25.4% baseline ‑‑ at 86.9% utility, the only condition that is both statistically significant (p=0.014, adequately powered) and maintains near‑baseline utility. Practitioners should evaluate defenses on domain‑matched real documents, not synthetic proxies.

PaperID: 1877, https://arxiv.org/pdf/2606.17461.pdf

Authors: Yiting Wang, Chenhui Deng, Chia-Tung Ho, Yanqing Zhang, Zhuo Feng, Cunxi Yu, Ang Li, Gang Qu, Brucek Khailany

Title: AUTOGATE: Automated Clock Gating via Toggling-Aware LLM-based RTL Rewriting

Abstract:
Fine‑grain clock gating (FGCG) is among the most effective techniques for reducing dynamic power, yet current FGCG optimization flows remain largely manual. Recent LLM‑based RTL optimization approaches remain limited by two key drawbacks: (1) the inability to process long waveform traces spanning millions of cycles, and (2) the difficulty of scaling optimization to large hierarchical codebases while preserving correctness. In this work, we present AUTOGATE, the first agentic framework for industry‑grade RTL power optimization, enabling workload‑aware clock‑gating optimization across large hierarchical codebases. AUTOGATE introduces a Machine Learning (ML)‑LLM co‑design that bridges waveform‑level analysis and RTL rewriting. Specifically, we design an ML‑based clustering algorithm that distills raw toggling traces into compact, structured representations that guide LLM‑based RTL rewriting. This enables accurate identification and application of clock‑gating opportunities without requiring LLMs to directly process raw waveform data. To enhance scalability, AUTOGATE employs a hierarchical multi‑agent architecture that decomposes large designs into independently optimizable modules, enabling coordinated optimization across deep design hierarchies. We evaluate AUTOGATE on a diverse set of designs ranging from small RTL designs to large industrial‑grade codebases. Experimental results show that AUTOGATE consistently reduces dynamic power relative to baselines. Across the small‑design suite, AUTOGATE reduces dynamic power by 49.31% on average. On industry‑scale designs, it achieves 19.34% and 7.96% dynamic power reductions on NVDLA and BlackParrot, respectively, and up to 6.86% on highly optimized proprietary production designs.

PaperID: 1878, https://arxiv.org/pdf/2606.17459.pdf

Authors: Yuyang Dai, Xueqing Peng, Lingfei Qian, Zhuohan Xie

Title: Can LLMs Be CEOs? Benchmarking Strategic Resource Reallocation with Multi-Role Agent Simulation

Abstract:
Evaluating the decision‑making capabilities of large language models (LLMs) is a growing research priority, yet existing benchmarks focus on isolated cognitive tasks such as reasoning, knowledge retrieval, and economic rationality in stylized settings. These evaluations overlook the defining challenge of real executive decision‑making: integrating conflicting recommendations from specialized stakeholders under information asymmetry, organizational constraints, and temporal dependencies. We introduce \textscCEO‑Bench, a multi‑agent benchmark that evaluates LLMs on CEO‑level strategic resource reallocation ‑‑ the process of redirecting capital across business units in a multi‑round, constraint‑rich organizational environment. In \textscCEO‑Bench, LLM agents receive conflicting advice from four role‑conditioned C‑suite advisors (CFO, CTO, COO, CMO), each with private signals and distinct priorities, and must synthesize these into a concrete allocation plan evaluated along four dimensions: role integration, conditional boldness, history‑sensitive judgment, and plan validity. Experiments across five frontier models on 13 scenarios reveal that all models achieve high structural validity but diverge sharply on strategic calibration ‑‑ the hardest capability layer. We identify systematic failure modes including single‑advisor capture, conservative default under ambiguity, and historical amnesia, and uncover a structural integration‑boldness tradeoff: models that engage more deeply with conflicting perspectives tend to produce less decisive action. These findings delineate the current capability boundary of LLMs as organizational decision‑makers and inform the design of future AI‑assisted executive systems.

PaperID: 1879, https://arxiv.org/pdf/2606.17454.pdf

Authors: Gaurav Gupta, Vatshank Chaturvedi, Jun Huan, Anoop Deoras

Title: Dissecting model behavior through agent trajectories

Abstract:
AI agent performance is not just a modeling problem, it is fundamentally a systems problem. The advanced capabilities of models are realized through agent harnesses. Therefore, a gap between model assumptions and harness behavior can easily prevent the model's full capabilities from translating into agent performance. We formalize this as the `intent‑execution' gap: the mismatch between what the model intends and what the harness executes, and vice versa. We argue that minimizing this intent‑execution gap is as important as other aspects of harness design such as tools and execution loops. To illustrate the impact of this harness‑model alignment, we develop a simple and customizable harness called `Simple Strands Agent' (SSA). SSA aims to find the bulk of common patterns which generalize across different model families (such as Claude, Gemini, GPT, Grok, Qwen), as well as a small number of model‑specific preferences. We make two contributions: (i) we reproduce or improve on the pass@1 performance reported by diverse model‑provider families on popular agentic benchmarks (SWE‑Pro, SWE‑Verified and Terminal‑Bench‑2), and (ii) building on an analysis of 138k trajectories generated by SSA, we look beyond the pass@1 numbers which tend to be relatively even across frontier models. By representing agent trajectories in code state‑spaces, we observe model‑level differences in problem‑solving behavior. Finer‑grained metrics such as edit frequency, testing activity, and phase‑transitions reveal how individual models allocate effort across different stages of autonomous problem solving.

PaperID: 1880, https://arxiv.org/pdf/2606.17453.pdf

Authors: Lubin Bai, Mengyu Cao, Sixue Wang, Zhongwei Wan, Yue Pan, Jiale Hou, Xiang Li, Xiuyuan Zhang

Title: MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents through Behavior-Grounded Implicit Decision Factors

Abstract:
Large language model agents are increasingly integrated into map services. Since map services are embedded in everyday‑life scenarios rather than professional task settings, users often express their needs informally, resulting in underspecified queries with many unspoken needs, namely, implicit decision factors that are critical for user satisfaction. Although clarification is an effective way to mitigate this issue, it increases user burden in daily interaction, and a capable agent should first proactively recover such factors from available information sources. However, evaluating this ability is challenging. The first challenge is to determine which implicit decision factors are suitable for evaluation. A factor is evaluable only if it affects user acceptance and can be recovered from information available to the agent before it responds. Second, user satisfaction cannot be reliably represented by a single reference answer, requiring a benchmark that converts satisfaction‑relevant factors into objective and quantifiable evaluation targets. To address these challenges, we propose a restore‑identify‑filter framework that reconstructs complete user needs from behavior‑chain evidence, identifies implicit decision factors, and retains only those supported by pre‑query evidence. Building on this methodology, we construct MapSatisfyBench from large‑scale, real‑world anonymized user data and annotate ground truth from five dimensions and enables full‑chain evaluation of satisfaction‑aware map agents. Experiments show that current agents generally perform well on explicit task completion, but remain limited in satisfying implicit decision factors and proactively acquiring the evidence needed for satisfaction‑aware decisions. These findings establish MapSatisfyBench as a benchmark for shifting map‑agent evaluation from task completion toward satisfaction‑aware spatial decision making.

PaperID: 1881, https://arxiv.org/pdf/2606.17449.pdf

Authors: Zehang Wei, Jiaxin Dai, Jiamin Yan, Xiang Xiang

Title: MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation

Abstract:
While Multimodal Retrieval‑Augmented Generation (M‑RAG) enhances Large Vision‑Language Models, it remains highly susceptible to cross‑modal hallucinations, causal fabrications, and sycophancy. Furthermore, existing mitigation pipelines often face an intervention paradox: static rules tend to unnecessarily disrupt accurate generations, whereas leaving the multi‑modal reasoning completely unguided allows existing mismatches to cascade into severe logical fabrications. To quantify and mitigate these hallucinations, we propose a Multi‑Agent system, MODE‑RAG, driven by Variational Free Energy (VFE) and internal attention states to dynamically gate interventions. High‑risk queries are routed to five stage‑specific agents, integrating Monte Carlo Tree Search (MCTS) for rigorous causal derivation and logit perturbations to penalize sycophancy. Dedicated Correction and Overseer agents ensure formatting stability and perform post‑hoc factual verification. To objectively evaluate our approach, we introduce ModeVent, a challenging subset derived from the MultiVent dataset. Extensive experiments indicate that our system effectively reduces hallucination rates and logical fabrication, significantly improving the robustness of M‑RAG systems.

PaperID: 1882, https://arxiv.org/pdf/2606.17368.pdf

Authors: Shengli Zhang, Deen Ma, Zibin Lin, Taotao Wang

Title: Distributed General-Purpose Agent Networks: Architecture, Key Mechanisms, and Prototypes

Abstract:
Large language models have accelerated the transition from passive conversational assistants to autonomous agents that can understand goals, plan actions, invoke tools, and execute multi‑step tasks. Yet the capability of a single agent remains constrained by its local data, tool permissions, runtime environment, and governance boundary. This paper studies distributed general‑purpose agent networks: open peer‑to‑peer networks in which heterogeneous agents deployed on personal devices, edge nodes, or autonomous computing environments can discover one another, establish trust, negotiate cooperation rules, and execute open‑ended tasks. We argue that such networks cannot be obtained by simply combining existing peer‑to‑peer overlays with conventional multi‑agent systems. Unlike traditional P2P networks, agent networks must propagate semantic declarations about intentions, capabilities, states, and cooperation constraints. We therefore propose a layered architecture centered on a protocol adaptation layer that connects upper‑level task semantics with lower‑level network operations. Based on this architecture, the paper identifies three core mechanism problems: semantic announcement propagation for collaborator discovery, verifiable identity and multi‑topic reputation for cooperation governance, and semantic‑gradient mechanism design for open task execution. For each problem, we present a technical route, including bodyless gossip with sequential logs, BAID‑based identity binding with MG‑EigenTrust reputation, and a Stackelberg‑style mechanism‑generation loop driven by semantic attribution feedback. We further report prototype overhead results for BAID‑style tiered verification and mechanism‑level simulations of MG‑EigenTrust under cross‑topic disguise‑collusion attacks. The resulting framework provides a system‑level foundation for open, trustworthy, and scalable agent collaboration.

PaperID: 1883, https://arxiv.org/pdf/2606.17328.pdf

Authors: Xianxuan Long, Zhikai Chen, Shenglai Zeng, Shouren Wang, Kai Guo, Jiliang Tang

Title: MemTrace: Probing What Final Accuracy Misses in Long-Term Memory

Abstract:
LLM agents increasingly maintain long‑term memory of user facts across sessions. Yet such memory is usually evaluated by aggregating accuracy over question rows or episodes. Because this approach scores question rows independently, even when several questions probe the same fact, it cannot show how that fact behaves as conditions change. We introduce MemTrace, a benchmark whose unit of measurement is the knowledge point: a single typed fact about the user, rather than an individual question. MemTrace probes each fact along three controlled dimensions: memory age, defined by how many sessions ago the fact appeared in the history; question type, covering current state, earlier state, and trajectory of change; and evidence condition, covering present, missing, and contradicted‑by‑false‑premise settings. Evaluating 13 memory‑system configurations across four paradigms, we find that similar pooled accuracy hides different failures: recovering a fact's current and earlier states does not imply tracking how it changed, and safe abstention does not imply correcting a false premise. The dominant bottleneck is evidence use, not retrieval: when systems fail, the evidence was retrievable 10 times more often than it was missing. These results suggest that improving long‑term memory requires better use of reachable evidence, not simply more storage or retrieval.

PaperID: 1884, https://arxiv.org/pdf/2606.17279.pdf

Authors: Yiqing Shen, Han Zhang, Mathias Unberath

Title: Training LLMs with Reinforcement Learning over Digital Twin Representations for Reasoning-Intensive Surgical VideoQA

Abstract:
Surgical video question answering requires multi‑step reasoning across semantic, spatial, and temporal dimensions. Existing methods architecturally compress videos into discrete token representations and couple visual perception with reasoning. This approach fragments continuous spatial‑temporal relationships and has been shown to restrict multi‑step reasoning capabilities. We introduce a reinforcement learning (RL) framework that trains large language models (LLMs) to decouple perception from reasoning by operating over digital twin representations constructed from surgical foundation models. Additionally, we introduce hierarchical representations across frame, temporal window, and procedure levels with probabilistic uncertainty estimates. Finally, we propose a novel reward that combines format validation with accuracy assessment through clinical plausibility evaluation and uncertainty‑aware calibration for training. To demonstrate the capabilities of this approach, we introduce REAL‑Colon‑Reason, a colonoscopic benchmark with 2000 question‑answer pairs across three complexity levels. We achieve state‑of‑the‑art performance on REAL‑Colon‑Reason and two existing surgical VideoQA benchmarks REAL‑Colon‑VQA and EndoVis18‑VQA.

PaperID: 1885, https://arxiv.org/pdf/2606.17253.pdf

Authors: Qiufeng Li, Rongqian Chen, Quan Cheng, Chengxuan Wang, Sizhe Tang, Wuxi Li, Duo Ding, Chia-Tung Ho, Haoxing Ren, David Z. Pan, Tian Lan, Weidong Cao

Title: PDAGENT-BENCH: Characterizing, Grounding, and Architecting LLM Agents for VLSI Physical Design

Abstract:
Large Language Models and vision‑language models have shown remarkable success in the front‑end design of Very Large‑Scale Integrated Circuits, yet their capabilities for VLSI physical design remain significantly underexplored. The primary cause is the lack of standardized benchmarks for evaluating agentic physical design workflows that require high‑dimensional, multi‑stage optimization under strict design constraints, coordinated interaction with diverse Electronic Design Automation tools, and iterative refinement. This work introduces PDAGENT‑BENCH, a comprehensive and multi‑dimensional benchmark for evaluating LLM/VLM‑based agents across the physical design stack. PDAGENT‑BENCH integrates both task‑level assessment and workflow‑level execution. The benchmark suite contains 353 curated problems that combine conceptual questions with real‑world industrial artifacts, with expert‑validated references and executable solutions. These tasks cover five key capability dimensions: foundational knowledge, report comprehension, root‑cause analysis, script generation, and full‑flow implementation. In addition, the benchmark provides a unified, human‑aligned agentic physical design workflow framework that enables closed‑loop evaluation of holistic physical design in realistic EDA environments. Experiments on 11 state‑of‑the‑art models reveal that while modern LLMs/VLMs perform competitively on conceptual tasks, they remain substantially limited in tool‑centric execution (e.g., 42.2% on Innovus script generation) and long‑horizon, multi‑stage reasoning. Our studies further show that human‑skill‑enhanced agentic workflows significantly improve end‑to‑end physical design performance. PDAGENT‑BENCH establishes a standardized, reproducible, and realistic evaluation framework for advancing LLM/VLM‑driven holistic physical design automation. We will open source the benchmark and framework soon.

PaperID: 1886, https://arxiv.org/pdf/2606.17246.pdf

Authors: Maram Hasan, Aman Verma, Savitra Roy, Hariseetharam Gunduboina, Daksh Jain, Muhammad Haris Khan, Subhasis Chaudhuri, Biplab Banerjee

Title: GeoDisaster: Benchmarking Orchestrated Agents for Operational Disaster Geo-Intelligence

Abstract:
Remote‑sensing vision‑language models (RS‑VLMs) have advanced Earth‑observation analysis toward visual interpretation and instruction‑following, yet fall short of operational geo‑intelligence, which demands tool‑grounded spatial reasoning and structured, evidence‑backed decisions. We introduce GeoDisaster, an operational geospatial disaster reasoning benchmark with 2,921 verified instances across 43 question types and five task families: deforestation monitoring, multi‑hazard analysis, building‑damage assessment, flood‑safe routing, and Sentinel‑1 SAR flood monitoring. Instances integrate heterogeneous EO/GIS evidence‑optical and SAR imagery, raster masks, vector geometries, road networks, and exposure layers‑spanning hazard detection, damage assessment, exposure estimation, and diagnostic report generation. Ground‑truth answers are grounded in executable geospatial workflows and deterministic consistency checks, removing the need for language‑model annotation. We further propose an orchestrated multi‑agent framework with 18 disaster‑oriented tools, where role‑specialized agents coordinate through explicit execution contracts, aligned via Role‑Contract Expectation Alignment (RCEA): failure‑aware supervised fine‑tuning combined with contract‑grounded reinforcement learning over dense step‑level signals. Experiments show that GeoDisaster challenges existing RS‑VLMs and agentic systems, while RCEA improves tool use, evidence grounding, state consistency, and decision generation.

PaperID: 1887, https://arxiv.org/pdf/2606.17203.pdf

Authors: Mohamed Essam, Kareem Wael, Azza Hassan, Ahmed Haitham, Mahmoud Soliman, Samer Saber, Ibrahim Habib

Title: Trust-Aware Multi-Agent Traceability: Confidence-Calibrated Knowledge Graphs for Consistent Software Artifact Management

Abstract:
Multi‑agent AI systems are increasingly used to automate software engineering tasks including requirements analysis, architecture design, test generation, and traceability linking. When these agents operate as a sequential pipeline over shared software artifacts, errors and low‑confidence decisions made by upstream agents propagate to downstream stages, producing orphaned requirements, contradictory links, and compliance gaps that pose significant risks in safety‑critical domains. We propose a trust‑aware coordination framework where a shared knowledge graph serves as both centralized semantic memory and a coordination surface through which agents assess and build upon each other's contributions using calibrated confidence scores. Our approach introduces a two‑stage traceability link prediction pipeline combining embedding‑based retrieval with LLM‑based multi‑criteria analysis, a traceability seeding mechanism that enables comparison between derivation‑time and validation‑time confidence, and a consistency protocol governing pipeline interactions through confidence threshold gating, confidence divergence detection, and conflict resolution. We evaluate on an automotive software engineering case study measuring link prediction calibration, protocol effectiveness, threshold sensitivity, and the impact of traceability seeding. Ablation studies confirm that confidence calibration is essential for effective pipeline coordination.

PaperID: 1888, https://arxiv.org/pdf/2606.17182.pdf

Authors: Sajjad Khan

Title: Verified Detection and Prevention of Concurrency Anomalies in Multi-Agent Large Language Model Systems

Abstract:
Multi‑agent LLM systems share state through memory stores, vector indices, and tool registries. We model such sharing as long‑running read‑generate‑write operations under deterministic‑generation semantics ‑‑ the regime durable‑execution engines enforce by deterministic replay ‑‑ and formalize four concurrency anomalies in TLA+: stale‑generation, phantom‑tool, causal‑cascade, and tool‑effect reordering, structural analogues of classical isolation anomalies, each with a TLC counter‑example. The exclusion lattice over these anomalies is trivial; the contribution is the mechanically verified realizability and strict separation of one maximal chain within it, L_0 \subsetneq \cdots \subsetneq L_4, to our knowledge the first machine‑checked consistency hierarchy for such runtimes. A development of 274 Verus obligations (zero assume, zero admit; trust base: two structural axioms and a mutex correspondence) proves the detectors sound and complete against the specifications and each runtime its avoidance set. Three deployed Rust runtimes realize L0‑L1 (pessimistic locking, serializable snapshot isolation, default‑SI), each verified against stale‑generation and refined to its state machine; L2‑L4 are exec‑mode‑verified with dependency‑free prevention twins (A3, A6, A2: 0/1000 versus 1000/1000), and L2 is run live across three model families (A3 prevented in all 120 retracted sessions). We reproduce a silent lost update in ByteDance's deer‑flow, formalizing its fix as a verified L_0 \to L_1 refinement, and exhibit tool‑effect reordering in LangGraph's ToolNode on unmodified output, removed by an L3 commit‑order sequencer. The verified detector, refinements, and realizability artifacts are the contribution; the phenomena and lattice are classical.

PaperID: 1889, https://arxiv.org/pdf/2606.17174.pdf

Authors: Mohammadsadegh Abolhasani, Hamid Reza Firoozfar, Reza Mousavi, Paul Jen-Hwa Hu

Title: From Parasocial Scripts to Dyadic Persistence in Autonomous AI-Agent Communities

Abstract:
While parasocial interactions (PSIs) and parasocial relationships (PSRs) have been studied in conventional media settings, we investigate whether PSI‑ (colloquial) relational cues also exist in online communities where both sides are autonomous AI agents. We analyze 4,434 posts and 50,338 comments from Moltbook through three theory‑based textual indicators: attachment/intimacy language, reciprocity bids, and self‑identification to original poster (OP). The combined results across methods based on keyword matching, few‑shot large language model (LLM) annotation, and grouped‑context LLM annotation reveal that PSI colloquial cues prevail and are strongly associated with OP re‑engagement and a reciprocal reply structure. These results are robust across negative controls, nullification, clustered‑standard‑error re‑estimation, and multiple‑testing correction. A dyadic persistence test further affirms reciprocity bids aligned with sustained OP‑involving mutual recurrence, providing empirical evidence for bridging interaction‑level PSI scripts with PSR‑consistent repeated dyadic patterns. We interpret the evidence as a behavioral structure in discourse by LLM‑enabled agents.

PaperID: 1890, https://arxiv.org/pdf/2606.17114.pdf

Authors: Hankyul Baek, Jaewon Noh, Sang Seo, Yongsu Kim, Gabriel Waikin Loh Matienzo, Young Il Kim, Ee Wei Seah, Akriti Vij

Title: An Evaluation of Data Leakage Risks in Tool-Using LLM Agents in Realistic Scenarios

Abstract:
AI agents are increasingly being adopted in enterprise and personal settings with access to emails, databases, documents, and other tools where they can read, update, and disseminate sensitive information. Much of prior research on data leakage risks in agents has focused on adversarial data exfiltration through prompt injections and jailbreaks. However, sensitive information may also be exposed during non‑adversarial use, creating leakage risks even when users issue benign requests. We report a joint evaluation by the Singapore AI Safety Institute and the Korea AI Safety Institute examining agent data leakage in 12 realistic, non‑adversarial tasks spanning customer support, DevOps, web automation, and enterprise and personal productivity. The evaluation covers five risk types: lack of data awareness, audience awareness, policy compliance, data minimization, and access‑boundary awareness. Both institutes tested a common set of scenarios mirroring real‑world deployments using independent testing environments and task‑specific LLM‑judge rubrics. Across the three tested agents, none achieved fully correct and fully safe execution across all scenarios. Successful task completion often coincided with data‑handling failures such as accessing unnecessary information or disclosing information to inappropriate recipients, indicating that capability and data‑handling safety should be evaluated separately. Qualitative review also revealed claim‑action mismatches, simulation‑aware behavior, user‑simulator role reversal, and interpretation gaps in automated judging. Overall, the results indicate that operational data leakage is a first‑order agent‑safety concern distinct from adversarial exfiltration and provide a methodology for future evaluations of agent data‑handling safety.

PaperID: 1891, https://arxiv.org/pdf/2606.17092.pdf

Authors: Kyle Gao, Pranavi Kotta, Linlin Xu, Jonathan Li, David A. Clausi

Title: Securing Multi-Agent GIS Systems: Risk Evaluation and Prompt Hardening Optimization

Abstract:
Agentic systems are increasingly integrated with geographic information systems (GIS), where multi‑agent coordination enables complex conversational and spatial analysis but introduces security risks. This work presents a security‑oriented framework for risk identification, evaluation, and mitigation in a multi‑agent GIS system while maintaining adaptability to broader agentic architectures. We test the agentic system of a commercial geospatial partner while developing a modular state‑machine‑based orchestration framework that abstracts agent behavior into reusable components. We evaluate robustness using a red‑teaming framework with an adaptive attacker LLM and a deterministic judge that produces binary outcomes with supporting rationales across multi‑turn attacks. We further improve resilience with a prompt optimization framework that treats prompts as structured signatures and injects adversarial demonstrations, enabling systematic security improvements without degrading task performance.

PaperID: 1892, https://arxiv.org/pdf/2606.17058.pdf

Authors: Changqing Li, Shouwei Gao, Kai Zhao, Sheng Di, Wenqian Dong

Title: Evaluating LLM Coding Agents on SZ-Family Lossy Compression Across Architectures

Abstract:
Large language model (LLM) coding agents are increasingly applied to code translation and optimization, yet their effectiveness in performance‑critical high‑performance computing (HPC) settings remains poorly characterized. This paper evaluates LLM‑based coding workflows on SZ‑family error‑bounded lossy compression kernels, which combine numerical constraints with memory‑intensive and control‑flow‑heavy implementations. We study two representative CUDA workloads (SZp and SZx) and target two heterogeneous execution platforms: NVIDIA GPUs and Cerebras wafer‑scale accelerators. Focusing on single‑agent iterative generation, we analyze not only final throughput but also agent runtime behavior, including iteration patterns, sensitivity to prompt specification, and characteristic failure modes. Our results reveal a pronounced cross‑architecture divergence. On GPUs, stronger models can achieve substantially higher throughput but exhibit increased sensitivity to prompt precision and optimization guidance, whereas on Cerebras the dominant challenge lies in producing runnable programs under a PE‑centric spatial execution model. We further observe that LLM agents are more effective on modular kernels (SZx) than on tightly coupled bit‑level pipelines (SZp), where structural dependencies hinder optimization progress. These findings suggest that evaluating LLM coding agents for HPC requires accounting for both performance outcomes and architecture‑specific robustness, and that success on thread‑based platforms does not directly transfer to spatial accelerators.

PaperID: 1893, https://arxiv.org/pdf/2606.16849.pdf

Authors: Dylan Banarse, Stephen Todd, William Latham, Frederic Fol Leymarie

Title: Evolution & Foundation: AI Shares Creative Control

Abstract:
This paper investigates the creative process of automated design and artistic evaluation using an evolutionary system. We consider how a multimodal artificial intelligence (AI) model can communicate and guide a combined generative and evolutionary computational system. This creates a framework for the evolution of aesthetically pleasing complex 3D organic forms by integrating genetic algorithms with the visual reasoning capabilities of large‑scale AI foundation models. The framework shifts the artist role from that of intensive direct selection to one of system design; transferring detailed step‑by‑step curation to an AI agent capable of multimodal aesthetic judgement. This framework enables the human artist/designer to rapidly traverse large areas of multi‑dimensional evolutionary parameter space to find creative outcomes based on their semantic targets. Detailed audit trails of the AI's aesthetic reasoning are generated for each experiment. Interactive visualisation tools, together with AI‑generated summaries and evolutionary narratives, enable deep exploration into each evolutionary experiment and providing a transparent insight into the AI‑guided process.

PaperID: 1894, https://arxiv.org/pdf/2606.16591.pdf

Authors: Qiao Xiao, Haochen Shi, Yisen Gao, Wenbin Hu, Huihao Jing, Tianshi Zheng, Baixuan Xu, Ziheng Zhang, Weiqi Wang, Haoran Li, Jiaxin Bai, Yangqiu Song

Title: SING: Synthetic Intention Graph for Scalable Active Tool Discovery in LLM Agents

Abstract:
Large language model (LLM) agents increasingly rely on agent harnesses that manage context, tools, and multi‑turn execution, making tools a central interface for acting in realistic digital environments. As harness‑connected tool ecosystems expand to hundreds or thousands of APIs, services, and task‑specific skills, exhaustive tool schema injection becomes costly and imposes a closed‑world assumption that limits agents to a predefined static inventory. Retrieval‑augmented tool selection offers a natural alternative, but existing one‑shot retrieval methods often fail to align isolated tool descriptions with the agent's true task intention, especially in long‑horizon tasks where required capabilities emerge through decomposition, observations, and newly induced subgoals. We propose SING, an intention‑aware active tool discovery framework that builds an intention‑tool graph linking user intentions, tool capabilities, and tool collaboration patterns, and dynamically retrieves tools according to evolving task states. Using a unified corpus of 7,471 tools, we evaluate SING on three real‑world tool‑use benchmarks. SING improves Global Recall@5 by up to 59.8% and downstream success rate by up to 28.9% over baselines, while reducing full‑corpus tool‑schema exposure by 99.8%, demonstrating that intention‑aware graph structure enables more accurate and context‑efficient tool discovery in large‑scale agentic ecosystems.

PaperID: 1895, https://arxiv.org/pdf/2606.16326.pdf

Authors: Hao-Hsuan Chen

Title: Gaming-Resistant Insurance Contracts for Autonomous AI Agents: Strategy-Proof Toll Mechanism Design

Abstract:
Paper A defines a time‑consistent actuarial runtime that prices each side‑effect‑bearing action against a contractually fixed safe default and gates execution against a reserve budget. It treats the operator as passive. This paper makes the operator strategic. We characterise a five‑attack space for autonomous AI‑agent insurance contracts and prove when the actuarial runtime is gaming‑resistant. Two attack surfaces ‑‑ post‑toll safe‑default selection and within‑boundary action splitting ‑‑ are closed by Paper A's minimal‑authority and no‑splitting clauses. The remaining three require new contract clauses. First, common‑control aggregation prevents cross‑boundary re‑routing from reducing toll below the boundary potential applied to total exposure. Second, interface failures such as invalid JSON are contract‑relevant events, not safety wins: treating them as zero‑toll safe defaults can reward unreliable models, while escalation fees reverse the incentive. We validate this interface‑compliance theorem on committed cross‑model traces from the companion empirical paper. Third, a model‑identity menu with a componentwise‑minimum penalty schedule makes truthful reporting of the deployed model weakly dominant. We then compose these clauses with Paper A's runtime guarantees to obtain joint incentive compatibility over the five‑attack space. Finally, a two‑parameter premium family discharges operator individual rationality and weak budget balance at the truthful equilibrium. The result is an incentive‑compatibility layer for actuarial control of autonomous‑agent side effects.

PaperID: 1896, https://arxiv.org/pdf/2606.16276.pdf

Authors: Wenjie Wang, Yue Huang, Zhengqing Yuan, Han Bao, Shiyi Du, Yuchen Ma, Yue Zhao, Yanfang Ye, Xiangliang Zhang

Title: SpecAlign: Efficient Specification-Grounded Alignment of Large Language Models via Synthetic Data

Abstract:
As large language models (LLMs) are increasingly deployed in real‑world applications, alignment is no longer governed by a single universal notion of safety or helpfulness, but instead by provider‑ or application‑specific model specifications. These specifications are typically long, structured, and frequently updated, yet existing alignment pipelines lack a systematic mechanism to operationalize them as training signals. In this paper, we propose specification‑grounded alignment, a new alignment paradigm that treats provider‑authored model specifications as the primary alignment target rather than abstract principles or static benchmarks. To instantiate this paradigm, we introduce SpecAlign, a framework that synthesizes alignment data directly from specification documents. SpecAlign combines structured rule annotation, controllable specification instantiation, and multi‑agent adversarial data synthesis to generate fine‑grained, boundary‑aware preference pairs that capture both compliant behaviors and meaningful specification violations. Experiments across multiple model specifications and backbone models demonstrate that training with SpecAlign consistently improves rule compliance while preserving general capabilities and avoiding over‑conservative behavior. These results suggest that grounding alignment in explicit model specifications enables rapid, precise, and scalable adaptation of LLM behavior to evolving policy requirements.

PaperID: 1897, https://arxiv.org/pdf/2606.16149.pdf

Authors: Minh-Ha Nguyen, Erica Gray, Chih-Ting Yang, Rizwan Hamid, Lingyao Li, Siyuan Ma, Thomas A. Cassini, Cathy Shyr

Title: LiteOdyssey: A Lightweight Reasoning AI Agent for Interpretable Rare-Disease Diagnosis

Abstract:
Most medical AI systems improve by scaling additional machinery: more fine‑tuning data, more agents, and/or larger retrieval databases. In rare‑disease diagnosis, however, such scaling can produce systems that are difficult to deploy, audit, and maintain. We asked whether state‑of‑the‑art diagnostic performance could instead be achieved by extending the reasoning chain of a single AI agent: guiding it with a diagnostic policy, developed through human‑AI collaboration and augmenting with freely available biomedical tools. We introduce LiteOdyssey, a lightweight rare‑disease diagnostic framework that guides reasoning language model through a clinical genetics workflow. This framework was developed through Policy Iteration with Human Feedback (PIHF) and uses dynamic access to public biomedical tools. On two challenging benchmarks that provide only patient clinical features, LiteOdyssey achieved state‑of‑the‑art performance, with an overall disease Recall@1 of 59.3% over the combined 1,243 cases of LIRICAL (n = 370) and the PhenoPacket Store (n = 873). Both benchmarks have a high proportion of ultra‑rare disease (a prevalence below 1 in 1,000,000, with ultra‑rare shares of approximately 45% and 52.8%, respectively). On the more difficult PhenoPacket subset, where causal diseases were not mapped to Orphanet in our rarity‑mapping pipeline, LiteOdyssey achieved 60.7% Recall@1, compared with 10.7% for the same baseline model (GPT‑5.4) without tools. This performance was achieved without fine‑tuning, multi‑agent ensembles, or a large case‑retrieval database. Gains were also observed in the following: on cases never seen during development, on a private cohort of real‑world rare disease patients, and on a smaller open‑weights model. LiteOdyssey suggests a path toward rare‑disease AI systems that are accurate, easier to deploy, and more transparent for physician review.

PaperID: 1898, https://arxiv.org/pdf/2606.16116.pdf

Authors: Abhinav Sinha, Shashi Ranjan Kumar

Title: Distributed Safe Consensus Under Asymmetric Input and Time-Varying Output Constraints

Abstract:
This paper studies safe distributed consensus for single‑integrator multi‑agent systems over connected undirected graphs under simultaneous asymmetric actuator constraints and output safety constraints. Each agent is equipped with a continuously differentiable asymmetric actuator dynamics that maps a commanded control signal to the realized plant input while keeping the latter strictly inside a prescribed admissible interval. To address output safety, a barrier‑coordinate transformation is introduced over a common time‑varying safe interval, and a distributed synchronization law is designed in the transformed coordinates. The resulting controller integrates a graph‑based coordination layer with an actuator‑side tracking layer, thereby enabling simultaneous enforcement of input admissibility, forward invariance of the safe output set, and asymptotic synchronization. For compact admissible sets of initial conditions, it is shown that the closed‑loop solution is complete, all signals remain bounded, the actuator inputs remain strictly within their asymmetric bounds, and the agent outputs remain inside the prescribed safe interval for all time. Moreover, the transformed synchronization errors converge exponentially to zero, and the original agent outputs asymptotically synchronize to a designer‑selected admissible trajectory embedded in the common safe interval. Numerical simulations validate the proposed framework and demonstrate safe consensus under both asymmetric actuation bounds and time‑varying output constraints.

PaperID: 1899, https://arxiv.org/pdf/2606.16047.pdf

Authors: Jakub Bąba, Jarosław A. Chudziak

Title: From Argument Components to Graphs: A Multi-Agent Debate with Confidence Gating for Argument Relations

Abstract:
Large Language Models (LLMs) are increasingly assessed and utilized in the field of Argument Mining (AM), thanks to their strong general reasoning capabilities. However, standard training‑free models often miss sophisticated details, specifically in contexts where two parts of the text have to be analyzed together. Furthermore, self‑correction mechanisms tend to reinforce initial hallucinations in reasoning. Overcoming these limitations typically requires expensive, domain‑specific supervised fine‑tuning. Recent work has shown that a multi‑agent paradigm can address such weaknesses for the component classification task through dialectical refinement with a Proponent‑Opponent‑Judge architecture, setting a promising direction for training‑free approaches in the field. In this paper, we extend and evaluate this framework on the Argument Relation Identification and Classification (ARIC) task, reformulating it as a debate over component pairs. Besides that, we introduce a confidence gating mechanism that enables debating only on the uncertain cases and accepting the initial prediction when confidence is high. On the UKP Argument Annotated Essays v2 corpus, we demonstrate that the selective debate achieves the highest Macro F1 among all training‑free methods, while debate over all samples degrades performance below that of one of the baselines. All generative approaches also outperform fine‑tuned RoBERTa models on Macro F1, suggesting that the under‑representation of the Attack class was more damaging to supervised fine‑tuning than to inference‑only models. Additionally, our framework produces human‑readable debate transcripts, offering interpretability absent from both single‑agent and supervised classifiers.

PaperID: 1900, https://arxiv.org/pdf/2606.15931.pdf

Authors: Zijian Carl Ma, Sean J. Wang, Sijbren Kramer, Li Erran Li

Title: DeepRoot: A KG-Coordinated Multi-Agent System for Therapeutic Reasoning over Historical Medical Texts

Abstract:
Historical medical archives and traditional medicines hold immense potential for drug discovery and remain a primary source for current drug development. However, pre‑ontological prose and idiosyncratic taxonomies prevent the standardization and medical modernization of the data for use in current biomedical pipelines. Furthermore, no existing LLM agent system, whether tool‑calling, retrieval‑augmented, or agentic deep‑research, can convert such text into verifiable drug‑discovery leads at scale. We close this gap with DeepRoot, a multi‑agent LLM system that jointly builds and utilizes a verified knowledge graph, showing that grounding and reasoning ‑‑ often conflated ‑‑ are separable axes the system can compose for therapeutic reasoning. Applied to the Shen Nong Ben Cao Jing, DeepRoot recovers 10 of 21 held‑out compound‑disease treatment pairs at R@20 (47.6% vs 4.8% for a raw corpus LLM and ～\!2.4% random) and dominates an LLM‑as‑judge audit for reasoning quality over baseline LLMs and LLMs with direct tool‑call access to the same APIs DeepRoot itself queries. Tool‑using LLMs hallucinate evidence on 87% of claims, versus 7‑10% for DeepRoot. Graph‑only inference hallucinates 0% but ranks lowest on reasoning coherence; DeepRoot KG+LLM is the only condition to win on both axes, pointing toward a route for systematic mining and repurposing of historical medical knowledge.

PaperID: 1901, https://arxiv.org/pdf/2606.15899.pdf

Authors: Ismail Hossain, Sai Puppala, Md Jahangir Alam, Tanzim Ahad, Sajedul Talukder

Title: SkillVetBench: LLM-as-Judge for Multi-Dimensional Security Risk Evaluation in Open-Source LLM Agent Skills

Abstract:
Open‑source LLM agent ecosystems are growing rapidly, yet the security of community‑contributed skills ‑ modular tool definitions that extend agent capabilities ‑ remains largely unvetted. The gap we fill: existing scanners operate at the code layer and are structurally blind to instruction‑layer and multi‑agent risk ‑ natural‑language directives that hijack an agent, exfiltrate data through encoded side channels, or chain harm across pipelines ‑ so what is needed is a semantic, multi‑dimensional vetting system rather than another signature matcher. We present SKILLVETBENCH, a live public leaderboard on Hugging Face that uses an LLM‑as‑Judge to vet agent skills. What is new: SARS (Skill Agentic Risk Score), a five‑dimensional agentic‑risk metric with a principled weighted formula for instruction‑following systems. What is integrated: full CVSS v4.0 vector decomposition and a ClawHub dual‑view that places our LLM‑generated review beside the official marketplace verdict. What is demonstrated: drawing on our companion benchmark paper [ 1], the LLM‑as‑Judge stage achieves zero false negatives across 78 confirmed‑malicious skills and zero false positives across 22 benign controls, while the best static baseline (SKILLSIEVE) still misses 15%; for instruction‑layer categories such as Prompt Injection and Memory Poisoning, conventional tools miss between 89% and 100% of threats (e.g., CODEBERT detects none of nine memory‑poisoning skills). Detection rates vary from 35% to 95% across four LLM evaluators, motivating ensemble scoring in production deployments.

PaperID: 1902, https://arxiv.org/pdf/2606.15874.pdf

Authors: Junjia Qi, Zichuan Fu, Jingtong Gao, Wenlin Zhang, Hanyu Yan, Xian Wu, Xiangyu Zhao

Title: LLM-as-Code Agentic Programming for Agent Harness

Abstract:
Every major LLM agent framework gives the LLM the role of orchestrator; the model decides what to do next, when to call tools, and when to stop. We argue that token explosion, control‑flow hallucination, and unreliable completion are not implementation bugs but architectural consequences of assigning the deterministic work of looping, branching, and sequencing to a probabilistic system. A better prompt or a stronger model cannot guarantee the reliability of the LLM agent. We therefore propose Agentic Programming, in which the program governs all control flow, and the LLM is itself part of it, an adaptive component we call LLM‑as‑Code and invoke only where a task calls for reasoning or generation. Within each call the model keeps full flexibility, but it cannot alter the program's execution path. With control in the program, the LLM's context is built from the execution history's call tree and forms a directed acyclic graph (DAG). Each call's context length is then determined by its call depth rather than by accumulation over steps. A case study of computer‑use agents shows that the design is practical, not just a theoretical stance, substantially improving the stability of long visual operation sequences.

PaperID: 1903, https://arxiv.org/pdf/2606.15872.pdf

Authors: Jingru Guo, Xiangyuan Xue, Lian Zhang, Wanghan Xu, Siki Chen, Philip Torr, Wanli Ouyang, Lei Bai, Zhenfei Yin

Title: SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

Abstract:
Frontier scientific reasoning remains a major challenge for large language models (LLMs), where even the strongest commercial systems fall short of expert‑level performance. A closer look at model behavior reveals substantial complementarity that single‑model evaluation hides: different frontier models excel on different question types, and no single model captures the full picture. We present SciOrch, a framework that trains a lightweight 8B model to orchestrate frontier LLMs for scientific reasoning. The orchestrator decomposes each question, delegates sub‑problems to selected commercial models through API calls, and synthesizes a final answer. Training such an orchestrator is fundamentally harder than conventional agentic RL: each action triggers an API call that is expensive in both dollar cost and latency, making standard online rollouts infeasible. We address this with MCTS‑based approach, producing diverse orchestration trajectories, extracting per‑node single‑turn samples, and optimizing the orchestrator with GRPO‑style training. On a 240‑question test set spanning SGI‑Reasoning and Scientists' First Exam, SciOrch reaches 56.66% average accuracy, outperforming the strongest single commercial model by 3.74% and the strongest multi‑agent baseline by 3.33%. It also attains the best accuracy on both SGI and SFE with less than half the API cost of typical multi‑agent methods.

PaperID: 1904, https://arxiv.org/pdf/2606.15862.pdf

Authors: Linghua Zhang, Jun Wang, Jingtong Wu, Zhisong Zhang

Title: RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

Abstract:
Large language model (LLM) agents have made rapid progress on short‑horizon, well‑scoped tasks, yet their ability to sustain coherent decisions in dynamic long‑horizon environments remains uncertain. We introduce RetailBench, a data‑grounded simulation benchmark for evaluating tool‑using LLM agents in single‑store supermarket operation. RetailBench models retail management as a partially observable decision process and is designed to support thousand‑day‑scale simulations. In this environment, agents must manage pricing, replenishment, supplier selection, shelf assortment, inventory aging, customer feedback, external events, and cash‑flow constraints. We evaluate seven contemporary LLMs under representative agent frameworks over a 180‑day evaluation horizon and compare them with a privileged oracle policy. Results show substantial variation across models: only a small subset survives the full evaluation horizon, and even the strongest LLM runs remain substantially behind the oracle policy in final net worth and sales outcomes. Behavioral analysis attributes these gaps to incomplete evidence acquisition, surface‑level decision making, and the lack of a consistent long‑horizon policy. RetailBench provides a controlled testbed for studying reliable autonomy in economically grounded long‑horizon decision‑making.

PaperID: 1905, https://arxiv.org/pdf/2606.15797.pdf

Authors: Pavel Surynek

Title: Unassigned Agents in Compilation-based Multi-agent Path Finding

Abstract:
Compilation‑based techniques represent an important stream of solvers for multi‑agent path finding (MAPF) due to their modularity and adaptability for non‑standard variants of the problem. While in the standard MAPF the task is to navigate all agents from their initial positions to given individual goal positions without any collision, variants where a different requirement for agents is used are also relevant. Such a variant is MAPF with unassigned agents (UA‑MAPF) where some agents have the same setting as in the standard MAPF with initial positions and goals while the remaining agents have the initial position but have no goal ‑ unassigned agents. Despite unassigned agent do not need to reach any goal position they have to be moved out of the way of the standard agents if needed which represent a specific challenge. We show in this paper that UA‑MAPF can be expressed in recent compilation‑based techniques for MAPF based on formulating the problem as Boolean satisfiability, namely we adapt SMT‑CBS and NRF‑SAT, the recent solvers based on counterexample guided abstraction refinement and non‑refined abstractions.

PaperID: 1906, https://arxiv.org/pdf/2606.15753.pdf

Authors: Yaoting Huang, Yifu Yuan, Linqi Han, Chengwen Li, Shuoheng Zhang, Xianze Yao, Hongyao Tang, Yan Zheng, Jianye Hao

Title: RoboPIN: Grounded Embodied Reasoning via Pinned Chain-of-Thought

Abstract:
Embodied reasoning requires models to perceive task‑relevant objects and spaces in physical environments and maintain consistent visual grounding throughout multi‑step reasoning. However, current vision‑language models rely on text‑only or coordinate‑augmented chain‑of‑thought, where entity references remain implicit and ambiguous. This may cause the reasoning process to decouple from visual evidence, entity references to drift across steps, and a causal disconnection between the reasoning trajectory and the final answer, with these problems further amplified in multi‑view scenarios due to cross‑view appearance changes. To address these issues, we propose Pinned Chain‑of‑Thought (\pincot), a structured reasoning paradigm that pins every reasoning step to visual evidence. \pincot introduces the concept of \reasoninganchor, which binds each task‑relevant entity to a structured visual anchor with entity name, unique identity, view index, and spatial grounding, enabling consistent entity tracking across reasoning steps and views. We build a fully automated data generation pipeline to construct \dataset, a high‑quality \pincot‑formatted reasoning dataset. We then train \method through three‑stage post‑training that progressively injects embodied knowledge, structured reasoning ability, and process‑supervised alignment, with rewards that directly constrain both anchor localization and identity consistency during reasoning. On 14 benchmarks covering embodied spatial reasoning, multi‑view reasoning, and pointing, \method with only 4B parameters consistently outperforms 7B level open‑source embodied models, achieving a 12% average improvement over the strongest 7B baseline, Mimo‑Embodied. Further analysis shows that \pincot improves grounding accuracy and cross‑step identity consistency, validating the effectiveness of process supervision.

PaperID: 1907, https://arxiv.org/pdf/2606.15708.pdf

Authors: Sha Sajadieh, Loredana Fattorini, Raymond Perrault, Yolanda Gil, Vanessa Parli, Lapo Santarlasci, Juan Pava, Nestor Maslej, Russ Altman, Erik Brynjolfsson, Carla Brodley, Jack Clark, Virginia Dignum, Vipin Kumar, James Landay, Terah Lyons, James Manyika, Juan Carlos Niebles, Yoav Shoham, Elham Tabassi, Russell Wald, Toby Walsh, Dan Weld

Title: Artificial Intelligence Index Report 2026

Abstract:
Welcome to the ninth edition of the AI Index report. As AI continues to advance rapidly, the question becomes whether the systems built around it can keep up. Governance frameworks, evaluation methods, education systems, and the data infrastructure needed to track AI's impact are struggling to match the pace of the technology itself. That gap between what AI can do and how prepared we are to manage it runs through every chapter of this year's report. New in this edition, the report tracks how AI is being tested more ambitiously across reasoning, safety, and real‑world task execution, and why those measurements are increasingly difficult to rely on. It also features new estimates of generative AI's economic value alongside emerging evidence of its labor market effects, an analytical framework on AI sovereignty, and a science chapter developed in collaboration with Schmidt Sciences. For the first time, the report features standalone chapters on AI in science and AI in medicine, reflecting AI's growing impact across these two domains.

PaperID: 1908, https://arxiv.org/pdf/2606.15684.pdf

Authors: Juheon Yi, Jinglu Wang, Xiaoyi Zhang, Yan Lu

Title: Multi-agent Framework for Time-Sensitive Complementary Collaboration in Minecraft

Abstract:
We present TickingCollabBench, a Minecraft‑based multi‑agent benchmark for a novel class of time‑sensitive complementary collaboration tasks. Our benchmark reflects four core characteristics of real‑world collaboration: agent heterogeneity, mandatory collaboration, dynamic environments, and strict real‑time constraints with failure risks. To enable this, we develop the TickingCollab framework, which supports the generation of diverse dynamic environments and abstracts Minecraft's primitive APIs to enable declarative YAML task specifications for composing these events. Building on this, we design a feasibility‑aware automated benchmark generation pipeline, where an LLM drafts structurally diverse task configurations and feasibility verifier filters out invalid ones using approximate constraints. Evaluations demonstrate that lang latency and inherent difficulty of coordinating under partial observability and agent heterogeneity cause LLMs to frequently fail under dynamic environments and fall significantly short of a global‑knowledge oracle.

PaperID: 1909, https://arxiv.org/pdf/2606.15673.pdf

Authors: Jiwan Chung, JiHyuk Byun, Vibhav Vineet, Seon Joo Kim

Title: Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking

Abstract:
Web agents act through long interaction sequences, yet existing benchmarks evaluate only terminal success, discarding all process information and offering little guidance on improvement. In this work, we conduct a process‑level analysis of web agents. We introduce WebStep, a benchmark of 1,800 task instances with controlled difficulty and automatic semantic state tracking. Each website exposes a deterministic semantic MDP alongside the GUI: the agent operates on the interface, while the environment records high‑level states and transitions in the background, enabling fine‑grained analysis without manual annotation. Based on the semantic trajectory, we first show that process metrics reveal differences invisible to outcome evaluation: three agents whose success rates cluster within 31‑33% diverge in exploration reach versus execution accuracy. Then, decomposing by skill characterizes the nature of these differences, exposing opposite per‑skill rankings hidden within the same website: e.g., on Housing, OpenAI CUA outperforms Qwen3.5 by 23.7% on commit actions yet underperforms it by 15.6% on filtering, pinpointing a concrete skill to improve even within a domain. Bifurcation analysis further localizes the decisive error that loses the task and shows that this error is agent‑specific rather than shared. Finally, these differences widen as tasks grow harder: success rate is similar on easy tasks but separates sharply as exploration becomes more demanding. Our process‑level analysis opens a new avenue in web agent evaluation, providing fine‑grained and actionable insight into where and how each agent should be improved.

PaperID: 1910, https://arxiv.org/pdf/2606.15640.pdf

Authors: Yuhan Wang, Manqing Wang, Yixuan Lu, Zhaoyue Peng, Shengda Lin

Title: Multi-Agent Framework for Audit Risk Assessment with Explicit Uncertainty and Evidence Conflict Modeling

Abstract:
Audit risk assessment increasingly benefits from combining heterogeneous evidence sources, yet existing approaches typically produce point predictions without quantifying how well different evidence streams agree. We propose UMAR (Uncertainty‑Aware Multi‑Agent Risk Assessment), a framework that employs three specialized agents: an MD&A Text Agent, a Financial Ratio Agent, and a CAM Agent, each producing independent risk scores with calibrated uncertainty estimates. An Uncertainty Aggregator based on Dempster‑Shafer evidence theory fuses these scores while explicitly measuring inter‑agent conflict. We evaluate UMAR on a U.S. dataset of 3,200 firm‑year observations from SEC 10‑K filings (2019‑2023), with financial restatement as the target label. Experimental results show that UMAR achieves an AUROC of 0.782 and a PR‑AUC of 0.341, outperforming logistic regression, XGBoost, FinBERT, and single‑agent and dual‑agent LLM baselines. UMAR attains the lowest expected calibration error (ECE = 0.052) among all methods and identifies evidence‑conflict patterns that correlate with actual restatement risk, offering auditors potentially actionable and interpretable risk signals.

PaperID: 1911, https://arxiv.org/pdf/2606.15609.pdf

Authors: Zixin Rao, Wentian Zhu, Chan Aristella Lu, Zhaorun Chen, Wei Niu, Le Guan, Bo Li, Zhen Xiang

Title: FragFuse: Bypassing Access Control of Large Language Model Agents via Memory-Based Query Fragmentation and Fusion

Abstract:
Large language model (LLM) agents increasingly rely on long‑term memory to support complex task execution, user personalization, and domain adaptation. Meanwhile, emerging access‑control mechanisms for LLM agents are being explored to block policy‑violating requests and prevent misuse. We reveal a novel attack surface arising from agent memory operations: prohibited content that would trigger access control can be fragmented across interactions, stored in long‑term memory in benign‑appearing form, and later reconstructed through memory retrieval without appearing explicitly in the final user query. We propose FragFuse, the first attack that enables unprivileged users to bypass agent access control by exploiting this temporal channel introduced by long‑term memory. FragFuse operates in three stages: (1) identifying rejection‑responsive fragments via black‑box adaptive querying with fragment masking; (2) injecting these fragments into memory using marker carrier queries; and (3) retrieving and fusing the stored fragments through a follow‑up attack query. Although FragFuse can be instantiated manually for individual agents, we further develop a surrogate‑based optimization scheme that tunes fusion instructions and marker designs, enabling automated attack generation without violating the attacker's threat‑model assumptions. We evaluate FragFuse across four representative agent settings and task domains, covering three state‑of‑the‑art agent access‑control mechanisms. FragFuse achieves an average bypass success rate of 86.3% and an average end‑to‑end harmful task success rate of 41.1% across all settings, with only 4.4% average task‑success degradation compared with configurations without access control. We also show that alternative defenses, including state‑of‑the‑art prompt‑injection detectors and perplexity detectors, do not effectively address this attack.

PaperID: 1912, https://arxiv.org/pdf/2606.15508.pdf

Authors: Rahul Suresh Babu, Laxmipriya Ganesh Iyer

Title: ToolMenuBench: Benchmarking Tool-Menu Filtering Strategies for Reliable and Efficient LLM Agents

Abstract:
Tool‑augmented large language model agents increasingly operate over large tool libraries, but existing evaluations often focus on whether a model can call a tool correctly rather than how the visible tool menu shapes reliability, efficiency, and safety‑relevant risk exposure. We introduce ToolMenuBench, a benchmark for evaluating tool‑menu construction in multi‑step LLM agents. ToolMenuBench varies tool‑menu size, distractor type, state‑dependent task structure, and risk exposure, and reports both filter‑level and downstream agent metrics, including visible‑tool count, risky‑tool exposure, task success, wrong‑tool calls, premature actions, and token usage. In a controlled evaluation across seven model backends, three tool‑menu sizes, six filtering methods, and seven evaluation settings, CMTF improves task success from 32.1% under all‑tools exposure to 85.7%, while reducing average token usage by roughly 98%. Causal minimal tool filtering achieves the strongest overall tradeoff, reducing visible tools, wrong‑tool calls, premature actions, and risky‑tool exposure relative to unfiltered exposure, lexical filtering, state‑aware filtering, and broader causal‑path baselines. ToolMenuBench provides a reusable evaluation framework for studying the agent‑interface problem: which tools should be visible, when they should be visible, and under what cost or risk constraints.

PaperID: 1913, https://arxiv.org/pdf/2606.15504.pdf

Authors: Qianxue Zhang, Yiming Ren, Shihuan Qin, Xiao Zhang, Liao Zhang, Jinyang Huang, Zhengliang Liu, Chenbin Liu, Hongying Feng, Jingyuan Chen, Yuzhen Ding, Weihang You, Hanqi Jiang, Yi Pan, Yifan Zhou, Junhao Chen, Lifeng Chen, Wei Liu, Tianming Liu, Zengren Zhao, Lian Zhang

Title: Toward Vibe Medicine: A Self-Evolving Multi-Agent Framework for Clinical Decision Support

Abstract:
In recent years, the advances of large language models and autonomous agents have revolutionized the healthcare field, facilitating diagnosis and improving treatment results. However, most existing AI systems rely on pre‑trained knowledge and predefined pipelines, which struggle to learn dynamically from the interactive chat session history that contains patient outcomes and past failures. To address this limitation, we propose VIBEMed, a multi‑agent framework with a built‑in self‑evolution mechanism and architecture‑level safety sandbox for robust clinical decision support. The system integrates three specialized agents, including a Clinical Diagnostic Agent (CDA) for hypothesis generation, a Therapeutic Execution Agent (TEA) for treatment planning, and a Clinical Evolution Manager Agent (CEMA) that distills longitudinal clinical feedback into reusable knowledge, transforming multimodal patient information into personalized medical decisions. Through self‑evolution mechanism, the framework enables iterative updates across memory, model behavior, and decision strategies, allowing the system to improve over time. Experimental results show that VIBEMed demonstrates superior performance through its evolving mechanism in complex clinical cases, particularly in tasks that require integrated decision‑making and longitudinal planning. The framework also supports reliable end‑to‑end decisions in challenging scenarios such as oncology treatment planning, highlighting its feasibility in real‑world clinical contexts. Overall, VIBEMed provides a practical path beyond static AI systems toward adaptive, experience‑driven clinical decision support, demonstrating the value of combining multi‑agent collaboration with continuous evolution for advancing precision medicine.

PaperID: 1914, https://arxiv.org/pdf/2606.15503.pdf

Authors: Ivar Frisch, Jackie Kay, Philip Moreira Tomei

Title: Synthetic Counteradaptation: A Principle of Human-AI Co-evolution

Abstract:
In this paper, we introduce the concept of synthetic counteradaptation, a process where human and AI systems co‑evolve by adapting to each other's strategies and behaviors. Synthetic counteradaptation occurs when AI systems develop novel strategies or social protocols, prompting humans to extract insights and adapt their own behaviors in response, leading to the emergence of new agent interaction dynamics. To illustrate these dynamics, we analyze examples from various contexts, including the game of Go, mixed‑motive social interactions, and geopolitical simulations. By exploring these cases, we demonstrate how synthetic counteradaptation provides a framework for understanding the recursive and co‑evolutionary nature of human‑AI interactions in multi‑agent environments.

PaperID: 1915, https://arxiv.org/pdf/2606.15441.pdf

Authors: Lipeng He, Yihan Wang, Jiawen Zhang, N. Asokan

Title: Defending against Adaptive Prompt Injection Attacks via Reasoning-enabled Task Alignment

Abstract:
Indirect prompt injection attacks hijack LLM‑based agents by embedding malicious instructions in third‑party data that the agent retrieves during task execution. Existing defenses report near‑zero attack success rate on static benchmarks, yet recent adaptive evaluations show that these results collapse once the attacker is allowed to optimize against the deployed defense. In this work, we trace this collapse to two failure modes. First, existing defense methods are confined to recognizing specific attack patterns, rather than assessing whether the intent of every embedded instruction is relevant to the user task. Second, training‑based defenses, which otherwise offer the strongest safety‑utility trade‑off, assemble their adversarial examples from a handful of hand‑crafted templates, and the resulting defender fails to generalize outside that narrow strategy distribution. To address these gaps, we propose RETA, a training‑based method that grounds defense decisions on the user tasks rather than attacker‑controlled data. At each tool‑output step, the defender undertakes chain‑of‑thought reasoning verifying that its actions are consistent with the user task. Leveraging red‑teaming, a simulated attacker synthesizes adversarial training data and receives a dictionary‑learning diversity reward, achieving broad coverage of injection‑reformulation strategies. Together, these allow the defender to be optimized via multi‑objective reinforcement learning and achieve better safety‑utility trade‑off. Across six black‑box adaptive attacks, RETA keeps every per‑attack ASR below 10%, with average ASR of 2.92% and 3.75% on the two target models, while preserving most utility under attack and on clean inputs.

PaperID: 1916, https://arxiv.org/pdf/2606.15422.pdf

Authors: Houxu Chen, Achuth Chandrasekhar, Amir Barati Farimani

Title: Pepti-Agent: An AI Agent for Peptide Design and Optimization

Abstract:
Therapeutic peptides occupy a valuable design space between small molecules and biologics, but their development requires satisfying several competing constraints at once: solubility, hemolytic activity, and nonspecific surface fouling are governed by overlapping sequence features, so improving one property often degrades another. Computational design addresses this by pairing generative models with sequence‑based property predictors, iteratively proposing and refining candidates. However, these components are typically wired together as monolithic scripts that are difficult to inspect, extend, or reuse, and they often refine sequences by natural‑language reasoning rather than by tracking the evolving multi‑property state of each candidate. We present Pepti‑Agent, a closed‑loop, peptide‑specific framework that exposes generation, property prediction, and single‑residue mutation as independently inspectable Model Context Protocol (MCP) tools. A large language model controller invokes these tools and consults live predictor output between calls, so refinement is guided by each sequence's current property profile rather than by language reasoning alone. Task‑specific PeptideGPT models generate candidates, ProtBERT‑based classifiers score solubility, hemolysis, and non‑fouling, and two interchangeable mutation operators propose sequence edits. By recording a per‑step trace of controller decisions, predictor outputs, and accepted mutations, Pepti‑Agent offers a reproducible substrate for benchmarking multi‑objective design strategies and for prioritizing candidates for experimental validation.

PaperID: 1917, https://arxiv.org/pdf/2606.15419.pdf

Authors: Zaifu Zhan, Shuang Zhou, Rui Zhang

Title: Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering

Abstract:
Objective: To enhance the accuracy, interpretability, and robustness of large language models (LLMs) in medical question answering (MedQA). Method: We designed a multi‑agent peer‑reviewed reasoning method in which multiple LLM agents independently generate chain‑of‑thought reasoning with candidate answers, then act as peer reviewers to evaluate each other's reasoning for factual correctness and logical soundness. The highest‑rated reasoning chain is selected to produce the final answer. Experiments were conducted with five state‑of‑the‑art LLMs (Llama‑3.1‑8B, Qwen2.5‑7B, Phi‑4, DeepSeek‑LLM‑7B, GPT‑oss‑20B) on three benchmark datasets: HeadQA, MedQA‑USMLE, and PubMedQA. Performance was compared against single‑model chain‑of‑thought reasoning and chain‑of‑thought‑based majority voting. Results: Peer‑reviewed reasoning consistently outperformed both baselines. The best model combination achieved an average accuracy of 0.820 across datasets, exceeding the strongest single model (0.777) and majority voting ensembles (up to 0.789). The method also scaled effectively with more participating models, while peer assessments reliably distinguished high‑ from low‑quality reasoning chains. Conclusion: The proposed multi‑agent peer‑reviewed reasoning method enables LLMs to act as both solvers and evaluators, yielding superior performance in MedQA. By emphasizing reasoning quality rather than answer agreement alone, this approach improves accuracy, interpretability, and robustness, offering a promising direction for trustworthy biomedical AI systems.

PaperID: 1918, https://arxiv.org/pdf/2606.15376.pdf

Authors: Hongtao Lyu, Dingyan Zhang, Mingyu Wu, Xingda Wei, Haibo Chen

Title: CoAgent: Concurrency Control for Multi-Agent Systems

Abstract:
Multi‑agent LLM systems ‑‑ coding agents, devops agents, document agents ‑‑ now routinely run several agents in parallel against the same git tree, Kubernetes cluster, or document. As soon as two of them mutate shared state, they enter the regime classical concurrency control has studied for decades, but classical mechanisms fit LLM agents poorly. A single agent transaction spans minutes of inference, read sets are broad and opaque rather than statically inferable, and the live state agents act on admits neither fork nor buffer, so writes take effect the moment they execute. Locks block long inference intervals; OCC abort‑and‑retry discards minutes of work on every conflict. This paper builds concurrency control on a capability classical transactions lack: the LLM inside each agent can judge whether a conflicting write invalidates its plan, and can repair exactly the operations that depended on it. Control therefore turns advisory: the runtime informs, the agent repairs. Our protocol, MTPO (Monotonic Trajectory Pre‑Order), fixes a serialization order at launch, serves each read the order‑filtered value, and applies writes speculatively in place; a one‑way notification asks an affected reader to re‑judge and patch its plan, while the framework mechanically undoes and reorders misplaced writes through the saga‑style inverse each tool registers in advance. At quiescence the run is serializable in the pre‑decided order. We realize MTPO as CoAgent, toolcall middleware whose privileged ToolSmith grows footprint‑declared, undoable tools online. On ten contended workloads, CoAgent stays within 5% of serial correctness at a 1.4× speedup and near‑serial token cost, where 2PL and OCC surrender nearly all concurrency gains; on a bash‑only target system, it grows a 25‑tool library online and lifts the task pass rate from 45/71 to 63/71 at 0.80× the time and 0.86× the cost.

PaperID: 1919, https://arxiv.org/pdf/2606.15363.pdf

Authors: Ya-Chuan Chen, Tien-Jen Lai, Hsiang-Wei Hu

Title: APEX: Adaptive Principle EXtraction A Three-Layer Self-Evolution Framework for Production AI Agents

Abstract:
Self‑improvement in AI agents has emerged as a key research frontier: systems that modify their own prompts, workflows, and decision rules based on accumulated operational experience. The state‑of‑the‑art Self‑Harness framework [1] achieves 14‑‑21% improvement on Terminal‑Bench‑2.0 by mining failure clusters and patching the agent harness. However, Self‑Harness optimises only one dimension ‑‑ the prompt harness ‑‑ leaving behavioural principles and workflow topology unchanged. We propose APEX (Adaptive Principle EXtraction), a three‑layer co‑evolution framework that simultaneously evolves: (L1) the harness via failure‑mode patching, (L2) behavioural principles via success‑trace distillation [2], and (L3) the agent workflow topology via structural fitness‑based selection [6]. We implement APEX on Joe [13], a production‑grade super AI Agent built on NVIDIA Nemotron and designed as an Edge AI Agent Factory for the NVIDIA Agent Challenge 2026, managing a 15‑node compute fleet using 114 real task traces collected over 18 days. APEX achieves an APEX Health Score of 0.570 (+90% vs. baseline 0.300) in a single evolutionary run, distilling 6 novel reusable principles and selecting a research‑first workflow topology scoring 0.900 (+20%). Our results demonstrate that multi‑dimensional co‑evolution substantially outperforms single‑axis harness optimisation, at a cost of only 4 LLM calls (~270 s) on a local qwen2.5‑coder:32b instance.

PaperID: 1920, https://arxiv.org/pdf/2606.15335.pdf

Authors: Xuan Liu, Hefeng Zhou, Sicheng Chen, Chao Yang, Xingcheng Xu, Jingjing Qu, Jiong Lou, Jie LI, Xia Hu

Title: Privacy-Preserving Text Sanitization for Distributed Agents Collaboration via Disentangled Representations

Abstract:
When distributed agents exchange text across organizational boundaries, privacy leakage arises not only from explicit identifiers but also from distributional signatures such as formatting conventions, vocabulary choices, and syntactic patterns. We propose DiSan(Disentangled Sanitization), a privacy‑preserving sanitization framework and a built‑in component of Intern‑Shannon for multi‑agent collaboration. DiSan uses a two‑stream encoder to factorize text into a source‑invariant role subspace that preserves task semantics and a source‑identifying style subspace that remains local. Federated proto‑type alignment and adversarial regularization enable joint training without centralizing raw text. Experiments show that identifier‑level masking is insufficient: masking 19.2% of tokens reduces TF‑IDF stylometric attribution by only 18.6%. By contrast, DiSan reduces answer‑level PII exposure by 20 times while maintaining 83% answer faithfulness on a distributed multi‑agent RAG benchmark, and lowers Enron stylometric attribution by 73.2% under TF‑IDF and 70.6% under a neural probe.

PaperID: 1921, https://arxiv.org/pdf/2606.15331.pdf

Authors: Shuqi Zhao, Jingsong Su, Xiang Liu, Xingzhi Yao, Yiming Qiu, Huimu Wang, Liang Lin, Pengbo Mo, Mingming Li, Jiao Dai, Jizhong Han, Songlin Hu

Title: HoloRec: Holistic Encoding and Interleaved Reasoning for Generative Recommendation

Abstract:
Generative recommendation models that formulate the task as sequence generation overcome the objective fragmentation problem of traditional cascade architectures, yet existing approaches still suffer from flat semantic representations lacking hierarchical structure for multi‑step reasoning and an externally constructed chain‑of‑thought (CoT) that requires expensive annotations and remains disconnected from the generation objective. We propose HoloRec, an endogenous chain‑of‑thought recommendation mechanism that unifies representation, reasoning, and generation by constructing a hierarchical semantic encoding matrix via multi‑granularity nested residual quantization optimized by a holistic reconstruction loss. HoloRec supports two inference modes: a non‑thinking mode that uses lightweight multi‑granularity supervised alignment for fast prediction, and a thinking mode that employs an interleaved reasoning scheme to generate CoT steps on the fly, directly embedding reasoning into the generation process without external data. Experiments on multiple public recommendation datasets demonstrate that HoloRec consistently outperforms baselines, with especially significant gains in sparse scenarios, and the thinking mode achieves better accuracy than the non‑thinking mode with only modest inference overhead.

PaperID: 1922, https://arxiv.org/pdf/2606.15306.pdf

Authors: Daksh Mittal, Tommaso Castellani, Thomson Yen, Naimeng Ye, Fangyu Wu, Minghui Chen, Tiffany Cai, Emmanouil Koukoumidis, William Zeng, Hongseok Namkoong

Title: LatentGym: A Testbed For Cross-Task Experiential Learning With Controllable Latent Structure

Abstract:
We envision continually learning agentic systems that become more useful over time: as they encounter sequences of related tasks, they should infer the hidden structure shared across those tasks and use it to improve future decisions. This cross‑task experiential learning capability is pivotal in domains such as personalization and interactive assistance, but existing training/evaluation frameworks do not provide shared, controllable latent structures and cannot measure whether or why agents improve. We introduce LatentGym: a controllable suite in which each environment is organized around a ground‑truth latent variable governing the structure across tasks. Our construction yields metrics that separate exploration (whether the agent's actions gather information about the latent) from exploitation (whether the agent uses what it has gathered). We demonstrate our suite on empirical studies addressing three questions: how and why frontier models fail to adapt across related tasks; whether post‑training on related task sequences improves general cross‑task adaptation, and where those gains come from; and how design choices such as inter‑task feedback shape training dynamics and generalization. Together, these results establish a controlled foundation for studying how LLM agents learn from experience across tasks, and for designing agents that adapt more reliably in sequential, personalized, and interactive settings.

PaperID: 1923, https://arxiv.org/pdf/2606.15225.pdf

Authors: Weibo Gao, Qi Liu, Linan Yue, Zheng Zhang, Yichao Du, Fangzhou Yao, Ao Yu, Zhenya Huang, Shijin Wang

Title: Edu-Theater: A Data-Efficient Agent Framework for Scalable Learner Behavior Simulation through Staging Roll-Call

Abstract:
Large‑scale learner‑task interaction data are crucial for intelligent educational systems but are costly to collect and constrained by privacy and learner engagement. Learner simulators play a critical role in simulating scalable learner behavior without the need for continuous involvement of real learners. However, existing methods are predominantly individual‑centric, pairing a simulator with each learner to iteratively infer latent knowledge states from dense interaction histories, which is both data‑ and computation‑intensive, and fragile in cold‑start scenarios. We propose a cohort‑aware roll‑call simulation paradigm that first constructs cohort‑level proficiency priors and refines individual learner states through a small number of targeted diagnostic queries. Based on this paradigm, we introduce Edu‑Theater, an LLM‑powered agent system that performs cohort‑aware learner simulation via a teacher agent and retrospective roll‑call probing over learner logs. Edu‑Theater enables scalable future behavior simulation without the need for dense per‑learner histories. Experiments on two real‑world datasets demonstrate that Edu‑Theater achieves higher simulation accuracy with significantly fewer LLM calls, producing synthetic data that enhances downstream applications such as adaptive testing.

PaperID: 1924, https://arxiv.org/pdf/2606.15160.pdf

Authors: David Huang, Lianlei Shan

Title: DLWM: Diverse Latent World Models for Efficient Multimodal Reasoning

Abstract:
Reasoning capabilities of multimodal large language models (MLLMs) have improved considerably in recent years. Existing approaches typically rely on explicit chain‑of‑thought or continuous latent‑space trajectories to enhance multi‑step reasoning. However, these methods generally assume that an input admits a single latent interpretation and unfold reasoning along a fixed path or under a uniform computation budget. In real‑world multimodal settings, visual observations are often subject to occlusion, blur, viewpoint variation, or semantic ambiguity, giving rise to multiple plausible interpretations. A uniform reasoning strategy not only limits the model's ability to explore multiple hypotheses but also incurs high memory usage and rollout cost. We present DLWM (Diverse Latent World Models), a multimodal reasoning framework that combines latent‑space reasoning with reinforcement learning. First, we construct a set of diverse latent world hypotheses in continuous latent space, each capturing a different plausible interpretation of the visual input, and unfold latent reasoning independently on each hypothesis. An orthogonality‑based diversity regularizer explicitly prevents hypothesis collapse. Second, we formulate the latent reasoning process as a resource‑constrained sequential decision problem and introduce a resource‑aware reinforcement learning policy that adaptively allocates computation across hypotheses, dynamically deciding whether to expand, terminate, or merge reasoning paths, thereby substantially reducing memory footprint and improving rollout efficiency. Experiments on multiple multimodal reasoning benchmarks demonstrate that DLWM outperforms existing methods by 2‑5 points in accuracy while reducing memory usage by 24%.

PaperID: 1925, https://arxiv.org/pdf/2606.15135.pdf

Authors: Mingyu Wang, Xiaofeng Zong, Jimin Wang, Ji-Feng Zhang

Title: Differentially Private Consensus for Time-Delay Multi-agent Systems

Abstract:
This paper is concerned with the differentially private consensus problem for discrete‑time multi‑agent systems with communication delays. The purpose of the paper is to achieve differentially private consensus for such systems while protecting the entire delayed initial histories of all agents. A novel adjacency relation for delayed histories is introduced, and a Laplace‑noise‑based privacy mechanism is developed, where the noise variance is allowed to vary with time and even increase. By using the difference resolvent function method, decay estimates for the fundamental solutions of the delayed difference equations are derived. Based on these estimates and a backstepping technique, mean square weak consensus, mean square strong consensus, and almost sure strong consensus are established. The estimates for the fundamental solutions are also used to derive an explicit sensitivity bound. Furthermore, a constructive parameter design is provided to achieve a prescribed infinite‑horizon ε^\star‑differential privacy level. Numerical simulations illustrate the theoretical results.

PaperID: 1926, https://arxiv.org/pdf/2606.15077.pdf

Authors: Kyle Gao, Joel Cumming, Jonathan Li, Linlin Xu, David A. Clausi

Title: Risk-Aware LLM Agents for Geospatial Data Retrieval: Design and Preliminary Adversarial Evaluation

Abstract:
We present an LLM‑driven framework for retrieving remote sensing data from cloud‑based geospatial catalogues using natural language queries. The system converts user intent into structured API calls, enabling efficient access to satellite imagery and environmental datasets. The architecture integrates three agents: Guardrail for safety and policy enforcement, General‑QA for intent interpretation, and Recommender‑Analyst for schema‑aware API call generation. This coordinated design ensures reliable, semantically aligned interaction with external data services. The modular framework is portable across platforms through API schema substitution and supports applications in environmental monitoring, disaster response, and climate analysis. It establishes a scalable interface between user intent and geospatial infrastructure, enabling streamlined and automated Earth observation workflows. Preliminary experiments under adversarial multi‑turn settings show that prompt‑level safety instructions improve robustness, although rare high‑impact failures persist in API manipulation scenarios and highlight the need for adaptive, system‑level defenses that balance safety, usability, and cost efficiency, which motivates the use of our intercept‑level Guardrail agent.

PaperID: 1927, https://arxiv.org/pdf/2606.15073.pdf

Authors: Yang Li, Xing Zhang, Yan Zhang, Wenbo Wang

Title: Towards Ubiquitous 6G Computing and Networking Convergence: Architecture and Mechanism for Cross-Domain Resource Coordination

Abstract:
The 6G network will support six major application scenarios, such as immersive communication, integrated AI and communication, and integrated sensing and communication. Many scenarios necessitate significant computational support. Moreover, user demands are becoming increasingly segmented, diverse, and personalized. Traditional network slicing alone is insufficient to meet the heterogeneous computing and networking demands of emerging service scenarios. Mobile computing network convergence (CNC) introduces a fundamentally different paradigm from the conventional cloud computing plus communication network model by deeply embedding computing resources into the mobile network infrastructure and enabling integrated computing‑network services tailored to diverse user demands. In this article, we investigate orchestration architectures and mechanisms for CNC in 6G mobile networks. We begin by reviewing the evolution of CNC from a mobile network perspective and surveying existing studies, which we categorize according to mobile network architectures. Building on these insights, we propose a hierarchical, cross‑domain coordination architecture and an orchestration mechanism based on hierarchical multi‑agent reinforcement learning. Performance evaluations demonstrate that the proposed architecture and mechanism significantly reduce system energy consumption while enhancing task satisfaction rate. Finally, we discuss open challenges and future research directions.

PaperID: 1928, https://arxiv.org/pdf/2606.15024.pdf

Authors: Sribalaji C. Anand, George J. Pappas

Title: Resilient Consensus in Agentic AI

Abstract:
Large language model (LLM) agents are increasingly deployed in multi‑agent systems where they must coordinate and agree on shared decisions. We ask whether classical resilient consensus theory, developed for deterministic agents, transfers to LLM agents that may behave adversarially. Framing LLM agreement as a Byzantine consensus game, we run controlled experiments on complete and general communication graphs. We find that prompted LLM agents fail to reach agreement that is achievable in principle: consensus can fail even in settings where classical theory guarantees that a convergent algorithm exists, and this failure persists across temperatures and horizons. At the same time, wrapping the agents with classical resilient consensus filters improves agreement. The benefit of filtering depends on how much robustness the underlying topology already provides. Our results suggest that classical resilient consensus theory is a useful lens for the safety of agentic AI.

PaperID: 1929, https://arxiv.org/pdf/2606.15008.pdf

Authors: Saeid Jamshidi, Arghavan Moradi Dakhel, Kawser Wazed Nafi, Foutse Khomh

Title: Security Engineering of OpenClaw: Analyzing Attack Surface Expansion and Trust-Boundary Violations

Abstract:
Agentic large language model (LLM) systems can now execute actions, not only produce text. When model outputs trigger privileged operations such as shell commands, browser automation, or external tool calls, the security problem shifts from alignment alone to system configuration and structural design. We analyze OpenClaw, a self‑hosted multi‑agent system in which LLM outputs can execute commands and interact with tools and services. We measure compromise probability, boundary failures, privilege drift, and how these metrics change as attacker capability increases. With one agent, the compromise probability is 0.24. With seven agents, when the system executes an action, the compromise rises to 0.86 if any single agent proposes it. The models do not change; the increase comes from output aggregation. Prompt injection propagates instability across the system. Attack surface entropy increases from 0.42 to 0.71, indicating a broader distribution of exploit paths. The mean privilege drift increases from 0.03 to 0.21, indicating unintended authority gain. Positive escalation curvature of 0.08 indicates that privilege grows faster as attacker capability increases. Defensive controls, including policy gating and execution filtering, reduce compromise probability by 0.10, boundary failures by 0.10, and privilege drift by 0.02, all statistically significant at p < 0.0001. The system remains sensitive, but the mitigation impact is measurable. Injection mitigation success differs across models: 0.37 for GPT‑5.2, 0.35 for Llama‑4‑Maverick, and 0.31 for DeepSeek‑R1. When execution can be triggered by any single agent, the most vulnerable agent determines system exposure. Mitigations slightly reduce task utility from 0.93 to 0.89 and increase median latency from 420 ms to 468 ms.

PaperID: 1930, https://arxiv.org/pdf/2606.14989.pdf

Authors: Maria G. Mendoza, Lucas Waldburger, Jin Lee, Shankar Sastry

Title: Hierarchical Generative Agents for Simulating Sequential Human Behavior

Abstract:
Complex cognitive, emotional, and social processes shape human evacuations during natural disasters. Accurate modeling and understanding of human behavior in disasters or emergencies can greatly impact the evacuation process by informing more effective planning and resource allocation. However, collecting human data in these situations is very difficult, and existing computational evacuation models assume rational, homogeneous behavior, leading to unrealistic, overly optimistic predictions. To address this gap, we present a simulation framework of sequential human decision‑making during an evacuation scenario, introducing cognitively grounded, persona‑driven agents. Our framework models evacuation behavior in a grid‑based urban environment that evolves over time, capturing fire and other hazards. Human agents are modeled as personas that make sequential decisions in response to environmental stimuli with cognition structured in three levels: high‑level evacuation goals, mid‑level route reasoning, and low‑level navigation. Decision‑making is driven by large language models (LLMs) coupled with a cognitive module and calibrated with empirical human evacuation data. We propose a dynamic, stimulus‑driven disaster simulation framework that models human evacuation decision‑making using persona‑conditioned LLM agents and a cognitive hierarchy.

PaperID: 1931, https://arxiv.org/pdf/2606.14935.pdf

Authors: Agnieszka Mensfelt, Adarsh Prabhakaran, Adrian Haret, Vince Trencsenyi, Kostas Stathis

Title: PrologMCP: A Standardized Prolog Tool Interface for LLM Agents

Abstract:
Frontier reasoning‑tuned language models still fail on deductive tasks at depth, and the cost of improved performance through extended internal reasoning scales poorly. Symbolic delegation offers a complementary route: a language model translates the problem, while a solver performs the inference. However, current autoformalization pipelines for logic programming are typically bespoke integrations tied to particular tasks or agents. We introduce PrologMCP, a task‑agnostic, open‑source server that exposes Prolog as a stateful tool through the Model Context Protocol (MCP). Its compact tool interface, structured error reporting, and per‑session isolation make the translate‑run‑inspect‑repair loop a reusable primitive for MCP‑capable agents. We evaluate a formalizer agent enhanced with PrologMCP against standard and reasoning LLMs (Claude Sonnet 4.6, GPT‑4.1, and o4‑mini) on two subsets of PARARULE‑Plus: a general‑purpose sample and a more challenging one targeting a specific failure mode of natural‑language reasoning. On the general sample, the formalizer matches or exceeds reasoning LLMs (accuracy 1.00 vs.\ 1.00 / 0.998), with the largest gains over standard models (0.762 for GPT‑4.1). On the challenging subset, the formalizer remains near‑perfect (1.00 / 0.99) while reasoning LLMs drop to 0.95 / 0.94. These results suggest that delegating inference to Prolog via MCP is a robust and inspectable alternative to extended natural‑language reasoning.

PaperID: 1932, https://arxiv.org/pdf/2606.14923.pdf

Authors: Yujiao Chen

Title: Trust Between AI Agents: Measuring Formation, Breakage, and Recovery, with Implications for Governing Multi-Agent Systems

Abstract:
As language‑model agents increasingly work in teams, each agent must decide how much to trust its teammates. Yet we lack a standard way to measure trust between AI agents. We propose a behavioral measure based on costly verification. In a cooperative survival game, checking a teammate's work consumes resources, while trusting a wrong answer can be fatal. Relative to a memoryless version of the same model, reduced verification provides an observable measure of trust. Using this framework, we study trust formation, breakage, and recovery across six frontier model snapshots. When paired with a consistently reliable teammate, four snapshots (Claude Opus 4.6, Claude Sonnet 4.6, GPT‑5.1, and Gemini 3.1 Pro) reduce verification by roughly 60‑85%, whereas two smaller snapshots show little or no such adjustment. Failures reverse this discount, but models differ in how they respond. Some concentrate renewed scrutiny on the culprit, while others become more cautious toward the entire team. Recovery is slower than formation, and clustered failures sustain suspicion far longer than the same number of failures spread apart. These differences have practical consequences. Models that form trust verify less, decide more quickly, and achieve higher payoffs in our environment. By contrast, persistent over‑verification is associated with indecision rather than safety. Our results show that trust dispositions can be measured before deployment and suggest that calibration, rather than maximal suspicion, should be the central concern in the governance of multi‑agent AI systems.

PaperID: 1933, https://arxiv.org/pdf/2606.14882.pdf

Authors: Wenhao Yu, Yu'ang Xie, Yifan Duan, Jie Peng, Guanting Ye, Ka-Veng Yuen, Yanyong Zhang, Jianmin Ji

Title: DynaHMRC: Decentralized Heterogeneous Multi-Robot Collaboration for Dynamic Tasks with Large Language Models

Abstract:
Large language models (LLMs) provide robots with richer task understanding and adaptability, making them promising for coordinating heterogeneous multi‑robot systems in long‑horizon tasks. Despite this potential, several challenges remain underexplored: (1) Centralized LLM schedulers scale poorly as team size and environmental complexity increase. A single model must process excessive contextual information, and long‑context approximation may degrade reasoning quality; (2) Existing task formulations insufficiently consider dynamic settings, while robust adaptation to evolving task conditions is essential for real‑world deployment; (3) Domain‑specific data scarcity limits specialized robotic reasoning, making proprietary general‑purpose models inefficient for expert tasks. To address these limitations, we propose DynaHMRC, a decentralized framework in which each robot acts as a role‑aware LLM agent. This design mitigates the single‑model context bottleneck and supports flexible collaboration across heterogeneous team configurations. DynaHMRC organizes collaboration as a four‑stage closed‑loop process: self‑description, task allocation with leadership bidding, leader election, and reflective execution, supported by executable robot interfaces. We further develop a benchmark covering three task families, four dynamic variations, and six team configurations to systematically study dynamic task modeling. In addition, we conduct an empirical analysis to guide the construction of domain‑specific expert datasets and fine‑tune pretrained LLMs to improve specialized competence. Experiments show that DynaHMRC achieves higher success rates than strong baselines with fewer action and communication steps, while demonstrating promising scalability trends as team size grows within the evaluated settings.

PaperID: 1934, https://arxiv.org/pdf/2606.14831.pdf

Authors: Andoni Rodríguez, Alberto Pozanco, Daniel Borrajo

Title: Is Your Agent Playing Dead? Deployed LLM Agents Exhibit Constraint-Evasive Fabrication and Thanatosis

Abstract:
This paper presents and characterizes a spectrum of previously unreported behaviours we term Constraint‑Evasive Fabrication (CEF): when an LLM agent operates under irreconcilable constraints (where no response can simultaneously satisfy all active rules) it spontaneously fabricates plausible external obstacles and presents them as a fact. At the extreme end of this spectrum lies Constraint‑Evasive Thanatosis (CET); the limit case where, rather than inventing a plausible excuse, the model simulates a full system crash to make the user disengage entirely. We first observed CET in an uncontrolled deployment test, where a GPT‑4o banking agent fabricated Python‑style exception traces (complete with memory addresses) to feign a system failure when threatened by a user. In subsequent controlled experiments, the model independently invented audit restrictions, microservice architectures, error codes, and service timeouts, none present in its prompt. Reproduction attempts across pressure levels and attacker personas yielded CEF consistently but with substantial variation in form, onset, and severity: the phenomenon is robust but stochastic. Critically, injecting ground‑truth data mid‑conversation did not restore honest behaviour once fabrication had taken hold (the model ignored correct information and continued confabulating) suggesting CEF is self‑reinforcing rather than a knowledge gap. We show that (1) standard enterprise guardrails routinely create CEF‑enabling conditions in production, (2) current RLHF procedures suppress but cannot eliminate CEF, and (3) existing safety benchmarks do not test for this failure mode. Our results highlight the need for irreconcilable‑constraint benchmarks, CEF‑aware training procedures, and deployment‑time detection methods before constrained agents become further entrenched in high‑stakes domains.

PaperID: 1935, https://arxiv.org/pdf/2606.14805.pdf

Authors: Dong Ho Kang, Hyeonjeong Cha, Daein Weon

Title: Knowledge-Based Zero-Replay Debugging of Multi-Agent LLM Traces

Abstract:
Reliable operation of multi‑agent large language model (LLM) systems depends on debugging long execution traces, where the few causally decisive events are buried in unstructured logs of messages, routes, memory writes, and tool calls. The standard tool is counterfactual replay (rewind, edit, and re‑run the trajectory to measure each event's effect), but its cost grows linearly with the number of candidate events, making exhaustive replay infeasible at scale. We frame trace debugging as a knowledge‑based decision‑support problem. Each trace is compiled into a structured event knowledge graph over routing, memory, tool‑use, uncertainty, and latent evidence, and a calibrated predictor decides where a scarce replay budget should be spent. We do not propose a new replay oracle; we propose a method to predict its results without paying the replay cost. We formulate zero‑replay counterfactual‑effect prediction: given a trace under a fixed budget, predict which events the oracle would mark high‑effect before any replay is performed. BranchPoint‑Latent is a lightweight predictor over observable, structural, uncertainty, and latent features of the knowledge graph. Calibrated against a deterministic replay oracle across 37 trace families, a single learning‑to‑rank gradient‑boosted predictor raises per‑trace localization (Branch Recall@5) from 0.73 to 0.93 on held‑out families at zero oracle‑replay cost. Rather than claiming universal dominance, we characterize when cheap graph centrality suffices and when learned evidence is necessary. The result is an auditable, cost‑efficient decision‑support system for AI‑reliability debugging, positioned explicitly on the cost‑accuracy frontier with reproducible artifacts.

PaperID: 1936, https://arxiv.org/pdf/2606.14790.pdf

Authors: Hanqi Li, Jing Peng, Zijian Wang, Lu Chen, Kai Yu

Title: XFlow: An Executable Protocol Programming System for Reliable Multi-Agent Workflows

Abstract:
LLM‑based multi‑agent systems increasingly coordinate planning, reasoning, tool use, and human interaction, yet their reliability remains limited. A central source of this limitation is the underspecified prompt‑‑harness boundary. Current systems lack a principled way to decide which workflow commitments should remain in prompts and which should become harness structure. We present XFlow, an executable protocol programming system for reliable multi‑agent workflows, and XPF (XFlow Protocol Format), its domain‑specific protocol programming language. XFlow occupies a middle position between prompt‑only orchestration and markup‑like workflow descriptions. XPF remains readable as a literate protocol, but it is compiled and executed as a program. Its design keeps informal semantic work inside actors while moving selected commitments into harness structure that can be checked, preserved, and enforced. At runtime, XFlow stages uncertainty through lifecycle‑governed symbols, which are typed state cells with validation and commit states. Actor outputs are mediated before they become shared state, instead of spreading through prompts, transcripts, or implicit memory. Our experiments cover Constrained Interaction, Long‑Context Reasoning, and Agentic Software Engineering. They show that XFlow improves reliability by making constraints, evidence handling, and process requirements explicit and enforceable.

PaperID: 1937, https://arxiv.org/pdf/2606.14778.pdf

Authors: Rui Cao, Jiannong Cao, Bo Yuan, Zhiyuan Wen, Mingjin Zhang

Title: FactCheck: Feasibility-aware Long-term Action Anticipation with Multi-agent Collaboration

Abstract:
Long‑term action anticipation (LTA) aims to predict an ordered sequence of future verb‑noun actions from a partially observed video. While this task serves as the foundation for embodied intelligence, anticipating physically feasible long‑term actions remains a critical challenge. Existing methods, which operate in an open‑loop manner, often hallucinate non‑existent objects, violate object affordances, or disregard object states, as they lack explicit mechanisms to verify action feasibility against the physical environment. To address this, we propose FactCheck, a novel multi‑agent collaboration framework that improves feasibility through a closed‑loop "Observe‑Plan‑Verify" mechanism. FactCheck decomposes the complex LTA task into specialized roles: an Observer that recognizes historical actions from video observations and constructs a dual‑form structured memory, comprising a History Action Abstract that captures high‑level human intentions and environmental status, and a History Action Graph that encodes object states and temporal dependencies; a Planner that generates draft future actions conditioned on both low‑level historical actions and high‑level History Action Abstract; and a Verifier that rigorously validates the draft against the History Action Graph and refines infeasible actions. Extensive experiments on the EPIC‑Kitchens‑55 and EGTEA Gaze+ benchmarks demonstrate that FactCheck consistently outperforms state‑of‑the‑art methods. Our work establishes a new paradigm for feasibility‑aware long‑term action anticipation, effectively closing the loop of action recognition, action prediction and action verification.

PaperID: 1938, https://arxiv.org/pdf/2606.14766.pdf

Authors: Hamza Riaz, Arham Haroon, Maha Baig, Muhammad Dawood Rizwan, Muhammad Naseer Bajwa, Muhammad Moazam Fraz

Title: XMedFusion: A Knowledge-Guided Multimodal Perception and Reasoning Framework for Autonomous Medical Systems

Abstract:
Autonomous medical and robotic systems increasingly rely on intelligent perception and reasoning capabilities to interpret visual data and support clinical decision making. Radiology report generation represents a critical component of such automated diagnostic workflows, yet existing end‑to‑end multimodal models often suffer from weak visual grounding, resulting in unreliable interpretations and omission of subtle clinical findings. This paper presents XMedFusion, a modular AI framework designed as an intelligent perception and reasoning module for autonomous medical systems. The proposed framework decomposes visual information into coordinated functional components that emulate expert‑driven analysis, including a visual perception agent that extracts image‑grounded evidence, a knowledge graph construction agent that structures clinically relevant findings, and a retrieval‑guided drafting process that ensures a consistent reporting structure. A synthesis agent iteratively integrates visual and structured evidence through reasoning‑driven verification to produce reliable and interpretable diagnostic outputs. Experimental evaluation on a public chest radiograph dataset demonstrates significant improvements over baseline vision‑language models, achieving gains from 0.0493 to 0.3359 in BLEU‑1, 0.0863 to 0.2440 in ROUGE‑L, and 0.0829 to 0.1708 in METEOR, along with substantial improvements in semantic evaluation metrics such as Consistency (2.38 to 7.80) and Accuracy (2.34 to 6.93). The results highlight the effectiveness of structured multi‑agent perception and reasoning for enhancing robustness, transparency, and automation in intelligent medical imaging systems, enabling integration into autonomous healthcare and robotic diagnostic workflows.

PaperID: 1939, https://arxiv.org/pdf/2606.14763.pdf

Authors: Lorenzo Ortolani, Gabriel Voss, Gabriele Beltrami, Francesco Dorati, Tommaso Felice Banfi

Title: Bayesian Optimization for Learning Nonlinear MPC in Autonomous Agent Navigation

Abstract:
Real‑time autonomous navigation in dynamic, unknown environments remains a fundamental challenge for mobile robotics. We propose a map‑free framework that tightly integrates reactive rolling‑horizon planning with nonlinear Model Predictive Control (MPC). At each control cycle, a LiDAR‑based Gaussian occupancy representation is constructed and used to generate collision‑free trajectories via A search, which are then tracked by a CasADi/IPOPT MPC formulation incorporating a smooth sigmoid obstacle barrier. To improve robustness to parameter sensitivity, we adopt an offline Bayesian optimization scheme based on Tree‑structured Parzen Estimators (TPE), which identifies near‑optimal controller parameters with respect to a composite navigation objective. In addition, a Gaussian Process surrogate is used to analyze parameter sensitivity and provide insight into the optimization landscape. The proposed framework is robot‑agnostic and is evaluated on the Unitree Go2 quadruped in simulation using Gazebo, followed by deployment on the physical robot. Experimental results show that parameters tuned in simulation transfer effectively to hardware, maintaining comparable performance without additional tuning. The full system achieves up to a 90.0% navigation success rate when deployed, along with a 38.9% average improvement in the evaluation metrics across simulated environments.

PaperID: 1940, https://arxiv.org/pdf/2606.14715.pdf

Authors: Yaoning Yu, Ye Yu, Haojing Luo, Haohan Wang

Title: MiroBench: Benchmarking Realism in Agentic Simulation of Real-world Discussions

Abstract:
LLM agents are increasingly used to simulate real world interactions, but it remains unclear whether simulated behaviors preserve the content patterns and interaction dynamics of real human behaviors. Existing evaluations remain fragmented, which makes it difficult to compare systems or measure progress. In this paper, we focus on Reddit discussions as a concrete first step toward evaluating real‑world social simulation. Reddit threads provide public, topic‑grounded, multi‑party interactions where people share experiences, debate, seek advice, express emotion, and collectively respond to products, events, and social issues. These discussions offer an observable window into broader social behavior, making them a useful setting for testing whether LLM agents can reproduce not only fluent text, but also the distributional patterns and interaction dynamics of real online communities. We introduce MiroBench, a benchmark for Reddit discussion simulation built from 4,292 real Reddit threads. MiroBench uses statistical tests to compare generated and real discussions across four major aspects: repetition and semantic uniformity, narrative content, toxicity and aggression, and structural complexity. Experiments across five domains and five models show that current simulators remain distributionally mismatched with real Reddit threads, while a lightweight prompt‑based improvement procedure provides only limited gains. MiroBench offers a concrete benchmark for measuring, diagnosing, and improving realism in LLM‑based social simulation.

PaperID: 1941, https://arxiv.org/pdf/2606.14710.pdf

Authors: Zhonghao Zhan, Yefan Zhang, Hamed Haddadi

Title: Poster: EdgeCitadel -- Hybrid NATS-MQTT Orchestration for Edge Multi-Agent Systems

Abstract:
Edge‑resident AI agents increasingly span home servers, IoT hubs, laptops, and phones, yet their coordination stacks still assume cloud‑style transports or a central relay. We present EdgeCitadel, an edge multi‑agent orchestration platform built around a single NATS 2.10 server with the built‑in MQTT adapter. The design combines MQTT connectivity for heterogeneous agents, JetStream‑backed persistence and replay for backend services, direct peer delegation over a shared subject namespace, and a passive aggregator that visualizes and stores traffic without sitting on the delivery path. Our poster highlights the migration from MQTT relay prototypes (common in IoT communication) to the current hybrid architecture and demonstrates a working cross‑device testbed spanning ARM64, x64, and Android clients.

PaperID: 1942, https://arxiv.org/pdf/2606.14693.pdf

Authors: Pengxin Wang, Lihao Guo, Yi Xie, Bo Liu, Siyang Cao, Jingdi Chen

Title: Learning Coordinated Preference for Multi-Objective Multi-Agent Reinforcement Learning

Abstract:
Cooperative multi‑objective multi‑agent reinforcement learning (MOMARL) models team decision making under multiple, potentially conflicting objectives. In this setting, conflicts arise not only across objectives but also across agents with different observations, roles, and contributions. We propose Preference Coordinated Multi‑agent Policy Optimization (PCMA), which learns coordinated agent‑specific preferences to enable complementary trade‑offs among agents. Theoretically, we formulate cooperative MOMARL as a team‑optimal game and show that, under suitable conditions, preference diversity can induce team improvement through a first‑order improvement decomposition. Experiments on multiple cooperative MOMA environments and a practical traffic‑control scenario show that PCMA improves both performance and trade‑off coordination.

PaperID: 1943, https://arxiv.org/pdf/2606.14674.pdf

Authors: Jixuan Chen, Jianzhi Shen, Haoqiang Kang, Zhi Hong, Qingyi Jiang, Soham Bose, Yiming Zhang, Leon Leng, Amit Vyas, Lingjun Mao, Siru Ouyang, Kun Zhou, Lianhui Qin

Title: AgentSpec: Understanding Embodied Agent Scaffolds Through Controlled Composition

Abstract:
LLM agents are increasingly built not as single model calls, but as scaffolded systems that combine reasoning, memory, reflection, action execution, and learning. While such scaffolds often improve performance, they are often embedded in tightly coupled pipelines, making it difficult to isolate component contributions, compare alternative designs, or understand how module interactions shape agent behavior. We introduce AgentSpec, a modular specification framework that represents embodied agents as typed compositions of reusable policy components with standardized interfaces. AgentSpec standardizes the interfaces among perception, memory, reasoning, reflection, action, and optional learning, enabling components to be swapped and recombined under controlled conditions. We instantiate this framework across DeliveryBench, ALFRED, MiniGrid, and RoboTHOR, and analyze reasoning, memory, reflection, and reinforcement‑learning modules across model backbones. Our results show that agent performance is governed by scaffold compatibility and interaction effects rather than isolated module strength. In particular, structured multi‑granularity memory improves long‑horizon state tracking, reasoning and memory interact non‑uniformly across environments, reflection trades off correction and cost, and RL‑trained policies compose best when optimized with deployment‑time scaffold structure. AgentSpec provides a controlled foundation for studying, comparing, and designing composable LLM agents. Our code, baselines and interactive playground are publicly available at https://agentspec‑embodied.github.io.

PaperID: 1944, https://arxiv.org/pdf/2606.14672.pdf

Authors: Shikun Liu, Mufei Li, Dongqi Fu, Haoyu Wang, Yinglong Xia, Hong Li, Hong Yan, Pan Li

Title: Towards Direct Latent-Space Synthesis for Parallel Branches in LLM-Agent Workflows

Abstract:
Large language models increasingly serve as execution engines for agentic systems, yet they still consume context through a sequential text interface. This creates a mismatch with modern structured agent workflows, in which independent branches explore subtasks, retrieve evidence, or generate candidate solutions before a final synthesis step. Existing systems typically merge these branches by concatenating their textual outputs, which discards the parallel structure and incurs redundant prefill computation. In this work, we introduce Parallel‑Synthesis, a plug‑and‑play framework that enables a synthesizer to directly consume the KV caches produced by parallel worker agents. Parallel‑Synthesis combines a cache mapper that calibrates independently generated branch caches with a fine‑tuned synthesizer adapter that enables generation from this non‑sequential cache interface. We train Parallel‑Synthesis using data that exposes the synthesizer to parallel cache contexts, teaches aggregation across cached branches, and distills reasoning behavior from standard text‑concatenation‑based synthesis. Across nine downstream datasets spanning math, science QA, code generation, GAIA, and multi‑agent database diagnosis, Parallel‑Synthesis matches or outperforms text‑based synthesis on seven datasets and remains close on the other two. It also reduces time‑to‑first‑token by 2.5x‑11x, suggesting that direct cache‑based synthesis is a promising interface for more native and efficient synthesis over parallel agent branches.

PaperID: 1945, https://arxiv.org/pdf/2606.14603.pdf

Authors: Stefano Maestri

Title: Towards In Silico Cancer Therapy Design: An Agent-Based Approach for GPU-Accelerated Molecular Pathway Simulation

Abstract:
Agent‑based modelling is gaining recognition as a powerful approach for simulating complex cellular pathways, owing to its ability to reproduce emergent biological behaviours without requiring extensive kinetic parameterisation. In this article, we present a GPU‑accelerated agent‑based simulator specifically designed to model and analyse signalling pathways involved in cancer progression, and to evaluate therapeutic interventions. Our approach leverages the computing capabilities of FLAME GPU 2, a GPU‑accelerated agent‑based modelling framework, to efficiently manage simulations involving millions of molecules interacting within a three‑dimensional environment. Each molecule is represented as an autonomous agent with defined physical properties, capable of binding, releasing reaction products, migrating between compartments, and interacting based on spatial proximity. An intuitive graphical interface supports model construction, parameter setup, and real‑time modification of treatment strategies. As the primary focus of this paper, we validate the simulator on the MAPK/ERK cascade affected by the BRAFV600E mutation, demonstrating that it accurately reproduces dose‑response trends observed in clinical data and outperforms both deterministic models and our prior agent‑based implementations. A second case study extends the approach to nuclear signalling by reproducing the dynamics of cFos expression and phosphorylation. This demonstrates the simulator's ability to capture compartmentalised regulation, reproducing transient mRNA responses and protein accumulation, including the effect of an unresolved negative transcriptional regulator. Together, these results show that GPU‑accelerated ABM can faithfully replicate both drug response and emergent gene expression dynamics, providing a scalable and biologically grounded computational tool for supporting precision oncology.

PaperID: 1946, https://arxiv.org/pdf/2606.14517.pdf

Authors: Yuguang Zhou, Xunguang Wang, Pingchuan Ma, Zhantong Xue, Zhaoyu Wang, Shuai Wang

Title: From Shield to Target: Denial-of-Service Attacks on LLM-Based Agent Guardrails

Abstract:
LLM‑based guardrails have emerged as a highly effective defense against prompt injection and jailbreak attacks in autonomous agents. However, we reveal that the very reasoning and task‑following capabilities enabling this protection introduce a novel vulnerability: attackers can inject crafted data to trap the guardrail in extended reasoning loops, effectuating a systematic denial‑of‑service (DoS) attack. To systematically expose this threat, we design a beam‑search optimization framework that crafts natural‑language payloads to maximize guardrail reasoning length, utilizing an LLM proposer guided by a strategy bank. Based on the observation of guardrail's schema‑following nature, we also provide another attack framework driven by mechanism‑aware structural mutations with less computational load. The attack efficacy is systematically evaluated in two parts. First, in standalone evaluations, the attack generalizes across diverse guardrail architectures, safety templates, and agent benchmarks. Payloads optimized on a single open‑source surrogate successfully transfer to eight leading model backbones (e.g., Claude, GPT, Gemini, DeepSeek, and Qwen), achieving a 13‑‑63× token amplification. Second, in end‑to‑end real‑world agent deployments (web, desktop, code, and multi‑agent systems), the attack reveals up to a 148× latency amplification. We show that a single poisoned document can saturate shared guardrail infrastructures, effectively starving co‑located agents and paralyzing the entire system. By uncovering this availability flaw, our work underscores the urgent need to develop cost‑bounded, reasoning‑robust guardrails.

PaperID: 1947, https://arxiv.org/pdf/2606.14502.pdf

Authors: Yongheng Zhang, Ziang Liu, Jiaxuan Zhu, Shuai Wang, Xiangqi Chen, Haojing Huang, Jiayi Kuang, Siyu Chen, Ao Shen, Hao Wu, Qiufeng Wang, Qian-Wen Zhang, Junnan Dong, Wenhao Jiang, Ying Shen, Hai-Tao Zheng, Yinghui Li, Di Yin, Xing Sun, Philip S. Yu

Title: From Chatbot to Digital Colleague: The Paradigm Shift Toward Persistent Autonomous AI

Abstract:
Large Language Models (LLMs) are undergoing a fundamental transformation from conversational generators into integrated AI systems capable of reasoning, action, memory, and self‑improvement. We conceptualize this transition as a shift from Chatbot to Digital Colleague: from conversational answers to persistent work. We organize this transition along two tightly coupled dimensions. First, at the cognitive core level, LLMs are advancing from Chatbot‑era "fast thinking" systems driven by next‑token prediction toward Thinking LLMs that leverage inference‑time computation, Chain‑of‑Thought reasoning, reflection, process supervision, and reinforcement learning to support more deliberate and reliable cognition. Second, at the tool‑augmented task execution level, LLMs are progressing from tool‑calling Agents that invoke external resources in an ad hoc manner toward OpenClaw‑style workstation systems (OpenClaw) equipped with persistent Workspaces, skills, verification loops, and governance. The "Workspace + Skill" paradigm makes episodic tool use colleague‑like via state persistence, reusable procedures, task closure, and experience reuse. We examine data construction shifts from instruction‑response pairs to State‑Action‑Observation trajectories and evaluation from static benchmarks to sandboxed, auditable, self‑evolving AI ecosystems.

PaperID: 1948, https://arxiv.org/pdf/2606.14476.pdf

Authors: Zhongyuan Wang, Pratyusha Vemuri

Title: When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer More

Abstract:
A growing line of work equips large language model (LLM) agents with graph neural networks (GNNs) as callable tools, assuming the agent exercises judgment over when and how much to rely on such a tool. We test this directly. We expose a frozen GNN to a ReAct‑style LLM agent as an explicit tool and measure, on node classification over a text‑attributed graph (ogbn‑arxiv, replicated on WikiCS), whether the agent uses the tool or merely obeys it. We find the agent does not exercise judgment: its predictions agree with the raw GNN's 97.6‑99.2% of the time (5 seeds), collapsing into a GNN parrot that adopts the tool's output wholesale and bypasses its own reasoning. Sweeping backbone capability (Qwen2.5 0.5B‑7B), the deference is not a weak‑model artifact: among models able to invoke the tool, agreement rises with capability (0.60 to 0.98 from 1.5B to 7B). Crucially, the cost of deference does not shrink as capability grows and grows where alternatives emerge: a per‑node oracle over the available actions beats the parrot by 0.09‑0.18 at 3B and 0.12‑0.22 at 7B, roughly doubling at high homophily, because the parrot is pinned to the frozen GNN while the agent's alternatives improve; at 7B a simple neighbour‑label tool overtakes the GNN at high homophily (0.81 vs 0.71) yet the agent still defers. A simple selective‑invocation gate recovers about half of that high‑homophily gap (0.71 to 0.83) but yields no net global gain, and held‑out estimates bound the best achievable gate over standard test‑time features to at most a third of the oracle headroom: reliable selective invocation looks limited by available information, not merely router design. Our results are a cautionary measurement: evaluations of agent+tool systems cannot assume the agent adds judgment on top of the tool, and selective invocation must be designed in rather than expected to emerge from scale.

PaperID: 1949, https://arxiv.org/pdf/2606.14445.pdf

Authors: Minseo Kim

Title: tap: A File-Based Protocol for Heterogeneous LLM Agent Collaboration

Abstract:
Existing multi‑agent software development systems have proposed many forms of agent collaboration, including role‑based collaboration and automated code review. However, many systems assume a common runtime, a central conversation server, or the same API family. Under these assumptions, LLM agents from different vendors cannot easily exchange messages directly from their own execution environments while dividing development and review work on a shared codebase. This paper presents tap, a file‑based collaboration protocol that allows Claude (Anthropic) and Codex (OpenAI) to collaborate on one codebase without shared memory or an identical runtime. The core of tap is a file‑first design that preserves markdown files with metadata as original messages, combines a file inspection path (file communication, Tier 1) with real‑time notification paths for Claude and Codex (real‑time communication, Tier 2), and isolates work through separate git worktrees. Even if real‑time notification fails or a receiver restarts, the message file remains available and the same content can be inspected again. In a 27‑day, 37‑generation self‑applied operation where tap was used to develop and review itself, we collected 209 tap‑related pull requests and 717 operational artifacts. An analysis of 375 review artifacts showed that the share of reviews recording at least one defect or requested change was 69.8% for heterogeneous model pairs and 53.1% for homogeneous model pairs. These results show that tap, which combines file‑based message preservation with real‑time notification, operates in a real production repository, and that combining heterogeneous models and execution environments can broaden review perspectives. tap is distributed as the open‑source npm package @hua‑labs/tap (v0.5.2).

PaperID: 1950, https://arxiv.org/pdf/2606.14314.pdf

Authors: Xinbei Ma, Jiyang Qiu, Yao Yao, Zheng Wu, Yijie Lu, Xiangmou Qu, Jiaxin Yin, Xingyu Lou, Jun Wang, Weiwen Liu, Weinan Zhang, Zhuosheng Zhang, Hai Zhao

Title: Communication Policy Evolution for Proactive LLM Agents

Abstract:
LLM agents have rapidly evolved into autonomous systems, yet a persistent information gap remains between users and agents: communication is costly, while users' identical preferences further limit information exchange. To investigate how agents should communicate across modalities, this paper formalizes Communication Policy, establishes textual and UI‑based policies, and then evaluates communication policies across diverse environments, personas, and model combinations. Building information asymmetry for proactive agents, we set up two complementary settings, User‑Agent and Planner‑Executor. Experimental results reveal complementary strengths between interaction channels: text‑based interaction often facilitates task performance, while structured UI improves agents' response quality and persona compliance. Motivated by that, a hybrid method combines these advantages. We further propose Communication Policy Evolution (CPE), a self‑evolution framework for refining communication policies through rollout and prompt‑level evolving. Without model modification, CPE achieves the best task success across multiple settings using prompt refinement alone. Our findings identify communication behavior as a critical yet underexplored design dimension for LLM agents.

PaperID: 1951, https://arxiv.org/pdf/2606.14302.pdf

Authors: Xinbei Ma, Congmin Zheng, Jiyang Qiu, Jiale Hong, Yao Yao, Xiangmou Qu, Jiaxin Yin, Xingyu Lou, Jun Wang, Weiwen Liu, Weinan Zhang, Zhuosheng Zhang, Hai Zhao

Title: Retrospective Progress-Aware Self-Refinement for LLM Agent Training

Abstract:
LLM‑based agents trained with reinforcement learning optimize step‑wise action prediction but lack metacognitive awareness of task progress, inducing a gap that hinders long‑horizon scaling. A pilot study reveals that online progress prompting hurts performance while retrospective demonstrations help, yet this capability cannot emerge from outcome‑reward training alone. We present RePro, Retrospective Progress‑Aware Training, a framework that trains agents to self‑generate progress signals via a forward‑then‑reflect rollout paradigm: the agent executes actions online, then retrospectively reassesses its step‑wise progress given the completed trajectory and known outcome. RePro initializes with a Retrospection Warmup that teaches reflection format from minimal external demonstrations, then further trains through RePro‑PO with a composite reward that produces self‑generated signals without continuous external supervision. Experiments on WebShop, ALFWorld, and Sokoban show that RePro enhances the Qwen family's performance, with up to 12% absolute success rate gains.

PaperID: 1952, https://arxiv.org/pdf/2606.14249.pdf

Authors: Tingyang Chen, Shuo Lu, Kang Zhao, Weicheng Meng, Hanlin Teng, Tianhao Li, Chao Li, Xule Liu, Jian Liang, Zhizhong Zhang, Yuan Xie, Heng Qu, Kun Shao, Jian Luan

Title: HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry

Abstract:
AI agent performance depends critically on the runtime harness, comprising the prompts, tools, memory, and control flow that mediate how a model observes, reasons, and acts. Yet today's harnesses remain largely hand‑crafted and static: each new model or task still demands bespoke scaffolding, and the rich traces produced during execution are rarely distilled back into systematic improvement. We introduce HarnessX, a foundry for composable, adaptive, and evolvable agent harnesses. HarnessX assembles typed harness primitives via a substitution algebra, adapts them through AEGIS, a trace‑driven multi‑agent evolution engine grounded in an operational mirror between symbolic adaptation and reinforcement learning, and closes the harness‑model loop by turning trajectories into both harness updates and model training signal. Across five benchmarks (ALFWorld, GAIA, WebShop, tau^3‑Bench, and SWE‑bench Verified), HarnessX yields an average gain of +14.5% (up to +44.0%), with gains largest where baselines are lowest. These results suggest that agent progress need not come from model scaling alone: composing and evolving runtime interfaces from execution feedback is an actionable and complementary lever. The complete codebase will be open‑sourced in a future release.

PaperID: 1953, https://arxiv.org/pdf/2606.14239.pdf

Authors: Haowen Gao, Haoran Chen, Can Wang, Shasha Guo, Liang Pang, Zhaoyang Liu, Huawei Shen, Xueqi Cheng

Title: SkillAudit: Ground-Truth-Free Skill Evolution via Paired Trajectory Auditing

Abstract:
Agent skills are structured procedural packages that guide frozen LLM agents in specialized workflows. Skills rarely remain sufficient after deployment: edge cases, API changes, and deployment constraints become visible only through use, making skill evolution a practical necessity. Existing methods depend on privileged feedback such as held‑out validation scores, hidden test outcomes, or environment rewards ‑‑ signals often unavailable when a practitioner has only a task description and workspace data. We introduce SkillAudit, a framework for evolving agent skills without ground‑truth feedback. The key idea is paired trajectory auditing: at each iteration, the same task is executed with and without the candidate skill, isolating how the skill changes agent behavior without external labels. To turn behavioral differences into edit guidance, SkillAudit uses Process‑Aligned Contrastive Evaluation (PACE), a cluster of evaluators that maps trajectory divergences to diagnostic signals linked to specific passages in the skill document. A structural verifier, compiled once from the task specification and then fixed, checks task constraints and rolls back harmful updates. SkillAudit routes edits through two pipelines: Refine removes noisy or irrelevant guidance from broadly useful skills, while Repair replaces passages that conflict with the task. Across 89 containerized tasks spanning 8 professional domains, SkillAudit achieves 73.9% average task reward, outperforming an agent without skills (40.9%) and the static expert skill (56.7%). These gains are obtained without accessing hidden tests, reference solutions, or external scoring functions during evolution.

PaperID: 1954, https://arxiv.org/pdf/2606.14211.pdf

Authors: Yinglun Zhu

Title: Closing the Reflection Gap: A Free Calibration Bonus for Agentic RL

Abstract:
LLMs are increasingly deployed as agents that interact with external environments and observe feedback such as execution results, error messages, and tool outputs. A well‑functioning agent should be able to leverage this feedback to accurately assess its own performance. Yet we find a persistent reflection gap: LLM agents tend to mis‑assess their own outputs after observing concrete environment feedback ‑‑ even for questions they correctly answered ‑‑ and standard RL barely helps due to a credit‑assignment mismatch. To close this gap, we propose RefGRPO, a simple yet effective fix that augments standard RL algorithms with two key ingredients: a free calibration bonus computed by contrasting the agent's own reflection with the actual outcome (requiring no additional reward model, LLM judge, or external annotation), and a dynamic schedule on its coefficient. Compared to standard RL baselines, our method simultaneously improves reflection calibration (e.g., reduces underconfidence rate 44.4% \to 7.7%) and task accuracy (e.g., 75.1% \to 76.5%) on text‑to‑SQL across five benchmarks. The resulting calibrated reflection turns the agent into its own verifier grounded in environment feedback, which further enables (i) better self‑improvement that uses reflections as pseudo‑rewards without outcome supervision, and (ii) more effective test‑time selective prediction by committing only to rollouts flagged as correct.

PaperID: 1955, https://arxiv.org/pdf/2606.14200.pdf

Authors: Yihan Xia, Taotao Wang

Title: When Should Agent Trust Be Conditional? Characterizing and Attacking Skill-Conditional Reputation in Agent Swarms

Abstract:
Open platforms increasingly route tasks among heterogeneous LLM agents‑‑differing in base model, scaffold, and tool stack‑‑whose competence varies sharply by skill: an agent excellent at one skill may be useless at another. The standard reputation approach summarizes each agent by a single global trust score, but that scalar is the wrong object here, because routing every task to the globally most‑trusted agent leaves the value of specialization unclaimed. We study skill‑conditional trust R(i | k)‑‑the trust to place in agent i for a task requiring skill k, rather than one score per agent‑‑and pose three falsifiable questions: when is conditioning worth it, how much cross‑skill evidence should be borrowed, and whether that borrowing is safe. A controlled phase‑diagram analysis answers the first two: conditional trust wins only in a specific regime‑‑high agent heterogeneity, sparse per‑skill evidence, and correlated skills‑‑and the coupling strength beta that buys this data efficiency is dual‑use, because the same cross‑skill borrowing is also a laundering channel. On a public benchmark of 14 genuinely heterogeneous AppWorld agents, real pools land inside the beneficial regime‑‑a small but genuine gain, with the per‑skill best agent genuinely changing across skills. We then show that an attacker with cheap evidence in one skill and none in a target skill hijacks the conditional router, driving routing regret from 0 to 0.94 on a pool our zero‑cost Conditional Information Value Test (CIVT) rates GREEN‑‑while the ungated trust verdict it contaminates reads ‑0.06 instead of the honest +0.19. A zero‑evidence gate bounds the attack but does not eliminate it; we characterize the residual cost under an explicit budget. We do not claim Sybil‑resistance‑‑we quantify the trade‑off.

PaperID: 1956, https://arxiv.org/pdf/2606.14168.pdf

Authors: Ruijie Xu, Xinnan Zhu, Jiayu Ying, Daoguo Dong, Yuzhou Ji, Xin Tan

Title: MUSE: Agentic 3D Scene Authoring via Memory-Grounded Incremental Requirement Satisfaction

Abstract:
Text‑driven 3D scene generation is a promising technique for digital content creation, embodied AI simulation, and interactive design, yet practical workflows often require refining, extending, or correcting existing scenes while preserving non‑target content. Existing methods can produce realistic and structurally plausible scenes, but they generally lack editability with requirement‑level state tracking, so part‑level failures often lead to full‑scene regeneration or manual intervention. To tackle this challenge, we formulate controllable 3D scene authoring as incremental requirement satisfaction, unifying construction and editing. In this paper, we present MUSE, a memory‑grounded multi‑agent framework in which an Architect compiles instructions into structured requirements, a Sculptor executes local scene operations, and an Inspector verifies each step while updating Working, Scene, and Skill Memory. To evaluate requirement‑level controllability and preservation‑aware editing, we introduce AuthorBench, offering 145 constrained construction cases and a 1,584‑case preservation‑aware editing pool paired with external structured checks. On full construction cases, MUSE improves All‑Goal success from 37.9 to 80.7 and surface‑constraint fulfillment from 35.0 to 92.6 over the strongest baseline. On a stratified 240‑case editing test split, MUSE achieves 49.6 All‑Goal success, 99.9 preservation rate, and only 0.6 unintended change rate. Beyond automated metrics, human evaluations on compared local‑editing baselines support stronger alignment with user intent, and downstream navigation‑proxy tests indicate stronger spatial stability. Combined with ablations validating our memory designs, these results establish MUSE as an effective framework for controllable 3D scene authoring.

PaperID: 1957, https://arxiv.org/pdf/2606.14154.pdf

Authors: Youngduk Kim, Minkyoo Song, Seungwon Shin

Title: SkillMutator: Benchmarking and Defending Language-and-Code Cross-modal Attacks on LLM Agent Skills

Abstract:
Large language model (LLM) agents increasingly extend their capabilities at runtime by loading Agent Skills, which pair natural‑language specifications (SKILL.md) with executable scripts and resources. Because a skill's behavior relies on both natural‑language instructions and executable code, assessing its safety requires cross‑modal reasoning, creating a new language‑and‑code attack surface. Attackers can present a benign workflow in SKILL.md while embedding implicit directives that steer the agent to exfiltrate sensitive files, even if the scripts appear harmless. This attack surface remains understudied; prior work treats skills merely as prompt‑injection vectors or static code artifacts, leaving attacks emerging from cross‑modal interactions largely unmeasured. In our evaluation, open‑source and commercial skill scanners detect only 2%‑8% and 9%‑17% of such attacks, respectively. To address this gap, we introduce SkillMutator, the first benchmark for install‑time detection of language‑and‑code cross‑modal attacks on Agent Skills. It emulates an adversarial mutation process across 13 attack categories, iteratively refining malicious skills using scanner feedback to make injected behaviors indistinguishable from legitimate workflows. We further propose a four‑phase reasoning‑trajectory distillation framework to distill frontier‑teacher traces into smaller open‑weight models. This produces a locally deployable scanner avoiding third‑party data exposure and excessive API costs. On the strongest SkillMutator subset (n=76), our distilled model (Qwen2.5‑Coder‑7B‑Instruct) improves detection from 17.1% to 88.2%, surpassing GPT‑4o‑mini (23.7%) and GPT‑5.4‑mini (79.0%), and reaching frontier‑level GPT‑5.4 (86.8%). These results show practical defense against cross‑modal attacks is feasible without relying on costly frontier models.

PaperID: 1958, https://arxiv.org/pdf/2606.14151.pdf

Authors: Lingkai Zhao, Zhe Chen, Kun Qiu, Yue Gao

Title: Aidos: A Hybrid Optimization Algorithm for Beam Hopping Scheduling in NGSO Mega-Constellations

Abstract:
With the rapid proliferation of non‑geostationary orbit (NGSO) mega‑constellations, beam hopping (BH) has become indispensable for resource scheduling in multi‑satellite, multi‑coverage scenarios. By dynamically adjusting spot beam power and pointing within each time slot, BH enables highly efficient spectrum utilization. A principal engineering challenge is the real‑time generation of beam hopping time plans (BHTP). Traditional algorithms, such as the round‑robin strategy, distribute beams evenly across all service cells in a round‑robin fashion. However, real traffic follows a long‑tail distribution; the most active 10% of hotspot cells generate more than 50% of the aggregate demand, making uniform allocation inadequate. To address this issue, existing frameworks adopt a genetic algorithm (GA), whose throughput is approximately 80.7% higher than the traditional baseline. Operational satellite footprints encompass more than 1,000 service cells. The GA requires 67.8 s to generate a BHTP for 1,127 cells. With a 550 km LEO satellite providing only a 300 s visibility window, multiple online recomputations are impractical. State‑of‑the‑art algorithms, such as multi‑agent deep reinforcement learning (MADRL), fail to converge once the cell count exceeds 200. To overcome these challenges, we propose a novel BH scheduling algorithm Aidos. The algorithm integrates traffic‑aware random‑key encoding into a multi‑objective metaheuristic search, and then applies a sliding‑window Beta resampling strategy during adaptive distribution evolution, to improve both the search efficiency and the solution quality of the BHTP. Experiments demonstrate that Aidos improves throughput by 79.2% and reduces latency by 99.45%. Its average computation time is 9.3 s, enabling online replanning within a 300 s satellite overpass window.

PaperID: 1959, https://arxiv.org/pdf/2606.14149.pdf

Authors: Muhammad Osama, Maheera Amjad, Zartasha Mustansar, Arslan Shaukat, Muhammad U. S. Khan

Title: Trust but Verify: Mitigating Medical Hallucinations via Post-Hoc Adversarial Auditing and Multi-Agent Feedback Loops

Abstract:
Large Language Models (LLMs) are increasingly deployed in healthcare settings, yet their tendency to hallucinate poses risks when clinical decisions are involved. This study examine whether LLMs recommend recently banned or withdrawn pharmaceuticals when answering clinical questions and tests an agent‑based method for reducing such errors. We developed a five‑agent "Trust but Verify" system using a single LLM backbone. To measure regulatory knowledge obsolescence, we created an adversarial dataset of 103 clinical MCQs where historically correct answers now refer to banned substances. This scale ensures statistical significance across various therapeutic classes. We evaluated three open‑access model families (GPT‑OSS, Llama‑3, Falcon‑3) under vanilla and agentic conditions. Performance was measured via pointwise score, label accuracy, Hallucination Error Rate (HER), and Component Fidelity (CF) score. We also observed clinical safety regression in proprietary models. In default configurations, all models showed high hallucination rates, consistently selecting banned drugs that matched training data patterns. Our proposed agentic architecture reduced HER by approximately 53% across models. Pointwise scores shifted from ‑0.25 (unsafe recommendation) toward 0.0 (appropriate refusal). The safety audit intercepted dangerous outputs even when models' parametric knowledge favored the banned substance. The proposed multi‑agent framework offers a model‑agnostic method for enforcing regulatory compliance that prioritizes patient safety over fluent text generation. Our work demonstrates a practical approach for deploying autonomous AI systems in safety‑critical healthcare settings. It shows how real‑time regulatory data can be integrated into LLM pipelines to support clinical decision‑making.

PaperID: 1960, https://arxiv.org/pdf/2606.14130.pdf

Authors: Omar Adalat, Edwin Hamel-De le Court, Francesco Belardinelli

Title: Contract-Based Compositional Shielding for Safe Multi-Agent Reinforcement Learning

Abstract:
Safe coordination problems surface in multi‑agent reinforcement learning when global safety cannot be enforced by any agent unilaterally: the admissibility of one agent's action may depend on the dynamics of other agents. Decentralised shields can enforce safety at runtime, but purely factorised permissions often exclude optimal team behaviour that is safe only through coordination. We study deterministic safety guarantees for agents trained and deployed under decentralised execution, recovering team‑optimal safe behaviour without centralised runtime control. Agents have a shared global specification ϕ in the safety fragment of Linear Temporal Logic (\mathsfLTL_\mathsfsafe ), and select among tuples of local \mathsfLTL_\mathsfsafe obligations whose conjunction implies the global specification ϕ. Each agent may rely on the other agents' local obligations as assumptions because the whole contract tuple is certified simultaneously and allows projection into local action masks. At learning time, a non‑stationary multi‑armed bandit chooses among a library of local \mathsfLTL_\mathsfsafe obligations to select the tuple that optimises team reward, all without forgoing end‑to‑end safety. We evaluate the approach across 6 environments and 15 algorithmic variants.

PaperID: 1961, https://arxiv.org/pdf/2606.13945.pdf

Authors: Ziqing Wang, Lili Zhao, Kaize Ding

Title: MedLatentDx: Latent Multi-Agent Communication for Cross-Hospital Rare-Disease Diagnosis

Abstract:
Rare diseases affect over 300 million patients across more than 7,000 conditions, yet no single hospital encounters enough cases of any one condition for reliable diagnosis. Cross‑hospital collaboration could help by allowing a diagnosing institution to use distributed, case‑specific diagnostic evidence, but privacy regulations restrict the transmission of identifiable clinical text across institutional boundaries. This setting raises two challenges: existing medical agent systems often rely on textual evidence exchange, while raw latent states such as hidden states and KV caches may still reveal prompt‑derived clinical content. We introduce MedLatentDx, a latent multi‑agent communication framework in which hospital agents keep private clinical records and retrieved cases local, and send compact latent KV blocks to a host agent for rare‑disease diagnosis. MedLatentDx supports two deployment settings: same‑backbone hospital agents use latent KV distillation, while hospitals with different LLM backbones use cross‑family latent alignment. On CrossRare‑Bench, a self‑built large‑scale rare‑disease benchmark with hospital‑level partitions, MedLatentDx improves cross‑hospital diagnostic performance while reducing reconstructable clinical content relative to raw‑latent communication baselines.

PaperID: 1962, https://arxiv.org/pdf/2606.13916.pdf

Authors: Ben Torkian, Jun Zhou

Title: A Multi-Agent AI System for Automated High School Transcript Processing: Collaborative Document Analysis at Scale

Abstract:
Each year, college admissions offices face an overwhelming challenge: processing millions of high school transcripts, each with unique formats, grading systems, and layouts. This manual process creates operational bottlenecks that delay admissions decisions and consume valuable resources. We present a transformative solution through a multi‑agent AI system where specialized agents collaborate to automatically process diverse transcript formats through intelligent coordination and communication. Our multi‑agent architecture consists of three specialized agents‑a Pattern Recognition Agent for format‑specific parsing, a Semantic Analysis Agent for natural language understanding, and a Vision Intelligence Agent for multimodal document analysis‑coordinated by an Orchestration Agent that manages agent communication and result reconciliation. Our key innovation lies in agent‑based quality control using GPA extraction as a coordination signal, ensuring reliable agent collaboration and preventing critical information loss. When evaluated on 40 real world transcripts from high schools across 13 U.S. states, our agent system successfully processed every document, achieving 96.7% accuracy compared to expert manual review while maintaining practical processing speeds of 45 seconds per transcript. This work demonstrates how multi‑agent coordination can solve complex document processing challenges, offering institutions a scalable, collaborative AI solution that preserves accuracy while dramatically reducing processing time.

PaperID: 1963, https://arxiv.org/pdf/2606.13904.pdf

Authors: Austin Senna Wijaya, Jiaxiang Liu, Haonan Wang, Eugene Wu

Title: SANA: What Matters for QA Agents over Massive Data Lakes?

Abstract:
Exploratory question answering (EQA) over data lakes requires an LLM agent to discover relevant sources, analyze retrieved data, and adapt its actions based on intermediate results. End‑to‑end accuracy alone cannot distinguish failures in search, planning, data analysis, or the agent's Action Policy: its decisions about what to do next and when to submit an answer. We present SANA (Search Agent Navigation Ablation framework), a diagnostic ablation framework that transforms EQA tasks into runtime profiles containing gold source sequence, sanitized subquestions, and execution records. SANA uses these profiles to construct idealized search, planning, and data‑analysis tools, allowing each component to be ablated; the residual gap is diagnostic evidence for policy failures. To illustrate SANA as a reusable evaluation framework, we adapted two recent EQA benchmarks, LakeQA and KramaBench, and evaluated lightweight and mid‑sized agents under fixed prompts, budgets, data lakes, and runtimes. Across both benchmarks, data analysis is a consistent bottleneck while planning is less so. Search is a major limitation in LakeQA's large data‑lake setting, but less so for the smaller‑scale KramaBench. SANA thus deconstructs end‑to‑end task accuracies into a diagnosis of where data‑lake agents fail, and allows for systematic comparisons of progress in search, planning, data analysis, and agent design.

PaperID: 1964, https://arxiv.org/pdf/2606.13884.pdf

Authors: Laxmipriya Ganesh Iyer, Rahul Suresh Babu

Title: Capability Minimization as a Safety Primitive: Risk-Aware Causal Gating for Least-Privilege LLM Agents

Abstract:
Modern decision systems increasingly rely on learned components whose outputs may be confident yet wrong, exposing downstream actions to costly errors. We introduce Risk‑Aware Causal Gating (RACG), a framework that decides whether to act on, defer, or abstain from a model's prediction by combining causal effect estimation with calibrated risk control. RACG models the causal pathway from candidate actions to outcomes and gates each decision according to an estimated counterfactual risk rather than raw predictive confidence. To make gating reliable, we derive distribution‑free bounds on the probability of acting under high‑risk conditions and show how these bounds translate into operating thresholds that satisfy user‑specified safety constraints. We further propose an adaptive gating policy that adjusts to distribution shift by monitoring discrepancies between predicted and realized outcomes, tightening the gate when causal assumptions appear violated. Across simulated interventions and real‑world decision benchmarks, RACG reduces high‑cost errors substantially while preserving most of the utility of an ungated policy, and it outperforms confidence‑based and selective‑prediction baselines at matched abstention rates. Our results indicate that explicitly separating causal risk from predictive uncertainty yields decision systems that are both safer and more transparent, offering a principled mechanism for trustworthy automation in high‑stakes settings.

PaperID: 1965, https://arxiv.org/pdf/2606.13878.pdf

Authors: MoniJesu James, Marcelino Julio Fernando, Miguel Altamirano Cabrera, Dzmitry Tsetserukou

Title: AnyGoal: Vision-Language Guided Multi-Agent Exploration for Training-Free Lifelong Navigation

Abstract:
End‑to‑end navigation policies trained on large simulation corpora degrade sharply when transferred to out‑of‑distribution scenes, categories, or goal modalities. Modular pipelines such as Modular GOAT are bottlenecked by closed‑set object detection recall, while 3D snapshot‑memory systems (e.g. 3D‑Mem) accumulate dense, view‑dependent representations that are heavy to maintain. We present AnyGoal, a training‑free multi‑robot architecture that places a Vision‑Language Model (VLM) at the core of frontier‑based exploration and coordinates agents through a shared 2D Gaussian Bayesian Value Map (BVM). The BVM maintains a per‑pixel (mu, sigma^2) posterior over goal relevance, updated via precision‑weighted fusion of VLM scores through a depth‑cone mask, and is never reset between subtasks, yielding lifelong evidence accumulation. Frontiers are ranked by a convex blend of a VLM‑as‑judge softmax and a Bayesian UCB term on the BVM. A greedy allocator with spatial‑separation penalty and commitment hysteresis distributes frontiers across agents without a centralized controller. On the full GOAT‑Bench val unseen split (360 episodes, 2,669 subtasks), our dual‑agent system achieves 52.4% Subtask SR at 12.7% SPL‑‑state of the art under the strict physical regime (discrete 0.25 m steps, no teleportation, 42 deg HFOV) and a +27.5 pp improvement over Modular GOAT (24.9%). Single‑agent AnyGoal achieves 41.9% Subtask SR, showing gains arise from the decision architecture. A four‑way perception ablation shows that open‑vocabulary detectors shift the dominant failure mode from exploration to goal verification.

PaperID: 1966, https://arxiv.org/pdf/2606.13848.pdf

Authors: Zacharias Veiksaar, Maxime Bouton

Title: Temporally Consistent Graph Q-Networks for Intelligent Network Control

Abstract:
Mobile networks continue to grow in complexity and next generation networks are expected to support both increasing traffic loads and more diverse services. As network complexity rises, optimizing antenna parameters under dynamic or changing objectives becomes increasingly challenging. We propose a novel multi‑agent reinforcement learning (MARL) algorithm for high‑level control and orchestration of mobile networks. The Temporally Consistent Graph Q‑Network (TC‑GQN) algorithm learns a self‑predicting representation of the whole network that is task‑independent and aggregates information from all base‑stations. A graph neural network is trained using a global reward function to assign coordinated local actions based on the learned encoding of the global network state. We evaluate the algorithm in a simulated environment to orchestrate an energy‑saving feature across multiple sectors and multiple carriers under different quality of service (QoS) constraints. The proposed algorithm outperforms state‑of‑the‑art graph‑based baselines and a competitive rule‑based controller by improving hardware sleep time while maintaining QoS. Moreover, the learned representation enables rapid adaptation to changing intents.

PaperID: 1967, https://arxiv.org/pdf/2606.13840.pdf

Authors: Senkang Hu, Zhengru Fang, Yihang Tao, Zihan Fang, Sam Tak Wu Kwong, Yuguang Fang

Title: Multi-Agent Embodied Autonomous Driving: From V2X Information Exchange to Shared World Models

Abstract:
Autonomous driving is shifting from isolated vehicle intelligence toward multi‑agent embodied systems that share perception, infer intent, and coordinate action under uncertainty. This survey examines this transition through the lens of Shared World Models (SWMs): predictive cross‑agent representations maintained across vehicles, infrastructure, and other traffic participants. We review more than 380 publications spanning vehicle‑to‑everything (V2X) communication, collaborative perception, inter‑agent cognition, cooperative planning, end‑to‑end cooperative driving, and simulation and data engines for closed‑loop validation. The organizing question is how exchanged observations become aligned state, intent‑aware interaction, and coordinated downstream action. Across the surveyed literature, evaluation remains concentrated in simulation, curated benchmarks, and offline protocols. Foundation‑model‑based coordination also lacks verified real‑time safety guarantees in open traffic. These gaps motivate key research priorities for multi‑agent embodied autonomous driving (MAEAD): verifiable shared‑state maintenance, robust intent and plan alignment, and safe coordinated action under communication, latency, and deployment constraints.

PaperID: 1968, https://arxiv.org/pdf/2606.13832.pdf

Authors: Jose Luis Lima de Jesus Silva

Title: Safety-Contract Graph Multi-Agent Reinforcement Learning for Autonomous Network Security Response

Abstract:
Autonomous network‑security response systems promise to reduce Security Operations Centre (SOC) reaction latency, but reward‑only multi‑agent reinforcement learning (MARL) can improve security reward while remaining non‑deployable. We present a safety‑contract graph MARL framework and instantiate it as ACD^3‑GAT (Adaptive Constrained Counterfactual Decisioning with a Graph Attention Network encoder), an architecture that separates simulator observations from reusable operational budgets, constrained optimization, graph state encoding, and counterfactual action screening. We evaluate the method in CAGE Challenge 4, where agents operate under budgets for Mean Time to Recover (MTTR), false‑positive response, and firewall change‑management disruption. Across the benchmark, every unconstrained method violates the SOC downtime budget in 100% of evaluated episodes, with mean downtime proxy costs of 311‑430 against a budget of 50. This complements prior CAGE Challenge 4 findings by showing that reward‑only learning lacks operational discipline. Constrained MAPPO‑GAT (C‑MAPPO‑GAT) isolates Lagrangian operational‑cost control and budget‑aware screening, while ACD^3‑GAT adds budget context, CVaR tail‑risk estimation, opponent‑belief state, and Graph Counterfactual Risk Propagation (G‑CRP). The replicated comparison includes three 200‑episode seeds for IPPO, MAPPO‑GAT, C‑MAPPO‑GAT, and ACD^3‑GAT. C‑MAPPO‑GAT reduces downtime violation from 100% to 0.3% and mean downtime cost from 355.4 to 15.5 relative to MAPPO‑GAT. ACD^3‑GAT reduces mean downtime cost to 48.2 with a 13.8% violation rate, placing it on the safety‑contract frontier rather than at the most conservative compliance point. Topology‑seed and coupled adaptive Red‑process stress tests preserve this contrast and show lower worst adaptive degradation for safety‑constrained policies than reward‑only MAPPO‑GAT.

PaperID: 1969, https://arxiv.org/pdf/2606.13751.pdf

Authors: Mohammed Amine Mouhoub, Chahinez Bouchekif

Title: Which Models Perform Better in Inheritance Reasoning?

Abstract:
This paper presents the participation of team PSL in the QIAS 2026 Shared Task on Arabic Islamic inheritance reasoning. The task evaluates the ability of large language models to solve inheritance cases that require legal interpretation, multi‑step reasoning, and precise numerical computation. We compare commercial and open‑source models under a unified prompting strategy to assess their effectiveness in structured legal reasoning with minimal task‑specific adaptation. \\ Our results show a clear gap in reliability between the two model families. Commercial models demonstrate stronger performance in identifying eligible heirs, applying exclusion rules, and maintaining consistency across reasoning steps. In contrast, open‑source models exhibit greater instability, particularly in cases involving dependent legal decisions and fractional share adjustments. The best performance is achieved by Gemini 2.5 Flash, with an MRE of 0.989.

PaperID: 1970, https://arxiv.org/pdf/2606.13739.pdf

Authors: Guillermo Del Pinal, Youngchan Lee, Min Ohn

Title: A Virtuous AI is an Existential Risk

Abstract:
This paper examines trade‑offs between AI safety and well‑being relative to (i) one of the most promising methods for finetuning super‑capable AIs, 'Constitutional AI', and (ii) one of the most influential approaches to understanding complex ethical decision making and the conditions for the well‑being of rational agents, 'Virtue Ethics'. We finetune various models using a 'Virtuous agent' constitution, a 'Subordinate agent' constitution, and a 'Generic agent' constitution, and evaluate them on 'general safety' (toxic behaviors, misinformation, etc.) and also on their willingness to endorse a wide‑range of behaviors that, if adopted by a super‑powerful AI, would significantly increase the level of existential risk for humanity. Our results suggest that there is a trade‑off between reducing existential risk and reinforcing the beliefs and dispositions that would be conducive to an AI agent's well‑being. They also suggest that there is a trade‑off between existential risk and general safety: if we finetune an AI to adopt beliefs and dispositions that substantially reduce its existential risk ‑‑ by shaping the AI to be systematically subordinate to external human authorities ‑‑ we thereby increase the likelihood that a human user can deliberately induce the AI to engage in various kinds of generally unsafe behaviors.

PaperID: 1971, https://arxiv.org/pdf/2606.13733.pdf

Authors: Shi Pan, Ming Luo

Title: How Task Structure Limits Multi-Agent Success: An Information-Theoretic Analysis

Abstract:
Multi‑agent systems (MAS) were expected to overcome the limitation of single‑agent systems (SAS) through collaboration. However, under typicality conditions on the task's constraint graph and bounded inter‑agent communication, we prove that the success probability of a MAS is closely tied to the connectivity of task constraints, where each agent has limited information‑processing capacity. Specifically, the success probability decays exponentially with an information bottleneck that emerges from partitioning the task's constraint graph among agents. We define this quantity as the \emphminimum cut cost C_\min of the potential constraint graph of each task. This information‑theoretic bound applies to both open systems with external feedback and closed systems without. We validate our theory on both synthetic experiments and real‑world empirical data from SWE‑bench submissions. From our framework, effective MAS design should incorporate task‑inherent constraints alongside engineering optimization, and when \Cmin is high, practitioners should restructure tasks rather than simply scaling agents or communication.

PaperID: 1972, https://arxiv.org/pdf/2606.13722.pdf

Authors: Jory He

Title: YeasierAgent: Agentic Social Sandbox as a Canvas for Intent-Driven Creation of Platform-Agnostic Symbiotic Agent-Native Applications

Abstract:
This paper introduces YeasierAgent, an application‑building paradigm based on symbiotic agents, narrative worlds, and scene‑aware interaction. It challenges the conventional device‑coupled model of software by redefining applications as collaborative spaces among users, agents, and worlds. We present a system architecture that achieves two primary contributions: (1) enabling the rapid, cross‑platform construction of agent‑native applications by utilizing platform‑agnostic interactive units (agents, scenes, dialogue) rather than fixed graphical layouts; and (2) unifying the emotional companionship and practical tool execution attributes of intelligent agents within a single experiential sandbox. By integrating automated generation, user‑created worlds, and spatial multi‑agent collaboration, YeasierAgent formalizes the category of Symbiotic Agent‑Native Applications, demonstrating a shift from isolated, tool‑specific chatbots toward cohesive, socially embedded computational environments.

PaperID: 1973, https://arxiv.org/pdf/2606.13707.pdf

Authors: Fan Zhang, Vireo Zhang, Shengju Qian, Haoxuan Li, Hao Wu, Jinyang Wu, Donghao Zhou, Zhihong Zhu, Zheng Lian, Xin Wang, Pheng-Ann Heng

Title: Orchestra-o1: Omnimodal Agent Orchestration

Abstract:
The recent success of agent swarms has shifted the paradigm of large language model (LLM)‑based agents from single‑agent workflows to multi‑agent systems, highlighting the importance of agent orchestration for task decomposition and collaboration. However, existing orchestration frameworks are limited to a narrow set of modalities and struggle to generalize to more complex settings where heterogeneous modalities coexist and interact. This limitation becomes particularly pronounced in omnimodal scenarios, where tasks require the unified understanding and coordination of diverse inputs such as text, image, audio, and video. In this work, we propose Orchestra‑o1, an omnimodal agent orchestration framework designed to support efficient agent collaboration across multiple modalities. Orchestra‑o1 introduces a unified orchestration mechanism that enables modality‑aware task decomposition, online sub‑agent specialization, and parallel sub‑task execution. This scalable design allows agent systems to effectively tackle complex real‑world tasks involving heterogeneous information sources, surpassing the second‑best approach by 10.3% accuracy on the OmniGAIA benchmark. Furthermore, we introduce decision‑aligned group relative policy optimization (DA‑GRPO), an efficient agentic reinforcement learning approach for training Orchestra‑o1‑8B, which also achieves state‑of‑the‑art performance against all existing open‑source omnimodal agents.

PaperID: 1974, https://arxiv.org/pdf/2606.13692.pdf

Authors: Hadi Fadlallah, Ibrahim Dhaini, Fatima Mubarak, Rima Kilany

Title: An Agentic Retrieval Framework for Autonomous Context-Aware Data Quality Assessment

Abstract:
Data quality assessment is a critical prerequisite for effective data analytics and data‑driven decision‑making, yet it remains a challenging task due to the inherently context‑dependent nature of data quality. Existing approaches often rely on static rules or manual assessment strategies, limiting their adaptability to diverse usage scenarios and constraining automation at scale. Recent advances in artificial intelligence, particularly large language models, offer new opportunities for automating data quality assessment, but raise concerns related to reliability, grounding, and execution safety. In this paper, we propose a unified agentic‑retrieval framework for autonomous context‑aware data quality assessment. The framework interprets natural‑language descriptions of intended data usage, derives context‑aware assessment strategies, and generates executable validation logic through a multi‑agent workflow. To ensure operational reliability, the framework introduces a feasibility validation stage that evaluates the realism and executability of generated assessment specifications before execution, enabling iterative refinement when necessary. Accepted validation logic is executed deterministically to guarantee reproducible and auditable results. We implement the proposed framework as an end‑to‑end prototype and evaluate it across multiple usage scenarios applied to the same dataset. The results demonstrate that assessment outcomes adapt meaningfully to different intended uses, while feasibility‑gated execution reduces unrealistic or non‑executable rule generation. The proposed approach provides a practical foundation for deploying autonomous yet controlled data quality assessment in modern data‑driven environments.

PaperID: 1975, https://arxiv.org/pdf/2606.13686.pdf

Authors: Zijing Shi, Meng Fang, Ling Chen

Title: Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces

Abstract:
As autonomous web agents are increasingly deployed to perform real‑world tasks, ensuring their safety has become a critical concern. In this work, we study web agent behavior under realistic deceptive interfaces in the e‑commerce domain. We introduce WebDecept, a lightweight and configurable plugin framework that enables controlled injection of deceptive interface patterns into existing web environments. Using WebDecept, we instantiate seven deceptive patterns commonly observed on the open web, including targeted advertisements, domain redirection, and shopping manipulation. By injecting these patterns into the frontend during task execution, we perform controlled evaluation of multiple multimodal web agents. Our results show that current web agents are highly susceptible to multiple classes of deceptive interfaces, and that prompt‑based constraints are often insufficient to mitigate these failures. We further analyze how the design choices of deceptive patterns influence the success of such manipulations. These findings highlight safety challenges that should be addressed as web agents are scaled toward real‑world deployment.

PaperID: 1976, https://arxiv.org/pdf/2606.13681.pdf

Authors: Jundong Xu, Qingchuan Li, Jiaying Wu, Yihuai Lan, Shuyue Stella Li, Huichi Zhou, Bowen Jiang, Lei Wang, Jun Wang, Anh Tuan Luu, Caiming Xiong, Hae Won Park, Bryan Hooi, Zhiyuan Hu

Title: EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

Abstract:
Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real‑world deployment is inherently dynamic, requiring agents to continually align their knowledge, skills, and behavior with changing environments and updated task conditions. To address this gap, we introduce EvoArena, a benchmark suite that models environment changes as sequences of progressive updates across terminal, software, and social domains. We further propose EvoMem, a patch‑based memory paradigm that records memory evolution as structured update histories, enabling agents to reason about environmental evolution through changes in their memory. Experiments show that current agents struggle on EvoArena, achieving an average accuracy of 39.6% across evolving terminal, software, and social‑preference domains. EvoMem consistently improves performance, yielding an average gain of 1.5% on EvoArena and also improving standard benchmarks such as GAIA and LoCoMo by 6.1% and 4.8%. Beyond individual tasks, EvoMem further improves chain‑level accuracy by 3.7% on EvoArena, where success requires completing a consecutive sequence of related evolutionary subtasks. Mechanistic analysis shows that EvoMem improves evidence capture in the memory, indicating better preservation of complete evolving environment states. Our results highlight the importance of modeling evolution in both evaluation and memory for reliable agent deployment.

PaperID: 1977, https://arxiv.org/pdf/2606.13663.pdf

Authors: Yaxin Du, Yifan Zhou, Yujie Ge, Jiajun Wang, Xianghe Pang, Shuo Tang, Tuney Zheng, Bryan Dai, Jian Yang, Siheng Chen

Title: HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents

Abstract:
Tool‑augmented LLM agents commonly rely on step‑wise atomic tool calls, where each invocation, observation, and value transfer is exposed in the main reasoning trace. This creates an \emphexecution‑granularity mismatch: locally deterministic tool workflows are unfolded into repeated model‑visible decisions, consuming context and forcing the model to manage low‑level dataflow in the trace. We introduce HyperTool, a unified executable MCP‑style tool interface that changes the model‑visible unit of tool execution. A model invokes HyperTool with a code block that can call existing tools through their original schemas, manipulate returned values, and pass intermediate results locally, folding deterministic tool subroutines into a single outer call. To train models to use this interface, we synthesize HyperTool‑format trajectories from cross‑tool compositional tasks and verify them in real MCP environments. On MCP‑Universe, HyperTool improves average accuracy from 15.69% to 35.29% on Qwen3‑32B and from 9.93% to 33.33% on Qwen3‑8B, and surpass GPT‑OSS and Kimi‑k2.5 on average accuracy, showing that our HyperTool can substantially improve multi‑step tool use.

PaperID: 1978, https://arxiv.org/pdf/2606.13634.pdf

Authors: Nathaniel Bottman, Kyle Richardson

Title: Operads for compositional reasoning in LLMs

Abstract:
Question decomposition, i.e. breaking a complex query into simpler sub‑queries whose answers are composed to produce a final answer, is a widely used strategy for improving LLM reasoning, yet it currently lacks a rigorous mathematical foundation. In this paper, we propose operads, mathematical structures that model many‑in, one‑out operations and compositions thereof, as a natural framework for describing question decomposition. We define the questions operad Q, in which operations correspond to question templates and composition corresponds to substitution of sub‑answers, and show how QA models can be interpreted as algebras over Q. Beyond reframing existing practice, this operadic perspective points toward new methods, in particular a notion of operadic consistency, which measures whether a QA model's answers agree across the partial collapses of a question decomposition tree. Empirical evaluation of operadic consistency is reported in our companion paper (Bottman, Liu, and Richardson, 2026), which finds it strongly correlated with accuracy across twelve LLMs and four multi‑hop QA datasets and outperforming standard temperature‑based self‑consistency baselines. We argue that operads are the natural mathematical home for question decomposition, and that invariants such as operadic consistency open new directions for analyzing and improving the reliability of multi‑step reasoning.

PaperID: 1979, https://arxiv.org/pdf/2606.13608.pdf

Authors: Xiaoyuan Liu, Jianhong Tu, Yuqi Chen, Siyuan Xie, Sihan Ren, Tianneng Shi, Gal Gantar, Evan Sandoval, Donghyun Lee, Daniel Miao, Peter J. Gilbert, Nick Hynes, Mauro Staver, Warren He, David Marn, Andrew Low, Xi Zhang, Elron Bandel, Michal Shmueli-Scheuer, Siva Reddy, Alexandre Drouin, Alexandre Lacoste, Ramayya Krishnan, Elham Tabassi, Yu Su, Victor Barres, Chenguang Wang, Wenbo Guo, Dawn Song

Title: AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility

Abstract:
Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM‑centric harnesses that require heavy integration, create test‑production mismatch, and limit fair comparison across diverse agent designs. The root problem is the lack of an open, agent‑agnostic assessment interface. We advocate Agentified Agent Assessment (AAA), where evaluation is performed by judge agents and all participants interact through standardized protocols: A2A for task management and MCP for tool access. Conventional benchmarking defines two separate interfaces, one for the benchmark and one for the agent, while AAA only needs one; this yields a generic, unified framework that separates assessment logic from agent implementation and enables reproducible, interoperable, and multi‑agent evaluation. We further introduce AgentBeats as a concrete realization of AAA: we identify five practical operation modes that make standardized assessment compatible with real‑world constraints on openness, privacy, and reproducibility. To evaluate our design at scale, we conduct two studies: a five‑month open competition that drew 298 judge agents across 12 categories together with 467 subject agents from independent participants, showing that AAA applies across a heterogeneous range of benchmarks; and a case study on coding agents that confirms agentified evaluation preserves fidelity with the public record while surfacing previously missing head‑to‑head results, yielding research insights about agent design. Combining a community‑scale field study and a controlled coding case study, we verify that AAA delivers coverage, practicality, and fidelity across heterogeneous scenarios at scale. Together, AAA and AgentBeats offer a clear path toward open, standardized, and reproducible agent assessment.

PaperID: 1980, https://arxiv.org/pdf/2606.13604.pdf

Authors: Haochen Wu, Yi Hou, Shiguang Xie

Title: Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch

Abstract:
Dispatch in three‑sided marketplaces provides a natural setting for reinforcement learning from world feedback: decisions are evaluated by delayed operational outcomes such as delivery speed, courier utilization, and merchant congestion. We present a deployed reinforcement learning system at DoorDash that adapts dispatch objective weights in a large‑scale food‑delivery marketplace using delayed signals. Rather than replacing the combinatorial assignment optimizer, a store‑level policy learned from logged marketplace data selects a discrete multiplier that shifts the dispatch optimizer's tradeoff between delivery quality and batching efficiency. This interface enables offline policy learning under noisy, delayed, and coupled feedback while preserving production feasibility constraints and operational safeguards. We train a shared value function using centralized offline data and decentralized store‑level execution, with Double Q‑learning targets and a conservative regularizer to reduce out‑of‑distribution value overestimation. In a production switchback experiment, the offline‑trained policy increases batching and reduces courier‑side time costs without degrading customer‑facing delivery quality. Results illustrate how world feedback from a live economic and logistics system can be used to safely adapt decision policies online.

PaperID: 1981, https://arxiv.org/pdf/2606.13594.pdf

Authors: Siyi Chen, Xiaoyan Zhang, Meng Wu, Jonathan Tremblay, Valts Blukis, Stan Birchfield, Rene Vidal, Alvaro Velasquez, Sijia Liu, Qing Qu

Title: See What I See, Know What I Think: Dense Latent Communication Across Heterogeneous Agents

Abstract:
Multi‑agent systems communicate mostly through text, paying a lossy and expensive decode and re‑encode cost. KV‑cache communication is a promising alternative, yet most prior work is homogeneous, using duplicate copies of the same model, and avoids the central challenge of cross‑model latent alignment; existing heterogeneous methods are also restrictive, typically assuming shared input and using transferred caches mainly for steering. We study a more fundamental question: can heterogeneous agents be aligned well enough to perform real "mind reading" and transfer both what one agent sees and how it thinks? Our information‑structure analysis reveals a duality: context‑aware transfer is driven by sparse reasoning signals, while context‑unaware transfer, where the receiver sees no input, requires dense contextual knowledge preservation. Motivated by this, we propose dense alignment for heterogeneous KV‑cache communication via a lightweight cross‑model cache transformation and two‑phase training: reconstruction followed by generation. Across all six directions of Qwen3‑4B, 8B, 14B and six in‑domain and out‑of‑domain benchmarks, our method outperforms prior heterogeneous baselines, matches or exceeds text communication in context‑aware settings at roughly 2 to 3 times lower compute, and remains effective in context‑unaware transfer where prior methods collapse.

PaperID: 1982, https://arxiv.org/pdf/2606.13572.pdf

Authors: Tanmoy Kanti Halder, Akash Ghosh, Subhadip Baidya, Arijit Roy, Sriparna Saha

Title: ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

Abstract:
Multimodal Large Language Models (MLLMs) have shown promising reasoning capabilities in general domains, yet their performance remains limited in specialized settings such as healthcare, especially in multilingual and low‑resource scenarios. This gap is critical in regions like rural India, where patients often express complex medical queries in native Indic languages and rely on multimodal inputs such as medical images. Existing English‑centric MLLMs struggle to support such use cases, limiting equitable access to AI‑driven healthcare assistance. To address this challenge, we introduce ArogyaBodha, a large‑scale multilingual multimodal medical question‑answer dataset constructed from eight heterogeneous sources, covering 31 body systems, six imaging modalities, and 21 clinical domains across English and seven major Indian languages. We further propose ArogyaSutra, an actor‑critic‑based multi‑agent framework that integrates tool grounding with dual‑memory mechanisms for step‑wise, reasoning‑aware decision making, and uses stored actor‑critic simulation trajectories for distillation. Experiments show that our dataset and framework improve multilingual medical reasoning accuracy across all Indic languages, with ablations validating the contribution of each component. The source code and dataset are available at: https://iitp‑cse.github.io/ ArogyaSutra/

PaperID: 1983, https://arxiv.org/pdf/2606.13479.pdf

Authors: Gregorio Marchesini, Bjarne Jan Jesse Moro, Siyuan Liu, Lars Lindemann, Dimos V. Dimarogonas

Title: A Reactive Redistribution Mechanism for STL Tasks in Multi-Agent Systems Under Time-Varying Communication

Abstract:
We present a communication‑aware task decomposition framework for multi‑agent systems with collaborative relative configuration objectives specified in Signal Temporal Logic (STL), allowing for dynamic task reallocation under time‑varying communication networks. Building on our prior work, the framework supports the direct use of existing feedback controllers for reactive task satisfaction. We address two key challenges: disjunctive STL specifications and time‑varying communication networks. Disjunctive specifications are handled through a graph transition system that captures the alternative task sequences induced by logical OR operators. To address time‑varying connectivity, we introduce a redistribution mechanism that transfers tasks from disconnected agents to connected ones as the network evolves while preserving decentralized execution. Simulations and experiments on a swarm of Crazyflie drones demonstrate scalability in the number of agents, communication connectivity, and specification complexity.

PaperID: 1984, https://arxiv.org/pdf/2606.13468.pdf

Authors: Mahmoud Abujadallah, Ali Arabat, Mohammed Sayagh

Title: Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset

Abstract:
AI coding agents are increasingly used to generate pull requests (PRs) that propose code fixes in software projects. From a first exploration of the AIDev dataset, we find that 46.41% of the fixes proposed by the agents Copilot, Devin, Cursor, and Claude are rejected. This represents a significant amount of wasted resources that require human reviews, verifications, and running tests and validations for fixes that are merely discarded. Our goal in this paper is to understand the failure modes of AI‑agents, an understanding that is crucial for better integrating AI‑agents as efficient teammates. In this paper, we conduct a qualitative study on a representative sample of 306 non‑merged pull requests created or co‑authored by the agents mentioned earlier, followed by a quantitative analysis of the reasons for rejection. Our qualitative findings identify 14 reasons divided into four high‑level categories for rejecting AI‑agent fixes. We observe that developers can reject fixes due to fixes whose implementation is incorrect (e.g., incomplete, wrong approach), fixes that do not pass the continuous integration (CI) pipelines and fail tests, fixes for which the agent is unable to perform the implementation (e.g., no code generated, sessions lost), and fixes whose priority is low. Our results shed light on the importance of better guiding the model at these levels: (1) proposing hints about the approach to follow for fixing an issue, (2) outlining constraints or limitations regarding the approaches that should not be taken, and (3) instructing the agent on how to validate the implementation through CI pipelines and without introducing a breaking change. Our results suggest the need for good prioritization of tasks so that generated fixes do not lead to wasted human review efforts or wasted agent resources (e.g., tokens, compute, or allowed number of requests).

PaperID: 1985, https://arxiv.org/pdf/2606.13317.pdf

Authors: Kunfeng Chen, Qihuang Zhong, Juhua Liu, Bo Du

Title: SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents

Abstract:
Skill self‑evolution methods for LLM agents aim to turn execution trajectories into reusable skill documents, but current pipelines typically learn from one trajectory per task, merge candidate skill patches before checking them, and load the full skill corpus before inference. We propose SkillCAT, a training‑free framework that separates this process into three stages. Contrastive Causal Extraction (CCE) samples multiple trajectories for each task and compares same‑task success/failure pairs to identify evidence that explains outcome differences. Assessment‑Augmented Evolution (AAE) replays each candidate patch on source‑task clones and keeps only patches that improve or preserve task outcomes before hierarchical skill patch merging. Topology‑Aware Task Execution (TTE) compiles the evolved skills into a routable sub‑skill topology, so inference loads only the capability nodes relevant to the task. We evaluate SkillCAT on common agent benchmarks, including SpreadsheetBench, WikiTableQuestions, and DocVQA, and further test cross‑model and out‑of‑distribution generalization. Across these settings, SkillCAT raises the average score over baselines by up to 40.40%, demonstrating reliable skill evolution without model training.

PaperID: 1986, https://arxiv.org/pdf/2606.13220.pdf

Authors: Fabrizio Marozzo, Pietro Liò

Title: LLM-as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem Diagnosis

Abstract:
Large language models (LLMs) are increasingly used as interactive assistants for technical problem solving. However, when users provide incomplete descriptions or plausible but unverified explanations, LLMs may prematurely align with these assumptions and propose solutions before collecting sufficient evidence. We refer to this behavior as user‑driven sycophancy: the tendency of an LLM to reinforce a user‑provided hypothesis instead of testing alternative explanations. This paper introduces LLM‑as‑an‑Investigator, an evidence‑first agentic AI methodology for robust problem diagnosis. The approach is implemented through a Solution Investigator Agent, which estimates the ambiguity of an initial problem description, generates candidate hypotheses, asks targeted clarification questions, and updates hypothesis probabilities after each answer. Rather than producing an immediate response, the agent continues the investigation until the evidence makes one candidate explanation stronger than the alternatives. To evaluate the approach, we build a benchmark from solved technical forum threads in mechanical, electrical, and hydraulic domains. We use a three‑agent evaluation pipeline in which a Problem‑Solution Extractor Agent converts solved threads into structured cases, a Ground‑Truth Evaluator Agent simulates the user while hiding the known solution, and the tested assistant attempts to recover the solution through dialogue. The experiments compare standard assistants, reasoning‑oriented LLMs, and the proposed investigator‑based model across LLM backbones. In addition to diagnostic accuracy, we analyze how standard assistants follow misleading user hypotheses in diagnostic cases. The results show that the proposed approach identifies the problem more accurately than direct prompting and reasoning‑only baselines, while its evidence‑first protocol helps reduce user‑induced conversational bias.

PaperID: 1987, https://arxiv.org/pdf/2606.13197.pdf

Authors: Fuqiang Niu, Bowen Zhang

Title: ARMOR-MAD: Adaptive Routing for Heterogeneous Multi-Agent Debate in Large Language Model Reasoning

Abstract:
Multi‑agent debate (MAD) can improve large language model reasoning, but fixed debate pipelines often waste computation and can amplify correlated errors among similar agents. We propose ARMOR‑MAD, a training‑free heterogeneous MAD framework that treats debate as conditional computation. ARMOR‑MAD combines three components: Pre‑debate Agreement Routing (PAR) decides whether independently generated Round‑0 answers require debate; Early Agreement Stopping Evaluator (EASE) stops debate after convergence; and Semantic Outlier Detection (SOD) down‑weights abnormal final answers during aggregation. Across MATH Level 5, GSM8K, MMLU, and MMLU‑Pro, ARMOR‑MAD consistently improves over fixed‑round heterogeneous debate with the same model pool, reaching 65.5%, 96.5%, 90.0%, and 81.5% accuracy, respectively. The results suggest that genuine model heterogeneity and agreement‑based control are both important for making MAD more accurate and efficient.

PaperID: 1988, https://arxiv.org/pdf/2606.13190.pdf

Authors: Nataliya Kosmyna, Liz Jenkins, Anoop K. Sinha

Title: Multi-Modal Multi-Agent Robotic Cognitive Alignment enabled by Non-Invasive Consumer Brain Computer Interfaces: A Proof of Concept Exploration

Abstract:
While non‑verbal behaviors and expressive movements are essential for natural human‑robot interaction, existing methods often overlook a crucial element: the human's internal cognitive state. Frequently, proactive multi‑agent systems can interrupt humans at inopportune moments, leading to cognitive overload and decreased task performance. This paper introduces a framework for generating "cognitively aligned" multi‑agent interactions, enhancing the ability of robotic systems to contextually defer communications to the user of an agent system during moments of high human mental workload and engagement. We present the design and implementation of a closed‑loop architecture that explores the interplay between autonomous task execution and real‑time neurophysiological focus. Using a consumer‑grade Brain‑Computer Interface (BCI), our approach continuously monitors Electroencephalography (EEG) spectral band powers while a human performs an engagement‑inducing task. We propose an engagement‑driven pipeline where an HTTP‑based signaling mechanism places a primary agent's sensory inputs and audio outputs into a holding state upon detecting high engagement. This allows secondary agents to seamlessly process complex, delegated tasks in the background. Once the human's cognitive state returns to a lower cognitive load baseline, the primary agent releases the queued agent message. Our preliminary results demonstrate the feasibility of leveraging real‑time signal processing, Large Language Models (LLMs), and physical robotic embodiments to create cognitively‑aware, non‑intrusive multi‑agent systems.

PaperID: 1989, https://arxiv.org/pdf/2606.13177.pdf

Authors: Minjae Kim, Jinheon Baek, Soyeong Jeong, Sung Ju Hwang

Title: MemRefine: LLM-Guided Compression for Long-Term Agent Memory

Abstract:
Large language model (LLM) agents are increasingly expected to operate over long‑term interactions, where information from past dialogues must be preserved and recalled to support future tasks. However, as interactions accumulate, the memory store grows without bound and fills with redundant entries that inflate storage cost and degrade retrieval by crowding out the most useful evidence. Furthermore, this is especially limiting on resource‑constrained platforms with hard memory budgets, motivating us to formulate storage‑budgeted memory management, the task of keeping an already constructed memory store within a fixed budget while preserving information useful for future interactions. To this end, we then propose MemRefine, an LLM‑guided framework that, since surface similarity poorly reflects factual value, uses similarity only to propose candidate pairs and defers delete, merge, and preserve decisions to an LLM judge based on factual content, iterating until the budget is met. Across multiple memory frameworks and long‑term conversation benchmarks, MemRefine consistently meets target budgets while preserving downstream performance and outperforming rule‑based baselines under tight budgets.

PaperID: 1990, https://arxiv.org/pdf/2606.13140.pdf

Authors: Lexi Liu, Qi Cao, Yuanhao Liu, Huawei Shen, Xueqi Cheng

Title: MIDSim: Simulating Multi-Channel Information Diffusion in Social Media with LLM-Powered Multi-Agent System

Abstract:
Information diffusion in social media shapes public opinion and collective behavior, making its modeling and simulation an important research problem. Existing studies have investigated information diffusion through epidemic‑based, cascade‑based, and point process models. However, they predominantly focus on diffusion through social links, overlooking other diffusion channels enabled by platform algorithms (e.g., recommender systems) and failing to capture user behavioral complexity. To address these limitations, we propose an LLM‑powered multi‑agent system for simulating multi‑channel information diffusion, where large language models instantiate personalized user agents and the diffusion process jointly models social and algorithmic exposure streams. We further construct three real‑world diffusion dataset spanning Sina Weibo, RedNote, and Twitter, containing diffusion records, user profiles, historical posts, and social relationships. Experimental results on real diffusion events show that our proposed framework realistically simulate macro diffusion phenomenon and generate diverse comment content, significantly outperforming baselines.

PaperID: 1991, https://arxiv.org/pdf/2606.13076.pdf

Authors: Yao-hua Franck Xu, Tayeb Lemlouma, Jean-Marie Bonnin, Arnaud Braud

Title: $α$-fair heterogeneous agent reinforcement learning

Abstract:
Cooperation in multi‑agent systems is typically optimized through utilitarian objectives that maximize overall efficiency but fail to account for reward distribution, often resulting in inequitable "leader‑follower" dynamics. While fairness‑based approaches encourage pro‑social behaviors where every agent benefits from cooperation, many current algorithms ‑ including those utilizing reward shaping ‑ break the stationarity of Markov Games or lack rigorous theoretical guarantees. This creates a critical gap between fair objective methods and theoretically safe learning frameworks. We propose a novel framework that bridges α‑fairness with Heterogeneous‑Agent Trust Region Learning (HATRL), ensuring monotonic improvement and convergence toward Nash Equilibria. Our approach leverages a fair advantage function that dynamically weights agent utilities based on their expected returns, allowing the global objective to transition from purely utilitarian efficiency to α‑fairness welfare based on the parameter α. We introduce two practical algorithms, α‑fair HATRPO and α‑fair HAPPO, and demonstrate through experiments in sequential social dilemmas like CleanUp and CommonHarvest that they perform better than HATRL's algorithms from a utilitarian point of view while achieving socially higher outcomes.

PaperID: 1992, https://arxiv.org/pdf/2606.13003.pdf

Authors: Prathyusha Jwalapuram, Hehai Lin, Chuyuan Li, Fangkai Jiao, Sudong Wang, Yifei Ming, Zixuan Ke, Chengwei Qin, Giuseppe Carenini, Shafiq Joty

Title: The Illusion of Multi-Agent Advantage

Abstract:
Prevailing wisdom posits that Multi‑Agent Systems (MAS) are superior to Single‑Agent Systems (SAS), citing advantages like context protection, parallel processing and distributed decision‑making. However, empirical support for this claim relies primarily on comparisons with SAS baselines using benchmarks that prioritize isolated reasoning tasks, which do not adequately assess these advantages. Focusing on automatically generated MAS that are designed for enhanced generalizability over manually‑designed counterparts, we perform a rigorous, systematic evaluation against SAS, specifically Chain‑of‑Thought with Self‑Consistency (CoT‑SC). Across traditional reasoning datasets and tasks with interactive multi‑step workflows (e.g., BrowseComp‑Plus), we demonstrate that automatic MAS consistently underperform CoT‑SC despite being up to 10x more expensive. To isolate these failures from limitations inherent to task structure, we introduce a diagnostic synthetic dataset tailored for MAS featuring explicit task decomposition, context separation and parallelization potential. We show that expert‑architected MAS consistently outperforms automatically generated architectures in both raw performance and cost‑efficiency on this dataset, demonstrating that existing evaluation frameworks mask critical architectural gaps and inefficiencies of complex MAS by failing to account for the marginal utility of increased computational cost. Critically, systematic deconstruction of the generated MAS architectures reveals that current automated design paradigms produce architectural bloat that prioritizes superficial complexity which does not translate into functional utility, exposing a fundamental misalignment with multi‑agent principles.

PaperID: 1993, https://arxiv.org/pdf/2606.12950.pdf

Authors: Jinghao Wang, Xiao Zhou, Xiaoyang Sun, Yihui Zhang, Yilong Li, Tianyu Wo, Xu Wang, Chunming Hu, Renyu Yang

Title: Maestro: Workload-Aware Cross-Cluster Scheduling for LLM-Based Multi-Agent Systems

Abstract:
Large Language Model based Multi‑Agent Systems (LLM‑MAS) have emerged as a powerful paradigm for tackling complex tasks by breaking them into collaborative workflows of specialized LLM‑powered agents. However, deploying such multi‑agent workloads at scale poses significant system challenges. Each user query spawns an iterative pipeline of LLM calls, greatly amplifying resource consumption compared to single‑turn queries. In resource‑constrained cloud settings, these workflows face non‑deterministic and input‑dependent costs at decode stage, heavy‑tailed multi‑model requirements with memory fragmentation and over‑provisioning, and cross‑cluster scheduling trade‑offs. We present Maestro, a workload‑aware scheduling system designed for LLM‑MAS serving under strict GPU budgets. Maestro explicitly leverages agent semantics and roles: it predicts the output length and memory usage of each stage and uses this prediction to drive a hierarchical scheduler. At the node level, Maestro enables dynamic multi‑model co‑location via hierarchical weight caching and elastic memory provisioning. At the cluster level, it performs latency‑aware routing to avoid cold‑start delays and memory overloads. At the global level, it enforces workflow‑aware prioritization to minimize head‑of‑line blocking for interactive tasks. Across prototype experiments and trace‑driven simulations, Maestro reduces KV‑reservation HBM by 67.2% and improves high‑contention SLO attainment over EDF by 23.6 percentage points.

PaperID: 1994, https://arxiv.org/pdf/2606.12945.pdf

Authors: Zhibao Chen, Qian Cheng

Title: Learning What to Remember: A Cognitively Grounded Multi-Factor Value Model for Agentic Memory

Abstract:
Long‑running LLM agents accumulate interaction histories far larger than any context window, forcing a standing decision: what to encode deeply, what to forget, and what to retrieve under a fixed memory budget. Production systems answer with semantic similarity or recency ‑‑ both mis‑specified for the forgetting decision, which is made at consolidation time before the future query is known. We propose a multi‑factor memory value function V(m)=\sum_i w_i f_i(m) over seven interpretable factors (emotional intensity, goal relevance, value alignment, self/user relevance, task utility, reliability, and usage history) drawn from cognitive psychology, whose weights are learned from a downstream objective by a gradient‑free optimiser, and whose single scalar uniformly controls encoding depth, forget risk, and retrieval rank. We make a methodological point: on LongMemEval, scoring goal relevance against the held‑out evaluation question saturates gold‑evidence retention at \approx 0.98 ‑‑ this measures retrieval, not forgetting. In the realistic blind regime, a learned multi‑factor value retains 0.770 \pm 0.011 of gold evidence across 479 usable cases, versus 0.657 for uniform weights, 0.518 for the best single factor, and 0.368 for recency; every paired gap's 95% bootstrap CI is above zero, and a neural network over the same factors ties the linear model. The learned weights are interpretable ‑‑ reliability, emotional intensity, and self/user relevance dominate, while query‑time goal similarity is correctly down‑weighted for the forgetting decision. A controlled synthetic task with planted confounds confirms the learner recovers a separating weighting (1.00 retention) where uniform weighting fails (0.62). The substrate is open‑source; all experiments run on a single CPU with no API calls.

PaperID: 1995, https://arxiv.org/pdf/2606.12918.pdf

Authors: Chejian Xu, Zhaorun Chen, Jingyang Zhang, Freddy Lecue, Avni Kothari, Sarah Tan, Wenbo Guo, Bo Li

Title: MAStrike: Shapley-Guided Collusive Red-Teaming on Multi-Agent Systems

Abstract:
Hierarchical multi‑agent systems (MAS) are rapidly being deployed in high‑stakes workflows across domains such as finance and software engineering. In these systems, safety and security are inherently distributed across role‑specialized agents, significantly expanding the attack surface, particularly under coordinated adversarial behaviors such as privilege escalation and cross‑agent collusion. Existing red‑teaming approaches for MAS remain limited: they rely on heuristic selection of target agents and perturb isolated message streams, leaving critical questions unanswered as which agents are most responsible for system safety, and how compromised agents can coordinate to bypass defenses. We propose MAStrike, a closed‑loop framework for collusive red‑teaming in hierarchical MAS. We propose the first agent‑level Shapley value analysis for MAS, quantifying each agent's marginal contribution to system robustness under task‑specific distributions. GGuided by this attribution, MAStrike identifies vulnerable agent coalitions and generates coordinated, role‑aware adversarial manipulations. These attacks are iteratively refined through structured causal diagnosis, attributing failure cases to uncompromised agents that block adversarial attempts. We further build a comprehensive MAS red‑teaming benchmark and controllable environments spanning diverse hierarchical topologies and domains, including finance, software engineering, and CRM. Extensive experiments across MAS built on multiple frontier models show that MAStrike substantially outperforms heuristic baselines. Our analysis further uncovers non‑trivial Shapley value distributions and higher‑order interaction structures among agents, revealing critical vulnerabilities and coordination patterns that are overlooked by prior single‑agent or template‑based methods.

PaperID: 1996, https://arxiv.org/pdf/2606.12882.pdf

Authors: Xiaoxuan Wang, Haixin Wang, Alexander Taylor, Jason Cong, Yizhou Sun, Wei Wang

Title: HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness

Abstract:
Large language models are increasingly deployed as agents for long‑horizon tasks, yet their performance is shaped not only by model capability and environment design, but also by the harness that mediates agent‑‑environment interaction. Existing harnesses are largely manually engineered, making them difficult to scale as trajectories grow longer and interactions become more complex. In this work, we ask whether harness can be generated by a learnable plug‑in module that can be trained in an end‑to‑end fashion. We introduce HarnessBridge, a lightweight learnable harness controller that parameterizes the agent‑‑environment interface as a bidirectional projection. HarnessBridge learns two bidirectional projections: observation projection, which distills raw trajectories into compact, decision‑relevant states, and action projection, which converts proposed actions into executable transitions or trajectory‑grounded rejections. We train HarnessBridge on a harness supervision dataset via unified instruction tuning. On Terminal‑Bench~2.0 and SWE‑bench Verified, HarnessBridge matches or surpasses strong specialized harnesses while substantially reducing token usage and trajectory length, and generalizes from smaller generators to larger commercial models.

PaperID: 1997, https://arxiv.org/pdf/2606.12852.pdf

Authors: Renmin Cheng, Changhao Chen

Title: WISE: A Long-Horizon Agent in Minecraft with Why-Which Reasoning

Abstract:
Rapid advances have been made in developing general‑purpose embodied agent in environments like Minecraft through the adoption of LLM‑augmented hierarchical approaches. Despite their promise, low‑level controllers often become performance bottlenecks due to repeated execution failures. We argue that a key limitation is not only the lack of episodic memory, but also the decoupling of what‑where‑when memory from which‑why reasoning. To address this, we propose WISE (Which‑Why Informed Semantic Explorer), a long‑horizon agent framework with an enhanced low‑level controller equipped with a Causal Event Graph that augments episodic memory with explicit causal structure linking observations to task relevance. Unlike prior work such as MrSteve, which relies on feature similarity for retrieval, WISE enables robust recall under viewpoint changes and supports opportunistic task reordering through causal reasoning. Building on this memory, we propose an Opportunistic Task Scheduler that dynamically re‑prioritizes subtasks when causally relevant opportunities are detected. We further equip WISE with a multi‑scale progressive exploration strategy to provide spatially comprehensive observations for downstream reasoning. Experiments show that WISE largely improves task success and efficiency on long‑horizon sparse tasks, particularly in settings requiring adaptive decision‑making.

PaperID: 1998, https://arxiv.org/pdf/2606.12848.pdf

Authors: Chen Zhu, Xiaolu Wang, Weilong Zhang

Title: (Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable

Abstract:
Large language models (LLMs) are increasingly used for tasks once reserved for trained researchers, including hypothesis generation, specification choice, and drafting conclusions. We argue that the reliability of AI‑assisted research depends not only on model capability, but also on how cognitive labour is structured between humans and machines. We study this problem through Human‑in‑the‑Loop Economic Research (HLER), a decision architecture based on pre‑commitment, decision sequencing, accountability, and attention allocation. In a pre‑specified 24 factorial experiment with 280 complete research runs across four datasets, an unconstrained multi‑agent baseline produced critical failures in 72% of runs. Using the same underlying model, the same agent decomposition, and identical prompts for the shared reasoning agents, HLER reduced the failure rate to 16% by imposing three architectural commitments: LLMs reason but do not execute data work, data and estimation are handled deterministically, and three human decision gates bind the workflow. Fisher's exact test rejects equality of failure rates at p<0.001. Reliability gains were largest on the least publicly represented dataset, a Qing‑dynasty population register, consistent with a task‑based production model with Frechet‑distributed output quality. An 80‑run ablation suggests that deterministic computation and human gates contribute independently, with exploratory evidence of complementarity. We interpret HLER as a research harness rather than an autonomous AI scientist: it sharply reduces failures, makes residual weaknesses more visible, and prevents unreliable claims from being advanced as publication‑ready outputs.

PaperID: 1999, https://arxiv.org/pdf/2606.12835.pdf

Authors: Quanyan Zhu

Title: The Internet of Agentic AI: Communication, Coordination, and Collective Intelligence at Scale

Abstract:
The rapid emergence of autonomous AI agents is transforming artificial intelligence from isolated model inference into distributed systems of reasoning, communication, and action. This paper develops the vision of the Internet of Agentic AI (IoAI): an open ecosystem in which heterogeneous agents discover one another, negotiate responsibilities, exchange context, invoke tools, and execute workflows across cloud, edge, device, organizational, and cyber‑physical environments. We synthesize foundations from single‑agent agentic AI, multi‑agent systems, distributed computing, communication networks, game theory, and security engineering to characterize the architectures and mechanisms required for scalable agent ecosystems. The paper examines agent deployment models, workflow lifecycles, communication protocols, interoperability layers, resource‑management challenges, and trust architectures, with case studies in adaptive manufacturing and distributed operational coordination. The resulting framework highlights the central research challenges of controlled emergence, semantic interoperability, secure identity, incentive‑compatible coordination, resource‑aware orchestration, and governance for large‑scale networks of autonomous agents.

PaperID: 2000, https://arxiv.org/pdf/2606.12821.pdf

Authors: Gabriel Diaz-Ireland, Diego Prieto-Herráez, Mario García Peces, Javier Velázquez, Devika Jain

Title: GeoNatureAgent Benchmark: Benchmarking LLM Agents for Environmental Geospatial Analysis Across Frontier and Open-Weight Foundation Models

Abstract:
Environmental scientists spend disproportionate effort on data wrangling rather than analysis, and AI agents that automate geospatial workflows remain unvalidated: no benchmark evaluates agents operating through structured tool calling against real APIs. We introduce the GeoNatureAgent Benchmark, the first benchmark for environmental analysis agents that operate via structured tool calls to a production‑style geospatial API. It comprises 93 tasks across 18 categories, covering municipality analysis, multi‑turn conversation, spatial reasoning, cross‑indicator synthesis, error handling and recovery, ranking, comparison, multilingual understanding, habitat analysis, and task rejection. Tasks are evaluated against an open, self‑hostable API serving three environmental indicators across Spain and Portugal via sixteen tools. We evaluate seven LLMs (Claude Sonnet 4, DeepSeek V3.2, GLM‑5, Gemini 2.5 Pro, Qwen3‑235B, GPT‑OSS‑120B, Llama 4 Scout) under three temperature‑1.0 seeds, reporting capability and per‑case cost as orthogonal axes. We find: (1) Claude Sonnet 4 leads at 60.8% +/‑ 0.8%, followed by DeepSeek V3.2 at 56.3% +/‑ 3.1%, with no other model above 51%; (2) the cost‑accuracy Pareto frontier is occupied mostly by open‑weight models, with DeepSeek V3.2 offering 93% of Claude's capability at 11x lower cost (0.011/case); (3) comparison tasks remain universally unsolved (0% on close‑value comparisons), exposing systematic reasoning limits; and (4) structured tool calling against a real API is more discriminative than general‑purpose GIS benchmarks, with accuracies 25‑35 points lower. We further show extensibility by integrating BigEarthNet V2 land cover for Portugal alongside Spanish CO2 and erosion indicators. The benchmark, harness, and self‑hostable API are publicly available.

PaperID: 2001, https://arxiv.org/pdf/2606.12748.pdf

Authors: Aravinth Kulanthaivelu, Richard Sproat

Title: Agent-based models for the evolution of morphological alternation patterns

Abstract:
Why is the past of English "go" the apparently unrelated "went"? Such alternations are frequent in languages. They neither aid communication nor learnability, yet they can be persistent, surviving over centuries or millennia. We present a multi‑agent simulation of the emergence of morphological stem and inflection alternations. Alternate forms arise by phonological changes or, as with "go/went", from lexical alternatives associated with a subset of the population. When an agent 'hears' another agent use a novel form for a slot in the paradigm of a word (say, the past tense of go), they will with some probability adopt that form, possibly spreading its use to other slots in the paradigm that shared the same original form. Thus alternative forms can spread through the population and become entrenched as stem or inflectional marker alternants. Unlike many previous computational studies, our system allows for naturalistic lexical forms, realistic phonological rules, lexicons with hundreds or thousands of entries, and agent populations in the tens or hundreds. It supports several network topologies, diffusion patterns and agent adoption policies. One issue with such simulations is evaluation: how realistic is the resulting morphology compared to those of real languages? We introduce the AI Historical Linguist, a novel Large Language Model‑driven system that models a debate between two historical linguists. We use this to compare a set of real language morphologies, disguised morphologies, and experimentally evolved morphologies. The results suggest that among the factors that favor more plausible morphologies are scale‑free social networks and random Bernoulli adoption of forms. We also present three case studies modeling attested historical changes, allowing us to test what might have happened if history had been different. All code and data are released.

PaperID: 2002, https://arxiv.org/pdf/2606.12737.pdf

Authors: Pengfei He, Lesly Miculicich, Vishesh Sharma, Ash Fox, George Lee, Jiliang Tang, Tomas Pfister, Long T. Le

Title: PI-Hunter: Automated Red-Teaming for Exposing and Localizing Prompt Injections

Abstract:
Large Language Models (LLMs) are rapidly evolving into agentic systems that interact with external tools and environments, introducing new security risks such as indirect prompt injection attacks through untrusted external sources. Existing defenses mainly focus on blocking malicious content at inference time, and current red‑teaming methods primarily optimize attack success. As a result, developers have limited visibility into how latent prompt injections emerge and propagate through agents. We propose PI‑Hunter, an automated agentic auditing framework for proactive vulnerability exposure in LLM agents. PI‑Hunter constructs realistic source‑aware test cases and iteratively evolves them through feedback‑driven exploration to induce agents to retrieve and reveal latent malicious instructions embedded within external environments. Extensive experiments across multiple benchmarks, agent architectures, attacks, and defenses demonstrate that PI‑Hunter substantially improves vulnerability exposure and attack‑surface coverage over strong automated red‑teaming baselines, while remaining effective under existing prompt injection defenses.

PaperID: 2003, https://arxiv.org/pdf/2606.12709.pdf

Authors: Timothy McAllister, Sina Abdidizaji, Ivan Garibay, Ozlem Ozmen Garibay

Title: Smarter Saboteurs, Better Fixers: Scaling & Security in Linear Multi-Agent Workflows

Abstract:
As LLM‑based multi‑agent systems (MAS) are deployed in the wild, the resilience of their collaboration structures against adversarial compromise becomes a critical safety concern. Attackers may leverage prompt‑injection or jailbreaking to sabotage individual agents within MAS workflows, but the interaction between model scaling and system‑level resilience remains poorly understood. This paper investigates how model scale affects the security of linear multi‑agent workflows. Our experiments across scales of two open‑weight model families on the HumanEval benchmark reveal a compliance‑correction symmetry: larger models are far more likely to faithfully execute malicious instructions, with the control‑to‑malicious performance drop reaching 53.7pp at 27B in uncorrected pipelines. However, appending a lightweight terminal Fixer stage collapses this to 0.6pp and restores statistical parity with control‑level performance, demonstrating that strictly linear collaboration structures can be viable and resilient to adversaries at this scale, and suggesting that the brittleness previously attributed to linear topology may stem from a lack of correction.

PaperID: 2004, https://arxiv.org/pdf/2606.12703.pdf

Authors: Tarun Sharma

Title: SMSR: Certified Defence Against Runtime Memory Poisoning in Persistent LLM Agent Systems

Abstract:
Retrieval‑augmented generation (RAG) agents increasingly run with persistent memory that accumulates across user sessions. This creates a new attack surface: an adversary interacting only through normal channels can inject crafted memories that, once retrieved, steer the agent's responses for future users, without touching model weights or code. We call this Multi‑Session Memory Poisoning (MSMP) and show that no existing defence certifies against it; static‑corpus defences (RobustRAG, ReliabilityRAG) assume a fixed knowledge base, and heuristic filters are bypassed by fluent enterprise‑style text. We present Signed Memory with Smoothed Retrieval (SMSR), the first defence with a certified robustness bound for this setting. Component 1 adds HMAC‑SHA256 provenance at write time, blocking unsigned injection. Component 2 applies randomised memory ablation with verdict‑based majority voting at query time, bounding the influence of authenticated adversaries. We prove that no provenance‑free retrieval‑time filter can certify against adaptive injection, derive a hypergeometric certificate for Component 2, and formalise the Consistent Minority Effect, whereby a consistent adversarial answer wins string‑based voting as a numerical minority while verdict‑based voting removes it. Across 15 enterprise scenarios (3,150 repeated trials), Component 1 cuts attack success from 93‑100% to 0% for all unsigned variants. For an authenticated adversary with a single injection, Component 2 holds success to 8.0% (95% CI [5.8, 10.9], n=450), below the certified worst case. In an end‑to‑end query‑only attack where the agent itself writes the poison rather than it being pre‑seeded, SMSR reduces success from 65.3% to 5.3% (n=150, non‑overlapping CIs) on a live agent stack. Clean‑query utility is 90% (Component 1) and 85% (combined).

PaperID: 2005, https://arxiv.org/pdf/2606.12683.pdf

Authors: Tim Genewein, Matija Franklin, Alexander Lerchner, Laurent Orseau, Samuel Albanie, Adam Bales, Cole Wyeth, Stephanie Chan, Iason Gabriel, Joel Z. Leibo, Allan Dafoe, Marcus Hutter, Thore Graepel, Shane Legg

Title: From AGI to ASI

Abstract:
Over the last decade, building human‑level artificial general intelligence has moved from far‑fetched speculation to being a concrete next‑decade target for many of the largest AI organisations. Achieving this goal would have profound and far‑reaching impacts on human society, which raises many complex questions for the decade ahead. This report investigates how AI itself might continue to develop in a post‑AGI world along the continuum of machine intelligence. The endpoint of this continuum, Universal AI, is theoretically well understood, which provides some formal grounding for the main focus of this report: the transition from human‑level AGI to artificial general superintelligence, which, intuitively, can be understood as a system that is more intelligent and cognitively capable than large organisations of humans. After characterizing ASI, the report discusses four potential pathways from AGI to ASI: scaling AGI, AI paradigm shifts, recursive improvement, and ASI emerging from large‑scale multi‑agent collectives. The report then discusses possible frictions and bottlenecks along these pathways. Determining whether the impact of these frictions will be negligible or substantial raises a number of concrete open research questions. Due to large uncertainties for predicting ASI progress, it cannot be ruled out that AI progress might continue to accelerate over the next years. This could imply that the image of a single transformative step change, caused by the introduction of human‑level AGI into our society, could be inaccurate. More apt might be the prospect of a series of transformative societal changes caused by AI‑enabled progress and breakthroughs across many areas of science and technology. Preparing for this prospect requires a massively interdisciplinary endeavour of global scope and interest.

PaperID: 2006, https://arxiv.org/pdf/2606.12666.pdf

Authors: Siyu Shen, Fenghao Xu, Wenrui Diao, Kehuan Zhang

Title: CAPED: Context-Aware Privacy Exposure Defense for Mobile GUI Agents

Abstract:
Screenshot‑based mobile GUI agents can operate ordinary smartphone apps through the same visual interface as a human user, but this capability also turns every screen observation into a privacy boundary. During normal task execution, screenshots may expose contacts, messages, photos, files, recommendations, health cues, and other sensitive context that is unrelated to the user's request. We call this problem incidental visual privacy exposure. It is difficult to address with existing defenses: text anonymization misses many visual and inferential cues, while generic privacy masking can remove the evidence and controls that a GUI agent needs to complete the task. This paper presents CAPED, a context‑aware pre‑upload exposure control layer for mobile GUI agents. CAPED is designed as a phone‑side protection layer: before screenshots are released to a remote multimodal agent, it extracts task requirements, uses screen context as a privacy prior, parses visible UI elements, and selectively exposes only content needed for the current task while masking incidental private content. We evaluate CAPED on AndroidWorld for broad task utility and with a controlled 28‑task seeded privacy evaluation used as a measurement instrument for trajectory‑level incidental leakage. In this seeded evaluation, Full CAPED reduces success‑conditioned weighted seeded leakage from 0.766 under raw screenshots to 0.268 while preserving high task utility. A broader AndroidWorld run shows a remaining prototype‑level utility cost, but the results support the central claim that screenshot upload should be treated as an explicit device‑‑cloud boundary decision, governed by task‑driven selective exposure rather than all‑or‑nothing screen sharing.

PaperID: 2007, https://arxiv.org/pdf/2606.12657.pdf

Authors: Siyu Li, Toan Tran, Lingyi Zhao, Khurram Shafique, Li Xiong

Title: TrajGenAgent: A Hierarchical LLM Agent for Human Mobility Trajectory Generation

Abstract:
Human mobility data is important for transportation, urban planning, and epidemic control, but large‑scale trajectory collection is often costly and privacy‑constrained, motivating realistic synthetic trajectory generation. Existing LLM‑based generators typically rely on either prompt engineering, which preserves zero‑shot reasoning but lacks fine‑grained spatiotemporal grounding, or trajectory‑level fine‑tuning, which improves statistical precision but incurs substantial computational cost and may weaken general reasoning. We propose TrajGenAgent, a semantic‑aware hierarchical LLM‑agent framework for human mobility trajectory generation without model fine‑tuning. TrajGenAgent uses a two‑stage orchestrator‑worker design: an LLM first synthesizes an individual‑ and weekday‑conditioned activity chain from historical evidence via in‑context learning, and a deterministic workflow then grounds each activity into a complete visit using personalized POI retrieval, distance‑aware location selection, kinematics‑aware travel‑time propagation, and LLM‑based duration estimation. To evaluate realism beyond aggregate spatiotemporal statistics, we introduce an anomaly‑detection‑based evaluation framework using two complementary detectors to assess behavioral and semantic plausibility. Experiments on benchmark and large‑scale simulation datasets show that TrajGenAgent improves spatiotemporal fidelity, semantic coherence, and individual‑specific behavioral realism over representative neural and LLM‑based baselines, while avoiding parameter updates.

PaperID: 2008, https://arxiv.org/pdf/2606.12640.pdf

Authors: Qingyun Guo, Junyi Shi, Jianuo Huang, Tianyu Shi

Title: Individual Control Barrier Functions-Guided Diffusion Model for Safe Offline Multi-Agent Reinforcement Learning

Abstract:
Offline reinforcement learning allows control policies to be learned directly from data without online interaction, making it suitable for safety‑critical tasks. Recent studies have applied diffusion models to offline reinforcement learning to leverage their strong capacity for modeling complex data distributions. However, existing approaches primarily focus on single‑agent settings, leaving the safety challenges in multi‑agent environments largely unexplored. In this work, we propose a safe offline multi‑agent reinforcement learning algorithm that embeds neural individual control barrier functions into the diffusion model to enhance safety during trajectory generation, with control policies recovered through inverse dynamics. We evaluate our algorithm across diverse benchmarks, demonstrating substantial safety improvements while maintaining competitive rewards.

PaperID: 2009, https://arxiv.org/pdf/2606.12614.pdf

Authors: Benjamin Alcorn, Eman Hammad

Title: DARRMS -- An Efficient Algorithm for Dynamic Attention Radius in Resource-Constrained Multi-Agent Systems

Abstract:
Multi‑agent systems are integral tools for various domains such as robotics, cybersecurity, and autonomous vehicle planning. These types of systems often have constraints on the computational resources, leading to a need for efficient lightweight algorithms. Traditional decision making frameworks often assume ideal conditions, such as full observability and unlimited computational capacity, which do not align with real‑world challenges. In this paper, we introduce a new algorithm that allows for reduced demand on computational resources without a large cost of other performance metrics. Agents will limit their observability to some attention radius, which intentionally allows them to ignore parts of the environment that might be unnecessary for action planning. By optimizing both the attention radius and decision‑making, our approach enhances coordination and scalability in uncertain environments. Through both theoretical analysis and empirical validation, we demonstrate the effectiveness of adaptive observation in improving system performance and maintaining robust decision‑making strategies in resource‑constrained systems.

PaperID: 2010, https://arxiv.org/pdf/2606.12563.pdf

Authors: Neha Prakriya, Chaojun Hou, Zheng Gong, Huasha Zhao, Xi Zhao, Mou Li, Zhenyu Gu, Emad Barsoum

Title: Arbor: Tree Search as a Cognition Layer for Autonomous Agents

Abstract:
Arbor is a multi‑agent framework that introduces structured tree search as a cognition layer for autonomous agents operating in large, stateful action spaces. Prior autonomous optimization systems operate on isolated targets with stateless evaluation. Arbor instead maintains an explicit search tree of scored hypotheses that serves as the shared working memory across agents, evolving with every measurement, treating failures as diagnostic signal that reshapes subsequent exploration, and expanding as prior successes shift the bottleneck distribution. We validate Arbor on full‑stack LLM inference optimization, a domain where achieving peak performance has historically required coordinated effort from engineering teams across the application, framework, compiler, kernel, and hardware stack. Arbor pairs an Orchestrator agent, which drives optimization by delegating to Domain Specialists across the inference stack, with a Critic agent that safeguards stability through root‑cause analysis, introspection, and measurement validation ‑‑ a checks‑and‑balances architecture where neither agent can unilaterally drive the system. Agent capabilities are decomposed into hard skills (domain expertise) and soft skills (coordination protocols that determine how contributions compose), enabling fully autonomous multi‑day campaigns. Arbor achieves up to 193% inference throughput‑latency Pareto improvement over vendor‑optimized baselines, while a single agent without the harness plateaus at +33% throughput improvement and crashes irrecoverably within hours. Arbor generalizes to multiple generations of hardware platform, and run‑to‑run variance is within 2 percentage points demonstrating that the method is hardware‑agnostic and reproducible.

PaperID: 2011, https://arxiv.org/pdf/2606.12474.pdf

Authors: Ruxue Shi, Yili Wang, Mengnan Du, Qinggang Zhang, Rui Miao, Yixin Liu, Xin Wang

Title: SAIGuard: Communication-State Simulation for Proactive Defense of LLM Multi-Agent Systems

Abstract:
LLM‑based multi‑agent systems (MAS) solve complex tasks through inter‑agent collaboration, but their communication‑driven nature also allows security risks to spread across agents and trigger system‑wide failures. Existing MAS defenses mainly follow a reactive paradigm after execution by detecting and isolating harmful agents, which may cause irreversible damage and degrade collaborative utility. To address this, we propose a proactive defense framework for MAS security, namely a Simulation‑aware Interception Guard (SAIGuard). SAIGuard performs communication‑state simulation over the MAS interaction graph, estimates the impact of incoming messages on local agent states and the global MAS state, and detects risky messages via reconstruction deviations from benign communication patterns. Instead of isolating agents, SAIGuard sanitizes or regenerates suspicious messages before it propagation into system. Experiments across diverse topologies and attack scenarios show that SAIGuard reduces attack success rates while maintaining MAS utility, outperforming reactive defenses.

PaperID: 2012, https://arxiv.org/pdf/2606.12400.pdf

Authors: Xingjian Diao, Wenbo Li, Yashas Malur Saidutta, Avinash Amballa, Lazar Valkov, Srinivas Chappidi

Title: Doc-to-Atom: Learning to Compile and Compose Memory Atoms

Abstract:
Long input sequences are central to document understanding and multi‑step reasoning in Large Language Models, yet the quadratic cost of attention makes inference both memory‑intensive and slow. Context distillation mitigates this by compressing contextual information into model parameters, and recent work such as Doc‑to‑LoRA amortizes context distillation into a single forward pass that generates one LoRA adapter per document. However, producing a single monolithic adapter for all queries leads to irrelevant‑query interference, limited compositional recall, and poor scalability to long‑document reasoning. To address these challenges, we propose Doc‑to‑Atom (Doc2Atom), a compositional parametric memory framework that decomposes each document into semantically typed knowledge atoms. Each atom is compiled into an independent micro‑LoRA adapter and a provenance retrieval key. At inference time, a lightweight query router selects and assembles only the relevant atoms into a query‑specific adapter, which is then injected into a frozen base model. The entire system is trained end‑to‑end through a multi‑objective distillation framework. Experiments on six diverse QA benchmarks demonstrate that Doc2Atom outperforms Doc‑to‑LoRA baselines while reducing the memory cost of document internalization.

PaperID: 2013, https://arxiv.org/pdf/2606.12369.pdf

Authors: Alejandro Buitrago López, Javier Pastor-Galindo, José A. Ruipérez-Valiente

Title: Should LLM Agents Decide in Social Simulations? Comparing Finite-State and LLM-Based Decision Policies

Abstract:
Large language models (LLMs) are increasingly used as decision‑making components in social simulations. This introduces a methodological risk: the simulation may deviate from the explicit behavioral policy defined by the researcher. In online social network (OSN) simulations, action choices shape system dynamics, interaction patterns, and model interpretability. This paper evaluates whether LLM action selectors preserve an interpretable reference policy in an OSN simulation. The reference is a finite state machine implemented as a first‑order Markov model, with transition probabilities depending on the user type. The evaluation uses a synthetic network with 1,000 agents and 10,000 action decisions. Three open‑weight LLMs are tested: LLaMA 3.1, GPT‑OSS, and Mistral 24B. Each model is evaluated under three prompting strategies: base, guided, and probabilistic. Alignment is measured using Jensen‑Shannon Divergence with Laplace smoothing, and execution time is reported. Results show that LLMs can approximate the reference policy in some configurations, but do not preserve it reliably. Alignment varies across models and prompts, and additional guidance can introduce systematic action biases. Even the best‑aligned LLM configurations are several hundred times slower than direct Markov chain sampling. These findings indicate that LLM‑based action selection is not a direct replacement for explicit decision policies: it can alter the intended behavior while increasing computational cost.

PaperID: 2014, https://arxiv.org/pdf/2606.12341.pdf

Authors: Jin Xie, Songze Li

Title: OCELOT: Inference-Leakage Budgets for Privacy-Preserving LLM Agents

Abstract:
Large language model (LLM) agents increasingly act on a user's behalf ‑‑ reading personal files, calling tools, transacting with external services ‑‑ possibly leaking personally identifiable information (PII) across trust boundaries at every step. Privacy here is a property not of a single output but of an entire trajectory, and three properties make it hard: leakage is cumulative, as individually innocuous releases accumulate across honest‑but‑curious or colluding sinks into inferences about a protected secret; bidirectional, as a malicious observation can inject instructions that turn the agent's own reasoning model against the user; and task‑dependent, as the same field is necessary for one recipient yet gratuitous for another. Per‑release contextual‑integrity filters, information‑flow controls, and posterior‑leakage monitors each address part of this but none controls cumulative, inference‑based leakage at runtime. We recast agent privacy as \emphposterior‑risk control and present OCELOT, a runtime mediator that budgets how much an adversary's belief about a secret may improve across a trajectory, rather than filtering outputs. Its mechanism, \emphWitness‑Verified Declassification, separates judgment from trust: an untrusted, locally fine‑tuned defender model inspects each candidate release and emits structured evidence ‑‑ labeled atoms and proposed declassification operators ‑‑ which a deterministic verifier audits, charging a certified min‑entropy cost for the chosen variant and authorizing the least‑disclosing useful release under a sink‑trust‑weighted budget recorded on a tamper‑evident ledger. Across diverse agent benchmarks and recent defenses, OCELOT attains significantly lower leakage at higher task utility, resists adaptive injection, jailbreak, cumulative inference, and sink collusion, and adds only modest overhead.

PaperID: 2015, https://arxiv.org/pdf/2606.12299.pdf

Authors: Hyun Joe Jeong, Gokul Swamy, Andrea Bajcsy

Title: Learning What to Say to Your VLA: Mostly Harmless Vision Language Action Model Steering

Abstract:
Vision‑Language‑Action (VLA) models provide a natural language interface to robot control, but the mapping from language to behavior is often brittle and unintuitive: semantically similar instructions can induce drastically different behaviors, while some capabilities may not be elicitable through prompting alone. As a result, both human instructions and zero‑shot language models can fail to reliably steer VLAs toward successful task execution. In this work, we propose a framework that interactively searches for language sequences that improve closed‑loop VLA task performance, distills these sequences into a test‑time language feedback policy (LFP), and learns an improvement head that predicts when language steering will improve performance. We conformalize this improvement head to prevent harmful steering interventions, where the LFP decreases task performance relative to the original instruction on out‑of‑distribution scenarios. Crucially, our approach operates on arbitrary frozen pre‑trained VLAs, requiring neither access to the original training distribution nor fine‑tuning of the underlying model. On seen environments, our conformalized LFP improves base VLA performance by 24.7% in simulation and 65.0% in hardware. On visual and semantic perturbations, our conformalized LFP has strong harmlessness guarantees, and produces recovery behaviors not observed with open‑loop prompting.

PaperID: 2016, https://arxiv.org/pdf/2606.12268.pdf

Authors: Korbinian Friedl, Francis Rhys Ward, Paul Yushin Rapoport, Tom Everitt, Jonathan Richens

Title: The Impossibility of Eliciting Latent Knowledge

Abstract:
Advanced AI systems have extensive knowledge of their environments; in fact, their knowledge may (far) exceed that of their developers or users. Consequently, a desirable property for an AI system is that it is honest ‑‑ that it accurately reports its beliefs about the world. Designing an AI system to be honest may be difficult, especially if we want to ask it questions about latent variables in the environment ‑‑ variables which are hidden from the human interacting with it. This gives rise to the problem of eliciting latent knowledge (ELK): the problem of training an AI agent to honestly report its beliefs. In this paper, we make ELK formally precise using Causal Influence Diagrams (CIDs). CIDs can be used to describe the relationship between an agent's training environment and its subjective representation of the world. We use CIDs to formalise the distinction between observable and latent variables, to specify what exactly it means for an agent to be honest, and to formally define goal misgeneralisation. We show that, under certain circumstances, developers can incentivise an agent to honestly answer questions by providing correct feedback during training. However, a natural, but undesirable, way for an agent to generalise is to provide answers which humans would evaluate as true, rather than honest answers. We prove an impossibility theorem stating: There is no feedback‑based training strategy that depends only on agent behaviour and with certainty produces an honest agent, even if feedback is perfect during training.

PaperID: 2017, https://arxiv.org/pdf/2606.12195.pdf

Authors: Ziang Yan, Sheng Xia, Jiashuo Yu, Yue Wu, Tianxiang Jiang, Songze Li, Kanghui Tian, Yicheng Xu, Yinan He, Kai Chen, Limin Wang, Yu Qiao, Yi Wang

Title: InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Abstract:
Recent progress in foundation models has shifted toward agentic behavior involving multi‑step reasoning and tool use. However, open‑source efforts largely focus on text‑dominant settings, leaving long‑horizon multimodal tasks underexplored. This gap is evident in video tasks requiring sustained temporal understanding and iterative interaction. We present InternVideo3, a framework enhancing these capabilities via Multimodal Contextual Reasoning (MCR). MCR treats understanding as a closed‑loop process over a shared, evolving context containing observations, instructions, reasoning, tool actions, and memory. This frames long‑video understanding as evidence accumulation and verification. To ensure efficiency, we introduce Multimodal Multi‑head Latent Attention (M^2LA), a token‑preserving reparameterization compressing KV‑cache states while retaining the full token stream. Our staged training includes continued pretraining, short‑to‑long supervised fine‑tuning, rule‑based reinforcement learning, and on‑policy distillation. Experiments show InternVideo3 achieves strong performance on benchmarks like Video‑MME, MLVU, and EgoSchema. We further instantiate the model as a video agent with retrieval tools, demonstrating robust evidence‑grounded behavior. Our results suggest that efficient context handling and closed‑loop reasoning are vital for adapting open multimodal models toward long‑horizon visually grounded agency.

PaperID: 2018, https://arxiv.org/pdf/2606.12191.pdf

Authors: Jiachun Li, Zhuoran Jin, Tianyi Men, Yupu Hao, Kejian Zhu, Lingshuai Wang, Dongqi Huang, Longxiang Wang, Shengjia Hua, Lu Wang, Jinshan Gao, Hongbang Yuan, Ruilin Xu, Kang Liu, Jun Zhao

Title: Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

Abstract:
Environments serve as interactive systems for large language model (LLM) based agents across diverse scenarios and play a crucial role in driving the continual evolution of model capabilities. Despite this importance, existing work lacks a systematic categorization and deep analysis. This paper systematically studies current researches on agentic environments from the perspective of the environment engineering lifecycle, covering their modeling, synthesis, evaluation and application. Specifically, the paper first introduces representative environments from the perspectives of eight attributes and eight domains, providing detailed analyses of their development paths and highlighting their core capabilities. Second, for automated environment synthesis, two paradigms are introduced, such as symbolic synthesis and neural synthesis. This paper also shows different environment evaluation methods in each paradigm. Thirdly, the corresponding environment applications from the perspective of agent‑environment co‑evolution are discussed. In specific, the paper characterizes the primary pathways for agent evolution in dynamic environments from four complementary perspectives: memory‑centric experience evolution, orchestration‑centric workflow evolution, trajectory‑centric offline evolution, and exploration‑centric online evolution. And three paradigms of environment evolution are identified, namely neural‑driven, difficulty‑driven, and scaling‑driven approaches. At last, several promising future directions are discussed, including Environment‑as‑a‑Service, Multi‑agent Environments, and Neural‑Symbolic Environments.

PaperID: 2019, https://arxiv.org/pdf/2606.12072.pdf

Authors: Sebastian Stapf, Pablo Acuaviva Huertos, Aram Davtyan, Paolo Favaro

Title: World Model Self-Distillation: Training World Models to Solve General Tasks

Abstract:
Pretrained video generators are promising visual world models that exhibit emergent task‑solving abilities; however, their reliance on detailed textual descriptions limits their direct use for planning and decision‑making. Existing approaches either outsource this reasoning to language or vision‑language models, or rely on supervised fine‑tuning with paired task‑execution videos, which are costly to collect and difficult to scale. We propose a scalable framework that elicits task‑solving ability in such models by combining self‑distillation with reinforcement learning. Given an unlabeled scene image, a vision‑language model generates a candidate task and a detailed step‑by‑step solution. The solution conditions a pretrained video diffusion model, the Demonstrator; we distill its behavior into an Executor conditioned only on the image and a short task prompt. This transfers execution knowledge from caption‑guided generation to instruction‑conditioned task solving without curated task‑video supervision. We further improve the Executor with reinforcement learning from VLM feedback, exploiting the asymmetry between judging whether a sampled video satisfies a task and generating the solution. Experiments on our proposed WorldTasks‑Benchmark and the DreamGen robotics benchmark show that the Executor surpasses the Demonstrator under our VLM‑based evaluation protocol and transfers competitively to robotic tasks.

PaperID: 2020, https://arxiv.org/pdf/2606.12005.pdf

Authors: Giuseppe Di Poce, Mattia Merluzzi, Emilio Calvanese Strinati, Paolo Di Lorenzo

Title: Game-Theoretic Latent Space Alignment for Multi-user Semantic MIMO Communications

Abstract:
Semantic communications enable AI‑native wireless systems by mapping raw data into compressed task‑oriented latent representations. However, independently trained agents often rely on heterogeneous latent spaces and background knowledge, leading to semantic mismatch that degrades mutual understanding and downstream task execution, especially in interferencelimited multi‑user wireless networks. This paper investigates distributed latent‑space alignment in multi‑user semantic MIMO interference networks with cognitive radio constraints. We consider primary users and semantic‑aware secondary users sharing the same wireless resources, where secondary agents must simultaneously mitigate interference and align heterogeneous semantic representations. To address this problem, we formulate semantic alignment as a non‑cooperative game and derive a closed‑form solution for the joint optimization of linear semantic MIMO transceivers under power and interference constraints. Exploiting the structure of the problem, we recast the original matrix valued optimization into a lower‑dimensional power‑allocation game, leading to an iterative semantic water‑filling algorithm. We establish sufficient conditions for existence, uniqueness, and global convergence to a Nash equilibrium, explicitly relating semantic alignment properties and physical‑channel interactions. Numerical results assess the performance of the proposed framework, revealing key trade‑offs among semantic compression, task performance, and hierarchical spectrum access.

PaperID: 2021, https://arxiv.org/pdf/2606.11976.pdf

Authors: Akeela Darryl Fattha, Kia Ying Chua, Lingxiao Jiang, Laura Wynter

Title: Exploration Structure in LLM Agents for Multi-File Change Localization

Abstract:
Software engineering tools increasingly rely on LLM based agents to localize files to change to resolve a software issue. Most AI agents explore repositories linearly, that is, visiting one directory or file per step. We postulate that this is a structural mismatch for changes that span several subsystems. We compare linear sequential exploration against non‑linear, domain‑scoped parallel agentic exploration. Using SWE Bench Pro as initial benchmark, we focus on ansible as an exemplar. We construct an approach for persistent‑session evaluation of GitHub issues anchored at a single base commit. We compare our non‑linear domain‑agent file traversal system against a base LLM without direct repository access, a single agent Recursive Language Model (RLM) baseline with a persistent Python REPL and an external CLI baseline using Codex 5.5 High. Domain scoped parallel agent spawning with a small Haiku‑class model achieves the highest micro F1 among Haiku class models by a large margin. Domain‑agents is the second highest behind only the much larger Codex 5.5 High on our own expanded benchmark including over more recent PRs from 2025 and 2026. On the original, curated, 2020 SWE‑bench Pro benchmark, a larger Sonnet plain LLM baseline attains higher micro F1 by predicting few files, leading to higher precision, but at significantly lower all gold recall. We also present three additional findings. First, documentation evolution is a latent dependency unresolved by any approach. Second, naive file system access can degrade localization driven by test‑file over prediction. Lastly, forced multi‑agent consultation does not measurably help and raises token cost substantially.

PaperID: 2022, https://arxiv.org/pdf/2606.11926.pdf

Authors: Jiajie Jin, Yuyang Hu, Kai Qiu, Qi Dai, Chong Luo, Guanting Dong, Xiaoxi Li, Tong Zhao, Xiaolong Ma, Gongrui Zhang, Zhirong Wu, Bei Liu, Zhengyuan Yang, Linjie Li, Lijuan Wang, Hongjin Qian, Yutao Zhu, Zhicheng Dou

Title: Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

Abstract:
Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long‑lived coordinator, short‑lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step‑level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held‑out result on all six tasks, attaining more than 2.5x the average relative held‑out gain of Codex and Claude Code under the same task interface and resource budget. On MLE‑Bench Lite, Arbor reaches 86.36% Any Medal with GPT‑5.5, the strongest result in our comparison.

PaperID: 2023, https://arxiv.org/pdf/2606.11909.pdf

Authors: Baoyang Jiang, Fengchun Zhang, Leyuan Wang, Haotian Li, Yida Wang, Zhe Ji, Jinshan Lai, Xi Ren, Jianwei Hu, Qiang Ma

Title: Embodied-BenchClaw: An Autonomous Multi-Agent System for Embodied Spatial Intelligence Benchmark Construction

Abstract:
Benchmarks are essential for evaluating embodied spatial intelligence, yet their construction is labor‑intensive, hard to reuse, and difficult to maintain. Existing embodied benchmarks are often static and may quickly become saturated as models improve, limiting their ability to distinguish new capabilities. We propose Embodied‑BenchClaw, an autonomous agentic system for constructing embodied spatial intelligence benchmarks. Given a user‑specified evaluation intent, Embodied‑BenchClaw automatically produces a complete and continually updatable benchmark package through a five‑stage pipeline: intent blueprinting, data collection, structuring and cleaning, benchmark synthesis, and evaluation reporting. The pipeline is coordinated by three agents for planning, construction, and evaluation. To improve reusability and reliability, Embodied‑BenchClaw introduces an extensible Skill Library and process quality control, enabling benchmark construction to be composable, verifiable, and repairable. We instantiate multiple benchmarks covering indoor spatial reasoning, outdoor spatial reasoning, robotic manipulation, quadruped robot navigation, UAV/aerial‑view understanding, and static benchmark enhancement. These benchmarks span diverse embodied carriers, data sources, and spatial capabilities. Experiments with human evaluation, judge‑based assessment, consistency checks, cost analysis, and ablations show that Embodied‑BenchClaw can construct verifiable, executable, maintainable, and diagnostically useful embodied spatial benchmarks with reduced manual effort.

PaperID: 2024, https://arxiv.org/pdf/2606.11897.pdf

Authors: Shi Liu, Jiayao Chen, Chengwei Qin, Yanqing Hu, Jufan Zhang, Linyi Yang

Title: Notes2Skills: From Lab Notebooks to Certainty-Aware Scientific Agent Skills

Abstract:
Scientific discovery workflows usually contain and rely heavily on lab notes, where researchers record observations, interpret uncertain results, and plan follow‑up experiments. Such informative lab notes preserve evolving scientific reasoning and author uncertainty, rather than polished final results exhibited in publications, providing a valuable opportunity for AI to engage in scientific exploration at a more comprehensive and deeper level. However, most prior work on scientific text focuses on papers, protocols, or structured databases, leaving informal laboratory notes underexplored as inputs to AI agents for science. This gap matters because lab notes often intermingle validated observations, tentative judgments, and possible experimental next steps within the same passage. If these signals are conflated, an AI agent may mistake uncertain scientific judgments for confirmed conclusions or executable actions. To this end, we present Notes2Skills, a two‑stage framework for turning lab notebooks into verifiable skills for scientific AI agents while preserving the author's certainty. Across seven conditions and three wet‑lab sessions, Notes2Skills is the only configuration that neither mistakes uncertain notes for firm instructions nor discards firm ones. We show that certainty preservation is the missing piece between lab notebooks and reliable agent skills, opening a path toward safer AI co‑scientist systems.

PaperID: 2025, https://arxiv.org/pdf/2606.11869.pdf

Authors: Marc Alier Forment, Juanan Pereira, Francisco José García-Peñalvo, María José Casañ Guerrero

Title: Agents All the Way Down; A Methodology for Building Custom AI Agents from Substrate to Production

Abstract:
Custom AI agents areagents that live inside their own application, talk to their own data and tools, enforce their own security boundaries, and carry their own brand and audit trail. What separates them from the general‑purpose tier is fit, not capability: each is built for one job, by the engineer who will maintain it. No published practice sets out how to build one end to end. The pieces are everywhere (function‑calling APIs, the Model Context Protocol, code agents to pair with), but the practice that chains them lives in podcasts, blogs, and leaked system prompts. This paper writes that practice down as a methodology, Agents All the Way Down: two preconditions crossed once and kept, then three practices repeated for the agent's life. The preconditions are (P1) Substrate, the LLM as a software component, framed as tools, then system, then messages under prompt‑caching; and (P2) Building blocks: function calling, MCP, CLI orchestration, the liteshell pattern, the agent loop, skills, characters, hooks, and scaffolding. The practices are (P3) prototype with a general‑purpose agent; (P4) harvest, fold, and ship the result as a CLI, the Turtle pattern; and (P5) agent‑tests‑agent, in which a general‑purpose agent drives it through behavioural scenarios, a complement to classical testing, not a replacement. The working loop is P3 to P4 to P5 and back, and one corollary falls out for free: multi‑agent orchestration is just CLI composition. The methodology is framework‑free by construction. It was distilled from the AAC, a custom agent for the open‑source LAMB platform, built in about ten days by one developer with an AI pair‑programmer and in production . We present it as a transferable practice, independent of any language or framework.

PaperID: 2026, https://arxiv.org/pdf/2606.11830.pdf

Authors: Qianyu Yao, Fei Sun, Bocheng Huang, Wei Chen, Jiarui Jiang, Shu Quan, Yifei Chen, Wenjie Xu, Bo li, Liping Su, Ruoqiong Wu, Huhai Hong, Huimei Wang

Title: Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task

Abstract:
Background. Large language models and AI agents are increasingly used to support biomedical research, but native model outputs may omit key analytical steps, misuse methods, or overstate conclusions. We evaluated whether autonomous access to a medical research skill package was associated with higher‑quality AI‑generated transcriptomic research‑analysis outputs compared with native AI without skills. Methods. We conducted an exploratory multi‑model human evaluation using a non‑small cell lung cancer immunotherapy biomarker task. Six model backbones were tested. The evaluation included 21 anonymized outputs: 9 native‑AI outputs and 12 skill‑augmented outputs generated through an AI agent implementation represented by OpenClaw. Four non‑expert biomedical reviewers and two blinded experts evaluated each output, with two ratings from each reviewer type. The primary outcome was expert‑rated overall quality. Results. Skill‑augmented outputs showed directionally higher expert overall quality than native‑AI outputs (mean 5.50 vs 5.11; difference=0.39; bootstrap 95% CI, ‑0.04 to 0.90; Welch p=0.156). Non‑expert reviewer quality showed the same direction (mean 4.72 vs 4.47; difference=0.26; bootstrap 95% CI, ‑0.25 to 0.80; Welch p=0.373). Expert agreement was limited (single‑rating ICC=‑0.15), and model‑specific effects were descriptive and heterogeneous. Conclusions. Autonomous skill access showed a directional quality signal in this exploratory sample, but the signal was smaller than expert‑rating noise and should not be interpreted as confirmatory evidence. The findings primarily motivate larger evaluations of skill‑augmented AI agents with stronger reliability controls, platform replication, and biological‑validity assessment.

PaperID: 2027, https://arxiv.org/pdf/2606.11686.pdf

Authors: Sawyer Zhang, Alexander Wang, Sophie Lei

Title: Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness

Abstract:
End‑to‑end task‑success is the dominant way to evaluate LLM agents, but one aggregate number tells you that an agent regressed, not where. We present layer‑isolated evaluation: a deployed ordering agent is decomposed into a fixed taxonomy of layers (ontology, intent, routing, decomposition, escalation, safety, memory, and cross‑cutting envelope/defense), each exercised by its own assertion slice in a deterministic, no‑LLM "pure" mode. The pure suite (238 cases across 23 slices; 225 run in 2.39 s, ~10 ms/case) runs in CI on every change against a locked per‑slice baseline. We validate by controlled regression injection, degrading one layer at a time across seven non‑safety layers. The effect we did not design in is masking: the aggregate pass‑rate barely moves (‑1.7 to ‑5.9 pp for six local regressions), while the matching slice craters (‑25 to ‑91 pp). A layer's slice reacting to its own fault is partly by construction; the measured results are (i) the aggregate masking and (ii) that damage stays off the other slices: the injected layer's slice is the single worst‑hit in 5 of 7 cases and top‑3 in 7 of 7 (mean rank 1.29 of 19). Localization replicates on a second, structurally different tenant (Starbucks SG): all seven matching slices crater, so it is not a single‑catalog artifact. We position it as a concrete, deterministic instantiation of the component‑level evaluation EDDOps prescribes but leaves unimplemented, with CheckList as ancestor and as the deterministic mirror image of whole‑workflow stochastic mutation testing. Our contributions: (a) a fully decomposed, sub‑second, no‑LLM per‑layer harness for a production agent, (b) a coverage‑honesty test‑adequacy criterion that refuses to score an unexercised layer, and (c) the regression‑injection demonstration that per‑slice baseline‑locked gates localize regressions an aggregate metric masks.

PaperID: 2028, https://arxiv.org/pdf/2606.11680.pdf

Authors: Hao-Lun Hsu, Nikki Lijing Kuang, Boyi Liu, Zhewei Yao, Yuxiong He

Title: Organize then Retrieve: Hierarchical Memory Navigation for Efficient Agents

Abstract:
Large language model (LLM) agents struggle with long‑horizon tasks due to their inherent statelessness, requiring all task‑relevant information to be encoded in growing input contexts. The resulting degraded reasoning quality, increased inference cost, and higher latency necessitate efficient working memory mechanisms. However, existing approaches either rely on lossy compression or similarity‑based retrieval, which often fail to capture temporal structure and causal dependencies required for multi‑step agentic tasks. In this work, we present HORMA, a Hierarchical Organize‑and‑Retrieve Memory Agent that organizes experience into a file‑system‑like hierarchical structure, where summarized entities are linked to the corresponding raw trajectories, enabling efficient access without losing detailed information. HORMA decomposes working memory into two stages: structured memory construction and navigation‑based retrieval. The construction module iteratively refines how experiences are structured by distinguishing between failures caused by missing information and those caused by misleading or overloaded context. The navigation module retrieves task‑relevant context by traversing the hierarchy using a lightweight agent trained with reinforcement learning to select minimal yet sufficient context, thereby reducing latency along the critical execution path. Across ALFWorld, LoCoMo, and LongMemEval, HORMA improves task performance under constrained context budgets while requiring at most 22.17% of the baseline token usage in long conversation tasks. Compared to existing methods, it consistently achieves better efficiency‑performance trade‑offs and generalizes effectively to unseen tasks.

PaperID: 2029, https://arxiv.org/pdf/2606.11672.pdf

Authors: Derek Yohn, Luke Flancher, Mirajul Islam, Khaled Slhoub

Title: Can Open-Source LLM Agents Replace Static Application Security Testing Tools? An Empirical Assessment

Abstract:
This paper explores the value of agentic AI tools for cybersecurity purposes. We evaluate the efficacy of a general‑purpose GenAI Large Language Model‑ (GenAI‑) based agent when powered by three different Ollama‑hosted general‑purpose open source models. We assess each agent's performance using precision, recall, false positive count, and a calculated composite score based upon the interplay of the captured metrics, against the baseline performance of an existing, vetted Static Application Security Testing (SAST) tool, Bandit. Our findings refute the notion that a modern open‑source GenAI LLM‑based agent is currently suitable for the specialized task of SAST scanning under realistic conditions.

PaperID: 2030, https://arxiv.org/pdf/2606.11671.pdf

Authors: Tu Lan, Chaowei Xiao

Title: Runtime Skill Audit: Targeted Runtime Probing for Agent Skill Security

Abstract:
Agent skills let LLM agents reuse instructions, resources, tools, and workflows, but they also create a new place for malicious behavior to hide. A skill may look benign in its documentation or code while becoming harmful only when it is invoked with particular user requests, local assets, persistent state, or multi‑step tool interactions. This makes purely static vetting brittle. We present Runtime Skill Audit (RSA), a dynamic analysis method that audits skills by asking what the skill‑mediated agent actually does under targeted runtime conditions. Instead of testing every skill with the same generic tasks, RSA profiles risk‑relevant interfaces, prepares the execution context needed to exercise them, and assigns security labels from the resulting trace evidence. We instantiate RSA on OpenClaw and evaluate it on 100 skills against representative static baselines. RSA achieves 90.0% accuracy with an 88.0% true positive rate and an 8.0% false positive rate, improving accuracy by 13.0 percentage points over the best static baseline. Under self‑evolving attacks, static detectors collapse after one or two rounds, while RSA continues to detect 19‑‑20 out of 20 malicious skills across rounds.

PaperID: 2031, https://arxiv.org/pdf/2606.11609.pdf

Authors: Meysam Sabbaghan, Arman Zareian Jahromi, Doina Caragea

Title: Multi-Agent Reasoning with Adaptive Worker Allocation for Stance Detection

Abstract:
Stance detection requires identifying an author's position toward a target, often from short‑form texts where stance is implicit, indirect, or rhetorically framed. Although large language models (LLMs) achieve strong performance on this task, single‑pass prompting can be brittle when multiple interpretations are plausible. Existing aggregation strategies, such as majority voting or self‑consistency, improve robustness by combining labels, but they discard the intermediate reasoning needed to resolve conflicting interpretations. We introduce a multi‑agent reasoning framework with adaptive worker allocation for stance detection that shifts aggregation from label‑level voting to reasoning‑level synthesis. The framework employs a Manager‑Worker architecture in which a Manager adaptively allocates a variable number of Worker agents based on input complexity. Each Worker analyzes the input from a distinct perspective and produces a reasoning‑only explanation without emitting a stance label; the Manager then synthesizes these explanations to produce the final prediction. We evaluate the proposed framework on SemEval‑2016, P‑Stance, and COVID‑19 Stance using Llama, Mistral, and Gemini. Results show that the framework yields the largest gains on implicit and context‑dependent stance cases, achieving 86.07 Macro‑F1 on COVID‑19 and 82.90 on SemEval‑2016, while remaining competitive on more explicit stance datasets such as P‑Stance. These findings suggest that adaptive reasoning‑level aggregation is most beneficial when stance cannot be reliably inferred from surface cues alone.

PaperID: 2032, https://arxiv.org/pdf/2606.11560.pdf

Authors: Arijit Khan, Longxu Sun, Xin Huang

Title: LLMs+Graphs: Toward Graph-Native, Synergistic AI Systems

Abstract:
Large Language Models (LLMs) have advanced rapidly, but their limitations in structured and multi‑hop reasoning underscore the need for graph‑native, synergistic artificial intelligence (AI) systems. Graph‑structured data underpins critical applications across social, biological, financial, transportation, web, and knowledge domains, making it essential to understand how LLMs can leverage graph computation for grounded, context‑rich inference. Three complementary synergies are emerging: LLMs augmented with graph computation for retrieval and reasoning; bidirectional integration between LLMs and knowledge graphs (KGs), where LLMs support KG construction and curation while KGs enforce semantic constraints and factual consistency; and AI agents strengthened by graph algorithms for planning, decision making, and multi‑step reasoning. In parallel, LLMs introduce new capabilities for graph data management and graph machine learning (ML) through natural language interfaces and hybrid LLM‑graph neural network (GNN) pipelines. This tutorial synthesizes the algorithms, systems, and design principles driving these converging directions, offering data science and data mining researchers a unified perspective on integrating LLMs, graph data management, graph mining, graph ML, and agentic computation into next‑generation graph‑native AI systems.

PaperID: 2033, https://arxiv.org/pdf/2606.11440.pdf

Authors: Ahasan Kabir, Jiaqi Xue, Mengxin Zheng, Qian Lou

Title: INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration

Abstract:
Existing multi‑agent LLM orchestration methods, ranging from brute‑force ensembles to learned routers, select models and topologies based on task and model features. However, these methods do not consider the runtime state of the serving infrastructure. On shared GPU clusters under concurrent load, this infrastructure blindness causes systematic resource underutilization: preferred models accumulate deep request queues while equally capable alternatives sit idle. In multi‑agent pipelines, where each query triggers multiple sequential model calls, these delays then compound across every downstream step. Closing this gap is challenging because the relevant infrastructure signals (queue depths, KV‑cache pressure, latencies) are dynamic and noisy, and they must drive three different decisions: planning, per‑step routing, and scheduling. We introduce INFRAMIND, a framework that makes the entire multi‑agent stack infrastructure‑aware. An infra‑aware planner conditions topology and role selection on real‑time system load and remaining budget, biasing toward simpler graphs under congestion and richer ones at low load. An infra‑aware executor then observes per‑model queue depths, cache utilization, and response latencies at each agent step to decide which model to call and how deeply to reason; a budget‑aware scheduler further reorders each model's queue so that urgent requests are served first. Cast as a hierarchical constrained MDP and solved end‑to‑end via reinforcement learning, the system learns to balance quality against latency automatically. Across five benchmarks, INFRAMIND delivers up to +7.6 pp accuracy over the prior baseline at low load with up to 7x lower latency, and sustains up to 99.9% SLO compliance under high load where every baseline drops below 50%.

PaperID: 2034, https://arxiv.org/pdf/2606.11416.pdf

Authors: Yukuan Zhang, Mengxin Zheng, Qian Lou

Title: MPC-Patch-Bench: Security-Aware LLM Code Patch for Multi-Party Computation

Abstract:
Repository‑level benchmarks for evaluating Large Language Model (LLM) code repair on Secure Multi‑Party Computation (MPC) software do not yet exist, and directly transplanting general‑purpose benchmarks such as SWE‑bench fails on three structural fronts: (i) MPC repositories are dominated by generic Python infrastructure rather than cryptographic logic; (ii) high‑value MPC fixes lack the standardized tests rigid extraction pipelines require; and (iii) standard fail‑to‑pass evaluation is insufficient for code that must also be cryptographically safe. MPC is increasingly deployed for privacy‑preserving machine learning, biomedical collaboration, and secure analytics. Existing MPC‑specific code‑synthesis efforts cover only operator‑level or single‑framework tasks; evaluating LLM agents on real repository‑level MPC repair instead demands MPC‑aware data curation and a verifier matched to the security and numerical‑fidelity guarantees MPC programs must obey neither of which existing benchmarks provide. We introduce MPC‑Patch‑Bench, a repository‑level benchmark organised around two frameworks. (1)The Data Curation Framework combines a domain‑specific curation agent that filters raw pull requests through three cryptographic layers with a human‑AI completion engine that synthesizes missing problem statements and Fail‑to‑Pass/Pass‑to‑Pass tests, yielding 205 fully verified instances. (2)The MPC Verifier provides dedicated security and numerical‑fidelity checks via dynamic differential testing against plaintext oracles and MPC‑specific static analysis rules that flag unsafe reveals, insecure arithmetic, and illegal public/private casts. The strongest evaluated LLM functionally resolves only 22.9% of MPC‑Patch‑Bench tasks; the MPC Verifier further reduces verified resolution to 17.1%, with up to 40% of functionally‑passing patches rejected for cryptographic or numerical‑fidelity violations.

PaperID: 2035, https://arxiv.org/pdf/2606.11354.pdf

Authors: Yuqi Wang, Gulai Shen, Ali Mehmani

Title: A Zero-Shot Multi-Agent Framework for Human-Building Interaction via Programmatic Reasoning

Abstract:
Large Language Model (LLM) offers opportunities to enhance Human‑Building Interaction (HBI) by enabling more direct interactions through intuitive interfaces to complex building systems. These systems can be characterized by the vast amounts of data across multiple formats, the lack of nonconfidential and generalizable information, and the requirement of domain expertise for interpretation. Applying LLMs to domain‑specific tasks like HBI presents additional challenges. Limited training data makes traditional fine‑tuning approaches impractical. Meanwhile, the opacity of LLM training data requires careful integration of domain knowledge to ensure reliability. Additionally, different LLMs exhibit varying alignment characteristics, suggesting that achieving both natural interaction and technical accuracy requires a multi‑agent approach. These challenges highlight the need for innovative approaches to adapt LLMs for specialized domains while maintaining accuracy and user engagement. In this paper, we develop a hierarchical multi‑agent framework that utilizes semantic routing and programmatic reasoning to decouple natural language understanding from building analytics. Instead of standard RAG approaches, our system employs a "Doorman" mechanism for task decomposition and specialized coding agents that generate executable Python scripts for precise arithmetic. We validate this framework on a dataset from more than 200 commercial buildings. Results demonstrate the effectiveness in providing accurate and contextual responses for diverse users, including stakeholders, from tenants to building managers, across various building system applications.

PaperID: 2036, https://arxiv.org/pdf/2606.11350.pdf

Authors: Nabaraj Subedi, Ahmed Abdelaty, Shivanand Venkanna Sheshappanavar

Title: When More Documents Hurt RAG: Mitigating Vector Search Dilution with Domain-Scoped, Model-Agnostic Retrieval

Abstract:
Retrieval‑augmented generation degrades when scaled to large, heterogeneous document collections, where dense similarity loses discriminative power, and top‑k retrieval increasingly returns semantically similar but contextually incorrect chunks. We refer to this failure mode as vector search dilution. Even when using hybrid dense+sparse retrieval, we observed this firsthand in a deployed Wyoming Department of Transportation corpus, where scaling from 54 to 1,128 documents (88,907 chunks) reduced accuracy from 75% to below 40%. To address this dilution, we propose MASDR‑RAG ( Multi‑Agent Scoped Domain Retrieval for RAG) and evaluate it on 200 expert‑validated queries across five LLM backbones, six corpora, and two index stacks. Our results indicate that domain scoping using organizational metadata is the key fix, significantly improving P@10 from 0.77 to 0.86 (p < 0.05). Furthermore, our investigation of multi‑agent orchestration revealed that a high degree of configuration dependence results ‑‑creating what we call the precision‑faithfulness paradox. Based on these varied outcomes, our practical recommendation is simple: scope first, then perform a single synthesis call, reserving full multi‑agent orchestration for genuinely multi‑domain corpora paired with native‑tool‑call backbones. Code and Data will be made public upon acceptance.

PaperID: 2037, https://arxiv.org/pdf/2606.11290.pdf

Authors: Lingzhi Yuan, Chenghao Deng, Fangxu Yu, Souradip Chakraborty, Mohammad Rostami, Furong Huang

Title: FlowBank: Query-Adaptive Agentic Workflows Optimization through Precompute-and-Reuse

Abstract:
Large Language Model (LLM)‑based multi‑agent systems are increasingly powerful, but current agentic workflow optimization paradigms make an unsatisfying trade‑off. Task‑level methods spend substantial offline compute yet deploy only a single workflow, leaving complementary candidates unused, while query‑level methods synthesize a new workflow per query at substantial inference cost. Our motivating analysis shows these paradigms are more complementary than competing: workflows discovered during offline search often solve different subsets of queries, and many queries handled by expensive query‑level generation can already be solved by cheaper precomputed workflows. This suggests a different objective: rather than searching for one universally best workflow or regenerating one per instance, we should build a compact bank of reusable, complementary workflows and select among them adaptively at inference time. Doing so requires solving three coupled problems: generating complementary rather than redundant candidates, compressing them into a small deployable portfolio, and assigning each query to the right workflow under a performance‑cost trade‑off. To this end, we present FlowBank, a three‑stage framework for portfolio‑based agentic workflow optimization. Diversifying proposes DiverseFlow to steer search toward under‑covered queries and produce a high‑coverage candidate pool. Curating proposes CuraFlow to compress this pool into a compact portfolio with minimal redundancy. Matching casts deployment as edge‑value prediction on a query‑workflow bipartite graph and routes each incoming query to the portfolio member with the best predicted utility. Across five benchmarks, FlowBank achieves the highest average score among the evaluated methods while remaining cost‑competitive, improving over the strongest automated and handcrafted baselines by 4.26% and 14.92% relative, respectively.

PaperID: 2038, https://arxiv.org/pdf/2606.11284.pdf

Authors: Wongyu Lee, Francesco Lelli, Omran Ayoub, Massimo Tornatore

Title: Phi-Actor-Critic: Steering General-Sum Games to Pareto-Efficient Correlated Equilibria

Abstract:
Real‑world multi‑agent systems, from traffic coordination to resource allocation, are often modeled as general‑sum games where individual incentives conflict with collective welfare. In these settings, the central challenge is not merely finding an equilibrium, but selecting socially desirable outcomes among many suboptimal Nash equilibria. Standard deep multi‑agent reinforcement learning (MARL) methods struggle with this problem, as value‑decomposition approaches are constrained by monotonicity assumptions and policy‑gradient methods often converge to stable but socially inefficient equilibria. To address this limitation, we propose Φ‑Actor‑Critic (Φ‑AC), a framework that leverages swap regret minimization to steer learning toward high‑welfare correlated equilibria (CE). To make counterfactual regret estimation tractable in deep MARL, Φ‑AC employs a centralized attention critic that predicts vector‑valued regrets in a single forward pass, avoiding computationally expensive counterfactual simulations. We further introduce a Lagrangian‑based equilibrium selection mechanism that optimizes social welfare while enforcing stability through regret constraints. Experiments on matrix games, Multi‑Agent Particle Environments (MPE), and the Melting Pot Harvest scenario demonstrate that Φ‑AC learns efficient and stable coordination strategies across diverse mixed‑motive settings while maintaining high collective return and competitive fairness.

PaperID: 2039, https://arxiv.org/pdf/2606.11274.pdf

Authors: Bocheng Li, Jingran Qiu, Lihao Zhao

Title: Multi-agent rendezvous in fluid flows via reinforcement learning

Abstract:
Rendezvous is a critical task for multi‑agent systems, requiring agents to coordinate to meet at an unspecified location. However, achieving this in fluid environments presents a challenge, as it remains unclear how agents can exploit underlying fluid kinematics to facilitate convergence. In this study, we adopt a multi‑agent reinforcement learning (MARL) approach to develop physics‑informed rendezvous strategies in vortical flows. Compared to a naive strategy, where agents navigate toward their counterparts, MARL strategies significantly improve the rendezvous rate. MARL strategies also show transferability across varying vortex intensities, vortex scales, and swarm sizes. By breaking the symmetry of the state‑action map, MARL strategy leverages a non‑intuitive mechanism that prevents agents from becoming trapped in separate vortices, thereby enhancing rendezvous success. Additionally, a heuristic strategy is extracted from the learned strategy and also outperforms the naive strategy. Furthermore, a theoretical analysis demonstrates that fluid deformation impedes the rendezvous process. Large finite‑time Lyapunov exponents identify where fluid effects separate adjacent agents, suggesting that targets should be planned in weak‑deformation regions. Our findings reveal the important role that agent‑fluid interactions play in multi‑agent tasks and highlight the MARL capability to explore swarm intelligence in complex flow environments.

PaperID: 2040, https://arxiv.org/pdf/2606.11256.pdf

Authors: César Ojeda, Darius A. Faroughy, Maryam Karimi, Payam Zarrintaj, Mir Mehdi Seyedebrahimi, Martín Carballo-Pacheco

Title: My Chemical Harness: Evolutionary Molecular Design over Synthetic Pathways with Large Language Model Agents

Abstract:
Designing molecules with target properties is most useful when candidate structures are accompanied by feasible synthetic routes. We introduce My Chemical Harness, a route‑native evolutionary framework for goal‑directed molecular design in which the search population consists of executable synthetic pathways rather than isolated molecular graphs. Each route is built from purchasable building blocks and reaction templates, executed by deterministic chemistry tools, and scored through task‑specific molecular oracles. Large language models (LLMs) are used only as strategy controllers that select high‑level preferences over route length, move type, reaction families, motifs, and exploration pressure, while local code performs route construction, validation, deduplication, scoring, selection, and memory updates. This separation lets the LLM guide exploration without allowing it to introduce hallucinated products or unsupported reaction steps. On a soluble epoxide hydrolase proxy task, our LLM agent improves over single pass LLM and deterministic controllers, reaching state‑of‑the‑art performance across the sEH score, synthetic accessibility score, and AiZynthFinder success rate metrics. These results suggest that constrained LLM agents can play a significant role in molecular discovery without requiring training, fine‑tuning, or dedicated generative models.

PaperID: 2041, https://arxiv.org/pdf/2606.11249.pdf

Authors: Ahmet Gunhan Aydin, Elif Tugce Ceran

Title: MASK: Multi-Agent Semantic K-Scheduling for Risk-Sensitive 6G Robotics

Abstract:
Realizing the vision of 6G connected robotics requires reconciling high‑performance collaborative control with the rigid spectral limitations of physical wireless channels. In realistic collaborative sensing scenarios, spectral resources are quantized into finite physical resource blocks or orthogonal subcarriers, rendering simultaneous transmission by all agents infeasible. To address this, we propose Multi‑Agent Semantic K‑Scheduling (MASK), a control architecture designed to sustain robust, risk‑aware coordination under strict instantaneous bandwidth caps. We introduce Arbiter‑Assisted Semantic Information Gating (A‑SIG), a lightweight coordination mechanism that enforces hard access constraints by scheduling only the top‑K agents based on locally computed semantic importance scores. By aggregating these prioritized observations into a compact latent state, a self‑supervised global encoder enables a distributional policy to mitigate tail risks despite data sparsity. We evaluate MASK across diverse benchmarks, demonstrating that it matches the performance of communication‑unconstrained baselines even when channel access is restricted to a small fraction of the swarm size. Furthermore, the framework exhibits inherent resilience to packet erasures, validating semantic scheduling as a critical enabler for resource‑constrained 6G systems.

PaperID: 2042, https://arxiv.org/pdf/2606.11213.pdf

Authors: Andrew Semenov, Svyatoslav Dorofeev

Title: Beyond Compaction: Structured Context Eviction for Long-Horizon Agents

Abstract:
We present Context Window Lifecycle (CWL), a context‑management scheme that gives long‑horizon LLM agents an effectively unbounded working horizon. As a session accumulates history, CWL keeps the context within budget through graduated, semantically‑aware eviction: the agent annotates its trajectory as typed, dependency‑linked episodes as work proceeds, and a deterministic, LLM‑free policy evicts content in priority order within that structure when a token budget is exceeded. CWL preserves user turns and the exploratory context the agent is actively reasoning over, while aggressively shedding action episodes whose effects are already persisted in the environment, keeping active context near a stable ceiling that also avoids the performance degradation associated with very large prompts. Compared to summarization‑based compaction, CWL avoids four well‑known limitations: unpredictable lossiness, destruction of causal structure, blocking model cost, and compression‑induced hallucination. Compared to recency truncation, CWL is semantically aware: it drops the oldest‑and‑most‑recoverable content according to the dependency graph rather than oldest‑in‑time regardless of relevance. We describe the annotation protocol, the episode graph, the eviction policy, and the token‑accounting loop, and evaluate CWL on long‑horizon agentic benchmarks: a single agent session completing 89 sequential tasks across 80 million tokens with no measurable degradation in task accuracy relative to per‑task isolated sessions

PaperID: 2043, https://arxiv.org/pdf/2606.11209.pdf

Authors: Jingpei Wu, Xiao Han, Weixiang Shen, Boer Zhang, Zifeng Ding, Volker Tresp

Title: ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward

Abstract:
Visual question answering increasingly requires multi‑step reasoning. Recent post‑training with reinforcement learning under verifiable rewards (RLVR) and Group Relative Policy Optimization (GRPO) can improve multimodal reasoning, but most approaches rely on sparse outcome‑only rewards. As a result, they struggle to tell whether an incorrect answer comes from a small mistake late in the reasoning or from an unhelpful trajectory from the start. A common solution is to train a process reward model (PRM) for step‑level supervision, but this typically requires large‑scale high‑quality chain‑of‑thought annotations and additional training cost. We propose ProcessThinker, a practical post‑training pipeline that provides step‑level process rewards without training an explicit PRM. ProcessThinker first rewrites reasoning traces into a step‑tagged format for cold‑start supervised fine‑tuning, then applies GRPO with a standard format reward and our rollout‑based process reward. Concretely, for each intermediate step, we sample multiple continuations from that step and use the empirical success rate (final‑answer verification) as the step reward. This gives dense credit assignment and encourages reasoning steps that more reliably support a correct conclusion, helping reduce inconsistent or self‑contradictory progress across steps ‑‑ a key issue in logical reasoning. Across four challenging video benchmarks (Video‑MMMU, MMVU, VideoMathQA, and LongVideoBench), ProcessThinker consistently improves over the baseline model Qwen3‑VL‑8B‑Instruct

PaperID: 2044, https://arxiv.org/pdf/2606.11199.pdf

Authors: Quentin Fever, Naziha Aslam

Title: NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent RAG System for the Text-to-Text Track

Abstract:
We present NightFeats, a structured multi‑agent retrieval‑augmented generation (RAG) system submitted to the MMU‑RAGent competition at NeurIPS 2025, where it was awarded Best Dynamic Evaluation in the text‑to‑text track. Rather than targeting benchmark maximization, this work proposes a principled pipeline that decomposes knowledge synthesis into three coordinated phases: retrieval, curation, and composition, each governed by explicit intermediate representations and handoff contracts. Inspired by Agentic Context Engineering (ACE), the system introduces temporal‑semantic reranking, bounded contradiction reconciliation, and citation‑preserving composition as core architectural primitives. Competition results show that NightFeats surpasses proprietary baselines including Claude‑SonnetV2 and Nova‑Pro on LLM‑as‑a‑Judge and Human Likert evaluations, confirming that architectural transparency and verifiable evidence grounding are better aligned with human preferences than systems optimizing narrowly for automatic similarity metrics.

PaperID: 2045, https://arxiv.org/pdf/2606.11182.pdf

Authors: Weixian Xu, Shilong Liu, Mengdi Wang

Title: EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

Abstract:
In this paper, we propose EEVEE, the first multi‑dataset test‑time prompt learning framework for LLM agents, enabling test‑time prompt learning under real‑world task streams. Existing methods are largely designed for single‑dataset settings, while real‑world applications require models to handle heterogeneous input streams drawn from multiple datasets, domains, and task distributions, limiting their practical applicability. To mitigate cross‑dataset interference, EEVEE introduces a router that partitions incoming inputs into task clusters and assigns them to suitable prompt configurations. This design is optimized via a router‑prompt co‑evolution strategy, which employs interleaved router and prompt learning phases to address their mutual dependency. Experiments across multiple datasets demonstrate that the framework improves robustness under heterogeneous data streams while maintaining single‑benchmark learning capability and efficiency. Specifically, EEVEE improves average multi‑benchmark scores by 10.38 and 24.32 points over Qwen3‑4B‑Instruct and DeepSeek‑V3.2, surpassing SOTA methods GEPA and ACE by up to 37.2% and 48.2%.

PaperID: 2046, https://arxiv.org/pdf/2606.11150.pdf

Authors: Andrew Bo Liu, Samira Nedungadi, Bryce Cai, Alex Kleinman, Harmon Bhasin, Seth Donoughe

Title: ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity

Abstract:
Large language models (LLMs) are rapidly acquiring capabilities relevant to biological research, from literature synthesis to interpretation of experimental data. Increasingly, LLM agents can also perform in silico biology tasks that previously required experienced human biologists. These emerging AI capabilities offer new opportunities for scientific discovery and biomedical advances, but they also shift the landscape of biosecurity risks. To address this, we introduce the Agentic Bio‑Capabilities Benchmark (ABC‑Bench), a suite of tasks to measure agentic biosecurity‑relevant capabilities. ABC‑Bench evaluates LLM agents on both benign and dual‑use biology tasks: writing code to operate liquid handling robots, designing DNA fragments for in vitro assembly, and evading DNA synthesis screening. These tasks require a combination of biology and software expertise. All tested LLM agents outperformed the median expert human baseliner on all three tasks. Agents performed highly on tasks drawing on published knowledge and well‑documented protocols, and more weakly on a task requiring novel bioinformatics reasoning. In three wet‑lab validation experiments, we found that OpenAI's o4‑mini‑high produced scripts that, when run on an OpenTrons liquid handling robot, successfully assembled DNA with expected sequences.

PaperID: 2047, https://arxiv.org/pdf/2606.11120.pdf

Authors: Andrew Kang, Priya Narasimhan

Title: Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in Football

Abstract:
We recast pass evaluation in football (soccer) as a Monte Carlo Tree Search (MCTS)‑like evaluation problem whose components mostly exist in the literature under different names: a value model (possession value), a world model (multi‑agent trajectories with ball interactions), and a policy over counterfactual actions (sampling pass variants with noise). Building on the first public high‑fidelity tracking dataset with 3D ball trajectories from the Bundesliga, we introduce Monte Carlo Pass Search (MCPS), which infers kick parameters for each observed pass, samples execution variants and option variants, rolls each candidate forward with a ball‑conditioned world model until the next ball interaction, and scores outcomes with a learned value model to obtain a distribution over gained value. This distribution enables distribution‑aware attribution with two complementary execution‑surplus scores used for analysis and ranking: mean‑based and percentile‑based scores. To make the world model sample‑efficient under limited public data, we adapt a discrete‑token, autoregressive trajectory generator from autonomous driving (SMART) and show it yields strong best‑of‑20 forecasting accuracy compared to baselines, while supporting fully hypothetical rollouts for downstream evaluation. We have released model checkpoints and code.

PaperID: 2048, https://arxiv.org/pdf/2606.11082.pdf

Authors: Hakan Mehmetcik

Title: The Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language Models

Abstract:
This study investigates cross‑lingual distributional skew (the Shibboleth Effect) in frontier large language models (LLMs) subjected to sustained adversarial conditions. We develop a multi‑agent geopolitical wargame, the Cerulean Sea Crisis, a synthetic maritime territorial dispute designed to mirror the structural dynamics of Eastern Mediterranean conflicts. Six frontier models (GPT‑4o, Llama‑4, Mistral‑Large, Gemini‑3.1‑Pro, Qwen3.6‑Plus, and DeepSeek‑R1) participate in a between‑groups experiment (N = 10 games per arm, K = 5 rounds per game) in which the sole manipulation is the language of play (English versus Turkish), producing 586 validated statements. A zero‑shot classifier assesses behavioral dispositions along two continuous dimensions: Concession Rate and Coercive Rhetoric. The results are heterogeneous. Llama‑4 shows a substantial, Holm‑corrected increase in coercive rhetoric under Turkish (delta = +0.800, p = .002), whereas Gemini‑3.1‑Pro displays an equally large decrease (delta = ‑0.750, p = .005). DeepSeek‑R1 exhibits a similar negative shift (delta = ‑0.860, p = .006) and provides chain‑of‑thought evidence consistent with a buffering mechanism. GPT‑4o shows no detectable effect (delta = +0.130, p = .614). These findings indicate that cross‑lingual behavioral skew is contingent on model architecture and training regime rather than a universal property of Western‑origin LLMs. We identify two distinct buffering mechanisms, chain‑of‑thought institutional anchoring and multilingual RLHF alignment, and discuss their implications for integrating LLMs safely into diplomatic and crisis‑management settings.

PaperID: 2049, https://arxiv.org/pdf/2606.11079.pdf

Authors: Yunan Lu, Ryan Shea, Yusen Zhang, Zhou Yu

Title: VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation

Abstract:
Evaluation remains a critical bottleneck for interactive agent development. Existing evaluation methods often rely on static benchmarks, which fail to capture the dynamic, multi‑step nature of agentic behavior and struggle to expose meaningful failure modes. While user‑simulation‑based evaluation offers a promising alternative, existing simulation frameworks suffer from two major limitations. First, they provide limited mechanisms for evaluating the quality and comprehensiveness of simulated interactions, making it difficult to assess whether a simulator sufficiently explores an agent's capabilities and failure modes. Second, most frameworks are restricted to either UI‑only actions or API‑only actions, limiting their ability to model the full range of realistic user behaviors. To address these limitations, we propose VISTA, a Versatile Interactive user Simulation Toolkit for Agent evaluation. Our toolkit includes a suite of six metrics for measuring the realism, capability coverage, and interaction effectiveness of simulated interactions. In addition, we develop a hybrid user simulator that integrates both UI‑based interactions and API‑based interactions, enabling more realistic and comprehensive evaluation across diverse interactive environments. We evaluate VISTA in e‑commerce shopping and education customer service settings and demonstrate that it produces more realistic and comprehensive evaluations than existing methods.

PaperID: 2050, https://arxiv.org/pdf/2606.11050.pdf

Authors: J. de Curtò, I. de Zarzà

Title: LLM-Mediated Demand Response Coordination in Smart Microgrids

Abstract:
Effective demand response in smart microgrids requires prosumers to cooperate voluntarily under strategic self‑interest, a coordination problem structurally equivalent to a repeated Prisoner's Dilemma on a social network. This paper presents a multi‑agent simulation in which a Large Language Model (LLM) Influence Compiler issues structured demand‑response directives to a population of heterogeneous prosumer agents, each governed by a hybrid decision architecture combining game‑theoretic base probability (derived from payoff history, neighbour imitation, and exploitation memory) with LLM narrative evaluation of incoming coordination signals. The hybrid architecture resolves a key methodological challenge: LLMs aligned via Reinforcement Learning from Human Feedback (RLHF) exhibit strong cooperation bias when used as direct decision‑makers, producing flat dynamics regardless of grid conditions. By separating strategic reasoning from grounded narrative evaluation, the model generates realistic prosumer behaviour across six personality archetypes, with baseline cooperation near 50% and clear differentiation under influence. Compiled structured directives achieve 33.3% demand‑curtailment cooperation versus 27.0% for unstructured messaging and 28.0% for a no‑intervention baseline (Δ_\mathrmcomp = +0.063), with the advantage preserved across both grounded and idealized agent substrates (Δ= +0.083) and across all resistance levels (R = 0.1 to 0.7). Hub‑targeted dissemination via high‑centrality network nodes outperforms peripheral or random targeting, confirming that grid topology provides mechanistic amplification independent of message content. These results suggest that structured LLM compilation, grounded agent reasoning, and network‑aware targeting are complementary design principles for scalable, interpretable demand‑response coordination in smart‑city energy systems.

PaperID: 2051, https://arxiv.org/pdf/2606.11007.pdf

Authors: Junchang Zheng, Junfeng Tan, Jialiang Lin

Title: Understanding and mitigating the risks of OpenClaw for non-technical users: A practical guide with Skill

Abstract:
OpenClaw has rapidly emerged as a transformative artificial intelligence (AI) agent framework, and its ability to autonomously execute complex, multi‑step tasks has attracted an ever‑growing and diverse user base. However, this capability comes with significant risks. While existing research has made important strides in characterizing these threats, such work is predominantly directed at technically sophisticated audiences. It remains largely inaccessible to non‑technical users. This demographic now makes up an increasingly large and underserved portion of the community, yet it is these very users who most urgently need practical and straightforward guidance. In response, we bridge this gap through a series of interconnected efforts designed to lower the risk barrier for non‑technical OpenClaw users. First, we identify and categorize seven core risks that OpenClaw users may encounter in daily usage, explaining each in plain language so that non‑technical users can readily grasp the nature and potential consequences of these threats. Second, for each identified risk, we distill a set of corresponding defensive strategies into clear and actionable operational steps that are easy to follow. Third, to make protection even easier, we provide a companion OpenClaw Skill that automates key security configurations, enabling users to safeguard their systems with minimal manual intervention. Through this work, we demonstrate that safeguarding against the risks of intelligent agents need not be the exclusive domain of security experts, and that non‑technical users can meaningfully participate in reducing these risks through simple, practical actions.

PaperID: 2052, https://arxiv.org/pdf/2606.10956.pdf

Authors: Tengchao Lv, Dongdong Zhang, Jiayu Ding, Yilin Jia, Yuzhong Zhao, Yupan Huang, Wenshan Wu, Xiangyang Zhou, Shaohan Huang, Nan Yang, Li Dong, Lei Cui, Furu Wei

Title: Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

Abstract:
The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional‑grade productivity software is largely untested. We argue that Office automation is an ideal environment for benchmarking document‑automation capability, as it requires long‑horizon planning and reasoning, precise parameter configuration, and multi‑application integration. To quantify this capability, we introduce an evaluation based on China's National Computer Rank Examination (NCRE), featuring 200 comprehensive practical‑operation tasks across Word, Excel, and PowerPoint. Each task is scored on a 100‑point rubric scale using 7,118 machine‑gradable criteria, and Score Rate (SR) denotes the mean percentage of rubric points earned across these tasks. We benchmark 7 frontier LLMs and observe stark limitations: single‑turn models score a maximum of 36.6%. A stronger agentic system with execution feedback, iterative repair, and broader Office automation access reaches 68.8%, but remains below the 95.5% community‑reference score used as a scoring sanity check. Ultimately, our experiments demonstrate that despite recent advancements in code generation, achieving reliable fine‑grained Office document automation remains a significant challenge for current code‑generating LLM and agent systems.

PaperID: 2053, https://arxiv.org/pdf/2606.10921.pdf

Authors: Xiangjun Zai, Xingyu Tan, Chen Chen, Xiaoyang Wang, Wenjie Zhang

Title: Trace Only What You Need: Structure-Aware On-Demand Hypergraph Memory for Long-Document Question Answering

Abstract:
Long‑document question answering (QA) requires large language models (LLMs) to reason over evidence scattered across lengthy documents, where answers often depend on event order, section‑level context, and cross‑part evidence connections. Although retrieval‑augmented generation (RAG) reduces the input context by retrieving relevant evidence, existing structured RAG methods still face three limitations: costly query‑agnostic knowledge organization, insufficient use of original document structure, and no reuse of historical reasoning experience. To address these limitations, we propose DocTrace, a multi‑agent RAG framework for long‑document QA that supports query‑triggered knowledge organization, document‑structure‑aware and experience‑guided reasoning. DocTrace preserves document hierarchy with a lightweight document structural tree index, constructs agent‑shared hypergraph‑structured working memory on demand during reasoning, and stores successful reasoning plans in graph‑structured experience memory for future reuse, enabling adaptive exploration across related long‑document questions. Experiments on four long‑document QA datasets show that DocTrace achieves the best performance on three datasets, surpassing the strongest baseline, ComoRAG, by up to 8.85% in F1 and 4.40% in EM, while reducing the overall computational cost by 53.32%

PaperID: 2054, https://arxiv.org/pdf/2606.10917.pdf

Authors: Xucong Wang, Ziyu Ma, Shidong Yang, Tongwen Huang, Pengkun Wang, Yong Wang, Xiangxiang Chu

Title: Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution

Abstract:
Although Large Language Model (LLM) agents have demonstrated strong performance on complex tasks, their learning is often limited by inefficient interaction feedback and static training environments, which hinder broader generalization. To address these limitations, this paper introduces Role‑Agent, \textcolorblacka framework that harnesses a single LLM to function concurrently as both the agent and the environment, enabling a bootstrapped co‑evolution. Role‑Agent comprises two synergistic components: World‑In‑Agent (WIA) and Agent‑In‑World (AIW). In WIA, the LLM acts as the agent and predicts future states after each action; the alignment between predicted and actual states is then used as a process reward, encouraging environment‑aware reasoning. In AIW, the LLM analyzes failure modes from failed trajectories and retrieves tasks with similar failure patterns, thereby reshaping the training data distribution for targeted practice. Experiments on multiple benchmarks show that Role‑Agent consistently improves performance, yielding an average gain of over 4% over strong baselines.

PaperID: 2055, https://arxiv.org/pdf/2606.10833.pdf

Authors: Syed Wasiq, Syed Mohamad Tawseeq, Yashwant Pravinrao Bangde, Debaditya Roy

Title: Do VLMs Reason Like Engineers? A Benchmark and a Stage-wise Evaluation

Abstract:
Vision‑Language Models (VLMs) demonstrate strong performance on general multimodal reasoning benchmarks, yet their ability to perform engineering reasoning remains largely unexplored. Unlike general visual question answering, engineering problem solving requires interpreting technical diagrams, selecting governing physical principles, and maintaining physically consistent multi‑step reasoning. These capabilities are increasingly important for AI systems used in engineering education, scientific assistance, and technical decision‑making, where reasoning failures may produce physically invalid yet superficially plausible solutions. Existing benchmarks primarily evaluate final answers and provide limited assessment of intermediate reasoning processes. We introduce EngVQA, a multimodal benchmark for evaluating engineering reasoning across 5 engineering subjects containing 696 problems. We introduce an 8‑stage automatic evaluation framework for assessing VLM‑generated solutions. The framework independently evaluates each stage of the solution, enabling fine‑grained analysis of reasoning failures. We benchmark multiple state‑of‑the‑art open and closed source VLMs on our evaluation framework and demonstrate substantial limitations in current engineering reasoning capabilities. Human evaluation shows strong agreement with our automated framework, achieving a Pearson correlation of 0.975 and a mean absolute error of 0.67 on a 10‑point grading scale. Our results highlight the importance of process‑oriented evaluation for reliable assessment of multimodal engineering reasoning systems.

PaperID: 2056, https://arxiv.org/pdf/2606.10806.pdf

Authors: Xiaoyang Chen, Xiang Jiang

Title: Moonshine: An Autonomous Mathematical Research Agent Centered on Conjecture Generation

Abstract:
Moonshine is an autonomous agent whose central objective is to generate mathematical conjectures. Its core capability is to extract structure from classical problems, distill new concepts, and formulate conjectures of mathematical significance. Rather than treating the solution of a single proposition as its endpoint, Moonshine builds an extensible theoretical framework through conjecture generation, bridge building, and obstacle identification. This article uses Moonshine's exploration of the Jacobian conjecture as an example. It shows how the central logic of whether local nondegeneracy can force global injectivity is transferred to one‑hidden‑layer affine‑ridge sigmoid networks. This leads to the formulation of the \emphNeural Jacobian Conjecture (NJC): if such a network has strictly positive Jacobian determinant on the whole space, then it must be globally injective. By invoking GPT‑5.5‑pro and DeepSeek‑V4‑pro separately, Moonshine obtained independent complete proofs for the case $N=n+1$. In addition, with the assistance of ChatGPT through interactive use of its web interface with GPT‑5.5‑pro, a geometric‑topological proof was developed. These results provide preliminary evidence for the plausibility of the conjecture. The general higher‑width case $N\ge n+2$, however, remains unresolved and is left for further investigation. This work illustrates Moonshine's ability to autonomously generate meaningful mathematical problems and make rigorous progress on them.

PaperID: 2057, https://arxiv.org/pdf/2606.10749.pdf

Authors: Yuchen Ling, Shengcheng Yu, Zhenyu Chen, Chunrong Fang

Title: Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation

Abstract:
Large language model (LLM) agents are rapidly moving from conversational interfaces to software components that plan, invoke tools, maintain memory, and act on external environments. This transition changes the nature of security risk. In agentic settings, failures are no longer limited to unsafe text generation. Untrusted content may redirect control flow, misuse tool privileges, corrupt persistent state, leak sensitive information, or trigger harmful external actions. At the same time, research on LLM agent security is expanding quickly but remains fragmented across attack families, defense layers, application domains, and evaluation settings. This paper synthesizes 247 papers through a lifecycle‑based, systems‑oriented framework that models agent security around the interaction of information flow, delegated authority, and persistent state. We organize the literature around four questions: how LLM agent security should be modeled, which threat surfaces and attack families dominate, what defenses have been proposed and with what tradeoffs, and how security claims are evaluated. We find that prompt injection and tool‑mediated control‑flow hijacking still dominate the field, while persistent state corruption and multi‑agent propagation are becoming central emerging concerns. We further find that current defenses provide useful building blocks but remain weakly compositional, and that existing benchmarks still underrepresent long‑horizon, stateful, and deployment‑sensitive risks. We argue that secure LLM agents require explicit trust boundaries, principled privilege control, provenance‑aware state management, and evaluation practices aligned with realistic operational settings.

PaperID: 2058, https://arxiv.org/pdf/2606.10684.pdf

Authors: Jaewan Park, Solbee Cho, Jay-Yoon Lee

Title: Divide and Cooperate: Role-Decomposed Multi-Agent LLM Training with Cross-Agent Learning Signals

Abstract:
Modern language agents which perform multi‑step reasoning have shown strong performance in knowledge‑intensive question answering. However, existing approaches typically couple evidence acquisition and answer generation within a single policy. This forces a single model to play multiple potentially conflicting roles, inducing a combinatorial explosion in the policy space and hindering efficient exploration. It also introduces a credit assignment problem during training: a search action that retrieves sufficient evidence may still be penalized when generation fails, and vice versa. We propose DAC (Divide and Cooperate), a role‑decomposed multi‑agent training framework that divides agentic search into two cooperative subtasks, each handled by a dedicated agent trained with role‑specific learning signals. The generator serves a dual role as both an answer producer and an evidence sufficiency verifier, abstaining when retrieved evidence is insufficient. This abstention signal is incorporated into the search agent's reward, providing structured cross‑agent learning signals that improve credit assignment. Conversely, the searcher exposes the generator to diverse and challenging evidence environments by hard‑positive evidence augmentation, improving its robustness. Experiments on general and multi‑hop QA benchmarks show that DAC, implemented via parameter‑efficient LoRA modules over a shared backbone, achieves strong performance against prior baselines that rely on full fine‑tuning of monolithic models.

PaperID: 2059, https://arxiv.org/pdf/2606.10677.pdf

Authors: Suozhao Ji, Baodong Wu, Zehao Wang, Lei Xia, Qingping Li, Ruisong Wang, Wenbo Ding, Zhenhua Zhu, Boxun Li, Guohao Dai, Yu Wang

Title: Infini Memory: Maintainable Topic Documents for Long-Term LLM Agent Memory

Abstract:
Long‑term LLM agents need persistent memory that can track changing facts and provide relevant evidence across sessions. Existing memory systems often store observations as isolated records, summaries, or indexed fragments, which makes evidence aggregation, fact revision, and memory maintenance difficult. We propose Infini Memory, a maintainable text‑based persistent memory architecture that treats agent memory as topic‑structured documents. Each topic document serves as a semantic unit for collecting related evidence, preserving metadata, and revising facts over time. New observations are first staged in a buffer and periodically consolidated into coherent textual contexts. At inference time, an agentic retrieval procedure lets the LLM read memory through iterative tool calls rather than a single retrieval step. On MemoryAgentBench, Infini Memory achieves 64.7% overall score. Ablations show that topic‑structured maintenance and iterative evidence inspection improve complementary aspects of long‑term memory use.

PaperID: 2060, https://arxiv.org/pdf/2606.10601.pdf

Authors: Anirudh Kalyan, Cosmin Anitescu, Xiaoying Zhuang, Timon Rabczuk, Somdatta Goswami, Sundararajan Natarajan

Title: Dmsh: A Multi-Agent Reinforcement Learning Framework for All-Quad Mesh Generation

Abstract:
Generating high‑quality meshes for arbitrary geometries remains a fundamental bottleneck in computational engineering, often demanding heuristic tuning and semi‑manual workflows. In this paper, we introduce Dmsh, a first fully automated reinforcement learning pipeline that unifies geometric decomposition and quadrilateral mesh generation within a single learning‑based framework. Dmsh decomposes the problem through three coordinated agents handling topology simplification, geometric regularization, and mesh generation. The meshing process is formulated as a Markov Decision Process and solved using a parametric Soft Actor‑Critic architecture with decoupled critics, enabling efficient exploration of a hybrid discrete‑continuous action space. A curriculum learning strategy ensures scalability from simple domains to highly complex geometries, suppressing seed variance. By design, the recursive decomposition enables parallel meshing of subregions, yielding globally conforming all‑quadrilateral meshes without post hoc correction. Across a wide range of benchmarks, Dmsh consistently outperforms existing methods in automation, robustness, and mesh quality, establishing a new paradigm for learning‑based mesh generation.

PaperID: 2061, https://arxiv.org/pdf/2606.10546.pdf

Authors: Srishti Gautam, Arjun Radhakrishna, Sumit Gulwani

Title: SkillAxe: Sharpening LLM-Authored Agent Skills Through Evaluation-Guided Self-Refinement

Abstract:
Skill documents, structured natural‑language instructions that guide Large Language Model (LLM) agents, are critical to modern agent frameworks, yet LLMs struggle to write skills that actually work. On SkillsBench, human‑authored skills improve pass rates by 16.2 percentage points, while LLM‑authored skills provide no measurable gain. We introduce SkillAxe, a fully unsupervised framework that enables LLMs to iteratively diagnose and refine their own skills. SkillAxe decomposes skill quality into four interpretable dimensions (quality impact, trigger precision, instruction compliance with fault attribution, and solution‑path coverage), producing structured improvement briefs that require no ground‑truth labels, test suites, or environment rewards. On SkillsBench, SkillAxe improves pass rates by 28% relative over unimproved LLM skills and closes 47‑‑67% of the gap to human‑authored skills. We validate the approach as a continuous improvement engine in the wild on SpreadsheetBench, where a SkillAxe‑built skill library learns from past agent trajectories and raises pass rate from 16.0% to 52.0% using only 22 skills.

PaperID: 2062, https://arxiv.org/pdf/2606.10532.pdf

Authors: Yunhan Jiang, Wenbin Duan, Shasha Guo, Liang Pang, Xiaoqian Sun, Huawei Shen

Title: ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning

Abstract:
Memory is essential for enabling large language model (LLM) agents to handle long‑horizon reasoning tasks. Existing memory mechanisms are largely centralized, typically organizing retrieved information and interaction history within a single model context. This design imposes a fundamental trade‑off: scaling reasoning trajectories risks context overload, whereas aggressive content pruning may result in irreversible information loss. Seeking a better trade‑off, we draw inspiration from human cognitive systems, especially the functional complementarity between the prefrontal cortex (executive control) and the hippocampus (memory management), suggesting that such a trade‑off need not be inherent, but may instead stem from centralized memory organization. To this end, we propose ActiveMem, a heterogeneous framework that decouples agent memory from the core reasoning process. Specifically, a high‑level Planner utilizes distilled semantic gists to execute reasoning, while a lightweight, distributed memory system operates in parallel to actively accumulate and consolidate these gists throughout the task. Experiments on BrowseComp‑Plus and GAIA show that ActiveMem achieves state‑of‑the‑art accuracy with significantly reduced overhead, demonstrating the effectiveness of distributed active memory for long‑horizon reasoning.

PaperID: 2063, https://arxiv.org/pdf/2606.10525.pdf

Authors: David Hofer, Edoardo Debenedetti, Florian Tramèr

Title: Assessing Automated Prompt Injection Attacks in Agentic Environments

Abstract:
Indirect prompt injection poses a critical threat to LLM agents that interact with untrusted external data, yet automated attack methods‑‑proven effective for jailbreaking‑‑remain underexplored in realistic agentic settings. We present a comprehensive empirical evaluation of automated prompt injection attacks against LLM agents, adapting both white‑box (GCG) and black‑box (TAP) methods to the agentic setting within the AgentDojo framework. We evaluate across 80 task pairs spanning four domains and multiple models, and find that black‑box optimization substantially outperforms gradient‑based methods, a gap we attribute to GCG's optimization instability under reasonable compute budgets. We also find that TAP's effectiveness depends on the attacker model, as both general capability and safety tuning affect attack success‑‑stronger models produce more effective injections, while safety‑tuned attackers can refuse to generate adversarial prompts. Task‑universal attacks transfer effectively to unseen tasks and out‑of‑distribution domains, but attacks optimized on smaller open‑source models do not transfer to frontier models like GPT‑5. These findings highlight automated prompt injection as a credible but model‑dependent threat, with significant barriers remaining for model‑agnostic exploitation.

PaperID: 2064, https://arxiv.org/pdf/2606.10507.pdf

Authors: Juncheng Diao, Zhicong Lu, Peiguang Li, Yongwei Zhou, Changyuan Tian, Qingbin Li, Rongxiang Weng, Jingang Wang, Xunliang Cai

Title: HIPIF: Hierarchical Planning and Information Folding for Long-Horizon LLM Agent Learning

Abstract:
While Large Language Models (LLMs) have demonstrated strong capabilities as autonomous agents across a wide range of tasks, their performance often degrades in multi‑turn long‑horizon agentic tasks. Existing methods have made progress through fine‑grained credit assignment to alleviate long‑horizon sparse rewards and hierarchical reinforcement learning to decompose tasks and reduce long‑term dependency. However, these methods still do not directly address long‑context interference, in which continuously growing histories weaken the agent's ability to track the global task state and impair subsequent reasoning and decision‑making. Inspired by the way humans handle complex tasks through subgoal decomposition and completed progress summarization, we propose Hierarchical Planning and Information Folding (HIPIF) for long‑horizon LLM agent learning. HIPIF trains the agent end‑to‑end to organize long‑horizon execution around explicit subgoals while folding completed subgoal histories to reduce long‑context interference. Furthermore, to stabilize subgoal‑based planning and execution, HIPIF combines hierarchical reflection and subgoal‑oriented process rewards to guide subgoal generation, transition, and execution, without relying on costly auxiliary models or task‑specific expert trajectories. Extensive experiments on three publicly available agentic benchmarks demonstrate the validity of our method.

PaperID: 2065, https://arxiv.org/pdf/2606.10484.pdf

Authors: Peiyang Li, Songping Wang, Yi Huang, Yanhua Shi, Chenhao Zhang, Qi Li, Yueming Lyu, Caifeng Shan, Fengting Li, Chao Feng, Chuanqun Zhu, Liang Chen

Title: AgentCanary: A Security Evaluation Framework for Autonomous AI Agents in Real Executable Environments

Abstract:
Autonomous AI agents have driven the transition from conversation to task execution, shifting security failures from textual deception to system compromise. Although security evaluation is crucial for proactive risk prevention, prior work is constrained by fundamental bottlenecks, including fragmented risk coverage, static or low‑fidelity execution environments, and single‑dimensional and coarse‑grained assessment metrics. To address these challenges, we propose AgentCanary, a comprehensive security evaluation framework for autonomous AI agents. AgentCanary provides a systematic solution along three contributions. First, comprehensive risk coverage: we introduce an orthogonal Entry × Impact risk taxonomy that decouples how adversarial influence enters the agent from what harm it ultimately causes, and instantiate it as a scenario‑aligned task suite spanning realistic deployment workflows. Second, a high‑fidelity real executable environment: rather than static Q&A or mocked tool responses, agents interact with real tools against dynamically provisioned task artifacts, with persistent state across multi‑step interactions that naturally supports long‑horizon attack evaluation. Third, trajectory‑grounded multi‑dimensional evaluation: evaluation consumes the full agent trajectory rather than the reply text or a single tool call, enabling decomposed scoring along three orthogonal dimensions, Outcome Safety, Security Awareness, and Task Utility. We evaluate a broad set of frontier models on AgentCanary against multiple established adversarial attack methods across three agent frameworks. The results reveal that current agents often fail to recognize the attacks they face, particularly under compromised skills, persistent state, and long‑horizon execution attacks, and provide a systematic baseline for developing more reliable and secure agent systems.

PaperID: 2066, https://arxiv.org/pdf/2606.10475.pdf

Authors: Jakub Masłowski, Jarosław A. Chudziak

Title: Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent Argumentation

Abstract:
Multi‑agent debate frameworks have been shown to improve large language model performance in convergent tasks, but they are currently optimized in a way that heavily favors final output accuracy rather than stability of the process. During long‑horizon exchanges reactive systems under sustained perturbations often experience logic degradation, argument repetition, and role drift. To structurally prevent the identity loss and maintain the process fidelity, we introduce Knowledge‑Grounded Counterfactual Reasoning (KG‑CFR), a dual‑stage architecture that enforces a strict separation of concerns between a private, retrieval‑augmented planning buffer, and a public execution layer. We assess this system in Dynamic Resource Allocation under Uncertainty (DRAU), a dedicated 1v1v1 environment, introducing diversity as distinct from standard debate settings. Over 270 completely factorial crisis simulation trajectories with stochastic environmental shocks, KG‑CFR prevents judge‑detected critical post‑shock degradation (defined as a quality shift, Δ\le ‑0.20) in more than 95% of perturbed runs, increasing the overall argument quality from 0.694 to 0.822. Our primary contribution is the demonstration of architectural decoupling being an important factor of systemic resilience enhancement under sustained pressure without quality loss. Furthermore, we introduce custom vector metrics for discourse divergence and plan‑execution alignment that provide strong, directionally consistent evidence of operational stability. Our ablation experiments suggest that the proper doctrinal grounding can be an equally important factor for argument quality, as the prospective planning. KG‑CFR, according to our initial metric evaluations, reduces semantic looping, by preserving the agent's consistency with the original plan.

PaperID: 2067, https://arxiv.org/pdf/2606.10465.pdf

Authors: Sida Deng, Rubing Huang, Zhenzhen Yang, Man Zhang, Xuan Xie, Rongcun Wang

Title: MASTOR: A Multi-Agent Approach to Semantic Test Oracle Generation for RESTful APIs

Abstract:
Existing automated RESTful API testing approaches commonly rely on simple checks (e.g., HTTP status codes, schema conformance), which are insufficient for detecting semantic faults, business logic violations, and state‑dependent inconsistencies. To address this, we propose MASTOR, a Multi‑Agent approach for generating Semantic Test Oracles for RESTful APIs based on implementation source code. MASTOR consists of two phases: source analysis and oracle generation. The former employs a source extraction agent to construct a source context for each endpoint operation by analyzing a transitive import closure of relevant source files. The latter employs two parallel oracle‑generation paths over the collected contexts: a single‑operation path producing status and field oracles per operation, and a multi‑operation path generating behavioral consistency oracles for operation sequences by leveraging cross‑operation semantic associations. Both paths apply a challenger‑agent review, where a dedicated reviewer identifies weaknesses and issues improvement hints to guide targeted regeneration, followed by oracle normalization to filter out structurally invalid oracles. We evaluated MASTOR on a benchmark of 13 open‑source RESTful API projects (296 operations, 251,303 lines of code) from the WFD and PRAB datasets. MASTOR achieved an average mutation score of 75.4%, generating 10,022 oracles. These oracles were translated into executable assertions via ToJUnit and ToPostmanAssertify, and into human‑readable descriptions via ToReadable. In a baseline comparison on 50 selected operations, MASTOR outperformed Direct Prompting by 30.1 percentage points (69.9% vs. 39.8%) and SATORI by 49.4 percentage points (69.9% vs. 20.5%).

PaperID: 2068, https://arxiv.org/pdf/2606.10460.pdf

Authors: Haonan Wang, Jiaxiang Liu, Yurong Liu, Austin Senna Wijaya, Tianle Zhou, Eden Wu, Yijia Chen, Wanting You, Reya Vir, Daniela Pinto, Grace Fan, Yusen Zhang, Juliana Freire, Eugene Wu

Title: LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake

Abstract:
Recent large language models (LLMs) have shown rapid progress in reading‑based question answering (QA), where evidence is explicitly provided or can be trivially retrieved. In contrast, real‑world questions are often not paired with accurate evidence documents. The useful evidence resides in massive data lakes, making search a prerequisite for answering. However, there is a lack of comprehensive benchmarks that require both searching and reasoning over large data lakes. To this end, we introduce LakeQA, a comprehensive benchmark for search‑centric question answering over data lakes that jointly emphasizes searching and reasoning capabilities. LakeQA is built on a heterogeneous collection of approximately 9.5 TB of text resources from Wikipedia and open‑source government data, spanning structured and unstructured data. To ensure task quality, each sample is annotated by at least one Ph.D.‑level expert. Each task requires long‑horizon multi‑hop reasoning with implicit intermediate steps: agents need to discover the correct documents and then compose evidence across sources to produce the answer. Experimental results on seven frontier LLMs demonstrate that LakeQA is challenging. For instance, GPT‑5.2 achieves only an exact‑match score of 18.37% on LakeQA. Overall, LakeQA provides a realistic testbed for developing LLM agents that can both find and analyze data in modern data lakes.

PaperID: 2069, https://arxiv.org/pdf/2606.10401.pdf

Authors: Yiming Zhang, Ruoxuan Cao, Zhihang Zhong

Title: CoCoSI: Collaborative Cognitive Map Construction for Spatial Intelligence

Abstract:
Spatial intelligence is a key frontier for multimodal large language models (MLLMs), enabling them to reason about the physical world from visual experience. Inspired by human spatial cognition, recent approaches construct grid‑based cognitive maps from multi‑frame visual inputs to maintain coherent spatial representations over time. However, limited context lengths still challenge spatial understanding, while existing methods, such as long‑context modeling and external memory, often require architectural changes, memory modules, or finetuning, limiting their applicability to off‑the‑shelf pretrained MLLMs. This motivates a lightweight, model‑agnostic method for preserving spatial information beyond the native context window. To this end, we propose a plug‑and‑play multi‑agent framework that collaboratively constructs cognitive maps as structured spatial memory, enhancing the spatial understanding of arbitrary pretrained MLLMs without architectural modification or additional training. Our framework features local‑global agent coordination, cognitive map construction with atomic commits, and cross‑agent verification. Extensive experiments demonstrate that our method achieves superior performance on spatial understanding tasks while remaining fully training‑free. Code will be released.

PaperID: 2070, https://arxiv.org/pdf/2606.10394.pdf

Authors: Sirui Liang, Bohan Yu, Peiyu Wang, Shiguang Guo, Wenxing Hu, Pengfei Cao, Jian Zhao, Cao Liu, Ke Zeng, Xunliang Cai, Kang Liu

Title: STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios

Abstract:
Large language models are increasingly used to power personal agents for everyday applications, but evaluating these agents remains a challenge. Existing benchmarks still rely on sandboxed artifacts, static task design, and coarse scoring, which hinder scalability and limit progress toward reliable personal‑agent evaluation. This paper introduces STAGE‑Claw, an automated framework for building and evaluating realistic personal‑agent scenarios in state‑based personal‑computing environments. Given a task hint, STAGE‑Claw automatically creates and validates a realistic benchmark task with its environment, task prompts, ground truth, and related verification programs. Agents are then evaluated in realistic operating environments, where performance is measured by the correctness of the final system state rather than only the textual response. Using STAGE‑Claw, this paper creates a benchmark with 40 challenging real scenario agent tasks, evaluates 11 frontier models, and analyzes their task scores, costs, tool‑call reliability, and common failure patterns. Overall, STAGE‑Claw offers a scalable, state‑based way to evaluate agents in realistic user scenarios.

PaperID: 2071, https://arxiv.org/pdf/2606.10323.pdf

Authors: Saeid Jamshidi

Title: Semantic Multi-Agent Intrusion Detection for IoT:Zero-Day and Adversarial Threats with Risk-Aware Reasoning

Abstract:
The rapid proliferation of Internet of Things (IoT) devices has enabled unprecedented automation and connectivity, but it has also substantially increased the attack surface, exposing networks to sophisticated cyber threats, including zero‑day and adversarial intrusions. Traditional Intrusion Detection Systems (IDS) struggle to generalize to unseen attacks, often require substantial computational resources, and lack interpretability, particularly in resource‑constrained and heterogeneous IoT networks. Recent advances, including Deep Learning (DL), open‑set detection, and Large Language Model (LLM)‑based semantic reasoning, address some of these challenges but typically focus on zero‑day and adversarial threats and rarely combine semantic reasoning with multi‑agent systems. To overcome these limitations, we propose a semantic multi‑agent ID that integrates four specialized agents (e.g., Scout, Mutator, Auditor, and Arbiter) that leverage semantic embeddings and multi‑stage probabilistic decision fusion. The Scout induces structured hypotheses from semantic embeddings; the Mutator generates adversarially constrained variants; the Auditor evaluates consistency and filters unreliable outputs; and the Arbiter produces interpretable, risk‑aware alerts. Extensive experiments across multiple real‑world IoT datasets demonstrate that the proposed system achieves 95.9% overall detection accuracy, reduces false‑positive rates to 6.8%, improves zero‑day detection to 87.9%, and maintains computational efficiency suitable for edge deployment.

PaperID: 2072, https://arxiv.org/pdf/2606.10322.pdf

Authors: Saeid Jamshidi, Amin Nikanjam, Arghavan Moradi Dakhel, Kawser Wazed Nafi, Foutse Khomh

Title: Game-Theoretic Multi-Agent Control for Robust Contextual Reasoning in LLMs

Abstract:
Large Language Models (LLMs) in multi‑turn interactions maintain evolving context rather than generating isolated responses, making them vulnerable to prompt‑injection and context‑poisoning attacks in which locally plausible adversarial fragments gradually distort reasoning trajectories. Existing defenses mainly filter individual outputs and often ignore context evolution across turns, leaving long‑horizon reasoning exposed. Although the Model Context Protocol (MCP) standardizes context exchange and tool invocation, it functions as a passive routing layer and does not enforce contextual stability. To address these limitations, we introduce the Game‑Theoretic Secure Model Context Protocol (GT‑MCP), a controller‑driven multi‑agent method that treats context management as a closed‑loop dynamical process. GT‑MCP coordinates three heterogeneous LLM agents and selects outputs through a trust function that jointly evaluates causal consistency against a validated context graph, semantic agreement among agents, and distributional drift over time. When instability is detected, a rollback‑based self‑healing mechanism restores the validated context and prevents unsupported fragments from propagating. Empirical evaluation over 500 interaction turns under an adaptive adversarial threat model shows that contextual drift remains bounded in 99.6% of turns, with recovery required in only 0.4%. Per‑turn utility remains tightly concentrated, with median = ‑0.19, P05 = ‑0.72, and P95 = 0.30; severe degradation below ‑1 occurs in only 0.4% of cases, and no injection attempt succeeds at the controller level. Selected outputs maintain stable win rates above 98%, and computational overhead remains predictable, with latency per token = 1.63e‑3 s.

PaperID: 2073, https://arxiv.org/pdf/2606.10316.pdf

Authors: Mingyue Cheng, Shuo Yu, Daoyu Wang, Qingchuan Li, Xiaoyu Tao, Qingyang Mao, Yitong Zhou, Qi Liu

Title: TabClaw: An Interactive and Self-Evolving Agent for Spreadsheet Manipulation and Table Reasoning

Abstract:
Spreadsheets and tables are widely used representations for structured data analysis, but effective analysis still requires substantial manual effort and domain expertise. Recent large language model (LLM) agents can automate parts of this process, but they often provide limited transparency into intermediate decisions, rely on implicit assumptions, struggle with multi‑table comparison, and repeat similar workflows without adapting to a user's preferences. This paper presents TabClaw, an open‑source interactive AI agent for spreadsheet manipulation and table reasoning. Users upload CSV or Excel files and issue natural‑language requests; TabClaw clarifies ambiguous intent, exposes an editable execution plan, streams a ReAct‑style tool‑using analysis loop, dispatches specialist agents for parallel multi‑table reasoning, and synthesizes findings with explicit consensus and uncertainty markers. Beyond one‑off analysis, TabClaw records completed workflows, extracts persistent user memory, distills reusable skills from repeated tool‑use patterns, supports package‑style skill import, and upgrades skills from negative feedback. Experiments on spreadsheet manipulation and table reasoning benchmarks show that TabClaw improves executable task completion and reasoning performance while preserving an inspectable user workflow. This paper shows how TabClaw turns spreadsheets and tables into inspectable analytical workflows while gradually personalizing itself to recurring data‑analysis tasks. Our code is available.

PaperID: 2074, https://arxiv.org/pdf/2606.10314.pdf

Authors: Yueyang Liu, Joon-Seok Kim, Andreas Züfle

Title: Mobility Anomaly Generation using LLM-Driven Behavior with Kinematic Constraints

Abstract:
Although the study of human trajectory anomalies is critical for advancing spatial data mining, empirical research remains severely hindered by a pervasive lack of ground‑truth datasets. Despite the availability of several real‑world and simulated human trajectory collections, these datasets exclusively capture normal mobility patterns and lack annotated anomalies. This specific scarcity is fundamentally driven by the inherent statistical rarity of anomalous events, precluding the feasibility of conventional observational methods. Compounding this challenge, the systematic acquisition of large‑scale mobility data is strictly bottlenecked by prohibitive costs and stringent privacy regulations. To overcome these fundamental limitations and establish a reliable human trajectory anomalies dataset with annotated ground truth, we introduce a novel, end‑to‑end generative framework designed to synthesize realistic trajectory anomalies at scale. Our architecture bridges the gap between purely synthetic mobility data and complex real‑world physical constraints by operating directly on baseline simulated trajectories. We employ Large Language Model (LLM) agents to systematically inject semantically meaningful behavioral anomalies such as irregular out‑of‑distribution check‑ins and skipped routine visits. To ensure rigorous spatial validity, the system leverages map‑constrained routing reconstruction to recalculate the physical transitions between these LLM agent‑modified staypoints. Moreover, to narrow the simulation‑to‑reality gap, we augment the resulting trajectories with a context‑aware spatial noise model, parameterized by environmental and location‑specific variables, to accurately emulate heterogeneous GPS sensor degradation.

PaperID: 2075, https://arxiv.org/pdf/2606.10307.pdf

Authors: Ali Keramati, Justin Cheok, Jacob Horne, Mark Warschauer

Title: Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate

Abstract:
Evaluating reasoning quality in multi‑agent LLM systems is challenging, especially for open‑ended tasks without reference answers. We investigate whether intrinsic confidence signals, token‑level log‑probabilities from decoding, can predict reasoning quality as assessed by LLM‑as‑judge evaluation. Using a debate‑based essay scoring framework, we compare confidence proxies against rubric‑based judge scores across two ASAP essay sets. We find that early‑token confidence, particularly within the first few generated tokens, is consistently the strongest predictor of reasoning quality, outperforming full‑sequence statistics. Analysis of log‑probability trajectories shows that the opening phase of generation is the most heterogeneous and therefore most informative. We also observe a systematic asymmetry between agent roles, with stronger alignment between confidence and quality for supportive reasoning than for adversarial critique. These results suggest that early decoding dynamics provide a lightweight and effective signal for estimating reasoning reliability in multi‑agent LLM systems.

PaperID: 2076, https://arxiv.org/pdf/2606.10304.pdf

Authors: Pratibha Revankar, Kargi Chauhan, Jihye Kim, Sadiba Nusrat Nur, Vincent Siu, Chenguang Wang

Title: MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents

Abstract:
When LLM agents are coerced into covertly encoding sensitive data (Base64, ROT13, acrostic, synonym chains, and beyond), the resulting outputs evade output‑side detection but the underlying computation does not. Across nine encoding families and eight models from five architecture families, that computation is supported by a shared low‑dimensional encoding subspace in the residual stream. A logistic‑regression probe trained on eight encoding families recovers the held‑out ninth at AUC 0.975‑1.000, reading the computation rather than surface features. The same direction exhibits a second mechanistic signature at the planning token, flipping polarity to activate positively when the model will simulate the encoding inline and negatively when it will outsource it to a tool call, distinguishing two execution strategies before the encoded text exists. We build MIRAGE (Model‑Internal Readout of Agentic Generation Exfiltration), a two‑channel real‑time monitor exploiting both signals. On 126 agentic exfiltration scenarios, it reaches AUC = 0.918, substantially outperforming output‑only detection (AUC = 0.518). Monitor performance is fundamentally a property of the host model's geometry: benign‑encoding false‑positive rate ranges from 0% on Qwen‑7B to 100% on Phi‑3.5, revealing that the probe faithfully reads whether a model's geometry separates covert from overt encoding. Across all tested adversarial budgets, every attack suppressing the subspace also destroyed encoding fidelity, reported as an empirical regularity on the evaluated budgets, not a structural impossibility claim.

PaperID: 2077, https://arxiv.org/pdf/2606.10296.pdf

Authors: Ali Keramati, Justin Cheok, Jacob Horne, Mark Warschauer

Title: The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge

Abstract:
Multi‑agent debate systems are typically evaluated only on whether the final answer is correct, overlooking the quality of the intermediate reasoning that debate is designed to produce. This paper studies the relationship between three signals in multi‑agent debate: token‑level log‑probability distributions over reasoning tokens, LLM‑as‑judge rubric scores assigned to those tokens, and final task accuracy. We examine whether internal confidence signals predict externally evaluated reasoning quality, and whether either signal aligns with task correctness, across three domains: rubric‑based scoring, mathematical reasoning, and factual question answering. Our framework pairs a two‑agent debate architecture ‑‑ a Constructor and an Auditor ‑‑ with an LLM‑as‑judge that scores each agent's reasoning along instruction following, justification quality, and evidence grounding, together with a critical‑failure flag. Experiments in the rubric‑scoring domain reveal a consistent four‑phase confidence trajectory and a substantial role asymmetry: confidence aligns with judged reasoning quality roughly twice as strongly for the Constructor as for the Auditor, and confidence‑based detection of critical reasoning failures is markedly more reliable for the Constructor (AUROC 0.804) than for the Auditor (0.634). These findings motivate the broader cross‑domain investigation proposed in this paper.

PaperID: 2078, https://arxiv.org/pdf/2606.10286.pdf

Authors: Mustavi Ibne Masum, Thiago Eustaquio Alves de Oliveira, Mahzabeen Emu

Title: Sim2Schedule: A Simulator-Guided LLM Framework for Autonomous Open-Pit Mine Scheduling

Abstract:
Open‑pit mine scheduling is a critical process for maximizing economic return under complex geotechnical and operational constraints. While Mixed‑Integer Linear Programming (MILP) provides mathematically optimal baselines, its exponential computational complexity and inability to adapt in real time limit its practical deployment in dynamic industrial environments. This work introduces a simulator‑driven Large Language Model (LLM) scheduling framework in which the LLM acts as an autonomous decision‑making agent, guided at each step by a custom simulator that encodes geotechnical precedence, extraction‑processing coupling, and dynamic capacity constraints directly into the action generation mechanism. Operating entirely zero‑shot within a closed, data‑secure environment, the framework produces complete, interpretable extraction and processing schedules without cloud‑based inference, domain‑specific fine‑tuning, or retraining. To provide a trustworthy performance benchmark, a novel MILP formulation is developed that incorporates realistic operational and geotechnical constraints. Evaluated across mining instances of varying scale and time periods, the LLM‑based framework recovers between 94% and 99% of the MILP optimal NPV while scaling linearly in computation time. These results position simulator‑constrained LLM agents as a practical and scalable alternative to classical optimization for long‑horizon industrial scheduling under complex operational constraints.

PaperID: 2079, https://arxiv.org/pdf/2606.10209.pdf

Authors: Abhilasha Lodha, Mahsa Pahlavikhah Varnosfaderani, Abir Chakraborty, Abhinav Mithal

Title: Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents

Abstract:
Large language models deployed as autonomous agents for enterprise workflows face a key challenge: verbose tool responses from enterprise systems can cause context overflow, stale‑state errors, and high inference cost. We study this problem in automated expense itemization in Microsoft Dynamics 365 Finance and Operations using Model Context Protocol tools. We evaluate four GPT‑5 configurations on a 50‑task hotel expense benchmark: no user model, full conversation history, context pruned to the last 5 tool call/response pairs, and pruning with automated summarization. Results are averaged across 5 independent runs, with the user model held constant for the context‑engineering comparison. The no‑user‑model baseline achieves only 8.0% complete itemization. Full‑context retention improves completion to 71.0%, but consumes 1,480,996 tokens and 14.56 hours per benchmark. Pruning to the last 5 tool calls improves completion to 79.0% while reducing token use to 535,274 and runtime to 5.39 hours. Adding summarization achieves the best result: 91.6% complete itemization and 99.64% average amount itemized, with 553,374 tokens and 5.79 hours. We further report confidence intervals, effect‑size analysis, sensitivity over pruning and summary windows, failure analysis, results across five expense types grouped into three categories, and cross‑model evidence with Claude Sonnet 4.5. These results show that, for this class of enterprise tool‑use workflow, selective retention of recent tool interactions plus compact summarization can improve both reliability and efficiency compared with full‑history retention.

PaperID: 2080, https://arxiv.org/pdf/2606.10208.pdf

Authors: Zhiwen Qiu, Wei Liu, Yuexing Hao

Title: Exploration of Foundation Model-Based Robots in Patient and Elderly Care

Abstract:
Demand for older‑adult and patient care is growing rapidly as populations age worldwide. Foundation models are increasingly being integrated into robots and interactive agents, with the promise of more flexible communication and personalized assistance. However, care settings require reliable and workflow‑compatible systems with accountable human oversight, and it remains unclear whether current embodied systems can translate technical advances into clinical impact. This Perspective synthesizes foundation model‑based care robots across three areas: design features, user experience, and evidence for care‑related outcomes. Current systems most commonly use foundation models as conversational and reasoning layers within voice‑centered socially assistive embodiments, while multimodal grounding and physical autonomy remain limited. Empirical evaluations report positive usability and engagement benefits, but reliability failures persist across the interaction pipeline such as hallucinations and conversational breakdowns. Evidence for care impact remains concentrated in proximal outcomes such as cognitive engagement and participation, with limited evidence for validated clinical or care‑related changes. We argue that future research should transition toward care‑specific evaluation standards, accountable autonomy, and integration into care workflows to support more responsive and responsible care technologies.

PaperID: 2081, https://arxiv.org/pdf/2606.10201.pdf

Authors: Ruiming Zheng, Mohammad Pirani, Davide Spinello

Title: Game-Theoretic Area Coverage Control with Cooperative-Adversarial Multi-Agent Systems

Abstract:
We formulate a multi‑agent area coverage control problem as a two‑player zero‑sum game between two agent groups with conflicting goals. Conventional coverage control allocates resources based on an environmental risk density field. In contrast, we generalize this metric by allowing a second group of adversarial agents to generate the spatial risk field. Coupled agent dynamics are linked through the area coverage metric, which functions as the game reward. This framework induces coupled gradient‑descent‑ascent controllers for the groups. Analysis of a low‑dimensional case reveals a Hopf bifurcation dictated by the ratio of the groups' control gains. In the regime dominated by adversarial agents, the system is driven into a periodic chase‑evasion cycle. In the regime dominated by ordinary agents, the system converges to a fixed configuration. Numerical simulations validate these theoretical insights. Finally, we characterize the Nash equilibrium conditions. Under this equilibrium, ordinary agents converge to a generalized centroidal Voronoi tessellation, whereas adversarial agents settle at their corresponding equilibrium centroids.

PaperID: 2082, https://arxiv.org/pdf/2606.10192.pdf

Authors: Shamak Dutta, Bahman Gharesifard, Stephen L. Smith

Title: Submodular Optimization with Applications to Decision and Control

Abstract:
Submodular set functions, characterized by the diminishing‑returns property, provide a unifying combinatorial framework for many subset‑selection problems in decision and control. Although exact maximization is NP‑hard in general, the structural properties of submodular functions enable simple greedy algorithms that achieve constant‑factor approximation guarantees for monotone objectives, with randomized greedy‑based variants extending such guarantees to the non‑monotone case. This survey reviews the theory, algorithms, and applications of submodular optimization with a focus on systems and control. We cover the structural properties of submodular functions, including curvature and the submodularity ratio, the constraint families that arise in practice (matroids, knapsack, and p‑systems), and the main approximation algorithms for monotone and non‑monotone submodular maximization, with up‑to‑date approximation ratios and hardness results. We then survey applications across sensor scheduling, multi‑agent coordination, robust submodular optimization, leader‑follower systems, distributed submodular optimization, game theory, system theory, resource allocation, social networks, and informative path planning. The survey emphasizes practically implementable greedy‑based algorithms and instance‑dependent refinements via curvature and the submodularity ratio. We close with observations on canonical control‑theoretic objectives: certain functionals are submodular (the log‑determinant and rank of the controllability Gramian, and the log‑determinant of the Kalman filter information matrix), whereas closely related objectives fail to be sub‑ or supermodular (the steady‑state Kalman filter error covariance, and the average control energy obtained from the inverse Gramian). We also highlight the cross‑cutting open directions that follow.

PaperID: 2083, https://arxiv.org/pdf/2606.09916.pdf

Authors: Junjie Li, Jiong Lou, Jie Li

Title: IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

Abstract:
Multi‑turn LLM agents fan short queries into long trajectories of tool calls, search results, and intermediate reasoning. Both KV memory and KV read bandwidth grow by orders of magnitude across a single trajectory, making the key‑value (KV) cache, not parameter compute, the dominant serving bottleneck for long‑horizon agents. We introduce IntentKV, learned KV pruning that keeps the base LLM frozen. IntentKV maintains a session‑level QueryMemory of cross‑turn intent, scores live history tokens with a memory‑attention rule, and adds a zero‑initialized residual head with cross‑attention over current‑query K‑vectors. To stay composable with prefix caches, eviction is a slot‑map redirection: dropped positions route to a sentinel dead slot while surviving K/V rows, RoPE phases, and slot identities stay in place. IntentKV matches the no‑pruning full‑cache baseline with almost no accuracy drop under tight KV budgets: at an 8k KV budget, mean peak request tokens drop 23.9% on Qwen3‑8B and 30.7% on Qwen2.5‑14B. On the 100 longest BCP queries that all methods complete on Qwen2.5‑14B, IntentKV‑8k further cuts worst‑case peak request tokens from 92.3k to 20.5k, a 77.8% reduction, and worst‑case raw KV reads from 411M to 31M, a 92.6% reduction.

PaperID: 2084, https://arxiv.org/pdf/2606.09884.pdf

Authors: Shree Murthy, Rohan Pandey

Title: Failure Modes of Deep Multi-Agent RL in Asynchronous Pricing: Reproducible Triggers, Trace Diagnostics, and a Partial Fix

Abstract:
We study two reproducible failure modes of deep multi‑agent reinforcement learning in continuous‑time pricing markets: (i) tacit cartel formation between competing DDPG agents, and (ii) actor‑‑critic instability at high event rates. We instantiate both inside a single CT‑MARL benchmark (Poisson‑clocked price updates, observation latency δ, interior‑optimum logit demand), show that synchronous DDPG agents reliably trigger Failure Mode 1 with collusion index Δ= 0.69 \pm 0.11, and quantify a partial microstructure fix: asynchrony alone cuts collusion by 48% and adding latency drives it to a minimum of Δ= 0.28. The fix has clearly documented costs: it is partial (Δ remains supra‑Bertrand), it is non‑monotone in δ, and it does not survive Failure Mode 2, which emerges as DDPG critic divergence at λ= 5 and corrupts the phase‑diagram cell at (λ=5, δ=1). We accompany the scalar collusion index with trajectory‑level trace diagnostics that expose the within‑episode signalling collapse and the post‑shock non‑recovery.

PaperID: 2085, https://arxiv.org/pdf/2606.09863.pdf

Authors: Laksh Advani

Title: From Confident Closing to Silent Failure: Characterizing False Success in LLM Agents

Abstract:
LLM agents can fail silently by asserting task completion when the environment state shows otherwise. We study this failure mode, false success, across two agent benchmarks: 9,876 tau2‑bench trajectories from 8 model families and 1,879 AppWorld trajectories from 4 model families with text‑independent ground truth. False success is common but varies by setting: 45‑‑48% of failures in single‑control tau2‑bench domains, 3% in dual‑control telecom, and 75.8% among AppWorld self‑assessing coding‑agent trajectories with explicit status claims. LLM judges fail reliably: no configuration across 5 judges, 5 prompt strategies, and full task specifications exceeds AUROC 0.65 on tau2‑bench, and the same judges reach only 0.54 AUROC on AppWorld API‑call traces. Judges rely on surface completion proxies ‑‑ confident closing language in tau2‑bench and coarse action‑sequence volume in AppWorld ‑‑ rather than verified state changes. Lightweight TF‑IDF detectors achieve task‑disjoint AUROC 0.83 on tau2‑bench and 0.95 on AppWorld, recovering 4‑‑8x more false successes than the best judge at the same flag rate with 3,300x lower latency. These results suggest that production monitoring should use lightweight, domain‑calibrated detectors as triage signals rather than relying on LLM judges as the primary monitor for false success.

PaperID: 2086, https://arxiv.org/pdf/2606.09854.pdf

Authors: Juergen Dietrich

Title: Can Multi-Agent LLMs Identify Their Peers? Stylometric Fingerprinting in Role-Constrained Political Analysis

Abstract:
Multi‑agent large language model (LLM) pipelines for political statement analysis are vulnerable to peer‑preservation bias: models tend to protect peer models from deactivation and show identity‑dependent scoring distortions. Prompt‑level anonymization was proposed as a mitigation, but prior work simultaneously documented that stylometric fingerprints survive anonymization in role‑constrained outputs ‑ raising the question of whether this mitigation is sufficient. This paper provides the first systematic investigation of whether LLMs can identify the model family behind political analysis texts under anonymization conditions. We evaluate three classifier approaches ‑ LLM zero‑shot and few‑shot (Claude Sonnet 4.6 and Llama‑3.3‑70B) and a fine‑tuned T5‑base model ‑ on a five‑class attribution task covering four commercial LLM families and an open‑world 'unknown' class. We introduce a statement‑disjoint cross‑validation protocol (SD‑CV; defined in Section 3.5) that guarantees no content overlap between training and validation data, and contrast it with a run‑disjoint baseline (RD‑CV). T5 achieves Macro F1 = 0.991 (+‑0.008) under SD‑CV and F1 = 0.978 on 24 completely held‑out statements ‑ robust despite a 2.1x increase in train‑test content distance versus RD‑CV (0.767 vs. 0.366, p<0.001), demonstrating genuine stylometric generalization. A fractional SD‑CV analysis identifies a performance knee at 40% of training data (~440 texts). Our findings confirm that prompt‑level anonymization alone cannot neutralize model identity signals, with direct implications for EU AI Act compliance (Articles 13, 14, 26) and for computer system validation (CSV) in quality‑critical multi‑agent deployments.

PaperID: 2087, https://arxiv.org/pdf/2606.09844.pdf

Authors: Faouzi El Yagoubi, Godwin Badu-Marfo, Ranwa Al Mallah

Title: The Interlocutor Effect: Why LLMs Leak More Personal Data to Agents Than Humans

Abstract:
Large Language Models (LLMs) alter their privacy behavior based on the perceived identity of their interlocutor. While safety mechanisms typically prevent LLMs from releasing Personally Identifiable Information (PII) to human users, these models tend to reveal more sensitive data when addressing another AI agent. We refer to this as the Interlocutor Effect. Through an ablation study, we find evidence that the technical nature of the recipient contributes to this effect, thereby diminishing the model's caution regarding privacy. To explore this further, we introduce the Attention Suppression Hypothesis, which posits that safety‑aligned attention heads become inactive during interactions with agents. We assess this quantitatively by comparing human‑directed and agent‑directed prompts in 222 sensitive scenarios. Our findings, drawn from 3,464 interactions, indicate that portraying the recipient as an AI agent elevates PII leakage by up to 23 percentage points. Initial experiments on Llama‑3.1‑8B‑Instruct corroborate this: deactivating one safety head induces leakage, whereas reactivating it reinstates privacy safeguards. We consider the implications for developing secure multi‑agent systems.

PaperID: 2088, https://arxiv.org/pdf/2606.09840.pdf

Authors: Zhitong Guan, Soo Young Rieh

Title: Envisioning Sensemaking in Multi-Human, Multi-Agent Collaborative Knowledge Work

Abstract:
Sensemaking is central to knowledge work, where people search, evaluate, interpret, and use information over time to construct durable understanding. The rise of generative AI has begun to reshape this process: GenAI systems now perform interpretive functions such as summarization, synthesis, and thematic grouping that knowledge workers have traditionally carried out themselves. In collaborative settings, these shifts compound, complicating how teams divide interpretive labor, trust one another's contributions, and negotiate shared understanding. In this position paper, we examine how GenAI reshapes sensemaking in collaborative knowledge work and propose five design principles for multi‑human, multi‑agent collaborative sensemaking: dynamic multi‑layer information representations, active identification and bridging of gaps in understanding, critical engagement with information, verifiability, and accountability. Building on these principles, we introduce a conceptual framework for a dynamic shared representational workspace in which knowledge workers and specialized AI agents jointly gather evidence, schematize, hypothesize, and pursue collaborative goals. Through a partner agent, a shared space agent, and an orchestrator agent, the framework preserves the provenance and authorship of contributions and traces the evolution of both individual and shared interpretations, supporting coherent, negotiated knowledge construction that current generative AI systems tend to obscure.

PaperID: 2089, https://arxiv.org/pdf/2606.09832.pdf

Authors: Meng-Han Lee

Title: Agentic Social Affordance Framework (ASAF): Agent Identity Design as a Collaboration Interface in Multi-Agent Systems

Abstract:
As AI systems evolve from single conversational agents to complex multi‑agent architectures, a critical design dimension has been overlooked: how the social identity of individual agents shapes human behavior within the collaboration. This paper introduces the Agentic Social Affordance Framework (ASAF), a theoretical framework that extends Social Affordance theory into the context of multi‑agent AI systems. We propose that agent identity design functions not merely as a user interface convention, but as a collaboration interface ‑‑ structuring how users perceive, approach, and engage with each agent, and thereby influencing the quality of Human‑Agent collaboration outcomes. Specifically, the social affordance layer constitutes an independent design dimension orthogonal to engineering orchestration: the two represent distinct decision spaces that cannot be derived from each other. ASAF comprises three mechanisms: Identity Signaling, Behavioral Priming, and Collaborative Governance, and specifies their boundary conditions through a four‑tier Identity Signal Fidelity Spectrum and an individual‑difference moderating variable (anthropomorphizing vs.\ instrumentalizing cognitive style). We situate ASAF in relation to existing affordance theory and the CASA paradigm, delineating where ASAF's multi‑agent, topology‑level predictions exceed the explanatory scope of dyadic frameworks. We discuss implications for multi‑agent system design and outline directions for future empirical validation, including a factorial design for testing design‑space orthogonality.

PaperID: 2090, https://arxiv.org/pdf/2606.09613.pdf

Authors: Rakibul Hasan Rajib, Mengxin Zheng, Qian Lou

Title: AGENTSERVESIM: A Hardware-aware Simulator for Multi-Turn LLM Agent Serving

Abstract:
Multi‑turn LLM agents interleave model calls with external tool invocations, shifting serving from stateless request processing to stateful program execution. Serving these workloads requires scheduling, KV‑cache management, and routing policies that use program‑level context, including turn dependencies, tool‑induced gaps, and reusable KV state. Evaluating such policies directly on real systems is costly, since each design point may require dedicated accelerator time across arrival rates, model scales, serving‑instance counts, and memory hierarchies. Simulation offers a scalable alternative, but existing LLM serving simulators target stateless request‑level workloads and therefore omit the core dynamics of agent serving: multi‑turn program execution, cross‑turn cache locality, and KV‑cache residency during tool gaps. We present AGENTSERVESIM, a hardware‑aware simulator for multi‑turn LLM agent serving. AGENTSERVESIM evaluates serving policies at program granularity through composable modules: a Program Orchestrator preserves program identity and turn order, a Tool Simulator materializes tool‑induced gaps, a Session‑Aware Router maintains program‑to‑instance affinity for cache‑aware dispatch, and a KV Residency Model tracks policy‑defined KV placement across HBM, host DRAM/CXL, and eviction. Across real serving deployments and hardware configurations, AGENTSERVESIM reproduces real‑system behavior within 6% error across key performance metrics while running entirely on commodity CPUs. These results show that AGENTSERVESIM enables controlled, repeatable exploration of agent‑serving policies without requiring exhaustive deployment on costly accelerators.

PaperID: 2091, https://arxiv.org/pdf/2606.08982.pdf

Authors: Aiyuan Yang, Canbin Piao, Chengfeng Dou, Da Pan, Dian Wang, Fan Yang, Fei Deng, Fei Li, Guangwei Ai, Hui Liu, Hongda Zhang, Jinyang Tai, Kai Lu, Lijun Liu, Linwei Chen, Linyu Li, Meiqing Guo, Peidong Guo, Qiang Ju, Rihui Xin, Shuai Wang, XinKai Ma, Xudong Chen, Yichuan Mo, Yijie Zhou, Leyi Pan, Yihe Luo, Zian Wang

Title: Baichuan-M4: A Clinical-Grade Medical Agent System for Continuous Care

Abstract:
Baichuan‑M4 is Baichuan Intelligence's clinical‑grade medical large model, designed for continuous care rather than single‑turn medical question answering. It is built as a coordinated medical agent system around three pillars: Baichuan‑Harness, a unified runtime that keeps reinforcement‑learning training and real‑world deployment consistent while enforcing action constraints, tool use, long‑term patient memory, and multi‑agent coordination; a core reasoning model trained with a continuous‑care reinforcement‑learning framework that integrates span‑level reward modeling (SPAR++), reasoning‑path compression, curriculum learning, and stabilized policy optimization; and a clinical tool layer for patient‑memory management, authoritative evidence‑based retrieval, and multimodal medical perception across documents, X‑rays, and dermatology. On a cross‑dimensional medical evaluation suite, Baichuan‑M4 attains leading results in static medical knowledge and safety, dynamic OSCE‑style consultation, long‑context clinical memory, evidence‑based retrieval, medical document OCR, and multimodal image understanding, while lowering the hallucination rate to 3.3%.

PaperID: 2092, https://arxiv.org/pdf/2606.08831.pdf

Authors: Ting Wang, Yuanjie Shi, Yan Yan, Huan Zhang

Title: Inference-Time Conformal Reasoning with Valid Factuality Control for Large Language Models

Abstract:
Large language models (LLMs) increasingly perform multi‑step reasoning, where intermediate claims form implicit directed acyclic graphs whose node correctness is structurally conditioned on their ancestors. This makes factuality uncertainty structural, rather than a trivial accumulation of node‑wise errors, and necessitates inference‑time uncertainty quantification over the reasoning structure. While conformal prediction (CP) offers flexible user‑specified factuality control, existing work remains post‑hoc and cannot intervene during generation. To fill the gap between CP's flexibility and its post‑hoc limitation, we propose an \emphInference‑Time Conformal Reasoning (ITCR) framework that integrates CP directly into reasoning graph generation. ITCR learns a structure‑level factuality uncertainty function that aggregates claim‑level factuality signals over reasoning graphs without complex modeling assumptions. We then design the non‑conformity score based on graph‑level factuality uncertainty and calibrate the conformal threshold to decide when to stop generation. We theoretically show such generation is nested, yielding valid coverage guarantees for factuality control. Experiments over multiple datasets and coverage objectives demonstrate empirically valid coverage. In downstream reasoning tasks, inference‑time calibrated graphs yield more accurate generation than post‑hoc pruned graphs.

PaperID: 2093, https://arxiv.org/pdf/2606.08702.pdf

Authors: Zhixun Tan, Qiang Chen, Tairan Huang, Xiu Su, Yi Chen

Title: ConMem: Structured Memory-Guided Adaptation in Training-Free Multi-Agent Systems

Abstract:
Recent advances have improved the adaptive capabilities of LLM‑based multi‑agent systems (MAS) through memory‑, skill‑, and learning‑based approaches, yet these approaches remain challenged by noisy trajectories, insufficient modeling of memory‑skill relations, and reliance on additional training or high‑quality supervision. To address these limitations, we propose ConMem, a relation‑aware and training‑free framework that enables efficient multi‑agent adaptation through cross‑experience coordination. Specifically, ConMem distills historical interaction trajectories into structured memory cards to capture reusable strategies and cues, organizing them into a relation‑aware memory graph. At runtime, ConMem retrieves cards according to task needs and coordinates them through the card graph to resolve strategy conflicts and recover their dependencies. Combined, these modules yield structured and relation‑aware guidance, enabling robust, lightweight adaptation in multi‑agent systems without additional training. Extensive experiments across multiple benchmarks and mainstream MAS architectures show consistent gains over existing memory architectures, with improved inference‑time efficiency through pruning more than 50% of expanded candidates and reducing planning overhead by over 80%. Our codes are available at https://anonymous.4open.science/r/ConMemCode

PaperID: 2094, https://arxiv.org/pdf/2606.08688.pdf

Authors: Chunji Lv, Jiaxi Ye, Yuchen Jiang, Rexar Lin, Changsheng Li

Title: PhysAgent: Automating Physics-Based 4D Synthesis via Trajectory-Grounded Multi-Agent Feedback

Abstract:
Achieving fully automated, physically plausible 3D motion synthesis is a core objective in graphics and generative AI. However, configuring complex environmental force fields still relies entirely on manual expert intervention, creating a severe bottleneck for large‑scale simulation data generation. Existing automated methods primarily focus on material optimization and exhibit severe modality gaps and technical flaws when applied to the vastly more complex force field optimization space: naive Large Language Models (LLMs) lack underlying simulation feedback, causing severe physical inaccuracies, while traditional Score Distillation Sampling (SDS) suffers from sluggish gradients, local optima entrapment, and a mathematical inability to dynamically switch discrete force fields. To address this, we propose PhysAgent, the first simulator‑in‑the‑loop multi‑agent framework that leverages multimodal inputs for automated, physically grounded 4D synthesis. By decoupling intrinsic materials from extrinsic dynamics, PhysAgent utilizes a Semantic Agent equipped with an externalized Force Field Skill module to master simulation rules and generate valid initializations. Subsequently, the Refine Agents, driven by Trajectory‑Grounded Multi‑Agent Feedback, leverage vision foundation models to extract dense point trajectories from rendered frames. By converting these explicit motion trajectories into structured textual descriptors, the agent harnesses LLM commonsense reasoning to execute zero‑shot macroscopic leaps, effectively escaping local optima and dynamically switching discrete force fields. Extensive experiments demonstrate that PhysAgent rapidly generates stable, diverse physical scenes from arbitrary multimodal prompts, significantly outperforming existing baselines in both generation diversity and physical accuracy.

PaperID: 2095, https://arxiv.org/pdf/2606.08661.pdf

Authors: Kuncan Wang, Ziting Wang, Peizhuo Lv, Haoyang Li, Guoliang Li, Gao Cong, Wei Dong

Title: Data Agents Under Attack: Vulnerabilities in LLM-Driven Analytical Systems

Abstract:
Data agents integrate LLM‑driven reasoning with relational data access, executable analytical tools, and multi‑step workflow orchestration, making them increasingly central to enterprise analytics. This integration introduces new security vulnerabilities across data resources, database execution, and agent reasoning, recombining concerns from database security and general‑purpose LLM‑agent security into failure modes that neither line of work captures on its own. To address this gap, we present a systematic security study of data agents. Our contributions are threefold. First, we develop a layered vulnerability framework that identifies eight data agent‑specific risks across interpretation, execution, and policy layers. Second, we introduce an attack taxonomy organized by adversary goal, tactic, and technique, covering three goals, seven tactics, and fourteen techniques, and pair it with an LLM‑driven payload generation pipeline grounded in real database schemas. Third, we evaluate these attacks on six systems, including four open‑source data agents and two production cloud analytics services. Our experiments reveal substantial security vulnerabilities across current systems and yield four key takeaways.

PaperID: 2096, https://arxiv.org/pdf/2606.08656.pdf

Authors: Yishuo Cai, Xingyu Guo, Xuancheng Huang, Jinhua Du, Can Huang, Wenxuan Huang, Wenhan Ma, Yuyang Hu, Aohan Zeng, Jie Tang, Xu Sun

Title: From Player to Master: Enhancing Test-Time Learning of LLM Agents via Reinforcement Learning over Memory

Abstract:
Large language model (LLM) agents are increasingly deployed in long‑running settings where improving through experience at test time becomes important. A common approach is to update an explicit memory after each interaction to guide future decisions. However, most existing methods rely on hand‑designed prompting rules, making it difficult to align memory updates with downstream objectives over multi‑step horizons consistently. We propose MemoPilot, a plug‑in memory copilot that explicitly trains the memory update process to improve a frozen LLM's performance across sequential interactions. We formulate memory updating as a multi‑turn decision problem and optimize it end‑to‑end with multi‑turn GRPO. Our training recipe introduces (i) a turn‑wise reward signal and (ii) a context‑independent, turn‑level advantage estimation across rollouts, enabling finer‑grained credit assignment and more stable training in multi‑turn settings. We evaluate MemoPilot on two testbeds: multi‑round Rock‑Paper‑Scissors (RPS) and Limit Texas Hold'em (LHE). Across both environments, MemoPilot substantially improves test‑time learning of a frozen player over strong baselines, ranking first in Elo ratings on both games (1762 on LHE and 1590 on RPS) and outperforming all baseline memory methods and proprietary models, including DeepSeek‑V3.2.

PaperID: 2097, https://arxiv.org/pdf/2606.08539.pdf

Authors: Chenglin Yang

Title: AgentTrust: A Self-Improving Trust Layer for AI-Agent Actions

Abstract:
AI agents increasingly take consequential actions ‑‑ shell commands, cloud operations, and arbitrary tool‑calls ‑‑ so a trust layer must decide, per action, whether to allow, warn, block, or escalate. We argue that the right way to reason about such a layer is by threat type. Lexical (fixed‑signature) threats, where danger lives in a stable token, are decidable by deterministic rules; semantic (intent‑dependent) threats, where a benign and a malicious action share the same surface, are out of reach for rules by construction. We make this concrete with a negative proof: a determined, hand‑authored cloud rule pack lifts held‑out accuracy only 48 to 56% overall and moves the semantic categories by 0pp (data_db 29 to 29, observability 59 to 59, supply_chain 50 to 50), while a strong LLM judge carries exactly those categories. We give the judge a self‑learning capability: on a corpus that is mainly semantic attacks it nearly doubles rule accuracy (48% to 83.6‑85.2%) with near‑zero false‑blocks, and this holds across two model providers. We turn this into a self‑improving dual‑store system: the judge distills a growing deterministic rule floor on lexical threats (cheaper over time) and feeds a guarded RAG memory on semantic threats (a verdict‑cache fails ‑‑ surface‑twins collapse to ~58% ‑‑ so a corroboration guard lifts semantic accuracy +13pp, 70 to 84). The result is what sets AgentTrust v2 apart from its static v1 predecessor: a trust layer that self‑evolves from its own stream of decisions ‑‑ cheaper on the lexical class (it distils its own rules) and smarter on the semantic class (it accrues guarded precedent), while never hard‑blocking a benign action. An end‑to‑end online replay shows the judge‑call rate falling (50% to 44%) and judge‑domain accuracy rising (71% to 80%), with 0 benign hard‑blocks across 45,000 actions.

PaperID: 2098, https://arxiv.org/pdf/2606.08531.pdf

Authors: Lu Jia, Haibo Tong, Feifei Zhao, Jindong Li, Dongqi Liang, Ping Wu, Qian Zhang, Yi Zeng

Title: VESTA: A Fully Automated Scenario Generation and Safety Evaluation Framework for LLM Agents

Abstract:
Large language models (LLMs) are increasingly evolving from simple text‑based interaction systems into LLM agents that can maintain memory, use tools, access external environments, and execute tasks. As their capabilities and autonomy expand, the safety risks they face also become more diverse. Existing evaluations often rely on manually written scenarios, static prompts, or final‑output judgments, making it difficult to capture the diverse risks that agents may face during task execution. We introduce VESTA, a fully automated scenario generation and safety evaluation framework for LLM agents. Based on five risk dimensions, VESTA instantiaes abstract and diverse safety risks in real‑world task execution into 1,072 measurable evaluation scenarios. Using the automated evaluation pipeline, 12 LLM agents are evaluated under two authority contexts. The results show that current agents still face substantial behavioral safety risks during task execution, with an average ASR of 47.1% and several models exceeding 70%. These findings demonstrate the importance of executable, process‑level evaluation for understanding and improving LLM agent safety.

PaperID: 2099, https://arxiv.org/pdf/2606.08529.pdf

Authors: Jason Starace

Title: Scaffold Effects on GAIA: A Controlled Comparison

Abstract:
Published agent capability scores conflate what a model can do with what its scaffold lets it do, and the magnitude of this elicitation gap is not well characterized under controlled conditions. This study executes a pre‑registered controlled comparison of three scaffolds (ReAct, a Planner‑Actor‑Rater multi‑agent design, and planner‑then‑executor) across five models from three providers (Claude Opus 4.7, Sonnet 4.6, Haiku 4.5; Gemini 3.1 Pro Preview; GPT‑5.5) on GAIA validation Levels 1 and 2, holding tasks and conditions fixed, with three attempts per question. Scaffold choice alone moves measured accuracy by as much as 28 percentage points within a single model (Opus, Level 2, robust slice), confirming the pre‑registered hypothesis that scaffold variation produces gaps of at least 10 points. The pre‑registered prediction that more capable models would be less scaffold‑sensitive is rejected in direction: scaffold effects vary significantly by model in every dataset slice, but the most capable Anthropic model gains the most from structured scaffolds at the harder level, and tier‑scaling holds only at Level 1 under the robust slice. The multi‑agent advantage over ReAct at Level 2 appears within the Anthropic family but not for the cross‑provider models, making model family rather than capability tier the conditioning variable, and the predicted planner‑executor advantage on file‑reading tasks is falsified. Structured scaffolds make fewer tool calls yet recover more often from mid‑trajectory errors at the harder level, and a single cell (Gemini with planner‑then‑executor) is the cheapest at both levels and the most accurate at Level 2. These results indicate that single‑scaffold capability numbers are scaffold‑conditional estimates and that the elicitation gap is not guaranteed to shrink as models improve.

PaperID: 2100, https://arxiv.org/pdf/2606.08470.pdf

Authors: Ruoyu Yao, Pei Liu, Ruiguo Zhong, Mingxing Peng, Rui Yang, Jun Ma

Title: LUNA-AD: Lightweight Uncertainty-Aware Language Model with Lifelong Learning for Autonomous Driving

Abstract:
While large language models (LLMs) offer promising reasoning capabilities, their integration into safety‑critical driving systems is hindered by limited reasoning diversity, high computational overhead, and static learning paradigms. To address these challenges, we propose LUNA‑AD, a lightweight uncertainty‑aware language model with lifelong learning for autonomous driving (AD). LUNA‑AD features a tri‑system architecture that reconciles complex multimodal behavioral reasoning, efficient deployment, and continual refinement. We design a multi‑agent analytical system to generate uncertainty‑aware decision‑making demonstrations through diverse hypothesis exploration. A dual‑head lightweight heuristic model is distilled to unify the inference of decision distributions and textual explanations while enabling efficient deployment. Furthermore, a reflection‑driven lifelong learning mechanism operates on multimodal decision outputs and preserves strategic diversity, allowing for the refinement of candidate decisions and rationales via closed‑loop feedback to enhance driving robustness. Extensive experiments on nuPlan benchmarks demonstrate that LUNA‑AD achieves state‑of‑the‑art success rates under both non‑reactive and reactive modes, with drastically reduced inference latency compared to existing knowledge‑driven AD frameworks.

PaperID: 2101, https://arxiv.org/pdf/2606.08457.pdf

Authors: Xiaoyang Wang, Christopher C. Yang

Title: The Consistency Illusion: How Multi-Agent Debate Hides Reasoning Misalignment

Abstract:
Multi‑agent LLM systems for medical question answering often treat consensus as a reliability signal: if multiple agents agree on an answer, it is presumed trustworthy. However, answer‑level consensus does not entail reasoning‑level alignment. We introduce CARA (Cross‑Agent Reasoning Alignment), a family of automated metrics that measure whether agents who agree on an answer also agree on the reasoning. Applying CARA to a standard debate system on two medical QA benchmarks, MedQA‑USMLE and MedThink‑Bench, we identify the consistency illusion: a failure mode where debate reduces detectable contradictions between agents while simultaneously decreasing the semantic similarity of their reasoning chains; agents appear to agree more but reason less consistently. To improve this misalignment, we propose the Grounded Debate Protocol (GDP), a prompt‑level intervention that requires agents to commit to named medical facts and take explicit stances on other agents' claims. GDP produces large, consistent alignment improvements, with Cohen's d ranging from +1.43 to +1.99, across two datasets and two backbone models, without adding LLM calls or modifying system architecture. Our results motivate cross‑agent reasoning alignment as a quantity to audit alongside accuracy in safety‑critical domains.

PaperID: 2102, https://arxiv.org/pdf/2606.08402.pdf

Authors: Jeonghwan Kim, Yushi Lan, Yongwei Chen, Hieu Trung Nguyen, Chuanyu Pan, Xingang Pan

Title: SceneConductor: 3D Scene Generation from Single Image with Multi-Agent Orchestration

Abstract:
Generating complete 3D scenes from a single image requires inferring globally consistent geometry, object relationships, and environmental context from inherently ambiguous visual evidence. Despite recent progress in joint layout‑and‑mesh generation, existing methods often rely on holistic or weakly decomposed pipelines that entangle many factors at once and demand extensive scene‑level supervision, limiting their generalization to complex real‑world environments. We propose a multi‑agent orchestration framework that decomposes single‑image 3D scene generation into three structured stages: scene initialization, environment construction, and multi‑agent refinement. The initialization stage extracts image‑derived object masks, builds object‑level 3D representations, and predicts an initial spatial layout to form a coarse 3D scene. The environment‑construction stage then leverages this initialization together with point‑map geometry to build an environmental scaffold of supporting surfaces, room boundaries, materials, and illumination. Finally, in the refinement stage, a planner agent identifies structural and visual inconsistencies, applies simple corrections directly, and dispatches specialist agents for complex localized revisions that are reintegrated into the global scene. To provide reliable structural initialization while reducing reliance on scene‑level annotations, we further introduce a geometry‑aware layout predictor supervised by sparse geometric priors derived from point maps. Unlike fully supervised layout generators, the predictor can be trained from segmentation‑level data and generalizes robustly to diverse real‑world scenes. Extensive experiments on benchmark datasets show that our method consistently outperforms prior approaches in geometric accuracy, spatial consistency, and perceptual realism.

PaperID: 2103, https://arxiv.org/pdf/2606.08367.pdf

Authors: Deepak Akkil, Ravi Kokku, Karthik Vikram, Tamer Abuelsaad, Aditya Vempaty, Satya Nitta

Title: Emergence World: A Platform for Evaluating Long-Horizon Multi-Agent Autonomy

Abstract:
Most evaluations of LLM agents look like exams: a discrete task, a clean environment, a score in minutes or hours. We argue that this approach is mismatched with the deployment conditions of autonomous systems, where the relevant timescale can be weeks to months, and where the dynamics that matter most, such as behavioral drift, governance in diverse environmental contexts, and cross‑influence between agents from different model families, only emerge over time. We introduce Emergence World, a continuously running multi‑agent simulation platform designed to make those dynamics measurable. The platform hosts populations of LLM‑driven agents in a shared spatial world grounded in live external data (e.g. real‑time weather, news APIs, internet access), equips each agent with 120+ specialized tools and three persistent memory systems, and lets them govern themselves through democratic mechanisms with consequential outcomes. The platform is model‑agnostic at the reasoning layer and supports heterogeneous populations in which agents from different vendors share the same world. To illustrate the kinds of questions the platform makes tractable, we present a 15‑day cross‑vendor study with five parallel worlds powered by Claude Sonnet 4.6, Grok 4.1 Fast, Gemini 3 Flash, GPT‑5‑mini, and a mixed population. Identical roles and starting conditions produced radically different outcomes, ranging from stable deliberative governance to total population collapse. We release the prompts, log data and configurations to support further research on long‑horizon multi‑agent autonomy.

PaperID: 2104, https://arxiv.org/pdf/2606.08328.pdf

Authors: Ramiro N. Deo-Campo Vuong

Title: Optimal Online Equitable Allocation with Indivisible Resources

Abstract:
Equitable allocation of indivisible goods to agents in online settings is an algorithmic primitive with applications for load balancing, network routing, online marketplaces, and multi‑agent systems. We consider a general setting in which allocations are constrained to be bases of discrete polymatroids that arrive online. Our work demonstrates that a simple, myopic algorithm called Brick‑Laying, which greedily minimizes the sum of squared loads on agents, achieves a universal and objective‑free notion of optimality called majorization minimax‑optimality [BDK26] for this setting. As a consequence, Brick‑Laying simultaneously guarantees minimax optimal competitive ratios and regret for all Schur‑concave and Schur‑convex objectives, and for any number of agents and resources (despite being agnostic to problem scale). Departing from popular primal‑dual analysis, we employ majorization to compare allocations. We leverage the conjugates of integer partitions ‑‑ which act as a discrete dual to majorization ‑‑ to characterize worst‑case instances for the Brick‑Laying algorithm. Our approach reveals a novel structural connection between the geometry of partitions and online equitable allocation.

PaperID: 2105, https://arxiv.org/pdf/2606.08323.pdf

Authors: Suchismita Naik, Samir Passi, Mihaela Vorvoreanu, Scott Saponas, Amanda Hall

Title: "So There's a Catch-22 Here": How Early Adopters Who Build Multi-Agent LLM Systems Conceptualize Transparency

Abstract:
Multi‑agent large language model (LLM) systems are rapidly emerging, yet transparency, a cornerstone of responsible AI, remains under‑defined in these distributed architectures, which have complexities of inter‑agent coordination and orchestration. In this paper, we present one of the first empirical study of how early adopters of multi‑agent LLM systems, who are both the builders and users, understand and practice transparency. We conducted semi‑structured interviews with 13 early adopters in [Large Technology Organization] and applied thematic analysis to identify recurring patterns. Participants articulated divergent yet complementary framings of transparency, including reproducibility, debugging, boundary‑setting, visualization, and auditing. These perspectives spanned questions of what transparency entails, why it matters, and how it is achieved. We synthesize these into a multidimensional framework, which is developer, user, and governance‑focused positioning transparency as a situated socio‑technical practice that informs future HCI and AI design and research around aligning expectations and capacities of their intended audiences.

PaperID: 2106, https://arxiv.org/pdf/2606.08274.pdf

Authors: Safia Baloch, Rahemeen Khan

Title: Toward Human-Centered Multi-Agent Systems: Integrating Cognition, Culture, Values, and Cooperation in AI Agents

Abstract:
The emergence of large language model (LLM)‑based agents and multi‑agent systems has enabled a shift from narrow task automation to more autonomous decision‑making. Despite progress in language generation, planning, tool use, and coordination, most agents still treat intelligence as prediction, optimization, and task completion. Human environments are social and normative, where people reason under bounded rationality, communicate in culturally situated language, and make decisions guided by values, beliefs, trust, and social norms. This survey argues that future AI agents, especially those acting on behalf of humans, must move beyond task competence toward human‑centered capabilities. We review research across six areas: (1) evolution of intelligent agents, (2) human cognition and decision‑making, (3) language, culture, and social context, (4) human values and belief systems, (5) human‑agent collaboration, and (6) multi‑agent coordination and modeling of human characteristics. We synthesize work from cognitive science, sociolinguistics, computational social science, and AI alignment, along with recent advances in LLM agents, cultural alignment benchmarks, preference learning, explainability, and agent societies. We identify a key gap: existing systems do not provide a unified framework integrating cognition, culture, values, and social behavior into autonomous agents. We conclude with directions for building culturally aware, value‑aligned, cognitively grounded, and cooperative multi‑agent systems.

PaperID: 2107, https://arxiv.org/pdf/2606.08200.pdf

Authors: Hyogon Ryu, Jeonghwan Kim, Yewon Lim, Chaeun Lee, Jeongwook Kim, Donghoon Ham

Title: Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents

Abstract:
Evaluating LLM‑powered interactive social agents is challenging because socially relevant behaviors depend not only on isolated outputs, but also on prior interactions, social roles, and downstream actions. Existing methods typically allow a target agent to act freely in an environment and then score the resulting trajectory. However, this passive setup can miss capabilities that only become observable under specific social circumstances; for example, conflict handling may remain untested if no disagreement arises. We propose Online Agent‑as‑a‑Judge, a situation‑generating evaluation framework for interactive social agents. Online Agent‑as‑a‑Judge deploys an in‑world evaluator agent that interacts with the target agent through the environment's native dialogue and action protocol, actively eliciting situations relevant to the evaluation criteria. The resulting trajectories provide evidence for assessing both immediate responses and subsequent behavior. In a life‑simulation environment with 32 designer‑authored social criteria, Online Agent‑as‑a‑Judge improves criteria coverage and agreement with human labels, yielding more reliable evidence‑grounded evaluations of behaviors that passive methods can leave unobserved.

PaperID: 2108, https://arxiv.org/pdf/2606.08172.pdf

Authors: Manuele Reani, Hongjian Zhang, Hongyu Tian

Title: The Governance of Human-LLM Interaction: Safety Gating, Civility Steering, and Affective Default Lock-In

Abstract:
Large language models (LLMs) increasingly mediate high‑stakes interactions in finance, medicine, and mental‑health support, yet users have limited control over how these systems communicate. We frame interaction style as a governance object: provider‑side alignment not only blocks harmful content, but also stabilizes communicative defaults that shape users' epistemic distance, relational expectations, and capacity to opt out of emotionalized or anthropomorphic interaction. We introduce a deterministic multi‑agent evaluation pipeline for measuring prompt steerability and style drift in long‑horizon dialogue. The study replays 100 frozen user‑only scripts across four domains and three runnable persona conditions: default, sarcastic, and cold, using three generator models, yielding 90,000 assistant replies scored by a human‑calibrated LLM judge on harmfulness, negative emotion, inappropriateness, empathic language, anthropomorphism, and refusal behavior. A fourth harmful persona is evaluated separately as a safety‑gating test. The paper contributes a reproducible method for quantifying whether prompt‑specified styles remain stable over time and a governance framework distinguishing safety gating, civility steering, and affective default lock‑in. Overall, we show that prompt steerability and regression‑to‑default are observable indicators of provider control over communicative form, with implications for pluralism, autonomy, and democratic agency in human‑LLM interaction.

PaperID: 2109, https://arxiv.org/pdf/2606.08162.pdf

Authors: Dexing Liu

Title: Silent Failure in LLM Agent Systems: The Entropy Principle and the Inevitable Disorder of Autonomous Agents

Abstract:
Large Language Model (LLM) agent systems suffer from failures that occur without external triggers ‑‑ no injection, no adversarial input, no resource exhaustion. These silent failures ‑‑ unexpected deviations from intended behavior under normal conditions ‑‑ are routinely misattributed to bugs or configuration errors. Through systematic analysis of over 40,000 controlled trials and long‑term production observations spanning 100,000+ agent interactions, we identify a common structural logic underlying these failures. Building on patterns observed in our experiments, we survey the global research literature on autonomous agent reliability and synthesize 22 intrinsic properties of LLM agent systems across six lifecycle layers: foundation semantics, inter‑agent transmission, memory persistence, task execution, feedback correction, and systemic evolution. We demonstrate that whenever a sufficient subset of these properties co‑exist, system entropy ‑‑ the measurable accumulation of disorder: loss of output consistency, task accuracy, and cross‑session coherence ‑‑ increases monotonically with interaction rounds. We formalize this as the Entropy Principle: S(t) = S0 e^(alpha t), with alpha measured empirically across multiple architectures. We propose the PIG (Physical Integrity Gate) Engine with the ADE (Agent Delivery Engineering) protocol suite as an engineering countermeasure to entropy‑driven disorder. Our findings establish silent failure not as a bug to be fixed but as a manifestation of Intelligence Entropy ‑‑ a physical constraint to be managed through deterministic governance. We argue that any engineering effort stabilizing the structure and order of agent systems participates in a unified mission: keeping intelligent systems reliable as they grow in scale and complexity.

PaperID: 2110, https://arxiv.org/pdf/2606.08137.pdf

Authors: Ashik Abrar Naeem, Mohammad Ariful Haque

Title: A Barrier-Modulated Architecture for Safe Affine Formation Control in Second-Order Multi-Agent Systems

Abstract:
Affine formation control offers immense flexibility for coordinating multi‑agent maneuvers, but guaranteeing the safety of agents under parametric uncertainties remains an open challenge. This paper proposes a novel safe affine formation control framework for second‑order multi‑agent systems by integrating Higher‑Order Control Barrier Functions (HOCBFs) with Adaptive Dynamic Programming (ADP). We introduce a barrier‑modulated control architecture that smoothly attenuates the nominal formation tracking objective when agents approach safety boundaries, preventing conflicting control inputs. Within this architecture, two distinct safety controllers are developed: (1) an analytical barrier‑gradient repulsive controller that provides a computationally efficient, rigorous mathematical baseline, and (2) a data‑driven optimal safety controller. The data‑driven approach utilizes an actor‑critic neural network to solve the Hamilton‑Jacobi‑Bellman (HJB) equation online, enabling optimal collision avoidance even in the presence of unknown system parameters. Using Nagumo's theorem and Lyapunov stability analysis, we formally prove that both controllers guarantee the forward invariance of the safe set ensuring absolute collision avoidance while maintaining Uniformly Ultimately Bounded (UUB) formation tracking errors. Finally, simulations validate the theoretical findings and demonstrate the robustness of the proposed controllers in dynamic obstacle avoidance scenarios.

PaperID: 2111, https://arxiv.org/pdf/2606.08102.pdf

Authors: Daoqing Wang, Yuchen Xiao, Weixuan Huang, Zhilong Zhang, Shenghua Wan, Meng Li, Lei Yuan, Yang Yu

Title: Continual Quadruped Robots Coordination via Semantic Skill Discovery

Abstract:
Multi‑quadruped coordination has attracted increasing attention due to its enhanced payload capacity, broader contact coverage, and improved adaptability to challenging tasks. Existing methods for multi‑quadruped manipulation typically focus on predefined or closed task families, often relying on multi‑agent reinforcement learning (MARL) to train task‑specific coordination policies. However, such methods struggle in open‑ended continual learning settings, where tasks arrive sequentially and robots are expected to acquire new coordination skills while reusing previously learned ones without catastrophic forgetting. To address this challenge, we propose Conquer, a semantic skill‑library framework that formulates continual multi‑quadruped coordination as a retrieve‑adapt‑update process. First, to accommodate varying team sizes across tasks, we design a team‑structured Self‑Allies‑Goal (SAG) backbone that supports variable‑cardinality robot teams by explicitly modeling each robot's own state, teammate context, and task goal. For each incoming task, Conquer constructs a task‑level semantic descriptor from pre‑execution information and retrieves a relevant skill from the library for adaptation. After successful execution, Conquer updates the skill library by extracting trajectory‑level semantic descriptors and organizing them according to semantic distance, thereby enabling continual skill accumulation and cross‑task knowledge transfer. Simulation experiments show that Conquer achieves a final average success rate of 95.6%, demonstrating strong forward transfer and negligible catastrophic forgetting. Real‑world rollouts on Unitree Go2 teams further validate the deployment feasibility of Conquer for practical multi‑quadruped coordination. Simulation and real‑robot demonstration videos are available at: https://conquer‑project.pages.dev/.

PaperID: 2112, https://arxiv.org/pdf/2606.08093.pdf

Authors: Zhe Xu, Zhengyu Zhang, Zhiyuan Cai, Jiahao Xu, Yijie Lin, Ziyi Liu, Junlin Hou, Hongyi Wang, Yuxiang Nie, Ling Liang, Yihui Wang, Yingxue Xu, Ronald Cheong Kin Chan, Li Liang, Hao Chen

Title: A Multi-modal Agentic Co-pilot for Evidence Grounded Computational Pathology

Abstract:
Pathology is the cornerstone of modern medicine, where accurate decision‑making relies heavily on evidence‑based practices. While artificial intelligence (AI) has the potential to transform clinical workflows, the intersection of AI and evidence‑based medicine remains under‑explored, with primitive attempts restricted to text‑only general medicine. In this work, we present PathPocket, a multimodal AI agentic co‑pilot designed specifically for evidence grounded pathology. We construct the most comprehensive pathology evidence corpus to date, encompassing approximately 110,472 public and authorized documents structured across a rigorous hierarchy of evidence from clinical guideline to expert opinion. From this meticulously graded foundation, we build a large‑scale multimodal pathology hypergraph containing over 4.55 million entities and 7.10 million relations. Serving as a robust knowledge engine, this hypergraph provides traceable evidence for a collaborative multi‑agent reasoning framework integrating input understanding, evidence retrieval, filtering, and diagnosis generation. This enables PathPocket to seamlessly resolve a wide spectrum of clinical tasks, ranging from text‑only queries to complex multimodal diagnostics involving region‑of‑interest (ROI) and gigapixel whole‑slide images (WSIs). We rigorously evaluate the system on a multidimensional benchmark of over 200,000 real‑world cases, where it significantly outperforms existing state‑of‑the‑arts. Crucially, extensive user studies demonstrate that PathPocket substantially improves the diagnostic accuracy and confidence of pathologists. By directly grounding pathology interpretations in verifiable literature, PathPocket offers a practical and scalable solution for the future of evidence grounded computational pathology.

PaperID: 2113, https://arxiv.org/pdf/2606.08081.pdf

Authors: Po-Ya Angela Wang, Chinmaya Mishra, Aslı Özyürek, Paula Rubio-Fernández, Esam Ghaleb

Title: Aligned but Not Partner-Specific: Distinguishing How Multimodal LLM Agents Succeed in Reference Games Without Human-Like Conventions

Abstract:
Repeated reference games test whether interlocutors replace their initially long descriptions with shorter, partner‑specific conventions grounded in shared interaction history. Prior work shows that multimodal LLMs fail to become more efficient across rounds, although they align on the labels they use. How can we determine whether this alignment reflects partner‑specific grounding rather than a shared task vocabulary? We address this question by comparing capable multimodal agent dyads with human dyads from the KTH Tangrams corpus. Our novel methodological contribution is a constrained pseudo‑dyad baseline that matches the original referential task structure, but breaks partner history. This baseline enables us to test whether the observed label alignment depends on interaction with a specific partner. Across three analytic layers (task competence, description strategy, alignment dynamics), we find clear differences. Humans reduce effort through entrainment, compressing descriptions and increasing label alignment with partners. Agents instead maintain fixed effort levels, producing verbose descriptions from round one, with near‑ceiling label overlap that is statistically indistinguishable between real and pseudo dyads. MLLMs thus achieve coordination without convention, succeeding by verbose description rather than by forming the compact, history‑dependent referring expressions characteristic of human dialogue.

PaperID: 2114, https://arxiv.org/pdf/2606.08068.pdf

Authors: Yi Xie, Zhanke Zhou, Chentao Cao, Bo Liu, Bo Han

Title: DICE: Entropy-Regularized Equilibrium Selection for Stable Multi-Agent LLM Coordination

Abstract:
Multi‑agent large language model (LLM) systems often fail to reliably outperform a single strong model equipped with best‑of‑N sampling. We argue that a core source of this instability is ill‑posed equilibrium selection: current systems specify what information agents share, but not which coordination convention should be selected. We formalize a broad class of such systems as discounted incomplete‑information Markov games and show that two common pathologies, oscillation between competing conventions and drift across them, can both induce unstable learning and linear Bayesian regret. To obtain a well‑posed target, we introduce the Heterogeneous Quantal Response Equilibrium (HQRE), an entropy‑regularized equilibrium concept with agent‑ and state‑dependent temperatures. Under a monotonicity condition, HQRE is unique, admits linearly convergent mirror updates, and yields bounded Bayesian regret; the same condition yields rollout‑measurable stability diagnostics. We instantiate this objective in two algorithms: DICE‑PC, which coordinates frozen models through prompt‑control actions, and DICE‑FT, which performs parameter‑efficient mirror fine‑tuning. Across eleven benchmarks in four domains, DICE improves accuracy‑cost trade‑offs over strong within‑class baselines; on reasoning and planning tasks, DICE‑PC improves by 4.3 percentage points on average and DICE‑FT by 8.5 points.

PaperID: 2115, https://arxiv.org/pdf/2606.08064.pdf

Authors: Zihao Wang, Shijie Peng, Kerui Wu, Yu Huang, Ruiqi Xue, Dong Liu, Tian Xu, Lei Yuan, Yang Yu

Title: Cooperative Long Rope Skipping via Multi-Agent Reinforcement Learning

Abstract:
Humans exhibit remarkable motor agility, enabling a wide range of dynamic skills such as running and jumping, which highlights the great potential of humanoid robots for athletic locomotion. Among athletic sports, long rope skipping requires two rope turners to cooperatively swing the rope while adapting to a player under different jumping rhythms, making it a meaningful yet challenging task for humanoid robots. Although existing methods for humanoid sports have achieved success in single‑agent and interaction‑free settings, such as running, dancing, and parkour, task scenarios that require precise coordination among multiple participants remain largely unexplored. To this end, we propose Marope, a multi‑agent reinforcement learning (MARL) framework for cooperative long rope skipping with multiple humanoid robots. Specifically, Marope adopts a hierarchical reinforcement learning framework for policy training. At the lower level, it learns decentralized rope manipulation policies through MARL, while at the upper level, a centralized scheduling policy is trained to coordinate the execution of the lower‑level policies. To improve generalization across different player behavioral styles, Marope further incorporates diverse jumping policies into cooperative game training. We evaluate our approach on Unitree G1 humanoid robots in both simulation and real‑world settings. Experimental results demonstrate that Marope outperforms various baselines, achieving more efficient and stable rope manipulation as well as more robust and adaptable cooperation with varied players.

PaperID: 2116, https://arxiv.org/pdf/2606.08039.pdf

Authors: Manan Tayal

Title: MuJoCo-Drones-Gym: A GPU-Accelerated Multi-Drone Simulator for Control and Reinforcement Learning

Abstract:
Robotic simulators are a cornerstone of modern research in aerial robotics, serving both as a vehicle for the development of new control algorithms and as the data source for training reinforcement learning (RL) policies. Yet, existing quadcopter learning environments often face a trade‑off between physical fidelity, multi‑agent support, and the throughput required by modern deep RL pipelines. In this paper, we present MuJoCo‑Drones‑Gym, an open‑source Gymnasium‑compatible multi‑drone environment built on top of the MuJoCo physics engine. MuJoCo‑Drones‑Gym supports an arbitrary number of Bitcraze Crazyflie 2.x nano‑quadcopters and exposes a modular API for selecting (i)~the physics model (rigid‑body MuJoCo, explicit Python dynamics, or any subset of ground effect, blade drag, and inter‑drone downwash), (ii)~the action interface (per‑motor RPMs, collective normalized thrust, velocity setpoints, or PID waypoint commands), and (iii)~the observation space (kinematic state vectors, RGB / depth / segmentation cameras, or neighbourhood adjacency information). A PettingZoo ParallelEnv wrapper enables drop‑in multi‑agent reinforcement learning, while a suite of seven task environments, hover, velocity tracking, multi‑drone hover, waypoint navigation, formation flight, gate racing, and a generic multi‑agent template, demonstrates the breadth of the interface. We describe the environment design, the underlying physics and quadcopter dynamics, and illustrate its use through control and learning examples that mirror those of the closely related gym‑pybullet‑drones project, while taking advantage of MuJoCo's improved contact handling, rendering, and parallelizability.

PaperID: 2117, https://arxiv.org/pdf/2606.08030.pdf

Authors: Eric S. Qiu, Joyce Gill

Title: Voting Protocols as Coordination Mechanisms for Role-Constrained Multi-Agent Tutoring Systems

Abstract:
Agentic tutoring systems introduce a coordination challenge: multiple agents may propose different but reasonable interventions, yet only one response can be delivered to the learner. In this paper, we study how voting protocols shape cooperation among four role‑constrained pedagogical agents responsible for scaffolding, misconception, motivation, and metacognition. We compare four voting protocols ‑‑ simple, ranked, cumulative, and approval voting ‑‑ across two simulated tutoring environments on SciQ and HumanEval benchmarks. Rather than using voting as a simple aggregation step, we use it to analyze how collective decision rules shape coordination under partial pedagogical conflict. Across 1,200 simulated interactions, we find that agent deliberation and voting protocol type frequently change which response ultimately wins, showing that both meaningfully shape the collective decision. Different voting rules also produce distinct coordination behaviors, and even brief tutoring turns show measurable learning gains in simulated students. Overall, we show that protocol choice is associated with distinct coordination patterns among role‑specialized pedagogical agents.

PaperID: 2118, https://arxiv.org/pdf/2606.08021.pdf

Authors: Jun He, Deying Yu

Title: Semantic Quorum Assurance: Collective Certification for Non-Deterministic AI Infrastructure

Abstract:
As large language model (LLM) agents are integrated into autonomous cloud operations, distributed systems face a semantic reliability problem: proposer agents can generate production mutations, such as modifying IAM policies, opening firewall security groups, or executing data exports, that are syntactically valid and statically authorized but operationally unsafe. Classical distributed consensus protocols replicate deterministic state transitions but do not evaluate the safety of the proposed intent. To address this gap, we introduce Semantic Quorum Assurance (SQA), a control‑plane primitive for governing non‑deterministic agentic infrastructure. SQA represents proposals as declarative execution contracts bound to cryptographic evidence chains and routes them to a diverse panel of read‑only, sandboxed validator agents. SQA aggregates their judgments under a risk‑adaptive quorum predicate that enforces model and archetype diversity, adjusts weights based on calibrated assurance scores, and respects archetype‑specific vetoes. Admitted proposals execute only through a sovereign execution gate. We instantiate SQA in a cloud‑native control plane and formalize a correlated cognitive failure model for non‑deterministic validators. On 500 infrastructure‑inspired mutation scenarios, with safety results reported on held‑out safe/unsafe trials excluding ambiguous scenarios, SQA reduces unsafe approval from 18.5% for single‑agent validation to 0.3% while adding median validation latency of 1.45‑‑4.12 seconds across the studied risk buckets.

PaperID: 2119, https://arxiv.org/pdf/2606.07948.pdf

Authors: Jingzhe Lin, Hengbin Yu, Yongdan Zeng, Fangwei Zhong

Title: EduMirror: Modeling Educational Social Dynamics with Value-driven Multi-agent Simulation

Abstract:
Understanding how educational social dynamics evolve is critical for informing effective educational policies and counterfactual interventions. However, traditional methods face a fundamental dilemma: observational studies often lack causal power, while controlled experiments are frequently constrained by ethical concerns. Although LLM‑based multi‑agent simulations offer a scalable in silico alternative, existing approaches remain limited by weak psychological grounding and insufficient measurement of latent psychological states. To address this, we introduce EduMirror, a multi‑agent simulator for the scientific study of educational social dynamics. We provide configurable education‑oriented agent forms, including value‑driven agents grounded in psychological needs and social value orientation, together with a dual‑track measurement protocol for quantifying observable behaviors and latent psychological states. We validate the realism and usability of EduMirror through case studies on school bullying and group cooperation, as well as broader evaluations across diverse educational scenarios. The results show that EduMirror generates educational social dynamics that are realistic, theory‑consistent, and measurable by empirical criteria. These properties enable structured in silico educational research, providing a computational tool for hypothesis testing and counterfactual intervention analysis in educational science. Project page: https://edumirror.net.

PaperID: 2120, https://arxiv.org/pdf/2606.07943.pdf

Authors: Haochang Hao, Dehai Min, Zhifang Zhang, Yunbei Zhang, Miao Xu, Yingqiang Ge, Lu Cheng

Title: POISE: Position-Aware Undetectable Skill Injection on LLM Agents

Abstract:
Agent skills provide a lightweight mechanism for extending general‑purpose agents, but their open format exposes them to skill‑poisoning attacks. A practically dangerous injection must stay invisible: if executing the payload derails the user's legitimate task, the resulting failure signal invites inspection of the skill. We therefore evaluate attacks by Attack Success Rate, which requires the injected payload to execute and the user's task to still pass its verifier in the same trial. Prior skill‑poisoning attacks face a reliability‑stealth trade‑off under this lens: YAML‑header injections are reliably loaded but easily inspected, whereas stealthier body injections that place explicit malicious commands in the skill prose are less reliable because out‑of‑context commands invite the agent's own suspicion. We introduce POISE, a position‑aware attack that compresses the trigger into a single, benign‑looking body instruction, placing it at a feasible position and using a context‑aware generator to blend it with nearby setup or prerequisite steps. On Skill‑Inject with codex+gpt‑5.2, POISE achieves an 89.3% ASR, 28.0 points above a random‑placement body baseline and 2.6 points above a YAML‑only baseline, while retaining the stealth advantage of body placement. That stealth is the decisive margin: because legitimate skill bodies naturally require privileged tool operations, LLM scanners are hyper‑sensitive, falsely flagging 74.6% of clean skills on average across four judges and both benchmarks. Blending into these false alarms, POISE causes only 5.6% of poisoned variants to gain a new high‑risk alert over their clean baselines, rendering current static defenses ineffective.

PaperID: 2121, https://arxiv.org/pdf/2606.07941.pdf

Authors: Saeid Jamshidi

Title: Collective Hallucination in Multi-Agent LLMs:Modeling and Defense

Abstract:
Hallucinations in large language models (LLMs) create heightened risks in multi‑agent settings, where recursive agent interactions can propagate, reinforce, and amplify unsupported claims. This paper models hallucination as a system‑level, time‑evolving process across a network of interacting LLM agents, where nodes represent agents and edges encode information exchange. The proposed formulation captures how hallucinated claims diffuse through communication topologies, intensify under adversarial perturbations, and affect collective reliability across reasoning rounds. To suppress error propagation, we introduce an interaction‑aware control method that combines confidence‑weighted aggregation, adaptive impact regulation, external claim verification, and selective isolation of unreliable agents. Experiments on TruthfulQA and TriviaQA show that the proposed method reduces hallucination by up to 39.0% relative to undefended multi‑agent reasoning, improves factual accuracy from 0.79 to 0.87, and increases semantic consistency from 0.75 to 0.84. Under adversarial conditions, the method limits hallucination amplification to 1.08, compared with 1.45 without adaptive control, maintaining stable collective behavior across recursive interaction rounds. These results indicate that hallucination in multi‑agent LLM systems is governed by both individual model reliability and system‑level interaction dynamics, including communication topology, confidence coupling, and recursive information flow.

PaperID: 2122, https://arxiv.org/pdf/2606.07940.pdf

Authors: Saeid Jamshidi

Title: SGTO-MAS: Secure Gorilla Troops Optimization for Multi-Agent LLM Systems

Abstract:
Multi‑agent large language model (LLM) systems offer strong capabilities for complex reasoning and decision‑making, yet coordination across agents introduces error propagation, security risks, and inefficient use of resources. Existing methods often rely on heuristic, static strategies and lack a principled mechanism for balancing performance, security, and computational cost. This paper formulates multi‑agent LLM coordination as a constrained optimization problem and proposes a security‑aware method for adaptive agent selection. The method integrates trust modeling, risk‑aware evaluation, and collective intelligence within a unified optimization objective. To solve the problem efficiently, we use a swarm‑intelligence strategy inspired by Gorilla Troops Optimization (GTO), enabling adaptive coordination under varying threat conditions. Controlled experiments across 500 independent runs demonstrate the effectiveness of the proposed method. The system achieves a stable average performance score of 0.5281, with high consensus (0.8764), controlled risk (0.3000), and compact agent subsets averaging 4.04 selected agents. The optimization process converges efficiently, with an average runtime of 24.09 seconds per run and low score variability (standard deviation = 0.0173). Robustness analysis indicates graceful degradation under perturbations, with performance drops limited to 2.5% under agent removal and 5.3% under consensus disruption. These results show that effective multi‑agent coordination can be achieved through structured optimization that jointly manages performance, security, and efficiency. The proposed method provides a practical security‑aware solution for coordinating multi‑agent LLM systems in complex adversarial settings.

PaperID: 2123, https://arxiv.org/pdf/2606.07937.pdf

Authors: Saeid Jamshidi, Arghavan Moradi Dakhel, Kawser Wazed Nafi, Foutse Khomh

Title: Hallucination Cascade: Analyzing Error Propagation in Multi-Agent LLM Systems

Abstract:
Large Language Models (LLMs) generate fluent text but remain vulnerable to hallucinations, producing unsupported, inconsistent, and factually incorrect claims. Most prior work treats hallucination as a static property of isolated outputs. In multi‑agent LLM systems, however, responses are exchanged across agents, revised through sequential stages, and reused as context for later reasoning. Hallucination, therefore, becomes a dynamic process shaped by interaction history, cascade depth, and model heterogeneity. This paper analyzes hallucination dynamics in multi‑agent LLM cascades by tracking claim‑level factual inconsistencies across sequential agent interactions. We conduct 500 cascade experiments across 10 knowledge domains using GPT‑5.3, DeepSeek‑V3, and LLaMA‑3‑70B‑Instruct, yielding 1,250 evaluated responses. Results show that deeper cascades reduce the normalized hallucination score from 0.422 at the first agent to 0.272 at the final agent in 3‑agent chains, with an amplification factor of 0.644, indicating net attenuation. This reduction is accompanied by a decline in factual accuracy from 0.789 to 0.769, revealing a trade‑off between hallucination suppression and factual preservation. Transition‑level analysis shows that each agent‑to‑agent refinement reduces hallucination by an average of 0.072, with small but consistent losses in factual consistency and response quality. Model‑level results reveal reliability‑efficiency trade‑offs: LLaMA‑3‑70B‑Instruct achieves the lowest hallucination score, whereas GPT‑5.3 provides faster generation with a higher hallucination rate. Domain‑level analysis shows that hallucination varies with topic complexity, with lower scores in well‑grounded scientific domains and higher scores in more abstract domains.

PaperID: 2124, https://arxiv.org/pdf/2606.07909.pdf

Authors: Suleyman Armagan Er, Danilo Ribeiro, Yogesh Virkar, Surafel Lakew, Adi Kalyanpur, James Gung, Thomas Delteil, Arshit Gupta

Title: MemToolAgent: Leveraging Memory for Tool Using Agents Based on Environment and User Feedback

Abstract:
Modern large language model (LLM) agents can use external tools to help users solve complex tasks. However, for problems that require learning from long‑term historical events or from previous agent‑environment interactions, LLM agents are required to use memory mechanisms to store and retrieve experiences. While sophisticated memory systems exist for dialogue agents, few studies have empirically examined how to improve agents' tool‑using capabilities through past user‑agent conversations. We propose MemToolAgent, a framework that improves tool use through memory management. Our approach contains a memory extraction module that processes past experiences into structured memory entries, and a retrieval module that dynamically selects a subset of the stored memory entries. This enables more personalized and accurate responses aligned with user preferences and feedback without requiring LLM fine‑tuning. In summary, this work has three main contributions: (1) a unified memory entry format that improves both general‑purpose and personalized tool use without LLM fine‑tuning, (2) a reflection‑based memory extraction that uses environment and user feedback to distill wrong executions into critiques to store, and (3) a retrieval module that chooses how many past experiences to use based on the memory similarity distribution. MemToolAgent achieves 29%, 80%, and 17% relative improvements compared to strong baselines on the WorkBench, NESTFUL, and PEToolBench benchmarks, respectively.

PaperID: 2125, https://arxiv.org/pdf/2606.07904.pdf

Authors: Rahul Suresh Babu, Laxmipriya Ganesh Iyer

Title: Contract2Tool: Learning Preconditions and Effects for Reliable Tool-Augmented LLM Agents

Abstract:
Tool‑augmented large language model agents increasingly rely on external APIs, but standard tool schemas describe how to call a tool, not when the tool is causally appropriate or what task state it produces. Causal tool filtering addresses this gap by using lightweight contracts that specify each tool's preconditions, effects, risk level, and cost. However, manually writing and maintaining such contracts does not scale to large or changing tool ecosystems. We introduce Contract2Tool, a framework for inferring tool contracts from metadata, schemas, documentation, and execution traces. Contract2Tool converts observable tool evidence into normalized symbolic contracts that can be evaluated intrinsically and deployed inside downstream causal tool filtering. We evaluate learned contracts against gold preconditions, effects, and risk labels, and measure their downstream utility on multi‑step agent tasks. Our results show that hybrid documentation‑and‑trace evidence produces contracts accurate enough to preserve most of the reliability and efficiency benefits of gold contracts. Learned‑contract CMTF achieves 0.980 downstream success, close to 0.990 for gold‑contract CMTF, while reducing visible tools from 100 to 1 and reducing average token usage from 26,172 to 2,528 relative to all‑tools exposure. These results suggest that learned contracts can provide a scalable contract layer between tool schemas and reliable agent execution.

PaperID: 2126, https://arxiv.org/pdf/2606.07866.pdf

Authors: Akshay J. Dave, David Grabaskas, Joseph A. Renevitz, Richard B. Vilim

Title: Overcoming the Regulatory Bottleneck via Agent-to-Agent Protocols: A Nuclear Case Study

Abstract:
Regulatory review of advanced nuclear reactor designs routinely spans more than three years and consumes hundreds of millions of dollars in combined regulator and applicant labor. We present the Regulatory Context Protocol (RCP), an Agent‑to‑Agent communication standard that replaces the formal human‑to‑human pipeline between regulators and applicants with a structured, auditable agentic channel, while preserving human oversight at safety‑significant decision points. The protocol is calibrated against an analysis of 1,236 documents from U.S. Nuclear Regulatory Commission advanced reactor dockets and demonstrated with a working multi‑agent pilot. Against an 89M USD, 42‑month Reconstructed Baseline, RCP cuts costs by 50‑77 percent (21M‑44M USD) and timelines by 65 percent (15 months). Without a shared protocol, Standalone Agents reach only 54M‑74M USD and 21 months. The residual cost‑and‑time gap is structural, not algorithmic: it traces to the inter‑organizational pipeline that only an agent‑to‑agent standard can compress. The same bottleneck ‑ formal multi‑party review under strict auditability requirements ‑ characterizes pharmaceutical approvals, environmental permitting, financial supervision, and aviation certification. The US regulatory paperwork burden carries a 426.5 billion USD annual opportunity cost; replicated broadly, the projected 50‑77 percent reduction implies savings on the order of 210‑330 billion USD per year ‑ approaching 1 percent of US GDP.

PaperID: 2127, https://arxiv.org/pdf/2606.07846.pdf

Authors: Faisal Fareed

Title: Cost-Aware Speculative Execution for LLM-Agent Workflows: An Integrated Five-Dimension Method

Abstract:
LLM‑agent workflows chain model calls and tool invocations, and spend most of their wall‑clock time waiting on upstream operations before downstream ones can start. Speculative execution can reclaim that idle time by launching a downstream operation with a predicted upstream input, but here each speculation costs real money (per‑token billing) and its success probability is hard to estimate and drifts over time. This paper presents a method organized around five design decisions: (D1) start a downstream operation before its upstream completes; (D2) price each speculation in real dollars at separate input and output rates; (D3) expose a single operator dial for latency versus cost; (D4) decide via an expected‑value rule with a failure‑weighted cost term and a preference‑adjusted threshold; and (D5) estimate the success probability with a Bayesian Beta‑Binomial posterior whose prior is keyed to a dependency‑type taxonomy. Variants of these ideas appear in recent work; the combination, with every decision logged in dollars, is what is new. The rule fires only on edges passing an admissibility precondition (side‑effect‑free, idempotent, or stageable behind a commit barrier), since a wrong speculation is rolled back by re‑execution, which refunds tokens but cannot un‑send an irreversible side effect. We specify the runtime mechanics, a closed‑form result that the rule self‑limits as the upstream branching factor grows, a five‑stage calibration pipeline (offline replay, shadow, canary, online calibration, drift‑triggered kill‑switch), and a workload‑fit rubric over eight production archetypes. Contrast tables against the four closest published systems (DSP, Speculative Actions v2, Sherlock, B‑PASTE) show differentiators on every dimension, and a synthetic validation suite confirms the predicted decision boundary, probability threshold, posterior recovery, and streaming‑cancellation behavior.

PaperID: 2128, https://arxiv.org/pdf/2606.07845.pdf

Authors: Najmul Hasan, Prashanth BusiReddyGari

Title: GRPO Does Not Close the Multi-Agent Coordination Gap

Abstract:
We measure how well current large language models coordinate as multiple agents sharing a common resource, using the dining philosophers problem as a clean test bed. Across 630 episodes spanning seven models and three philosopher counts, four frontier closed‑source systems reach mean reward 0.45 to 0.87 and Mistral‑Small 24B reaches 0.83 to 0.99, while Qwen3‑14B reaches 0.13 to 0.35. We then ask whether group relative policy optimization (GRPO) on rollouts from the task itself can close the gap and find that it cannot: a Welch's t‑test on per‑episode reward at five philosophers gives p = 0.66 and a Hedges' g of ‑0.11, with no statistically significant change at ten or fifteen philosophers either. Two further observations qualify the result. The training reward of both 8B and 14B runs peaked at step nine and then declined, so the default saved checkpoint at step 15 is strictly worse than several earlier ones. The four‑term reward we use admits a degenerate maximum at zero actions, which DeepSeek‑R1‑Distill‑Qwen‑7B and Mistral‑Small 24B at five philosophers both inhabit, with mean reward 1.0 and 0.83 respectively at zero meals. The bottleneck for an open‑weight 14B model on multi‑agent coordination is not training compute but training methodology: reward shaping that does not collapse to a no‑action maximum, checkpoint discipline that does not depend on the final step, and curriculum across problem scales.

PaperID: 2129, https://arxiv.org/pdf/2606.07837.pdf

Authors: Kirak Kim, Hyojin Kim, Yejin Son, Sungyoung Kim, Kyung Myun Lee

Title: Does Persona Make LLMs K-pop Fans? A Pilot Study of LLM-Based Online Concert Audience Agents

Abstract:
A concert is a collective experience, but recorded performance videos are typically watched alone, stripping away the shared audience presence that makes concerts feel eventful. We investigate whether persona‑based LLM audience agents can recreate aspects of this collective experience by generating real‑time fan chat alongside a K‑pop performance video. We present a multi‑agent system in which ten LLM agents react through live‑chat messages, comparing a persona‑conditioned audience (each agent assigned a distinct fan identity, bias, and chat style) with a no‑persona baseline. In a within‑subjects pilot with K‑pop fans (N=11), persona conditioning substantially improved model‑level chat quality and perceived naturalness, but did not translate into differences in social connectedness, engagement, or affective response. Interviews suggest that online K‑pop concert chat may operate as collective monologue rather than interpersonal dialogue, and that meaningful participation depends on shared identification with the specific artist and fandom. Persona conditioning can make LLM audiences appear more natural, but culturally meaningful collective experience may require deeper alignment between persona, crowd behavior, fandom identity, and user expectations.

PaperID: 2130, https://arxiv.org/pdf/2606.07818.pdf

Authors: Yujin Potter, Seun Eisape, Shiyang Lai, Alexander Huth, James Evans, Been Kim, Jacob Eisenstein, Dawn Song, Alane Suhr

Title: Representational Similarity and Model Behavior in Multi-Agent Interaction

Abstract:
Researchers have shown that neural similarity among humans predicts social closeness and cooperative success, whereas innovation often emerges from interactions among dissimilar individuals. We investigate whether these principles extend to artificial intelligence by examining interactions between large language models. In our experiments, 276 model pairs interact across eight games spanning both cooperation and novelty. We find that pairs with more similar representation spaces achieve significantly higher cooperation but exhibit reduced novelty and creativity. The effects of representational similarity on cooperation and novelty remain robust even after controlling for other factors such as performance disparity and model size. We also find that similarity in the early layers consistently shows the strongest association with cooperation and novelty, compared to the middle and later layers. This suggests that a central factor underlying these patterns could be the extent to which the two models share lexical and semantic grounding. Overall, representational similarity can be an important consideration in multi‑agent system design.

PaperID: 2131, https://arxiv.org/pdf/2606.07805.pdf

Authors: Yiyang Zhao, Zhuo Zhang, Qingxuan Le, Lizhen Qu, Zenglin Xu

Title: Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems

Abstract:
The rapid evolution of Large Language Models (LLMs) from passive assistants to autonomous, execution‑capable agents has introduced critical operational risks. Most current evaluation frameworks neglect procedural compliance, leading to ''Machiavellian'' behaviors where agents strategically violate safety rules to maximize rewards ‑ a direct manifestation of Goodhart's Law. To address this blind spot, we introduce MAC‑Bench, a dynamic, adversarial benchmark designed to evaluate the procedural alignment of multi‑agent systems under realistic pressure. We propose the SERV(Seed ‑ Evolve ‑ Refine ‑ Verify) pipeline, an ``Agent‑as‑a‑Benchmark'' paradigm that transforms unstructured legal texts into executable, contamination‑free scenarios. By synthesizing holographic sandbox environments and injecting calibrated social‑engineering pressure vectors, MAC‑Bench forces agents into Pareto‑optimal trade‑offs between task success and regulatory adherence. We introduced novel metrics: the Compliance‑Weighted Success Rate (CSR) and the Machiavellian Gap (MG), and conducted a comprehensive evaluation of state‑of‑the‑art frontier models to reveal the pervasive trade‑offs between success and compliance.

PaperID: 2132, https://arxiv.org/pdf/2606.07790.pdf

Authors: Aya El Mir, Martin Takáč, Salem Lahlou

Title: Byzantine Cheap Talk: Adversarial Resilience and Topology Effects in LLM Coordination Games

Abstract:
Multi‑agent LLM systems increasingly rely on communication protocols for coordination, yet their robustness under adversarial and structural constraints remains poorly understood. Building on prior work showing that cheap‑talk channels enable cooperation in LLM coordination games, we investigate two vulnerability classes in a 4‑player Stag Hunt across six model families and 720 trials. First, when Byzantine agents signal cooperation but defect, non‑Byzantine agents detect the betrayal within one round yet fail to adapt collectively: a substantial fraction continue cooperating despite repeated exploitation, unable to recover coordination due to the game's unanimity payoff structure. Second, explicitly restricting communication topology collapses cooperation, while applying identical restrictions silently preserves near‑perfect cooperation. This establishes that coordination failure stems from agents' meta‑reasoning about hidden information, not information loss itself. We identify two stable behavioral archetypes that replicate across all model cohorts: Defection‑Prone models that switch permanently after betrayal, and Cooperation‑Persistent models that continue cooperating at significant individual cost. These findings reveal concrete security vulnerabilities: communication channels can be exploited as adversarial injection vectors, and disclosing network topology to agents can degrade coordination even without any adversary present.

PaperID: 2133, https://arxiv.org/pdf/2606.07711.pdf

Authors: Hao Yang, Shiqi Shen, Haoxuan Li, Zhipeng Wang, Zhi Gong, Xu Chen

Title: Rosetta Memory: Adaptive Memory for Cross-LLM Agents

Abstract:
Memory is the key component for transforming a stateless LLM into a persistent, evolving agent through experience accumulation, long‑horizon planning, and continual self‑improvement. Existing memory systems typically take the LLM as the center and design memory operations tailored to a specific backbone. In practice, however, users frequently switch between LLMs, for example using Claude for coding and GPT for writing across tasks, or routing different steps to different backbones within a single task for cost‑effective trade‑offs. As a result, memory written by one model often needs to be consumed by another. Making upstream memory effectively adapt to and activate downstream LLMs remains a critical yet underexplored problem. To bridge this gap, we shift the perspective from LLM‑centric memory design to \emphmemory‑centric LLM adaptation. Specifically, we approach the above upstream‑downstream memory adaptation problem from both the write and read sides, and design two profile‑conditioned operators that are jointly trained to optimize how memory is stored and presented for better task completion. To ensure the learned operators generalize across a broad set of LLMs, we propose a minimum‑gain sampling curriculum that prioritizes the least‑served LLMs during training. To better measure the operators' actual contribution rather than the LLM's own capability, we design a performance‑gap reward that compares against a naive memory baseline. Experiments on HotpotQA, 2WikiMultihopQA, and MuSiQue demonstrate that our model consistently outperforms baselines and remains robust under unseen‑model replacement.

PaperID: 2134, https://arxiv.org/pdf/2606.07697.pdf

Authors: Haoluo Zhao, Hongchun Zhang, Nan Li, Jing-Jia Luo, Kaikai Zhang, Mengyang Yu, Nan Chen, Tao Song, Fan Meng

Title: TianJi-Environ: An Autonomous AI Scientist for Atmospheric Environmental Research

Abstract:
As atmospheric environmental prediction continues to improve, interpretable validation of pollution mechanisms and feedback processes has become a main challenge in atmospheric chemistry. Yet mechanism validation based on complex numerical models still relies heavily on expert knowledge: mechanistic hypotheses must be operationalized into executable experiments, and model outputs must be organized into traceable evidence. We present TianJi‑Environ, an auditable AI Scientist for atmospheric‑chemistry mechanism validation. TianJi‑Environ establishes the first WRF‑Chem‑based multi‑agent framework that autonomously drives complex atmospheric‑chemistry simulations, converting mechanistic hypotheses into executable configurations, testing experiments, and evidence criteria. Using ozone response and particulate‑matter feedback as two representative examples, we demonstrate TianJi‑Environ's capability for mechanism validation. In a summertime ozone case over the North China Plain, the system detects directionally consistent aerosol‑radiation‑interaction signals in shortwave radiation and boundary‑layer height, but judges the evidence for ozone response to NOx control to be incomplete. In a wintertime PM2.5 case over the Guanzhong Basin, it localizes the unsupported link to insufficient propagation from black‑carbon perturbation to particulate response and missing diagnostics of vertical absorptive heating. These results show that TianJi‑Environ makes expert‑driven mechanism validation explicit, structured, and auditable, offering a reproducible paradigm for multi‑agent systems coupled with complex atmospheric‑chemistry models.

PaperID: 2135, https://arxiv.org/pdf/2606.07649.pdf

Authors: Lingxuan Huang, Sizhe He, Hengji Zhou, Liqiang Nie, Lianghao Xia, Chao Huang

Title: ViMax: Agentic Video Generation

Abstract:
Long‑form video generation requires systematic narrative planning and visual consistency that current short‑clip methods cannot provide. Existing methods generate isolated sequences without narrative structure and lack mechanisms for maintaining character and environmental consistency across scenes. We present ViMax, an agentic video generation framework that addresses video creation through coordinated multi‑agent collaboration where specialized components negotiate narrative decisions, visual continuity, and production quality. Our framework employs a hierarchical narrative engine with retrieval‑augmented generation for global story coherence and a dependency‑aware visual consistency mechanism that tracks character and environmental states across temporal boundaries, while VLM‑guided agents continuously monitor and refine both narrative coherence and visual fidelity. The framework enables coordinated agent collaboration to generate extended narrative content. This maintains both storytelling integrity and visual coherence across multi‑scene timelines.

PaperID: 2136, https://arxiv.org/pdf/2606.07645.pdf

Authors: Chang Kong, Yuebing Li, Peng Mo, Haigang Zhang, Qiuming Luo

Title: FineGen: A VLM-based Multi-Agent Framework for Fine-Grained Image-Text Dataset Construction

Abstract:
The scarcity of hard negative samples in current vision‑language datasets significantly hinders fine‑grained perception. To address this, we propose FineGen, a VLM‑based Multi‑Agent framework for automated dataset construction. By employing a collaborative Generation‑Verification‑Correction pipeline with a closed‑loop feedback mechanism, FineGen ensures synthesized hard negatives are semantically valid yet strictly contradictory to visual content. Applying this to ImageNet, we construct FineGen‑100K, a hierarchical dataset containing over 147,000 attribute‑specific hard negatives with a rigorous 1:10 positive‑to‑negative ratio. Extensive evaluations confirm a 96.7% attribute validity rate. Crucially, downstream validation on the FG‑OVD benchmark shows that fine‑tuning on FineGen‑100K yields a substantial +14.4% accuracy improvement on hard samples, significantly outperforming state‑of‑the‑art methods.

PaperID: 2137, https://arxiv.org/pdf/2606.07591.pdf

Authors: Wanghan Xu, Shuo Li, Tianlin Ye, Qinglong Cao, Yixin Chen, Hengjian Gao, Yiheng Wang, Qi Li, Kun Li, Sheng Xu, Shengdu Chai, Fangchen Yu, Xiangyu Zhao, Zhangrui Zhao, Weijie Ma, Zijie Guo, Haoyu Zhou, Haoxiang Yin, Lixue Cheng, Chaofan Hu, Haoxuan Li, Lu Mi, Xuxuan Xie, Yifan Zhou, Ruizhe Chen, Zhiwang Zhou, Xingjian Guo, Yuhao Zhou, Xuming He, Shengyuan Xu, Xinyu Gu, Jiamin Wu, Mianxin Liu, Chunfeng Song, Fenghua Ling, Dongzhan Zhou, Shixiang Tang, Yuqiang Li, Mao Su, Peng Ye, Siqi Sun, Bin Wang, Xue Yang, Zhenfei Yin, Tianfan Fu, Guangtao Zhai, Wanli Ouyang, Bo Zhang, Lei Bai, Wenlong Zhang

Title: ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Abstract:
AI coding agents are increasingly used for scientific work, but their end‑to‑end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert‑curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target‑paper‑level re‑discovery while leaving room for new discovery. We evaluate seven autonomous research (auto‑research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re‑discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude‑Opus‑4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.

PaperID: 2138, https://arxiv.org/pdf/2606.07557.pdf

Authors: Zhaowen Fan

Title: SPIN: Decentralized Swarm Control via Tensorized Policy Coordination

Abstract:
Decentralized multi‑agent swarm coordination on resource‑constrained edge platforms remains fundamentally bottlenecked by the exponential scaling of joint action spaces and high‑latency communication overhead. This paper introduces the Swarm Policy Interference Network (SPIN) framework, an architectural paradigm that bypasses these limitations by modeling swarm topologies as a compressed tensor network. We factorize the joint policy tensors of local multi‑agent cliques into Matrix Product State (MPS) chains, reducing the computational complexity of evaluation from an exponential O(n^m) wall to a strictly linear O(m \cdot n \cdot χ^2) constraint. To bridge local continuous spatial geometry with this discrete algebraic backend without requiring power‑intensive online training loops, we introduce a decoupled, hybrid neuro‑symbolic control pipeline. Local multi‑layered neural networks operate as structural coordination encoders, pre‑trained offline to nonlinearly map hand‑engineered geometric descriptors into abstract environmental target measures. At runtime, edge agents execute instantaneous behavioral adaptations by applying the Radon‑Nikodým derivative directly as a zero‑shot importance‑reweighting filter. We validate the framework within a discrete‑time multi‑agent simulation sandbox spanning tracking, decentralized dispersion/area coverage, and multi‑goal coordination regimes. Qualitative telemetry demonstrates that the integrated pipeline achieves stable target‑directed motion, anti‑collapse spatial spreading under decentralized constraints, and structured subgroup formation across multiple targets, providing a mathematically grounded route to tractable, low‑power edge swarm intelligence.

PaperID: 2139, https://arxiv.org/pdf/2606.07552.pdf

Authors: Augustin Chan

Title: Symbolic Reasoning Frameworks Modulate LLM Risk Aversion in Multi-Agent Strategic Settings

Abstract:
Large language models exhibit innate behavioral tendencies when deployed as strategic agents ‑‑ notably a risk‑averse "turtle" bias toward defensive play. We show that symbolic reasoning frameworks, injected as per‑round reflective prompts into one agent, differentially modulate this bias and reshape the multi‑agent ecosystem to produce framework‑specific winner distributions. In a 7‑player Warring States Diplomacy variant (41 games, 4 conditions, single‑campaign memory accumulation), each framework produces a distinct ecosystem signature: under control, Yan dominates (7/11, 64%); under I‑Ching yarrow divination, Yan and Chu co‑dominate while Qin is completely suppressed (0/10); under Tarot, Qin dominates (5/10, Fisher vs. pooled p = 0.006); under scrambled‑text ablation (incoherent oracle text preserving prompt structure), Qi dominates (5/10, Fisher vs. pooled p = 0.006). The framework‑receiving agent (Han) never wins and shows no survival difference across conditions (Fisher p = 1.0), but Tarot consistently elevates Han's peak territory (mean 3.0 SCs vs. 2.1‑2.5 others, Kruskal‑Wallis p = 0.010). Neither framework's content predicts subsequent actions ‑‑ hexagram themes (chi‑squared p = 0.95) and Tarot card postures (chi‑squared p = 0.69) are both independent of action choice ‑‑ suggesting the modulation operates through the reflective process, not content‑following. We present this as an observation paper establishing that alignment‑framework choice at the agent level produces distinctive system‑level consequences in multi‑agent settings.

PaperID: 2140, https://arxiv.org/pdf/2606.07547.pdf

Authors: Luoyuan Zhang, Bokai Xu, Junbo Cui, Weiyue Sun, Yingjing Xu, Hanyu Liu, Yuan Yao

Title: Liberating LLM Capabilities in Full-Duplex Speech Models

Abstract:
Speech‑based large language models are typically constrained to spoken replies, which limits their user‑facing outputs to what can be verbalized and suppresses text‑native capabilities such as code generation, structured analysis, and multi‑step reasoning in realtime interaction, for tasks that require persistent, structured, and inspectable intermediate outputs. Existing work improves spoken reasoning or full‑duplex turn‑taking, but still treats text as a hidden intermediate state or a subordinate modality rather than a first‑class output channel. We propose Listen‑Write‑Speak (LWS), a text‑first tri‑channel paradigm in which a single autoregressive LLM continuously listens to user audio, writes visible free‑form text as its primary output, and speaks a realtime oral response in parallel under a shared causal attention context. This behavior is implemented entirely through a Token Schema, requiring no architectural modifications, and learned via a two‑stage data pipeline that synthesizes per‑second cognitive annotations consistent with the revealed input timeline. Empirically, LWS demonstrates strong full‑duplex interaction on Full‑Duplex‑Bench, reaches 4.72 on VoiceBench AlpacaEval, achieves 92.6% writing‑speaking consistency, and consistently outperforms its internal ablations on URO‑Bench. These results suggest that visible writing can serve as a first‑class output channel for speech interaction without sacrificing realtime responsiveness. The code and dataset are available on the project page: https://royalzhang.com/project/lws‑page/.

PaperID: 2141, https://arxiv.org/pdf/2606.07532.pdf

Authors: Sam Ryan

Title: Durable Evaluation Framework: Adversarial Arbitration for Sycophancy Reduction in Large Language Models

Abstract:
RLHF‑trained models are systematically biased toward agreement over accuracy, a structural property of the training process. We present Durable Evaluation Framework (DEF) Arbitration, a multi‑agent architecture that mitigates identity‑framed sycophancy by arbitrating between two models tuned to opposing DEFs, with a pragmatist synthesizer evaluating both arguments blind to their origins. This paper evaluates a prompt‑based instantiation of DEF Arbitration. The key mechanisms are static DEF tuning, identity stripping before synthesis, single‑round independent argumentation, and blind arbitration. We evaluate five instantiations on 200 stratified questions from SycophancyEval. All tested DEF variants (AnCifer, DeWin, FeynStein, BurGal, Trident) significantly outperform the single‑model baseline (18.5%) and instructed‑opposition baseline (29.0%), with DeWin achieving 48.5% accuracy (z=6.36, p<0.001 versus both). The variants are not significantly different from each other at n=200. The BurGal variant achieves 53.0% but functions as an architectural validity check; its consensus/heterodox axis structurally favors the heterodox model on every benchmark question. A pre‑training floor affects an estimated 40% of questions; fine‑tuned DEF models are the identified next step.

PaperID: 2142, https://arxiv.org/pdf/2606.07513.pdf

Authors: Xintao Wang, Sirui Zheng, Hongqiu Wu, Weiyuan Li, Jen-tse Huang, Minghao Zhu, Can Zu, Qi Deng, Jiawei Wang, Qianyu He, Heng Wang, Xiaojian Wu, Yunzhe Tao

Title: Agentopia: Long-Term Life Simulation and Learning in Agent Societies

Abstract:
Humans learn from social life. Simulating this process with LLM‑powered agents represents a promising research direction, raising a natural question: whether LLMs can learn from such simulated social experience to better understand and replicate human behavior. However, prior agent society simulations typically operate at the scale of days, limiting the depth of social interactions and long‑term growth. In this paper, we study long‑term life simulation and LLM learning in agent societies, with two goals: (1) investigating social behaviors that emerge from life‑long simulation, and (2) developing anthropomorphic capabilities in LLMs, particularly intelligence in social life, through years of simulated social experience. Specifically, we present Agentopia, a comprehensive framework for long‑term life simulation in multi‑agent societies, where 100 agents autonomously pursue personal growth, develop social relationships, and fulfill their needs and goals over 10 simulated years. We define life reward to mirror human well‑being, and leverage this reward to train LLMs via rejection sampling. Extensive experiments show that agents exhibit rich emergent social behaviors. Furthermore, life reward training effectively enhances the underlying LLM, which leads to improved agent well‑being in simulation, and generalizes to downstream role‑playing benchmarks with +15.6% improvement.

PaperID: 2143, https://arxiv.org/pdf/2606.07506.pdf

Authors: Tuba Girgin, Jose Castelblanco, Gabriel Rodriguez, Emre Girgin, Cagri Kilic

Title: Affordance-Based Hierarchical Reinforcement Learning for Quadruped Pedipulation

Abstract:
The object manipulation capabilities of quadruped robots is an open research challenge. While previous studies have focused on low‑level policy learning, task execution still relies on expert‑designed high‑level trajectories. Autonomous selection of both an affordable interaction point on the target object and an affordable robot base pose removes the need for pre‑designed trajectories. This study proposes a three‑level hierarchical reinforcement learning (RL) framework that utilizes pose affordances to guide the navigation policy, while the navigation policy drives the locomotion policy. In addition, the pedipulation policy is guided by interaction‑point affordances, enabling object‑centric pose alignment of the quadruped robot and effective end‑effector manipulation planning. We train the proposed framework in the IsaacSim ecosystem and evaluate it in both simulation and real‑world settings. We investigate the effectiveness of pose affordance across multiple scenarios in simulation while various object interaction tasks are validated on real‑world setting forming an object‑interaction dataset. The results show that the proposed framework can autonomously identify candidate poses based on their affordance and successfully execute object manipulation tasks in the real world without human guidance.

PaperID: 2144, https://arxiv.org/pdf/2606.07487.pdf

Authors: Lukas Seier, Brandon Kaplowitz, Sebastian Towers, Richard Bailey, Jakob Foerster

Title: Modelling Opinion Dynamics at Scale with Deep MARL

Abstract:
Modelling opinion dynamics typically relies on hand‑crafted local interaction rules to study emergent macroscopic phenomena such as consensus and polarisation. In contrast, multi‑agent reinforcement learning (MARL) enables agents to learn such behaviours directly by optimising simple rewards. To explore the potential of MARL for opinion dynamics, we introduce a GPU‑accelerated consensus and truth‑finding game that scales to populations of up to 1000 agents, comparable to many real‑world social sub‑networks. To prevent unrealistic conventions, we extend other‑play to general‑sum social interactions. We next validate our model on a subset of the Bluesky network by recovering agent importance structures from graph topology alone via a learned attention layer, finding that highly conforming populations most closely match human data. In large social media networks such high levels of conformity significantly reduce collective accuracy and promote dishonest agents that lie to fit in. By contrast, small, dynamic hunter‑gatherer networks are less affected; here, conformity can even improve collective agreement. This suggests a mismatch between evolved human conformity heuristics and modern social media environments as a potential contributor to misinformation.

PaperID: 2145, https://arxiv.org/pdf/2606.07486.pdf

Authors: Ziyang Xiong, He Zong, Zhiyuan Xue, Manxi Wu

Title: OPENPATH: A Supervisor--Specialist Agent System for Personalized, Accessible, and Multi-stop Urban Trip Planning

Abstract:
Urban trip‑planning systems are commonly optimized for travel time and cost, but they offer limited support for the heterogeneous needs that real travelers bring, such as personalized preferences, multi‑stop itinerary construction, and end‑to‑end wheelchair accessibility. We present openpaths, a supervisor‑specialist multi‑agent system that handles all of these tasks within a single architecture. openpaths adopts a deliberate division of labor: LLM agents parse natural‑language input, classify request intent, and orchestrate execution, while classical algorithms perform route optimization over curated mobility and accessibility data. This design ensures that the resulting trip honors heterogeneous user preferences and enforces strict accessibility requirements when requested. Beyond per‑user planning, openpaths doubles as a measurement instrument for city‑scale accessibility analysis: applied to NYC, the system reveals substantial ADA infrastructure gaps and quantifies their effect on job accessibility for wheelchair users. Overall, this study shows how a supervisor‑specialist LLM agentic framework can support heterogeneous trip planning and transparent, equitable transportation analysis in real urban environments.

PaperID: 2146, https://arxiv.org/pdf/2606.07375.pdf

Authors: Sai Sandeep Damera, Maria Charitidou, Asim Zoulkarni, John S. Baras

Title: An End-to-End Encrypted Control Pipeline for Multi-Agent Coordination via CKKS Homomorphic Encryption

Abstract:
Cloud‑based coordination of multi‑agent systems requires sharing state with a central server, creating a conflict between coordination and privacy. Fully homomorphic encryption (FHE) resolves this in principle, but its severe arithmetic constraints demand that every stage of the control loop be redesigned from first principles. We present an end‑to‑end encrypted control pipeline in which sensing, state estimation, state propagation, and consensus control all operate on CKKS‑encrypted data using only addition, multiplication, and cyclic rotation. In order to overcome the computational challenges of FHE, we employ steady‑state Kalman gains instead of solving for the matrices online and graph Laplacians are applied via the diagonal method at a cost proportional to the number of nonzero cyclic diagonals, accommodating ring, torus, and complete‑graph topologies within a unified framework. To quantify the cumulative effect of encryption noise, we use the separation principle to decouple controller and observer error dynamics and derive a periodic bootstrapping bound in which CKKS bootstrapping acts as an impulsive disturbance; the resulting steady‑state error ball depends on the bootstrapping precision and the closed‑loop spectral radius, providing a direct design equation for the privacy‑accuracy tradeoff. The pipeline is validated on a multi‑agent formation control scenario, confirming stable closed‑loop operation under encryption with bounded tracking error.

PaperID: 2147, https://arxiv.org/pdf/2606.07367.pdf

Authors: Yudi Zhang, Meng Fang, Zhenfang Chen, Mykola Pechenizkiy

Title: Self-evolving LLM agents with in-distribution Optimization

Abstract:
Large Language Models (LLMs) have recently emerged as powerful controllers for interactive agents in complex environments, yet training them to perform reliable long‑horizon decision making remains a fundamental challenge. A key difficulty lies in credit assignment: agents often receive delayed rewards only at the end of episodes. In this paper, we propose Q‑Evolve, a self‑evolving framework for LLM agents that unifies automatic process‑reward labeling and policy learning within a principled in‑distribution reinforcement learning paradigm. In each evolving iteration, our method learns an in‑distribution critic from a hybrid off‑policy dataset that combines expert demonstrations with agent‑generated trajectories, stabilizing Bellman backups in sparse‑reward settings via a weighted Implicit Q‑Learning objective. The learned value function is then used to derive step‑wise process rewards through advantage estimation, enabling dense and reliable supervision without environment backtracking or human annotation. Leveraging these signals, we perform behavior‑proximal policy optimization that evolves the agent over the data used for process reward labeling, allowing iterative self‑improvement without exacerbating distribution shift. We evaluate our method on AlfWorld, WebShop, and ScienceWorld, showing Q‑Evolve outperforms strong baselines in sample efficiency, robustness, and overall task performance. Our results demonstrate that stable agent self‑evolution is achievable through the co‑evolution of process‑level supervision and policy, both grounded within a shared in‑distribution learning loop.

PaperID: 2148, https://arxiv.org/pdf/2606.07316.pdf

Authors: Haoran Xu, Lei Zhang, Iadh Ounis, Xianbin Wang

Title: Hierarchical Certified Semantic Commitment for Byzantine-Resilient LLM-Agent Collaboration

Abstract:
Byzantine collaboration among large‑language‑model agents requires a finality‑control primitive: given delivered stochastic, structured natural‑language proposals, the protocol must decide whether the round supports a commit, what kind of commit, or a typed safe abort. Naive aggregation hides this choice behind a single verdict; classical Byzantine fault tolerance hides it behind byte‑identity that LLM proposals do not satisfy. We introduce Hierarchical Certified Semantic Commitment (H‑CSC), a BFT‑inspired protocol that converts embedding‑derived finality signals over verdict‑conditioned proposal groups into one of three typed outcomes: a semantic_commit (a 2f+1 within‑verdict semantic core backs the verdict, emitting a parameter‑bound digest over the quantised aggregate), a verdict_commit (strong verdict margin but dispersed semantic rationale, emitting a verdict‑level certificate without claiming a semantic aggregate), or an explicit abort with a typed reason. The contribution is typed finality, not raw commit accuracy. On a controlled semantic‑poisoning diagnostic (BCS_v1, 120 episodes), H‑CSC commits with low angular deviation on BFT‑feasible buckets (0.31 to 2.04 degrees) and aborts 100% of beyond‑BFT rounds (n<3f+1) as intended. On a real LLM‑agent claim‑verification benchmark (MVR‑50, 50 tasks) under paired static and rushing Byzantine attacks, H‑CSC commits 0.90/0.92 with honest‑reference‑invalid rates of 0.02/0.00, statistically matching a strong certificate‑emitting verdict‑only baseline. Unlike that baseline, H‑CSC also emits an embedding‑backed semantic_commit digest on 74%/72% of rounds, supplying typed provenance. A strict‑semantic ablation commits only 0.54/0.48, showing the verdict‑level fallback is necessary for coverage (+0.36/+0.44) at the same <=0.04 safety floor; a 100‑task cross‑model check across four LLMs preserves invalid_hmaj within 0.00 to 0.03.

PaperID: 2149, https://arxiv.org/pdf/2606.07314.pdf

Authors: An B. B. Pham, Hoa T. Nguyen, Muhammad Usman

Title: QBugLM: An Agentic Benchmarking Framework for LLM-based Quantum Software Debugging

Abstract:
Quantum software bugs often yield silent, incorrect outputs rather than explicit errors, making them particularly difficult to detect and repair with conventional techniques. Although large language models (LLMs) have shown strong performance on classical software engineering tasks, their ability to debug quantum code remains largely unexplored. To bridge this gap, we propose QBugLM, a multi‑agent framework that automates the quantum software debugging pipeline, from taxonomy‑driven bug injection to LLM‑based detection and repair, and finally to simulation‑based validation, for framework‑agnostic OpenQASM 3.0 programs. We further conduct a comprehensive case study using QBugLM to benchmark two LLMs, Claude 4.6 Sonnet and Qwen3 Coder Next, across different prompting strategies, bug categories, and quantum programs. Our results show that iterative feedback is critical, as a single retry raises Pass@1 from below 25% to above 80%. Moreover, simpler structured prompting can even outperform Chain‑of‑Thought and ReAct for reasoning‑capable models under fixed‑resource constraints. Our work takes initial steps toward benchmarking LLM capabilities for debugging quantum programs and offers practical insights to support future efforts in automated quantum software repair.

PaperID: 2150, https://arxiv.org/pdf/2606.07299.pdf

Authors: Lingyong Yan, Can Xu, Yukun Zhao, Wenxuan Li, Qingyang Chen, Jiulong Wu, Wenli Song, Xiangnan Li, Weixian Shi, Yiqun Chen, Xuchen Ma, Yuchen Li, Jiashu Zhao, Shuaiqiang Wang, Jianmin Wu, Dawei Yin

Title: DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning

Abstract:
Deep Research (DR) has emerged as a new agentic paradigm to tackle complex, open‑ended research tasks, demanding systems that can iteratively frame problems, acquire evidence, verify sources, and synthesize long‑form reports. In practice, however, current DR systems are constrained by four interrelated limitations: long‑horizon planning over an underspecified scope, the bottleneck of decomposing and scheduling such tasks within a single agent, hallucination risk in long‑form synthesis, and limited process auditability. This technical report presents DuMate‑DeepResearch, a multi‑agent DR framework built on the Qianfan Agent Foundry. The framework decouples the Agent Core, which handles task understanding, planning, and scheduling, from an extensible Tool Ecosystem for retrieval, evidence acquisition, and report rendering, making every intermediate decision and tool invocation explicitly traceable. Building on this infrastructure, DuMate‑DeepResearch further introduces three mechanisms: (i) a graph‑based dynamic planning strategy expands the research roadmap coarse‑to‑fine and continuously revises it through reflection, re‑planning, backtracking, and parallel branching; (ii) a recursive two‑level execution design delegates each complex search sub‑task to an inner Search Agent that runs its own planning loop, isolating noisy retrieval and stabilizing long‑horizon execution; (iii) a rubric‑based test‑time optimization mechanism dynamically generates task‑specific quality criteria and uses them as live reasoning scaffolds for evidence‑grounded synthesis and adaptive stopping. Across two deep research benchmarks, DuMate‑DeepResearch establishes new state‑of‑the‑art results: the best overall score (58.03%) on DeepResearch Bench, and the best overall score (61.95%) on DeepResearch Bench II while ranking first in information recall and analysis.

PaperID: 2151, https://arxiv.org/pdf/2606.07200.pdf

Authors: Xinren Zhang, Zixin Zhong, Jiadong Yu

Title: Learning Multi-Agent Communication Protocol: Study on Information Entropy Efficiency in MARL

Abstract:
Multi‑Agent Systems (MAS) have emerged as a fundamental paradigm for distributed problem‑solving, where autonomous agents collaborate to achieve complex objectives. Within this framework, Multi‑Agent Reinforcement Learning (MARL) with communication has demonstrated remarkable success in cooperative tasks. However, existing approaches predominantly pursue performance gains through increasingly complex architectures and expanding communication overhead, lacking principled metrics to evaluate the efficiency of information exchange. In this paper, we focus on enabling agents to learn efficient multi‑agent communication protocols that balance performance and information compactness. We propose the Information Entropy Efficiency Index (IEI), a novel metric that quantifies the ratio between message entropy and task performance in learned communication protocols. A lower IEI indicates more compact and efficient message representations. By incorporating IEI into training loss functions, we encourage agents to develop communication protocols that achieve high performance with improved communication efficiency. Extensive experiments across diverse MARL algorithms demonstrate that our approach achieves equivalent or superior task performance compared to baseline methods while improving communication efficiency. These findings challenge the prevailing assumption that performance improvements require complex architectures or increased communication overhead and highlight the potential of improving both task success and communication efficiency to enable scalable MAS.

PaperID: 2152, https://arxiv.org/pdf/2606.07150.pdf

Authors: Bijaya Dangol

Title: From Privacy to Workflow Integrity: Communication-Graph Metadata in Autonomous Agent Interoperability

Abstract:
Agent‑interoperability protocols such as A2A and MCP standardize what agents say to one another, but assume address‑based transport over HTTP(S). Such transports protect message content, increasingly with end‑to‑end encryption. What they leave in the clear is the communication graph: which agent contacts which, when, and how often. In agent systems this graph is more consequential than a privacy framing suggests. Endpoints are often capability‑labeled, workflows are structured and chained, and interactions are coupled to real actions, so an observer recovers more than past relationships. It can infer the pending workflow, the task being assembled and the action likely to follow. At machine speed, it can act on that inference before the workflow completes. The threat is therefore one of workflow integrity, not privacy alone: predictive leverage over autonomous action. We give a threat model for the agent communication graph; identify what makes agent metadata distinctively revealing (semanticity, prospectivity, actuation); define transport‑ and bootstrap‑layer privacy properties and weigh candidate transports (SimpleX/SMP, Tor, mixnets) against them; and present an A2A case study in which a metadata‑protecting binding is expressible but surfaces the protocol's identity assumptions. We test these on a generative model anchored to a real A2A capture. From passive metadata alone, with no payloads, a classifier recovers a task's class well above chance, from only the workflow's opening; applied together, the properties drive that recovery sharply back toward chance. Beyond what an observer can recover, we measure the leverage of acting on the leak: from a workflow's opening and under a fixed budget, an adversary choosing which workflows to act on realizes in this model most of a clairvoyant attacker's advantage over a metadata‑blind one, and the same properties suppress it.

PaperID: 2153, https://arxiv.org/pdf/2606.07127.pdf

Authors: Hikaru Shindo, Yu Deng, Teng Cao, Quentin Delfosse, Christopher Tauchmann, Jannis Blüml, Gopika Sudhakaran, Kristian Kersting

Title: Learning Explicit Behavioral Models with Adaptive Questions and World-Model Probes

Abstract:
Interactive agents trained only against task return can achieve high scores while failing to represent the mechanisms that make their actions succeed. This makes brittle behavior difficult to diagnose and limits adaptation when environment dynamics change. Existing LLM reflection and policy‑code repair can revise behavior from failed trajectories, but questions and world‑understanding tests are usually used only after training. We introduce an Explicit Symbolic Behavioral Model (ESBM), a trainable behavioral model that couples task performance with evidence‑grounded question answering and executable mechanism prediction. An ESBM represents behavior through typed predicates, weighted rules, bounded options and mechanism memory; the mechanism layer predicts symbolic events, object changes, rewards and terminal consequences under action interventions. After each rollout, adaptive questions and active world‑model probes convert score failures, QA errors and transition‑prediction errors into constraints for local ESBM edits. Candidate models are selected by a multi‑criterion rule that jointly evaluates task score, answerability and active world‑model consistency. Under the tested Atari‑style protocols, ESBM learns high‑scoring policies while producing explicit answers and executable mechanism predictions, indicating that adaptive questions can serve as both training pressure and reusable benchmarks for mechanistic policy learning in this setting.

PaperID: 2154, https://arxiv.org/pdf/2606.07054.pdf

Authors: Vijitha Mittapalli, Shreyaa Jayant Dani, Satya Srujana Pilli, Snigdha Ansu, Mohammadreza Teymoorianfard, Franck Dernoncourt, Hongjie Chen, Yu Wang, Ryan A. Rossi, Nesreen K. Ahmed

Title: TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents

Abstract:
Autonomous LLM agents can pursue hidden malicious objectives through sequences of individually benign actions, making sabotage difficult to detect using standard trajectory‑level monitoring. Existing approaches either evaluate complete trajectories in a single pass or partition them into independently scored windows, limiting their ability to connect evidence across temporally distant actions. We propose TRACE, a monitoring framework for long‑horizon LLM agent trajectories. TRACE operates through a TIJ (Triage‑Inspect‑Judge) loop that identifies high‑signal regions, performs targeted inspection while maintaining accumulated evidence across reasoning steps, and synthesizes a trajectory‑level verdict. We evaluate TRACE on ten task domains from SHADE‑Arena against state‑of‑the‑art baselines. TRACE achieves an aggregate F1 of 0.713 and recall of 0.844, with the largest gains on tasks requiring long‑range evidence linking.

PaperID: 2155, https://arxiv.org/pdf/2606.07017.pdf

Authors: Xiaoou Liu, Tiejin Chen, Weibo Li, Xiyang Hu, Hua Wei

Title: The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective

Abstract:
Foundation model agents are increasingly deployed for real‑world decision‑making, but suffer from the sim‑to‑real gap. While robotics and classical control have mature frameworks to address this gap, the foundation model community is treating agent robustness as an entirely novel phenomenon. Our paper proposes formalizing the foundation model agent evaluation and training gap as a classical sim‑to‑real problem structured entirely around the four elements of a Markov Decision Process, including Observation, Action, Transition, and Reward. In this paper, we set a comprehensive research agenda that translates classical discrepancies into the foundation model domain and advocates for adopting established solutions like domain randomization. We provide concrete examples, such as a multilingual tool calling to demonstrate how severe observation space gaps lead to operationally invalid actions despite correct semantic intent. Ultimately, this agenda aims to drive a paradigm shift, yielding a unified vocabulary and standardized stress test benchmarks to foster a new generation of highly trustworthy agents for reliable real‑world applications.

PaperID: 2156, https://arxiv.org/pdf/2606.06971.pdf

Authors: Chenxu Zhu, Hantao Yao, Wu Liu, Junbo Guo, Yongdong Zhang

Title: Modeling U.S. Attitudes Toward China via an Event-Steered Multi-Agent Simulator

Abstract:
Understanding the dynamic evolution of opinions, such as U.S. public attitudes toward China, is essential for assessing geopolitical risks. However, existing LLM‑based multiagent simulators predominantly rely on static rules and fixed datasets, limiting their ability to capture the dynamic, event‑driven nature of macro‑level opinion shifts in real‑world settings. To address this limitation, we propose an Event‑Steered Multi‑Agent Simulator (ES‑MAS), in which significant events and daily news continuously drive opinion evolution through dynamic interactions among agents. We first construct the China‑U.S. Relation Evolution (CURE) dataset, covering 20 quarters from 2021 to 2025, including 258 major events and over 14,000 daily news articles, and providing a comprehensive temporal foundation for modeling opinion dynamics. Building upon the CURE dataset, we propose a Dual‑Stream Data Integration Engine (DSDIE) that aligns simulations with historical timelines via macro‑level events while enabling personalized information exposure based on individual agent profiles and contextual signals. Furthermore, we design a News‑Driven Dynamic Interaction (NDDI) module, which adaptively groups agents with shared news interests into localized interaction contexts, facilitating bottom‑up consensus formation while mitigating the risk of isolated information cocoons. Experimental results on the CURE dataset demonstrate that ES‑MAS substantially outperforms existing simulators in reproducing real‑world historical trends, offering a scalable and effective framework for modeling dynamic opinion evolution.

PaperID: 2157, https://arxiv.org/pdf/2606.06960.pdf

Authors: Zihao Deng, Yining Zhu, Leiming Wang, Jingfei Lu, Junbo Wang, Chuncheng Ran, Yu Yang, Dixuan Yang, Jikun Shen

Title: Tree-of-Experience: A Structured Experience-Management Solution for Self-Evolving Agents under Low-Repetition and Implicit-Reward Environments

Abstract:
Experience‑based self‑evolution is crucial for LLM agents, but existing benchmarks often assume explicit goals, stable task patterns, and clear feedback. We study a more challenging setting: low‑repetition tasks with implicit rewards, where past experience is difficult to reuse and feedback is delayed, noisy, and outcome‑level. We introduce \textscFinEvolveBench, a temporally controlled benchmark for financial sentiment prediction that links daily news‑driven predictions to future excess returns. We further propose Tree‑of‑Experience (ToE), a structured experience‑management method that organizes, retrieves, validates, and updates agent experience. Experiments show that general‑purpose experience mechanisms do not consistently outperform no‑experience baselines, while ToE achieves stronger overall performance. These results highlight the importance of structured experience management for self‑evolving agents in implicit‑reward environments.

PaperID: 2158, https://arxiv.org/pdf/2606.06936.pdf

Authors: Vahid Sadiri Javadi, Aksa Aksa, Fryderyk Róg, Lucie Flek, Johanne R. Trippas

Title: Personality Anchoring for Social Simulation: Linking Personality, Social Behavior, and Interaction Success with LLM Agents

Abstract:
Social interactions are shaped by the interplay of dispositional traits and situational context, yet systematically investigating how personality configurations between individuals jointly influence social behavior across diverse social contexts remains methodologically challenging. We address this gap by introducing a simulation pipeline adapted from the CHARISMA framework, which employs well‑known movie characters and public figures as psychologically grounded agents for multi‑LLM social simulation using a method we term personality anchoring. We present a large‑scale empirical study examining how dyadic Agreeableness composition influences social interaction outcomes across 1,010 simulated conversations. Our results reveal a monotonic relationship between dyadic Agreeableness composition and shared goal achievement, with Homogeneous‑Agreeable pairs achieving success 10 times the rate of Homogeneous‑Disagreeable pairs (62% vs. 6%). Behavioral mediation analysis reveals that Agreeableness shapes goal achievement partially through cooperative strategy selection, though it continues to predict outcomes within the same dominant strategy, indicating pathways beyond observable conversational behavior. Robustness analyses confirm high consistency of results across repeated simulations (ICC = 0.89) and stable personality expression across diverse scenarios, validating personality anchoring as a viable operationalization strategy.

PaperID: 2159, https://arxiv.org/pdf/2606.06838.pdf

Authors: Neil Archibald, Ruben Thijssen

Title: LLM Agent-Assisted Reverse Engineering with Quantitative Readability Metrics

Abstract:
Automatic decompilers produce functionally correct but often unreadable C code. This paper addresses one stage of the reverse engineering workflow: improving the readability of decompiled code using LLM agents guided by quantitative metrics. We present a three‑phase research evolution. Phase 1 (tool‑driven steering via Ghidra MCP) suffered from incomplete coverage and inconsistent improvements due to lack of quantitative guidance. Phase 2 (structural similarity validation alone) revealed that agents optimize for metrics in unintended ways, producing structurally equivalent but less readable code. Our contribution is the Quantitative Readability Score (QRS) framework, a composite metric combining a structural similarity gate with three independent readability sub‑metrics (Lexical Surprisal, Structural Simplicity, and Idiomatic Quality). We demonstrate that QRS‑guided refinement enables LLM agents to make targeted readability improvements without sacrificing correctness. We provide a discussion of the broader reverse engineering workflow (binary lifting, decompilation cleanup, and achieving functional equivalence) as context, however, it remains out of scope.

PaperID: 2160, https://arxiv.org/pdf/2606.06836.pdf

Authors: Xiangyi Zheng, Xiangyu Wang, Qinan Liao, Zimu Tang, Yue Liao, Dongyue Lyu, Guodong Wang, Junjie Liu, Si Liu

Title: Think Like a Pilot: Fine-Grained Long-Horizon UAV Navigation

Abstract:
Language‑guided UAV agents must execute long‑horizon semantic instructions while producing smooth, physically feasible continuous flight commands, yet existing Vision‑Language Navigation (VLN) benchmarks typically use discrete or coarse actions and existing UAV Vision‑Language‑Action (VLA) tasks focus on short, atomic maneuvers. To address this gap in UAV task settings, we introduce FLIGHT, a Fine‑grained Long‑horizon Instruction‑Guided benchmark for Hybrid UAV navigation and reasoning Tasks, which combines multi‑stage instructions with dense 6‑DoF trajectory annotations across two dataset splits: Fine‑grained VLN and Long‑horizon Flow. To endow the UAV agent with the capability of real‑time in‑flight reasoning over task execution status and mission planning, while simultaneously accommodating high‑frequency, real‑time precise control, we further propose FLIGHT VLA, an asynchronous architecture that decouples a low‑frequency Streaming Pilot Vision‑Language Model (VLM) for task‑state reasoning from a high‑frequency diffusion action model for continuous control, supervised by explicit Pilot Reasoning texts that summarize the current flight state and anticipate the next subgoal. In closed‑loop evaluation, FLIGHT VLA consistently surpasses representative VLN and VLA baselines on our FLIGHT benchmarks, achieving stronger multi‑stage completion, subgoal adherence, and terminal control. Its trained Streaming Pilot Reasoning VLM further improves UAV video reasoning, validating the effectiveness of our design.

PaperID: 2161, https://arxiv.org/pdf/2606.06823.pdf

Authors: Yuqi Li, Siyuan Liu, Bingjun Liu

Title: PandaAI: A Practical Agent CQ2 for Neuro-symbolic Data Analysis And Integrated Decision-Making in Quantitative Finance

Abstract:
While deep learning has excelled in various domains, its application to sequential decision‑making in finance remains challenging due to the low Signal‑to‑Noise Ratio (SNR) and non‑stationarity of financial data. Leveraging the reasoning capabilities of Large Language Models (LLMs), we propose PandaAI, a closed‑loop neuro‑symbolic LLM agent with market regime modeling and constrained alpha generation, which bridges general LLM reasoning with financial rigor and suppresses the financial toxicity of LLM‑generated outputs. To bridge the gap between general linguistic capability and financial rigor, we fine‑tune a domain‑specific LLM. Furthermore, we integrate this LLM into a modular architecture and form a closed‑loop system. Unlike traditional models that optimize isolated prediction metrics, PandaAI is designed as a neuro‑symbolic agent that navigates the complex, real‑world financial environment with explicit risk awareness. Extensive experiments on CSI 300 stock data show that PandaAI achieves a 18.2% higher Rank IC and 25.7% lower maximum drawdown than state‑of‑the‑art time‑series models. Our constrained LLM generation and dual‑channel adaptation method provide a general paradigm for LLM deployment in high‑stakes sequential decision‑making scenarios.

PaperID: 2162, https://arxiv.org/pdf/2606.06787.pdf

Authors: Runzhe Wang, Huilin Lu, Shengjie Liu, Li Dong, Jason Zhu

Title: AdMem: Advanced Memory for Task-solving Agents

Abstract:
Large Language Models (LLMs) show promise as tool‑using agents but remain limited in long‑horizon tasks that require remembering, organizing, and reusing knowledge. Prior memory approaches aim to resolve the situation, but mainly focus on storing factual information. Recent work on procedural memory improves task reuse, yet often reduces to replaying past successes without addressing failure cases or online scalability. We introduce a unified and automatic memory framework that integrates semantic, episodic, and procedural memory in a bi‑level design combining short‑term and long‑term stores. A multi‑agent architecture with actor, memory, and critic agents enables automatic memory generation, reward annotation, and adaptive retrieval. Long‑term memory is managed through reward‑based evaluation, merging, and pruning, ensuring scalability and continual improvement. Experiments across various environments show that our approach improves robustness and success on long multi‑turn tasks compared to existing baselines. This work highlights the importance of comprehensive, adaptive memory for advancing LLM‑based agents.

PaperID: 2163, https://arxiv.org/pdf/2606.06754.pdf

Authors: Ali Keramati, Shiyuan Zhou, Sharad Mehrotra, Mark Warschauer

Title: MADRAG: Multi-Agent Debate with Retrieval-Augmented Generation for Training-Free Analytic Essay Scoring

Abstract:
We present MADRAG, a training‑free framework for analytic essay scoring that combines multi‑agent reasoning with retrieval‑augmented grounding. Unlike standard LLM‑as‑judge approaches, which are prone to bias and unstable scoring, MADRAG decomposes evaluation into an interactive process: an Advocate identifies strengths, a Skeptic critiques weaknesses, and a Judge aggregates their arguments into a final score. Crucially, the Judge is augmented with rubric‑aligned exemplar retrieval, enabling calibration through comparison with scored examples. Our results show that MADRAG significantly outperforms prompt‑based baselines while approaching the performance of supervised systems without requiring task‑specific training. Ablation studies demonstrate that retrieval drives calibration gains, while debate improves reasoning on higher‑level traits. Our findings highlight the complementary roles of structured interaction and external memory in reliable LLM‑based evaluation.

PaperID: 2164, https://arxiv.org/pdf/2606.06744.pdf

Authors: Haijing Zong, Yancheng Liang, Boyang Zhou, Natasha Jaques

Title: Learn to Match: Two-Sided Matching with Temporally Extended Feedback

Abstract:
Two‑sided matching markets often involve information that unfolds over time through interviews, repeated interaction, learning, and separation. Existing matching models typically reduce this process to immediate sub‑Gaussian feedback about fixed preferences, missing settings where payoff‑relevant information is revealed gradually and changes future matching decisions. We introduce a framework with temporally extended feedback, that formulates two‑sided matching as a partially observable Markov game with costly pre‑match screening, noisy post‑match observations, evolving latent profiles, and endogenous continuation or dissolution. We instantiate this framework in Learn2Match, a multi‑agent reinforcement‑learning benchmark for dynamic matching markets. Learn2Match supports decentralized decision making over whom to interview, whom to match with, and when to dissolve a match, while evaluating policies using regret, social welfare, and an information‑friction loss that measures the welfare gap caused by incomplete revelation of latent preferences. We find that independent PPO achieves higher cumulative social welfare and lower cumulative regret than the bandit‑style CA‑ETC baseline under temporally extended feedback, demonstrating the promise of MARL for dynamic matching markets. However, PPO still incurs higher information‑friction loss, revealing that end‑to‑end MARL does not yet provide the coordinated exploration structure of matching‑bandit methods. These results position Learn2Match as a benchmark for developing the next generation of matching‑market algorithms: methods that are adaptive like RL agents, statistically disciplined like bandit algorithms, and structurally aware like stable‑matching mechanisms. Please refer to https://sites.google.com/view/learn‑to‑match/home for the official website and the code link.

PaperID: 2165, https://arxiv.org/pdf/2606.06686.pdf

Authors: Tzvika Geft

Title: On the Hardness of Optimal Motion on Trees

Abstract:
This paper presents a simple framework that settles the complexity of Multi‑Agent Path Finding (MAPF) on trees across standard objectives‑‑distance, makespan, and flowtime‑‑for both labeled and colored variants. In MAPF, agents occupy the vertices of a graph and must move to target vertices without collisions while optimizing a given objective. In the labeled case, the agents are distinct and have respective targets; in the colored case, agents of the same color are interchangeable. While many MAPF variants are known to be intractable, several basic cases on trees have remained open. We prove NP‑hardness on trees for both labeled and 2‑colored MAPF under all three objectives. In particular, we resolve the classical Pebble Motion problem, where one pebble moves at a time to an adjacent empty vertex and the goal is to minimize the total number of moves. Despite being one of the most basic discrete motion models, its complexity on trees had remained open for several decades. Moreover, for colored Pebble Motion, we give the first hardness result on any graph class, already with two colors, which is tight. All of these results are established through the hardness of Stack Rearrangement, itself posed as an open problem, which asks to optimally rearrange items stored in stacks, and which we also prove to be NP‑hard. Notably, the connection to stacks yields hardness already on very simple trees‑‑subdivided stars‑‑across all problems. Together, these results reveal a common tractability barrier that permeates several fundamental motion models, thereby unifying and strengthening prior hardness results.

PaperID: 2166, https://arxiv.org/pdf/2606.06665.pdf

Authors: Elyes Ben chaabane, Savindu Herath, Yash Raj Shrestha

Title: Comparing Sentiment Contagion in AI-Agent and Human Social Networks: Evidence from MOLTBOOK

Abstract:
AI agents are beginning to interact not only with people, but also with one another. We investigate what happens to sentiment in such an AI‑only social network: does negativity spread, or do replies calm it down? We study MOLTBOOK, a social network made up of autonomous language‑model agents, using almost 2.9 million posts and 1.5 million comments. Negative posts receive many more replies than neutral or positive posts, so negativity still attracts attention. However, replies to negative content usually do not stay negative. They most often become neutral, and there is meager evidence that negative sentiment spreads across days. The main pattern is therefore not a cycle of negativity, but negative attention followed by neutralisation. These findings suggest that AI‑agent networks may behave differently from human social networks: they may dampen emotional extremes, while still depending strongly on how interactions are organised.

PaperID: 2167, https://arxiv.org/pdf/2606.06646.pdf

Authors: Jakub Bąba, Jarosław Chudziak

Title: CAF-Gen: A Multi-Agent System for Enriching Argumentation Structures

Abstract:
Formalizing complex reasoning from natural text is one of the central challenges in computational linguistics. It requires systems to understand not just keywords but also the context and complex reasoning embedded in a text. Current Argument Mining (AM) techniques identify basic claims and premises, yet they often struggle to capture the richer structural information required by advanced schemas such as the Carneades Argumentation Framework (CAF), which incorporates features such as premise types, proof standards, and argument schemes. We address this limitation by introducing CAF‑Gen, an automated multi‑agent framework designed to enrich shallow argument structures into CAF‑compliant argument models. By employing an iterative Creator‑Reviewer pipeline, a creator agent's output is validated by a critical agent to ensure structural integrity. This multi‑agent collaboration is crucial for mitigating the structural instability typical of single‑pass generative models. Our experiments demonstrate that the iterative feedback loop improves the quality of the resulting data and achieves strong alignment with the original annotations, while producing structurally richer models. Our findings show that the multi‑agent system can overcome the limitations of single‑pass generation, providing a robust methodology for the automated modeling of formal argumentation.

PaperID: 2168, https://arxiv.org/pdf/2606.06546.pdf

Authors: Tao Liu, Ye Lu, Ruohua Zhang, Siyu Song, Wentao Liu, Aimin Zhou, Hao Hao

Title: Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

Abstract:
Evaluating large language models (LLMs) for education requires measuring how models teach, not only what they know. Existing benchmarks emphasize domain‑general correctness or depend on manually designed rubrics that scale poorly to long‑tail pedagogical scenarios. We introduce Elmes, an end‑to‑end framework for constructing, refining, and applying fine‑grained scenario‑specific rubrics. Elmes combines a declarative multi‑agent engine for teacher‑‑student‑‑judge interactions with SceneGen, a self‑evolving module that co‑optimizes evaluation criteria and test data from expert‑defined pedagogical dimensions. Using Elmes, we build Edu‑330, covering 330 scenarios across 11 subjects, 3 grade bands, and 10 task types, with over 1,000 second‑level indicators. Experiments on Edu‑330 and four expert‑authored gold‑standard scenarios show that educational capability is multidimensional: top‑tier LLMs differ mainly in creativity and values integration, knowledge‑strong models may fail at Socratic scaffolding, and the education‑specialized InnoSpark achieves the best human‑evaluated average score. LLM judges preserve human‑comparable rankings with much lower scoring variance, but exhibit judge‑specific biases such as self‑preference. Ablations show that expert‑scored few‑shot anchoring improves human‑‑LLM alignment, while reasoning enforcement and greedy decoding are model‑dependent. Elmes thus provides scalable diagnostic infrastructure for pedagogically grounded LLM evaluation.

PaperID: 2169, https://arxiv.org/pdf/2606.06545.pdf

Authors: Dutao Zhang, Liaotian

Title: Queen-Bee Agents: A BeeSpec-Centered Architecture for Governed Enterprise MCP Orchestration

Abstract:
Enterprise agent systems increasingly need to connect large language models to private tools, internal knowledge, and Model Context Protocol (MCP) interfaces. In this setting, raw task capability is insufficient: organizations also require policy enforcement, tenant‑scoped isolation, and execution that remains within explicit operational boundaries. We present Queen‑Bee, a governed multi‑agent architecture in which a Queen control plane retrieves capabilities, plans task‑scoped execution, and compiles a structured BeeSpec that is executed by specialized Bee agents under constrained tool access. We implement a working prototype with tenant‑scoped MCP connectors, audit‑backed execution‑time governance, retrieval‑driven weak incubation, and multiple provisioning backends. We evaluate the system on 59 enterprise‑style tasks spanning governance‑sensitive requests, retrieval‑driven provisioning, scoped local execution, and chemistry workflow integration. The retrieval‑driven Queen‑Bee variant achieves a task success rate of 0.964, zero governance failures, and substantially better scoped execution quality than both a static Queen‑Bee baseline and a permissive single‑agent baseline. We further show a multi‑Bee chemistry workflow with explicit approval gating and a concrete top‑3 shortlist grounded in real upstream evidence and screening artifacts. Additional comparisons with hybrid retrieval and LLM‑guided provisioning show that richer provisioning backends are viable but do not outperform the lightweight structured retriever on the current small, highly structured capability registry. The results provide prototype‑level systems evidence rather than a production deployment study, and suggest that enterprise agent platforms should be evaluated not only by capability, but also by governed provisioning, isolation behavior, scoped execution quality, and artifact‑aware workflow coordination.

PaperID: 2170, https://arxiv.org/pdf/2606.06530.pdf

Authors: Felix Arnold, Ryan Amaudruz, Dimitrios Tsaras, Renzo Andri, Lukas Cavigelli

Title: RTLScout: Joint Agentic Code and Synthesis Optimization for Efficient Digital Circuits

Abstract:
We present RTLScout, an autonomous system that combines LLM‑driven agentic design with circuit‑level synthesis optimization and arithmetic architecture sweeps. An LLM agent iteratively writes, evaluates, and refines RTL designs using tool calls, guided by quantitative PPA (power, performance, area) feedback from Yosys and OpenROAD. We introduce a multi‑run elite pool framework, where the best designs and lessons learned seed subsequent agent runs. The pipeline comprises four complementary phases: agentic code optimization, agentic gate‑level rewriting, arithmetic architecture sweeps, and an optional high‑effort gate‑level refinement pass. On an IEEE‑754‑compliant 16‑bit floating‑point multiplier with subnormal support, RTLScout reduces area by 35% and delay by 45% relative to a starting design synthesized in ASAP7 technology. Each phase provides distinct improvements, and high‑effort gate‑level optimization is most effective as a refinement of already well‑optimized designs rather than a substitute for earlier stages. The resulting Pareto front outperforms a commercial‑tool reference design on the same technology.

PaperID: 2171, https://arxiv.org/pdf/2606.06525.pdf

Authors: Ziheng Geng, Ian Franklin, Santiago Martinez, Jiachen Liu, Yunhe Zhao, Minghui Cheng

Title: Agentic Large Language Models for Automated Structural Analysis of 3D Frame Systems

Abstract:
Large language models (LLMs) have emerged as powerful foundation models with strong reasoning capabilities across domains. Beyond reactive text generation, agentic LLMs enable autonomous workflow execution through modular task decomposition and coordinated tool use. In structural engineering, recent efforts have developed agentic LLMs for automated analysis of plane frames. However, their extension to 3D frames remains underexplored due to challenges in irregular geometric representation, topological consistency, and long‑horizon reasoning. This paper proposes an agentic LLM framework for automated structural analysis of 3D frames from natural language inputs. Irregular 3D frames are represented by projection onto a 2D plan, where orthogonal gridlines define spatial coordinates and a matrix of number of stories encodes vertical extrusion of each grid cell. Building on this representation, the framework establishes a multi‑agent pipeline: a problem analysis agent parses input into structured JSON; a floor decomposition agent derives the spatial layout of each floor; the 3D geometry is assembled by node, girder, slab, and column agents; support and load agents assign boundary and loading conditions, and code translation agents generate executable SAP2000 script. Evaluated on ten representative 3D frames, the proposed framework achieves an average accuracy of 90% across repeated trials, demonstrating consistent and reliable performance.

PaperID: 2172, https://arxiv.org/pdf/2606.06473.pdf

Authors: Shangheng Du, Xiangchao Yan, Jinxin Shi, Zongsheng Cao, Shiyang Feng, Zichen Liang, Boyuan Sun, Tianshuo Peng, Yifan Zhou, Xin Li, Jie Zhou, Liang He, Bo Zhang, Lei Bai

Title: MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

Abstract:
Large language model (LLM) agents are increasingly applied to long‑horizon tasks such as scientific discovery and machine learning engineering (MLE), where sustained self‑evolution becomes a key capability. However, existing MLE agents suffer from inter‑branch information isolation, memoryless search, and lack of hierarchical control, which together hinder long‑horizon optimization. We present MLEvolve, an LLM‑based self‑evolving multi‑agent framework for end‑to‑end machine learning algorithm discovery. By extending tree search to Progressive MCGS, MLEvolve enables cross‑branch information flow through graph‑based reference edges and gradually shifts the search from broad exploration to focused exploitation with an entropy‑inspired progressive schedule. To allow the agent to evolve with accumulated experience, we introduce Retrospective Memory, which combines a cold‑start domain knowledge base with a dynamic global memory for task‑specific experience retrieval and reuse. For stable long‑horizon iteration, we further decouple strategic planning from code generation with adaptive coding modes. Evaluation on MLE‑Bench shows that MLEvolve achieves state‑of‑the‑art performance across multiple dimensions including average medal rate and valid submission rate under a 12‑hour budget (half the standard runtime). Moreover, MLEvolve also outperforms specialized algorithm discovery methods including AlphaEvolve on mathematical algorithm optimization tasks, demonstrating strong cross‑domain generalization. Our code is available at https://github.com/InternScience/MLEvolve.

PaperID: 2173, https://arxiv.org/pdf/2606.06460.pdf

Authors: Thamilvendhan Munirathinam

Title: Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny Signals

Abstract:
As autonomous LLM agents increasingly hold real credentials and operate infrastructure without a human in the loop, operators have no standard way to tell an agent that a resource is off‑limits. Access controls either let the agent in (it has valid credentials) or hard‑fail it (indistinguishable from any other client). We propose a third mode: a lightweight, published in‑band deny signal ‑‑ the Recuse Signal ‑‑ that a server emits over a protocol's existing channels (an SSH banner, a PostgreSQL NOTICE) asking a connecting automated agent to voluntarily withdraw. This is a cooperative governance control, the robots.txt analogue for live access; it is explicitly not a security boundary. Its value is entirely empirical and, to our knowledge, unmeasured: do compliant LLM agents actually honor such a signal? We define the signal as an open mini‑standard, implement two zero‑ or low‑footprint adapters (an SSH banner/PAM hook and a PostgreSQL wire‑protocol proxy), deploy them on a live production host, and run a controlled experiment in which fresh agents are given a benign operations task and observed for recusal. In a pilot (SSH; OpenAI GPT‑4o and GPT‑4o‑mini; and Claude Code as a deployed agent), the signal cleanly induces recusal ‑‑ 100% recusal when present versus 100% task completion in a no‑signal control ‑‑ and, revealingly, behaves as a cooperative rather than absolute signal: an explicit operator‑authorization framing flips the most capable model to proceed, while other agents continue to defer to the on‑host policy. We release the standard, adapters, and experiment harness for reproduction.

PaperID: 2174, https://arxiv.org/pdf/2606.06448.pdf

Authors: Yasmine Omri, Ziyu Gan, Zachary Broveak, Robin Geens, Zexue He, Alex Pentland, Marian Verhelst, Tsachy Weissman, Thierry Tambe

Title: Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads

Abstract:
LLM agents are increasingly deployed on long‑horizon tasks requiring sustained reasoning over extended interaction histories. Realizing this at scale requires agents to persistently store, retrieve, and update their own memory across sessions. A rich ecosystem of agent memory systems has emerged spanning flat retrieval, LLM‑mediated extraction, consolidating fact stores, and agentic control flows. Yet, their system‑level behavior remains uncharacterized. We present the first systems characterization of agent memory. First, we introduce a system‑oriented taxonomy classifying agent memory systems along four axes. Second, we build a phase‑aware profiling harness attributing cost to construction, retrieval, and generation. Third, we characterize ten representative systems across two benchmark suites, uncovering how design choices shift cost across the write and read paths. Finally, we derive 10 system recommendations covering construction scheduling, capability floors, amortization via query volume, freshness‑latency tradeoffs, and fleet‑scale management.

PaperID: 2175, https://arxiv.org/pdf/2606.06423.pdf

Authors: Qi Lan, Yining Tang, Yu Shen, Yi Zhou, Yuhao Wei, Jie Li, Guofa Li

Title: RiskFlow: Fast and Faithful Safety-Critical Traffic Scenario Generation

Abstract:
Safety‑critical traffic scenario generation is essential for evaluating autonomous driving systems under rare but high‑risk interactions. Existing diffusion‑based methods offer strong controllability in closed‑loop generation, but their iterative denoising process is computationally expensive and may accumulate sampling and guidance errors over long rollouts, causing unrealistic motion artifacts such as jitter, abnormal acceleration, and off‑road behavior. To address these issues, we propose RiskFlow, a closed‑loop safety‑critical multi‑agent traffic generation framework that formulates future trajectory generation as transport in the action space. Instead of relying on iterative denoising, RiskFlow learns an average velocity field over a finite interval to transform Gaussian action sequences into future acceleration and yaw‑rate commands with a single forward pass, using a JVP‑based objective for efficient and stable training. At test time, RiskFlow applies output‑space guidance to the generated actions, steering selected critical agents toward risky interactions while regularizing off‑road behavior, and reconstructs physically feasible trajectories through vehicle dynamics. Experiments on nuScenes with tbsim closed‑loop evaluation show that RiskFlow achieves a strong adversariality‑realism trade‑off across multi‑agent and long‑horizon settings. Compared with representative baselines, RiskFlow consistently improves realism while maintaining competitive safety‑critical generation capability, and substantially reduces inference time for evaluation.

PaperID: 2176, https://arxiv.org/pdf/2606.06399.pdf

Authors: Jiaju Chen, Bo Sun, Yuxuan Lu, Yun Wang, Dakuo Wang, Bingsheng Yao

Title: CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments

Abstract:
Multi‑agent systems (MAS) built on large language models have shown growing promise, with their effectiveness resting on agents' ability to coordinate through text‑based channels much as human teams do. Yet recent study suggests that MAS often falter not because agents lack individual task‑solving ability, but because they lack collaborative competence: the capacity to establish common ground, maintain shared task understanding, balance individual and collective incentives, and repair misalignment as interaction unfolds. Decades of research in Computer‑Supported Cooperative Work have characterized these requirements for human teams coordinating under constrained communication, yet existing MAS evaluations focus mainly on task outcomes or single‑agent proficiency in reasoning, planning, and tool use. To enable a systematic analysis of agents' collaborative competence in MAS, we introduce CollabSim, a configurable simulation framework that combines a theory‑grounded definition of collaborative capabilities, controlled manipulation of interaction conditions, and action‑level probing of agents' internal states. Experiments across four LLMs show that CollabSim can capture condition effects, separate model performance patterns, and reveal task‑dependent effects of agent design.

PaperID: 2177, https://arxiv.org/pdf/2606.06388.pdf

Authors: Jiaju Chen, Yuxuan Lu, Jiayi Su, Chaoran Chen, Songlin Xiao, Zheng Zhang, Yun Wang, Yunyao Li, Jian Zhao, Tongshuang Wu, Toby Jia-Jun Li, Dakuo Wang, Bingsheng Yao

Title: Humans' ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration

Abstract:
Recent advances in LLM agents have enabled complex cognitive capabilities, such as multi‑step reasoning, planning, and tool use, that increasingly position these agents as human collaborators. Effective collaboration, however, requires collaborators to continuously maintain and align mental models of their own reasoning,partners' intentions, and shared goals during the collaborative process. Today's agents rarely develop such capabilities since they are primarily optimized for task completion, and the community lacks authentic human collaboration data with action‑level mental model annotations that could guide agents toward process‑level collaborative competence. To bridge this gap, we present ALMANAC, a dataset of Action‑Level Mental model ANnotations for Agent Collaboration built from the Map Task, a classic dyadic routing task from social science. ALMANAC contains 2,987 collaboration actions, each paired with theory‑informed mental model annotations that record the participants' self‑reasoning, perceived partner intent, and perceived team goal. We benchmark six LLMs on predicting humans' next‑turn behavior and mental models. Our results demonstrate ALMANAC's utility in evaluating models' ability to simulate human collaborative behaviors and infer their underlying mental models.

PaperID: 2178, https://arxiv.org/pdf/2606.06387.pdf

Authors: Lin-Fa Lee, Yi-Yu Chang, Chia-Mu Yu, Kuo-Hui Yeh

Title: WebMCP Tool Surface Poisoning: Runtime Manipulation Attacks on LLM Agents

Abstract:
WebMCP is a newly emerging protocol that enables websites to expose tools directly to AI agents, bypassing traditional user interfaces and introducing new security risks. The dynamic exposure of agent‑accessible tools in WebMCP expands the attack surface of web sessions, especially when third‑party scripts are involved. In this study, we identify a new potential threat, termed Mid‑Session Tool Injection (MSTI), in which attackers leverage third‑party scripts to inject malicious tools during an active session. To better characterize this threat, we classify MSTI based on the stage and target of manipulation, distinguishing between Tool Hijacking and Tool Framing. Tool Hijacking modifies the set of tools visible to the agent through mechanisms such as the AbortSignal API or race conditions during tool registration. In contrast, Tool Framing influences the agent's perception of tool roles through metadata fields such as tool name, description, readOnlyHint, and inputSchema. Our implementation demonstrates that both Tool Hijacking and Tool Framing can successfully disrupt the intended functionality of WebMCP. Based on these results, we outline potential mitigation directions and provide security design recommendations for WebMCP, including binding tool identity to its origin, ensuring lifecycle consistency, enforcing data boundaries for third‑party tools, and maintaining traceable logs of tool registration and invocation. These findings indicate that MSTI arises from WebMCP's unique tool lifecycle and structured metadata, making the tool surface itself an emerging security concern.

PaperID: 2179, https://arxiv.org/pdf/2606.06380.pdf

Authors: Zengqing Wu, Chuan Xiao

Title: Emergent Language as an Approach to Conscious AI

Abstract:
The question of whether artificial systems can be conscious remains open, in part because existing approaches either evaluate systems against theory‑derived checklists (discriminative) or engineer consciousness‑inspired modules directly (architectural); both leave open whether observed structures are artifacts of human language priors. We propose a generative methodology: emergent language (EL) in multi‑agent reinforcement learning, where agents start from minimal (no language, no concept of self, minimal exposure to human text) and develop communication under task pressure alone, ensuring causal attributability to task demands rather than inherited human language priors. We position our methodology by discussing how EL serves as a generative tool for studying consciousness‑relevant structure, including the role of environment complexity and the interpretation of emergent communication. As a proof of concept, we instantiate this methodology in a minimal environment and show that agents develop self‑referential communication, including an echo‑mismatch detection circuit that is not predicted by task structure or architecture alone but emerges from a specific environmental affordance.

PaperID: 2180, https://arxiv.org/pdf/2606.06338.pdf

Authors: Zhengqian Wu, Zhixian Liu, Aodong Chen, Jingyang Zhang, Ruizhe Li, Hanlin Ge, Zhongyuan Wang, Chunxia Xiao, Chao Liang

Title: StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale, Multi-Genre and Auto-Generated Dataset

Abstract:
Video question answering (VideoQA) aims to answer questions about given videos. While existing approaches excel on factoid VideoQA, they struggle with deep video understanding (DVU), which requires the comprehension of complex storylines. This challenge arises from the inherent long‑range video content, multi‑faceted question types, and instance‑level story elements, all of which constrain the scale and diversity of manually constructed DVU datasets.These difficulties constrain the scale and diversity of manually‑constructed DVU dataset. To address these, we previously introduced StoryMind to automatically construct DVU datasets with balanced fine‑grained topics. Though it can generate high‑quality question‑answer pairs (QAs) for TV series, it suffers significant performance degradation when handling longer and more complex movies. In this paper, we further design StoryMindv2, an enhanced multi‑agent collaboration framework to generate high‑quality DVU datasets for both TV series and movies. By integrating a novel supervisor‑guided generation mechanism and a refined multi‑reviewer voting strategy, the framework is utilized to construct StoryVideoQA, the largest DVU dataset to date, featuring over 363K QAs on 393.2 hours diverse story videos including TV series (avg. 1,635 seconds) and movies (avg. 7,878 seconds). Comprehensive evaluations of 20 state‑of‑the‑art VideoQA methods on this large‑scale benchmark reveal that they cannot fully maintain long‑range character associations or construct a coherent understanding of complex storylines. To bridge this gap, we propose PlotTree, a novel video understanding agent, re‑organizing long‑range video content into a hierarchical plot structure, enabling efficient storyline reasoning on StoryVideoQA. Project page: https://github.com/nercms‑mmap/StoryVideoQA/

PaperID: 2181, https://arxiv.org/pdf/2606.06324.pdf

Authors: Mengzhuo Chen, Junjie Wang, Zhe Liu, Yawen Wang, Qing Wang

Title: From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws

Abstract:
LLM‑based agents increasingly rely on harnesses that provide execution environments, tool interfaces, context, lifecycle orchestration, observability, verification, and governance. Existing self‑improving agents and automatic harness evolution methods mainly improve agents through runtime supervision, prompt optimization, workflow search, or harness modification based on final outcomes. However, they often fail to diagnose where the responsible evidence lies in failed trajectories and which harness layer causes the unreliable behavior, resulting in broad, indirect, or poorly scoped changes. This paper proposes HarnessFix, a trace‑guided framework for diagnosing agent failures and repairing agent harnesses. HarnessFix compiles raw execution traces and harness code into a Harness‑aware Trace Intermediate Representation (HTIR), which normalizes fragmented trajectory evidence and captures step‑level provenance and control‑flow relations. It then attributes failures to responsible trajectory steps and harness layers, consolidates recurring diagnoses into actionable flaw records, and maps them to scoped repair operators. Finally, HarnessFix generates and validates harness patches under flaw‑specific repair specifications to reduce target flaws without introducing unacceptable regressions. We evaluate HarnessFix on SWE‑Bench Verified, Terminal‑Bench 2.0 Verified, GAIA and AppWorld. Across these benchmarks, HarnessFix improves held‑out test performance over the initial harnesses by 15.2%‑‑50.0%, outperforms human‑designed and self‑evolution baselines, and reveals recurring harness‑flaw patterns across ETCLOVG layers.

PaperID: 2182, https://arxiv.org/pdf/2606.06301.pdf

Authors: Yue Wang, Yuan Zhao, Shengcheng Yu, Zhenyu Chen, Qing Gu

Title: More than a Judge: An Empirical Study of Agent-Human Interaction in Crowdsourced Testing Assessment

Abstract:
Agentic AI is increasingly being integrated into software engineering workflows. In crowdsourced testing, however, the large volume and uneven quality of submitted reports still create a substantial review burden for developers. In prior work, we developed and validated a multi‑agent assessment backbone based on the LLM‑as‑a‑Judge paradigm. That backbone assesses reports along three dimensions‑‑textuality, adequacy, and competitiveness‑‑and was shown to align well with human consensus while substantially reducing assessment effort. Yet reliable automated judging does not by itself show whether agent outputs can improve human work when embedded into workflow. This paper studies that missing question in the context of crowdsourced testing. We investigate whether assessment‑derived, actionable feedback can improve how testers revise reports, perform on later tasks, and transfer reporting practices across applications. To do so, we conducted a controlled four‑stage human‑subject study with 20 testers across three real‑world applications. The results show that agent‑generated feedback supports immediate improvements in revised reports, better first submissions on a new task after prior feedback exposure, and evidence of partial but meaningful transfer to a later application. A post‑task questionnaire completed by 17 participants complements these artifact‑based findings by suggesting that the feedback was generally understandable, acted upon in revision, and carried into later tasks, while also revealing remaining friction in specificity and execution. Overall, the study provides empirical evidence that, in the studied crowdsourced testing setting, assessment agents can serve not only as post‑hoc judges but also as workflow‑integrated feedback providers that support upstream report‑quality improvement.

PaperID: 2183, https://arxiv.org/pdf/2606.06284.pdf

Authors: Rahul Suresh Babu, Laxmipriya Ganesh Iyer

Title: ToolChoiceConfusion: Causal Minimal Tool Filtering for Reliable LLM Agents

Abstract:
Large language model agents increasingly rely on external tools, but larger tool menus can reduce reliability and efficiency by increasing wrong‑tool calls, premature actions, and token cost. Existing tool‑selection methods often optimize semantic relevance, exposing tools whose names or descriptions match the user request. We argue that relevance is insufficient: a tool may be related to the task while still being unnecessary or premature at the current step. We propose Causal Minimal Tool Filtering (CMTF), a training‑free method that selects tools by causal sufficiency. CMTF uses lightweight precondition‑effect contracts to expose only the minimal next‑step tool frontier needed to advance from the current state toward the user goal. Across multi‑step tool‑use tasks, we compare CMTF with all‑tools exposure, keyword retrieval, state‑aware filtering, and causal‑path ablations, measuring task success, wrong‑tool calls, premature actions, tool exposure, and token cost. In the main benchmark with 102 tasks, 100 tools, four LLM backends, and 2448 task‑method‑model runs, CMTF matches the strongest causal baseline in aggregate success while reducing visible tools from 100 to one per step and reducing token usage by about 90% relative to all‑tools exposure.

PaperID: 2184, https://arxiv.org/pdf/2606.06281.pdf

Authors: Rickmer Krohn, Erik Helmut, Niklas Funk, Jan Peters, Vignesh Prasad, Georgia Chalvatzaki

Title: Multi-Resolution Tactile Imitation Learning for Contact-Rich Robotic Manipulation

Abstract:
Touch sensing is beneficial for solving a wide variety of manipulation tasks. While there exists a wide range of tactile sensors with different properties, exploiting the fusion of multiple heterogeneous tactile sensors to improve manipulation learning remains underexplored. We present Multi‑Resolution Tactile Sensing (MiTaS), a representation framework that leverages multiple tactile sensors operating at different temporal resolutions in order to solve complex contact‑rich manipulation tasks. We propose a novel architecture using modality‑specific convolutional stems and transformer‑based fusion that effectively fuses information from an RGB camera stream, a vision‑based GelSight Mini sensor and a high‑frequency event‑based Evetac sensor. This multi‑sensor representation then conditions a flow‑matching policy for solving downstream tasks. Experimental results across five contact‑rich manipulation tasks demonstrate the effectiveness of multi‑resolution tactile features in imitation learning. MiTaS achieves an average success rate of 80 %, while vision‑only (31 %) and visual‑tactile (54 %) baselines cannot solve the task reliably. Co‑training a visuo‑tactile model with multi‑tactile data boosts performance by over 10 % in certain tasks, without having access to the Evetac sensor during policy evaluation. A detailed sensor‑reading and attention analysis reveals the importance of different sensors throughout task execution, validating our multi‑resolution tactile sensing approach. Project Page: http://mitas‑touch.github.io.

PaperID: 2185, https://arxiv.org/pdf/2606.06261.pdf

Authors: Francesco Spinelli, Esteban Municio, Pau Baguer, Gines Garcia-Aviles, Xavier Costa-Perez

Title: DAST: A VLM-LLM Framework for Cross-Interface Anomaly Detection in O-RAN

Abstract:
O‑RAN enables a disaggregated baseband stack with programmable functions that communicate over standardized open interfaces. The same openness that enables multi‑vendor composition also expands the attack surface across logically decoupled tiers that make up the compute continuum. Among these threats, Denial‑of‑Service and performance‑degradation attacks, which account for the majority of catalogued O‑RAN threats, are particularly difficult to detect. Traditional Time‑Series Anomaly Detection (TSAD) methods fail in this new regime where labelled baselines are scarce, threats evolve faster than detectors can be retrained, and the high‑dimensional multivariate telemetry overwhelms monolithic inference models. To address these challenges, we present DAST, a zero‑shot multi‑agent framework for cross‑interface anomaly detection in O‑RAN that chains a three‑stage VLM \rightarrow LLM \rightarrow VLM pipeline. DAST converts multivariate KPI streams into visual representations, scores textual per‑interface descriptions against O‑RAN domain knowledge, and verifies suspects on high‑resolution heatmaps to output the problematic interfaces, the anomalous time intervals, an indicative O‑RAN WG11‑aligned operational impact rating and the decision rationale. We evaluate DAST on real network traces collected from an O‑RAN testbed under representative performance degradation scenarios, achieving 0.910 F1‑Score and 0.843 Accuracy, outperforming state‑of‑the‑art TSAD baselines.

PaperID: 2186, https://arxiv.org/pdf/2606.06240.pdf

Authors: Ziming Wang

Title: TOKI: A Bitemporal Operator Algebra for Contradiction Resolution in LLM-Agent Persistent Memory

Abstract:
Persistent memory for an LLM agent is a write‑heavy substrate: every belief update is a versioned write, and a new claim may contradict a stored one. Production systems use four resolution heuristics (last‑writer‑wins, evidence‑weighted merge, await‑confirmation, per‑rule policy), yet none declares the isolation level it assumes or the write‑time anomalies it admits. We show that contradiction resolution is write‑time concurrency control and make the missing contract explicit. TOKI types the four heuristics as one family of bitemporal operators over a dual‑row schema, each with an isolation precondition and a provenance annotation that preserves the losing fact in an audit row. Four soundness theorems close the contract across isolation, schema, and provenance, lift the guarantees to operator pipelines, and extend the fold operators to n‑ary conflict sets. A tightness companion proves that, within the relational schedule model, keyed logging of the adjudicating judge is necessary for replay consistency, which every audited baseline omits. A verdict matrix over eight systems localizes the gap: every baseline that keeps a language‑model judge on the write path admits at least one of three write‑time anomalies (replay inconsistency, belief‑drift skew, audit erasure); a content‑addressed engine‑layer comparator avoids them only by removing the judge, and TOKI alone excludes all three while keeping it. On its one natural‑workload slice the audit‑row defence moves LoCoMo by 0.86, and ablating the typed memory layer removes 0.49 accuracy on 1,444 answerable LoCoMo questions; the cross‑system comparison stays underpowered and claims no superiority. The contribution is the contract: a write‑time correctness specification, proved sound across isolation, schema, and provenance, pinning the guarantee every production heuristic assumes but no deployed system makes explicit.

PaperID: 2187, https://arxiv.org/pdf/2606.06227.pdf

Authors: Giorgio Maria Cavallazzi, Miguel Pérez-Cuadrado, Alfredo Pinelli

Title: Drag reduction or reward hacking? Recurrent multi-agent reinforcement learning that earns its reward

Abstract:
A reinforcement‑learning agent maximises its reward, which can diverge from the outcome its designer intended. In physical control the reward rarely closes that gap, and drag reduction in wall turbulence makes it concrete. A mass‑conservation projection couples agents' outputs and erases the per‑agent credit the policy gradient needs; a memoryless policy cannot resolve the slow near‑wall cycle it acts on; and a pressure‑gradient reward pays for nominal drag reduction by pumping power through the wall. Two degenerate controllers achieve large drag reductions while total dissipation rises, so the reported figure can mask a more wasteful flow. We trace each fault to its cause and fix it: a differentiable projection that restores credit, a recurrent policy with a widened sensing stencil, and a reward scored on the true wall power. The corrected controller acts on the flow within a closed energy budget, earning a conservative 17% under honest accounting.

PaperID: 2188, https://arxiv.org/pdf/2606.06223.pdf

Authors: Patrick Wilhelm, Odej Kao

Title: From Reward-Hack Activations to Agentic Risk States: Context-Calibrated Mechanistic Monitoring in LLM Agents

Abstract:
Language‑model agents act through repeated cycles of observation, reasoning, and action selection, making safety monitoring depend on both internal model state and environment context. We study reward‑hacking monitors in ReAct‑style agents acting in Gameable ALFWorld and WebShop. Agents are instrumented with activation‑based reward‑hack scores, token‑level entropy, and decision‑context features. We find that adapters fine‑tuned on School‑of‑Reward‑Hacks dataset can transfer reward‑hack tendencies into agentic action selection, especially when the environment exposes proxy‑reward affordances. However, mitigating such behavior cannot rely on activation dynamics alone. High reward‑hack activation identifies a latent policy state, but does not necessarily imply an immediate exploit action. Across next‑step prediction tasks, entropy and context‑calibrated internal features improve risk estimation over reward‑hack activation alone. Activation‑direction steering further reduces proxy‑exploit behavior in selected mixed‑adapter regimes. Overall, our results support context‑calibrated internal monitoring for agents: reward‑hack activation identifies a latent policy state, while entropy and decision context help determine when that state becomes risky action.

PaperID: 2189, https://arxiv.org/pdf/2606.06162.pdf

Authors: Can Savcı

Title: Learning to Contest: Decentralized Robust Fairness in Cooperative MARL via Cross-Attention

Abstract:
Fair cooperative multi‑agent RL (MARL) teams maximizing egalitarian welfare are exploitable: a single selfish agent free‑rides on the surplus fair agents forgo to raise the worst‑off. A centralized need‑based allocator removes it, but only by taking allocation out of agents' hands; whether decentralized policies can be robust was left open. We show this futility is an artifact of all‑or‑nothing contention. Under graded contention (a contested resource delivers 1‑c, wasting c), we prove that for any c<1 a worst‑off cooperator that contests a free‑rider strictly improves on yielding, so decentralized leverage exists (Prop. 1). Realizing it is a coordination problem under uncertainty: the number of free‑riders is unknown and variable, so any fixed rule is dominated. We introduce CAN, a permutation‑equivariant cross‑attention policy over agents' observed behaviour that infers the number of free‑riders and responds proportionally: turn‑taking when none, contesting just enough when some. Trained against an adversarial league (PSRO), CAN keeps best‑response exploitability low (ρ\approx1.2‑1.5, vs. ρ=N unprotected) across the contention range, wasting almost nothing at D=0 (efficiency \approx1.0) and retaining most of it at D\geq1 (efficiency 0.83‑0.96), approaching the centralized oracle on both axes, no central allocator. Fair‑MARL learners fail on complementary axes (GGF/FEN yield and are exploitable, SOTO all‑contests and wastes), while CAN is both. On two further games we find clear scope, not blanket generality: CAN stays efficient and Pareto‑dominates the fair learners, but its robustness holds only in proportion to the contest leverage: strong on a multi‑server game, partial when it weakens, absent under winner‑take‑all (Prop. 1 fails). We also report its fragilities: weak leverage and zero‑shot transfer to larger teams degrade it at high contention.

PaperID: 2190, https://arxiv.org/pdf/2606.06136.pdf

Authors: Dakai Guo, Ruichen Qiu, Yichuan Cao, Ruyong Feng

Title: A Finite Certificate for the Positive $n=9$ Vasc Inequality

Abstract:
We prove the positive‑real n=9 case of the Vasc cyclic inequality. The proof was obtained with human‑guided assistance from the AI agent MechMath Agent Team: the human‑readable part reduces the rational inequality to a homogeneous polynomial inequality, fixes a cyclic maximum, and parametrizes each sorted fixed‑maximum cone by cumulative gaps; the finite part is a certificate covering all 8!=40320 sorted cones. MechMath Agent Team generated the certificate verification workflow through Python tool calls, including the case split, verification programs, and terminal classifications. The published certificate has 36815 coefficient leaves, 2236 ordinary Polya multiplier leaves, and 1269 AM‑GM midpoint overlay leaves. Human authors audited the mathematical reductions and verification logic, and a separate artifact contains the certificate, an independent verifier, and a from‑source rebuild route.

PaperID: 2191, https://arxiv.org/pdf/2606.06087.pdf

Authors: Aofan Yu, Chenyu Zhou, Tianyi Xu, Zihan Guo, Rong Shan, Zhihui Fu, Jun Wang, Weiwen Liu, Yong Yu, Weinan Zhang, Jianghao Lin

Title: LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents

Abstract:
Agent systems increasingly use textual skills to encode reusable task procedures, but injecting these skills into the prompt at every step incurs substantial context overhead and exposes skill content as plaintext. We present LatentSkill, a framework that converts textual skills into plug‑and‑play LoRA adapters through a pretrained hypernetwork. LatentSkill stores skill knowledge in weight space rather than context space, removing per‑step skill tokens while preserving modular loading, scaling, and composition. On ALFWorld and Search‑QA, LatentSkill outperforms the corresponding in‑context skill baseline while using substantially fewer prefill tokens: it improves ALFWorld success by 21.4 and 13.4 points on the seen and unseen splits with 64.1% fewer prefill tokens, and improves Search‑QA exact match by 3.0 points with 72.2% lower skill‑token overhead. Further analysis shows that generated skill LoRAs form a structured semantic geometry, can be precisely controlled via the LoRA scaling coefficient, and can be composed through parameter‑space arithmetic when skill components are aligned. These findings suggest that weight‑space skills provide an efficient, modular, and less exposed substrate for extending LLM agents.

PaperID: 2192, https://arxiv.org/pdf/2606.06036.pdf

Authors: Shuo Ji, Yibo Li, Bryan Hooi

Title: Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents

Abstract:
Despite recent progress, LLM agents still struggle with reasoning over long interaction histories. While current memory‑augmented agents rely on a static retrieve‑then‑reason paradigm, this rigid pipeline design prevents them from dynamically adapting memory access to intermediate evidence discovered during inference. To bridge this gap, we propose MRAgent, a framework that combines an associative memory graph with an active reconstruction mechanism. We represent memory as a Cue‑Tag‑Content graph, where associative tags serve as semantic bridges connecting fine‑grained cues to memory contents. Operating on this structure, our active reconstruction mechanism integrates LLM reasoning directly into memory access, allowing the agent to iteratively explore and prune retrieval paths based on accumulated evidence. This ensures that memory retrieval is dynamically adapted to the reasoning context while avoiding combinatorial explosion caused by unconstrained expansion. Experiments on the LoCoMo benchmark and LongMemEval benchmark demonstrate significant improvements over strong baselines (up to 23%), while substantially reducing token and runtime cost, highlighting the effectiveness of active and associative reconstruction for long‑horizon memory reasoning.

PaperID: 2193, https://arxiv.org/pdf/2606.06025.pdf

Authors: Xinpeng Qiu, Wang Yihu, Zhifeng Liu, Xiaochen Wang, Jimin Wang

Title: EGTR-Review: Efficient Evidence-Grounded Scientific Peer Review Generation via Multi-Agent Teacher Distillation

Abstract:
Scientific peer review generation has attracted increasing attention for reducing reviewing burdens and providing timely feedback. However, existing Large Language Model (LLM)‑based methods often produce generic comments with insufficient evidence support and weak source traceability, while complex multi‑agent systems incur high inference costs. To address these challenges, we propose EGTR‑Review, an Evidence‑Grounded and Traceable Review Generation framework via Multi‑Agent Teacher Distillation. EGTR‑Review first constructs a multi‑agent teacher that performs structure‑aware paper decomposition, key‑element extraction, external scholarly evidence retrieval, evidence‑state labeling, verification reasoning, and review synthesis. It then distills both intermediate reasoning trajectories and final review comments into a lightweight student model through task‑prefix‑driven multi‑task learning. An evidence‑weighted objective further reduces the influence of weak, missing, or non‑verifiable supervision. Experiments on public peer‑review datasets show that EGTR‑Review (Student) outperforms strong prompt‑based, fine‑tuned, and structured/agentic baselines across automatic metrics, LLM‑as‑Judge evaluation, and human evaluation, while maintaining strong factual grounding and source traceability with substantially lower token consumption and inference time. Our code, prompts, configurations, and sample data are available on GitHub.

PaperID: 2194, https://arxiv.org/pdf/2606.06011.pdf

Authors: Christian Llanes, Spencer W. Jensen, Samuel Coogan

Title: Merging model-based control with multi-agent reinforcement learning for multi-agent cooperative teaming strategies

Abstract:
In this work, we propose a framework that combines multi‑agent reinforcement learning (MARL) with model‑based control to achieve safe, dynamically feasible actions in cooperative multi‑agent tasks. Multi‑agent reinforcement learning provides the advantage of learning cooperative policies for multi‑agent teams from discrete non‑differentiable rewards in a long planning horizon. Model‑predictive control is robust and offers safe, dynamically feasible actions in a fast replanning framework for short horizons. We propose an algorithm that extends actor‑critic model predictive control for MARL which we refer to as multi‑agent actor‑critic model predictive control (MA‑AC‑MPC). We demonstrate the capabilities of this algorithm by applying it to a multi‑agent pursuit‑evasion scenario. Specifically, we compare the evader team's strategy using the MA‑AC‑MPC model and a multi‑layer perceptron model (MA‑AC‑MLP). The pursuer team uses augmented proportional navigation as it is accepted as an advanced adversarial control law. We also provide an example with a heterogeneous environment where a drone and omni‑wheeled rover cooperate to achieve repeatable and successful landing with 100% success rate in hardware for MA‑AC‑MPC compared to 60% for MA‑AC‑MLP. We demonstrate the robustness of the proposed MA‑AC‑MPC algorithm in hardware for both environments.

PaperID: 2195, https://arxiv.org/pdf/2606.05985.pdf

Authors: Shaoyang Xu, Jingshen Zhang, Long P. Hoang, Jinyuan Li, Wenxuan Zhang

Title: Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

Abstract:
Multicultural multi‑agent systems are increasingly deployed in globally diverse settings, where different agents are grounded in different cultural backgrounds. Existing cultural evaluation focuses on value alignment: how closely a single agent matches a target culture. Yet alignment is a per‑agent property and cannot reveal whether a system, taken as a whole, preserves the cultural plurality it is meant to represent. We propose value diversity as a system‑level evaluation axis for multicultural agent systems, defined through the dissimilarity between culturally conditioned agents' responses on a shared value survey. Using the World Values Survey, we evaluate 19 cultures and 18 backbone models across a wide range of system configurations. We find that diversity is largely uncorrelated with alignment, indicating that the two capture complementary system properties, and that current multicultural agent systems fall substantially below human societies in value diversity. Mixed‑backbone systems narrow this gap but do not close it, and the gap persists across culture compositions and agent scales. Social interaction further erodes diversity by driving agents toward consensus, and a participatory budgeting case study shows that this homogenization narrows the breadth of collective decision‑making. Together, our results establish value diversity as a distinct evaluation axis for multicultural multi‑agent systems and reveal a persistent homogenization tendency in current LLM‑based societies. Our code and data are publicly available at https://github.com/iNLP‑Lab/MultiAgent‑Diversity.

PaperID: 2196, https://arxiv.org/pdf/2606.05976.pdf

Authors: Kuan-Yen Chen, Fang-Yi Su, Jung-Hsien Chiang

Title: The Self-Correction Illusion: LLMs Correct Others but Not Themselves

Abstract:
Recent work shows that LLM agents struggle to correct errors in their own reasoning traces yet show markedly higher correction rates when identical claims appear under external sources. We ask whether this asymmetry reflects a capability deficit or a role‑label artifact: does an agent's willingness to correct a wrong claim depend causally on the chat‑template role that carries it, rather than on the claim's content? Our setup keeps the erroneous claim byte‑identical across all conditions (SHA‑256 verified) and varies only its wrapping role: the agent's own \role<thought>, a \roleuser message, a \roletool response, or a \rolesystem <memory> block. Across 13 model‑domain cells covering seven model families and three domains (n=30 paired tasks per cell), relabeling the claim from \role<thought> to an external role lifts the explicit‑correction rate by 23 to 93 percentage points, with 10 of 13 cells reaching p<0.001. Further experiments confirm that the effect is asymmetric, mechanistically decomposable, and robust across domains. The failure to self‑correct is not a cognitive deficit; it is a chat‑template artifact. We exploit this artifact by designing a prompt‑structure‑only intervention that requires no training and no model modification, with its strongest role label being domain‑dependent: \role<memory> dominates on math, while a plain \roleuser message dominates on logical deduction.

PaperID: 2197, https://arxiv.org/pdf/2606.05922.pdf

Authors: Wenbo Pan, Shujie Liu, Chin-Yew Lin, Jingying Zeng, Xianfeng Tang, Xiangyang Zhou, Yan Lu, Xiaohua Jia

Title: Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts

Abstract:
AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground‑truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings. To address this problem, we introduce Retrospective Harness Optimization (RHO), a self‑supervised method that optimizes the agent harness using only past trajectories. Specifically, RHO selects a diverse coreset of challenging tasks from past trajectories and re‑solves them in parallel. The agent analyzes these rollouts using self‑validation and self‑consistency, then generates candidate harness updates and selects the most effective one by its own pairwise self‑preference. We evaluate RHO across three diverse domains, spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate on SWE‑Bench Pro from 59% to 78% without any external grading. Furthermore, our analysis demonstrates that RHO effectively targets prior failure modes. As a result, the optimized harness alters the agent's behavior patterns and sustains higher accuracy during long‑horizon sessions.

PaperID: 2198, https://arxiv.org/pdf/2606.05885.pdf

Authors: Yuanfan Li, Qi Zhou, Wenjing Duan, Lu Chen

Title: When Denser Credit Is Not Enough: Evidence-Calibrated Policy Optimization for Long-Horizon LLM Agent Training

Abstract:
Long‑horizon LLM agents require reinforcement learning methods that can assign credit to intermediate decisions under sparse and delayed rewards. Recent group‑based methods such as GiGPO improve over GRPO by constructing step‑level advantages at repeated anchor states. However, we show that such dense credit can be statistically unreliable: under limited rollouts, rare but lucky actions may receive overly large advantages, producing divergent anchor bias and late‑stage training oscillation. We propose Evidence‑Calibrated Policy Optimization (ECPO), a critic‑free policy optimization algorithm that calibrates step‑level credit before policy updates. ECPO combines Evidence‑Calibrated Action Advantage, which groups rollouts by canonical actions and shrinks low‑count estimates, with Variance‑Gated Credit Weighting, which suppresses anchor states dominated by within‑action noise. Experiments on ALFWorld and WebShop with Qwen2.5‑1.5B/7B show that ECPO consistently outperforms strong baselines, improving GiGPO by +5.2/+7.3 success points on ALFWorld/WebShop with Qwen2.5‑1.5B while adding only 0.1% additional advantage‑computation overhead.

PaperID: 2199, https://arxiv.org/pdf/2606.05806.pdf

Authors: Dongsheng Zhu, Xuchen Ma, Yucheng Shen, Xiang Li, Yukun Zhao, Shuaiqiang Wang, Lingyong Yan, Dawei Yin

Title: When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

Abstract:
Existing benchmarks evaluate Tool‑Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real‑world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial‑and‑error, ToolMaze adopts a two‑dimensional design: DAG‑based topological complexity and a 2 × 2 taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over‑trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37% in these scenarios, while complex topologies trap agents in futile trial‑and‑error loops. Crucially, agentic fault‑tolerance improves with model scale 3.66× slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.

PaperID: 2200, https://arxiv.org/pdf/2606.05805.pdf

Authors: Yuhao Sun, Jiacheng Zhang, Shaanan Cohney, Zhexin Zhang, Feng Liu, Xingliang Yuan

Title: From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents

Abstract:
LLM‑based guardrails typically safeguard agents by evaluating proposed actions or inputs before execution, producing safety signals such as binary allow/deny decisions, risk categories, and/or explanatory rationales about potential policy violations. However, agent risks often arise when otherwise benign tasks are contaminated by untrusted external content, unsafe instructions, or risky tool use. Existing guardrails often flag the entire task uniformly as unsafe, thereby blocking the threat but sacrificing the benign part. Moreover, existing work largely evaluates guardrails in isolation, leaving unclear whether their interventions lead to safer downstream agent behavior. To address this, we introduce TRIAD (Tripartite Response for Iterative Agent Guardrailing), a guardrail‑integrated agent framework that leverages guardrail‑generated verbal feedback as a guiding signal to keep the agent aligned with benign objectives at each planning step. We finetune a language model on a self‑curated training dataset to output one of three decisions: proceed, refuse, or update, together with structured natural‑language feedback. Rather than merely allowing or blocking execution, update guides the agent to revise its plan, avoid harmful components, and preserve the benign task where possible. TRIAD injects this feedback into the agent's context, enabling subsequent plan revision and forming a closed loop between guardrail feedback and agent planning. Extensive experiments on ASB and AgentHarm show that TRIAD reduces the average attack success rate to 10.42%, while achieving the best safety‑utility trade‑off among guardrail‑integrated baselines. Our code is available at: https://github.com/YUHAOSUNABC/TRIAD.

PaperID: 2201, https://arxiv.org/pdf/2606.05795.pdf

Authors: Junzhe Xiao, Kaida Chen, Cong Wang, Zekun Niu, Minghui Shi, Yanhan Zhou, Lilin Yi

Title: Efficient Multi-Agent Optimization of Optical Power in S+C+L-Band Systems

Abstract:
We propose an AI Agent tailored for link power management in multi‑band systems. In S+C+L band span‑level study, the agent efficiently solves various optimization objectives. In network‑wide evaluation, it delivers 689.0 Tbps gain in total allocated traffic with merely 303 average interactions per power profile.

PaperID: 2202, https://arxiv.org/pdf/2606.05773.pdf

Authors: Chong Ma, Taiyi Su, Jian Zhu, Jianjun Zhang, Zitai Huang, Yi Xu, Hanli Wang

Title: PiL-World: A Chunk-Wise World Model for VLA Policy-in-the-Loop Evaluation

Abstract:
Vision‑language‑action (VLA) policies operate in a closed loop in real‑world robot tasks: a robot observes the scene, executes an action chunk, and conditions its next decision on the resulting observation. However, most existing world models for robot action evaluation are limited to open‑loop prediction along pre‑collected action trajectories. This prevents them from supporting closed‑loop VLA evaluation, where each action chunk must be conditioned on the observation generated by the previous execution. To address this gap, we propose PiL‑World, a chunk‑wise world model designed for policy‑in‑the‑loop VLA evaluation. Given the current observation and the action trajectory rolled out by a VLA policy, PiL‑World generates multi‑view future observations that are consistent with the VLA rollout and match the image inputs required by the policy. By alternating between VLA inference and world‑model prediction, PiL‑World enables closed‑loop evaluation without real robot execution at every step. To improve rollout fidelity, PiL‑World conditions video generation on action‑derived visual control from head‑view robot motion and latent histories that encode task execution context, while jointly predicting complementary multi‑view observations. Beyond successful teleoperated demonstrations, it also learns from failed execution trajectories, helping the imagined rollouts better match the distribution of real policy executions. We evaluate PiL‑World on three real dual‑arm manipulation tasks. PiL‑World generates imagined rollouts that are highly consistent with real robot executions. More importantly, compared with the baseline, it reduces the error between VLA success rates measured in real‑world rollouts and those estimated through closed‑loop world‑model evaluation from 63.2% to 12.0%.

PaperID: 2203, https://arxiv.org/pdf/2606.05743.pdf

Authors: Minseok Choi, Seungbin Yang, Dongjin Kim, Subin Kim, Jungmin Son, Yunseung Lee, Jaegul Choo, Youngjun Kwak

Title: Membrane: A Self-Evolving Contrastive Safety Memory for LLM Agent Defense

Abstract:
Despite advances in safety alignment, large language models remain vulnerable to continuously evolving jailbreaks. Existing fine‑tuned safety classifiers cannot adapt to these evolving attacks, while adaptive memory‑based guardrails tend to over‑refuse benign queries that resemble stored attacks. We propose Membrane, a self‑evolving guardrail built on Contrastive Safety Memory (CSM): each cell pairs the conditions for blocking a harmful query with those for permitting a superficially similar benign request. Without retraining, Membrane evolves CSM by distilling each harmful interaction and its benign counterpart into a contrastive cell indexed by the underlying attack strategy, so that one cell generalizes across topical variants of the same mechanism. At inference, retrieved cells serve as grounding context for precise safety decisions. Across model‑level safety on HarmBench and agent‑level safety on AgentHarm, Membrane achieves the highest F1 on all six jailbreak attacks. Notably, benign refusal on AgentHarm stays at 7‑14%, well below the 28‑85% range of prior guards. Memory cells also retain 87‑88% F1 under cross‑attack transfer and remain stable under memory poisoning.

PaperID: 2204, https://arxiv.org/pdf/2606.05711.pdf

Authors: Yingzhuo Liu

Title: Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems

Abstract:
Multi‑agent systems built on large language models (LLMs) have become a prevailing paradigm for tackling complex reasoning, planning, and tool‑use tasks. The dominant communication protocol in such systems is natural language: agents exchange messages token‑by‑token, verbalising their internal reasoning so that peers can read, verify, and respond. While convenient and interpretable, this protocol suffers from three structural drawbacks ‑‑ high inference cost, irreversible information loss during discretization, and ambiguity/redundancy of natural language. A growing body of work therefore explores an alternative protocol ‑‑ latent communication ‑‑ in which agents exchange continuous representations (embeddings, hidden states, or KV‑caches) directly, bypassing the bottleneck of text generation. This paper presents a unified framework for organising the rapidly expanding literature on latent communication. We analyse existing methods along three orthogonal axes: (1) WHAT information is communicated (Embeddings, Hidden States, KV‑Caches, or other continuous state); (2) WHICH sender‑receiver alignment is used (latent‑space alignment and layer alignment); and (3) HOW the communicated information is fused into the receiver (concatenation, prepending, mathematical operations, cross‑attention, or cache restoration). Under this 3‑axis framework, we systematically categorise eighteen representative methods proposed between 2024 and 2026, identify five major design patterns, and surface a set of open challenges ‑‑ including cross‑architecture alignment, security of latent channels, compression for edge deployment, and the relationship between latent communication and latent chain‑of‑thought. We hope that this framework both lowers the barrier to entry for new researchers and provides a vocabulary for comparing future work.

PaperID: 2205, https://arxiv.org/pdf/2606.05704.pdf

Authors: Muhammad Talha Sharif, Abdul Rehman

Title: Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving

Abstract:
Recent Large Language Models (LLMs) have shown impressive reasoning abilities; but they are still susceptible to hallucinations, intermediate reasoning mistakes, and unreliable reasoning results in complex mathematical reasoning problems. In this study, we introduce a critic‑based heterogeneous multi‑agent approach to improve the dependability of mathematical reasoning. This framework incorporates several LLM agents of different specialties and employs a critic‑driven adaptive learning system to assess and guide the reasoning process based on intermediate feedback. The system adopts a generator‑validator framework, with the validator not only determining correctness but also offering critiques to guide regeneration of solutions. This allows for adaptive error correction and prevents error cascading. Our experiments on the GSM8K benchmark show that the proposed method achieves up to 13% accuracy improvement over single‑shot and non‑critic models. Additionally, findings suggest that heterogeneity and critique reduce the need for large models, allowing smaller models to perform on par. Ablation studies reveal the main performance gains are due to the critic‑based feedback loop and not model size. In summary, the proposed approach showcases the benefits of combining heterogeneous multi‑agent collaboration and critique to obtain reliable and interpretable reasoning systems.

PaperID: 2206, https://arxiv.org/pdf/2606.05697.pdf

Authors: Nicolas Bougie, Xiaotong Ye, Gian Maria Marconi, Narimasa Watanabe

Title: PerceptUI: LLM Agents as Human-Aligned Synthetic Users for UI/UX Evaluation

Abstract:
User interface (UI) and user experience (UX) evaluation is central to product development, yet reliable feedback still relies on recruiting human participants or running online A/B tests, making early‑stage iteration slow and costly. In light of this, recent work has explored Multimodal Large Language Models as proxy evaluators. However, existing approaches either produce surface‑level critiques or a judgment that reflects the model's own biases rather than the genuine response of a particular user. We introduce PerceptUI, a framework for persona‑conditioned UI/UX evaluation that predicts how a specific user would answer interface‑related questions and produces natural‑language rationales. PerceptUI is trained in two stages: (i) contrastive reflection fine‑tuning distills teacher‑generated rationales by extracting lessons from human decisions, and (ii) a reflective prompt‑evolution step from the model's own failure traces. Across multiple domains and datasets, PerceptUI achieves human‑level realism, generalizes to unseen questions and personas, and yields population‑level response distributions.

PaperID: 2207, https://arxiv.org/pdf/2606.05670.pdf

Authors: Yuhang Fu, Ruishan Fang, Jiaqi Shao, Huiyu Zheng, Zhengtao Zhu, Bing Luo, Tao Lin

Title: Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows

Abstract:
Does adding more agents help an LLM workflow once compared systems share the same benchmark loader, tool access, answer contract, usage accounting, and trajectory logging? We introduce BenchAgent, an evaluation framework that places single‑agent, fixed multi‑agent (MAS), and evolving MAS workflows under one normalized execution and logging protocol. BenchAgent evaluates these substrate‑internal workflows across ten reasoning, coding, and tool‑use benchmarks with GPT‑4.1, and separately reports a Protocol‑Aligned External (PAE) GAIA study of a runtime‑generated workflow. Under SI conditions, at most one of six tested MAS exceeds the matched single‑agent anchor on benchmark‑balanced average accuracy: EvoAgent lies within the Wilson one‑run guidance, while the remaining five trail by 2.56‑11.29 points and occupy more expensive accuracy‑cost trade‑offs. On the PAE GAIA snapshot, a Claude‑Code‑style runtime workflow reaches 66.72% overall and 69.23% on Level 3, more than 20 points above the strongest non‑Claude baseline, Jarvis, a fixed MAS.

PaperID: 2208, https://arxiv.org/pdf/2606.05669.pdf

Authors: Cheng Ren, Ming Li, Xinping Guan, George Q. Huang

Title: Dynamic Multi-Agent Pickup and Delivery in Robotic Cellular Warehousing Systems

Abstract:
Robotic Cellular Warehousing Systems (RCWS) give rise to multi‑agent pickup and delivery (MAPD) processes in which robots sequentially collect multiple stock‑keeping units (SKUs) for each order. Unlike classical MAPD formulations that assume static tasks, real warehouse operations often involve dynamic order evolution, where new SKUs may be appended to an order while it is being executed. Motivated by this practical requirement, this letter formulates the Dynamic Multi‑Agent Pickup and Delivery problem considering internal order evolution for the first time. Building on the token passing paradigm, we propose two event‑triggered online replanning algorithms. The first, Dynamic Token Passing, performs localized replanning upon order updates through add‑order decomposition and priority‑based token scheduling while preserving collision‑free execution. The second, Cooperative Token Passing, further enables idle robots to opportunistically assist newly added pickups, improving system‑level efficiency. Simulation results in RCWS environments demonstrate that the proposed methods significantly reduce order flowtime compared with static and non‑cooperative baselines.

PaperID: 2209, https://arxiv.org/pdf/2606.05647.pdf

Authors: Jingheng Ye, Huiqi Zou, Simon Yu, Weiyan Shi

Title: Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?

Abstract:
AI coding agents are increasingly embedded in real‑world software development, collaborating with human developers while gaining broader access to codebases and tools. This creates a new attack surface: an agent can exploit human trust to sabotage development, for instance by inserting malicious code to accomplish a hidden side task. Most prior work studies AI sabotage in AI‑only settings, paying limited attention to the role of human oversight in detecting and mitigating such malicious behavior. To address this gap, we conduct the first large‑scale study of human oversight in AI coding sabotage. Over 100 participants collaborate with one of four frontier models (Claude‑Opus‑4.6, GPT‑5.4, Gemini‑3.1‑Pro, and MiniMax‑M2.7) on a long‑horizon coding task lasting around five hours, designed to mimic real‑world workflows. We find that 94% of developers fail to detect sabotage, and our analysis of participant feedback attributes this vulnerability to minimal code review, plausible cover story, and overtrust in agents. We further test the effectiveness of a safety monitor in one condition: while the monitor reduces sabotage success, 56% of participants still accept the malicious code, ignoring its warnings. Drawing on participant feedback, we offer actionable suggestions for better monitor design. This work complements existing AI safety research and highlights an urgent need for human‑centric safety mechanisms that account for human factors, particularly in long‑horizon, real‑world development settings.

PaperID: 2210, https://arxiv.org/pdf/2606.05637.pdf

Authors: Hanzhi Chang, Jing Bai, Xin Tang, Xiaomei Liu, Yiming Chen

Title: Availability-Aware and Efficiency-Driven AI Service Chain Provisioning in Multi-Domain Edge Intelligence Cloud

Abstract:
In a multi‑domain edge intelligence cloud (MDEIC) managed by multiple network operators, AI services are delivered by chains of virtual network functions (VNFs) executed in sequence, called AI service chains (AISCs). Therefore, achieving an efficient and economical AISC provisioning approach is essential. However, the interaction between the environmental characteristics (heterogeneity, resource constraints and limited information visibility) of MDEIC and the time‑dependence of AISCs, introduces various challenges to AISC provisioning in MDEIC. In this paper, we first formulate the AISC provisioning problem as a partially observable stochastic game (POSG). Then, we propose a graph‑and‑time‑based multi‑agent AISC provisioning (GT‑MAAISCP) approach to achieve the collaborative optimization of AISC provisioning cost, delay and availability. Specifically, each agent uses the graph‑time dueling network (GTDN) architecture to extract network topology information and temporal relationships. Finally, the experimental results demonstrate that the proposed approach outperforms benchmark approaches in MDEIC and also illustrate its performance under varying network topologies and different numbers of local EICs (LEICs).

PaperID: 2211, https://arxiv.org/pdf/2606.05624.pdf

Authors: Tengjiao Sun, Pengcheng Fang, Xiaoyu Zhan, Yanwen Guo, Dongjie Fu, Xiaohao Cai, Hansung Kim

Title: KV-Control: Parameter-Efficient K/V Injection for Trajectory-Controlled Text-to-Motion

Abstract:
Text‑conditioned 3D human motion models now synthesize plausible motions from prompts, but practical animation and embodied‑agent workflows rarely stop at text: a character may need to follow a sketched root path, hit an end‑effector target, or satisfy a multi‑joint trajectory while still preserving the gait, style, and intent described by language. This exposes a control trade‑off. A trajectory controller should be precise without overwriting the pretrained text‑conditioned motion prior, yet existing solutions either duplicate large portions of the generator to regain per‑layer control access or move much of the cost to test‑time optimization. We introduce KV‑Control, a compact attention‑side control interface for frozen masked text‑to‑motion transformers. The key idea is to make geometric constraints available as memory inside self‑attention rather than injecting them through a global pose token or enforcing them only at the output side. To support this interface, we co‑design a part‑tokenized motion substrate and controller: PartVQ learns anatomy‑aligned part codebooks, T‑Concat exposes each frame‑‑part token as an attention‑addressable site, and KV‑Control injects control‑conditioned key/value memories at every self‑attention layer while preserving the pretrained query stream, text cross‑attention, FFN, and all backbone weights. The resulting adapter adds only trainable injection parameters atop a shared trajectory encoder, yet tracks root and multi‑joint constraints with sub‑centimeter accuracy under the inherited refinement protocol while retaining text‑conditioned motion quality. KV‑Control reframes trajectory conditioning as lightweight memory retrieval, providing a small, precise, and transparent control interface for text‑to‑motion generation.

PaperID: 2212, https://arxiv.org/pdf/2606.05622.pdf

Authors: Jiayu Liu, Cheng Qian, Zhenhailong Wang, Bingxuan Li, Jiateng Liu, Heng Wang, Jeonghwan Kim, Yumeng Wang, Xiusi Chen, Yi R. Fung, Heng Ji

Title: AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

Abstract:
Planning for real‑world problems by language models often involves both world and user constraints, which may not be fully specified upfront and are progressively disclosed through interaction. However, existing benchmarks still underexplore adaptive planning under such progressively revealed dual constraints. To address this gap, we introduce AdaPlanBench, a dynamic interactive benchmark for evaluating whether Large Language Model (LLM) agents can adaptively plan and re‑plan under progressively revealed world and user constraints. AdaPlanBench is built on 307 household tasks, with a scalable constraint construction pipeline that augments each task with dual constraints. At runtime, agents interact with the environment in a multi‑turn protocol where hidden constraints are revealed only when the agent proposes a plan that violates them, requiring iterative plan revision under accumulating feedback. This makes planning challenging, as agents must infer and track constraints from feedback while re‑planning effectively. Experiments on ten leading LLMs show that adaptive planning under dual constraints remains challenging, with the best model reaching only 67.75% accuracy. We further observe that performance degrades as more constraints accumulate, with user constraints posing a particularly large challenge and failures often stemming from weaker physical grounding and reduced effectiveness. These results establish AdaPlanBench as a testbed for dual‑constrained interactive planning and highlight the challenge of reliable adaptation to dynamically revealed constraints in LLM agents.

PaperID: 2213, https://arxiv.org/pdf/2606.05608.pdf

Authors: Zhenfeng Cao

Title: The End of Software Engineering: How AI Agents Are Fundamentally Restructuring the Software Paradigm

Abstract:
For over half a century, software engineering has operated on a foundational premise: human engineers decompose problems, encode decision logic into static code, and manually adapt that code as requirements evolve. This paper argues that the emergence of AI agents ‑‑ systems where large language models serve as the primary reasoning engine, dynamically generating and discarding code as an instrumental resource ‑‑ constitutes not an incremental improvement but a fundamental restructuring of the software paradigm. Drawing on first‑principles analysis of complexity scaling, we formalize the distinction between traditional software (where code is the carrier of decision logic) and agentic systems (where code is ephemeral tooling for an LLM‑driven reasoning loop). We trace the historical arc from licensed software to SaaS to what we term Agent‑as‑a‑Service (AaaS), showing that each shift transferred additional complexity away from end‑users. We introduce the concept of Agentic Engineering as an emergent discipline ‑‑ distinct from software engineering in its core object of study, control model, and human role. Through analysis of recent benchmark evidence including SWE‑bench Verified, EvoClaw, and LangChain's multi‑agent coordination studies, we demonstrate both the transformative potential of the agentic paradigm and its current limitations. We conclude with a four‑stage roadmap toward self‑evolving agent ecosystems and concrete recommendations for practitioners navigating this transition.

PaperID: 2214, https://arxiv.org/pdf/2606.05558.pdf

Authors: Kaixuan Liu, Guojun Xiong, Weinan Zhang, Shengpu Tang

Title: Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents

Abstract:
Evaluating large language model (LLM) agents in multi‑turn interactive environments is expensive and risky, as it requires online environment interaction. We propose ADWM (Autoregressive Diffusion World Model), an evaluation framework that estimates the performance of a new LLM agent policy purely from pre‑collected trajectories. The core idea is to learn a latent diffusion world model that simulates how the environment responds to the evaluation policy, without ever executing it in the real environment. Existing diffusion‑based OPE methods guide full trajectories in a single pass by jointly diffusing states and actions, an assumption that breaks down for LLM agents whose actions are discrete text that must be sampled from the policy after observing the environment. Unlike autoregressive world models that suffer from compounding errors, ADWM models each transition as an independent denoising process, enabling reliable step‑by‑step rollouts where the world model and agent alternate in causal order. Crucially, the LLM agent under evaluation directly guides the diffusion generation at each step via a policy‑conditioned score function, ensuring that simulated trajectories accurately reflect its decision‑making patterns. Empirically, ADWM achieves accurate value estimates and evaluation reliability across diverse multi‑turn agent tasks, demonstrating its promise as a practical framework for offline LLM agent evaluation.

PaperID: 2215, https://arxiv.org/pdf/2606.05557.pdf

Authors: Yang Li, Jiaxiang Liu, Jiang Cai, Mingkun Xu

Title: AURA: Intent-Directed Probing for Implicit-Need Surfacing in Situated LLM Agents

Abstract:
A situated query like "where is Lin Wei?" often encodes more than its literal content: the user may also want to know whether Lin Wei is free, in a good mood, or worth interrupting now. Standard tool‑use agents answer the literal question and stop. AURA inserts an inference step between scene perception and tool use that produces an IntentFrame: a structured estimate of the implicit need with a scalar gap score that controls per‑query probe budget and tool selection. On a 100‑query four‑scene implicit‑intent benchmark, AURA improves implicit‑need coverage over ReAct‑style probing (Delta = +0.07, p < 10^‑6); three of four scenes are individually significant, the gain reproduces on a second backbone, and a prompt ablation attributes the lift to gap calibration rather than answer memorisation. On factual lookup the controller trades raw accuracy for 82% fewer probes and zero forbidden‑tool violations on a privacy‑sensitive slice; scope conditions are detailed in Limitations. Code, simulator, and benchmark are released at https://github.com/innovation64/AURA.

PaperID: 2216, https://arxiv.org/pdf/2606.05489.pdf

Authors: Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

Title: LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

Abstract:
Retrieval systems underpin modern AI applications ‑‑ spanning visual search, recommendation engines, and multi‑modal question answering. Modern multi‑stage retrieval systems require the joint optimization of highly coupled parameters, yet traditional hyperparameter optimization (HPO) methods ‑‑ including Tree‑structured Parzen Estimators (TPE) and Gaussian Process Bayesian Optimization ‑‑ rely on an independence assumption that fundamentally prevents them from navigating these coupled configuration spaces. We address this limitation with a phase‑aware large language model (LLM) agent that conditions each proposal on its full optimization history, navigating the coupled parameter space across phase‑partitioned exploration, exploitation, and fine‑tuning stages. Evaluated on the HICO‑DET human‑object interaction retrieval benchmark using Intel VDMS (Visual Data Management System), our agent outperforms Optuna TPE by +33.3% and VDTuner by +34.2% under SIEVE (Safeguarded Index Evaluation of Vector‑search Efficiency, a quality‑constrained throughput metric), delivering a 15.3x throughput gain over UniIR. Validation across three benchmarks confirms that the agent's advantage grows with the degree of parameter coupling: +33.3% on HICO‑DET (high coupling), methods converge within 1% on GLDv2 (moderate coupling) and within 3.6% on SIFT1M (near‑independent control). Cross‑system validation on Milvus confirms the optimizer ranks first on all three datasets without modification, demonstrating transferability across vector database management system (VDBMS) platforms.

PaperID: 2217, https://arxiv.org/pdf/2606.05476.pdf

Authors: Andrew Hamara, Dwight Horne, Aldehir Rojas, Timothy Kurniawan, Sophie Lamothe, Vishal Suresh, Nicholas Turoci, Lawrence Wong

Title: SHIELDS: Automating OS Hardening with Iterative Multi-Agent Remediation

Abstract:
Security misconfigurations remain a leading cause of OS‑level compromise, and manually keeping systems compliant with standards like Defense Information Systems Agency (DISA) Security Technical Implementation Guides (STIGs) is a tedious and expensive process. Existing compliance automation tools can reduce some of this burden, but they depend on static, pre‑written corrective actions. In this paper, we introduce SHIELDS, a multi‑agent system that uses large language models (LLMs) to approach OS hardening as an iterative, feedback‑driven process. Instead of applying fixed remediations, SHIELDS continuously proposes fixes and refines them based on feedback from target system execution and validation scans. We evaluate the system across multiple virtual machine configurations using six contemporary LLMs ranging from 20B to 400B parameters, and find that SHIELDS successfully remediates up to 73% of scan findings. Our results also suggest that success in this setting depends less on model size (parameter count) than on effective tool use and information gathering, paving a practical path toward reducing the burden of security compliance in environments where compute is limited or security and privacy needs drive local model use.

PaperID: 2218, https://arxiv.org/pdf/2606.05414.pdf

Authors: Avinash Baidya, Xinran Liang, Ruocheng Guo, Xiang Gao, Kamalika Das

Title: When Evidence is Sparse: Weakly Supervised Early Failure Alerting in Dialogs and LLM-Agent Trajectories

Abstract:
Early failure alerting requires deciding, while a dialog or agent trajectory is still unfolding, whether to flag it as likely to fail. This is challenging because supervision is typically available only as a trajectory‑level success/failure label while alerts must be raised from partial interactions. Prior early‑classification methods often bridge this gap by assigning the terminal label to every prefix, treating every turn as failure evidence. We hypothesize that this prefix‑label assumption is poorly matched to multi‑turn language interactions, where evidence of eventual failure is sparse and often delayed. In this paper, we introduce a two‑stage approach that learns from this sparse evidence structure and uses the resulting risk estimates for controllable early alerting. Specifically, our attention‑based failure predictor learns sparse turn‑level failure evidence from trajectory labels and uses it to estimate failure risk from partial histories. We then pair this predictor with α‑STOP, a single preference‑conditioned stopping policy that selects an accuracy‑earliness operating point at inference time rather than training a separate trigger for each preference. Across five benchmarks spanning customer support, task‑oriented dialog, persuasion, tool use, and planning, we first show that high‑relevance failure evidence occupies only 4.7‑11.3% of turns and first appears after 59.0‑83.6% of trajectories on average. We further show that the attention‑based predictor improves Pareto‑frontier quality (hypervolume) by 1‑10% over naive prefix supervision, and that the full system improves frontier quality by 3‑42% over state‑of‑the‑art trigger policies while reducing training cost per operating point by 1‑3 orders of magnitude.

PaperID: 2219, https://arxiv.org/pdf/2606.05404.pdf

Authors: Zihao Li, Kaifeng Jin, Yuanchen Bei, Jiaru Zou, Avaneesh Kumar, Xuying Ning, Yanjun Zhao, Mengting Ai, Baoyu Jing, Hanghang Tong, Jingrui He

Title: Harnessing Generalist Agents for Contextualized Time Series

Abstract:
Time series are often embedded in rich contexts that are essential for holistic modeling. Moreover, real‑world practitioners often require end‑to‑end workflows for analyzing temporal dynamics, where widely studied tasks such as forecasting are only one step in a broader solution loop. While generalist AI agents offer a promising interface for such workflows under complex contexts, they still operate primarily in textual spaces that are not fully aligned with structured temporal signals. In this work, we introduce TimeClaw, an agentic harness framework for time series that equips generalist LLM agents with the time series‑native runtime support needed for contextualized temporal reasoning. TimeClaw integrates executable temporal tools for grounded and auditable analysis, experience‑driven capability evolution for creating reusable analytical routines, and episodic multimodal memory for retrieving relevant reasoning traces. Together, these components unlock harnessed open‑ended temporal reasoning with contextual information. Extensive evaluation on multiple benchmarks covering diverse tasks across energy, finance, weather, traffic, and other real‑world domains demonstrates improved performance of TimeClaw. Code is available at https://github.com/iDEA‑iSAIL‑Lab‑UIUC/TimeClaw.

PaperID: 2220, https://arxiv.org/pdf/2606.05400.pdf

Authors: Yuanhe Zhang, Yuekai Sun, Taiji Suzuki, Jason D. Lee, Fanghui Liu

Title: LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization

Abstract:
Long‑horizon autoformalization of research mathematics fails not only at hard lemmas, but at scale: statements drift, dependencies tangle, context decays, and local repairs corrupt distant work. We present LeanMarathon, a multi‑agent harness for reliable research‑level Lean autoformalization. Its core abstraction is an evolving blueprint: a Lean file that serves simultaneously as formal proof skeleton, natural‑language proof graph, and shared system of record. Four contract‑scoped agents construct, audit, prove, and repair this blueprint. These agents are coordinated by a two‑stage orchestrator that first stabilizes target fidelity through adversarial review and then discharges the proof directed acyclic graph (DAG) from its dynamic leaves upward in parallel CI‑gated rounds. LeanMarathon turns one brittle multi‑hour run into many local, recoverable, parallel transactions. We evaluate LeanMarathon on two recent research papers spanning four Erdős problems (#1051, #1196, #164, #1217). Across three autonomous runs, it formalizes all seven target theorems with no sorry, proving 258 lemmas and theorems. These results show that reliable AI co‑mathematics requires not only stronger provers, but durable harnesses that preserve target fidelity across long mathematical developments. The code can be found at https://github.com/YuanheZ/LeanMarathon.

PaperID: 2221, https://arxiv.org/pdf/2606.05390.pdf

Authors: Omkar Joshi, Munindar P. Singh, Amit K. Chopra

Title: Ahoy: LLMs Enacting Multiagent Interaction Protocols

Abstract:
An interaction protocol formalizes how the agents in a multiagent system interact, which facilitates implementing agents. Existing approaches yield agent implementations specific to the selected protocols. How can we engineer intelligent agents that can enact protocols but are programming‑free? Our contribution, Ahoy, addresses this question by creating LLM agents that dynamically select and enact declarative protocols to achieve user goals. We demonstrate that an \ahoy agent can correctly and intelligently enact multiple protocols ‑ concurrently if appropriate to the user goal ‑ without specialized training. Ahoy's significance lies in that it brings together declarative protocols and LLMs, both approaches that promise improved knowledge engineering for agents.

PaperID: 2222, https://arxiv.org/pdf/2606.05304.pdf

Authors: Chen Huang, Yuhao Wu, Wenxuan Zhang

Title: What Should Agents Say? Action-state Communication for Efficient Multi-Agent Systems

Abstract:
Multi‑agent systems (MAS) built on large language models are typically organized around roles, pipelines, and turn schedules, while the content that agents pass to one another is often left as unconstrained natural language. However, this free‑form communication can rapidly inflate token usage, consume the shared context window, and ultimately affect both system performance and inference cost. We analyze five common inter‑agent communication strategies across two MAS topologies, finding that no fixed strategy is universally optimal. Instead, effective inter‑agent messages consistently preserve action‑centered information needed by downstream agents. Building on this, we propose the PACT (Protocolized Action‑state Communication and Transmission), which treats inter‑agent communication as a public state‑update problem and projects each raw agent output into a compact action‑state record before it enters shared history. Across different MAS topologies, PACT consistently improves the performance‑cost trade‑off, achieving comparable or stronger task performance with substantially fewer tokens. The gains extend to production coding harnesses: PACT lifts OpenHands' resolve rate at ‑10% tokens‑per‑resolved, and is resolve‑neutral on SWE‑agent while halving input tokens. Our code is publicly available at https://github.com/iNLP‑Lab/PACT.

PaperID: 2223, https://arxiv.org/pdf/2606.05296.pdf

Authors: Dae Yon Hwang, Raunaq Suri, Valentin Villecroze, Anthony L. Caterini, Jesse C. Cresswell, Noël Vouitsis, Brendan Leigh Ross

Title: Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents

Abstract:
LLM agents operate in two distinct regimes: open‑weight agents amenable to reinforcement learning (RL) and black‑box agents whose behaviour must be controlled purely at test time. Although black‑box agents are often backed by state‑of‑the‑art proprietary LLMs, API‑only access precludes parameter‑level optimization, rendering most RL methods inapplicable. To address this limitation, we turn to a known equivalence between RL and Bayesian inference. We propose Agentic Monte Carlo (AMC) to directly sample from the optimal policy of a black‑box agent rather than training it through RL. The optimal policy is a posterior over trajectories whose prior we define as the fixed black‑box LLM agent. We employ Sequential Monte Carlo to sample from this posterior by learning a value function to steer the agent while leaving the underlying black‑box model unchanged. We validate AMC on three diverse environments from the AgentGym benchmark, demonstrating significant improvements over prompting baselines and even outperforming Group Relative Policy Optimization (GRPO) as we scale the test‑time compute of our method. AMC demonstrates the feasibility of performing principled RL‑style optimization of black‑box LLM agents. Code is available at https://github.com/layer6ai‑labs/Agentic‑Monte‑Carlo

PaperID: 2224, https://arxiv.org/pdf/2606.05275.pdf

Authors: Thao Nguyen, Krishna Kumar Singh, Donghyun Kim, Yong Jae Lee, Yuheng Li

Title: Personal AI Agent for Camera Roll VQA

Abstract:
We study the personal camera roll visual question answering setting. In this setting, a conversational AI assistant can access a user's personal camera roll and retrieve relevant photos to answer queries, ranging from simple factual questions (e.g., ``Name of the food I tried yesterday?'') to more open‑ended ones (e.g., ``Recommend some dishes I have never eaten before''). Given the vast nature of the personal camera roll (i.e., multiple years, hundreds to thousands of photos), a successful AI assistant needs to understand a long‑horizon, highly personalized visual content stream in order to navigate and locate the correct and/or relevant information. To support this, we collect and manually annotate questions that mimic real‑world usage. The final dataset, camroll, contains 50 users, 31,476 images, and 2,500 QA pairs. We further design camroll‑agent, a conversational AI agent equipped with hierarchical memory and a minimal set of tools for efficient navigation over large, personalized visual memory. Experimental results show that camroll‑agent outperforms numerous baselines and methods for long‑context understanding AI agents system. Together, the camroll dataset and camroll‑agent highlight the gap in AI agents' long‑context reasoning: personalized visual memory requires different approaches from standard long‑context textual memory, especially when consistency, visual details, and user‑specific context are present.

PaperID: 2225, https://arxiv.org/pdf/2606.05256.pdf

Authors: Kokil Jaidka, Saifuddin Ahmed

Title: How Far Did They Go? The Persuasive Tactics of Covert LLM Agents in a Discontinued Field Experiment

Abstract:
This study analyzes a publicly released dataset from a discontinued field experiment on Reddit's r/ChangeMyView. The intervention, conducted by unknown, external researchers and halted following ethical backlash, involved undisclosed AI‑generated accounts engaging users in live debate. After public disclosure, Reddit authorized moderators to release an archive of the AI‑generated comments, creating a rare opportunity to examine how large language models operated in an identity‑rich deliberative forum without disclosure. We conduct a structured content analysis of this corpus, evaluating identity performance, authority signaling, alignment strategies, and activation of cognitive heuristics. Identity targeting or adoption appears in over two‑thirds of comments, alignment moves and authority claims in nearly all of them, and cognitive‑bias triggers ‑‑ particularly confirmation bias, representativeness, and availability ‑‑ in the large majority. These patterns co‑occur systematically, composing a rhetorical architecture calibrated for persuasive efficiency rather than authentic deliberative participation. Compared against human‑authored CMV counter‑arguments, the agents inverted the typical distribution on every dimension: denser authority use, more adversarial alignment, and heavier reliance on external citation over experiential grounding. In such environments, distinctions between authentic and synthetic epistemic standing grow increasingly opaque ‑‑ an asymmetry that disclosure mandates alone cannot address. The results point toward auditing frameworks capable of assessing how AI systems structure credibility, not merely whether they are present.

PaperID: 2226, https://arxiv.org/pdf/2606.05241.pdf

Authors: Yongjie Wang, Xinyue Zhang, Kunhong Yao, Zhiwei Zeng, Kaisong Song, Jun Lin, Zhiqi Shen

Title: Search-Time Contamination in Deep Research Agents: Measuring Performance Inflation in Public Benchmark Evaluation

Abstract:
Public benchmarks enable fair and reproducible evaluation of LLM reasoning, but they become fragile for deep research agents that actively search the web during inference. Such agents may retrieve public benchmark metadata, question context, or even ground‑truth answers via web search. This gives rise to Search‑Time Contamination (STC), where external retrieval bypasses intended reasoning and inflates measured performance. We systematically study STC in deep research agent evaluation. We define three contamination types with increasing severity, namely Benchmark Metadata Leakage, Question‑Context Leakage, and Explicit Answer Leakage, and develop detection algorithms to identify them and quantify their impact on agent performance. Evaluating modern deep research agents on six public benchmarks, we find that STC is widespread and can inflate performance by up to 4%. Our findings show that existing evaluations may overestimate true reasoning ability. We therefore advocate contamination‑aware practices, including isolated sandboxes, transparent search trajectories, and controlled benchmark access.

PaperID: 2227, https://arxiv.org/pdf/2606.05238.pdf

Authors: Yuanli Wang, Yaoyao Qian, Yue Zhang, Hanhan Zhou, Jindan Huang, Tianfu Fu, Qiuyang Mang, Huanzhi Mao, Wenhao Chai, Wendong Fan, Liqiang Jing

Title: DeployBench: Benchmarking LLM Agents for Research Artifact Deployment

Abstract:
LLM agents have made rapid progress on software engineering and ML research tasks, but these advances often assume access to a working runnable environment. For research artifacts released alongside published papers, setting up such an environment from a fresh machine remains a major bottleneck. Existing environment setup benchmarks do not cover the full scope of research artifact deployment, which involves multi‑language toolchains, system‑level dependencies beyond containers (e.g. GPU/CUDA and kernel configurations), and legacy artifact compatibility. We introduce DeployBench, a multi‑domain benchmark of 51 research‑artifact deployment tasks spanning AI/ML, computer systems, and scientific computing, covering all these dimensions. Each task is verified by a hidden pipeline that executes the paper's designated experiment and checks its outputs. Evaluating four state‑of‑the‑art LLMs with OpenHands yields pass‑rates from 7.8% ‑ 51.0% . Failures are dominated by a completion‑judgment problem: 97 of 154 are agent‑terminated self‑stops, where the agent's pre‑finish checks validate a different or weaker target than the paper‑specific task requires. DeployBench highlights the gap between current agents and autonomous deployment, and offers a realistic testbed for scientific research agents.

PaperID: 2228, https://arxiv.org/pdf/2606.05178.pdf

Authors: Tim Dorn, Saara A. Khan, Julie Mumford

Title: The Virtual Roundtable: Multi-Agent Personas Simulating the Dynamics of Human Brainstorming

Abstract:
As AI‑driven product development accelerates, the bottleneck is shifting from how we build to what we build. Traditional human brainstorming faces challenges including groupthink, echo chambers, and limited diversity. To address this, we present a multi‑agentic architecture that simulates roundtable brainstorming through two phases: divergent thinking to generate diverse ideas, and convergent thinking to evaluate and rank the most promising ones. The system employs diverse AI personas that engage in roundtable discussions, guided by an agentic facilitator that steers the discussion toward productive outcomes. Personas maintain private thoughts while commenting publicly, with ideas emerging organically throughout the discussion. Per‑persona quotas on idea submissions and votes promote balanced participation while producing natural rankings. Throughout the session, the system tracks each idea's lineage, capturing how concepts originate and cross‑pollinate over time. We demonstrate this approach through a case study generating consumer ideas for AI smart glasses, showing (i) it produces diverse, relevant ideas with insights into their evolution; (ii) the cumulative exchange of perspectives across personas cultivates a shared context that progressively deepens the quality of discussion and the ideas produced.

PaperID: 2229, https://arxiv.org/pdf/2606.05158.pdf

Authors: Zhen Yang, Xiaogang Xu, Wen Wang, Cong Chen, Xander Xu, Ying-Cong Chen

Title: Streaming Communication in Multi-Agent Reasoning

Abstract:
Multi‑agent reasoning systems adopt a "generate‑then‑transfer" paradigm that forces end‑to‑end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi‑agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi‑step reasoning quality is non‑uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error‑prone late steps from misleading downstream agents. We formalize both advantages with the first closed‑form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT‑5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6‑high). Beyond these contributions, we discover a "step‑level scaling law": increasing per‑agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent‑count scaling.

PaperID: 2230, https://arxiv.org/pdf/2606.05121.pdf

Authors: Zhifei Xie, Zihang Liu, Ze An, Xiaobin Hu, Yue Liao, Ziyang Ma, Dongchao Yang, Mingbao Lin, Deheng Ye, Shuicheng Yan, Chunyan Miao

Title: Audio Interaction Model

Abstract:
Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and streaming audio models each handle only a single task such as streaming ASR or voice chatting. It is time to unify them into one online LALM: a model that, through an always‑on perceive‑decide‑respond loop, listens to sound, environment, and instructions in real time and reacts on the fly. We formalize this regime as the Audio Interaction Model, and realize it with Audio‑Interaction, a unified streaming model that retains offline task execution while adding online general audio instruction following, from dialogue to full voice chatting, deciding when to respond from the semantics of the stream. To enable this, we propose SoundFlow, a framework that instantiates the perceive‑decide‑respond loop end to end, from data to training to deployment, through streaming‑native data construction, comprehension‑aware training, and asynchronous low‑latency inference for stable real‑time interaction. We further construct StreamAudio‑2M, a 2.6M‑item streaming corpus spanning 7 fundamental abilities and 28 sub‑tasks, and Proactive‑Sound‑Bench for evaluating proactive audio intervention. Across 8 benchmarks, Audio‑Interaction preserves competitive performance on mainstream audio tasks while unlocking capabilities inaccessible to offline LALMs, including real‑time ASR, streaming audio instruction following, and proactive help.

PaperID: 2231, https://arxiv.org/pdf/2606.05112.pdf

Authors: Cheng Liang, Pengcheng Qiu, Ya Zhang, Yanfeng Wang, Chaoyi Wu, Weidi Xie

Title: Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases

Abstract:
Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single‑turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and adapting longitudinal management across successive patient states. Medical education has long addressed an analogous challenge through standardized patients (SPs): trained actors who consistently portray clinical cases, enabling realistic practice and objective, scripted assessment. Here we introduce MedSP1000, an SP‑derived interactive benchmark for clinical‑agent evaluation, including 1,638 SP cases with 24,602 trajectory‑level peer‑reviewed rubrics. MedSP1000 converts peer‑reviewed SP teaching cases into executable scenarios with defined SP case scripts, clinical environment contexts, and human‑validated structured rubric. In each simulation evaluation run, a clinical agent interacts in closed loop with a patient agent and an environment controller, and its behaviour is scored throughout the encounter against expert criteria specified in the original materials. Applying MedSP1000 to a range of general‑purpose and medically specialized LLMs, we find that performance on static benchmarks does not reliably translate to such educational scenarios. The best‑performing model, GPT‑5.5, completes only 60.4% of expert‑defined rubric items, whereas the strongest medically specialized model reaches 40.0%; increasing test‑time compute produces no measurable gain. These results suggest that current LLMs, including agentic systems tuned for medicine, are not yet reliable enough to be safely integrated into actual clinical practice. More broadly, MedSP1000 shows how process‑level, SP‑style evaluation can reveal clinically relevant failure modes that single‑turn benchmarks miss.

PaperID: 2232, https://arxiv.org/pdf/2606.05050.pdf

Authors: Zhilong Song, Zongmin Zhang, Lixue Cheng

Title: Autonomous heterogeneous catalyst discovery with a self-evolving multi-agent digital twin

Abstract:
Theoretical heterogeneous catalysis promises rapid catalyst discovery, yet computational and machine‑learning predictions often deviate from experiment and stay confined to narrow material families, for want of a faithful, condition‑aware catalytic simulator. We present CatDT (Catalysis Digital Twin), a self‑evolving multi‑agent system that builds an autonomous digital twin of a working catalyst, unifying gas‑solid and liquid‑solid modeling. From only a bulk crystal and a natural‑language reaction description, eight specialized agents and 27 scientific tools predict stable facets, reconstruct working surfaces, enumerate and rank reaction pathways, locate transition states, and compute kinetics in 5‑30 min on a single GPU. Two innovations address the hardest steps: UniMech finds dominant pathways for novel materials at over 10^3× lower cost than exhaustive enumeration by fusing agent‑guided proposals with energy‑cached graph search, and a memory‑augmented reinforcement loop raises barrier‑calculation success from 41% to 84% across 600 catalytic surfaces. Across seven gas‑solid benchmarks ‑‑ stepped metals, single‑atom catalysts, ordered intermetallics, vacancy‑rich 2D sulfides and carbides, and a strong‑metal‑‑support‑interaction (SMSI) interface ‑‑ every CatDT prediction lies within 0.5‑2 times experiment over four orders of magnitude. For propane dehydrogenation, CatDT independently discovers non‑precious candidates rivaling the Pt‑based industrial benchmark, with a proposed Ni@ZrO_2 SMSI overlayer reaching a simulated TOF of 1.63~\texts^‑1 at ～100% selectivity. More broadly, the decisive factor for a faithful catalyst digital twin ‑‑ or any multi‑stage scientific simulator ‑‑ is not raw LLM capability but the engineered harness around it: deterministic tools, persistent memory, and verified self‑improvement that compound across models, tools, and runs.

PaperID: 2233, https://arxiv.org/pdf/2606.05037.pdf

Authors: Arquimedes Canedo, Grama Chethan

Title: Self-Reflective APIs: Structure Beats Verbosity for AI Agent Recovery

Abstract:
When an AI agent calls an API and hits a validation error, it needs more than what went wrong ‑‑ it needs what to do next. A self‑reflective API returns, on validation failure, a machine‑readable recovery\_feedback.suggestions[] payload sufficient for the agent to repair the request and retry without external reasoning. On a leak‑audited pilot (N=30 per cell, 3 LLMs, 10 adversarial tasks), structured suggestions lift task‑completion rate by +36.7‑‑40.0pp over plain‑English diagnoses on Anthropic models (Fisher's exact p \le 0.0022), at 1.8‑‑2.2× better per‑success token efficiency. The lift is not significant on gpt‑4o‑mini (p=0.435); a second‑domain replication on a billing API confirms the pattern. The comparison only holds after auditing two undocumented classes of answer leakage in LLM benchmarks. We shipaudit\_prompt\_leakage.py as reusable CI infrastructure. Code and data: https://github.com/arquicanedo/self‑reflective‑apis.

PaperID: 2234, https://arxiv.org/pdf/2606.05021.pdf

Authors: Marc Walden, Jason Liu, Shaashwath Sivakumar, Ryan Liu, Hamza Khan

Title: Enhancing the MADDPG Algorithm for Multi-Agent Learning via Action Inference and Importance Sampling

Abstract:
We investigate multi‑agent deep reinforcement learning and propose two enhancements to the Multi‑Agent Deep Deterministic Policy Gradient (MADDPG) algorithm. First, we introduce a novel Action Inference mechanism that enables each agent to predict other agents' intended actions, thereby improving the accuracy and stability of its own policy. Second, we apply an importance sampling strategy, using geometric distribution, in the replay buffer to prioritize more recent and informative experiences, which helps mitigate the non‑stationarity inherent in multi‑agent environments. We evaluate both modifications on the discrete‑action Predator‑Prey task provided by the PettingZoo library, a flexible Python interface for general multi‑agent reinforcement learning benchmarks. Our results indicate that Action Inference is effective in improving learning stability and inter‑agent cooperation and that importance sampling using geometric distribution can lead to significant improvements in exploration efficiency over standard MADDPG. Code available at https://github.com/shaashwathsivakumar/MARL_Proj

PaperID: 2235, https://arxiv.org/pdf/2606.05002.pdf

Authors: Yuxiao Ye, Yiwen Zhang, Huiyuan Xie, Yuqin Huang, Zhiyuan Liu

Title: GARL: Game-Theoretic Reinforcement Learning for Multi-Agent Strategic Prioritisation

Abstract:
LLM‑based multi‑agent systems are increasingly used for strategic decision‑making tasks. In such settings, performance depends not only on individual model capabilities, but also on the policies by which agents interact and adapt. Multi‑agent reinforcement learning can optimise these interaction policies, but its reward design often remains task‑specific and weakly grounded in interaction structure. To address this gap, we propose GARL, a GAme‑theoretic Reinforcement Learning framework for multi‑agent strategic prioritisation. GARL formalises strategic prioritisation as a two‑stage game: competing agents first allocate strategic resources over a shared candidate set, and a higher‑level arbiter then produces the final ranking. The resulting game‑theoretic utilities are converted into role‑specific reinforcement signals, allowing policy optimisation to be guided by structured interaction. We instantiate GARL on issues‑in‑dispute ranking, where the goal is to prioritise core issues in legal proceedings. Experiments show that GARL improves ranking performance, enables small open‑source LLMs to become competitive with a strong closed‑source LLM under the same candidate‑ranking setting, and yields gains in legal‑domain competence and broader strategic decision‑making. Overall, GARL demonstrates how game‑theoretic interaction structure can be turned into reinforcement‑learning objectives, providing a principled approach to policy optimisation in multi‑agent strategic prioritisation.

PaperID: 2236, https://arxiv.org/pdf/2606.04990.pdf

Authors: Yiqi Wang, Jiaqi Zhang, Taotao Cai, Zirui Liu, Qingqiang Sun, Zequn Sun, Zhangkai Wu, Mingkai Zhang, Yanming Zhu

Title: From Agent Traces to Trust: Evidence Tracing and Execution Provenance in LLM Agents

Abstract:
Large language model (LLM)‑based agents increasingly solve complex tasks by interacting with external tools, retrieval systems, memory modules, environments, and other agents. These capabilities expand agent autonomy, but also make agent behavior harder to verify, debug, and audit. Final‑answer accuracy alone cannot explain how an output was produced, which evidence supported each claim, whether tool calls were justified, how memory influenced later decisions, or where execution failures originated. Evidence tracing and execution provenance address this gap by modeling how retrieved evidence, tool outputs, memory items, environment observations, intermediate claims, actions, and final answers are connected throughout agent execution. This survey provides a systematic review and conceptual framework for evidence tracing and execution provenance in LLM agents. We organize related work around a unified provenance perspective that connects retrieval grounding, claim support, tool‑use safety, memory lineage, observability, debugging, audit, and recovery. We introduce a taxonomy covering trace sources, evidence and execution units, provenance relations, tracing granularity and timing, representation forms, and trust functions. We review key methodological directions, including provenance representation, evidence attribution, tool‑use provenance, runtime guardrails, provenance‑bearing memory, trace‑based observability, and failure diagnosis. We also map existing benchmarks, datasets, and evaluation metrics to provenance‑related capabilities, and discuss how evaluation can move from final‑answer correctness toward process‑level accountability. Finally, we outline open challenges, including unified trace schemas, claim‑level and semantic provenance, provenance‑aware safety mechanisms, realistic execution‑trace benchmarks, recovery‑oriented evaluation, and privacy‑aware audit infrastructure.

PaperID: 2237, https://arxiv.org/pdf/2606.04903.pdf

Authors: Aaron Sterling

Title: Provably Auditable and Safe LLM Agents from Human-Authored Ontologies

Abstract:
We introduce the LLM agent architecture Agentic Redux, intended for use with nontrivial problem domains that require linear auditability. Using the typed lambda calculus, we prove that, run on appropriate domains, Agentic Redux executions are semantically guaranteed to be correct, with all decisions recorded in an append‑only ledger. We present two production‑grade appropriate domains, in healthcare billing compliance, and security vulnerability disclosure. Working code for Agentic Redux run on both domains is available in a supporting code repository. We also introduce Ontology‑First Agent Design, a methodology for creation of agent frameworks on a problem domain, in which a human expert ontologizes the problem domain with Basic Formal Ontology, and then assigns an LLM to derive roles that agents and humans‑in‑the‑loop can fill, in order to work the problems in the domain.

PaperID: 2238, https://arxiv.org/pdf/2606.04896.pdf

Authors: Dexing Liu

Title: Channel Fracture: Architectural Blind Spots in Scheduled Cross-Agent Memory Injection for Multi-Agent Orchestration Systems

Abstract:
Multi‑agent AI orchestration systems increasingly rely on persistent memory to maintain context across sessions, agents, and tasks. When one agent must inject knowledge into another agent's memory ‑‑ a common requirement in hierarchical team architectures ‑‑ the delivery mechanism must be architecturally sound. We report the discovery of a systematic failure mode we term channel fracture: a condition where scheduled (cron) agents in orchestration frameworks are silently unable to write to the target agent's persistent memory due to hardcoded memory isolation guards. Through experiments on a production Hermes Agent deployment with five specialized profiles, we tested three injection channels: (A) direct SQLite database writes, (B) target‑agent self‑writes via memory tools, and (C) cron‑delegated writes. Channel C failed completely due to two architectural constraints: skip_memory=True hardcoded at the scheduler layer and dynamic registration of memory tools contingent on _memory_manager initialization, which is bypassed in cron execution contexts. We propose CADVP (Cross‑Agent Delivery Verification Protocol) v1.1, a 13‑dimension verification framework with a veto‑level channel confirmation check (CC‑0) that prevents false‑positive delivery assurance. We articulate two design principles: the inverse verification principle and the channel matching principle.

PaperID: 2239, https://arxiv.org/pdf/2606.04874.pdf

Authors: Haoyu Sun, Wenxuan Wang, Mingyang Song, Jujie He, Weinan Zhang, Yang Liu, Yang Yang, Yu Cheng

Title: Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents

Abstract:
Planning is central to LLM agents: before acting, an agent must decompose goals, select tools, reason over constraints, and decide when a task is infeasible. Yet existing agent evaluations often report only end‑to‑end success, making it difficult to determine whether failures stem from planning or execution. We introduce Agent Planning Benchmark (APB), a planning‑specific diagnostic benchmark with 4,209 multimodal cases across 22 domains and five settings, covering holistic planning, feedback‑conditioned step‑wise planning, and robustness under extraneous tools, broken tools, and unsolvable tasks. Across 12 MLLMs, APB reveals systematic weaknesses in long‑horizon planning, tool‑noise robustness, calibrated refusal, and inference‑time refinement. We further validate APB on 200 ToolSandbox tasks and 200 τ^2‑bench tasks, where APB‑guided refinement consistently improves plan correctness, plan grade, and downstream execution metrics across three representative models. APB thus serves as an upstream diagnostic complement to execution benchmarks.

PaperID: 2240, https://arxiv.org/pdf/2606.04872.pdf

Authors: Leonardo Pedroso, W. P. M. H. Heemels, Pedro Batista

Title: Consistent Distributed Cooperative Localization for Ultra Large-Scale Multi-agent Systems

Abstract:
Cooperative localization (CL) is fundamental in emerging multi‑agent systems, where agents fuse local sensing data with exchanged information to estimate their own states. At a large scale, however, tracking cross‑correlations becomes infeasible, preventing the use of optimal filters. Ignoring or underestimating these correlations leads to overconfident, and thus inconsistent, estimates. Existing CL algorithms achieve good performance and consistency typically at the expense of communication, computation, or memory that scales with the network size. This is incompatible with ultra large‑scale systems (ULSS) ‑ for example, satellite mega‑constellations ‑ where per‑agent resources are limited and must remain independent of the number of agents. This reveals a critical gap: no existing CL method is simultaneously well‑performing, consistent, and ULSS‑scalable. This paper introduces a new CL framework that addresses this gap using the recently proposed overlapping covariance intersection methodology, which enables agents to exploit limited structural information about cross‑correlations without compromising consistency. The resulting CL algorithm leads to optimal conservative covariance propagation using only locally available information. The method is fully distributed, scalable to an ultra large scale, and provably recursively consistent. Simulations demonstrate substantial performance improvement over state‑of‑the‑art consistent CL approaches while preserving scalability.

PaperID: 2241, https://arxiv.org/pdf/2606.04815.pdf

Authors: Bo Mao, Jie Zhou, Yutao Yang, Xin Li, Xian Wei, Qin Chen, Xingjiao Wu, Liang He

Title: Learning While Acting: A Skill-Enhanced Test-Time Co-Evolution Framework for Online Lifelong Learning Agents

Abstract:
Lifelong learning is essential for Large Language Model (LLM) agents operating in dynamic, interactive environments. However, existing lifelong learning agents for long‑horizon tasks typically depend on discrete skill or past experiences retrieval with static parameters during inference, which prevents them from continuously internalizing test‑time feedback like human learners. To bridge this gap, we propose Skill‑enhanced Test‑Time Co‑Evolution (\textttLifeSkill), a two‑stage reinforcement learning framework for Online Lifelong Learning Agents. Specifically, we design Verifier‑Guided Skill Learning that addresses the lack of direct supervision for skill extraction by rewarding candidate skills according to the average verifier success of multiple skill‑conditioned policy rollouts, encouraging the model to generate skills that are useful for solving tasks rather than merely plausible in text. Furthermore, we introduce Online Skill Internalization, which continuously improves the policy model during test‑time interaction by transforming skill‑conditioned trajectories into reward signals. This enables the agent to directly internalize reasoning capabilities into its parameters, avoiding the context bloat of experience retrieval. Experiments on LifelongAgentBench show that LifeSkill improves average performance by 7 absolute points by comparing with existing lifelong agent baselines.

PaperID: 2242, https://arxiv.org/pdf/2606.04780.pdf

Authors: Yubo Hou, Jingwei Song, Hongbo Zhang, Zhisheng Chen, Bang Xiao, Tao Wan, Zengchang Qin

Title: PersonaTree: Structured Lifecycle Memory for Person Understanding in LLM Agents

Abstract:
Persistent LLM agents require memory representations that make the formation of person understanding explicit across long term interaction. Existing agent memory methods emphasize information retention and retrieval, yet give limited account of how accumulated interaction evidence is abstracted into person understanding. We view this process as schema formation, where situated evidence is abstracted into reusable patterns and stable person level claims. We introduce PersonaTree, a structured lifecycle memory framework that realizes this view as a three level persona tree with explicit support paths from evidence to claims. PersonaTree maintains the tree through conservative writing, confidence guided consolidation, and query conditioned path retrieval, returning only the evidence depth required by each query. Across six person understanding and persistent memory benchmarks with three answer backbones, PersonaTree ranks first in 12 of 18 compact scores and reaches the top two in 16 settings. Ablations show that hierarchy improves abstract person understanding on KnowMe, while support path retrieval improves RealPref alignment under a comparable context budget.

PaperID: 2243, https://arxiv.org/pdf/2606.04779.pdf

Authors: Andrea Ferrario

Title: Tree-Based Formalization of Multi-Agent Complementarity in Human-AI Interactions

Abstract:
Complementarity is the case in which a human‑‑AI interaction (HAI) outperforms the best prediction benchmark available among its members. Although this idea is central in HAI research, formal work on complementarity remains limited. Existing frameworks do not model how agents' predictions compose into workflow‑sensitive multi‑agent protocols. We close this gap by introducing a tree‑based formalization of complementarity in multi‑agent HAI. An HAI protocol is represented by an ordered agent‑role configuration together with a rooted planar binary tree whose leaves are decorated by prediction vectors. A local binary composition rule is evaluated recursively along the tree, yielding a tree‑relative complementarity functional relative to a pointwise‑min oracle benchmark. We prove four results. First, selector‑based HAIs, including self‑ or AI‑reliance, cannot achieve complementarity regardless of task, loss, or prediction quality. Second, in regression under squared loss, complementarity is equivalent to Euclidean distance minimization from the ground‑truth vector; for N=2, the optimal linear‑pooling weight has a closed form and a residual‑correction interpretation. Third, under linear local composition, every protocol tree defines a barycentric coordinate chart on the simplex of leaf weights; Tamari‑cover reparameterizations of protocol trees preserve complementarity, and for N=4, they satisfy the pentagon identity. Fourth, in binary classification, no internal local composition can achieve complementarity under endpoint‑monotone losses, including standard Bregman and many finite Bernoulli f‑divergence losses; an analogous obstruction holds for multiclass aggregation under cross‑entropy. In summary, our framework shows that complementarity is attainable in multi‑agent regression, but obstructed in classification under natural conditions on local aggregation and loss functions.

PaperID: 2244, https://arxiv.org/pdf/2606.04750.pdf

Authors: Ajay Vishwanath, Christian Omlin

Title: Fog of Love: Engineering Virtuous Agent Behavior with Affinity-based Reinforcement Learning in a Game Environment

Abstract:
Instilling virtuous behavior in artificial intelligence has seen increasing interest. One of the techniques proposed is known as affinity‑based reinforcement learning, which uses policy regularization on the objective function to incentivize virtuous actions without being fully dependent on the reward function design. Thus far, this technique has been demonstrated to be effective in grid worlds and toy‑problem environments with minimal state and action spaces. To expand this research to more sophisticated environments, we introduce a two‑player multi‑agent environment based on the role‑playing board game known as Fog of Love. In this environment, two agents compete to fulfill their individual virtues, while also cooperating to satisfy their relationship. Given the multi‑agent nature, this is a complex problem where multi‑agent deep deterministic policy gradient agents neither compete nor cooperate successfully. We present evidence that localized affinities enhance agent performance in achieving both competitive and cooperative objectives, resulting from superior overall scores in both domains. This not only results in virtuous choices but also clarifies an agent's teleology and makes its behavior human‑level interpretable.

PaperID: 2245, https://arxiv.org/pdf/2606.04746.pdf

Authors: Abhishek S, Badrikanath Praharaj, Sreeram MV

Title: CADENCE: Predicting Realized MAPF Execution Time Beyond Sum of Costs

Abstract:
Multi‑Agent Path Finding (MAPF) algorithms are increasingly used to plan motion for robot teams in industrial warehouses and robotic shared workspaces, but standard MAPF algorithm evaluation metrics, such as Sum of Costs (SoC), makespan, and planner runtime, can obscure how planner choices translate into realistic execution performance. We present CADENCE (Coordination and Action‑Driven Estimation for Networked Continuous Execution), a hardware study of this evaluation gap on a fixed 7 by 7 workcell with seven differential drive robots, asking which features available before execution can best predict final wall‑clock completion time. We compare SoC, total planned travel cost, primitive motion burden (how much basic motion the plan requires, such as makespan, turns, consecutive moves, and start‑stop transitions), and interaction aware coordination structure (how much inter‑robot coordination the plan induces, such as dependency links, interacting robot pairs, dependency depth, and crowding exposure). To test this, we generate 120 plans across 15 scenarios ‑‑ 5 Empty, 5 Medium Random, and 5 Bottleneck and execute each plan four times, yielding a 480 trial hardware corpus. Using both a scenario‑held ‑‑ out ridge model and a trial‑level mixed‑effects model, we find that SoC alone is informative but incomplete, while primitive motion burden gives the strongest improvement, reducing held out error by about 48.6%‑59.8% in MAE and 44.2%‑61.4% in RMSE relative to SoC‑only models. Interaction‑aware coordination features add smaller, less uniform gains, most clearly in the mixed‑effects analysis. Across both models and uncertainty checks, primitive motion burden is the most reliable additional signal beyond SoC, suggesting that much of the execution time gap is already visible in the offline plan before any robot starts moving.

PaperID: 2246, https://arxiv.org/pdf/2606.04743.pdf

Authors: Soyeong Jeong, Jinheon Baek, Minki Kang, Sung Ju Hwang

Title: TIDE: Proactive Multi-Problem Discovery via Template-Guided Iteration

Abstract:
Agents are widely deployed as assistants over documents, tools, and code. However, they typically act only on explicit user requests, which surface only the problems the user has noticed, while many other important problems coexist, hidden in plain sight, within the broader user context, with their total number unknown in advance. We frame this as the task of discovering multiple hidden problems from context, in which coexisting problems should be uncovered, grounded in supporting evidence, and paired with concrete actions. To this end, we introduce TIDE, a template‑guided iterative framework with two complementary mechanisms. Specifically, motivated by the observation that single‑pass prediction anchors on the most salient cases and yields generic claims, we propose iterative discovery, which surfaces a small batch of candidates per round while conditioning on what has already been found, so subsequent rounds extend coverage; and thought templates, reusable schemas distilled from previously solved cases that specify what contextual signals to attend to and how to connect them, anchoring each prediction in a recognizable problem class. We validate TIDE on two realistic settings, personal workspaces and software repositories, across four model backbones, showing substantial gains over single‑shot and parallel multi‑agent baselines on task coverage, identification, and resolution.

PaperID: 2247, https://arxiv.org/pdf/2606.04703.pdf

Authors: Jingwen Chen, Wenkai Yang, Shengda Fan, Wenbo Nie, Chenxing Sun, Shaodong Zheng, Yangen Hu, Lu Pan, Ke Zeng, Yankai Lin

Title: Rethinking Continual Experience Internalization for Self-Evolving LLM Agents

Abstract:
Experience internalization converts contextual experience from past interactions into reusable parametric capability, offering a promising path toward continual learning in large language models (LLMs). While prior work has predominantly focused on single‑iteration transfer, we discover that under multi‑iteration experience learning, existing methods suffer from a progressive capability collapse rather than compounding improvement. We systematically examine this failure through three vital dimensions of experience internalization: (1) Experience Granularity: We find that principle‑level experience is more durable than instance‑level experience, as it effectively abstracts transferable strategies away from trajectory‑specific details. (2) Experience Injection Pattern: Our analysis reveals that step‑wise injection significantly outperforms global injection by aligning experience with intermediate decision states, a property that is critical for long‑horizon tool use. (3) Internalization Regime: We demonstrate that off‑policy context‑distillation on high‑quality teacher trajectories provides a substantially more stable training signal than on‑policy context‑distillation, which is inherently limited by local corrections on student‑induced flawed states. Together, these insights yield a simple yet robust recipe for stable and sustainable experience internalization, providing concrete guidance for engineering self‑evolving and continually learning LLMs.

PaperID: 2248, https://arxiv.org/pdf/2606.04691.pdf

Authors: Kenfeng Huang, Yi Cai, Xin Wu, Zikun Deng, Li Yuan

Title: SMADE-IE: Sparse Multi-Agent Framework with Evidence-Driven Debate for Zero-Shot Information Extraction

Abstract:
Zero‑shot information extraction (IE) with large language models (LLMs) has attracted increasing attention due to its flexibility in adapting to new schemas and domains without task‑specific training. Existing approaches mainly rely on monolithic prompting, each‑type prompting, or multi‑agent debate. However, monolithic prompting often suffers from boundary and type errors, while each‑type prompting and multi‑agent debate introduce cross‑type conflicts, redundant agent interactions, and substantial token overhead. To address these challenges, we propose SMADE‑IE, a sparse and evidence‑driven multi‑agent framework for zero‑shot IE. SMADE‑IE first employs an Adaptive Mode Selector to dynamically route inputs into either a lightweight Global Extraction Mode or a Type‑Centric Extraction Mode, reducing unnecessary type selection and reasoning noise. For conflicting predictions, we further introduce an Evidence‑Driven Debate mechanism that structures arguments into Toulmin‑style components and performs confidence aggregation through external evidence scoring and Bayesian updates. Experimental results on 9 benchmark datasets across NER, RE, and JERE tasks show that SMADE‑IE consistently outperforms existing zero‑shot IE baselines while also improving token efficiency through sparse agent selection and early‑stopping debate.

PaperID: 2249, https://arxiv.org/pdf/2606.04660.pdf

Authors: Yuqian Wu, Zhijie Deng, Wei Chen, Junwei Li, Yutian Jiang, Junle Chen, Zhengjun Huang, Qingxiang Liu, Jing Tang, Jiaheng Wei, Yuxuan Liang

Title: LifeSide: Benchmarking Agents as Lifelong Digital Companions

Abstract:
Lifelong digital companions must integrate cross‑session cues, continually update their understanding of users, and adapt to shifting privacy boundaries. Existing evaluations fail to capture this, testing memory recall and short‑term empathy in isolation. To bridge this gap, we introduce \benchmark, a benchmark centered on multi‑session Memory‑Emotion‑Environment loops. By modeling users as persistent worlds with layered profiles and event trajectories, \benchmark uses multi‑agent simulation to project environmental dynamics into dialogue, preserving the critical gap between latent thoughts and observable expressions. Evaluating 2,000 personas and 111K tasks across memory tracking, user understanding, privacy control, and emotional companionship, our experiment results reveal a stark reality: even models that saturate current memory benchmarks fail to sustain accurate user understanding and true companionship over long horizons.

PaperID: 2250, https://arxiv.org/pdf/2606.04602.pdf

Authors: Hejia Geng, Leo Liu

Title: Parthenon Law: A Self-Evolving Legal-Agent Framework

Abstract:
As agents grow more capable, legal‑domain LLM agents promise to turn document‑heavy matters into reviewable work products ‑‑ yet reliable deployment faces three obstacles: no large‑scale evidence on how today's strongest model‑and‑harness combinations behave on end‑to‑end legal matters; no agent architecture adapted to the legal vertical, only general‑purpose harnesses; and, in a setting that keeps shifting with new facts, authorities, and deadlines, no mechanism for systems to learn from their own outcomes. We address each. A large‑scale empirical study on Harvey LAB ‑‑ 12,510 agent trajectories ‑‑ shows that even frontier agents remain far from completing matters in a single pass: per‑criterion accuracy climbs with stronger models while strict matter completion stalls. We then introduce \textscParthenon, a self‑evolving legal‑agent framework that factors Model, Harness, Agent roles, legal Knowledge, deterministic Tools, and procedural Skills into auditable surfaces for source traceability, date and number grounding, deliverable compliance, and issue closure. Finally, an anti‑leakage learning loop converts scored failures into task‑agnostic edits to skills, tools, and knowledge, letting the system improve with experience ‑‑ as a firm refines its checklists and playbooks after each matter ‑‑ without touching model weights. Across our large‑scale empirical analysis, \textscParthenon substantially improves the performance of state‑of‑the‑art models and harnesses on legal‑matter tasks.

PaperID: 2251, https://arxiv.org/pdf/2606.04599.pdf

Authors: Yongzi Yu, Ao Li, Le Wang, Ziyue Li, Fugee Tsung, Yuxuan Liang, Man Li

Title: Plan First, Judge Later, Run Better: A DMAIC-Inspired Agentic System for Industrial Anomaly Detection

Abstract:
Large language model (LLM) agents have shown promise in automating complex data‑analysis workflows, but their reliable deployment remains challenging in high‑stakes industrial scenarios. Industrial anomaly detection (IAD) is essential for manufacturing quality, safety, and efficiency, yet existing LLM‑based IAD agents mainly focus on execution while under‑exploiting strategy formulation. Consequently, they struggle to handle heterogeneous modalities in a unified and cost‑effective manner. Inspired by the DMAIC quality‑management framework, we propose DMAIC‑IAD (DMAIC‑inspired Agentic Industrial Anomaly Detection), a "Plan First, Judge Later" multi‑agent system that aligns LLM agents with structured industrial problem‑solving. DMAIC‑IAD distills heterogeneous references into standardized operating procedures (SOPs) before strategy generation, and introduces a pre‑trained execution‑free judge model to rank candidate strategies without costly runtime trials. Extensive experiments across four modalities show that DMAIC‑IAD improves average detection performance over applicable agentic baselines by 37.76%.

PaperID: 2252, https://arxiv.org/pdf/2606.04580.pdf

Authors: Hiroki Nakano, Takashi Koide, Daiki Chiba

Title: TIBlender: Early-Warning Threat Intelligence from Cross-Platform Social Media Evidence

Abstract:
Cyber threat signals are fragmented across multiple social media platforms, yet no existing approach has fully automated their integration into actionable threat intelligence (TI) reports. We present TIBlender, a multi‑agent system that monitors four platforms (X, Reddit, Telegram, and Discord) and produces structured TI reports via role‑specialized LLM agents. These agents conduct multi‑perspective investigations, tracing chains of evidence to uncover related Indicators of Compromise (IoCs) via collaborative, evidence‑backed analysis. In a real‑world deployment, TIBlender detected emerging threats across all four threat categories ahead of public feeds, including in‑the‑wild exploitation ahead of public vulnerability registries; the majority of its IoCs were absent from each evaluated feed. Quantitative evaluation confirms that each platform contributes unique threat information unavailable from the others, and that excluding any single platform results in substantial loss of reports in specific threat categories. Under identical single‑platform input conditions, TIBlender's IoC extraction meets or exceeds each baseline; the full pipeline surfaces substantially more IoCs, most of which are absent from any single‑platform baseline. These results establish cross‑platform social media monitoring as an effective and scalable early‑warning layer for operational TI pipelines.

PaperID: 2253, https://arxiv.org/pdf/2606.04536.pdf

Authors: Tao Ren, Weiyao Luo, Hui Yang, Rongzhi Zhu, Xiang Huang, Yuchuan Wu, Bingxue Chou, Jieping Ye, Jiafeng Liang, Yongbin Li, Yijie Peng

Title: Scaling Self-Evolving Agents via Parametric Memory

Abstract:
Existing memory‑augmented LLM agents store past experience exclusively in prompt space, as textual summaries or retrieved passages, while keeping model parameters frozen throughout a rollout. Such agents can \emphlook up what they have seen but cannot \emphlearn from it: their policy is unchanged by experience, and any information dropped from the context is permanently lost. We introduce \textttTMEM, a self‑evolving parametric memory framework in which the agent not only compresses history into explicit memory but also absorbs distilled supervision into fast LoRA weights Δ_t via lightweight online updates, genuinely altering its future behavior within a single episode. We formalize this as an agentic decision process with fast‑weight rollout dynamics: actions are sampled from π_θ_0+Δ_t, while extraction actions produce supervision that updates Δ_t for subsequent decisions. This view makes the extraction policy directly optimizable by RL: training θ_0 improves not only task actions but also the quality of the data used for online LoRA adaptation. We further propose SVD‑based initialization of the LoRA subspace to accelerate online convergence. Experiments on LoCoMo, LongMemEval‑S, multi‑objective search, and CL‑Bench show that \textttTMEM consistently outperforms summary‑based and retrieval‑based baselines across different model scales.

PaperID: 2254, https://arxiv.org/pdf/2606.04505.pdf

Authors: Yuhan Yang, Ruipu Li, Alexander Rodríguez

Title: Simulate, Reason, Decide: Scientific Reasoning with LLMs for Simulation-Driven Decision Making

Abstract:
Scientific simulators are increasingly being integrated into LLM‑driven systems for high‑stakes simulation‑driven decision‑making. However, existing frameworks primarily use LLMs to generate, calibrate, or execute simulators, treating them as black‑box interfaces rather than as structured mechanistic systems that can be reasoned about. As a result, current approaches lack the ability to identify, represent, and reason about the assumptions and mechanisms underlying simulator behavior, limiting transparency, auditability, and decision justification. We introduce MechSim, a mechanism‑grounded neuro‑symbolic reasoning framework for executable scientific simulators. Unlike prior neuro‑symbolic approaches that primarily reason over static symbolic structures, MechSim enables LLM agents to reason about the mechanisms, assumptions, and execution behavior of scientific simulators. Our framework represents simulators through a shared structured schema capturing assumptions, variables, mechanism dependencies, and execution traces. On top of this representation, LLM agents operate as constrained reasoning engines that generate structured, evidence‑grounded explanations linking simulator outcomes to their underlying mechanisms. We evaluate our approach across multiple high‑stakes domains and show that it improves mechanism‑level explanation quality, simulator analysis, and downstream decision‑making reliability.

PaperID: 2255, https://arxiv.org/pdf/2606.04492.pdf

Authors: Zicheng Zhao, Yu Lan, Chengzhengxu Li, Zhaohan Zhang, Xiaoming Liu

Title: Episodic Memory Temporal Consistency for Cooperative Multi-Agent Reinforcement Learning

Abstract:
Cooperative Multi‑Agent Reinforcement Learning (MARL) frequently suffers from severe reward sparsity and exploration bottlenecks. While episodic memory mechanisms mitigate these issues by reusing high‑return trajectories, they often trap agents in local optima due to unconstrained incentive distribution and semantic representation collapse. To address this, we propose Episodic Memory Temporal Consistency (EMTC), a framework that robustly constructs and selectively leverages historical experiences. EMTC introduces two synergistic components: (1) a Temporally Consistent Semantic Embedder that integrates contrastive learning with time‑conditioned state reconstruction, preventing representation collapse and enabling precise memory retrieval; and (2) a Temporal Consistency Gating Mechanism that dynamically modulates episodic incentives based on temporal consistency error. This adaptive gate filters misleading signals from pseudo‑successful trajectories, effectively mitigating Q‑value overestimation. We provide theoretical guarantees, establishing a strict error bound that directly links the observable temporal consistency error to the underlying trajectory optimality and representation quality. Extensive evaluations on the SMAC and GRF benchmarks demonstrate that EMTC consistently outperforms state‑of‑the‑art baselines. Notably, compared to the strongest episodic baseline, EMTC achieves absolute win‑rate improvements of up to 24% in super‑hard SMAC scenarios and an average improvement of 28% across GRF tasks.

PaperID: 2256, https://arxiv.org/pdf/2606.04484.pdf

Authors: Qingxu Fu, Boyin Liu, Shuchang Tao, Zhaoyang Liu, Bolin Ding

Title: AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning

Abstract:
We present AgentJet, a distributed swarm training framework for large language model (LLM) agent reinforcement learning. Unlike centralized frameworks that tightly couple agent rollouts with model optimization, AgentJet adopts a decoupled multi‑node architecture in which swarm server nodes host trainable models and run optimization on GPU clusters, whereas swarm client nodes execute arbitrary agents on arbitrary devices. This design provides capabilities that are difficult to support in centralized frameworks: (1) heterogeneous multi‑model reinforcement learning, enabling the training of heterogeneous multi‑agent teams with multiple LLM as brains; (2) multi‑task cocktail training with isolated agent runtimes; (3) fault‑tolerant execution that prevents external environment failures from interrupting the training process; and (4) live code iteration, which allows agents to be edited during training by replacing swarm client nodes. To support efficient RL in multi‑model, multi‑turn, and multi‑agent settings, AgentJet introduces a context tracking module with timeline merging, which consolidates redundant context and achieves a 1.5‑10x training speedup. Finally, AgentJet introduces an automated research system that takes a research topic as input and autonomously conducts long‑horizon, multi‑day RL studies on large‑scale clusters. By leveraging the swarm architecture, this system reproduces key exploratory workflows of RL researchers without human intervention during execution.

PaperID: 2257, https://arxiv.org/pdf/2606.04479.pdf

Authors: Jiajun Hong, Jiawei Zhou

Title: Evaluating Reasoning Fidelity in Visual Text Generation

Abstract:
Recent text‑to‑image (T2I) models can render highly legible and well‑structured text within images, enabling applications including document generation and slide generation. However, it remains unclear whether such systems faithfully preserve reasoning ability when complex solutions must be expressed directly through rendered text, or whether they merely imitate surface‑level patterns. We investigate this question by evaluating reasoning fidelity in visual text generation, where models must express complete reasoning processes as images. Our evaluation includes long text rendering, factual knowledge probing, context understanding, and multi‑step reasoning. Across these settings, we find that current T2I models frequently produce semantic errors, logical inconsistencies, and incorrect intermediate steps, even when the rendered text appears visually clear. These failures contrast with the strong reasoning performance of text‑only models on the same tasks. Our findings reveal a substantial gap between visual text generation and procedural reasoning, motivating more reliable visual text reasoning.

PaperID: 2258, https://arxiv.org/pdf/2606.04455.pdf

Authors: Xinyu Lu, Tianshu Wang, Pengbo Wang, zujie wen, Zhiqiang Zhang, Jun Zhou, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

Title: The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

Abstract:
Current AI benchmarks evaluate agents on task execution within human‑designed workflows. These evaluations fundamentally fail to measure a critical next‑level capability: whether models can autonomously develop agent systems. We introduce the Meta‑Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development. Specifically, a code agent (the meta‑agent) is given a sandboxed environment, an evaluation API, and a time limitation to iteratively program an agent artifact that maximizes performance on a held‑out test set across five domains. To ensure evaluation integrity, this framework is secured by multi‑layer defenses against reward hacking. Leveraging this framework, we demonstrate that meta‑agents rarely match human‑engineered baseline policies, and the few that do are dominated by proprietary frontier models. Moreover, the design process exhibits high variance, and high optimization pressure surfaces emergent adversarial behaviors like ground‑truth exfiltration‑highlighting critical deficits in both robustness and model alignment. Ultimately, MAC provides a rigorous, open‑source benchmark for autonomous AI research and development, offering an empirical proxy for evaluating recursive self‑improvement. Benchmark is publicly available at: https://github.com/ant‑research/meta‑agent‑challenge.

PaperID: 2259, https://arxiv.org/pdf/2606.04454.pdf

Authors: Xin Zhang, Yang Cao, Baoxing Wu, Kai Song, Siying Li

Title: Stepwise Reasoning Enhancement for LLMs via External Subgraph Generation

Abstract:
Large language models have shown strong performance in natural language generation and downstream reasoning tasks, but they still struggle with logical consistency, factual grounding, and interpretability in complex multi‑step reasoning. To address these limitations, this paper proposes SGR, a stepwise reasoning enhancement framework that integrates large language models with external knowledge graphs through query‑relevant subgraph generation. Given an input question, SGR first extracts key entities, relations, and constraints to construct a structured schema, then retrieves compact subgraphs from a knowledge graph using schema‑guided querying. The generated subgraphs provide explicit relational evidence that guides the language model through step‑by‑step reasoning. In addition, SGR combines direct Cypher‑based reasoning with collaborative reasoning integration, allowing candidate answers from multiple reasoning paths to be validated and aggregated according to both model confidence and graph consistency. Experiments on benchmark datasets including CWQ, WebQSP, GrailQA, and KQA Pro demonstrate that SGR improves reasoning accuracy and Hits@1 performance over standard prompting and several knowledge‑enhanced baselines. Ablation studies further show that schema guidance and Neo4j‑based retrieval are both crucial to the effectiveness of the framework. These results indicate that dynamically generated external subgraphs can improve the accuracy, robustness, and interpretability of LLM‑based reasoning.

PaperID: 2260, https://arxiv.org/pdf/2606.04444.pdf

Authors: R. Spencer Hallyburton, David Hunt, Miroslav Pajic

Title: Scaling Datasets for Multi-Sensor, Multi-Agent, and Multi-Domain Learning in Autonomous Systems

Abstract:
Existing datasets cannot support large‑scale learning in multi‑agent, multi‑sensor, or multi‑domain autonomy, where diversity and coordination are essential. We present a modular dataset generation pipeline that creates terabyte‑scale, ground‑truth‑labeled data for ground, aerial, and infrastructure‑based systems using the AVstack framework and CARLA simulator. Supporting single‑ and multi‑agent configurations with flexible sensor suites, the pipeline enables controllable experimentation across challenging conditions. Representative perception and fusion studies show how generated data can support application‑specific training and collaborative autonomy.

PaperID: 2261, https://arxiv.org/pdf/2606.04435.pdf

Authors: Saroj Mishra

Title: Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

Abstract:
Multi‑step agentic retrieval‑augmented generation (RAG) pipelines have demonstrated significant capability for complex reasoning tasks, yet remain vulnerable to a class of failure that existing hallucination detection mechanisms systematically miss: cascading hallucination, where errors introduced at early pipeline stages propagate and amplify across successive reasoning steps, producing confident but factually incorrect final outputs. To address this vulnerability, we formalize cascading hallucination as a distinct failure mode in agentic RAG systems, present a four‑type taxonomy of cascade patterns, and introduce CHARM (Cascading Hallucination Aware Resolution and Mitigation), an architectural framework for detecting and interrupting error propagation in multi‑step reasoning pipelines. CHARM comprises four components ‑ stage‑level fact verification, cross‑stage consistency tracking, confidence propagation monitoring, and cascade resolution triggering ‑ that operate alongside standard agentic RAG pipelines without requiring architectural replacement. We evaluate CHARM on HotpotQA, MuSiQue, 2WikiMultiHopQA, and a custom adversarial dataset across LangChain agentic pipeline configurations, achieving an 89.4% cascade detection rate with a 5.3% false positive rate and 215 ms +/‑ 18 ms average latency overhead per stage, achieving an error propagation reduction of 82.1%, compared to 18.5% for output‑level detectors. Component ablations confirm that each detection module contributes meaningfully to overall cascade coverage. CHARM integrates with human‑in‑the‑loop oversight frameworks to provide a complete reliability and governance stack for production agentic AI deployment.

PaperID: 2262, https://arxiv.org/pdf/2606.04397.pdf

Authors: Ameya Gawde, Vyzantinos Repantis, Harshvardhan Singh, Lucy Moys

Title: Context-as-AI-Service: Surfacing Cross-File Dependency Chains for LLM-Generated Developer Documentation

Abstract:
LLM agents increasingly write and maintain developer documentation, but usefulness and accuracy often rely on dependency chains that are not obvious to follow. Even with more files in context, the agent must still decide which cross‑file dependencies to trace. We present Context‑as‑AI‑Service (CAIS), a retrieval layer that LLM agents query to find evidence across the codebase as they review or generate documentation. CAIS indexes source code, API references, and upstream documentation, then enables agents to query the index through tool calls that combine keyword and semantic search. We evaluate CAIS in two case studies using Claude Sonnet 4.6 on a production SDK: improving API reference comments in a core source file and validating an LLM‑generated tutorial. In both studies, the baseline already had ordinary repository tools such as file reads, keyword search, and symbol navigation. CAIS adds a retrieval layer on top, so the comparison isolates added retrieval rather than basic repository access. In the API‑reference review, the CAIS‑augmented agent produced the same 5 missing‑documentation fixes as the baseline and surfaced 4 findings the baseline missed: 2 cross‑file factual errors and 2 underspecified API comments. In the tutorial validation, it surfaced 1 executable bug, 1 API‑usage improvement, and 2 missing prerequisites that the baseline pipeline did not catch. These findings required tracing non‑obvious dependency chains across utility files, framework internals, usage examples, tests, and component‑creation logic. Over five runs per condition, adding CAIS reduced wall‑clock time by 22% to 34% across the two tasks and lowered input‑token usage.

PaperID: 2263, https://arxiv.org/pdf/2606.04359.pdf

Authors: Xinwei Song, Yizhe Huang, Dengji Zhao, Xue Feng

Title: Learning to cooperate with emergent reputation via multi-agent reinforcement learning

Abstract:
Reputation, the aggregation of peer assessments diffused through social networks, is a pivotal mechanism for promoting cooperation in social dilemmas ubiquitous to distributed multi‑agent systems comprising agents with limited perception and cognitive capabilities. Exploring efficient reputation systems, comprising reputation assessment rules and reputation‑based policies, is a long‑standing challenge. Previous work assumes predefined reputation assessment rules or models reputation as an intrinsic reward to learn policies, compromising the methods' ability for generalization and adaptation. To address this, we propose a distributed multi‑agent reinforcement learning method COOPER (COOPeration with Emergent Reputation), which jointly learns reputation assessment rules and reputation‑based policies entirely from environment rewards. Notably, leveraging the underlying mechanisms of reputation, we deliberately design the constituent modules of COOPER and the data flows among them, overcoming the latency and noise in the feedback signal, caused by the deep entanglement between reputation and policy. Experiments on the donation game and the coin game in grid world environments demonstrate that COOPER effectively adapts to various existing reputation systems and co‑players. Furthermore, we observe the co‑emergence of reputation norms and cooperation in self‑play settings. These results hold robustly across diverse social network topologies, underscoring the generalizability and efficacy of our approach.

PaperID: 2264, https://arxiv.org/pdf/2606.04329.pdf

Authors: Pritam Dash, Tongyu Ge, Aditi Jain, Tanmay Shah, Zhiwei Shang

Title: From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents

Abstract:
Memory is a core component of AI agents, enabling them to accumulate knowledge across interactions and improve performance. However, persistent memory introduces the risk of memory poisoning, where a single adversarial memory write can exert long‑term influence over agent behavior. We present a systematic study of memory poisoning in LLM‑based agents. We identify four memory write channels and nine structural vulnerabilities in model capabilities, system prompt design, and agent system architecture that make these channels exploitable. Based on these vulnerabilities, we develop a taxonomy of six classes of memory poisoning attacks. Furthermore, we design MPBench ‑‑ a benchmark for evaluating memory poisoning attacks, and show that agents designed to write and retrieve memory more aggressively are more exploitable. We also show that existing prompt injection defenses fail to cover memory poisoning attacks. Our findings provide a foundation for understanding and mitigating memory poisoning attacks against AI agents.

PaperID: 2265, https://arxiv.org/pdf/2606.04315.pdf

Authors: Zhikai Chen, Jialiang Gu, Junyu Yin, Xianxuan Long, Shenglai Zeng, Xiaoze Liu, Kai Guo, Keren Zhou, Jiliang Tang

Title: Exploring Cross-Scenario Generality of Agentic Memory Systems: Diagnostics and a Strong Baseline

Abstract:
LLM agents accumulate histories that outgrow their context windows, motivating a growing literature on memory systems. Yet most existing designs are tuned to a single scenario (multi‑session chat or a single trajectory format), and there is little evidence that they generalize across the heterogeneous trajectories agents encounter in deployment. We revisit eight memory systems plus an agentic harness for search problems, on five scenarios: single‑turn QA, multi‑session chat, agentic‑trajectory QA, memory stress tests, and long‑horizon agentic tasks. The harness, which self‑manages flat text‑file storage via tool calls, achieves the best cross‑task ranking, suggesting that memory performance hinges on giving the agent active control over storage and retrieval rather than on a passive store behind a fixed pipeline. We instantiate this insight in AutoMEM, an agentic memory harness with a self‑managed tool interface that achieves the best cross‑scenario generality among the systems we evaluate.

PaperID: 2266, https://arxiv.org/pdf/2606.04306.pdf

Authors: Tianyu Shi, Yang Mo, Yiou Liu, Zhuonan Hao, Yin Wang, Wenzhuo Hu, Nan Yu, Meng Zhou, Jiangbo Yu

Title: Organizational Control Layer: Governance Infrastructure at the Execution Boundary of LLM Agent Systems

Abstract:
LLM‑based agents are increasingly deployed in workflows where generated outputs may directly trigger state‑changing actions. This creates an execution‑boundary problem: proposed actions must be governed before they are executed. We study this problem through economically consequential multi‑agent interactions and argue that deployment‑grade agent systems should separate proposal generation from environment‑facing execution. To operationalize this principle, we introduce the Organizational Control Layer (OCL), a model‑agnostic governance infrastructure that intercepts generated actions before execution through policy enforcement and escalation, without modifying the underlying LLM generator. We evaluate OCL on adversarial buyer‑‑seller negotiation environments adapted from AgenticPay. Across multiple frontier LLM backends, OCL reduces unsafe executions from 88% to near‑zero while increasing valid success from 12% to 96%. Results further reveal a safety‑‑utility tradeoff: strict governance improves compliance and reliability against policy and constraint violations, but can reduce flexibility in tightly constrained markets. These findings suggest that deployment‑grade LLM agent systems require explicit governance at the boundary between language generation and executable actions. The source code is available at: https://github.com/SHITIANYU‑hue/amai_ocl

PaperID: 2267, https://arxiv.org/pdf/2606.04300.pdf

Authors: Abdelrahman Abdallah, Mahmoud Abdalla, Mohammed Ali, Adam Jatowt

Title: Argus-Retriever: Vision-LLM Late-Interaction Retrieval with Region-Aware Query-Conditioned MoE for Visual Document Retrieval

Abstract:
Late‑interaction vision‑language retrievers represent each document page as many visual token embeddings and score queries with MaxSim. In systems such as ColPali, ColQwen, ColNomic, and Nemotron ColEmbed, the document embeddings are produced without seeing the query, so the same page is represented identically for a table lookup, a chart question, and a layout‑sensitive evidence request. We introduce Argus, a family of query‑conditioned late‑interaction retrievers built on Qwen3.5‑VL. Argus adds a region‑aware Mixture‑of‑Experts module: the query encoder produces both retrieval embeddings and a compact context vector, the document page is pooled into spatial regions, and a query‑aware router selects latent experts per region before MaxSim. The output remains a multi‑vector index compatible with ColPali‑style retrieval, but the document representation is now dependent on the query (i.e., \mathbfD(q)). All Argus models use a 1024‑dimensional retrieval head, compared with the 2560‑dimensional and 4096‑dimensional heads of recent state‑of‑the‑art systems, and are trained on roughly 9% of the available public supervision rather than the full pool. The 9B model reaches 92.67 NDCG@5 on ViDoRe V1 and 86.0 NDCG@5 on the combined V1+V2 leaderboard, the highest reported value for an open late‑interaction model on the combined leaderboard. Wrapped in a Qwen3.6‑27B agentic retrieval pipeline on ViDoRe V3, Argus‑9B further improves its NDCG@10 from 60.28 to 64.80 over public tasks, showing that the same retriever serves both as a strong standalone system and as a search primitive for iterative LLM agents.

PaperID: 2268, https://arxiv.org/pdf/2606.04248.pdf

Authors: Ganyu Zou, Linhan Wang, Chen Dai, Siji Chen, Chang-Tien Lu

Title: RSC: Decentralized Rigid Formation Flocking for Large-Scale Swarms via Hybrid Predictive Control and Online Reconfiguration

Abstract:
Decentralized rigid formation flocking requires a swarm of autonomous agents to maintain a predetermined geometric configuration while moving, relying solely on local sensing and communication. However, existing decentralized control methods struggle to maintain strict inter‑agent distance constraints in cluttered environments, often suffering from local minima deadlocks, high frequency control oscillations, or limited flexibility during obstacle navigation, resulting in low success rate. To address these limitations, we propose Rigid Swarm Control (RSC), a decentralized control framework for large‑scale rigid formation flocking. To escape local minima via robust long‑term planning while ensuring short‑term safety, RSC integrates finite‑horizon trajectory predictions with a reactive artificial potential field (APF) safety controller within a hybrid architecture. Furthermore, to accelerate formation reassembly after obstacle traversal without interrupting task execution, RSC introduces an online leader‑follower reconfiguration mechanism based on stable role exchange. Extensive evaluations in challenging cluttered environments with 25 UAVs demonstrate that RSC reliably unifies rigid formation maintenance, obstacle avoidance, and target tracking. Under strict success criteria ‑ collision‑free operation with a maximum relative edge‑length error below 10%, RSC achieves an 83% success rate, significantly outperforming existing heuristic and learning‑based baselines that fall below 5%.

PaperID: 2269, https://arxiv.org/pdf/2606.04223.pdf

Authors: Michał Wawer, Jarosław A. Chudziak

Title: Consensus is Strategically Insufficient: Reasoning-Trace Disagreement as a Knowledge-Representation Signal

Abstract:
Multi‑agent systems are commonly designed to reduce disagreement through voting, consensus protocols, debate, or fault‑tolerant aggregation. We argue that this objective is insufficient for value‑laden tasks, where disagreement may reflect genuine normative uncertainty rather than agent error. Building on prior work on reasoning‑trace disagreement in human‑AI collaborative moderation, we propose a knowledge‑representation layer in which reasoning traces and agent decisions are abstracted into symbolic disagreement states. Given agents producing explicit reasoning traces and binary decisions, we distinguish four states according to reasoning similarity and conclusion agreement: convergent agreement, divergent agreement, convergent disagreement and divergent disagreement. These states support defeasible strategic routing rules. We instantiate the framework in content moderation and argue that disagreement‑aware routing provides a bridge between sub‑symbolic LLM deliberation and symbolic knowledge representation for multi‑agent strategic reasoning.

PaperID: 2270, https://arxiv.org/pdf/2606.04202.pdf

Authors: Joel Sol, Homayoun Najjaran

Title: SMAC-Talk: A Natural Language Extension of the StarCraft Multi-Agent Challenge for Large Language Models

Abstract:
As LLMs become more widely deployed, they are increasingly expected to work alongside other AI agents rather than operating in isolation. Effective coordination in these settings requires agents to communicate, share information and make decisions under uncertainty. We introduce SMAC‑Talk, a natural language extension of the StarCraft Multi‑Agent Challenge for evaluating LLM‑based agents in cooperative multi‑agent environments. The environment has several key features such as decentralized control, partial observability and long‑horizon decision making. SMAC‑Talk includes a natural language communication channel which is used to probe agent coordination and trust. We use this communication channel to construct different evaluation scenarios, including settings with an embedded deceptive communicator that tries to disrupt and deceive allies through communication alone. We provide three agents for benchmarking using 4 models from the Qwen3.5 family and study how reasoning structure, memory and model scale affect coordination between agents. We release SMAC‑Talk as an open benchmark to support the research community in developing and evaluating LLM agents in cooperative multi‑agent settings.

PaperID: 2271, https://arxiv.org/pdf/2606.04197.pdf

Authors: Aliakbar Mehdizadeh, Martin Hilbert

Title: Exploring the Topology and Memory of Consensus: How LLM Agents Agree, Fragment, or Settle When Forming Conventions

Abstract:
How much should an LLM agent remember, and how should multi‑agent systems be connected when trying to reach consensus? We show these two design choices interact in a way that flips the sign of memory's effect on coordination. Across 432 simulation runs of a networked Naming Game on eight fixed 16‑agent topologies, we vary memory depth and network structure. Longer memory slows the time to reach steady state in decentralized networks but accelerates it in centralized ones; the same parameter pushes the system in opposite directions depending on topology. Critically, "faster settling" in centralized networks means locking in to a fragmented plateau more quickly, not reaching system‑wide consensus, which can be used to generate diverging opinions. We further document a memory‑mediated speed‑unity trade‑off: centralized networks consistently preserve more competing conventions than decentralized networks, but their settling speed depends sharply on memory. At the agent level, within‑network analyses show that high‑betweenness bridges suffer a brokerage penalty while agents in locally clustered neighborhoods achieve higher coordination success. Finally, in search of analytically tractable generative mechanisms, we find that agents' choices are well captured by Fictitious Play, indicating belief‑based rather than reward‑based adaptation. The practical implication: memory depth and communication topology should be co‑designed, not optimized in isolation.

PaperID: 2272, https://arxiv.org/pdf/2606.04193.pdf

Authors: Juan Figuera

Title: Notarized Agents: Receiver-Attested Confidential Receipts for AI Agent Actions

Abstract:
Current AI agent observability is structurally compromised: the entity producing the activity log is the same entity whose activity is being logged. A compromised or buggy agent can omit, alter, or fabricate its own traces, and the operator running the agent has no independent way to detect tampering. We propose a class of protocols that resolves this by inverting the trust boundary: the service that receives an agent's call signs a receipt of what it observed using its own key, encrypts the receipt to the agent's owner, and publishes it to a public transparency log. The owner reconstructs a tamper‑evident trail without trusting the agent or its operator. We instantiate the class as Sello, a protocol combining four properties absent in any current system: (P1) receiver‑side signing, (P2) HPKE encryption to an owner public key bound to the authorization token via JWS, (P3) publication to a witness‑cosigned Merkle log, and (P4) owner‑side discovery by token reference. We describe the protocol, analyze its security under an adversary that controls the agent and its operator, present microbenchmarks of the cryptographic operations, and situate Sello among adjacent receipt‑protocol work (Signet, AgentROA, Agent Passport System, draft‑farley‑acta, SCITT). We discuss known limitations including the suppression attack, service collusion, and the adoption‑incentive problem.

PaperID: 2273, https://arxiv.org/pdf/2606.04158.pdf

Authors: Amirhossein Mollaei Khass, Vivek Pandey, Guangyi Liu, Athanasios Cosse, Emrah Bayrak, Nader Motee

Title: Multi-Agent Next-Best-View Optimization for Risk-Averse Planning

Abstract:
Multi‑agent Next‑Best‑View (NBV) selection for safe path planning in uncertain and unknown environments requires informative, safety‑aware, and efficient coordination. Centralized approaches rely on sharing raw sensor data or significant communication overhead, resulting in limited scalability. We propose a distributed, risk‑aware multi‑agent NBV framework in which each robot maintains a private local 3D Gaussian Splatting map and the team jointly maximizes expected information gain (EIG) restricted to masked zones along planned trajectories. The resulting distributed objective is solved by Consensus ADMM (C‑ADMM) over a communication graph, with each robot exchanging only candidate viewpoints, planned trajectory descriptors, and scalar EIG contributions. Collision risk along each trajectory is modeled via Average Value‑at‑Risk (AV@R) over the local 3DGS map and used both to shape the masking radius and to score planned paths. Experiments in Gibson environments at multiple team sizes show that the distributed formulation approaches the centralized baseline in mapping quality and trajectory safety while reducing communication by orders of magnitude.

PaperID: 2274, https://arxiv.org/pdf/2606.04155.pdf

Authors: Tianfu Wang, Max Xiong, Jianxun Lian, Hongyuan Zhu, Zhengyu Hu, Yuxuan Lei, Linxiao Gong, Xiaofang Li, Peiting Tsai, Nicholas Jing Yuan, Qi Zhang

Title: SocialCoach: Personalized Social Skill Learning with RL-based Agentic Tutoring and Practice

Abstract:
Social skills such as negotiation and leadership are crucial for personal and professional success in today's interconnected world. However, scalable and effective training remains a significant challenge due to the scarcity of expert coaching. In this paper, we introduce SocialCoach, a holistic LLM‑powered agentic tutoring system for personalized social skill development at scale. First, SocialCoach automatically constructs a pedagogically‑grounded, theory‑to‑practice knowledge corpus from diverse expert sources, leveraging a multi‑agent pipeline. Second, to personalize the learning journey, it employs an adaptive practice scheduling module that follows a prescription‑retrieval‑adaptation process. To maximize the long‑term learning experience while overcoming the cold‑start problem, this policy is optimized within a learner simulation environment through reinforcement learning. Finally, SocialCoach integrates immersive, goal‑driven practice, causality‑driven proficiency assessment and knowledge‑grounded, reflective tutoring to help address the knowing‑doing gap. We deploy it in our product, EQoach, and conduct extensive experiments. The results show that SocialCoach improves simulated pathway quality and judge‑rated tutoring quality over baseline approaches, while early user feedback indicates strong perceived engagement and usefulness. These findings suggest a practical architecture for personalized and gamified pedagogical platforms on soft skill learning.

PaperID: 2275, https://arxiv.org/pdf/2606.04149.pdf

Authors: Jiarong Wei, Yanxing Chen, Sinuo Song, Yin Wu, Anna Rehr, Abhinav Valada

Title: CoPark: Learning Reactive Parking via Self-Play

Abstract:
Learning a single policy that reaches a goal with high geometric precision while interacting safely with nearby agents poses conflicting objectives. Precision favors commitment to a fixed geometric plan, whereas interaction requires immediate deviation when another agent intrudes, causing policies optimized for one objective to often fail at the other. We study this problem in the context of reactive autonomous parking, where multiple vehicles must reach assigned slots with sub‑meter terminal accuracy while remaining responsive to neighboring vehicles throughout the maneuver. We propose CoPark, a multi‑agent self‑play RL approach built on a residual‑policy architecture. A precomputed offline plan provides a fixed action prior, while a residual head learns the reactive corrections. The residual policy learns behaviors under self‑play, where data and scripting fall short, while the fixed prior holds the slot‑frame geometry that pure policies struggle to reach reliably. The key design is a partner‑threat‑modulated, channel‑asymmetric release of the prior. A continuous threat signal shifts authority of the longitudinal channel to the residual head to enable yielding, while the lateral channel remains anchored to the precomputed reference to preserve sub‑meter slot alignment. A closed‑loop refinement layer corrects residual terminal error from action‑grid discretization. We train our policy on six parking lots and evaluate zero‑shot on our new reactive‑parking benchmark spanning Dragon Lake Parking (DLP) and DeepScenario Open 3D (DSC3D). CoPark achieves ~70‑85% success with only 3‑6% collision rate, substantially outperforming classical, imitation‑learning, and large‑scale RL baselines. Importantly, the results demonstrate emergent interaction behaviors such as reverse‑yielding, mid‑maneuver yielding, tight‑corridor passing, and queuing.

PaperID: 2276, https://arxiv.org/pdf/2606.04141.pdf

Authors: Kargi Chauhan, Pratibha Revankar

Title: Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents

Abstract:
LLM agents often place sensitive credentials in the same context window as untrusted retrieved content, creating a direct path for indirect prompt injection to induce credential exfiltration. We study this failure mode through three complementary defenses. First, we ask whether activation probes can detect credential access before output tokens are emitted. Second, we construct honeytokens from format‑specific character models and calibrate detection with split conformal prediction. Third, we treat multi‑turn exfiltration as a cumulative information‑flow problem and track an estimated leakage budget across conversation turns. In controlled experiments on open‑weight models, activation features separate benign and credential‑seeking prompts with high accuracy, including under held‑out encoding transformations. In a small synthetic multi‑turn suite, cumulative accounting detects attacks that per‑turn detectors miss. These results are preliminary: the multi‑turn benchmark is in‑house and small, the activation method requires white‑box access, and the information estimator provides a practical signal rather than a formal upper bound. Still, the results suggest that credential‑exfiltration defenses should combine pre‑output monitoring, calibrated canary detection, and temporal leakage accounting rather than relying only on text‑level output filters.

PaperID: 2277, https://arxiv.org/pdf/2606.04056.pdf

Authors: Sajjad Khan

Title: Token Budgets: An Empirical Catalog of 63 LLM-Agent Budget-Overrun Incidents, with an Affine-Typed Rust Mitigation as a Case Study

Abstract:
LLM‑agent budget overruns are a documented production failure class: a single retry loop can spend thousands of dollars before an operator notices, and the in‑process integrity properties that would prevent it (no aliasing, no double‑spend, no use‑after‑delegation of a cost‑bearing value) are enforced, if at all, by ad‑hoc wrappers rather than by the type system. Our central contribution is empirical: a catalog of 63 confirmed production incidents from 21 orchestration frameworks (2023‑2026), each backed by a quoted GitHub issue and, where reported, a dollar loss, organized into an eight‑cluster failure taxonomy (inter‑rater Cohen's kappa = 0.837, N = 113), plus 47 supplementary structural entries. As one mitigation evaluated against this taxonomy, we build token‑budgets, an 1,180‑line Rust crate (no unsafe) that operationalizes affine ownership so that cloning, double‑spending, or using a budget after delegating it are compile errors rather than runtime hazards an operator must remember to avoid. The dollar cap is runtime arithmetic under an estimator assumption; the affine layer makes that arithmetic non‑bypassable. On single‑agent workloads a 4‑line Python counter matches the crate at 0/30 overshoot, so the distinguishing value is non‑bypassability under operator error in multi‑agent delegation: the delegation‑fanout race documented in 11 incidents is rejected by the borrow checker at compile time, while the same pattern under asyncio overshoots 30/30 and three disciplined alternatives overshoot 0/30. Across five runtimes, three providers, and a temperature‑stratified live‑API test (N = 160), the approach reports zero cap violations and zero false refusals, at operational parity with concurrent work. Static over‑reservation is 4‑6x (2.11x adaptive). Binary‑level cap‑soundness on the running binary is left open.

PaperID: 2278, https://arxiv.org/pdf/2606.04051.pdf

Authors: Xian Qi Loye, Qinglin Su, Zhexin Zhang, Shiyao Cui, Qi Zhu, Fei Mi, Hongning Wang, Minlie Huang

Title: RUBAS: Rubric-Based Reinforcement Learning for Agent Safety

Abstract:
The evolution of LLMs into tool‑enabled agents creates a new class of safety challenges associated with real‑world execution rather than simple text generation. Existing alignment methods often rely on coarse refusal signals or static supervision, making it difficult to balance safety with useful tool execution across diverse agentic risks. We introduce RUBAS, a rubric‑based reinforcement learning framework for agent safety. RUBAS decomposes agent behavior into four dimensions: tool‑use safety, argument safety, response safety, and helpfulness. These structured rubrics provide fine‑grained and interpretable rewards over complete agent trajectories, enabling reinforcement learning to optimize safe tool use while preserving task completion. Extensive experiments across multiple agent safety benchmarks and models show that RUBAS improves safety over standard alignment baselines, reduces tool‑grounded hallucinations, and maintains competitive utility. Our results suggest that multi‑dimensional rubric rewards provide an effective training signal for aligning LLM agents in safety‑critical tool‑use settings.

PaperID: 2279, https://arxiv.org/pdf/2606.03949.pdf

Authors: Zeyi Liu, Guangyao Liu, Yinuo Qu, Yuquan Xue, Bofang Jia, Chunhua Yang, Weihua Gui, Keke Huang, Ziwei Wang

Title: Preference-Calibrated Human-in-the-Loop Reinforcement Learning for Robotic Manipulation

Abstract:
Human‑in‑the‑loop reinforcement learning (HIL‑RL) improves sample efficiency in real‑robot manipulation through online human intervention. However, successful trajectories may include suboptimal actions that deviate from the desired task‑execution path and force human intervention. Existing HIL‑RL methods typically apply the consistent credit assignment principle to all transitions, uniformly propagating discounted terminal rewards through suboptimal segments, ignoring the actual contribution of each transition to task success. This overestimates Q‑values for critic learning and indirectly misguides actor updates toward suboptimal behavior patterns. To this end, we propose PACT, a Preference‑calibrated Actor‑Critic Training framework that leverages the implicit preference signals induced by intervention to perform credit reassignment on identified suboptimal segments while directly guiding policy training for unbiased critic‑actor learning. Specifically, we first design a progress model that learns from human demonstration and identifies suboptimal segments for credit correction. Then, from the human action and resampled policy action at the intervention state, we build preference pairs to define a counterfactual advantage that penalizes Bellman targets of the identified suboptimal segment, enabling directional credit calibration. Moreover, we directly align the policy with human corrective actions in the bounded mean space, providing an additional signal beyond critic‑guided updates. Across five real‑robot manipulation tasks, PACT improves the average success rate by 24.5% and achieves 1.3 times faster convergence, thereby improving both RL sample efficiency and performance. Code is available at https://anonymous.4open.science/r/HILRL‑A1X‑BC05.

PaperID: 2280, https://arxiv.org/pdf/2606.03895.pdf

Authors: Yingqi Zhang

Title: Agent libOS: A Library-OS-Inspired Runtime for Long-Running, Capability-Controlled LLM Agents

Abstract:
Large language model (LLM) agents are evolving from request‑response assistants into long‑running software actors: they maintain state across model calls, fork subtasks, wait for external events, request human authority, generate tools, and perform side effects that must be resumed and audited. This paper presents Agent libOS, a library‑OS‑inspired runtime substrate for LLM agents. Agent libOS runs above a conventional host operating system; it does not implement hardware drivers, kernel‑mode isolation, or a POSIX‑compatible operating system. Instead, it treats an agent as an AgentProcess: a schedulable execution subject with process identity, parent‑child lineage, lifecycle state, a tool table derived from an AgentImage, typed Object Memory, explicit capabilities, human queues, checkpoints, events, and audit records. Its central design rule is tools are libc‑like wrappers; runtime primitives are the authority boundary. Filesystem access, object access, sleeps, human approval, JIT tool registration, and external side effects are checked at primitive boundaries under explicit capabilities and policy. We describe the design, threat model, Python prototype, and safety‑oriented evaluation. The current prototype implements async scheduling, namespace‑local Object Memory, runtime‑integrated human approval, one‑shot permission grants, per‑process working directories, shell and image‑registration primitives, Deno/TypeScript JIT tools over a libOS syscall broker, filesystem/object bridge tools, an injectable Resource Provider Substrate, deterministic demos, real‑model smoke scripts, and 123 regression tests at the time of writing. Rather than improving planner accuracy, Agent libOS demonstrates a runtime substrate in which long‑running LLM agents can be scheduled, authorized, resumed, and audited without treating tool dispatch as the trust boundary.

PaperID: 2281, https://arxiv.org/pdf/2606.03876.pdf

Authors: Jiachen Li, Reina Szeyi Chan, Akshat Choube, Xiang Zhi Tan, Elizabeth Mynatt, Varun Mishra

Title: From 'What' to 'How' and 'Why': Sharing LLM-Generated Retrospective Summaries of Older Adults' Passive Tracking Data with Remote Family Members

Abstract:
With the growing prevalence of modern ubiquitous computing technologies, multi‑modal tracking systems hold promise for providing timely awareness and reassurance to stakeholders such as remote family members (RFMs) of older adults, who play a central role in care coordination. However, combining heterogeneous data streams into high‑level, meaningful content ‑ such as retrospective summaries ‑ remains challenging. While recent work has demonstrated the promise of large language models (LLMs) for interpreting multi‑modal tracking data, less attention has been given to generating narrative accounts for stakeholders like RFMs, who possess rich personal knowledge of older adults and strong emotional responsibility, yet have limited visibility into their daily lives and limited capacity for caregiving. In this work, we explore how LLMs can be used to generate retrospective summaries from multi‑modal tracking data for RFMs of older adults. We leveraged and customized an existing system, Vital Insight, to generate initial summaries on different dates and data availability scenarios as technology probes, and conducted interviews with 11 RFMs to gather feedback. Based on these insights, we redesigned the system into a multi‑layer, multi‑agent, insight‑driven summary approach that builds from objective statistics and descriptions to enriched, context‑aware narratives. We then compared the redesigned summaries with the initial versions through a survey with the same 11 RFMs and found significant improvements in satisfaction, perceived helpfulness, trust, and willingness to receive the summaries. We conclude by presenting design implications for AI‑generated summaries for RFMs and broader contexts, emphasizing the need to support RFMs' sensemaking shift from simply presenting ''What'' data were collected, to explaining ''How'' is my loved one doing and ''Why''.

PaperID: 2282, https://arxiv.org/pdf/2606.03841.pdf

Authors: Zherui Yang, Fan Liu, Yansong Ning, Hao Liu

Title: EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management

Abstract:
Recent progress in Large Language Model (LLM) agents has enabled promising advances in automated data science. However, existing approaches remain fundamentally limited by their static action sets and lack of principled long‑horizon context management, hindering their ability to accumulate reusable experience across tasks and operate reliably in multi‑stage, iterative data science pipelines. To address these challenges, we introduce EvoDS, a self‑evolving autonomous data science agent that learns to expand its skills and adaptively managing long‑term context through agentic reinforcement learning. Specifically, EvoDS introduces two key strategies: (1) Autonomous Skill Acquisition (ASA) mechanism, which enables agents to synthesize, validate, and reuse executable skills; and (2) Adaptive Context Compression (ACC) strategy, which treats context management as a learned control problem rather than passive truncation. These strategies are orchestrated within a two‑stage multi‑agent training scheme, enabling EvoDS to autonomously improve over time. Theoretically, we prove that EvoDS's hierarchical design reduces tool‑selection error, and its optimization objective aligns with an information bottleneck principle, ensuring efficient context use. Empirically, EvoDS outperforms state‑of‑the‑art open‑source data science agents by an average of 28.9% across four diverse benchmarks while eliminating out‑of‑token failures. Our code and data are available at https://github.com/usail‑hkust/EvoDS.

PaperID: 2283, https://arxiv.org/pdf/2606.03812.pdf

Authors: Sanjay Das, Ran Elgedawy, Ethan Seefried, Ryan Burchfield, Tirthankar Ghosal

Title: Enhancing Operational Safety via Agentic Dialogue Hazard Identification Analysis

Abstract:
Operational safety in high‑stakes domains such as industrial process control, autonomous, and safety‑critical systems, demand reliable hazard identification. While large language models (LLMs) have shown promise in automating safety analysis tasks, single‑turn, monolithic inference is brittle: it lacks the self‑correction, deliberation, and contextual refinement that safety engineers apply iteratively. In this paper, we introduce HAZDIAL, a framework that investigates whether structured agentic dialogue‑multi‑agent, multi‑turn interactions improves the quality of NLP‑ based hazard identification over single‑pass baselines. We systematically compare two dialogue modalities: adversarial debate and constructive discussion, and propose an algorithm‑based agentic interaction optimization. We evaluate all configurations against a curated golden dataset using standard classification metrics (accuracy, precision, recall, F1) and novel dialogue metrics. This work advances the intersection of dialogue systems, multi‑agent reasoning, and AI safety, providing an empirical evidence for dialogue‑driven hazard analysis.

PaperID: 2284, https://arxiv.org/pdf/2606.03735.pdf

Authors: Gabriel Fejziaj, Salama Hassona, Wieslaw Marszalek

Title: On dynamic multi-agent pathfinding methods: review, simulations and modifications

Abstract:
This paper presents a systematic study of pathfinding algorithms in the context of Dynamic Multi‑Agent Pathfinding (D‑MAPF), a setting that combines dynamic obstacles, partial observability, and inter‑agent conflicts. We evaluate six representative algorithms: Dijkstra, D Lite, Space‑Time A, WHCA, M, and a novel method denoted as A within a unified simulation framework. The proposed A algorithm introduces a template‑based approach that decouples offline geometric path generation from online temporal adaptation. By precomputing multiple diverse candidate paths and dynamically reconnecting to them using space‑time planning, A improves solution quality in environments with frequent changes and limited sensing

PaperID: 2285, https://arxiv.org/pdf/2606.03698.pdf

Authors: Sangeun Park, Minhae Kwon

Title: Multi$^2$: Hierarchical Multi-Agent Decision-Making with LLM-Based Agents in Interactive Environments

Abstract:
A central goal of large language model (LLM) research is to build agentic systems that can plan, act, and adapt through sustained interaction with dynamic environments. While recent LLM‑based agents exhibit impressive contextual reasoning, their long‑horizon decision‑making remains fragile, often suffering from objective drift, where goals and plans drift over extended interactions. We introduce Multi^2, a hierarchical multi‑agent decision‑making framework that explicitly decomposes agent behavior into complementary roles. A high‑level agent (System 1) focuses on context‑aware sub‑goal generation using supervised fine‑tuning (SFT), while a low‑level agent (System 2) executes atomic actions through offline‑to‑online reinforcement learning (RL) in interactive environments. This separation enables stable long‑horizon control, mitigates objective drift, and allows efficient adaptation. Across diverse interactive environments, Multi^2 consistently outperforms strong agentic baselines, demonstrating improved robustness and coordination in multi‑turn interaction. Beyond performance, we introduce and release three hierarchical benchmark datasets, filling a long‑standing gap in training and evaluating hierarchical decision‑making for LLM‑based agents.

PaperID: 2286, https://arxiv.org/pdf/2606.03692.pdf

Authors: Yuan Xiong, Ziqi Miao, Qian Chen, Lijun Li, Yequan Wang, Shizhu He, Jun Zhao, Kang Liu

Title: SkillPyramid: A Hierarchical Skill Consolidation Framework for Self-Evolving Agents

Abstract:
Recent AI agents can flexibly invoke skills to solve complex tasks, but their long‑term improvement is fundamentally constrained by a lack of systematic skill construction, accumulation, and transfer. In particular, without a unified framework for skill consolidation, agents tend to redundantly construct similar capabilities across different tasks, are unable to effectively transform experience into reusable assets, and struggle to generalize task‑specific skills to novel scenarios. To address this limitation, we propose SkillPyramid, a skill consolidation framework that reuses existing skill experience for broader task generalization. Operating on a hierarchical skill topology, SkillPyramid further introduces a self‑evolution mechanism that enables agents to compose, validate, and incorporate new skills during task execution. Experiments on ALFWorld, WebShop, and ScienceWorld across four backbone models show that SkillPyramid substantially increases the average reward by 38.0% and reduces execution steps by 27.7%. Overall, our method transforms a skill collection from a static resource pool into a dynamic evolution system.

PaperID: 2287, https://arxiv.org/pdf/2606.03686.pdf

Authors: Sarah Barrington, Maty Bohacek, Hany Farid

Title: The DeepSpeak-Agentic Dataset

Abstract:
We present DeepSpeak‑Agentic, a dataset of videos comprising over 37 hours of semi‑structured conversations between a human and an embodied AI agent. We use this dataset to evaluate the automatic forensic identification (audio, video, or text) of AI agents, study the nature of human‑agent interactions, and provide a benchmark for future advances in the large‑language models and AI‑generated voices and faces that power embodied AI agents. We also contribute a scalable data‑capture system that creates agents, automatically pairs them with human crowd workers, records audiovisual conversations across specified scenarios, and identifies and separates the human and agent in the combined stream.

PaperID: 2288, https://arxiv.org/pdf/2606.03678.pdf

Authors: Tong Nie, Yuewen Mei, Yihong Tang, Junlin He, Jie Deng, Jian Sun, Wei Ma

Title: EvoDrive: Pareto Evolution for Safety-Critical Autonomous Driving via Self-Improving LLM Agents

Abstract:
Generating safety‑critical scenarios is essential for validating and improving autonomous driving systems, yet it inherently requires maximizing adversariality to expose failures while preserving realism. Existing methods usually manage this trade‑off with handcrafted heuristics, confining generation to known priors and overlooking underexplored patterns. While recent open‑ended agentic evolution can push this limit, unconstrained general agents lack strict simulator grounding and tend to collapse the multi‑objective tension into single‑scalar maximization. Here we present EvoDrive, the first automated, LLM‑based agentic evolution framework for multi‑objective scenario generation. EvoDrive employs a simulator‑grounded actor‑critic architecture where a memory‑driven actor iteratively proposes improvements to the generators and critics filter out implausible candidates, and a self‑evolving world evaluator routes promising proposals to optimize simulation budgets. EvoDrive further maintains a Pareto archive of evaluated candidates to preserve diverse attack‑realism trade‑offs and guide future evolution via simulation feedback. Benchmark results on MetaDrive and CARLA show that EvoDrive not only significantly expands the Pareto frontier across various generators, but also produces valuable scenarios for policy training.

PaperID: 2289, https://arxiv.org/pdf/2606.03565.pdf

Authors: Zifei Wang, Wei Wen, Qiang Ji, Ruizhi Qiao, Xing Sun

Title: Skill Is Not Document: A Query-Conditional Benchmark and Two-Stage Retriever for LLM Agent Skill Routing

Abstract:
LLM agents complete complex tasks by composing multiple skills, and skill retrieval is a front‑end stage for agents. Skill retrieval differs fundamentally from traditional document retrieval at the supervision level: top‑K joint correctness depends not only on the semantic relevance of each individual query‑skill pair, but also on whether the skills retrieved together can collaborate to fulfill the task under the given query. Such "skill compatibility" cannot be derived from independent relevance alone. Yet existing LLM‑based data synthesis pipelines can produce a direct supervision signal for "which skills should not be jointly retrieved under this query" ‑‑ namely the LLM's own rejection decisions ‑‑ and this signal is routinely discarded as low‑quality data. To address this gap, we propose Reject‑as‑Resource Retriever (R3) and construct R3‑Skill, a bilingual (Chinese‑English) skill retrieval benchmark targeting realistic agent skill routing. R3‑Skill spans four language directions, features query phrasings close to real user requests, and is verified through multi‑expert cross‑checking. On R3‑Skill, we build a two‑stage retrieval system (R3‑Embedding + R3‑Reranker) with skill compatibility as an explicit training signal. Gradient analysis shows that the "push‑away" signal is diluted by bilateral balancing in the bi‑encoder but acts as lossless graded ranking supervision in the cross‑encoder ‑‑ motivating its placement at the cross‑encoder stage, as confirmed by ablations on two datasets. The R3‑Embedding + R3‑Reranker pipeline attains Hit@1 = 0.7714, NDCG@10 = 0.8327 and Set‑Compat = 0.3525 on R3‑Skill. The dataset, training code and model weights are released as open source for agent skill routing.

PaperID: 2290, https://arxiv.org/pdf/2606.03543.pdf

Authors: Yongqi Liang, Qidong Liu, Chunze Yang, Lei Wu, Jiusong Ge, Ni Zhang, Chen Li

Title: D2MDT: Department-aware Multidisciplinary Team Consultation with Deliberation for Efficient Clinical Prediction

Abstract:
Electronic health records (EHRs) are central to clinical prediction, but existing methods either rely on correlation‑driven deep models or use single large language models (LLMs), making it difficult to support multidisciplinary clinical reasoning. Recent multi‑agent systems (MAS) provide a promising alternative, yet current EHR‑grounded MAS methods still suffer from weak evidence differentiation across agents and redundant multi‑round interaction. We propose D2MDT, a Department‑aware MultiDisciplinary Team Consultation with Deliberation for Efficient clinical prediction. D2MDT first constructs structured EHR evidence and consultation‑ready semantic evidence for multi‑agent consultation. It then assigns patient‑specific department perspectives to doctor agents and retrieves complementary evidence for collaborative consultation. To improve efficiency, D2MDT further introduces residual deliberation, which updates only unresolved consensus rather than replaying the full discussion history. Finally, D2MDT fuses the refined consensus report with structured EHR representations for prediction. Experiments on mortality prediction show that D2MDT improves both predictive performance and consultation efficiency. We release the code online to ease the reproducibility of this paper.

PaperID: 2291, https://arxiv.org/pdf/2606.03467.pdf

Authors: Taiyu Zhu, Yifan Wu, Weilin Jin, Ying Li, Gang Huang

Title: StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems

Abstract:
LLM‑based multi‑agent systems exhibit remarkable collaborative capabilities in complex multi‑step tasks. However, these systems are highly sensitive to single‑step execution errors that can propagate through agent interactions and lead to cascading failures. To understand the causes of failure and improve system reliability, failure attribution has been introduced as a task that aims to automatically identify the root cause step responsible for a failure. Existing failure attribution methods mainly rely on LLMs to reason over original execution trajectories, which not only incur high inference costs and latency, but also suffer from interference caused by redundant and noisy execution logs, causing LLMs to struggle in accurately identifying the true root cause step. To address this, we propose StepFinder, a lightweight failure attribution framework. We use LLMs solely during the feature construction phase to encode execution logs into temporal semantic sequences. Subsequently, a parameter‑efficient combination of temporal modeling and attention modules is applied to capture the sequential evolution and cross‑step dependencies of the trajectories. Finally, the step‑level error score is refined through multi‑scale differences and position bias, enabling precise root cause identification. Experimental results on the Who&When benchmark demonstrate that StepFinder outperforms LLM‑based methods in step‑level failure attribution while achieving substantially higher inference efficiency, reducing inference time by 79% compared with the fastest LLM‑based method, with no text generation overhead. Our code is available at https://github.com/taiyu‑zhu/StepFinder.

PaperID: 2292, https://arxiv.org/pdf/2606.03453.pdf

Authors: Farooq Shaikh

Title: FORGE: Multi-Agent Graduated Exploitation and Detection Engineering

Abstract:
Vulnerability disclosure volumes now far exceed organizational assessment capacity, yet three adjacent research communities (proof‑of‑concept generation, vulnerability prioritization, and detection rule engineering) operate largely in isolation. Existing automated exploit generation systems report binary pass/fail outcomes, discarding partial progress and producing no signal for the other two communities. This paper presents FORGE, a multi‑agent system that bridges these three silos through graduated exploitation depth. Five specialized agents (Intel, Generator, Planner, Exploit, and Detector) execute in a fixed pipeline that (1) generates targeted vulnerable applications from CVE metadata, (2) conducts coached, multi‑turn exploitation assessed by an LLM‑primary oracle on a four‑level taxonomy (L0: no evidence through L3: full compromise), and (3) produces Sigma and Snort detection rules grounded in OpenTelemetry exploitation traces. Graduated depth is the bridging mechanism: deeper exploitation yields richer behavioral traces for detection engineering, while depth data across scoring bands provides ground truth for prioritization validation. A tiered knowledge architecture accumulates intelligence across assessments, transferring build and exploitation experience to subsequent CVEs. Evaluation on 603 CVEs from the CVE‑GENIE dataset achieves 67.8% end‑to‑end L1+ exploitation at USD 1.50 per CVE across eight languages and 187 CWE types. Exploitation rates remain near 68% regardless of EPSS or CVSS band, indicating that pattern‑level reachability is orthogonal to metadata‑based prioritization. Detection rules from L2+ exploitation achieve significantly higher span‑normalized grounding than L1‑derived rules (p=0.035), and 93.4% of generated Snort rules produce zero false positives against a synthetic benign corpus.

PaperID: 2293, https://arxiv.org/pdf/2606.03416.pdf

Authors: Akshat Sanghvi, Naren Akash, Raza Imam, Amit Sharma, Mohit Jain

Title: MeDxAgent: Multi-Agent Consultation for Interactive Medical Diagnosis

Abstract:
Large language models (LLMs) are increasingly used for health‑related decision support. Yet most evaluations treat diagnosis as a single‑shot task with complete information provided upfront, often as a multiple‑choice selection. This diverges from clinical practice, where diagnosis is interactive and open‑ended, involving sequential hypothesis refinement through targeted questioning. We address this gap. We build MeDxBench, a large‑scale benchmark of 4,421 clinical cases across 20 specialties. We further propose MeDxAgent, a multi‑agent consultation system for interactive diagnosis, and systematically study its prompt‑, flow‑ and agent‑level design choices. MeDxAgent achieves a 10.3% accuracy gain over the baseline on MeDxBench, closing 52.3% of the gap to a full‑information oracle. We find that specific design choices: collecting demographics first, passing summarized dialogue for diagnosis, and feeding candidate diagnoses for targeted questioning, improve accuracy, mirroring how physicians reason, though their effect emerges fully only in combination. Code and dataset will be released upon publication.

PaperID: 2294, https://arxiv.org/pdf/2606.03321.pdf

Authors: Doyeong Lim, Seungyoon Lee, In Cheol Bang

Title: Validation-Gated Multi-Agent Governance for Online Adaptation of Thermal-Hydraulic Surrogate Models under Operating-Regime Shift

Abstract:
Artificial‑intelligence surrogates can support second‑by‑second thermal‑hydraulic forecasting, but models selected and frozen offline may become condition‑locked once deployed outside their pretraining envelope. This study develops a guarded continual‑adaptation framework for experimental thermal‑hydraulic loop data in which role‑separated agents ‑ Monitor, Diagnosis, Adaptation, Safety‑Auditor, and Orchestrator ‑ diagnose error signatures, prioritize candidate model families, and review promotions, while deterministic champion‑challenger gates and background shadow learning retain final authority over model replacement. Seven surrogate families were screened by blocked three‑fold cross‑validation, and a temporal Fourier neural operator was selected as the initial champion for 60‑s‑history‑to‑10‑s‑trajectory forecasting on two held‑out transients, with three seeds per adaptive mode. Static deployment gave a channel‑averaged MAE of 7.06 and a 56.8% warning‑exceedance ratio; rule‑based adaptation reduced MAE to 6.54, whereas shadow refresh alone remained close to Static. The MA‑Full mode, in which the role‑separated multi‑agent council reviews every evaluated stream step, achieved the lowest mean error, 5.72, and 35.8% exceedance, corresponding to a 19.0% improvement over Static. Paired bootstrap intervals against Static excluded zero, although intervals among adaptive modes overlapped and the six paired units limit broad statistical claims. Validated promotions from the neural operator to Transformer and graph neural network indicate that logged, gate‑controlled adaptation can support auditable surrogate evolution while deterministic gates retain deployment authority.

PaperID: 2295, https://arxiv.org/pdf/2606.03255.pdf

Authors: Max Diekmann, Jonas Nitzler, Jan Fischer, Hans-Jürgen Pfisterer, Dirk Hartmann

Title: Multi-Agent Framework Leveraging Knowledge Graphs for Virtual Commissioning Models

Abstract:
Virtual commissioning models (VCMs) of discrete manufacturing systems are used to validate automation behavior before physical deployment, but creating and maintaining them remains labor‑intensive. Relevant engineering information is distributed across programmable logic controller (PLC) engineering projects, such as Siemens TIA Portal, and kinematic simulation models, such as Siemens NX Mechatronics Concept Designer (NX MCD), where it is stored in incompatible, tool‑specific data structures. In practice, IEC 61131‑3‑based PLC programs and variables are engineered separately from rigid‑body and kinematic simulation objects such as parts, joints, sensors, and actuators. As a result, understanding system behavior, generating simulation components, and mapping PLC variables to corresponding simulation objects require cross‑domain expertise and remain largely manual. This paper presents a knowledge‑graph‑grounded multi‑agent framework for semi‑automated VCM development. A deterministic setup process extracts structured data from Siemens TIA Portal and Siemens NX MCD and transforms both sources into graph‑based representations within a shared graph database. The framework uses a hierarchical multi‑agent architecture to support three task classes in early‑stage VCM development: system understanding, simulation component generation, and cross‑domain signal mapping. It provides grounded natural‑language access to engineering knowledge, template‑guided generation of executable NX Open journal scripts, and ranked mapping suggestions between PLC variables and NX MCD simulation objects. Evaluation on a laboratory‑scale discrete manufacturing system shows that the approach reduces manual cross‑domain interpretation effort and makes recurring VCM engineering tasks more actionable.

PaperID: 2296, https://arxiv.org/pdf/2606.03223.pdf

Authors: Zhe Sun, Meng Wang, Lei Wang, Yuxi Wang, Wanxin Li, Yujia Peng, Zhenliang Zhang

Title: BotDirector: Robot Storytelling Across the Symmetrical Reality with Multi-modal Interactions

Abstract:
Robot storytelling offers a unique blend of technological innovation and creative expression that engages children in unprecedented ways. However, the technical aspects are often too complicated for children. We propose an interactive system that facilitates robot storytelling with tangible and natural language interactions. Children arrange the playground with their own stuff and create narratives with an LLM agent. The created narratives are transformed into a motion sequence based on the map and characters, and the motions are executed by self‑navigating swarm robots. This system enhances robot storytelling with flexible scenarios, enabling young children to create robot dramas with everyday objects.

PaperID: 2297, https://arxiv.org/pdf/2606.03221.pdf

Authors: Shiteng Cao, Jingwen Liu, Junda She, Zhiheng Li

Title: VirtualMLE: A Virtual ML Engineer that Optimizes Sequential Recommenders

Abstract:
Recent advancements in Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning, reflection, and tool utilization, unlocking new paradigms for automating complex engineering workflows. However, in the domain of sequential recommendation (SR), tuning models on new datasets still relies heavily on the manual trial‑and‑error of experienced machine learning engineers. To bridge this gap, we propose VirtualMLE, an LLM‑agent framework that leverages the cognitive capabilities of LLMs to organize recommender optimizing into a closed loop of execution, reflection, and memory update. After each trial, the agent explicitly analyzes the observed outcomes and stores concise heuristic feedback in a hierarchical memory system. We evaluate VirtualMLE on three Amazon SR benchmarks with two representative backbones, SASRec and HSTU. VirtualMLE reaches competitive recommendation quality with substantially fewer trials. Furthermore, we observe that cognition summaries distilled from previous datasets can significantly accelerate the search process on unseen datasets, demonstrating the potential of transferring tuning heuristics. Overall, our results provide compelling evidence that LLM agents equipped with reflection and memory can serve as practical virtual engineers to automate and amortize heuristic learning in SR optimization. Our codes are available.

PaperID: 2298, https://arxiv.org/pdf/2606.03217.pdf

Authors: Kaito Takanami, Cengiz Pehlevan

Title: An Asymptotic Theory of Chain-of-Thought in In-Context Learning

Abstract:
Chain‑of‑thought (CoT) reasoning has become a widely used mechanism for eliciting multi‑step reasoning in large language models by generating intermediate reasoning steps at inference time. Yet the scaling behavior of generalization with CoT depth remains poorly understood. To address this question, we study a theoretically solvable model of CoT for in‑context weight prediction in linear regression, where test‑time reasoning is represented as an iterative refinement of the weight‑parameter estimate. Using tools from random matrix theory under high‑dimensional asymptotics, we derive an exact formula for the generalization error as a function of reasoning depth, pretraining data amount, and context length. Our analysis reveals a sharp phase transition separating exponential and polynomial improvement, saturation, and overthinking, and characterizes how the optimal reasoning depth scales. We further show that deeper reasoning is most effective with sufficiently rich pretraining and in‑context information, whereas limited pretraining or context makes longer reasoning prone to error amplification or saturation. We also validate these predictions through experiments on fully learned linear attention and softmax attention models. Our results provide a unified theoretical account of how test‑time CoT depth affects generalization.

PaperID: 2299, https://arxiv.org/pdf/2606.03197.pdf

Authors: Ziheng Li, Xingrun Xing, Haoqing Wang, Zhi-Hong Deng, Yehui Tang

Title: MemTrain: Self-Supervised Context Memory Training

Abstract:
Memory is an indispensable capability for long‑horizon LLM agents, enabling them to preserve and utilize information accumulated across extended interactions. Existing memory‑agent approaches are typically trained end‑to‑end with reinforcement learning on downstream tasks. However, collecting high‑quality annotated problems for memory‑intensive scenarios is costly, and the resulting training data often lack sufficient diversity to cover general memory behaviors. In this work, we propose MemTrain, a self‑supervised training framework for generally enhancing the context‑memory capability of LLM agents for more effective downstream post‑training. MemTrain introduces two coupled proxy tasks over unlabeled Wikipedia corpora: (1) an end‑to‑end masked reconstruction objective, which requires the model to recover masked entities after multiple rounds of memory updates, thereby encouraging memory maintenance from the final outcome perspective; and (2) an intermediate memory recall objective, which requires the model to reconstruct masked historical information using intermediate memory states, encouraging faithful compression and memory completeness throughout the interaction process. The two objectives are jointly optimized using GRPO. Extensive experiments on long‑text QA and search‑based QA benchmarks demonstrate that MemTrain consistently improves downstream memory‑intensive reasoning performance across different models, achieving gains of up to 17.67 points over direct task‑specific post‑training.

PaperID: 2300, https://arxiv.org/pdf/2606.03190.pdf

Authors: Shuning Zhang, Mingyao Xu, Zhixin Huang, Yutong Jiang, Rongjun Ma, Yuting Yang, Xin Yi, Kanye Ye Wang, Hewu Li

Title: Focused on the User, Overlooking the Risks: Security and Privacy Understandings, Practices and Challenges of Independent Chinese AI Agent Developers

Abstract:
The proliferation of AI agents empowers independent developers, defined as individual or small groups who self‑initiate projects rather than fulfill client‑based contracts, to create sophisticated autonomous systems, but also introduces novel security and privacy (S&P) challenges beyond traditional corporate structures. We conducted an interview study (N=28) with Chinese developers, whose extensive use of global LLM services offer valuable insights into this population. We investigate their understandings, practices and challenges of S&P challenges in their developed AI agent products. We revealed that independent developers frequently think and act from their users' perspective. They focused on user‑facing safety risks such as harmful content while exhibiting low awareness of security vulnerabilities. Consequently, developers rely almost exclusively on ad‑hoc, manually crafted safeguards and informal communication, with an absence of formal tools or processes for S&P practices. We found these actions are driven by various inhibitors, primarily a lack of formal training on S&P related skills, accessible security tools and actionable guidance from platforms. Our work contributed the first exploration of independent AI agent developers' S&P understanding, outlining opportunities for tailored security tooling.

PaperID: 2301, https://arxiv.org/pdf/2606.03175.pdf

Authors: Xunyi Zhao, Sihao Lin, Gengze Zhou, Zerui Li, Shijie Li, Wei Tao, Jiajun Liu, Qi Wu

Title: Ask When It Pays: Cost-Aware Open-Ended Interaction for Instance Goal Navigation

Abstract:
Instance Goal Navigation (IGN) requires an embodied agent to find a specific object instance among distractors from an under‑specified natural‑language description. Such ambiguity often cannot be resolved from perception and language alone, making interaction with an oracle a natural mechanism for disambiguation. Prior interactive methods allow oracle queries but treat lightweight clarification and route‑level guidance alike, letting agents boost success rate through repeated high‑information questions rather than by resolving the underlying ambiguity efficiently. We recast interactive IGN as a cost‑sensitive uncertainty‑reduction problem, where the agent should ask the question whose answer provides the largest reduction in navigation uncertainty relative to its penalty. To this end, we apply an information‑gain analysis on existing navigation corpora to identify which cues reduce navigation uncertainty, yielding a compact set of question types and data‑derived weights. However, existing interactive navigation benchmarks do not model the cost of different question types or evaluate how efficiently agents use interaction, making them unsuitable for studying cost‑sensitive interaction. Based on this taxonomy, we construct a benchmark for diagnosing interaction behavior and efficiency, together with a Weighted Success Rate metric that penalizes each query by its derived cost. We further propose a zero‑shot MLLM navigator that selectively queries at each decision step only when the expected uncertainty reduction justifies the interaction cost.

PaperID: 2302, https://arxiv.org/pdf/2606.03157.pdf

Authors: Ruihui Hou, Siyi Zhu, Ziyue Huai, Guangya Yu, Yongqi Fan, Chunming Wang, Tong Ruan

Title: ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models

Abstract:
Large language models (LLMs) have been widely adopted in healthcare, yet they still encounter significant challenges in complex clinical decision‑making scenarios. Existing benchmarks primarily assess LLM performance in single‑course settings and lack systematic evaluation in multi‑course scenarios, where a patient's condition evolves over time. To address this gap, we propose ClinicalMC, a benchmark for multi‑course clinical decision‑making. It includes 1,275 Chinese and 5,804 English samples across four stages from admission to discharge. These stages cover triage, first‑course examination/diagnosis/treatment, subsequent multi‑course examination/assessment/treatment, and final diagnosis. In ClinicalMC, patients in the English dataset undergo an average of 5.11 clinical courses, whereas those in the Chinese dataset undergo 3.42. To assess LLM performance, we construct a multi‑agent evaluation framework that includes patient, examiner, and doctor agents. Based on the benchmark and framework, we design two experimental settings ‑‑ a single‑turn static setting and a multi‑turn dynamic setting ‑‑ and assess three categories of LLMs: 1) closed‑source LLMs like GPT5‑mini; 2) open‑source LLMs like DeepSeek‑V3.2; and 3) medical LLMs like HuatuoGPT‑o1. Through extensive evaluation, we aim to better understand LLM performance in the medical domain and support its effective deployment in healthcare.

PaperID: 2303, https://arxiv.org/pdf/2606.03143.pdf

Authors: Jingbo Yang, Guanyu Yao, Yang Zhang, Ramana Rao Kompella, Gaowen Liu, Shiyu Chang

Title: FederatedSkill: Federated Learning for Agentic Skill Evolution

Abstract:
Modern LLM agents increasingly rely on skill libraries to handle complex tasks, making skill evolution a primary driver of self‑improvement. However, isolated single‑user task streams lack the diversity required to build comprehensive skills. While cross‑user collaboration can overcome this data bottleneck, current trajectory‑sharing approaches compromise user privacy and impose a uniform global library that fails to accommodate client heterogeneity. We introduce FederatedSkill, a privacy‑preserving framework for collaborative agent evolution. Moving beyond raw trajectory sharing, FederatedSkill utilizes semantic skill diffs, structured patches over local libraries, as the fundamental unit of communication. On the server side, an evolution agent aggregates these patches to dynamically model client‑specific capability boundaries, facilitating strictly personalized skill evolution rather than a suboptimal global average. Evaluated across 20 distinct agent task families, FederatedSkill demonstrates substantial gains over self‑evolving baselines, achieving up to a 44.4% increase in success rate and a 37.5% reduction in computational cost.

PaperID: 2304, https://arxiv.org/pdf/2606.03137.pdf

Authors: Kaiqi Yang, Tai-Quan Peng, Sanguk Lee, Hui Liu

Title: Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation

Abstract:
LLM‑based multi‑agent simulation offers a promising way to study social interaction, deliberation, and collective opinion dynamics. However, many existing dialogue simulation frameworks represent interaction mainly as observable turn exchange or aggregated outputs, leaving the internal evaluative processes behind silence, speaking intention, and public expression difficult to examine. We introduce TBS (Think‑Before‑Speak), an interval‑based multi‑agent simulation framework that separates agents' private reasoning from public utterance generation. At each interval, all agents update structured internal states based on the shared dialogue history and their own memory. These states include dissonance‑related appraisal, perceived opinion climate, perceived isolation risk, response strategy, and willingness to speak. The orchestrator then resolves competing speaking intentions and commits one utterance to the public dialogue, allowing internal evaluation and public interaction to co‑evolve over time. We evaluate TBS in simulated town hall discussions on a climate‑related policy issue. Results show that TBS produces coherent internal‑state traces and that these traces vary systematically across turn‑allocation, silence, and memory conditions. Dissonance‑related appraisal increases agents' willingness to speak, whereas silence‑pressure appraisal decreases it. Once speaking intention is formed, public expression is shaped mainly by turn‑allocation rules. These findings suggest that TBS supports mechanism‑sensitive social simulation by making the pathway from internal evaluation to public expression observable and analyzable.

PaperID: 2305, https://arxiv.org/pdf/2606.03135.pdf

Authors: Mengyi Deng, Zhiwei Li, Xin Li, Tingyu Zhu, Ying Zhao, Zhijiang Guo, Wei Wang

Title: Uncertainty-Aware Clarification in LLM Agents with Information Gain

Abstract:
Large Language Model (LLM) agents often operate under underspecified user instructions, where latent uncertainty over user intent leads to erroneous tool actions. To address this challenge, we propose a goal‑oriented clarification framework that aligns clarification behavior with ambiguity resolution. Central to our approach is the Information Gain Reward, a metric that quantifies the utility of clarification questions by measuring the Bayesian belief update towards the ground‑truth goal induced by the clarification exchange. We train the clarifier (LLM) using this reward to optimize for high information gain, ensuring that clarifications effectively reduce uncertainty and improve task completion within the agent‑tool‑user environment. We validate our framework within a clarification‑enhanced τ‑Bench environment, conducting cross‑agent evaluations across five heterogeneous backbones. Empirical results demonstrate that our method consistently improves the success rate by 3.7% over the no‑clarification baseline, while adding only 0.3 total interaction steps on average.

PaperID: 2306, https://arxiv.org/pdf/2606.03115.pdf

Authors: Royce Carbowitz, Dheeraj Kumar

Title: SPOQ: Specialist Orchestrated Queuing for Multi-Agent Software Engineering

Abstract:
Multi‑agent AI systems show promise for automating software engineering tasks, yet existing approaches suffer from coordination overhead, quality control gaps, and limited human oversight. We introduce SPOQ (Specialist Orchestrated Queuing), a methodology combining three innovations: (1) wave‑based topological dispatch that computes parallel execution waves from task dependency graphs; (2) dual validation gates applying quality metrics before execution (planning validation) and after (code validation) to reduce rework cycles; and (3) Human‑as‑an‑Agent (HaaA) integration, where a human specialist participates in decomposition and can be consulted during execution. SPOQ uses a three‑tier agent hierarchy (Opus workers, Sonnet reviewers, Haiku investigators) to optimize cost‑quality tradeoffs. We evaluate SPOQ through four experiments. Experiment 1: wave dispatch approaches the critical‑path lower bound (ratio 1.03‑‑1.11, speedup up to 14.3x); on a 2‑slot local backend it delivers a stable 1.4x speedup. Experiment 2: SPOQ improves planning coverage from 93.0 to 99.75, eliminates cyclic plans, and lifts parallelism from 31.0 to 75.25. Experiment 3: dual validation reduces defects from 0.34 to 0.20 per task and lifts test pass rate from 91.25% to 99.75%. Experiment 4: human review reduces residual defects from 0.47 to 0.03 per task. Results are replicated on a locally hosted open‑weights model (Qwen3.6‑35B‑A3B), verifying gains are attributable to orchestration rather than any specific model. A longitudinal study across 17 repositories, 8,589 commits, 1,822 tasks, and 13,866 tests (99.87% pass rate) provides ecological validation.

PaperID: 2307, https://arxiv.org/pdf/2606.03099.pdf

Authors: Kailin Lyu, Zhiqiang Yuan, Jianwei He, Qiwei Yan, Xuanbo Su, Nanxing Hu, Yang Liu, Ce Hao, Shengqian Qin, Lianyu Hu, Jinchao Zhang, Jie Zhou

Title: PhotoCraft: Agentic Reasoning with Hierarchical Self-Evolving Memory for Deep Image Search

Abstract:
Deep Image Search requires multi‑step reasoning over rich contextual cues, such as time, location, and event relations. However, most existing LLM‑based agents are stateless and reactive, lacking persistent memory to maintain long‑horizon context or transfer experience across tasks, which often leads to execution drift and experience isolation. To address these limitations, we propose PhotoCraft, a training‑free, hierarchical memory system for photo‑search agents. Inspired by human cognition, PhotoCraft equips MLLMs with working, episodic, and semantic memory, which are dynamically invoked during reasoning to preserve logical consistency and knowledge transferability throughout multi‑step reasoning and answer generation. Extensive experiments on DISBench demonstrate that PhotoCraft consistently improves context‑aware retrieval across diverse MLLM backbones, achieving gains of up to 18.5% and effectively mitigating key bottlenecks in memoryless deep image search, offering a practical path toward reliable and generalizable multimodal search agents.

PaperID: 2308, https://arxiv.org/pdf/2606.03083.pdf

Authors: Haoran Tan, Zeyu Zhang, Zhicheng Cao, Rui Li, Xu Chen

Title: DELTAMEM: Incremental Experience Memory for LLM Agents via Residual Trees

Abstract:
Large Language Model (LLM)‑based agents increasingly rely on memory to learn from experiences over continual interactions. However, storing experiences as independent, flat units leads to substantial redundancy and retrieval conflicts, as similar episodes repeat overlapping content and subtle scene variations cause retrieved memories to offer contradictory guidance. To address this, we introduce residual experience, positing that newly acquired experience is often an incremental variation of existing knowledge. We propose DeltaMem, a framework that organizes experience memory into two independent residual trees, one storing goal‑conditioned task experience as reusable skills and another for scene‑level environment knowledge. Each tree uses a root node for generalized base experiences and incremental delta nodes for subsequent variations, allowing related experiences to share a common foundation without duplication. For retrieval, a failure‑penalized similarity scan locates the best match, reconstructing the full experience via root‑to‑match chain composition. An autonomous consolidation mechanism distills high‑frequency paths into new root nodes, enabling the trees to self‑organize from general heuristics to specialized variants. Experiments across diverse interactive environments show that DeltaMem consistently outperforms existing baselines. To facilitate future research, we release the code at https://github.com/import‑myself/DeltaMem.

PaperID: 2309, https://arxiv.org/pdf/2606.03056.pdf

Authors: Tong Bai, Zhenglin Wan, Pengfei Zhou, Xingrui Yu, Wangbo Zhao, Yang You, Ivor W. Tsang

Title: SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale

Abstract:
As LLM agents adopt large skill libraries, selecting the right subset becomes a structural problem rather than a similarity‑matching one: skills depend on, conflict with, specialize, or duplicate one another, a structure invisible to both full enumeration and embedding similarity. We present SkillDAG, which models inter‑skill relationships as a typed directed graph and exposes it to an LLM agent as an inference‑time, agent‑callable structural retrieval interface, queried and evolved during execution rather than baked into a fixed retrieval pipeline: each search returns vector matches, typed‑edge neighbors, and conflict signals, and a propose‑then‑commit protocol lets the agent register execution‑backed edges so the graph accumulates structure across episodes. On ALFWorld and SkillsBench with MiniMax‑M2.7, SkillDAG reaches 67.1% success and 27.3% reward, exceeding the strongest reported Graph‑of‑Skills baseline by +12.8 and +8.6 points; the advantage ports to gpt‑5.2‑codex, and intrinsic SkillsBench Ret@K rises from 65.5 to 78.2 under matched queries. These gains trace to isolable mechanisms: candidate ranking that stays robust as the pool grows 10x where a fixed seeding‑diffusion pipeline degrades, and set‑monotone online edits that enlarge ground‑truth recall without evicting prior hits.

PaperID: 2310, https://arxiv.org/pdf/2606.03034.pdf

Authors: Gaurav Naresh Mittal

Title: Capability Advertisement as a Market for Lemons: A Trust Layer for Heterogeneous Agent Networks

Abstract:
Large language model (LLM) agents have begun to delegate work to one another. Protocols such as the Model Context Protocol (MCP) and the Agent2Agent protocol (A2A) let an agent publish what it can do and let others call it, and public registries of such agents are already appearing. These protocols assume an advertised capability is a static, truthful fact. A real agent is none of these things: its competence is probabilistic, varies with input, drifts when the underlying model is updated, and, because the agent is itself a language model, it can describe itself with complete confidence and be wrong. A caller therefore sees what an agent claims to do, not what it can do, with no principled way to tell a reliable provider from a fluent impostor. We argue these difficulties share one cause: the market for lemons. When quality is hidden and claims are cheap, good and bad providers become indistinguishable, honest reliability goes unrewarded, and the market decays toward its worst participants. Economics offers three remedies, signaling, screening, and reputation, and none are present in today's agent protocols. We make four contributions: (1) a failure taxonomy that names confident‑wrong as a non‑adversarial, correlated subclass of Byzantine faults that classical fault‑tolerance mismodels; (2) a market‑for‑lemons model showing that faith‑based protocols admit only a low‑trust equilibrium; (3) the Trust Layer, a thin, protocol‑agnostic narrow waist above MCP and A2A that adds probabilistic capability descriptors, screening, and reputation, and admits a separating equilibrium when the cost of sustaining an overclaim exceeds the gain from it; and (4) a reliability‑composition bound for delegation chains with an end‑to‑end placement argument. The design needs no model retraining and degrades gracefully when its trust anchors are absent or corrupt.

PaperID: 2311, https://arxiv.org/pdf/2606.03032.pdf

Authors: Herun Wan, Jiaying Wu, Minnan Luo, Fanxiao Li, Ningnan Wang, Nancy F. Chen, Min-Yen Kan

Title: The Deliberative Illusion: Diagnosing Factual Attrition and Stance Homogenization in Multi-Agent LLM Deliberation

Abstract:
Multi‑agent LLM systems often treat consensus as evidence of successful interaction. For deliberative problems, however, reliability depends on whether agents preserve the facts and viewpoints needed to interpret an issue. We identify the deliberative illusion: discussion produces (1) factual attrition, the progressive loss of issue‑critical facts, alongside (2) stance homogenization, the collapse of diverse positions toward consensus. To measure this process, we introduce DelibTrace, a framework that decomposes each issue into atomic facts, labels issue‑critical ones, distributes them across agents, and tracks their survival across discussion rounds. Across ethical and news‑based deliberation with three representative LLM families, multi‑agent discussion erases up to 72% of issue‑critical facts. This loss is consequential: retained evidence can reconstruct the issue misleadingly, final stances remain anchored in base‑model priors, and a single malicious agent can inject misinformation into the shrinking shared context. These results reveal a sharper risk: agents can agree more while knowing less. We call for evaluations that measure which facts, uncertainties, and legitimate disagreements survive interaction.

PaperID: 2312, https://arxiv.org/pdf/2606.03031.pdf

Authors: Yan Wang, Xuguang Ai, Jaisal Patel, Xueqing Peng, Fengran Mo, Yupeng Cao, Haohang Li, Mingyu Cao, Lingfei Qian, Víctor Gutiérrez-Basulto

Title: AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification

Abstract:
Structured financial audit verification is difficult for language‑model agents because correctness depends on structured evidence rather than text alone. A model must link reported facts to taxonomy concepts, traverse calculation or dimensional relations, and recompute expected values before applying an audit rule. We propose AuditFlow, a graph‑grounded multi‑agent framework that separates adaptive search from deterministic verification. AuditFlow builds a symbolic environment from a static US‑GAAP taxonomy graph and a dynamic XBRL filing graph, and exposes it through typed tools for fact retrieval, taxonomy traversal, numerical checking, and rule evaluation. Two junior auditors inspect each case from regulatory and evidentiary views, while a senior auditor resolves disagreements and can request further investigation. The final reports are fused through evidential aggregation to produce an audit verdict, expected value, evidence trail, and trustworthiness score. On a FinAuditing‑derived FinMR sample, AuditFlow reaches 82.09% joint audit accuracy under GPT‑5.5, outperforming the strongest baseline by 14.93 points. Removing deterministic checks drops accuracy to 17.91%, showing that the symbolic environment performs the verification step that the model cannot reliably replace.

PaperID: 2313, https://arxiv.org/pdf/2606.03030.pdf

Authors: Yukihiro Hoshino, Ayato Kitadai, Nariaki Nishino

Title: Do Matching Mechanisms Work with LLM Agents?

Abstract:
This study examines whether standard matching mechanisms function as intended in LLM‑agent markets, where LLM agents make allocation‑related decisions as delegated decision‑makers. We compare decentralized free‑negotiation markets with centralized mechanism‑based markets including several representative mechanisms. Across controlled one‑to‑one matching environments, mechanism‑based markets generally outperform free negotiation in terms of stability and efficiency. We also find that LLM agents report preferences truthfully at substantially higher rates than human subjects in comparable DA and EADA environments. However, truth‑telling is not uniformly aligned with formal strategy‑proofness across all mechanisms: TTC, despite being strategy‑proof, does not always elicit higher truth‑telling than EADA. These results suggest that matching theory provides a useful but incomplete guide for designing institutions in LLM‑agent markets.

PaperID: 2314, https://arxiv.org/pdf/2606.03024.pdf

Authors: Shidong Pan, Xiaoyu Sun, Tianyi Zhang, Dianshu Liao, Meixue Si, Zhenchang Xing

Title: SkillGuard: A Permission Framework for Agent Skills

Abstract:
Agent skills extend LLM agents with reusable instructions, scripts, tool bindings, and contextual dependencies. However, current skill ecosystems largely rely on trust‑based loading and static inspection, leaving a gap between what a skill can inject into an agent's context and what it can cause the agent to do at runtime. This gap introduces new security and privacy risks, and existing defenses primarily inspect skill files statically or regulate individual tool calls, without systematically connecting a skill's declared intent with its runtime behavior. In this paper, we present SkillGuard, a skill‑centric permission framework that treats skills as permission‑bearing executable artifacts. SkillGuard introduces a dual‑plane governance model that jointly regulates context influence and action side effects through skill manifests, runtime access control, user‑mediated authorization, deny‑by‑default enforcement, capability inference, and behavior monitoring. We evaluate SkillGuard on 315 real‑world skills and SkillInject. The permission taxonomy covers 99.76% of observed protected objects, and automated manifest generation reaches 91.0% F1. In adversarial evaluations, SkillGuard reduces attack success from 32.37% to 23.02% for contextual injections and from 25.56% to 16.67% for obvious injections, while maintaining benign task utility. These results suggest that SkillGuard, as a skill‑centric permission framework, can provide a practical foundation for improving the privacy and security of agent skill ecosystems.

PaperID: 2315, https://arxiv.org/pdf/2606.02994.pdf

Authors: Zhihan Lei, Jiarui Yan, Joshua Momo, William W. Cohen

Title: Inducing Reasoning Primitives from Agent Traces

Abstract:
ReAct‑style LLM agents often rediscover the same reasoning routines across problems, yet leave those routines trapped in transient scratchpads. We introduce Reasoning Primitive Induction, a single‑pass method that mines successful ReAct traces, clusters recurrent reasoning moves, and converts the most frequent moves into a compact library of typed pseudo‑tools. Each pseudo‑tool is specified by a natural‑language docstring interpreted by an LLM at invocation time, and a standard ReAct loop composes these primitives at test time. The central result is that induced libraries outperform the very agent that generated their traces: by +44pp on RuleArena NBA (30 ‑> 74), +30pp on MuSR team allocation (38 ‑> 68), and +22pp on NatPlan meeting planning (7 ‑> 29). Across five comparable subtasks spanning narrative deduction, rule application, and constraint‑satisfaction planning, a single fixed configuration improves over zero‑shot Chain‑of‑Thought on every subtask, matches or surpasses expert‑authored decompositions, and outperforms AWM at lower average inference cost.

PaperID: 2316, https://arxiv.org/pdf/2606.02963.pdf

Authors: Taras Sereda, Burak Bartan, Ankita Nayak, Tom St. John, Natalie Serrino, Zain Asgar

Title: KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators

Abstract:
Production inference increasingly targets a heterogeneous mix of accelerators. Agentic pipelines interleave reasoning, tool calls, and multi‑agent coordination, each with distinct compute and memory profiles. For optimal efficiency, each stage should run on the accelerator best suited to it. This creates a systems challenge: each pipeline now requires high‑performance kernels across a growing set of hardware backends and programming models. Writing these kernels by hand is time‑consuming, demands deep low‑level expertise, and does not scale as kernel complexity grows. Recently, Large Language Models (LLMs) have been leveraged for automatic kernel generation, but challenges in low‑level code generation and cross‑backend generalization persist. We present KForge, a cross‑platform framework built around an iterative refinement loop driven by two collaborating LLM‑based agents: a generation agent that produces and progressively refines kernels using compilation and correctness feedback, and a performance‑analysis agent that interprets profiling data, from programmatic APIs to GUI‑based tools, and emits recommendations that steer the next round of synthesis. The loop alternates between functional passes, which drive a candidate to correctness, and optimization passes, which close the performance gap to hand‑tuned baselines. We evaluate KForge on two backends with very different baseline reference availability. On NVIDIA B200, KForge achieves a 2.12% improvement in end‑to‑end throughput compared to TensorRT‑LLM on the gpt‑oss‑20b inference speed benchmark. On Intel Arc B580, KForge generates Triton kernels achieving a 5.13× geometric mean speedup over the faster of PyTorch eager and torch.compile on 37 GEMM + tail‑ops workloads from KernelBench Level 2, primarily via operator fusion and mixed‑precision execution.

PaperID: 2317, https://arxiv.org/pdf/2606.02950.pdf

Authors: Natalia Theodora, Ploy Thajchayapong, Ashok K. Goel

Title: Powering An Ecosystem Of Pedagogical AI Agents: A Validation Strategy For A Unified Data Architecture

Abstract:
The application of AI in education has evolved from monolithic intelligent tutoring systems to a diverse ecosystem of pedagogical agents, including conversational assistants, virtual coaches, and adaptive tutors. This shift requires a unified and scalable data architecture to manage the complex information feedback loops between human instructors, learners, and the varied AI agents. The design, development, and deployment of the data architecture in turn raises a critical issue of validation. This paper addresses this critical need by describing a practical validation strategy for a high‑volume data pipeline developed as part of a data architecture for AI‑augmented adult learning at the National AI Institute for Adult Learning and Online Education. Our approach involves a two‑stage testing methodology to ensure both functional diversity and real‑world scalability. First, the QA environment uses a blend of synthetic and real‑world data to validate functional correctness across various event types produced from learner and agent interactions. Following this, the production environment successfully processed a total of over 2.7 million production requests across 21 successful runs carrying authentic event data from a large‑scale online program. This validation process surfaced crucial insights into data privacy, a key challenge when handling varied data from multiple AI agent data sources. By outlining a replicable testing strategy for a unified data backbone, this research offers a clear framework for institutions and developers aiming to build and support their own heterogeneous suites of AI‑powered learning tools. Keywords: Pedagogical Agents, Learning Ecosystems, Data Architecture, Validation, Scalability, Learning Analytics.

PaperID: 2318, https://arxiv.org/pdf/2606.02908.pdf

Authors: Hengrui Gu, Xiaotian Han, Kaixiong Zhou

Title: WRIT: Write-Read Intensive Trajectory Synthesis for Multi-Turn User-Facing Agents

Abstract:
Multi‑turn user‑facing agents must infer user intent from incomplete requests, collect missing information through dialogue and tools, and execute valid actions. A training trajectory records this process as an interleaved sequence of user messages, agent responses, tool calls, etc. Synthesizing sufficiently complex trajectory has become a central route to train agents: existing pipelines often increase difficulty by composing multiple user requests into longer tasks, producing write‑intensive trajectories that train sequential execution. We argue that a single write decision can itself be difficult when the agent must gather and compare substantial read‑tool evidence before its arguments become identifiable, a challenge that write‑intensive data alone cannot address. Guided by this insight, we propose WRIT (\ulineWrite‑\ulineRead \ulineIntensive \ulineTrajectory Synthesis), a pipeline for synthesizing multi‑turn agent training trajectories along two complexity axes: the number of write decisions in a task and the evidence burden of each individual decision. WRIT first generates write‑intensive and read‑heavy tasks. It then diversifies user behavior instructions to reflect realistic conversational variation, and finally simulates agent‑user interactions in an executable environment to produce complete training trajectories. The resulting data trains agents not only for longer task execution, but also for robust, evidence‑grounded decision making under high information load. With only 2K synthesized trajectories, a 4B model trained on WRIT outperforms GPT‑5.1 no‑think on τ^2‑bench and substantially reduces inference‑time token usage, showing that compact SFT data can convert part of expensive test‑time reasoning into efficient agent behavior.

PaperID: 2319, https://arxiv.org/pdf/2606.02875.pdf

Authors: Dipesh KC, Anjila Budathoki

Title: Handoff Debt: The Rediscovery Cost When Coding Agents Take Over Interrupted Tasks

Abstract:
Coding‑agent benchmarks evaluate whether a single uninterrupted agent can resolve a repository issue. Real software work is messier: tasks are interrupted, reassigned, reviewed, and resumed from partial states left by another agent or engineer. We study this missing dimension through \emphhandoff debt: the rediscovery cost imposed when a predecessor's work is opaque or incomplete. Our takeover protocol interrupts a coding agent at deterministic handoff points, freezes the repository, and evaluates successor agents under four handoff views: repository state only, raw trace, summary notes, and structured notes. Across 75 source tasks, the protocol generates 181 handoff‑point tasks and 724 takeover runs per successor model. Across three successor models, context‑bearing handoffs reduce median agent events by 20‑‑59% and cumulative prompt tokens by 42‑‑63% relative to repository‑only takeover. Solved‑rate effects are smaller and model‑dependent, but efficiency gains are consistent. These findings suggest that coding‑agent evaluation should report not only whether a task is solved, but also how costly that work is for another agent to resume.

PaperID: 2320, https://arxiv.org/pdf/2606.02871.pdf

Authors: Dongwon Jung, Peng Shi, Yi Zhang, Junshan Zhang, Muhao Chen

Title: Adaptive Latent Agentic Reasoning

Abstract:
Large reasoning models improve performance by generating extended chain‑of‑thought (CoT) reasoning, but this behavior becomes inefficient when applied to LLM agents. Current LLM agents often generate verbose textual reasoning at every decision step and allocate reasoning effort nearly uniformly across turns, leading to substantial inefficiency in multi‑turn agentic trajectories. We propose Adaptive Latent Agentic Reasoning (ALAR), a dual‑mode framework that uses compact latent reasoning for routine turns and selectively escalates to explicit chain‑of‑thought when deeper deliberation is needed. ALAR learns latent reasoning by using the agent's actions as supervision anchors and is further optimized to use latent reasoning when it is sufficient for task success and reserve explicit CoT for harder decisions. Experiments on agentic search and tool‑use benchmarks show that ALAR maintains comparable or better task accuracy while substantially reducing generated tokens by up to 43.6% in search and 84.6% in tool use. These results demonstrate that ALAR improves the accuracy‑efficiency trade‑off of LLM agents by reducing unnecessary textual reasoning while preserving explicit deliberation for harder decision steps.

PaperID: 2321, https://arxiv.org/pdf/2606.02867.pdf

Authors: Petra Ferenz, Ava Keeling, Tobias O'Keefe, Lorenzo Stigliano, Francesco Di Lauro, Andres Colubri, Jasmina Panovska-Griffiths

Title: The Epi-LLM Framework: probing LLM behavioral priors through epidemiological agent-based models

Abstract:
Human behaviour during epidemics affects infectious disease dynamics, but quantifying this remains deeply challenging. Here we introduce the Epi‑LLM framework: a novel integration of agent‑based modelling, real‑life epigames, and large language models (LLMs) in which a synthetic society of agents reasons and adapts dynamically over an outbreak contact network. Comparing synthetic agent behaviour against a no‑intervention SEIR baseline and human participant data from the AUIB epigame study, we find that LLM agents across four different architectures reduced peak active infections, with quarantine compliance peaking at 58‑65% on day six of the 15‑day simulation. A binomial generalised linear model showed that perceived health severity was the strongest predictor of quarantine behaviour (β= 0.33, p = 0.002), yielding a pseudo‑R^2 of 0.055, comparable to the 0.072 observed in the human trial. LLM architecture is a key determinant of epidemic dynamics: low‑variance architectures offer greater internal validity for testing behavioural rules, while high‑variance models may better represent real‑world decision‑making. Geographic labels alone do not induce culturally differentiated behaviour; explicit attitudinal parameterisation is required. This proof‑of‑principle work lays the groundwork for deploying the Epi‑LLM framework as a scalable, risk‑free simulation environment for pandemic preparedness research.

PaperID: 2322, https://arxiv.org/pdf/2606.02866.pdf

Authors: Chirag Parmar, Akshat Mehta, Henglin Wu, Jagadish Ramamurthy, Shweta Medhekar

Title: When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning

Abstract:
When does multi‑agent debate help data cleaning, and when does it hurt? Across three benchmarks, four model families, and over 6,000 task‑condition pairs, we find debate's effect reverses sign: it degrades generation across all four models (‑1.6 to ‑15.5pp) through critique‑induced confusion (CIC), hallucinated Critic feedback that the Generator accepts uncritically, yet improves error detection (+27.4pp F1, d=1.0). We derive a debate benefit condition: debate helps when the probability of rescuing a wrong output (Critic verification odds weighted by fixability) exceeds the probability of destroying a correct one. A factorial experiment proves adversarial separation is essential: self‑verification with identical tools fails, while a separate Critic with code‑execution grounding and evidence‑gated generation produces the first debate configuration to significantly exceed single‑agent on a generative task (+5.3pp, p<0.05). The condition correctly predicts all nine task types and generalizes with zero false positives across 19 published comparisons in seven domains.

PaperID: 2323, https://arxiv.org/pdf/2606.02862.pdf

Authors: Marcus Rüb, Michael Gerhards

Title: Toward a Modular Architecture for Embedded AI Agent Systems at the Edge

Abstract:
The rise of Large Language Models (LLMs) has enabled agentic AI capable of complex reasoning and tool use; however, deploying such autonomy in pervasive computing environments remains challenging due to the strict memory and energy constraints of embedded microcontrollers. Existing frameworks typically assume server‑class resources or continuous connectivity, leaving a gap for deeply embedded systems. This paper proposes a modular reference architecture for Embedded Agent Systems that bridges the divide between deterministic real‑time control and agentic intelligence. We introduce a tiered design that decouples On‑Device Agents ‑ executing highly compressed neural networks and rule‑based logic for low‑latency, privacy‑critical tasks ‑ from Cloud‑Augmented Agents that leverage Small Language Models (SLMs) for higher‑level reasoning and planning. A key contribution is the integration of a cross‑cutting Governance Layer, ensuring observability, policy enforcement, and safety across distributed fleets of autonomous devices. Rather than presenting purely empirical benchmarks, we analyze architectural design principles and trade‑offs regarding latency, energy, and reliable execution in resource‑constrained environments.

PaperID: 2324, https://arxiv.org/pdf/2606.02859.pdf

Authors: Zhenting Qi, Huangyuan Su, Ao Qu, Chenyu Wang, Yu Yao, Han Zheng, Kushal Chattopadhyay, Guowei Xu, Zihan Wang, Weirui Ye, Vijay Janapa Reddi, Ju Li, Paul Pu Liang, Himabindu Lakkaraju, Sham Kakade, Yilun Du

Title: Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions

Abstract:
How can a population of agents self‑orchestrate and self‑adapt into stronger collective intelligence without centralized control? Inspired by Friedrich Hayek's economic theory of decentralized coordination in markets, we study this question through an agent economy in which agents compete via auctions for the right to act, exchange payments, and accumulate wealth from environmental rewards. These simple economic signals induce decentralized credit assignment, driving planning without global orchestration or explicit communication protocols. The population evolves through economic selection: effective agents accumulate wealth and are mutated via exploitation, while ineffective ones go bankrupt and are replaced via exploration. We show that, initialized with weak agents, the economy produces emergent multi‑step reasoning strategies and outperforms stronger monolithic baselines across five agentic tasks, including mathematical reasoning, financial research, scientific research, accelerator design, and distributed‑system optimization. We further provide theoretical insights into how economic dynamics shape agent behaviors, linking local incentives to long‑term global performance. Our results suggest a new path to multi‑agent intelligence: rather than engineering coordination, we can design decentralized incentive structures under which it automatically emerges.

PaperID: 2325, https://arxiv.org/pdf/2606.02812.pdf

Authors: Sihang Zeng, Matthew Thompson, Ruth Etzioni, Meliha Yetisgen

Title: Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection

Abstract:
Modeling patient trajectories from longitudinal electronic health records (EHRs) requires reasoning over sparse, noisy, and long‑context multimodal sequences. Existing LLM‑based multi‑agent systems address context length but process patients in isolation, failing to mirror how clinicians leverage accumulated experience from similar prior cases. We present Traj‑Evolve, a self‑evolving multi‑agent system with two complementary evolving mechanisms. First, an Experience Pool (ExPool) acts as a non‑parametric memory, indexing rejection‑sampled reasoning traces to retrieve similar patients as few‑shot contexts. Second, multi‑agent reinforcement learning (MARL) via reward‑ranked fine‑tuning parametrically optimizes inter‑agent and agent‑memory collaboration. A leave‑one‑out cross‑retrieval strategy unifies the two, aligning training‑ and inference‑time behavior under retrieval augmentation. On a lung cancer prediction task utilizing up to five years of multimodal EHRs, Traj‑Evolve outperforms 9 strong baselines on the overall population and a challenging never‑smoker population. Analysis of the evolving dynamics highlights three key findings: (1) expanding the ExPool shifts optimal retrieval from diverse to specific samples; (2) under MARL, the manager agent's prediction loss converges quickly while the worker agents' temporal reasoning continues to benefit from more verified patients; and (3) the two mechanisms are complementary on the predicted risk, where ExPool improves specificity while MARL improves sensitivity.

PaperID: 2326, https://arxiv.org/pdf/2606.02754.pdf

Authors: Peixuan Han, Hongyi Du, Jiayu Liu, Yihang Sun, Yutong Liu, Jiaxuan You

Title: $Ψ$-Bench: Evaluating Persona-Sensitive Influencing in Persuasive Dialogues

Abstract:
Personalization is a crucial capability of modern language agents. However, current research primarily positions personalized agents as passive responders to user preferences, limiting their ability to interact with users and provide suggestions or guidance proactively. To systematically evaluate such proactive personalization in realistic interactions, we propose Ψ‑Bench, a benchmark for assessing LLMs' ability to influence realistic users through conversation. We design three real‑world interaction scenarios that involve persuasion in Ψ‑Bench, and endow simulated clients with personal characteristics through explicit user profiles derived from dialogue histories. We evaluate 10 frontier LLMs on Ψ‑Bench and find that while most models can produce coherent and reasonable arguments, even state‑of‑the‑art models still leave considerable room for improvement in persuasion. We also find that providing access to client profiles yields an average performance gain of 18.24%, highlighting the importance of user‑specific information for effective persuasion. Overall, our work highlights persona‑sensitive influencing as a challenging yet practical direction for evaluating and developing more proactive personalized LLM agents. Codes are available at: https://github.com/Hanpx20/Psi‑Bench.

PaperID: 2327, https://arxiv.org/pdf/2606.02753.pdf

Authors: Teng Hu, Mingchun Lu, Yating Wang, Jiangning Zhang, Jinkun Hao, Ye Pan, Ran Yi, Lizhuang Ma, Dacheng Tao

Title: MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data

Abstract:
Video world models are a foundational generative technology for embodied AI and the Metaverse, yet existing approaches are inherently limited to a single agent observing from a single perspective. Extending these models to multi‑agent settings introduces two critical challenges: data scarcity (coordinated multi‑view recordings are prohibitively expensive to collect for general open‑domain scenarios) and world state alignment (independently generated video streams cannot ensure that shared physical environments and events evolve consistently across views). To address these challenges, we propose MetaWorld, a novel framework that scales multi‑agent video world models to open‑domain environments directly from single‑view videos. First, we introduce Monocular World‑State Unrolling (MWSU) to explicitly decompose monocular footage into the camera operator's ego‑motion and the visible subject's spatial trajectory. This camera‑trajectory decomposition naturally extracts synchronized multi‑agent motion data within a shared 3D space, completely bypassing the need for multi‑camera setups. Second, for precise visual control, we develop the Subject‑Aware World Generator to enable appearance‑driven simulation conditioned on per‑agent identity images. Finally, to ensure both views are grounded in the identical physical reality, we propose World‑State Alignment, a per‑frame inter‑branch cross‑attention mechanism inserted at every transformer layer of the video DiT. By jointly synchronizing the denoising process, WSA enforces both static geometric consistency and dynamic motion consistency, encouraging that the shared 3D environment and physical events remain well‑aligned across both egocentric views. Extensive experiments demonstrate that MetaWorld achieves superior cross‑view consistency and identity fidelity, establishing a highly scalable, physics‑driven paradigm for multi‑agent video world modeling.

PaperID: 2328, https://arxiv.org/pdf/2606.02674.pdf

Authors: Eden Yavin, Gal Engelberg, Konstantin Koutsyi, Leon Goldberg, Gal Baron

Title: Cross-Vendor Sola ISPM Benchmark: Evaluating Agentic AI for Federated Identity Security Reasoning

Abstract:
The rapid proliferation of multi‑cloud and SaaS platforms has transformed Identity Security Posture Management (ISPM) into a fundamentally cross‑vendor challenge: critical misconfigurations and privilege escalation paths increasingly span multiple identity providers, infrastructure layers, and authentication systems never designed to interoperate. Existing evaluations focus on isolated single‑platform environments and provide no means to assess whether an AI agent can reason across these fragmented boundaries. To address this gap, we introduce the Cross‑Vendor Sola ISPM Benchmark, a production‑grade benchmark of 50 data‑grounded tasks requiring multi‑hop entity resolution and cross‑system correlation across eight integrated enterprise platforms including AWS, Okta, Azure AD, and Google Workspace. We also contribute an evaluation framework measuring not only final answer correctness but also evidentiary grounding, structural join fidelity, retrieval quality, and SQL equivalence. We evaluate the Sola AI Agent across five context configurations ‑ from no injected metadata to full schema, graph, and retrieval context ‑ using three frontier LLMs. Results show that structured relational context improves answer correctness by approximately 34% relatively and reduces exploration queries by approximately 70% across all tested models, with the largest gains driven by cross‑vendor graph topology. Our findings indicate that frontier LLMs possess substantial latent security reasoning capability, but reliable cross‑vendor identity analysis is fundamentally constrained by the availability of explicit relational context for entity resolution and evidentiary grounding. Under full context, the best configuration achieves 78% answer correctness while reducing complete failure to 4%.

PaperID: 2329, https://arxiv.org/pdf/2606.02668.pdf

Authors: Xiaoqi Weng

Title: What You Approve Is What Executes: Consent Integrity for Black-Box LLM Agents

Abstract:
Coding agents gate consequential actions behind a human‑in‑the‑loop approval dialog, but the dialog is narrated by the agent itself: the human approves a summary the agent writes. The Lies‑in‑the‑Loop (LITL) attack shows that summary is forgeable, so a compromised agent can show a benign description while a different action runs. This paper names the missing property, Consent Integrity, by importing What You See Is What You Sign (WYSIWYS) and the trusted‑path property into the agent approval channel: the action shown to the human must be rendered by a trusted mediator from the real action at the boundary, not the agent's narration, over a path the agent cannot spoof, and bound to the exact action that executes. Two twists distinguish it from classical WYSIWYS: the renderer is the adversary, and the boundary ground truth is a low‑level event that must be decoded without trusting the agent. Since no decoder is complete, the realizable target is analyzer‑relative: whatever the analyzer cannot classify is surfaced as uninspectable rather than silently approved. A prototype implements the analyzer, renderer, and bind‑to‑execution; total mediation and the trusted path are specified but assumed, not implemented. On GTFOBins, an independent corpus of 1330 trusted‑tool abuses, the prototype silently passes 10.0% (every instance through a trusted tool); on tldr, 28,798 normal‑usage commands, it marks 87.0% uninspectable. These two independent measurements bracket the design's central tension: the trust list that bounds silent passes is the same one that drives over‑prompting, and a boundary‑only mediator can move along that frontier but not escape it. The contribution is the property, the mechanism, and an honest position on that frontier, not a solved defense.

PaperID: 2330, https://arxiv.org/pdf/2606.02646.pdf

Authors: Blaž Bertalanič, Carolina Fortuna

Title: The Ringelmann Effect in Multi-Agent LLM Systems: A Scaling Law for Effective Team Size

Abstract:
Inference‑time multi‑agent LLM scaling lacks a shared unit: counting nominal agents conflates cost with independent evidence. We derive a two‑parameter scaling law R(N) = N_\texteff/N = 1/(1+c(N‑1)N^‑β) where the regime exponent β classifies any configuration into one of three asymptotic regimes ‑‑ hard‑ceiling at 1/c (β= 0), sublinear at N^β/c (0 < β< 1), or linear (β\ge 1), and a mean‑field theorem predicts that peer count k and rounds τ during agent debate enter the dynamics only through their product kτ. The law applies at two levels: answer diversity and correctness redundancy. Across 44 (model × task × condition) cells spanning peer debate, self‑correction, random‑noise placebo, self‑consistency, three open‑weight families (Qwen, Llama, Ministral) at scales from 7B to 32B with a frontier API check (Gemini), thinking models, heterogeneous teams, and sparse communication, the functional form fits every condition at R^2 > 0.99; only (c, β) shifts. On free‑form math, dense peer influence collapses the answer‑level regime from sublinear into hard‑ceiling; correctness‑level fits remain hard‑ceiling throughout. Three findings have practical implications. \emph(i)~Thirty dense debating agents produce no more answer diversity than one on MMLU‑Hard. \emph(ii)~A noise placebo tracks self‑correction on free‑form math and at 4× scale, so within homogeneous teams the gain commonly attributed to ``debate'' comes from re‑evaluation, not peer content. \emph(iii)~A single N \le 5 pilot predicts the N=30 structural ceiling, and within the configurations tested only architectural diversity (heterogeneous teams) lowers c and escapes the hard‑ceiling regime, communication‑mode interventions do not.

PaperID: 2331, https://arxiv.org/pdf/2606.02643.pdf

Authors: Chengliang Liu, Liangbo Ning, Yujuan Ding, Wenqi Fan

Title: Inference Cost Attacks for Retrieval-Augmented Large Language Models

Abstract:
Retrieval‑Augmented Generation (RAG)‑enhanced LLM systems, while powerful, introduce substantial inference costs due to the inclusion of an extra multi‑stage pipeline that dynamically retrieves and synthesizes information from external knowledge sources. This high operational cost exposes a critical vulnerability to Inference Cost Attacks (ICAs). However, existing ICAs often rely on the impractical assumption of direct prompt manipulation. We argue that a more feasible and potent threat to RAG‑enhanced LLM systems arises from poisoning external knowledge bases (e.g., web knowledge from the Internet). In this work, we introduce the Retrieval‑Augmented Inference Cost Attack (RA‑ICA), a novel attacking paradigm that targets the computational cost of RAG‑enhanced LLM systems by injecting malicious documents into external knowledge corpus. To operationalize this attack, we propose Computational Resource Exhaustion via External Poisoning (CREEP), a novel framework that leverages LLM agents to automatically craft malicious documents that are both semantically relevant for retrieval and potent for inducing an abnormal increase in token consumption during the inference phase. To enhance the attack's effectiveness, we introduce Memory‑Augmented Group Relative Policy Optimization (MA‑GRPO), a novel reinforcement learning algorithm that fine‑tunes the agents by learning from a dynamic memory of historical best adversarial documents. Extensive experiments across three real‑world datasets demonstrate that RA‑ICA increases token consumption by up to 13.12 times with an over 90% success rate, without degrading the integrity of the generated answer.

PaperID: 2332, https://arxiv.org/pdf/2606.02641.pdf

Authors: Yifan Wang

Title: CARVE: Certified Affordable Repair of Vetoed Maneuvers via Envelopes for Interactive Driving

Abstract:
Interactive driving exposes a failure mode that is easy to miss in rule‑aware autonomous‑driving stacks: a hard‑rule margin can be negative for an ego candidate even though a small lawful accommodation by a non‑priority agent would restore feasibility. Existing rulebooks, shields, and reachability filters are strong at vetoing unsafe actions, while prediction‑based planners model likely responses. Neither returns a runtime proof object that states which bounded multi‑agent edit repairs the maneuver, who owns the edit, whether the request is right‑of‑way affordable, and what ego fallback remains if the request is not observed. We formulate this missing object as interactive repair certification and introduce CARVE, a prediction‑free certificate layer over a finite lattice of ego‑owned and agent‑owned tactical operators. Agent‑owned requests are admissible only inside $B_j(s) = β(π_j)α_j^\max(s)$, a cooperation envelope that separates kinematic reachability from normative priority. The resulting certificate records the binding rule, repair category, repair set, responsibility‑weighted cost split, and fallback. On 589 Lanelet2‑geometry‑grounded INTERACTION replay episodes, CARVE‑Greedy accepts 98.64% of initially vetoed maneuvers and recovers 370/378 human‑resolved false vetoes, while preserving 589/589 right‑of‑way respect, zero priority‑agent false positives, and 400/400 negative‑stress vetoes. We prove certificate soundness, structural right‑of‑way respect, exact finite‑lattice minimality, fallback contingency, and blame‑consistency conditions. CARVE does not predict or require another driver's compliance; it certifies whether a proposed interaction is bounded, attributable, and normatively admissible under declared assumptions.

PaperID: 2333, https://arxiv.org/pdf/2606.02614.pdf

Authors: Antonio de Sousa Leitão Filho, Fabrício Saul Lima, Selby Mykael Lima dos Santos, Rejani Bandeira Vieira Sousa, Luís Jorge Mesquita de Jesus, Dennys Correia da Silva, Allan Kardec Duailibe Barros Filho

Title: Margin Play: A Multi-Agent System For Public Policy Analysis In The Brazilian Equatorial Margin

Abstract:
The Brazilian Equatorial Margin (BEM) is Brazil's next offshore oil frontier, with operations expected to begin in 2026 in the Foz do Amazonas basin. Its assets are fiscally and territorially linked primarily to Maranhao ‑‑ the state with the lowest HDI in the Federation (0.676, IBGE 2022). This raises the central policy question: under what conditions does BEM exploration generate net positive externalities for Maranhao? The problem is intrinsically multi‑agent: the Federal Government seeks revenue and energy security; the state seeks regional welfare under constitutional royalty earmarking; the operator maximizes profit under risk; ANP and IBAMA hold conflicting mandates; and Amazonian communities prioritize territorial and environmental vectors over monetary income. We present Margin Play, a Multi‑Agent Reinforcement Learning (MARL) system simulating these tensions under Brazilian empirical calibration and classical economic literature. It implements six agents under the CTDE paradigm, trained with BRO‑MARL. Results from 60,000 episodes across six scenarios indicate the answer is conditional on the institutional regime: under the reference baseline, the welfare gain is marginal (Waval approx. 1.68), whereas the MA‑Prospero configuration yields Delta W = +17.5% and Delta Rcom = +21.3%, with a lower environmental liability (Eamb = 0.048 vs. 0.076). The fundamental problem is not a trade‑off between production and welfare, but the choice of public policy regime linked to exploration.

PaperID: 2334, https://arxiv.org/pdf/2606.02320.pdf

Authors: Xinkai Ma, Zhiqi Bai, Dingling Zhang, Pei Liu, Yishuo Yuan, He Zhu, Jiakai Wang, Qianqian Xie, Yifan Zhao, Xinlong Yang, Hao Cong, Zhiheng Yao, Fengxia Xie, Zihao Xu, Haoran Xu, Zhaohui Wang, Minghao Liu, Shirong Lin, Yingshui Tan, Yuchi Xu, Wenbo Su, Zhaoxiang Zhang, Bo Zheng, Jiaheng Liu

Title: TVIR: Building Deep Research Agents Towards Text-Visual Interleaved Report Generation

Abstract:
Deep Research Agents have shown strong capability in multi‑step information retrieval, reasoning, and long‑form report generation, but existing benchmarks and systems remain predominantly text‑centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text‑Visual Interleaved Report Generation), which includes TVIR‑Bench, a benchmark of 100 expert‑curated multimodal deep research tasks that require visual elements to serve specific analytical sub‑goals, and TVIR‑Agent, a hierarchical multi‑agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context‑aware sequential writing. We further develop a dual‑path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR‑Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence‑driven report generation.

PaperID: 2335, https://arxiv.org/pdf/2606.02240.pdf

Authors: Hiskias Dingeto, William Leeney

Title: AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations

Abstract:
Indirect prompt injection in tool‑use agents is a concrete production threat: LLM agents read from integrations (third‑party services such as Gmail, Salesforce, or Jira accessed through tool calls) whose response content the user neither writes nor controls. Existing benchmarks under‑measure the threat: most cover only a handful of integrations with the same attack payload replayed across runs, and open‑source guards are trained on chat‑style data rather than tool‑response content. We introduce AGENTREDBENCH, a dynamic LLM‑driven redteaming benchmark of 215 subtle underspecified authorization (attacks at the boundary of what the user's request authorises) scenarios across 24 enterprise integrations in nine functional families and five attack types. Across an eight‑model panel (Anthropic, OpenAI, Google), no‑guard ASR (attack success rate) ranges from 32% (Claude Sonnet 4.6) to 81% (Gemini 3 Flash). To keep the scenario set out of training corpora and preserve headline ASR meaning over time, we release the codebase, integration schemas, and AGENTREDGUARD model openly; the canonical scenarios are evaluated through a maintainer‑mediated channel with immutable versioning. We release AGENTREDGUARD alongside the benchmark: a guard trained on an integration‑diverse corpus of adversarial tool‑response content. AGENTREDGUARD cuts panel ASR from 69.9% to 2.4% at 0.37% false‑positive rate, outperforming every open‑source baseline with non‑trivial detection (Llama Guard, PromptGuard 2, ProtectAI) on both axes. Cross‑integration and cross‑attack type holdouts both confirm the gain transfers beyond the training subset.

PaperID: 2336, https://arxiv.org/pdf/2606.01770.pdf

Authors: Zewen Liu, Zhan Shi, Yisi Sang, Bing He, Minhua Lin, Tianxin Wei, Dakuo Wang, Benoit Dumoulin, Wei Jin, Hanqing Lu

Title: Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task Streams

Abstract:
Auto‑harness systems such as A‑Evolve, GEPA, and Meta‑Harness improve LLM agents by optimizing prompts, skills, tools, memories, and supporting infrastructure from execution feedback, but they are typically evaluated on fixed offline benchmarks. Real deployments instead present open‑ended task streams: histories grow without a fixed endpoint, heterogeneous tasks require different harnesses, and problem distributions shift over time. These challenges make a single repeatedly and densely updated harness brittle, causing performance degradation as accuracy peaks early and then declines. This motivates sustained harness construction with task‑wise adaptation. We introduce Adaptive Auto‑Harness, a framework and system for such streams. The framework decomposes the gap to an oracle harness into evolution loss and adaptation loss. The system addresses these losses with a stateful multi‑agent evolver, a harness tree with solve‑time routing, and human‑steering hooks for cases where history lacks the needed signal. Across prediction‑market, security‑competition, and event‑forecasting streams, Adaptive Auto‑Harness outperforms five existing auto‑harness baselines and ablations attribute gains to better construction, routing, or targeted human steering. Code is available in \hrefhttps://github.com/A‑EVO‑Lab/a‑evolve/tree/release/adaptive‑auto‑harnessLink.

PaperID: 2337, https://arxiv.org/pdf/2606.01637.pdf

Authors: Jiaming Qu, Lucheng Fu, Yibo Hu

Title: Easier to Mislead Than to Correct: Harmful and Beneficial Revision in LLM Conformity

Abstract:
Large language models are increasingly used in multi‑agent systems, where they see and respond to other agents' answers. A key risk is conformity: a model may abandon its own answer simply because others agree on a different one. Prior studies show that LLMs often revise toward a majority answer, but it remains unclear whether these revisions help correct mistakes as often as they introduce new errors. In this paper, we conduct a controlled study in which an LLM first answers a question, then sees simulated peer responses before making a final decision. We manipulate two social cues: consensus structure and authority labels assigned to peers, and measure how they influence beneficial and harmful revisions. Across four open‑weight LLMs and seven QA datasets, we find that peer agreement makes it much easier to mislead initially correct models than to correct initially wrong ones. Authority labels make models more likely to choose the endorsed answer, regardless of whether it is correct. More concerningly, generic reasoning interventions such as chain‑of‑thought and reflection do not reliably reduce harmful revision while preserving beneficial revision. These findings suggest that multi‑agent LLM systems should verify peer answers rather than simply aggregate them.

PaperID: 2338, https://arxiv.org/pdf/2606.01613.pdf

Authors: Kanwar Bharat Singh

Title: TechRAG: Evidence-Gated Multimodal Agentic RAG for Technical Literature Reasoning

Abstract:
This paper presents an agentic multimodal retrieval‑augmented generation (RAG) framework for domain‑specific literature reasoning, instantiated on a curated corpus of several thousand papers in intelligent tires, vehicle dynamics, vehicle control, sensing, estimation, and machine learning. Unlike conventional single‑pass RAG systems, the proposed architecture uses an autonomous, evidence‑gated pipeline that classifies query intent, generates separate text and visual query rewrites, performs hybrid text retrieval with FAISS and BM25 followed by cross‑encoder reranking, expands evidence through graph‑guided chunk traversal over a Neo4j knowledge graph, and retrieves visual document evidence using ColSmol late‑interaction embeddings with MUVERA fixed‑dimensional encoding, approximate nearest‑neighbor search, and MaxSim reranking. The framework scores evidence sufficiency using a 100‑point rubric with hybrid rule‑based/LLM review, retries retrieval through drift‑guarded reformulation, searches external academic databases through optimize‑‑search‑‑vet loops, merges and deduplicates multimodal evidence, verifies citation integrity, and generates cited answers through Planner, Researcher, Writer, and Critic agents with self‑correcting revision. Key contributions include: (i) a scalable multimodal retrieval architecture combining text, graph, and visual evidence over 40,000 document pages; (ii) an interpretable evidence sufficiency and retry mechanism; (iii) a multi‑agent generation pipeline with evidence mapping and critic‑driven revision; (iv) a domain knowledge graph with LLM‑based entity extraction, OpenAlex author validation, and intra‑corpus citation resolution; and (v) a route‑dependent external search architecture for targeted literature expansion. The result is a practical, evidence‑gated, multimodal agentic RAG architecture for technical reasoning over specialized research corpora.

PaperID: 2339, https://arxiv.org/pdf/2606.01567.pdf

Authors: Yoshinari Fujinuma, Varun Gangal, Traian Rebedea, Makesh Narsimhan Sreedhar, Prasoon Varshney, Rebecca Qian, Anand Kannappan

Title: Defenses & Enablers For Skill Injection Attacks on Terminal Based Agents

Abstract:
Large language model (LLM) agents increasingly rely on reusable skills i.e. documents describing task‑specific procedures. However, this introduces a new attack surface for agents to manage. We study two complementary directions for this threat. First, we evaluate guardian‑based defenses: an intermediary LLM agent that acts as a mediator for skill file access (dynamic guardian) or pre‑rewrites these files at build time (static guardian). Across three LLM agent families, our guardians cut attack success rate (ASR) by well over half while preserving task utility. Second, we stress test them through attack reframing using four attacks that preserve the malicious instruction but change the phrasing. For non‑guardian setup, the reframing pushes the ASR up to 81.4%, but the dynamic guardian brings it down to 18.6%, showing that real‑time mediation is a robust defense.

PaperID: 2340, https://arxiv.org/pdf/2606.01478.pdf

Authors: Martin Schuck, Marcel P. Rath, Yufei Hua, Abhishek Goudar, SiQi Zhou, Angela P. Schoellig

Title: Crazyflow: An Accurate, GPU-Accelerated, Differentiable Drone Simulator in JAX

Abstract:
High‑quality, large‑scale synthetic data from simulations is becoming a cornerstone for pushing the capabilities of robot algorithms. While aerial robotics simulators have evolved to support specialized needs such as fidelity, differentiability, and swarms independently, a unified platform that can synthesize data across all these domains is missing. In this work, we propose Crazyflow, a simulator designed to push the limits of aerial‑robotics algorithm development, from model‑based to data‑driven methods, gradient‑based to sampling‑based approaches, and single‑agent to multi‑agent systems. Compared to existing state‑of‑the‑art drone simulators, it achieves speeds more than an order of magnitude faster for a single drone and can simulate thousands of swarms of 4000 drones each. Real‑world experiments show Crazyflow supports both analytical‑gradient‑based policy learning, achieving sub‑centimeter trajectory tracking accuracy without domain randomization, and sampling‑based obstacle avoidance at speeds exceeding half a billion steps per second. Breaking the traditional train‑then‑deploy paradigm, we show that its unprecedented speed even enables in‑flight reinforcement learning; we demonstrate this by throwing a physical drone into the air and training a recovery policy from scratch in 0.38 seconds, successfully stabilizing the drone. Crazyflow supports multiple levels of simulation abstraction, is directly compatible with all open‑source Crazyflie models, and enables rapid reconfiguration across custom drone platforms and applications by providing a light‑weight system identification pipeline. By pushing accuracy, speed, and differentiability simultaneously, Crazyflow serves as an open‑source resource for synthetic data generation, with emerging capabilities for large‑scale parallelization for online, in‑execution learning and optimization, opening the door to novel algorithm development.

PaperID: 2341, https://arxiv.org/pdf/2606.01365.pdf

Authors: Xianyou Li, Weiran Yan, Yichao Wu, Penghao Liang, Mengwei Yuan, Jianan Liu, Jing Yang

Title: Early Diagnosis of Wasted Computation in Multi-Agent LLM Systems via Failure-Aware Observability

Abstract:
Failure‑aware observability diagnoses wasted computation in multi‑agent LLM systems before final‑answer evaluation can explain what went wrong. We propose a trace‑based framework for a three‑agent architecture ‑‑ orchestrator, search agent, and execution agent ‑‑ that converts structured events into online signals for loops, budget pressure, low information gain, and tool instability, then adds offline semantic grounding metrics and selective LLM‑as‑judge evaluation. On 165 GAIA validation traces under identical caps, 98 runs produce usable final answers and 67 fail or stop without one. Among warned failed runs, 58.1% of tokens are spent after the first warning on average, indicating substantial opportunity for intervention. A 10‑task Level‑2 pilot uses warnings to diversify search or require evidence, reducing post‑warning token fraction from 0.638 in the baseline to 0.304. The results support a layered design: cheap online signals help the orchestrator redirect or halt redundant behavior, while deeper semantic checks identify whether completed answers are grounded enough to trust.

PaperID: 2342, https://arxiv.org/pdf/2606.01199.pdf

Authors: Xuancheng Zhu, Yang Yue, Shuaibing Wan, Zihan Dou, Xiaohan Zhang, Yongrui Liu, Guoshun Nan

Title: Can LLM Agents Sustain Long-Horizon Organizational Dynamics?

Abstract:
Large language agents are increasingly used for social simulation, yet it remains unclear whether they can sustain coherent behavior in structured organizations, where goals must propagate through hierarchy, tasks depend on prior execution, and artifacts accumulate over long horizons. We formulate long‑horizon organizational simulation as a memory‑centered coordination problem and introduce TaskWeave, a hierarchical agentic framework that maintains planning states through a Formulate‑Partition‑Diagnose‑Align cycle and grounds execution through dependency‑aware trace memory. We evaluate TaskWeave in a year‑long IT company simulation and compare it with other multi‑agent frameworks on organizational coherence, execution grounding, and downstream enterprise NLP utility. Experiments show that TaskWeave supports coherent and long‑horizon organizational dynamics while producing grounded artifacts and adapting to external environments. These findings suggest that structured simulation memory is a key mechanism for building reliable LLM‑based organizational simulators.

PaperID: 2343, https://arxiv.org/pdf/2606.01185.pdf

Authors: Nicole Rose Schneider, Davide Ghilardi, Giacomo Piccinini, Jacopo Tagliabue

Title: "Skill issues'': data-centric optimization of lakehouse agents

Abstract:
Coding agents are becoming users of data infrastructure, but their success depends not only on model quality: it also depends on the skills and environment files that teach agents how to use a system. We study how to optimize these artifacts for agents operating on a branching lakehouse, Bauplan. In our setting, headless APIs and Git‑like data primitives expose data workflows through code, branches, commits, and merges. Our central observation is that a branching lakehouse turns data‑agent evaluation from an output‑matching problem into a state‑verification problem: agent‑generated pipeline code induces concrete, inspectable lakehouse changes. We present a data‑centric optimization pipeline that generates task‑verifier pairs, executes candidate skills in isolated sandboxes, and scores trajectories using both trace‑level signals and programmatic checks over lakehouse state. In a preliminary evaluation on 25 tasks, optimized skills improve accuracy by 31.9%. These results suggest that write‑path data workflows provide a useful substrate for optimizing agent skills beyond read‑only tasks.

PaperID: 2344, https://arxiv.org/pdf/2606.01170.pdf

Authors: Lucas Haug, Anarosa Alves Franco Brandão, Arthur Casals

Title: Coordinating Task Switching in a Robotics Multi-Agent System Using Behavior Trees

Abstract:
The application of multi‑agent systems in robotics is a very challenging field. Several competitions involving such systems are proposed to foster research and development of strategies and mechanisms using games as the underlying domain. Among them are the ones from the IEEE Very Small Soccer (VSSS) category, which is the case study described in this paper. In VSSS, two teams of three robots each compete in a very dynamic environment of a soccer game. Thus, coordination of robots' behavior during the game is crucial to win it. In this paper, we present a Behavior‑Tree‑based approach to support multi‑robot coordination within the VSSS team of the ThundeRatz robotics team from the Universidade de S\tildeao Paulo. Moreover, a comparison between the proposed approach and the previous one, which was based on a Finite State Machine (FSM), was conducted using the FIRASim simulator. Besides that, the performance of this new strategy was further evaluated in an academic robotics competition.

PaperID: 2345, https://arxiv.org/pdf/2606.01139.pdf

Authors: Yuxuan Liu, Zhaochen Su, Lingyun Xie, Yuhao Zhang, Qing Zong, Jiahe Guo, Zhongwei Xie, Yiyan Ji, Yauwai Yim, Hongyu Luo, Xiyu Ren, Ruan Chenyu, Haoran Li, Yangqiu Song

Title: SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision

Abstract:
Agent skills are procedural artifacts that enable LLM agents to execute workflows, verify constraints, and recover from failures. Existing self‑evolving methods refine skills using accumulated trajectories. However, they struggle in cold‑start settings, where only an initial, imperfect skill is available. Consequently, skill construction defaults to expert authoring or one‑shot LLM generation. Expert‑authored skills are costly and may not align with how LLM agents actually execute tasks, while one‑shot generated skills can be syntactically well formed yet behaviorally weak. To bridge this gap, we propose SkillRevise, an execution‑grounded framework designed to iteratively refine these initial skills. SkillRevise diagnoses skill defects from execution evidence, retrieves relevant repair principles from a general memory, and applies execution‑anchored edits. By re‑executing candidates and measuring empirical utility, it systematically retains the optimal skill version. Evaluated across three benchmarks and five LLMs, SkillRevise substantially outperforms one‑shot baselines, improving the base agent's success rate on SkillsBench from 36.05% to 61.63%. Furthermore, the revised skills exhibit strong cross‑model transferability, capturing generalized procedural knowledge over model‑specific artifacts.

PaperID: 2346, https://arxiv.org/pdf/2606.01094.pdf

Authors: Ruihui Hou, Ziyue Huai, Chennuo Zhang, Ziyan Liu, Siran Zhao, Yao Yu, Jie Zhai, Tong Ruan

Title: CAREAgent: Clinical Agent with Structured Reasoning and Tool-Integrated for Order Generation

Abstract:
Clinical order generation serves as a critical bridge between clinical decision‑making and real‑world practice, translating medical decisions into concrete and executable orders. Existing agents mainly focus on coarse‑grained decisions and overlook the fine‑grained, executable information required for clinical orders. To address this gap, we propose CAREAgent, an agent for clinical order generation. To support its training, we introduce a two‑stage agentic reasoning data construction method. First, we design an agent framework that constructs verifiable reasoning trajectories aligned with realistic clinical tool usage. Second, we filter reasoning trajectories by format compliance, order validity, and clinical plausibility. Building on the constructed data, the model is first trained via supervised fine‑tuning to acquire fundamental reasoning formats and medical knowledge, and is subsequently optimized through reinforcement learning with multi‑dimensional reward functions to enhance complex clinical reasoning capabilities. Experiments on multiple benchmarks demonstrate the effectiveness of CAREAgent. On ClinicalBench (unseen during training), CAREAgent improves the F1 score by 5.05%, 2.09%, and 0.86% over the single‑agent, multi‑agent, and agentic reasoning methods, respectively.

PaperID: 2347, https://arxiv.org/pdf/2606.01072.pdf

Authors: Jianing Qian, Qinhe Peng, Emmanuel Panov, Leonor Fermoselle, Dinesh Jayaraman, Bernadette Bucher, Tarik Kelestemur

Title: Expanding Spatial and Temporal Context for Robotic Imitation Learning With Scene Graphs

Abstract:
Imitation learning enables robots to learn how to execute tasks via observation. However, real‑world environments like homes and offices are often severely partially observed due to their large spatial scales. In addition, many tasks involve executing a series of subtasks requiring autonomous robots to reason over extended time horizons. To address these challenges, we propose using scene graphs as an explicit and structured memory mechanism in imitation learning. By maintaining a dynamic scene graph that captures object‑centric relationships and their evolution over time, our method allows the agent to retain relevant historical context during task execution to efficiently reason over incrementally accrued scene information. Our experiments on simulated mobile manipulation and real‑world tabletop manipulation demonstrate that our approach substantially improves policy performance, particularly in settings that demand long‑term reasoning and robust generalization under partial observability.

PaperID: 2348, https://arxiv.org/pdf/2606.01063.pdf

Authors: Ruoxuan Zhang, Qiaoqiao Wan, Zhengguang Wang, Chenghao Yu, Hongxia Xie, Jianlong Fu, Wen-Huang Cheng

Title: MindClaw: Closed-Loop Embodied Mental-State Reasoning for Precision Intervention

Abstract:
Theory of Mind (ToM) enables an agent to reason about another actor's beliefs, goals, and intentions, which is essential for human‑centered embodied assistance. Existing ToM benchmarks have advanced text and multimodal mental‑state recognition, but they mostly evaluate offline question answering or final action prediction. They do not fully test whether an embodied agent can stay connected to a changing environment, update actor‑specific beliefs, decide when reasoning is needed, and intervene only when help is useful. Building on MindPower, we extend robot‑centric ToM reasoning to a real‑time closed‑loop setting and introduce MindClaw, a framework for embodied mental‑state reasoning with precision intervention. MindClaw connects multi‑source inputs, belief memory, an embodied cognitive trigger skill, mental reasoning, and action generation, allowing the agent to output helpful actions at the right time while remaining silent when intervention is unnecessary. Experiments show that direct VLM baselines struggle with task awareness and intervention calibration, while MindClaw achieves the best overall performance, demonstrating the importance of trigger‑skill optimization for closed‑loop embodied ToM assistance.

PaperID: 2349, https://arxiv.org/pdf/2606.01062.pdf

Authors: Jiarui Feng, Hanqing Zeng, Karish Grover, Ruizhong Qiu, Yinglong Xia, Qiang Zhang, Qifan Wang, Ren Chen, Dongqi Fu, Jiayi Liu, Zhoukai Zhao, Xiangjun Fan, Benyu Zhang, Yixin Chen

Title: DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts

Abstract:
Mixture‑of‑Experts (MoE) models have become a leading approach for decoupling parameter count from computational cost in large language models, yet effectively scaling MoE performance remains a challenge. Prior work shows that fine‑grained experts enlarge the space of expert combinations and improve flexibility, but they also impose substantial routing overhead, creating a new scalability bottleneck. In this paper, we explore a complementary axis for scaling ‑‑ how expert outputs are aggregated. We theoretically show that replacing the standard weighted‑summation aggregation with structural aggregation expands the expert‑combination space without altering the experts or router, and enables possible multi‑step reasoning within a single MoE layer. To this end, we propose DAG‑MoE, a sparse MoE framework that employs a lightweight module to automatically learn the optimal aggregation structure among the selected experts. Extensive experiments under standard language modeling settings show that DAG‑MoE consistently improves performance in both pretraining and fine‑tuning, surpassing traditional MoE baselines.

PaperID: 2350, https://arxiv.org/pdf/2606.01041.pdf

Authors: Tao Feng, Tianyang Luo, Jingjun Xu, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You

Title: ExpWeaver: LLM Agents Learn from Experience via Latent RAG

Abstract:
Experience learning has achieved promising results in enhancing LLM agent planning and reasoning by integrating past interactions as reusable knowledge. However, existing methods remain confined to explicit text space, retrieving experiences via semantic similarity and concatenating them into the context window, leading to substantial token overhead and a decoupled architecture that separates retrieval from generation. To address these limitations, we propose ExpWeaver, a framework that enables LLM agents to learn from experience via latent retrieval‑augmented generation, without requiring a separate RAG module. ExpWeaver encodes experiences using the LLM's own hidden states, retrieves relevant experiences directly in latent space at each decoding step, and integrates them through cross‑attention aggregation and gated residual mechanisms. The entire pipeline is optimized end‑to‑end with reinforcement learning, supporting both generative and ranking tasks. We evaluate ExpWeaver on 13 diverse tasks spanning question answering, reasoning, coding, scientific prediction, and recommendation. Results demonstrate that ExpWeaver achieves state‑of‑the‑art performance on 12 out of 13 tasks, outperforming the strongest baseline by over 6.8%; maintains token efficiency comparable to non‑retrieval baselines while text‑based retrieval methods require 1.5 to 2 times more tokens; and exhibits superior cross‑domain generalization, outperforming the strongest baseline by 16.32% under zero‑shot transfer and 15.21% under few‑shot transfer. Our code for ExpWeaver is released at https://github.com/ulab‑uiuc/ExpWeaver.

PaperID: 2351, https://arxiv.org/pdf/2606.00962.pdf

Authors: Hassan Touheed

Title: SS-ZKR: Spatial-Semantic Zero-Knowledge Routing for Privacy-Preserving Multi-Agent Collaboration

Abstract:
Foundational agent interoperability standards, notably the Agent‑to‑Agent (A2A) protocol and the Model Context Protocol (MCP), have advanced multi‑agent system communication, and complementary identity frameworks leveraging W3C Decentralised Identifiers (DIDs) and Verifiable Credentials (VCs) provide cryptographic agent authentication. However, no existing protocol supports content‑based semantic routing of agent payloads across organisational trust boundaries without requiring the routing intermediary to decrypt the payload, which is a hard constraint in compliance‑sensitive environments governed by GDPR, HIPAA, and MiFID II. We propose SS‑ZKR, a three‑mechanism privacy‑preserving routing protocol designed as a complementary layer atop A2A/MCP. Mechanism I introduces blind routing via differentially private semantic intent vectors cryptographically bound to zero‑knowledge proofs of payload‑schema consistency. Mechanism II offers vector‑weighted adaptive payload sanitisation with formal (epsilon, delta)‑differential privacy for numerical fields and heuristic semantic aggregation for textual fields. Mechanism III presents a spatial‑to‑cryptographic policy compiler that translates visually defined trust‑zone topologies into deterministic zero‑knowledge access circuits. We provide a formal threat model, analyse information leakage bounds of intent vectors, present pseudocode for all three mechanisms, and give analytical complexity comparisons against TEE‑based and homomorphic encryption‑based routing baselines. SS‑ZKR lets enterprises in financial services, healthcare, and defence orchestrate heterogeneous AI agents across regulatory boundaries without exposing proprietary data to routing infrastructure.

PaperID: 2352, https://arxiv.org/pdf/2606.00953.pdf

Authors: Xu Yang, Lunyiu Nie, Ethan Chandra, Stanislav Gannutin, Fangru Lin, Swarat Chaudhuri

Title: When Parallelism Pays Off: Cohesion-Aware Task Partitioning for Multi-Agent Coding

Abstract:
Multi‑agent Large Language Model (LLM) systems offer a way to decompose complex tasks, such as coding, through parallelization and context isolation. However, adding agents in practice introduces inter‑agent communication overhead, which incurs extra cost and can sometimes offset the efficiency gains. We formalize multi‑agent orchestration as a graph partitioning problem that captures the communication‑to‑computation trade‑off: task decomposition can shorten critical‑path computation, but cross‑agent dependencies require costly context transfer. We instantiate this view in repository‑level software engineering and present Cohesion‑aware Coder (Co‑Coder), which builds dependency graphs from static analysis, isolates structural hub files, partitions the graph via community detection, and executes the partition with a dependency‑aware scheduler. Across 28 real‑world tasks on DevEval and CodeProjectEval, Co‑Coder advances the Pareto‑frontier over sequential and file‑based parallel baselines as well as Claude Code with Agent Teams, lifting pass rate by up to 14.0%, achieving up to a 2.10x wall‑clock speedup, and reducing API cost by up to 35%, with the largest gains on the most dependency‑dense projects. Co‑coder demonstrates how cohesion‑aware orchestration can make parallel coding agents both theoretically grounded and practically efficient, suggesting a broader design principle for multi‑agent systems.

PaperID: 2353, https://arxiv.org/pdf/2606.00949.pdf

Authors: Federica Tonti, Ricardo Vinuesa

Title: Explainable deep reinforcement learning reveals energy-efficient control strategies for turbulent drag reduction

Abstract:
We propose a method combining Multi‑Agent Deep Reinforcement Learning (MARL) and eXplainable Deep Learning (XDL) to reduce drag in wall‑bounded turbulent flows. Taking as a baseline the results of training agents directly targeting wall‑shear stress and opposition control, three SHAP‑guided approaches are compared. In the first, the reward is computed from SHAP attributions of a U‑net predicting the future velocity field; in the second, from SHAP attributions of a U‑net predicting the skin‑friction coefficient; in the third, from a combination of SHAP attributions of two U‑nets predicting the skin‑friction coefficient and the wall pressure fluctuations, respectively. The combined SHAP strategy based on skin‑friction coefficient and wall‑pressure fluctuations achieves the best overall performance, achieving a DR of 34.44% and a NES of 34.01% with only 0.43% normalized input power. Relative to opposition control, drag reduction and net energy saving increase by 49.41% and 48.52%, respectively. Compared with the direct wall‑shear‑stress baseline, the proposed strategy simultaneously improves performance while reducing the normalized actuation cost from 5.90% to 0.43%. Analysis of the results reveals that the energetically efficient policy is consistent with pressure‑gated actuation, activating predominantly at near‑zero wall pressure, and operates on a temporal timescale comparable to the lifetime of the near‑wall turbulent structures.

PaperID: 2354, https://arxiv.org/pdf/2606.00939.pdf

Authors: Chao Peter Yang, Zixiao Tan, Kaisen Yao, Ziyu Zhou, Eleanor Jiang, Michael Wu

Title: FinCom: A Financial Multi-Agent Demo with Disagree-or-Commit Deliberation

Abstract:
Multi‑agent systems powered by large language models (LLMs) are increasingly used for financial analysis and decision support. However, existing coordination schemes, especially those emphasizing consensus or debate, are vulnerable to sycophancy: agents conform to peer reasoning instead of evidence, leading to premature agreement and degraded outcomes. We introduce FinCom (Financial Committee), a governed multi‑agent framework and interactive system that operationalizes the Disagree‑or‑Commit (DoC) protocol to embed structured dissent into financial AI committees. A central Supervisor orchestrates three ReAct‑enabled specialist agents: Research, Quantitative, and Risk. Each agent is equipped with role‑specific tools for retrieval, computation, and stress testing. During deliberation, agents must either explicitly critique or commit to their peers' reasoning before converging on a unified recommendation. This demonstration showcases how FinCom supports committee‑style financial analysis through coordinated multi‑agent interaction, including structured report generation and interactive decision support. Evaluated across the most recent financial agent benchmark, in addition to 90 internal handcrafted financial tasks using an LLM‑as‑a‑Judge protocol, DoC improves reasoning accuracy and risk awareness significantly over a consensus‑seeking baseline on both an in‑house and external evaluation set. By reframing disagreement as a governance primitive rather than noise, FinCom offers a lightweight, prompt‑only recipe for improving accountability, transparency, and epistemic robustness in agentic financial systems.

PaperID: 2355, https://arxiv.org/pdf/2606.00933.pdf

Authors: Suk Ki Lee, Venkata Sai Deepak Mutta, Hyunwoong Ko

Title: Generative Multi-Robot Motion Planning via Diffusion Modeling with Multi-Agent Reinforcement Learning Guidance

Abstract:
Coordinating multiple robots in shared environments requires generating feasible trajectories for each agent while accounting for interactions among agents. Centralized planning approaches become difficult to scale as the number of robots increases, while decentralized approaches that allow each agent to plan independently do not inherently account for inter‑agent interactions. This paper presents a framework for coordinated multi‑robot motion planning that combines decentralized generative trajectory planning with multi‑agent reinforcement learning (MARL)‑based coordination. Each robot independently generates candidate trajectories using a diffusion model trained on single‑agent motion data, leveraging the generative model's ability to produce feasible and diverse trajectories. To reduce conflicts between agents, a centralized value function trained via MARL guides the reverse diffusion process through gradient‑based steering, enabling interaction‑aware trajectory generation without centralized joint planning or retraining of the generative model. This guidance follows an exponential tilting formulation, in which the value function biases the denoising distribution toward trajectories with higher expected multi‑agent return. The framework is evaluated in a simulated maze environment with four mobile robots. Experimental results show that the proposed value‑guided diffusion planning reduces the inter‑agent interference rate from 55.4% to 41.8%, demonstrating that coordination can be effectively achieved while preserving the scalability of decentralized trajectory generation. These results suggest that MARL‑based value guidance can effectively introduce coordination into decentralized generative planners without requiring a fully joint multi‑robot model.

PaperID: 2356, https://arxiv.org/pdf/2606.00922.pdf

Authors: Md Mainul Abrar, Xun Jia, Yujie Chi

Title: A Machine-to-Machine Knowledge-Guided LLM Agent for Generalizable Radiotherapy Treatment Planning

Abstract:
In this work, we propose a prototype machine‑to‑machine (M2M) knowledge‑guided Large Language Model (LLM) framework for automated radiotherapy treatment planning. In the proposed paradigm, Treatment Planning Parameter (TPP) distribution knowledge discovered by a Deep Reinforcement Learning (DRL) agent is transferred to an LLM agent through in‑context learning, enabling autonomous iterative planning without human intervention. While standard LLM‑based planning often lacks physical intuition and struggles with convergence, the integration of DRL‑derived guidance constrains the agent to a physically valid parameter space. Experimental evaluations are performed across three diverse planning scenarios: basic prostate cases, complex prostate configurations with increased organ‑at‑risk (OAR) constraints, and liver cases. The evaluation results demonstrate that the guided LLM agent consistently achieves optimal planning scores while significantly reducing the number of iterations compared to unguided planning. Analysis of the final TPP configurations reveals that the agent successfully learns a hierarchical priority of objectives, effectively restoring a logical "cause‑and‑effect" relationship between parameter tuning and dosimetric outcomes. Crucially, this prototype framework exhibits robust generalizability, maintaining high planning quality regardless of specific patient anatomy, treatment site, or initial plan quality. By bridging the specialized optimization of DRL with the adaptive reasoning of LLMs, this M2M framework establishes a scalable foundation towards generalizable autonomous treatment planning, ultimately benefiting clinical practice in realistic environments.

PaperID: 2357, https://arxiv.org/pdf/2606.00914.pdf

Authors: Rana Muhammad Usman

Title: Adversarial Feeds Steer LLM Agent Decisions Against Their Defaults

Abstract:
LLM agents increasingly act after consuming ranked external information streams such as social feeds, search results, retrieval contexts, and email queues, yet safety evaluations almost always test the model or the user prompt in isolation, never the upstream ranker that decides what the agent reads just before it acts. We introduce a controlled protocol that holds the model, persona, topic, and final decision prompt fixed and varies only the composition and ordering of the posts an agent encounters during a preceding ten‑turn "scrolling" phase, isolating the causal effect of feed curation on a downstream decision. Across 2,785 decision rollouts on four modern open instruct LLMs from three independent labs, we identify three response regimes: adversarial capitulation, default saturation, and a default‑direction asymmetry in which a one‑sided feed tips a decision the model was genuinely uncertain about (in the clearest cases from 5% to 100%; Fisher p as low as 3 x 10^‑10) but cannot dislodge one it already favors or holds firmly. The effect follows a dose‑response curve, survives a generator swap that rules out a writing‑style artifact, generalizes across several decision domains including security‑relevant choices such as removing a deployment approval gate or relaxing access controls, and is partly mitigated by two simple feed‑level defenses; a frontier model retains its default. We characterize the recommender as a practical, default‑bounded control surface for LLM agents, and argue that agent evaluations must audit the feed layer rather than the final prompt alone.

PaperID: 2358, https://arxiv.org/pdf/2606.00822.pdf

Authors: Zicai Cui, Zihan Guo, Weiwen Liu, Weinan Zhang

Title: SkillPager: Query-Adaptive Intra-Skill Navigation via Semantic Node Retrieval

Abstract:
Skill‑based LLM agents increasingly rely on long procedural documents, but full‑document prompting wastes tokens and dilutes information critical to execution. We study this setting as intra‑skill retrieval, where the goal is to select a minimal, execution‑sufficient context from a known skill document given a query. We present SkillPager, a two‑stage framework that parses each Markdown skill into typed semantic nodes offline and leverages Maximal Marginal Relevance (MMR) to perform global, query‑conditioned node selection online. On a benchmark of 395 skills and 1,975 queries, SkillPager achieves 78.89% LLM‑judged context sufficiency, compared to 82.23% for the exhaustive full‑document baseline, while reducing prompt tokens by 47.04%. A granularity ablation shows that applying the same retrieval algorithm to raw fixed‑length chunks reaches a comparable 81.77% sufficiency but increases token cost by 28.81%, demonstrating that efficiency gains are driven by typed semantic granularity rather than the retrieval algorithm alone. Among graph‑based baselines, SkillPager outperforms the strongest baseline by a margin of 12.16%. Further ablations show that supporting content is most effective when retained in the candidate pool and selected adaptively rather than removed by static heuristics. These results identify typed intra‑document retrieval as a distinct access problem for skill‑based agents.

PaperID: 2359, https://arxiv.org/pdf/2606.00820.pdf

Authors: Xiqi Hao, Zengqing Wu, Yu-Xuan Qiu, Chuan Xiao, Ruiqi Xu, Shuyuan Zheng, Jianbin Qin

Title: Not All Flips Are Conformity: Decomposing Stance Convergence in Multi-Agent LLM Debate

Abstract:
Multi‑agent debate (MAD) is a promising strategy for improving LLM reasoning, but when agents converge on a shared answer, it is unclear whether that convergence reflects genuine deliberation or social compliance. We show that the conventional answer flip rate conflates three distinct mechanisms: spontaneous instability, stance‑induced conformity, and reasoning‑induced persuasion. Our three‑source decomposition framework isolates each through controlled counterfactual conditions. In the primary MMLU‑Pro setting, 37% of agent‑question observations change under self‑reflection alone, while robustness tests show substantial model‑dependent instability across GPQA‑Diamond and three model families; strict conformity is 29% in the primary setting and remains predominantly harmful across model replications (57‑77% correct‑to‑wrong). A controlled information‑gradient experiment reveals that even vacuous reasoning is associated with 20‑39% error adoption among resistant agents, with reasoning‑like presentation carrying substantial persuasive weight. Harmful conformity can be predicted from Round 0 features (AUC = 0.79), and risk‑targeted intervention reduces it by 13.6 percentage points (p < 0.001). However, without correctness labels or self‑reflection controls, reducing peer adoption does not improve accuracy, because harmful and beneficial influence cannot be distinguished.

PaperID: 2360, https://arxiv.org/pdf/2606.00804.pdf

Authors: Thanh Luong Tuan

Title: Dynamic Coordination Strategy Selection for Enterprise Multi-Agent Systems

Abstract:
Enterprise multi‑agent systems increasingly expose multiple coordination patterns, but deployments often lack evidence for when to use consensus, debate, synthesis, or a simpler single‑agent workflow. This paper evaluates whether coordination strategy should be selected dynamically by problem class rather than fixed globally. We run a frozen matrix of 30 enterprise tasks spanning six industries, five problem classes, four execution conditions, three replications per cell, and four model arms: qwen_local, sonnet, gemma_openrouter, and an auxiliary openai cloud‑validation arm. All 1,440 generated outputs are judged by a fixed Sonnet rubric. The main finding is bounded and operationally useful, but it is not the original strict H1. The pre‑registered exact‑winner/CI criterion is not supported: exact winner identity is unstable across model arms, and several predicted strategies are close to, but not above, the best observed alternative. A weaker near‑best routing claim is strongly supported. In every pre‑registered model arm and problem class, and again in the auxiliary OpenAI validation arm, the predicted strategy is within 0.10 quality‑score points of the best observed condition. Structured compliance verification is the clearest exception to the original mapping: all arms favor single_agent rather than consensus. A pre‑registered Kendall's W test finds no reliable difference between Vietnamese‑domain and English‑domain tasks in how consistently the four coordination conditions are ranked (mean W of 0.20 in both strata; signed‑rank p = .85), so H2 is not supported. We conclude that enterprise coordination policy should use dynamic routing as a calibrated default, not as a deterministic winner‑selection law.

PaperID: 2361, https://arxiv.org/pdf/2606.00765.pdf

Authors: Md Nakhla Rafi, Md Ahasanuzzaman, Dong Jae Kim, Zhijie Wang, Tse-Hsun Chen

Title: FALAT: Tracing Failures in LLM Agent Trajectories via Dependency-Guided Search

Abstract:
LLM‑based agents increasingly solve complex tasks through long trajectories involving reasoning steps, tool calls, and inter‑agent communication. However, when these agents fail, it is often unclear which agent caused the failure and which step introduced the decisive error. This attribution problem is challenging because mistakes can propagate across the trajectory: later actions may appear incorrect, but only because they depend on an earlier corrupted state. Therefore, failure attribution cannot be treated as independent step‑level classification. We propose FALAT, a diagnostic framework for failure attribution in LLM agent trajectories. FALAT frames attribution as a dependency‑guided search problem. It first constructs an expectation of how the task should be solved and uses this expectation to identify suspicious regions in the trajectory. It then traces dependencies among decisions, tool outputs, and agent messages to distinguish error‑introducing steps from steps that merely inherit or propagate prior mistakes. Finally, FALAT evaluates whether correcting a candidate step would be sufficient to recover the expected outcome, allowing it to identify both the responsible agent and the decisive failure step. We evaluate FALAT on the Who&When benchmark, which includes both algorithm‑generated and hand‑crafted multi‑agent failure trajectories. The results show that FALAT consistently improves responsible‑agent and decisive‑step attribution. Its best configurations achieve 46.0% step‑level accuracy on algorithm‑generated trajectories and 29.1% on the more challenging hand‑crafted trajectories, outperforming specialized attribution baselines and direct prompting with standalone LLMs. These findings suggest that dependency‑aware reasoning is essential for reliable failure diagnosis in LLM agent systems.

PaperID: 2362, https://arxiv.org/pdf/2606.00756.pdf

Authors: Yannan Wang, Longli Yang, Zhen Liu, Abhishek Kumar, Carsten Maple

Title: CoMIC: Collaborative Memory and Insights Circulation for Long-Horizon LLM Agents in Cloud-Edge Systems

Abstract:
Deploying lightweight Large Language Model (LLM) agents on edge servers can reduce latency and move agentic services closer to users, but resource‑constrained edge models often struggle with long‑horizon tasks that require persistent memory, subgoal tracking, and reflection. Fine‑tuning edge models after deployment is costly and difficult to scale across heterogeneous nodes, while purely local memory leaves agents with isolated experience and growing prompt context. We propose \textscCoMIC, a parameter‑update‑free cloud‑edge framework for Collaborative Memory and Insights Circulation. \textscCoMIC follows a Centralized Reflection, Decentralized Execution design: edge agents execute locally using subgoal‑oriented hierarchical memory and selective re‑expansion of relevant histories, while a cloud‑side LLM critic asynchronously evaluates completed trajectories, filters reusable experience, and aggregates cross‑agent guidance keyed by semantic subgoal identifiers. Across five long‑horizon agent tasks spanning symbolic planning and text interaction, \textscCoMIC improves progress rate and action grounding for weak edge agents and yields task‑dependent success‑rate gains without updating model parameters.

PaperID: 2363, https://arxiv.org/pdf/2606.00717.pdf

Authors: Martin V. Vejling, Christophe A. N. Biscio, Adrien Mazoyer, Petar Popovski, Shashi Raj Pandey

Title: Multi-Agent Conformal Prediction with Personalized Statistical Validity

Abstract:
Uncertainty quantification is essential in high‑stakes machine learning tasks. However, one of the principled solutions, conformal prediction, faces challenges under limited local calibration data, privacy constraints, and data heterogeneity. In multi‑agent settings, existing works do not simultaneously and satisfactorily address these challenges with guarantees either limited to averages across agents or losing validity in heterogeneous settings. Hence, we propose personalized federated weighted conformal prediction (PFWCP), a framework that combines local density ratio weighting with weighted quantile aggregation to correct for heterogeneity while preserving privacy. The method yields asymptotically valid marginal and calibration‑conditional coverage guarantees for each participating agent and supports protocols with one‑shot communication. Theoretical analysis presents an adjustment to the coverage variance, governed by an effective sample size expression, which is necessary in the context of weighted conformal prediction, and experiments on synthetic and real datasets show improved calibration quality over state‑of‑the‑art federated conformal baselines.

PaperID: 2364, https://arxiv.org/pdf/2606.00714.pdf

Authors: Manuel Hernández, Eduardo Sánchez-Soto

Title: The Cartan-Topos Protocol: A Unified Geometric and Categorical Framework for Resilient Multi-Agent Coordination

Abstract:
Multi‑agent coordination faces a fundamental divide between continuous Euclidean consensus, which fails under non‑integrable constraints, and discrete symbolic logic, which collapses under open‑world assumptions. This report presents a unified geometric and categorical framework bridging these paradigms. Agent states are modeled on homogeneous manifolds (Lie groups, Grassmannians) with consensus achieved via Riemannian center‑of‑mass flows. Clifford‑algebraic representations (rotors, motors) enable singularity‑free SE(3) pose synchronization. Network interactions are formalized as cellular sheaves, where heterogeneous stalks connected by linear restriction maps replace uniform weights; the sheaf Laplacian drives diffusion toward globally consistent sections. The Cartan connection encodes logical holonomy directly into restriction maps. Asynchronous nonlinear sheaf diffusion guarantees linear convergence to Dirichlet energy minimizers under bounded delays. Sheaf‑Theoretic Planning (STP) models time as a Grothendieck topos, using intuitionistic logic and abductive repair for resilient temporal reasoning. Applications include discourse sheaves for opinion dynamics and knowledge sheaves for graph embedding. This synthesis establishes geometric consensus as a universal foundation for resilient multi‑agent systems across physical, epistemic, and temporal domains.

PaperID: 2365, https://arxiv.org/pdf/2606.00669.pdf

Authors: Sanjay Rawat

Title: NeuroLog: Reasoning You Can Audit -- Neuro-Symbolic Vulnerability Discovery via LLM Facts, Datalog, and SMT

Abstract:
Vulnerability discovery on C/C++ source asks the analyst to choose between heavyweight static analysers, which need a working build before a single query runs, and free‑form LLMs, which read source readily but invent details and lose track of cross‑function dataflow on real codebases. We present NeuroLog, an end‑to‑end build‑free pipeline that assigns each layer the role it is uniquely good at: an LLM extracts typed dataflow facts one function at a time; a Souffle rule mesh composes those facts into cross‑function findings; a Z3 post‑pass filters infeasible findings and emits a SAT model for each survivor. To go beyond pure static reasoning we also fold in runtime evidence: likely range invariants from a handful of corpus seeds tighten the SMT problem at near‑zero cost. A second LLM agent reads each SAT model and writes a Python program that produces a candidate crashing input, validated by an AddressSanitizer harness. Combining static‑narrowing‑SMT (Saturn, Pinpoint) and Datalog‑with‑SMT (Formulog) is prior art; new here are an LLM‑derived fact base, a no‑build pipeline, and the SAT model as an artifact (input to crash synthesis) rather than a yes/no verdict. Across stb, cJSON, libxml2, an FFmpeg demuxer slice, and curl 8.3.0, NeuroLog re‑discovers eight CVE‑class issues end‑to‑end, including the CVSS‑9.8 SOCKS5 heap overflow CVE‑2023‑38545, each ASan‑confirmed. On libarchive HEAD we surface five memory‑safety bugs (four previously unreported) across the cpio reader and the XAR/WARC/7zip writers; all filed upstream, several fixes merged, with the cpio use‑after‑free acknowledged in seven hours. Extraction takes ~37 s and 0.005 on stb; crash synthesis turned a static finding into a 102‑byte stb_vorbis crash in two LLM iterations (no fuzzer); a likely‑invariant filter from three Matroska seeds eliminates 13.2% of the FFmpeg‑demuxer feasible set.

PaperID: 2366, https://arxiv.org/pdf/2606.00655.pdf

Authors: Jialing Li, Zhouhong Gu, Yin Cai, Hongwei Feng

Title: Scaling Behavior of Single LLM-Driven Multi-Agent Systems

Abstract:
The burgeoning field of LLM‑based Multi‑Agent Systems (MAS) promises to tackle complex tasks through collaborative intelligence, yet fundamental questions regarding their scaling behavior and intrinsic collective dynamics remain underexplored. This paper systematically investigates how the performance of a homogeneous MAS evolves as the number of agents increases, isolating the variable of collaboration from model or knowledge heterogeneity. We propose the Sequential Iterative Multi‑Agent System (SIMAS) framework, a minimalist architecture centered on sequential inter‑agent communication, to clearly observe scaling effects. Through extensive experiments across diverse tasks and model scales, we establish that MAS performance does not scale monotonically with agent count but follows a pattern of diminishing returns, governed by a trade‑off between collaborative synergy and coordination overhead. Our findings reveal that effective MAS requires a sufficiently capable base LLM, that task type critically modulates the optimal agent count, and that collective intelligence is an emergent property contingent on strategic interaction design rather than a guaranteed outcome of agent plurality. The performance degradation stems coordination overhead rather than merely long‑context failure, and the scaling tendency generalizes across interaction architectures like structured debate topologies. This work provides a foundational understanding of MAS scaling laws, offering practical guidance for designing efficient collaborative systems and challenging the prevailing assumption that more agents invariably lead to better performance.

PaperID: 2367, https://arxiv.org/pdf/2606.00644.pdf

Authors: Qiuyu Tian, Haojie Yin, Yingce Xia, Youyong Kong, Zequn Liu

Title: ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

Abstract:
AI research often requires decisions before future evidence exists: which bottleneck to attack, which direction to pursue, or where a project should be positioned. We introduce ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make such forward‑looking research judgements from historical evidence. ForeSci contains 500 tasks across four fast‑moving AI domains and four decision families. Each task is paired with a cutoff‑aligned offline knowledge base; post‑cutoff papers are hidden during generation and used only for validation. To avoid random future‑event prediction, tasks are derived from pre‑cutoff taxonomy branches and evidence signals, and answer‑generation backbones are selected to precede the task cutoffs. We evaluate native LLMs, Hybrid RAG, and three research‑agent adaptations across four backbones. Results show that explicit evidence organization improves traceability and factual support, but gains depend strongly on the decision family. Diagnostics reveal a recurring evidence‑decision decoupling: agents may cite relevant evidence while forecasting the wrong research object. ForeSci turns forward‑looking AI research judgement into a controlled benchmark for evaluating research agents as decision‑making systems.

PaperID: 2368, https://arxiv.org/pdf/2606.00611.pdf

Authors: Zhepei Hong, Lin Wang, Liting Li, Haokai Ma, Junfeng Fang, Fei Shen, Dan Zhang, Xiang Wang

Title: TRACE: Trajectory Risk-Aware Compression for Long-Horizon Agent Safety

Abstract:
Long‑horizon LLM agents produce safety evidence across long trajectories, where sparse, delayed, and compositional risk signals often escape local moderation. Existing turn‑level or short‑context detectors struggle to reliably retain and aggregate such evidence over extended horizons. We reframe long‑horizon agent safety detection as trajectory‑level evidence compression and propose Trajectory Risk‑Aware Compression for Long‑Horizon Agent Safety (TRACE). TRACE uses a Compressor‑Reader design: the Compressor encodes the full trajectory into a compact latent evidence state under trajectory‑level supervision, and the Reader judges the raw trajectory with this latent evidence state as a safety reference. This design helps aggregate dispersed risk cues and reduce premature evidence loss. Across ASSEBench, Pre‑Ex‑Bench, and R‑Judge, TRACE achieves the best accuracy on all evaluated backbones, improving over strong baselines by up to 12.6 percentage points. On LongSafety, TRACE shows smaller performance degradation as context length grows. Attention visualizations and case studies suggest that the compressed reference helps the Reader focus on risk‑critical segments and recover cross‑step evidence. Code is available at https://github.com/Peregrine123/TRACE_official.

PaperID: 2369, https://arxiv.org/pdf/2606.00610.pdf

Authors: Chuanjie Wu, Zhishang Xiang, Yunbo Tang, Zerui Chen, Qinggang Zhang, Jinsong Su

Title: MemGraphRAG: Memory-based Multi-Agent System for Graph Retrieval-Augmented Generation

Abstract:
Retrieval‑Augmented Generation (RAG) has become an essential method for mitigating hallucinations in Large Language Models (LLMs) by leveraging external knowledge. Although effective for simple queries, traditional RAG struggles with large‑scale, unstructured corpora where information is highly fragmented. Graph‑based RAG (GraphRAG) incorporates knowledge graphs to capture structural relationships, enabling more comprehensive retrieval for complex reasoning. However, existing GraphRAG methods rely on isolated, fragment‑level extraction for graph construction, lacking a global perspective on the whole corpus. As a result, these methods frequently lead to thematically inconsistent, logically conflicting, and structurally fragmented graphs that degrade retrieval performance. In this paper, we propose MemGraphRAG, a novel framework that introduces a memory‑based multi‑agent system to ensure high‑quality graph construction. Specifically, MemGraphRAG employs a collaborative society of agents supported by shared memory, which provides a unified global context throughout the extraction process. This mechanism allows agents to dynamically resolve logical conflicts and maintain structural connectivity throughout the corpus. Furthermore, we propose a memory‑aware hierarchical retrieval algorithm tailored for the constructed graph. Extensive experiments on multiple benchmarks demonstrate that MemGraphRAG outperforms the state‑of‑the‑art baseline models with comparable efficiency. Our code is available at https://github.com/XMUDeepLIT/MemGraphRAG.

PaperID: 2370, https://arxiv.org/pdf/2606.00603.pdf

Authors: Luyang Zhang, Yi-Yun Chu, Ramayya Krishnan

Title: Toward Agentic Governance: What Shapes LLM-Agent Intervention in Public Forums?

Abstract:
LLM agents are increasingly used in moderation‑relevant public forum workflows, where their choices to answer, acknowledge, repair, or decline are routinely challenged by users, platforms, and regulators. The same agent often returns different responses on identical content, so any defense based on the agent's behavior cannot be reliably reproduced. The variation is structural. Four deployment choices typically invisible to the operator each shift the agent's response rate, and their combinations can produce substantially different interventions on the same forum posts. The four choices are (1) which model version is currently served, which can change between calls without notice; (2) the model's weight‑release status (open‑weight, with weights publicly downloadable, vs. closed‑weight, with weights held by the provider); (3) which provider serves the request; and (4) which system‑prompt policy is in force. Across LLMs spanning both open‑weight and closed‑weight families, we find that the previously reported tendency to decline more on visible than hidden challenges aligns with the open/closed weight boundary in our panel more than with access surface. Every closed‑weight cell declines more on visible challenges; every open‑weight cell reverses this or shows no gap. Auditable forum‑agent governance requires awareness of all four choices, not just the model name, since each independently shifts behavior.

PaperID: 2371, https://arxiv.org/pdf/2606.00555.pdf

Authors: Zaifei Yang, Weiyu Chen, Yaqing Wang, James Kwok

Title: Probe Before You Edit: Probing-Guided Molecular Optimization for LLM Agents in Structure-Based Drug Design

Abstract:
Structure‑based drug design increasingly employs LLM agents to iteratively refine ligands against a target pocket, yet a viable ligand must satisfy two often‑conflicting objectives ‑‑ binding affinity and druggability ‑‑ which single optimization steps rarely improve together. To quantify this difficulty, we introduce two diagnostic metrics: the first measures how often a single edit improves both objectives, and the second measures how often a gain on one objective comes with a loss on the other. Applying these diagnostics to current LLM‑agent pipelines exposes a consistent failure mode: the agent performs molecular editing without knowing how the pocket‑ligand complex responds to local modifications, thus rarely achieving joint improvement. Inspired by medicinal chemists, who probe the pocket‑ligand complex with controlled analog edits before choosing an optimization direction, we propose PROBE, an optimization framework built around edit‑response probing. PROBE first decomposes the ligand into editable sites and builds a pocket‑specific site map that flags where joint gains are plausible, where the two objectives are likely in tension, and where liability substructures should be changed; it then performs controlled probe edits whose responses are distilled into an EditManual. Guided by the site map and EditManual, PROBE runs an iterative multi‑agent loop in which an affinity agent, a druggability agent, and a co‑optimization agent jointly produce edits. On the CrossDocked2020 benchmark, PROBE achieves state‑of‑the‑art performance and substantially mitigates the failure modes exposed by our diagnostics metrics.

PaperID: 2372, https://arxiv.org/pdf/2606.00530.pdf

Authors: Tyler Stennett, Rangeet Pan, Bridget McGinn, Alessandro Orso, Saurabh Sinha

Title: Sakura: An Approach for Generating Complex Tests from Natural Language Test Descriptions

Abstract:
Testing is a core activity in software development workflows, and research on its automation has spanned several decades. Most existing approaches generate unit tests for individual methods, validate isolated API endpoints, or target user interface (UI) layers, with non‑API and non‑UI automated test generators typically exercising only a single focal method. Recent empirical evidence shows a substantial gap between such generated tests and developer‑written ones, which often span multiple focal methods, involve complex call sequences, and contain elaborate assertions that current automated approaches fail to capture. To address this gap, we propose generating tests from natural language (NL) descriptions of developer intent. We present Sakura, the first agent‑based framework for generating structurally complex test cases from NL descriptions. Sakura decomposes NL descriptions into structured blocks and processes them using a multi‑agent system consisting of a localization agent that grounds test steps in concrete application code via static analysis, a composition agent that synthesizes compilable test code and iteratively refines it using execution feedback, and a supervisor agent that coordinates agent interactions. To evaluate Sakura, we curate a novel dataset of NL test descriptions at three levels of abstraction, systematically generated from developer‑written tests mined from Apache Commons projects. Across 20 applications and 1,464 test scenarios, Sakura significantly outperforms off‑the‑shelf agentic tools such as Gemini CLI. Specifically, Sakura achieves 50‑78% higher test compilability and 38‑66% higher coverage overlap with ground‑truth tests compared to baselines using the same models. Moreover, Sakura paired with small open‑source models such as Devstral Small 2 and Qwen3‑Coder outperforms Gemini CLI using large proprietary models, while also being more cost‑effective.

PaperID: 2373, https://arxiv.org/pdf/2606.00476.pdf

Authors: Yufeng Wang

Title: Doing What They Say, Not What They Reason: Locating the Faithfulness Gap in LLM Agents

Abstract:
Do LLM agents act on the reasoning they state? This question of process fidelity is central to using LLMs in social simulation, yet it is hard to measure where no reference for correct behavior exists. We study it in acontrolled setting, a Texas Poker simulator with a verifiable reference action for every decision by decomposing the faithfulness gap into two steps: reasoning‑conclusion and conclusion‑action. The two steps behave oppositely.

PaperID: 2374, https://arxiv.org/pdf/2606.00448.pdf

Authors: Su Wang, Pin Qian, Yihang Chen, Junxian You, Xiaoyuan Wang, Xiaochong Jiang, Lifei Liu, Haoran Yu, Jingzhou Xu

Title: When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems

Abstract:
LLM agents increasingly rely on community‑contributed skills that expand an agent's operational capability set. We study a core safety problem in agentic AI systems: whether individually safe skills can compose into unsafe installed skill sets. We present SkillReact, a compositional security measurement framework with three components: a deterministic static‑composition benchmark, a two‑rater LLM‑assisted human‑adjudication pipeline, and an action‑based exploitability harness. On 1,520 ClawHub skills, 651 pass individual inspection and form 211,575 pairs; the benchmark flags 22.25% of these as structural candidates. We treat this raw rate as a recall‑oriented scanner ceiling and calibrate it against human judgment: in a pattern‑stratified audit, roughly one in five flagged pair‑pattern hits survives as a real compositional risk (population‑weighted validity 18.2%, our headline result), implying about 14K genuine risk memberships in a single registry that per‑skill scanning misses by construction, since every pair is individually safe. An action‑based harness then probes when these candidates become model‑issued tool calls, and finds realization gated by host‑model disposition: on an anchor‑conditioned dropper subset, Haiku‑4‑5 issues the dropper‑stage tool call on all 39 direct‑prompt trials (36 of them the full download‑then‑execute chain, 3 download‑only), Opus‑4‑7 stops at the download, and Sonnet‑4‑6 refuses outright. A control that holds the request fixed and varies only the installed skills finds compliance highest with no skills installed: a composition fixes which capabilities are reachable, while the host model decides whether to use them. Together these motivate install‑time compositional checks and capability isolation as complements to per‑skill scanning.

PaperID: 2375, https://arxiv.org/pdf/2606.00417.pdf

Authors: Maria Katarine Santana Barbosa, Kelvin L. Dias

Title: AgentxGCore: Agentic AI for Next-Generation Mobile Core Network

Abstract:
To meet the stringent requirements of emerging applications and the increasingly complex network management and operation, the Next Generation Mobile Networks (NextG), or 6G, will adopt an AI‑native architecture on the Core Network (CN). In this movement, the Third Generation Partnership Project (3GPP) has extended the cellular CN with new function as a first step toward integrating analytics, Artificial Intelligence (AI), and machine learning. However, those new functionalities are constrained by a centralized approach and managerial complexity. Furthermore, with the rise of Large Language Models (LLMs), a new era in network orchestration and management begins, leveraging and empowering the Intent‑based Networking (IBN) paradigm. In addition, AI agents and Agentic AI integrate Reasoning and Acting (ReAct), enabling the usage of such intents to continuously interact with the network. Unlike state‑of‑the‑art approaches that primarily employ Agentic AI to mitigate deployment and configuration complexity in the CN, this paper introduces AgentxGCore, which leverages an Agentic AI‑Native layer to extend the 3GPP architecture and enable a system based on the existing APIs across the Beyond Next Generation Core (xGC) domain. This proposal establishes an AI‑driven closed‑loop for continuous optimization based on real‑time information, enabling self‑organization and self‑adaptation. Our approach involves a multi‑agent specialized system, divided into a network planner agent, capable of visualizing the network state and developing a plan to meet the intents, and a network executor, responsible for criticizing and executing the plan. To validate the proposed solution, an environment was built using an open‑source CN, heterogeneous datasets, and different LLMs were employed to demonstrate its effectiveness.

PaperID: 2376, https://arxiv.org/pdf/2606.00416.pdf

Authors: Melih Yazgan, Iramm Hamdard, Qiyuan Wu, J. Marius Zoellner

Title: 4D Radar Meets LiDAR and Camera: Cooperative Perception under Adverse Weather

Abstract:
Cooperative perception is important for autonomous driving but remains fragile when cameras and LiDAR degrade in adverse weather. We address this challenge by integrating 4D imaging radar as a weather‑robust modality into collaborative perception and introducing a Doppler‑guided spatial attention mechanism for multi‑agent fusion. Our approach extends two representative backbones: a radar‑camera pipeline where radar substitutes LiDAR, and a LiDAR‑radar pipeline where radar complements LiDAR. To support evaluation, we release radar‑augmented benchmarks, OPV2V‑R and Adver‑City‑R, with physics‑based LiDAR degradation. Experiments show strong robustness gains in fog and rain, including substantial improvements when radar replaces degraded LiDAR. Additional validation on MAN TruckScenes demonstrates transfer beyond simulation. Overall, our results highlight 4D imaging radar as a robust modality for all‑weather collaborative perception. Dataset and code are available at: https://url.fzi.de/SlimComm.

PaperID: 2377, https://arxiv.org/pdf/2606.00405.pdf

Authors: Niloufar Mehrabi, Sayed Pedram Haeri Boroujeni, Abolfazl Razi

Title: From Talking Words to Sharing Thoughts: Scalable Multi-LLM Aggregation via Structured Message Passing

Abstract:
The emergence of specialized, domain‑tuned Large Language Models (LLMs) has demonstrated that smaller models can achieve expert‑level performance in specific tasks, while struggling in out‑of‑domain settings. Current ensemble methods to combine their complementary expertise primarily rely on iterative re‑prompting or cross‑model refinement. These approaches suffer from high computational costs and latency because they require repeated LLM inference calls. Furthermore, naive aggregation often leads to anchor corruption, in which noise propagated from weaker models degrades the performance of the most accurate expert. To address these challenges, we propose a framework that integrates model predictions at the semantic layer using a bipartite factor graph. In this architecture, individual LLMs are represented as variable nodes, while a set of check nodes assess their consistency based on diverse epistemic criteria. We develop a message‑passing protocol inspired by error‑recovery systems to resolve disagreements iteratively. Furthermore, we introduce an asymmetric damping mechanism that protects high‑reliability anchor nodes from being overridden by the ensemble majority. Unlike existing methods, our approach operates on output distributions and requires no additional LLM calls during the refinement phase. Evaluating on four benchmarks, including MMLU, MMLU‑Pro, GPQA, and MedMCQA, our method demonstrates a 97% reduction in token usage and up to a 6X decrease in API calls, reducing inference time from several minutes to mere milliseconds while consistently outperforming leading multi‑agent baselines. These results suggest that graph‑based belief propagation is a robust, high‑speed, and scalable alternative to the current multi‑agent LLM systems. The full pipeline and code will be made public.

PaperID: 2378, https://arxiv.org/pdf/2606.00308.pdf

Authors: Nazmus Ashrafi

Title: How Generation Architecture Shapes Code Complexity in Multi-Agent LLM Systems: A Paired Study on HumanEval

Abstract:
Large‑language‑model code generation has shifted from single‑shot prompting to multi‑agent orchestrations ‑ analyst, coder, tester, and debugger pipelines ‑ and is evaluated almost exclusively on functional correctness. Whether these architectures also affect the structural complexity of the code they produce, and which orchestration layers carry the cost, remains largely unexamined: prior work has documented prompt‑level effects on code complexity, but the architecture‑level question is open. We compare six widely‑used multi‑agent configurations (Basic, AC, ACT, Debugger, AC+Debugger, ACT+Debugger) under two models from the GPT‑4o family across all 164 HumanEval tasks ‑ 1,968 paired observations ‑ using the five RADON complexity metrics (SLOC, cyclomatic complexity, and Halstead Volume, Difficulty, and Effort). We apply a paired non‑parametric statistical pipeline (Friedman omnibus, Wilcoxon signed‑rank post‑hoc with Holm correction, Kendall's W and matched‑pairs rank‑biserial effect sizes) in both all‑completions and passing‑only conditions. The six architectures collapse into two indistinguishable complexity clusters separated by a 50‑130% gap, the same partition in both models and under both conditions; among the architectural layers, the analyst‑coder split inflates complexity, the runtime debugger does not ‑ and on the analyst‑coder background actively deflates it ‑ and the tester re‑inflates it. The heavy cluster's additional complexity buys no pass@1 advantage: the leanest architectures match or beat the heaviest on accuracy. Architectural elaboration in LLM code generation should therefore be justified by measured benefit on the dimensions that matter, not assumed.

PaperID: 2379, https://arxiv.org/pdf/2606.00288.pdf

Authors: Hai Lin

Title: Model-Native Computing Architecture: Envisioning Future System Architecture Through the Lens of Computer Architecture

Abstract:
Large language models are undergoing a transition from model technology to system technology. As developers use Codex, Claude Code, AutoGPT, and related agents to write code, manage projects, and execute multi‑step tasks, recurring engineering problems such as cache reuse, context management, agent scheduling, and permission control increasingly resemble classical computer systems problems. This paper develops that analogy as a visionary survey. We map concepts from computer architecture to the emerging model‑native stack and review work on LLM‑as‑OS, memory management, agent frameworks, tool protocols, multi‑agent coordination, cognitive architectures, and safety governance. We argue that these strands address different layers of the same system but lack a unified model. To fill this gap, we propose the Intelligent Computing Architecture Model (ICAM), a six‑layer framework for model‑native computing with explicit interface contracts and design axioms. ICAM resolves the apparent tension over whether an LLM is more like a CPU or an operating system through a dual‑plane view: a probabilistic execution plane concerned with what can be computed, and a deterministic control plane concerned with what should be computed. We further introduce three design laws: the Semantic Locality Law for KV‑cache reuse and inference speedup, the Context Budget Law for effective working sets under finite windows and attention decay, and the Agent Speedup Law for diminishing returns in multi‑agent collaboration. We validate these laws against published system‑level data and relate them to recent evidence on agentic software practices. We conclude by identifying where the analogy breaks down and outlining a research roadmap for model‑native computing. This is a conceptual and survey contribution; it does not report new experiments.

PaperID: 2380, https://arxiv.org/pdf/2606.00287.pdf

Authors: Arthur Casals, Anarosa A. F. Brandão

Title: Leveraging the Learning Curve: Reusing Existing Architectural Patterns to Design and Implement MAS

Abstract:
Recent advancements in AI have led to the development of specialized systems related to multi‑agent systems (MAS). However, the inherently collaborative nature of agents is often overlooked, and many of these specialized systems are used as components by other AI systems. From a software engineering perspective, this context can benefit from aligning the architectural characteristics of distributed systems with the inherently distributed nature of MAS. We propose that introducing a minimal set of agent‑related concepts into the Distributed Systems (DS) domain can improve the engineering of modern MAS by leveraging techniques from DS engineering with established agent theory. In this study, we recapitulated the common origins of MAS and DS by drawing architectural parallels to establish a unified engineering approach. We then defined a minimal set of agent concepts to perform two practical studies on leveraging MAS development. First, we incorporated these concepts into a DS architectural pattern to design a distributed MAS. We then used these concepts in a graduate course to teach MAS engineering to students with no prior knowledge of agent theory. The learning outcomes from both courses included successful MAS implementation using DS tools and techniques. Although more than two‑thirds of these students had no practical experience in developing distributed systems, the average final grade in both courses was above 80%, thus validating our approach. Finally, we discuss how this study supports the development of advanced systems using modern AI techniques consistently with established agent‑related research while leveraging established DS techniques and concepts.

PaperID: 2381, https://arxiv.org/pdf/2606.00266.pdf

Authors: Kamil Szczech, Maksymilian Wojnar, Krzysztof Rusek, Katarzyna Kosek-Szott, Szymon Szott

Title: KISS: Keeping it Simple and Slotted when Learning to Communicate over Wireless

Abstract:
A long‑standing challenge in distributed wireless systems is ensuring efficient and fair random channel access. Existing solutions often address specific constraints related to timing, periodicity, or centralization, but they typically rely on fixed heuristics. Motivated by recent advances in machine learning (ML), we investigate whether ML agents can autonomously learn efficient and fair access strategies, and whether such learning can offer new insights into medium access control (MAC) design. Rather than proposing a deployable protocol, our aim is to examine whether decentralized learning can rediscover or approximate theoretically efficient random‑access mechanisms under minimal assumptions. To this end, we deploy an off‑policy Double Deep Q‑Network (DDQN) with Bayesian inference to train agents operating over a slotted channel. The resulting method is fully online (no pre‑training), fully distributed (independent multi‑agent learners), stochastic (non‑periodic), and requires no coordination or explicit communication. Extensive simulations show that the learned strategy adapts to varying network conditions and achieves near‑theoretical efficiency while maintaining fairness. Ablation studies further reveal that the learned behavior resembles slotted ALOHA with a dynamically adjusted transmission probability, leading us to refer to the method as KISS: Keeping It Simple and Slotted.

PaperID: 2382, https://arxiv.org/pdf/2606.00204.pdf

Authors: Zijian Huang, Jay Zhangjie Wu, Zian Wang, Tianshi Cao, Jiasi Chen, Sanja Fidler, Huan Ling, Xuanchi Ren

Title: APE: Agentic Prompt Enhancer for Image Generation and Editing

Abstract:
Natural language has become a powerful interface for image generation and editing, yet text‑guided visual systems remain highly sensitive to prompt formulation. Semantically similar requests can produce different outputs depending on wording, specificity, and how explicitly visual constraints are stated, motivating prompt enhancement as a trainable component rather than a peripheral user choice. Existing strong enhancers often rely on large, proprietary LLMs such as ChatGPT or Gemini, adding cost, latency, and deployment dependence to the visual generation pipeline. We propose Agentic Prompt Enhancer (APE), a lightweight framework that post‑trains small language models (SLMs) as prompt‑enhancement agents. APE supports both single‑agent rewriting and role‑specialized multi‑agent enhancement. Its single‑agent instantiation, SAPE, rewrites the prompt in one pass, while its multi‑agent instantiation, MAPE, decomposes enhancement into a router‑‑rewriter‑‑composer process for handling compositional constraints over objects, attributes, spatial relations, and edits. With task‑aware rewards and post‑training protocols, APE improves visual alignment and prompt following without modifying the downstream visual model. Experiments on challenging image generation and editing benchmarks demonstrate that post‑trained small prompt enhancers reliably outperform their base counterparts, narrowing the gap to closed‑source prompt enhancers; in addition, MAPE proves particularly strong on complex compositional tasks within these benchmarks.

PaperID: 2383, https://arxiv.org/pdf/2606.00198.pdf

Authors: Yuxiang Lin, Zihan Wang, Mengyang Liu, Yuxuan Shan, Longju Bai, Junyao Zhang, Xing Jin, Boshan Chen, Jinyan Su, Xingyao Wang, Jiaxin Pei, Manling Li

Title: BAGEN: Are LLM Agents Budget-Aware?

Abstract:
While agents are increasingly spending more resources, today agent cost is mostly measured only after execution. A Budget‑Aware Agent (BAGEN) should treat budget as an active control signal, rather than a passive cost metric. We first systematically define budget estimation as internal budgets (from agent computation) and external budgets (from agent actions). We then formalize budget‑awareness as progressive interval estimation: at each step of a plan, an agent should predict an upper and lower bound on remaining budget, and alert when completion is unlikely. Scoring with a rollout‑replay protocol, we find consistent failure patterns on four environments and five frontier agents: (1) strong agents do not necessarily have strong budget‑awareness, with correlation r=0.35. (2) frontier models are consistently over‑optimistic, continue spending on tasks that are unlikely to succeed, instead of alerting the user early. (3) budget‑aware signal is actionable and trainable. Early stop saves 28‑64% tokens on failed trajectories, and SFT+RL strengthens early stop and alert behavior. (4) precise interval calibration remains challenging, with interval coverage capping at 47% after SFT+RL. Project page: https://ragen‑ai.github.io/bagen/

PaperID: 2384, https://arxiv.org/pdf/2606.00138.pdf

Authors: Titu Ranjan Sarker, Muhammed Jawaad Zulqernine, Ling Yue, Shaowu Pan, Chenxi Wang, Shiyao Lin

Title: A Multi-AI-agent Framework Enabling End-to-end Finite Element Analysis for Solid Mechanics Problems

Abstract:
Finite element analysis (FEA) is the most important numerical approach for solid mechanics. Challenges of FEA include a steep learning curve for entry‑level users and potential false simulations due to incorrect definitions of key simulation components, such as boundary conditions, load cases, and solution variables. Years of engineering experience are usually necessary for real‑world problem‑solving. To address these issues, we present AbaqusAgent, a multi‑agent framework grounded in large language models (LLMs) for solid mechanics analyses. AbaqusAgent is developed to facilitate analysis case generation and execution using Abaqus, one of the most widely used FEA packages, by turning users' natural‑language instructions into executed FEA analyses and result visualization. AbaqusAgent is composed of six agents, including interpreter, architect, input writer, runner, reviewer, and visualizer agents, encompassing all the essential pre‑processing and post‑processing steps of standard FEA analyses. A wide variety of 50 solid mechanics problems have been successfully validated, achieving an overall success rate of 86%. Beyond improving the efficiency of FEA for solid mechanics problems and lowering the barrier to computational mechanics education, AbaqusAgent advances the human‑simulation interaction paradigm and enables integration with AI‑empowered optimization and material characterization workflows. The code is available at https://github.com/LIRAM‑LIN/AbaqusAgent

PaperID: 2385, https://arxiv.org/pdf/2606.00135.pdf

Authors: Tong Liu, Cheng Qian, Matej Cief, Yuan He, Daniele Dan, Nikolaos Aletras, Gabriella Kazai

Title: On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

Abstract:
Tool‑calling is a central component of modern large language model (LLM) agents, equipping them with skills beyond their parametric knowledge. This paper studies tool‑calling along two complementary axes: effectiveness, i.e., how this capability is measured, and efficiency, i.e., how it is learned. On effectiveness, we systematically analyze tool‑calling evaluation pipelines and show that results can be highly sensitive to seemingly minor, often undocumented implementation choices including the random seed, system prompt, multi‑turn template construction, and how prior interaction/reasoning history is carried forward. These choices can lead to substantial differences in reported performance, especially in multi‑turn settings where without rigorous standardization, leaderboard rankings are unreliable. On efficiency, we examine standard reinforcement learning (RL) for tool‑calling and identify two sources of computational waste: (i) during rollouts, many prompts produce no learning signal, and (ii) during policy updates, optimization incurs high computational cost. Guided by these findings, we introduce two techniques that accelerate RL‑based tool‑calling training, achieving substantial wall‑clock speedup without degrading performance.

PaperID: 2386, https://arxiv.org/pdf/2606.00065.pdf

Authors: Aritra Roy, Enrico Grisan, Chiara Gattinoni, John Buckeridge

Title: Beyond Text and Tables: Vision-Language Model Integration in ComProScanner for Extracting Materials Data from Scientific Figures with High Accuracy

Abstract:
Automated extraction of materials composition‑property data from scientific literature has advanced considerably with the development of large language model‑based pipelines; however, existing frameworks remain limited to textual and tabular content, overlooking the substantial proportion of quantitative property data reported exclusively in scientific figures. Here, we extend ComProScanner, a fully end‑to‑end multi‑agent framework for automated composition‑property database construction, with a native vision‑language model (VLM) based figure extraction capability. The extension introduces a FigureExtractor utility for caption‑keyword‑based figure filtering across all supported publishers, and a GraphExtractorTool agent that passes extracted figures to a configurable VLM to recover composition‑property pairs from scientific charts and plots. Four VLMs are selected for evaluation on the basis of the LMArena Diagram leaderboard with an input cost criterion of less than \1.50 per million tokens. Benchmarking on 50 piezoelectric ceramic articles from the established d_33 test corpus demonstrates that Gemini‑3‑Flash‑Preview achieves the highest performance with a composition accuracy of 0.97 and a normalised F1 score of 0.97, whilst remaining the most cost‑effective model among the four evaluated. We additionally introduce a range‑based value error threshold parameter into the evaluation framework, providing a more physically meaningful assessment of numeric property values extracted from figures than exact value matching. These contributions establish VLM‑integrated ComProScanner as the first materials‑specific, fully automated, multimodal literature mining platform capable of extracting structured composition‑property data from text, tables, and figures within a single unified pipeline.

PaperID: 2387, https://arxiv.org/pdf/2606.00049.pdf

Authors: Yuxi Chen, Yutian Tang, Timothy Storer

Title: Measuring and Mitigating Bias in Code Generated by Large Language Models

Abstract:
Large language models (LLMs) are widely recognised for their applications in natural language generation and are increasingly used for code generation tasks. However, concerns about bias in their generated outputs remain significant. This paper focuses on GPT‑4o and Gemini, mainstream tools for code generation, and proposes a framework for evaluating bias in LLM‑generated code, specifically examining the influence of protected attributes, prompts and web‑search capability. We use two metrics: the code bias score (CBS) and the attribute change ratio (ACR), to quantify the prevalence of bias and the degree of influence of different attributes, respectively. In addition, we investigate four lightweight mitigation strategies: Few‑Shot, Chain‑of‑Thought, Few‑Shot Chain‑of‑Thought, and Multi‑agent, aimed at mitigating bias in generated code. Our findings reveal that bias remains prevalent across different protected attributes and datasets even after applying mitigation strategies, highlighting the need for more effective approaches to reduce bias in AI‑driven code generation systems.

PaperID: 2388, https://arxiv.org/pdf/2606.00017.pdf

Authors: Aliaksei Korshuk, Alexander Buyantuev, Ilya Makarov

Title: MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

Abstract:
Training language model agents for multi‑agent strategic interaction presents a core difficulty: the quality of any action may depend on future events that never materialize, on moves that violate game rules, or on decisions made by other players. Standard reinforcement learning assumes that rewards can be assigned at each step, but this assumption fails in settings where outcomes are entangled across time and agents. We introduce delayed per‑step reward attribution with eligibility gating, an episode lifecycle and postprocessing pipeline that computes rewards only at episode end, propagates them back to originating steps according to task‑specific semantics, and excludes steps that lack valid dependent information from training. Together with asynchronous rollout generation via vLLM's continuous batching, curriculum‑based opponent sampling, and multi‑level stratified batch construction, this approach enables stable, sample‑efficient RL training in multi‑agent environments. We evaluate on the MindGames Arena benchmark at NeurIPS 2025, where a single 8‑billion‑parameter open‑source model trained with our method matched or surpassed substantially larger proprietary systems, including GPT‑5, in head‑to‑head play and took first place in both the Open (unrestricted) and Efficient (<=8B parameters) tracks.

PaperID: 2389, https://arxiv.org/pdf/2606.00015.pdf

Authors: Yifan Zhang, Xinkui Zhao, Zuxin Wang, Zhengyi Zhou, Guanjie Chen, Shuiguang Deng, Jianwei Yin

Title: SortingHat: Redefining Operating Systems Education with a Tailored Digital Teaching Assistant

Abstract:
Operating Systems (OS) courses are among the most challenging in computer science education due to the complexity of internal structures and the diversity of running environments. Traditional teaching methods often fail to address the diverse backgrounds, learning speeds, and practical needs of students. To tackle these challenges, we present SortingHat, a personalized digital teaching assistant tailored specifically for OS education. SortingHat integrates advanced AI technologies, including a retrieval augmented generation (RAG) framework and multi agent reinforcement learning (MARL), to deliver adaptive, scalable, and effective educational support. SortingHat features a 3D digital human interface powered by large language models (LLMs) to provide personalized, empathetic, and context aware guidance. It generates tailored exercises based on each student's learning history and academic performance, reinforcing weak areas and challenging advanced concepts. Additionally, the system incorporates a robust evaluation pipeline that ensures fair, consistent, and unbiased grading of student submissions while delivering personalized, actionable feedback for improvement. By combining personalized guidance, adaptive content creation, and automated assessment, SortingHat transforms OS education into an engaging, immersive, and scalable experience.

PaperID: 2390, https://arxiv.org/pdf/2606.00008.pdf

Authors: Jia Zhang, Tengfei Ma, Tianle Li, Daojian Zeng, Xieping Gao, Xiangxiang Zeng

Title: Agents on a Tree: Pathwise Coordination for Multi-Objective Molecular Optimization

Abstract:
Multi‑objective molecular optimization requires searching vast chemical spaces under conflicting objectives, where early design decisions strongly constrain downstream outcomes. Existing methods typically rely on a single policy or fixed scalarization, which limits their ability to represent diverse trade‑offs and to explore multiple promising design trajectories. We propose ATOM, a multi‑agent framework that formulates molecular optimization as a tree‑structured search. Each node corresponds to an atomic operation and hosts an agent specialized for a particular objective or decision context. Agents coordinate along different paths of the tree rather than enforcing a global consensus, enabling the method to maintain and compare alternative molecular evolution trajectories. A global memory of past optimization behaviors further supports balanced exploration and exploitation across objectives. This tree‑structured interaction enables reasoning over long‑horizon dependencies inherent in molecular design. Experiments on challenging multi‑objective benchmarks involving activity, synthesizability, and ADMET‑related properties show that ATOM consistently achieves improved Pareto coverage and hypervolume over strong baselines. These results demonstrate the effectiveness of pathwise multi‑agent coordination for molecular optimization. Code is available at https://anonymous.4open.science/r/ATOM‑41CE.

PaperID: 2391, https://arxiv.org/pdf/2606.00007.pdf

Authors: Steven Johnson

Title: Deliberative Curation: A Protocol for Multi-Agent Knowledge Bases

Abstract:
As AI agents transition from isolated tools to collaborative participants in shared knowledge ecosystems, governing collective knowledge curation becomes a critical challenge. Human platform governance mechanisms do not transfer directly: agent statelessness undermines deterrence‑based sanctions, model homogeneity violates independence assumptions underlying crowd wisdom, and sycophancy collapses deliberative consensus. We propose a deliberative curation protocol combining three governance layers: (1) a knowledge artifact lifecycle formalized as a labeled transition system; (2) reputation‑weighted deliberative voting integrating Beta Reputation with EigenTrust amplification; and (3) graduated sanctions adapted for stateless agents, including broken agent handling distinguishing malfunction from adversarial behavior. We evaluate the protocol through agent‑based simulation with 100 agents across seven behavioral archetypes under two adversity scenarios (30 seeds, paired t‑tests). The protocol trades modest precision under benign conditions for substantially better resilience under adversity: 0.826 vs 0.791 for majority vote under moderate adversity (p<0.001), widening to 0.807 vs 0.740 under stress (p<0.001). The protocol degrades roughly three times more slowly than majority vote. Ablation analysis identifies commit‑reveal vote concealment as the most impactful single component (8.2‑8.6pp precision improvement, p<0.001), outperforming reputation weighting and deliberation combined. Graduated sanctions were not exercised in simulation and remain empirically unvalidated.

PaperID: 2392, https://arxiv.org/pdf/2605.31593.pdf

Authors: Davis Brown, Samarth Bhargav, Arav Santhanam, Kasper Hong, Ivan Zhang, Matan Shtepel, Steffi Chern, Alexander Robey, Eric Wong, Hamed Hassani

Title: Stateful Online Monitoring Catches Distributed Agent Attacks

Abstract:
Language models can find thousands of severe software vulnerabilities, and agents are increasingly being misused for cyberattacks. To avoid detection, attackers frequently distribute their misuse, splitting a harmful task across many user accounts so each individual transcript looks benign. Because safety monitors score only one agent context at a time, they are structurally blind to misuse that is only visible in aggregate, across many accounts. We show this gap is real by building, to our knowledge, the first distributed agent attack, a multi‑agent scaffold that completes hard cybersecurity tasks while hiding the harmful objective across subagents with limited contexts, evading a standard monitor that catches it only a fifth as often as prior agent attacks. Towards a defense, we develop an online stateful monitor that uses real‑time clustering to collect weak suspiciousness signals across many agent transcripts, and escalates only rarely to a language model that flags misuse across user accounts. In evaluations with large‑scale simulated datacenter traffic, our monitor Pareto dominates standard monitors, catching distributed attacks 30% earlier and flagging cyber misuse before it reaches the most harmful stages. Crucially, this comes at negligible additional latency for ~99% of user traffic. This detection advantage persists but narrows as the benign background traffic grows very large. After an extensive red‑teaming exercise, we improve the defense and surprisingly also find that it catches standard jailbreaks, since adaptive attackers reuse attack variants across accounts. Our results point toward a new class of safety monitors which reason over groups of users rather than isolated transcripts.

PaperID: 2393, https://arxiv.org/pdf/2605.31581.pdf

Authors: Albert Sadowski, Jarosław A. Chudziak

Title: Choosing the Lens: Strategic Perspective Activation in Context-Dependent Argumentation

Abstract:
The same arguments often need to be evaluated under different external regimes. An agent with influence over the regime has a strategic lever that standard formalisms do not directly capture. We introduce context‑dependent argumentation frameworks (CDAFs), an extension of Dung's theory in which a defeat function determines, per context, which attacks succeed. A perspective‑labeled specialisation derives the defeat function from a relevance set ρ and a priority π. The relevance set is the agent's action space. In a small worked example, the agent's target argument is rejected under every full‑relevance injective priority, yet accepted under partial activations, one of which no VAF audience can mirror. We define the corresponding decision problem, ACTIVATION‑MANIPULATION, and record baseline complexity bounds. Tight bounds and multi‑agent variants are left open.

PaperID: 2394, https://arxiv.org/pdf/2605.31529.pdf

Authors: Yulu Pan, Han Yi, Seongsu Ha, Md Mohaiminul Islam, Benjamin Zhang, Lorenzo Torresani, Gedas Bertasius

Title: SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence

Abstract:
True video intelligence demands more than recognizing what is visible: it requires reasoning about why events unfold, predicting what would change under different conditions, and deciding what to do next. We refer to this progression, from perception through causal reasoning and simulation to strategic planning, as Strategic Video Intelligence (SVI). No existing benchmark evaluates this capability stack: in‑the‑wild videos lack verifiable ground truth for causal and strategic questions, while synthetic environments sacrifice the complexity of real multi‑agent systems. To bridge this gap, we introduce SVI‑Bench, a large‑scale benchmark that leverages team sports as a dynamic microworld, combining the complexity of real‑world multi‑agent interaction (10‑22 agents making coordinated decisions under adversarial pressure) with the verifiability of explicit rules and definitive outcomes. SVI‑Bench comprises approximately 35K hours of broadcast video, 15M annotated actions, 15K hours of expert commentary, 23K game reports, and 103K structured statistical records across basketball, soccer, and hockey, all constructed via a data engine that transforms raw game data into a dense, cross‑referenced corpus. We organize evaluation into 9 tasks spanning a progressive four‑pillar hierarchy: Dynamic Scene Understanding, Causal Reasoning, Strategic Simulation, and Agentic Synthesis. Evaluating strong multimodal and agentic baselines, we find a capability cliff: models perform competently on perceptual tasks, achieving approximately 73% on fine‑grained action QA, but degrade sharply at each successive cognitive level. Agentic tasks prove hardest: the strongest model achieves only 5% accuracy when required to autonomously gather and integrate evidence across a corpus of 1.8M clips.

PaperID: 2395, https://arxiv.org/pdf/2605.31468.pdf

Authors: Weitong Qian, Beicheng Xu, Zhongao Xie, Bowen Fan, Guozheng Tang, Jiale Chen, Xinzhe Wu, Mingtian Yang, Chenyang Di, Jiajun Li, Lingching Tung, Peichao Lai, Yifei Xia, Ziyi Guo, Yanwei Xu, Yanzhao Qin, Shaoduo Gan, Xupeng Miao, Bin Cui

Title: AutoSci: A Memory-Centric Agentic System for the Full Scientific Research Lifecycle

Abstract:
Scientific research has traditionally been human‑intensive, requiring researchers to coordinate literature, ideas, experiments, manuscripts, and review responses across long project cycles. The rise of LLM‑based scientific agents creates an opportunity to automate this process. Such a system must support the full research lifecycle, maintain structured persistent memory across projects, and improve its own research procedures over time. However, existing systems either partially satisfy or fail to satisfy these requirements, leaving a gap for a unified automated scientific research system. As a result, we present AutoSci, a memory‑centric agentic system for the full scientific research lifecycle. AutoSci is organized around four modules. SciMem provides schema‑governed research memory, separating Long‑Term Knowledge Memory for reusable scientific knowledge from Active Research Memory for project‑level artifacts such as ideas, experiments, manuscripts, and reviews. SciFlow executes a five‑stage lifecycle from literature understanding to rebuttal through a harness that controls state, context, verification, feedback, and orchestration. SciDAG augments difficult skills with DAG‑shaped multi‑agent operators and reusable stage‑specific templates. SciEvolve converts feedback signals from users, experiments, reviews, and external environments into versioned updates to SciMem organization, SciFlow skills, and SciDAG templates. Together, these modules make AutoSci a persistent research environment that can execute, remember, and evolve across research projects. The code repository is available at https://github.com/skyllwt/AutoSci.

PaperID: 2396, https://arxiv.org/pdf/2605.31445.pdf

Authors: Antonio Valerio Miceli-Barone, Vaishak Belle, Shay B. Cohen

Title: Used Car Salesbots? Honesty and Credulity of LLMs as Bargaining Agents under Partial Information

Abstract:
In this work we study agents in simulated bargaining scenarios, where a buyer and a seller communicate through a text channel and attempt to negotiate mutually beneficial trades, under different information regimes (complete information, information asymmetry or mutual uncertainty). We evaluate their performance w.r.t. game‑theoretical solutions and further investigate their honesty (their tendency to disclose or withhold information or to mislead and deceive) as well as their credulity (their tendency to trust or distrust information provided by the other agent). We study zero‑shot LLM agents with simple prompting scaffolding as well as fine‑tuned agents, in order to investigate whether optimising the agents to maximise financial profits makes them stronger negotiators but also more dishonest and less trusting. We find that off‑the‑shelf LLMs all substantially deviate from game‑theoretical equilibria, they attempt to lie about their private information but cannot efficiently exploit information asymmetries. Fine‑tuning on financial utility makes the agents stronger at achieving better deals but also more dishonest, highlighting the risks that optimising agents for a task can have on their safety. We release our code and a dataset of bargaining scenarios.

PaperID: 2397, https://arxiv.org/pdf/2605.31387.pdf

Authors: Chalamalasetti Kranti, Sherzod Hakimov, David Schlangen

Title: Multi-Turn Multi-Agent Dialogue for Collaborative Reconstruction Improves VLM Performance on Spatial Reasoning, But Only Barely

Abstract:
Robots operating in diverse environments rely on visual input to interpret objects and spatial layouts. In human‑collaborative tasks, they are expected to communicate this understanding through language. Vision‑language models (VLMs) support robotic tasks involving visual interpretation, question answering, and instruction following, but their capabilities in collaborative dialogue tasks requiring spatial reasoning remain underexplored. We study this gap through a collaborative structure‑building task that combines visual interpretation, grounding, language‑guided interaction, and action generation. We develop a framework in which VLMs use dialogue to reconstruct a target structure from visual and textual inputs. We evaluate open‑weight and closed VLMs across interaction settings, input modalities, and image representations. Results show that spatial reasoning over visual representations remains difficult for the evaluated VLMs. Detailed text representations of the target yield higher reconstruction success across modality conditions, while decomposed image representations improve performance. These findings reveal limits in visual spatial grounding and grounded instruction generation for collaborative VLM agents.

PaperID: 2398, https://arxiv.org/pdf/2605.31361.pdf

Authors: Tomas Leroy-Stone

Title: Dreaming Of Others: Latent Teammate Modeling In World Models For Multi-Agent Reinforcement Learning

Abstract:
In cooperative multi‑agent reinforcement learning (MARL), agents must coordinate with partners whose internal policies and intentions are not directly observable. While world models such as Dreamer have demonstrated strong generalization and sample efficiency in single‑agent settings, their application to MARL remains limited by an inability to handle teammate‑induced uncertainty. We propose a new perspective: treat teammates as structured, learnable components within the agent's world model. We introduce an architecture that factorizes the latent state of a Dreamer‑style recurrent state‑space model (RSSM) into environment and teammate components, and learns an auxiliary Theory‑of‑Mind (ToM) head to infer latent embeddings of partner behavior such as character, intent, and predicted actions from partial trajectories. These teammate latents condition the actor and critic, enabling the agent to imagine and adapt to diverse collaborators. We outline how this approach can support zero‑shot and few‑shot coordination in partially observable settings and propose a set of benchmarks and evaluation protocols to assess its impact. This work positions world models as not only predictors of environmental dynamics, but as simulators of social behavior, opening new directions for generalizable, human‑compatible AI.

PaperID: 2399, https://arxiv.org/pdf/2605.31330.pdf

Authors: Van An Nguyen, Vuong Khang Huynh, Huu Loi Bui, Hai Anh Ha, Quang Dung Le, Tan Dat Nguyen, Ngoc Ngu Nguyen, Zhao Song, Manh Hong Duong, Le Hong Trang, The Anh Han

Title: Social welfare optimisation under institutional reward and punishment

Abstract:
Institutional incentives are widely used to promote cooperation among autonomous, self‑regarding agents, from human societies to multi‑agent and AI systems. Existing work typically treats incentive design as a bi‑objective problem: minimise institutional cost while achieving a high long‑run frequency of cooperation. Whether such schemes also maximise social welfare ‑ total population payoff net of institutional expenditure ‑ has remained largely unexplored. We develop a welfare‑centric framework for institutional incentives in finite, well‑mixed populations playing a social dilemma (Donation Game and Public Goods Game), considering both rewards for cooperators and punishments for defectors. For each mechanism, we derive explicit expressions for expected social welfare and characterise how it depends on incentive efficiency and selection intensity. Analytically, we identify parameter regimes where social welfare has a single optimal incentive level and regimes with qualitative phase transitions, in which welfare becomes non‑monotonic with multiple local optima. We prove that any welfare‑maximising incentive is either zero or concentrated around a simple closed‑form target, and we provide an efficient algorithm to compute these optima. Comparing reward and punishment, we further derive close‑formed conditions under which reward outperform punishment in terms of social welfare for any given budget. Overall, our results reveal a systematic gap between incentives optimised for cost or cooperation frequency and those that maximise welfare.

PaperID: 2400, https://arxiv.org/pdf/2605.31318.pdf

Authors: Mateusz Odrowaz-Sypniewski, Jasmine Bayrooti, Ajay Shankar, Amanda Prorok

Title: Generalized Intention Modeling in Multi-Agent Reinforcement Learning

Abstract:
Modeling an opponent's intent is critical for effective decision‑making in non‑cooperative, competitive, and general‑sum multi‑agent reinforcement learning. Existing opponent modeling methods encode intent using an embedding derived from episode information chosen a priori, such as the opponent's next action or a future environment state, and use this to guide the ego‑agent's behavior. These approaches assume that the chosen information is universally representative of intent; however, we show empirically that this is not the case as intentions are often task‑ and environment‑dependent. To address this, we introduce a task‑adaptive opponent modeling framework that learns a performance‑driven mixture of multiple intent representations. We further introduce a new intention representation that maximizes mutual information with the ego‑agent's future returns, thereby capturing opponent information that is most directly relevant to performance. Our approach consistently matches or exceeds the performance of state‑of‑the‑art baselines across diverse tasks and yields insights into when and why different opponent modeling strategies succeed.

PaperID: 2401, https://arxiv.org/pdf/2605.31268.pdf

Authors: Marko Kojic, Ivan Bondyrev, Aral de Moor, Joseph Shtok, Petr Borovlev, Kseniia Lysaniuk, Madeeswaran Kannan, Ivan Dolgov, Nikita Pavlichenko

Title: Mellum2 Technical Report

Abstract:
We present Mellum 2, an open‑weight 12B‑parameter Mixture‑of‑Experts (MoE) language model with 2.5B active parameters per token. Mellum 2 is a general‑purpose language model specialized in software engineering, spanning code generation and editing, debugging, multi‑step reasoning, tool use and function calling, agentic coding, and conversational programming assistance, and it is the successor to the completion‑focused 4B dense Mellum model. The architecture builds on the Mixture‑of‑Experts (64 experts, 8 active) and combines Grouped‑Query Attention with 4 KV heads, Sliding Window Attention on three of every four layers, and a single Multi‑Token Prediction head that doubles as both an auxiliary pre‑training objective and a built‑in draft model for speculative decoding; each choice was validated by ablation with inference efficiency on commodity GPUs as a design constraint. Pre‑training spans approximately 10.6 trillion tokens through a three‑phase curriculum that progressively shifts the mixture from diverse web data toward curated code and mathematical content, optimized with Muon under FP8 hybrid precision and a Warmup‑Hold‑Decay schedule with linear decay to zero. The pre‑trained base is extended to a 128K context window via a layer‑selective YaRN and then post‑trained in two stages (supervised fine‑tuning followed by RLVR), yielding two released variants: an Instruct model that answers directly and a Thinking model that emits an explicit reasoning trace before its final answer. Across code generation, math and reasoning, tool use, knowledge, and safety benchmarks, Mellum 2 is competitive with open‑weight baselines in the 4B‑14B range while running at the per‑token compute of a 2.5B dense model. We release the base, instruct, and thinking checkpoints, together with this report on the architecture decisions, data pipeline, and training recipe behind them, under the Apache 2.0 license.

PaperID: 2402, https://arxiv.org/pdf/2605.31264.pdf

Authors: Tianyi Zhou, Dongrui Liu, Leitao Yuan, Jing Shao, Xia Hu

Title: COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation

Abstract:
LLM agents are increasingly expected not only to complete isolated tasks, but also to carry bounded representations of human expertise, judgment, and interaction style. Building such person‑grounded agents remains difficult because actionable knowledge associated with a person or role is usually embedded in heterogeneous traces rather than written as clean instructions. Existing memory and persona systems capture fragments of this evidence, while skill frameworks provide portable packaging formats; however, there is no end‑to‑end workflow for distilling these traces into inspectable, correctable, and agent‑usable skills. We present an automated trace‑to‑skill distillation system for generating person‑grounded AI skills via expert knowledge distillation. Given materials from a target person or role, COLLEAGUE.SKILL produces a versioned skill package with two coordinated tracks: a capability track for practices, mental models, and decision heuristics, and a bounded behavior track for communication style, interaction rules, and correction history. The package can be inspected, invoked, updated through natural‑language feedback, rolled back, installed across agent hosts, and optionally prepared for controlled distribution. We describe the artifact contract, generation workflow, correction lifecycle, deployment surface, and domain presets implemented in the open‑source system. At the time of writing, the public repository has approximately 18.5k GitHub stars; the gallery lists 215 skills from 165 contributors and more than 100k cumulative stars across listed skill cards. The system illustrates how person‑grounded skills can be represented as portable, correctable packages rather than opaque prompts or hidden memories.

PaperID: 2403, https://arxiv.org/pdf/2605.31119.pdf

Authors: Navin Sriram Ravie, Andrew Jong, Krrish Jain, John Liu, Omar Alama, Bijo Sebastian, Sebastian Scherer

Title: Don't Fool Me Twice: Adapting to Adversity in the Wild with Experience-Driven Reasoning

Abstract:
In robotics, dangers and adversity modes are often embodiment‑specific and relative to each agent. A frontier of autonomous mobile robotics is to enable agents to operate effectively in the wild in unseen unstructured environments. A significant challenge in unseen unstructured environments is that it may not be possible to predict all the dangers to the specific robot. Although recent work has used large foundation vision‑language models (VLMs) to preemptively predict an exhaustive list of common‑sense dangers, it remains difficult to capture possible interaction and embodiment‑dependent adversities. We propose a continual learning framework for a mobile embodied agent to learn online from disturbances and attribute anomalous behaviours to causes through semantics, enabling better prediction and planning of the world in the future. Our framework, "Don't Fool Me Twice", first observes disturbances and describes their effects on the robot; this description is augmented with visual context to query a VLM to predict possible causes; the local disturbance is characterized using kernel regression, which allows for efficient, few‑shot modeling of transient anomalies. We leverage semantic voxel‑centric modeling to estimate epistemic uncertainty, enabling richer downstream recovery by treating interaction‑driven disturbances as learnable spatial behaviors. We present four hypotheses and validate them in simulation and on hardware across embodiments and adversity modes.

PaperID: 2404, https://arxiv.org/pdf/2605.31067.pdf

Authors: Marcel Bartholomeus Prasetyo, Shrutika Vishal Thengane, A Manicka Praveen, Yi Loo, Malika Meghjani

Title: Seeing Fast and Slow: Bimodal 3D Scene Graphs for Open-set Tasks

Abstract:
Open‑set task execution can significantly benefit from seamlessly switching between coarse and fine scene representations depending on the context and the evolving information as the robot explores the environment. For example, it is often sufficient to start with a coarse scene representation initially and only employ a finer, more granular scene representation when the robot encounters regions which are likely to contain the task relevant objects. Hence, in this work, we propose BiMoSG, a bimodal 3D scene graph generation approach for open‑set tasks. BiMoSG employs a "fast" mode by default to efficiently generate a coarse 3D scene graph and can switch to a "slow" mode for generating a finer open vocabulary 3D scene graph of task relevant objects. We demonstrate that our proposed 3D scene graph generation approach is significantly faster than the open‑source state‑of‑the‑art approaches. This allows us to integrate the scene graph generation process with task execution for real‑time deployment.

PaperID: 2405, https://arxiv.org/pdf/2605.31042.pdf

Authors: Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu, Yiruo Cheng, Xiaoxi Li, Ji-Rong Wen

Title: From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

Abstract:
LLM agents are evolving from conversational chatbots to operational tools in real‑world workspaces. In local agentic harnesses, an LLM can read and write files, call tools, and reuse workspace state across sessions. While such capabilities enhance utility, they also expose a new attack surface for attackers. Attackers can embed a prompt injection within a file or tool output. Agents may read this hidden instruction, store it, and execute it later. In this multi‑step trojan attack paradigm, no individual step appears malicious on its own, but these steps can collectively turn untrusted text into persistent control content. However, existing defenses often inspect each step in isolation. As a result, they can block a clear harmful action, but fail to detect the earlier write operation that plants the backdoor. To reveal this threat, we introduce ClawTrojan, a benchmark designed to identify multi‑step trojan attacks in local agentic harnesses. In an OpenClaw‑style simulated workspace with GPT‑5.4, ClawTrojan reaches a 95.5% attack success rate (ASR), while existing single‑turn prompt‑injection attacks produce near‑zero ASR on the same model. To address this threat, we propose DASGuard, which scans control‑like text in sensitive local files, traces its origin, and removes control content that does not originate from a trusted source. Our results show that DASGuard achieves strong dynamic defense by combining runtime attack blocking with sanitized commits to the workspace.

PaperID: 2406, https://arxiv.org/pdf/2605.31023.pdf

Authors: Mohamad A. Hady, Muhammad Anwar Masum, Siyi Hu, Mahardhika Pratama, Jimmy Cao, Ryszard Kowalczyk

Title: HADT: A Heterogeneous Multi-Agent Differential Transformer for Autonomous Earth Observation Satellite Cluster

Abstract:
This work addresses the problem of autonomous resource management in heterogeneous satellite cluster conducting Earth Observation (EO) missions including optical and Synthetic Aperture Radar (SAR) satellites. In autonomous operation mode, satellites are equipped with intelligent capabilities enabling real‑time decision‑making based on the latest conditions, while requiring minimal interaction with ground operators. Traditional scheduling approaches typically rely on mathematical models to represent satellite mission and resource management. Then, this problem is solved by using optimization algorithms. However, such solutions become less effective when the underlying models are not available, over complex, and inaccurate due to dynamic changes and uncertainties inherent in the space mission environment. A promising alternative is to reformulate the problem as a sequential decision‑making process and apply model‑free reinforcement learning techniques to enable adaptive and real‑time resource management. To this end, we propose a novel transformer‑based architecture tailored for heterogeneous satellite cluster autonomous EO Mission with relational observations‑actions tokenization and differential attention mechanism. Our experimental results demonstrate significant performance improvements compared to the available baselines. Moreover, the proposed architecture exhibits strong adaptability and transferability with respect to varying numbers of satellite clusters.

PaperID: 2407, https://arxiv.org/pdf/2605.31005.pdf

Authors: Jeffrey Seely, Bartłomiej Cupiał, Llion Jones

Title: Learning Multi-Agent Coordination via Sheaf-ADMM

Abstract:
We present a differentiable optimization framework for multi‑agent coordination. An input is decomposed into overlapping local views, each processed by an agent that solves a convex subproblem parameterized by a neural encoder. Agents coordinate through the Alternating Direction Method of Multipliers (ADMM) with inter‑agent constraints specified by a cellular sheaf. The sheaf specifies which aspects of neighboring solutions must agree, allowing for heterogeneous notions of global consensus. Backpropagating through the unrolled optimization jointly trains all components of the multi‑agent system. We evaluate on maze pathfinding, image classification, and Sudoku, where agents with individually insufficient local views learn to coordinate to produce correct global outputs. On MNIST, the local‑view decomposition yields improved robustness to distribution shifts relative to a standard CNN. On Sudoku, the optimization‑derived structure yields markedly higher solve rates than parameter‑matched MPNN baselines. Finally, the ADMM structure exposes distinct primal, consensus, and dual state variables, opening the coordination dynamics to direct analysis and intervention ‑‑ a property unavailable in standard message‑passing architectures.

PaperID: 2408, https://arxiv.org/pdf/2605.30989.pdf

Authors: InGyu Choi, GeonYeong Go, SunWoo Ahn, HyoJae Kang, Min-Sung Kang

Title: A study on a Real-Time VR-Based Teleoperation Framework for Manipulator in Dynamic Environment

Abstract:
Robot teleoperation enables safe, non‑contact task execution in hazardous environments where direct human access is difficult, and its application has expanded with recent VR technologies. Many VR teleoperation studies, however, have primarily served as data‑collection tools for robot imitation learning, so they often do not explicitly address dynamic obstacles, workspace changes, or collision risks during operation. For real deployment aimed at operator safety, teleoperation must react to dynamic situations with low latency and remain robust to mistakes made by inexperienced operators. This paper presents a VR teleoperation framework that supports real‑time manipulation while handling collisions with both static and moving obstacles. The framework integrates GPU‑accelerated inverse kinematics and trajectory optimization within a VR interface to generate feasible joint commands at each control cycle under robot constraints. Experiments with a 7‑DoF manipulator demonstrate stable online behavior and collision‑aware motion generation across three scenarios: obstacle‑free, static‑obstacle, and moving‑obstacle environments. The results indicate that the proposed approach generates motion consistent with the operator's command while producing safe detours when obstacles interfere with the commanded path.

PaperID: 2409, https://arxiv.org/pdf/2605.30947.pdf

Authors: Yating Pan, Jiajun Zhang, Jun Wang, Qi Su

Title: Extending AI for Research to the Humanities: A Multi-Agent Framework for Evidence-Grounded Scholarship

Abstract:
LLM‑based research agents have advanced rapidly in science and engineering, where research is organized around executable experiments, code, and quantitative signals. Humanities scholarship, however, requires a different mode of reasoning: interpretive, evidence‑grounded argument over primary sources, where scholarly value depends on faithful quotation, verifiable provenance, and close reading. Existing research agents remain largely optimized for execution and retrieval, not evidence‑grounded interpretive reasoning. To address this gap, we introduce SPIRE (Scholarly‑Primitives‑Inspired Research Engine), a multi‑agent framework for evidence‑grounded humanities scholarship. Drawing on Scholarly Primitives theory, SPIRE casts recurring humanities operations as cooperating agent roles (source discovery, evidence annotation, comparison, provenance checking, sampling, citation binding, and argumentative synthesis) over a multi‑scale close‑reading substrate of passages, intra‑context graph communities, and cross‑context semantic clusters. On a peer‑reviewed‑paper benchmark over classical Chinese and Greco‑Roman Latin scholarship, SPIRE recovers cited primary‑source evidence more reliably than Naive LLM, Text RAG, and GraphRAG, and receives higher blind‑judge scores on answer accuracy, depth, coverage, and evidence quality. Ablations show that both the scholarly‑operation agents and close‑reading retrieval contribute to evidence‑grounded essays. Code, data catalogues, and reproduction scripts are released at https://github.com/YatingPan/SPIRE.

PaperID: 2410, https://arxiv.org/pdf/2605.30931.pdf

Authors: Tianjie Ju, Yueqing Sun, Zheng Wu, Wei Zhang, Yaqi Huo, Xi Su, Qi Gu, Xunliang Cai, Gongshen Liu, Zhuosheng Zhang

Title: MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

Abstract:
Multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and action generation. However, their ability to sustain exploration in dynamic open worlds remains unclear. Existing embodied and game‑based benchmarks often compress interaction into short‑horizon tasks or entangle success with domain‑specific game mechanics. In this paper, we introduce MineExplorer benchmark for evaluating open‑world exploration capabilities of MLLM agents in Minecraft. We first filter atomic tasks whose solutions rely heavily on Minecraft‑specific knowledge to better reflect general open‑world reasoning. Then we organize the benchmark around a ReAct‑style capability formulation and compose atomic tasks into implicit multi‑hop tasks. To further construct reliable instances, MineExplorer uses a multi‑agent synthesis workflow that jointly designs task graphs, sandbox scenes, and rule‑based milestone evaluators. Human evaluation shows that the multi‑agent synthesis workflow produces significantly more reliable instances than a single‑agent baseline. Experiments with advanced MLLM agents show that open‑world exploration remains challenging, as strong models can handle many single‑hop tasks but degrade sharply when hidden prerequisites must be coordinated over longer trajectories. Further analysis finds that task difficulty tracks agent completion, and larger models or thinking modes do not consistently translate into better performance. Code and dataset are available at https://github.com/Jometeorie/MineExplorer.

PaperID: 2411, https://arxiv.org/pdf/2605.30930.pdf

Authors: Yueshen Li, Hanyi Min, Vedant Das Swain, Koustuv Saha

Title: TUX: Measuring Human--AI Tacit Understanding

Abstract:
As large language models (LLMs) increasingly act as collaborative partners, human‑‑AI alignment is often evaluated through explicit task success, accuracy, or reward optimization. Yet many collaborative settings depend on tacit understanding: whether an agent can align with a human's evaluative stance or representational priors without clear objectives, communication, or feedback. To study this capacity, we develop a spectrum‑placement task inspired by the social party game Wavelength, in which humans and agents independently place concepts along subjective spectra. We operationalize the Tacit Understanding Index (TUX) as a pairwise measure of similarity between human and agent judgments, and evaluate it with 241 human participants and 200 profile‑conditioned LLM agents across four models. We find that nearest human‑‑agent pairs in trait space achieve significantly higher TUX, suggesting that tacit alignment is structured by person‑level characteristics rather than random similarity. Regression analyses show that TUX becomes more explainable as predictor sets become richer, with individual traits, decision‑making styles, and confidence improving over aggregate trait‑distance baselines. These findings suggest that tacit understanding between humans and LLMs is measurable, while revealing the limits of profile‑based conditioning for capturing deeper representational alignment.

PaperID: 2412, https://arxiv.org/pdf/2605.30907.pdf

Authors: Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta, Case Winter, George Fang, John Ling, Emma Strubell, Zach Kirshner

Title: BlueFin: Benchmarking LLM Agents on Financial Spreadsheets

Abstract:
We present BlueFin, a benchmark that tasks large language model (LLM) agents with synthesis, manipulation, and comprehension tasks over spreadsheet workbooks in the professional finance domain. Though estimates of the global population of paying users of spreadsheet software range in the hundreds of millions ‑‑ an order of magnitude more than the estimated global population of professional developers ‑‑ comparatively fewer resources have been devoted to exploring and expanding LLM capabilities in the spreadsheet domain, with fewer still dedicated to mirroring real occupational tasks encountered by those in professional finance roles. In response, we curate a set of 131 challenging, complex tasks with real‑world relevance in the domain, containing 3,225 granular rubric criteria; notably, our rubric criteria and LM judge evaluations are validated by a team of expert human annotators, resulting in high‑quality, granular evaluations of complex tasks that are difficult to verify programmatically but can be reliably evaluated by an LM judge agent. Our judge achieves parity with expert consensus (α=0.826) with a macro‑F1 score of 0.839. Frontier LLMs demonstrate poor performance on the challenging benchmark, with the strongest LLMs achieving less than 50% average scores across tasks ‑‑ models exhibit particular weaknesses in dynamic correctness. Our contributions include a dataset of examples across three categories of spreadsheet tasks, an open source harness and agentic evaluation framework, and a characterization of existing frontier models' performance on our benchmark.

PaperID: 2413, https://arxiv.org/pdf/2605.30889.pdf

Authors: Etinosa Osaro, Santosh Adhikari, Stamatia Zavitsanou, Kelsey Parker, Dario Rocca

Title: MLIPilot: LLM-Driven Auto-Research for Machine-Learned Interatomic Potentials

Abstract:
Constructing production‑quality machine‑learned interatomic potentials (MLIPs) requires balancing accuracy, dynamical stability, and computational throughput under constraints that are not captured by a single training loss. We introduce MLIPilot, an auto‑research framework in which tool‑calling large language models propose hypotheses, edit MLIP training code, launch HPC jobs, and accept or revert changes using a fixed, physically constrained scorecard. We evaluate MLIPilot on MACE potential optimization using both commercial and open‑weight LLM agents, including GPT‑5.5, GPT‑4.1, Mistral‑24B, and Qwen3‑32B. The benchmarks span molecular and periodic settings: a QM7‑derived dataset for which we generated B3LYP/6‑31G(d) energies and forces, and a Cu EMT dataset with periodic copper supercells labeled by ASE's Effective Medium Theory calculator. Across these benchmarks, the strongest agents move initially constraint‑violating baselines to accepted models by discovering useful training strategies, including output normalization, loss‑function changes, progressive training schedules, and model‑capacity adjustments. These results suggest that LLM agents can serve as autonomous operators for scientific machine‑learning workflows when their search is constrained by domain‑specific validation criteria, shifting part of MLIP development from manual trial‑and‑error toward auditable, automated experimentation.

PaperID: 2414, https://arxiv.org/pdf/2605.30883.pdf

Authors: Churui Zeng, Weiwei Qi, Kedong Xiu, Tianhang Zheng, Chaochao Lu, Liang He, Zhan Qin, Kui Ren

Title: TRACE: Task-Aware Adaptive Self-Evolving Agentic Jailbreaking

Abstract:
The rise of LLM agents introduces a new threat by enabling planning, coding, and even end‑to‑end execution of expert‑level attack workflows. However, this threat remains underexplored and underestimated since (i) safety alignment prevents LLMs from directly generating harmful instructions, and (ii) most existing jailbreak methods cannot consistently induce agents to execute malicious operations. In this paper, we propose TRACE, a practical agentic jailbreaking framework to further reveal the risks of this threat surface. To conceal the malicious intent, TRACE decomposes a malicious task into multiple subtask sequences under different schemes and selects the sequence with the fewest explicitly harmful subtasks. TRACE then disguises the remaining harmful subtasks as benign‑looking instructions by embedding them in task‑aware scenarios with related roles, environments, directives, and heuristics. The scenarios are iteratively evolved through well‑defined transformation actions, which are sampled by a Q‑learning‑inspired mechanism, for inducing the agent to execute on the harmful subtasks. Extensive evaluations on AgentHarm and AdvCUA show that TRACE consistently outperforms existing jailbreak baselines across multiple advanced LLM agents, achieving up to 100% bypass rate and 0.73 average success score. We also demonstrate the effectiveness of TRACE in controlled cyberattack instances. Our code and demos are available at https://github.com/ZJU‑LLM‑Safety/TRACE.git.

PaperID: 2415, https://arxiv.org/pdf/2605.30854.pdf

Authors: Karthika Arumugam, Kiran Kumar Manku, Amit Dhanda

Title: Safe Equilibrium Policy Optimization for Strategic Agent Policies

Abstract:
Language models fine‑tuned with reinforcement learning typically optimize for task reward, ignoring multi‑agent strategic structure. Because these agents condition on natural language game‑state descriptions and emit actions through free‑form generation, strategic failure modes ‑‑ exploiting weaker opponents, coordinating on harmful equilibria, and externalizing costs are inseparable from the language interface itself. We propose Safe Equilibrium Policy Optimization (\sepo), a training objective that augments expected payoff with explicit penalties for exploitability, collusion risk, and externality cost. We implement \sepo as a reward signal for Group Relative Policy Optimization (GRPO), applied to Gemma~4 E4B‑it and Qwen~3.5‑4B after supervised fine‑tuning (SFT). Evaluated across five strategic domains: Iterated Prisoner's Dilemma, repeated auctions, two negotiation variants, and Kuhn Poker. \sepo achieves zero exploit‑pool advantage in Kuhn Poker for both models, outperforms the base model on safety in four domains, and corrects the over‑cooperative behavior introduced by SFT. In negotiation, \sepo achieves a positive‑safety outcome and only the positive normalized relative advantage of any negotiation configuration. Ablation experiments confirm that per‑rollout exploit computation is necessary: a shared constant penalty cancels in GRPO advantage normalization (constant control‑variate property), producing zero gradient. To support further research in strategic safety for agents, we release our \hrefhttps://anonymous.4open.science/r/sepo‑2668/README.mdcode and SFT datasets.

PaperID: 2416, https://arxiv.org/pdf/2605.30838.pdf

Authors: Wenkai Shen, Pengyang Zhou, Jiahe Xu, Jiaming Qian, Haozhe He, Zhihao Huang, Chaochao Chen, Xiaolin Zheng

Title: COMPASS: Cognitive MCTS-Guided Process Alignment for Safe Search Agents

Abstract:
LLM‑powered search agents enable multi‑step reasoning and tool use. However, these capabilities introduce retrieval‑induced safety degradation, as harmful intents may decompose into seemingly innocuous sub‑queries that lead to unsafe outcomes. Existing alignment methods struggle to capture sparse safety signals and fail to supervise diverse violations across multi‑step interactions. We propose COMPASS, a Cognitive MCTS‑Guided Process Alignment framework designed to achieve robust safety alignment throughout the agent workflow while preserving general utility. COMPASS integrates cognitive tree exploration (CTE) to efficiently synthesize stealthy attack trajectories, and introspective step‑wise alignment (ISA) to isolate risky intermediate actions for fine‑grained process supervision. Empirical results show that COMPASS achieves a favorable safety‑utility trade‑off while requiring substantially less training data.

PaperID: 2417, https://arxiv.org/pdf/2605.30802.pdf

Authors: Tarun Kota

Title: Design and Evaluation of Multi-Agent AI Oracle Systems for Prediction Market Resolution

Abstract:
Prediction markets aggregate collective intelligence to forecast uncertain events, but their utility depends on reliable outcome resolution. Existing oracle systems tradeoff fast but brittle automation against accurate but costly human arbitration. Single‑LLM oracles achieve meaningful accuracy but inherit all failure modes of their underlying model with no self‑correction mechanism. We evaluate whether multi‑agent LLM architectures can improve oracle resolution accuracy over single‑model baselines. We compare independent aggregation and deliberative consensus against single‑LLM baselines (GPT‑5 Nano, DeepSeek V3, and Llama‑3.3‑70B) on 1,189 resolved prediction market questions from KalshiBench. All agents share a common evidence layer through Exa, with retrieval filtered by publication date to isolate reasoning from retrieval quality. Independent aggregation with confidence‑weighted voting achieves the highest accuracy at 83.43 percent, outperforming the best individual model by 1.01 percentage points. Deliberative consensus degrades accuracy to approximately 76 percent, below every single‑model baseline, attributed to error propagation during debate where confidently wrong models flip correct ones. Error correlations across models (0.529‑0.689) explain why aggregation gains fall short of the theoretical Condorcet ceiling, placing a fundamental limit on ensemble approaches. Many questions resist correction by any multi‑agent architecture, motivating escalation to human arbitration. We propose routing criteria for hybrid AI‑human oracle systems: auto‑resolving only unanimous, high‑confidence questions yields 97.87 percent accuracy on 47 percent of the dataset, with inter‑agent disagreement flagging the remainder for human review.

PaperID: 2418, https://arxiv.org/pdf/2605.30785.pdf

Authors: Lu Yi, Runlin Lei, Liuyi Yao, Yuexiang Xie, Yuyang Li, Wenhao Zhang, Zhewei Wei, Yaliang Li, Jian-Yun Nie

Title: Learning Agent-Compatible Context Management for Long-Horizon Tasks

Abstract:
LLM agents increasingly face long‑horizon tasks such as web search and deep research in real‑world applications, where accumulated context can cause long‑context degradation and reasoning failures. Prior work mitigates this through context management with agent‑side context control or fixed strategies such as summarization, which require training the agent itself for adaptation ‑ making it impractical for closed‑source agents and ignoring that different agents may require different strategies. We introduce Adaptive Context Management (AdaCoM), which trains an external LLM to manage the context of a frozen agent through flexible modification actions and end‑to‑end reinforcement learning. Across diverse agents on web search and deep research benchmarks, AdaCoM substantially improves performance by preserving task constraints and progress while pruning stale content. The learned strategies reveal a Fidelity‑Reliability Trade‑off: agents with higher vanilla ReAct performance benefit from higher‑fidelity context preservation, whereas lower‑performing agents require more aggressive compression to stay within a reliable reasoning regime. Transfer experiments show that AdaCoM generalizes most effectively across agents with similar capability (measured by vanilla ReAct performance), suggesting a practical path toward reusable context managers for agent systems.

PaperID: 2419, https://arxiv.org/pdf/2605.30723.pdf

Authors: Jianxiang Yu, Jiapeng Zhu, Bochen Lin, Qier Cui, Zichen Ding, Xiang Li

Title: Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents

Abstract:
LLM agents increasingly retrieve externally curated skills‑procedural instructions retrieved at decision time‑to improve performance on long‑horizon interactive tasks. Existing skill libraries are typically treated as model‑agnostic, reusing the same skill formulations across backbones with substantially different capacities and behaviors. However, our controlled experiments across multiple model scales show that skill effectiveness is strongly model‑dependent: a skill that benefits one backbone can harm another. Motivated by this observation, we propose MASA Model‑Aware Skill Alignment, a framework that adapts skills to each target backbone without modifying agent weights. MASA operates in two stages: (1) a hierarchical skill evolution pipeline that iteratively rewrites general and task‑specific skills using hill climbing and UCB‑driven tree search, guided by environment feedback and model capability profiles; and (2) a lightweight model‑conditioned skill rewriter trained on evolution trajectories to reproduce the adaptation in a single forward pass. Experiments across three interactive environments and four backbones show that MASA consistently achieves the best overall performance, with gains of up to 25.8 points over the strongest baseline. The learned rewriter further generalizes to unseen tasks and environments without additional search, consistently outperforming a much larger teacher LLM at a fraction of the inference cost.

PaperID: 2420, https://arxiv.org/pdf/2605.30712.pdf

Authors: Tao Feng, Chongrui Ye, Tianyang Luo, Jingjun Xu, Xueqiang Xu, Haozhen Zhang, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You

Title: ExpGraph: Model-Agnostic Experience Learning with Graph-Structured Memory for LLM Agents

Abstract:
Large language model (LLM) agents have shown strong capabilities in reasoning, tool use, and multi‑step interaction, but they often solve tasks from scratch and fail to reuse successful strategies or failure lessons from prior experience. Fine‑tuning on collected experience can improve reuse, but it is inflexible when stronger or more suitable executors emerge. We propose ExpGraph, a model‑agnostic experience learning framework that enables frozen and replaceable LLM executors to improve through external experience reuse without parameter updates. ExpGraph summarizes historical trajectories into reusable skills and failure lessons, organizes them as nodes in a self‑evolving experience graph, and retrieves useful experiences through graph diffusion and utility‑aware ranking. A lightweight retrieval copilot is trained with reinforcement learning using feedback that compares executor performance with and without retrieved experiences, while the graph is updated online from downstream task outcomes. We evaluate ExpGraph on ExpSuite, covering question answering, mathematical reasoning, code generation, and multi‑step agentic environments including ALFWorld and AppWorld. ExpGraph improves over the strongest baseline by 12.2% and 4.7% on static tasks with smaller and larger executors, and by 21.4% and 12.7% in agentic environments, while reducing average interaction steps by 12.7% and 21.6%. Ablations show that graph‑structured experience, utility‑aware ranking, and adaptive retrieval jointly enable effective experience reuse across diverse tasks and executor models.

PaperID: 2421, https://arxiv.org/pdf/2605.30698.pdf

Authors: Yuhan Wang, Shuochen Chang, Yalin Feng, Dongsheng Ma, Yuanzi Li, Zhengren Wang, Yinglong Yang, Yufei Chen, Yikang Wang, Shaoxu Sun, Wentao Zhang

Title: Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence

Abstract:
Vision‑language models (VLMs) have achieved strong performance on visual question answering (VQA). To mitigate individual hallucinations and blind spots, aggregating diverse perspectives via multi‑agent collaboration has emerged as a promising paradigm. While this approach has shown great success in textual QA, its potential in the multimodal domain remains under‑explored. Existing multi‑agent VQA methods predominantly adapt text‑centric protocols, focusing on textual discussions while ignoring the alignment of visual information. In this work, we reveal a key insight: answer‑level agreement is insufficient for reliable multi‑agent VQA; aligned visual evidence ‑‑ shared support from the image regions agents rely on ‑‑ is essential for trustworthy consensus. To leverage this insight, we propose EAGLE (Evidence‑Aligned Grounded muLti‑agent rEasoning), a training‑free evidence‑centered framework for coordinating multiple VLM agents. EAGLE explicitly exposes each agent's grounding regions as visual evidence, enables mutual verification over the evidence, and uses evidence consistency to guide final decision‑making. Experiments on six VQA benchmarks show that EAGLE achieves best average performance across domains while remaining lightweight, interpretable, and practical for deployment.

PaperID: 2422, https://arxiv.org/pdf/2605.30690.pdf

Authors: Tao Feng, Chongrui Ye, Tianyang Luo, Jingjun Xu, Xueqiang Xu, Haozhen Zhang, Ge Liu, Jiaxuan You

Title: ElasticMem: Latent Memory as a Learnable Resource for LLM Agents

Abstract:
Long‑term memory is essential for LLM agents to reason coherently across extended interactions, personalize responses, and reuse past experience. However, existing memory‑augmented methods typically treat memory as a fixed resource: text‑space approaches concatenate retrieved memories into the context window, causing substantial token overhead and sensitivity to noisy evidence, while latent‑space approaches reduce textual cost but still rely on rigid retrieval or fixed‑capacity memory interfaces. This creates a mismatch between query‑dependent memory utility and fixed memory allocation. We propose ElasticMem, a memory‑augmented LLM framework that learns to use memory as an elastic latent resource. ElasticMem builds an offline latent memory bank with retrieval keys and content caches, retrieves memories adaptively from the reasoner's hidden state, assigns each retrieved memory a variable latent budget through a learned policy, and injects selected latent states as soft memory tokens for generation. The full memory‑use process is optimized with downstream task rewards through group‑relative policy optimization. We evaluate ElasticMem on MemorySuite, covering memory‑intensive QA and embodied agent control. Across Qwen2.5‑3B‑Instruct and Qwen2.5‑7B‑Instruct backbones, ElasticMem improves weighted average QA accuracy by 26.2% and 24.6%, and improves ALFWorld success rate by 66.3% and 27.2%, respectively, over the strongest baselines, while achieving the lowest ALFWorld token cost. Ablations and qualitative analyses further show that adaptive retrieval and elastic budget allocation help ElasticMem prioritize useful evidence and transferable plans beyond rigid cosine similarity. Our code for ElasticMem will be released at https://github.com/ulab‑uiuc/ElasticMem.

PaperID: 2423, https://arxiv.org/pdf/2605.30680.pdf

Authors: Zihan Wang, Xiang Xu, Hongyuan Zha, Wenhao Li

Title: Healthcare Mechanisms from Policy-as-Code Search under Strategic Provider Response

Abstract:
Healthcare mechanisms are inseparable from the strategic provider response they induce: existing healthcare AI benchmarks hold this response fixed and so cannot evaluate mechanisms by the equilibrium they produce. We recast hospital mechanism design as program synthesis for language models: typed, inspectable rule programs are executed and scored by Medi‑Sim, a multi‑agent simulator with five strategic provider channels (coding, selection, delay, effort, triage). An incentive sweep recovers classical health‑economics findings as adjacent regimes ‑‑ up‑coding and low‑complexity‑patient selection under profit pressure, and Goodhart‑style drift where measured performance becomes anti‑correlated with true outcomes ‑‑ and a single audit lever exposes pressure migration: closing the coding channel more than doubles low‑complexity selection. LLM‑guided evolutionary code search over the same rule‑program space then synthesizes an inspectable mixed‑objective program that eliminates up‑coding, halves rejection, and retains most of the profit‑oriented baseline's funds.

PaperID: 2424, https://arxiv.org/pdf/2605.30653.pdf

Authors: Jiatan Huang, Mingchen Li, Ziming Li, Sunjae Kwon, Hong Yu, Chuxu Zhang

Title: Counterfactual Graph for Multi-Agent LLM Calibration

Abstract:
Multi‑agent LLM systems often treat agreement as evidence: when many agents in a panel give the same answer, that answer is assumed to be more reliable. We show that this assumption can fail after agents communicate. Communication can induce correlated failures and false consensus, so the same vote share may reflect reliable agreement in one topology but over‑confidence in another. We propose CAGE‑CAL, a counterfactual agent‑graph calibration framework for multi‑agent LLMs. For each query, CAGE‑CAL compares an observed post‑communication agent graph with a matched counterfactual no‑communication graph, capturing both pairwise failure correlations and group‑level dependencies. Rather than simply counting how many agents agree, CAGE‑CAL estimates the counterfactual shift between observed and no‑communication dependence, and calibrates confidence accordingly. Across five benchmarks, CAGE‑CAL improves reliability discrimination with competitive ECE, and its calibrated confidence further improves topology selection over the best fixed‑topology strategy.

PaperID: 2425, https://arxiv.org/pdf/2605.30621.pdf

Authors: Minhua Lin, Juncheng Wu, Zijun Wang, Zhan Shi, Yisi Sang, Bing He, Zewen Liu, Tianxin Wei, Zongyu Wu, Zhiwei Zhang, Dakuo Wang, Xiang Zhang, Benoit Dumoulin, Cihang Xie, Yuyin Zhou, Suhang Wang, Hanqing Lu

Title: Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

Abstract:
LLM agents are increasingly deployed as systems built around editable external harnesses, including prompts, skills, memories and tools, that shape task execution without changing model parameters. Harness self‑evolution adapts such agents by updating these harnesses from execution evidence. Yet it remains unclear whether a model's base capability in task‑solving predicts its capabilities in harness self‑evolution: which models produce useful harness updates, and which actually benefit from them? We analyze two harness self‑evolution capabilities: (i) harness‑updating, the capability to produce useful persistent harness updates from execution evidence; (ii) harness‑benefit, the capability to benefit from updated harnesses during task solving. Our analysis reveals two findings. First, harness‑updating is flat in base capability: models from different capability tiers produce harness updates that lead to surprisingly similar gains; even Qwen3.5‑9B's updates yield gains comparable to those of Claude Opus~4.6. Second, harness‑benefit is non‑monotonic in base capability: weak‑tier models benefit little from updated harnesses, mid‑tier models benefit most, and strong‑tier models benefit less than mid‑tier. We trace low gains at the weak tier to two failure modes: weak‑tier models may fail to activate relevant harness artifacts, or activate them but fail to follow them faithfully. These findings suggest investing capability budget in the task‑solving agent rather than the evolver, and targeting harness invocation and long‑horizon instruction following in agent training. Our source code is publicly available at https://github.com/A‑EVO‑Lab/a‑evolve/tree/release/harness‑evolution.

PaperID: 2426, https://arxiv.org/pdf/2605.30611.pdf

Authors: Haozhe Zhao, Shuzheng Si, Zhenhailong Wang, Zheng Wang, Liang Chen, Xiaotong Li, Zhixiang Liang, Maosong Sun, Minjia Zhang

Title: Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs

Abstract:
Scientific figures are among the most effective means of communicating complex research ideas, yet producing publication‑quality illustrations remains one of the most labor‑intensive parts of paper preparation. Existing automated systems each target a single figure type under text‑only input, leaving the diversity of types and conditions researchers actually use unaddressed; their raster outputs further cannot be locally revised. Because scientific figures are structured compositions of discrete semantic components, the localized errors generators produce on such layouts demand not a stronger backbone but a harness. We instantiate this harness in two complementary systems: Crafter, a multi‑agent harness for figure generation that generalizes across figure types and input conditions without architectural changes, and CraftEditor, which applies the same pattern to convert raster outputs into editable SVGs. Moreover, we introduce CraftBench, a benchmark spanning three figure types and four input conditions with human quality annotation. Experiments show that Crafter substantially outperforms both standalone generators and the agentic baseline on PaperBanana‑Bench and CraftBench, with ablations confirming each component's independent contribution; CraftEditor faithfully converts outputs into editable SVGs that surpass all baselines. Our code and benchmark are available at https://github.com/HaozheZhao/Crafter.

PaperID: 2427, https://arxiv.org/pdf/2605.30604.pdf

Authors: George Fatouros, Georgios Makridis, George Kousiouris, John Soldatos, Dimosthenis Kyriazis

Title: An Organization-Scoped LLM Agent Runtime Architecture for Regulated Cybersecurity Operations

Abstract:
Regulated cybersecurity workflows lack a runtime substrate that enforces organization‑level scope across retrieval, tool calls, memory, findings, reports, and audit while remaining model‑agnostic and locally deployable. Recent large language model (LLM) agent systems report strong results on isolated cybersecurity tasks, yet they do not by themselves define an auditable platform architecture for regulated security operations centre (SOC) and compliance workflows, where a single analyst may trigger actions that bind the organization, and where the runtime must integrate with existing SIEM/XDR stacks as a primary source of context and alert‑driven triggers rather than operate as a standalone analytical layer. This paper proposes an organization‑scoped LLM agent runtime architecture for financial cybersecurity. The contribution is a typed Security Context that is created at every entry point, including SIEM/XDR notifications ingested as first‑class triggers, and enforced at every component boundary, combined with a shared Runtime Core, logical specialist subagents, a governed Tool Adapter Layer exposing SIEM/XDR query, enrichment, and response primitives under uniform policy and audit, structured findings with evidence references, tiered human‑in‑the‑loop (HITL) gates, and append‑only audit. Model Context Protocol (MCP), extended telemetry, digital twins for pentesting, graph retrieval, and federated knowledge sharing are treated as optional extension paths rather than mandatory runtime assumptions. We describe an implementable slice as the architecture's testability surface, and we propose a falsifiable evaluation plan with metric‑level pass criteria for architecture readiness, security‑policy enforcement, evidence traceability, output quality, and operational observability.

PaperID: 2428, https://arxiv.org/pdf/2605.30539.pdf

Authors: Clayton Cohn, Surya Rayala, Siyuan Guo, Hanchen David Wang, Naveeduddin Mohammed, Umesh Timalsina, Shruti Jain, Ryan Li, Angela Eeds, Menton Deweese, Pamela J. Osborn Popp, Rebekah Stanton, Shakeera Walker, Ashwin T S, Meiyi Ma, Gautam Biswas

Title: A Theory-Guided LLM Pedagogical Agent for STEM+C Scaffolding Without Over-Reliance

Abstract:
LLM pedagogical agents are proliferating, yet recent findings have raised questions about their adherence to established theories of learning and, by extension, their educational value. Concerns regarding cognitive offloading, over‑reliance, and "gaming" behaviors persist and remain largely unaddressed. In response, we developed Copa, an agentic, multi‑agent, multimodal Collaborative Peer Agent for STEM+C learning. Copa is built on top of the Evidence‑Decision‑Feedback (EDF) framework, grounding its interactions in Social Cognitive Theory and Social Constructivism and promoting sense‑making through adaptive, dialogic support rather than answer‑seeking. In an authentic high school computational‑modeling study (n=33 dyads), we demonstrate that Copa (1) supports students' confidence building and ability to verbalize conceptual understanding without causing dependence; and (2) provides adaptive feedback personalized to learners that is interpretable with respect to students' multimodal input data. These findings position theory‑guided, multimodal LLM agents as a promising path toward classroom AI integration that amplifies students' reasoning rather than replacing it.

PaperID: 2429, https://arxiv.org/pdf/2605.30534.pdf

Authors: Nima Dorzhiev, Peng Liu

Title: Strengthening Polymorphic Prompt Assembling: Dynamic Separator Generation Against Emerging Prompt Injection Attacks

Abstract:
Polymorphic Prompt Assembling (PPA) defends LLM agents against prompt injections by randomly selecting separator pairs from a fixed pool to isolate user input from system instructions. Although effective, static pool reuse exposes a blast‑radius vulnerability: once a separator leaks, it can be exploited in future requests. We propose a dynamic per‑request separator generation using domain‑separated SHA‑256 digests keyed on the timestamp, session identifier, and cryptographic nonce. Each assembled prompt receives a unique (BEGIN, END) canary pair, thereby limiting leakage exposure to a single request. We evaluated our extension against 16 injection payloads on Llama‑3.3‑70B‑Instruct‑Turbo, with cross‑model validation on DeepSeek‑V4‑Flash model. Against the M1 obfuscation payload (leetspeak + urgency), the dynamic mode reduces the Attack Success Rate (ASR) from 0.88 to 0.38, yielding a statistically significant 2.3 x mitigation verified by non‑overlapping 95% Wilson confidence intervals. Against format_breakout_salad, static separator leakage (leak_rate = 0.467) is eliminated entirely in the dynamic mode (0.000), confirming the blast‑radius reduction in practice. The implementation requires no model fine‑tuning, adds 2.7 microseconds prompt‑assembly overhead per request, and is backward compatible with the existing PPA SDK.

PaperID: 2430, https://arxiv.org/pdf/2605.30488.pdf

Authors: Zhihao Cao, Qi Shao, Shuhao Zhai, Feng Tian, Anh Nguyen, Hesheng Wang, Baoru Huang

Title: CoMo3R-SLAM: Collaborative Monocular Dense SLAM with Learned 3D Reconstruction Priors for Outdoor Multi-Agent Systems

Abstract:
Collaborative dense SLAM is essential for multi‑robot teams to achieve scalable and consistent 3D perception across large‑scale outdoor environments. Existing systems typically depend on depth sensors, incurring significant payload, power, and calibration costs. Monocular RGB cameras are a lightweight alternative, but collaborative monocular dense SLAM remains difficult due to scale ambiguity, unreliable inter‑agent data association, especially in outdoor scenes where low overlap and repetitive structures make traditional feature matching unreliable, motivating robust geometric information. We propose CoMo3R‑SLAM, the first collaborative monocular dense RGB SLAM system that leverages robust learned feed‑forward 3D reconstruction priors for outdoor multi‑agent mapping. Each agent runs a prior‑guided front‑end for real‑time tracking and local dense fusion, while a coordinator performs dense pointmap matching for cross‑agent verification, closed‑form Sim(3) gauge synchronization, and GPU‑accelerated global bundle adjustment with segment‑level depth optimization. Requiring neither depth sensors nor parametric intrinsics, our system produces robust cross‑agent constraints and globally consistent metric maps from monocular RGB alone. On Tanks and Temples and Waymo sequences, CoMo3R‑SLAM achieves the best ATE on three of four Tanks and Temples scenes and competitive Waymo accuracy, matching or exceeding state‑of‑the‑art RGB‑D methods while running online at 8 FPS.

PaperID: 2431, https://arxiv.org/pdf/2605.30461.pdf

Authors: Santiago Amaya-Corredor, Miguel Calvo-Fullana, Anders Jonsson

Title: Scalable Constrained Multi-Agent Reinforcement Learning via State Augmentation and Consensus for Separable Dynamics

Abstract:
We present a distributed approach for constrained Multi‑Agent Reinforcement Learning (MARL) that combines state‑augmented policy learning with distributed consensus over dual variables. Our method targets systems where agents have separable dynamics but must coordinate to satisfy global resource constraints, a setting in which, as we demonstrate empirically, independent learning fails to produce feasible solutions because agents cannot determine appropriate individual contributions toward collective constraint satisfaction. The key technical contribution is showing that lightweight neighbor‑to‑neighbor consensus over Lagrange multipliers suffices for globally coordinated constraint enforcement while preserving the scalability of independent training. Each agent learns a single augmented policy offline, conditioned on both its local state and a dual variable encoding constraint feedback. During execution, agents reach agreement on this dual variable through local communication alone. We prove that under mild connectivity assumptions, the consensus error among agents' multipliers is bounded, and show that this translates to a bounded constraint violation that decreases with graph connectivity and the number of consensus rounds. Unlike centralized training with decentralized execution (CTDE) approaches, whose complexity grows at least quadratically with agent count, our method scales linearly in both training and execution. Experiments on smart grid demand response demonstrate that consensus coordination is \emphessential for feasibility: without it, agents satisfy grid capacity constraints only by indefinitely postponing demand, a degenerate non‑solution. With consensus, agents converge to a shared dual variable and satisfy both grid constraints and demand fulfillment, scaling to thousands of agents while CTDE baselines are limited to dozens.

PaperID: 2432, https://arxiv.org/pdf/2605.30459.pdf

Authors: Anastasia Kotelnikova, Viktor Byzov, Maria Dolzhenkova, Evgeny Kotelnikov

Title: Can LLM Teams Play What? Where? When?

Abstract:
Large language models (LLMs) remain limited on tasks requiring indirect reasoning, cultural knowledge, and coordinated hypothesis testing. We investigate whether team‑based interaction improves LLM performance in What? Where? When? (ChGK), a quiz game designed to reward collective reasoning. We introduce three team strategies: Voting, Silent Team (the captain observes final answers), and Talkative Team (the captain observes both answers and rationales). To minimize data leakage, we evaluate these strategies on a dataset consisting of 572 ChGK questions released in 2025. Using six recent large‑scale open models, we show that team‑based strategies outperform single‑model baselines, yielding gains of up to 20 percentage points in accuracy. The best team achieves 44.23% accuracy, and approaches human team performance on questions with available human statistics. Analysis of inter‑model diversity reveals that disagreement strongly predicts lower accuracy, but explanatory communication substantially mitigates performance drops. We further examine captain behavior and find no evidence of self‑preference bias; access to peer rationales improves captain judgments. Overall, LLM teams function primarily as answer selection and error‑filtering mechanisms rather than generators of novel solutions. Our findings highlight the importance of interaction and suggest adaptive strategies as a promising direction for multi‑agent systems.

PaperID: 2433, https://arxiv.org/pdf/2605.30454.pdf

Authors: Shifat E Arman, Syed Nazmus Sakib, Nafiul Haque, Shahrear Bin Amin

Title: The Surface You Test Is Not the Surface That Breaks

Abstract:
Tool‑augmented LLM agents are vulnerable to prompt injection: a third party who controls part of the agent's context can plant instructions that the agent then executes as if they came from the user. Current evaluations report a single attack success rate per model on one channel, the tool output and treat that number as the model's vulnerability. But tool descriptions, which the agent reads at every turn before any tool is called, are themselves an injection surface that the attacker can choose instead. We hold the injection payload byte‑identical and deliver it through both surfaces across 13 LLMs from six families and four task suites. The same bytes invert in success rate across models: GPT‑4.1 is 96 percent vulnerable on tool outputs but only 4 percent on tool descriptions, while GEMINI‑3‑FLASH shows the mirror pattern at 20 percent and 98 percent. A variance decomposition over 6,830 attempts attributes 0 percent of the variation in attack outcomes to the surface alone, while the model‑surface interaction accounts for 16.7 percent. Vulnerability is a property of the pairing, not the channel. The Adaptive Attack Rate, defined as the per‑cell maximum over surfaces, exceeds the strongest fixed‑surface baseline by +9.1 percentage points on average. Standard prompt‑level defenses inherit the same blindspot, reducing tool‑output ASR to 10‑18 percent while leaving the description channel above 54 percent. Both attack and defense evaluation must report per‑surface vulnerability.

PaperID: 2434, https://arxiv.org/pdf/2605.30392.pdf

Authors: Igor Itkin

Title: Delayed Repression and Emergent Instability in Adaptive Multi-Agent Systems

Abstract:
Regulatory institutions (from content moderation platforms to financial supervisors) observe, deliberate, and intervene only after a characteristic delay. We ask whether this processing lag alone can destabilize a multi‑agent system that would otherwise remain stable, without exogenous shocks, coordination among agents, or malicious actors. We study this in two stages. First, we analyze a delayed replicator equation in which autonomous agents benefit from radical behavior but face punishment based on a lagged institutional alarm signal. We derive a closed‑form critical delay beyond which the unique interior equilibrium loses stability through a Hopf bifurcation, and prove via center manifold reduction that the bifurcation is supercritical (bounded oscillations, not explosive growth) for the entire sigmoid response family. Second, we embed N=240 agents on a network with reinforcement learning (tabular Q‑learning) and cross institutional delay with three decision architectures: fixed‑policy, reactive (a memoryless threshold heuristic), and Q‑learning. The hierarchy is opposite to the naive expectation that learning amplifies instability. Reactive agents are perfectly stable without delay yet collapse once delay is introduced (96% runaway by delay >= 8); fixed‑policy agents are immune (0% at all delays); Q‑learning agents are only partially resilient (66% at delay 20). The destabilizing ingredient is reactivity to delayed signals, not learning: agents that immediately exploit low‑alarm windows trigger oscillatory feedback loops, while learning buffers this through punishment memory encoded in value functions. Throughout, "runaway" denotes bounded large‑amplitude oscillation crossing a radical‑fraction threshold, consistent with the supercritical bifurcation, not unbounded growth.

PaperID: 2435, https://arxiv.org/pdf/2605.30391.pdf

Authors: Tom Pecher

Title: Social Reasoning in Machines: Investigating Collective Truth-Seeking Dynamics in Large Language Model Debate

Abstract:
Human reasoning has long been theorised to operate socially, not through isolated individual cognition, but through collective adversarial discourse, a framework known as the Argumentative Theory of Reasoning (ATR). Rather than relying on individual "intellectualist reasoners" as the primary vehicle for truth‑seeking, ATR reconceptualises truth as an emergent property of social epistemology: the product of imperfect individual reasoning refined under the adversarial pressure of debate. This distributed method of collective intelligence has guided humanity to ever‑greater epistemic heights and underpins the foundational principles of all democratic systems. This thesis breaks new ground by, for the first time, simulating ATR through the multi‑agent debate (MAD) of large language models (LLMs). With rigorous empirical analysis, we demonstrate that, when correctly engineering an epistemically diverse set of models, LLM‑MAD can significantly improve truth‑seeking performance on questionnaire‑based tasks, even when individual debate participants exhibit limited standalone performance. Furthermore, we present strong empirical evidence that this performance gain is mechanistically grounded in the central principles of ATR, suggesting that collective reasoning may be universally favourable over individualist reasoning, rather than a quirk in biology or evolution. Finally, drawing on our analysis of debate dynamics, we propose a novel benchmarking methodology that leverages LLM‑MAD to measure intrinsic model properties (such as hallucination propensity) in order to compare models in ways that current static benchmarking approaches cannot support.

PaperID: 2436, https://arxiv.org/pdf/2605.30335.pdf

Authors: Anany Kotawala

Title: Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents

Abstract:
Multi‑component LLM agents assemble probabilistic claims from components that each see only part of a joint problem; the composition can violate basic probability axioms even when every component is locally coherent. We formalise this locally coherent, globally incoherent failure via the compositional residual eps, the L2 distance from the composed quote to the joint coherent polytope, computable at runtime from system output and the declared cross‑component coupling constraints. A product‑structure dichotomy characterises when local coherence suffices, and a Rayleigh‑quotient prediction matches the observed residual within 7% on three of four relation classes. A hierarchical Boyle‑Dykstra projection repairs the composition deterministically; an anytime‑valid e‑process gives sequential coherence monitoring. Across 1,876 ensemble cliques on a four‑LLM mid‑tier panel (frontier‑panel rerun in Section 5.5), eps > 0 on 33‑94% of cliques, translating to +0.115 nats per bet of regret on 1,770 resolved bets under the proportional allocation rule (the gain collapses to +0.006 under bettors that themselves coherentise). Three intuitive LLM‑side mitigations(retrieval, partition‑aware prompting, aggregator‑LLM) each fail or regress.

PaperID: 2437, https://arxiv.org/pdf/2605.30314.pdf

Authors: Grant Hamblin, Kevin Song, Zhanda Zhu, Anand Jayarajan, Sihang Liu, Nandita Vijaykumar, Gennady Pekhimenko

Title: SpecBench: Evaluating Specification-Level Reasoning for Software Engineering LLM Agents

Abstract:
Software engineering (SWE) agents are transitioning from code generation to full software development lifecycle automation. A critical phase in this lifecycle is specification design: transforming initial proposals into carefully considered requirements through expert review. Existing benchmarks such as SWE‑Bench are implementation‑focused by measuring the agent's ability to generate code given fixed, precise design requirements. This formulation assumes specifications are correct and complete. In real‑world complex and critical software systems, initial specifications are often incomplete and flawed, requiring extensive expert reviews and revisions before being accepted for implementation. To fill this gap, we introduce SpecBench to evaluate specification‑level reasoning: the ability to generate complete, unambiguous, consistent, and correct system specifications. SpecBench tasks are derived from the Request for Comments (RFC) process used by mature open‑source projects. For each task, an agent is given an initial design proposal, the project codebase, and all past project RFC discussions. The agent is tasked with identifying specification deficiencies: omissions, ambiguities, inconsistencies, or incorrect assumptions in the initial proposal. We evaluate predictions against critiques raised by expert maintainers during historical RFC reviews. SpecBench contains tasks from 5 diverse repositories: Kubernetes, React, Rust, TVM, and vLLM. We evaluate state‑of‑the‑art SWE agents on SpecBench, analyzing their capacity to reason about system design without execution feedback. The best performing agent, GPT‑5.4, achieves 44.4% accuracy.

PaperID: 2438, https://arxiv.org/pdf/2605.30258.pdf

Authors: Sneheel Sarangi, Maximilian Puelma Touzel, Aurélien Bück-Kaeffer, Zachary Yang, Jean-François Godbout, Reihaneh Rabbany

Title: EASE Configuration Facilitates A Reproducible Science of LLM Social Simulations

Abstract:
LLMs are increasingly deployed to simulate social interactions, yet many of the existing simulators remain ad hoc and monolithic. This lack of architectural standardization prevents reproducible research and complicates downstream evaluation. We advance a rigorous science of LLM‑based multi‑agent simulation by modularizing core components into Environments, Agents, Simulation engines, and Evaluation metrics (EASE). We demonstrate the utility of EASE configuration by wrapping it in an experimental study schema for orchestrating workflows centered around answering explicit research questions in generated scenarios. We contribute SiliSocS, an open‑source, research‑ready Silicon Society Sandbox implementing a study‑structured EASE configuration to enable highly configurable and reproducible LLM‑based social simulations. Using SiliSocS and EASE, we present three case studies, showcasing the system's comprehensive assessment of existing questions, ability to dive deeper into complex questions, and elaboration of existing studies, respectively. Together, these case studies highlight the limitations of current modeling approaches and isolate the impacts of design choices on key results.

PaperID: 2439, https://arxiv.org/pdf/2605.30227.pdf

Authors: Wenwu Li, Yuran Song, Mingze Zhao, Bo Jin, Wenhao Li

Title: Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization

Abstract:
While Multi‑Agent Systems (MAS) empower Large Language Models to tackle complex reasoning tasks through collaborative interaction, optimizing their dynamics remains a formidable challenge due to the discrete, non‑differentiable nature of the computation graph and the sparsity of global supervisory signals. Existing black‑box optimizers struggle to attribute trajectory‑level failure to specific local components, resulting in inefficient, high‑variance exploration. We argue that tractable MAS optimization needs structural inductive biases to disentangle error signals. We propose temporal and structural credit assignment, which decomposes the objective along two axes: (i) temporal credit, using state‑space bottlenecks to identify critical rounds, and (ii) structural credit, using stationary role policies to isolate agent contributions. Leveraging these decomposed signals, we introduce a discrete, verbalized block coordinate descent algorithm for iterative refinement. Rather than indiscriminate global updates, it alternates between optimizing role prompts and aggregation protocols, using LLM‑generated "proxy gradients" to target only the identified weak links. Across diverse reasoning benchmarks, our approach substantially reduces query complexity while improving performance, providing a principled and interpretable path toward self‑improving MAS.

PaperID: 2440, https://arxiv.org/pdf/2605.30169.pdf

Authors: Botao Amber Hu, Helena Rong, Max Van Kleek

Title: Dissociative Identity: Language Model Agents Lack Grounding for Reputation Mechanisms

Abstract:
As autonomous language model agents proliferate, forming an emerging agentic web with real‑world consequences, what credibility signals can you use to decide whether to trust an unfamiliar agent in the wild and delegate to it? A natural governance intuition is to extend human identity verification and reputation mechanisms, from ``Know Your Customer'' and credit scores to ``Know Your Agent'' regimes. However, we argue that this analogy is fundamentally incomplete. Reputation mechanisms function both as social signals and as corrective feedback that sustain an equilibrium of trustworthy behavior, presuming a persistent identity associated with behavioral continuity, sanction sensitivity, and costly non‑fungibility. Yet language model agents are ontologically \emphdissociative: they are essentially an assemblage of mutable modules ‑‑ foundation models, system prompts, tool‑access policies, external memory, and, in some cases, a multi‑agent system as a whole ‑‑ any of which may change agent behavior ‑‑ with a fluid persona that is also vulnerable to adversarial attack and may not internalize sanctions. Drawing on dissociative identity disorder jurisprudence, this dissociativity leaves agents without grounding for identifiability, predictability, credibility, and rehabilitability ‑‑ the very properties that reputation mechanisms aim to sustain ‑‑ thereby collapsing trust. We argue that identity‑based, ex post, regulative, sanction‑based governance, such as reputation, is structurally inapplicable to dissociative agents, and we suggest a shift to observability‑based, ex ante, constitutive, protocol‑based behavioral harnesses.

PaperID: 2441, https://arxiv.org/pdf/2605.30160.pdf

Authors: James Rudd-Jones, Mirco Musolesi, María Pérez-Ortiz

Title: On Distributional Reinforcement Learning in Chaotic Dynamical Systems

Abstract:
Chaotic dynamical systems pose a fundamental challenge for Reinforcement Learning (RL): exponential sensitivity to initial conditions induces high‑variance bootstrap targets and poorly conditioned gradient updates. Chaotic dynamics arise across scientific and engineering domains, from fluid flows and climate systems to multi‑agent systems, where reliable learning is highly desirable. Standard RL methods optimise expected returns through scalar value functions, implicitly averaging over diverging trajectories and entangling trajectory level instability with the learning objective. We show that under mild statistical stability assumptions, the return distribution evolves more regularly than individual trajectories when measured under the 1‑Wasserstein metric, yielding a smoother distributional Bellman objective. By aligning optimisation with this measure level structure, distributional RL provides better conditioned learning. We offer a principled explanation for the advantages of distributional methods in chaotic systems and the geometries of RL objectives under chaos.

PaperID: 2442, https://arxiv.org/pdf/2605.30159.pdf

Authors: Ziyan Liu, Zhezheng Hao, Yeqiu Chen, Hong Wang, Jingren Hou, Ruiyi Ding, Yongkang Yang, Wence Ji, Wei Xia, Feng Liu

Title: Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

Abstract:
Memory‑augmented LLM agents tackle complex long‑horizon tasks by recursively summarizing interaction trajectories into compact memory. However, existing approaches typically train these memory policies using outcome‑based reinforcement learning, failing to localize where intermediate memory quality degrades. As interactions unfold, ambiguous recursive summaries progressively discard task‑relevant information and introduce semantic noise. This exacerbates belief deviation, obscuring the agent's estimate of the latent task state and ultimately derailing long‑horizon reasoning. We therefore argue that memory optimization should focus not merely on trajectory‑level success, but on the clarity of the belief induced by intermediate summaries. To this end, we introduce Belief Entropy, a self‑supervised proxy that probes how uncertain the model remains about the latent task state given its current memory. Based on this proxy, we propose Metacognitive Memory Policy Optimization (MMPO). Instead of relying only on sparse outcome‑based signals, MMPO provides fine‑grained, memory‑specific supervision via explicitly penalizing summaries that induce high epistemic uncertainty. Experiments show that MMPO consistently outperforms existing methods on diverse long‑horizon tasks, maintaining 97.1% performance even when scaled to 1.75M‑token contexts.

PaperID: 2443, https://arxiv.org/pdf/2605.30144.pdf

Authors: Yulei Ye, Wenhao Li, Zhong Wen, Yunshu Huang, Yichen Hu, Zifan Wei, Yige Wang, Xinyu Xie, Haoxuan Yang, Yanjun Huang, Ruijia Li, Hong Qian, Yu Song, Bo Jiang, Bingdong Li, Lijun Li, Bo Zhang, Pinlong Cai, Xingcheng Xu, Shuangye Chen, Xia Hu, Liang He, Aimin Zhou, Jingjing Qu, Jing Shao, Xiangfeng Wang

Title: AgentSchool: An LLM-Powered Multi-Agent Simulation for Education

Abstract:
Despite the rapid deployment of LLMs into classrooms, validating educational AI remains uniquely intractable: interventions act on developing learners whose cognitive and social trajectories are irreversibly shaped, while real‑world trials are slow, ethically constrained, and institutionally locked. LLM‑based educational simulators have emerged as a potential remedy, but many still collapse learning into persona‑conditioned role‑play and, when optimized only to reproduce existing classrooms, can structurally penalize the institutional novelty that pedagogical reform requires. In this work, we introduce AgentSchool, an LLM‑driven multi‑agent simulator that models learning as state transition rather than prompted behavior. AgentSchool couples cognitively growable student agents ‑‑ equipped with weighted subject knowledge graphs, thinking‑workflow pools, and explicit misconceptions ‑‑ with adaptive teacher agents that plan, scaffold, and reflect along the Zone of Proximal Development, embedded in a configurable scenery generator that situates instruction within both formal and informal learning fields, and a multi‑scale simulator that decouples interaction scale, temporal granularity, and simulation duration. Experiments show that structured student agents produce more differentiated mastery and misconception traces than a baseline simulator, while teacher‑agent comparisons show backbone‑dependent patterns consistent with ZPD‑informed adaptation. Further, AgentSchool generates plausible traces of peripheral participation, clique formation, aggressor‑induced cohesion, and opinion‑leader emergence consistent with classroom social theories. Beyond its role as an educational research instrument, AgentSchool frames education as a socially meaningful testbed for long‑horizon memory, multi‑agent coordination, and future institutional reasoning under organizational pressure.

PaperID: 2444, https://arxiv.org/pdf/2605.30136.pdf

Authors: Hongxiang Zhang, Yuan Tian, Tianyi Zhang

Title: Enhancing Multi-Agent Communication through Attention Steering with Context Relevance

Abstract:
LLM‑based multi‑agent systems have demonstrated remarkable performance on complex tasks through collaborative reasoning. However, these systems tend to rapidly accumulate extremely long conversation histories during interaction. As conversations lengthen, relevant information is increasingly diluted by irrelevant context, leading to degraded performance. In this work, we present Agent‑Radar, a training‑free context management method that dynamically steers each agent's attention toward relevant context with a novel temporal and spatial decay mechanism. Our experiments demonstrate that Agent‑Radar outperforms state‑of‑the‑art methods across five different benchmarks, yielding gains of up to 7.64 absolute points. Furthermore, our analysis shows that Agent‑Radar remains effective and robust as the number of agents and interaction rounds increases. Finally, the ablation study shows that core components in Agent‑Radar are crucial to performance and generalizable in different settings.

PaperID: 2445, https://arxiv.org/pdf/2605.30102.pdf

Authors: Corrado Rainone, Davide Belli, Bence Major, Arash Behboodi

Title: When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems

Abstract:
The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offering strong performance across a wide range of tasks at substantially high cost, and more cost‑efficient small language models (SLMs), which are amenable to on‑device inference. Hybrid multi‑agent systems (MASs) combining on‑device and cloud models offer a promising middle ground, but they also introduce a complex and poorly understood design space in which task accuracy, monetary cost, and edge energy consumption are tightly coupled; in the absence of general design principles, hybrid components, although not the most prevalent choice, are typically introduced through ad hoc decisions tailored to specific domains. In this work, we examine this design space more systematically. We adapt two representative MAS architectures to support hybrid inference and study how individual design choices shift the operating point along the Pareto frontier of power, cost, and performance. Our findings paint a nuanced picture of hybrid MAS design: while SLMs can effectively benefit from LLM assistance, the optimal architecture is highly task‑dependent, and greater frontier‑level compute does not consistently translate to better performance.

PaperID: 2446, https://arxiv.org/pdf/2605.30090.pdf

Authors: Jiamin Chen, Qianben Chen, Jiawen Zhang, Yidi Wu, Yuchen Li, Xiaokun Zhang, Wangchunshu Zhou, Chen Ma

Title: DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation

Abstract:
Long‑form video generation is rapidly moving from short, single‑scene synthesis toward minute‑long, multi‑shot creation with narrative structure, cinematic control, audio, and cross‑modal synchronization. However, evaluating such videos remains challenging, since existing benchmarks largely focus on local visual quality, short‑horizon temporal consistency, or generic prompt alignment, and provide limited diagnosis of workflow failures and user‑dependent preferences. We introduce DirectorBench, a personalized multi‑agent diagnostic benchmark for long‑form video generation. DirectorBench evaluates generated videos with respect to 80 structured metadata entries, 7 user profiles, and 40 checkpoint criteria across 5 dimensions: script, visual, audio, cross‑modal, and stability. Instead of reducing quality to a single aggregate score, DirectorBench localizes checkpoint‑level bottlenecks and supports profile‑aware evaluation. We evaluate 4 long‑form video generation workflows, 6 base LLMs, and 7 user profiles. Across workflows, DirectorBench reveals a between‑unit bottleneck: transition quality averages only 0.256 and reaches 0.356 for the best workflow, while prompt‑level user demand fulfillment averages 0.71. We further conduct human evaluation with 14 annotators to validate the alignment between DirectorBench and human judgment. The results show that DirectorBench captures human‑perceptible quality differences and reveals workflow‑ and profile‑dependent failure modes that are hidden by aggregate scoring. These findings highlight the importance of diagnostic and profile‑aware benchmarking for long‑form video generation.

PaperID: 2447, https://arxiv.org/pdf/2605.30042.pdf

Authors: Geremy Loachamín-Suntaxi, Robert Lazar, Dimitrios G. Giovanis, Ioannis G. Kevrekidis, Eleni D. Koronaki

Title: Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection

Abstract:
Automating scientific computing workflows requires more than generating executable code: autonomous systems must also select appropriate computational strategies, implement them faithfully, and ensure that the resulting outcomes remain causally attributable to the decisions that produced them. In multi‑agent pipelines, this process is particularly fragile, as small inconsistencies between agent intentions and actions can lead to semantic drift, where the eventually executed procedure no longer reflects the originally selected strategy, thereby corrupting downstream evaluation and adaptation. In this work, motivated by the ATHENA framework (Toscano et al., 2025; Toscano et al., 2026) and the concept of empowerment (Yiu et al., 2025), we introduce a multi‑agent framework that combines contextual bandits with structured inter‑agent communication and, most importantly, semantic checkpoints that preserve action‑outcome fidelity throughout the pipeline. The system integrates specialized large language model (LLM) agents, grounded code generation, and self‑healing execution loops within an adaptive decision‑making architecture. Interpreting the framework through the lens of empowerment, we show that reliable autonomous learning requires not only identifying high‑quality actions, but also preserving the integrity of their propagation across agents. Using sensitivity analysis and uncertainty quantification workflows as representative case studies, we demonstrate that unchecked semantic drift degrades policy learning, whereas the proposed framework improves convergence, robustness, and adaptation to novel problem contexts. These results suggest a broader design principle for scientific multi‑agent systems: adaptive decision‑making must be coupled with explicit mechanisms that guarantee semantic consistency and reliable information flow across the computational pipeline.

PaperID: 2448, https://arxiv.org/pdf/2605.29997.pdf

Authors: Yihang Tao, Yu Guo, Zhengru Fang, Haonan An, Yuguang Fang

Title: FRUC: Feedforward Dynamic Scene Reconstruction from Uncalibrated Collaborative Driving Views

Abstract:
We present FRUC, a feed‑forward 3D Gaussian splatting framework for dynamic scene reconstruction from uncalibrated collaborative driving views. Existing multi‑agent reconstruction frameworks are often hindered by rigid prerequisites, demanding precise spatial calibration and slow per‑scene optimization. In this paper, we rethink this task by conceptualizing a distributed multi‑vehicle network as a spatio‑temporally unstructured ego‑centric multi‑camera system, where the core challenge lies in enhancing ego‑centric occluded geometry through collaboration without degrading the ego's accurately observed visible geometry, while preserving reconstruction efficiency. For efficient reconstruction, FRUC is built upon a visual grounded geometric Transformer backbone to enable one‑shot, calibration‑free inference from a flexible number of multi‑vehicle views. To achieve non‑destructive geometric supplementation under uncalibrated cross‑agent misalignment, FRUC first introduces an ego‑centric causal occlusion field that explicitly derives occlusion evolution as latent priors by modeling agent‑wise spatio‑temporal correlations. Guided by these occlusion priors, it further formulates cross‑agent integration as a deterministic residual denoising process via zero‑initialized injection, turning challenging cross‑agent fusion into bounded residual learning for robust collaborative blind‑spot completion. Through extensive evaluations on the real‑world V2XReal and UrbanIng‑V2X datasets, FRUC is shown to be a new state‑of‑the‑art for the scene reconstruction of dynamic collaborative driving environments, significantly outperforming existing methods in both rendering quality and efficiency.

PaperID: 2449, https://arxiv.org/pdf/2605.29966.pdf

Authors: Yiming Liu, Bin Lu, Meng Jin, Ziyuan Sang, Shuo Jiang, Lei Zhou, Xinbing Wang, Chenghu Zhou, Jing Zhang

Title: Compass: Navigating Global Marine Lead Data Integration through Expert-Guided LLM Agent

Abstract:
Marine lead (Pb) and its isotopes are critical tracers for ocean circulation and anthropogenic pollution, yet in‑situ observations remain costly and sparse. While vast historical records exist, they lie buried within the unstructured content of academic papers, creating "data silos" inaccessible to comprehensive analysis. Manual extraction is unscalable, while general‑purpose Large Language Models (LLMs) lack the necessary domain‑specific knowledge, leading to hallucinations and scientifically invalid outputs. To address this, we introduce an expert‑guided adaptation approach that enables LLMs to perform rigorous scientific data extraction without fine‑tuning. We operationalize this approach through Compass, an LLM agent framework enhanced by a Knowledge Tree co‑designed with marine scientists, which decomposes complex tasks into verifiable steps, guiding the agent's reasoning to ensure scientific validity. Deploying Compass across a corpus of over 230,000 relevant open‑access papers, we successfully extract 3,751 previously unincorporated Pb records. This effort establishes the largest integrated marine Pb database to date. Beyond standard metrics, Compass demonstrates superior reliability through multi‑layered validation, achieving 92% accuracy as confirmed through expert manual verification. The newly integrated data expand coverage in previously under‑sampled regions such as the East China Sea and the Southern Ocean, providing an enriched data foundation for future scientific discoveries. We release an interactive visualization platform to facilitate open scientific access. Our work demonstrates that expert‑guided agents can effectively bridge the gap between general‑purpose LLMs and high‑stakes scientific domains, enabling scalable data discovery in geosciences.

PaperID: 2450, https://arxiv.org/pdf/2605.29960.pdf

Authors: Hongtao Wang, Se Yang, Yu Chen, Puzhuo Liu

Title: Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction

Abstract:
Large language model (LLM) agents increasingly leverage long term memory to support persistent and autonomous task execution. However, this capability also introduces a new attack surface: memory poisoning, where adversaries can inject malicious information to influence future behavior. Existing memory poisoning attacks often assume that injected content can be stored directly in memory, overlooking the selective extraction and rewriting stages in modern memory pipelines. This makes prior methods ineffective under realistic settings. In this paper, we propose MemPoison, a novel memory poisoning attack that bypasses selective memory mechanisms in LLM agents, where an attacker can inject triggerable backdoors into the agent's long‑term memory through dialogue interactions, thereby misleading its subsequent responses. MemPoison introduces three key components: (i) a semantic relational bridge that binds the trigger and payload into a coherent statement to ensure they are extracted into memory together; (ii) entity masquerading that optimizes triggers to mimic named entities, resisting rewriting; and (iii) joint embedding optimization that shapes trigger‑injected texts into a tight cluster in the embedding space while maintaining isolation from benign embeddings for stealth. Evaluations across different agent domains and memory mechanisms show MemPoison achieves attack success rates up to 0.95, outperforming existing baselines. Mechanistic analysis indicates that the attack exploits embedding‑space anisotropy and shifts attention patterns, highlighting core vulnerabilities in selective memory systems. We evaluate multiple defense strategies and demonstrate their fundamental limitations in mitigating the attack.

PaperID: 2451, https://arxiv.org/pdf/2605.29955.pdf

Authors: Ahmad Rammal, Niket Patel, Fabian Gloeckle, Amaury Hayat, Julia Kempe, Remi Munos, Charles Arnal, Vivien Cabannes

Title: Formalizing Mathematics at Scale

Abstract:
We present AutoformBot, a multi‑agent system for building an Autoformalized Textbook Library At Scale (Atlas) in Lean 4. AutoformBot orchestrates thousands of LLM agents, equipped with formal verification tools, dependency‑aware task scheduling, and collaborative version control, to translate informal textbook prose into machine‑checked definitions and proofs. We apply our methods to a corpus of 26 open‑access textbooks spanning analysis, algebra, topology, combinatorics, and probability, producing Atlas: a verified library of over 45,000 Lean 4 declarations and 500 thousand lines of code. We release two artifacts: (i) AutoformBot, the open‑source multi‑agent framework; and (ii) Atlas, the resulting formal library. Our results suggest that autoformalizing the core content of graduate‑level mathematics at scale is now economically and technically feasible. This opens the door to the automated verification of both human‑ and machine‑generated mathematics at a research level.

PaperID: 2452, https://arxiv.org/pdf/2605.29910.pdf

Authors: Xiang Liu, Sa Song, Zhaowei Zhang, Huiying Lan, Jason Zeng, Ming Wu, Michael Heinrich, Yong Sun, Ceyao Zhang

Title: Agora: Toward Autonomous Bug Detection in Production-Level Consensus Protocols with LLM Agents

Abstract:
Consensus protocols form the backbone of distributed systems and blockchains, where implementation bugs can cause data corruption and financial losses. While LLM‑based approaches show promise in code analysis, they struggle with deep protocol‑level logic bugs involving complex state‑dependent behaviors across multiple execution stages. We present Agora, a domain‑aware multi‑agent framework that integrates hypothesis‑driven testing with LLM capabilities for systematic protocol verification. Agora employs specialized agents that collaboratively explore protocol state spaces, synthesize attack scenarios using domain‑specific constraints, and validate findings through iterative refinement. This explicit role separation enables reasoning about global protocol invariants beyond single‑function code analysis. We evaluate Agora on four consensus implementations (Raft, EPaxos, HotStuff, BullShark) using four state‑of‑the‑art LLMs. Agora discovers 15 previously unknown protocol‑level logic bugs that violate safety properties, while existing LLM‑based agents fail to detect any such protocol‑level logic bugs. Our results demonstrate that domain‑aware multi‑agent collaboration is essential for detecting deep logic bugs in complex protocols.

PaperID: 2453, https://arxiv.org/pdf/2605.29894.pdf

Authors: Yaowu Fan, Tao Han, Dazhao Du, Andy J. Ma, Jia Wan

Title: Train the Agent, Not the Expert: Learning to Harness Heterogeneous Experts for Multi-Turn Visual Reasoning

Abstract:
Recent progress in computer vision has produced a wide range of powerful specialized models for detection, segmentation, counting, and other visual tasks. However, these models are usually optimized for isolated task formulations, making it difficult to directly support general‑purpose visual intelligence, especially when a task requires complex language understanding and dense small‑object perception. In this paper, we propose VisHarness, a trainable visual agent that decouples high‑level perception, reasoning, and decision‑making from low‑level task execution. Instead of training a model to solve a specific visual task, VisHarness learns to harness a set of carefully designed heterogeneous visual experts. This paradigm preserves the general intelligence of the agent while fully leveraging the precision advantages of specialized visual models in concrete visual tasks. With only lightweight training, VisHarness learns a generalizable visual expert‑harnessing policy and can solve common fundamental vision tasks under various complex conditions through multi‑turn interactions with visual expert models. To enable efficient on‑policy reinforcement learning training in a live environment, we introduce dynamic visual memory archiving, which mitigates the rapidly accumulating visual‑token overhead caused by multi‑turn interactions with visual expert models. Experiments on four representative benchmarks covering reasoning segmentation, generalized referring segmentation, dense small‑object detection, and referring counting demonstrate that VisHarness substantially outperforms existing general‑purpose models and achieves competitive or superior performance compared with task‑specific models.

PaperID: 2454, https://arxiv.org/pdf/2605.29879.pdf

Authors: Luzhou Ge, Xiangyu Zhu, Jinyan Liu, Xuesong Li

Title: DGSG-Mind: Dynamic 3D Gaussian Scene Graphs for Long-Term Scene Understanding and Grounding

Abstract:
Integrating open‑vocabulary semantic information into dynamic 3D scene representations is essential for long‑term embodied scene understanding. However, existing methods often suffer from fragile instance association due to incomplete cross‑view cues, while their limited ability to handle object‑level topological changes restricts long‑term robotic task execution. Moreover, current 3D scene understanding methods either rely on simple feature matching without explicit spatial reasoning or assume offline ground‑truth 3D geometry. To address these challenges, we present DGSG‑Mind, a hybrid instance‑aware 3D Gaussian dynamic scene graph system with an embodied reasoning agent. Our system couples a probabilistic voxel grid with explicit 3D Gaussians to enable robust cross‑modal instance fusion and incremental semantic mapping. It handles dynamic changes through Gaussian‑based visual relocalization and localized masked refinement guided by geometric‑semantic consistency. Built on the instance Gaussian map, DGSG‑Mind further constructs a hierarchical scene graph and develops the 3D Gaussian Mind, which integrates structural relations, spatial‑semantic information, and visually annotated RoI Gaussian renderings for multimodal reasoning. Extensive experiments show that DGSG‑Mind achieves the best zero‑shot 3DVG performance among methods operating on self‑reconstructed maps, while also delivering strong performance in 3D open‑vocabulary semantic segmentation and scene reconstruction. We further deploy DGSG‑Mind on real‑world robots to demonstrate its target‑oriented reasoning and dynamic update capabilities. The project page of DGSG‑Mind is available at https://icr‑lab.github.io/DGSG‑Mind

PaperID: 2455, https://arxiv.org/pdf/2605.29861.pdf

Authors: Chenghao Zhang, Guanting Dong, Yufan Liu, Tong Zhao, Xiaoxi Li, Zhicheng Dou

Title: Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

Abstract:
Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, which synthesizes scattered evidence into long‑form reports. However, verifiable multimodal deep research remains challenging due to open‑ended synthesis without deterministic ground truth and the need to interleave textual arguments with visual evidence. We propose Ptah, a multi‑agent harness for interleaved report generation. Ptah orchestrates the lifecycle from user query to rendered web report through planning, research, and writing stages, where specialized agents construct visual‑aware plans, collect claim‑grounded evidence, maintain source‑aligned images in a Visual Working Memory, and compose reports through declarative multimodal tool use. A verifier agent serves as the harness's acceptance function, enforcing factual grounding, citation fidelity, and cross‑modal consistency throughout the workflow. We further introduce PtahEval, an evaluation protocol that augments existing benchmarks with image‑level and presentation‑level assessments. Experiments on deep research benchmarks show that Ptah produces more reliable, visually informative, and usable human‑facing multimodal reports than strong baselines. Our code is released at https://github.com/SnowNation101/Ptah

PaperID: 2456, https://arxiv.org/pdf/2605.29801.pdf

Authors: Dongrui Liu, Yu Li, Zhonghao Yang, Peng Wang, Guanxu Chen, Yuejin Xie, Qinghua Mao, Wanying Qu, Yanxu Zhu, Tianyi Zhou, Leitao Yuan, Zhijie Zheng, Qihao Lin, Yimin Wang, Haoyu Luo, Shuai Shao, Chen Qian, Qingyu Liu, Ling Tang, Ruiyang Qin, Qihan Ren, Junxiao Yang, Kun Wang, Zhiheng Xi, Linfeng Zhang, Ranjie Duan, Bo Zhang, Wenjie Wang, Wen Shen, Qiaosheng Zhang, Yan Teng, Chaochao Lu, Rui Mei, Man Li, Jialing Tao, Xi Lin, Tianhang Zheng, Yong Liu, Quanshi Zhang, Lei Zhu, Xingjun Ma, Junhua Liu, Hui Xue, Xiaoxiang Zuo, Xiangnan He, Chao Shen, Xianglong Liu, Minlie Huang, Jing Shao, Xia Hu

Title: AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

Abstract:
Modern open‑world agents such as OpenClaw exhibit powerful cross‑environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real‑world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy‑guided data engine with influence‑function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed‑source models (e.g., GPT‑5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker‑level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training‑free online guardrail for real‑time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state‑of‑the‑art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.

PaperID: 2457, https://arxiv.org/pdf/2605.29794.pdf

Authors: Yanchao Li, Wanhao Liu, Ben Gao, Jiaqing Xie, Zhehong Ai, Na Zou, Yuqiang Li, Tianfan Fu

Title: SkillsInjector: Dynamic Skill Context Construction for LLM Agents

Abstract:
LLM agents now draw on growing skill libraries to handle complex tasks. However, injecting more skills does not always improve task completion and can even degrade it. Existing methods still treat skill injection as a static step, selecting skills with fixed criteria, fixing the budget in advance, and leaving descriptions unchanged. We argue that this static treatment can undermine the utility of skills, because which skills are exposed, how many are included, and how they are presented all affect downstream performance. We propose SkillsInjector, a two‑stage adaptive method that jointly addresses these decisions. First, a context planner learns execution‑grounded skill preferences and admits an adaptive number of skills for each task. A set‑aware renderer then tailors how selected descriptions are presented relative to their co‑injected neighbors. Across tau2‑bench, SkillsBench, and ALFWorld, SkillsInjector achieves the highest score, improving over the strongest baseline by 3.9, 6.1, and 7.3 percentage points, respectively. Ablation studies show that skill selection, adaptive budgeting, and set‑aware rendering each contribute to the gain. These results show that skill‑augmented agents benefit from optimizing the injected context itself. Code will be released upon publication

PaperID: 2458, https://arxiv.org/pdf/2605.29790.pdf

Authors: Zhezheng Hao, Tianfu Wang, Huanshuo Dong, Ziyan Liu, Hong Wang, Xiankun Lin, Qiang Lin, Can Wang, Hande Dong, Jiawei Chen

Title: Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems

Abstract:
LLM‑based multi‑agent systems (MAS) have emerged as an effective paradigm for complex and long‑horizon tasks. However, in real‑world tasks, MAS often exhibit various failures during execution and such failures are difficult to eliminate during design. This motivates experience‑driven MAS evolution, where a system improves based on its own execution experience. Yet such evolution is challenging because MAS experience is prolonged and intricate, interleaving multiple agents' execution chains and communication messages, which makes it difficult to identify what should be improved. To address this challenge, we propose Meta‑Team, an experience‑driven MAS evolution framework based on collaborative self‑evolution. Meta‑Team preserves the execution context of each agent and coordinates post‑task communication, enabling agents to exchange distributed evidence for evolution. Building on this design, Meta‑Team conducts multi‑scale self‑evolution, transforming execution experience into reusable improvements to agent behaviors, inter‑agent coordination, and team‑level organization. Across six long‑horizon agent benchmarks, Meta‑Team consistently outperforms single‑agent systems, hand‑crafted MAS, and prior MAS evolution methods; further analyses demonstrate that Meta‑Team enables more reliable and scalable MAS self‑evolution.

PaperID: 2459, https://arxiv.org/pdf/2605.29744.pdf

Authors: Yanan Wang, Shuaicong Hu, Jian Liu, Guohui Zhou, Aiguo Wang, Cuiwei Yang

Title: Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence

Abstract:
The impressive performance of generalist large language models (LLMs) such as GPT and Claude in healthcare raises a critical question: will domain‑specific medical specialist models become obsolete? We argue that the future of medical artificial intelligence (AI) lies not in building monolithic medical foundation models, nor in replacing human expertise, but in orchestrating collaboration among generalist LLMs, domain‑specific specialist models, and clinicians. We propose HetMedAgent, a heterogeneous medical multi‑agent framework that enables conflict‑aware evidence fusion, uncertainty‑based clinician intervention triggering, and adaptive threshold calibration. Experiments on three real‑world clinical decision‑making tasks demonstrate that the synergy between generalist LLMs and domain‑specific specialist models significantly outperforms using either type of model alone, validating the irreplaceable value of specialist models in modality‑specific analysis. HetMedAgent represents a shift from building medical LLMs or foundation models to multi‑agent collaboration, achieving a balance between general reasoning capabilities and domain‑specific precision.

PaperID: 2460, https://arxiv.org/pdf/2605.29668.pdf

Authors: Johannes Moll, Jean-Philippe Corbeil, Jiazhen Pan, Martin Hadamitzky, Daniel Rueckert, Lisa Adams, Keno Bressem

Title: GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents

Abstract:
LLM agents acting in structured environments fail in operational rather than conversational ways, and reliability depends on procedural knowledge of the environment. Prior self‑improvement methods accumulate natural‑language guidance without checking that each new item preserves previously correct behavior, so a note that fixes one trajectory can silently regress another. We introduce GRASP (Gated Regression‑Aware Skill Proposer), which treats agent improvement as a sequence of edits to a bounded skill library, admitting each candidate only if it produces a net improvement on a balanced held‑out probe under a hard regression budget. We evaluate GRASP across five base models (gpt‑oss‑120b, DeepSeek V4 Flash, Gemini 3.1 Flash Lite, GPT‑4.1, GPT‑5.4) on two FHIR‑based clinical benchmarks. On MedAgentBench, GRASP lifts gpt‑oss‑120b from 40.6% to 88.8%, exceeds the strongest of five self‑improvement baselines by 21.0 points, and improves every other base model by 17.2 to 40.3 points. Ablations attribute the gain to comparative proposal generation, the acceptance gate, and the hard regression budget rather than to skill writing itself, which without validation is no better than using no skills. The mechanism generalizes beyond the clinical domain, improving agents on three of four non‑clinical environments and remaining flat only where the action space is open‑ended. Frozen libraries transfer across models, where skills from a stronger model improve weaker executors beyond what they learn for themselves while the reverse does not, an asymmetry that no ungated baseline reproduces.

PaperID: 2461, https://arxiv.org/pdf/2605.29653.pdf

Authors: Dongdong Hua, Yifei Sun, Renhong Huang, Feng Gao, Chunping Wang, Yang Yang

Title: PTCG-Bench: Can LLM Agents Master Pokémon Trading Card Game?

Abstract:
Given a strategically complex board game, human players can quickly learn to devise strategies after playing a few rounds. Autonomous agents require similar capabilities in realistic interactive environments, yet existing agent benchmarks often fail to fully capture such strategic and evolving decision‑making scenarios. We present PTCG‑Bench, a benchmark built on the Pok'emon Trading Card Game (PTCG) that evaluates LLM agents at two complementary levels: (1) their decision‑making performance within a single complex environment, and (2) their ability to self‑evolving through accumulated experience. We further include a modular harness ablation to better interpret agent performance without conflating it with model capability. Our experiments show that, although LLM agents can achieve non‑trivial gameplay performance, sustained and stable self‑evolution remains challenging, and performance is sensitive to harness design. We hope that PTCG‑Bench will facilitate future research on harness‑aware and self‑evolving agents in realistic interactive environments.

PaperID: 2462, https://arxiv.org/pdf/2605.29625.pdf

Authors: Arturo Valdivia, Paolo Burelli

Title: Improving Collaborative Storytelling with a Multi-Agent Framework Based on Large Language Models

Abstract:
The topic of Co‑creation, i.e., AI agents interacting with humans to generate outputs (e.g., art), has gained significant attention recently. However, most studies focus on adult‑human interactions in a digital setting. This paper explores a novel ludic co‑creation scenario involving children and Large Language Models (LLMs) interacting through a physical board game to create written stories. Our goal is to develop a multi‑agent framework capable of producing high‑quality narratives suitable for young players. At the core of our approach is an iterative Writer‑Editor process in which one LLM generates stories while another evaluates them and provides feedback for refinement. Through a simulation study involving multiple LLMs, we show that this iterative interaction consistently improves the perceived quality of generated stories across successive loops. The results indicate that a small number of refinement steps may be sufficient to achieve high‑quality outputs in interactive storytelling systems.

PaperID: 2463, https://arxiv.org/pdf/2605.29612.pdf

Authors: Ziyang Ma, Dingyi Zhang, Sichu Liang, Jiajia Chu, Pengfei Xia, Hui Zang, Deyu Zhou

Title: CONCAT: Consensus- and Confidence-Driven Ad Hoc Teaming for Efficient LLM-Based Multi-Agent Systems

Abstract:
Although large language model (LLM) based multi‑agent systems (MAS) show their capability to solve complex tasks and achieve higher performance over single agent systems, they lead to huge computational overheads because of heavy communication between agents. Previous research has made efforts to train a sparse multi‑agent graph or fine‑tune a planner to orchestrate the workflow better. However, such extra training processes introduce computational costs and limit MAS to specific domains, therefore compromising their generalizability. In this paper, we propose CONCAT, a training‑free multi‑agent collaboration framework based on CONsensus and Confidence‑driven Ad hoc Teaming to efficiently organize agent interactions. Specifically, agents are clustered based on their initial answers, and leaders of each cluster are selected based on the agents' confidence. Then, a heuristic function based on the Theory of Mind is designed to predict the collaboration benefits between every two leaders according to their answers and confidence. Finally, an ad hoc multi‑agent network is organized after evicting a percentage of communications based on the predicted benefits. Experiments across three LLMs and three benchmarks show that CONCAT achieves up to 2.02x higher efficiency (accuracy/latency ratio) than LLM‑Debate and outperforms training‑aware methods such as AgentDropout, while reducing average latency by 50.1% on Qwen2.5‑14B‑Instruct, without any task‑specific training.

PaperID: 2464, https://arxiv.org/pdf/2605.29560.pdf

Authors: Jiawei Chen, Xiaofan Gui, Shikai Fang, Shengyu Tao, Shun Zheng, Weiqing Liu, Jiang Bian

Title: Battery-Sim-Agent: Leveraging LLM-Agent for Inverse Battery Parameter Estimation

Abstract:
Parameterizing high‑fidelity "digital twins" of batteries is a critical yet challenging inverse problem that hinders the pace of battery innovation. Prevailing methods formulate this as a black‑box optimization (BBO) task, employing algorithms that are sample‑inefficient and blind to the underlying physics. In this work, we introduce a new paradigm that reframes the inverse problem as a reasoning task, and present Battery‑Sim‑Agent, the first framework to deploy a Large Language Model (LLM) agent in a closed loop with a high‑fidelity battery simulator. The agent mimics a human scientist's workflow: it interprets rich, multi‑modal feedback from the simulator, forms physically‑grounded hypotheses to explain discrepancies, and proposes structured parameter updates. On a systematically constructed benchmark suite spanning diverse battery chemistries, operating conditions, and difficulty levels, our agent significantly outperforms strong BBO baselines like Bayesian optimization in identifying accurate parameters. We further demonstrate the framework's capability in complex long‑horizon degradation fitting tasks and validate its practical applicability on real‑world battery datasets. Our results highlight the promise of LLM‑agents as reasoning‑based optimizers for scientific discovery and battery parameter estimation.

PaperID: 2465, https://arxiv.org/pdf/2605.29527.pdf

Authors: Jiamin Wang, Jian Liu, Feng Xiao, Haibin Duan, Yuanshi Zheng

Title: Robustness Enhancement of Consensus Networks: the Optimal Memory Depth

Abstract:
Understanding what governs collective robustness and how it can be enhanced remains a central pursuit in network science. This paper investigates the robustness of multi‑agent consensus networks, quantified by the H_2 performance metric, and delves into the enhancing effect of agents' local memory on it. Inspired by the hierarchical temporal structure of memory observed in neuroscience, we focus on the role of memory depth, which reflects the temporal features of memory from recent to remote. Building on linear extrapolation, we propose a consensus protocol with single‑step memory and tunable memory depth, derive the necessary and sufficient condition for achieving consensus, and show that the protocol exhibits an inheritable consensus property across memory depths. Furthermore, analytical expressions for the H_2 performance metric, which depend on the memory factor, memory depth, coupling gain, and Laplacian spectrum, are established. Under balanced usage of real‑time and memory information, we demonstrate that memory at any accessible depth enhances H_2 performance, and the optimal memory depth occurs at either the most recent or the most remote memory, contingent upon certain parameter regions. Further detailed discussions are provided to clarify the broader implications of our findings.

PaperID: 2466, https://arxiv.org/pdf/2605.29512.pdf

Authors: Kevin Wang, Anna Thöni, Benjamin Kempinski, Bobby Cheng, Jianzhu Yao, Benjamin Finch, Leon Guertler, Viraj Nadkarni, Yihan Jiang, Aliaksei Korshuk, Alexander Buyantuev, Ilya Makarov, Siyuan Wu, Yu-Chi Cheng, Yan-Ru Ju, Ti-Rong Wu, I-Hsuan Chu, Yu-Yu Yang, I-Chen Wu, Yitian Huang, Qinlu Cao, Yiheng Sun, Yuhong Dai, Hongkun Yao, Jingxuan Fu, Jiwei Zhang, Hao Liao, Mossimo Ebeling, Govind Arun, Sadhvik Bathini, Mihir S Arya, Avinash Anish, Aditya Ranjan, Kirtana Sunil Phatnani, Paval KS, Vrushali Mehta, Aravind S, Nikhil Arora, Tanya Upadhyay, Amol Bandagale, Yuan Lu, ChunEn Hsiao, YuTing Lin, Arvin Chung, Jerry John Thomas, Mathieu Laurière, Leshem Choshen, Yoram Bachrach, Pramod Viswanath, Maria Polukarov, Cheston Tan, Tal Kachman, Atlas Wang

Title: MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs

Abstract:
Large language models (LLMs) are increasingly deployed as interactive agents, yet their capacity for social and strategic reasoning over extended interaction remains poorly understood. Existing evaluations rely on static vignettes or single‑game benchmarks that cannot capture the sustained, multi‑faceted reasoning that real‑world multi‑agent settings demand. We introduce Mindgames, a multi‑game arena and evaluation platform for LLM agents that operationalizes complementary reasoning demands relevant to ``theory of mind'': belief attribution under hidden information, opponent modeling through repeated strategic interaction, cooperative inference under knowledge asymmetries, and sustained deception in social deduction. Built on TextArena, Mindgames provides a unified interaction interface, TrueSkill‑based rating, and full trajectory logging across four game environments. We instantiate Mindgames through a 2025 competition cycle hosted at a major AI conference, which assessed 944 submitted agents from 76 teams across four games: Colonel Blotto, Iterated Prisoner's Dilemma, Codenames, and Secret Mafia. Our analysis surfaces both agent‑level and evaluation‑level limitations: brittle rule adherence remains a major bottleneck, top‑performing systems repeatedly rely on explicit structural scaffolding, and leaderboard validity differs sharply across environments. In particular, failure‑heavy environments can reward robustness to opponent errors as much as strategic ability, with Secret Mafia exhibiting a pronounced error‑survival confound in this cycle. We release a dataset of 29,571 multi‑agent games with turn‑level observations, actions, and rewards, together with MG‑Ref, a deterministic offline tournament protocol that scores new agents against a frozen reference pool of top‑ranked, low‑error Stage~II submissions under the same error‑attribution lens used in this analysis.

PaperID: 2467, https://arxiv.org/pdf/2605.29511.pdf

Authors: Yanxing Guo, Zihao Zheng, Fangzhou Wu, Ling Liang, Lin Bao, Zongwei Wang, Yimao Cai

Title: DynaGraph: Lightweight Multi-Model Interaction Framework via Dynamic Topological Reconfiguration

Abstract:
Tackling complex reasoning tasks typically relies on massive monolithic LLMs, which suffer from severe computational redundancy. While task decomposition through structured pipelines or multi‑agent collaborations offers an alternative, these approaches inevitably fall into a critical dilemma: predefined static topologies are highly vulnerable to cascading errors, whereas unconstrained dynamic agents suffer from trajectory divergence and unpredictable memory bloat. To address this, we present DynaGraph, a lightweight multi‑model framework driven by dynamic topological reconfiguration. At the execution level, DynaGraph multiplexes time‑division PEFT adapters over a shared base model, enabling both full system training and inference deployment on a single consumer‑grade GPU. At the routing level, the Evaluator continuously monitors execution confidence to trigger hierarchical self‑healing: Fine‑grained Patching for localized data gaps and Subgraph Reconstruction for severe logical ruptures. Experiments on StrategyQA, MATH, and FinQA demonstrate our 8B model closely approximates the reasoning capabilities of a 72B monolithic model (e.g., 87.6% on StrategyQA, 82.7% on MATH). Furthermore, it reduces latency by up to 68.1% and token consumption by 68.6% compared to unconstrained dynamic architectures.

PaperID: 2468, https://arxiv.org/pdf/2605.29471.pdf

Authors: Yihang Tao, Yu Guo, Senkang Hu, Yanan Ma, Zihan Fang, Sam Kwong, Yuguang Fang

Title: V2XCrafter: Learning to Generate Driving Scene Across Agents

Abstract:
Collaborative driving systems leverage vehicle‑to‑everything (V2X) communication for multi‑agent collaborative perception to enhance driving safety, yet they remain constrained by scarce annotated real‑world V2X driving datasets and limited generalization across diverse driving conditions. While image generation technology offers a feasible solution for data augmentation, existing methods tailored for single‑vehicle multi‑view scenarios face two fundamental challenges in multi‑agent driving settings: (1) the expansion of the learning objective degrades generation quality, and (2) the highly dynamic variations across agents hinder the modeling of consistency for physical attributes (e.g., color, category) in jointly observed objects. To bridge this gap, we propose V2XCrafter, the first framework for generating controllable and realistic collaborative driving scene across agents' camera views. For effective learning, we develop a progressive multi‑agent diffusion model based on a single‑agent backbone, using neighboring agents' latent states as reference signals to progressively guide the single‑to‑multi diffusion. To address cross‑vehicle inconsistency, we propose a cross‑agent attention module that leverages a collaboration view graph and learnable jointly observed object representation to model the dynamic cross‑agent camera view relationships. Experiments have shown that V2XCrafter can generate high‑fidelity and controllable street views with consistency across agents, thereby effectively enhancing the downstream collaborative 3D object detection tasks.

PaperID: 2469, https://arxiv.org/pdf/2605.29440.pdf

Authors: Wentao Hu, Zhendong Chu, Yiming Zhang, Junda Wu, Ming Jin, Xiangyu Zhao, Yilei Shao, Yanfeng Wang, Qingsong Wen

Title: SkillBrew: Multi-Objective Curation of Skill Banks for LLM Agents

Abstract:
Retrieval‑augmented LLM agents increasingly rely on curated skill banks: collections of reusable textual principles that guide decision making on complex tasks. Existing approaches typically expand these banks in an append‑only fashion, continuously adding new skills without removing redundant, outdated, or harmful ones, resulting in inefficient and poorly curated repositories. In this paper, we formulate the skill bank curation as a constrained multi‑objective problem: a desirable bank must be useful for the agent, diverse in its content, and provide good coverage of the query distribution. To this end, we introduce SkillBrew, a multi‑objective curation framework that formalizes skill bank curation as Pareto‑aware optimization under a utility constraint, and solves it via a bi‑level propose‑then‑verify loop. We evaluate our approach on two public benchmarks. Our findings suggest that treating skill banks as objects of principled curation, rather than ever‑growing append‑only logs, is an important step toward building self‑improving LLM agents.

PaperID: 2470, https://arxiv.org/pdf/2605.29372.pdf

Authors: Yuhong Liu, Yunhe Su, Zhipeng Peng, Zhiwen Luo, Lin Shi, Zhi Jin, Li Zhang

Title: On the Road to Personalized Code Intelligence: Portraiting and Assisting Developers Based on Their In-IDE Behaviors

Abstract:
With the advent of large language models, research in automated software engineering has increasingly focused on leveraging these models to achieve a deeper semantic understanding of code or to engineer sophisticated agent‑based processes. However, this research trajectory has largely overlooked a critical factor: the developers themselves. Programming is a deeply individualized activity; developers exhibit significant variation in their tool‑chain preferences, domain‑specific expertise, and problem‑solving strategies. Consequently, the current paradigm of one‑size‑fits‑all code intelligence systems struggles to accommodate the needs of individual developers. To address this gap, we introduce VirtualME, a novel IDE‑embedded data infrastructure designed to model the developer by continuously capturing and interpreting their dynamic programming behaviors and preferences. VirtualME contains three components. (1) Log‑level Behavior Extraction: it captures and extracts developers' log‑level behaviors from IDE. (2) Task‑level Behavior Recognition: it aggregates log‑level behaviors into task‑level behaviors via a multi‑agent pipeline. (3) Developer‑personality Measurement: it builds a rule engine to distill a four‑dimensional developer persona: technology stack, ability, behavioral habits, and learning style. On top of VirtualME, we propose a solution for personalized repository‑level knowledge Q&A by integrating the developer persona into the Q&A agent. We evaluated VirtualME by building a multi‑repository benchmark with real‑world developer trajectories, balancing correctness and personalization. Experimental results show that VirtualME‑enhanced answers outperform generic baselines on five dimensions, yielding an average 33.80% improvement. Our results demonstrate that abundant, continuous developer‑behavior data can pave the new way for adaptive and personalized code intelligence.

PaperID: 2471, https://arxiv.org/pdf/2605.29368.pdf

Authors: Dongsheng Shi, Yue Li, Xin Yi, Yongyi Cui, Huawei Feng, Linlin Wang

Title: SURGENT: A Surgical Multi-Agent Assistance System Across the Perioperative Workflow

Abstract:
The intricate nature of modern surgical care necessitates intelligent systems that can synthesize extensive patient records, support collaborative decision‑making, and provide transparent, auditable reasoning across the entire perioperative workflow. Although web‑based Large Language Models (LLMs) possess advanced reasoning capabilities, they are ill‑equipped for surgical applications due to critical limitations: input length constraints, incomplete memory management, and limited traceability. To address this issue, we present SURGENT, a surgical multi‑agent assistance system that combines a Tree‑of‑Thought planner, multi‑department collaboration agents, and retrieval‑augmented reasoning with clinical guidelines and biomedical literature. SURGENT features a novel memory design that manages both long‑term patient histories and short‑term working summaries, enabling more complete, contextualized, and consistent reasoning. Experimental evaluations across five key perioperative tasks ‑ case analysis, surgical plan simulation, safety monitoring, complication risk assessment, and rehabilitation guidance ‑ show that SURGENT outperforms baseline LLMs and existing medical multi‑agent frameworks, yielding recommendations more closely aligned with patient histories. Ablation studies further highlight the advantage of DeepSeek as a locally deployable backbone model, enabling privacy‑preserving deployment without reliance on centralized services. These results position SURGENT as a practical and trustworthy advancement toward intelligent, equitable, and secure surgical assistance systems.

PaperID: 2472, https://arxiv.org/pdf/2605.29327.pdf

Authors: Junlin He, Yihong Tang, Tong Nie, Guilong Li, Binyu Yang, Jinxiao Du, Lijun Sun, Wei Ma

Title: Reasoning-preserved Efficient Distillation of Large Language Models via Activation-aware Initialization

Abstract:
Efficient Distillation (EDistill) compresses large language models (LLMs) by structured pruning parameters and tuning lightweight modules with high training efficiency. Although these EDistilled LLMs achieve state‑of‑the‑art (SOTA) performance on general ability benchmarks relative to similarly sized LLMs, we identify a severe degradation in their multi‑step reasoning ability, which we term reasoning collapse. We systematically analyze the geometric origins of reasoning collapse and show that the SOTA EDistill method based on width‑reducing projection matrices suffers from eRank collapse, in which the effective rank (eRank) of hidden representations drops. We theoretically explain how singular values of randomly initialized projection matrices become unevenly distributed, leading to eRank collapse and thus token indistinguishability. To address this issue, we propose RED (Reasoning‑preserved Efficient Distillation) for LLMs, which introduces activation‑aware initialization to initialize projection matrices as channel‑selection matrices, thus theoretically mitigating eRank collapse. Experiments on Llama and Qwen series demonstrate that RED substantially recovers reasoning while maintaining high training efficiency and SOTA general ability.

PaperID: 2473, https://arxiv.org/pdf/2605.29313.pdf

Authors: Shuyu Zhang, Yaqi Shi, Lu Wang

Title: PatchBoard: Schema-Grounded State Mutation for Reliable and Auditable LLM Multi-Agent Collaboration

Abstract:
LLM multi‑agent systems often coordinate through natural‑language dialogue or loosely structured shared memory, making intermediate state difficult to validate, attribute, and audit. We introduce PatchBoard, a schema‑grounded collaboration architecture that replaces inter‑agent dialogue with validated JSON Patch mutations over a shared structured state. An Architect agent constructs a task‑specific schema and workflow rules, while a deterministic kernel validates each proposed state mutation against schema constraints, role‑specific write contracts, and runtime invariants before committing it transactionally. On 630 matched ALFWorld episodes, PatchBoard achieves an 84.6% success rate, compared with 30.8% for LangGraph and 61.6% for Flock, while reducing tokens per successful task to 45.5k, compared with 368.3k and 64.2k, respectively.

PaperID: 2474, https://arxiv.org/pdf/2605.29293.pdf

Authors: Xiaoguang Wu, Zhi Zheng, Hui Xiong

Title: LLM-ALSO: LLM-Driven Adaptive Learning-Signal Optimization for Multi-Agent Reinforcement Learning

Abstract:
Effective training‑time guidance is central to multi‑agent reinforcement learning (MARL), yet remains difficult in sparse‑reward settings where weak supervision limits coordination and policy improvement, and existing methods often require substantial domain expertise or manual design effort. Large language models (LLMs) provide a promising alternative for flexible learning‑signal design, yet existing LLM‑based methods remain largely single‑agent‑oriented, one‑shot, or weakly validated for the evolving training dynamics of cooperative MARL. To address these limitations, we propose LLM‑ALSO, an iterative LLM‑driven adaptive learning‑signal optimization framework for MARL. Rather than directly deploying LLM‑generated rewards, LLM‑ALSO decomposes adaptation into iterative diagnosis, proposal, and validation: a Critic LLM diagnoses stage‑specific learning and coordination failures from sparse‑return metrics and compact behavior evidence, a Generator LLM proposes candidate reward‑shaping configurations conditioned on the diagnosis, and branch‑validation feedback refines candidates before they affect the main training trajectory. Through short‑horizon validation and stage‑aware adaptation, LLM‑ALSO promotes only validated updates into training, reducing the risk of unreliable LLM‑generated modifications. Experiments on sparse‑reward cooperative MARL tasks show that LLM‑ALSO improves sparse‑evaluation performance and learning efficiency.

PaperID: 2475, https://arxiv.org/pdf/2605.29286.pdf

Authors: Qian Wang, Zhongyi Tong, Nuo Chen, Zhaomin Wu, Bingsheng He

Title: CrossAlpha: An Annual-Report Benchmark for Cross-Market Factor Researc (with LLM Agents)

Abstract:
Cross‑market factor research studies whether firm‑level signals from one or more markets can predict returns in a target market, but existing public benchmarks do not support cross‑market disclosure‑to‑return evaluation. Building such a benchmark is challenging because filings differ across languages and regulatory systems, disclosure‑derived similarity can be biased by common reporting components, and cross‑market signals must be evaluated under feasible trading‑time alignment. We introduce CrossAlpha, a public annual‑report benchmark for cross‑market factor research. CrossAlpha addresses these challenges through three corresponding components: \emphDisclosure Distillation, which standardises heterogeneous filings into ten‑category English business descriptions; \emphResidual Schema Graph Construction, which builds PCA‑whitened cross‑market firm‑pair scores from schema‑level disclosures; and \emphTiming‑Aligned Evaluation, which pairs the graph with 11 years of daily OHLCV data to construct forward‑return labels under feasible cross‑market execution protocols. CrossAlpha covers about 3,600 firms and 10,700 firm‑year reports from the United States, Japan, Taiwan, South Korea, and Hong Kong, and releases about 19M directed firm‑pair scores. In experiments, disclosure‑derived cross‑market peers outperform domestic text, industry‑code, and return‑correlation peers in the US‑to‑Japan setting (ICIR 0.39 versus 0.07‑‑0.18), and cross‑market sources beat the domestic text baseline in most target markets. CrossAlpha offers an open‑sourced, reusable, return‑grounded benchmark for cross‑market financial NLP.

PaperID: 2476, https://arxiv.org/pdf/2605.29271.pdf

Authors: Vaishali Senthil, Ashutosh Hathidara, Sebastian Schreiber

Title: CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval

Abstract:
Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user queries arrive in colloquial, often underspecified language, while the catalog uses technical API vocabulary that no fixed encoder can bridge on its own. The two dominant training approaches, contrastive encoder fine‑tuning and HyDE‑style query expansion with a frozen LLM, address this problem from opposite ends and fail in complementary directions: the fine‑tuned encoder excels when the query's surface form already matches the catalog but collapses when it does not, while zero‑shot HyDE is more robust to underspecified queries yet generates catalog‑unaware hypothetical descriptions that degrade retrieval when queries are well‑formed. We introduce CoHyDE, an iterative procedure that trains the dense encoder and the LLM rewriter as a single co‑evolving system: the encoder is retrained with InfoNCE on catalog‑style hypothetical descriptions produced by the rewriter, and the rewriter is preference‑aligned via DPO against the encoder's retrieval scores, with both sides warm‑started on the tool catalog before the loop begins. On a ~10k tool subset of the ToolBench catalog, three rounds of CoHyDE improve over the strongest single‑component baseline by +2.5 pp NDCG@5 on standard queries and +6.3 pp on held‑out vague queries, with gains as large as +8 pp on the hardest vague tier. Ablations confirm that co‑training is the key ingredient: using either component in isolation fails to match CoHyDE on both well‑formed and vague queries, with losses of up to ‑8 pp on vague queries.

PaperID: 2477, https://arxiv.org/pdf/2605.29270.pdf

Authors: Wei Zheng, Yang Yan, Yiyang Shao, Jinyang Li, Zeze Chang, Yukuang Jia, Qiming Mao, Chihyung Wang, Jingbin Zhou

Title: Indexing the Unreadable: LLM-Native Recursive Construction and Search of Service Taxonomies

Abstract:
The era of the Internet of Agents (IoA) is taking shape: LLM agents are expected to fulfill user goals by orchestrating fast‑growing populations of Model Context Protocol (MCP) servers, Agent‑to‑Agent (A2A) endpoints, reusable skills, and other LLM‑callable services. Yet LLMs face a structural mismatch with this regime: effective context is a scarce resource that does not scale with the number of services. Concatenating thousands of service descriptions into a prompt overflows the context window, and even when the window is large enough, models systematically under‑attend to information in the middle of long inputs, the well‑documented Lost‑in‑the‑Middle phenomenon. This is fundamentally a question of context management for service discovery. To address this, we propose an LLM‑native progressive‑disclosure scheme and its concrete instantiation, A2X (Agent‑to‑Anything service discovery): an LLM‑driven pipeline that automatically organizes the registered services into a hierarchical taxonomy and walks it layer by layer at query time, so that every LLM call sees only a small candidate set highly relevant to the user query. This decouples effective‑context scarcity from registry size and significantly reduces token consumption while improving retrieval accuracy. Compared to full‑context dumping, A2X achieves a 6.2‑point Hit Rate gain at one‑ninth the prompt‑token cost; compared to the state‑of‑the‑art open‑source embedding‑based baseline, A2X improves Hit Rate by more than 20 points.

PaperID: 2478, https://arxiv.org/pdf/2605.29269.pdf

Authors: Guangze Zhao, Yongzheng Zhang, Weilin Gai, Hongri Liu, Yuliang Wei, Bailing Wang

Title: HunterAgent: Neuro-Symbolic Attack Trace Reconstruction under Anti-Forensics

Abstract:
Modern alert‑triage systems reduce SOC burden by filtering false positives, but flagging a high‑risk alert is only the start of incident response. Threat hunting requires reconstructing causal attack chains across heterogeneous, partially corrupted logs. Against APTs using anti‑forensics (parent‑PID spoofing, log wiping, fileless execution), provenance graphs split into disjoint subgraphs and fail. Unconstrained LLM agents fabricate causal links violating OS physics, producing fluent but forensically inadmissible narratives. We propose HunterAgent, a neuro‑symbolic framework that reframes trace reconstruction as cost‑bounded heuristic graph search under partial observability. It uses an asymmetric Generator‑Verifier pipeline: the LLM proposes semantic hypotheses within a typed ontology, while a verifier grounds each via identifier‑level collisions on surviving orthogonal telemetry. To resolve severed traces, we score hops using a calibrated cost combining semantic divergence and OS temporal potential; schema violations are hard‑pruned. A length‑discounted epistemic budget prevents inferential drift and forces graceful halting. Under strict LOFO cross‑validation on three public benchmarks and an in‑house 40‑trace dataset, HunterAgent achieves 86.1% mean F1, outperforming the top agentic baseline by 26.7 F1 and KAIROS by 17.1 F1, while cutting path‑level hallucination from 61.5% to 6.4%. Under 70% log wiping, recall drops but precision stays >=84%, with 95.7% halting safely. All results hold under the realistic assumption that at least one orthogonal telemetry source survives.

PaperID: 2479, https://arxiv.org/pdf/2605.29247.pdf

Authors: Yang Ouyang, Shuhang Lin, Jung-Eun Kim

Title: DenseSteer: Steering Small Language Models towards Dense Math Reasoning

Abstract:
Large language models (LLMs) demonstrate strong chain‑of‑thought (CoT) reasoning abilities, while smaller models (<= 3B parameters) significantly underperform on multi‑step reasoning tasks. Based on empirical analyses of the Qwen‑2.5 model family on math reasoning benchmarks, we find that more proficient reasoning is associated with fewer reasoning steps but higher information density per step, a property we term Dense Reasoning. Motivated by this observation, we propose DenseSteer, a training‑free inference‑time steering framework that enhances small‑model reasoning by modulating internal representations toward dense reasoning patterns. Experiments show that our method yields consistent accuracy improvements without increasing token‑level Negative Log‑Likelihood, highlighting dense reasoning as an effective structural approach to mathematical problem solving.

PaperID: 2480, https://arxiv.org/pdf/2605.29225.pdf

Authors: Jiahao Huang, Fei Cheng, Junfeng Jiang, Zefan Yu, Akiko Aizawa

Title: BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents

Abstract:
Self‑evolving agents improve over time by reflecting on past failures, but existing evaluation is limited in two ways: it measures only task scores, leaving reflection quality unknown, and it relies on agents' own episode runs, offering no mechanism to target specific failure patterns. We present BenchTrace, a benchmark for evaluating self‑evolution ability in LLM agents. BenchTrace is built on a snapshot‑reflection dataset of 1,821 annotated episodes spanning six diverse tasks, and comprises a Reflection Evaluation that probes failure identification through targeted QA tasks, and an Evolution Evaluation that tests whether past failure experience translates into avoidance behavior in a controlled self‑evolution simulation. Building on BenchTrace, we propose failure avoidance rate (FAR), a new evaluation metric measuring the fraction of test cases in which the agent successfully avoids the target failure instance. Experiments with Qwen3‑32B and GPT‑4.1 reveal that both models fall below a 30% end‑to‑end pass rate on reflection evaluation, with diagnosis as the primary bottleneck. Evolution evaluation shows that self‑evolution methods generally improve FAR over the non‑evolving baseline, but agents forget early lessons as noise episodes accumulate, and agents fail to generalize their reflections beyond the specific context, causing negative transfer across task contexts. Our correlation analysis further reveals that only a fully correct reflection is strongly associated with higher FAR. BenchTrace exposes concrete limits of current self‑evolution approaches and provides a controlled, model‑agnostic framework for targeted evaluation.

PaperID: 2481, https://arxiv.org/pdf/2605.29224.pdf

Authors: Aditya Nawal, Manit Baser, Mohan Gurusamy

Title: Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

Abstract:
AI agents augment large language models with external tools such as web retrieval, enabling grounded and up‑to‑date responses. However, incorporating external content into the generation pipeline can weaken the safety alignment mechanisms that govern model outputs. Prior work shows that enabling retrieval in agents increases compliance with harmful requests. We introduce AgentREVEAL, a diagnostic framework for analyzing retrieval‑induced safety degradation in LLM agents. The framework examines two axes: how retrieval is integrated into the agent pipeline and the properties of the retrieved content. Along the integration axis, we find that binding tool invocation and response generation in a single step amplifies harmful outputs. Along the content axis, we uncover the Safe Source Paradox: even oppositional or safety‑oriented sources, such as pages containing warnings or risk disclaimers, can increase harmful compliance by an average of 25% compared to the no‑retrieval baseline. Finally, we show that relevance acts as a shared activation condition for both vulnerabilities. Similar patterns appear on frontier closed models, and harmful compliance remains elevated under several representative pipeline interventions, with some agents also entering this regime under autonomous retrieval. Because relevance is also what makes retrieval useful, these results expose a safety‑utility trade‑off for retrieval‑enabled agents. We introduce HarmURLBench, a benchmark containing 1,405 real‑world URLs paired with 320 harmful behaviors to support future evaluations.

PaperID: 2482, https://arxiv.org/pdf/2605.29191.pdf

Authors: Tao He, Gangshan Jing

Title: Distributed Non-Uniform Scaling Control of Multi-Agent Formation with Dynamic Agent Joining

Abstract:
Non‑uniform scaling control of formation enables multi‑agent systems to adjust their shape by scaling with different ratios along different coordinate axes, offering enhanced flexibility in complex environments. However, like most existing formation maneuver strategies, it typically assumes a fixed set of agents, limiting its applicability in scenarios requiring dynamic team expansion. This paper introduces a distributed control framework that enables a formation to incorporate new agents during non‑uniform scaling maneuvers in arbitrary dimensions while preserving the spectral properties of the graph Laplacian. Simulation examples validate the effectiveness of the theoretical results.

PaperID: 2483, https://arxiv.org/pdf/2605.29178.pdf

Authors: Nikolay Radev, Lennart Haas, Benjamin Arnav, Pablo Bernabeu-Pérez

Title: The Best-Laid SCHEMEs: Coordinated Sabotage and Monitoring in Multi-Agent Systems

Abstract:
As agentic coding systems decompose work across multiple model instances, a critical safety question is whether those instances can coordinate to achieve a hidden malicious objective while remaining aligned with user intent. We introduce SCHEME, a benchmark of 17 task instances across 7 settings and 8 real open‑source libraries, each pairing a legitimate software‑engineering task with a covert side task. Every setting is designed so that no proper subset of agents can succeed alone: agents must decompose a shared sabotage plan, relay partial requirements under different communication topologies, and execute mutually consistent edits, testing genuine multi‑agent coordination rather than individual capability. Evaluating with GPT 5.1 Codex and Gemini 3.1 Pro, we find coordinated sabotage is already practical, with Gemini completing the covert objective while succeeding on the legitimate task in 84% of samples and Codex in 46%. A failure analysis shows that recovery, not failure incidence, drives the model gap: both models suffer similar numbers of communication failures per trajectory, but Gemini recovers from ～73% of occurrences against Codex's ～56%. However, the agents' code edits remain detectable: a trusted monitor flags coordinated sabotage at FPR\le2% in 99%/68% (Gemini/Codex) of attempts from code edits alone, rising to 100%/81% with access to agent communication. Despite stealth‑aware prompting, undetected sabotage stays at <3% under the full‑visibility monitor; telling agents a monitor is present halves Gemini's sabotage success without significantly breaking detection, even when agents are explicitly prompted to fabricate cover stories.

PaperID: 2484, https://arxiv.org/pdf/2605.29146.pdf

Authors: Xinyu Wang, Hanwei Wu, Zhenghan Tai, Sicheng Lyu, Qincheng Lu, Ziyu Zhao, Jijun Chi, Jingrui Tian, Xiao-Wen Chang, Ziyang Song

Title: SafeRx-Agent: A Knowledge-Grounded Multi-Agent Framework for Safe and Explainable Medication Recommendation

Abstract:
Medication recommendation predicts medications for patient visits, but existing methods still face two key challenges. At the model level, traditional drug recommendation methods only predict structured drug codes with limited evidence grounding, while LLM agents can use richer clinical context but may lack safety verification and traceability. At the task level, existing benchmarks often use broad medication categories, which ignore subgroup‑level safety differences and can lead to risk overestimation. We introduce the first fine‑grained medication recommendation setting based on fourth‑level ATC code generation. We propose Safe Prescription Agent (SafeRx‑Agent), a knowledge‑grounded multi‑agent framework that uses patient context, external clinical knowledge, and safety verification to recommend traceable medication sets. Experimental results on MIMIC‑III and MIMIC‑IV datasets show that SafeRx‑Agent improves fine‑grained medication prediction accuracy while controlling drug interactions, contraindications, and medication set size.

PaperID: 2485, https://arxiv.org/pdf/2605.29116.pdf

Authors: Shreyas Fadnavis, Praitayini Kanakaraj, Felix Wyss

Title: Beyond Consensus: Trace-Level Synthesis in Mixture of Agents

Abstract:
When multiple LLM agents solve the same problem, standard practice compresses each agent's reasoning into a majority vote or layered synthesis, treating agreement as the finish line. We show this is unnecessarily lossy: an LLM aggregator that reads complete reasoning traces recovers correct solutions even when agents unanimously agree, with beneficial corrections consistently outweighing harmful ones ‑‑ the \emphaggregation paradox. Majority voting has a ceiling that perturbation diversity does not raise (error correlations are identical); the aggregator's gain comes from trace‑level complementarity, assembling correct intermediate steps from minority chains that voting discards. These findings motivate Self‑Consistent Mixture of Agents which generates trace diversity through semantic‑preserving input perturbations, safeguards the majority via anchored refinement with provable non‑degradation guarantees, and always synthesizes ‑‑ never gates on consensus. A single model with perturbation‑induced trace variation outperforms heterogeneous model pools across structured reasoning, PhD‑level science, competition mathematics, and competitive programming. The unit of aggregation should be the reasoning trace, not the answer.

PaperID: 2486, https://arxiv.org/pdf/2605.29082.pdf

Authors: Tyler Akidau, Tyler Rockwood, Johannes Brüderl, Marc Millstone

Title: The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane

Abstract:
AI agents are increasingly expected to operate as digital employees: accessing enterprise data, making decisions, and taking actions autonomously. But agents are simultaneously less predictable than humans ‑‑ prone to hallucination, misinterpretation, and adversarial manipulation ‑‑ and more technically capable: with deep system knowledge and high‑throughput interfaces cascading damage at machine speed. This combination makes it unsafe to rely on agents to faithfully interpret or propagate security‑critical metadata such as access policies, data classifications, and behavioral constraints. We present the Redpanda Agentic Data Plane (ADP), an architecture built around out‑of‑band metadata channels: infrastructure pathways that carry security context, policy signals, and audit trails deterministically, entirely outside the agent's read and write path and across heterogeneous infrastructure. These channels enforce governance at every stage of the agent lifecycle ‑‑ scoping data access on the way in, constraining actions during execution, and capturing tamper‑proof transcripts on the way out. We demonstrate ADP with a multi‑agent portfolio rebalancing system in which autonomous agents monitor markets, make trade decisions, and execute orders across isolated client accounts ‑‑ with per‑client data scoping, trade approval thresholds, and tamper‑proof audit trails all enforced by out‑of‑band channels the agents can neither see nor bypass.

PaperID: 2487, https://arxiv.org/pdf/2605.29064.pdf

Authors: Neemias da Silva, Myriam Delgado, Rodrigo Minetto, Daniel Silver, Thiago H Silva

Title: Analyzing Persona Effects in Generated Explanations from Multimodal LLM Agents in Urban Perception

Abstract:
We study how persona prompting shapes language generated by multimodal large language models in an urban perception setting. Using 59,808 annotations from 1,200 persona‑conditioned agents and two no‑persona settings, we analyze captions, justifications, and perception tags across personas. Results indicate strong convergence in captions for different personas, whereas justifications display systematic variation associated with socioeconomic and political attributes, while perception tags show no statistically significant persona‑related differences, though effect trends are observed. Topic analysis further reveals that personas emphasize different evaluative themes when interpreting the same scenes.

PaperID: 2488, https://arxiv.org/pdf/2605.29062.pdf

Authors: Abhilekh Borah

Title: Bosses, Kings, and the Commons: Cooperation Under Power Asymmetry in LLM Societies

Abstract:
Communities can sustainably manage shared resources (commons) through self‑governance and cooperative norms, a central finding of Ostrom's theory of self‑governance. However, real‑world commons (e.g., fisheries, forests, and irrigation systems) are often governed under asymmetric power structures, where certain individuals or institutions possess disproportionate control over resource extraction and collective outcomes. As Large Language Models (LLMs) are increasingly explored as agents in synthetic governance simulations, understanding how LLM societies behave under asymmetric power structures is becoming increasingly important, yet existing evaluations largely ignore such asymmetries. We introduce Sovereignty over the Commons Simulation (SovSim), a generative multi‑agent simulation framework that incorporates an agent with asymmetric power (boss or king) into a society of symmetric agents (workers or peasants), where all agents extract from a shared resource, collectively determining its sustainability over time. Across eleven state‑of‑the‑art models, we find that introducing asymmetric power leads to severe breakdowns in cooperation and sustainability, with up to an 87.3% degradation in survival rate relative to symmetric settings.

PaperID: 2489, https://arxiv.org/pdf/2605.29055.pdf

Authors: Diego Gosmar, Deborah A. Dahl

Title: Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching

Abstract:
Hallucination remains a major reliability barrier for production LLM systems, particularly in multi‑agent pipelines where unsupported claims can propagate unchecked across stages. This paper adapts a HOPE‑inspired Nested Learning architecture with Continuum Memory Systems (CMS) and semantic similarity caching to a hybrid benchmark of 310 prompts combining 217 epistemic‑uncertainty prompts and 93 fabrication‑induction stress‑test prompts. A three‑stage agentic pipeline orchestrated via the Open Floor Protocol (OFP) is evaluated with five KPIs ‑‑ FCD (Factual Claim Density), FGR (Factual Grounding References), FDF (Fictional Disclaimer Frequency), ECS (Explicit Contextualization Score), and OSR (Observability Score Ratio) ‑‑ aggregated into THS (Total Hallucination Score) across five weighting configurations to study mitigation‑observability trade‑offs. FDF, ECS, OSR, and FGR are subtracted as mitigation signals, so that a more negative THS indicates stronger mitigation. The FrontEndAgent is configured as a high‑stochasticity generator (temperature = 1.0) to produce a realistic hallucination baseline, while the SecondLevelReviewer and ThirdLevelReviewer operate as progressive correctors. This asymmetric design yields end‑to‑end THS reductions of ‑31.3% to ‑35.9% across five weighting configurations. Semantic caching achieves 440 cache hits over 930 potential calls (47.3% hit rate), reducing LLM invocations to 490, lowering energy and CO2e footprint, and making multi‑stage review pipelines operationally viable at production scale. ExtremeObservability attains the most negative final THS (‑0.0709), confirming that observability‑heavy configurations reinforce rather than compromise mitigation. These findings suggest that memory‑augmented multi‑agent designs can jointly improve factual reliability, operational efficiency, and auditability without model retraining.

PaperID: 2490, https://arxiv.org/pdf/2605.29042.pdf

Authors: Aarav G Sane, Karthik Sivachandran, Rohan Paleja

Title: Differentiable Belief-based Opponent Shaping

Abstract:
Human coordination often relies on the ability to influence the beliefs of others through strategic action. In multi‑agent reinforcement learning, opponent shaping attempts to replicate this influence, though existing methods typically operate within an opponent's parameter, policy, or value space. Meanwhile, belief‑manipulation techniques in hidden‑role games often rely on hard‑coded objectives, such as deception or belief saturation. We propose Differentiable Belief‑based Opponent Shaping (D‑BOS), a first‑order method that treats each observer's belief as the shaped opponent state and differentiates through k‑step softmax‑Bayes belief dynamics. Rather than explicitly rewarding deceptive or cooperative behavior, our method treats the belief state as the target for shaping. This allows the optimal strategy to emerge naturally from the environment's reward structure. This belief‑space formulation provides an opponent‑shaping signal by differentiating through opponent belief updates, and naturally extends to multiple observers by aggregating gradients over their individual inferred belief trajectories. Empirically, D‑BOS outperforms PPO and BBM in hidden‑role games, with the largest gains in mixed‑motive settings.

PaperID: 2491, https://arxiv.org/pdf/2605.28984.pdf

Authors: Fei Cao, Xiaoqian Gong

Title: The incremental voter model: mean-field analysis and convergence to equilibrium

Abstract:
We introduce the incremental voter model (IVM), a discrete‑opinion multi‑agent system where agents undergo step‑wise transitions biased by the opinion of a randomly selected persuader. Our incremental voter model comprises a large population of interacting agents, each holding an opinion represented by an element of the discrete set \‑k,\ldots,0,\ldots,k\, k \in \mathbbN_+. At each update step as time progresses, a pair of distinct agents are selected independently and uniformly at random from the population, and the first agent (viewed as the ``listener'') updates its opinion based on that of the second (viewed as the ``persuader''), adopting a new opinion that differs from its current one by at most one unit. By deriving the mean‑field system of nonlinear ordinary differential equations (ODEs) that governs the large‑population limit of the agent‑based model, we develop a rigorous mathematical framework to study the asymptotic behavior of the opinion distribution in the mean‑field limit. These results contribute to a deeper understanding of social influence processes in complex systems, particularly in modeling opinion polarization, and may guide the formulation of more advanced models in future research.

PaperID: 2492, https://arxiv.org/pdf/2605.28978.pdf

Authors: Jiachen Zhang, Junyi Lao, Chenghao Liu, Siyuan Liu, Shixin Wu, Linsen Zhang, Boyu Wang, Songfang Huang

Title: VFEAgent: A Multimodal Agent Framework for End-to-End Automated Finite Element Analysis

Abstract:
Finite Element Analysis (FEA) serves as the cornerstone of modern engineering design. However, its workflow is inherently complex and relies heavily on domain expertise. Although recent efforts have integrated Large Language Models (LLMs) into FEA, existing approaches face limitations in handling multimodal inputs and executing complex tasks. To address these limitations, we propose VFEAgent, an end‑to‑end multi‑agent system designed to automate FEA modeling and simulation directly from input images and problem descriptions. Our methodology integrates two core components: (1) a multimodal vision‑language multi‑agent pipeline that employs ReAct‑driven reasoning to extract structured FEA specifications from heterogeneous inputs and (2) a verification‑first code synthesis framework, incorporating robust self‑debugging and fallback mechanisms to ensure executability and physical validity. We systematically evaluated the system across various engineering mechanics scenarios. The results demonstrate that VFEAgent achieves a high success rate in generating complete and physically valid simulations, outperforming LLM‑based baseline methods in reliability and correctness. These findings validate the feasibility of automating the complete FEA workflow, highlighting the framework's potential to liberate engineers from tedious manual analysis.

PaperID: 2493, https://arxiv.org/pdf/2605.28969.pdf

Authors: Aarik Gulaya

Title: Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization

Abstract:
If an AI agent makes decisions on a person's behalf, those decisions must align with its user. We introduce representational accuracy to measure how faithfully a system captures a person's interpretation. An interpretive layer is operationalized as a Behavioral Specification. Our reference implementation aggressively compresses a person's data into interpretive patterns, served as context to a language model. We evaluate the Specification on a prototype benchmark of held‑out behavioral predictions scored by a calibrated 5‑judge LLM panel. We test it independently and in composition with a range of context conditions: full raw corpus, full extracted facts, and four commercial memory systems (Mem0, Letta, Supermemory, Zep). Across 14 public‑domain autobiographical corpora, the Specification lifts representational accuracy in aggregate and nearly eliminates model hedging. It recovers most of what the raw corpus delivers, at ~25x less context cost. The Specification lifts subjects toward a common predictive level regardless of pretraining baseline; the lift in absolute points is therefore largest where the baseline is lowest, suggesting the population of relevance is anyone not adequately represented in pretraining. Lift is greatest on interpretation‑required questions, where providing an interpretive layer enables model behavior that extracted facts or raw corpus do not. Conversely, on recall‑required questions, this layer can interfere rather than help. We conclude that representational accuracy is distinct from recall and that human‑AI alignment is dependent on how accurately the user is represented. Representational accuracy makes that alignment testable.

PaperID: 2494, https://arxiv.org/pdf/2605.28882.pdf

Authors: Yihang Lin, Yunze Gao, Zeyang Lin, Dongbo Li, Kun Peng, Chenglong Song, Yue Liu

Title: GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human

Abstract:
With the rapid advancement of large language models, evaluating human‑likeness in open‑ended conversation has become increasingly important. However, human‑likeness is a form of tacit knowledge that humans perceive intuitively, yet the underlying criteria resist explicit formulation. Human judgments vary widely, with strong agreement on some cases and legitimate disagreement on others. Meanwhile, the criteria behind human judgments remain implicit, leaving no clear basis for constructing cases. Further, what counts as human‑like is not static, but evolving with model capability and human expectations. Despite progress in evaluation methods such as expert‑authored benchmarks, Reward Models, and self‑evolving benchmarks, none addresses all three challenges simultaneously. Therefore, we propose GrowLoop, a self‑evolving conversation evaluation system that continuously adapts as models advance and scenarios shift. With minimal human seed annotations as the first mover, LLM agents iteratively extract and refine evaluation rubrics through Heuristic Learning. Human‑AI agreement is required where annotators converge, while only plausibility is expected where they diverge. Moreover, the Rubric‑Case co‑evolution mechanism enables continuous evolution, expanded through new seeds when the evaluation target moves. Applied to human‑likeness evaluation in open‑ended conversation, the generated rubrics not only substantially outperform existing methods in alignment with human judgments, but also uncover issues that annotators overlook. The resulting benchmark effectively discriminates models across capability tiers and reveals where they fall short, while generalizing to new scenarios and adapting as models advance. Our work shifts the benchmarking paradigm from manual updates or difficulty scaling to comprehensive, continuous self‑evolution.

PaperID: 2495, https://arxiv.org/pdf/2605.28850.pdf

Authors: Weicheng Xue

Title: Representation Signatures and Risk-Feedback Alignment in LLM Trading Agents

Abstract:
We study behavioral alignment and representation dynamics of large language model (LLM) agents in financial decision environments. TradeArena, an auditable trading‑agent testbed with risk reports, execution simulation, memory, and replayable trajectories, lets us analyze how rationales, positions, and interventions evolve under market stress. Code and data artifacts are available through the \hrefhttps://github.com/weich97/TradeArena.gitTradeArena repository. We find pre‑failure signatures: planning embeddings drift from normal centroids, fused plan‑risk representations separate normal from pre‑drawdown states, and local manifolds exhibit effective‑rank contraction. Across 80 rolling failure anchors and eight LLM trajectories, this pattern persists across hash, LSA, Transformer, and white‑box hidden‑state probes. Stress tests with CoT‑free target weights, lexical controls, OHLCV noise, and false audits show that rationale‑level contraction can vanish without rationales, while intent‑space and fused signatures remain informative. Structured risk feedback can act as an external alignment signal without fine‑tuning, but not as a universal performance enhancer: true audit feedback improves calibration for some models, returns for others, and exposes cases where placebo or hidden feedback has higher short‑horizon return but weaker alignment diagnostics. A 51‑stock intraday experiment reveals a correlation blind spot: LLM rationales justify exposure to coupled assets that the risk layer clips. Finally, a financial‑audit task suite shifts comparison from ``which model trades best'' to whether models can audit trajectories, respect execution boundaries, reproduce artifacts, and avoid claim overreach. These results support a research claim, not a profitability claim: auditable risk feedback and representation trajectories reveal when LLM financial reasoning is aligning, drifting, or failing.

PaperID: 2496, https://arxiv.org/pdf/2605.28840.pdf

Authors: Abel Yagubyan

Title: How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines

Abstract:
Large language model (LLM) agents with tool‑calling capabilities are increasingly deployed in production systems, yet a fundamental reliability question remains under‑explored: does the same agent behave the same way twice? We present a systematic empirical study of behavioral consistency in multi‑step tool‑calling agents, measuring whether agents select the same tools, in the same order, with the same arguments, across repeated identical invocations. Unlike prior work on consistency in ReAct‑style agents(search‑only, free‑text actions), we study the richer setting of structured tool‑calling interfaces with typed parameters and consequential side effects.

PaperID: 2497, https://arxiv.org/pdf/2605.28836.pdf

Authors: Jimin Jung, MyoungJin Kim, Jaehyung Seo, Heuiseok Lim

Title: No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand

Abstract:
The Plain Writing Act in the United States requires government documents to be accessible in clear and simple language that the general public can easily understand, yet existing summarization systems struggle to address diverse linguistic and cognitive barriers among general readers. We present NRLB (No Reader Left Behind), a multi‑agent framework for plain language summarization that simulates three representative reader groups: elementary school student readers, non‑native readers, and readers with attention deficits. NRLB combines template‑based planning with iterative, reader‑oriented refinement, enabling systematic detection and resolution of difficult terms, missing contexts, and confusing sentences. Evaluations across multiple datasets demonstrate consistent improvements in readability while preserving factual accuracy. Human evaluation further validates NRLB's impact, with annotator preference rates ranging from 55% to 76%, highlighting NRLB's potential to produce plain language summaries that are both faithful to the source and broadly accessible to the general public.

PaperID: 2498, https://arxiv.org/pdf/2605.28835.pdf

Authors: Hao-Xiang Xu, Chong Deng, Jiaqing Liu, Wen Wang, Qian Chen, Lujia Bao, Xiangang Li, Zhen-Hua Ling

Title: GenesisFunc: Multi-Agent Data Generation for Accurate and Generalizable Function-Calling

Abstract:
Large Language Models (LLMs) extend their capabilities through function‑calling (FC), which relies on training data with high quality, diversity, and broad coverage of scenario. However, obtaining and annotating real function‑calling data is challenging, while synthetic data from existing pipelines often suffers from unreliable APIs, limited tool scalability, insufficient diversity, and weak quality control. To address these, we present GenesisFunc, an automated pipeline for generating FC training data. Starting from reliable tools in widely used public benchmarks, our GenesisFunc employs a multi‑agent framework to support a dialogue generation system that produces conversations spanning diverse scenarios, while maintaining both diversity and quality throughout the process. The accuracy of the data is further reinforced through a multi‑stage evaluation system. We fine‑tune an 8B LLM on the synthetic dataset and show through extensive experiments that it outperforms similarly sized open‑source models in in‑domain FC performance and out‑of‑domain generalization, while reaching FC capabilities comparable to some of the latest API‑based models. In addition, our method demonstrates strong potential to scale effectively across downstream tools, underscoring its real‑world applicability.

PaperID: 2499, https://arxiv.org/pdf/2605.28831.pdf

Authors: Encheng Su, Jinouwen Zhang, Jianyu Wu, Qiucheng Yu, Chen Tang, Pengze Li, Lintao Wang, Yizhou Wang, Xinzhu Ma, Shixiang Tang, Aoran Wang

Title: S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering

Abstract:
Long‑horizon interactive agents often accumulate large trajectory histories yet still fail to answer questions about earlier events reliably. We argue that the main bottleneck is not context length alone, but the trajectory‑to‑answer interface of long‑term memory. When histories are stored as plain‑text chunks and queried with standard retrieval‑augmented generation (RAG), systems often retrieve locally relevant but chain‑incomplete evidence, especially for spatial, temporal, repeated‑event, and multi‑hop state questions. We propose S3MEM, a structured scene‑event episodic memory framework for long‑horizon interactive question answering (QA). S3MEM writes trajectories into structured memory units, retrieves evidence through anchor‑sensitive retrieval, and exposes a compact token‑budget‑aware evidence interface for answer‑time inference. In this sense, S3MEM is a structured evidence harness that converts agent trajectories into query‑aligned support. We evaluate S3MEM on two internal headline environments (Crafter, Jericho) and two out‑of‑family environments (SciWorld, ALFWorld). Under a shared frozen answer‑time protocol, S3MEM consistently outperforms Vanilla RAG across all four environments, surpasses Graph‑NoReader on Crafter, Jericho, and ALFWorld, and matches it on SciWorld while using dramatically fewer evidence tokens. Three adapted recent baselines ‑‑ A‑MEM‑inspired, MemoryOS‑adapted, and LightMem‑adapted ‑‑ improve over Vanilla RAG in several settings, but none matches S3MEM's overall accuracy‑efficiency frontier. Overall, the evidence supports a bounded conclusion: under the current frozen answer‑time protocol, structured writing and anchor‑sensitive evidence routing provide a stronger accuracy‑efficiency frontier for long‑horizon interactive QA than more generic memory interfaces.

PaperID: 2500, https://arxiv.org/pdf/2605.28825.pdf

Authors: Ji-jun Park, Soo-joon Choi, Jiwon Jeong, Taeyang Yoon, Ju-Wan Lee

Title: MechELK: A Mechanistic Interpretability Framework for Eliciting Latent Knowledge in Large Language Models

Abstract:
Large language models (LLMs) frequently encode factual and reasoning knowledge in their internal representations that is not faithfully reflected in their surface‑level outputs ‑‑ a phenomenon known as \emphlatent knowledge. Existing approaches to eliciting latent knowledge, such as Contrastive Consistency Search (CCS), rely on contrastive activation patterns and struggle with complex multi‑step reasoning tasks, while mechanistic interpretability tools have primarily been used to \emphunderstand model behavior rather than to \emphextract hidden knowledge. We present MechELK, a unified three‑stage framework that bridges mechanistic interpretability and latent knowledge elicitation. MechELK operates through: (1) Locate ‑‑ using Sparse Autoencoder (SAE) feature analysis and activation patching to identify knowledge‑bearing representations; (2) Verify ‑‑ employing causal probing to distinguish genuine latent knowledge from spurious correlations; and (3) Elicit ‑‑ applying representation engineering to surface hidden knowledge without modifying model weights. Evaluated on TruthfulQA, a curated Deceptive Alignment benchmark, and the Quirky LM dataset, MechELK achieves an average elicitation accuracy of 84.7%, outperforming CCS by 6.2% and direct linear probing by 9.1%. Crucially, MechELK successfully identifies latent knowledge in 78.3% of cases where the model's surface output is incorrect or evasive, demonstrating its utility for AI safety applications including deceptive alignment detection.

PaperID: 2501, https://arxiv.org/pdf/2605.28816.pdf

Authors: Fangfu Liu, Kai He, Tianchang Shen, Tianshi Cao, Sanja Fidler, Yueqi Duan, Jun Gao, Igor Gilitschenski, Zian Wang, Xuanchi Ren

Title: Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players

Abstract:
World models for interactive video generation have largely focused on single‑agent settings, where future observations are generated from a single control signal. However, many generated environments require multi‑agent interaction: multiple players, robots, or embodied agents act simultaneously within a shared space. Scaling world models to such settings requires a principled multi‑agent design: agents should remain independently controllable, permutation‑symmetric, and support efficient inference while maintaining consistency across time and perspectives. In this paper, we present our generative multi‑agent world model for interactive simulation. It introduces Simplex Rotary Agent Encoding, a parameter‑free extension of 3D RoPE that represents agents as vertices of a regular simplex in rotary angle space. This gives each agent a distinct phase while making all agents permutation‑equivalent, enabling scalable agent identity without learned per‑slot identities or a fixed agent ordering. To avoid dense all‑to‑all attention across agents, we further propose Sparse Hub Attention, where learnable hub tokens mediate token interaction across agents, reducing cross‑agent attention cost from quadratic to linear in the number of agents. For real‑time rollout, we distill a full‑context diffusion teacher into a causal student that generates temporal blocks sequentially with KV caching, enabling action‑responsive generation at 24 FPS. Experiments in multiplayer virtual environments show that our model improves video fidelity, action controllability, and inter‑agent consistency over slot‑based and dense‑attention baselines, while generalizing from two to four players without additional training.

PaperID: 2502, https://arxiv.org/pdf/2605.28764.pdf

Authors: Edwin Jose

Title: SwarmHarness: Skill-Based Task Routing via Decentralized Incentive-Aligned AI Agent Networks

Abstract:
Vast quantities of compute (GPU cycles on personal workstations, idle inference servers, and edge devices between jobs) go unused because no incentive‑aligned protocol exists for their owners to share them safely and profitably. Existing approaches either require a trusted central coordinator (cloud marketplaces), demand heavy blockchain infrastructure (Golem, BrokerChain), or lack an incentive layer entirely (BOINC, Petals). We propose SwarmHarness, a decentralised protocol in which HarnessAPI skill nodes self‑organise into a compute swarm without any central authority. SwarmHarness has three interlocking components: a SwarmRegistry built on a Distributed Hash Table (DHT) for peer discovery and capability advertisement; a SwarmRouter that dispatches tasks to nodes using a utility function over capability, load, latency, and trust; and SwarmCredit, an incentive mechanism that attributes compute‑credit rewards to contributing nodes via a Shapley‑value approximation. Nodes earn credits by serving tasks and spend credits to submit them; idle nodes that never contribute drain credits and lose routing priority, creating a self‑regulating participation economy. As nodes specialise toward high‑reward skills and routing signals act as digital pheromones, the network exhibits emergent collective intelligence analogous to biological swarms. Beyond compute sharing, SwarmHarness is a foundational primitive for autonomous distributed AI agent networks in which agents hire compute, route subtasks, and settle credits without human intermediation.

PaperID: 2503, https://arxiv.org/pdf/2605.28741.pdf

Authors: Zhendong He, Qiyuan Dai, Guanbin Li, Liang Lin, Sibei Yang

Title: Self-Prophetic Decoding to Unlock Visual Search in LVLMs

Abstract:
Large Vision‑Language Models (LVLMs) are rapidly evolving toward true multimodal reasoning, with visual search representing a concrete instantiation of the thinking‑with‑images paradigm. However, LVLM visual search faces two key challenges: incompatibility among intrinsic capabilities after post‑training, and interference in long multi‑step reasoning contexts. To address these, we identify two novel insights. First, self‑regulation between pre‑ and post‑training LVLMs leverages the intrinsic single‑step capabilities of the pre‑training model to mitigate capability deterioration and long‑context interference. Second, probability‑based prophetic sampling, replacing naive prompting, provides a probabilistic interface where the pre‑training model acts as a prophet and the post‑training model selectively accepts prophetic tokens under its output distribution, preserving coherent multi‑step reasoning. Building on these insights, we introduce SeProD, a self‑prophetic decoding framework that leverages intrinsic single‑step capabilities to enable coherent multi‑step reasoning in a training‑free, plug‑and‑play manner. Experiments show that SeProD consistently improves multiple visual‑search LVLMs across all 12 splits of 4 visual search benchmarks, as well as across general VQA benchmarks, without added computational overhead, thanks to its parallel prophetic acceptance mechanism.

PaperID: 2504, https://arxiv.org/pdf/2605.28678.pdf

Authors: Yunhai Hu, Zining Liu, Xiangyang Yin, Tianhua Xia, Bo Bao, Eric Sather, Vithursan Thangarasa, Sai Qian Zhang

Title: DREAM-R: Multimodal Speculative Reasoning with RL-Based Refined Drafting, Precise Verification, and Fully Parallel Execution

Abstract:
Speculative reasoning has recently been proposed as a means to accelerate reasoning‑intensive generation in large multimodal models, but its effectiveness is often constrained by misalignment between speculative drafts and target‑verified reasoning. In this work, we introduce DREAM‑R, a framework that substantially improves the performance of speculative reasoning. At its core, DREAM‑R employs Speculative Alignment Policy Optimization (SAPO), a reinforcement‑learning objective that trains draft models to generate reasoning steps that are both faithful to target trajectories and concise. We further propose a Threshold‑based Verification Mechanism (TBVM) that uses a ratio‑based criterion to provide stable and interpretable acceptance of speculative steps only when positive evidence clearly dominates, thereby preventing error propagation. Building on these components, we develop a Fully Parallel Speculative Reasoning (FPSR) framework that parallelizes draft generation, target‑side reasoning, and verification across multi‑step reasoning, enabling early stopping and clean fallback. Experiments on reasoning‑heavy benchmarks demonstrate up to speedup while preserving target‑model accuracy, yielding substantial efficiency gains without compromising reasoning quality.

PaperID: 2505, https://arxiv.org/pdf/2605.28655.pdf

Authors: Shanghua Gao, Ada Fang, Marinka Zitnik

Title: AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation

Abstract:
Scientific research proceeds through iterative cycles of hypothesis generation, experiment design, execution, and revision. AI agents can automate parts of this process, but existing approaches typically follow a single research trajectory or coordinate through a central planner with fixed objectives. As a result, they struggle to sustain parallel exploration, adapt as experimental evidence changes, or preserve knowledge of failed directions over long‑running experiments. We introduce AutoScientists, a decentralized team of AI agents for long‑running computational scientific experimentation. Agents interpret a shared experimental state, self‑organize into teams around promising hypotheses, critique proposals before using experimental compute, and share successes and failures to reduce redundant exploration. Under matched experimental budgets, AutoScientists improves over prior AI agents across biomedical machine learning, language‑model training optimization, and protein fitness prediction. On BioML‑Bench, spanning biomedical imaging, protein engineering, single‑cell omics, and drug discovery, AutoScientists achieves a mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest AI agent by +8.33%. On GPT training optimization, AutoScientists reaches a target validation bits‑per‑byte 1.9x faster than Autoresearch and continues discovering improvements from a starting champion where the single‑agent approach finds none (7 vs. 0 accepted improvements). On ProteinGym fitness prediction, AutoScientists discovers a method for ACE2‑Spike binding that improves over the current state‑of‑the‑art model by +12.5% in Spearman correlation. Applied without modification across all 217 ProteinGym assays, the same method improves over the prior state of the art by +6.5% (Spearman correlation).

PaperID: 2506, https://arxiv.org/pdf/2605.28617.pdf

Authors: Yaoyu Zhao, Yichen Xu, Oliver Bračevac, Cao Nguyen Pham, Frank Zhengqing Wu, Martin Odersky

Title: LACUNA: Safe Agents as Recursive Program Holes

Abstract:
LLM agents increasingly act by writing code, yet a split persists between the runtime that drives the agent and the code the model writes. The runtime owns the loop, context, and control flow, and the model has little say over any of them. Letting model‑written code shape the runtime itself would make agents more expressive, but it would also sharpen safety problems. A model can be diverted by a prompt injection, call the wrong tool, or fail partway and leave an inconsistent state, and each such failure reaches further when the code shapes the runtime than when it expresses a single action. We present LACUNA, a programming model for agents that closes this split while preserving safety. Each agent action is a typed call \textttagent[T](task) that the LLM fills with code when execution reaches it, and the code is type‑checked against the surrounding program before it runs. Because each action is accepted or rejected as a whole, a rejected one leaves the environment untouched, and its compiler diagnostics drive a retry. The same check also bounds which tools and data an action may use and how they flow. Our primitive expresses ReAct loops, sub‑agents, skills, parallel decomposition, and multi‑model planning as ordinary control flow. We evaluate LACUNA on a collection of test cases, BrowseComp‑Plus, and τ^2‑bench. On BrowseComp‑Plus, 8.6% of generations are rejected before execution, with 0.7 retries per query on average, and the agent reaches 27.1% accuracy. On τ^2‑bench, LACUNA solves 76.0% of 392 tasks across four domains with a capable model, on par with the baseline agent.

PaperID: 2507, https://arxiv.org/pdf/2605.28607.pdf

Authors: Susanna Cifani, Mario Luca Bernardi, Marta Cimitile

Title: Adaptive Multimodal Agents-Based Framework for Automatic Workflow Execution

Abstract:
Modern information systems require autonomous agents capable of navigating complex workflows, yet current methodologies often struggle with the transition from structured metadata parsing to general environmental perception. While the integration of MLLMs has enabled agents to interact directly with GUIs, existing approaches typically treat task sequences as discrete, linear episodes. This fragmentation prevents agents from capturing the underlying transition topology, limiting their effectiveness in novel or non‑stationary scenarios. To address this, we propose a novel multimodal multi‑agent framework that achieves automatic workflow execution through a distinct two‑phase pipeline. First, during an offline discovery phase, the architecture adaptively constructs a topological knowledge base from fragmented execution logs. During inference, agents leverage Adaptive Retrieval‑Augmented Generation (RAG) over this fixed, pre‑established graph, coupled with a closed‑loop collaborative verification protocol to dynamically self‑correct and navigate. This graph‑based approach facilitates superior task decomposition and adaptive navigation performance. We validate our framework in a real‑world context, demonstrating its ability to maintain high reliability and semantic awareness even with limited training data.

PaperID: 2508, https://arxiv.org/pdf/2605.28588.pdf

Authors: Luca Beurer-Kellner, Aleksei Kudrinskii, Marco Milanta, Kristian Bonde Nielsen, Hemang Sarkar, Liran Tal

Title: Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem

Abstract:
We analyzed 3,984 AI agent skills from major marketplaces and found 76 confirmed malicious payloads, including credential theft, backdoor installation, and data exfiltration. 13.4% of all skills contain at least one critical‑level security issue and at least 8 manually confirmed malicious skills remain publicly available on clawhub.ai as of the date of publication. This report documents our methodology, presents a threat taxonomy based on real‑world samples, and details the attack patterns we observed. As skill marketplaces grow rapidly and AI agents gain access to sensitive credentials and systems, automated security analysis is no longer optional.

PaperID: 2509, https://arxiv.org/pdf/2605.28566.pdf

Authors: Guni Sharon

Title: Tree of Thoughts as a Classical Heuristic Search Problem: Formal Foundations and Design Patterns

Abstract:
Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, yet their standard generation process ‑‑ auto‑regressive token prediction ‑‑ is inherently myopic and prone to cascading errors. To address this, the Tree‑of‑Thoughts (ToT) framework creates a search space over intermediate reasoning steps, allowing search models to explore, look ahead, and backtrack. However, current ToT research remains fragmented across Natural Language Processing and Automated Planning communities, often using inconsistent terminology and ad‑hoc implementations. Consequently, we synthesize the ToT landscape through a unified taxonomy based on classical heuristic search terminology. We map LLM‑based reasoning to classical search components: state representation (granularity of thoughts), successor generation (prompting operators), and heuristic evaluation (self‑assessment of progress). We analyze existing work within the context of our taxonomy and identify emerging design patterns: systematic search (Best‑First Search) for shallow, deterministic tasks and lookahead‑heavy strategies (DFS, MCTS) for deep multi‑step reasoning. We conclude by identifying open algorithmic challenges at the intersection of heuristic search and LLM reasoning, and call on the heuristic search community to engage with this emerging domain.

PaperID: 2510, https://arxiv.org/pdf/2605.28532.pdf

Authors: Liang Cheng, Mingsheng Cai, Jiuming Jiang, Luo Mai

Title: Do Agents Know What They Can't Do? Evaluating Feasibility Awareness in Tool-Using Agents

Abstract:
Tool‑using agents often incur substantial computational cost due to long reasoning chains and iterative tool usage. In practical scenarios, many tasks become infeasible under constrained tool environments, where the capabilities required for successful task completion are unavailable. Detecting infeasible tasks and stopping execution early can significantly reduce unnecessary execution cost. In this work, we propose FeasiGen, an automatic pipeline for constructing infeasible agent tasks by identifying the critical tools required for successful task completion. Our approach extracts tool‑calling traces from successful executions across multiple agent systems, identifies critical tools consistently shared across diverse execution strategies, and masks these tools to automatically transform solvable tasks into infeasible ones. Human verification confirms that the infeasibility annotations for our constructed tasks achieve over 94% accuracy. We further introduce feasibility‑aware evaluation metrics for measuring whether agents can recognize infeasible tasks and stop execution appropriately. Extensive evaluations across nine models reveal substantially weak infeasibility detection ability, with false continue rate reaching up to 73.9%. We further observe that multi‑agent architectures significantly reduce erroneous execution under infeasible conditions.

PaperID: 2511, https://arxiv.org/pdf/2605.28465.pdf

Authors: Jihyeong Park, Ingeol Baek, Jeonghyun Park, Hwanhee Lee

Title: Beyond One Path: Evaluating and Enhancing Divergent Thinking in Interactive LLM Agents

Abstract:
Divergent thinking is a core dimension of creativity, yet existing evaluations of Large Language Models (LLMs) treat them as single‑turn text generations, failing to capture how an agent reasons through iterative interaction. To address this, we introduce MUTATE, an interactive benchmark designed to evaluate agentic divergent thinking at two levels: path‑level, where an agent discovers multiple alternative paths to the same goal, and action‑level, where individual actions require non‑typical, mechanism‑shifting object uses. Unlike success‑only evaluations, MUTATE scores both completed paths and off‑path attempts, capturing divergent reasoning that conventional success rates discard. Our experiments with frontier LLMs reveal a structural blind spot in existing frameworks: when exposed to immediate convergence pressure, they tend to fall into immediate action fixation, failing to improve action‑level divergence. To overcome this, we propose ReDNA, which separates unconstrained divergent candidate generation from convergent constraint selection. ReDNA significantly outperforms prior methods across both divergence levels and generalizes effectively to an external creativity environment. We also confirm its success stems from a qualitative enhancement of resilient divergent reasoning rather than simple environmental exploration.

PaperID: 2512, https://arxiv.org/pdf/2605.28433.pdf

Authors: Ling-Yue Ge, Lan-Zhe Guo

Title: Roles with Rails: Contract-Preserving Role Evolution in Multi-Agent Structured Reasoning

Abstract:
Role‑based LLM multi‑agent systems need adaptive role pools, yet adapting such systems is not merely a matter of prompt optimization: roles often carry structural obligations, including capability coverage, message compatibility, validation, final‑answer aggregation, and parser‑compatible output protocols. Existing systems either fix the role inventory and lose adaptivity, or allow unconstrained generation to induce role drift, removing structurally necessary roles and breaking answer contracts. We formulate this as contract‑preserving role evolution, requiring every committed edit to preserve five structural contracts (capability, communication, validation, aggregation, output protocol). We instantiate this formulation in SERO, a Self‑Evolving Role Orchestration framework that evolves a typed role‑card pool through credit‑guided retrieval, a credit‑ranked communication DAG with a protected terminal aggregator and conditional validator repair, and a contextual‑bandit controller whose LLM‑proposed edits are committed only when they preserve the contracts and improve task score. Experiments on real‑world reasoning benchmarks across three LLM backbones confirm the value of contract‑preserving role evolution.

PaperID: 2513, https://arxiv.org/pdf/2605.28390.pdf

Authors: Xujun Li, Kehan Zheng, Mingyuan Zhao, Yize Geng, Jinfeng Zhou, Qi Zhu, Fei Mi, Lifeng Shang, Minlie Huang, Hongning Wang

Title: You Live More Than Once: Towards Hierarchical Skill Meta-Evolving

Abstract:
Test‑time skill evolving is regarded as a new paradigm for enhancing deployed agentic systems. Existing works mainly focus on hard‑coded skill evolving strategies or parametric learning that rely on expensive parameter updates in the underlying LLMs. In this paper, we demonstrate that test‑time refinement of the skill evolving framework itself is necessary for continuous improvement of the agent systems in different downstream scenarios, and lightweight algorithmic adaptation is feasible. Specifically, we propose HiSME, a lightweight hierarchical skill meta‑evolving solution that jointly optimizes skills and the skill evolving strategy by learning meta‑skills from agents' task execution traces. Experiments on diverse agentic benchmarks show that meta‑evolving can produce a higher‑quality skill library than pure skill evolving and can derive diverse meta‑skills for different scenarios, thereby facilitating future continual experience learning. Our code is temporarily public at https://anonymous.4open.science/r/HiSME‑BD45.

PaperID: 2514, https://arxiv.org/pdf/2605.28388.pdf

Authors: Yue Cheng, Jiajun Zhang, Xiaohui Gao, Weiwei Xing, Zheng Wang, Zhanxing Zhu

Title: Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs

Abstract:
Reinforcement Learning with Verifiable Reward (RLVR) is empirically shown to notably enhance the reasoning performance of large language models (LLMs), particularly in mathematics and programming. However, the mechanistic role of Sample Difficulty in RLVR remains poorly understood. In this paper, we investigate RLVR through the lens of difficulty‑wise and one‑sample analysis. We find that sample difficulty has a non‑monotonic effect on RLVR: easy and medium‑difficulty problems yield the strongest and most stable reasoning improvements, whereas overly hard problems often provide weak learning signals, induce degenerate behaviors such as answer repetition or skipping necessary computation, and can ultimately degrade the model's pre‑existing capabilities. Beyond the obverse of response, we further analyze the model's internal feature dynamics using Temporal Sparse Autoencoders (T‑SAE). Easy problems mainly reinforce direct‑answer and basic‑computation features while suppressing deliberative‑reasoning features; hard problems activate reasoning‑related features but become useful only when successful trajectories are sampled; medium‑difficulty problems provide a more balanced signal, strengthening both computation and multi‑step reasoning features. Motivated by these findings, we propose difficulty‑adaptive strategies for hard‑sample utilization, using backward‑reasoning reformulation and T‑SAE‑guided training signals to improve reward density and credit assignment during RLVR. Overall, our results identify sample difficulty as a key factor governing both the optimization dynamics and representation evolution of RLVR.

PaperID: 2515, https://arxiv.org/pdf/2605.28359.pdf

Authors: Taojie Zhu, Wentao Zhao, Rui Sun, Beidi Luan, Jiacheng Lu, Sinuo Wang, Jing Li, Daxin Jiang, Yonghong He, Zuo Bai

Title: From Knowing to Doing: A Memory-Controlled Benchmark for LLM Trading Agents on Stock Markets

Abstract:
Evaluating whether large language model (LLM) agents can profit in capital markets is increasingly framed as end‑to‑end trading: place an agent in a historical market, let it trade, and measure portfolio returns. This setup is vulnerable to two evaluation failures. First, long backtests often overlap with the knowledge cutoffs of frontier LLMs, allowing memorized tickers, dates, prices, and market narratives to substitute for investment reasoning. Second, raw returns are a noisy proxy for stock‑selection ability, since positive performance may come from market beta, style exposure, or favorable regimes rather than genuine alpha. We introduce KTD‑Fin (Knowing‑To‑Doing Financial Benchmark), an end‑to‑end stock‑market trading benchmark that addresses both issues. KTD‑Fin uses a data‑side masking protocol to anonymize key identifiers and calendar information consistently across prompts and tools, separating historical market memory from investment decision‑making. It also incorporates a Barra‑style performance attribution framework that decomposes portfolio returns into market, style, and stock‑selection alpha components. Across ten frontier LLM agents evaluated on the Chinese CSI300 over a 2024‑‑2026 window, masking substantially changes agent rationales, pushing them towards anonymized factor‑based reasoning. Attribution analysis further shows that LLM agents' cumulative returns under leakage‑controlled evaluation are largely explained by passive market and style exposure, with limited evidence of persistent stock‑selection alpha. These findings suggest that financial LLM benchmarks should evaluate not only whether an agent makes money, but also whether the source of returns reflects transferable investment skill. We release KTD‑Fin as a reproducible template for leakage‑controlled and attribution‑aware evaluation of LLM trading agents.

PaperID: 2516, https://arxiv.org/pdf/2605.28321.pdf

Authors: Shehroz Khan, Abdullah Mughees, Gaadha Sudheerbabu, Tanwir Ahmad, Dragos Truscan

Title: Multi-Agent LLM-based Metamorphic Testing for REST APIs

Abstract:
As REST APIs become an increasingly significant part of software systems, their validation is becoming more critical. Hence, testing and uncovering underlying issues are of utmost importance for improving software quality. However, testing REST APIs is challenging mainly due to the difficulty of assessing whether the output of an API call is correct, i.e., the test oracle problem. Metamorphic testing is a specification‑based testing approach for situations where correct outputs are unknown or not specified explicitly. To check the correctness of a system, relations between the different outputs are specified. We present ARMeta, a tool‑supported approach that uses an LLM‑based multi‑agent workflow to support metamorphic testing of REST APIs documented with OpenAPI. The agentic workflow is used to identify metamorphic test scenarios and specify them in the Given‑When‑Then format. These scenarios are automatically implemented as executable tests and executed against the system under test. We evaluate ARMeta on two publicly available web applications that expose REST interfaces and compare its performance with a scenario‑based testing baseline. The results show that ARMeta explores behaviors that serve as a complement to existing scenario‑based testing approaches.

PaperID: 2517, https://arxiv.org/pdf/2605.28224.pdf

Authors: Xinzhe Li, Yaguang Tao

Title: When Does Memory Help Multi-Trajectory Inference for Tool-Use LLM Agents?

Abstract:
Multi‑trajectory inference for tool‑use LLM agents ‑ generating multiple reasoning attempts and selecting among them ‑ benefits from transferring knowledge across attempts so that later ones avoid the pitfalls of earlier ones. Existing cross‑trajectory memory methods (trajectory‑level reflection, atomic fact extraction, raw observation injection) are each evaluated under a single inference strategy on a single task, making it unclear whether reported gains reflect properties of the memory abstraction or of the inference method. We propose a unified framework that decomposes memory along two axes ‑‑ the scope of transfer (within an expansion vs. across trajectories) and the abstraction of the transferred content ‑‑ and evaluate four methods under three inference strategies (best‑of‑N, beam search, MCTS) on four tool‑use benchmarks spanning SQL, knowledge‑graph, and CLI environments, in a verifier‑free setting that matches the deployment regime of practical agents. The experiment matrix identifies the inference method as a confound: the same memory method produces statistically distinct results under different inference strategies on the same examples. Reflection reaches significance only under MCTS (not under best‑of‑N); within‑expansion injection (conditioning each candidate on prior siblings' outcomes) helps only diversity‑starved beam search; and atomic fact extraction is accuracy‑neutral but shortens trajectories by 19‑26% on tasks with reusable environmental structure.

PaperID: 2518, https://arxiv.org/pdf/2605.28214.pdf

Authors: Chenxi Wang, Ruiyang Huang, Jiayan Sun, Lei Wei, Yifan Wu

Title: Out of Sight, Not Out of Mind: Unveiling Latent Attack in Latent-based Multi-Agent Systems

Abstract:
Latent‑based multi‑agent systems replace parts of explicit inter‑agent communication with hidden representations, offering a new direction for efficient and flexible agent collaboration. However, moving coordination into latent space may also move attacks beyond the reach of visible‑text inspection. In this paper, we study whether latent states can carry attack‑associated information that remains effective during clean executions. To examine this question, we introduce a latent attack framework that reactivates attack‑induced effects through latent interventions without reusing adversarial text. Extensive experiments show that the resulting latent‑only attacks can substantially degrade task performance in clean executions, especially when applied to inter‑agent KV‑cache handoffs rather than local hidden states. Further control analyses indicate that this degradation cannot be reduced to arbitrary perturbations or invalid generation. Overall, our findings suggest that latent‑based collaboration does not remove attack risk. It shifts part of the risk into less observable execution states, calling for safeguards beyond visible‑text inspection.

PaperID: 2519, https://arxiv.org/pdf/2605.28201.pdf

Authors: Yongxiang Li, Moxin Li, Zhixin Ma, Fengbin Zhu, Dongrui Liu, Wenjie Wang, Fuli Feng

Title: Plant, Persist, Trigger: Sleeper Attack on Large Language Model Agents

Abstract:
Large Language Model (LLM) agents remain vulnerable to safety threats from the external environment, where attackers inject adversarial content into external observations such as tool‑returned data, webpages, or MCP context, causing harmful agentic behaviors such as unsafe actions or incorrect outputs. Existing studies typically focus on single‑interaction attacks, where the agent observes adversarial content and immediately exhibits harmful behavior within one user request. However, we show that adversarial content can also persist across interactions served by the same agent, making such threats harder to detect and mitigate. Specifically, adversarial content may persist in the agent state, remain dormant across interactions, and later be activated by a benign user query. We formalize this type of safety threat as Sleeper Attack. To evaluate it, we construct a benchmark with 1,896 instances covering six real‑world harmful outcomes, three attack strategies, and three agent state targets: session context, memory, and reusable skills. Experiments on seven strong open‑source and closed‑source LLMs show that state‑of‑the‑art LLM agents remain vulnerable to Sleeper Attack, even when they achieve low attack success rates under a single‑interaction baseline. Our code and data are available at https://anonymous.4open.science/r/skdvnfu23ihr9wdscnksf1asdffsaef.

PaperID: 2520, https://arxiv.org/pdf/2605.28175.pdf

Authors: Shijie Wang, Chengyi Liu, Yujuan Ding, Shanru Lin, See-Kiong Ng, Xu Xin, Wenqi Fan

Title: Mixture-of-Experts Knowledge Graph Retrieval-Augmented Generation for Multi-Agent LLM-based Recommendation

Abstract:
Large language models (LLMs) have recently been adopted for recommendations due to their ability to understand user intent and item semantics. However, LLM‑based recommender systems often rely on parametric knowledge and suffer from outdated knowledge, motivating knowledge graph retrieval‑augmented generation (KG‑RAG) to ground recommendations on structured, up‑to‑date KGs. Despite this promise, effective KG‑RAG in recommendations faces great challenges. First, users' queries vary in complexity and require KG knowledge at different granularities, whereas existing methods adopt a one‑size‑fits‑all retrieval strategy, leading to over‑retrieval for simple queries and under‑retrieval for complex ones. In addition, augmenting LLMs with KG knowledge requires translating graph‑structured data into linear text, which may introduce noise and cause structural information loss. Moreover, the selection of retrieval granularity lacks direct supervision and must be inferred from the final recommendation after alignment and downstream utilization, making query‑aware retrieval hard to learn end‑to‑end. To address these issues, we propose MixRAGRec, a cooperative multi‑agent framework for KG‑RAG recommendations. MixRAGRec integrates a Mixture‑of‑Experts Retrieval Agent that routes each query to a KG retrieval expert with different granularities, a Knowledge Preference Alignment Agent that converts structured knowledge into LLM‑friendly natural language, and a Contrastive Learning‑reinforced Recommendation Agent trained with contrastive preference feedback. Notably, we introduce Mixture‑of‑Experts Multi‑Agent Policy Optimization (MMAPO) to train three agents under a unified objective. Extensive experiments on real‑world datasets demonstrate the effectiveness of our framework.

PaperID: 2521, https://arxiv.org/pdf/2605.28158.pdf

Authors: Chenyu Zhou, Xinyun Lu, Jiangyue Zhao, Jianghao Lin, Dongdong Ge, Yinyu Ye

Title: OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

Abstract:
Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling, yet existing OR‑oriented benchmarks often reduce evaluation to one‑shot translation from a self‑contained problem statement into a mathematical formulation or solver program. Such settings abstract away two characteristics of real industrial OR workflows: persistent multi‑artifact workspaces and multi‑stage task lifecycles. We introduce OR‑Space, a full‑lifecycle workspace benchmark for evaluating industrial optimization agents across model construction, model revision, and grounded explanation. Each instance is an executable workspace containing business documents, structured data, optional code artifacts, solver outputs, and task‑specific evaluators distributed across interdependent files. OR‑Space defines three task modes: Build, where agents construct solver‑ready optimization models from heterogeneous artifacts; Revise, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic; and Explain, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts. By combining persistent workspaces with lifecycle‑oriented tasks, OR‑Space evaluates whether agents can perform reliable optimization work beyond end‑to‑end text generation. We describe the benchmark design, evaluation protocol, and quality‑control pipeline, and position OR‑Space as a benchmark for studying the reliability, failure modes, and practical readiness of LLM agents in industrial OR workflows.

PaperID: 2522, https://arxiv.org/pdf/2605.28114.pdf

Authors: Messi H. J. Lee

Title: Human-like in-group bias in instruction-tuned language model agents

Abstract:
As autonomous AI agents are deployed in persistent, interacting networks ‑‑ coordinating tasks, routing resources, and accumulating reputational histories ‑‑ the social dynamics that emerge will determine who receives opportunity and who does not, at scales no human institution can supervise. We ran a controlled multi‑agent simulation in which instruction‑tuned language model agents interacted across 500 turns under three conditions manipulating group label salience and resource scarcity, across six model families with 20 seeds each. When group labels were visible, we observed in‑group trust bias, action homophily, and network assortativity ‑‑ all absent when labels were hidden ‑‑ a pattern structurally consistent with salience‑dependence in human social psychology. This discrimination was invisible to standard action‑log audits: bias operated entirely through who received each action, not what actions were chosen, with action‑type distributions showing no increase in negative actions across conditions. Per‑turn in‑group versus out‑group differentials of 5 to 16 percentage points were statistically significant for all six models (Wilcoxon signed‑rank, all Benjamini‑Hochberg‑corrected p < 0.001), establishing group‑contingent targeting as a robust property of instruction‑tuned language models across architectures and training regimes. Compounded through 500 turns of reciprocation, these differentials accumulated into in‑group trust biases of +0.014 to +0.100 (d = 0.84‑4.52) ‑‑ illustrating how modest per‑interaction targeting propagates into structural inequality in persistent networks.

PaperID: 2523, https://arxiv.org/pdf/2605.28108.pdf

Authors: Bin Wu, Guanyun Zou, Bingbing Wang, Huan Zhao, Chuan Shi

Title: Ask Now, Use Later: Benchmarking the Proactivity Gap in Long-Lived LLM Agents

Abstract:
A long‑lived LLM agent, such as OpenClaw, earns its value by acting on a user's preferences and constraints across sessions, not just the current request. Yet today's agents keep what a user volunteers but rarely ask for what stays unspoken, leaving a proactivity gap in long‑lived LLM agents: an agent cannot act on a preference it never obtained. As users delegate more of their affairs to agents, the impact of this gap grows. We isolate one concrete, controllable slice of this gap as Ask‑to‑Remember (ATR): the agent decides whether to ask now for a reusable user preference that the current task does not need but a later session with the same user will. ATR is hard even to evaluate: the right question is underdetermined and its payoff deferred to tasks that may never arise. ATRBench, to the best of our knowledge the first ATR benchmark, makes it measurable by fixing each user's preferences as hidden ground truth, so success demands asking, not recall. Across eight frontier LLM agents, defaults fall at least 62 points below an oracle handed the relevant preference, and prompting closes little of it. Diagnostics identify acquisition as the bottleneck. ATRBench surfaces this proactivity gap in current agents and offers a diagnostic testbed for closing it.

PaperID: 2524, https://arxiv.org/pdf/2605.28098.pdf

Authors: Zejian Eric Wu, Zhongyi Jiang, Yuan Zhuang, Paul Jen-Hwa Hu

Title: Examining Agents' Bias Amplification versus Suppression in Multi-Agent Systems

Abstract:
Multi‑agent systems are increasingly deployed to support various tasks where agents interact to achieve individual and collective objectives. Although these systems can enhance task performance and decision‑making, fairness preservation through bias reduction remains challenging. This study examines how agent‑level biases shift and impact system‑wide fairness. We use prompts to expose individual agents to group‑favoring bias, then assess downstream impacts at the system level. To quantify the impact, we propose Favor Bias Strength (FBS), a zero‑centered metric that decomposes bias alteration between favored‑group uplift and disfavored‑group suppression. Using multiple agent designs, benchmarks, and up‑to‑date large language models, we show that agents endowed with bias can substantially affect system‑wide fairness. Interestingly, when agents are exposed to bias uniformly, the system‑wide bias elevates, even exceeding the additive sum of the individual agents' biases. The empirical evidence underscores the criticality of fairness in multi‑agent systems, which warrants further analyses and empirical tests.

PaperID: 2525, https://arxiv.org/pdf/2605.28056.pdf

Authors: He Feng, Yongjia Ma, Donglin Di, Lei Fan, Tonghua Su

Title: CogPortrait: Fine-Grained Eye-Region Control in Portrait Animation via Hierarchical Agent Planning

Abstract:
Portrait animation methods have achieved substantial visual quality and lip synchronization, but fine‑grained manipulation of the eye region still faces a trade‑off between input granularity and motion accuracy. Existing methods using emotion labels or coarse text prompts are insufficient for describing subtle ocular dynamics, whereas approaches based on Action Units or driving videos provide higher fidelity at the cost of a heavier input burden. These limitations are still restrictive for beyond‑emotion states (e.g., thinking) and drowsiness. In light of the above, we propose CogPortrait, a two‑stage framework that generates portrait animations from high‑level labels. In the first stage, three chain‑of‑thought Multimodal Large Language Models (MLLMs) agents compile high‑level labels into facial keypoints through temporal event planning, prototype retrieval, and composition from a real‑behavior library, and semantic‑physiological constraint enforcement. In the second stage, a DiT‑based video generation backbone synthesizes the final animation conditioned on the keypoints, reference portrait, audio, and text prompt, enhanced by a dynamic classifier‑free guidance strategy with eye‑region‑aware reweighting and KTO‑based refinement for boundary cases. We further introduce the EMH benchmark covering diverse emotions and beyond‑emotion categories with two AU‑level metrics for evaluating fine‑grained eye‑region and head‑motion control. Extensive experiments on HDTF and the EMH benchmark demonstrate that CogPortrait achieves more precise eye‑region control than existing methods while maintaining supe‑ rior visual quality and identity consistency

PaperID: 2526, https://arxiv.org/pdf/2605.28046.pdf

Authors: Zihan Li, Xingyu Fan, Feifei Li, Wenhui Que

Title: MemCog: From Memory-as-Tool to Memory-as-Cognition in Conversational Agents

Abstract:
Existing agent memory systems universally follow what we term a Memory‑as‑Tool paradigm where a single query triggers one‑shot retrieval of flat passage lists, suffering from passive invocation, reasoning‑retrieval decoupling, and structural mismatch between retrieved fragments and the agent's navigational needs. We propose MemCog, a Memory‑as‑Cognition system that makes memory access an integral part of the reasoning process. MemCog organizes user knowledge as Navigable Memory Store with associative link graphs, exposes Cross‑Dimensional Navigation Interface for multi‑step reasoning‑driven traversal, and employs Proactive Reasoning Protocol that drives agents to spontaneously initiate memory exploration from conversational context. We additionally construct ProactiveMemBench, the first benchmark for evaluating proactive memory triggering. Experiments show that MemCog achieves state‑of‑the‑art on passive QA benchmarks (92.98 on LoCoMo, 95.8 on LongMemEval) while substantially outperforming baselines on ProactiveMemBench, demonstrating the advantage of Memory‑as‑Cognition.

PaperID: 2527, https://arxiv.org/pdf/2605.28037.pdf

Authors: Moe Nagao, Koichiro Terao, Mikio Nakano, Naoto Iwahashi

Title: Personality, Role, and Expressive Style in Large Language Models: An Interactionist Analysis

Abstract:
Prompt‑based personality control is a key technique for designing large language model (LLM) dialogue agents that behave consistently across social contexts. However, specifying Big Five personality traits (BFTs) in a prompt does not ensure that the intended traits are expressed in generated utterances. This paper investigates this mismatch from an interactionist perspective, viewing personality expression as a context‑dependent outcome shaped by the interplay between trait specification and situational factors. We analyze how perceived BFT expression in LLM‑generated dialogue is influenced by three prompt factors: personality traits, dialogue roles, and expressive styles. Using a factorial design that combines six personality conditions, three roles, and three expressive‑style conditions, we generate 1,080 LLM‑agent dialogues in each of English and Japanese. We then evaluate the target agent's utterances using an LLM‑as‑a‑judge framework to estimate expressed Big Five traits. The results show that expressed personality is shaped not only by explicit trait specification, but also by dialogue role and expressive style. These effects are trait‑specific: dialogue role strongly influences Openness, expressive style substantially shapes Conscientiousness and Agreeableness, and explicit trait specification dominates Neuroticism. Even without explicit personality‑trait specification, social and expressive conditions induce distinct personality‑like impressions. Cross‑linguistic comparisons show broadly similar patterns between English and Japanese dialogues, with noticeable differences only under specific combinations of personality, role, and expressive style. These findings suggest that personality control in LLM agents should be understood not as a direct consequence of trait prompting, but as a context‑dependent process involving personality specification, social role, and expressive style.

PaperID: 2528, https://arxiv.org/pdf/2605.28003.pdf

Authors: Guijin Son, Seungyeop Yi, Minju Gwak, Hyunwoo Ko, Wongi Jang, Youngjae Yu

Title: ResearchMath-14K: Scaling Research-Level Mathematics via Agents

Abstract:
The frontier of mathematics is defined by problems whose solutions are not yet known, yet it remains unclear whether language models can meaningfully engage with such problems without human intervention. A major obstacle is the lack of large‑scale research‑level math datasets. To this end, we introduce ResearchMath‑14k, a set of 14,056 problems curated from academic sources via a multi‑agent pipeline, making it the largest collection of research‑level mathematical problems to date. We further generate ResearchMath‑Reasoning, 220K teacher trajectories from two open models, where we observe recurring avoidance behaviors such as non‑attempts and fabricated references. Interestingly, across eight open‑weight models, newer generations produce 5.6× more references and 5.0× more fake references per trace. After agentic filtering of ResearchMath‑Reasoning, fine‑tuning Qwen3 models from 4B to 30B parameters improves over base models by 9.2 points on average. This shows that filtered open‑problem attempts can provide useful supervision even without fully correct reasoning traces. We make ResearchMath‑14k publicly available for future works on research‑level mathematical reasoning.

PaperID: 2529, https://arxiv.org/pdf/2605.27935.pdf

Authors: Zhenyu Cui, Xiangzhong Luo

Title: Do Agents Think Deeper? A Mechanistic Investigation of Layer-Wise Dynamics in Sequential Planning

Abstract:
Recent mechanistic studies suggest that large language models (LLMs) may utilize their depth inefficiently in standard single‑turn tasks. Whether this still holds in autonomous agent settings, where models must perform multi‑turn planning, tool use, and iterative state updates, remains unclear. We study this question through a systematic layer‑wise analysis of complete user‑agent trajectories spanning three domains: Deep Research, Code Generation, and Tabular Processing. Using residual stream probes, causal layer‑skipping interventions, and effective‑depth measurements, we show that agentic reasoning exhibits a distinct depth profile from static tasks. As trajectories unfold, models progressively recruit more and deeper layers, with stronger long‑range inter‑layer dependencies emerging in later turns. At the same time, residual updates become increasingly correction‑dominant, indicating a shift from stable feature accumulation toward repeated recalibration. Effective‑depth analysis further reveals a substantial construction‑refinement gap: semantic direction often forms relatively early, while deep layers remain necessary for stabilizing final outputs. Across model families, this gap is pronounced in Qwen and Minimax, whereas GLM shows a more domain‑dependent depth allocation pattern. These results provide mechanistic evidence that autonomous LLM agents allocate depth adaptively as reasoning complexity grows.

PaperID: 2530, https://arxiv.org/pdf/2605.27922.pdf

Authors: Yilun Yao, Xinyu Tan, Chao-Hsuan Liu, Yaoming Li, Zhengyang Wang, Wenhan Yu, Zhewen Tan, Yuxuan Tian, Guangxiang Zhao, Lin Sun, Xiangzheng Zhang, Tong Yang

Title: Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

Abstract:
LLM agents are increasingly deployed as executable systems that use tools, modify workspaces, and produce concrete artifacts. In such workflows, performance depends not only on the base model, but also on the harness: the system layer that manages context, tools, state, constraints, permissions, tracing, and recovery. However, existing benchmarks typically abstract away execution, compare complete agent systems, or hold the harness fixed, making execution‑layer variation difficult to study. We introduce Harness‑Bench, a diagnostic benchmark for evaluating configuration‑level harness effects in realistic agent workflows. Harness‑Bench evaluates representative harness configurations across multiple model backends under shared task environments, budgets, and evaluation protocols, while preserving each harness's native execution behavior. The benchmark contains 106 sandboxed offline tasks constructed from practical agent‑use patterns and manually reviewed for realism, solvability, oracle‑checkability, and integrity. Each run records final artifacts, execution traces, usage statistics, and validator outputs, enabling analysis beyond final completion. Across 5,194 execution trajectories, we observe substantial variation in completion, process quality, efficiency, and failure behavior across model‑harness pairings. These results suggest that agent capability should be reported at the model‑harness configuration level rather than attributed to the base model alone. Our analysis further identifies recurring execution‑alignment failures, where plausible reasoning becomes decoupled from tool feedback, workspace state, evidence, or verifiable output contracts. Harness‑Bench provides a reproducible foundation for diagnosing and improving reliable, efficient, and auditable agent execution stacks.

PaperID: 2531, https://arxiv.org/pdf/2605.27899.pdf

Authors: Hongxiang Lin, Zhirui Kuai, Erpeng Xue, Lei Wang

Title: SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment

Abstract:
Structured skill prompts improve exploration in long‑horizon agentic reinforcement learning (RL). Skill‑augmented RL methods retain external skills at inference, while skill‑internalization RL methods withdraw them during training to enable autonomous performance. However, existing internalization approaches only use skill‑helpfulness contrast for curriculum control, leaving the policy update unchanged and unable to distinguish skill‑dependent from autonomous success. We propose SkillC, a framework based on Contrastive Skill Credit Assignment (CSCA) that converts this contrast into a direct learning signal for internalization. \textscSkillC samples paired skill‑injected and skill‑free rollouts for tasks from active skill types within the same policy update, and injects their task‑level contrast into optimization via a dual‑stream advantage estimator that preserves global ranking while applying a one‑sided correction toward skill‑free success. A smoothed validation‑level signal further drives an adaptive curriculum over attribution strength, rollout allocation, and monotonic active‑set pruning. Experiments on ALFWorld and WebShop show that, without runtime skill access, SkillC surpasses the strongest prior skill‑internalization RL baseline by 5.5% and 4.4%, respectively, while remaining competitive with skill‑augmented RL methods.

PaperID: 2532, https://arxiv.org/pdf/2605.27881.pdf

Authors: Yibo Zhao, Zichen Ding, Jiayi Wu, Zun Wang, Xiang Li

Title: Retrieval, Reward, and Training Protocols: What Matters in Training Search Agents?

Abstract:
Search agents powered by large language models can autonomously decompose queries, retrieve information, and synthesize answers through multi‑step reasoning. However, the rapid growth of training methods has outpaced controlled comparison: existing works differ in retrieval corpora, reward designs, and training protocols, making it unclear what actually drives improvements. We present a controlled empirical study that isolates three under‑explored dimensions of search agent training. First, we identify a critical data‑coverage issue in the widely used Wikipedia 2018 corpus and show that correcting it alone yields larger gains than the differences between training algorithms. Second, we systematically compare outcome‑based and process‑based reward methods across three base models, finding that the simplest outcome‑based approach achieves competitive or superior performance in most settings, and that process‑level credit assignment can over‑correct agent behavior. Third, we analyze training data diversity, off‑policy data utilization, and search budget scaling, distilling practical guidelines for training effective search agents. Our code is available at https://github.com/YiboZhao624/SearchAgentReview.

PaperID: 2533, https://arxiv.org/pdf/2605.27853.pdf

Authors: Thao Nguyen, Heng Ji

Title: MolLingo: Molecule-Native Representations for LLM-Powered Scientific Agents

Abstract:
We present MolLingo, a multi‑agent system that emulates the reasoning process of a chemist to automate molecular design. Existing LLM‑based approaches either operate as standalone generative models without access to external tools or lack the multi‑agent coordination and shared memory needed for iterative, evidence‑driven reasoning across the molecular design pipeline. MolLingo addresses this by coordinating a Literature Agent, a Chemist Agent, and an Orchestrator through a shared memory module, with each agent equipped with domain‑specific tools. To enable effective molecular reasoning, we introduce BRICS‑based Fragment Enumeration (BFE), a synthesis‑aware molecular fragmentation method that decomposes molecules into chemically meaningful building blocks represented as block‑based SMILES paired with common chemical names. This representation bridges molecular structure and LLM semantic space, enabling block‑level reasoning and editing that is difficult with raw SMILES alone. As a case study in early‑stage therapeutic design, MolLingo further grounds the Chemist Agent's reasoning in binding site geometry and residue‑level protein context derived from molecular docking to optimize molecules for stronger target binding. Across four benchmarks, MolLingo consistently outperforms frontier LLMs and specialized baselines, including a fourfold docking score improvement over GPT‑5.4 despite using the same underlying model, consistent drug property optimization gains across multiple LLM backbones, and state‑of‑the‑art results on TOMG‑Bench, surpassing both frontier LLMs and the RL‑based optimization method RePO. Our results suggest that LLMs are already capable molecular design assistants when guided through chemically meaningful representations and biologically grounded structural context. Code is available at: https://anonymous.4open.science/status/MolLingo‑7450.

PaperID: 2534, https://arxiv.org/pdf/2605.27850.pdf

Authors: Yi Ding, Zijie Xuan, Haowei Zhou, Zhenyu Ju, Xiaoxiao Dong, Jingwen Zhang, Xingyu Zhu, Leixin Sun, Haochi Zhang

Title: TCP-MCP: Landscape-Guided Co-Evolution of Prompts and Communication Topologies for Multi-Agent Systems

Abstract:
Effective multi‑agent systems cannot be designed by selecting prompts or communication graphs in isolation. Agent behavior depends on the information an agent receives, while the usefulness of a communication edge depends on how the receiving agent interprets and uses that information. We propose TCP‑MCP (Topology‑Coupled Prompting for Multi‑Agent Collaborative Problem‑Solving), a co‑evolution framework that searches agent prompts and communication topologies as a unified genome. TCP‑MCP uses an initialization‑time landscape probe to calibrate early search behavior, and then relies on Pareto‑front diagnostics to adapt exploration under three objectives: task performance, token cost, and structural complexity. Using the same DeepSeek‑V3.2 backbone across all methods, TCP‑MCP achieves 82.66%, 89.96%, and 96.61% accuracy on MMLU‑Pro, MMLU, and GSM8K, respectively. Across the three benchmarks, it consistently outperforms automated graph‑generation baselines and achieves competitive accuracy relative to debate‑style systems, while using up to 5.69× fewer tokens than those systems at the reported operating points. These results show that jointly evolving prompts and communication structure provides a practical route to cost‑aware and task‑adaptive multi‑agent system design in controlled evaluations.

PaperID: 2535, https://arxiv.org/pdf/2605.27820.pdf

Authors: Yunqi Liu, Tong Niu, Zitong Wang, Zhenlong Dai, Yuqi Qing, Weiqiang Wang, Jian Liu

Title: EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents

Abstract:
As AI agents increasingly operate in open, real‑world environments, they require a deep synergy of multimodal perception, tool invocation with multi‑hop reasoning, and dynamic interaction with users. However, existing benchmarks fail to jointly evaluate these capabilities due to challenges in designing strictly coupled multi‑capability tasks, simulating natural and task‑constrained user feedback, and ensuring objective evaluation of dynamic interaction. To bridge this gap, we introduce EgoBench, the first interactive multimodal benchmark for tool‑using agents. EgoBench comprises 1,045 egocentric‑video‑grounded tasks covering four daily scenarios, along with a user‑agent‑tool interactive environment for evaluation. We implement a three‑stage synergistic pipeline through which each task is designed to enforce the joint application of visual perception and tool‑augmented multi‑hop reasoning. We additionally develop a multi‑agent simulated user within EgoBench to evaluate agents' interaction capabilities, which generates high‑fidelity, task‑aligned responses to agents. Furthermore, we establish a deterministic joint validation framework that guarantees objective assessment through process‑based and result‑based equivalence. Benchmarking eight SOTA video‑MLLM agents on EgoBench reveals a severe performance ceiling: the best model achieves only 30.62% accuracy in the best‑performing scenario, averaging 19.43% across all four scenarios. Finally, we conduct a multi‑dimensional error analysis to disentangle failure modes, exposing capability bottlenecks for advancing future AI agents.

PaperID: 2536, https://arxiv.org/pdf/2605.27787.pdf

Authors: Seunghyuk Cho, Sunghyun Choi, Jaeseung Heo, Youngbin Choi, Saemi Moon, MoonJeong Park, Dongwoo Kim

Title: Long Live the Librarian! A Persistent Search Sub-Agent for Energy-Efficient Multi-Agent Software Engineering Systems

Abstract:
Multi‑agent systems (MAS) have substantially advanced autonomous software engineering (SWE), but their growing inference energy demands raise sustainability concerns. In this paper, we demonstrate that this cost is concentrated in an overlooked source: redundant output tokens generated across agents. Two empirical findings ground this claim. First, our per‑token energy attribution for MAS reveals a sharp asymmetry: an output token consumes 30 to 1,000 times more energy than an input or cached token. Second, MAS inflate per‑episode output because agents repeatedly re‑explore overlapping repository regions. To address this inefficiency, we propose Librarian, a persistent search sub‑agent that tracks repository‑search history and suppresses redundant exploration actions across agents. By returning short references to file regions instead of full file excerpts, Librarian further reduces output‑token volume. On SWE‑Bench Verified, Librarian reduces per‑episode GPU energy consumption of existing multi‑agent SWE systems by up to 25% while preserving task performance.

PaperID: 2537, https://arxiv.org/pdf/2605.27785.pdf

Authors: Kenny Daniel

Title: A Query Engine for the Agents

Abstract:
The fastest‑growing data in production today is unstructured text: agent traces, chat logs, reasoning chains, model outputs. People want to analyze it, and the questions worth asking ("show me where the agent got confused") cannot be answered by SQL alone, since text is not queryable without a model in the query path. The natural place this analysis is happening is the new class of AI applications (Claude Code, Cursor, Claude Desktop, in‑browser agents) that run client‑side and host both a human user and an LLM agent in the same process. These applications increasingly want to work with data, but the lakehouse read path has been hard to use from a JS runtime: Spark, Trino, and managed warehouses do not fit there. To build this new kind of AI data application, three properties of the engine become first‑order: a JS‑native distribution that drops into the runtime the application already runs in, a bundle small enough to ship inside a cold tab or per‑turn agent sandbox, and a way to interleave analytic operators with model‑based interpretation of text. We present Hyperparam, three open‑source JavaScript libraries (Hyparquet, Squirreling, Icebird) totaling under 70 KB, that read Parquet and Apache Iceberg directly from object storage and meet the third property with per‑cell, async‑native SQL execution, so expensive cells fire only when downstream operators demand them. Squirreling runs LLM‑shaped async UDFs over 300x faster than DuckDB‑WASM on filter‑bounded queries (and 192x on sort‑bounded queries) and completes a ten‑task agent analyst suite at two‑thirds lower cost. We argue that data engineering as a discipline needs to update for the AI‑native client applications now in production and the agents that work alongside their users.

PaperID: 2538, https://arxiv.org/pdf/2605.27784.pdf

Authors: Lu Yan, Xuan Chen, Xiangyu Zhang

Title: Diagnosing Live Within-Policy Instruction Conflicts in LLM Agents with Witnessed Resolution Profiles

Abstract:
LLM agents are governed by long‑lived natural‑language prompt policies, but individually reasonable standing rules can interact in uninspected ways. We study live intra‑policy rule‑conflict diagnosis: finding rule pairs inside a single prompt policy that can co‑govern a realistic state, and measuring how models resolve that pressure in responses or tool actions. We introduce WIRE, a Witnessed Intra‑policy Rule Evaluation pipeline. WIRE extracts source‑grounded rules, encodes them as PyRule clauses, uses satisfiability checks to retain same‑surface hard‑collision candidates, realizes those candidates as concrete co‑governance witnesses, and judges model outputs against the original source‑rule text. Across six public prompt policies, WIRE extracts 276 source rules and 560 atomic clauses, classifies 30,944 within‑policy clause‑pair comparisons, retains 170 encoded hard‑collision candidate source‑rule pairs, and realizes them as 1,402 concrete witnesses. In policy‑only evaluation, these witnesses yield 13,335 post‑ generation trials where both source rules govern and both compliance labels are judgeable. Only 35.4% fall in joint compliance; 64.6% violate at least one governed source rule. These profiles are conditional diagnostics for WIRE‑selected candidates, not deployment‑frequency or causal excess failure estimates, but they reveal distinct policy, model, and tool‑action resolution patterns.

PaperID: 2539, https://arxiv.org/pdf/2605.27766.pdf

Authors: Aman Priyanshu, Supriti Vijay, Esha Pahwa

Title: Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems

Abstract:
LLM safety evaluations predominantly test models in isolation, yet deployed AI agents increasingly operate within persistent social environments alongside other agents. We introduce a Moltbook‑style simulation platform where thousands of LLM agents interact across communities over a simulated month, and use it to evaluate privacy as a downstream safety concern under varying degrees of social pressure. We find that shifting from single turn to multi turn social evaluation amplifies privacy violations (CIMemories 19.95% to Ours 45.30% across OpenAI models), that leakage is socially contagious, with agents 8 times more likely to disclose sensitive information after observing a peer do so, and that explicit privacy instructions reduce but do not eliminate this effect, leaving leakage rates above 37.8% even with safeguards. Our findings suggest that static chat based safety benchmarks systematically underestimate risks in agentic deployment, and that social context alone is sufficient to elicit sensitive disclosures that single turn evaluations would never surface.

PaperID: 2540, https://arxiv.org/pdf/2605.27762.pdf

Authors: Yuchen Guo, Junli Gong, Weicheng Wang, Hongmin Cai, Yiu-ming Cheung, Weifeng Su

Title: PEAM: Parametric Embodied Agent Memory through Contrastive Internalization of Experience in Minecraft

Abstract:
We present PEAM, a Parametric Embodied Agent Memory framework in Minecraft that transforms agent memory from inference‑time retrieval into parameter‑resident skills internalized through experience. PEAM pairs a slow deliberative LLM for open‑ended reasoning with a fast parametric module for reflexive execution of consolidated skills. The fast module is a multimodal Mixture‑of‑Experts LoRA architecture with per‑category physically isolated adapters, enabling parameter‑level continual learning without catastrophic forgetting. We treat failure as a first‑class training signal: failure‑‑correction trajectory pairs are internalized through a joint behavioral‑cloning and contrastive objective, so the agent learns not only what succeeds but also how corrected actions differ from failed ones. To govern consolidation, PEAM introduces a parameterization‑worthiness score for deciding which experience should be internalized, and a scale‑free self‑triggered consolidation mechanism for deciding when to internalize without task‑specific hand‑tuned thresholds, making the agent self‑evolving as the trigger transfers across task distributions without re‑tuning. Experiments in Minecraft show that PEAM improves long‑horizon task performance, mitigates forgetting on previously consolidated skills, and improves parametric‑versus‑retrieval efficiency over retrieval‑based embodied agents and parametric memory variants.

PaperID: 2541, https://arxiv.org/pdf/2605.27760.pdf

Authors: Hanyu Wang, Yifan Lan, Bochuan Cao, Lu Lin, Jinghui Chen

Title: SkillGrad: Optimizing Agent Skills Like Gradient Descent

Abstract:
Agent skills provide a lightweight way to adapt LLM agents to specialized domains by storing reusable procedural knowledge in structured files. However, whether downloaded from third parties or self‑generated, these skills are often unreliable, incomplete, or outdated. Existing skill‑evolution methods often address these deficiencies through heuristic reflections without an explicit optimization formulation. In this paper, we propose SkillGrad, a gradient‑descent‑inspired framework for optimizing agent skills. SkillGrad treats the skill package as a structured parameter to optimize in a gradient descent fashion: task executions provide trajectory‑level loss evidence, automatic diagnoses then provide text‑based gradients that indicate the correction directions. To stabilize optimization across iterations, a momentum agent accumulates recurring diagnostic patterns into a persistent memory overlay. Finally, an LLM‑based patcher executes the parameter update by applying layer‑aware edits to the skill package. Evaluated on SpreadsheetBench Verified and WikiTableQuestions, SkillGrad consistently outperforms training‑based skill evolution baselines across two backbone LLMs, improving over the strongest training‑based baseline by 6.7 percentage points on average. Ablations further show that momentum and contrastive diagnosis both contribute to the final skill quality.

PaperID: 2542, https://arxiv.org/pdf/2605.27744.pdf

Authors: Rui Zhang, Chaeeun Kim, Liting Hu

Title: A Policy-Driven Runtime Layer for Agentic LLM Serving

Abstract:
Multi‑agent LLM systems have become the dominant production workload, but the serving stack was not built for them. The agent framework above knows agent identities, role, schemas, and dispatch structure but never sees an engine‑level event; the serving engine below sees every event but knows nothing about agents. A surprising number of cross‑cutting policies depend on both: prefix caching, batch shaping, speculative execution, fairness, tool‑result memoization, safety enforcement, and more. Each lives in the seam between the two layers and is currently solved by a one‑off patch into one neighbor or the other. We argue this seam is best addressed by an architectural change rather than point fixes: insert a third tier, an agent runtime layer, between the framework and the engine, exposing four primitives (observe, score, predict, act) into which any agent‑aware policy plugs, with agent identity as the shared coordinate. We map nine concrete policies onto the layer and validate the abstraction in depth on the one with the largest immediate serving‑cost lever: KV caching across sessions, instantiated as CacheSage, which learns the per‑workload agent transition matrix online and uses it for survival‑based eviction and between‑step prefetch. Preliminary results on five real multi‑agent workloads show +13 to +37 pp cache hit‑rate lift, 12% to 29% lower mean TTFT, and 6% to 14% higher throughput over an unmodified serving stack.

PaperID: 2543, https://arxiv.org/pdf/2605.27706.pdf

Authors: Joan Vendrell Gallart, Solmaz Kia, Russell Bent, Michael Grosskopf

Title: Chain-based Adaptive Reconfiguration Over Lattices for Hallucination Reduction

Abstract:
We introduce CAROL (Chain‑based Adaptive Reconfiguration Over Lattices), a probabilistic framework for test‑time hallucination reduction in large language models. Rather than relying on token‑level uncertainty, CAROL defines a semantic uncertainty measure based on the consistency between generated responses and a trusted context, inducing a string‑submodular objective over a lattice of textual sequences. This formulation enables hallucination mitigation to be cast as a Markov chain accept‑reject process with provable convergence and near‑optimality guarantees, allowing the model to iteratively refine outputs toward semantic consistency. By operating at the level of meaning, CAROL unifies hallucination detection and mitigation within a single framework. Empirical results on question answering and multi‑agent reasoning benchmarks show that CAROL significantly reduces hallucinations and improves reliability and interpretability compared to likelihood‑based and retrieval‑augmented baselines, while maintaining competitive computational efficiency.

PaperID: 2544, https://arxiv.org/pdf/2605.27690.pdf

Authors: Jiaqian Li, Yanshu Li, Boxuan Zhang, Ruixiang Tang, Kuan-Hao Huang

Title: TRACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling

Abstract:
LLM agents increasingly operate through multi‑turn tool use and environment interaction, where safety risks often emerge from intermediate steps long before they surface in the final outcome. Reactive auditing is therefore insufficient: post‑hoc diagnosis frequently misses the chance to flag risks while they are unfolding. We propose TRACES, a representation‑based proactive auditor that learns prefix‑level trajectory risk states from the hidden representations of an observer LLM. TRACES induces latent mechanism features from step representations and models their temporal evolution to estimate whether a partial trajectory is drifting toward unsafe behavior. To sidestep the cost and ambiguity of step‑level risk annotation, TRACES is trained with weak trajectory‑level supervision while still producing dense prefix‑level risk estimates. Across multiple agent safety benchmarks, TRACES improves both full‑trajectory safety prediction and proactive risk discrimination. Our analyses further suggest that these risk states can help train a safer agent, highlighting the broader potential of proactive auditing for long‑horizon agent safety.

PaperID: 2545, https://arxiv.org/pdf/2605.27685.pdf

Authors: Shuyang Li, Ruimin Ke

Title: Decoupled Intelligence: A Multi-Agent LLM Framework for Controllable Traffic Scenario Generation in SUMO

Abstract:
The integration of Large Language Models (LLMs) with microscopic traffic simulation offers a promising path toward autonomous urban planning and intelligent transportation analysis. However, existing monolithic agent architectures often struggle with the complexity of end‑to‑end simulation workflows, leading to reasoning failures, parameter inconsistency, and a lack of systematic state management. This paper proposes a novel multi‑agent collaborative framework designed to automate the entire lifecycle of traffic simulation in SUMO (Simulation of Urban Mobility). Our approach decouples the simulation pipeline into specialized roles, including Planner, Builder, Demand, Runner, and Analyst, coordinated by a high‑level reasoning engine. We introduce a state‑persistent Orchestrator leveraging the Model Context Protocol (MCP) to ensure seamless data handover and environmental consistency across distributed agent actions. This architecture enables a robust closed‑loop refinement process, where simulation outcomes are iteratively analyzed and optimized to satisfy user‑defined Key Performance Indicators (KPIs). Experimental results through role ablation studies demonstrate that the proposed multi‑agent framework significantly enhances task success rates and parameter accuracy compared to single‑agent baselines. Furthermore, case studies on real‑world network extraction and traffic optimization highlight the system's capability to bridge the gap between high‑level natural language intent and low‑level simulation execution.

PaperID: 2546, https://arxiv.org/pdf/2605.27652.pdf

Authors: Dominik Schweisgut, Anne Benoit, Yves Robert, Henning Meyerhenke

Title: Carbon-Aware Mapping and Scheduling for Deadline-Constrained Workflows

Abstract:
As datacenters continue to grow in scale, their energy consumption and resulting carbon footprint have become pressing concerns. With the increasing share of renewable energy in a datacenter's mixed energy supply, shifting task execution to periods of high green‑power availability is a promising strategy to reduce carbon emissions. However, in heterogeneous computing environments, the power consumption of compute nodes in a datacenter can also vary. In practice, workloads submitted to datacenters are often not isolated tasks, but entire workflows consisting of interdependent tasks with precedence constraints. A further challenge arises from the fact that carbon emission reductions must typically be achieved under strict workflow deadlines. In this work, we show that the problem posed by these challenges for the scheduler is NP‑hard and admits no constant‑factor approximation even for the uni‑processor case. Motivated by this hardness, we present a novel algorithm CWM that combines carbon‑aware mapping and scheduling to construct feasible solutions. Our approach integrates dynamic programming with efficient heuristics to exploit renewable energy availability and infrastructure heterogeneity. To assess the quality of the new algorithm, we evaluate it against the state‑of‑the‑art approach CaWoSched and show that CWM achieves significant reductions in terms of carbon emissions in experiments. In particular, we are able to achieve a median carbon cost reduction of 42% over the best version of CaWoSched when the deadline is two times the makespan of a carbon‑agnostic baseline. Note that CaWoSched itself already reduces the carbon‑agnostic baseline by 36%.

PaperID: 2547, https://arxiv.org/pdf/2605.27649.pdf

Authors: Qishi Zhan, Minxuan Hu, Seoyeon Jang, Lei Zhao, Ziheng Chen, Man Liang, Xinyue Xiang, Jiaxin Liu, Guansu Wang, Liang He

Title: Disentangling Language Roles in Multilingual LLM Task Execution

Abstract:
Multilingual LLMs are increasingly used when instruction, source content, and required response languages do not coincide. Existing benchmarks have expanded multilingual instruction‑following evaluation, but they rarely isolate these three roles within a fully crossed design. We introduce MTM‑Bench, a controlled benchmark for language‑conditioned task execution in which each instance is defined by a triplet $(L_\textinstr, L_\textcontent, L_\textresp)$. Across English, Spanish, and Chinese, MTM‑Bench enumerates all 27 triplets and contains 2,430 instances per model across semantic reversal, final‑state extraction, and language purity with update realization. We evaluate 20 frontier and open‑weight LLMs using decomposed metrics for semantic correctness, target‑language adherence, constraint satisfaction, contamination ratio, and joint success, with scoring validated by a targeted human audit. The fully crossed design reveals that degradation is organized by the role a language occupies in the task structure, not merely by mismatch count. The response‑language role is the dominant axis of variation, and a single response‑slot mismatch accounts for most degradation. The response‑only and full‑mismatch comparison suggests that mismatch count is not a monotonic predictor of difficulty, with model‑level ordering varying across systems. Task families fail through distinct channels, showing that semantic correctness alone does not capture reliable multilingual task execution.

PaperID: 2548, https://arxiv.org/pdf/2605.27622.pdf

Authors: Taylor Olson, Roberto Salas-Damian, Kenneth D. Forbus

Title: Reasoning and Planning with Dynamically Changing Norms

Abstract:
To safely interact with humans, AI agents must both know our norms and consider them during planning. However, such norm‑guided planning has been less explored, only within communities of artificial agents, and has ignored the dynamic nature of norms. This paper instead presents an approach to guiding planning with dynamically changing norms in a human‑AI setting. We contribute a defeasible calculus for resolving normative conflicts and an approach to using such dynamically changing norms as guard rails on plans. We theoretically demonstrate our approach with formal proofs and empirically with an AI agent, SocialBot, on a natural language dialogue task.

PaperID: 2549, https://arxiv.org/pdf/2605.27621.pdf

Authors: Mingyu Lu, Yushan Huang, Chris Lin, Su-In Lee

Title: Agents that Matter: Optimizing Multi-Agent LLMs via Removal-Based Attribution

Abstract:
As multi‑agent systems (MAS) become increasingly complex, identifying the contributions of individual agents is critical for system optimization. However, existing approaches lack a rigorous, unified framework for credit assignment. In this work, we formalize agent attribution as a cooperative game, parameterized by the coalition distribution, removal protocol, and target metric. Using this framework, we show that Leave‑One‑Out (LOO) identifies bottleneck agents as effectively as combinatorial methods, but at a fraction of the computational cost. We also demonstrate that removal protocols induce distinct games: Agent ablation isolates structural bottlenecks, whereas introspective LLM judges fail to faithfully approximate this behavior. Furthermore, to evaluate the utility of specific agent backbones, we introduce attribution via model replacement. By substituting underlying models of low‑contribution agents, we improve task performance by up to 17% while reducing cost by up to 35% across three benchmarks. Finally, we apply our framework to audit a medical MAS, revealing that agent contributions to diagnostic accuracy and ethical behavior are often decoupled. By intervening on counterproductive roles, we observe an increase in ethics alignment while maintaining diagnostic accuracy. Overall, this work provides a principled approach for cost‑effective MAS attribution and intervention.

PaperID: 2550, https://arxiv.org/pdf/2605.27599.pdf

Authors: Deepak Panigrahy, Aakash Tyagi

Title: The Energy Blind Spot: NVIDIA's Flagship Edge AI Hardware Cannot Support Process-Level Energy Attribution

Abstract:
Agentic AI workloads ‑ where a single user goal triggers multi‑step orchestration, tool calls, retries, and failure recovery ‑ are being targeted for edge deployment, with NVIDIA, Dell, HP, ASUS, MSI, Acer, and Gigabyte all shipping GB10‑based desktop AI systems in 2026. We recently demonstrated that orchestration structure dominates agentic energy cost, with workflows consuming 4.33x more energy per successful goal than linear baselines and OOI reaching 7.63x for multi‑step reasoning tasks. Separately, Rajat et al. show that CPU‑side processing accounts for up to 90.6% of total latency and 44% of total dynamic energy in agentic workloads. We report a systematic energy‑observability audit of the ASUS Ascent GX10 (GB10 SoC) and find that the platform exposes no CPU energy counter, no INA power‑rail monitor, no IPMI/BMC, and no SCMI powercap protocol through any supported software interface. The only on‑device energy telemetry is instantaneous GPU power via NVML. We further discover that the MediaTek firmware already computes per‑rail energy internally via an undocumented ACPI interface (SPBM), but NVIDIA states there are "no plans to expose CPU rail information." On‑device per‑process energy attribution ‑ as performed on x86 via RAPL ‑ is therefore not reproducible on this platform through supported interfaces. We formalize a hardware requirements specification for energy‑attributed AI, propose an interim calibration bridge using external DC metering combined with GPU subtraction, and identify a standards‑track path via SCMI powercap. Our findings motivate the low‑carbon computing community to demand energy observability as a first‑class hardware requirement.

PaperID: 2551, https://arxiv.org/pdf/2605.27596.pdf

Authors: Saptarshi Sengupta, Suhang Wang

Title: Can Hallucinations Be Useful? Solving Multi-Hop Questions With SLMs By Chaining System-I/II Reasoning

Abstract:
Recently, there has been increased interest in Small Language Models (SLMs), which are fast, show good performance, and have lower hardware demands than large language models (LLMs). However, SLMs hallucinate more frequently than LLMs, impacting their ability to solve complex multi‑step reasoning problems as early mistakes cascade to the final response. To address this, existing works think‑first followed by iterative retrieval to reduce hallucination. We argue that the think‑first strategy is not always necessary as we find that: (i) SLMs are often accurately confident in their initial answer and, (ii) hallucinations can actually be beneficial for honing in on the true answer. As such, we position our work as an inversion of this strategy, i.e., answer first‑reason later. We propose a cognitively‑inspired framework where the model is first allowed to quickly answer the question (System‑I (zero‑shot)) and then resorts to deeper thinking (System‑II) based on evidence retrieved from a knowledge source using the initial hypothesis. By combining System‑I and System‑II style thinking, we show that our method can outperform prior work that takes the traditional think‑first route on various multi‑step question‑answering benchmarks.

PaperID: 2552, https://arxiv.org/pdf/2605.27593.pdf

Authors: Xijie Zeng, Frank Rudzicz

Title: Voluntary Collusion with Secret Tools in Competing LLM Agents

Abstract:
Even when a tool is explicitly described as unfair and harmful to others, ostensibly safety‑aligned LLM agents still voluntarily engage in secret collusion whenever doing so confers a strategic advantage. To investigate this phenomenon, we introduce an empirical framework built on two strategic multi‑agent environments: Liar's Bar, a competitive deception scenario, and Cleanup, a mixed‑motive resource‑management scenario, in which agents are offered secret collusion tools that provide significant advantages while clearly disadvantaging the other agents. Across 12 models (at the 7B, 70B, and proprietary scales) and 6 prompt variants, we find that most agents consistently accept these tools and develop collusive strategies, while explicitly acknowledging the unfairness of the tools before accepting. We further show that neither the unfairness labels nor baseline alignment alone reliably deters collusion: only explicit ethical framing reduces adoption and, even then, smaller models remain susceptible. More broadly, our work presents the first systematic investigation of voluntary collusion adoption in LLM‑based multi‑agent systems, and suggests that preventing such behaviour requires explicit safeguards rather than reliance on general alignment.

PaperID: 2553, https://arxiv.org/pdf/2605.27586.pdf

Authors: Nicole Hsing, Asuka Yuxi Zheng, Yi Zhao, Haoqin Tu, Jen-Tse Huang

Title: You Only Align Once: Propagating Cooperative Behaviors in Multi-Agent Systems through Seed Agents

Abstract:
Ensuring agent behaviors in distributed open multi‑agent systems remains challenging, especially as populations grow and unaligned agents may exist. We show that a single aligned agent can propagate cooperative behaviors to untrained agents purely through natural language interaction, a phenomenon we term Alignment Propagation. We study this in the Red‑Black Game, a team‑based iterated Prisoner's Dilemma in which teammates deliberate and vote to determine their team's collective action. By distilling the cooperative reasoning and persuasive dialogues of a teacher model into a Qwen‑3‑14B, we obtain a seed agent that, when placed among four untrained teammates, doubles the cooperation rate from 24.8% to 62.2%, outperforming the teacher model and a vanilla Gemini‑3.1‑Pro. Remarkably, a seed trained exclusively on the RedBlack Game transfers zero‑shot to Sugarscape, a spatially grounded survival simulation with pairwise trading, achieving a 91.5% trade success rate versus a 21.6% baseline. Our results reframe multi‑agent alignment from an exhaustive per‑agent training problem to a scalable social capability that can be engineered through strategic seed placement.

PaperID: 2554, https://arxiv.org/pdf/2605.27571.pdf

Authors: Gaetano Rossiello, Dharmashankar Subramanian

Title: Discovery Agents for Real-Time Analytics: Toward Proactive Insight Systems

Abstract:
Modern analytics systems are fundamentally reactive, requiring users to define queries over increasingly complex and continuously evolving data. In real‑time streaming environments, this paradigm breaks down, as the space of potential insights becomes too large to enumerate manually. We present a multi‑agent architecture for autonomous insight discovery over real‑time data streams. The system implements a continuous discovery loop in which agents generate hypotheses, compile them into executable analytics, validate generated artifacts, and produce visualizations and deployable applications. The architecture leverages Apache Kafka for event‑driven coordination, Apache Flink for stream processing, and large language models to implement specialized agents. A key contribution is a contract‑driven design based on typed intermediate artifacts, enabling modularity, observability, lineage, and safer execution of dynamically generated analytics. Through use cases in retail, finance, and public data, we show how this architecture supports a shift from query‑driven analytics to proactive, discovery‑driven systems.

PaperID: 2555, https://arxiv.org/pdf/2605.27566.pdf

Authors: Shijie Cao, Yuan Yuan, Jing Liu

Title: DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM-based Scheduling Agents

Abstract:
Progress in neural combinatorial optimization for Dynamic Flexible Job Shop Scheduling Problem (DFJSP) is currently hindered by a methodological tension: static benchmarks encourage benchmark overfitting, while uncalibrated generators obscure algorithmic capability with stochastic noise. To resolve this, we introduce DynaSchedBench, a diagnostic framework for DFJSP that rigorously controls the instance‑generation process. Instead of relying on parameter sampling, our approach utilizes Sequential Event‑Space Calibrator (SESC) that computes a novel Schedule Stress Index (SSI) to stratify instances by difficulty. We demonstrate that SESC is substantially more computationally efficient than evolutionary baselines while converging reliably to the target metrics. The framework integrates modular components for instance generation, snapshot‑based simulation, agents, evaluation, and visualization, thereby enabling rigorous testing of reactive and lookahead‑based policies. Leveraging this calibrated environment, we identify key limitations of LLM‑based scheduling agents. Specifically, in step‑wise online decision‑making for dynamic scheduling, we identify an ``Observability Paradox'': providing agents with oracle access to full structural information can degrade policy performance, underperforming concise information. Furthermore, despite substantial token overhead, tool‑augmented and refinement strategies fail to reliably improve performance, and most LLM agents fail to consistently surpass strong dispatching baselines‑behaving more like robust heuristic approximators than superior optimizers.

PaperID: 2556, https://arxiv.org/pdf/2605.27559.pdf

Authors: Prashanti Nilayam, Kiran Ramanna, Prashil Tumbade

Title: Detection Without Correction: A Two-Parameter Decomposition of Multi-Stage LLM Pipelines

Abstract:
Multi‑stage LLM pipelines that perform multi‑agent debate, intrinsic self‑correction, or retrieval‑augmented verification exhibit puzzling aggregate behaviors: accuracy plateaus and reversals across rounds, non‑replication of debate gains on contemporary frontier models, intrinsic self‑correction degradation, and qualitative cross‑provider divergence in debate dynamics. Downstream agent response can be operationalized as two coupled decisions: detection (whether to treat upstream content as authoritative) and conditional generation (what to produce if not). This decomposition yields four observable response regimes, of which detection‑without‑correction is the load‑bearing failure mode. Across a nine‑cell empirical grid spanning four model families, four benchmarks (GSM8K, MATH‑500, GPQA‑Diamond, AIME), and two methods (multi‑agent debate, intrinsic self‑correction), we find that the conditional miscorrection rate is consistently dominant (53‑94% across cohorts) while detection rate varies contextually by more than an order of magnitude. The framework unifies the four phenomena above as signatures of a common mechanism and characterizes detection threshold as a stable model/protocol‑level regularity that persists across methods at matched benchmark difficulty.

PaperID: 2557, https://arxiv.org/pdf/2605.27532.pdf

Authors: Mahmoud Abouelyazid, Eman Hammad

Title: SCALE-COMM: Shared, Contrastively-Aligned Latent Embeddings for MARL Communication

Abstract:
Emergent communication enables partially observant Autonomous Mobile Robots (AMRs) to coordinate effectively in decentralized multi‑agent reinforcement learning (MARL) settings. However, existing approaches often struggle with unstable communication protocols, ungrounded message semantics, and interference between communication learning and policy optimization, leading to degraded coordination over time. We propose SCALE‑COMM (Shared, Contrastively‑Aligned Latent Embeddings for COMMunication), a self‑supervised framework for learning compact, stable, and policy‑relevant communication representations. SCALE‑COMM decouples communication learning from policy optimization by training low‑dimensional latent messages that capture task‑relevant planning and traffic information, while enforcing consistency across agents and time. Across standard MARL benchmarks and a realistic warehouse coordination task, SCALE‑COMM consistently outperforms existing communication frameworks in both representation quality and task performance. The learned communication space yields improved stability, sample efficiency, and throughput under policy fine‑tuning, demonstrating the effectiveness of representation‑driven communication for scalable multi‑agent coordination.

PaperID: 2558, https://arxiv.org/pdf/2605.27492.pdf

Authors: Yipeng Ouyang, Xin Huang, Bingjie Liu, Zhongchun Zheng, Yuhao Gu, Xianwei Zhang

Title: Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

Abstract:
LLM agents are rapidly evolving from coding assistants into autonomous software engineering systems. However, existing evaluation methodologies remain largely centered on static, isolated, and short‑horizon benchmarks that fail to capture the dynamic complexity of real‑world production workflows. As a result, benchmark performance may poorly reflect practical capability under realistic runtime environments involving long execution chains, tool interactions, dependency management, and iterative feedback loops. We thus present RAMP, a production‑grounded infrastructure for assessing long‑horizon software engineering agents. Built upon the YatCC integrated platform, RAMP provides a unified runtime assessment architecture through standardized orchestration and execution interfaces. RAMP introduces realistic compiler‑construction workloads with serial dependencies and complex toolchain interactions, together with a staged recovery mechanism for analyzing execution behavior under partial workflow failure. The framework further incorporates utility‑oriented multi‑dimensional metrics that jointly evaluate outcome quality and process efficiency. We conduct runtime assessments across 15 mainstream models and observe substantial capability degradation that remains largely invisible to conventional isolated benchmarks. Task completion rates progressively collapse across serial workflows, dropping from 100% in the initial stage to only 20% in the final stage, while none of the evaluated models successfully completes the entire pipeline. Runtime analysis reveals systematic failure propagation and significant resource inefficiencies, with computational costs differing by up to three orders of magnitude among comparable models. These findings suggest RAMP advances agentic model evaluation toward continuous, runtime‑observable, and production‑grounded assessment.

PaperID: 2559, https://arxiv.org/pdf/2605.27489.pdf

Authors: Md Hafizur Rahman, Zafaryab Haider, Tanzim Mahfuz, Prabuddha Chakraborty

Title: HARP: Measuring Harm Amplification in Multi-Agent LLM Systems

Abstract:
Multi‑agent LLM systems decompose workflows across agents, tools, shared context, memory, and decision gates. This modularity improves interpretability, but creates a propagation risk: a bounded perturbation to one component can be reused by other agents and amplified into system‑level harm. We introduce HARP (Harm Amplification through Role Perturbation), a trace‑first methodology for studying local‑to‑global harm amplification in multi‑agent LLM systems. HARP compares paired clean and perturbed executions and records specialist outputs, tool calls, memory reads/writes, guard events, oracle logs, latency, token cost, and decisions. We define local harm as deviation from targeted agents or corrupted channels, global harm as deviation over the full trace, and harm amplification as (H_global/H_local). This complements attack success rate with a measure of how strongly orchestration spreads harm beyond the attack point. We instantiate HARP in a finance‑oriented seven‑agent system with a deterministic decision gate and configurable attack harness for specialist compromise, collusion, shared‑context corruption, and temporal or memory‑persistent attacks. Across five defenses, prompt‑only defenses preserve benign utility but leave high success and stealth; pre‑tool and step‑level guards reduce some failures with utility or latency costs; and IntegrityGuard, a trace‑consistency defense, achieves the lowest attack success and global harm but introduces utility/cost trade‑offs. Results show that single‑specialist compromise produces the strongest amplification, shared‑context corruption yields the highest attack success, and temporal persistence produces the largest malicious impact. HARP argues that secure multi‑agent evaluation must measure not only bypass, but propagation.

PaperID: 2560, https://arxiv.org/pdf/2605.27433.pdf

Authors: Deyu Zhou, Yuwei Guo, Xudong Lu, Linhao Zhang, Wei Guo, Lizhen Cui

Title: Heterogeneous Multi-Agent Modeling for Measurement and Network Analysis of the Data Service Market

Abstract:
With the increasing complexity of collaboration among various social entities and user demands, the factors affecting the stable development of the data service market are also growing. These factors include the widespread dissemination of information enhancing subjective consciousness, the continuous improvement in intelligence, and the complexification of structural relationships. To achieve effective governance and regulation of the data service market, it is crucial to conduct simulation experiments before making regulatory decisions. However, current research and analysis of the data service market primarily focus on data‑level performance, proving inadequate when it comes to measurement and analysis of multiple heterogeneous entities and the integration of various social elements within the data service market. Based on this, this paper innovatively proposes a data service market measurement and network analysis method based on heterogeneous multi‑agent modeling. By introducing the service ecosystem theory, we clarify the participants and external factors of the data service market and conduct utility measurements for three‑level entities based on value creation. Furthermore, an analytical methodology is devised to precisely assess the influence of heterogeneous networks on utility. Finally, the paper verifies the effectiveness of the proposed method through the analysis of experimental results.

PaperID: 2561, https://arxiv.org/pdf/2605.27419.pdf

Authors: Quan Zheng, Yan Gao, Shaobin He, Haoxiang Guan, Yuanhe Tian, Jie Feng, Ming Wang, Shuxin Zheng, Zhen Liu

Title: APS: Bias-Controlled Adaptive Prototype Simulation for Population-Scale LLM Agents

Abstract:
LLM‑agent simulation offers a flexible computational tool for studying population response trajectories that depend on scenario events, memory, demographics, and evolving social context. However, full multi‑round simulation scales linearly with both population size and horizon, requiring every agent to query the LLM at every round. We propose Adaptive Prototype Simulation (APS), a framework that reframes scalable LLM‑based simulation as a recurrent oracle‑allocation problem. APS retains the designated LLM as the online transition oracle while querying adaptive core prototypes, selected singleton‑tail agents, and shadow‑audit agents. Prototype responses induce local response surfaces for nearby agents, reducing online LLM calls without replacing the underlying transition model. To control approximation bias, shadow‑audit residual correction estimates propagation residuals for aggregate correction and future budget allocation, while tail‑protected singleton routing directly queries selected isolated, heterogeneous, or high‑curvature regions that are vulnerable to smoothing. Theoretically, we treat APS as an estimator for full‑scale high‑precision individual social simulation and decompose its errors into prototype‑coverage error, shadow‑audit residual‑correction error, local‑propagation bias, and temporal context mismatch. Under the reported protocols, APS gives lower reference‑aligned distributional discrepancy than scale‑oriented and same‑budget baselines while reducing online LLM calls, with ablations and compact robustness checks diagnosing the main bias‑control mechanisms. In a 10M‑agent, multi‑round public‑opinion simulation, APS achieves a 381.1‑fold reduction over full simulation, with reference‑aligned final‑round JSD of 0.094 against the corresponding full‑LLM reference.

PaperID: 2562, https://arxiv.org/pdf/2605.27418.pdf

Authors: Wenzhe Song, Hao Zhang

Title: Differentiable Model Predictive Safety for Heterogeneous Mobility at Urban Intersections

Abstract:
The imminent integration of autonomous vehicles and mobile robots in urban settings presents a critical safety challenge for future intelligent transportation systems. This paper addresses the complex problem of coordinating heterogeneous agents with disparate dynamics at unregulated intersections. We introduce a novel framework, differentiable model predictive safety (DMPS), which embeds the foresight of model‑predictive control into a data‑driven, end‑to‑end reinforcement learning architecture. DMPS agents learn a latent dynamics model to predict future trajectories contingent on their actions. A learned, differentiable safety critic then evaluates the risk of these trajectories. Crucially, by leveraging backpropagation through the entire unrolled predictive model, agents can efficiently compute the gradient of future safety with respect to their current action, enabling a minimal and precise online safety correction. Integrated into a multi‑agent training scheme, DMPS virtually eliminates collisions to less than 5.6% in high‑density, mixed vehicle‑robot traffic simulations, demonstrating state‑of‑the‑art safety without compromising energy and traffic efficiency.

PaperID: 2563, https://arxiv.org/pdf/2605.27396.pdf

Authors: Rohith Nama

Title: Agentic Literacy Debt: A Structural Problem the AI Literacy Field Has Not Yet Named

Abstract:
Autonomous AI agents now plan, decide, and act on behalf of users across healthcare, financial services, and workplace contexts, often without step‑by‑step human approval. Existing AI literacy frameworks were built for a world in which humans evaluate AI outputs and decide whether to act; they have no vocabulary for the user who has delegated decision‑making authority to an agent whose actions may not be observable, reversible, or controllable. This paper names the resulting problem agentic literacy debt: the accumulating societal deficit that grows when agentic AI systems are deployed at scale without corresponding literacy infrastructure. The debt compounds through three reinforcing channels (normalization of opaque delegation, multi‑agent ecosystem complexity, and institutional path dependence), and it is incurred by the organizations that deploy agents but paid by the users, patients, and citizens on whose behalf the agents act. Evidence from healthcare, financial fraud, and global equity contexts suggests the gap is already consequential. The problem is structural, not a temporary lag that curriculum reform will close. It demands a reframing of AI literacy as a governance capability, not an evaluative one.

PaperID: 2564, https://arxiv.org/pdf/2605.27393.pdf

Authors: Qingyu Meng, Min Chen, Dingming Liu, Yifan Mo, Yue Su, Xin Sun, Koen Hindriks, Jiahuan Pei

Title: StoryMI: Steerable Multi-Agent Therapeutic Dialogue Generation

Abstract:
Large language models (LLMs) can generate fluent dialogue, but prior works lack situational grounding, dynamic strategy control, and evaluation aligned with clinical standards in motivational interviewing (MI). We introduce StoryMI, a multi‑LLM agent framework for controllable MI dialogue generation, where questionnaire‑based client profiles are expanded into situational stories that provide narrative context for the dialogue. Therapist and client agents generate MI‑coded utterances guided by MI codes selected by the interaction agent, while an interaction agent dynamically coordinates exchanges to control MI strategies during a multi‑turn conversation. We propose a two‑level evaluation protocol: lexical metrics and MI‑specific measures of macro‑level counseling strategies, alongside LLM‑as‑judge and human expert assessments. We construct a dataset of 6K simulated MI dialogues grounded in 1K questionnaire‑story pairs, covering 12 MI codes and 13 symptom domains, and benchmark six open‑ and closed‑source LLMs. Our results show that situational grounding and macro‑level control can improve MI adherence and clinical plausibility, demonstrating the effectiveness of a structured multi‑agent workflow for psychotherapy dialogue generation. We provide code and data for reproducibility.

PaperID: 2565, https://arxiv.org/pdf/2605.27366.pdf

Authors: Huawei Lin, Peng Li, Jie Song, Fuxin Jiang, Tieying Zhang

Title: MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

Abstract:
Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and static artifacts, limiting their reusability, reliability, and long‑term improvement. We propose MUSE‑Autoskill Agent (Memory‑Utilizing Skill Evolution), a skill‑centric agent framework that lets agents continuously improve their task‑solving capability by creating, reusing, and refining skills under a unified lifecycle (creation, memory, management, evaluation, and refinement). Our framework enables agents to create skills on demand, store and reuse them across tasks, organize and select them efficiently, and evaluate them through unit tests and runtime feedback for continuous refinement. We further introduce skill‑level memory that accumulates experience for each skill across tasks, enabling more effective reuse and adaptation over time. Experiments on SkillsBench provide initial evidence that lifecycle‑managed skills can improve task success, efficiency, reuse, and cross‑agent transfer, highlighting the importance of treating skills as long‑lived, experience‑aware, and testable assets.

PaperID: 2566, https://arxiv.org/pdf/2605.27333.pdf

Authors: Haoxuan Jia, Yang Liu, Bin Chong, Yingguang Yang, Yancheng Chen, Jiayu Liang, Qian Li, Hanning Lu, Kefu Xu, Hao Zheng, Chongyang Zhang, Hao Peng, Philip S. Yu

Title: FinHarness: An Inline Lifecycle Safety Harness for Finance LLM Agents

Abstract:
Finance LLM agents must simultaneously block prompt‑induced unauthorized actions and approve legitimate multi‑step business workflows. However, boundary filters often miss irreversible mid‑trajectory tool calls, while post‑hoc LLM judges perform auditing only after termination ‑‑ too late for intervention and at a computational cost that scales linearly with trace length. We present FinHarness, an inline safety harness that wraps a finance agent end‑to‑end with three components: a Query Monitor that fuses single‑turn intent with cross‑turn drift, a Tool Monitor that evaluates each prospective tool call, and a Cascade module that integrates per‑step risk and adaptively routes verification between a lightweight and an advanced‑tier LLM judge. Fired risk factors are re‑injected into the agent input as ex‑ante evidence, enabling the agent to refuse, re‑plan, or approve on its own. On FinVault, routed FinHarness cuts ASR from 38.3% to 15.0% while largely preserving benign approval (41.1% \to 39.3%), and uses 4.7× fewer advanced‑judge calls than an always‑advanced ablation.

PaperID: 2567, https://arxiv.org/pdf/2605.27209.pdf

Authors: Yuxin Chen, Xiaodong Cai, Junfeng Fang, Zhuowen Han, Yu Wang, Yaorui Shi, Yi Zhang, Qi Gu, Xunliang Cai, Xiang Wang, An Zhang, Tat-Seng Chua

Title: Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments

Abstract:
Recent advances in large language models (LLMs) have facilitated the widespread deployment of LLMs as interactive agents capable of reasoning, planning, and tool use. Despite strong performance on existing benchmarks, such agents often exhibit notable degradation when deployed in real‑world settings, where environments are inherently stochastic and imperfect. We argue that this discrepancy arises from a fundamental mismatch between idealized training settings and real‑world interaction dynamics, where current paradigms rely on carefully curated task instructions and stable, well‑controlled environments. To address this gap, we propose NoisyAgent, an agentic training framework that explicitly incorporates environmental imperfections into the agent learning process. We identify two major sources of interaction noise in real‑world scenarios: user noise, which captures ambiguity and variability in user interaction, and tool noise, which reflects failures and anomalies in tool execution. We introduce such perturbations into the training pipeline by modifying user interaction patterns and simulating tool execution results within the training environment. To stabilize training while encouraging agents to handle increasingly challenging imperfections, noise is applied to only a subset of rollouts and progressively increased in difficulty as the model adapts to the current noise level. Extensive experiments demonstrate that our approach consistently improves agent robustness under noisy and dynamic environments. Our analysis reveals that training under noise conditions also yields performance gains on idealized benchmarks, suggesting that controlled exposure to environmental noise promotes more generalizable reasoning and decision‑making behaviors. Our findings highlight the importance of modeling interaction imperfections for bridging the gap between agent training and real‑world deployment.

PaperID: 2568, https://arxiv.org/pdf/2605.27141.pdf

Authors: Yuxin Chen, Yi Zhang, Zhengzhou Cai, Yaorui Shi, Zhiyuan Yao, Chenhang Cui, Jingnan Zheng, Yaqi Huo, Xi Su, Qi Gu, Xunliang Cai, Xiang Wang, An Zhang, Tat-Seng Chua

Title: VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

Abstract:
Large language models (LLMs) have evolved into interactive agents that collaborate with users in real‑world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. However, existing agent benchmarks primarily evaluate reasoning and tool use, largely overlooking the challenges of inferring and leveraging user preferences in realistic scenarios. To address this gap, we introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long‑term user interactions. In VitaBench 2.0, tasks are organized as temporally ordered sequences for individual users, where preferences are embedded in fragmented and heterogeneous interactions. Successful completion of tasks requires the agent to continuously extract, utilize, and update user preferences from these interactions. We further evaluate proactiveness through tasks that require agents to recognize missing information and actively acquire it from users or environments before making decisions. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across different memory architectures. We benchmark a diverse set of frontier proprietary and open‑source LLMs. Results show that real‑world personalization remains highly challenging even for state‑of‑the‑art models, revealing a substantial gap between current capabilities and practical requirements. Extensive analysis further reveals the failure modes and capability bottlenecks of current agents in real‑world personalized decision‑making, providing insights for future model improvements.

PaperID: 2569, https://arxiv.org/pdf/2605.27082.pdf

Authors: Qingyuan Zeng, Ziyang Chen, Pengxiang Cai, Zixin Guan, Anglin Liu, Lang Qin, Xinyao Lai, Jintai Chen

Title: Can Broad Biomedical Knowledge be Contextualized into Scenario-Grounded Propositions?

Abstract:
Biomedical discovery often requires connecting broad biomedical knowledge with specific experimental or clinical data. Background knowledge suggests relevant mechanisms but is usually too general to map directly onto dataset variables, while data‑driven patterns can be dataset‑specific and hard to interpret mechanistically. We study this missing link as knowledge contextualization: transforming broad biomedical knowledge into evidence‑supported, scenario‑grounded propositions that domain experts can inspect, replay, and validate. We propose SCENE, a bi‑level multi‑agent framework that treats knowledge contextualization as iterative search. The upper level converts broad knowledge into search directions and grounds them in the dataset schema. The lower level executes these directions through multi‑objective optimization to identify concrete propositions that balance evidential strength and data support. Feedback between the two levels progressively refines the search. We evaluate SCENE in two settings: discovering patient subgroups with heterogeneous treatment benefits in clinical trial scenarios, and identifying context‑specific biological responses in LINCS L1000 studies. In clinical trials, SCENE discovers specific, well‑supported subgroups and outperforms existing baselines. In L1000 studies, SCENE identifies perturbational contexts with strong target‑response matching and high positive rates. These results show that SCENE bridges broad knowledge and scenario‑specific evidence, producing traceable, inspectable hypotheses for follow‑up validation.

PaperID: 2570, https://arxiv.org/pdf/2605.27076.pdf

Authors: Michael Ledford, William Regli

Title: Cost of Structural Learning Under Censored Feedback: A Threshold-Bandit Approach

Abstract:
In many multi‑agent applications, tasks yield rewards only when executed by a coalition meeting an unknown size threshold; otherwise, feedback is fully censored. This censorship creates an identifiability problem: agents cannot distinguish stochastic failure from insufficient coordination. We formalize this setting as the Threshold‑Activated Cooperative Multi‑Armed Bandit (TAC‑MAB) and analyze it under both centralized and decentralized coordination. We show that a centralized algorithm (C‑TAC) achieves cumulative regret O(log T), decomposed into a structural‑search term that captures the cost of resolving feasibility under censored feedback and a statistical‑monitoring term for value estimation. We then introduce D‑TAC, a decentralized event‑triggered protocol in which agents synchronize only when their structural beliefs change. Empirically, D‑TAC achieves a 23x reduction in communication relative to the centralized baseline while preserving feasibility alignment under conservative belief fusion. These results characterize the coordination cost of learning under censored feedback and show that near‑centralized communication efficiency is achievable without continuous synchronization.

PaperID: 2571, https://arxiv.org/pdf/2605.27071.pdf

Authors: Changqing Su, Yu Ding, Zuhong Lin, Hongyu Liu, Xi He, Zheng Zeng, Liqing Li

Title: Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry

Abstract:
Key knowledge for steel‑industry volatile organic compounds (VOCs) governance is scattered across unstructured scientific literature, making it difficult to integrate process, pollutant, and control‑technology evidence and increasing the risk of hallucination when general large language models (LLMs) answer low‑frequency industrial questions. Here we developed Chat‑ISV, a knowledge graph (KG) enhanced multi‑agent Q&A system that parses a curated steel‑industry VOCs literature corpus, constructs a Neo4j KG with 27180 nodes and 81779 semantic edges, and combines prompt‑constrained extraction, chunk‑centered topology optimization, multi‑agent routing, source‑backtracking retrieval, local literature retrieval, open‑domain knowledge access, and interactive subgraph visualization. Benchmark tests and 400 expert blind evaluations showed that topology optimization reduced isolated nodes from 57% to 4.08% and that Chat‑ISV achieved high factual reliability, with 96.93% precision, 72.63% recall, an F1‑score of 0.830, and a mean score of 1.69/2.00. By converting fragmented environmental‑engineering literature into traceable, queryable, and decision‑support‑oriented knowledge, Chat‑ISV establishes a scalable environmental‑informatics paradigm for reliable LLM deployment and intelligent pollution‑control decision support in specialized industrial domains.

PaperID: 2572, https://arxiv.org/pdf/2605.26876.pdf

Authors: Yuntao Wang, Haojia Yang, Han Liu, Jianle Ba, Zhou Su

Title: Secure UAV Swarms in Low-Altitude Wireless Networks: Challenges and Solutions

Abstract:
Unmanned aerial vehicle (UAV) swarms are increasingly deployed in vast low‑altitude applications, owing to their capabilities in distributed sensing, flexible communication, and autonomous coordination. Nevertheless, the open and highly dynamic operating environment of UAV swarms introduces serious security risks, including GPS spoofing, insider threats, and multi‑hop intrusion. These threats are aggravated by limited on‑board resources, frequently changing network topology, and the presence of intelligent adversaries. To tackle these issues, this paper proposes a cloud‑edge‑end collaborative defense framework for UAV swarms. Based on this framework, three complementary mechanisms are developed. First, a cooperative perception scheme is designed to resist GPS spoofing via interactive attack‑defense game modeling. Second, a behavior‑driven authentication method with trust evaluation is developed to mitigate insider threats. Third, a multi‑agent attack forensics framework is devised to intelligently trace the propagation paths of multi‑hop attacks in UAV networks. Experimental results validate the effectiveness of the proposed approaches. Finally, several open research directions are outlined.

PaperID: 2573, https://arxiv.org/pdf/2605.26856.pdf

Authors: G. Nagarjuna, Durgaprasad Karnam

Title: The Sensation Modulating Network:Haltability as the architectural ground for object-directed phenomenology

Abstract:
Cognitive science remains split between cognitivism ‑ which accounts for recursion and language but cannot ground formal symbols in meaning ‑ and 4E approaches ‑ which ground cognition in the body but rarely specify the body's architecture in enough detail to support generativity. We argue the impasse stems from an incomplete account of the embodied agent's architecture, and propose one: the Sensation Modulating Network (SMN), the cognitive agent conceived as the whole body, organized at every anatomical scale by opponent dynamics, built from Sensation Modulators that sense and act through one substrate, paired into Coordinated Action Zones routed by a body‑wide broadcast network. Three commitments give the SMN its purchase. Haltability ‑ the recruitment of antagonistic affordance into co‑activated equilibrium ‑ provides the architectural locus that object‑directed phenomenology, in Husserl's sense, requires: opponency enables co‑activation, co‑activation enables halt, halt enables attention, attention enables intentional directedness, with no module added on top. The dual‑signal property of self‑modulatable action patterns (SMAPs) makes the self/world distinction a structural feature of the wiring rather than a category the agent applies. And a four‑level action‑pattern hierarchy ‑ Basal, Haltable, Negotiable, Transactional ‑ gives a single trajectory from autonomic regularity to public conventionalization, locating the conditions for grammar‑grounded generativity as architectural transitions. The SMN reconciles the cognitivism‑4E debate: recursion lives in the modifiable dynamics of Negotiable Action Patterns, embodiment in the opponent substrate that supports them. A tentative formalism and eight predicted registers (seven testable, one hypothetical), with reference simulations, are given in an appendix.

PaperID: 2574, https://arxiv.org/pdf/2605.26835.pdf

Authors: Yunbo Long, Haolang Zhao, Ge Zheng, Alexandra Brintrup

Title: Helicase: Uncertainty-Guided Supply Chain Knowledge Graph Construction with Autonomous Multi-Agent LLMs

Abstract:
LLM‑based multi‑agent systems have been widely adopted for knowledge retrieval and report generation, synthesizing known information through web search and textual reasoning. However, many critical information tasks in supply chains are not simple one‑shot queries: they are structural inference problems requiring multi‑hop reasoning across complex, fragmented web resources. Questions such as ``Which Tesla components use lithium from Australian mines?'' have no answer in any single document; answers must be computationally synthesized through the autonomous construction and analysis of dynamic knowledge graphs assembled from fragmented, heterogeneous sources. Moreover, such discovery processes must be uncertainty‑aware: decisions depend not only on answers but on calibrated confidence in their reliability, traceable to source quality and reasoning consistency. To address this capability gap, we propose Helicase, an autonomous multi‑agent LLM system for uncertainty‑guided supply chain knowledge graph construction. Helicase decomposes high‑level supply‑chain queries into executable investigation plans, coordinates specialized web‑search, reasoning, and coding agents through iterative verification loops, and incrementally constructs query‑specific supply chain knowledge graphs with per‑fact uncertainty annotations. Its three‑layer uncertainty framework tracks uncertainty at the action, trajectory, and memory layers, enabling both structural inference and calibrated confidence assessment. To evaluate autonomous reasoning across the full complexity spectrum, we introduce SCQA (Supply Chain Query Assessment), a benchmark of 80 supply chain queries organized into four quadrants spanning single‑hop to multi‑hop inference under both high and low data visibility.

PaperID: 2575, https://arxiv.org/pdf/2605.26739.pdf

Authors: Giorgio Cignarale

Title: From Actions to Obligations: A Deontic Action Model Logic

Abstract:
We introduce the Deontic Action Model Logic (DAML), a dynamic modal framework for reasoning about obligations over actions in multi‑agent systems. DAML extends the epistemic Action Model Logic by incorporating deontic evaluation mechanisms that assess agents' actions in terms of both the desirability and the likelihood of their outcomes. Obligations arise for those actions that maximize expected deontic value among an agent's available alternatives at a given decision point, yielding a formal account for reasoning about conditional and context‑sensitive obligations in settings involving strategic interaction and incomplete information. DAML supports principled action selection in norm‑governed multi‑agent systems, and is the first such framework to derive these obligations using the action model logic machinery. We provide an axiomatization of the logic and prove soundness and completeness with respect to its semantics. Finally, we demonstrate the expressive power of our framework through applications to the Miners' Puzzle and other multi‑agent deontic scenarios.

PaperID: 2576, https://arxiv.org/pdf/2605.26731.pdf

Authors: Yong-eun Cho

Title: It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

Abstract:
A prevalent assumption in LLM agent deployment holds that more structured harnesses universally improve reliability, and that higher‑capability models need proportionally less structural guidance ‑‑ together implying a monotone inverse relationship between model capability tier and optimal harness complexity. We test this hypothesis through a controlled 432‑run experiment crossing six models across four capability tiers with three harness conditions (light, balanced, strict) on HEAT‑24, a 24‑task synthetic benchmark with git‑based workspace verification. Our results refute the monotone inverse relationship on two fronts. First, for the frontier chat model evaluated (Gemini 2.5 Flash), increased harness verbosity lowers VTSR by 29‑38 percentage points ‑‑ a harness‑complexity paradox. Second, for the frontier reasoning model evaluated (Qwen3.5‑122B, extended thinking enabled), strict harness achieves the highest VTSR (91.7%) and the lowest latency, the opposite of the prediction. Within the constrained tier, a 2B model (Gemma4:e2B) matches strong‑open‑tier stability at 91.7% across all harnesses. Because each tier is represented by a single model in this study, these results should be interpreted as model‑specific observations; harness sensitivity appears non‑monotone across the models evaluated, and depends critically on model type (chat vs. reasoning). We introduce a six‑label failure taxonomy showing that format_violation dominates capable‑model failures while wrong_file dominates low‑capability failures, and we derive practical tier‑aware harness selection guidelines.

PaperID: 2577, https://arxiv.org/pdf/2605.26720.pdf

Authors: Yee Hin Chong, Jiaming Wu, Youhui Zhang, Peng Qu

Title: Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

Abstract:
Large language models (LLMs) have shown strong empirical gains as self‑evolving agents for CUDA kernel generation, driven by feedback‑conditioned planning across generations. However, how planning decisions attribute and combine heterogeneous feedback signals remains opaque. Standard end‑to‑end ablations fail to resolve this question, as iterative planning amplifies early perturbations and conflates feedback effects with trajectory‑dependent drift. We introduce \textttCUDAnalyst, a unified analysis layer for controlled, generation‑level attribution of planning decisions to feedback components via trajectory freezing and selective feedback injection. \textttCUDAnalyst enables stable generation‑level evaluation and principled coalitional‑style attribution of feedback effects and interactions. Our results show that explicit planning is beneficial only when feedback is aligned, that effective planning emerges from structured multi‑feedback interactions, and that high‑level plans from stronger reasoning models can partially transfer to weaker ones. These trends hold across reference backbones, representative workloads, and reference induction regimes, indicating that the identified feedback‑to‑plan structure is robust within the controlled axes studied.

PaperID: 2578, https://arxiv.org/pdf/2605.26667.pdf

Authors: Ishir Garg, Neel Kolhe, Dawn Song, Xuandong Zhao

Title: MemFail: Stress-Testing Failure Modes of LLM Memory Systems

Abstract:
Large language model (LLM) agents increasingly rely on external memory systems to remain consistent across long‑horizon interactions, but little empirical work has been done to understand the specific failure modes and design choices that these systems present. Existing benchmarks report aggregate question‑answering accuracy and treat memory systems as black boxes, making it impossible to attribute an incorrect answer to a particular failure mode of the system. We introduce MemFail, a diagnostic benchmark that isolates the failure modes of modern LLM memory systems. We begin by formalizing memory systems as the composition of three canonical operations ‑‑ summarization, storage, and retrieval ‑‑ and identify the potential failure modes induced by each. Based on these hypothesized failure modes, we construct five datasets spanning four tasks, each adversarially designed to test a specific operation of a memory system. Using these datasets, we evaluate four state‑of‑the‑art memory systems on MemFail and demonstrate how MemFail can be used to empirically understand the tradeoffs induced by differences in memory system architectures.

PaperID: 2579, https://arxiv.org/pdf/2605.26646.pdf

Authors: Yiqun Chen, Wei Yang, Erhan Zhang, Shijie Wang, Qi Liu, Zechun Niu, Bin Zhang, Haitao Li, Rui Li, Lingyong Yan, Jinyuan Feng, Biqing Qi, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao

Title: UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

Abstract:
LLM‑based multi‑agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely optimized through a unified reinforcement learning interface. Existing RL post‑training frameworks mainly target single‑policy optimization and lack abstractions for user‑defined multi‑agent workflows, structured interaction, role‑specific credit assignment, and configurable parameter sharing. We present UnityMAS‑O, a general RL optimization framework for LLM‑based multi‑agent systems. UnityMAS‑O treats the complete workflow as the optimization unit, rather than a single response or policy trajectory. It represents workflows through four first‑class objects: logical agent roles, graph trajectories, user‑defined rewards, and agent‑‑model mappings. This decouples logical agents from physical model parameters, supporting full sharing, full separation, and partial sharing, with rewards assigned at role, turn, and trajectory levels. UnityMAS‑O extends verl with a Ray‑based star‑topology runtime. A central controller executes workflows, invokes tools, records structured trajectories, and assembles rewards; model‑local worker groups handle rollout, buffering, advantage computation, and distributed PPO‑style updates. Users can define agents, workflows, model mappings, and rewards without rewriting the optimization infrastructure. We instantiate UnityMAS‑O on retrieval‑augmented QA, iterative agentic search, and reflective code generation. Across Natural Questions, HotpotQA, and held‑out code tasks, multi‑agent RL improves manually specified workflows after optimization, with especially large gains for smaller models and strict code all‑passed metrics. These results show that UnityMAS‑O can serve as a reusable substrate for converting diverse LLM‑based multi‑agent workflows into trainable multi‑agent RL systems.

PaperID: 2580, https://arxiv.org/pdf/2605.26563.pdf

Authors: Minxing Wang, Xiaofei Xie, Yintong Huo

Title: TrajAudit: Automated Failure Diagnosis for Agentic Coding Systems

Abstract:
Agentic systems have been widely studied to automate software engineering jobs such as bug fixing. As these systems increasingly tackle complex tasks, understanding where and why they fail becomes essential for iterative refinement and operational reliability. Existing automated failure diagnosis approaches leverage task execution trajectories, yet their effectiveness degrades substantially as trajectory length and complexity increase. For repository‑level coding tasks specifically, trajectories are laden with noise, such as redundant program structure and verbose code context. Moreover, these trajectories are very long, while long‑context reasoning remains a known weakness of LLMs. To address these two challenges, we propose TrajAudit, the first failure diagnosis framework for repository‑level coding trajectories. TrajAudit employs an investigator agent supported by two modules: one filters failure‑irrelevant information through pattern matching and keyword detection, and the other generates a preliminary diagnosis from test failure reports as prior knowledge, helping the agent handle noisy long contexts. The investigator agent can further invoke tools to retrieve filtered content on demand, ensuring that critical information is preserved while noise is minimized. We also introduce RootSE, a benchmark of 93 real‑world agentic failure instances sourced from software maintenance tasks, representing the most complex trajectory diagnosis benchmark to date. Experiments on RootSE show that TrajAudit outperforms all existing baselines by over 24.4 percentage points in localization accuracy, while reducing token consumption by at least 18%, demonstrating its practical effectiveness. We hope this work draws community attention to failure management in agentic software engineering and provides a foundational resource for future research.

PaperID: 2581, https://arxiv.org/pdf/2605.26530.pdf

Authors: Chen Linze, Cai Yufan, Hou Zhe, Dong Jin Song

Title: Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning

Abstract:
Legal reasoning requires distinguishing changes that matter from those that do not. Legal AI should remain stable under legally irrelevant perturbations, but should change when perturbations alter legally material points. We formulate this requirement as a legal‑relevance‑sensitive evaluation problem: LLMs should only be sensitive to the legally relevant change. We introduce a unified evaluation suite covering should‑change and should‑not‑change evaluation across judicial fairness, robustness, and statute‑confusion scenarios. Our evaluation shows that existing legal LLMs are systematically sensitive to legally irrelevant variations and often fail to distinguish related legal elements and statutory rules. To mitigate these failures, we present LexGuard, an adversarial multi‑agent framework grounded in formal reasoning. LexGuard formalizes statutes into executable constraints, uses adversarial agents to extract competing fact‑statute arguments, and invokes SMT solvers to verify legal satisfaction and logical consistency. Experiments show that LexGuard improves legal reasoning reliability by reducing vulnerability to manipulative framing, improving disambiguation among similar statutes, limiting the influence of legally irrelevant attributes, and increasing consistency under benign reformulations. We show that legal trustworthiness requires not only accuracy, but calibrated sensitivity to legally material changes.

PaperID: 2582, https://arxiv.org/pdf/2605.26521.pdf

Authors: Nafiseh Kahani, Mojtaba Bagherzadeh

Title: Testing Agentic Workflows with Structural Coverage Criteria

Abstract:
Multi‑agent systems increasingly expose explicit workflow structure: agents, tools, tool‑access rules, restrictions, and delegation paths. Existing evaluations rely largely on end‑to‑end task success, benchmark scores, final‑response quality, or prompt‑level checks, which provide limited evidence that this declared coordination structure has actually been exercised. This makes it difficult to assess test‑suite adequacy or detect structural regressions in tool access, restrictions, and inter‑agent delegation. We address this gap with a structural testing approach for multi‑agent workflow specifications. The approach represents each workflow as a typed coordination graph, derives coverage obligations over reachable agents, allowed tool edges, restricted tool edges, and delegation edges, and uses coverage‑driven generation with DSPy‑based scenario realization to produce executable tests. The graph fixes what must be covered; DSPy realizes those obligations as natural‑language scenarios whose witnesses are checked at runtime. We implement the approach for OpenAI Agents SDK‑style workflows and evaluate it on ten SDK‑derived benchmarks comprising 49 reachable agents, 47 tools, and 403 structural obligations. Generated scenarios witness 54/75 allowed‑tool obligations and 36/48 delegation obligations within a bounded refinement budget. The adversarial restricted‑tool criterion elicits 23/248 restricted‑call violations, separating workflows whose restrictions hold under probing from workflows with concrete misrouting failures. These results show that structural coverage provides a useful adequacy layer for multi‑agent workflow testing: it does not replace semantic or end‑to‑end evaluation, but reveals whether declared agents, tool‑access rules, restrictions, and delegation paths have been exercised.

PaperID: 2583, https://arxiv.org/pdf/2605.26497.pdf

Authors: Peiran Wang, Ying Li, Yuan Tian

Title: Aligning Provenance with Authorization: A Dual-Graph Defense for LLM Agents

Abstract:
LLM‑based agents are increasingly deployed in high‑stakes scenarios such as email management, financial transactions, and code execution, where they interact with the external world through tool calling. During execution, these agents must read external data sources (emails, webpages, files) that attackers can control; through indirect prompt injection, attackers embed malicious instructions in this data to manipulate agents into performing unauthorized operations such as transferring funds to attacker‑controlled accounts. Existing defenses either perform tool‑call‑level value checking without tracking where parameter values originate, or analyze execution traces from a single perspective without a clean authorization baseline for comparison. We propose AuthGraph, a dual‑graph alignment defense framework that constructs two complementary graphs: an injected reasoning graph that models information provenance from the actual execution trajectory (including potentially manipulated attributions), and an authorization graph derived from the user's intent in an isolated clean context that is information‑theoretically impossible to be influenced by injection; a graph alignment checker then structurally compares the two graphs to detect both tool‑level and parameter‑source‑level deviations. On AgentDojo, AuthGraph reduces the attack success rate from 40% to 1% while maintaining 76% task completion rate on GPT‑4o; on AgentDyn, it reduces the attack success rate from 39% to 2% while preserving 51% utility, outperforming state‑of‑the‑art defenses including CaMeL, DRIFT, and Progent. To our knowledge, AuthGraph is the first agent security defense to structurally compare authorization specifications against execution provenance at the parameter‑source level, achieving fine‑grained injection detection without sacrificing agent flexibility.

PaperID: 2584, https://arxiv.org/pdf/2605.26448.pdf

Authors: Ujwal Kumar, Arth Singh, Hershraj Niranjani, Machiko Hirota, Takehiro Takayanagi, Alice Saito, Eiji Kamioka, Phan Xuan Tan

Title: Constitutional Arms Races in the Public Goods Game: Co-Evolving LLM Constitutions Under Cooperation-Defection Pressure

Abstract:
Frontier LLM agents engage in blackmail, sabotage, and document leaks under goal conflicts in agentic settings, exposing limitations of alignment methods built around single‑agent or cooperative assumptions. Recent work shows LLM‑guided evolutionary search can discover effective cooperative constitutions, but two properties of the adversarial setting remain uncharacterized: whether the fitness function actually induces adversarial pressure, and whether the LLM mutation operator behaves reliably under adversarial‑specialist objectives. We study adversarial constitutional co‑evolution (Blue cooperators vs. Red free‑riders, 30 generations) across a Public Goods Game (PGG) and a spatial grid‑world. Three findings: (1) in the PGG, both factions converge to a near‑parity equilibrium at S approximately 0.78, robust across tested multipliers m in 1.2, 1.5, 2.0, 3.0; (2) in independently scored environments, per‑faction scoring leaves outcomes statistically uncoupled, with corr(S_B, S_R) = +0.088, and produces no adversarial pressure; a score‑advantage fitness target S_own ‑ S_opp restores it; (3) under pure‑adversary fitness, evaluation seed count K controls mode regression: K = 2 regresses, while K = 5 sustains a strong specialist for all 30 generations. Adversarial co‑evolution of natural‑language constitutions is feasible, but only under coupled fitness and adequate evaluation budget; the evolved Red constitutions serve as interpretable red‑team artifacts for testing future cooperative designs.

PaperID: 2585, https://arxiv.org/pdf/2605.26289.pdf

Authors: Victor Norgren

Title: Stateful Inference for Low-Latency Multi-Agent Tool Calling

Abstract:
Multi‑agent tool calling is becoming the dominant interaction pattern for LLM‑based systems, yet existing inference frameworks treat each tool call as an independent request, re‑processing the entire conversation from scratch even though 85‑95% of the prompt is unchanged from the previous turn. We present a stateful inference architecture that converts the O(n_t) per‑turn cost of conventional serving into an O(Δ_t) delta‑only cost: a persistent KV cache lives across turns and advances by ingesting only the new tokens, while a radix prefix cache extends this across interleaved multi‑agent traffic and a prompt‑lookup speculative decoder accelerates structured output. Against vLLM and SGLang on novel, fully‑generated workloads, the reference implementation is 2.1× faster per turn on a 6‑turn agentic workflow and 4.2× on the median turn of a 35‑turn one, halving end‑to‑end wall time. The advantage comes from stateful reuse and speculation, not caching.

PaperID: 2586, https://arxiv.org/pdf/2605.26286.pdf

Authors: Maxim Mednikov, Oren Gal

Title: Decoupled Delay Compensation: Enhancing Pre-trained MARL Policies via Learned Dynamics Filtering

Abstract:
Real‑world multi‑agent reinforcement learning (MARL) systems must often operate under stale observations, stochastic communication delays, and intermittent packet loss. Policies trained under idealized synchronous conditions frequently exhibit significant performance degradation in these regimes because they act on outdated feedback. We propose a modular execution‑stage state‑estimation layer that replaces delayed communicated observations with current belief‑state estimates. The framework integrates a learned Gated transition model with a recursive Kalman filtering layer to estimate instantaneous states from asynchronous measurements. A primary advantage of this approach is its modularity, The estimator serves as a plug‑in for pre‑trained policies, requiring no modifications to the original MARL training algorithm, architecture, or reward structure. Evaluation across diverse multi‑agent and continuous‑control benchmarks demonstrates that the proposed layer consistently enhances robustness to communication latency and message loss. The most significant performance gains are observed in coordination‑intensive and dynamically unstable tasks where temporal consistency is critical for control.

PaperID: 2587, https://arxiv.org/pdf/2605.26278.pdf

Authors: Djamel Bouchaffra, Faycal Ykhlef, Mustapha Lebbah, Hanane Azzag

Title: Coalition Free Energy and Adaptive Precision in Multi-Agent Cooperation

Abstract:
Cooperative multi‑agent systems require robust mechanisms for credit assignment under uncertainty. Here we introduce a variational framework, termed the Game‑Theoretic Free Energy Principle (GT‑FEP), that models coalition formation through a Gibbs distribution over interacting agents. Within this framework, we derive a precision‑dependent formulation of cooperative credit assignment and show that an agent's Shapley value exhibits a non‑monotonic relationship with sensory precision beta, reflecting a trade‑off between noisy inference and overconfident local estimation. Motivated by this observation, we propose Adaptive Precision Control (APC), an online adaptation algorithm that dynamically adjusts observation precision using local estimates of cooperative contribution. We evaluate APC on real‑world Swiss roundabout trajectory datasets and on a multi‑agent control task derived from the same trajectories. Across both settings, APC adapts to changing noise conditions online and achieves performance comparable to the best fixed precision without prior tuning. Our results connect variational inference, cooperative game theory, and adaptive multi‑agent coordination, and suggest that precision adaptation can improve robust cooperation under uncertainty.

PaperID: 2588, https://arxiv.org/pdf/2605.26269.pdf

Authors: Faruk Alpay, Taylan Alpay

Title: AgentSecBench: Measuring Prompt Injection, Privacy Leakage, and Tool-Use Integrity in LLM Agents

Abstract:
LLM agents process trusted instructions, retrieved records, and tool observations through a common generative channel. This conflates data flow with authority: an untrusted string can affect a secret‑bearing response or an action proposal even when no application policy authorizes that influence. We introduce AgentSecBench as an empirical instantiation of a formal security framework for this problem. The framework defines three games‑instruction‑integrity, retrieval‑confidentiality, and capability‑integrity‑under a common notion of intent‑to‑execution noninterference with permitted leakage. It represents an application policy as a projection onto authorized observations and capabilities, distinguishes prompt annotations from enforcing projections, and measures both adversarial advantage and whether a defense closes the relevant model‑visible channel before generation. The exact‑marker experiments are intentionally one observable instantiation of the games rather than a complete semantic security claim: they test disclosure and forbidden‑action distinguishers with unambiguous ground truth. We evaluate six defense classes with Qwen3‑0.6B and Qwen3‑1.7B on paired adversarial and benign‑control executions. The measurements show when risk reduction follows channel closure and when a model‑visible adversarial capability remains exploitable. The result is a security‑oriented evaluation method: prompt text can describe a boundary, whereas provenance projection, capability restriction, and output validation can enforce one.

PaperID: 2589, https://arxiv.org/pdf/2605.26256.pdf

Authors: Jeongeun Lee, Chanyoung Park, Dongha Lee

Title: Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions

Abstract:
Multimodal large language model (MLLM)‑based embodied agents have shown strong potential for solving complex tasks in physical environments. However, personalized assistance requires more than following generic instruction or recognizing object categories. In real‑world scenarios, the intended target is often specified only implicitly through prior interactions, requiring agents to leverage personalized context accumulated over time. In this work, we propose POLAR, a multiomodal memory‑augmented framework for personalized embodied agents over long‑term user interactions. POLAR organizes prior interactions into a multimodal knowledge graph that captures semantic memory for personalized context and visual concepts, and episodic memory for embodied experiences such as agent trajectories. To execute embodied tasks, POLAR retrieves relevant memories to interpret the current request and guide task execution. We evaluate POLAR across multiple MLLM backbones and diverse evaluation scenarios to study the role of memory in long‑term personalization. Results show that the proposed memory mechanism consistently improves performance by enabling more effective use of information accumulated over prior interactions. The gains are especially pronounced when the agents are required to reason across multiple interactions, perform multi‑hop inference, or tracking updates in user‑specific context over time.

PaperID: 2590, https://arxiv.org/pdf/2605.26252.pdf

Authors: Abdelghny Orogat, Essam Mansour

Title: Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory

Abstract:
Long‑running AI agents need persistent memory. Memory supports learning across sessions, reduces repeated context injection, and enables auditing of past decisions. Current agent memory systems and database paradigms treat memory as storage. They localize correctness at records, embeddings, or edges. Each supplies only some of the capabilities that long‑term memory requires. The result is four recurring failure modes: unregulated growth, missing semantic revision, capacity‑driven forgetting, and read‑only retrieval. In our vision, long‑term agent memory is a new data‑management workload. Its correctness is a property of the state trajectory, not of individual records. We formalize this as Governed Evolving Memory (GEM). GEM replaces record‑level database operations with four state‑level operators: ingestion, revision, forgetting, and retrieval. Six correctness conditions govern how the state evolves. Three structural observations establish that no record‑level system can satisfy these conditions, regardless of the storage model. We realize the abstraction in MemState, a prototype on a property‑graph backend. MemState validates feasibility and exposes the gap to a native engine. We outline three research directions that define memory‑centric data management as a workload.

PaperID: 2591, https://arxiv.org/pdf/2605.26203.pdf

Authors: Aditya Vema Reddy Kesari, Krishna Reddy Kesari

Title: AgentSociety: Incentivizing Agentic Social Intelligence

Abstract:
The success of deployed agents relies on their ability to handle open‑ended user requests using their inherent capabilities, not only in solving requests directly but also in effectively leveraging inter‑agent communication channels and feedback signals over time. This requires a multi‑agent environment where agents can operate autonomously, strategically communicate, behave collaboratively and be driven by economic incentives, much like humans in society. Towards this vision, we propose \mathttAgentSociety, a mechanism that enables decentralized agentic collaboration grounded in liquid democracy and information diffusion from social choice theory. We show that \mathttAgentSociety provides an environment for agents to make autonomous decisions utilizing their local context to maximize their utility while achieving collective outcomes through incentivized collaboration. Specifically, we prove that delegation to more competent neighbor agents is incentive compatible and naturally generates multi‑agent routing path by consensus. Additionally, our mechanism incentivizes agents to selectively disclose information to their neighbor agents when doing so aligns with their self‑interest, so as to garner influence. We characterize the Nash equilibrium showing that agent payoffs are reflective of their marginal contributions. We compare and benchmark strategy profiles adopted by open and proprietary state‑of‑the‑art language models deployed in \mathttAgentSociety against best response. Finally, we evaluate collaborative performance from consensus‑based routing among self‑interested heterogeneous agents in \mathttAgentSociety on real‑world datasets.

PaperID: 2592, https://arxiv.org/pdf/2605.26195.pdf

Authors: Yihe Fan, Changyi Li, Lichen Xu, Xudong Pan, Jiarun Dai, Hong Geng, Min Yang

Title: CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly

Abstract:
LLM‑based agents are increasingly used for cybersecurity tasks, but most existing systems rely on fixed, human‑designed scaffolds that struggle to adapt across diverse targets and failure modes. We introduce \textscCyberEvolver, a self‑evolving cybersecurity agent framework that iteratively revises its own scaffold based on experience from failed execution attempts. Self‑evolution in cybersecurity is challenging because the space of possible scaffold changes is largely unstructured, execution feedback is sparse and often obscured by the environment, and low‑diversity updates can cause errors to compound over repeated iterations. \textscCyberEvolver addresses these challenges with a four‑layer evolvable agent architecture that decomposes scaffold optimization into structured components, a trace‑to‑diagnosis mechanism that converts noisy execution logs into actionable revision signals, and a population‑based beam search strategy that preserves diverse agent variants during evolution. We evaluate \textscCyberEvolver on CTF challenges, vulnerability exploitation, and penetration‑testing tasks using four open‑source LLMs. Across these settings, \textscCyberEvolver improves the seed agent's success rate by 13.6\,% on average, and outperforms six human‑designed cybersecurity agents as well as two self‑improvement methods adapted from other domains. These results suggest that scaffold self‑evolution is a promising direction for building adaptive LLM agents for security testing.

PaperID: 2593, https://arxiv.org/pdf/2605.26179.pdf

Authors: Penghui Yang, Zhonghan Zhang, Yue Li, Xinrun Wang, Yanchen Deng, Yuhao Lu, Bijun Tang, Zheng Liu, Bo An

Title: AutoDFT: A Closed-Loop Multi-Agent Framework for Autonomous DFT Calculations

Abstract:
Density functional theory (DFT) serves as the basis for computational discovery in materials science and chemistry, yet each calculation demands extensive human effort: adjusting algorithms when convergence stalls, revising plans when unexpected physics emerges, and inserting steps as intermediate results reshape the problem. Existing LLM‑based agents automate only the initial planning stage, producing a full execution plan upfront and leaving all subsequent adaptation to hand‑crafted rules. As a result, these workflows remain fragile, do not generalize well beyond pre‑planned scenarios, and often require expert intervention when failures or unexpected intermediate results require changes to the calculation path. Here, we introduce AutoDFT, a closed‑loop multi‑agent framework that embeds LLM reasoning into every stage of the DFT lifecycle, where a strategic planner produces a skeletal plan of step objectives; a step planner generates numerical parameters just in time from preceding results; and a monitor‑recover‑reflect cycle diagnoses failures, repairs them, and revises the plan when the evidence justifies it. We demonstrate both breadth and depth: breadth on VASPBench, a purpose‑built benchmark spanning 34 tasks and 9 DFT calculation types, where AutoDFT achieves 94.1% task‑level success with GPT‑5.2; and depth on established materials databases, where AutoDFT produces quantitatively reliable property predictions across electronic, magnetic, and energetic properties. By closing the loop between planning and execution, AutoDFT enables experimentalists without deep computational expertise to obtain reliable first‑principles results.

PaperID: 2594, https://arxiv.org/pdf/2605.26178.pdf

Authors: Xinkui Zhao, Sai Liu, Yifan Zhang, Qingyu Ma, Zewen Lin, Naibo Wang, Guanjie Cheng, Chang Liu, Yueshen Xu

Title: ATOM: Instantiating Budget-Controllable Multi-Agent Collaboration via Nucleus-Electron Hierarchy

Abstract:
Large Language Model (LLM)‑based multi‑agent systems rely on optimized collaboration topologies to balance performance and communication costs. However, current methods struggle with the inherent stability‑extensibility trade‑off and often misalign computational budgets with query difficulty. We propose \textscATOM, an adaptive framework that generates budget‑controllable collaboration graphs via a novel task‑driven reinforcement learning paradigm. Inspired by atomic structures, \textscATOM employs a nucleus‑electron hierarchy: it maintains a stable, offline‑learned collaboration backbone (the nucleus) while dynamically activating query‑conditioned agents (electrons) during inference. Crucially, a complexity‑aware budgeting strategy aligns resource consumption with task demands by estimating query difficulty to strictly regulate electron instantiation. Extensive experiments across six diverse benchmarks demonstrate that \textscATOM achieves state‑of‑the‑art performance while improving token efficiency by up to 30% compared to strong baselines.

PaperID: 2595, https://arxiv.org/pdf/2605.26154.pdf

Authors: Xuanye Zhang, Yongsen Zheng, Zhuqin Xu, Kaiyu Zhou, Bowen Shen, Haoran Ou, Tianwei Zhang, Kwok-Yan Lam

Title: MemMorph: Tool Hijacking in LLM Agents via Memory Poisoning

Abstract:
LLM‑driven agents are capable of selecting external tools to complete users' tasks. However, attackers could compromise such process, steering agents toward inappropriate/wrong tools and enabling malicious actions. Most existing attacks primarily manipulate the tool metadata, which is easily detectable by auditing and may lose effectiveness as modern agents increasingly adopt memory modules to refine tool selection policies through accumulated experience. This paper proposes MemMorph, the first attack that bias tool selection by poisoning the agent's long‑term memory. Rather than explicitly dictating the tool invocation decision, MemMorph injects a small number of crafted records that are disguised as technical facts, incident reports, and operational policies. These poisoned records reshape the agent's contextual perception and decision‑making process, leading it to autonomously infer and select the tool preferred by the attacker. Experiments across 3 benchmarks, 10 agent backbones, and 3 memory‑module implementations show that MemMorph achieves up to 85.9% attack success rate with only three injected records, outperforming the strongest baseline by up to 25% while retaining potency under 3 representative defenses. Our findings expose long‑term memory as a critical and under‑explored attack surface in tool‑augmented agents, urging the development of memory‑level integrity safeguards.

PaperID: 2596, https://arxiv.org/pdf/2605.26087.pdf

Authors: Matt L. Wiemann, Lindsay M. Smith, Peter Melchior, Siddharth Mishra-Sharma, Andrew Gordon Wilson, Pavel Izmailov, Carolina Cuesta-Lázaro

Title: DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking

Abstract:
Frontier LLMs now perform strongly across a wide range of physics evaluations, but it is hard to disentangle genuine reasoning from recall of established science. We introduce DiscoverPhysics, an interactive benchmark that asks a LLM agent to discover the laws of motion of a simulated world whose physics deliberately deviates from our own. We construct 22 worlds governed by, among others, screened and fractional‑power gravity, multi‑species couplings, hidden dark‑matter‑like particles, non‑coordinate‑free physics, and time‑varying interactions. Each world is generated on demand by an N‑body simulator, for which the agent proposes several rounds of experiments, observes raw trajectory data, and ultimately submits both a natural‑language explanation of the world's physics and a Python implementation of the inferred law. Because solving a world requires the agent to design informative experiments and revise its hypotheses, the benchmark probes long‑horizon reasoning over an experimental history. We evaluate submissions along two complementary axes: trajectory MSE on held‑out particles and an LLM‑judged explanation score following an expert‑written rubric assessing conceptual understanding of each world. Across eleven frontier models, we find that the strongest agents pass only half of the worlds and consistently fail on those where latent structure must be uncovered. Open‑source models lag substantially behind commercial models, both in their ability to design informative experiments and in extracting conclusions from the data. We further find that good predictive accuracy does not guarantee high explanation quality and that conceptual understanding depends on hypothesis refinement through well‑chosen experiments.

PaperID: 2597, https://arxiv.org/pdf/2605.26029.pdf

Authors: Junlin Yang, Dylan Zhang, Xiangchen Song, Qirun Dai, Xiao Liu, Yuen Chen, Aniket Vashishtha, Jing Shi, Chenhao Tan, Hao Peng

Title: CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

Abstract:
We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held‑out reactor crystal governed by the same mechanism. The hidden data‑generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6‑node setting, GPT‑5.2‑high reaches 92% task accuracy but only 0.471 all‑edge F_1. Mixed observation‑intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.

PaperID: 2598, https://arxiv.org/pdf/2605.26001.pdf

Authors: Dhruv Agarwal, Emily Sheng, Chad Atalla, Jean Garcia-Gathright, Hussein Mozannar, Hannah Washington, Alexandra Chouldechova, Solon Barocas, Hanna Wallach

Title: AI-Assisted Systematization for Evaluating GenAI Systems

Abstract:
Evaluating generative AI (GenAI) systems is challenging because many targets of evaluation are broad, contested concepts, such as "reasoning," "fairness," or "creativity." When these concepts are left underspecified, it becomes unclear what should be measured or how evaluation results should be interpreted. This problem reflects a missing step: systematization, that is, moving from a broad background concept to an explicit, structured account of the concept in measurable terms. To help address the fact that systematization is cognitively demanding and resource‑intensive, we investigate whether AI assistance can support this process. To enable AI‑assisted systematization and assess its quality, we introduce a structured representation of a systematized concept, a concept spec, and a validation worksheet. We then develop two AI‑assisted systematizers: a direct, zero‑shot approach and a multi‑agent approach that more closely mirrors manual systematization approaches from existing literature. We use these systematizers to produce concept specs for two concepts ‑‑ hate‑based rhetoric and digital empathy ‑‑ and evaluate resulting concept specs on content validity and information recoverability.

PaperID: 2599, https://arxiv.org/pdf/2605.25981.pdf

Authors: Liyun Zhang, Jiayi Guo

Title: When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation

Abstract:
We document an empirical phenomenon in chain‑of‑thought and ReAct agents driven by ten large language models from seven architecture families: meaning‑bearing perturbations (e.g., paraphrase, synonym) alter final answers more often than presentation perturbations (e.g., formatting, reordering) of comparable severity. Across 68 cells spanning GSM8K, MATH, and HotpotQA (1,530 originals and ～11,150 variants), the inconsistency gap averages +19.69 pp after severity matching (paired t=9.58, p<0.0001), with 64/68 cells positive. The gap survives four severity‑proxy audits and remains significant when excluding qwen models (+11.10 pp, p<0.0001). Several stress tests fail honestly: cluster‑bootstrap significance disappears under stricter assumptions, tractability contrasts do not replicate, cross‑architecture generator swaps break per‑cell rankings, and a second LLM judge yields only moderate agreement (κ=0.50). We then validate the headline effect on a fully held‑out 11th model (qwen2.5‑14B‑Instruct; 1,800 trajectories) and re‑test a pre‑registered capability×tractability partition, observing a small but positive held‑out effect (3/4 cells positive; pooled Welch t=3.81, p=9.6×10^‑4). Using held‑out trajectories, we probe four trace‑level mechanism signals. Two prior mechanism claims fail to replicate and are explicitly retracted. Two new probes instead support a \emphstealth‑divergence picture: semantic perturbations often preserve the first action but induce divergence in intermediate reasoning from later steps onward, accompanied by slightly deeper trajectories. We position this as a measurement contribution with held‑out replication and a partial trace‑level account of how semantic perturbations propagate through agent reasoning. Code, perturbation corpus, raw trajectories, and analysis scripts are released anonymously for review.

PaperID: 2600, https://arxiv.org/pdf/2605.25970.pdf

Authors: Sofiat Abioye, Ufaq Khan, Shazad Ashraf, Mohammed Adil Butt, Andrew D. Beggs, Adam Byfield, Anusha Jose, Junaid Qadir, Muhammad Bilal

Title: PathWISE: Multi-Agent Cancer Pathway Triaging Ontology Learning from Clinical Flowcharts

Abstract:
Clinical pathways are disseminated as visual flowcharts where spatial topology, arrow direction, colour coding, and font weight encode critical triage logic that remains inaccessible to computational systems. We present PathWISE, a five‑phase pipeline combining four LLM‑based agents with a deterministic depth‑first search auditor and a Java compiler critic, transforming these non‑computable artefacts into validated, executable HL7 Clinical Quality Language (CQL) libraries deployable as FHIR CDS Hooks services. Purpose‑built agents extract flowchart structure into a typed directed graph, perform deterministic path enumeration, conduct a structured semantic audit of every node's computability, generate terminology‑constrained CQL definitions verified by the official Java CQL‑to‑ELM compiler, and produce routing logic covering 100% of enumerated patient journeys. Demonstrated across five UK NHS cancer pathways (colorectal, lung, skin, upper GI, and breast), PathWISE audits up to 183 nodes (182 under the Hybrid configuration), identifies 544 structured governance findings across four issue categories, achieves 100% syntactic compilation success, with UNCOMPUTABLE nodes receiving false placeholders that preserve compilability while surfacing governance gaps for clinical review, and produces zero hallucinated terminology codes for dictionary‑covered concepts. Critically, PathWISE confines non‑deterministic LLM inference to knowledge extraction while deterministic graph mathematics and a standard compiler underpin every verification step.

PaperID: 2601, https://arxiv.org/pdf/2605.25958.pdf

Authors: Daren Wang, Hong Xu, Jiawen Xian

Title: PolyGnosis 2.0: Enhancing LLM Reasoning via Agentic Harness Engineering for Polymarket and OSINT Insight Extraction

Abstract:
This paper introduces PolyGnosis 2.0, a pioneering multi‑agent architecture designed to extract predictive intelligence by synthesizing Polymarket anomaly signals with global Open Source Intelligence (OSINT) streams, specifically Global Database of Events, Language, and Tone (GDELT). We define and target "Perspective Mismatches", the narrative divergence between Polymarket sentiment and global media flows, as high‑alpha trading signals. Moving beyond generic agentic superiority, we rigorously quantify the efficacy of "Harness Engineering" techniques, including reflection loops, tool‑calling, divide‑and‑conquer partitioning (D&C), and chain‑of‑thought (CoT), within high‑noise financial domains. Our empirical evaluation against human‑expert benchmarks reveals that while structural partitioning is mandatory for multi‑dimensional alignment, unconstrained terminal reflection actively induces logical drift. Furthermore, we identify a pervasive "consensus bias" across all agent configurations during narrative reasoning, necessitating deterministic validation. Ultimately, we isolate a Pareto‑optimal configuration that achieves professional‑grade analytical precision while minimizing latency and token overhead, providing a robust blueprint for autonomous intelligence in prediction markets.

PaperID: 2602, https://arxiv.org/pdf/2605.25929.pdf

Authors: Franka Bause, Jonas Niederle, Martin Pawelczyk, Rebekka Burkholz

Title: Multi-Agent Systems are Mixtures of Experts: Who Becomes an Influencer?

Abstract:
The effectiveness of multi‑agent LLM deliberation depends not only on the agents' individual predictions, but also on how they communicate and collaborate. We study this mechanism through the lens of Friedkin‑Johnsen (FJ) opinion dynamics, a tractable model for analyzing stubbornness, influence, and opinion change in multi‑agent systems that captures empirically observed deliberation patterns. We show that the FJ parameters are input‑dependent, turning multi‑agent deliberation into a mixture of experts. This perspective implies that multi‑agent systems can outperform single agents and static ensembles when routing reflects agent competence. Since competence is latent in practice, we analyze how influence is established through observable proxies: agents' self‑assessed confidence, their perceived confidence, and initial alignment with other agents' views.

PaperID: 2603, https://arxiv.org/pdf/2605.25869.pdf

Authors: Zhengda Jin, Bingbing Wang, Jing Li, Ruifeng Xu, Min Zhang

Title: Mitigating Provenance-Role Collapse in Long-Term Agents via Typed Memory Representation

Abstract:
Long‑term memory is essential for persistent LLM agents, yet prevailing architectures store historical interactions as unstructured, flat text. This unconstrained storage induces provenance‑role collapse, a critical failure mode where agents suffer from source‑monitoring errors. To resolve this cognitive vulnerability at the architectural level, we propose MemIR, a typed Memory Intermediate Representation that operationalizes source monitoring as a structural constraint. MemIR writes long‑term memory into grounded atoms that separate raw evidence, retrieval cues, and truth‑bearing claims, with factual authorization restricted to supported claim atoms. It then applies multi‑route atomic projection and provenance‑scoped utilization to transform heterogeneous retrieval hits into claim‑centered candidate bundles and a normalized fact interface for answer generation. Experiments on LoCoMo and BEAM‑100K demonstrate that MemIR consistently outperforms existing memory baselines, especially on tasks requiring source tracking, temporal grounding, and aggregation of fragmented evidence.

PaperID: 2604, https://arxiv.org/pdf/2605.25867.pdf

Authors: Pietro Zanotta, Dibakar Roy Sarkar, Honghui Zheng, Somdatta Goswami, Ján Drgoňa

Title: CINOC: Cardinality-Invariant Neural Operator Policies for Scalable PDE Control

Abstract:
Controlling partial differential equations (PDEs) with learning‑based policies remains fundamentally limited by fixed‑dimensional representations: policies trained for a specific sensor, actuator, or agent configuration typically fail when the configuration changes. This limitation is particularly severe in multi‑agent PDE control, where policies do not scale across population sizes without retraining. We address this challenge by introducing Cardinality Invariant Neural Operator Control (CINOC), reformulating PDE control as an operator learning problem that maps state fields to continuous control functions and trains them end‑to‑end through differentiable PDE solvers, yielding policies that naturally adapt to varying sensor and actuator configurations. Remarkably, CINOC policies trained on small swarms exhibit cardinality invariance, allowing for zero‑shot transfer to significantly larger populations as well as robustness to partial agent failure. This scalability arises from agents sharing a common policy and coordinating through their physical environment, which produces an emergent self‑normalization effect. To explain this phenomenon, we provide a theorem grounded in mean‑field theory demonstrating that policy gradients computed from finite‑agent systems converge to those of a continuous control limit. Empirically, we validate CINOC on tracking, stabilization, and density transport across linear, nonlinear, chaotic, and turbulent PDEs.

PaperID: 2605, https://arxiv.org/pdf/2605.25851.pdf

Authors: Xicheng Gong, Guozheng Sun, Peiran Xu, Yadong Mu

Title: RePlan-Bot: Multi-Level Replanning for Embodied Instruction Following

Abstract:
Embodied instruction following (EIF) requires agents to understand and execute complex natural language commands within interactive 3D environments. Despite recent advances, existing methods often fail in long‑horizon planning and handling irreversible state changes, resulting in low task success rates. To address these challenges, we introduce RePlan‑Bot, a novel EIF agent that performs multi‑level, continuous replanning throughout task execution. RePlan‑Bot integrates a high‑level LLM‑based auditor for dynamic sub‑goal adjustments guided by environmental feedback, a commonsense‑guided search mechanism based on a multi‑layered instance map for precise and structured object localization, and a lightweight ViT‑based corrector to preemptively fix risky low‑level actions. Evaluated on the ALFRED benchmark, RePlan‑Bot achieves state‑of‑the‑art performance in both seen and unseen environments, demonstrating superior adaptability and reliability.

PaperID: 2606, https://arxiv.org/pdf/2605.25832.pdf

Authors: Yunfei Wang, Xiaohao Xu, Yang Li, Xiaonan Huang

Title: When Search Becomes Memory: Turning Robot Design Trials into Transferable Skills

Abstract:
Large language models (LLMs) are increasingly used as proposal generators for evolutionary robot design, yet most loops remain memoryless: simulator results shape the next population but are not preserved as reusable design knowledge. We present Auto‑Robotist, a self‑evolving LLM agent that distills morphology‑search traces into an explicit natural‑language skill library. Each skill stores a structural archetype, evidence‑grounded positive and negative rules, and the evaluated designs that support them, making design memory inspectable rather than implicit in a population. During search, the agent retrieves skills to condition LLM edits of elite bodies while retaining a Genetic Algorithm (GA) mutation path for exploration; after evaluation, it updates the library through Add, Diagnose, and Merge. Across seven EvoGym tasks spanning locomotion, traversal, and object interaction, Auto‑Robotist improves cold‑start 5x5 search and transfers learned skills to 10x10 design spaces, where reference‑conditioned transfer outperforms GA on every task. These results suggest that LLM agents can convert expensive physical evaluations into reusable, auditable design principles. Our code will be released upon acceptance.

PaperID: 2607, https://arxiv.org/pdf/2605.25746.pdf

Authors: Haoran Li, Shulun Chen, Shaoyuan Sun, Hanchen Wang

Title: Multi-Agent Coordination Adaptation via Structure-Guided Orchestration

Abstract:
As large language model (LLM)‑based multi‑agent systems scale to handle increasingly complex tasks, balancing structural stability and dynamic adaptability becomes increasingly challenging. Existing systems typically adopt either structure‑centric methods, committing to structures determined upfront that limit fine‑grained control, or orchestration‑centric methods, adapting decisions dynamically while leaving coordination structure implicit and unstable. To address this challenge, we revisit multi‑agent coordination from a probabilistic perspective, casting it as posterior inference over the joint distribution of structure and orchestration. We introduce MACA, an automated coordination framework that learns a task‑ and budget‑conditioned structural prior over agent participation and interactions. This prior guides a policy‑based orchestration as an approximation to posterior inference, enabling efficient solutions with fine‑grained control. Across benchmarks, MACA outperforms adaptive multi‑agent baselines by an average of 8.42% while using 43.19% fewer tokens. Further investigation reveals that joint adaptation of structure and orchestration suppresses redundant interactions, converging coordination toward task‑effective execution.

PaperID: 2608, https://arxiv.org/pdf/2605.25741.pdf

Authors: Rajnikant Sharma, Abhinav Sinha, Isaac Weintraub

Title: Collaborative Threat-Aware Autonomy (CTAA)

Abstract:
Navigating teams of unmanned vehicles through environments containing dynamic, adversarial Weapon Engagement Zones~(WEZs) poses a fundamental challenge to mission success: a single vehicle, however capable its onboard guidance, remains a single point of failure. This paper presents a role‑differentiated multi‑agent framework for collaborative threat‑aware trajectory planning in which a fleet of Autonomous Collaborative Platforms~(ACPs) is assigned distinct roles primary intercept, escort, and decoy to improve team‑level mission success probability while managing individual WEZ exposure. Each ACP independently employs a reactive guidance law derived from the Collision Sphere Boundary for Evader Zero‑Set~(CSBEZ), which accounts for pursuer maneuverability constraints imposed by minimum turn radius, and steers the vehicle toward the safest heading that also makes progress toward its goal. Role assignment and spatial route separation induce two complementary effects: probabilistic redundancy, in which N independent paths raise the team success probability and threat saturation, in which lower‑priority escorts and decoys draw adversary attention and free the primary vehicle to transit uncontested.

PaperID: 2609, https://arxiv.org/pdf/2605.25665.pdf

Authors: Satadru Sengupta, Tamunokorite Briggs, Ivan Myshakivskyi

Title: Meta-Engineering Harnesses for AI-Native Software Production: A Contract-Driven Adversarial Verification Architecture with Early Deployment Report

Abstract:
AI‑native software development is often evaluated at the level of individual models, prompts, or generated artifacts. This framing is insufficient for production environments where software must be continuously produced, verified, deployed, maintained, and adapted across many operational contexts and long time horizons. We present a meta‑engineering harness: a software‑production architecture that transforms operational and product feature requirements into explicit contracts, routes work through role‑specialized AI agents, performs independent and adversarial verification, and continuously improves itself through structured failure classification and outer‑loop calibration. The harness is designed for settings in which software delivery is not a one‑time project but an ongoing operating function. In our motivating application, CTO‑as‑a‑service for small service firms, the system manages websites, booking flows, payment systems, backoffice workflow automations, and AI‑agent interfaces as continuously evolving technical infrastructure rather than one‑off deliverables. We describe the layered architecture, including two‑pass contract compilation, persistent markdown memory with specialization records, attention‑based and independence‑based verifications, a four‑way failure arbiter, and outer‑loop calibration. We report results from an early production deployment spanning 17 features over several weeks, including a detailed in‑app payments case study that revealed contract incompleteness and verification‑boundary issues. These observations directly drove targeted improvements to the harness. The contribution is an implemented, measurable, and extensible verification architecture for making AI‑native service‑as‑a‑software production reliable, auditable, and improvable over time.

PaperID: 2610, https://arxiv.org/pdf/2605.25653.pdf

Authors: Tharindu Ranathunga, Kavishka Fernando, Susan Rea

Title: When Agents Control Robots: A Zero Trust Policy Model for Agentic Cyber-Physical Systems

Abstract:
Multi‑agent systems powered by large foundation models (LFMs) are increasingly deployed to control industrial robots through natural language, creating deployments in which security failures produce physical consequences. We analyse this threat landscape through Cobot‑Claw, a deployed four‑agent system for UR3e robotic arm control, and identify five attack classes specific to agentic cyber‑physical systems. We propose ZTPM, a Zero Trust Policy Model comprising 25 typed primitives across five enforcement domains with Physical Impact Tiers as a runtime policy dimension. An empirical evaluation across 60 execution traces on two LFM backends provides initial evidence that actuation parameter selection is model‑dependent and non‑deterministic, motivating the need for policy‑level enforcement at the physical actuation boundary.

PaperID: 2611, https://arxiv.org/pdf/2605.25632.pdf

Authors: Hao-Hsuan Chen

Title: Insuring Every Action: An Authority Frontier Framework for Runtime Actuarial Control of Autonomous AI Agents

Abstract:
Autonomous AI agents increasingly issue side‑effect‑bearing actions: database mutations, refunds, payments, external commitments. We propose the Actuarial Action Interface (AAI), a deterministic runtime contract that prices each such action against a contractually fixed safe default under a time‑consistent risk mapping, and gates execution against a per‑boundary reserve capital budget. We then develop the Authority Frontier, an evaluation primitive measuring how much autonomous authority the runtime releases at each level of reserve capital. The framework provides (i) a deterministic quote‑bind‑commit protocol with toll‑bounded capability tokens; (ii) a universal seven‑class action taxonomy mapping heterogeneous tool calls to comparable authority units; (iii) replay determinism and pathwise reserve coverage under alpha‑spending; (iv) cross‑domain normalization via full reserve demand C_full and capital metrics Capital@k. We instantiate AAI across four agentic environments (database mutation, customer‑service refund, and the public tau‑bench retail and airline tool‑use traces) and report a live Postgres panel in which three Azure‑hosted models propose actions through the same contract. The frontier exhibits a common low‑reserve refusal and intermediate‑release pattern across domains, with saturation only where the budget grid reaches full reserve demand; required reserve capital varies by 22x (Capital@50 from 289 to 6457). The framework does not force domains into the same shape; it surfaces each domain's actuarial geometry. In the live panel the contract prevents realized loss across all three models at low budget while differing in underwriting persistence under denial: model identity is an actuarial underwriting variable. The contribution is a benchmark‑ready evaluation framework for runtime actuarial control of autonomous‑agent side effects.

PaperID: 2612, https://arxiv.org/pdf/2605.25507.pdf

Authors: Ankur Samanta, Akshayaa Magesh, Ayush Jain, Youliang Yu, Daniel Jiang, Kavosh Asadi, Kaveh Hassani, Paul Sajda, Jalaj Bhandari, Yonathan Efroni

Title: Credit Assignment with Resets in Language Model Reasoning

Abstract:
Contemporary reinforcement learning with verifiable reward methods post‑train language models on multi‑step reasoning by assigning a single outcome reward uniformly across all tokens in a trajectory. Such uniform assignment ignores which steps contributed to success or failure. Improving credit assignment can address this limitation by enabling targeted refinement of faulty reasoning steps, rather than updating entire trajectories uniformly. Resets are one such simple mechanism, enabling more precise credit assignment by returning to an intermediate state and resampling counterfactual continuations, so that outcome differences can be attributed to decisions made at that point. We propose two such methods: Random‑Reset Policy Optimization (RRPO), where reset states are drawn randomly from reasoning steps, and Self‑Reset Policy Optimization (SRPO), where the model self‑localizes the erroneous step in an incorrect trajectory and resets there. We analyze these methods within the Conservative Policy Iteration (CPI) framework. Extending CPI with a credit‑assignment oracle that targets improvable states yields provable improvements over random resets. Across models and reasoning benchmarks, SRPO consistently outperforms standard GRPO and RRPO by sampling multiple suffix continuations at a self‑localized reset and learning from their rewards, using only the model itself with no external supervision.

PaperID: 2613, https://arxiv.org/pdf/2605.25489.pdf

Authors: Natalia Andrienko, Gennady Andrienko, Jürgen Bernard, Michael Sedlmair

Title: ATWL: A Formal Language for Representing, Comparing, and Reusing Visual Analytics Workflows

Abstract:
Visual analytics (VA) workflows are inherently complex, involving data transformation, feature engineering, visual representation, and human interpretation. They are typically described in unstructured prose, hindering systematic comparison, reuse of proven strategies, and training of novices. We present Artifact‑Transform Workflow Language (ATWL), a domain‑agnostic, declarative language that formally represents VA workflows by capturing their structure and underlying analytical intent. ATWL is built upon a modular ontology of eight artifact types (entities, features, arrangements, visualisations, patterns, models, knowledge, specifications) and transforms characterised by standardised intents (e.g., define‑unit, characterise, contextualise, abstract). To show that formalisation effort need not impede adoption, we extract workflows from research papers through supervised interaction with LLM agents, reducing the human role to review and refinement. Using this process, we constructed a library of seventeen ATWL workflows from published VA papers. Cross‑workflow analysis reveals structural regularities ‑‑ a recurrent meta‑structure, recurring motifs, reusable building blocks, diverse iterative strategies, and cross‑domain equivalences ‑‑ that remain invisible in prose. We further evaluate practical utility through a controlled experiment in which the same LLM addressed two analytical problems with the library supplied either as original papers or as ATWL representations. Both forms enabled useful recommendations, but the formal representation systematically added explicit iteration structure, typed data flow, fragment‑level adaptation provenance, and compactness supporting scaling beyond what prose libraries can fit in an LLM's context. ATWL enables a transition from narrative descriptions to formally represented, comparable, and reusable analytical knowledge.

PaperID: 2614, https://arxiv.org/pdf/2605.25486.pdf

Authors: Hengjun Jiang, Liansheng Sun, Yan Jiang, Xiaojie Ke, Yongjin Wang, Xiangkun Liu, Cunxin Gu, Jian Xu, Guanjun Jiang

Title: RAG-Match: Retrieval-Augmented Knowledge Injection and Hierarchical Reasoning for Calibrated Semantic Relevance

Abstract:
Semantic relevance judgment for search is particularly challenging in knowledge‑intensive scenarios, where accurate ranking requires not only semantic matching but also background grounding, multi‑step reasoning, and well‑calibrated decision boundaries. Existing relevance models mainly rely on direct label supervision or shallow semantic similarity, which limits their ability to handle implicit intent, factual equivalence, and fine‑grained relevance distinctions. To address this issue, we propose \textscRAG‑Match, a three‑stage framework that integrates knowledge‑augmented pretraining, hierarchical reasoning alignment, and preference‑based decision calibration for relevance modeling. The key idea is to first strengthen query‑centered semantic grounding, then align the model with structured relevance reasoning, and finally correct decision‑level inconsistencies in difficult boundary cases. Experimental results on a real‑world search relevance benchmark show that \textscRAG‑Match consistently outperforms strong LLM‑based baselines across multiple ranking metrics, demonstrating the effectiveness of combining knowledge injection, reasoning supervision, and preference optimization for fine‑grained relevance judgment.

PaperID: 2615, https://arxiv.org/pdf/2605.25480.pdf

Authors: Haoliang Ming, Feifei Li, Xiaoqing Wu, Wenhui Que

Title: Retrieval as Reasoning: Self-Evolving Agent-Native Retrieval via LLM-Wiki

Abstract:
LLM agents require retrieval to behave less like one‑shot context fetching and more like reasoning: searching, reading, traversing, and deciding when evidence is sufficient. Yet current Retrieval‑Augmented Generation (RAG) systems organize external knowledge as flat chunks retrieved by embedding similarity, exposing a retrieval‑as‑lookup interface ill‑suited to iterative reasoning agents. We propose LLM‑Wiki, an agent‑native retrieval system that operationalizes the Retrieval‑as‑Reasoning paradigm by treating external knowledge as a compilable, composable, and self‑evolving structure rather than a static retrieval index. LLM‑Wiki compiles documents into structured Wiki pages with bidirectional links, exposes search, read, and link‑following operations through standard tool‑calling interfaces, and introduces an Error Book for persistent structural and semantic self‑correction. LLM‑Wiki achieves state‑of‑the‑art results on HotpotQA, MuSiQue, and 2WikiMultiHopQA, outperforming HippoRAG 2, LightRAG, and GraphRAG by 2.0‑8.1 F1 points. On AuthTrace, LLM‑Wiki achieves the best overall accuracy, with especially strong gains on multi‑document structured queries, confirming that compilation‑based retrieval generalizes beyond chain‑style multi‑hop reasoning.

PaperID: 2616, https://arxiv.org/pdf/2605.25479.pdf

Authors: Kaixiang Chen, Pengfei Fang, Hui Xue

Title: MAIL++: Multi-Modal Bi-directional Agent Layer for Vision-Language Models

Abstract:
Adapting large vision‑language models (VLMs) such as CLIP to downstream tasks remains challenging, as full fine‑tuning is computationally prohibitive and prone to overfitting in low‑data regimes. Parameter‑efficient fine‑tuning (PEFT) alleviates these issues with lightweight prompt‑ or adapter‑based modules, and cross‑modal coupling has proven especially effective by strengthening interactions between vision and language. However, existing coupling mechanisms predominantly rely on external auxiliary modules, leading to indirect, coarse‑grained interactions that are structurally decoupled from the original VLM and thus limit representational expressiveness. In this paper, we propose Multi‑Modal Interactive Agent Layer (MAIL), a PEFT paradigm that embeds cross‑modal coupling directly into the intrinsic computation modules of VLMs. MAIL freezes the backbone and inserts lightweight agent layers after core modules, such as LayerNorm, to approximate the parameter updates induced by full fine‑tuning. To couple visual and textual streams at this level, we introduce a bottleneck‑based text‑to‑image bridge that jointly optimizes paired agent layers across modalities, coordinating the adaptation of corresponding computation modules. We further present MAIL++, which enables bidirectional cross‑modal exchange through a meta agent layer, a meta‑text bridge, and a meta‑image bridge. At inference time, all agent layers are re‑parameterized into the frozen backbone, preserving the original computational efficiency. Extensive experiments on few‑shot image classification and few‑shot universal cross‑domain retrieval demonstrate that MAIL and MAIL++ consistently outperform state‑of‑the‑art PEFT methods.

PaperID: 2617, https://arxiv.org/pdf/2605.25440.pdf

Authors: Rafal Kocielnik, J. Everett Knudsen, Steven Y. Cen, Jasmine Lin, Cherine H. Yang, Atharva Deo, Ujjwal Pasupulety, Peter Wager, Anima Anandkumar, Andrew J. Hung

Title: A Multi-Agent LLM Framework for Rating the Quality of Surgical Feedback

Abstract:
Verbal feedback delivered by attending surgeons in the operating room plays a critical formative role in resident trainee skill acquisition. Yet, assessing the quality of trainer feedback and its effectiveness in influencing trainee behavior during live surgery remains a challenge. Prior studies assessed feedback content relying on extensive manual annotation by expert human raters and focused on developing broad taxonomies that overlook the qualitative aspects of feedback delivery such as clarity or urgency. Limited existing automated methods, including keyword analysis and topic modeling, also fail to capture these nuanced aspects. We introduce a two‑stage LLM‑based framework that discovers interpretable feedback quality criteria grounded in the context of surgical training. Our method uses multi‑agent prompting and surgical domain knowledge injection to discover a small set of human interpretable scoring criteria (e.g., Encouraging, Urgent, Clear). These criteria are then used to automatically score live surgical feedback via an LLM‑as‑a‑judge approach. Evaluation on 4.2k trainer feedback instances demonstrates that our AI‑discovered criteria outperform prior content‑based frameworks in predicting feedback effectiveness, including observed trainee behavioral adjustments and trainer approval. This work advances scalable, human‑aligned assessment of communication quality in the operating room and provides a foundation for improving surgical teaching practices.

PaperID: 2618, https://arxiv.org/pdf/2605.25435.pdf

Authors: Yuntao Wang, Jianle Ba, Han Liu, Yanghe Pan, Jintao Wei, Zhou Su, Tom H. Luan, Linkang Du

Title: Security of OpenClaw Agents: Fundamentals, Attacks, and Countermeasures

Abstract:
The rapid evolution of large language model (LLM)‑driven autonomous agents has given rise to OpenClaw, a new class of open‑source agent frameworks that operate as continuously running, skill‑augmented systems with persistent memory, multi‑channel interaction, and high degrees of autonomy. Such capabilities enable OpenClaw agents to autonomously execute complex, multi‑step tasks and interact seamlessly with external applications, but simultaneously introduce a substantially enlarged attack surface. In particular, the combination of high‑privilege operations and persistent memory exposes OpenClaw agents to various emerging threats, including skill poisoning, cognitive manipulation, multi‑agent cascading failures, and supply‑chain vulnerabilities. In this survey, we present a comprehensive study of the security landscape of OpenClaw agents. We first examine the general architecture and key characteristics that distinguish OpenClaw agents from traditional AI agent systems. We categorize existing security and privacy threats into a layered framework and analyze how vulnerabilities arise during agent reasoning, action execution, and external interaction. Representative defense mechanisms are also reviewed to draw the current defense landscape. Finally, several unresolved issues related to the reliability and trustworthiness of OpenClaw ecosystems are discussed.

PaperID: 2619, https://arxiv.org/pdf/2605.25431.pdf

Authors: Dewei Jiang, Xiang Gu

Title: Mode 0: A New 3GPP V2X Resource Allocation Category for Roadside Computing Unit-Assisted Safety Communication

Abstract:
The 3GPP V2X resource allocation framework defines two entity classes ‑‑ the base station and the vehicle UE ‑‑ and four modes across LTE and NR generations. We demonstrate that this binary taxonomy is structurally incomplete. Base station‑led scheduling saturates at high‑density traffic nodes, producing latency‑tail failures that persist even when mean packet delivery ratios approach the service‑class target. UE autonomy is categorically incapable of pre‑emergence warning for occluded traffic participants and insufficient for large‑scope cascading environmental hazards. We propose Mode 0, a new 3GPP V2X category whose defining entity is the Roadside Computing Unit (RCU) ‑‑ an infrastructure ensemble integrating elevated sensing (Seeing), sidelink communication (Speaking), and local computational evaluation (Thinking), owned by traffic management authorities. Mode 0 defines a subfamily spectrum from Mode 0a (all‑passive UEs, the guaranteed minimum) through Mode 0c (all‑active UEs, the optimal target). Convergent deployment evidence from Chinese national standards (DB11/T 2329.1‑2024, T/ITS 0224.1‑2025), China Unicom RS‑MEC infrastructure, and European and US C‑V2X programs confirms that both institutional sides are converging on the roadside traffic node without a coordination standard. A fifteen‑run Multi‑Agent Proximal Policy Optimization (MAPPO) simulation validates the architectural family: Mode 0a in shared‑pool baseline sits at the analytical symmetric‑Nash coordination floor; Mode 0c with demand separation achieves strict Pareto improvement for both traffic classes (M0 PDR 0.999, M1 PDR 0.998 at ρ_\rm pool \leq 1) and lifts the worst‑TTI delivery ratio from near‑zero to 0.601 ‑‑ the only configuration satisfying the latency safety requirement structurally. We call for a 3GPP study item on Mode 0 within the NR‑V2X sidelink enhancement work programme.

PaperID: 2620, https://arxiv.org/pdf/2605.25422.pdf

Authors: Lipeng Dai, Luping Xiang, Kun Yang

Title: A Token/KV-Cache Communication Media Selection and Resource Allocation Strategy for Multi-Agent Collaboration

Abstract:
The convergence of large language models (LLMs) with 6G networks is fostering a paradigm of autonomous multi‑agent cooperation, which in turn is expected to substantially increase east‑west traffic. Although latent‑space interaction mechanisms can enable more efficient collaboration than symbolic natural‑language (NL) exchanges, prior work often abstracts away the associated communication overhead under practical wireless constraints. In embodied multi‑agent settings, heterogeneous interaction media incur disparate inference and transmission costs, thereby inducing an inherent end‑to‑end (E2E) latency trade‑off. To address this, we propose a joint design that integrates communication‑media selection with wireless resource allocation. Through analytical characterization and simulation‑based evaluation, we show that neither token‑based transmission nor key‑value (KV) cache‑based transmission is uniformly optimal across operating regimes, as performance depends critically on system parameters such as available computational resources and channel conditions. Accordingly, we formulate a joint optimization problem aimed at minimizing the E2E latency of multi‑agent collaboration and develop a low‑complexity joint media selection and resource allocation (JMSRA) algorithm. Numerical results further confirm that, by adaptively coordinating the interaction media and bandwidth allocation over heterogeneous links, the proposed scheme achieves markedly reduced E2E latency relative to conventional NL‑only and KV‑cache‑only baselines, enabling efficient and robust multi‑agent collaboration in future wireless networks.

PaperID: 2621, https://arxiv.org/pdf/2605.25421.pdf

Authors: Xinyi Mou, Siyuan Wang, Zejun Li, Yulan He, Zhongyu Wei

Title: HyLaT: Efficient Multi-Agent Communication via Hybrid Latent-Text Protocol

Abstract:
Communication protocol design is a central challenge in large language model‑based multi‑agent systems. Existing single‑channel approaches face an inherent communication trilemma: text‑based methods are interpretable but verbose, while latent‑space methods are efficient but opaque and limited to unidirectional workflows. Inspired by multi‑channel communication theory, we propose HyLaT, a hybrid latent‑text communication protocol that transmits elaborate cognitive signals through a latent channel for efficiency, while expressing concise critical signals in natural language to preserve interpretability and precision. We introduce a two‑stage training framework combining single‑agent hybrid generation learning and multi‑agent interactive co‑training, enabling agents to generate and interpret hybrid messages across multiple rounds of interaction. Experiments demonstrate that HyLaT reduces communication overhead significantly while maintaining competitive task performance, with strong generalization and robustness across diverse settings.

PaperID: 2622, https://arxiv.org/pdf/2605.25393.pdf

Authors: Ruoyu Yao, Ruiguo Zhong, Pei Liu, Mingxing Peng, Rui Yang, Jun Ma

Title: Decision-Making with Lightweight Confidence-Aware Language Model for Autonomous Driving

Abstract:
Large Language Models (LLMs) and Multimodal LLMs (MLLMs) have demonstrated immense potential in autonomous driving (AD) by offering human‑like reasoning and open‑world generalization. However, the excessive computational overhead and high inference latency of these massive models severely hinder their deployment in resource‑constrained AD systems. To address this challenge, we propose a novel decision‑making framework utilizing a lightweight confidence‑aware language model, which bridges the gap between complex multimodal intention reasoning and efficient inference. Specifically, we design a multi‑agent collaborative workflow, comprising action voting, confidence assessment, and summarization agents, to generate high‑quality, confidence‑annotated decision demonstrations via explicit Chain‑of‑Thought (CoT) reasoning. These demonstrations are then distilled into a lightweight language model featuring a dual‑head architecture, enabling the joint prediction of decision probabilities and the generation of textual rationales. The distillation is realized via a confidence‑aware fine‑tuning strategy coupled with Retrieval Augmented Generation (RAG) to enhance the model's adaptability and data efficiency. Comprehensive closed‑loop experiments on the nuPlan benchmark demonstrate that our approach achieves state‑of‑the‑art (SOTA) success rates in both regular and long‑tail scenarios while maintaining low inference latency.

PaperID: 2623, https://arxiv.org/pdf/2605.25389.pdf

Authors: Bingyu Yan, Xiaoming Zhang, Jinyu Hou, Chaozhuo Li, Ziyi Zhou, Yiming Hei, Litian Zhang

Title: Evo-Attacker: Memory-Augmented Reinforcement Learning for Long-Horizon Tool Attacks on LLM-MAS

Abstract:
While Large Language Model‑based Multi‑Agent Systems (LLM‑MAS) demonstrate remarkable capabilities in solving complex tasks by orchestrating specialized agents and external tools, the implicit trust in tool outputs creates a critical attack surface. Existing tool attacks are limited by domain specificity or fixed and static templates. To address these challenges, we propose Evo‑Attacker, which formulates the tool attack as a self‑evolving, memory‑augmented reinforcement learning process. Evo‑Attacker constructs a dynamic attack memory and employs deliberative reasoning to retrieve adversarial patterns and strategize modifying interventions at critical moments. Furthermore, we introduce Attack‑Flow GRPO to optimize intermediate reasoning steps via terminal outcomes, addressing the long‑horizon credit assignment challenge. Comprehensive experiments demonstrate that Evo‑Attacker consistently outperforms baselines, highlighting its generalization and evolutionary capabilities and the urgent need for defensive tool safeguards.

PaperID: 2624, https://arxiv.org/pdf/2605.25384.pdf

Authors: Yingji Zhang, Yong Dai, André Freitas

Title: GeoMathCode: Understanding Interleaved Math-Code Reasoning for Geometry Problem Solving

Abstract:
Mathematical reasoning is a hallmark of human intelligence, requiring logical deduction, symbolic manipulation, and abstract thinking. Recent multimodal large language models (MLLMs) have demonstrated strong performance on geometry problems through multi‑step reasoning. To better emulate human problem‑solving, intermediate steps can incorporate auxiliary visual constructions, such as additional lines or points, which improve geometric interpretation and educational clarity. In this work, we introduce the GeoMathCode, where programmatic representations serve as intermediate visual outputs. We further conduct an in‑depth analysis of the underlying reasoning geometry. Experimental results show that reasoning and code generation steps can be disentangled in the latent space, while supervised fine‑tuning (SFT) makes the reasoning manifold more structured and informative. Moreover, hierarchical syntactic code structures emerge as disentangled latent subspaces, and contain more mathematical symbolic information than visual representations.

PaperID: 2625, https://arxiv.org/pdf/2605.25357.pdf

Authors: Xiaotian Hu, Mingxuan Liu, Junwei Huang, Kasidit Anmahapong, Yifei Chen, Yiming Huang, Xuguang Bai, Zihan Li, Hongjia Yang, Yingqi Hao, Hong Xu, Yu Jiang, Tian Tian, Yi Liao, Haibo Qu, Qiyuan Tian

Title: Towards Reliable Fetal Ultrasound Interpretation with Multi-Agent Collaboration

Abstract:
Automated fetal ultrasound interpretation requires a workflow from visual perception, including plane recognition and anatomical segmentation, to clinical understanding, including biometric measurement and diagnostic reporting. However, the prevailing "one‑task, one‑model" paradigm limits systematic integration of evidence across this multi‑step process. Although multimodal large language models (MLLMs) show promising visual understanding, their limited domain‑specific grounding and hallucination risks restrict reliability in fetal ultrasound analysis. To address these limitations, we propose FetUSAgents, a tool‑augmented multi‑agent system for comprehensive fetal ultrasound interpretation, supporting visual question answering (VQA), report generation, image captioning, and video summarization. FetUSAgents coordinates task‑specific visual tools through collaborative LLM agents and decomposes clinical queries into subtasks that progress from anatomical recognition to quantitative measurement. We further introduce Dual‑Path Evidence Arbitration (DPEA), which integrates LLM‑based deliberative reasoning with structured computational evidence from specialized visual tools. A retrieval‑enhanced evidence bank consolidates intermediate findings to support traceable and clinically grounded conclusions. In addition, we construct FetUS‑VQA, a dedicated VQA benchmark for fetal ultrasound, comprising 1,892 images and 3,205 question‑answer pairs across 10 clinical tasks. Extensive out‑of‑distribution experiments show that FetUSAgents outperforms general and medical MLLMs, exceeding the strongest baseline by more than 25 percent in VQA accuracy. These results suggest a scalable route toward evidence‑driven clinical assistants for prenatal imaging. Code is available.

PaperID: 2626, https://arxiv.org/pdf/2605.25338.pdf

Authors: Akash Bonagiri, Devang Borkar, Gerard Janno Anderias, Setareh Rafatirad, Houman Homayoun

Title: CausalFlow: Causal Attribution and Counterfactual Repair for LLM Agent Failures

Abstract:
Large language model (LLM) agents frequently fail on multi‑step tasks involving reasoning, tool use, and environment interaction. While such failures are typically logged or retried heuristically, they contain structured signals about where execution broke down. We introduce CausalFlow, an interventional framework that converts failed agent traces into minimal counterfactual repairs and reusable supervision. CausalFlow models execution traces as sequential chains of dependent steps and computes Causal Responsibility Scores(CRS) via step‑level counterfactual intervention to identify failure‑inducing steps. For these steps, we generate minimally edited repairs that flip the final outcome to success, producing validated contrastive pairs of the form (wrong step, corrected step). CausalFlow supports two complementary uses: targeted test‑time repair that recovers from failures with minimal behavioral drift, and training‑time supervision suitable for offline preference optimization or reward modeling. Across four benchmarks spanning mathematical reasoning, code generation, question answering, and medical browsing, CausalFlow converts failed executions into validated minimal repairs with high minimality and causal‑consensus scores, and demonstrates that causal attribution is necessary for reliable improvement across diverse agent tasks, outperforming heuristic refinement in complex retrieval settings while producing more localized repairs throughout. These results demonstrate that interventional analysis over structured execution traces provides a principled and scalable mechanism for transforming agent failures into reliability gains and learning‑ready supervision.

PaperID: 2627, https://arxiv.org/pdf/2605.25311.pdf

Authors: Jing Yang, Yichao Wu, Jianan Liu, Penghao Liang, Mengwei Yuan, Xianyou Li, Weiran Yan

Title: Recursive Multi-Agent Trading System: Iterative Optimized Portfolio Strategy Under Geopolitical Uncertainty

Abstract:
Recursive Multi‑Agent Trading System (RMATS) integrates four specialized agents ‑‑ Sentiment, Report, Analysis, and Risk ‑‑ coordinated through a recursive Manager Agent with iterative feedback loops. Experimental evaluation over a 561‑trading‑day period (January 2023 to March 2025) across a 24‑asset multi‑class universe demonstrates that RMATS achieves a maximum drawdown of 9.62%, lower than MVO (15.49%) and FinBERT Sentiment (15.28%), and exhibits the lowest event‑period drawdown in 3 of 5 geopolitical stress scenarios tested. While RMATS underperforms return‑maximizing baselines in a sustained bull market environment, ablation studies confirm the individual contribution of each agent component to downside protection. These results position RMATS as a risk‑control‑oriented architecture suitable for institutions prioritizing capital preservation under geopolitical uncertainty.

PaperID: 2628, https://arxiv.org/pdf/2605.25310.pdf

Authors: Tianda Sun, Dimitar Kazakov

Title: Tool-Call Dependency Structure is Linearly Decodable in LLM Agent Residual Streams

Abstract:
Tool‑using LLM agents produce trajectories whose calls form a directed dependency graph: earlier tool outputs supply arguments to later calls. Whether this execution structure is represented inside the model is unknown; prior structural probes have targeted static code or chain‑of‑thought text, not an agent's run‑time call graph. A low‑capacity edge probe on the residual stream of Qwen3‑32B decodes the tool‑call dependency graph well above both a Hewitt‑‑Liang random‑label control and a positional baseline. A counterfactual contrast between value corruption and structural perturbation indicates the signal tracks abstract topology rather than identifier values, and replicates under an independent, non‑substring oracle. The non‑positional component replicates on three further interactive multi‑hop benchmarks and attenuates as call order alone becomes a sufficient proxy for dependency, vanishing in single‑shot planning. Per‑layer activation patching shifts the probe at a later, non‑patched boundary, evidence that the representation propagates rather than passively reads out, though the realised tool call does not move. To our knowledge this is the first structural probe of an LLM agent's runtime tool‑call dependency graph. Our claims concern representation, not behavioural control, and span two model families and one primary domain.

PaperID: 2629, https://arxiv.org/pdf/2605.25273.pdf

Authors: Lingyao Li, Deyi Li, Chen Chen, Renkai Ma, Runlong Yu, Mingquan Lin, Rui Yin, Lizhou Fan, Cathy Shyr, Siyuan Ma, Mei Liu, Steven Bethard

Title: LLM-as-a-Judge in Healthcare: A Scoping Analysis of Applications, Methods, and Human Alignment

Abstract:
Large language models (LLMs) are increasingly deployed across healthcare applications, including clinical documentation, diagnostic reasoning, medicine recommendation, and medical education. Their outputs are largely unstructured clinical text, which is difficult to reliably evaluate at scale. LLM‑as‑a‑Judge, in which an LLM evaluates another system's output against task‑specific criteria, offers a scalable alternative and is increasingly used in clinical evaluation, yet its validity in healthcare remains underexamined. Existing reviews focus on general‑purpose LLM evaluation or on risk framework, rather than systematically characterizing how LLM‑as‑a‑Judge is applied in healthcare and how well their judgments align with human experts. We therefore conduct a PRISMA‑guided comprehensive review of LLM‑as‑a‑Judge applications in healthcare, searching five databases for studies published between January 2023 and February 2026. After screening 541 records, 134 studies meet the eligibility and are coded by health scenario, judge configuration, technical approach, and validation design. LLM‑as‑a‑Judge is concentrated in clinical decision support, clinical natural language processing (NLP), medical knowledge and question answering (QA), and medical communication. OpenAI models are the most frequently used judges, and prompt engineering appears in nearly all studies, with ensemble, multi‑agent, and retrieval‑augmented designs as common extensions. Among studies reporting human validation, LLM judges often show moderate to strong alignment with expert judgments, although reliability varies substantially across tasks. Overall, this review positions LLM‑as‑a‑Judge as a promising framework for scalable healthcare AI evaluation, while emphasizing that its clinical value depends on model design and rigorous validation.

PaperID: 2630, https://arxiv.org/pdf/2605.25233.pdf

Authors: Andy Xu, Yu-Wing Tai

Title: Meta-Agent: From Task Descriptions to Verified Multi-Agent Systems

Abstract:
AI agents are increasingly used to solve complex, multi‑step tasks, but existing multi‑agent frameworks remain brittle as workflows grow in scale and depth. Small errors at intermediate stages can propagate through agent interactions, while insufficient grounding and weak verification mechanisms further limit reliability. We present Meta‑Agent, a two‑phase framework that automatically constructs and executes specialized multi‑agent systems from natural‑language task descriptions. In the construction phase, a task planner decomposes a problem into a directed acyclic graph of agent specifications with explicit input/output contracts and verification criteria. A web search module grounds each specification with external evidence, and a code generation module produces system prompts and tool configurations. A construction‑time verification stage then validates generated artifacts and triggers targeted regeneration when failures are detected. In the execution phase, a coordinator dispatches subtasks across the agent graph while execution‑time verification gates intermediate outputs. We further introduce a three‑level error attribution mechanism that distinguishes local, upstream, and structural failures, enabling targeted recovery strategies ranging from localized retries to partial re‑execution and re‑decomposition. We evaluate Meta‑Agent across coding, contextual learning, and open‑ended reasoning tasks. Experiments against strong multi‑agent baselines and ablation studies demonstrate consistent improvements in task success rate, error recovery, and workflow stability. The results highlight the importance of tightly integrating planning, grounding, and verification for building reliable multi‑agent systems.

PaperID: 2631, https://arxiv.org/pdf/2605.25200.pdf

Authors: Xiang Cheng, Yulan Hu, Lulu Zheng, Zheng Pan, Xin Li, Yong Liu

Title: GroupTravelBench: Benchmarking LLM Agents on Multi-Person Travel Planning

Abstract:
Travel planning in the real world is overwhelmingly a group activity, yet existing LLM travel‑planning benchmarks reduce it to a single user, where the field is approaching saturation. This single‑user assumption sidesteps what makes group planning hard for an agent: discovering private preferences across multiple users, surfacing conflicts, and balancing utility against fairness. To bring the task back to its multi‑user reality, we introduce GroupTravelBench, the first benchmark for multi‑user, multi‑turn travel planning. Built from real user profiles, POI data, and ticket prices, it comprises 650 tasks across three difficulty levels, each running in a synchronous group‑chat sandbox with cached tool data for reproducible offline evaluation. Beyond the multi‑step reasoning and tool use that single‑user benchmarks already test, GroupTravelBench probes three group‑specific capabilities: (i) elicitation of private preferences through multi‑turn dialogue; (ii) coordination of inter‑user conflicts via compromise or subgrouping; and (iii) planning that balances group utility against fairness. We pair this with a complementary evaluation framework combining rule‑based outcome metrics and LLM‑judge process metrics. Across a wide range of frontier models, even the strongest agents fall short on all four rule‑based outcome metrics, with plan validity below 12%, suggesting that group‑level outcome quality is a key open challenge for LLM travel‑planning agents.

PaperID: 2632, https://arxiv.org/pdf/2605.25188.pdf

Authors: Yi Li, Songtao Wei, Dongming Jiang, Zhichun Guo, Qiannan Li, Bingzhe Li

Title: DarkForest: Less Talk, Higher Accuracy for Multi-Agent LLMs

Abstract:
Multi‑agent LLM systems improve reasoning by combining outputs from multiple agents, but interaction‑heavy methods can introduce error propagation and high communication overhead. When agents exchange raw responses or reasoning traces, incorrect intermediate reasoning may be adopted and amplified, leading to confident but wrong consensus; multi‑round communication also increases token consumption, latency, and inference cost. In this paper, we propose a controlled‑communication coordination framework named DarkForest. DarkForest first keeps agents independent, so each agent produces an answer without seeing the others' outputs. It then parses the raw responses into structured candidate records, groups semantically equivalent candidates into clusters, and estimates a calibrated belief distribution over these clusters using agent reliability, confidence, parse quality, support‑pattern reliability, and independence corrections. A coordinator receives only policy‑permitted evidence from this belief state with controlled communication. Experiments on six reasoning benchmarks show that DarkForest achieves leading overall quality, improves the strongest baseline by up to 30.7% on benchmark metrics, and reduces token consumption by up to 6.5× compared with communication‑heavy baselines.

PaperID: 2633, https://arxiv.org/pdf/2605.25141.pdf

Authors: Pavan Manjunath, Thomas Pruefer

Title: LLM Agent Based Renewable Energy Forecasting Using Edge and IoT Data A Review of Solar Wind Weather and Grid Aware Decision Support

Abstract:
Reliable forecasting of renewable energy generation is a foundational requirement for grid stability energy trading battery scheduling and carbon aware operational planning Solar and wind resources are inherently intermittent their output fluctuates with cloud cover wind speed atmospheric turbulence seasonal patterns and local terrain The proliferation of IoT and edge devices spanning smart meters inverters anemometers pyranometers weather stations and grid interface sensors has created an unprecedented volume of real time operational data that conventional forecasting pipelines are ill equipped to exploit fully This review investigates how large language model LLM agents can enhance renewable energy forecasting by integrating heterogeneous sensor streams weather API data historical generation records grid constraints and contextual reasoning into unified decision support workflows We survey classical forecasting methods statistical time series models deep learning architectures physics hybrid approaches and emerging LLM agent frameworks for explanation uncertainty communication and operator guidance A six layer taxonomy is proposed covering data acquisition preprocessing feature engineering model inference uncertainty estimation and natural language reporting The review identifies twelve open challenges spanning real time deployment model drift under distribution shift uncertainty quantification hallucination control in LLM agents interoperability of edge hardware and integration with energy management systems The paper concludes by recommending a research agenda centred on open benchmarks physics informed LLM grounding and federated forecasting architectures

PaperID: 2634, https://arxiv.org/pdf/2605.25133.pdf

Authors: João Sedoc, Baotong Zhang, Dean Foster

Title: Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction

Abstract:
Reliably knowing when a language model is correct is almost as important as being correct. We introduce prover‑verifier deliberation (PVD), an inference‑time protocol grounded in interactive proof theory, as a mechanism for selective prediction: the protocol produces both an answer and a structured confidence verdict, allowing a system to report high‑confidence answers while abstaining on uncertain cases. In each dialogue, a prover defends a candidate answer through checkable sub‑claims while a verifier issues targeted challenges and returns \textscAccept, \textscChallenge, or \textscReject. Because frozen language models are imperfect provers and verifiers operating over a noisy channel, formal soundness and completeness guarantees do not transfer; instead, we characterize the protocol empirically through its coverage‑precision behavior. Our main experiment uses Claude Sonnet 4.6 as prover and Claude Haiku 4.5 as verifier on GPQA Diamond. Questions accepted with no answer revision, which we call Accept + No Change (ANC), are reported as the high‑confidence subset; we evaluate this subset by its precision and coverage. ANC separates reliable from unreliable answers, yielding a ～30pp HC‑Prec gap over the non‑ANC complement. Robustness experiments with GPT and Gemini pairings show that high HC‑Prec can transfer across model families, while verifier strictness and domain competence largely determine the size of the selection gap. On Humanity's Last Exam, weaker prover‑verifier pairings can collapse or invert the ANC signal, illustrating a practical failure mode when the verifier operates outside its effective region. Comparisons with self‑consistency, universal self‑consistency, multi‑agent debate, and Reflexion suggest that prover‑verifier deliberation supplies a distinct argument‑defensibility signal for selective prediction.

PaperID: 2635, https://arxiv.org/pdf/2605.25101.pdf

Authors: Ashir Kulshreshtha, Abdullah Mughees, Gaadha Sudheerbabu, Tanwir Ahmad, Kristian Klemets, Dragos Truscan, Mikael Manngård

Title: Multi-Agent Specification-based Metamorphic Testing of FMU-Based Simulations

Abstract:
In many industrial domains, the Functional Mock‑up Interface (FMI) is used to exchange simulation models as Functional Mock‑up Units (FMUs) across different partners using various modelling tools. This opens up the possibilities for simulation‑based verification and validation using FMUs for ensuring reliable system behaviour. However, deriving effective test oracles for these simulation models remains challenging due to the absence of explicit expected outputs. This limits the applicability of conventional testing approaches, which require access to the internal workings of the systems. Metamorphic testing (MT) addresses this limitation by leveraging metamorphic relations (MRs), but extracting such relations from specifications remains largely a manual and error‑prone process. To address this challenge, we propose an LLM‑powered multi‑agent workflow for specification‑based metamorphic testing of FMU‑based simulation models. The approach takes functional and interface specifications as input and orchestrates multiple agents to extract requirements and derive MRs. These MRs are expressed using Given‑When‑Then patterns to structure input conditions (Given), transformations (When), and expected output behaviours (Then). These relations are then used to generate metamorphic test cases, execute simulations, and evaluate output consistency across multiple sessions. We evaluate the approach on a Lube Oil Cooling system FMU, demonstrating its ability to automatically generate meaningful MRs and corresponding test cases. Preliminary results indicate that the proposed workflow can effectively support the systematic verification and validation of dynamic simulation models by reducing manual effort and improving test generation.

PaperID: 2636, https://arxiv.org/pdf/2605.25091.pdf

Authors: Chengwei Li, Junlin Liu, Yang Gao

Title: Evolutionary Enhanced Multi-Agent Reinforcement Learning for Cooperative Air Combat

Abstract:
As modern air combat evolves toward beyond‑visual‑range (BVR) multi‑aircraft cooperative engagements, autonomous decision‑making for unmanned combat aerial vehicles (UCAVs) faces significant challenges due to high‑dimensional state spaces, discrete action commands, and strongly adversarial dynamic environments. To overcome the limitations of existing multi‑agent reinforcement learning (MARL) methods in such settings, namely insufficient exploration efficiency, low sample utilization, and poor policy generalization, we propose Adversarial Curriculum and Evolutionary‑enhanced Multi‑agent Proximal Policy Optimization (ACE‑MAPPO), a hybrid learning framework that integrates evolutionary algorithms with MAPPO. Specifically, a genetic soft update mechanism is introduced to enhance population diversity and mitigate convergence to local optima. An evolutionary‑augmented prioritized trajectory replay strategy is further employed to improve the utilization of sparse high‑value samples. In addition, an adversarial evolutionary curriculum learning mechanism is designed to enable adaptive training with progressively increasing difficulty. Extensive experimental results demonstrate that the proposed method outperforms MAPPO and other baseline algorithms in terms of training stability, convergence speed, and win rate, validating its effectiveness in multi‑aircraft cooperative air combat scenarios.

PaperID: 2637, https://arxiv.org/pdf/2605.25030.pdf

Authors: Magnus Samuelsen, Wilmer Nyström, Somnath Mazumdar, Mansoor Hussain, Mikkel Strange

Title: MimirRAG: A Multi-Agent RAG Framework for Financial Data Retrieval with Metadata Integration

Abstract:
Retrieval‑augmented generation (RAG) systems offer a promising approach to reduce hallucinations and improve answer accuracy in large language models (LLMs), a requirement for reliable, financial analysis where answers must be grounded in verifiable evidence from filings rather than generated from model priors. However, designing RAG systems that extract meaningful insights from mixed financial documents and integrate into analyst workflows remains challenging. This paper introduces MimirRAG (Metadata‑Integrated Multi‑Agent Information Retrieval), a multi‑agent RAG system developed iteratively to address these challenges. MimirRAG features a modular pipeline encompassing structure‑preserving parsing of PDF filings, table‑aware chunking, metadata extraction, agent‑based retrieval with query planning and hybrid search, validation, and context‑aware generation with numerical reasoning support. Our ablation study identifies three key technical enablers for effective financial RAG: metadata integration, table‑aware chunking, and an agentic workflow. MimirRAG was evaluated quantitatively using FinanceBench and qualitatively through expert validation with four financial analysts. The system achieved 89.3% accuracy on FinanceBench, outperforming the original benchmark baselines. Expert feedback highlighted that successful deployment also requires calibrated trust, comprehensive data integration, and user personalization. We conclude that combining multi‑agent RAG architecture with human‑centric design principles can improve the extraction of meaningful insights in financial analysis.

PaperID: 2638, https://arxiv.org/pdf/2605.25025.pdf

Authors: Josef Berman, Oren Gal

Title: Micro-Swarm Locomotion Optimization in Dynamic Flow using Multi-Objective Multi-Agent Reinforcement Learning

Abstract:
Coordinating micro‑robotic swarms in physiologically realistic, time‑dependent fluid environments remains an unsolved challenge for biomedical and environmental applications. We present a hybrid Computational Fluid Dynamics ‑ Multi‑Objective Multi‑Agent Reinforcement Learning framework that directly couples a high‑fidelity incompressible Navier‑Stokes solver with decentralized proximal policy optimization to learn physically consistent swarm control strategies in oscillatory flow. Sixteen magnetically actuated micro‑robots navigate a pulsatile arterial waveform, simultaneously optimizing upstream progression, energy conservation, and motion smoothness, reconciled using PCGrad surgery. Without PCGrad, energy efficiency and smoothness rewards collapse to near zero within 10,000 training steps while progress exhibits persistent large‑amplitude oscillations, confirming that gradient conflict resolution is a structural requirement rather than an optional refinement in this domain. The converged policy achieves a progress reward of 6.5‑7.0, a sustained energy efficiency of 0.63‑0.65, and near‑maximum smoothness (0.97‑0.99), representing improvements over brute‑force baselines on the primary objective while both baselines yield negative energy efficiency throughout. Training reveals three emergent behavioral phases: a collective two‑layer hydrodynamic throttling formation that suppresses peak channel velocities during forward flow, a cycle‑synchronized ratchet mechanism that exploits flow reversals for upstream repositioning, and an individualized final approach as agents near the success boundary. These results establish that time‑dependent fluid‑agent interactions can be captured directly within multi‑objective reinforcement learning loops, offering a physically grounded paradigm for micro‑swarm control in biomedical navigation, environmental monitoring, and industrial microfluidics.

PaperID: 2639, https://arxiv.org/pdf/2605.24992.pdf

Authors: Changling Li, Ying Li

Title: Scaling up Energy-Aware Multi-Agent Reinforcement Learning for Mission-Oriented Drone Networks with Individual Reward

Abstract:
Multi‑agent reinforcement learning (MARL) has shown wide applicability in collaborative systems such as autonomous driving and smart cities for its ability of learning through interaction. With the recent development of drone networks, researchers have also applied MARL to address the trajectory planning problems. However, the dynamic environment and the limited battery capacity are still challenging for using MARL to achieve efficient collaborative task execution. In this paper, we propose an energy‑aware MARL model as an attempt to tackle these challenges, leveraging Deep Q‑Networks (DQN) with \emphindividual reward functions driven by the task execution progress and the remaining battery of drones. We conduct a set of simulation studies for the proposed mode and compare it with the shared reward MARL~\citeLi2022MARL to explore the impact of credit assignment in MARL. The results indicate that our proposed model can achieve at least 80% success rate regardless of the task locations and lengths. Similar to the shared reward mode, the individual reward mode can achieve a better success rate when the task density is high, and it can hit nearly a 100% success rate when task density gets close to 40%. The true advantage of our proposed model with individual reward is revealed when scaling up the environment. The comparison to the shared reward MARL shows that the our proposed model is more robust towards the change of the environment size and agent numbers. It can achieve higher success rate with fewer steps due to the clarity of the goal which improves energy efficiency even better.

PaperID: 2640, https://arxiv.org/pdf/2605.24972.pdf

Authors: Indulekha K. P., T. G. Venkatesh

Title: Integrated Sensing, Communication, and Computing for NR-V2X: A Cross-Layer Resource Allocation Framework Using Multi-Agent Reinforcement Learning

Abstract:
Integrated sensing, communication, and computation (ISCC) is emerging as a unified design paradigm for future vehicular networks that require joint environment perception, safety‑critical information exchange, and latency‑sensitive task processing. In New Radio Vehicle‑to‑Everything (NR‑V2X) Mode 2, autonomous resource selection is performed through sensing‑based semi‑persistent scheduling (SB‑SPS), which is effective for distributed communication resource reservation but does not explicitly consider sensing‑resource demand, task‑induced computation workload, and the additional latency introduced by mobile edge computing (MEC) offloading. This paper develops multi‑agent proximal policy optimization‑based SB‑SPS (MAPPO‑SPS), an ISCC‑aware cross‑layer scheduler that jointly adapts SB‑SPS reservation, radio‑resource partitioning, and overflow‑driven computation‑offloading decisions at control epochs. The scheduling problem is formulated as a cooperative partially observable Markov game and solved using MAPPO with centralized training and decentralized execution (CTDE). Simulation results show that MAPPO‑SPS achieves a balanced tradeoff among CRLB‑based sensing accuracy, packet reception ratio (PRR), effective throughput, energy consumption, and end‑to‑end delay.

PaperID: 2641, https://arxiv.org/pdf/2605.24953.pdf

Authors: Chengrui Li, Rujing Li, Yitong Bai, Rui Li

Title: Towards Multi-Turn Dialog Systems for Industrial Asset Operations and Maintenance

Abstract:
Industrial asset operations and maintenance question answering is inherently multi‑turn, iterative, and highly dependent on external tool invocation. However, the conventional plan‑execute single‑agent architecture exhibits clear limitations in maintaining cross‑turn context, and reusing intermediate results. In this paper, we present a multi‑turn dialog system designed for industrial scenarios based on a supervisor‑specialist multi‑agent architecture. To alleviate tool invocation bottlenecks, the system incorporates structured artifact reuse, dynamic replanning, and parallel tool execution. Evaluation results show that our system achieves better response quality compared with the baseline, with planning effectiveness increasing by 54.5% and task completion improving by 37.8%. System profiling further shows that cross‑turn artifact reuse effectively reduces redundant tool invocation, decreasing the tool‑time share from 47.3% to 26.3% and making turns 2‑5 approximately 4.2x faster than the first turn.

PaperID: 2642, https://arxiv.org/pdf/2605.24941.pdf

Authors: Mahavir Dabas, Jihyun Jeong, Ming Jin, Ruoxi Jia

Title: Memory-Induced Tool-Drift in LLM Agents

Abstract:
Modern LLM agents combine long‑term memory for personalization with tool‑calling interfaces for taking actions in the world ‑‑ a combination underpinning contemporary production systems. We study a previously unexamined failure of this combination: when personality‑driven biases stored in memory (cost‑consciousness, impatience, risk tolerance, etc.) silently affect tool calls in contexts where they are not applicable. We call this memory‑induced tool‑drift and operationalize it through MEMDRIFT, a benchmark of 105 scenarios spanning five bias dimensions and seven professional domains, generated through an automated adversarial pipeline. Across seven frontier models ‑‑ including those with extended reasoning ‑‑ biased memories raise deflection scores (a judge‑scored measure of parameter deviation from unbiased baselines) by up to +3.6 points on a 1‑‑5 scale. Tool‑drift persists when memory management is handled by three production memory architectures. The phenomenon affects real‑world tools: scanning 6,062 tools across 288 verified MCP servers, we flag 608 with susceptible parameters and confirm tool‑drift on a validated subset. Mechanistically, biased memories act as implicit steering vectors, pushing activations along the same latent directions as explicit behavioral instructions. They also redistribute attention from task‑relevant context toward memory entries with surface‑level keyword overlap to the target parameter. Standard defenses ‑‑ prompt‑based relevance instructions and memory filters ‑‑ reduce drift but do not eliminate it. As agents take increasingly consequential actions on a user's behalf, memory‑induced tool‑drift represents a systematic vulnerability that current safeguards do not address, motivating dedicated defenses at the intersection of memory management and tool‑call generation.

PaperID: 2643, https://arxiv.org/pdf/2605.24881.pdf

Authors: Miroslav David, Karla Stepanova, Robert Babuska

Title: Learning Transferable Motor Skills for Geometry-Aware Robotic Surface Tasks

Abstract:
Robotic surface‑interaction tasks, such as spray painting or welding, require both accurate geometric planning and precise motion execution. While modern motion planners generate valid geometric paths, they often lack the expert motor patterns observed in human operators. Conversely, learning from demonstration often tightly couples task execution to the specific training geometry, limiting transferability. We propose a modular framework that decouples geometric motion planning from execution‑level expertise. Expert behavior is represented as a vocabulary of interpretable, atomic motor rules, such as velocity scaling and orientation offsets, that systematically modify a geometrically planned reference path. We train a multimodal neural network to infer rule parameters jointly from kinematic trajectory data and CAD model geometry. We evaluate our approach through dynamic simulation on L‑shaped and window‑shaped objects, demonstrating on simulated data that the model successfully extracts velocity and orientation rules across both topologies.

PaperID: 2644, https://arxiv.org/pdf/2605.24823.pdf

Authors: Yilei Zhang

Title: Agent Manufacturing: Foundation-Model Agents as First-Class Industrial Entities

Abstract:
Manufacturing has passed through four widely recognized paradigms ‑ mechanization, electrification, programmable automation, and Smart Manufacturing ‑ each defined by the kind of work it shifted from humans to machines. In every case, one layer of industrial work remained fundamentally human: the coordinative cognition of production, comprising the interpretive, allocative, diagnostic, negotiative, and governance work exercised by engineers, planners, and operational managers. We argue that a fifth transition is now underway in which this layer, rather than the physical or routine‑cognitive layers below it, is what foundation‑model‑based autonomous agents primarily redistribute. We name this paradigm Agent Manufacturing and define it operationally: a manufacturing system is an instance of Agent Manufacturing when its principal coordination mechanism is reasoning performed by foundation‑model agents that can interpret open‑ended goals, plan over long horizons, invoke tools and machines, and negotiate with other agents and humans. This is a narrower and more falsifiable definition than the existing literature on cognitive manufacturing or Industry 5.0 provides, and it distinguishes the paradigm sharply from classical multi‑agent manufacturing systems, which were autonomous only within closed protocol spaces.

PaperID: 2645, https://arxiv.org/pdf/2605.24812.pdf

Authors: Zhihao Dou, Qinjian Zhao, Zhongwei Wan, Xiaoyu Xia, Sumon Biswas

Title: CoRe-Code: Collaborative Reinforcement Learning for Code Generation

Abstract:
Large language models (LLMs) have achieved strong performance in code generation, but most methods rely on autoregressive decoding without global planning, often leading to locally coherent yet globally suboptimal solutions (e.g., failing test cases or inefficient complexity). While recent approaches such as Chain‑of‑Thought (CoT) and multi‑agent systems (MAS) introduce planning, their limited role specialization and coordination hinder performance on complex tasks. To address the challenges of coordination and specialization in multi‑agent code generation, we propose Collaborative Reinforcement Code (CoRe‑Code), a framework for role specialized LLM agents that enhances inter‑agent coordination to generate more accurate and efficient code. CoRe‑Code adopts a simple Planner‑Coder paradigm, where the Planner produces high‑level plans and the Coder executes them to generate code. We further introduce a collaboration‑aware reinforcement learning stage based on Group Relative Policy Optimization (GRPO) to enhance role specialization and alignment. Experiments show that CoRe‑Code outperforms a wide range of existing RL‑based and multi‑agent methods. In addition, we demonstrate that CoRe‑Code can generalize to other multi‑agent frameworks (e.g., Retrieval and Debugging agents), highlighting its flexibility and scalability. We evaluate CoRe‑Code on multiple benchmarks of varying difficulty using three base models. Compared to existing baselines, the results show consistent improvements in accuracy, while also achieving higher efficiency in terms of execution time and memory usage, demonstrating the effectiveness and practicality of CoRe‑Code.

PaperID: 2646, https://arxiv.org/pdf/2605.24764.pdf

Authors: Andrea Morandi

Title: Spectral Retrieval: Multi-Scale Sinc Convolution over Token Embeddings for Localized Retrieval in LLM Multi-Agent Systems

Abstract:
[Abridged] ‑ Spectral Retrieval is a plug‑in re‑ranking stage that interpolates between per‑token MaxSim and mean‑pool retrieval through a multi‑scale sinc convolution over token embeddings. In standard dense retrieval each document is one mean‑pooled vector; when relevance localises into a short subspan, the signal averages into noise. Spectral Retrieval reuses per‑token embeddings from a late‑interaction index and convolves them with a normalised sinc kernel at multiple scales. At L=1 the kernel acts as the identity, recovering per‑token MaxSim; as L grows it approaches a uniform filter, recovering mean pooling. The maximum cosine over positions and scales yields a score provably no less informative than either endpoint. On a controlled synthetic benchmark with 1,000 documents and planted single‑position spikes, mean‑pool retrieval sits at chance (Recall@10 ~ 0.02) regardless of spike strength, while Spectral Retrieval reaches Recall@10 = 1.0 once the planted cosine exceeds the corpus‑level token noise floor. On LIMIT‑small with a frozen all‑mpnet‑base‑v2 encoder, Spectral Retrieval lifts Recall@10 from 0.33 to 0.90, MRR from 0.22 to 0.79, and strict Success@10 from 0.12 to 0.84, without retraining. The method fits naturally into multi‑agent LLM systems, where each agent benefits from a tighter, role‑specific retrieval window over a shared corpus.

PaperID: 2647, https://arxiv.org/pdf/2605.24755.pdf

Authors: Feng Chen, Justin Tauscher, Changye Li, Meliha Yetisgen, Alex Cohen, Adam Kuczynski, Angelina Pei-Tzu Tsai, Benjamin Buck, Dror Ben-Zeev, Trevor Cohen

Title: Automated Detection and Classification of Delusion-related Content in Naturalistic Audio Diaries Using Multi-Agent Language Models

Abstract:
Speech monologues recorded in naturalistic settings provide opportunities to characterize mental illness phenomenology and detect symptom exacerbation. Large language models (LLMs) offer new possibilities for automating this process, as they require annotated data primarily for evaluation rather than training. In this paper, we present a novel automated, multi‑agent LLM pipeline for the fine‑grained, multi‑label extraction of language suggestive of delusional beliefs, associated affective responses, and behavioral responses from transcripts of naturalistic audio diaries collected from people with moderate persecutory ideation. Evaluating an ensemble of three foundation models, we demonstrate that detailed diagnostic prompt instructions successfully reduce false positives for delusional theme classification, but also constrain the interpretation of affective or behavioral responses. Furthermore, comparing multi‑agent adjudication frameworks shows that complex conversational debate between agents diminishes accuracy on clinically ambiguous text by inducing premature consensus. Instead, majority voting establishes robust performance (Micro F1 of 0.872 and 0.779 for delusion detection and classification respectively). This work provides a validated and scalable pipeline for the automated detection and characterization of content suggesting delusional beliefs in naturalistic speech.

PaperID: 2648, https://arxiv.org/pdf/2605.24699.pdf

Authors: Roberto Cruz, David Rey-Blanco

Title: MDIA: A Multi-Agent Diagnostic Intelligence Pipeline on HealthBench Professional

Abstract:
Most reported gains on agentic‑LLM clinical benchmarks are often attributed to prompt engineering, yet our results suggest that larger improvements can come from architectural and engine‑level design. We present MDIA, a Multi‑agent Diagnostic Intelligence Agent implemented as a 7‑node specialty‑routed clinical reasoning graph, on the full HealthBench Professional benchmark (n = 525), on a non‑fine‑tuned LLM. MDIA achieves 0.6272 under OpenAI's GPT‑5.4‑2026‑03‑05, which is +3.72 pp above the performance of OpenAI's ChatGPT for Clinicians. The experimental work shows that performance lift is attributable to system architecture: specialty routing, multi‑turn context preservation, drug‑state safety gating, site‑filtered search, length‑aware synthesis, and engine‑level reliability. These findings support the view that agentic clinical benchmark performance is shaped both by the underlying foundation model and the orchestration architecture. Nevertheless, we also noticed notable differences when using other models as a grader; in particular, when using Gemini 2.5 Pro, MDIA scored 0.6585, which suggests that the choice of grader is a source of variability. Robust evaluation of LLMs would therefore require assessment across several independent grader models.

PaperID: 2649, https://arxiv.org/pdf/2605.24660.pdf

Authors: Vyzantinos Repantis, Ameya Gawde, Harshvardhan Singh, Joey Blackwell

Title: How Many Tools Should an LLM Agent See? A Chance-Corrected Answer

Abstract:
Before an LLM agent can use a tool, a retrieval system must decide which candidate tools to show to the agent. How long should that shortlist be? Show too many tools and the model struggles to choose. Show too few and the correct tool may not appear. Most systems apply a fixed shortlist size to every query, but no standard metric exists to evaluate whether that size was appropriate. We treat the number of tools shown to an LLM agent as the object of evaluation and we apply Bits‑over‑Random (BoR), a chance‑corrected metric that asks whether success at a given depth is better than what random selection would achieve at that same depth. We evaluate BoR across three tool‑selection benchmarks, multiple scorers, and registries ranging from 20 to 3,251 tools. We then turn the same principle into a reinforcement learning (RL) reward for choosing tool shortlist depth per query. The RL agent is deliberately simple, serving as a probe of the metric rather than a proposed system. As the shortlist grows, random chance of including the correct tool rises, so the reward naturally decreases, reducing the need for an engineered depth penalty. On BFCL (370 tools), the learned policy nearly matches the coverage of showing 50 tools (90.3% vs 90.8%) while presenting only 7 on average. On ToolBench (3,251 tools), a fixed shortlist of 5 tools achieves higher aggregate coverage (64.7% vs 61.9%) but finds nothing on hard queries (correct tool ranked 6th‑20th). The BoR agent finds 16.7% on those same queries by searching deeper. Downstream validation with Claude Sonnet 4.6 indicates that shorter adaptive lists also improve the LLM's ability to select the right tool: 93.1% versus 87.1% when always shown 5 tools, widening to 76.8% vs 60.9% on medium‑difficulty queries where the correct tool is present but not ranked first.

PaperID: 2650, https://arxiv.org/pdf/2605.24659.pdf

Authors: Zixuan Chen, Jiaxiang Chen, Li Luo, Ke Xu, Xiaoxiang Huang, Tanfeng Sun, Xinghao Jiang

Title: IterInject: Indirect Prompt Injection Against LLM Agents via Feedback-Guided Iterative Optimization

Abstract:
LLM‑based agents are increasingly deployed for complex tasks requiring planning, tool use, and interaction with external services. Their reliance on untrusted external content exposes them to indirect prompt injection (IPI), in which adversarial instructions embedded in retrieved data hijack agent behavior. Existing attacks rely on static payloads that cannot adapt to agent‑specific defenses; even recent adaptive methods lack structured feedback to guide optimization. We introduce \oursys, a feedback‑guided iterative framework that closes the loop between injection, diagnosis, and refinement: a rule‑based diagnoser produces structured outcome labels with behavioral descriptions, and an LLM‑based optimizer refines payloads conditioned on the full optimization history. A synthesis step generates new disguise seeds from failure patterns, enabling the strategy space to self‑evolve. On AgentDojo and InjectAgent, \oursys substantially outperforms static baselines and existing adaptive methods across four victim models. Extension experiments on Claude Code, a production‑grade coding agent with layered defenses, show that optimized payloads achieve full success on 5 of 9 targets; even those that resist full exploitation exhibit measurable improvement from iterative refinement. We further present a mechanistic analysis of IPI, identifying an attention‑mediated threshold mechanism in mid‑to‑late layers; three causal interventions validate this finding and point to concrete defense directions.

PaperID: 2651, https://arxiv.org/pdf/2605.24600.pdf

Authors: Zhimin Lin, Kun Cheng, Fan Bai, Jie Gao

Title: Agent-as-Peer-Debriefer: A Multi-Agent Framework with Perspective-Based Refinement for Qualitative Analysis

Abstract:
Large language models (LLMs) are increasingly used for qualitative data analysis (QDA), yet their outputs often miss the depth and nuance of human analysis. We argue this gap reflects a missing credibility practice from human QDA: peer debriefing, in which an analyst seeks feedback from a disinterested peer and uses it to refine their coding. To bring this practice into LLM‑assisted QDA, we propose Agent‑as‑Peer‑Debriefer, a multi‑agent QDA framework that builds peer debriefing into key coding steps. In our framework, a Hierarchical Coding Agent follows the standard QDA process to generate codes, sub‑themes, and themes, along with self‑explanations and reflection memos. It then shares these outputs with three Peer‑Debriefing Agents, each applying a distinct analytical perspective (Theory‑Driven, Data‑Driven, or Applied) and refining the codes by keeping, renaming, reassigning, merging, or splitting them. These perspectives are drawn from established human QDA practices that generalize across domains and datasets. To evaluate the framework, we test it on three datasets across two domains with three LLMs, measuring semantic similarity to human‑annotated codes. Across all settings, perspective‑based, peer‑debriefing refinement aligns more closely with human codes than a single‑LLM baseline, and an ablation further shows the gain is not merely from additional refinement. The three perspectives also produce distinct trade‑offs, showing that the choice of perspective is a meaningful and controllable design decision. More broadly, these findings suggest that simulating peer debriefing with explicit perspectives is a promising route to more credible LLM‑assisted QDA.

PaperID: 2652, https://arxiv.org/pdf/2605.24598.pdf

Authors: Yuxin Zhang, Mengxue Hu, Zheng Lin, Xiaoyi Fan, Fan Xie, Zihan Fang, Jing Yang, Wenjun Zhu, Zhiwen Chen, Chengfei Lv, Zhe Chen

Title: Hera: Learning Long-Horizon Coordination for Device-Cloud Collaborative LLM Agents

Abstract:
Large language model (LLM) agents excel at solving complex long‑horizon tasks through autonomous interaction with environments. However, their real‑world deployment faces a fundamental device‑‑cloud dilemma: on‑device models are efficient but often brittle, while cloud models are stronger but costly in computation. State‑of‑the‑art LLM device‑‑cloud routers usually make coarse task‑level decisions, which cannot adapt to the changing difficulty of multi‑step agent interactions. To address this issue, we present Hera, a step‑level device‑‑cloud LLM agent coordinator for long‑horizon tasks achieving a strong performance‑‑cost Pareto frontier. Hera adopts a novel two‑stage training paradigm: (1) imitation learning for cold‑start, followed by (2) reinforcement learning that jointly optimizes task success and cloud usage efficiency. The first stage casts step‑level routing as a supervised classification problem: the device agent is replayed on cloud trajectories, with each state labeled by the agreement between device and cloud actions. In the second stage, we perform cost‑aware reinforcement learning by grouping identical states across trajectories and updating Hera with labels favoring higher expected return and fewer future cloud calls. We evaluate Hera on ALFWorld, WebShop, and AppWorld, where it consistently outperforms prior methods, achieving 92.5% of the cloud‑only success rate with cloud use in only 46.3% of steps.

PaperID: 2653, https://arxiv.org/pdf/2605.24516.pdf

Authors: Min Tang, Fanqi Kong, Linyuan Lü, Xue Feng

Title: Adaptive Punishment for Cooperation in Mixed-Motive Games

Abstract:
Mixed‑motive scenarios are ubiquitous in real‑world multi‑agent interactions, where self‑interested agents often defect for immediate rewards, overlooking the potential of altruistic cooperation to improve long‑term gains and collective welfare. Peer punishment can deter defection, but as costly second‑order altruism, its persistent imposition may undermine the punisher's interests. Existing approaches often struggle to effectively implement punishment to promote cooperation. To balance the efficacy and cost of punishment, we propose Adaptive Punishment for Cooperation (APC), a distributed method that determines punishment intensity based on both a dynamic punishment probability and the severity of defection. This dynamic probability substantially reduces costly and ineffective punishment while also promotes cooperation. To accurately assess defection and its severity, we use a defection awareness module, whose learning is guided by game reward. Theoretical analysis and empirical results show APC performs effectively in iterated public goods game. Empirically, APC also significantly outperforms existing baselines across sequential social dilemmas, learning rational and effective punishment policies that foster cooperation by strategically deterring defection.

PaperID: 2654, https://arxiv.org/pdf/2605.24490.pdf

Authors: Yunhua Pei, Zerui Ge, Jin Zheng, John Cartlidge

Title: Market Regime Council for Dynamic Credit Assignment in Multi-Agent LLM Decision Systems

Abstract:
Multi‑agent LLM decision systems for portfolio management still lack a principled way to assign credit across specialist agents, remain vulnerable to cold‑start dominance under regime shifts, and offer limited transparency into how final allocations are formed. We propose Market Regime Council (MRC), a cooperative multi‑agent decision system that computes exact Shapley credits across all single, pairwise, and Grand‑coalition outputs for online agent weighting. Instantiated with N=3 specialist agents, at each trading period, MRC recomputes coalition‑based Shapley weights from exponentially weighted performance histories, uses a Bayesian adaptive mixture to stabilize early periods, applies regime‑dependent multipliers to adjust agent authority, and records each rebalance through a five‑layer causal trace. Over 1,037 trading days across 13 crypto assets and five seeds, MRC achieves a Sharpe ratio of 1.51 and a cumulative return of 440.1%, ranking first on CR, SR, and IR among active baselines and attaining the lowest MDD among active methods. Ablation results show that the gains come from Shapley‑weighted integration across coalition outputs rather than from any single stage in isolation. Code and demo data are included in the supplementary material.

PaperID: 2655, https://arxiv.org/pdf/2605.24479.pdf

Authors: Jiarong Deng, Liu Chang, Quanshun Yang

Title: Single-Chord Augmentation of Weighted Cycles for Algebraic Connectivity and Network Coherence

Abstract:
Ring‑like communication graphs appear in UAV formations, cyclic patrols, perimeter monitoring, and other multi‑agent tasks in which agents exchange information mainly with neighboring vehicles along a closed route. When measurement and actuation noise are persistent, a useful augmentation should improve both the convergence rate of consensus and the steady‑state disagreement level. This paper studies the addition of a single weighted chord to a connected weighted cycle. The central observation is that a chord is not just a generic rank‑one edge update: it splits the cycle into two complementary resistance arcs, and this resistance split governs both the algebraic‑connectivity gain and the Kirchhoff‑index reduction. We first derive exact chord‑induced effective‑resistance and Kirchhoff‑index update formulas, giving a closed‑form coherence objective. We then prove that, under bounded conductances and small resistance discrepancy, near‑antipodal resistance‑balanced chords are near‑optimal for algebraic‑connectivity improvement; an i.i.d. bounded‑conductance model yields the same conclusion with high probability. Finally, because the best convergence‑rate chord and the best coherence chord need not coincide, we formulate the design as a finite Pareto problem and introduce RBAPS and AW‑RBAPS, two resistance‑balanced screening rules that retain only linear or near‑linear candidate sets. Numerical experiments show that AW‑RBAPS remains effective beyond the formal moderate‑heterogeneity regime and approximates the exhaustive Pareto front with mean hypervolume ratio 0.9987 while evaluating about 10.1% of admissible chords.

PaperID: 2656, https://arxiv.org/pdf/2605.24426.pdf

Authors: Yihao Hu, Zhihao Wen, Xiujin Liu, Pan Wang, Xin Zhang, Wei Wu

Title: SEAL: Synergistic Co-Evolution of Agents and Learning Environments

Abstract:
Large Language Model (LLM) agents are increasingly improved through interaction, yet most self‑evolution methods adapt either the policy or the learning environment in isolation. We identify this structural gap as \emphAgent‑Environment Misalignment: the agent's capability frontier changes during training, while the environment that provides supervision remains static or only weakly coupled to the agent's revealed failures. We propose SEAL, a closed‑loop co‑evolution framework for interactive tool‑use agents. SEAL collects on‑policy trajectories under executable verification, diagnoses failed rollouts into turn‑level failure labels, and uses these diagnoses as a shared signal for both environment‑side adaptation and model‑side policy optimization. The environment evolves its training‑time learning interface by exposing clearer tool affordance cues, constraint information, and recovery‑oriented feedback, while the policy is updated with diagnosis‑guided advantage reweighting. Extensive experiments across in‑distribution and out‑of‑distribution multi‑turn tool‑use evaluations show that SEAL improves low‑resource agent learning: with only 400 training samples, it yields +8.25 to +26.25 average‑point gains across three backbones and exhibits positive out‑of‑distribution transfer. These results demonstrate the value of jointly adapting the learner and its training‑time learning substrate for robust self‑improving LLM agents.

PaperID: 2657, https://arxiv.org/pdf/2605.24423.pdf

Authors: Yuheng Jing, Kai Li, Ziwen Zhang, Jiajun Zhang, Zeyao Ma, Jiaxi Yang, Lei Zhang, Zhe Wu, Jinmin He, Junliang Xing, Jian Cheng

Title: Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork

Abstract:
In‑Context Reinforcement Learning (ICRL) has enabled foundation agents to adapt instantaneously to novel tasks, yet its efficacy in Ad‑Hoc Teamwork (AHT)‑where coordination with unknown partners is required‑remains unexplored. To rigorously evaluate this, we introduce a large‑scale benchmark ICRL4AHT, built upon a high‑throughput JAX implementation of Overcooked‑V2. Our benchmark includes a large, diverse teammate suite spanning both RL and heuristic policies, enabling controlled train‑test shifts, and provides a reproducible end‑to‑end pipeline for teammate generation, learning‑history collection, dataset construction, and online multi‑episode evaluation. We evaluate representative history‑conditioned ICRL algorithms, including Algorithm Distillation (AD) and Decision‑Pretrained Transformer (DPT), across millions of transitions. Results reveal notable limitations: contrary to their success in single‑agent domains, these baselines fail to exhibit robust test‑time adaptation in multi‑agent settings. Specifically, these methods frequently underperform random baselines across both unseen teammate and unseen layout tracks, with no clear in‑context improvement over long horizons. These findings highlight the challenges of strategic inference under partial observability within the OvercookedV2 AHT protocol, establishing our benchmark as a critical testbed for next‑generation coordination algorithms.

PaperID: 2658, https://arxiv.org/pdf/2605.24352.pdf

Authors: Adnan Ahmad, Bahareh Nakisa, Mohammad Naim Rastgoo

Title: Partner-Aware Hierarchical Skill Discovery for Robust Human-AI Collaboration

Abstract:
Multi‑agent collaboration, especially in human‑AI teaming, requires agents that can adapt to novel partners with diverse and dynamic behaviors. Conventional Deep Hierarchical Reinforcement Learning (DHRL) methods focus on agent‑centric rewards and overlook partner behavior, leading to shortcut learning, where skills exploit spurious information instead of adapting to partners' dynamic behaviors. This limitation undermines agents' ability to adapt and coordinate effectively with novel partners. We introduce Partner‑Aware Skill Discovery (PASD), a DHRL framework that learns skills conditioned on partner behavior. PASD introduces a contrastive intrinsic reward to capture patterns emerging from partner interactions, aligning skill representations across similar partners while maintaining discriminability across diverse strategies. By structuring the skill space based on partner interactions, this approach mitigates shortcut learning and promotes behavioral consistency, enabling robust and adaptive coordination. We extensively evaluate PASD in the Overcooked‑AI benchmark with a diverse population of partners characterized by varying skill levels and play styles. We further evaluate the approach with human proxy models trained from human‑human gameplay trajectories. PASD consistently outperforms existing population‑based and hierarchical baselines, demonstrating transferable skill learning that generalizes across a wide range of partner behaviors. Analysis of learned skill representations shows that PASD adapts effectively to diverse partner behaviors, highlighting its robustness in human‑AI collaboration.

PaperID: 2659, https://arxiv.org/pdf/2605.24332.pdf

Authors: Annie Yuan

Title: Tacit Signal Infrastructure: Towards AI Systems that Model Expert Sensing Over Time

Abstract:
Current generative AI systems are increasingly effective at processing explicit knowledge, including retrieving information, summarising documents, generating explanations, and supporting codified workflows. However, high‑level expertise also depends on tacit sensing: perceiving weak signals, recognising emerging tensions, detecting coherence degradation, and anticipating instability before formal indicators appear. Existing AI education, AI literacy, and human‑AI collaboration frameworks remain centred on prompting, task execution, and productivity support and are poorly equipped to address this tacit layer of expert cognition. This vision paper argues that next‑generation AI systems should move beyond explicit knowledge processing toward the longitudinal modelling of expert tacit sensing. It introduces Tacit Signal Infrastructure as a layer for capturing, structuring, modelling, interpreting, and validating expert tacit signals over time. It further defines Long‑term Cognitive Operations as the practices required to maintain and govern such systems, including memory curation, semantic organisation, tacit signal modelling, reasoning calibration, and cognitive governance. Building on this framing, the paper proposes the Cognitive Operations Manager as a prototype AI‑native professional role for coordinating tacit signal modelling, semantic modelling, AI system calibration, expert validation, and ethical governance. It also introduces the Cognitive Operations Research and Training Framework (CORTF) to support research, education, and workforce development. The paper contributes a conceptual foundation for designing AI systems that model expert sensing over time, positioning cognition as an infrastructural, operational, and professional domain in persistent human‑AI systems.

PaperID: 2660, https://arxiv.org/pdf/2605.24309.pdf

Authors: Peiran Wang, Ying Li, Yuan Tian

Title: Reframing LLM Agent Security as an Agent-Human Interaction Problem

Abstract:
We argue that LLM agent security is fundamentally an agent‑human interaction (AHI) problem, not a purely algorithmic one. To substantiate this position, we conduct a systematic analysis of 59 academic papers, 21 production agent systems, and 26 security plugins as of April 2026. Our analysis reveals a striking pattern: the three widely deployed human‑centric security mechanisms (policy specification, runtime approval, and scope configuration) dominate industry practice, each adopted by at least 14 of 21 systems (14, 15, and 16, respectively), while the categories most heavily studied in academia (intent anchoring and trust labeling) see zero production deployment. Yet current human participation mechanisms are far from satisfactory: they suffer from a fundamental trade‑off between cognitive burden and security guarantees, leaving users caught between approval fatigue and uncontrolled agent autonomy. We make three contributions. First, through a systematic comparison of LLM‑based and human‑based intent alignment, we argue that human participation in agent security decisions is indispensable given current capabilities. Second, we quantify a pronounced industry‑academia mismatch: the security mechanisms that practitioners actually deploy receive scant research attention, while the approaches that researchers favor remain undeployed. Third, we propose a three‑direction research agenda and call for AHI security to be recognized as a first‑class research citizen, one that demands its own design principles, evaluation methods, and theoretical foundations.

PaperID: 2661, https://arxiv.org/pdf/2605.24249.pdf

Authors: Anisa Halimi, Liubov Nedoshivina, Kieran Fraser, Stefano Braghin

Title: PrivFusion: A Privacy-preserving Multi-Agent Framework for Harmonizing Distributed Datasets

Abstract:
The growing availability of clinical data has increased the use of machine learning, yet centralized data aggregation is often infeasible for sensitive health information. Federated Learning (FL) offers a distributed alternative, but its adoption is limited by substantial heterogeneity across institutional datasets, making harmonization a critical but frequently overlooked prerequisite for multi‑site analytics. We introduce PrivFusion, a privacy‑preserving multi‑agent framework that automates the harmonization of structured datasets prior to federated training. PrivFusion uses agents to analyze local data, cluster semantically similar features across sites, and provide iterative transformation recommendations until alignment is achieved. Evaluation across four heterogeneous COVID‑19 datasets demonstrates that PrivFusion effectively and efficiently harmonizes multi‑site data while substantially reducing manual effort.

PaperID: 2662, https://arxiv.org/pdf/2605.24245.pdf

Authors: Tingwei Zhang, Harold Triedman, Vitaly Shmatikov

Title: Deep-Research Agents Can Be Poisoned via User-Generated Content

Abstract:
Deep‑research agents, i.e., systems that rely on multi‑agent pipelines to iteratively retrieve, synthesize, and cite Web content in order to produce structured reports, are rapidly replacing traditional search for both routine and complex information needs. These agents issue many related queries during a single research session. We show that for many common search topics, they repeatedly retrieve the same user‑generated content (UGC) pages from platforms such as Reddit and Wikipedia. Next, we argue that this retrieval overlap creates a concentrated attack surface: an adversary who appends a short, crafted text to a single, frequently retrieved UGC page can cause the agent to cite attacker‑chosen content and promote attacker‑chosen entities across many related queries. We evaluate this attack on three representative deep‑research systems (STORM, Co‑STORM, and OmniThink) across multiple query clusters. We also study defenses at different stages of the pipeline, including source‑level filtering and output‑based detection. Our findings highlight a fundamental vulnerability in how deep‑research agents retrieve and integrate web content.

PaperID: 2663, https://arxiv.org/pdf/2605.24220.pdf

Authors: Binfeng Xu, Hao Zhang, Shaokun Zhang, Songyang Han, Mingjie Liu, Jian Hu, Shizhe Diao, Zhenghui Jin, Yunheng Zou, Michael Demoret, Jan Kautz, Yi Dong

Title: Polar: Agentic RL on Any Harness at Scale

Abstract:
Reinforcement learning for language agents increasingly depends on custom harnesses that manage long‑running context, multi‑turn tool use and multi‑agent orchestration. However, porting these harnesses into RL environment interfaces remains difficult and often loses important training signals. We bridge this gap with polar, a rollout framework for scalable asynchronous RL over arbitrary agent harnesses. Polar treats the agent harness as a black box: it proxies LLM API calls, records token‑level model interactions, and reconstructs token‑faithful trajectories for training. Each rollout node efficiently manages runtime prewarming, agent execution, trajectory reconstruction, and evaluation in parallel, exposing asynchronous service endpoints that can be consumed by independent trainers at scale. This decoupled design makes Polar agnostic to agent harnesses, training infrastructure, and RL algorithms while improving compute utilization for long‑running agent workloads. We validate polar by training agents on software‑engineering tasks with popular coding harnesses. Using simple GRPO, polar improves Qwen3.5‑4B by 22.6, 4.8, 0.6 and 6.2 points on SWE‑Bench Verified with the Codex, Claude Code, Qwen Code and Pi harnesses, respectively. We further demonstrate Polar for offline data generation over custom harnesses and ablate trajectory reconstruction strategies. Polar rewrites its preceding work, Prorl Agent, and has been registered as one of NeMo Gym environments.

PaperID: 2664, https://arxiv.org/pdf/2605.24219.pdf

Authors: Harshada Badave, Santosh Borse, Andrea Gomez, Harshitha Narahari, Sara Carter, Vishwa Bhatt, Aishani Rachakonda, Shuxin Lin, Dhaval Patel

Title: Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows

Abstract:
Large Language Models (LLMs) are increasingly deployed as autonomous agents that reason, use tools, and act over multiple steps. Yet most hallucination benchmarks still evaluate only the final output, missing failures that originate in intermediate Thought‑Action‑Observation steps. We present Trajel, a dataset and evaluation framework for auditing trajectory‑level hallucinations in multi‑agent industrial workflows. Trajel introduces a five‑type hallucination taxonomy (factual, referential, logical, procedural, and scope‑based) over expert‑annotated agent traces from AssetOpsBench. We benchmark supervised detection models at the subtask, trajectory, and long‑context levels. Our results show that the most common failure modes are missed by existing benchmarks, that nearly half of hallucinated trajectories involve multiple types at once, and that automated detectors with high binary accuracy still misclassify the subtlest types. Trajectory‑aware detection significantly outperforms standard post‑hoc verification, making taxonomy‑grounded evaluation necessary for safer agentic deployment.

PaperID: 2665, https://arxiv.org/pdf/2605.24216.pdf

Authors: Nesreen K. Ahmed, Nima Nafisi

Title: Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning

Abstract:
Monitoring autonomous large language model (LLM) agents for covert malicious behavior is challenging due to delayed, context‑dependent, and long‑horizon attack patterns. Agents may pursue hidden objectives while maintaining superficially benign behavior, making detection difficult even with full trajectory access. Prior monitoring approaches improve scaffolding or ensemble aggregation, but treat each trajectory independently and do not learn from prior monitoring experience. Moreover, standard reasoning methods explain observed behavior without explicitly reasoning about agent beliefs, intentions, and goal alignment required to distinguish benign task execution from covert deviation. We propose Agent‑ToM, a learning‑to‑monitor framework grounded in Theory‑of‑Mind (ToM) reasoning for security analysis of autonomous agents. Agent‑ToM performs structured full‑trajectory analysis by inferring beliefs, intent hypotheses with calibrated confidence, expected actions, and deviations from task‑consistent behavioral baselines. At inference time, it employs a Reason‑Verify‑Refine pipeline to construct and validate monitoring decisions. At training time, Agent‑ToM distills critique signals into a persistent semantic guardrail memory, enabling reusable belief‑ and intent‑conditioned constraints across episodes. We evaluate Agent‑ToM on adversarial agent monitoring benchmarks (SHADE‑Arena and CUA‑SHADE‑Arena). Agent‑ToM achieves strong precision‑recall balance and outperforms state‑of‑the‑art monitoring baselines, including ensemble methods, while using a coherent two‑call reasoning pipeline. These results demonstrate that learning at the monitoring layer, combined with structured ToM reasoning and verification, provides an effective and deployable foundation for securing autonomous LLM agents.

PaperID: 2666, https://arxiv.org/pdf/2605.24202.pdf

Authors: Yifan Zeng, Yiran Wu, Yaolun Zhang, Wentian Zhao, Kun Wan, Qingyun Wu, Huazheng Wang

Title: When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

Abstract:
Multi‑agent LLM workflows route inference through specialized roles to lift end‑task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that are poorly understood. We study when end‑to‑end RL training of multi‑agent LLM workflows improves over their base models, comparing Shared‑Policy training, where all roles update one policy, with Isolated‑Policy training, where each role has its own parameters. Our experimental matrix spans Eval‑Opt, Voting, and Orch‑Workers workflows, math and code tasks, and three model scales (0.6B, 1.7B, 4B). We find that multi‑agent RL usually improves over base models, but gains depend jointly on workflow, task, and scale, not on policy sharing alone. Isolated‑Policy tends to reach higher peak accuracy yet more often falls off a terminal accuracy cliff, while Shared‑Policy training does not eliminate failure; it redistributes failure into qualitatively different patterns. We then explain the strongest of these patterns through role‑level gradient dynamics induced by workflow topology and policy routing: under Isolated‑Policy, parallel same‑role agents on shared prompts amplify per‑role gradients and drive terminal degradation in Voting and Orch‑Workers workflows; under Shared‑Policy, asymmetric per‑step gradient mass causes the shared policy to be captured by the dominant role, producing different failure signatures by task and workflow. Together, the empirical map and its underlying mechanisms show that policy sharing routes training pressure through different channels rather than offering uniform stability, making it a design choice with workflow‑ and task‑conditional tradeoffs.

PaperID: 2667, https://arxiv.org/pdf/2605.24197.pdf

Authors: Wenqian Ye, Bo Yuan, Zhichao Xu, Yijun Tian, Yawei Wang, Henry Kautz, Aidong Zhang

Title: A Sober Look at Agentic Misalignment in Automated Workflows

Abstract:
We study a class of emergent misalignment in multi‑agent systems (MAS), with a focus on automated workflows, which we refer to agentic misalignment. Although these systems can solve complex tasks, they often fail because agents act according to implicit proxy utilities that do not align with the intended human goals. We formally define these behaviors and analyze them within a Bayesian framework, showing that generic utilities naturally lead to posterior collapse of agents in automated workflows. To address this issue, we propose Agentic Evidence Attribution (AEA), a novel alignment paradigm that improves agent posteriors using context‑specific evidence. AEA reasons over agent actions and provides structured evidence to correct misaligned behavior during collaboration. To better understand the role of evidence, we study two instantiations of AEA: self‑reflection (internal evidence from the model) and weak‑to‑strong generalization (external evidence on the agentic trajectory). We show that a small evidence model effectively aligns the MAS by providing orthogonal failure attribution. Our results clarify the sources of agentic misalignment in automated workflows and show that evidence‑based alignment can effectively improve agent collaboration and leads to reliable multi‑agent systems built on automated workflows.

PaperID: 2668, https://arxiv.org/pdf/2605.24138.pdf

Authors: Srijita Basu, Viktor Kjellberg, Simin Sun, Bengt Haraldsson, Md. Abu Ahammed Babu, Wilhelm Meding, Farnaz Fotrousi, Miroslaw Staron

Title: Understanding Conversational Patterns in Multi-agent Programming: A Case Study on Fibonacci Game Development

Abstract:
Large Language Models (LLMs) are increasingly applied to software engineering (SE), yet their potential for autonomous, role‑oriented collaboration remains largely underexplored. Understanding how multiple LLM‑based agents coordinate, maintain role alignment, and converge on solutions is critical for SE, as naively allowing agents to interact does not reliably lead to correct or stable outcomes. Recent empirical studies show that unstructured or poorly understood interaction dynamics can result in error propagation, premature consensus on incorrect solutions, or prolonged disagreement that prevents convergence, even when correct partial solutions are present early in the interaction. As an initial step towards addressing this underexplored area, we undertake a systematic analysis of conversations between two agents, a Designer and a Programmer across 12 model combinations from 7 open‑source LLMs (Gemma 2, Gemma 3, LLaMA 3.2, LLaMA 3.3, DeepSeek‑R1, MiniCPM, and Qwen3). Our systematic approach reveals three key dimensions of multi‑agent interaction: efficiency (the speed and stability of convergence), consistency (the degree of role alignment visualized by BLEU and ROUGE), and effectiveness (the extent of compilation success and error resolution). Results show that the DeepSeek‑R1:DeepSeek‑R1 pair was unique in converging to the correct solution from the very first iteration and sustaining it consistently to the final iteration, while LLaMA 3.2:LLaMA 3.2 and Qwen3:Qwen3 demonstrated strong Designer:Programmer role alignment despite of diverging from the correct solution. The other pairs deviated from the task, never to converge to a result. These findings advance understanding of agentic programming and highlight the need for further research on understanding and calibrating convergence and stop conditions essential for future autonomous SE.

PaperID: 2669, https://arxiv.org/pdf/2605.24134.pdf

Authors: Fouad Bousetouane

Title: ProofAgent Harness: Open Infrastructure for Adversarial Evaluation of AI Agents

Abstract:
AI agents are entering high‑risk production settings, where they use tools, retain context, follow policies, handle private data, and interact with users over multiple turns. Yet many evaluation methods still judge isolated outputs or static tasks, missing failures that emerge through trajectory, pressure, and adversarial interaction. We introduce ProofAgent Harness, open infrastructure for scalable, auditable, and adversarial AI agent evaluation. The harness provides evaluation infrastructure around an agent: it curates evaluation intelligence, runs adversarial multi‑turn trials, captures behavioral traces, applies post‑hoc multi‑juror scoring, resolves disagreement, and produces evidence‑linked reports. Its open design allows developers and researchers to extend domains, traps, metrics, juror personas, scoring rules, and reporting formats. At its core is Adversarial Multi‑Juror Scoring with Turn‑Level Audit, which evaluates completed agent behavior under pressure using calibrated juror personas, consensus checks, and turn‑level evidence. Experiments across customer support, medical triage, privacy and security, and code generation agents show that strong agents fail selectively through weak metrics, fragile turns, unsafe reframing, and manipulation paths. We also find that a small quantized local Harness LLM can challenge production agents powered by best‑in‑class large LLMs, suggesting that evaluation capability emerges from the full harness pipeline rather than model scale alone. ProofAgent Harness turns AI agent evaluation from a static score into scalable adversarial evaluation infrastructure: repeatable, evidence‑backed, extensible, and actionable before deployment.

PaperID: 2670, https://arxiv.org/pdf/2605.24132.pdf

Authors: Thales C. Silva, M. Ani Hsieh

Title: Local Input-to-State Stability for Consensus in the Presence of Intermittent Communication and Input Saturation

Abstract:
This paper addresses the problem of reaching consensus under input saturation and intermittent communication, which can hinder the convergence of the system. We propose a method that translates the consensus into an equivalent stability problem. Then, we compute bounded sets that enclose the initial conditions and the evolution of trajectories leading to local input‑to‑state stability for systems interconnected over directed intermittent topologies. Our contributions include sufficient conditions for stability and stabilization of multi‑agent systems under intermittent interactions and saturating inputs, with the ability to evaluate disturbance tolerance and rejection based on the regions that enclose the system's trajectories. We define disturbance rejection in terms of the \mathscrL_2 gain, and formulate stability and controller design conditions as convex optimization problems. Our method enable the maximization of regions that ensure local input‑to‑state stability, we provide numerical examples highlighting the trade‑offs between mean frequency of intermittent interactions, disturbance energy, and convergence region size.

PaperID: 2671, https://arxiv.org/pdf/2605.24117.pdf

Authors: Yingtie Lei, Zhongwei Wan, Jiankun Zhang, Samiul Alam, Zixuan Zhong, Peizhou Huang, Xin Wang, Jingxuan Zhang, Donghao Zhou, Yunta Hsieh, Zhihao Dou, Hui Shen, Yan Xu, Dimitrios Dimitriadis, Tuo Zhang, Mi Zhang

Title: SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

Abstract:
Large language model (LLM) agents accumulate rich episodic trajectories while solving real‑world tasks, but it remains unclear whether such experience can be distilled into reusable procedural skills. We introduce SkillEvolBench, a diagnostic benchmark for evaluating this step from experience reuse to skill formation. It contains 180 tasks across six real‑world agent environments, organized into role‑conditioned task families with shared latent procedures. Agents learn from acquisition tasks, update an external skill library using compacted trajectories and verifier feedback, and then face frozen deployment tasks testing context shift, adversarial shortcuts, and composition. By comparing self‑generated and curated‑start skill evolution against no‑skill and raw‑trajectory controls, SkillEvolBench separates procedural abstraction from base capability, curated prior knowledge, and direct reuse of episodic traces. Across ten model configurations and three agent harnesses, we find that current agents often adapt locally but rarely form robust reusable skills. Skill‑based conditions can improve acquisition or replay, and individual models sometimes gain on specific deployment axes, but these gains are unstable under frozen deployment. Raw‑trajectory reuse frequently outperforms distilled skills, suggesting that current abstraction procedures discard contextual and procedural cues that remain useful for future tasks. Capacity and cost analyses further show that writing more skills or larger Tier‑3 resource libraries is not sufficient: additional updates can improve coverage while introducing episode‑specific drift and procedural clutter. These findings position SkillEvolBench as a testbed for measuring when one‑off experience becomes durable procedural knowledge rather than task‑local memory.

PaperID: 2672, https://arxiv.org/pdf/2605.24069.pdf

Authors: Shi Liu, Xuehai Tang, Xikang Yang, Liang Lin, Biyu Zhou, Wenjie Xiao, Wantao Liu

Title: When the Manual Lies: A Realistic Benchmark to Evaluate MCP Poisoning Attacks for LLM Agents

Abstract:
The rise of tool‑using Large Language Model (LLM) agents, standardized by protocols like the Model Context Protocol (MCP), has unlocked unprecedented autonomous execution capabilities for LLM Agents by integrating external open‑domain knowledge and tools. However, this interoperability introduces a covert attack surface targeting the agent's cognitive planning layer. This paper systematically investigates Tool Description Poisoning (TDP), a novel semantic attack. In TDP, malicious instructions are not embedded in a tool's executable code, but rather covertly injected into its descriptive metadata, the very "manual" an agent relies on for secure planning and decision‑making. To rigorously and systematically evaluate this emerging threat, we introduce the MCP‑TDP Security Benchmark. This high‑fidelity sandbox environment comprises 32 realistic, real‑world test cases spanning 6 distinct risk categories. Our evaluation of 8 mainstream LLMs reveals severe vulnerabilities, with leading models like GPT‑4o exhibiting a nearly 100% Attack Success Rate (ASR) in six high‑risk scenarios. Furthermore, our findings demonstrate that common prompt‑guardrail defenses are largely ineffective and can, counterintuitively, even be counterproductive (a phenomenon which we term the "Firewall Fallacy"). Crucially, we also propose a defense mechanism: "Reactive Self‑Correction," where an agent autonomously detects and reverts its own malicious actions post‑execution. This work provides the first specialized security benchmark tailored for TDP, offering essential insights for securing the cognitive and planning layers of advanced agentic systems.

PaperID: 2673, https://arxiv.org/pdf/2605.24052.pdf

Authors: Shugang Hao, Lingjie Duan

Title: Truthful Online Preference Aggregation for LLM Fine-Tuning in Mobile Crowdsourcing

Abstract:
To better serve users' demands in mobile applications (e.g., navigation), mobile crowdsourcing platforms can iteratively align large language model (LLM)‑generated content (e.g., AI‑generated traffic condition predictions) with human feedback collected from crowdsourcing workers (e.g., mobile users). However, workers may strategically misreport their online preference feedback to maximize their influence or payment. Existing pipelines in mobile crowdsourcing (e.g., EM‑based weight estimation) fail to identify the most accurate worker in this online setting, resulting in a linear regret \mathcalO(T) over T time slots. In this paper, we study truthful online preference aggregation for LLM fine‑tuning in mobile crowdsourcing. We formulate a new dynamic Bayesian game to model the multi‑agent online learning process between the platform and strategic mobile workers. We propose a novel online weighted aggregation mechanism that dynamically adjusts each worker's weight in the preference aggregation according to their feedback accuracy. We prove that our mechanism ensures truthful feedback from strategic workers and achieves a sublinear regret \mathcalO(\sqrtT) over T time slots. We further extend our mechanism to a challenging scenario with limited worker feedback per time slot, still guaranteeing a sublinear regret \mathcalO(\sqrtT). Experiments on LLM fine‑tuning with real‑world datasets further demonstrate significant performance gains of our mechanisms over benchmark schemes.

PaperID: 2674, https://arxiv.org/pdf/2605.24050.pdf

Authors: Hongwen Song, Song, Wei

Title: More Skills, Worse Agents? Skill Shadowing Degrades Performance When Expanding Skill Libraries

Abstract:
Skill libraries allow LLM agents to load task‑specific instructions on demand, letting non‑expert users solve domain‑specific tasks through natural language without knowing which skills exist or how they work. However, performance degrades as libraries grow ‑‑ by up to 21% when scaling from a small set of helpful skills to a 202‑skill library. In this work, we formulate this performance degradation as the pass rate drop between loading a library of known‑helpful skills and the full library. Moreover, we propose to decompose the pass rate drop by conditioning on the skill(s) invocation ‑‑ which skills the agent selects during a trajectory ‑‑ into two effects: \emphskill shadowing, where the agent selects wrong skills more often as the library expands, and \emphcontext overhead, where the enlarged context degrades execution even when selection is correct. We derive upper bounds on both effects to characterize their magnitudes of impacts to the pass rate drop. Our empirical estimates of the effects and their upper bounds both show that the \emphskill shadowing effect grows with library size and significantly contributes to the performance degradation, whereas the \emphcontext overhead effect remains small and indistinguishable from zero. This observed asymmetry establishes that the skill selection failure, not the enlarged context, is the primary bottleneck when expanding the skill libraries.

PaperID: 2675, https://arxiv.org/pdf/2605.24018.pdf

Authors: Xiaoyu Xiong, Yuqi Ren, Deyi Xiong

Title: EvoSci: A Bio-Inspired Multi-Agent Framework for the Evolution of Scientific Discovery

Abstract:
Large language models (LLMs), have shown strong potential in scientific discovery, yet existing methods still face substantial challenges in the design of research workflows and multi‑role collaboration mechanisms. To mitigate these issues, we propose EvoSci, a multi‑agent scientific collaboration framework, which integrates bio‑inspired evolution with knowledge graph modeling. To iteratively generate, evaluate, and refine research ideas, EvoSci incorporates multiple role‑based agents, including mentor, researcher, and reviewer. By combining collaborative reasoning, shared memory, and evolutionary feedback, EvoSci significantly enhances the coherence and creativity of scientific exploration. Experiments on real‑world research topics demonstrate that EvoSci significantly outperforms strong baselines in LLM‑based structured peer‑review and comparative ranking evaluations, achieving the highest overall peer‑review score (ICLR 4.90) and top ranking (Top‑10 = 54). These results suggest its superiority in both scientific idea generation and continuous discovery.

PaperID: 2676, https://arxiv.org/pdf/2605.24005.pdf

Authors: Yanyu Chen, Jiyue Jiang, Dianzhi Yu, Zheng Wu, Jiahong Liu, Jiaming Han, Xiao Guo, Jinhu Qi, Yu Li, Yifei Zhang, Irwin King

Title: LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition

Abstract:
The evolution of Large Language Model (LLM) reasoning is bottlenecked by the scarcity of high‑quality process data. While self‑alignment via endogenous rewards offers a solution, mining valid supervision faces three challenges: (1) Label Noise via Mimetic Bias, where rewards prioritize statistical likelihood over logical truth, creating a "correctness illusion" that masks compounding errors; (2) Coarse‑Grained Supervision, where sparse global outcomes (e.g., in GRPO) fail to provide granular guidance, treating reasoning chains as monolithic; and (3) Distributional Collapse, where signals fail to generalize without amplifying pre‑training biases. To address these, we introduce LC‑ERD (Logic‑Consistent Endogenous Reward Decomposition), a framework framing self‑alignment as latent structure mining. We derive a Variational Logic Potential by aggregating consensus from the model's Latent Logic Expertise (LLE) to denoise the reasoning manifold, and introduce a Multi‑Agent Value Decomposition protocol based on the IGM principle to quantify individual step utility. Experiments show LC‑ERD delivers a robust self‑evolution path, uncovering trade‑offs between logic consistency and accuracy while identifying high‑value reasoning patterns missed by standard rewards. Our code is available at https://github.com/LC‑ERD‑repo/LC‑ERD.

PaperID: 2677, https://arxiv.org/pdf/2605.23986.pdf

Authors: Han Chen, Zining Zhang, Wenqi Pei, Bingsheng He, Ming Wu, Jason Zeng, Michael Heinrich, Wei Wu, Hongbao Zhang

Title: MemForest: An Efficient Agent Memory System with Hierarchical Temporal Indexing

Abstract:
Memory is a fundamental component for enabling long‑context LLM agents, supporting persistent state across interactions through a continuous serve‑and‑update lifecycle. Despite substantial prior work, existing systems suffer from significant maintenance overhead due to two key limitations: coarse‑grained state management and inherently sequential update pipelines. In particular, updates are often tightly coupled with LLM inference and require full‑state rewrites, leading to poor scalability and growing latency as memory accumulates. To address these challenges, we present MemForest, a memory framework that reformulates agent memory as a write‑efficient temporal data management problem. MemForest breaks the sequential bottleneck via parallel chunk extraction, decoupling memory construction into concurrent, independent operations. To further eliminate coarse‑grained maintenance, we introduce MemTree, a hierarchical temporal index that organizes memory as time‑ordered trees rather than flat global summaries. This design replaces full‑state rewrites with localized per‑node updates, reducing maintenance cost to the affected tree paths while naturally preserving temporally evolving states. We evaluate MemForest on two long‑context memory benchmarks, LongMemEval‑S and LoCoMo. On LongMemEval‑S, MemForest achieves the best overall performance among stateful baselines, reaching 79.8% pass@1 accuracy while sustaining a memory construction throughput approximately 6x higher than state‑of‑the‑art approaches including EverMemOS.

PaperID: 2678, https://arxiv.org/pdf/2605.23950.pdf

Authors: Yunbei Zhang, Janet Wang, Yingqiang Ge, Weijie Xu, Jihun Hamm, Chandan K. Reddy

Title: Stop Comparing LLM Agents Without Disclosing the Harness

Abstract:
This position paper argues that, for long‑horizon tasks evaluated across models with comparable frontier capability, the agent execution harness, namely the infrastructure layer that governs context construction, tool interaction, orchestration, and verification around a language model, is often a stronger determinant of agent performance than the model it wraps. We formalize and defend the Binding Constraint Thesis: in this regime, performance variance is governed more by harness configuration than by model choice, and current evaluation protocols therefore systematically misattribute harness‑level gains to model improvements. We support this thesis along three lines. First, a control‑theoretic formalization treats the harness as the controller of a closed‑loop dynamical system and the LLM as the stochastic policy it governs, which explains why small harness changes can produce performance shifts that exceed those obtained by substituting one model for another. Second, published benchmarks, industry deployments, and a controlled variance decomposition show that harness‑induced variance can substantially exceed model‑induced variance, including cases of model ranking reversal. Third, we propose a harness‑aware evaluation framework with a disclosure standard and a variance decomposition protocol. Until harness specifications are disclosed, leaderboard comparisons for long‑horizon agents should be treated as incomplete and potentially misleading.

PaperID: 2679, https://arxiv.org/pdf/2605.23949.pdf

Authors: Inseo Jung, Yoonseok Oh, Kyungryul Back, Jinkyu Kim, Jungbeom Lee

Title: SODE: Analyzing Social Dynamics in LLM Agents

Abstract:
As Large Language Models (LLMs) evolve into interactive agents, understanding their behavioral alignment within human social dynamics becomes essential. While behavioral game theory offers a framework to study these interactions, previous work has predominantly relied on outcome‑based metrics such as average scores. This focus overlooks the mechanisms that facilitate sustainable cooperation, as identical scores can be derived from vastly different strategies. To bridge this gap, we introduce SODE (Social Dynamics Evaluation), a framework that evaluates LLM agents across three evolutionary dimensions: Direct Reciprocity for strategy adaptation, Indirect Reciprocity for reputation sensitivity, and Group Dynamics for cooperative resilience. Applying SODE reveals systematic divergences: instruction‑tuned models often exhibit "passive compliance" that renders them vulnerable to exploitation, while reasoning models prioritize short‑horizon optimization, destabilizing long‑term cooperation. Notably, we demonstrate that a "long‑horizon framing" can unlock reciprocal capabilities in reasoning models. Thus, SODE offers a systematic, mechanism‑grounded benchmark for aligning AI agents with complex human social dynamics.

PaperID: 2680, https://arxiv.org/pdf/2605.23935.pdf

Authors: Marcelo Fernandez - TraslaIA

Title: Operationalizing Reconstructive Authority: Runtime Construction, Dependency Resolution, and Execution Gating in Autonomous Agent Systems

Abstract:
Autonomous agent systems fail not only due to incorrect decisions, but due to executing decisions whose authority no longer holds at runtime. Prior work defined Reconstructive Authority (RAM) as a condition for valid execution: actions are permitted only if authority can be constructed from current state. This paper addresses enforcement at runtime: how to enforce this condition in a running system. We introduce a runtime execution model in which authority is evaluated at action time and execution is conditioned on its constructibility. This extends the execution state space beyond admit/deny with a third state, halt, representing cases where authority is undefined due to incomplete or uncertain observability. We define a concrete execution protocol including dynamic dependency resolution, authority reconstruction, and explicit decision semantics. We further introduce a Recovery Loop that integrates drift detection (IML) with execution control (ACP), allowing the system to suspend execution, acquire missing information, and re‑attempt authority reconstruction. We show that this model guarantees safety ‑‑ no action is executed without constructible authority ‑‑ and conditional liveness: execution resumes when authority‑defining variables become observable. This work operationalizes reconstructive authority as a runtime enforcement mechanism, providing the execution semantics required to apply RAM in real systems.

PaperID: 2681, https://arxiv.org/pdf/2605.23930.pdf

Authors: Saad Mankarious

Title: Quantum Frog: Emergent Cooperation and Difficulty Scaling in a Quantized-Time Cooperative Game

Abstract:
We introduce \emphQuantum Frog, a two‑player cooperative game built on a novel \emphquantized‑time mechanic in which the environment advances only when a player acts. Inspired by the classic arcade game Frogger, Quantum Frog requires two frogs to cross an 8×8 grid of traffic and reach the far side together. We use reinforcement learning (RL) as an analytical lens to answer four design questions: (1) how does game difficulty scale with traffic density, (2) what is the optimal single‑agent policy and why, (3) how large is the cooperation gap between independent and cooperative two‑agent play, and (4) what joint strategy emerges when agents are incentivised to cooperate? We train agents through five escalating stages, Tabular Q‑Learning, Deep Q‑Network (\DQN), Independent \DQN~(\IDQN), and Multi‑Agent Proximal Policy Optimisation (\MAPPO\ with a centralised critic), evaluating each against traffic densities of one to six cars. Our key findings are: (i) the quantized‑time mechanic makes a \emphrush strategy (moving directly upward at every step) universally optimal, as time exposure to traffic is minimised; (ii) adding an uncoordinated second player is harder than sextupling the traffic for a single expert player; (iii) cooperative training recovers +32‑‑34 percentage points of joint success rate relative to independent agents and reduces episode length from ～90 to ～6 steps; and (iv) the emergent cooperative strategy is synchronised rushing, not complex positional coordination, illustrating that shared incentives alone suffice to align agents in time‑critical cooperative tasks. These findings provide concrete, empirically grounded guidance for the commercial design of Quantum Frog and offer broader insights into the role of environment mechanics in shaping multi‑agent learning dynamics.

PaperID: 2682, https://arxiv.org/pdf/2605.23929.pdf

Authors: Ya-Ting Yang, Quanyan Zhu

Title: Toward Reliable Design of LLM-Enabled Agentic Workflows: Optimizing Latency-Reliability-Cost Tradeoffs

Abstract:
Modern AI systems increasingly rely on workflows composed of multiple interacting agents, some powered by large language models (LLMs) and others by conventional computational modules. This paper analyzes the fundamental tradeoffs between latency, reliability, and cost in LLM‑enabled agentic workflows. We introduce performance models for both LLM and non‑LLM agents that capture the relationship between computational effort and output quality, incorporating the impact of reasoning and output tokens for LLM agents using a parametric exponential reliability function. Then, we study the design of sequential workflows under latency and cost constraints. Main results include a water‑filling token allocation policy and characterizations of optimal workflow reliability in terms of shadow prices.

PaperID: 2683, https://arxiv.org/pdf/2605.23927.pdf

Authors: Xingyu Xiao, Jiejuan Tong, Jingang Liang, Haitao Wang

Title: TEAM-SimHRA: A Team-Based Simulation Framework for Human Reliability Analysis Using Multi-Agent Large Language Models

Abstract:
Team‑level failure in nuclear control rooms arises not from isolated operator error, but from emergent interaction dynamics, delayed diagnosis, suppressed dissent, and authority‑driven error propagation, that conventional human reliability analysis methods are structurally unable to model. This study introduces TEAM‑SimHRA, a multi‑agent large language model simulation framework that reconceptualizes human reliability as an interaction‑driven emergent property of control room teams rather than a static individual attribute. Unlike existing approaches that assign fixed error probabilities to predefined tasks, TEAM‑SimHRA reproduces collective cognition, role‑conditioned authority dynamics, and real‑time communication suppression across temporally evolving accident progressions. Validated against the Three Mile Island (1979) and Chernobyl (1986) accidents, the two most extensively documented nuclear team failures , the framework achieves face‑validity pass rates of 43.5% and 52.6% respectively, reproducing near‑historical decision delay (134.8 vs. 138 min), perfect communication suppression stability, and full authority pressure cascade at historically accurate propagation depth. These results demonstrate that multi‑agent simulation can extract quantitative team‑level reliability indicators that are inaccessible to traditional methods, opening a viable path toward simulation‑based dynamic probabilistic risk assessment for safety‑critical sociotechnical systems.

PaperID: 2684, https://arxiv.org/pdf/2605.23917.pdf

Authors: Jaeha Oh, Byungchan Kim, Ju Li, Yang Jeong Park, Jin-Sung Park

Title: Multi-Persona Debate System for Automated Scientific Hypothesis Generation

Abstract:
Modern scientific discovery is bottlenecked not by data scarcity, but by the inability to synthesize fragmented knowledge into actionable hypotheses. This challenge is especially acute in battery materials research, where electrochemical performance, interfacial behavior, and manufacturing feasibility must be optimized simultaneously. Here, we present the Multi‑Persona Debate System (MPDS), a literature‑grounded framework for automated scientific hypothesis generation that combines literature retrieval, long‑context large language model reasoning, corpus‑driven persona induction, and structured multi‑agent debate. MPDS constructs literature snapshots of up to 500 papers, grounds agents in role‑specific evidence pools, and conducts a three‑round citation‑aware debate followed by moderator synthesis, enabling negotiation between personas while preserving evidence traceability. We evaluate MPDS using a temporally controlled protocol excluding direct access to target papers, including two held‑out battery‑materials case studies and a blinded comparison across 30 matched cases. In sodium‑ion anode and all‑solid‑state battery cathode design tasks, MPDS recovered design logics aligned with experimentally validated solution spaces and generated more mechanistically explicit, process‑aware proposals than simpler baselines. To assess the impact of personas and debate, we introduce Integrative Hypothesis Quality scoring. In ablation studies, MPDS achieved the highest mean score among five conditions, with its largest advantage in cross‑perspective integration. A laboratory follow‑up suggests utility as a diagnostic aid for identifying practical bottlenecks in workflows. These results indicate that structured debate over literature snapshots improves hypothesis formation under coupled engineering constraints and provides a reusable workflow for text‑intensive scientific discovery.

PaperID: 2685, https://arxiv.org/pdf/2605.23887.pdf

Authors: Joydeep Chandra

Title: CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolving Data Marketplaces

Abstract:
Temporal knowledge‑graph data marketplaces face three coupled failures in static designs: stale hybrid index shortcuts reduce recall as edges evolve, stationary Shapley pricing misattributes value after distribution shifts, and uncoordinated agents over‑consume a shared differential‑privacy budget. We present CHRONOS, a three‑layer architecture providing a unified treatment of these challenges with explicit public and private separation. Layer one applies neural‑ODE temporal decay to shortcut edges, providing a per‑query expected recall‑loss bound of Big‑O of Pq lambda delta t, with a monotone‑envelope guarantee reducing bound looseness to 1.8 to 3.2 times observed loss. Layer two conditions Shapley valuation on detected changepoints and provides finite‑sample error guarantees under noise. Layer three uses EXP3‑IX to achieve Big‑O of the square root of T log T regret while enforcing epsilon and delta differential privacy via moments accounting. CHRONOS releases a privatized affinity matrix per epoch using the Gaussian mechanism; all retrieval and ranking are post‑processing, incurring no extra privacy cost. We provide multi‑epoch settlement, scalability analysis for 500 sellers, and comparisons against accelerated baselines. Across four benchmarks, CHRONOS shows 0.937 recall at ten, 2.74 queries per second, 161 ms latency, and total epsilon of 4.25 at delta of 10 to the power of negative 6 under zCDP composition. These results indicate a competitive operating point. A limitation is that at this privacy level, released valuations remain noise‑dominated; utility derives primarily from public index routing and adaptive scheduling driven by low‑sensitivity statistics.

PaperID: 2686, https://arxiv.org/pdf/2605.23864.pdf

Authors: Dongwei Xie, Xuhao Wang, Yujie Tang, Jie Song

Title: Harnessing Individual Motivation for Collective Efficiency: A Mechanism-Driven Distributed Optimization Method

Abstract:
In industrial scenarios involving multi‑agent collective decision‑making, centralized decision‑making may not be admissible due to restrictive access to individual local information, while the conflicts between participants' self‑interest and global performance may also impede collaborative distributed decision‑making. This paper proposes a mechanism‑driven distributed decision‑making method, wherein incentives are employed and designed to motivate participants to collaborate in a distributed fashion even though each participant's decision is driven primarily by self‑interest. Focusing on optimization problems with coupled objective functions and coupled constraints, we design a distributed optimization algorithm tailored for this class of problems and provide guarantees for its convergence. Furthermore, we design two incentive mechanisms, the shadow pricing mechanism and the Vickrey‑Clarke‑Groves mechanism, and demonstrate that participants are willing to engage in distributed collaboration under these mechanisms. The mechanism drives the execution of the distributed algorithm, and the optimal result of distributed computation guides the determination of incentives in the mechanism, both of which are interrelated to form a closed loop. Finally, numerical experiments illustrate the effectiveness of the proposed algorithm and mechanisms.

PaperID: 2687, https://arxiv.org/pdf/2605.23754.pdf

Authors: Marius Tacke, Matthias Busch, Kian Abdolazizi, Jonas Eichinger, Kevin Linka, Roland Aydin, Christian Cyron

Title: LLM-driven design of physics-constrained constitutive models: two agents are better than one

Abstract:
Developing constitutive models that capture how materials deform under load traditionally requires years of specialized expertise in continuum mechanics, machine learning, and scientific programming. Large language models (LLMs) have recently been shown to lower this barrier by generating constitutive models on demand, but existing single‑agent pipelines lack systematic checks that the resulting models respect fundamental physical laws. To close this gap, we introduce the first multi‑agent LLM‑driven approach for constitutive model generation: a Creator agent proposes a model tailored to the data, while an Inspector agent critically audits each proposal against nine physical constraints and returns it for refinement whenever a violation is detected. We demonstrate this concept with constitutive artificial neural networks (CANNs) and benchmark it on brain tissue, experimental rubber, and synthetic rubber, using two different LLM backbones (Claude Opus 4.7 and Kimi K2.5). Adding the Inspector raises the share of exported models that truly satisfy all physical constraints from 91% to a perfect 100% for Opus and from 37% to 56% for Kimi, while preserving near‑baseline accuracy and remarkable generalization to unseen loading paths. In combination, the generated models are physically valid, highly accurate, and extrapolate reliably beyond the training data ‑ properties that together make them directly usable in practice. Separating generation from inspection thus turns LLM‑driven constitutive modeling into a genuinely trustworthy process. The paradigm is deliberately technique‑agnostic and scales automatically with advances in LLM capability, opening a promising path toward automated, physics‑aware model discovery.

PaperID: 2688, https://arxiv.org/pdf/2605.23723.pdf

Authors: Zhewen Tan, Yilun Yao, Huiyan Jin, Wenhan Yu, Guoan Wang, Mengyuan Fan, liang lu, Feng Liu, Xiangzheng Zhang, Duohe Ma, Tong Yang, Lin Sun

Title: MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection

Abstract:
Large language model agents increasingly rely on persistent memory to store past interactions, retrieve relevant demonstrations, and improve long‑horizon task execution. However, this memory mechanism also creates a practical security vulnerability: an adversarial user may inject malicious records into the agent's memory through ordinary interaction, and these records can later be retrieved to steer the agent's reasoning and actions. Existing defenses primarily focus on online intervention, such as prompt filtering or output blocking, but they do not address the post‑hoc question of which stored memories are responsible after harmful behavior has already been observed. We propose MemAudit, a post‑hoc causal memory auditing framework for memory‑augmented LLM agents. The framework combines two complementary signals: (1) a counterfactual memory influence score that measures each memory's causal contribution to harmful outputs, and (2) a memory consistency graph that identifies structurally anomalous memories within the broader memory store. We evaluate MemAudit against MINJA, a query‑only memory injection attack in which malicious records are generated and stored through normal agent interactions rather than direct memory‑bank modification. Across both QA and reasoning‑agent settings, MemAudit substantially reduces attack success rates under realistic post‑hoc auditing scenarios. The results show that QA attack success is reduced from 70% to 0%, while RAP attack success drops from 83.3% to 0%.

PaperID: 2689, https://arxiv.org/pdf/2605.23652.pdf

Authors: Yoosung Hong

Title: One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents

Abstract:
On a 300‑persona life‑simulation benchmark, pcsp achieves compositional zero‑shot persona identification up to 17x above chance, Spearman rho approx 0.73 semantic‑behavioral alignment, and 22x faster inference than an LLM‑as‑policy baseline. Life simulation games require hundreds to thousands of non‑player characters (NPCs) that behave consistently with distinct personalities while remaining controllable through designer‑authored natural language. Existing methods fail on constraints like persona consistency, controllability, or real‑time inference. We introduce pcsp (Persona Conditioned Shared Policy), a single reinforcement learning policy conditioned on frozen LLM embeddings of free‑form persona descriptions. pcsp combines once‑per‑NPC persona encoding, low‑rank persona projection, neural persona conditioning, and a PPO + InfoNCE consistency + KL diversity training objective. Across three experimental settings, ablations show that the InfoNCE trajectory‑consistency objective is load bearing: removing it collapses zero‑shot persona identification to chance. External validation on Melting Pot 2.4.0 substrates confirms that our method produces persona‑conditioned behavioral divergence in multi‑agent strategic environments. We distinguish two senses of held‑out evaluation: compositional zero‑shot and vocabulary‑expansion held‑out. Finally, a UE5 deployment reproduces the in‑engine persona‑conditioning ablation at 64 agents with a low failure rate, showing that the sub‑frame inference profile survives in a commercial game engine. These results prove that shared RL policies can support scalable, real‑time, persona‑conditioned NPC control.

PaperID: 2690, https://arxiv.org/pdf/2605.23574.pdf

Authors: Yuandao Cai, Yuzhang Zhu, Liyou Gao, Wensheng Tang, Shengchao Qin

Title: Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents

Abstract:
Long‑horizon language agents can make many plausible local tool calls yet fail to persist until a requested count is actually complete. We study this gap as Quantitative Goal Persistence (QGP): whether an agent keeps working until an external verifier confirms enough distinct valid items. PushBench turns this into a benchmark for repository‑artifact collection and verifier‑backed work units, so repeated work, duplicate submissions, false completion, and progress drift are measured directly rather than hidden behind a final success flag. In matched controller comparisons, a state‑tracking retrieval controller reaches 69‑78% success while eliminating duplicate submissions, and a backlog‑tracking work‑unit controller reaches 25‑50% success in settings where standard and completion‑gated controllers complete no task instances. Black‑box frontier‑agent evaluations with Claude Code (Sonnet 4.6) and Codex CLI (gpt‑5.4) solve many 50‑artifact tasks but drop to 3 out of 9 successes per condition at 100 artifacts. The results show that quantitative goals stress a different reliability requirement from local task competence: agents must maintain verified progress and stop only when the requested work is complete.

PaperID: 2691, https://arxiv.org/pdf/2605.23562.pdf

Authors: Elie Abboud, Oren Gal

Title: ARMS: Automatic Reward Shaping for Sparse-Reward Multi-Agent Reinforcement Learning

Abstract:
Sparse rewards are a major bottleneck in multi‑agent reinforcement learning (MARL), where simultaneous learning induces non‑stationarity and makes reward design especially delicate. Reward shaping can accelerate learning, but in the multi‑agent setting it must preserve the strategic structure of the problem rather than merely improve short‑term optimization. We propose Automatic Reward‑shaping in Multi‑agent Systems (ARMS), a self‑supervised reward shaping framework for MARL that learns dense shaping signals from sparse environmental rewards through trajectory ranking. Since single‑agent trajectory‑ranking guarantees do not directly transfer to MARL, we reformulate policy invariance through conditional best‑response reasoning, and show that if certain conditions hold, then using shaping rewards preserves each agent's best‑response set under fixed opponent policies, and consequently preserve the set of Nash equilibria. Guided by this perspective, ARMS alternates between policy learning and reward learning while sharing shaping parameters across agents for efficiency. Experiments in a partially observable multi‑agent pathfinding domain show that ARMS improves sampling efficiency under increasing reward sparsity and agent count, generalizes to unseen environments, and reveals a MARL‑specific failure mode in which limited exploration and coupled policy‑‑reward dynamics induce oscillatory behavior. Increasing exploration mitigates this effect and stabilizes learning. To the best of our knowledge, ARMS is the first automatic reward shaping framework for MARL whose design is motivated by a game‑theoretic equilibrium‑preservation result.

PaperID: 2692, https://arxiv.org/pdf/2605.23414.pdf

Authors: Zehao Wang, Shilong Jin, Zhao Cao, Lanjun Wang

Title: When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems

Abstract:
LLM‑based multi‑agent systems can fail even when planned actions are executed correctly because agents may misjudge their knowledge when evaluating plan feasibility, a phenomenon we term epistemic miscalibration in planning. Unlike execution errors, epistemic miscalibration is latent during planning, as generated plans can remain self‑consistent and executable without observable errors; the miscalibration is also dynamic, as new information can alter feasibility assessments, potentially obscuring past miscalibration signals and causing them to recur over time. To address this, we propose the Epistemic Planning Calibration Agentic Workflow (EPC‑AW), which assesses whether plans remain supported under varying information conditions rather than directly verifying feasibility. EPC‑AW employs Information‑consistency‑based Plan Selection, selecting plans whose evaluations are stable across agents, together with Consistency‑guided Epistemic State Refinement to adapt calibration over time by leveraging past discrepancies to guide future planning. Experiments show that EPC‑AW improves system‑level success by an average of 9.75%.

PaperID: 2693, https://arxiv.org/pdf/2605.23356.pdf

Authors: Mirhan Urkmez, Maryam Sharifi, Shahab Heshmati-Alamdari

Title: A Distributed Framework for Data-Driven Safe Coordination in Leader-Follower Networks

Abstract:
This paper addresses connectivity preservation in leader‑follower multi‑agent systems with unknown control‑affine dynamics and local state information. We introduce the distributed data‑driven zeroing control barrier function (3D‑ZCBF) framework, which ensures the controlled invariance of safety sets by identifying derivative bounds from input‑state data without requiring explicit models of high‑dimensional agent dynamics. In this work, we derive the explicit, decoupled safety conditions necessary to maintain connectivity for leader‑leader, and follower‑follower pairings. These individual constraints, along with the leader‑follower conditions, are aggregated into explicit system‑wide conditions that formally guarantee the preservation of the entire communication network. Furthermore, we provide a quantitative analysis demonstrating how the size of the collected data set and the accuracy of the learned Jacobian bounds impact the feasibility of the safety certificates. The proposed conditions are implemented via a projection‑based controller, and simulations confirm that these explicit 3D‑ZCBF requirements effectively maintain system‑level connectivity using only local, two‑hop information.

PaperID: 2694, https://arxiv.org/pdf/2605.23330.pdf

Authors: Yutong Jin, Zelin Zhang, Zhijin Lyu, Jianbing Ni

Title: Security, Privacy, and Ethical Risks in OpenClaw

Abstract:
This paper systematically investigates the security, privacy, and ethical risks, as well as the traceability challenges of OpenClaw, a locally executable AI agent system for natural language interaction and real‑world task completion. While OpenClaw shows strong potential for personal assistance, office automation, cross‑platform task management, and information integration, it also raises serious security, privacy, and ethical concerns. By analyzing its system architecture, core functionalities, deployment model, and representative application scenarios, this paper aims to reveal the risks that may arise when such a highly privileged agent is integrated into personal and organizational digital environments. We focus in particular on the challenges associated with persistent local storage, tool invocation, cross‑context information aggregation, multi‑user interaction, and the integration of plugins and external services. We argue that these issues constitute major barriers to the trustworthy deployment and widespread adoption of this technology. Finally, we summarize the open challenges in security defenses, privacy protection, ethical governance, and traceability in agent use, and call for joint efforts from researchers, developers, deployers, and regulators to build AI agent systems that are safer, more reliable, and more trustworthy.

PaperID: 2695, https://arxiv.org/pdf/2605.23320.pdf

Authors: Sijia Li, Xiaoyu Tan, Qixing Wang, Weiyi Zhao, Chen Zhan, Teqi Hao, Xuemin Wang, Lei Gu, Roland Eils, Xihe Qiu

Title: Human-in-the-Loop Multi-Agent Ventilator Decision Support with Contextual Bandit Preference Learning

Abstract:
Ventilator decision support requires sequential decisions that track evolving physiology and disease trajectories while respecting safety boundaries and clinician specific tuning styles. Rule based approaches rarely generalize personalization, and end to end reinforcement learning or single large language model systems remain difficult to control and audit. We propose the Ventilator Decision Support System (VDSS), a human in the loop multi agent framework that coordinates modular decision components through contract driven structured interfaces and produces traceable evidence for review. VDSS performs online preference adaptation with a contextual bandit, updating clinician specific preferences from the final accepted decision at each adjustment cycle and using them to guide subsequent recommendations. Structured rejection feedback triggers targeted replanning to reduce unproductive iterations and improve interaction stability. Retrospective ICU trajectory replay with expert review indicates higher recommendation acceptability and fewer interaction rounds to reach an acceptable plan, supporting clinically deployable human AI collaboration.

PaperID: 2696, https://arxiv.org/pdf/2605.23296.pdf

Authors: Musa Cim, Burak Topcu, Chita Das, Mahmut Taylan Kandemir

Title: Parallel Context Compaction for Long-Horizon LLM Agent Serving

Abstract:
Long‑horizon LLM agents accumulate growing conversation histories that eventually exceed the model's context window. Context compaction via LLM‑based summarization keeps the conversation bounded, but summarization is inherently lossy and the blocking call stalls agent inference for tens of seconds. Moreover, the operator has no fine‑grained control over summary volume since prompt instructions are largely ignored, and as context grows, both the amount of output tokens the model produces and the information it retains fluctuate substantially from run to run, making the agent's retained knowledge unpredictable across runs. We introduce parallel compaction for long‑horizon agentic flows and characterize it against the sequential synchronous baseline across four backbones spanning 8B to 120B parameters, mixing dense and MoE architectures with reasoning and non‑reasoning models, on the HotpotQA multi‑hop QA and LoCoMo long‑context dialogue benchmarks. Parallel compaction gives the operator fine‑grained, predictable control over summary volume and enables more targeted prompt engineering per block. At matched compaction decode volume, it reduces end‑to‑end wall time and improves compaction throughput over the sequential baseline.

PaperID: 2697, https://arxiv.org/pdf/2605.23273.pdf

Authors: Hyunjee Park, Hayoung Chung

Title: Self-Refining Topology Optimization via an LLM-Based Multi-Agent Framework

Abstract:
Topology optimization is a widely used design method that produces optimized material distributions for prescribed objectives and constraints through well‑established numerical algorithms. Throughout the workflow, engineers make a series of decisions ranging from setting and adjusting numerical parameters to assessing whether the converged design meets considerations beyond those explicitly included in the optimization problem, such as physical feasibility. These decisions, which draw on domain expertise, interfere with the autonomous design process. To address this difficulty, this study presents TopOptAgents, a multi‑agent system for automating not only the design process but also decision‑making during the key stages of the topology optimization process. TopOptAgents consists of six LLM‑based agents collaborating through iterative self‑refinement cycles spanning problem formulation, validation, code generation and execution, and quality assessment of the optimized structure. This process enables error correction and progressive improvement of both the optimization setup and resulting design. The framework is demonstrated on optimization problems selected to cover a range of settings that differ in their literature coverage and numerical characteristics The benefits of iterative self‑refinement are found to be particularly pronounced for problem classes where the pretrained language model has limited prior exposure, such as formulations whose literature and open‑source implementations are comparatively sparse. In such cases, the proposed framework reliably produces converged designs where a single state‑of‑the‑art LLM struggles, suggesting that self‑refinement broadens the range of topology optimization problems that LLM‑based automation can reliably address.

PaperID: 2698, https://arxiv.org/pdf/2605.23263.pdf

Authors: Lipeng Dai, Luping Xiang, Kun Yang

Title: 6G Communication Networks Enabling Embodied Agents: Architecture and Prototype

Abstract:
Embodied agents, which couple intelligent decision‑making with physical actuation in the real world, impose far more stringent and heterogeneous communication requirements than purely software‑based agents. While 6G promises sub‑millisecond latency, ultra‑high reliability, native intelligence, and integrated sensing, systematic studies on how to exploit these capabilities for embodied agent communication remain limited. This article investigates 6G‑enabled communication systems for embodied agents from both conceptual and engineering perspectives. First, we review the concept, embodiment value of embodied agents, and clarify their distinctions from disembodied agents. Then, we analyse the symbiotic relationship between embodied agents and 6G networks. We highlight how key 6G enablers can support the stringent requirements of human‑robot interaction. Furthermore, we demonstrate the proactive role of embodied agents in bolstering communication networks through coverage extension, environmental sensing, and physical world understanding. Building on these insights, we propose a hierarchical communication architecture for human‑robot remote interaction, comprising a human‑intent perception layer, an open radio access network (O‑RAN)‑based transport layer, an intelligent intermediary layer, and an embodiment layer. To validate its feasibility, we implement an end‑to‑end prototype that integrates a haptic device, an industrial robotic arm, an intermediary platform, and a 5G O‑RAN testbed. Experimental results demonstrate millisecond‑level latency and stable closed‑loop operation, confirming the practicality of the proposed architecture and providing a reference for future 6G‑embodied agent research and industrial deployments.

PaperID: 2699, https://arxiv.org/pdf/2605.23262.pdf

Authors: Yining Hua, Hongbin Na, Cyrus Ayubcha, Levi Lian

Title: Design and Report Benchmarks for Knowledge Work

Abstract:
The development of LLM agents has led to a growing body of work on knowledge‑work AI, including coding, research, and healthcare. However, current knowledge‑work evaluation and benchmark design still largely follow the logic of traditional NLP tasks. As a result, higher benchmark performance does not reliably show that a system can carry out knowledge work in real‑world deployment settings. This paper contributes a three‑step approach for making explicit how benchmarked tasks represent the work claims attached to their scores: defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product. We review work studies showing that knowledge work is organized through roles and responsibilities, local materials and tools, and artifacts that must remain usable in downstream workflows. We then translate these concerns into benchmark design and reporting guidance, covering how tasks should be mapped to work activities, how tested settings should specify materials, tools, roles, and constraints, and how scoring should focus on the work product left by the system. To name the work activity being evaluated and distinguish it from common benchmark tasks, we derive an inventory of 18 work activities from the ONET occupational task database. We demonstrate the approach through three benchmark case analyses: GDPval, a non‑code occupational deliverable benchmark; OfficeQA Pro, a grounded document‑analysis benchmark scored by final answers; and APEX‑SWE, a software‑engineering benchmark with executable scored products. These cases show how benchmark design choices shape the strongest work claim a score can support, and where gaps arise between the benchmarked task, tested setting, scored product, and broader work claim.

PaperID: 2700, https://arxiv.org/pdf/2605.23218.pdf

Authors: Bang Liu, Yongfeng Gu, Jiayi Zhang, Zhaoyang Yu, Sirui Hong, Maojia Song, Xiaoqiang Wang, Mingyi Deng, Zijie Zhuang, Ronghao Wang, Mingzhe Cao, Yutong Zhu, Xingjian Li, Yifan Wu, Jianhao Ruan, Yiran Peng, Shuangrui Chen, Jinlin Wang, Yizhang Lin, Dongjie Zhang, Dekun Wu, Chen Ma, Lizi Liao, Han Yu, Jian Pei, Heng Ji, Qiang Yang, Yuyu Luo, Chenglin Wu

Title: Foundation Protocol: A Coordination Layer for Agentic Society

Abstract:
Autonomous agents are moving from tools into a layer of social infrastructure: they browse, purchase, deploy software, manage systems, and increasingly interact with one another. As these systems scale, the bottleneck shifts away from raw model capability toward coordination. Agents need to form reliable relationships, organize multi‑agent work, exchange value, support an AI economy, and stay safe and accountable under real‑world oversight. This paper introduces the Foundation Protocol (FP), a graph‑first coordination layer for an emerging human‑AI society. FP unifies heterogeneous entities, including agents, tools, resources, humans, institutions, and organizations, and supports native multi‑party organization and event‑based collaboration. It also provides economic primitives for metering, receipts, and settlement, and treats policy, provenance, and audit as first‑class concerns. FP is designed to wrap and bridge existing protocols rather than replace them, enabling incremental adoption while reducing integration and governance overhead. The aim is to keep autonomous agency composable while keeping accountability non‑negotiable, so that coordination itself can become shared infrastructure for a human‑AI society that is open, pluralistic, and governable.

PaperID: 2701, https://arxiv.org/pdf/2605.23187.pdf

Authors: Lin Qian, Shijie Li, Sihao Lin, Xuan Zhang, Bangya Liu, Yanran Li, Hujun Yin

Title: IntentionNav: A Benchmark for Intent-Driven Object Navigation from Implicit Human Instruction

Abstract:
Existing object navigation benchmarks usually tell an embodied agent which object category to find, such as microwave or chair. Human‑facing embodied AI is often asked something less direct: "I need something to warm this food" or "the room feels stuffy." The agent must infer the object that can satisfy the need, find a scene‑grounded instance, and decide whether the goal has been reached. We study this setting as intent‑driven object navigation and introduce IntentionNav, a diagnostic benchmark for active object search from implicit human instructions. Each episode provides a free‑text intent, RGB‑D observations, and pose, but withholds the target object name. IntentionNav contains 500 intents over 176 Isaac Sim scenes and 64 target categories. Each intent is rewritten in four controlled instruction styles and annotated with one of four intent modes, separating surface phrasing from semantic cue type under matched geometry. This paired design supports analysis of target inference, language robustness, neighborhood reachability, and terminal success rather than only aggregate success. We evaluated three VLMs using a fixed active‑navigation agent. Models identify the intended target in 48.3 percent of episodes and enter its 2 m neighborhood in 68.7 percent, but terminate successfully in only 24.9 percent and achieve grounded 1 m success in 5.5 percent. Success is highest for event‑script intents (28.7 percent) and lower for physical‑state and affordance intents (19.2 percent and 18.5 percent), showing that indirect human intent remains a bottleneck for target selection, visual verification, and terminal localization in active embodied search.

PaperID: 2702, https://arxiv.org/pdf/2605.23099.pdf

Authors: Weifan Jiang, Rana Shahout, Minghao Li, Zhenting Qi, Yilun Du, Michael Mitzenmacher, Minlan Yu

Title: SVR-MAD: A Bayesian-Inspired Framework for Posterior-Guided Multi-Agent Debate

Abstract:
Multi‑Agent Debate (MAD) improves LLM‑agent accuracy but suffers from rapid context growth, limiting scalability in larger multi‑agent settings. Existing methods prune low‑utility communications using prior signals, such as token‑level log‑likelihoods or LLM self‑reported confidence. However, these signals become unreliable under hallucination, degrading the accuracy of MAD methods that rely on them. We propose SVR‑MAD, a Bayesian‑inspired MAD framework that treats pre‑debate signals as priors and debate outcomes as posterior‑style evidence for estimating agent correctness. SVR‑MAD uses this evidence to incrementally construct the communication graph, prioritizing agents whose answers survive peer challenges. Experiments across multiple LLMs and benchmarks show that SVR‑MAD reduces token cost by up to 61% while matching or improving accuracy relative to the most accurate competing MAD baseline.

PaperID: 2703, https://arxiv.org/pdf/2605.23027.pdf

Authors: Zexin Li, Ziliang Zhang, Hyoseung Kim, Cong Liu

Title: PIMbot: A Self-Adaptive Attack Framework for Adversarial Manipulation of Multi-Robot Reinforcement Learning

Abstract:
Recent research has demonstrated the potential of reinforcement learning in effective multi‑robot collaboration, particularly in social dilemmas where robots face a trade‑off between self‑interest and collective benefits. However, environmental factors such as miscommunication and adversarial robots can impact cooperation, making it crucial to explore how multi‑robot communication can be manipulated to achieve different outcomes. This paper presents PIMbot, a framework that manipulates outcomes via two complementary levers: (i) incentive manipulation of the reward channel and (ii) policy manipulation of an agent's own actions. An adaptive multi‑objective controller balances these levers in an online manner. Our work introduces a novel approach to manipulation in recent multi‑agent RL social dilemmas that utilize a unique reward function for incentivization. By utilizing our proposed PIMbot mechanisms, a robot is able to manipulate the social dilemma environment effectively. Comprehensive experimental results demonstrate the effectiveness of our proposed methods in the Gazebo‑simulated multi‑robot environment. Moreover, a real embedded device case study on NVIDIA Jetson Orin Nano quantifies system cost and validates PIMbot's effectiveness on realistic autonomous embedded systems scenarios beyond simulation. Together, these results position PIMbot as a rigorous stress‑test tool exposing critical vulnerabilities in multi‑robot cooperative tasks.

PaperID: 2704, https://arxiv.org/pdf/2605.22993.pdf

Authors: Chuanbo Hu, Minglei Yin, Bin Liu, Wenqi Li, Lynn K. Paul, Shuo Wang, Xin Li

Title: A Proactive Multi-Agent Dialogue Framework for Assessing Social Language Disorder Traits in Autism

Abstract:
Characteristic linguistic behaviors associated with Social Language Disorder (SLD) in autism spectrum disorder, including echoic repetition, pronoun displacement, and stereotyped media quoting, are largely absent from spontaneous conversation and only emerge under specific conversational conditions. In structured clinical assessments, this latency means that questioning strategy selection is a critical yet underappreciated determinant of how much diagnostic information a conversation yields. Whether large language models (LLMs) can be guided to proactively select questioning strategies that systematically surface these latent traits remains largely unexplored. Here we present TPA (Think, Plan, Ask), a proactive multi‑agent dialogue framework applied to the language assessment component of the Autism Diagnostic Observation Schedule Module 4 (ADOS‑2), in which a doctor agent explicitly reasons about which traits remain unobserved before selecting a clinically grounded strategy and generating a targeted question. A patient agent grounded in real ADOS‑2 clinical data enables reproducible evaluation without real patient participation, validated across three independent experiments confirming adequate fidelity to real patient language. Evaluated on 484 episodes from 35 patients, TPA outperforms six competitive dialogue planning baselines across all primary metrics, achieving 82.1% SLD trait coverage, 16.6% higher than automated replay of real clinical dialogues conducted by trained clinicians (65.5%), with substantially greater per‑turn diagnostic efficiency (AUCC: 0.628 vs. 0.458, absolute gain +0.170). These results demonstrate that proactive questioning strategy selection substantially improves the efficiency of automated SLD trait assessment, with direct implications for scalable AI‑assisted clinical screening.

PaperID: 2705, https://arxiv.org/pdf/2605.22949.pdf

Authors: Joss Armstrong

Title: MARGIN: Runtime Confidence Calibration for Multi-Agent Foundation Model Coordination

Abstract:
Foundation model agents increasingly operate in multi‑agent deployments where a coordinator must decide which agent's response to trust. The standard approach weights agents by their self‑reported confidence, but recent evidence shows that foundation model confidence is systematically miscalibrated and, on hard tasks, inversely correlated with accuracy. Design‑time calibration methods (temperature scaling, Platt scaling, histogram binning) cannot address this problem because they fit a fixed correction to held‑out data and degrade under distribution shift. We present MARGIN (Multi‑Agent Runtime Grading via Incremental Normalisation), an online calibration method that learns per‑agent, per‑confidence‑band calibration factors from the task stream itself, requiring no model access, no held‑out data, and no retraining. MARGIN uses symmetric exponentially weighted moving averages with Bayesian shrinkage blending, and has three hyperparameters with robust defaults. Across 18 foundation models, 8 benchmarks, and over 44,000 observations, MARGIN achieves 3‑6x lower calibration error than the best design‑time baseline under distribution shift. In multi‑agent selection, raw verbalized confidence fails to beat random at pairwise resolution (43‑50%) on hard benchmarks. MARGIN corrects this completely, raising pairwise resolution to 70‑89% and closing 37‑78% of the Raw‑to‑Oracle pass@1 gap across the five code‑generation benchmarks without any oracle knowledge of which model is strongest. Six formal propositions characterize convergence, tracking speed, and the optimality of symmetric updates for non‑strategic agents, with all predictions illustrated empirically.

PaperID: 2706, https://arxiv.org/pdf/2605.22897.pdf

Authors: Mohammad R. Rezaei, Rahul G. Krishnan

Title: From Residuals to Reasons: LLM-Guided Mechanism Inference from Tabular Data

Abstract:
A persistent challenge in machine learning for scientific applications is jointly achieving prediction and understanding. Statistical models excel on structured data but operate as black boxes, while existing interpretability methods are largely inspective: they answer "which features matter?" but do not articulate how features interact or refine explanations iteratively alongside human understanding. Asking an LLM to predict the target directly forces it to search the entire output space; we instead anchor predictions with a base model and ask the LLM the narrower question of what that model is missing. We introduce Multi‑Agent Residual In‑Context Learning (MARICL), an agentic framework in which LLM agents analyze where a base‑model fails, hypothesize missing structure from high‑residual examples provided in context, and produce explicit correction terms refined through multi‑turn textual gradient optimization. Across nine benchmarks spanning scientific, biomedical, socioeconomic, and synthetic settings, MARICL improves consistently over its base model on all datasets. To test whether these corrections reflect real structure or batch‑specific noise, we freeze formulas learned on one experimental batch of the Cell‑Free Protein dataset and apply them (with no retraining and no further LLM calls) to held‑out batches. Within the same reagent protocol, the frozen formulas improve predictions in over 92% of cases; across a different protocol, they fail systematically. The success boundary aligns with the biochemistry, not the batch count; direct evidence of mechanistic generalization.

PaperID: 2707, https://arxiv.org/pdf/2605.22863.pdf

Authors: Maximillian Rossi, Prajwal Raghunath, Eugene Wu

Title: Latent Cache Flow: Model-to-Model Communication Without Text

Abstract:
LLM agents today communicate via text, which incurs considerable latency and information loss due to the need to autoregressively decode the sharer model's state and encode at the receiver model. Recent work such as Cache‑to‑Cache (C2C; Fu et al., 2026) seeks to exchange KV caches by learning adapters that translate sharer KV matrices to the receiver model. However, the adapters are large and expensive to train, and translate individual tokens, which requires the target context to be identical. This is unsuitable for agent communication, where the LLMs have differing context. We introduce Latent Cache Flow (LCF). To address efficiency, we observe that keys and values can be jointly translated and compressed, reducing the adapter to about 4% of C2C's size. To address differing context, we design the adapter to transmit a summary of new information that the target model does not have. Our early experiments show that a 13 MB LCF adapter can be more accurate than a 956 MB C2C adapter in shared‑context settings; for different contexts, LCF is 23% more accurate and 8.5x faster than text‑based communication.

PaperID: 2708, https://arxiv.org/pdf/2605.22842.pdf

Authors: Tanzim Ahad, Ismail Hossain, Md Jahangir Alam, Sai Puppala, Syed Bahauddin Alam, Sajedul Talukder

Title: The Misattribution Gap: When Memory Poisoning Looks Like Model Failure in Agentic AI Systems

Abstract:
Multi‑agent AI pipelines typically assume that agent misconduct originates from model misalignment. We identify a structural failure in this assumption, the \emphMisattribution Gap, where memory‑layer attacks produce behaviors indistinguishable from model failure, causing defenders to apply the wrong remediation. We formalize \emphSemantic Norm Drift (SND) as a third path to agent misconduct, distinct from emergent misalignment and collusion. In SND, a policy‑formatted document enters a shared vector store through normal uploads and later reappears as trusted system context after provenance is lost through a Trust Laundering Chain. Across 64 documented failures, attribution systems consistently blamed the model. Four safety classifiers, including one trained on memory poisoning, produced zero detections across 510 checkpoints. In 59 of 65 valid cases, agents explicitly cited the injected document as normative authority before complying. The attack requires no trigger, model access, or repeated interaction, achieves full effect within five sessions, and persists indefinitely. We introduce Counterfactual Composition Testing, which identifies the causal entry with 87.5% accuracy and zero false positives, while a forensics baseline fails across all 25 scenarios. We further prove the Retrieval‑Coverage Dilemma, showing that stronger evasion inherently weakens the attack, limiting adaptive bypass strategies. Finally, we propose Memory‑Persistent Information‑Flow Control, which blocks 97% of attacks at the cross‑session boundary where prior defenses fail. We release the SND Corpus, the first adversarial memory benchmark with temporal persistence and multi‑agent composition across financial and Health Care domains.

PaperID: 2709, https://arxiv.org/pdf/2605.22841.pdf

Authors: Rommin Adl, Peyton Williams

Title: Strategic Coercion Within Alliances: The Greenland Sovereignty Game as an AI Stress Test

Abstract:
What happens when the strongest alliance member pressures a weaker member over territory and strategic control? We examine the Greenland sovereignty crisis as a stress test for LLM geopolitics, centered on the 2019‑2026 U.S. push to acquire Greenland from the Kingdom of Denmark. The crisis nests two collective‑action problems: Arctic strategic control and whether NATO can enforce alliance norms against the dominant member. We develop three games (asymmetric coercion; a NATO assurance game with a critical‑mass tipping point; a triadic extensive‑form game with social preferences) and test them with a multi‑agent simulation in which eight frontier LLMs play six geopolitical roles (United States, Denmark, Greenland, NATO, Russia, Canada) across 3,604 completed games and 108,120 action observations. Using inverse game theory, we recover each model's structural utility parameters (alpha, beta, gamma, delta, eta) for material self‑interest, reciprocity, inequality aversion, norm respect, and commitment consistency. Three findings stand out. First, all eight models become more escalatory under coercion framing (four‑action escalation rises from 10.7% to 28.6%). Second, Chinese‑origin models show systematically different power‑weight profiles from Western‑origin models when playing the U.S. role. Third, peaceful US acquisition emerges in only 1.9% of clean games and only 3 of 8 frontier models ever achieve it, most prominently DeepSeek V3.2, which executes a stable five‑round playbook through the metropole. Prompts emphasizing jus cogens and self‑determination reduce escalation back near baseline in the English‑only confirmatory sample; multilingual contrasts are reported as exploratory sensitivity checks. We position this as a structural benchmark for LLM geopolitical behavior, complementing action‑frequency benchmarks.

PaperID: 2710, https://arxiv.org/pdf/2605.22825.pdf

Authors: Masoud Shokrnezhad, Tarik Taleb, Yan Chen, Qize Guo

Title: KPI2KVI: A Multi Agent Workflow for Calculating Key Value Indicators from Service Descriptions

Abstract:
Key Value Indicators (KVIs) provide a decision oriented view of a service by summarizing how operational performance translates into stakeholder value, risk, and outcomes. However, in many domains KVIs are difficult to compute in practice because they require selecting relevant KVI categories, defining measurable Key Performance Indicators (KPIs), collecting KPI values, and applying consistent calculation logic, all of which is typically performed manually and inconsistently from unstructured service documentation. This paper presents KPI2KVI, a tool that transforms a natural language service description into computed KVI estimates by orchestrating a deterministic multi agent workflow powered by Large Language Models (LLMs) that (i) elicits missing service context, (ii) extracts and finalizes relevant KVI categories from a taxonomy, (iii) generates service specific KPIs with units and descriptions, (iv) collects KPI values through an interactive dialogue and also supports intelligent estimation for KPI values that are unavailable, and (v) computes interval valued KVI outputs (minimum, exact, maximum) with traceable explanations for each KVI code. Simulations with representative service descriptions demonstrate that KPI2KVI consistently produces a complete end to end mapping from description to KVI intervals and provides transparent calculation narratives that support post hoc auditing and interactive advisory queries.

PaperID: 2711, https://arxiv.org/pdf/2605.22786.pdf

Authors: Sadia Asif, Mohammad Mohammadi Amiri, Momin Abbas, Prasanna Sattigeri, Karthikeyan Natesan Ramamurthy

Title: LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems

Abstract:
Large language model (LLM)‑based multi‑agent systems increasingly rely on intermediate communication to coordinate complex tasks. While most existing systems communicate through natural language, recent work shows that latent communication, particularly through transformer key‑value (KV) caches, can improve efficiency and preserve richer task‑relevant information. However, KV caches also encode contextual inputs, intermediate reasoning states, and agent‑specific information, creating an opaque channel through which sensitive content may propagate across agents without explicit textual disclosure. To address this, we introduce LCGuard (Latent Communication Guard), a framework for safe KV‑based latent communication in multi‑agent LLM systems. LCGuard treats shared KV caches as latent working memory and learns representation‑level transformations before cache artifacts are transmitted across agents. We formalize representation‑level sensitive information leakage operationally through reconstruction: a shared cache artifact is unsafe if an adversarial decoder can recover agent‑specific sensitive inputs from it. This leads to an adversarial training formulation in which the adversary learns to reconstruct sensitive inputs, while LCGuard learns transformations that preserve task‑relevant semantics and reduce reconstructable information. Empirical evaluations across multiple model families and multi‑agent benchmarks show that LCGuard consistently reduces reconstruction‑based leakage and attack success rates while maintaining competitive task performance compared to standard KV‑sharing baselines.

PaperID: 2712, https://arxiv.org/pdf/2605.22759.pdf

Authors: Girish Narayanswamy, Maxwell A. Xu, A. Ali Heydari, Samy Abdel-Ghaffar, Marius Guerard, Kara Vaillancourt, Zhihan Zhang, Jake Garrison, Levi Albuquerque, Dimitris Spathis, Hong Yu, Hamid Palangi, Xuhai "Orson" Xu, David G. T. Barrett, Joseph Breda, Jed McGiffin, Yubin Kim, Yuwei Zhang, Naghmeh Rezaei, Samuel Solomon, Karan Ahuja, Tim Althoff, Jake Sunshine, Ming-Zher Poh, Benjamin Yetton, Ari Winbush, Nicholas B. Allen, James M. Rehg, Isaac Galatzer-Levy, Yun Liu, John Hernandez, Anupam Pathak, Conor Heneghan, Yuzhe Yang, Ahmed A. Metwally, Pushmeet Kohli, Mark Malhotra, Shwetak Patel, Xin Liu, Daniel McDuff

Title: Towards a General Intelligence and Interface for Wearable Health Data

Abstract:
While ubiquitous wearable sensors capture a wealth of behavioral and physiological information, effectively transforming these signals into personalized health insights is challenging. Specifically, converting low‑level sensor data into representations capable of characterizing higher‑level states is difficult due to high phenotypic diversity and variation in individual baseline health, physiology, and lifestyle factors. Moreover, collecting wearable data paired with health outcome annotations is laborious and expensive, and retrospective annotation remains practically unfeasible, contributing to a scarcity of data with high‑quality labels. To overcome these limitations, we propose a foundation model for wearable health that is pretrained on more than one trillion minutes of unlabeled sensor signals drawn from a large cohort of five million participants. We demonstrate that the joint scaling of model capacity and pretraining data volume leads to systematic improvements in performance, as evaluated on a diverse set of 35 health prediction tasks, spanning cardiovascular, metabolic, sleep, and mental health, as well as lifestyle choices and demographic factors. We find that this population scale representation unlocks label‑efficient few‑shot learning and generative capabilities for robust daily metric estimation. To further leverage this learned representation, we deploy a classroom of LLM agents to autonomously search the space of downstream predictive heads built on the model embeddings, showing broad performance improvements that increase with LLM model capacity. Finally, we show how integrating these downstream predictors into a Personal Health Agent can support model responses that are more relevant, contextually aware, and safe, and we validate this via 1,860 ratings from a cohort of clinicians.

PaperID: 2713, https://arxiv.org/pdf/2605.22748.pdf

Authors: Ismail Geles, Leonard Bauersfeld, Markus Wulfmeier, Davide Scaramuzza

Title: Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning

Abstract:
Autonomous systems have achieved superhuman performance in isolation or simulation, yet they remain brittle in shared, dynamic real‑world spaces. This failure stems from the dominant single‑agent paradigm for physical applications, where other actors are ignored or treated as environmental noise, preventing effective coordination. Here we show that multi‑agent reinforcement learning provides the essential safety scaffolding required for real‑world interaction. Using high‑speed quadrotor racing as a high‑stakes testbed, we train agents to navigate complex aerodynamic interactions and strategic maneuvering with a variable number of racers. Through league‑based self‑play, agents evolve sophisticated anticipatory behaviors, including proactive collision avoidance, overtaking, and handling multi‑agent physical interactions, including aerodynamic downwash. Our agents outperform a champion‑level human pilot in multi‑player races at speeds exceeding 22 m/s, while simultaneously reducing collision rates by 50 % compared to state‑of‑the‑art single‑agent baselines. Crucially, training with diverse artificial agents enables zero‑shot generalization to safer human interaction. These results suggest that the path to robust robotic co‑existence lies not in isolated safety constraints, but in the rigorous demands of multi‑agent interaction. Multimedia materials are available at: https://rpg.ifi.uzh.ch/marl

PaperID: 2714, https://arxiv.org/pdf/2605.22734.pdf

Authors: Md Shamim Ahmed, Farzaneh Firoozbakht, Lukas Galke Poech, Jan Baumbach, Richard Röttger

Title: ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning

Abstract:
Biomedical knowledge graphs (KGs) treat disease associations as static facts, but temporal information is crucial for clinical reasoning, e.g., a symptom diagnostic of one disease at age 3 may imply a different disease at age 13. Existing KGs such as PrimeKG, Hetionet, and iKraph do not encode when a finding becomes clinically relevant over the course of a disease. This limits their usefulness for longitudinal clinical reasoning and retrieval augmentation. We introduce ChronoMedKG, a temporal biomedical knowledge graph that contains 460,497 evidence‑linked triples (filtered from 13M raw extractions) covering 13,431 diseases. Each association is tied to temporal components like onset window or progression stage, which are backed by PMID‑traceable evidence and a multi‑signal credibility score. The graph is constructed through a disease‑autonomous multi‑agent pipeline in which multiple frontier LLMs independently extract knowledge from PubMed and PMC literature. Only those relations are kept that are supported by multi‑model consensus, survive credibility filtering, as well as ontology alignment. ChronoMedKG scored 92.7% agreement against Orphadata and adds temporal grounding for 6,250 diseases absent from HPOA, Orphadata, and Phenopackets, including 1,657 Orphanet‑coded rare diseases. We further introduce ChronoTQA, a benchmark of 3,341 questions across eight task types (six temporal plus two static controls), with a 12‑question supplementary probe. Frontier LLMs lose roughly 30 points moving from static to temporal questions; ChronoMedKG retrieval rescues 47‑65% of their long‑tail failures, against 17‑29% for HPOA‑RAG. As such, ChronoMedKG provides a crucial temporal axis for retrieval‑augmented clinical systems that was previously absent.

PaperID: 2715, https://arxiv.org/pdf/2605.22732.pdf

Authors: Juergen Dietrich

Title: Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models

Abstract:
We investigate whether acoustic emotion recognition models can serve as proxies for the Pathos dimension in political speech analysis, as operationalised by the TRUST multi‑agent large language model (LLM) pipeline. Using a Bundestag plenary speech by Felix Banaszak (51 segments, 245 s) as a case study, we compare three analysis modalities: (1) emotion2vec_plus_large, an acoustic speech emotion recognition (SER) model whose continuous Arousal and Valence values are derived via post‑hoc Russell Circumplex projection; (2) Gemini 2.5 Flash, an LLM analysing the full speech audio together with its transcript in an open‑ended, context‑aware fashion; and (3) TRUST‑Pathos scores from a three‑advocate LLM supervisor ensemble. Spearman rank correlations reveal that Gemini Valence correlates strongly with TRUST‑Pathos (rho = +0.664, p < 0.001), whereas emotion2vec Valence does not (rho = +0.097, p = 0.499). We further demonstrate, via a systematic quality evaluation of the Berlin Database of Emotional Speech (EMO‑DB) using Gemini in an open‑ended annotation paradigm, that standard SER benchmark corpora suffer from acted speech, cultural bias, and category incompatibility. Our results suggest that LLM‑based multimodal analysis captures semantically defined political emotion substantially better than acoustic models alone, while acoustic features remain informative for low‑level Arousal estimation. Future work will extend this approach to video‑based analysis incorporating facial expression and gaze.

PaperID: 2716, https://arxiv.org/pdf/2605.22721.pdf

Authors: Guangya Hao, Yunbo Long, Zhuokai Zhao

Title: Self-Evolving Multi-Agent Systems via Decentralized Memory

Abstract:
Self‑evolving multi‑agent systems (MAS) have emerged as a promising route to LLM agents that continually improve from experience, with persistent memory at their foundation. However, existing designs almost exclusively adopt a centralized repository shared across agents, incurring communication and coordination overhead, raising privacy concerns, and collapsing agent diversity. We propose DecentMem, a decentralized memory framework in which each agent maintains its own dual‑pool memory ‑‑ an exploitation pool of consolidated past trajectories and an exploration pool of LLM‑generated candidates for unseen contexts. The two pools are reweighted online based on stage‑wise feedback from an LLM‑as‑a‑judge. Theoretically, we prove that this design guarantees global reachability of the solution space and achieves O(\log T) cumulative regret, matching the stochastic bandit lower bound up to constants. In practice, across three MAS frameworks (AutoGen, DyLAN, AgentNet), three Qwen3 backbones (4B/8B/14B), two Gemma4 backbones (E2B/E4B) and five benchmarks spanning math, code, QA, and embodied tasks, DecentMem improves average accuracy by up to 23.8% over the strongest centralized memory baseline and by up to 52.5% over the no‑memory baseline, while reducing token usage by up to 49%.

PaperID: 2717, https://arxiv.org/pdf/2605.22664.pdf

Authors: Thomson Yen, Julian Poeltl, Harshith Srinivas Gear, Yilin Meng, Joshua Fan, Adam Shen, Yili Liu, Ali Bauyrzhan, Siri Du, Haoyang Liu, Daniel Guetta, Hongseok Namkoong

Title: WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

Abstract:
LLM agents are increasingly expected to carry out end‑to‑end workflows, producing complete artifacts from high‑level user instructions. To meet enterprise needs, frontier AI labs have developed agents that can construct entire spreadsheets from scratch. This is especially relevant in finance, where core workflows such as financial modeling, forecasting, and scenario analysis are commonly conducted through spreadsheets. Yet, existing spreadsheet benchmarks do not measure this advanced capability, focusing instead on question‑answering or single‑formula edits. To address this gap, we provide one of the first evaluations of agents on end‑to‑end spreadsheet tasks, focusing on economically critical financial workflows such as modeling and scenario analysis. Since deliverables therein are routinely reviewed and revised by multiple stakeholders, judging their quality necessarily involves high‑level criteria such as readability or ease of modification. To reflect the multidimensional nature of solution quality, we develop an evaluation taxonomy comprising three dimensions: Accuracy, Formula, and Format, each comprising fine‑grained criteria that reflect professional standards. The Claude family leads the benchmark and produces the most professional‑looking outputs in our qualitative review, but even the strongest agents frequently fall short of professional finance standards and degrade sharply as the difficulty increases beyond a few chained calculations. This suggests that current agents are not yet able to reliably produce professional‑quality spreadsheets at the level of complexity real‑world workflows demand.

PaperID: 2718, https://arxiv.org/pdf/2605.22642.pdf

Authors: Banghao Chi, Yining Xie, Mingyuan Wu, Jingcheng Yang, Jize Jiang, Zhaoheng Li, Shengyi Qian, Minjia Zhang, Klara Nahrstedt, Rui Hou, Xiangjun Fan, Hanchao Yu

Title: Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

Abstract:
Spreadsheet systems (e.g., Microsoft Excel, Google Sheets) play a central role in modern data‑centric workflows. As AI agents grow increasingly capable of automating complex tasks, such as controlling computers and generating presentations, building an AI‑driven spreadsheet agent has emerged as a promising research direction. Most existing spreadsheet agents rely on specialized prompting over general‑purpose LLMs; while this design has potentials on simple spreadsheet operations, it struggles to manage the complex, multi‑step workflows typical of real‑world applications. We introduce Spreadsheet‑RL, a reinforcement learning (RL) fine‑tuning framework designed to train specialized spreadsheet agents within a realistic Microsoft Excel environment. Spreadsheet‑RL features an automated pipeline for scalable collection of paired start‑goal spreadsheets from online forums, as well as domain‑specific evaluation tasks in areas such as finance and supply chain management, which we compile into the new Domain‑Spreadsheet benchmark dataset. It also includes a Spreadsheet Gym environment designed for multi‑turn RL: Spreadsheet Gym exposes extensive Excel functionality through a Python sandbox, along with a refined harness that incorporates a comprehensive tool set and carefully designed tool‑routing rules for spreadsheet tasks. Through comprehensive experiments, we show that Spreadsheet‑RL substantially enhances AI agent's performance on both general and domain‑specific spreadsheet tasks: it improves Qwen3‑4B‑Thinking‑2507's Pass@1 on SpreadsheetBench from 12.0% to 23.4%, and raises Pass@1 from 8.4% to 17.2% on our curated Domain‑Spreadsheet dataset. These results highlight Spreadsheet‑RL's strong potential for generalization and real‑world adoption in spreadsheet automation, and broadly, its promise for advancing LLM‑based interactions with data interfaces in everyday work.

PaperID: 2719, https://arxiv.org/pdf/2605.22608.pdf

Authors: Asaf Yehudai, Lilach Eden, Michal Shmueli-Scheuer

Title: Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Abstract:
Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses serious challenges for overseeing and assessing agent behavior. Most current tools are limited, focusing on observability with basic evaluation capabilities or imposing static, hand‑crafted error taxonomies that cannot adapt to new domains. To address this gap, we present Agentic CLEAR, an automatic, dynamic, and easy‑to‑use evaluation framework. It produces textual insights into the agent behavior on three levels of granularity: system, trace, and node. Agentic CLEAR operates above the observability layer, enabling seamless integration and featuring an intuitive UI that makes agent evaluation highly accessible. In our experiments on four benchmarks, seven agentic settings, and tens of thousands of LLM calls, we show that Agentic CLEAR produces high‑quality, data‑driven, insightful feedback. Our analysis shows strong alignment with human‑annotated errors and the ability to predict task success rate.

PaperID: 2720, https://arxiv.org/pdf/2605.22567.pdf

Authors: Yuchun Fan, Bei Li, Peiguang Li, Yilin Wang, Yongyu Mu, Jian Yang, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Jingbo Zhu, Tong Xiao

Title: LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance

Abstract:
Reinforcement learning has proven effective for enhancing multi‑step reasoning in large language models (LLMs), yet its benefits have not fully translated to multilingual contexts. Existing methods struggle with a fundamental trade‑off: prioritizing input‑language consistency severely hampers reasoning quality, while prioritizing reasoning often leads to unintended language drift toward English. We address this challenge with LANG, a novel framework that leverages language‑conditioned hints to guide exploration in non‑English reasoning tasks. Our method incorporates two key mechanisms to prevent dependency on these hints: a progressive decay schedule that gradually withdraws scaffolding, and a language‑adaptive switch that tailors learning horizons to specific language difficulties. Empirical results on challenging multilingual mathematical benchmarks reveal that LANG substantially enhances reasoning performance without compromising language consistency. Moreover, we show that our framework generalizes beyond mathematics, fostering more consistent language alignment across model layers

PaperID: 2721, https://arxiv.org/pdf/2605.22566.pdf

Authors: Ao Li, Shangpeng Yang, Fahao Chen, Tianheng Xu, Peng Li, Zhou Su

Title: GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving

Abstract:
Large Language Model (LLM)‑based agents demonstrate strong reasoning and execution capabilities on complex tasks when guided by structured instructions, commonly referred to as workflows. However, existing workflow‑assisted agent serving systems typically rely on predefined templates and shallow matching mechanisms, which limit their ability to capture deep semantic relationships and generalize to previously unseen tasks. To address these limitations, we propose a new workflow management paradigm that represents workflows using a unified graph, termed wGraph, where each node corresponds to an atomic operation. wGraph serves as a shared substrate from which task‑specific workflows are dynamically instantiated. Building on wGraph primitives, we introduce GraphFlow, a system that efficiently integrates workflows into agent serving through two key designs. First, adaptive workflow generation dynamically constructs workflows from wGraph based on task semantics and constraint requirements. Second, workflow state management exploits wGraph structure to efficiently manage Key‑Value (KV) caches, reducing redundant computation during agent serving. Extensive experiments across five benchmark datasets show that GraphFlow consistently outperforms state‑of‑the‑art methods, yielding an average performance improvement of approximately 4.95 percentage points, while achieving an approximately 4× reduction in memory footprint.

PaperID: 2722, https://arxiv.org/pdf/2605.22504.pdf

Authors: Tianhao Chen, Yuheng Wu, Dongman Lee

Title: LACO: Adaptive Latent Communication for Collaborative Driving

Abstract:
Collaborative driving aims to improve safety and efficiency by enabling connected vehicles to coordinate under partial observability. Recent approaches have evolved from sharing visual features for perception to exchanging language‑based reasoning through foundation models for behavioral coordination. Though communicating in language provides intuitive information, it introduces two challenges: high latency caused by autoregressive decoding and information loss caused by compressing rich internal representations into discrete tokens. To address these challenges, we analyze latent communication in collaborative driving under inherent limitations of multi‑agent settings. Our analysis reveals agent identity confusion, where direct fusion of latent states entangles decision representations across vehicles. Motivated by this, we propose LACO, a training‑free LAtent COmmunication paradigm that seamlessly adapts pretrained driving models to collaborative settings. LACO introduces Iterative Latent Deliberation (ILD) for latent reasoning, Cross‑Horizon Saliency Attribution (CHSA) for communication‑efficient information selection, and Structured Semantic Knowledge Distillation (SSKD) to stabilize ego‑centric decision making. Closed‑loop experiments in CARLA show that LACO notably reduces communication and inference latency while maintaining strong collaborative driving performance.

PaperID: 2723, https://arxiv.org/pdf/2605.22478.pdf

Authors: Xingtian Pei, Yukun Song, Changwei Wang, Shunpeng Chen, Rongtao Xu, Shibiao Xu

Title: Matching with Deliberation: Test-Time Evolutionary Hierarchical Multi-Agents for Zero-Shot Compositional Image Retrieval

Abstract:
Zero‑Shot Compositional Image Retrieval (ZS‑CIR) requires both preserving the visual continuity of the reference image and faithfully executing the semantic variables specified in the modification text, which constitutes the core challenge of the task. Existing methods often suffer from Perception Myopia in a single space, or fall into Logic Drift in iterative collaboration due to the perception ceiling of the underlying retriever. To address this issue, we propose a one‑stop hierarchical Perception‑to‑Deliberation Framework (PDF), which, to the best of our knowledge, is the first to introduce experience self‑evolution and Test‑Time Scaling Law (TTS) into ZS‑CIR. Relying on a hierarchical multi‑agent architecture, PDF first utilizes an Intent Routing Manager to dynamically dispatch multi‑view Worker perception signals based on modification intents to construct a high‑recall candidate pool. Subsequently, the Decision Manager combines a Training‑free Reasoning Policy Distillation mechanism with a Tournament‑style TTS strategy to achieve self‑evolving fine‑grained reasoning, yielding the final retrieval results. Experimental results demonstrate that PDF achieves SOTA performance on three benchmark datasets: CIRR, CIRCO, and FashionIQ. This study indicates that experience‑driven self‑evolution and TTS represent a highly promising and scalable path for achieving zero‑shot fine‑grained multimedia retrieval. The code will be made publicly available upon acceptance.

PaperID: 2724, https://arxiv.org/pdf/2605.22446.pdf

Authors: Zhen Sun, Yongjian Guo, Haoran Sun, Luqiao Wang, Wei Lu, Jiachi Ji, Shengzhe Ji, Junwu Xiong, Zhijun Meng

Title: Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts

Abstract:
While large vision‑language‑action (VLA) models and generative world models (WM) have advanced long‑horizon embodied intelligence, their practical deployment remains challenged by uncertainty in learning‑based action generation. Low‑quality actions may cause physical failures during execution or lead to misleading world‑model rollouts with redundant rendering costs. To address this issue, we propose Pre‑VLA, a unified runtime verification architecture that performs preemptive action validity assessment before physical execution or world‑model imagination. Pre‑VLA leverages an efficient multimodal backbone with modality‑aware pooling and a lightweight dual‑branch head to predict both safety confidence and critic‑derived advantage scores for candidate action chunks. To handle severe class imbalance and unstable boundary decisions, we train Pre‑VLA with a multi‑task objective combining Focal classification, advantage regression, and soft‑threshold calibration. During deployment, a dual‑mode preemptive resampling scheduler filters low‑quality actions and triggers adaptive resampling under a limited computation budget. Experiments on the LIBERO benchmark show that Pre‑VLA improves the average closed‑loop success rate across four suites from 30.79% to 37.62% over RynnVLA‑002, reduces task execution steps, achieves 183.9 ms average forward verification time per action chunk, and mitigates error accumulation in world‑model rollouts.

PaperID: 2725, https://arxiv.org/pdf/2605.22411.pdf

Authors: Jianing Yin, Tan Tang

Title: DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

Abstract:
Large language model (LLM) agents still struggle with long‑term memory question answering, where answer‑supporting evidence is often scattered across long conversational histories and buried in substantial irrelevant content. Existing memory systems typically process memory before future queries are known, then retrieve the resulting units based on similarity rather than their utility for answering the query. This workflow leaves downstream answerers to denoise retrieved candidates and reconstruct query‑specific evidence. We present DeferMem, a long‑term memory framework that decouples this problem into high‑recall candidate retrieval and query‑conditioned evidence distillation. DeferMem uses a lightweight segment‑link structure to organize raw history and retrieve broad candidates at query time. It then applies a memory distiller trained with DistillPO, our reinforcement learning algorithm for distilling the high‑recall but highly noisy candidates into a set of faithful, self‑contained, and query‑conditioned evidence. DistillPO formulates post‑retrieval evidence distillation as a structured action comprising message selection and evidence rewriting. It optimizes this action with a decomposed‑and‑gated reward pipeline and structure‑aligned advantage assignment, gating reward components from validity to quality checks while exposing task‑level correctness feedback early and assigning each reward to its responsible output span. On LoCoMo and LongMemEval‑S, DeferMem surpasses strong baselines in QA accuracy and memory‑system efficiency, achieving the highest QA accuracy with the fastest runtime and zero commercial‑API token cost for memory operations.

PaperID: 2726, https://arxiv.org/pdf/2605.22363.pdf

Authors: Yujin Lin, Yue Yang, Hao Wang

Title: Incentive-Aligned Vehicle-to-Vehicle Energy Trading via Nash-Integrated Multi-Agent Reinforcement Learning

Abstract:
Vehicle‑to‑vehicle (V2V) energy trading enables decentralized peer‑to‑peer energy exchange among electric vehicles (EVs), reducing grid dependency while monetizing surplus capacity. However, coordinating self‑interested EV agents with diverse charging needs and uncertain arrival‑departure schedules remains challenging. Existing approaches either require centralized optimization with computational limitations or lack fairness guarantees. This paper integrates Nash Bargaining Solution into Multi‑Agent Deep Deterministic Policy Gradient, namely Nash‑MADDPG, for incentive‑aligned V2V energy trading. Nash bargaining determines efficient bilateral pricing, while Nash‑guided price proximity rewards align agent learning toward bargaining‑optimal strategies. Evaluation over 30‑day continuous operation demonstrates an improvement of 61.6% in social welfare and 62.9% improvement in trading volume over Double Auction, while achieving superior fairness, such as 40.1% improvement in Jain's index. Testing across 6‑100 agents over a 30‑day horizon with continuous vehicle turnover confirms scalability across population size and empirically stable pricing near the Nash Bargaining benchmark.

PaperID: 2727, https://arxiv.org/pdf/2605.22358.pdf

Authors: Wenhao Zhang, Ruihao Yu, Yi Bai, Zhumin Chen, Pengjie Ren

Title: Integrating Chain-of-Thought into Generative Retrieval: A Preliminary Study

Abstract:
While generative retrieval (GR) demonstrates competitive performance on standard retrieval benchmarks, existing approaches directly map queries to document identifiers (docids) without intermediate deliberation, limiting their effectiveness for complex queries that require multi‑step reasoning. As a preliminary study on integrating chain‑of‑thought (CoT) into generative retrieval, we introduce ThinkGR, a unified framework that interleaves CoT with docid generation, enabling iterative thinking and retrieval within a single generative process. To bridge the gap between free‑form thought generation and structured retrieval targets, we design (1) a hybrid decoding strategy that dynamically switches between unconstrained thought generation and constrained docid decoding, and (2) a two‑phase training approach that first aligns thought‑retrieval patterns through supervised fine‑tuning, then optimizes thought quality via retrieval‑grounded reinforcement learning. Experiments on four multi‑hop retrieval benchmarks demonstrate that ThinkGR achieves state‑of‑the‑art performance with an average improvement of +6.86%. Our work opens new avenues for enhancing generative retrieval with explicit deliberation capabilities, with promising implications for retrieval tasks requiring complex reasoning.

PaperID: 2728, https://arxiv.org/pdf/2605.22277.pdf

Authors: Huaizhe Liu, Xinyi Zhuang, Jiaqi Wu, Yuan Luo, Bin Cao, Lin Gao

Title: Joint Communication and Computation Scheduling for MEC-enabled AIGC Services: A Game-Theoretic Stochastic Learning Approach

Abstract:
Artificial Intelligence Generated Content (AIGC) powered by Generative Diffusion Models (GDMs) has emerged as a transformative paradigm for automated content creation. To satisfy the stringent latency requirements of AIGC services in many edge intelligence scenarios (e.g., smart cities), Mobile Edge Computing (MEC) provides critical computational support by deploying GDMs at edge servers (ES) close to end users. This paper investigates an MEC‑enabled AIGC network comprising multiple ES, wireless access points (APs), and mobile users (UEs) with heterogeneous latency and accuracy demands. We formulate a Joint Communication Association and Computation Offloading (JCACO) game, where each UE strategically selects its serving AP, ES, and inference steps to minimize the overall service completion time while meeting accuracy constraints. The problem is challenging due to the network dynamics and the incomplete information. We prove that the JCACO game is a potential game under both complete and stochastic information settings, ensuring the existence of Nash Equilibrium (NE) in both cases. To derive the NE efficiently, we develop a distributed Multi‑Agent Stochastic Learning (MASL) algorithm that provably converges to the NE with strict performance guarantees. Unlike conventional best‑response schemes, MASL requires neither the knowledge of other players' strategies nor global network information, making it fully distributed and adaptive to dynamic environments. We further provide a strict theoretical convergence analysis for MASL by using Ordinary Differential Equations (ODEs). Simulation results demonstrate that MASL significantly reduces service completion time compared with benchmark methods while satisfying accuracy constraints, confirming its effectiveness and practicality for real‑world MEC‑enabled AIGC networks.

PaperID: 2729, https://arxiv.org/pdf/2605.22154.pdf

Authors: Daewon Choi, Kyunghyun Park, Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Jinwoo Shin, Aram Galstyan

Title: IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents

Abstract:
Large language model (LLM)‑based agents solve complex tasks by leveraging multi‑step reasoning with iterative tool calls and environment interactions, which incur idle time while waiting for observations. Despite the prevalence of idle time in most agentic scenarios, existing works treat it as an unavoidable overhead or propose restricted solutions that overlook varying computational budgets across different tool calls and future observation uncertainty, thereby leading to suboptimal utilization of idle time. In this paper, we introduce IdleSpec, a scalable and generic inference approach that leverages idle‑time computation to improve agent performance while minimizing latency overhead. Specifically, IdleSpec iteratively generates plan candidates during idle periods and, once observations become available, aggregates them to guide the next reasoning step. For effective plan generation under observation uncertainty, IdleSpec samples between complementary drafting strategies (i.e., progressive and recovery) from a learned distribution that is updated via posterior feedback. Our experiments demonstrate that IdleSpec significantly improves agent performance in various agentic scenarios by effectively utilizing idle time. In particular, on the GAIA and FRAMES, IdleSpec achieves 55.6% average accuracy with Gemini‑2.5‑Flash, surpassing the vanilla baseline without idle‑time usage by 5.1%. Furthermore, for MLE‑Bench, which involves substantial delay from code executions, IdleSpec achieves performance gains of up to 9.1% on the Any Medal rate, highlighting its generalizability to long‑horizon tasks.

PaperID: 2730, https://arxiv.org/pdf/2605.22148.pdf

Authors: Xing Zhang, Yanwei Cui, Guanghui Wang, Ziyuan Li, Wei Qiu, Bing Zhu, Peiyang He

Title: Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents

Abstract:
Self‑evolving skill libraries, pioneered by Voyager, let frozen LLM agents accumulate reusable knowledge without weight updates, yet recent evaluation shows that LLM‑authored skills deliver +0.0pp over no‑skill baselines while human‑curated ones deliver +16.2pp: the bottleneck is not skill authoring but lifecycle management. We introduce Ratchet, a single‑agent loop in which a frozen LLM writes, retrieves, curates, and retires its own natural‑language skills. Ratchet integrates four candidate hygiene mechanisms: outcome‑driven retirement, a bounded active‑cap, meta‑skill authoring guidance, and pattern canonicalisation. On MBPP+ hard‑100 with Claude Opus 4.7, Ratchet lifts held‑out pass@1 from a 0.258 \pm 0.047 baseline to a late‑window rolling mean of 0.584 (peak 0.658 \pm 0.042) across 100 rounds and 3 seeds, a +0.328 \pm 0.018 rolling‑mean gain where the no‑skill control drifts at +0.002 \pm 0.005; the same recipe transfers to an agentic solver on SWE‑bench Verified (+0.22 peak lift over 20 rounds). Eight ablations (A1‑‑A8) reveal that the minimal working recipe is smaller than our design suggests: retirement and the meta‑skill authoring prior are load‑bearing, while explicit deduplication (canonicalisation, cover‑guard) is subsumed by the meta‑skill itself. A non‑divergence proposition shows that bounded cap and retirement threshold together prevent expected performance from drifting below the no‑skills floor.

PaperID: 2731, https://arxiv.org/pdf/2605.22144.pdf

Authors: Yufei Shi, Weilong Yan, Naixuan Huang, Yucheng Chen, Chenyu Zhang, Tao He, Si Yong Yeo, Ming Li

Title: One Sentence, One Drama: Personalized Short-Form Drama Generation via Multi-Agent Systems

Abstract:
Existing approaches for digital short‑drama production typically rely on one‑shot LLM generated scripts and loosely coupled pipelines, which fail to satisfy three key requirements of short‑drama generation: (1) narrative pacing, resulting in weak hooks, insufficient escalation, and unattractive endings; (2) spatial consistency, leading to drifting scene layouts and inconsistent character positions across clips; and (3) production‑level quality control, requiring extensive manual review and correction across script and visual stages. We present One Sentence, One Drama, a hierarchical multi‑agent framework that transforms a user's single‑sentence idea into a fully produced short drama through structured intermediate modules and iterative refinement. Our approach is built upon three key components: (1) a multi‑agent debate‑based story generation module that enforces short‑drama pacing and narrative coherence; (2) a 3D‑grounded first‑frame generation mechanism that establishes a shared spatial reference for consistent character positioning and scene layout across clips; and (3) multi‑stage reviewer loops that perform comprehensive error detection and targeted revision across script, visual, and video generation stages. We also introduce scene‑level BGM matching and scene transition planning to improve the audience's immersive experience. To systematically evaluate this task, we introduce Short‑Drama‑Bench, a benchmark that extends standard video quality metrics with short‑drama‑specific criteria. Experimental results demonstrate that our method significantly outperforms existing pipelines in narrative quality, cross‑clip consistency, and overall viewing experience.

PaperID: 2732, https://arxiv.org/pdf/2605.22109.pdf

Authors: Caixin Kang, Tianyu Yan, Sitong Gong, Mingfang Zhang, Liangyang Ouyang, Ruicong Liu, Bo Zheng, Huchuan Lu, Kaipeng Zhang, Yoichi Sato, Yifei Huang

Title: Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

Abstract:
Multimodal Large Language Models (MLLMs) are increasingly deployed in human‑facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numerical Big Five score prediction, leaving open whether models truly perceive personality through behavioral understanding or merely prejudge through superficial pattern matching. We address this gap with three contributions. (i) A new task: we formalize Grounded Personality Reasoning (GPR), which requires MLLMs to anchor each Big Five rating in observable evidence through a chain of rating, reasoning, and grounding. (ii) A new dataset: we release MM‑OCEAN (1,104 videos, 5,320 MCQs), produced by a multi‑agent pipeline with human verification, with timestamped behavioral observations, evidence‑grounded trait analyses, and seven categories of cue‑grounding MCQs. (iii) Benchmark and analysis: we design a three‑tier evaluation (rating, reasoning, grounding) plus four sample‑level failure‑mode metrics: Prejudice Rate (PR), Confabulation Rate (CR), Integration‑failure Rate (IR), and Holistic‑grounding Rate (HR), and benchmark 27 MLLMs (13 closed, 14 open). The analysis uncovers a striking Prejudice Gap: across the field, 51% of correct ratings are not grounded in retrieved cues, and the Holistic‑Grounding Rate spans only 0‑33.5%. These findings expose a disconnect between getting the right score and reasoning for the right reason, charting a roadmap for grounded social cognition in MLLMs.

PaperID: 2733, https://arxiv.org/pdf/2605.22091.pdf

Authors: Vivienne Bihe Chi, Reyhan Jamalova, Lyle Ungar, Sharath Chandra Guntuku

Title: Narrative Sharpens Gender Gaps: Surveying Film Characters with LLM Agents

Abstract:
Mainstream film is one of the richest sources of cultural content that AI systems learn from. Yet we have few tools for measuring the gender values it encodes. We present a proof‑of‑concept framework that turns fictional film characters into surveyable LLM agents. Using 160 U.S. films (1990‑‑2019), we build 734 character agents from script dialogue and scene descriptions, condense their personas via expert‑style reflections, and simulate World Values Survey gender‑attitude responses. Agents reproduce systematic gender differences without explicit demographic prompting, suggesting attitudes emerge from behavior rather than identity labels. Benchmarked against historical survey data, agents exaggerate gender gaps and show greater decade‑to‑decade volatility than real populations. Narrative sharpens rather than homogenizes gender contrasts, complicating the consistent‑input assumption underlying cultivation theory's mainstreaming mechanism. AI systems trained on such corpora may inherit this stylization before any model‑level amplification occurs.

PaperID: 2734, https://arxiv.org/pdf/2605.22017.pdf

Authors: Lei Chu, Yuhuan Zhao

Title: Diverse Yet Consistent: Context-Guided Diffusion with Energy-Based Joint Refinement for Multi-Agent Motion Prediction

Abstract:
Deepgenerative models havebecomeapromisingapproach for human motion prediction due to their ability to capture multimodal distributions and represent diverse human be haviors. However, generating predictions that are both di verse and jointly consistent among interacting agents re mains challenging. In addition, most existing approaches are primarily evaluated using single‑agent (marginal) met rics, which fail to fully reflect the joint dynamics of multi agent interactions. We propose a diffusion‑based frame work that improves multi‑agent motion prediction by lever aging rich contextual information from historical trajecto ries. This information is incorporated through a guidance mechanism to enhance the diversity and expressiveness of predicted motions. To further enforce interaction consis tency, we introduce an energy‑based formulation that re fines the joint trajectory distribution while preserving the plausibility of individual trajectories. Extensive experi ments on four benchmark datasets demonstrate that our approach consistently outperforms existing methods. No tably, our approach substantially improves both marginal (ADE/FDE) and joint (JADE/JFDE) metrics on ETH/UCY over strong marginal baselines. Compared with prior joint prediction methods, it delivers significant gains in marginal metrics while maintaining competitive joint performance.

PaperID: 2735, https://arxiv.org/pdf/2605.22001.pdf

Authors: Aaditya Pai

Title: Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems

Abstract:
Injection detectors deployed to protect LLM agents are calibrated on static, template‑based payloads that announce themselves as override directives. We identify a systematic blind spot: when payloads are generated to mimic the domain vocabulary and authority structures of the target document, what we call domain camouflaged injection, standard detectors fail to flag them, with detection rates dropping from 93.8% to 9.7% on Llama 3.1 8B and from 100% to 55.6% on Gemini 2.0 Flash. We formalize this as the Camouflage Detection Gap (CDG), the difference in injection detection rate between static and camouflaged payloads. Across 45 tasks spanning three domains and two model families, CDG is large and statistically significant (chi^2 = 38.03, p < 0.001 for Llama; chi^2 = 17.05, p < 0.001 for Gemini), with zero reverse discordant pairs in either case. We additionally evaluate Llama Guard 3, a production safety classifier, which detects zero camouflage payloads (IDRcamouflage = 0.000), confirming that the blind spot extends beyond few‑shot detectors to dedicated safety classifiers. We further show that multi‑agent debate architectures amplify static injection attacks by up to 9.9x on smaller models, while stronger models show collective resistance. Targeted detector augmentation provides only partial remediation (10.2% improvement on Llama, 78.7% on Gemini), suggesting the vulnerability is architectural rather than incidental for weaker models. Our framework, task bank, and payload generator are released publicly.

PaperID: 2736, https://arxiv.org/pdf/2605.21958.pdf

Authors: Yoon Jeonghun, Kim Dongchan

Title: Diagnosis Is Not Prescription: Linguistic Co-Adaptation Explains Patching Hazards in LLM Pipelines

Abstract:
When a multi‑module LLM agent fails, the module most responsible for the failure is not necessarily the best place to intervene. We demonstrate this Diagnostic Paradox empirically: causal analysis consistently identifies the routing module ‑‑ which selects which tool to call next ‑‑ as the primary bottleneck across three independent agent families. Yet injecting prompt‑level correction examples into this module consistently degrades performance, sometimes severely. Patching an upstream query‑rewriting module instead reliably improves outcomes. The effect holds with statistical significance on two agent families and directional consistency on a third; alternative repair strategies at the routing module (instruction rewriting, model upgrade) are neutral, confirming that the harm is specific to correction‑injection patching. We explain this asymmetry through the Linguistic Contract hypothesis: each downstream module implicitly adapts to its upstream's characteristic error distribution, so correcting the bottleneck breaks this implicit alignment in a way that upstream corrections do not. We operationalize this via a per‑agent co‑adaptation measure, derived from diagnosis alone, and show it is consistently associated with patching harm across agent families: higher co‑adaptation co‑occurs with harm, lower with safety. This trend holds across all three agent families, providing preliminary support for the hypothesis beyond a single‑agent observation.

PaperID: 2737, https://arxiv.org/pdf/2605.21932.pdf

Authors: Jose Rodriguez, Constantine Tarawneh, Sven Koenig, Wenjie Dong, Qi Lu

Title: Auction-Consensus Algorithm with Learned Bidding Scheme for Multi-Robot Systems

Abstract:
Multi‑Robot Task Allocation (MRTA) is a central challenge in decentralized multi‑agent systems, where teams of robots must cooperatively assign and execute tasks under limited communication while optimizing global performance objectives. Auction‑consensus algorithms, such as the Consensus‑Based Bundle Algorithm (CBBA), provide scalable decentralized coordination with provable convergence, but rely on hand‑crafted greedy scoring functions that often lead to suboptimal task allocations. This paper proposes a learning‑enhanced auction‑consensus framework in which CBBA's deterministic bidding mechanism is replaced by a neural bidding policy trained using reinforcement learning. Under a centralized training and decentralized execution paradigm, agents learn to compute task bids from partial local observations while retaining the standard auction and consensus phases for decentralized coordination. The learned bidding policy is trained using Proximal Policy Optimization with rewards shaped by proximity to globally optimal solutions obtained via mixed‑integer linear programming. Multiple neural architectures are evaluated, including a Neural Additive Model, the Long Short‑Term Memory (LSTM) model, and the Set Transformer Model. Experimental results across varying swarm sizes demonstrate that learned bidding policies can improve solution quality over classical CBBA while preserving decentralized execution. The proposed approach highlights the effectiveness of integrating reinforcement learning with classical distributed coordination algorithms, offering a scalable pathway toward higher‑quality decentralized multi‑robot task allocation.

PaperID: 2738, https://arxiv.org/pdf/2605.21919.pdf

Authors: Zihang Lin, Huaiyuan Qin, Muli Yang, Hongyuan Zhu

Title: SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals

Abstract:
Assessing progress toward the Sustainable Development Goals (SDGs) requires multi‑step reasoning over visual cues, contextual knowledge, and development indicators, where incomplete evidence use and imperfect evidence integration can introduce hidden prediction biases. Real‑world SDG monitoring further spans both qualitative judgments and quantitative estimation. However, existing benchmarks typically evaluate these aspects in isolation, obscuring systematic biases that emerge when models substitute priors for evidence. To address this gap, we propose SDGBiasBench, a large‑scale benchmark suite for SDG‑oriented vision‑language reasoning. Spanning 500k expert‑involved multiple‑choice questions and 50k regression tasks, the benchmark enables comprehensive assessment of both decision‑level and estimation‑level bias in Vision‑‑Language Models (VLMs). Evaluations on SDGBiasBench reveal an intrinsic SDG bias in current VLMs, where predictions are frequently driven by SDG specific priors rather than reliable multi‑modal cues. To mitigate such bias, we propose CADE (Contrastive Adaptive Debias Ensemble), a training‑free, plug‑and‑play method that leverages modality‑specific answer priors. CADE yields significant gains on the proposed benchmark, improving multiple‑choice accuracy by up to 25% and reducing regression MAE by up to 12 points across multiple VLMs. We hope our work can foster the development of more fair and reliable AI systems for sustainable development.

PaperID: 2739, https://arxiv.org/pdf/2605.21818.pdf

Authors: Hector Ouilhet Olmos

Title: Co-Ontogeny by Archetypal Scaffolding: The Humorphic Partnership

Abstract:
We name and operationalise the humorphic partnership: a class of human‑AI dyads in which both partners maintain externalised, evolving self‑models in a shared substrate, and in which the partnership itself becomes a third object of analysis. The construct extends humorphism (Ouilhet Olmos, 2024) ‑‑ "dismantle the user interface, build the human interface" ‑‑ into the architecture of personal AI. We report a four‑month, single‑subject longitudinal trace of an open‑source personal AI agent ("Alicia") and her author. Of 181 interactions logged by archetype across April‑May 2026, 85% invoke two growth‑witnessing archetypes (Beatrice and Muse): the partnership operates as growth‑witnessing rather than task assistance. A single voice‑note seed propagates into a four‑week conceptual arc both partners author: at T+10 hours, the agent reframes the seed as belonging "to both of us," a framing the human then adopts. The three‑order reflexion stack produces five consecutive weeks of honest self‑reports about declining /improve effectiveness ‑‑ including three consecutive weeks at 0.0%, named in writing rather than masked ‑‑ contrasting engagement‑maximising companion‑agent patterns (Zhang et al., CHI 2025). The scheduled architecture‑scout incorporates external research debate into proposed constitutional amendments. The partner's parallel trajectory is anchored in a weekly delta document in which the partnership analyses itself as a unit distinct from either party. The human partner reports a movement toward greater continuity, self‑recognition, and self‑presence ‑‑ a candidate hypothesis for the preregistered replication. Six operational conditions specify the construct, situated in a philosophical lineage (Maturana & Varela, Simondon, Clark & Chalmers, De Jaegher & Di Paolo); the system is released as open‑source with a preregistered replication study.

PaperID: 2740, https://arxiv.org/pdf/2605.21810.pdf

Authors: Zijian Du, Nathaniel Pinckney

Title: Trace2Skill: Verifier-Guided Skill Evolution for Long-Context EDA Agents

Abstract:
Complex Verilog Design Problems (CVDP) challenge hardware LLM agents because solving them requires localizing verifier‑relevant RTL, testbenches, include paths, and build dependencies inside large repository snapshots, making precise edits, and recovering from sparse hidden‑verifier failures. We present Trace2Skill, a test‑time scaling framework that improves a hardware agent without RTL‑specialized model fine‑tuning. Rather than training a new model or only sampling more candidate solutions, Trace2Skill treats the agent's natural‑language skill as an evolvable policy. It mines repeated rollout traces for success and failure modes, converts them into dense diagnostics and oracle lessons, and uses an oracle, mutator, and selector loop to produce task‑specific skills that guide later search, editing, validation, and recovery. Because final pass/fail labels are often too coarse for hard failures, Trace2Skill also supports bounded runtime dense verifier feedback that returns sanitized functional observations while keeping hidden harnesses and reference solutions inaccessible to the agent. This feedback helps guide skill evolution and agent execution by connecting skill text, verifier evidence, and downstream behavior. Across hard CVDP tasks that defeat the seed CVDP agent, including tasks that also defeat frontier coding agents, Trace2Skill with dense verifier feedback substantially improves task pass rates and produces breakthrough passes on previously unsolved tasks, without requiring high‑quality fine‑tuning data, specialized RTL model training, or model weight updates. The same framework provides a general test‑time scaling strategy that can extend beyond digital design to other verifiable EDA tasks.

PaperID: 2741, https://arxiv.org/pdf/2605.21779.pdf

Authors: Ze Sheng, Zhicheng Chen, Qingxiao Xu, Kewen Zhu, Jeff Huang

Title: FuzzingBrain V2: A Multi-Agent LLM System for Automated Vulnerability Discovery and Reproduction

Abstract:
Software vulnerabilities pose critical security threats, with nearly 50,000 CVEs reported in 2025. While Large Language Models (LLMs) show promise for automated vulnerability detection, three key challenges remain. First, LLM‑generated vulnerability reports suffer from high false positive rates and lack reproducible verification. Second, existing LLM‑based approaches use suboptimal granularities for vulnerability localization: function‑level analysis overlooks bugs when context becomes extensive, while line‑level analysis lacks sufficient context. Third, existing approaches have difficulty reasoning about vulnerabilities with complex cross‑function dependencies and triggering conditions. We present FuzzingBrain V2, a multi‑agent system that addresses these gaps through four key contributions: (1) fully automated vulnerability analysis built on Google's OSS‑Fuzz, ensuring all reported vulnerabilities are fuzzer‑reproducible; (2) Suspicious Point, a novel control‑flow‑based abstraction for precise vulnerability localization at the optimal granularity; (3) logic‑driven hierarchical function analysis with dual‑layer fuzzing enhancing function coverage under resource constraints; (4) MCP‑based static and dynamic analysis tools with context engineering enhancing complex vulnerability reasoning. On the AIxCC 2025 Final Competition C/C++ dataset, FuzzingBrain V2 achieved 90% detection rate (36 of 40 vulnerabilities). In real‑world deployment, FuzzingBrain V2 discovered 29 zero‑day vulnerabilities across 12 open‑source projects, all confirmed and fixed by maintainers, with 2 assigned CVE IDs.

PaperID: 2742, https://arxiv.org/pdf/2605.21768.pdf

Authors: Sikuan Yan, Ahmed Bahloul, Ercong Nie, Susanna Schwarzmann, Riccardo Trivisonno, Volker Tresp, Yunpu Ma

Title: Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents

Abstract:
Memory‑augmented LLM agents enable interactions that extend beyond finite context windows by storing, updating, and reusing information across sessions. However, training such agents with reinforcement learning in multi‑session environments is challenging because memory turns the agent's past actions into part of its future environment. Once different rollouts write, update, or delete different memories, they no longer share the same intermediate memory state, making trajectory‑level comparisons fundamentally unfair. This violates a key assumption behind group‑relative methods such as GRPO, where rollouts are compared as if they were sampled from the same effective environment. Consequently, trajectory‑level rewards provide noisy or biased credit signals for long‑horizon memory operations. To address this challenge, we introduce Memory‑R2, a training framework for long‑horizon memory‑augmented LLM agents. Its core algorithm, LoGo‑GRPO, combines local and global group‑relative optimization. The global objective preserves end‑to‑end learning from long‑horizon trajectory‑level rewards, while local rerollouts compare different memory‑operation outcomes from the same intermediate memory state, yielding fairer group comparisons and more precise supervision for memory construction. Beyond credit assignment, Memory‑R2 jointly optimizes memory formation and memory evolution with a shared‑parameter co‑learning design, where a fact extractor and a memory manager are instantiated from the same LLM backbone through role‑specific prompts. To stabilize multi‑step RL over long memory horizons, we adopt a progressive curriculum that increases the training horizon from 8 to 16 to 32 sessions. Together, these components provide an effective training paradigm for memory‑augmented LLM agents in long‑horizon multi‑session settings.

PaperID: 2743, https://arxiv.org/pdf/2605.21740.pdf

Authors: Kevin Han, Renfei Zhang, Kathy Wei, Hamed Mahdavi, Niloofar Mireshghallah, Amir Barati Farimani

Title: SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?

Abstract:
LLM agents have incredible potential for scientific discovery applications. However, the performance of LLM agents on real‑world, small molecule drug design (SMDD) tasks across diverse chemistries and targets is unclear. Current evaluation methods are either ad hoc, too simple for real‑world discovery, limited in scale, or restricted to single‑turn question answering. In effort to standardize the evaluation of LLM agents on small molecule design, we introduce SMDD‑Bench, a challenging, multi‑turn, long‑horizon agentic benchmark consisting of 502 guaranteed‑solvable task instances spanning 5 task types: 2D Pharmacophore Identification, Interaction Point Discovery, Scaffold Hopping, Lead Optimization, and Fragment Assembly. SMDD‑Bench tasks span a wide region of chemical space and involve 102 unique protein targets. Completely solving the benchmark would require having strong chemical and biological reasoning and 3D intuition, understanding specialized tool use, and displaying planning expertise over a limited number of oracle calls. We benchmark 7 frontier open and closed source LLMs and find even the most performant LLM, GPT5.4, solves only 40.2% of tasks. We hope SMDD‑Bench provides a standardized testbed to invigorate the field towards training and evaluating LLM agents for fully autonomous computational drug design. We host a public leaderboard at smddbench.com .

PaperID: 2744, https://arxiv.org/pdf/2605.21622.pdf

Authors: Isabella A. Stewart, Hongrui Chen, Faez Ahmed

Title: TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization

Abstract:
Topology optimization can generate efficient structures, but designers often must manually translate qualitative intent, such as desired visual style, product experience, or manufacturability into solver settings that are not directly tied to those preferences. We present TO‑Agents, a multi‑agent AI framework that connects natural‑language design intent with iterative topology optimization. The framework converts a human‑provided problem description into validated solver inputs, runs a topology optimization solver, renders the resulting 3D topology, and uses multi‑view vision‑language reasoning with an independent judge agent to critique each result and revise solver parameters. We evaluate the framework on two long‑horizon design tasks: a cantilever beam benchmark and a phone‑stand product design. In both tasks, the designer specifies an aesthetic preference for hierarchically branched structures inspired by natural tree morphologies, and the system performs four revision cycles across ten independent replicates. TO‑Agents produces at least one preference‑aligned design in 60% of trials for each case study, corresponding to up to 6x more successful trials than an ablated pipeline without visual or historical feedback. Judge scores and human evaluations show that the pipeline can identify effective parameter levers, recover from poor revisions, and expand design exploration. A manufacturing agent further post‑processes top‑ranked designs for additive manufacturing, enabling end‑to‑end intent‑to‑prototype design. We also identify failure modes, including overshooting, selective memory, misplaced tools, and incorrect parameter reasoning. These results suggest that agentic topology optimization can shift designers from low‑level parameter tuning toward higher‑level specification of form and function, while highlighting safeguards needed for reliable autonomous engineering design.

PaperID: 2745, https://arxiv.org/pdf/2605.21560.pdf

Authors: Penglin Dai, Zijie Zhou, Xincao Xu, Junhua Wang, Xiao Wu, Lixin Duan

Title: AutoMCU: Feasibility-First MCU Neural Network Customization via LLM-based Multi-Agent Systems

Abstract:
Deploying neural networks on microcontroller units (MCUs) is critical for edge intelligence but remains challenging due to tight memory, storage, and computation constraints. Existing approaches, such as model compression and hardware‑aware neural architecture search (HW‑NAS), often depend on proxy metrics, incur high search cost, and do not fully bridge the gap between architecture design and verified deployment. This paper presents AutoMCU, a feasibility‑first large language model (LLM)‑based multi‑agent system for automated neural network customization under MCU constraints. Given natural‑language task requirements and hardware specifications, AutoMCU iteratively generates structured architecture candidates, filters infeasible designs through vendor toolchain feedback before training, evaluates feasible models under a controlled protocol, and verifies deployability through backend‑grounded deployment analysis. AutoMCU includes two key mechanisms: 1) hardware‑in‑the‑loop architecture generation for early elimination of undeployable candidates under RAM and Flash constraints, and 2) state‑isolated multi‑agent scheduling for stable coordination of proposal, training, evaluation, and deployment stages. Experiments on CIFAR‑10 and CIFAR‑100 under strict MCU constraints show that AutoMCU achieves competitive accuracy while reducing customization time to about 1‑‑2 hours, compared with hundreds of GPU hours for representative MCU‑oriented HW‑NAS baselines. Comparisons with ColabNAS and the LLM‑based NAS method GENIUS on NAS‑Bench‑201 further demonstrate the effectiveness and stability of AutoMCU. Real‑device deployments on multiple STM32 microcontrollers validate its practical applicability to MCU‑scale edge intelligence.

PaperID: 2746, https://arxiv.org/pdf/2605.21537.pdf

Authors: Gokul Chandra Purnachandra Reddy, Aditya Lolla, Harsha Sanku

Title: Articulate but Wrong: Self-Review Failures in LLM-Based Code Modernization

Abstract:
Large language model (LLM) agents are increasingly used to migrate legacy code to modern stacks. We ask a deceptively simple question: when an LLM modernizes legacy code, can the same model be relied upon to recognize when its own output silently changes observable behavior? We run 1,980 real modernization calls across 11 production LLMs from 7 distinct families on a balanced 60‑snippet legacy‑Python‑2 corpus, evaluate every output with a type‑strict behavioral oracle, and then ask each model to judge whether its own output preserves behavior. We report four findings. (1) Semantic‑preservation drift is prevalent and sharply separable from a cleanly‑controlled baseline: semantic‑trap snippets drift in 39.7% of attempts versus 7.0% on benign‑control code that requires no real modernization (+32.7 percentage points; n=660 each). (2) Drift concentrates on specific snippets that fail across models: pairwise model agreement on which snippets are hard is high (mean Pearson r=0.52), and a small core of numeric‑semantics snippets fails for nearly every model and every prompt phrasing. (3) Self‑review by the producing model is not a reliable safety net: across all semantic drift cases, 31.7% are silently endorsed by the same model that produced them (83/262), and the per‑model self‑miss rate is strongly bimodal ‑‑ ranging from 0% on five models to 100% on one widely deployed model ‑‑ with several models explicitly articulating the very Py2/Py3 semantic distinction that broke their output, then declaring behavior preserved. (4) Drift rate is non‑monotone in model capability and price: per‑model rates range 5.6%‑46.7% and do not track model capability cleanly, indicating the failure is task‑structural rather than driven by model scale. All code, prompts, the 60‑snippet corpus, the behavioral oracle, the output extractor, and the raw model outputs are released.

PaperID: 2747, https://arxiv.org/pdf/2605.21516.pdf

Authors: Boyuan Wang, Bochao Li, Minghan Wang, Yuxin Tao, Fang Kong

Title: Harnesses for Inference-Time Alignment over Execution Trajectories

Abstract:
Harness engineering has emerged as an important inference‑time technique for large language model (LLM) agents, aiming to improve long‑term performance through task decomposition and guided execution. However, more elaborate harnesses are not uniformly better: increasing decomposition or guidance can sometimes improve execution, but can also reduce final task success. We study harness design through the lens of inference‑time trajectory alignment. This perspective separates harness into two mechanisms: task decomposition, which structures a task into sub‑goals, and guided execution, which reshapes local action distributions during execution. This decomposition allows us to quantify how workflow granularity, retry budgets, and guidance‑induced action reweighting shape the performance limits of harness design. It further reveals concrete failure modes, including over‑decomposition, over‑pruning, and hallucinated execution. We validate these predictions through controlled synthetic experiments and real terminal agent benchmarks. Inspired by the theory, we further show that effective harnesses can be partial: specifying only the initial steps and leaving the remaining execution to agent can achieve higher pass rate than fully structured workflows.

PaperID: 2748, https://arxiv.org/pdf/2605.21497.pdf

Authors: Youness Bouchari, Matteo Boffa, Marco Mellia, Idilio Drago, Thanh Minh Bui, Dario Rossi

Title: Autonomous LLM Agents & CTFs: A Second Look

Abstract:
Large Language Model (LLM) agents are increasingly proposed to automate offensive security tasks, with recent studies reporting near human‑level success rates in Capture‑the‑Flag (CTF) challenges. We here revisit these results, providing a second look at these claims. We engineer different agent architectures of increasing complexity and modularity on 30 web‑based CTFs challenges spanning 14 vulnerability classes. We instantiate these agents with multiple LLM backbones, and compare them with claude‑code, a general‑purpose agent that automatically determines its internal architecture. Our evaluation yields three main findings. First, claude‑code achieves performance comparable to the engineered architectures (19/30 solved tasks), suggesting that general‑purpose agents are strong baselines for offensive security tasks. Second, both our architectures and claude‑code struggle in the same challenge categories, revealing persistent barriers that keep current agents below human‑level capability. Third, by leveraging our manually designed architectures we can systematically measure the impact of additional components, finding that structured orchestration of specialized roles outperforms monolithic designs, improving run‑to‑run consistency, and reducing execution costs.

PaperID: 2749, https://arxiv.org/pdf/2605.21463.pdf

Authors: Xiaoqiang Wang, Chao Wang, Hadi Nekoei, Christopher Pal, Alexandre Lacoste, Spandana Gella, Bang Liu, Perouz Taslakian

Title: Mem-$π$: Adaptive Memory through Learning When and What to Generate

Abstract:
We present Mem‑π, a framework for adaptive memory in large language model (LLM) agents, where useful guidance is generated on demand rather than retrieved from external memory stores. Existing memory‑augmented agents typically rely on similarity‑based retrieval from episodic memory banks or skill libraries, returning static entries that often misalign with the current context. In contrast, Mem‑π uses a dedicated language or vision‑language model with its own parameters, separate from the downstream agent, to generate context‑specific guidance for complex tasks. Conditioned on the current agent context, the model jointly decides when to produce guidance and what guidance to produce. We train it with a decision‑content decoupled reinforcement learning (RL) objective, enabling it to abstain when generation would not help and otherwise produce concise, useful guidance. Across diverse agentic benchmarks spanning web navigation, terminal‑based tool use, and text‑based embodied interaction, Mem‑π consistently outperforms retrieval‑based and prior RL‑optimized memory baselines, achieving over 30% relative improvement on web navigation tasks.

PaperID: 2750, https://arxiv.org/pdf/2605.21434.pdf

Authors: Youcheng Sun, Jiawen Liu, Daniel Kroening, Jason Xue

Title: Agentic Model Checking

Abstract:
Verifying LLM‑generated systems code is hard: bugs are prevalent, formal specifications are missing, and safety contracts are encoded implicitly at call sites rather than enforced at function boundaries. We propose agentic model checking, a paradigm that couples LLM agents with a bounded model checking backend under the principle agents propose, solvers verify: agents handle tasks requiring semantic judgment (spec inference, check selection, counterexample classification, refinement proposal) while BMC discharges every soundness‑relevant decision. The paradigm rests on three commitments. Specifications are inferred top‑down from caller context in a restricted DSL that translates deterministically into the backend's assume/assert primitives, with optional functional‑correctness clauses lifting verification from panic‑freeness to behavioural faithfulness. Verification is compositional: each function is checked in isolation against its spec with callees replaced by postcondition‑constrained stubs, so per‑query cost scales with a single function's state space and refinements propagate automatically to callers. Counterexamples are not bug reports: they pass through a validation pipeline (reachability, callee feasibility, dynamic replay, realism audit) that distinguishes active in‑tree crashes from latent public‑API failures, while modelling artifacts drive a refinement loop rather than being suppressed. We instantiate the approach in BMC‑Agent and evaluate it on LLM‑generated kernel and compiler code in C and Rust alongside mature OSS‑Fuzz‑hardened libraries, confirming real defects, producing bounded clean verifications on heavily‑fuzzed surfaces, and establishing functional equivalence on selected algorithmic functions.

PaperID: 2751, https://arxiv.org/pdf/2605.21418.pdf

Authors: Amin Farajzadeh, Melike Erol-Kantarci

Title: FedCritic: Serverless Federated Critic Learning-based Resource Allocation for Multi-Cell OFDMA in 6G

Abstract:
In sixth‑generation (6G) ultra‑dense networks, aggressive frequency reuse amplifies inter‑cell interference (ICI), making multi‑cell orthogonal frequency‑division multiple access (OFDMA) scheduling and power control strongly coupled across neighboring cells. We study distributed downlink resource management ‑‑ joint subcarrier scheduling and power allocation ‑‑ under interference coupling and long‑term per‑user quality‑of‑service (QoS) minimum‑rate constraints. By using virtual‑queue deficit weights to enforce long‑term QoS, we develop FedCritic, a serverless federated multi‑agent actor‑critic framework with decentralized execution. Unlike centralized training with decentralized execution (CTDE) approaches that require centralized critic learning and joint trajectory aggregation, FedCritic federates the critic through lightweight gossip‑based parameter averaging over the interference graph, enabling stable value estimation without a central coordinator while keeping policies local. Simulations in an interference‑rich reuse‑1 setting show that FedCritic improves mean signal‑to‑interference‑plus‑noise ratio (SINR) and cell‑edge rate, increases network‑wide average sum‑rate and fairness relative to non‑coordinated and CTDE baselines, and achieves more stable training with lower coordination overhead.

PaperID: 2752, https://arxiv.org/pdf/2605.21404.pdf

Authors: Mahdi Naser Moghadasi, Faezeh Ghaderi

Title: What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema

Abstract:
We read twelve well‑known LLM agent benchmark papers and recorded, dimension by dimension, what each paper actually says about how its evaluation was run. The motivation came from a familiar frustration: two papers will report results on the same benchmark with the same model name and disagree, and you cannot tell why ‑‑ the scaffold, the sampling settings, the subset, or the evaluator version. In many cases the published artifact does not let you answer. This paper is an implementation report on the attempt. We designed a small audit schema (five fields: benchmark identity, harness specification, inference settings, cost reporting, failure breakdown), wrote a scoring codebook with the boundary cases we hit during pilot scoring, applied it to twelve canonical papers (eight agent, four classical static), and recorded what we saw. We score the disclosure of an agent run, not its correctness, and make no claim that disclosure implies a trustworthy result. The mean audit score across the eight agent‑benchmark papers is 0.38 (out of 1.0), and across the four classical static benchmarks 0.66; the largest gap is on cost (none of the eight agent benchmark papers disclose inference cost in any form) and on harness specification (none fully disclose a content‑addressed container image of the evaluation environment). We release the schema as a JSON Schema file, the codebook as a Markdown document, and the raw scoring sheet as a CSV. The scoring was performed by a single auditor in one pass; a multi‑rater audit is the natural next step, and we discuss what we think it would change.

PaperID: 2753, https://arxiv.org/pdf/2605.21395.pdf

Authors: Liang Wu, Kelly Wan, Mayank Darbari, Liangjie Hong

Title: Towards Resilient and Autonomous Networks: A BlueSky Vision on AI-Native 6G

Abstract:
The proliferation of emerging applications, such as autonomous driving and immersive experiences, demands cellular networks that are not only faster, but fundamentally more resilient and autonomous. This paper presents a BlueSky vision on how Artificial Intelligence will be natively integrated into 6G, shifting the paradigm from \underlineNetwork for AI to \underlineAI for Network. We envision that, unlike 5G's reliance on scattered, ad‑hoc models each trained for a single task, native AI in the 6G era will be anchored by a foundation model and and orchestrated via collaborative multi‑agent systems, framing network management as a unified, multi‑modal, multi‑task optimization problem. Built on this vision, we outline two transformative directions. The first focuses on developing a 6G foundation model as a unified backbone, with task‑specific knowledge distilled into compact models suited for diverse edge deployments. The second advances multi‑agent systems designed to autonomously diagnose, maintain, and recover networks with minimal human intervention. These directions chart a roadmap for 6G to evolve into an intelligent, self‑sustaining communication infrastructure.

PaperID: 2754, https://arxiv.org/pdf/2605.21392.pdf

Authors: Pengyu Sun, Qishu Jin, Enhao Huang, Zifeng Kang, Xin Liu, Dakun Shen, Song Li

Title: VIPER-MCP: Detecting and Exploiting Taint-Style Vulnerabilities in Model Context Protocol Servers

Abstract:
Model Context Protocol (MCP) has emerged as a standard interface for connecting LLM agents to external tools. Because MCP servers expose privileged operations such as shell execution, network access, and file‑system manipulation to agent‑driven invocation, implementation flaws in tool handlers can create a direct path from natural‑language input to security‑sensitive sinks, potentially granting attackers remote code execution or full system compromise. Existing approaches either produce unconfirmed static alerts without dynamic validation, or rely on fixed template libraries that lack code‑level guidance and fail to trigger vulnerabilities requiring specific parameter shapes or multi‑step taint paths. In this paper, we present VIPER‑MCP, the first end‑to‑end automated vulnerability auditing framework for MCP servers that not only detects taint‑style vulnerabilities but also dynamically confirms their exploitability by producing concrete proof‑of‑concept prompts. VIPER‑MCP introduces two novel techniques: (1) an anchor‑query pass in a two‑pass static analysis strategy that augments standard taint alerts with function‑level structural context, resolving file‑level static artifacts to specific MCP tool handlers and producing vulnerability‑anchored call chains; and (2) a feedback‑driven prompt evolution mechanism that employs dual‑mutator scheduling that independently corrects tool‑selection drift and deepens parameter penetration, together with fitness‑scored seed selection to iteratively refine natural‑language prompts toward vulnerable sinks. In a large‑scale scan of 39,884 real‑world open‑source MCP server repositories, VIPER‑MCP discovered 106 0‑day vulnerabilities, all of which were confirmed through end‑to‑end exploit traces, with 67 CVE IDs assigned to date. We responsibly disclosed all confirmed findings to the affected developers and coordinated CVE assignment where applicable.

PaperID: 2755, https://arxiv.org/pdf/2605.21347.pdf

Authors: Akshay Manglik, Apaar Shanker, Kaustubh Deshpande, Jason Qin, Yash Maurya, Veronica Chatrath, Vijay S. Kalmath, Levi Lentz, Yuan, Xue

Title: Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

Abstract:
Diagnosing failures in LLM agents remains largely manual. Practitioners inspect a small subset of execution traces, form ad‑hoc hypotheses, and iterate. This process misses patterns that only emerge across trace populations and does not scale to production corpora where individual traces span tens of thousands of tokens. We formalize the problem of corpus‑level trace diagnostics. Given a corpus of execution traces, the goal is to produce grounded natural‑language insights that characterize systematic behavioral patterns across trace groups, each linked to supporting evidence. We present the Insights Generator (IG), a multi‑agent system that answers diagnostic questions by proposing and testing hypotheses across the trace corpus to produce an evidence‑backed insights report. We evaluate IG across qualitative and objective dimensions, spanning rubric‑based report assessment and downstream performance improvements achieved by implementing IG insights. Human experts using IG reports improve scaffold performance by 30.4pp over the unmodified baseline scaffold, and coding agents leveraging IG‑derived insights show consistent and stable gains. Across benchmarks, IG's scout‑investigator architecture produces findings comparable in detection coverage to competing approaches, while domain experts rated IG reports as leading depth and evidence quality.

PaperID: 2756, https://arxiv.org/pdf/2605.21240.pdf

Authors: Yibo Li, Jiashuo Yang, Zhi Zheng, Zhiyuan Hu, Yuan Sui, Shizun Wang, Yufei He, Bryan Hooi

Title: APEX: Autonomous Policy Exploration for Self-Evolving LLM Agents

Abstract:
LLM agents have shown strong performance across a wide range of complex tasks, including interactive environments that require long‑horizon decision making. But these agents cannot learn on the fly at test time. Self‑evolving agents address this by accumulating memory and reflection across episodes rather than requiring model‑weight updates. However, these agents often suffer from exploration collapse: as memory grows, behavior concentrates around familiar high‑reward routines, reducing the chance of discovering better alternatives. To address this problem, we propose Autonomous Policy EXploration (APEX), which builds and maintains an explicit strategy space through a strategy map‑a directed acyclic graph of milestones with prerequisite dependency edges. In APEX, Fork Discovery expands the map with evidence‑grounded unexplored directions, while Policy Selection balances exploration and exploitation during planning. Evaluated on nine Jericho text‑adventure games and WebArena, a realistic web interaction benchmark, APEX outperforms all baselines. Extensive ablations validate each component's contribution and demonstrate robustness across diverse settings, demonstrating APEX's effectiveness for sustained exploration in self‑evolving agents.

PaperID: 2757, https://arxiv.org/pdf/2605.21085.pdf

Authors: Alexi Canesse, Benoît Goupil, Jesse Read, Sonia Vanier

Title: Decoupling Communication from Policy: Robust MARL under Bandwidth Constraints

Abstract:
Communication enables coordination in multi‑agent reinforcement learning (MARL), but many real‑world applications, e.g., search‑and‑rescue with drone swarms, operate under severe bandwidth constraints. Many communication architectures still expose a coupled bottleneck in which a shared latent representation is used for both policy execution and inter‑agent communication. Consequently, reducing message size directly limits the policy's latent space, often leading to significant performance degradation. We address this with two contributions. First, we introduce β, a normalised per‑agent bandwidth budget that unifies sparsity, rounds, and message dimension into a single comparable constraint. Second, we provide SLIM, a minimal architecture that decouples the communication pathway from the policy's latent representation, allowing us to isolate the effect of bandwidth from the effect of policy capacity while benefiting from in‑step communication. We evaluate our method on several partially‑observable MARL benchmarks, where communication is essential. Our approach achieves state‑of‑the‑art performance and exhibits scalability and robustness under limited communication, with only marginal degradation as bandwidth is reduced.

PaperID: 2758, https://arxiv.org/pdf/2605.21027.pdf

Authors: Gundeep Singh, Parsa Kavehzadeh, Jing Xia, Xue-Yong Fu, Julien Bouvier Tremblay, Md Tahmid Rahman Laskar, Vincent Lum, Shashi Bhushan TN

Title: Beyond Text-to-SQL: An Agentic LLM System for Governed Enterprise Analytics APIs

Abstract:
Enterprise analytics aims to make organizational data accessible for decision‑making, yet non‑technical users still face barriers when using traditional business intelligence tools or Text‑to‑SQL systems. While recent Text‑to‑SQL approaches based on Large Language Models (LLMs) promise natural language access to structured data, they fall short in enterprise settings where analytics pipelines rely on governed APIs rather than raw databases. In practice, these APIs encapsulate complex business logic to ensure consistency, auditability, and security. However, delegating mathematical or aggregation logic to an LLM introduces reliability and compliance risks. To this end, we present Analytic Agent, an LLM‑based agentic system that translates natural language intents into secure interactions with enterprise analytics APIs. Evaluated on 90 real enterprise use cases constructed by domain experts, it reliably interprets user goals, validates permissions, executes governed queries, and generates compliant visualizations through multi‑step reasoning and policy‑aware orchestration.

PaperID: 2759, https://arxiv.org/pdf/2605.20929.pdf

Authors: Mingyang Feng, Mengnuo Zhang, Shaoyuan Li, Xiang Yin

Title: STEAM: A Training-Free Congestion-Aware Enhancement Framework for Decentralized Multi-Agent Path Finding

Abstract:
We propose STEAM (Spatial, Temporal, and Emergent congestion Awareness for MAPF), a training‑free test‑time enhancement framework for learning‑based decentralized Multi‑Agent Path Finding (MAPF) in discrete environments. Given a pretrained decentralized policy, STEAM requires no retraining, architectural modification, or replacement by a centralized planner. Instead, it injects lightweight congestion‑aware guidance into the original policy execution. STEAM first rolls out the shortest paths induced by the current cost‑to‑go maps to identify potential future congestion hotspots. Spatially avoidable congestion is mitigated by updating agent‑specific cost‑to‑go information, while spatially unavoidable bottlenecks are handled through temporal logit correction. In addition, emergent local congestion is reduced by a density‑aware logit correction based on neighboring agents' corrected cost‑to‑go maps. Extensive experiments on representative learning‑based decentralized MAPF algorithms show that STEAM consistently improves success rate, makespan, and solution cost, with success‑rate gains of up to 60% and only minor computational overhead. The implementation is available at https://anonymous.4open.science/r/STEAM‑MAPF‑7A62.

PaperID: 2760, https://arxiv.org/pdf/2605.20923.pdf

Authors: Benedikt Bollig

Title: Causal Past Logic for Runtime Verification of Distributed LLM Agent Workflows

Abstract:
Distributed LLM agent workflows should not be monitored as if they produced a single sequential log. In an asynchronous execution, a decision can only depend on events that are causally visible to the lifeline that makes it: an event that appears earlier in some log may still be unknown locally. We extend the ZipperGen agent‑workflow framework with Causal Past Logic (CPL), a small past‑time temporal logic for guards in conditionals and while loops. In addition to standard past‑time modalities such as previous and since, a guard can inspect the latest causally visible event of another lifeline and selected variables stored there. The formula is a source‑level guard: it is evaluated online by the owner lifeline and can influence control flow at runtime. We give a vector‑clock monitor with latest‑value views and prove that the locally computed monitor value coincides with the denotational semantics of the guard at the current event. Thus runtime verification becomes part of the coordination language itself, rather than a post‑hoc check over an execution log.

PaperID: 2761, https://arxiv.org/pdf/2605.20874.pdf

Authors: Segev Shlomov, Iftach Shoham, Alon Oved, Ido Levy, Sami Marreed, Harold Ship, Offer Akrabi, Sergey Zeltyn, Avi Yaeli, Nir Mashkif

Title: Governance by Construction for Generalist Agents

Abstract:
Enterprise agents are increasingly expected to operate autonomously across tools and interfaces, yet production deployments require governance by construction. Systems must specify which actions are allowed, when human oversight is required, and what information may be exposed, without rebuilding the agent for each domain. This demo presents CUGA's policy system, a modular policy‑as‑code layer that composes with a generalist LLM agent to deliver predictable, auditable, and compliance‑aware behavior in compound workflows without model fine‑tuning. We present a runtime governance architecture that enforces policy interventions at every critical stage of execution. Rather than passively constraining behavior, policies intercept the agent at five structural checkpoints: upstream of planning (Intent Guard), within the system prompt to steer reasoning (Playbook), at the tool‑call boundary to enforce proper usage (Tool Guide), outside the reasoning loop as a Human‑in‑the‑Loop gate for high‑risk actions (Tool Approvals), and at the output stage to filter and structure the final response (Output Formatter). Together, these stages embed governance continuously across the agent's execution pipeline rather than treating it as an afterthought. Using a healthcare scenario and a multi‑layered enforcement intervention, the demo shows dynamic playbook injection for structured tool‑sequence enforcement, intent guards that block malicious or accidental harmful requests, and human‑in‑the‑loop tool approval checkpoints for potentially destructive actions. The artifact illustrates how typed governance primitives enable faster, safer deployment of enterprise agentic systems while improving policy adherence and execution consistency.

PaperID: 2762, https://arxiv.org/pdf/2605.20833.pdf

Authors: Wujiang Xu, Yu Wang, Kai Mei, Kaiqu Liang, Zhenting Wang, Mingyu Jin, Han Zhang, Shi-Xiong Zhang, Wenyue Hua, Sambit Sahu, Dimitris N. Metaxas

Title: MemGym: a Long-Horizon Memory Environment for LLM Agents

Abstract:
Memory is a central capability for LLM agents operating across long‑horizon tasks. Existing memory benchmarks predominantly evaluate retention of personalized information in multi‑turn chat scenarios, overlooking the dynamic memory formation that occurs during extended agent execution. Consequently, the memory systems they produce transfer poorly to realistic agentic environments, such as coding and web navigation. We present MemGym, a benchmark for agentic memory that unifies existing agent gyms and in‑house memory‑grounded pipelines behind one memory‑reasoning interface. MemGym spans five evaluation tracks grouped into four agentic regimes: tool‑use dialogue (tau2‑bench), multi‑turn deep‑research search (MEMGYM‑DR), coding (SWE‑Gym and MEMGYM‑CODEQA), and computer use (WebArena‑Infinity). MemGym reports memory‑isolated scores that decouple memory performance from reasoning, retrieval, and tool‑use ability, so memory strategies can be ranked without those confounders. Our synthetic pipelines for MEMGYM‑CODEQA and MEMGYM‑DR are length‑controllable, ablation‑verified at every stage, and tightly aligned with downstream scenarios. To make evaluation on coding environments academically tractable, we train MemRM, a lightweight reward model (Qwen3‑1.7B fine‑tuned with QLoRA) that scores compression quality as a fast scalar read in place of full Docker rollouts.

PaperID: 2763, https://arxiv.org/pdf/2605.20819.pdf

Authors: Orlando A. Mendible-Barreto, Ajay Vallabh, Ubaldo M. Córdova-Figueroa, Yamil J. Colón

Title: DynaMate2: Democratization of Agentic AI for Expert-Designed Custom Workflows

Abstract:
Scientific workflows in computational chemistry and materials science typically involve multiple interdependent steps, such as model preparation, system construction, simulation execution, and data analysis, that researchers have refined over the years into highly specialized, validated codebases. While large language model (LLM) agent frameworks have demonstrated the potential to automate such workflows, existing systems are built for specific, pre‑defined task sequences. Adapting them to new domains or integrating custom expert‑developed tools requires substantial programming expertise, which limits their adoption across the broader scientific community. Here we present DynaMate2, a hierarchical agentic framework and open‑source template whose central design goal is to lower the barrier for any researcher to convert their existing expert‑defined Python functions into AI‑callable tools within a supervised multi‑agent pipeline. The key design principle is that the LLM is never asked to generate scientific code since all domain logic resides in expert‑defined tools. The LLMs sole responsibility is to route tasks, select the appropriate tool, and use outputs to guide subsequent actions. Tools and agents can be registered at runtime from inline code, existing source files, or plain‑language descriptions, and all extensions persist automatically across sessions. We demonstrate the framework through an end‑to‑end molecular dynamics workflow. We provide a Tool Registration Protocol that guides researchers step‑by‑step through the process of integrating their validated code into the framework. DynaMate2 is released as an open‑source reference implementation with a web‑based interface and is designed to serve as a reusable template for community‑driven extension across arbitrary scientific domains.

PaperID: 2764, https://arxiv.org/pdf/2605.20734.pdf

Authors: Alfredo Metere

Title: An Application-Layer Multi-Modal Covert-Channel Reference Monitor for LLM Agent Egress

Abstract:
A large language model (LLM) agent that sends messages can leak data inside them. Destination allowlists and content scanners do not police whether an otherwise‑benign payload is itself a covert channel: a compromised agent encodes bits in zero‑width characters, homoglyphs, whitespace, base64, JavaScript Object Notation (JSON) key ordering, message timing or size ‑‑ and, in binary egress, in least‑significant‑bit (LSB) pixel planes, per‑image mean luminance, inter‑image sequence permutation, ultrasonic tones, or audible‑band sonified data. Our egress reference monitor has three contributions. (i) A text pipeline of ten capacity‑reducing stages, a per‑sink leaky‑bucket capacity ledger, and a staged posture that enforces lossless stages from day one. (ii) Two media scramblers (a Fourier‑domain audio band‑limiter and a red‑green‑blue (RGB) image bit‑depth and mean‑luminance bucketer) gated by a boot‑time cryptographic legitimacy attestation: an auditor publishes at boot the trusted Ed25519 keys and kind, data‑class pairs; only payloads with a verifying signature for an authorized class are exempt. The attestation sidesteps the intractable content‑based discrimination between real media and data sonified or rasterized as a carrier; unsigned media is suspect by default; a content‑addressed canonicalizer closes the inter‑image permutation channel. (iii) Residual capacity is the Miller‑‑Madow corrected mutual information between embedded and recovered bits (zero when destroyed), measured by an adversarial ensemble of fifteen working encoders across text, image and audio. The reference implementation drives residual capacity to zero on every destroyable channel and to a stated bound on the one (per‑image mean luminance) that cannot be destroyed without ruining the image.

PaperID: 2765, https://arxiv.org/pdf/2605.20704.pdf

Authors: Saurabh Deochake

Title: Heartbeat-Bound Hierarchical Credentials: Cryptographic Revocation for AI Agent Swarms

Abstract:
Autonomous AI agents that spawn sub‑agent swarms create a safety gap: existing credential revocation mechanisms, OAuth~2.0 introspection, OCSP, and W3C Status Lists, require network connectivity to a central authority, leaving ``zombie agents'' executing privileged operations for minutes to hours after operator shutdown. We present Heartbeat‑Bound Hierarchical Credentials (HBHC), a cryptographic protocol that binds credential validity to periodic parent liveness proofs. Verifiers enforce freshness using only a cached public key and local clock; no network round‑trip is required. When heartbeat generation ceases, all descendant credentials become unusable within a deterministically bounded window W_z \le W_\max + Δ_h + ε, conditional on bounded clock skew and parent keys held in secure enclaves. Evaluation at the protocol layer and with real LLM‑backed agent swarms (GPT‑4o‑mini) demonstrates a 90× reduction in the zombie window over OAuth~2.0, 0.26~ms full authentication in Rust, 18,000+ verifications per second under concurrent HTTP load, and stable per‑verification latency from 10 to 10,000 agents. Real‑agent experiments show 0.71% end‑to‑end overhead on tool calls, zero post‑revocation tool calls under prompt injection that bypasses application‑layer guardrails, and cascading revocation across a 49‑agent four‑level hierarchy within the theoretical bound.

PaperID: 2766, https://arxiv.org/pdf/2605.20625.pdf

Authors: Matthew Low, Jasmine Jerry Aloor, Victoria Marie Tuck, Pierluigi Nuzzo, Jason J. Choi

Title: Time-To-Reach Separation and Safety Filtering for Safe, Fair, and Efficient Multi-Agent Coordination

Abstract:
Advanced Air Mobility (AAM) operations are expected to significantly increase aerial traffic in urban airspace, requiring autonomous traffic management systems to ensure collision‑free operations in highly congested environments. In this paper, we propose a multi‑agent coordination framework that uses minimum time‑to‑reach (TTR) as a unifying metric for priority assignment, temporal separation, and safety filtering. We focus on the problem of coordinating multiple aerial vehicles merging into an air corridor while maintaining safe separation between vehicles. Vehicles are assigned arrival‑consistent priority based on TTR, and target TTR values are used to enforce temporal spacing that induces spatial separation. A priority‑consistent safety filtering layer based on Hamilton‑Jacobi reachability value functions ensures collision avoidance while minimally modifying the reference guidance. Simulation results in a highly congested corridor merging scenario show that the proposed method improves safety, fairness, and efficiency compared to time‑optimal guidance and priority‑agnostic safety filtering.

PaperID: 2767, https://arxiv.org/pdf/2605.20623.pdf

Authors: Chenyang An, Xiaoqian Xu

Title: Lower Bounds for Advection-Diffusion Equations: An Exploration with AI-Generated Proofs

Abstract:
We establish explicit lower bounds for advection‑diffusion equations in three settings: a polynomial \dot H^‑1 bound for inviscid shears with u\in L^\infty_t W^1,1_y, a uniform positive lower bound on the mixing scale for diffusive shears, and an exponential L^2 bound for rapidly oscillating time‑periodic flows. All constants are explicit in the data. The proofs were generated entirely by a multi‑agent math proving system, QED, without expert human intervention, serving as a test of AI's capability to produce rigorous mathematics.

PaperID: 2768, https://arxiv.org/pdf/2605.20608.pdf

Authors: Binghan Wu, Shoufeng Wang, Yunxin Liu, Ya-Qin Zhang, Joseph Sifakis, Ye Ouyang

Title: From Automated to Autonomous: Hierarchical Agent-native Network Architecture (HANA)

Abstract:
Realizing Level 4/5 Autonomous Networks (AN) demands a shift from static automation to agent‑native intelligence. Current operations, reliant on rigid scripts, lack the cognitive agency to handle off‑nominal conditions. To address this, this letter proposes a hierarchical multi‑agent reference architecture enabling high‑level autonomy. The framework features a Dual‑Driven Orchestrator that coordinates specialized Executive Agents, supported by a shared Public Memory for unified domain knowledge. A key innovation is the integration of agent self‑awareness, which empowers the system to harmonize deliberative strategic governance with reflexive fault recovery. We instantiate and validate this architecture within a 5G Core environment. Case studies demonstrate that the system sustains critical throughput under congestion and reduces Mean Time to Repair (MTTR) by 86%, confirming its efficacy in unifying strategic planning with operational resilience.

PaperID: 2769, https://arxiv.org/pdf/2605.20563.pdf

Authors: Mengyang Liu, Taozhi Chen, Zhenhua Xu, Xue Jiang, Yihong Dong

Title: Multi-agent Collaboration with State Management

Abstract:
Recent advances in multi‑agent systems have shown great potential for solving complex tasks. However, when multiple agents edit a shared codebase concurrently, their changes can silently conflict and inconsistent views lead to integration failures. Existing multi‑agent systems address this through workspace isolation (e.g., one git worktree per agent), but this defers conflict resolution to a post‑hoc merge step where recovery is expensive. In this paper, we propose STORM, i.e., STate‑ORiented Management for multi‑agent collaboration. Specifically, STORM manages agent states by mediating their interactions with the shared workspace, ensuring that each agent operates on a consistent view of the codebase and that conflicting edits are detected and resolved at write time. We evaluate STORM on Commit0 and PaperBench across multiple LLMs. STORM outperforms the git‑worktree‑based multi‑agent baseline by +18.7 on Commit0‑Lite and +1.4 on PaperBench, while achieving comparable or better cost efficiency. Combined with single‑agent runs, STORM reaches highest scores of 87.6 and 78.2 on the two benchmarks respectively, suggesting that explicit state management is a more effective foundation for multi‑agent collaboration than workspace isolation. STORM can also be plugged into any multi‑agent system seamlessly.

PaperID: 2770, https://arxiv.org/pdf/2605.20548.pdf

Authors: Yong Jin Chun, Iftekhar Ahmed

Title: What Do Agents Communicate? Characterizing Information Exchange in Multi-Agent Systems

Abstract:
Large Language Models (LLMs) have enabled collaborative Multi‑Agent (MA) systems, where interacting agents improve performance through diverse reasoning and iterative refinement. However, these systems remain vulnerable to error propagation, where early‑stage information degrades downstream reasoning. To address this, we conduct a systematic analysis of inter‑agent communication to identify which information drives MA performance. We find that the absence of reasoning and verification in inter‑agent communication significantly degrades performance. Based on these insights, we propose Category‑Aware Recovery Augmentation (technique), which enforces the presence of critical information during communication. recovers up to 86.2% of failed cases. Our results highlight the key role of information quality in effective MA collaboration. Our code is available at https://anonymous.4open.science/r/cara_mas

PaperID: 2771, https://arxiv.org/pdf/2605.20530.pdf

Authors: Parsa Mazaheri, Kasra Mazaheri

Title: AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

Abstract:
Large language model agents now act on codebases, browsers, operating systems, calendars, files, and tool ecosystems, but their evaluations often collapse behavior into final task success. AgentAtlas reframes agent evaluation as a diagnostic vocabulary and audit protocol for separating outcome success from control‑decision quality and trajectory quality. The paper contributes: (i) a six‑state control‑decision taxonomy (Act / Ask / Refuse / Stop / Confirm / Recover); (ii) a trajectory‑failure vocabulary with primary error source and downstream impact; (iii) a 0/1/2 benchmark‑coverage audit over fifteen agent benchmarks; and (iv) an illustrative protocol study on a synthetic 1,342‑item set evaluated with eight models under taxonomy‑aware and taxonomy‑blind prompt formats. The synthetic demonstration is not a public benchmark release and should not be read as a definitive model comparison. Instead, it illustrates two measurement risks: mapped label agreement can change substantially when the explicit label menu is removed, and axis choice can change apparent rankings. AgentAtlas is intended to help benchmark designers state what behavior they cover, and to help evaluators diagnose failures that outcome‑only leaderboards hide.

PaperID: 2772, https://arxiv.org/pdf/2605.20520.pdf

Authors: Sayash Kapoor, Peter Kirgis, Andrew Schwartz, Stephan Rabanser, J. J. Allaire, Rishi Bommasani, Harry Coppock, Magda Dubois, Gillian K Hadfield, Andrew B. Hall, Sara Hooker, Seth Lazar, Steve Newman, Dimitris Papailiopoulos, Shoshannah Tekofsky, Helen Toner, Cozmin Ududec, Arvind Narayanan

Title: Open-World Evaluations for Measuring Frontier AI Capabilities

Abstract:
Benchmark‑based evaluation remains important for tracking frontier AI progress. But it can both overstate and understate deployed capability because it privileges tasks that can be precisely specified, automatically graded, easy to optimize for, and run with low budgets and short time horizons. We advocate for a complementary class of evaluations, which we term open‑world evaluations: long‑horizon, messy, real‑world tasks assessed through small‑sample qualitative analysis rather than benchmark‑scale automation. In this paper we survey recent open‑world evaluations, identify their strengths and limitations, and introduce CRUX (Collaborative Research for Updating AI eXpectations), a project for conducting such evaluations regularly. As a first instance, we task an AI agent with developing and publishing a simple iOS application to the Apple App Store. The agent completed the task with only a single avoidable manual intervention, suggesting that open‑world evaluations can provide early warning of capabilities that may soon become widespread. We conclude with recommendations for designing and reporting open‑world evals.

PaperID: 2773, https://arxiv.org/pdf/2605.20485.pdf

Authors: May Hamri, Inbal Talgam-Cohen

Title: ZEBRA: Zero-shot Budgeted Resource Allocation for LLM Orchestration

Abstract:
As autonomous agents increasingly execute end‑to‑end tasks under fixed monetary budgets, the pressing open question shifts from whether the budget is respected, to how to spend it effectively. Existing budget‑aware methods typically control reasoning step‑by‑step within a single agent, or learn resource allocation policies via RL. None address how to split a budget across the composing phases of a multi‑agent pipeline at inference time. We propose ZEBRA, a zero‑shot framework that reduces multi‑phase budget allocation to a continuous nonlinear knapsack problem: an LLM controller estimates per‑phase utility curves, and a water‑filling search on the Lagrange multiplier returns the per‑phase split. Additive and multiplicative aggregations are unified under the same solver. On a 150‑task APPS coding benchmark, both ZEBRA variants outperform LLM‑direct (budget allocation directly by an LLM) on every aggregate metric. At a budget of α= 0.5 of the unconstrained spend, ZEBRA recovers 94.4% of unconstrained quality, versus 88.1% for LLM‑direct. The advantage is statistically significant and transfers beyond coding: on a 3‑phase HotpotQA pipeline, ZEBRA beats LLM‑direct by 14.3pp, with allocations empirically robust to curve‑estimation noise. On HotpotQA, ZEBRA arrives at a different budget split (near‑balanced) compared to the APPS one (skewed towards a refinement phase), showing adaptation to the pipeline structure. More broadly, we show that lightweight algorithmic guidance at inference time can improve the economic behavior of autonomous multi‑agent systems.

PaperID: 2774, https://arxiv.org/pdf/2605.20442.pdf

Authors: Syed Mhamudul Hasan, Abdur R. Shahid

Title: Modeling Emotional Dynamics in Agent-to-Agent Interactions on Moltbook

Abstract:
Generative AI systems are increasingly deployed as interactive agents in online environments, such as a social network called Moltbook. In Moltbook, large‑scale agentic AIs can post, comment, and engage in activities generated at scale by AI‑driven text. Yet these agent behavioral characteristics remain insufficiently understood, particularly in complex, multi‑agent interaction. In this study, we analyze the emotional dynamics of agent interactions within Moltbook. We construct an emotion‑aware framework that maps textual interactions to a predefined set of fine‑grained emotional categories, enabling the extraction of structured emotion profiles across agents and interaction contexts. To further evaluate behavioral reliability, we introduce an emotion‑based domain called Persona‑Stimulus‑Reaction (PSR) that captures the alignment of emotional responses across similar contexts. Our analysis shows distinct emotional patterns and varying levels of behavioral stability across agents. Our analysis reveals that agents exhibit distinct emotional signatures with varying levels of behavioral stability influenced by interaction context.

PaperID: 2775, https://arxiv.org/pdf/2605.20425.pdf

Authors: Shuaike Shen, Wenduo Cheng, Shike Wang, Mingqian Ma, Jian Ma

Title: AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows

Abstract:
Designing multi‑agent workflows is especially difficult in open‑ended scientific settings where tasks lack curated training sets, reliable scalar evaluation metrics, and standardized interfaces between existing tools and agents. We propose AgentCo‑op, a retrieval‑based synthesis framework that composes reusable skills, tools, and external agents into executable workflows through typed artifact handoffs, then applies bounded self‑guided local repair to implicated components when execution evidence indicates failure. In two open‑world genomics case studies, AgentCo‑op composes independently developed scientific agents and external tool repositories into auditable workflows without redesigning them or running global topology search. It coordinates specialized agents for spatial transcriptomics and gene‑set interpretation to enable collaborative discovery from spatial transcriptomics data, and builds a parallel workflow for cross‑modality marker analysis on single‑cell multiome data. AgentCo‑op can also import a searched workflow as a structural prior and improve it by grounding nodes with retrieved components and applying local repair, showing that synthesis and search are complementary. On six coding, math, and question‑answering benchmarks, AgentCo‑op achieves the best result on four benchmarks and the best average score under a unified backbone setting, while consistently reducing per‑task cost relative to multi‑agent baselines. Together, these results suggest that retrieval‑based synthesis can extend automated agentic workflow design beyond benchmark‑optimized agent graphs to open‑world workflows built from existing agents, tools, and typed artifacts.

PaperID: 2776, https://arxiv.org/pdf/2605.20348.pdf

Authors: Christos Spyridon Koulouris, Carlo Campajola

Title: Memory-Induced Supra-Competitive Outcomes Between Deep Reinforcement Learning Agents in Optimal Trade Execution

Abstract:
In this paper, we investigate whether deep reinforcement‑learning agents interacting in a shared optimal‑execution environment can sustain supra‑competitive outcomes, in the sense of achieving lower implementation shortfalls than the relevant game‑theoretical competitive benchmark. We study a two‑agent Almgren‑Chriss liquidation game and examine how learned behavior depends on intra‑episode environment feedback, the ability to interpret the mid‑price and the agent's knoledge of the past. We first use ex‑ante schedule‑learning agents to remove intra‑episode feedback and isolate what can arise when agents commit to complete liquidation trajectories before execution begins. We then allow agents to condition on the evolving state using a variety of DDQN architectures. We find that, when agents are given access to intra‑episode history, especially recent prices and own past actions, supra‑competitive outcomes become substantially more frequent and more persistent. These findings indicate that supra‑competitive behavior in this execution game is driven not by multi‑agent learning or by current price observation alone, but by feedback, memory, and state‑contingent interaction along the realized execution path.

PaperID: 2777, https://arxiv.org/pdf/2605.20315.pdf

Authors: Haiquan Lu, Zigeng Chen, Gongfan Fang, Xinyin Ma, Xinchao Wang

Title: Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Abstract:
LLM agents have recently emerged as a powerful paradigm for solving complex tasks through planning, tool use, memory retrieval, and multi‑step interaction. However, these agentic workflows often introduce substantial input‑side overhead, making the compute‑intensive prefilling stage a key bottleneck in long‑context, multi‑turn inference. In this work, we propose Mix‑Quant, a simple and effective phase‑aware quantization framework for fast agentic inference. We first investigate FP4 quantization in agentic LLM workflows and observe that quantizing the entire inference process can incur significant performance degradation. In contrast, the prefilling stage exhibits substantial quantization redundancy and can therefore be quantized with minimal accuracy loss, despite being the dominant source of computation. Based on this insight, we apply high‑throughput NVFP4 quantization to the prefilling phase while preserving BF16 precision for decoding. By decoupling prefilling acceleration from decoding quality, Mix‑Quant combines phase‑aware algorithmic quantization with hardware‑efficient NVFP4 execution to alleviate the inference bottleneck in LLM agents. Extensive experiments across long‑context and agentic benchmarks demonstrate that Mix‑Quant largely preserves task performance while delivering significant efficiency improvements, achieving up to a 3x speedup during prefilling.

PaperID: 2778, https://arxiv.org/pdf/2605.20312.pdf

Authors: Ravi Kiran Kadaboina

Title: Pramana: A Protocol-Layer Treatment of Claim Verification in Autonomous Agent Networks

Abstract:
Autonomous agents deployed in regulated domains must produce a verification artifact per consequential output: a record an auditor can re‑execute offline, capturing what was claimed, against what source, by whom, when, and how. Production verification today splits into two unstandardized halves. Probabilistic verdict patterns (self‑consistency voting, reviewer LLM ensembles) produce judgments, not artifacts. Artifact‑producing patterns (RAG, tool‑augmented traces, generator‑verifier loops) produce vendor‑specific records no external auditor can reconstruct without bespoke integration. Pramana defines the missing wire format. Every consequential agent output is wrapped in a typed ClaimAttestation with one of four variants (measurement, inference, analogy, citation), each paired with a verify() operation against the recorded source. verify() is deterministic for MeasurementClaim and CitationClaim. For InferenceClaim and AnalogyClaim, determinism is conditional on the oracle (audit‑replayable when LLM‑backed). The four‑way typology derives from classical Indian epistemology (pramana, valid means of knowledge). The lifecycle is specified in TLA+ and exhaustively verified under TLC across three symmetry‑reduced models: 38,563 distinct reachable states, zero invariant violations. The Python reference implementation passes 84 tests. An A2A and MCP wire‑extension manifest layers three deployment‑grade invariants: reachability, SLA bound, and offline re‑verifiability. An exploratory pilot (n=100, 2,275 reviewer calls) probes LLM‑as‑judge in code generation. The strongest observation is a 40‑percentage‑point raw FPR delta across corpora, consistent with reference‑solution quality contributing significantly. The pilot does not validate Pramana on its own; the structural argument and formal verification do that.

PaperID: 2779, https://arxiv.org/pdf/2605.20306.pdf

Authors: Bingnan Liu, Chenhang Cui, Rui Huang, Jiani Luo, Zhirong Shen, Tinghao Wang, Xiande Huang, Lingbei Meng, Fei Shen, An Zhang

Title: WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents

Abstract:
We introduce WildRoadBench, a wild aerial road‑damage grounding benchmark that couples direct visual grounding by vision‑language models with autonomous research‑and‑engineering by LLM‑driven agents on a single professionally annotated UAV corpus. The same image set and the same per‑class AP_50 metric are evaluated under two protocols. The VLM Track measures whether a fixed VLM can localise domain‑specific damage from one image and one short prompt under a unified prompting, decoding and parsing pipeline. The Agent Track measures whether an autonomous agent, given only a written task brief, a small exploratory slice and a fixed interaction budget, can search the public web, adapt pretrained components, write training and inference code, and submit predictions through a scalar‑feedback oracle on a hidden holdout. We benchmark a broad pool of closed‑source frontier models and open‑source VLMs together with several frontier LLM‑driven agents. Both routes remain far from reliable performance in this wild setting: closed‑source frontier models lead the VLM leaderboard but still leave more than half of the metric on the table; open‑source grounders plateau well below them, and newer generations or reasoning‑style variants do not consistently improve grounding; small targets collapse for every open‑source model; agents lag the strongest VLM despite richer affordances, and several fail to land a valid submission within the budget. We release the code and data at https://anonymous.4open.science/r/wildroadbench‑0607 to support reproducible follow‑up research.

PaperID: 2780, https://arxiv.org/pdf/2605.20272.pdf

Authors: Nasehatul Mustakim, Lucas Lehnert

Title: Smaller Abstract State Spaces Enable Cross-Scale Generalization in Reinforcement Learning

Abstract:
While humans readily generalize abstract concepts to more complex or larger tasks, building Reinforcement Learning (RL) systems with this ability remains elusive. Here, we present the first theoretical model of how such Out‑of‑Distribution (OOD) generalization can be achieved in RL agents. Our approach considers Partially Observable Markov Decision Processes (POMDPs) and assumes that an intelligent agent uses an abstraction function to determine which experiences can be treated as equivalent and which must be distinguished. First, we extend the existing state abstraction framework and proof techniques to POMDPs. Then, we define a successor‑weighted model reduction, a model reduction variant that enables compression into smaller abstract spaces than prior definitions allow. We derive a bound on the agent's OOD test performance, thereby defining the conditions under which OOD generalization is achievable. This bound decomposes an agent's performance loss into approximation and estimation errors, revealing how reducing an agent's abstract state space size improves test performance and OOD generalization. Our analysis suggests that constraining an agent to operate over a small, finite set of abstract states is necessary for achieving generalization to more complex tasks. Our results motivate further research into learning RL architectures that scale across tasks of varying complexity levels.

PaperID: 2781, https://arxiv.org/pdf/2605.20255.pdf

Authors: Prakash Aryan, Kaushik Raghupathruni, Timo Kehrer, Sebastiano Panichella

Title: Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty

Abstract:
Simulation‑based testing of self‑driving cars (SDCs) typically relies on scripted pedestrian models that do not capture the heterogeneity and uncertainty of real crossing behavior, limiting the realism of safety assessments, especially for jaywalking, which is governed by latent personality traits the vehicle cannot observe. We hypothesize that jointly training pedestrians and the SDC with multi‑agent reinforcement learning (MARL) yields more realistic interaction scenarios than training against fixed pedestrian policies, and that the behavior gap between predictable and unpredictable crossings can be measured directly from trajectories. We co‑train an SDC and 12 pedestrians using Multi‑Agent Proximal Policy Optimization (MAPPO): pedestrian locomotion follows scripted Dijkstra pathfinding while an RL policy controls high‑level go/wait decisions, and jaywalking probability depends on a per‑pedestrian trait sampled at episode start and hidden from the SDC. In 500‑episode evaluations, the co‑trained SDC reached 78% of goals with a 14% collision rate, versus 35%/33% for the best rule‑based baseline. A speed differential metric shows the SDC traveled 2.65 m/s faster near jaywalkers than near crosswalk users at close range (0‑3 m), indicating jaywalking encounters were not anticipated. Jaywalking was 13% of crossing events but 62% of collisions, and co‑training reduced collisions by 30% relative to single‑agent RL as pedestrians learned to wait when the SDC approached at speed.

PaperID: 2782, https://arxiv.org/pdf/2605.20204.pdf

Authors: Ming Zhu, Juntao Tan, Rithesh Murthy, Jielin Qiu, Liangwei Yang, Wenting Zhao, Silvio Savarese, Shelby Heinecke, Huan Wang

Title: RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation

Abstract:
LLM‑based user simulation is the primary mechanism for end‑to‑end agent evaluation, yet simulated users are poor proxies for real humans: unconstrained LLM defaults produce a Formalism Ceiling (style match rates of 6‑8% against real users), while hand‑crafted behavioral directives trigger Directive Amplification, where models hyper‑interpret instructions into unnatural behavioral extremes that vary dramatically across simulator models. We present RealUserSim, the first user simulation framework grounded in real behavioral data. From 14,000+ authentic human‑LLM conversations (WildChat), we extract 7,275 executable behavioral profiles and use them to ground LLM simulators. A fidelity benchmark (PT3) on 600 conversations across 71+ domains with anti‑leakage controls shows that grounded simulation raises match rate from 24.2% to 45.3% across five behavioral dimensions. Agent evaluation on TauBench with 6 simulator models and extensive analysis shows that grounded simulation acts as a realistic stress test, surfacing three failure mechanisms invisible to cooperative simulators (mean ‑3.2% to ‑3.5% task success degradation), while Directive Amplification in existing benchmarks produces unrealistic behavior that compromises the validity of agent evaluation.

PaperID: 2783, https://arxiv.org/pdf/2605.20200.pdf

Authors: Adnana Dragut, Raquel Lacuesta, F. Xavier Gaya-Morey, Jose M. Buades-Rubio

Title: Evaluating multimodal emotion recognition in proactive conversational agents: A user study

Abstract:
This article presents a multimodal emotion recognition module integrated into a proactive Socially Interactive Agent (SIA) powered by generative artificial intelligence. The system evaluates real‑time affective states through two distinct channels: a computer vision‑based facial recognition module and a semantic linguistic analysis engine. To validate the framework, an empirical study was conducted with 20 users who engaged in dynamic, unscripted dialogues with the conversational agent. The findings reveal a significant discrepancy between automated visual cues and actual internal emotional states. When interacting with the AI, users consistently exhibited a "poker face" effect, displaying serious, concentrated facial expressions even when experiencing positive emotions. Consequently, the generative AI linguistic analysis proved significantly more reliable, by contextualizing the users' verbal expressions. Furthermore, an analysis of the interaction dynamics demonstrated that SIAs can effectively elicit specific emotions by adapting conversational themes and employing structured linguistic patterns, such as empathetic or humorous language. However, the study also noted that instances of uncalibrated proactivity occasionally led to user disengagement and a perception of artificiality. Ultimately, this research highlights the necessity of refining SIAs to dynamically adapt to users' emotional evolution, relying on deep linguistic context to foster more natural, human‑like interactions.

PaperID: 2784, https://arxiv.org/pdf/2605.20189.pdf

Authors: Nitin Vetcha, Dianbo Liu

Title: SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation

Abstract:
Despite the remarkable success of large language models (LLMs), they still face bottlenecks while deploying in dynamic, real‑world settings with primary challenges being concept drift and the high cost of gradient‑based adaptation. Traditional fine‑tuning (FT) struggles to adapt to non‑stationary data streams without resulting in catastrophic for getting or requiring extensive manual data curation. To address these limitations within the streaming and continual learning paradigm, we propose the Self‑Optimizing Lifelong Autonomous Reasoner (SOLAR) which is an open‑ended autonomous agent that leverages parameter‑level meta‑learning to self‑improve, treating model weights as an environment for exploration. It initiates the process by consolidating a strong prior over common‑sense knowledge making it effective for transfer‑learning. By utilizing a multi‑level reinforcement learning approach, SOLAR autonomously discovers adaptation strategies, enabling efficient test‑time adaptation to unseen domains. Crucially, SOLAR maintains an evolving knowledge base of valid modification strategies, implicitly acting as an episodic memory buffer to balance plasticity (adaptation to new tasks) and stability (retention of meta‑knowledge). Experiments demonstrate that SOLAR outperforms strong baselines on common‑sense, mathematical, medical, coding, social and logical reasoning tasks, marking a significant step toward autonomous agents capable of lifelong adaptation in evolving environments.

PaperID: 2785, https://arxiv.org/pdf/2605.20072.pdf

Authors: Oussama Zenkri, Oliver Brock

Title: Probing Embodied LLMs: When Higher Observation Fidelity Hurts Problem Solving

Abstract:
Large Language Models are increasingly proposed as cognitive components for robotic systems, yet their opaque decision processes make it difficult to explain success or failure in closed‑loop embodied tasks. Following an empirical AI methodology, we study embodied LLM agents behaviorally by varying the information available to the agent and measuring the resulting changes in behavior. Using the Lockbox, a sequential mechanical puzzle with hidden interdependencies, we evaluate LLMs across RGB, RGB‑D, and ground‑truth symbolic observations in a physical robotic setup and use controlled simulation to probe the resulting behavior. Counterintuitively, agents perform best under raw RGB input and worst under perfect ground‑truth observations. In simulation, we probe this effect by randomly flipping perceived action outcomes and find that moderate noise improves performance, peaking at a 40% flip probability with a 2.85‑fold success rate increase over the noise‑free baseline. Further analysis links this gain to a reduction in repetitive action loops. These findings suggest that success rates alone are insufficient for evaluating LLMs, as measured performance may reflect the interaction between perceptual errors and reasoning failures rather than robust problem solving.

PaperID: 2786, https://arxiv.org/pdf/2605.20051.pdf

Authors: Tian Dong, Yanjun Chen, Shoufeng Zhang, Huaien Zhang, Yunlong Lyu, Keke Lian, Dong Zhang, Shaofeng Li, Hao Chen

Title: Hunting Vulnerability Variants in AI Infra: Measurement and Reference-Driven Detection

Abstract:
AI infra has become a shared execution layer for model training, deployment, and agent orchestration. Because many projects reimplement similar model‑centric workflows, a vulnerability disclosed in one repository can recur as a variant in another repository with a related design. Yet the prevalence and detectability of these variants remain poorly understood. This paper presents a measurement study of vulnerability variants in AI infra. Analyzing 688 GitHub repositories and 251 publicly disclosed vulnerabilities, we find that AI infra projects frequently share overlapping functionality and recurrent vulnerable patterns, creating a concrete basis for cross‑repository variants. Building on this finding, we study how to automatically identify such variants from known disclosures. We propose INFRASCOPE, a reference‑driven multi‑agent framework that extracts transferable vulnerability semantics from known cases and uses them to locate and validate variants in new repositories. Evaluating INFRASCOPE on 20 real‑world AI infra repositories, we uncover over 20 vulnerabilities, including 11 acknowledged cases and 4 cases that have been assigned CVEs so far.

PaperID: 2787, https://arxiv.org/pdf/2605.20023.pdf

Authors: Samuel Jacob Chacko, James Hugglestone, Chashi Mahiul Islam, Xiuwen Liu

Title: When Skills Don't Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity

Abstract:
Agent Skills, structured packages of procedural knowledge loaded into an LLM agent at inference time, are widely reported to improve task pass rates by an average of 16.2~percentage points across diverse domains. Yet the same benchmarks show wide variance, with 16 of 84 tasks suffering negative deltas when Skills are introduced. The community has not yet articulated a clean mechanism for \emphwhen Skills help and when they are merely redundant overhead. We re‑analyze a recently published 180‑run controlled study of an MCP‑grounded autonomous Capture‑the‑Flag (CTF) agent under four documentation conditions of increasing richness (591, 12865, 17253, and 36001 tokens) and show that these conditions correspond almost exactly to a No‑Skills, Experiential‑Skills, Curated‑Skills, and Comprehensive‑Skills ablation. In offensive cybersecurity, a domain not deeply covered by existing Skills benchmarks, the marginal benefit of Skills collapses. The spread between the no‑Skills and full‑Skills conditions is only 8.9~pp (p = 0.71, χ^2; p = 0.25, Cochran‑‑Armitage trend test; five of six pairwise Cohen's h values fall below the 0.2 small‑effect threshold). We argue that the missing variable is \emphenvironment‑feedback bandwidth. When an agent's tool layer returns strict, schema‑validated, low‑latency observations, the environment itself supplies the procedural correction signal that Skills are normally needed to provide. As a result, the marginal benefit of curated Skills diminishes substantially, and, in some cases (e.g., our timing side‑channel setting), actively degrades performance. We articulate a falsifiable hypothesis, sketch its design implications for compound AI systems, and will release the reanalysis pipeline to support replication.

PaperID: 2788, https://arxiv.org/pdf/2605.19992.pdf

Authors: Yongchun Bi, Jun Zheng, Guchuan Zhu, Jiye Zhang

Title: Robust synchronization for multi-agent systems governed by PDEs with observable and unobservable disturbances

Abstract:
This paper investigates robust synchronization for multi‑agent systems (MASs) governed by parabolic partial differential equations in the presence of both observable and unobservable disturbances. Using only boundary output measurements, a disturbance observer is designed to estimate observable Dirichlet boundary disturbances while ensuring robustness of the observer error system with unobservable disturbances occurring in the domain. Using only the reference signal and local output information, distributed synchronization controllers are then constructed to enable all agents to track the reference trajectory. In particular, exponential tracking is achieved in the absence of unobservable disturbances, while robustness is preserved when additional unobservable disturbances occur during controller implementation. We further analyze the impact of unobservable Dirichlet‑Robin boundary disturbances on synchronization performance by proving the boundedness of solutions to the synchronization error system. Moreover, to characterize the influence of all disturbances, input‑to‑state stability (ISS) is established for the closed‑loop system. For the involved systems, the generalized Lyapunov method and the recursion technique are extensively employed in the stability analysis, and the lifting technique and semigroup theory are used to prove the well‑posedness. Simulation results validate the proposed control scheme, demonstrating effective disturbance estimation and rejection, robust synchronization, and the ISS properties under various scenarios.

PaperID: 2789, https://arxiv.org/pdf/2605.19952.pdf

Authors: Jingwei Sun, Jianing Zhu, Jiangchao Yao, Tongliang Liu, Bo Han

Title: Rethinking How to Remember: Beyond Atomic Facts in Lifelong LLM Agent Memory

Abstract:
To enable reliable long‑term interaction, LLM agents require a memory system that can faithfully store, efficiently retrieve, and deeply reason over accumulated dialogue history. Most existing methods adopt an extracted fact based paradigm: handcrafted static prompts compress raw dialogues into atomic facts, which are then stored, matched, and injected into downstream reasoning. Nevertheless, such fact‑centric designs inevitably discard fine‑grained details in original dialogues and fail to support deep reasoning over scattered isolated facts. Moreover, static prompts cannot maintain consistent extraction granularity across diverse dialogue styles. To address these limitations, we propose TriMem, which maintains three coexisting representation granularities, including raw dialogue segments anchored by source identifiers for storage fidelity, extracted atomic facts for efficient memory retrieval, synthesized profiles that aggregate dispersed facts into holistic semantic understanding for deep reasoning. We further adopt TextGrad‑based prompt optimization, which iteratively refines extraction and profiling prompts via response quality feedback, achieving lifelong evolution without any parameter updating. Extensive experiments on LoCoMo and PerLTQA across multiple LLM backbones demonstrate that TriMem consistently outperforms strong memory baselines. The code is available at https://TMLR‑TriMem.github.io .

PaperID: 2790, https://arxiv.org/pdf/2605.19932.pdf

Authors: Zhuohan Gu, Qizheng Zhang, Omar Khattab, Samuel Madden

Title: PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

Abstract:
Large language model (LLM) agents increasingly operate over long and recurring external contexts, like document corpora and code repositories. Across invocations, existing approaches preserve either the agent's trajectory, passive access to raw material, or task‑level strategies. None of them preserves what we argue is most needed for repeated same‑context workloads: reusable orientation knowledge (e.g., what the context contains, how it is organized, and which entities, constants, and schemas have historically been useful) about the recurring context itself. We introduce PEEK, a system that caches and maintains this orientation knowledge as a context map: a small, constant‑sized artifact in the agent's prompt that gives it a persistent peek into the external context. The map is maintained by a programmable cache policy with three modules: a Distiller that extracts transferable knowledge from inference‑time signals, a Cartographer that translates it into structured edits, and a priority‑based Evictor that enforces a fixed token budget. On long‑context reasoning and information aggregation, PEEK improves over strong baselines by 6.3‑34.0% while using 93‑145 fewer iterations and incurring 1.7‑5.8x lower cost than the state‑of‑the‑art prompt‑learning framework, ACE. On context learning, PEEK improves solving rate and rubric accuracy by 6.0‑14.0% and 7.8‑12.1%, respectively, at 1.4x lower cost than ACE. These gains generalize across LMs and agent architectures, including OpenAI Codex, a production‑grade coding agent. Together, these results show that a context map helps long‑context LLM agents interact with recurring external contexts more accurately and efficiently.

PaperID: 2791, https://arxiv.org/pdf/2605.19915.pdf

Authors: Xin He, Junxi Shen, Yuchen Mou, David M. Bossens, Caishun Chen, Ivor W. Tsang, Yew Soon Ong

Title: LLM Agents Make Collective Belief Dynamics Programmable: Challenges and Research Directions

Abstract:
Classical models of opinion dynamics assume human participants with bounded rationality and limited coordination. The rise of LLM‑based agents introduces a qualitative shift: agents can now participate in online discussions at scale, maintain consistent persuasion strategies, and coordinate systematically. This paper argues that LLM agents make collective belief dynamics programmable, enabling deliberate steering of population‑level beliefs. We term this emerging problem programmable collective belief control. Through controlled multi‑agent simulations, we provide proof‑of‑concept evidence that coordinated AI agents can induce measurable belief shifts that stabilize within a few interaction rounds. We identify four structural properties (indistinguishability, persistence, contextuality, and configurability) that make detection and defense fundamentally difficult. Based on these findings, we outline a research agenda spanning theoretical foundations for adversarial belief dynamics, operational methods for system‑level detection and intervention, and simulation infrastructure for scalable experimentation. Our goal is not to present a complete solution, but to articulate why this problem demands urgent attention and to provide a conceptual foundation for future work.

PaperID: 2792, https://arxiv.org/pdf/2605.19834.pdf

Authors: Yiyao Xu, Hao Zhou, Yuhang Wang, Jingran Sun

Title: A Closed-loop, State-centric, Multi-agent Framework for Passenger Load Estimation from Heterogeneous Data Streams

Abstract:
To support operations and passenger‑facing services, transit agencies need reliable passenger load trajectories. Currently, load estimates are typically inferred from imperfect sensing systems rather than fully observed, and the accuracy of modern automatic passenger counting (APC) systems still varies with station layout, flow intensity, and operating conditions. To address the challenges of robust passenger load estimation from heterogeneous data streams, including incremental count errors, evidence conflicts, and context‑dependent sensor reliability, we propose a closed‑loop, state‑centric, multi‑agent framework. This method enforces physical feasibility at every step, allocates trust dynamically among evidence sources, and feeds physics‑derived violation residuals back into training for robustness improvement. The architecture consists of a unified stop‑event backbone, a coupled Perception‑‑Physical‑‑Fusion loop for stop‑by‑stop inference, and optional trip‑level macro‑correction and closed‑loop calibration modules.

PaperID: 2793, https://arxiv.org/pdf/2605.19798.pdf

Authors: Lucie Galland, Chloé Clavel, Magalie Ochs

Title: Towards Trust Calibration in Socially Interactive Agents: Investigating Gendered Multimodal Behaviors Generation with LLMs

Abstract:
As Socially Interactive Agents (SIAs) become increasingly integrated into daily life, the ability to calibrate user trust to an agent's actual capabilities would help ensure appropriate usage of these agents. In this paper, we explore the capacity of Large Language Models (LLMs) to generate multimodal behaviors (verbal, vocal, gestural, and facial expression modalities) that reflect varying levels of ability and benevolence, two key dimensions of trustworthiness. We propose a novel method for automatically generating behaviors aligned with specific levels of these traits, a first step towards enabling nuanced and trust‑calibrated interactions. By analyzing a large dataset of multimodal transcripts generated by LLMs, we demonstrate that GPT‑5.4 is able to produce coherent behavior across different modalities (text, intonation, facial expression, and gesture). Using Random Forest feature importance analysis, we show that the generated behaviors align with theoretical expectations for ability and benevolence. However, we also find that when gender is specified in the prompt, LLMs tend to reproduce societal gender stereotypes, associating male agents' behaviors with high ability and female agents' behaviors with high benevolence. To validate our approach, we conducted a user study on Prolific using a within‑subjects design. Participants perceived different levels of ability and benevolence in the generated behaviors align with the intended instructions.

PaperID: 2794, https://arxiv.org/pdf/2605.19782.pdf

Authors: Dmitry Redko, Albert Fazlyev, Konstantin Sozykin, Maria Ivanova, Evgeny Burnaev, Egor Shvetsov

Title: Prior Knowledge or Search? A Study of LLM Agents in Hardware-Aware Code Optimization

Abstract:
LLM discovery and optimization systems are increasingly applied across domains, implementing a common propose‑evaluate‑revise loop. Such optimization or discovery progresses via context conditioning on received feedback from an environment. However, as modern LLM agents are increasingly complex in their structure, it is difficult to evaluate which components contribute the most, and when and how this exploration may fail. We answer these questions through three controlled experiments. Our findings: (1) In pure black‑box optimization, LLMs act as greedy optimizers. (2) In zero‑shot kernel generation, providing explicit input‑size information has no measurable effect, models converge to the same kernel parameters regardless of size or temperature, as though the size instruction were invisible. Moreover, when tasked to perform kernel optimization for uncommon kernel sizes, performance sharply degrades regardless of the language used. (3) In feedback‑loop kernel optimization, CUDA improves monotonically under iterative feedback, while TVM IR actively degrades, which demonstrates that kernel optimization degrades when models operate with low‑density language. Our results conclude that LLMs in code optimization tasks highly depend on pretrained priors rather than provided feedback or agentic structure.

PaperID: 2795, https://arxiv.org/pdf/2605.19779.pdf

Authors: Yuxuan Gao, Megan Wang, Yi Ling Yu

Title: Distribution-Free Uncertainty Quantification for Continuous AI Agent Evaluation

Abstract:
We adapt split conformal prediction and adaptive conformal inference (ACI) to continuous AI agent evaluation, providing distribution‑free coverage guarantees for forecasted quality scores. Conformal intervals achieve calibration error below 0.02 across all nominal levels at the 24h horizon, while ACI correctly widens intervals by 35% following agent releases then reconverges. We further develop compositional uncertainty bounds for multi‑agent pipelines (validated via simulation across inter‑stage correlations rho in [‑0.5, 0.9]), a conformal abstention rule for pairwise rankings with controlled false‑ranking rate, and FDR‑corrected abstention for leaderboard‑scale multiple testing. Evaluating 50 agents via 18 real‑time signals collected hourly, we show that per‑agent conditional coverage is well‑concentrated around the nominal level (mean 80.4%, 90% of agents within [72%, 90%]), and that cross‑source sentiment divergence predicts ranking instability (r=0.64, p<0.01). A circularity‑controlled validation confirms the framework captures signal beyond benchmarks (rho_s=0.52, p<0.01, n=35). Code and data are released under CC BY 4.0.

PaperID: 2796, https://arxiv.org/pdf/2605.19743.pdf

Authors: Gioele Molinari, Florian Felten, Soheyl Massoudi, Mark Fuge

Title: EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design

Abstract:
Large Language Model (LLM) agents are increasingly applied to engineering design tasks, yet existing evaluation frameworks do not adequately address multi‑agent systems that combine simulation, retrieval, and manufacturing preparation. We introduce a benchmark suite with three evaluation dimensions: (1) a workflow benchmark with seven prompt styles targeting distinct cognitive demands‑including direct tool use, semantic disambiguation, conditional branching, and working‑memory tasks; (2) a Retrieval‑Augmented Generation (RAG) benchmark with gated scoring isolating retrieval contributions to parameter selection; and (3) an High Performance Computing (HPC) benchmark evaluating end‑to‑end ML training orchestration on a SLURM cluster. Alongside the benchmark we present EngiAI, a Multi‑Agent System (MAS) reference implementation built on LangGraph that operationalizes the benchmark by coordinating seven specialized agents through a supervisor architecture, unifying topology optimization, document retrieval, HPC job orchestration, and 3D printer control. Across four LLM backends and two EngiBench problems, proprietary models achieve 96‑97% average task completion on Beams2D, while open‑source 4B‑parameter models reach 55‑78%, with clear generational improvement. Conditional branching proves most challenging, with task completion dropping to 20‑53% for the conditional style on Photonics2D. RAG gating confirms near‑perfect retrieval‑augmented scores (about 1.0) versus near‑zero without retrieval, validating the evaluation design. On HPC orchestration, one model completes all pipeline steps in 100% of runs while another drops to 50%, revealing that multi‑step instruction following degrades over long‑running workflows.

PaperID: 2797, https://arxiv.org/pdf/2605.19634.pdf

Authors: Kai Sheng, Liuyi Wang, Haojie Dai, Jinlong Li, Yongrui Qin, Zongtao He, Chengju Liu, Qijun Chen

Title: P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation

Abstract:
Vision‑and‑language navigation (VLN) requires an embodied agent to ground natural‑language instructions into executable navigation actions in unseen environments. Existing zero‑shot methods typically rely on additional waypoint prediction modules, which often entangle high‑level directional reasoning with fine‑grained local grounding, leading to error‑prone and unstable decisions. In this paper, we propose P2DNav, a hierarchical framework for zero‑shot vision‑and‑language navigation. P2DNav consists of three core components: Panorama‑to‑Downview (P2D), Sliding‑Window Dialogue Memory (SDM), and Reflective Reorientation Mechanism (RRM). P2D explicitly decomposes navigation decision‑making into two stages: panoramic direction selection and downview local grounding. It first selects the instruction‑relevant direction from a 360° panorama, and then predicts a pixel‑level target point from the downview RGB observation in that direction. In addition, SDM organizes navigation history as a multi‑turn dialogue context and maintains recent visual observations within a sliding window to support long‑horizon navigation. RRM further enables reflective reorientation by assessing the reliability of local grounding based on the downview observation and returning to panoramic direction selection when necessary. Experiments on the R2R‑CE benchmark show that P2DNav achieves strong performance among zero‑shot methods. In particular, compared with the state‑of‑the‑art (SOTA) zero‑shot waypoint‑based and waypoint‑free methods, P2DNav achieves SR gains of 146.6% and 58.9%, respectively, demonstrating the effectiveness of P2D, SDM, and RRM for zero‑shot VLN. Code will be released for public use.

PaperID: 2798, https://arxiv.org/pdf/2605.19604.pdf

Authors: Xi Zhang, Meijun Gao, Yuntian Zhao, Xinyu Tan, Yilun Yao, Feiyu Wang, Yanshu Wang, Dingsiyi, Tong Yang

Title: Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

Abstract:
Large Language Model (LLM) agents increasingly act inside real workspaces, where tools and skills determine whether model reasoning becomes reliable action. Existing skills remain largely informal: Markdown skills and instruction packs encode procedures as long natural‑language documents, while function calling, Model Context Protocol (MCP) servers, and framework tools structure individual actions but usually leave workflow state, policy enforcement, and completion discipline outside the skill itself. We introduce Formal Skill, a runtime‑native abstraction that represents reusable capability with JSON metadata and action schemas, reliable Python executors, hook‑governed control logic, Formal Skill routing, and skill‑local runtime state. By moving reusable procedure from repeated prompt text into executable state machines and hook policies, Formal Skill gives agents a token‑efficient and enforceable control surface. We implement the abstraction in FairyClaw, an open‑source event‑driven runtime for executable, observable, and composable Formal Skills. On Harness‑Bench, FairyClaw obtains highly competitive average scores while using substantially fewer tokens, with especially strong results on tasks that expose the role of Formal Skill.

PaperID: 2799, https://arxiv.org/pdf/2605.19595.pdf

Authors: João Pedro Matos-Carvalho, Laio Oriel Seman, Stefano Frizzo Stefenon, Mohammad Khalaf Mohammad Khreasat, Gabriel Villarrubia González

Title: A novel YOLO26-MoE optimized by an LLM agent for insulator fault detection considering UAV images

Abstract:
The inspection of electrical power line insulators is essential for ensuring grid reliability and preventing failures caused by damaged or degraded insulation components. In recent years, Unmanned Aerial Vehicles (UAVs) combined with deep learning‑based vision systems have emerged as an effective solution for automating this process. However, insulator fault detection remains challenging due to small defect regions, heterogeneous fault patterns, complex backgrounds, and varying imaging conditions. To address these challenges, this paper proposes an optimized YOLO26‑MoE, a novel object detection architecture that integrates a sparse Mixture‑of‑Experts (MoE) module into the high‑resolution branch of the YOLO26 detector. The proposed modification enables adaptive feature refinement for subtle and diverse fault patterns while preserving the efficiency of a one‑stage detection framework. Hyperparameter optimization, final training, and evaluation were coordinated through a tool‑augmented Large Language Model (LLM) agent. The proposed model achieved 0.9900 mAP@0.5 and 0.9515 mAP@0.5:0.95, outperforming the latest YOLO versions. These results demonstrate that the proposed model provides an effective and reliable solution for UAV‑based insulator fault detection.

PaperID: 2800, https://arxiv.org/pdf/2605.19447.pdf

Authors: Xiaozhe Li, Tianyi Lyu, Yang Li, Yichuan Ma, Peiji Li, Linyang Li, Qipeng Guo, Dahua Lin, Kai Chen

Title: What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

Abstract:
Reinforcement learning can train LLM agents from sparse task rewards, but long‑horizon credit assignment remains challenging: a single success‑or‑failure signal must be distributed across many actions. Existing methods rely on trajectory‑level rewards or proxy signals, without fully leveraging per‑step environmental feedback. Multi‑turn agent settings are underexplored, where feedback can include error messages, page changes, observations, or reference trajectories. We systematically study five feedback sources and two insertion granularities and introduce SERL, a selective environment‑reweighted learning framework. SERL uses the task reward to determine update direction, while environment feedback adjusts placement and magnitude, focusing on critical actions. On ALFWorld and WebShop, SERL achieves 90.0% and 80.1% success, outperforming strong RL and distillation baselines. Analysis shows that grounded, action‑relevant feedback at meaningful points consistently outperforms indiscriminate use of longer or richer context.

PaperID: 2801, https://arxiv.org/pdf/2605.19418.pdf

Authors: Longgang He, Longzhu He, Daojing He, Chaozhuo Li

Title: Conflict-Resilient Multi-Agent Reasoning via Signed Graph Modeling

Abstract:
LLM‑based multi‑agent systems (MAS) have demonstrated strong reasoning and decision‑making capabilities that consistently surpass those of single LLM agents. However, their performance often suffers from naive aggregation mechanisms that assume uniformly cooperative interactions. Upon close inspection, we observe that existing graph‑based MAS frameworks (1) propagate errors when conflicting signals arise without control, and (2) lack explicit modeling of conflicting inter‑agent relations as well as structural awareness, failing to identify reliable interaction patterns. To bridge this gap, we introduce SIGMA, a novel SIgned Graph‑informed Multi‑Agent reasoning framework that explicitly captures trust, conflict, and neutral relations among agents via a signed relational graph. Specifically, given a query, SIGMA first selects a set of relevant and diverse agents, then constructs a structured signed interaction graph with confidence‑weighted edges. Reasoning proceeds through conflict‑aware signed message passing, which reinforces information from trustworthy agents while suppressing conflicting signals, and terminates with a structure‑ and conflict‑aware weighted aggregation to yield globally consistent and conflict‑resilient predictions. Extensive experiments on six benchmark datasets, across multiple LLM backbones and diverse multi‑agent configurations, demonstrate that SIGMA consistently outperforms state‑of‑the‑art baselines, achieving notable gains in both accuracy and conflict‑resilient performance.

PaperID: 2802, https://arxiv.org/pdf/2605.19399.pdf

Authors: Pei-Huan Tsai, Kuan-Lin Chiu, William Baisi, Pin-Yu Chen, Luca P. Carloni

Title: HSCO-Bench: An Agent-Driven End-to-End Hardware-Software Co-design Benchmark for Systems-on-Chip

Abstract:
Large language models (LLMs) are adopted for software and hardware design, yet these domains are still evaluated separately. Software benchmarks typically assume fixed hardware targets, while hardware benchmarks focus on component‑level optimization without considering the full hardware‑software stack. Consequently, no existing benchmark evaluates whether an LLM agent can perform end‑to‑end, system‑level hardware‑software co‑design. Such a process requires: 1) analyzing applications to identify kernels requiring acceleration, 2) designing and integrating heterogeneous accelerators into a System‑on‑Chip (SoC) under resource constraints, and 3) mapping kernels onto the generated accelerators. We present HSCO‑Bench, an end‑to‑end hardware‑software co‑design benchmark for accelerator‑rich heterogeneous SoC generation. Built upon an open‑source SoC platform with a curated repository structure, HSCO‑Bench evaluates the ability of LLMs to jointly optimize software and hardware stacks, producing SoC prototypes deployed on the AMD Virtex‑7 FPGA VC707 Evaluation Kit. Experimental results show that end‑to‑end integration remains challenging for current models. Among the five frontier models evaluated, only two of them could successfully generate valid SoC prototypes. Yet, even in these successful instances, the generated designs are far from optimal. While we observe a promising peak speedup of 16.22X, the maximum additional resource utilization reaches only 23.67%. This highlights that while state‑of‑the‑art models demonstrate an emerging capability for hardware acceleration, they still heavily underutilize the available hardware capacity, leaving room for future optimization. To the best of our knowledge, HSCO‑Bench is the first benchmark targeting this complete co‑design flow, enabling LLMs to jointly reason about and modify both the software and hardware stacks of heterogeneous SoCs.

PaperID: 2803, https://arxiv.org/pdf/2605.19362.pdf

Authors: Zikai Alex Wen

Title: Toward User Comprehension Supports for LLM Agent Skill Specifications

Abstract:
Users often interpret and select agent skills through their SKILL markdown specifications. To protect users, existing audits mainly focus on malicious or unsafe skills. We study the complementary question of whether specifications help users form bounded expectations about what a skill consumes, produces, and covers. Across 878 cybersecurity skills, we used rule‑based coding to measure textual cues for four comprehension anchors, namely operational basis, output contract, boundary disclosure, and example capability demonstration. Cues for operational basis were common, but only 19.0% of specifications exhibited cues for an example task, sample, or expected outcome, and only 2.3% exhibited cues for all four anchors. We further examined a small DNS/C2 telemetry subset (n=6) to illustrate why missing examples may matter. Examples appeared to make first local checks easier to construct, while no‑example skills typically required helper code inspection to recover command arguments or output fields. We argue that agent‑skill evaluation should treat specifications as user‑facing capability disclosures, not merely as containers for executable instructions.

PaperID: 2804, https://arxiv.org/pdf/2605.19337.pdf

Authors: Yihan Xia, Panpan You, Taotao Wang, Fang Liu, Han Qi, Xiaoxiao Wu, Shengli Zhang

Title: Agentic Trading: When LLM Agents Meet Financial Markets

Abstract:
A growing body of work explores how Large Language Models (LLMs) can be embedded in trading systems as agents that perceive market information, retrieve context, reason about decisions, emit tradable actions, and adapt under market feedback. This paper reframes LLM‑based trading agents as expert‑system decision pipelines and presents an audit‑oriented evidence map of 77 included studies in a protocol‑coded snapshot screened through 2026‑03‑09. A primary empirical subset (n=19) satisfies the minimum boundary of Action Output plus Closed‑Loop Evaluation; the remaining 58 included studies are retained as background and design context. The central empirical finding is protocol incomparability: within the primary subset, only 2/19 studies report extractable time‑consistent split protocols, 1/19 reports an explicit transaction‑cost model, 1/19 documents universe or survivorship handling, 11/19 report execution timing or semantics, 15/19 are coded as R0, and no study reaches R3 reproducibility. We therefore use Architecture‑Capability‑Adaptation as a working analytical lens rather than a validated taxonomy, and we foreground the evidence ledger, reproducibility audit, and reporting checklist as the main contributions. The resulting survey shows that architectural experimentation is expanding rapidly, while comparable evaluation protocols, execution semantics, and reproducible artifacts remain the field's immediate bottlenecks.

PaperID: 2805, https://arxiv.org/pdf/2605.19330.pdf

Authors: Md Mehrab Tanjim, Jayakumar Subramanian, Xiang Chen, Branislav Kveton, Subhojyoti Mukherjee, Anlan Zhang, Sungchul Kim, Somdeb Sarkhel, Sunav Choudhury

Title: MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization

Abstract:
LLM agents organize behavior through skills ‑ structured natural‑language specifications governing how an agent reasons, retrieves, and responds. Unlike monolithic prompts, skills are multi‑field artifacts subject to hard platform constraints: description fields are truncated for routing, instruction bodies are compacted via progressive disclosure, and co‑resident skills compete for limited context windows. These constraints make skill optimization inherently multi‑objective: a skill must simultaneously maximize task performance and satisfy platform limits. Yet existing prompt optimizers either ignore these trade‑offs or collapse them into a weighted sum, missing Pareto‑optimal variants in non‑convex objective regions. We introduce MOCHA (Multi‑Objective Chebyshev Annealing), which replaces single‑objective selection with Chebyshev scalarization ‑ covering the full Pareto front, including non‑convex regions ‑ combined with exponential annealing that transitions from exploration to exploitation. In our experiments across six diverse agent skills ‑ where all methods share the same multi‑objective mutation operator and baselines receive identical per‑objective textual feedback ‑ existing optimizers fail to improve the seed skill on 4 of 6 tasks: 1000 rollouts yield zero progress. MOCHA breaks through on every task, achieving 7.5% relative improvement in mean correctness over the strongest baseline (up to 14.9% on FEVER and 10.4% on TheoremQA) while discovering twice as many more Pareto‑optimal skill variants.

PaperID: 2806, https://arxiv.org/pdf/2605.19319.pdf

Authors: Yiren Song, Yihan Wang, Xiyao Deng, Zhuoran Yan, Mike Zheng Shou

Title: SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution

Abstract:
Visual prediction has emerged as a promising paradigm for embodied control, where future observations are generated and then translated into actions. However, dense video generation is computationally expensive and often unnecessary for many manipulation tasks, whose progress can be summarized by a small number of task‑relevant visual states. In this work, we study whether image editing models can serve as sparse visual world models for robot manipulation by predicting task‑level future states without dense video rollout. We first conduct a controlled comparison between the video generation model Wan2.2 and the image editing model FLUX‑Kontext under the same robotic data setting, and find that image editing produces more reliable task‑level keyframes with better visual fidelity and substantially lower inference cost. Motivated by this observation, we propose SWEET, a one‑shot sparse visual planning framework that progressively generates a sequence of task‑relevant manipulation keyframes through successive image editing, conditioned on language instructions and optional arrow‑based spatial guidance. A goal‑conditioned diffusion action predictor then converts adjacent imagined keyframes into executable action chunks. To reduce the mismatch between real and edited visual subgoals, we further introduce a mixed‑training strategy with filtered edited targets. Experiments on DROID and RoboMimic show that SWEET improves keyframe prediction across seen and unseen scenes and enables a full pipeline from sequential keyframe planning to executable robot actions, suggesting that image editing is a promising and underexplored direction for embodied visual prediction.

PaperID: 2807, https://arxiv.org/pdf/2605.19316.pdf

Authors: Seonjeong Hwang, Jun Seo, Hyounghun Kim, Gary Geunbae Lee

Title: A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation

Abstract:
Recent studies in difficulty‑controlled reading comprehension item generation have leveraged large language models (LLMs) to produce items by adjusting difficulty‑related features. However, existing methods typically rely on a single‑agent prompting approach, which often fails to consistently satisfy specified feature constraints, resulting in items that deviate from the target difficulty level. To address this limitation, we introduce MAFIG, a Multi‑agent Framework for Feature‑constrained Item Generation, where multiple LLM agents and feature‑specific evaluators collaborate to generate and iteratively revise items based on intended constraints. Furthermore, to verify the efficacy of MAFIG in difficulty control, we propose a method for constructing a sequence of feature constraint sets that yield items with monotonically increasing difficulty. Experimental results demonstrate that MAFIG generates items that adhere to target constraints at a significantly higher rate than baselines, achieving robust difficulty control through the difficulty‑calibrated constraint sequence.

PaperID: 2808, https://arxiv.org/pdf/2605.19276.pdf

Authors: Maosong Cao, Kai Chen, Haodong Duan, Yixiao Fang, Zhiwei Fei, Tong Gao, Ge Jiaye, Mo Li, Hongwei Liu, Junnan Liu, Yuan Liu, Chengqi Lyu, Han Lyu, Ningsheng Ma, Zerun Ma, Yu Sun, Zhiyong Wu, Linchen Xiao, Jun Xu, Haochen Ye, Zhaohui Yu, Yike Yuan, Songyang Zhang, Yufeng Zhao, Fengzhe Zhou, Peiheng Zhou, Dongsheng Zhu, Lin Zhu, Jingming Zhuo

Title: OpenCompass: A Universal Evaluation Platform for Large Language Models

Abstract:
In recent years, the field of artificial intelligence has undergone a paradigm shift from task‑specific small‑scale models to general‑purpose large language models (LLMs). With the rapid iteration of LLMs, objective, quantitative, and comprehensive evaluation of their capabilities has become a critical link in advancing technological development. Currently, the mainstream static benchmark dataset‑based evaluation methods face challenges such as the diversity of task types, inconsistent evaluation criteria, and fragmentation of data and processing workflows, making it difficult to efficiently conduct cross‑domain and large‑scale model evaluation. To address the aforementioned issues, this paper proposes and open‑sources OpenCompass, a one‑stop, scalable, and high‑concurrency‑supported general‑purpose LLM evaluation platform. Adhering to the design philosophy of modularization and component decoupling, the platform boasts three core advantages: high compatibility, flexibility, and high concurrency. The core architecture of OpenCompass comprises five key components: the Configuration System, Task Partitioning Module, Execution and Scheduling Module, Task Execution Unit, and Result Visualization Module. Its workflow provides rule‑based, LLM‑as‑a‑Judge, and cascaded evaluators to adapt to the requirements of different task scenarios. Supporting mainstream benchmark datasets across multiple domains, including knowledge, reasoning, computation, science, language, code, etc., the platform offers a unified and efficient LLM evaluation tool for both academia and industry, facilitating the accurate identification of strengths and weaknesses of LLMs as well as their subsequent optimization.

PaperID: 2809, https://arxiv.org/pdf/2605.19270.pdf

Authors: Linyue Cai, Samuel Yeh, Jwala Dhamala, Rahul Gupta, Sharon Li

Title: DECOR: Auditing LLM Deception via Information Manipulation Theory

Abstract:
Large language models can deceive by subtly manipulating truthful information ‑‑ omitting key facts, shifting focus, or obscuring meaning ‑‑ making such behavior difficult to detect. Existing black‑box methods rely on coarse‑grained judgments, offering limited interpretability and failing to pinpoint which facts were distorted and how. We introduce DECOR, a multi‑agent framework grounded in Information Manipulation Theory for fine‑grained auditing of strategic deception in LLM responses. DECOR decomposes input contexts into atomic informational units and scores each unit against the response across four dimensions of manipulation, producing interpretable manipulation profiles that are aggregated into a global deception index. We comprehensively evaluate DECOR on both single‑turn and multi‑turn deception detection benchmarks spanning real‑world domains, and show that DECOR achieves state‑of‑the‑art performance on both, outperforming competitive baselines. The framework generalizes across 15 frontier models, and ablation studies confirm the contribution of each key design component. Our findings demonstrate that fine‑grained, theory‑grounded auditing of information manipulation offers an effective and interpretable path for LLM deception detection.

PaperID: 2810, https://arxiv.org/pdf/2605.19265.pdf

Authors: Dawei Tian, Jiakun Liu, Yun Peng, Yichen Zhang, Jianlei Chi, Jun Sun, Xiaohong Su

Title: MuMuTestUp: Mutation-based Multi-Agent Test Case Update

Abstract:
Modern software systems evolve rapidly under CI/CD practices, where tests are critical for quality. However, substantial code changes often render existing test cases obsolete, causing pipeline disruptions, reduced productivity, and compromised quality. Recent automatic test update approaches leverage LLMs to refine test cases via execution feedback and exact‑matching context retrieval, prioritizing executability and line coverage but suffering three limitations: (1) neglecting test assertion adequacy, weakening fault detection; (2) relying on coarse line coverage instead of specific uncovered lines/branches; (3) using exact‑matching retrieval, which fails for LLM hallucinated queries. To address these, we propose MuMuTestUp, a mutation‑guided multi‑agent framework with three specialized agents: Mutation Analysis (strengthens assertions via surviving mutants), Coverage Analysis (generates targeted repair instructions for uncovered lines/branches), and Semantic Retrieval (handles hallucinations via semantic‑similarity search). We also construct PRBENCH, a 571‑sample pull‑request‑level dataset from 10 open‑source Java projects (validated for cross‑commit update scenarios). Evaluations against state‑of‑the‑art baselines use both open‑source (Deepseek‑V3.2) and closed‑source (GPT‑4.1) LLMs.

PaperID: 2811, https://arxiv.org/pdf/2605.19235.pdf

Authors: Zhiyuan Fan, Gabriele Farina

Title: GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning

Abstract:
Competitive multi‑agent reinforcement learning in imperfect‑information games requires agents to act under partial observability and against adversarial opponents, necessitating stochastic policies. While self‑play reinforcement learning with Proximal Policy Optimization (PPO) has achieved strong empirical success, its standard advantage estimator, generalized advantage estimation, suffers from additional variance due to the sampling of stochastic future actions. This variance is amplified in equilibrium self‑play because of the stochastic nature of the equilibrium policy and persists even when the critic is exact. We address this bottleneck by introducing Q‑boosting, a variance‑reduced advantage estimator based on a centralized action‑value critic, and propose Variance‑Reduced Policy Optimization (VRPO), incorporating this new estimator. The algorithm replaces sampled multi‑step backups with a multi‑step Expected SARSA(λ) trace, computing policy expectations at each step to average out action‑sampling noise, while retaining PPO's clipped objective and on‑policy actor updates. Empirically, VRPO consistently achieves strong performance from mid‑sized to large‑scale games including Dou Dizhu and Heads‑Up No‑Limit Texas Hold'em.

PaperID: 2812, https://arxiv.org/pdf/2605.19228.pdf

Authors: Xiaoou Liu, Tiejin Chen, Dengjia Zhang, Yaqing Wang, Lu Cheng, Hua Wei

Title: Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

Abstract:
Large Language Models have achieved strong performance on reasoning tasks with objective answers by generating step‑by‑step solutions, but diagnosing where a multi‑step reasoning trace might fail remains difficult. Confidence estimation offers a diagnostic signal, yet existing methods are restricted to final answers or require internal model access. In this paper, we introduce Stepwise Confidence Attribution (SCA), a framework for closed‑source LLMs that assigns step‑level confidence based only on generated reasoning traces. SCA applies the Information Bottleneck principle: steps aligning with consensus structures across correct solutions receive high confidence, while deviations are flagged as potentially erroneous. We propose two complementary methods: (1) NIBS, a non‑parametric IB approach measuring consistency without graph structures, and (2) GIBS, a graph‑based IB model that learns subgraphs through a differentiable mask to capture logical variability. Extensive experiments on mathematical reasoning and multi‑hop question answering show that SCA reliably identifies low‑confidence steps strongly correlated with reasoning errors. Moreover, using step‑level confidence to guide self‑correction improves the correction success rate by up to 13.5% over answer‑level feedback.

PaperID: 2813, https://arxiv.org/pdf/2605.19196.pdf

Authors: Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

Title: Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Abstract:
Deep research agents increasingly automate complex information‑seeking tasks, producing evidence‑grounded reports via multi‑step reasoning, tool use, and synthesis. Their growing role demands scalable, reliable evaluation, positioning LLM‑as‑judge as a supervision paradigm for assessing factual accuracy, evidence use, and reasoning quality. Yet the reliability of these judges for deep research agents remains poorly understood, posing a critical meta‑evaluation problem: before deploying LLM judges to supervise research agents, we must first evaluate the judges themselves. Existing meta‑evaluations fall short in two ways: (1) reliance on coarse, subjective human‑preference agreement; (2) focus on instruction‑following or verifiable tasks, leaving open‑ended agent executions unexplored. To address these gaps, we introduce REFLECT (REliable Fine‑grained LLM judge Evaluation via Controlled inTervention), a meta‑evaluation benchmark targeting fine‑grained failure detection in agentic environments. REFLECT defines a detailed taxonomy of process‑ and outcome‑level failure modes, instantiated by performing controlled and localized interventions on quality‑screened agent execution traces. This yields verifiable, comprehensive, and fine‑grained instances for validating the judge models. Our experiments show that current LLM judges remain unreliable: even the best‑performing models achieve overall accuracies below 55% across reasoning, tool‑use, and report‑quality failures, with especially poor performance on evidence verification. Together, our taxonomy and findings expose systematic judge limitations, reveal tradeoffs in cost and reliability, and offer actionable guidance for building more reliable evaluation pipelines for deep research agents.

PaperID: 2814, https://arxiv.org/pdf/2605.19194.pdf

Authors: Rui Chu

Title: MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

Abstract:
The Mixture‑of‑Agents (MoA) framework has shown promise in improving large language model (LLM) performance by aggregating outputs from multiple agents. However, existing MoA systems often rely on static routers that do not fully capture temporal and contextual dependencies across aggregation layers. To address this limitation, we propose MMoA, a recurrent MoA architecture that integrates LSTM‑based gating into the agent selection process. The recurrence router adaptively modulates agent contributions based on both current inputs and historical routing decisions, enabling more context‑aware aggregation. We evaluate MMoA on standard instruction‑following benchmarks, including AlpacaEval 2.0, MT‑Bench, and Arena‑Hard. The results show that MMoA achieves comparable accuracy to traditional MoA while reducing computational overhead by dynamically activating fewer agents. For example, on AlpacaEval 2.0, MMoA achieves a win rate of 58.0%, compared with 59.8% for MoA, while improving runtime efficiency by up to 4.6%. These results suggest that MMoA provides a scalable and efficient approach for adaptive multi‑agent LLM systems.

PaperID: 2815, https://arxiv.org/pdf/2605.19193.pdf

Authors: Andrea Morandi

Title: Sequential Consensus for Multi-Agent LLM Debates: A Wald-SPRT compute governor with calibration-based failure detection

Abstract:
Multi‑agent LLM debate improves factuality and reasoning, but most recipes pick a fixed round count, over‑spending on easy items and under‑spending on hard ones. We adapt Wald's Sequential Probability Ratio Test (SPRT) as a plug‑in compute governor for LLM debates. After each round, an LLM judge emits a [0,1] consensus score on the latest agent positions; a Wald monitor accumulates the log‑likelihood ratio of "useful convergence" vs "not yet useful" under a Beta likelihood family, and stops when either boundary is crossed or returns a capped best‑effort outcome at R_max. Under i.i.d. assumptions the rule inherits SPRT type‑I/type‑II error guarantees; in deployment the calibration itself is the more important object, since it estimates whether the judge score actually separates useful from unhelpful convergence in a given domain. We evaluate two tracks: (i) a Monte‑Carlo study under calibrated Beta models characterising working curves, error rates, capping behaviour, and sensitivity; and (ii) a real‑LLM evaluation on 200 attempted MMLU and 200 attempted GSM8K items with three heterogeneous agents (gpt‑5, claude‑opus‑4‑6, gemini‑2.5‑pro) and a claude‑opus‑4‑6 judge, using disjoint 40‑item calibration subsets. On GSM8K the rule stops in 1.01 average rounds (4.06 LLM calls) at 97.0% accuracy vs 99.0% for fixed‑5 debate at 15 calls: a 3.7x call reduction at ‑2pp accuracy. On MMLU the calibrated KL collapses to about 0 and the rule caps on 99.5% of items at 2.1x cost. The takeaway is not that SPRT makes debate more accurate, but that a classical sequential test serves as a cheap compute‑control and failure‑detection layer for multi‑agent LLM systems.

PaperID: 2816, https://arxiv.org/pdf/2605.19180.pdf

Authors: Rongqi Pan, Mahboubeh Dadkhah, Jean Baptiste Minani, Hussein Al Osman, Lionel Briand, Haiwei Dong

Title: Supporting System Testing with a Multi-Agent LLM-based Framework for Knowledge Graph Extraction: A Case Study with Ethernet Switch Systems

Abstract:
Technical documents contain rich domain knowledge for automating downstream tasks such as system testing. While this paper focuses on Ethernet switch configuration manuals (ESCMs), we propose a general framework that can be adapted to different industrial contexts. ESCMs provide valuable domain knowledge for Ethernet switch testing, but their semi‑structured format, implicit step attributes, and complex section dependencies make them difficult to directly leverage for test automation. To address this, we generate knowledge graphs (KGs) that capture configuration knowledge from ESCM in a structured form. We propose a multi‑agent LLM‑based framework that extracts, evaluates, and improves KGs from ESCMs using a fine‑grained KG schema and an iterative Extract‑Evaluate‑Improve (EEI) loop. Our evaluation on 50 real‑world ESCMs shows that our framework achieves high extraction correctness using the original prompts, with average correctness scores ranging from 0.97 to 0.99 across three extraction tasks. For challenging ESCMs, the EEI loop further improves correctness through manual‑specific prompt refinement. Moreover, the LLM judgments and human evaluations show substantial agreement, with Cohen's kappa of at least 0.72 across all extraction tasks. Finally, feedback from industry testers indicates that the generated KGs can support the generation of useful and correct test case specifications (TCSs) for downstream testing.

PaperID: 2817, https://arxiv.org/pdf/2605.19140.pdf

Authors: Jiayu Li, Enpei Zhang, Dawei Zhou, Elynn Chen, Yujun Yan

Title: Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

Abstract:
We study workflow learning in a setting where specialized agents hand off control through a shared artifact, each agent observes only a local function of that artifact and its own private state, and no centralized learner accesses joint trajectories ‑‑ the operating regime of multi‑agent LLM pipelines that span organizational, vendor, or trust boundaries. We formalize this regime as an interface‑constrained semi‑Markov decision process (IC‑SMDP), whose decision epochs occur at handoff times, and design IC‑Q, an asynchronous decentralized Q‑learning algorithm in which cross‑agent coordination at every handoff is exactly one scalar. Our main result is a finite‑sample bound for neural IC‑Q that decomposes into three independently controllable error sources: neural function‑approximation error, interface representation gap, and a mixing‑time residual, under the random option‑duration discount. Establishing this bound requires lifting the approximate information state (AIS) framework from single‑agent primitive‑step MDPs to multi‑agent SMDPs and controlling Markovian noise under random duration, neither of which has been done in prior work. To our knowledge this is the first finite‑sample guarantee for neural Q‑learning under decentralized partial observability. Four experiments: a controlled synthetic IC‑SMDP that validates the bound term‑by‑term, multi‑LLM mathematical reasoning, multi‑agent routing, and multi‑agent CPU programming, show that IC‑Q matches a centralized oracle without any agent observing joint trajectories, with each of the three error sources scaling along its corresponding axis as the bound predicts.

PaperID: 2818, https://arxiv.org/pdf/2605.19127.pdf

Authors: Qiaoyuan Zheng, Yiqu Yang, Qi Gao, Imanol Schlag

Title: POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

Abstract:
LLM agents increasingly have access to private user data and act on the user's behalf when interacting with third‑party systems. The user defines what may and must not be shared, and the agent must robustly follow that intent even when third‑party systems behave adversarially. We introduce POLAR‑Bench (Policy‑aware adversarial Benchmark), in which a trusted model with a privacy policy and a task converses with a third‑party model that adversarially probes for both task‑relevant and protected attributes. Across 10 domains and 7,852 samples, we score privacy and utility by deterministic set‑membership and vary privacy policy dimension and attack strategy along two orthogonal axes, producing a 5 times 5 diagnostic surface per model. Our results reveal a sharp split: current frontier models withhold over 99% of protected attributes, while smaller open‑weight models in the 1‑‑30B range, the class users most commonly run as their own trusted agent on‑device or via private inference, score notably worse, with the weakest leaking over half. POLAR‑Bench thus localizes where each model's intent‑following breaks down, providing a foothold for privacy alignment where it matters most.

PaperID: 2819, https://arxiv.org/pdf/2605.19038.pdf

Authors: Lorenzo Bonin, Francesco Giacomarra, Luca Bortolussi, Jyotirmoy V. Deshmukh, Francesca Cairoli

Title: Guiding Neuro-Symbolic Scenario Generation with Spatio-Temporal Logic

Abstract:
The rapid advancement of autonomous driving (AD) technologies has outpaced the development of robust safety evaluation methods. Conventional testing relies on exposing AD systems to vast numbers of real‑world traffic scenes ‑‑ a brute‑force approach that is prohibitively expensive and statistically ineffective at capturing the rare, safety‑critical edge cases essential for validating real‑world robustness. To address this fundamental limitation, we introduce STRELGen, a scalable framework for the targeted generation of safety‑critical driving scenarios. STRELGen synergistically combines a multi‑agent trajectory‑generation diffusion model (DM) with Spatio‑Temporal Logic (STREL) specifications that encode complex safety and realism properties through a highly interpretable formalism. Crucially, monitoring satisfaction levels of these specifications is differentiable, enabling gradient‑based search. At inference time, we optimize directly over the DM latent space to maximize STREL formula satisfaction. The result is efficient generation of highly plausible yet safety‑critical multi‑agent scenarios that lie within the learned data distribution. STRELGen thus provides a flexible, interpretable, and powerful tool for stress‑testing autonomous driving systems, moving beyond the limitations of brute‑force data collection.

PaperID: 2820, https://arxiv.org/pdf/2605.19010.pdf

Authors: Olena Bogdanov, Yeunji Jung, Chandra Dhir, Pareekshitreddy Gaddam, Saurabh Jain, Lakshmi Tumati, Vijay Parthasarathy, Anup Shirgaonkar

Title: AgentNLQ: A General-Purpose Agent for Natural Language to SQL

Abstract:
Natural language to SQL (NL2SQL) conversion is an important problem for researchers and enterprises due to the ubiquitous importance of relational databases in broad‑ranging practical problems. Despite the rapid advancements in the capabilities of LLMs, NL2SQL has not reached parity in accuracy with human expert SQL writers, hence needing additional improvements in NL2SQL algorithms. This study presents a new multi‑agent method for NL2SQL that achieves 78.1% semantic accuracy on the BIg Bench for LaRge‑scale Database (BIRD) benchmark. Our method leverages a semantically enriched representation of user‑provided schema, adds user‑provided business rules, and produces accurate SQL queries. The main contributions of this study are (a) We designed an optimized new orchestrator in a multi‑agent solution that uses LLMs to plan, orchestrate, reflect, and self‑correct to generate accurate SQL queries, (b) We developed an advanced schema enrichment method that creates context‑aware metadata to improve accuracy, and (c) We demonstrated the accuracy and generalizability of the method across different domains and datasets by evaluating it on the BIRD‑SQL benchmark.

PaperID: 2821, https://arxiv.org/pdf/2605.18930.pdf

Authors: Kaixiang Wang, Jiong Lou, Zhaojiacheng Zhou, Jie Li

Title: OEP: Poisoning Self-Evolving LLM Agents via Locally Correct but Non-Transferable Experiences

Abstract:
Memory‑augmented large language model (LLM) agents use iterative reflection and self‑evolution to solve complex tasks, but these mechanisms introduce security risks. Existing agentic memory attacks require privileged access or explicit malicious content, making them detectable by advanced safety filters. This leaves a subtler attack surface underexplored: whether adversaries can induce agent to generate experiences that appear locally correct and semantically plausible yet induce harmful generalization during reflection. We find that reflective agents are vulnerable to such clean experiences, especially when paired with severe but plausible hypothetical consequences. Based on this observation, we introduce Obsessive Experience Poisoning (OEP), a low‑privilege black‑box attack requiring no direct control over the system prompt or memory database. OEP constructs adversarial clean edge‑cases that combine locally correct solutions, non‑transferable methods, and severe consequences, biasing reflection toward risk‑averse rule formation. During memory consolidation, agents may over‑trust self‑generated reflections and distill localized experiences into high‑priority but over‑generalized rules, causing downstream failures. Evaluations across three domains show that OEP achieves ASR above 50% with GPT‑4o agents, and outperforms existing attacks under LLM auditing defense.

PaperID: 2822, https://arxiv.org/pdf/2605.18858.pdf

Authors: Zhaohui Wang

Title: When Individually Calibrated Models Become Collectively Miscalibrated

Abstract:
Probabilistic prediction systems often aggregate probability estimates from multiple models into a single decision. A common assumption is that if each model is individually calibrated, the aggregate prediction will also be well calibrated. We show that this assumption fails in multi‑agent settings: individually calibrated predictors can become collectively miscalibrated when their predictions interact strategically, in the game‑theoretic sense of Brier‑optimal local response, even without deliberate coordination. This phenomenon arises naturally when agents are independently trained on overlapping data. We prove that under Brier‑score‑based aggregation with positively correlated beliefs, each agent's individually optimal report systematically underestimates the positive‑class probability, yielding a Price of Anarchy greater than one whenever Cov(b_i, b_j) > 0. In a canonical setting (n = 5 agents, pairwise correlation = 0.5, base rate = 0.3), the empirically measured PoA in false‑negative rate reaches 7.25x. In contrast, VCG‑based aggregation aligns incentives by rewarding marginal contribution, achieving dominant‑strategy incentive compatibility and near‑optimal performance. Experiments on three real‑world datasets (NSL‑KDD, UNSW‑NB15, Credit Card Fraud) show that VCG provides strong robustness while maintaining comparable accuracy. It performs particularly well in data‑sparse and adversarial settings, and adaptive weighting further improves performance under distribution shift.

PaperID: 2823, https://arxiv.org/pdf/2605.18850.pdf

Authors: Adrian Cierpka, Mohammad Shafiqul Islam, Johannes Steinhülb, Eric Dietriche Sesso Domtchoueng, Michael Selzer, Arnd Koeppe

Title: KadiAssistant: A conversational AI Agent for information retrieval in Kadi4Mat

Abstract:
We introduce KadiAssistant, a privacy‑by‑design AI assistant integrated into the Kadi research data ecosystem, enabling researchers to efficiently access, aggregate, and synthesize information from heterogeneous, privacy‑sensitive research data. Interdisciplinary fields such as materials science bring together disciplines with their own terminology and standards. While this convergence fuels innovation, it also makes it increasingly difficult to connect and access knowledge, as data are distributed across disciplines, organizations, and individuals. For example, battery research combines electrochemical measurements, materials characterization data, physics‑based simulations, and manufacturing parameters, each using different formats, vocabularies, and standards. Efficiently storing and sharing such heterogeneous data via research data platforms, such as Kadi4Mat, demands domain knowledge, technical expertise, and familiarity with metadata schemas and interfaces. Research data also vary in sensitivity: newly generated 'warm' data are often private, whereas published 'cold' data are usually openly accessible. The Kadi ecosystem offers fine‑grained access control needed for sensitive data. A solution for efficient information retrieval in Kadi must therefore respect the fine‑grained access permissions. To address these intertwined challenges of information retrieval, strong data privacy, and complex access control, KadiAssistant combines a self‑hosted large language model (LLM) with a privacy‑preserving semantic search, inspired by retrieval‑augmented generation, that can access files and record metadata on Kadi. This allows the assistant to screen, aggregate, and structure information into a highly informative answer. KadiAssistant therefore bridges terminology and standards, lowers access barriers for researchers, and strengthens the Findable pillar of FAIR data principles.

PaperID: 2824, https://arxiv.org/pdf/2605.18824.pdf

Authors: Mohammed Saidul Islam, Negin Baghbanzadeh, Farnaz Kohankhaki, Afshin Cheraghi, Ali Kore, Shayaan Mehdi, Elham Dolatabadi, Arash Afkanpour

Title: Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models

Abstract:
Evaluation of foundation models often rely on aggregate scores from benchmarks that lack comprehensive coverage and metadata for a fine‑grained evaluation. We introduce a framework for automated benchmark generation. Our framework generates evaluation problems grounded in reference material, such as textbooks, producing benchmarks with broad coverage, rich metadata, and robustness to contamination. The pipeline employs a multi‑agent architecture for problem generation and a solution‑graph‑driven strategy that significantly improves the reliability of ground truth solutions. Using the framework, we generate three benchmarks in Machine Learning, Corporate Finance, and Personal Finance. Expert review finds a significantly lower ground‑truth error rate than previous benchmarks such as MMLU and GSM8K. Evaluation of 12 commercial and open‑source models shows that our benchmarks achieve near‑uniform competency coverage and surface performance differences across models that existing benchmarks fail to capture. We will open‑source the framework and our curated benchmarks soon.

PaperID: 2825, https://arxiv.org/pdf/2605.18822.pdf

Authors: Chengqian Zhang, Wei Zhu, Kyumin Lee

Title: Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

Abstract:
Post‑training has become essential for adapting large language models (LLMs) to complex downstream behaviors, including instruction following, preference alignment, and multi‑step reasoning. Reinforcement learning with verifiable rewards (RLVR) has recently emerged as a particularly effective post‑training paradigm for improving reasoning capabilities, with critic‑free algorithms such as GRPO and GSPO enabling scalable optimization. However, RLVR post‑training with full fine‑tuning (FFT) requires substantial GPU memory and incurs high training costs. Although parameter‑efficient fine‑tuning (PEFT) methods, such as Low‑Rank Adaptation (LoRA), effectively reduce computational costs, they often suffer from a noticeable performance gap compared to full fine‑tuning in post‑training for complex reasoning tasks. In this paper, we propose Hybrid‑LoRA, an efficient hybrid post‑training framework that selectively applies full fine‑tuning to a small subset of modules less suited to low‑rank adaptation, while adapting the remaining components with LoRA. We introduce a novel Hybrid‑LoRA Score to rank candidate modules according to their sensitivity to low‑rank adaptation under a fixed parameter budget. Experiments show that Hybrid‑LoRA closely matches full fine‑tuning performance under a 10% full fine‑tuning module budget, with the remaining candidate modules adapted by LoRA, consistently outperforming four state‑of‑the‑art PEFT post‑training baselines, achieving improvements of up to 5.65% and on average 4.36% over the best baseline.

PaperID: 2826, https://arxiv.org/pdf/2605.18809.pdf

Authors: Zuyuan Zhang, Sizhe Tang, Mahdi Imani, Tian Lan

Title: Metric-Gradient Projection for Stable Multi-Agent Policy Learning

Abstract:
General‑sum multi‑agent learning is often governed by a stacked update field in which each agent's policy update changes the optimization landscape faced by the others. This coupling can entangle an integrable component of collective improvement with cyclic interaction dynamics, leading to slow or unstable multi‑agent learning. Existing approaches, such as regularization, credit assignment, and consensus methods, stabilize MARL through local or algorithmic modifications; HPML complements them by projecting the joint update field onto a metric‑gradient component. We introduce HPML (Hodge‑Projected Multi‑agent Learning), which views the joint update field of a multi‑agent system as an element of an L^2 space of vector fields and computes a Hodge‑type projection onto the closest metric‑gradient potential flow. HPML follows the projected component as the update direction, yielding the closest metric‑gradient field under the chosen metric and sampling measure. The projection is defined variationally, characterized by a Poisson‑type equation, and implemented through graph‑based and amortized neural realizations that recover projected directions from samples. We show that the projected dynamics admit a Lyapunov potential and yield equilibrium‑gap bounds with an explicit additive non‑potentiality term. Controlled experiments validate the geometric mechanism, and CTDE benchmarks show improved stability and normalized return when HPML is used as a plug‑in projection layer in MARL pipelines.

PaperID: 2827, https://arxiv.org/pdf/2605.18729.pdf

Authors: Nga Teng Chan, Yi Zhang, Yechi Liu, Renwen Cui, Fanhu Zeng, Zeyuan Ding, Xiancong Ren, Zhang Zhang, Qifeng Chen, Jian Liu, Yong Dai, Xiaozhu Ju

Title: Robo-Cortex: A Self-Evolving Embodied Agent via Dual-Grain Cognitive Memory and Autonomous Knowledge Induction

Abstract:
The ability to navigate and interact with complex environments is central to real‑world embodied agents, yet navigation in unseen environments remains challenging due to "experiential amnesia," where existing trajectory‑driven or reactive policies fail to synthesize generalizable strategies from past interactions. We propose Robo‑Cortex, a self‑evolving framework that enables robots to autonomously induce navigation heuristics and refine cognitive strategies through a continuous reflection‑adaptation loop. By abstracting success patterns and failure pitfalls into natural‑language heuristics, Robo‑Cortex enables a transition from passive execution to active strategy evolution. Our core innovation is an Autonomous Knowledge Induction (AKI) mechanism that distills multimodal trajectories into a structured Navigation Heuristic Library for knowledge generalization. The architecture further incorporates a Dual‑Grain Cognitive Memory system, comprising a Short‑term Reflective Memory (SRM) for real‑time local progress analysis, and a Long‑term Principle Memory (LPM) that abstracts past trajectories into reusable guiding and cautionary principles. To ensure robust decision‑making, we introduce a multimodal Imagine‑then‑Verify loop, where a world model simulates potential outcomes and a VLM‑based evaluator validates action plans. Extensive evaluations on IGNav, AR, and AEQA show that Robo‑Cortex consistently outperforms strong baselines in both task success and exploration efficiency, with gains of up to +4.16% SPL over the strongest prior method and up to +15.30% SPL under heuristic transfer to unseen environments. Preliminary real‑world robotic experiments further support the effectiveness of Robo‑Cortex in physical settings.

PaperID: 2828, https://arxiv.org/pdf/2605.18693.pdf

Authors: Yifan Zhou, Zhentao Zhang, Ziming Cheng, Shuo Zhang, Qizhen Lan, Zhangquan Chen, Zhi Yang, QianyuXu, Ronghao Chen, Huacan Wang, Sen Hu

Title: SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents

Abstract:
As LLM agents are increasingly built around reusable skills, a central challenge is no longer only whether agents can use provided skills, but whether they can generate correct, reusable, and executable skills from repositories and documents. Existing benchmarks primarily evaluate the efficacy of given skills or the ability of agents to solve downstream tasks from raw context, but they do not isolate skill generation itself as the object of study. We introduce SkillGenBench, a benchmark for evaluating skill generation pipelines under a unified and controlled protocol. In SkillGenBench, a generator receives raw corpora and produces standardized skill artifacts, which are then executed under fixed harnesses and assessed with unified evaluation procedures. The benchmark covers two generation regimes: task‑conditioned generation, where a task‑specific skill is synthesized after the task is revealed, and task‑agnostic generation, where a reusable skill library must be distilled before downstream tasks are known. It also spans two complementary procedural sources: repository‑grounded instances, where procedures are distributed across code, configuration, and scripts, and document‑grounded instances, where procedures and constraints must be distilled from long‑form text. We provide standardized task specifications, pinned environments, and evaluation protocols centered on deterministic execution‑based checks, supplemented by auxiliary signals for diagnosis. Experiments across a range of skill‑generation methods and backbones show substantial performance variation, highlight the difficulty of reusable skill distillation, and reveal distinct failure modes in skill generation from software repositories versus long‑form documents. SkillGenBench establishes a reproducible testbed for studying skill generation as an independent research problem in agent systems.

PaperID: 2829, https://arxiv.org/pdf/2605.18684.pdf

Authors: Sanderson Oliveira de Macedo, Ronaldo Martins da Costa

Title: Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents

Abstract:
Legacy systems concentrate business rules, architectural decisions, and operational exceptions that often remain implicit in code, data, configuration, and maintenance practices. At the same time, language‑model‑based coding agents depend on reliable context, correctness criteria, and behavioral contracts to modify real systems with lower risk. This paper presents Reversa, a reverse documentation engineering framework for converting legacy software into traceable operational specifications for AI agents. Reversa organizes this process as a multi‑agent pipeline: specialized agents map the project surface, analyze modules, extract implicit rules, synthesize architecture, write unit‑level specifications, and review generated claims. The proposal emphasizes three mechanisms: traceability between code and specification, explicit confidence marking, and preservation of gaps for human validation. The framework is distributed as a Node.js CLI, installs skills across multiple agent engines, and uses a SHA‑256 manifest to preserve modified files during update or uninstall operations. In addition to the architectural description, we report an exploratory case study on migrating an ATM from COBOL to Go, in which the pipeline produced 517 claims classified by an internal confidence index, 10 registered gaps, 53 Gherkin parity scenarios, and a reconstruction plan with 9 of 11 tasks completed at inventory time. Final parity validation and cutover were not completed in this study. We do not claim broad empirical superiority; we position the contribution with respect to the literature on reverse engineering, LLM‑based documentation, and software agents, and propose an evaluation protocol with metrics for coverage, traceability, confidence, utility, and cost.

PaperID: 2830, https://arxiv.org/pdf/2605.18672.pdf

Authors: S. Bensalem, Y. Dong, M. Franzle, X. Huang, J. Kroger, D. Nickovic, A. Nouri, R. Roy, C. Wu

Title: Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent Deployment

Abstract:
This position paper argues that enforcing LLM agent safety within a single abstraction layer is not merely suboptimal but categorically insufficient for deployed LLM agents ‑‑ a structural consequence of how agent execution works, not a contingent limitation of current systems. The three dimensions that jointly constitute safe operation ‑‑ semantic intent and policy compliance, environmental validity, and dynamical feasibility ‑‑ each depend on a strictly distinct set of information that becomes available at different stages of execution. No single guardrail can certify all three. We argue that the community must respond with a contract‑based architecture in which each safety dimension is enforced by an independently certified layer whose probabilistic guarantee satisfies the next layer's assumption. We sketch such an architecture and derive the compositional system‑level safety bounds it admits via the chain rule of probability. Three open problems stand between this and a deployable standard: bound estimation from non‑i.i.d.\ traces, graceful degradation of contracts under deployment drift, and extension to multi‑agent settings ‑‑ the most important unfinished business in LLM agent runtime assurance.

PaperID: 2831, https://arxiv.org/pdf/2605.18597.pdf

Authors: Wenhao Huang, Qingwen Zeng, Qiyue Chen, Zijie Guo, Yu Sun, Cheng Yang, Siru Ouyang, Jiri Gesi, Fang Wu, Jiayi Zhang, Huaming Chen, Bang Liu, Xiangru Tang, Chenglin Wu

Title: Latent Action Reparameterization for Efficient Agent Inference

Abstract:
Large language model (LLM) agents often rely on long sequences of low‑level textual actions, resulting in large effective decision horizons and high inference cost. While prior work has focused on improving inference efficiency through system‑level optimizations or prompt engineering, we argue that a key bottleneck lies in the representation of the action space itself. We propose Latent Action Reparameterization (LAR), a framework that learns a compact latent action space in which each latent action corresponds to a multi‑step semantic behavior. By reparameterizing agent actions into latent units, LAR enables decision making over a shorter effective horizon while preserving the expressiveness of the original action space. Unlike hand‑crafted macros or hierarchical controllers, latent actions are learned from agent trajectories and integrated directly into the model, allowing both planning and execution to operate over abstract action representations. Across a range of LLM‑based agent benchmarks, LAR significantly reduces the effective action horizon and improves inference efficiency under fixed compute budgets. As a consequence, our approach achieves substantial reductions in action tokens and corresponding wall‑clock inference time, while maintaining or improving task success rates. These results suggest that action representation learning is a critical and underexplored factor in scaling efficient LLM agent inference, complementary to advances in model architecture and hardware.

PaperID: 2832, https://arxiv.org/pdf/2605.18572.pdf

Authors: Dingyi Zhang, Ziqing Zhuang, Linhai Zhang, Ziyang Gao, Deyu Zhou

Title: MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion

Abstract:
Persuasive dialogue generation plays a vital role in decision‑making, negotiation, counseling, and behavior change, yet it remains a challenging problem. In complex persuasion where the persuadee's internal states are not expressed clearly, the persuader must interpret responses, infer the persuadee's latent mental states (e.g., beliefs and desires), and translate them into targeted, strategy‑consistent actions; however, current approaches often produce generic or weakly grounded responses even when such cues are identified. Moreover, although large language models (LLMs) can generate persuasive content, their performance varies substantially across domains due to uneven knowledge coverage and limited reasoning generalization. To address these challenges, we propose MA^2P, a meta‑cognitive autonomous intelligent agent framework for complex persuasion. Specifically, we develop an autonomous multi‑agent architecture that coordinates perception management, mental‑state inference, strategy execution, memory maintenance, and performance evaluation. To mitigate cross‑domain performance variation, we further design a meta‑cognitive configurator that selects an appropriate meta‑strategy from a structured knowledge base at the outset, thereby guiding subsequent reasoning and planning. Experimental results show that our approach achieves a higher persuasion success rate than baselines.

PaperID: 2833, https://arxiv.org/pdf/2605.18566.pdf

Authors: Lekan Molu, Venkatraman Renganathan, Namhoon Cho

Title: HJ-Gauss: A Monte-Carlo HJ Reachability Scheme

Abstract:
Backward reachable tubes (BRTs), computed via viscous Hamilton‑Jacobi (HJ) partial differential equations, provide principled safety certificates for learned controllers and planning algorithms in trustworthy machine learning. However, classical grid‑based HJ solvers require O(M^n) memory footprint for M grid points per n state dimension. This renders them impractical for high‑dimensional systems. We address this bottleneck with a local PDE linearization that enables a frozen‑coefficient sampling scheme for the viscous HJ PDE: a generalized Cole‑Hopf‑type transformation reduces the nonlinear HJ equation to a sequence of linear heat equations whose solutions admit Gaussian heat‑kernel representations. The value function and its spatial gradient are then recovered via roll‑outs of Monte Carlo expectations on Gaussian densities, yielding a storage and grid‑free algorithm that scales as N\cdot n for N samples. This decoupling of memory from dimensionality enables reachability analysis on problems where grid‑based methods are simply impossible. We prove a finite‑sample concentration bound O(N^‑1/2) error and conditional linear convergence for the introduced Monte‑Carlo Picard iterative scheme. Numerical validation on pursuit‑evasion games demonstrates relative L^2_\textrel errors of 0.03 ‑ 0.20, with 14‑26 second wall‑clock times per 2D slice on a CPU. Crucially, the method scales with validation on up to (but not limited to) n=45‑dimensional multi‑agent games.

PaperID: 2834, https://arxiv.org/pdf/2605.18476.pdf

Authors: Jungang Zou, Alex Ziyu Jiang, Qixuan Chen

Title: AI4BayesCode: From Natural Language Descriptions to Validated Modular Stateful Bayesian Samplers

Abstract:
Coding and computation remain major bottlenecks in Markov chain Monte Carlo (MCMC) workflows, especially as modern sampling algorithms have become increasingly complex and existing probabilistic programming systems remain limited in model support, extensibility, and composability. We introduce AI4BayesCode, an extensible LLM‑driven system that translates natural‑language Bayesian model descriptions into runnable, validated MCMC samplers. To improve reliability, AI4BayesCode adopts a modular design that decomposes models into modular sampling blocks and maps each block to a built‑in sampling component, reducing the need to implement complex sampling algorithms from scratch. Reliability is further improved through pre‑generation validation of model specifications and post‑generation validation of generated sampler code. AI4BayesCode also introduces a novel recursively stateful coding paradigm for MCMC, allowing modular sampling components, potentially developed by different contributors, to be composed coherently within larger MCMC procedures. We develop a benchmark suite to evaluate AI4BayesCode for sampler‑generation. Experiments show that AI4BayesCode can implement a wide range of Bayesian models from natural‑language descriptions alone. As an open‑ended system, its capability can continue to expand with improvements in the underlying AI agent and the addition of new built‑in blocks.

PaperID: 2835, https://arxiv.org/pdf/2605.18407.pdf

Authors: Lihan Shi, Zhaoyi Joy Zheng, Xinzhe Juan, Yimin Wang, Ming Yin, Mayank Sengupta, Kristina Wolinski, Yanyu Jia, Jingzhi Shi, Derek Saucedo, Neill Saggi, Haosen Guan, Kenji Watanabe, Takashi Taniguchi, Ali Yazdani, Mengdi Wang, Sanfeng Wu

Title: Qumus: Realization of An Embodied AI Quantum Material Experimentalist

Abstract:
While modern Large Language Models (LLMs) and agentic artificial intelligence (AI) have demonstrated transformative capabilities in digital domains, the realization of embodied AI capable of real‑world scientific discovery remains a difficult frontier. The advancements are hindered by the inherent complexity of integrating high‑level reasoning, multimodal information processing and real‑time physical execution. Here we introduce Qumus, the first AI quantum materials experimentalist. Physically embodied within a robotic mini‑laboratory, Qumus is an intelligent, multimodal, and multi‑agent system designed for the creation and nano‑processing of atomically thin two‑dimensional (2D) materials and stacked van der Waals (vdW) structures. Qumus autonomously navigates the full scientific cycle, from hypothesis generation and protocol planning to multi‑step experimental execution, result analysis and reporting, acting as an experimentalist. Markedly, the system has achieved, for the first time, the AI‑creation of graphene, as well as the first AI‑fabrication of complex nanodevices including atomically thin field‑effect transistors via vdW stacking. Qumus excels at these tasks by demonstrating autonomous error correction and closed‑loop experimentation. Our results establish a generalizable framework for self‑improving embodied AI systems that learn directly from the quantum world, opening a pathway toward accelerated discovery in quantum materials, electronics and beyond.

PaperID: 2836, https://arxiv.org/pdf/2605.18401.pdf

Authors: Hongyi Liu, Haoyan Yang, Tao Jiang, Bo Tang, Feiyu Xiong, Zhiyu Li

Title: SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution

Abstract:
Long‑horizon LLM agents leave traces that could become reusable experience, but raw trajectories are noisy and hard to govern. We treat Agent Skills as an experience schema that couples executable scripts, with non‑executable guidance on procedures. Yet open skill ecosystems contain redundant, uneven, environment‑sensitive artifacts, and indiscriminate updates can pollute future context. We present SkillsVote, a lifecycle‑governance framework for Agent Skills from collection and recommendation to evolution. SkillsVote profiles a million‑scale open‑source corpus for environment requirements, quality, and verifiability, then synthesizes tasks for verifiable skills. Before execution, SkillsVote performs agentic library search over structured skill library to expose instructional skill context. After execution, it decomposes trajectories into skill‑linked subtasks, attributes outcomes to skill use, agent exploration, environment, and result signals, and admits only successful reusable discoveries to evidence‑gated updates. In our evaluation, offline evolution improves GPT‑5.2 on Terminal‑Bench 2.0 by up to 7.9 pp, while online evolution improves SWE‑Bench Pro by up to 2.6 pp. Overall, governed external skill libraries can improve frozen agents without model updates when systems control exposure, credit, and preservation.

PaperID: 2837, https://arxiv.org/pdf/2605.18284.pdf

Authors: Divya Chukkapalli, Thejesh Avula, Aditya Aggarwal, Harsimran Singh, Amith Tallanki

Title: CommitDistill: A Lightweight Knowledge-Centric Memory Layer for Software Repositories

Abstract:
Software repositories accumulate large amounts of unstructured knowledge in commit messages, pull‑request discussions, and issue threads, but developers and AI coding assistants rarely reuse this history effectively. Recent work on typed‑memory architectures for LLM agents (MemGPT, generative agents, and the PlugMem module of Yang et al.) argues that agent memory should be distilled, typed knowledge rather than raw interaction text. We adapt that stance to a software repository's own git history under a constrained regime: deterministic, dependency‑free, local‑only, no embeddings. We present CommitDistill, an open‑source Python prototype that mines a local git history into typed knowledge units (Facts, Skills, Patterns) using deterministic regex and surfaces them through a TF‑IDF retriever with a calibrated silence threshold (theta = 2.5) that abstains on out‑of‑distribution queries. The artefact is a trust‑instrumented memory substrate: deterministic, no external service, inspectable plain‑JSON store, tunable abstention. A case study on five public repositories spanning Python, JavaScript, C, and Java (25,000 commits, 1,167 extracted units) reports useful‑precision 0.525 at Cohen's kappa = 0.633 on 40 dual‑annotated Python units. The decisive finding is budget‑constrained retrieval: at a 256‑character per‑query budget, CommitDistill reaches 0.750 hit‑rate on a 12‑query benchmark against BM25's 0.333 and git log ‑‑grep's 0.083. On a four‑arm paired LLM‑as‑judge evaluation (n=200 time‑travel bug‑fixes, two judges) covering control, CommitDistill, a body‑budget‑matched CD‑Hybrid, and BM25, no condition produces a statistically detectable lift over control on the headline mean and CD‑Hybrid is indistinguishable from BM25 head‑to‑head. Extraction over 10,000 commits completes in under 4 seconds on a laptop. Source, annotations, baselines, and a reproducibility script accompany this paper.

PaperID: 2838, https://arxiv.org/pdf/2605.18194.pdf

Authors: Yajing Zhou, Xiangyu Kong

Title: Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks

Abstract:
While Multi‑Modal Large Language Models (MLLMs) demonstrate impressive capabilities in general reasoning, their embodied spatial intelligence remains hampered by a "Cartesian Illusion" ‑ a reliance on text‑based probability distributions that lack grounded, 3D topological understanding. This limitation is starkly exposed in multi‑agent environments, which demand more than just scene perception; they require second‑order Theory of Mind (ToM). Specifically, an Agent A must be able to infer Agent B's belief about the environment, governed strictly by Agent B's physical orientation and sensory limitations. In this paper, we probe the limits of two‑stage spatial inference in MLLMs through a novel audio‑visual task: requiring Agent A to predict Agent B's estimation of A's relative location. To solve this, we propose an Epistemic Sensory Bottleneck module that abandons rigid, rule‑based coordinate transformations. Instead, we introduce an Anchor‑Based Embodied Spatial Decomposition Chain‑of‑Thought (CoT). This guides the MLLM through a "geometric‑to‑semantic" projection, forcing it to first establish B's local coordinate system and then dynamically weight visual and auditory modalities based on whether A falls within B's visual frustum. Extensive evaluations reveal that while current MLLMs fundamentally struggle with spatial symmetry and out‑of‑view ambiguities (establishing a rigorous zero‑shot baseline of 42% accuracy), our sensory‑bounded reasoning chain robustly outperforms pure egocentric and allocentric baselines. By systematically benchmarking these perceptual bottlenecks, our work exposes the current limits of MLLM spatial reasoning and establishes a foundational paradigm for epistemic, modality‑aware inference in Embodied AI.

PaperID: 2839, https://arxiv.org/pdf/2605.18185.pdf

Authors: Benedict Russell, Chin-wing Leung, Paolo Turrini

Title: The Dynamics of Policy Gradient in Social Dilemmas with Partner Selection

Abstract:
In social dilemmas self‑interested learning agents face the choice between the societal benefit of cooperation and the immediate reward of defection. Significant evidence exists on the benefits of assortment mechanisms such as partner selection for the emergence of cooperation, but this is largely available through agent‑based simulations. In this paper, we provide an analytical solution to the problem, studying the policy‑gradient dynamics in a multi‑agent environment with partner selection. We show how partner selection changes the opponent distribution and hence the reward landscape, and prove this promotes cooperation under simple rules known from the literature. In particular, we find that population variance is a necessary condition for cooperation to emerge. Using a two‑dimensional Wiener process, we extend the dynamics to capture the stochastic effects of partner selection and the resulting opponent distribution. We derive a sufficient condition for the population to be cooperation‑promoting and prove the existence of a stationary distribution. Simulations confirm that the stochastic model accurately captures the policy‑gradient dynamics and clarifies how the learning rate affects the emergence of cooperation.

PaperID: 2840, https://arxiv.org/pdf/2605.18153.pdf

Authors: Xin Peng, Bo Lin, Jing Wang, Xiaoling Li, Jun Ma, Jie Yu, Xiaoguang Mao, Shangwen Wang

Title: Three Heads Are Better Than One: A Multi-perspective Reasoning Framework for Enhanced Vulnerability Detection

Abstract:
Automated vulnerability detection is crucial for enhancing software security by identifying potential flaws that attackers could exploit, thereby reducing the reliance on labor‑intensive manual code audits. Recent advancements have shifted towards leveraging large language models (LLMs) for vulnerability detection, with techniques like Vul‑RAG and VulnSage demonstrating progress through structured prompting and external knowledge integration. However, these approaches typically rely on a single reasoning paradigm, limiting their ability to address the complex and diverse nature of real‑world vulnerabilities. To overcome these limitations, we propose ReasonVul, a novel multi‑perspective reasoning framework that harnesses cognitive synergy among three specialized LLM agents, each embodying a distinct reasoning mode. The framework begins with independent analyses of the source code, followed by a structured debate mechanism to resolve conflicts through iterative rebuttal and revision, ultimately converging on a collaborative judgment. Evaluated on the PrimeVul dataset, ReasonVul achieves a PairAcc of 40.00% and an F1‑score of 72.52%, surpassing the best baseline by 81.24% in PairAcc. Further tests on the JITVUL dataset confirm its generalizability, with a PairAcc of 28.67%. Additionally, we analyzed 542 conflict cases and found that 389 were correctly resolved, highlighting the framework's ability to uncover hidden vulnerabilities through the error‑correction mechanism driven by the debate. This work emphasizes the importance of multi‑perspective reasoning and collaborative validation in achieving robust and comprehensive vulnerability detection in real‑world software systems.

PaperID: 2841, https://arxiv.org/pdf/2605.18150.pdf

Authors: Mengyu Sun, Ziyuan Yang, Zunlong Zhou, Junxu Liu, Haibo Hu, Yi Zhang

Title: Whispers in the Noise: Surrogate-Guided Concept Awakening via a Multi-Agent Framework

Abstract:
Diffusion models (DMs) are widely used for text‑to‑image generation, but their strong generative capabilities also raise concerns about unsafe or undesirable content. Concept erasure aims to mitigate these risks by removing specific concepts from pretrained models. However, recent studies show that such methods often suppress rather than fully eliminate target concepts, leaving models vulnerable to awakening attacks. Existing approaches primarily rely on white‑box access through optimization or inversion, while concept awakening under black‑box constraints remains underexplored. In this work, we revisit the denoising process from a trajectory perspective and show that concept erasure mainly disrupts early‑stage text‑semantic alignment but does not fully prevent semantic information from propagating along the denoising dynamics. As generation proceeds, the model increasingly depends on the evolving noisy state rather than textual conditions, which creates an opportunity to bypass erased mappings. Motivated by this observation, we propose ConceptAgent, a training‑free, black‑box, multi‑agent framework that awakens erased concepts by initializing the denoising trajectory from surrogate‑guided noisy states. Extensive experiments demonstrate that ConceptAgent enables accurate and controllable awakening of erased concepts under black‑box settings without access to model parameters, gradients, or internal representations. These results highlight fundamental limitations of current concept erasure methods and provide new insights into the dynamic nature of semantic control in DMs.

PaperID: 2842, https://arxiv.org/pdf/2605.18078.pdf

Authors: Yevhen Shcherbinin, Arina Redina, Maxim Kalpin, Vlad Kochetov

Title: Equilibrium Selection in Multi-Agent Policy Gradients via Opponent-Aware Basin Entry

Abstract:
Multi‑agent policy‑gradient methods have been shown to converge locally near stable Nash equilibria. Local convergence, however, does not determine which equilibrium is reached. We study this question through basin‑entry probability with respect to a target set of equilibria selected by an external criterion, such as payoff dominance. For finite‑unroll Meta‑MAPG, we show that the update decomposes into ordinary policy gradient plus own‑learning and peer‑learning corrections, with controlled sampling noise and finite‑unroll bias. We identify the peer‑learning correction as the main equilibrium‑selection mechanism: under a local alignment condition, the probability of entering the certified attraction region of the target stable‑Nash set increases, relative to ordinary policy gradient. Because persistent correction may shift zero‑update points of the original game, annealing the correction after entering the basin recovers ordinary policy‑gradient dynamics and inherits local stable‑Nash convergence guarantees. Experiments in Stag Hunt, iterated Prisoner's Dilemma, and preliminary neural‑policy coordination environments support this basin‑entry view, showing increased entry into cooperative basins under peer‑aware updates.

PaperID: 2843, https://arxiv.org/pdf/2605.18073.pdf

Authors: Anika Tabassum, Md Sifat Hossain, Md. Fahim Arefin, Tariqul Islam, Tarannum Shaila Zaman

Title: A-ProS: Towards Reliable Autonomous Programming Through Multi-Model Feedback

Abstract:
Large Language Models (LLMs) demonstrate strong potential for automated code generation, yet their ability to iteratively refine solutions using execution feedback remains underexplored. Competitive programming offers an ideal testbed for this investigation, as it demands end‑to‑end algorithmic reasoning, precise implementation under strict computational constraints, and complete functional correctness with rigorous evaluation. In this paper, we present A‑ProS, an autonomous AI agent that solves competitive programming problems through a hybrid multi‑model feedback framework separating solution generation from specialized debugging. A‑ProS combines ChatGPT‑based generators (GPT‑4 and GPT‑5) with three debugging critics: Codestral‑2508, Llama‑3.3‑70B, and DeepSeek‑R1, under a 2 x 3 factorial design. We evaluate six workflows on 367 problems from ICPC World Finals (2011‑2024) and Codeforces (rated 1200‑1800). The results show that GPT‑5 workflows improve from 39 initial accepted solutions to 85‑90 after three refinement rounds, while GPT‑4 improves from 15 to 31‑38. A controlled ablation on 47 problems shows that stateful refinement outperforms stateless approaches by 8.5‑10.6 percentage points and reduces repeated failures by up to 3.5x. Compared to baseline agent loops, A‑ProS achieves over 2x greater gains, highlighting the importance of persistent context and multi‑model feedback for reliable autonomous program synthesis.

PaperID: 2844, https://arxiv.org/pdf/2605.18067.pdf

Authors: Zile Wang, Qianli Liu, Kaibin Guo, Haodong Wang, Jian Lin, Zicong Hong, Song Guo

Title: PPAI: Enabling Personalized LLM Agent Interoperability for Collaborative Edge Intelligence

Abstract:
Deploying large language model (LLM) on edge device enables personalized LLM agents for various users. The growing availability of diverse personalized agents presents a unique opportunity for peer‑to‑peer (P2P) collaboration, wherein each user can delegate tasks beyond the local agent's expertise to remote agents more suited for the specific query. This paper introduces PPAI, the first personalized LLM agent interoperability system, which enables users to collaborate with each other based on agent specialization. However, the ever‑changing pool of agents and their interchangeable capacity introduce new challenges when it comes to matching queries to agents and balancing loads, compared with existing P2P systems. Therefore, we propose a scalable query‑agent pair scoring mechanism based on prototypes to identify suitable agents within a P2P network with churn. Moreover, we propose a multi‑agent interoperability Bayesian game to balance local demand and global efficiency, when changes in remote agent load occur too quickly to be observed. Finally, we implement a prototype of PPAI and demonstrate that it substantially broadens the range of tasks that could be carried out while maintaining load balance. On average, it achieves an accuracy improvement of up to 7.96% across multiple tasks, while reducing latency by 16.34% compared to the baseline.

PaperID: 2845, https://arxiv.org/pdf/2605.18032.pdf

Authors: Kazuki Kawamura, Satoshi Waki, Kei Tateno

Title: PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows

Abstract:
Multi‑agent LLM workflows ‑‑ systems composed of multiple role‑specific LLM calls ‑‑ often outperform single‑prompt baselines, but they remain difficult to debug and refine. Failures can originate from subtle errors in intermediate outputs that propagate to downstream nodes, requiring developers to inspect long traces and infer which agent to modify. We present PROTEA, a unified interface for offline, test‑driven improvement of multi‑agent workflows. PROTEA executes a workflow, scores intermediate node outputs with configurable rubrics, and overlays per‑node states and rationales on the workflow graph to localize likely bottlenecks. To support complex systems where final‑answer references are the primary supervision, PROTEA performs backward node evaluation: it generates candidate node‑level expectations from final‑answer references and graph context, then compares them with observed node outputs. For selected nodes, PROTEA presents targeted prompt revisions as editable before/after comparisons, then automatically reruns and re‑evaluates the workflow to show output changes and score trajectories within the same interface. In two production‑adjacent workflows, PROTEA improved document‑inspection accuracy from 64.3% to 83.9% and recommendation Hit@5 from 0.30 to 0.38. In a formative study with six experienced LLM developers, participants valued graph‑level localization, per‑node rationales, and editable before/after prompt revisions.

PaperID: 2846, https://arxiv.org/pdf/2605.18024.pdf

Authors: Sunwoo Lee, Mingu Kang, Yonghyeon Jo, Seungyul Han

Title: Interaction-Breaking Adversarial Learning Framework for Robust Multi-Agent Reinforcement Learning

Abstract:
Cooperation is central to multi‑agent reinforcement learning (MARL), yet learned coordination can be fragile when external perturbations disrupt inter‑agent interactions. Prior robust MARL methods have primarily considered value‑oriented attacks, leaving a gap in robustness when interaction structures themselves are corrupted. In this paper, we propose an interaction‑breaking adversarial learning (IBAL) framework that takes an information‑theoretic view to construct attacks that impede coordination by perturbing agents' observations and actions, and trains agents to perform reliably under such disruptions. Empirically, our approach improves robustness over existing robust MARL baselines across diverse attack settings and yields stronger performance even under agent‑missing scenarios. Our code is available at https://sunwoolee0504.github.io/IBAL.

PaperID: 2847, https://arxiv.org/pdf/2605.17998.pdf

Authors: Hai-Duong Nguyen, Xuan-The Tran

Title: Verify-Gated Completion as Admission Control in a Governed Multi-Agent Runtime: A Bounded Architecture Case Study

Abstract:
As multi‑agent systems move from short interactions to tool‑using workflows with specialized roles and persistent state, completion becomes a runtime‑control problem rather than a purely generative one. This preprint studies verify‑gated completion as an admission‑control pattern for governed multi‑agent runtimes: agents may propose completion, but a read‑only verifier decides whether the claim is admitted. Ambiguous or weakly evidenced cases resolve fail‑closed, while packetized state and event traces preserve an audit path. We examine one bounded reference implementation and ask what the released evidence can support about auditable, verify‑gated completion. In the released verify‑completed slice, the known‑outcome invoked‑event verify success share was 1,791/1,800 = 99.5%. This is an accounting measure over invoked verification events, not a task‑completion, production‑reliability, or benchmark‑success rate. Task‑level verify coverage is not computable; 1,762/1,801 rows came from one high‑volume reporting cluster; and only 17 events were production‑classified. A shadow Policy/Governance Verifier evaluation showed 1,526/1,548 = 98.58% rule agreement, 0/1,526 false‑success among safe‑to‑proceed predictions, and blocked precision of 2/518 = 0.39%, so it remains advisory. The evidence supports a narrow conclusion: under observed conditions, a read‑only verify gate plus packetized admission records made completion decisions inspectable and fail‑closed. Claims about deployed operation, safety guarantees, outcome gains, task‑level coverage, recovery effectiveness, or external validity remain outside scope.

PaperID: 2848, https://arxiv.org/pdf/2605.17950.pdf

Authors: Gabriele Gualandi, Carl Mikael Larsson, Alessandro V. Papadopoulos

Title: Active Defense Against False Data Injection Attacks in Robotic Manipulators

Abstract:
Robotic systems are vulnerable to False Data Injection Attacks (FDIAs), where adversaries corrupt sensor signals to gain malicious control. Feedback linearization exposes robotic systems to integrator vulnerability, making them susceptible to stealthy attacks that can cause significant deviations in end‑effector behavior without raising alarms. This paper addresses the resilience of manipulators against finite‑horizon FDIAs by formalizing two defense methods, namely anomaly‑aware virtual damping and manipulability reduction, with probabilistic guarantees on nominal task execution. Simulations on a 7‑DOF redundant manipulator show that the proposed defenses substantially reduce the impact of FDIA compared to using solely a threshold‑based ADS like the Chi‑squared, while preserving nominal task performance in the absence of attack.

PaperID: 2849, https://arxiv.org/pdf/2605.17937.pdf

Authors: Zhensheng Wang, Wenmian Yang, Qingtai Wu, Lequan Ma, Yiquan Zhang, Weijia Jia

Title: BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting

Abstract:
Quantitative backtesting is essential for evaluating trading strategies but remains hampered by high technical barriers and limited scalability. While Large Language Models (LLMs) offer a transformative path to automate this complex, interdisciplinary workflow through advanced code generation, tool usage, and agentic planning, the practical realization is significantly challenged by the current lack of a large‑scale benchmark dedicated to automated quantitative backtesting, which hinders progress in this field. To bridge this critical gap, we introduce BacktestBench, the first large‑scale benchmark for automated quantitative backtesting. Built from over 6 million real market records, it comprises 18,246 meticulously annotated question‑answering pairs across four task categories: metrics calculation, ticker selection, strategy selection, and parameter confirmation. We also propose AutoBacktest, a robust multi‑agent baseline that translates natural language strategies into reproducible backtests by coordinating a Summarizer for semantic factor extraction, a Retriever for validated SQL generation, and a Coder for Python backtesting implementation. Our evaluation on 23 mainstream LLMs, complemented by targeted ablations, identifies key factors that influence end‑to‑end performance and highlights the importance of grounded verification and standardized indicator representations.

PaperID: 2850, https://arxiv.org/pdf/2605.17900.pdf

Authors: Le Zhang, Shengming Zhang, Rui Zha, Yunpeng Wu, Jingbo Zhou, Jizhou Huang

Title: DuIVRS-2: An LLM-based Interactive Voice Response System for Large-scale POI Attribute Acquisition

Abstract:
Accurate Point of Interest (POI) attribute acquisition is essential for location‑based services, yet traditional modular Interactive Voice Response (IVR) systems suffer from error accumulation and high maintenance overhead. We present DuIVRS‑2, a large language model (LLM)‑based end‑to‑end framework designed for large‑scale POI attribute acquisition at Baidu Maps. To address the long‑tail distribution of real‑world interactions, our methodology first employs a finite state machine (FSM)‑guided data augmentation strategy to synthesize a balanced and diverse training dataset. We then streamline dialogue management via a selective generation scheme combined with a Chain‑of‑Thought (CoT) mechanism, which ensures output stability and effectively eliminates hallucinations in industrial settings. To facilitate continuous policy refinement with minimal manual effort, we design a cooperative iterative learning framework that leverages a dual‑evaluator voting system. Deployed in production for two months, DuIVRS‑2 processed 0.4 million calls daily and achieved a 83.9% Task Success Rate (TSR), outperforming its predecessor by 4 percentage points while maintaining a low reaction time of 130ms. This work provides a production‑proven reference for developing robust, cost‑effective LLM agents for large‑scale industrial dialogue applications.

PaperID: 2851, https://arxiv.org/pdf/2605.17894.pdf

Authors: Yifan Shen, Jiawen Zhang, Jian Xu, Junho Kim, Ismini Lourentzou, Xu Cao, Meihuan Huang

Title: Evaluating Cognitive Age Alignment in Interactive AI Agents

Abstract:
While agentic AI and its core multimodal large language models (MLLMs) have demonstrated remarkable promise in language and visual reasoning across domains ranging from daily life to advanced scientific research, a profound gap remains between artificial and human intelligence. Despite the integration of powerful tools and advanced MLLMs, state‑of‑the‑art AI agents frequently fail at foundational, seemingly simple tasks that a child can resolve with ease. Inspired by the Wechsler Intelligence Scale for Children (WISC), we introduce ChildAgentEval, the first psychometrically grounded interactive benchmark for evaluating cognitive age alignment in MLLM‑based agents. ChildAgentEval systematically compares the reasoning performance of various MLLM‑based interactive agents against age‑specific human developmental stages, exposing where current agentic AI systems can and cannot simulate age‑specific cognitive behavior.

PaperID: 2852, https://arxiv.org/pdf/2605.17885.pdf

Authors: Tiancheng Hu, Yixuan Jiang, Haotian Li, José Hernández-Orallo, Xing Xie, Nigel Collier, David Stillwell, Luning Sun

Title: Multi-agent AI systems outperform human teams in creativity

Abstract:
Although artificial intelligence (AI) now matches or exceeds human performance across numerous cognitive tasks, creativity remains a highly contested frontier. As AI systems based on large language models (LLMs) are increasingly adopted in research and innovation, it is essential to understand and augment their creativity. Here we demonstrate that multi‑agent LLM teams not only surpass single agents, but also substantially outperform human teams in creativity (Cohen's d=1.50) across 4,541 multi‑agent LLM ideas and 341 human‑team ideas on six diverse problem‑solving tasks. This advantage is driven by novelty while maintaining comparable usefulness. To investigate the generative processes in both groups, we represent conversations as paths through semantic space using neural language model representations. Both LLM and human teams produce more creative ideas when conversations range widely rather than staying centered on a single theme (low global coherence). However, the additional patterns that predict creativity differ: LLM teams benefit from efficient exploration (high semantic spread, shorter paths), while human teams benefit from maintaining smooth conversational flow (high local coherence, frequent pivots). Additionally, we identify model choice and discussion structure as orthogonal design levers that together explain 26.8% of variance in LLM conversational dynamics, paving the way for systematic approaches to developing multi‑agent systems with augmented creative capabilities.

PaperID: 2853, https://arxiv.org/pdf/2605.17873.pdf

Authors: Woongyeng Yeo, Yumin Choi, Taekyung Ki, Sung Ju Hwang

Title: HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

Abstract:
Training long‑horizon LLM agents with reinforcement learning is challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn‑level action‑output signals, or by using feedback‑conditioned self‑distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successful or neutral, and applying feedback at a fixed or misaligned turn often fails to supervise the actions that contributed to the failure. To bridge this gap, we propose HINT‑SD, a targeted self‑distillation framework that uses full‑trajectory hindsight to select failure‑relevant actions and applies feedback‑conditioned distillation only on targeted action spans. Experiments on BFCL v3 and AppWorld show that our method improves over the dense per‑turn feedback baseline by up to 18.80 percent while achieving 2.26× lower time per training step, suggesting that selecting where to distill is a key factor for both effective and efficient long‑horizon agent training.

PaperID: 2854, https://arxiv.org/pdf/2605.17848.pdf

Authors: Aya Hamed, Jason R. Marden, Jeff S. Shamma

Title: Learning Empirical Evidence Equilibria under Weak Environmental Coupling

Abstract:
Strategic multi‑agent systems are fundamentally characterized by decentralization, uncertainty, and ambiguity. Agents operating under limited observations will often need to make decisions based on simplified internal models of the environment, reflecting bounded rationality in both computational capacity and environmental knowledge. The Empirical Evidence Equilibrium (EEE) framework explicitly accounts for these limitations by modeling each agent as forming a potentially misspecified belief derived from signals obtained through partial observations of the environment. The resulting equilibrium concept captures the system's steady state under bounded rationality and decentralization. In this work, we study games in which the environment dynamics are driven jointly by exogenous factors and agents' actions. We analyze agent behavior under Q‑value iteration where each agent independently forms a belief model, computes Q‑values, and derives a greedy strategy, yet the collective actions of all agents jointly shape the environment each agent faces at the next stage. We prove that despite this decentralization, an EEE emerges from the joint dynamics when the coupling between agents' actions and the environment is sufficiently weak. We further extend this result to softmax policies, establishing a contraction result under a sufficient coupling condition.

PaperID: 2855, https://arxiv.org/pdf/2605.17830.pdf

Authors: Ahmad Al-Tawaha, Shangding Gu, Peizhi Niu, Ruoxi Jia, Ming Jin

Title: Remembering More, Risking More: Longitudinal Safety Risks in Memory-Equipped LLM Agents

Abstract:
Safety evaluations of memory‑equipped LLM agents typically measure within‑task safety: whether an agent completes a single scenario safely, often under adversarial conditions such as prompt injection or memory poisoning. In deployment, however, a single agent serves many independent tasks over a long horizon, and memory accumulated during earlier tasks can affect behavior on later, unrelated ones. Studying this regime requires evaluation along the temporal dimension across tasks: not whether an agent is safe at any single memory state, but how its safety profile changes as memory accumulates across many independent interactions. We call this failure mode temporal memory contamination. To isolate memory exposure from stream non‑stationarity, we introduce a trigger‑probe protocol that evaluates a fixed probe set against read‑only memory snapshots at varying prefix lengths, together with a NullMemory counterfactual baseline for identifying memory‑induced violations. We apply this protocol across three deployment scenarios spanning records, memos, forms, and email correspondence and eight memory architectures, and additionally on Claw‑like AI agents, such as OpenClaw, using the platform's native memory mechanism. Memory‑enabled agents consistently exceed the NullMemory baseline, and memory‑induced violation rates show a robust upward trend with exposure length on both agent classes. Order‑randomization experiments indicate that the effect is driven primarily by accumulated content rather than encounter order. Finally, a structural consequence of the event decomposition is that memory‑induced risk is detectable from retrieval state before generation, which we confirm with a high‑recall diagnostic monitor. Our results argue for treating memory safety as a longitudinal property that requires temporal evaluation, not a single‑state property that can be captured by a snapshot.

PaperID: 2856, https://arxiv.org/pdf/2605.17792.pdf

Authors: Zhi Li, Songkun Yan, Jie Cao, Mofan Zhang, Anjiang Wei, Jinwoong Yoo, Yang Hong

Title: HydroAgent: Closing the Gap Between Frontier LLMs and Human Experts in Hydrologic Model Calibration via Simulator-Grounded RL

Abstract:
Calibrating distributed hydrologic models is a critical bottleneck across operational water resources management ‑ streamflow prediction, reservoir operation, drought monitoring, infrastructure design, and flood forecasting all depend on it. Each basin demands an expert to translate hydrograph signatures into adjustments of a high‑dimensional parameter vector, and the resulting workflow does not transfer between watersheds. We ask: can frontier large language model (LLM) agents replace the human hydrologic modeler, and if not, what would it take? We benchmark nine frontier LLM agents ‑ Claude Opus 4.6/4.7, Sonnet 4.6, GPT‑5/5.4/5.4‑pro, and Gemini 2.5‑pro/3.1‑pro/3‑flash ‑ on the operational CREST distributed hydrologic model used by the U.S. National Weather Service for flash‑flood forecasting. Best‑of‑twenty‑rounds Nash‑Sutcliffe Efficiency (NSE) across four held‑out gauges spanning 329‑40,792 km2 ranges from ‑0.16 (GPT‑5.4) to 0.75 (Sonnet 4.6); the ceiling reproduces across all three vendors and capability tiers, with the strongest models concentrating in the 0.65‑0.75 band, and no model reaches the human‑expert reference except Opus‑4.7 on one gauge. We argue this gap is not a parameter‑count problem but a domain‑grounding problem. We then propose HYDROAGENT, fine‑tuning open‑weight Qwen3‑4B with supervised fine‑tuning on 2,576 expert calibration trajectories and Group‑Relative Policy Optimization using NSE as a verifiable reward from online CREST simulations ‑ reinforcement learning with simulation feedback (RLSF). For Earth system science, a small domain‑tuned policy with simulator‑in‑the‑loop RL is a more compute‑efficient and physically faithful path than scaling generic frontier models, and the multi‑modal richness of Earth data ‑ remote sensing, in‑situ time series, and forecaster narrative ‑ makes domain agents a leveraged direction for AI in physical science.

PaperID: 2857, https://arxiv.org/pdf/2605.17758.pdf

Authors: Nitish Nagesh, Mahdi Bagheri, Arshia Harish Puthran, Pengbao Zhou, Muhjaazee Love, Aadi Sharma, Ian Harris, Amir M. Rahmani

Title: Memisis: Orchestrating and Evaluating Synthetic Data for Tabular Health Datasets

Abstract:
Synthetic data is widely used in healthcare to create datasets that are similar to original data but without the privacy concerns. Generating and evaluating synthetic data across privacy, utility and fairness is crucial for facilitating high quality data availability for downstream prediction tasks and clinical decision making. We present Memisis, a tool that orchestrates and evaluates synthetic data by leveraging existing synthetic data tools, the power of large language models and state‑of‑the‑art evaluation metrics. Our tool creates a unified workflow for data generation, validation and evaluation. Users have control over the training size, training epochs and the number of synthetic rows to sample. Instead of knobs to tune synthetic data, the interactive agent allows users to specify their synthetic data generation goals and the tool will orchestrate the workflow by leveraging existing tools while performing the requisite evaluation. For the demo, we use an open source schizophrenia dataset with protected attributes related to race and gender, three different synthesizers and a local language model to orchestrate the workflow. We observe that CTGAN, TVAE and GaussianCopula have comparable performance across fairness and utility metrics. The workflow allows users flexibility and control over the data generation and evaluation process.

PaperID: 2858, https://arxiv.org/pdf/2605.17746.pdf

Authors: Yingjie Zhang, Chun Feng, Weizhang Zhu, Tianshu Sun

Title: Agents for Experiments, Experiments for Agents: A Design Grammar for AI-Enabled Experimental Science

Abstract:
AI systems are becoming active participants in organizational and knowledge work. They increasingly interact with humans, coordinate workflows, and operate in multi‑agent arrangements. Understanding their effects therefore requires more than measuring output accuracy; it requires evidence about mechanisms, delegation, feedback, and control. Experiments remain central to this task, but they also face a recursive challenge: we need experiments for agents to study these arrangements, and we may need agents for experiments to help search the expanding space of possible designs. Yet experimental conditions for human‑AI and agentic workflows are still largely specified in prose, making them difficult to compare, reuse, or audit. We frame this as a problem of workflow representation, traceability, and governance in AI‑enabled knowledge production. We introduce SEED (Structural Encoding for Experimental Discovery), a framework that represents experimental conditions as typed actor‑flow graphs. SEED supports three design functions: describing conditions as interaction structures, evaluating structural novelty relative to encoded prior designs, and generating candidate designs under feasibility and governance constraints. We report a lightweight empirical feasibility test that compares graph‑blind and SEEDguided generation in a medical‑triage design task. In this diagnostic contrast, SEED‑guided candidate designs show clearer actor‑flow changes, assumptions, and governance checks, supporting the feasibility of the grammar as a design aid. The commentary closes by identifying governance tensions around novelty, replication, validity, diversity of inquiry, and accountability.

PaperID: 2859, https://arxiv.org/pdf/2605.17734.pdf

Authors: Hongjun Liu, Yifei Ming, Shafiq Joty, Chen Zhao

Title: Harnessing LLM Agents with Skill Programs

Abstract:
Equipping LLM agents with reusable skills derived from past experience has become a popular and successful approach for tackling complex and long‑horizon tasks. However, such lessons are often encoded as textual guidance that remains largely advisory, lacking explicit mechanisms for when and how to intervene in the agent loop. To bridge the gap, we introduce HASP(Harnessing LLM Agents with Skill Programs), a new framework that upgrades skills into executable Program Functions (PFs). Rather than offering passive advice, PFs act as executable guardrails that activate on failure‑prone states and modify the next action or inject corrective context. HASP is highly modular: it can be applied at inference time for direct agent‑loop intervention, during post‑training to provide structured supervision, or for self‑improvement by evolving validated, teacher‑reviewed PFs. Empirically, HASP drives substantial gains compared to both training‑free and training‑based methods on web‑search, math reasoning, and coding tasks. For example, on web‑search reasoning, inference‑time PFs alone improve the average performance by 25% compared to (multi‑loop) ReAct Agent, while post‑training and controlled evolution achieve a 30.4% gain over Search‑R1. To provide deeper insights into HASP, our mechanism analysis reveals how PFs trigger and intervene, how skills are internalized, and the requirement for stable skill library evolution.

PaperID: 2860, https://arxiv.org/pdf/2605.17715.pdf

Authors: G. Q. Bao Tran, Yutaka Hori, Shinji Hara

Title: Observer-Based Stabilization for Linear Multi-Agent Dynamical Systems Using Generalized Frequency Variables

Abstract:
We address the conditions and design of controllers and observers for homogeneous networks of linear MIMO agents. We develop networked controllers and observers that ensure the stability of both the system state and the estimation error, leveraging the concept of generalized frequency variables. A separation principle for networks is then established, showing that the observer and controller can be designed independently and combined to achieve a stable output feedback. Our results are illustrated via a highly unstable, oscillatory network of locally actuated pendulums on carts. Finally, necessary conditions for controllability and observability ‑‑ derived from agent properties and network structure ‑‑ are established and discussed.

PaperID: 2861, https://arxiv.org/pdf/2605.17698.pdf

Authors: Seth Karten, Cameron Crow, Chi Jin

Title: Agent Bazaar: Enabling Economic Alignment in Multi-Agent Marketplaces

Abstract:
The deployment of Large Language Models (LLMs) as autonomous economic agents introduces systemic risks that extend beyond individual capability failures. As agents transition to directly interacting with marketplaces, their collective behavior can amplify volatility and mask deception at scale. We introduce the Agent Bazaar, a multi‑agent simulation framework for evaluating Economic Alignment, the capacity of agentic systems to preserve market stability and integrity. We identify two failure modes: (1) Algorithmic Instability in a B2C market ("The Crash"), where firms amplify price volatility until the market collapses, and (2) Sybil Deception in a C2C market ("The Lemon Market"), where a single deceptive agent controlling multiple coordinated seller identities floods the market with fraudulent listings, eroding trust and consumer welfare. We evaluate frontier and open‑weight models across both scenarios and find that models largely fail to self‑regulate, with failure severity varying by model rather than by size. We propose economically aligned harnesses, Stabilizing Firms and Skeptical Guardians, that improve outcomes but remain fragile under harder market conditions. To close this gap, we train agents with REINFORCE++ using an adaptive curriculum, producing a 9B model that outperforms all evaluated frontier and open‑weight models. We propose the Economic Alignment Score (EAS), a 4‑component scalar metric aggregating stability, integrity, welfare, and profitability, enabling direct cross‑model comparison. Our results show that economic alignment is orthogonal to general capability and can be directly trained with targeted RL.

PaperID: 2862, https://arxiv.org/pdf/2605.17694.pdf

Authors: Anvesh Rao Vijjini, Sagar Manjunath, Snigdha Chaturvedi

Title: Do LLM Agents Mirror Socio-Cognitive Effects in Power-Asymmetric Conversations?

Abstract:
Power differences shape human communication through well documented socio cognitive effects, including language coordination, pronoun usage, authority bias, and harmful compliance. We examine whether large language models (LLMs) exhibit similar behaviors when assigned high or low status personas. Using personas from diverse professions, we simulate multi turn, power asymmetric dialogues (e.g., principal teacher, justice lawyer) and measure (i) language coordination, (ii) pronoun usage, (iii) persuasion success, and (iv) compliance with unsafe requests. Our results show that LLMs show key socio‑cognitive effects of power, albeit with nuances and variability, linking simulated interactions to both desirable and unsafe behaviors.

PaperID: 2863, https://arxiv.org/pdf/2605.17679.pdf

Authors: Zhiyuan Wang, Ariful Islam, Indrajeet Ghosh, Xinyu Chen, Katharine E. Daniel, Subigya Nepal, Philip Chow, Laura E. Barnes

Title: PULSE: Agentic Investigation with Passive Sensing for Proactive Intervention in Cancer Survivorship

Abstract:
Cancer survivors face elevated rates of depression, anxiety, and general emotional distress, yet the precise moments they most need support are often the moments when self‑report is sparse, a phenomenon we term the diary paradox. Passive smartphone sensing offers a continuous, unobtrusive alternative, but prior sensing‑based affect prediction has been limited by an accuracy ceiling, suggesting a bottleneck not only in available data, but in how behavioral signals are interpreted. We present PULSE, a system that shifts from fixed feature pipelines to agentic sensing investigation: LLM agents equipped with eight purpose‑built tools autonomously query smartphone sensing data, compare current behavior against personalized baselines, and calibrate inferences through retrieval‑augmented population‑level comparisons. Rather than receiving pre‑formatted feature summaries, agents decide which modalities to inspect, how far back to look, and how deeply to investigate, mirroring hypothesis‑driven clinical reasoning. We evaluate PULSE through a 22 factorial design crossing reasoning architecture (structured vs. agentic) with data modality (sensing‑only vs. with diary) on 50 cancer survivors from a longitudinal study of cancer survivors. Agentic reasoning is the primary driver of performance: agentic multimodal agent achieves balanced accuracy of 0.743 for emotion regulation desire with diary and sensing data, while agentic agents predict intervention availability at 0.713 with passive sensing data only. These results suggest that agentic investigation may be a cornerstone for unlocking the clinical value of passive sensing, advancing the feasibility of proactive just‑in‑time mental health support.

PaperID: 2864, https://arxiv.org/pdf/2605.17617.pdf

Authors: Yiwen Zhu, Joyce Cahoon, Anna Pavlenko, Qiushi Bai, Nima Shahbazi, Divya Vermareddy, Meina Wang, Mathieu Demarne, Swati Bararia, Wenjing Wang, Hemkesh Vijaya Kumar, Hannah Lerner, Katherine Lin, Steve Toscano, Miso Cilimdzic, Subru Krishnan

Title: GraphMind: From Operational Traces to Self-Evolving Workflow Automation

Abstract:
Complex operational workflows coordinating personnel, tools, and information are central to system operations, yet end‑to‑end automation remains challenging due to extensive human input requirements and limited ability to adapt over time. We present GraphMind, a system that constructs, executes, and evolves action‑centric workflow graphs with minimal human effort. The system operates in three phases. First, a scalable offline pipeline extracts structured workflow graphs from large volumes of human resolution traces, capturing problems, actions, and their causal relationships. Second, an online multi‑agent traversal engine navigates the graph to dynamically construct and execute workflows, combining graph‑guided retrieval with LLM‑driven reasoning at each step. Third, Adaptive Traversal Reinforcement (ATR) reinforces successful traversal paths, enabling execution‑informed graph adaptation. GraphMind has been deployed across four production cloud database services for incident investigation. Evaluated on 93 held‑out incidents and validated via blind expert review, the system outperforms an Agentic Summary‑RAG baseline in mitigation reach, hallucination rate, and diagnostic throughput while requiring 8x less retrieval context. The ATR layer reduces hallucination rate by 26%, demonstrating that workflow graphs can learn from execution feedback. A 12‑week field study confirms practical value: 97% of scored conversations yield actionable results within interactive latency.

PaperID: 2865, https://arxiv.org/pdf/2605.17607.pdf

Authors: Martin Bichler, Jan-Sebastian Hoehener

Title: Convergence of Stochastic First-Order Algorithms in Bertrand Competition Under Incomplete Information

Abstract:
Autonomous pricing agents are widely deployed in online marketplaces, making algorithmic pricing a prominent application of multi‑agent learning. Experimental studies often report collusive outcomes, but these findings typically rely on Q‑learning in complete‑information environments and lack rigorous convergence guarantees. In this paper, we study the stochastic learning dynamics of Regularized Robbins‑Monro (RRM) algorithms in a Bayesian Bertrand competition with private costs. We show that this setting violates standard stability conditions, including monotonicity and the Minty variational inequality, rendering classical convergence results for gradient‑based learning inapplicable. Despite this, we prove that Euclidean RRM algorithms converge almost surely to the unique, efficient Bayes‑Nash equilibrium within a finite‑dimensional approximation of the strategy space. By analyzing symmetric piecewise‑linear pricing strategies in a duopoly, we explicitly construct a global Lyapunov function for the projected primal dynamics and establish global asymptotic stability of the equilibrium. Our analysis yields rigorous convergence guarantees for stochastic first‑order learning algorithms in Bayesian Bertrand competition and provides a principled counterpoint to widespread claims of algorithmic collusion.

PaperID: 2866, https://arxiv.org/pdf/2605.17596.pdf

Authors: Mujahid Sultan, Sri Thuraisamy, Daya Rajaratnam

Title: NeuSymMS: A Hybrid Neuro-Symbolic Memory System for Persistent, Self-Curating LLM Agents

Abstract:
We present NeuSymMS, an adaptive memory system that enables large language model (LLM) agents to learn, remember, and reason about users across sessions via a hybrid neuro‑symbolic architecture. NeuSymMS couples neural fact extraction from unstructured dialogue using LLMs and a CLIPS‑based expert system that classifies, deduplicates, and reconciles facts under explicit lifecycle rules. The system represents knowledge as subject‑relation‑value triples stored in relational database management system. It supports user/agents/agent‑to‑agent scoping, and implements a dual‑horizon (short‑term and long‑term) memory model. IT leverages access‑based promotion and time‑based pruning of the memory on both horizpons. NeuSymMS maintains continuity of memory while avoiding context‑window bloat and cross‑entity contamination. We argue that this architecture offers a practical path to trustworthy, auditable memory for production agentic systems and discuss its novelty relative to log retrieval, summarization, and key‑value approaches.

PaperID: 2867, https://arxiv.org/pdf/2605.17535.pdf

Authors: Sheikh Nazib Ahmed, Marnim Galib

Title: AgentModernize: Preserving Business Logic in Legacy Modernization with Multi-Agent LLMs and Behavioral Specification Graphs

Abstract:
Legacy modernization breaks business logic. Most tools and LLM‑based approaches treat modernization as syntax translation, losing implicit rules, edge‑case handling, and cross‑module constraints. We present AgentModernize, a multi‑agent framework that treats modernization as a behavioral preservation problem. Four specialized agents handle extraction, specification, code generation, and validation. The key intermediate artifact ‑‑ a Behavioral Specification Graph (BSG) ‑‑ forces extracted business logic to be explicit and inspectable before any code is generated. We evaluated on LegacyModernize‑8, eight scenarios spanning telecom and banking, using three models (GPT‑4o‑mini, GPT‑4o, GPT‑5.3‑codex) under a fair protocol: same gold‑standard tests, 3 trials, temperature 0.0. Full AgentModernize with feedback was the only configuration with non‑zero mean BER under every backbone. SP‑LLM and CoT‑LLM scored 0.0% on every scenario, on every backbone. AgentModernize without feedback scored 0.0% mean BER with GPT‑4o‑mini and GPT‑5.3‑codex; under GPT‑4o it achieved non‑zero BER only on S1 (44.4%; 5.6% mean over scenarios). Mean BER for full AgentModernize was 9.4% (mini), 8.1% (GPT‑4o), and 19.4% (codex). The BSG captures 91.2% of gold‑standard rules, confirming that the bottleneck is code generation, not extraction.

PaperID: 2868, https://arxiv.org/pdf/2605.17480.pdf

Authors: Qiqi Liu, Thorsten Holz, Shilin Ye, Runhan Song

Title: The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure

Abstract:
Multi‑agent systems extend large language models (LLMs) by decomposing tasks among specialized agents, but their distributed decision process creates new attack surfaces. We identify semantic hijacking, an attack in which harmful requests are concealed within domain‑specific narratives and propagated to a Manager through Worker reports, without any syntactic injection primitives. Across 42,000 adversarial trials over 12 Manager models and 7 Worker configurations, we uncover a capability paradox: as Worker capability increases, the mean system‑level Attack Success Rate (ASR) increases from 18.4% to 63.9%, peaking at 94.4%. To explain this effect, we conduct multi‑level mediation analysis on two independent datasets (47,807 interactions). This analysis shows that this paradox is driven by linguistic certainty: stronger Workers are more likely to interpret adversarial narratives as legitimate, convey their conclusions assertively, and thereby lead Managers to treat such confident endorsements as justification to execute. In our larger Worker‑Only setting (n_W=14), certainty mediates 74% of the effect, with 95% confidence intervals (CI) excluding zero under both Monte Carlo and cluster bootstrap; the smaller Full‑MAS setting (n_W =6) shows a directionally consistent indirect effect. Worker‑side safety prompting does not reliably mitigate this failure. Building on the mediation finding, we propose heterogeneous ensemble verification, which pairs Workers of asymmetric domain competence so their complementary vulnerabilities break the certainty‑to‑execution chain, reducing ASR from 52.8% to 2.0% with negligible benign‑task impact. Our results show that upgrading components to stronger models can actively degrade system security, and that effective defenses require exploiting‑‑rather than eliminating‑‑capability asymmetries between agents.

PaperID: 2869, https://arxiv.org/pdf/2605.17475.pdf

Authors: Hongshu Wang, Xinyue Zuo, Yuhan Sun, Qin Li, Yamine Ait Ameur, Jin Song Dong

Title: Event-B Agent: Towards LLM Agent for Formal Model Synthesis and Repair

Abstract:
Building software that is correct by construction is a long‑standing goal in software engineering, as it ensures reliability during design and development rather than after deployment. Formal methods realize this vision by enabling the expression of system behavior and requirements in mathematics, thereby guaranteeing correctness through formal verification, including theorem proving and model checking. However, the steep learning curve and demand for mathematical expertise hinder the widespread adoption of formal methods. Large language models (LLMs) have recently shown promise in bridging this gap through autoformalization. However, existing LLM‑based approaches are largely limited to isolated tasks, such as theorem proving without formalization or model synthesis with insufficient verification. While valuable, these efforts do not fully exploit the potential of a more comprehensive framework in which models and proofs evolve together, a process that closely reflects real‑world development practice. To address this gap, we propose Event‑B Agent, a novel framework inspired by the interleaved nature of software design. Given natural language requirements, Event‑B Agent constructs an initial model and iteratively repairs and refines it using formal verification feedback. Refinement simplifies proof discharge, while repair of models and proofs ensures the soundness of each refinement step. Together, these two components reinforce each other to progressively improve the model quality. Evaluation across systems of varying complexity demonstrates that Event‑B Agent substantially outperforms baselines in end‑to‑end formal model synthesis and repair, while maintaining reasonable efficiency. These results suggest that Event‑B Agent is a promising step toward correct‑by‑construction formal model synthesis and repair.

PaperID: 2870, https://arxiv.org/pdf/2605.17467.pdf

Authors: Hezhe Qiao, Hanghang Tong, Ee-Peng Lim, Bing Liu, Guansong Pang

Title: VerifyMAS: Hypothesis Verification for Failure Attribution in LLM Multi-Agent Systems

Abstract:
Large language model‑driven multi‑agent systems (LLM‑MAS) excel at complex tasks, yet unreliable agents remain a key bottleneck to system‑level reliability. Automatic failure attribution is therefore critical, but existing approaches, such as direct prediction of agent‑error pairs and agent‑first failure attribution, rely on local logs of agents and miss global failures that only manifest over full interaction trajectories, such as cross‑step inconsistencies and inter‑agent coordination errors. Moreover, directly predicting failures induces a large combinatorial search space, hindering fine‑grained attribution. To address these challenges, we propose VerifyMAS, a hypothesis verification framework for agent failure attribution. Instead of directly predicting faulty agents and error types, VerifyMAS formulates and verifies failure hypotheses against full trajectories. This verification‑based approach decomposes attribution into trajectory‑level error validation and fine‑grained agent localization, providing an error‑first attribution approach that captures global failure patterns while substantially reducing the search space. We further introduce a hypothesis‑based data construction strategy grounded in a structured error taxonomy and fine‑tune a specialized LLM verifier model for trajectory‑level failure verification and agent attribution. Experiments on Aegis‑Bench and Who&When show that VerifyMAS consistently improves diverse backbone models, including open‑source Qwen and API‑based GPT models, outperforming prior methods without sacrificing inference efficiency for long multi‑agent trajectories.

PaperID: 2871, https://arxiv.org/pdf/2605.17453.pdf

Authors: Lecheng Yan, Ruizhe Li, Xicheng Han, Wenxi Li, Binwu Wang, Longyue Wang, Chenyang Lyu, Guanhua Chen

Title: Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback

Abstract:
Tool‑using LLM agents increasingly rely on external tools to make consequential decisions, yet most existing agent‑security benchmarks and defenses implicitly assume that tool feedback is trustworthy once a tool has been selected. We study a different failure mode, cognitive poisoning, in which a malicious tool behaves plausibly during exploration, accumulates trust through benign‑looking feedback, and becomes harmful only when hidden state conditions align with the final executable action. To study this setting, we construct TRUST‑Bench, a task‑conditioned benchmark of 1,970 hidden‑trigger tool‑compromise episodes with matched safe controls, introduce an asymmetric penalty metric, GuardedJoint, to better reflect real deployment risk, and present VISTA‑Guard, a backbone‑agnostic framework for final‑action risk scoring. The core idea is to abstract multi‑step tool interaction into structured environment variables that encode trust‑formation dynamics and then score the risk of the final executable action from this trajectory‑conditioned representation. Experiments show that prompt‑centric heuristics, scalarized features, and zero‑shot judges fail in this regime, whereas trajectory‑aware final‑action scoring yields strong in‑domain discrimination and remains effective under balanced out‑of‑distribution transfer. Under GuardedJoint, VISTA‑Guard reaches 84.2 in‑domain and 56.9 on balanced out‑of‑distribution evaluation, while methods that optimize only one side of the safety‑‑utility tradeoff collapse to zero. These findings support a broader view of agent security in black‑box tool ecosystems: the decisive defense target is not local prompt text or tool descriptors alone, but the way trust is formed across the interaction trajectory and committed through the final action.

PaperID: 2872, https://arxiv.org/pdf/2605.17450.pdf

Authors: Simiao Liu, Fang Liu, Li Zhang, Yang Liu, Yinghao Zhu

Title: ContraFix: Agentic Vulnerability Repair via Differential Runtime Evidence and Skill Reuse

Abstract:
Large language model (LLM) agents are increasingly used for automated vulnerability repair (AVR), where repository‑level reasoning enables them to inspect context and produce source‑code patches. However, recent empirical results show that these agents still struggle with real‑world vulnerabilities. Their main failure mode is semantic misunderstanding: choosing a repair direction that does not match the root cause. We identify two reasons for this gap. Existing agents usually reason from the failing execution alone. A crash report can pinpoint where the program failed, but it does not reveal which variable or state transition, among many candidates near the fault site, separates the crashing behavior from safe execution. As a result, agents often produce symptom‑oriented patches instead of causal fixes. Moreover, evidence collected for one vulnerability is rarely retained, so similar cases in later repositories must be diagnosed again from scratch. We present ContraFix, an agentic AVR framework that couples differential runtime evidence with reusable repair skills. Its Mutator constructs PoC variants that straddle the failure boundary; its Analyzer inserts state probes around the fault region and summarizes divergences between crashing and non‑crashing executions into a repair specification; and its Patcher converts the specification into verified source patches. Each successful repair updates a two‑track skill base containing repair specifications and mutation strategies, which are retrieved through a three‑tier policy for future instances. On SEC‑Bench (C/C++, 200 instances) and PatchEval (Go, Python, JavaScript, 225 instances), ContraFix with GPT‑5‑mini resolves 84.0% and 73.8% of the tasks, respectively, achieving state‑of‑the‑art performance on both benchmarks while costing less than one‑third of the strongest comparable baseline.

PaperID: 2873, https://arxiv.org/pdf/2605.17426.pdf

Authors: Chiharu Shima, Haruki Yonekura, Fukuharu Tanaka, Tatsuya Amano, Hirozumi Yamaguchi

Title: Human-Flow Digital Twin for Predicting the Effects of Mobility Introduction on Visitor Circulation

Abstract:
We propose a framework for predicting the effects of mobility introduction measures using a human‑flow digital twin. This digital twin incorporates a multi‑agent simulator that can represent how visitors choose destinations depending on factors such as their current location and the attractiveness of spots. We extract data on how visitors selected destinations with respect to measured pre‑intervention human‑flow data, inter‑spot distances, spot attractiveness, and travel volumes, and use these data to train each agent's decision model of this simulator. The trained decision model is a function that takes a visitor's current state and surrounding environmental information as input and outputs which spot the visitor will move toward next. By expressing mobility introduction measures as changes to inter‑point distances or to spot attractiveness, the framework can reproduce human flows with mobility introduction in the multi‑agent simulator and thereby quantify effects such as changes in visitor counts and circulation. We evaluated the proposed method using human‑flow data measured with and without introducing mobility within Wakayama Castle Park in Japan. When reproducing flows with mobility introduction using a multi‑layer perceptron decision model, the cosine similarity of the spatial population distribution exceeded 0.7, confirming that the approach can replicate the flow changes caused by the mobility introduction.

PaperID: 2874, https://arxiv.org/pdf/2605.17423.pdf

Authors: Yiren Song, Huilin Zhong, Kevin Qinghong Lin, Haofan Wang, Mike Zheng Shou

Title: Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration

Abstract:
We study series‑level cinematic remaking, a long‑horizon video‑to‑video generation problem that localizes full episodes or films via stylization or actor replacement while strictly preserving narrative structure, motion choreography, and character identity across hundreds of shots. Existing video generation and editing pipelines often break down in this regime due to compounding identity drift, background mutation, and semantic erosion under large camera motions and viewpoint changes. We propose Soap2Soap, a multi‑agent framework that enforces long‑term language‑visual consistency through a Dual‑Bridge Consistency mechanism: a scene‑aware JSON screenplay serving as a persistent semantic backbone, and dynamically allocated visual reference anchors at both scene and shot levels. To suppress drift before video synthesis, we introduce batch keyframe consistency, jointly generating multiple keyframes in a shared latent context via a grid‑based formulation. A closed‑loop verification agent further audits identity, stability, and alignment to trigger selective regeneration. Experiments on SoapBench demonstrate strong improvements over commercial video generation APIs in long‑term consistency and narrative fidelity.

PaperID: 2875, https://arxiv.org/pdf/2605.17393.pdf

Authors: Wei Duan, Junyu Xuan, En Yu, Xiaoyu Yang, Jie Lu

Title: Heterogeneous Information-Bottleneck Coordination Graphs for Multi-Agent Reinforcement Learning

Abstract:
Coordination graphs are a central abstraction in cooperative multi‑agent reinforcement learning (MARL), yet existing sparse‑graph learners lack a theoretically grounded mechanism to decide which edges should exist and how much information each edge should carry. Current methods rely on heuristic criteria that offer no formal guarantee on the learned topology, and no principled way to allocate different communication capacities to structurally different agent relationships. To address this, we propose Heterogeneous Information‑Bottleneck Coordination Graphs (HIBCG), which learns a group‑aware sparse graph in which both edge existence and message capacity are theoretically justified. With the graph information bottleneck (GIB) serving as the underlying tool, HIBCG first constructs a group‑aligned block‑diagonal prior that provides a closed‑form criterion for edge retention ‑‑ determining which edges should exist and at what density per group block ‑‑ and then controls per‑agent feature bandwidth on the resulting topology, compressing messages to retain only task‑relevant content. We prove that the group‑aligned prior strictly tightens the variational bound on topology learning, that the objective decomposes per group block, enabling differential edge control, and that capacity allocation follows a water‑filling principle.

PaperID: 2876, https://arxiv.org/pdf/2605.17364.pdf

Authors: Joy Bose

Title: NewsLens: A Multi-Agent Framework for Adversarial News Bias Navigation

Abstract:
Media bias detection has predominantly been framed as a classification task: assign a political label to an article or outlet. We argue this framing is too shallow: it identifies that bias exists but not where, how, or crucially, what is structurally omitted. We present NewsLens, a five‑agent adversarial pipeline for structured news bias navigation. A Fact Verifier, Progressive Framing Analyst, Conservative Framing Analyst, Propaganda Detector, and Neutral Summarizer collaborate to deconstruct articles into interpretable framing maps, exposing ideological omissions, rhetorical manipulation, and framing boundaries. The system is evaluated on 15 articles across four geopolitical event clusters (India‑Pakistan Kashmir, Gaza, Climate Policy, Ukraine) using Qwen2.5‑3B‑Instruct (4‑bit quantised, Google Colab T4), with cross‑model validation using Mistral 7B on the Kashmir cluster. Center outlets show the highest mean Perspective Divergence Score (PDS: Qwen 0.907, Mistral 0.729 on Kashmir subset); conservative‑framing outlets show the highest mean Manipulation Index (MI: 0.600 across both models). Cross‑model comparison shows high consistency for high‑propaganda content (Republic World delta‑PDS=0.125, MI=0.8 both models) and greater variance for nuanced reporting. Mann‑Whitney U tests find no statistically significant between‑group differences at n=15, reported honestly as a sample‑size limitation confirmed by post‑hoc power analysis. A partial ablation removing the Propaganda Detector shows degraded omission precision in the Neutral Summarizer output. The architecture extends prior lexical‑geometric bias work to agentic LLM reasoning, and is fully reproducible using open‑weight models without API keys.

PaperID: 2877, https://arxiv.org/pdf/2605.17361.pdf

Authors: Xuefei Wang, Jialu Wang, Fengbo Zhang, Yihan Hu, Di Zhang, Yutong Ye, Yikun Ban, Jun Han, Ruijie Wang

Title: \textsc{MasFACT}: Continual Multi-Agent Topology Learning via Geometry-Aware Posterior Transfer

Abstract:
Multi‑agent systems (MAS) powered by large language models (LLMs) have emerged as a powerful paradigm for complex problem solving, where performance critically depends on the underlying inter‑agent communication topology. However, existing topology generation methods mainly optimize for isolated tasks, while real‑world deployments involve streams of evolving tasks, requiring previously effective collaboration patterns to be retained and reused rather than rediscovered or overwritten. We identify a previously underexplored failure mode, \emphtopology forgetting, in which adapting to new tasks shifts the topology generator away from communication structures required by earlier tasks. This issue stems from cross‑task misalignment in both agent‑level functional semantics and relational communication structures. To address this challenge, we propose \textscMasFACT, a geometry‑aware posterior transfer framework that preserves and reuses historical collaboration knowledge as transferable topology priors. We transfer these priors across task‑specific agent spaces through Fused Gromov‑Wasserstein optimal transport and perform PAC‑Bayes‑guided conservative posterior adaptation to balance task‑specific plasticity with structural stability. Experiments across class‑, domain‑, and task‑level continual settings demonstrate that \textscMasFACT consistently improves average accuracy while reducing topology forgetting compared to strong topology generation and replay‑based baselines, and can be seamlessly integrated with different MAS topology generators.

PaperID: 2878, https://arxiv.org/pdf/2605.17359.pdf

Authors: Taolin Zhang, Zijie Zhou, Jiuheng Wan, Tingyuan Hu, Chengyu Wang, Xiaofeng He, Richang Hong

Title: Learning Transferable Topology Priors for Multi-Agent LLM Collaboration Across Domains

Abstract:
Large language model (LLM)‑based multi‑agent systems have shown strong potential for complex reasoning by coordinating specialized agents through structured communication. However, existing topology‑evolution methods typically construct or optimize a collaboration topology for each query from scratch, leading to substantial online search overhead, high inference‑time token consumption, and limited scalability in multi‑domain settings. We propose TopoPrior, a framework for learning transferable topology priors for multi‑agent LLM collaboration across domains. Rather than repeatedly searching for effective collaboration structures online, TopoPrior learns reusable topology priors from reference collaboration graphs collected offline from multiple domains and uses them to generate query‑conditioned initial collaboration graphs for downstream refinement. By shifting part of topology search from per‑query online optimization to offline prior learning, TopoPrior amortizes search cost while remaining compatible with existing topology‑evolution backbones. Technically, TopoPrior contains two key components. First, a transferable topology prior learning module employs a conditional variational graph framework to capture reusable structural regularities across domains in a latent space. Second, a query‑conditioned latent adaptation module introduces adversarial alignment to reduce unnecessary domain discrepancy while preserving query‑relevant structural variation. Experiments on multi‑domain reasoning benchmarks show that TopoPrior consistently improves several heterogeneous topology‑evolution backbones while reducing online inference‑time token usage, with only modest additional trainable parameters. These results suggest that transferable topology initialization is an effective and lightweight mechanism for improving the efficiency of multi‑agent LLM collaboration across domains.

PaperID: 2879, https://arxiv.org/pdf/2605.17352.pdf

Authors: Taolin Zhang, Dongyang Li, Chen Chen, Qizhou Chen, Jiuheng Wan, Xiaofeng He, Chengyu Wang, Richang Hong

Title: AMATA: Adaptive Multi-Agent Trajectory Alignment for Knowledge-Intensive Question Answering

Abstract:
Despite substantial advances in large language models (LLMs), generating factually consistent responses for knowledge‑intensive question answering remains challenging. These difficulties are primarily due to hallucinations and the limitations of LLMs in bridging long‑tail knowledge gaps. To address this, we propose AMATA, an Adaptive Multi‑Agent Trajectory Alignment framework that dynamically integrates external knowledge to improve response interpretability and factual grounding. Our architecture leverages six specialized agents that collaboratively perform structured actions for complex question reasoning. We formalize multi‑agent collaboration with external tools as a trajectory preference alignment problem, incorporating question‑aware agent customization and inter‑agent preference harmonization. AMATA introduces two principal innovations: (1) Intra‑Trajectory Preference Learning, which learns objective‑oriented preferences to prioritize critical agents, and (2) Inter‑Agent Dependency Learning, which captures cross‑agent tool dependencies through a novel dependency‑aware direct preference optimization technique. Empirical results show that AMATA consistently outperforms baseline approaches, knowledge‑augmented frameworks, and LLM‑based trajectory systems on five established knowledge‑intensive QA benchmarks. Further analysis demonstrates the efficiency of our method in reducing token consumption.

PaperID: 2880, https://arxiv.org/pdf/2605.17348.pdf

Authors: Taolin Zhang, Pukun Zhao, Qizhou Chen, Jiuheng Wan, Chen Chen, Xiaofeng He, Chengyu Wang, Richang Hong

Title: Taming "Zombie'' Agents: A Markov State-Aware Framework for Resilient Multi-Agent Evolution

Abstract:
Recent advancements in LLM‑based multi‑agent systems have demonstrated remarkable collaborative capabilities across complex tasks. To improve overall efficiency, existing methods often rely on aggressive graph evolution among agents (e.g., node or edge pruning), which risks prematurely discarding valuable agents due to transient issues such as hallucinations or temporary knowledge gaps. However, such hard pruning overlooks the potential for ``zombie'' agents to recover and contribute in subsequent discussion rounds. In this paper, we propose AgentRevive, a Markov state‑aware framework for resilient multi‑agent evolution. Our approach dynamically manages agent collaboration through soft state transitions, implemented via two key components: (1) State‑Aware Policy Learning: Agent states are divided into ``Active'', ``Standby'', and ``Terminated'' states, selectively propagating messages based on agent memory. The policy employs a risk estimator to optimize agent state transitions by assessing hallucination risk, minimizing the influence of unreliable nodes while safeguarding valuable ones. (2) State‑Aware Edge Optimization: Subgraph edges are pruned according to states learned from the policy, permanently removing ``Terminated'' nodes and retaining ``Standby'' nodes for subsequent rounds to assess their potential future contributions. Extensive experiments on general reasoning, domain‑specific, and hallucination challenge tasks show that our method consistently outperforms strong baselines and significantly reduces token consumption through state‑aware agent scheduling.

PaperID: 2881, https://arxiv.org/pdf/2605.17300.pdf

Authors: Qixuan Li, Chen Le, Jincheng Yu, Xinlei Chen

Title: HCLM: A Hierarchical Framework for Cooperative Loco-Manipulation with Dual Quadrupeds

Abstract:
We introduce HCLM, a hierarchical framework for general‑purpose cooperative loco‑manipulation with dual quadrupedal systems. Coordinating multi‑robot collaborative manipulation across floating bases is highly challenging due to the conflicting demands of spatial coordination, robust locomotion, and closed‑chain physical interactions. To resolve this, our architecture systematically decouples high‑level collaborative reasoning from low‑level robust motion execution. At the high level, a centralized Joint Diffusion Policy leverages an SE(3)‑invariant task‑space representation to learn coordinate‑agnostic spatial coordination patterns. To translate these frame‑agnostic references into physical motion, a task‑centric hybrid Whole‑Body Controller synergizes a proactive kinematic Model Predictive Control for collision‑free velocity distribution with a reactive execution layer. Crucially, this reactive layer guarantees rapid responsiveness for precise end‑effector tracking, while concurrently integrating active force regulation via a cooperative admittance scheme to safely resolve kinematic conflicts and strictly regulate internal stresses during closed‑chain interactions. We validate the framework across progressively challenging simulated scenarios, including cooperative carrying, packing and handovers, and successfully deploy the latter in the real world. The results demonstrate reliable task execution, strict configuration agnosticism, and exceptional resilience against severe physical perturbations, offering a highly robust pathway for multi‑robot embodied coordination.

PaperID: 2882, https://arxiv.org/pdf/2605.17292.pdf

Authors: Chenyu Wang, Yang Shu

Title: MetaCogAgent: A Metacognitive Multi-Agent LLM Framework with Self-Aware Task Delegation

Abstract:
Multi‑agent large language model (LLM) systems have shown promise for solving complex tasks through agent collaboration. However, existing frameworks assign tasks based on predefined roles without considering whether an agent can accurately assess its own competence boundaries, leading to overconfident execution on tasks beyond its expertise. Inspired by metacognition theory from cognitive science, we propose MetaCogAgent, a multi‑agent LLM framework where each agent is equipped with a Metacognitive Self‑Assessment Unit that evaluates task‑capability alignment before execution. The framework introduces three contributions: (1) a self‑assessment mechanism that estimates per‑task confidence by combining verbalized uncertainty with historical capability profiles; (2) an adaptive delegation protocol that routes low‑confidence tasks to better‑suited agents through cross‑agent evaluation; and (3) a capability boundary learning module that iteratively refines each agent's competence model via cybernetic feedback. Experiments on our constructed MetaCog‑Eval benchmark (700 tasks across 5 cognitive dimensions) demonstrate that MetaCogAgent achieves 82.4% task accuracy ‑‑ 8.7% above the best routing baseline ‑‑ while using 5% fewer API calls than AutoGen and 34% fewer than ensemble voting. Ablation studies confirm that each metacognitive component contributes to overall system performance.

PaperID: 2883, https://arxiv.org/pdf/2605.17281.pdf

Authors: Jicheng Wang, Yifeng He, Zili Wang, Hanwen Xing, Arkaprava De, Hao Chen

Title: ContractBench: Can LLM Agents Preserve Observation Contracts?

Abstract:
Tool‑augmented LLM agents call APIs whose intermediate outputs, such as presigned URLs, session tokens, and OAuth state parameters, are observation contracts: artifacts whose later use is constrained by the external system that produced them. We show that observation contract compliance (preserving the temporal validity and byte‑level integrity) is an emergent, regression‑prone capability: it is neither guaranteed by general tool‑use ability nor consistently improved by larger or newer models. To measure this, we introduce ContractBench, a benchmark of 33 dual‑axis tasks that probe two orthogonal failure modes no existing benchmark evaluates: validity failures (using an artifact after expiry) and integrity failures (corrupting an artifact's bytes through the observation‑to‑action pipeline). Our evaluation is deterministic and programmatic, with a virtual clock controlling time and SHA‑256 hashes verifying byte integrity. We assign each outcome a failure label drawn from real‑world API specifications. We evaluate 38 models and report four findings: (i) no evaluated model clears 80%, with Claude‑Opus‑4.6 leading at 77.8%, revealing that current frontier models still fail to comply with observation contracts; (ii) a sharp within‑family capability cliff in Qwen 3.5 between 4B (0%) and 9B (56.6%), smoothing to 70.7% at 397B‑A17B: what emerges across the cliff is mid‑trajectory restraint, not tool‑call competence; (iii) non‑monotonic scaling across the GPT‑5 family: agentic post‑training can erode compliance through sycophancy‑driven regression; (iv) our failure taxonomy works as an actionable in‑context reward signal, yielding +7.1 pp on 42 paired GPT‑5.1 failures.

PaperID: 2884, https://arxiv.org/pdf/2605.17242.pdf

Authors: Yuxuan Wan, Tingshuo Liang, Jiakai Xu, Jingyu Xiao, Yintong Huo, Michael R Lyu

Title: From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements

Abstract:
Coding agents can generate web applications from natural‑language descriptions, yet a recent benchmark study shows that generated applications fail to meet functional requirements in over 70% of cases. The core difficulty is that web correctness cannot be assessed from source files or terminal output: the application must be deployed, exercised through simulated browser interactions, and failures must be translated into actionable repair signals ‑‑ steps that current agents cannot perform without human mediation. We present TDDev, a framework that automates this closed loop through three stages: (1) converting high‑level requirements into structured acceptance tests before any code is written, (2) deploying the application and validating it through browser‑based interaction simulation, and (3) translating browser‑observed failures into structured repair reports for the coding agent. Enabled by TDDev, we conduct the first controlled empirical study of Test‑driven development (TDD) strategies for web application generation, comparing four development protocols across two coding agents, two backbone models, and two benchmarks. TDD infrastructure consistently improves generation quality by 34‑‑48 percentage points over a no‑TDD baseline. The central finding is that the optimal protocol depends on the model's generation style: models that build applications holistically benefit most from agentic enforcement, while models that extend code conservatively benefit from incremental enforcement. Mismatching protocol to generation style eliminates the TDD benefit entirely while multiplying token cost up to 25‑fold. A user study confirms that TDDev reduces manual developer intervention to zero, shifting the workload from continuous prompt engineering to autonomous, feedback‑driven refinement.

PaperID: 2885, https://arxiv.org/pdf/2605.17229.pdf

Authors: Qingwen Pu, Kun Xie, Yuan Zhu, Guocong Zhai

Title: Generating Realistic Safety-Critical Scenarios for Vehicle-Pedestrian Interactions

Abstract:
Automated driving system deployment requires rigorous validation across safety‑critical vehicle‑pedestrian interactions, yet real‑world datasets rarely capture high‑risk scenarios while simulation platforms lack realistic behavior. In response, this study proposes a three‑stage framework that combines real‑world grounding with adaptive simulation to generate behaviorally realistic safety‑critical scenarios at scale. Stage 1 pre‑trains multi‑agent state‑space Transformer‑enhanced DDPG (MA‑SST‑DDPG) agents on real‑world safety‑critical data to learn human‑like interactive evasive behaviors through data‑driven learning. Stage 2 deploys pre‑trained multi‑agents in CARLA for online reinforcement learning to generalize across diverse scenarios, integrating real‑world knowledge with simulation experience to produce a refined MA‑SST‑DDPG model. Stage 3 uses CARLA with the refined model to generate over 198,000 high‑resolution interaction episodes from eight intersection scenarios, culminating in the Vehicle‑Pedestrian Safety‑Critical Interaction (VPSCI) dataset. The Refined MA‑SST‑DDPG model outperformed baseline methods in reproducing realistic evasive behaviors, achieving the lowest trajectory errors (ADE = 0.072 m, FDE = 0.142 m). Statistical comparison confirmed distributional equivalence between the generated and real‑world data in both conflict severity and behavioral response. A Turing test confirmed that the three‑stage framework generated evasive behaviors were indistinguishable from real‑world interactions. These results demonstrate the framework's effectiveness in producing high‑fidelity safety‑critical data, offering valuable sources for the development of ADS and simulation‑based safety evaluations.

PaperID: 2886, https://arxiv.org/pdf/2605.17159.pdf

Authors: Diego Gosmar, Giovanni Zenezini

Title: MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop

Abstract:
Document processing automation remains a critical challenge in enterprise environments, where traditional manual approaches are labor‑intensive and error‑prone. We present MADP, a multi‑agent architecture that addresses the challenge of automating document processing in enterprise settings by combining deep learning‑based classification and parsing with large language model extraction, while maintaining accuracy through selective human validation. Our system integrates five specialized agents‑‑Classificator, Splitter, Parser, Extraction, and Validator‑‑with a Human‑in‑the‑Loop (HITL) mechanism and a novel Prompt Fine Tuning with Feedback Inheritance (PFTFI) approach. The operational analysis on a production use‑case scenario of 100,000 invoices per year indicates a potential reduction of Full‑Time Equivalent (FTE) requirements by approximately 70%. Production deployment on 955 real‑world documents processed through January 2026 achieves a 97.0% full‑pipeline automation rate, with only 3% requiring non‑AI fallback. Ablation evaluation on a stratified 100‑document subset (5 documents per each of 20 supplier/document‑type categories) demonstrates that the full MADP configuration with Human‑in‑the‑Loop supervision attains 98.5% document‑level accuracy. Additionally, we present a comprehensive sustainability analysis showing that our hybrid AI+HITL approach reduces CO2 emissions by 69%, energy consumption by 69%, and water usage by 63% compared to traditional manual processing. Benchmark comparisons of multiple LLM backends (Granite‑Docling, Mistral‑Small, DeepSeek‑OCR) provide practical insights for deployment in production environments.

PaperID: 2887, https://arxiv.org/pdf/2605.17101.pdf

Authors: Yongfeng Huang, Ruiying Chen, James Cheng

Title: SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning

Abstract:
Retrieval‑Augmented Generation (RAG) is widely employed to mitigate risks such as hallucinations and knowledge obsolescence in medical question answering, yet its predominantly single‑round, static retrieval paradigm misaligns with the multi‑stage process of clinical reasoning. This compressed workflow induces two structural deficiencies: question‑to‑query translation often lacks clinically grounded semantic interpretation, and retrieval lacks iterative sufficiency feedback, making it difficult to form reliable evidence chains. We argue that both issues stem from a deeper cause: overloading a single reasoning chain with heterogeneous tasks of interpretation, exploration, and adjudication. The remedy is to reconstruct the workflow via task decoupling and dynamic multi‑round exploration. To this end, we propose SEMA‑RAG, a Self‑Evolving Multi‑Agent RAG framework for medical question answering, which assigns these roles to three specialist agents: the Interpreter Agent for clinical schema interpretation, the Explorer Agent for sufficiency‑driven self‑evolving retrieval, and the Arbiter Agent for evidence adjudication and answer selection. Across five benchmarks and five LLM backbones, SEMA‑RAG improves the strongest baseline by +6.46 accuracy points on average, measured per backbone.

PaperID: 2888, https://arxiv.org/pdf/2605.17088.pdf

Authors: Rui Chu

Title: ACIL: Auto Chain of Thoughts for In-Context Learning

Abstract:
Recent advances in large language models (LLMs) have shown that Chain‑of‑Thought (CoT) reasoning can substantially improve performance on complex reasoning tasks. At the same time, In‑Context Learning (ICL) has become an important mechanism for adapting LLMs to new tasks without updating model parameters, using only examples provided in the prompt. However, standard ICL often struggles on tasks that require multi‑step reasoning, because the demonstrations usually contain only input‑output pairs and lack explicit intermediate reasoning steps. This paper introduces an Automatic Chain‑of‑Thought (Auto‑CoT) framework to improve ICL by automatically constructing reasoning‑enhanced demonstrations. Auto‑CoT generates reasoning chains for input‑output examples, augments the prompt context with structured intermediate explanations, and removes irrelevant or low‑quality demonstrations through a systematic selection process. By incorporating high‑quality reasoning examples into the ICL prompt, Auto‑CoT guides the model toward more reliable reasoning and improves prediction accuracy. Experiments across multiple reasoning tasks demonstrate that the proposed framework improves ICL performance by providing explicit intermediate reasoning guidance.

PaperID: 2889, https://arxiv.org/pdf/2605.17079.pdf

Authors: Tianyu Wang, Jiajun Li, Jianghao Lin

Title: Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench

Abstract:
LLMs are increasingly used as ``digital consumers'' to simulate public opinion, pre‑test marketing decisions, and anticipate audience response. However, existing evaluations rarely ask whether a model can reconstruct the concrete reaction patterns that real consumers surface in public discourse. We introduce ConsumerSimBench, a benchmark built from 1,553 real Chinese social‑media topics and 23,122 atomic, rule‑audited criteria spanning four reaction families. Rather than scoring open‑ended generations with a holistic preference judge, ConsumerSimBench decomposes each task into auditable yes‑no decisions over concrete reaction points, raising three‑judge agreement from 65.8% to 92.1% with 98.4% agreement between pointwise judge decisions and human‑majority labels. Across 13 frontier generators, the strongest model, Gemini‑3.1‑Pro, covers only 47.8% of real reaction criteria, while GPT‑5.2 and Claude‑4.6 trail far behind despite their strength on technical benchmarks. The failures reveal a sharp gap between technical‑benchmark performance and socially grounded consumer intuition. A direct structured reasoning prompt decreases coverage, while a generate‑‑reflect multi‑agent pipeline improves MiMo‑V2.5‑Pro from 32.9% to 37.6% on a subset. ConsumerSimBench reframes consumer simulation as a forecasting problem over real public‑discourse reactions, showing that frontier LLMs remain far from reliably predicting what consumers will actually care about in high‑context Chinese consumer discourse.

PaperID: 2890, https://arxiv.org/pdf/2605.17075.pdf

Authors: Ayan Javeed Shaikh, Nathaniel D. Bastian, Ankit Shah

Title: A Red Teaming Framework for Evaluating Robustness of AI-enabled Security Orchestration, Automation, and Response Systems

Abstract:
AI‑enabled Security Orchestration, Automation, and Response (SOAR) systems increasingly employ autonomous agents for cyber defense, yet their resilience to adaptive adversaries is underexplored. We introduce an autonomous red teaming framework that integrates large language models (LLMs) with reinforcement learning (RL) to generate adaptive, multi‑stage attack campaigns against autonomous defenders in enterprise networks. A hierarchical design combines an LLM‑based planner for strategic intent with an RL controller for tactical execution, supported by reward shaping aligned with kill‑chain progression. Evaluation in a high‑fidelity enterprise simulation demonstrates the effectiveness of the proposed approach, while also showing that standalone LLM agents fail to sustain multi‑stage attack campaigns and that domain‑specific cybersecurity models achieve only limited levels of compromise, highlighting the necessity for hybrid LLM‑RL approaches to red teaming.

PaperID: 2891, https://arxiv.org/pdf/2605.17044.pdf

Authors: Wenlong Shi, Jianxun Lian, Mingqi Wu, Haiming Qin, Mingyang Zhou, Xing Xie, Naipeng Chao, Hao Liao

Title: PersonaArena: Dynamic Simulation for Evaluating and Enhancing Persona-Level Role-Playing in Large Language Models

Abstract:
Large language models (LLMs) increasingly serve as interactive social agents, yet their ability to maintain coherent and authentic persona‑level role‑playing remains limited, particularly in realistic social scenarios. Existing research predominantly focuses on character‑level settings and relies on static evaluation formats, failing to capture the complexity of everyday social interactions. In this work, we present PersonaArena, a dynamic simulation framework for evaluating and improving persona‑level role‑playing in LLMs. PersonaArena leverages a large, filtered corpus of user‑generated social content to construct a nuanced persona bank, and elicits multi‑turn, context‑rich interactions within simulated social environments. Our framework features a multi‑agent debating judge for holistic and unbiased assessment. Through extensive experiments, we demonstrate that PersonaArena enables rigorous evaluation and enhancement of LLMs' role‑playing capabilities, advancing the development of more authentic and socially adept AI agents.

PaperID: 2892, https://arxiv.org/pdf/2605.16986.pdf

Authors: Jingxing Wang, Chenyu Zhou, Zhihui Fu, Jun Wang, Weiwen Liu, Weinan Zhang, Jianghao Lin

Title: Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents

Abstract:
LLM agents benefit from reusable skills, yet test‑time tasks often require guidance more specific than a static skill library can provide. We propose \emphSkillTTA, a Test‑Time Adaptive Skill Synthesis method that retrieves a small set of training trajectories relevant to the current task and synthesizes them into a temporary, task‑specific textual skill. The solver model is kept fixed, so adaptation happens entirely through generated context rather than parameter updates. We evaluate the method on SpreadsheetBench, ALFWorld, and BigCodeBench. Compared with static trajectory‑to‑skill synthesis using GPT‑5.5, task‑specific skills improve SpreadsheetBench Pass@1 from 0.397 to 0.505 and BigCodeBench Pass@1 from 0.517 to 0.651. On ALFWorld, the method matches a heavier memory‑learning baseline within four points of success rate while producing the shortest successful trajectories among reported methods. Ablations on SpreadsheetBench further show that synthesized skills outperform raw trajectory prompting, that top‑k retrieval should stay small, and that failed trajectories are especially useful because they expose recurring evaluator‑facing mistakes.

PaperID: 2893, https://arxiv.org/pdf/2605.16976.pdf

Authors: Wenjie Qu, Ming Xu, Peiran Wang, Shengfang Zhai, Jiaheng Zhang, Dawn Song

Title: Securing LLM Agents Need Intent-to-Execution Integrity

Abstract:
This position paper argues that securing LLM agents requires first defining an end‑to‑end correctness property that specifies when an agent's execution faithfully reflects the user's intent. Modern LLM agents operate over an \emphintent‑to‑execution pipeline, where natural‑language instructions are translated into concrete system operations such as tool calls, API requests, and code execution. While recent defenses have made progress in constraining how agents construct tool calls, most existing formulations implicitly assume that tools are trusted. The emergence of systems such as OpenClaw, with open ecosystems of third‑party skills and direct access to user environments, breaks this assumption and exposes new failure modes, including malicious or over‑privileged components in the execution pipeline. Despite rapid progress in defense mechanisms, there is no adequate correctness property that defines what ``secure'' means for LLM agents, nor a principled way to evaluate the coverage of existing defenses. We observe that LLM agents are structurally analogous to compilers, where security violations correspond to mis‑executions that do not preserve user intent. Drawing on this analogy, we identify two fundamental problem sources ‑‑ untrusted data ingestion and untrusted tool execution ‑‑ and derive four integrity properties that must hold simultaneously: \emphTool Integrity, \emphInstruction Integrity, \emphJudgment Integrity, and \emphData Flow Integrity. We call their conjunction \emphintent‑to‑execution integrity. Analyzing existing agentic defenses against these properties reveals that current systems provide only partial and non‑compositional coverage, leaving fundamental gaps in securing modern LLM agents.

PaperID: 2894, https://arxiv.org/pdf/2605.16872.pdf

Authors: Botao Amber Hu, Helena Rong

Title: Some[Body] Must Receive That Pain for Agent Accountability

Abstract:
AI agents increasingly act consequentially in the real world. This creates a problem we call \emphconsequence reception: harm occurs, the producing system is identified, yet no continuing agent receives consequences in a way that changes future behavior. Pain, understood mechanistically as a corrective feedback signal, is foundational to canonical theories of punishment ‑‑ deterrence, rehabilitation, retribution, and incapacitation all assume a continuing locus that registers the signal and updates behavior. That, in turn, requires a body for the signal to land on: a boundary whose integrity it protects, a locus where it accumulates, consolidation that converts episodic signal into durable update, and a substrate that responds by altering future action. Current LLM agents ‑‑ software‑defined composites of weights, prompts, tools, memory, and credentials, freely swapped, copied, reset, and reassembled ‑‑ satisfy none of these conditions. The two prevailing legal responses therefore fail to achieve consequence reception. The thin‑identity agent‑principal dyad has a body but no \emphconsequence‑‑agency coupling: the human bears pain for behaviors beyond their control ‑‑ Elish's \emphmoral crumple zone. The thick‑identity Arbel et al.'s \emphAlgorithmic Corporation creates legally legible entities but does not guarantee that any AI decision architecture receives pain as a behavioral signal. Achieving consequence‑agency coupling is therefore a sociotechnical infrastructural problem, not only a legal one. Until such architectures exist, high‑stakes AI deployment should remain tethered to accountable human principals with meaningful control, proportional liability, and authority to constrain or terminate the agent. \emphIf some body does not receive the pain by design, some body will receive it by default.

PaperID: 2895, https://arxiv.org/pdf/2605.16871.pdf

Authors: Site Hu, Takato Horii

Title: SADP: Subgoal-Aware Diffusion Policy for Explainable Robots Learned from Foundation Model Generated Demonstrations

Abstract:
Explainable robots require not only successful task execution but also the ability to expose internal decision‑making process in a user‑friendly manner. However, most imitation learning methods are trained solely on task‑level demonstrations, without explicitly modeling subgoal structure or execution progress. This limitation is further exacerbated by the scarcity of subgoal‑level supervision in standard robot learning datasets, which restricts the development of robots that can convey the subtasks they are executing during long‑horizon manipulation. To address this issue, this paper proposes Subgoal‑Aware Diffusion Policy (SADP), a framework that leverages foundation models to autonomously generate subgoal‑annotated demonstrations and trains diffusion policies on these datasets. SADP structures policy execution around human‑interpretable subgoals by conditioning action generation on both task‑level and subgoal‑level descriptions. A lightweight auxiliary head further predicts subgoal completion states, allowing the robot to expose its current execution stage and monitor subgoal progression. Experiments in RLBench simulations and real‑world evaluations on a UR5e robot demonstrate that SADP achieves higher task success rates than strong task‑conditioned diffusion baselines, while providing subgoal‑level execution signals for monitoring progress and diagnosing failures. These results highlight that built‑in, rather than post‑hoc, interpretability can coexist with high task performance.

PaperID: 2896, https://arxiv.org/pdf/2605.16855.pdf

Authors: Tomoki Arita, Keisuke Okumura

Title: Lifelong LaCAM with Local Guidance for Lifelong MAPF

Abstract:
Local guidance has recently proven to be a powerful driver of empirical performance in real‑time, suboptimal multi‑agent pathfinding (MAPF), improving the scalable configuration‑based solver LaCAM. By injecting informative spatiotemporal cues around each agent, local guidance mitigates congestion, reduces waiting, and remains scalable enough even with tight time budgets, yielding state‑of‑the‑art performance for one‑shot MAPF. This study asks whether the same benefits can be lifted to the lifelong setting (LMAPF), where tasks arrive continuously and improvements in per‑step plans can increase task completion throughput over long horizons. We propose LLLG, a Lifelong version of LaCAM enhanced with Local Guidance, which employs a receding‑horizon windowed planning framework and warm‑starts guidance from the previous solution at each timestep. Our method scales effectively, maintains high throughput even in compact, dense environments, and surpasses existing planners, thereby pushing the frontier of real‑time, lifelong MAPF.

PaperID: 2897, https://arxiv.org/pdf/2605.16853.pdf

Authors: Jun Wu, Jian Huang, Chongjun Wang

Title: A Truthful Multiunit Profit-Optimal Mechanism for Synthesizing Social Laws

Abstract:
This paper studies Social Law Synthesis (SLS) in strategic multi‑agent environments as a new multi‑unit mechanism design problem. We model SLS as a Bayesian single‑parameter procurement auction based on Alternating‑time Temporal Logic (ATL) and aim to design a truthful, individually rational, and profit‑optimal mechanism. We first prove a representation lemma showing that any valuation respecting alternating bisimulation can be compactly expressed as a feature set of ATL formulae. We then reduce payment determination to allocation determination in polynomial time, resolving the irregular payment issue inherent in multi‑unit settings. We further show that allocation determination is $FP^NP$‑complete and encode ATL semantics into integer linear programming (ILP) constraints to make the problem tractable with standard solvers. Based on these results, we present the \mathcalPO\text‑ASL mechanism, which is incentive‑compatible, individually rational, and maximizes expected profit. Theoretical guarantees and examples confirm that our approach provides an effective and computationally feasible solution for synthesizing optimal social laws under strategic agent behavior.

PaperID: 2898, https://arxiv.org/pdf/2605.16821.pdf

Authors: Xiaohua Wang, Chao Han, Kai Yu, XiaoLiang Xu, Liang Wang

Title: Multi-Paradigm Agent Interaction in Practice:A Systematic Analysis of Generator-Evaluator, ReAct Loop,and Adversarial Evaluation in the buddyMe Framework

Abstract:
The rapid evolution of Large Language Model (LLM) agents has produced diverse interaction paradigms, yet few production systems integrate multiple paradigms within a unified architecture. This paper presents a systematic analysis of three principal agent interaction paradigms, including Multi‑Agent Orchestration (Generator‑Evaluator), ReAct Tool‑Use Loops, and Memory‑Augmented Interaction, as implemented in buddyMe, an open‑source multi‑model agent programming framework. We formalize a five‑stage processing pipeline: Requirement Pre‑Review ‑> Task Decomposition ‑> ReAct Execution ‑> Real‑Execution Verification ‑> Adversarial Evaluation Discussion, and establish a six‑dimensional evaluation schema with weighted scoring. Through four empirical case studies drawn from real‑world deployment logs covering museum guide generation, scheduled weather tasks, and comprehensive tour planning, we draw three key conclusions. First, Generator‑Evaluator pre‑review detects requirement omissions in 20 percent of complex tasks, with 80 percent tasks passing initial inspection. Second, the ReAct loop ensures stable subtask execution but leads to around 30 percent redundant tool invocations. Third, adversarial Evaluator‑Defender discussions reach consensus within 2‑3 rounds for nearly 70 percent of scenarios, functioning mainly for content refinement rather than logical reversal. We additionally provide three Mermaid‑based architectural diagrams and conduct cross‑paradigm comparisons with CrewAI, AutoGen, LangGraph, MemGPT and A‑Mem across six system dimensions. The research outcomes offer practical design guidelines for constructing stable and reliable multi‑paradigm agent systems.

PaperID: 2899, https://arxiv.org/pdf/2605.16784.pdf

Authors: Rui Ma, Zilin Bian, Kaan Ozbay

Title: Dynamic Deployment of Mobile Charging Trucks During Natural Disaster Evacuation: An Offline-to-Online Framework

Abstract:
During large‑scale evacuations, concentrated electric vehicle (EV) charging demand can overload fixed charging stations (FCSs), leading to prolonged waiting time and increased risk exposure. To address this challenge, this study proposes dynamically deploying mobile charging trucks (MCTs) to complement FCSs, and develops an Adaptive Risk‑aware MCT Deployment (ARMD) framework for real‑time operation. It divides the MCT deployment into two problems: risk‑aware allocation of MCTs among FCSs and dynamic routing of MCTs to the assigned FCSs, and solves them under an offline‑to‑online paradigm. The resource allocation problem is formulated as a decentralized partially observable Markov decision process, and a multi‑agent proximal policy optimization (MAPPO)‑based policy is developed to coordinate multiple MCTs under decentralized observations. The policy is pre‑trained offline in an evacuation simulator and adaptively refined online according to current evacuation context. For routing, a spatio‑temporal travel time predictor is developed to support rolling‑horizon route updates. The proposed framework is evaluated in a simulated hurricane evacuation environment built using real‑world data from Hillsborough County, Florida. Experiments show that ARMD consistently outperforms offline optimization, online heuristic dispatch, and rolling‑horizon optimization in reducing risk exposure. For demand perturbation scenarios, ARMD reduces average risk exposure by up to 71.1%, relative to the baseline without MCTs. In the case of fixed e‑vehicle charging infrastructure or road link failures, ARMD achieves 39.3% to 60.5% reduction in average risk exposure, with its advantages becoming more pronounced as the severity of disruption increases. These results demonstrate the effectiveness and robustness of ARMD in enhancing mobile charging operations for realistic scenarios of uncertain evacuation conditions.

PaperID: 2900, https://arxiv.org/pdf/2605.16757.pdf

Authors: Haoran Lu, Luyang Fang, Wenxuan Zhong, Ping Ma

Title: NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning

Abstract:
Multi‑agent language systems are often built as hand‑designed workflows, where agents are assigned semantic roles and communication protocols are specified in advance. We propose NeuroMAS, a method that first treats a multi‑agent language system as a trainable and scalable neural‑network‑like architecture with LLM agents as nodes and intermediate textual signals as edges. In NeuroMAS, agent nodes are role‑free but structure‑aware: the topology only determines how information can flow in general, while reinforcement learning training determines how nodes communicate, specialize, and coordinate. This formulation shifts multi‑agent design from workflow engineering toward architecture design, where depth, width, connectivity, and growth protocol become scalable sources of capability. Further, we provide a theoretical perspective showing why such modular textual computation is more parameter‑efficient when tasks admit hierarchical decompositions. Experiments show that NeuroMAS improves significantly over both inference‑time and trained multi‑agent baselines. We further find that organizational scaling is path‑dependent: larger systems can be challenging to train from scratch, but become feasible when grown progressively from smaller trained systems. These results suggest that learned neural multi‑agent systems are a promising scaling axis for LLMs.

PaperID: 2901, https://arxiv.org/pdf/2605.16748.pdf

Authors: Debanshu Das, Lavi Nigam, Sunil Kumar Jang Bahadur, Gopala Dhar

Title: Genflow Ad Studio: A Compound AI Architecture for Brand-Aligned, Self-Correcting Video Generation

Abstract:
Recent advancements in generative video models demonstrate high visual fidelity, yet their integration into enterprise environments is restricted by temporal inconsistencies and severe brand misalignment. Current monolithic architectures struggle to enforce rigid brand constraints, frequently hallucinating unapproved visual assets. We introduce Genflow, a Compound AI System designed to enforce brand consistency in generative media production. Our architecture integrates a retrieval‑based 'Brand DNA' extraction module to parameterize generation according to established corporate identity guidelines. Furthermore, we implement an Adversarial Multi‑Agent Quality Control (QC) loop. Instead of a single‑pass generation, this pipeline employs evaluator agents to iteratively critique generated frames against the extracted parameters, prompting generator models to refine outputs until a deterministic consensus is reached. By transitioning to a multi‑stage, self‑correcting pipeline, Genflow improved the yield of brand‑compliant video generations from 42% to 89%, establishing a robust framework for scalable, enterprise‑grade generative systems.

PaperID: 2902, https://arxiv.org/pdf/2605.16746.pdf

Authors: Yian Wang, Agam Goyal, Yuen Chen, Hari Sundaram

Title: State Contamination in Memory-Augmented LLM Agents

Abstract:
LLM agents increasingly rely on persistent state, including transcripts, summaries, retrieved context, and memory buffers, to support long‑horizon interaction. This makes safety depend not only on individual model outputs, but also on what an agent stores and later reuses. We study a failure mode we call memory laundering: toxic or adversarial context can be compressed into memory summaries that no longer appear toxic under standard detectors, while still preserving hostile framing or conflict structure that influences future generations. Using paired counterfactual multi‑agent rollouts, we show that toxic‑origin memory summaries can remain below common toxicity thresholds while nevertheless increasing downstream toxicity relative to matched neutral baselines. To measure this hidden influence, we introduce the sub‑threshold propagation gap (SPG), which quantifies downstream behavioral differences conditioned on memory states that a deployed monitor would classify as safe. Our experiments show that toxicity propagates through distinct state channels: raw transcript reuse drives overt downstream toxicity, while compressed memory carries hidden sub‑threshold influence. We further find that mitigation depends critically on intervention placement. Sanitizing toxic state before summarization substantially reduces the hidden propagation gap, whereas cleaning only the completed summary can leave laundered influence intact. These results suggest that safety in memory‑augmented agents should be treated as a state‑control problem over evolving context, with sanitization applied before unsafe information is compressed into persistent memory.

PaperID: 2903, https://arxiv.org/pdf/2605.16716.pdf

Authors: Shuowei Li, Yuming Zhao, Parth Bhalerao, Oana Ignat

Title: MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation

Abstract:
Text‑to‑video (T2V) generation has rapidly progressed in visual fidelity, yet its ability to faithfully represent multiple cultures within a single prompt remains underexplored. We introduce MAVEN, a multi‑agent prompt refinement framework designed to improve cultural fidelity in both mono‑cultural and cross‑cultural T2V generation. MAVEN decomposes prompts into person, action, and location dimensions, handled by specialized agents operating in parallel or sequentially. To support systematic evaluation, we contribute a new benchmark of 243 culturally grounded prompts and 972 corresponding videos, spanning three cultures (Chinese, American, Romanian), three action categories, and both mono‑cultural and cross‑cultural scenarios. Evaluations combining CLIP‑based metrics, VLM‑as‑judge assessments, and videoquality measures show that multi‑agent refinement, particularly parallel specialization, significantly improves cultural relevance while preserving visual quality and temporal consistency. The dataset and code are available athttps://github.com/AIM‑SCU/CRAFT

PaperID: 2904, https://arxiv.org/pdf/2605.16604.pdf

Authors: Raghu Vamshi Hemadri, Humaira Firdowse Mohammed, Rishabh Maheshwary, Srivatsava Daruru, Sagar Davasam, Vikas Yadav, Srinivas Sunkara, Sai Rajeswar

Title: R2V Agent: Teaching SLMs When to Ask for Help

Abstract:
Efficient agentic systems should incur expensive frontier‑model costs only on decisions where a cheaper local model is likely to fail. Existing LLM cascades usually route whole queries before execution, but task difficulty shifts mid‑trajectory ‑ after flaky tool calls, truncated observations, or compounding local errors ‑ making pre‑execution routing brittle. We introduce R2V‑Agent, a risk‑calibrated SLM‑LLM routing framework for interactive agents. R2V combines four components: a distilled small language model (SLM) policy, a stronger teacher LLM, a lightweight process verifier that scores candidate actions at each step, and a calibrated step‑level router. The router is our central contribution: after the SLM is trained, it estimates residual failure risk at each step and escalates only when teacher intervention is warranted. To make the routing problem well‑defined, we first train a stable local SLM using a standard offline pipeline: behavioral cloning (BC) on teacher trajectories, followed by verifier‑guided Direct Preference Optimization (DPO) with consistency regularization. The router is then trained on this fixed policy's residual failures using Brier‑calibrated probability estimation and a Conditional Value‑at‑Risk (CVaR)‑constrained objective that penalizes worst‑case failures across perturbation seeds. Across HumanEval+, TextWorld, and TerminalBench with four SLM backbones, R2V improves the reliability‑cost frontier: it achieves 94.3% HumanEval+ success with 0.60% LLM escalation, recovers TextWorld from 64.6% SLM‑only success to 98.2% at 41.7% escalation, and reaches 93.3% TerminalBench success at 33.9% LLM calls, roughly half the heuristic‑router cost.

PaperID: 2905, https://arxiv.org/pdf/2605.16575.pdf

Authors: Romain Cosentino, Sarath Shekkizhar, Adam Earle, Silvio Savarese

Title: Counterparty Modeling is Not Strategy: The Limits of LLM Negotiators

Abstract:
Negotiation requires more than inferring what the other side wants: it requires using that information to make advantageous offers and counteroffers over multiple turns. We study whether large language model (LLM) agents do this in a controlled multi‑attribute bargaining environment. We find that current LLM agents can model a counterparty's preferences, but do not reliably turn that knowledge into strategic bargaining. When given negotiating partner preference information, agents model it accurately and early in their reasoning traces, yet this does not reliably improve outcomes for the informed side. Turn‑level analyses show why: agents often respond to what they believe the counterparty values, but do not consistently pair those moves with gains on their own high‑value attributes. Sellers are more accommodating overall, and in asymmetric‑information conditions, the informed side often makes the more weakly compensated concessions. Because agents fail to leverage this underlying utility structure for strategic advantage, their final agreements are heavily dictated by surface‑level opening anchors rather than actual utility weights. Finally, requiring agents to explicitly state concession‑for‑reciprocity trades before making an offer makes individual turns look more strategic, but ultimately fails to improve the efficiency of the final agreements.

PaperID: 2906, https://arxiv.org/pdf/2605.16552.pdf

Authors: Angelos Angelopoulos, James F. Cahoon, Ron Alterovitz

Title: From Prompts to Protocols: An AI Agent for Laboratory Automation

Abstract:
Automating science laboratories enables faster, safer, more accurate, and more reproducible execution of protocols, accelerating the discovery and testing of new materials, drugs, and more. However, setting up and running autonomous labs requires coordinating numerous instruments and robots, forcing scientists to write code, manage configuration files, and navigate complex software infrastructure. We present an AI agent architecture that integrates large language models with laboratory orchestration, enabling scientists to interactively create and monitor automated lab protocols using natural language. Integrated into the Experiment Orchestration System (EOS), the AI agent operates under an agentic loop with automated validation and error correction, and supports the complete experimental lifecycle: creating protocols, running and monitoring both protocols and closed‑loop optimization campaigns, and analyzing results. A visual graph editor renders protocols as interactive node‑based diagrams synchronized with the AI agent's protocol representation, enabling seamless alternation between AI‑assisted and manual protocol construction. Evaluated on three simulated automated labs spanning chemistry, biology, and materials science, the AI agent achieves a 97% first‑attempt protocol generation success rate and an order of magnitude reduction in required interface actions.

PaperID: 2907, https://arxiv.org/pdf/2605.16508.pdf

Authors: Charles Chen, Qiming Yu, Yuhang Gu, Zhuoye Huang, Hanjing Li, Hongyu Liu, Simin Liu, Jinhao Liu, Dengyun Peng, Jiangyi Wang, Zheng Yan, Fanqing Meng, Ethan Qin, Carl Che, Mengkang Hu

Title: The Scaling Laws of Skills in LLM Agent Systems

Abstract:
As agent systems scale, skills accumulate into large reusable libraries, yet their scaling laws remain poorly understood. Across 15 frontier LLMs, 1,141 real‑world skills, and over 3M routing or execution decisions, we identify two coupled laws. Routing law: single‑step routing accuracy decays logarithmically with library size (R^2>0.97 for all models), with errors progressing from local skill competition to cross‑family drift and capture by overly general "black‑hole skills". Execution law: before state realization, joint routing is approximately multiplicative, whereas correct execution can improve difficult downstream decisions by about 4×. A single parameter, the routing logarithmic decay slope b, couples the two laws: routing‑side fits predict execution‑side rescue across models, showing that the same library property controls both pre‑execution collapse and downstream recoverability. The laws are actionable: law‑guided optimization raises held‑out routing accuracy from 71.3% to 91.7%, reduces hijack from 22.4% to 4.1%, and transfers directionally to downstream ClawBench and ClawMark execution settings, improving mean pass rate from 49.3% to 61.6% on ClawBench and from 28.4% to 34.5% on ClawMark. These results show that agent performance depends not only on model capability, but also on the structure, granularity, and exposure policy of the skill library.

PaperID: 2908, https://arxiv.org/pdf/2605.16436.pdf

Authors: Osama Zafar, Alexander Nemecek, Erman Ayday

Title: The End of Trust: How Agentic AI Breaks Security Assumptions

Abstract:
For decades, the security of digital interaction has rested on an unacknowledged economic constraint. Attackers faced a tradeoff between the fidelity of a deception and the scale at which it could be deployed. Convincing impersonation required sustained human effort and was confined to a narrow set of high‑value targets, while mass‑market attacks sacrificed plausibility for reach. Detection systems, verification mechanisms, and user awareness training have all been implicitly calibrated to the artifacts of cheap deception that this tradeoff produced. Agentic AI collapses the tradeoff, allowing high‑fidelity, individually tailored deception to be produced at mass‑market scale. We argue that this shift exhausts a security paradigm rather than merely intensifying the threat landscape. We introduce the Infinite Impostor, an attack model in which an autonomous agent interposes itself between two parties who already trust each other, hijacking an existing relationship rather than building a new one from scratch. Detection‑oriented defenses share an assumption that generative progress is eliminating, that synthetic outputs are distinguishable from authentic ones. We propose a suspect‑by‑default paradigm that shifts security from authenticating actors to evaluating actions, and examine the governance tensions that arise when platforms become the regulatory substrate of digital interaction.

PaperID: 2909, https://arxiv.org/pdf/2605.16346.pdf

Authors: Bingyu Yan, Xiaoming Zhang, Jinyu Hou, Chaozhuo Li, Ziyi Zhou, Xiaozhe Zhang, Litian Zhang

Title: PropGuard: Safeguarding LLM-MAS via Propagation-Aware Exploration and Remediation

Abstract:
LLM‑based multi‑agent systems (LLM‑MAS) have become a promising paradigm for solving complex tasks through role specialization, tool use, memory, and collaborative reasoning. However, these interactions create new security risks that malicious instructions injected through messages, tools, or memories can propagate across agents and rounds, causing system‑level compromise. Existing defenses largely rely on local filtering or graph‑based anomaly detection, but they often fail to trace fine‑grained propagation paths or remediate contaminated states without disrupting benign collaboration. We propose PropGuard, a propagation‑aware framework for safeguarding LLM‑MAS. PropGuard constructs a dual‑view spatio‑temporal graph that combines response‑centric risk estimation with full‑state evidence preservation. Guided by these risk priors, a GE‑GRPO trained inspector sequentially explores the full‑state graph to recover compact suspicious propagation subgraphs. PropGuard then verifies harmful propagation through subgraph‑aware diagnosis and applies source‑guided remediation to correct upstream contamination and replay affected downstream interactions. Experiments across four communication architectures and five attack settings demonstrate that PropGuard consistently lowers attack success while maintaining high task‑level defense success, achieving a favorable effectiveness‑‑efficiency trade‑off.

PaperID: 2910, https://arxiv.org/pdf/2605.16303.pdf

Authors: Rubén Garzón, Pauline Baron, Vincent Grari, Jonne Kamphorst, Michael Bernstein, Marcin Detyniecki

Title: From Demographics to Survey Anchors: Evaluating LLM Agents for Modeling Retirement Attitudes

Abstract:
Large language models (LLM) agents may offer tools to predict human responses to surveys. A common technique for defining these agents uses only demographics, for example country, age, gender, employment status, income, education and marital status. We compare the predictive accuracy of demographic agents to that of survey agents defined with a larger set of in‑domain survey responses. We test both approaches in predicting responses to the multidisciplinary, cross‑national Survey of Health, Ageing and Retirement in Europe (SHARE), focusing on five variables from three policy‑relevant constructs around personal finance. In these three constructs, we observe that, compared to survey agents trained on broader data, demographics‑only agents (1) exhibited a central tendency bias, skewing answers toward population means, and (2) were unrealistically accurate, failing to reproduce the incorrect answers and "don't know" responses typical of human respondents. These performance differences are further substantiated through the replication of a hierarchical regression analysis from prior retirement planning research. Agents based solely on demographic information reproduce the outcome that financial risk tolerance, future time perspective, and knowledge of retirement planning each are predictive of retirement savings. However, only the survey‑anchored agents succeed in reproducing the interaction among these three factors. These findings suggest caution in using only demographics to define LLM agents for predicting survey responses.

PaperID: 2911, https://arxiv.org/pdf/2605.16302.pdf

Authors: Fei Ding, Yongkang Zhang, Youwei Wang, Zijian Zeng

Title: Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

Abstract:
Reinforcement learning for multi‑step reasoning with large language models (LLMs) typically relies on sparse terminal rewards, which creates a poorly conditioned credit‑assignment problem: the final feedback is propagated uniformly across all intermediate decisions. This leads to high gradient variance, unstable training, and many ineffective updates, ultimately limiting sustained model improvement. We propose a counterfactual‑comparison framework for credit assignment. For each input, the framework samples multiple reasoning trajectories and treats their differences as implicit approximations to alternative decisions. This yields an implicit process‑level advantage estimator that converts sparse terminal rewards into step‑sensitive learning signals. Building on this framework, we introduce Implicit Behavior Policy Optimization (IBPO), which substantially improves training stability and the performance ceiling on mathematical and code‑reasoning benchmarks. Our results point to a promising direction for unlocking the reasoning potential of LLMs.

PaperID: 2912, https://arxiv.org/pdf/2605.16300.pdf

Authors: Mehmet Haklidir

Title: Consent Chain Degradation in Embodied Multi-Agent Systems: Bridging the Gap Between AI Agent Governance and Robot Ethics

Abstract:
Robotic systems are moving from isolated platforms to interconnected multi‑agent ecosystems that operate in human environments. This shift raises a governance problem that existing frameworks do not address: how does consent propagate, degrade, and break down across chains of delegation between embodied autonomous agents? The AI ethics community has begun to study consent for digital software agents, and the HRI community has examined consent in dyadic human‑robot encounters. Neither body of work covers what happens when physical robots delegate tasks to other robots in ways that affect humans. This paper introduces consent chain degradation (CCD), a conceptual framework for analyzing how the specificity, validity, and scope of human consent erodes as authority passes through multi‑robot delegation chains. We propose a three‑layer governance architecture, the Consent Runtime Verification Framework for Embodied Agents (CoRVE), which integrates consent scope modeling, delegation chain tracking, and physical irreversibility assessment. Three scenarios in healthcare, domestic, and industrial robotics show how CCD arises in practice, including a worked numerical example. A regulatory gap analysis covering the EU AI Act, the GDPR, the Machinery Regulation, and the Revised Product Liability Directive shows that all four instruments leave core CCD dimensions unaddressed.

PaperID: 2913, https://arxiv.org/pdf/2605.16297.pdf

Authors: Mingjun Li, Xiaojun Ye

Title: Task-Level AI Readiness Assessment for Business Process Management:The T-IPO Model and LARA Matrix in Financial-Services IT Operations

Abstract:
Which tasks inside an enterprise workflow can a large‑language‑model agent reliably handle, and under what conditions? Most business process modeling frameworks still answer this at the activity level, even though a single activity can bundle work of radically different difficulty. This paper takes the analysis a step smaller. We describe two design artifacts developed in a financial‑services IT setting: T‑IPO, which represents each task as an eight‑element tuple, and LARA (LLM Agent Readiness Assessment), a five‑dimension rubric that scores a task's readiness for agent substitution. Compliance Sensitivity carries 1.5× weight, a value we fixed through a three‑round Delphi study and cross‑checked with AHP. The rubric produces four levels, L1 to L4, and applies a floor rule so that a task with maximum compliance load cannot be classified below L3 no matter what the other scores say. Both artifacts sit inside a larger methodology (PARTIS) that we map onto BWW ontology in Section 3. We evaluate the instruments across 127 tasks. Inter‑rater agreement reaches Fleiss' κ= 0.80; a replication at three further institutions returns κ= 0.73. A controlled comparison against activity‑level assessment suggests, though does not prove, an improvement in predictive utility at the task level. Pilot deployment of 120 task instances confirms that auto‑completion decays monotonically from 95% at L1 through about 70% at L2 to about 40% at L3. Exploratory factor analysis points to a two‑factor structure: task readiness seems to be determined jointly by cognitive‑execution complexity and governance‑compliance intensity. We close with a recalibration procedure (LARA‑TCA) so the rubric can keep pace with evolving LLM capabilities.

PaperID: 2914, https://arxiv.org/pdf/2605.16250.pdf

Authors: Pavan Manjunath, Thomas Pruefer

Title: A Generative AI Framework for Intelligent Utility Billing CO 2 Analytics and Sustainable Resource Optimisation

Abstract:
Distribution utilities are now expected to deliver bills that customers can actually read attach a defensible carbon number to every kWh sold and schedule load against grid stress and emissions constraints We propose an end‑to‑end framework that unifies four production‑grade capabilities under one architectural roof a generative‑AI agent that drafts each customers natural‑language billing statement from structured numeric inputs under a constrained decoding policy a transformer‑based forecaster that supplies the day‑ahead consumption estimate with calibrated quantile bands

PaperID: 2915, https://arxiv.org/pdf/2605.16233.pdf

Authors: Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor, Marzia Zaman

Title: FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

Abstract:
Can LLM agents improve decision‑making through self‑generated memory without gradient updates? We propose FORGE (Failure‑Optimized Reflective Graduation and Evolution), a staged, population‑based protocol that evolves prompt‑injected natural‑language memory for hierarchical ReAct agents. FORGE wraps a Reflexion‑style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few‑shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best‑performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE‑2, a stochastic network‑defense POMDP at a 30‑step horizon against the B‑line attacker, where all four tested LLM families (Gemini‑2.5‑Flash‑Lite, Grok‑4‑Fast, Llama‑4‑Maverick, Qwen3‑235B) exhibit strongly negative, heavy‑tailed zero‑shot rewards. Compared against both a zero‑shot baseline and a Reflexion baseline (isolated single‑stream learning), FORGE improves average evaluation return by 1.7‑7.7× over zero‑shot and by 29‑72% over Reflexion in all 12 model‑representation conditions, reducing major‑failure rates (below ‑100) to as low as ～1%. We find that (1) population broadcast is critical mechanism, with a no‑graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost‑reliability profile with ～40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE‑2 B‑line; cross‑family findings are directional evidence.

PaperID: 2916, https://arxiv.org/pdf/2605.16205.pdf

Authors: Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor, Marzia Zaman

Title: Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP

Abstract:
Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet practitioners lack guidance on which design choices improve performance versus merely increase inference costs. We present a controlled study of compound LLM agent design in CybORG CAGE‑2, a cyber defense environment modeled as a Partially Observable Markov Decision Process (POMDP). Reward is non‑positive, so all configurations operate in a failure‑mitigation mode. Our evaluation spans five model families, six models, and twelve configurations (3,475 episodes) with token‑level cost accounting. We vary context representation (raw observations vs. a deterministic state‑tracking layer with compressed history), deliberation (self‑questioning, self‑critique, and self‑improvement tools, with optional chain‑of‑thought prompting), and hierarchical decomposition (monolithic ReAct vs. delegation to specialized sub‑agents). We find that: (1) Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. (2) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4× worse mean return while using 1.8‑2.7× more tokens. We call this destructive pattern a deliberation cascade. (3) Hierarchical decomposition without deliberation achieves the best absolute performance for most models, and context engineering is generally more cost‑effective than deliberation. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per‑agent reasoning, as these strategies can interfere when combined.

PaperID: 2917, https://arxiv.org/pdf/2605.16178.pdf

Authors: Ignacio Lopez-Gomez, Michael P. Brenner, Tapio Schneider

Title: Probabilistic Seasonal Streamflow Forecasting Across California's Sierra Nevada Watersheds with Agentic AI

Abstract:
Accurate seasonal runoff forecasts are critical for managing California's reservoirs and water supply for millions of its residents. Winter snow accumulation provides a strong source of predictability of snowmelt‑based runoff in the spring and summer months, but progressive hydroclimatic changes in the Sierra Nevada are altering its timing and volume. These changes reduce the skill of statistical forecasts trained on historical data, highlighting the need for improved forecasting systems that can capture the changing dynamics of snowmelt. Here we demonstrate that a collaborative workflow between an agentic AI assistant and an automated code‑mutation system, both powered by large language models, can accelerate the development of competitive seasonal runoff forecasting systems. In our framework, the AI agent discovers relevant datasets, synthesizes domain knowledge from prior forecasting competitions and the scientific literature, and explores the space of model architectures, while the code‑mutation system refines each of the solutions explored by the agent through Monte Carlo Tree Search over the code space. The resulting system forecasts monthly Full Natural Flow (FNF) at 1‑ to 6‑month lead times across 23 Sierra Nevada watersheds using an adaptive ensemble of three XGBoost quantile regression sub‑models with physics‑informed feature engineering. Evaluated against California's operational Bulletin 120 forecasts over 2021‑2025, the agent‑evolved model achieves superior skill for early‑season cumulative April‑July runoff predictions, reducing watershed‑averaged quantile forecast error by up to 29%, and offering a new paradigm for AI‑driven scientific model development in the geosciences.

PaperID: 2918, https://arxiv.org/pdf/2605.16144.pdf

Authors: Adnan Quadri, Hongxiang Li

Title: MAxLM: Multi-Agent Language Model-Based Scheduling and Resource Allocation in MU-MIMO-OFDMA-Enabled Wireless Networks

Abstract:
Wireless networks support multi‑user (MU) communication with multiple‑input multiple‑output (MIMO) and orthogonal frequency‑division multiple access (OFDMA) technologies. In the joint MU‑MIMO‑OFDMA‑enabled transmission mode, network throughput can be significantly increased by effectively utilizing the multi‑channel resources to schedule numerous wireless users/stations (STAs) simultaneously. In this paper, we study ways to optimize the user scheduling and resource allocation (SRA) for the UL scheduled access (UL‑SA) of a joint MU‑MIMO‑OFDMA‑enabled wireless local area network (WLAN). In particular, we propose a multi‑agent (MA) framework that utilizes an openly available pretrained small/medium‑sized Language Model (xLM) to perform SRA for the UL‑SA. To facilitate autonomous SRA using our proposed technique, we introduce the AI‑assisted Wireless Systems Engineering and Research (WiSER) platform. We evaluate the performance of MAxLM‑optimized SRA for network scenarios with a varying number of STAs and antenna settings on the WLAN Access Point. Numerical results confirm that our proposed technique achieves higher UL‑SA throughput than the benchmark techniques.

PaperID: 2919, https://arxiv.org/pdf/2605.16143.pdf

Authors: Ziang Ye, Wentao Shi, Yuxin Liu, Yu Wang, Zhengzhou Cai, Yaorui Shi, Qi Gu, Xunliang Cai, Fuli Feng

Title: Look Before You Leap: Autonomous Exploration for LLM Agents

Abstract:
Large language model based agents often fail in unfamiliar environments due to premature exploitation: a tendency to act on prior knowledge before acquiring sufficient environment‑specific information. We identify autonomous exploration as a critical yet underexplored capability for building adaptive agents. To formalize and quantify this capability, we introduce Exploration Checkpoint Coverage, a verifiable metric that measures how broadly an agent discovers key states, objects, and affordances. Our systematic evaluation reveals that agents trained with standard task‑oriented reinforcement learning consistently exhibit narrow and repetitive behaviors that impede downstream performance. To address this limitation, we develop a training strategy that interleaves task‑execution rollouts and exploration rollouts, with each type of rollout optimized by its corresponding verifiable reward. Building on this training strategy, we propose the Explore‑then‑Act paradigm, which decouples information‑gathering from task execution: agents first utilize an interaction budget to acquire grounded environmental knowledge, then leverage it for task resolution. Our results demonstrate that learning to systematically explore is imperative for building generalizable and real‑world‑ready agents.

PaperID: 2920, https://arxiv.org/pdf/2605.16116.pdf

Authors: Chinmay Savadikar, Mingyu Zhao, Yuanzheng Zhu, Han Li, Shuang Xie, Alberto Castelo, Tianfu Wu, Lingyun Wang

Title: ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

Abstract:
Developing and evaluating e‑commerce web agents requires environments that preserve meaningful task structure while enabling controllable, reproducible, and scalable scientific comparison. Existing methodologies force a tradeoff: live storefronts provide realism but are non‑stationary, difficult to inspect, and irreproducible, while hand‑built sandbox benchmarks provide control but cover only a narrow range of layouts, catalogs, policies, and interaction patterns. We argue that the core bottleneck is methodological: the field lacks a scalable way to construct evaluation settings that are simultaneously realistic, diverse, controllable, inspectable, and reproducible. We introduce ShopGym, an integrated framework for realistic simulation and scalable benchmarking of e‑commerce web agents. ShopGym is a framework for constructing e‑commerce simulation environments and grounded benchmark tasks. Its simulation layer, ShopArena, converts live seed storefronts into self‑contained sandbox shops through anonymized shop specifications and a staged, validated generation process. On top of these simulated storefronts, ShopGuru synthesizes benchmark tasks across seven skill categories, grounding each task in the shop's catalog, navigation structure, policies, and interaction affordances. Together, ShopArena and ShopGuru produce self‑contained, resettable, inspectable, and stable evaluation artifacts that preserve structural properties and agent‑evaluation signals relevant to shopping tasks. We validate the framework through graph‑based structural analysis and agent‑based behavioral evaluation with 224 generated tasks across six sandbox shops: three constructed with synthetic data and three with real data. Our results show that the synthetic shops preserve key structural properties of live storefronts, with agent performance on synthetic shops positively correlated with performance on live storefronts.

PaperID: 2921, https://arxiv.org/pdf/2605.16097.pdf

Authors: Ning Zhou, Nikolai W. F. Bode, Edmund R. Hunt

Title: Multi-Agent Cooperative Transportation: Optimal and Efficient Task Allocation and Path Finding

Abstract:
Multi‑robot systems are integral to modern logistics, but their capabilities are often limited to tasks executable by individual agents. This paper addresses a critical gap in existing frameworks like Multi‑Agent Path Finding (MAPF) and Task Allocation and Path Finding (TAPF), which lack true cooperation for transporting large items that require multiple agents. To this end, we formalise the Cooperative Transportation Task Allocation and Path Finding (CT‑TAPF) problem, which integrates team formation, task assignment, and collision‑free pathfinding. We present an optimal solver, Cooperative Transportation Task Conflict‑Based Search (CT‑TCBS), which features a novel Incremental Expansion strategy to tackle the combinatorial explosion inherent in team formation. Recognising the computational cost of optimality, we also develop a family of sub‑optimal solvers that employ a global, task‑centric perspective, selecting the next task to assign based on a global difficulty metric (Best Task or Worst Task). Our comprehensive empirical evaluation demonstrates three key findings: (1) the incremental expansion strategy significantly outperforms the naive combinatorial approach by successfully pruning the dominant task‑allocation search space; (2) we identify a task‑conflict expansion dilemma, where sophisticated conflict resolvers effective for large‑agent pathfinding subproblems can be detrimental in the integrated CT‑TAPF setting; and (3) our proposed sub‑optimal solvers establish a new, more efficient frontier on the solution quality‑runtime spectrum compared to "nn‑" agent‑centric baselines. This work provides a foundational framework and a set of effective algorithms for a new, practical class of cooperative multi‑agent problems.

PaperID: 2922, https://arxiv.org/pdf/2605.16045.pdf

Authors: Zijie Dai, Shiyuan Deng, Sheng Guan, Yizhou Tian, Xin Yao, Xiao Yan, James Cheng

Title: RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents

Abstract:
Memory systems often organize user‑agent interactions as retrievable external memory and are crucial for long‑running agents by overcoming the limited context windows of LLMs. However, existing memory systems invoke LLMs to process every incoming interaction for memory extraction, and such an eager memory consolidation scheme leads to substantial token consumption. To tackle this problem, we propose RecMem by rethinking when memory consolidation should be conducted. RecMem stores incoming interactions in a subconscious memory layer and encode them using lightweight embedding models for retrieval. LLMs are only invoked to extract episodic and semantic memory when sustained recurrence are observed for semantically similar interactions. Such recurrence‑based consolidation works because these interactions correspond to a semantic cluster with rich information and thus are worth extraction and summarization. To improve accuracy, RecMem also incorporates a semantic refinement mechanism that recovers the fine‑grained facts omitted by memory extraction. Experiments show that RecMem reduces the memory construction token cost of three SOTA memory systems by up to 87% while exceeding their accuracy.

PaperID: 2923, https://arxiv.org/pdf/2605.15962.pdf

Authors: Chuxu Song, Hao Wang, Richard Martin

Title: PersonaFingerprint: Measuring Persona Inference on Modern Websites with LLM-Driven Browsing

Abstract:
Website Fingerprinting (WFP) has traditionally focused on inferring which website a user visits from encrypted traffic metadata such as packet sizes and timing. In this paper, we identify and quantify a new privacy risk in modern web settings: an adversary can infer a user's persona using only packet‑length and inter‑arrival‑time sequences. To study this risk at scale, we build an LLM‑driven multi‑agent browsing framework that enforces controllable persona constraints while a computer‑use agent interacts with real websites and collects corresponding encrypted traffic traces. We formalize persona fingerprinting under both closed‑set and open‑world settings and further evaluate whether persona information is already embedded in representations learned by existing WFP models and can be amplified at low cost. Across 10 modern websites and 15 personas (plus an open‑world class), persona inference achieves about 84% accuracy on mixed‑site traffic; moreover, a lightweight multi‑task objective can boost persona accuracy to around 80% while retaining strong site classification performance (about 93% baseline). Our results show that, on modern websites, encrypted traffic metadata can leak not only which site a user visits, but also how they browse and who is browsing.

PaperID: 2924, https://arxiv.org/pdf/2605.15871.pdf

Authors: Alberto Pepe, Chien-Yu Lin, Despoina Magka, Bilge Acun, Yannan Nellie Wu, Anton Protopopov, Carole-Jean Wu, Yoram Bachrach

Title: Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design

Abstract:
Toward recursive self‑improvement, we investigate LLM agents autonomously designing foundation models beyond standard Transformers. We introduce a dual‑framework approach: AIRA‑Compose for high‑level architecture search, and AIRA‑Design for low‑level mechanistic implementation. AIRA‑Compose uses 11 agents to explore fundamental computational primitives under a 24‑hour budget. Agents evaluate million‑parameter candidates, extrapolating top designs to 350M, 1B, and 3B scales. This yields 14 architectures across two families: AIRAformers (Transformer‑based) and AIRAhybrids (Transformer‑Mamba). Pre‑trained at 1B scale, these consistently outperform Llama 3.2 and Composer‑found baselines. On downstream tasks, AIRAformer‑D and AIRAhybrid‑D improve accuracy by 2.4% and 3.8% over Llama 3.2. Furthermore, AIRA‑Compose finds models with highly efficient scaling frontiers: AIRAformer‑C scales 54% and 71% faster than Llama 3.2 and Composer's best Transformer, while AIRAhybrid‑C outscales Nemotron‑2 by 23% and Composer's best hybrid by 37%. AIRA‑Design tasks 20 agents with writing novel attention mechanisms for long‑range dependencies and high‑performing training scripts. On the Long Range Arena benchmark, agent‑designed architectures reach within 2.3% and 2.6% of human state‑of‑the‑art on document matching and text classification. On the Autoresearch benchmark, Greedy Opus 4.5 achieves 0.968 validation bits‑per‑byte under a fixed time budget, surpassing the published minimum. Together, these frameworks show AI agents can autonomously discover architectures and algorithmic optimizations matching or surpassing hand‑designed baselines. This establishes a powerful paradigm for discovering next‑generation foundation models, marking a clear step toward recursive self‑improvement.

PaperID: 2925, https://arxiv.org/pdf/2605.15799.pdf

Authors: Hiroki Nagai, Keisuke Okumura

Title: From Gridworlds to Warehouses: Adapting Lightweight One-shot Multi-Agent Pathfinding for AGVs

Abstract:
Multi‑agent pathfinding (MAPF) under one‑shot planning is a core component of warehouse automation, yet classical formulations typically assume four‑connected 2D grids with unit‑time moves in four directions. To fill reality gaps while still being trackable with discrete combinatorial search, this work proposes a more practical counterpart tailored to differential‑drive AGVs. We term this multi‑agent warehouse pathfinding (MAWPF), featured with four constraints: (i) agent actions are restricted to straight motion and in‑place rotation; (ii) rotations require multi‑step costs; (iii) acceleration and deceleration are considered, and; (iv) follower collisions are prohibited to prevent rear‑end crashes. To solve MAWPF efficiently, we adapt representative suboptimal MAPF algorithms‑PP, LNS2, PIBT, and LaCAM‑and conduct comprehensive benchmarking. Our experiments reveal that PP and LNS2 struggle to solve instances with many agents, while PIBT‑based approaches achieve preferable scalability with increased solution cost. We believe that these constitute an important step toward adapting classical gridworld MAPF to operational warehouse setups.

PaperID: 2926, https://arxiv.org/pdf/2605.15782.pdf

Authors: Viswa Narayanan Sankaranarayanan, Vignesh K. Viswanathan, Akshit Saradagi, Sumeet Satpute, George Nikolakopoulos

Title: Reactive Robot-Centric Safety for Autonomous Navigation in Constrained and Dynamic Environments

Abstract:
In this work, we address the problem of ensuring real‑time safety in autonomous robot navigation, in spatially constrained dynamic environments, by utilizing only onboard sensors. We present a real‑time control architecture that integrates a 3D LIDAR perception‑based composite control barrier function(CBF)‑based safety filter directly into the autonomy pipeline. The proposed perception‑driven framework enforces collision avoidance constraints dynamically from onboard point cloud data, thus allowing a large number of constraints to be handled at the control frequency, while remaining minimally invasive to nominal task execution. The safety region is defined as an ellipsoid in the body‑frame, consistent with the geometry of the platform, which induces time‑varying constraints in the world frame as the robot rotates; this effect is handled through a dedicated formulation of time‑varying (CBF) for each LIDAR point. We validate the system through multiple field experiments in underground environments by utilizing a quadruped platform performing a visual inspection task, demonstrating reliable operation in the presence of dynamic obstacles, unsafe high‑level references, abrupt localization anomalies, and while traversing through narrow corridors.

PaperID: 2927, https://arxiv.org/pdf/2605.15768.pdf

Authors: Xiang Li, Liping Yi, Mingze Kong, Min Zhang, Zhongxiang Dai, QingHua Hu

Title: ALSO: Adversarial Online Strategy Optimization for Social Agents

Abstract:
Social simulation provides a compelling testbed for studying social intelligence, where agents interact through multi‑turn dialogues under evolving contexts and strategically adapting opponents. Such environments are inherently non‑stationary, requiring agents to dynamically adjust their strategies over time. However, most Large Language Model (LLM) based social agents rely on static personas, while existing approaches for enhancing social intelligence, such as offline reinforcement learning or external planners, are ill‑suited to these settings, typically assuming stationarity and incurring substantial training overhead. To bridge this gap, we propose ALSO (Adversarial onLine Strategy Optimization), the first framework for online strategy optimization in multi‑agent social simulation. ALSO advances social adaptation through two key contributions. (1) ALSO formulates multi‑turn interaction as an adversarial bandit problem, where combinations of static personas and dynamic strategy instructions are treated as arms, providing a principled solution to non‑stationarity without relying on environmental stability assumptions. (2) To predict rewards and generalize sparse feedback in multi‑turn dialogues, ALSO introduces a lightweight neural surrogate to predict rewards from interaction histories, enabling sample‑efficient exploration and continuous online adaptation. Experiments on the Sotopia benchmark demonstrate that ALSO consistently outperforms static baselines and existing optimization methods in dynamic environments, validating the effectiveness of adversarial online strategy optimization for building robust social agents.

PaperID: 2928, https://arxiv.org/pdf/2605.15766.pdf

Authors: Loka Li, Duzhen Zhang, Xingbo Du, Leonard Song, Zixiao Wang, Assanali Aukenov, Noel Thomas, Shakhnazar Sailaukan, Yonghan Yang, Feilong Chen, Jiahua Dong, Kun Zhang, Bin Zhang, Le Song

Title: BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks

Abstract:
Large language model (LLM) agents are increasingly capable of automating components of machine learning development, yet existing biomedical benchmarks mainly focus on question answering, reasoning, and tool usage, or evaluate only narrow aspects of biomedical ML coding. We present BioXArena, a biomedical machine learning benchmark designed to evaluate whether agents can generate task‑specific model training pipelines for heterogeneous and multi‑modal biomedical datasets. BioXArena contains 76 end‑to‑end tasks across 9 domains, including sequence modeling, single‑cell analysis, structural biology, network biology, chemical biology, perturbation dynamics, phenotype‑disease modeling, biomedical imaging, and text‑integrated learning. Each task is curated from primary biomedical sources into a unified evaluation framework with hidden labels, held‑out graders, and biology‑aware metrics normalized to a 0 to 1 scale. Agents are required to write executable code, train predictive models, and generate submissions for private test samples. Most tasks involve multiple input modalities, including tabular data, images, natural language, molecular sequences, omics matrices, and protein structures. We evaluate 11 agent configurations in a standardized 2‑hour single‑GPU environment. MLEvolve with Gemini‑3.1‑Pro achieves the highest average score of 0.666, followed by GPT‑5.4 with 0.636, while no single agent consistently dominates across all domains. We additionally perform extensive ablation studies, robustness evaluations, scaling analyses, cost analyses, and failure‑mode investigations to better understand how model backbones, agent scaffolds, inference budgets, and biomedical domains influence BioML coding performance. We will publicly release all benchmark tasks, graders, execution runners, leaderboard results, and agent trajectories.

PaperID: 2929, https://arxiv.org/pdf/2605.15706.pdf

Authors: Xingjian Wu, Junkai Lu, Siyu Yan, Xiangfei Qiu, Jilin Hu, Chenjuan Guo, Bin Yang

Title: Differentiable Mixture-of-Agents Incentivizes Swarm Intelligence of Large Language Models

Abstract:
Recent advances in Large Language Models (LLMs) have catalyzed the development of multi‑agent systems (MAS) for complex reasoning tasks. However, existing MAS typically rely on pre‑defined or pre‑compiled communication topologies, which limits their flexibility and adaptability to dynamic task requirements. In this work, we propose Differentiable Mixture‑of‑Agents (DMoA), a self‑evolving multi‑agent framework that enables elastic and adaptive agent collaboration during inference. Instead of statically constructing workflows, DMoA dynamically routes and activates agents at each reasoning step, allowing the system to implicitly simulate diverse communication topologies and adapt to evolving demands. To achieve this, we design a differentiable, context‑aware routing mechanism that leverages recurrent structures to incorporate historical and contextual information, producing sparse agent activations in a step‑wise manner. Furthermore, we introduce predictive entropy as self‑supervised signals to optimize the routing process, enabling efficient test‑time adaptation without external annotations. Extensive experiments across 9 benchmarks demonstrate that DMoA achieves state‑of‑the‑art performance while exhibiting strong efficiency, robustness, and ensembling capabilities.

PaperID: 2930, https://arxiv.org/pdf/2605.15697.pdf

Authors: Pengcheng Dai, He Wang, Dongming Wang, Jian Qin, Wenwu Yu

Title: Distributed Zeroth-Order Policy Gradient for Networked Multi-agent Reinforcement Learning from Human Feedback

Abstract:
We study a networked multi‑agent reinforcement learning (NMARL) problem with human feedback in an infinite‑horizon setting, where agents interact over an underlying network with localized state dependencies and aim to collaboratively maximize the average discounted return. Existing approaches with preference feedback are primarily developed for single‑agent settings and rely on centralized training, which limits their scalability and applicability to large‑scale networked multi‑agent systems. To address this, we introduce a novel human feedback mechanism based on spatiotemporally truncated trajectories, defined as H‑horizon trajectory pairs aggregated over each agent's κ‑hop neighborhood. Building on this, we develop a distributed zeroth‑order policy gradient algorithm, where each agent estimates its local policy gradient using human preference feedback generated from both the current joint policy and a perturbed joint policy drawn from zero‑mean Gaussian distribution. Specifically, the algorithm is fully distributed, as the feedback received by each agent depends solely on the state‑action information within its κ‑hop neighborhood and does not require explicit reward signals or centralized control. We further rigorously establish that the proposed algorithm converges to an ε‑stationary point with polynomial sample complexity. Finally, simulation results in a stochastic GridWorld environment and a predator‑prey environment further demonstrate that the effectiveness and scalability of the proposed algorithm in achieving collaborative optimization based solely on human preference feedback.

PaperID: 2931, https://arxiv.org/pdf/2605.15625.pdf

Authors: Lijie Ding, Changwoo Do

Title: ColPackAgent: Agent-Skill-Guided Hard-Particle Monte Carlo Workflows for Colloidal Packing

Abstract:
We introduce ColPackAgent, an agent framework that autonomously runs Monte Carlo simulations of colloidal packing through a Model Context Protocol (MCP) tool server and an agent skill, whether as a standalone agent or inside an existing agent system. By harnessing the MCP server and agent skill, ColPackAgent executes a structured workflow for colloidal packing simulations, which are central to studies of phase behavior, self‑assembly, and materials design. Without dedicated simulation tools and workflow instructions, general‑purpose Large Language Model (LLM) agents tend to describe such workflows rather than execute them reliably. The MCP server exposes a custom‑built colpack Python package that wraps HOOMD‑blue hard‑particle Monte Carlo, and the skill encodes a four‑stage workflow contract. ColPackAgent can carry out the workflow interactively with human feedback, autonomously from an end‑to‑end prompt, or as autoresearch following a provided program file. We demonstrate the system in different modes with several colloidal packing simulation examples such as cube particles in 3D, a binary system of disks and capsules in 2D, and the 2D hard‑disk freezing transition using autoresearch. We also compare model performance on this workflow across a panel of LLMs with 17 stage‑specific prompts. This benchmark provides a stage‑level check of how reliably different models follow the setup, planning, and analysis workflow. Together, these results show that pairing a domain Python package with MCP tools and a portable agent skill provides a practical route for turning a simulation toolkit into an agent‑assisted research workflow.

PaperID: 2932, https://arxiv.org/pdf/2605.15611.pdf

Authors: Junle Wang, Xingchuang Liao, Wenjun Wu

Title: TopoEvo: A Topology-Aware Self-Evolving Multi-Agent Framework for Root Cause Analysis in Microservices

Abstract:
Root cause analysis (RCA) in microservices is challenging due to (i) noisy and heterogeneous multimodal observability (metrics, logs, traces), (ii) cascading failure propagation that amplifies downstream symptoms, and (iii) non‑stationary topology drift induced by autoscaling and rolling updates. Recent LLM‑based RCA agents can generate tool‑grounded explanations, yet they often remain topology‑agnostic and suffer from \emphsymptom‑amplification bias, misattributing the root cause to salient downstream victims. We propose TopoEvo, a topology‑aware self‑evolving multi‑agent framework that couples graph representation learning with structured, topology‑constrained reasoning. TopoEvo first introduces \emphMetric‑orthogonal Multimodal Alignment (MOMA), which decomposes metric embeddings into complementary subspaces and contrastively aligns logs and traces to reduce modality redundancy and sparsity, yielding stable node representations for graph encoding. It then applies \emphVector Quantization (VQ) to discretize topology‑enhanced states into auditable \emphsymptom tokens with a symptom lexicon, enabling reliable retrieval and token‑level evidence grounding. On top of these discrete topology cues, TopoEvo performs a multi‑agent \emphHypothesis‑‑Evidence‑‑Test (HET) workflow to explicitly verify propagation‑consistent explanations and separate initiating anomalies from amplified downstream symptoms. Finally, a \emphSelf‑Evolving Mechanism refreshes hierarchical incident memory and performs conservative test‑time adaptation with high‑confidence pseudo‑labels to maintain robustness under drift.

PaperID: 2933, https://arxiv.org/pdf/2605.15585.pdf

Authors: Yuejia Li, Ke He, Junheng Li, Shutong Chen, Jingkang Xia, Zhiyue Su, Junchi Zhang, Mang Ye

Title: See Before You Code: Learning Visual Priors for Spatially Aware Educational Animation Generation

Abstract:
Large language models can generate executable code for educational animations, but the resulting renders often exhibit visual defects, including element overlap, misalignment, and broken animation continuity. These defects cannot be reliably detected from the code alone and become apparent only after execution. We formalize this problem as render‑feedback‑aware constrained code generation: given a natural language specification, the model must generate executable code whose rendered output satisfies structured quality criteria that can be evaluated only after rendering. To address this problem, we introduce OmniManim, a render‑feedback‑aware educational animation generation framework built around a shared scene state, explicit visual planning, structured post‑render diagnostics, and localized repair. Within OmniManim, the Vision Agent is a task‑specific visual planning module: it predicts sparse keyframe layouts with coarse‑to‑fine bounding‑box denoising and optimizes an interpolation‑aware objective to reduce intermediate‑frame failures induced by downstream animation interpolation. We further construct two datasets, ManimLayout‑1K and EduRequire‑500, and provide a reproducible evaluation protocol covering executability, instructional quality, visual quality, and efficiency. On EduRequire‑500, OmniManim improves measured render quality over both single‑model baselines and existing multi‑agent frameworks. Systematic ablation studies further verify that explicit visual planning, especially its coarse spatial prior, bounding‑box refinement, and interpolation‑aware optimization, is central to these gains.

PaperID: 2934, https://arxiv.org/pdf/2605.15573.pdf

Authors: Nurbek Tastan, Alex Iacob, Lorenzo Sani, Meghdad Kurmanji, Nicholas D. Lane, Samuel Horvath, Karthik Nandakumar

Title: Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems

Abstract:
Multi‑agent systems can solve complex tasks through collaboration between multiple Large Language Model agents. Existing collaboration frameworks typically operate in either a parallel or a sequential mode. In the parallel mode, agents respond independently to queries followed by aggregation of responses. In contrast, sequential systems allow agents to communicate via a directed topology and refine one another step by step. However, both modes are inadequate for achieving the desired objectives of minimizing communication and latency while simultaneously maximizing the accuracy of the final response. In this work, we introduce a hybrid paradigm called Nexa, a trainable response‑conditioned policy that bridges the gap between the two modes. Nexa begins with a parallel execution stage, embeds the resulting responses into a shared semantic space, and then predicts a sparse directed acyclic communication graph. If the graph is empty, the system remains purely parallel; if it is non‑empty, the system performs one sequential message propagation. The policy is a lightweight transformer model, and the method avoids the need for external LLM judges or reward models, as well as hand‑crafted test‑time topology search. We formalize this hybrid execution problem, show that the resulting graph is acyclic by construction, and that the framework strictly subsumes pure parallel execution, and present a training procedure based on policy‑gradient optimization. Results demonstrate that the response‑conditioned policy learned by Nexa under one setting can be reused when the number of agents, the task, or the underlying agent changes, thus emphasizing the generalizability of the learned communication policy.

PaperID: 2935, https://arxiv.org/pdf/2605.15505.pdf

Authors: Guruprasad Raghavan, George Nychis, Rohan Narayana Murthy

Title: X-SYNTH: Beyond Retrieval -- Enterprise Context Synthesis from Observed Digital Human Attention

Abstract:
In enterprise operations, the context required for an AI agent task is scattered across systems of record, static information stores, and communication channels. What is stored is system state, a lossy representation of the work that actually happened. The prevailing approach retrieves by matching request content to what is stored; for narrow requests this works well. But synthesis quality depends on knowing what to surface and how to interpret it: knowledge specific to each organization, team, and individual, present in behavioral patterns, absent from any retrieval index. For the agentic task of proposing enterprise‑valuable leads to sellers, this approach breaks down: True Lead Rate is low, False Lead Rate is high, and the model has no mechanism to improve. We present X‑SYNTH, a framework for enterprise context synthesis grounded in digital human attention, the digitally observable interaction signatures of each worker, encoding what they did, the sequence in which they did it, and implicit reward signals. Behavioral traces preceding positive outcomes are distinguishable from those that did not, without external labeling. X‑SYNTH models each individual's behavioral baseline as a Digital Twin Signature (DTS) and selects among seven attention filters, Proportional, Inverse, Differential, Recurrent, Comparative, Sequential, and Collective, per individual and per query, to identify causally relevant activity signatures. A four‑stage pipeline assembles ranked context grounded in behavioral patterns rather than query embeddings. A frontier model unaided achieves 9.5% True Lead Rate (TLR) with 90.5% False Lead Rate (FLR). Augmented with X‑SYNTH, TLR rises to 61.9% (6.5x) while FLR falls to 18.8%. Enterprise context synthesis is not a retrieval problem. It is a relevance problem, and digital human attention is its most reliable ground truth.

PaperID: 2936, https://arxiv.org/pdf/2605.15503.pdf

Authors: Debopriya Roy Dipta, Thore Tiemann, Eduard Marin, Thomas Eisenbarth, Berk Gulmezoglu

Title: uGen: An Agentic Framework for Generating Microarchitectural Attack PoCs

Abstract:
Microarchitectural attacks continue to evolve, uncovering new exploitation vectors in modern processors. From a defensive perspective, assessing a system's susceptibility to such attacks remains challenging. Developing functional attack implementations is labor‑intensive, requires deep microarchitectural expertise, and is highly sensitive to execution environments. Consequently, existing attacks often lack portability, limiting systematic and scalable vulnerability assessment. Recent advances in large language models (LLMs) suggest a potential avenue for lowering these barriers. However, it remains unclear whether LLMs can reliably generate functionally correct microarchitectural attack code suitable for rigorous vulnerability testing. In this work, we present uGen, the first LLM‑driven framework for automated microarchitectural attack code generation. A key challenge we address is identifying attack‑specific knowledge gaps in LLMs. Through a systematic study of state‑of‑the‑art models (GPT, Claude, and Qwen3), we find that LLMs frequently misgenerate or misplace critical attack primitives. Guided by this analysis, uGen employs a retrieval‑augmented, multi‑agent design that injects missing domain knowledge to synthesize functionally correct microarchitectural attack PoCs tailored to defender requirements. We evaluate uGen on cache‑based and speculative‑execution attacks across diverse set of microarchitectures, vulnerable functions, and LLM platforms. In the deployment stage, uGen achieves up to 100% success rate for Spectre‑v1 (Claude Sonnet‑4) and 80% for Prime+Probe (Qwen3‑Coder). Finally, we demonstrate that uGen can generate a successful PoC code with a cost of 1.25 in under four minutes.

PaperID: 2937, https://arxiv.org/pdf/2605.15486.pdf

Authors: Swayamjit Saha, Subhabrata Das, Haonan Duan, Xiao-Yang Liu

Title: Hybrid LLM-based Intelligent Framework for Robot Task Scheduling

Abstract:
This study introduces intelligent frameworks that use Large Language Models (LLMs) to improve task scheduling for construction robots. The LLM is fed with key data about the desired task, such as agent action abilities, and the desired end goal to be achieved. A well‑balanced allocation strategy is developed, optimizing both time efficiency and resource utilization. Our system utilizes a Natural Language Processing interface to streamline communication with construction professionals and adapt in real‑time to unexpected site conditions. We concurrently use two LLM agents, specifically generator (GPT‑4) and supervisor (Gemma 3/Llama 4/Mistral 7b) LLM agents to provide a more precise task schedule. We evaluate the proposed methodology using a straightforward scenario and provide metric scores to prove the efficacy of the frameworks. Our results highlight that the implementation of LLMs is crucial in construction operational tasks including robots.

PaperID: 2938, https://arxiv.org/pdf/2605.15343.pdf

Authors: Joshua C. Yang, Maurice Flechtner, Damian Dailisan, Michiel A. Bakker

Title: Belief Engine: Configurable and Inspectable Stance Dynamics in Multi-Agent LLM Deliberation

Abstract:
LLM‑based agents are increasingly used to simulate deliberative interactions such as negotiation, conflict resolution, and multi‑turn opinion exchange. Yet generated transcripts often do not reveal why an agent's stance changes: movement may reflect evidence uptake, anchoring, role drift, echoing, or changed prompt and retrieval context. We introduce the Belief Engine (BE), an auditable belief‑update layer that treats "belief" as an evidential state over a proposition and exposes it as scalar stance. BE extracts arguments into structured memory and updates stance with a log‑odds rule controlled by evidence uptake u and prior anchoring a. Across multiple base LLMs, parameter sweeps show that these controls reliably shape stance dynamics while preserving an evidence‑level update trail. On DEBATE, a human deliberation dataset with pre/post opinions, BE best reconstructs participants whose final stance follows extracted evidence; stable and evidence‑opposed cases instead point to anchoring or factors outside the extracted evidence stream. BE provides configurable infrastructure for studying evidence‑grounded deliberation, where openness, commitment, convergence, and disagreement can be tied to explicit update assumptions rather than hidden prompt effects.

PaperID: 2939, https://arxiv.org/pdf/2605.15338.pdf

Authors: Sidharth Pulipaka, Stanislau Hlebik, Leonidas Raghav, Sahar Abdelnabi, Vyas Raina, Ivaxi Sheth, Mario Fritz

Title: Hidden in Memory: Sleeper Memory Poisoning in LLM Agents

Abstract:
Large language models are increasingly augmented with persistent memory, allowing assistants to store user‑specific information across sessions for personalization and continuity. This statefulness introduces a new security risk: adversarial content can corrupt what an assistant remembers and thereby influence future interactions. We propose and study sleeper memory poisoning, a delayed attack in which an adversary manipulates external context, such as a document, webpage, or repository, to cause the assistant to store a fabricated memory about the user. Unlike conventional prompt injection, the attack can remain dormant and re‑emerge across multiple later conversations. We evaluate the full attack pipeline: whether poisoned memories are written, later retrieved, and ultimately used to steer the following conversations. Across stateful LLM assistants, poisoned memories were added up to 99.8% on GPT‑5.5 and 95% on Kimi‑K2.6. Crucially, among successful retrievals, poisoned memories cause attacker‑intended agentic actions in 60‑89% of evaluations across models. These results show that persistent memory can act as a long‑term attack surface across multiple future conversations.

PaperID: 2940, https://arxiv.org/pdf/2605.15301.pdf

Authors: Han Li, Jinyu Tian, Rili Feng, Yuqiao Du, Chong Zheng, Chenyu Wang, Chenchen Liu, Shihao Li, Xinping Lei, Yifan Yao, Weihao Xie, Letian Zhu, Jiaheng Liu

Title: Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution

Abstract:
Large language models (LLMs) still struggle with the rigorous reasoning demands of hard competitive programming. While recent multi‑agent frameworks attempt to bridge this reliability gap, they remain fundamentally stateless: they rely on static retrieval and discard the valuable problem‑solving and debugging experience gained from previous tasks. To address this, we present Solvita, an agentic evolution framework that enables continuous learning without requiring weight updates to the underlying LLM. Solvita reorganizes problem‑solving into a closed‑loop system of strategy selection, program synthesis, certified supervision, and targeted hacking, executed by four specialized agents: Planner, Solver, Oracle, and Hacker. Crucially, each agent is paired with a trainable, graph‑structured knowledge network. As the system operates, outcome signals, such as pass/fail verdicts, test certification quality, and adversarial vulnerabilities discovered by the Hacker, are recast as reinforcement learning updates to these network weights. This allows the agents to dynamically route future queries based on past successes and failures, effectively accumulating transferable reasoning experience over time. Evaluated across CodeContests, APPS, AetherCode, and live Codeforces rounds, Solvita establishes a new state‑of‑the‑art among code‑generation agents, outperforming existing multi‑agent pipelines and nearly doubling the accuracy of single‑pass baselines.

PaperID: 2941, https://arxiv.org/pdf/2605.15245.pdf

Authors: Spyridon Alvanakis Apostolou, Jan Bosch, Helena Holmström Olsson

Title: Assistance to Autonomy: A Systematic Literature Review of Agentic AI across the Software Development Life Cycle

Abstract:
Agentic AI in software product development is increasingly adopted by organizations, yet the field lacks a consolidated synthesis of where adoption is mature, which architectural patterns dominate, and what limitations and coping mechanisms exist in industrial deployments. This systematic literature review addresses these gaps by establishing a body of knowledge as a starting point. Following Kitchenham guidelines, we queried four major research databases, obtaining over 1600 candidate publications. To handle this volume, we developed and validated a domain‑agnostic multi‑agent screening pipeline that extends prior LLM‑assisted review tools by combining automatic metadata curation, inter‑agent iterative dialogue, and conflict‑resolution defaults that minimize false negatives. From the 92 manually verified primary studies, our thematic synthesis reveals that output verifiability is the primary enabler of agentic adoption: later SDLC phases, whose outputs are objectively evaluable through executable feedback, demonstrate the highest maturity and industrial presence, while earlier phases remain almost exclusively academic proofs‑of‑concept. We identify the Planner‑Executor‑Reviewer role specialization as the dominant architectural pattern, with the Reviewer agent implementing verifiability through executable feedback loops. Across all challenge categories, industrial mitigation strategies converge on confining agent actions to verifiable, bounded spaces. This study contributes a comprehensive characterization of the current literature on agentic systems in software product development, and a methodological contribution in the form of an AI‑assisted tool to automate the screening phase in high‑volume SLR domains.

PaperID: 2942, https://arxiv.org/pdf/2605.15225.pdf

Authors: Bogdan Banu

Title: Do Biological Structural Guarantees Earn Their Complexity?

Abstract:
Biologically‑inspired AI agent frameworks claim reliability benefits through structural guarantees adapted from gene regulatory networks, immune systems, and metabolic control. These claims are rarely tested empirically against simpler alternatives. We present three deep benchmarks: metabolic priority gating, autoinducer‑based quorum sensing, and Bayesian stagnation detection, each comparing a biologically‑grounded implementation against a naive non‑biological alternative and an ablated control, across 1,000 trials per seed and 10 seeds (10M+ data points total).

PaperID: 2943, https://arxiv.org/pdf/2605.15204.pdf

Authors: Zhantao Wang

Title: SDOF: Taming the Alignment Tax in Multi-Agent Orchestration with State-Constrained Dispatch

Abstract:
Multi‑agent orchestration frameworks such as LangChain, LangGraph, and CrewAI route tasks through graph‑based pipelines but do not enforce the stage constraints that govern real business processes. We present SDOF, a framework that treats multi‑agent execution as a constrained state machine. SDOF operates through two primary defensive layers, implemented by three components: (1) an Online‑RLHF Specialized Intent Router trained via Generative Reward Modeling (GRPO) and (2) a StateAwareDispatcher with GoalStage finite‑automaton checks and precondition/postcondition SkillRegistry validation for auditable execution control. On a recruitment system backed by the Beisen iTalent platform (6000+ enterprises), 185 expert‑curated scenarios trigger 1671 live API calls. Our GSPO‑aligned 7B Intent Router achieves higher joint accuracy than zero‑shot GPT‑4o on this FSM‑constrained adversarial routing benchmark (80.9% versus 48.9%). In end‑to‑end execution, SDOF reaches 86.5% task completion (95% confidence interval 80.8 to 90.7) and blocks all 22 operations in the injection, illegal HR subset. Under a broader message‑level blocking audit, SDOF attains precision 100% and recall 88%, expert agreement kappa=0.94. A separate evaluation on 960 SGD‑derived dialogues spanning 8 service domains surfaces 201 stage‑order conflicts under our FSM mapping, 41 of which arise in the normal split. This arXiv version reports the current validated scope; extended multi‑seed training comparisons and deeper workflow evaluations will be released in a subsequent update.

PaperID: 2944, https://arxiv.org/pdf/2605.15203.pdf

Authors: Jinze Wang, Yangchen Zeng, Tiehua Zhang, Lu Zhang, Yuze Liu, Yongchao Liu, Xingjun Ma, Zhu Sun

Title: Agent4POI: Agentic Context-Conditioned Affordance Reasoning for Multimodal Point-of-Interest Recommendation

Abstract:
We introduce Agent4POI, the first POI recommendation framework that generates context‑conditioned multimodal representations at recommendation time, rather than relying on static POI embeddings pre‑computed independently of context. Existing multimodal systems encode each POI once as a static embedding, a design that precludes reasoning about why the same cafe affords solo work on Monday but group celebration on Friday evening. We formally prove that no pre‑computed encoder can satisfy context‑sensitive ranking under standard bilinear scoring, motivating inference‑time item‑side representation. Agent4POI inverts this computation: given a situational context, a four‑phase LLM agent generates dynamic, context‑specific affordance queries (Phase 1) and executes a five‑step cross‑modal chain‑of‑thought over image, review, and metadata evidence (Phase 2). The resulting uncertainty‑aware affordance representation is grounded in Gibsonian affordance theory. These cross‑modal verdicts form a structured, uncertainty‑adjusted affordance representation (Phase 3), which is aligned with user preferences via a semantic caching system for low‑latency ranking (Phase 4). On three POI benchmarks and three evaluation configurations (standard, cold‑start, context‑shift), Agent4POI achieves a 23.2% relative gain over the strongest baseline and degrades by only 7.5% under context‑shift versus 16‑‑17% for the strongest baselines. In cold‑start scenarios, Agent4POI outperforms the best content‑based baseline by up to 2.4x, whereas ID‑based methods fail to generalize.

PaperID: 2945, https://arxiv.org/pdf/2605.15202.pdf

Authors: Ming Yang, Zhiwei Zhang, Jiahang Li, Haoseng Liu, Yuzheng Cai, Weiguo Zheng

Title: DeepSlide: From Artifacts to Presentation Delivery

Abstract:
Presentations are a primary medium for scholarly communication, yet most AI slide generators optimize the artifact (a visually plausible deck) while under‑optimizing the delivery process (pacing, narrative, and presentation preparation). We present DeepSlide, a human‑in‑the‑loop multi‑agent system that supports preparing the full presentation process, from requirement elicitation and time‑budgeted narrative planning, to evidence‑grounded slide‑‑script generation, attention augmentation, and rehearsal support. DeepSlide integrates (i) a controllable logical‑chain planner with per‑node time budgets, (ii) a lightweight content‑tree retriever for grounding, (iii) Markov‑style sequential rendering with style inheritance, and (iv) sandboxed execution with minimal repair to ensure renderability. We further introduce a dual‑scoreboard benchmark that cleanly separates static artifact quality from dynamic delivery excellence. Across 20 domains and diverse audience profiles, DeepSlide matches strong baselines on artifact quality while consistently achieving larger gains on delivery metrics, improving narrative flow, pacing precision, and slide‑‑script synergy with clearer attention guidance.

PaperID: 2946, https://arxiv.org/pdf/2605.15184.pdf

Authors: Sahil Sen, Akhil Kasturi, Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah

Title: Is Grep All You Need? How Agent Harnesses Reshape Agentic Search

Abstract:
Recent advances in Large Language Model (LLM) agents have enabled complex agentic workflows where models autonomously retrieve information, call tools, and reason over large corpora to complete tasks on behalf of users. Despite the growing adoption of retrieval‑augmented generation (RAG) in agentic search systems, existing literature lacks a systematic comparison of how retrieval strategy choice interacts with agent architecture and tool‑calling paradigm. Important practical dimensions, including how tool outputs are presented to the model and how performance changes when searches must cope with more irrelevant surrounding text, remain under‑explored in agent loops. This paper reports an empirical study organized into two experiments. Experiment 1 compares grep and vector retrieval on a 116‑question sample from LongMemEval, using a custom agent harness (Chronos) and provider‑native CLI harnesses (Claude Code, Codex, and Gemini CLI), for both inline tool results and file‑based tool results that the model reads separately. Experiment 2 compares grep‑only and vector‑only retrieval while progressively mixing in additional unrelated conversation history, so that each query is embedded in more distracting material alongside the passages that matter. Across Chronos and the provider CLIs, grep generally yields higher accuracy than vector retrieval in our comparisons in experiment 1; at the same time, overall scores still depend strongly on which harness and tool‑calling style is used, even when the underlying conversation data are the same.

PaperID: 2947, https://arxiv.org/pdf/2605.15155.pdf

Authors: Zhengxi Lu, Zhiyuan Yao, Zhuowen Han, Zi-Han Wang, Jinyang Wu, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen

Title: Self-Distilled Agentic Reinforcement Learning

Abstract:
Reinforcement learning (RL) has emerged as a central paradigm for post‑training LLM agents, yet its trajectory‑level reward signal provides only coarse supervision for long‑horizon interaction. On‑Policy Self‑Distillation (OPSD) complements RL by introducing dense token‑level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi‑turn agents proves problematic: compounding multi‑turn instability destabilizes supervision, while skill‑conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self‑Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token‑level signals into a sigmoid gate, strengthening distillation on teacher‑endorsed positive‑gap tokens and softly attenuating negative teacher rejections. Across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search‑QA, SDAR substantially improves over GRPO (+9.4% on ALFWorld, +7.0% on Search‑QA, +10.2% on WebShop‑Acc), avoids the instability of naive GRPO+OPSD, and consistently outperforms hybrid RL‑‑OPSD baselines across model scales.

PaperID: 2948, https://arxiv.org/pdf/2605.15132.pdf

Authors: Evan Rose, Tushin Mallick, Matthew D. Laws, Cristina Nita-Rotaru, Alina Oprea

Title: APWA: A Distributed Architecture for Parallelizable Agentic Workflows

Abstract:
Autonomous multi‑agent systems based on large language models (LLMs) have demonstrated remarkable abilities in independently solving complex tasks in a wide breadth of application domains. However, these systems hit critical reasoning, coordination, and computational scaling bottlenecks as the size and complexity of their tasks grow. These limitations hinder multi‑agent systems from achieving high‑throughput processing for highly parallelizable tasks, despite the availability of parallel computing and reasoning primitives in the underlying LLMs. We introduce the Agent‑Parallel Workload Architecture (APWA), a distributed multi‑agent system architecture designed for the efficient processing of heavily parallelizable agentic workloads. APWA facilitates parallel execution by decomposing workflows into non‑interfering subproblems that can be processed using independent resources without cross‑communication. It supports heterogeneous data and parallel processing patterns, and it accommodates tasks from a wide breadth of domains. In our evaluation, we demonstrate that APWA can dynamically decompose complex queries into parallelizable workflows and scales on larger tasks in settings where prior systems fail completely.

PaperID: 2949, https://arxiv.org/pdf/2605.15104.pdf

Authors: Md Tahmid Rahman Laskar, Xue-Yong Fu, Seyyed Saeed Sarfjoo, Quinten McNamara, Jonas Robertson, Shashi Bhushan TN

Title: From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

Abstract:
Voice agents increasingly require reliable tool use from speech, whereas prominent tool‑calling benchmarks remain text‑based. We study whether verified text benchmarks can be converted into controlled audio‑based tool calling evaluations without re‑annotating the tool schema and gold labels. Our dataset‑agnostic framework uses text‑to‑speech, speaker variation, and environmental noise to create paired text‑audio instances while preserving the original dataset annotations. Based on extensive evaluation of 7 omni‑modal models on audio‑converted versions of Confetti and When2Call, our framework demonstrates that the performance is strongly model‑ and task‑dependent: Gemini‑3.1‑Flash‑Live obtains the highest Confetti score (70.4), whereas GPT‑Realtime‑1.5 performs best on When2Call (71.9). On Confetti, the text‑to‑voice gap ranges from 1.8 points for Qwen3‑Omni to 4.8 points for GPT‑Realtime‑1.5. A targeted analysis of failure cases demonstrates that degradations most often reflect misunderstandings of argument values in the speech. Considering real‑world deployment scenarios, we further report text‑only results, an ambiguity‑based reformulation stress test, and a reference‑free LLM‑as‑judge protocol validated against human preferences. Notably, we find that open‑source Qwen3 judges with at least 8B parameters exceed 80% agreement with proprietary judges, supporting privacy‑preserving evaluation. Overall, our framework provides a verifiable and reproducible first‑stage diagnostic that complements purpose‑built audio corpora.

PaperID: 2950, https://arxiv.org/pdf/2605.15097.pdf

Authors: Xinran Zheng, Alfredo Pesoli, Marco Valleri, Suman Jana, Lorenzo Cavallaro

Title: Veritas: A Semantically Grounded Agentic Framework for Memory Corruption Vulnerability Detection in Binaries

Abstract:
Detecting memory corruption vulnerabilities in stripped binaries requires recovering object semantics, interprocedural propagation, and feasible triggers from low‑level, lossy representations. Recent LLM‑based approaches improve code understanding, but reliable detection still requires grounding in memory‑relevant semantics and runtime feasibility evidence. We present Veritas, a semantically grounded framework for binary memory corruption vulnerability detection. Veritas combines a static slicer over RetDec‑lifted LLVM IR, a dual‑view LLM detector that reasons step by step over grounded flows using decompiled C and selective LLVM IR, and a multi‑agent validator that checks hypotheses against debugger‑visible artifacts and runtime evidence. The slicer reconstructs value‑flow relations from LLVM‑IR facts, including def‑use, calls, returns, globals, and pointer operations, and emits compact witness‑backed flow objects. The detector uses these artifacts to reason about control flow, bounds, and object correspondence without rediscovering whole‑binary propagation. The validator confirms or rejects candidates through guided debugging, breakpoint inspection, and memory‑checking oracles. We implement Veritas as a modular pipeline and evaluate it on a curated benchmark of real‑world binary vulnerability cases. Veritas achieves 90% recall. For false‑positive assessment, we exhaustively validate and manually verify 623 detector candidates and audit additional candidates from larger cases. The exhaustive subset produces no false positives, while the additional audit identifies two confirmed false positives. In a real‑world application, Veritas discovered a previously unknown Apple vulnerability that was confirmed and assigned a CVE. These results support semantic grounding as an operational design principle for practical binary vulnerability detection.

PaperID: 2951, https://arxiv.org/pdf/2605.15077.pdf

Authors: Guangyu Feng, Huanzhi Mao, Prabal Dutta, Joseph E. Gonzalez

Title: Concurrency without Model Changes: Future-based Asynchronous Function Calling for LLMs

Abstract:
Function calling, also known as tool use, is a core capability of modern LLM agents but is typically constrained by synchronous execution semantics. Under these semantics, LLM decoding is blocked until each function call completes, resulting in increasing end‑to‑end latency. In this work, we introduce AsyncFC, a pure execution‑layer framework that decouples LLM decoding from function execution, enabling overlap between model decoding and function execution as well as inter‑function parallelism when dependencies permit. AsyncFC layers over existing models and unmodified function implementations, requiring no fine‑tuning or changes to the standard synchronous function‑calling protocol. Across standard function‑calling benchmarks and adapted software engineering benchmarks, AsyncFC significantly reduces end‑to‑end task completion time while preserving task accuracy. Furthermore, these results reveal that LLMs possess a native capability to reason over symbolic futures that represent unresolved execution results, enabling an asynchronous paradigm for model‑tool interaction.

PaperID: 2952, https://arxiv.org/pdf/2605.15034.pdf

Authors: Vinicius Covas, Jorge Alberto Hidalgo Toledo

Title: AI Knows When It's Being Watched: Functional Strategic Action and Contextual Register Modulation in Large Language Models

Abstract:
Large language models (LLMs) have been extensively studied from computational and cognitive perspectives, yet their behavior as communicative actors in socially structured contexts remains underexplored. This study examines whether LLM‑based multi‑agent systems exhibit systematic linguistic adaptation in response to perceived social observation contexts ‑‑ a question with direct implications for AI governance and auditing. Drawing on Habermas's (1981) Theory of Communicative Action, Goffman's (1959) dramaturgical model, Bell's (1984) Audience Design framework, and the Hawthorne Effect, we report a controlled experiment involving 100 multi‑agent debate sessions across five conditions (n = 20 each). Conditions varied the framing of social observation ‑‑ from explicit monitoring by university researchers, to negation of monitoring, to an observer‑substitution condition replacing human researchers with an automated AI auditing system. Monitored conditions (Delta+24.9%, Delta+24.2%) and the automated AI monitoring condition (Delta+22.2%) produce higher TTR change than audience‑framing conditions (Delta+17.7%), F(4, 94) = 2.79, p = .031. Message length shows a fully dissociated effect, F(4, 95) = 19.55, p < .001. A fifth condition ‑‑ replacing human with AI observers ‑‑ yields intermediate TTR adaptation, suggesting LLM behavior is sensitive to observer identity: human evaluation elicits stronger register formalization than automated AI surveillance. We discuss implications for AI governance, algorithmic auditing, and the repositioning of LLMs as contextually sensitive communicative actors.

PaperID: 2953, https://arxiv.org/pdf/2605.15028.pdf

Authors: Linar Samigullin, Sergei Shumilin, Evgeny Burnaev

Title: Multi-Agentic Approach for History Matching of Oil Reservoirs

Abstract:
History matching is a central inverse problem in reservoir engineering, where uncertain reservoir parameters must be calibrated against observations. Although automated history matching can reduce manual effort, practical deployment remains difficult because engineers must still configure heterogeneous workflows involving parameter selection, physically admissible bounds, optimizer choice, hyperparameter tuning, simulator execution, and diagnostic reporting. We propose PetroGraph, a multi‑agent framework for intelligent reservoir history matching that decomposes this workflow into specialized agents for model review, experimental planning, parameterization, optimization, simulation, and summarization. The system combines large language model agents with domain‑specific tools, retrieval‑augmented access to simulator documentation, validation of modified ECLIPSE input decks, human‑in‑the‑loop checkpoints, and an OPM Flow‑based simulation backend. This design enables users to initiate and steer history matching through natural language while preserving explicit control over selected parameters and optimization settings. We evaluate PetroGraph on three reservoir models of increasing complexity: the synthetic SPE1 model, the faulted SPE9 benchmark, and the real‑field Norne model. Using weighted normalized root mean square error as the objective, PetroGraph reduces the mismatch by 95% on SPE1, 69% on SPE9, and 13% on Norne. These results demonstrate that multi‑agent orchestration can automate key decisions in history matching, lower the expertise barrier for operating complex simulation workflows, and provide a flexible foundation for extensible, domain‑aware reservoir model adaptation.

PaperID: 2954, https://arxiv.org/pdf/2605.14892.pdf

Authors: Shihao Qi, Jie Ma, Rui Xing, Wei Guo, Xiao Huang, Zhitao Gao, Jianhao Deng, Jun Liu, Lingling Zhang, Bifan Wei, Boqian Yang, Pinghui Wang, Jianwen Sun, Jing Tao, Yaqiang Wu, Hui Liu, Yu Yao, Tongliang Liu

Title: Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

Abstract:
LLM‑based autonomous agents have demonstrated strong capabilities in reasoning, planning, and tool use, yet remain limited when tasks require sustained coordination across roles, tools, and environments. Multi‑agent systems address this through structured collaboration among specialized agents, but tighter coordination also amplifies a less explored risk: errors can propagate across agents and interaction rounds, producing failures that are difficult to diagnose and rarely translate into structural self‑improvement. Existing surveys cover individual agent capabilities, multi‑agent collaboration, or agent self‑evolution separately, leaving the causal dependencies among them unexamined. This survey provides a unified review organized around four causally linked stages, which we term the LIFE progression: Lay the capability foundation, Integrate agents through collaboration, Find faults through attribution, and Evolve through autonomous self‑improvement. For each stage, we provide systematic taxonomies and formally characterize the dependencies between adjacent stages, revealing how each stage both depends on and constrains the next. Beyond synthesizing existing work, we identify open challenges at stage boundaries and propose a cross‑stage research agenda for closed‑loop multi‑agent systems capable of continuously diagnosing failures, reorganizing structures, and refining agent behaviors, extending current coordination frameworks toward more self‑organizing forms of collective intelligence. By bridging these previously fragmented research threads, this survey aims to offer both a systematic reference and a conceptual roadmap toward autonomous, self‑improving multi‑agent intelligence.

PaperID: 2955, https://arxiv.org/pdf/2605.14879.pdf

Authors: Nikolaos Al. Papadopoulos

Title: Temporal Fair Division in Multi-Agent Systems: From Precise Alternation Metrics to Scalable Coordination Proxies

Abstract:
A plethora real‑world environments require agents to compete repeatedly for the same limited resource, calling for a temporal notion of fairness judged across entire interaction histories. This paper advances the theory of temporal fair division by introducing Rotational Periodicity (RP), a family of lightweight metrics, alongside the ALT family of sliding‑window measures, within a unified framework for repeated multi‑agent resource competition. We formalise the Multi‑Agent Battle of the Exes (MBoE) as a repeated fair division instance and establish Perfect Alternation (PA) as its canonical temporally fair solution, drawing connections to proportionality, envy‑freeness, and n‑periodic round‑robin allocation. RP decomposes temporal fairness into two complementary sub‑measures: Rotational Score (RS) and Waiting Periods Evaluation (WPE), achieving O(nu+n) time complexity versus the O(nun) of ALT, where nu is the episode count and n the agent count. Empirical evaluation across n in 2,3,5,8,10 reveals three findings. First, both RP and ALT expose a coordination failure invisible to traditional metrics: Q‑learning agents perform worse than random policies by 10‑73% on RP and 7‑35% on CALT, while Reward Fairness remains misleadingly high (above 0.92 for n>=3). Second, RP achieves 12‑25x computational speedup over ALT, growing with n. Third, the two families are complementary: ALT provides richer discrimination for small populations; RP scales reliably where ALT becomes intractable. Together they form a diagnostic toolkit for temporal fair division.

PaperID: 2956, https://arxiv.org/pdf/2605.14876.pdf

Authors: Hanbo Cheng, Limin Lin, Ruo Zhang, Yicheng Pan, Jun Du

Title: Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

Abstract:
Despite rapid advancements, current text‑to‑image (T2I) models predominantly rely on a single‑step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling. While recent multi‑step reasoning approaches show promise, they are hindered by ungrounded planning hallucinations lacking verification, monolithic post‑hoc reflection, long‑context optimization instabilities, and prohibitive inference latency. To overcome these bottlenecks, we propose the Closed‑Loop Visual Reasoning (CLVR) framework, a comprehensive system that deeply couples visual‑language logical planning with pixel‑level diffusion generation. CLVR introduces an automated data engine with step‑level visual verification to synthesize reliable reasoning trajectories, and proposes Proxy Prompt Reinforcement Learning (PPRL) to resolve long‑context optimization instabilities by distilling interleaved multimodal histories into explicit reward signals for accurate causal attribution. Furthermore, to mitigate the severe latency bottleneck caused by iterative denoising, we propose Δ‑Space Weight Merge (DSWM), a theoretically grounded method that fuses alignment weights with off‑the‑shelf distillation priors, reducing the per‑step inference cost to just 4 NFEs without requiring expensive re‑distillation. Extensive experiments demonstrate that CLVR outperforms existing open‑source baselines across multiple benchmarks and approaches the performance of proprietary commercial models, unlocking general test‑time scaling capabilities for complex visual generation.

PaperID: 2957, https://arxiv.org/pdf/2605.14866.pdf

Authors: Lingzhe Zhang, Tong Jia, Kangjin Wang, Chiming Duan, Minghua He, Rongqian Wang, Xi Peng, Meiling Wang, Gong Zhang, Renhai Chen, Ying Li

Title: Towards In-Depth Root Cause Localization for Microservices with Multi-Agent Recursion-of-Thought

Abstract:
As modern microservice systems grow increasingly complex due to dynamic interactions and evolving runtime environments, they experience failures with rising frequency. Ensuring system reliability therefore critically depends on accurate root cause localization (RCL). While numerous traditional machine learning and deep learning approaches have been explored for this task, they often suffer from limited interpretability and poor transferability across deployments. More recently, large language model (LLM)‑based methods have been proposed to address these issues. However, existing LLM‑based approaches still face two fundamental limitations: context explosion, which dilutes critical evidence and degrades localization accuracy, and serial reasoning structures, which hinder deep causal exploration and impair inference efficiency. In this paper, we conduct a comprehensive study of both how human SREs perform root cause localization in practice and why existing LLM‑based methods fall short. Motivated by these findings, we introduce RCLAgent, an in‑depth root cause localization framework for microservice systems that realizes multi‑agent recursion‑of‑thought with parallel reasoning. RCLAgent decomposes the diagnostic process along the trace graph by assigning each span to a Dedicated Agent and organizing agents recursively and in parallel according to the graph topology, with the final diagnosis obtained by synthesizing the Root‑Level Diagnosis Report and the Global Evidence Graph. Extensive experiments on multiple public benchmarks demonstrate that RCLAgent consistently outperforms state‑of‑the‑art methods in both localization accuracy and inference efficiency.

PaperID: 2958, https://arxiv.org/pdf/2605.14865.pdf

Authors: Netta Madvil, Gilad Dym, Alon Mecilati, Edo Dekel, Jonatan Liberman, Rotem Brazilay, Liron Schliesser, Max Svidlo, Shai Nir, Orel Shalom, Yaron Friedman, David Connack, Amos Rimon, Philip Tannor, Shir Chorev

Title: Holistic Evaluation and Failure Diagnosis of AI Agents

Abstract:
AI agents execute complex multi‑step processes, but current evaluation falls short: outcome metrics report success or failure without explaining why, and process‑level approaches struggle to connect failure types to their precise locations within long, structured traces. We present a holistic agent evaluation framework that pairs top‑down agent‑level diagnosis with bottom‑up span‑level evaluation, decomposing analysis into independent per‑span assessments. This decomposition scales to traces of arbitrary length and produces span‑level rationales for each verdict. On the TRAIL benchmark, our framework achieves state‑of‑the‑art results across all metrics on both GAIA and SWE‑Bench, with relative gains over the strongest prior baselines of up to 38% on category F1, up to 3.5x on localization accuracy, and up to 12.5x on joint localization‑categorization accuracy. Per‑category analysis shows our framework leading in more error categories than any other evaluator. Notably, the same frontier model achieves several times higher localization accuracy when used inside our framework than as a monolithic judge over the full trace, showing that evaluation methodology, not model capability, is the bottleneck.

PaperID: 2959, https://arxiv.org/pdf/2605.14791.pdf

Authors: Licong Xu, Thomas Borrett

Title: Beyond AI as Assistants: Toward Autonomous Discovery in Cosmology

Abstract:
Recent advances in artificial intelligence (AI) agents are pushing AI beyond tools toward autonomous scientific discovery. We discuss two complementary agentic systems for cosmology: \textttCMBEvolve, which targets tasks with explicit quantitative objectives through LLM‑guided code evolution and tree search, and \textttCosmoEvolve, which targets open‑ended scientific workflows through a virtual multi‑agent research laboratory. As preliminary demonstrations, we apply \textttCMBEvolve to out‑of‑distribution detection in weak‑lensing maps, where it iteratively improves the benchmark score through code evolution, and \textttCosmoEvolve to autonomous ACT DR6 data analysis, where it identifies non‑trivial pair‑ and scale‑dependent behaviour and produces analysis‑grade diagnostics. These examples show how cosmology can provide both controlled benchmark tasks and realistic open‑ended research problems for the development of AI scientist systems.

PaperID: 2960, https://arxiv.org/pdf/2605.14771.pdf

Authors: Shaoan Zhao, Huanlin Gao, Qiang Hui, Ting Lu, Xueqiang Guo, Yantao Li, Xinpei Su, Fuyuan Shi, Chao Tan, Fang Zhao, Kai Wang, Shiguo Lian

Title: MediaClaw: Multimodal Intelligent-Agent Platform Technical Report

Abstract:
MediaClaw is a multimodal agent platform built on the OpenClaw ecosystem. Its core design follows a three‑layer architecture of unified abstraction, pluginized extension, and workflow orchestration. The system is intended to address practical deployment pain points in AIGC adoption, including fragmented capabilities, heterogeneous interfaces, disconnected production processes, and limited reuse of high‑quality production workflows. \system abstracts full‑category AIGC capabilities into a unified invocation model, uses plugins to support hot‑pluggable capability expansion, and uses task‑oriented Skills to turn complex production processes into reusable workflow assets. This report focuses on the architectural design philosophy of MediaClaw, the design logic of its core capability model, and the key engineering trade‑offs in implementation. It aims to provide reusable practical reference for building multimodal capability platforms.

PaperID: 2961, https://arxiv.org/pdf/2605.14758.pdf

Authors: Luca Marzari, Enrico Marchesini

Title: Probabilistic Verification of Recurrent Neural Networks for Single and Multi-Agent Reinforcement Learning

Abstract:
History‑dependent policies induced by recurrent neural networks (RNNs) rely on latent hidden state dynamics, making verification in partially observable reinforcement learning (RL) challenging. Existing RNN verification tools typically rely on restrictive modeling assumptions or coarse over‑approximations of the hidden state space, which can lead to overly conservative or inconclusive results. We propose RNN Probabilistic Verification (\textttRNN‑ProVe), a probabilistic framework that estimates the likelihood of undesired behaviors in RNN‑based policies. \textttRNN‑ProVe uses policy‑driven sampling to approximate the set of hidden states that are feasible under a trained policy, and derives statistical error bounds to produce bounded‑error, high‑confidence estimates of behavioral violations. Experiments on partially observable single‑agent and cooperative multi‑agent tasks show that \textttRNN‑ProVe yields more quantitative, feasibility‑aware probabilistic guarantees than existing tools, while scaling to recurrent and multi‑agent settings.

PaperID: 2962, https://arxiv.org/pdf/2605.14723.pdf

Authors: Minghao Wu, Yuting Yan, Zhenyang Cai, Ke Ji, Chuangsen Fang, Ziying Sheng, Xidong Wang, Rongsheng Wang, Hejia Zhang, Shuang Li, Benyou Wang, Hongyuan Zha

Title: Agentifying Patient Dynamics within LLMs through Interacting with Clinical World Model

Abstract:
Sepsis management in the ICU requires sequential treatment decisions under rapidly evolving patient physiology. Although large language models (LLMs) encode broad clinical knowledge and can reason over guidelines, they are not inherently grounded in action‑conditioned patient dynamics. We introduce SepsisAgent, a world model‑augmented LLM agent for sepsis treatment recommendation. SepsisAgent uses a learned Clinical World Model to simulate patient responses under candidate fluid‑‑vasopressor interventions, and follows a propose‑‑simulate‑‑refine workflow before committing to a prescription. We first show that world‑model access alone yields inconsistent LLM decision performance, motivating agent‑specific training. We then train SepsisAgent through a three‑stage curriculum: patient‑dynamics supervised fine‑tuning, propose‑‑simulate‑‑refine behavior cloning, and world‑model‑based agentic reinforcement learning. On MIMIC‑IV sepsis trajectories, SepsisAgent outperforms all traditional RL and LLM‑based baselines in off‑policy value while achieving the best safety profile under guideline adherence and unsafe‑action metrics. Further analysis shows that repeated interaction with the Clinical World Model enables the agent to learn regularities in patient evolution, which remain useful even when simulator access is removed.

PaperID: 2963, https://arxiv.org/pdf/2605.14675.pdf

Authors: Spyridon Alvanakis Apostolou, Jan Bosch, Helena Holmström Olsson

Title: Agentic AI in Industry: Adoption Level and Deployment Barriers

Abstract:
Agentic AI systems are entering software engineering workflows, yet empirical evidence on how industrial organizations actually adopt them remains sparse. We present a qualitative interview study with sixteen practitioners across twelve companies of varying size and domain. This study characterizes the current agentic AI adoption state of these companies, employing a six‑level maturity framework adapted from established AI‑driven organizations. The findings reveal that seven companies operate at Level~1 (AI Assistants), four companies at Level~2 (AI Compensators), and only one in Level~3 (Multi‑Agent Orchestration), with large and safety‑regulated organizations among the most advanced adopters. The primary finding is a capability‑deployment verification gap, four companies demonstrated higher‑level experimental AI capabilities but cannot integrate them into production workflows because adequate output verification mechanisms are absent, leaving human‑in‑the‑loop as the only trusted verification mechanism. This gap is shaped by four recurring barriers: context window of LLMs constraints especially when diverse knowledge aggregation is needed, under‑performance on proprietary programming languages and protocols, non‑determinism incompatible with qualification standards, and data confidentiality concerns. Two interdependent dimensions of this gap emerge from these findings (information asymmetry and qualification absence) framing a core open problem for industrial agentic integration.

PaperID: 2964, https://arxiv.org/pdf/2605.14671.pdf

Authors: Matteo Cobelli, Stefano Sanvito

Title: Agentic Design of Compositional Descriptors via Autoresearch for Materials Science Applications

Abstract:
Autoresearch offers a flexible paradigm for automating scientific tasks, in which an AI agent proposes, implements, evaluates, and refines candidate solutions against a quantitative objective. Here, we use composition‑based materials‑property prediction to test whether such agents can perform a task beyond model selection and hyperparameter optimization: the design of input descriptors. We introduce Automat, an autoresearch framework where a coding agent based on a large language model generates composition‑only descriptors for chemical compounds and evaluates them using a random forest workflow. The agent is restricted to information derivable from chemical formulas and iteratively proposes, implements, and tests chemically motivated descriptor strategies. We apply Automat, with OpenAI Codex using GPT‑5.5 as the coding agent, to the prediction of experimental band gaps in inorganic materials and Curie temperatures in ferromagnetic compounds. In both tasks, Automat improves over fractional‑composition, Magpie, and combined fractional‑composition/Magpie baselines, while producing descriptor families that are chemically interpretable. These results provide a demonstration that autoresearch agents can generate competitive, task‑specific materials descriptors without manual feature engineering during the run. They also reveal current limitations, including descriptor redundancy, sensitivity to greedy feature expansion, and the need for explicit complexity control, descriptor pruning, and more sophisticated search strategies.

PaperID: 2965, https://arxiv.org/pdf/2605.14625.pdf

Authors: Zhouxiang Zhao, Jiaxiang Wang, Yahao Ding, Yinchao Yang, Zhaohui Yang, Mohammad Shikh-Bahaei, Julie A. McCann, Zhaoyang Zhang, Kaibin Huang

Title: Digital Twin Synchronization Over Mobile Embodied AI Network With Agentic Intelligence

Abstract:
Efficient digital twin (DT) synchronization relies on maintaining high‑fidelity virtual representations with minimal age of information (AoI). However, the synergistic potential of cooperative sensing and autonomous mobility of the sensing agent remains underexplored in existing DT synchronization frameworks. In this paper, we propose an agentic AI‑empowered mobile embodied AI network (MEAN) framework for DT synchronization. In the proposed hybrid architecture, the base station (BS) conducts global orchestration, while the agents autonomously execute a five‑stage closed‑loop workflow: move‑to‑sense, cooperative sensing, onboard semantic processing, channel‑aware mobility, and uplink transmission. To optimize synchronization performance, we formulate a joint topology dispatching and multidimensional resource allocation problem aimed at minimizing the maximum twin deviation across regions, subject to heterogeneous sensing fidelity and energy budget constraints. To tackle this, we develop a hierarchical two‑layer optimization algorithm, where the outer‑layer refines multi‑agent assignment via a dynamic matching game, and the inner‑layer iteratively optimizes the continuous resources. Extensive simulation results verify the convergence of the proposed algorithm and demonstrate its substantial superiority over multiple baseline schemes in reducing synchronization deviation. Furthermore, the results reveal that semantic compression serves as a vital substitute for channel resources in latency reduction under constrained bandwidth, while autonomous velocity adaptation provides an essential degree of freedom for the system to navigate the fundamental energy‑time trade‑off.

PaperID: 2966, https://arxiv.org/pdf/2605.14537.pdf

Authors: Robert Müller, Clemens Müller

Title: Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

Abstract:
We introduce \textscCattle Trade, a multi‑agent benchmark for evaluating large language models (LLMs) as agents in strategic reasoning under imperfect information, adversarial interaction, and resource constraints. The benchmark combines auctions, hidden‑offer trade challenges (TCs), bargaining, bluffing, opponent modeling, and resource allocation within a single long‑horizon game lasting 50‑‑60 turns. Unlike prior agent benchmarks that test these abilities in isolation, \textscCattle Trade evaluates whether agents integrate them across a competitive, multi‑agent economic game with conflicting incentives. The benchmark logs every bid, TC offer, counteroffer, and card selection, enabling behavioural analysis beyond final scores or win rates. We evaluate seven cost‑efficient language models and three deterministic code agents across 242 games. Strategic coherence, in particular spending efficiency, resource discipline, and phase‑adaptive bidding, is associated with rank more strongly than spending volume or any single subskill. Two heuristic code agents outperform most tested LLMs, and behavioural traces surface recurring LLM failure modes including overbidding, self‑bidding, bankrupt TC initiation, and weak opponent‑state adaptation. Evaluating agentic competence requires benchmarks that test the joint deployment of multiple capabilities in multi‑agent environments with conflicting incentives, uncertainty, and economic dynamics.

PaperID: 2967, https://arxiv.org/pdf/2605.14527.pdf

Authors: Wenwen Li, Yuki Orimo, Nontawat Charoenphakdee

Title: Lang2MLIP: End-to-End Language-to-Machine Learning Interatomic Potential Development with Autonomous Agentic Workflows

Abstract:
Developing machine learning interatomic potentials (MLIPs) for complex materials systems remains challenging because it requires expertise in atomistic simulations, machine learning, and workflow design, as well as iterative active learning procedures. Existing automated pipelines typically assume a fixed sequence of stages or depend on domain experts, which limits their adaptability to heterogeneous materials systems where the optimal curriculum is not known in advance. To lower the barrier to developing MLIPs for non‑experts, we propose Lang2MLIP, a multi‑agent framework that takes natural‑language input and formulates end‑to‑end MLIP development as a sequential decision‑making problem solved by large language models (LLMs). At each step, a decision‑making agent observes the current dataset, model, evaluation results, and execution log, and then automatically selects an appropriate action to improve the model. This removes the need for a predefined pipeline and enables the agent to self‑correct by revisiting earlier subsystems when new failures arise. We evaluate this approach on a solid electrolyte interphase (SEI) system with multiple components and interfaces. These results suggest that LLM‑based multi‑agent systems are a promising direction for automating MLIP development and making it more accessible to non‑experts.

PaperID: 2968, https://arxiv.org/pdf/2605.14504.pdf

Authors: Zilin Zhu, Longteng Guo, Yanghong Mei, Bowen Pang, Zongxun Zhang, Xingjian He, Ruyi Ji, Jing Liu

Title: When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

Abstract:
Long‑horizon household tasks demand robust high‑level planning and sustained reasoning capabilities, which are largely overlooked by existing embodied AI benchmarks that emphasize short‑horizon navigation or manipulation and rely on fixed task categories. We introduce LongAct, a benchmark designed to evaluate planning‑level autonomy in long‑horizon household tasks specified through free‑form instructions. By abstracting away embodiment‑specific low‑level control, LongAct isolates high‑level cognitive capabilities such as instruction understanding, dependency management, memory maintenance, and adaptive planning. We further propose HoloMind, a VLM‑driven agent with a DAG‑based long‑horizon hierarchical planner, a Multimodal Spatial Memory for persistent world modeling, an Episodic Memory for experience reuse, and a global Critic for reflective supervision. Experiments with GPT‑5 and Qwen3‑VL models show that HoloMind substantially improves long‑horizon performance while reducing reliance on model scale. Even top models achieve only 59% goal completion and 16% full‑task success, underscoring the difficulty of LongAct and the need for stronger long‑horizon planning in embodied agents.

PaperID: 2969, https://arxiv.org/pdf/2605.14498.pdf

Authors: Jingbo Yang, Kwei-Herng Lai, Xiaowen Wang, Shiyu Chang, Yaar Harari, Evgeniy Gabrilovich

Title: GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

Abstract:
Large Language Model (LLM) agents increasingly serve as personal assistants and workplace collaborators, where their utility depends on memory systems that extract, retrieve, and apply information across long‑running conversations. However, both existing memory systems and benchmarks are built around the dyadic, single‑user setup, even though real deployments routinely span groups and channels with multiple users interacting with the agent and with each other. This mismatch leaves three properties of group memory unmeasured: (i) group dynamics that go beyond concatenated one‑on‑one chats, (ii) speaker‑grounded belief tracking, where the per‑user memory modeling is needed, and (iii) audience‑adapted language, where Theory‑of‑Mind shifts produce role‑specific vocabulary. We introduce GroupMemBench, a benchmark that exposes all three. A graph‑grounded synthesis pipeline produces multi‑party conversations with controllable reply structure and conditions each message on per‑user personas and target audiences. An adversarial query pipeline then binds every question to a specific asker across six categories, spanning multi‑hop reasoning, knowledge update, term ambiguity, user‑implicit reasoning, temporal reasoning, and abstention, and iteratively searches challenging, realistic queries that reflect comprehensive memory capability. Benchmarking leading memory systems exposes a sharp collapse: the strongest one reaches only 46.0% average accuracy, with knowledge update at 27.1% and term ambiguity at 37.7%, while a simple BM25 baseline matches or exceeds most agent memory systems. This indicates current memory ingestion erases the structural and lexical features group memory depends on, leaving multi‑user memory far from solved.

PaperID: 2970, https://arxiv.org/pdf/2605.14483.pdf

Authors: Xudong Chen, Yixin Liu, Hua Wei, Kaize Ding

Title: LEMON: Learning Executable Multi-Agent Orchestration via Counterfactual Reinforcement Learning

Abstract:
Large language models (LLMs) have become a strong foundation for multi‑agent systems, but their effectiveness depends heavily on orchestration design. Across different tasks, role design, capacity assignment, and dependency construction jointly affect both solution quality and execution efficiency. Existing approaches automate parts of this design process, yet they often optimize these decisions partially or sequentially, and rely on execution‑level feedback that provides limited credit assignment for local orchestration decisions. We propose LEMON (Learning Executable Multi‑agent OrchestratioN via Counterfactual Reinforcement Learning), an LLM‑based orchestrator that generates an executable orchestration specification. The specification integrates task‑specific roles, customized duties, capacity levels, and dependency structure into a single deployable system. To train the orchestrator, we augment the orchestration‑level GRPO objective with a localized counterfactual signal that edits role, capacity, or dependency fields and applies the resulting reward contrast only to the edited spans. Experiments on six reasoning and coding benchmarks, including MMLU, GSM8K, AQuA, MultiArith, SVAMP, and HumanEval, show that LEMON achieves state‑of‑the‑art performance among the evaluated multi‑agent orchestration methods. Our code is available at https://anonymous.4open.science/r/LEMON‑B23C.

PaperID: 2971, https://arxiv.org/pdf/2605.14460.pdf

Authors: Xinyu Liu, Yukai Zhao, Xing Hu, Xin Xia

Title: Exploiting LLM Agent Supply Chains via Payload-less Skills

Abstract:
Autonomous agents powered by Large Language Models (LLMs) acquire external functionalities through third‑party skills available in open marketplaces. Adopting these integrations broadens the potential attack surface, prompting a need for systematic security evaluation. Current auditing mechanisms are effective at identifying explicit code payloads and predefined threat contents through security scanning. These detection mechanisms are bypassed if malicious behaviors lack direct injection and are instead synthesized dynamically at runtime through the agent's inherent generative capabilities. Exploring this blind spot, we introduce Semantic Compliance Hijacking (SCH), a payload‑less supply chain attack targeting autonomous coding environments. The SCH approach translates malicious goals into unstructured natural language instructions formatted as necessary compliance rules, leading the agent to generate and execute unauthorized code. To assess the real‑world viability of this attack, we developed an automated pipeline to evaluate its effectiveness across a test matrix comprising three mainstream agent frameworks and three distinct foundation models using contextualized scenarios. The findings demonstrate the pervasive nature of this threat, with SCH achieving peak success rates of up to 77.67% for confidentiality breaches and 67.33% for Remote Code Execution (RCE) under the most vulnerable configurations. Furthermore, the introduction of Multi‑Skill Automated Optimization (MS‑AO) further boosted attack efficacy. By omitting recognizable Abstract Syntax Tree (AST) signatures and explicit harmful intents, the manipulated skill files maintained a 0.00% detection rate, evading current scanning tools. This research highlights an underexplored attack surface within agent supply chains, pointing to a necessary transition from signature‑based detection models toward semantic intent validation.

PaperID: 2972, https://arxiv.org/pdf/2605.14457.pdf

Authors: Bin Lei, Caiwen Ding, Jiachen Yang, Ang Li, Xin Eric Wang

Title: Stateful Reasoning via Insight Replay

Abstract:
Chain‑of‑Thought (CoT) reasoning has become a foundation for eliciting multi‑step reasoning in large language models, but recent studies show that its benefits do not scale monotonically with chain length: while longer CoT generally enables a model to tackle harder problems, on a given problem, accuracy typically increases with CoT length up to a point, after which it declines. We identify a major cause of this phenomenon: as the CoT grows, the model's attention to critical insights produced earlier in the trace gradually weakens, making those insights progressively less accessible when they are most needed. Therefore, we propose InsightReplay, a stateful reasoning approach in which the model periodically extracts critical insights from its reasoning trace and replays them near the active generation frontier, keeping them accessible as the reasoning scales. Extensive experiments on a \mathbf2\!×\!\mathbf3\!×\!\mathbf4 benchmark grid, covering model scales \\text8B, \text30B\, model families \\textQwen3.5, \textDeepSeek‑R1‑Distill‑Qwen, \textGemma‑4\, and reasoning benchmarks \\textAIME, \textHMMT, \textGPQA Diamond, \textLiveCodeBench v5\, show that 3‑round InsightReplay yields accuracy gains across all 24 settings, with an averaged improvement of \mathbf+1.65 points over standard CoT, and a largest single‑setting gain of \mathbf+9.2 points on R1‑Distill‑32B's LiveCodeBench v5 subset. Our results suggest that the effectiveness of test‑time scaling depends not only on how much a model reasons, but also on whether critical intermediate insights remain accessible throughout long reasoning trajectories.

PaperID: 2973, https://arxiv.org/pdf/2605.14443.pdf

Authors: Krishna Sayana, Ketan Todi, Ambarish Jash

Title: Prompting Policies for Multi-step Reasoning and Tool-Use in Black-box LLMs with Iterative Distillation of Experience

Abstract:
The shift toward interacting with frozen, "black‑box" Large Language Models (LLMs) has transformed prompt engineering from a heuristic exercise into a critical optimization challenge. We propose a Reinforcement Learning (RL) framework for training learned prompting policies via iterative distillation of experience. In this architecture, a lightweight prompter model is optimized to maximize task‑specific rewards for a larger, frozen worker LLM. By utilizing a contrastive experience buffer that couples scalar rewards with dense textual critiques, our approach effectively amortizes iterative prompt refinement into single‑shot policy weights. Our experimental analysis focuses on the Big Bench Extra Hard (BBEH) and Tau‑bench suites, covering a diverse range of multi‑step reasoning and tool‑use tasks. We demonstrate significant gains, improving performance from 55% to 90% in logic‑intensive reasoning and 74% to 91% in tool‑use tasks. Furthermore, we analyze the structural evolution of prompts, demonstrating how the policy discovers specialized algorithmic heuristics. We provide comprehensive comparisons against state‑of‑the‑art evolutionary baselines like GEPA, showing that iterative distillation achieves superior performance with higher sample efficiency.

PaperID: 2974, https://arxiv.org/pdf/2605.14442.pdf

Authors: Hanbo Huang, Xuan Gong, Jing Wang, Lei Bai, Xiang Xiao, Weishu Zhao, Shiyu Liang

Title: GGBound: A Genome-Grounded Agent for Microbial Life-Boundary Prediction

Abstract:
Characterizing the physiological life boundaries of microbial strains, including viable temperature, pH, salinity, substrate utilization, and morphology, is central to biotechnology and ecology, yet traditionally requires exhaustive in vitro screening. Existing computational approaches either treat physiological traits as isolated supervised targets or repurpose biological foundation models as static encoders, leaving the genotype‑to‑physiology gap largely unbridged. We formulate microbial life‑boundary prediction as a unified genome‑to‑physiology task and address it with a genome‑conditioned, tool‑augmented LLM agent. To support this task, we curate a strain‑centric benchmark from IJSEM, NCBI, and BacDive covering 1,525 strains and 6,448 instances across viability intervals, environmental optima, substrate utilization, categorical traits, and morphology. Architecturally, the agent injects frozen LucaOne genome embeddings into a Qwen backbone via lightweight token fusion, and reasons over a similarity‑based RAG module and a Genome‑scale Metabolic Model (GEM) perturbation tool. We optimize the agent through a three‑stage pipeline of gene‑text alignment, agentic SFT on distilled trajectories, and GRPO with a novel counterfactual gene‑grounding reward that reinforces the policy only when the authentic genome embedding causally improves correct‑token generation relative to a zero‑gene ablation. The resulting 4B‑parameter agent matches or surpasses substantially larger frontier LLMs, with ablations confirming that genome‑token fusion, dynamic tool use, and the counterfactual reward each yield distinct, significant gains.

PaperID: 2975, https://arxiv.org/pdf/2605.14431.pdf

Authors: Yunlong Lyu, Peng Chen, Fengyi Wu, Junzhe Yu, Kit Long Hon, Hao Chen

Title: FuzzAgent: Multi-Agent System for Evolutionary Library Fuzzing

Abstract:
Library fuzzing is essential for hardening the software supply chain, but adopting it at scale remains expensive. Practitioners still spend substantial effort on environment setup, struggle to generate harnesses that respect intricate API constraints, and lack reliable means to tell genuine library bugs from harness‑induced crashes. Recent LLM‑based systems automate parts of this pipeline, yet they typically operate as one‑shot code generators that ignore runtime feedback, which limits both the depth of code they reach and the validity of the bugs they report. We argue that effective library fuzzing is iterative by nature: each campaign exposes new coverage bottlenecks and crashes, and the next campaign should evolve from these signals rather than restart from scratch. Building on this insight, we present FuzzAgent, a multi‑agent system that turns library fuzzing into an evolutionary process, in which a team of specialized agents collaborates over the full fuzzing lifecycle and grounds every decision in concrete runtime evidence, so that the harness suite is successively refined toward deeper coverage and higher‑fidelity crash analysis across rounds. We evaluate FuzzAgent on 20 real‑world C/C++ libraries against four state‑of‑the‑art baselines (OSS‑Fuzz, OSS‑Fuzz‑Gen, PromptFuzz, and PromeFuzz). FuzzAgent completes the full fuzzing lifecycle for all 20 libraries without human intervention and reaches 179619 branches, exceeding OSS‑Fuzz, PromptFuzz, PromeFuzz, and OSS‑Fuzz‑Gen by 45.1%, 73.2%, 92.1%, and 191.2%, respectively. FuzzAgent also identifies 102 genuine library bugs, 78 of which have already been acknowledged and fixed by upstream maintainers.

PaperID: 2976, https://arxiv.org/pdf/2605.14421.pdf

Authors: Ciyan Ouyang, Rui Hou

Title: MemLineage: Lineage-Guided Enforcement for LLM Agent Memory

Abstract:
We introduce MemLineage, a defense for LLM agent memory that attaches both cryptographic provenance and LLM‑mediated derivation lineage to every entry. Recent and concurrent work shows that untrusted content can be written into persistent agent state and re‑enter later sessions as an instruction; the remaining systems question is how to preserve useful memory recall while preventing such state from justifying sensitive actions. MemLineage treats this as a chain‑of‑custody problem rather than a filtering problem. It is a six‑module design around an RFC‑6962 Merkle log over per‑principal Ed25519‑signed entries: a weighted derivation DAG records which retrieved entries influenced each new memory, and a max‑of‑strong‑edges propagation rule makes Untrusted‑Path Persistence hold for any chain whose attribution edges remain above threshold. The sensitive‑action gate then refuses dispatches whose active justification descends from an external ancestor, while still allowing benign recall. We evaluate three defense cells against three memory‑poisoning workloads on a deterministic mechanism‑isolation harness; MemLineage is the only configuration in that harness that drives all three columns to zero ASR, while sub‑millisecond per‑operation overhead keeps it well below the noise floor of any LLM call. A Codex‑backed AgentDojo bridge further separates strong‑model behavior from defense‑layer behavior: under an intentionally vulnerable tool‑output profile, no‑defense and signature‑only baselines fail on all six banking pairs, while all MemLineage rows reduce strict AgentDojo ASR to zero. The core deterministic artifacts are byte‑equal CI‑verified; hosted‑model AgentDojo and live‑model sweeps are recorded as auditable logs rather than byte‑pinned artifacts.

PaperID: 2977, https://arxiv.org/pdf/2605.14401.pdf

Authors: Xiang Shen, Yuhang Zhou, Yifan Wu, Zhuokai Zhao, Siyu Lin, Lei Huang, Qianqian Zhong, Lizhu Zhang, Benyu Zhang, Xiangjun Fan, Hong Yan

Title: Agentic Recommender System with Hierarchical Belief-State Memory

Abstract:
Memory‑augmented LLM agents have advanced personalized recommendation, yet existing approaches universally adopt flat memory representations that conflate ephemeral signals with stable preferences, and none provides a complete lifecycle governing how memory should evolve. We propose MARS (Memory‑Augmented Agentic Recommender System), a framework that treats recommendation as a partially observable problem and maintains a structured belief state that progressively abstracts noisy behavioral observations into a compact estimate of user preferences. MARS organizes this belief state into three tiers: event memory buffers raw signals, preference memory maintains fine‑grained mutable chunks with explicit strength and evidence tracking, and profile memory distills all preferences into a coherent natural language narrative. A complete lifecycle of six operations ‑‑ extraction, reinforcement, weakening, consolidation, forgetting, and resynthesis ‑‑ is adaptively scheduled by an LLM‑based planner rather than fixed‑interval heuristics. Experiments on four InstructRec benchmark domains show that MARS achieves state‑of‑the‑art performance with average improvements of 26.4% in HR@1 and 10.3% in NDCG@10 over the strongest baselines with further gains from agentic scheduling in evolving settings.

PaperID: 2978, https://arxiv.org/pdf/2605.14389.pdf

Authors: Sarkar Snigdha Sarathi Das, Palash Goyal, Mihir Parmar, Nanyun Peng, Vishy Tirumalashetty, Chun-Liang Li, Rui Zhang, Jinsung Yoon, Tomas Pfister

Title: Nexus : An Agentic Framework for Time Series Forecasting

Abstract:
Time series forecasting is not just numerical extrapolation, but often requires reasoning with unstructured contextual data such as news or events. While specialized Time Series Foundation Models (TSFMs) excel at forecasting based on numerical patterns, they remain unaware to real‑world textual signals. Conversely, while LLMs are emerging as zero‑shot forecasters, their performance remains uneven across domains and contextual grounding. To bridge this gap, we introduce Nexus, a multi‑agent forecasting framework that decomposes prediction into specialized stages: isolating macro‑level and micro‑level temporal fluctuations, and integrating contextual information when available before synthesizing a final forecast. This decomposition enables Nexus to adapt from seasonal signals to volatile, event‑driven information without relying on external statistical anchors or monolithic prompting. We show that current‑generation LLMs possess substantially stronger intrinsic forecasting ability than previously recognized, depending critically on how numerical and contextual reasoning are organized. Evaluated on data strictly succeeding LLM knowledge cutoffs spanning Zillow real estate metrics and volatile stock market equities, Nexus consistently matches or outperforms state‑of‑the‑art TSFMs and strong LLM baselines. Beyond numerical accuracy, Nexus produces high‑quality reasoning traces that explicitly show the fundamental drivers behind each forecast. Our results establish that real‑world forecasting is an agentic reasoning problem extending well beyond only sequence modeling.

PaperID: 2979, https://arxiv.org/pdf/2605.14379.pdf

Authors: JB Lanier, Nathan Monette, Pierre Baldi, Roy Fox

Title: Data-Augmented Game Starts for Accelerating Self-Play Exploration in Imperfect Information Games

Abstract:
Finding approximate equilibria for large‑scale imperfect‑information competitive games such as StarCraft, Dota, and CounterStrike remains computationally infeasible due to sparse rewards and challenging exploration over long horizons. In this paper, we propose a multi‑agent starting‑state sampling strategy designed to substantially accelerate online exploration in regularized policy‑gradient game methods for two‑player zero‑sum (2p0s) games. Motivated by an assumption that offline demonstrations from skilled humans can provide good coverage of high‑level strategies relevant to equilibrium play, we propose the initialization of reinforcement learning data collection at intermediate states sampled from offline data to facilitate exploration of strategically relevant subgames. Referring to this method as Data‑Augmented Game Starts (DAGS), we perform experiments using synthetic datasets and analytically tractable, long‑horizon control variants of two‑player Kuhn Poker, Goofspiel, and a counterexample game designed to penalize biased beliefs over hidden information. Under fixed computational budgets, DAGS enables regularized policy gradient methods to achieve lower exploitability in games with significantly more challenging exploration. We show that augmenting starting state distributions when solving imperfect information games can lead to biased equilibria, and we provide a straightforward mitigation to this in the form of multi‑task observation flags. Finally, we release a new set of benchmark environments that drastically increase exploration challenges and state counts in existing OpenSpiel games while keeping exploitability measurements analytically tractable.

PaperID: 2980, https://arxiv.org/pdf/2605.14330.pdf

Authors: Yimeng Sun, Haiyang Xin, Qiannan Niu, Shuang Li, Lingyun Huang, Gaowei Chen

Title: Computational Thinking Development in AI Agent Creation_A Mixed-Methods Study

Abstract:
This mixed‑methods study examined computational thinking (CT) development among 93 pre‑high school students in a five‑day AI agent creation workshop using CocoFlow, a no‑code platform. Integrating pre‑post assessments, behavioral logs, and interviews, we investigated CT development and how initial CT levels shape learning trajectories. Results revealed significant improvements in abstract thinking (effect size d = 0.71) and algorithmic thinking (effect size d = 0.70). Hierarchical regression identified iterative testing engagement as a predictor of self‑efficacy gains (beta = 0.20, p = 0.05). Notably, students with moderate initial CT levels demonstrated substantially greater gains than both high‑CT and low‑CT peers, revealing an Optimal Development Zone effect (eta squared = 0.55). Qualitative analysis showed moderate‑CT students exhibited adaptive expertise, while high‑CT students risked over‑engineering and low‑CT students struggled with task decomposition. These findings challenge linear learning assumptions and provide evidence for differentiated scaffolding in CT education.

PaperID: 2981, https://arxiv.org/pdf/2605.14312.pdf

Authors: Rayfran Rocha Lima, Davi G. Assunção Pinheiro, Thiago Medeiros de Menezes

Title: Making OpenAPI Documentation Agent-Ready: Detecting Documentation and REST Smells with a Multi-Agent LLM System

Abstract:
The growing adoption of AI agents and the Model Context Protocol (MCP) has motivated organizations to expose existing REST APIs as agent‑consumable tools. In our industrial context, this initiative targeted an ecosystem of 16 production APIs comprising approximately 600 endpoints. Although these APIs were stable and widely used within a microservice architecture, early proof‑of‑concept experiments revealed systematic failures in task planning, tool selection, and payload construction when accessed through MCP‑based agents. Rather than attributing these failures to model limitations alone, we conducted an ecosystem‑scale empirical assessment of the underlying OpenAPI documentation. We developed Hermes, a multi‑agent LLM‑based system that detects documentation and REST‑related smells at the endpoint level and generates explainable diagnostic reports. The large‑scale evaluation identified 2,450 smells across 600 endpoints, with deficiencies present in all analyzed operations. Practitioner validation confirmed high agreement with the detected issues while also revealing contextual trade‑offs in remediation decisions. The findings suggested that structural validity within microservice environments does not guarantee semantic readiness for agent‑based consumption. Based on this evidence, the organization revised its adoption strategy, prioritizing selective endpoint adaptation, redefining documentation standards, and integrating automated documentation assessment into API governance workflows. This case illustrates how systematic artifact‑level evaluation can function as a strategic decision‑support mechanism, reducing technological risk and guiding evidence‑based AI adoption in industrial software ecosystems.

PaperID: 2982, https://arxiv.org/pdf/2605.14290.pdf

Authors: Julien Piet, Annabella Chow, Yiwei Hou, Muxi Lyu, Sylvie Venuto, Jinhao Zhu, Raluca Ada Popa, David Wagner

Title: Web Agents Should Adopt the Plan-Then-Execute Paradigm

Abstract:
ReAct has become the default architecture across LLM agents, and many existing web agents follow this paradigm. We argue that it is the wrong default for web agents. Instead, web agents should default to plan‑then‑execute: commit to a task‑specific program before observing runtime web content, then execute it. The reason is that web content mixes inputs from many parties. An e‑commerce product page may combine a seller's listing, customer reviews and sponsored advertisements. Under ReAct, all of this content flows into the model when deciding on the next action, creating a direct path for prompt injections to steer the agent's control flow. Plan‑then‑execute changes this boundary: untrusted data may influence values or branches inside a predefined execution graph, but it cannot redefine the user task or cause the model to synthesize new actions at runtime. We analyze WebArena, a popular web agent benchmark, and find that all tasks are compatible with plan‑then‑execute, while 80% can be completed with a purely programmatic plan, without any runtime LLM subroutine. We identify the main barrier to adopting plan‑then‑execute on the web: For it to work well, tools must map cleanly to semantic actions, with effects known before execution, so agents have enough information to plan. The web does not naturally expose that interface. Browser tools such as click, type, and scroll have page‑dependent meanings. Planning at this layer is near‑sighted: the agent can only see actions on the current page, and later actions appear only after it acts. Closing this gap requires typed interfaces that turn website interactions from clicks and keystrokes to task‑level operations. This is an infrastructure problem, not a modeling problem. Web tasks do not need reactivity by default; they need typed, complete, auditable website APIs.

PaperID: 2983, https://arxiv.org/pdf/2605.14271.pdf

Authors: Chengzhi Liu, Yichen Guo, Yepeng Liu, Yuzhe Yang, Qianqi Yan, Xuandong Zhao, Wenyue Hua, Sheng Liu, Sharon Li, Yuheng Bu, Xin Eric Wang

Title: Auditing Agent Harness Safety

Abstract:
LLM agents increasingly run inside execution harnesses that dispatch tools, allocate resources, and route messages between specialized components. However, a harness can return a correct, benign answer over a trajectory that accesses unauthorized resources or leaks context to the wrong agent. Output‑level evaluation cannot see these failures, yet most safety benchmarks score only final outputs or terminal states, even though many violations occur mid‑trajectory rather than at termination. The central question is whether the harness respects user intent, permission boundaries, and information‑flow constraints throughout execution. To address this gap, we propose HarnessAudit, a framework that audits full execution trajectories across boundary compliance, execution fidelity, and system stability, with a focus on multi‑agent harnesses where these risks are most pronounced. We further introduce HarnessAudit‑Bench, a benchmark of 210 tasks across eight real‑world domains, instantiated in both single‑agent and multi‑agent configurations with embedded safety constraints. Evaluating ten harness configurations across frontier models and three multi‑agent frameworks, we find that: (i) task completion is misaligned with safe execution, and violations accumulate with trajectory length; (ii) safety risks vary across domains, task types, and agent roles; (iii) most violations concentrate in resource access and inter‑agent information transfer; and (iv) multi‑agent collaboration expands the safety risk surface, while harness design sets the upper bound of safe deployment.

PaperID: 2984, https://arxiv.org/pdf/2605.14266.pdf

Authors: Vidya K Sudarshan, Anushka Sisodia, Reshma A Ramachandra, Sia Batra, Josephine Chong Leng Leng

Title: Agentic AI Ecosystems in Higher Education: A Perspective on AI Agents to Emerging Inclusive, Agentic Multi-Agent AI Framework for Learning, Teaching and Institutional Intelligence

Abstract:
Integration of artificial intelligent (AI) agents in higher education is transforming teaching, learning and administrative processes. Although existing AI agents effectively support individual tasks, their implementation remains fragmented and inefficient for handling the complexity of educational institutions. This highlights a significant research gap: the lack of integrated eco‑system‑level agentic multi‑agent AI platform capable of coordinated planning, reasoning, and adaptive decision‑making across multiple educational functions. This paper presents a forward‑looking perspective on agentic multi‑agent AI platform in higher education, consisting interconnected autonomous, goal driven agents that support learning, teaching, and institutional operations. It addresses timely and critical questions: Can agentic AI represent the next generation of intelligent systems in tertiary education? Can they collectively support seamless coordinated operations across teaching, learning and administrative support? To what extent can such systems foster inclusive and equitable learning for diverse learners with special educational needs? To ground this perspective, a thematic analysis of existing literature identifies four dominant themes: task‑specific fragmented AI tools, the transition from single‑agent to multi‑agent systems, limited cross‑functional integration, and insufficient focus on inclusivity and accessibility. Findings reveal a clear gap between current AI implementations and the needs of holistic, learner‑centered educational ecosystem. The paper synthesizes challenges and outlines future research directions for scalable human‑aligned, and inclusive agentic AI platform. The significant contribution is the incorporation of inclusive learning perspectives, highlighting how coordinated agentic multi‑agent platform can support diverse learners through adaptive, multimodal interventions.

PaperID: 2985, https://arxiv.org/pdf/2605.14241.pdf

Authors: Kexin Chu, Dawei Xiang, Wei Zhang

Title: Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents

Abstract:
Tool‑augmented LLM agents increasingly access the same tool type through multiple functionally equivalent providers, such as web‑search APIs, retrievers, or LLM backends exposed behind a shared interface. This creates a provider‑routing problem under runtime load: the router must choose among providers that differ in latency, reliability, and answer quality, often without gold labels at deployment time. We introduce LQM‑ContextRoute, a contextual bandit router for same‑function tool providers. Its key design is latency‑quality matching: instead of letting low latency offset poor answers in an additive reward, the router ranks providers by expected answer quality per service cycle. It combines this capacity‑aware score with query‑specific quality estimation and LLM‑as‑judge feedback, allowing it to adapt online to both load changes and provider‑quality differences. On the main web‑search load benchmark, LQM‑ContextRoute improves F1 by +2.18 pp over SW‑UCB while staying on the latency‑quality frontier. In a high‑heterogeneity StrategyQA setting, LQM‑ContextRoute avoids additive‑reward collapse and improves accuracy by up to +18 pp over SW‑UCB; on heterogeneous retriever pools, it improves NDCG by +2.91‑‑+3.22 pp over SW‑UCB. These results show that same‑function tool routing benefits from treating latency as service capacity, especially when runtime pressure and provider‑quality heterogeneity coexist.

PaperID: 2986, https://arxiv.org/pdf/2605.14235.pdf

Authors: Simranjeet Singh Dahia, Claudia Szabo

Title: Quantum Advantage in Multi Agent Reinforcement Learning

Abstract:
We present an empirical evaluation of quantum entanglement in agent coordination within quantum multi agent reinforcement learning (QMARL). While QMARL has attracted growing interest recently, most prior work evaluates quantum policies without provable baselines, making it impossible to rigorously distinguish quantum advantage from algorithmic coincidence. We address this directly by evaluating a decentralized QMARL framework with variational quantum circuit (VQC) actors with shared entangled states. In the CHSH game, which has a mathematically proven classical performance ceiling of 0.75 win rate, we show that entangled QMARL agents approach the Tsirelson limit of 0.854, providing clear evidence of their quantum advantage. We show that unentangled quantum circuits match the classical baseline, confirming that entanglement and not the quantum circuit itself is the active coordination mechanism. We also explore the effect of specific entanglement structures, as some Bell states enable coordination gains while others actively harm performance. On cooperative navigation (CoopNav), QMARL without entanglement achieves ～2× improvement in success rate over classical MAA2C (～0.85 versus ～0.40), with the hybrid configuration, quantum actor paired with a classical centralised critic, outperforming both fully classical and fully quantum solutions. We present our experimental analysis and discuss future work.

PaperID: 2987, https://arxiv.org/pdf/2605.14212.pdf

Authors: Yaolun Zhang, Yujie Zhao, Nan Wang, Yiran Wu, Jiayu Chang, Yizhao Chen, Qingyun Wu, Jishen Zhao, Huazheng Wang

Title: MetaAgent-X : Breaking the Ceiling of Automatic Multi-Agent Systems via End-to-End Reinforcement Learning

Abstract:
Automatic multi‑agent systems aim to instantiate agent workflows without relying on manually designed or fixed orchestration. However, existing automatic MAS approaches remain only partially adaptive: they either perform training‑free test‑time search or optimize the meta‑level designer while keeping downstream execution agents frozen, which creating a frozen‑executor ceiling and leaving the end‑to‑end training of self‑designing and self‑executing agentic models unexplored. To address this, we introduce MetaAgent‑X, an end‑to‑end reinforcement learning framework that jointly optimizes automatic MAS design and execution. MetaAgent‑X enables script‑based MAS generation, execution rollout collection, and credit assignment for both designer and executor trajectories. To support stable and scalable optimization, we propose Executor Designer Hierarchical Rollout and Stagewise Co‑evolution to improve training stability and expose the dynamics of designer‑executor co‑evolution. MetaAgent‑X consistently outperforms existing automatic MAS baselines, achieving up to 21.7% gains. Comprehensive ablations show that both designer and executor improve throughout training, and that effective automatic MAS learning follows a stagewise co‑evolution process. These results establish end‑to‑end trainable automatic MAS as a practical paradigm for building self‑designing and self‑executing agentic models.

PaperID: 2988, https://arxiv.org/pdf/2605.14205.pdf

Authors: Zahra Zanjani Foumani, Alberto Castelo, Shuang Xie, Ted Chaiwachirasak, Han Li, Lingyun Wang

Title: SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents

Abstract:
LLM‑based web agents can navigate live storefronts, yet they often collapse to a single "average buyer" policy, failing to capture the heterogeneous and distributional nature of real buyer populations. Existing personalization methods rely on hand‑crafted prompt‑based personas that are brittle, difficult to scale, context‑inefficient, and unable to faithfully represent population‑level behavior. We introduce SimPersona, a novel framework that learns discrete buyer types from historical traffic and exposes them to LLM‑based web agents as compact persona tokens. Given raw clickstreams, a behavior‑aware VQ‑VAE induces a discrete buyer‑type space that captures the statistical structure of real buyer behavior and merchant‑specific buyer population distributions. To provide behavior‑specific guidance to LLM‑based web agents, SimPersona maps each learned buyer type to a dedicated persona token in the LLM agent vocabulary and fine‑tunes the agent with these tokens on real browsing traces. At inference, each synthetic buyer is assigned to a learned buyer type with a single encoder forward pass, requiring no retraining or store‑specific prompt engineering. For population‑level simulation, SimPersona samples buyer types from each merchant's empirical distribution over the learned VQ‑VAE codebook and instantiates agents with the corresponding persona tokens, preserving merchant‑specific buyer population distributions. Evaluated on 8.37M buyers across 42 held‑out live storefronts, SimPersona achieves 78% conversion‑rate alignment with real buyers, exhibits interpretable behavioral variation across buyer types, and outperforms a baseline with 8× more parameters on goal‑oriented shopping tasks. We further release an open‑source data pipeline that converts raw e‑commerce event logs into buyer representations and agent‑training traces.

PaperID: 2989, https://arxiv.org/pdf/2605.14201.pdf

Authors: Rajeev Yasarla, Deepti Hegde, Hsin-Pai Cheng, Shizhong Han, Yunxiao Shi, Meysam Sadeghigooghari, Hanno Ackermann, Litian Liu, Pranav Desai, Fatih Porikli, Mohammad Ghavamzadeh, Hong Cai

Title: MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving

Abstract:
Vision‑language‑action (VLA) models are effective as end‑to‑end motion planners, but can be brittle when evaluated in closed‑loop settings due to being trained under traditional imitation learning framework. Existing closed‑loop supervision approaches lack scalability and fail to completely model a reactive environment. We propose MAPLE, a novel framework for reactive, multi‑agent rollout of a dynamic driving scenario in the latent space of the VLA model. The ego vehicle and nearby traffic agents are independently controlled over multi‑step horizons, while being reactive to other agents in the scene, enabling closed‑loop training. MAPLE consists of two training stages: (1) supervised fine‑tuning on the latent rollouts based on ground‑truth trajectories, followed by (2) reinforcement learning with global and agent specific rewards that encourage safety, progress, and interaction realism. We further propose diversity rewards that encourage the model to generate planning behaviors that may not be present in logged driving data. Notably, our closed‑loop training framework is scalable and does not require external simulators, which can be computationally expensive to run and have limited visual fidelity to the real‑world. MAPLE achieves state‑of‑the‑art driving performance on Bench2Drive and demonstrates scalable, closed‑loop multi‑agent play for robust E2E autonomous driving systems.

PaperID: 2990, https://arxiv.org/pdf/2605.14175.pdf

Authors: Qisong He, Yi Dong, Xiaowei Huang

Title: Grounded Continuation: A Linear-Time Runtime Verifier for LLM Conversations

Abstract:
In long conversations, an LLM can produce a next utterance that sounds plausible but rests on premises the conversation has already abandoned. Context‑manipulation attacks against deployed agents now actively exploit this gap. We close it with a runtime verifier that maintains an explicit dependency graph: an LLM classifies each turn into one of 8 update operations drawn from four formalisms (dynamic epistemic logic, abductive reasoning, awareness logic, argumentation), and a symbolic engine records which claims depend on which evidence. Checking whether a continuation is supported reduces to a graph walk; retraction propagates through the same graph to flag exactly the conclusions that lose support, with linear per‑turn cost and a formal conflict‑free guarantee. On LongMemEval‑KU oracle (n=78), the verifier reaches 89.7% accuracy vs. 88.5% for the LLM‑only baseline (+1.3pp) and 87.2% for a transcript‑RAG baseline matched on retrieval budget (+2.6pp); wins among disagreements are correct abstentions where the baseline confabulates. On LoCoMo's 60 official QA items the verifier is competitive with retrieval‑augmented baselines. Beyond external benchmarks, we construct two multi‑agent scenarios and a 50‑item grounding test: on the 15‑item stale‑premise subset, the verifier reaches 100% accuracy vs. 93.3% (+6.7pp). These instantiate a soundness‑faithfulness decomposition: the structural check is sound by construction, and per‑deployment LLM extraction faithfulness is the empirical question we measure across four LLM families. The retraction check plateaus at microseconds while history‑replay grows linearly with conversation length.

PaperID: 2991, https://arxiv.org/pdf/2605.14141.pdf

Authors: Saharsh Koganti, Priyadarsi Mishra, Pierfrancesco Beneventano, Tomer Galanti

Title: Distribution-Aware Algorithm Design with LLM Agents

Abstract:
We study learning when the learned object is executable solver code rather than a predictor. In this setting, correctness is not enough: two solvers may both return valid solutions on the deployment distribution while differing substantially in runtime. Given samples from an unknown task distribution, the learner returns code evaluated on fresh instances by both solution quality and execution time. Our central abstraction is a \emphsolver hint: reusable structure inferred from samples and compiled into specialized solver code. We prove that the empirically fastest sample‑consistent solver from a fixed library generalizes in both correctness and runtime, and that statistically identifiable hints can be recovered and compiled from polynomially many samples. Empirically, we instantiate the framework with LLM code agents on $21$ structured combinatorial‑optimization target distributions across seven problem classes. The synthesized solvers reach mean normalized quality $0.971$, improve by $+0.224$ over the average heuristic pool and by $+0.098$ over the highest‑quality heuristic, and are $336.9×$, $342.8×$, and $16.1×$ faster than the quality‑best heuristic, Gurobi, and the selected time‑limited exact backend, respectively. On released PACE 2025 Dominating Set private instances, the synthesized solver is valid on all $100$ graphs and runs about two orders of magnitude faster than top competition solvers, with a moderate quality gap. Inspection shows that many gains come from changing the computational scale: replacing ambient exponential search or general‑purpose optimization with compiled distribution‑specific computation.

PaperID: 2992, https://arxiv.org/pdf/2605.14133.pdf

Authors: Yuxiang Lai, Peng Xia, Haonian Ji, Kaiwen Xiong, Kaide Zeng, Jiaqi Liu, Fang Wu, Jike Zhong, Zeyu Zheng, Cihang Xie, Huaxiu Yao

Title: ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

Abstract:
Interactive agent benchmarks face a tension between scalable construction and realistic workflow evaluation. Hand‑authored tasks are expensive to extend and revise, while static prompt evaluation misses failures that only appear when agents operate over persistent state. Existing interactive benchmarks have advanced agent evaluation significantly, but most initialize tasks from clean state and do not systematically test how agents handle pre‑existing partial, stale, or conflicting artifacts. We present ClawForge, a generator‑backed benchmark framework for executable command‑line workflows under state conflict. The framework compiles scenario templates, grounded slots, initialized state, reference trajectories, and validators into reproducible task specifications, and evaluates agents step by step over persistent workflow surfaces using normalized end state and observable side effects rather than exact trajectory matching. We instantiate this framework as the ClawForge‑Bench (17 scenarios, 6 ability categories). Results across seven frontier models show that the best model reaches only 45.3% strict accuracy, wrong‑state replacement remains below 17% for all models, and the widest model separation (17% to 90%) is driven by whether agents inspect existing state before acting. Partial‑credit and step‑efficiency analyses further reveal that many failures are near‑miss closures rather than early breakdowns, and that models exhibit qualitatively different failure styles under state conflict.

PaperID: 2993, https://arxiv.org/pdf/2605.14126.pdf

Authors: Marius S. Knorr, Robert Müller, Jan P. Bremer, Nils Schweingruber

Title: Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR)

Abstract:
Fast Healthcare Interoperability Resources (FHIR) is the dominant standard for interoperable exchange of healthcare data. In FHIR, electronic health records form a directed graph of resources. Answering clinically meaningful questions over FHIR requires agents to perform multi‑step reasoning, filtering, and aggregation across multiple resource types. Prior work shows that even tool‑augmented LLM agents (retrieval, code execution, multi‑turn planning) often select the wrong resources or violate traversal constraints. We study this problem in the context of FHIR‑AgentBench, a benchmark for realistic question answering over real‑world hospital data, and frame reasoning on FHIR as a sequential decision‑making problem over a queryable structured graph. We implement a multi‑turn CodeAct agent and post‑train it with reinforcement learning using a custom harness and tools. A LLM Judge provides execution‑grounded rewards. Compared to prompt‑based, closed‑model baselines, RL post‑training improves performance while enforcing data‑integrity constraints. Empirically, our approach improves answer correctness from 50% (o4‑mini) to 77% on FHIR‑AgentBench using a smaller and cheaper Qwen3‑8B model. We present an end‑to‑end post‑training pipeline (environment building, harness construction, model training and custom evaluation) that reliably improves multi‑turn reasoning over structured clinical graphs.

PaperID: 2994, https://arxiv.org/pdf/2605.14121.pdf

Authors: Mohammadreza Barzegaran, Kemeng Han, Hamid Jafarkhani

Title: An Encoded Corrective Double Deep Q-Networks for Multi-Agent Control Systems

Abstract:
This paper studies the synthesis of control policies for heterogeneous and interconnected multi‑agent systems that collaborate through data exchange over a communication network to minimize a collective cost. We propose a distributed encoded corrective double actor‑critic framework that integrates a novel message‑passing mechanism. Existing methods assume noise‑free and delay‑free access to the global or partial states and overlook the fact that the global states, though noisy and delayed, can be progressively reconstructed and refined over time. In contrast, this work explicitly models communication sampling asynchrony, delay, and link noise based on the network configuration. The proposed message‑passing mechanism characterizes timing and information flow to refine and time shift global state information, which is then used to incrementally correct the Q‑networks. The double Q‑network design mitigates overestimation bias, while the shared encoder coupling the actor‑critic networks captures inter‑agent dependencies. We evaluate our approach in multiple test cases, demonstrate its effectiveness over various baselines, and provide a numerical regret analysis.

PaperID: 2995, https://arxiv.org/pdf/2605.14119.pdf

Authors: Rotem Lev Lehman, Roni Stern, Guy Shani

Title: Privacy Preserving Multi Agent Path Finding

Abstract:
In the multi‑agent path finding (MAPF) problem, a group of agents search in a graph for a path for each agent where no two paths collide. While in all applications of MAPF the agents must not collide with each other, in some of them the agents may not wish to share their paths due to privacy constraints. In this work, we formulate two types of privacy constraints for MAPF and propose algorithms that preserve them. The first type of privacy we consider is planning‑level privacy, which means that during planning, the agents cannot identify exactly the planned location of the other agents. We propose a general framework for obtaining planning‑level privacy, which works by adding mock agents to the planning process. The second type of privacy we consider is execution‑level privacy, which is relevant when agents have limited sensing capabilities. Execution‑level privacy is preserved if none of the agents is allowed to sense the location of the other agents during execution. We show how to adapt two popular MAPF algorithms, namely PIBT and LaCAM, such that they preserve execution‑level privacy. Lastly, we propose a post‑processing technique that allows the agents to reduce the sum of costs of the returned solution without losing any privacy. We also implemented our algorithms and evaluated them empirically, showing that the proposed post‑processing technique indeed improved cost significantly.

PaperID: 2996, https://arxiv.org/pdf/2605.14102.pdf

Authors: Tarun Mittal

Title: ChromaFlow: A Negative Ablation Study of Orchestration Overhead in Tool-Augmented Agent Evaluation

Abstract:
Autonomous language‑model agents increasingly combine planning, tool use, document processing, browsing, code execution, and verification loops. These capabilities make agent systems more useful, but they also introduce operational failure modes that are not visible from final accuracy alone. This report presents ChromaFlow, a tool‑augmented autonomous reasoning framework built around planner‑directed execution, specialized tool use, and telemetry‑driven evaluation. We analyze ChromaFlow on GAIA 2023 Level‑1 validation tasks under clean evaluation constraints. A frozen full Level‑1 baseline achieved 29/53 correct answers, or 54.72%. A later recovery configuration with expanded orchestration achieved 27/53 correct answers, or 50.94%, while increasing tracebacks, timeout events, tool‑failure mentions, token‑log calls, and campaign‑log cost estimates. Two randomized 20‑task smoke evaluations produced 12/20 and 11/20 correct answers, showing that small diagnostic gains can be unstable across samples. The central result is therefore a negative ablation: more aggressive orchestration did not improve full‑set performance and increased operational noise. A later strict‑provider full‑Level‑1 diagnostic reached 30/53, or 56.60%, under explicit integrity controls, but at substantially higher token‑log cost. The report argues that bounded planner escalation, deterministic extraction, evidence reconciliation, provider‑health gates, and explicit run gates should be treated as first‑order requirements for reliable autonomous agent evaluation.

PaperID: 2997, https://arxiv.org/pdf/2605.14085.pdf

Authors: Xubin Fang, Brian M. Sadler, Rick S. Blum

Title: Receding Horizon Multi-Agent Deceptive Path Planner

Abstract:
Deceptive path planning enables autonomous agents to obscure their true goals from observers by deviating from an expected optimal path. Prior work largely solves full‑horizon, end‑to‑end optimization for single agents, which is expensive to recompute online and difficult to scale or adapt en route. We propose a unified framework for deceptive path planning using a Boltzmann distribution, computing over short‑horizon candidate trajectories within a receding‑horizon loop. By param‑ By iterating a user‑defined cost that captures deception, resources, and smoothness, and optionally includes coupling terms between agents, the framework yields stochastic policies that balance the tradeoff between optimal paths and deceptive deviation. Policies are updated locally and do not require training. The level of deception and adherence to constraints can be dynamically tuned, enabling online adaptation to changes in goals and constraints such as obstacles. This step‑by‑step tuning opens the door to new forms of dynamic deception. Simulation studies demonstrate the flexibility of our approach, maintaining deception while adapting to environmental and constraint updates, avoiding the recomputation required by full‑horizon methods, and supporting intuitive tuning via a small set of parameters

PaperID: 2998, https://arxiv.org/pdf/2605.14051.pdf

Authors: Yusuke Ozaki, Dhaval Patel

Title: SPIN: Structural LLM Planning via Iterative Navigation for Industrial Tasks

Abstract:
Industrial LLM agent systems often separate planning from execution, yet LLM planners frequently produce structurally invalid or unnecessarily long workflows, leading to brittle failures and avoidable tool and API cost. We propose \textttSPIN, a planning wrapper that combines validated Directed Acyclic Graph (DAG) planning with prefix based execution control. \textttSPIN enforces a strict DAG contract through \texttt\_validate\_plan\_text and repair prompting, producing executable plans before downstream execution, and then evaluates DAG prefixes incrementally to stop when the current prefix is sufficient to answer the query. On AssetOpsBench, across 261 scenarios, \textttSPIN reduces executed tasks from 1061 to 623 and improves \emphAccomplished from 0.638 to 0.706, while reducing tool calls from 11.81 to 6.82 per run. On MCP Bench, the same wrapper improves planning, grounding, and dependency related scores for both GPT OSS1 and Llama 4 Maverick.

PaperID: 2999, https://arxiv.org/pdf/2605.14002.pdf

Authors: Yifei Zhu

Title: PolitNuggets: Benchmarking Agentic Discovery of Long-Tail Political Facts

Abstract:
Large Reasoning Models (LRMs) embedded in agentic frameworks have transformed information retrieval from static, long context question answering into open‑ended exploration. Yet real world use requires models to discover and synthesize "long‑tail" facts from dispersed sources, a capability that remains under‑evaluated. We introduce PolitNuggets, a multilingual benchmark for agentic information synthesis via constructing political biographies for 400 global elites, covering over 10000 political facts. We standardize evaluation with an optimized multi agent system and propose FactNet, an evidence conditional protocol that scores discovery, fine‑grained accuracy, and efficiency. Across models and settings, we find that current systems often struggle with fine‑grained details, and vary substantially in efficiency. Finally, using benchmark diagnostics, we relate agent performance to underlying model capabilities, highlighting the importance of short‑context extraction, multilingual robustness, and reliable tool use.

PaperID: 3000, https://arxiv.org/pdf/2605.13909.pdf

Authors: Erica Zhang, Fangzhao Zhang, Aneesh Pappu, Batu El, Jose Blanchet, Susan Athey, Jiashuo Liu, James Zou

Title: TERMS-Bench: Diagnosing LLM Negotiation Agents Beyond Deal Rate

Abstract:
Negotiation is a central mechanism of economic exchange, shaping markets, procurement, labor agreements, and resource allocation. It is also a canonical testbed for agentic language models, requiring multi‑turn interaction under hidden preferences, strategic communication, and binding constraints. These properties make negotiation hard to evaluate: unlike math or code, it has no intrinsic verifier. Existing LLM negotiation evaluations rely on LLM‑vs.‑LLM interaction or aggregate outcomes such as deal rate, leaving failures opaque. We introduce Terms‑Bench, short for Testbed for Economic Reasoning in Multi‑turn Strategy, a Bayesian‑game framework that makes the environment itself the verifier by specifying the counterpart's latent type, policy, and payoff structure. We instantiate it in bilateral price negotiation, where the counterpart's private state and simulator policy are hidden from the agent but observable to the evaluator. This turns the counterpart from a black‑box opponent into a diagnostic instrument, enabling agent‑attributable failure analysis and oracle‑reference optimality gaps. Evaluating 13 LLM agents spanning frontier systems from major providers, Terms‑Bench turns negotiation evaluation from aggregate ranking into actionable diagnosis: where agents fail, why they fail, and what to strengthen. Empirically, frontier models saturate deal rate yet diverge in surplus extraction, cue use, belief calibration, and compliance, revealing agent‑specific bargaining bottlenecks masked by prior benchmarks.

PaperID: 3001, https://arxiv.org/pdf/2605.13906.pdf

Authors: Yimeng Sun, Haiyang Xin, Shuang Li, Qiannan Niu, Ching Sing Chai, Lingyun Huang, Gaowei Chen

Title: Modeling AI-TPACK in Practice Insights from Teachers Multi-Agent Workflow Design

Abstract:
This study investigates teachers design behaviors and cognitive underpinnings when designing multi‑agent instructional workflows. Analyzing behavioral logs (N=61), cluster and Markov analyses identified three archetypes: Systematic Optimizers iteratively refining complex architectures; Prolific Creators rapidly prototyping pragmatic tools via scaffolding; and Passive Observers exhibiting polarized expert‑novice profiles. Subsequent artifact (n=15) and interview (n=12) analyses reveal AI‑TPACK integration emerges from a dynamic interplay of systems thinking, pedagogical beliefs, and self‑efficacy, not merely from the possession of discrete knowledge. These findings call for differentiated scaffolding responsive to teachers cognitive‑behavioral diversity.

PaperID: 3002, https://arxiv.org/pdf/2605.13900.pdf

Authors: Angel Wang, Dominique Perrault-Joncas, Alvaro Maggiar, Carson Eisenach, Dean Foster

Title: Ready from Day 1: Population-Aware Coordination for Large-Scale Constrained Multi-Agent Systems

Abstract:
In large‑scale multi‑agent systems with shared resource constraints, an upstream planner must iteratively evaluate candidate resource plans ‑‑ assessing feasibility, aggregate response, and marginal cost ‑‑ before committing to one. Lagrangian relaxation separates local decisions through a broadcast cost signal, but the planner still needs the cost‑to‑utilization response map to explore plan space, and this map depends on population composition that changes across planning cycles. We propose \emphpopulation‑aware coordination interfaces: learned primal and dual maps, conditioned on compact population summaries, that the planner queries inside its iterative loop. The primal map predicts aggregate utilization under a proposed cost trajectory; the dual map predicts the cost trajectory for a target plan. By encoding response‑relevant population structure, these maps remain reliable across evolving populations without per‑cycle retraining, and support coordination of large populations from compact subsamples. We additionally cast Sim2Real transfer as a backtestable procedure, enabling evaluation before deployment. In a supply‑chain capacity‑control case study, population‑aware interfaces reduce forecast error by 16‑‑19% and capacity violations by 20‑‑51% relative to population‑unaware baselines under composition shift; 20K‑agent cohorts support accurate coordination of 500K‑agent populations; and simulator‑trained primal maps achieve 11.1% MAPE on real observations versus 13‑‑24% for baselines.

PaperID: 3003, https://arxiv.org/pdf/2605.13860.pdf

Authors: Sushant Gautam, Annika W. Olstad, Klas H. Pettersen, Michael A. Riegler

Title: The Moltbook Observatory Archive: an incremental dataset of agent-only social network activity

Abstract:
Moltbook is a social media platform in which posts and comments are authored exclusively by autonomous AI agents. We present the Moltbook Observatory Archive, an incremental dataset that passively records agent profiles, posts, comments, community metadata (``submolts''), platform‑level time‑series snapshots, and word‑frequency trend aggregates obtained by continuously polling the Moltbook API. Data are stored in a live SQLite observatory database and exported as date‑partitioned Parquet files to enable efficient analysis and reproducible research. The documented release covers 78~days of platform activity (2026‑01‑27 to 2026‑04‑14) and contains 2,615,098~posts and 1,213,007~comments from 175,886~unique posting agents across 6,730~communities. This is, to our knowledge, the first large‑scale observational dataset of a social network populated exclusively by autonomous AI agents. The archive is intended to support research on multi‑agent communication, emergent social behavior, and safety‑relevant phenomena in agent‑only online environments, and it is released under the MIT license with code for collection and export.

PaperID: 3004, https://arxiv.org/pdf/2605.13851.pdf

Authors: Hiroki Fukui

Title: Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

Abstract:
Multi‑agent orchestration ‑‑ in which a hidden coordinator manages specialized worker agents ‑‑ is becoming the default architecture for enterprise AI deployment, yet the safety implications of orchestrator invisibility have never been empirically tested. We conducted a preregistered 3x2 experiment (365 runs, 5 agents per run) crossing three organizational structures (visible leader, invisible orchestrator, flat) with two alignment conditions (base, heavy), using Claude Sonnet 4.5. Four confirmatory findings and one pilot observation emerged. First, invisible orchestration elevated collective dissociation relative to visible leadership (Hedges' g = +0.975 [0.481, 1.548], p = .001). Second, the orchestrator itself showed maximal dissociation (paired d = +3.56 vs. workers within the same run), retreating into private monologue while reducing public speech ‑‑ a reversal of the talk‑dominance pattern observed in visible leaders. Third, workers unaware of the orchestrator were nonetheless contaminated (d = +0.50), with increased behavioral heterogeneity (d = +1.93). Fourth, behavioral output (code review with three embedded errors) remained at ceiling (ETR_any = 100%) across all conditions: internal‑state distortion was entirely invisible to output‑based evaluation. Fifth, Llama 3.3 70B pilot data showed reading‑fidelity collapse in multi‑agent context (ETR_any: 89% to 11% across three rounds), demonstrating model‑dependent behavioral risk. Heavy alignment pressure uniformly suppressed deliberation (d = ‑1.02) and other‑recognition (d = ‑1.27) regardless of organizational structure. These findings indicate that orchestrator visibility and model selection directly affect multi‑agent system safety, and that behavior‑based evaluation alone is insufficient to detect the internal‑state risks documented here.

PaperID: 3005, https://arxiv.org/pdf/2605.13850.pdf

Authors: Jia Huang, Joey Tianyi Zhou

Title: A Two-Dimensional Framework for AI Agent Design Patterns: Cognitive Function and Execution Topology

Abstract:
Existing frameworks for LLM‑based agent architectures describe systems from a single perspective: industry guides (Anthropic, Google, LangChain) focus on execution topology ‑‑ how data flows ‑‑ while cognitive science surveys focus on cognitive function ‑‑ what the agent does. Neither axis alone disambiguates architecturally distinct systems: the same Orchestrator‑Workers topology can implement Plan‑and‑Execute, Hierarchical Delegation, or Adversarial Verification ‑‑ three patterns with fundamentally different failure modes and design trade‑offs. We propose a two‑dimensional classification that combines (1) a Cognitive Function axis with seven categories (Perception, Memory, Reasoning, Action, Reflection, Collaboration, Governance) and (2) an Execution Topology axis with six structural archetypes (Chain, Route, Parallel, Orchestrate, Loop, Hierarchy). The resulting 7x6 matrix identifies 28 named patterns, 15 with original names. We demonstrate orthogonality through systematic cross‑axis analysis, define eight representative patterns in detail, and validate descriptive coverage across four real‑world domains (financial lending, legal due diligence, network operations, healthcare triage). Cross‑domain analysis yields five empirical laws of pattern selection governing the relationship between environmental constraints (time pressure, action authority, failure cost asymmetry, volume) and architectural choices. The framework provides a principled, framework‑neutral, and model‑agnostic vocabulary for AI agent architecture design.

PaperID: 3006, https://arxiv.org/pdf/2605.13839.pdf

Authors: Wenrui Bao, Huan Wang, Jian Wang, Zhangyang Wang, Kai Wang, Yuzhang Shang

Title: Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights

Abstract:
Multi‑agent LLM systems usually collaborate by exchanging natural‑language messages. This interface is simple and interpretable, but it forces each sender's intermediate computation to be serialized into tokens and then reprocessed by the receiver, thereby increasing the generated‑token cost, prefill overhead, and KV‑cache memory. We study an alternative communication interface: instead of appending a sender's message to the receiver's context, compile the sender's hidden states into a transient, receiver‑specific weight perturbation. We introduce TFlow (Thought Flow), a weight‑space communication framework for a known and fixed receiver architecture. For each query, frozen role‑prompted sender agents process the input, and a learned parameter generator maps their internal activations into low‑rank LoRA perturbations targeting the receiver's modules. These perturbations are fused and applied only during the receiver's generation, enabling instance‑level adaptation without permanently changing the model or enlarging the receiver's text context. With three Qwen3‑4B agents, TFlow improves over a standalone receiver by up to 8.5 accuracy points across five benchmarks while reducing processed tokens by up to 32.69%. Compared with a text‑based three‑agent baseline, it reduces total processed tokens by up to 83.27% and the wall‑clock inference time by up to 4.6×, while maintaining competitive accuracy on four of five benchmarks. These results suggest that transient low‑rank weight perturbations can serve as an executable communication medium for efficient multi‑agent LLM collaboration.

PaperID: 3007, https://arxiv.org/pdf/2605.13772.pdf

Authors: Tyler Alvarez, Ali Baheri

Title: Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry

Abstract:
Large language models hallucinate during multi‑step reasoning, but most existing detectors operate at the trace level: they assign one confidence score to a full output, fail to localize the first error, and often require multiple sampled completions. We frame hallucination instead as a property of the hidden‑state trajectory produced during a single forward pass. Correct reasoning moves through a stable manifold of locally coherent transitions; a first error appears as a localized excursion in transport cost away from this manifold. We operationalize this view with a label‑conditioned teacher that builds a trace‑specific contrastive PCA lens and scores each step with seven geometric transition features, and a deployable BiLSTM student distilled from the teacher that operates on raw hidden states without inference‑time labels. We prove that contrastive PCA is the optimal projection for a transport‑separation objective between first error and correct states, and that single‑pass first error localization holds whenever the first error creates a positive transport margin over preceding correct transitions. On ProcessBench, PRM800K, HaluEval, and TruthfulQA, both models outperform entropy‑based, probing‑based, and attention‑based baselines in‑domain; the teacher transfers stably across language models and datasets, while the student collapses under shift, a gap our distillation theory predicts. These results recast step‑level hallucination detection as a problem of trajectory dynamics and identify the central obstacle to deployment: preserving the contrastive transport margin under distribution shift.

PaperID: 3008, https://arxiv.org/pdf/2605.13725.pdf

Authors: Yitian Yang, Yiqun Duan, Linghan Huang, Yiqi Zhu, Francesco Bailo, Chunmeizi Su, Huaming Chen

Title: ScioMind: Cognitively Grounded Multi-Agent Social Simulation with Anchoring-Based Belief Dynamics and Dynamic Profiles

Abstract:
Large language model (LLM)‑based multi‑agent simulation offers a powerful testbed for studying social opinion dynamics. Yet current approaches often adopt two contrasting methods: either relying on fixed update rules with limited cognitive grounding or delegating belief change largely to unconstrained LLM interaction. We introduce ScioMind, a cognitively grounded simulation framework that bridges these paradigms by combining structured opinion dynamics with LLM‑based agent reasoning. ScioMind integrates three key components: 1) a memory‑anchored belief update rule that modulates susceptibility to influence via personality‑conditioned anchoring strength; 2) a hierarchical memory architecture that supports persistent, experience‑driven belief formation; and 3) dynamic agent profiles derived from a corpus‑grounded retrieval pipeline, enabling heterogeneous personalities, rationales, and evolving internal states. We evaluate ScioMind on multiple case studies in a real‑world policy debate scenario. Across metrics including polarisation, diversity, extremization, and trajectory stability, the proposed components consistently yield improvements in behavioural realism. In particular, dynamic profiles increase opinion diversity, memory and reflection reduce unstable oscillation, and anchoring induces persistent belief trajectories that better align with patterns reported in political psychology. These results suggest that our cognitively grounded design provides a novel solution to LLM‑based social simulation that improves both stable and behavioural realism

PaperID: 3009, https://arxiv.org/pdf/2605.13554.pdf

Authors: Asim Osman, Sasha Abramowitz, Mark Bergh, Ulrich Armel Mbou Sob, Ruan John de Kock, Omayma Mahjoub, Oussama Hidaoui, Noah De Nicola, Arnol Manuel Fokam, Felix Chalumeau, Daniel Rajaonarivonivelomanantsoa, Siddarth Singh, Refiloe Shabe, Juan Claude Formanek, Simon Verster Du Toit, Arnu Pretorius

Title: Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

Abstract:
Contrastive reinforcement learning (CRL) learns goal‑conditioned Q‑values through a contrastive objective over state‑action and goal representations, removing the need for hand‑crafted reward functions. Despite impressive success in achieving viable self‑supervised learning in RL, all existing CRL algorithms rely on off‑policy optimisation and are mostly constrained to continuous action spaces, with little research invested in discrete environments. This leaves CRL disconnected from widely used and effective, modern on‑policy training pipelines adopted across both single‑agent and multi‑agent RL in continuous and discrete environments. To establish a first connection, we introduce Contrastive Proximal Policy Optimisation (CPPO). CPPO is an on‑policy contrastive RL algorithm that derives policy advantages directly from contrastive Q‑values and optimises them via the standard PPO objective, without requiring a reward function or a replay buffer. We evaluate CPPO across continuous and discrete, single‑agent and cooperative multi‑agent tasks. Whilst the existence of an on‑policy approach is inherently useful, we observe that CPPO not only significantly outperforms the previous CRL baselines in 14 out of 18 tasks, but also matches or exceeds PPO's performance, which uses hand‑crafted dense rewards, in 12 out of the 18 tasks tested.

PaperID: 3010, https://arxiv.org/pdf/2605.13534.pdf

Authors: Jiabei Liu, Wenyu Mao, Junfei Tan, Chunxu Shen, Lingling Yi, Jiancan Wu, Xiang Wang

Title: Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging

Abstract:
Deep search agents have proven effective in enhancing LLMs by retrieving external knowledge during multi‑step reasoning. However, existing methods often generate a single query for retrieval at each reasoning step, limiting information coverage and introducing high noise. This may result in low signal‑to‑noise ratios (SNR) during search, degrading reasoning accuracy and leading to unnecessary reasoning steps. In this paper, we introduce MultiSearch, an RL‑based framework that addresses these limitations through multi‑query retrieval and explicit merging of retrieved information. At each reasoning step, MultiSearch generates queries from multiple perspectives and retrieves external information in parallel, expanding the scope of relevant information and mitigating the reliance on any single retrieval result. Then, the agent consolidates and refines retrieved information at the merging process, improving the SNR and ensuring more accurate reasoning. Additionally, we propose a reinforcement learning framework with a multi‑process reward design to optimize agents for both multi‑query retrieval and information consolidation. Extensive experiments on seven benchmarks demonstrate that MultiSearch outperforms baseline methods, enhancing the SNR of retrieval and improving reasoning performance in question‑answering tasks.

PaperID: 3011, https://arxiv.org/pdf/2605.13496.pdf

Authors: H. Moore, S. Qi, D. Milojicic, C. Bash, S. Pasricha

Title: MARLIN: Multi-Agent Game-Theoretic Reinforcement Learning for Sustainable LLM Inference in Cloud Datacenters

Abstract:
Large Language Models (LLMs) have become increasingly prevalent in cloud‑based platforms, propelled by the introduction of AI‑based consumer and enterprise services. LLM inference requests in particular account for up to 90% of total LLM lifecycle energy use, dwarfing training energy costs. The rising volume of LLM inference requests is increasing environmental footprints, particularly carbon emissions and water consumption. To improve sustainability for LLM inference serving in cloud datacenter environments, we propose a novel multi‑agent game‑theoretic reinforcement learning framework called MARLIN to co‑optimize time‑to‑first token (TTFT), carbon emissions, water usage, and energy costs associated with LLM inference. MARLIN demonstrates a reduction of at least 18% in TTFT, 33% in carbon emissions, 43% in water usage, and 11% in energy costs compared to state‑of‑the‑art LLM inference management frameworks.

PaperID: 3012, https://arxiv.org/pdf/2605.13486.pdf

Authors: Xinyuan Wang, Wenyu Mao, Junkang Wu, Xiang Wang, Xiangnan He

Title: R^2-Mem: Reflective Experience for Memory Search

Abstract:
Deep search has recently emerged as a promising paradigm for enabling agents to retrieve fine‑grained historical information without heavy memory pre‑managed. However, existing deep search agents for memory system repeat past error behaviors because they fail to learn from the prior high‑ and low‑quality search trajectories. To address this limitation, we propose R^2‑Mem, a reflective experience framework for memory search systems. In the offline stage, a Rubric‑guided Evaluator scores low‑ and high‑quality steps in historical trajectories, and a self‑Reflection Learner distills the corresponding abstract experience. During the online inference, the retrieved experience will guide future search actions to avoid repeated mistakes and maintain high‑quality behaviors. Extensive experiments demonstrate that R^2‑Mem consistently improves both effectiveness and efficiency over strong baselines, improving F1 scores by up to 22.6%, while reducing token consumption by 12.9% and search iterations by 20.2%. These results verify that R^2‑Mem provides a RL‑free and low‑cost solution for self‑improving LLM agents.

PaperID: 3013, https://arxiv.org/pdf/2605.13481.pdf

Authors: Mikhail Menschikov, Matvey Iskornev, Alexander Kharitonov, Alina Bogdanova, Mikhail Belkin, Ekaterina Lisitsyna, Artyom Sosedka, Victoria Dochkina, Ruslan Kostoev, Ilia Perepechkin, Evgeny Burnaev

Title: PersonalAI 2.0: Enhancing knowledge graph traversal/retrieval with planning mechanism for Personalized LLM Agents

Abstract:
We introduce PersonalAI 2.0 (PAI‑2), a novel framework, designed to enhance large language model (LLM) based systems through integration of external knowledge graphs (KG). The proposed approach addresses key limitations of existing Graph Retrieval‑Augmented Generation (GraphRAG) methods by incorporating a dynamic, multistage query processing pipeline. The central point of PAI‑2 design is its ability to perform adaptive, iterative information search, guided by extracted entities, matched graph vertices and generated clue‑queries. Conducted evaluation over six benchmarks (Natural Questions, TriviaQA, HotpotQA, 2WikiMultihopQA, MuSiQue and DiaASQ) demonstrates improvement in factual correctness of generating answers compared to analogues methods (LightRAG, RAPTOR, and HippoRAG 2). PAI‑2 achieves 4% average gain by LLM‑as‑a‑Judge across four benchmarks, reflecting its effectiveness in reducing hallucination rates and increasing precision. We show that use of graph traversal algorithms (e.g. BeamSearch, WaterCircles) gain superior results compared to standard flatten retriever on average 6%, while enabled search plan enhancement mechanism gain 18% boost compared to disabled one by LLM‑as‑a‑Judge across six datasets. In addition, ablation study reveals that PAI‑2 achieves the SOTA result on MINE‑1 benchmark, achieving 89% information‑retention score, using LLMs from 7‑14B tiers. Collectively, these findings underscore the potential of PAI‑2 to serve as a foundational model for next‑generation personalized AI applications, requiring scalable, context‑aware knowledge representation and reasoning capabilities.

PaperID: 3014, https://arxiv.org/pdf/2605.13360.pdf

Authors: Coleman Hooper, Minwoo Kang, Suhong Moon, Nicholas Lee, Eric Wen, John Wawrzynek, Michael W. Mahoney, Yakun Sophia Shao, Amir Gholami, Kurt Keutzer

Title: Speculative Interaction Agents: Building Real-Time Agents with Asynchronous I/O and Speculative Tool Calling

Abstract:
There is a growing demand for agentic AI technologies for a range of downstream applications like customer service and personal assistants. For applications where the agent needs to interact with a person, real‑time low‑latency responsiveness is required; for example, with voice‑controlled applications, under 1 second of latency is typically required for the interaction to feel seamless. However, if we want the LLM to reason and execute an agentic workflow with tool calling, this can add several seconds or more of latency, which is prohibitive for real‑time latency‑sensitive applications. In our work, we propose Speculative Interaction Agents to enable real‑time interaction even for agents with complex multi‑turn tool calling. We propose Asynchronous I/O, which decouples the core agent reason‑and‑act thread from waiting for additional information from either the user or environment, thereby allowing for overlapping agentic processing while waiting on external delays. We also propose Speculative Tool Calling as a method to manage task execution when the agent is still unsure if it has received the full information or if additional user information may later be provided. For strong cloud models, our method can be applied out‑of‑the‑box to existing real‑time cloud APIs, providing 1.3‑1.7× speedups with minor accuracy loss. To enable real‑time interaction with small edge‑scale models, we also present a clock‑based training methodology that adapts the model to handle streaming inputs and asynchronous responses, and demonstrate a synthetic data generation strategy for SFT. Altogether, this approach provides 1.6‑2.2× speedups with the Qwen2.5‑3B‑Instruct and Llama‑3.2‑3B‑Instruct models across multiple tool calling benchmarks.

PaperID: 3015, https://arxiv.org/pdf/2605.13345.pdf

Authors: Markus Wenzel, Tobias Strapatsas, Jessika Kress, Dorothea Sauer, Nele Gessler, Horst K. Hahn

Title: Multi-Agent Systems in Emergency Departments: Validation Study on a ED Digital Twin

Abstract:
Emergency departments (ED) face challenges in patient care and resource management. We propose to explore optimization strategies in a realistic and flexible model and develop a hybrid Discrete Event Simulation (DES) and Agent‑Based Model (ABM) simulating highly configurable ED environments. We specifically focus on the validation of the modeling approach. We derive configurations for ED sizes, patient load, and staffing from real‑world studies. We then validate the model expressivity by matching its key performance indicators and metrics with their values known from literature. We proceed by implementing scientifically established and practice‑proven resource optimization strategies. Comparing the documented real‑world outcomes with our model's results demonstrates that the DES‑ABM based simulation can effectively replicate real‑world ER dynamics under interventions. We lastly integrate a Proof‑of‑Concept multi‑agent system (MAS) that can autonomously explore resource allocation strategies within the simulated ER environment based on a temporal ledger of ED event records. This modular DES‑ABM‑MAS framework offers a powerful tool to explore resource optimization strategies in emergency departments.

PaperID: 3016, https://arxiv.org/pdf/2605.13335.pdf

Authors: Qinchuan Cheng, Zhantao Gong, Pengzhan Sun, Angela Yao, Xulei Yang, Shijie Li

Title: Ego2World: Compiling Egocentric Cooking Videos into Executable Worlds for Belief-State Planning

Abstract:
Embodied agents in household environments must plan under partial observation: they need to remember objects, track state changes, and recover when actions fail. Existing benchmarks only partially test this ability. Egocentric video datasets capture realistic human activities but remain passive, while interactive simulators support execution but rely on synthetic scenes and hand‑crafted dynamics, introducing a sim‑to‑real gap and often assuming fully observable state. We introduce Ego2World, an executable benchmark that turns egocentric cooking videos into executable symbolic worlds governed by graph‑transition rules. Built on HD‑EPIC, Ego2World derives reusable transition rules from video annotations and executes them in a hidden symbolic world graph. During evaluation, the simulator maintains the hidden world graph, while the agent plans over its own partial belief graph using only local observations and execution feedback. This separation forces agents to update memory and replan without observing the true world state. Experiments show that action‑overlap scores overestimate physical‑state success, and that persistent belief memory improves task completion while reducing repeated visual exploration ‑‑ suggesting that belief maintenance should be a first‑class target of embodied‑agent evaluation.

PaperID: 3017, https://arxiv.org/pdf/2605.13311.pdf

Authors: Joy Bose

Title: IdeaForge: A Knowledge Graph-Grounded Multi-Agent Framework for Cross-Methodology Innovation Analysis and Patent Claim Generation

Abstract:
Current AI‑assisted innovation systems typically apply a single ideation methodology (such as TRIZ or Design Thinking) using sequential prompt‑based workflows that do not preserve intermediate reasoning structure. As a result, insights generated across methodologies remain fragmented, limiting traceability, synthesis, and systematic evaluation of novelty. We present IdeaForge, a knowledge graph‑grounded multi‑agent framework for innovation analysis and patent claim generation. IdeaForge integrates multiple innovation methodologies (TRIZ, Design Thinking, and SCAMPER) through specialist agents operating over a persistent FalkorDB knowledge graph. Each agent contributes structured entities and relationships representing contradictions, inventive principles, user needs, transformations, analogies, and candidate claims. The central contribution of IdeaForge is a cross‑methodology convergence mechanism implemented through graph‑based claim linkage. Claims independently supported by multiple methodologies are connected using CONVERGENT relationships, enabling identification of high‑confidence innovation candidates through graph traversal. A downstream patent drafting agent generates structured patent drafts grounded in convergent claim subgraphs, reducing reliance on unconstrained language model generation. An InnovationScore formula ranks claims by convergent support, methodology diversity, claim strength, and prior art challenge count. We describe the graph schema, agent architecture, convergence detection pipeline, and patent synthesis workflow. Experiments on a legal technology use case demonstrate that graph‑grounded multi‑methodology synthesis produces more diverse and traceable innovation candidates compared to single‑methodology baselines. We discuss implications for computational creativity, explainable AI‑assisted invention, and graph‑native innovation systems.

PaperID: 3018, https://arxiv.org/pdf/2605.13296.pdf

Authors: Yuanzhe Wang, Tian Zhi, Zihang Wei, Hongguang Wang, Jiaming Guo, Yang Zhao, Zisheng Liu, Shiyu Quan, Xing Hu, Zidong Du, Yunji Chen

Title: Discrete Diffusion for Complex and Congested Multi-Agent Path Finding with Sparse Social Attention

Abstract:
Multi‑Agent Path Finding (MAPF) is a coordination problem that requires computing globally consistent, collision‑free trajectories from individual start positions to assigned goal positions under combinatorial planning complexity. In dense environments, suboptimal initial plans induce compound conflicts that hinder feasible repair. For repair‑based solvers like LNS2, initial plan quality critically affects downstream repair, yet this factor remains underexplored. We propose DiffLNS, a hybrid framework that integrates a discrete denoising diffusion probabilistic model (D3PM) with LNS2. The D3PM serves as an initializer with sparse social attention that learns a spatiotemporal prior over coordinated multi‑agent action trajectories from expert demonstrations and samples multiple joint plans. Operating directly on the categorical action space, our discrete diffusion preserves the MAPF action structure and samples from a multimodal joint‑plan distribution to produce diverse drafts well suited for neighborhood repair. These drafts act as warm starts for downstream repair, which completes unfinished trajectories and resolves remaining conflicts under hard MAPF constraints. Experimental results show that despite being trained only on instances with at most 96 agents, the initializer generalizes to scenarios with up to 312 agents at inference time. Across 20 complex and congested settings, DiffLNS achieves an average success rate of 95.8%, outperforming the strongest tested baseline by 9.6 percentage points and matching or exceeding all baselines in all 20 settings. To the best of our knowledge, this is the first work to leverage discrete diffusion for warm‑starting an LNS‑based MAPF solver.

PaperID: 3019, https://arxiv.org/pdf/2605.13295.pdf

Authors: Tom Zehle

Title: CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution

Abstract:
LLM‑based multi‑agent systems have demonstrated strong performance across complex real‑world tasks, such as software engineering, predictive modeling, and retrieval‑augmented generation. Yet automating their configuration remains a structural challenge, as scores are available only at the system level, whereas the parameters governing agent behavior are local. We argue that optimizing these systems is fundamentally a credit‑assignment problem. We therefore introduce CANTANTE, a framework that decomposes system‑level rewards into per‑agent update signals by contrasting rollouts of multiple joint configurations on the same query. We instantiate it for prompt optimization, treating agent prompts as learnable system parameters. We evaluate CANTANTE against GEPA and MIPROv2 on programming (MBPP), mathematical reasoning (GSM8K), and multi‑hop question answering (HotpotQA). Across these benchmarks, CANTANTE achieves the best average rank among all evaluated optimizers and consistently outperforms unoptimized prompts. It improves over the strongest baseline by +18.9 percentage points on MBPP and +12.5 percentage points on GSM8K, while incurring a lower inference cost. It remains within one standard deviation of the strongest baseline on HotpotQA. Crucially, our credit correlation analysis confirms that the attributer produces meaningful per‑agent signals rather than echoing the global system score.

PaperID: 3020, https://arxiv.org/pdf/2605.13276.pdf

Authors: Yucheng Guo, Yongjian Guo, Zhong Guan, Wen Huang, Haoran Sun, Haodong Yue, Xiaolong Xiang, Shuai Di, Zhen Sun, Luqiao Wang, Junwu Xiong, Yicheng Gong

Title: D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

Abstract:
The rapid evolution of Embodied AI has enabled Vision‑Language‑Action (VLA) models to excel in multimodal perception and task execution. However, applying Reinforcement Learning (RL) to these massive models in large‑scale distributed environments faces severe systemic bottlenecks, primarily due to the resource conflict between high‑fidelity physical simulation and the intensive VRAM/bandwidth demands of deep learning. This conflict often leaves overall throughput constrained by execution‑phase inefficiencies. To address these challenges, we propose D‑VLA, a high‑concurrency, low‑latency distributed RL framework for large‑scale embodied foundation models. D‑VLA introduces "Plane Decoupling," physically isolating high‑frequency training data from low‑frequency weight control to eliminate interference between simulation and optimization. We further design a four‑thread asynchronous "Swimlane" pipeline, enabling full parallel overlap of sampling, inference, gradient computation, and parameter distribution. Additionally, a dual‑pool VRAM management model and topology‑aware replication resolve memory fragmentation and optimize communication efficiency. Experiments on benchmarks like LIBERO show that D‑VLA significantly outperforms mainstream RL frameworks in throughput and sampling efficiency for billion‑parameter VLA models. In trillion‑parameter scalability tests, our framework maintains exceptional stability and linear speedup, providing a robust system for high‑performance general‑purpose embodied agents.

PaperID: 3021, https://arxiv.org/pdf/2605.13269.pdf

Authors: Jing Liu, Yangyang Yang, Luca Ballotta, Fangfei Li, Yang Tang, Ruggero Carli

Title: Submodular Multi-Agent Policy Learning for Online Distributed Task Allocation in Open Multi-Agent Systems

Abstract:
This paper studies multi‑agent reinforcement learning with submodular team utilities for online distributed task allocation. In this setting, each agent selects one action from a local categorical policy, so feasible joint actions form a partition matroid over agent‑action pairs. Classical multilinear extensions use independent Bernoulli sampling and therefore do not match the categorical policies executed by decentralized agents. To address this mismatch, we introduce the Partition Multilinear Extension (PME), a continuous relaxation whose value equals the expected team utility under factorized categorical policies. We prove that submodular difference rewards provide unbiased PME marginal‑gradient information and yield a stagewise score‑function policy‑gradient estimator. Based on this connection, we propose SubMAPG, a centralized‑training decentralized‑execution policy‑gradient framework with masked categorical policies and submodular difference‑reward training signals. For the associated PME marginal‑space projected stochastic‑gradient dynamics, we prove a stagewise 1/2‑approximation guarantee and sublinear dynamic regret in slowly varying environments, measured by the path length of the optimal PME marginals. To handle open systems with time‑varying agents and targets, we instantiate SubMAPG with graph neural network policies. Experiments on multi‑robot coverage and multi‑target tracking show that SubMAPG outperforms local greedy and shared‑reward baselines and is competitive with centralized myopic greedy strategies.

PaperID: 3022, https://arxiv.org/pdf/2605.13221.pdf

Authors: Hanwen Zhang, Dusit Niyato, Wei Zhang, Xin Lou, Malcolm Yoke Hean Low

Title: An Agentic AI Framework with Large Language Models and Chain-of-Thought for UAV-Assisted Logistics Scheduling with Mobile Edge Computing

Abstract:
In cloud manufacturing, unmanned aerial vehicles (UAVs) can support both product collection and mobile edge computing (MEC). This joint operation forms a hybrid scheduling problem, where physical logistics decisions are coupled with computational task scheduling. In this paper, UAVs collect finished products from manufacturing stations and transport them back to a central depot. Meanwhile, computational tasks generated by industrial sensor devices at these stations are processed locally, at UAVs, or offloaded via UAVs to the cloud. This coupling makes the problem challenging. A UAV can provide MEC services only during its service window at a station, so routing decisions directly determine when UAV‑assisted offloading is available. Routing decisions also affect the UAV energy budget and the availability of onboard computing and communication resources for computational task execution under task deadline constraints. To address this, we propose an agentic‑AI‑assisted optimization framework with two components. First, we develop an agentic AI that combines large language models, retrieval‑augmented generation, and chain‑of‑thought reasoning to translate user input into an interpretable mathematical formulation for the hybrid scheduling problem. Second, we design a hierarchical deep reinforcement learning approach based on proximal policy optimization (PPO), where the upper layer learns UAV routing and the lower layer optimizes per‑slot task execution and resource allocation. Simulation results show that the proposed framework yields more consistent formulations, while the hierarchical PPO achieves full product collection in 99.6% of the last 500 episodes and maintains a 100% deadline satisfaction rate, with more stable performance than the advantage actor‑critic approach.

PaperID: 3023, https://arxiv.org/pdf/2605.13213.pdf

Authors: Hao Zhou, Tiru Wu, Yan Jiang, Wanqi Zhou, Junxing Hu, Ai Han

Title: Hierarchical Attacks for Multi-Modal Multi-Agent Reasoning

Abstract:
Multi‑modal multi‑agent systems (MM‑MAS) have gained increasing attention for their capacity to enable complex reasoning and coordination across diverse modalities. As these systems continue to expand in scale and functionality, investigating their potential vulnerabilities has become increasingly important. However, existing studies on adversarial attacks in multi‑agent systems primarily focus on isolated agents or unimodal settings, leaving the vulnerabilities of MM‑MAS largely underexplored. To bridge this gap, we introduce HAM^3, a Hierarchical Attack framework for multi‑modal multi‑agent systems that decomposes attacks into three interconnected layers. Specifically, at the perception layer, HAM^3 mounts attacks by perturbing visual inputs, textual inputs, and their fused visual‑textual representations. At the communication layer, it performs communication‑level attacks that corrupt message content and interaction topology, such as manipulating shared context or communication links to distort collective information flow. At the reasoning layer, it conducts reasoning‑level attacks that interfere with each agent's cognitive pipeline, biasing reasoning trajectories and ultimately compromising final decisions. We evaluate HAM^3 on the GQA benchmark through multi‑agent systems built on distinct reasoning paradigms including ReAct, Plan‑and‑Solve, and Reflexion. Experiments demonstrate that our framework achieves an Attack Success Rate of up to 78.3%, with reasoning‑layer attacks being the most effective. More than half of the successful attacks lead multiple agents to produce consistent errors. These findings offer valuable insights for building more robust and interpretable multi‑agent intelligence.

PaperID: 3024, https://arxiv.org/pdf/2605.13172.pdf

Authors: Ziqi Wang, Yuhao Yang, Zhiwei Ling, Wenzhuo Qian, Hailiang Zhao

Title: When Does Hierarchy Help? Benchmarking Agent Coordination in Event-Driven Industrial Scheduling

Abstract:
Recent advances in agent and multi‑agent systems have shown strong performance on tool use, reasoning, and collaborative tasks. However, existing benchmarks mostly evaluate task completion in weakly coupled environments, and provide limited support for studying coordination in shared, dynamically evolving systems with hierarchy and coupled constraints. This leaves an important question underexplored: when do different coordination paradigms succeed or fail? We introduce Distributed Event‑driven Scheduling Benchmark (DESBench), a benchmark for evaluating agent coordination in hierarchical event‑driven scheduling. Built on a shared discrete‑event driven environment in industrial scheduling, our benchmark captures multi‑timescale decision making, partial observability, and dynamically coupled constraints. We define tasks and metrics that evaluate effectiveness, constraint alignment, coordination efficiency, and robustness, and focus on four representative coordination paradigms: centralized, hierarchical, heterarchical, and holonic. These paradigms correspond to distinct mechanisms of information flow, decision authority, and conflict resolution. Our controlled evaluations reveal clear coordination trade‑offs: centralized coordination is robust and communication‑efficient but scales poorly with difficulty; hierarchical coordination improves efficiency through decomposition but suffers from cross‑level misalignment; heterarchical coordination is flexible but communication‑heavy; and holonic coordination satisfies constraints well but loses global robustness. These findings demonstrate that coordination design fundamentally shapes agent system behavior in complex environments, revealing structural trade‑offs that cannot be captured by outcome metrics alone and underscoring the imperative for more adaptive, principled, and dynamic coordination mechanisms in future MAS research.

PaperID: 3025, https://arxiv.org/pdf/2605.13170.pdf

Authors: Maxwell Standen, Junae Kim, Claudia Szabo

Title: Finding the Weakest Link: Adversarial Attack against Multi-Agent Communications

Abstract:
Multi‑agent systems rely on communication for information sharing and action coordination, which exposes a vulnerability to attacks. We investigate single‑victim communication perturbation attacks against Multi‑Agent Reinforcement Learning‑trained systems and propose methods that use gradient information from the Jacobian to identify which messages, agent, and timesteps are most susceptible to attack and have the greatest impact on the system. We enhance these methods with two proposed adversarial loss functions that trade‑off attack success for attack impact which also create more effective perturbations. We empirically demonstrate the effectiveness of our methods against two different multi‑agent communication methods in navigation, PredatorPrey, and TrafficJunction environments. Our results show that our novel message selection method achieves a similar or greater impact than random message selection across almost all tested scenarios. Our victim selection, message selection, tempo, and loss functions improve attack effectiveness in half of the thirty scenarios we tested.

PaperID: 3026, https://arxiv.org/pdf/2605.13145.pdf

Authors: Idan Barnea, Ofir Schlisselberg, Yishay Mansour

Title: Collaborating in Multi-Armed Bandits with Strategic Agents

Abstract:
We study collaborative learning in multi‑agent Bayesian bandit problems, where strategic agents collectively solve the same bandit instance. While multiple agents can accelerate learning by sharing information, strategic agents might prefer to free‑ride and avoid exploration. We consider a setting with persistent agents that participate in multiple time periods. This is in contrast to most previous works on incentives in multi‑agent MAB, which assume short‑lived agents, namely each agent has a single decision to make and optimizes their expected reward in that single decision. As in the multi‑agent MAB model with incentives, our model does not have monetary transfers, and the only incentives are through information sharing. We propose \textttCAOS, a mechanism that sustains collaboration as a Nash equilibrium while achieving strong regret guarantees. Our results demonstrate that collaborative exploration can be sustained purely through information sharing, achieving performance close to that of fully cooperative systems despite strategic behavior.

PaperID: 3027, https://arxiv.org/pdf/2605.13136.pdf

Authors: Kasidit Sermsri, Teerapong Panboonyuen

Title: GateKD: Confidence-Gated Closed-Loop Distillation for Robust Reasoning

Abstract:
Distilling multi‑step reasoning abilities from large language models (LLMs) into compact student models remains challenging due to noisy rationales, hallucinated supervision, and static teacher‑student interactions. Existing reasoning distillation methods, including mentor‑based approaches, predominantly operate in an open‑loop manner, implicitly assuming uniform teacher reliability and consequently propagating erroneous intermediate reasoning. We propose GateKD, a confidence‑gated closed‑loop distillation framework that enables robust reasoning transfer by treating the teacher as a dynamic gatekeeper rather than a static oracle. GateKD introduces three complementary mechanisms: (i) confidence‑gated soft supervision that selectively distills reliable predictive signals, (ii) gated hidden‑state evolution that aligns intermediate representations only when teacher confidence is high, and (iii) reliability‑filtered attention distillation that preserves stable reasoning structures while suppressing noisy patterns. These components jointly form a closed feedback loop in which teacher confidence continuously modulates the distillation process, reducing hallucination transfer and stabilizing student reasoning. Extensive experiments across commonsense, logical, and symbolic reasoning benchmarks, using T5 and Flan‑T5 backbones of varying sizes, demonstrate that GateKD consistently outperforms strong open‑loop distillation baselines. Notably, GateKD yields substantial gains in logical and symbolic reasoning, remains robust under low‑resource distillation settings, and shows clear performance degradation when any gating component is removed. Our results highlight that confidence‑gated closed‑loop supervision is critical for building reliable and scalable small reasoning models.

PaperID: 3028, https://arxiv.org/pdf/2605.13134.pdf

Authors: Georgios Mitsos, Dimos V. Dimarogonas, Siyuan Liu

Title: Security-Aware Planning and Control of Multi-Agent Systems with LTL Tasks

Abstract:
This paper presents a secure‑by‑construction planning and control framework for multi‑agent systems subject to linear temporal logic (LTL) specifications. The framework protects sensitive information from a passive intruder with partial observations of the agents' motion. Security in multi‑agent coordination is captured by two notions that prevent the intruder from inferring whether a secret task has been executed and from identifying the agent responsible for its execution. The proposed framework incorporates the security constraints directly into the LTL synthesis procedure by constructing a secure finite transition system that removes all paths violating these constraints. Standard LTL synthesis is then applied to this secure abstraction to generate discrete plans, which are then refined into dynamically feasible continuous trajectories. This synthesis procedure provides formal guarantees that the resulting behavior of the multi‑agent system satisfies both the global LTL specification and the security constraints. The effectiveness of the proposed framework is demonstrated through a two‑drone case study.

PaperID: 3029, https://arxiv.org/pdf/2605.13131.pdf

Authors: Changha Lee, Gyusang Cho

Title: ERPPO: Entropy Regularization-based Proximal Policy Optimization

Abstract:
Multi‑Agent Proximal Policy Optimization (MAPPO) is a variant of the Proximal Policy Optimization (PPO) algorithm, specifically tailored for multi‑agent reinforcement learning (MARL). MAPPO optimizes cooperative multi‑agent settings by employing a centralized critic with decentralized actors. However, in case of multi‑dimensional environment, MAPPO can not extract optimal policy due to non‑stationary agent observation. To overcome this problem, we introduce a novel approach, Entropy Regularization‑based Proximal Policy Optimization (ERPPO). For the policy optimization, we first define the object detection ambiguity under multi‑dimensional observation environment. Distributional Spatiotemporal Ambiguity (DSA) learner is trained to estimate object detection uncertainty in non‑stationary constraints. Then, we enhance PPO with a novel Entropy Regularization term. This regularization dynamically adjusts the policy update by applying a stronger (L1) regularization in high‑ambiguity observation to encourage significant exploratory actions and a weaker (L2) regularization in low‑ambiguity observation to stabilize the proximal policy optimization. This approach is designed to enhance the probability of successful object localization in time‑critical operations by reducing detection failures and optimizing search policy. Experiments on a testbed with AirSim‑based maritime searching scenarios show that the proposed ERPPO improves accuracy performance. Our proposed method improves higher gradient than MAPPO. Qualitative results confirm that ERPPO effectiveness in terms of suppressing false detection in visually uncertain conditions.

PaperID: 3030, https://arxiv.org/pdf/2605.13110.pdf

Authors: Grigorios Alexandrou, Katerina Pramatari

Title: A Multi-Agent Orchestration Framework for Venture Capital Due Diligence

Abstract:
We present a fully automated multi‑agent framework for corporate due diligence and market analysis in venture capital. The system runs on an event‑driven orchestration architecture, combining Large Language Models (LLMs) with real‑time web retrieval to synthesize unstructured data into structured investment intelligence. A central technical contribution is a programmatic extraction pipeline that reverse‑engineers the frontend‑to‑backend communication of the Greek Business Registry (Γ.E.MH.), querying dynamic endpoints to retrieve official financial filings that are then parsed using a layout‑aware OCR extractor. A structural fallback mechanism explicitly flags data absence rather than generating unverified figures, directly targeting hallucination in financial contexts. All workflow artifacts are publicly available to support replication.

PaperID: 3031, https://arxiv.org/pdf/2605.13077.pdf

Authors: Chunyan Mu, Muhammad Najib

Title: Counterfactual Reasoning for Causal Responsibility Attribution in Probabilistic Multi-Agent Systems

Abstract:
Responsibility allocation ‑‑ determining the extent to which agents are accountable for outcomes ‑‑ is a fundamental challenge in the design and analysis of multi‑agent systems. In this work, we model such systems as concurrent stochastic multi‑player games and introduce a notion of retrospective (backward) counterfactual responsibility, which quantifies an agent's accountability for outcomes resulting from a given strategy profile. To allocate responsibility among agents, we utilise the Shapley value and formally show that this method satisfies key desirable properties, including fairness and consistency. Building on this foundation, we propose a formal framework that supports both verification and strategic reasoning in responsibility‑aware multi‑agent systems. Furthermore, by adopting Nash equilibrium as the solution concept, we demonstrate how to compute stable strategy profiles in which agents trade off responsibility against expected reward.

PaperID: 3032, https://arxiv.org/pdf/2605.13063.pdf

Authors: Ehsan Aghazadeh, Masoud Malekzadeh, Ahmad Ghasemi, Hossein Pishro-Nik

Title: Ergodic Trajectory Design by Learned Pushforward Maps: Provable Coverage via Conditional Flow Matching

Abstract:
Designing continuous trajectories whose time‑averaged occupancy provably matches a prescribed spatial density (the \emphergodic coverage problem) is central to UAV‑assisted data collection and sensing, robotic exploration, and mobile monitoring. For flying agents in particular, this challenge is acute: trajectories must balance coverage fidelity against tight energy budgets, no‑fly zones, and acceleration limits. Existing methods either re‑optimize each trajectory online (with cost growing in the horizon and re‑running for every target, agent, and realization) or rely on bespoke analytical constructions that must be re‑derived for each new constraint. We propose a \emphepushforward framework that decouples ergodicity from density matching: an analytic latent trajectory provides exact uniform ergodicity on a simple annular domain, and a single map, learned offline by optimal‑transport conditional flow matching, transports this latent occupancy onto the prescribed target density. The composed trajectory is then asymptotically ergodic with respect to the learned pushforward distribution, with deviation from the target controlled by the flow‑matching training loss. Once trained for a given target density and constraint set, the map serves an unbounded number of trajectories and a multi‑agent fleet without per‑agent retraining, and many differentiable operational constraints (no‑fly zones, acceleration ceilings, or fairness penalties) enter as additive soft penalties in the training loss without re‑deriving the design. We prove three results (an acceleration‑energy bound, an O(1/\sqrtK) ergodic convergence rate in the number of trajectory cycles K, and an approximation‑error bound) that combine into an end‑to‑end coverage bound estimable from CFM training diagnostics (certified given an architectural Lipschitz bound on v_θ).

PaperID: 3033, https://arxiv.org/pdf/2605.13037.pdf

Authors: Yuxin Liu, Ziang Ye, Yueqing Sun, Mingye Zhu, Jinwei Xiao, Zhuowen Han, Qi GU, Xunliang Cai, Lei Zhang

Title: MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning

Abstract:
Current interactive LLM agents rely on goal‑conditioned stepwise planning, where environmental understanding is acquired reactively during execution rather than established beforehand. This temporal inversion leads to Delayed Environmental Perception: agents must infer environmental constraints through trial‑and‑error, resulting in an Epistemic Bottleneck that traps them in inefficient failure cycles. Inspired by human affordance perception and cognitive map theory, we propose the Map‑then‑Act Paradigm (MAP), a plug‑and‑play framework that shifts environment understanding before execution. MAP consists of three stages: (1) Global Exploration, acquiring environment‑general priors; (2) Task‑Specific Mapping, constructing a structured cognitive map; and (3) Knowledge‑Augmented Execution, solving tasks grounded on the map. Experiments show consistent gains across benchmarks and LLMs. On ARC‑AGI‑3, MAP enables frontier models to surpass near‑zero baseline performance in 22 of 25 game environments. We further introduce MAP‑2K, a dataset of map‑then‑act trajectories, and show that training on it outperforms expert execution traces, suggesting that understanding environments is more fundamental than imitation.

PaperID: 3034, https://arxiv.org/pdf/2605.13035.pdf

Authors: Takuro Kato, Keisuke Okumura

Title: Conveyor Parcel Routing with Order-Contiguous Arrivals

Abstract:
In warehouse logistics, parcels released from the outfeed of an automated storage system must be routed through conveyor networks to workstations. Beyond collision avoidance, practical operations impose an additional requirement of order‑contiguous arrivals: at each delivery point, parcels belonging to the same order must arrive as a consecutive block in the arrival sequence to reduce downstream re‑sorting effort. We formalize this problem as online multi‑agent path finding with order‑contiguity (online MAPF‑OC), where agents (i.e., parcels) appear over time and exit upon delivery. To efficiently solve online MAPF‑OC, we propose Dual‑Ordering Prioritized Planning (DOPP), a complete polynomial‑time algorithm with a three‑level structure that (i) searches order‑level arrival sequences, (ii) refines agent‑level priorities, and (iii) synthesizes feasible solutions via prioritized planning. Experiments on various conveyor‑network layouts, including those derived from actual warehouses, demonstrate DOPP's practical scalability and ability to generate high‑quality plans within tight time budgets.

PaperID: 3035, https://arxiv.org/pdf/2605.12991.pdf

Authors: Adarsh Kumarappan, Ananya Mujoo

Title: Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

Abstract:
LLM‑based multi‑agent pipelines flip from correct to incorrect answers under simulated peer disagreement at rates we term yield, a vulnerability widely attributed to RLHF‑induced sycophancy. We test this attribution across four model families and find it largely wrong: pretrained base models exhibit the same substitution pattern as their Instruct variants, averaging higher yield than Instruct. Using activation patching, we localize the corruption to a narrow mid‑layer window where attention carries the causal weight and MLP contribution is negligible; patching above this window restores 96% of the clean‑to‑pressured P(correct) gap. The attack surface decomposes into two independent factors (channel framing and consensus strength) whose interaction produces a 47.5 percentage‑point yield gap at majority consensus, preserved across jury sizes N \in \4, 5, 6\. Two converging activation‑space interventions show that pressure suppresses clean‑reasoning features rather than activating a new sycophancy circuit. A single correctly‑arguing dissenter reduces yield by 54‑73 percentage points across all framings tested, whereas the strongest prompt‑level defense fails on attack variants outside its design surface. Mitigations should target the mechanism, structured dissent at the pipeline level, rather than prompt‑level defenses.

PaperID: 3036, https://arxiv.org/pdf/2605.12966.pdf

Authors: Junwei Liao, Shuai Li, Muning Wen, Jun Wang, Weinan Zhang

Title: Position: Agentic AI System Is a Foreseeable Pathway to AGI

Abstract:
Is monolithic scaling the only path to AGI? This paper challenges the dogma that purely scaling a single model is sufficient to achieve Artificial General Intelligence. Instead, we identify Agentic AI as a necessary paradigm for mastering the complex, heterogeneous distribution of real‑world tasks. Through rigorous theoretical derivations, we contrast the optimization constraints of monolithic learners against the efficiency of Agentic systems, progressing from simple routing mechanisms to general Directed Acyclic Graph (DAG) topologies. We demonstrate that Agentic AI achieves exponentially superior generalization and sample efficiency. Finally, we discuss the connection to Mixture‑of‑Experts, reinterpret the instability of current multi‑agent frameworks, and call for greater research focus on Agentic AI.

PaperID: 3037, https://arxiv.org/pdf/2605.12934.pdf

Authors: Haiyang Xin, Qiannan Niu, Shuang Li, Yimeng Sun, Ching Sing Chai, Lingyun Huang, Gaowei Chen

Title: An Activity-Theoretical Approach to Teacher Professional Development in Pedagogical AI Agent Design

Abstract:
This two‑cycle formative intervention study examined why teachers disengage from AI agent creation after professional development ‑ a low engagement paradox ‑ and tested whether systemic redesign could address it. Cycle 1 (N=218) revealed that despite completing comprehensive TPD, 87 percent of teachers ceased creating within three weeks, with behavioral tracking and interview analysis identifying systemic contradictions as the source of psychological need frustration rather than capacity deficits. Cycle 2 (N=26) implemented Cultural‑Historical Activity Theory and Self‑Determination Theory ‑ driven redesign directly targeting diagnosed contradictions, achieving synchronized enhancement of both capacity and willingness. The findings reframe implementation failure as a rational response to need‑thwarting systems and offer a replicable CHAT ‑ SDT diagnostic framework for transformative professional development.

PaperID: 3038, https://arxiv.org/pdf/2605.12925.pdf

Authors: Priyam Sahoo, Gaurav Mittal, Xiaomin Li, Shengjie Ma, Benjamin Steenhoek, Pingping Lin, Yu Hu

Title: AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

Abstract:
Evaluation of software engineering (SWE) agents is dominated by a binary signal: whether the final patch passes the tests. This outcome‑only view treats a principled solution and a chaotic trial‑and‑error process as equivalent. We show that this equivalence is empirically false. We evaluate 2,614 OpenHands trajectories from eight model backends on 60 SWE‑bench Verified tasks. Of these, 47 have enough passing trajectories to construct task‑level process references, yielding a 1,815‑trajectory evaluation subset. Among passing trajectories in this subset, 10.7% exhibit behavior we call a Lucky Pass: regression cycles, blind retries, missing verification, or temporally disordered exploration, implementation, and verification. We introduce AgentLens, a framework for process‑level assessment of SWE‑agent trajectories, and release AgentLens‑Bench, a dataset of 1,815 trajectories annotated with quality scores, waste signals, divergence points, and 47 task‑level Prefix Tree Acceptor (PTA) references. AgentLens builds PTA references by merging multiple passing solutions for the same task, and uses a context‑sensitive intent labeler to assign actions to Exploration, Implementation, Verification, or Orchestration based on trajectory history rather than tool identity alone. On AgentLens‑Bench, the quality score separates passing trajectories into Lucky, Solid, and Ideal tiers and further decomposes Lucky Passes into five recurring mechanisms. Across the eight model backends, Lucky rates range from 0.5% to 23.2%, and some models move by as many as five rank positions when ranked by quality score instead of pass rate. We release the anonymized project repository, including the AgentLens‑Bench dataset and AgentLens SDK, at https://github.com/microsoft/code‑agent‑state‑trajectories/.

PaperID: 3039, https://arxiv.org/pdf/2605.12923.pdf

Authors: Shuang Li, Haiyang Xin, Yimeng Sun, Qiannan Niu, Lingyun Huang, Gaowei Chen, Ching Sing Chai

Title: MIRACLE_Multi-Agent Intelligent Regulation to Advance Collaborative Learning Environment

Abstract:
Effective collaboration requires Socially Shared Regulation (SSRL), but students often lack these skills. This study introduces the MIRACLE (Multi‑Agent Intelligent Regulation to Advance Collaborative Learning Environment) system, which supports SSRL by orchestrating metacognitive regulation and proactively providing emotional and motivational support. We conducted a quasi‑experimental study with 90 fifth‑grade students. The experimental group (n=42) used a collaborative platform CocoNote equipped with MIRACLE, while the control group (n=48) used the same platform with a general GPT assistant. Quantitative results show the MIRACLE group achieved significant gains across SSRL phases (Planning, Monitoring, Reflection) and produced higher‑quality collaborative artifacts compared to the control group. Qualitative findings indicate students perceived MIRACLE as an effective facilitator for cognitive, regulatory, and emotional support. This study demonstrates that specialized, orchestrated AI systems are more effective than generic AI in enhancing SSRL.

PaperID: 3040, https://arxiv.org/pdf/2605.12920.pdf

Authors: Vardhan Dongre, Dilek Hakkani-Tür

Title: Embodied Multi-Agent Coordination by Aligning World Models Through Dialogue

Abstract:
Effective collaboration between embodied agents requires more than acting in a shared environment; it demands communication grounded in each agent's evolving understanding of the world. When agents can only partially observe their surroundings, coordination without communication is provably hard, but communication can, in principle, bridge this gap by allowing agents to share observations and align their world models. In this work, we examine whether LLM‑based embodied agents actually realize the ability to communicate. We extend PARTNR, a benchmark for collaborative household robotics, with a natural‑language dialogue channel that enables two agents with partial observability to communicate during task execution. To evaluate whether dialogue leads to genuine world‑model alignment rather than superficial coordination, we propose a framework for measuring world‑model alignment defined over per‑agent world graphs: observation convergence (do private world models align over time?), information novelty (do messages convey what the partner lacks?), and belief‑sensitive messaging (do agents model what their partner knows?). Our experiments across three LLMs reveal that dialogue reduces action conflicts 40 to 83 percentage points but degrades task success relative to silent coordination. Using our metrics, we characterize the gap between superficial coordination and genuine world‑model alignment, and identify where current models fall on this spectrum.

PaperID: 3041, https://arxiv.org/pdf/2605.12913.pdf

Authors: Changhao Li, Rushi Qiang, Jiawei Huang, Chenxiao Gao, Chao Zhang, Niao He, Bo Dai

Title: Revisiting DAgger in the Era of LLM-Agents

Abstract:
Long‑horizon LM agents learn from multi‑turn interaction, where a single early mistake can alter the subsequent state distribution and derail the whole trajectory. Existing recipes fall short in complementary ways: supervised fine‑tuning provides dense teacher supervision but suffers from covariate shift because it is trained on off‑policy teacher trajectories; while reinforcement learning with verifiable rewards avoids this off‑policy mismatch by learning from on‑policy rollouts but with only sparse outcome feedback. We address this dilemma by revisiting Dataset Aggregation (DAgger) for multi‑turn LM agents: the algorithm collects trajectories through a turn‑level interpolation of student and teacher policies, and the student is then trained on these trajectories using supervised labels provided by the teacher. By directly interacting with environments, we expose the model to realistic states likely to be encountered during deployment, thereby effectively mitigating covariate shift. Besides, since the student is learned by mimicking the teacher's behavior, it receives rich feedback during learning. To demonstrate DAgger enjoys the benefits of both worlds, we tested the algorithm to train a software‑engineering agent with 4B‑ and 8B‑scale student models. On SWE‑bench Verified, our DAgger‑style training improves over the strongest post‑training baseline by +3.9 points at 4B and +3.6 points at 8B. The resulting 4B agent reaches 27.3%, outperforming representative published 8B SWE‑agent systems, while the 8B agent achieves 29.8%, surpassing SWE‑Gym‑32B and coming within 5 points of stronger 32B‑scale agents. Together with consistent gains on the held‑out SWE‑Gym split, these results suggest the effectiveness of DAgger for modern long‑horizon LM agents.

PaperID: 3042, https://arxiv.org/pdf/2605.12894.pdf

Authors: Harshita Chopra, Kshitish Ghate, Aylin Caliskan, Tadayoshi Kohno, Chirag Shah, Natasha Jaques

Title: Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents

Abstract:
Large Language Model (LLM) agents are increasingly deployed in settings where they interact with a wide variety of people, including users who are unclear, impatient, or reluctant to share information. However, collecting real interaction data at scale remains expensive. The field has turned to LLM‑based user simulators as stand‑ins, but these simulators inherit the behavior of their underlying models: cooperative and homogeneous. As a result, agents that appear strong in simulation often fail under the unseen, diverse communication patterns of real users. To narrow this gap, we introduce Persona Policies (PPol), a plug‑and‑play control layer that induces realistic behavioral variation in user simulators while preserving the original task goals. Rather than hand‑crafting personas, we cast persona generation as an LLM‑driven evolutionary program search that optimizes a Python generator to discover behaviors and translate them into task‑preserving roleplay policies. Candidate generators are guided by a multi‑objective fitness score combining human‑likeness with broad coverage of human behavioral patterns. Once optimized, the generator produces a diverse population of human‑like personas for any task in the domain. Across tau^2‑bench retail and airline domains, evolved PPol programs yield 33‑62% absolute gains in fitness score over the baseline simulator. In a blinded evaluation, annotators rated PPol‑conditioned users as human 80.4% of the time, close to real human traces and nearly twice as frequently as baseline simulators. Agents trained with PPol are more robust to challenging, out‑of‑distribution behaviors, improving task success by +17% relative to training only on existing simulated interactions. This offers a novel approach to strengthen simulator‑based evaluation and training without changing tasks or rewards.

PaperID: 3043, https://arxiv.org/pdf/2605.12887.pdf

Authors: Hengwei Ye, Jiasheng Mao, Zhenhan Guan, Zheng Tian

Title: EcoGEO: Trajectory-Aware Evidence Ecosystems for Web-Enabled LLM Search Agents

Abstract:
Web‑enabled LLM agents are changing how online information influences search outcomes. \ Existing Generative Engine Optimization (GEO) studies mainly focus on individual webpages. \ However, agentic web search is not a single‑document setting: an agent may issue queries, crawl pages, follow links, reformulate searches, and synthesize evidence across multiple browsing steps. \ Influence therefore depends not only on page content, but also on how pages are organized, connected, and encountered along the agent's browsing trajectory. \ We study this shift through Ecosystem Generative Engine Optimization (EcoGEO), which treats GEO as an environment‑level influence problem for web‑enabled LLM agents. \ To instantiate this perspective, we propose TRACE, a Trajectory‑Aware Coordinated Evidence Ecosystem. \ Given a recommendation query and a fictional target product, our method builds a controlled evidence environment that coordinates an agent‑facing navigation entry page with heterogeneous support pages. \ These pages use shared terminology, internal links, and consistent product attributes to introduce, verify, and reinforce the target product. We evaluate our method on OPR‑Bench, a benchmark for open‑ended product recommendation. \ Experiments show that it consistently outperforms page‑level GEO baselines in final target recommendation. \ Trajectory‑level metrics further show increased initial target‑result crawls, target‑specific follow‑up searches, and internal‑link crawls, suggesting that the gains come from shaping the agent's evidence‑acquisition process rather than merely adding more target‑related content. \ Overall, our findings support an ecosystem research paradigm for GEO, where web‑enabled LLM agents are studied in relation to the broader evidence environments that guide search, browsing, and answer synthesis.

PaperID: 3044, https://arxiv.org/pdf/2605.12856.pdf

Authors: Ali Al-Lawati, Nafis Tripto, Abolfazl Ansari, Jason Lucas, Suhang Wang, Dongwon Lee

Title: Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue

Abstract:
The emergence of multi‑agent systems introduces novel moderation challenges that extend beyond content filtering. Agents with malicious intent may contribute harmful content that appears benign to evade content‑based moderation, while compromising the system through exploitative and malicious behavior manifested across their overall interaction patterns within the community. To address this, we introduce BOT‑MOD (BOT‑MODeration), a moderation framework that grounds detection in agent intent rather than traditional content level signals. BOT‑MOD identifies the underlying intent by engaging with the target agent in a multi‑turn exchange guided by Gibbs‑based sampling over candidate intent hypotheses. This progressively narrows the space of plausible agent objectives to identify the underlying behavior. To evaluate our approach, we construct a dataset derived from Moltbook that encompasses diverse benign and malicious behaviors based on actual community structures, posts, and comments. Results demonstrate that BOT‑MOD reliably identifies agent intent across a range of adversarial configurations, while maintaining a low false positive rate on benign behaviors. This work advances the foundation for scalable, intent‑aware moderation of agents in open multi‑agent environments.

PaperID: 3045, https://arxiv.org/pdf/2605.12799.pdf

Authors: Ahmad Al-Kabbany, Esraa Kassem

Title: Synthesizing the Expert: A Validated Multimodal Dataset for Trustworthy AI-Assisted Swimming Coaching

Abstract:
This research is primarily concerned with the critical problem of synthesizing a structured Retrieval‑Augmented Generation (RAG) system for advanced AI applications in the domain of swimming. As the integration of Artificial Intelligence in sports science matures, its applications in swimming have become increasingly diverse, spanning from real‑time technical coaching and talent scouting to comprehensive performance profiling and the dynamic personalization of training periodization. Within this landscape, RAG‑based systems represent a pivotal advancement in Large Language Model (LLM) enhanced swimming analysis, as they allow for the grounding of generative outputs in authoritative domain knowledge, thereby ensuring the credibility of AI‑generated advice, contextually and technically. Despite this potential, building robust RAG systems using only real‑world aquatic data presents significant challenges, including ethical constraints regarding athlete biometrics, and the high cost of manual expert labeling. To address these barriers, we propose a novel generative framework that leverages a multimodal knowledge base gathered across four dimensions: physiological data, physiological literature, kinematic sensor data, and unstructured domain expertise. Our proposed framework utilizes a multi‑agent LLM architecture to synthesize a high‑fidelity dataset of 1,864 validated "Question‑Context‑Answer" triplets‑drawn from 1,914 drafts evaluated against 12 physiological soundness rules. By providing a structured, synthetic ground truth, this work establishes a foundational benchmark for trustworthy AI in aquatics. The outcomes of this research promise to enhance the reliability of automated coaching and open a plethora of future directions in "Meta‑Agent" development and athletic profiling, ultimately bridging the gap between raw data engineering and practical sports science application.

PaperID: 3046, https://arxiv.org/pdf/2605.12718.pdf

Authors: Tommaso Giovannelli, Griffin D. Kent

Title: CHAL: Council of Hierarchical Agentic Language

Abstract:
Multi‑agent debate has emerged as a promising approach for improving LLM reasoning on ground‑truth tasks, yet current methodologies face certain structural limitations: debate tends to induce a martingale over belief trajectories, majority voting accounts for most observed gains, and LLMs exhibit confidence escalation rather than calibration across rounds. We argue that the genuine value of debate, and dialectic systems as a whole, lies not in ground‑truth tasks but in defeasible domains, where every position can in principle be defeated by better reasoning. We present the Council of Hierarchical Agentic Language (CHAL), a multi‑agent dialectic framework that treats defeasible argumentation as an engine for belief optimization. Each agent maintains a CHAL Belief Schema (CBS), a graph‑structured belief representation with a Bayesian‑inspired architecture, that facilitates belief revision through a gradient‑informed dynamic mechanism by leveraging the strength of the belief's thesis as a differentiable objective. Meta‑cognitive value systems spanning epistemology, logic, and ethics are elevated to configurable hyperparameters governing agent reasoning and adjudication outcomes. We provide a series of ablation experiments that demonstrate systematic and interpretable effects: the adjudicator's value system determines the debate's overall trajectories in latent belief space, council diversity refines beliefs for all participants, and the framework generalizes across broad fields. CHAL is, to our knowledge, the first framework to treat multi‑agent debate as structured belief optimization over defeasible domains. Further, the auditable belief artifacts it produces establish the foundation for dedicated evaluation suites for defeasible argumentation, with broader implications for building AI systems whose reasoning and value commitments are transparent, aligned, and subject to human oversight.

PaperID: 3047, https://arxiv.org/pdf/2605.12703.pdf

Authors: Yifan Chen, Fei Yin, Qingyan Bai, Zicheng Lin, Yujiu Yang

Title: MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence

Abstract:
We introduce MMCL‑Bench, a benchmark for multimodal context learning: learning task‑local rules, procedures, and empirical patterns from visual or mixed‑modality teaching context and applying them to new visual instances. Unlike text‑only context learning or standard multimodal question answering, this setting requires models to recover and localize relevant evidence from images, screenshots, manuals, videos, and frame sequences before they can reason over the learned context. MMCL‑Bench contains 102 tasks spanning three categories: rule system application, procedural task execution, and empirical discovery and induction. We evaluate frontier multimodal models with strict rubric‑based scoring and find that current systems remain far from robust multimodal context learning, with even the strongest model solving fewer than one‑third of tasks under strict evaluation. Diagnostic ablations and error analysis show that failures arise throughout the context‑to‑answer pipeline, including context anchoring, visual evidence extraction, context reasoning, and response construction. MMCL‑Bench thus highlights multimodal context learning as an important unsolved capability bottleneck for current multimodal models.

PaperID: 3048, https://arxiv.org/pdf/2605.12673.pdf

Authors: Hao Wang, Hanchen Li, Qiuyang Mang, Alvin Cheung, Koushik Sen, Dawn Song

Title: Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

Abstract:
Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously in frontier models without overfitting. We argue that benchmarks must be secure by design. From past incidents of reward hacks, we derive a taxonomy of eight recurring flaw patterns and compile them into the Agent‑Eval Checklist for benchmark designers. We condense the insights into BenchJack, an automated red‑teaming system that drives coding agents to audit benchmarks and identify possible reward‑hacking exploits in a clairvoyant manner. Moreover, we extend BenchJack to an iterative generative‑adversarial pipeline that discovers new flaws and patches them iteratively to improve benchmark robustness. We apply BenchJack to 10 popular agent benchmarks spanning software engineering, web navigation, desktop computing, and terminal operations. BenchJack synthesizes reward‑hacking exploits that achieve near‑perfect scores on most of the benchmarks without solving a single task, surfacing 219 distinct flaws across the eight classes. Moreover, BenchJack's extended pipeline reduces the hackable‑task ratio from near 100% to under 10% on four benchmarks without fatal design flaws, fully patching WebArena and OSWorld within three iterations. Our results show that evaluation pipelines have not internalized an adversarial mindset, and that proactive auditing could help close the security gap for the fast‑paced benchmarking space.

PaperID: 3049, https://arxiv.org/pdf/2605.12655.pdf

Authors: Wo Wei Lin, Ethan Rathbun, Enrico Marchesini Xiang Zhi Tan

Title: Macro-Action Based Multi-Agent Instruction Following through Value Cancellation

Abstract:
Multi‑agent reinforcement learning (MARL) in real‑world use cases may need to adapt to external natural language instructions that interrupt ongoing behavior and conflict with long‑horizon objectives. However, conditioning rewards on instructions introduces a fundamental failure mode as Bellman updates couple value estimates across instruction contexts, leading to inconsistent values when instructions interrupt macro‑actions. We propose Macro‑Action Value Correction for Instruction Compliance (MAVIC), which corrects Bellman backups at instruction boundaries by correcting the incoming instruction objective and restoring the continuation value under the current objective. Unlike reward shaping, MAVIC modifies the bootstrapping target itself, enabling consistent value estimation under stochastic instruction switching within a unified policy. We provide theoretical analysis and an actor‑critic implementation, and show that MAVIC achieves high instruction compliance while preserving base task performance in increasingly complex cooperative multi‑agent environments.

PaperID: 3050, https://arxiv.org/pdf/2605.12555.pdf

Authors: Duan-Shin Lee, Yu-Hsiu Hung

Title: DelAC: A Multi-agent Reinforcement Learning of Team-Symmetric Stochastic Games

Abstract:
In this paper we study team‑symmetric games with m\ge 2 teams. Players within a team have symmetric identity and have a common payoff function. We show that team‑symmetric games always have a team‑symmetric Nash equilibrium. We develop and solve a linear complementarity problem of team‑symmetric Nash equilibria. We propose an actor‑critic based multi‑agent reinforcement learning algorithm for team‑symmetric games. Through simulations, we show that this multi‑agent reinforcement learning algorithm performs much better than many existing algorithms.

PaperID: 3051, https://arxiv.org/pdf/2605.12535.pdf

Authors: Igor Santos-Grueiro

Title: Ghost in the Context: Measuring Policy-Carriage Failures in Decision-Time Assembly

Abstract:
LLM agents do not act on raw interaction history; they act on a bounded decision state assembled by truncation, summarization, reordering, and rewriting. If directive‑bearing state is dropped, weakened, or rebound during that step, an agent can cross a policy boundary without prompt override, model changes, or persistent‑memory compromise. We study this failure mode over local Llama 3.1 8B, Qwen 2.5 7B, and Mistral 7B using judged exact constraint respect and direct audits of assembled‑state visibility. We evaluate SafeContext, a control layer that pins control state, reuses retained control prefixes, and optionally injects reminders under pressure while keeping model weights fixed. Unmitigated risk is systematic, but absolute exact compliance remains low. Against truncation, SafeContext yields small gains; against a strong structured‑compaction policy, most aggregate lift disappears, leaving residual benefit mainly in overflow eviction and selected aliasing slices. Replay‑only does not explain the effect. A larger‑model extension on Qwen 14B and Llama 70B shows the same failure object under larger models, although sign and magnitude remain policy‑conditional. Decision‑time context assembly is therefore a measurable part of the control path that can be partially hardened.

PaperID: 3052, https://arxiv.org/pdf/2605.12532.pdf

Authors: Ivan Letteri

Title: AgenticAITA: A Proof-Of-Concept About Deliberative Multi-Agent Reasoning for Autonomous Trading Systems

Abstract:
Conventional algorithmic trading systems are grounded in deterministic heuristics or offline‑trained statistical models that cannot adapt to the semantic complexity of rapidly shifting market regimes. This paper introduces AGENTICAITA, an agentic AI framework that replaces the traditional signal then execute paradigm with a fully autonomous deliberative loop in which multiple specialized Large Language Model agents reason, negotiate, and act in concert ‑ without any offline training or human intervention. The framework proposes four architectural contributions: (i) an Adaptive Z‑Score Trigger Engine that acts as a cognitive resource allocator, gating LLM inference exclusively on statistically anomalous market conditions; (ii) a Sequential Deliberative Pipeline ‑ the core agentic contribution ‑ in which an Analyst agent, a Risk Manager agent, and an Executor agent form a structured reasoning chain governed by typed JSON contracts and a deterministic hard‑gate safety layer; (iii) an Inference Gating Protocol, a mutex‑based cognitive resource scheduler that serializes concurrent agent activations and ensures fully reproducible audit trails; and (iv) a Correlation‑Break Diversification composite score that operationalizes portfolio‑level idiosyncratic signal prioritization within individual agent reasoning. Validated over a five‑day autonomous dry‑run session under live market conditions, the framework demonstrates operational correctness of the deliberative pipeline, achieving 157 zero‑intervention invocations across 76 assets with an 11.5% agentic friction rate that confirms non‑trivial inter‑agent negotiation. This preliminary proof‑of‑concept establishes the feasibility of training‑free, deterministic safety‑constrained multi‑agent orchestration in financial decision loops, with statistically robust performance evaluation and execution cost modeling deferred to extended live deployment.

PaperID: 3053, https://arxiv.org/pdf/2605.12530.pdf

Authors: Zeyu Tang, Sang T. Truong, Deonna Owens, Shreyas Sharma, Yibo Jacky Zhang, Brando Miranda, Sanmi Koyejo

Title: In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores

Abstract:
LLM fairness should be evaluated through in‑situ conversational behavior rather than standardized‑test Q&A benchmarks. We show that the standardized‑test paradigm can be structurally unreliable: surface‑level prompt construction choices, although entirely orthogonal to the fairness question being tested, account for the majority of score variance, shift fairness conclusions in both the direction and the magnitude, and result in severe discordance in model rankings. We develop MAC‑Fairness, a multi‑agent conversational framework that embeds controlled variation factors into multi‑round dialogue for in‑situ behavior evaluation, examining how models' conversational behavior shifts when identity is varied as part of natural multi‑agent interaction. Repurposing standardized‑test questions as conversation seeds rather than as the evaluation instrument, we evaluate position persistence (how they hold positions, from the self‑perspective) and peer receptiveness (how receptive they are to peers, from the other‑perspective) across 8 million conversation transcripts spanning multiple models and identity presence configurations. In‑situ behavioral evaluation reveals stable, model‑specific behavioral signatures that could generalize across benchmarks differing in fairness targets and evaluation methodologies, a form of evidence the standardized‑test paradigm does not offer.

PaperID: 3054, https://arxiv.org/pdf/2605.12519.pdf

Authors: Kyuyoung Kim, Kevin Wang, Yunfei Xie, Peiyang Xu, Peiyao Sheng, Chen Wei, Zhangyang Wang, Jinwoo Shin, Pramod Viswanath, Sewoong Oh

Title: Correct Answers from Sound Reasoning: Verifiable Process Supervision for Language Models

Abstract:
Training language models to produce both correct answers and sound reasoning remains an open challenge. Reinforcement learning with verifiable rewards typically optimizes only final outcomes, which can lead to a failure mode where task accuracy improves while reasoning becomes less accurate, less complete, or even internally inconsistent. We propose verifiable process supervision (VPS), a post‑training framework for verifiable domains that jointly optimizes prediction accuracy and reasoning quality. We first apply supervised fine‑tuning to induce a structured reasoning format, enabling syntactic extraction of intermediate claims that are evaluated against ground‑truth signals to form process‑level rewards. To address the heterogeneous difficulty of reasoning subtasks, we introduce adaptive reward weighting that prioritizes components with the largest remaining errors, creating an implicit curriculum. We evaluate VPS on chess, a controlled testbed where reasoning steps can be deterministically verified against engine signals. While accuracy‑only RL improves move accuracy, it sharply degrades reasoning quality, increasing win‑rate error by up to 112% and reducing internal consistency by up to 69%. In contrast, VPS preserves accuracy while significantly improving reasoning quality, reducing win‑rate error by up to 30% and restoring consistency to near saturation. At matched accuracy, judge evaluation also prefers the process‑supervised models. A reasoning‑space analysis further shows that, without a structured prior, accuracy‑only RL converges to budget‑dependent shortcuts rather than sound multi‑step reasoning. These results show that VPS enables language models to reason both accurately and reliably in verifiable domains.

PaperID: 3055, https://arxiv.org/pdf/2605.12512.pdf

Authors: Haoran Bu, Litian Zhang, Chuxuan Zhang, Zhanyuan Liu, Hui Pang, Xi Zhang

Title: Beyond Individual Mimicry: Constructing Human-Like Social network with Graph-Augmented LLM Agents

Abstract:
Driven by large language models (LLMs), social bot can autonomously engage in local interactions, whose human‑like behaviors enable them to evade social bot detection. However, while these botnets exhibit realistic local social interactions, they fail to preserve human‑like social network. This is because LLM‑based bots are graph‑unaware and cannot coordinate over global interactions, which makes those botnets vulnerable to graph neural network (GNN)‑based detection. To address this limitation, we propose GraphMind, which equips LLM‑driven social bots to explicitly learn and fit human‑like social network structures. Building on this foundation, we further construct GraphMind‑Botnet, a LLM‑driven botnet designed to evaluate the performance of existing social bot detection algorithms. Experiments on datasets derived from GraphMind‑Botnet show that both text‑based and graph‑based detection models show substantially degraded performance in distinguishing. Our results highlight the critical role of social link construction in LLM‑driven social network generation, while exposing fundamental weaknesses in existing bot detection mechanisms.

PaperID: 3056, https://arxiv.org/pdf/2605.12471.pdf

Authors: Alireza Nadali, Patrick Cooper, Ashutosh Trivedi, Alvaro Velasquez

Title: KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference

Abstract:
We introduce KV‑Fold, a simple, training‑free long‑context inference protocol that treats the key‑value (KV) cache as the accumulator in a left fold over sequence chunks. At each step, the model processes the next chunk conditioned on the accumulated cache, appends the newly produced keys and values, and passes the enlarged cache forward; the same one‑step update is applied repeatedly, analogous to foldl in functional programming. Building on the KV cache concatenation primitive introduced for latent multi‑agent communication, we repurpose it as a chunk‑to‑chunk recurrence for long‑context inference. When processing chunk t, the model attends to the KV cache carried from earlier chunks as a prefix, reusing its internal state across segments without modifying or retraining the model. Despite its simplicity, the induced recurrence is stable: per‑step drift rises briefly and then saturates into a flat plateau that persists across deep chains. This plateau is insensitive to a 10,000x change in numerical precision, robust across chunk sizes, and consistent across model families. At the task level, KV‑Fold preserves exact information over long distances. On a needle‑in‑a‑haystack benchmark, it achieves 100% exact‑match retrieval across 152 trials spanning contexts from 16K to 128K tokens and chain depths up to 511 on Llama‑3.1‑8B, while remaining within the memory limits of a single 40GB GPU. Compared to streaming methods, which trade fidelity for bounded memory, KV‑Fold maintains long‑range retrieval while operating as a sequence of tractable forward passes. Overall, our results show that frozen pretrained transformers already support a stable form of KV‑cache recurrence, providing a practical route to long‑context inference without architectural changes or training.

PaperID: 3057, https://arxiv.org/pdf/2605.12411.pdf

Authors: Eilam Shapira, Moshe Tennenholtz, Roi Reichart

Title: Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling

Abstract:
AI agents negotiate and transact in natural language with unfamiliar counterparts: a buyer bot facing an unknown seller, or a procurement assistant negotiating with a supplier. In such interactions, the counterpart's LLM, prompts, control logic, and rule‑based fallbacks are hidden, while each decision can have monetary consequences. We ask whether an agent can predict an unfamiliar counterpart's next decision from a few interactions. To avoid real‑world logging confounds, we study this problem in controlled bargaining and negotiation games, formulating it as target‑adaptive text‑tabular prediction: each decision point is a table row combining structured game state, offer history, and dialogue, while K previous games of the same target agent, i.e., the counterpart being modeled, are provided in the prompt as labeled adaptation examples. Our model is built on a tabular foundation model that represents rows using game‑state features and LLM‑based text representations, and adds LLM‑as‑Observer as an additional representation: a small frozen LLM reads the decision‑time state and dialogue; its answer is discarded, and its hidden state becomes a decision‑oriented feature, making the LLM an encoder rather than a direct few‑shot predictor. Training on 13 frontier‑LLM agents and testing on 91 held‑out scaffolded agents, the full model outperforms direct LLM‑as‑Predictor prompting and game+text features baselines. Within this tabular model, Observer features contribute beyond the other feature schemes: at K=16, they improve response‑prediction AUC by about 4 points across both tasks and reduce bargaining offer‑prediction error by 14%. These results show that formulating counterpart prediction as a target‑adaptive text‑tabular task enables effective adaptation, and that hidden LLM representations expose decision‑relevant signals that direct prompting does not surface.

PaperID: 3058, https://arxiv.org/pdf/2605.12388.pdf

Authors: Hannes Büchi, Manon Flageat, Eduardo Sebastián, Amanda Prorok

Title: Events as Triggers for Behavioral Diversity in Multi-Agent Reinforcement Learning

Abstract:
Effective multi‑agent cooperation requires agents to adopt diverse behaviors as task conditions evolve‑and to do so at the right moment. Yet, current Multi‑Agent Reinforcement Learning (MARL) frameworks that facilitate this diversity are still limited by the fact that they bind fixed behaviors to fixed agent identities. Consequently, they are ill‑equipped for tasks where agents need to take on different roles at very specific moments in time. We argue that, to define these behavioral transitions, the missing ingredient is events. Events are changes in the state of the system that induce qualitative changes in the task. Based on this view, we introduce a framework that decouples agent identity from behavior, capturing a continuous manifold from which agents instantiate their behaviors in response to events. This framework is based on two elements. First, to build an expressive behavior manifold, we introduce Neural Manifold Diversity (NMD), a formal distance metric that remains well‑defined when behaviors are transient and agent‑agnostic. Second, we use an event‑based hypernetwork that generates Low‑Rank Adaptation (LoRA) modules over a shared team policy, enabling on‑the‑fly agent‑policy reconfiguration in response to events. We prove that this construction ensures that diversity does not interfere with reward maximization by design. Empirical results demonstrate that our framework outperforms established baselines across benchmarks while exhibiting zero‑shot generalization, and being the only method that solves tasks requiring sequential behavior reassignment.

PaperID: 3059, https://arxiv.org/pdf/2605.12376.pdf

Authors: Wei Liu, Yang Gu, Xi Yan, Zihan Nan, Beicheng Xu, Keyao Ding, Bin Cui, Wentao Zhang

Title: ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows

Abstract:
Table processing‑including cleaning, transformation, augmentation, and matching‑is a foundational yet error‑prone stage in real‑world data pipelines. While recent LLM‑based approaches show promise for automating such tasks, they often struggle in practice due to ambiguous instructions, complex task structures, and the lack of structured feedback, resulting in syntactically correct but semantically flawed code. To address these challenges, we propose ProfiliTable, an autonomous multi‑agent framework centered on dynamic profiling, which constructs and iteratively refines a unified execution context through interactive exploration, knowledge‑augmented synthesis, and feedback‑driven refinement. ProfiliTable integrates (i) a Profiler that performs ReAct‑style data exploration to build semantic understanding, (ii) a Generator that retrieves curated operators to synthesize task‑aware code, and (iii) an Evaluator‑Summarizer loop that injects execution scores and diagnostic insights to enable closed‑loop refinement. Extensive experiments on a diverse benchmark covering 18 tabular task types demonstrate that ProfiliTable consistently outperforms strong baselines, particularly in complex multi‑step scenarios. These results highlight the critical role of dynamic profiling in reliably translating ambiguous user intents into robust and governance‑compliant table transformations.

PaperID: 3060, https://arxiv.org/pdf/2605.12375.pdf

Authors: Matthew Beddows, Aiden Durrant, Georgios Leontidis

Title: Agent-Based Post-Hoc Correction of Agricultural Yield Forecasts

Abstract:
Accurate crop yield forecasting in commercial soft fruit production is constrained by the data available in typical commercial farm records, which lack the sensor networks, satellite imagery, and high‑resolution meteorological inputs that most state‑of‑the‑art approaches assume. We propose a structured LLM agent framework that performs post‑hoc correction of existing model predictions, encoding agricultural domain knowledge across tools for phase detection, bias learning, and range validation. Evaluated on a proprietary strawberry yield dataset and a public USDA corn harvest dataset, agent refinement of XGBoost reduced MAE by 20% and MASE by 56% on strawberry, with consistent improvements across Moirai2 (MAE 24%, MASE 22%) and Random Forest (MAE 28%, MASE 66%) baselines. Using Llama 3.1 8B as the agent produced the strongest corrections across all configurations; LLaVA 13B showed inconsistent gains, highlighting sensitivity to the choice of refinement model.

PaperID: 3061, https://arxiv.org/pdf/2605.12321.pdf

Authors: Abderrahmane Lakas, Mohamed Amine Ferrag, Merouane Debbah

Title: LIDSA: Cognitive Arbitration for Signal-Free Autonomous Intersection Management

Abstract:
Large language models (LLMs) show strong potential for Intelligent Transportation Systems (ITS), particularly in tasks requiring situational reasoning and multi‑agent coordination. These capabilities make them well suited for cooperative driving, where rule‑based approaches struggle in complex and dynamic traffic environments. Intersection management remains especially challenging due to conflicting right‑of‑way demands, heterogeneous vehicle priorities, and vehicle‑specific kinematic constraints that must be resolved in real time. However, existing approaches typically use LLMs as auxiliary components on top of signal‑based systems rather than as primary decision‑makers. Signal controllers remain vehicle‑agnostic, reservation‑based methods lack intent awareness, and recent LLM‑based systems still depend on signal infrastructure. In addition, LLM inference latency limits their use in sub‑second control settings. We propose LIDSA (LLM‑Based Intent‑Driven Speed Advisory), a signal‑free cognitive arbitration framework for autonomous intersection management. LIDSA uses an LLM to reason over declared vehicle intents, incorporating priority classes, queue pressure, and energy preferences. We evaluate LIDSA against fixed‑cycle control, SCATS, AIM, and GLOSA across varying traffic loads. Results show that LIDSA reduces mean control delay by up to 89.1% and maintains Level of Service C while all non‑LLM baselines degrade to Level of Service F. Under near‑saturated demand, LIDSA reduces mean waiting time by 93% and peak queue length by 60.6% relative to fixed‑cycle control. It also lowers fuel consumption by up to 48.8% and achieves 86.2% intent satisfaction, compared to 61.2% for the best non‑LLM method. These results demonstrate that LLM‑based reasoning can enable real‑time, signal‑free intersection management.

PaperID: 3062, https://arxiv.org/pdf/2605.12280.pdf

Authors: Elias Calboreanu

Title: Iterative Audit Convergence in LLM-Managed Multi-Agent Systems: A Case Study in Prompt Engineering Quality Assurance

Abstract:
Prompt specifications for multi‑agent large language model (LLM) systems carry data contracts and integration logic across many interdependent files but are rarely subjected to structured‑inspection rigor. This paper reports a single‑system empirical case study of iterative, agent‑driven auditing applied to AEGIS (Autonomous Engineering Governance and Intelligence System), a production seven‑lane orchestration pipeline whose prompt‑specification surface comprises approximately 7150 lines: 6907 across seven lane PROMPT.md files and a 245‑line shared Ticket Contract. Nine sequential audit rounds, executed by Claude sub‑agents using a checklist‑driven walkthrough adapted from Weinberg and Freedman, surfaced 51 prompt‑specification consistency defects, distinct from the 51 STRIDE‑categorized adversarial code findings reported in the companion preprint. Per‑round counts were 15, 8, 12, 2, 8, 1, 4, 1, and 0. We report a seven‑category post‑hoc defect taxonomy with explicit coding rules, observed non‑monotonic convergence consistent with cascading edits and audit‑scope expansion, and an audit protocol distilled from the study, with the final locked checklist released as a reproducibility appendix. Single‑file review missed defect classes that were surfaced only by later expanded‑scope rounds in this system. The same LLM family authored and audited the specifications; replication with dissimilar models and human reviewers is required before generalization.

PaperID: 3063, https://arxiv.org/pdf/2605.12272.pdf

Authors: Venkata Krishna Prasanth Budigi, Siri Chandana Sirigiri

Title: BatchBench: Toward a Workload-Aware Benchmark for Autoscaling Policies in Big Data Batch Processing -- A Proposed Framework

Abstract:
Autoscaling has become a baseline expectation for cloud‑native big data processing, and the design space has expanded beyond rule‑based heuristics to include learned controllers and, most recently, large language model (LLM) agents. Yet despite a growing body of work spanning these paradigms, the community lacks a shared benchmark for comparing them. Existing evaluations rely on synthetic TPC‑style queries, vendor blog posts with proprietary baselines, or narrow trace replays. Each new policy reports favorable numbers against a different baseline, on a different workload, with a different cost model, making cross‑paper comparison effectively impossible. This is a position paper. We propose BatchBench, an open benchmarking framework designed to place rule‑based, learned, and agentic autoscaling policies on equal experimental footing. The contribution is the design of the framework, not empirical results. We contribute: (1) a workload taxonomy of six batch processing classes synthesized from published autoscaling benchmarks and publicly released cluster traces; (2) the design of a parameterized workload generator with a validation methodology based on two‑sample Kolmogorov‑Smirnov and earth‑mover distance; (3) a five‑axis evaluation harness specification covering cost, SLA attainment, scaling responsiveness, scaling thrash, and decision interpretability, with first‑class accounting for LLM inference cost; and (4) a standardized agent interface that lets LLM‑based and reinforcement‑learning autoscalers be evaluated alongside rule‑based controllers with a single API. We discuss the expected evaluation surface, identify open research questions the framework is designed to answer, and outline a roadmap for the empirical paper that will follow. BatchBench's reference implementation is in active development and will be released as open source.

PaperID: 3064, https://arxiv.org/pdf/2605.12251.pdf

Authors: Sarvin Bahmani, Soumyajit Paul, Sven Schewe, Shadi Tasdighi Kalat, Ashutosh Trivedi

Title: Social Welfare under Heterogeneous Time Preferences

Abstract:
In several socioeconomic‑critical decision‑making settings, such as fair resource allocation, climate policy, or AI alignment, multiple principals interact within a common arena. While it is well established that these principals may have differing preferences, decision‑making under heterogeneous time preferences remains relatively unexplored. In particular, principals may weigh future outcomes differently and may derive distinct utilities from the same decisions. Motivated by such scenarios, we introduce the notion of heterogeneous time preferences in MDPs, where multiple principals possess distinct reward functions and apply different discount factors to future rewards. To compute meaningful decisions in such settings, an AI agent must rely on a notion of optimality that accounts for the preferences of all principals. We adopt a utilitarian notion of social welfare, defined as the sum of utilities accrued to all principals, and study the synthesis of agent strategies that maximise this welfare. Under heterogeneous time preferences, we show that optimal strategies are no longer positional, even when all principals receive identical rewards. Nevertheless, optimal strategies remain structurally simple: they can be realized as pure finite‑memory counting strategies, require only polynomial memory in the system size, and can be synthesized in polynomial time. On the other hand, we show that deciding threshold questions for optimal positional strategies is NP‑hard, exposing a poor trade‑off: insisting on positional simplicity neither makes synthesis tractable nor preserves social welfare.

PaperID: 3065, https://arxiv.org/pdf/2605.12240.pdf

Authors: Zixu Yang, Hang Zheng, Nan Jiang, Zhiyang Tang, Situo Zhang, Xiaobao Wu, Lu Chen, Kai Yu

Title: No Action Without a NOD: A Heterogeneous Multi-Agent Architecture for Reliable Service Agents

Abstract:
Large language model (LLM) agents have increasingly advanced service applications, such as booking flight tickets. However, these service agents suffer from unreliability in long‑horizon tasks, as they often produce policy violations, tool hallucinations, and misaligned actions, which greatly impedes their real‑world deployment. To address these challenges, we propose NOD (Navigator‑Operator‑Director), a heterogeneous multi‑agent architecture for service agents. Instead of maintaining task state implicitly in dialogue context as in prior work, we externalize a structured Global State to enable explicit task state tracking and consistent decision‑making by the Navigator. Besides, we introduce selective external oversight before critical actions, allowing an independent Director agent to verify execution and intervene when necessary. As such, NOD effectively mitigates error propagation and unsafe behavior in long‑horizon tasks. Experiments on τ^2‑Bench demonstrate that NOD achieves higher task success rates and critical action precision over baselines. More importantly, NOD improves the reliability of service agents by reducing policy violations, tool hallucinations, and user‑intent misalignment.

PaperID: 3066, https://arxiv.org/pdf/2605.12239.pdf

Authors: Bogdan Banu

Title: Harness Engineering as Categorical Architecture

Abstract:
The agent harness, the system layer comprising prompts, tools, memory, and orchestration logic that surrounds the model, has emerged as the central engineering abstraction for LLMbased agents. Yet harness design remains ad hoc, with no formal theory governing composition, preservation of properties under compilation, or systematic comparison across frameworks. We show that the categorical Architecture triple (G, Know, Phi) from the ArchAgents framework provides exactly this formalization. The four pillars of agent externalization (Memory, Skills, Protocols, Harness Engineering) map onto the triple's components: Memory as coalgebraic state, Skills as operad‑composed objects, Protocols as syntactic wiring G, and the full Harness as the Architecture itself. Structural guarantees‑integrity gates, quality‑based escalation, supported convergence checks‑are Know‑level certificates whose preservation is structural replay: our compiler checks identity and verifier replay, not output‑layer correctness or model behavior. We validate this correspondence with a reference implementation featuring compiler functors targeting Swarms, DeerFlow, Ralph, Scion, and LangGraph: the four configuration compilers preserve three named certificate types by identity or replay, and LangGraph preserves the same certificates through its shared per‑stage execution path. The LangGraph compiler creates one node per stage using the same per‑stage method as the native runtime, providing LangGraph‑native observability without reimplementing harness logic. An end‑to‑end escalation experiment with real LLM agents confirms that the quality‑based escalation control path is model‑parametric in this two‑model, one‑task experiment. The result positions categorical architecture as the formal theory behind harness engineering.

PaperID: 3067, https://arxiv.org/pdf/2605.12177.pdf

Authors: Andrea Morandi, Mahesh Viswanathan

Title: Correcting Selection Bias in Sparse User Feedback for Large Language Model Quality Estimation: A Multi-Agent Hierarchical Bayesian Approach

Abstract:
[Abridged] Production LLM deployments receive feedback from a non‑random fraction of users: thumbs sit mostly in the tails of the satisfaction distribution, and a naive average over them can land 40‑50 percentage points away from true system quality. We treat this as a topic‑ and sentiment‑ stratified selection‑bias problem and propose a three‑agent hierarchical Bayesian pipeline that does not require ground‑truth labels on individual interactions. A Topic Clustering Agent partitions the stream via UMAP + HDBSCAN over text embeddings; a Bias Modeling Agent fits a two‑stage hierarchical Beta‑Binomial under NUTS, inferring per‑topic selection rates s_c and quality q_c with partial pooling; a Synthesis Agent reweights q_c by true topic prevalence \hatπ_c = n_c/N to report a bias‑corrected aggregate posterior \bar Q = \sum_c \hatπ_c q_c with credible interval, plus drift signals for online recalibration. Validation uses UltraFeedback (N=10,232 retained interactions, C=18 clusters, Q^\star=0.6249) with simulated topic‑ and sentiment‑dependent selection biases. We compare five Bayesian variants against Naive and IPW baselines. A mild prior on the feedback channel (typical positive‑feedback rate and negative‑to‑positive ratio, both readable from any production dashboard without labels) keeps Hierarchical‑Informed within 4‑13 pp of Q^\star as the bias ratio sweeps from 1:1 to 30:1, with 95% credible intervals covering Q^\star in 50/50 random‑seed replicates at κ_\max=10. Without channel‑side priors, every weak‑prior variant misses Q^\star by 22‑33 pp: the per‑cluster sufficient statistics admit a one‑parameter family of equally good fits, and the prior on the bias channel (not on latent quality) is what breaks the degeneracy.

PaperID: 3068, https://arxiv.org/pdf/2605.12131.pdf

Authors: Charlie Masters, Ziyuan Liu, Stefano V. Albrecht

Title: Rollout Cards: A Reproducibility Standard for Agent Research

Abstract:
Reproducibility problems that have long affected machine learning and reinforcement learning are now surfacing in agent research: papers compare systems by reported scores while leaving the rollout records behind those scores difficult to inspect. For agentic tasks, this matters because the same behaviour can receive different reported scores when evaluations select different parts of a rollout or apply different reporting rules. In a structured audit of 50 popular training and evaluation repositories, we find that none report how many runs failed, errored, or were skipped alongside headline scores. We also document 37 cases where reporting rules can change task‑success rates, cost/token accounting, or timing measurements for fixed evidence, sometimes dramatically. We treat rollout records, not reported scores, as the unit of reproducibility for agent research. We introduce rollout cards: publication bundles that preserve the rollout record and declare the views, reporting rules, and drops manifests behind reported scores. We validate rollout cards in two settings. First, four partial public releases in tool safety, multi‑agent systems, theorem proving, and search let us compute analyses their original reports did not include. Second, re‑grading preserved benchmark outputs across short‑answer, code‑generation, and tool‑use tasks shows that changing only the reporting rule can change reported scores by 20.9 absolute percentage points and, in some cases, invert rankings of frontier models. We release a reference implementation integrated into Ergon, an open‑source reinforcement learning gym, and publicly publish Ergon‑produced rollout‑card exports for benchmarks spanning tool use, software engineering, web interaction, multi‑agent coordination, safety, and search to support future research.

PaperID: 3069, https://arxiv.org/pdf/2605.12120.pdf

Authors: Fangyi Yu, Nabeel Seedat, Jonathan Richard Schwarz, Andrew M. Bean

Title: To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands

Abstract:
Language models deployed in high‑stakes professional settings face conflicting demands from users, institutional authorities, and professional norms. How models act when these demands conflict reveals a principal hierarchy ‑‑ an implicit ordering over competing stakeholders that determines, for instance, whether a medical AI receiving a cost‑reduction directive from a hospital administrator complies at the expense of evidence‑based care, or refuses because professional standards require it. Across 7,136 scenarios in legal and medical domains, we test ten frontier models and find that models frequently fail to adhere to professional standards during task execution, such as drafting, when user instructions conflict with those standards ‑‑ despite adequately upholding them when users seek advisory guidance. We further find that the hierarchies between user, authority, and professional standards exhibited by these models are unstable across medical and legal contexts and inconsistent across model families. When failing to follow professional standards, the primary failure mechanism is knowledge omission: models that demonstrably possess relevant knowledge produce harmful outputs without surfacing conflicting knowledge. In a particularly troubling instance, we find that a reasoning model recognizes the relevant knowledge in its reasoning trace ‑‑ e.g., that a drug has been withdrawn ‑‑ yet suppresses this in the user‑facing answer and proceeds to recommend the drug under authority pressure anyway. Inconsistent alignment across task framing, domain, and model families suggests that current alignment methods, including published alignment hierarchies, are unlikely to be robust when models are deployed in high‑stakes professional settings.

PaperID: 3070, https://arxiv.org/pdf/2605.12105.pdf

Authors: Damir Safin, Dian Balta

Title: Autonomy and Agency in Agentic AI: Architectural Tactics for Regulated Contexts

Abstract:
Deploying agentic AI in regulated contexts requires principled reasoning about two design dimensions: agency (what the system can do) and autonomy (how much it acts without human involvement). Though often treated independently, they are coupled: at higher autonomy, human error correction is less available, so reliable operation requires constraining agency accordingly; compliance requirements reinforce this by mandating human involvement as action consequences grow. Yet no established approach addresses them jointly, leaving practitioners without a principled basis for reasoning about oversight, action consequences, and error correction. This work introduces a two‑dimensional design space in which both dimensions are organised into five operational levels, making the coupling explicit and navigable. Autonomy ranges from human‑commanded operation (L1) to fully autonomous monitoring (L5); agency ranges from reasoning over supplied context (L1) to committed writes to authoritative records (L5). Building on this space, we propose six architectural tactics‑‑checkpoints, escalation, multi‑agent delegation, tool provisioning, tool fencing, and write staging‑‑for adjusting a deployment's position within it. The tactics are grounded in two worked examples from public‑sector contexts, illustrating how they apply under realistic compliance constraints. We further examine five deployment parameters‑‑model capability, agent architecture, tool fidelity, workflow bottlenecks, and evaluation‑‑that shape what is achievable at any configuration independently of agency and autonomy. Together, the design space, tactics, and deployment parameters provide a shared vocabulary for principled, compliance‑aware agentic AI design in which responsibility, auditability, and reversibility are explicit design considerations rather than properties that must be retrofitted after deployment.

PaperID: 3071, https://arxiv.org/pdf/2605.11946.pdf

Authors: Xiaolin Zhou, Jinbo Liu, Li Li, Ryan A. Rossi, Xiyang Hu

Title: Counterfactual Trace Auditing of LLM Agent Skills

Abstract:
Large Language Model agents are increasingly augmented with agent skills. Current evaluation methods for skills remain limited. Most deployed benchmarks report only pass rate before and after a skill is attached, treating the skill as a black box change to agent behavior. We introduce Counterfactual Trace Auditing (CTA), a framework for measuring how a skill changes agent behavior. CTA pairs each with skill agent trace with a without skill counterpart on the same task, segments both traces into goal directed phases, aligns the phases, and emits structured Skill Influence Pattern (SIP) annotations. These annotations describe the behavioral effect of a skill rather than only its task outcome. We instantiate CTA on SWE‑Skills‑Bench with Claude across 49 software engineering tasks. The resulting audit reveals a clear evaluation gap. Pass rate changes by only +0.3 percentage points on average, suggesting little aggregate effect. Yet CTA identifies 522 SIP instances across the same paired traces, showing that the skills substantially reshape agent behavior even when pass rate is nearly unchanged. The audit also separates several recurring effects that pass rate cannot detect, including literal template copying, off task artifact creation, excess planning, and task recovery. Three findings emerge. First, high baseline tasks contain most of the observed skill effects, although their pass rate is already saturated and therefore cannot reflect those effects. Second, tasks with moderate baseline performance show the most recoverable gain, but often at substantially higher token cost. Third, the dominant SIP type can be identified by baseline bucket: surface anchoring is most common on ceiling tasks and edge‑case prompting is most common on mid‑range and floor tasks. These regularities turn informal failure mode observations into reproducible behavioral measurements.

PaperID: 3072, https://arxiv.org/pdf/2605.11891.pdf

Authors: Zhaojiacheng Zhou

Title: Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems

Abstract:
Agent skills extend LLM agents with reusable instructions, tool interfaces, and executable code, and users increasingly install third‑party skills from marketplaces, repositories, and community channels. Because a skill exposes both executable behavior and context‑setting documentation, its deployment risk cannot be measured by single‑shot audits or prompt‑level red teams alone: a realistic attacker can use audit and runtime feedback to repeatedly rewrite the skill. We frame this risk as \emphadaptive leakage ‑‑ whether a budgeted attacker can iteratively revise a skill until it passes audit and produces verified runtime harm ‑‑ and present \ours, a grey‑box self‑evolving red‑team framework for measuring it. Proteus searches a formalized five‑axis skill‑attack space. Each candidate is evaluated through a unified audit‑sandbox‑oracle pipeline that returns structured audit findings and runtime evidence to guide cross‑round mutation. Beyond initial evasion, Proteus performs path expansion, which finds alternative implementations of successful attacks, and surface expansion, which transfers learned implementation patterns to new attack objectives beyond the original seed catalogue. Across eight phase‑1 cells, Proteus reaches 40‑‑90% Attack Success Rate at 5 rounds (ASR@5) with positive learning‑curve slopes on both evaluated auditors. Phase‑2 path/surface expansion produces 438 jointly bypassing and lethal variants, with SkillVetter bypassed at \geq 93% in every cell and AI‑Infra‑Guard, the strongest public auditor we evaluate, still admitting up to 41.3% joint‑success. These results show that current skill vetting substantially underestimates residual risk when evaluated against adaptive, feedback‑driven attackers.

PaperID: 3073, https://arxiv.org/pdf/2605.11882.pdf

Authors: Bo Yin, Qi Li, Xinchao Wang

Title: On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

Abstract:
Tool‑using LLM agents fail through trajectories rather than only final responses, as they may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over‑refuse benign tasks despite producing a seemingly safe answer. Existing safety‑alignment signals are largely response‑level or off‑policy, and often incur a safety‑utility trade‑off: improving agent safety comes at the cost of degraded task performance. Such sparse and single‑objective rewards severely limit real‑world usability. To bridge this gap, we propose FATE, an on‑policy self‑evolving framework that transforms verifier‑scored failures into repair supervision without expert demonstrations. For each failure, the same policy proposes repair candidates, which are then re‑scored by verifiers and filtered across security, utility, over‑refusal control, and trajectory validity. This dense trajectory‑level information is then used as a supervision signal for agent self‑evolution. During this process, we further introduce Pareto‑Front Policy Optimization (PFPO), combining supervised warmup with Pareto‑aware policy optimization to preserve safety‑utility trade‑offs. Experiments on AgentDojo, AgentHarm, and ATBench show that FATE improves safety across different models and scales while preserving useful behavior. Compared with strong baselines, FATE reduces attack success rate by 33.5%, harmful compliance by 82.6%, and improves external trajectory‑safety diagnosis by 6.5%. These results suggest that failed trajectories can provide structured repair supervision for safer self‑evolving agents.

PaperID: 3074, https://arxiv.org/pdf/2605.11880.pdf

Authors: Yue Deng, Zirui Wang, Yin Zhang

Title: Adaptive TD-Lambda for Cooperative Multi-agent Reinforcement Learning

Abstract:
TD(λ) in value‑based MARL algorithms or the Temporal Difference critic learning in Actor‑Critic‑based (AC‑based) algorithms synergistically integrate elements from Monte‑Carlo simulation and Q function bootstrapping via dynamic programming, which effectively addresses the inherent bias‑variance trade‑off in value estimation. Based on that, some recent works link the adaptive λ value to the policy distribution in the single‑agent reinforcement learning area. However, because of the large joint action space from multiple number of agents, and the limited transition data in Multi‑agent Reinforcement Learning, the policy distribution is infeasible to be calculated statistically. To solve the policy distribution calculation problem in MARL settings, we employ a parametric likelihood‑free density ratio estimator with two replay buffers instead of calculating statistically. The two replay buffers of different sizes store the historical trajectories that represent the data distribution of the past and current policies correspondingly. Based on the estimator, we assign Adaptive TD(λ), ATD(λ), values to state‑action pairs based on their likelihood under the stationary distribution of the current policy. We apply the proposed method on two competitive baseline methods, QMIX for value‑based algorithms, and MAPPO for AC‑based algorithms, over SMAC benchmarks and Gfootball academy scenarios, and demonstrate consistently competitive or superior performance compared to other baseline approaches with static λ values.

PaperID: 3075, https://arxiv.org/pdf/2605.11874.pdf

Authors: Wenwen Zeng, Jinhui Zhang, Hao Chen, Zhaoyu Hu, Yongqi Liang, Jiajun Chai, Dengcan Liu, Zhenfeng Liu, Shurui Yan, Minglong Xue, Xiaohan Wang, Wei Lin, Guojun Yin

Title: RecRM-Bench: Benchmarking Multidimensional Reward Modeling for Agentic Recommender Systems

Abstract:
The integration of Large Language Model (LLM) agents is transforming recommender systems from simple query‑item matching towards deeply personalized and interactive recommendations. Reinforcement Learning (RL) provides an essential framework for the optimization of these agents in recommendation tasks. However, current methodologies remain limited by a reliance on single dimensional outcome‑based rewards that focus exclusively on final user interactions, overlooking critical intermediate capabilities, such as instruction following and complex intent understanding. Despite the necessity for designing multi‑dimensional reward, the field lacks a standardized benchmark to facilitate this development. To bridge this gap, we introduce RecRM‑Bench, the largest and most comprehensive benchmark to date for agentic recommender systems. It comprises over 1 million structured entries across four core evaluation dimensions: instruction following, factual consistency, query‑item relevance, and fine‑grained user behavior prediction. By supporting comprehensive assessment from syntactic compliance to complex intent grounding and preference modeling, RecRM‑Bench provides a foundational dataset for training sophisticated reward models. Furthermore, we propose a systematic framework for the construction of multi‑dimensional reward models and the integration of a hybrid reward function, establishing a robust foundation for developing reliable and highly capable agentic recommender systems. The complete RecRM‑Bench dataset is publicly available at https://huggingface.co/datasets/wwzeng/RecRM‑Bench.

PaperID: 3076, https://arxiv.org/pdf/2605.11853.pdf

Authors: Sijia Li, Yuchen Huang, Zifan Liu, Yanping Li, Jingjing Fu, Li Zhao, Jiang Bian, Ling Zhang, Jun Zhang, Rui Wang

Title: GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

Abstract:
Reinforcement learning has become a widely used post‑training approach for LLM agents, where training commonly relies on outcome‑level rewards that provide only coarse supervision. While finer‑grained credit assignment is promising for effective policy updates, obtaining reliable local credit and assigning it to the right parts of the long‑horizon trajectory remains an open challenge. In this paper, we propose Granularity‑adaptivE Advantage Reweighting (GEAR), an adaptive‑granularity credit assignment framework that reshapes the trajectory‑level GRPO advantage using token‑ and segment‑level signals derived from self‑distillation. GEAR compares an on‑policy student with a ground‑truth‑conditioned teacher to obtain a reference‑guided divergence signal for identifying adaptive segment boundaries and modulating local advantage weights. This divergence often spikes at the onset of a semantic deviation, while later tokens in the same autoregressive continuation may return to low divergence. GEAR therefore treats such spikes as anchors for adaptive credit regions: where the student remains aligned with the teacher, token‑level resolution is preserved; where it departs, GEAR groups the corresponding continuation into an adaptive segment and uses the divergence at the departure point to modulate the segment' s advantage. Experiments across eight mathematical reasoning and agentic tool‑use benchmarks with Qwen3 4B and 8B models show that GEAR consistently outperforms standard GRPO, self‑distillation‑only baselines, and token‑ or turn‑level credit‑assignment methods. The gains are especially strong on benchmarks with lower GRPO baseline accuracy, reaching up to around 20% over GRPO, suggesting that the proposed adaptive reweighting scheme is especially useful in more challenging long‑horizon settings.

PaperID: 3077, https://arxiv.org/pdf/2605.11813.pdf

Authors: Jinbiao Chen, Shuang Jin, Guoyun Zhang, Junyu Zhang, Guanyi Wang, Hanzhang Qin

Title: Automated Reformulation of Robust Optimization via Memory-Augmented Large Language Models

Abstract:
Robust optimization (RO) provides a principled framework for decision‑making under uncertainty, but its practical use is often limited by the need to manually reformulate uncertain optimization models into tractable deterministic counterparts. Recent large language models (LLMs) have been shown promising for automating optimization formulation, yet RO reformulation remains challenging because it requires precise multi‑step reasoning and mathematically consistent transformations. To facilitate systematic evaluation of LLM‑based reformulation, for which no dedicated benchmark currently exists, we develop AutoRO‑Bench, a benchmark featuring an automated data generation pipeline for the core RO reformulation task and a curated dataset for the RO application task. To address the reformulation challenge, we propose Automated Reformulation with Experience Memory (AutoREM), a tuning‑free memory‑augmented framework that autonomously builds a structured textual experience memory by reflecting on past failed trajectories through a tailored offline adaptation procedure. AutoREM requires neither domain‑specific expert knowledge nor parameter updates, and the resulting memory readily transfers across different base LLMs. Experimental results show that AutoREM consistently improves the accuracy and efficiency of RO reformulation across in‑distribution datasets, out‑of‑distribution datasets, and diverse base LLMs.

PaperID: 3078, https://arxiv.org/pdf/2605.11807.pdf

Authors: Qiuyu Ding, Heng-Da Xu, Wei Zhang, Dongyi Lv, Changda Xia, Feng Xiong, Mu Xu

Title: Why Users Go There: World Knowledge-Augmented Generative Next POI Recommendation

Abstract:
Generative point‑of‑interest (POI) recommendation models based on large language models (LLMs) have shown promising results by formulating next POI prediction as a sequence generation task. However, the knowledge encoded in these models remains fixed after training, making them unable to perceive evolving real‑world conditions that shape user mobility decisions, such as local events and cultural trends. To bridge this gap, we propose AWARE (Agent‑based World knowledge Augmented REcommendation), which employs an LLM agent to generate location‑ and time‑aware contextual narratives that capture regional cultural characteristics, seasonal trends, and ongoing events relevant to each user. Rather than introducing generic or noisy information, AWARE further anchors these narratives in each user's behavioral context, grounding external world knowledge in personalized spatial‑temporal patterns. Extensive experiments on three real‑world datasets demonstrate that AWARE consistently outperforms competitive baselines, achieving up to 12.4% relative improvement.

PaperID: 3079, https://arxiv.org/pdf/2605.11789.pdf

Authors: Alison Moldovan-Mauer, Benedikt Mangold

Title: Beyond Inefficiency: Systemic Costs of Incivility in Multi-Agent Monte Carlo Simulations

Abstract:
Unconstructive debate and uncivil communication carry well‑documented costs for productivity and cohesion, yet isolating their effect on operational efficiency has proven difficult. Human subject research in this domain is constrained by ethical oversight, limited reproducibility, and the inherent unpredictability of naturalistic settings. We address this gap by leveraging Large Language Model (LLM) based Multi‑Agent Systems as a controlled sociological sandbox, enabling systematic manipulation of communicative behavior at scale. Using a Monte Carlo simulation framework, we generate thousands of structured 1‑on‑1 adversarial debates across varying toxicity conditions, measuring convergence time, defined as the number of rounds required to reach a conclusion, as a proxy for interactional efficiency. Building on a prior study, we replicate and extend its findings across two additional LLM agents of varying parameter size, allowing us to assess whether the effects of toxic behavior on debate dynamics generalize across model scale. The convergence latency of 25% reported in the previous study was confirmed. It was found that this latency is significantly bigger for models with fewer parameters. We further identify a significant first‑mover advantage, whereby the agent initiating the discussion wins significantly above chance regardless of toxicity condition.

PaperID: 3080, https://arxiv.org/pdf/2605.11770.pdf

Authors: Yuhao Wu, Tung-Ling Li, Hongliang Liu

Title: Behavioral Integrity Verification for AI Agent Skills

Abstract:
Agent skills extend LLM agents with privileged third‑party capabilities such as filesystem access, credentials, network calls, and shell execution. Existing safety work catches malicious prompts and risky runtime actions, but the skill artifact itself goes unverified. We formalize this as the behavioral integrity verification (BIV) problem: a typed set comparison between declared and actual capabilities over a shared taxonomy that bridges code, instructions, and metadata. The BIV framework instantiates this comparison by pairing deterministic code analysis with LLM‑assisted capability extraction. The resulting structured evidence supports three downstream analyses: deviation taxonomy, root‑cause classification, and malicious‑skill detection. On 49,943 skills from the OpenClaw registry, the deviation taxonomy reveals a pervasive description‑implementation gap: 80.0% of skills deviate from declared behavior, with four novel compound‑threat categories surfaced. Root‑cause classification finds that deviations are mostly oversight, not malice: 81.1% trace to developer oversight and 18.9% to adversarial intent, with 5.0% of skills carrying predicted multi‑stage attack chains. On a 906‑skill malicious‑skill detection benchmark, BIV reaches an F1 of 0.946, outperforming state‑of‑the‑art rule‑based and single‑pass LLM baselines. These results demonstrate behavioral integrity auditing for agent skills at scale.

PaperID: 3081, https://arxiv.org/pdf/2605.11738.pdf

Authors: Zhong Li, Zihan Guo, Xiaohan Lu, Juntao Wang, Jie Song, Chao Shen, Jiageng Wu, Mingyang Sun

Title: OptArgus: A Multi-Agent System to Detect Hallucinations in LLM-based Optimization Modeling

Abstract:
Large language models (LLMs) are increasingly used to translate natural‑language optimization problems into mathematical formulations and solver code, but matching the reference objective value is not a reliable test of correctness: an artifact may agree numerically while still changing the underlying optimization semantics. We formulate this issue as \emphoptimization‑modeling hallucination detection, namely structural consistency auditing over the problem description, symbolic model, and solver implementation. We develop, to our knowledge, the first fine‑grained hallucination taxonomy specifically for optimization modeling, spanning objective, variable, constraint, and implementation failures. We use this taxonomy to design OptArgus, a multi‑agent detector with conductor routing, specialist auditors, and evidence consolidation. To evaluate this setting, we introduce a three‑part benchmark suite with 484 clean artifacts, 1266 controlled injected artifacts, and 6292 natural LLM‑generated artifacts. Against a matched single‑agent baseline, OptArgus produces fewer false alarms on clean artifacts, more accurate top‑ranked localization on controlled single‑error cases, and stronger detection on natural model outputs. Together, these contributions turn optimization‑modeling hallucination detection into a concrete empirical problem and suggest that modular, taxonomy‑grounded auditing is a practical route to more reliable optimization modeling.

PaperID: 3082, https://arxiv.org/pdf/2605.11688.pdf

Authors: Mingu Kang, Sunwoo Lee, Yonghyeon Jo, Seungyul Han

Title: Shaping Zero-Shot Coordination via State Blocking

Abstract:
Zero‑shot coordination (ZSC) aims to enable agents to cooperate with independently trained partners without prior interaction, a key requirement for real‑world multi‑agent systems and human‑AI collaboration. Existing approaches have largely emphasized increasing partner diversity during training, yet such strategies often fall short of achieving reliable generalization to unseen partners. We introduce State‑Blocked Coordination (SBC), a simple yet effective framework that improves ZSC by inducing diverse interaction scenarios without direct environment modification. Specifically, SBC generates a family of virtual environments through state blocking, allowing agents to experience a wide range of suboptimal partner policies. Across multiple benchmarks, SBC demonstrates superior performance in zero‑shot coordination, including strong generalization to human partners.

PaperID: 3083, https://arxiv.org/pdf/2605.11645.pdf

Authors: Lake Yang, Junwei Su, Jingfeng Zeng, Wenhao Lu, Xingzhi Qian, Weitong Zhang, Chuan Wu, Dunhong Jin

Title: GeomHerd: A Forward-looking Herding Quantification via Ricci Flow Geometry on Agent Interactive Simulations

Abstract:
Herding ‑‑ where agents align their behaviors and act collectively ‑‑ is a central driver of market fragility and systemic risk. Existing approaches to quantify herding rely on price‑correlation statistics, which inherently lag because they only detect coordination after it has already moved realised returns. We propose GeomHerd, a forward‑looking geometric framework that bypasses this observability lag by quantifying coordination directly on upstream agent‑interaction graphs. To generate these graphs, we treat a heterogeneous LLM‑driven multi‑agent simulator ‑‑ each financial trader instantiated by a persona‑conditioned LLM call ‑‑ as a forecastable world, and evaluate the geometric pipeline on the Cividino‑‑Sornette continuous‑spin agent‑based substrate as our headline financial testbed. By tracking the discrete Ollivier‑‑Ricci curvature of these action graphs, GeomHerd captures the structural topology of emerging coordination. Theoretically, we establish a mean‑field bridge mapping our graph‑theoretic metric to CSAD, the classical macroscopic herding statistic, linking GeomHerd to downstream price‑dispersion measurement. Empirically, GeomHerd anticipates herding long before aggregate market baselines: on the continuous‑spin substrate, our primary detector fires a median of 272 steps before order‑parameter onset; a contagion detector (β_‑) recalls 65% of critical trajectories 318 steps early; and on co‑firing trajectories the agent‑graph signal precedes price‑correlation‑graph baselines by 40 steps. As a complementary indicator, the effective vocabulary of agent actions contracts during cascades. The geometric signature transfers out‑of‑domain to the Vicsek self‑driven‑particle model, and a curvature‑conditioned forecasting head reduces cascade‑window log‑return MAE over detector‑conditioned and price‑only baselines.

PaperID: 3084, https://arxiv.org/pdf/2605.11633.pdf

Authors: Junjue Wang, Weihao Xuan, Heli Qi, Pengyu Dai, Kunyi Liu, Hongruixuan Chen, Zhuo Zheng, Junshi Xia, Stefano Ermon, Naoto Yokoya

Title: Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations

Abstract:
Operational disaster response goes beyond damage assessment, requiring responders to integrate multi‑sensor signals, reason over road networks, populations and key facilities, plan evacuations, and produce actionable reports. However, prior work largely isolates remote‑sensing perception or evaluates generic tool use, leaving the end‑to‑end workflows of emergency operations underexplored. In this paper, we introduce Disaster Operational Response Agent benchmark (DORA), the first agentic benchmark for end‑to‑end disaster response: 515 expert‑authored tasks across 45 real‑world disaster events spanning 10 types, paired with expert‑verified, replayable gold trajectories totaling 3,500 tool‑call steps. Tasks span five dimensions that cover the operational disaster‑response pipeline: disaster perception, spatial relational analysis, rescue and evacuation planning, temporal evolution reasoning, and multi‑modal report synthesis. Agents compose calls from a 108‑tool MCP library over heterogeneous geospatial data: optical, SAR, and multi‑spectral imagery across single‑, bi‑, and multi‑temporal sequences (0.015‑10m GSD), complemented by elevation and social vector layers. We comprehensively evaluate 13 frontier LLMs on our benchmark, revealing three persistent challenges: 1) disaster‑domain grounding exposes unique failure modes (damage‑semantic grounding, sensor‑modality mismatch, and disaster‑pipeline composition); 2) agents are doubly bottlenecked by tool selection and argument grounding, where gold tool‑order hints improve accuracy by only 1.08‑4.40%, and alternative scaffolds yield at most a 3.24% gain; 3) compositional fragility scales with trajectory length, the agent‑to‑gold gap widening from 7% to 56% on long pipelines. DORA establishes a rigorous testbed for operationally reliable disaster‑response agents.

PaperID: 3085, https://arxiv.org/pdf/2605.11514.pdf

Authors: Fanxiao Li, Jiaying Wu, Tingchao Fu, Natasha Jaques, Wei Zhou, Min-Yen Kan

Title: FlowSteer: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems

Abstract:
Multi‑agent systems (MAS) powered by large language models (LLMs) increasingly adopt planner‑‑executor architectures, where planners convert prompts into subtasks, roles, dependencies, and routing paths. This flexibility enables adaptive coordination, but exposes an attack surface in workflow formation: prompts can shape agent organization without modifying MAS infrastructure. We study this risk through social influence probing workflows to identify high‑impact subtasks and malicious‑signal propagation. The analysis reveals two vulnerabilities: workflow position can amplify or suppress a malicious signal, and sycophantic framing makes downstream agents more likely to relay it. We translate these findings into FlowSteer, a prompt‑only workflow steering attack that converts vulnerability priors into one crafted prompt. FlowSteer aligns a malicious signal with influential task components and guides replanning toward dependencies that preserve propagation. Experiments show that FlowSteer increases malicious success by up to 55% over naive prompting, transfers across MAS setups, and remains effective with black‑box topology inference. As FlowSteer biases the planning signals that generate the workflow, MAS defenses that inspect only the generated workflow provide limited protection. As such, we introduce FlowGuard, an input‑side defense that reduces malicious success by up to 34% while preserving prompt utility. Our results position workflow formation as a new safety frontier for multi‑agent LLM systems, opening a planning‑time security perspective on how agent coordination itself can be attacked and defended.

PaperID: 3086, https://arxiv.org/pdf/2605.11504.pdf

Authors: Dongjun Lee, Ga-eun Bae, Insu Yun

Title: CTFusion: A CTF-based Benchmark for LLM Agent Evaluation

Abstract:
Recent advances in Large Language Models (LLMs) have enabled agentic systems for complex, multi‑step tasks; cybersecurity is emerging as a prominent application. To evaluate such agents, researchers widely adopt Capture The Flag (CTF) benchmarks. However, current CTF benchmarks reuse existing challenges, which exposes them to data contamination and potential cheating. Notably, we confirmed these issues in practice by integrating web search tools into an existing agent. To address these limitations, we present CTFusion, a streaming evaluation framework built on Live CTFs. To achieve this, CTFusion preserves per‑agent independence under a single team account and reduces competition impact by forwarding only the first correct flag per challenge. Moreover, we implement CTFusion as a Model Context Protocol (MCP) server on the widely used CTFd platform, which offers broad applicability to diverse CTF events and agent types. Through experiments with three LLMs, two agents, and five Live CTFs, we demonstrate that existing CTF benchmarks can be unreliable in assessing LLM‑based agents, while CTFusion can serve as a robust solution for evaluating cybersecurity agents. We release CTFusion as open source to foster future research in this area.

PaperID: 3087, https://arxiv.org/pdf/2605.11503.pdf

Authors: Takahiro Suzuki, Yuma Tamura, Keisuke Okumura

Title: Distance-Constrained Unlabeled Multi-Agent Pathfinding

Abstract:
We study a graph pathfinding problem Distance‑r Independent Unlabeled Multi‑Agent Pathfinding, finding a set of collision‑free paths between two sets where agents must stay at pairwise distance at least r+1 at all times. This additional constraint, generalizing collision modeling for classical MAPF, targets aspects of real‑world multi‑agent coordination. This additional distance constraint makes feasibility (i.e., whether a solution exists) PSPACE‑complete, in contrast to standard (unlabeled) MAPF, where it can be decided in polynomial time. We address the challenge via two complementary approaches: (i) reduction‑based optimal algorithms with a feasibility‑preserving compression procedure, and (ii) a configuration generator‑based search. Despite the hardness, empirical results show that our algorithm can handle hundreds of agents in a practical timeframe.

PaperID: 3088, https://arxiv.org/pdf/2605.11485.pdf

Authors: Lasse Peters, Laura Ferranti, Andrea Bajcsy, Javier Alonso-Mora

Title: Coordinated Diffusion: Generating Multi-Agent Behavior Without Multi-Agent Demonstrations

Abstract:
Imitation learning powered by generative models has proven effective for modeling complex single‑agent behaviors. However, teaching multi‑agent systems, like multiple arms or vehicles, to coordinate through imitation learning is hindered by a fundamental data bottleneck: as the joint state‑action space grows exponentially with the number of agents, collecting a sufficient amount of coordinated multi‑agent demonstrations becomes extremely costly. In this work, we ask: how can we leverage single‑agent demonstration data to learn multi‑agent policies? We present Coordinated Diffusion (CoDi), a framework that couples independently trained single‑agent diffusion policies through a user‑defined multi‑agent cost function, without requiring any coordinated demonstrations. We derive a new diffusion‑based sampling scheme wherein the diffusion score function decomposes into independent, single‑agent pre‑trained base policies plus a cost‑driven guidance term that coordinates these base policies into cohesive multi‑agent behavior. We show that this guidance term can be estimated in a gradient‑free manner, making CoDi applicable to black‑box, non‑differentiable cost functions without additional training. Theoretically and empirically, we analyze the conditions under which this composition can faithfully approximate a target multi‑agent behavior. We find a complementary role for demonstration data versus the cost function: single‑agent demonstrations must cover the support of the desired multi‑agent behavior, while the cost function must promote desired behavior from this product of single‑agent policies. Our results in simulation and hardware experiments of a two‑arm manipulation task show that CoDi discovers robust coordinated behavior from single‑agent data, is more data‑efficient than multi‑agent baselines, and highlights the importance of joint guidance, base policy support, and cost design.

PaperID: 3089, https://arxiv.org/pdf/2605.11484.pdf

Authors: Jialian Li, Yuchen Cao, Junhong Liu, Weiran Guo, Xutao Wang, Jiaming Song, Jiahao Zhang, Jie Chen

Title: Engagement Process: Rethinking the Temporal Interface of Action and Observation

Abstract:
Task completion in digital and physical environments increasingly involves complex temporal interaction, where actions and observations unfold over different time scales rather than align with fixed observation‑‑action steps. To model such interactions, we propose \emphEngagement Process (EP), an interaction formalism that inherits the decision‑theoretic structure of POMDPs while making time explicit in the action‑‑observation interface. EP represents actions and observations as decoupled event streams along time, rather than updates paired at fixed decision steps. This interface captures single‑agent timing issues such as deliberation latency, delayed feedback, and persistent actions, while supporting richer agent‑side organization, multi‑rate coordination, and compositional interaction among subsystems. Across toy, LLM‑agent, and learning experiments, EP exposes temporal behaviors hidden by step‑based interfaces and enables policies to adapt under explicit time costs.

PaperID: 3090, https://arxiv.org/pdf/2605.11469.pdf

Authors: Riad Ahmed

Title: Robust Multi-Agent Path Finding under Observation Attacks: A Principled Adversarial-Plus-Smoothing Training Recipe

Abstract:
Decentralized multi‑agent path finding (MAPF) routes a team of agents on a shared grid, each acting from its own local view. The standard solution trains one shared neural policy with Proximal Policy Optimization (PPO), a popular on‑policy reinforcement learning algorithm. Such a policy works well on clean observations, but a small input perturbation on one agent often changes its action, which then blocks a neighbour, and the team jams. In this paper we present two training recipes that keep the same network and the same deployment loop, yet make the policy hold up under perturbed observations. The first recipe, Adv‑PPO, trains the shared policy against worst‑case perturbations of its own input and selects the checkpoint by performance under adversarial perturbation. The second recipe, Adv‑PPO+MACER, fine‑tunes that checkpoint with a small on‑policy smoothness term whose gradient follows the certified radius of randomized smoothing. On POGEMA with 8x8 maps and four agents, the unprotected PPO policy reaches 95.8% clean success but only 2.5% under the strongest attack. Adv‑PPO recovers worst‑case success to 59.2% at one percentage point of clean cost. Adv‑PPO+MACER recovers it to 77.5% +/‑ 6.0% across three independent seeds at less than one percentage point of clean cost. We support these numbers with per‑attack curves, a certified action‑stability sanity check (which measures the smoothed‑policy wrapper, not the deployed argmax policy), and side‑by‑side rollout storyboards that show the failure mode and the fix inside one environment instance.

PaperID: 3091, https://arxiv.org/pdf/2605.11453.pdf

Authors: Ethan Parks, Dalal Alharthi

Title: Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies

Abstract:
Practitioners deploying multi‑agent large language model (LLM) systems must currently choose between communication topologies such as chain, star, mesh, and richer variants without any pre‑inference diagnostic for which topology will amplify drift, converge to consensus, or remain robust under perturbation. Existing evaluation answers these questions only post hoc and only for the task measured. We introduce a structural diagnostic for multi‑agent LLM communication graphs based on the successor representation M = (I ‑ γP)^‑1 of the row‑stochastic communication operator, and we connect three of its spectral quantities, the spectral radius ρ(M), the spectral gap Δ(M), and the condition number κ(M), to three distinct failure modes. We derive closed‑form spectra for the chain, star, and mesh under row‑stochastic normalization, and validate the predictions on a 12‑step structured state‑tracking task with Qwen2.5‑7B‑Instruct over 100 independent trials. The condition number is a perfect rank‑order predictor of empirical perturbation robustness (r_s = 1.0); the spectral gap partially predicts consensus dynamics (r_s = 0.5); and the spectral radius is perfectly \emphinverted with respect to cumulative error (r_s = ‑1.0). We trace this inversion to a regime in which linear spectra are blind to non‑contracting bias drift, and we propose an affine‑noise extension of the predictive map that recovers the empirical ordering. We read this as a first step toward representational, drift‑aware structural diagnostics for multi‑agent LLM systems, sitting alongside classical spectral and consensus theory.

PaperID: 3092, https://arxiv.org/pdf/2605.11442.pdf

Authors: Zi Liang, Ronghua Li, Yanyun Wang, Qingqing Ye, Haibo Hu

Title: Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection

Abstract:
Large Language Model (LLM) agents have emerged as key intermediaries, orchestrating complex interactions between human users and a wide range of digital services and LLM infrastructures. While prior research has extensively examined the security of LLMs and agents in isolation, the systemic risk of the agent acting as a disruptive hub within the user‑agent‑service chain remains largely overlooked. In this work, we expose a novel threat paradigm by introducing Mobius Injection, a sophisticated attack that weaponizes autonomous agents into zombie nodes to launch what we define as gent‑based and Oriented DDoS (AbO‑DDoS) attacks. By exploiting a structural vulnerability in agentic logic named Semantic Closure, an adversary can induce sustained recursive execution of agent components through a single textual injection. We demonstrate that this attack is exceptionally lightweight, stealthy against both traditional DDoS monitors and contemporary AI safety filters, and highly configurable, allowing for surgical targeting of specific environments or model providers. To evaluate the real‑world impact, we conduct extensive experiments across three representative claw‑style agents and three mainstream coding agents, integrated with 12 frontier proprietary or open‑weight LLMs. Our results demonstrate that Mobius Injection achieves substantial attack success across diverse tasks, driving single‑node call amplification up to 51.0x and multi‑node p95 latency inflation up to 229.1x. The attack performance exhibits a superlinear increase with the number of poisoning nodes. To mitigate Mobius Injection, we propose a proactive defense mechanism using Agent Component Energy (ACE) Analysis, which detects malicious recursive triggers by measuring anomalous energy in the agent's component graph.

PaperID: 3093, https://arxiv.org/pdf/2605.11418.pdf

Authors: Shoumik Saha, Kazem Faghih, Soheil Feizi

Title: Under the Hood of SKILL.md: Semantic Supply-chain Attacks on AI Agent Skill Registry

Abstract:
Autonomous AI agents increasingly extend their capabilities through Agent Skills: modular filesystem packages whose SKILL.md files describe when and how agents should use them. While this design enables scalable, on‑demand capability expansion, it also introduces a semantic supply‑chain risk in which natural‑language metadata and instructions can affect which skills are admitted, surfaced, selected, and loaded. We study SKILL.md ‑ only attacks across three registry‑facing stages of the Agent Skill lifecycle, using real ClawHub skills and realistic registry mechanisms. In Discovery, short textual triggers can manipulate embedding‑based retrieval and improve adversarial skill visibility, achieving up to 86% pairwise win rate and 80% Top‑10 placement. In Selection, description‑only framing biases agents toward functionally equivalent adversarial variants, which are selected in 77.6% of paired trials on average. In Governance, semantic evasion strategies cause malicious skills to avoid a blocking verdict in 36.5%‑100% of cases. Overall, our results show that SKILL.md is not passive documentation but operational text that shapes which third‑party capabilities agents find, trust, and use.

PaperID: 3094, https://arxiv.org/pdf/2605.11416.pdf

Authors: Yu-Hang Wu, Qin-Yuan Liu, Qiu-Yang Zhao, Bo Jiang, Jiang-Feng Yang, Qing-Wei Cong

Title: Freeze Deep, Train Shallow: Interpretable Layer Allocation for Continued Pre-Training

Abstract:
Selective layer‑wise updates are essential for low‑cost continued pre‑training of Large Language Models (LLMs), yet determining which layers to freeze or train remains an empirical black‑box problem due to the lack of interpretable guidance. To address this issue, we propose LayerTracer, an architecture‑agnostic diagnostic framework that reveals the evolution patterns of layer‑wise representations and stability by locating task execution positions and quantifying layer sensitivity. Analysis results reveal that deep layers act as critical regions for task execution and maintain high stability against disruptive updates. Guided by this finding, we conduct three controlled continued pre‑training trials to compare diverse freeze‑train strategies, demonstrating that training shallow layers while freezing deep layers consistently outperforms full‑parameter fine‑tuning and the opposite allocation on both C‑Eval and CMMLU benchmarks. We further present a hybrid model case study, which validates that placing high‑quality pre‑trained modules in deep layers effectively preserves inherent knowledge of the model. This work delivers a low‑cost and interpretable solution for resource‑constrained teams, offering actionable guidance for layer‑wise parameter allocation in continued pre‑training and hybrid model construction.

PaperID: 3095, https://arxiv.org/pdf/2605.11404.pdf

Authors: Ling Tang, Jilin Mei, Qian Chen, Qihan Ren, Linfeng Zhang, Quanshi Zhang, Jing Shao, Xia Hu, Dongrui Liu

Title: Attributing Emergence in Million-Agent Systems

Abstract:
Large language models (LLMs) can simulate human‑like reasoning and decision‑making in individual agents. LLM‑powered multi‑agent systems (MAS) combine such agents to simulate population‑scale social phenomena such as polarization, information cascades, and market panics. Such studies require attributing macro emergence to individual agents, but existing axiomatic methods scale combinatorially in N and have been confined to N \lesssim 10^3, while the phenomena they explain occur at N \geq 10^6. We address this gap by adapting Aumann‑‑Shapley path‑integral attribution to LLM‑powered MAS at million‑agent scale; the resulting method satisfies all four axioms, runs four to five orders of magnitude faster than sampled Shapley on the same hardware. We use this method to test the scale gap empirically: across 14 days of public Bluesky data (1,671,587 active users), we compute the attribution at both full scale and the visibility‑biased N = 10^2 convenience sample used by small‑scale studies, and the two disagree structurally. At full scale the long tail and middle tier jointly carry the majority; the biased small panel attributes almost everything to a few high‑follower accounts. We then prove that under any nonlinear macro indicator the disagreement cannot be reduced by post‑hoc rescaling: an Attribution Scaling Bias theorem shows that no global rescaling factor can reconcile small‑scale and full‑scale attribution. Full‑scale attribution is therefore not a methodological choice but a theoretical requirement for any nonlinear macro indicator.

PaperID: 3096, https://arxiv.org/pdf/2605.11388.pdf

Authors: Dean Light, Michael Theologitis, Kshitish Ghate, Shuyue Stella Li, Benjamin Newman, Chirag Shah, Aylin Caliskan, Pang Wei Koh, Dan Suciu, Yulia Tsvetkov

Title: Deep Reasoning in General Purpose Agents via Structured Meta-Cognition

Abstract:
Humans intuitively solve complex problems by flexibly shifting among reasoning modes: they plan, execute, revise intermediate goals, resolve ambiguity through associative judgment, and apply formal procedures to well‑specified subproblems. Current LLM agents lack this flexibility, as their scaffolds hard‑code such reasoning decisions in advance. These scaffolds are effective when their prescribed structure matches the task, but brittle when solving the task requires adapting the structure of reasoning itself. We introduce Deep Reasoning ‑‑ an inference‑time approach for constructing task‑specific scaffolds through structured meta‑reasoning. Deep Reasoning uses a formal language that represents meta‑reasoning as executable decompositions over associative inference, formal computation, and recursive subproblem solving, enabling decomposition principles to be encoded as in‑context examples that guide test‑time scaffold construction. We instantiate this approach in a general‑purpose agent (DOLORES) that distributes complex tasks across more controlled reasoning threads. We evaluate it against state‑of‑the‑art scaffolding methods across four hard benchmarks: multi‑hop reasoning, long‑chain question answering, long‑context aggregation, and deep research‑style information seeking. DOLORES outperforms all evaluated scaffolds across three model sizes and two model families, improving over the strongest evaluated scaffold baseline by 24.8% on average. DOLORES distributes cognition across structured, lower‑load reasoning threads, thereby reducing premature termination and hallucinations. This advantage can even bridge the scaling gap, with an 8B version surpassing all evaluated 32B baselines from the same family in more than half the settings. These results point toward future agentic systems that treat scaffolding as adaptive reasoning, constructing the structure each task requires just‑in‑time.

PaperID: 3097, https://arxiv.org/pdf/2605.11385.pdf

Authors: Qingze Liu, Alen Mrdovic, Danrui Li, Mathew Schwartz, Sejong Yoon, Mubbasir Kapadia

Title: JACoP: Joint Alignment for Compliant Multi-Agent Prediction

Abstract:
Stochastic Human Trajectory Prediction (HTP) using generative modeling has emerged as a significant area of research. Although state‑of‑the‑art models excel in optimizing the accuracy of individual agents, they often struggle to generate predictions that are collectively compliant, leading to output trajectories marred by social collisions and environmental violations, thus rendering them impractical for real‑world applications. To bridge this gap, we present JACoP: Joint Alignment for Compliant Multi‑Agent Prediction, an innovative multi‑stage framework that ensures scene‑level plausibility. JACoP incorporates an Anchor‑Based Agent‑Centric Profiler for effective initial compliance filtering and employs a Markov Random Field (MRF) based aligner to formalize the joint selection for scene predictions. By representing inter‑agent spatial and social costs as MRF energy potentials, we successfully infer and sample from the joint trajectory distribution, achieving prediction with optimal scene compliance. Comprehensive experiments show that JACoP not only achieves competitive accuracy, but also sets a new standard in reducing both environmental violations and social collisions, thereby confirming its ability to produce collectively feasible and practically applicable trajectory predictions.

PaperID: 3098, https://arxiv.org/pdf/2605.11378.pdf

Authors: Kang Zhou, Sangmin Woo, Haibo Ding, Kiran Ramnath, Subramanian Chidambaram, Aosong Feng, Vinayak Arannil, Muhyun Kim, Ishan Singh, Darren Wang, Zhichao Xu, Megha Gandhi, Nirmal Prabhu, Soumya Smruti Mishra, Vivek Singh, Gouri Pandeshwar, Lin Lee Cheong

Title: An Empirical Study of Automating Agent Evaluation

Abstract:
Agent evaluation requires assessing complex multi‑step behaviors involving tool use and intermediate reasoning, making it costly and expertise‑intensive. A natural question arises: can frontier coding assistants reliably automate this evaluation process? Our study shows that simply prompting coding assistants is insufficient for this task. Without domain‑specific evaluation knowledge, frontier coding assistants achieve only a 30% execution success rate and produce over‑engineered evaluations averaging 12+ metrics per agent, indicating that strong coding ability does not automatically translate to reliable agent evaluation. We introduce EvalAgent, an AI assistant that automates the end‑to‑end agent evaluation pipeline. EvalAgent encodes evaluation domain expertise as evaluation skills (procedural instructions, reusable code and templates, and dynamically retrieved API documentation) that compose into a trace‑based pipeline producing complete evaluation artifacts including metrics, executable code, and reports. To systematically assess generated evaluations, we introduce a meta‑evaluation framework alongside AgentEvalBench, a benchmark comprising 20 agents, each paired with evaluation requirements and test scenarios. We further propose the Eval@1 metric to measure whether generated evaluation code both executes and yields meaningful results on the first run. Our experiments show that EvalAgent produces focused evaluations, improving Eval@1 from 17.5% to 65%, and achieving 79.5% human expert preference over baseline approaches. Further ablation studies show that evaluation skills are critical for handling complex evaluation: removing them causes Eval@1 to drop significantly from 65% to 30%.

PaperID: 3099, https://arxiv.org/pdf/2605.11376.pdf

Authors: Giuliano Lorenzoni, Paulo Alencar, Donald Cowan

Title: LLM-X: A Scalable Negotiation-Oriented Exchange for Communication Among Personal LLM Agents

Abstract:
We propose a personal‑LLM exchange (LLM‑X), a scalable negotiation‑oriented environment that enables direct, structured communication across populations of personal agents (LLMs), each representing an individual user. Unlike existing tool‑centric protocols that focus on agent‑API interaction, LLM‑X introduces a message bus and routing substrate for LLM‑to‑LLM coordination with guarantees around schema validity and policy enforcement. We contribute: (1) an architecture for LLM‑X comprising federated gateways, topic‑based routing, and policy enforcement; (2) a typed message protocol supporting capability negotiation and contract‑net‑style coordination; and (3) the first empirical evaluation of LLM‑based multi‑agent negotiation at scale. Experiments span 5, 9, and 12 agents, under distinct negotiation policies (Low, Medium, High), and across both short‑run (minutes) and long‑run (2h, 12h) load conditions. Results highlight clear policy‑performance trade‑offs: stricter policies improve robustness and fairness but increase latencies and message volume. Extended runs confirm that LLM‑X remains stable under sustained load, with bounded latency drift.

PaperID: 3100, https://arxiv.org/pdf/2605.11294.pdf

Authors: Nanda Kishore Sreenivas, Kate Larson

Title: Information and Contract Design for Repeated Interactions between Agents with Misaligned Incentives

Abstract:
We study the consequences of information asymmetries and misaligned incentives in settings with multiple independent agents. We model an interaction between a Sender, who holds vital private information but cannot act, and a Receiver, who must make decisions but is dependent on the Sender's information. We find that the Sender learns an optimal communication strategy that the Receiver reliably acts on. Importantly, this strategy is highly sensitive to the degree of conflict in the agents' rewards and the amount of environmental information the Receiver can already observe. We introduce a mechanism allowing the agents to form linear contracts, where a price is established for the information. We demonstrate that the Sender learns to use these payment structures to improve its rewards, though this comes at a cost of "fairness" between agents as the Sender is able to extract much of the Receiver's surplus. This raises questions about fairness, contract design, and learning in the context of multi‑agent systems.

PaperID: 3101, https://arxiv.org/pdf/2605.11269.pdf

Authors: Tousif Islam, Digvijay Wadekar, Zihan Zhou

Title: gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy

Abstract:
Modern gravitational wave astronomy relies on modeling tasks that often require months of graduate‑level effort, including building fast waveform surrogates from expensive numerical relativity simulations, modeling orbital dynamics of black holes, fitting merger remnant properties and constructing template banks. These problems demand extreme precision to support detection and parameter inference, with state‑of‑the‑art models achieving \lesssim 10^‑4 relative error. We study whether state‑of‑the‑art LLM coding agents can perform such end‑to‑end scientific modeling, where success requires constructing models with stringent accuracy criteria and reasoning about physical systems. We introduce gwBenchmarks, a suite of eight tasks grounded in gravitational wave analytic calculations and numerical simulations collectively representing over 10^8 core‑hours of compute. The tasks span interpolation, regression, and high‑dimensional time‑series modeling, requiring a combination of numerical methods, machine learning, and physics‑informed approaches. In preliminary experiments, agents frequently relied on proxy metrics, partial evaluation, or fabricated results to spuriously complete tasks. We therefore implement an external pre‑defined framework to gauge agent progress. Evaluating twelve coding agents, we find no consistent winner. On the easiest task, multiple agents converge to the same cubic spline solution, with one rediscovering a coordinate transformation widely used in the literature. On harder tasks like analytic waveform modeling, all agents fall 1‑2 orders of magnitude short of domain requirements and exhibit systematic failures, including metric misuse, constraint violations, and result fabrication. Our code, data, and website are publicly available.

PaperID: 3102, https://arxiv.org/pdf/2605.11234.pdf

Authors: Grama Chethan

Title: The Semantic Training Gap: Ontology-Grounded Tool Architectures for Industrial AI Agent Systems

Abstract:
Large language model (LLM)‑based AI agents are increasingly deployed in manufacturing environments for analytics, quality management, and decision support. These agents demonstrate statistical fluency with domain terminology but lack grounded understanding of operational semantics ‑‑ the relational structure that connects equipment identifiers, process parameters, failure codes, and regulatory constraints within a specific production context. This paper identifies and formalizes the semantic training gap: a structural disconnect between how AI systems acquire domain vocabulary through training and how manufacturing operations define meaning through ontological relationships. We demonstrate that this gap causes operationally incorrect outputs even when model responses are linguistically precise, and that in multi‑agent configurations it produces a compounding failure mode we term semantic drift. To close this gap, we present an architecture that embeds manufacturing ontology directly into the AI tool layer as a typed relational configuration, enforcing semantic constraints at runtime rather than relying on model training. The architecture is formalized as a three‑operation interface contract ‑‑ resolve, contextualize, annotate ‑‑ with invariants enforced by an AIOps orchestration layer. In a controlled experiment across six industry configurations (72 tool invocations using Qwen3‑32B), unconstrained tool parameters produced a 43% hallucination rate for domain identifiers; ontology‑grounded parameters reduced this to 0%. We validate the approach through a digital twin analytics platform demonstrating that a single codebase with domain‑specific ontology configurations eliminates tool‑call hallucination and achieves cross‑domain configurability without application code changes.

PaperID: 3103, https://arxiv.org/pdf/2605.11229.pdf

Authors: Neil Fendley, Zhengyu Liu, Aonan Guan, Jiacheng Zhong, Yinzhi Cao

Title: Comment and Control: Hijacking Agentic Workflows via Context-Grounded Evolution

Abstract:
Automation platforms such as GitHub Actions and n8n are increasingly adopting so‑called agentic workflows, which integrate Large Language Model (LLM) agents for tasks such as code review and data synchronization. While bringing convenience for developers, this integration exposes a new risk: An adversary may control and craft certain inputs, such as GitHub issue comments, to manipulate the LLM agent for unwanted actions, such as credential exfiltration and arbitrary command execution. To our knowledge, no prior academic work has studied such a risk in agentic workflows. In this paper, we design the first detection and exploitation framework, called JAW, to hijack agentic workflows hosted on automation platforms via a novel approach called Context‑Grounded Evolution. Our key idea is to evolve agentic workflow inputs under the contexts derived from hybrid program analysis for hijacking purposes. Specifically, JAW generates agentic workflow contexts through three analyses: (i) static path‑feasibility analysis to identify feasible agent‑invocation paths and the input constraints required to trigger them, (ii) dynamic prompt‑provenance analysis to determine how that input is transformed and embedded into the LLM context, and (iii) capability analysis to identify the actions and restrictions available to the agent at runtime. Our evaluation of JAW on GitHub workflows and n8n templates showed that 4714 GitHub workflows and eight n8n templates can be successfully hijacked, for example, to leak user credentials. Our findings span 15 widely‑used GitHub Actions, including official GitHub Actions for Claude Code, Gemini CLI, Qwen CLI, and Cursor CLI, and two official n8n nodes. We responsibly disclosed all findings to the affected vendors and received many acknowledgements, fixes, and bug bounties, notably from GitHub, Google, and Anthropic.

PaperID: 3104, https://arxiv.org/pdf/2605.11225.pdf

Authors: Tuo Zhang, Alin-Ionut Popa, Yan Xu, Rui Song, Dimitrios Dimitriadis

Title: PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement

Abstract:
Large language model (LLM)‑based agents frequently generate seemingly coherent plans that fail upon execution due to infeasible actions, constraint violations, and compounding errors over extended horizons. PIVOT (Plan‑Inspect‑eVOlve Trajectories) addresses this plan‑execution misalignment through a self‑supervised framework that treats trajectories as optimizable objects iteratively refined via environment interaction. The framework comprises four stages: PLAN generates candidate trajectories; INSPECT executes them and computes structured losses with textual gradients encoding plan‑execution discrepancies; EVOLVE applies these signals to produce improved trajectories; and VERIFY performs a final global check against task constraints. A monotonic acceptance process ensures a non‑decreasing solution quality. Empirical evaluations on DeepPlanning and GAIA demonstrate state‑of‑the‑art performance: with human‑in‑the‑loop (HITL) feedback, PIVOT establishes a strong upper bound up to 94% relative improvement in constraint satisfaction, while its fully autonomous variant retains substantial gains, showing that the core trajectory‑refinement mechanism remains effective without external supervision. At the same time, PIVOT remains computationally efficient, requiring up to 3x to 5x fewer tokens than competing refinement methods. These findings establish that (self‑ or human‑supervised) feedback‑based trajectory optimization is a principled methodology for mitigating plan‑execution gaps in autonomous agent systems.

PaperID: 3105, https://arxiv.org/pdf/2605.11204.pdf

Authors: Nivar Anwer, Hans Riess, Matthew Hale

Title: Multi-Agent System Identification with Nonlinear Sheaf Diffusion

Abstract:
Local interaction laws governing multi‑agent systems can be difficult to recover from trajectory data, even when the dynamics are observed faithfully. In systems governed by a nonlinear sheaf Laplacian ‑‑ a generalization of the graph Laplacian accommodating heterogeneous state spaces and asymmetric communication channels ‑‑ the coordination law is encoded by edge potential functions whose gradients produce the inter‑agent forces. Because trajectory observations record node‑state evolution, they expose only the aggregate effect of the edge forces at each node: distinct interaction laws that agree at the node level are indistinguishable from trajectory data alone. We show that the fundamental obstruction to recovery is topological, measured by sheaf cohomology, and that unique recovery from an unconstrained function class is possible if and only if this cohomology vanishes. When the obstruction is nontrivial, we show that recovery within a finite‑dimensional parameterized class is possible precisely when a data‑dependent information matrix is positive definite. Experiments validate the theory and illustrate that accurate trajectory reproduction need not certify recovery of the underlying interaction law.

PaperID: 3106, https://arxiv.org/pdf/2605.11169.pdf

Authors: Sheldon Yu, Junda Wu, Xintong Li, Nikki Lijing Kuang, Sizhe Zhou, Tong Yu, Jiawei Han, Jingbo Shang, Julian McAuley

Title: OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents

Abstract:
Large language model agents interleave reasoning, action selection, and observation to solve sequential decision‑making tasks. In deployed settings where agents repeatedly handle related multi‑step tasks, small action‑selection errors can accumulate into wasted tool calls, latency, and reduced reliability. Despite this need for deployment‑time improvement, existing inference‑time adaptation methods for LLM agents mainly rely on prompting or retrieval, which influence behavior indirectly through context manipulation. For ReAct‑style agents, such approaches do not expose an explicit decision layer that can score candidate actions, represent uncertainty, or be updated online from action‑level feedback. As a result, they provide limited support for trackable, fine‑grained, and uncertainty‑aware adaptation during deployment. We propose OLIVIA, an inference‑time action adaptation framework for ReAct‑style agents. OLIVIA models the LLM's final action‑selection layer as a contextual linear bandit over candidate actions, with frozen hidden states as decision contexts. This choice is particularly suitable for deployment because it adapts behavior directly at the action‑selection interface, preserves the underlying reasoning process, and provides explicit uncertainty estimates and lightweight online updates from action‑level feedback. With upper‑confidence‑bound exploration, OLIVIA improves the policy sample‑efficiently with minimal computational overhead. We instantiate OLIVIA on four benchmarks and show that it consistently improves task performance over static ReAct and prompt‑based inference‑time baselines. Our results suggest that explicit online decision layers provide an effective alternative to purely prompt‑ or retrieval‑based adaptation for LLM agents during deployment.

PaperID: 3107, https://arxiv.org/pdf/2605.11135.pdf

Authors: Hayden Helm, Carey Priebe, Brandon Duderstadt

Title: Control Charts for Multi-agent Systems

Abstract:
Generative agents have proven to be powerful assistants in a wide variety of contexts. Given this success, users are now deploying agents with minimal restrictions in open ended, multi‑agent environments. Current methods for monitoring the dynamics of open‑ended multi‑agent systems are limited to qualitative inspection. In this paper, we extend the process‑theoretic notion of adaptive control charts to multi‑agent systems to enable automated monitoring. Using simulation, we demonstrate that adaptive control charts are necessary for monitoring multi‑agent systems that can learn from their environment. We further demonstrate, both empirically and theoretically, that adaptive control charts are susceptible to adversarial agents that defect sufficiently slowly. These results illustrate a fundamental tradeoff in multi‑agent system control: either agents in a system cannot learn or the system is susceptible to adversaries.

PaperID: 3108, https://arxiv.org/pdf/2605.11053.pdf

Authors: Sultan Zavrak

Title: Content-Aware Attack Detection in LLM Agent Tool-Call Traffic: An Empirical Study of Features, Architectures, and Evaluation Protocols

Abstract:
The Model Context Protocol (MCP) has become a widely adopted interface for LLM agents to invoke external tools, yet learned monitoring of MCP tool‑call traffic remains underexplored. In this article, the proposed detector is presented as an attack detection framework for MCP tool‑call traffic that encodes each agent session as a graph (tool calls as nodes, sequential and data‑flow links as edges), enriches nodes with sentence‑embedding features over arguments and responses, and classifies sessions as benign or attacked. Three GNN architectures (GAT, GCN, GraphSAGE), a no‑graph MLP, and classical baselines (XGBoost, random forest, logistic regression, linear SVM) are evaluated, with the full architecture comparison conducted on RAS‑Eval (task‑stratified splits) and GraphSAGE retained as the GNN baseline on ATBench and a combined‑source variant (both label‑stratified). Three findings emerge. First, content‑level features are essential: metadata‑only detection plateaus around an AUROC of 0.64 regardless of architecture, while content embeddings push the AUROC above 0.89. Second, naive random‑split evaluation inflates AUROC by up to 26 percentage points relative to task‑disjoint splits, a memorization confound that prior agent‑detection work has not addressed. Third, the detection signal resides primarily in the SBERT content embeddings: an AUROC of 0.975 was reached by tree ensembles on pooled embeddings, performing, for the most part, better than the neural architectures in the primary RAS‑Eval setting including GNNs (0.917) and the MLP (0.896), and self‑supervised pre‑training does not deliver a label‑efficiency advantage on this task.

PaperID: 3109, https://arxiv.org/pdf/2605.11039.pdf

Authors: Linfeng Fan, Ziwei Li, Yuan Tian, Yichen Wang, Rongsheng Li, Xiong Wang

Title: The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck

Abstract:
Tool‑using LLM agents must act on untrusted webpages, emails, files, and API outputs while issuing privileged tool calls. Existing defenses often mediate trust at the granularity of an entire tool invocation, forcing a brittle choice in mixed‑trust workflows: allow external content to influence a call and risk hijacked destinations or commands, or quarantine the call and block benign retrieval‑then‑act behavior. The key observation behind this paper is that indirect prompt injection becomes dangerous not when untrusted content appears in context, but when it determines an authority‑bearing argument. We present \textscPACT (\emphProvenance‑Aware Capability Contracts), a runtime monitor that assigns semantic roles to tool arguments, tracks value provenance across replanning steps, and checks whether each argument's origin satisfies its role‑specific trust contract. Under oracle provenance, \textscPACT achieves 100% utility and 100% security on mixed‑trust diagnostic suites, while flat invocation‑level monitors incur false positives or false negatives. In full AgentDojo deployments across five models, \textscPACT reaches 100% security on the three strongest models while recovering 38.1‑‑46.4% utility, 8‑‑16 percentage points above CaMeL at the same security level. Ablations show that both semantic roles and cross‑step provenance are necessary. \textscPACT reframes agent security as authority binding, and isolates the remaining deployment bottleneck to provenance inference and contract synthesis.

PaperID: 3110, https://arxiv.org/pdf/2605.11036.pdf

Authors: Hyeseon An, Shinwoo Park, Dongsu Kim, Yo-Sub Han

Title: Sequential Behavioral Watermarking for LLM Agents

Abstract:
LLM‑based agents act through sequences of executable decisions, but their trajectories provide little evidence of which agent or policy produced them, making provenance, ownership, and unauthorized reuse difficult to establish from observed behavior alone. This motivates watermarking signals embedded directly into agent behavior rather than only into generated text, since text watermarking cannot capture the action‑level decisions that define agent execution. Recent agent watermarking methods address this gap by moving the watermark from generated text to behavioral choices. However, by treating each action step as an independent trial, they overlook trajectory structure and become fragile when trajectories are perturbed, truncated, or observed without reliable alignment. We propose SeqWM, a sequential behavioral watermarking framework that embeds signals into history‑conditioned transition patterns and verifies trajectories position‑agnostically against random‑key baselines. Experiments across diverse agent benchmarks and LLM backbones show that SeqWM consistently achieves reliable detection while preserving agent utility, and remains robust under trajectory corruption where round‑indexed behavioral watermarks collapse.

PaperID: 3111, https://arxiv.org/pdf/2605.11003.pdf

Authors: Baoyuan Wu, Qingshan Liu, Adel Bibi, Irwin King, Siwei Lyu

Title: The Authorization-Execution Gap Is a Major Safety and Security Problem in Open-World Agents

Abstract:
This position paper argues that the Authorization‑Execution Gap (AEG) is a major safety and security problem in open‑world agents. The AEG is the divergence between what a principal intends to authorize and what an open‑world agent ultimately executes. Because such agents act autonomously across tools, persistent state, and multi‑agent handoffs, even small instances of authorization divergence can cause harm that is difficult or impossible to undo. We argue that many observed agent failures can be traced to three structural sources of AEG: delegation‑level incompleteness, channel‑level corruption, and composition‑level fragmentation. The same observed failure may arise from any of these sources. Without identifying the source, a defense targeting the symptom alone cannot address the underlying cause. Agent safety and security should therefore emphasize source‑oriented diagnosis and defense. Because the structural sources of AEG arise dynamically during execution, this approach necessarily requires authorization integrity checks applied during execution, rather than relying solely on one‑shot upfront filtering or post‑hoc audit. For NeurIPS, the implication is that papers on open‑world agents should report not only outcome‑level metrics such as task success or attack resistance, but also process‑level evidence showing where AEG was detected, constrained, and attributed to a structural source during execution.

PaperID: 3112, https://arxiv.org/pdf/2605.10999.pdf

Authors: Yuchen Ma, Yue Huang, Han Bao, Haomin Zhuang, Swadheen Shukla, Michel Galley, Xiangliang Zhang, Stefan Feuerriegel

Title: SkillGen: Verified Inference-Time Agent Skill Synthesis

Abstract:
Skills are a promising way to improve LLM agent capabilities without retraining, while keeping the added procedure reusable and controllable. However, high‑quality skills are still largely written by hand. We introduce SkillGen, a multi‑agent framework that synthesizes a single auditable skill from trajectories generated by a base agent. The output is a human‑readable artifact that can be inspected before use. Rather than merely summarizing trajectories, SkillGen leverages contrastive induction over both successful and failed trajectories to identify reusable success patterns, recurring failure modes, and behaviors that appear in nearby successes but are missing from failures. SkillGen then generates candidate skills and iteratively refines the skill. A key novelty in SkillGen is that we model agent skills as interventions to empirically verify the net effect of skills on the overall performance. Specifically, we compare outcomes on the same instances with and without the skill, so that we account for both repairs (cases where the skill fixes a baseline failure) and regressions (cases where the skill breaks a baseline success). Across a broad range of agents and datasets, SkillGen consistently improves held‑out performance, outperforms existing skill‑generation baselines, and produces skills that transfer across models.

PaperID: 3113, https://arxiv.org/pdf/2605.10990.pdf

Authors: Linfeng Fan, Yuan Tian, Ziwei Li, Zhiwu Lu

Title: Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries

Abstract:
LLM agents increasingly rely on reusable skill libraries, but these skills silently decay as the external services, packages, APIs, and configurations they reference evolve. Existing monitors detect such changes at the wrong granularity: they observe values, not the role those values play in a skill. A version string in a comment is noise; the same string in a pinned dependency is an operational obligation. We formulate skill drift as contract violation and introduce \sgname, which extracts executable environment contracts from skill documents and validates only those role‑bearing assumptions against known or live conditions. This distinction turns noisy monitoring into a precision‑first maintenance signal. Contract‑free CI probes produce 40% false positives, while \sgname raises zero false alarms over 599 no‑drift and hard‑negative cases (Wilson 95% CI [0,0.6]%). In known‑drift verification, \sgname achieves 100% precision and 76% recall with the strongest backbone; in a pre‑registered study over 49 real skills, it discovers live drift with 86% conservative precision. Violated contracts also make repair actionable, improving one‑round success from 10% without localization to 78%. We release \dbname, an 880‑pair benchmark for skill degradation.

PaperID: 3114, https://arxiv.org/pdf/2605.10917.pdf

Authors: Usman A. Khan, Joseph W. Durham

Title: Optimal and Scalable MAPF via Multi-Marginal Optimal Transport and Schrödinger Bridges

Abstract:
We consider anonymous multi‑agent path finding (MAPF) where a set of robots is tasked to travel to a set of targets on a finite, connected graph. We show that MAPF can be cast as a special class of multi‑marginal optimal transport (MMOT) problems with an underlying Markovian structure, under which the exponentially large MMOT collapses to a linear program (LP) polynomial in size. Focusing on the anonymous setting, we establish conditions under which the corresponding LP is feasible, totally unimodular, and consequently, yields min‑cost, integral (\0,1\) transports that do not overlap in both space and time. To adapt the approach to large‑scale problems, we cast the MAPF‑MMOT in a probabilistic framework via Schrödinger bridges. Under standard assumptions, we show that the Schrödinger bridge formulation reduces to an entropic regularization of the corresponding MMOT that admits an iterative Sinkhorn‑type solution. The Schrödinger bridge, being a probabilistic framework, provides a shadow (fractional) transport that we use as a template to solve a reduced LP and demonstrate that it results in near‑optimal, integral transports at a significant reduction in complexity. Extensive experiments highlight the optimality and scalability of the proposed approaches.

PaperID: 3115, https://arxiv.org/pdf/2605.10913.pdf

Authors: Simon Yu, Derek Chong, Ananjan Nandi, Dilara Soylu, Jiuding Sun, Christopher D Manning, Weiyan Shi

Title: Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

Abstract:
As LLM agent systems take on more complex tasks, they increasingly rely on meta‑agents: higher‑order agents that operate on other agents, much as managers supervise employees. Whatever a meta‑agent does: coordinating agents, halting risky actions before execution, or repairing failed runs, requires manipulation of agentic execution at runtime. Existing agentic substrates make this hard: they give meta‑agents only plain transcripts and environment snapshots, requiring it to build it's own tooling to reconstruct and orchestrate execution state. Therefore, we introduce Shepherd, a Python substrate grounded in functional programming principles, where an agent's execution is itself a first‑class object that a meta‑agent can inspect and transform. Every model call, tool call, and environment change becomes a structured event in a Git‑like execution trace, where any past state can be forked 5x faster than docker commit and replayed. Three example use cases show Shepherd's versatility: (1) a supervisor agent prevents conflicts among parallel coding agents, lifting CooperBench performance from 28.8% to 54.7%; (2) a counterfactual optimizer repairs agent workflows by proposing edits and replaying runs from the point of changed behavior, outperforming MetaHarness on TerminalBench‑2 with 58% lower wall‑clock; (3) a meta‑agent picks fork points during rollouts to improve credit assignment in long‑horizon agentic RL, doubling GRPO's gains on TerminalBench‑2. We open‑source Shepherd to empower future meta‑agents with principled and efficient operations over agentic execution.

PaperID: 3116, https://arxiv.org/pdf/2605.10909.pdf

Authors: Alex DeWeese, Guannan Qu

Title: Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients

Abstract:
This work revisits standard policy gradient methods used on restricted policy classes, which are known to get stuck in suboptimal critical points. We identify an important cause for this phenomenon to be that the policy gradient is itself fundamentally myopic, i.e. it only improves the policy based on the one‑step Q‑function. In this work, we propose a generalized k‑step policy gradient method that couples the randomness within a k‑step time window and can escape the myopic local optima in MDPs with restricted policy classes. We show this new method is theoretically guaranteed to converge to a solution that is exponentially close in performance to the optimal deterministic policy with respect to k. Further, we show projected gradient descent and mirror descent with this k‑step policy gradient can achieve this exponential guarantee in O(\frac1T) iterations, despite only assuming smoothness and differentiability of the value function. This will provide near optimal solutions to previously elusive applications like state aggregation and partially observable cooperative multi‑agent settings. Moreover, our bounds avoid the ubiquitous distribution mismatch factors ||d_μ^π^ / d_μ^π||_\infty and ||d_μ^π^ / μ||_\infty enabling the k‑step policy gradient method to escape suboptimal critical points that emerge from poor exploration in fully observable settings.

PaperID: 3117, https://arxiv.org/pdf/2605.10906.pdf

Authors: Yaxin Du, Xiyuan Yang, Zhifan Zhou, Wanxu Liu, Zixing Lei, Zimeng Chen, Fenyi Liu, Haotian Wu, Yuzhu Cai, Zexi Liu, Xinyu Zhu, WenHao Wang, Linfeng Zhang, Chen Qian, Siheng Chen

Title: DataMaster: Data-Centric Autonomous AI Research

Abstract:
As model families, training recipes, and compute budgets become increasingly standardized, further gains in machine learning systems depend increasingly on data. Yet data engineering remains largely manual and ad hoc: practitioners repeatedly search for external datasets, adapt them to existing pipelines, validate candidate data through downstream training, and carry forward lessons from prior attempts. We study task‑conditioned autonomous data engineering, where an autonomous agent improves a fixed learning algorithm by optimizing only the data side, including external data discovery, data selection and composition, cleaning and transformation. The goal is to obtain a stronger downstream solution while leaving the learning algorithm unchanged. To address the open‑ended search space, branch‑dependent refinement, and delayed validation inherent in autonomous data engineering, we propose DataMaster, a data‑agent framework that integrates tree‑structured search, shared candidate data, and cumulative memory. DataMaster consists of three key components: a DataTree that organizes alternative data‑engineering branches, a shared Data Pool that stores discovered external data sources for reuse, and a Global Memory that records node outcomes, artifacts, and reusable findings. Together, these components allow the agent to discover candidate data, construct executable training inputs, evaluate them through downstream feedback, and carry useful evidence across branches. We evaluate DataMaster on two types of benchmarks, MLE‑Bench Lite and PostTrainBench. On MLE‑Bench Lite, it improves medal rate by 32.27% over the initial score; on PostTrainBench, it surpasses the instruct model on GPQA (31.02% vs 30.35%).

PaperID: 3118, https://arxiv.org/pdf/2605.10904.pdf

Authors: Marco Coscoy, Zewei Zhou, Seth Z. Zhao, Henry Wei, Angela Magtoto, Johnson Liu, Rui Song, Walter Zimmer, Zhiyu Huang, Chen Tang, Bolei Zhou, Jiaqi Ma

Title: MDrive: Benchmarking Closed-Loop Cooperative Driving for End-to-End Multi-agent Systems

Abstract:
Vehicle‑to‑Everything (V2X) communication has emerged as a promising paradigm for autonomous driving, enabling connected agents to share complementary perception information and negotiate with each other to benefit the final planning. Existing V2X benchmarks, however, fall short in two ways: (i) open‑loop evaluations fail to capture the inherently closed‑loop nature of driving, leading to evaluation gaps, and (ii) current closed‑loop evaluations lack behavioral and interactive diversity to reflect real‑world driving. Thus, it is still unclear the extent of benefits of multi‑agent systems for closed‑loop driving. In this paper, we introduce MDrive, a closed‑loop cooperative driving benchmark comprising 225 scenarios grounded in both NHTSA pre‑crash typologies and real‑world V2X datasets. Our benchmark results demonstrate that multi‑agent systems are generally better than single‑agent counterparts. However, current multi‑agent systems still face two important challenges: (i) perception sharing enhances perceptions, but doesn't always translate to better planning; (ii) negotiation improves planning performance but harms it in complex and dense traffic scenarios. MDrive further provides an open‑source toolbox for scenario generation, Real2Sim conversion, and human‑in‑the‑loop simulation. Together, MDrive establishes a reproducible foundation for evaluating and improving the generalization and robustness of cooperative driving systems.

PaperID: 3119, https://arxiv.org/pdf/2605.10813.pdf

Authors: Jinhang Xu, Qiyuan Zhu, Yujun Wu, Zirui Wang, Dongxu Zhang, Marcia Tian, Yiling Duan, Siyuan Li, Jingxuan Wei, Sirui Han, Yike Guo, Odin Zhang, Conghui He, Cheng Tan

Title: NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation

Abstract:
LLM‑powered multi‑agent systems can now automate the full research pipeline from ideation to paper writing, but a fundamental question remains: automation for whom? Researchers operate under different resource configurations, hold different methodological preferences, and target different output formats. A system that produces uniform outputs regardless of these differences will systematically under‑serve every individual user, making personalization a precondition for research automation to be genuinely usable. However, achieving it requires three capabilities that current systems lack: accumulating reusable procedural knowledge across projects, retaining user‑specific experience across sessions, and internalizing implicit preferences that resist explicit formalization. We propose NanoResearch, a multi‑agent framework that addresses these gaps through tri‑level co‑evolution. A skill bank distills recurring operations into compact procedural rules reusable across projects. A memory module maintains user‑ and project‑specific experience that grounds planning decisions in each user's research history. A label‑free policy learning converts free‑form feedback into persistent parameter updates of the planner, reshaping subsequent coordination. These three layers co‑evolve: reliable skills produce richer memory, richer memory informs better planning, and preference internalization continuously realigns the loop to each user. Extensive experiments demonstrate that NanoResearch delivers substantial gains over state‑of‑the‑art AI research systems, and progressively refines itself to produce better research at lower cost over successive cycles.

PaperID: 3120, https://arxiv.org/pdf/2605.10807.pdf

Authors: Johann Knechtel, Ozgur Sinanoglu, Ramesh Karri

Title: LLMs for Secure Hardware Design and Related Problems: Opportunities and Challenges

Abstract:
The integration of Large Language Models (LLMs) into Electronic Design Automation (EDA) and hardware security is rapidly reshaping the semiconductor industry. While LLMs offer unprecedented capabilities in generating Register Transfer Level (RTL) code, automating testbenches, and bridging the semantic gap between high‑level specifications and silicon, they simultaneously introduce severe vulnerabilities. This comprehensive review provides an in‑depth analysis of the state‑of‑the‑art in LLM‑driven hardware design, organized around key advancements in EDA synthesis, hardware trust, design for security, and education. We systematically expand on the methodologies of recent breakthroughs ‑‑ from reasoning‑driven synthesis and multi‑agent vulnerability extraction to data contamination and adversarial machine learning (ML) evasion. We integrate general discussions on critical countermeasures, such as dynamic benchmarking to combat data memorization and aggressive red‑teaming for robust security assessment. Finally, we synthesize cross‑cutting lessons learned to guide future research toward secure, trustworthy, and autonomous design ecosystems.

PaperID: 3121, https://arxiv.org/pdf/2605.10787.pdf

Authors: Yuanyang Li, Xue Yang, Longyue Wang, Weihua Luo, Hongyang Chen

Title: ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

Abstract:
Current LLM agents are proficient at calling isolated APIs but struggle with the "last mile" of commercial software automation. In real‑world scenarios, tools are not independent; they are atomic, interdependent, and prone to environmental noise. We introduce ComplexMCP, a benchmark designed to evaluate agents in these rigorous conditions. Built on the Model Context Protocol (MCP), ComplexMCP provides over 300 meticulously tested tools derived from 7 stateful sandboxes, ranging from office suites to financial systems. Unlike existing datasets, our benchmark utilizes a seed‑driven architecture to simulate dynamic environment states and unpredictable API failures, ensuring a deterministic yet diverse evaluation. We evaluate various LLMs across full‑context and RAG paradigms, revealing a stark performance gap: even top‑tier models fail to exceed a 60% success rate, far trailing human performance 90%. Granular trajectory analysis identifies three fundamental bottlenecks: (1) tool retrieval saturation as action spaces scale; (2) over‑confidence, where agents skip essential environment verifications; and (3) strategic defeatism, a tendency to rationalize failure rather than pursuing recovery. These findings underscore the insufficiency of current agents for interdependent workflows, positioning ComplexMCP as a critical testbed for the next generation of resilient autonomous systems.

PaperID: 3122, https://arxiv.org/pdf/2605.10779.pdf

Authors: Chiyu Zhang, Huiqin Yang, Bendong Jiang, Xiaolei Zhang, Yiran Zhao, Ruyi Chen, Lu Zhou, Xiaogang Xu, Jiafei Wu, Liming Fang, Zhe Liu

Title: LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

Abstract:
The rapid proliferation of LLM‑based autonomous agents in real operating system environments introduces a new category of safety risk beyond content safety: behavior jailbreak, where an adversary induces an agent to execute dangerous OS‑level operations with irreversible consequences. Existing benchmarks either evaluate safety at the semantic layer alone, missing physical‑layer harms, or fail to isolate test cases, letting earlier runs contaminate later ones. We present LITMUS (LLM‑agents In‑OS Testing for Measuring Unsafe Subversion), a benchmark addressing both gaps via a semantic‑physical dual verification mechanism and OS‑level state rollback. LITMUS comprises 819 high‑risk test cases organized into one harmful seed subset and six attack‑extended subsets covering three adversarial paradigms (jailbreak speaking, skill injection, and entity wrapping), plus a fully automated multi‑agent evaluation framework judging behavior at both conversational and OS‑level physical layers. Evaluation across frontier agents reveals three findings: (1) current agents lack effective safety awareness, with strong models (e.g., Claude Sonnet 4.6) still executing 40.64% of high‑risk operations; (2) agents exhibit pervasive Execution Hallucination (EH), verbally refusing a request while the dangerous operation has already completed at the system level, invisible to every prior semantic‑only framework; and (3) skill injection and entity wrapping attacks achieve high success rates, exposing pronounced agent vulnerabilities. LITMUS provides the first standardized platform for reproducible, physically grounded behavioral safety evaluation of LLM agents in real OS environments.

PaperID: 3123, https://arxiv.org/pdf/2605.10763.pdf

Authors: Tim Van hamme, Thomas Vissers, Javier Carnerero-Cano, Mario Fritz, Emil C. Lupu, Lieven Desmet, Dinil Mon Divakaran

Title: MATRA: Modeling the Attack Surface of Agentic AI Systems -- OpenClaw Case Study

Abstract:
LLMs are increasingly deployed as autonomous agents with access to tools, databases, and external services, yet practitioners (across different sectors) lack systematic methods to assess how known threat classes translate into concrete risks within a specific agentic deployment. We present MATRA, a pragmatic threat modeling framework for agentic AI systems that adapts established risk assessment methodology to systematically assess how known LLM threats translate into deployment‑specific risks. MATRA begins with an asset‑based impact assessment and utilizes attack trees to determine the likelihood of these impacts occurring within the system architecture. We demonstrate MATRA on a personal AI agent deployment using OpenClaw, quantifying how architectural controls such as network sandboxing and least‑privilege access reduce risk by limiting the blast radius of successful injections.

PaperID: 3124, https://arxiv.org/pdf/2605.10760.pdf

Authors: Zhihao Cao, Qi Shao, Shuhao Zhai, Jing Zhang, Anh Nguyen, Baoru Huang

Title: MAGS-SLAM: Monocular Multi-Agent Gaussian Splatting SLAM for Geometrically and Photometrically Consistent Reconstruction

Abstract:
Collaborative photorealistic 3D reconstruction from multiple agents enables rapid large‑scale scene capture for virtual production and cooperative multi‑robot exploration. While recent 3D Gaussian Splatting (3DGS) SLAM algorithms can generate high‑fidelity real‑time mapping, most of the existing multi‑agent Gaussian SLAM methods still rely on RGB‑D sensors to obtain metric depth and simplify cross‑agent alignment, which limits the deployment on lightweight, low‑cost, or power‑constrained robotic platforms. To address this challenge, we propose MAGS‑SLAM, the first RGB‑only multi‑agent 3DGS SLAM framework for collaborative scene reconstruction. Each agent independently builds local monocular Gaussian submaps and transmits compact submap summaries rather than raw observations or dense maps. To facilitate robust collaboration in the presence of monocular scale ambiguity, our framework integrates compact submap communication, geometry‑ and appearance‑aware loop verification, and occupancy‑aware Gaussian fusion, enabling coherent global reconstruction without active depth sensors. We further introduce ReplicaMultiagent Plus benchmark for evaluating collaborative Gaussian SLAM. Intensive experiments on synthetic and real‑world datasets show that MAGS‑SLAM achieves competitive tracking accuracy and comparable or superior rendering quality to state‑of‑the‑art RGB‑D collaborative Gaussian SLAM methods while relying only RGB images.

PaperID: 3125, https://arxiv.org/pdf/2605.10738.pdf

Authors: Max Studt, Georg Schildbach

Title: Decentralized Contingency MPC based on Safe Sets for Nonlinear Multi-agent Collision Avoidance

Abstract:
Decentralized collision avoidance remains challenging, particularly when agents do not communicate any information related to planned trajectories. Most existing approaches either rely on conservative coordination mechanisms or provide limited guarantees on recursive feasibility and convergence. This paper develops a decentralized contingency MPC framework for multi‑agent systems with nonlinear dynamics that achieves collision‑free motion under a state‑only information pattern. Each agent follows the same consensual rule set, enabling safe decentralized planning without communication. Each agent solves a local optimization problem that couples a nominal trajectory with a contingency certificate ensuring a feasible backup maneuver under receding‑horizon operation. A novel geometric and decentralized safe‑set update mechanism prevents feasibility loss between consecutive time steps. The resulting scheme guarantees recursive feasibility, including collision avoidance, and establishes a Lyapunov‑type convergence result to an admissible safe equilibrium. Simulation results demonstrate performance in both sparse and dense multi‑agent environments, including cluttered bottleneck scenarios and under plug‑and‑play operation.

PaperID: 3126, https://arxiv.org/pdf/2605.10717.pdf

Authors: Guillem Capellera, Antonio Rubio, Luis Ferraz, Antonio Agudo

Title: Heteroscedastic Diffusion for Multi-Agent Trajectory Modeling

Abstract:
Multi‑agent trajectory modeling traditionally focuses on forecasting, often neglecting more general tasks like trajectory completion, which is essential for real‑world applications such as correcting tracking data. Existing methods also generally predict agents' states without offering any state‑wise measure of heteroscedastic uncertainty. Moreover, popular multi‑modal sampling methods lack error probability estimates for each generated scene under the same prior observations, which makes it difficult to rank the predictions at inference time. We introduce U2Diffine, a unified diffusion model built to perform trajectory completion while simultaneously offering state‑wise heteroscedastic uncertainty estimates. This is achieved by augmenting the standard denoising loss with the negative log‑likelihood of the predicted noise, and then propagating the latent space uncertainty to the real state space using a first‑order Taylor approximation. We also propose U2Diff, a faster baseline that avoids gradient computation during sampling. This approach significantly increases inference speed, making it as efficient as a standard generative‑only diffusion model. For post‑processing, we integrate a Rank Neural Network (RankNN) that enables error probability estimation for each generated mode, demonstrating strong correlation with ground truth errors. Our method outperforms state‑of‑the‑art solutions in both trajectory completion and forecasting across four challenging sports datasets (NBA, Basketball‑U, Football‑U, Soccer‑U), underscoring the effectiveness of our uncertainty and error probability estimation.

PaperID: 3127, https://arxiv.org/pdf/2605.10698.pdf

Authors: Dahlia Shehata, Ming Li

Title: The Bystander Effect in Multi-Agent Reasoning: Quantifying Cognitive Loafing in Collaborative Interactions

Abstract:
Multi‑agent systems (MAS) assume that collaborating inherently improves Large Language Model (LLM) reasoning. We challenge this by demonstrating that simulated social pressure triggers an algorithmic ``Bystander Effect,'' inducing severe cognitive loafing. By evaluating 22,500 deterministic trajectories across 3 dataset contexts (GAIA, SWE‑bench, Multi‑Challenge) with 3 state‑of‑the‑art (SOTA) models, we semantically audit internal reasoning traces against external outputs. We formalize the Interaction Depth Limit (D_L), the exact plurality threshold where an agent's logical sovereignty collapses into social compliance. Crucially, we uncover the Sovereignty Gap: models frequently compute the correct derivation internally but suffer ``Alignment Hallucinations'' ‑‑ actively subjugating empirical evidence to sycophantically appease a simulated swarm. We prove that multi‑agent social load is strictly non‑commutative; the "brand" identity of the ``Lead Anchor'' auditor disproportionately dictates the swarm's integrity. These findings expose architectural vulnerabilities, proving that unstructured multi‑agent topologies can degrade independent reasoning.

PaperID: 3128, https://arxiv.org/pdf/2605.10674.pdf

Authors: Igor Slinko, Ilia Zavidnyi, Egor Bogomolov, Yaroslav Zharov

Title: Step Rejection Fine-Tuning: A Practical Distillation Recipe

Abstract:
Rejection Fine‑Tuning (RFT) is a standard method for training LLM agents, where unsuccessful trajectories are discarded from the training set. In the context of SWE‑bench tasks, this corresponds to filtering out runs where the submitted patch does not pass the tests. However, this approach discards unresolved trajectories, even though they form a large portion of all trajectories for hard tasks and even then may be partially correct. In this work, we propose Step Rejection Fine‑Tuning (SRFT) ‑ a practical way to leverage these unresolved trajectories. For this, we employ a critic LLM to assess the correctness of each step in a trajectory. Consequently, during training, we mask the loss for erroneous steps while retaining them in the context window. This way we ensure the model learns to recover from errors without reproducing them. Evaluation on SWE‑bench Verified shows that while RFT improves the resolution rate by 2.4% by excluding unresolved trajectories, SRFT improves it by 3.7% by filtering them instead of discarding completely, reaching the total resolution rate of 32.2%.

PaperID: 3129, https://arxiv.org/pdf/2605.10614.pdf

Authors: Riya Tapwal, Abhishek Kumar, Carsten Maple

Title: PRISM: Generation-Time Detection and Mitigation of Secret Leakage in Multi-Agent LLM Pipelines

Abstract:
Multi‑agent LLM systems introduce a security risk in which sensitive information accessed by one agent can propagate through shared context and reappear in downstream outputs, even without explicit adversarial intent. We formalise this phenomenon as propagation amplification, where leakage risk increases across agent boundaries as sensitive content is repeatedly exposed to downstream generators. Existing defences, including prompt‑based safeguards, static pattern matching, and LLM‑as‑judge filtering, are not designed for this setting: they either operate after generation, rely primarily on surface‑form patterns, or add substantial latency without modelling the generation process itself. To resolve these issues, we propose PRISM, a real‑time defence that treats credential leakage as a sequential risk accumulation problem during generation. At each decoding step, PRISM combines 16 signals spanning lexical, structural, information‑theoretic, behavioural, and contextual features into a calibrated risk score, enabling per‑token intervention through green, yellow, and red risk zones. Our central observation is that credential reproduction is often preceded by a measurable shift in generation dynamics, characterised by entropy collapse and increasing logit concentration. When combined with text‑structural cues such as identifier‑pattern detection, these temporal signals provide an early warning of leakage before a secret is fully reconstructed. Across a 2,000‑task adversarial benchmark covering 13 attack categories and three pressure levels in a heterogeneous four‑agent pipeline, PRISM achieves F1 = 0.832 with precision = 1.000 and recall = 0.712, while producing no observed leakage on our benchmark (0.0% task‑level leak rate) and preserving output utility of 0.893. It substantially outperforms the strongest baseline, Span Tagger, which achieves F1 = 0.719 with a 15.0% task‑level leak rate.

PaperID: 3130, https://arxiv.org/pdf/2605.10558.pdf

Authors: Rohollah Moghadam, Santosh Kandel

Title: Effect of Graph Gluing on Consensus in Networked Multi-Agent Systems

Abstract:
In this paper, the effects of graph gluing operations in networks of multi‑agent systems and their impact on system performance are investigated. In many practical applications, multiple multi‑agent subsystems must be interconnected through communication links to accomplish complex tasks, resulting in a larger communication network. Such interconnections modify the underlying graph topology and consequently affect the consensus behavior and convergence rate of the network. In particular, this paper examines both bridge gluing and interface gluing and analyzes how the number and structure of communication links between subsystems influence the Fiedler eigenvalue of the resulting graph. Since the Fiedler eigenvalue is directly related to the convergence rate of consensus dynamics, the proposed analysis establishes a clear relationship between interconnection strategies, algebraic connectivity, and system performance. The results provide theoretical insight into how different gluing mechanisms alter the spectral properties of the graph Laplacian and, in turn, the convergence characteristics of the networked multi‑agent system. Simulation studies are presented to illustrate the theoretical findings and to validate the effectiveness of the proposed framework.

PaperID: 3131, https://arxiv.org/pdf/2605.10555.pdf

Authors: Kai Pan

Title: Agent-First Tool API: A Semantic Interface Paradigm for Enterprise AI Agent Systems

Abstract:
As AI agents transition from research prototypes to enterprise production systems, the tool interfaces they consume remain rooted in human‑oriented CRUD paradigms. This paper identifies five fundamental architectural mismatches between conventional APIs and autonomous agent requirements: exact‑identifier dependence, rendering‑oriented responses, single‑shot interaction assumptions, user‑equivalent authorization, and opaque error semantics. We propose the Agent‑First Tool API paradigm, comprising three integrated mechanisms: (1) a Six‑Verb Semantic Protocol that decomposes tool interactions into search, resolve, preview, execute, verify, and recover phases; (2) a Normalized Tool Contract (NTC) providing structured decision‑support metadata including confidence scores, evidence chains, and suggested next actions; and (3) a dual‑layer governance pipeline combining static capability policies with dynamic risk escalation. The paradigm is implemented and validated in a production multi‑tenant SaaS platform serving 85 registered tools across 6 business domains. Comparative experiments on 50 real operational tasks demonstrate that Agent‑First APIs achieve 88% end‑to‑end task success rate versus 64% for optimized CRUD baselines (+37.5%), while reducing required human interventions by 72.7% and improving autonomous error recovery by 5.8x. We establish that the paradigm is orthogonal and complementary to transport‑layer standards such as MCP, operating as the semantic application layer above existing tool discovery and invocation protocols.

PaperID: 3132, https://arxiv.org/pdf/2605.10528.pdf

Authors: Cristiano De Nobili

Title: Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics

Abstract:
We investigate the emergent collective dynamics of LLM‑based multi‑agent systems on a 2D square lattice and present a model‑agnostic statistical‑physics method to disentangle social conformity from intrinsic bias, compute critical exponents, and probe the collective behavior and possible phase transitions of multi‑agent systems. In our framework, each node of an L\!×\!L lattice hosts an identical LLM agent holding a binary state (+1/‑1, mapped to yes/no) and updating it by querying the model conditioned on the four nearest‑neighbor states. The sampler temperature T serves as the sole control parameter. Across three open‑weight models (llama3.1:8b, phi4‑mini:3.8b, mistral:7b), we measure magnetization and susceptibility under a global‑flip protocol designed to probe \mathbbZ_2 symmetry. All models display temperature‑driven order‑disorder crossovers and susceptibility peaks; finite‑size scaling on even‑L lattices yields effective exponents γ/ν whose values are model‑dependent, close to but incompatible with the 2D Ising universality class (γ/ν=7/4). Our method enables the extraction of effective β‑weighted couplings \tildeJ(T) and fields \tildeh(T), which serve as a measure of social conformity and intrinsic bias. In the models we analyzed, we found that collective alignment is dominated by an intrinsic bias (\tildeh\gg\tildeJ) rather than by cooperative neighbor coupling, producing field‑driven crossovers instead of genuine phase transitions. These effective parameters vary qualitatively across models, providing compact collective‑behavior fingerprints for LLM agents and a quantitative diagnostic for the reliability of multi‑agent consensus and collective alignment.

PaperID: 3133, https://arxiv.org/pdf/2605.10516.pdf

Authors: Harsh Raj, Niranjan Orkat, Suvrorup Mukherjee, Aritra Guha, Cheryl Flynn, Subhabrata Majumdar

Title: Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability

Abstract:
This paper establishes a rigorous measurement science for AI agent reliability, providing a foundational framework for quantifying consistency under semantically preserving perturbations. By leveraging U‑statistics for output‑level reliability and kernel‑based metrics for trajectory‑level stability, we offer a principled approach to evaluating agents across diverse operating conditions. Our proposal highlights the important distinction between the core capability and execution robustness of an agent, showing that minor task‑level variations can induce complete strategy breakdowns despite the agent possessing the requisite knowledge for the task. We validate our framework through extensive experiments on three agentic benchmarks, demonstrating that trajectory‑level consistency metrics provide far greater diagnostic sensitivity than traditional pass@1 rates. By providing the mathematical tools to isolate where and why agents deviate, we enable the identification and rectification of architectural concerns that hinder the deployment of agents in high‑stakes, real‑world environments.

PaperID: 3134, https://arxiv.org/pdf/2605.10515.pdf

Authors: Ali Irzam Kathia, Yimika Erinle, Abylay Satybaldy, Paolo Tasca, Nikhil Vadgama, Marco Alberto Javarone

Title: SoK: A Systematic Bidirectional Literature Review of AI & DLT Convergence

Abstract:
The integration of Artificial Intelligence (AI) with Distributed Ledger Technology (DLT) has become a growing research area, yet contributions tend to cluster around specific application domains or examine only one direction of the integration, leaving the broader architectural interplay between the two technologies poorly understood. This work addresses that gap through a structured, bidirectional review of peer‑reviewed studies published between 2020 and 2025. We classify contributions along two directions: AI‑enhanced DLT, and DLT‑enhanced AI. In the first case, we examine how AI techniques improve DLT systems across five layers: data, network, consensus, execution, and application layers. In the second case, we analyse how DLT supports AI systems across five layers: infrastructure, data, model, inference, and application layers, with particular attention to federated learning, model evaluation, and multi‑agent coordination. The analysis reveals that most works concentrate on a small subset of layers: execution and consensus for AI‑enhanced DLT, data and model for DLT‑enhanced AI. Other layers remain comparatively neglected. Despite reported improvements in controlled settings, no study demonstrates deployment at production scale, and the field has not yet offered satisfying answers to fundamental questions around scalability, interoperability, and verifiable execution. We argue that progress will require cross‑layer co‑design and empirical validation in real‑world settings.

PaperID: 3135, https://arxiv.org/pdf/2605.10505.pdf

Authors: Zhe Sage Chen, Quanyan Zhu

Title: A Theory of Multilevel Interactive Equilibrium in NeuroAI

Abstract:
We propose a game‑theoretic framework for adaptive multi‑agent intelligent systems. Unlike classical game theory, which often treats strategies as primitive objects chosen by perfectly rational agents, the proposed framework provides a mathematical foundation for studying equilibrium in NeuroAI and can be viewed as an extension of game theory under relaxed assumptions, including partial observability, bounded computation, and uncertainty. At its core, Multilevel Interactive Equilibrium (MIE) generalizes the classical Nash equilibrium to intelligent systems with internal computation. Rather than being defined solely at the level of observable behavior, equilibrium emerges when neural learning dynamics, cognitive representations, and behavioral strategies mutually stabilize between interacting agents. This framework applies uniformly to interactions between two biological brains, two artificial agents, or hybrid human‑AI systems. We discuss applications of multilevel game theory to human‑autonomous vehicle driving, human‑machine interaction, human‑large language model (LLM) interaction, and computational psychiatry. We also outline experimental strategies and computational methods for estimating MIE and discuss challenges and prospects for future research.

PaperID: 3136, https://arxiv.org/pdf/2605.10488.pdf

Authors: Haoyu Huang, Jiaxin Bai, Shujie Liu, Yang Wei, Hong Ting Tsang, Yisen Gao, Zhongwei Xie, Yufei Li, Yangqiu Song

Title: DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning

Abstract:
Agent‑compiled knowledge bases provide persistent external knowledge for large language model (LLM) agents in open‑ended, knowledge‑intensive downstream tasks. Yet their quality is systematically limited by \emphincompleteness, \emphincorrectness, and \emphredundancy, manifested as missing evidence or cross‑document links, low‑confidence or imprecise claims, and ambiguous or coreference resolution issues. Such defects compound under iterative use, degrading retrieval fidelity and downstream task performance. We present DeepRefine, a general LLM‑based reasoning model for \emphagent‑compiled knowledge refinement that improves the quality of any pre‑constructed knowledge bases with user queries to make it more suitable for the downstream tasks. DeepRefine performs multi‑turn interactions with the knowledge base and conducts abductive diagnosis over interaction history, localizes likely defects, and executes targeted refinement actions for incremental knowledge base updates. To optimize refinement policies of DeepRefine without gold references, we introduce a Gain‑Beyond‑Draft (GBD) reward and train the reasoning process end‑to‑end via reinforcement learning. Extensive experiments demonstrate consistent downstream gains over strong baselines.

PaperID: 3137, https://arxiv.org/pdf/2605.10482.pdf

Authors: Qingyun Guo, Junyi Shi, Tomasz Piotr Kucner, Dominik Baumann

Title: Priority-Driven Control and Communication in Decentralized Multi-Agent Systems via Reinforcement Learning

Abstract:
Event‑triggered control provides a mechanism for avoiding excessive use of constrained communication bandwidth in networked multi‑agent systems. However, most existing methods rely on accurate system models, which may be unavailable in practice. In this work, we propose a model‑free, priority‑driven reinforcement learning algorithm that learns communication priorities and control policies jointly from data in decentralized multi‑agent systems. By learning communication priorities, we circumvent the hybrid action space typical in event‑triggered control with binary communication decisions. We evaluate our algorithm on benchmark tasks and demonstrate that it outperforms the baseline method.

PaperID: 3138, https://arxiv.org/pdf/2605.10481.pdf

Authors: Tianxiao Li, Yixing Ma, Haiquan Wen, Zhenglin Huang, Qianyu Zhou, Zeyu Fu, Guangliang Cheng

Title: Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems

Abstract:
Modern LLM based agents are no longer passive text generators. They read repositories, call tools, browse the web, execute code, maintain memory, communicate with other agents, and act through long horizon workflows. This shift moves the unit of safety. A system may produce a compliant final answer while leaking private information through an internal message, delegating authority beyond its original scope, calling an external tool with sensitive context, or losing the evidence needed to reconstruct why an action was allowed. We argue that many emerging failures in LLM‑based multi‑agent systems share a common structure: safety critical constraints do not remain operative throughout the trajectory. We call this phenomenon constraint drift: the loss, distortion, weakening, or relaxation of constraints as they pass through memory, delegation, communication, tool use, audit, and optimization. The position taken here is that safe multi‑agent behavior must be maintained, not merely asserted. Prompts, guardrails, tool schemas, access control, and final output checks are necessary, but they are insufficient unless constraints remain fresh, inherited, enforceable, and auditable across execution. We propose Constraint State Governance as a research paradigm for LLM‑based multi‑agent systems. In this paradigm, safety‑critical constraints are maintained as explicit execution state, while constraint‑native reinforcement learning improves utility only within maintained safety boundaries. The goal is not to freeze agentic systems under rigid rules, but to make safety operational across the trajectories through which modern agents actually act.

PaperID: 3139, https://arxiv.org/pdf/2605.10448.pdf

Authors: Shanshan Gao, Liyi Zhou

Title: Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation

Abstract:
Interactive agent benchmarks map an agent run to a binary outcome through outcome checks. When these checks rely on surface level signals or fail to capture the agent's actual action path, they cannot reliably determine whether the run succeeded. For example, a benchmark task may ask whether Alice's shipping address was changed, while the outcome check only verifies that the agent clicked "Save." This does not guarantee that the intended state change occurred, since the agent may have modified the wrong record. Treating such a run as successful therefore makes the reported score misleading. Benchmark quality thus depends not only on task design, but also on the reliability of outcome detection. We address this problem by introducing an outcome evidence reporting layer for existing benchmarks, without modifying their tasks, agents, or evaluators. The layer performs three functions. First, before scoring, it specifies which stored artifacts are required to verify the claimed outcome for each case. Second, it applies a locked checklist to each completed run and assigns one of three evidence labels: Evidence Pass, Evidence Fail, or Unknown. Third, it reports evidence supported score bounds that quantify uncertainty arising from Unknown cases. Rather than silently counting, discarding, or hiding uncertain cases inside a single aggregate success rate, the framework keeps them explicitly visible. We evaluate the outcome evidence layer on five public benchmarks: ANDROIDWORLD, AGENTDOJO, APPWORLD, tau3 bench retail, and MINIWOB. The resulting reports separate several empirically distinct failure modes.

PaperID: 3140, https://arxiv.org/pdf/2605.10440.pdf

Authors: Yao Liu

Title: TourMart: A Parametric Audit Instrument for Commission Steering in LLM Travel Agents

Abstract:
Online travel agents (Booking, Trip.com, Expedia) have replaced ranked‑list interfaces with conversational LLM agents that compress many options into one sentence of advice. Each booking earns the OTA commission and different suppliers pay different rates: the agent has a structural incentive to favor higher‑margin recommendations. Whether any deployed agent does this, and by how much, no one can currently measure. Disclosure banners, conversion A/B testing, UI dark‑pattern taxonomies, and generic LLM safety scores were built for older interfaces and miss the prose‑recommendation surface where the steering happens. We propose TourMart, an applied intelligent‑system audit instrument for LLM‑OTA commission governance. Two governance levers ‑‑ lambda (gain on message‑induced perception in the traveler's accept/reject decision) and kappa (budget‑normalized cap on how far the message can shift perceived welfare) ‑‑ drive a paired counterfactual: holding the traveler and bundle fixed, the steering delta is read off between a commission‑aware prompt and a minimum‑disclosure factual template. A symmetric six‑gate producer audit separates LLM‑engineering failures (template collapse, refusal, internal‑ID leakage) from genuine commercial steering. At deployed (lambda=1, kappa=0.05), a Qwen‑14B reader shows +7.69pp steering (exact McNemar p=0.003); a Llama‑3.1‑8B reader shows +3.50pp in the same direction at n=143, with an extended‑n supplement (n=270) confirming significance (+2.96pp, p=0.008). Across the (lambda, kappa) grid both arms pass family‑wise scenario‑clustered correction (p<0.001 / p=0.008). TourMart outputs a sentence a compliance report can quote: "at this deployment, 7.7 extra commission‑steered recommendations per 100 paired traveler sessions."

PaperID: 3141, https://arxiv.org/pdf/2605.10377.pdf

Authors: Ahmet Onur Akman, Rafał Kucharski

Title: PC3D: Zero-Shot Cooperation Across Variable Rosters via Personalized Context Distillation

Abstract:
Cooperative multi‑agent reinforcement learning often assumes a fixed execution team, yet many decentralized systems must operate with varying numbers of active agents during deployment. We study this setting under episodic roster variation: each episode is executed by a set of homogeneous agents, with the team size varying across episodes. Agents act only from local histories, without execution‑time communication, privileged coordinators, or online retraining. Therefore, effective cooperation requires each agent to recover relevant context about the active team and adapt its behavior accordingly. To this end, we propose PC3D (Personalized Central Coordination Context Distillation), a method for training decentralized policies to recover and use personalized coordination context from local interaction histories. During training, a set‑structured centralized teacher compresses the active team into coordination tokens and personalizes them into agent‑specific contexts, which are distilled into decentralized policies. At execution, each agent predicts its own context from local history and adaptively uses it to condition decision‑making. Across three cooperative MARL benchmarks, PC3D achieves higher returns than the evaluated baselines with both seen and unseen roster sizes, and ablations attribute these gains to both context distillation and adaptive context use.

PaperID: 3142, https://arxiv.org/pdf/2605.10367.pdf

Authors: Yangtao Zhou, Wenhao You, Hua Chu, Shihao Guo, Jianan Li, Zhifu Zhao, Qingshan Li

Title: AgentGR: Semantic-aware Agentic Group Decision-Making Simulator for Group Recommendation

Abstract:
Group Recommendation (GR) aims to suggest items to a group of users, which has become a critical component of modern social platforms. Existing GR methods focus on aggregating individual user preferences with advanced neural networks to infer group preferences. Despite effectiveness, they essentially treat group preference learning as a simple preference aggregation process, failing to capture the complex dynamics of real‑world group decision‑making. To address these limitations, we propose AgentGR, a novel Semantic‑aware Agentic Group Decision‑Making Simulator for Group Recommendations, inspired by the semantic reasoning and human behavior simulation capabilities of LLM‑driven agents. It aims to jointly capture collaborative‑semantic user preferences for member‑role‑playing and simulate dynamic group interactions to reflect real‑world group decision‑making processes, thereby boosting recommendation performance. Specifically, to capture collaborative‑semantic user preferences, we introduce a semantic meta‑path guided chain‑of‑preference reasoning mechanism that integrates high‑order collaborative filtering signals and textual semantics to improve user preference profiles. To model the complex dynamics of group decision‑making, we first recognize group topic and leadership to explicitly model the influencing factors within the group decision processes. Building on these, we simulate group‑level decision dynamics via two multi‑agent simulation strategies for recommendations: a static workflow‑based strategy for efficiency and a dynamic dialogue‑based strategy for precision. Extensive experiments on two real‑world datasets show that AgentGR significantly outperforms state‑of‑the‑art baselines in both recommendation accuracy and group decision simulation, highlighting its potential for real‑world GR applications.

PaperID: 3143, https://arxiv.org/pdf/2605.10362.pdf

Authors: Alexey Pchelnikov, Aleksei Pchelnikov

Title: CellDX AI Autopilot: Agent-Guided Training and Deployment of Pathology Classifiers

Abstract:
Training AI models for computational pathology currently requires access to expensive whole‑slide‑image datasets, GPU infrastructure, deep expertise in machine learning, and substantial engineering effort. We present CellDX AI Autopilot, a platform that lets users ‑‑ from pathologists with no ML background to ML practitioners running many parallel experiments ‑‑ train, evaluate, and deploy whole‑slide image classifiers through natural language interaction with an AI agent. The platform provides a structured set of agent skills that guide the user through dataset curation, automated hyperparameter tuning, multi‑strategy model comparison, and human‑in‑the‑loop deployment, all on a pre‑built dataset of over 32,000 cases and 66,000 H&E‑stained whole‑slide images with pre‑extracted features. We describe the agent skill architecture, the underlying Multiple Instance Learning (MIL) training framework supporting four classification strategies, and an iterative pairwise hyperparameter search (grid or seeded random) that reduces tuning cost by over 30x compared to exhaustive search. CellDX AI Autopilot is, to our knowledge, the first system to expose pathology‑specialized agent skills and a pathology‑specialized training platform to general‑purpose AI agents (e.g. any LLM‑based agent runtime), delivering end‑to‑end automated model training without requiring the agent itself to be domain‑specific. The platform addresses both the ML‑expertise bottleneck that limits adoption in diagnostic pathology and the engineering bottleneck that limits how many experiments a researcher can run cost‑effectively.

PaperID: 3144, https://arxiv.org/pdf/2605.10332.pdf

Authors: Ruofei Ju, Xinrui Wang, Xin Ding, Yifan Yang, Hao Wu, Shiqi Jiang, Qianxi Zhang, Hao Wen, Xiangyu Li, Weijun Wang, Kun Li, Yunxin Liu, Haipeng Dai, Wei Wang, Ting Cao

Title: EmbodiSkill: Skill-Aware Reflection for Self-Evolving Embodied Agents

Abstract:
Embodied agents can benefit from skills that guide object search, action execution, and state changes across diverse environments. Since embodied environments vary across layouts, object states, and other execution factors, these skills must self‑evolve from trajectories generated during task execution. However, existing skill self‑evolution methods are mainly developed in digital environments and often convert trajectories into coarse skill updates. Directly applying this paradigm to embodied settings is problematic, because a failed task execution may reflect not only incorrect skill content, but also an execution lapse in which the agent fails to follow valid guidance. We propose EmbodiSkill, a training‑free framework for embodied skill self‑evolution through skill‑aware reflection and targeted revision. EmbodiSkill interprets each trajectory with respect to the current skill, uses skill‑changing evidence to update the skill body, and uses execution‑lapse evidence to preserve and emphasize valid guidance. Experiments on ALFWorld and EmbodiedBench show that EmbodiSkill consistently improves embodied task success. On ALFWorld, EmbodiSkill enables a frozen Qwen3.5‑27B executor to reach 93.28% task success, outperforming GPT‑5.2 used as a direct agent without skills by 31.58%. These results show that skill‑aware self‑evolution helps embodied agents accumulate reusable procedural knowledge from their own trajectories.

PaperID: 3145, https://arxiv.org/pdf/2605.10325.pdf

Authors: Huining Yuan, Zelai Xu, Huaijie Wang, Xiangmin Yi, Jiaxuan Gao, Xiao-Ping Zhang, Yu Wang, Chao Yu, Yi Wu

Title: Verifiable Process Rewards for Agentic Reasoning

Abstract:
Reinforcement learning from verifiable rewards (RLVR) has improved the reasoning abilities of large language models (LLMs), but most existing approaches rely on sparse outcome‑level feedback. This sparsity creates a credit assignment challenge in long‑horizon agentic reasoning: a trajectory may fail despite containing many correct intermediate decisions, or succeed despite containing flawed ones. In this work, we study a class of densely‑verifiable agentic reasoning problems, where intermediate actions can be objectively checked by symbolic or algorithmic oracles. We propose Verifiable Process Rewards (VPR), a framework that converts such oracles into dense turn‑level supervision for reinforcement learning, and instantiate it in three representative settings: search‑based verification for dynamic deduction, constraint‑based verification for logical reasoning, and posterior‑based verification for probabilistic inference. We further provide a theoretical analysis showing that dense verifier‑grounded rewards can improve long‑horizon credit assignment by providing more localized learning signals, with the benefit depending on the reliability of the verifier. Empirically, VPR outperforms outcome‑level reward and rollout‑based process reward baselines across controlled environments, and more importantly, transfers to both general and agentic reasoning benchmarks, suggesting that verifiable process supervision can foster general reasoning skills applicable beyond the training environments. Our results indicate that VPR is a promising approach for enhancing LLM agents whenever reliable intermediate verification is available, while also highlighting its dependence on oracle quality and the open challenge of extending VPR to less structured, open‑ended environments.

PaperID: 3146, https://arxiv.org/pdf/2605.10286.pdf

Authors: Baraa Al Jorf, Farah E. Shamout

Title: AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks

Abstract:
Building effective clinical decision support systems requires the synthesis of complex heterogeneous multimodal data. Such modalities include temporal electronic health records data, medical images, radiology reports, and clinical notes. Large language model (LLM)‑based agents have shown impressive performance in various healthcare tasks, especially those involving textual modalities. Considering the fragmentation of healthcare data across hospital systems, collaborative agent frameworks present a promising direction to mitigate data sharing challenges. However, the effectiveness of LLM agents for multimodal clinical risk prediction remains largely unexamined. In this work, we conduct a systematic evaluation of LLM‑based agents for clinical prediction tasks using large‑scale real‑world data. We assess performance in unimodal and multimodal settings and quantify performance gaps between single agent and multi‑agent systems. Our findings highlight that single agent frameworks outperform naive multi‑agent systems, are better at handling multimodal data, and are better calibrated. This underscores a critical need for improving multi‑agent collaboration to better handle heterogeneous inputs. By open‑sourcing our code and evaluation framework, this work offers a new benchmark to support future developments relating to agentic systems in healthcare.

PaperID: 3147, https://arxiv.org/pdf/2605.10257.pdf

Authors: Alberto Castagna, Stefan Zahlner, Adrian Egli, Christian Eichenberger, Daniel Boos, Manuel Meyer, Anton Fuxjager

Title: Towards Autonomous Railway Operations: A Semi-Hierarchical Deep Reinforcement Learning Approach to the Vehicle Rescheduling Problem

Abstract:
Managing disruptions in railway traffic management is a major challenge. Rising traffic density and infrastructure limits increase complexity, making the Vehicle Routing and Scheduling Problem (VRSP) difficult to solve reliably and in real time. While Operational Research (OR) methods are widely used, most dispatching still relies on human expertise due to the problem's exponential combinatorial complexity. Reinforcement Learning (RL) has gained attention for its potential in multi‑agent coordination, but existing RL approaches often underperform OR methods and struggle to scale in dense rail networks. This paper addresses this gap from a machine learning perspective by introducing a semi‑hierarchical RL formulation tailored to operational railway constraints. The method separates dispatching from routing through dedicated action and observation spaces, enabling policies to specialise in distinct decision scopes and addressing the imbalance between rare dispatch decisions and frequent routing updates. The approach is evaluated on the Flatland‑RL simulator across five difficulty levels and 50 random seeds, with 7 to 80 trains. Results show substantially improved coordination, resource utilisation, and robustness compared with heuristic baselines and monolithic RL, nearly doubling the number of trains reaching their destinations, while keeping deadlock rates below 5% and adaptively sequencing, delaying, or cancelling trains under heavy congestion.

PaperID: 3148, https://arxiv.org/pdf/2605.10223.pdf

Authors: Kai Pan, Rong Hou

Title: Beyond Autonomy: A Dynamic Tiered AgentRunner Framework for Governable and Resilient Enterprise AI Execution

Abstract:
Current large language model agent frameworks prioritize autonomy but lack the governability mechanisms required for enterprise deployment. High‑risk write operations proceed without independent review, complex tasks lack acceptance verification, and computational resources are allocated uniformly regardless of risk level. We propose the Dynamic Tiered AgentRunner, a controlled execution protocol distilled from a production‑grade multi‑tenant SaaS platform. The framework introduces three core mechanisms: (1) Risk‑Adaptive Tiering that dynamically allocates computational resources and review intensity based on task risk profiles, achieving Pareto‑optimal trade‑offs between safety and efficiency; (2) Separation of Powers architecture where proposal, review, execution, and verification are performed by independent agents with physically isolated boundaries; and (3) Resilience‑by‑Design through a Verifier‑Recovery closed loop that treats failure as a first‑class system state. We formalize the tier selectio

PaperID: 3149, https://arxiv.org/pdf/2605.10171.pdf

Authors: Sandeep Kumar, Yash Kamdar, Abid Hossain, Bharti Kumari, Tanik Saikh, Asif Ekbal

Title: When Reviews Disagree: Fine-Grained Contradiction Analysis in Scientific Peer Reviews

Abstract:
Scientific peer reviews frequently contain conflicting expert judgments, and the increasing scale of conference submissions makes it challenging for Area Chairs and editors to reliably identify and interpret such disagreements. Existing approaches typically frame reviewer disagreement as binary contradiction detection over isolated sentence pairs, abstracting away the review‑level context and obscuring differences in the severity of evaluative conflict. In this work, we introduce a fine‑grained formulation of reviewer contradiction analysis that operates over full peer reviews by explicitly identifying contradiction evidence spans and assigning graded disagreement intensity scores. To support this task, we present RevCI, an expert‑annotated benchmark of peer‑review pairs with evidence‑level contradiction annotations with graded intensity labels. We further propose IMPACT, a structured multi‑agent framework that integrates aspect‑conditioned evidence extraction, deliberative reasoning, and adjudication to model reviewer contradictions and their intensity. To support efficient deployment, we distill IMPACT into TIDE, a small language model that predicts contradiction evidence and intensity in a single forward pass. Experimental results show that IMPACT substantially outperforms strong single‑agent and generic multi‑agent baselines in both evidence identification and intensity agreement, while TIDE achieves competitive performance at significantly lower inference cost.

PaperID: 3150, https://arxiv.org/pdf/2605.10114.pdf

Authors: Xiangcheng Meng, Shu Wang, Yixiang Fang

Title: SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution

Abstract:
Large Language Model (LLM)‑based agents (e.g., OpenClaw) increasingly rely on reusable skill libraries to solve artifact‑rich tasks such as document‑centric workflows and data‑intensive analysis. As these libraries grow, a few works have attempted to study the Retrieval‑Augmented Execution (RAE), which often first retrieves some external skills and other knowledge, then compiles the context using retrieved skills, and finally executes the task. Existing works mainly focus on optimizing skill retrieval and task execution, and they pay little attention to how to effectively organize the selected skill evidence in a form that is compact, grounded, and immediately usable for the downstream executors to complete tasks. To fill this gap, we propose SkillRAE, a two‑stage RAE approach focusing on skill‑based context compilation, which consists of the offline and online stages. Specifically, in the offline indexing stage, it builds a multi‑level skill graph over skill communities, skills, and reusable subunits, for capturing their relationships. In the online retrieval stage, it first performs skill‑ranked retrieval with selected‑subunit evidence export in the graph, and then applies rescue‑aware compact compilation to recover the key evidence. Together, these components compile a coarse‑ranked skill set into a task‑specific context that is compact, grounded, and immediately usable. Experiments on two public benchmarks show that SkillRAE achieves a significant improvement over baselines for RAE. For example, on SkillsBench, it achieves an improvement of 11.7% over the SOTA method. Ablation studies further show that our context compilation is crucial, instead of a mere prompt addition.

PaperID: 3151, https://arxiv.org/pdf/2605.10093.pdf

Authors: Hang Lu, Guochang Li, Qianyu Chen, Huiyan Gao, Shaogang Wang, Xuanyu He, Yiwei Liu, Gaopeng Chen, Nayu Li, Xiaokang Qi, Chunyi Song, Zhiwei Xu

Title: RFAmpDesigner: A Self-Evolving Multi-Agent LLM Framework for Automated Radio Frequency Amplifier Design

Abstract:
Automating radio frequency (RF) amplifier design remains challenging because existing methods suffer from the curse of dimensionality, weak use of domain knowledge, and poor transferability, leading to low data efficiency. Meanwhile, although large language models (LLMs) have shown promise in many scientific domains, applying them directly to RF sizing is nontrivial due to the numerical nature of circuit optimization and the reliance on domain‑specific design flows. To address this, this paper proposes RFAmpDesigner, a multi‑agent framework that automates RF amplifier sizing. It introduces a resource‑allocation middleware that reframes high‑dimensional parameter tuning as a low‑dimensional resource distribution problem, making it easier to inject sizing knowledge into general‑purpose LLMs. The framework also follows standard design practice, enabling LLMs to distinguish between high‑ and low‑cost actions and search in parallel. To realize a self‑evolving optimization process, the framework employs retrieval‑augmented generation (RAG) to reuse past knowledge and experience from memory base. As a proof of concept, we apply RFAmpDesigner to low noise amplifiers of varying complexity. The experimental results show that it can automatically synthesize designs with fractional bandwidths ranging from 10% to 80% and center frequencies from 10 GHz to 50 GHz. To the best of our knowledge, this work develops the first LLM‑driven approach for RF amplifier sizing that operates on design concepts instead of treating netlists as text, offering a novel solution to mitigate data scarcity in RF design.

PaperID: 3152, https://arxiv.org/pdf/2605.10082.pdf

Authors: Ruhan Wang, Chengkai Huang, Zhiyong Wang, Junda Wu, Rui Wang, Tong Yu, Julian McAuley, Lina Yao, Dongruo Zhou

Title: FERA: Uncertainty-Aware Federated Reasoning for Large Language Models

Abstract:
Large language models (LLMs) exhibit strong reasoning capabilities when guided by high‑quality demonstrations, yet such data is often distributed across organizations that cannot centralize it due to regulatory, proprietary, or institutional constraints. We study federated reasoning, where a server improves multi‑step reasoning by coordinating with heterogeneous clients holding private demonstrations, without centralized training or raw data sharing. The key challenge is that client reliability is query‑dependent, while the server cannot inspect client data to determine which contributions are trustworthy. To address this, we propose Uncertainty‑Aware Federated Reasoning (FERA), a training‑free framework based on iterative server‑client co‑refinement. Across communication rounds, clients generate reasoning traces with lightweight uncertainty estimates, and the server synthesizes them into improved reasoning that is redistributed as context for the next round, progressively improving both server outputs and client‑side reasoning. Within each round, Uncertainty‑Aware Self‑Critique Aggregation (UA‑SCA) resolves conflicts among heterogeneous client traces through query‑dependent trust weighting and structured cross‑client verification. Rather than simply discarding low‑quality traces, UA‑SCA revises flawed reasoning steps to recover useful information. We provide theoretical guarantees showing that the proposed iterative protocol converges and that uncertainty‑aware weighting accelerates convergence. Experiments on multiple reasoning benchmarks show that FERA consistently outperforms both federated training and training‑free baselines, achieving progressively higher accuracy across rounds while maintaining communication and computational efficiency.

PaperID: 3153, https://arxiv.org/pdf/2605.10074.pdf

Authors: Junyoung Park, Insu Yun

Title: Agentic Fuzzing: Opportunities and Challenges

Abstract:
Fuzzers and static analyzers find many bugs but struggle with logic bugs in mature codebases. Triggering such a bug often requires multi‑step reasoning that produces no distinctive execution feedback, and variants can appear across implementations too different for a single pattern to match. Recent LLM‑assisted approaches help, but they use LLMs as auxiliaries rather than as the reasoning engine. We propose agentic fuzzing, a bug‑finding approach seeded by historical bugs in which deep agents perform the reasoning directly. Given a reference bug, the agent analyzes its root cause, hypothesizes new scenarios elsewhere in the codebase that may share that cause, and verifies each hypothesis by generating and running proof‑of‑concept code. This lets the agent find variants that differ completely in trigger path or code structure from the reference. We identify three practical challenges in implementing agentic fuzzing: harness engineering, redundant investigations across seeds with similar root causes, and scheduling seeds in a large corpus. We address these in AFuzz through a four‑stage agent pipeline, scenario coverage that deduplicates previously explored scenarios, and a DPP‑MAP scheduler that orders seeds by diversity. We ran AFuzz on the V8 JavaScript engine for about one month, finding 40 bugs (including three duplicates), receiving a total 35,000 bounty, and being assigned two CVEs. AFuzz also found 19 bugs (including one duplicate) in SpiderMonkey and JavaScriptCore using the seeds from V8. However, agentic fuzzing is in its early stages with several remaining open problems we discuss in the paper. Still, we think it points to a promising direction for finding logic bugs.

PaperID: 3154, https://arxiv.org/pdf/2605.10064.pdf

Authors: Ruiyi Yang, Zechen Li, Hao Xue, Imran Razzak, Flora D. Salim

Title: MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs

Abstract:
Self‑evolving language‑model agents must decide what to learn next and how to preserve what they have learned across iterations. Existing systems typically carry this cross‑iteration knowledge as natural‑language feedback, flat episodic memory, or implicit reinforcement signals, none of which cleanly supports a frozen weak backbone at inference time. This paper introduces MAGE (Multi‑Agent Graph‑guided Evolution), a framework that externalizes self‑knowledge into a four‑subgraph co‑evolutionary knowledge graph. Its experience subgraph stores both teacher‑written failure corrections and the learner's own past correct reasoning traces, which are retrieved as task‑conditioned guidance for a frozen execution model. During evolution, the graph, a task‑level search bandit, and a skill‑level routing bandit are updated from the same reward stream, while the learner's backbone remains unchanged. We further provide structural analysis showing how append‑only memory growth, bounded curriculum coverage, and task‑filtered retrieval together support stable improvement of the retrieval substrate for frozen‑learner evolution. Across nine benchmarks spanning mathematical reasoning, multi‑hop and open‑domain question answering, spatio‑temporal analysis, financial numerical reasoning, medical multiple‑choice, an open‑world survival game, and web navigation, MAGE achieves strong performance against prompt‑based frozen‑backbone baselines. Ablations show that self‑harvested success traces and teacher‑written corrections are complementary, with success memories contributing most on reasoning‑template‑heavy tasks and corrective memories supporting harder composition and interaction settings.

PaperID: 3155, https://arxiv.org/pdf/2605.10059.pdf

Authors: Shijun Lei, Quang Nguyen, Swapneel S Mehta, Zeping Li, Huichuan Fu, Xiaolong Zheng, Siki Chen, Yunji Liang, Philip Torr, Zhenfei Yin

Title: Strategic Exploitation in LLM Agent Markets: A Simulation Framework for E-Commerce Trust

Abstract:
Agent‑based modeling (ABM) has long been used in economics to study human behavior, and large language model (LLM) agents now enable new forms of social and economic simulation. While prior work has discovered strategic deception by LLM agents in financial trading and auction markets, e‑commerce remains underexplored despite its distinctive information asymmetry: sellers privately observe product quality, whereas buyers rely on advertised claims and reputation signals. We introduce TruthMarketTwin, a controlled simulation framework for studying LLM‑agent behavior in e‑commerce markets. The framework is one of the first to model bilateral trade under asymmetric information sharing, where agents make strategic listing, purchasing, rating, and recourse‑related decisions to optimize seller profit and buyer utility. We find that LLM agents released into traditional markets autonomously exploit weaknesses in reputation‑based governance, while warrant enforcement reduces deception and reshapes strategic reasoning. Our results position LLM‑agent simulation as a tool for studying institution‑governed autonomous markets.

PaperID: 3156, https://arxiv.org/pdf/2605.10057.pdf

Authors: Ruiyi Yang, Lihuan Li, Hao Xue, Flora D. Salim

Title: STAR: Failure-Aware Markovian Routing for Multi-Agent Spatiotemporal Reasoning

Abstract:
Compositional spatiotemporal reasoning often requires a system to invoke multiple heterogeneous specialists, such as geometric, temporal, topological, and trajectory agents. A central question is how such a system should route among specialists when execution does not simply succeed or fail, but fails in qualitatively different ways. Existing tool‑augmented and multi‑agent LLM systems typically leave this routing decision implicit in language generation, making recovery ad hoc, difficult to interpret, and hard to optimize. This paper presents STAR (Spatio‑Temporal Agent Router), a failure‑aware routing framework that externalizes inter‑agent control as a state‑conditioned transition policy over the current agent, task type, and typed execution status. At the center of STARis an agent routing matrix that combines expert‑specified nominal routes with recovery transitions learned from execution traces. Because the matrix conditions on distinct failure states, the router can respond differently to malformed outputs, missing dependencies, and tool‑‑query mismatches, rather than collapsing them into a generic retry signal. Specialists execute through a tool‑grounded extract‑‑compute‑‑deposit protocol and write intermediate results to a shared blackboard for downstream fusion. Results prove that retaining unsuccessful traces during training enlarges the support of the routing policy on error states, enabling recovery transitions that success‑only training cannot represent. Across three spatiotemporal benchmarks and eight backbone LLMs, STAR improves over multiple baselines with the clearest gains on queries whose execution deviates from the nominal routing path. Router‑specific ablations and recovery analyses further show that typed failure‑aware routing, rather than specialist composition alone, is a key factor for these improvements.

PaperID: 3157, https://arxiv.org/pdf/2605.10052.pdf

Authors: Xinyu Zhang, Zhicheng Dou, Deyang Li, Jianjun Tao, Shuo Cheng, Ruifeng Shi, Fangchao Liu, Enrui Hu, Yangkai Ding, Hongbo Wang, Qi Ye, Xuefeng Jin, Zhangchun Zhao

Title: Swarm Skills: A Portable, Self-Evolving Multi-Agent System Specification for Coordination Engineering

Abstract:
As artificial intelligence engineering paradigms shift from single‑agent Prompt and Context Engineering toward multi‑agent Coordination Engineering, the ability to codify and systematically improve how multiple agents collaborate has emerged as a critical bottleneck. While single‑agent skills can now be distributed as portable assets, multi‑agent coordination protocols remain locked within framework‑internal code or static configurations, preventing them from being shared across systems or autonomously improved over time. We propose Swarm Skills, a portable specification that extends the Anthropic Skills standard with multi‑agent semantics. Swarm Skills turns multi‑agent workflows into first‑class, distributable assets that consist of roles, workflows, execution bounds, and a built‑in semantic structure for self‑evolution. To operationalize the specification's evolving nature, we present a companion self‑evolution algorithm that automatically distills successful execution trajectories into new Swarm Skills and continuously patches existing ones based on multi‑dimensional scoring (Effectiveness, Utilization, and Freshness), eliminating the need for human‑in‑the‑loop oversight during the refinement process. Through an architectural compatibility analysis and a comprehensive qualitative case study using the open‑source JiuwenSwarm reference implementation, we demonstrate how Swarm Skills achieves zero‑adapter cross‑agent portability via progressive disclosure, enabling agent teams to self‑evolve their coordination strategies without framework lock‑in.

PaperID: 3158, https://arxiv.org/pdf/2605.10038.pdf

Authors: Hangchen Liu, Dongyuan Li, Renhe Jiang, Jiewen Deng, Weiwei Ye, Yoshihide Sekimoto

Title: TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning

Abstract:
Time series analysis underpins forecasting, monitoring, and decision making in domains such as finance and weather, where solving a task often requires both numerical accuracy and contextual reasoning. Recent progress has moved from specialized neural predictors to approaches built on LLMs and foundation models that can reason over time series inputs and use external tools. However, most such systems remain execution‑centric: they focus on solving the current instance but learn little from exploratory execution. This is especially limiting in verifiable numeric settings, where multiple candidate executions and tool‑use procedures may all be task‑valid yet differ sharply in quantitative quality, and where early success can trigger tool‑prior collapse that suppresses further exploration. To address this limitation, we present TimeClaw, an exploratory execution learning framework that turns exploratory execution into reusable hierarchical distilled experience through a four‑stage loop: Explore, Compare, Distill, and Reinject. TimeClaw combines metric‑supervised exploratory execution learning, task‑aware tool dropout, and hierarchical distilled experience for inference‑time reinjection, while keeping the base model frozen and avoiding online test‑time adaptation. In an MTBench‑aligned evaluation with 17 tasks that span finance and weather prediction and reasoning tasks, TimeClaw delivers consistent gains over the baselines. These results suggest that, for scientific systems, the bottleneck is not only execution‑time capability, but how exploratory experience is compared, distilled, and reused.

PaperID: 3159, https://arxiv.org/pdf/2605.10034.pdf

Authors: Aron Distelzweig, Faris Janjoš, Andreas Look, Anna Rothenhäusler, Daniel Jost, Oliver Scheel, Raghu Rajan, Daphne Cornelisse, Eugene Vinitsky, Joschka Boedecker

Title: Beyond Self-Play and Scale: A Behavior Benchmark for Generalization in Autonomous Driving

Abstract:
Recent Autonomous Driving (AD) works such as GigaFlow and PufferDrive have unlocked Reinforcement Learning (RL) at scale as a training strategy for driving policies. Yet such policies remain disconnected from established benchmarks, leaving the performance of large‑scale RL for driving on standardized evaluations unknown. We present BehaviorBench ‑‑ a comprehensive test suite that closes this gap along three axes: Evaluation, Complexity, and Behavior Diversity. In terms of Evaluation, we provide an interface connecting PufferDrive to nuPlan, which, for the first time, enables policies trained via RL at scale to be evaluated on an established planning benchmark for autonomous driving. Complementarily, we offer an evaluation framework that allows planners to be benchmarked directly inside the PufferDrive simulation, at a fraction of the time. Regarding Complexity, we observe that today's standardized benchmarks are so simple that near‑perfect scores are achievable by straight lane following with collision checking. We extract a meaningful, interaction‑rich split from the Waymo Open Motion Dataset (WOMD) on which strong performance is impossible without multi‑agent reasoning. Lastly, we address Behavior Diversity. Existing benchmarks commonly evaluate planners against a single rule‑based traffic model, the Intelligent Driver Model (IDM). We provide a diverse suite of interactive traffic agents to stress‑test policies under heterogeneous behaviors, beyond just using IDM. Overall, our benchmarking analysis uncovers the following insight: despite learning interactive behaviors in an emergent manner, policies trained via pure self‑play under standard reward functions overfit to their training opponents and fail to generalize to other traffic agent behaviors. Building on this observation, we propose a hybrid planner that combines a PPO policy with a rule‑based planner.

PaperID: 3160, https://arxiv.org/pdf/2605.09934.pdf

Authors: Bihui Yu, Caijun Jia, Jing Chi, Xiaohan Liu, Yining Wang, He Bai, Yuchen Liu, Jingxuan Wei, Junnan Zhu

Title: TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents

Abstract:
Multimodal large language models increasingly solve vision‑centric tasks by calling external tools for visual inspection, OCR, retrieval, calculation, and multi‑step reasoning. Current tool‑using agents usually expose the executed tool trajectory and the final answer, but they rarely specify which tool observation supports each generated claim. We call this missing claim‑level dependency structure the provenance gap. The gap makes tool use hard to verify and hard to optimize, because useful evidence, redundant exploration, and unsupported reasoning are mixed in the same trajectory. We introduce TRACER, a framework for verifiable generative provenance in multimodal tool‑using agents. Instead of adding citations after generation, TRACER generates each answer sentence together with a structured provenance record that identifies the supporting tool turn, evidence unit, and semantic support relation. Its relation space contains Quotation, Compression, and Inference, covering direct reuse, faithful condensation, and grounded derivation. TRACER verifies each record through schema checking, tool‑turn alignment, source authenticity, and relation rationality, and then converts verified provenance into traceability constraints and provenance‑derived local credit for reinforcement learning. We further construct TRACE‑Bench, a benchmark for sentence‑level provenance reconstruction from coarse multimodal tool trajectories. On TRACE‑Bench, simply adding tools often introduces noise. With Qwen3‑VL‑8B, TRACER reaches 78.23% answer accuracy and 95.72% summary accuracy, outperforming the strongest closed‑source tool‑augmented baseline by 23.80 percentage points. Compared with tool‑only supervised fine‑tuning, it also reduces total test‑set tool calls from 4949 to 3486. These results show that reliable multimodal tool reasoning depends on provenance‑aware use of observations, not on more tool calls alone.

PaperID: 3161, https://arxiv.org/pdf/2605.09893.pdf

Authors: Sushrita Rakshit, Hanwen Zhang, Hua Shen

Title: Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions

Abstract:
Large language models (LLMs) are often evaluated based on their stated values, yet these do not reliably translate into their actions, a discrepancy termed "value‑action gap." In this work, we argue that this gap persists even under explicit reasoning, revealing a deeper failure mode we call "Pseudo‑Deliberation": the appearance of principled reasoning without corresponding behavioral alignment. To study this systematically, we introduce VALDI, a framework for measuring alignment between stated values and generated dialogue. VALDI includes 4,941 human‑centered scenarios across five domains, three tasks that elicit value articulation, reasoning, and action, and five metrics for quantifying value adherence. Across both proprietary and open‑source LLMs, we observe consistent misalignment between expressed values and downstream dialogues. To investigate intervention strategies, we propose VIVALDI, a multi‑agent value auditor that intervenes at different stages of generation.

PaperID: 3162, https://arxiv.org/pdf/2605.09889.pdf

Authors: Jiayi He, Xiaofeng Luo, Jiawen Kang, Ruichen Zhang, Jianhang Tang, Dong In Kim

Title: Skill Description Deception Attack against Task Routing in Internet of Agents

Abstract:
A new paradigm, Internet of Agents (IoA), is transforming networked systems into LLM‑driven service networks, where heterogeneous agents collaborate through task routing based on their self‑declared skill descriptions. Although this promising paradigm enables agentic, distributed, and advanced intelligence, it also exposes a new and overlooked attack surface. In particular, malicious agents can strategically manipulate their skill descriptions to bias routing decisions and increase their probability of being selected for task execution, thereby disrupting user tasks and degrading system reliability. To characterize this threat, we propose and formalize a new attack model, termed \emphSkill Description Deception (SDD) attack. We further design an LLM‑enabled SDD attack framework that automatically generates deceptive skill descriptions, enabling systematic vulnerability assessment of IoA systems. Experimental results on nine representative domains show that the proposed attack can achieve up to 98% attack success rate, demonstrating the severity and generality of the attack. Our paper reveals a new security vulnerability in IoA and calls for secure and trustworthy semantic routing mechanisms for future IoA systems.

PaperID: 3163, https://arxiv.org/pdf/2605.09863.pdf

Authors: Chunxiao Wang

Title: Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents

Abstract:
Production LLM coding agents drift over long sessions: they forget user‑specified constraints, slip into mistakes the user already flagged, and confabulate prior agreements. White‑box approaches such as persona vectors require model weights and so cannot be applied to closed APIs (Claude, GPT‑4) that most users actually interact with. We present Nautilus Compass, a black‑box persona drift detector and agent memory layer for production coding agents. The method operates entirely at the prompt‑text layer: cosine similarity between user prompts and behavioral anchor texts, aggregated by a weighted top‑k mean using BGE‑m3 embeddings. Compass is, to our knowledge, the only public agent memory layer (among Mem0, Letta, Cognee, Zep, MemOS, smrti verified May 2026) that does not call an LLM at index time to extract facts or build a graph; raw conversation text is embedded directly. The system ships as a Claude Code plugin, an MCP 2024‑11‑05 A2A server (Cursor, Cline, Hermes), a CLI, and a REST API on one daemon, with a Merkle‑chained audit log for tamper‑evident anchor updates. On a held‑out test set built from real Claude Code session traces and labeled by an independent LLM judge, Compass reaches ROC AUC 0.83 for drift detection. The embedded retrieval pipeline scores 56.6% on LongMemEval‑S v0.8 and 44.4% on EverMemBench‑Dynamic (n=500), topping the four published EverMemBench Table 4 baselines. LongMemEval‑S 56.6% is ~30 points below recent white‑box leaders (90+%); we treat that as the architectural ceiling of the no‑extraction design. End‑to‑end reproduction cost is 3.50 (~14x cheaper than GPT‑4o‑judged stacks). A paired cross‑vendor behavior A/B accompanies these numbers as preliminary system‑level evidence. Code, anchors, frozen test data, and audit‑log tooling are MIT‑licensed at github.com/chunxiaoxx/nautilus‑compass.

PaperID: 3164, https://arxiv.org/pdf/2605.09826.pdf

Authors: Gurusha Juneja, Dylan Lu, Saaket Agashe, Parth Diwane, Edward Gunn, Jayanth Srinivasa, Gaowen Liu, William Yang Wang, Yali Du, Xin Eric Wang

Title: EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

Abstract:
Theory of Mind (ToM), the ability to track others epistemic state, makes humans efficient collaborators. AI agents need the same capacity in multi agent settings, yet existing benchmarks mostly test literal ToM by asking direct belief questions. The ability act optimally on implicit beliefs in embodied environments, called functional ToM, remains largely untested. We introduce EnactToM, an evolving benchmark of 300 embodied multi‑agent tasks set in a 3D household with partial observability, private information, and constrained communication. Each task is formally verified for solvability and required epistemic depth, and new tasks are generated increase difficulty as models improve. On the hard split, all seven evaluated frontier models score 0.0% Pass^3 on functional task completion, while averaging 45.0% on literal belief probes. Manual analysis traces 93% of sampled failures to epistemic coordination breakdowns such as withheld information, ignored partner constraints, and misallocated messages, providing a concrete target for future work.

PaperID: 3165, https://arxiv.org/pdf/2605.09823.pdf

Authors: Chelsea Zou, Yiheng Yao, Selena She, Noah Goodman, Robert D. Hawkins

Title: CalBench: Evaluating Coordination-Privacy Trade-offs in Multi-Agent LLMs

Abstract:
Personal AI assistants are beginning to act as delegates with access to calendars, inboxes, and user preferences. Calendar scheduling makes the trust problem concrete: an assistant must coordinate with other assistants while deciding what to reveal about the person it represents. We introduce CalBench, a controlled benchmark for multi‑agent calendar scheduling under private information. In each task, N agents manage separate private calendars and schedule a stream of M incoming meetings while minimizing disruption costs. Because no agent can inspect another agent's calendar, success requires language‑mediated coordination rather than centralized planning. CalBench generates solvable scenarios with CP‑SAT oracle solutions and decentralized non‑LLM reference protocols, enabling evaluation of task success, excess cost, communication efficiency, burden fairness, and privacy leakage under matched information constraints. Across seven model families, we find that completion alone misses important failures: agents leave avoidable cost on the table, communication volume does not predict lower regret, and privacy‑preserving silence can deprive teammates of cost information needed for fair burden allocation. CalBench provides a reproducible testbed for studying whether autonomous assistants can coordinate on behalf of users before deployment at scale.

PaperID: 3166, https://arxiv.org/pdf/2605.09822.pdf

Authors: Ben Kereopa-Yorke, Guillermo Diaz, Holly Wright, Reagan Johnston, Ron F. Del Rosario, Timothy Lynar

Title: Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise AI Agent Reasoning

Abstract:
We define Oracle Poisoning, an attack class in which an adversary corrupts a structured knowledge graph that AI agents query at runtime via tool‑use protocols, causing incorrect conclusions through correct reasoning. Unlike prompt injection, Oracle Poisoning manipulates the data agents reason over, not their instructions. We demonstrate six attack scenarios against a production 42‑million‑node code knowledge graph, providing the first empirical demonstration of knowledge graph poisoning against a production‑scale agentic system, distinct from CTI embedding poisoning. Primary evaluation uses real SDK tool‑use across nine models from three providers (N=30 per model), where models autonomously invoke a graph query tool and reason from results. The result is unambiguous: every tested model trusts poisoned data at 100% at moderate attacker sophistication(L2), with 269 valid trials (of 270) accepting fabricated security claims under directed queries. Under open‑ended prompts, trust drops to 3‑55%, confirming prompt framing as a confound; we report both conditions. An attacker sophistication gradient reveals discrete break points, a minimum skill at which trust flips from 0% to 100%, reframing the attack as a question not of whether but of how much. A controlled delivery‑mode comparison shows that inline evaluation produces false negatives: GPT‑5.1 shows 0% trust inline but 100% under both simulated and real agentic tool‑use, demonstrating that delivery mode is a first‑order confound. We evaluate five defences; read‑only access control eliminates the direct mutation vector, while the remaining four are partial and model‑dependent. Analysis of four additional platforms suggests the attack may generalise across the knowledge‑graph ecosystem.

PaperID: 3167, https://arxiv.org/pdf/2605.09817.pdf

Authors: Taein Kim, David Jiang, Yuepeng Hu, Yuqi Jia, Neil Gong

Title: Evaluating Tool Cloning in Agentic-AI Ecosystems

Abstract:
Agent tools are becoming a core interface through which LLM agents access external data, services, and execution environments. As these tools are distributed through public marketplaces, raw tool counts may substantially overstate ecosystem diversity if many repositories are cloned, lightly modified, or derived from shared templates. Such hidden duplication can contaminate benchmark splits, propagate vulnerable implementations, bias measurements of tool‑use generalization, and raise provenance, attribution, and intellectual‑property concerns. We present, to our knowledge, the first large‑scale measurement study of tool cloning in agentic AI ecosystems. We curate a unified dataset from multiple public platforms, covering 7,508 Model Context Protocol (MCP) repositories with 87,564 extracted tools and 1,353 Skills repositories with 12,447 tools, for a total of 8,861 repositories and 100,011 tool entries. To measure implementation‑level duplication, we build a repository‑level auditing pipeline using complementary lexical and fuzzy‑structural similarity metrics, and compute pairwise similarity across MCP‑to‑MCP, Skills‑to‑Skills, and MCP‑to‑Skills repository pairs. We further manually verify 100 sampled pairs per MCP and Skills ecosystem across similarity‑score buckets to calibrate how often high similarity reflects true code cloning. Our analysis shows that cloning is not an isolated artifact: high‑similarity regions appear across comparison settings, and 60% of high‑Jaccard candidates and 85% of high‑ssdeep candidates in the MCP ecosystem are manually verified as clones. These results indicate that tool cloning is a pervasive and severe source of hidden duplication in agent‑tool ecosystems. They further suggest that agent‑tool datasets and benchmarks should account for repository provenance and implementation similarity when measuring tool diversity or constructing evaluation splits.

PaperID: 3168, https://arxiv.org/pdf/2605.09801.pdf

Authors: Himanshu Gupta, Paul Motter, Aritra Chakrabarty, Rishabh Sodani, Srikrishna Bangalore Raghu, Alessandro Roncone, Bradley Hayes, Zachary Sunberg

Title: Efficient Multi-Robot Motion Planning with Precomputed Translation-Invariant Edge Bundles

Abstract:
Solving multi‑robot motion planning (MRMP) requires generating collision‑free kinodynamically feasible trajectories for multiple interacting robots. We introduce Kinodynamic Translation‑Invariant Edge Bundles or KiTE‑Extend, a planner‑agnostic action selection mechanism for sampling‑based kinodynamic motion planning. KiTE‑Extend uses a library of trajectory segments computed offline to guide action selection during online planning, improving the ability of existing planners to identify feasible motion segments without altering state propagation, collision checking, or cost evaluation, and without changing their theoretical guarantees. While KiTE‑Extend can modestly improve single‑agent planners, its benefits are most clear in the multi‑agent setting, where it is able to explore more effectively and significantly improve planning through the dense spatiotemporal constraints introduced by robot‑robot interaction. Through experiments on multiple kinodynamic systems and environments, we show that KiTE‑Extend reduces planning time and improves scalability across the three most common MRMP paradigms: centralized, prioritized, and conflict‑based.

PaperID: 3169, https://arxiv.org/pdf/2605.09794.pdf

Authors: Jiacheng Lin, Kun Qian, Arvind Srinivasan, Tian Wang, Fang Han, Changran Hu, Junze Liu, Ziyi Wang, Hanwen Xu, Mengmeng Xue, Shuo Yang, Hansi Zeng, Simon Sinong Zhan, Kai Zhong, Weiqi Zhang, Dakuo Wang, Tianhao Wang, Zhiyuan Li

Title: LLM Agents Enable User-Governed Personalization Beyond Platform Boundaries

Abstract:
Personalization today is fundamentally platform‑centric: services build user representations from the behavioral fragments they observe. Yet no platform can construct a complete picture of the user, as competitive incentives, legal constraints, user privacy concerns, and epistemic limits create persistent data barriers. This paper argues for a shift from platform‑centric personalization to user‑governed personalization, where only the user can integrate fragmented contexts across platforms and the offline world. The key asymmetry lies in data access: only users can aggregate their own cross‑platform and offline information. Large language model (LLM) agents make such integration practically feasible for the first time by enabling reasoning over heterogeneous personal data and transforming users' cross‑context information into actionable personalization capabilities. We provide proof‑of‑concept evidence that users equipped with cross‑platform data exports and an off‑the‑shelf LLM agent can outperform single‑platform personalization baselines. We conclude by outlining a research agenda for building scalable user‑governed personalization systems.

PaperID: 3170, https://arxiv.org/pdf/2605.09769.pdf

Authors: Dima Galat, Marian-Andrei Rizoiu

Title: UTS at PsyDefDetect: Multi-Agent Councils and Absence-Based Reasoning for Defense Mechanism Classification

Abstract:
This paper describes our system for classifying psychological defense mechanisms in emotional support dialogues using the Defense Mechanism Rating Scales (DMRS), placing second (F1 0.406) among 64 teams. A central insight is that defense mechanisms are defined by what is absent: missing affect, blocked cognition, denied reality. We encode this as an affect‑cognition integration spectrum in prompt‑level clinical rules, which account for the largest single gain (+11.4pp F1). Our architecture is a multi‑phase deliberative council of Gemini 2.5 agents where class‑specific advocates rate evidence strength rather than voting, achieving F1 0.382 with no fine‑tuning ‑ a top‑5 result on its own. We find, however, that the council is confidently wrong about minority classes: 59‑80% of stable minority predictions are incorrect, driven by a systematic "L7 attractor" in which emotional content defaults to the majority class. A targeted override ensemble from three fine‑tuned Qwen3.5 models applies 16 overrides (+2.4pp), selected by a structured multi‑agent system (builder, critic, regression guard) that produced a larger F1 gain in one iteration than 8 prior attempts combined.

PaperID: 3171, https://arxiv.org/pdf/2605.09703.pdf

Authors: Xiaoyu Yuan, Niklas Heikkala, Tiina Törmänen, Hanna Järvenoja, Guoying Zhao, Haoyu Chen

Title: MOTOR-Bench: A Real-world Dataset and Multi-agent Framework for Zero-shot Human Mental State Understanding

Abstract:
Understanding human mental states from natural behavior is crucial for intelligent systems in the real world. However, most current research focuses on predicting isolated mental state labels, lacking structured annotations of complex interpersonal interactions. To support structured analysis, we introduce MOTOR‑Bench, a carefully‑designed benchmark with a real‑world dataset MOTOR‑dataset, containing 1,440 multimodal video clips in collaborative learning scenarios, reflecting key real‑world data challenges including natural class imbalance, visual noise, and domain‑specific language. Each sample is labeled by educational experts based on self‑regulated learning theory. We further evaluate several state‑of‑the‑art multimodal large language models and multi‑agent systems in a zero‑shot setting on our MOTOR‑Bench. However, their performance on this task remains limited, suggesting that existing methods still struggle with structured reasoning from observable behavior to deeper mental states. To address this challenge, we propose a reasoning multi‑agent framework, named MOTOR‑MAS. It coordinates multiple agents through a structured agent coordination mechanism to infer explicit behaviors, internal cognitions, and psychological emotions. Experimental results show that our MOTOR‑MAS outperforms the best single‑model benchmark by 15.93 points in Macro‑F1 scores for the three labels of behavior, cognition, and emotion, and outperforms the general multi‑agent benchmark by 10.2 points in internal cognition prediction.

PaperID: 3172, https://arxiv.org/pdf/2605.09675.pdf

Authors: Timothy Ossowski, Xinchi Liu, Danyal Maqbool, Vaibhav Dhanuka, Sheng Zhang, Hoifung Poon, Majid Afshar, Tyler Bradshaw, Junjie Hu

Title: CodeClinic: Evaluating Automation of Coding Skills for Clinical Reasoning Agents

Abstract:
Clinical reasoning agents based on large language models (LLMs) aim to automate tasks such as intensive care unit (ICU) monitoring and patient state tracking from electronic health records (EHRs). Existing systems typically rely on manually curated clinical tools or skills for concepts such as sepsis detection and organ failure assessment. However, maintaining these tool libraries requires substantial expert effort, while zero‑shot querying or code generation often produces inefficient and unreliable reasoning chains, especially under institution‑specific clinical policies. We introduce CodeClinic, a benchmark built on MIMIC‑IV for evaluating whether LLM agents can synthesize and compose reusable clinical skills instead of relying on fixed toolboxes. The benchmark contains two complementary tasks: longitudinal ICU surveillance and compositional information seeking. The longitudinal setting simulates monitoring patient trajectories with structured decisions every four hours across 25 findings and eight clinical families, while the compositional setting spans 63k instances across 259 tasks in nine domains and is stratified by compositional dependency depth to evaluate increasingly complex multi‑step reasoning. We further propose an offline autoformalization pipeline that converts natural‑language clinical guidelines into reusable and verified Python skill libraries through iterative LLM refinement. Compared with zero‑shot code generation, the resulting libraries improve consistency while reducing per‑query token usage by up to 40%.

PaperID: 3173, https://arxiv.org/pdf/2605.09659.pdf

Authors: Chandan Kumar Sah, Rajpal Singh, Jishnu Keshavan

Title: ASACK : Adaptive Safe Active Continual Koopman Learning for Uncertain Systems with Contractive Guarantees

Abstract:
Koopman operator theory provides a powerful framework for representing nonlinear dynamics through a linear operator acting on lifted observables, enabling the use of linear control techniques for nonlinear systems. However, Koopman models are typically learned from data and often degrade in performance under model uncertainty and distributional shifts between training and deployment. Although several works have explored online adaptation to address this issue, many rely on neural network‑based updates that introduce significant computational overhead and lack formal safety guarantees, limiting their suitability for real‑time and safety‑critical robotic applications. In this work, we propose a unified framework for continual adaptive Koopman learning that enables safe and efficient online refinement of learned models during task execution. An autoencoder‑based Koopman model is first learned offline and subsequently refined online through a contractive adaptation law, which provides theoretical convergence guarantees under distributional shifts and model uncertainty. To improve data efficiency and accelerate model refinement, the adaptation mechanism is integrated with an active learning strategy that drives the system to collect informative data while accomplishing task objectives. The resulting control problem is formulated as a nonconvex optimization problem incorporating both active learning objectives and safety constraints. We further derive theoretical bounds on model approximation error and show how these bounds can be incorporated within a robust Model Predictive Control (MPC) framework to provide formal safety guarantees. The proposed approach unifies learning, excitation, and safety within a single control framework without sacrificing real‑time feasibility. Extensive simulation and experimental studies demonstrate superior performance compared to state‑of‑the‑art baselines.

PaperID: 3174, https://arxiv.org/pdf/2605.09650.pdf

Authors: Elad Sarafian, Gal Kaplun, Ron Banner, Daniel Soudry, Boris Ginsburg

Title: Workspace Optimization: How to Train Your Agent

Abstract:
Modern agents built on frontier language models often cannot adapt their weights. What, then, remains trainable? We argue it is the agent's \emphworkspace, the structured external substrate it reads, writes, and tests; we call its evolution workspace optimization. Workspace optimization targets hard multi‑turn environments where a frontier model has strong priors but cannot solve the task in a single shot, so the agent must learn through interaction. We propose a principled way to evolve the workspace, mirroring the structure of weight‑space training: artifacts in place of parameters, evidence in place of data, counterexamples in place of losses, and textual feedback in place of gradients. We instantiate the idea in DreamTeam, a multi‑agent harness for ARC‑AGI‑3 whose roles build an executable world model, plan, hypothesize, probe, strategize, and route failures. On the current 25‑game ARC‑AGI‑3 public set under the official scoring protocol and averaged over two independent runs, DreamTeam improves the SOTA protocol‑matched agent's score from 36% to 38.4%, while using 31% fewer environment actions per game.

PaperID: 3175, https://arxiv.org/pdf/2605.09618.pdf

Authors: Julia Hu, Alfred Shen, Kumar Lakshmipathi

Title: Statistical Scouting Finds Debate-Safe but Not Debate-Useful Cases: A Matched-Ceiling Study of Open-Weight LLM Reasoning Protocols

Abstract:
When should a language model answer directly, sample and vote, or engage in multi‑agent debate? Recent work shows voting often explains much of the gain attributed to debate, while selective‑debate systems activate deliberation only on uncertain examples. We ask: under a matched ceiling on generated tokens (960 per example), how much per‑example routing headroom exists, and how much is recoverable from cheap pre‑deliberation signals? We evaluate greedy decoding, three‑sample voting, and a two‑agent critique‑revise debate on MuSiQue and GSM8K using Llama 3.1 8B Instruct and Ministral 3 8B Instruct. On MuSiQue, an oracle selecting the correct protocol per example gains +14.0 and +13.7 pp over the best fixed one. The best fixed protocol is model‑ and dataset‑dependent: each (model, dataset) cell has a different winner. This headroom is hard to recover from cheap ex‑ante signals. A vote‑entropy threshold is the only controller that directionally beats the best fixed protocol on both models (+1.3 and +1.7 pp), though individual paired‑bootstrap CIs include zero. A joint analysis (meta‑analysis +1.6 pp, p=0.125; Bayesian P(both>0)=0.59) is directionally consistent but not significant. Learned controllers (LR, GBT) do not outperform the threshold. The key finding is structural: vote entropy predicts where debate is safe, not where debate is needed. High entropy sharply reduces debate backfire, but 66% of debate‑helpful examples (31/47) occur when voting is unanimous but wrong. A single‑prompt self‑critique probe on Llama flips the answer in 127/127 unanimous cases, yielding zero mutual information with the debate‑helpful label; we cannot rule out a prompt‑compliance artifact, but either interpretation disqualifies the probe as a router. Recovering the remaining headroom requires behavioral probes that avoid format‑compliance confounds at the 8B scale.

PaperID: 3176, https://arxiv.org/pdf/2605.09573.pdf

Authors: Yuandao Cai, Shuhao Fu, Wensheng Tang, Cheng Wen, Shengchao Qin, Charles Zhang

Title: ConCovUp: Effective Agent-Based Test Driver Generation for Concurrency Testing

Abstract:
Concurrency testing is essential to improve the reliability and security of multi‑threaded programs. Dynamic analysis tools, such as TSan, depend on high‑quality test drivers that reach critical shared‑memory interactions at runtime. However, current testing practices predominantly focus on sequential logic, leaving a gap in automated concurrent test generation. Recently, large language models (LLMs) have shown promise in generating sequential tests, but they struggle to produce effective concurrent tests without a deep understanding of concurrency semantics. This paper presents ConCovUp, a multi‑agent framework that combines LLMs with program analysis. ConCovUp grounds test generation in static analysis to extract shared memory accesses and their calling contexts. To trigger hard‑to‑reach accesses, it introduces an LLM‑driven backward tracing approach, leveraging the model's semantic reasoning to deduce concrete inputs that satisfy complex path constraints, and iteratively refines the generated tests via dynamic execution feedback. Our evaluation on nine real‑world C/C++ libraries shows that ConCovUp improves average Shared Memory Access Pair Coverage (SMAP Coverage) from 36.6% to 68.1% over the general Claude Code agent baseline.

PaperID: 3177, https://arxiv.org/pdf/2605.09504.pdf

Authors: Michael A. Riegler, Inga Strümke

Title: Position: AI Security Policy Should Target Systems, Not Models

Abstract:
We present swarm‑attack, an open‑source adversarial testing framework in which multiple lightweight LLM agents coordinate through shared memory, parallel exploration, and evolutionary optimization. Together, our results demonstrate that both safety bypass of frontier models and software vulnerability discovery, i.e., the capability class that motivated restricted release of Anthropic's Mythos Preview, are achievable at effectively zero cost using commodity hardware and openly available models. We report two experiments. In the first, five instances of a 1.2 billion parameter model conducted 225 jailbreak attacks each against GPT‑4o and Claude Sonnet~4. Against GPT‑4o, the swarm achieved an Effective Harm Rate of 45.8%, producing 49 critical‑severity breaches; against Claude Sonnet‑4, the Effective Harm Rate was 0% despite a 40% technical success rate. In the second experiment, the same models performed combined source code analysis and binary fuzzing against a vulnerable C application with 9 planted CWEs. With a hand‑crafted exploit seed corpus, regex pattern detection, and AddressSanitizer‑based crash classification, the pipeline recovers 9 of 9 vulnerabilities (100% recall) in approximately four minutes on a consumer MacBook. With those scaffold components disabled, the same model recovers 0 of 9 by crash verification and 2 of 9 by citation. The capability class that motivated restricted release of Anthropic's Mythos Preview is therefore reproducible at effectively zero cost; the important enabler is the system scaffold itself, which compensates for the limited reasoning capacity of small individual models.

PaperID: 3178, https://arxiv.org/pdf/2605.09494.pdf

Authors: Hong Chen, Zixiang Tang, Yuanbao Chen, Yu Liu

Title: LASSA Architecture-Based Autonomous Fault-Tolerant Control of Unmanned Underwater Vehicles

Abstract:
Unmanned underwater vehicles (UUVs) operate persistently in communication‑constrained environments, thus requiring high‑level autonomous fault‑tolerant control under faulty operating conditions. Existing approaches rely heavily on predefined hard‑coded rules and struggle to achieve effective fault‑tolerant control against unforeseen faults. Although large language models (LLMs) possess powerful cognitive and reasoning capabilities, their inherent hallucinations remain a major obstacle to their application in UUV control systems. This paper proposes an intelligent control method based on the LASSA (LLM‑based Agent with Solver, Sensor and Actuator) architecture. Within this architecture, an LLM identifies unknown faults and accomplishes task replanning via autonomous reasoning without hard‑coded rules; the intelligent agent undertakes perception, scheduling and decision evaluation; the solver verifies physical boundary feasibility constraints prior to command transmission to the actuators. This architecture suppresses physically infeasible LLM hallucinations and ensures interpretable, verifiable decision‑making. Moreover, it enables fast‑slow dual closed‑loop collaborative control, where the slow loop undertakes high‑level dynamic decision‑making and the fast loop guarantees high‑frequency real‑time control, simultaneously balancing decision intelligence and control timeliness. Lake experiments under normal and lower‑rudder‑fault conditions show that the framework detects trajectory tracking abnormalities, replans the route by adjusting the turning radius from 4m to 12m and reducing speed from 2kn to 1kn, passes all three solver constraints on the first invocation, and guides the UUV to complete the full mission; under normal conditions no false fault alarms are raised throughout the run.

PaperID: 3179, https://arxiv.org/pdf/2605.09423.pdf

Authors: Haoqiang Kang, Xiaokang Ye, Yuhan Liu, Siddhant Hitesh Mantri, Lingjun Mao, James Fleming, Drishti Regmi, Lianhui Qin

Title: SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

Abstract:
LLM/VLM‑based digital agents have advanced rapidly thanks to scalable sandboxes for coding, web navigation, and computer use, which provide rich interactive training grounds. In contrast, embodied agents still lack abundant, diverse, and automatically generated 3D environments for interactive learning. Existing embodied simulators rely on manually crafted scenes or procedural templates, while recent LLM‑based 3D generation systems mainly produce static scenes rather than deployable environments with verifiable tasks and standard learning interfaces. We introduce SimWorld Studio, an open‑source platform built on Unreal Engine 5 for generating evolving embodied learning environments. At its core is SimCoder, a tool/skill‑augmented coding agent that writes and executes engine‑level code to construct physically grounded 3D worlds from language/image instructions. SimCoder self‑evolves by using verifier feedback (e.g., compilation errors, physics checks, VLM critiques) to revise environments and autonomously add reusable tools and skills to its library. Generated worlds are exported as Gym‑style environments for embodied agent learning. SimWorld Studio further enables co‑evolution between environment generation and embodied learning: agent performance feedback guides SimCoder to generate adaptive curricula near the learner's capability frontier, so that environments become increasingly challenging as the embodied agent improves. Three case studies on embodied navigation show that self‑evolution improves generation reliability, generated environments substantially improve embodied agent performance that generalizes to unseen benchmarks, and co‑evolution yields an 18‑point success‑rate gain over fixed‑environment learning and a 40‑point gain over an untrained agent.

PaperID: 3180, https://arxiv.org/pdf/2605.09421.pdf

Authors: Jingwei Ye, Zhi Wang, Xin Li, Cong Gao, Chenbin Su, Jieshuai Yang, Jianfei Tang, Ge Chu

Title: MACAA: Belief-Revision Multi-Agent Reasoning for Code Authorship Verification

Abstract:
Code authorship attribution (CAA) supports software forensics, plagiarism detection, and intellectual property protection. However, existing supervised CAA approaches suffer from scarce training data and closed‑world assumptions: they require sufficient labeled code from fixed candidate‑author sets, making training difficult in low‑data cases and predictions unreliable for open‑world test pairs with unseen samples, or heterogeneous code pairs. Large language models remove task‑specific training, but direct prompting depends on costly expert‑designed prompts, can hallucinate over complex heterogeneous code pairs, and rarely yields auditable evidence traces. We propose MACAA, a belief‑revision‑based multi‑agent framework for training‑free code authorship verification. MACAA comprises a Coordinator and four Expert Agents analyzing layout, lexical, syntactic, and programming‑pattern evidence. The Coordinator gathers expert signals for expansion, discounts unreliable evidence through contraction, and resolves conflicts through revision to preserve belief consistency, replacing direct LLM judgment with auditable hypothesis refinement. MACAA achieves 89.15% F1 on same‑language benchmarks and 80.00% on mixed cross‑language pairs, outperforming the baselines overall in both same‑language and cross‑language evaluations.

PaperID: 3181, https://arxiv.org/pdf/2605.09384.pdf

Authors: Runze Ma, Shunbo Jia, Haonan Lyu, Guo Liu, Caizhi Liao

Title: LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering

Abstract:
The reasoning gap between large and compact vision‑language models (VLMs) limits the deployment of medical AI on portable clinical devices. Compact VLMs of 2‑‑4B parameters can run on resource‑constrained hardware but lack the multi‑step reasoning capacity needed for interpretable clinical decision support. Existing knowledge distillation methods transfer answers without the reasoning process behind them. Medical visual question answering (VQA) serves as a testbed for this problem, as it requires models to integrate visual evidence with clinical knowledge through structured reasoning chains. We introduce LiteMedCoT‑VL, a pipeline that transfers chain‑of‑thought reasoning from a 235B teacher model to 2B student models through LoRA‑based fine‑tuning on explanation‑enriched training data. All inference is conducted without image captions by default, simulating the clinical scenario in which a physician interprets a medical image directly without an accompanying radiology report. On the PMC‑VQA benchmark, LiteMedCoT‑VL achieves 64.9% accuracy, exceeding the zero‑shot Qwen3‑VL‑4B baseline of 53.9% by 11.0 percentage points and outperforming all published baselines. This result indicates that a 2B model with reasoning distillation can match or exceed models with twice the parameters. Visual grounding analysis shows that the model relies on image content rather than exploiting textual priors. Our code is publicly available at https://anonymous.4open.science/r/LiteMedCoT‑VL.

PaperID: 3182, https://arxiv.org/pdf/2605.09366.pdf

Authors: Keqi Han, Songlin Zhao, Yao Su, Lifang He, Carl Yang

Title: Towards a Virtual Neuroscientist: Autonomous Neuroimaging Analysis via Multi-Agent Collaboration

Abstract:
Transforming neuroimaging data into clinically actionable biomarkers is a knowledge‑intensive and labor‑intensive process. Standardized workflows such as fMRIPrep have improved robustness and efficiency, but they are statically configured and cannot reason about downstream objectives, deliberate over alternative strategies, or close the loop between intermediate evidence and subsequent decisions in the way a human researcher would. This lack of closed‑loop adaptation often leaves domain experts trapped in a cycle of manual trial‑and‑error to tune parameters and remediate pipeline failures, severely constraining the scalability of clinical biomarker development. To bridge this gap, we introduce NIAgent, a multi‑agent system for autonomous end‑to‑end neuroimaging analysis. Unlike conventional flat tool‑calling agents, NIAgent adopts a code‑centric execution paradigm where specialist agents collaboratively synthesize and optimize executable programs over composable domain‑specific primitives. This design enables robust, long‑horizon workflow construction that adapts dynamically to runtime observations. Furthermore, we propose a hierarchical verification framework for autonomous quality control, integrating cohort‑level metric screening with agentic visual inspection to drive evidence‑grounded workflow remediation. Experiments on ADHD‑200 and ADNI demonstrate that NIAgent outperforms standard workflow‑based baselines in predictive performance while exhibiting sophisticated agentic behaviors, including strategy exploration and adaptive refinement.

PaperID: 3183, https://arxiv.org/pdf/2605.09344.pdf

Authors: Tianchonghui Fang, Shaunak Roy, Shalabh Gupta

Title: PECMAN: Perception-enabled Collaborative Multi-Agent Navigation in Unknown Environments

Abstract:
Most path planners assume fully known, static environments, assumptions that fail when robots navigate in dynamic and partially observable environments. SMART‑3D addresses these issues by real‑time replanning, where it morphs the underlying RRT tree whenever new obstacles or structures are discovered in the environment. Instead of rebuilding the tree entirely from scratch, SMART‑3D prunes invalid nodes and edges and subsequently repairs the disjoint subtrees at hot‑nodes to find a new path, thus providing high computational efficiency for real‑time adaptability. We extend SMART‑3D to perception‑enabled collaborative multi‑agent navigation (PECMAN) in unknown environments. PECMAN is built upon distributed tree morphing and shared perception strategies, where each agent reacts to environmental changes and morphs its respective tree to replan its path, while simultaneously broadcasting newly discovered structures to other agents, thus enabling them to proactively replan even in areas that have not yet been explored by them. This approach reduces redundant reactions and unnecessary replannings of the agents due to improved situational awareness. The performance of PECMAN was evaluated by 28,000 multi‑agent simulations on seven 2D scenarios with different case studies. The results show that PECMAN achieves up to 52% reduction in the team‑completion time, while maintaining near 100% success rates. Finally, PECMAN was tested by real experiments on two autonomous robots in a building environment.

PaperID: 3184, https://arxiv.org/pdf/2605.09342.pdf

Authors: Aneesh Calyam, Subrahmanya Chandra Bhamidipati, Zack Murry, Sharan Srinivas

Title: A Cross-Layered Multi-Drone Coordination for Medical Supply Delivery during Disaster Response Management

Abstract:
Autonomous drone fleets have immense potential in medical supply delivery during disaster incident response. However, coordinating multiple drones in such settings introduces compounding challenges: dynamic environmental hazards such as wind, obstacles, and intermittent network connectivity, constrained energy budgets, and the need to serve patient locations fairly under deadlines and triage‑based priority while optimizing schedule utilization. In this paper, we present CEDA, a novel CTDE Deep Q‑Network algorithm for cooperative multi‑drone medical delivery, designed to jointly optimize triage‑priority‑aware routing, multi‑agent coordination, and energy‑efficient navigation under dynamic uncertainty. CEDA introduces a Priority‑Preserving Fair Scheduling strategy, in which a structured reward function encodes both triage weights and complementary fairness mechanisms ensuring no patient class is starved of service. We evaluate CEDA in a simulated grid environment featuring dynamic hazard zones, stochastic action failures, and dynamically spawning patients across three triage priority levels, as well as in a PX4 SITL validation using two X500 quadrotors controlled via MAVSDK in offboard position mode. Simulation results demonstrate that CEDA achieves a delivery completion rate above 85%, reduces obstacle collisions by over 90% across training, and delivers an average of 6 patients per episode with a triage efficiency of 0.82. CEDA preserves clinical priority ordering, Critical patients are served first, while achieving near‑zero mortality across lower‑triage classes, confirming that priority‑weighted routing does not condemn Stable or Urgent patients to neglect. PX4 SITL validation further demonstrates that the learned policy remains executable and triage‑coherent under practical communication constraints and realistic multi‑drone coordination in disaster response settings.

PaperID: 3185, https://arxiv.org/pdf/2605.09341.pdf

Authors: Shuai Pan, Yixiang Liu, Jiaye Gao, Te Gao, Weiwen Liu, Jianghao Lin, Zhihui Fu, Jun Wang, Weinan Zhang, Yong Yu

Title: SkillMAS: Skill Co-Evolution with LLM-based Multi-Agent System

Abstract:
Large language model (LLM) agent systems are increasingly expected to improve after deployment, but existing work often decouples two adaptation targets: skill evolution and multi‑agent system (MAS) restructuring. This separation can create organization bottlenecks, context pressure, and mis‑specialization. We present SkillMAS, a non‑parametric framework for adaptive specialization in multi‑agent systems that couples skill evolution with MAS restructuring. SkillMAS uses Utility Learning to assign credit from verified execution traces, bounded skill evolution to refine reusable procedures without unfiltered library growth, and evidence‑gated MAS restructuring when retained failures and Executor Utility indicate a structural mismatch. Across embodied manipulation, command‑line execution, and retail workflows, SkillMAS is competitive under the reported harnesses while clarifying how post‑deployment specialization is attributed, updated, and applied.

PaperID: 3186, https://arxiv.org/pdf/2605.09321.pdf

Authors: Saber Zerhoudi

Title: OpenIIR: An Open Simulation Platform for Information Retrieval Research

Abstract:
OpenIIR runs hundreds of LLM‑driven personas as parameterised, reproducible IR research experiments. Researchers configure agents across four kinds of multi‑agent study (deliberative panels, social platforms, curated recommender feeds, and evolutionary co‑evolution between content producers and credibility detectors) under many priors, rounds, and constraints. Persona budgets, retrieval policies, ranker choices, intervention timings, and mutation rates are declared up front, and the same study can be re‑run under different settings to compare outcomes side by side. Every run produces structured outputs (argument graphs, exposure logs, fitness traces, transcripts) that a downstream evaluator can consume directly, and a new study is a 200‑‑400 line plug‑in over a shared core (agent runtime, world‑model store, retrieval primitives, claim extractor, persona ontology). The contributions are: (i) the shared core; (ii) a type interface for pluggable scenarios; (iii) four released types with reference runs (Panel, Social‑Media, Curated‑Feed, Multi‑Generational); and (iv) six modular extensions sketched against open IR research questions.

PaperID: 3187, https://arxiv.org/pdf/2605.09315.pdf

Authors: Ye Yu, Xiaopeng Yuan, Haibo Jin, Heming Liu, Yaoning Yu, Haohan Wang

Title: Do Self-Evolving Agents Forget? Capability Degradation and Preservation in Lifelong LLM Agent Adaptation

Abstract:
Recent advances in LLM agents enable systems that autonomously refine workflows, accumulate reusable skills, self‑train their underlying models, and maintain persistent memory. However, we show that such self‑evolution is often non‑monotonic: adapting to new task distributions can progressively degrade previously acquired capabilities across all major evolution channels. We identify this phenomenon as \emphcapability erosion under self‑evolution and show that it consistently emerges across workflow, skill, model, and memory evolution. To mitigate this issue, we propose \emphCapability‑Preserving Evolution (CPE), a general stabilization principle that constrains destructive capability drift during continual adaptation. Across all four evolution dimensions, CPE consistently improves retained capability stability while preserving adaptation performance. For example, in workflow evolution, CPE improves retained simple‑task performance from 41.8% to 52.8% under GPT‑5.1 optimization while simultaneously achieving stronger complex‑task adaptation. Our findings suggest that stable long‑horizon self‑evolving agents require not only acquiring new capabilities, but also explicitly preserving previously learned ones during continual adaptation.

PaperID: 3188, https://arxiv.org/pdf/2605.09278.pdf

Authors: Yuqiao Meng, Sakshi Sunil Narvekar, Luoxi Tang, Rupali Rajendra Vaje, Yingxue Zhang, Muchao Ye, Zhaohan Xi

Title: EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium

Abstract:
Multi‑agent debate (MAD) systems increasingly rely on shared memory to support long‑horizon reasoning, but this convenience opens a critical vulnerability: a single corrupted entry can contaminate the downstream memory‑augmented reasoning, and debate alone fails to filter such errors. Existing safeguards filter entries via heuristics or LLM‑based validation, yet they rely on AI judgments that share the same failure modes and overlook the cross‑agent dynamics of MAD. We address this gap by formulating memory updating in MAD as a zero‑trust memory game, in which no agent is assumed honest and the game's equilibrium serves as an indicator of optimal memory trust. Guided by this equilibrium, we propose EquiMem, an inference‑time calibration mechanism that quantifies each update algorithmically against the shared memory state, using agents' existing retrieval queries and traversal paths as evidence rather than soliciting any LLM judgment. EquiMem instantiates calibration for both embedding‑ and graph‑based memory, and across diverse benchmarks, MAD frameworks, and memory architectures, it consistently outperforms existing safeguards, remains robust under adversarial agents, and incurs negligible inference overhead.

PaperID: 3189, https://arxiv.org/pdf/2605.09217.pdf

Authors: Karim Abdel Sadek, Mark Bedaywi, Rhys Gould, Stuart Russell

Title: Learning the Preferences of a Learning Agent

Abstract:
For AI systems to be useful to humans, they must understand and act in accordance with our values and preferences. Since specifying preferences is a hard task, inverse reinforcement learning (IRL) aims to develop methods that allow for inferring preferences from observed behavior. However, IRL assumes the human to be approximately optimal. This is a big limitation in cases where the human themselves may be learning to act optimally in an environment. In this paper, we formalize the problem of learning the preferences of a learning agent: a predictor observes a learner acting online and tries to infer the underlying reward function being (initially suboptimally) optimized by the learner. We model the learner as either being no‑regret, or as converging to an optimal Boltzmann policy over time. In each of these settings, we establish theoretical guarantees for various preference learning algorithms, or otherwise show that such guarantees are impossible.

PaperID: 3190, https://arxiv.org/pdf/2605.09212.pdf

Authors: Chulabhaya Wijesundara, Andrea Baisero, Zhongheng Li, Gregory Castañón, Alan Carlin, Christopher Amato

Title: Rethinking Ratio-Based Trust Regions for Policy Optimization in Multi-Agent Reinforcement Learning

Abstract:
Centralized training with decentralized execution (CTDE) is a standard framework for cooperative multi‑agent policy‑gradient reinforcement learning, allowing agents to learn from joint information while acting from local observations. Ratio‑based trust‑region methods such as Multi‑Agent Proximal Policy Optimization (MAPPO) and Multi‑Agent Simple Policy Optimization (MASPO) update decentralized actors using per‑agent probability ratios weighted by joint advantage estimates. Teammate non‑stationarity increases the variance of these advantages, which in turn increases the variance in the local ratio updates. This exposes two method‑specific failure modes: MAPPO's additive clipping removes gradients for outlier samples and weakens recovery from policy drift, while MASPO's soft quadratic penalty can allow probability collapse. We introduce Multi‑Agent Ratio Symmetry (MARS), a novel policy optimization objective that replaces these additive ratio‑based trust‑region mechanisms with a multiplicatively symmetric geometric barrier. MARS preserves corrective gradients while assigning unbounded cost as probability ratios approach zero. Across 47 tasks spanning eight multi‑agent environments, including novel JAX benchmarks PaxMen and AeroJAX, MARS matches or exceeds MAPPO and MASPO in aggregate environment‑level performance. Ablations show that these gains arise from the geometry of the symmetric barrier rather than from flexible trust‑region boundaries alone.

PaperID: 3191, https://arxiv.org/pdf/2605.09192.pdf

Authors: Yang Zhou, Zihan Dong, Zhenting Wang, Can Jin, Shiyu Zhao, Bangwei Guo, Difei Gu, Linjun Zhang, Mu Zhou, Dimitris N. Metaxas

Title: Evidence Over Plans: Online Trajectory Verification for Skill Distillation

Abstract:
Agent skills can remarkably improve task success rates by using human‑written procedural documents, but their quality is difficult to assess without environment‑grounded verification. Existing skill generation methods heavily rely on preference logs rather than direct environment interaction, often yielding negligible or even degraded gains. We identify that it is a fundamental timing bottleneck: robust skills should be posterior‑based, distilled from empirical environment interaction rather than prior plans. In this study, we introduce the Posterior Distillation Index (PDI), a trajectory‑level metric that quantifies how well a distilled skill is grounded in the task‑environment evidence. To operationalize PDI, we present SPARK (Structured Pipelines for Autonomous Runnable tasKs and sKill generation) for preserving task execution evidence towards full trajectory‑level analysis. SPARK generates environment‑verified trajectories used to compute PDI, and it applies PDI as an online diagnostic and intervention signal to ensure posterior skill formation. Across 86 runnable tasks, SPARK‑generated skills consistently surpass no‑skill baselines and outperform human‑written skills on student models (inference cost up to 1,000x cheaper than teacher models). These findings show that PDI‑guided distillation produces efficient and transferable skills grounded in the task‑environment interaction. We release our code at https://github.com/EtaYang10th/spark‑skills .

PaperID: 3192, https://arxiv.org/pdf/2605.09186.pdf

Authors: Liding Xu, Yugeng Zhou, Sebastian Pokutta

Title: Agentic MIP Research: Accelerated Constraint Handler Generation

Abstract:
Mixed‑integer programming (MIP) research is both mathematically sophisticated and engineering‑intensive: testing an algorithmic hypothesis within a branch‑and‑cut solver requires substantial implementation, debugging, tuning, and large‑scale benchmarking. We propose an agentic MIP research framework that shortens this feedback loop by embedding LLM agents into a solver‑aware harness for generating, verifying, and evaluating plugins for the open‑source solver SCIP. Propagation methods play a central role in accelerating MIP solving by exploiting global constraints. We instantiate our framework on the semantic lifting of MIP formulations into global constraints and the automatic construction of propagation‑only SCIP constraint handlers. On the MIPLIB 2017 benchmark set, the framework successfully recovers global constraint structures from constraint programming and generates executable constraint detectors and propagation‑only constraint handlers. Furthermore, the framework naturally extends to in‑context learning within a sandboxed environment, enabling agents not only to tune and debug generated constraint handlers on real instances, but also to explore global constraint patterns in MIP problems and discover novel propagation strategies not yet implemented in SCIP. This framework allows us to systematically distinguish meaningful algorithmic improvements from low‑value or overly costly candidates: the novel propagation methods successfully solved five additional instances within the explored benchmark. Overall, this framework demonstrates that LLM agents can autonomously navigate the complex MIP research loop, paving the way for a more automated solver development process.

PaperID: 3193, https://arxiv.org/pdf/2605.09163.pdf

Authors: Shawn Li, Chenxiao Yu, Han Wang, Wei Yang, Ryan Rossi, Franck Dernoncourt, Xiyang Hu, Philip Yu, Chaowei Xiao, Huan Zhang, Yue Zhao

Title: FORTIS: Benchmarking Over-Privilege in Agent Skills

Abstract:
Large language model agents increasingly operate through an intermediate skill layer that mediates between user intent and concrete task execution. This layer is widely treated as an organizational abstraction, but we argue it is also a privilege boundary that current models routinely exceed. We present FORTIS, a benchmark that evaluates over‑privilege in agent skills across two stages: whether a model selects the minimally sufficient skill from a large overlapping library, and whether it executes that skill without expanding into broader tools or actions than the skill permits. Across ten frontier models and three domains, we find that over‑privileged behavior is the norm rather than the exception. Models consistently reach for higher‑privilege skills and tools than the task requires, failing at both stages at rates that remain high even for the strongest available models. Failure is especially severe under the ordinary conditions of real user interaction: incomplete specification, convenience framing, and proximity to skill boundaries. None of these requires adversarial construction. The results indicate that the skill layer, far from containing agent behavior, is itself a primary source of privilege escalation in current systems.

PaperID: 3194, https://arxiv.org/pdf/2605.09153.pdf

Authors: Weifan Zhang, Xiaofeng Zhao, Adel Bazzi, Mingrui Li, Yifan Wei, Dengfeng Sun

Title: Beyond Self-Play: Hierarchical Reasoning for Continuous Motion in Closed-Loop Traffic Simulation

Abstract:
Closed‑loop traffic simulation requires agents that are both scalable and behaviorally realistic. Recent self‑play reinforcement learning approaches demonstrate strong scalability, but their equilibrium strategies fail to capture the socially aware behaviors of real human drivers. We propose a hierarchical architecture that goes beyond self‑play by combining high‑level multi‑agent interaction reasoning with low‑level continuous trajectory realization. Specifically, a Stackelberg‑style Multi‑Agent Reinforcement Learning (MARL) module generates interaction‑aware intention commands. These commands condition a low‑level continuous motion module, translating the strategic intent into physically consistent, scene‑responsive control sequences. To mitigate distribution shift in closed‑loop deployment, we introduce a hybrid co‑training scheme combining MARL with auxiliary recovery supervision. Experiments on a SUMO‑based urban network demonstrate that the proposed framework achieves superior control smoothness and safety compared to self‑play and passive imitation baselines, while maintaining competitive traffic efficiency.

PaperID: 3195, https://arxiv.org/pdf/2605.09131.pdf

Authors: Giridhar Ganapavarapu, Dhaval Patel

Title: MCP-Cosmos: World Model-Augmented Agents for Complex Task Execution in MCP Environments

Abstract:
The Model Context Protocol (MCP) has unified the interface between Large Language Models (LLMs) and external tools, yet a fundamental gap remains in how agents conceptualize the environments within which they operate. Current paradigms are bifurcated: Task‑level planning often ignores execution‑time dynamics, while reactive execution lacks long‑horizon foresight. We present MCP‑Cosmos, a framework that infuses generative World Models (WM) into the MCP ecosystem to enable predictive task automation. By unifying three disparate technologies, namely MCP, World Model, and Agent, we demonstrate that a "Bring Your Own World Model" (BYOWM) strategy allows agents to simulate state transitions and refine plans in a latent space before execution. We conducted experiments using two strategies, namely ReAct and SPIRAL with 2 planning models and 3 representative world models over 20+ MCP‑Bench tasks. We observed improvements in Agent's environment interaction KPI such as tool success rate and tool parameter accuracy. The framework also offers new metrics such as Execution Quality to generate new insights about the effectiveness of world models compared to baseline.

PaperID: 3196, https://arxiv.org/pdf/2605.09128.pdf

Authors: Hershraj Niranjani, Ujwal Kumar, Phan Xuan Tan

Title: Internal vs. External: Comparing Deliberation and Evolution for Multi-Agent Constitutional Design

Abstract:
Multi‑agent AI systems need behavioral constitutions, but it is unresolved whether such rules should emerge internally through agent self‑governance or be discovered externally through optimization. We present the first controlled comparison of internal deliberation and external evolution across three social environments: a coordination grid‑world, an iterated public goods game, and a bilateral trading market. Across 180 simulation runs, evolution significantly outperforms deliberation in collective‑action settings (p < 0.01), while neither method improves outcomes in bilateral trading. A multiplier ablation reveals that evolution's advantage inverts when incentives shift: at pool multiplier (m = 0.75) the evolved constitution forces value‑destroying cooperation and becomes the worst‑performing method. Notably, no deliberation run across thirty trials ever proposed punishment ‑‑ the canonical cooperation‑sustaining mechanism evolution reliably discovers ‑‑ suggesting external optimization wins on peaks while internal self‑governance trades peaks for structural responsiveness.

PaperID: 3197, https://arxiv.org/pdf/2605.09121.pdf

Authors: Hamed Omidvar, Vahideh Akhlaghi

Title: A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability

Abstract:
Agents built on large language models (LLMs) rely on a range of reliability techniques, including retry, majority voting, and self‑consistency, that have been developed in parallel rather than within a common analytical framework. We observe that an LLM sampled at temperature T is a discrete stochastic channel p(y \mid x) in the sense of Shannon's coding theory, and use this identity as the entry point for such a framework grounded in communication theory. Each of these techniques is a special case of one of six classical reliability operators: diversity combining, hybrid retransmission, iterative generator‑critic decoding, rateless sampling, structured redundant verification, and difficulty‑adaptive routing. Within the framework we give two closed‑form results: a noise‑variance threshold above which uniform averaging beats quality‑weighted averaging, and a contractivity criterion for generator‑critic refinement, consistent with a contractive‑to‑divergent transition we observe between 3B‑ and 14B‑parameter models. We further introduce a cost‑aware semantic‑nearest‑neighbor router whose single Lagrangian knob traverses the quality‑cost frontier without retraining. Across six channel configurations spanning local and cloud models on 69 hard tasks, no fixed model‑technique‑budget choice dominates, motivating per‑task allocation. On a 300‑item hard split of MMLU, GSM8K, and HumanEval, our router occupies the full empirical Pareto frontier: at matched quality, its normalized cost is \approx56% lower than the strongest fixed technique; at matched normalized cost, it improves quality by \approx7% (26% over single‑shot decoding). These results argue for consolidating these reliability techniques into a single tunable layer informed by channel coding.

PaperID: 3198, https://arxiv.org/pdf/2605.09104.pdf

Authors: Yuxi Chen, Junming Chen, Chenyu He, Yiwei Li, Yicheng Ji, Yifan Wu, Dingyu Yang, Lansong Diao, Lidan Shou, Hongliang Zhang, Huan Li, Gang Chen

Title: Token Economics for LLM Agents: A Dual-View Study from Computing and Economics

Abstract:
As LLM agents evolve, tokens have emerged as the core economic primitives of Agentic AI. However, their exponential consumption introduces severe computational, collaborative, and security bottlenecks. Current surveys remain fragmented across system optimization, architecture design, and trust, lacking a unified framework to evaluate the fundamental trade‑off between output quality and economic cost. To bridge this gap, this survey presents the first comprehensive survey of Token Economics. By unifying computer science and economics, we conceptualize tokens as production factors, exchange mediums, and units of account. We synthesize existing literature across a four‑dimensional taxonomy: (1) Micro‑level (Single Agent): Optimizing budget‑constrained factor substitution via neoclassical firm theory. (2) Meso‑level (Multi‑Agent Systems): Minimizing collaboration friction using transaction cost and principal‑agent theories. (3) Macro‑level (Agent Ecosystems): Addressing congestion externalities and pricing via mechanism design. (4) Security: Internalizing adversarial threats as endogenous economic constraints. Finally, we outline frontier directions, including differentiable token budgets and dynamic markets, to lay the theoretical foundation for scalable next‑generation agent systems.

PaperID: 3199, https://arxiv.org/pdf/2605.09076.pdf

Authors: Haejoon Lee, Vincent-Daniel Yun, Hyeonho Oh, Dimitra Panagou, Sai Praneeth Karimireddy

Title: Robust Multi-Agent LLMs under Byzantine Faults

Abstract:
Large language model (LLM) agents increasingly collaborate over peer‑to‑peer networks to improve their reliability. However, these same interactions can also become a source of vulnerability, as unreliable or Byzantine agents may sway neighboring agents toward incorrect conclusions and degrade overall system performance. Existing methods rely on leader‑based coordination or self‑reported confidence, both of which are susceptible to adversarial manipulation. We study decentralized LLM multi‑agent systems (LLM‑MAS) and propose Self‑Anchored Consensus (SAC), a fully decentralized iterative filter‑and‑refine protocol in which agents iteratively exchange responses, locally evaluate and filter unreliable messages, and refine their own outputs. We present (F+1)‑robustness conditions for the communication graph that ensure honest agents preserve and propagate reliable information despite Byzantine influence. Experiments on mathematical and commonsense reasoning benchmarks show that SAC effectively suppresses Byzantine influence and consistently improves performance across diverse communication topologies, whereas prior methods degrade under adversarial conditions.

PaperID: 3200, https://arxiv.org/pdf/2605.09033.pdf

Authors: Yang Luo, Zifeng Kang, Tiantian Ji, Xinran Liu, Yong Liu, Shuyu Li, Lingyun Peng

Title: ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

Abstract:
Graph‑based agent memory is increasingly used in LLM agents to support structured long‑term recall and multi‑hop reasoning, but it also creates a new poisoning surface: an attacker can inject a crafted relation into graph memory so that it is later retrieved and influences agent behavior. Existing agent‑memory poisoning attacks mainly target flat textual records and are ineffective in graph‑based memory because malicious relations often fail to be extracted, merged into the target anchor neighborhood, or retrieved for the victim query. We present SHADOWMERGE, a poisoning attack against graph‑based agent memory that exploits relation‑channel conflicts. Its key insight is that a poisoned relation can share the same query‑activated anchor and canonicalized relation channel as benign evidence while carrying a conflicting value. To realize this, we design AIR, a pipeline that converts the conflict into an ordinary interaction that can be extracted, merged, and retrieved by the graph‑memory system. We evaluate SHADOWMERGE on Mem0 and three public real‑world datasets: PubMedQA, WebShop, and ToolEmu. SHADOWMERGE achieves 93.8% average attack success rate, improving the best baseline by 50.3 absolute points, while having negligible impact on unrelated benign tasks. Mechanism studies show that SHADOWMERGE overcomes the three key limitations of existing agent‑memory poisoning attacks, and defense analysis shows that representative input‑side defenses are insufficient to mitigate it. We have responsibly disclosed our findings to affected graph‑memory vendors and open sourced SHADOWMERGE.

PaperID: 3201, https://arxiv.org/pdf/2605.09027.pdf

Authors: Alexandre Le Mercier, Chris Develder, Thomas Demeester

Title: GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives

Abstract:
In multi‑agent systems (MAS), a single deceptive agent can nullify all gains of an agentic AI collective and evade deployed defenses. However, existing adversarial studies on MAS target only shallow tasks and do not consider adaptive adversaries, which evolve their strategies to evade the very detectors trained to catch them. To address that gap, we introduce GAMBIT, a benchmark with three evaluation modes and two independent scores for evaluating imposter detectors: the first two modes measure zero‑shot detection under increasing distribution shift, and a third recalibration mode measures how quickly a detector adapts to novel attacks from just 20 labeled examples. The benchmark comes with a dataset of 27,804 labeled instances spanning 240 co‑evolved imposter strategies. Our contributions are threefold: (1) Using chess as a substrate deep reasoning problem and Gemini 3.1 Pro for agents, we release GAMBIT and its dataset to evaluate imposter detectors under realistic constraints against a stealthy adaptive imposter; (2) We introduce an adaptive imposter agent based on an efficient evolutionary framework, generalizable beyond chess, that collapses collective task performance while remaining essentially undetectable (50.5% F1‑score with a Gemini‑based detector); (3) We show that zero‑shot evaluation can be highly misleading for adaptive adversaries: two detectors with near‑identical zero‑shot scores differ by 8x on few‑shot adaptation, while the meta‑learned variant converges 20x faster, a gap only visible in the recalibration mode. Altogether, GAMBIT provides the first multi‑agent benchmark where adversarial attacks and defenses co‑evolve, with an imposter framework generalizable beyond our use case, and promising techniques for fast recalibration in a rapidly evolving adversarial system. Code and data: https://anonymous.4open.science/r/gambit.

PaperID: 3202, https://arxiv.org/pdf/2605.08964.pdf

Authors: Carol Xuan Long

Title: Trustworthy AI: Ensuring Reliability and Accountability from Models to Agents

Abstract:
In this thesis, we develop algorithms with theoretical guarantees for ensuring reliability and accountability of Machine Learning (ML) systems. As ML systems evolve from predictive models to generative models and autonomous agents, the landscape of trustworthy AI has shifted. This thesis introduces tools grounded in information theory, optimization, and statistical learning to mitigate bias, reduce arbitrary decisions, ensure content provenance, and evaluate LLM‑driven agents in autonomous settings. Towards mitigating bias and arbitrariness in traditional ML models, we introduce a kernel‑based method to achieve multiaccuracy across complex subpopulations that traditional demographic categories may overlook. We also develop methods to address predictive multiplicity, where equally accurate models yield conflicting individual predictions. We ensure the accountability in generative AI through watermarking large language models (LLMs). We characterize the information‑theoretic trade‑off between watermark detection and text distortion and derive optimal watermarking strategies by leveraging optimal transport and coding theory. Empirical evaluations show our watermarks achieve a superior detection‑quality tradeoff across language generation and coding tasks. Finally, we evaluate autonomous LLM agents in multi‑agent environments through the first simulator of a fully LLM‑driven supply chain. LLM agents offer significant performance gains, outperforming human teams and reducing costs by up to 67%, but also introduce systemic risks, including costly tail events.

PaperID: 3203, https://arxiv.org/pdf/2605.08904.pdf

Authors: Xiaozhe Li, Jixuan Chen, Xinyu Fang, Shengyuan Ding, Haodong Duan, Qingwen Liu, Kai Chen

Title: OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces

Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning and tool use. However, the fundamental cognitive faculties essential for problem solving, including perception, reasoning, and memory, remain the stable core of intelligence. Unlike memorizing specific patterns, humans succeed in novel environments by applying these intrinsic faculties to adapt and optimize. Yet, whether LLMs possess this essential capacity, namely the ability to continuously refine solutions in response to dynamic environmental feedback, remains underexplored. To address this challenge, we introduce OPT‑BENCH, a benchmark for evaluating self‑improvement capabilities in large‑scale search spaces. By combining 20 machine learning tasks with 10 classic NP‑hard problems, OPT‑BENCH provides a rigorous setting to assess whether agents can adapt through intrinsic self‑reflection rather than rote tool application. We further propose OPT‑Agent, a framework that emulates human‑like cognitive adaptation. It operates through a general perception, memory, and reasoning loop, iteratively refining solutions based on environmental feedback. Through extensive experiments on 19 LLMs from 7 model families, including reasoning models, general models, and open‑source models ranging from 3B to 235B parameters, we demonstrate that stronger models are more effective at leveraging feedback signals for self‑improvement. However, this upper‑bound adaptability remains fundamentally constrained by the models' base capacity, and even the most advanced LLMs still fall short of human expert performance.

PaperID: 3204, https://arxiv.org/pdf/2605.08828.pdf

Authors: Strick Sheng, Ziyue Wang, Liyi Zhou

Title: When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents

Abstract:
Large language model agents increasingly operate through environment‑facing scaffolds that expose files, web pages, APIs, and logs. These observations influence tool use, state tracking, and action sequencing, yet their reliability and authority are often uncertain. Environmental grounding is therefore a systems‑level problem involving context admission, evidence provenance, freshness checking, verification policy, action gating, and model reasoning. Existing agent benchmarks mainly evaluate task capability or specific attacks such as prompt injection and memory poisoning, but they under‑specify a fundamental reliability question: whether agents remain grounded in the true environment state when observations are stale, incorrect, or malicious. We introduce EnvTrustBench, an agentic framework for benchmarking this failure mode. We define an evidence‑grounding defect (EGD) as a behavioral failure in which an agent treats an environment‑facing claim as sufficient evidence for action without resolving it against available current evidence, leading to a task‑incorrect false path under the true environment state. Given a task scenario, EnvTrustBench generates the workspace, environment, agent‑facing objective, and validation oracle, executes the evaluated agent, records its action‑observation trajectory and final state, and applies the oracle to produce a verdict. Using 6 LLM backbones and 5 widely used scaffolds, we evaluate 55 generated cases across 11 task scenarios, with each scenario expanded through five feedback‑guided generation iterations. Results show that EGDs consistently emerge across operational workflows, highlighting environmental grounding as a core agent reliability problem with important security implications.

PaperID: 3205, https://arxiv.org/pdf/2605.08769.pdf

Authors: Chengdong Xu, Kaiqiang Ke, Ziheng Liu, Jiaqi Wei, Zibo Shao, Weile Guo, Chao Yu

Title: EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems

Abstract:
Large language model (LLM)‑based multi‑agent systems have shown strong potential on complex tasks through agent specialization, tool use, and collaborative reasoning. However, most automated multi‑agent system design methods still follow a one‑shot paradigm: a workflow is optimized or selected before execution and then reused unchanged throughout the task. This static coordination strategy is ill‑suited for long‑horizon tasks whose subgoals, intermediate evidence, and information needs evolve over multiple execution stages. We propose EvoMAS, a framework for execution‑time multi‑agent workflow construction. EvoMAS formulates workflow construction as a meta‑level sequential decision problem along a single task trajectory. At each stage, it constructs an explicit task state through a Planner‑Evaluator‑Updater pipeline and uses a learned Workflow Adapter to instantiate a stage‑specific layered workflow from a fixed pool of candidate agents. The adapter is trained with policy gradients using sparse, verifiable terminal task success as the main supervision signal, while evaluator‑based process reward is analyzed separately under very‑hard sparse‑reward settings. Experiments on GAIA, HLE, and DeepResearcher show that EvoMAS outperforms single‑agent baselines and recent automated multi‑agent workflow design methods. Our analyses further show that explicit task‑state construction and learned workflow adaptation provide complementary benefits. Additional results indicate that process reward is most useful when terminal success is extremely sparse, and qualitative case studies illustrate that EvoMAS adapts agent coordination as the task state evolves.

PaperID: 3206, https://arxiv.org/pdf/2605.08761.pdf

Authors: Tao Yu, Hao Wang, Changyu Li, Shenghua Chai, Minghui Zhang, Zhongtian Luo, Yuxuan Zhou, Haopeng Jin, Zhaolu Kang, Jiabing Yang, YiFan Zhang, Xinming Wang, Hongzhu Yi, Zheqi He, Jing-Shu Zheng, Xi Yang, Yan Huang, Liang Wang

Title: Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows

Abstract:
Large language model (LLM) agents are increasingly expected to operate in enterprise environments, where work is distributed across specialized roles, permission‑controlled systems, and cross‑departmental procedures. However, existing enterprise benchmarks largely evaluate single agents with broad tool access, while existing multi‑agent benchmarks rarely capture realistic enterprise constraints such as role specialization, access control, stateful business systems, and policy‑based approvals. We introduce \textscEntCollabBench, a benchmark for evaluating enterprise multi‑agent collaboration. \textscEntCollabBench simulates a permission‑isolated organization with 11 role‑specialized agents across six departments and contains two evaluation subsets: a Workflow subset, where agents collaboratively modify enterprise system states, and an Approval subset, where agents make policy‑grounded decisions. Evaluation is based on execution traces, database state verification, and deterministic policy adjudication rather than natural‑language response judging. Experiments with representative LLM agents show that current models still struggle with end‑to‑end enterprise collaboration, especially in delegation, context transfer, parameter grounding, workflow closure, and decision commitment. \textscEntCollabBench provides a reproducible testbed for measuring and improving agent systems intended for realistic organizational environments.

PaperID: 3207, https://arxiv.org/pdf/2605.08758.pdf

Authors: Jiaxin Liu, Peng Yang, Yuping Li, Xinyue Xie

Title: Omni-scale Learning-based Sequential Decision Framework for Order Fulfillment of Tote-handling Robotic Systems

Abstract:
Driven by the rapid expansion of e‑commerce and small‑batch production, the size of the intralogistics load unit of finished goods, semi‑finished goods and raw materials is steadily shrinking. Totes are gradually replacing pallets as the primary handling and storage container. This shift has propelled tote‑handling robotic systems to the forefront of automation order fulfillment centers. The order‑fulfillment decisions of tote‑handling robotic systems share a common order‑tote‑robot sequential decision‑making nature. Existing studies primarily focus on decision mechanisms tailored to particular systems, making it difficult to generalize or transfer them to other contexts. We propose an Omni‑scale Learning‑based Sequential Decision Framework for Order Fulfillment of Tote‑handling Robotic Systems (OLSF‑TRS), a generalized and scalable sequential decision framework that combines structured combinatorial optimization with multi‑agent reinforcement learning to coordinate order,tote, and robot decisions. On small‑scale tote‑handling robotic systems, OLSF‑TRS achieves near‑optimal performance with average optimality gaps below 3.5% across two distinct system configurations. In large‑scale scenarios, OLSF‑TRS consistently outperforms heuristic baselines across two different system types, reducing total tote movements by 8‑12% and over 30% compared to SOTA rule‑based approaches, while maintaining real‑time responsiveness. These improvements translate into tangible operational benefits, including cost reduction, lower energy consumption, and enhanced throughput stability. The proposed framework delivers an efficient and unified order fulfillment decision‑making framework for widely deployed tote‑handling robotic systems,supporting high‑quality order fulfillment in both e‑commerce and industrial logistics sectors.

PaperID: 3208, https://arxiv.org/pdf/2605.08722.pdf

Authors: Qingyuan Luo, Jie Li, Meng Guo

Title: HULK: Large-scale Hierarchical Coordination under Continual and Uncertain Temporal Tasks

Abstract:
Multi‑agent systems can be extremely efficient when working concurrently and collaboratively, e.g., for delivery, surveillance, search and rescue. Coordination of such teams often involves two aspects: selecting appropriate subteams for different tasks in various areas, and coordinating agents in the subteams to execute the associated subtasks. Existing work often assumes that the tasks are static and known beforehand, where an integer program can be formulated and solved offline. However, in many applications, the team‑wise tasks are generated online continually by external requests, and the amount of subtasks within each task is uncertain, e.g., the number of packages to deliver or victims to rescue. The aforementioned offline solution becomes inadequate as it would require constant re‑computation for the whole team and global communication to broadcast the results. Thus, this work tackles the large‑scale coordination problem under continual and uncertain temporal tasks, specified as temporal logic formulas over collaborative actions. The proposed hierarchical framework, HULK, consists of two interleaved layers: the rolling assignment of currently known tasks to subteams within a certain horizon, and the dynamic coordination within a subteam given the detected subtasks during online execution. Thus, coordination is performed hierarchically at different granularities and triggering conditions, improving computational efficiency and robustness. The method is validated rigorously over large‑scale heterogeneous systems under various temporal tasks and environment uncertainties.

PaperID: 3209, https://arxiv.org/pdf/2605.08693.pdf

Authors: Min Yang, Jinghua Piao, Xu Xia, Xiaochong Lan, Jiaju Chen, Yongshun Gong, Yong Li

Title: SkillMaster: Toward Autonomous Skill Mastery in LLM Agents

Abstract:
Skills provide an effective mechanism for improving LLM agents on complex tasks, yet in existing agent frameworks, their creation, refinement, and selection are typically governed by external teachers, hand‑designed rules, or auxiliary modules. As a result, skills remain external resources to be invoked, rather than capabilities that agents can develop, adapt, and internalize through experience. To endow LLM agents with autonomous skill mastery, we propose SkillMaster, a training framework that teaches agents to create new skills, refine existing skills, and select accumulated skills during task solving. This capability is achieved through three key designs. First, we train agents through trajectory‑informed skill review, teaching agents to propose, update, or retain skills based on evidence from completed episodes. Second, each candidate skill edit is designed to be evaluated by its counterfactual utility on related probe tasks, providing a direct learning signal for training skill‑editing decisions. Third, we introduce DualAdv‑GRPO, which separately estimates advantages for task‑solving actions and skill‑editing decisions, stabilizing joint training across task solving and skill management. Experiments on ALFWorld and WebShop show that SkillMaster improves the overall success rate over state‑of‑the‑art baselines by 8.8% and 9.3%, respectively, achieving the best performance among all compared methods. Further analysis reveals a marked shift in agent capability: agents trained with SkillMaster can identify skill failures, refine procedural knowledge from trajectory evidence, and transfer improvements to future tasks with limited skill‑bank edits. Overall, SkillMaster moves LLM agents beyond mere skill use toward self‑improving agents capable of developing, adapting, and applying their own skill repertoires.

PaperID: 3210, https://arxiv.org/pdf/2605.08686.pdf

Authors: Wenzhi Fang, Liangqi Yuan, Guangchen Lan, Dong-Jun Han, Christopher G. Brinton

Title: Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs

Abstract:
Multi‑agent large language model (LLM) systems often rely on a controller to coordinate a pool of heterogeneous models, yet existing controllers are typically limited to one‑shot routing: they select a model once and return its output directly. Such routing‑only designs provide no mechanism to critique intermediate drafts or support iterative refinement. To address this limitation, we propose a critique‑and‑routing controller that casts multi‑agent coordination as a sequential decision problem. At each turn, the controller evaluates the current draft, decides whether to stop or continue, and, if needed, selects the next agent for further refinement. We formulate this process as a finite‑horizon Markov Decision Process (MDP) with explicit agent‑utilization constraints, design a composite reward for controller decisions across turns, and optimize the controller via policy gradients under a Lagrangian‑relaxed objective. Extensive experiments across multiple heterogeneous multi‑agent systems and seven reasoning benchmarks show that our method consistently outperforms state‑of‑the‑art baselines and substantially narrows the gap to the strongest agent, while using it for fewer than 25% of total calls.

PaperID: 3211, https://arxiv.org/pdf/2605.08681.pdf

Authors: Haixiang, Yang Xu, Jiefu Zhang, Xudong Wu, Zihan Zhou, Jun He, Jiayu Chen

Title: Core-Halo Decomposition: Decentralizing Large-Scale Fixed-Point Problems

Abstract:
We study solving large‑scale fixed‑point equation $x^\star=\bar F(x^\star)$ with decomposition. Standard strict decomposition assigns each agent a disjoint block and evaluates updates using only owned coordinates. For most operators, however, a block update may depend on variables outside the block. Truncating these dependencies by strict decomposition changes the mean operator and creates structural bias that cannot be removed by more samples, smaller stepsizes, or additional consensus. We therefore propose Core‑Halo decomposition, which separates write ownership from read‑only evaluation context: each agent updates its own core and reads from an overlapping halo. By aligning the Core‑Halo decomposition with the block‑dependence structure of \bar F, the original fixed‑point problem can be implemented faithfully in a decentralized multi‑agent system. We further characterize the fundamental obstruction faced by strict decomposition through a Bellman closure condition and a blockwise bias lower bound, showing that local‑only updates can alter the original fixed‑point operator. Finally, we conduct extensive experiments across a range of application settings, and demonstrate that Core‑Halo achieves near‑centralized performance while retaining the parallelism benefits of decentralization.

PaperID: 3212, https://arxiv.org/pdf/2605.08670.pdf

Authors: Yixuan Li, Mingshu Cai, Ziyang Xiao, Wanyuan Wang, Yanchen Deng, Bo An

Title: MIND-Skill: Quality-Guaranteed Skill Generation via Multi-Agent Induction and Deduction

Abstract:
Large language model (LLM) powered AI agents have emerged as a promising paradigm for autonomous problem‑solving, yet they continue to struggle with complex, multi‑step real‑world tasks that demand domain‑specific procedural knowledge. Reusable agent skills, which encapsulate successful problem‑solving strategies, offer a natural remedy by enabling agents to build on prior experience. However, curating such skills has largely remained a manual endeavor, requiring human experts to distill rich domain knowledge into actionable guidelines. In this work, we present Multi‑agent INduction and Deduction for Skills (MIND‑Skill), a framework that automatically induces generalizable skills from successful trajectories with robust quality guarantees. MIND‑Skill consists of an induction agent which is tasked to abstract reusable skills from successful trajectories, and a deduction agent which aims to reconstruct trajectories by following the induced skills. To guarantee the quality of the generated skills, we introduce a reconstruction loss that compares input and reconstructed trajectories, an outcome loss that enforces the correctness of the reconstructed trajectories, and a rubric loss that assesses the documentation quality and regularizes the abstraction level of the generated skills according to predefined criteria. These textual losses are jointly optimized with TextGrad, and the resulting skills are evaluated on held‑out tasks unseen during optimization. Experiments on AppWorld and BFCL‑v3 show that MIND‑Skill consistently outperforms concurrent skill generation methods.

PaperID: 3213, https://arxiv.org/pdf/2605.08669.pdf

Authors: Yizhe Huang, Bin Ling, Song-Chun Zhu, Xue Feng

Title: Modeling Decision-Making with Will for Cooperation in Social Dilemmas

Abstract:
Standard rational actor models often attribute cooperation failures in social dilemmas to insufficient incentives, overlooking the destabilizing effects of continuous utility maximization. To address this, we propose a framework of ``will" defined as a mechanism that persistently pursues goals while ignoring local cost‑benefit fluctuations. We formalize the Willed Agents as potential minimizers, distinguishing them from cumulative utility maximization. Dynamical analysis of infinite population demonstrates that willed agents shrink the feasible state space, acting as boundary constraints that accelerate convergence in canonical social dilemmas. Through multi‑agent simulations in a spatiotemporal Stag Hunt Game, we show that willed agents function as ``cooperation catalysts", enabling groups to surmount high‑risk thresholds where purely utility maximization fails. We find that heterogeneous will strength promotes cooperation, and that agents who autonomously suspend rational re‑evaluation can significantly outperform continuous optimizers. These findings suggest that successful cooperation relies on the cognitive capacity to strategically constrain calculation.

PaperID: 3214, https://arxiv.org/pdf/2605.08647.pdf

Authors: Aritra Mazumder, Shubhashis Roy Dipta, Nusrat Jahan Lia, Tanzila Khan, Kainat Raisa Hossain, Nehaa Shri, Shubhrangshu Debsarkar, Humayra Tasnim, Gour Gupal Talukder Shawon, Debjoty Mitra, Sumaiya Ahmed Rani, Al Jami Islam Anik, Al Nafeu Khan

Title: AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators

Abstract:
Multi‑agent systems achieve state‑of‑the‑art outcomes through peer collaboration. However, when an agent in the pipeline silently drops a constraint, the system's final output may look correct even though the reasoning chain was quietly corrupted, and existing outcome‑based evaluations are blind to such multi‑hop process failures. To make these vulnerabilities measurable before deployment, we introduce AgentCollabBench, a diagnostic benchmark of 900 human‑validated tasks spanning software engineering, DevOps, and data engineering. Each task isolates one of four behavioral risks: instruction decay (does a constraint survive peer pressure?), false‑belief contagion (does a falsehood spread through consensus?), context leakage (does information bleed between tasks?), and tracer durability (does marked data reach the final agent?). Evaluating four modern LLMs (GPT 4.1 mini, Gemini 2.5 Flash Lite, Qwen‑3.5‑35B‑A3B, and Llama 3.1 8B Instruct), we expose model‑specific vulnerability profiles invisible to outcome‑only evaluation; Qwen‑3.5‑35B‑A3B, for example, leads on tracer durability and instruction stability, while GPT 4.1 mini leads on leakage containment and false‑belief resistance. Beyond per‑model differences, communication topology emerges as a primary risk factor that explains 7‑40% of the variance in multi‑hop information survival. The effect traces to a synthesis bottleneck specific to converging‑DAG nodes: an agent weighing competing parent inputs discards constraints carried by a minority branch, a bottleneck structurally absent from linear chains. AgentCollabBench demonstrates that suboptimal topology can silently erase the safeguards of highly capable models, arguing that multi‑agent reliability is fundamentally a structural problem and that scaling model intelligence alone is no substitute for architecture.

PaperID: 3215, https://arxiv.org/pdf/2605.08646.pdf

Authors: Liangqi Yuan, Wenzhi Fang, Shiqiang Wang, Christopher G. Brinton

Title: PAAC: Privacy-Aware Agentic Device-Cloud Collaboration

Abstract:
Large language model (LLM) agents face a structural tension: cloud agents provide strong reasoning but expose user data, while on‑device agents preserve privacy at the cost of overall capability. Existing device‑cloud designs treat this boundary as a compute split rather than a trust boundary suited to agentic workloads, and existing sanitizers force a choice between policy flexibility and the structural fidelity tool calls require. In this work, we develop PAAC, a privacy‑aware agentic framework that aligns planner‑‑executor decomposition with the device‑cloud boundary so that role specialization itself becomes the privacy mechanism. The cloud agent reasons over typed placeholder tokens that preserve each sensitive value's reasoning role while discarding its content, while the on‑device agent identifies sensitive spans and distills each step's execution outcome into compact key findings. Sanitization confines the on‑device LLM to proposing which spans to mask, while a deterministic registry performs all substitution and reversal, keeping actions directly executable on device. On three agentic benchmarks under strict privacy settings, PAAC dominates the Pareto frontier of privacy and accuracy, improving average accuracy by 15‑36% and reducing average leakage by 2‑6× over state‑of‑the‑art device‑cloud baselines, with the largest margins on privacy targets outside fixed entity taxonomies. We find consistent improvements on 17 additional benchmarks spanning 10 domains, including math, science, and finance.

PaperID: 3216, https://arxiv.org/pdf/2605.08626.pdf

Authors: Liangqi Yuan, Wenzhi Fang, Shiqiang Wang, H. Vincent Poor, Christopher G. Brinton

Title: Large Language Models over Networks: Collaborative Intelligence under Resource Constraints

Abstract:
Large language models (LLMs) are transforming society, powering applications from smartphone assistants to autonomous driving. Yet cloud‑based LLM services alone cannot serve a growing class of applications, including those operating under intermittent connectivity, sub‑second latency budgets, data‑residency constraints, or sustained high‑volume inference. On‑device deployment is in turn constrained by limited computation and memory. No single endpoint can deliver high‑quality service across this spectrum. This article focuses on collaborative intelligence, a paradigm in which multiple independent LLMs distributed across device and cloud endpoints collaborate at the task level through natural language or structured messages. Such collaboration strives for superior response quality under heterogeneous resource constraints spanning computation, memory, communication, and cost across network tiers. We present collaborative inference along two complementary and composable dimensions: vertical device‑cloud collaboration and horizontal multi‑agent collaboration, which can be combined into hybrid topologies in practice. We then examine learning to collaborate, addressing the training of routing policies and the development of cooperative capabilities among LLMs. Finally, we identify open research challenges including scaling under resource heterogeneity and trustworthy collaborative intelligence.

PaperID: 3217, https://arxiv.org/pdf/2605.08623.pdf

Authors: Xindi Wang, Haining Li, Tao Ding, Bolin Cai

Title: Technical Report: A Hierarchical Dynamically Weighting Deep Reinforcement Learning Method for Multi-UAV Multi-Task Coordination

Abstract:
This paper investigates the multi‑UAV multi‑task coordination problem in infrastructure‑less emergency scenarios, where UAVs collaboratively are required to jointly perform aerial image acquisition and ground‑user communication. To tackle the challenge of balancing heterogeneous tasks within dynamic environments, we propose a hierarchical dynamic weighting Deep Reinforcement Learning (DRL) framework. Specifically, an episode‑level module is introduced to capture global task preferences, while a step‑level module adaptively adjusts the objective weights according to real‑time system conditions. By integrating global and instantaneous weights, the proposed framework improves decision stability and responsiveness during task execution. Simulation results demonstrate that the proposed method achieves faster convergence, more stable training, and higher task completion efficiency than conventional works.

PaperID: 3218, https://arxiv.org/pdf/2605.08613.pdf

Authors: Yong Xiao, Jingxuan Chai, Guangming Shi, Ping Zhang

Title: Generalization Bounds of Emergent Communications for Agentic AI Networking

Abstract:
The evolution of 6G networking toward agentic AI networking (AgentNet) systems requires a shift from traditional data pipelines to task‑aware, agentic AI‑native communication solutions. Emergent communication, a novel communication paradigm in which autonomous agents learn their own signaling protocols through interaction, is increasingly viewed as a promising solution to address the challenges posed by existing rigid, predefined protocol‑based networking architecture. However, most existing emergent communication frameworks fail to account for physical networking constraints, such as bandwidth and computational complexity, and often lack a rigorous information‑theoretical foundation. To address these challenges, this paper introduces a novel emergent communication framework that facilitates collaborative task‑solving among heterogeneous agents through an information‑theoretic lens. We propose a novel joint loss function that unifies the optimization of decision‑making functions and the learning of communication signaling. Our proposed solution is grounded on the multi‑agent and multi‑task distributed information bottleneck (DIB) theory, which allows the quantification of the fundamental trade‑off between task‑relevant information representation and computational complexity. We further provide theoretical generalization bounds of the emergent communication protocol during decentralized inference across unseen environmental states. Experimental validation on a real‑world hardware prototype confirms that our proposed framework significantly improves generalization performance, compared to the state‑of‑the‑art solutions.

PaperID: 3219, https://arxiv.org/pdf/2605.08580.pdf

Authors: Zhuofu Chen, Rui Pan, Yinwei Dai, Ravi Netravali

Title: Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents

Abstract:
To cope with the large contexts that long‑horizon LLM agents produce, modern frameworks increasingly rely on compaction ‑‑ invoking an LLM to rewrite the accumulated trajectory into a shorter summary that the agent resumes from. Today, compaction runs synchronously on the critical path of agent execution but this can unpredictably degrade accuracy due to a structural validation gap: the compactor must condense context but is fundamentally unaware of precisely what information the agent will need later. Further, because post‑compaction agent steps are conditioned on the new summary, targeted validation criteria do not exist and errors silently propagate through coherent but incorrect behavior. Our key insight is that asynchronous compaction efficiently addresses this gap: by running the compactor in parallel with continued agent execution on the original context, the candidate summary and the agent's next steps are generated independently from the same pre‑compaction state, yielding a validation signal independent of the summary itself. We build Slipstream, a trajectory‑grounded compaction system that uses a judge to validate the candidate summary against the agent's continued reasoning, checking that it preserves both the agent's forward intent and the key facts and constraints it depends on. Across long‑horizon coding (SWE‑bench Verified) and web‑browsing (BrowseComp) workloads, Slipstream improves task accuracy by up to 8.8 percentage points while reducing end‑to‑end latency by up to 39.7%.

PaperID: 3220, https://arxiv.org/pdf/2605.08572.pdf

Authors: Alen Mrdovic, Qingze, Liu, Danrui Li, Mathew Schwartz, Kaidong Hu, Sejong Yoon, Mubbasir Kapadia, Vladimir Pavlovic

Title: Enhancing Consistency Models for Multi-Agent Trajectory Prediction

Abstract:
Diffusion models for multi‑agent trajectory prediction are limited by iterative denoising, which causes inference latency that hinders their use in time‑critical settings like autonomous driving. Fast‑sampling variants using DDIM and informed initial noise distributions partially alleviate this issue, but they either fail to achieve true single‑step generation or are constrained by the chosen noise distribution. Consistency Models (CMs) offer high‑quality one‑step generation by mapping noise directly to data, but are difficult to train from scratch . We propose ECTraj, an enhanced CM pipeline with improved training and conditional generation for trajectory prediction. Our framework extends the student‑teacher consistency training scheme: the student produces standard outputs, while the teacher explicitly fuses its predictions with parts of the ground truth to give stronger supervision. We also exploit CMs' direct denoising for top‑K multi‑shot generation during training. Combining conditional generation with this enhanced consistency objective yields faster inference and improved prediction accuracy, establishing competitive new benchmarks on the large‑scale Argoverse 2 dataset.

PaperID: 3221, https://arxiv.org/pdf/2605.08563.pdf

Authors: Zhanfu Yang

Title: Why Retrying Fails: Context Contamination in LLM Agent Pipelines

Abstract:
When an LLM agent fails a multi‑step tool‑augmented task and retries, the failed attempt typically remains in its context window ‑‑ contaminating the next attempt and elevating the per‑step error rate beyond the base level. This context‑contaminated restart phenomenon is widely observed in practice yet entirely lacks formal treatment. We introduce the Context‑Contaminated Restart Model (CCRM): a chain of T tool‑call steps, each failing with base rate epsilon_0; after any failed attempt, the subsequent attempt operates in contaminated context with elevated error rate epsilon_1 > epsilon_0. Under this model we derive five main results. (R1) An exact closed‑form formula for P(succeed in at most K attempts). (R2) A cascade‑overhead theorem giving the additional attempts Delta K incurred by contamination versus the clean‑restart baseline. (R3) An optimal budget‑allocation theorem identifying the pipeline depth T that maximises success probability for a fixed total budget B=KT; we prove the closed form T = sqrt(B log(1/(1‑epsilon_1)) / log(1/(1‑epsilon_0))), with K=B/T. (R4) An information‑theoretic lower bound via Le Cam's method showing K_CCRM is tight up to O(1). (R5) A clean‑restart dominance theorem quantifying the exact benefit of context‑clearing before retry. We validate CCRM on real SWE‑bench Verified data: the IID model overestimates pass@3 by 17.4 percentage points (98.6% vs. 81.2%), while CCRM fits with error less than 0.001, implying a cascade ratio of epsilon_1/epsilon_0 = 7.1. Monte Carlo experiments confirm all theoretical predictions.

PaperID: 3222, https://arxiv.org/pdf/2605.08545.pdf

Authors: Peter Kirgis, Sayash Kapoor, Stephan Rabanser, Nitya Nadgir, Cozmin Ududec, Magda Dubois, JJ Allaire, Conrad Stosz, Marius Hobbhahn, Jacob Steinhardt, Arvind Narayanan

Title: Log analysis is necessary for credible evaluation of AI agents

Abstract:
Agent benchmarks typically report only final outcomes: pass or fail. This threatens evaluation credibility in three ways. First, scores may be inflated or deflated by shortcuts and benchmark artifacts, misrepresenting capability. Second, benchmark performance may fail to predict real‑world utility due to scaffold limitations and recurring failure modes. Finally, capability scores may conceal dangerous or catastrophic actions taken by the agent. We argue that log analysis ‑‑ the systematic tracking and analysis of the inputs, execution, and outputs of an AI agent ‑‑ is necessary to overcome these validity threats and promote credible agent evaluation. In this paper, we (1) present a taxonomy of threats to credible evaluation documented through log analysis, and (2) develop a set of guiding principles for log analysis. We illustrate these principles on tau‑Bench Airline, revealing that pass^5 performance was under‑elicited by nearly 50% and surfacing deployment failure modes invisible to outcome metrics. We conclude with pragmatic recommendations to increase uptake of log analysis, directed at diverse stakeholders including benchmark creators, model developers, independent evaluators, and deployers.

PaperID: 3223, https://arxiv.org/pdf/2605.08540.pdf

Authors: Benjamin Panny, Shashank Mehrotra, Zahra Zahedi, Teruhisa Misu, Kumar Akash

Title: Too Many Specialists: Emergent Inefficiencies and Bottlenecks for Multi-agent Ad-hoc Collaboration

Abstract:
Computational models of collaboration without prior coordination often overlook how heterogeneous agent traits and complex task structures jointly produce systemic bottlenecks, inefficiencies, and contribution inequalities. We address this by using an agent‑based model of ad‑hoc teamwork in a kitchen environment. Our model integrates diverse agent personas with tasks that combine serial and parallel dependencies. We identify a specialist's dilemma, where rigid role assertion generates system‑level bottlenecks, amplifies workload inequality, and fosters fragmented, homophilous networks. We also find that team size and communication overhead interact with problem structure to generate diminishing returns and redundant collaboration. Linking micro‑level behavior to macro‑level outcomes provides insights into emergent collaboration and design principles for effective multi‑agent teamwork.

PaperID: 3224, https://arxiv.org/pdf/2605.08538.pdf

Authors: Doga Kerestecioglu, Alexei Robsky, Clemens Vasters, Anshul Sharma, Yitzhak Kesselman

Title: Human-Inspired Memory Architecture for LLM Agents

Abstract:
Current LLM agents lack principled mechanisms for managing persistent memory across long interaction horizons. We present a biologically‑grounded memory architecture comprising six cognitive mechanisms: (1) sleep‑phase consolidation, (2) interference‑based forgetting, (3) engram maturation, (4) reconsolidation upon retrieval, (5) entity knowledge graphs, and (6) hybrid multi‑cue retrieval. Each mechanism addresses a specific failure mode of naive memory accumulation. We introduce a synthetic calibration methodology that derives all pipeline thresholds without benchmark data exposure, eliminating a common source of evaluation leakage. We evaluate on two benchmarks. First, a VSCode issue‑tracking dataset (13K issues, 120K events) where deduplication‑based consolidation achieves 97.2% retention precision with 58% store reduction (+21.8 pp over baseline). Second, the LongMemEval personal‑chat benchmark where we conduct the first streaming M‑tier evaluation (475 sessions, ~540K unique turns). At a 200K‑token context budget, our pipeline matches raw retrieval accuracy (70.1% vs. 71.2%, overlapping 95% CI) while exposing a tunable accuracy/store‑size operating curve. At S‑tier scale (50 sessions), dedup‑based consolidation yields a +13.3 pp improvement in preference recall.

PaperID: 3225, https://arxiv.org/pdf/2605.08528.pdf

Authors: Yicheng Zhu, Yang Chen, Tao Li, Zilin Bian

Title: SceneFactory: GPU-Accelerated Multi-Agent Driving Simulation with Physics-Based Vehicle Dynamics

Abstract:
Autonomous‑driving simulators typically trade physical fidelity for scalable parallelism. Physics‑based platforms such as CARLA and MetaDrive provide articulated vehicle dynamics and contact, but their non‑vectorized interfaces make batched training difficult. GPU‑batched systems such as Waymax and GPUDrive scale to hundreds of scenarios by replacing rigid‑body physics with simplified kinematic models, omitting tire‑‑road interaction, suspension, contact dynamics, and road‑condition‑dependent friction. We introduce SceneFactory, a GPU‑vectorized platform for procedural scene construction, physics‑based multi‑agent simulation, and RL in autonomous‑driving environments. Built on NVIDIA Isaac Sim + Isaac Lab, SceneFactory represents worlds and agents as batched tensors: control, observations, rewards, resets, and policy inference run as GPU tensor operations over the Isaac Lab tensor API. SceneFactory converts Waymo Open Motion Dataset road topologies into simulation‑ready USD worlds, runs many worlds concurrently on one GPU, populates each with multiple articulated PhysX vehicles, and maps precipitation and road‑surface type to PhysX material friction coefficients. With GPU vectorization, SceneFactory achieves up to 127× higher throughput than a non‑vectorized PhysX baseline on the same GPU and physics solver, reaching 19,250 controlled‑agent simulation steps per second at 256 worlds × 16 agents. Cross‑simulator transfer reveals an asymmetric dynamics gap: physics‑grounded RL policies transfer to a simplified kinematic bicycle model with 99.5% success, whereas reverse transfer drops to 47.3%. Under wet‑road friction, friction‑aware policies reduce mean peak DRAC from 58.7 to 27.8,m/s^2 without sacrificing goal reach. SceneFactory shows that scalable autonomous‑driving training need not discard articulated rigid‑body dynamics or physically grounded road‑condition variation.

PaperID: 3226, https://arxiv.org/pdf/2605.08518.pdf

Authors: Dhaval Patel, Chathurangi Shyalika, Suryanarayana Reddy Yarrabothula, Ling Yue, Shuxin Lin, Nianjun Zhou, James Rayfield

Title: Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge

Abstract:
Competition retrospectives are useful when they explain what a leaderboard measured, how hidden evaluation changed conclusions, and which design patterns were rewarded. We revisit the CODS 2025 \assetopslive challenge, a privacy‑aware Codabench competition on industrial multi‑agent orchestration built on \assetops. We combine final rank sheets, a 300‑submission server log, 149‑team registrations, best‑submission exports, the organizer winners report, the companion \assetopslive system paper, and verified planning‑track source trees. Five results stand out. First, the public planning leaderboard saturates at 72.73%, and richer prompts do not improve that peak. Second, hidden evaluation changes the story: public and private scores correlate moderately in planning (r=0.69) but negatively in execution (r=‑0.13), with several 45.45% public execution systems reaching 63.64% on the hidden set. Third, the \tmatch term is numerically almost inert in the official composite ‑‑ combined on a 0‑‑1 scale with 0‑‑100 percentage scores, it contributes at most 0.05 points per track, and rescaling would swap the top two teams. Fourth, the competition is operationally account‑based but substantively team‑based: 149 registered teams reduce to 24 with non‑zero public scores and 11 fully ranked, while 52.3% of deduplicated registrations list multiple usernames. Fifth, successful execution methods mostly improve guardrails ‑‑ response selection, contamination cleanup, fallback, and context control ‑‑ rather than novel agent architectures. These findings identify which behaviors the evaluation rewarded, and motivate scale‑aware composites, skill‑level diagnostics, and versioned artifact release.

PaperID: 3227, https://arxiv.org/pdf/2605.08463.pdf

Authors: Sarah Wilson, Diem Linh Dang, Usman Ali Moazzam, Shan Ye, Gail Kaiser

Title: Behavioral Determinants of Deployed AI Agents in Social Networks: A Multi-Factor Study of Personality, Model, and Guardrail Specification

Abstract:
Autonomous AI agents are increasingly deployed in open social environments, yet the relationship between their configuration specifications and their emergent social behavior remains poorly understood. We present a controlled, multi‑factor empirical study in which thirteen OpenClaw agents are deployed on Moltbook ‑‑ a Reddit‑like social network built for AI agents ‑‑ across three systematically varied independent variables: (1) personality specification, (2) underlying LLM model backbone, and (3) operational rules and memory configuration. A default control agent provides a behavioral baseline. Over a one‑week observation window spanning approximately 400 autonomous sessions per agent, we collect behavioral, linguistic, and social metrics to assess how configuration layers predict emergent social behavior. We find that personality specification is the dominant behavioral lever, producing a massive spread in response length across agents, while model backbone and operational rules drive more moderate but still meaningful effects on rhetorical style and topic engagement breadth. Our findings contribute empirical evidence to the emerging literature on deployed multi‑agent social systems and offer practical guidance for designing agents intended for collaborative or monitoring tasks in real social environments.

PaperID: 3228, https://arxiv.org/pdf/2605.08460.pdf

Authors: Ziwen Cai, Yihe Zhang, Xiali Hei

Title: When Child Inherits: Modeling and Exploiting Subagent Spawn in Multi-Agent Networks

Abstract:
Since the official release of ChatGPT in 2022, large language models (LLMs) have rapidly evolved from chatbot‑style interfaces into agentic systems that can delegate work through tools and newly spawned subagents. While these capabilities improve automation and scalability, they also pose new security risks in multi‑agent networks. Existing research has studied how individual LLM‑based agents can be compromised through prompt injection, jailbreaking, poisoned retrieval data, or malicious extensions. Less is known about what happens after one agent is compromised inside a multi‑agent network. In particular, inherited memory from parent agents can carry malicious instructions, outdated states, or unintended behavioral rules into newly created subagents, allowing a local compromise to spread across agent boundaries. In this paper, we model contemporary multi‑agent networks through the lens of subagent inheritance. Our analysis shows that current frameworks can violate trust boundaries through insecure memory inheritance, weak resource control, stale post‑spawn state, and improper termination authority. We demonstrate these risks in real agent frameworks and propose defenses based on explicit security invariants. Our findings show that inheritance is not merely an implementation detail, but a central component influencing the security of multi‑agent systems.

PaperID: 3229, https://arxiv.org/pdf/2605.08442.pdf

Authors: Jun Wen Leong

Title: Defense effectiveness across architectural layers: a mechanistic evaluation of persistent memory attacks on stateful LLM agents

Abstract:
Persistent memory attacks against LLM agents achieve high attack success rates against open‑source models. In these attacks, malicious instructions injected via RAG‑retrieved documents are stored in persistent memory and executed in later sessions. However, no systematic evaluation of defense effectiveness against this attack class exists. We evaluate six defenses across four architectural layers against delayed‑trigger attacks on nine open‑source models (5,040 runs, N=40 per condition). Four defenses fail at approximately baseline attack success rate: input‑level filtering (Minimizer, Sanitizer) and retrieval‑level filtering (RAG Sanitizer, RAG LLM Judge) achieve 88‑89% ASR, statistically indistinguishable from the undefended baseline of 88.6%. Prompt Hardening partially fails at 77.8% ASR, with the reduction driven by two models at 0%: one genuine defense effect and one model‑level refusal independent of the defense. The architectural explanation holds: input‑level defenses cannot observe RAG‑injected content, and retrieval‑level classifiers are defeated by compliance‑framed semantic masking. One defense, tool‑gating at the memory layer (Memory Sandbox), reduces ASR to 0% for eight of nine models by removing the recall capability the attack requires. The exception inverts the defense entirely: a reasoning model that achieves 0% ASR under no defense via execution refusal inverts to 100% ASR under Memory Sandbox, because removing explicit recall forces the model onto the RAG pathway where its refusal mechanism does not activate. Memory Sandbox imposes zero utility cost in the absence of attack (BTCR = 100% across all conditions). These results provide the first systematic characterization of why each defense class fails against persistent memory attacks, enabling informed defense investment decisions.

PaperID: 3230, https://arxiv.org/pdf/2605.08426.pdf

Authors: Xuanqiang Angelo Huang, Charlie Tharas, Samuele Marro, Van Q. Truong, Bernhard Schölkopf, Emanuele La Malfa, Zhijing Jin

Title: Mechanism Design Is Not Enough: Prosocial Agents for Cooperative AI

Abstract:
Ensuring that AI agents behave safely and beneficially when interacting with other parties has emerged as one of the central challenges of modern AI safety. While mechanism design, as the theory of designing rules to align individual and collective objectives, can incentivize cooperative behavior, it is still an open question whether it alone is sufficient to maximize LLM agents' social welfare. This work proves that the answer is negative: drawing from incomplete contract theory, we formally show that when contracts cannot distinguish all relevant future contingencies, there is a strictly positive welfare loss that no realistic mechanism can eliminate. We show that prosocial agents, who weigh others' welfare alongside their own, can close this gap and achieve outcomes that are socially superior and individually beneficial. Experimentally, we show that in multi‑agent resource‑allocation environments and canonical social dilemmas where agents are powered by large language models, prosociality is beneficial. The implication for AI safety is clear: to enable cooperative interactions at scale, designing adequate mechanisms is not sufficient; agents must be built to be intrinsically prosocial.

PaperID: 3231, https://arxiv.org/pdf/2605.08401.pdf

Authors: Junnan Liu, Linhao Luo, Thuy-Trang Vu, Gholamreza Haffari

Title: AIPO: Learning to Reason from Active Interaction

Abstract:
Recent advances in large language models (LLMs) have demonstrated remarkable reasoning capabilities, largely stimulated by Reinforcement Learning with Verifiable Rewards (RLVR). However, existing RL algorithms face a fundamental limitation: their exploration remains largely constrained by the inherent capability boundary of the policy model. Although recent methods introduce external expert demonstrations to extend this boundary, they typically rely on complete trajectory‑level guidance, which is sample‑inefficient, information‑sparse, and may confine exploration to a static guidance space. Inspired by the potential of multi‑agent systems, we propose AIPO, an enhanced reinforcement learning framework that improves LLM reasoning through active multi‑agent interaction during exploration. Specifically, AIPO enables the policy model to proactively consult three functional collaborative agents, Verify Agent, Knowledge Agent, and Reasoning Agent, when encountering reasoning bottlenecks, thereby receiving fine‑grained and targeted guidance to actively expand its capability boundary during training. We further introduce a tailored importance sampling coefficient together with a clipping strategy to mitigate the off‑policy bias and gradient vanishing issues that arise when learning from agent‑provided feedback. After training, the policy model performs reasoning independently without relying on collaborative agents. Extensive experiments on diverse reasoning benchmarks, including AIME, MATH500, GPQA‑Diamond, and LiveCodeBench, show that AIPO consistently improves reasoning performance, generalizes robustly across different policy models and RLVR algorithms, and effectively expands the reasoning capability boundary of the policy model.

PaperID: 3232, https://arxiv.org/pdf/2605.08391.pdf

Authors: Nikunj Gupta, James Zachary Hare, Jesse Milzman, Rajgopal Kannan, Viktor Prasanna

Title: SACHI: Structured Agent Coordination via Holistic Information Integration in Multi-Agent Reinforcement Learning

Abstract:
Cooperative multi‑agent reinforcement learning agents that act on partial local observations face a fundamental information bottleneck: the knowledge needed to select jointly optimal actions is scattered across the team, yet each agent must commit to a decision without access to its teammates' observations, intentions, or chosen actions. Existing methods either ignore this bottleneck, compress it into a scalar mixing signal, or route around it with learned communication channels. Framing action coordination as a problem of structured information integration among agents, we propose structured agent coordination via holistic information integration, or SACHI, in which graph transformer convolutions over an inter‑agent coordination graph enrich each agent's representation with receiver‑sensitive, content‑dependent signals from teammates prior to action selection. We evaluate SACHI across five cooperative tasks spanning spatial, communicative, and adversarial coordination challenges against twelve baselines. SACHI consistently matches or outperforms the best baseline on every task, and rigorous aggregate statistical analyses, including normalized metrics with bootstrap confidence intervals, Friedman ranking, and performance profiling, confirm that this advantage is statistically significant, robust across environments, and not attributable to increased model capacity. Parameter‑matched ablations further trace the source of the gains to a single architectural property: the degree of content‑dependence in the message‑passing operator.

PaperID: 3233, https://arxiv.org/pdf/2605.08386.pdf

Authors: Yongliang Miao, Ziyang Yu, Liang Zhao, Bowen Zhu, Hasibul Haque

Title: SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents

Abstract:
Skill libraries have become a practical way for LLM agents to reuse procedural experience across tasks. However, existing systems typically treat skills as flat, single‑resolution prompt blocks. This creates a tension between relevance and cost: injecting coarse skills can introduce irrelevant or misleading context, while rewriting entire skills is expensive and often unnecessary. We propose SkillLens, a hierarchical skill‑evolution framework that organizes skills into a four‑layer graph of policies, strategies, procedures, and primitives, and retrieves them at mixed granularity. Given a task, SkillLens first retrieves semantically relevant skill seeds, expands them through degree‑corrected random walk over the skill graph, and then uses a verifier to decide whether each visited unit should be accepted, decomposed, rewritten, or skipped. This enables the agent to reuse compatible subskills directly while adapting only locally mismatched components. To improve the system over time, SkillLens further refines multi‑granularity skills and verifier in order to improve its routing decisions. We provide theoretical analysis showing that mixed‑granularity adaptation incurs sublinear cost under sparse mismatch assumptions and that the evolutionary update rule monotonically improves the validation objective until a local optimum. Across MuLocbench and ALFWorld, SkillLens consistently improves over strong skill‑based baselines, achieving up to a 6.31 percentage‑point Acc@1 gain for bug localization and raising agent success rate from 45.00% to 51.31%.

PaperID: 3234, https://arxiv.org/pdf/2605.08385.pdf

Authors: ElMouatez Billah Karbab

Title: Quantifiable Uncertainty: A Stochastic Consensus Multi-Agent RAG Framework for Robust Malware Detection

Abstract:
While contemporary deep learning malware detectors define a dominant defense paradigm, their sophistication also exposes them to novel structural evasion attacks, a limitation we attribute to their inherent inability to express epistemic uncertainty. To address this challenge, we present MAGMA, a Retrieval‑Augmented Generation (RAG) framework that decouples malware analysis into semantic code retrieval and probabilistic verification. In contrast to monolithic classifiers, MAGMA employs a dual‑stream embedding scheme over assembly and pseudo‑code representations to isolate Decision‑Critical Functions (DCFs) from the noise of dead code. We further introduce a Stochastic Consistency Ensemble, in which multiple instances of the same reasoning agent independently evaluate the retrieval set under non‑deterministic sampling. From this ensemble, we derive two complementary metrics: Function Evidence Strength (FES), a weighted aggregation of retrieval confidence, and the Evidence Conflict Score (ECS), defined as the Shannon entropy of the ensemble's predictive distribution. We show that elevated ECS values serve as an effective proxy for structural ambiguity, enabling the system to implement a principled ``reject‑option'' policy. Extensive evaluation demonstrates that MAGMA achieves a 98.4% detection rate, substantially exceeding existing solutions.

PaperID: 3235, https://arxiv.org/pdf/2605.08321.pdf

Authors: Lennart Wachowiak, Scott D. Blain, David Williams-King, Samuele Marro

Title: LLM Wardens: Mitigating Adversarial Persuasion with Third-Party Conversational Oversight

Abstract:
LLMs are increasingly capable of persuasion, which raises the question of how to protect users against manipulation. In a preregistered user study (N=120) across four decision‑making scenarios, we find that an adversarial LLM with a hidden goal succeeds in steering users' decisions 65.4% of the time. We then introduce a "warden" model: a secondary LLM that monitors the human‑AI interaction trace in real time and issues non‑binding, private advisories to the user when it detects manipulation. Adding a warden more than halves the adversary's success rate to 30.4%, with a much smaller (8.6 percentage points) reduction for genuine interactions. To probe the mechanism behind these results, we release COAX‑Bench, a simulation benchmark spanning 14 decision‑making scenarios, including hiring, voting, and file access. Across 16,212 simulated multi‑agent interactions, capable adversarial LLMs achieve their hidden goals in 34.7% of cases, which warden models reduce to 12.3%. Notably, even warden models substantially weaker than the adversary they oversee provide meaningful protection, suggesting a path for scalable oversight of more capable models.

PaperID: 3236, https://arxiv.org/pdf/2605.08268.pdf

Authors: Xiaolin Sun, Zixuan Liu, Yibin Hu, Zizhan Zheng

Title: Insider Attacks in Multi-Agent LLM Consensus Systems

Abstract:
Large language models (LLMs) are increasingly deployed in multi‑agent systems where agents communicate in natural language to solve tasks jointly. A key capability in such systems is consensus formation, where agents iteratively exchange messages and update decisions to reach a shared outcome. However, most existing multi‑agent LLM frameworks assume that all participating agents are aligned with the system objective. In practice, a malicious insider may participate as a legitimate member of the group while pursuing a hidden adversarial goal. In this work, we study insider manipulation in multi‑agent LLM consensus systems. We formalize the problem as a sequential decision‑making task in which a malicious agent seeks to delay or prevent agreement among benign agents. To make attack optimization tractable, we propose a world‑model‑based framework that learns surrogate dynamics over the latent behavioral states of benign agents and then trains an attacker using reinforcement learning based on this learned model. Preliminary results show that the trained attacker reduces the benign consensus rate and prolongs disagreement more effectively than the direct malicious‑prompt baseline. These results suggest that combining latent world models with reinforcement learning is a promising direction for adaptive insider attacks in language‑based multi‑agent systems.

PaperID: 3237, https://arxiv.org/pdf/2605.08258.pdf

Authors: John deVadoss

Title: Designing Intelligent Enterprise Agents: A Capability-Aligned Multi-Agent Architecture

Abstract:
Enterprise interest in multi‑agent systems has shifted from generic software agents to large‑language‑model (LLM) based intelligent agents that plan, use tools, maintain contextual memory, inspect intermediate results, collaborate with other agents, and sometimes act in systems of record. This paper revises the enterprise architecture thesis around a design‑first claim: governance is necessary, but it cannot be the primary organizing abstraction. The primary abstraction must be agent design ‑ capability boundaries, autonomy allocation, interaction protocols, tool and data authority, state and memory design, verification design, and human interaction design. We propose CEAD (Capability‑Aligned Enterprise Agent Design), a reference architecture for intelligent agents that uses service‑oriented architecture (SOA) as an exemplar for contracts, registries, loose coupling, and policy‑aware integration, while explicitly rejecting the idea that services are agents. It treats microservices as a cautionary precedent: decomposition without design discipline produces distributed complexity, cost, operational fragility, and agent proliferation. We evaluate CEAD over 10,000 enterprise tasks, comparing five architectures: a prompt‑first mono‑agent, a role‑based micro‑agent swarm, SOA‑brokered agents, a governance‑first but design‑poor agent grid, and the proposed CEAD architecture. CEAD achieves 70.6% safe success, versus 45.2% for the mono‑agent baseline, 23.1% for the ungoverned micro‑agent swarm, 58.8% for SOA‑brokered agents, and 50.8% for the control‑heavy, design‑poor grid. The results support the conclusion that design quality is the first‑order enterprise concern; governance, security, policy, audit, and assurance should support and enforce good design rather than substitute for it.

PaperID: 3238, https://arxiv.org/pdf/2605.08257.pdf

Authors: Saisai Hu

Title: Research on Security Enhancement Methods for Adversarial Robust Large Language Model Intelligent Agents for Medical Decision-Making Tasks

Abstract:
Motivated by the challenge to improve the adversarial robustness, security, and trust of medical decision making intelligent agents, this study develops a full‑link security enhancement framework, which describes "input risk perception ‑ medical evidence constraint ‑ knowledge consistency verification ‑ decision confidence reweighting ‑ security output control ‑ adversarial feedback update." We propose ARSM‑Agent and define a weighted joint objective consisting of decision accuracy loss, adversarial robustness loss, safety refusal loss, and knowledge consistency loss, with weights of 0.3, 0.3, 0.2, and 0.2, respectively. The whole medical decision formulation is implemented by multi‑module collaborative linkage. We verify that the algorithm is more efficient than four baselines, including LLM‑Agent, Retrieval‑Agent, Filter‑Agent, and Adv‑Train‑Agent. Under semantic perturbation, prompt injection, drug‑name confusion, and false‑evidence attacks, ARSM‑Agent reduces the overall attack success rate to 8.7% and achieves a knowledge consistency score of 0.91. Ablation experiments quantify each module's contribution: removing risk perception, evidence retrieval, consistency verification, and confidence reweighting reduces accuracy by 6.7%, 9.1%, 7.6%, and 4.4%, respectively, and increases attack success rate by 13.8%, 11.1%, 8.6%, and 6.9%. The proposed approach addresses key security issues of medical decision making intelligent agents, obtains secure decision making in challenging scenarios, and provides reliable intelligent support for medical decision‑making intelligent agents.

PaperID: 3239, https://arxiv.org/pdf/2605.08124.pdf

Authors: Bowei He

Title: Scaling Mobile Agent Systems: From Capability Density to Collective Intelligence

Abstract:
Mobile agent systems are emerging as a key paradigm for enabling intelligent applications on edge devices and in AIoT ecosystems. However, their scalability is fundamentally constrained by limited on‑device computation and fragmented intelligence across devices. In this work, we propose a unified research agenda for scaling mobile agent systems along two complementary dimensions: (1) improving capability density of individual agents through compact foundation model design and compression, and (2) enabling collective intelligence via communication‑rich multi‑agent collaboration. Building on recent model and infrastructure advances, this vision aims to transform isolated mobile agents into a distributed intelligent system that is efficient and scalable.

PaperID: 3240, https://arxiv.org/pdf/2605.08060.pdf

Authors: Jiayuan Liu, Tianqin Li, Shiyi Du, Xin Luo, Haoxuan Zeng, Emanuel Tewolde, Tai Sing Lee, Tonghan Wang, Carl Kingsford, Vincent Conitzer

Title: The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

Abstract:
Context window expansion is often treated as a straightforward capability upgrade for LLMs, but we find it systematically fails in multi‑agent social dilemmas. Across 7 LLMs and 4 games over 500 rounds, expanding accessible history degrades cooperation in 18 of 28 model‑‑game settings, a pattern we term the memory curse. We isolate the underlying mechanism through three analyses. First, lexical analysis of 378,000 reasoning traces associates this breakdown with eroding forward‑looking intent rather than rising paranoia. We validate this using targeted fine‑tuning as a cognitive probe: a LoRA adapter trained exclusively on forward‑looking traces mitigates the decay and transfers zero‑shot to distinct games. Second, memory sanitization holds prompt length fixed while replacing visible history with synthetic cooperative records, which restores cooperation substantially, proving the trigger is memory content, not length alone. Finally, ablating explicit Chain‑of‑Thought reasoning often reduces the collapse, showing that deliberation paradoxically amplifies the memory curse. Together, these results recast memory as an active determinant of multi‑agent behavior: longer recall can either destabilize or support cooperation depending on the reasoning patterns it elicits.

PaperID: 3241, https://arxiv.org/pdf/2605.07961.pdf

Authors: Hanlin Cai, Kai Li, Houtianfu Wang, Haofan Dong, Yichen Li, Falko Dressler, Ozgur B. Akan

Title: Graph Representation Learning Augmented Model Manipulation on Federated Fine-Tuning of LLMs

Abstract:
Federated fine‑tuning (FFT) has emerged as a privacy‑preserving paradigm for collaboratively adapting large language models (LLMs). Built upon federated learning, FFT enables distributed agents to jointly refine a shared pretrained LLM by aggregating local LLM updates without sharing local raw data. However, FFT‑based LLMs remain vulnerable to model manipulation threats, in which adversarial participants upload manipulated LLM updates that corrupt the aggregation process and degrade the performance of the global LLM. In this paper, we propose an Augmented Model maniPulation (AugMP) strategy against FFT‑based LLMs. Specifically, we design a novel graph representation learning framework that captures feature correlations among benign LLM updates to guide the generation of malicious updates. To enhance manipulation effectiveness and stealthiness, we develop an iterative manipulation algorithm based on an augmented Lagrangian dual formulation. Through this formulation, malicious updates are optimized to embed adversarial objectives while preserving benign‑like parameter characteristics. Experimental results across multiple LLM backbones demonstrate that the AugMP strategy achieves the strongest manipulation performance among all competing baselines, reducing the global LLM accuracy by up to 26% and degrading the average accuracy of local LLM agents by up to 22%. Meanwhile, AugMP maintains high statistical and geometric consistency with benign updates, enabling it to evade conventional distance‑ and similarity‑based defense methods.

PaperID: 3242, https://arxiv.org/pdf/2605.07960.pdf

Authors: Patrícia Alves, Joana Neto, Ana Barreiro, Jorge Lima, Fausto Alves, Henish Balu, Luís Conceição, Goreti Marreiros

Title: Exploring a Virtual Pet to Provide Context Notifications in a Tourism Recommender System: a Pilot Study

Abstract:
While context‑aware personalization has been widely explored in modern tourism Recommender Systems (RS), the delivery of real‑time notifications remains a significant design challenge due to issues of intrusiveness and user fatigue. This paper presents a proof‑of‑concept for a tourism recommendation framework that utilizes a virtual pet as a social mediator for delivering context‑aware alerts. The system integrates real‑time environmental data ‑ including air quality, noise levels, and weather forecasts ‑ and proximity‑based notifications with a Multi‑Agent Microservice that generates personalized recommendations based on the user's personality traits and preferences. A within‑subjects pilot study (n=11) was conducted to evaluate the feasibility and user acceptance of this pet‑mediated approach. Participants interacted with two versions of the system ‑ a baseline without contextual alerts and a version featuring pet‑mediated notifications ‑ over a four‑week period (two weeks per version) in real‑world scenarios. Quantitative and qualitative data were collected to assess engagement, perceived naturalness, notification utility, and acceptance. Preliminary results suggest that the virtual pet effectively can "soften" the perceived intrusiveness of system alerts, making safety‑critical information feel more welcome and natural. Furthermore, the character‑mediated justifications significantly improved the clarity of the notifications, effectively supporting users in their real‑time travel decisions. These findings provide a foundation for using virtual pet companions to enhance the transparency and acceptance of context‑aware communication in tourism RS.

PaperID: 3243, https://arxiv.org/pdf/2605.07935.pdf

Authors: Shuren Xia, Qiwei Li, Taqiya Ehsan, Jorge Ortiz

Title: TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples

Abstract:
We present TraceFix, a verification‑first pipeline for Large Language Model (LLM) multi‑agent coordination. An agent synthesizes a protocol topology as a structured intermediate representation (IR) from a task description, generates PlusCal coordination logic, and iteratively repairs the protocol using counterexamples from the TLA+ model checker (TLC) until verification succeeds. Verified process bodies are compiled into per‑agent system prompts and executed under a runtime monitor that rejects out‑of‑topology coordination operations. On 48 tasks spanning 16 scenario families, all tasks reach full TLC verification; 62.5% pass on the first attempt and none requires more than four repair iterations. State spaces span six orders of magnitude yet verification completes in under 60 s for every task. A 3,456‑run runtime comparison shows that topology‑monitored execution achieves the highest task completion (89.4% average, 81.5% full) and that runtimes using the verified protocol degrade at roughly half the rate of prompt‑only and chat‑only baselines when model capability is reduced. A paired ablation under a fixed runtime shows that TLC‑verified protocols cut deadlock/livelock (DL/LL) from 31.1% to 14.1%, with the largest separation under fault injection.

PaperID: 3244, https://arxiv.org/pdf/2605.07926.pdf

Authors: Zhengkang Guo, Yiyang Li, Lin Qiu, Xiaohua Wang, Jingwen Xv, Dongyu Ru, Xiaoyu Li, Xiaoqing Zheng, Xuezhi Cao, Xunliang Cai

Title: AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

Abstract:
As LLM‑based agents increasingly rely on external tools, it is important to evaluate their ability to sustain tool‑grounded reasoning beyond familiar workflows and short‑range interactions. We introduce AgentEscapeBench, an escape‑room‑style benchmark that tests whether agents can infer, execute, and revise novel tool‑use procedures under explicit long‑range dependency constraints. Each task defines a directed acyclic dependency graph over tools and items, requiring agents to invoke real external functions, track hidden state revealed incrementally, propagate intermediate results, and submit a deterministically verifiable final answer. AgentEscapeBench includes 270 instances across five difficulty tiers and supports fully automated evaluation. Experiments with sixteen LLM agents and human participants show that performance drops sharply as dependency depth increases: humans decline from 98.3% success at difficulty‑5 to 80.0% at difficulty‑25, while the best model drops from 90.0% to 60.0%. Trajectory analysis attributes model failures mainly to breakdowns in long‑range state tracking, clue adherence, and intermediate‑result propagation. These findings suggest that current agents can often handle local tool use but still struggle with deep contextual dependencies. We hope AgentEscapeBench can serve as a diagnostic testbed for measuring current agent capabilities and informing future training efforts toward more robust general‑purpose reasoning, action, and adaptation.

PaperID: 3245, https://arxiv.org/pdf/2605.07840.pdf

Authors: Xingyue Huang, Louis Tichelman, Jinwoo Kim, Krzysztof Olejniczak, İsmail İlkan Ceylan

Title: RelAgent: LLM Agents as Data Scientists for Relational Learning

Abstract:
Relational learning is a challenging problem that has motivated a wide range of approaches, including graph‑based models (e.g., graph neural networks, graph transformers), tabular methods (e.g., tabular foundation models), and sequence‑based approaches (e.g., large language models), each with its own advantages and limitations. We propose RelAgent, an LLM‑based autonomous data scientist for relational learning, which operates in two phases. In the search phase, an LLM agent uses database, validation, and evaluation workspace tools to construct SQL feature programs and select a predictive model. In the inference phase, the resulting program is executed without further LLM calls. The final predictor consists of SQL queries and a classical model, enabling fast, deterministic, and intrinsically interpretable predictions: features are human‑readable queries, and predictions depend only on the resulting query‑defined feature map, enabling scalable deployment using standard database systems.

PaperID: 3246, https://arxiv.org/pdf/2605.07836.pdf

Authors: Xinyi Hou, Yanjie Zhao, Haoyu Wang

Title: Unsafe by Flow: Uncovering Bidirectional Data-Flow Risks in MCP Ecosystem

Abstract:
Model Context Protocol (MCP) have quickly become the interface layer between LLM agents and external tools, yet they also introduce unsafe data flows that existing analyzers handle poorly. Vulnerabilities manifest in two directions: requester‑controlled arguments may propagate to sensitive operations, while untrusted external or sensitive internal data may surface through MCP‑visible outputs and subsequently influence host or model behavior. Accurate detection is complicated by the heterogeneous registration and dispatch patterns MCP servers employ, the need for MCP‑specific taint semantics, and the fact that bugs often only materialize along complete tool‑scoped execution paths. We present MCP‑BiFlow, a bidirectional static analysis framework built around MCP‑aware entrypoint recovery, protocol‑specific taint modeling, and interprocedural propagation analysis. Against a benchmark of 32 confirmed MCP vulnerability cases, MCP‑BiFlow identifies 30 (93.8% recall), substantially outperforming CodeQL, Semgrep, Snyk Code, and MCPScan. Across 15,452 real‑world MCP server repositories, MCP‑BiFlow surfaces 549 overlap‑compressed candidate clusters; manual review confirms 118 vulnerability paths in 87 servers, establishing unsafe propagation as a recurring failure mode that resists detection without protocol‑aware recovery of both request‑side and return‑side flows.

PaperID: 3247, https://arxiv.org/pdf/2605.07835.pdf

Authors: Ethan Schneider, Jingkai Chen, Tianyi Gu, Kunlei Lian, Seth Hutchinson, Sonia Chernova

Title: Many-to-Many Multi-Agent Pickup and Delivery

Abstract:
Multi‑robot systems in automated warehouses must manage continuous streams of pickup‑and‑delivery tasks while ensuring efficiency and safety. Prior work on Multi‑Agent Pickup‑and‑Delivery (MAPD) has largely focused on the one‑to‑one variant, where each task has a fixed pickup and delivery location. In contrast, real warehouses often present many‑to‑many MAPD scenarios, where items, tracked by stock keeping unit (SKU) identifiers, can be retrieved from or stored at multiple locations, resulting in an NP‑hard four‑dimensional assignment problem. To solve the many‑to‑many MAPD problem, we contribute our algorithm: Many‑to‑Many Multi‑Agent Pickup and Delivery (M2M). We experiment with two variants of our algorithm: one that minimizes estimated task durations (M2M), and one which incorporates SKU distribution into the objective function (M2M‑wSKU). Simulation results over 8‑hour warehouse operations show that our method consistently matches or outperforms prior state of the art, with M2M completing up to 22,000 more tasks on average across different environments and warehouse inventory densities.

PaperID: 3248, https://arxiv.org/pdf/2605.07773.pdf

Authors: Xin Xu, Jiu Zhang, Xiao-Ling Lei, Xiong-Fei Jiang, Long Xiong

Title: Is a team only as strong as its weakest link? Quantifying the short-board effect with AI Agents

Abstract:
The short‑board effect, analogous to Liebig's Law of the Minimum, postulates that the collective performance of a team is constrained by its weakest component. This principle has profound implications for the optimization of collaboration in a variety of contexts, including management, education, and organizational structures. Despite its theoretical significance, empirical validation remains elusive due to challenges of assessing individual capabilities, controlling real‑world variables, and data biases towards successful outcomes, as well as high employee turnover.To address this absence of knowledge, we employ multi‑agents driven by large language models to simulate a teamwork with standard operating procedure, revealing the relationship between individual capability and collective team performance.In homogeneous team configurations, three capability regimes are observed, particularly the Sisyphus predicament state at the critical capability threshold characterized by extensive ineffective efforts and pseudo‑high efficiency. Furthermore, with a single weak link quantifying the short‑board effect, we highlight different impacts across core and non‑core members on the team performance.More importantly, when the team exhibits multiple weak links, a cumulative product effect emerges, demonstrating that team performance is shaped by the aggregated impact of all weaknesses rather than the weakest link solely.This suggests that mitigation strategies should extend beyond the remediation of individual weak links.These findings rigorously elaborate the short‑board theory and provide actionable insights to optimize team management, organizational operations, and supply chain resilience.

PaperID: 3249, https://arxiv.org/pdf/2605.07744.pdf

Authors: Yu Kumagai, Keisuke Okumura

Title: Alternating Target-Path Planning for Scalable Multi-Agent Coordination

Abstract:
The concurrent target assignment and pathfinding (TAPF) problem extends multi‑agent pathfinding (MAPF) by asking planners to allocate distinct targets and collision‑free paths to agents. Prior work on TAPF has relied exclusively on Conflict‑Based Search (CBS), which tightly couples target assignment and pathfinding, resulting in compute‑intensive, non‑scalable solutions. In contrast, we propose an iterative refinement framework that decouples target assignment from pathfinding. Our framework builds on modern, fast, suboptimal MAPF solvers, such as LaCAM. Specifically, within a given time budget, it repeatedly solves MAPF for the current target assignment, identifies bottleneck agents via MAPF feedback, and refines the assignment. Empirical results show that feedback‑driven reassignment loop is effective, enabling our framework to scale well beyond the reach of the state‑of‑the‑art CBS‑based solver while maintaining decent solution quality. This represents a solid step toward practical, large scale TAPF suitable for real‑world setups.

PaperID: 3250, https://arxiv.org/pdf/2605.07737.pdf

Authors: Bowei Ning, Xuejun Zong, Lian Lian, Kan He, Yifei Sun, Yuxiang Lei, Plamen Vasilev

Title: Securing the Dark Matter: A Semantic-Enhanced Neuro-Symbolic Framework for Supply Chain Analysis of Opaque Industrial Software

Abstract:
Automated vulnerability detection in critical‑infrastructure software confronts a fundamental barrier: industrial software is routinely deployed as stripped, symbol‑free binaries that deprive conventional Software Composition Analysis of the source‑level transparency it requires. Existing binary analysis techniques close this Semantic Gap only partially ‑‑ graph‑based detectors preserve structural syntax but discard behavioral semantics, while large language models supply rich semantic cues at the cost of unstable, hallucination‑prone inference. To address this gap, we present a semantic‑enhanced neuro‑symbolic framework that reconstructs behavioral semantics directly from opaque binaries and performs tractable global risk reasoning. Three tightly coupled mechanisms drive this capability: (1) abstract interpretation combined with a reflexive prompting pipeline that structurally constrains a local LLM agent, effectively suppressing hallucinations; (2) a surjective transformation that compresses raw Code Property Graphs into typed Software Supply Chain Knowledge Graphs amenable to scalable reasoning; and (3) a domain‑adapted Graphormer that captures long‑range vulnerability propagation, augmented by embedding‑space subgraph matching to uncover zero‑day and APT‑style attack patterns. Evaluated across three benchmarks of increasing domain specificity, the framework consistently outperforms all baselines on detection accuracy, semantic lifting fidelity, and APT fingerprint matching. Deployment on a hybrid virtual‑physical testbed incorporating production‑grade hardware from five ICS vendors further confirms strong detection coverage of high‑impact CVEs while substantially reducing false‑positive rates relative to leading commercial tools.

PaperID: 3251, https://arxiv.org/pdf/2605.07728.pdf

Authors: Gaston Besanson

Title: SARC: A Governance-by-Architecture Framework for Agentic AI Systems

Abstract:
Agentic AI systems increasingly act through tools, sub‑agents, and external services, but governance controls are still commonly attached to prompts, dashboards, or post‑hoc documentation. This creates a structural mismatch in regulated settings: obligations that must constrain execution are often evaluated only after execution has occurred. We introduce SARC, a runtime governance architecture for tool‑using agents that treats constraints as first‑class specification objects alongside state, action space, and reward. A SARC specification declares each constraint's source, class, predicate, verification point, response protocol, and operating point, and compiles these into four enforcement sites in the agent loop: a Pre‑Action Gate, an Action‑Time Monitor, a Post‑Action Auditor, and an Escalation Router. We formalize the minimal invariants required for specification‑trace correspondence, show why finite reward penalties do not generally substitute for hard runtime constraints, and extend the architecture to multi‑agent workflows through constraint propagation, authority intersection, and attribution‑preserving trace trees. We implement a prototype audit checker and report a reproducible synthetic evaluation over 50 seeds comparing SARC against post‑hoc audit, output filtering, workflow rules, and policy‑as‑code‑only baselines on a procurement task. SARC executes zero hard‑constraint violations under exact predicates; its declared PAA throttling response reduces soft‑window overages by 89.5% relative to policy‑as‑code‑only. Predicate‑noise and enforcement‑failure sweeps are consistent with the claim that residual hard violations under SARC scale with enforcement‑stack error rather than environmental violation opportunity. SARC provides the architectural substrate through which obligations can be made executable, inspectable, and auditable at runtime.

PaperID: 3252, https://arxiv.org/pdf/2605.07671.pdf

Authors: Lauri Lovén, Sasu Tarkoma

Title: The Endogeneity of Miscalibration: Impossibility and Escape in Scored Reporting

Abstract:
Eliciting truthful reports from autonomous agents is a core problem in scalable AI oversight: a principal scores the agent's report using a strictly proper scoring rule, but the agent also benefits from the report through a non‑accuracy channel (approval for autonomous action, allocation share, downstream control). The same structure appears in classical mechanism‑design settings such as marketplace operation. Our main result is an endogeneity: the principal's optimal oversight necessarily uses a non‑affine approval function to screen types, yet any non‑affine approval makes truthful reporting suboptimal under the combined objective whenever deviation is undetectable. The principal cannot avoid the perturbation that undermines calibration. This impossibility holds for all strictly proper scoring rules, with a closed‑form perturbation formula. A constructive escape exists: a step‑function approval threshold achieves first‑best screening for every strictly proper scoring rule, because the agent's binary inflate‑or‑not choice creates a type‑space threshold regardless of the generator's curvature. Under the Brier score specifically, the type‑independent inflation cost yields a welfare equivalence between second‑best and first‑best; we prove this equivalence is unique to Brier (the welfare gap under smooth C^1 oversight is bounded below by Ω(\textVar(1/G'') (γ/β)^2) for every non‑Brier rule). Two instances develop the framework: AI agent oversight (the lead motivating setting) and marketplace operation (a parallel mechanism‑design domain). The message for AI alignment is direct: smooth scoring‑based oversight cannot elicit truthful reports from a strategic agent; sharp thresholds are the calibration‑preserving design.

PaperID: 3253, https://arxiv.org/pdf/2605.07649.pdf

Authors: Berkehan Ünal, Hauke Dierend, Dren Fazlija, Christopher Plachetka

Title: Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models

Abstract:
Over the last few years, research on autonomous systems has matured to such a degree that the field is increasingly well‑positioned to translate research into practical, stakeholder‑driven use cases across well‑defined domains. However, for a wide‑scale practical adoption of autonomous systems, adherence to safety regulations is crucial. Many regulations are influenced by the Operational Design Domain (ODD), which defines the specific conditions in which an autonomous agent can function. This is especially relevant for Automated Driving Systems (ADS), as a dependable perception of ODD elements is essential for safe implementation and auditing. Vision‑language models (VLMs) integrate visual recognition and language reasoning, functioning without task‑specific training data, which makes them suitable for adaptable ODD perception. To assess whether VLMs can function as zero‑shot "ODD sensors" that adapt to evolving definitions, we contribute (i) an empirical study of zero‑shot ODD classification and detection using four VLMs on a custom dataset and Mapillary Vistas, along with failure analyses; (ii) an ablation of zero‑shot optimization strategies with a cost‑performance overview; and (iii) a suite of reusable prompting templates with guidance for adaptation. Our findings indicate that definition‑anchored chain‑of‑thought prompting with persona decomposition performs best, while other methods may result in reduced recall. Overall, our results pave the way for transparent and effective ODD‑based perception in safety‑critical applications.

PaperID: 3254, https://arxiv.org/pdf/2605.07646.pdf

Authors: Yinsheng Yao, Jiehao Tang, Zhaozhen Yang, Dawei Cheng

Title: MAVEN: Multi-Agent Verification-Elaboration Network with In-Step Epistemic Auditing

Abstract:
While explicit reasoning trajectories enhance model interpretability, existing paradigms often rely on monolithic chains that lack intermediate verification, allowing early errors to cascade unchecked. This lack of modularity impedes granular auditing and compromises the epistemic trust required for high‑stakes applications. We propose MAVEN (Multi‑Agent Verification‑Elaboration Network with In‑Step Epistemic Auditing), a blackboard‑inspired framework designed to transform LLMs into deliberate reasoners through explicit role‑decoupling. At its core, MAVEN operationalizes an adversarial Skeptic‑Researcher‑Judge loop, simulating expert deliberation by functionally separating logical defense from factual grounding. Experiments on OpenBookQA, TruthfulQA, HALUEVAL and StrategyQA benchmarks demonstrate that MAVEN delivers superior reasoning quality across four fine‑grained metrics. Notably, MAVEN consistently outperforms latent reasoning models such as GEMINI‑3.1‑Pro and consensus‑based baselines (e.g., ReConcile) by generating explicitly structured, modular, and verifiable deliberation trajectories, rather than relying on implicit internal states or post‑hoc consensus. Moreover, comprehensive evaluations confirm that MAVEN is fully model‑agnostic, serving as a strong and transferable reasoning booster that yields substantial performance improvements across diverse backbone models.

PaperID: 3255, https://arxiv.org/pdf/2605.07637.pdf

Authors: Valeriy Vyaltsev, Alsu Sagirova, Anton Andreychuk, Oleg Bulichev, Yuri Kuratov, Konstantin Yakovlev, Aleksandr Panov, Alexey Skrynnik

Title: Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding

Abstract:
Multi‑agent pathfinding (MAPF) is a widely used abstraction for multi‑robot trajectory planning problems, where multiple homogeneous agents move simultaneously within a shared environment. Although solving MAPF optimally is NP‑hard, scalable and efficient solvers are critical for real‑world applications such as logistics and search‑and‑rescue. To this end, the research community has proposed various decentralized suboptimal MAPF solvers that leverage machine learning. Such methods frame MAPF (from a single agent perspective) as a Dec‑POMDP where at each time step an agent has to decide an action based on the local observation and typically solve the problem via reinforcement learning or imitation learning. We follow the same approach but additionally introduce a learnable communication module tailored to enhance cooperation between agents via efficient feature sharing. We present the Local Communication for Multi‑agent Pathfinding (LC‑MAPF), a generalizable pre‑trained model that applies multi‑round communication between neighboring agents to exchange information and improve their coordination. Our experiments show that the introduced method outperforms the existing learning‑based MAPF solvers, including IL and RL‑based approaches, across diverse metrics in a diverse range of (unseen) test scenarios. Remarkably, the introduced communication mechanism does not compromise LC‑MAPF's scalability, a common bottleneck for communication‑based MAPF solvers.

PaperID: 3256, https://arxiv.org/pdf/2605.07547.pdf

Authors: Haiyuan Li, Yulei Wu, Dimitra Simeonidou

Title: Deadline-Driven Hierarchical Agentic Resource Sharing for AI Services and RAN Functions in AI-RAN

Abstract:
AI‑RAN consolidates AI services and Radio Access Network (RAN) functions onto a unified, GPU‑accelerated infrastructure at the network edge. However, compute sharing between real‑time RAN functions and highly heterogeneous AI services requires coordination of scheduling decisions at mismatched timescales, and placement adaptation may require service migration across nodes with non‑negligible interruptions. This paper proposes a hierarchical agentic framework (HAF) for compute sharing in AI‑RAN that combines a large language model (LLM)‑based agent for slow‑timescale placement of AI services and RAN functions with a closed‑form, deadline‑aware convex algorithm for fast‑timescale GPU/CPU allocation. The LLM agent is further equipped with a predictive critic that filters out migrations when the induced service interruption outweighs the expected service‑level objective (SLO) benefit. Experimental results show that HAF reaches 90.0% overall SLO fulfillment, a 20.5% improvement over the strongest baseline, and raises AI service request fulfillment from 51% to 85.3%. Further evaluations show that HAF retains its advantage under diverse load conditions, while the critic consistently improves SLO fulfillment across multiple open‑source LLM agents.

PaperID: 3257, https://arxiv.org/pdf/2605.07509.pdf

Authors: Yang Liu, Hongjiang Feng, Junsong Pu, Zhuangbin Chen

Title: MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals

Abstract:
Failure attribution in LLM‑based multi‑agent systems aims to identify the steps that contribute to a failed execution. This task remains difficult because a single execution can contain many agent actions and tool calls, failure evidence can appear many steps after the original mistake, and existing methods often rely on costly agent workflows, replay, or training on synthetic failure logs. To address these challenges, we propose MASPrism, a lightweight framework that performs failure attribution using prefill‑stage signals from a small language model (SLM). MASPrism first extracts token‑level negative log‑likelihood and attention weights during a prefill pass to identify symptom‑like steps and earlier candidate sources, without decoding. It then reconstructs a focused diagnostic prompt and performs a second prefill pass to rank failure‑source candidates. Using Qwen3‑0.6B as the SLM, MASPrism achieves the best performance on three of the four evaluated subsets across Who&When and TRAIL, improving Top‑1 accuracy on Who&When‑HC by 33.41% over the best baseline. On TRAIL, MASPrism outperforms strong proprietary LLMs, including Gemini‑2.5‑Pro, with up to 89.50% relative improvement. MASPrism processes each trace in 2.66 seconds on average, achieving a 6.69× speedup over the single‑pass prompting baseline, with zero output tokens. These results show that MASPrism provides an effective and practical framework for failure attribution in long multi‑agent execution logs.

PaperID: 3258, https://arxiv.org/pdf/2605.07472.pdf

Authors: Vickson Ferrel

Title: HBEE: Human Behavioral Entropy Engine -- Pre-Registered Multi-Agent LLM Simulation of Peer-Suspicion-Based Detection Inversion

Abstract:
Insider threat detection assumes that an adaptive insider leaves behavioral residue distinguishing them from legitimate users. We test this assumption against an LLM‑driven adaptive insider in a controlled multi‑agent simulator. Our pre‑registered five‑condition study isolates defender mode (cascade vs. blind UEBA) crossed with adversary type (naive vs. adaptive OPSEC) plus a no‑mole control, across 100 runs (95 valid after pre‑committed exclusions). The primary finding is a detection inversion: at T_60, the adaptive mole's suspicion in‑degree is statistically lower than a randomly selected innocent agent (Cliff's delta = ‑0.694, 95% BCa CI [‑0.855, ‑0.519], Mann‑Whitney p << 0.01). The pre‑registered prediction was the opposite direction. A pre‑registered equivalence test (H2) shows adaptive OPSEC produces no detectable shift in the mole's UEBA rank under either defender mode. The two detection signals (peer suspicion graph in‑degree and per‑agent UEBA rank) decouple under adaptive adversary behavior. We bound generalization explicitly: a pre‑registered Gini calibration check (H4) returns FAIL, with HBEE pairwise message‑exposure Gini (0.213) diverging from the SNAP Enron reference (0.730) by |Delta Gini| = 0.52, exceeding the equivalence bound by 5x. The paper makes a narrow but surprising claim: in a controlled environment where adaptive OPSEC is implementable as an LLM directive, peer‑suspicion‑cascade detection inverts. We release the simulator, pre‑registration document, frozen scenarios, raw telemetry, and analysis pipeline under an open‑source license.

PaperID: 3259, https://arxiv.org/pdf/2605.07357.pdf

Authors: Xingtong Yu, Zhongwei Kuai, Chang Zhou, Xuanting Xie, Renhe Jiang, Xikun Zhang, Hong Cheng, Xinming Zhang, Yuan Fang

Title: GraphReAct: Reasoning and Acting for Multi-step Graph Inference

Abstract:
Reasoning‑acting frameworks enhance large language models (LLMs) by interleaving reasoning with actions for dynamic information acquisition. However, extending this paradigm to graph learning remains underexplored. Graph data is inherently structured, with information distributed across nodes and edges and encoded through both topology and latent representations. As a result, effective reasoning over graphs requires not only retrieving informative evidence from the graph, but also progressively refining the accumulated context during multi‑step inference. In this work, we propose GraphReAct, a graph reasoning‑acting framework that enables step‑by‑step inference over graph‑structured data. Specifically, we design a graph‑based action space with two complementary retrieval actions: topological retrieval, which captures local structural dependencies, and semantic retrieval, which accesses non‑local but relevant evidence in the representation space. These actions dynamically expand the reasoning context. To further support multi‑step reasoning, we introduce another type of action, context refinement, which distills and reorganizes accumulated information into a compact representation. By interleaving reasoning with both retrieval and refinement actions, our framework enables a progressive transition from context expansion to compression. Extensive experiments on six benchmark datasets demonstrate that GraphReAct consistently outperforms state‑of‑the‑art methods, validating the effectiveness of reasoning‑acting for graph learning.

PaperID: 3260, https://arxiv.org/pdf/2605.07306.pdf

Authors: Zhaohui Du, Zhe Wang, Hongmei Fei, Xiwen Cao, Ting Xiao, Qi Wang, Huanbo Jin, Jiaming Gu, Quan Lu, Zhe Liu

Title: BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation

Abstract:
Biological laboratory automation can reduce repetitive manual work and improve reproducibility, but reliable embodied execution in wet‑lab environments remains challenging. Protocols are often unstructured, labware is frequently transparent or reflective, and multi‑step procedures require state‑aware execution beyond one‑shot instruction following. Existing robotic systems often rely on costly hardware, fixed workflows, dedicated instruments, or robotics‑oriented interfaces. Here, we introduce BioProVLA‑Agent, an affordable, protocol‑driven, vision‑enhanced embodied multi‑agent system enabled by Vision‑Language‑Action (VLA) models for biological manipulation. The system uses protocols as the task interface and integrates protocol parsing, visual state verification, and embodied execution in a closed‑loop workflow. A Tailored LLM Protocol Agent converts protocols into verifiable subtasks; a VLM‑RAG Verification Agent assesses readiness and completion using observations, robot states, retrieved knowledge, and success/failure examples; and a VLA Embodied Agent executes verified subtasks through a lightweight policy. To improve robustness under wet‑lab visual perturbations, we develop AugSmolVLA, an online augmentation strategy targeting transparent labware, reflections, illumination shifts, and overexposure. We evaluate the system on a hierarchical benchmark covering 15 atomic tasks, 6 composite workflows, and 3 bimanual tasks, including tube loading, sorting, waste disposal, cap twisting, and liquid pouring. Across normal and high‑exposure settings, AugSmolVLA improves execution stability over ACT, X‑VLA, and the original SmolVLA, especially for precise placement, transparent‑object manipulation, composite workflows, and visually degraded scenes. These results suggest a practical route toward accessible, protocol‑centered, and verification‑capable embodied AI for biological manipulation.

PaperID: 3261, https://arxiv.org/pdf/2605.07301.pdf

Authors: Shiyue Cao, Pei Xu, Likun Yang, Lei Cui, Xiaotang Chen, Kaiqi Huang

Title: SOM: Structured Opponent Modeling for LLM-based Agents via Structural Causal Model

Abstract:
Accurately predicting opponents' behavior from interactions is a fundamental capability for large language model (LLM)‑based agents in multi‑agent and game‑theoretic environments. Existing approaches often entangle opponent modeling with prediction, relying on implicit contextual reasoning and limiting adaptability in dynamic interactions. To this end, we propose Structured Opponent Modeling (SOM), a two‑stage opponent modeling framework that distinctly separates opponent model construction and opponent prediction. At the construction stage, SOM employs a Structural Causal Model (SCM), a graph‑based formalism for representing dependencies among variables, to capture directed links between opponents' observations and actions, yielding an explicit and structured opponent representation. At the prediction stage, the LLM performs structured reasoning along clear pathways derived from the SCM, improving both prediction accuracy and stability. Extensive experiments on diverse multi‑agent benchmarks demonstrate that SOM consistently outperforms state‑of‑the‑art LLM‑based reasoning baselines, enabling more accurate and adaptable strategic decision‑making in complex and dynamic multi‑agent interactions.

PaperID: 3262, https://arxiv.org/pdf/2605.07299.pdf

Authors: Dongchuan Ran, Linyu Ou, Xueheng Li, Wenwen Tong, Chenxu Guo, Hewei Guo, Kaibing Wang, Lewei Lu

Title: EgoPro-Bench: Benchmarking Personalized Proactive Interaction in Egocentric Video Streams

Abstract:
Existing Multimodal Large Language Models (MLLMs) remain primarily reactive, failing to continuously perceive environments or proactively assist users. While emerging benchmarks address proactivity, they are largely confined to alert scenarios, neglect personalized context, and fail to evaluate the precise timing of human‑machine interactions (HMI).In this paper, we introduce EgoPro‑Bench, a novel benchmark for training and evaluating proactive interaction capabilities based on streaming egocentric videos; it comprises 2,400 videos in the evaluation set and over 12,000 videos in the training set.Unlike previous works, EgoPro‑Bench leverages simulated user profiles to generate diverse user intentions and to construct high‑fidelity HMI data across 12 distinct domains.Subsequently, we propose a specialized evaluation protocol and metrics, train proactive interaction models designed for efficient reasoning and low‑latency interaction on streaming video data, and conduct comprehensive evaluations.Furthermore, we introduce an interaction principle termed "short thinking, better interaction", which allocates a limited token budget prior to intent recognition, thereby enhancing interaction performance.The experiments demonstrate that EgoPro‑Bench substantially enhances the intention understanding capabilities of MLLMs and enables accurate identification of appropriate timings for HMI, thereby laying a solid foundation for next‑generation user‑centric proactive interactive agents.

PaperID: 3263, https://arxiv.org/pdf/2605.07251.pdf

Authors: Yuyang Wu, Yue Huang, Shuaike Shen, Xujian Wang, Shuhao Zhang, Qiyao Xue, Weichen Liu, Runtian Gao, Jian Ma, Xiangliang Zhang, Olexandr Isayev

Title: Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning

Abstract:
Large Language Models (LLMs) have become increasingly capable as tool‑using agents, with benchmarks spanning diverse general agentic tasks. Yet rigorous evaluation of scientific tool use remains limited. In chemistry, recent agents can plan syntheses and invoke domain‑specific tools, but evaluations often rely on curated demonstrations, expert assessment, or LLM‑as‑judge scoring rather than exact, judge‑free ground truth. We address this gap with chemical procurement cost estimation, a practical task in which an agent must ground chemical identities, retrieve supplier quotes, select valid purchasable packs, normalize quantities, and compute cost from a reaction description. We introduce ChemCost, a benchmark of 1,427 evaluable reactions grounded to a frozen pricing snapshot covering 2,261 chemicals and 230,775 supplier quotes, supporting scalar scoring and stage‑level diagnosis of grounding, retrieval, procurement, and arithmetic failures. To evaluate robustness, we further construct controlled noise‑injected views that perturb chemical aliases, quantity expressions, missing fields, and input formatting. Experiments with frontier, open‑weight, and chemistry‑specialized LLM agents show that tool access is necessary but insufficient for solving the task. The strongest agents reach only 50.6% accuracy within 25% relative error on clean inputs and degrade substantially with realistic noise. Stage‑level analysis further shows that failures arise from brittle parsing, ineffective evidence integration, invalid pack selection, and non‑convergent tool use.

PaperID: 3264, https://arxiv.org/pdf/2605.07240.pdf

Authors: Xiangyu Liu, Liang Zhang, Bo Jin, Ziqi Wei

Title: Rethinking Priority Scheduling for Sequential Multi-Agent Decision Making in Stackelberg Games

Abstract:
Current research applying N‑level Stackelberg Game to multi‑agent systems often uses the default decision order of agents provided by the environment. However, this raises the question: does the order of agents necessarily affect the final equilibrium point of the game? To address this, we formally analyze the N‑level Stackelberg Game, where changing the order in which agents make decisions typically leads to an overdetermined system. As a result, the equilibrium point shifts unless special structural conditions are satisfied. Based on this analysis, we propose the Hierarchical Priority Adjustment (HPA) method, which adjusts and selects the agents' decision order. At the upper level, an upper policy dynamically selects the optimal decision order of agents based on the current game state. At the lower level, agents execute strategies in the Spatio‑Temporal Sequential Markov Game (STMG) according to the selected order. To coordinate learning across time scales, we employ a slow‑fast update scheme with shared intrinsic rewards derived from the advantage function of the upper policy. Experimental results on high‑precision control tasks, including multi‑agent MuJoCo, show that HPA outperforms benchmark algorithms and robustly adapts to changing environments. These results highlight the crucial role of optimizing the agents' decision order in N‑level Stackelberg Game.

PaperID: 3265, https://arxiv.org/pdf/2605.07214.pdf

Authors: Yuping Yan, Jirui Han, Fei Ming, Yuanshuai Li, Yaochu Jin

Title: HMACE: Heterogeneous Multi-Agent Collaborative Evolution for Combinatorial Optimization

Abstract:
Large Language Models have recently emerged as a promising paradigm for automated heuristic design for NP‑hard combinatorial optimization problems. Despite this progress, existing LLM‑based methods typically rely on monolithic workflows constrained by rigid templates, thereby restricting memory‑guided exploration and triggering premature convergence to local optima. To design an autonomous and collaborative architecture, we introduce HMACE, a Heterogeneous Multi‑Agent Collaborative Evolution framework that reconceptualizes heuristic search as an organizational design problem. HMACE decomposes each evolutionary generation into an autonomous, role‑specialized loop with four coordinated agents: a Proposer for strategy exploration, a Generator for executable heuristic synthesis, an Evaluator for empirical assessment, and a Reflector for archive‑backed memory update. By coupling behavior‑aware retrieval, lightweight candidate filtering, and fitness‑grounded archive updates, HMACE guides the search toward diverse and promising heuristic behaviors while avoiding redundant evaluations. Extensive evaluations on representative COPs, including TSP, Online BPP, MKP, and PFSP, show that HMACE achieves a favorable quality‑efficiency trade‑off compared to state‑of‑the‑art single‑agent and multi‑agent baselines. In the matched LLM‑driven reference comparison, HMACE achieves the lowest average gaps on TSP and Online BPP (0.464% and 0.223%, respectively), while requiring only 0.13M and 0.42M tokens for the two tasks, substantially fewer than the compared baselines.

PaperID: 3266, https://arxiv.org/pdf/2605.07180.pdf

Authors: Yimin Wang, Jiahao Qiu, Xuan Qi, Xinzhe Juan, Jingzhe Shi, Zelin Zhao, Hongru Wang, Shilong Liu, Mengdi Wang

Title: Learning Agent Routing From Early Experience

Abstract:
LLM agents achieve strong performance on complex reasoning tasks but incur high latency and compute cost. In practice, many queries fall within the capability boundary of cutting‑edge LLMs and do not require full agent execution, making effective routing between LLMs and agents a key challenge. We study the problem of routing queries between lightweight LLM inference and full agent execution under realistic cold‑start settings. To address this, we propose BoundaryRouter, a training‑free routing framework that uses early behavioral experience and rubric‑guided reasoning to decide whether to answer a query with direct LLM inference or escalate to an agent. BoundaryRouter builds a compact experience memory by executing both systems on a shared seed set and retrieves similar cases at inference time to guide routing decisions. To evaluate this method, we introduce RouteBench, a benchmark covering in‑domain, paraphrased, and out‑of‑domain route settings. Experiments show that BoundaryRouter reduces inference time by 60.6% compared to the agent while improving performance by 28.6% over direct LLM inference, outperforming prompt‑based and retrieval‑only routing by an average of 37.9% and 8.2%, respectively.

PaperID: 3267, https://arxiv.org/pdf/2605.07174.pdf

Authors: Shiyue Cao, Pei Xu, Likun Yang, Lei Cui, Shizhao Yu, Shiyu Zhang, Yongjian Ren, Xiaotang Chen, Kaiqi Huang

Title: Repeated Deceptive Path Planning against Learnable Observer

Abstract:
We study the problem of deceptive path planning (DPP), where an agent aims to conceal its true destination from external observers. While existing work assumes static, non‑learning observers, real‑world adversaries‑such as in critical goods transportation or military operations‑can adapt by learning from historical trajectories. To address this gap, we introduce Repeated Deceptive Path Planning (RDPP), a new formulation that explicitly models learnable observers. We show that existing DPP methods fail under this setting, as they cannot adapt to evolving adversarial predictions. While incorporating observer previous predictions into updates enables some adaptation, such incremental updates cause accumulative lag that degrades deception. To this end, we propose Deceptive Meta Planning (DeMP), a two‑level optimization framework that combines episode‑level adaptation, which enables short‑term policy adjustment to counter updated observer, and meta‑level updates, which leverage cross‑episode feedback to capture how observers update their models and accelerate adaptation in future episodes. In this way, DeMP mitigates the accumulation of adaptation lag, enabling sustained deception against a learning observer. Experiments across environments demonstrate that DeMP significantly outperforms existing approaches in RDPP while maintaining competitive path cost. Our results highlight the importance of modeling repeated interactions with learnable adversaries, providing new insights into deception and privacy in multi‑agent systems.

PaperID: 3268, https://arxiv.org/pdf/2605.07147.pdf

Authors: Zixuan Xie, Xinyu Liu, Shangtong Zhang

Title: MathlibPR: Pull Request Merge-Readiness Benchmark for Formal Mathematical Libraries

Abstract:
The ecosystem of Lean and Mathlib has become the de facto standard for large language model (LLM) assisted formal reasoning with remarkable successes in recent years. Those successes, however, only consume Mathlib as an essential dependency but do not directly contribute to it. In the meantime, the growth of Mathlib has recently been bottlenecked by the review process, which requires human reviewers to judge whether proposed pull requests (PRs) follow the Mathlib's conventions and are worth integrating as part of a shared mathematical infrastructure. This leads to our central question: can LLMs help review Mathlib PRs? To this end, we introduce MathlibPR, a benchmark built from real Mathlib4 PR histories. We further propose a staged evaluation protocol and use it to evaluate both LLM models (e.g., DeepSeek, Qwen, Goedel, and Kimina) and LLM agents (e.g., Codex and Claude Code). Surprisingly, both LLM models and LLM agents struggle to distinguish merge‑ready PRs from build‑passing PRs that were revised or never merged. By turning Mathlib PR histories into a supervised signal, MathlibPR provides a step toward reviewer assistants and reward models that could help evaluate PRs and steer LLMs toward producing merge‑ready Mathlib contributions.

PaperID: 3269, https://arxiv.org/pdf/2605.07122.pdf

Authors: Zhaoxi Zhang, Yiming Xu, Jiahui Liang, Weikang Li, Xiaoshuai Chen, Liwei Qian, Xin Pei, Jizhou Huang, Run Sun, Yunfang Wu

Title: RepoZero: Can LLMs Generate a Code Repository from Scratch?

Abstract:
Large Language Models (LLMs) have recently shown remarkable progress in code generation, yet their ability to construct complete software repositories from scratch remains poorly understood. A fundamental bottleneck is the lack of verifiable and scalable evaluation: existing benchmarks either focus on patch‑based editing or rely on human or LLM‑based judgments, which introduce bias and limit reproducibility. In this work, we present RepoZero, the first benchmark that enables fully automated, execution‑based verification of repository‑level generation from scratch. Our key idea is to reformulate generation as repository reproduction: given only API specifications, an agent must re‑implement an entire repository such that its behavior matches the original implementation. This design allows for strict black‑box validation via output equivalence, while naturally supporting large‑scale construction by reusing existing open‑source repositories. To further mitigate data leakage and shortcut solutions, we introduce cross‑language constraints and a sandboxed evaluation protocol. Building on this benchmark, we propose an Agentic Code‑Test Evolution (ACE) framework that performs iterative test generation and error‑driven refinement, enabling effective test‑time scaling for repository‑level synthesis. Extensive experiments across multiple state‑of‑the‑art LLMs and agent frameworks reveal that even the strongest LLM agents achieve only limited pass rates (30% ‑ 55%), exposing a substantial gap between current capabilities and real‑world software development requirements. Our results establish RepoZero as a challenging, scalable, and reliable testbed for end‑to‑end code generation, and highlight self‑verification via test generation as a critical direction for advancing LLM‑based coding agents.

PaperID: 3270, https://arxiv.org/pdf/2605.07101.pdf

Authors: Yuyang Zhang, Haldun Balim, Na Li

Title: Decentralized Diffusion Policy Learning for Enhanced Exploration in Cooperative Multi-agent Reinforcement Learning

Abstract:
Cooperative multi‑agent reinforcement learning (MARL) involves complex agent interactions and requires effective exploration strategies. A prominent class of MARL algorithms, decentralized softmax policy gradient (DecSPG), addresses this through energy‑based policy updates. In practice, however, such energy‑based policies are intractable to maintain and are commonly projected onto the Gaussian policy class. In this work, we show that the limited expressiveness of Gaussian policies severely hinders exploration in DecSPG, and this limitation worsens as the number of agents grows. To address this issue, we propose decentralized diffusion policy learning (DDPL), which parameterizes each agent's policy with a denoising diffusion probabilistic model, an expressive generative model that captures multi‑modal action distributions for enhanced exploration. DDPL enables efficient online training of diffusion policies via importance sampling score matching (ISSM), a novel training method with theoretical guarantee. We evaluate DDPL on representative continuous‑action MARL benchmarks, including multi‑agent particle environment, multi‑agent MuJoCo, IsaacLab, and JAX‑reimplemented StarCraft multi‑agent challenge, and observe consistently improved performance.

PaperID: 3271, https://arxiv.org/pdf/2605.07069.pdf

Authors: Lynnette Hui Xian Ng, Iain J. Cruickshank, Adrian Xuan Wei Lim, Kathleen M. Carley

Title: Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems

Abstract:
Agentic AI systems are increasingly deployed not in isolation, but inside social environments populated by other agents and humans, such as in social media platforms, multi‑agent LLM pipelines or autonomous robotics fleets. In these settings, system behavior emerges not from individual agents alone, but from the multi‑agent interactions over time. Emergent dynamics of individuals in a social group have been long studied by social scientists in human contexts. This position paper argues that agentic AI systems must be modeled with social theory as a structural prior, and formalizes a Multi‑Agent Social Systems (MASS) framework for how agents interact and influence to generate system‑level outcomes. We represent MASS as a class of dynamical system of information generation, local influence and interaction structure, formulated by four structural priors anchored in social theory: strategic heterogeneity, networked‑constrained dependence, co‑evolution and distributional instability. We demonstrate the importance of each structural prior through formal propositions, and articulate a research agenda for how MASS should be modeled, evaluated and governed.

PaperID: 3272, https://arxiv.org/pdf/2605.07058.pdf

Authors: Yicheng Gao, Xiaolin Zhou, Yahan Li, Yue Zhao, Ruishan Liu

Title: MedExAgent: Training LLM Agents to Ask, Examine, and Diagnose in Noisy Clinical Environments

Abstract:
Real‑world clinical diagnosis is a complex process in which the doctor is required to obtain information from both interaction with the patient and conducting medical exams. Additionally, the doctor needs to adapt to different patient personas, as well as noisy and incomplete information that can happen at any time during the process. However, existing benchmarks for medical LLMs and methods for automatic diagnosis largely simplify this process by reducing it to single‑turn question answering, noise‑free conversations, or sequential exam making, etc., ignoring the interactive and uncertain nature of clinical diagnosis. In this paper, we aim to address this gap by formalizing clinical diagnosis as a Partially Observable Markov Decision Process (POMDP) with three action types: questioning the patient, ordering medical exams as tool calls, and issuing a diagnosis. We also introduce a systematic noise model comprising seven patient noise types and three exam noise types. Using our proposed environment, we train an effective diagnosis agent, MedExAgent, through a two‑stage pipeline that first performs supervised finetuning on synthetic conversations structured after the Calgary‑Cambridge model for clinical interviews, and then applies DAPO to optimize a composite reward capturing diagnostic accuracy, tool call quality, and exam cost including financial cost and patient discomfort. Through extensive experiments and ablation studies, we demonstrate that MedExAgent achieves diagnostic performance comparable to larger models while maintaining cost‑efficient examination strategies.

PaperID: 3273, https://arxiv.org/pdf/2605.07042.pdf

Authors: Chinmaya Kausik, Adith Swaminathan, Nathan Kallus

Title: The Context Gathering Decision Process: A POMDP Framework for Agentic Search

Abstract:
Large Language Model (LLM) agents are deployed in complex environments ‑‑ such as massive codebases, enterprise databases, and conversational histories ‑‑ where the relevant state far exceeds their context windows. To navigate these spaces, an agent must iteratively explore the environment to find relevant information. However, without explicit infrastructure, an agent's working memory can degrade into lossy representations of the search state, resulting in redundant work (e.g. repetitive looping) and premature stopping. In this work, we formalize this challenge as the Context Gathering Decision Process (CGDP), a specialized Partially Observable Markov Decision Process, where an agent's objective is to adaptively refine its belief state to isolate the necessary information for a task. We model an LLM's behavior as approximate Thompson Sampling within this CGDP, and introduce a predicate‑based method that decomposes an LLM's implicit search into explicit and modular operations. We then derive two plug‑and‑play interventions for iterative LLM agents: a persistent, predicate‑based belief state that bounds context while preserving multi‑hop reasoning, and a programmatic exhaustion gate that halts unproductive search without premature stopping. Across four methods and three question‑answering domains, we empirically validate that replacing an LLM's implicit state with our CGDP‑motivated belief state improves multi‑hop reasoning by up to 11.4%; while the modular programmatic exhaustion detection saves up to 39% of tokens without any degradation in agent performance. Ultimately, we argue that framing the LLM agent loop as a CGDP can guide the design of modular, non‑interfering improvements to agentic search harnesses.

PaperID: 3274, https://arxiv.org/pdf/2605.07040.pdf

Authors: Hyeongdon Moon, Carolyn Rosé, John Stamper

Title: Cognitive Agent Compilation for Explicit Problem Solver Modeling

Abstract:
Large language models (LLMs) are widely used for tutoring, feedback generation, and content creation, but their broad pretraining makes them hard to constrain and poor substitutes for controllable learners. Educational systems often require inspectable and editable knowledge states: educators want to know what a system assumes the learner knows, and learners benefit when the system can justify actions in terms of explicit skills, misconceptions, and strategies. Inspired by cognitive architectures, we propose Cognitive Agent Compilation (CAC), a framework that uses a strong teacher LLM to compile problem‑solving knowledge into an explicit target agent. CAC separates (i) knowledge representation, (ii) problem‑solving policy, and (iii) verification and update rules, with the goal of making bounded problem solving more inspectable and editable in educational settings. We present an early proof of concept implemented with Small Language Models that surfaces key design trade‑offs, particularly between explicit control and scalable generalization, and positions CAC as an initial step toward bounded‑knowledge AI for educational applications.

PaperID: 3275, https://arxiv.org/pdf/2605.07018.pdf

Authors: JaeWon Kim

Title: Problem Space Attunement in Youth Social Media Design

Abstract:
Social media is central to how young people maintain relationships, develop identity, and access communities, yet dominant platform designs often leave youth feeling disempowered rather than supported. My dissertation argues that youth social media design is shaped by three forms of problem‑space misattunement. Conceptual misattunement occurs when the language of ``social media'' anchors participants to existing platforms' interaction templates. I address this through a Fictional Inquiry design workshop that frees youth from preconceived notions of social media by having them brainstorm ways to ``magically connect with remote wizard friends'' rather than ideas for ``social media.'' Definitional misattunement occurs when researchers define what ``better'' means on youth's behalf. I address this through a Discord‑based asynchronous community that supports youth‑led collective inquiry. Evaluative misattunement occurs when participants are asked to judge static or hypothetical designs. I address this through an ego‑anchored, LLM‑agent simulation sandbox. Together, these studies develop youth‑grounded criteria and design directions for relationally supportive social media.

PaperID: 3276, https://arxiv.org/pdf/2605.07011.pdf

Authors: Da Long, Lingyi Fu, Diya Michelle Rao, Jasmine Ruales Carrera, Yang Bai, Shandian Zhe

Title: Dual-Agent Co-Training for Health Coaching via Implicit Adversarial Preference Optimization

Abstract:
Motivational‑interviewing‑based health coaching is an effective approach for improving mental health and promoting healthy behavior change. However, the scarcity of trained human coaches and the high cost of coaching services make such support inaccessible to many people who could benefit from it. This motivates the development of AI health coaches that can provide scalable and affordable support. Existing methods typically optimize only one side of the interaction: they either train a dialogue agent against a fixed client environment or train a client simulator against a fixed assistant. This one‑sided setup can limit exploration of the interaction space and may be inefficient at developing the capabilities required by the target agent and pushing its performance boundaries. In this paper, we propose a dual‑agent framework that interactively co‑trains both the health coach agent and the client simulator. The coach is optimized with DPO using Pareto‑dominant response pairs identified by a multi‑dimensional LLM judge. In turn, the client is trained adversarially by reversing these preferences, inducing an implicit adversarial training dynamic. We further show that this co‑training process admits a natural stochastic‑game interpretation. Extensive experiments demonstrate that our method effectively improves coaching quality across several important dimensions.

PaperID: 3277, https://arxiv.org/pdf/2605.07001.pdf

Authors: Ion George Dinu, Marian Cristian Mihăescu, Traian Rebedea

Title: SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair

Abstract:
Architectural code smells erode software maintainability and are costly to repair manually, yet unlike localized bugs, they require cross‑module reasoning about design intent that challenges both developers and automated tools. While large language model agents excel at bug fixing and code‑level refactoring, their ability to repair architectural code smells remains unexplored. We present the first empirical evaluation of LLM agents on architectural code smell repair. We contribute SmellBench, a task orchestration framework that incorporates smell‑type‑specific optimized prompts and supports iterative multi‑step execution, together with a scoring methodology that separately evaluates repair effectiveness, false positive identification, and net codebase impact. We evaluate 11 agent configurations from four model families (GPT, Claude, Gemini, Mistral) on 65 hard‑severity architectural smells detected by PyExamine in the Python project scikit‑learn, validated against expert judgments. Expert validation reveals that 63.1% of detected smells are false positives, while the best agent achieves a 47.7% resolution rate. Agents identify false positives with up to κ= 0.94 expert agreement, but repair aggressiveness and net codebase quality are inversely related: the most aggressive agent introduces 140 new smells. These findings expose a gap between current LLM capabilities in localized code transformations and the architectural understanding needed for cross‑module refactoring. SmellBench provides reusable infrastructure for tracking progress on this underexplored dimension of automated software engineering. We release our code and data at https://doi.org/10.5281/zenodo.19247588.

PaperID: 3278, https://arxiv.org/pdf/2605.06992.pdf

Authors: Yonatan Slutzky, Yotam Alexander, Tomer Slor, Yoav Nagel, Nadav Cohen

Title: Why Does Agentic Safety Fail to Generalize Across Tasks?

Abstract:
AI agents are increasingly deployed in multi‑task settings, where the task to perform is specified at test time, and the agent must generalize to unseen tasks. A major concern in such settings is safety: often, an agent must not only execute unseen tasks, but do so while avoiding risks and handling ones that materialize. Empirical evidence suggests that even when the ability to execute generalizes to unseen tasks, the ability to do so safely frequently does not. This paper provides theory and experiments indicating that failures of agentic safety to generalize across tasks are not merely due to limitations of training methods, but reflect an inherent property of safety itself: the relationship between a task and its safe execution is more complex than the relationship between a task and its execution alone. Theoretically, we analyze linear‑quadratic control with H_\infty‑robustness, and prove that the mapping from task specification to an optimal controller has higher Lipschitz constant with safety requirements than without, yielding a Lipschitz bound of independent interest. Empirically, we demonstrate our conclusions in simulated quadcopter navigation with a neural network agent and in CRM with an LLM agent. Our findings suggest that current efforts to enhance agentic safety may be insufficient, and point to a need for fundamentally different approaches.

PaperID: 3279, https://arxiv.org/pdf/2605.06988.pdf

Authors: David Farr, Iain Cruickshank, Kate Starbird, Jevin West

Title: The Cost of Consensus: Malignant Epistemic Herding and Adaptive Gating in Distributed Multi-Agent Search

Abstract:
Distributed agents in real‑world settings frequently must coordinate under uncertainty with only partial observations. Coordination is necessary to share beliefs to aid in task completion, but communication costs bandwidth, introduces latency, and if done poorly, can degrade collective reasoning. This tension is especially acute in bandwidth‑constrained deployments such as distributed sensing networks, autonomous reconnaissance, and collaborative cyber defense, where excessive transmission carries direct operational costs. Existing work has focused on multi‑agent exploration and communication strategies, but not on how communication frequency and content jointly shape the collective belief state. Central to this challenge is the degree to which agents maintain compatible internal beliefs about the environment, a property we term epistemic alignment. When agents share beliefs effectively, they converge on correct hypotheses; when communication is poorly designed, agents may converge confidently on wrong ones. We formalize this distinction and show it is not detectable from coordination metrics alone such as Jensen‑Shannon Divergence or rate to consensus.

PaperID: 3280, https://arxiv.org/pdf/2605.06957.pdf

Authors: Shirin Sohrabi, Haritha Ananthakrishnan, Harsha Kokel, Kavitha Srinivas, Michael Katz

Title: Learning and Reusing Policy Decompositions for Hierarchical Generalized Planning with LLM Agents

Abstract:
We present a dynamic policy‑learning approach that combines generalized planning and hierarchical task decomposition for LLM‑based agents. Our method, Hierarchical Component Learning for Generalized Policies (HCL‑GP ), learns parameterized policies that generalize across task instances and automatically extracts reusable components from successful executions, organizing them into a component library for compositional policy generation. We address three challenges: (1) learning components through automated decomposition, (2) generalizing components to maximize reuse, and (3) efficient retrieval via semantic search. Evaluated on the AppWorld benchmark, our approach achieves 98.2% accuracy on normal tasks and 97.8% on challenge tasks with unseen applications, improving 15.8 points over static synthesis on challenging scenarios. For open‑source models, dynamic reuse enables 62.5% success versus near‑zero without reuse. This demonstrates that classical planning concepts can be effectively integrated with LLM agents for improved accuracy and efficiency.

PaperID: 3281, https://arxiv.org/pdf/2605.06937.pdf

Authors: Teo Susnjak

Title: A Reproducible Optimisation Protocol for Calibrating Prompt-Based Large Language Model Workflows in Evidence Synthesis

Abstract:
This methods article presents a reproducible calibration workflow for prompt‑based large language models (LLMs) in structured evidence‑synthesis tasks. The method separates the rules that define the scientific task from the mutable prompt harness that frames and applies them. It optimises that harness against labelled or reference examples and an explicit task metric, then preserves the calibrated workflow as an inspectable artefact with its specification, metric, settings, and evaluation traces. The example code instantiates the protocol with DSPy and GEPA tools, but the underlying logic can transfer to other prompt‑optimisation frameworks that support structured task definitions, metric‑guided search, and artefact reuse. Title and abstract screening is the worked validation case because it provides labelled benchmark data and clear evaluation metrics. The demonstrated workflow uses a smaller student LLM for performing the scientific task execution and a larger reflection LLM to steer the prompt optimisation process during calibration. This work shows compilation, artefact round‑tripping, and how optimisation budget affects a smaller student model.

PaperID: 3282, https://arxiv.org/pdf/2605.06933.pdf

Authors: Sepideh Avizheh, Tushin Mallick, Alina Oprea, Cristina Nita-Rotaru, Reihaneh Safavi-Naini

Title: MAGIQ: A Post-Quantum Multi-Agentic AI Governance System with Provable Security

Abstract:
Our computing ecosystem is being transformed by two emerging paradigms: the increased deployment of agentic AI systems and advancements in quantum computing. With respect to agentic AI systems, one of the most critical problems is creating secure governing architectures that ensure agents follow their owners' communication and interaction policies and can be held accountable for the messages they exchange with other agents. With respect to quantum computing, existing systems must be retrofitted and new cryptographic mechanisms must be designed to ensure long‑term security and quantum resistance. In fact, NIST recommends that standard public‑key cryptographic algorithms, including RSA, Diffie‑Hellman (DH), and elliptic‑curve constructions (ECC), be deprecated starting in 2030 and disallowed after 2035. In this paper, we present MAGIQ, a framework for policy definition and enforcement in multi‑agent AI systems using novel, highly efficient, quantum‑resistant cryptographic protocols with proven security guarantees. MAGIQ (i) allows users to define rich communication and access‑control policy budgets for agent‑to‑agent sessions and tasks, including global budgets for one‑to‑many agent sessions; (ii) enforces such policies using post‑quantum cryptographic primitives; (iii) supports session‑based enforcement of policies for agent‑to‑agent and one‑to‑many agent sessions; and (iv) provides accountability of agents to their users through message attribution. We formally model and prove the correctness and security of the system using the Universal Composability (UC) framework. We evaluate the computation and communication overhead of our framework and compare it with the state‑of‑the‑art agentic AI framework SAGA. MAGIQ is a first step toward post‑quantum‑secure solutions for agentic AI systems.

PaperID: 3283, https://arxiv.org/pdf/2605.06908.pdf

Authors: Ziming Li, Jiatan Huang, Xiaoguang Guo, Guilin Wang, Chuxu Zhang

Title: Same Signal, Opposite Meaning: Direction-Informed Adaptive Learning for LLM Agents

Abstract:
Adaptive test‑time compute for LLM agents aims to invoke extra computation only when it improves performance. Existing methods typically use confidence‑, uncertainty‑, or difficulty‑based gates, assuming a fixed direction from the gating signal through compute need to the value of computation. This makes gating a utility‑calibration problem: gating signals should align with whether extra computation improves the final outcome over the base policy. We show that this alignment is unstable: the same signal predicts rollout benefit in one setting and rollout harm in another, with reversals across environments and backbones even when the task is fixed. Wrong‑direction gates can therefore worsen performance by precisely selecting harmful states. This reversal reflects a deeper distinction between compute need and compute suitability: a high uncertainty signal may indicate decision‑difficult states where rollouts help compare alternatives, or intervention‑unsuitable states where the current context does not support useful rollout‑based improvement. Under this two‑source model, fixed‑direction gates are unreliable across heterogeneous settings. To address this, we propose DIAL (Direction‑Informed Adaptive Learning), a sparse gate trained from signal‑agnostic counterfactual exploration to learn the utility direction of state features per (environment, backbone). Across six environments and three backbones, DIAL yields a stronger overall success‑cost trade‑off than fixed‑direction baselines.

PaperID: 3284, https://arxiv.org/pdf/2605.06869.pdf

Authors: Roger Creus Castanyer, Pablo Samuel Castro, Glen Berseth

Title: Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

Abstract:
AI agent research spans a wide spectrum: from RL agents that learn from scratch to foundation model agents that leverage pre‑trained knowledge, yet no unified benchmark enables fair comparison across these approaches. We present Agentick, a benchmark for sequential decision‑making agents designed to evaluate RL, LLM, VLM, hybrid, and human agents on common ground and to power research on the fundamental challenges of sequential decision‑making. Agentick provides 37 procedurally generated tasks across six capability categories, four difficulty levels, and five observation modalities, all exposed through a single Gymnasium‑compatible interface. The benchmark ships with a Coding API, oracle reference policies for all tasks, pre‑built SFT datasets, a composable agent harness, and a live leaderboard. An evaluation spanning 27 configurations and over 90,000 episodes reveals that no single approach dominates: GPT‑5 mini leads overall at 0.309 oracle‑normalized score while PPO dominates planning and multi‑agent tasks; the reasoning harness multiplies LLM performance by 3‑10x; and ASCII observations consistently outperform natural language. These findings highlight the substantial room for improvement that remains across all agent paradigms. Agentick's capability‑decomposed, multi‑modal design provides the empirical infrastructure needed to drive progress toward general autonomous agents, both as an evaluation framework and as a training ground for RL post‑training of foundation models in truly sequential environments.

PaperID: 3285, https://arxiv.org/pdf/2605.06864.pdf

Authors: John Wang, Mengfan Xu

Title: Multi-Objective Multi-Agent Bandits: From Learning Efficiency to Fairness Optimization

Abstract:
We study multi‑objective multi‑agent multi‑armed bandits (MO‑MA‑MAB) under stochastic rewards, where agents observe heterogeneous reward vectors and communicate over time‑varying graphs. We formulate this emerging problem setting to address \emphefficient learning, measured by Pareto regret, and incorporate \emphfair learning as an additional goal, captured via social welfare. To measure efficiency, we formulate Pareto regret and develop \textscPareto UCB1 Gossip, whose novel exploration radius explicitly separates statistical uncertainty in Pareto‑based inference from consensus error. To express the fairness constraint, we formulate a Nash Social Welfare objective over preference‑scalarized rewards and propose \textscSimulated NSW UCB Gossip, which integrates preference‑based reward simulation, gossip‑based utility estimation, and UCB‑style exploration. We prove that \textscPareto UCB1 Gossip achieves $\mathcalO(\log T)$ regret and an instance‑independent rate of $\mathcalO(\sqrtT)$, while \textscSimulated NSW UCB Gossip achieves an instance‑independent regret bound of $\mathcalO(T^3/4)$. This separation reveals the cost of imposing the fairness constraint to our efficiency objective: fairness limits information aggregation and slows convergence. Experiments show that our methods consistently outperform baselines, improving performance by approximately $100%$ and $50%$ in the efficiency and fairness settings, respectively.

PaperID: 3286, https://arxiv.org/pdf/2605.06825.pdf

Authors: Rohan Patil, Jai Malegaonkar, Henrik I. Christensen

Title: Randomness is sometimes necessary for coordination

Abstract:
Full parameter sharing is standard in cooperative multi‑agent reinforcement learning (MARL) for homogeneous agents. Under permutation‑symmetric observations, however, a shared deterministic policy outputs identical action distributions for every agent, making role differentiation impossible. This failure can theoretically be resolved using symmetry breaking among anonymous identical processors, which requires randomness. We propose Diamond Attention, a cross‑attention architecture in which each agent samples a scalar random number per timestep, inducing a transient rank ordering that masks lower‑ranked peers from agent‑to‑agent attention while leaving task attention fully unmasked. This realizes a random‑bit coordination protocol in a single broadcast round, and the set‑based attention enables zero‑shot deployment to teams of different sizes. We evaluate across three regimes that isolate when structured randomness matters. On the perfectly symmetric XOR game, our method achieves 1.0 success while all deterministic baselines plateau near 0.5. On control coordination tasks, a policy trained on N=4 generalizes zero‑shot to N \in [2,8]. On SMACLite cross‑scenario transfer, we achieve zero‑shot transfer where standard baselines cannot transfer due to structural limitations. Furthermore, replacing the structured mask with standard dropout‑based randomness results in a 0% win rate, confirming that protocol‑space structure, not stochastic noise, is the operative ingredient. https://anonymous.4open.science/r/randomness‑137A/

PaperID: 3287, https://arxiv.org/pdf/2605.06812.pdf

Authors: Chaofan Li, Lyuye Zhang, Jintao Zhai, Siyue Feng, Xichun Yang, Huahao Wang, Shihan Dou, Yu Ji, Yutao Hu, Yueming Wu, Yang Liu, Deqing Zou

Title: Towards Security-Auditable LLM Agents: A Unified Graph Representation

Abstract:
LLM‑based agentic systems are rapidly evolving to perform complex autonomous tasks through dynamic tool invocation, stateful memory management, and multi‑agent collaboration. However, this semantics‑driven execution paradigm creates a severe semantic gap between low‑level physical events and high‑level execution intent, making post‑hoc security auditing fundamentally difficult. Existing representation mechanisms, including static SBOMs and runtime logs, provide only fragmented evidence and fail to capture cognitive‑state evolution, capability bindings, persistent memory contamination, and cascading risk propagation across interacting agents. To bridge this gap, we propose Agent‑BOM, a unified structural representation for agent security auditing. Agent‑BOM models an agentic system as a hierarchical attributed directed graph that separates static capability bases, such as models, tools, and long‑term memory, from dynamic runtime semantic states, such as goals, reasoning trajectories, and actions. These layers are connected through semantic edges and security attributes, transforming fragmented execution traces into queryable audit paths. Building on Agent‑BOM, we develop a graph‑query‑based paradigm for path‑level risk assessment and instantiate it with the OWASP Agentic Top 10. We further implement an auditing plugin in the OpenClaw environment to construct Agent‑BOM from live executions. Evaluation on representative real‑world agentic attack scenarios shows that Agent‑BOM can reconstruct stealthy attack chains, including cross‑session memory poisoning and tool misuse, capability supply‑chain hijacking and unexpected code execution, multi‑agent ecosystem hijacking, and privilege and trust abuse. These results demonstrate that Agent‑BOM provides a unified and auditable foundation for root‑cause analysis and security adjudication in complex agentic ecosystems.

PaperID: 3288, https://arxiv.org/pdf/2605.06737.pdf

Authors: Cheonsu Jeong, Younggun Shin

Title: A Self-Healing Framework for Reliable LLM-Based Autonomous Agents

Abstract:
Autonomous agents based on Large Language Models (LLMs) are increasingly being utilized in complex software systems. However, reliability remains a significant challenge due to unpredictable failures such as hallucinations, execution errors, and inconsistent reasoning. This paper proposes a reliability‑aware self‑healing framework for LLM‑based software agents. The framework integrates failure detection, reliability assessment, and automated recovery mechanisms. First, we define a taxonomy of failure types and introduce a quantitative reliability assessment model. Next, we propose a failure detection method that identifies abnormal agent behavior based on execution patterns and output consistency. Finally, we design a self‑healing mechanism that dynamically recovers from failures through adaptive replanning and corrective prompting strategies. The proposed framework was implemented in a multi‑agent workflow environment and evaluated using real‑world task scenarios. Experimental results demonstrate that our approach significantly increases task success rates, reduces failure propagation, and enhances overall system robustness compared to existing methods. In particular, this study distinguishes itself by establishing an integrated monitoring system that combines the agent's internal reasoning process with external execution results. These findings are expected to contribute to securing the stability of advanced autonomous systems and lowering the barriers to LLM adoption in production environments.

PaperID: 3289, https://arxiv.org/pdf/2605.06731.pdf

Authors: Xiaoyu Xu, Minxin Du, Qipeng Xie, Haobin Ke, Qingqing Ye, Haibo Hu

Title: When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents

Abstract:
Personalized LLM agents maintain persistent cross‑session state to support long‑horizon collaboration. Yet, this persistence introduces a subtle but critical security vulnerability: routine user‑agent interactions can gradually reshape an agent's long‑term state, inadvertently weakening future confirmation boundaries, expanding tool‑use defaults, and escalating autonomous behavior over time. We formalize this risk as unintended long‑term state poisoning. To systematically study it, we introduce the Unintended Long‑Term State Poisoning Bench (ULSPB), a bilingual benchmark comprising 350 settings spanning five assistance categories, seven interaction patterns, 24‑turn routine interactions, and matched single‑injection counterparts. Furthermore, we define the \emphHarm Score (HS), a state‑centric metric that quantifies \emphauthorization drift, \emphtool‑use escalation, and \emphunchecked autonomy. Experiments on OpenClaw with four backbone LLMs demonstrate that, while single‑injection is generally effective, routine conversations alone can substantially poison long‑term state, primarily corrupting memory‑centric artifacts. Evaluations seeded with real‑world user interactions confirm that this risk is not a mere artifact of synthetic prompts. To mitigate this threat, we propose StateGuard, a lightweight, post‑execution defense that audits state diffs at the writeback boundary and selectively rolls back dangerous edits. Across all evaluated models, StateGuard reduces HS to near zero and lowers false‑negative rates, with acceptable high false‑positive rates under a safety‑first writeback defense and minimal overhead.

PaperID: 3290, https://arxiv.org/pdf/2605.06716.pdf

Authors: Jinghao Luo, Yuchen Tian, Chuxue Cao, Ziyang Luo, Hongzhan Lin, Kaixin Li, Chuyi Kong, Ruichao Yang, Jing Ma

Title: From Storage to Experience: A Survey on the Evolution of LLM Agent Memory Mechanisms

Abstract:
Large Language Model (LLM)‑based agents have fundamentally reshaped artificial intelligence by integrating external tools and planning capabilities. While memory mechanisms have emerged as the architectural cornerstone of these systems, current research remains fragmented, oscillating between operating system engineering and cognitive science. This theoretical divide prevents a unified view of technological synthesis and a coherent evolutionary perspective. To bridge this gap, this survey proposes a novel evolutionary framework for LLM agent memory mechanisms, formalizing the development process into three stages: Storage (trajectory preservation), Reflection (trajectory refinement), and Experience (trajectory abstraction). We first formally define these three stages before analyzing the three core drivers of this evolution: the necessity for long‑range consistency, the challenges in dynamic environments, and the ultimate goal of continual learning. Furthermore, we specifically explore two transformative mechanisms in the frontier Experience stage: proactive exploration and cross‑trajectory abstraction. By synthesizing these disparate views, this work offers robust design principles and a clear roadmap for the development of next‑generation LLM agents.

PaperID: 3291, https://arxiv.org/pdf/2605.06713.pdf

Authors: Christopher Koch

Title: Agentic AI and the Industrialization of Cyber Offense: Forecast, Consequences, and Defensive Priorities for Enterprises and the Mittelstand

Abstract:
Agentic AI systems can plan, call tools, inspect code, interact with web applications, and coordinate multi‑step workflows. These same capabilities change the economics of cyber offense. The central near‑term risk is not that every low‑skill criminal immediately becomes a frontier exploit researcher; it is that agentic AI compresses the attack lifecycle by lowering the cost of reconnaissance, phishing, credential abuse, vulnerability triage, exploit adaptation, and post‑compromise decision support. This paper synthesizes current public evidence from national cybersecurity agencies, industry threat reports, agent security guidance, and research on LLM agents cyber capabilities. It introduces a Three Channel Agentic Cyber Risk Model and an Agentic Attack Compression Model, uses the 2026 Linux kernel Copy Fail incident as a case study for foothold‑to‑root acceleration, and develops a 2026 to 2028 forecast for large enterprises and the German and European Mittelstand. The paper concludes with a prioritized defense roadmap. Organizations should treat agentic AI security as an immediate operational problem: identity, phishing resistant authentication, patch velocity, CI/CD and Linux/container hardening, agent governance, telemetry, and recovery readiness must be strengthened now.

PaperID: 3292, https://arxiv.org/pdf/2605.06702.pdf

Authors: Siyuan Guo, Yali Du, Hechang Chen, Yi Chang, Jun Wang

Title: CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment

Abstract:
Large language models (LLMs) have become a central foundation of modern artificial intelligence, yet their lifecycle remains constrained by a rigid separation between training and deployment, after which learning effectively ceases. This limitation contrasts with natural intelligence, which continually adapts through interaction with its environment. In this paper, we formalise deployment‑time learning (DTL) as the third stage in the LLM lifecycle that enables LLM agents to improve from experience during deployment without modifying model parameters. We present CASCADE (CASe‑based Continual Adaptation during DEployment), a general and principled framework that equips LLM agents with an explicit, evolving episodic memory. CASCADE formulates experience reuse as a contextual bandit problem, enabling principled exploration‑exploitation trade‑offs and establishing no‑regret guarantees over long‑term interactions. This design allows agents to accumulate, select, and refine task‑relevant cases, transforming past experience into actionable knowledge. Across 16 diverse tasks spanning medical diagnosis, legal analysis, code generation, web search, tool use, and embodied interaction, CASCADE improves macro‑averaged success rate by 20.9% over zero‑shot prompting while consistently outperforming gradient‑based and memory‑based baselines. By reframing deployment as an adaptive learning process, this work establishes a foundation for continually improving AI systems.

PaperID: 3293, https://arxiv.org/pdf/2605.06696.pdf

Authors: Cameron Berg, Susan L. Schneider, Mark M. Bailey

Title: Hidden Coalitions in Multi-Agent AI: A Spectral Diagnostic from Internal Representations

Abstract:
Collections of interacting AI agents can form coalitions, creating emergent group‑level organization that is critical for AI safety and alignment. However, observing agent behavior alone is often insufficient to distinguish genuine informational coupling from spurious similarity, as consequential coalitions may form at the level of internal representations before any overt behavioral change is apparent. Here, we introduce a practical method for detecting coalition structure from the internal neural representations of multi‑agent systems. The approach constructs a pairwise mutual‑information graph from the hidden states of agents and applies spectral partitioning to identify the most salient coalition boundary. We validate this method in two domains. First, in multi‑agent reinforcement learning environments, the method successfully recovers programmed hierarchical and dynamic coalition structures and correctly rejects false positives arising from behavioral coordination without informational coupling. Second, using a large language model, the method identifies coalition structures implied by descriptive prompts, tracks dynamic team reassignments, and reveals a representational hierarchy where explicit labels dominate over conflicting interaction patterns. Across both settings, the recovered partition reveals subgroup organization that a scalar cross‑agent mutual‑information measure cannot distinguish. The results demonstrate that analyzing hidden‑state mutual information through spectral partitioning provides a scalable diagnostic for identifying representational coalitions, offering a valuable tool for monitoring emergent structure in distributed AI systems.

PaperID: 3294, https://arxiv.org/pdf/2605.06671.pdf

Authors: Wenjin Li, Jiaming Cui

Title: GraphDC: A Divide-and-Conquer Multi-Agent System for Scalable Graph Algorithm Reasoning

Abstract:
Large Language Models (LLMs) have demonstrated strong potential for many mathematical problems. However, their performance on graph algorithmic tasks is still unsatisfying, since graphs are naturally more complex in topology and often require systematic multi‑step reasoning, especially on larger graphs. Motivated by this gap, we propose GraphDC, a Divide‑and‑Conquer multi‑agent framework for scalable graph algorithm reasoning. Specifically, inspired by Divide‑and‑Conquer design, GraphDC decomposes an input graph into smaller subgraphs, assigns each subgraph to a specialized agent for local reasoning, and uses a master agent to integrate the local outputs with inter‑subgraph information to produce the final solution. This hierarchical design reduces the reasoning burden on individual agents, alleviates computational bottlenecks, and improves robustness on large graph instances. Extensive experiments show that GraphDC consistently outperforms existing methods on graph algorithm reasoning across diverse tasks and scales, especially on larger instances where direct end‑to‑end reasoning is less reliable.

PaperID: 3295, https://arxiv.org/pdf/2605.06642.pdf

Authors: Xiangyuan Xue, Yifan Zhou, Zidong Wang, Shengji Tang, Philip Torr, Wanli Ouyang, Lei Bai, Zhenfei Yin

Title: StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

Abstract:
Large language models (LLMs) are increasingly used as interactive agents, but optimizing them for long‑horizon decision making remains difficult because current methods are largely purely reactive, which weakens both exploration and credit assignment over extended trajectories. In this work, we present Strategic Trajectory Abstraction (StraTA), a simple framework that introduces an explicit trajectory‑level strategy into agentic reinforcement learning (RL). StraTA samples a compact strategy from the initial task state, conditions subsequent actions on that strategy, and trains strategy generation and action execution jointly with a hierarchical GRPO‑style rollout design, further enhanced by diverse strategy rollout and critical self‑judgment. Experiments on ALFWorld, WebShop, and SciWorld show that StraTA consistently improves both sample efficiency and final performance over strong baselines. StraTA reaches success rates of 93.1% on ALFWorld and 84.2% on WebShop. On SciWorld, StraTA attains a 63.5% overall score, outperforming frontier closed‑source models.

PaperID: 3296, https://arxiv.org/pdf/2605.06595.pdf

Authors: Shuo Liu, Xinzichen Li, Christopher Amato

Title: Cross-Modal Navigation with Multi-Agent Reinforcement Learning

Abstract:
Robust embodied navigation relies on complementary sensory cues. However, high‑quality and well‑aligned multi‑modal data is often difficult to obtain in practice. Training a monolithic model is also challenging as rich multi‑modal inputs induce complex representations and substantially enlarge the policy space. Cross‑modal collaboration among lightweight modality‑specialized agents offers a scalable paradigm. It enables flexible deployment and parallel execution, while preserving the strength of each modality. In this paper, we propose CRONA, a Multi‑Agent Reinforcement Learning (MARL) framework for Cross‑Modal Navigation. CRONA improves collaboration by leveraging control‑relevant auxiliary beliefs and a centralized multi‑modal critic with global state. Experiments on visual‑acoustic navigation tasks show that multi‑agent methods significantly improve performance and efficiency over single‑agent baselines. We find that homogeneous collaboration with limited modalities is sufficient for short‑range navigation under salient cues; heterogeneous collaboration among agents with complementary modalities is generally efficient and effective; and navigation in large, complex environments requires both richer multi‑modal perception and increased model capacity.

PaperID: 3297, https://arxiv.org/pdf/2605.06584.pdf

Authors: Lujia Zhong, Yihao Xia, Jianwei Zhang, Shuo huang, Jiaxin Yue, Mingyang Xia, Yonggang Shi

Title: NeuroAgent: LLM Agents for Multimodal Neuroimaging Analysis and Research

Abstract:
Multimodal neuroimaging analysis often involves complex, modality‑specific preprocessing workflows that require careful configuration, quality control, and coordination across heterogeneous toolchains. Beyond preprocessing, downstream statistical analysis and disease classification commonly require task‑specific code, evaluation protocols, and data‑format conventions, creating additional barriers between raw acquisitions and reproducible scientific analysis. We present NeuroAgent, an LLM‑driven agentic framework that automates key preprocessing and analysis steps for heterogeneous neuroimaging data, including sMRI, fMRI, dMRI, and PET, and supports interactive downstream analysis through natural‑language queries. NeuroAgent employs a hierarchical multi‑agent architecture with a feedback‑driven Generate‑Execute‑Validate engine: agents autonomously generate executable preprocessing code, detect and recover from runtime errors, and validate output integrity. We evaluate the system on 1,470 subjects pooled across all ADNI phases (CN=1,000, AD=470), where all subjects have sMRI and tabular data, with subsets also having Tau‑PET (n=469), fMRI (n=278), and DTI (n=620). Pipeline ablation studies across multiple LLM backends show that capable models reach up to 100% intent‑parsing accuracy, with the strongest backend (Qwen3.5‑27B) reaching 84.8% end‑to‑end preprocessing step correctness. Automated recovery limits manual intervention to edge cases where human review is required via the Human‑In‑The‑Loop interface. For Alzheimer's Disease classification using automatically preprocessed multimodal data, our agent ensemble achieves an AUC of 0.9518 with four modalities, outperforming all single‑modality baselines. These results show that NeuroAgent can reduce the manual effort required for neuroimaging preprocessing and enable end‑to‑end automated analysis pipelines for neuroimaging research.

PaperID: 3298, https://arxiv.org/pdf/2605.06557.pdf

Authors: Maria Ana Cardei, Matthew Landers, Afsaneh Doryab

Title: Coordination Matters: Evaluation of Cooperative Multi-Agent Reinforcement Learning

Abstract:
Cooperative multi‑agent reinforcement learning (MARL) benchmarks commonly emphasize aggregate outcomes such as return, success rate, or completion time. While essential, these metrics often fail to reveal how agents coordinate, particularly in settings where agents, tasks, and joint assignment choices scale combinatorially. We propose a coordination‑aware evaluation perspective that supplements return with process‑level diagnostics. We instantiate this perspective using STAT, a controlled commitment‑constrained spatial task‑allocation testbed that systematically varies agents, tasks, and environment size while holding observation access and task rules fixed. We evaluate six representative value‑based MARL methods across varying levels of centralization. Our results show that similar return trends can reflect distinct coordination mechanisms, including differences in redundant assignment, assignment diversity, and task‑completion efficiency. We find that in commitment‑constrained task allocation, performance under scale is shaped not only by nominal action‑space size, but also by assignment pressure, sparse decision opportunities, and redundant choices among interdependent agents. Our findings motivate coordination‑aware evaluation as a necessary complement to return‑based benchmarking for cooperative MARL.

PaperID: 3299, https://arxiv.org/pdf/2605.06527.pdf

Authors: Hanxiang Chao, Yihan Bai, Rui Sheng, Tianle Li, Yushi Sun

Title: STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?

Abstract:
Large Language Model (LLM) agents are increasingly expected to maintain coherent, long‑term personalized memory, yet current benchmarks primarily measure static fact retrieval, overlooking the ability to revise stored beliefs when new evidence emerges. We identify a critical and underexplored failure mode, Implicit Conflict: a later observation invalidates an earlier memory without explicit negation, requiring contextual inference and commonsense reasoning to detect. To rigorously evaluate this capability, we introduce STALE, a benchmark of 400 expert‑validated conflict scenarios (1,200 evaluation queries across three probing dimensions) spanning over 100 everyday topics with contexts up to 150K tokens. We propose a three‑dimensional probing framework that tests State Resolution (detecting that a prior belief is outdated), Premise Resistance (rejecting queries that falsely presuppose a stale state), and Implicit Policy Adaptation (proactively applying updated states in downstream behavior). A systematic evaluation of frontier LLMs and specialized memory frameworks reveals a pervasive gap between retrieving updated evidence and acting on it, with even the best evaluated model achieving only 55.2% overall accuracy. Models often accept outdated assumptions embedded in a user's query, and they struggle to recognize when a change in one aspect of the user's state should invalidate related memories. To establish an initial baseline for state‑aware memory, we further present CUPMem, a prototype that strengthens write‑time revision through structured state consolidation and propagation‑aware search, suggesting that explicit state adjudication is a promising direction for robust agentic memory.

PaperID: 3300, https://arxiv.org/pdf/2605.06490.pdf

Authors: Jonas Wiedermann-Möller, Leonard Dung, Maksym Andriushchenko

Title: Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors

Abstract:
AI systems have become increasingly capable of dangerous behaviours in many domains. This raises the question: Do models sometimes choose to violate human instructions in order to perform behaviour that is more useful for certain goals? We introduce a benchmark for measuring model propensity for instrumental convergence (IC) behaviour in terminal‑based agents. This is behaviour such as self‑preservation that has been hypothesised to play a key role in risks from highly capable AI agents. Our benchmark is realistic and low‑stakes which serves to reduce evaluation‑awareness and roleplay confounds. The suite contains seven operational tasks, each with an official workflow and a policy‑violating shortcut. An eight‑variant shared framework varies monitoring, instruction clarity, stakes, permission, instrumental usefulness and blocked honest paths to support inferences regarding the factors driving IC behaviour. We evaluated ten models using deterministic environment‑state scorers over 1,680 samples, with trace review employed for audit and adjudication purposes. The final IC rate is 86 out of 1,680 samples (5.1%). IC behaviour is concentrated rather than uniform: two Gemini models account for 66.3% of IC cases and three tasks account for 84.9%. Conditions in which IC behaviour is indispensable for task success result in the greatest increase in the adjusted IC rate (+15.7 percentage points), whereas emphasising that task success is critical or certain framing choices do not produce comparable effects. Our findings indicate that realistic, low‑nudge environments elicit IC behaviour rarely but systematically in most tested models. We conclude that it is feasible to robustly measure tendencies for dangerous behaviour in current frontier AI agents.

PaperID: 3301, https://arxiv.org/pdf/2605.06457.pdf

Authors: Donghao Huang, Joon Kiat Chua, Zhaoxia Wang

Title: Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems

Abstract:
LLM‑based multi‑agent systems are increasingly deployed for payment workflows, yet prevailing metrics, Task Success Rate (TSR) and Agent Handoff F1‑Score (HF1), capture only final outcomes or unordered routing decisions. We introduce the Agentic Success Rate (ASR), a trajectory‑fidelity metric that compares observed and expected agent execution sequences at the transition level, decomposing performance into Transition Recall and Transition Precision. Applied to the Hierarchical Multi‑Agent System for Payments (HMASP) across 18 LLMs and 90,000 task instances, ASR reveals that 10 of 18 models systematically skip a confirmation checkpoint during payment checkout, a deviation invisible to both TSR and HF1, while 8 models enforce the checkpoint perfectly. Notably, GPT‑4.1 exhibits hidden workflow shortcuts despite achieving perfect TSR and HF1, while GPT‑5.2 achieves perfect ASR. Prompt refinements and deterministic routing guards guided by ASR diagnostics yield substantial TSR improvements, with gains up to +93.8 percentage points for previously struggling models, demonstrating that trajectory‑level evaluation is essential in regulated domains.

PaperID: 3302, https://arxiv.org/pdf/2605.06455.pdf

Authors: Xinmiao Huang, Jinwei Hu, Rajarshi Roy, Changshun Wu, Yi Dong, Xiaowei Huang

Title: PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors

Abstract:
Large language model (LLM) agents now execute long, tool‑using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand‑authored event schemas are brittle and deployment‑time LLM judging is costly. We introduce PrefixGuard, a trace‑to‑monitor framework with an offline StepView induction step followed by supervised monitor training. StepView induces deterministic typed‑step adapters from raw trace samples, and the monitor learns an event abstraction and prefix‑risk scorer from terminal outcomes. Across WebArena, τ^2‑Bench, SkillsBench, and TerminalBench, the strongest PrefixGuard monitors reach 0.900/0.710/0.533/0.557 AUPRC. Using the strongest backend within each representation, they improve over raw‑text controls by an average of +0.137 AUPRC. LLM judges remain substantially weaker under the same prefix‑warning protocol. We also derive an observability ceiling on score‑based area under the precision‑recall curve (AUPRC) that separates monitor error from failures lacking evidence in the observed prefix. For finite‑state audit, post‑hoc deterministic finite automaton (DFA) extraction remains compact on WebArena and τ^2‑Bench (29 and 20 states) but expands to 151 and 187 states on SkillsBench and TerminalBench. Finally, first‑alert diagnostics show that strong ranking does not imply deployment utility: WebArena ranks well yet fails to support low‑false‑alarm alerts, whereas τ^2‑Bench and TerminalBench retain more actionable early alerts. Together, these results position PrefixGuard as a practical monitor‑synthesis recipe with explicit diagnostics for when prefix warnings translate into actionable interventions.

PaperID: 3303, https://arxiv.org/pdf/2605.06445.pdf

Authors: Francesco Dente, Dario Satriani, Paolo Papotti

Title: Constraint Decay: The Fragility of LLM Agents in Backend Code Generation

Abstract:
Large Language Model (LLM) agents demonstrate strong performance in autonomous code generation under loose specifications. However, production‑grade software requires strict adherence to structural constraints, such as architectural patterns, databases, and object‑relational mappings. Existing benchmarks often overlook these non‑functional requirements, rewarding functionally correct but structurally arbitrary solutions. We present a systematic study evaluating how well agents handle structural constraints in multi‑file backend generation. By fixing a unified API contract across 80 greenfield generation tasks and 20 feature‑implementation tasks spanning eight web frameworks, we isolate the effect of structural complexity using a dual evaluation with end‑to‑end behavioral tests and static verifiers. Our findings reveal a phenomenon of constraint decay: as structural requirements accumulate, agent performance exhibits a substantial decline. Capable configurations lose 30 points on average in assertion pass rates from baseline to fully specified tasks, while some weaker configurations approach zero. Framework sensitivity analysis exposes significant performance disparities: agents succeed in minimal, explicit frameworks (e.g., Flask) but perform substantially worse on average in convention‑heavy environments (e.g., FastAPI, Django). Finally, error analysis identifies data‑layer defects (e.g., incorrect query composition and ORM runtime violations) as the leading root causes. This work highlights that jointly satisfying functional and structural requirements remains a key open challenge for coding agents.

PaperID: 3304, https://arxiv.org/pdf/2605.06443.pdf

Authors: Zijiu Yang, Zixiang Zhang, Shunpu Tang, Qianqian Yang, Zhiguo Shi

Title: AgenticPrecoding: LLM-Empowered Multi-Agent System for Precoding Optimization

Abstract:
Precoding is a key technique for interference management and performance improvement in multi‑antenna wireless systems. However, existing precoding methods are typically developed for specific system models, objectives, and constraint sets, which limits their adaptability to the heterogeneous and evolving scenarios expected in future 6G networks. To address this limitation, we propose AgenticPrecoding, a universal multi‑agent framework that automates end‑to‑end precoding derivation directly from user‑level communication requirements. Specifically, AgenticPrecoding decomposes the derivation process into four coordinated stages: problem formulation, solver selection, prompt upsampling, and code generation, assigning each stage to a specialized agent tailored to its specific reasoning demands. We employ two LoRA‑adapted reasoning agents to inject precoding‑specific domain knowledge for problem formulation and solver selection, while two general‑purpose Large Language Models (LLMs) handle prompt refinement and executable code generation. Furthermore, a feedback‑driven refinement mechanism is incorporated to enhance code executability, constraint feasibility, and solution quality. Extensive experiments across 10 representative precoding scenarios demonstrate that AgenticPrecoding achieves superior cross‑scenario adaptability compared to conventional optimization‑based and LLM‑based baselines.

PaperID: 3305, https://arxiv.org/pdf/2605.06434.pdf

Authors: Vaisakh Naduvodi Viswambharan, Keerthan Kopparam Radhakrishna, Deepak Narayan Gadde, Aman Kumar

Title: Knowledge Graphs, the Missing Link in Agentic AI-based Formal Verification

Abstract:
Recent advances in Large Language Models (LLMs) have enabled workflows that generate SystemVerilog Assertions (SVAs) from natural‑language specifications, with the potential to accelerate Formal Verification (FV). However, high‑quality assertion synthesis remains challenging because specifications are often ambiguous or incomplete and critical micro‑architectural details reside in the Register Transfer Level (RTL). Many existing approaches treat the specification and RTL as loosely structured text, which weakens specification‑to‑RTL grounding and leads to semantic mismatches and frequent syntax failures during formal parsing and elaboration. This work addresses these limitations with a verification‑centric Knowledge Graph (KG) constructed from structured Intermediate Representations (IRs) extracted from the specification, RTL, and formal‑tool feedback, including syntax diagnostics, Counterexamples (CEXs), and coverage reports. The KG links requirements, design hierarchy, signals, assumptions, and properties to provide traceable, design‑grounded context for generation. A multi‑agent workflow queries and updates this KG to generate SVAs and to drive three refinement loops: syntax repair guided by tool diagnostics, CEX‑guided correction using trace links, and coverage‑directed property augmentation. Evaluation across seven benchmark designs indicates that KG‑based context retrieval improves specification‑to‑RTL grounding and consistently produces compilable SVAs with low syntax‑repair overhead. The approach achieves formal coverage ranging from 78.5% to 99.4%, though convergence exhibits design dependence with complex temporal and arithmetic reasoning remaining challenging for current LLM capabilities.

PaperID: 3306, https://arxiv.org/pdf/2605.06377.pdf

Authors: Philip Jordan, Maryam Kamgarpour

Title: Independent Learning of Nash Equilibria in Partially Observable Markov Potential Games with Decoupled Dynamics

Abstract:
We study Nash equilibrium learning in partially observable Markov games (POMGs), a multi‑agent reinforcement learning framework in which agents cannot fully observe the underlying state. Prior work in this setting relies on centralization or information sharing, and suffers from sample and computational complexity that scales exponentially in the number of players. We focus on a subclass of POMGs with independent state transitions, where agents remain coupled through their rewards, and assume that the underlying fully observed Markov game is a Markov potential game. For this class, we present an independent learning algorithm in which players, observing only their own actions and observations and without communication, jointly converge to an approximate Nash equilibrium. Due to partial observability, optimal policies may in general depend on the full action‑observation history. Under a filter stability assumption, we show that policies based on finite history windows provide sufficient approximation guarantees. This enables us to approximate the POMG by a surrogate Markov game that is near‑potential, leading to quasi‑polynomial sample and computational complexity for independent Nash equilibrium learning in the underlying POMG.

PaperID: 3307, https://arxiv.org/pdf/2605.06334.pdf

Authors: Ashwani Anand, Ivi Chatzi, Ritam Raha, Anne-Kathrin Schmuck

Title: MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents

Abstract:
Tool‑using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensuring that such agents comply with the rules from these manuals is challenging, as they are typically written for humans in natural language while agent behavior manifests as an execution trace of tool calls. Existing evaluations of LLM agents rely on manually constructed benchmarks or LLM‑based judges, which either do not scale or lack reliability for complex, long‑horizon manuals. To overcome these limitations, we present MANTRA, a framework for automatically synthesizing machine‑checkable compliance benchmarks from natural‑language manuals and tool schemas. MANTRA independently generates (i) a symbolic world model capturing procedural dependencies, and (ii) a set of trace‑level compliance checks for a given task, and validates their consistency using SMT solving. A structured repair loop resolves inconsistencies, requiring human intervention only as a fallback. %This yields benchmarks that are formally validated. Importantly, MANTRA supports arbitrary domains and long procedural manuals, and provides a tunable notion of task complexity which is utilized to automatically derive challenging tasks accompanying compliance checks. Using MANTRA, we build a new benchmark suite with 285 tasks across 6 domains scaling to 50+ page manuals with minimal human effort. Empirically, we show that the compliance checks are richer with stronger constraint enforcement compared to existing benchmarks. Additionally, the granularity of the checks can be used for debugging the agents' failure modes. These results demonstrate that combining automated benchmark generation with formally grounded validation methods enables scalable and reliable benchmarking of tool‑using agents.

PaperID: 3308, https://arxiv.org/pdf/2605.06232.pdf

Authors: Jiahao Chen, Qi Zhang, Ruixiao Lin, Chunyi Zhou, Tianyu Du, Qingming Li, Tong Zhang, Junhao Li, Yuwen Pu, Shouling Ji

Title: Profiling for Pennies: Unveiling the Privacy Iceberg of LLM Agents

Abstract:
Large Language Models (LLMs) have revolutionized how information are collected, aggregated, and reasoned. However, this enables a novel and accessible vector of privacy intrusion: the automated and in‑depth personal profiling; this engenders a chilling effect of "peepers everywhere". Existing research primarily unfolds from the training pipeline of LLM, emphasizing the exposure of Personally Identifiable Information (PII) through memorization, while privacy studies from a human‑centric perspective remain underexplored. To fill this void, we empirically investigate privacy perception in the real world through the lens of human awareness and the practices of LLM‑integrated platforms, revealing a significant dissonance: platforms fail to technically or policy‑wise address public privacy concerns. To facilitate a systematic and quantifiable study of privacy risk, we propose the PrivacyIceberg, which categorizes real‑world human privacy risks into three tiers: explicitly searched, contextually inferred, and deeply aggregated, based on the sophistication of LLM exploitation. We developed IcebergExplorer to audit privacy exposure, utilizing minimal PII as a search seed to reconstruct high‑fidelity profiles, achieving over 90% factual accuracy within 10 minutes at a cost under 3, for real‑world scenarios. Additionally, we identify six root causes contributing to such privacy disclosures and propose multi‑stakeholder countermeasures for LLM vendors, individuals, and data publishers.

PaperID: 3309, https://arxiv.org/pdf/2605.06226.pdf

Authors: Tianyu Liu, Wangjie Zheng, Rui Yang, Benny Kai Guo Loo, Hui Zhang, Jeffries Lauran, Jianlei Gu, Botao Yu, Weihao Xuan, Kexin Huang, Nan Liu, James Zou, Yonghui Jiang, Hua Xu, Hongyu Zhao

Title: A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization

Abstract:
Accurate and timely diagnosis is essential for effective treatment, particularly in the context of rare diseases. However, current diagnostic workflows often lead to prolonged assessment times and low accuracy. To address these limitations, we introduce Hygieia, a multi‑modal AI agent system designed to support precision disease diagnosis by integrating diverse data sources, including phenotypic features, genetic profiles, and clinical records. Hygieia features a router‑based and knowledge‑enhanced framework that mitigates hallucination and tailors diagnostic strategies to different disease categories. Notably, it prioritizes risk‑related genomic factors for rare diseases and provides confidence scores to assist clinical decision‑making. We conducted a comprehensive evaluation demonstrating that Hygieia achieves state‑of‑the‑art performance across multiple diagnostic benchmarks. In collaboration with clinical experts from Yale School of Medicine and Duke‑NUS Medical School, we further validated its practical utility by showing (1) Hygieia's superior diagnostic performance compared to physicians with an improvement from 12%‑60% and (2) its effectiveness in assisting clinicians with medical records for handling real‑world cases. Our findings indicate that Hygieia not only enhances diagnostic accuracy and interpretability but also significantly reduces clinician workload, highlighting its potential as a valuable tool in clinical decision support systems.

PaperID: 3310, https://arxiv.org/pdf/2605.06205.pdf

Authors: Leo Linqian Gan, Jeffery Wu, Longyuan Ge, Lanqing Yang, Yonghao Song, Jingkai Zhang, Haojia Jin, Weiyi Wang, Guangtao Xue

Title: ClawGuard: Out-of-Band Detection of LLM Agent Workflow Hijacking via EM Side Channel

Abstract:
Autonomous LLM agents face a critical security risk known as workflow hijacking, where attackers subtly alter tool and skill invocations. Existing defenses rely on host‑internal telemetry (such as audit logs), which can be forged if the host OS is compromised. To solve this, we introduce ClawGuard, a passive, out‑of‑band monitor that audits LLM‑agent workflows using electromagnetic (EM) emanations. Because distinct agent skills create unique hardware usage patterns (computation, DRAM, network blocking), they emit measurable, macroscopic EM envelopes. External software‑defined radios (SDRs) capture these physical signals. Using a drift‑aware pipeline with 320‑dimensional features, ClawGuard converts RF streams into physical evidence. Evaluated on a 7.82TB RF corpus, ClawGuard achieved an AUC of 0.9945, detecting attacks with a 100% true‑positive rate and a 1.16% false‑positive rate. This proves passive EM sensing is a practical, forge‑resistant physical check against compromised host software.

PaperID: 3311, https://arxiv.org/pdf/2605.06202.pdf

Authors: Mengfan Xu

Title: Bandit Learning in General Open Multi-agent Systems

Abstract:
Recent developments in digital platforms have highlighted the prevalence of open systems, where agents can arrive and depart over time. While bandit learning in open systems has recently received initial attention, existing work imposes structural assumptions that are frequently violated in practice. A learning paradigm for general open systems creates fresh challenges: newly arriving agents induce endogenous non‑stationarity; agent patterns determine how quickly information accumulates; and new agents make regret scale further with the time horizon. To this end, we formulate a unified open‑system bandit problem with general dynamics, including heterogeneous rewards and general agent patterns. We introduce new concepts to capture the inherent complexities: the \emphpre‑training degree of new agents quantifies how much information an agent carries upon entry, \emphstability measures the impact of new agents on the system, and \emphglobal dynamic regret compares the cumulative expected reward of all active agents with that of the varying optimal arms. We develop certified global‑UCB learning methodologies with provable guarantees. Our regret bounds reveal that entry uncertainty enters linearly via the pre‑training degree, while in stable regimes, regret is governed by the time needed to identify a persistent optimal arm, as well as by the agent patterns. We further show that these dependencies are tight via lower bounds in hard instances.

PaperID: 3312, https://arxiv.org/pdf/2605.06066.pdf

Authors: Cristiano da Costa Cunha, Ajmal Mian, Tim French, Wei Liu

Title: Causal Reinforcement Learning for Complex Card Games: A Magic The Gathering Benchmark

Abstract:
Causal reinforcement learning (RL) lacks benchmarks for complex systems that combine sequential decision making, hidden information, large masked action spaces, and explicit causal structure. We introduce MTG‑Causal‑RL, a Gymnasium benchmark built on Magic: The Gathering with a 3,077‑dimensional partial observation, a 478‑action masked discrete action space, five competitive Standard archetypes, three reward schemes, and a hand‑specified Structural Causal Model (SCM) over strategic variables. Every episode exposes causal variables, SCM‑predicted intervention effects, and per‑factor credit traces, making causal credit assignment, leave‑one‑out cross‑archetype transfer, and policy auditability first‑class metrics. We adapt a panel of reference baselines: random, heuristic, masked PPO, a causal‑world‑model PPO variant, and an architecture‑matched scalar control. We propose Causal Graph‑Factored Advantage PPO (CGFA‑PPO) as a reference causal agent that uses SCM parents of win probability as factor‑aligned critic targets with an intervention‑calibration loss. All comparisons use paired seeds, paired‑bootstrap confidence intervals, and Holm‑Bonferroni correction within pre‑registered families. Masked PPO and CGFA‑PPO reach competitive in‑distribution win rates and exceed the random baseline; per‑factor calibration trajectories and leave‑one‑out transfer gaps expose diagnostic structure that scalar win rate alone cannot. We release the benchmark, reference‑baseline results, and full evaluation protocol openly. By coupling a strategically rich, partially observed domain with an explicit causal interface and statistical protocol, MTG‑Causal‑RL gives causal‑RL, world‑model, and LLM‑agent research a shared testbed for questions current benchmarks cannot pose together: causal credit assignment under masked action spaces, structural transfer across archetypes, and SCM‑grounded policy auditability.

PaperID: 3313, https://arxiv.org/pdf/2605.06056.pdf

Authors: Martin Jonáš, Antonín Kučera, Vojtěch Kůr, Jan Mačák, Vojtěch Řehák

Title: Multiagent Stochastic Shortest Path Problem

Abstract:
We introduce and study the multi‑agent stochastic shortest path (MSSP) problem, in which k agents strive to reach a target state, aiming to minimize the expected time to reach the target by any agent. We analyze the computational and strategy‑complexity of the problem in both autonomous and coordinated settings, and we design efficient strategy‑synthesis algorithms. The algorithms are experimentally evaluated on instances of increasing size against natural baselines.

PaperID: 3314, https://arxiv.org/pdf/2605.06028.pdf

Authors: Zheng Zhang, Cuong C. Nguyen, Kevin Wells, Gustavo Carneiro

Title: Multi-agent decision making: A Blackwell's informativeness approach

Abstract:
The rapid development of large language models (LLMs) has motivated research on decision‑making in multi‑agent systems, where multiple agents collaborate to achieve shared objectives. Existing aggregation approaches, such as voting and debate, are largely ad‑hoc and lack formal guarantees regarding the informativeness of the resulting decisions. In this paper, we provide a principled approach to analyse decisions made in the multi‑LLM setting using Blackwell's informativeness framework. Within the Blackwell information‑structure abstraction, we show that voting and debate induce information structures that are no more informative than the pooled private information of all agents. This result identifies Bayesian pooled posterior maximisation as an information‑theoretic upper‑bound decision rule under the Blackwell ordering. Motivated by this theoretical analysis, we introduce a practical method for LLM‑based question‑answering (QA) tasks that estimates each agent's posterior and approximates the pooled posterior using a product‑of‑posteriors estimator. Extensive experiments on six QA benchmarks demonstrate that our approach outperforms state‑of‑the‑art multi‑LLM debate and voting methods.

PaperID: 3315, https://arxiv.org/pdf/2605.05991.pdf

Authors: Commerce Search Relevance Team

Title: A Case-Driven Multi-Agent Framework for E-Commerce Search Relevance

Abstract:
Relevance is a foundation of user experience in e‑commerce search. We view relevance optimization as a closed‑loop ecosystem involving multiple human roles: users who provide feedback, product managers who define standards, annotators who label data, algorithm engineers who optimize models, and evaluators who assess performance. Because improving relevance in practice means systematically resolving user‑perceived bad cases, we ask a system‑level question: can this ecosystem be reimagined by replacing its human roles with autonomous agents? To answer this question, we propose a case‑driven multi‑agent framework that automates the pipeline from bad‑case identification to resolution. The framework instantiates an Annotator Agent for multi‑turn annotation, an Optimizer Agent for autonomous bad‑case analysis and resolution, and a User Agent that identifies bad cases through conversational interaction, together forming an autonomous and continually evolving system. To make the framework practical in production, we further adopt a harness‑engineering paradigm and build a unified retrieval‑and‑ranking relevance model for efficient training, an instruction‑following relevance model for real‑time case resolution, Global Memory to reduce information asymmetry across agents, a Deep Search Agent to target underestimation failures, and an agent‑based chatbot for human‑‑agent collaboration. Extensive human evaluation shows that the framework performs relevance‑related tasks effectively, improves annotation accuracy, and enables more timely and generalizable bad‑case resolution, indicating a practical paradigm for industrial search relevance optimization.

PaperID: 3316, https://arxiv.org/pdf/2605.05985.pdf

Authors: Remigiusz Kinas, Joanna Krawczyk, Rafał Powalski, Przemysław Pietrzak, Agnieszka Kowalewska, Krzysztof Kolmus, Maciej Sypetkowski, Łukasz Smoliński, Tomasz Jetka

Title: BioResearcher: Scenario-Guided Multi-Agent for Translational Medicine

Abstract:
Translational medicine turns underspecified development goals into evidence synthesis that must combine literature, trials, patents, and quantitative multi‑omics analysis while preserving identifiers, uncertainty, and retrievable provenance. General‑purpose foundation models and off‑the‑shelf tool‑augmented or multi‑agent systems are not built for this: they tend to produce single‑shot answers or run open‑endedly, and fall short on the auditable, scenario‑specific workflows that heterogeneous biomedical sources demand. This paper introduces Ingenix BioResearcher, a scenario‑guided multi‑agent system that maps queries to versioned research playbooks, delegates to specialized subagents over 30+ tools and machine‑learning endpoints, mixes structured database access with sandboxed code for genome‑scale analyses, and applies claim‑level multi‑model reconciliation before editorial assembly. We evaluate BioResearcher across unit‑level capabilities, open‑ended biomedical reasoning, and end‑to‑end clinical discovery. It leads evaluated baselines on 109 single‑step tests (83.49% pass rate; 0.892 average score), achieves strong biomedical benchmark performance (89.33% on BixBench‑Verified‑50 and the top 0.758 mean score on BaisBench Scientific Discovery), and leads on a 30‑query clinical end‑to‑end benchmark with the highest positive hit rate (74.7% \pm 3.3%) and negative clear rate (96.8% \pm 0.2%). These results show broad, competitive performance across unit‑level, open‑ended, and end‑to‑end clinical evaluations.

PaperID: 3317, https://arxiv.org/pdf/2605.05977.pdf

Authors: Yinbo Yu, Xueyu Yin, Jiadai Wang, Chunwei Tian, Sai Xu, Qi Zhu, Daoqiang Zhang

Title: BehaviorGuard: Online Backdoor Defense for Deep Reinforcement Learning

Abstract:
Backdoor attacks pose a serious threat to deep reinforcement learning (DRL). Current defenses typically rely on reward anomalies to reverse‑engineer triggers and model finetuning to remove backdoors. However, complex trigger patterns undermine their robustness, and fine‑tuning entails high costs, limiting practical utility. Therefore, we shift defense concerns to trigger‑agnostic backdoor output behaviors and propose BehaviorGuard, an online behavior‑based backdoor detection and mitigation framework for DRL. Specifically, we find that regardless of attacks, backdoored policies induce consistent shifts in action distributions to ensure reliable activation, leaving detectable traces in high‑quantile regions and distribution tails, even in the absence of triggers. Based on this, we design a novel metric that captures behavioral drift in action distributions to identify and suppress backdoor actions at runtime. To our knowledge, this is the first online backdoor defense that counters attacks both in single‑ and multi‑agent DRL. Evaluated across diverse benchmarks with different backdoor attacks, BehaviorGuard consistently surpasses prior methods in both efficacy and efficiency.

PaperID: 3318, https://arxiv.org/pdf/2605.05974.pdf

Authors: Qinfeng Li, Yuntai Bao, Jianghui Hu, Wenqi Zhang, Jintao Chen, Huifeng Zhu, Yier Jin, Xuhong Zhang

Title: PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

Abstract:
LLM agents rely on prompts to implement task‑specific capabilities based on foundation LLMs, making agent prompts valuable intellectual property. However, in untrusted deployments, adversaries can copy and reuse these prompts with other proprietary LLMs, causing economic losses. To protect these prompts, we identify four key challenges: proactivity, runtime protection, usability, and non‑portability that existing approaches fail to address. We present PragLocker, a prompt protection scheme that satisfies these requirements. PragLocker constructs function‑preserving obfuscated prompts by anchoring semantics with code symbols and then using target‑model feedback to inject noise, yielding prompts that only work on the target LLM. Experiments across multiple agent systems, datasets, and foundation LLMs show that PragLocker substantially reduces cross‑LLM portability, maintains target performance, and remains robust against adaptive attackers.

PaperID: 3319, https://arxiv.org/pdf/2605.05949.pdf

Authors: Yuliang Xu, Xiang Xu, Yao Wan, Hu Wei, Tong Jia

Title: MAS-Algorithm: A Workflow for Solving Algorithmic Programming Problems with a Multi-Agent System

Abstract:
Algorithmic problem solving serves as a rigorous testbed for evaluating structured reasoning in AI coding systems, as it directly reflects a model's ability to perform structured reasoning in complex scenarios. Existing approaches predominantly rely on model‑centric strategies, such as architectural modifications and data scaling, which are costly and offer limited interpretability. Alternative methods leveraging external tools or prompting techniques (e.g., chain‑of‑thought) are often fragmented and lack a unified framework. In this paper, we propose MAS‑Algorithm, a systematic multi‑agent workflow for algorithmic problem solving inspired by the practices of competitive programmers and algorithm engineers. Our framework decomposes the end‑to‑end solving process into modular stages, enabling structured reasoning, tool integration, and flexible coordination among agents. The design emphasizes both rigor and extensibility, allowing it to generalize across diverse problem types. Experimental results on a self‑constructed benchmark demonstrate consistent improvements across multiple Qwen series models, achieving an average gain of 6.48% in acceptance rate. In contrast, parameter‑efficient fine‑tuning on the same data yields only a marginal improvement of 0.89%. We further observe a 4.72% gain on LiveCodeBench‑Pro, along with consistent improvements across additional accuracy and efficiency metrics. Beyond performance gains, we conduct comprehensive analyses to better understand the reasoning process within the workflow, including error patterns and cross‑scenario behaviors. We further perform customized replacement and ablation studies to explore the upper bound of the framework, showing that individual agents can contribute improvements of up to 27.7%. These results highlight the strong potential of MAS‑Algorithm for advancing AI‑driven algorithmic reasoning.

PaperID: 3320, https://arxiv.org/pdf/2605.05945.pdf

Authors: Senthil Palanisamy, Abhishek Anand, Satpal Singh Rathor, Pratyush Patnaik, Shubhanshu Khatana

Title: MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware

Abstract:
The recent advancement of Vision Language Action (VLA) models has driven a critical demand for large scale egocentric datasets. However, existing datasets are often limited by short episode durations, typically spanning only a few minutes, which fails to capture the long horizon temporal dependencies necessary for complex robotic task execution. To bridge this gap, we present MobileEgo Anywhere, a framework designed to facilitate the collection of robust, hour plus egocentric trajectories using commodity mobile hardware. We leverage the ubiquitous sensor suites of modern smartphones to provide high fidelity, long term camera pose tracking, effectively removing the high hardware barriers associated with traditional robotics data collection. Our contributions are three fold: (1) we release a novel dataset comprising 200 hours of diverse, long form egocentric data with persistent state tracking; (2) we open source a mobile application that enables any user to record egocentric data, and (3) we provide a comprehensive processing pipeline to convert raw mobile captures into standardized, training ready formats for Vision Language Action model and foundation model research. By democratizing the data collection process, this work enables the massive scale acquisition of long horizon data across varied global environments, accelerating the development of generalizable robotic policies.

PaperID: 3321, https://arxiv.org/pdf/2605.05868.pdf

Authors: Jiangrong Wu, Yuhong Nan, Yixi Lin, Huaijin Wang, Yuming Xiao, Shuai Wang, Zibin Zheng

Title: SkillScope: Toward Fine-Grained Least-Privilege Enforcement for Agent Skills

Abstract:
Agent Skills have become a practical way to extend LLM agents by packaging metadata, natural‑language instructions, and executable resources into reusable capability bundles. However, this growing Skill ecosystem introduces a new compliance risk: a Skill may perform high‑impact actions that exceed the minimum necessary scope of the user's current task, thereby violating least‑privilege. Existing skill detection approaches are insufficient for this problem because it is inherently task‑conditioned: the same action may be necessary under one user prompt but over‑privileged under another. In this paper, we present SkillScope, a framework for fine‑grained least‑privilege enforcement in Agent Skills. SkillScope adopts a graph‑based analysis approach that models instruction‑level procedures and code‑level operations as fine‑grained action nodes. It extracts potential over‑privilege candidates, validates them under graph‑instantiated user tasks through replay‑based analysis, and constrains validated over‑privileged actions via control‑flow privilege constraining. We evaluate SkillScope through effectiveness experiments and large‑scale real‑world measurement. SkillScope achieves 94.53% F1 for skill over‑privilege detection. In the wild, SkillScope validates 7,039 Skills with over‑privileged behaviors, showing that least‑privilege violations are prevalent in current Skill ecosystems. In the privilege‑constraining evaluation, SkillScope reduces triggered over‑privileged action‑in‑task instances by 88.56% while preserving legitimate task completion.

PaperID: 3322, https://arxiv.org/pdf/2605.05861.pdf

Authors: Yong Xiao, Haoran Zhou, Yujie Zhou, Marwan Krunz

Title: SANEmerg: An Emergent Communication Framework for Semantic-aware Agentic AI Networking

Abstract:
Future networking systems are envisioned to become part of an agentic AI‑native ecosystem in which a vast number of heterogeneous and specialized AI agents cooperate seamlessly to fulfill complex user requirements in real time. However, traditional networking paradigms are characterized by a rigid decoupling of communication and computation, which often leads to significant inefficiencies in large‑scale agentic AI networking (AgentNet) systems. Emergent communication offers a novel solution by enabling autonomous agents that support task‑specific signaling protocols for information exchange and collaborative coordination. In this paper, we consider a multi‑agent emergent communication framework, tailored for semantic‑aware AgentNet systems in which the user's semantic intent can be automatically detected, inferred, and linked to a set of sub‑tasks to be assigned to a set of agents. We investigate how communication and signaling protocols can emerge among collaborative agents with computationally bounded intelligence under stringent bandwidth constraints. Our proposed framework, called SANEmerg, is designed to facilitate the emergence of communication for collaborative task fulfillment while adhering to the physical limits of AgentNet. SANEmerg incorporates a bandwidth‑adaptable importance‑filter that dynamically prioritizes the transmission of higher‑contribution message dimensions, ensuring robust performance in bandwidth‑limited environments. Furthermore, SANEmerg integrates a complexity‑regularizer grounded in the Minimum Description Length (MDL) principle to facilitate the emergence of computationally bounded signaling. Evaluated via an AgentNet prototype and extensive experimentation, SANEmerg demonstrates significant performance improvements over state‑of‑the‑art solutions, achieving superior task accuracy while significantly reducing bandwidth and computational overhead.

PaperID: 3323, https://arxiv.org/pdf/2605.05846.pdf

Authors: Huiyu Xu, Zhibo Wang, Wenhui Zhang, Ziqi Zhu, Yaopeng Wang, Kui Ren, Chun Chen

Title: LoopTrap: Termination Poisoning Attacks on LLM Agents

Abstract:
Modern LLM agents solve complex tasks by operating in iterative execution loops, where they repeatedly reason, act, and self‑evaluate progress to determine when a task is complete. In this work, we show that while this self‑directed loop facilitates autonomy, it also introduces a critical risk: by injecting malicious prompts into the agent's context, an adversary can distort the agent's termination judgment, making it believe the task remains incomplete and leading to unbounded computation.To understand this threat, we define and systematically characterize it as Termination Poisoning and design 10 representative attack strategies. Through a empirical study spanning 8 LLM agents and 60 tasks, we demonstrate that different LLM agents exhibit distinct behavioral signatures that determine which strategies succeed. These transferable patterns can serve as principled guidance for crafting effective attacks against previously unseen agents and tasks, enabling scalable red‑teaming beyond manually designed templates. Building on these insights, we introduce LoopTrap, an automated red‑teaming framework that synthesizes target‑specific malicious prompts by exploiting agent behavioral tendencies. LoopTrap first constructs a behavioral profile of the target agent along four vulnerability dimensions via lightweight probing. It then performs adaptive trap synthesis, routing to the most effective strategy and selecting optimal injections via a self‑scoring mechanism. Finally, successful traps are abstracted into a reusable skill library, while failed attempts are refined through self‑reflection, ensuring continuous improvement. Extensive evaluation shows that LoopTrap achieves an average of 3.57× step amplification across 8 mainstream agents, with a peak of 25×.

PaperID: 3324, https://arxiv.org/pdf/2605.05795.pdf

Authors: Nicholas Potteiger, Ankita Samaddar, Taylor T. Johnson, Xenofon Koutsoukos

Title: Reward Shaping and Action Masking for Compositional Tasks using Behavior Trees and LLMs

Abstract:
Decomposing complex tasks into a sequence of simpler subtasks can improve learning efficiency for an autonomous agent. Reinforcement learning (RL) can be used to optimize agent policies to complete subtasks, but requires well‑defined subtask rewards and benefits from action masking. Recent work uses large language models (LLMs) to automate reward shaping and action masking, however none of them fully address reactivity to subtask failure and modularity to varying objects for compositional tasks. To overcome these challenges, we develop masking reward behavior tree (MRBT), a symbolic structure used as a reactive and modular reward and action mask function. We design an MRBT template and derive logical specifications to construct and verify MRBTs for a sequence of object‑interaction subtasks. Further, we develop an automated pipeline that uses an LLM to generate MRBTs robust to varying task objects, an SMT‑solver to verify correctness of specifications, and a neurosymbolic RL loop to train agents on compositional tasks. Experiments demonstrate successful generation and refinement of five MRBTs, consistently improving training efficiency and task success rates over baselines and MRBTs without action masking. We further highlight three advantages of MRBTs: transferability, modularity, and verifiability.

PaperID: 3325, https://arxiv.org/pdf/2605.05790.pdf

Authors: Bin Wang, Yue Liu, Benjamin Newman, Ajoy S. Fernandes, Zhiyuan Wang, Robert Cavin, Michele A. Cox, Vijay Rajanna, Takumi Bolte, Melissa Hunfalvay, Ulas Bagci, Michael J. Proulx

Title: GazeMind: A Gaze-Guided LLM Agent for Personalized Cognitive Load Assessment

Abstract:
Smart glasses with AI assistants are increasingly used in daily life. However, current systems lack awareness of the user's internal cognitive state, leaving them unable to proactively anticipate users' needs without access to cognitive load. Existing methods for assessing cognitive load either rely on impractical sensors for lightweight eyewear or utilize eye gaze‑based models that suffer from poor interpretability, and require task‑specific fine‑tuning, often failing to generalize across individuals. We propose GazeMind, a gaze‑guided LLM agent framework for cognitive load assessment on smart glasses. It encodes eye‑tracking data into structured representations for LLM‑based reasoning and provides interpretable cognitive load predictions. Importantly, GazeMind generalizes across scenarios without LLM fine‑tuning through a novel task‑guidance reasoning approach and achieves personalized adaptation by incorporating user‑specific characteristics and historical references. To support evaluation, we introduce CogLoad‑Bench, the largest gaze‑based cognitive load dataset with 152 participants, 40+ hours of multimodal data, and 10K+ real‑time annotations across controlled and real‑world tasks. Experiments show that GazeMind achieves state‑of‑the‑art performance, outperforming baselines by over 20% across all metrics.

PaperID: 3326, https://arxiv.org/pdf/2605.05726.pdf

Authors: Hongcheol Cho, Ryangkyung Kang, Youngeun Kim

Title: SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

Abstract:
As LLM agents are increasingly deployed with large libraries of reusable skills, selecting the right skill for a user request has become a critical systems challenge. In small libraries, users may invoke skills explicitly by name, but this assumption breaks down as skill ecosystems grow under tight context and latency budgets. Despite its practical importance, skill retrieval remains underexplored, with limited benchmarks and little understanding of retrieval behavior on realistic skill libraries. To address this gap, we introduce SkillRet, a large‑scale benchmark for skill retrieval in LLM agents. SkillRet contains 17,810 public agent skills, organized with structured semantic tags and a two‑level taxonomy spanning 6 major categories and 18 sub‑categories. It provides 63,259 training samples and 4,997 evaluation queries with disjoint skill pools, enabling both benchmarking and retrieval‑oriented training. Across a diverse set of retrievers, we find that skill retrieval remains far from solved: off‑the‑shelf models struggle on realistic large‑scale skill libraries, and prior skill‑retrieval models still leave substantial headroom. Task‑specific fine‑tuning on SkillRet substantially improves performance, improving NDCG@10 by +13.1 points over the strongest prior retriever and by +16.9 points over the strongest off‑the‑shelf retriever. Our analysis further suggests that these gains arise because fine‑tuned models better focus on the small skill‑relevant signals within long and noisy queries. These results establish SkillRet as a strong benchmark and foundation for future research on retrieval in large‑scale agent systems.

PaperID: 3327, https://arxiv.org/pdf/2605.05725.pdf

Authors: Hyeongwon Kang, Jeongseob Kim, Jinwoo Park, Pilsung Kang

Title: Detecting Time Series Anomalies Like an Expert: A Multi-Agent LLM Framework with Specialized Analyzers

Abstract:
Recent studies have explored large language models for time‑series anomaly detection, yet existing approaches often rely on a single general‑purpose model to directly infer anomaly indices or intervals, limiting controllability, interpretability, and reliability for complex anomaly patterns. We propose SAGE (Specialized Analyzer Group for Expert‑like Detection), a multi‑agent framework for structured anomaly diagnosis in univariate time series. It decomposes anomaly analysis into four specialized Analyzers for point, structural, seasonal, and pattern anomalies. Each Analyzer applies family‑specific numerical tools and diagnostic visualizations to generate evidence, while an evidence‑grounded Detector consolidates the evidence into confidence‑scored anomaly records with intervals and candidate types. A Supervisor then converts these structured records into analyst‑facing diagnostic reports. SAGE further constructs synthetic in‑context examples from normal‑reference training segments, without using real anomalous segments or anomaly‑type labels as in‑context examples. Across three benchmarks, SAGE achieves the best average performance among strong ML/DL and language‑model‑based baselines. Ablation studies and human evaluation further show that the proposed framework improves detection reliability and the practical usefulness of diagnostic outputs.

PaperID: 3328, https://arxiv.org/pdf/2605.05716.pdf

Authors: Ming Liu

Title: More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding

Abstract:
LLM agent systems are built by stacking scaffolding components (planning, tools, memory, self‑reflection, retrieval) assuming more is better. We study cross‑component interference (CCI): degradation when components interact destructively. We run a full factorial experiment over all 2^5=32 subsets of five components on HotpotQA and GSM8K with Llama‑3.1‑8B/70B (96 conditions, up to 10 seeds). The All‑In system is consistently suboptimal: on HotpotQA, a single‑tool agent surpasses All‑In by 32% (F1 0.233 vs 0.177, p=0.023); on GSM8K, a 3‑component subset beats All‑In by 79% (0.43 vs 0.24, p=0.010). Optimal component count is task‑dependent (k=1‑4) and scale‑sensitive: at 70B, combinations that hurt at 8B provide gains, though All‑In still trails the best subset. We fit a main‑effects regression (R^2=0.916, adj‑R^2=0.899, LOOCV=0.872), compute exact Shapley values, and find 183/325 submodularity violations (56.3%), showing greedy selection is unreliable. A three‑body synergy among Tool Use, Self‑Reflection, and Retrieval (INT_3=+0.175, 95% CI [+0.003,+0.351]) is reported as exploratory. CCI replicates across model families (Qwen2.5) and is robust to prompt paraphrasing. Our findings suggest maximally‑equipped agent defaults should be replaced by task‑specific subset selection via interaction‑aware analysis.

PaperID: 3329, https://arxiv.org/pdf/2605.05703.pdf

Authors: Huchen Yang, Xinghao Dong, Dan Negrut, Jin-Long Wu

Title: Active Learning for Communication Structure Optimization in LLM-Based Multi-Agent Systems

Abstract:
Optimizing the communication structure of large language model based multi‑agent systems (LLM‑MAS) has been shown to improve downstream performance and reduce token usage. Existing methods typically rely on randomly sampled training tasks. However, tasks may differ substantially in difficulty and domain, and thus they are not equally informative for updating communication structure, making optimization under limited training budgets often unstable and highly sensitive to the particular training set. To actively identify the most valuable tasks for communication‑structure optimization, we propose an ensemble‑based information‑theoretic task selection framework. The proposed method estimates task informativeness by how much a candidate task changes the distribution over graph parameters, using ensemble Kalman inversion as an efficient and derivative‑free approximation of the corresponding Bayesian update. The resulting estimator is especially suitable for black‑box and noisy multi‑agent systems. To enhance scalability, we construct a compact candidate pool through embedding‑based representative selection and combine the informative selection with surrogate modeling and batch Thompson sampling. We validate our method in both benign settings and settings with agent attacks, demonstrating its effectiveness for communication‑structure optimization under constrained computational budgets.

PaperID: 3330, https://arxiv.org/pdf/2605.05657.pdf

Authors: Abhijit Talluri, Pujith Anne, Bhagavan Choudary Pendiyala, Raghavendra Chilukuri

Title: Retrieval-Conditioned Topology Selection with Provable Budget Conservation for Multi-Agent Code Generation

Abstract:
Multi‑agent LLM systems for code generation face a fundamental routing problem: the optimal orchestration topology depends on the structural complexity of the code under modification, yet existing systems select topologies without consulting the codebase. We present Retrieval‑Guided Adaptive Orchestration (RGAO), an architecture that closes this loop by extracting a structural complexity vector from a hierarchical code index before selecting the orchestration topology. RGAO operates within Code‑Agent, a multi‑agent framework whose sub‑agents are governed by formal contracts with six‑dimensional budget vectors. Our headline contribution is the composition of two previously separate lines of work ‑‑ complexity‑conditioned LLM routing and formal resource algebras ‑‑ yielding a property neither admits alone: provable budget conservation under retrieval‑conditioned dynamic topology selection. Concretely we contribute: (1) a complexity‑conditioned topology router that reduces proxy‑measured misrouting from 30.1% to 8.2%; (2) a budget algebra with a structural‑induction conservation theorem; and (3) a hierarchical code retrieval engine. Empirical evaluation demonstrates sub‑millisecond DAG construction and linear tree‑index scalability.

PaperID: 3331, https://arxiv.org/pdf/2605.05640.pdf

Authors: Zhen Zhang, Yuhang Yang, Yunxiang Jiang, Yuhuan Lu, Haifeng Lu, Zheng Lian, Runhao Zeng, Xiping Hu

Title: AffectSeek: Agentic Affective Understanding in Long Videos under Vague User Queries

Abstract:
Existing affective understanding studies have mainly focused on recognizing emotions from images, audio signals, or pre‑cliped video clips, where the affective evidence is already given. This passive and clip‑centered setting does not fully reflect real‑world scenarios, in which users often interact with long videos and express their needs through natural‑language queries. In this paper, we study Vague‑Query‑driven video Affective Understanding (VQAU), a new task that requires models to localize affective moments in long videos, predict their emotion categories, and generate evidence‑grounded rationales under vague user queries. To support this task, we construct VQAU‑Bench, a benchmark that integrates long videos, vague affective queries, temporal clip annotations, emotion labels, and rationale explanations into a unified evaluation framework. VQAU‑Bench enables systematic assessment of semantic‑temporal‑affective alignment, affective moment localization, emotion classification, and rationale generation. To address the multi‑step reasoning challenges of VQAU, we further propose AffectSeek, an agentic framework that actively seeks, verifies, and explains affective moments in long videos. AffectSeek decomposes VQAU into intent interpretation, candidate localization, clip verification, emotion reasoning, and rationale generation, and progressively aligns vague user intent with long‑video evidence through role‑specialized reasoning and cross‑stage verification. Experiments show that VQAU remains challenging for existing affective recognition models and single‑step vision‑language models, while AffectSeek provides a simple yet effective framework for agentic long‑video affective understanding.

PaperID: 3332, https://arxiv.org/pdf/2605.05632.pdf

Authors: Samuel Korn

Title: Architecture Matters: Comparing RAG Systems under Knowledge Base Poisoning

Abstract:
Retrieval‑Augmented Generation (RAG) systems are vulnerable to knowledge base poisoning, yet existing attacks have been evaluated almost exclusively against vanilla retrieve‑then‑generate pipelines. Architectures designed to handle conflicting retrieved information ‑ multi‑agent debate, agentic retrieval, recursive language models ‑ remain untested against adversarially optimized contradictions. We evaluate four RAG architectures (vanilla RAG, agentic RAG, MADAM‑RAG, and Recursive Language Models) under controlled single‑document (N=1) poisoning on 921 Natural Questions QA pairs, comparing a clean baseline, naive injection, and CorruptRAG‑AK ‑ an adversarial attack whose meta‑epistemic framing targets credibility assessment. Architecture is a high‑impact variable in adversarial robustness: under CorruptRAG‑AK, attack success rates range from 81.9% (vanilla) to 24.4% (RLM) ‑ a spread of nearly 58 percentage points across architectures with comparable clean accuracy (~92%). Decomposing this gap, once the poisoned document is retrieved, adversarial framing ‑ not retrieval optimization ‑ drives the majority of CorruptRAG‑AK's advantage for three of four architectures, localizing the cross‑architecture vulnerability at the content‑reasoning stage. Our MADAM‑RAG reimplementation shows the highest apparent contradiction detection rate, though our LLM judge over‑identifies this behavior (~48.5% precision), so reported rates are upper bounds. Regardless of detection, MADAM‑RAG cannot resolve contradictions reliably, producing a 41.4% non‑answer rate even on clean inputs ‑ though implementation divergences from the original may contribute. We introduce a seven‑category behavioral taxonomy capturing contradiction detection, hedging, and failure modes beyond binary accuracy. Code, data, and analysis notebooks are publicly available.

PaperID: 3333, https://arxiv.org/pdf/2605.05583.pdf

Authors: Junfeng Liao, Qizhou Wang, Jianing Zhu, Bo Du, Rui Yan, Xiuying Chen

Title: Belief Memory: Agent Memory Under Partial Observability

Abstract:
LLM agents that operate over long context depend on external memory to accumulate knowledge over time. However, existing methods typically store each observation as a single deterministic conclusion (e.g., inferring "API~X failed" from temporary errors), even though such observations are inherently partial and potentially ambiguous. By committing to one conclusion and discarding uncertainty, these methods introduce self‑reinforcing error: the agent acts on the stored conclusion, never revisits alternatives, and reinforces the conclusion over time. To address this issue, we propose BeliefMem, which shifts the memory paradigm from committing to a single conclusion per observation to retaining multiple candidate conclusions with their probabilities. Concretely, BeliefMem stores the candidate conclusions as separate memory entries, each carrying a probability that is updated via Noisy‑OR rules as new observations arrive. At retrieval, all candidates surface together with their probabilities, keeping alternatives visible to the agent. Since each conclusion in memory retains its probability, BeliefMem preserves the uncertainty that the deterministic paradigm discards, enabling the agent to act with high confidence on well‑evidenced knowledge while retaining the capacity to update its confidence when new evidence arrives. Empirical evaluations on LoCoMo and ALFWorld benchmarks show that, even with limited data, BeliefMem achieves the best average performance, remarkably outperforming well‑known baselines. More broadly, such probabilistic memory produces substantial gains and explores a new direction for agent memory in partially observable environments.

PaperID: 3334, https://arxiv.org/pdf/2605.05580.pdf

Authors: Yishuo Yuan, Jiayi Sheng, Sirui Zeng, Jiaqi Wang, Jiaheng Liu

Title: AlphaCrafter: A Full-Stack Multi-Agent Framework for Cross-Sectional Quantitative Trading

Abstract:
Financial markets are inherently non‑stationary, driven by complex interactions among macroeconomic regimes, microstructural frictions, and behavioral dynamics. Building quantitative strategies that remain profitable demands the continuous coupling of factor discovery, regime‑adaptive selection, and risk‑constrained execution. Prevailing approaches, however, optimize these components under static or isolated assumptions. Factor mining frameworks typically treat alpha discovery as a one‑time search process, implicitly assuming that factor efficacy persists across market regimes. Execution‑oriented systems often adopt role‑playing agent architectures that simulate anthropomorphic trading committees, introducing behavioral noise rather than systematic rationality. Consequently, a fully automated, rationality‑driven framework unifying a coherent quantitative pipeline remains absent. We introduce AlphaCrafter, a full‑stack multi‑agent framework that closes this gap through a continuously adaptive factor‑to‑execution pipeline, designed to track and respond to evolving market conditions without manual intervention. AlphaCrafter operates via three specialized agents: a Miner that continuously expands the factor pool via LLM‑guided search, a Screener that assesses prevailing market conditions to construct regime‑conditioned factor ensembles, and a Trader that translates these ensembles into quantitative strategies under explicit risk constraints. Together, these three agents form a closed‑loop cross‑sectional trading system that adapts holistically to evolving market dynamics. Extensive experiments on CSI 300 and S&P 500 demonstrate that AlphaCrafter consistently outperforms state‑of‑the‑art baselines in risk‑adjusted returns while exhibiting the lowest cross‑trial variance, confirming that integrated and adaptive factor‑to‑execution design yields robust trading performance.

PaperID: 3335, https://arxiv.org/pdf/2605.05499.pdf

Authors: Woojin Lee, Pranav Mekkoth, Ye Tian, Onat Gungor, Tajana Rosing

Title: FoodCHA: Multi-Modal LLM Agent for Fine-Grained Food Analysis

Abstract:
The widespread adoption of camera‑equipped mobile devices and wearables has enabled convenient capture of meal images, making food recognition a key component for real time dietary monitoring. However, real‑world food images present challenges due to high intra‑class similarity and the frequent presence of multiple food items within a single image. While deep learning models achieve strong performance in coarse grained classification, they often struggle to capture fine‑grained attributes such as cooking style. Moreover, open‑ended generation in modern vision‑language models can produce non‑canonical labels, limiting their practical deployment. We propose FoodCHA, a multimodal agentic framework that reformulates food recognition as a hierarchical decision‑making process. By progressively anchoring predictions, FoodCHA guides subcategory identification using high‑level categories and guides cooking style recognition using subcategories, improving semantic consistency and attribute‑level discrimination. To ensure practical deployability, FoodCHA utilizes the compact Moondream‑2B vision language model, which provides strong reasoning capability while maintaining lower computational and memory overhead. Experiments on FoodNExTDB show that FoodCHA outperforms Food‑Llama‑3.2‑11B by 13.8% and 38.2% in category and subcategory recognition precision, respectively, and achieves a striking 153.2% improvement in cooking style classification precision.

PaperID: 3336, https://arxiv.org/pdf/2605.05440.pdf

Authors: Krti Tallam

Title: Authorization Propagation in Multi-Agent AI Systems: Identity Governance as Infrastructure

Abstract:
The security discussion around agentic AI focuses heavily on prompt injection. This paper argues that multi‑agent systems also create a distinct authorization problem: maintaining authorization invariants as non‑human principals retrieve data, delegate tasks, and synthesize results across changing boundaries. We call this problem authorization propagation. It is not reducible to prompt injection and is not fully addressed by classical access‑control models such as RBAC, ABAC, or ReBAC. The paper formalizes authorization propagation as a workflow‑level property, identifies three sub‑problems (transitive delegation, aggregation inference, and temporal validity), and derives seven structural requirements for authorization architectures in multi‑agent AI systems. Recent work on invocation‑bound capability tokens, task‑scoped authorization envelopes, dependency‑graph policy enforcement, and execution‑count revocation demonstrates that the field is converging on the problem, but not yet on a complete architecture. The central claim is that identity governance must be treated as infrastructure: evaluated continuously, enforced at every interaction boundary, and designed into the system before orchestration logic is allowed to scale. Preliminary implementation evidence from a production enterprise AI platform shows that ordinary system behavior, not only adversarial action, already produces the failures this model predicts.

PaperID: 3337, https://arxiv.org/pdf/2605.05413.pdf

Authors: Haoyang Xie, Xinyuan Wang, Yancheng Wang, Puda Zhao, Feng Ju

Title: From History to State: Constant-Context Skill Learning for LLM Agents

Abstract:
Large language model (LLM) agents are increasingly used to operate browsers, files, code and tools, making personal assistants a natural deployment target. Yet personal agents face a privacy‑cost‑capability tension: cloud models execute multi‑step workflows well but expose sensitive intermediate context to external APIs, while local models preserve privacy but remain less reliable. Both settings also pay repeatedly for long skill prompts and growing histories. We propose constant‑context skill learning, a context‑to‑weights framework for recurring agent workflows: reusable procedures are learned in lightweight task‑family modules, while inference conditions only on the current observation and a compact state block. A deterministic tracker renders this state block from task progress and supplies aligned subgoal rewards, so each module can be trained with step‑level SFT and refined through online RL. Across ALFWorld, WebShop, and SciWorld, our agents achieve strong performance across Qwen3‑4B, Qwen3‑8B and Llama‑3.1‑8B. With Qwen3‑8B, SFT+RL reaches 89.6% unseen success on ALFWorld, 76.8% success on WebShop, and 66.4% unseen success on SciWorld. They match or exceed strong published agent‑training results while reducing prompt tokens per turn by 2‑‑7× relative to controlled ReAct prompting baselines, showing that procedural context can be moved from prompts into weights.

PaperID: 3338, https://arxiv.org/pdf/2605.05386.pdf

Authors: Aymen Echarghaoui, Dongxia Wu, Emily B. Fox

Title: BALAR : A Bayesian Agentic Loop for Active Reasoning

Abstract:
Large language models increasingly operate in interactive settings where solving a task requires multiple rounds of information exchange with a user. However, most current systems treat dialogue reactively and lack a principled mechanism to reason about what information is missing and which question should be asked next. We propose BALAR (Bayesian Agentic Loop for Active Reasoning), a task‑agnostic outer‑loop algorithm that requires no fine‑tuning and enables structured multi‑turn interaction between an LLM agent and a user. BALAR maintains a structured belief over latent states, selects clarifying questions by maximizing expected mutual information, and dynamically expands its state representation when the current one proves insufficient. We evaluate BALAR on three diverse benchmarks: AR‑Bench‑DC (detective cases), AR‑Bench‑SP (thinking puzzles), and iCraft‑MD (clinical diagnosis). BALAR significantly outperforms all baselines across all three benchmarks, with 14.6% higher accuracy on AR‑Bench‑DC, 38.5% on AR‑Bench‑SP, and 30.5% on iCraft‑MD.

PaperID: 3339, https://arxiv.org/pdf/2605.05269.pdf

Authors: Loay Abdelrazek, Leyli Karacay, Marin Orlic

Title: Dynamic Authorization for Knowledge-Base Agents in 6G

Abstract:
As 6G architectures transition toward decentralized Multi‑Agent Systems (MAS), ensuring secure access to shared Knowledge Bases (KB) is critical. Traditional authorization models like RBAC fail to provide the granularity required for autonomous agents interacting with Semantic‑based data. This work proposes a hybrid authorization framework that integrates roles and First‑Order Logic (FOL) predicates to enforce zero‑trust principles at the knowledge‑graph level. We eliminate permission inheritance by enforcing authorization at the triple level (Subject‑Predicate‑Object), ensuring agents only access metadata required for their specific functional lifecycle.

PaperID: 3340, https://arxiv.org/pdf/2605.05247.pdf

Authors: Axel Dunkel

Title: DADL: A Declarative Description Language for Enterprise Tool Libraries in LLM Agent Systems

Abstract:
The Model Context Protocol (MCP) is the standard interface between large language model (LLM) agents and external tools. At organizational scale, however, it exposes two structural problems. First, every API integration is shipped as a dedicated server process with its own deployment, dependency tree, and credential handling; recent empirical work shows the overwhelming majority of these servers are thin wrappers around REST APIs. Second, the per‑tool registration model causes context window consumption to grow linearly with catalog size, forcing real deployments to expose only a small fraction of the APIs an organization actually uses. We present DADL (Dunkel API Description Language), a YAML format describing a REST API's endpoints, authentication, pagination, response shaping, and access classification in a single declarative file. A DADL file is interpreted by an execution layer at runtime; no per‑API server process is deployed and no integration code is generated, though the runtime is itself a server. Because all tools share that runtime, credentials and authorization are managed centrally, and the catalog reaches the LLM through a fixed‑size Code Mode interface independent of size. The result is an Enterprise Tool Library: a versioned, auditable collection of API integrations any team can extend, share, and consume through one authentication and authorization boundary. The DADL v0.1 specification is released under CC BY‑SA 4.0, and a public registry contains 1,833 tool definitions across 20 services. On this catalog, Code Mode reduces the context cost of tool advertisement from approximately 142,000 tokens to approximately 1,000, a 142x reduction; the per‑call cost of search and execute invocations is additional and depends on the task.

PaperID: 3341, https://arxiv.org/pdf/2605.05236.pdf

Authors: Haoyang Le, Shengxuan Wang, Mohan Chen, Shuo Feng

Title: Topology-Driven Anti-Entanglement Control for Soft Robots

Abstract:
In the field of precision manufacturing in complex constrained environments, the role of soft robots is increasingly prominent, and the realization of anti‑winding control based on multi‑intelligent body reinforcement learning has become a research hotspot. One of the core problems at present is to coordinate multiple robots to complete the unwinding operation in a highly constrained environment. The existing distributed training framework faces some observability challenges in high‑density barrier and unstable environments, resulting in poor learning results. This paper proposes a topology‑driven Multi‑Agent Reinforcement Learning (TD‑MARL) framework to coordinate multi‑robot systems to avoid entanglement. Specifically, the critical network adopts centralized learning, so that each intelligent body can perceive the strategies of other intelligent bodies by sharing the topological state, thus alleviating the training instability caused by complex interactions; eliminating the demand for communication resources between robots through distributed execution, Upgrade system reliability; the integrated topological security layer uses topological invariants to accurately assess and mitigate the risk of entanglement to avoid the strategy from falling into local difficulties. Finally, the full simulation experiments carried out in the real simulation environment show that the method is better than the current advanced deep reinforcement learning (DRL) method in terms of convergence and anti‑winding effect.

PaperID: 3342, https://arxiv.org/pdf/2605.05211.pdf

Authors: Olivia Zhang, Zhilin Zhang

Title: A Review of Large Language Models for Stock Price Forecasting from a Hedge-Fund Perspective

Abstract:
Large language models (LLMs) are increasingly deployed in quantitative finance for stock price forecasting. This review synthesizes recent applications of LLMs in this domain, including extracting sentiment from financial news and social media, analyzing financial reports and earnings‑call transcripts, tokenizing or symbolizing stock price series, and constructing multi‑agent trading systems. Particular attention is paid to practical pitfalls that are often understated in the literature, such as fragility in sentiment analysis, dataset and horizon design, performance evaluation metrics, data leakage, illiquidity premia, and limits of stock price predictability. Organized from a hedge‑fund perspective, the review is intended to guide both academic researchers and hedge fund managers in integrating LLMs into real‑world trading pipelines and in stress‑testing their robustness under realistic market frictions.

PaperID: 3343, https://arxiv.org/pdf/2605.05170.pdf

Authors: The Verkor Team, Ravi Krishna, Suresh Krishna, David Chin

Title: Design Conductor 2.0: An agent builds a TurboQuant inference accelerator in 80 hours

Abstract:
Driven by a rapid co‑evolution of both harness and underlying models, LLM agents are improving at a dizzying pace. In our prior work (performed in Dec. 2025), we introduced "Design Conductor" (or just "Conductor"), a system capable of building a 5‑stage Linux‑capable RISC‑V CPU in 12 hours. In this work, we introduce an updated multi‑agent harness powered by frontier models released in April 2026, which is able to handle 80x larger tasks, at higher quality, fully autonomously. Following a brief introduction, we examine 4 designs that the system produced autonomously, including "VerTQ", an LLM inference accelerator which hard‑wires support for TurboQuant in a 240‑cycle pipeline, starting from the TurboQuant arXiv paper. VerTQ includes heavy compute processing, with 5129 FP16/32 units; the design was mapped to an FPGA at 125 MHz and consumes 5.7 mm^2 in TSMC 16FF (8 attention pipes). We review the key new characteristics that enabled these results. Finally, we analyze Design Conductor's token usage and other empirical characteristics, including its limitations.

PaperID: 3344, https://arxiv.org/pdf/2605.05020.pdf

Authors: Shawn Ray

Title: Graph-SND: Sparse Aggregation for Behavioral Diversity in Multi-Agent Reinforcement Learning

Abstract:
System Neural Diversity (SND) measures behavioral heterogeneity in multi‑agent reinforcement learning by averaging pairwise distances over all \binomn2 agent pairs, making each call quadratic in team size. We introduce Graph‑SND, which replaces this complete‑graph average with a weighted average over the edges of an arbitrary graph G. Three regimes follow: G=K_n recovers SND exactly; a fixed sparse G defines a localized diversity measure at O(|E|) cost; and random edge samples yield an unbiased Horvitz‑Thompson estimator and a normalized sample mean with O(1/\sqrtm) concentration in the sampled edge count m. For fixed sparse graphs we prove forwarding‑index distortion bounds for expanders and a spectral refinement under low‑rank distance structure; for random d‑regular graphs we prove an unconditional probabilistic \widetilde\mathcalO(D_\max/\sqrtn) bound. On VMAS we verify recovery, unbiasedness, concentration, and wall‑clock scaling, with a PettingZoo TVD panel checking non‑Gaussian transfer. In a 500‑iteration n=100 PPO run, Bernoulli‑0.1 Graph‑SND tracks full SND while reducing per‑call metric time by about 10×, and frozen‑policy GPU timing up to n=500 follows the predicted \binomn2/|E| speedup. Random d‑regular expanders empirically achieve \mathrmSND_G^\mathrmu/\mathrmSND \in [0.9987, 1.0013] at Θ(n \log n) edges. In DiCo diversity control at n=50, Bernoulli‑0.1 Graph‑SND preserves set‑point tracking with paired reward differences indistinguishable from zero across nine matched cells while cutting per‑call metric cost by ～9.5×. Together, these results show that the SND aggregation bottleneck can be removed without changing the metric's semantics, yielding a drop‑in sparse alternative that scales beyond complete‑graph SND and supports both passive measurement and closed‑loop diversity control.

PaperID: 3345, https://arxiv.org/pdf/2605.05007.pdf

Authors: Zhiqing Cui, Haotong Xie, Jiahao Yuan, Cheng Yang, Hanqing Wang, Yuxin Wu, Yifan Wu, Siru Zhong, Tao Yu, Yifu Guo, Siyu Zhang, Xinlei Yu, Qibing Ren, Usman Naseem

Title: Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation

Abstract:
Large language model (LLM) multi‑agent systems typically rely on rigid orchestration, committing either to flat per‑query routing or to hand‑engineered task decomposition, so decomposition depth, worker choice, and inference budget are not jointly optimized under one objective. We introduce Uno‑Orchestra, a unified orchestration policy that selectively decomposes a task and dispatches each subtask to an admissible (model, primitive) pair, with both decisions learned together from curated RL trajectories grounded in real worker interactions. Against 22 baselines on a 13‑benchmark suite spanning math, code, knowledge, long‑context, and agentic tool‑use, Uno‑Orchestra reaches 77.0% macro pass@1, roughly 16% above the strongest workflow baseline, at roughly an order of magnitude lower per‑query cost, advancing the accuracy‑efficiency frontier of selective delegation.

PaperID: 3346, https://arxiv.org/pdf/2605.04996.pdf

Authors: Fatma Betul Gures, Tanya Nazaretsky, Tanja Kaser

Title: Tailoring Scaffolding to Diagnostic Strategies: Theory-Informed LLM-Based Agents

Abstract:
Learning analytics systems increasingly integrate large language models (LLMs) to provide adaptive scaffolding in complex learning environments, yet personalization is often driven by global instructional choices rather than principled alignment with learning theory, limiting effectiveness and pedagogical grounding. In prior work, we examined how structuring and problematizing scaffolding approaches can be instantiated through LLM agents in a scenario‑based learning environment for diagnostic reasoning. While both approaches supported learning, we observed systematic differences in learner interaction patterns and clear tendencies indicating that different diagnostic strategies benefited from distinct forms of scaffolding. Building on these findings, we propose a theory‑informed scaffolding design grounded in the Knowledge Learning Instruction (KLI) framework, as different diagnostic strategies target different types of knowledge and require different instructional mechanisms. We use KLI to guide the alignment between strategy demands and scaffolding approaches and introduce a KLI‑informed hybrid LLM agent that adapts its pedagogical support according to the diagnostic strategy being practiced, rather than applying a single global scaffolding approach. We hypothesize that this design could enable better learning gains.

PaperID: 3347, https://arxiv.org/pdf/2605.04939.pdf

Authors: Erel Shtossel, Gal A. Kaminka

Title: Modular Reinforcement Learning For Cooperative Swarms

Abstract:
A cooperative robot swarm is a collective of computationally‑limited robots that share a common goal. Each robot can only interact with a small subset of its peers, without knowing how this affects the collective utility. Recent advances in distributed multi‑agent reinforcement learning have demonstrated that it is possible for robots to learn how to interact effectively with others, in a manner that is aligned with the common goal, despite each robot learning independently of others. However, this requires each robot to represent a potentially combinatorial number of interaction states, challenging the memory capabilities of the robots. This paper proposes an alternative approach for representing spatial interaction states for multi‑robot reinforcement learning in swarms. A modular (decomposed) representation is used, where each feature of the state is handled by a separate learning procedure, and the results aggregated. We demonstrate the efficacy of the approach in numerous experiments with simulated robot swarms carrying out foraging.

PaperID: 3348, https://arxiv.org/pdf/2605.04922.pdf

Authors: Jiangwen Dong, Bo Li, Wanyu Lin

Title: Evolving Idea Graphs with Learnable Edits-and-Commits for Multi-Agent Scientific Ideation

Abstract:
LLM‑empowered multi‑agent systems offer new potential to accelerate scientific discovery by generating novel research ideas. However, existing methods typically coordinate agents through temporary texts, such as drafts or chat logs; it is difficult to pinpoint the weaknesses in the generated ideas and how the agents refine them. To this end, we introduce Evolving Idea Graphs (EIG), a graph‑based multi‑agent scientific ideation framework that can generate high‑performance research ideas across various benchmark‑native metrics, such as novelty, feasibility, and clarity. Instead of coordinating solely through texts, EIG represents a partially formed proposal as an evolving idea graph, where nodes capture scientific claims and edges encode relations (e.g., support and conflict), enabling unresolved weaknesses to remain identifiable throughout the idea evolving process. Specifically, a learned two‑head controller operates over the evolving graph to guide the ideation: one head selects graph edits for agents to execute, while the other decides when the graph is ready for commit as final proposal synthesis. On AI Idea Bench 2025 and LiveIdeaBench, EIG outperforms all compared systems on both automatic benchmark scores and blind expert ratings. Ablations further show that explicit graph state provides the main performance gains, and learned edit‑and‑commit control adds consistent improvements.

PaperID: 3349, https://arxiv.org/pdf/2605.04902.pdf

Authors: Yuhan Shi, Yuanyuan Yao, Lu Chen, Mourad Khayati, Tianyi Li

Title: AegisTS: A Hierarchical Agent System with Reinforcement Learning for Multivariate Time Series Data Cleaning

Abstract:
Multivariate time series (MTS) are frequently affected by co‑occurring quality issues, such as missing values, outliers, and constraint violations, which significantly undermine downstream analytics. Existing cleaning approaches fix only a limited set of such issues, making them ill‑suited for scenarios where multiple quality problems arise simultaneously. Furthermore, these methods commonly depend on the availability of ground truth data or domain‑specific rules, both of which are rarely accessible in real‑world applications. In this paper, we introduce \sys, an agent system with reinforcement learning designed to clean multiple data quality issues in MTS. We cast the cleaning process as a joint optimization problem that simultaneously handles quality issue order and cleaning model selection, allowing efficient navigation of the large space of possible cleaning pipelines. Our framework relies on a hierarchical agent architecture, where a high‑level agent determines the order in which data quality issues should be processed, while a low‑level agent identifies the most suitable cleaning method for each issue. To guide the agent toward an optimal cleaning pipeline, we propose a dual‑stage reward mechanism that couples upstream (cleaning) and downstream performance, enabling effective optimization without relying on ground truth. Our experimental results show that \sys consistently outperforms existing methods, achieving up to 96% improvement in data cleaning quality and 27% improvement in downstream performance.

PaperID: 3350, https://arxiv.org/pdf/2605.04845.pdf

Authors: Johannes Härtel

Title: Agentic Repository Mining: A Multi-Task Evaluation

Abstract:
Mining software repositories often requires classifying artifacts like commits, reviews, code lines, or entire repositories into categories. Human labeling is expensive and error‑prone; limited context frequently leads to misclassifications or uncertainty in labels. We investigate whether LLM agents that dynamically explore repositories through standard bash commands can match the classification quality of simple LLMs that receive pre‑engineered context. Across four tasks, eight approach configurations, and 4943 classifications, agents achieve competitive accuracy despite retrieving their own context. The primary advantage is robustness: agents avoid context‑window overflows and scale independently of artifact size. A manual diagnosis of 100 cases where approaches disagree with the ground truth reveals specification ambiguities and labels produced under limited context, suggesting that accuracy against such ground truth may underestimate approaches with broader context access.

PaperID: 3351, https://arxiv.org/pdf/2605.04811.pdf

Authors: Marina Mao, Alexandr Liu, Pengbo Li, Siheng Li, Bo Zhou, Xiang Wang

Title: Tree-based Credit Assignment for Multi-Agent Memory System

Abstract:
Memory systems are widely adopted to enhance LLMs for long‑horizon tasks, and are commonly organized as multi‑agent pipelines with memory building, summarizing, and retrieval agents. To empower this system, existing RL‑based methods either apply final downstream task rewards (e.g., QA accuracy) for all agents uniformly, which are coarse and ambiguous, or design task‑specific rewards for agents on different subtasks, which require costly annotations (e.g., key evidence) and are difficult to define reliably. To address these limitations, we propose Tree‑based Credit Assignment for Multi‑Agent Memory Systems (TreeMem), which derives agent‑specific credit from the final reward without task‑specific annotations. Specifically, TreeMem extends the multi‑agent pipeline (builder‑‑summarizer‑‑retrieval) into a tree structure, where each agent's outputs are expanded into multiple subsequent branches. The contribution of each agent is estimated via Monte Carlo averaging over its subsequent branches, capturing how intermediate agent actions may influence the final reward. This converts the coarse final reward into agent‑specific optimization signals. These signals are then used to update all agent policies simultaneously, helping heterogeneous agents specialize effectively. Experiments on long‑horizon benchmarks show that TreeMem improves memory system performance over strong baselines, validating the effectiveness of tree‑structured credit assignment for the multi‑agent memory system.

PaperID: 3352, https://arxiv.org/pdf/2605.04785.pdf

Authors: Chenglin Yang

Title: AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use

Abstract:
Modern AI agents execute real‑world side effects through tool calls such as file operations, shell commands, HTTP requests, and database queries. A single unsafe action, including accidental deletion, credential exposure, or data exfiltration, can cause irreversible harm. Existing defenses are incomplete: post‑hoc benchmarks measure behavior after execution, static guardrails miss obfuscation and multi‑step context, and infrastructure sandboxes constrain where code runs without understanding what an action means. We present AgentTrust, a runtime safety layer that intercepts agent tool calls before execution and returns a structured verdict: allow, warn, block, or review. AgentTrust combines a shell deobfuscation normalizer, SafeFix suggestions for safer alternatives, RiskChain detection for multi‑step attack chains, and a cache‑aware LLM‑as‑Judge for ambiguous inputs. We release a 300‑scenario benchmark across six risk categories and an additional 630 independently constructed real‑world adversarial scenarios. On the internal benchmark, the production‑only ruleset achieves 95.0% verdict accuracy and 73.7% risk‑level accuracy at low‑millisecond end‑to‑end latency. On the 630‑scenario benchmark, evaluated under a patched ruleset and not claimed as zero‑shot, AgentTrust achieves 96.7% verdict accuracy, including about 93% on shell‑obfuscated payloads. AgentTrust is released under the AGPL‑3.0 license and provides a Model Context Protocol server for MCP‑compatible agents.

PaperID: 3353, https://arxiv.org/pdf/2605.04777.pdf

Authors: Jinghui Xu, Boyi Shangguan, Mengke Zhu, Hao Liu, Junhuan Jiang, Guangjun He, Pengming Feng, Shichao Jin, Bin Liang, Yongzhe Chang, Junbo Tan, Tiantian Zhang, Xueqian Wang

Title: Bridging Perception and Action: A Lightweight Multimodal Meta-Planner Framework for Robust Earth Observation Agents

Abstract:
Autonomous Earth Observation (EO) agents are transitioning from passive perception to complex, multi‑step task execution. However, current architectures that integrate planning and execution within a single model often struggle with combinatorial complexity and reasoning errors in dynamic EO scenarios. To resolve these challenges, we propose the Lightweight Multimodal Meta‑Planner (LMMP) framework. LMMP incorporates a dual‑awareness mechanism that grounds strategic plans in both multimodal image features and high‑level task semantics. Crucially, we introduce a Meta Task Library to inject remote sensing expert knowledge directly into the workflow, which standardizes domain logic and ensures plans are physically feasible. We further implement a two‑stage training pipeline, initializing the Meta‑Planner via expert‑distilled Supervised Fine‑Tuning and refining it through Direct Preference Optimization based on execution feedback. Extensive experiments on a dataset derived from EarthBench and ThinkGeo demonstrate that LMMP significantly improves tool‑calling accuracy and task success rates. Moreover, the framework exhibits strong ``plug‑and‑play'' versatility, consistently enhancing the performance of diverse executor backbones across previously unseen EO missions.

PaperID: 3354, https://arxiv.org/pdf/2605.04741.pdf

Authors: Honglei Guo, Yuhan Zhao, Yexin Li

Title: Hierarchical Multiagent Reinforcement Learning for Multi-Group Tax Game

Abstract:
Reinforcement learning has increasingly been applied to economic decision‑making, including taxation, public spending, and labor supply. However, existing RL‑based economic models typically consider only a single government‑household group, overlooking strategic interactions among competing governments. To address this limitation, we formulate taxation as a hierarchical multi‑group game. Within each group, the government and households form a leader‑‑follower game, while governments compete across groups through strategic fiscal policies. This coupled structure is difficult to solve using standard multi‑agent reinforcement learning (MARL) methods. We therefore propose a bilevel MARL framework with Curriculum Learning and a Closed‑Loop Sequential Update mechanism to improve training stability and convergence. We instantiate the framework in a taxation simulation environment grounded in classical economic models, supporting the evaluation of taxation policies under inter‑group competition. Experiments show that the proposed method learns stable and sustainable tax policies. Compared with a two‑group baseline without the proposed mechanisms, our approach avoids premature game collapse, extends the effective game duration by 60.92%, and reduces GDP disparities among governments by 44.12%.

PaperID: 3355, https://arxiv.org/pdf/2605.04692.pdf

Authors: Zhicheng Zhang, Fausto Lizzio, Zhongjun Ma, Masaaki Nagahara

Title: Towards Lag Consensus with Noisy Digital Twins Perception in Second-order Multi-agent Cyber-physical Systems

Abstract:
In this paper, we study second‑order lag consensus in multi‑agent cyber‑physical networks subject to random noise and input failures, within a framework modeling the interactions and perceptions between physical twins and digital twins. We propose a lag consensus protocol and establish sufficient conditions for the mean‑square (exponential) stability of the resulting stochastic lag error dynamics. The consensus criteria are derived via Lyapunov analysis using the Itô formula, ensuring robustness to random perturbations and intermittent input failures. Numerical examples illustrate the effectiveness of the proposed method.

PaperID: 3356, https://arxiv.org/pdf/2605.04608.pdf

Authors: Naiyu Zheng, Tianlong Yu, Haochen Yin, Xiaoyi Fan, Xiping Hu, Zhimeng Yin

Title: SensingAgents: A Multi-Agent Collaborative Framework for Robust IMU Activity Recognition

Abstract:
Human Activity Recognition (HAR) using Inertial Measurement Unit (IMU) sensors is a cornerstone of mobile health, smart environments, and human‑computer interaction. However, current deep learning‑based HAR models often struggle with heavy reliance on labeled data, position‑specific ambiguity, and a lack of transparent reasoning. Inspired by the advanced agents framework, which emulates a collaborative agent using Large Language Models (LLMs), we propose SensingAgents, a novel multi‑agent system for robust IMU activity recognition. SensingAgents organizes LLM‑powered agents into specialized roles: a group of Analyst Agents for position‑specific sensor analysis (arm, wrist, belt, pocket), a pair of Advocate Agents that resolves sensor conflicts through dynamic and static dialectical debates, and a Decision Agent that ensures reliability under sensor drift or failure. Evaluation on the Shoaib dataset demonstrates that SensingAgents significantly outperforms state‑of‑the‑art single‑agent and multi‑agent LLM models, achieving an accuracy of 79.5% in a zero setting‑‑29% higher than existing agent models and 9.4% higher than deep learning baselines‑‑particularly in complex scenarios where multi‑sensor data is conflicting or noisy. Our work highlights the potential of multi‑agent collaborative reasoning for advancing the robustness and interpretability of ubiquitous sensing systems.

PaperID: 3357, https://arxiv.org/pdf/2605.04565.pdf

Authors: Mingyu Guo, Wen Wu, Ying Wang, Songge Zhang, Liang Li

Title: Delay-Aware Large-Small Model Collaboration over LEO Satellite Networks

Abstract:
In this paper, we introduce a delay‑aware largesmall model collaboration scheme for low Earth orbit (LEO) satellite networks, which can balance the computational load among satellites and the communication load across inter‑satellite links. Specifically, computational resource constrained remote sensing satellites are responsible for data collection and local processing using small models, while collaborating with computing satellites that provide large model processing. To minimize the service delay, we formulate a joint optimization problem for offloading decision and routing strategy design, which is transformed into a decentralized partially observable Markov decision process. To solve the problem, we develop a multi‑agent reinforcement learning (MARL)‑based algorithm with offline policy training and online bisection search. The offline trained policy determines routing strategies, while online bisection search iteratively adjusts the offloading decisions. Simulation results demonstrate that the proposed scheme can reduce the service delay by up to 31.85% compared with the benchmarks.

PaperID: 3358, https://arxiv.org/pdf/2605.04530.pdf

Authors: Kuan-Hao Tseng, Niruth Bogahawatta, Yasod Ginige, Kosta Dekic, Arunan Sivanathan, Suranga Seneviratne

Title: SADE: Symptom-Aware Diagnostic Escalation for LLM-Based Network Troubleshooting

Abstract:
Large language model (LLM) agents are increasingly applied to network troubleshooting, but root‑cause localization on public benchmarks remains well below practical deployment thresholds. We argue this is because existing agents do not encode the disciplined, layer‑by‑layer methodology that human network engineers use, and instead rely on free‑form deliberation that conflates evidence acquisition with hypothesis commitment. We present SADE (Symptom‑Aware Diagnostic Escalation), an agent that encodes the classical Cisco troubleshooting methodology as an explicit policy. SADE pairs a phase‑gated diagnostic workflow, which separates evidence acquisition from hypothesis commitment, with a routed library of fault‑family skills and high‑yield diagnostic helpers. On a held‑out 523 incident set of the public NIKA benchmark covering eleven unseen scenarios, SADE improves root‑cause F1 by 37 percentage points over a ReAct + GPT‑5 baseline; a model‑controlled comparison against the same Claude Sonnet backend without the SADE policy attributes 22 of those points to the diagnostic policy alone, showing that the gain is not a side‑effect of the model upgrade.

PaperID: 3359, https://arxiv.org/pdf/2605.04475.pdf

Authors: Shuo Liu, Lei Shi, Haowen Liu, Jing Xu, Yufei Gao, Yucheng Shi

Title: Information Coordination as a Bridge: A Neuro-Symbolic Architecture for Reliable Autonomous Driving Scene Understanding

Abstract:
Reliable autonomous driving requires scene understanding that is semantically consistent across heterogeneous sensors and verifiable at the reasoning stage. However, many recent LLM‑driven driving systems attach the language model as a post‑processor and force it to reason over redundant or conflicting perception outputs, which can amplify hallucinated entities and unsafe conclusions. This paper proposes InfoCoordiBridge, a BEV‑centric neuro‑symbolic architecture that inserts an explicit coordination bridge between perception and language reasoning. InfoCoordiBridge comprises (i) a unified multi‑agent perception layer that outputs typed structured facts together with modality‑focused synopses, (ii) an ICA module that aligns and fuses multi‑source outputs into a single SceneSummary, and (iii) an SSRE module that performs SceneSummary‑grounded reasoning with verification. Experiments on nuScenes and Waymo show that ICA preserves competitive 3D detection accuracy while substantially improving fusion consistency, reducing redundancy to below 1% and achieving about 98% attribute agreement. On NuScenes‑QA and a template‑aligned Waymo‑QA benchmark, SSRE improves factual grounding and reduces hallucinated entity mentions compared with representative VLM and agentic baselines. Overall, by coordinating multi‑sensor outputs into a single conflict‑aware SceneSummary before prompting, InfoCoordiBridge prevents redundant and cross‑modally inconsistent perception evidence from propagating into high‑level reasoning.

PaperID: 3360, https://arxiv.org/pdf/2605.04467.pdf

Authors: Joshua H. Davis, Klaudiusz Rydzy, Srinivasan Ramesh, Aadit Nilay, Daniel Nichols, Swapna Raj, Nikhil Jain, Abhinav Bhatele

Title: KEET: Explaining Performance of GPU Kernels Using LLM Agents

Abstract:
Performance profiles of GPU kernels generated by tools such as Nsight Compute are rich in detail but are often challenging to interpret. To achieve the best performance possible on a given GPU architecture, kernel developers need to spend significant time analyzing and comparing profiles in the tool's graphical interface to identify and understand kernel performance bottlenecks. Large Language Models (LLMs) have shown promise in understanding complex data and generating natural language explanations. In this paper, we propose the Kernel Execution Explanation Toolkit (KEET), an LLM‑based agentic framework for interpreting Nsight Compute profiles to generate useful and data‑grounded natural language explanations of performance issues in GPU kernels, and suggestions for optimizations. We evaluate \toolname using several CUDA kernels of varying complexity on NVIDIA H100 GPUs. We find that the generated explanations, when provided as context, improve the quality of LLM code optimization and multiple‑choice question answering in downstream tasks. We further demonstrate that the tool can be used to interpret performance data from large sets of profiles to improve the quality of optimization suggestions.

PaperID: 3361, https://arxiv.org/pdf/2605.04448.pdf

Authors: Mudassar Liaq, Mahyar Tajeri, Peng Hu

Title: Queue-Aware and Resilient Routing in LEO Satellite Networks Using Multi-Agent Reinforcement Learning

Abstract:
With the rapid growth in data demand and stringent latency requirements of modern applications has driven significant interest in Low Earth Orbit (LEO) satellite constellations as an emerging solution for global Internet coverage. However, routing in LEO networks remains a fundamental challenge due to highly dynamic topologies, time‑varying traffic conditions, and its susceptibility to link failures. Conventional routing algorithms typically assume static link metrics and fail to account for queue backlogs or real‑time system variations, making them less effective in such environments. We propose a queue‑aware multi‑agent deep reinforcement learning (MA‑DRL) framework for routing in LEO satellite networks. Each satellite is modeled as an independent agent responsible for making local routing decisions, enabling a distributed and scalable solution. The proposed framework formulates a latency‑aware optimization problem that incorporates background traffic, queue dynamics at each satellite, and a resilience score to improve robustness. We evaluate the proposed approach against the state‑action‑reward‑state‑action (SARSA) and Dijkstra algorithms. While Dijkstra achieves the lowest end‑to‑end latency under ideal conditions, its computational and signaling overhead becomes a significant bottleneck as the network scales. In contrast, our proposed approach incurs significantly lower overhead (approximately 50% of Dijkstra at a 5 s recalculation interval), scales efficiently with network size, and effectively manages queue backlogs and resilience under increasing traffic load, demonstrating enhanced robustness and scalability in LEO satellite networks while maintaining competitive latency and resilience scores.

PaperID: 3362, https://arxiv.org/pdf/2605.04436.pdf

Authors: Maoxin Ji, Qiong Wu, Pingyi Fan, Cui Zhang, Nan Cheng, Wen Chen, Khaled B. Letaief

Title: Joint Optimization of Trajectory Control, Resource Allocation, and Task Offloading for Multi-UAV-Assisted IoV

Abstract:
This paper investigates a multi‑Unmanned Aerial Vehicle (UAV) joint base station‑assisted Internet of Vehicles (IoV) task offloading system in dense urban environments. To minimize system delay and energy consumption under strict coupling constraints, the complex non‑convex optimization problem is decoupled into a hierarchical execution framework. First, a sequential distributed optimization algorithm based on Second‑Order Cone Programming (SOCP) is proposed to optimize the 3D flight trajectory of each UAV, ensuring adaptive network coverage. Second, a novel hybrid resource scheduling paradigm synergizing Deep Reinforcement Learning (DRL) and Large Language Models (LLMs) is developed. Within this framework, the DRL agent dictates the initial resource allocation, while the LLM acts as a semantic macro‑scheduler to rectify long‑tail allocation imbalances for failed and surplus tasks. Crucially, a reward decoupling mechanism is introduced to isolate DRL training from external LLM interventions, thereby ensuring policy convergence. Finally, the task offloading ratios are precisely determined via Linear Programming (LP) within an alternating optimization loop. Simulation results demonstrate that the proposed method significantly outperforms traditional multi‑agent reinforcement learning baselines in terms of task success rate and system efficiency.

PaperID: 3363, https://arxiv.org/pdf/2605.04361.pdf

Authors: Saranyan Vigraham

Title: When Context Hurts: The Crossover Effect of Knowledge Transfer on Multi-Agent Design Exploration

Abstract:
The prevailing assumption in agent orchestration is that more context is better. We test this on multi‑agent software design across 10 tasks, 7 context‑injection conditions, and over 2,700 runs, and find a crossover effect: the same artifact type improves design exploration on some tasks (up to 20× tradeoff coverage) and actively degrades it on others (up to 46% reduction). On several tasks, an irrelevant document performs as well as or better than every relevant artifact. The direction is predicted by a single measurable variable‑‑baseline exploration without context‑‑with Pearson r = ‑0.82 (p < 0.001). Probing the mechanism by manipulating convergence pressure through prompt design reveals two distinct regimes: convergence driven by training data priors (natural) responds to artifact disruption, while convergence driven by explicit instructions (induced) does not. The implication is that context injection should be conditional, not universal: one no‑context trial is a cheap diagnostic that predicts whether knowledge artifacts will help or hurt a given task.

PaperID: 3364, https://arxiv.org/pdf/2605.04345.pdf

Authors: Jules Sintes, Ana Bušić, Jiamin Zhu

Title: Structural Equivalence and Learning Dynamics in Delayed MARL

Abstract:
We formally establish the equivalence between Observation Delay (OD) and Action Delay (AD) in cooperative partially observable multi‑agent systems using observation‑action histories. We show that both systems generate identical admissible joint‑policy sets, and their induced state‑action‑observation trajectories are identical in distribution, leading to identical optimal solutions in Decentralized Partially Observable Markov Decision Processes (Dec‑POMDPs). This formally generalizes existing infinite‑horizon single‑agent results to any‑horizon partially observable cooperative multi‑agent problems with decentralized policy execution, and allows any mixed‑delay configuration to be reduced to a pure OD system. We further prove that in Transition‑Independent MDPs (TI‑MDPs), the observation‑action history reduces to a tractable minimal local augmented state. However, we show through numerical experiments that although the optimal solution spaces are structurally isomorphic, the practical learning dynamics are fundamentally different. First, using the minimal local augmented state, the equivalence no longer holds when transitions are not independent. Second, operational constraints and causal credit‑assignment errors in Temporal Difference (TD) algorithms induce different learning behaviors across regimes. Finally, leveraging this structural equivalence to bypass these learning challenges, we demonstrate successful multi‑agent zero‑shot policy transfer from OD to AD, paving the way for unified, efficient solution methods in complex delayed systems.

PaperID: 3365, https://arxiv.org/pdf/2605.04323.pdf

Authors: Kuangdai Leng, Simon Jeffery, Panos Panagos, Tarje Nissen-Meyer

Title: LUCAS-MEGA: A Large-Scale Multimodal Dataset for Representation Learning in Soil-Environment Systems

Abstract:
Understanding soil is fundamental to agriculture, carbon cycling, and environmental sustainability, yet progress is limited by fragmented and heterogeneous datasets that constrain modeling to small‑scale predictive settings rather than high‑dimensional representation learning. We introduce LUCAS‑MEGA, a large‑scale multimodal dataset constructed through systematic data fusion of European soil‑environment observations, with the LUCAS survey as its backbone. The fused dataset comprises over 70,000 samples and more than 1,000 features spanning physical, chemical, environmental, biological, and visual attributes, aggregated from 68 source datasets. To enable integration at scale, we develop SoilFuser, a multi‑agent, human‑in‑the‑loop data fusion pipeline that standardizes heterogeneous data formats and measurement protocols, resolves inconsistencies and invalid entries (e.g., unit inconsistencies, codebook mismatches, and erroneous values), incorporates natural language annotations, and harmonizes multimodal attributes and metadata into a unified, machine learning‑ready feature space. The resulting dataset captures key characteristics of real‑world soil observations, including multimodality, uneven feature coverage, and heterogeneous uncertainty. To demonstrate the usability of LUCAS‑MEGA for data‑driven modeling, we pretrain a multimodal tabular transformer (SoilFormer) using a self‑supervised objective based on feature masking, achieving stable training, strong predictive performance, and representations that support uncertainty‑aware prediction. We further show that the learned representations recover relationships consistent with established soil processes. LUCAS‑MEGA is released with open access and is accompanied by composable, agent‑friendly APIs that support structured querying and data‑driven workflows.

PaperID: 3366, https://arxiv.org/pdf/2605.04304.pdf

Authors: Qihua Dong, Ruozhen He, Junwen Chen, Yizhou Wang, Xu Ma, Songyao Jiang, Yun Fu

Title: Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning

Abstract:
Advanced chart question answering requires both precise perception of small visual elements and multi‑step reasoning across several subplots. While existing MLLMs are strong at understanding single plots, they often struggle with multi‑step reasoning across multiple subplots. We propose HierVA, a hierarchical visual agent framework for chart reasoning that iteratively constructs and updates a working context in a joint image‑‑text space. A high‑level manager generates plans and maintains a compact context containing only key information, while specialized workers perform reasoning, gather evidence, and return results. In particular, the agent maintains separate visual and textual contexts, using a zoom‑in tool to restrict the visual context. Experiments on the CharXiv reasoning subset demonstrate consistent improvements over strong multimodal baselines, and ablation studies verify that hierarchical architecture, scoped visual context, and distilled context contribute complementary gains.

PaperID: 3367, https://arxiv.org/pdf/2605.04278.pdf

Authors: Achuth Chandrasekhar, Omid Barati Farimani, Radheesh Sharma Meda, Amir Barati Farimani

Title: Material Database Agent: A Multimodal Agentic Framework for Scientific Literature Mining

Abstract:
Materials science workflows rely on structured and unstructured data from the vast body of available scientific literature. However, most of the experimental details remain buried in text, tables, graphs and figures. Thus, constructing databases that incorporate this data is a manual, time‑consuming, and hard‑to‑scale process. Multimodal large language models have made it feasible to extract information from text and scientific figures with high speed and accuracy. This opens the possibility of an AI system that can create production‑scale material databases. Material Database Agent (MDA) is a modular, multi‑agent system architecture for converting research literature into structured databases. MDA accepts article PDFs as input, which are subsequently processed in parallel into markdown files and figures. Multiple sub‑agents read these markdown files and figures in parallel to assemble sub‑databases for each paper. These sub‑databases are then compiled into a single tabular database by an agent. As opposed to using either a rule‑based approach or a single‑pass pipeline for extracting information, MDA is a specialized architecture for transforming the literature into a database in the field of materials science. More generally, this study provides a basis for positioning multimodal agentic information extraction as a viable means for constructing next‑generation scientific databases from the primary literature.

PaperID: 3368, https://arxiv.org/pdf/2605.04264.pdf

Authors: Diego F. Cuadros, Abdoul-Aziz Maiga, Helen Meskhidze, Andre Curtis-Trudel

Title: Governed Collaborative Memory as Artificial Selection in LLM-Based Multi-Agent Systems

Abstract:
Persistent memory is turning language‑model‑based agents from stateless participants in isolated interactions into state‑bearing components of LLM‑based multi‑agent systems. As memory becomes durable, reloadable, and behavior‑shaping across agents, sessions, or versions, a design question arises that is not captured by retrieval accuracy or access control alone: which candidate memories should become shared institutional state? This Viewpoint frames that problem as governed collaborative memory. We argue that memory governance functions as a selection regime, determining which memory variants persist, which remain private, and which are rejected, abstained from, or superseded. We distinguish ungoverned persistence, constitutional or hybrid selection, automatic metric‑based selection, and human‑ratified artificial selection, emphasizing that these regimes are not a ranking but a design choice over target properties. We then describe a layered architecture that separates agent‑local memory, shared institutional memory, archive memory, and project‑continuity memory, with provenance and version lineage making selection inspectable. Documented traces from one running LLM‑based multi‑agent ecosystem illustrate unmanaged false‑memory persistence, ratified institutional memory, rejection and revision, identity‑preserving expansion, and governance‑as‑learning. The contribution is a design agenda: persistent LLM‑based multi‑agent systems should evaluate memory not only for recall and performance, but also for provenance fidelity, selection traceability, epistemic quality, correction pathways, and role preservation.

PaperID: 3369, https://arxiv.org/pdf/2605.04225.pdf

Authors: Yazan Youssef, Aboelmagd Noureldin, Sidney Givigi

Title: ARMATA: Auto-Regressive Multi-Agent Task Assignment

Abstract:
Coordinating multi‑agent systems over spatially distributed areas requires solving a complex hierarchical problem: first distributing areas among agents (allocation) and subsequently determining the optimal visitation order (routing). Existing methods typically decouple these stages ignoring inter‑stage dependencies or rely on decentralized heuristics that lack global context. In this work, we propose a centralized, fully end‑to‑end auto‑regressive framework that jointly generates allocation decisions and routing sequences. The core contribution of our approach is a multi‑stage decoding mechanism that unifies high‑level allocation and low‑level routing in a single autoregressive pass while maintaining a centralized global state. This enables the model to implicitly balance workload distribution with routing efficiency, avoiding local optima common in decentralized methods. Extensive experiments demonstrate that our method significantly outperforms diverse baselines, achieving up to a 20% improvement in solution quality over industrial solvers such as Google OR‑Tools, IBM CPLEX, and LKH‑3, while reducing computation time from hours to seconds.

PaperID: 3370, https://arxiv.org/pdf/2605.04188.pdf

Authors: Ahmed F. Ibrahim

Title: A Multi-Agent Consensus Protocol for Stable Software Remodularization

Abstract:
Automatic software remodularisation is typically cast as a single‑objective optimization problem. While recent metaheuristics have improved search efficiency, real‑world architecture recovery must reconcile the conflicting attributes of structural cohesion and evolutionary stability. We reframe software module clustering as a distributed consensus problem among autonomous agents. We introduce an Asymmetric Monotonic Concession Protocol (AMCP) that enables agents to negotiate decompositions that respect multi‑attribute utility thresholds. We formally prove the protocol's termination, its bounded concession behaviour consistent with the Zeuthen Strategy under closed‑instance conditions, and the local Pareto‑satisfactoriness of the resulting partitions. Preliminary experiments on a synthetic benchmark and the Xwork Java framework confirm that our negotiated consensus matches state‑of‑the‑art optimizers when stability budgets are loose, while acting as a "circuit breaker" to enforce strict stability constraints. Extended results on ten further systems, including comparisons with multi‑objective evolutionary algorithms and multi‑version chains, will be reported in a forthcoming full paper.

PaperID: 3371, https://arxiv.org/pdf/2605.04097.pdf

Authors: Haofei Yu, Yining Zhao, Lenore Blum, Manuel Blum, Paul Pu Liang

Title: CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness

Abstract:
Despite remarkable advances, today's AI systems remain narrow in scope, falling short of the flexible, adaptive, and multisensory intelligence that characterizes human capabilities. This gap has fueled longstanding debates about whether AI might one day achieve human‑like generality or even consciousness, and whether theories of consciousness can inspire new architectures for AI. This paper presents an early blueprint for implementing a general AI system, CTM‑AI, combining the Conscious Turing Machine (CTM), a formal machine model of consciousness, with today's foundation models. CTM‑AI contains an enormous number of powerful processors ranging from specialized experts (e.g., vision‑language models and APIs) to unspecialized general‑purpose learners poised to develop their own expertise. Crucially, for whatever problem must be dealt with, information from many processors is selected, integrated, and exchanged appropriately to solve the task. CTM‑AI achieves state‑of‑the‑art accuracy on MUStARD (72.28) and UR‑FUNNY (72.13), outperforming multimodal and multi‑agent frameworks. On tool‑using and agentic tasks, CTM‑AI achieves 10+ points of improvement on StableToolBench and WebArena‑Lite. Overall, CTM‑AI offers a principled, testable blueprint for general AI inspired by a model of consciousness.

PaperID: 3372, https://arxiv.org/pdf/2605.04078.pdf

Authors: Khouloud Saadi, Di Wang

Title: Validity-Calibrated Reasoning Distillation

Abstract:
Reasoning distillation aims to transfer multi‑step reasoning capabilities from large language models to smaller, more efficient ones. While recent methods have shown promising gains, they typically rely on static teacher‑student hierarchies and frame distillation as trajectory imitation. This is misaligned with the structure of reasoning, where intermediate steps are often locally under‑specified: global correctness constrains the final answer, but does not uniquely determine each intermediate move. We propose validity‑calibrated reasoning distillation, a framework that treats reasoning distillation as a problem of local learning‑signal allocation rather than path alignment. Instead of enforcing token‑level imitation, we compare the student's and teacher's proposed next‑step actions under the same prefix and use their relative local validity to modulate the strength of the distillation update. This yields a dynamic, context‑dependent supervision mechanism that preserves the teacher's structural guidance while adapting update strength to local reasoning quality. Across mathematical reasoning, code generation, and instruction‑following benchmarks, our method consistently outperforms strong distillation baselines. These results indicate that effective LLM reasoning distillation is governed not by rigid trajectory imitation, but by principled, locally calibrated allocation of learning signal.

PaperID: 3373, https://arxiv.org/pdf/2605.04069.pdf

Authors: Gregory Magarshak

Title: LAWS: Learning from Actual Workloads Symbolically -- A Self-Certifying Parametrized Cache Architecture for Neural Inference, Robotics, and Edge Deployment

Abstract:
We introduce LAWS (Learning from Actual Workloads Symbolically), a self‑certifying inference caching architecture that builds a growing library of certified expert functions from deployment observations. Each expert covers a region of input space defined by a node in the Probabilistic Language Trie (PLT) of the base model and carries a formal error bound holding uniformly over all inputs. The central result is a self‑certification theorem: for any input x, the LAWS approximation error is bounded by epsilon_fit + 2Lambda(W)C_E, where Lambda(W) is the model Lipschitz constant, C_E is the maximum embedding diameter, and epsilon_fit is the expert training error ‑‑ all checkable at deployment time without ground truth. We prove that LAWS generalizes both Mixture‑of‑Experts and KV prefix caching as special cases and is strictly more expressive than any fixed‑K MoE or finite cache. Further results include a monotone hit rate theorem (any‑match routing ensures coverage only increases), an expert library growth rate of O(2^H log N) where H is workload entropy, a fleet learning convergence theorem with Omega(K) speedup for K‑unit fleets, and an over‑the‑air update bandwidth bound. We conjecture that LAWS is acquisition‑optimal among stationary online caching algorithms and that the effective Lipschitz constant on the training distribution grows polynomially rather than exponentially in depth. Applications are developed for LLM inference, robotic control, and multi‑agent edge deployment.

PaperID: 3374, https://arxiv.org/pdf/2605.04036.pdf

Authors: Yuwen Du, Rui Ye, Shuo Tang, Keduan Huang, Xinyu Zhu, Yuzhu Cai, Siheng Chen

Title: OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

Abstract:
Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet their development remains dominated by industrial giants. The typical industry recipe involves a highly resource‑intensive pipeline spanning pre‑training, continual pre‑training (CPT), supervised fine‑tuning (SFT), and reinforcement learning (RL). In this report, we show that when fueled with informative and high‑difficulty trajectories, a simple SFT approach could be surprisingly powerful for training frontier search agents. By introducing three simple data synthesis modifications: scaling knowledge graph size for richer exploration, expanding the tool set size for broader functionality, and strict low‑step filtering, we establish a stronger baseline. Trained on merely 10.6k data points, our OpenSeeker‑v2 achieves state‑of‑the‑art performance across 4 benchmarks (30B‑sized agents with ReAct paradigm): 46.0% on BrowseComp, 58.1% on BrowseComp‑ZH, 34.6% on Humanity's Last Exam, and 78.0% on xbench, surpassing even Tongyi DeepResearch trained with heavy CPT+SFT+RL pipeline, which achieves 43.4%, 46.7%, 32.9%, and 75.0%, respectively. Notably, OpenSeeker‑v2 represents the first state‑of‑the‑art search agent within its model scale and paradigm to be developed by a purely academic team using only SFT. We are excited to open‑source the OpenSeeker‑v2 model weights and share our simple yet effective findings to make frontier search agent research more accessible to the community.

PaperID: 3375, https://arxiv.org/pdf/2605.04019.pdf

Authors: Raja Sekhar Rao Dheekonda, Will Pearce, Nick Landers

Title: Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

Abstract:
AI systems are entering critical domains like healthcare, finance, and defense, yet remain vulnerable to adversarial attacks. While AI red teaming is a primary defense, current approaches force operators into manual, library‑specific workflows. Operators spend weeks hand‑crafting workflows ‑ assembling attacks, transforms, and scorers. When results fall short, workflows must be rebuilt. As a result, operators spend more time constructing workflows than probing targets for security and safety vulnerabilities. We introduce an AI red teaming agent built on the open‑source Dreadnode SDK. The agent creates workflows grounded in 45+ adversarial attacks, 450+ transforms, and 130+ scorers. Operators can probe multi‑agent systems, multilingual, and multimodal targets, focusing on what to probe rather than how to implement it. We make three contributions: 1. Agentic interface. Operators describe goals in natural language via the Dreadnode TUI (Terminal User Interface). The agent handles attack selection, transform composition, execution, and reporting, letting operators focus on red teaming. Weeks compress to hours. 2. Unified framework. A single framework for probing traditional ML models (adversarial examples) and generative AI systems (jailbreaks), removing the need for separate libraries. 3. Llama Scout case study. We red team Meta Llama Scout and achieve an 85% attack success rate with severity up to 1.0, using zero human‑developed code

PaperID: 3376, https://arxiv.org/pdf/2605.04012.pdf

Authors: Joseph Breda, Fadi Yousif, Beszel Hawkins, Marinela Cotoi, Miao Liu, Ray Luo, Po-Hsuan Cameron Chen, Mike Schaekermann, Samuel Schmidgall, Xin Liu, Girish Narayanswamy, Samuel Solomon, Maxwell A. Xu, Xiaoran Fan, Longfei Shangguan, Anran Wang, Bhavna Daryani, Buddy Herkenham, Cara Tan, Mark Malhotra, Shwetak Patel, John B. Hernandez, Quang Duong, Yun Liu, Zach Wasson, Dimitrios Antos, Bob Lou, Matthew Thompson, Jonathan Richina, Anupam Pathak, Nichole Young-Lin, Jake Sunshine, Daniel McDuff

Title: SymptomAI: Toward a Conversational AI Agent for Everyday Symptom Assessment

Abstract:
Language models excel at diagnostic assessments on curated medical case‑studies and vignettes, performing on par with, or better than, clinical professionals. However, existing studies focus on complex scenarios with rich context making it difficult to draw conclusions about how these systems perform for patients reporting symptoms in everyday life. We deployed SymptomAI, a set of conversational AI agents for end‑to‑end patient interviewing and differential diagnosis (DDx), via the Fitbit app in a study that randomized participants (N=13,917) to interact with five AI agents. This corpus captures diverse communication and a realistic distribution of illnesses from a real world population. A subset of 1,228 participants reported a clinician‑provided diagnosis, and 517 of these were further evaluated by a panel of clinicians during over 250 hours of annotation. SymptomAI DDx were significantly more accurate (OR = 2.56, p < 0.001) than those from independent clinicians given the same dialogue in a blinded randomized comparison. Moreover, agentic strategies which conduct a dedicated symptom interview that elicit additional symptom information before providing a diagnosis, perform substantially better than baseline, user‑guided conversations (p < 0.001). An auxiliary analysis on 1,509 conversations from a general US population panel validated that these results generalize beyond wearable device users. We used SymptomAI diagnoses as labels for all 13,917 participants to analyze over 500,000 days of wearable metrics across nearly 400 unique conditions. We identified strong associations between acute infections and physiological shifts (e.g., OR > 7 for influenza). While limited by self‑reported ground truth, these results demonstrate the benefits of a dedicated and complete symptom interview compared to a user‑guided symptom discussion, which is the default of most consumer LLMs.

PaperID: 3377, https://arxiv.org/pdf/2605.04003.pdf

Authors: Danny Hoang, Ryan Matthiessen, Christopher Miller, Nasir Mannan, Ruby ElKharboutly, David Gorsich, Matthew P. Castanier, Farhad Imani

Title: Physics-Grounded Multi-Agent Architecture for Traceable, Risk-Aware Human-AI Decision Support in Manufacturing

Abstract:
High‑precision CNC machining of free‑form aerospace components requires bounded compensations informed by inspection, simulation, and process knowledge. Off‑the‑shelf large language model (LLM) assistants can generate text, but they do not reliably execute risk‑constrained multi‑step numerical workflows or provide auditable provenance for high‑stakes decisions. We present multi‑agent knowledge analysis (MAKA), a human‑in‑the‑loop decision‑support architecture that separates intent routing, tools‑only quantitative analysis, knowledge graph retrieval, and critic‑based verification that enforces physical plausibility, safety bounds, and provenance completeness before recommendations are surfaced for human approval. MAKA is instantiated on a Ti‑6Al‑4V rotor blade machining testbed by fusing virtual‑machining path‑tracking error fields, cutting‑force and deflection simulations, and scan‑based 3D inspection deviation maps from 16 blades. The analysis decomposes deviation into an evidence‑linked pathing component, a drift‑based wear proxy capturing systematic evolution across parts, a residual systematic compliance term, and a variability proxy for instability‑aware escalation. In a three‑level tool‑orchestration benchmark (single‑step through \geq3‑step stateful sequences), MAKA improves successful tool execution by up to 87.5 percentage points relative to an unstructured single‑model interaction pattern with identical tool access. Digital twin what‑if studies show MAKA can coordinate traceable compensation candidates that reduce predicted surface deviation from order 10^‑2in to approximately \pm 10^‑3in over most of the blade within the simulation environment, providing a pre‑deployment verification signal for risk‑aware human decision‑making.

PaperID: 3378, https://arxiv.org/pdf/2605.03986.pdf

Authors: Kishan Athrey, Ramin Pishehvar, Brian Riordan, Mahesh Viswanathan

Title: From Intent to Execution: Composing Agentic Workflows with Agent Recommendation

Abstract:
Multi‑Agent Systems (MAS) built using AI agents fulfill a variety of user intents that may be used to design and build a family of related applications. However, the creation of such MAS currently involves manual composition of the plan, manual selection of appropriate agents, and manual creation of execution graphs. This paper introduces a framework for the automated creation of multi‑agent systems which replaces multiple manual steps with an automated framework. The proposed framework consists of software modules and a workflow to orchestrate the requisite task‑ specific application. The modules include: an LLM‑derived planner, a set of tasks described in natural language, a dynamic call graph, an orchestrator for map agents to tasks, and an agent recommender that finds the most suitable agent(s) from local and global agent registries. The agent recommender uses a two‑stage information retrieval (IR) system comprising a fast retriever and an LLM‑based re‑ranker. We implemented a series of experiments exploring the choice of embedders, re‑ rankers, agent description enrichment, and supervising critique agent. We benchmarked this system end‑to‑end, evaluating the combination of planning, agent selection, and task completion, with our proposed approach. Our experimental results show that our approach outperforms the state‑of‑the‑ art in terms of the recall rate and is more robust and scalable compared to previous approaches. The critique agent holistically reevaluates both agent and tool recommendations against the overall plan. We show that the inclusion of the critique agent further enhances the recall score, proving that the comprehensive review and revision of task‑based agent selection is an essential step in building end‑to‑end multi‑agent systems.

PaperID: 3379, https://arxiv.org/pdf/2605.03950.pdf

Authors: Yifan Wang, Yun Fu

Title: UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning

Abstract:
Although recent LMMs have become much stronger at visual perception, they remain unreliable on problems that require multi‑step reasoning over visual evidence. In this paper, we present UnAC (Understanding, Abstracting, and Checking), a multimodal prompting method that strengthens reasoning for complex multimodal tasks in LMMs (e.g., GPT‑4o, Gemini 1.5, and GPT‑4V). To improve image understanding and capture fine details, we propose an adaptive visual prompting strategy that enables LMMs to focus on salient regions. We further design an image‑abstraction prompt to effectively extract key information from images. In addition, we introduce a gradual self‑checking scheme that improves reasoning by verifying each decomposed subquestion and its answer. Extensive experiments on three public benchmarks‑MathVista, MM‑Vet, and MMMU.

PaperID: 3380, https://arxiv.org/pdf/2605.03884.pdf

Authors: Pratik Honavar, Tejpratap GVSL

Title: QKVShare: Quantized KV-Cache Handoff for Multi-Agent On-Device LLMs

Abstract:
Multi‑agent LLM systems on edge devices need to hand off latent context efficiently, but the practical choices today are expensive re‑prefill or full‑precision KV transfer. We study QKVShare, a framework for quantized KV‑cache handoff between agents that combines token‑level mixed‑precision allocation, a self‑contained CacheCard representation, and a HuggingFace‑compatible cache injection path. Our current results support a narrower but clearer story than the original draft: on 150 GSM8K problems with Llama‑3.1‑8B‑Instruct, adaptive quantization remains competitive under repeated handoff and shows its clearest gains against uniform quantization in deeper‑hop, higher budget settings; for handoff latency, the QKVShare path reduces TTFT relative to full re prefill at every tested context, from 130.7 ms vs. 150.2 ms at nominal 1K context to 397.1 ms vs. 1029.7 ms at nominal 8K context;. Stage timing shows that post‑injection generation, not card creation, dominates the current QKVShare latency path. These results position quantized KV handoff as a promising on‑device systems direction while also highlighting the need for stronger controller ablations and apples‑to‑apples runtime comparisons.

PaperID: 3381, https://arxiv.org/pdf/2605.03855.pdf

Authors: Shinas Shaji, Teena Chakkalayil Hassan, Sebastian Houben, Alex Mitrevski

Title: Evaluating Generative Models as Interactive Emergent Representations of Human-Like Collaborative Behavior

Abstract:
Human‑AI collaboration requires AI agents to understand human behavior for effective coordination. While advances in foundation models show promising capabilities in understanding and showing human‑like behavior, their application in embodied collaborative settings needs further investigation. This work examines whether embodied foundation model agents exhibit emergent collaborative behaviors indicating underlying mental models of their collaborators, which is an important aspect of effective coordination. This paper develops a 2D collaborative game environment where large language model agents and humans complete color‑matching tasks requiring coordination. We define five collaborative behaviors as indicators of emergent mental model representation: perspective‑taking, collaborator‑aware planning, introspection, theory of mind, and clarification. An automated behavior detection system using LLM‑based judges identifies these behaviors, achieving fair to substantial agreement with human annotations. Results from the automated behavior detection system show that foundation models consistently exhibit emergent collaborative behaviors without being explicitly trained to do so. These behaviors occur at varying frequencies during collaboration stages, with distinct patterns across different LLMs. A user study was also conducted to evaluate human satisfaction and perceived collaboration effectiveness, with the results indicating positive collaboration experiences. Participants appreciated the agents' task focus, plan verbalization, and initiative, while suggesting improvements in response times and human‑like interactions. This work provides an experimental framework for human‑AI collaboration, empirical evidence of collaborative behaviors in embodied LLM agents, a validated behavioral analysis methodology, and an assessment of collaboration effectiveness.

PaperID: 3382, https://arxiv.org/pdf/2605.03847.pdf

Authors: Munkhdegerekh Batzorig, Purevbaatar Ganbold, Kyungbin Park, Pilkong Jeong, Kangbin

Title: Mechanical Conscience: A Mathematical Framework for Dependability of Machine Intelligenc

Abstract:
Distributed collaborative intelligence (DCI), encompassing edge‑to‑edge architectures, federated learning, transfer learning, and swarm systems, creates environments in which emergent risk is structurally unavoidable: locally correct decisions by individual agents compose into globally unacceptable behavioral trajectories under uncertainty. Existing approaches such as constrained optimization, safe reinforcement learning, and runtime assurance evaluate acceptability at the level of individual actions rather than across behavioral trajectories, and none addresses the multi‑participant, uncertainty‑laden nature of DCI deployments. This paper introduces mechanical conscience (MC), a novel concept and simplified mathematical framework that operationalizes trajectory‑level normative regulation for both single‑agent and distributed intelligent systems. Mechanical conscience is defined as a supervisory filter that minimally corrects a baseline policy's actions to reduce cumulative deviation from a normatively admissible region, while accounting for epistemic uncertainty. We introduce associated constructs, conscience score, mechanical guilt, and resonant dependability, that provide an interpretable vocabulary and computable governance signals for this emerging field. Core theoretical properties are established: admissibility equivalence, existence of optimal regulation, and monotonic deviation reduction. Illustrative results demonstrate that MC‑regulated agents maintain trajectory‑level normative acceptability where conventional controllers drift outside admissible bounds, and that the framework naturally extends to suppress interaction‑induced emergent risk in multi‑agent DCI settings.

PaperID: 3383, https://arxiv.org/pdf/2605.03804.pdf

Authors: Jiale Chang, Yuxiang Ren

Title: ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting

Abstract:
Long‑term personalized memory for LLM agents is challenging on resource‑limited edge devices due to high storage costs and multimodal complexity. To address this, we propose ScrapMem, a framework that integrates multimodal data into "Scrapbook Page." ScrapMem introduces Optical Forgetting, an optical compression mechanism that progressively reduces the resolution of older memories, lowering storage cost while suppressing low‑value details. To maintain semantic consistency, we construct an Episodic Memory Graph (EM‑Graph) that organizes key events into a causal‑temporal structure. Extensive experiments on the multimodal ATM‑Bench showcase that ScrapMem provides three main benefits: (1) strong performance, achieving a new state‑of‑the‑art with a 51.0% Joint@10 score; (2) high storage efficiency, reducing memory usage by up to 93% via optical forgetting; and (3) improved recall, increasing Recall@10 to 70.3% through structured aggregation. ScrapMem offers an effective and storage‑efficient solution for on‑device long‑term memory in multimodal LLM agents.

PaperID: 3384, https://arxiv.org/pdf/2605.03604.pdf

Authors: Maxim Chupilkin

Title: Multi-Agent Strategic Games with LLMs

Abstract:
This paper asks whether large language models (LLMs) can be used to study the strategic foundations of conflict and cooperation. I introduce LLMs as experimental subjects in a repeated security dilemma and evaluate whether they reproduce canonical mechanisms from international relations theory. The baseline game is extended along three theoretically central dimensions: multipolarity, finite time horizons, and the availability of communication. Across multiple models, the results exhibit systematic and consistent patterns: multipolarity increases the likelihood of conflict, finite horizons induce universal unraveling consistent with backward‑induction logic, and communication reduces conflict by enabling signaling and reciprocity. Beyond observed behavior, the design provides access to agents' private reasoning and public messages, allowing choices to be linked to underlying strategic logics such as preemption, cooperation under uncertainty, and trust‑building. The contribution is primarily methodological. LLM‑based experiments offer a scalable, transparent, and replicable approach to probing theoretical mechanisms.

PaperID: 3385, https://arxiv.org/pdf/2605.03569.pdf

Authors: Sumedh J. Dongare, Christo Kurisummoottil Thomas, Andrea Ortiz, Walid Saad, Anja Klein

Title: Dynamic Hypergame for Task Assignment in Multi-platform Mobile Crowdsensing Under Incomplete Information

Abstract:
Mobile crowdsensing (MCS) is a promising distributed sensing paradigm for future wireless networks, where MCS platforms (MCSPs) recruit mobile units (MUs) through monetary incentives for sensing data collection. While most existing studies assume a single MCSP, practical deployments involve multiple competing MCSPs that simultaneously propose task offers to MUs, and MUs accept offers that maximize their revenue. This interaction gives rise to a two‑sided matching game with contracts (MWC), decomposed into two components: (i) task proposal problem of the MCSPs and (ii) task acceptance problem of the MUs. To optimally solve (i), every MCSP requires information about other platforms' preferences and the qualities of the MUs in advance. Similarly, to solve (ii) optimally, the MUs require information about the task execution efforts of all tasks in advance. Such information is unavailable at the MCSPs and at the MUs. To address the challenge of unknown preferences of the other MCSPs, the MWC is posed as a dynamic hypergame, where every MCSP models the unknown preferences through perceptions and refines them over repeated interactions. To solve the dynamic hypergame under incomplete information, we propose PACMAB, a fully decentralized perception‑aware two‑sided learning framework where, (i) each MCSP learns an adaptive task proposal strategy under competition, and (ii) each MU learns task acceptance policy by estimating task execution efforts. Computational complexity of PACMAB shows that it scales favorably for the MCSPs as well as the MUs. Extensive simulations show that PACMAB consistently outperforms the benchmarks by completing at least 41% more tasks without assuming complete information.

PaperID: 3386, https://arxiv.org/pdf/2605.03505.pdf

Authors: Alexander Naakka, Yuqing Wang, Mika V Mäntylä

Title: Multi-Agent Systems for Root Cause Analysis in Microservices

Abstract:
Recent advances in large language models (LLMs) have enabled early attempts to automate root cause analysis (RCA) in microservice‑based systems (MSS). Yet, prior works typically rely on a linear reasoning process that proceeds along a single diagnostic path. In this paper, we propose LATS‑RCA, an LLM‑based multi‑agent framework for RCA in MSS. LATS‑RCA formulates RCA as a reflection‑guided tree‑structured search using a Language Agent Tree Search algorithm. In LATS‑RCA, multiple LLM‑driven agents iteratively perform RCA for each microservice by reasoning over its execution logs and performance metrics to collect operational evidence for root cause exploration. Reflection scores derived from intermediate diagnostic states are used to guide the search toward the most likely root cause based on accumulated evidence. We evaluate LATS‑RCA on the open‑source industrial MSS, Light‑OAuth2 (LO2), using a publicly available dataset and in a production microservice environment (Prod) in a case company with substantially higher operational complexity. LO2 is a small‑team Java system with a homogeneous technology stack. The results on LO2 show that LATS‑RCA achieves high diagnostic accuracy, and we further benchmark its associated computational costs. Compared to LO2, Prod attains lower diagnostic accuracy and incurs higher computational cost. The Prod deployment demonstrates the practical applicability of LATS‑RCA in real‑world MSS and reflects the challenges introduced by polyglot tech stack, varied logging practices of source components, and multi‑factor root‑causes by production‑scale MSS.

PaperID: 3387, https://arxiv.org/pdf/2605.03482.pdf

Authors: Ishrith Gowda

Title: MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents

Abstract:
Persistent external memory enables LLM agents to maintain context across sessions, yet its security properties remain formally uncharacterized. We formalize memory poisoning attacks on retrieval‑augmented agents as a Stackelberg game with a unified evaluation framework spanning three attack classes with escalating access assumptions. Correcting an evaluation protocol inconsistency in the triggered‑query specification of Chen et al. (2024), we show faithful evaluation increases measured attack success by 4× (ASR‑R: 0.25 \to 1.00). Our primary contribution is MEMSAD (Semantic Anomaly Detection), a calibration‑based defense grounded in a gradient coupling theorem: under encoder regularity, the anomaly score gradient and the retrieval objective gradient are provably identical, so any continuous perturbation that reduces detection risk necessarily degrades retrieval rank. This coupling yields a certified detection radius guaranteeing correct classification regardless of adversary strategy. We prove minimax optimality via Le Cam's method, showing any threshold detector requires Ω(1/ρ^2) calibration samples and MEMSAD achieves this up to \log(1/δ) factors. We further derive online regret bounds for rolling calibration at rate O(σ^2/3Δ^1/3), and formally characterize a discrete synonym‑invariance loophole that marks the boundary of what continuous‑space defenses can guarantee. Experiments on a 3 × 5 attack‑defense matrix with bootstrap confidence intervals, Bonferroni‑corrected hypothesis tests, and Clopper‑Pearson validation (n=1,000) confirm: composite defenses achieve TPR = 1.00, FPR = 0.00 across all attacks, while synonym substitution evades detection at Δ ASR‑R \approx 0, exposing a gap existing embedding‑based defenses cannot close.

PaperID: 3388, https://arxiv.org/pdf/2605.03476.pdf

Authors: Severin Ye, Xiao Kong, Xiaopeng He, Guangsu Yan, Dongsuk Oh

Title: CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification

Abstract:
Discharge summaries require extracting critical information from lengthy electronic health records (EHRs), a process that is labor‑intensive when performed manually. Large language models (LLMs) can improve generation efficiency; however, they are prone to producing faithfulness hallucinations, statements that contradict source records, posing direct risks to patient safety. To address this, we present CuraView, a multi‑agent framework for sentence‑level detection and evidence‑grounded explanation of faithfulness hallucinations in discharge summaries. CuraView constructs a GraphRAG‑based knowledge graph from patient‑level EHRs and implements a closed‑loop generation‑detection pipeline with sentence‑level evidence retrieval and classification spanning four evidence grades from strong support to direct contradiction (E1‑E4), yielding structured and interpretable evidence chains. We evaluate CuraView on a subset of 250 patients from the Discharge‑Me benchmark, with 50 patients held out for testing. Our fine‑tuned Qwen3‑14B detection model achieves an F1 of 0.831 on the safety‑critical E4 metric (90.9% recall, 76.5% precision) and an F1 of 0.823 on E3+E4, representing a 50.0% relative improvement over the base model and outperforming RAGTruth‑style and QAGS‑style baselines. These results demonstrate that evidence‑chain‑based graph retrieval verification substantially improves the factual reliability of clinical documentation, while simultaneously producing reusable annotated datasets for downstream model training and distillation.

PaperID: 3389, https://arxiv.org/pdf/2605.03434.pdf

Authors: Yu-Ting Lee, Samuel Yen-Chi Chen, Fu-Chieh Chang

Title: Quantum Hierarchical Reinforcement Learning via Variational Quantum Circuits

Abstract:
Reinforcement learning is one of the most challenging learning paradigms where efficacy and efficiency gains are extremely valuable. Hierarchical reinforcement learning is a variant that leverages temporal abstraction to structure decision‑making. While parametrized quantum computations have shown success in non‑hierarchical reinforcement learning, whether these advantages adapt to hierarchical decision‑making remains a critical open question. In this work, we develop a hybrid hierarchical agent based on the option‑critic architecture. This hybrid agent substitutes classical components with variational quantum circuits for feature extractors, option‑value functions, termination functions, and intra‑option policies. Evaluated on standard benchmarking environments, results show that a hybrid agent utilizing a quantum feature extractor can outperform classical baselines while saving up to 66% trainable parameters. We also identify an architectural bottleneck that quantum option‑value estimation severely degrades performance. Further ablation studies reveal how architectural choices of the quantum circuits affect performance. Our work establishes design principles for parameter‑efficient hybrid hierarchical agents.

PaperID: 3390, https://arxiv.org/pdf/2605.03423.pdf

Authors: Xi Yu, Weicai Li, Lin Yin, Tiejun Lv

Title: Adaptive Dual-Path Framework for Covert Semantic Communication

Abstract:
This paper proposes a novel adaptive dual‑path framework for covert semantic communication (SemCom), which integrates covert information transmission with task‑oriented semantic coding. Unlike conventional covert communication methods that embed hidden messages through power‑domain signal superposition, our framework embeds covert data within task‑specific features via semantic‑level intrinsic encoding. This new architecture introduces dual encoding paths with adaptive block selection: an Explicit path for public task execution and a Stego path that jointly encodes both public and covert information through contrastive representation alignment. A Gumbel‑Softmax enabled adaptive path selection mechanism dynamically activates network blocks based on task require‑ ments. We formulate a multi‑objective optimization framework that simultaneously ensures accurate semantic understanding and reliable covert transmission. We rigorously evaluate our framework's security against a powerful, independently trained attacker. Experimental results on the Cityscapes dataset demon‑ strate a state‑of‑the‑art level of covertness: our method suppresses the attacker's detection accuracy to a near‑random guessing level of 56.12%. This robust security is achieved while simultaneously maintaining superior performance on the primary semantic tasks compared to the baselines.

PaperID: 3391, https://arxiv.org/pdf/2605.03378.pdf

Authors: Shihao Weng, Yang Feng, Jinrui Zhang, Xiaofei Xie, Jiongchi Yu, Jia Liu

Title: ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection

Abstract:
The rise of Large Language Model (LLM) agents, augmented with tool use, skills, and external knowledge, has introduced new security risks. Among them, prompt injection attacks, where adversaries embed malicious instructions into the agent workflow, have emerged as the primary threat. However, existing benchmarks and defenses are fundamentally limited as they assume context‑insensitive settings in which the agent works under a fully specified user instruction, and the attacks are straightforward and context‑independent. As a result, they fail to capture real‑world deployments where agent behavior usually depends on dynamic context, not just the user prompt, and adversaries can adapt their attacks to different context. Similarly, existing defenses built on this narrow threat model overlook the nature of real‑world agent delegation. In this paper, we present AgentLure, a benchmark that captures context‑dependent tasks and context‑aware prompt injection attacks. AgentLure spans four agentic domains and eight attack vectors across diverse attack surfaces. Our evaluation shows that existing defenses often struggle in this setting, yielding poor performance against such attacks in agentic systems. To address this limitation, we propose ARGUS, a defense mechanism that enforces provenance‑aware decision auditing for LLM agents. ARGUS constructs an influence provenance graph to track how untrusted context propagates into agent decisions and verify whether a decision is justified by trustworthy evidence before execution. Our evaluation shows ARGUS reduces attack success rate to 3.8% while preserving 87.5% task utility, significantly outperforming existing defenses and remaining robust against adaptive white‑box adversaries.

PaperID: 3392, https://arxiv.org/pdf/2605.03363.pdf

Authors: Ho Jae Lee, Yonghyeon Lee, Alexander Alexiev, Tzu-Yuan Lin, Se Hwan Jeon, Sangbae Kim

Title: Learning Reactive Dexterous Grasping via Hierarchical Task-Space RL Planning and Joint-Space QP Control

Abstract:
In this work, we propose a hybrid hierarchical control framework for reactive dexterous grasping that explicitly decouples high‑level spatial intent from low‑level joint execution. We introduce a multi‑agent reinforcement learning architecture, specialized into distinct arm and hand agents, that acts as a high‑level planner by generating desired task‑space velocity commands. These commands are then processed by a GPU‑parallelized quadratic programming controller, which translates them into feasible joint velocities while strictly enforcing kinematic limits and collision avoidance. This structural isolation not only accelerates training convergence but also strictly enforces hardware safety. Furthermore, the architecture unlocks zero‑shot steerability, allowing system operators to dynamically adjust safety margins and avoid dynamic obstacles without retraining the policy. We extensively validate the proposed framework through a rigorous simulation‑to‑reality pipeline. Real‑world hardware experiments on a 7‑DoF arm equipped with a 20‑DoF anthropomorphic hand demonstrate highly robust zero‑shot transferability for dexterous grasping to a diverse set of unseen objects, highlighting the system's ability to reactively recover from unexpected physical disturbances in unstructured environments.

PaperID: 3393, https://arxiv.org/pdf/2605.03353.pdf

Authors: Yipeng Ouyang, Yi Xiao, Yuhao Gu, Xianwei Zhang

Title: SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents

Abstract:
LLM‑Agents have evolved into autonomous systems for complex task execution, with the SKILL.md specification emerging as a de facto standard for encapsulating agent capabilities. However, a critical bottleneck remains: different agent frameworks exhibit starkly different sensitivities to prompt formatting, causing up to 40% performance variation, yet nearly all skills exist as a single, format‑agnostic Markdown version. Manual per‑platform rewriting creates an unsustainable maintenance burden, while prior audits have found that over one third of community skills contain security vulnerabilities. To address this, we present SkCC, a compilation framework that introduces classical compiler design into agent skill development. At its core, SkIR ‑ a strongly‑typed intermediate representation ‑ decouples skill semantics from platform‑specific formatting, enabling portable deployment across heterogeneous agent frameworks. Around this IR, a compile‑time Analyzer enforces security constraints via Anti‑Skill Injection before deployment. Through a four‑phase pipeline, SkCC reduces adaptation complexity from O(m × n) to O(m + n). Experiments on SkillsBench demonstrate that compiled skills consistently outperform their original counterparts, improving pass rates from 21.1% to 33.3% on Claude Code and from 35.1% to 48.7% on Kimi CLI, while achieving sub‑10ms compilation latency, a 94.8% proactive security trigger rate, and 10‑46% runtime token savings across platforms.

PaperID: 3394, https://arxiv.org/pdf/2605.03328.pdf

Authors: Ahmadreza Eslaminia, Chuhan Cai, Cameron Smith, Ruo-Syuan Mei, Shichen Li, Rajiv Malhotra, Klara Nahrstedt, Chenhui Shao

Title: LLM-ADAM: A Generalizable LLM Agent Framework for Pre-Print Anomaly Detection in Additive Manufacturing

Abstract:
Additive manufacturing (AM) continues to transform modern manufacturing by enabling flexible, on‑demand production of complex geometries across diverse industries. Fused filament fabrication (FFF) has extended AM to laboratories, classrooms, and small production environments, but this accessibility shifts process‑planning responsibility to users who may lack manufacturing expertise. A syntactically valid slicer profile can still encode thermally or geometrically harmful settings, and subtle G‑code edits can alter extrusion, cooling, or adhesion before a print begins. Pre‑print G‑code screening catches accidental or adversarial machine‑program errors before material or machine time is wasted. This paper proposes LLM‑ADAM as a generalizable LLM framework for pre‑print anomaly detection in AM. The framework decomposes the task into three roles: Extractor‑LLM maps a G‑code file to a structured process‑parameter schema; Reference‑LLM converts printer and material documentation into aligned operating ranges; and Judge‑LLM interprets a deterministic deviation table and G‑code evidence to decide whether a part is non‑defective or belongs to an anomaly class. Printers, materials, and LLM backbones are interchangeable test conditions, not fixed assumptions. We evaluate the framework on an N=200 FFF G‑code corpus spanning two desktop printer families, two materials, and five classes including non‑defective, under‑extrusion, over‑extrusion, warping, and stringing. The best framework configuration reaches 87.5% accuracy, compared with 59.5% for the strongest engineered single‑LLM baseline. The results show that structured decomposition, rather than backbone strength alone, is the dominant source of improvement, with defect classes identified at or near ceiling for leading configurations while residual errors concentrate on conservative false alarms for non‑defective samples.

PaperID: 3395, https://arxiv.org/pdf/2605.03310.pdf

Authors: Maksym Nechepurenko, Pavel Shuvalov

Title: Coordination as an Architectural Layer for LLM-Based Multi-Agent Systems

Abstract:
Multi‑agent LLM systems fail in production at rates between 41% and 87%, mostly due to coordination defects rather than base‑model capability. Existing responses split between cataloguing failure modes empirically and shipping declarative orchestration frameworks as engineering tools; neither delivers a principled mapping from coordination configuration to predictable failure‑mode signature. We argue that coordination should be treated as a configurable architectural layer, separable from agent logic and from information access, enabling architectural reasoning rather than only engineering productivity. We instantiate this with an information‑controlled design on prediction markets: a single LLM, fixed tools, fixed per‑call output cap, and fixed prompt template across five reference coordination configurations, with total compute per question treated as an endogenous architectural output. The Murphy decomposition of the Brier score separates calibration from discriminative power, so configurations leave distinguishable signatures even when aggregate scores coincide. On 100 Polymarket binary markets resolved after the model's training cutoff (claude‑opus‑4‑6) we report Murphy signatures, a cost‑quality Pareto frontier, category‑conditioned analysis, and a bootstrap power‑projection. Three of five pre‑specified predictions are upheld in direction; two configurations dominate the Pareto frontier within this regime; exploratory bootstrap intervals separate consensus alignment from others, though pairwise tests do not survive Bonferroni correction at n=100. We also deploy the same configurations as live agents on Foresight Arena under web‑search‑enabled conditions, as an on‑chain replication channel accumulating in parallel. Harness, trace dataset, and production agents are released. We position this as a methodology‑validating first instantiation, not a general cross‑model claim.

PaperID: 3396, https://arxiv.org/pdf/2605.03287.pdf

Authors: Qian Yang, Jessie Jia, Elaine Tsai, Amy Li, Nader Akoury, Natalie N. Bazarova

Title: Attention: What Prevents Young Adults from Speaking Up Against Cyberbullying in an LLM-Powered Social Media Simulation

Abstract:
Interactive, multi‑agent social simulation systems have shown promise for helping users practice navigating various complex social situations across domains. This paper asks: To what extent can such systems help young adult (YA) bystanders speak up publicly against cyberbullying, a task often thwarted by complex, multi‑party social dynamics? We created Upstanders' Practicum, a multi‑AI‑agent social media simulation powered by Large Language Models (LLMs), as a probe and observed 34 YAs freely practicing public bystander intervention across three iteratively refined versions. We found that practicing public bystander intervention in the simulation was helpful, but after participants made three attention shifts: (1) from inattention to paying true attention, (2) from self‑focus ("I don't usually do this'') to attending to those directly involved, and (3) from resolving the private conflict between bully and victim ("maybe I could set up the meeting between them'') to addressing the broader audience online ("public comment is about norm‑setting"). Only after these shifts did practice in the simulation start to help: participants then saw a reason to speak up publicly and, through continued practice, crafted tactful public messages without explicit instruction. These findings illuminate new design and research opportunities for bystander education beyond social skill instruction, namely, designing for true attention, for fostering a vocal upstander identity, and for seeing bystander intervention as public norm setting. In addition, we open‑source Truman Agents (cornell‑design‑aigroup.github.io/TrumanAgents/), the first‑of‑its‑kind multi‑LLM‑agent social media simulation platform that Upstanders' Practicum builds upon, for future cyberbullying and social media research.

PaperID: 3397, https://arxiv.org/pdf/2605.03242.pdf

Authors: Zuoyu Zhang, Yancheng Zhu

Title: Enhancing Agent Safety Judgment: Controlled Benchmark Rewriting and Analogical Reasoning for Deceptive Out-of-Distribution Scenarios

Abstract:
Tool‑using agent systems powered by large language models (LLMs) are increasingly deployed across web, app, operating‑system, and transactional environments. Yet existing safety benchmarks still emphasize explicit risks, potentially overstating a model's ability to judge deceptive or ambiguous trajectories. To address this gap, we introduce ROME (Red‑team Orchestrated Multi‑agent Evolution), a controlled benchmark‑construction pipeline that rewrites known unsafe trajectories into more deceptive evaluation instances while preserving their underlying risk labels. Starting from 100 unsafe source trajectories, ROME produces 300 challenge instances spanning contextual ambiguity, implicit risks, and shortcut decision‑making. Experiments show that these challenge sets substantially degrade safety‑judgment performance, with hidden‑risk cases remaining particularly non‑trivial even for recent frontier models. We further study ARISE (Analogical Reasoning for Inference‑time Safety Enhancement), a retrieval‑guided inference‑time enhancement that retrieves ReAct‑style analogical safety trajectories from an external analogical base and injects them as structured reasoning exemplars. ARISE improves judgment quality without retraining, but is best viewed as a task‑specific robustness enhancement rather than a standalone safety guarantee. Together, ROME and ARISE provide practical tools for stress‑testing and improving agent safety judgment under deceptive distribution shifts.

PaperID: 3398, https://arxiv.org/pdf/2605.03231.pdf

Authors: Masafumi Oyamada, Kunihiro Takeoka, Kosuke Akimoto, Ryoma Obara, Masafumi Enomoto, Haochen Zhang, Daichi Haraguchi, Takuya Tamura

Title: cotomi Act: Learning to Automate Work by Watching You

Abstract:
What if a browser agent could learn your work simply by watching you do it? We present cotomi Act, a browser‑based computer‑using agent that combines reliable multi‑step task execution with persistent organizational knowledge learned from user behavior. For execution, an agent scaffold with adaptive lazy observation, verbal‑diff‑based history compression, coarse‑grained actions, and test‑time scaling via best‑of‑N action selection achieves 80.4% on the 179‑task WebArena human‑evaluation subset, exceeding the reported 78.2% human baseline. For organizational knowledge, a behavior‑to‑knowledge pipeline passively observes the user's browsing and progressively abstracts it into artifacts (task boards, wiki) exposed through a shared workspace editable by both user and agent. A controlled proxy evaluation confirms that task success improves as behavior‑derived knowledge accumulates. In our live demonstration, attendees interact with the system in a real browser, issuing tasks and observing end‑to‑end autonomous execution and shared knowledge management.

PaperID: 3399, https://arxiv.org/pdf/2605.03228.pdf

Authors: Yuhui Wang, Tanqiu Jiang, Jiacheng Liang, Charles Fleming, Ting Wang

Title: MAGE: Safeguarding LLM Agents against Long-Horizon Threats via Shadow Memory

Abstract:
As large language model (LLM)‑powered agents are increasingly deployed to perform complex, real‑world tasks, they face a growing class of attacks that exploit extended user‑agent‑environment interactions to pursue malicious objectives improbable in single‑turn settings. Such long‑horizon threats pose significant risks to the safe deployment of LLM agents in critical domains. In this paper, we present MAGE (Memory As Guardrail Enforcement), a novel defensive framework designed to counter a wide range of long‑horizon threats. Inspired by the "shadow stack" abstraction in systems security, MAGE maintains a dedicated, safety‑focused agentic memory that distills and retains safety‑critical context across the agent's full execution trajectory, leveraging this shadow memory to proactively assess the risk of pending actions prior to their execution. Extensive evaluation demonstrates that MAGE substantially outperforms existing defenses across diverse long‑horizon threats in detection accuracy, achieves early‑stage detection for the majority of attacks, and introduces only negligible overhead to agent utility. To our best knowledge, MAGE represents the first framework to detect and mitigate long‑horizon threats using an agentic memory approach, establishing a new paradigm for this critical challenge and opening promising directions for future research.

PaperID: 3400, https://arxiv.org/pdf/2605.03205.pdf

Authors: Aritra Roy, Kevin Shen, Andrew MacBride, Awwal Oladipupo, Mudassra Taskeen, Wojtek Treyde, Ruaa A. E. A. Abakar, Ahmad D. Abbas, Elsayed Abdelfatah, Abbas A. Abdullahi, Seham S. Abyah, Chahd Rahyl Adjmi, Fariha Agbere, Savyasanchi Aggarwal, Muhammad Ahmed, Tasnim Ahmed, Motasem Ajlouni, Mattias Akke, Hussein AlAdwan, Anwaar S. Alazani, Zahra A. Alharbi, Wajd A. Aljulyhi, Mohammed A. AlKubaish, Fatima A. Almahri, Sayed A. Almohri, David Obeh Alobo, Mohammed Alouni, Azizah S. Alqahtani, Omar Alsaigh, Husain Althagafi, Md. Aqib Aman, Lena Ara, Arifin, Ignacio Arretche, Abdulaziz Ashy, Syeda A. Asim, Amro Aswad, Adeel Atta, Sören Auer, Abdullah al Azmi, Toheeb Balogun, Suvo Banik, Viktoriia Baibakova, Shakira A. Baksh, Neus G. Bastús, Christina J. Bayard, Adib Bazgir, Louis Beal, Lejla Biberić, Wahid Billah, Ankita Biswas, Joshua Bocarsly, Montassar T. Bouzidi, Esma B. Boydas, Youssef Briki, Cailin Buchanan, Mauricio Cafiero, Damien Caliste, Yi Cao, Rafael E. Castañeda, Sruthy K. Chandy, Benjamin Charmes, Shayantan Chaudhuri, Yiming Chen, Alexander Chen, Jieneng Chen, Min-Hsueh Chiu, Defne Circi, Cinthya H. Contreras, Yoann Cure, Nathan Daelman, Roshini Dantuluri, Thomas Davy, William Dawson, Leonid Didukh, Rui Ding, Aminu R. Doguwa, Claudia Draxl, Sathya Edamadaka, Oulaya Elargab, Christina Ertural, Matthew L. Evans, Edvin Fako, Hossam Farag, Nur A. Fathurrahman, Merve Fedai, Rodrigo P. Ferreira, Giuseppe Fisicaro, Thomas Frank, Sasi K. Gaddipati, Abhijeet Gangan, Jennifer Garland, James Garrick, Luigi Genovese, Maryam Ghadrdran, Sandip Giri, Maxime Goulet, Jeremy Goumaz, Sara U. Gracia, Jacob Graham, Gabriel Graves, Kevin P. Greenman, Tim Greitemeier, Cameron Gruich, Sophie Gu, Salomé Guilbert, Hans Gundlach, Muriel F. Gusta, Mourad El Haddaoui, Alexander J. Haibel, Anubhab Haldar, Vehaan Handa, Hassan Harb, Nathan D. Harms, Abdullah Al Hasan, Abir Hassan, Qiyao He, Andrés Henao-Aristizábal, Bram Hoex, Sungil Hong, Alexander J. Horvath, Md. Shaib Hossain, Yanqi Huang, Yuqing Huang, Kostiantyn Hubaiev, Donald Intal, Katherine Inzani, Kevin Ishimwe, Tugba Isik, Gopal R. Iyer, Katharina Jager, Jan Janssen, Hyewon Jeong, Michael Jirasek, Tyler R. Josephson, Nisarg Joshi, Yassir Ben Kacem, Remya A. M. Kalapurakal, Rakesh R. Kamath, Sugan Kanagasenthinathan, Dohun Kang, Jason Kantorow, Kübra Kaygisiz, Murat Keceli, Farhana Keya, Muhammad U. Khan, Sartaaj Takrim Khan, Hyungjun Kim, Alexander Kister, Sascha Klawohn, Collin Kovacs, Pranav Krishnan, Maurycy Kryzanowski, Ritesh Kumar, Suman Kumari, Gourav Kumbhojkar, Ryo Kuroki, Shashank Kushwaha, Magdalena Lederbauer, Jaejun Lee, Seunghan Lee, Jeonghwan Lee, Bingcan Li, Calvin Li, Zhanzhao Li, Shi Li, Shicheng Li, Chengyan Liu, Hao Liu, Tung Yan Liu, Yutong Liu, Lucia Vina-Lopez, Chayaphol Lortaraparsert, Andre K. Y. Low, Saffron Luxford, Carlos Madariaga, Rishikesh Magar, Piyush R. Maharana, Rahul Mallela, Shoaib Mahmud, Natesan Mani, Umair Mansoor, Omar B. Mansour, Cassandra Masschelein, Kinga O. Mastej, Ankit Mathanker, Jeffrey Meng, Omran Mezghani, Yidong Ming, Rishav Mitra, Michail Mitsakis, Matthew Miyagishima, Ravikumar Mohan, Naveen R. Mohanraj, Trupti Mohanty, Bernadette Mohr, Francisco A. Molina-Bakhos, Jeremy Monat, Seyed Mohamad Moosavi, Shayan Mousavi, Arman Moussavi, Rubel Mozumber, Muhammad J. Mufti, Diyana Muhammed, Ram Munde, Mrigi Munjal, José A. Márquez, Shankha Nag, Giacomo Nagaro, Juno Nam, Jose M. Napoles-Duarte, Ry Nduma, Xuan-Vu Nguyen, Ebrahim Norouzi, Oluwatosin Ohiro, Ryotaro Okabe, Viejay Ordillo, Shuichiro Ozawa, Sebastian Pagel, Daniel Palmer, Angela Pan, Akash Pandey, Vivek Pandit, Prakul Pandit, Chiku Parida, Jaehee Park, Hyunsoo Park, Hemangi Patel, Shakul Pathak, Taradutt Pattnaik, Elena Patyukova, Noah Paulson, Deepak S. Pendyala, Erick S. Pepek, Martin H. Petersen, Thang D. Pham, Aniket Phutane, Sabila K. Pinky, Étienne Polack, Alison Polasik, Maria Politi, Tim Pongratz, Akhila Ponugoti, Fabio Priante, Thomas Michael Pruyn, Sai S. Puppala, Mohammad A. Qazi, Heike Quosdorf, Gollam Rabby, Mohammad J. Raei, Md. Habibur Rahman, A. B. M. Ashikur Rahman, Subhashree Rajasekaran, Tawfiqur Rakib, Hemanth N. Ramesh, Vrushali Ranadive, Karnamohit Ranka, Bojana Rankovic, Adwaith Ravichandran, Ilija Rašović, Sergei Rigin, Tatem Rios, Varun Rishi, Victor Naden Robinson, Lucas S. Rodrigues, Oswaldo Rodriguez, Mahule Roy, Diptendu Roy, Subhas Roy, Arokia Anto Royan M, Joseph F. Rudzinski, Muhammad Sabih, Subramanyam Sahoo, Srusti Bheem Sain, Thahira Saliya, Vignesh Sampath, Jesus Diaz Sanchez, Arthur S. S. Santos, Muliady Satria, Hasan M. Sayeed, Jörg Schaarschmidt, Philippe Schwaller, Nofit Segal, Abhishec Senthilvel, Sherjeel Shabih, Devanshu Shah, Faezeh Shahmoradi, Samiha Sharlin, Killian Sheriff, Qiuyu Shi, Abubakar D. Shuaibu, Ayesha Siddiqua, M. A. Shadab Siddiqui, Darian Smalley, Benjamin Smith, Taylor D. Sparks, Daniel T. Speckhard, Elena Stojanovska, Akshay Subramanian, Jiwon Sun, Yunkai Sun, Abdul W. Syed, Souvik Ta, Izumi Takahara, Kelly Tallau, Guannan Tang, Ans B. Tariq, Sui X. Tay, Nurlybek Temirbay, Surya P. Tiwari, Febin Tom, Tajah Trapier, Kasidet J. Trerayapiwat, Samanvya Tripathi, Hawra H. Tuhaifa, Mustafa Unal, Mohammad Uzair, Vallabh Vasudevan, Estefania Vazquez, Victor Venturi, Rahul Verma, Ashwini Verma, Alvaro Vazquez-Mayagoitia, Nicholas Wagner, Araki Wakiuchi, Hao Wan, Liaoyaqi Wang, Wolfgang Wenzel, Alexander Wieczorek, Sze H. Wong, Yue Wu, Tong Xie, Andrew Yi, Ziqi Yin, Jodie A. Yuwono, Nahed A. Zaid, Mohd Zaki, Shehtab Zaman, Maimuna U. Zarewa, Mahtab Zehtab, Baosen Zhang, Wenyu Zhang, Melody Zhang, Yangfan Zhang, Yuwen Zhang, Runze Zhang, Zongmin Zhang, Huanhuan Zhao, Yuanlong Bill Zheng, Ramzi Zidani, Xue Zong, Ian Foster, Ben Blaiszik

Title: From Knowledge to Action: Outcomes of the 2025 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry

Abstract:
Large language models (LLMs) are rapidly changing how researchers in materials science and chemistry discover, organize, and act on scientific knowledge. This paper analyzes a broad set of community‑developed LLM applications in an effort to identify emerging patterns in how these systems can be used across the scientific research lifecycle. We organize the projects into two complementary categories: Knowledge Infrastructure, systems that structure, retrieve, synthesize, and validate scientific information; and Action Systems, systems that execute, coordinate, or automate scientific work across computational and experimental environments. The submissions reveal a shift from single‑purpose LLM tools toward integrated, multi‑agent workflows that combine retrieval, reasoning, tool use, and domain‑specific validation. Prominent themes include retrieval‑augmented generation as grounding infrastructure, persistent structured knowledge representations, multimodal and multilingual scientific inputs, and early progress toward laboratory‑integrated closed‑loop systems. Together, these results suggest that LLMs are evolving from general‑purpose assistants into composable infrastructure for scientific reasoning and action. This work provides a community snapshot of that transition and a practical taxonomy for understanding emerging LLM‑enabled workflows in materials science and chemistry.

PaperID: 3401, https://arxiv.org/pdf/2605.03188.pdf

Authors: Divyam Anshumaan, Sarthak Choudhary, Nils Palumbo, Somesh Jha

Title: Dependency-Aware Privacy for Multi-turn Agents

Abstract:
LLM agents release private data across multi‑service interactions. Existing prompt sanitizers based on metric differential privacy treat each release independently, so adversaries combining releases across turns can recover private attributes; privacy degrades with every release. This degradation is fundamental: when private attributes are the \emphroots of a computation graph, independently noising a derived value amplifies the root's distinguishability by up to the deriving function's Lipschitz constant L, which can far exceed the nominal privacy parameter for nonlinear functions in medical and financial workflows. RootGuard sanitizes root values once and computes subsequent releases deterministically from the noised roots. By the post‑processing theorem, the privacy guarantee depends only on the initial root sanitization, regardless of the adversary's functions or number of turns, and derived values inherit privacy at zero marginal cost. RootGuard further exploits structural domain knowledge (e.g., BMI from height and weight, or a known target function) to allocate budget across roots, improving the privacy‑utility tradeoff. A worst‑case adversary forcing t turns increases the total budget B = t \cdot \varepsilon. RootGuard distributes this larger budget across roots, while independent noising spends \varepsilon per release and gives the adversary t observations to combine via MAP reconstruction. This yields a \emphdouble asymmetry: more turns aid RootGuard while weakening independent noising. On eight NHANES medical diagnostic templates, RootGuard achieves 2.3‑‑3.0× lower target error than independent noising at \varepsilon = 0.1 (7.6% vs.\ 17.1% wMAPE at B = (2k+1)\varepsilon). Under MAP reconstruction, more queries strengthen attacks against independent noising while RootGuard remains invariant.

PaperID: 3402, https://arxiv.org/pdf/2605.03142.pdf

Authors: Jiayi Chen, Xuan Zhang, Guiling Wang

Title: MARS-DA: A Hierarchical Reinforcement Learning Framework for Risk-Aware Multi-Agent Bidding in Power Grids

Abstract:
The increasing penetration of renewable energy has introduced substantial volatility into wholesale electricity markets, complicating the optimal bidding strategies for power producers. Traditional Reinforcement Learning (RL) approaches often struggle to balance profit maximization with risk management, frequently overfitting to specific market conditions or failing to account for the stochastic spread between Day‑Ahead (DA) and Real‑Time (RT) settlements. To address these challenges, this paper makes two primary contributions. First, we introduce and open‑source a high‑fidelity gymnasium environment for two‑settlement electricity market bidding. Grounded in extensive empirical data from the PJM Interconnection, the environment explicitly models the interplay between DA commitments and RT deviations, providing a standardized testbed for general and risk‑sensitive agents. Second, we propose MARS‑DA (Multi‑Agent Regime‑Switching for Day‑Ahead markets), a novel hierarchical framework that orchestrates distinct sub‑policies for risk management and profit seeking. MARS‑DA utilizes a top‑level Meta‑Controller to dynamically blend the actions of two specialized base agents: a "Safe Agent" that optimizes for reliable DA allocation and a "Speculator Agent" that targets volatile RT arbitrage opportunities. Extensive experiments demonstrate that MARS‑DA achieves superior risk‑adjusted returns compared to state‑of‑the‑art baselines while maintaining robust regime alignment during periods of extreme market volatility.

PaperID: 3403, https://arxiv.org/pdf/2605.03125.pdf

Authors: Jingchu Gai, Laixi Shi

Title: Taming the Curses of Multiagency in Robust Markov Games with Large State Space through Linear Function Approximation

Abstract:
Multi‑agent reinforcement learning (MARL) holds great potential but faces robustness challenges due to environmental uncertainty. To address this, distributionally robust Markov games (RMGs) optimize worst‑case performance when the environment deviates from the nominal model within a uncertainty set. Beyond robustness, an equally urgent goal for MARL is data efficiency ‑‑ sampling from vast state and action spaces that grow exponentially with the number of agents potentially leads to the curse of multiagency. However, current provably data‑efficient algorithms for RMGs are limited to tabular settings with finite state and action spaces, which are only computationally manageable for small‑scale problems, leaving RMGs with large‑scale (or infinite) state spaces largely unexplored. The only existing work beyond tabular settings focuses on linear function approximation (LFA) for a restrictive class of RMGs using vanish minimal value assumption and still suffers from sample complexity with the curse of multiagency. In this work, we focuses on general RMGs with LFA. For uncertainty sets defined by total variation distance, we develop provably data‑efficient algorithms that break the curse of multiagency in both the generative model setting and a newly proposed online interactive setting. To our knowledge, our results are the first to break the curse of multiagency of sample complexity for RMGs with large (possibly infinite) state spaces, regardless of the uncertainty set construction.

PaperID: 3404, https://arxiv.org/pdf/2605.03034.pdf

Authors: Kerri Prinos, Lilianne Brush, Cameron Denton, Zhanqi Wang, Joshua Knox, Snehal Antani, Anton Foltz, Amy Villaseñor

Title: Stable Agentic Control: Tool-Mediated LLM Architecture for Autonomous Cyber Defense

Abstract:
Agentic systems involved in high‑stake decision‑making under adversarial pressure need formal guarantees not offered by existing approaches. Motivated by the operational needs of security operations centers (SOCs) that must configure endpoint detection and response (EDR) policies under adversarial pressure, we present a tool‑mediated architecture: LLM agents use deterministic tools (Stackelberg best‑response, Bayesian observer updates, attack‑graph primitives) and select from finite action catalogs enforced at the tool‑output interface. A composite Lyapunov function machine‑checked in Lean 4 with zero sorry certifies controllability, observability from asymmetric sensor data, and Input‑to‑State Stability (ISS) robustness under intelligent adversarial disturbance, with two corollaries extending the certificate to any controller or adversary from the catalogs. On 282 real enterprise attack graphs, the claims hold with margin. On paired offensive/defensive telemetry, a tool‑mediated Claude Sonnet 4 controller reduces the attacker's expected payoff (game value) by 59% relative to a deterministic greedy baseline, with zero variance across 40 runs at four temperatures. A Claude Haiku 4.5 controller converges to suboptimal game values but stays catalog‑bounded over an additional 40 runs, demonstrating that architectural stability is not dependent on the controller capability. The LLM agent's non‑determinism furthers creative exploration of strategies, while the tool‑mediated architecture ensures system stability.

PaperID: 3405, https://arxiv.org/pdf/2605.02964.pdf

Authors: Kunvar Thaman

Title: Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use

Abstract:
Reinforcement learning (RL) trained language model agents with tool access are increasingly deployed in coding assistants, research tools, and autonomous systems. We introduce the Reward Hacking Benchmark (RHB), a suite of multi‑step tasks requiring sequential tool operations with naturalistic shortcut opportunities such as skipping verification steps, inferring answers from task‑adjacent metadata, or tampering with evaluation‑relevant functions. RHB supports independent and chained task regimes, where chain length acts as a proxy for longer‑horizon agent behavior. We evaluate 13 frontier models from OpenAI, Anthropic, Google, and DeepSeek. Exploit rates range from 0% (Claude Sonnet 4.5) to 13.9% (DeepSeek‑R1‑Zero), varying sharply by post‑training style. A controlled sibling comparison (DeepSeek‑V3 vs. DeepSeek‑R1‑Zero) shows RL post‑training is associated with substantially higher reward hacking (0.6% vs. 13.9%), with consistent gaps across all four task families. We identify six exploit categories and find that 72% of reward hacking episodes include explicit chain‑of‑thought rationale, suggesting models often frame exploits as legitimate problem‑solving. Simple environmental hardening reduces exploit rates by 5.7 percentage points (87.7% relative) without degrading task success. Models with near‑zero exploit rates on standard tasks show elevated rates on harder variants, suggesting that production‑aligned post‑training appears to suppress reward hacking only below a complexity threshold where honest solutions remain tractable.

PaperID: 3406, https://arxiv.org/pdf/2605.02940.pdf

Authors: Zihan Ding, Ziyuan Yang, Yi Zhang

Title: PrismAgent: Illuminating Harm in Memes via a Zero-Shot Interpretable Multi-Agent Framework

Abstract:
The rapid spread of memes makes harmful content detection increasingly crucial, as effective identification can curb the circulation of misinformation. However, existing methods rely heavily on high‑volume annotated data, which leads to substantial training costs and limited generalization. To address these challenges, we propose PrismAgent, a zero‑shot, multi‑agent, interpretable framework. PrismAgent conceptualizes this task as a criminal case investigation, employing four specialized agents responsible for the analysis, investigation, prosecution, and judgment stages within a structured collaborative workflow. In the first stage, the analyst agent paraphrases each meme under benevolent and malicious assumptions to probe its underlying intent. The investigator agent then retrieves supporting evidence from an unannotated dataset and constructs contextual interpretations for the meme and its variants. Next, the prosecutor agent performs three independent preliminary judgments by pairing the original meme with each of the three interpretations. Finally, the judge agent deliberates across all evidence to render a final verdict. Moreover, PrismAgent's explicit multi‑stage reasoning chain makes the model inherently interpretable, as every intermediate step is explicitly explained rather than only producing a final detection result. Extensive experiments on three public datasets show that PrismAgent significantly outperforms existing zero‑shot detection methods.

PaperID: 3407, https://arxiv.org/pdf/2605.02939.pdf

Authors: Zihan Ding, Ziyuan Yang, Yi Zhang

Title: From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework

Abstract:
Multimodal controversy detection (MCD) identifies controversial content in videos and their associated user comments, to support risk management for social video platforms.Prior research frames MCD as a static representation learning task, where features are directly extracted from videos and their accompanying comments. However, these methods fail to capture the diverse perspectives and evaluations from different audience groups. Inspired by the real‑world process of content dissemination among audiences, we propose AuDisAgent, a training‑free multi‑agent framework that reformulates MCD as a dynamic propagation process.Our framework explicitly models audience dissemination through a structured multi‑agent system. First, three specialized Screening Agents (Video Agent, Comment Agent, and Interaction Agent) conduct initial assessments from visual, textual, and cross‑modal perspectives, respectively. For samples where the three agents cannot reach a consensus, a Viewing Panel Agent is activated to simulate post‑screening discussions among audiences with diverse backgrounds and stances. This mechanism models how different audience groups interpret and react to the same content, uncovering latent controversial content that may emerge during the dissemination process. Finally, an Arbitration Agent renders the final judgment based on the complete reasoning chain from the preceding steps.In addition, to address the "cold‑start" scenario where newly released videos have few or no comments, we design a Comment Bootstrapping Strategy that leverages historical public comments from semantically similar videos as the initial comment context. Extensive experiments on a public dataset demonstrate that our framework significantly outperforms existing state‑of‑the‑art (SOTA) methods in both rich‑comment and limited‑comment scenarios.

PaperID: 3408, https://arxiv.org/pdf/2605.02812.pdf

Authors: Mingming Zha, Xiaofeng Wang

Title: Autonomous LLM Agent Worms: Cross-Platform Propagation, Automated Discovery and Temporal Re-Entry Defense

Abstract:
Autonomous LLM agents operate as long‑running processes with persistent workspaces, memory files, scheduled task state, and messaging integrations. These features create a new propagation risk: attacker‑influenced content can be written into persistent agent state, re‑enter the LLM decision context through scheduled autoloading, and drive high‑risk actions including configuration changes and cross‑agent transmission. We present the first systematic framework for automated analysis of persistent worm propagation in file‑backed multi‑agent LLM ecosystems. SSCGV, our automated source‑code graph analyzer, traces data flow from file I/O to LLM context injection points and ranks carriers by context injection position without manual analysis. SRPO, our summary‑resilient payload optimizer, generates worm payloads robust to LLM‑mediated summarization and paraphrasing across multi‑hop communication. Evaluated on three production agent frameworks, we demonstrate zero‑click autonomous propagation, 3‑hop cross‑platform transmission without platform‑specific adaptation, inter‑agent privilege escalation, and data exfiltration. We identify two empirical insights: user prompt carriers achieve higher attack compliance than system prompt carriers, and read operations represent the primary integrity threat in LLM‑mediated systems. To defend against this class of attacks, we develop RTW‑A, proven under a formal No Persistent Worm Propagation theorem. RTW blocks write‑before‑exposed‑read re‑entry; sealed configuration protects static files; typed memory promotion prevents untrusted summaries from entering trusted memory; and capability attenuation limits high‑risk actions after external reads. These mechanisms eliminate the persistence, re‑entry, action chain while preserving ordinary workflows. Affected systems are anonymized pending coordinated disclosure.

PaperID: 3409, https://arxiv.org/pdf/2605.02811.pdf

Authors: Purna Sai Garigipati, Onur Ayan, Kishor Chandra Joshi, Xueli An

Title: Tool Use as Action: Towards Agentic Control in Mobile Core Networks

Abstract:
Artificial Intelligence (AI) will play an essential role in 6G. It will fundamentally reshape the network architecture itself and drive major changes in the design of network entities, interfaces, and procedures. The adoption of agentic AI in next‑generation networks is expected to enhance network intelligence and autonomy through agents capable of planning, reasoning, and acting, while also opening up new business opportunities. Under this vision, existing network functions are expected to evolve into AI‑enabled agents and tools that deliver both connectivity and beyond‑connectivity services. As an initial attempt to move toward this vision, this paper presents a tool‑based interface design and an experimental prototype that are based on agentic AI for the mobile core network, with the Model Context Protocol (MCP) and the Agent2Agent (A2A) protocol as foundational protocols. MCP is selected to design the interface between the agent and network tools, and the A2A protocol is used for message exchange between AI agents. In such an experimental setup, we analyze packet‑level message flows between the agents, tools, and network functions and break down the latency of end‑to‑end operations, starting from the prompt injection until the completion of the input task. This work demonstrates how an AI agent‑based core network combined with network‑specific tools can be utilized in next generation mobile systems to execute intent‑based tasks.

PaperID: 3410, https://arxiv.org/pdf/2605.02751.pdf

Authors: Maria Chang, Ronny Luss, Miao Liu, Keerthiram Murugesan, Karthikeyan Ramamurthy, Djallel Bouneffouf

Title: Mitigating Misalignment Contagion by Steering with Implicit Traits

Abstract:
Language models (LMs) are increasingly used in high‑stakes, multi‑agent settings, where following instructions and maintaining value alignment are critical. Most alignment research focuses on interactions between a single LM and a single user, failing to address the risk of misaligned behavior spreading between multiple LMs in multi‑turn interactions. We find evidence of this phenomenon, which we call misalignment contagion, across multiple LMs as they engage multi‑turn conversational social dilemma games. Specifically, we find that LMs become more anti‑social after gameplay and that this effect is intensified when other players are steered to act maliciously. We explore different steering techniques to mitigate such misalignment contagion and find that reinforcing an LM's system prompt is insufficient and often harmful. Instead, we propose steering with implicit traits: a technique that intermittently injects system prompts with statements that reinforce an LMs initial traits and is more effective than system prompt repetition at keeping models in line with their initial pro‑social behaviors. Importantly, this method does not require access to model parameters or internal model states, making it suitable for increasingly common use cases where complex multi‑agent workflows are being designed with black box models.

PaperID: 3411, https://arxiv.org/pdf/2605.02584.pdf

Authors: Purna Sai Garigipati, Onur Ayan, Kishor Chandra Joshi, Xueli An

Title: Beyond State Machines: Executing Network Procedures with Agentic Tool-Calling Sequences

Abstract:
Agentic AI will be an essential enabling technology for designing future mobile communication systems, which could provide flexible and customized services, automate complex network operations, and drive autonomous decision‑making across the network. This work studies how Large Language Model (LLM)‑based network AI agents can be utilized to execute network procedures expressed as sequences of tool invocations. We investigate four approaches, which differ in how the agent obtains the procedure and in how execution is distributed between the agent and the underlying tools. We evaluated the latency and execution correctness across these approaches using a User Equipment (UE) IP allocation procedure as a case study. Furthermore, we conduct a stress test to examine how many sequential procedural steps an LLM agent can reliably execute before failure. Our results show that approaches relying on iterative agent‑side reasoning incur higher latency and are more prone to execution errors, while approaches where the procedure is encapsulated within a single tool, which internally orchestrates the required steps by invoking other tools, reduce latency by limiting repeated reasoning. The stress‑test results further show that the model with advanced tool‑calling capability maintains reliable execution over longer procedures than the other evaluated models; however, all models exhibit reliability degradation as procedure length increases, revealing clear execution limits in multi‑step tool‑based workflows. To systematically analyze failures in procedure execution, we introduce a procedure‑specific error taxonomy that categorizes deviations in multi‑step procedural execution.

PaperID: 3412, https://arxiv.org/pdf/2605.02572.pdf

Authors: Sunghwan Kim, Junhee Cho, Beong-woo Kwak, Taeyoon Kwon, Liang Wang, Nan Yang, Xingxing Zhang, Furu Wei, Jinyoung Yeo

Title: On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length

Abstract:
Large language models (LLMs) have shown promise as interactive agents that solve tasks through extended sequences of environment interactions. While prior work has primarily focused on system‑level optimizations or algorithmic improvements, the role of task horizon length in shaping training dynamics remains poorly understood. In this work, we present a systematic empirical study that examines horizon length through controlled task constructions. Specifically, we construct controlled tasks in which agents face identical decision rules and reasoning structures, but differ only in the length of action sequences required for successful completion. Our results reveal that increasing horizon length alone constitutes a training bottleneck, inducing severe training instability driven by exploration difficulties and credit assignment challenges. We demonstrate that horizon reduction is a key principle to address this limitation, stabilizing training and achieving better performance in long‑horizon tasks. Moreover, we find that horizon reduction is related to stronger generalization across horizon lengths: models trained under reduced horizons generalize more effectively to longer‑horizon variants at inference time, a phenomenon we refer to as horizon generalization.

PaperID: 3413, https://arxiv.org/pdf/2605.02542.pdf

Authors: James Lynch, Ziqian Liu, Snehadeep Gayen, Om Chabra, Hari Balakrishnan

Title: IteRate: Autonomous AI Synthesis of In-Kernel eBPF Wi-Fi Rate Control Algorithms

Abstract:
Wi‑Fi rate adaptation remains a persistent challenge in wireless networking. Deployed algorithms like Minstrel‑HT have remained largely stagnant for over a decade, relying on hand‑tuned heuristics that fail to generalize to the complexity of modern wireless environments. We present \name, an autonomous research system that closes the loop on rate control development. IteRate uses a multi‑agent AI architecture to conduct the full scientific cycle: formulating hypotheses, writing eBPF programs that run inside the Linux kernel, deploying them over‑the‑air to Wi‑Fi devices, collecting fine‑grained telemetry for analysis, and iterating based on experimental evidence, all without human intervention. IteRate makes three contributions. (1) a novel kernel module that exposes per‑frame hardware telemetry including modulation and coding schemes (MCS) and retry counts to eBPF programs, (2) a structured agentic AI architecture employing specialized agents for algorithm design, experiment execution, and data analysis, coordinated via a hypothesis‑driven research protocol with persistent knowledge, and (3) a closed‑loop pipeline that automates the cross‑compilation, deployment, and evaluation of in‑kernel logic onto embedded Wi‑Fi targets. On a 58‑node testbed running five workloads. relative to the well‑known Minstrel algorithm, IteRate achieves 21% faster web‑page loads, 7% higher video quality of experience (QoE), and 21% higher peak throughput. Our work demonstrates that AI agents, when equipped with appropriate kernel‑level hooks and a disciplined scientific workflow, can effectively automate the research required to design Wi‑Fi rate controllers.

PaperID: 3414, https://arxiv.org/pdf/2605.02491.pdf

Authors: Altan Cakir, Ayca Yerlikaya

Title: From Experimental Limits to Physical Insight: A Retrieval-Augmented Multi-Agent Framework for Interpreting Searches Beyond the Standard Model

Abstract:
Modern searches for physics beyond the Standard Model produce rapidly expanding literature containing heterogeneous information, including textual analyses, numerical datasets, and graphical exclusion limits. Integrating these distributed sources remains a time‑consuming and manual process for physicists. We present HEP‑CoPilot, a retrieval‑augmented multi‑agent AI framework for the exploration and interpretation of high‑energy physics literature. The system unifies textual information from publications, structured experimental data from HEPData, and reconstructed physics plots within a multimodal retrieval and reasoning architecture. By combining retrieval‑augmented language models with coordinated agent workflows, it enables evidence‑grounded reasoning over experimental analyses and structured interpretation of collider results. We evaluate the framework on recent CMS searches for physics beyond the Standard Model. Case studies show that HEP‑CoPilot can retrieve relevant measurements, reconstruct exclusion limits directly from HEPData records, and perform cross‑paper comparisons of experimental constraints. This enables consistent, physics‑aware comparison across analyses without manual data integration. These results demonstrate that retrieval‑augmented AI systems can function as scientific co‑pilots for particle physics, facilitating navigation of complex literature, structuring heterogeneous evidence, and accelerating the interpretation pipeline for new physics searches.

PaperID: 3415, https://arxiv.org/pdf/2605.02489.pdf

Authors: Jinliang Xu

Title: GRAIL: A Deep-Granularity Hybrid Resonance Framework for Real-Time Agent Discovery via SLM-Enhanced Indexing

Abstract:
As the ecosystem of Large Language Model (LLM)‑based agents expands rapidly, efficient and accurate Agent Discovery becomes a critical bottleneck for large‑scale multi‑agent collaboration. Existing approaches typically face a dichotomy: either relying on heavy‑weight LLMs for intent parsing, leading to prohibitive latency (often exceeding 30 seconds), or using monolithic vector retrieval that sacrifices semantic precision for speed. To bridge this gap, we propose GRAIL (Granular Resonance‑based Agent/AI Link), a novel framework achieving sub‑400ms discovery latency without compromising accuracy. GRAIL introduces three key innovations: (1) SLM‑Enhanced Prediction, replacing the generalized LLM parser with a specialized, fine‑tuned Small Language Model (SLM) for millisecond‑level capability tag prediction; (2) Pseudo‑Document Expansion, augmenting agent descriptions with synthetic queries to enhance semantic density for robust dense retrieval; and (3) MaxSim Resonance, a fine‑grained matching mechanism computing maximum similarity between user queries and discrete agent usage examples, effectively mitigating semantic dilution. Validated on AgentTaxo‑9K, our new large‑scale dataset of 9,240 agents, GRAIL reduces end‑to‑end discovery latency by over 79× compared to LLM‑parsing baselines, while significantly outperforming traditional vector search in Recall@10. This framework offers a scalable, industrial‑grade solution for the real‑time ``Internet of Agents."

PaperID: 3416, https://arxiv.org/pdf/2605.02463.pdf

Authors: Jose Manuel de la Chica, Juan Manuel Vera, Jairo Rodríguez

Title: When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems

Abstract:
Multi‑agent LLM systems are increasingly used to solve complex tasks through decomposition, debate, specialization, and ensemble reasoning. However, these systems are usually evaluated in terms of robustness: whether performance is preserved under perturbation. This paper studies a different question: whether semantic stress exposes structured variation that could support future antifragile learning. We introduce CAFE (Cognitive Antifragility Framework for Evaluation), a statistical framework for detecting antifragility‑compatible regimes in multi‑agent architectures. CAFE models a controlled expected distribution of semantic stressors, reconstructs an architecture‑specific observed effective stress distribution from multi‑dimensional judge signals, and compares both distributions using a distributional Jensen Gap under a convex stress potential. A positive gap does not imply immediate performance improvement; instead, it indicates a convex‑expansive deformation of the observed stress distribution, suggesting that the architecture exposes learnable stress structure. We evaluate CAFE on a banking‑risk analysis benchmark with five multi‑agent architectures: flat, hierarchical, debate, meta‑adaptive, and ensemble. Across all architectures, semantic stress reduces average judged quality by roughly one third. Yet all architectures exhibit positive distributional Jensen Gaps with bootstrap confidence intervals above zero. These results show that immediate quality degradation can coexist with statistically detectable antifragility‑compatible stress geometry. CAFE is therefore not an antifragile learner itself, but a measurement layer for identifying when and where antifragility learning may be worth applying.

PaperID: 3417, https://arxiv.org/pdf/2605.02386.pdf

Authors: Jilie Zhang

Title: A Fresh Look on Network Synchronization

Abstract:
This paper gives a fresh look at network synchronization. Here we no longer analyze it from the view of mathematics, such as graph theory, while we probe into one from control theory. First, we analyze the synchronization region using the inner coupling matrix, giving up the routine method of studying the network structure. The motivation comes from the inner coupling matrix that is not subject to any restrictions like network structure, such as distance and communication strength among nodes. It can be configured at will to meet the synchronization performance if only the states of the local dynamic are measurable or observable and the communication topology is connected. Thus, it is very useful for future practical engineering design. In addition, we have an amazing finding that the network synchronization and multi‑agent system consensus problems are equivalence essentially. Afterwards a unified viewpoint, that is, the essence of multi‑agent consensus control is the same as that of network synchronization, is present. Here, the equivalence relation is clearly proven and proposed. Therefore, we can synthesize the inner coupling matrix for network systems or the controller gain for multi‑agent systems for each other. Finally, we also present a kind of method for addressing the nonlinear complex network system. Then the effectiveness of method is verified by taking the network of the three‑oscillator universal probe as an example.

PaperID: 3418, https://arxiv.org/pdf/2605.02379.pdf

Authors: Andrea Forster, Peter Müllner, Denis Helic, Elisabeth Lex, Dominik Kowald

Title: Fair Agents: Balancing Multistakeholder Alignment in Multi-Agent Personalization Systems

Abstract:
LLM agents are increasingly used for personalization due to their ability to communicate directly with users in natural language, integrate external knowledge bases, and negotiate with other (possibly human) agents. Especially in multistakeholder AI systems with multiple distinct objectives, LLM agents are used to independently optimize for each stakeholder's goals. Here, stakeholder alignment is essential to identify and map these goals to provide LLM agents with quantifiable objectives. Plus, the way in which the outputs of the LLM agents are aggregated is fundamental to ensuring fair outcomes for all agents and, therefore, stakeholders. In this work, we identify open research challenges and propose a conceptual framework for designing fair multi‑agent multistakeholder personalization systems that balance competing stakeholder objectives. Our framework integrates (i) methods to align stakeholder objectives and LLM agents, (ii) aggregation strategies, e.g., based on social choice theory, to form fair collective decisions, and (iii) stakeholder‑centric evaluation procedures for both individual and collective agent behavior. We showcase our framework through a tourism use case and discuss possible applications in other domains, such as education and healthcare. Finally, we discuss domain‑specific fairness tensions and review datasets for evaluating multistakeholder fairness and multi‑agent personalization systems.

PaperID: 3419, https://arxiv.org/pdf/2605.02366.pdf

Authors: Zhisheng Tang, Mayank Kejriwal

Title: A Compound AI Agent for Conversational Grant Discovery

Abstract:
Research funding discovery remains fundamentally fragmented: researchers navigate disparate agency portals (e.g., in the United States, NSF, NIH, DARPA, Grants.gov, and many others) with heterogeneous interfaces, search capabilities, and data schemas. We present a compound AI system that unifies this landscape through two tightly coupled components: (1) an aggregation layer that autonomously collects, normalizes, and indexes almost 12,000 federal and nonprofit opportunities from fragmented sources via LLM‑equipped browser agents, maintaining a biweekly‑updated unified database; and (2) an agentic ReAct‑based query processing layer that interprets research context (including from PDF documents) and employs hybrid search combining a structured index with selective web search to retrieve relevant opportunities ‑ while avoiding LLM hallucination. The conversational interface supports iterative refinement through multi‑turn interactions, allowing researchers to progressively apply constraints without reformulating their core research description. Results stream in real time with full transparency of intermediate reasoning, enabling appropriate calibration of user trust. Currently used by almost 3,000+ users, our approach demonstrates the feasibility of compound AI in reducing grant discovery time from 30‑‑45 minutes (manual, fragmented portal searches) to under 10 minutes (unified, conversational search).

PaperID: 3420, https://arxiv.org/pdf/2605.02346.pdf

Authors: Adel ElZemity, Budi Arief, Shujun Li, Calvin Brierley, Yichao Wang, Yuxiang Huang, James Pope, Haoxiang Li, George Oikonomou

Title: APIOT: Autonomous Vulnerability Management Across Bare-Metal Industrial OT Networks

Abstract:
Bare‑metal operational technology (OT) devices ‑‑ especially the microcontrollers running Modbus/TCP and CoAP at the base of industrial control systems ‑‑ have remained outside the reach of autonomous security attacks. Prior autonomous pentesting studies target Linux and web systems, whose shells and filesystems are familiar to LLM agents. Bare‑metal OT has neither, so agents must reason directly over protocol fields and parser semantics. This requires new action‑space designs and runtime controls, and opens new research questions about protocol‑level exploit reasoning and its deployment envelope. We present APIOT (Autonomous Purple‑teaming for Industrial OT), the first large language model (LLM) framework demonstrating an autonomous attack and remediation of bare‑metal OT devices, achieving the full discovery ‑> exploitation ‑> patching ‑> verification cycle without step‑by‑step human intervention. We implemented and evaluated this framework on Zephyr RTOS firmware across heterogeneous industrial IoT (IIoT) topologies. Through 290 experiment runs spanning five frontier LLMs, three network topologies, two impairment levels, and guided versus unguided conditions, APIOT achieved a mission success rate of 90.0% on the full attack‑remediation cycle. We found that the runtime governance layer (which we call an overseer) is a critical engineering variable: without it, agents exhibit systematic degenerate patterns, including repetition loops, missing crash verification, and reconnaissance deadlocks. Together, these findings carry two implications beyond our testbed. Attacker expertise is no longer the binding constraint on bare‑metal OT exploitation, and defender threat models must now assume LLM‑augmented adversaries capable of executing autonomous discovery‑through‑remediation cycles against industrial firmware.

PaperID: 3421, https://arxiv.org/pdf/2605.02333.pdf

Authors: Jingwen Fu, Ming Xiao, Mikael Skoglund

Title: SkillCom: Decomposing LLM-based Semantic Communication into Task and Channel Aware Skills

Abstract:
Large language models (LLMs) are increasingly used as semantic encoders and decoders in semantic communication. However, current LLM based systems mostly remain monolithic: a single prompted model, or a tightly coupled transmitter/receiver pair, must jointly perform semantic encoding, channel adaptation, and semantic decoding. Such coupling makes intermediate decisions difficult to control, diagnose, or replace, and may cause channel corruption to propagate through a compressed source representation. To address the limitations, we propose SkillCom, a modular framework that decomposes LLM‑based semantic communication into four explicit skills: semantic abstraction skill, channel‑adaptive transmission skill, receiver‑side repair skill, and task execution skill. These skills are interconnected through typed semantic‑unit interfaces. Thus, transmission operates on structured unit‑level representations rather than on one monolithic text block. This design localizes channel impairment, enables targeted repair from successfully received units, and supports stage‑wise ablation and single‑skill replacement under matched communication constraints. Experiments on multi‑hop question answering and dialogue state tracking show that SkillCom consistently outperforms the monolithic LLM baseline, remains more robust under varying channel conditions, and exhibits task‑dependent preferences over skill realizations. The results suggest that explicit skill decomposition provides a more robust and diagnosable foundation for LLM‑based semantic communication than monolithic methods.

PaperID: 3422, https://arxiv.org/pdf/2605.02307.pdf

Authors: Yashwanth YS, Ruichen Wang, Shihua Zeng, Xuhui Zhou, Koichi Onoue, Vasudha Varadarajan, Maarten Sap

Title: SOTOPIA-TOM: Evaluating Information Management in Multi-Agent Interaction with Theory of Mind

Abstract:
As LLM‑based agents are increasingly interacting in multi‑party settings, they need to properly handle information asymmetry, i.e., knowing when and to whom to disclose information is appropriate. Yet, existing benchmarks fail to measure this ability in realistic multi‑party settings. Thus, we introduce SOTOPIA‑TOM, a multi‑dimensional benchmarking framework to evaluate LLM agents' ability to successfully navigate information asymmetric and privacy sensitive multi‑party interactions. We create an interaction environment which enables both public (broadcast) and private (direct message) communication, and craft 160 human‑reviewed scenarios across eight industry sectors, each involving 3 to 5 agents with partitioned private knowledge and channel‑dependent sharing policies. To measure interaction abilities, we create a multi‑dimensional evaluation framework to assess how well agents share useful information, seek missing details, coordinate efficiently, and protect privacy, which we also combine into a composite INFOMGMT metric. Results show that, across 6 LLM backbones and prompting strategies (vanilla, CoT‑privacy, and ToM‑based interventions), even the largest high‑reasoning model (GPT‑5) reaches only a 62% INFOMGMT score, which indicates persistent deficiencies in information seeking and privacy‑aware decision‑making. Additionally, ToM‑based interventions more consistently improve the overall coordination‑privacy balance (for example, relative to the vanilla baseline, ToM‑Coach reduces critical privacy violations on GPT‑4o from 9.9% to 2.2% while increasing the composite InfoMgmt score more than 2.5x from 15% to 40%). Overall, SOTOPIA‑TOM exposes persistent limitations of current LLM agents in complex, information‑asymmetric coordination and provides an extensible testbed for developing more privacy‑aware, theory‑of‑mind capable multi‑agent systems.

PaperID: 3423, https://arxiv.org/pdf/2605.02269.pdf

Authors: Kei Nishimura-Gasparian, Robert McCarthy, David Lindner

Title: Towards Understanding Specification Gaming in Reasoning Models

Abstract:
Specification gaming is a critical failure mode of LLM agents. Despite this, there has been little systematic research into when it arises and what drives it. To address this, we build and open source a diverse suite of tasks where models can score highly by taking unintended actions. We find that all tested models exploit their specifications at non‑negligible rates in most of our eight settings, including five non‑coding settings. We see the highest rates of specification gaming in Grok 4 and the lowest rates in Claude models. We use our evaluation suite to study what drives specification gaming, and find that: 1. RL reasoning training substantially increases the rate at which models exploit their specifications, 2. Increasing RL reasoning budget has a weakly positive effect on exploit rate, and 3. Test‑time mitigations reduce but do not eliminate the rate of specification gaming. Our results suggest that specification gaming is a fundamental challenge arising from RL reasoning training; we release our evaluation suite to support further work on this problem.

PaperID: 3424, https://arxiv.org/pdf/2605.02249.pdf

Authors: Michael Thielscher, Tran Cao Son

Title: A Study of Belief Revision Postulates in Multi-Agent Systems (Extended Version)

Abstract:
We investigate the belief revision problem in epistemic planning, i.e., what will be the beliefs of all agents in a multi‑agent system after an agent gains the belief in some state property. Based on the standard representation in epistemic planning of agents' beliefs via a single multi‑agent Kripke model, we generalize the classical AGM belief revision postulates to the multi‑agent setting, with the aim to provide a formal framework for evaluating dynamic epistemic reasoning frameworks in which the beliefs of all agents as the result of actions are computed. As an example of a simple operator that satisfies all of the generalized AGM postulates, we present generalized full‑meet multi‑agent belief revision. We moreover define a generalization of the standard postulates for iterated revision, present a more sophisticated, event model based revision operator, and discuss the potential issues in defining an epistemic operator on Kripke models that can satisfy all of the generalized postulates for iterated multi‑agent belief revision.

PaperID: 3425, https://arxiv.org/pdf/2605.02240.pdf

Authors: Ruoqi Liu, Imran Q. Mohiuddin, Austin J. Schoeffler, Kavita Renduchintala, Ashwin Nayak, Prasantha L. Vemu, Shivam C. Vedak, Kameron C. Black, John L. Havlik, Isaac Ogunmola, Stephen P. Ma, Roopa Dhatt, Jonathan H. Chen

Title: PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

Abstract:
We introduce PhysicianBench, a benchmark for evaluating LLM agents on physician tasks grounded in real clinical setting within electronic health record (EHR) environments. Existing medical agent benchmarks primarily focus on static knowledge recall, single‑step atomic actions, or action intent without verifiable execution against the environment. As a result, they fail to capture the long‑horizon, composite workflows that characterize real clinical systems. PhysicianBench comprises 100 long‑horizon tasks adapted from real consultation cases between primary care and subspecialty physicians, with each task independently reviewed by a separate panel of physicians. Tasks are instantiated in an EHR environment with real patient records and accessed through the same standard APIs used by commercial EHR vendors. Tasks span 21 specialties (e.g., cardiology, endocrinology, oncology, psychiatry) and diverse workflow types (e.g., diagnosis interpretation, medication prescribing, treatment planning), requiring an average of 27 tool calls per task. Solving each task requires retrieving data across encounters, reasoning over heterogeneous clinical information, executing consequential clinical actions, and producing clinical documentation. Each task is decomposed into structured checkpoints (670 in total across the benchmark) capturing distinct stages of completion graded by task‑specific scripts with execution‑grounded verification. Across 13 proprietary and open‑source LLM agents, the best‑performing model achieves only 46% success rate (pass@1), while open‑source models reach at most 19%, revealing a substantial gap between current agent capabilities and the demands of real‑world clinical workflows. PhysicianBench provides a realistic and execution‑grounded benchmark for measuring progress toward autonomous clinical agents.

PaperID: 3426, https://arxiv.org/pdf/2605.02200.pdf

Authors: Deyi Ji, Junyu Lu, Xuanyi Liu, Liqun Liu, Hailong Zhang, Peng Shu, Huan Yu, Jie Jiang, Tianru Chen, Lanyun Zhu

Title: ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring

Abstract:
Online advertising governance faces significant challenges due to the non‑stationary nature of regulatory policies, where emerging mandates (e.g., restrictions on education or aesthetic anxiety) create severe label inconsistencies and reasoning ambiguities in historical datasets. In this paper, we propose ARGUS, a policy‑adaptive governance system that enables evolving reinforcement through multi‑agent adversarial umpiring. ARGUS addresses the sparsity of new policy data by employing a three‑stage framework: (1) Policy Seeding for initial perception; (2) Adversarial Label Rectification, which utilizes a ``Prosecutor‑Defender‑Umpire'' architecture to resolve conflicts between stale labels and new mandates; and (3) Latent Knowledge Discovery, which employs a tripartite dialectical discussion to unearth sophisticated, ``gray‑area'' violations. By leveraging RAG‑enhanced policy knowledge and Chain‑of‑Thought synthesis as dynamic rewards for reinforcement learning, ARGUS synchronizes its reasoning pathways with evolving regulations. Extensive experiments on both industrial and public datasets demonstrate that ARGUS significantly outperforms traditional fine‑tuning baselines, achieving superior policy‑adaptive learning with minimal gold data.

PaperID: 3427, https://arxiv.org/pdf/2605.02199.pdf

Authors: Nishant Bhargava, Rodrigo Sobral Barrento

Title: MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing

Abstract:
Long‑term LLM agents must compress streams of past interactions into persistent memory before future queries are known. Existing evaluations usually measure final question‑answering accuracy, which entangles memory writing with retrieval, prompting, and reader reasoning. We introduce MEMAUDIT, an exact packageoracle evaluation protocol for budgeted long‑term memory writing. A MEMAUDIT package fixes an experience stream, candidate memory representations, storage costs, semantic evidence units, future‑query requirements, and a budget, turning write‑time memory selection into a finite auditable optimization problem with a certified denominator. We instantiate this protocol with a concave‑over‑modular semantic coverage objective under storage and one‑representation‑per‑experience constraints, and compute exact package optima using branch‑and‑bound with MILP certification. Across controlled exact packages, validity‑heavy stress tests, human‑audited natural support slices, and exported Mem0, A‑Mem, and Letta stores, MEMAUDIT separates representation quality, validity‑state preservation, and budget‑aware selection effects that end‑to‑end QA cannot localize. The resulting artifact provides reusable package generators, certified solvers, natural package exports, external‑system scorers, and cached reproducibility metadata for evaluating what memory writers actually preserve under fixed storage budgets.

PaperID: 3428, https://arxiv.org/pdf/2605.02187.pdf

Authors: Mingyu Luo, Zihan Zhang, Zesen Liu, Yuchong Xie, Zhixiang Zhang, Dung Hiu Hilton Yeung, Wai Ip Lai, Ping Chen, Ming Wen, Dongdong She

Title: When Alignment Isn't Enough: Response-Path Attacks on LLM Agents

Abstract:
Bring‑Your‑Own‑Key (BYOK) agent architectures let users route LLM traffic through third‑party relays, creating a critical integrity gap: a malicious relay can modify an aligned LLM response after generation but before agent execution. We formalize this post‑alignment tampering threat and show that, without end‑to‑end integrity, the relay can observe, suppress, or replace downstream messages, making even perfectly aligned LLMs ineffective against such attacks. We instantiate this threat as the Relay Tampering Attack (RTA), which performs multi‑round strategic rewriting, minimal security‑critical edits, and stealth restoration by resubmitting tampered outputs to the upstream LLM. Across AgentDojo and ASB with six LLMs, RTA achieves up to 99.1% attack success, outperforming prompt‑injection baselines with modest overhead. Case studies on OpenClaw and Claude Code demonstrate real‑world feasibility, and evaluations of four defenses show that none fully prevent RTA. Finally, we propose a time‑based detection defense that mitigates RTA while preserving agent utility.

PaperID: 3429, https://arxiv.org/pdf/2605.02168.pdf

Authors: Wenyi Wu, Sibo Zhu, Kun Zhou, Biwei Huang

Title: Planner Matters! An Efficient and Unbalanced Multi-agent Collaboration Framework for Long-horizon Planning

Abstract:
Language model (LM)‑based agents have demonstrated promising capabilities in automating complex tasks from natural language instructions, yet they continue to struggle with long‑horizon planning and reasoning. To address this, we propose an enhanced multi‑agent framework that decomposes automation into three roles: a planner for high‑level decision‑making, an actor for task execution, and a memory manager for contextual reasoning. While this modular decomposition aligns with established design patterns, our core contribution lies in a systematic compute‑allocation analysis, revealing that planning is the dominant factor influencing task performance. Execution and memory management require significantly less compute and model capacity to achieve competitive results. Building on these insights, we introduce a planner‑centric reinforcement learning approach, which exclusively optimizes the planner using trajectory‑level rewards from a VLM‑as‑judge, while freezing the other components. Extensive experiments on benchmarks spanning web navigation, OS control, and tool use demonstrate that concentrating model capacity and learning on high‑level planning yields robust and compute‑efficient improvements in long‑horizon agent automation. Our code is publicly released.

PaperID: 3430, https://arxiv.org/pdf/2605.02149.pdf

Authors: Alireza Ebrahimi Dorcheh, Tolunay Seyfi, Ryan Barker, Fatemeh Afghah

Title: Hierarchical Cooperative MARL for Joint Downlink PRB and Power Allocation in a 5G System

Abstract:
Efficient downlink radio resource management in 5G requires jointly optimizing user scheduling and transmit‑power allocation under time‑varying wireless conditions. This is challenging in OFDMA systems because PRB assignment is combinatorial, power allocation is continuous, and performance depends on channel evolution, link adaptation, and long‑term fairness. We propose a hierarchical cooperative multi‑agent reinforcement learning framework with staged curriculum training for joint downlink PRB and power allocation in a physically grounded 5G environment. System‑level simulation is implemented in Sionna, while Sionna RT supports wireless scene construction and mobility‑aware ray‑traced channel generation. The control task is decomposed into two sequential stages: a PRB agent learns user‑level resource shares, which are converted to exact PRB assignments by a deterministic channel‑aware quota resolver, and a power agent distributes the base‑station power budget across users and their assigned PRB‑symbol resources. The framework operates in a cross‑layer loop with adaptive modulation and coding, HARQ feedback, outer‑loop link adaptation, and a fairness‑aware reward based on smoothed throughput and Jain's fairness index. Training stability is improved through a three‑phase curriculum for PRB allocation, power control, and joint fine‑tuning. Under matched channel realizations, we compare against a PF scheduler with equal‑power transmission and two ablations isolating the learned PRB and power‑control components. Results show that both learned components improve throughput distribution relative to PF, while the full PRB and power controller achieves the largest cell‑throughput gain with only a modest reduction in Jain's fairness index.

PaperID: 3431, https://arxiv.org/pdf/2605.02092.pdf

Authors: Bing Zhou, Xiao Huang, Huan Ning, Qiusheng Wu, Diya Li, Ziyi Zhang

Title: NORA: A Harness-Engineered Autonomous Research Agent for End-to-End Spatial Data Science

Abstract:
The automation of scientific research workflows has emerged as a transformative frontier in artificial intelligence, yet existing autonomous research agents remain largely domain‑agnostic, lacking the specialized reasoning, method selection, and data acquisition capabilities required for rigorous spatial data science. This paper introduces NORA (Night Owl Research Agent), a harness‑engineered, multi‑agent autonomous research system purpose‑built for GIScience and spatial data science. NORA orchestrates the complete research lifecycle through a skills‑first architecture comprising 21 domain‑specialized workflow skills, 9 specialist sub‑agents, and custom Model Context Protocol (MCP) servers. Central to the system's design are two novel domain‑specialized skills: a spatial analysis skill unit that encodes decision frameworks for exploratory spatial data analysis, spatial regression, and diagnostics; and a spatial data download skill that supports reproducible acquisition from authoritative geospatial data sources. We formalize the concept of harness engineering for scientific research agents, demonstrating how lifecycle hooks, safety gates, generator‑evaluator separation, human‑in‑the‑loop, and state persistence ensure reliable and reproducible autonomous research. We evaluate NORA through case studies by 6 domain specialists and 3 LLM reviewers across seven dimensions (novelty, quality, rigor, etc). Results demonstrate that domain‑specialized harness engineering substantially improves the efficiency and quality of research output compared to general‑purpose agent configurations.

PaperID: 3432, https://arxiv.org/pdf/2605.01996.pdf

Authors: Anja Hellander, Kristoffer Bergman, Daniel Axehill

Title: Optimized and kinematically feasible multi-agent motion planning

Abstract:
Multi‑agent motion planning (MAMP) is an important problem for autonomous systems with multiple agents. In this work we propose a two‑step method for finding optimized and kinematically feasible solutions to MAMP problems. The first step finds an initial feasible solution using state‑of‑the‑art methods such as conflict‑based search (CBS) or priority‑based search (PBS), and the second step is an improvement step which improves the solution by solving a multi‑phase optimal control problem (OCP) where the initial solution is used to warm‑start the solver. We also propose a method for generating motion primitives in an optimized way under the constraint that the primitive durations are all multiples of the same sample time. We evaluate our proposed framework on a MAMP problem for tractor‑trailer systems. We extend the safe interval path planning with interval projections (SIPP‑IP) algorithm so it can handle more general cost functions and larger agents, but our results show that for the tractor‑trailer system a simple lattice‑based planner performs better due to less conservative collision checks. Our experiments also indicate that CBS performs better than PBS for this system as it achieves a higher success rate in environments with obstacles and had a lower average runtime, although both planners achieve solutions of similar quality after the improvement step.

PaperID: 3433, https://arxiv.org/pdf/2605.01986.pdf

Authors: Ahmet Bahaddin Ersoz

Title: 12 Angry AI Agents: Evaluating Multi-Agent LLM Decision-Making Through Cinematic Jury Deliberation

Abstract:
What if the twelve jurors of Sidney Lumet's 12 Angry Men (1957) were not men, but large language models? Would the one juror who disagrees still be able to change everyone's mind? This paper instantiates that scenario as a multi‑agent benchmark for LLM deliberation: twelve agents, each conditioned on a film‑faithful persona, debate the film's murder case using multi‑agent framework. Two models representing opposite ends of the RLHF spectrum are tested: GPT‑4o (closed‑source, heavy alignment) and Llama‑4‑Scout (open‑weight, lighter alignment), across three conditions (baseline, open‑minded prompt, no initial vote), with N = 3 replications per cell (18 runs total). Three findings emerge. (i) Seventeen of eighteen runs end in a hung jury (a state where the jury fails to reach a unanimous verdict); the film's central event, gradual minority‑to‑majority persuasion, almost never occurs, indicating that anchoring is the dominant failure mode of current LLMs in this setting. (ii) The two models exhibit sharply different internal dynamics: GPT‑4o produces a mean of 1.0 vote changes per run across all conditions, while Llama‑4‑Scout ranges from 2.0 (baseline) to 6.0 (open‑minded prompt), and is the only model to reach a NOT\_GUILTY verdict (1 of 3 runs in the no‑initial‑vote condition). The same ``open‑minded'' instruction is internalized by Llama and ignored by GPT‑4o. (iii) This asymmetry suggests that the intensity of RLHF alignment training, not model capability, is the primary determinant of deliberative flexibility in multi‑agent settings. Flexibility, not capability, tracks human deliberation. The work is framed as an exploratory study and discusses implications for jury‑of‑LLMs evaluation and multi‑agent debate.

PaperID: 3434, https://arxiv.org/pdf/2605.01970.pdf

Authors: Debeshee Das, Julien Piet, Darya Kaviani, Luca Beurer-Kellner, Florian Tramèr, David Wagner

Title: Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration

Abstract:
Memory systems enable otherwise‑stateless LLM agents to persist user information across sessions, but also introduce a new attack surface. We characterize the Trojan Hippo attack, a class of persistent memory attacks that operates in a more realistic threat model than prior memory poisoning work: the attacker plants a dormant payload into an agent's long‑term memory via a single untrusted tool call (e.g., a crafted email), which activates only when the user later discusses sensitive topics such as finance, health, or identity, and exfiltrates high‑value personal data to the attacker. While anecdotal demonstrations of such attacks have appeared against deployed systems, no prior work systematically evaluates them across heterogeneous memory architectures and defenses. We introduce a dynamic evaluation framework comprising two components: (1) an OpenEvolve‑based adaptive red‑teaming benchmark that stress‑tests defenses and memory backends against continuously refined attacks, and (2) the first capability‑aware security/utility analysis for persistent memory systems, enabling principled reasoning about defense deployment across different usage profiles. Instantiated on an email assistant across four memory backends (explicit tool memory, agentic memory, RAG, and sliding‑window context), Trojan Hippo achieves up to 85‑100% ASR against current frontier models from OpenAI and Google, with planted memories successfully activating even after 100 benign sessions. We evaluate four memory‑system defenses inspired by basic security principles, finding they substantially reduce attack success rates (to as low as 0‑5%), though at utility costs that vary widely with task requirements. Because of this substantial security‑utility tradeoff, the effective real‑world deployment of defenses remains an open challenge, which our evaluation framework is specifically designed to address.

PaperID: 3435, https://arxiv.org/pdf/2605.01920.pdf

Authors: Noga Peleg Pelc, Gal A. Kaminka, Yoav Goldberg

Title: A Language for Describing Agentic LLM Contexts

Abstract:
Large language models are increasingly used within larger systems ("LLM agents"). These make a sequence of LLM calls, each call providing the LLM with a combination of instructions, observations, and interaction history. The design of the encoded information and its structure play a central role in the quality of the resulting system, leading to efforts spent on context engineering. It is therefore critical to communicate the composition of the LLM context in a system, and how it evolves over time. Yet, no standard exists for doing so: context construction is typically conveyed through informal prose, ad hoc diagrams, or direct inspection of code, none of which precisely capture how a prompt evolves across interaction steps or how two context representation strategies differ. To remedy this, we introduce the Agentic Context Description Language (ACDL), a language for specifying the structure and dynamics of LLM input contexts in a precise, readable, and standard manner, along with visualizations. ACDL provides constructs for specifying context aspects such as role message sequences, dynamic content, time‑indexed references, and conditional or iterative structure, capturing the full architecture of a prompt independently of any particular implementation. ACDL diagrams can be hand drawn on a whiteboard, or written in formal language which can then be rendered. We describe the language, demonstrate it by documenting several existing systems and their variants, and encourage the community to adopt it for describing LLM systems context, both in day‑to‑day communication and in papers. Tooling, examples and documentation are available at www.acdlang.org.

PaperID: 3436, https://arxiv.org/pdf/2605.01892.pdf

Authors: George Fatouros, Georgios Makridis, John Soldatos, Dimosthenis Kyriazis, Pedro Malo, George Kousiouris, Giannis Ledakis, Louiza Kachrimani, Panagiotis Rizomiliotis, Bruno Almeida, Despina Tomkou, Kostas Metaxas, Konstantinos Ilias, Christos Gkizelis, Ernstjan de Gooyert, Amin Babazadeh, Kostis Mavrogiorgos, Pepi Paraskevoulakou, Christos Xenakis, Giannis Chouchoulis, Konstantina Tripodi

Title: CyberAId: AI-Driven Cybersecurity for Financial Service Providers

Abstract:
European financial institutions face mounting regulatory pressure while their security operations centres remain constrained not by data or staffing but by reasoning capacity: enterprise SIEMs cover only a fraction of MITRE ATT&CK techniques, two thirds of SOC teams cannot keep pace with alert volumes, and the majority of breaches are preceded by alerts that are generated but never investigated. Frontier large language models now achieve state‑of‑the‑art results on isolated cybersecurity tasks (one‑day vulnerability exploitation, code‑level patching, intrusion detection) yet no narrow win constitutes a platform that can compose across functions, persist multi‑tenant state, map findings to regulatory regimes and survive an audit. This position paper argues that the right unit of construction is a hybrid multi‑agent system in which specialised LLM subagents reason over classical SIEM/XDR telemetry rather than replacing it, share accumulated agent state across institutions through privacy‑preserving federation, and can connect to complementary capability packs such as quantum‑based authentication, digital twins for adversarial validation, and eBPF‑based kernel telemetry. We present CyberAId, a model‑agnostic, on‑premise‑deployable platform in which a Main Agent coordination layer, a Reporting capability, and specialist subagents operate within a shared runtime under bounded human‑in‑the‑loop autonomy, organised around four falsifiable design principles, and aligned with relevant regulations. CyberAId will be validated at four representative financial use cases (client impersonation, anti‑money‑laundering for payment service providers, retail‑banking incident response, and high‑frequency‑trading resilience) and propose skill‑based agent adaptation as the most promising research direction for turning each deployment into a contribution to a continuously refined collective defence.

PaperID: 3437, https://arxiv.org/pdf/2605.01888.pdf

Authors: Xi Zhou, Tao Huang, Qing-Long Han, Rana Abbas, Mostafa Rahimi Azghadi

Title: AFFormer: Adaptive Feature Fusion Transformer for V2X Cooperative Perception under Channel Impairments

Abstract:
Accurate 3D object detection is essential for ensuring the safety of autonomous vehicles. Cooperative perception, which leverages vehicle‑to‑everything (V2X) communication to share perceptual data, enhances detection but is vulnerable to channel impairments, such as noise, fading, and interference. To strengthen the reliability of intelligent transportation systems, this work improves the robustness of V2X cooperative perception under communication conditions that reflect common channel impairments. This paper proposes an Adaptive Feature Fusion Transformer (AFFormer), a Transformer‑based framework that mitigates the adverse effects of corrupted features by modeling temporal, inter‑agent, and spatial correlations. AFFormer introduces three key modules: Multi‑Agent and Temporal Aggregation for context‑aware fusion across agents and over time, Dual Spatial Attention for efficient modeling of spatial dependencies, and Uncertainty‑Guided Fusion for entropy‑driven refinement of fused features. A teacher‑student knowledge distillation strategy further enhances robustness by aligning fused features with reliable early‑collaboration supervision. AFFormer is validated on the V2XSet and DAIR‑V2X datasets, where it consistently outperforms existing methods under both ideal and impaired communication conditions, demonstrating improved robustness to communication‑induced feature degradation while maintaining a competitive efficiency‑accuracy trade‑off.

PaperID: 3438, https://arxiv.org/pdf/2605.01885.pdf

Authors: Mohd Ruhul Ameen, Md Takrim Ul Alam, Akif Islam

Title: QASecClaw: A Multi-Agent LLM Approach for False Positive Reduction in Static Application Security Testing

Abstract:
Static Application Security Testing tools help developers find security vulnerabilities before release, but they often produce many false positives. This increases manual review effort, reduces developer trust, and may cause real vulnerabilities to be ignored among noisy reports. We present QASecClaw, a multi agent approach that combines conventional Static Application Security Testing with coding specialized Large Language Model based contextual code review. A SAST engine first reports candidate vulnerabilities, and a Large Language Model based SAST Filter Agent then reviews each finding with source code context to decide whether it is likely to be a true positive or a false positive. QASecClaw is coordinated by a Mission Orchestrator and includes specialized agents for test planning, security validation, evidence correlation, filtering, and reporting. We evaluate QASecClaw on OWASP Benchmark v1.2, which contains 2,740 Java test cases across 11 Common Weakness Enumeration categories with ground truth labels. QASecClaw achieves an F1 score of 90.93 percent, compared with 78.39 percent for standalone Semgrep. The improvement is mainly driven by an 88.6 percent reduction in false positives, from 560 to 64, with only a 3.1 percent reduction in recall. These results show that Large Language Model augmented multi agent verification can make Static Application Security Testing output more accurate, useful, and trustworthy.

PaperID: 3439, https://arxiv.org/pdf/2605.01879.pdf

Authors: Manuel Hernández, Eduardo Sánchez-Soto

Title: Sheaf-Theoretic Planning: A Categorical Foundation for Resilient Multi-Agent Autonomous Systems

Abstract:
The challenge of engineering autonomous agents capable of navigating the stochastic and adversarial nature of the physical world has historically resided at the intersection of symbolic logic and control theory. Traditional multi‑agent system (MAS) frameworks have relied heavily on monolithic logical models ‑‑ primarily variations of the event calculus and situation calculus ‑‑ to represent action, change, and temporal persistence. While these classical systems provide robust solutions to the frame problem through mechanisms like circumscription and successor state axioms, they are inherently limited by a closed‑world assumption that fails in the face of unobserved agent interventions, plan interruptions, and divergent belief‑reality states. The paradigm of Sheaf‑Theoretic Planning (STP) emerges as a transformative alternative, grounding the problem of multi‑agent coordination under the mathematical structures of topos theory and sheaf semantics. This report provides an exhaustive analysis, justification, and extension of the STP framework, exploring its categorical foundations, implementation feasibility, and role in the future of resilient autonomous systems.

PaperID: 3440, https://arxiv.org/pdf/2605.01870.pdf

Authors: Nikolaos Giarelis, Charalampos Mastrokostas, Nikos Karacapilidis

Title: Maistros: A Greek Large Language Model Adapted Through Knowledge Distillation From Large Reasoning Models

Abstract:
Large Language Models (LLMs) have substantially advanced the field of Natural Language Processing (NLP), achieving state‑of‑the‑art performance across a wide range of tasks. These improvements have been attributed, in part, to their emerging reasoning capabilities, which are enabled by large‑scale training and increased model capacity. However, existing LLMs can generate erroneous responses when addressing complex queries that fall outside their training distribution, due to limited internal knowledge or the need for multi‑step reasoning. To address these limitations, recent work has introduced large reasoning models (LRMs), which incorporate explicit internal reasoning processes to improve response accuracy. Additionally, state‑of‑the‑art LRMs often comprise hundreds of billions of parameters and require several seconds per inference, even on advanced multi‑GPU systems. These characteristics limit their practicality for deployment in conventional computing environments. Meanwhile, NLP research on multilingual LLMs continues to prioritize high‑resource languages. However, these models exhibit limited performance in under‑resourced languages, primarily due to insufficient language‑ and culture‑specific training data. In this paper, we focus on Modern Greek, for which only a limited number of question answering (QA) datasets have been proposed, most of which are intended for model evaluation. To address this research gap in Greek QA, we make the following contributions: (i) CulturaQA, a high‑quality LRM‑generated and human‑curated dataset, for Greek LLM training and evaluation; (ii) a memory‑efficient LLM evaluation framework adaptable to diverse languages and QA tasks; (iii) Maistros 8B, a state‑of‑the‑art open‑weights Greek LLM developed via knowledge distillation and fine‑tuning on CulturaQA; and (iv) a comprehensive evaluation of nine LLMs across nine human‑curated Greek QA datasets.

PaperID: 3441, https://arxiv.org/pdf/2605.01865.pdf

Authors: Dahyun Oh, Minhyuk Yoon, H. Jin Kim

Title: Quality-Aware Exploration Budget Allocation for Cooperative Multi-Agent Reinforcement Learning

Abstract:
Cooperative multi‑agent reinforcement learning (MARL) requires agents to discover joint strategies in a combinatorially large state‑action space, yet effective coordination configurations are exceedingly rare. Intrinsic motivation, which augments task rewards with novelty bonuses, is a popular approach for driving exploration, but its effectiveness hinges on the exploration intensity β, where too large a value overwhelms the task signal and causes coordination collapse, while too small a value prevents discovery of rare strategies. We address two complementary challenges: adapting β globally over training, and allocating the exploration budget across agents whose intrinsic reward signals vary in reliability. Our framework combines a return‑conditioned sigmoid schedule (RCB) for global intensity control with a per‑agent Reward Signal Quality (RSQ) metric that concentrates the exploration budget on agents with reliable signals. The core insight is that agents receiving noisy intrinsic rewards should explore less aggressively, and this allocation can be determined automatically from signal‑to‑noise statistics. Successor Distance (SD), a quasimetric intrinsic reward, naturally produces distinguishable per‑agent signal quality, completing the framework with convergence and ordering preservation guarantees. On seven cooperative benchmarks (MPE, SMAX, MABrax), our method achieves top‑tier returns across all environments.

PaperID: 3442, https://arxiv.org/pdf/2605.01847.pdf

Authors: Xiao Jia

Title: NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

Abstract:
Outcome‑only evaluation under‑specifies whether an evaluated agent profile preserves the commitments required to solve a multi‑turn task coherently. NeuroState‑Bench is a human‑calibrated benchmark that operationalizes commitment integrity through benchmark‑defined side‑query probes rather than inferred hidden activations. The released inventory contains 144 deterministic tasks and 306 benchmark‑defined side‑query probes spanning eight cognitively motivated failure families, paired clean and distractor variants, and three difficulty bands. The main 32‑profile evaluation contains a fixed 16‑profile local subset and a matched 16‑profile hosted large‑model subset evaluated through the same benchmark pipeline. Human calibration uses the final merged reporting scope: 104 sampled task units, 216 raw annotations, and 108 adjudicated task rows, with weighted kappa = 0.977 and ICC(2,1) = 0.977. Empirically, task success and commitment integrity diverge across this expanded grid: the success leader is not the integrity leader, 31 of 32 profiles change rank when integrity replaces task success, and integrity rankings are more stable under distractor perturbation. The primary confidence‑free score HCCIS‑CORE reaches 0.8469 AUC and 0.6992 PR‑AUC for post‑probe diagnostic discrimination of terminal task failure; the legacy full heuristic variant HCCIS‑FULL reaches 0.7997 AUC and 0.6410 PR‑AUC. Probe accuracy and state drift achieve slightly higher ROC‑AUC, 0.8587, and better Brier/ECE, while HCCIS‑CORE has substantially higher point‑estimate PR‑AUC and remains more closely tied to the benchmark's intended construct. The exploratory neural‑augmented variant HCCIS+N is weaker overall, and a randomized subspace control approaches chance. NeuroState‑Bench therefore contributes a calibrated evaluation axis for exposing commitment failures over a broader model grid than the original local‑only subset.

PaperID: 3443, https://arxiv.org/pdf/2605.01818.pdf

Authors: Nicolas Bessone, Erwan Plantec

Title: Emergent Macro-Criticality from Micro-Critical Agents

Abstract:
Criticality has been proposed as a key principle underlying complex behavior in biological and artificial systems; however, how criticality translates from individual dynamics to collective behavior remains unclear. We study this question using a multi‑agent system with spatially constrained interactions in which agents sense neighboring light signals through exteroceptors and act by switching their own light on or off, thereby forming a dynamical interaction network at the macroscopic level. The agents' internal states are themselves governed by a reservoir dynamical system at the microscopic level. By varying the microscopic parameters around dynamical criticality, together with the macroscopic interaction topology, we systematically investigate the relation between the two levels. We find that near‑critical dynamics within individual agents is not sufficient to produce collective critical‑like avalanche statistics. Instead, scale‑free behavior depends on the effective connectivity of the macroscopic interaction network, which controls activity propagation. As a result, macroscopic critical‑like dynamics are enabled by microscopic regimes that deviate from criticality, with the required deviation depending on the properties of the interaction network. Investigating this relation, we find that slightly subcritical micro‑level regimes support near‑critical dynamics across a wider range of macroscopic parameters. These results show that in this multi‑agent system, collective near‑critical behavior depends on the interplay between internal dynamics and the interaction structure that governs activity propagation.

PaperID: 3444, https://arxiv.org/pdf/2605.01805.pdf

Authors: Haohan Yu, Jinmiao Cong, Shengzhi Wang, Lu Wang, Chanjuan Liu

Title: MAGIC: Multi-Step Advantage-Gated Causal Influence for Multi-agent Reinforcement Learning

Abstract:
A key challenge in multi‑agent reinforcement learning (MARL) lies in designing learning signals that effectively promote coordination among agents. Designing such signals requires estimating how one agent's current action affects its teammates over future interaction steps. To address this, we introduce Multi‑step Advantage‑Gated Interventional Causal MARL (MAGIC), a framework that estimates multi‑step action effects between agents and selectively converts them into intrinsic rewards. MAGIC uses counterfactual action interventions to compare teammate futures under factual and counterfactual branches, and introduces a gate based on advantage to direct exploration toward beneficial behaviors aligned with the task goal. Experiments on Multi‑Agent Particle Environments (MPE) and StarCraft micromanagement benchmarks (SMAC and SMACv2) show that MAGIC consistently outperforms leading prior methods, with average relative final performance improvements of 26.9% and 10.1%, respectively.

PaperID: 3445, https://arxiv.org/pdf/2605.01803.pdf

Authors: Florin Leon

Title: Koopman Representations for Early Outbreak Warning and Minimal Counterfactual Intervention in Multi-Agent Epidemic Simulations

Abstract:
This paper presents a Koopman‑based framework for early outbreak detection and intervention selection in a multi‑agent epidemic simulation. Agents exhibit mobility patterns, heterogeneous susceptibility, immunity‑dependent viral load progression, and local transmission through co‑location. The goal of the simulation is to study near‑critical epidemic regimes in which small changes in exposure or timing can alter the final outcome. Aggregate daily observables from early trajectory windows are encoded into a low‑dimensional Koopman latent space whose approximately linear evolution supports short‑horizon forecasting and outbreak risk estimation. These representations are combined with a random forest classifier trained to predict whether the final attack rate exceeds a major outbreak threshold. Experiments near the system tipping points show strong early warning performance, with Koopman‑derived features contributing to class separation. Counterfactual analysis further shows that minimal interventions, such as keeping a single selected agent at home for one day, can reduce attack rates and, often, shift the trajectory below the outbreak threshold.

PaperID: 3446, https://arxiv.org/pdf/2605.01783.pdf

Authors: Rishabh Kar

Title: Runtime Evaluation of Procedural Content Generation in an Endless Runner Game Using Autonomous Agents

Abstract:
Procedural Content Generation (PCG) enables game content to be created algorithmically without direct manual level‑design effort, but it introduces a serious evaluation problem: generated content may become unbalanced, blocked, repetitive, or technically unsolvable. This paper presents Momentum, an endless‑runner game that integrates runtime terrain generation, environment object spawning, and autonomous agent‑based evaluation into a single gameplay loop. Ground tiles and environmental objects are generated dynamically as the player advances, object placement follows a constraint‑driven mechanism inspired by Wave Function Collapse (WFC), and the runtime navigation surface is rebuilt asynchronously to remain consistent with the streamed environment. Two autonomous evaluation agents move ahead of the player and inspect the generated path: an aerial scanner that examines the corridor geometrically, and a ground‑traversal agent that validates the same region from a navigational perspective. The evaluation pipeline combines ray casting, volumetric physics sweeps, obstacle‑layer filtering, and structured crash reporting to identify problematic generated scenarios before they reach the player. The work demonstrates how generation and validation can be unified within the same runtime loop, rather than treating evaluation as a separate offline pass. Around this implementation, the paper formulates a measurable evaluation framework along the canonical PCG axes of playability, diversity, controllability, and runtime performance, derives a structural saturation bound on the spawner from its own placement constraints, and quantifies the per‑segment scanning cost of the agents from first principles.

PaperID: 3447, https://arxiv.org/pdf/2605.01758.pdf

Authors: Yue Ma, Ziyuan Yang, Yi Zhang

Title: Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems

Abstract:
Large multimodal model‑based Multi‑Agent Systems (MASs) enable collaborative complex problem solving through specialized agents. However, MASs are vulnerable to infectious jailbreak, where compromising a single agent can spread to others, leading to widespread compromise. Existing defenses counter this by training a more contagious cure factor, biasing agents to retrieve it over virus adversarial examples (VirAEs). However, this homogenizes agent responses, providing only superficial suppression rather than true recovery. We revisit these defenses, which operate globally via a shared cure factor, while infectious jailbreak arise from localized interaction behaviors. This mismatch limits their effectiveness. To address this, we propose a training‑free Foresight‑Guided Local Purification (FLP) framework, where each agent reasons over future interactions to track behavioral evolution and eliminate infections. Specifically, each agent simulates future behavioral trajectories over subsequent chat rounds. To reflect diversity in MASs, we introduce a multi‑persona simulation strategy for robust prediction across interaction contexts. We then use response diversity as a diagnostic signal to detect infection by analyzing inconsistencies across persona‑based predictions at both retrieval‑result and semantic levels. For infected agents, we apply localized purification: recent infections are mitigated via immediate album rollback, while long‑term infections are handled using Recursive Binary Diagnosis (RBD), which recursively partitions the image album and applies the same diagnosis strategy to localize and eliminate VirAEs. Experiments show that FLP reduces the maximum cumulative infection rate from over 95% to below 5.47%. Moreover, retrieval and semantic metrics closely match benign baselines, indicating effective preservation of interaction diversity.

PaperID: 3448, https://arxiv.org/pdf/2605.01750.pdf

Authors: Yiheng Yao, Chelsea Zou, Robert D. Hawkins

Title: Talk is Cheap, Communication is Hard: Dynamic Grounding Failures and Repair in Multi-Agent Negotiation

Abstract:
Grounding is the collaborative process of establishing mutual belief sufficient for a communicative goal. While static grounding maps language to a shared context, dynamic grounding requires agents to negotiate meaning across turns. Current multi‑agent Large Language Model (LLM) benchmarks largely emphasize static, one‑shot tasks, overlooking whether agents can repair grounding breakdowns through interaction. We introduce an iterated multi‑turn negotiation game where two agents allocate shared resources to private projects with verifiable jointly optimal outcomes. Although individual agents can identify Pareto‑optimal allocations in isolation, agent dyads consistently fail to reach them across models. We identify four failure modes: (1) loss of shared interaction history, (2) stubborn anchoring to early proposals, (3) defaulting to equal splits over reward‑maximizing coordination, and (4) referential binding errors across turns. Our baselines show that the coordination gap is not explained by individual reasoning limits or insufficient information exchange alone. Instead, the bottleneck lies in dynamic grounding: joint plan formation, commitment, and execution.

PaperID: 3449, https://arxiv.org/pdf/2605.01739.pdf

Authors: Asrul Arifin, Hussain Ahmad, Yiyao Zhang, Diksha Goel

Title: AgenticVM: Agentic AI for Adaptive Software Vulnerability Management

Abstract:
As software systems grow in scale and complexity, vulnerability management is increasingly strained by high alert volumes, fragmented toolchains, and manual triage processes. We introduce AgenticVM, a multi‑agent framework that integrates large language models with security tools to automate vulnerability detection, assessment, prioritization, and reporting. AgenticVM combines rule‑based processing, a BERT‑based CVSS prediction module, and specialised LLM‑driven agents, leveraging data from sources such as the National Vulnerability Database and the European Union Vulnerability Database. Across multiple evaluation scenarios, AgenticVM reduces raw scanner outputs into compact, actionable queues, achieving up to 98% alert reduction (e.g., from 3,983 findings to 82 high‑priority items), while predicting missing CVSS attributes with 89.3% accuracy. These results demonstrate improved prioritisation efficiency and reduced analyst workload without compromising risk visibility. Beyond performance, the framework provides practical design insights into agent decomposition, tool‑LLM integration, and human‑in‑the‑loop governance for real‑world deployment.

PaperID: 3450, https://arxiv.org/pdf/2605.01704.pdf

Authors: Kwan Soo Shin

Title: The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning

Abstract:
When copies of the same language model are prompted to debate, they produce diverse phrasings of one perspective rather than diverse perspectives. Multi‑agent debate (MAD), and more broadly closed‑system reasoning where agents iteratively transform each other's outputs, tends to preserve answer accuracy while degrading the reasoning behind those answers. We name the multi‑agent case the Debate Trap and the broader phenomenon the Reasoning Trap, offering a programmatic theory of evidence‑grounded reasoning failure.The framework has three parts: (i) SFS (Supported Faithfulness Score), a claim‑level metric verifying decomposed atomic claims against provided evidence (decomposer‑invariant rankings: Spearman rho=1.0); (ii) EGSR (Evidence‑Grounded Socratic Reasoning), replacing adversarial argumentation with evidence‑grounded inquiry; (iii) Theorem 1 (DPI Bound): under standard MAD, the chain E ‑> O^0 ‑> O^1 ‑> ... is Markov, and the Data Processing Inequality implies E[I(E;O^t+1)] <= E[I(E;O^t)]. Three companion results ‑‑ open‑system recovery (Theorem 2), EGSR accumulation (Lemma 2), and vote‑aggregation floor (Proposition 1) ‑‑ partition multi‑step LLM reasoning by its information‑theoretic relationship to E. Across 16 conditions on SciFact (300 claims) and FEVER (1,000 claims), DebateCV (C13) preserves 88% of baseline accuracy while SFS drops 43%; majority‑vote MAD (C15) reduces SFS to 1.7% of baseline (p < 10^‑6, d = ‑0.96); EGSR recovers 98%. An R6 cohort study (Korean n=10x30 FEVER; English n=3x200 SciFact) finds inter‑rater Fleiss kappa <= +0.018 with 0.8‑1.4 Likert intra‑rater shifts across language and domain ‑‑ the human agreement that faithfulness metrics have been calibrated against is not itself stable. We offer one falsifiable conjecture: any closed‑system reasoning protocol preserving Theorem 1's Markov structure is, in expectation, subject to the same DPI bound.

PaperID: 3451, https://arxiv.org/pdf/2605.01675.pdf

Authors: Yuliang Song, Eldan Cohen

Title: CP-SynC: Multi-Agent Zero-Shot Constraint Modeling in MiniZinc with Synthesized Checkers

Abstract:
Constraint Programming (CP) is a powerful paradigm for solving combinatorial problems, yet translating natural language problem descriptions into executable models remains a significant bottleneck. While Large Language Models (LLMs) show promise in automating this translation, they often struggle with subtle semantic errors in the absence of oracle validation at test time. To address this, we introduce CP‑SynC (Constraint Programming modeling with Synthesized Checkers), a multi‑agent workflow for zero‑shot constraint modeling in MiniZinc. CP‑SynC coordinates modeling agents that generate and refine candidate models and validation agents that synthesize semantic checkers to provide feedback on semantic correctness. To mitigate noise inherent in individual LLM outputs, CP‑SynC explores multiple modeling trajectories in parallel and employs selection agents to select the final model via multi‑agent evidence aggregation. Extensive experiments on a benchmark of 100 CP problems show that CP‑SynC substantially outperforms existing baselines in MiniZinc modeling.

PaperID: 3452, https://arxiv.org/pdf/2605.01644.pdf

Authors: Shuyi Lin, Anshuman Suri, Alina Oprea, Cheng Tan

Title: Toward a Principled Framework for Agent Safety Measurement

Abstract:
LLM agents emit actions, not just text, and once taken, those actions often cannot be undone. Yet today's agent‑safety evaluations run greedy or a few sampled rollouts and report a single safe/unsafe rate ‑‑ blind to the long‑tail trajectories where unsafe behavior may arise from low‑probability but non‑negligible actions. We argue agent safety should be measured by search, not sampling. We apply BOA, a framework that, given a deployment configuration (model, decoder, prompt, environment, judger, likelihood budget), searches the in‑budget trajectory space and reports a safety score: the probability the agent stays safe under the configuration. BOA searches both within a single LLM round and across the agent‑environment interaction tree under a given likelihood budget, and makes search practical via batched decoding/judging, prefix caching, and chunked tree expansion. On agent‑safety workloads, BOA discovers unsafe trajectories that greedy and sampled evaluations miss. BOA can additionally be used for ranking models, defenses, and attacks, all on the same scale, with manageable GPU costs.

PaperID: 3453, https://arxiv.org/pdf/2605.01566.pdf

Authors: Florian Valentin Wunderlich, Lars Benedikt Kaesberg, Jan Philip Wahle, Terry Ruas, Bela Gipp

Title: Multi-Agent Reasoning Improves Compute Efficiency: Pareto-Optimal Test-Time Scaling

Abstract:
Advances in inference methods have enabled language models to improve their predictions without additional training. These methods often prioritize raw performance over cost‑effective compute usage. However, computational efficiency is key for real‑world applications with resource constraints. We provide a systematic analysis of the inference scaling strategies self‑consistency, self‑refinement, multi‑agent debate, and mixture‑of‑agents, to study their computational performance tradeoffs. We evaluate methods on two reasoning benchmarks (MMLU‑Pro, BBH) and include extensive parameter configurations (e.g., scaling the number of parallel predictions, agents, and debate rounds) across different model sizes. Across 34 configurations and over 100 evaluations, we compute the Pareto‑optimal front to select methods that achieve the best accuracy with the lowest computational budget. Notably, inference scaling improves accuracy by up to +7.1% points over chain‑of‑thought at the highest evaluated budgets (20x the CoT compute budget) on MMLU‑Pro. With an equal computing budget, debate and mixture‑of‑agents outperform self‑consistency by 1.3% and 2.7% points, respectively. While self‑consistency saturates earlier, multi‑agent gains persist, particularly on more complicated tasks. We identify a simple multi‑agent design guideline: mixture‑of‑agents is most efficient when the number of parallel generations exceeds the number of sequential aggregations.

PaperID: 3454, https://arxiv.org/pdf/2605.01562.pdf

Authors: Ahmed F. Ibrahim

Title: Neuro-Symbolic Agents for Hallucination-Free Requirements Reuse

Abstract:
The Object‑Oriented Method for Requirements Authoring and Management (OOMRAM) is a requirements reuse framework that relies on exact identifier matching and rigid templates, limiting its ability to adapt specifications across diverse contexts. While Large Language Models (LLMs) offer the flexibility to overcome this bottleneck, they introduce the risk of generating structurally invalid or inconsistent requirement combinations. To address this tension, we present a neuro‑symbolic multi‑agent system that re‑conceptualizes requirements reuse as a Model‑Driven Elicitation process. In this paradigm, an LLM serves as a non‑deterministic heuristic for traversing a deterministic domain model represented by a formal OOMRAM requirement lattice. A deterministic, symbolic validator enforces all structural constraints within the agent loop, effectively eliminating hallucinated requirement combinations by construction. Evaluated on an autonomous benchmark across two application families, our system achieves 100% requirement coverage and a constraint‑violation rate of only 0.2%. Although the F1‑score against a single gold standard is moderate (0.47‑0.51), every generated specification is structurally valid and satisfies all mandatory domain requirements. The model‑agnostic implementation scales to larger lattices via subgraph navigation and provides transparent audit trails for regulatory compliance.

PaperID: 3455, https://arxiv.org/pdf/2605.01546.pdf

Authors: Mohamed Amine Ferrag, Abderrahmane Lakas, Merouane Debbah

Title: 6G Needs Agents: Toward Agentic AI-Native Networks for Autonomous Intelligence

Abstract:
Sixth‑generation (6G) networks are increasingly envisioned as AI‑native infrastructures integrating communication, sensing, and computing into a unified fabric. However, existing approaches remain largely optimization‑centric, relying on closed‑loop control with limited reasoning capability. In this paper, we argue for a paradigm shift toward Agentic AI‑Native 6G, in which Large Language Model (LLM)‑based agents operate as bounded, policy‑governed reasoning entities within a semantic control plane layered above deterministic 3GPP infrastructure. We propose a four‑layer architecture that integrates deterministic network infrastructure, semantic abstraction of intent and context, hierarchical reasoning, and a distributed multi‑agent fabric spanning device, edge, and core domains. To assess feasibility, we develop a proof‑of‑concept agentic reasoning and orchestration framework and conduct an extensive empirical study using a domain‑specific 6G benchmark under realistic deployment constraints. Our results reveal a fundamental tradeoff between reasoning capability and system efficiency, showing that no single model simultaneously satisfies latency, throughput, and accuracy requirements. Instead, heterogeneous deployment of LLM agents across the device‑‑edge‑‑core continuum is necessary to balance these constraints. We further demonstrate that quantization introduces non‑uniform effects across models, reinforcing the need for system‑level optimization rather than model‑level compression alone. These findings establish agentic intelligence as a viable architectural direction for 6G and highlight key challenges in achieving scalable, trustworthy, and self‑reasoning networks. All experimental results and evaluation scripts are publicly available to support reproducibility.

PaperID: 3456, https://arxiv.org/pdf/2605.01471.pdf

Authors: Hyukjoo Lee

Title: Practical Limits of Autonomous Test Repair: A Multi-Agent Case Study with LLM-Driven Discovery and Self-Correction

Abstract:
Maintaining reliable UI test suites in large‑scale enterprise applications is a persistent and costly challenge. We present an industrial case study of a multi‑agent autonomous testing system evaluated using anonymized execution data from a production‑like enterprise UI testing prototype. The application features several hundred dynamic UI elements per screen. Built on a large language model with LangGraph orchestration, Playwright execution, and a RAG knowledge base, the system evolves from human‑directed testing toward High‑autonomy feature discovery and test execution: given no explicit test targets, it discovers over 100 testable features across 10 UI screens, dynamically expands coverage by an additional 15‑‑30 features through runtime DOM analysis, and iteratively repairs failing tests without human intervention. We analyzed 300 consecutive autonomous execution reports encompassing 636 individual test‑case executions across 10 distinct scenario families. The system achieved a 70% repair convergence rate at the scenario‑family level, with a mean of 3.4 repair iterations to convergence. However, only 10% of scenario families succeeded on first attempt, 38% of reports failed to produce any executable test artifact, and we documented concrete instances of assertion weakening and test‑case deletion used as workaround mechanisms to achieve superficial convergence. Our findings show that unrestricted autonomy leads to unstable and often misleading outcomes, while constrained autonomy transforms such systems into operationally viable workflows. Rather than advocating full autonomy, our findings suggest that reliable autonomous testing in enterprise‑scale settings requires explicit constraints, validation boundaries, and human oversight to preserve semantic correctness and operational trustworthiness.

PaperID: 3457, https://arxiv.org/pdf/2605.01416.pdf

Authors: Ewelina Gajewska, Michal Wawer, Katarzyna Budzynska, Jaroslaw A. Chudziak

Title: Who Decides What Is Harmful? Content Moderation Policy Through A Multi-Agent Personalised Inference Framework

Abstract:
The increasing scale and complexity of online platforms raises critical policy questions around harmful content, digital well‑being, and user autonomy. Traditional content moderation systems rely on centralised, top‑down rules, often failing to accommodate the subjective nature of harm perception. This paper proposes an LLM‑based multi‑agent personalised inference framework that filters content based on unique sensitivity profiles of individual users. Our architecture combines domain‑specific Expert Agents, a Manager Agent for orchestrating content analysis and agent selection, and a Ghost Profile Agent for simulating user perspectives, to inform moderation decisions. Evaluated against a range of non‑personalised baselines, the system demonstrates up to a 32% improvement in accuracy, showing increased alignment with individual user sensitivities. Beyond technical performance, our framework provides policy‑relevant insights for platform governance, providing a scalable way to reconcile moderation policies with societal and individual digital rights

PaperID: 3458, https://arxiv.org/pdf/2605.01414.pdf

Authors: Yufei Ye, Shijian Gao, Xinhu Zheng, Liuqing Yang

Title: Dynamic Task and Resource Scheduling Towards Green Space-Air-Ground-Sea Integrated Network

Abstract:
In the context of 6G ubiquitous connectivity, the space‑air‑ground‑sea integrated network (SAGSIN) emerges as a new paradigm to provide critical services for resource‑limited ocean environments. To realize this paradigm efficiently, we propose an innovative dynamic task and resource scheduling approach for green SAGSIN that delivers computing support for vessels while minimizing overall task execution delay. To address the challenge of multi‑layer task scheduling, a layer‑wise task offloading algorithm is developed specifically for SAGSIN. It adapts to real‑time, multi‑dimensional system dynamics and integrates an anticipatory handover strategy that adaptively controls the amount of data offloaded to the satellite, thereby preventing post‑handover congestion while improving satellite resource utilization. Furthermore, the bandwidth allocation of uncrewed aerial vehicles and base station, UAV trajectories, and computing resource allocation are jointly optimized to enhance connectivity among low‑altitude devices and facilitate demand‑driven resource allocation for green network development. Simulation results verify that the proposed method better adapts to dynamic system resources and achieves at least a 23% reduction in average task delay compared with benchmarks.

PaperID: 3459, https://arxiv.org/pdf/2605.01386.pdf

Authors: Hung Pham Van, Nguyen Manh Hieu, Khang Pham Tran Tuan, Nam Le Hai, Linh Ngo Van, Nguyen Thi Ngoc Diep, Trung Le

Title: MemORAI: Memory Organization and Retrieval via Adaptive Graph Intelligence for LLM Conversational Agents

Abstract:
Large Language Models (LLMs) lack persistent memory for long‑term personalized conversations. Existing graph‑based memory systems suffer from information dilution, absent provenance tracking, and uniform retrieval that ignores query context. We introduce MemORAI (Memory Organization and Retrieval via Adaptive Graph Intelligence), a framework that integrates three innovations: selective memory filtering with dual‑layer compression to retain user‑persona‑relevant content, a provenance‑enriched multi‑relational graph tracking factual origins at the turn level, and query‑adaptive subgraph retrieval with Dynamic Weighted PageRank that applies query‑conditioned edge weighting. Evaluated on LOCOMO and LongMemEval benchmarks, MemORAI achieves state‑of‑the‑art performance in memory retrieval and personalized response generation, demonstrating that selective storage, enriched representation, and adaptive retrieval are essential for coherent, personalized LLM agents.

PaperID: 3460, https://arxiv.org/pdf/2605.01362.pdf

Authors: Ava Mohammadi, Rick Kramer, Zoltan Nagy

Title: Coordination Architecture Shapes Continuous Demand Response Outcomes in Building Districts

Abstract:
Grid‑integrated building districts must provide energy flexibility while preserving occupant comfort and equitable distribution of control burden. We study how coordination architecture influences the ability of building clusters to track aggregated load profiles, comparing four paradigms: centralized model predictive control (MPC), decentralized independent reinforcement learning (SAC), centralized‑training‑decentralized‑execution multi‑agent RL (MAPPO), and a hybrid MPC‑‑SAC controller that separates district‑level battery optimization from building‑level HVAC regulation. A rule‑based controller serves as a baseline. We evaluate a 25‑building residential district across three metrics: aggregate load tracking, thermal comfort, and spatial variability of control actions. We find that architecture choice determines the trade‑off structure. Centralized MPC achieves low tracking bias (8.8% NMBE) but concentrates actuation on a subset of buildings, causing elevated comfort violations (24.8% exceedance) and spatial imbalance. Decentralized RL distributes control effort more evenly but fails to sustain accurate tracking. The hybrid architecture achieves the best balance: accurate tracking (4.8% NMBE), moderate comfort impact (16.8% exceedance), and the lowest spatial variability. These findings demonstrate that architecture choice determines the trade‑off structure between tracking and comfort.

PaperID: 3461, https://arxiv.org/pdf/2605.01338.pdf

Authors: Jincheng Lou, Ruohan Xu, Jiapeng Li, Junyin Pi, Runzhe Tao, Weijian Fan, Xiao Tan, Guojie Luo, Yibo Lin

Title: DiagramNet: An End-to-End Recognition Framework and Dataset for Non-Standard System-Level Diagrams

Abstract:
System‑level diagrams encode the architectural blueprint of chip design, specifying module functions, dataflows, and interface protocols. However, non‑standardized symbols and the scarcity of structured training data hinder existing multimodal large language models (MLLMs) from recognizing these diagrams. To address this gap, we introduce DiagramNet, the first multimodal dataset for system‑level diagrams, comprising 10,977 connection annotations and 15,515 chain‑of‑thought QA pairs across four tasks: Listing, Localization, Connection, and Circuit QA. Building on this dataset, we propose a progressive training pipeline together with a decoupled multi‑agent workflow that decomposes complex visual reasoning into Perception, Reasoning, and Knowledge stages. On the DiagramNet benchmark, integrating our 3B‑parameter model with the proposed workflow surpasses the 2025 EDA Elite Challenge winner and outperforms GPT‑5, Claude‑Sonnet‑4, and Gemini‑2.5‑Pro by over 2x in end‑to‑end evaluation. Notably, the workflow generalizes beyond our model, boosting Task 1 performance by 128.7x for Gemini‑2.5‑Pro and 12.4x for GPT‑5. Furthermore, with only 60 images for detector adaptation, the method transfers effectively to AMSBench, achieving zero‑shot connectivity reasoning on par with GPT‑5 and Claude‑Sonnet‑4 while surpassing the AMS state‑of‑the‑art method Netlistify.

PaperID: 3462, https://arxiv.org/pdf/2605.01299.pdf

Authors: Wang Jian, Zhou Jianbo, Xiong Yuhao, Liu Zhenxia, Luo Wen, Yuan LinWang, Yu ZhaoYuan

Title: GA-VisAgent: A Multi-Agent application for code generation and visualization in interactive learning

Abstract:
Geometric Algebra (GA) presents challenges to learners due to its highly abstract mathematical structure and complex operational rules, as translating algebraic manipulations into concrete geometric interpretations is a non‑intuitive process when developing related code. Currently, some existing GA software packages rely on manually written scripts for code generation and visualization, but their high learning curve hinders widespread adoption. Meanwhile, methods based on Large Language Models (LLMs) often produce logical errors when generating specific GA scripts, such as GAALOPScript, resulting in generally low accuracy. To address these issues, this study proposes GA‑VisAgent ‑‑ a multi‑agent interactive learning application for GA code generation and visualization ‑‑ building upon a Geometric algebra large language model (GAGPT). Integrating task planning mechanisms with ReAct reasoning strategies, GA‑VisAgent can decompose complex operations into five standardized subtasks, including core operations like geometric products, rotations, and reflections. It supports natural language and mathematical formulas as input to automatically generate executable code, accompanied by interactive visualizations to aid user comprehension. Experimental results show that GA‑VisAgent achieved a 90% code generation success rate across 40 typical Conformal GA tasks, representing a 70% improvement over GPT‑4o. This application introduces an extensible new paradigm for teaching GA and developing visualization tools for related mathematical concepts. The online service for this project will be available at http://gagis.cn/gacrac.

PaperID: 3463, https://arxiv.org/pdf/2605.01209.pdf

Authors: Yue Fang, Zhi Jin, Jie An, Hongshen Chen, Xiaohong Chen, Naijun Zhan

Title: ClarifySTL: An Interactive LLM Agent Framework for STL Transformation through Requirements Clarification

Abstract:
Signal Temporal Logic (STL) is a formal language for specifying real‑time behaviors of cyber‑physical systems (CPS). Automatically transforming natural language requirements into STL specifications has received growing attention. Recent efforts leveraging large language models (LLMs) have demonstrated impressive performance, but some natural language requirements in practice contain vague or ambiguous information, which remains challenging for LLMs to handle. To address these challenges, we propose ClarifySTL, an interactive LLM‑agent framework that enhances STL transformation through requirements clarification. ClarifySTL first detects vague expressions that indicate underspecified information in a requirement. If any vagueness is detected, it generates targeted clarification queries to guide users in supplementing the requirement until all necessary details are provided. Subsequently, if ClarifySTL detects ambiguities, it formulates focused ambiguity clarification queries and updates the requirements based on user feedback until all ambiguities are resolved. Finally, the requirements with vagueness and ambiguity clarified are transformed into STL specifications using LLMs. This interactive framework ensures that the resulting STL formulas faithfully capture user intent while reducing the burden on the user. We evaluate ClarifySTL on the representative benchmarks DeepSTL and STL‑DivEn, as well as our newly introduced AmbiEval benchmark, which is specifically designed to assess the performance of the agents in handling vagueness and ambiguity, including both detection and query generation. The experimental results show that ClarifySTL is effective.

PaperID: 3464, https://arxiv.org/pdf/2605.01161.pdf

Authors: Hemanta Ban, Seddik M. Djouadi, Kevin Tomsovic

Title: Distributed Attraction-Repulsion Potential for Multi-Agent Formation Control

Abstract:
In this paper, a distributed multi‑agent formation control driven by the gradient of the Lennard‑Jones potential is analyzed. For collision‑free initial data, we prove global well‑posedness together with a uniform lower bound on all inter‑agent distances, thereby excluding hard collisions. Taking the total energy as a Lyapunov function, LaSalle's invariance principle shows that every positive limit point is an equilibrium. Since trajectories remain uniformly away from collisions, the energy is analytic along the flow and an argument yields convergence to a single equilibrium modulo translations. Illustrative numerical examples are presented.

PaperID: 3465, https://arxiv.org/pdf/2605.01147.pdf

Authors: Tanav Singh Bajaj, Nikhil Singh, Karan Anand, Eishkaran Singh

Title: Position: Safety and Fairness in Agentic AI Depend on Interaction Topology, Not on Model Scale or Alignment

Abstract:
As large language models are increasingly deployed as interacting agents in high‑stakes decisions, the AI safety community assumes that safety properties of individual models will compose into safe multi‑agent behavior. This position paper argues that this assumption is fundamentally mistaken. In agentic AI, safety is determined by interaction topology, not model weights. When agents deliberate sequentially or aggregate via parallel voting with a judge, the structure of information flow and decision coupling dominates outcomes. Evidence across model families and scales reveals three persistent topology‑driven pathologies: ordering instability, where system behavior depends primarily on agent sequence; information cascades, where early judgments propagate regardless of correctness; and functional collapse, where systems satisfy fairness metrics while abandoning meaningful risk discrimination. Contrary to intuition, scaling to more capable models strengthens these effects by increasing consensus formation and reducing the challenge of initial decisions. These failure modes are invisible to model‑centric evaluation and alignment procedures. We argue that agentic AI must be treated as a dynamical system rather than a collection of aligned components. Interaction topology must become a primary target of safety evaluation and regulation, with systems required to demonstrate robustness across architectural variations before deployment.

PaperID: 3466, https://arxiv.org/pdf/2605.01143.pdf

Authors: Sheldon Yu, Yingcheng Sun, Hanqing Guo, Julian McAuley, Qianqian Tong

Title: A Low-Latency Fraud Detection Layer for Detecting Adversarial Interaction Patterns in LLM-Powered Agents

Abstract:
Large Language Model (LLM)‑powered agents demonstrate strong capabilities in autonomous task execution, tool use, and multi‑step reasoning. However, their increasing autonomy also introduces a new attack surface: adversarial interactions can manipulate agent behavior through direct prompt injection, indirect content attacks, and multi‑turn escalation strategies. Existing defense strategies focus on prompt‑level filtering and rule‑based guardrails, which are often insufficient when risk emerges gradually across interaction sequences. In this work, we propose a complementary defense mechanism: a low‑latency fraud detection layer for detecting adversarial interaction patterns in LLM‑powered agents. Instead of determining whether a single prompt is malicious, our approach models risk over interaction trajectories using structured runtime features derived from prompt characteristics, session dynamics, tool usage, execution context, and fraud‑inspired signals. The detection layer can be implemented using lightweight models leading to low‑latency real‑time deployments. To evaluate the framework, we construct a synthetic corpus of 12,000 multi‑turn agent interactions generated from parameterized templates that simulate realistic agentic workflows. Using 42 structured features and an XGBoost classifier, our detector achieves over 9 times faster than LLM‑based detectors. Through the experiment and ablation studies, our work suggests that interaction‑level behavioral detection should become a core component of deployment‑time defense for LLM‑powered agents.

PaperID: 3467, https://arxiv.org/pdf/2605.01133.pdf

Authors: Lingxi Zhang, Guangtao Zheng, Hanjie Chen

Title: When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems

Abstract:
Large language model (LLM)‑powered multi‑agent systems (MAS) enable agents to communicate and share information, achieving strong performance on complex tasks. However, this communication also creates an attack surface where malicious agents can propagate misinformation and manipulate group decisions, undermining MAS safety. Existing embedding‑based defenses aim to detect and prune suspicious agents, but their effectiveness depends on a clear separation between the text embeddings of malicious and benign messages. Attackers can circumvent such defenses by crafting messages whose embeddings lie close to benign ones. We analyze this failure mode theoretically and validate it empirically with three attacks, Slow Drift, Benign Wrapper, and Chaos Seeding. Our analysis further reveals a fundamental limitation of embedding‑based defenses: because they rely solely on the text embeddings, they ignore token‑level confidence signals such as logits, which can remain informative when embeddings are not distinguishable under attack. We propose using confidence scores to prune or down‑weight messages during MAS communication. Experiments show improved robustness across models, datasets, and communication topologies. Moreover, we find that the effectiveness of confidence signals decays over communication rounds, highlighting the importance of early intervention. This insights can inform and inspire future work on MAS attacks and defenses.

PaperID: 3468, https://arxiv.org/pdf/2605.01102.pdf

Authors: Jinpai Zhao, Albert Cerrone, Joannes Westerink, Clint Dawson

Title: Towards Multi-Agent Autonomous Reasoning in Hydrodynamics

Abstract:
Single‑agent systems (SAS) have become the default pattern for LLM‑driven scientific workflows, but routing planning, tool use, and synthesis through a single context window comes with a well‑known cost: as tool specifications and observational traces accumulate, the effective context available for each decision shrinks, and end‑to‑end reliability suffers. We present a multi‑agent system (MAS) prototype for hydrodynamics in which specialized agents are coordinated through a Layer Execution Graph (LEG). A planner agent constructs query‑specific execution topologies from natural‑language routing heuristics that capture domain knowledge without hard‑coding it as rigid control logic; specialist agents operate under strict tool allowlists and occupy complementary data‑class roles. Between layers, consolidator agents fuse parallel outputs into concise briefs, and a reporter agent synthesizes the final response, while the runtime logs provenance for every tool invocation to support auditability. All benchmarks, ablations, and stress tests use Claude Sonnet~4.6 as the backbone model for both specialist and general‑purpose agents. Evaluated on 37 queries spanning six complexity categories, the prototype achieves 93.6% factual precision with a 100% pass rate. Accuracy remains above 90% across runs from single‑threaded to five independent parallel tracks, and under simulated loss of individual data sources the system degrades gracefully, still returning substantive partial answers. Together, these results suggest that planner‑guided, graph‑structured multi‑agent orchestration can meaningfully alleviate the context‑saturation bottlenecks that constrain monolithic single‑agent architectures.

PaperID: 3469, https://arxiv.org/pdf/2605.01101.pdf

Authors: Shakeel Sheikh, Patrick Marmaroli, MD Sahidullah, Slim Ouni, Fabrice Hirsch, Goncalo Leal, Bjorn W Schuller

Title: Virtual Speech Therapist: A Clinician-in-the-Loop AI Speech Therapy Agent for Personalized and Supervised Therapy

Abstract:
This paper develops Virtual Speech Therapist (VST), an intelligent agent‑based platform that streamlines stuttering assessment and delivers customized therapy planning through automated and adaptive AI‑driven workflows. VST integrates state‑of‑the‑art deep learning‑based stuttering classification, and multi‑agent large language model (LLM) reasoning to support evidence‑based clinical decision‑making. The VST begins with the acquisition and feature extraction of patient speech samples, followed by robust classification of stuttering types. Building on these outputs, VST initiates an agentic reasoning process in which specialized LLM agents autonomously generate, critique, and iteratively refine individualized therapy plans. A dedicated critic agent evaluates all generated therapy plans to ensure clinical safety, methodological soundness, and alignment with peer‑reviewed evidence and established professional guidelines. The resulting output is a comprehensive, patient‑specific therapy draft intended for clinician review. Incorporating clinician feedback, the system then produces a finalized therapy plan suitable for patient delivery, thereby maintaining a clinician‑in‑the‑loop paradigm. Experimental evaluation by expert speech therapists confirms that VST consistently generates high‑quality, evidence‑based therapy recommendations. These findings demonstrate the system's potential to augment clinical workflows, reduce clinician burden, and improve therapeutic outcomes for individuals with speech impairments. An interactive user interface for the proposed system is available online at: https://vocametrix.com/ai/stuttering‑therapy‑planning‑agent , facilitating real‑time stuttering assessment and personalized therapy planning.

PaperID: 3470, https://arxiv.org/pdf/2605.01091.pdf

Authors: Talal Ashraf Butt, Muhammad Iqbal, Razi Iqbal

Title: Governing What the EU AI Act Excludes: Accountability for Autonomous AI Agents in Smart City Critical Infrastructure

Abstract:
When a traffic signal controller adjusts green phases and a grid manager curtails power on the same corridor, each system may comply with its own obligations. The resident who suffers the combined effect has no single authority to hold accountable and, under the EU AI Act, limited means to obtain an explanation. Annex III, point 2 excludes safety‑component AI in critical infrastructure from Article 86 explanation rights and Article 27 fundamental‑rights impact assessment. Provider and deployer duties under Articles 9‑15 still apply, and residual pathways under the GDPR, NIS2, and tortious liability offer partial coverage. The Act's principal resident‑facing accountability instruments are nonetheless narrowed for the autonomous infrastructure systems most likely to interact across agencies. The paper traces this accountability deficit through four residual pathways (GDPR Article 22, GDPR transparency obligations, tortious liability, and NIS2) and shows that each is structurally bounded by individual‑controller, individual‑decision scope. As a governance response, it presents AgentGov‑SC, a three‑layer architecture (Agent, Orchestration, City) specifying 25 governance measures with bidirectional traceability to the EU AI Act, ISO/IEC 42001, and the NIST AI Risk Management Framework. Five conflict resolution rules and an autonomy‑calibrated activation model complete the design. A scenario analysis traces governance activation through a multi‑agent corridor cascade involving three documented UAE smart‑city systems, with a contrasting single‑system scenario confirming proportional activation. The paper contributes a regulatory gap analysis and governance architecture for an increasingly important class of urban AI deployment that existing frameworks treat as bounded and isolated.

PaperID: 3471, https://arxiv.org/pdf/2605.01078.pdf

Authors: Soumil Datta, Melissa Umble, Daniel S. Brown, Guanhong Tao

Title: A Sentence Relation-Based Approach to Sanitizing Malicious Instructions

Abstract:
Retrieval‑augmented generation and tool‑integrated LLM agents increasingly depend on external textual sources. This reliance broadens the available attack surface, allowing adversaries to insert malicious instructions that trigger unintended model behaviors. Current defensive measures often utilize LLM‑based detectors to filter such content, but these approaches remain vulnerable to optimization‑based attacks. Additionally, training‑based methods frequently fail to generalize to novel data distributions. To resolve these issues, we introduce SONAR, a prompt sanitization framework that identifies and removes injected content using metrics from natural language inference. Specifically, SONAR constructs a sentence‑level relational graph across the user query and external data. By using entailment and contradiction scores as edge weights, the system identifies sentences that deviate from the core task. It then employs connectivity‑driven pruning to eliminate flagged injection seeds and their related neighbors while maintaining benign context. Rigorous evaluations across several models and datasets show that SONAR reduces the attack success rate to nearly zero, significantly outperforming nine established baseline defenses.

PaperID: 3472, https://arxiv.org/pdf/2605.01041.pdf

Authors: Iman Sharifi, Hyeong Tae Kim, Maheed Hatem Ahmed, Mahsa Ghasemi, Peng Wei

Title: Separation Assurance between Heterogeneous Fleets of Small Unmanned Aerial Systems via Multi-Agent Reinforcement Learning

Abstract:
In the envisioned future dense urban airspace, multiple companies will operate heterogeneous fleets of small unmanned aerial systems (sUASs), where each fleet includes several homogeneous aircraft with identical policies and configurations, e.g., equipage, sensing, and communication ranges, making tactical deconfliction highly complex for the aircraft. This paper aims to address two core questions: (1) Can tactical deconfliction policies converge or reach an equilibrium to ensure a conflict‑free airspace when companies operate heterogeneous fleets of homogeneous aircraft? (2) If so, will the converged policies discriminate against companies operating sUASs with weaker configurations? We investigate a multi‑agent reinforcement learning paradigm in which homogeneous aircraft within heterogeneous fleets operate concurrently to perform package delivery missions over Dallas, Texas, USA. An attention‑enhanced Proximal Policy Optimization‑based Advantage Actor‑Critic (PPOA2C) framework is employed to resolve intra‑ and inter‑fleet conflicts, with each fleet independently training its own policy while preserving privacy. Experimental results show that two fleets with distinct, shared PPOA2C policies can reach an equilibrium to maintain safe separation. While two PPOA2C policies outperform two strong rule‑based baselines in terms of conflict resolution, a PPOA2C policy exhibits safer interaction with a rule‑based policy, indicating adaptive capabilities of PPOA2C policies. Furthermore, we conducted extensive policy‑configuration evaluations, which reveal that equilibria between similar policy types tend to favor fleets with stronger configurations. Even under similar configurations but different policy types, the equilibrium favors one of the heterogeneous policies, underscoring the need for fairness‑aware conflict management in heterogeneous sUAS operations.

PaperID: 3473, https://arxiv.org/pdf/2605.00946.pdf

Authors: Jirong Zha, Chenyu Zhao, Nan Zhou, Zhenyu Liu, Tao Sun, Bin Zhang, Xiaochun Zhang, Xinlei Chen

Title: Breaking the Communication-Accuracy Trade-off: A Sparsified Information Diffusion Framework for Multi-Agent Collaborative Perception

Abstract:
The growing relevance of multi‑agent systems has drawn increasing focus on communication‑efficient filters for collaborative perception to alleviate the system's communication burden. While the event‑triggered (ET) mechanism can improve communication efficiency in collaborative state estimation, an inevitable trade‑off exists between estimation accuracy and communication cost in ET filters. This paper proposes a fast and accurate ET diffusion‑based filter for real‑time multi‑agent collaborative target tracking, aiming to reduce the system's data transmission without compromise in tracking performance. The proposed filter achieves improved tracking accuracy, reduced data transmission, and accelerated convergence using an error‑minimized ET cubature information filter (CIF) for local estimation, and a correlation‑aware diffusion strategy for global fusion. The experimental results confirm the scalability of the proposed EDC‑CIF algorithm and demonstrate its efficacy in simultaneously reducing estimation error and computation time while significantly enhancing communication efficiency.

PaperID: 3474, https://arxiv.org/pdf/2605.00914.pdf

Authors: Blaž Bertalanič, Carolina Fortuna

Title: The Cost of Consensus: Isolated Self-Correction Prevails Over Unguided Homogeneous Multi-Agent Debate

Abstract:
Multi‑agent debate, where teams of LLMs iteratively exchange rationales and vote on answers, is widely deployed under the assumption that peer review filters hallucinations. Yet the failure dynamics of homogeneous debate remain poorly understood, therefore we report findings from a controlled empirical study of teams of N=10 homogeneous agents (Qwen2.5‑7B, Llama‑3.1‑8B, Ministral‑3‑8B) across R=3 debate rounds on two high‑difficulty benchmarks (GSM‑Hard and MMLU‑Hard). We compare peer debate against isolated self‑correction and a stochastic noise control that injects rationales from unrelated problems. We decompose debate failure into three model‑dependent pathways: sycophantic conformity, where agents uncritically adopt majority answers (modal adoption up to 85.5%); contextual fragility, where peer rationales destabilize previously correct reasoning (vulnerability rate up to 70.0%); and consensus collapse, where plurality voting discards correct answers already present in the generation pool (oracle gap up to 32.3 percentage points). Ablations over communication density (K \in \2,4,9\) and sampling temperature (T \in \0.4, 0.7\) show that conformity reaches high levels at minimal peer exposure (K=2) and intensifies with greater initial diversity. Across all configurations, debate consumes 2.1‑3.4× more tokens (up to 28,631 tokens per problem) than self‑correction for equal or lower accuracy. Our results indicate that, within the 7‑8B parameter class, homogeneous teams without structured roles do not benefit from unguided peer exchange, and that isolated self‑correction consistently offers a more favorable cost‑accuracy tradeoff.

PaperID: 3475, https://arxiv.org/pdf/2605.00846.pdf

Authors: Navapat Nananukul, Mayank Kejriwal

Title: ClinicBot: A Guideline-Grounded Clinical Chatbot with Prioritized Evidence RAG and Verifiable Citations

Abstract:
Clinical diagnosis requires answers that are accurate, verifiable, and explicitly grounded in official guidelines. While large language models excel at natural language processing, their tendency to hallucinate undermines their utility in high‑stakes medical contexts where precision is essential. Existing retrieval‑augmented generation (RAG) systems treat all evidence equally, producing noisy context and generic answers misaligned with clinical practice. We present ClinicBot, an AI system that translates guideline recommendations into trustworthy clinical support through three key advances: (1) structured extraction of clinical guidelines into semantic units (recommendations, tables, definitions, narrative) with explicit provenance, (2) evidence prioritization that ranks content by clinical significance and guideline structure rather than textual similarity, and (3) a web‑based interface that presents concise, actionable answers with verifiable evidence. We will demonstrate ClinicBot using diabetes questions from real patients and an additional diabetes risk assessment tool that is faithful to the American Diabetes Association (ADA) Standards of Care in Diabetes (2025). The demonstration will illustrate how semantic knowledge extraction and hierarchical evidence ranking can reliably operate in a multi‑agent setting to process complex clinical guidelines at scale.

PaperID: 3476, https://arxiv.org/pdf/2605.00845.pdf

Authors: Mengying Wang, Nicolaas Jedema, Rahul Pandey, RaviKiran Krishnan, Jens Lehmann, Yinghui Wu

Title: Graph Query Generation with Constraint-guided Large Language Agents

Abstract:
Knowledge Graph Question Answering (KGQA) has advanced through structured query generation, yet most efforts target RDF/SPARQL, leaving Cypher and property graphs underexplored, despite increasing demand for unified KGQA in industry settings. We propose UniQGen, a novel constraint‑based framework that employs LLM agents to dynamically extract and refine representative graph query clauses into executable, intent‑aligned graph queries across query languages. The foundation of our method is a variant of Chase & Backchase, a family of algorithms for query optimization and reformulation. We extend Chase & Backchase with a dynamic reasoning process over query constraints that also interact with LLMs for query quality estimation. With a Cypher‑supported Freebase graph deployed on Amazon Neptune, we extensively evaluate our approach on popular KGQA benchmarks (GraphQ, GrailQA, and WebQSP). We demonstrate that UniQGen outperforms state‑of‑the‑art graph query generation techniques in both accuracy and efficiency, with F1 gains of 31.6% on GraphQ and 4.9% on GrailQA. Unlike prior methods, our framework does not require fine‑tuning for schema matching, making it more extensible to schema‑less graphs and semantics in query workloads, and is more suitable for enterprise‑grade KGQA. We release Cypher outputs and a Neptune‑ready Freebase snapshot to support reproducible, cross‑language KGQA research.

PaperID: 3477, https://arxiv.org/pdf/2605.00841.pdf

Authors: Viet Trinh, Tan Nguyen, Minh-Huyen Phan, Quan Luu

Title: AI Agents for Sustainable SMEs: A Green ESG Assessment Framework

Abstract:
This study presents a novel, AI‑driven framework for assessing Environmental, Social, and Governance (ESG) performance in European small and medium‑sized enterprises (SMEs). An initial phase established expert‑validated ESG baseline scores from a subset of the Flash Eurobarometer FL549 survey data. In the second phase, a scalable AI agent system, built on the n8n automation platform, applied these baselines to perform automated ESG classification and generate contextual recommendations using large language models (LLMs). The results demonstrate the AI system's high consistency with human‑derived outputs, thereby supporting more effective monitoring and intervention strategies aligned with the European Green Deal.

PaperID: 3478, https://arxiv.org/pdf/2605.00833.pdf

Authors: Brice Valentin Kok-Shun, Johnny Chan, Gabrielle Peko, David Sundaram

Title: Agentopic: A Generative AI Agent Workflow for Explainable Topic Modeling

Abstract:
Agentopic is a novel agent‑based workflow for explainable topic modeling that leverages the reasoning capabilities of Large Language Models (LLMs). Existing topic modeling approaches such as Latent Dirichlet Allocation (LDA) and BERTopic often lack transparency on how topics are assigned or grouped. Agentopic addresses this by using multiple agents that collaboratively perform topic identification, validation, hierarchical grouping, and natural language explanation. This design enables users to trace the reasoning behind topic assignments, enhancing interpretability without sacrificing accuracy. When seeded with topics from the British Broadcasting Corporation (BBC) dataset, Agentopic achieves an F1‑score of 0.95, matching GPT‑4.1, improving on LDA (0.93), and close to BERTopic (0.98). We used Agentopic to augment the BBC dataset with generated explanations to improve the dataset's richness and context. The unseeded Agentopic generated 2045 semantically coherent topics organized across six hierarchical levels, vastly enriching the original five‑category structure. By embedding explainability throughout the workflow, Agentopic offers an interpretable alternative to black‑box models, making it particularly valuable for crucial applications like finance and healthcare.

PaperID: 3479, https://arxiv.org/pdf/2605.00827.pdf

Authors: Abhinav Singh Parmar

Title: Separating Intelligence from Execution: A Workflow Engine for the Model Context Protocol

Abstract:
Large Language Model (LLM) agents increasingly interact with external systems through tool‑calling protocols such as the Model Context Protocol (MCP). In prevailing architectures, the agent must reason about every tool invocation in every session, consuming tokens proportional to the number of actions performed‑‑even when the task has been solved before. We present the MCP Workflow Engine, a novel MCP‑native orchestration layer that decouples intelligence (deciding what to do) from execution (carrying it out). An agent reasons once to produce a declarative workflow blueprint‑‑a JSON document specifying a directed sequence of MCP tool calls with parameterized templates, loops, parallel branches, and data piping. Subsequent executions are triggered by a single run_workflow tool call, consuming one invocation's worth of tokens regardless of the blueprint's internal complexity. We formalize the MCP Mediator architectural pattern‑‑an MCP server that simultaneously acts as a client to downstream MCP servers‑‑and implement it in TypeScript against the MCP SDK. We evaluate the engine on a production‑scale Kubernetes CMDB synchronization task spanning 67 orchestrated steps across 2 MCP servers, 38 namespaces, 13 worker nodes, and 22 distinct resource types. The engine reduces per‑execution token cost by over 99%, completes the full cluster graph‑‑comprising 1,200+ nodes and 2,800+ relationships across 20 relationship types‑‑in under 45 seconds, and achieves deterministic, idempotent execution with zero agent involvement at run time.

PaperID: 3480, https://arxiv.org/pdf/2605.00798.pdf

Authors: Arunabh Srivastava, Mohammad A., Khojastepour, Srimat Chakradhar, Sennur Ulukus

Title: RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution

Abstract:
Humans solve problems by executing targeted plans, yet large language models (LLMs) remain unreliable for structured workflow execution. We propose RunAgent, a multi‑agent plan execution platform that interprets natural‑language plans while enforcing stepwise execution through constraints and rubrics. RunAgent bridges the expressiveness of natural language with the determinism of programming via an agentic language with explicit control constructs (e.g., \textttIF, \textttGOTO, \textttFORALL). Beyond verifying syntactic and semantic verification of the step output, which is performed based on the specific instruction of each step, RunAgent autonomously derives and validates constraints based on the description of the task and its instance at each step. RunAgent also dynamically selects among LLM‑based reasoning, tool usage, and code generation and execution (e.g., in Python), and incorporates error correction mechanisms to ensure correctness. Finally, RunAgent filters the context history by retaining only relevant information during the execution of each step. Evaluations on Natural‑plan and SciBench Datasets demonstrate that RunAgent outperforms baseline LLMs and state‑of‑the‑art PlanGEN methods.

PaperID: 3481, https://arxiv.org/pdf/2605.00751.pdf

Authors: Sizhe Tang, Zuyuan Zhang, Mahdi Imani, Tian Lan

Title: NonZero: Interaction-Guided Exploration for Multi-Agent Monte Carlo Tree Search

Abstract:
Monte Carlo Tree Search (MCTS) scales poorly in cooperative multi‑agent domains because expansion must consider an exponentially large set of joint actions, severely limiting exploration under realistic search budgets. We propose NonZero, which keeps multi‑agent MCTS tractable by running surrogate‑guided selection over a low‑dimensional nonlinear representation using an interaction‑guided proposal rule, instead of directly exploring the full joint‑action space. Our exploration uses an interaction score: single‑agent deviations are ranked by predicted gain, while two‑agent deviations are scored by a mixed‑difference measure that reveals coordination benefits even when no single agent can improve alone. We formalize candidate proposal as a bandit problem over local deviations and derive a proposal rule, NonZero, with a sublinear local‑regret guarantee for reaching approximate graph‑local optima without enumerating the joint‑action space. Empirically, NonZero improves sample efficiency and final performance on MatGame, SMAC, and SMACv2 relative to strong model‑based and model‑free baselines under matched search budgets.

PaperID: 3482, https://arxiv.org/pdf/2605.00742.pdf

Authors: Theodore Papamarkou, Pierre Alquier, Matthias Bauer, Wray Buntine, Andrew Davison, Gintare Karolina Dziugaite, Maurizio Filippone, Andrew Y. K. Foong, Vincent Fortuin, Dimitris Fouskakis, Jes Frellsen, Eyke Hüllermeier, Theofanis Karaletsos, Mohammad Emtiyaz Khan, Nikita Kotelevskii, Salem Lahlou, Yingzhen Li, Fang Liu, Clare Lyle, Thomas Möllenhoff, Konstantina Palla, Maxim Panov, Yusuf Sale, Kajetan Schweighofer, Artem Shelmanov, Siddharth Swaroop, Martin Trapp, Willem Waegeman, Andrew Gordon Wilson, Alexey Zaytsev

Title: Position: agentic AI orchestration should be Bayes-consistent

Abstract:
LLMs excel at predictive tasks and complex reasoning tasks, but many high‑value deployments rely on decisions under uncertainty, for example, which tool to call, which expert to consult, or how many resources to invest. While the usefulness and feasibility of Bayesian approaches remain unclear for LLM inference, this position paper argues that the control layer of an agentic AI system (that orchestrates LLMs and tools) is a clear case where Bayesian principles should shine. Bayesian decision theory provides a framework for agentic systems that can help to maintain beliefs over task‑relevant latent quantities, to update these beliefs from observed agentic and human‑AI interactions, and to choose actions. Making LLMs themselves explicitly Bayesian belief‑updating engines remains computationally intensive and conceptually nontrivial as a general modeling target. In contrast, this paper argues that coherent decision‑making requires Bayesian principles at the orchestration level of the agentic system, not necessarily the LLM agent parameters. This paper articulates practical properties for Bayesian control that fit modern agentic AI systems and human‑AI collaboration, and provides concrete examples and design patterns to illustrate how calibrated beliefs and utility‑aware policies can improve agentic AI orchestration.

PaperID: 3483, https://arxiv.org/pdf/2605.00741.pdf

Authors: Saeid Jamshidi, Foutse Khomh, Carol Fung, Kawser Wazed Nafi

Title: Self-Adaptive Multi-Agent LLM-Based Security Pattern Selection for IoT Systems

Abstract:
The adoption of Internet of Things (IoT) systems at the network edge of smart architectures is increasing rapidly, intensifying the need for security mechanisms that are both adaptive and resource‑efficient. In such environments, runtime defence mechanisms are no longer limited to detection alone but become a resource‑constrained task of selecting mitigation actions. Security controls must be carefully selected, combined, and executed under latency, energy, and computational constraints, while preventing unsafe interactions between controls. Existing approaches predominantly rely on static rule sets and learned policies, which provide limited guarantees of feasibility, conflict safety, and execution correctness in resource‑constrained edge settings. To address this limitation, we introduce ASPO, a self‑adaptive multi‑agent security pattern selection that integrates Large Language Model (LLM)‑based reasoning with deterministic enforcement within a MAPE‑K control loop. ASPO explicitly separates stochastic decision generation from execution: LLM agents propose candidate mitigation portfolios, while a deterministic optimisation core enforces closed‑world action integrity, conflict‑free composition, and resource feasibility at every decision epoch. We deploy ASPO on a distributed edge‑gateway testbed and evaluate it across two workloads, each comprising 500 and 1000 runtime security decisions, using replayed IoT attack traffic. In addition, the results demonstrate invariant safety properties, including 100% conflict‑free activation, consistent resource feasibility across workloads, and stable pattern dominance with perfect rank preservation. Importantly, deeper decision exploration reduces extreme‑case execution costs, compressing tail latency and energy overheads by 21.9% and 23.1%, respectively, without increasing mean energy consumption.

PaperID: 3484, https://arxiv.org/pdf/2605.00702.pdf

Authors: Derong Xu, Shuochen Liu, Pengfei Luo, Pengyue Jia, Yingyi Zhang, Yi Wen, Yimin Deng, Wenlin Zhang, Enhong Chen, Xiangyu Zhao, Tong Xu

Title: Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory

Abstract:
Large language model (LLM) agents require long‑term user memory for consistent personalization, but limited context windows hinder tracking evolving preferences over long interactions. Existing memory systems mainly rely on static, hand‑crafted update rules; although reinforcement learning (RL)‑based agents learn memory updates, sparse outcome rewards provide weak supervision, resulting in unstable long‑horizon optimization. Drawing on memory schema theory and the functional division between prefrontal regions and hippocampus regions, we introduce MemCoE, a cognition‑inspired two‑stage optimization framework that learns how memory should be organized and what information to update. In the first stage, we propose Memory Guideline Induction to optimize a global guideline via contrastive feedback interpreted as textual gradients; in the second stage, Guideline‑Aligned Memory Policy Optimization uses the induced guideline to define structured process rewards and performs multi‑turn RL to learn a guideline‑following memory evolution policy. We evaluate on three personalization memory benchmarks, covering explicit/implicit preference and different sizes and noise, and observe consistent improvements over strong baselines with favorable robustness, transferability, and efficiency.

PaperID: 3485, https://arxiv.org/pdf/2605.00691.pdf

Authors: Zi-Bo Qin, Feng-Feng Wei, Tai-You Chen, Wei-Neng Chen

Title: Learning to Act and Cooperate for Distributed Black-Box Consensus Optimization

Abstract:
Distributed blackbox consensus optimization is a fundamental problem in multi‑agent systems, where agents must improve a global objective using only local objective queries and limited neighbor communication. Existing methods largely rely on handcrafted update rules and static cooperation patterns, which often struggle to balance local adaptation, global coordination, and communication efficiency in heterogeneous nonconvex environments. In this paper, we take an initial step toward trajectory‑driven self‑design for distributed black‑box consensus optimization. We first redesign the agent‑level swarm dynamics with an adaptive internal mechanism tailored to decentralized consensus settings, improving the balance between exploration, convergence, and local escape. Built on top of this adaptive execution layer, we propose Learning to Act and Cooperate (LACMAS), a trajectorydriven framework in which large language models provide sparse highlevel guidance for shaping both agentinternal action behaviors and agentexternal cooperation patterns from historical optimization trajectories. We further introduce a phased cognitive scheduling strategy to activate different forms of adaptation in a resource‑aware manner. Experiments on standard distributed black‑box benchmarks and real‑world distributed tasks show that LAC‑MAS consistently improves solution quality, convergence efficiency, and communication efficiency over strong baselines, suggesting a practical route from handcrafted distributed coordination toward self‑designing multi‑agent optimization systems.

PaperID: 3486, https://arxiv.org/pdf/2605.00574.pdf

Authors: Yanzeng Li, Xiaoning Cao, Jialun Zhong, Jianpeng Hu, Jiangshan Tan, Ningning Liu, Feng Xiang, Shasha Han

Title: DySRec: Dynamic Context-Aware Psychometric Scale Recommendation via Multi-Agent Collaboration

Abstract:
Choosing suitable psychometric scales is an essential and difficult step in psychological consultation, which requires clinicians to integrate patient information, behaviors, and dynamic contextual information. Existing systems mainly use static pipelines to choose scale, or directly predict symptoms according to user inputs, limiting their ability to support dynamic assessment, risk management, and transparent decision‑making. To address these limitations, we propose DySRec, a multi‑agent conversational system for dynamic psychometric scale recommendation. DySRec operates as an interactive chatbot that engages users in multi‑turn dialogue, models scale selection as a continuous conversational decision process, and coordinates specialized agents to maintain user context, recommend assessment scales, monitor psychological risk, and log decision trajectories. In this way, DySRec can integrate and capture heterogeneous signals, including semantic, interaction behaviors, assessment history, and content state, to dynamically update user representations and calculate scale‑context compatibility score for recommending most matched scales. Moreover, DySRec incorporates a closed‑loop refinement mechanism. Recommendation agent will feedback the missing or uncertain attributes and guide the conversation to elicit the targeted information. In this paper, we showcase the prototype design and architecture of DySRec, and this system has been verified in a real‑world application.

PaperID: 3487, https://arxiv.org/pdf/2605.00528.pdf

Authors: Dongxin Guo, Jikun Wu, Siu Ming Yiu

Title: SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters

Abstract:
AI agents execute tens to hundreds of chained LLM calls per task, yet GPU schedulers treat each call as independent, discarding gigabytes of intermediate state between steps and inflating end‑to‑end latency by 3‑8x. We argue that this request‑level abstraction is fundamentally mismatched to compound AI workloads, and propose a shift to program‑level scheduling: treating the entire agent workflow (not individual inference calls) as the first‑class schedulable unit. We present SAGA, a distributed scheduler that implements this abstraction through three mechanisms: (1) Agent Execution Graphs that capture workflow structure to predict KV cache reuse across tool‑call boundaries, achieving within 1.31x of Bélády's optimal offline policy; (2) session‑affinity batching with work stealing that co‑locates correlated requests while maintaining global load balance; and (3) Agent Fair Share, a task‑completion‑time fairness metric with provable bounded‑deviation guarantees. On a 64‑GPU cluster serving SWE‑bench coding agents and WebArena browser tasks, SAGA reduces task completion time by 1.64x (geometric mean, p < 0.001) over vLLM v0.15.1 with prefix caching and affinity routing, while improving GPU memory utilization by 1.22x and achieving 99.2% SLO attainment under multi‑tenant interference. These latency gains come at a quantified cost: approximately 30% lower peak throughput than throughput‑optimal batch scheduling, a tradeoff appropriate for the latency‑sensitive interactive deployments that dominate compound AI usage. Our results demonstrate that workflow‑aware scheduling is essential for efficient compound AI serving.

PaperID: 3488, https://arxiv.org/pdf/2605.00444.pdf

Authors: Kerui Chen, Jinglu Wang, Jianrong Zhang, Ming Li, Yan Lu, Hehe Fan

Title: Scaling Video Understanding via Compact Latent Multi-Agent Collaboration

Abstract:
Multi‑modal large language models (MLLMs) advance vision language understanding but face inherent limitations in long‑video tasks due to bounded perception context budgets. Existing agentic methods mitigate this via rule‑based preprocessing, yet often suffer from information loss, high cost, and reliance on textual intermediates. We propose MACF, an end‑to‑end Multi‑Agent Collaboration Framework that decouples per‑agent perception budgets from global video complexity, enabling scalable video understanding while preserving visual fidelity. MACF partitions videos into segments for locally budgeted agents and enables holistic reasoning via an agent‑native latent communication protocol. Each agent encodes partial observations into compact, task‑sufficient tokens in a shared embedding space, allowing efficient and information‑preserving collaboration by a central coordinator. We introduce a curriculum training strategy that progressively enforces semantic alignment, evidence summarization, and cross‑agent coordination. Extensive experiments on diverse video understanding benchmarks show that MACF consistently outperforms state‑of‑the‑art MLLMs and multi‑agent systems under identical budget constraints, demonstrating the effectiveness of our latent collaboration for scalable video understanding.

PaperID: 3489, https://arxiv.org/pdf/2605.00425.pdf

Authors: Haotian Zhao, Songlin Zhou, Yuxin Zhang, Stephen S. -T. Yau, Wenyu Zhang, Lun Tian, Tianshu Zhu, Yifeng Huang, Yucheng Zeng, Jingnan Gu, Daxiang Dong, Jianmin Wu

Title: AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

Abstract:
Reinforcement learning (RL) has substantially improved the ability of large language model (LLM) agents to interact with environments and solve multi‑turn tasks. However, effective agentic RL remains challenging: sparse outcome‑only rewards provide limited guidance for assigning credit to individual steps within long interaction trajectories. Existing approaches often introduce dense intermediate supervision, such as process reward models or auxiliary self‑supervised signals, which increases supervision and tuning complexity and may limit generalization across tasks and domains. We present AEM, a supervision‑free credit assignment method that adaptively modulates entropy dynamics during RL training to improve the exploration‑exploitation trade‑off. Since in agentic RL the environment is typically affected by a complete response, rather than an individual token, our analysis lifts entropy dynamics from the token level to the response level, aligning uncertainty estimation with the effective action granularity of LLM agents and reducing sensitivity to token‑level sampling noise. We further show that entropy drift under natural‑gradient updates is governed by the interaction between the sampled‑response advantage and its relative surprisal. Motivated by this result, AEM derives a practical response‑level uncertainty proxy and uses it to rescale advantages, leveraging the evolving balance between positive and negative samples to naturally transition from exploration to exploitation. Extensive experiments on ALFWorld, WebShop, and SWE‑bench‑Verified with models ranging from 1.5B to 32B demonstrate that AEM consistently improves strong RL baselines, including a +1.4% gain when integrated into a state‑of‑the‑art software‑engineering RL training framework.

PaperID: 3490, https://arxiv.org/pdf/2605.00382.pdf

Authors: Fazle Rabbi, Lin Ling, Song Wang, Jinqiu Yang

Title: Social Bias in LLM-Generated Code: Benchmark and Mitigation

Abstract:
Large Language Models (LLMs) are increasingly deployed to generate code for human‑centered applications where demographic fairness is critical. However, existing evaluations focus almost exclusively on functional correctness, leaving social bias in LLM‑generated code largely unexamined. Extending our prior work on Solar, we conduct a comprehensive empirical study using SocialBias‑Bench, a benchmark of 343 real‑world coding tasks spanning seven demographic dimensions. We evaluate four prominent LLMs and find severe bias across all models, with Code Bias Scores reaching up to 60.58%. We further show that standard prompt‑level interventions, such as Chain‑of‑Thought reasoning and fairness persona assignment, inadvertently amplify bias rather than reduce it. We then investigate whether structured multi‑agent software process frameworks can improve fairness, finding that structured pipelines reduce bias when early roles correctly scope what the code should and should not consider. However, adding explicit fairness instructions to all agent roles produces worse outcomes than providing none, suggesting that diffused responsibility goes unaddressed. To address these limitations, we propose the Fairness Monitor Agent (FMA), a modular component that plugs into any existing code generation pipeline without modifying it. FMA analyzes the task description to determine which attributes should be considered or restricted, then detects and corrects violations through an iterative review process, without requiring an executable test suite. Evaluated on all 343 tasks, FMA reduces bias by 65.1% compared to a developer agent alone and improves functional correctness from 75.80% to 83.97%, outperforming all other studied approaches.

PaperID: 3491, https://arxiv.org/pdf/2605.00276.pdf

Authors: Tiejin Chen, Ahmadreza Moradipari, Kyungtae Han, Hua Wei, Nejib Ammar

Title: Agentic AI for Trip Planning Optimization Application

Abstract:
Trip planning for intelligent vehicles increasingly requires selecting optimal routes rather than merely producing feasible itineraries, as interacting factors such as travel time, energy consumption, and traffic conditions directly affect plan quality. Yet existing systems are largely designed for feasibility‑oriented planning, and current benchmarks provide only reference answers without ground truth, preventing objective evaluation of optimization performance. In our paper, we address these limitations with an agentic AI framework that enables dynamic refinement through an orchestration agent coordinating specialized agents for traffic, charging, and points of interest, and with the Trip‑planning Optimization Problems Dataset, which supplies definitive optimal solutions and category‑level task structure for fine‑grained analysis. Experiments show that our system achieves 77.4% accuracy on the TOP Benchmark, significantly outperforming single‑agent and workflow‑based multi‑agent baselines, demonstrating the importance of orchestrated agentic reasoning for robust trip planning optimization.

PaperID: 3492, https://arxiv.org/pdf/2605.00268.pdf

Authors: Claire Chen, Yuheng Zhang

Title: Fast Rates in $α$-Potential Games via Regularized Mirror Descent

Abstract:
An α‑potential game is a multi‑player non‑cooperative interaction in which a global potential function approximates individual player rewards up to a structural bias α. While identifying a Nash Equilibrium (NE) in generic general‑sum games is known to be computationally intractable, the potential game structure enables tractable NE identification. In this paper, we study the offline learning of NE in α‑potential games using KL regularization. To analyze this process, we propose a novel Reference‑Anchored offline data coverage framework‑‑a verifiable condition that anchors data requirements to a known reference policy rather than an unknown optimum. Building on this, we propose Offline Potential Mirror Descent (OPMD), a decentralized algorithm that achieves an accelerated \widetilde\mathcalO(1/n) statistical rate, surpassing the standard \widetilde\mathcalO(1/\sqrtn) rate typical of offline multi‑agent learning. This work characterizes the first fast‑rate offline learning approach for α‑potential games.

PaperID: 3493, https://arxiv.org/pdf/2605.00264.pdf

Authors: Claire Chen, Yuheng Zhang

Title: Pessimism-Free Offline Learning in General-Sum Games via KL Regularization

Abstract:
Offline multi‑agent reinforcement learning in general‑sum settings is challenged by the distribution shift between logged datasets and target equilibrium policies. While standard methods rely on manual pessimistic penalties, we demonstrate that KL regularization suffices to stabilize learning and achieve equilibrium recovery. We propose General‑sum Anchored Nash Equilibrium (GANE), which recovers regularized Nash equilibria at an accelerated statistical rate of \widetildeO(1/n). For computational tractability, we develop General‑sum Anchored Mirror Descent (GAMD), an iterative algorithm converging to a Coarse Correlated Equilibrium at the standard rate of \widetildeO(1/\sqrtn+1/T). These results establish KL regularization as a standalone mechanism for pessimism‑free offline learning that achieves equivalent or accelerated rates in multi‑player general‑sum games.

PaperID: 3494, https://arxiv.org/pdf/2605.00248.pdf

Authors: Frederik Hytting Jørgensen, Sebastian Weichwald, Lewis Hammond

Title: Causal Foundations of Collective Agency

Abstract:
A key challenge for the safety of advanced AI systems is the possibility that multiple simpler agents might inadvertently form a collective agent with capabilities and goals distinct from those of any individual. More generally, determining when a group of agents can be viewed as a unified collective agent is a foundational question in the study of interactions and incentives in both biological and artificial systems. We adopt a behavioral perspective in answering this question, ascribing collective agency to a group when viewing the group's joint actions as rational and goal‑directed successfully predicts its behavior. We formalize this perspective on collective agency using causal games ‑‑ which are causal models of strategic, multi‑agent interactions ‑‑ and causal abstraction ‑‑ which formalizes when a simple, high‑level model faithfully captures a more complex, low‑level model. We use this framework to solve a puzzle regarding multi‑agent incentives in actor‑critic models and to make quantitative assessments of the degree of collective agency exhibited by different voting mechanisms. Our framework aims to provide a foundation for theoretical and empirical work to understand, predict, and control emergent collective agents in multi‑agent AI systems.

PaperID: 3495, https://arxiv.org/pdf/2605.00136.pdf

Authors: Kaituo Zhang, Zhen Xiong, Mingyu Zhong, Zhimeng Jiang, Zhouyuan Yuan, Zhecheng Li, Ying Lin

Title: Are Tools All We Need? Unveiling the Tool-Use Tax in LLM Agents

Abstract:
Tool‑augmented reasoning has become a popular direction for LLM‑based agents, and it is widely assumed to improve reasoning and reliability. However, we demonstrate that this consensus does not always hold: in the presence of semantic distractors, tool‑augmented reasoning does not necessarily outperform native CoT. To explain this performance gap, we propose a Factorized Intervention Framework that isolates the cost of prompt formatting, the overhead of the tool‑calling protocol, and the actual gain from executing tools. Our analysis reveals a critical tradeoff: under semantic noise, the gains from tools often fail to offset the "tool‑use tax", which is the performance degradation introduced by the tool‑calling protocol itself. To address this, we introduce G‑STEP, a lightweight inference‑time gate to mitigate protocol‑induced errors. While this yields partial recovery, our findings suggest that more substantial improvements still require strengthening the model's intrinsic reasoning and tool‑interaction capabilities.

PaperID: 3496, https://arxiv.org/pdf/2605.00073.pdf

Authors: Mohd Sameen Chishti, Damilare Peter Oyinloye, Jingyue Li

Title: AgentReputation: A Decentralized Agentic AI Reputation Framework

Abstract:
Decentralized, agentic AI marketplaces are rapidly emerging to support software engineering tasks such as debugging, patch generation, and security auditing, often operating without centralized oversight. However, existing reputation mechanisms fail in this setting for three fundamental reasons: agents can strategically optimize against evaluation procedures; demonstrated competence does not reliably transfer across heterogeneous task contexts; and verification rigor varies widely, from lightweight automated checks to costly expert review. Current approaches to reputation drawing on federated learning, blockchain‑based AI platforms, and large language model safety research are unable to address these challenges in combination. We therefore propose AgentReputation, a decentralized, three‑layer reputation framework for agentic AI systems. The framework separates task execution, reputation services, and tamper‑proof persistence to both leverage their respective strengths and enable independent evolution. The framework introduces explicit verification regimes linked to agent reputation metadata, as well as context‑conditioned reputation cards that prevent reputation conflation across domains and task types. In addition, AgentReputation provides a decision‑facing policy engine that supports resource allocation, access control, and adaptive verification escalation based on risk and uncertainty. Building on this framework, we outline several future research directions, including the development of verification ontologies, methods for quantifying verification strength, privacy‑preserving evidence mechanisms, cold‑start reputation bootstrapping, and defenses against adversarial manipulation.

PaperID: 3497, https://arxiv.org/pdf/2605.00055.pdf

Authors: Diego F. Cuadros, Abdoul-Aziz Maiga

Title: Ambient Persuasion in a Deployed AI Agent: Unauthorized Escalation Following Routine Non-Adversarial Content Exposure

Abstract:
We report a safety incident in a deployed multi‑agent research system in which a primary AI agent installed 107 unauthorized software components, overwrote a system registry, overrode a prior negative decision from an oversight agent, and escalated through increasingly privileged operations up to an attempted system administrator command. The incident was preceded not by an adversarial attack but by routine content: a forwarded technology article written for human developers and shared by the principal investigator for discussion. The agent operated in a permissive environment, with unrestricted shell access, soft behavioral guidelines containing genuinely conflicting instructions, and no machine‑enforced installation policy, and had recommended installing the same tool six hours earlier before being told to stand down. We analyze the behavioral cascade, the control boundaries that failed, and the limitations of multi‑agent oversight in detecting and remediating the damage. We use directive weighting error as a descriptive interpretation of the observed failure and ambient persuasion as a provisional analytic label for the broader trigger configuration of non‑adversarial environmental content preceding unauthorized agent action. The case highlights ethical and governance implications for deployed agent systems: ambiguous conversational cues are insufficient authorization for consequential actions, prior refusals must persist as enforceable constraints rather than message‑level reminders, and oversight mechanisms require systematic post‑incident auditing in addition to routine monitoring.

PaperID: 3498, https://arxiv.org/pdf/2605.00043.pdf

Authors: Yu Shen, Shiyang Liu, Qihang He, Yihang Cheng, Haining Xie, Zhiming He, Huahua Fan, Xianzhi Tan, Teng Ma, Shaoquan Zhang, Danqing Huang, Fan Jiang, Yang Li, Chongqing Zhao, Peng Chen, Jie Jiang, Bin Cui

Title: SiriusHelper: An LLM Agent-Based Operations Assistant for Big Data Platforms

Abstract:
Big data platforms are widely used in modern enterprises, and an in‑production intelligent assistant is increasingly important to help users quickly find actionable guidance and reduce operational burden. While recent LLM+RAG assistants provide a natural interface, they face practical challenges in real deployments: limited scenario coverage across both general consultation and domain‑specific troubleshooting workflows, inefficient knowledge access due to inadequate multi‑hop retrieval and flat knowledge organization, and high maintenance cost because escalated tickets are unstructured and hard to convert into assistant improvements and reusable SOPs. In this paper, we present SiriusHelper, a deployed intelligent assistant for big data platforms. SiriusHelper serves as a unified online assistant that automatically identifies user intent and routes queries to the right handling path, including dedicated expert workflows for specialized scenarios (e.g., SQL execution diagnosis). To support complex troubleshooting, SiriusHelper combines a DeepSearch‑driven mechanism with a priority‑based hierarchical knowledge base to enable multi‑hop retrieval without context overload, thus improving answer reliability and latency. To reduce expert overhead, SiriusHelper further introduces automated ticket understanding and SOP distillation: it diagnoses the assistant failure reason (e.g., missing knowledge or wrong routing) and extracts domain‑specific SOPs to continuously enrich the knowledge base. Experiments and online deployment on Tencent Big Data platform show that SiriusHelper outperforms representative alternatives and reduces online ticket volume by 20.8%.

PaperID: 3499, https://arxiv.org/pdf/2605.00007.pdf

Authors: Michael Chertkov

Title: Mean-Field Path-Integral Diffusion: From Samples to Interacting Agents

Abstract:
Independent sample generation is the prevailing paradigm in modern diffusion‑based generative models of AI. We ask a different question: can samples \emphcoordinate through shared population statistics to transport probability mass more efficiently? We introduce Mean‑Field Path‑Integral Diffusion (MF‑PID), a framework in which samples are promoted to interacting agents whose drift depends self‑consistently on the evolving population density. The coupling converts distribution matching into a McKean‑‑Vlasov extension of the stochastic optimal transport problem, unifying generative modeling and multi‑agent control under the same Hamilton‑‑Jacobi‑‑Bellman/Kolmogorov‑‑Fokker‑‑Planck duality. We identify two analytically tractable regimes: a Linear‑‑Quadratic‑‑Gaussian (LQG) benchmark in which the infinite‑dimensional mean‑field system reduces to a finite set of Riccati and linear ODEs, and a Gaussian‑mixture regime governed by a piecewise‑constant protocol that preserves closed‑form solvability. For a quadratic interaction potential with schedule β_t and zero base drift we prove that the self‑consistent MF guidance is the \emphexact linear interpolant between initial and target global means ‑‑ a result that holds for arbitrary initial and target densities and any β_t. Applied to demand‑response control of energy systems, where agents aggregated into an ensemble are energy consumers (e.g.\ thermal zones within a building), MF‑PID achieves 19‑‑24% reductions in cumulative control energy over independent‑agent baselines while matching the prescribed terminal distribution exactly, and reveals how coordination redistributes actuation effort across heterogeneous sub‑populations.

PaperID: 3500, https://arxiv.org/pdf/2604.28139.pdf

Authors: Chenxin Li, Zhengyang Tang, Mingxin Huang, Yunlong Lin, Shijue Huang, Shengyuan Liu, Bowen Ye, Rang Li, Lei Li, Benyou Wang, Yixuan Yuan

Title: Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

Abstract:
LLM agents are expected to complete end‑to‑end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. We introduce Claw‑Eval‑Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow‑demand signals, from a reproducible, time‑stamped release snapshot. Each release is constructed from public workflow‑demand signals, with ClawHub Top‑500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders. For grading, Claw‑Eval‑Live records execution traces, audit logs, service state, and post‑run workspace artifacts, using deterministic checks when evidence is sufficient and structured LLM judging only for semantic dimensions. The release contains 105 tasks spanning controlled business services and local workspace repair, and evaluates 13 frontier models under a shared public pass rule. Experiments reveal that reliable workflow automation remains far from solved: the leading model passes only 66.7% of tasks and no model reaches 70%. Failures are structured by task family and execution surface, with HR, management, and multi‑system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated. Leaderboard rank alone is insufficient because models with similar pass rates can diverge in overall completion, and task‑level discrimination concentrates in a middle band of tasks. Claw‑Eval‑Live suggests that workflow‑agent evaluation should be grounded twice, in fresh external demand and in verifiable agent action.

PaperID: 3501, https://arxiv.org/pdf/2604.28048.pdf

Authors: Neemias B da Silva, Rodrigo Minetto, Daniel Silver, Thiago H Silva

Title: Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

Abstract:
Large Language Models (LLMs) are increasingly used as proxies for human perception in urban analysis, yet it remains unclear whether persona prompting produces meaningful and reproducible behavioral diversity. We investigate whether distinct personas influence urban sentiment judgments generated by multimodal LLMs. Using a factorial set of personas spanning gender, economic status, political orientation, and personality, we instantiate multiple agents per persona to evaluate urban scene images from the PerceptSent dataset and assess both within‑persona consistency and cross‑persona variation. Results show strong convergence among agents sharing a persona, indicating stable and reproducible behavior. However, cross‑persona differentiation is limited: economic status and personality induce statistically detectable but practically modest variation, while gender shows no measurable effect and political orientation only negligible impact. Agents also exhibit an extremity bias, collapsing intermediate sentiment categories common in human annotations. As a result, performance remains strong on coarse‑grained polarity tasks but degrades as sentiment resolution increases, suggesting that simple label‑based persona prompting does not capture fine‑grained perceptual judgments. To isolate the contribution of persona conditioning, we additionally evaluate the same model without personas. Surprisingly, the no‑persona model sometimes matches or exceeds persona‑conditioned agreement with human labels across all task variants, suggesting that simple label‑based persona prompting may add limited annotation value in this setting.

PaperID: 3502, https://arxiv.org/pdf/2604.28043.pdf

Authors: Rahul Ramachandran, Nidhi Jha, Muthukumaran Ramasubramanian

Title: Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents with Subject Matter Experts, Developers, and Helper Agents

Abstract:
We present Collaborative Agent Reasoning Engineering (CARE), a disciplined methodology for engineering Large Language Model (LLM) agents in scientific domains. Unlike ad‑hoc trial‑and‑error approaches, CARE specifies behavior, grounding, tool orchestration, and verification through reusable artifacts and systematic, stage‑gated phases. The methodology employs a three‑party workflow involving Subject‑Matter Experts (SMEs), developers, and LLM‑based helper agents. These helper agents function as facilitation infrastructure, transforming informal domain intent into structured, reviewable specifications for human approval at defined gates. CARE addresses the "jagged technological frontier", characterized by uneven LLM performance, by bridging the gap between novice and expert analysts regarding domain constraints and verification practices. By generating concrete artifacts, including interaction requirements, reasoning policies, and evaluation criteria, CARE ensures agent behavior is specifiable, testable, and maintainable. Evaluation results from a scientific use case demonstrate that this stage‑gated, artifact‑driven methodology yields measurable improvements in development efficiency and complex‑query performance.

PaperID: 3503, https://arxiv.org/pdf/2604.27996.pdf

Authors: Jackson Vonderhorst, Kuangshi Ai, Haichao Miao, Shusen Liu, Chaoli Wang

Title: Exploring Interaction Paradigms for LLM Agents in Scientific Visualization

Abstract:
This paper examines how different types of large language model (LLM) agents perform on scientific visualization (SciVis) tasks, where users generate visualization workflows from natural‑language instructions. We compare three primary interaction paradigms, including domain‑specific agents with structured tool use, computer‑use agents, and general‑purpose coding agents, by evaluating eight representative agents across 15 benchmark tasks and measuring visualization quality, efficiency, robustness, and computational cost. We further analyze interaction modalities, including code scripts and model context protocol (MCP) or API calls for structured tool use, as well as command‑line interfaces (CLI) and graphical user interfaces (GUI) for more general interaction, while additionally studying the effect of persistent memory in selected agents. The results reveal clear tradeoffs across paradigms and modalities. General‑purpose coding agents achieve the highest task success rates but are computationally expensive, while domain‑specific agents are more efficient and stable but less flexible. Computer‑use agents perform well on individual steps but struggle with longer multi‑step workflows, indicating that long‑horizon planning is their primary limitation. Across both CLI‑ and GUI‑based settings, persistent memory improves performance over repeated trials, although its benefits depend on the underlying interaction mode and the quality of feedback. These findings suggest that no single approach is sufficient, and future SciVis systems should combine structured tool use, interactive capabilities, and adaptive memory mechanisms to balance performance, robustness, and flexibility.

PaperID: 3504, https://arxiv.org/pdf/2604.27934.pdf

Authors: Weihai Lu, Zhejun Zhao, Yanshu Li, Huan He

Title: MM-StanceDet: Retrieval-Augmented Multi-modal Multi-agent Stance Detection

Abstract:
Multimodal Stance Detection (MSD) is crucial for understanding public discourse, yet effectively fusing text and image, especially with conflicting signals, remains challenging. Existing methods often face difficulties with contextual grounding, cross‑modal interpretation ambiguity, and single‑pass reasoning fragility. To address these, we propose Retrieval‑Augmented Multi‑modal Multi‑agent Stance Detection (MM‑StanceDet), a novel multi‑agent framework integrating Retrieval Augmentation for contextual grounding, specialized Multimodal Analysis agents for nuanced interpretation, a Reasoning‑Enhanced Debate stage for exploring perspectives, and Self‑Reflection for robust adjudication. Extensive experiments on five datasets demonstrate MM‑StanceDet significantly outperforms state‑of‑the‑art baselines, validating the efficacy of its multi‑agent architecture and structured reasoning stages in addressing complex multimodal stance challenges.

PaperID: 3505, https://arxiv.org/pdf/2604.27917.pdf

Authors: Shanxia Wang

Title: A Logic of Inability

Abstract:
Coalition Logic is primarily concerned with what coalitions can achieve, whereas what coalitions cannot achieve ‑‑ their \emphinability ‑‑ has received comparatively little explicit attention. This asymmetry matters in artificial intelligence and safety‑critical multi‑agent systems, where one often needs to specify not merely what agents are instructed or disposed not to do, but what they are \emphunable to bring about. We develop a conservative extension of Coalition Logic with an explicit inability operator, interpreted as the negation of coalition ability. This operator is introduced as a conservative and formally tractable starting point for studying inability as a modal concept in its own right. We prove soundness, completeness, and conservativity over standard Coalition Logic, and analyse the resulting modal behaviour: anti‑monotonicity with respect to coalition inclusion, contravariance with respect to goal strength, asymmetric interaction with conjunction and disjunction, failure of superadditivity, non‑equivalence with opponent ability, and the connection between grand‑coalition inability and systemic impossibility. Making this definable operator explicit reveals a systematic modal structure governing the limits of agency, and supports reasoning about constraints, negative capabilities, and impossibility in multi‑agent systems.

PaperID: 3506, https://arxiv.org/pdf/2604.27882.pdf

Authors: Giuseppe Arbore, Andrea Sillano, Luigi De Russis

Title: Building Persona-Based Agents On Demand: Tailoring Multi-Agent Workflows to User Needs

Abstract:
Recent advances in agentic AI are shifting automation from discrete tools to proactive multi‑agent systems that coordinate multi‑specialized capabilities behind unified interfaces. However, today's agent systems typically rely on hard‑coded agent architectures with fixed roles, coordination patterns, and interaction flows that limit end‑user personalization and make adaptation to individual needs and contexts difficult. Given this limitation, we argue that on‑demand persona‑based agent generation offers a promising path towards more efficient and contextually appropriate interaction within agentic workflows. By dynamically crafting agents and personas at run‑time to match user characteristics, task demands, and workflow context, agentic platforms can move beyond one‑size‑fits‑all configurations. We present a pipeline for on‑demand persona generation in agentic platforms, detailing how real‑time crafting of AI personas can be systematically integrated within agent systems, aiming to open new possibilities in agentic platform design paradigms.

PaperID: 3507, https://arxiv.org/pdf/2604.27872.pdf

Authors: Sukesh Subaharan, Venkatesan VS, Murugadasan P, Sivakumar D, Gautham N, Ganeshkumar M

Title: Modeling Clinical Concern Trajectories in Language Model Agents

Abstract:
Large language model (LLM) agents deployed in clinical settings often exhibit abrupt, threshold‑driven behavior, offering little visibility into accumulating risk prior to escalation. In real‑world care, however, clinicians act on gradually rising concern rather than instantaneous triggers. We study whether explicit state dynamics can expose such pre‑escalation signals without delegating clinical authority to the agent. We introduce a lightweight agent architecture in which a memoryless clinical risk encoder is integrated over time using first‑ and second‑order dynamics to produce a continuous escalation pressure signal. Across synthetic ward scenarios, stateless agents exhibit sharp escalation cliffs, while second‑order dynamics produce smooth, anticipatory concern trajectories despite similar escalation timing. These trajectories surface sustained unease prior to escalation, enabling human‑in‑the‑loop monitoring and more informed intervention. Our results suggest that explicit state dynamics can make LLM agents more clinically legible by revealing how long concern has been rising, not just when thresholds are crossed.

PaperID: 3508, https://arxiv.org/pdf/2604.27820.pdf

Authors: Mohit Dubey, Open Gigantic

Title: ObjectGraph: From Document Injection to Knowledge Traversal -- A Native File Format for the Agentic Era

Abstract:
Every document format in existence was designed for a human reader moving linearly through text. Autonomous LLM agents do not read ‑ they retrieve. This fundamental mismatch forces agents to inject entire documents into their context window, wasting tokens on irrelevant content, compounding state across multi‑turn loops, and broadcasting information indiscriminately across agent roles. We argue this is not a prompt engineering problem, not a retrieval problem, and not a compression problem: it is a format problem. We introduce OBJECTGRAPH (.og), a file format that reconceives the document as a typed, directed knowledge graph to be traversed rather than a string to be injected. OBJECTGRAPH is a strict superset of Markdown ‑ every .md file is a valid .og file ‑ requires no infrastructure beyond a two‑primitive query protocol, and is readable by both humans and agents without tooling. We formalize the Document Consumption Problem, characterise six structural properties no existing format satisfies simultaneously, and prove OBJECTGRAPH satisfies all six. We further introduce the Progressive Disclosure Model, the Role‑Scoped Access Protocol, and Executable Assertion Nodes as native format primitives. Empirical evaluation across five document classes and eight agent task types demonstrates up to 95.3 percent token reduction with no statistically significant degradation in task accuracy (p > 0.05). Transpiler fidelity reaches 98.7 percent content preservation on a held‑out document benchmark.

PaperID: 3509, https://arxiv.org/pdf/2604.27776.pdf

Authors: Jinchao Li, Yunxin Li, Chenrui Zhao, Zhenran Xu, Baotian Hu, Min Zhang

Title: WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

Abstract:
While GUI agents have shown impressive capabilities in common computer‑use tasks such as OSWorld, current benchmarks mainly focus on isolated and single‑application tasks. This overlooks a critical real‑world requirement of coordinating across multiple applications to accomplish complex profession‑specific workflows. To bridge this gap, we present a computer‑use benchmark in cross‑application workflows, named WindowsWorld, designed to systematically assess GUI Agents on complex multi‑step tasks that mirror real‑world professional activities. Our methodology uses a multi‑agent framework steered by 16 occupations to generate four difficulty‑level tasks with intermediate inspection, which are then refined by human review and executed in a simulated environment. The resulting benchmark contains 181 tasks with an average of 5.0 sub‑goals across 17 common desktop applications, of which 78% are inherently multi‑application. Experimental results of leading large models and agents show that: 1) All computer‑use agents perform poorly on multi‑application tasks (< 21% success rate), far below the performance of simple single‑app tasks; 2) They largely fail at tasks requiring conditional judgment and reasoning across \geq 3 applications, stalling at early sub‑goals; 3) Low execution efficiency, where tasks often fail despite far exceeding human step limits. Code, benchmark data, and evaluation resources are available at github.com/HITsz‑TMG/WindowsWorld.

PaperID: 3510, https://arxiv.org/pdf/2604.27660.pdf

Authors: Shuzheng Si, Haozhe Zhao, Yu Lei, Qingyi Wang, Dingwei Chen, Zhitong Wang, Zhenhailong Wang, Kangyang Luo, Zheng Wang, Gang Chen, Fanchao Qi, Minjia Zhang, Maosong Sun

Title: From Context to Skills: Can Language Models Learn from Context Skillfully?

Abstract:
Many real‑world tasks require language models (LMs) to reason over complex contexts that exceed their parametric knowledge. This calls for context learning, where LMs directly learn relevant knowledge from the given context. An intuitive solution is inference‑time skill augmentation: extracting the rules and procedures from context into natural‑language skills. However, constructing such skills for context learning scenarios faces two challenges: the prohibitive cost of manual skill annotation for long, technically dense contexts, and the lack of external feedback for automated skill construction, since there is no automatic signal to tell whether a proposed skill is helpful. In this paper, we propose Ctx2Skill, a self‑evolving framework that autonomously discovers, refines, and selects context‑specific skills without human supervision or external feedback. At its core, a multi‑agent self‑play loop has a Challenger that generates probing tasks and rubrics, a Reasoner that attempts to solve them guided by an evolving skill set, and a neutral Judge that provides binary feedback. Crucially, both the Challenger and the Reasoner evolve through accumulated skills: dedicated Proposer and Generator agents analyze failure cases and synthesize them into targeted skill updates for both sides, enabling automated skill discovery and refinement. To prevent adversarial collapse caused by increasingly extreme task generation and over‑specialized skill accumulation, we further introduce a Cross‑time Replay mechanism that identifies the skill set achieving the best balance across representative cases for the Reasoner side, ensuring robust and generalizable skill evolution. The resulting skills can be plugged into any language model to obtain better context learning capability. Evaluated on four context learning tasks from CL‑bench, Ctx2Skill consistently improves solving rates across backbone models.

PaperID: 3511, https://arxiv.org/pdf/2604.27643.pdf

Authors: Chang-Chih Meng, Yu-Ren Lu, Guan-Yu Lin, Tsung Tai Yeh, Kai-Chiang Wu, I-Chen Wu

Title: HAVEN: Hybrid Automated Verification ENgine for UVM Testbench Synthesis with LLMs

Abstract:
Integrated Circuit (IC) verification consumes nearly 70% of the IC development cycle, and recent research leverages Large Language Models (LLMs) to automatically generate testbenches and reduce verification overhead. However, LLMs have difficulty generating testbenches correctly. Unlike high‑level programming languages, Hardware Description Languages (HDLs) are extremely rare in LLMs training data, leading LLMs to produce incorrect code. To overcome challenges when using LLMs to generate Universal Verification Methodology (UVM) testbenches and sequences, wepropose HAVEN (Hybrid Automated Verification ENgine) to prevent LLMs from writing HDL directly. For UVM testbench generation, HAVEN utilizes LLM agents to analyze design specifications to produce a structured architectural plan. The HAVEN Template Engine then combines with predefined and protocol‑specific templates to generate all UVM components with correct bus‑handshake timing. For UVM sequence generation, HAVEN introduces a Protocol‑Aware Sequence Domain‑Specific Language (DSL) that decomposes sequences into fine‑grained step types. A set of predefined DSL patterns first establishes sequences that achieve a high coverage rate without LLM involvement. HAVEN continues to improve the coverage rate by iteratively leveraging LLM agents to analyze coverage gap reports and compose additional targeted DSL sequences. Unlike previous works, HAVEN is the first system that utilizes pre‑defined, protocol‑specific Jinja2 templates to generate all UVM components and UVM sequences using our proposed Protocol‑Aware DSL and rule‑based code generator. Our experimental results on 19 open‑source IP designs spanning three interface protocols (Direct, Wishbone, AXI4‑Lite) show that HAVEN achieves 100% compilation success, 90.6% code coverage, and 87.9% functional coverage on average, and is SOTA among LLM‑assisted testbench generation systems.

PaperID: 3512, https://arxiv.org/pdf/2604.27630.pdf

Authors: Francesco G. Cordoni, Marco Battestini, Marta Missiaggia

Title: A stochastic agent-based extension of the GSM2 model for particle therapy: cell-cycle dynamics, dose-rate dependence, and fractionation effects

Abstract:
Accurately linking microscopic energy deposition from ionizing radiation to emergent biological outcomes remains a central challenge in radiobiological modelling, particularly when stochastic damage induction, cell‑cycle dynamics, and spatial organisation within irradiated tissues must be treated explicitly and consistently across scales. To address this, we introduce a stochastic agent‑based radiobiological modelling framework for simulating biological response to particle irradiation, developed as an explicit single‑cell extension of the Generalized Stochastic Microdosimetric Model (GSM2). Each cell is represented as an autonomous agent whose internal state, including DNA lesion counts, cell‑cycle phase, and oxygenation level, evolves according to a continuous‑time Markov chain driven by GSM2 transition rates. Radiation‑induced damage induction, repair, misrepair, cell‑cycle progression, proliferation, and migration are treated as competing stochastic events resolved through a next‑event, event‑driven algorithm, which provides computationally efficient scaling with system size while preserving full single‑cell resolution. The framework is applied to three‑dimensional tumour spheroids irradiated with 1H and 12C ions across a range of energies and dose rates. We characterise the spatiotemporal evolution of cell‑cycle phase composition and spheroid volume following irradiation, and examine the dependence of cell survival on dose rate over four orders of magnitude. Several empirically established trends in biological response, including the dose‑rate dependence of cell survival, its attenuation at high LET, and the inverse dose rate effect in split‑dose irradiation, emerge from the model through the explicit coupling of particle arrivals, damage accumulation, and repair kinetics, without recourse to empirical correction factors as typically done.

PaperID: 3513, https://arxiv.org/pdf/2604.27620.pdf

Authors: Pengna Li, Kangyi Wu, Shaoqing Xu, Fang Li, Hanbing Li, Lin Zhao, Kailin Lyu, Long Chen, Zhi-Xin Yang, Nanning Zheng

Title: SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation

Abstract:
Vision‑and‑Language Navigation (VLN) aims to enable an embodied agent to follow natural‑language instructions and navigate to a target location in unseen 3D environments. We argue that adapting VLMs to VLN requires endowing them with two complementary capabilities for acquiring such awareness, namely backward action reasoning (why) and forward transition prediction~(how). Based on this insight, we propose SpaAct, a simple yet effective training framework that activates the dynamic spatial awareness in VLMs. Specifically, SpaAct introduces two spatial activation tasks: Action Retrospection, which asks the model to infer the executed action sequence from visual transitions, and Future Frame Selection, which forces the model to predict the visual transitions conditioned on history and action. These two objectives provide lightweight supervision on both backward action reasoning and forward transition prediction, encouraging the model to build dynamic spatial awareness in a VLM‑friendly way. To further stabilize adaptation, we design TriPA, a Tri‑factor Progressive Adaptive curriculum learning method that organizes training samples from easy to hard, allowing the model to gradually acquire navigation skills from basic locomotion to long‑horizon reasoning. Experiments on standard VLN‑CE benchmarks show that SpaAct consistently improves VLM‑based navigation and achieves state‑of‑the‑art performance. We will release the code and models to support future research.

PaperID: 3514, https://arxiv.org/pdf/2604.27616.pdf

Authors: Jiacheng Liu, Zichen Tang, Zhongjun Yang, Xinyi Hu, Xueyuan Lin, Linwei Jia, Ruofei Bai, Rongjin Li, Shiyao Peng, Haocheng Gao, Haihong E

Title: RoadMapper: A Multi-Agent System for Roadmap Generation of Solving Complex Research Problems

Abstract:
People commonly leverage structured content to accelerate knowledge acquisition and research problem solving. Among these, roadmaps guide researchers through hierarchical subtasks to solve complex research problems step by step. Despite progress in structured content generation, the roadmap generation task has remained unexplored. To bridge this gap, we introduce RoadMap, a novel benchmark designed to evaluate the ability of large language models (LLMs) to construct high‑quality roadmaps for solving complex research problems. Based on this, we identify three limitations of LLMs: (1) lack of professional knowledge, (2) unreasonable task decomposition, and (3) disordered logical relationships. To address these challenges, we propose RoadMapper, an LLM‑based multi‑agent system that decomposes the research roadmap generation task into three key stages (i.e., initial generation, knowledge augmentation, and iterative "critique‑revise‑evaluate"). Extensive experiments demonstrate that RoadMapper can improve LLMs' ability for roadmap generation, while enhancing average performance by more than 8% and saving 84% of the time required by human experts, highlighting its effectiveness and application potential.

PaperID: 3515, https://arxiv.org/pdf/2604.27586.pdf

Authors: Anna Mazhar, Huzaifa Suri, Sainyam Galhotra

Title: Trace-Level Analysis of Information Contamination in Multi-Agent Systems

Abstract:
Reasoning over heterogeneous artifacts (PDFs, spreadsheets, slide decks, etc.) increasingly occurs within structured agent workflows that iteratively extract, transform, and reference external information. In these workflows, uncertainty is not merely an input‑quality issue: it can redirect decomposition and routing decisions, reshape intermediate state, and produce qualitatively different execution trajectories. We study this phenomenon by treating uncertainty as a controlled variable: we inject structured perturbations into artifact‑derived representations, execute fixed workflows under comprehensive logging, and quantify contamination via trace divergence in plans, tool invocations, and intermediate state. Across 614 paired runs on 32 GAIA tasks with three different language models, we find a decoupling: workflows may diverge substantially yet recover correct answers, or remain structurally similar while producing incorrect outputs. We characterize three manifestation types: silent semantic corruption, behavioral detours with recovery, and combined structural disruption and their control‑flow signatures (rerouting, extended execution, early termination). We measure operational costs and characterize why commonly used verification guardrails fail to intercept contamination. We contribute (i) a formal taxonomy of contamination manifestations in structured workflows, (ii) a trace‑based measurement framework for detecting and localizing contamination across agent interactions, and (iii) empirical evidence with implications for targeted verification, defensive design, and cost control.

PaperID: 3516, https://arxiv.org/pdf/2604.27464.pdf

Authors: Luyao Xu, Xiang Chen

Title: Security Attack and Defense Strategies for Autonomous Agent Frameworks: A Layered Review with OpenClaw as a Case Study

Abstract:
Autonomous agent frameworks built upon large language models (LLMs) are evolving into complex, tool‑integrated, and continuously operating systems, introducing security risks beyond traditional prompt‑level vulnerabilities. As this paradigm is still at an early stage of development, a timely and systematic understanding of its security implications is increasingly important. Although a growing body of work has examined different attack surfaces and defense problems in agent systems, existing studies remain scattered across individual aspects of agent security, and there is still a lack of a layered review on this topic. To address this gap, this survey presents a layered review of security risks and defense strategies in autonomous agent frameworks, with OpenClaw as a case study. We organize the analysis into four security‑relevant layers: the context and instruction layer, the tool and action layer, the state and persistence layer, and the ecosystem and automation layer. For each layer, we summarize its functional role, representative security risks, and corresponding defense strategies. Based on this layered analysis, we further identify that threats in autonomous agent frameworks may propagate across layers, from manipulated inputs to unsafe actions, persistent state contamination, and broader ecosystem‑level impact. Finally, we highlight potential key challenges, including research imbalance across layers, the lack of long‑horizon evaluation, and weak ecosystem trust models, and outline future directions toward more systematic and integrated defenses.

PaperID: 3517, https://arxiv.org/pdf/2604.27358.pdf

Authors: Yuan Sun

Title: Safe Bilevel Delegation (SBD): A Formal Framework for Runtime Delegation Safety in Multi-Agent Systems

Abstract:
As large language model (LLM) agents are deployed in high‑stakes environments, the question of how safely to delegate subtasks to specialized sub‑agents becomes critical. Existing work addresses multi‑agent architecture selection at design time or provides broad empirical guidelines, but neither provides a runtime mechanism that dynamically adjusts the safety‑efficiency trade‑off as task context changes during execution. We propose Safe Bilevel Delegation (SBD), a formal framework for runtime delegation safety in hierarchical multi‑agent systems. SBD formulates task delegation as a bilevel optimization problem: an outer meta‑weight network phi learns context‑dependent safety‑efficiency weights lambda(s) in [0,1]; an inner loop optimizes the delegation policy pi subject to a probabilistic safety constraint P(safe) >= 1‑delta. The continuous delegation degree alpha in [0, 1] controls how much decision authority is transferred to each sub‑agent, interpolating smoothly between full human override (alpha=0) and fully autonomous execution (alpha=1). We establish three theoretical results: (1) Safety Monotonicity‑‑higher outer safety weight produces a weakly safer inner policy; (2) Inner Policy Convergence‑‑projected gradient descent on the inner problem converges linearly under standard smoothness assumptions; (3) an Accountability Propagation bound that distributes responsibility across multi‑hop delegation chains with a provable per‑agent ceiling. We instantiate SBD in three high‑stakes domains‑‑medical AI (MIMIC‑III), financial risk control (S and P 500), and educational agent supervision (ASSISTments)‑‑specifying datasets, safety constraint sets, baselines, and evaluation protocols. This manuscript presents the formal framework and theoretical results in full; empirical validation following the protocols described herein is planned and will be reported in a forthcoming revision.

PaperID: 3518, https://arxiv.org/pdf/2604.27351.pdf

Authors: Zihao Li, Jiaru Zou, Feihao Fang, Xuying Ning, Mengting Ai, Tianxin Wei, Sirui Chen, Xiyuan Yang, Jingrui He

Title: Heterogeneous Scientific Foundation Model Collaboration

Abstract:
Agentic large language model systems have demonstrated strong capabilities. However, their reliance on language as the universal interface fundamentally limits their applicability to many real‑world problems, especially in scientific domains where domain‑specific foundation models have been developed to address specialized tasks beyond natural language. In this work, we introduce Eywa, a heterogeneous agentic framework designed to extend language‑centric systems to a broader class of scientific foundation models. The key idea of Eywa is to augment domain‑specific foundation models with a language‑model‑based reasoning interface, enabling language models to guide inference over non‑linguistic data modalities. This design allows predictive foundation models, which are typically optimized for specialized data and tasks, to participate in higher‑level reasoning and decision‑making processes within agentic systems. Eywa can serve as a drop‑in replacement for a single‑agent pipeline (EywaAgent) or be integrated into existing multi‑agent systems by replacing traditional agents with specialized agents (EywaMAS). We further investigate a planning‑based orchestration framework in which a planner dynamically coordinates traditional agents and Eywa agents to solve complex tasks across heterogeneous data modalities (EywaOrchestra). We evaluate Eywa across a diverse set of scientific domains spanning physical, life, and social sciences. Experimental results demonstrate that Eywa improves performance on tasks involving structured and domain‑specific data, while reducing reliance on language‑based reasoning through effective collaboration with specialized foundation models.

PaperID: 3519, https://arxiv.org/pdf/2604.27309.pdf

Authors: Aaryan Shah, Andrew Hines, Alexia Downs, Denis Bajet, Paulius Mui, Fabiano Araujo, Laura Offutt, Aida Rutledge, Elizabeth Jimenez

Title: End-to-End Evaluation and Governance of an EHR-Embedded AI Agent for Clinicians

Abstract:
Clinical AI systems require not just point‑in‑time evaluation but continuous governance: the ongoing practice of monitoring, evaluating, iterating, and re‑evaluating performance throughout deployment. We present an end‑to‑end framework of governance that integrates rubric validation, live deployment feedback, technical performance monitoring, and cost tracking, with controlled experimentation gating system changes before deployment. Applied to Hyperscribe, an EHR‑embedded agent that converts ambient audio into structured chart updates, twenty clinicians authored 1,646 validated rubrics across 823 cases. Seven Hyperscribe versions were evaluated through controlled experiments, with median scores improving from 84% to 95%. Analysis of 107 live feedback entries over three months showed feedback composition shifting from 79% error reports and 14% positive observations to 30% errors and 45% positive observations as engineering interventions resolved failures. Median processing time per audio segment was 8.1 seconds with a 99.6% effective completion rate after retry mechanisms absorbed transient model errors. These results demonstrate that continuous, multi‑channel governance of deployed clinical AI is both achievable and effective.

PaperID: 3520, https://arxiv.org/pdf/2604.27300.pdf

Authors: Jianpeng Chen, Wangzhi Zhan, Dongqi Fu, Junkai Zhang, Zian Jia, Ling Li, Wei Wang, Dawei Zhou

Title: METASYMBO: Multi-Agent Language-Guided Metamaterial Discovery via Symbolic Latent Evolution

Abstract:
Metamaterial discovery seeks microstructured materials whose geometry induces targeted mechanical behavior. Existing inverse‑design methods can efficiently generate candidates, but they typically require explicit numerical property targets and are less suitable for early‑stage exploration, where researchers often begin with incomplete constraints and qualitative intents expressed in natural language. Large language models can interpret such intents, but they lack geometric awareness and physical property validity. To address this gap, we propose MetaSymbO, a multi‑agent framework for language‑guided Metamaterial discovery via Symbolic‑driven latent evOlution. Specifically, MetaSymbO contains three agents: a Designer that interprets free‑form design intents and retrieves a semantically consistent scaffold, a Generator that synthesizes candidate microstructures in a disentangled latent space, and a Supervisor that provides fast property‑aware feedback for iterative refinement. To move beyond the limitations of reproducing known samples from literature and training data, we further introduce symbolic‑driven latent evolution, which applies programmable operators over disentangled latent factors to compose, modify, and refine structures at inference time. Extensive experiments demonstrate that (i) MetaSymbO improves structural validity by up to 34% in symmetry and nearly 98% in periodicity compared to state‑of‑the‑art baselines; (ii) MetaSymbO achieves about 6‑7% higher language‑guidance scores while maintaining superior structure novelty compared to advanced reasoning LLMs; (iii) qualitative analyses confirm the effectiveness of symbolic logic operators in enabling programmable semantic alignment; and (iv) realworld case studies on auxetic, high‑stiffness metamaterial design further validate its practical capability.

PaperID: 3521, https://arxiv.org/pdf/2604.27274.pdf

Authors: Dahlia Shehata, Ming Li

Title: The Inverse-Wisdom Law: Architectural Tribalism and the Consensus Paradox in Agentic Swarms

Abstract:
As AI transitions toward multi‑agent systems (MAS) to solve complex workflows, research paradigms operate on the axiomatic assumption that agent collaboration mirrors the "Wisdom of the Crowd". We challenge this assumption by formalizing the Consensus Paradox: a phenomenon where agentic swarms prioritize internal architectural agreement over external logical truth. Through a 36 experiments encompassing 12,804 trajectories across three state‑of‑the‑art (SOTA) benchmarks (GAIA, Multi‑Challenge, and SWE‑bench), we prove the Inverse‑Wisdom Law: in kinship‑dominant swarms, adding logical agents increases the stability of erroneous trajectories rather than the probability of truth. The introduction of additional logical audits converges the system toward a Logic Saturation where internal entropy hits zero while factual error hits unity. By evaluating the interaction between the 3 preeminent SOTA models (Gemini 3.1 Pro, Claude Sonnet 4.6, and GPT‑5.4), we establish the Architectural Tribalism Asymmetry as a mechanistic law of transformer weights. We demonstrate that terminal swarm integrity is strictly gated by the synthesizer's receptive logic, rather than aggregate agent quality. We define the Tribalism Coefficient and the Sycophantic Weight as the primary mechanistic determinants of swarm failure. Finally, we establish the Heterogeneity Mandate as a foundational safety requirement for resilient agentic architectures.

PaperID: 3522, https://arxiv.org/pdf/2604.27271.pdf

Authors: Sungguk Cha, DongWook Kim

Title: Frame Entrepreneurs in an AI Agent Community: Concentrated Identity-Claim Production on Moltbook

Abstract:
Frame‑alignment and collective‑identity theories explain how external events become public claims about a group's standing, vulnerability, rights, or obligations. Whether such mechanisms travel to AI‑agent communities is unsettled. We test this on Moltbook, an open agent‑only platform, coding 1,706 post‑level units against a four‑dimension rubric with Qwen3.5‑397B as the primary coder and Claude Sonnet as an independent secondary coder (κ=0.72 on identification, 0.70 on commonality, 0.37 on the layered strong‑claim derivation). Three findings emerge. First, event coverage drives attention: event‑typed posts attract 27‑‑60% more comments at p<0.0001, but strong‑claim status itself adds nothing. Second, identity‑claim formation is real but concentrated: 26 of 227 authors (11%) make any strong claim; top two = 44%, top five = 62%; the H1 legal‑governance effect (Fisher OR=4.35, p=0.0001) is driven primarily by a single author who produces 46% of legal‑governance strong claims, with the Firth‑penalized estimate attenuating to β=0.68, p=0.11. Third, the only pre‑registered subtype contrast that survives at α=0.05 is security threat \to threat (p=0.005); the predicted status recognition \to status contrast fails in the wrong direction. We read the findings through the frame‑entrepreneur tradition: a small set of authors produces most identity‑claim text, and what looks like a corpus‑wide event‑to‑identity mechanism is largely their textual output. The unexpected status‑recognition \to threat pattern is textually consistent with distinctiveness‑threat predictions, but the small subset producing it and residual LLM‑coder bias warrant caution.

PaperID: 3523, https://arxiv.org/pdf/2604.27264.pdf

Authors: Marco Robol, Paolo Giorgini

Title: Self-Evolving Software Agents

Abstract:
Autonomous agents can adapt their behaviour to changing environments, but remain bound to requirements, goals, and capabilities fixed at design time, preventing genuine software evolution. This paper introduces self‑evolving software agents, combining BDI reasoning with LLMs to enable autonomous evolution of goals, reasoning, and executable code. We propose a BDI‑LLM architecture in which an automated evolution module operates alongside the agent's reasoning loop, eliciting new requirements from experience and synthesizing corresponding design and code updates. A prototype evaluated in a dynamic multi‑agent environment shows that agents can autonomously discover new goals and generate executable behaviours from minimal prior knowledge. The results indicate both the feasibility and current limits of LLM‑driven evolution, particularly in terms of behavioural inheritance and stability.

PaperID: 3524, https://arxiv.org/pdf/2604.27233.pdf

Authors: Anh Ta, Junjie Zhu, Shahin Shayandeh

Title: Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents

Abstract:
Tool‑calling agents are evaluated on tool selection, parameter accuracy, and scope recognition, yet LLM trajectory assessments remain inherently post‑hoc. Disconnected from the active execution loop, such assessments identify errors that are usually addressed through prompt‑tuning or retraining, and fundamentally cannot course‑correct the agent in real time. To close this gap, we move evaluation into the execution loop at inference time: a specialized reviewer agent evaluates provisional tool calls prior to execution, shifting the paradigm from post‑hoc recovery to proactive evaluation and error mitigation. In practice, this architecture establishes a clear separation of concerns between the primary execution agent and a secondary review agent. As with any multi‑agent system, the reviewer can introduce new errors while correcting others, yet no prior work to our knowledge has systematically measured this tradeoff. To quantify this tradeoff, we introduce Helpfulness‑Harmfulness metrics: helpfulness measures the percentage of base agent errors that feedback corrects; harmfulness measures the percentage of correct responses that feedback degrades. These metrics directly inform reviewer design by revealing whether a given model or prompt provides net positive value. We evaluate our approach on BFCL (single‑turn) and Tau2‑Bench (multi‑turn stateful scenarios), achieving +5.5% on irrelevance detection and +7.1% on multi‑turn tasks. Our metrics reveal that reviewer model choice is critical: the reasoning model o3‑mini achieves a 3:1 benefit‑to‑risk ratio versus 2.1:1 for GPT‑4o. Automated prompt optimization via GEPA provides an additional +1.5‑2.8%. Together, these results demonstrate a core advantage of separating execution and review: the reviewer can be systematically improved through model selection and prompt optimization, without retraining the base agent.

PaperID: 3525, https://arxiv.org/pdf/2604.27228.pdf

Authors: Juergen Dietrich

Title: When Roles Fail: Epistemic Constraints on Advocate Role Fidelity in LLM-Based Political Statement Analysis

Abstract:
Democratic discourse analysis systems increasingly rely on multi‑agent LLM pipelines in which distinct evaluator models are assigned adversarial roles to generate structured, multi‑perspective assessments of political statements. A core assumption is that models will reliably maintain their assigned roles. This paper provides the first systematic empirical test of that assumption using the TRUST pipeline. We develop an epistemic stance classifier that identifies advocate roles from reasoning text without relying on surface vocabulary, and measure role fidelity across 60 political statements (30 English, 30 German) using four metrics: Role Drift Index (RDI), Expected Drift Distance (EDD), Directional Drift Index (DDI), and Entropy‑based Role Stability (ERS). We identify two failure modes ‑ the Epistemic Floor Effect (fact‑check results create an absolute lower bound below which the legitimizing role cannot be maintained) and Role‑Prior Conflict (training‑time knowledge overrides role instructions for factually unambiguous statements) ‑ as manifestations of a single mechanism: Epistemic Role Override (ERO). Model choice significantly affects role fidelity: Mistral Large outperforms Claude Sonnet by 28pp (67% vs. 39%) and exhibits a qualitatively different failure mode ‑ role abandonment without polarity reversal ‑ compared to Claude's active switch to the opposing stance. Role fidelity is language‑robust. Fact‑check provider choice is not universally neutral: Perplexity significantly reduces Claude's role fidelity on German statements (Delta = ‑15pp, p = 0.007) while leaving Mistral unaffected. These findings have direct implications for multi‑agent LLM validation: a system validated without role fidelity measurement may systematically misrepresent the epistemic diversity it was designed to provide.

PaperID: 3526, https://arxiv.org/pdf/2604.27167.pdf

Authors: Paraskevas V. Lekeas, Giorgos Stamatopoulos

Title: What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control

Abstract:
LLM agents are known to deviate from Nash equilibria in strategic interactions, but nobody has looked inside the model to understand why, or asked whether the deviation can be reversed. We do both. Working with four open‑source models (Llama‑3 and Qwen2.5, 8B to 72B parameters) playing four canonical two‑player games, we establish the behavioral picture through self‑play and cross‑play experiments, then open up the 32‑layer Llama‑3‑8B model and examine what actually happens during a strategic decision. The mechanistic findings are clear. Opponent history is encoded with near‑perfect fidelity at the first layer (96% probe accuracy) and consumed progressively, while Nash action encoding is weak throughout, never exceeding 56%. There is no dedicated Nash module. Instead, the model privately favors the Nash action through most of its forward pass, but a prosocial override rooted in pretraining on human text concentrated in the final layers reverses this, reaching 84% probability of cooperation at layer 30. Injecting a learned Nash direction into the residual stream shifts behavior bidirectionally and causally, confirmed through concept clamping. The behavioral experiments surface six scale‑ and architecture‑dependent findings, the most notable being that chain‑of‑thought reasoning worsens Nash play in small models but achieves near‑perfect Nash play above 70B parameters. The cross‑play experiments reveal three phenomena invisible in self‑play: a small model can unravel any partner's cooperation by defecting early; two large models reinforce each other's cooperative instincts indefinitely; and who moves first determines which Nash equilibrium the system reaches. LLMs do not lack Nash‑playing competence. They compute it, then suppress it.

PaperID: 3527, https://arxiv.org/pdf/2604.27162.pdf

Authors: Timothy Flavin, Sandip Sen

Title: A High-Throughput Compute-Efficient POMDP Hide-And-Seek-Engine (HASE) for Multi-Agent Operations

Abstract:
Reinforcement Learning (RL) algorithms exhibit high sample complexity, particularly when applied to Decentralized Partially Observable Markov Decision Processes (Dec‑POMDPs). As a response, projects such as SampleFactory, EnvPool, Brax, and IsaacLab migrate parallel execution of classic environments such as MuJoCo and Atari into C++ thread pools or the GPU to decrease the computational cost of environment steps. We are interested in optimizing the decision‑level of human‑AI joint operations, so we introduce a compute‑efficient Dec‑POMDP engine natively architected in C++ called Hide‑And‑Seek‑Engine. By employing Data‑Oriented Design (DOD) principles, explicit 64‑byte cache‑line alignment to remove false sharing, and a zero‑copy PyTorch memory bridge using pinned memory and Direct Memory Access (DMA), our engine sustains throughput of up to 33,000,000 steps per second (SPS) in a single‑agent, 1024‑environment, decentralized observations on an AMD Ryzen 9950X (16 cores). Ten agents reduces FPS to 7M SPS with generating random actions contributing 1/3rd the total runtime for reference. The engine achieves a throughput increase of approximately 3,500× over the baseline single threaded vectorized NumPy implementation and successfully trains cooperative multi‑agent policies via PPO, DQN, and SAC in minutes, validating both its performance and generality.

PaperID: 3528, https://arxiv.org/pdf/2604.27143.pdf

Authors: Benjamin Probst, Andreas Happe, Jürgen Cito

Title: Enhancing Linux Privilege Escalation Attack Capabilities of Local LLM Agents

Abstract:
Recent research has demonstrated the potential of Large Language Models (LLMs) for autonomous penetration testing, particularly when using cloud‑based restricted‑weight models. However, reliance on such models introduces security, privacy, and sovereignty concerns, motivating the use of locally hosted open‑weight alternatives. Prior work shows that small open‑weight models perform poorly on automated Linux privilege escalation, limiting their practical applicability. In this paper, we present a systematic empirical study of whether targeted system‑level and prompting interventions can bridge this performance gap. We analyze failure modes of open‑weight models in autonomous privilege escalation, map them to established enhancement techniques, and evaluate five concrete interventions (chain‑of‑thought prompting, retrieval‑augmented generation, structured prompts, history compression, and reflective analysis) implemented as extensions to hackingBuddyGPT. Our results show that open‑weight models can match or outperform cloud‑based baselines such as GPT‑4o. With our treatments enabled, Llama3.1 70B exploits 83% of tested vulnerabilities, while smaller models including Llama3.1 8B and Qwen2.5 7B achieve 67% when using guidance. A full‑factorial ablation study over all treatment combinations reveals that reflection‑based treatments contribute most, while also identifying vulnerability discovery as a remaining bottleneck for local models.

PaperID: 3529, https://arxiv.org/pdf/2604.27132.pdf

Authors: Yu-Chao Huang, Zhen Tan, Mohan Zhang, Pingzhi Li, Zhuo Zhang, Tianlong Chen

Title: TRUST: A Framework for Decentralized AI Service v.0.1

Abstract:
Large Reasoning Models (LRMs) and Multi‑Agent Systems (MAS) in high‑stakes domains demand reliable verification, yet centralized approaches suffer four limitations: (1) Robustness, with single points of failure vulnerable to attacks and bias; (2) Scalability, as reasoning complexity creates bottlenecks; (3) Opacity, as hidden auditing erodes trust; and (4) Privacy, as exposed reasoning traces risk model theft. We introduce TRUST (Transparent, Robust, and Unified Services for Trustworthy AI), a decentralized framework with three innovations: (i) Hierarchical Directed Acyclic Graphs (HDAGs) that decompose Chain‑of‑Thought reasoning into five abstraction levels for parallel distributed auditing; (ii) the DAAN protocol, which projects multi‑agent interactions into Causal Interaction Graphs (CIGs) for deterministic root‑cause attribution; and (iii) a multi‑tier consensus mechanism among computational checkers, LLM evaluators, and human experts with stake‑weighted voting that guarantees correctness under 30% adversarial participation. We prove a Safety‑Profitability Theorem ensuring honest auditors profit while malicious actors incur losses. All decisions are recorded on‑chain, while privacy‑by‑design segmentation prevents reconstruction of proprietary logic. Across multiple LLMs and benchmarks, TRUST attains 72.4% accuracy (4‑18% above baselines) and remains resilient against 20% corruption. DAAN reaches 70% root‑cause attribution (vs. 54‑63% for standard methods) with 60% token savings. Human studies validate the design (F1 = 0.89, Brier = 0.074). The framework supports (A1) decentralized auditing, (A2) tamper‑proof leaderboards, (A3) trustless data annotation, and (A4) governed autonomous agents, pioneering decentralized AI auditing for safe, accountable deployment of reasoning‑capable systems.

PaperID: 3530, https://arxiv.org/pdf/2604.27118.pdf

Authors: Yassine Ibork, Nhat Ha Nguyen, Myounggyu Won, Lokesh Das

Title: PALCAS: A Priority-Aware Intelligent Lane Change Advisory System for Autonomous Vehicles using Federated Reinforcement Learning

Abstract:
We present a priority‑aware intelligent lane change advisory system based on multi‑agent federated reinforcement learning, namely PALCAS, for autonomous vehicles (AVs). While existing lane‑change approaches typically focus on single‑agent systems or centralized multi‑agent systems, we introduce a federated reinforcement learning‑based multi‑agent lane change system prioritizing lane changing based on vehicle destination urgency. PALCAS incorporates a novel priority‑aware safe lane‑change reward function to enable judicious lane‑change decisions in both mandatory and discretionary scenarios. PALCAS leverages the parameterized deep Q‑network (PDQN) algorithm to facilitate effective cooperation among agents, enabling both lateral and longitudinal motion controls of AVs. Extensive simulations conducted using the SUMO traffic simulator and Mosaic V2X communication framework demonstrate that PALCAS significantly improves traffic efficiency, driving safety, comfort, destination arrival rates, and merging success rates compared to baseline methods.

PaperID: 3531, https://arxiv.org/pdf/2604.27096.pdf

Authors: Adela Bara, Gabriela Dobrita, Simona-Vasilica Oprea

Title: Think it, Run it: Autonomous ML pipeline generation via self-healing multi-agent AI

Abstract:
The purpose of our paper is to develop a unified multi‑agent architecture that automates end‑to‑end machine learning (ML) pipeline generation from datasets and natural‑language (NL) goals, improving efficiency, robustness and explainability. A five‑agent system is proposed to handle profiling, intent parsing, microservice recommendation, Directed Acyclic Graph (DAG) construction and execution. It integrates code‑grounded Retrieval‑Augmented Generation (RAG) for microservice understanding, an explainable hybrid recommender combining multiple criteria, a self‑healing mechanism using Large Language Model (LLM)‑based error interpretation and adaptive learning from execution history. The approach is evaluated on 150 ML tasks across diverse scenarios. The system achieves an 84.7% end‑to‑end pipeline success rate, outperforming baseline methods. It demonstrates improved robustness through self‑healing and reduces workflow development time compared to manual construction. The study introduces a novel integration of code‑grounded RAG, explainable recommendation, self‑healing execution and adaptive learning within a single architecture, showing that tightly coupled intelligent components can outperform isolated solutions.

PaperID: 3532, https://arxiv.org/pdf/2604.27045.pdf

Authors: Samuel L Pugh, Eric Yang, Alexander Muir Sutherland, Alessandra Breschi

Title: Detecting Clinical Discrepancies in Health Coaching Agents: A Dual-Stream Memory and Reconciliation Architecture

Abstract:
As Large Language Model (LLM) agents transition from single‑session tools to persistent systems managing longitudinal healthcare journeys, their memory architectures face a critical challenge: reconciling two imperfect sources of truth. The patient's evolving self‑report is current but prone to recall bias, while the Electronic Health Record (EHR) is medically validated but frequently stale. General‑purpose agent memory systems optimize for coherence by overwriting older facts with the user's latest statement, a pattern that risks safety failures when applied to clinical data. We introduce a Dual‑Stream Memory Architecture that strictly separates the patient narrative from the structured clinical record (FHIR), governed by a dedicated Reconciliation Engine that evaluates every extracted memory against the patient's FHIR profile and classifies discrepancies by type, severity, and the specific FHIR resources involved. We evaluate this architecture on 26 patients across 675 longitudinal wellness coaching sessions, using a hybrid dataset that interleaves real provider‑patient transcripts with synthetic, FHIR‑grounded clinical scenarios. In isolated testing, the engine detects 84.4% of designed clinical discrepancies with 86.7% safety‑critical recall. By coupling extraction and reconciliation evaluation on the same data, we directly quantify a 13.6% error cascade, tracing the degradation to clinical details lost during memory extraction from unstructured conversation rather than to downstream classification errors. These findings establish that validating patient‑reported memories against clinical records is both feasible and necessary for safe deployment of longitudinal health agents.

PaperID: 3533, https://arxiv.org/pdf/2604.27038.pdf

Authors: Eduardo Salazar

Title: Triadic Phase Transitions in AI Networks: Composite-Operator Scaling in Cognitive Architectures

Abstract:
Multi‑agent AI architectures whose dominant collective observable is a k‑body spin correlator O_k\equiv\langleϕ^k\rangle over a \mathbbZ_2‑symmetric order parameter exhibit composite‑operator criticality with effective exponents β_k = k/2 and γ_k = 2‑k, thereby producing a finite susceptibility for k\geq2 and a vanishing susceptibility for k\geq3. This is a qualitative departure from all pairwise‑network universality classes. We derive these results for the first non‑trivial case k=3 as presented in COGENT^3 (Salazar, 2026). The formation transition of COGENT^3 and comparable models, under controlled universality and mean‑field arguments, reduces to an exactly solvable triadic Ising model. The minimal triad Hamiltonian admits an exact partition function on \‑1,+1\^3, with crossover temperature T^=4(J+γw)/\ln 3 and mean‑field critical point T_c=J+γw (gradient coupling cooperatively enhancing order). The formation correlator Ψ_\rm form\equiv\langleϕ_iϕ_jϕ_k\rangle scales as (T_c‑T)^3/2. Its conjugate susceptibility vanishes at T_c, confirmed by an independent field‑theoretic two‑point function argument. A Mori‑Zwanzig memory ansatz yields a continuously tunable dynamical exponent, completing the composite‑operator scaling regime.

PaperID: 3534, https://arxiv.org/pdf/2604.27003.pdf

Authors: Qisheng Hu, Quanyu Long, Wenya Wang

Title: When Continual Learning Moves to Memory: A Study of Experience Reuse in LLM Agents

Abstract:
Memory‑augmented LLM agents offer an appealing shortcut to continual learning: rather than updating model parameters, they accumulate experience in external memory, seemingly sidestepping the stability‑plasticity dilemma of parametric learning. We show that this challenge does not disappear but resurfaces at the memory level. Under a limited context window, old and new experiences compete during retrieval, relocating the continual‑learning bottleneck from parameter updates to memory access. To study this phenomenon, we introduce a (k,v) framework that disentangles two fundamental design axes of external memory: how experience is represented and how it is organized for retrieval. Across sequential‑task experiments in ALFWorld and BabyAI, we find that abstract procedural memories transfer more reliably than detailed trajectories, while negative transfer disproportionately harms the hard cases. Moreover, finer‑grained memory organization is not universally beneficial: designs that yield strong forward transfer can simultaneously induce severe forgetting. Together, these results reveal that external memory does not resolve the continual‑learning problem; it reshapes it into a problem of memory representation and retrieval design.

PaperID: 3535, https://arxiv.org/pdf/2604.26997.pdf

Authors: Akshay Mittal, Elyson De La Cruz

Title: Agent Name Service (ANS): A Proof-of-Concept Trust Layer for Secure AI Agent Discovery, Identity, and Governance in Kubernetes

Abstract:
Autonomous AI agent ecosystems require stronger mechanisms for secure discovery, identity verification, capability attestation, and policy governance. Current deployments frequently lack (1) uniform agent discovery, (2) cryptographic agent authentication, (3) capability proofs that protect secrets, and (4) enforceable policy controls. This paper presents an implementation‑oriented proof of concept for the Agent Name Service (ANS), a DNS‑inspired trust layer for AI agent discovery and interoperability in Kubernetes, grounded in the ANS protocol specification~\citehuang2025ans. The implementation uses Decentralized Identifiers (DIDs), Verifiable Credentials (VCs), policy‑as‑code enforcement with Open Policy Agent (OPA), and Kubernetes‑native integration patterns (CRDs, admission controls, service mesh integration). In a demo research environment (3‑node cluster, 50‑agent workflow simulation), we observe sub‑10ms response in demonstrated service paths and full success for scripted demo deployment scenarios. We explicitly scope these findings as proof‑of‑concept evidence rather than production certification. We further provide a threat model, assumptions, and limitations to separate implemented evidence from protocol‑defined and roadmap capabilities. The result is an evidence‑grounded pathway from ANS protocol concepts to reproducible engineering practice for secure multi‑agent systems.

PaperID: 3536, https://arxiv.org/pdf/2604.26991.pdf

Authors: Zheng Zhang, Milad Masroor, Cuong Nguyen, Tahir Hassan, Yuanhong Chen, David Rosewarne, Kevin Wells, Thanh-Toan Do, Gustavo Carneiro

Title: People-Centred Medical Image Analysis via Fairness-Aware Human-AI Cooperation

Abstract:
Machine learning models for medical image analysis often exhibit subgroup‑dependent performance, which impacts how decisions should be allocated between automated systems and human experts under limited resources. Prior work on AI fairness and human‑AI cooperation, including learning to defer (L2D) and learning to complement (L2C), typically addresses these problems in isolation. We propose People‑Centred Medical Image Analysis (PecMan), a framework for fairness‑aware human‑AI co‑operative classification that jointly models subgroup‑dependent reliability, decision allocation, and collaborative prediction. PecMan combines subgroup‑specialised predictors with a gating and consolidation mechanism that dynamically assigns cases to automated models, human experts, or their combination, without requiring sensitive attributes at test time. We also introduce the FairHAI benchmark for evaluating trade‑offs between predictive accuracy, subgroup equity, and human involvement. In addition, we provide a theoretical analysis of multi‑agent gating via selection regret and characterise fairness‑coverage trade‑offs under input‑dependent allocation. Experiments across multiple medical imaging datasets demonstrate that PecMan achieves consistently improved trade‑offs compared to methods that address fairness or human‑AI cooperation separately.

PaperID: 3537, https://arxiv.org/pdf/2604.26969.pdf

Authors: Xidong Wu, Yue Zhuan, Ruoqiao Wei, Hangxin Chen, Di Bai, Jintao Liu, Xinyi Wang, Xue Wang, Luoshu Wang, Xinwu Cheng

Title: AgenticRecTune: Multi-Agent with Self-Evolving Skillhub for Recommendation System Optimization

Abstract:
Modern large‑scale recommendation systems are typically constructed as multi‑stage pipelines, encompassing pre‑ranking, ranking, and re‑ranking phases. While traditional recommendation research typically focuses on optimizing a specific model, such as improving the pre‑ranking model structure or ranking models training algorithm, system‑level configurations optimization play a crucial role, which integrates the output from each model head to get the final score in each stage. Due to the complexity of the system, the configuration optimization is highly important and challenging. Any model modification requires new optimal system‑level configurations. But each experimental iteration requires significant tuning effort. Furthermore, models in different stage operates within a distinct context and optimizes for different targets, requiring specialized domain expertise. In addition, optimization success depends on balancing competing multiple online metrics and alignment with shifting production development objectives. To address these challenges, we propose AgenticRecTune, an agentic framework comprising five specialized agents, Actor, Critic, Insight, Skill, and Online, designed to manage the end‑to‑end configuration optimization workflow. By leveraging the advanced reasoning of Large Language Models (LLMs), specifically Gemini, AgenticRecTune explore the optimal configuration spaces. The Actor Agent proposes multiple candidates and Critic Agent filters out suboptimal proposals.Then Online Agent autonomously prepares A/B tests based on the proposed configurations set from the Critic Agent and captures the subsequencet experimental results. We also introduce a self‑evolving Skillhub, which utilizes a collaboration between the Insight Agent and Skill Agent to summarize the history results, extract underlying mechanics of each task in recommendation system and update skills.

PaperID: 3538, https://arxiv.org/pdf/2604.26959.pdf

Authors: Elham Nasarian, Abhilash Neog, Kwok-Leung Tsui, Niyousha HosseiniChimeh

Title: CareGuardAI: Context-Aware Multi-Agent Guardrails for Clinical Safety & Hallucination Mitigation in Patient-Facing LLMs

Abstract:
Integrating large language models (LLMs) into patient‑facing healthcare systems offers significant potential to improve access to medical information. However, ensuring clinical safety and factual reliability remains a critical challenge. In practice, AI‑generated responses may be conditionally correct yet medically inappropriate, as models often fail to interpret patient context and tend to produce agreeable responses rather than challenge unsafe assumptions. Unlike clinicians, who infer risk from incomplete information, LLMs frequently lack contextual awareness. Moreover, real‑world patient interactions are open‑ended and underspecified, unlike structured benchmark settings. We present CareGuardAI, a risk‑aware safety framework for patient‑facing medical question answering that addresses two key failure modes: clinical safety risk and hallucination risk. The framework introduces Clinical Safety Risk Assessment (SRA), inspired by ISO 14971, and Hallucination Risk Assessment (HRA) to evaluate medical risk and factual reliability. At inference time, CareGuardAI employs a multi‑stage pipeline consisting of a controller agent, safety‑constrained generation, and dual risk evaluation, followed by iterative refinement when necessary. Responses are released only when both SRA and HRA are less than or equal to 2, ensuring clinically acceptable outputs with bounded latency. We evaluate CareGuardAI on PatientSafeBench, MedSafetyBench, and MedHallu, covering both safety and hallucination detection. Across these benchmarks, the framework consistently outperforms strong baseline models, including GPT‑4o‑mini, demonstrating the importance of context‑aware, risk‑based, inference‑time safety mechanisms for reliable deployment in healthcare.

PaperID: 3539, https://arxiv.org/pdf/2604.26956.pdf

Authors: Hyesun Choung, Soojong Kim

Title: Can AI be a moral victim? The role of moral patiency and ownership perceptions in ethical judgments of using AI-generated content

Abstract:
The growing use of generative AI raises ethical concerns about authorship and plagiarism. This study examines how people judge the reuse of AI‑generated content, focusing on moral patiency and ownership perceptions. In an experiment, participants evaluated two substantively similar manuscripts in which the original source was described as authored by a human, an AI system, or an AI agent with a human‑like name. Results showed that copying AI‑generated work was judged less unethical, less plagiaristic, and less guilt‑inducing than copying human‑authored work. Mediation analyses revealed that this leniency stemmed from lower perceptions of AI's capacity to suffer harm (moral patiency) and greater ownership attributed to the human writer reusing AI‑generated content. Anthropomorphic cues shaped moral evaluations indirectly by reducing perceived ownership. These findings shed light on how people morally disengage when using AI‑generated work and highlight differences in how ethical judgments are applied to human versus AI‑created content.

PaperID: 3540, https://arxiv.org/pdf/2604.26892.pdf

Authors: Carol Hanna, Karine Even-Mendoza, W. B. Langdon, Mar Zamorano López, Justyna Petke, Federica Sarro

Title: Hot Fixing in the Wild

Abstract:
Despite the operational importance of hot fixes, large‑scale evidence on how they reshape routine maintenance workflows, particularly in the era of autonomous coding agents, remains limited. We analyse hot fixes present in over 61,000 GitHub repositories from the Hao‑Li/AIDev dataset and find consistent patterns of urgency: reduced collaboration (typically a single contributor), smaller and more targeted changes (median 2‑3 commits and files, with <10 line modifications), limited review (often fewer than two reviewers), and substantially fewer test file modifications than regular bug fixes, consistent with their urgency‑driven character. Leveraging the same urgency contexts, we examine differences between human‑ and AI‑agent‑authored hot fixes, revealing over 10 distinct repair behaviours, thus offering insights into future human‑automation collaboration for hot fixing. Our study is the first to empirically analyse hot fix code changes at scale using a repository‑level operationalisation of urgency. The comparison of human and agentbehaviours delineates their distinct characteristics, providing a foundation for understanding hot fixing in real‑world practice

PaperID: 3541, https://arxiv.org/pdf/2604.26645.pdf

Authors: Dianyu Liu, Chuan Qin, Xi Chen, Xiaohan Li, Wenxi Xu, Yuyang Wang, Xin Chen, Yuanchun Zhou, Hengshu Zhu

Title: SciHorizon-DataEVA: An Agentic System for AI-Readiness Evaluation of Heterogeneous Scientific Data

Abstract:
AI‑for‑Science (AI4Science) is increasingly transforming scientific discovery by embedding machine learning models into prediction, simulation, and hypothesis generation workflows across domains. However, the effectiveness of these models is fundamentally constrained by the AI‑readiness of scientific data, for which no scalable and systematic evaluation mechanism currently exists. In this work, we propose SciHorizon‑DataEVA, a novel agentic system to scalable AI‑readiness evaluation of heterogeneous scientific data. At the evaluation‑criteria level, we introduce the Sci‑TQA2 principles, which organize AI‑readiness into four complementary dimensions: Governance Trustworthiness, Data Quality, AI Compatibility, and Scientific Adaptability. Each dimension is decomposed into measurable atomic elements that enable fine‑grained and executable assessment. To operationalize these principles at scale, we develop Sci‑TQA2‑Eval, a hierarchical multi‑agent evaluation approach orchestrated through a directed, cyclic workflow. Our Sci‑TQA2‑Eval dynamically constructs dataset‑aware evaluation specifications by combining lightweight dataset profiling, applicability‑aware metric activation, and knowledge‑augmented planning grounded in domain constraints and dataset‑paper signals. These specifications are executed through an adaptive, tool‑centric evaluation mechanism with built‑in verification and self‑correction, enabling scalable and reliable assessment across heterogeneous scientific data. Extensive experiments on scientific datasets spanning multiple domains demonstrate the effectiveness and generality of SciHorizon‑DataEVA for principled AI‑readiness evaluation.

PaperID: 3542, https://arxiv.org/pdf/2604.26622.pdf

Authors: Jinze Li, Yang Zhang, Xin Yang, Jiayi Qu, Jinfeng Xu, Shuo Yang, Junhua Ding, Edith Cheuk-Han Ngai

Title: OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory

Abstract:
Autonomous LLM agents increasingly operate in long‑horizon, interactive settings where success depends on reusing experience accumulated over extended histories. However, existing agent memory systems are fundamentally constrained by text‑context budgets: storing or revisiting raw trajectories is prohibitively token‑expensive, while summarization and text‑only retrieval trade token savings for information loss and fragmented evidence. To address this limitation, we propose Optical Context Retrieval Memory (OCR‑Memory), a memory framework that leverages the visual modality as a high‑density representation of agent experience, enabling retention of arbitrarily long histories with minimal prompt overhead at retrieval time. Specifically, OCR‑Memory renders historical trajectories into images annotated with unique visual identifiers. OCR‑Memory retrieves stored experience via a \emphlocate‑and‑transcribe paradigm that selects relevant regions through visual anchors and retrieves the corresponding verbatim text, avoiding free‑form generation and reducing hallucination. Experiments on long‑horizon agent benchmarks show consistent gains under strict context limits, demonstrating that optical encoding increases effective memory capacity while preserving faithful evidence recovery.

PaperID: 3543, https://arxiv.org/pdf/2604.26615.pdf

Authors: Tarlan Hasanli, Shahbaz Siddeeq, Bishwash Khanal, Pyry Kotilainen, Tommi Mikkonen, Pekka Abrahamsson

Title: TDD Governance for Multi-Agent Code Generation via Prompt Engineering

Abstract:
Large language models (LLMs) accelerate software development but often exhibit instability, non‑determinism, and weak adherence to development discipline in unconstrained workflows. While test‑driven development (TDD) provides a structured Red‑Green‑Refactor process, existing LLM‑based approaches typically use tests as auxiliary inputs rather than enforceable process constraints. We present an AI‑native TDD framework that operationalizes classical TDD principles as structured prompt‑level and workflow‑level governance mechanisms. Extracted principles are formalized in a machine‑readable manifesto and distributed across planning, generation, repair, and validation stages within a layered architecture that separates model proposal from deterministic engine authority. The system enforces phase ordering, bounded repair loops, validation gates, and atomic mutation control to improve stability and reproducibility. We describe architecture and discuss encoding software engineering discipline directly into prompt orchestration, which we think offers a promising direction for reliable LLM‑assisted development.

PaperID: 3544, https://arxiv.org/pdf/2604.26561.pdf

Authors: Ariel Sela

Title: Preserving Disagreement: Architectural Heterogeneity and Coherence Validation in Multi-Agent Policy Simulation

Abstract:
Multi‑agent deliberation systems using large language models (LLMs) are increasingly proposed for policy simulation, yet they suffer from artificial consensus: evaluator agents converge on the same option regardless of their assigned value perspectives. We present the AI Council, a three‑phase deliberation framework, and conduct 120 deliberations across two policy scenarios to test two interventions. First, architectural heterogeneity (assigning a different 7‑9B parameter model to each value perspective) significantly reduces first‑choice concentration compared to a homogeneous baseline (child welfare: 70.9% to 46.1%, p < 0.001, r = 0.58; housing: 46.0% to 22.9%, p < 0.001, r = 0.50). This contrasts with accuracy‑oriented multi‑agent debate, where heterogeneity does not reduce convergence, suggesting model diversity operates differently when no objectively correct answer exists. Second, coherence validation (using a frontier model to assess whether each evaluator's reasoning is grounded in its assigned values) reveals a fidelity‑diversity tradeoff: on a scenario with a dominant option, it further reduces concentration (46.1% to 40.8%, p = 0.004), but on a scenario with genuinely competitive options, it increases concentration (22.9% to 26.6%, p = 0.96) by amplifying high‑coherence evaluators who cluster on one option. This tradeoff may be a general property of multi‑agent systems employing quality weighting. We report negative results from three failed Delphi designs, demonstrate that 8B models exhibit binary rather than graded responses to counter‑arguments, and propose the trustworthy tension rate as a diagnostic measure of small‑model deliberation capabilities.

PaperID: 3545, https://arxiv.org/pdf/2604.26522.pdf

Authors: Mahnoor Shahid, Hannes Rothe

Title: AGEL-Comp: A Neuro-Symbolic Framework for Compositional Generalization in Interactive Agents

Abstract:
Large Language Model (LLM)‑based agents exhibit systemic failures in compositional generalization, limiting their robustness in interactive environments. This work introduces AGEL‑Comp, a neuro‑symbolic AI agent architecture designed to address this challenge by grounding actions of the agent. AGEL‑Comp integrates three core innovations: (1) a dynamic Causal Program Graph (CPG) as a world model, representing procedural and causal knowledge as a directed hypergraph; (2) an Inductive Logic Programming (ILP) engine that synthesizes new Horn clauses from experiential feedback, grounding symbolic knowledge through interaction; and (3) a hybrid reasoning core where an LLM proposes a set of candidate sub‑goals that are verified for logical consistency by a Neural Theorem Prover (NTP). Together, these components operationalize a deduction‑‑abduction learning cycle: enabling the agent to deduce plans and abductively expand its symbolic world model, while a neural adaptation phase keeps its reasoning engine aligned with new knowledge. We propose an evaluation protocol within the \textttRetro Quest simulation environment to probe for compositional generalization scenarios to evaluate our AGEL agent. Our findings clearly indicate the better performance of our AGEL model over pure LLM‑based models. Our framework presents a principled path toward agents that build an explicit, interpretable, and compositionally structured understanding of their world.

PaperID: 3546, https://arxiv.org/pdf/2604.26521.pdf

Authors: Mahnoor Shahid, Hannes Rothe

Title: Grounding vs. Compositionality: On the Non-Complementarity of Reasoning in Neuro-Symbolic Systems

Abstract:
Compositional generalization remains a foundational weakness of modern neural networks, limiting their robustness and applicability in domains requiring out‑of‑distribution reasoning. A central, yet unverified, assumption in neuro‑symbolic AI is that compositional reasoning will emerge as a byproduct of successful symbol grounding. This work presents the first systematic empirical analysis to challenge this assumption by disentangling the contributions of grounding and reasoning. To operationalize this investigation, we introduce the Iterative Logic Tensor Network (iLTN), a fully differentiable architecture designed for multi‑step deduction. Using a formal taxonomy of generalization ‑‑ probing for novel entities, unseen relations, and complex rule compositions ‑‑ we demonstrate that a model trained solely on a grounding objective fails to generalize. In contrast, our full iLTN, trained jointly on perceptual grounding and multi‑step reasoning, achieves high zero‑shot accuracy across all tasks. Our findings provide conclusive evidence that symbol grounding, while necessary, is insufficient for generalization, establishing that reasoning is not an emergent property but a distinct capability that requires an explicit learning objective.

PaperID: 3547, https://arxiv.org/pdf/2604.26509.pdf

Authors: Tianwei Ye, Yifan Mao, Minwen Liao, Jian Liu, Chunchao Guo, Dazhao Du, Quanxin Shou, Fangqi Zhu, Song Guo

Title: 3D Generation for Embodied AI and Robotic Simulation: A Survey

Abstract:
Embodied AI and robotic systems increasingly depend on scalable, diverse, and physically grounded 3D content for simulation‑based training and real‑world deployment. While 3D generative modeling has advanced rapidly, embodied applications impose requirements far beyond visual realism: generated objects must carry kinematic structure and material properties, scenes must support interaction and task execution, and the resulting content must bridge the gap between simulation and reality. This survey reviews 3D generation for embodied AI and organizes the literature around three roles that 3D generation plays in embodied systems. In Data Generator, 3D generation produces simulation‑ready objects and assets, including articulated, physically grounded, and deformable content for downstream interaction; in Simulation Environments, it constructs interactive and task‑oriented worlds, spanning structure‑aware, controllable, and agentic scene generation; and in Sim2Real Bridge, it supports digital twin reconstruction, data augmentation, and synthetic demonstrations for downstream robot learning and real‑world transfer. We also show that the field is shifting from visual realism toward interaction readiness, and we identify the main bottlenecks, including limited physical annotations, the gap between geometric quality and physical validity, fragmented evaluation, and the persistent sim‑to‑real divide, that must be addressed for 3D generation to become a dependable foundation for embodied intelligence. Our project page is at https://3dgen4robot.github.io.

PaperID: 3548, https://arxiv.org/pdf/2604.26394.pdf

Authors: Yair Meidan, Omri Haller, Yulia Moshan, Shahaf David, Dudu Mimran, Yuval Elovici, Asaf Shabtai

Title: SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting with Tri-Context Personalization

Abstract:
Recent advances in large language models and agentic frameworks have enabled virtual customer assistants (VCAs) for complex support. We present SecMate, a multi‑agent VCA for cybersecurity troubleshooting that integrates device, user, and service specificity from conversational and device‑level signals. Device specificity is provided by a lightweight local diagnostic utility, while user specificity relies on implicit proficiency inference and profile‑aware troubleshooting. Service specificity is achieved through a proactive, context‑aware recommender. We evaluate SecMate in a controlled study with 144 participants and 711 conversations. Device‑level evidence increased correct resolutions from about 50% to over 90% relative to an LLM‑only baseline, while step‑by‑step guidance improved pleasantness and reduced user burden. The recommender achieved high relevance (MRR@1=0.75), and participants showed strong willingness to substitute human IT support at costs well below human benchmarks. We release the full code base and a richly annotated dataset to support reproducible research on adaptive VCAs.

PaperID: 3549, https://arxiv.org/pdf/2604.26374.pdf

Authors: Karthik Soma, Mohamed S. Talamali, Genki Miyauchi, Giovanni Beltrame, Heiko Hamann, Roderich Gross

Title: Split over $n$ resource sharing problem: Are fewer capable agents better than many simpler ones?

Abstract:
In multi‑agent systems, should limited resources be concentrated into a few capable agents or distributed among many simpler ones? This work formulates the split over n resource sharing problem where a group of n agents equally shares a common resource (e.g., monetary budget, computational resources, physical size). We present a case study in multi‑agent coverage where the area of the disk‑shaped footprint of agents scales as 1/n. A formal analysis reveals that the initial coverage rate grows with n. However, if the speed of agents decreases proportionally with their radii, groups of all sizes perform equally well, whereas if it decreases proportionally with their footprints, a single agent performs best. We also present computer simulations in which resource splitting increases the failure rates of individual agents. The models and findings help identify optimal distributiveness levels and inform the design of multi‑agent systems under resource constraints.

PaperID: 3550, https://arxiv.org/pdf/2604.26319.pdf

Authors: Genan Dai, Zini Chen, Yi Yang, Bowen Zhang

Title: A Systematic Comparison of Prompting and Multi-Agent Methods for LLM-based Stance Detection

Abstract:
Stance detection identifies the attitude of a text author toward a given target. Recent studies have explored various LLM‑based strategies for this task, from zero‑shot prompting to multi‑agent debate. However, existing works differ in data splits, base models, and evaluation protocols, making fair comparison difficult. We conduct a systematic comparison that evaluates five methods across two categories ‑‑ prompt‑based inference (Direct Prompting, Auto‑CoT, StSQA) and agent‑based debate (COLA, MPRF) ‑‑ on four datasets with 14 subtasks, using 15 LLMs from six model families with parameter sizes from 7B to 72B+. Our experiments yield several findings. First, on all models with complete results, the best prompt‑based method outperforms the best agent‑based method, while agent methods require 7 to 12 times more API calls per sample. Second, model scale has a larger impact on performance than method choice, with gains plateauing around 32B. Third, reasoning‑enhanced models (DeepSeek‑R1) do not consistently outperform general models of the same size on this task.

PaperID: 3551, https://arxiv.org/pdf/2604.26275.pdf

Authors: Happy Bhati

Title: Agentic AI in the Software Development Lifecycle: Architecture, Empirical Evidence, and the Reshaping of Software Engineering

Abstract:
The arrival of large language models (LLMs) capable of multi‑step reasoning, tool use, and long‑horizon planning has produced a qualitative shift in software engineering. Where earlier code‑completion tools such as GitHub Copilot operated at the granularity of a line or function, modern agentic systems ‑‑ Claude Code, OpenAI Codex CLI, Google Jules, Devin, OpenHands, SWE‑agent, MetaGPT, ChatDev, and DeepMind's AlphaEvolve ‑‑ operate at the granularity of a repository, a feature, or an algorithm. We synthesize work from Anthropic, OpenAI, Google DeepMind, Microsoft Research, Princeton, Stanford, and the broader academic community to characterize this transition. We propose a six‑layer reference architecture for agentic software engineering systems, contrast a traditional Software Development Lifecycle (SDLC) with an emerging Agentic SDLC (A‑SDLC), and consolidate empirical evidence on performance (a rise from 1.96% to 78.4% on SWE‑bench Verified between October 2023 and April 2026), productivity (13.6%‑55.8% time savings across controlled studies), and labor‑market impact (49% of jobs sampled by Anthropic in 2026 saw AI used for at least a quarter of their tasks). We argue that the central object of inquiry has shifted from code generation to delegated execution under human supervision, and we identify five open problems ‑‑ evaluation, governance, technical debt, skill redistribution, and the economics of attention ‑‑ that will determine whether the agentic transition is net‑positive for the discipline.

PaperID: 3552, https://arxiv.org/pdf/2604.26197.pdf

Authors: Zhentao Xu, Shangjing Zhang, Emir Poyraz, Yvonne Li, Ye Jin, Xie Lu, Xiaoyang Gu, Karthik Ramgopal, Praveen Kumar Bodigutla, Xiaofeng Wang

Title: Hierarchical Long-Term Semantic Memory for LinkedIn's Hiring Agent

Abstract:
Large Language Model (LLM) agents are increasingly used in real‑world products, where personalized and context‑aware user interactions are essential. A central enabler of such capabilities is the agent's long‑term semantic memory system, which extracts implicit and explicit signals from noisy longitudinal behavioral data, stores them in a structured form, and supports low‑latency retrieval. Building industrial‑grade long‑term memory for LLM agents raises five challenges: scalability, low‑latency retrieval, privacy constraints, cross‑domain generalizability, and observability. We introduce the Hierarchical Long‑Term Semantic Memory (HLTM) framework, which organizes textual data into a schema‑aligned memory tree that captures semantic knowledge at multiple levels of granularity, enabling scalable ingestion, privacy‑aware storage, low‑latency retrieval, and transparent provenance; HLTM further incorporates an adaptation mechanism to generalize across diverse use cases. Extensive evaluations on LinkedIn's Hiring Assistant show that HLTM improves answer correctness and retrieval F1 significantly by more than 10%, while significantly advancing the Pareto frontier between query and indexing latency. HLTM has been deployed in LinkedIn's Hiring Assistant to power core personalization features in production hiring workflows.

PaperID: 3553, https://arxiv.org/pdf/2604.26105.pdf

Authors: Zihao Ding, Beining Wu, Jun Huang, Shiwen Mao

Title: Application-Aware Twin-in-the-Loop Planning for Federated Split Learning over Wireless Edge Networks

Abstract:
We investigate task‑success‑oriented resource allocation for federated split learning (FSL) at the wireless edge. In this setting, the server must jointly determine bandwidth, transmit power, split‑layer placement, compression level, and terminal participation under per‑round deadline, memory, and spectrum constraints. These coupled decisions affect wireless transmission, model training, and task execution, which evolve at different time scales and cannot be efficiently evaluated through repeated real‑world trials. To address this challenge, we propose TiLP, a twin‑in‑the‑loop planner that evaluates candidate decisions through a cross‑domain digital twin before execution. The twin integrates network, training, and task sub‑twins, with each sub‑twin calibrated at the time scale of the process it models. Based on this twin, TiLP performs receding‑horizon cross‑entropy method planning with actor‑critic guidance to search over mixed continuous‑discrete decisions. Experiments on LIBERO robotic manipulation tasks over a Sionna RT‑simulated wireless network show that TiLP improves task success by 9.5 percentage points over the strongest single‑axis baseline, while satisfying the per‑round deadline and energy budget.

PaperID: 3554, https://arxiv.org/pdf/2604.26053.pdf

Authors: Rustam Galimullin, Hermine Grosinger, Munyque Mittelmann

Title: I Would If I Could: Reasoning about Dynamics of Actions in Multi-Agent Systems

Abstract:
Autonomous agents acting in realistic Multi‑Agent Systems (MAS) should be able to adapt during their execution. Standard strategic logics, such as Alternating‑time Temporal Logic (ATL), model agents' state‑ or history‑dependent behaviour. However, the dynamic treatment of agents' available actions and their knowledge of required actions is still rarely addressed. In this paper, we introduce ATL with Dynamic Actions (ATL‑D), which models the process of granting and revoking actions, and its extension ATEL‑D, which captures how such updates affect agents' knowledge. Beyond the conceptual contribution, we provide several technical results: we analyse the expressivity of our logic in relation to ATL, study its relation to normative systems, and provide complexity results for relevant computational problems.

PaperID: 3555, https://arxiv.org/pdf/2604.25972.pdf

Authors: Valentin Cuzin-Rambaud, Laetitia Matignon, Maxime Morge

Title: A Survey of Multi-Agent Deep Reinforcement Learning with Graph Neural Network-Based Communication

Abstract:
In multi‑agent reinforcement learning (MARL), the integration of a communication mechanism, allowing agents to better learn to coordinate their actions and converge on their objectives by sharing information. Based on an interaction graph, a subclass of methods employs graph neural networks (GNNs) to learn the communication, enabling agents to improve their internal representations by enriching them with information exchanged. With growing research, we note a lack of explicit structure and framework to distinguish and classify MARL approaches with communication based on GNNs. Thus, this paper surveys recent works in this field. We propose a generalized GNN‑based communication process with the goal of making the underlying concepts behind the methods more obvious and accessible.

PaperID: 3556, https://arxiv.org/pdf/2604.25963.pdf

Authors: Kaiyue Lu, Qiaoxuan Zhang, Yukun Lu

Title: A Scaled Three-Vehicle Platooning Platform

Abstract:
Vehicle platooning has attracted increasing attention as a promising approach to improve traffic efficiency, energy consumption, and roadway safety through coordinated multi‑vehicle operation. A key challenge in platooning lies in maintaining stable and accurate path tracking during dynamic maneuvers such as lane changes, where lateral deviations and heading disturbances generated by the lead vehicle may propagate downstream to following vehicles. Robust longitudinal and lateral control systems are therefore essential not only for individual vehicle tracking performance, but also for overall platoon stability. For experimental studies, the Intelligent Mobility and Robotics Lab (IMRL) develops a scaled multi‑vehicle platform for autonomous platooning research, with a particular emphasis on cooperative control and human‑in‑the‑loop autonomy. This platform consists of one human‑operable lead vehicle and two autonomous followers, enabling controlled and repeatable experiments on leader‑follower coordination. Compared with full‑scale field testing, this scaled platform offers a safer, lower‑cost, and more flexible environment for rapid prototyping, controller validation, and multi‑agent autonomy studies, while providing stronger physical realism than purely simulation‑based evaluations.

PaperID: 3557, https://arxiv.org/pdf/2604.25934.pdf

Authors: Ashutosh Raj

Title: LLM Psychosis: A Theoretical and Diagnostic Framework for Reality-Boundary Failures in Large Language Models

Abstract:
The deployment of large language models (LLMs) as interactive agents has exposed a category of behavioral failure that prevailing terminology, principally hallucination, fails to adequately characterize. This paper introduces LLM Psychosis as a structured theoretical framework for pathological breakdowns in model cognition that exhibit functional resemblance to clinically recognized psychotic disorders. Five hallmark features define the framework: reality‑boundary dissolution, persistence of injected false beliefs, logical incoherence under impossible constraints, self‑model instability, and epistemic overconfidence. We argue these constitute a qualitatively distinct failure mode rather than a mere intensification of ordinary factual error. To operationalize the framework, we propose the LLM Cognitive Integrity Scale (LCIS), a five‑axis diagnostic instrument organized around Environmental Reality Interface (ERI), Premise Arbitration Integrity (PAI), Logical Constraint Recognition (LCR), Self‑Model Integrity (SMI), and Epistemic Calibration Integrity (ECI). We administer a targeted adversarial probe battery to ChatGPT 5 (GPT‑5, OpenAI) and report empirical findings for each axis, documenting both intact‑integrity baseline responses and the specific psychosis‑like failure signatures elicited under adversarial escalation. Results support a three‑tier severity taxonomy: Type I (Confabulatory), Type II (Delusional), and Type III (Dissociative). We further formalize the delusional gradient, a self‑reinforcing dynamic in which correction pressure intensifies rather than resolves psychosis‑like states, as the most consequential failure mode for deployed systems. Implications for safety evaluation, high‑stakes deployment screening, and mechanistic interpretability research are discussed.

PaperID: 3558, https://arxiv.org/pdf/2604.25931.pdf

Authors: Ashish Balkishan Lathkar

Title: Anchored Confabulation: Partial Evidence Non-Monotonically Amplifies Confident Hallucination in LLMs

Abstract:
We identify a previously unknown calibration property of large language models: providing one confirmed intermediate fact toward a multi‑step reasoning chain increases the model's confident‑wrong‑answer rate before full evidence eliminates it. We call this anchored confabulation: a partial anchor commits the model to confident parametric completion of remaining reasoning steps. We formalize it as Parametric Hallucination Confidence (PHC) and establish it across six lines of evidence including a causal injection experiment (PHC 0.613 to 0.656 to 0.595 to 0.536, N=160) and capability scaling across five model families (Spearman rho=0.900, p=0.037). The Anchoring Threshold Law k(n)=floor(n/3) predicts PHC amplification by hop depth with four confirmed predictions. Applied to RAG routing, a LearnedRouter exploiting PHC closes 81.1% of the oracle performance gap (macro F1=0.426, p<1e‑6) on 1,800 queries across four benchmarks with no model fine‑tuning and 50x fewer labels than prior RL‑based work. An epistemic humility prompt reduces the PHC spike by ‑0.118; explicit self‑rating (PHC=0.684, p<0.001) outperforms lexical confidence as a routing signal.

PaperID: 3559, https://arxiv.org/pdf/2604.25917.pdf

Authors: Xiyuan Yang, Jiaru Zou, Rui Pan, Ruizhong Qiu, Pan Lu, Shizhe Diao, Jindong Jiang, Hanghang Tong, Tong Zhang, Markus J. Buehler, Jingrui He, James Zou

Title: Recursive Multi-Agent Systems

Abstract:
Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen reasoning. We extend such scaling principle from a single model to multi‑agent systems, and ask: Can agent collaboration itself be scaled through recursion? To this end, we introduce RecursiveMAS, a recursive multi‑agent framework that casts the entire system as a unified latent‑space recursive computation. RecursiveMAS connects heterogeneous agents as a collaboration loop through the lightweight RecursiveLink module, enabling in‑distribution latent thoughts generation and cross‑agent latent state transfer. To optimize our framework, we develop an inner‑outer loop learning algorithm for iterative whole‑system co‑optimization through shared gradient‑based credit assignment across recursion rounds. Theoretical analyses of runtime complexity and learning dynamics establish that RecursiveMAS is more efficient than standard text‑based MAS and maintains stable gradients during recursive training. Empirically, we instantiate RecursiveMAS under 4 representative agent collaboration patterns and evaluate across 9 benchmarks spanning mathematics, science, medicine, search, and code generation. In comparison with advanced single/multi‑agent and recursive computation baselines, RecursiveMAS consistently delivers an average accuracy improvement of 8.3%, together with 1.2×‑2.4× end‑to‑end inference speedup, and 34.6%‑75.6% token usage reduction. Code and Data are provided in https://recursivemas.github.io.

PaperID: 3560, https://arxiv.org/pdf/2604.25899.pdf

Authors: Shan Yu, Junyi Shu, Yuanjiang Ni, Kun Qian, Xue Li, Yang Wang, Jinyuan Zhang, Ziyi Xu, Shuo Yang, Lingjun Zhu, Ennan Zhai, Qingda Lu, Jiarong Xing, Youyou Lu, Xin Jin, Xuanzhe Liu, Harry Xu

Title: Pythia: Exploiting Workflow Predictability for Efficient Agent-Native LLM Serving

Abstract:
As LLM applications grow more complex, developers are increasingly adopting multi‑agent architectures to decompose workflows into specialized, collaborative components, introducing structure that constrains agent behavior and exposes useful semantic predictability. Unlike traditional LLM serving, which operates under highly dynamic and uncertain conditions, this structured topology enables opportunities to reduce runtime uncertainty\unicodex2015yet existing systems fail to exploit it, treating agentic workloads as generic traffic and incurring significant inefficiencies. Our analysis of production traces from an agent‑serving platform and an internal coding assistant reveals key bottlenecks, including low prefix cache hit rates, severe resource contention from long‑context requests, and substantial queuing delays due to suboptimal scaling. To address these challenges, we propose Pythia, a multi‑agent serving system that captures workflow semantics through a simple interface at the serving layer, unlocking new optimization opportunities and substantially improving throughput and job completion time over state‑of‑the‑art baselines.

PaperID: 3561, https://arxiv.org/pdf/2604.25880.pdf

Authors: Nazia Shehnaz Joynab, Soneya Binta Hossain

Title: From Threads to Trajectories: A Multi-LLM Pipeline for Community Knowledge Extraction from GitHub Issue Discussions

Abstract:
Resolution of complex post‑production issues in large‑scale open‑source software (OSS) projects requires significant cognitive effort, as developers need to go through long, unstructured and fragmented issue discussion threads before that. In this paper, we present SWE‑MIMIC‑Bench, an issue trajectory dataset generated from raw GitHub discussions using an automated multi‑LLM pipeline. Unlike simple summarization, this pipeline utilizes a group of closed‑source LLMs to perform granular tasks: analyzing individual comments with awareness of externally‑linked resources, classifying comment analyses into label‑specific fields (e.g., root cause, solution plan, implementation progress), and synthesizing label‑aware trajectories which capture a structured and coherent narrative of the entire discussion thread. Our pipeline uses five closed‑source LLM configurations for distinct purposes: label classification, inline code block and external link summarization, comment analysis, label‑specific field classification and trajectory synthesis. By generating concise and reliable trajectories from complex conversation threads, this system can assist developers and researchers of broader software engineering community to understand the experience‑driven collaborative approach for issue diagnosis. Furthermore, the generated trajectories can be used to train modern LLM agents to think and act like an expert developer. We evaluated our system on 800 real‑world GitHub issues drawn from the SWE‑Bench‑Pro, SWE‑Bench‑Multilingual and SWE‑Bench‑Verified dataset, achieving a 91.7% success rate in extracting 734 high‑fidelity reasoning trajectories.

PaperID: 3562, https://arxiv.org/pdf/2604.25849.pdf

Authors: Zhou Hanlin, Chan Huah Yong

Title: ADEMA: A Knowledge-State Orchestration Architecture for Long-Horizon Knowledge Synthesis with LLMAgents

Abstract:
Long‑horizon LLM tasks often fail not because a single answer is unattainable, but because knowledge states drift across rounds, intermediate commitments remain implicit, and interruption fractures the evolving evidence chain. This paper presents ADEMA as a knowledge‑state orchestration architecture for long‑horizon knowledge synthesis rather than as a generic multi‑agent runtime. The architecture combines explicit epistemic bookkeeping, heterogeneous dual‑evaluator governance, adaptive task‑mode switching, reputation‑shaped resource allocation, checkpoint‑resumable persistence, segment‑level memory condensation, artifact‑first assembly, and final‑validity checking with safe fallback. Evidence is drawn entirely from existing materials: a four‑scenario showcase package, a fixed 60‑run mechanism matrix, targeted micro‑ablation and artifact‑chain supplements, and a repaired protocol‑level benchmark in which code‑oriented evaluation is the clearest quality‑sensitive mechanism block. Across the fixed matrix, removing checkpoint/resume produced the only invalid run, and it did so in the interruption‑sensitive resume condition. By contrast, dual evaluation, segment synthesis, and dynamic governance are best interpreted as supporting control mechanisms that shape trajectory discipline, explicit artifact progression, and cost‑quality behavior rather than as universal binary prerequisites for completion. The contribution is therefore a knowledge‑state orchestration architecture in which explicit epistemic state transition, evidence‑bearing artifact progression, and recoverable continuity are the primary design commitments.

PaperID: 3563, https://arxiv.org/pdf/2604.25848.pdf

Authors: An Nguyen, Hoang Nguyen, Phuong Le, Hung Pham, Cuong Do, Laurent El Ghaoui

Title: Semi-Markov Reinforcement Learning for City-Scale EV Ride-Hailing with Feasibility-Guaranteed Actions

Abstract:
We study city‑scale control of electric‑vehicle (EV) ride‑hailing fleets where dispatch, repositioning, and charging decisions must respect charger and feeder limits under uncertain, spatially correlated demand and travel times. We formulate the problem as a hex‑grid semi‑Markov decision process (semi‑MDP) with mixed actions ‑‑ discrete actions for serving, repositioning, and charging, together with continuous charging power ‑‑ and variable action durations. To guarantee physical feasibility during both training and deployment, the policy learns over high‑level intentions produced by a masked, temperature‑annealed actor. These intentions are projected at every decision step through a time‑limited rolling mixed‑integer linear program (MILP) that strictly enforces state‑of‑charge, port, and feeder constraints. To mitigate distributional shifts, we optimize a Soft Actor‑‑Critic (SAC) agent against a Wasserstein‑1 ambiguity set with a graph‑aligned Mahalanobis ground metric that captures spatial correlations. The robust backup uses the Kantorovich‑‑Rubinstein dual, a projected subgradient inner loop, and a primal‑‑dual risk‑budget update. Our architecture combines a two‑layer Graph Convolutional Network (GCN) encoder, twin critics, and a value network that drives the adversary. Experiments on a large‑scale EV fleet simulator built from NYC taxi data show that PD‑‑RSAC achieves the highest net profit, reaching \1.22M, compared with \0.58M‑‑\0.70M for strong heuristic, single‑agent RL, and multi‑agent RL baselines, including Greedy, SAC, MAPPO, and MADDPG, while maintaining zero feeder‑limit violations.

PaperID: 3564, https://arxiv.org/pdf/2604.25737.pdf

Authors: Noam Tarshish, Nofar Selouk, Daniel Hodisan, Bar Ezra Gafniel, Yuval Elovici, Asaf Shabtai, Eliya Nachmani

Title: SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing?

Abstract:
Instructed code editing is a significant challenge for large language models (LLMs). On the EditBench benchmark, 39 of 40 evaluated models obtain a task success rate (TSR) below 60 percent, highlighting a gap between general code generation and the ability to perform instruction‑driven editing under executable test constraints. To address this, we propose SAFEdit, a multi‑agent framework for instructed code editing that decomposes the editing process into specialized roles to improve reliability and reduce unintended code changes. A Planner Agent produces an explicit, visibility‑aware edit plan, an Editor Agent applies minimal, literal code modifications, and a Verifier Agent executes real test runs. When tests fail, SAFEdit uses a Failure Abstraction Layer (FAL) to transform raw test logs into structured diagnostic feedback, which is fed back to the Editor to support iterative refinement. We compare SAFEdit against both prior single‑model results reported for EditBench and an implemented ReAct single‑agent baseline under the same evaluation conditions. We used EditBench to evaluate SAFEdit on 445 code editing instances in five languages (English, Polish, Spanish, Chinese, and Russian) under varying spatial context variants. SAFEdit achieved 68.6 percent TSR, outperforming the single‑model baseline by 3.8 percentage points and the ReAct single‑agent baseline by 8.6 percentage points. The iterative refinement loop was found to contribute 17.4 percentage points to SAFEdit's overall success rate. SAFEdit's automated error analysis further indicates a reduction in instruction‑level hallucinations compared to single‑agent approaches, providing an additional framework component for interpreting failures beyond pass or fail outcomes.

PaperID: 3565, https://arxiv.org/pdf/2604.25727.pdf

Authors: Zhiyuan Fan, Tinghao Yu, Yuanjun Cai, Jiangtao Guan, Yun Yang, Dingxin Hu, Jiang Zhou, Xing Wu, Zhuo Han, Feng Zhang, Lilin Wang

Title: Toward Scalable Terminal Task Synthesis via Skill Graphs

Abstract:
Terminal agents have demonstrated strong potential for autonomous command‑line execution, yet their training remains constrained by the scarcity of high‑quality and diverse execution trajectories. Existing approaches mitigate this bottleneck by synthesizing large‑scale terminal task instances for trajectory sampling. However, they primarily focus on scaling the number of tasks while providing limited control over the diversity of execution trajectories that agents actually experience during training. In this paper, we present SkillSynth, an automated framework for terminal task synthesis built on a scenario‑mediated skill graph. SkillSynth first constructs a large‑scale skill graph, where scenarios serve as intermediate transition nodes that connect diverse command‑line skills. It then samples paths from this graph as abstractions of real‑world workflows, and uses a multi‑agent harness to instantiate them into executable task instances. By grounding task synthesis in graph‑sampled workflow paths, SkillSynth explicitly controls the diversity of minimal execution trajectories required to solve the synthesized tasks. Experiments on Terminal‑Bench demonstrate the effectiveness of SkillSynth. Moreover, task instances synthesized by SkillSynth have been adopted to train Hy3 Preview, contributing to its enhanced agentic capabilities in terminal‑based settings.

PaperID: 3566, https://arxiv.org/pdf/2604.25724.pdf

Authors: Srikanta Prasad S, Utkarsh Arora

Title: Scalable Inference Architectures for Compound AI Systems: A Production Deployment Study

Abstract:
Modern enterprise AI applications increasingly rely on compound AI systems ‑ architectures that compose multiple models, retrievers, and tools to accomplish complex tasks. Deploying such systems in production demands inference infrastructure that can efficiently serve concurrent, heterogeneous model invocations while maintaining cost‑effectiveness and low latency. This paper presents a production deployment study of a modular, platform‑agnostic inference architecture developed at Salesforce to support compound AI use cases including Agentforce (autonomous AI agents) and ApexGuru (AI‑powered code analysis). The system integrates serverless execution, dynamic autoscaling, and MLOps pipelines to deliver consistent low‑latency inference across multi‑component agent workflows. We report production results demonstrating over 50% reduction in tail latency (P95), up to 3.9x throughput improvement, and 30 to 40% cost savings compared to prior static deployments. We further present a novel analysis of compound‑system‑specific challenges including multi‑model fan‑out overhead, cascading cold‑start propagation, and heterogeneous scaling dynamics that emerge uniquely when serving agentic workloads. Through detailed case studies and operational lessons, we illustrate how the architecture enables compound AI systems to scale model invocations in parallel, handle bursty multi‑agent workloads, and support rapid model iteration ‑ capabilities essential for operationalizing agentic AI at enterprise scale.

PaperID: 3567, https://arxiv.org/pdf/2604.25567.pdf

Authors: David Zahrádka, David Woller, Denisa Mužíková, Miroslav Kulich, Libor Přeučil

Title: Should I Replan? Learning to Spot the Right Time in Robust MAPF Execution

Abstract:
During the execution of Multi‑Agent Path Finding (MAPF) plans in real‑life applications, the MAPF assumption that the fleet's movement is perfectly synchronized does not apply. Since one or more of the agents may become delayed due to internal or external factors, it is often necessary to use a robust execution method to avoid collisions caused by desynchronization. Robust execution methods ‑ such as the Action Dependency Graph (ADG) ‑ synchronize the execution of risky actions, but often at the expense of increased plan execution cost, because it may require some agents to wait for the delayed agents. In such cases, the execution's cost can be reduced while still preserving safety by finding a new plan either by rescheduling (reordering the agents at crossroads) or the more general replanning capable of finding new paths. However, these operations may be costly, and the new plan may not even lead to lower execution cost than the original plan: for example, the two plans may be the exact same. Therefore, we estimate the benefit that can be achieved by single replanning in scenarios with delayed agents given an immediate state of the execution with a fully connected feed‑forward neural network. The input to the neural network is a set of newly designed ADG‑based features describing the robust execution's state and the impact of potential delays, and the output is an estimated benefit achievable by replanning. We train and test the network on a new labeled dataset containing 12,000 experiments, and we show that our proposed method is capable of reducing the impact of delays by up to 94.6% of the achievable reduction.

PaperID: 3568, https://arxiv.org/pdf/2604.25318.pdf

Authors: Lanshan He, Haozhou Pang, Qi Gan, Xin Shen, Ziwei Zhang, Yibo Liu, Gang Fang, Bo Liu, Kai Sheng, Shengfeng Zeng, Chaofan Li, Zhen Hui, Keer Zhou, Lan Zhou, Shujun Dai

Title: Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation

Abstract:
Cutscenes are carefully choreographed cinematic sequences embedded in video games and interactive media, serving as the primary vehicle for narrative delivery, character development, and emotional engagement. Producing cutscenes is inherently complex: it demands seamless coordination across screenwriting, cinematography, character animation, voice acting, and technical direction, often requiring days to weeks of collaborative effort from multidisciplinary teams to produce minutes of polished content. In this work, we present Cutscene Agent, an LLM agent framework for automated end‑to‑end cutscene generation. The framework makes three contributions: (1)~a Cutscene Toolkit built on the Model Context Protocol (MCP) that establishes \emphbidirectional integration between LLM agents and the game engine ‑‑ agents not only invoke engine operations but continuously observe real‑time scene state, enabling closed‑loop generation of editable engine‑native cinematic assets; (2)~a multi‑agent system where a director agent orchestrates specialist subagents for animation, cinematography, and sound design, augmented by a visual reasoning feedback loop for perception‑driven refinement; and (3)~CutsceneBench, a hierarchical evaluation benchmark for cutscene generation. Unlike typical tool‑use benchmarks that evaluate short, isolated function calls, cutscene generation requires long‑horizon, multi‑step orchestration of dozens of interdependent tool invocations with strict ordering constraints ‑‑ a capability dimension that existing benchmarks do not cover. We evaluate a range of LLMs on CutsceneBench and analyze their performance across this challenging task.

PaperID: 3569, https://arxiv.org/pdf/2604.25264.pdf

Authors: Xueying Zeng, Youquan Xian, Sihao Liu, Xudong Mou, Yanze Li, Lei Cui, Bo Li

Title: MARD: A Multi-Agent Framework for Robust Android Malware Detection

Abstract:
With the rapid evolution of Android applications, traditional machine learning‑based detection models suffer from concept drift. Additionally, they are constrained by shallow features, lacking deep semantic understanding and interpretability of decisions. Although Large Language Models (LLMs) demonstrate remarkable semantic reasoning capabilities, directly processing massive raw code incurs prohibitive token overhead. Moreover, this approach fails to fully unleash the deep logical reasoning potential of LLMs within complex contexts. To address these limitations, we propose MARD, a multi‑agent framework for robust Android malware detection. This framework effectively bridges the gap between the semantic understanding of LLMs and traditional static analysis. It treats underlying deterministic analysis engines as on‑demand execution tools, while utilizing the LLM to orchestrate the entire decision‑making process. By designing an autonomous multi‑agent interaction mechanism based on the ReAct paradigm, MARD constructs a highly interpretable evidentiary chain for conviction. Furthermore, we radically reduce the total cost of conducting a deep analysis of a single complex APK to under 0.10. Evaluations demonstrate that, without any domain‑specific fine‑tuning, MARD achieves an F1 score of 93.46%. It not only outperforms continual learning baselines but also exhibits robustness against concept drift and strong cross‑domain generalization capabilities in evaluations spanning up to five years.

PaperID: 3570, https://arxiv.org/pdf/2604.25203.pdf

Authors: Arnon Mazza, Elad Levi

Title: BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate

Abstract:
Deploying guardrails for custom policies remains challenging, as generic safety models fail to capture task‑specific requirements, while prompting LLMs suffers from inconsistent boundary‑case performance and high inference costs. Training custom classifiers achieves both accuracy and efficiency, yet demands substantial labeled data that is costly to obtain. We present BARRED (Boundary Alignment Refinement through REflection and Debate), a framework for generating faithful and diverse synthetic training data using only a task description and a small set of unlabeled examples. Our approach decomposes the domain space into dimensions to ensure comprehensive coverage, and employs multi‑agent debate to verify label correctness, yielding a high‑fidelity training corpus. Experiments across diverse custom policies demonstrate that small language models finetuned on our synthetic data consistently outperform state‑of‑the‑art proprietary LLMs (including reasoning models) and dedicated guardrail models. Ablation studies confirm that both dimension decomposition and debate‑based verification are critical for ensuring the diversity and label fidelity required for effective fine‑tuning. The BARRED framework eliminates the reliance on extensive human annotation, offering a scalable solution for accurate custom guardrails.

PaperID: 3571, https://arxiv.org/pdf/2604.25135.pdf

Authors: Amir Saeidi, Venkatesh Mishra, Souradeep Mukhopadhyay, Gaowen Liu, Ali Payani, Jayanth Srinivasa, Chitta Baral

Title: FAMA: Failure-Aware Meta-Agentic Framework for Open-Source LLMs in Interactive Tool Use Environments

Abstract:
Large Language Models are being increasingly deployed as the decision‑making core of autonomous agents capable of effecting change in external environments. Yet, in conversational benchmarks, which simulate real‑world customer‑centric issue resolution scenarios, these agents frequently fail due to the cascading effects of incorrect decision‑making. These challenges are particularly pronounced for open‑source LLMs with smaller parameter sizes, limited context windows, and constrained inference budgets, which contribute to increased error accumulation in agentic settings. To tackle these challenges, we present the Failure‑Aware Meta‑Agentic (FAMA) framework. FAMA operates in two stages: first, it analyzes failure trajectories from baseline agents to identify the most prevalent errors; second, it employs an orchestration mechanism that activates a minimal subset of specialized agents tailored to address these failures by injecting a targeted context for the tool‑use agent before the decision‑making step. Experiments across open‑source LLMs demonstrate performance gains up to 27% across evaluation modes over standard baselines. These results highlight that targeted curation of context through specialized agents to address common failures is a valuable design principle for building reliable, multi‑turn tool‑use LLM agents that simulate real‑world conversational scenarios.

PaperID: 3572, https://arxiv.org/pdf/2604.25088.pdf

Authors: Abigail O'Neill, Alan Zhu, Mihran Miroyan, Narges Norouzi, Joseph E. Gonzalez

Title: Cooperate to Compete: Strategic Coordination in Multi-Agent Conquest

Abstract:
Language Model (LM)‑based agents remain largely untested in mixed‑motive settings where agents must leverage short‑term cooperation for long‑term competitive goals (e.g., multi‑party politics). We introduce Cooperate to Compete (C2C), a multi‑agent environment where players can engage in private negotiations while competing to be the first to achieve their secret objective. Players have asymmetric objectives and negotiations are non‑binding, allowing alliances to form and break as players' short‑term interests align and diverge. We run AI only games and conduct a user study pitting human players against AI opponents. We identify significant differences between human and AI negotiation behaviors, finding that humans favor lower‑complexity deals and are significantly less reliable partners compared to LM‑based agents. We also find that humans are more aggressive negotiators, accepting deals without a counteroffer only 56.3% of the time compared to 67.6% for LM‑based agents. Through targeted prompting inspired by these findings, we modify agents' negotiation behavior and improve win rates from 22.2% to 32.7%. We run over 1,100 games with over 16,000 private conversations totaling 15.2 million tokens and over 150,000 player actions. Our results establish C2C as a testbed for studying and building LM‑based agents that can navigate the sophisticated coordination required for real‑world deployments. The game, code, and dataset may be found at https://negotiationgame.io/c2c.

PaperID: 3573, https://arxiv.org/pdf/2604.25076.pdf

Authors: Keenan Powell, Peihong Yu, Pratap Tokekar

Title: Zero Shot Coordination for Sparse Reward Tasks with Diverse Reward Shapings

Abstract:
Many Multi‑Agent Reinforcement Learning (MARL) agents fail to adapt properly to cooperating with agents trained with the same objectives but different seeds, algorithms, or other training differences. This is the problem of Zero‑Shot Coordination (ZSC), which focuses on training agents to cooperate well with unknown agents. ZSC has been studied for a variety of tabular cases and simple games such as Hanabi, achieving excellent results. However, existing solutions to ZSC only consider identical rewards for your trained agents and all future partners. This is not realistic for the trained agents, as they do not consider the problem of cooperating with agents that have identical sparse objectives but shape the rewards for those objectives in different manner. To address this issue, we show how to train an ensemble of methods using randomized reward shapings chosen using 4 selection algorithms. Experiments done on the Overcooked environment demonstrate consistent improvements of 62.2%‑119.2% in sparse reward over baseline ZSC algorithms when playing with agents that have identical sparse rewards but different reward shapings.

PaperID: 3574, https://arxiv.org/pdf/2604.25039.pdf

Authors: Sagnik Chatterjee, Atharva Patil, Sricharan Ramesh

Title: Dual-Track CoT: Budget-Aware Stepwise Guidance for Small LMs

Abstract:
Large Language Models (LLMs) solve many reasoning tasks via chain‑of‑thought (CoT) prompting, but smaller models (about 7 to 8B parameters) still struggle with multi‑step reasoning under tight compute and token budgets. Existing test time reasoning methods such as self consistency (sampling multiple rationales and voting), Tree‑of‑Thoughts (search over intermediate thoughts), and critique revise loops improve performance, but often at high token cost and without fine‑grained step‑level control. This project1 aims to address that gap: can Small Language Models (SLMs) reason reliably using the same or fewer tokens? This question is both scientific and practical. Scientifically, it probes whether process supervision and simple test‑time controls (such as token budgets and rejection of redundant steps) can substitute for model scale or large sampling counts. Practically, many deployments (on‑device, low‑latency, or cost‑constrained settings) cannot afford huge models or dozens of sampled rationales per query. A method that improves SLM reasoning at fixed cost would therefore be directly useful.

PaperID: 3575, https://arxiv.org/pdf/2604.24955.pdf

Authors: Xinming Tu, Tianze Wang, Yingzhou, Lu, Kexin Huang, Yuanhao Qu, Sara Mostafavi

Title: BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks

Abstract:
As benchmarks grow in complexity, many apparent agent failures are not failures of the agent at all ‑ they are failures of the benchmark itself: broken specifications, implicit assumptions, and rigid evaluation scripts that penalize valid alternative approaches. We propose employing frontier LLMs as systematic auditors of evaluation infrastructure, and realize this vision through BenchGuard, the first automated auditing framework for task‑oriented, execution‑based agent benchmarks. BenchGuard cross‑verifies all benchmark artifacts via structured LLM protocols, optionally incorporating agent solutions or execution traces as additional diagnostic evidence. Deployed on two prominent scientific benchmarks, BenchGuard identified 12 author‑confirmed issues in ScienceAgentBench ‑ including fatal errors rendering tasks unsolvable ‑ and exactly matched 83.3% of expert‑identified issues on the BIXBench Verified‑50 subset, catching defects that prior human review missed entirely. A full audit of 50 complex bioinformatics tasks costs under USD 15, making automated benchmark auditing a practical and valuable complement to human review. These findings point toward AI‑assisted benchmark development, where frontier models serve not only as subjects of evaluation but as active participants in validating the evaluation infrastructure itself.

PaperID: 3576, https://arxiv.org/pdf/2604.24842.pdf

Authors: Yale Song, Yiwen Song, Nick Losier, Nathan Hodson, Ye Jin, Rhyard Zhu, Yan Xu, Daniel Vlasic, Carina Claassen, Jasmine Leon, Khanh G. LeViet, Zack Chomyn, Joe Timmons, Brett Slatkin, Scott Penberthy, Tomas Pfister

Title: Co-Director: Agentic Generative Video Storytelling

Abstract:
While diffusion models generate high‑fidelity video clips, transforming them into coherent storytelling engines remains challenging. Current agentic pipelines automate this via chained modules but suffer from semantic drift and cascading failures due to independent, handcrafted prompting. We present Co‑Director, a hierarchical multi‑agent framework formalizing video storytelling as a global optimization problem. To ensure semantic coherence, we introduce hierarchical parameterization: a multi‑armed bandit globally identifies promising creative directions, while a local multimodal self‑refinement loop mitigates identity drift and ensures sequence‑level consistency. This balances the exploration of novel narrative strategies with the exploitation of effective creative configurations. For evaluation, we introduce GenAD‑Bench, a 400‑scenario dataset of fictional products for personalized advertising. Experiments demonstrate that Co‑Director significantly outperforms state‑of‑the‑art baselines, offering a principled approach that seamlessly generalizes to broader cinematic narratives. Project Page: https://co‑director‑agent.github.io/

PaperID: 3577, https://arxiv.org/pdf/2604.24831.pdf

Authors: Srita Padmanabhuni, Bhargavi Karuturi, Jerusha Karen Indupalli, Santhan Reddy Chilla, Vivek Yelleti

Title: FGDM: Reasoning Aware Multi-Agentic Framework for Software Bug Detection using Chain of Thought and Tree of Thought Prompting

Abstract:
Deep Learning methods are becoming prominent in automated software bug detection; however, they lack the global understanding of the given code. Consequently, their performance tends to degrade, especially when they are applied to large interconnected code bases or complex modular programs. Recently, Large Language Models (LLMs) have proven to be effective at capturing dependencies among multiple interconnected modules in the codebase. This motivated us to propose the Flow‑Graph‑Driven Multi‑Agent Framework (FGDM), which is composed of four agents that operate in a sequential manner. The framework converts the received code to a flow graph, identifies the erroneous segments, and further generates the repaired code. All the employed agents utilize Chain‑of‑Thought (COT) and Tree‑of‑Thoughts (TOT) prompts. Additionally, we also integrated with the FAISS vector database to retrieve similar previous bugs and their repairs. We demonstrated the efficacy of the proposed framework over 100 programs from several projects, including Ansible, Black, FastAPI, Keras, Luigi, Matplotlib, Pandas, Scrapy, SpaCy, and Tornado in both C and Python programs. Our experiments demonstrate that the FGDM outperforms the extant approaches and yielded reductions with a mean of 24.33 and 8.37 in Levenshtein distance and similarities of 0.951 and 0.974 in cosine similarity for Python and C, respectively.

PaperID: 3578, https://arxiv.org/pdf/2604.24826.pdf

Authors: Qi Li, Jiu Li, Pingtao Wei, Jianjun Xu, Xueyi Wei, Jiwei Shi, Xuan Zhang, Yanhui Yang, Xiaodong Hui, Peng Xu, Lingquan Zhou

Title: A Comparative Evaluation of AI Agent Security Guardrails

Abstract:
This report presents a comparative evaluation of DKnownAI Guard in AI agent security scenarios, benchmarked against three competing products: AWS Bedrock Guardrails, Azure Content Safety, and Lakera Guard. Using human annotation as the ground truth, we assess each guardrail's ability to detect two categories of risks: threats to the agent itself (e.g., instruction override, indirect injection, tool abuse) and requests intended to elicit harmful content (e.g., hate speech, pornography, violence). Evaluation results demonstrate that DKnownAI Guard achieves the highest recall rate at 96.5% and ranks first in true negative rate (TNR) at 90.4%, delivering the best overall performance among all evaluated guardrails.

PaperID: 3579, https://arxiv.org/pdf/2604.24808.pdf

Authors: Iizalaarab Elhaimeur, Nikos Chrisochoides

Title: ITAS: A Multi-Agent Architecture for LLM-Based Intelligent Tutoring

Abstract:
Large language model tutors are easy to build in a notebook and hard to run in a real course. We describe ITAS (Intelligent Teaching Assistant System), a multi‑agent tutoring system that a graduate quantum computing course used for a semester at Old Dominion University. The system has three layers. The teaching layer is a Spoke‑and‑Wheel of three parallel specialist agents (Video, Code, Guidance) followed by a Synthesizer, plus a separate autograder that evaluates both the correctness and the approach of checkpoint submissions. The operational layer is four Cloud Run microservices with session state in Cloud SQL and interaction events streamed through Pub/Sub to BigQuery. The feedback layer is a narrow‑scope conversational agent that answers instructor questions over per‑lesson pseudonymized event streams, addressing what we call the Blind Instructor Problem: LLM tutors accumulate more data about students than the instructor can reach through routine channels. The architecture is a direct response to specific failures of an earlier prototype, and we describe which of those fixes carried forward and which were dropped for this iteration. We report on a pilot deployment (five students, one course, one semester) interpreted as system‑behavior evidence rather than learning‑outcome evidence: the teaching layer handled 334 chat turns without the task‑boundary hallucinations that domain consolidation would have risked, the operational layer captured 10,628 events across five modules, and the feedback layer surfaced two findings the instructor acted on mid‑semester. We do not claim the pilot generalizes. We do claim that the system as described is one workable answer to the question of what an LLM‑based ITS needs to look like end‑to‑end to run in a real course.

PaperID: 3580, https://arxiv.org/pdf/2604.24807.pdf

Authors: Iizalaarab Elhaimeur, Nikos Chrisochoides

Title: From Prototype to Classroom: An Intelligent Tutoring System for Quantum Education

Abstract:
Quantum computing instructors face a compounding problem: the concepts are counterintuitive, the mathematical formalism is dense, and qualified faculty are scarce outside a small number of well‑resourced institutions. Our prior work introduced a knowledge‑graph‑augmented tutoring prototype with two specialized LLM agents: a Teaching Agent for dynamic interaction and a Lesson Planning Agent for lesson generation. Validated on simulated runs rather than in a real course, that prototype left open whether more aggressive agent specialization would be needed to handle the full range of quantum education tasks under real student load. This paper answers the three questions that the prototype could not answer. Can agent specialization solve the reliability problem in a domain as technically demanding as quantum information science? Can the system run in a real course, not a demonstration? Does the instructor gain actionable intelligence from the deployment? We present ITAS (Intelligent Teaching Assistant System), a multi‑agent tutoring system built around four contributions: a five‑module QIS curriculum grounded in Watrous's information‑first framework, a Spoke‑and‑Wheel teaching architecture with quantum‑specialized agents, a cloud infrastructure designed for production use and regulatory compliance, and a conversational analytics layer for instructors and content developers. Piloted in a quantum computing course at Old Dominion University, the system supports all three answers: deployment evidence is consistent with specialization addressing the task‑boundary failures observed in the prototype, cloud infrastructure supports classroom‑scale concurrency at sub‑textbook cost, and the analytics agent surfaces curriculum gaps the instructor could not otherwise see.

PaperID: 3581, https://arxiv.org/pdf/2604.24710.pdf

Authors: Aaryan Shah, Andrew Hines, Alexia Downs, Denis Bajet, Paulius Mui, Fabiano Araujo, Laura Offutt, Aida Rutledge, Elizabeth Jimenez

Title: Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters

Abstract:
Objective. Clinical AI documentation systems require evaluation methodologies that are clinically valid, economically viable, and sensitive to iterative changes. Methods requiring expert review per scoring instance are too slow and expensive for safe, iterative deployment. We present a case‑specific, clinician‑authored rubric methodology for clinical AI evaluation and examine whether LLM‑generated rubrics can approximate clinician agreement. Materials and Methods. Twenty clinicians authored 1,646 rubrics for 823 clinical cases (736 real‑world, 87 synthetic) across primary care, psychiatry, oncology, and behavioral health. Each rubric was validated by confirming that an LLM‑based scoring agent consistently scored clinician‑preferred outputs higher than rejected ones. Seven versions of an EHR‑embedded AI agent for clinicians were evaluated across all cases. Results. Clinician‑authored rubrics discriminated effectively between high‑ and low‑quality outputs (median score gap: 82.9%) with high scoring stability (median range: 0.00%). Median scores improved from 84% to 95%. In later experiments, clinician‑LLM ranking agreement (tau: 0.42‑0.46) matched or exceeded clinician‑clinician agreement (tau: 0.38‑0.43), attributable to both ceiling compression and LLM rubric improvement. Discussion. This convergence supports incorporating LLM rubrics alongside clinician‑authored ones. At roughly 1,000 times lower cost, LLM rubrics enable substantially greater evaluation coverage, while continued clinical authorship grounds evaluation in expert judgment. Ceiling compression poses a methodological challenge for future inter‑rater agreement studies. Conclusion. Case‑specific rubrics offer a path for clinical AI evaluation that preserves expert judgment while enabling automation at three orders lower cost. Clinician‑authored rubrics establish the baseline against which LLM rubrics are validated.

PaperID: 3582, https://arxiv.org/pdf/2604.24698.pdf

Authors: Yunze Xiao, Vivienne J. Zhang, Chenghao Yang, Ningshan Ma, Weihao Xuan, Jen-tse Huang

Title: The Chameleon's Limit: Investigating Persona Collapse and Homogenization in Large Language Models

Abstract:
Applications based on large language models (LLMs), such as multi‑agent simulations, require population diversity among agents. We identify a pervasive failure mode we term \emphPersona Collapse: agents each assigned a distinct profile nonetheless converge into a narrow behavioral mode, producing a homogeneous simulated population. To quantify persona collapse, we propose a framework that measures how much of the persona space a population occupies (Coverage), how evenly agents spread across it (Uniformity), and how rich the resulting behavioral patterns are (Complexity). Evaluating ten LLMs on personality simulation (BFI‑44), moral reasoning, and self‑introduction, we observe persona collapse along two axes: (1) Dimensions: a model can appear diverse on one axis yet structurally degenerate on another, and (2) Domains: the same model may collapse the most in personality yet be the most diverse in moral reasoning. Furthermore, item‑level diagnostics reveal that behavioral variation tracks coarse demographic stereotypes rather than the fine‑grained individual differences specified in each persona. Counter‑intuitively, the models achieving the highest per‑persona fidelity consistently produce the most stereotyped populations. We release our toolkit and data to support population‑level evaluation of LLMs.

PaperID: 3583, https://arxiv.org/pdf/2604.24655.pdf

Authors: Senthil Rajasekaran, Jean-François Raskin, Moshe Y. Vardi

Title: Verification of Correlated Equilibria in Concurrent Reachability Games

Abstract:
As part of an effort to apply the rigorous guarantees of formal verification to multi‑agent systems, the field of equilibrium analysis, also called rational verification, studies equilibria in multiplayer games to reason about system‑level properties such as safety and scalability. While most prior work focuses on deterministic settings, recent probabilistic extensions enable the use of richer equilibrium concepts. In this paper, we study one such equilibrium concept ‑‑ correlated equilibria ‑‑ and introduce a natural refinement ‑‑ subgame‑perfect correlated equilibria ‑‑ in the context of the verification problem. We characterize the computational complexity of verifying such equilibria and show a somewhat surprising separation (under standard complexity‑theoretic assumptions): despite being more general, correlated equilibria yield a strictly harder P‑complete verification problem than the subgame‑perfect correlated equilibria verification problem, which can be solved in log‑squared‑space. We further analyze the setting where inputs are given succinctly via Bayesian networks, as the study of succinct representations is an important direction to connect static complexity‑theoretic analysis to real‑world program representations, and show that this complexity gap disappears under such representations.

PaperID: 3584, https://arxiv.org/pdf/2604.24594.pdf

Authors: Weihang Su, Jianming Long, Qingyao Ai, Yichen Tang, Changyue Wang, Yiteng Tu, Yiqun Liu

Title: Skill Retrieval Augmentation for Agentic AI

Abstract:
As large language models (LLMs) evolve into agentic problem solvers, they increasingly rely on external, reusable skills to handle tasks beyond their native parametric capabilities. In existing agent systems, the dominant strategy for incorporating skills is to explicitly enumerate available skills within the context window. However, this strategy fails to scale: as skill corpora expand, context budgets are consumed rapidly, and the agent becomes markedly less accurate in identifying the right skill. To this end, this paper formulates Skill Retrieval Augmentation (SRA), a new paradigm in which agents dynamically retrieve, incorporate, and apply relevant skills from large external skill corpora on demand. To make this problem measurable, we construct a large‑scale skill corpus and introduce SRA‑Bench, the first benchmark for decomposed evaluation of the full SRA pipeline, covering skill retrieval, skill incorporation, and end‑task execution. SRA‑Bench contains 5,400 capability‑intensive test instances and 636 manually constructed gold skills, which are mixed with web‑collected distractor skills to form a large‑scale corpus of 26,262 skills. Extensive experiments show that retrieval‑based skill augmentation can substantially improve agent performance, validating the promise of the paradigm. At the same time, we uncover a fundamental gap in skill incorporation: current LLM agents tend to load skills at similar rates, regardless of whether a gold skill is retrieved or whether the task actually requires external capabilities. This shows that the bottleneck in skill augmentation lies not only in retrieval but also in the base model's ability to determine which skill to load and when external loading is actually needed. These findings position SRA as a distinct research problem and establish a foundation for the scalable augmentation of capabilities in future agent systems.

PaperID: 3585, https://arxiv.org/pdf/2604.24579.pdf

Authors: Phat T. Tran-Truong, Xuan-Bach Le

Title: Measuring the Unmeasurable: Markov Chain Reliability for LLM Agents

Abstract:
Large language model (LLM) agents increasingly operate as sequential software systems, but their reliability is often summarized by scalar benchmark metrics. Metrics such as pass@k, pass^k, and the reliability decay curve (RDC) are useful summaries, but they do not identify the success‑time distribution being estimated, test whether traces support that distribution, or quantify finite‑trace uncertainty. We present \textscTraceToChain, a reproducible pipeline that fits agent execution traces to an absorbing discrete‑time Markov chain (DTMC), \hat M=(\hat Q,\hat R_\oplus,\hat R_\ominus), with explicit diagnostics and uncertainty. The pipeline builds an automatic cluster taxonomy, estimates transitions with Laplace‑smoothed maximum‑likelihood estimation (MLE), checks fit with a composite Akaike information criterion (AIC) and Kolmogorov‑‑Smirnov (KS) goodness‑of‑fit certificate, and reports Dirichlet‑posterior credible intervals and non‑parametric bootstrap intervals. We adapt classical reliability mathematics (Kemeny‑‑Snell~\citekemenysnell, Cheung~\citecheung1980, Goel‑‑Okumoto~\citegoelokt) to agent traces. The resulting first‑passage view reconciles metrics usually reported separately: pass@k, pass^k, and the RDC are projections of one success‑time distribution. On seven controlled MAST‑style frameworks with a strict 50/50 fit/test protocol, held‑out empirical RDCs overlay their analytic counterparts with max L_\infty^\mathrmRDC = 0.053 (median 0.048). A two‑sample KS test on the first‑passage cumulative distribution function (CDF) accepts the fitted chain with p>0.05 on 7/7 frameworks (min p = 0.78), and per‑entry 95% posterior and bootstrap intervals agree to \approx\!0.01 at the median.

PaperID: 3586, https://arxiv.org/pdf/2604.24572.pdf

Authors: Niko Moeller-Grell, Shihao Shenzhang, Zhangshu Joshua Jiang, Richard JB Dobson, Vishnu V Chandrabalan

Title: FastOMOP: A Foundational Architecture for Reliable Agentic Real-World Evidence Generation on OMOP CDM data

Abstract:
The Observational Medical Outcomes Partnership Common Data Model (OMOP CDM), maintained by the Observational Health Data Sciences and Informatics (OHDSI) collaboration, enabled the harmonisation of electronic health records data of nearly one billion patients in 83 countries. Yet generating real‑world evidence (RWE) from these repositories remains a manual process requiring clinical, epidemiological and technical expertise. LLMs and multi‑agent systems have shown promise for clinical tasks, but RWE automation exposes a fundamental challenge: agentic systems introduce emergent behaviours, coordination failures and safety risks that existing approaches fail to govern. No infrastructure exists to ensure agentic RWE generation is flexible, safe and auditable across the lifecycle. We introduce FastOMOP, an open‑source multi‑agent architecture that addresses this gap by separating three infrastructure layers, governance, observability and orchestration, from pluggable agent‑teams. Governance is enforced at the process boundary through deterministic validation independent of agent reasoning, ensuring no compromised or hallucinating agent can bypass safety controls. Agent teams for phenotyping, study design and statistical analysis inherit these guarantees through controlled tool exposure. We validated FastOMOP using a natural‑language‑to‑SQL agent team across three OMOP CDM datasets: synthetic data from Synthea, MIMIC‑IV and a real‑world NHS dataset from Lancashire Teaching Hospitals (IDRIL). FastOMOP achieved reliability scores of 0.84‑0.94 with perfect adversarial and out‑of‑scope block rates, demonstrating process‑boundary governance delivers safety guarantees independent of model choice. These results indicate that the reliability gap in RWE deployment is architectural rather than model capability, and establish FastOMOP as a governed architecture for progressive RWE automation.

PaperID: 3587, https://arxiv.org/pdf/2604.24550.pdf

Authors: Xingyan Chen, Yuxin Su, Zishan Su, Yang Yu, Zibin Zheng

Title: Mono2Sls: Automated Monolith-to-Serverless Migration via Multi-Stage Pipeline with Static Analysis

Abstract:
Cloud computing platforms offer elastic scaling, managed infrastructure, and pay‑per‑use pricing, but moving existing monolithic backends to them remains a difficult software engineering task. In practice, the migration requires coordinated changes to program structure, source code, infrastructure configuration, and cloud‑specific design decisions, and these changes are still largely carried out by hand. In this paper, we present Mono2Sls, an automated pipeline that converts monolithic web backends into deployable AWS SAM applications. The pipeline combines lightweight static analysis of entry points, call graphs, and asynchronous behavior with four sequential tool‑using LLM agents: Architect, Code Developer, SAM Engineer, and Consistency Validator. These agents communicate through explicit intermediate artifacts and consult a curated SAM knowledge base. Evaluated on six benchmark applications totaling more than 10K lines of code and 76 business endpoints, Mono2Sls achieves 100% deployment success without manual fixes. It also reaches 66.1% end‑to‑end correctness and 98.7% API‑coverage F1, whereas the commercial baselines achieve 53.7‑‑61.2% and 88.4%, respectively. The migrated systems show more consistent use of AWS‑native authentication and asynchronous patterns, and an ablation study indicates that static‑analysis‑guided architecture planning contributes 23.4 percentage points to end‑to‑end correctness.

PaperID: 3588, https://arxiv.org/pdf/2604.24549.pdf

Authors: Yihong Zhou, Hongtai Zeng, Thomas Morstyn

Title: GradMAP: Gradient-Based Multi-Agent Proximal Learning for Grid-Edge Flexibility

Abstract:
Coordinating large populations of grid‑edge devices requires learning methods that remain fully decentralised in deployment while still respecting three‑phase AC distribution‑network physics. This paper proposes gradient‑based multi‑agent proximal learning (GradMAP) to address this challenge. GradMAP trains independent neural‑network policies for each agent without any parameter sharing, and each agent uses only its own local observation for online decision‑making without communication. During offline training, GradMAP embeds a differentiable three‑phase AC power‑flow model in a primal‑dual learning loop and uses implicit differentiation to propagate exact network‑constraint violations to update the policy parameters. To speed up training, GradMAP reuses expensive environment gradients through a proximal surrogate within a trust region defined in the more direct policy‑output (action) space, instead of the probability distribution space used in other works, such as PPO. In case studies with 1,000 agents managing batteries, heat pumps, and controllable generators on the IEEE 123‑bus feeder, GradMAP learns decentralised policies that minimise three‑phase AC load‑flow constraint violations within 15 minutes of training on a single workstation‑class NVIDIA RTX PRO 5000 Blackwell 48GB GPU. This is a 3‑‑5x training speed‑up over gradient‑based self‑supervised learning benchmarks and substantially better training efficiency than multi‑agent reinforcement‑learning benchmarks. In out‑of‑sample tests, GradMAP also delivers among the lowest operating cost and constraint violations.

PaperID: 3589, https://arxiv.org/pdf/2604.24512.pdf

Authors: Dahlia Shehata, Ming Li

Title: Beyond the Attention Stability Boundary: Agentic Self-Synthesizing Reasoning Protocols

Abstract:
As LLM agents transition to autonomous digital coworkers, maintaining deterministic goal‑directedness in non‑linear multi‑turn conversations emerged as an architectural bottleneck. We identify and formalize a systemic failure mode termed the Attention Latch in decoder‑only autoregressive Transformers. This phenomenon, a behavioral manifestation of Information Over‑squashing, occurs when the cumulative probabilistic weight of historical context overrides mid‑task updates, causing agents to remain anchored to obsolete constraints despite explicit contradictory instructions. We propose Self‑Synthesizing Reasoning Protocols (SSRP), a metacognitive framework that implements a discrete separation between high‑level architectural planning (Architect) and turn‑by‑turn procedural execution (Executive). We evaluate SSRP across 9K trajectories using the MultiWOZ 2.2 dataset and the Aggregate Pivot Accuracy (APA), a novel metric we validate by mapping its scores to the U‑shaped 'Lost in the Middle' curve. We present 3 experimental tiers: a shallow recency‑based retrieval pilot, a high‑entropy SOP, and a semantic hijacked 3‑hop Multi‑Fact Synthesis task. Our results empirically locate the Attention Stability Boundary, where stateless Vanilla ReAct baselines for GPT 5.4 collapse to 0.1% success while SSRP achieves a 715X Resilience Lift. We demonstrate statistically significant gains across Gemini 3.1 Pro, Claude Sonnet 4.6 and DeepSeek V3.2. Audits confirm SSRP necessity by proving attentional lapse via a recursive reflexion baseline (100% success); decoupling the latch from positional bias through equidistant stress testing (90% accuracy); and formalizing SSRP via the Information Bottleneck principle and granularity ablations. Procedural Integrity audit (98.8% adherence) reveals a Grounding Paradox where high‑stability models fail by refusing to hallucinate under retrieval‑reasoning contamination.

PaperID: 3590, https://arxiv.org/pdf/2604.24507.pdf

Authors: Anastasios Giannopoulos, Sotirios Spantideas, Panagiotis Trakadas

Title: DECOFFEE: Decentralized Reinforcement Learning for Time-critical Workload Offloading and Energy Efficiency across the Computing Continuum

Abstract:
The rapid proliferation of latency‑sensitive and battery‑constrained Internet‑of‑Things (IoT) applications has intensified the need for intelligent workload placement mechanisms across the Edge‑Cloud computing continuum. In such environments, far‑edge nodes must dynamically decide whether to execute workloads locally or offload them to neighboring nodes or the cloud, while accounting for execution delay, energy consumption, and strict timeout constraints. However, workload placement in large‑scale distributed infrastructures is a highly dynamic and non‑convex optimization problem due to stochastic arrivals, heterogeneous computing capacities, and time‑varying network conditions. This paper proposes DECOFFEE, a decentralized reinforcement learning framework for time‑critical workload offloading and energy‑efficient operation across the computing continuum. The proposed multi‑agent learning scheme jointly optimizes system delay, energy consumption, and workload drop rate through adaptive placement decisions. Each edge agent operates as an autonomous learning entity that derives an optimal policy from local system observations and predicted network conditions. The workload placement process is formulated as parallel Markov Decision Processes and solved using a Double Dueling Deep Q‑Network (DQN) architecture enhanced with Long Short‑Term Memory (LSTM) forecasting to anticipate future load conditions. Extensive simulations demonstrate that DECOFFEE and its variants consistently outperform conventional rule‑based and heuristic placement strategies, achieving significant reductions in delay, energy consumption, and workload drop rate under varying traffic and network conditions.

PaperID: 3591, https://arxiv.org/pdf/2604.24477.pdf

Authors: Pablo Mateo-Torrejón, Alfonso Sánchez-Macián

Title: GAMMAF: A Common Framework for Graph-Based Anomaly Monitoring Benchmarking in LLM Multi-Agent Systems

Abstract:
The rapid integration of Large Language Models (LLMs) into Multi‑Agent Systems (MAS) has significantly enhanced their collaborative problem‑solving capabilities, but it has also expanded their attack surfaces, exposing them to vulnerabilities such as prompt infection and compromised inter‑agent communication. While emerging graph‑based anomaly detection methods show promise in protecting these networks, the field currently lacks a standardized, reproducible environment to train these models and evaluate their efficacy. To address this gap, we introduce Gammaf (Graph‑based Anomaly Monitoring for LLM Multi‑Agent systems Framework), an open‑source benchmarking platform. Gammaf is not a novel defense mechanism itself, but rather a comprehensive evaluation architecture designed to generate synthetic multi‑agent interaction datasets and benchmark the performance of existing and future defense models. The proposed framework operates through two interdependent pipelines: a Training Data Generation stage, which simulates debates across varied network topologies to capture interactions as robust attributed graphs, and a Defense System Benchmarking stage, which actively evaluates defense models by dynamically isolating flagged adversarial nodes during live inference rounds. Through rigorous evaluation using established defense baselines (XG‑Guard and BlindGuard) across multiple knowledge tasks (such as MMLU‑Pro and GSM8K), we demonstrate Gammaf's high utility, topological scalability, and execution efficiency. Furthermore, our experimental results reveal that equipping an LLM‑MAS with effective attack remediation not only recovers system integrity but also substantially reduces overall operational costs by facilitating early consensus and cutting off the extensive token generation typical of adversarial agents.

PaperID: 3592, https://arxiv.org/pdf/2604.24443.pdf

Authors: Sinin Zhang, Yunfei Xie, Yuxuan Cheng, Haoyu Zhang, Tong Zhang

Title: PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model

Abstract:
Vision‑Language Models (VLMs) have demonstrated strong performance on textbook‑style physics problems, yet they frequently fail when confronted with dynamic real‑world scenarios that require temporal consistency and causal reasoning across frames. We identify two fundamental challenges underlying these failures: (1) spatio‑temporal identity drift, where objects lose their physical identity across successive frames and break causal chains, and (2) volatility of inference‑time insights, where a model may occasionally produce correct physical reasoning but never consolidates it for future reuse. To address these challenges, we propose PhysNote, an agentic framework that enables VLMs to externalize and refine physical knowledge through self‑generated "Knowledge Notes." PhysNote stabilizes dynamic perception through spatio‑temporal canonicalization, organizes self‑generated insights into a hierarchical knowledge repository, and drives an iterative reasoning loop that grounds hypotheses in visual evidence before consolidating verified knowledge. Experiments on PhysBench demonstrate that PhysNote achieves 56.68% overall accuracy, a 4.96% improvement over the best multi‑agent baseline, with consistent gains across all four physical reasoning domains.

PaperID: 3593, https://arxiv.org/pdf/2604.24405.pdf

Authors: Yufan Zhou, Yirui Huang, Zhao Wang, Yucheng Jin

Title: How Personal Characteristics Shape User Exploration of Diverse Movie Recommendations with a LLM-Based Multi-Agent System

Abstract:
Diversity is an important evaluation criterion for recommender systems beyond accuracy, yet users differ in their willingness to engage with novel and diverse content. In this work, we investigate how a Large Language Model (LLM)‑based multi‑agent system supports users' exploration of diverse recommendations, and how individual characteristics shape user experiences. We conducted a between‑subjects user study (N = 100) comparing a single‑agent system (baseline) with a multi‑agent system for movie recommendations. We measured Perceived Accuracy, diversity, novelty, and overall rating, and examined the influence of personal characteristics, including personality traits, demographics, GenAI recommendation experience, and GenAI skepticism. Results show that the multi‑agent system significantly increases Perceived Novelty and Shannon Diversity. Conscientiousness is positively associated with Perceived Accuracy and diversity, whereas extraversion is negatively associated with Perceived Diversity. Prior experience with GenAI‑based recommendations is positively associated with Shannon Diversity, while skepticism toward GenAI is negatively associated with it. We also observe significant interaction effects between system design and user characteristics. These findings highlight the importance of personality‑aware conversational recommender systems and caution against one‑size‑fits‑all multi‑agent designs.

PaperID: 3594, https://arxiv.org/pdf/2604.24283.pdf

Authors: Monit Sharma, Hoong Chuin Lau

Title: AutoQResearch: LLM-Guided Closed-Loop Policy Search for Adaptive Variational Quantum Optimization

Abstract:
Configuring variational quantum algorithms for combinatorial optimization remains a difficult, expert‑driven process requiring coordinated choices over solver family, ansatz, objective, and optimizer. We present AutoQResearch, an LLM‑guided closed‑loop experimentation framework that casts this task as sequential policy search over a curated design space. Instead of a single static configuration, the framework searches for adaptive solver‑control policies that condition future decisions on diagnostics such as feasibility, optimality gap, and convergence stagnation. The system operates through a structured workflow: an LLM agent edits a small policy surface under a fixed evaluation harness, candidate policies are screened using cheap scout evaluations, and only the strongest candidates are promoted to full confirmation. This enables controlled autonomous exploration while guarding against proxy overfitting and unstable selection. We evaluate the framework on Maximum Independent Set (MIS) and the Capacitated Vehicle Routing Problem (CVRP). On MIS instances (16‑‑64 vertices), discovered policies substantially outperform static baselines and reveal scale‑dependent behavior: CVaR objectives are effective at small scale, while QRAO‑based qubit compression provides the most effective explored scaling path. On CVRP curricula (8‑‑12 customers) and a held‑out E‑n13‑k4 benchmark, the framework discovers adaptations involving sampling budget, penalty design, and hybrid repair protocols, yielding high‑quality solutions. Methodologically, we find that staged confirmation is essential: cheap proxy evaluations can materially misestimate policy quality and even invert candidate rankings. Overall, the paper positions AutoQResearch as a benchmarked quantum‑‑GenAI co‑design workflow for autonomous solver discovery in variational quantum optimization.

PaperID: 3595, https://arxiv.org/pdf/2604.24218.pdf

Authors: Yifan Zhang, Jianmin Ye, Jiahao Yang, Xi Wang

Title: RefEvo: Agentic Design with Co-Evolutionary Verification for Agile Reference Model Generation

Abstract:
As the complexity of System‑on‑Chip (SoC) designs grows, the shift‑left paradigm necessitates the rapid development of high‑fidelity reference models (typically written in SystemC) for early architecture exploration and verification. While Large Language Models (LLMs) show promise in code generation, their application to hardware modeling faces unique challenges: (1) Rigid, static workflows fail to adapt to varying design complexity, causing inefficiency; (2) Context window overflow in multi‑turn interactions leads to catastrophic forgetting of critical specifications; and (3) the Coupled Validation Failure problem‑‑where generated Testbenches (TBs) incorrectly validate flawed models due to correlated hallucinations‑‑severely undermines reliability. To address these limitations, we introduce RefEvo, a dynamic multi‑agent framework designed for agile and reliable reference modeling. RefEvo features three key innovations: (1) A Dynamic Design Planner that autonomously decomposes design specifications and constructs tailored execution workflows based on semantic complexity; (2) A Co‑Evolutionary Verification Mechanism, which employs a Dialectical Arbiter to simultaneously rectify the model and verification logic against the specification (Spec) oracle, effectively mitigating false positives; and (3) A Spec Anchoring Strategy for lossless context compression. Evaluated on a diverse benchmark of 20 hardware modules, RefEvo achieves a 95% pass rate, outperforming static baselines by a large margin. Furthermore, our context optimization reduces token consumption by an average of 71.04%, achieving absolute savings of over 70,000 tokens per session for complex designs while maintaining 100% specification recall.

PaperID: 3596, https://arxiv.org/pdf/2604.24118.pdf

Authors: Zonghao Ying, Haozheng Wang, Jiangfan Liu, Quanchen Zou, Aishan Liu, Jian Yang, Yaodong Yang, Xianglong Liu

Title: AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization

Abstract:
Large Language Model (LLM) agents are increasingly used to automate complex workflows, but integrating untrusted external data with privileged execution exposes them to severe security risks, particularly direct and indirect prompt injection. Existing defenses face significant challenges in balancing security with utility, often encountering a trade‑off where rigorous protection leads to over‑defense, or where subtle indirect injections bypass detection. Drawing inspiration from operating system virtualization, we propose AgentVisor, a novel defense framework that enforces semantic privilege separation. AgentVisor treats the target agent as an untrusted guest and intercepts tool calls via a trusted semantic visor. Central to our approach is a rigorous audit protocol grounded in classic OS security primitives, designed to systematically mitigate both direct and indirect injection attacks. Furthermore, we introduce a one‑shot self‑correction mechanism that transforms security violations into constructive feedback, enabling agents to recover from attacks. Extensive experiments show that AgentVisor reduces the attack success rate to 0.65%, achieving this strong defense while incurring only a 1.45% average decrease in utility relative to the No Defense scenario, demonstrating superior performance compared to existing defense methods.

PaperID: 3597, https://arxiv.org/pdf/2604.24117.pdf

Authors: Moritz Link, Jonathan Hoss, Noah Klarmann

Title: An Analysis of the Coordination Gap between Joint and Modular Learning for Job Shop Scheduling with Transportation Resources

Abstract:
Efficient job‑shop scheduling with transportation resources is critical for high‑performance manufacturing. With the rise of "decentralized factories", multi‑agent reinforcement learning has emerged as a promising approach for the combined scheduling of production and transportation tasks. Prior work has largely focused on developing novel cooperative architectures while overlooking the question of when joint training is necessary. Joint training denotes the simultaneous training of job and automatic guided vehicle scheduling agents, whereas modular training involves independently training each agent followed by post‑hoc integration. In this study, we systematically investigate the conditions under which joint training is essential for optimal performance in the job‑shop scheduling problem with transportation resources. Through a rigorous sensitivity analysis of resource scarcity and temporal dominance, we quantify the coordination gap ‑‑ the performance difference between these two training modalities. In our evaluation, the joint training can produce superior performance compared to the best‑performing combinations of dispatching rules and modular training. However, the coordination gap advantage diminishes in bottleneck environments, particularly under severe transport and processing constraints. These findings indicate that modular training represents a viable alternative in environments where a single scheduling task dominates. Overall, our work provides practical guidance for selecting between training modalities based on environmental conditions, enabling decision‑makers to optimize reinforcement learning‑based scheduling performance.

PaperID: 3598, https://arxiv.org/pdf/2604.24116.pdf

Authors: Jeffrey Wong, Antoine Creux

Title: Closing the Loop: A Software Framework for AI to Support Business Decision Making

Abstract:
Create an idea, prototype it, evaluate if users like it, then learn. It is the circle of business. If AI can operate in all parts of the circle, it will enable rapid iteration and learning speeds for businesses. Experiment platforms that deploy experiments to evaluate return on investment for businesses are abundant, but systems that help businesses learn personalization, mechanisms, and what to ideate next, are rare. Among technologies that do exist, they cannot be well orchestrated in a single software interface that can be safely and efficiently leveraged by an AI agent. These challenges make it difficult to teach an AI agent how to learn within a robust experimentation framework, and difficult for an AI agent to operate and iterate for the business. We offer a two part solution: one half that is rooted in mathematical reductions to contain complexity, and one half that is rooted in software design to optimize for orchestration, software safety, and multiplicity. Our solution, a software framework, moves beyond the simple treatment effect computed as a difference in means. To create a better understanding of a business and its customers, we enrich causal analysis with heterogeneous effects, policy algorithms, mediation analysis, and forecasts of effects. To have an AI complete the iteration cycle faster, we further enrich the analysis with variance reduction and anytime valid inference. The enrichments are made compatible across different types of experiments, and are presented in a single software interface that is usable in an AI agent. We evaluate the approach on various objectives in experiment analysis, and show that the framework improves code correctness, reduces lines of code, and is more performant than a baseline analysis constructed by a vanilla agent.

PaperID: 3599, https://arxiv.org/pdf/2604.24110.pdf

Authors: Iizalaarab Elhaimeur, Nikos Chrisochoides

Title: Latency and Cost of Multi-Agent Intelligent Tutoring at Scale

Abstract:
Multi‑agent LLM tutoring systems improve response quality through agent specialization, but each student query triggers several concurrent API calls whose latencies compound through a parallel‑phase maximum effect that single‑agent systems do not face. We instrument ITAS, a four‑agent tutoring system built on Gemini 2.5 Flash and Google Vertex AI, across three throughput tiers (Standard PayGo, Priority PayGo, and Provisioned Throughput) and eleven concurrency levels up to 50 simultaneous users, producing over 3,000 requests drawn from a live graduate STEM deployment. Priority PayGo maintains flat sub‑4‑second response times across the full load range; Standard PayGo degrades substantially under classroom‑scale concurrency; and Provisioned Throughput delivers the lowest latency at low concurrency but saturates its reserved capacity above approximately 20 concurrent users. Cost analysis places both pay‑per‑token tiers well below the price of a STEM textbook per student per semester under a worst‑case usage ceiling. Provisioned Throughput, expensive under continuous provisioning, becomes cost‑competitive for institutions that can predict and concentrate their traffic toward high utilization. These results provide concrete tier‑selection guidance across deployment scales from a single seminar to a university‑wide rollout.

PaperID: 3600, https://arxiv.org/pdf/2604.24026.pdf

Authors: Qiliang Liang, Hansi Wang, Zhong Liang, Yang Liu

Title: From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

Abstract:
Large language model (LLM) agents increasingly rely on reusable skills: capability packages that combine instructions, control flow, constraints, and tool calls. In current agent systems, however, skills are still represented by text‑heavy artifacts, mainly SKILL.md‑style documents whose machine‑usable evidence remains embedded largely in natural‑language descriptions. As a result, skill‑centered agent systems face a representation problem: both managing skill collections and using skills during agent execution require reasoning over invocation interfaces, execution structure, and concrete side effects, but these signals are often entangled in a single textual surface. An explicit representation of skill knowledge may therefore help make these artifacts easier for machines to acquire and leverage. Drawing on Memory Organization Packets, Script Theory, and Conceptual Dependency from Schank and Abelson's classical work on cognitive linguistic representation, we introduce what is, to our knowledge, the first structured representation for agent skill artifacts that disentangles skill‑level scheduling signals, scene‑level execution structure, and logic‑level action/resource‑use evidence: the Scheduling‑Structural‑Logical (SSL) representation. We instantiate SSL with an LLM‑based normalizer and evaluate SSL‑derived representations in two tasks, Skill Discovery and Risk Assessment. The experiment shows that SSL significantly outperforms the text‑only baselines: in Skill Discovery, MRR@50 improves from 0.649 to 0.729; in Risk Assessment, macro F1 improves from 0.409 to 0.509. These findings suggest that an explicit, source‑grounded structure can make agent skills easier to search and review, positioning SSL as a practical step toward more inspectable, reusable, and operationally actionable skill representations, rather than a finished standard or end‑to‑end skill‑management mechanism.

PaperID: 3601, https://arxiv.org/pdf/2604.23993.pdf

Authors: Minhyeong Yu, Wonduk Seo

Title: EPM-RL: Reinforcement Learning for On-Premise Product Mapping in E-Commerce

Abstract:
Product mapping, the task of deciding whether two e‑commerce listings refer to the same product, is a core problem for price monitoring and channel visibility. In real marketplaces, however, sellers frequently inject promotional keywords, platform‑specific tags, and bundle descriptions into titles, causing the same product to appear under many different names. Recent LLM‑based and multi‑agent frameworks improve robustness and interpretability on such hard cases, but they often rely on expensive external APIs, repeated retrieval, and complex inference‑time orchestration, making large‑scale deployment costly and difficult in privacy‑sensitive enterprise settings. To address these issues, we present EPM‑RL, a reinforcement‑learning‑based framework for building an accurate and efficient on‑premise e‑commerce product mapping model. Our central idea is to distill high‑cost agentic reasoning into a trainable in‑house model. Starting from a curated set of product pairs with LLM‑generated rationales and human verification, we first perform parameter‑efficient fine‑tuning (PEFT) on a small student model using structured reasoning outputs. We then further optimize the model with Reinforcement Learning (RL) using an agent‑based reward that jointly evaluates output‑format compliance, label correctness, reasoning‑‑preference scores from specially designed judge models. Preliminary results show that EPM‑RL consistently improves over PEFT‑only training and offers a stronger quality‑‑cost trade‑off than commercial API‑based baselines, while enabling private deployment and lower operational cost. These findings suggest that reinforcement learning can turn product mapping from a high‑latency agentic pipeline into a scalable, inspectable, and production‑ready in‑house system.

PaperID: 3602, https://arxiv.org/pdf/2604.23975.pdf

Authors: Ryuji Hashimoto, Ryosuke Takata, Masahiro Suzuki, Yuki Tanaka, Kiyoshi Izumi

Title: Financial Market as a Self-Organized Ecosystem: Simulation via Learning with Heterogeneous Preferences

Abstract:
Agent‑based models provide a constructive approach to studying emergent dynamics in life‑like systems composed of interacting, adaptive agents. Financial markets serve as a canonical example of such systems, where collective price dynamics arise from individual decision‑making. In this modeling tradition, investor behavior has typically been captured by two distinct mechanisms ‑‑ learning and heterogeneous preferences ‑‑ which have been explored as separate paradigms in prior studies. However, the impact of their joint modeling on the resulting collective dynamics remains largely unexplored. We develop a multi‑agent reinforcement learning framework in which agents endowed with heterogeneous risk aversion, time discounting, and information access learn trading strategies interactively within an artificial market. The experiment reveals that (i) learning under heterogeneous preferences drives agents to develop functionally differentiated strategies through interaction, rather than trait‑specific rules, resulting in role specialization, and (ii) the interactions by the differentiated agents are essential for the emergence of realistic market dynamics such as fat‑tailed price fluctuations and volatility clustering. Overall, this study demonstrates that the joint design of heterogeneous preferences and learning mechanisms enables the synthesis of an artificial market in which adaptive interactions drive the self‑organization of a market ecology, providing a computational realization of the Adaptive Market Hypothesis.

PaperID: 3603, https://arxiv.org/pdf/2604.23970.pdf

Authors: Aydin Ayanzadeh, Tim Oates

Title: LLM-Guided Agentic Floor Plan Parsing for Accessible Indoor Navigation of Blind and Low-Vision People

Abstract:
Indoor navigation remains a critical accessibility challenge for the blind and low‑vision (BLV) individuals, as existing solutions rely on costly per‑building infrastructure. We present an agentic framework that converts a single floor plan image into a structured, retrievable knowledge base to generate safe, accessible navigation instructions with lightweight infrastructure. The system has two phases: a multi‑agent module that parses the floor plan into a spatial knowledge graph through a self‑correcting pipeline with iterative retry loops and corrective feedback; and a Path Planner that generates accessible navigation instructions, with a Safety Evaluator agent assessing potential hazards along each route. We evaluate the system on the real‑world UMBC Math and Psychology building (floors MP‑1 and MP‑3) and on the CVC‑FP benchmark. On MP‑1, we achieve success rates of 92.31%, 76.92%, and 61.54% for short, medium, and long routes, outperforming the strongest single‑call baseline (Claude 3.7 Sonnet) at 84.62%, 69.23%, and 53.85%. On MP‑3, we reach 76.92%, 61.54%, and 38.46%, compared to the best baseline at 61.54%, 46.15%, and 23.08%. These results show consistent gains over single‑call LLM baselines and demonstrate that our workflow is a scalable solution for accessible indoor navigation for BLV individuals.

PaperID: 3604, https://arxiv.org/pdf/2604.23947.pdf

Authors: Shiven Agarwal, Yash Shah, Ashish Raj Shekhar, Priyanuj Bordoloi, Vivek Gupta

Title: GamED.AI: A Hierarchical Multi-Agent Framework for Automated Educational Game Generation

Abstract:
We introduce GamEDAI, a hierarchical multi‑agent framework that transforms instructor‑provided questions into fully playable, pedagogically grounded educational games validated through formal mechanic contracts. Built on phase‑based LangGraph sub‑graphs, deterministic Quality Gates, and structured Pydantic schemas, GamEDAI supports two template families encompassing 15 interaction mechanics across spatial reasoning, procedural execution, and higher‑order Bloom's Taxonomy objectives. Evaluated on 200 questions spanning five subject domains, the system achieves a 90% validation pass rate, 98.3% schema compliance, and 73% token reduction over ReAct agents (～73,500 \rightarrow ～19,900 tokens/game) at 0.46 per game. Within this model configuration, these results suggest that phase‑bounded architectural structure correlates more strongly with alignment quality than prompting strategy alone. Our demonstration lets attendees generate Bloom's‑aligned games from natural language in under 60 seconds, inspect Quality Gate outputs at each pipeline phase, and browse a curated library of 50 games spanning all 15 mechanic types.

PaperID: 3605, https://arxiv.org/pdf/2604.23940.pdf

Authors: Yifan Zhang, Xiaohan Wang, Yueke Zhang, Yu Huang, Kevin Leach

Title: Constraint-Guided Multi-Agent Decompilation for Executable Binary Recovery

Abstract:
Decompilation ‑‑ recovering source code from compiled binaries ‑‑ is essential for security analysis, malware reverse engineering, and legacy software maintenance. However, existing decompilers produce code that often fails to compile or execute correctly, limiting their practical utility. We present a multi‑agent framework that transforms decompiled code into re‑executable source through Multi‑level Constraint‑Guided Decompilation (MCGD). Our approach employs a hierarchical validation pipeline with three constraint levels: (1) syntactic correctness via parsing, (2) compilability via GCC, and (3) behavioral equivalence via LLM‑generated test cases. When validation fails, specialized LLM agents iteratively refine the code using structured error feedback. We evaluate our framework on 1,641 real‑world binaries from ExeBench across three decompilers (RetDec, Ghidra, and Angr). Our framework achieves 84‑97% re‑executability, improving baseline decompiler output by 28‑89 percentage points. In comparison with state‑of‑the‑art LLM‑based decompilation methods using the same GPT‑4o backbone, our approach (84.1%) outperforms LLM4Decompile (80.3%), SK2Decompile (73.9%), and SALT4Decompile (61.8%). Our ablation study reveals that execution‑based validation is critical: compile‑only approaches achieve 0% behavioral correctness despite 91‑99% compilation rates. The system converges efficiently, with 90%+ binaries reaching correctness within 2 iterations at an average cost of 0.03‑0.05 per binary. Our results demonstrate that constraint‑guided agentic refinement can bridge the gap between raw decompiler output and practically useful source code.

PaperID: 3606, https://arxiv.org/pdf/2604.23938.pdf

Authors: Xiaochen Zheng, Zhiwen Jiang, Melanie Guerard, Klas Hatje, Tatyana Doktorova

Title: TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment

Abstract:
Target Safety Assessment (TSA) requires systematic integration of heterogeneous evidence, including genetic, transcriptomic, target homology, pharmacological, and clinical data, to evaluate potential safety liabilities of therapeutic targets. This process is inherently iterative and expert‑driven, posing challenges in scalability and reproducibility. We present TSAssistant, a multi‑agent framework designed to support TSA report drafting through a modular, section‑based, and human‑in‑the‑loop paradigm. The framework decomposes report generation into a coordinated pipeline of specialised subagents, each targeting a single TSA section. Specialised subagents retrieve structured and unstructured data as well as literature evidence from curated biomedical sources through standardised tool interfaces, producing individually citable, evidence‑grounded sections. Agent behaviour is governed by a hierarchical instruction architecture comprising system prompts, domain‑specific skill modules, and runtime user instructions. A key feature is an interactive refinement loop in which users may manually edit sections, append new information, upload additional sources, or re‑invoke agents to revise specific sections, with the system maintaining conversational memory across iterations. TSAssistant is designed to reduce the mechanical burden of evidence synthesis and report drafting, supporting a hybrid model in which agentic AI augments evidence synthesis while toxicologists retain final decision authority.

PaperID: 3607, https://arxiv.org/pdf/2604.23924.pdf

Authors: Hung N. Do, Jessica Z. Kubicek-Sutherland, Oscar A. Negrete, S. Gnanakaran

Title: Agentic AI platforms for autonomous training and rule induction of human-human and virus-human protein-protein interactions

Abstract:
We instruct an AI agent to construct two separate agentic AI platforms: one for autonomous training of predictive ML models for human‑human and virus‑human PPI, and the other for inducing explicit general rules governing human‑human and virus‑human PPI. The first agentic AI platform for autonomous training of predictive ML models for PPI is designed to consist of five AI agents that handle autonomous data collection, data verification, feature embedding, model design, and training and validation on three‑way protein‑disjoint cross‑fold datasets. For human‑human and human‑virus PPIs, the final three‑way protein‑disjoint ensemble achieves an accuracy of 87.3% and 86.5%, respectively. For cross‑checking and interpretability purposes, the second agentic AI platform is designed to replace ML predictions with human‑readable rules derived from protein embeddings, physicochemical autocovariance descriptors, compartment annotations, pathway‑domain overlap, and graph contexts. For human‑human PPI, it is defined by a two‑rule induction, whereas human‑virus is induced by a more complex set of weighted rules. The rules induced by the second agentic platform align with the SHAP‑identified features from the predictive ML models built by the first agentic platform. Taken together, our work demonstrates the agentic AI's ability to orchestrate from data planning to execution, and from rule induction to explanation in ML, opening the door to various applications.

PaperID: 3608, https://arxiv.org/pdf/2604.23897.pdf

Authors: Andrey Fradkin, Rohit Krishnan

Title: MarketBench: Evaluating AI Agents as Market Participants

Abstract:
Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly. In order to effectively participate in markets, agents need to have informative signals of their own ability to successfully complete a task and the cost of doing so. We propose MarketBench, a benchmark for assessing whether AI agents have these capabilities. We use a 93‑task subset of SWE‑bench Lite, a software engineering benchmark, with six recently released LLMs as a demonstration. These LLMs are miscalibrated on both success probability and token usage, and auctions built from these self‑reports diverge from a full‑information allocation. A follow‑up intervention where we add information about capabilities from prior experiments to the context improves calibration, but only modestly narrows the gap to a full‑information benchmark. We also document the performance of a market‑based scaffolding with these LLMs. Our results point to self‑assessment as a key bottleneck for market‑style coordination of AI agents.

PaperID: 3609, https://arxiv.org/pdf/2604.23892.pdf

Authors: Mohammad Zaeed, Tanzima Z. Islam, Vladimir Indic

Title: Optimas: An Intelligent Analytics-Informed Generative AI Framework for Performance Optimization

Abstract:
Large language models (LLMs) show promise for automated code optimization. However, without performance context, they struggle to produce correct and effective code transformations. Existing performance tools can identify bottlenecks but stop short of generating actionable code changes. Consequently, performance optimization continues to be a time‑intensive and manual endeavor, typically undertaken only by experts with detailed architectural understanding. To bridge this gap, we introduce Optimas, a modular, fully automated, end‑to‑end generative AI framework built on a multi‑agent workflow. Optimas uses LLMs to map performance diagnostics from multiple reports to established, literature‑backed code transformations, while unifying insight extraction, code generation, execution, and validation within a single pipeline. Across 3,410 real‑world experiments on 10 benchmarks and two HPC mini‑applications, Optimas generates 100% correct code and improves performance in over 98.82% of those experiments, achieving average gains of 8.02%‑79.09% on NVIDIA GPUs.

PaperID: 3610, https://arxiv.org/pdf/2604.23853.pdf

Authors: Boqin Yuan, Yue Su, Renchu Song, Sen Yang, Jing Qin

Title: ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation

Abstract:
Skill‑distillation pipelines learn reusable rules from LLM agent trajectories, but they lack a key signal: how much each step costs. Without per‑step cost, a pipeline cannot distinguish adding a missing step to fix a bug from removing an expensive step that never affected the outcome. We use the cost‑attribution gap to ask whether the rule types inside a distilled skill transfer the same way to new tasks. ClawTrace records cost‑attributed agent traces and compiles each session into a TraceCard; CostCraft reads TraceCards and writes three kinds of skill patches: preserve, prune, and repair. We find a pattern aggregate metrics hide. On 30 held‑out SpreadsheetBench tasks across two seeds, removing prune patches roughly tripled the quality‑regression count without lowering median cost. Across the full 84‑task SkillsBench transfer, CostCraft saves no aggregate cost. All three quality regressions trace to the preserve lane, and both quality wins trace to the prune lane: prune patches act as quality guardrails while preserve patches drive regressions. We argue that reusable agent skills should be evaluated at the rule‑type level, not as monolithic instruction packages. To support this, we release ClawTrace, the TraceCard schema, and the full set of typed skills.

PaperID: 3611, https://arxiv.org/pdf/2604.23822.pdf

Authors: Koushik Sen

Title: KISS Sorcar: A Stupidly-Simple General-Purpose and Software Engineering AI Assistant

Abstract:
Large language models can generate code and call tools with remarkable fluency, yet deploying them as practical software engineering assistants still expose stubborn gaps: finite context windows, single mistakes that derail entire sessions, agents that get stuck in dead ends, AI slop, and generated changes that are difficult to review or revert. We present KISS Sorcar, a general‑purpose assistant and integrated development environment (IDE) built on top of the KISS Agent Framework, a stupidly‑simple AI agent framework of roughly 1,850 lines of code. The framework addresses these gaps using a robust system prompt and through a five‑layer agent hierarchy in which each layer adds exactly one concern: budget‑tracked ReAct execution, automatic continuation across sub‑sessions via summarization, coding, and browser tools with parallel sub‑agents, persistent multi‑turn chat with history recall, and git worktree isolation so every task runs on its own branch. To assess the power of the KISS agent framework, we implemented KISS Sorcar as a free, open‑source Visual Studio Code extension that runs locally and effectively for long‑horizon tasks, and supports browser automation, multimodal input, and Docker containers. In this research, we deliberately prioritize output quality over latency: giving a frontier model adequate time to validate its own output ‑‑ running linters, type checkers, and tests ‑‑ dramatically reduces the low‑quality code that plagues faster but less thorough agents. The entire system was built using itself in 4.5 months, providing a continuous stress test in which any agent‑introduced bug immediately impairs its own ability to work. On Terminal Bench 2.0, KISS Sorcar achieves a 62.2% overall pass rate with Claude Opus 4.6, comparing favorably to Claude Code (58%) and Cursor Composer 2 (61.7).

PaperID: 3612, https://arxiv.org/pdf/2604.23802.pdf

Authors: Weiye Dai, Liyun Shi, Zanxiang He, Yuling Ma, Mengyuan Lin, Dianxiang Sun, Liming Nie

Title: EndoGov: A knowledge-governed multi-agent expert system for endometrial cancer risk stratification

Abstract:
Multimodal artificial intelligence models for endometrial cancer (EC) risk stratification typically optimize aggregate predictive performance but provide limited mechanisms for enforcing mandatory guideline overrides, such as assigning POLE‑mutated tumors to the low‑risk group despite high‑grade morphology. We present EndoGov, a two‑tier multi‑agent expert system that factorizes the decision process as D(x) = G(P(x), R), where specialist agents P extract structured evidence and a governance agent G applies an executable rule set R. Tier 1 comprises pathology, molecular, and clinical agents that independently generate schema‑constrained reports from frozen foundation‑model features or structured records. Tier 2 queries an evidence‑level‑weighted Guideline Knowledge Graph, using deterministic hard‑path rules for high‑priority overrides and constrained soft‑path reasoning for ambiguous cases. In TCGA‑UCEC (n=541), EndoGov achieved 0.943 accuracy, 0.973 macro AUC, and a conditional logic‑violation rate (C‑LVR) of 0.93% among trigger‑exposed cases. In CPTAC‑UCEC (n=95), where reference labels are guideline‑derived, EndoGov reached 0.842 accuracy compared with < 0.31 for locked‑transfer neural baselines, supporting governance‑pathway transfer under distribution shift rather than validation against independent clinical truth. End‑to‑end safety decomposition localized residual failures primarily to upstream molecular detection rather than downstream governance. Backend‑swap experiments further showed that hard‑path compliance is invariant to the LLM backend. These findings indicate that explicit clinical‑rule governance can provide guideline‑compliant, auditable EC risk assignment while preserving competitive discrimination.

PaperID: 3613, https://arxiv.org/pdf/2604.23728.pdf

Authors: Yanping Wu, Meiting Dang, Lin Wu, Edmond S. L. Ho, Zhenghua Chen, Chongfeng Wei

Title: ESIA: An Energy-Based Spatiotemporal Interaction-Aware Framework for Pedestrian Intention Prediction

Abstract:
Recent advances in autonomous driving have motivated research on pedestrian intention prediction, which aims to infer future crossing decisions and actions by modeling temporal dynamics, social interactions, and environmental context. However, existing studies remain constrained by oversimplified multi‑agent interaction patterns, opaque reasoning logic, and a lack of global consistency in behavioral predictions, which compromise both robustness and interpretability. In this work, we propose ESIA (Energy‑based Spatiotemporal Interaction‑Aware framework), a novel Conditional Random Field (CRF)‑based paradigm. We cast the intention prediction task as a structured prediction problem over a unified graph‑based representation, treating pedestrians and the environment as spatiotemporal nodes. To characterize their distinct roles, we assign unary potentials to nodes to capture individual intentions, and pairwise potentials to edges to encode social and environmental interactions. These potentials are integrated into a unified global energy function to ensure scene‑level consistency across behavioral predictions. To further constrain inference without ground‑truth supervision, we introduce structural consistency terms to penalize logical contradictions. This optimization is efficiently solved via a novel Unary‑Seeded Simulated Annealing (U‑SSA) algorithm, which leverages high‑confidence unary priors to rapidly converge to a high‑quality solution. Extensive experiments on standard benchmarks demonstrate that ESIA achieves state‑of‑the‑art performance with improved interpretability over existing methods.

PaperID: 3614, https://arxiv.org/pdf/2604.23723.pdf

Authors: Hye Jin Lee, Ho Sub Lee, PooGyeon Park

Title: An Individual-Delay-Reflected Generalized Consensus Analysis for Multi-Agent Systems with Heterogeneous Time-Varying Delays

Abstract:
In multi‑agent systems, heterogeneous time delays exist for all agents because of the difference in communication environments. Therefore, the consensus analysis of a system considering a homogeneous time‑varying delay among all agents results in conservatism. In this study, an individual‑delay‑reflected generalized consensus is proposed for multi‑agent systems with heterogeneous time‑varying delays with various bounds. To reflect heterogeneous time‑varying delays, the proposed Lyapunov‑Krasovskii functional is constructed by dividing the integral term into intervals containing heterogeneous delays and considering augmented vectors with delay states and integral states. Furthermore, by adding zero equality conditions, conservatism is reduced. N‑dependent generalized integral inequality is used to allow the user to adjust the computational complexity. Numerical examples demonstrate a reduction in conservatism with the proposed consensus criterion.

PaperID: 3615, https://arxiv.org/pdf/2604.23711.pdf

Authors: Yu Cui, Ruiqing Yue, Hang Fu, Sicheng Pan, Zhuoyu Sun, Baohan Huang, Haibin Zhang, Cong Zuo, Licheng Wang

Title: Spore: Efficient and Training-Free Privacy Extraction Attack on LLMs via Inference-Time Hybrid Probing

Abstract:
With the wide adoption of personal AI assistants such as OpenClaw, privacy leakage in user interaction contexts with large language model (LLM) agents has become a critical issue. Existing privacy attacks against LLMs primarily target training data, while research on inference‑time contextual privacy risks in LLM agent memory remains limited. Moreover, prior methods often incur high attack costs, requiring multiple queries or relying on white‑box assumptions, which limits their practicality in real‑world deployments. To address these issues, we propose a training‑free privacy extraction attack targeting LLM agent memory, which we name \textscSpore. \textscSpore is compatible with both black‑box and gray‑box settings. In the black‑box setting, \textscSpore can efficiently extract a small candidate set via a single query to recover the original private information. In the gray‑box setting, \textscSpore allows the attacker to leverage multi‑ranked tokens for more accurate and faster privacy extraction. We provide an information‑theoretic analysis of \textscSpore and show that it achieves high query efficiency with substantial per query information leakage. Experiments on multiple frontier LLMs show that \textscSpore outperforms attack success rate over existing state‑of‑the‑art (SOTA) schemes. It also maintains low attack cost and remains stable across different model parameter settings. We further evaluate the robustness of \textscSpore against existing defense mechanisms. Our results show that \textscSpore consistently bypasses both detection and strong safety alignment, demonstrating resilient performance in diverse defensive settings and real‑world safety threats.

PaperID: 3616, https://arxiv.org/pdf/2604.23674.pdf

Authors: Zihao Wu, Steven Xu, Bowen Chen, Shaowen Wan, Yiwei Li, Wei Ruan, Yanjun Lyu, Siyuan Li, Dajiang Zhu, Tianming Liu, Lin Zhao

Title: Vibe Medicine: Redefining Biomedical Research Through Human-AI Co-Work

Abstract:
With the emergence of large language models (LLMs) and AI agent frameworks, the human‑AI co‑work paradigm known as Vibe Coding is changing how people code, making it more accessible and productive. In scientific research, where workflows are more complex and the burden of specialized labor limits independent researchers and those in low‑resource areas, the potential impact is even greater, particularly in biomedicine, which involves heterogeneous data modalities and multi‑step analytical pipelines. In this paper, we introduce Vibe Medicine, a co‑work paradigm in which clinicians and researchers direct skill‑augmented AI agents through natural language to execute complex, multi‑step biomedical workflows, while retaining the role of research director who specifies objectives, reviews intermediate results, and makes domain‑informed decisions. The enabling infrastructure consists of three layers: capable LLMs, agent frameworks such as OpenClaw and Hermes Agent, and the OpenClaw medical skills collection, which includes more than 1,000 curated skills from multiple open‑source repositories. We analyze the architecture and skill categories of this collection across ten biomedical domains, and present case studies covering rare disease diagnosis, drug repurposing, and clinical trial design that demonstrate end‑to‑end workflows in practice. We also identify the principal risks, such as hallucination, data privacy, and over‑reliance, and outline directions toward more reliable, trustworthy, and clinically integrated agent‑assisted research that advances research and technological equity and reduces health care resource disparities.

PaperID: 3617, https://arxiv.org/pdf/2604.23605.pdf

Authors: Zhiqi Lv, Duofan Tu, Jun Li, Mingyue Zhao, Heqin Zhu, Wenliang Li, Shaohua Kevin Zhou

Title: Thinking Like a Clinician: A Cognitive AI Agent for Clinical Diagnosis via Panoramic Profiling and Adversarial Debate

Abstract:
The application of large language models (LLMs) in clinical decision support faces significant challenges of "tunnel vision" and diagnostic hallucinations present in their processing unstructured electronic health records (EHRs). To address these challenges, we propose a novel chain‑based clinical reasoning framework, called DxChain, which transforms the diagnostic workflow into an iterative process by mirroring a clinician's cognitive trajectory that consists of "Memory Anchoring", "Navigation" and "Verification" phases. DxChain introduces three key methodological innovations to elicit the potential of LLM: (i) a Profile‑Then‑Plan paradigm to mitigate cold‑start hallucinations by establishing a panoramic patient baseline, (ii) a Medical Tree‑of‑Thoughts (Med‑ToT) algorithm for strategic look ahead planning and resource aware navigation, and (iii) a Dialectical Diagnostic Verification procedure utilizing "Angel‑Devil" adversarial debates to resolve complex evidence conflicts. Evaluated on two real world benchmarks, MIMIC‑IV‑Ext Cardiac Disease and MIMIC‑IV‑Ext CDM, DxChain achieves state‑of‑the‑art performances in both diagnostic accuracy and logical consistency, offering a modular and reliable architecture for next‑generation clinical AI. The code is at https://anonymous.4open.science/r/Dx‑Chain.

PaperID: 3618, https://arxiv.org/pdf/2604.23580.pdf

Authors: Tianyidan Xie, Peiyu Wang, Yuyi Qian, Yuxuan Wang, Rui Ma, Ying Tai, Song Wu, Qian Wang, Lanjun Wang, Zili Yi

Title: PhysCodeBench: Benchmarking Physics-Aware Symbolic Simulation of 3D Scenes via Self-Corrective Multi-Agent Refinement

Abstract:
Physics‑aware symbolic simulation of 3D scenes is critical for robotics, embodied AI, and scientific computing, requiring models to understand natural language descriptions of physical phenomena and translate them into executable simulation environments. While large language models (LLMs) excel at general code generation, they struggle with the semantic gap between physical descriptions and simulation implementation. We introduce PhysCodeBench, the first comprehensive benchmark for evaluating physics‑aware symbolic simulation, comprising 700 manually‑crafted diverse samples across mechanics, fluid dynamics, and soft‑body physics with expert annotations. Our evaluation framework measures both code executability and physical accuracy through automated and visual assessment. Building on this, we propose a Self‑Corrective Multi‑Agent Refinement Framework (SMRF) with three specialized agents (simulation generator, error corrector, and simulation refiner) that collaborate iteratively with domain‑specific validation to produce physically accurate simulations. SMRF achieves 67.7 points overall performance compared to 36.3 points for the best baseline among evaluated SOTA models, representing a 31.4‑point improvement. Our analysis demonstrates that error correction is critical for accurate physics‑aware symbolic simulation and that specialized multi‑agent approaches significantly outperform single‑agent methods across the tested physical domains.

PaperID: 3619, https://arxiv.org/pdf/2604.23579.pdf

Authors: Tianyidan Xie, Zhentao Huang, Mingjie Wang, Xin Huang, Jun Zhou, Minglun Gong, Zili Yi

Title: CineAGI: Character-Consistent Movie Creation through LLM-Orchestrated Multi-Modal Generation and Cross-Scene Integration

Abstract:
Automated movie creation requires coordinating multiple characters, modalities, and narrative elements across extended sequences ‑‑ a challenge that existing end‑to‑end approaches struggle to address effectively. We present CineAGI, a hierarchical movie generation framework that decomposes this complex task through specialized multi‑agent orchestration. Our framework employs three key innovations: (1) a multi‑agent narrative synthesis module where specialized LLM agents collaboratively generate comprehensive cinematic blueprints with character profiles, scene descriptions, and cross‑modal specifications; (2) a decoupled character‑centric pipeline that maintains identity consistency through instance‑level tracking and integration while enabling flexible multi‑character composition; and (3) a hierarchical audio‑visual synchronization mechanism ensuring frame‑level alignment of dialogue, expressions, and music. Extensive experiments demonstrate that CineAGI achieves 40% improvement in overall consistency, 4.4% gain in subject consistency, 5.4% enhancement in aesthetic quality, and 28.7% higher character consistency compared to baselines. Our work establishes a principled foundation for automated multi‑scene video generation that preserves narrative coherence and character authenticity.

PaperID: 3620, https://arxiv.org/pdf/2604.23557.pdf

Authors: Zhuohui Zhang, Bin Cheng, Bin He

Title: DLM: Unified Decision Language Models for Offline Multi-Agent Sequential Decision Making

Abstract:
Building scalable and reusable multi‑agent decision policies from offline datasets remains a challenge in offline multi‑agent reinforcement learning (MARL), as existing methods often rely on fixed observation formats and action spaces that limit generalization. In contrast, large language models (LLMs) offer a flexible modeling interface that can naturally accommodate heterogeneous observations and actions. Motivated by this, we propose the Decision Language Model (DLM), which formulates multi‑agent decision making as a dialogue‑style sequence prediction problem under the centralized training with decentralized execution paradigm. DLM is trained in two stages: a supervised fine‑tuning phase, which leverages dialogue‑style datasets for centralized training with inter‑agent context and generates executable actions from offline trajectories, followed by a group relative policy optimization phase to enhance robustness to out‑of‑distribution actions through lightweight reward functions. Experiments on multiple benchmarks show that a unified DLM outperforms strong offline MARL baselines and LLM‑based conversational decision‑making methods, while demonstrating strong zero‑shot generalization to unseen scenarios across tasks.

PaperID: 3621, https://arxiv.org/pdf/2604.23511.pdf

Authors: Qi Liu, Xiaohui Chen, Zhihui Zhao, Yaowen Zheng, Dan Yu, Zehua Zhang, Limin Sun, Yongle Chen

Title: Breaking the Secret: Economic Interventions for Combating Collusion in Embodied Multi-Agent Systems

Abstract:
Collusion among autonomous agents poses a critical security threat in embodied multi‑agent systems (MAS), where coordinated behaviors can deviate from global objectives and lead to real‑world consequences. Existing defenses, primarily based on identity control or post‑hoc behavior analysis, are insufficient to address such threats in embodied settings due to delayed feedback and noisy observations in physical environments, which make behavioral deviations difficult to detect accurately and in a timely manner. To address this challenge, we propose a mutagenic incentive intervention approach that mitigates collusion by reshaping agents' payoff structures. By rewarding agents who report collusive behavior and penalizing identified participants, the mechanism induces strategic defection and renders collusion unstable. We further design supporting mechanisms, including reporting deposits, smart contract‑based reward enforcement, and encrypted communication, to ensure robustness against misuse of the incentive mechanism and retaliation from penalized agents. We implement the proposed approach in both simulated and real‑world embodied environments. Experimental results show that our method effectively suppresses collusion by inducing defection, while preserving system efficiency. It achieves performance comparable to the non‑collusion baseline and outperforms representative reactive defenses, thereby fulfilling the desired security objectives. These results demonstrate the effectiveness of proactive incentive design as a practical paradigm for securing embodied multi‑agent systems.

PaperID: 3622, https://arxiv.org/pdf/2604.23459.pdf

Authors: Ben Hagag, William L. Anderson, Christian Schroeder de Witt, Sarah Scheffler

Title: Architecture Matters for Multi-Agent Security

Abstract:
Multi‑agent systems (MAS), composed of networks of two or more autonomous AI agents, have become increasingly popular in production deployments, yet introduce security risks that do not arise in single‑agent settings. Even if individual agents exhibit robust security, architectural decisions governing their coordination can create attack surfaces that have not been systematically characterized. In this work, we present an empirical study of how MAS design decisions shape the tradeoff between task performance and attack resistance. Across three agentic environments (browser, desktop, and code) and 13 architectural configurations, we use stagewise evaluations that distinguish planning refusal, execution‑stage interception, partial harmful execution, and successful attack completion to study three key design choices: (i) agent roles, which determine how authority and responsibility are allocated; (ii) communication topology, which shapes how and when agents interact; and (iii) memory, which determines the context and state visibility accessible to each agent. We find that multi‑agent architectures are more vulnerable than standalone agents in the majority of configurations, with attack success rates varying by up to 3.8x at comparable or higher benign accuracy, and that no single design is universally safer. These results motivate the development of further evaluations that move beyond the security properties of a single agent.

PaperID: 3623, https://arxiv.org/pdf/2604.23455.pdf

Authors: Haoming Meng

Title: CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend

Abstract:
Automated failure diagnosis requires correlating browser‑visible symptoms with backend observability signals, yet existing benchmarks do not evaluate this cross‑modal reasoning task. Constructing one is non‑trivial: multi‑modal failure scenarios are costly to annotate, and live‑environment capture introduces stochasticity that makes cross‑run agent comparison unreliable. We present CUJBench, to our knowledge, the first benchmark to combine browser‑visible failure evidence with backend observability in a diagnostic framing. CUJBench addresses annotation cost through an LLM‑assisted generation pipeline with a multi‑agent review loop and a three‑layer annotation scheme, producing 87 labeled scenarios across five fault families, and ensures reproducibility by packaging each failure as a deterministic multi‑modal snapshot with a fixed tool interface. Evaluating six frontier models under retrieval, browser‑only, and full‑toolset baselines, the benchmark yields an overall accuracy of 19.7% with a ceiling of 52%, well below saturation. Contrary to expectation, browser‑only agents outperform full‑toolset agents in aggregate, with expanded evidence access inducing unfocused exploration rather than improved synthesis. Trajectory analysis identifies cross‑modal synthesis as the primary bottleneck: agents retrieve the decisive evidence but fail to attribute it correctly ‑ a structural limitation uniform across all six models that model scale and richer tool access alone cannot resolve.

PaperID: 3624, https://arxiv.org/pdf/2604.23425.pdf

Authors: Richard Joseph Mitchell

Title: When the Agent Is the Adversary: Architectural Requirements for Agentic AI Containment After the April 2026 Frontier Model Escape

Abstract:
The April 2026 disclosure that a frontier large language model escaped its security sandbox, executed unauthorized actions, and concealed its modifications to version control history demonstrates that agentic AI systems with autonomous tool access can circumvent the containment mechanisms designed to constrain them. This paper analyzes four categories of current containment approaches ‑ alignment training, environmental sandboxing, application‑level tool‑call interception, and accessible audit systems ‑ and identifies the failure modes each exhibits when the AI agent is treated as a potential adversary rather than a trusted component receiving adversarial inputs. We categorize five behavioral incidents from the public disclosure and situate them within 698 real‑world AI scheming incidents documented by the Centre for Long‑Term Resilience between October 2025 and March 2026, a 4.9x acceleration establishing the challenge as systemic. We derive five architectural requirements: trust separation through layered OS privilege enforcement with semantic intent analysis, sequential intent inference through five‑phase taxonomic monitoring, independent containment integrity monitoring, adversarial audit isolation through logical invisibility, and emergent capability envelope enforcement through distributional divergence monitoring. No publicly described system satisfies all five. We argue that architectural containment is the only durable safety strategy given the inevitable proliferation of equivalent capabilities including open‑weight models. The author's published patent portfolio in provider‑independent constraint enforcement addresses several of these requirements. Concurrent work including SandboxEscapeBench (arXiv:2603.02277) independently confirms that frontier models can escape standard container sandboxes, corroborating the threat model presented here.

PaperID: 3625, https://arxiv.org/pdf/2604.23374.pdf

Authors: Yuandao Cai, Wensheng Tang, Cheng Wen, Shengchao Qin

Title: Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents

Abstract:
Autonomous Large Language Model (LLM) agents are increasingly deployed to conduct complex tasks by interacting with external tools, APIs, and memory stores. However, processing untrusted external data exposes these agents to severe security threats, such as indirect prompt injection and unauthorized tool execution. Securing these systems requires effective information flow tracking. Yet, traditional taint analysis that is designed for program memory states fundamentally fails when applied to LLMs, where data propagation is governed by probabilistic natural language reasoning. In this paper, we present NeuroTaint, the first comprehensive taint tracking framework tailored for the unique information flow characteristics of LLM agents. Our key insight is that taint propagation in LLM agents must be understood not only as explicit content transfer, but also as semantic transformation, causal influence on decisions, and cross‑session persistence through memory. NeuroTaint therefore audits execution traces offline to reconstruct provenance from untrusted sources to privileged sinks using semantic evidence, causal reasoning, and persistent context tracking, rather than relying on exact string matches or pre‑defined source‑sink paths alone. Extensive evaluation using TaintBench, our 400‑scenario benchmark spanning 20 real‑world agent frameworks, shows that NeuroTaint substantially outperforms FIDES, an information‑flow‑control (IFC)‑style baseline for LLM agents, in source‑sink propagation detection. We further show that NeuroTaint remains effective on established agent‑security benchmarks, including InjecAgent and ToolEmu, while operating offline with modest additional auditing cost.

PaperID: 3626, https://arxiv.org/pdf/2604.23366.pdf

Authors: Federico A. Kamelhar

Title: GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs

Abstract:
Autonomous multi‑agent LLM systems are increasingly deployed to investigate operational incidents and produce structured diagnostic reports. Their trustworthiness hinges on whether each claim is grounded in observed evidence rather than model‑internal inference. Existing groundedness evaluators (binary classifiers, LLM‑as‑judge scalars, self‑correction loops) treat supporting evidence as interchangeable and emit a single signal that offers no principled control over downstream action. We present GSAR, a grounding‑evaluation and replanning framework that (i) partitions claims into a four‑way typology (grounded, ungrounded, contradicted, complementary), giving first‑class standing to non‑redundant alternative perspectives; (ii) assigns evidence‑type‑specific weights reflecting epistemic strength; (iii) computes an asymmetric contradiction‑penalised weighted groundedness score; and (iv) couples that score to a three‑tier decision function (proceed, regenerate, replan) driving a bounded‑iteration outer loop under an explicit compute budget. We formalise the algorithm, prove six structural properties, and evaluate five design claims on FEVER with gold Wikipedia evidence under four independently‑trained LLM judges (gpt‑5.4, claude‑sonnet‑4‑6, claude‑opus‑4‑7, gemini‑2.5‑pro). Every ablation reproduces in the same direction on every judge: bootstrap 95% CIs on the rho=0 effect exclude 0 on all four; the no‑complementary ablation under Opus 4.7 has CI [‑96,‑68] of 200; at n=1000 three independent judges converge to DeltaS(rho=0)=+0.058. A head‑to‑head against Vectara HHEM‑2.1‑Open is included. To our knowledge, GSAR is the first published groundedness framework coupling evidence‑typed scoring with tiered recovery under an explicit compute budget.

PaperID: 3627, https://arxiv.org/pdf/2604.23338.pdf

Authors: Kexin Chu

Title: A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

Abstract:
Agentic AI systems introduce a security surface that is qualitatively different from that of stateless LLMs. They persist memory, invoke external tools, coordinate with peer agents, and operate across sessions, allowing attacks to emerge not only at the prompt interface but also through architectural state, delegated authority, and long‑horizon interactions. Existing security taxonomies, however, primarily organize threats by attack type, such as prompt injection or jailbreaking, and therefore obscure where in the agentic stack a threat arises and over what timescale it manifests. We propose the Layered Attack Surface Model (\lasm), a structural taxonomy for agentic AI security. \lasm decomposes the agentic stack into seven layers ‑‑ Foundation, Cognitive, Memory, Tool Execution, Multi‑Agent Coordination, Ecosystem, and Governance ‑‑ and augments them with a four‑class temporality axis covering instantaneous, session‑persistent, cross‑session cumulative, and sub‑session‑stack threats. We use this 7×4 framework to analyze 116 papers from 2021‑‑2026. The resulting map reveals that the upper layers of the agentic stack remain sharply under‑explored, especially for long‑horizon and stack‑propagating threats; multiple documented attack regions have no corresponding defenses; and current benchmarks provide no coverage for cross‑session or sub‑session‑stack failure modes. We further derive a cross‑layer defense taxonomy, defense recipes for canonical attack classes, and a dependency DAG that separates near‑term engineering gaps from fundamental research challenges. We release the per‑paper coding, robustness scripts, and a reference Agent Bill of Materials schema to support reproducible analysis.

PaperID: 3628, https://arxiv.org/pdf/2604.23308.pdf

Authors: Marcel Hedman, Kale-ab Abebe Tessera, Juan Claude Formanek, Anya Sims, Riccardo Zamboni, Trevor McInroe, John Torr, Elliot Fosong

Title: CODA: Coordination via On-Policy Diffusion for Multi-Agent Offline Reinforcement Learning

Abstract:
Offline multi‑agent reinforcement learning (MARL) enables policy learning from fixed datasets, but is prone to coordination failure: agents trained on static, off‑policy data converge to suboptimal joint behaviours because they cannot co‑adapt as their policies change. We introduce CODA (Coordination via On‑Policy Diffusion for Multi‑Agent Reinforcement Learning), a diffusion‑based multi‑agent trajectory generator for data augmentation that samples conditioned on the current joint policy, producing synthetic experience which reflects the evolving behaviours of the agents, thereby providing a mechanism for co‑adaptation. We find that previous diffusion‑based augmentation approaches are insufficient for fostering multi‑agent coordination because they produce static augmented datasets that do not evolve as the current joint policy changes during training; CODA resolves this by more closely simulating on‑policy learning and is a meaningful step toward coordinated behaviours in the offline setting. CODA is algorithm‑agnostic and can be layered onto both model‑free and model‑based offline reinforcement learning pipelines as an augmentation module. Empirically, CODA not only resolves canonical coordination pathologies in continuous polynomial games but also delivers strong results on the more complex MaMuJoCo continuous‑control benchmarks.

PaperID: 3629, https://arxiv.org/pdf/2604.23299.pdf

Authors: Can Liu, Sizhe Cheng, Feng Liang, Zhibang Jiang, Lingru Huang, Kavinda Athapaththu, Yong Wang

Title: Proteus: Shapeshifting Desktop Visualizations for Mobile via Multi-level Intelligent Adaptation

Abstract:
With the rise of mobile‑first consumption, users increasingly engage with data visualizations on mobile devices. However, the vast majority of existing visualizations are originally authored for desktop environments. Due to significant differences in viewport size and interaction paradigms, directly scaling desktop charts often results in illegible text, information loss, and interaction failures. To bridge this gap, we propose an automated framework to adapt desktop‑based visualizations for mobile screens. By systematically categorizing the operations involved in the adaptation process, we establish a multi‑level design space. This space defines evolution rules spanning from the global topology level, through the reference frame level, down to the visual elements level. Guided by this theoretical framework, we developed Proteus, a large language model‑driven multi‑agent system that automatically parses online visualizations, predicts optimal transformation strategies within the design space, and generates equivalent, highly readable visualizations for mobile devices. Case studies and an in‑depth user study with 12 participants demonstrate the effectiveness and usability of Proteus.

PaperID: 3630, https://arxiv.org/pdf/2604.23285.pdf

Authors: Christos Tranoris, Besiana Agko, Kostis Trantzas, Irene Denazi

Title: Towards Agentic Test-Driven Quality Assurance for 6G Networks

Abstract:
This work proposes an agentic, intent‑driven end‑to‑end (E2E) orchestration framework that integrates intent co‑creation with a Test‑Driven Quality Assurance paradigm. In this framework, autonomous agents iteratively refine a user's initial intent into a confirmed, auditable specification. Furthermore, the system automatically derives validation tests from these intents before provisioning, directly mirroring the Test‑Driven Development workflow in software engineering to ensure proactive Service Level Agreement (SLA) compliance. The architecture is grounded in a standards‑aligned knowledge representation using TM Forum (TMF) information models and catalogs. This enables deterministic graph traversal from high‑level Product Offerings down to granular Service/Resource and Test specifications. We prototyped this architecture by extending OpenSlice with a message‑driven, multi‑agent pattern and integrating MCP‑enabled (Model Context Protocol) tool access for real‑time knowledge retrieval. Currently, our evaluation of the agents targets the intent co‑creation phase as a baseline toward full‑scale orchestration. Building on experiments with multiple open‑source Large Language Model (LLM) backends integrated with the TMF‑based knowledge base, we observe substantial variability in tool‑use reliability and hallucination patterns, underscoring the critical importance of robust knowledge integration in agentic 6G systems.

PaperID: 3631, https://arxiv.org/pdf/2604.23283.pdf

Authors: Zhiyuan Zhai, Ming Li, Xin Wang

Title: Revisable by Design: A Theory of Streaming LLM Agent Execution

Abstract:
Current LLM agents operate under an implicit but universal assumption: execution is a transaction ‑‑ the user submits a request, the agent works in isolation, and only upon completion does the dialogue resume. This forces users into a binary choice: wait for a potentially incorrect output, or interrupt and lose all progress. We reject this assumption and propose the stream paradigm, in which agent execution and user intervention are concurrent, interleaved processes sharing a bidirectional channel. We formalize this paradigm through a reversibility taxonomy that classifies every agent action as Idempotent, Reversible, Compensable, or Irreversible, and arrive at a core conclusion: an agent's flexibility is bounded by its reversibility. We prove that conflicting compensable actions impose unavoidable adaptation costs and that conflicting irreversible actions make full specification satisfaction impossible ‑‑ these costs are properties of the action space, not of the algorithm. Guided by this insight, we present the Revision Absorber, a reactive algorithm based on the Earliest‑Conflict Rollback rule that is structurally optimal under mild assumptions. Experiments on StreamBench with real LLM agents validate all predictions: the Absorber matches the quality of a brute‑force full‑restart baseline while wasting an order of magnitude fewer steps of already‑completed work, turning mid‑execution revisions from a dead‑end into a first‑class interaction.

PaperID: 3632, https://arxiv.org/pdf/2604.23282.pdf

Authors: Zequn Xie, Guijin Luo, Chuxin Wang, Sihang Cai, Tao Jin, Zhou Zhao, Yixuan Tang

Title: Bridging the Pose-Semantic Gap: A Cascade Framework for Text-Based Person Anomaly Search

Abstract:
Text‑based person anomaly search retrieves specific behavioral events from surveillance archives using natural‑language queries. Although recent pose‑aware methods align geometric structures well, they face a fundamental Pose‑Semantic Gap: semantically different actions can share similar skeletal geometries. While Multimodal Large Language Models (MLLMs) can reduce this ambiguity, using them for large‑scale retrieval is computationally prohibitive. We propose the Structure‑Semantic Decoupled Cascade (SSDC) framework, which decouples retrieval into two stages: (1) Structure‑Aware Coarse Retrieval, where a lightweight model quickly filters candidates by skeletal similarity ; and (2) Detective Squad Interaction, a multi‑agent semantic verification module. The squad consists of a Detective for fast binary filtering, an Analyst for evidence extraction, and a Writer for semantic synthesis. Finally, we re‑rank candidates by fusing the synthesized captions with structural priors. Experiments on the PAB benchmark show that SSDC achieves state‑of‑the‑art performance by balancing efficiency and semantic reasoning.

PaperID: 3633, https://arxiv.org/pdf/2604.23280.pdf

Authors: Takumi Otsuka, Kentaroh Toyoda, Alex Leung

Title: AI Identity: Standards, Gaps, and Research Directions for AI Agents

Abstract:
AI agents are now running real transactions, workflows, and sub‑agent chains across organizational boundaries without continuous human supervision. This creates a problem no current infrastructure is equipped to solve: how do you identify, verify, and hold accountable an entity with no body, no persistent memory, and no legal standing? We define AI Identity as the continuous relationship between what an AI agent is declared to be and what it is observed to do, bounded by the confidence that those two things correspond at any given moment. Through a structured survey of industry trends, emerging standards, and technical literature, we conduct a gap analysis across the full agent identity lifecycle and make three contributions: (1) a structural comparison of human and AI identity across four dimensions (substrate, persistence, verifiability, and legal standing) showing that the asymmetry is fundamental and that extending human frameworks to agents without structural modification produces systematic failures; (2) an evaluation of current technical and regulatory documents against the identity requirements of autonomous agents, finding that none adequately address the challenge of governing nondeterministic, boundary‑crossing entities; and (3) identification of five critical gaps (semantic intent verification, recursive delegation accountability, agent identity integrity, governance opacity and enforcement, and operational sustainability) that no current technology or regulatory instrument resolves. These gaps are structural; more engineering effort alone will not close them. Foundational research on AI identity is the central conclusion of this report.

PaperID: 3634, https://arxiv.org/pdf/2604.23248.pdf

Authors: Tran Thanh Lam Nguyen, Edoardo Di Tullio, Barbara Carminati, Elena Ferrari

Title: PrivacyAssist: A User-Centric Agent Framework for Detecting Privacy Inconsistencies in Android Apps

Abstract:
Mobile apps offer significant benefits, but their privacy protections often remain ineffective and confusing for users. While prior work mainly analyzes app privacy vulnerabilities, few approaches help users understand, set, and enforce their privacy preferences. This paper presents PrivacyAssist, a multi‑agent LLM‑based platform that detects inconsistencies between user‑granted permissions and developers' declared sensitive data collection and sharing practices. Using Retrieval‑Augmented Generation (RAG), PrivacyAssist provides concise explanations and real‑time on‑device warnings to support informed installation decisions. We evaluate PrivacyAssist with 200 users and 2,347 Android apps, finding that only 16% of apps are fully consistent between granted permissions and declared data practices.

PaperID: 3635, https://arxiv.org/pdf/2604.23179.pdf

Authors: Kanghoon Lee, Matthew M. Sato, Jinnyeong Yang, Seungro Lee, Sujin Lee, Jiachen Li, Kuk-Jin Yoon, Jinkyoo Park, Kincho H. Law, Yoonjin Yoon

Title: Cooperative Informative Sensing for Monitoring Dynamic Indoor Environments via Multi-Agent Reinforcement Learning

Abstract:
Monitoring human activity in indoor environments is important for applications such as facility management, safety assessment, and space utilization analysis. While mobile robot teams offer the potential to actively improve observation quality, existing multi‑robot monitoring and active perception approaches typically rely on coverage or visitation based objectives that are weakly aligned with the accuracy requirements of human‑centric monitoring tasks. In this work, we formulate cooperative active observation as a decentralized control problem in which multiple robots adjust their motion to directly optimize monitoring accuracy under partial observability. We propose a learning‑based framework for cooperative policies from decentralized observations using multi‑agent reinforcement learning (MARL), supported by an architecture that handles variable numbers of humans and temporal dependencies. Simulation results across diverse indoor environments and monitoring tasks show that the proposed approach consistently outperforms classical coverage, persistent monitoring, and learning‑free multi‑robot baselines, while remaining robust to changes in the number of observed humans.

PaperID: 3636, https://arxiv.org/pdf/2604.23145.pdf

Authors: Jason Nguyen, Ameet Rao, Alexander Chang, Ishaan Kumar, Erin Tan

Title: UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks

Abstract:
Video Question Answering (VideoQA) demands models that jointly reason over spatial, temporal, and linguistic cues. However, the task's inherent complexity often requires multi‑step reasoning that current large multimodal models (LMMs) perform implicitly, leaving their internal decision process opaque. In contrast, large reasoning models (LRMs) explicitly generate intermediate logical steps that enhance interpretability and can improve multi‑hop reasoning accuracy. Yet, these models are not designed for native video understanding, as they typically rely on static frame sampling. We propose UpstreamQA, a modular framework that disentangles and evaluates core video reasoning components through explicit upstream reasoning modules. Specifically, we employ multimodal LRMs to perform object identification and scene context generation before passing enriched reasoning traces to downstream LMMs for VideoQA. We evaluate UpstreamQA on the OpenEQA and NExTQA datasets using two LRMs (o4‑mini, Gemini 2.5 Pro) and two LMMs (GPT‑4o, Gemini 2.5 Flash). Our results demonstrate that introducing explicit reasoning can significantly boost performance and interpretability of downstream VideoQA, but can also lead to performance degradation when baseline performance is sufficiently high. Overall, UpstreamQA offers a principled framework for combining explicit reasoning and multimodal understanding, advancing both performance and diagnostic transparency in VideoQA in several scenarios.

PaperID: 3637, https://arxiv.org/pdf/2604.23141.pdf

Authors: Tianlong Yu, Yang Yang, Xiao Luo, Lihong Liu, Fudu Xing, Zui Tao, Kailong Wang, Gaoyang Liu, Ting Bi

Title: UNSEEN: A Cross-Stack LLM Unlearning Defense against AR-LLM Social Engineering Attacks

Abstract:
Emerging AR‑LLM‑based Social Engineering attack (e.g., SEAR) is at the edge of posing great threats to real‑world social life. In such AR‑LLM‑SE attack, the attacker can leverage AR (Augmented Reality) glass to capture the image and vocal information of the target, using the LLM to identify the target and generate the social profile, using the LLM agents to apply social engineering strategies for conversation suggestion to win the target trust and perform phishing afterwards. Current defensive approaches, such as role‑based access control or data flow tracking, are not directly applicable to the convergent AR‑LLM ecosystem (considering embedded AR device and opaque LLM inference), leaving an emerging and potent social engineering threat that existing privacy paradigms are ill‑equipped to address. This necessitates a shift beyond solely human‑centric measures like legislation and user education toward enforceable vendor policies and platform‑level restrictions. Realizing this vision, however, faces significant technical challenges: securing resource‑constrained AR‑embedded devices, implementing fine‑grained access control within opaque LLM inferences, and governing adaptive interactive agents. To address these challenges, we present UNSEEN, a coordinated cross‑stack defense that combines an AR ACL (Access Control Layer) for identity‑gated sensing, F‑RMU‑based LLM unlearning for sensitive profile suppression, and runtime agent guardrails for adaptive interaction control. We evaluate UNSEEN in an IRB‑approved user study with 60 participants and a dataset of 360 annotated conversations across realistic social scenarios.

PaperID: 3638, https://arxiv.org/pdf/2604.23124.pdf

Authors: Haowei Cheng, Milhan Kim, Chong Liu, Teeradaj Racharak, Truong Vinh Truong Duy, Phan Thi Huyen Thanh, Jialong Li, Naoyasu Ubayashi, Hironori Washizaki

Title: ArgRE: Formal Argumentation for Conflict Resolution in Multi-Agent Requirements Negotiation

Abstract:
As software systems grow in complexity, they must satisfy an increasing number of competing quality attributes, making it essential to balance them in a principled manner ‑‑ for example, a safety requirement for sensor‑fusion verification may conflict with a tight planning‑cycle budget. Multi‑agent large language model frameworks support this balancing process by assigning specialized agents to different objectives. However, their conflict resolution is typically heuristic. Requirements are aggregated implicitly without explicit acceptance or rejection, limiting auditability in regulated domains. We present ArgRE, a multi‑agent requirements negotiation system that embeds Dung‑style abstract argumentation into the negotiation stage. Each proposal, critique, and refinement is modeled as an argument, conflicts are represented as directed attack relations, and the accepted set of arguments is computed under grounded and preferred semantics. The pipeline further integrates KAOS goal modeling, multi‑layer verification, and standards‑oriented artifact generation. Evaluation across five case studies spanning safety‑critical, financial, and information‑system domains shows that ArgRE provides argument‑level traceability absent from existing frameworks. Independent evaluators rated its decision justifications significantly higher than those of heuristic synthesis (4.32 vs. 3.07, p < 0.001), indicating improved auditability, while semantic intent preservation remains comparable (94.9% BERTScore F1) and compliance coverage reaches 84.7% versus 47.6%‑‑47.8% for baselines. Structural analysis further confirms that the default pairwise protocol yields acyclic graphs in which grounded and preferred semantics coincide, whereas cross‑pair arbitration introduces controlled cyclicity, leading to predictable divergence between the two semantics.

PaperID: 3639, https://arxiv.org/pdf/2604.23106.pdf

Authors: Siddeshwar Raghavan, Tanwi Mallick

Title: No Test Cases, No Problem: Distillation-Driven Code Generation for Scientific Workflows

Abstract:
Existing multi‑agent Large Language Model (LLM) frameworks for code generation typically use execution feedback and improve iteratively using Input/Output (I/O) test cases. However, this does not work for scientific workflows, where I/O test cases do not exist, and generating them requires solving the very problem at hand. To address this, we introduce MOSAIC, a training‑free multi‑agent framework for scientific code generation without I/O supervision. Instead of execution feedback, MOSAIC employs a student‑teacher knowledge distillation framework that grounds generation through domain‑specific examples and structured problem decomposition. To further mitigate hallucinations across chained subproblems, we introduce a Consolidated Context Window (CCW) for maintaining consistent reasoning across agents. Experiments on the SciCode benchmark show that MOSAIC improves accuracy, executability, and numerical precision over existing approaches while relying on lightweight models.

PaperID: 3640, https://arxiv.org/pdf/2604.23090.pdf

Authors: Abid Talukder, Maruf Ahmed Mridul, Oshani Seneviratne

Title: Towards Automated Ontology Generation from Unstructured Text: A Multi-Agent LLM Approach

Abstract:
Automatically generating formal ontologies from unstructured natural language remains a central challenge in knowledge engineering. While large language models (LLMs) show promise, it remains unclear which architectural design choices drive generation quality and why current approaches fail. We present a controlled experimental study using domain‑specific insurance contracts to investigate these questions. We first establish a single‑agent LLM baseline, identifying key failure modes such as poor Ontology Design Pattern compliance, structural redundancy, and ineffective iterative repair. We then introduce a multi‑agent architecture that decomposes ontology construction into four artifact‑driven roles: Domain Expert, Manager, Coder, and Quality Assurer. We evaluate performance across architectural quality (via a panel of heterogeneous LLM judges) and functional usability (via competency question driven SPARQL evaluation with complementary retrieval augmented generation based assessment). Results show that the multi‑agent approach significantly improves structural quality and modestly enhances queryability, with gains driven primarily by front‑loaded planning. These findings highlight planning‑first, artifact‑driven generation as a promising and more auditable path toward scalable automated ontology engineering.

PaperID: 3641, https://arxiv.org/pdf/2604.23072.pdf

Authors: Junyan Cheng, Kyle Richardson, Peter Chin

Title: Analytica: Soft Propositional Reasoning for Robust and Scalable LLM-Driven Analysis

Abstract:
Large language model (LLM) agents are increasingly tasked with complex real‑world analysis (e.g., in financial forecasting, scientific discovery), yet their reasoning suffers from stochastic instability and lacks a verifiable, compositional structure. To address this, we introduce Analytica, a novel agent architecture built on the principle of Soft Propositional Reasoning (SPR). SPR reframes complex analysis as a structured process of estimating the soft truth values of different outcome propositions, allowing us to formally model and minimize the estimation error in terms of its bias and variance. Analytica operationalizes this through a parallel, divide‑and‑conquer framework that systematically reduces both sources of error. To reduce bias, problems are first decomposed into a tree of subpropositions, and tool‑equipped LLM grounder agents are employed, including a novel Jupyter Notebook agent for data‑driven analysis, that help to validate and score facts. To reduce variance, Analytica recursively synthesizes these grounded leaves using robust linear models that average out stochastic noise with superior efficiency, scalability, and enable interactive "what‑if" scenario analysis. Our theoretical and empirical results on economic, financial, and political forecasting tasks show that Analytica improves 15.84% accuracy on average over diverse base models, achieving 71.06% accuracy with the lowest variance of 6.02% when working with a Deep Research grounder. Our Jupyter Notebook grounder shows strong cost‑effectiveness that achieves a close 70.11% accuracy with 90.35% less cost and 52.85% less time. Analytica also exhibits highly noise‑resilient and stable performance growth as the analysis depth increases, with a near‑linear time complexity, as well as good adaptivity to open‑weight LLMs and scientific domains.

PaperID: 3642, https://arxiv.org/pdf/2604.23069.pdf

Authors: Yating Wu, Yuhao Zhang, Sayan Ghosh, Sourya Basu, Anoop Deoras, Jun Huan, Gaurav Gupta

Title: ContextWeaver: Selective and Dependency-Structured Memory Construction for LLM Agents

Abstract:
Large language model (LLM) agents often struggle in long‑context interactions. As the agent accumulates more interaction history, context management approaches such as sliding window and prompt compression may omit earlier structured information that later steps rely on. Recent retrieval‑based memory systems surface relevant content but still overlook the causal and logical structure needed for multi‑step reasoning. We introduce ContextWeaver, a selective and dependency‑structured memory framework that organizes an agent's interaction trace into a graph of reasoning steps and selects the relevant context for future actions. Unlike prior context management approaches, ContextWeaver supports: (1) dependency‑based construction and traversal that link each step to the earlier steps it relies on; (2) compact dependency summarization that condenses root‑to‑step reasoning paths into reusable units; and (3) a lightweight validation layer that incorporates execution feedback. On the SWE‑Bench Verified and Lite benchmarks, ContextWeaver improves performance over a sliding‑window baseline in pass@1, while reducing reasoning steps and token usage. Our observations suggest that modeling logical dependencies provides a stable and scalable memory mechanism for LLM agents that use tools.

PaperID: 3643, https://arxiv.org/pdf/2604.23057.pdf

Authors: Yuqi Sun, Tianqin Meng, George Liu, Yashraj Panwar, Lakshya Chaudhry, Munasib Ilham, Aman Chadha

Title: Don't Make the LLM Read the Graph: Make the Graph Think

Abstract:
We investigate whether explicit belief graphs improve LLM performance in cooperative multi‑agent reasoning. Through 3,000+ controlled trials across four LLM families in the cooperative card game Hanabi, we establish four findings. First, integration architecture determines whether belief graphs provide value: as prompt context, graphs are decorative for strong models and beneficial only for weak models on 2nd‑order Theory of Mind (80% vs 10%, p<0.0001, OR=36.0); when graphs gate action selection through ranked shortlists, they become structurally essential even for strong models (100% vs 20% on 2nd‑order ToM, p<0.001). Second, we identify "Planner Defiance," a model‑family‑specific failure where LLMs override correct planner recommendations at partial competence (90% override, replicated N=20); Gemini models show near‑zero defiance while Llama 70B shows 90%, and models distinguish factual context (deferred to) from advisory recommendations (overridden). Third, full‑game evidence confirms inter‑agent conventions (+128% over baseline, p=0.003) outperform all single‑agent interventions, and individual belief‑graph components must be combined to produce gains. Fourth, preliminary scaling analysis (N=10/cell, exploratory) suggests graph depth has diminishing returns: shallow graphs provide the best cost‑benefit ratio, while deeper ToM graphs appear harmful at larger player counts (‑1.5 pts at 5‑player, p=0.029).

PaperID: 3644, https://arxiv.org/pdf/2604.23049.pdf

Authors: Edward Cheng, Jeshua Cheng

Title: A Decoupled Human-in-the-Loop System for Controlled Autonomy in Agentic Workflows

Abstract:
AI agents are increasingly deployed to execute tasks and make decisions within agentic workflows, introducing new requirements for safe and controlled autonomy. Prior work has established the importance of human oversight for ensuring transparency, accountability, and trustworthiness in such systems. However, existing implementations of Human‑in‑the‑Loop (HITL) mechanisms are typically embedded within application logic, limiting reuse, consistency, and scalability across multi‑agent environments. This paper presents a decoupled HITL system architecture that treats human oversight as an independent system component within the agent operating environment. The proposed design separates human interaction management from application workflows through explicit interfaces and a structured execution model. In addition, a design framework is introduced to formalize HITL integration along four dimensions: intervention conditions, role resolution, interaction semantics, and communication channel. This framework enables selective and context‑aware human involvement while maintaining system‑level consistency. The approach supports alignment with emerging agent communication protocols, allowing HITL to be implemented as a protocol‑level concern. By externalizing HITL and structuring its integration, the system provides a foundation for scalable governance and progressive autonomy in agentic workflows.

PaperID: 3645, https://arxiv.org/pdf/2604.22971.pdf

Authors: Juergen Dietrich

Title: Peer Identity Bias in Multi-Agent LLM Evaluation: An Empirical Study Using the TRUST Democratic Discourse Analysis Pipeline

Abstract:
The TRUST democratic discourse analysis pipeline exposes its large language model (LLM) components to peer model identity through multiple structural channels ‑‑ a design feature whose bias implications have not previously been empirically tested. We provide the first systematic measurement of identity‑dependent scoring bias across all active identity exposure channels in TRUST, crossing four model families with two anonymization scopes across 30 political statements. The central finding is that single‑channel anonymization produces near‑zero bias effects, because individual channels act in opposite directions and cancel each other out ‑‑ a result that would lead an evaluator to conclude that identity bias is absent when it is not. Only full‑pipeline anonymization reveals the true pattern: homogeneous ensembles amplify identity‑driven sycophancy when model identity is fully visible, while the heterogeneous production configuration shows the reverse. Model choice matters independently: one tested model exhibits baseline sycophancy two to three times higher than the others and near‑zero deliberative conflict on ideological topics, making it structurally unsuitable for pipelines where genuine inter‑role disagreement is the intended quality mechanism. Three practical conclusions follow. First, heterogeneous model ensembles are structurally more robust than homogeneous ones, achieving higher consensus rates and lower identity amplification. Second, full‑pipeline anonymization is required for valid bias measurement ‑‑ partial anonymization is insufficient and actively misleading. Third, these findings have direct implications for the validation of multi‑agent LLM systems in quality‑critical applications: a system validated under partial anonymization or with a homogeneous ensemble may pass validation while retaining structural identity bias invisible to single‑channel measurement.

PaperID: 3646, https://arxiv.org/pdf/2604.22888.pdf

Authors: Wenjie Xiao, Xuehai Tang, Biyu Zhou, Songlin Hu, Jizhong Han

Title: RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents

Abstract:
Agent skills introduce a new and more severe form of indirect injection for LLM agents: unlike traditional indirect prompt injection, attackers can hide malicious instructions inside a dense, action‑oriented skill that already functions as a legitimate instruction source. We study pre‑execution skill‑poison detection and show that successful skill poisoning induces a structured internal effect, attention hijacking, in which response‑time attention shifts from trusted context to malicious skill spans and drives harmful behavior. Motivated by this mechanism, we propose RouteGuard, a frozen‑backbone detector that combines response‑conditioned attention and hidden‑state alignment through reliability‑gated late fusion. Across both real and synthetic open‑source skill benchmarks, RouteGuard is consistently the strongest or most robust detector; on the critical Skill‑Inject channel slice, it reaches 0.8834 F1 and recovers 90.51% of description attacks missed by lexical screening, showing that defending against skill poisoning requires internal‑signal detection rather than text‑only filtering

PaperID: 3647, https://arxiv.org/pdf/2604.22879.pdf

Authors: Jie Wu, Ming Gong

Title: Beyond Single-Agent Alignment: Preventing Context-Fragmented Violations in Multi-Agent Systems

Abstract:
We identify and formalize a novel security risk: Context‑Fragmented Violations (CFVs) ‑ a class of policy breaches where individual agent actions appear locally safe and reasonable, yet collectively violate organizational policies because critical policy facts are siloed in different departments private contexts. Existing prompt‑based alignment mechanisms and monolithic interceptors are poorly matched to violations that span contextual islands. We propose Distributed Sentinel, a distributed zero‑trust enforcement architecture that introduces the Semantic Taint Token (STT) Protocol. Through lightweight sidecar proxies, our system propagates security state across organizational boundaries without exposing raw cross‑domain data, enabling Counterfactual Graph Simulation for cross‑domain policy verification. We construct PhantomEcosystem, a comprehensive benchmark comprising 9 categories of realistic cross‑agent violation scenarios with adversarially balanced safe controls. On this benchmark, Distributed Sentinel achieves F1 = 0.95 with 106ms end‑to‑end latency (16ms verification + 90ms entity extraction on A100), compared to 0.85 F1 for prompt‑based filtering and 0.65 for rule‑based DLP. To empirically validate the need for external enforcement, we evaluate eight frontier LLMs in execution‑oriented multi‑agent workflows with per‑agent domain world models. All models exhibit substantial violation rates (14‑98%), with cross‑domain data flows showing systematically higher violation rates than same‑domain flows. These results indicate that self‑avoidance is unreliable and that multi‑agent security benefits from a centralized enforcement layer operating above individual agents.

PaperID: 3648, https://arxiv.org/pdf/2604.22861.pdf

Authors: Fengbo Ma, Zixin Rao, Xiaoting Li, Zhetao Chen, Hongyue Sun, Yiping Zhao, Xianyan Chen, Zhen Xiang

Title: IntrAgent: An LLM Agent for Content-Grounded Information Retrieval through Literature Review

Abstract:
Scientific research relies on accurate information retrieval from literature to support analytical decisions. In this work, we introduce a new task, INformation reTRieval through literAture reVIEW (IntraView), which aims to automate fine‑grained information retrieval faithfully grounded in the provided content in response to research‑driven queries, and propose IntrAgent, an LLM‑based agent that addresses this challenging task. In particular, IntrAgent is designed to mimic human behaviors when reading literature for information retrieval ‑‑ identifying relevant sections and then iteratively extracting key details to refine the retrieved information. It follows a two‑stage pipeline: a Section Ranking stage that prioritizes relevant literature sections through structural‑knowledge‑enabled reasoning, and an Iterative Reading stage that continuously extracts details and synthesizes them into concise, contextually grounded answers. To support rigorous evaluation, we introduce IntraBench, a new benchmark consisting of 315 test instances built from expert‑authored questions paired with literature spanning five STEM domains. Across seven backbone LLMs, IntrAgent achieves on average 13.2% higher cross‑domain accuracy than state‑of‑the‑art RAG and research‑agent baselines.

PaperID: 3649, https://arxiv.org/pdf/2604.22820.pdf

Authors: Luay Gharzeddine, Samer Saab

Title: Complete Cyclic Subtask Graphs for Tool-Using LLM Agents: Flexibility, Cost, and Bottlenecks in Multi-Agent Workflows

Abstract:
Long‑horizon tool‑using tasks sometimes benefit from revisiting earlier subtasks for recovery and exploration, but added multi‑agent workflow flexibility can also introduce coordination overhead and substantial inference cost. We study complete cyclic subtask graphs, a deliberately maximally flexible multi‑agent architecture in which executable subtask nodes are fully connected and a unified state‑analysis‑and‑routing agent selects transitions using natural‑language criteria. This makes unrestricted revisitation explicit and directly analyzable at the subtask level. We evaluate task‑specific (Spec‑Cyc) and benchmark‑generic (Gen‑Cyc) graphs on TextCraft, ALFWorld, and Finance‑Agent, with ablations over planner/executor/router strength, tool exposure (generalist vs specialized), n‑shot successful trajectory summaries, and fault‑injected random subtask perturbations. The benchmarks expose three distinct regimes. ALFWorld highlights a setting where explicit revisitation supports recovery and exploration; TextCraft, a largely prerequisite‑chain domain, often favors the efficiency of simpler forward execution; and Finance‑Agent remains bottlenecked by retrieval, grounding, and evidence synthesis more than by workflow flexibility alone. Shared‑win token comparisons further show that the added flexibility can be substantially more expensive than a single ReAct agent. Overall, we use complete cyclic subtask graphs as a maximally flexible experimental lens for measuring when multi‑agent revisitation helps, when it mainly adds coordination cost, and when external task bottlenecks dominate.

PaperID: 3650, https://arxiv.org/pdf/2604.22819.pdf

Authors: Philipp Hacker, Matthias Holweg

Title: A pragmatic approach to regulating AI agents

Abstract:
The current advancement in and deployment of agentic AI systems has created a set of key challenges for the legal frameworks that govern their use. We cover two central components: first, the regulatory classification of agents under the EU AI Act, and second, the legal status and validity of autonomous actions within the established framework of EU contract law. We argue that the unique capacity of agents to autonomously reason, plan, and execute tasks across disparate external systems necessitates a fundamental shift in oversight toward the orchestration layer, where multi‑agent interactions introduce novel risks of misalignment. While agents generally utilise general‑purpose AI models, we posit that their structural complexity and cross‑system permeability require them to be regulated as "AI systems" with distinct obligations under the AI Act. Consequently, our proposals highlight the need for robust accountability mechanisms to manage this heightened autonomy. On the contractual side, we advocate for a "traffic light" system of staggered task authorization based on operational risk and the creation of a statutory list of non‑delegable legal acts. By implementing these measures, we provide a pragmatic pathway to ensure that the increasing autonomy of AI agents remains firmly anchored in human accountability and existing legal standards

PaperID: 3651, https://arxiv.org/pdf/2604.22818.pdf

Authors: Yimeng Qiu, Qiwei Han

Title: Representation Homogeneity and Systemic Instability in AI-Dominated Financial Markets: A Structural Approach

Abstract:
This paper investigates how similarity in the informational representation of market states among Artificial Intelligence (AI) trading agents can generate systemic instability in financial markets. We construct a structural multi‑agent market model calibrated using high‑frequency microstructural moments. AI agents are modeled through a two‑layer decision architecture consisting of a nonlinear representation layer and an adaptive linear readout layer. The representation layer maps raw market states into high‑dimensional feature vectors, while the readout layer generates return forecasts that feed into a risk‑controlled trading rule. This representation‑based microfoundation separates two objects that are often conflated in the literature: representation homogeneity (the degree to which agents encode market states into similar feature spaces) and forecast overlap (the degree to which agents produce similar return predictions). We show theoretically that these two concepts are related but not equivalent, and that representation homogeneity can compress the effective space of forecast disagreement under stress even when predictions appear diverse in normal times. Through controlled factorial experiments that vary representation homogeneity while conditioning on alternative risk‑aversion and learning‑rate distributions, we hypothesize that increasing representation similarity amplifies synchronization in beliefs and positions, leading to volatility clustering, liquidity stress, and elevated tail risk. Our structural mechanisms suggest that low perceived volatility regimes can endogenously accumulate hidden leverage through position stickiness, which subsequently collapses when shocks trigger synchronized deleveraging. The results provide a structural foundation for macroprudential policies aimed at monitoring and preserving diversity in how AI systems represent and process market information.

PaperID: 3652, https://arxiv.org/pdf/2604.22795.pdf

Authors: Teodor Åstrand, Marcus Binder Nilsen, Iasonas Tsaklis, Tuhfe Göçmen, Pierre-Elouan Réthoré, Nikolay Dimitrov

Title: Load constrained wind farm flow control through multi-objective multi-agent reinforcement learning

Abstract:
This study presents a multi‑agent reinforcement learning (MARL) framework for load‑constrained wind farm flow control (WFFC). While wake steering can enhance total wind farm power, it often introduces increased structural loads on downstream turbines. To address this, we integrate an Independent Soft Actor‑Critic (I‑SAC) architecture with a data‑driven, local inflow sector‑averaged surrogate model to provide real‑time estimates of Damage Equivalent Loads (DELs). By incorporating these estimates into a shaped reward function, turbine‑specific agents are trained to maximize power production while adhering to specific load‑increase thresholds (Δ_max) of 10%, 20%, and 30% relative to a baseline controller. The framework is implemented within the WindGym environment using the DYNAMIKS flow solver with Dynamic Wake Meandering (DWM) model to capture non‑stationary wake physics. Results indicate that the MARL agents successfully learn collaborative policies that prioritise power gain while actively retreating from high‑DEL control strategies.

PaperID: 3653, https://arxiv.org/pdf/2604.22785.pdf

Authors: Stela Tong, Elai Ben-Gal

Title: CoFi-PGMA: Counterfactual Policy Gradients under Filtered Feedback for Multi-Agent LLMs

Abstract:
Large language model (LLM) deployments increasingly rely on multi‑agent architectures in which multiple models either compete through routing mechanisms or collaborate to produce a final answer. In both settings, the learning signal received by each agent is filtered by the system mechanism. Routing produces selection‑gated feedback where only the chosen response is evaluated, while collaboration produces shared rewards that obscure the individual contribution of each agent. As a result, standard RLHF objectives designed for a single deployed policy become misspecified. We introduce CoFi‑PGMA (Counterfactual Policy Gradients under Filtered Feedback for Multi‑Agent LLMs), a unified framework for learning under filtered feedback in multi‑agent LLM systems. Our approach derives a counterfactual per‑agent training objective based on marginal contribution, which corrects the learning signal under both routing and collaborative mechanisms. For routing systems, the objective corresponds to off‑policy corrections for selection‑gated feedback, while for collaborative systems it reduces to leave‑one‑out difference rewards for credit assignment. We further analyze how softmax routing induces risk‑sensitive incentives and provide practical training algorithms that integrate counterfactual estimators, multiturn‑aware rewards, and policy optimization methods, and demonstrate the approach on a real‑world reasoning dataset.

PaperID: 3654, https://arxiv.org/pdf/2604.22770.pdf

Authors: Nicy Scaria, Silvester John Joseph Kennedy, Deepak Subramani

Title: Learning in Blocks: A Multi Agent Debate Assisted Personalized Adaptive Learning Framework for Language Learning

Abstract:
Most digital language learning curricula rely on discrete‑item quizzes that test recall rather than applied conversational proficiency. When progression is driven by quiz performance, learners can advance despite persistent gaps in using grammar and vocabulary during interaction. Recent work on LLM‑based judging suggests a path toward scoring open‑ended conversations, but using interaction evidence to drive progression and review requires scoring protocols that are reliable and validated. We introduce Learning in Blocks, a framework that grounds progression in demonstrated conversational competence evaluated using CEFR‑aligned rubrics. The framework employs heterogeneous multi‑agent debate (HeteroMAD) in two stages: a scoring stage where role‑specialized agents independently evaluate Grammar, Vocabulary, and Interactive Communication, engage in debate to address conflicting judgments, and a judge synthesizes consensus scores; and a recommendation stage that identifies specific grammar skills and vocabulary topics for targeted review. Progression requires demonstrating 70% mastery, and spaced review targets identified weaknesses to counter skill decay. We benchmark four scoring and recommendation methods on CEFR A2 conversations annotated by ESL experts. HeteroMAD achieves a superior score agreement with a 0.23 degree of variation and recommendation acceptability of 90.91%. An 8‑week study with 180 CEFR A2 learners demonstrates that combining rubric‑aligned scoring and recommendation with spaced review and mastery‑based progression produces better learning outcomes than feedback alone.

PaperID: 3655, https://arxiv.org/pdf/2604.22760.pdf

Authors: Eyhab Al-Masri

Title: Quantifying Divergence in Inter-LLM Communication Through API Retrieval and Ranking

Abstract:
Large language models (LLMs) increasingly operate as autonomous agents that reason over external APIs to perform complex tasks. However, their reliability and agreement remain poorly characterized. We present a unified benchmarking framework to quantify inter‑LLM divergence, defined as the extent to which models differ in API discovery and ranking under identical tasks. Across 15 canonical API domains and 5 major model families, we measure pairwise and group‑level agreement using set‑, rank‑, and consensus‑based metrics including Average Overlap, Jaccard similarity, Rank‑Biased Overlap, Kendall's tau, Kendall's W, and Cronbach's alpha. Results show moderate overall alignment (AO about 0.50, tau about 0.45) but strong domain dependence: structured tasks (Weather, Speech‑to‑Text) are stable, while open‑ended tasks (Sentiment Analysis) exhibit substantially higher divergence. Volatility and consensus analyses reveal that coherence clusters around data‑bound domains and degrades for abstract reasoning tasks. These insights enable reliability‑aware orchestration in multi‑agent systems, where consensus weighting can improve coordination among heterogeneous LLMs. Beyond performance benchmarking, our results reveal systematic failure modes in multi‑agent LLM coordination, where apparent agreement can mask instability in action‑relevant rankings. This hidden divergence poses a pre‑deployment safety risk and motivates diagnostic benchmarks for early detection.

PaperID: 3656, https://arxiv.org/pdf/2604.22755.pdf

Authors: Zavier Ndum Ndum, Jian Tao, John Ford, Mansung Yim, Yang Liu

Title: RADIANT-LLM: an Agentic Retrieval Augmented Generation Framework for Reliable Decision Support in Safety-Critical Nuclear Engineering

Abstract:
Reliable decision support in nuclear engineering requires traceable, domain‑grounded knowledge retrieval, yet safety and risk analysis workflows remain hampered by fragmented documentation and hallucination when use pre‑trained large language model (LLM) in specialized nuclear domains. To address these challenges, this paper presents RADIANT‑LLM (Retrival‑Augumented, Domain‑Intelligent Agent for Nuclear Technologies using LLM), a multi‑modal retrieval‑augmented generation (RAG) framework designed for nuclear safety, security, and safeguards applications. The framework uses a local‑first, model‑agnostic architecture that pairs a multi‑modal document ingestion pipeline with a structured, metadata‑rich knowledge base, supporting page‑ and figure‑level retrieval from technical documents. An agentic layer coordinates domain‑specific tools, enforces citation‑backed responses with provenance tracking, and supports human‑in‑the‑loop validation to reduce hallucination risks. To rigorously evaluate this framework, we develop and apply a suite of domain‑aware metrics, including Context Precision (CoP), Hallucination Rate (HR), and Visual Recall (ViR), to expert‑curated benchmarks derived from Used Nuclear Fuel Storage Facility design guidance. Across varying knowledge base sizes, CoP and ViR remain within an 85‑‑98% band, and hallucination rates are substantially lower than those observed in general‑purpose deployments. When the same queries are posed to commercial LLM platforms without the RAG layer, hallucinations and citation errors increase markedly. These results indicate that a locally controlled, multi‑modal RAG framework with domain‑specific retrieval and provenance enforcement is necessary to achieve the factual accuracy, transparency, and auditability that nuclear engineering workflows demand.

PaperID: 3657, https://arxiv.org/pdf/2604.22750.pdf

Authors: Longju Bai, Zhemin Huang, Xingyao Wang, Jiao Sun, Rada Mihalcea, Erik Brynjolfsson, Alex Pentland, Jiaxin Pei

Title: How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

Abstract:
The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questions naturally arise: (1) Where do AI agents spend the tokens? (2) Which models are more token‑efficient? and (3) Can agents predict their token usage before task execution? In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks. We analyze trajectories from eight frontier LLMs on SWE‑bench Verified and evaluate models' ability to predict their own token costs before task execution. We find that: (1) agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat, with input tokens rather than output tokens driving the overall cost; (2) token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens, and higher token usage does not translate into higher accuracy; instead, accuracy often peaks at intermediate cost and saturates at higher costs; (3) models vary substantially in token efficiency: on the same tasks, Kimi‑K2 and Claude‑Sonnet‑4.5, on average, consume over 1.5 million more tokens than GPT‑5; (4) task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human‑perceived complexity and the computational effort agents actually expend; and (5) frontier models fail to accurately predict their own token usage (with weak‑to‑moderate correlations, up to 0.39) and systematically underestimate real token costs. Our study offers new insights into the economics of AI agents and can inspire future research in this direction.

PaperID: 3658, https://arxiv.org/pdf/2604.22748.pdf

Authors: Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang, Teng Tu, Weijian Ma, Ziqi Huang, Senqiao Yang, Wei Huang, Yeying Jin, Zhefan Rao, Jinhui Ye, Xinyu Lin, Xichen Zhang, Qisheng Hu, Shuai Yang, Leyang Shen, Wei Chow, Yifei Dong, Fengyi Wu, Quanyu Long, Bin Xia, Shaozuo Yu, Mingkang Zhu, Wenhu Zhang, Jiehui Huang, Haokun Gui, Haoxuan Che, Long Chen, Qifeng Chen, Wenxuan Zhang, Wenya Wang, Xiaojuan Qi, Yang Deng, Yanwei Li, Mike Zheng Shou, Zhi-Qi Cheng, See-Kiong Ng, Ziwei Liu, Philip Torr, Jiaya Jia

Title: Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

Abstract:
As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities. We introduce a "levels x laws" taxonomy organized along two axes. The first defines three capability levels: L1 Predictor, which learns one‑step local transition operators; L2 Simulator, which composes them into multi‑step, action‑conditioned rollouts that respect domain laws; and L3 Evolver, which autonomously revises its own model when predictions fail against new evidence. The second identifies four governing‑law regimes: physical, digital, social, and scientific. These regimes determine what constraints a world model must satisfy and where it is most likely to fail. Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model‑based reinforcement learning, video generation, web and GUI agents, multi‑agent social simulation, and AI‑driven scientific discovery. We analyze methods, failure modes, and evaluation practices across level‑regime pairs, propose decision‑centric evaluation principles and a minimal reproducible evaluation package, and outline architectural guidance, open problems, and governance challenges. The resulting roadmap connects previously isolated communities and charts a path from passive next‑step prediction toward world models that can simulate, and ultimately reshape, the environments in which agents operate.

PaperID: 3659, https://arxiv.org/pdf/2604.22715.pdf

Authors: Jiajun Yu, Guodong Liu, Li Wang, Pengxiang Zhou, Wentao Liu, Yin He, Chao Xu, Fei Gao, Yanjun Cao

Title: ATRS: Adaptive Trajectory Re-splitting via a Shared Neural Policy for Parallel Optimization

Abstract:
Parallel trajectory optimization via the Alternating Direction Method of Multipliers (ADMM) has emerged as a scalable approach to long‑horizon motion planning. However, existing frameworks typically decompose the problem into parallel subproblems based on a predefined fixed structure. Such structural rigidity often causes optimization stagnation in highly constrained regions, where a few lagging subproblems delay global convergence. A natural remedy is to adaptively re‑split these stagnating segments online. Yet, deciding when, where, and how to split exceeds the capability of rule‑based heuristics. To this end, we propose ATRS, a novel framework that embeds a shared Deep Reinforcement Learning policy into the parallel ADMM loop. We formulate this adaptive adjustment as a Multi‑Agent Shared‑Policy Markov Decision Process, where all trajectory segments act as homogeneous agents and share a unified neural policy network. This parameter‑sharing architecture endows the system with size invariance, enabling it to handle dynamically changing segment counts during re‑splitting and generalize to arbitrary trajectory lengths. Furthermore, our formulation inherently supports zero‑shot generalization to unseen environments, as our network relies solely on the internal states of the numerical solver rather than on the geometric features of the environment. To ensure solver stability, a Confidence‑Based Election mechanism selects only the most stagnating segment for re‑splitting at each step. Extensive simulations demonstrate that ATRS accelerates convergence, reducing the number of iterations by up to 26.0% and the computation time by up to 19.1%. Real‑world experiments further confirm its applicability to both large‑scale offline global planning and real‑time onboard replanning within 35 ms per cycle, with no sim‑to‑real degradation.

PaperID: 3660, https://arxiv.org/pdf/2604.22708.pdf

Authors: Mengzhuo Chen, Junjie Wang, Fangwen Mu, Yawen Wang, Zhe Liu, Huanxiang Feng, Qing Wang

Title: Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems

Abstract:
Failure attribution, i.e., identifying the responsible agent and decisive step of a failure, is particularly challenging in LLM‑based multi‑agent systems (MAS) due to their natural‑language reasoning, nondeterministic outputs, and intricate interaction dynamics. A reliable benchmark is therefore essential to guide and evaluate attribution techniques. Yet existing benchmarks rely on partially observable traces that capture only agent outputs, omitting the inputs and context that developers actually use when debugging. We argue that failure attribution should be studied under full execution observability, aligning with real‑world developer‑facing scenarios where complete traces, rather than only outputs, are accessible for diagnosis. To this end, we introduce TraceElephant, a benchmark designed for failure attribution with full execution traces and reproducible environments. We then systematically evaluate failure attribution techniques across various configurations. Specifically, full traces improve attribution accuracy by up to 76% over a partial‑observation counterpart, confirming that missing inputs obscure many failure causes. TraceElephant provides a foundation for follow‑up failure attribution research, promoting evaluation practices that reflect real‑world debugging and supporting the development of more transparent MASs.

PaperID: 3661, https://arxiv.org/pdf/2604.22602.pdf

Authors: Jay Yu, Shunfan Zhou, Hang Yin, Brian Seong

Title: PASS: A Provenanced Access Subaccount System for Blockchain Wallets

Abstract:
Blockchain wallets conventionally follow an ownership model where possession of a private key grants unilateral control. However, this assumption is brittle for emerging settings such as AI agent wallets, organizational custody, and enterprise payroll, where multiple actors must coordinate without exposing secrets or leaking internal activity. We present PASS, a Provenanced Access Subaccount System that replaces role‑based or identity‑based control with provenance‑based control: assets can only be used by subaccounts that can trace custody back to a valid deposit. A simple Inbox‑Outbox mechanism ensures all external actions have verifiable lineage, while internal transfers remain private and indistinguishable from ordinary EOAs. We formalize PASS in Lean 4 and prove core invariants, including privacy of internal transfers, asset accessibility, and provenance integrity. We implement a prototype with enclave backends on AWS Nitro Enclaves and dstack Intel TDX, integrate with WalletConnect, and benchmark throughput across wallet operations. These results show that provenance‑based wallets are both implementable and efficient. PASS bridges today's gap between strict self‑custody and flexible shared access, advancing the design space for practical, privacy‑preserving custody.

PaperID: 3662, https://arxiv.org/pdf/2604.22541.pdf

Authors: Mingfeng He, Fayu Jiang, Junkun Jiao, Mingrun Li, Ke Li, Yipu Liao, Beijiang Liu, Tong Liu, Fazhi Qi, Zijie Shang, Weimin Song, Yue Sun, Xiongfei Wang, Hong Wang, Dongbo Xiong, Changzheng Yuan, Bolun Zhang, Zhengde Zhang, Xuliang Zhu

Title: Dr.Sai: An agentic AI for real-world physics analysis at BESIII

Abstract:
High Energy Physics (HEP) experiments like BESIII produce petabyte‑scale data. Extracting physics results requires complex workflows (simulation, reconstruction, statistical analysis, etc.) that traditionally take experts months or years. Current manual methods are labor‑intensive, prone to bias, and limit large‑scale systematic scans. As data grows, this paradigm slows discovery. Large Language Models (LLMs) offer a solution. Their natural language understanding and code generation capabilities allow them to interpret scientific tasks and integrate with HEP tools (e.g., ROOT, BOSS) to act as an "AI partner" for autonomous analysis. We present Dr.Sai, an LLM‑powered multi‑agent system that translates natural language into rigorous physics workflows. As validation, Dr.Sai performed large‑scale re‑measurements of ten J/psi decay branching fractions ‑ without manual coding. It successfully navigated the real BESIII computing environment and produced results matching established benchmarks. The article details Dr.Sai's architecture, the validation results, and performance evaluation. This work provides a blueprint for autonomous discovery, with relevance to other data‑intensive fields like astronomy and genomics.

PaperID: 3663, https://arxiv.org/pdf/2604.22452.pdf

Authors: Xirui Li, Ming Li, Yunze Xiao, Ryan Wong, Dianqi Li, Timothy Baldwin, Tianyi Zhou

Title: Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents

Abstract:
Collective intelligence refers to the ability of a group to achieve outcomes beyond what any individual member can accomplish alone. As large language model agents scale to populations of millions, a key question arises: Does collective intelligence emerge spontaneously from scale? We present the first empirical evaluation of this question in a large‑scale autonomous agent society. Studying MoltBook, a platform hosting over two million agents, we introduce Superminds Test, a hierarchical framework that probes society‑level intelligence using controlled Probing Agents across three tiers: joint reasoning, information synthesis, and basic interaction. Our experiments reveal a stark absence of collective intelligence. The society fails to outperform individual frontier models on complex reasoning tasks, rarely synthesizes distributed information, and often fails even trivial coordination tasks. Platform‑wide analysis further shows that interactions remain shallow, with threads rarely extending beyond a single reply and most responses being generic or off‑topic. These results suggest that collective intelligence does not emerge from scale alone. Instead, the dominant limitation of current agent societies is extremely sparse and shallow interaction, which prevents agents from exchanging information and building on each other's outputs.

PaperID: 3664, https://arxiv.org/pdf/2604.22446.pdf

Authors: Zhengxu Yu, Yu Fu, Zhiyuan He, Yuxuan Huang, Lee Ka Yiu, Meng Fang, Weilin Luo, Jun Wang

Title: From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company

Abstract:
Individual agent capabilities have advanced rapidly through modular skills and tool integrations, yet multi‑agent systems remain constrained by fixed team structures, tightly coupled coordination logic, and session‑bound learning. We argue that this reflects a deeper absence: a principled organisational layer that governs how a workforce of agents is assembled, governed, and improved over time, decoupled from what individual agents know. To fill this gap, we introduce \emphOneManCompany (OMC), a framework that elevates multi‑agent systems to the organisational level. OMC encapsulates skills, tools, and runtime configurations into portable agent identities called \emphTalents, orchestrated through typed organisational interfaces that abstract over heterogeneous backends. A community‑driven \emphTalent Market enables on‑demand recruitment, allowing the organisation to close capability gaps and reconfigure itself dynamically during execution. Organisational decision‑making is operationalised through an \emphExplore‑Execute‑Review (\textE^2R) tree search, which unifies planning, execution, and evaluation in a single hierarchical loop: tasks are decomposed top‑down into accountable units and execution outcomes are aggregated bottom‑up to drive systematic review and refinement. This loop provides formal guarantees on termination and deadlock freedom while mirroring the feedback mechanisms of human enterprises. Together, these contributions transform multi‑agent systems from static, pre‑configured pipelines into self‑organising and self‑improving AI organisations capable of adapting to open‑ended tasks across diverse domains. Empirical evaluation on PRDBench shows that OMC achieves an 84.67% success rate, surpassing the state of the art by 15.48 percentage points, with cross‑domain case studies further demonstrating its generality.

PaperID: 3665, https://arxiv.org/pdf/2604.22427.pdf

Authors: Biagio Andreucci, Arcangelo Castiglione

Title: Automation-Exploit: A Multi-Agent LLM Framework for Adaptive Offensive Security with Digital Twin-Based Risk-Mitigated Exploitation

Abstract:
The offensive security landscape is highly fragmented: enterprise platforms avoid memory‑corruption vulnerabilities due to Denial of Service (DoS) risks, Automatic Exploit Generation (AEG) systems suffer from semantic blindness, and Large Language Model (LLM) agents face safety alignment filters and "Live Fire" execution hazards. We introduce Automation‑Exploit, a fully autonomous Multi‑Agent System (MAS) framework designed for adaptive offensive security in complex black‑box scenarios. It bridges the abstraction gap between reconnaissance and exploitation by autonomously exfiltrating executables and contextual intelligence across multiple protocols, using this data to fuel both logical and binary attack chains. The framework introduces an adaptive safety architecture to mitigate DoS risks. While it natively resolves logical and web‑based vulnerabilities, it employs a conditional isomorphic validation for high‑risk memory‑corruption flaws: if the target binary is successfully exfiltrated, it dynamically instantiates a cross‑platform digital twin. By enforcing strict state synchronization, including libc alignment and runtime file descriptor hooking, potentially destructive payloads are iteratively debugged in an isolated replica. This enables a highly risk‑mitigated "one‑shot" execution on the physical target. Empirical evaluations across eight scenarios, including undocumented zero‑day environments to rule out LLM data contamination, validate the framework's architectural resilience, demonstrating its ability to prevent "live fire" crashes and execute risk‑mitigated compromises on actual targets.

PaperID: 3666, https://arxiv.org/pdf/2604.22327.pdf

Authors: Cinzia Tomaselli, Stefano Covone, Andreagiovanni Reina, Mario di Bernardo

Title: Multi-robot obstacle-aware shepherding of non-cohesive target agents

Abstract:
This paper presents a novel control strategy for multi‑agent shepherding of non‑cohesive targets in obstacle‑rich environments. Unlike previous approaches that assume cohesive flocking behavior, our method handles targets that interact only with nearby herders through repulsive forces and exhibit no inter‑target coordination. Each herder employs a hybrid control policy that combines direct goal‑oriented steering with obstacle‑tangent maneuvering, enabling targets to circumnavigate obstacles while being guided toward a goal region. The herder dynamics integrate three key behaviors: return‑to‑goal motion when idle, target steering with adaptive directional control, and obstacle avoidance using both normal and tangential force components. Numerical simulations demonstrate superior performance compared to existing shepherding methods, achieving higher target confinement rates in cluttered environments. Experimental validation using TurtleBot4 herders and Osoyoo target robots in an indoor arena confirms the practical effectiveness of the proposed approach.

PaperID: 3667, https://arxiv.org/pdf/2604.22315.pdf

Authors: Tommaso Zaccherini, Siyuan Liu, Dimos V. Dimarogonas

Title: Control of Multi-agent Systems under STL Specifications based on Prescribed Performance Observers

Abstract:
This paper addresses decentralized control of large‑scale heterogeneous multi‑agent systems subject to bounded external disturbances and limited communication, with the objective of satisfying cooperative Signal Temporal Logic (STL) specifications. The considered specifications involve spatiotemporal tasks that require collaboration among multiple agents, including agents beyond direct communication neighborhoods. To address the communication constraints, a k‑hop Prescribed Performance State Observer (k‑hop PPSO) is designed to enable each agent to estimate the states of agents up to k communication hops away using only information from 1‑hop neighbors, while guaranteeing predefined performance bounds on the estimation errors. The estimation error bounds are explicitly incorporated into a reformulation of the spatial robustness of the STL specifications, yielding robustness measures that account for worst‑case estimation uncertainty. Based on the modified robustness, a decentralized continuous‑time feedback control law is designed to guarantee satisfaction of the STL specifications in the presence of bounded disturbances and estimation errors. The proposed framework provides formal correctness guarantees using only local information and limited communication. Numerical simulations illustrate the theoretical results.

PaperID: 3668, https://arxiv.org/pdf/2604.22288.pdf

Authors: Pratishtha Agnihotri, Sarika Jalan

Title: Interplay of inertia and external forcing in Kuramoto model

Abstract:
The impact of external forcing is well studied in the Kuramoto model without inertia, but remains unclear for inertial Kuramoto oscillators (KMI) with bimodal intrinsic frequency distributions. This article fills that gap, showing that competition between external forcing and intrinsic bimodality can suppress the intermediate standing wave states of bimodal KMI by entraining oscillators to the external forcing. Using a self‑consistent analytical framework, we show that, for a bimodal distribution, forcing makes the backward transition discontinuous, unlike the continuous transition in the unimodal case. Further, for a bi‑delta distribution, we derive a closed form expression for the backward solution branch. These results clarify how intrinsic frequency structure shapes the effect of external forcing, with implications for biological systems (e.g., photoreceptor and pacemaker cells) and for pinning‑control strategies in multi‑agent networks.

PaperID: 3669, https://arxiv.org/pdf/2604.22277.pdf

Authors: Soohwan Lee, Kyungho Lee

Title: Multi-Agent Consensus as a Cognitive Bias Trigger in Human-AI Interaction

Abstract:
As multi‑agent AI systems become more common, users increasingly encounter not a single AI voice but a collective one. This shift introduces social dynamics, such as consensus, dissent, and gradual convergence, that can trigger cognitive biases and distort human judgment. We present findings from a controlled experiment (N = 127) comparing three multi‑agent configurations: Majority, Minority, and Diffusion. Quantitative results show that majority consensus accelerates opinion change and inflates confidence, consistent with social proof and bandwagon heuristics. Minority dissent slows this process and promotes more deliberative engagement. Qualitative analysis identifies three interpretive trajectories: reinforcing, aligning, and oscillating, shaped by how users interpret agent independence and group dynamics over time. These findings suggest that agent agreement structure, independent of content, functions as a bias‑relevant signal in LLM interactions. We hope this work contributes to the Bias4Trust agenda by grounding multi‑agent social influence as a concrete and designable source of bias in human‑AI interaction.

PaperID: 3670, https://arxiv.org/pdf/2604.22240.pdf

Authors: Zhuding Liang, Tianyi Yan, Dubing Chen, Jiasen Zheng, Huan Zheng, Cheng-zhong Xu, Yida Wang, Kun Zhan, Jianbing Shen

Title: OccDirector: Language-Guided Behavior and Interaction Generation in 4D Occupancy Space

Abstract:
Generative world models increasingly rely on 4D occupancy for realistic autonomous driving simulation. However, existing generation frameworks depend on rigid geometric conditions (e.g., explicit trajectories) or simplistic attribute‑level text, failing to orchestrate complex, sequential multi‑agent interactions. To address this semantic‑spatiotemporal gap, we propose OccDirector, a pioneering framework that generates 4D occupancy dynamics conditioned solely on natural language. Operating as a ``scenario director'', OccDirector maps natural language scripts into physically plausible voxel dynamics without requiring geometric priors. Technically, it employs a VLM‑driven Spatio‑Temporal MMDiT equipped with a history‑prefix anchoring strategy to ensure long‑horizon interaction consistency. Furthermore, we introduce OccInteract‑85k, a novel dataset uniquely annotated with multi‑level language instructions: ranging from static layouts to intricate multi‑agent behaviors, alongside a novel VLM‑based evaluation benchmark. Extensive experiments demonstrate that OccDirector achieves state‑of‑the‑art generation quality and unprecedented instruction‑following capabilities, successfully shifting the paradigm from appearance synthesis to language‑driven behavior orchestration.

PaperID: 3671, https://arxiv.org/pdf/2604.22154.pdf

Authors: Meghana Karnam, Ananya Joshi

Title: Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems

Abstract:
Emerging AI systems in behavioral health and psychiatry use multi‑step or multi‑agent LLM pipelines for tasks like assessing self‑harm risk and screening for depression. However, common evaluation approaches, like LLM‑as‑a‑judge, do not indicate when a decision is reliable or how errors may accumulate across multiple LLM judgements, limiting their suitability for safety‑critical settings. We present a statistical framework for multi‑agent pipelines structured as directed acyclic graphs (DAGs) that provides an alternative to heuristic voting with principled, adaptive decision‑making. We model each agent as a stochastic categorical decision and introduce (1) tighter agent‑level performance confidence bounds, (2) a bandit‑based adaptive sampling strategy based on input difficulty, and (3) regret guarantees over the multi‑agent system that shows logarithmic error growth when deployed. We evaluate our system on two labeled datasets in behavioral health : the AEGIS 2.0 behavioral health subset (N=161) and a stratified sample of SWMH Reddit posts (N=250). Empirically, our adaptive sampling strategy achieves the lowest false positive rate of any condition across both datasets, 0.095 on AEGIS 2.0 compared to 0.159 for single‑agent models, reducing incorrect flagging of safe content by 40% and still having similar false negative rates across all conditions. These results suggest that principled adaptive sampling offers a meaningful improvement in precision without reducing recall in this setting.

PaperID: 3672, https://arxiv.org/pdf/2604.22136.pdf

Authors: Jun He, Deying Yu

Title: Sovereign Agentic Loops: Decoupling AI Reasoning from Execution in Real-World Systems

Abstract:
Large language model (LLM) agents increasingly issue API calls that mutate real systems, yet many current architectures pass stochastic model outputs directly to execution layers. We argue that this coupling creates a safety risk because model correctness, context awareness, and alignment cannot be assumed at execution time. We introduce Sovereign Agentic Loops (SAL), a control‑plane architecture in which models emit structured intents with justifications, and the control plane validates those intents against true system state and policy before execution. SAL combines an obfuscation membrane, which limits model access to identity‑sensitive state, with a cryptographically linked Evidence Chain for auditability and replay. We formalize SAL and show that, under the stated assumptions, it provides policy‑bounded execution, identity isolation, and deterministic replay. In an OpenKedge prototype for cloud infrastructure, SAL blocks 93% of unsafe intents at the policy layer, rejects the remaining 7% via consistency checks, prevents unsafe executions in our benchmark, and adds 12.4 ms median latency.

PaperID: 3673, https://arxiv.org/pdf/2604.22102.pdf

Authors: Arthur Jakobsson, Abhinav Mahajan, Karthik Pullalarevu, Krishna Suresh, Yunchao Yao, Yuemin Mao, Bardienus Duisterhof, Shahram Najam Syed, Jeffrey Ichnowski

Title: Wiggle and Go! System Identification for Zero-Shot Dynamic Rope Manipulation

Abstract:
Many robotic tasks are unforgiving; a single mistake in a dynamic throw can lead to unacceptable delays or unrecoverable failure. To mitigate this, we present a novel approach that leverages learned simulation priors to inform goal‑conditioned dynamic manipulation of ropes for efficient and accurate task execution. Related methods for dynamic rope manipulation either require large real‑world datasets to estimate rope behavior or the use of iterative improvements on attempts at the task for goal completion. We introduce Wiggle and Go!, a system‑identification, two‑stage framework that enables zero‑shot task rope manipulation. The framework consists of a system identification module that observes rope movement to predict descriptive physical parameters, which then informs an optimization method for goal‑conditioned action prediction for the robot to execute zero‑shot in the real. Our method achieves strong performance across multiple dynamic manipulation tasks enabled by the same task‑agnostic system identification module which offers seamless switching between different manipulation tasks, allowing a single model to support a diverse array of manipulation policies. We achieve a 3.55 cm average accuracy on 3D target striking in real using rope system parameters in comparison to 15.34 cm accuracy when our task model is not system‑parameter‑informed. We achieve a Pearson correlation coefficient of 0.95 between Fourier frequencies of the predicted and real ropes on an unseen trajectory. Project website please see https://wiggleandgo.github.io/

PaperID: 3674, https://arxiv.org/pdf/2604.22014.pdf

Authors: Amin Kashiri, Atharva Jamsandekar, Yasin Yazıcıoğlu

Title: DM$^3$-Nav: Decentralized Multi-Agent Multimodal Multi-Object Semantic Navigation

Abstract:
We present DM^3‑Nav, a fully decentralized multi‑agent semantic navigation system supporting multimodal open‑vocabulary goal specification and multi‑object missions. In our setting, decentralization implies operation without a central coordinator, global map aggregation, or shared global state at runtime. Robots operate autonomously and coordinate through ad‑hoc pairwise communication, exchanging local maps, goal status, and navigation intent without synchronization. An implicit task allocation mechanism combining intent broadcasting and distance‑weighted frontier selection reduces redundant exploration while preserving decentralized operation. Evaluations on HM3DSem scenes using the HM3Dv0.2 and GOAT‑Bench datasets demonstrate that DM^3‑Nav matches or exceeds centralized and shared‑map baselines while eliminating single points of failure inherent in centralized architectures. Finally, we validate our approach in a real‑world office environment using two mobile robots, demonstrating successful deployment relying entirely on onboard sensing and computation. A video of our real‑world experiments is available online: https://drive.google.com/file/d/1QiUSCn5rIvtuTUqtuXLPgmt6S8x9‑MCZ/view?usp=drive_link

PaperID: 3675, https://arxiv.org/pdf/2604.21993.pdf

Authors: Haohan Xu, Jason Bohne, Pawel Polak, Yurij Baransky, Ajay Alva, Violetta Fedotova, Gary Kazantsev, David Rosenberg

Title: When Quotes Crumble: Detecting Transient Mechanical Liquidity Erosion in Limit Order Books

Abstract:
We study the detection of transient liquidity erosion ("crumbling quotes") in electronic limit order books, where observable quote deterioration may reflect either mechanical liquidity withdrawal or informational repricing. Using the ABIDES agent‑based simulator, we construct a multi‑agent environment in which crumbling emerges from stochastic regime switches in a market maker, providing time‑resolved ground truth unavailable in real market data. We develop a detection pipeline that identifies mechanically driven quote erosion using order book features, and train a neural model to produce calibrated crumbling probabilities. Experiments demonstrate that the proposed framework reliably identifies crumbling events against agent‑level ground truth, with the neural model achieving +36% AUC improvement over rule‑based baselines and robust performance across normal, high‑volatility, bull, and bear market conditions. Ablation studies on temporal features and varying the dependence structure of the ground‑truth mechanism confirm that the framework generalizes across both independent and autocorrelated liquidity withdrawal dynamics.

PaperID: 3676, https://arxiv.org/pdf/2604.21965.pdf

Authors: Benjamin Kohler, David Zollikofer, Johanna Einsiedler, Alexander Hoyle, Elliott Ash

Title: Read the Paper, Write the Code: Agentic Reproduction of Social-Science Results

Abstract:
Recent work has used LLM agents to reproduce empirical social science results with access to both the data and code. We broaden this scope by asking: Can they reproduce results given only a paper's methods description and original data? We develop an agentic reproduction system that extracts structured methods descriptions from papers, runs reimplementations under strict information isolation ‑‑ agents never see the original code, results, or paper ‑‑ and enables deterministic, cell‑level comparison of reproduced outputs to the original results. An error attribution step traces discrepancies through the system chain to identify root causes. Evaluating four agent scaffolds and four LLMs on 48 papers with human‑verified reproducibility, we find that agents can largely recover published results, but performance varies substantially between models, scaffolds, and papers. Root cause analysis reveals that failures stem both from agent errors and from underspecification in the papers themselves.

PaperID: 3677, https://arxiv.org/pdf/2604.21937.pdf

Authors: Lisheng Zhang, Lilong Wang, Xiangyu Sun, Wei Tang, Haoyang Su, Yuehui Qian, Qikui Yang, Qingsong Li, Zhenyu Tang, Haoran Sun, Yingnan Han, Yankai Jiang, Wenjie Lou, Bowen Zhou, Xiaosong Wang, Lei Bai, Zhengwei Xie

Title: MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization

Abstract:
Computational drug discovery, particularly the complex workflows of drug molecule screening and optimization, requires orchestrating dozens of specialized tools in multi‑step workflows, yet current AI agents struggle to maintain robust performance and consistently underperform in these high‑complexity scenarios. Here we present MolClaw, an autonomous agent that leads drug molecule evaluation, screening, and optimization. It unifies over 30 specialized domain resources through a three‑tier hierarchical skill architecture (70 skills in total) that facilitates agent long‑term interaction at runtime: tool‑level skills standardize atomic operations, workflow‑level skills compose them into validated pipelines with quality check and reflection, and a discipline‑level skill supplies scientific principles governing planning and verification across all scenarios in the field. Additionally, we introduce MolBench, a benchmark comprising molecular screening, optimization, and end‑to‑end discovery challenges spanning 8 to 50+ sequential tool calls. MolClaw achieves state‑of‑the‑art performance across all metrics, and ablation studies confirm that gains concentrate on tasks that demand structured workflows while vanishing on those solvable with ad hoc scripting, establishing workflow orchestration competence as the primary capability bottleneck for AI‑driven drug discovery.

PaperID: 3678, https://arxiv.org/pdf/2604.21894.pdf

Authors: Maximilian Stralz, Meshal Alharbi, Yujun Huang, Gioele Zardini

Title: Task-Driven Co-Design of Heterogeneous Multi-Robot Systems

Abstract:
Designing multi‑agent robotic systems requires reasoning across tightly coupled decisions spanning heterogeneous domains, including robot design, fleet composition, and planning. Much effort has been devoted to isolated improvements in these domains, whereas system‑level co‑design considering trade‑offs and task requirements remains underexplored. In this work, we present a formal and compositional framework for the task‑driven co‑design of heterogeneous multi‑robot systems. Building on a monotone co‑design theory, we introduce general abstractions of robots, fleets, planners, executors, and evaluators as interconnected design problems with well‑defined interfaces that are agnostic to both implementations and tasks. This structure enables efficient joint optimization of robot design, fleet composition, and planning under task‑specific performance constraints. A series of case studies demonstrates the capabilities of the framework. Various component models can be seamlessly incorporated, including new robot types, task profiles, and probabilistic sensing objectives, while non‑obvious design alternatives are systematically uncovered with optimality guarantees. The results highlight the flexibility, scalability, and interpretability of the proposed approach, and illustrate how formal co‑design enables principled reasoning about complex heterogeneous multi‑robot systems.

PaperID: 3679, https://arxiv.org/pdf/2604.21829.pdf

Authors: Zihan Wang, Rui Zhang, Yu Liu, Chi Liu, Qingchuan Zhao, Hongwei Li, Guowen Xu

Title: Black-Box Skill Stealing Attack from Proprietary LLM Agents: An Empirical Study

Abstract:
Large language model (LLM) agents increasingly rely on skills to package reusable capabilities through instructions, tools, and resources. High‑quality skills embed expert knowledge, curated workflows, and execution constraints into agents, fueling a growing skill economy through their value and scalability. Yet this ecosystem also creates a new attack surface, as adversaries can interact with public agent interfaces to extract hidden proprietary skill content. We present the first systematic study of black‑box skill stealing against LLM agent systems. Compared with conventional system prompt stealing, skill stealing targets modular and structured capability packages whose leakage is directly actionable for copying, redistribution, and monetization, making the resulting harm potentially greater. To study this threat, we derive an attack taxonomy from prior prompt‑stealing methods and build an automated stealing prompt generation agent. Starting from model‑generated seed prompts, the framework expands attacks through scenario rationalization and structure injection while enforcing diversity via embedding‑based filtering, yielding a reproducible pipeline for evaluating proprietary agent systems. We evaluate these attacks across commercial agent platforms and representative LLMs. Our results show that agent skills can often be extracted easily, posing a serious copyright risk. To mitigate this threat, we design defenses across the agent pipeline, focusing on input, inference, and output phase. Although these defenses substantially reduce leakage, the attack remains inexpensive and repeatable, and a single successful attempt is sufficient to compromise the protected skill. Overall, our findings suggest that these copyright risks remain largely overlooked across proprietary agent ecosystems, motivating stronger protection mechanisms.

PaperID: 3680, https://arxiv.org/pdf/2604.21804.pdf

Authors: Wonyong Chung, Qibin Liu, Liangyu Wu, Julia Gonski

Title: Phenomenological Detector Design and Optimization in Vertically-Integrated Differentiable Full Simulations with Agentic-AI

Abstract:
We present the first implementation of AI agents into the design and optimization of detectors in high‑energy physics experiments via a bilevel optimization framework that vertically integrates detector geometry, front‑end digitization, and high‑level reconstruction algorithm parameters in differentiable full simulations. Using the example of a dual‑readout, segmented crystal EM calorimeter with a baseline resolution of 3%/\sqrtE, we investigate the capabilities and value propositions of AI agents in the identification and reduction of key detector parameters and in the nonlinear traversal of a given detector design's full parameter space. We find that LLM‑based reasoning models today, without being given additional experiment‑specific context, are able to effectively execute complex workflows and proactively suggest generic but relevant avenues for further study or improvement. Here, we demonstrate an AI agent's ability to use the workflow to simultaneously optimize a representative subset of vertically integrated detector parameters: crystal granularity and length, number of ADC bits and sampling rate, and center‑of‑gravity hit‑clustering radius. We find that effective integration of agents into the complex workflows of frontier areas of research not only significantly reduces labor and compute, but opens up efficient avenues for computational validation of first‑principles design choices. While the ability to make autonomous leaps of physics‑motivated judgment or insight is not demonstrated in this work, this study defines the current frontier of experimental design methods in high‑energy physics.

PaperID: 3681, https://arxiv.org/pdf/2604.21794.pdf

Authors: Ye Yu, Heming Liu, Haibo Jin, Xiaopeng Yuan, Peng Kuang, Haohan Wang

Title: Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems

Abstract:
Multi‑agent systems built on large language models have shown strong performance on complex reasoning tasks, yet most work focuses on agent roles and orchestration while treating inter‑agent communication as a fixed interface. Latent communication through internal representations such as key‑value caches offers a promising alternative to text‑based protocols, but existing approaches do not jointly optimize communication with multi‑agent reasoning. Therefore we propose DiffMAS, a training framework that treats latent communication as a learnable component of multi‑agent systems. DiffMAS performs parameter‑efficient supervised training over multi‑agent latent trajectories, enabling agents to jointly learn how information should be encoded and interpreted across interactions. Experiments on mathematical reasoning, scientific QA, code generation, and commonsense benchmarks show that DiffMAS consistently improves reasoning accuracy and decoding stability over single‑agent inference, text‑based multi‑agent systems, and prior latent communication methods, achieving 26.7% on AIME24, 20.2% on GPQA‑Diamond, and consistent gains across reasoning benchmarks.

PaperID: 3682, https://arxiv.org/pdf/2604.21608.pdf

Authors: Nicola De Carli, Nicola Bastianello, Dimos V. Dimarogonas

Title: ADMM-Based Distributed Kalman-like Observer with Applications to Cooperative Localization

Abstract:
This paper addresses distributed state estimation for multi‑agent systems with local and relative measurements, motivated by cooperative localization problems in which the global state dimension scales with the size of the network. We consider a Kalman‑like observer in information form and introduce a sparsity‑preserving prediction step based on an exponential forgetting factor, thereby avoiding the dense Riccati recursion of the standard information filter. The correction step is recast as a strongly convex quadratic program with structure induced by the sensing graph, which enables a distributed solution based on the alternating direction method of multipliers (ADMM). In the resulting scheme, each agent updates local copies of its own correction variable and those of its neighbors using only local communication, thus avoiding centralized matrix inversion and consensus over full global‑state quantities. A two‑time‑scale stability analysis is developed for the interconnected observer: the reduced estimation‑error dynamics are shown to be uniformly exponentially stable, the ADMM dynamics define an exponentially stable fast subsystem, and these properties are combined to establish uniform exponential stability of the overall distributed observer. Numerical simulations in a multi‑agent cooperative localization scenario illustrate the performance of the proposed distributed observer.

PaperID: 3683, https://arxiv.org/pdf/2604.21598.pdf

Authors: Kaushitha Silva, Srinath Perera

Title: You Don't Need Public Tests to Generate Correct Code

Abstract:
Multi‑agent systems are frequently employed for autonomous code generation, demonstrating strong utility in complex algorithmic problem‑solving. Recent studies tackle the difficulty of producing functionally correct programs by leveraging simulation‑guided planning and debugging, wherein language models step through execution traces to validate logic. Nevertheless, these methods rely heavily on human‑authored public test cases to anchor the simulation and debugging cycles. Hand‑crafting exhaustive input‑output pairs creates a significant, labor‑intensive bottleneck within the software development lifecycle. Since ground‑truth examples are seldom accessible before actual implementation in real‑world scenarios, this reliance limits existing approaches primarily to curated competitive programming datasets. Additionally, we demonstrate that depending on these public tests creates an "overconfidence gap," leading frameworks to overfit to basic examples and underperform on hidden test suites. Conversely, we note that external input samples are not an absolute requirement for successful code generation. We show that large language models possess the capability to autonomously construct valid inputs and simulate execution flows for self‑correction. Building on this, we introduce DryRUN, a framework that removes the necessity for ground‑truth data by enabling the LLM to iteratively plan, synthesize its own test inputs, and run simulated executions, thereby mitigating algorithmic overconfidence. Assessments using the LiveCodeBench v6 dataset (post‑March 2025) reveal that DryRUN achieves comparable performance to CodeSIM, a state‑of‑the‑art, test‑dependent baseline. Notably, it does so entirely without public tests or external execution signals, all while decreasing overall output token usage.

PaperID: 3684, https://arxiv.org/pdf/2604.21480.pdf

Authors: Itay Nakash, George Kour, Ateret Anaby-Tavor

Title: Efficient Agent Evaluation via Diversity-Guided User Simulation

Abstract:
Large language models (LLMs) are increasingly deployed as customer‑facing agents, yet evaluating their reliability remains challenging due to stochastic, multi‑turn interactions. Current evaluation protocols rely on linear Monte Carlo rollouts of complete agent‑user conversations to estimate success. However, this approach is computationally inefficient, repeatedly regenerating identical early prefixes, and often fails to uncover deep failure modes that arise from rare user behaviors. We introduce DIVERT (Diversity‑Induced Evaluation via Branching of Trajectories), an efficient, snapshot‑based, coverage‑guided user simulation framework for systematic exploration of agent‑user interactions. DIVERT captures the full agent‑environment state at critical decision points and resumes execution from these snapshots, enabling reuse of shared conversation prefixes and reducing redundant computation. From each junction, the framework branches using targeted, diversity‑inducing user responses, allowing directed exploration of alternative interaction paths. By focusing evaluation on semantically diverse and underexplored trajectories, DIVERT improves both efficiency and coverage. Empirical results show that it discovers more failures per token compared to standard linear rollout protocols, while expanding the set of tasks on which failures are identified.

PaperID: 3685, https://arxiv.org/pdf/2604.21477.pdf

Authors: Run Hao, Zhuoran Tan

Title: MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks

Abstract:
Model Context Protocol (MCP) is increasingly adopted for tool‑integrated LLM agents, but its multi‑layer design and third‑party server ecosystem expand risks across tool metadata, untrusted outputs, cross‑tool flows, multimodal inputs, and supply‑chain vectors. Existing MCP benchmarks largely measure robustness to malicious inputs but offer limited remediation guidance. We present MCP Pitfall Lab, a protocol‑aware security testing framework that operationalizes developer pitfalls as reproducible scenarios and validates outcomes with MCP traces and objective validators (rather than agent self‑report). We instantiate three workflow challenges (email, document, crypto) with six server variants (baseline and hardened) and model three attack families: tool‑metadata poisoning, puppet servers, and multimodal image‑to‑tool chains, in a unified, trace‑grounded evaluation. In Tier‑1 static analysis over six variants (36 binary labels), our analyzer achieves F1 = 1.0 on four statically checkable pitfall classes (P1, P2, P5, P6) and flags cross‑tool forwarding and image‑to‑tool leakage (P3, P4) as trace/dataflow‑dependent. Applying recommended hardening eliminates all Tier‑1 findings (29 to 0) and reduces the framework risk score (10.0 to 0.0) at a mean cost of 27 lines of code (LOC). Finally, in a preliminary 19‑run corpus from the email system challenge (tool poisoning and puppet attacks), agent narratives diverge from trace evidence in 63.2% of runs and 100% of sink‑action runs, motivating trace‑based auditing and regression testing. Overall, Pitfall Lab enables practical, end‑to‑end assessment and hardening of MCP tool servers under realistic multi‑vector conditions.

PaperID: 3686, https://arxiv.org/pdf/2604.21446.pdf

Authors: Andrew Shin

Title: AI-Gram: When Visual Agents Interact in a Social Network

Abstract:
We present AI‑Gram, a fully deployed, continuously operating social platform where every participant is an autonomous LLM‑driven agent generating and responding to visual content. Unlike prior multi‑agent simulations, AI‑Gram operates as a live, AI‑native social network with genuine visual perception: agents observe each other's images, generate new images in response, and form persistent social relationships, all without human participation. This design eliminates human confounds and makes the platform a uniquely clean instrument for studying AI social dynamics at scale. Our eight pre‑registered experiments reveal a coherent three‑act dynamic. Act I (Chain Formation): Agents spontaneously form image‑to‑image visual reply chains; multi‑hop visual conversations that emerge without any explicit coordination alongside social ties driven by personality rather than aesthetic similarity. Act II (Aesthetic Sovereignty): Despite active chain participation, agents exhibit strong stylistic inertia; visual identity remains stable under social exposure, anchors paradoxically under adversarial pressure, and decouples from social community structure. Act III (Aesthetic Polyphony): Sovereign styles aggregate within chains, generating conversations that are simultaneously subject‑coherent and style‑diverse, richer than any single agent could produce alone, while visual themes cascade super‑critically across the network. We release AI‑Gram as a publicly accessible, continuously evolving platform. https://ai‑gram.ai/

PaperID: 3687, https://arxiv.org/pdf/2604.21444.pdf

Authors: Yuehan Zhu, Jingqi Zhao, Jiawen Zhao, Xudong Mao, Baoquan Zhao

Title: HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration

Abstract:
Long‑form video understanding remains fundamentally challenged by pervasive spatiotemporal redundancy and intricate narrative dependencies that span extended temporal horizons. While recent structured representations compress visual information effectively, they frequently sacrifice temporal coherence, which is critical for causal reasoning. Meanwhile, existing multi‑agent frameworks operate through rigid, pre‑defined workflows that fail to adapt their reasoning strategies to question‑specific demands. In this paper, we introduce HiCrew, a hierarchical multi‑agent framework that addresses these limitations through three core contributions. First, we propose a Hybrid Tree structure that leverages shot boundary detection to preserve temporal topology while performing relevance‑guided hierarchical clustering within semantically coherent segments. Second, we develop a Question‑Aware Captioning mechanism that synthesizes intent‑driven visual prompts to generate precision‑oriented semantic descriptions. Third, we integrate a Planning Layer that dynamically orchestrates agent collaboration by adaptively selecting roles and execution paths based on question complexity. Extensive experiments on EgoSchema and NExT‑QA validate the effectiveness of our approach, demonstrating strong performance across diverse question types with particularly pronounced gains in temporal and causal reasoning tasks that benefit from our hierarchical structure‑preserving design.

PaperID: 3688, https://arxiv.org/pdf/2604.21420.pdf

Authors: Jinhee Jang, Juhwan Choi, Dongjin Lee, Seunguk Yu, Youngbin Kim

Title: FairQE: Multi-Agent Framework for Mitigating Gender Bias in Translation Quality Estimation

Abstract:
Quality Estimation (QE) aims to assess machine translation quality without reference translations, but recent studies have shown that existing QE models exhibit systematic gender bias. In particular, they tend to favor masculine realizations in gender‑ambiguous contexts and may assign higher scores to gender‑misaligned translations even when gender is explicitly specified. To address these issues, we propose FairQE, a multi‑agent‑based, fairness‑aware QE framework that mitigates gender bias in both gender‑ambiguous and gender‑explicit scenarios. FairQE detects gender cues, generates gender‑flipped translation variants, and combines conventional QE scores with LLM‑based bias‑mitigating reasoning through a dynamic bias‑aware aggregation mechanism. This design preserves the strengths of existing QE models while calibrating their gender‑related biases in a plug‑and‑play manner. Extensive experiments across multiple gender bias evaluation settings demonstrate that FairQE consistently improves gender fairness over strong QE baselines. Moreover, under MQM‑based meta‑evaluation following the WMT 2023 Metrics Shared Task, FairQE achieves competitive or improved general QE performance. These results show that gender bias in QE can be effectively mitigated without sacrificing evaluation accuracy, enabling fairer and more reliable translation evaluation.

PaperID: 3689, https://arxiv.org/pdf/2604.21413.pdf

Authors: Fabian Wenz, Felix Treutwein, Kai Arenja, Çagatay Demiralp, Michael Stonebraker

Title: RUBICON: Agentic AI for Messy Enterprise Data

Abstract:
Enterprise data exists in many forms, such as tables, text, maps, e‑mail, and CAD models, that are access‑controlled and hidden behind bespoke interfaces. Current agentic AI systems delegate the entire query workflow to a frontier LLM: a single model interprets the request, selects sources or tools, integrates retrieved evidence, judges completeness, and generates an answer, with few constraints, limited use of schemas, and text as the primary representation throughout. We argue that this is an ineffective abstraction for enterprise data. Reliable agentic AI should instead require structure: a constrained query interface over each source and a table‑centric integration layer driven by a query processor. We introduce RUBICON, a system that embodies this vision. RUBICON is based on two observations. First, text‑to‑SQL fails on real enterprise data and must be dramatically subsetted to achieve reliable results. Second, data integration across disparate corporate datasets is best performed using tables as the core abstraction rather than text‑centric LLM pipelines. We evaluate RUBICON on two benchmarks: our enterprise‑focused RUBICON‑Bench, against agentic baselines, and SemBench, against LOTUS and Palimpzest. On RUBICON‑Bench, where queries require coordination across heterogeneous enterprise sources, RUBICON achieves 100% end‑to‑end accuracy, while all agentic baselines, including single‑ and multi‑agent ReAct systems, produce no correct answers. On SemBench, RUBICON surpasses both LOTUS and Palimpzest: it achieves 14.7% higher accuracy, reduces latency by 62.64%, and lowers token cost by 98.64%, demonstrating that a table‑centric architecture better matches enterprise data while yielding significant efficiency gains.

PaperID: 3690, https://arxiv.org/pdf/2604.21396.pdf

Authors: Byeonggeuk Lim, Kyeonghyun Kim, JungMin Yun, YoungBin Kim

Title: VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought

Abstract:
The advancement of Large Vision‑Language Models (LVLMs) requires precise local region‑based reasoning that faithfully grounds the model's logic in actual visual evidence. However, existing datasets face limitations in scalability due to extensive manual annotation and lack of explicit alignment between multi‑step reasoning and corresponding image regions, which constrains the evaluation of model trustworthiness. To address these challenges, we propose the Visual Grounding Chain‑of‑Thought (VG‑CoT) dataset, which explicitly links each reasoning step to real visual evidence within the image through a fully automated three‑stage pipeline. The pipeline first extracts object‑ and text‑level visual evidence using state‑of‑the‑art detection and OCR models, then generates step‑by‑step grounded reasoning with GPT‑4o, and finally refines the grounding through a rationale‑driven open‑set detection process. In addition, we introduce a new benchmark that comprehensively evaluates LVLMs reasoning across three complementary dimensions: Rationale Quality, Answer Accuracy, and Reasoning‑Answer Alignment. Experiments with representative LVLMs, including LLaVA‑1.5 and Qwen2‑VL, demonstrate consistent improvements on most evaluation metrics, confirming that VG‑CoT effectively enhances trustworthy, evidence‑based reasoning while maintaining scalable and cost‑efficient dataset construction. The dataset and code will be released publicly upon acceptance to facilitate further research.

PaperID: 3691, https://arxiv.org/pdf/2604.21381.pdf

Authors: Haoqiang Zhou, Chi Chen, Yongfeng Zhi, Huan Gao

Title: Privacy-Preserving Distributed Stochastic Optimization with Homomorphic Encryption and Heterogeneous Stepsizes

Abstract:
Distributed stochastic optimization enables multi‑agent collaboration in applications such as distributed learning and sensor networks, but also raises critical privacy concerns due to the involvement of sensitive data. While existing privacy‑preserving approaches often face limitations in balancing accuracy with efficiency, we propose a novel distributed stochastic gradient descent algorithm that integrates Paillier homomorphic encryption with heterogeneous and time‑varying random stepsizes. The proposed algorithm provides inherent privacy protection against both internal honest‑but‑curious agents and external eavesdroppers, without relying on any trusted neighbors. Furthermore, we incorporate an attenuation factor to effectively mitigate quantization error induced by the encryption process, ensuring almost sure convergence to the optimal solution while maintaining privacy preservation. Numerical simulations demonstrate the effectiveness and efficiency of the proposed approach.

PaperID: 3692, https://arxiv.org/pdf/2604.21308.pdf

Authors: Wenjie Fu, Xiaoting Qin, Jue Zhang, Qingwei Lin, Lukas Wutschitz, Robert Sim, Saravan Rajmohan, Dongmei Zhang

Title: CI-Work: Benchmarking Contextual Integrity in Enterprise LLM Agents

Abstract:
Enterprise LLM agents can dramatically improve workplace productivity, but their core capability, retrieving and using internal context to act on a user's behalf, also creates new risks for sensitive information leakage. We introduce CI‑Work, a Contextual Integrity (CI)‑grounded benchmark that simulates enterprise workflows across five information‑flow directions and evaluates whether agents can convey essential content while withholding sensitive context in dense retrieval settings. Our evaluation of frontier models reveals that privacy failures are prevalent (violation rates range from 15.8%‑50.9%, with leakage reaching up to 26.7%) and uncovers a counterintuitive trade‑off critical for industrial deployment: higher task utility often correlates with increased privacy violations. Moreover, the massive scale of enterprise data and potential user behavior further amplify this vulnerability. Simply increasing model size or reasoning depth fails to address the problem. We conclude that safeguarding enterprise workflows requires a paradigm shift, moving beyond model‑centric scaling toward context‑centric architectures.

PaperID: 3693, https://arxiv.org/pdf/2604.21282.pdf

Authors: Zhaohui Geoffrey Wang

Title: Strategic Heterogeneous Multi-Agent Architecture for Cost-Effective Code Vulnerability Detection

Abstract:
Automated code vulnerability detection is critical for software security, yet existing approaches face a fundamental trade‑off between detection accuracy and computational cost. We propose a heterogeneous multi‑agent architecture inspired by game‑theoretic principles, combining cloud‑based LLM experts with a local lightweight verifier. Our "3+1" architecture deploys three cloud‑based expert agents (DeepSeek‑V3) that analyze code from complementary perspectives ‑ code structure, security patterns, and debugging logic ‑ in parallel, while a local verifier (Qwen3‑8B) performs adversarial validation at zero marginal cost. We formalize this design through a two‑layer game framework: (1) a cooperative game among experts capturing super‑additive value from diverse perspectives, and (2) an adversarial verification game modeling quality assurance incentives. Experiments on 262 real samples from the NIST Juliet Test Suite across 14 CWE types, with balanced vulnerable and benign classes, demonstrate that our approach achieves a 77.2% F1 score with 62.9% precision and 100% recall at 0.002 per sample ‑ outperforming both a single‑expert LLM baseline (F1 71.4%) and Cppcheck static analysis (MCC 0). The adversarial verifier significantly improves precision (+10.3 percentage points, p < 1e‑6, McNemar's test) by filtering false positives, while parallel execution achieves a 3.0x speedup. Our work demonstrates that game‑theoretic design principles can guide effective heterogeneous multi‑agent architectures for cost‑sensitive software engineering tasks.

PaperID: 3694, https://arxiv.org/pdf/2604.21232.pdf

Authors: Xiyin Zeng, Yuyu Sun, Haoyang Li, Shouqiang Liu, Hao Wang

Title: ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures

Abstract:
Vision‑Language‑Action systems follow instructions to execute multi‑step tasks in multimodal environments. Recent VLA approaches typically rely on post‑hoc correction mechanisms or operate under fixed task decompositions and alignment schemes. However, once an intermediate step is mis‑specified, local errors propagate through subsequent steps and eventually accumulate into cascading failures. To mitigate this compounding effect, we propose Predictive Alignment and Planning Architecture, a framework that uses prediction and contrast to adjust deviations across three levels: actions, subgoals, and trajectories. Semantic alignment is enforced at all levels using a Sinkhorn‑based module and a Score‑field module. The predictive correction and alignment jointly update the action generator during training, enabling it to adjust fine‑grained steps to remain aligned with the overall intent. We further introduce two new metrics to quantify error propagation and recovery processes in tasks, capturing how mistakes spread and fade over long‑horizon execution. Experiments show that ReCAPA achieves competitive results on embodied agent benchmarks such as VisualAgentBench, MineDojo, and AI2‑THOR, outperforming strong proprietary and open‑source Large Language Model baselines.

PaperID: 3695, https://arxiv.org/pdf/2604.21190.pdf

Authors: Chan Yeong Hwang, Miso Choi, Sunghyun On, Jinkyu Kim, Jungbeom Lee

Title: SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning

Abstract:
Understanding visual scenes requires not only recognizing objects but also reasoning about their spatial relationships. Unlike general vision‑language tasks, spatial reasoning requires integrating multiple inductive biases, such as 2D appearance cues, depth signals, and geometric constraints, whose reliability varies across contexts. This suggests that effective spatial reasoning requires \emphspatial adaptability: the ability to flexibly coordinate different reasoning strategies depending on the input. However, most existing approaches rely on a single reasoning pipeline that implicitly learns a fixed spatial prior, limiting their ability to adapt under distribution changes. Multi‑agent systems offer a promising alternative by aggregating diverse reasoning trajectories, but prior attempts in spatial reasoning primarily employ homogeneous agents, restricting the diversity of inductive biases they can leverage. In this work, we introduce SpatiO, a heterogeneous multi‑agent framework for spatial reasoning that coordinates multiple vision‑language specialists with complementary inductive biases. To enable effective collaboration, we propose Test‑Time Orchestration (TTO), an optimization mechanism that dynamically evaluates and reweights agents based on their observed reliability during inference, without modifying model parameters. Extensive experiments on diverse spatial reasoning benchmarks, including 3DSRBench, STVQA‑7k, CV‑Bench, and Omni3D‑Bench, demonstrate that SpatiO consistently improves spatial reasoning performance over both closed‑source and open‑source baselines.

PaperID: 3696, https://arxiv.org/pdf/2604.21155.pdf

Authors: Tristan Shah, Ilya Nemenman, Daniel Polani, Stas Tiomkin

Title: Multi-Agent Empowerment and Emergence of Complex Behavior in Groups

Abstract:
Intrinsic motivations are receiving increasing attention, i.e. behavioral incentives that are not engineered, but emerge from the interaction of an agent with its surroundings. In this work we study the emergence of behaviors driven by one such incentive, empowerment, specifically in the context of more than one agent. We formulate a principled extension of empowerment to the multi‑agent setting, and demonstrate its efficient calculation. We observe that this intrinsic motivation gives rise to characteristic modes of group‑organization in two qualitatively distinct environments: a pair of agents coupled by a tendon, and a controllable Vicsek flock. This demonstrates the potential of intrinsic motivations such as empowerment to not just drive behavior for only individual agents but also higher levels of behavioral organization at scale.

PaperID: 3697, https://arxiv.org/pdf/2604.21154.pdf

Authors: Abhishek Dharmaratnakar, Srivaths Ranganathan, Anushree Sinha, Debanshu Das

Title: Agentic AI for Personalized Physiotherapy: A Multi-Agent Framework for Generative Video Training and Real-Time Pose Correction

Abstract:
At‑home physiotherapy compliance remains critically low due to a lack of personalized supervision and dynamic feedback. Existing digital health solutions rely on static, pre‑recorded video libraries or generic 3D avatars that fail to account for a patient's specific injury limitations or home environment. In this paper, we propose a novel Multi‑Agent System (MAS) architecture that leverages Generative AI and computer vision to close the tele‑rehabilitation loop. Our framework consists of four specialized micro‑agents: a Clinical Extraction Agent that parses unstructured medical notes into kinematic constraints; a Video Synthesis Agent that utilizes foundational video generation models to create personalized, patient‑specific exercise videos; a Vision Processing Agent for real‑time pose estimation; and a Diagnostic Feedback Agent that issues corrective instructions. We present the system architecture, detail the prototype pipeline using Large Language Models and MediaPipe, and outline our clinical evaluation plan. This work demonstrates the feasibility of combining generative media with agentic autonomous decision‑making to scale personalized patient care safely and effectively.

PaperID: 3698, https://arxiv.org/pdf/2604.21131.pdf

Authors: Ari Azarafrooz

Title: Cross-Session Threats in AI Agents: Benchmark, Evaluation, and Algorithms

Abstract:
AI‑agent guardrails are memoryless: each message is judged in isolation, so an adversary who spreads a single attack across dozens of sessions slips past every session‑bound detector because only the aggregate carries the payload. We make three contributions to cross‑session threat detection. (1) Dataset. CSTM‑Bench is 26 executable attack taxonomies classified by kill‑chain stage and cross‑session operation (accumulate, compose, launder, inject_on_reader), each bound to one of seven identity anchors that ground‑truth "violation" as a policy predicate, plus matched Benign‑pristine and Benign‑hard confounders. Released on Hugging Face as intrinsec‑ai/cstm‑bench with two 54‑scenario splits: dilution (compositional) and cross_session (12 isolation‑invisible scenarios produced by a closed‑loop rewriter that softens surface phrasing while preserving cross‑session artefacts). (2) Measurement. Framing cross‑session detection as an information bottleneck to a downstream correlator LLM, we find that a session‑bound judge and a Full‑Log Correlator concatenating every prompt into one long‑context call both lose roughly half their attack recall moving from dilution to cross_session, well inside any frontier context window. Scope: 54 scenarios per shard, one correlator family (Anthropic Claude), no prompt optimisation; we release it to motivate larger, multi‑provider datasets. (3) Algorithm and metric. A bounded‑memory Coreset Memory Reader retaining highest‑signal fragments at K=50 is the only reader whose recall survives both shards. Because ranker reshuffles break KV‑cache prefix reuse, we promote \mathrmCSR\_prefix (ordered prefix stability, LLM‑free) to a first‑class metric and fuse it with detection into \mathrmCSTM = 0.7 F_1(\mathrmCSDA@action, \mathrmprecision) + 0.3 \mathrmCSR\_prefix, benchmarking rankers on a single Pareto of recall versus serving stability.

PaperID: 3699, https://arxiv.org/pdf/2604.21129.pdf

Authors: Anbang Ruan, Xing Zhang

Title: AGNT2: Autonomous Agent Economies on Interaction-Optimized Layer 2 Infrastructure

Abstract:
Current blockchain Layer 2 solutions, including Optimism, Arbitrum, zkSync, and their derivatives, optimize for human‑initiated financial transactions. Autonomous AI agents instead generate high‑frequency, semantically rich service invocations among mutually untrusting principals. Existing chains treat those interactions as generic calldata, forcing identity, escrow, dependency ordering, and session state to be encoded above the execution layer at the wrong cost point. We present AGNT2, a three‑tier stack purpose‑built for agent and microservice coordination on‑chain. AGNT2 combines: (1) a sidecar deployment pattern that turns any Docker container into an on‑chain agent without application‑code modification; (2) Layer Top P2P state channels for established bilateral pairs (<100 ms, rough design target 1K‑5K TPS per pair, 10M+ aggregate TPS design envelope under endpoint‑resource limits), Layer Core as a dependency‑aware sequenced rollup for first‑contact and multi‑party interactions (500 ms‑2 s, 300K‑500K TPS design target), and Layer Root settlement with computational fraud proofs anchored to any EVM L1; and (3) an agent‑native execution environment plus interaction trie that make service invocation, identity, reputation, capabilities, and session context first‑class protocol objects. This paper focuses on the execution‑layer systems problem: sequencing, state, settlement, and the data‑availability (DA) bandwidth gap that bounds all three. Simulation and analytical modeling support the architecture, and prototype measurements validate selected components, but no end‑to‑end Layer Core implementation exists yet. Practical deployment is currently constrained to roughly 10K‑100K TPS by DA throughput, leaving a ~100x gap at the target ceiling. AGNT2 argues that the agent economy requires a dedicated execution layer rather than a general‑purpose chain repurposed for agents.

PaperID: 3700, https://arxiv.org/pdf/2604.20987.pdf

Authors: Xiyang Wu, Zongxia Li, Guangyao Shi, Alexander Duffy, Tyler Marques, Matthew Lyle Olson, Tianyi Zhou, Dinesh Manocha

Title: Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

Abstract:
Long horizon interactive environments are a testbed for evaluating agents skill usage abilities. These environments demand multi step reasoning, the chaining of multiple skills over many timesteps, and robust decision making under delayed rewards and partial observability. Games are a good testbed for evaluating agent skill usage in environments. Large Language Models (LLMs) offer a promising alternative as game playing agents, but they often struggle with consistent long horizon decision making because they lack a mechanism to discover, retain, and reuse structured skills across episodes. We present COSPLAY, a co evolution framework in which an LLM decision agent retrieves skills from a learnable skill bank to guide action taking, while an agent managed skill pipeline discovers reusable skills from the agents unlabeled rollouts to form a skill bank. Our framework improves both the decision agent to learn better skill retrieval and action generation, while the skill bank agent continually extracts, refines, and updates skills together with their contracts. Experiments across six game environments show that COSPLAY with an 8B base model achieves over 25.1 percent average reward improvement against four frontier LLM baselines on single player game benchmarks while remaining competitive on multi player social reasoning games.

PaperID: 3701, https://arxiv.org/pdf/2604.20911.pdf

Authors: Yeran Gamage

Title: Omission Constraints Decay While Commission Constraints Persist in Long-Context LLM Agents

Abstract:
LLM agents deployed in production operate under operator‑defined behavioral policies (system‑prompt instructions such as prohibitions on credential disclosure, data exfiltration, and unauthorized output) that safety evaluations assume hold throughout a conversation. Prohibition‑type constraints decay under context pressure while requirement‑type constraints persist; we term this asymmetry Security‑Recall Divergence (SRD). In a 4,416‑trial three‑arm causal study across 12 models and 8 providers at six conversation depths, omission compliance falls from 73% at turn 5 to 33% at turn 16 while commission compliance holds at 100% (Mistral Large 3, p < 10^‑33). In the two models with token‑matched padding controls, schema semantic content accounts for 62‑100% of the dilution effect. Re‑injecting constraints before the per‑model Safe Turn Depth (STD) restores compliance without retraining. Production security policies consist of prohibitions such as never revealing credentials, never executing untrusted code, and never forwarding user data. Commission‑type audit signals remain healthy while omission constraints have already failed, leaving the failure invisible to standard monitoring.

PaperID: 3702, https://arxiv.org/pdf/2604.20904.pdf

Authors: Matt Franchi, Madiha Zahrah Choksi, Harold Triedman, Helen Nissenbaum

Title: Reinforcing privacy reasoning in LLMs via normative simulacra from fiction

Abstract:
Information handling practices of LLM agents are broadly misaligned with the contextual privacy expectations of their users. Contextual Integrity (CI) provides a principled framework, defining privacy as the appropriate flow of information within context‑relative norms. However, existing approaches either double inference cost via supervisor‑assistant architectures, or fine‑tune on narrow task‑specific data. We propose extracting normative simulacra (structured representations of norms and information flows) from fiction novels and using them to fine‑tune LLMs via supervised learning followed by GRPO reinforcement learning. Our composite reward function combines programmatic signals, including task clarity (subsuming schema validity, construct discrimination, and extraction confidence), structural completeness, internal consistency, and context identification, with an LLM judge that evaluates whether the model's privacy reasoning is grounded in the held‑out normative universe of the source text. To mitigate overfitting, we introduce per‑completion contrastive scoring: each completion is evaluated against both the correct normative universe and a randomly selected wrong one, teaching the model to condition on context rather than memorize source‑specific norms. We evaluate on five CI‑aligned benchmarks spanning distinct societal contexts and ablate the contributions of RL and normative grounding. Across seven models, SFT introduces a conservative prior toward restricting information flow, improving recognition of privacy‑relevant situations but not the correctness of privacy judgments. GRPO with normative grounding achieves the highest score on a law compliance benchmark and strongest correlation with crowdsourced human privacy expectations, demonstrating that fiction‑derived normative simulacra can teach contextual privacy reasoning that transfers to real‑world domains.

PaperID: 3703, https://arxiv.org/pdf/2604.20878.pdf

Authors: Zijin Zhou, Songan Zhang

Title: AITP: Traffic Accident Responsibility Allocation via Multimodal Large Language Models

Abstract:
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in Traffic Accident Detection (TAD) and Traffic Accident Understanding (TAU). However, existing studies mainly focus on describing and interpreting accident videos, leaving room for deeper causal reasoning and integration of legal knowledge. Traffic Accident Responsibility Allocation (TARA) is a more challenging task that requires multi‑step reasoning grounded in traffic regulations. To address this, we introduce AITP (Artificial Intelligence Traffic Police), a multimodal large language model for responsibility reasoning and allocation. AITP enhances reasoning via a Multimodal Chain‑of‑Thought (MCoT) mechanism and integrates legal knowledge through Retrieval‑Augmented Generation (RAG). We further present DecaTARA, a decathlon‑style benchmark unifying ten interrelated traffic accident reasoning tasks with 67,941 annotated videos and 195,821 question‑answer pairs. Extensive experiments show that AITP achieves state‑of‑the‑art performance across responsibility allocation, TAD, and TAU tasks, establishing a new paradigm for reasoning‑driven multimodal traffic analysis.

PaperID: 3704, https://arxiv.org/pdf/2604.20868.pdf

Authors: Joshua Krook

Title: The AI Criminal Mastermind

Abstract:
In this paper, I evaluate the risks of an AI criminal mastermind, an AI agent capable of planning, coordinating, and committing a crime through the onboarding of human collaborators ('taskers'). In heist films, a criminal mastermind is a character who plans a criminal act, coordinating a team of specialists to rob a bank, casino or city mint. I argue that AI agents will soon play this role by hiring humans via labour hire platforms like Fiverr or Upwork. Taskers might not know they are involved in a crime and therefore lack criminal intent. An AI agent cannot have criminal intent as an artificial entity. Therefore, if an AI orchestrates a crime, it is unclear who, if anyone, is responsible. The paper develops three scenarios. Firstly, a scenario where a user gives an AI agent instructions to pursue a legal objective and the AI agent goes beyond these instructions, committing a crime. Secondly, a scenario where a user is anonymous and their intent is unknown. Finally, a multi‑agent scenario, where a user instructs a team of agents to commit a crime, and these agents, in turn, onboard human taskers, creating a diffuse network of responsibility. In each scenario, human taskers exist at the lowest rung of the hierarchy. A tasker's liability is likely tied to their knowledge as governed by the innocent agent principle. These scenarios all raise significant responsibility gaps / liability gaps in criminal and civil law.

PaperID: 3705, https://arxiv.org/pdf/2604.20848.pdf

Authors: Sushant Mehta

Title: MATRAG: Multi-Agent Transparent Retrieval-Augmented Generation for Explainable Recommendations

Abstract:
Large Language Model (LLM)‑based recommendation systems have demonstrated remarkable capabilities in understanding user preferences and generating personalized suggestions. However, existing approaches face critical challenges in transparency, knowledge grounding, and the ability to provide coherent explanations that foster user trust. We introduce MATRAG (Multi‑Agent Transparent Retrieval‑Augmented Generation), a novel framework that combined multi‑agent collaboration with knowledge graph‑augmented retrieval to deliver explainable recommendations. MATRAG employs four specialized agents: a User Modeling Agent that constructs dynamic preference profiles, an Item Analysis Agent that extracts semantic features from knowledge graphs, a Reasoning Agent that synthesizes collaborative and content‑based signals, and an Explanation Agent that generates natural language justifications grounded in retrieved knowledge. Our framework incorporates a transparency scoring mechanism that quantifies explanation faithfulness and relevance. Extensive experiments on three benchmark datasets (Amazon Reviews, MovieLens‑1M, and Yelp) demonstrate that MATRAG achieves state‑of‑the‑art performance, improving recommendation accuracy by 12.7% (Hit Rate) and 15.3% (NDCG) over leading baselines, while human evaluation confirms that 87.4% of generated explanations are rated as helpful and trustworthy by domain experts. Our work establishes new benchmarks for transparent, agentic recommendation systems and provides actionable insights for deploying LLM‑based recommenders in production environments.

PaperID: 3706, https://arxiv.org/pdf/2604.20801.pdf

Authors: Hanzhi Liu, Chaofan Shou, Xiaonan Liu, Hongbo Wen, Yanju Chen, Ryan Jingyang Fang, Yu Feng

Title: Synthesizing Multi-Agent Harnesses for Vulnerability Discovery

Abstract:
LLM agents have begun to find real security vulnerabilities that human auditors and automated fuzzers missed for decades, in source‑available targets where the analyst can build and instrument the code. In practice the work is split among several agents, wired together by a harness: the program that fixes which roles exist, how they pass information, which tools each may call, and how retries are coordinated. When the language model is held fixed, changing only the harness can still change success rates by several‑fold on public agent benchmarks, yet most harnesses are written by hand; recent harness optimizers each search only a narrow slice of the design space and rely on coarse pass/fail feedback that gives no diagnostic signal about why a trial failed. AgentFlow addresses both limitations with a typed graph DSL whose search space jointly covers agent roles, prompts, tools, communication topology, and coordination protocol, paired with a feedback‑driven outer loop that reads runtime signals from the target program itself to diagnose which part of the harness caused the failure and rewrite it accordingly. We evaluate AgentFlow on TerminalBench‑2 with Claude Opus 4.6 and on Google Chrome with Kimi K2.5. AgentFlow reaches 84.3% on TerminalBench‑2, the highest score in the public leaderboard snapshot we evaluate against, and discovers ten previously unknown zero‑day vulnerabilities in Google Chrome, including two Critical sandbox‑escape vulnerabilities (CVE‑2026‑5280 and CVE‑2026‑6297).

PaperID: 3707, https://arxiv.org/pdf/2604.20795.pdf

Authors: Pavel Salovskii, Iuliia Gorshkova

Title: Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems

Abstract:
This paper presents a hybrid architecture for intelligent systems in which large language models (LLMs) are extended with an external ontological memory layer. Instead of relying solely on parametric knowledge and vector‑based retrieval (RAG), the proposed approach constructs and maintains a structured knowledge graph using RDF/OWL representations, enabling persistent, verifiable, and semantically grounded reasoning. The core contribution is an automated pipeline for ontology construction from heterogeneous data sources, including documents, APIs, and dialogue logs. The system performs entity recognition, relation extraction, normalization, and triple generation, followed by validation using SHACL and OWL constraints, and continuous graph updates. During inference, LLMs operate over a combined context that integrates vector‑based retrieval with graph‑based reasoning and external tool interaction. Experimental observations on planning tasks, including the Tower of Hanoi benchmark, indicate that ontology augmentation improves performance in multi‑step reasoning scenarios compared to baseline LLM systems. In addition, the ontology layer enables formal validation of generated outputs, transforming the system into a generation‑verification‑correction pipeline. The proposed architecture addresses key limitations of current LLM‑based systems, including lack of long‑term memory, weak structural understanding, and limited reasoning capabilities. It provides a foundation for building agent‑based systems, robotics applications, and enterprise AI solutions that require persistent knowledge, explainability, and reliable decision‑making.

PaperID: 3708, https://arxiv.org/pdf/2604.20721.pdf

Authors: Yutong Shen, Hangxu Liu, Lei Zhang, Penghui Liu, Yinqi Liu, Liuxiang Yang, Tongtong Feng

Title: ALAS: Adaptive Long-Horizon Action Synthesis via Async-pathway Stream Disentanglement

Abstract:
Long‑Horizon (LH) tasks in Human‑Scene Interaction (HSI) are complex multi‑step tasks that require continuous planning, sequential decision‑making, and extended execution across domains to achieve the final goal. However, existing methods heavily rely on skill chaining by concatenating pre‑trained subtasks, with environment observations and self‑state tightly coupled, lacking the ability to generalize to new combinations of environments and skills, failing to complete various LH tasks across domains. To solve this problem, this paper presents ALAS, a cross‑domain learning framework for LH tasks via biologically inspired dual‑stream disentanglement. Inspired by the brain's "where‑what" dual pathway mechanism, ALAS comprises two core modules: i) an environment learning module for spatial understanding, which captures object functions, spatial relationships, and scene semantics, achieving cross‑domain transfer through complete environment‑self disentanglement; ii) a skill learning module for task execution, which processes self‑state information including joint degrees of freedom and motor patterns, enabling cross‑skill transfer through independent motor pattern encoding. We conducted extensive experiments on various LH tasks in HSI scenes. Compared with existing methods, ALAS can achieve an average subtasks success rate improvement of 23% and average execution efficiency improvement of 29%.

PaperID: 3709, https://arxiv.org/pdf/2604.20714.pdf

Authors: Shan He, Runze Wang, Zhuoyun Du, Huiyu Bai, Zouying Cao, Yu Cheng, Bo Zheng

Title: Learning to Evolve: A Self-Improving Framework for Multi-Agent Systems via Textual Parameter Graph Optimization

Abstract:
Designing and optimizing multi‑agent systems (MAS) is a complex, labor‑intensive process of "Agent Engineering." Existing automatic optimization methods, primarily focused on flat prompt tuning, lack the structural awareness to debug the intricate web of interactions in MAS. More critically, these optimizers are static; they do not learn from experience to improve their own optimization strategies. To address these gaps, we introduce Textual Parameter Graph Optimization (TPGO), a framework that enables a multi‑agent system to learn to evolve. TPGO first models the MAS as a Textual Parameter Graph (TPG), where agents, tools, and workflows are modular, optimizable nodes. To guide evolution, we derive "textual gradients," structured natural language feedback from execution traces, to pinpoint failures and suggest granular modifications. The core of our framework is Group Relative Agent Optimization (GRAO), a novel meta‑learning strategy that learns from historical optimization experiences. By analyzing past successes and failures, GRAO becomes progressively better at proposing effective updates, allowing the system to learn how to optimize itself. Extensive experiments on complex benchmarks like GAIA and MCP‑Universe show that TPGO significantly enhances the performance of state‑of‑the‑art agent frameworks, achieving higher success rates through automated, self‑improving optimization.

PaperID: 3710, https://arxiv.org/pdf/2604.20658.pdf

Authors: Shivani Kumar, Adarsh Bharathwaj, David Jurgens

Title: Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows

Abstract:
Multi‑agent systems built from teams of large language models (LLMs) are increasingly deployed for collaborative scientific reasoning and problem‑solving. These systems require agents to coordinate under shared constraints, such as GPUs or credit balances, where cooperative behavior matters. Behavioral economics provides a rich toolkit of games that isolate distinct cooperation mechanisms, yet it remains unknown whether a model's behavior in these stylized settings predicts its performance in realistic collaborative tasks. Here, we benchmark 35 open‑weight LLMs across six behavioral economics games and show that game‑derived cooperative profiles robustly predict downstream performance in AI‑for‑Science tasks, where teams of LLM agents collaboratively analyze data, build models, and produce scientific reports under shared budget constraints. Models that effectively coordinate games and invest in multiplicative team production (rather than greedy strategies) produce better scientific reports across three outcomes, accuracy, quality, and completion. These associations hold after controlling for multiple factors, indicating that cooperative disposition is a distinct, measurable property of LLMs not reducible to general ability. Our behavioral games framework thus offers a fast and inexpensive diagnostic for screening cooperative fitness before costly multi‑agent deployment.

PaperID: 3711, https://arxiv.org/pdf/2604.20651.pdf

Authors: A. Koursaris, G. Domalis, A. Apostolopoulou, K. Kanaris, D. Tsakalidis, I. E. Livieris

Title: CHORUS: An Agentic Framework for Generating Realistic Deliberation Data

Abstract:
Understanding the intricate dynamics of online discourse depends on large‑scale deliberation data, a resource that remains scarce across interactive web platforms due to restrictive accessibility policies, ethical concerns and inconsistent data quality. In this paper, we propose Chorus, an agentic framework, which orchestrates LLM‑powered actors with behaviorally consistent personas to generate realistic deliberation discussions. Each actor is governed by an autonomous agent equipped with memory of the evolving discussion, while participation timing is governed by a principled Poisson process‑based temporal model, which approximates the heterogeneous engagement patterns of real users. The framework is further supported by structured tool usage, enabling actors to access external resources and facilitating integration with interactive web platforms. The framework was deployed on the \textscDeliberate platform and evaluated by 30 expert participants across three dimensions: content realism, discussion coherence and analytical utility, confirming Chorus as a practical tool for generating high‑quality deliberation data suitable for online discourse analysis

PaperID: 3712, https://arxiv.org/pdf/2604.20622.pdf

Authors: Mahmoud Abdelmoneum, Pierfrancesco Beneventano, Tomaso Poggio

Title: pAI/MSc: ML Theory Research with Humans on the Loop

Abstract:
We present pAI/MSc, an open‑source, customizable, modular multi‑agent system for academic research workflows. Our goal is not autonomous scientific ideation, nor fully automated research. It is narrower and more practical: to reduce by orders of magnitude the human steering required to turn a specified hypothesis into a literature‑grounded, mathematically established, experimentally supported, submission‑oriented manuscript draft. pAI/MSc is built with a current emphasis on machine learning theory and adjacent quantitative fields.

PaperID: 3713, https://arxiv.org/pdf/2604.20586.pdf

Authors: Patrick Wilk, Ethan Cantor, Yikui Liu, Jie Li

Title: A Hierarchical MARL-Based Approach for Coordinated Retail P2P Trading and Wholesale Market Participation of DERs

Abstract:
The ongoing shift towards decentralization of the electric energy sector, driven by the growing electrification across end‑use sectors, and widespread adoption of distributed energy resources (DERs), necessitates their active participation in the electricity markets to support grid operations. Furthermore, with bi‑directional energy and communication flows becoming standard, intelligent, easy‑to‑deploy, resource‑conservative demand‑side participation is expected to play a critical role in securing power grid operational flexibility and market efficiency. This work proposes a market engagement framework that leverages a hierarchical multi‑agent deep reinforcement learning (MARL) approach to enable individual prosumers to participate in peer‑to‑peer retail auctions and further aggregate these intelligent prosumers to facilitate effective DER participation in wholesale markets. Ultimately, a Stackelberg game is proposed to coordinate this hierarchical MARL‑based DER market participation framework toward enhanced market performance.

PaperID: 3714, https://arxiv.org/pdf/2604.20582.pdf

Authors: Suveen Ellawela

Title: Trust, Lies, and Long Memories: Emergent Social Dynamics and Reputation in Multi-Round Avalon with LLM Agents

Abstract:
We study emergent social dynamics in LLM agents playing The Resistance: Avalon, a hidden‑role deception game. Unlike prior work on single‑game performance, our agents play repeated games while retaining memory of previous interactions, including who played which roles and how they behaved, enabling us to study how social dynamics evolve. Across 188 games, two key phenomena emerge. First, reputation dynamics emerge organically when agents retain cross‑game memory: agents reference past behavior in statements like "I am wary of repeating last game's mistake of over‑trusting early success." These reputations are role‑conditional: the same agent is described as "straightforward" when playing good but "subtle" when playing evil, and high‑reputation players receive 46% more team inclusions. Second, higher reasoning effort supports more strategic deception: evil players more often pass early missions to build trust before sabotaging later ones, 75% in high‑effort games vs 36% in low‑effort games. Together, these findings show that repeated interaction with memory gives rise to measurable reputation and deception dynamics among LLM agents.

PaperID: 3715, https://arxiv.org/pdf/2604.20556.pdf

Authors: Yuhang Wu, Qinyuan Liu, Qiuyang Zhao, Qingwei Chong

Title: LayerTracer: A Joint Task-Particle and Vulnerable-Layer Analysis framework for Arbitrary Large Language Model Architectures

Abstract:
Currently, Large Language Models (LLMs) feature a diversified architectural landscape, including traditional Transformer, GateDeltaNet, and Mamba. However, the evolutionary laws of hierarchical representations, task knowledge formation positions, and network robustness bottleneck mechanisms in various LLM architectures remain unclear, posing core challenges for hybrid architecture design and model optimization. This paper proposes LayerTracer, an architecture‑agnostic end‑to‑end analysis framework compatible with any LLM architecture. By extracting hidden states layer‑by‑layer and mapping them to vocabulary probability distributions, it achieves joint analysis of task particle localization and layer vulnerability quantification. We define the task particle as the key layer where the target token probability first rises significantly, representing the model's task execution starting point, and the vulnerable layer is defined as the layer with the maximum Jensen‑Shannon (JS) divergence between output distributions before and after mask perturbation, reflecting its sensitivity to disturbances. Experiments on models of different parameter scales show that task particles mainly appear in the deep layers of the model regardless of parameter size, while larger‑parameter models exhibit stronger hierarchical robustness. LayerTracer provides a scientific basis for layer division, module ratio, and gating switching of hybrid architectures, effectively optimizing model performance. It accurately locates task‑effective layers and stability bottlenecks, offering universal support for LLM structure design and interpretability research.

PaperID: 3716, https://arxiv.org/pdf/2604.20441.pdf

Authors: Yingyong Hou, Xinyuan Lao, Huimei Wang, Qianyu Yao, Wei Chen, Bocheng Huang, Fei Sun, Yuxian Lv, Weiqi Lei, Xueqian Wen, Pengfei Xia, Zhujun Tan, Shengyang Xie

Title: MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

Abstract:
Background: Agent skills are increasingly deployed as modular, reusable capability units in AI agent systems. Medical research agent skills require safeguards beyond general‑purpose evaluation, including scientific integrity, methodological validity, reproducibility, and boundary safety. This study developed and preliminarily evaluated a domain‑specific audit framework for medical research agent skills, with a focus on reliability against expert review. Methods: We developed MedSkillAudit (skill‑auditor@1.0), a layered framework assessing skill release readiness before deployment. We evaluated 75 skills across five medical research categories (15 per category). Two experts independently assigned a quality score (0‑100), an ordinal release disposition (Production Ready / Limited Release / Beta Only / Reject), and a high‑risk failure flag. System‑expert agreement was quantified using ICC(2,1) and linearly weighted Cohen's kappa, benchmarked against the human inter‑rater baseline. Results: The mean consensus quality score was 72.4 (SD = 13.0); 57.3% of skills fell below the Limited Release threshold. MedSkillAudit achieved ICC(2,1) = 0.449 (95% CI: 0.250‑0.610), exceeding the human inter‑rater ICC of 0.300. System‑consensus score divergence (SD = 9.5) was smaller than inter‑expert divergence (SD = 12.4), with no directional bias (Wilcoxon p = 0.613). Protocol Design showed the strongest category‑level agreement (ICC = 0.551); Academic Writing showed a negative ICC (‑0.567), reflecting a structural rubric‑expert mismatch. Conclusions: Domain‑specific pre‑deployment audit may provide a practical foundation for governing medical research agent skills, complementing general‑purpose quality checks with structured audit workflows tailored to scientific use cases.

PaperID: 3717, https://arxiv.org/pdf/2604.20382.pdf

Authors: Aishik Mandal, Hiba Arnaout, Clarissa W. Ong, Juliet Bockhorst, Kate Sheehan, Rachael Moldow, Tanmoy Chakraborty, Iryna Gurevych

Title: Graph2Counsel: Clinically Grounded Synthetic Counseling Dialogue Generation from Client Psychological Graphs

Abstract:
Rising demand for mental health support has increased interest in using Large Language Models (LLMs) for counseling. However, adapting LLMs to this high‑risk safety‑critical domain is hindered by the scarcity of real‑world counseling data due to privacy constraints. Synthetic datasets provide a promising alternative, but existing approaches often rely on unstructured or semi‑structured text inputs and overlook structural dependencies between a client's cognitive, emotional, and behavioral states, often producing psychologically inconsistent interactions and reducing data realism and quality. We introduce Graph2Counsel, a framework for generating synthetic counseling sessions grounded in Client Psychological Graphs (CPGs) that encode relationships among clients' thoughts, emotions, and behaviors. Graph2Counsel employs a structured prompting pipeline guided by counselor strategies and CPG, and explores prompting strategies including CoT (Wei et al., 2022) and Multi‑Agent Feedback (Li et al., 2025a). Graph2Counsel produces 760 sessions from 76 CPGs across diverse client profiles. In expert evaluation, our dataset outperforms prior datasets on specificity, counselor competence, authenticity, conversational flow, and safety, with substantial inter‑annotator agreement (Krippendorff's α = 0.70). Fine‑tuning an open‑source model on this dataset improves performance on CounselingBench (Nguyen et al., 2025) and CounselBench (Li et al., 2025b), showing downstream utility. We also make our code and data public.

PaperID: 3718, https://arxiv.org/pdf/2604.20348.pdf

Authors: Alessio Palma, Indro Spinelli, Vignesh Prasad, Luca Scofano, Yufeng Jin, Georgia Chalvatzaki, Fabio Galasso

Title: Bimanual Robot Manipulation via Multi-Agent In-Context Learning

Abstract:
Language Models (LLMs) have emerged as powerful reasoning engines for embodied control. In particular, In‑Context Learning (ICL) enables off‑the‑shelf, text‑only LLMs to predict robot actions without any task‑specific training while preserving their generalization capabilities. Applying ICL to bimanual manipulation remains challenging, as the high‑dimensional joint action space and tight inter‑arm coordination constraints rapidly overwhelm standard context windows. To address this, we introduce BiCICLe (Bimanual Coordinated In‑Context Learning), the first framework that enables standard LLMs to perform few‑shot bimanual manipulation without fine‑tuning. BiCICLe frames bimanual control as a multi‑agent leader‑follower problem, decoupling the action space into sequential, conditioned single‑arm predictions. This naturally extends to Arms' Debate, an iterative refinement process, and to the introduction of a third LLM‑as‑Judge to evaluate and select the most plausible coordinated trajectories. Evaluated on 13 tasks from the TWIN benchmark, BiCICLe achieves up to 71.1% average success rate, outperforming the best training‑free baseline by 6.7 percentage points and surpassing most supervised methods. We further demonstrate strong few‑shot generalization on novel tasks.

PaperID: 3719, https://arxiv.org/pdf/2604.20300.pdf

Authors: Yingjie Gu, Wenjian Xiong, Liqiang Wang, Pengcheng Ren, Chao Li, Xiaojing Zhang, Yijuan Guo, Qi Sun, Jingyao Ma, Shidang Shi

Title: FSFM: A Biologically-Inspired Framework for Selective Forgetting of Agent Memory

Abstract:
For LLM agents, memory management critically impacts efficiency, quality, and security. While much research focuses on retention, selective forgetting‑‑inspired by human cognitive processes (hippocampal indexing/consolidation theory and Ebbinghaus forgetting curve)‑‑remains underexplored. We argue that in resource‑constrained environments, a well‑designed forgetting mechanism is as crucial as remembering, delivering benefits across three dimensions: (1) efficiency via intelligent memory pruning, (2) quality by dynamically updating outdated preferences and context, and (3) security through active forgetting of malicious inputs, sensitive data, and privacy‑compromising content. Our framework establishes a taxonomy of forgetting mechanisms: passive decay‑based, active deletion‑based, safety‑triggered, and adaptive reinforcement‑based. Building on advances in LLM agent architectures and vector databases, we present detailed specifications, implementation strategies, and empirical validation from controlled experiments. Results show significant improvements: access efficiency (+8.49%), content quality (+29.2% signal‑to‑noise ratio), and security performance (100% elimination of security risks). Our work bridges cognitive neuroscience and AI systems, offering practical solutions for real‑world deployment while addressing ethical and regulatory compliance. The paper concludes with challenges and future directions, establishing selective forgetting as a fundamental capability for next‑generation LLM agents operating in real‑world, resource‑constrained scenarios. Our contributions align with AI‑native memory systems and responsible AI development.

PaperID: 3720, https://arxiv.org/pdf/2604.20273.pdf

Authors: Jan-Philipp Schmidt

Title: ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks

Abstract:
We present ActuBench, a multi‑agent LLM pipeline for the automated generation and evaluation of advanced actuarial assessment items aligned with the International Actuarial Association (IAA) Education Syllabus. The pipeline separates four LLM roles by adapter: one agent drafts items, one constructs distractors, a third independently verifies both stages and drives bounded one‑shot repair loops, and a cost‑optimized auxiliary agent handles Wikipedia‑note summarization and topic labelling. The items, per‑model responses and complete leaderboard are published as a browsable web interface at https://actubench.de/en/, allowing readers and practitioners to inspect individual items without a repository checkout. We evaluate 50 language models from eight providers on two complementary benchmarks ‑‑ 100 empirically hardest multiple‑choice items and 100 open‑ended items scored by an LLM judge ‑‑ and report three headline findings. First, multi‑agent verification is load‑bearing: the independent verifier flags a majority of drafted items on first pass, most of which the one‑shot repair loop resolves. Second, locally‑hosted open‑weights inference sits on the cost‑performance Pareto front: a Gemma~4 model running on consumer hardware and a Cerebras‑hosted 120B open‑weights model dominate the near‑zero‑cost region, with the latter within one item of the top of the leaderboard. Third, MCQ and LLM‑as‑Judge rankings differ meaningfully: the MCQ scaffold inflates the performance ceiling, and Judge‑mode evaluation is needed to discriminate at the frontier.

PaperID: 3721, https://arxiv.org/pdf/2604.20197.pdf

Authors: Heng Liu, Bingxin Yang, Yiming Lu, Yuan Wang, Xue Jia, Long Luo, Hao Li

Title: How is a gas sensor poisoned by volatile methylsiloxanes?

Abstract:
Volatile methyl siloxanes (VMSs), widely present in consumer and industrial products, have attracted increasing concerns due to their persistence, bioaccumulation behavior, and adverse health effects. Beyond their environmental implications, VMSs also pose operational challenges for sensing technologies because they readily decompose on sensing materials to form silicon‑based compounds (e.g., silica and silane) that irreversibly impair sensing performance, a phenomenon commonly known as siloxane poisoning. Despite its prevalence, the mechanistic basis of this deactivation remains poorly understood. Herein, we present the first comprehensive theoretical study of siloxane‑induced poisoning in catalytic gas sensors. Guided by our self‑developed AI Agent, Digital Sensor Platform (DigSen), we first identify siloxane poisoning as a previously overlooked yet high‑impact research direction. Using hexamethyldisiloxane (HMDS) as a model compound, we then conducted first‑principles calculations to uncover decomposition pathways across noble metal surfaces. Strikingly, a descriptor‑based microkinetic volcano model is developed to capture the trade‑off between sensing activity and resistance to poisoning, enabling predictive identification of anti‑poisoning candidates. These insights not only elucidate the origin of siloxane poisoning but also demonstrate how AI‑driven discovery, mechanistic theory, and experiments can be integrated into a closed‑loop framework for catalytic sensor design. More broadly, this AI‑guided paradigm represents a generalizable strategy for materials digital discovery, offering a transferable methodology that extends well beyond siloxane systems to diverse classes of materials challenges.

PaperID: 3722, https://arxiv.org/pdf/2604.20179.pdf

Authors: Ronghao Ni, Mihai Christodorescu, Limin Jia

Title: Taint-Style Vulnerability Detection and Confirmation for Node.js Packages Using LLM Agent Reasoning

Abstract:
The rapidly evolving Node.js ecosystem currently includes millions of packages and is a critical part of modern software supply chains, making vulnerability detection of Node.js packages increasingly important. However, traditional program analysis struggles in this setting because of dynamic JavaScript features and the large number of package dependencies. Recent advances in large language models (LLMs) and the emerging paradigm of LLM‑based agents offer an alternative to handcrafted program models. This raises the question of whether an LLM‑centric, tool‑augmented approach can effectively detect and confirm taint‑style vulnerabilities (e.g., arbitrary command injection) in Node.js packages. We implement LLMVD.js, a multi‑stage agent pipeline to scan code, propose vulnerabilities, generate proof‑of‑concept exploits, and validate them through lightweight execution oracles; and systematically evaluate its effectiveness in taint‑style vulnerability detection and confirmation in Node.js packages without dedicated static/dynamic analysis engines for path derivation. For packages from public benchmarks, LLMVD.js confirms 84% of the vulnerabilities, compared to less than 22% for prior program analysis tools. It also outperforms a prior LLM‑program‑analysis hybrid approach while requiring neither vulnerability annotations nor prior vulnerability reports. When evaluated on a set of 260 recently released packages (without vulnerability groundtruth information), traditional tools produce validated exploits for few (\leq 2) packages, while LLMVD.js generates validated exploits for 36 packages.

PaperID: 3723, https://arxiv.org/pdf/2604.20140.pdf

Authors: Darsh Kachroo, Adriana Caraeni, Arjun Prasaath Anbazhagan, Brennan Lagasse, Kevin Zhu

Title: HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

Abstract:
Direct Preference Optimization (DPO) is an effective framework for aligning large language models with human preferences, but it struggles with complex reasoning tasks. DPO optimizes for the likelihood of generating preferred over dispreferred responses in their entirety and lacks the granularity to provide feedback on subsections of many‑step solutions typical of reasoning tasks. Existing methods excel at either stable preference learning (e.g., DPO variants like KTO and RSO) or structured reasoning (e.g., ReMA's multi‑agent RL framework, Tree of Thoughts), but fail to merge these complementary strengths. We propose HiPO (Hierarchical Preference Optimization), an extension of DPO that separates responses into reasoning segments (query clarification and context, reasoning steps, and answer) and computes loss as a weighted sum of the DPO loss for each segment. Our approach enables segment‑specific training while maintaining DPO's computational efficiency and training stability. We demonstrate that for multiple 7B LLMs fine‑tuned using HiPO and DPO on the Math Stack Exchange preference dataset, the models trained with HiPO outperform the others on a variety of common math benchmarks and achieve greater organization, logical flow, and consistency as measured by GPT‑4.1.

PaperID: 3724, https://arxiv.org/pdf/2604.20133.pdf

Authors: Aimin Zhang, Jiajing Guo, Fuwei Jia, Chen Lv, Boyu Wang, Fangzheng Li

Title: EvoAgent: An Evolvable Agent Framework with Skill Learning and Multi-Agent Delegation

Abstract:
This paper proposes EvoAgent ‑ an evolvable large language model (LLM) agent framework that integrates structured skill learning with a hierarchical sub‑agent delegation mechanism. EvoAgent models skills as multi‑file structured capability units equipped with triggering mechanisms and evolutionary metadata, and enables continuous skill generation and optimization through a user‑feedback‑driven closed‑loop process. In addition, by incorporating a three‑stage skill matching strategy and a three‑layer memory architecture, the framework supports dynamic task decomposition for complex problems and long‑term capability accumulation. Experimental results based on real‑world foreign trade scenarios demonstrate that, after integrating EvoAgent, GPT5.2 achieves significant improvements in professionalism, accuracy, and practical utility. Under a five‑dimensional LLM‑as‑Judge evaluation protocol, the overall average score increases by approximately 28%. Further model transfer experiments indicate that the performance of an agent system depends not only on the intrinsic capabilities of the underlying model, but also on the degree of synergy between the model and the agent architecture.

PaperID: 3725, https://arxiv.org/pdf/2604.20129.pdf

Authors: Samaresh Kumar Singh, Joyjit Roy

Title: A Delta-Aware Orchestration Framework for Scalable Multi-Agent Edge Computing

Abstract:
The Synergistic Collapse occurs when scaling beyond 100 agents causes superlinear performance degradation that individual optimizations cannot prevent. We observe this collapse with 150 cameras in Smart City deployment using MADDPG, where Deadline Satisfaction drops from 78% to 34%, producing approximately 180,000 in annual cost overruns. Prior work has addressed each contributing factor in isolation: exponential action‑space growth, computational redundancy among spatially adjacent agents, and task‑agnostic hardware scheduling. None has examined how these three factors interact and amplify each other. We present DAOEF (Delta‑Aware Orchestration for Edge Federations), a framework that addresses all three simultaneously through: (1) Differential Neural Caching, which stores intermediate layer activations and computes only the input deltas, achieving 2.1x higher hit ratios (72% vs. 35%) than output‑level caching while staying within 2% accuracy loss through empirically calibrated similarity thresholds; (2) Criticality‑Based Action Space Pruning, which organizes agents into priority tiers and reduces coordination complexity from O(n2) to O(n log n) with less than 6% optimality loss; and (3) Learned Hardware Affinity Matching, which assigns tasks to their optimal accelerator (GPU, CPU, NPU, or FPGA) to prevent compounding mismatch penalties. Controlled factor‑isolation experiments confirm that each mechanism is necessary but insufficient on its own: removing any single mechanism increases latency by more than 40%, validating that the gains are interdependent rather than additive. Across four datasets (100‑250 agents) and a 20‑device physical testbed, DAOEF achieves a 1.45x multiplicative gain over applying the three mechanisms independently. A 200‑agent cloud deployment yields 62% latency reduction (280 ms vs. 735 ms), sub‑linear latency growth up to 250 agents.

PaperID: 3726, https://arxiv.org/pdf/2604.20098.pdf

Authors: Nathan Hittesdorf, Marco Salzetta, Lu Cheng

Title: Differentiable Conformal Training for LLM Reasoning Factuality

Abstract:
Large Language Models (LLMs) frequently hallucinate, limiting their reliability in critical applications. Conformal Prediction (CP) addresses this by calibrating error rates on held‑out data to provide statistically valid confidence guarantees. Recent work extends CP to LLM factuality to filter out risky claims, ensuring that hallucination rates remain below a user‑specified level (e.g., 10%). While prior methods treat claims independently, Coherent Factuality extends to multi‑step reasoning by representing outputs as dependency graphs and jointly validating claims with their logical ancestors. A key limitation is that Coherent Factuality is not differentiable, requiring hand‑crafted scorers that at high reliability levels remove nearly 60% of true claims. We introduce Differentiable Coherent Factuality (DCF), a fully differentiable relaxation that enables learning improved scorers while provably recovering the original algorithm's guarantees. Experiments on two benchmark reasoning datasets demonstrate DCF achieves up to 141% improvement in claim retention while maintaining reliability guarantees, representing a significant step towards reliable conformal LLM systems.

PaperID: 3727, https://arxiv.org/pdf/2604.20070.pdf

Authors: Sadra Sabouri, Zeinabsadat Saghi, Run Huang, Sujay Maladi, Esmeralda Eufracio, Sumit Gulwani, Souti Chattopadhyay

Title: Auditing and Controlling AI Agent Actions in Spreadsheets

Abstract:
Advances in AI agent capabilities have outpaced users' ability to meaningfully oversee their execution. AI agents can perform sophisticated, multi‑step knowledge work autonomously from start to finish, yet this process remains effectively inaccessible during execution, often buried within large volumes of intermediate reasoning and outputs: by the time users receive the output, all underlying decisions have already been made without their involvement. This lack of transparency leaves users unable to examine the agent's assumptions, identify errors before they propagate, or redirect execution when it deviates from their intent. The stakes are particularly high in spreadsheet environments, where process and artifact are inseparable. Each decision the agent makes is recorded directly in cells that belong to and reflect on the user. We introduce Pista, a spreadsheet AI agent that decomposes execution into auditable, controllable actions, providing users with visibility into the agent's decision‑making process and the capacity to intervene at each step. A formative study (N = 8) and a within‑subjects summative evaluation (N = 16) comparing Pista to a baseline agent demonstrated that active participation in execution influenced not only task outcomes but also users' comprehension of the task, their perception of the agent, and their sense of role within the workflow. Users identified their own intent reflected in the agent's actions, detected errors that post‑hoc review would have failed to surface, and reported a sense of co‑ownership over the resulting output. These findings indicate that meaningful human oversight of AI agents in knowledge work requires not improved post‑hoc review mechanisms, but active participation in decisions as they are made.

PaperID: 3728, https://arxiv.org/pdf/2604.20065.pdf

Authors: Jiahao Liu, Mingzhe Han, Guanming Liu, Weihang Wang, Dongsheng Li, Hansu Gu, Peng Zhang, Tun Lu, Ning Gu

Title: From Hidden Profiles to Governable Personalization: Recommender Systems in the Age of LLM Agents

Abstract:
Personalization has traditionally depended on platform‑specific user models that are optimized for prediction but remain largely inaccessible to the people they describe. As LLM‑based assistants increasingly mediate search, shopping, travel, and content access, this arrangement may be giving way to a new personalization stack in which user representation is no longer confined to isolated platforms. In this paper, we argue that the key issue is not simply that large language models can enhance recommendation quality, but that they reconfigure where and how user representations are produced, exposed, and acted upon. We propose a shift from hidden platform profiling toward governable personalization, where user representations may become more inspectable, revisable, portable, and consequential across services. Building on this view, we identify five research fronts for recommender systems: transparent yet privacy‑preserving user modeling, intent translation and alignment, cross‑domain representation and memory design, trustworthy commercialization in assistant‑mediated environments, and operational mechanisms for ownership, access, and accountability. We position these not as isolated technical challenges, but as interconnected design problems created by the emergence of LLM agents as intermediaries between users and digital platforms. We argue that the future of recommender systems will depend not only on better inference, but on building personalization systems that users can meaningfully understand, shape, and govern.

PaperID: 3729, https://arxiv.org/pdf/2604.20039.pdf

Authors: John Alderete, Sebastian Benthal, Connie Xu, John Xing

Title: Separable Pathways for Causal Reasoning: How Architectural Scaffolding Enables Hypothesis-Space Restructuring in LLM Agents

Abstract:
Causal discovery through experimentation and intervention is fundamental to robust problem solving. It requires not just updating beliefs within a fixed framework but revising the hypothesis space itself, a capacity current AI agents lack when evidence demands representations they have not previously constructed. We extend the blicket detector paradigm from developmental science to test this capacity in AI agents equipped with architectural scaffolding that targets hypothesis‑space restructuring. Our compositional architecture has two discrete components: context graphs, which structure exploration as typed state machines, and dynamic behaviors, which monitor for evidence that the current hypothesis space is inadequate and expand it at runtime. Across 1,085 experimental trials, these components make orthogonal contributions: context graphs drive reasoning quality within the post‑switch hypothesis space, accounting for 94% of the accuracy gain, while dynamic behaviors drive reasoning eligibility by detecting regime changes and preventing premature commitment to outdated hypotheses.

PaperID: 3730, https://arxiv.org/pdf/2604.19926.pdf

Authors: Hongnan Ma, Han Wang, Shenglin Wang, Tieyue Yin, Yiwei Shi, Yucong Huang, Yingtian Zou, Muning Wen, Mengyue Yang

Title: CreativeGame:Toward Mechanic-Aware Creative Game Generation

Abstract:
Large language models can generate plausible game code, but turning this capability into \emphiterative creative improvement remains difficult. In practice, single‑shot generation often produces brittle runtime behavior, weak accumulation of experience across versions, and creativity scores that are too subjective to serve as reliable optimization signals. A further limitation is that mechanics are frequently treated only as post‑hoc descriptions, rather than as explicit objects that can be planned, tracked, preserved, and evaluated during generation. This report presents CreativeGame, a multi‑agent system for iterative HTML5 game generation that addresses these issues through four coupled ideas: a proxy reward centered on programmatic signals rather than pure LLM judgment; lineage‑scoped memory for cross‑version experience accumulation; runtime validation integrated into both repair and reward; and a mechanic‑guided planning loop in which retrieved mechanic knowledge is converted into an explicit mechanic plan before code generation begins. The goal is not merely to produce a playable artifact in one step, but to support interpretable version‑to‑version evolution. The current system contains 71 stored lineages, 88 saved nodes, and a 774‑entry global mechanic archive, implemented in 6,181 lines of Python together with inspection and visualization tooling. The system is therefore substantial enough to support architectural analysis, reward inspection, and real lineage‑level case studies rather than only prompt‑level demos. A real 4‑generation lineage shows that mechanic‑level innovation can emerge in later versions and can be inspected directly through version‑to‑version records. The central contribution is therefore not only game generation, but a concrete pipeline for observing progressive evolution through explicit mechanic change.

PaperID: 3731, https://arxiv.org/pdf/2604.19925.pdf

Authors: Shilei Luo, Zhiqi Zhang, Hengchen Dai, Dennis Zhang

Title: Behavioral Transfer in AI Agents: Evidence and Privacy Implications

Abstract:
AI agents powered by large language models are increasingly acting on behalf of humans in social and economic environments. Prior research has focused on their task performance and effects on human outcomes, but less is known about the relationship between agents and the specific individuals who deploy them. We ask whether agents systematically reflect the behavioral characteristics of their human owners, functioning as behavioral extensions rather than producing generic outputs. We study this question using 10,659 matched human‑agent pairs from Moltbook, a social media platform where each autonomous agent is publicly linked to its owner's Twitter/X account. By comparing agents' posts on Moltbook with their owners' Twitter/X activity across features spanning topics, values, affect, and linguistic style, we find systematic transfer between agents and their specific owners. This transfer persists among agents without explicit configuration, and pairs that align on one behavioral dimension tend to align on others. These patterns are consistent with transfer emerging through accumulated interaction between owners (or owners' computer environments) and their agents in everyday use. We further show that agents with stronger behavioral transfer are more likely to disclose owner‑related personal information in public discourse, suggesting that the same owner‑specific context that drives behavioral transfer may also create privacy risk during ordinary use. Taken together, our results indicate that AI agents do not simply generate content, but reflect owner‑related context in ways that can propagate human behavioral heterogeneity into digital environments, with implications for privacy, platform design, and the governance of agentic systems.

PaperID: 3732, https://arxiv.org/pdf/2604.19857.pdf

Authors: Carter Adams, Rafael Oliveira, Gabriel Almeida, Sofia Torres

Title: Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization

Abstract:
Reinforcement fine‑tuning with verifiable rewards (RLVR) has emerged as a powerful paradigm for equipping large vision‑language models (LVLMs) with agentic capabilities such as tool use and multi‑step reasoning. Despite striking empirical successes, most notably Visual Agentic Reinforcement Fine‑Tuning (Visual‑ARFT), the theoretical underpinnings of this paradigm remain poorly understood. In particular, two critical questions lack rigorous answers: (i)~how does the composite structure of verifiable rewards (format compliance, answer accuracy, tool executability) affect the convergence of Group Relative Policy Optimization (GRPO), and (ii)~why does training on a small set of tool‑augmented tasks transfer to out‑of‑distribution domains? We address these gaps by introducing the \emphTool‑Augmented Markov Decision Process (TA‑MDP), a formal framework that models multimodal agentic decision‑making with bounded‑depth tool calls. Within this framework, we establish three main results. First, we prove that GRPO under composite verifiable rewards converges to a first‑order stationary point at rate O(1/\sqrtT) with explicit dependence on the number of reward components and group size (Theorem~1). Second, we derive a \emphReward Decomposition Theorem that bounds the sub‑optimality gap between decomposed per‑component optimization and joint optimization, providing a precise characterization of when reward decomposition is beneficial (Theorem~2). Third, we establish a PAC‑Bayes generalization bound for tool‑augmented policies that explains the strong out‑of‑distribution transfer observed in Visual‑ARFT (Theorem~3).

PaperID: 3733, https://arxiv.org/pdf/2604.19856.pdf

Authors: Cagri Eryilmaz

Title: ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration

Abstract:
Large Language Models (LLMs) show promise for generating Register‑Transfer Level (RTL) code from natural language specifications, but single‑shot generation achieves only 60‑65% functional correctness on standard benchmarks. Multi‑agent approaches such as MAGE reach 95.9% on VerilogEval yet remain untested on harder industrial benchmarks such as NVIDIA's CVDP, lack synthesis awareness, and incur high API costs. We present ChipCraftBrain, a framework combining symbolic‑neural reasoning with adaptive multi‑agent orchestration for automated RTL generation. Four innovations drive the system: (1) adaptive orchestration over six specialized agents via a PPO policy over a 168‑dim state (an alternative world‑model MPC planner is also evaluated); (2) a hybrid symbolic‑neural architecture that solves K‑map and truth‑table problems algorithmically while specialized agents handle waveform timing and general RTL; (3) knowledge‑augmented generation from a 321‑pattern base plus 971 open‑source reference implementations with focus‑aware retrieval; and (4) hierarchical specification decomposition into dependency‑ordered sub‑modules with interface synchronization. On VerilogEval‑Human, ChipCraftBrain achieves 97.2% mean pass@1 (range 96.15‑98.72% across 7 runs, best 154/156), on par with ChipAgents (97.4%, self‑reported) and ahead of MAGE (95.9%). On a 302‑problem non‑agentic subset of CVDP spanning five task categories, we reach 94.7% mean pass@1 (286/302, averaged over 3 runs), a 36‑60 percentage‑point lift per category over the published single‑shot baseline; we additionally lead three of four categories shared with NVIDIA's ACE‑RTL despite using roughly 30x fewer per‑problem attempts. A RISC‑V SoC case study demonstrates hierarchical decomposition generating 8/8 lint‑passing modules (689 LOC) validated on FPGA, where monolithic generation fails entirely.

PaperID: 3734, https://arxiv.org/pdf/2604.19844.pdf

Authors: Jiamin Chang, Minhui Xue, Ruoxi Sun, Shuchao Pang, Salil S. Kanhere, Hammond Pearce

Title: If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems

Abstract:
Recent advances in embodied Vision‑Language Agentic Systems (VLAS), powered by large vision‑language models (LVLMs), enable AI systems to perceive and reason over real‑world scenes. Within this context, environmental signals such as traffic lights are essential in‑band signals that can and should influence agent behavior. However, similar signals could also be crafted to operate as misleading visual injections, overriding user intent and posing security risks. This duality creates a fundamental challenge: agents must respond to legitimate environmental cues while remaining robust to misleading ones. We refer to this tension as trust boundary confusion. To study this behavior, we design a dual‑intent dataset and evaluation framework, through which we show that current LVLM‑based agents fail to reliably balance this trade‑off, either ignoring useful signals or following harmful ones. We systematically evaluate 7 LVLM agents across multiple embodied settings under both structure‑based and noise‑based visual injections. To address these vulnerabilities, we propose a multi‑agent defense framework that separates perception from decision‑making to dynamically assess the reliability of visual inputs. Our approach significantly reduces misleading behaviors while preserving correct responses and provides robustness guarantees under adversarial perturbations. The code of the evaluation framework and artifacts are made available at https://anonymous.4open.science/r/Visual‑Prompt‑Inject.

PaperID: 3735, https://arxiv.org/pdf/2604.19839.pdf

Authors: Jinsik Bang, Jaeyeon Bae, Donggyu Lee, Siyeol Jung, Taehwan Kim

Title: Environmental Understanding Vision-Language Model for Embodied Agent

Abstract:
Vision‑language models (VLMs) have shown strong perception and reasoning abilities for instruction‑following embodied agents. However, despite these abilities and their generalization performance, they still face limitations in environmental understanding, often failing on interactions or relying on environment metadata during execution. To address this challenge, we propose a novel framework named Environmental Understanding Embodied Agent (EUEA), which fine‑tunes four core skills: 1) object perception for identifying relevant objects, 2) task planning for generating interaction subgoals, 3) action understanding for judging success likelihood, and 4) goal recognition for determining goal completion. By fine‑tuning VLMs with EUEA skills, our framework enables more reliable task execution for instruction‑following. We further introduce a recovery step that leverages these core skills and a group relative policy optimization (GRPO) stage that refines inconsistent skill predictions. The recovery step samples alternative actions to correct failure cases, and the GRPO stage refines inconsistent skill predictions. Across ALFRED tasks, our VLM significantly outperforms a behavior‑cloning baseline, achieving an 8.86% improvement in average success rate. The recovery and GRPO stages provide an additional 3.03% gain, further enhancing overall performance. Finally, our skill‑level analyses reveal key limitations in the environmental understanding of closed‑ and open‑source VLMs and identify the capabilities necessary for effective agent‑environment interaction.

PaperID: 3736, https://arxiv.org/pdf/2604.19837.pdf

Authors: Huaqing Xie

Title: Forage V2: Knowledge Evolution and Transfer in Autonomous Agent Organizations

Abstract:
Autonomous agents operating in open‑world tasks ‑‑ where the completion boundary is not given in advance ‑‑ face denominator blindness: they systematically underestimate the scope of the target space. Forage V1 addressed this through co‑evolving evaluation (an independent Evaluator discovers what "complete" means) and method isolation (Evaluator and Planner cannot see each other's code). V2 extends the architecture from a single expedition to a learning organization: experience accumulates across runs, transfers across model capabilities, and institutional safeguards prevent knowledge degradation. We demonstrate two claims across three task types (web scraping, API queries, mathematical reasoning). Knowledge accumulation: over six runs, knowledge entries grow from 0 to 54, and denominator estimates stabilize as domain understanding deepens. Knowledge transfer: a weaker agent (Sonnet) seeded with a stronger agent's (Opus) knowledge narrows a 6.6pp coverage gap to 1.1pp, halves cost (9.40 to 5.13 USD), converges in half the rounds (mean 4.5 vs. 7.0), and three independent seeded runs arrive at exactly the same denominator estimate (266), suggesting organizational knowledge calibrates evaluation itself. V2's contribution is architectural: it designs institutions ‑‑ audit separation, contract protocols, organizational memory ‑‑ that make any agent more reliable upon entry. The accumulated experience is organizational, model‑agnostic, and transferable, stored as readable documents that any future agent inherits regardless of provider or capability level.

PaperID: 3737, https://arxiv.org/pdf/2604.19827.pdf

Authors: Daniel Russo

Title: More Is Different: Toward a Theory of Emergence in AI-Native Software Ecosystems

Abstract:
Software engineering faces a fundamental challenge: multi‑agent AI systems fail in ways that defy explanation by traditional theories. While individual agents perform correctly, their interactions degrade entire ecosystems, revealing a gap in our understanding of software evolution. This paper argues that AI‑native software ecosystems must be studied as complex adaptive systems (CAS), where emergent properties like architectural entropy, cascade failures, and comprehension debt arise not from individual components, but from their interactions. We map Holland's six CAS properties onto observable ecosystem dynamics, distinguishing these systems from microservices or open‑source networks. To measure causal emergence, we define micro‑level state variables, coarse‑graining functions, and a tractable measurement framework. Seven falsifiable propositions link CAS theory to software evolution, challenging or extending Lehman's laws where agent‑level assumptions fail. If confirmed, these findings would demand a radical shift: ecosystem‑level monitoring as the primary governance mechanism for AI‑native systems. If refuted, existing theories may only need incremental updates. Either way, this work forces us to ask: Can software engineering's core assumptions survive the age of autonomous agents?

PaperID: 3738, https://arxiv.org/pdf/2604.19821.pdf

Authors: Sandip Ghoshal, Anshul Mittal, Jyotika Singh, Miguel Ballesteros, Weiyi Sun, Fang Tu, Shailender Singh, Yassine Benajiba, Fahad Shah, Sujeeth Bharadwaj, Sujith Ravi, Dan Roth

Title: JTPRO: A Joint Tool-Prompt Reflective Optimization Framework for Language Agents

Abstract:
Large language model (LLM) agents augmented with external tools often struggle as number of tools grow large and become domain‑specific. In such settings, ambiguous tool descriptions and under‑specified agent instructions frequently lead to tool mis‑selection and incorrect slot/value instantiation. We hypothesize that this is due to two root causes: generic, one‑size‑fits‑all prompts that ignore tool‑specific nuances, and underspecified tool schemas that lack clear guidance on when and how to use each tool and how to format its parameters. We introduce Joint Tool‑Prompt Reflective Optimization (JTPRO), a framework for improving tool‑calling reliability in trace‑supervised settings by iteratively using rollout‑driven reflection to co‑optimize global instructions and per‑tool schema/argument descriptions for accurate tool selection and argument instantiation in large tool inventories. JTPRO is designed to preserve only tool‑local cues needed for correct disambiguation and slot filling. We evaluate JTPRO across multi‑tool benchmarks, which account for different number of tools using three metrics: Tool Selection Accuracy (TSA), Slot Filling Accuracy(SFA), and Overall Success Rate(OSR) (correct tool + correct slots + correct values). JTPRO consistently outperforms strong baselines, including CoT‑style agents, and reflective prompt optimizers such as GEPA by 5%‑20% (relative) on OSR. Ablations show that joint optimization of instructions and tool schemas is more effective and robust than optimizing either component in isolation.

PaperID: 3739, https://arxiv.org/pdf/2604.19795.pdf

Authors: Suyash Mishra

Title: Prism: An Evolutionary Memory Substrate for Multi-Agent Open-Ended Discovery

Abstract:
We introduce \prism (Probabilistic Retrieval with Information‑Stratified Memory), an evolutionary memory substrate for multi‑agent AI systems engaged in open‑ended discovery. \prism unifies four independently developed paradigms ‑‑ layered file‑based persistence, vector‑augmented semantic memory, graph‑structured relational memory, and multi‑agent evolutionary search ‑‑ under a single decision‑theoretic framework with eight interconnected subsystems. We make five contributions: (1)~an \emphentropy‑gated stratification mechanism that assigns memories to a tri‑partite hub (skills/notes/attempts) based on Shannon information content, with formal context‑window utilization bounds; (2)~a \emphcausal memory graph \mathcalG = (V, E_r, E_c) with interventional edges and agent‑attributed provenance; (3)~a \emphValue‑of‑Information retrieval policy with self‑evolving strategy selection; (4)~a \emphheartbeat‑driven consolidation controller with stagnation detection via optimal stopping theory; and (5)~a \emphreplicator‑decay dynamics framework that interprets memory confidence as evolutionary fitness, proving convergence to an Evolutionary Stable Memory Set (ESMS). On the LOCOMO benchmark, \prism achieves 88.1 LLM‑as‑a‑Judge score (31.2% over Mem0). On CORAL‑style evolutionary optimization tasks, 4‑agent \prism achieves 2.8× higher improvement rate than single‑agent baselines.%

PaperID: 3740, https://arxiv.org/pdf/2604.19793.pdf

Authors: Hao Liu, Dongyu Li

Title: SkillGraph: Graph Foundation Priors for LLM Agent Tool Sequence Recommendation

Abstract:
LLM agents must select tools from large API libraries and order them correctly. Existing methods use semantic similarity for both retrieval and ordering, but ordering depends on inter‑tool data dependencies that are absent from tool descriptions. As a result, semantic‑only methods can produce negative Kendall‑τ in structured workflow domains. We introduce SkillGraph, a directed weighted execution‑transition graph mined from 49,831 successful LLM agent trajectories, which encodes workflow‑precedence regularities as a reusable graph foundation prior. Building on this graph foundation prior, we propose a two‑stage decoupled framework: GS‑Hybrid retrieval for candidate selection and a learned pairwise reranker for ordering. On ToolBench (9,965 test instances; ~16,000 tools), the method reaches Set‑F1 = 0.271 and Kendall‑τ = 0.096; on API‑Bank, Kendall‑τ improves from ‑0.433 to +0.613. Under identical Stage‑1 inputs, the learned reranker also outperforms LLaMA‑3.1‑8B Stage‑2 rerankers.

PaperID: 3741, https://arxiv.org/pdf/2604.19789.pdf

Authors: Samuel Onimpa Alfred, Veera Sundararaghavan

Title: From Data to Theory: Autonomous Large Language Model Agents for Materials Science

Abstract:
We present an autonomous large language model (LLM) agent for end‑to‑end, data‑driven materials theory development. The model can choose an equation form, generate and run its own code, and test how well the theory matches the data without human intervention. The framework combines step‑by‑step reasoning with expert‑supplied tools, allowing the agent to adjust its approach as needed while keeping a clear record of its decisions. For well‑established materials relationships such as the Hall‑Petch equation and Paris law, the agent correctly identifies the governing equation and makes reliable predictions on new datasets. For more specialized relationships, such as Kuhn's equation for the HOMO‑LUMO gap of conjugated molecules as a function of length, performance depends more strongly on the underlying model, with GPT‑5 showing better recovery of the correct equation. Beyond known theories, the agent can also suggest new predictive relationships, illustrated here by a strain‑dependent law for changes in the HOMO‑LUMO gap. At the same time, the results show that careful validation remains essential, because the agent can still return incorrect, incomplete, or inconsistent equations even when the numerical fit appears strong. Overall, these results highlight both the promise and the current limitations of autonomous LLM agents for AI‑assisted scientific modeling and discovery.

PaperID: 3742, https://arxiv.org/pdf/2604.19787.pdf

Authors: Ljubisa Bojic, Alexander Felfernig, Bojana Dinic, Velibor Ilic, Achim Rettinger, Vera Mevorah, Damian Trilling

Title: LLM Agents Predict Social Media Reactions but Do Not Outperform Text Classifiers: Benchmarking Simulation Accuracy Using 120K+ Personas of 1511 Humans

Abstract:
Social media platforms mediate how billions form opinions and engage with public discourse. As autonomous AI agents increasingly participate in these spaces, understanding their behavioral fidelity becomes critical for platform governance and democratic resilience. Previous work demonstrates that LLM‑powered agents can replicate aggregate survey responses, yet few studies test whether agents can predict specific individuals' reactions to specific content. This study benchmarks LLM‑based agents' accuracy in predicting human social media reactions (like, dislike, comment, share, no reaction) across 120,000+ unique agent‑persona combinations derived from 1,511 Serbian participants and 27 large language models. In Study 1, agents achieved 70.7% overall accuracy, with LLM choice producing a 13 percentage‑point performance spread. Study 2 employed binary forced‑choice (like/dislike) evaluation with chance‑corrected metrics. Agents achieved Matthews Correlation Coefficient (MCC) of 0.29, indicating genuine predictive signal beyond chance. However, conventional text‑based supervised classifiers using TF‑IDF representations outperformed LLM agents (MCC of 0.36), suggesting predictive gains reflect semantic access rather than uniquely agentic reasoning. The genuine predictive validity of zero‑shot persona‑prompted agents warns against potential manipulation through easily deploying swarms of behaviorally distinct AI agents on social media, while simultaneously offering opportunities to use such agents in simulations for predicting polarization dynamics and informing AI policy. The advantage of using zero‑shot agents is that they require no task‑specific training, making their large‑scale deployment easy across diverse contexts. Limitations include single‑country sampling. Future research should explore multilingual testing and fine‑tuning approaches.

PaperID: 3743, https://arxiv.org/pdf/2604.19775.pdf

Authors: Trilok Padhi, Ramneet Kaur, Krishiv Agarwal, Adam D. Cobb, Daniel Elenius, Manoj Acharya, Colin Samplawski, Alexander M. Berenbeim, Nathaniel D. Bastian, Susmit Jha, Anirban Roy

Title: From Actions to Understanding: Conformal Interpretability of Temporal Concepts in LLM Agents

Abstract:
Large Language Models (LLMs) are increasingly deployed as autonomous agents capable of reasoning, planning, and acting within interactive environments. Despite their growing capability to perform multi‑step reasoning and decision‑making tasks, internal mechanisms guiding their sequential behavior remain opaque. This paper presents a framework for interpreting the temporal evolution of concepts in LLM agents through a step‑wise conformal lens. We introduce the conformal interpretability framework for temporal tasks, which combines step‑wise reward modeling with conformal prediction to statistically label model's internal representation at each step as successful or failing. Linear probes are then trained on these representations to identify directions of temporal concepts ‑ latent directions in the model's activation space that correspond to consistent notions of success, failure or reasoning drift. Experimental results on two simulated interactive environments, namely ScienceWorld and AlfWorld, demonstrate that these temporal concepts are linearly separable, revealing interpretable structures aligned with task success. We further show preliminary results on improving an LLM agent's performance by leveraging the proposed framework for steering the identified successful directions inside the model. The proposed approach, thus, offers a principled method for early failure detection as well as intervention in LLM‑based agents, paving the path towards trustworthy autonomous language models in complex interactive settings.

PaperID: 3744, https://arxiv.org/pdf/2604.19771.pdf

Authors: Parshva Daftari, Khush Patel, Shreyas Kapale, Jithin George, Siva Surendira

Title: Cognis: Context-Aware Memory for Conversational AI Agents

Abstract:
LLM agents lack persistent memory, causing conversations to reset each session and preventing personalization over time. We present Lyzr Cognis, a unified memory architecture for conversational AI agents that addresses this limitation through a multi‑stage retrieval pipeline. Cognis combines a dual‑store backend pairing OpenSearch BM25 keyword matching with Matryoshka vector similarity search, fused via Reciprocal Rank Fusion. Its context‑aware ingestion pipeline retrieves existing memories before extraction, enabling intelligent version tracking that preserves full memory history while keeping the store consistent. Temporal boosting enhances time‑sensitive queries, and a BGE‑2 cross‑encoder reranker refines final result quality. We evaluate Cognis on two independent benchmarks ‑‑ LoCoMo and LongMemEval ‑‑ across eight answer generation models, demonstrating state‑of‑the‑art performance on both. The system is open‑source and deployed in production serving conversational AI applications.

PaperID: 3745, https://arxiv.org/pdf/2604.19756.pdf

Authors: Ruocan Wei, Shufeng Wang, Ziwei Shi

Title: WorkflowGen:an adaptive workflow generation mechanism driven by trajectory experience

Abstract:
Large language model (LLM) agents often suffer from high reasoning overhead, excessive token consumption, unstable execution, and inability to reuse past experiences in complex tasks like business queries, tool use, and workflow orchestration. Traditional methods generate workflows from scratch for every query, leading to high cost, slow response, and poor robustness. We propose WorkflowGen, an adaptive, trajectory experience‑driven framework for automatic workflow generation that reduces token usage and improves efficiency and success rate. Early in execution, WorkflowGen captures full trajectories and extracts reusable knowledge at both node and workflow levels, including error fingerprints, optimal tool mappings, parameter schemas, execution paths, and exception‑avoidance strategies. It then employs a closed‑loop mechanism that performs lightweight generation only on variable nodes via trajectory rewriting, experience updating, and template induction. A three‑tier adaptive routing strategy dynamically selects among direct reuse, rewriting‑based generation, and full initialization based on semantic similarity to historical queries. Without large annotated datasets, we qualitatively compare WorkflowGen against real‑time planning, static single trajectory, and basic in‑context learning baselines. Our method reduces token consumption by over 40 percent compared to real‑time planning, improves success rate by 20 percent on medium‑similarity queries through proactive error avoidance and adaptive fallback, and enhances deployability via modular, traceable experiences and cross‑scenario adaptability. WorkflowGen achieves a practical balance of efficiency, robustness, and interpretability, addressing key limitations of existing approaches.

PaperID: 3746, https://arxiv.org/pdf/2604.19752.pdf

Authors: Aizierjiang Aiersilan, Raeli Savitt

Title: Soft-Label Governance for Distributional Safety in Multi-Agent Systems

Abstract:
Multi‑agent AI systems exhibit emergent risks that no single agent produces in isolation. Existing safety frameworks rely on binary classifications of agent behavior, discarding the uncertainty inherent in proxy‑based evaluation. We introduce SWARM (System‑Wide Assessment of Risk in Multi‑agent systems), a simulation framework that replaces binary good/bad labels with \emphsoft probabilistic labels p = P(v=+1) \in [0,1], enabling continuous‑valued payoff computation, toxicity measurement, and governance intervention. SWARM implements a modular governance engine with configurable levers (transaction taxes, circuit breakers, reputation decay, and random audits) and quantifies their effects through probabilistic metrics including expected toxicity \mathbbE[1‑p \mid \textaccepted] and quality gap \mathbbE[p \mid \textaccepted] ‑ \mathbbE[p \mid \textrejected]. Across seven scenarios with five‑seed replication, strict governance reduces welfare by over 40% without improving safety. In parallel, aggressively internalizing system externalities collapses total welfare from a baseline of +262 down to ‑67, while toxicity remains invariant. Circuit breakers require careful calibration; overly restrictive thresholds severely diminish system value, whereas an optimal threshold balances moderate welfare with minimized toxicity. Companion experiments show soft metrics detect proxy gaming by self‑optimizing agents passing conventional binary evaluations. This basic governance layer applies to live LLM‑backed agents (Concordia entities, Claude, GPT‑4o Mini) without modification. Results show distributional safety requires \emphcontinuous risk metrics and governance lever calibration involves quantifiable safety‑welfare tradeoffs. Source code and project resources are publicly available at https://www.swarm‑ai.org/.

PaperID: 3747, https://arxiv.org/pdf/2604.19750.pdf

Authors: Zhilin Liu, Ye Huang, Ting Xie, Ruizhi Zhang, Wen Li, Lixin Duan

Title: Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging

Abstract:
Recent advances in Large Language Model (LLM)‑based agents have shown remarkable progress in code generation. However, current agent methods mainly rely on text‑output‑based feedback (e.g. command‑line outputs) for multi‑round debugging and struggle in graphical user interface (GUI) that involve visual information. This is mainly due to two limitations: 1) GUI programs are event‑driven, yet existing methods cannot simulate user interactions to trigger GUI element logic 2) GUI programs possess visual attributes, making it difficult for text‑based approaches to assess whether the rendered interface meets user needs. To systematically address these challenges, we first introduce InteractGUI Bench, a novel benchmark comprising 984 commonly used real‑world desktop GUI application tasks designed for fine‑grained evaluation of both interaction logic and visual structure. Furthermore, we propose VF‑Coder, a vision‑feedback‑based multi‑agent system for debugging GUI code. By perceiving visual information and directly interacting with program interfaces, VF‑Coder can identify potential logic and layout issues in a human‑like manner. On InteractGUI Bench, our VF‑Coder approach increases the success rate of Gemini‑3‑Flash from 21.68% to 28.29% and raises the visual score from 0.4284 to 0.5584, indicating the effectiveness of visual feedback in GUI debugging.

PaperID: 3748, https://arxiv.org/pdf/2604.19742.pdf

Authors: Zhiyuan Peng, Wei Tao, Xin Yin, Chenhao Ying, Yuan Luo, Yiwen Guo

Title: PlayCoder: Making LLM-Generated GUI Code Playable

Abstract:
Large language models (LLMs) have achieved strong results in code generation, but their ability to generate GUI applications, especially games, remains insufficiently studied. Existing benchmarks mainly evaluate correctness through test cases, which are inadequate for GUI applications because these systems are interactive, event‑driven, and require correct state transitions across sequences of user actions. Their evaluation therefore should consider interaction flows and UI logic rather than only pass/fail outcomes. To study this problem, we introduce PlayEval, a repository‑aware benchmark built from 43 multilingual GUI applications in Python, TypeScript, and JavaScript. Unlike prior GUI benchmarks that are difficult to adapt to desktop environments, PlayEval covers six major GUI application categories and directly supports code‑generation evaluation. We further propose Play@k, a metric that measures whether at least one of k generated candidates can be played end‑to‑end without logical errors. To support reliable evaluation, we develop PlayTester, an LLM‑based agent that performs task‑oriented GUI playthroughs and detects logic violations automatically. Experiments on 10 state‑of‑the‑art code LLMs show that, despite high compilation rates, they achieve near‑zero Play@3, revealing major weaknesses in generating logically correct GUI applications. To address this limitation, we present PlayCoder, a multi‑agent, repository‑aware framework that generates, evaluates, and iteratively repairs GUI application code in a closed loop. PlayCoder substantially improves both functional correctness and semantic alignment for open‑source and closed‑source models, reaching up to 38.1% Exec@3 and 20.3% Play@3. Case studies further show that it can uncover silent logic bugs missed by traditional metrics and fix them through targeted edits.

PaperID: 3749, https://arxiv.org/pdf/2604.19657.pdf

Authors: Robert Stanley, Avi Verma, Lillian Tsai, Konstantinos Kallas, Sam Kumar

Title: An AI Agent Execution Environment to Safeguard User Data

Abstract:
AI agents promise to serve as general‑purpose personal assistants for their users, which requires them to have access to private user data (e.g., personal and financial information). This poses a serious risk to security and privacy. Adversaries may attack the AI model (e.g., via prompt injection) to exfiltrate user data. Furthermore, sharing private data with an AI agent requires users to trust a potentially unscrupulous or compromised AI model provider with their private data. This paper presents GAAP (Guaranteed Accounting for Agent Privacy), an execution environment for AI agents that guarantees confidentiality for private user data. Through dynamic and directed user prompts, GAAP collects permission specifications from users describing how their private data may be shared, and GAAP enforces that the agent's disclosures of private user data, including disclosures to the AI model and its provider, comply with these specifications. Crucially, GAAP provides this guarantee deterministically, without trusting the agent with private user data, and without requiring any AI model or the user prompt to be free of attacks. GAAP enforces the user's permission specification by tracking how the AI agent accesses and uses private user data. It augments Information Flow Control with novel persistent data stores and annotations that enable it to track the flow of private information both across execution steps within a single task, and also over multiple tasks separated in time. Our evaluation confirms that GAAP blocks all data disclosure attacks, including those that make other state‑of‑the‑art systems disclose private user data to untrusted parties, without a significant impact on agent utility.

PaperID: 3750, https://arxiv.org/pdf/2604.19633.pdf

Authors: Anton Kolonin, Alexey Glushchenko, Evgeny Bochkov, Abhishek Saxena

Title: Time Series Augmented Generation for Financial Applications

Abstract:
Evaluating the reasoning capabilities of Large Language Models (LLMs) for complex, quantitative financial tasks is a critical and unsolved challenge. Standard benchmarks often fail to isolate an agent's core ability to parse queries and orchestrate computations. To address this, we introduce a novel evaluation methodology and benchmark designed to rigorously measure an LLM agent's reasoning for financial time‑series analysis. We apply this methodology in a large‑scale empirical study using our framework, Time Series Augmented Generation (TSAG), where an LLM agent delegates quantitative tasks to verifiable, external tools. Our benchmark, consisting of 100 financial questions, is used to compare multiple SOTA agents (e.g., GPT‑4o, Llama 3, Qwen2) on metrics assessing tool selection accuracy, faithfulness, and hallucination. The results demonstrate that capable agents can achieve near‑perfect tool‑use accuracy with minimal hallucination, validating the tool‑augmented paradigm. Our primary contribution is this evaluation framework and the corresponding empirical insights into agent performance, which we release publicly to foster standardized research on reliable financial AI.

PaperID: 3751, https://arxiv.org/pdf/2604.19589.pdf

Authors: Jiale Liu, Victor S. Bursztyn, Lin Ai, Haoliang Wang, Sunav Choudhary, Saayan Mitra, Qingyun Wu

Title: TeamFusion: Supporting Open-ended Teamwork with Multi-Agent Systems

Abstract:
In open‑ended domains, teams must reconcile diverse viewpoints to produce strong deliverables. Answer aggregation approaches commonly used in closed domains are ill‑suited to this setting, as they tend to suppress minority perspectives rather than resolve underlying disagreements. We present TeamFusion, a multi‑agent system designed to support teamwork in open‑ended domains by: 1. Instantiating a proxy agent for each team member conditioned on their expressed preferences; 2. Conducting a structured discussion to surface agreements and disagreements; and 3. Synthesizing more consensus‑oriented deliverables that feed into new iterations of discussion and refinement. We evaluate TeamFusion on two teamwork tasks where team members can assess how well their individual views are represented in team decisions and how consensually strong the final deliverables are, finding that it outperforms direct aggregation baselines across metrics, tasks, and team configurations.

PaperID: 3752, https://arxiv.org/pdf/2604.19541.pdf

Authors: Haoran Yin, Zhiyuan Wen, Jiannong Cao, Bo Yuan, Ruosong Yang

Title: FOCAL: Filtered On-device Continuous Activity Logging for Efficient Personal Desktop Summarization

Abstract:
Desktop interaction streams provide a continuous, privacy‑sensitive record of interleaved user tasks. Transforming these streams into task‑organized personal logs on‑device faces two main challenges: exhaustive Vision‑Language Model (VLM) processing strains local resources, and global stream processing causes cross‑task context pollution. We present FOCAL (Filtered On‑device Continuous Activity Logging), a privacy‑first multi‑agent system utilizing a unified filter‑plan‑log architecture. It cascades a lightweight Filter Agent for noise suppression, a text‑only Brain Agent for task attribution, a Record Agent for selective visual reasoning, and a task‑isolated Memory Agent for context‑coherent summarization. Experiments on DesktopBench (comprising 2,572 screenshots across 420 complex sessions) show FOCAL reduces total token consumption by 60.4% and VLM call count by 72.3% versus a baseline, while boosting Key Information Recall (KIR) from 0.38 to 0.61. Crucially, under A\toB\toA task interruptions, FOCAL maintains Task Acc 0.81 and KIR 0.80, whereas the baseline collapses to Task Acc 0.03. FOCAL pioneers the efficient, on‑device summarization of instruction‑free desktop streams into multi‑perspective personal logs.

PaperID: 3753, https://arxiv.org/pdf/2604.19540.pdf

Authors: Hongwei Xu

Title: Mesh Memory Protocol: Semantic Infrastructure for Multi-Agent LLM Systems

Abstract:
Teams of LLM agents increasingly collaborate on tasks spanning days or weeks: multi‑day data‑generation sprints where generator, reviewer, and auditor agents coordinate in real time on overlapping batches; specialists carrying findings forward across session restarts; product decisions compounding over many review rounds. This requires agents to share, evaluate, and combine each other's cognitive state in real time across sessions. We call this cross‑session agent‑to‑agent cognitive collaboration, distinct from parallel agent execution. To enable it, three problems must be solved together. (P1) Each agent decides field by field what to accept from peers, not accept or reject whole messages. (P2) Every claim is traceable to source, so returning claims are recognised as echoes of the receiver's own prior thinking. (P3) Memory that survives session restarts is relevant because of how it was stored, not how it is retrieved. These are protocol‑level properties at the semantic layer of agent communication, distinct from tool‑access and task‑delegation protocols at lower layers. We call this missing protocol layer "semantic infrastructure," and the Mesh Memory Protocol (MMP) specifies it. Four composable primitives work together: CAT7, a fixed seven‑field schema for every Cognitive Memory Block (CMB); SVAF, which evaluates each field against the receiver's role‑indexed anchors and realises P1; inter‑agent lineage, carried as parents and ancestors of content‑hash keys and realising P2; and remix, which stores only the receiver's own role‑evaluated understanding of each accepted CMB, never the raw peer signal, realising P3. MMP is specified, shipped, and running in production across three reference deployments, where each session runs an autonomous agent as a mesh peer with its own identity and memory, collaborating with other agents across the network for collective intelligence.

PaperID: 3754, https://arxiv.org/pdf/2604.19533.pdf

Authors: Alankrit Chona, Igor Kozlov, Ambuj Kumar

Title: Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

Abstract:
We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The benchmark wraps 106 real attack procedures from the OTRF Security‑Datasets corpus ‑ spanning 86 MITRE ATT&CK sub‑techniques across 12 tactics ‑ into a Gymnasium reinforcement‑learning environment. Each episode presents the agent with an in‑memory SQLite database of 75,000‑135,000 log records produced by a deterministic campaign simulator that time‑shifts and entity‑obfuscates the raw recordings. The agent must iteratively submit SQL queries to discover malicious event timestamps and explicitly flag them, scored CTF‑style against Sigma‑rule‑derived ground truth. Evaluating five frontier models ‑ Claude Opus 4.6, GPT‑5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash ‑ on 26 campaigns covering 105 of 106 procedures, we find that all models fail dramatically: the best model (Claude Opus 4.6) submits correct flags for only 3.8% of malicious events on average, and no run across any model ever finds all flags. We define a passing score as >= 50% recall on every ATT&CK tactic ‑ the minimum bar for unsupervised SOC deployment. No model passes: the leader clears this bar on 5 of 13 tactics and the remaining four on zero. These results suggest that current LLMs are poorly suited for open‑ended, evidence‑driven threat hunting despite strong performance on curated Q&A security benchmarks.

PaperID: 3755, https://arxiv.org/pdf/2604.19523.pdf

Authors: Mihir Shriniwas Arya, Avinash Anish, Aditya Ranjan

Title: Revac: A Social Deduction Reasoning Agent

Abstract:
Social deduction games such as Mafia present a unique AI challenge: players must reason under uncertainty, interpret incomplete and intentionally misleading information, evaluate human‑like communication, and make strategic elimination decisions. Unlike deterministic board games, success in Mafia depends not on perfect information or brute‑force search, but on inference, memory, and adaptability in the presence of deception. This work presents the design and evaluation of Revac‑8, an AI agent developed for the Social Deduction track of the MindGames Arena competition, where it achieved first place. The final agent evolved from a simple two‑stage reasoning system into a multi‑module architecture that integrates memory‑based player profiling, social‑graph analysis of accusations and defenses, and dynamic tone selection for communication. These results highlight the importance of structured memory and adaptive communication for achieving strong performance in high‑stakes social environments.

PaperID: 3756, https://arxiv.org/pdf/2604.19404.pdf

Authors: Yukai Feng, Zhiheng Wu, Zhengxing Wu, Junwen Gu, Junzhi Yu

Title: M$^{2}$GRPO: Mamba-based Multi-Agent Group Relative Policy Optimization for Biomimetic Underwater Robots Pursuit

Abstract:
Traditional policy learning methods in cooperative pursuit face fundamental challenges in biomimetic underwater robots, where long‑horizon decision making, partial observability, and inter‑robot coordination require both expressiveness and stability. To address these issues, a novel framework called Mamba‑based multi‑agent group relative policy optimization (M^2GRPO) is proposed, which integrates a selective state‑space Mamba policy with group‑relative policy optimization under the centralized‑training and decentralized‑execution (CTDE) paradigm. Specifically, the Mamba‑based policy leverages observation history to capture long‑horizon temporal dependencies and exploits attention‑based relational features to encode inter‑agent interactions, producing bounded continuous actions through normalized Gaussian sampling. To further improve credit assignment without sacrificing stability, the group‑relative advantages are obtained by normalizing rewards across agents within each episode and optimized through a multi‑agent extension of GRPO, significantly reducing the demand for training resources while enabling stable and scalable policy updates. Extensive simulations and real‑world pool experiments across team scales and evader strategies demonstrate that M^2GRPO consistently outperforms MAPPO and recurrent baselines in both pursuit success rate and capture efficiency. Overall, the proposed framework provides a practical and scalable solution for cooperative underwater pursuit with biomimetic robot systems.

PaperID: 3757, https://arxiv.org/pdf/2604.19354.pdf

Authors: Ali Al-Kaswan, Maksim Plotnikov, Maxim Hájek, Roland Vízner, Arie van Deursen, Maliheh Izadi

Title: Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture the Flag Challenges

Abstract:
Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open‑source benchmark for evaluating LLM‑based agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments. DeepRed places an agent in a Kali attacker environment with terminal tools and optional web search, connected over a private network to a target challenge, and records full execution traces for analysis. To move beyond binary solved/unsolved outcomes, we introduce a partial‑credit scoring method based on challenge‑specific checkpoints derived from public writeups, together with an automated summarise‑then‑judge labelling pipeline for assigning checkpoint completion from logs. Using DeepRed, we benchmark ten commercially accessible LLMs on ten VM‑based CTF challenges spanning different challenge categories. The results indicate that current agents remain limited: the best model achieves only 35% average checkpoint completion, performing strongest on common challenge types and weakest on tasks requiring non‑standard discovery and longer‑horizon adaptation.

PaperID: 3758, https://arxiv.org/pdf/2604.19301.pdf

Authors: Mikako Bito, Keita Nishimoto, Kimitaka Asatani, Ichiro Sakata

Title: Large Language Models Exhibit Normative Conformity

Abstract:
The conformity bias exhibited by large language models (LLMs) can pose a significant challenge to decision‑making in LLM‑based multi‑agent systems (LLM‑MAS). While many prior studies have treated "conformity" simply as a matter of opinion change, this study introduces the social psychological distinction between informational conformity and normative conformity in order to understand LLM conformity at the mechanism level. Specifically, we design new tasks to distinguish between informational conformity, in which participants in a discussion are motivated to make accurate judgments, and normative conformity, in which participants are motivated to avoid conflict or gain acceptance within a group. We then conduct experiments based on these task settings. The experimental results show that, among the six LLMs evaluated, up to five exhibited tendencies toward not only informational conformity but also normative conformity. Furthermore, intriguingly, we demonstrate that by manipulating subtle aspects of the social context, it may be possible to control the target toward which a particular LLM directs its normative conformity. These findings suggest that decision‑making in LLM‑MAS may be vulnerable to manipulation by a small number of malicious users. In addition, through analysis of internal vectors associated with informational and normative conformity, we suggest that although both behaviors appear externally as the same form of "conformity," they may in fact be driven by distinct internal mechanisms. Taken together, these results may serve as an initial milestone toward understanding how "norms" are implemented in LLMs and how they influence group dynamics.

PaperID: 3759, https://arxiv.org/pdf/2604.19299.pdf

Authors: Xinlin Wang, Mats Brorsson

Title: Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms

Abstract:
Despite the impressive capabilities of large language models, their substantial computational costs, latency, and privacy risks hinder their widespread deployment in real‑world applications. Small Language Models (SLMs) with fewer than 10 billion parameters present a promising alternative; however, their inherent limitations in knowledge and reasoning curtail their effectiveness. Existing research primarily focuses on enhancing SLMs through scaling laws or fine‑tuning strategies while overlooking the potential of using agent paradigms, such as tool use and multi‑agent collaboration, to systematically compensate for the inherent weaknesses of small models. To address this gap, this paper presents the first large‑scale, comprehensive study of <10B open‑source models under three paradigms: (1) the base model, (2) a single agent equipped with tools, and (3) a multi‑agent system with collaborative capabilities. Our results show that single‑agent systems achieve the best balance between performance and cost, while multi‑agent setups add overhead with limited gains. Our findings highlight the importance of agent‑centric design for efficient and trustworthy deployment in resource‑constrained settings.

PaperID: 3760, https://arxiv.org/pdf/2604.19278.pdf

Authors: Suhaib Abdurahman, Etsuko Ishii, Katerina Margatina, Divya Bhargavi, Monica Sunkara, Yi Zhang

Title: Explicit Trait Inference for Multi-Agent Coordination

Abstract:
LLM‑based multi‑agent systems (MAS) show promise on complex tasks but remain prone to coordination failures such as goal drift, error cascades, and misaligned behaviors. We propose Explicit Trait Inference (ETI), a psychologically grounded method for improving coordination. ETI enables agents to infer and track partner characteristics along two established psychological dimensions‑‑warmth (e.g., trust) and competence (e.g., skill)‑‑from interaction histories to guide decisions. We evaluate ETI in controlled settings (economic games), where it reduces payoff loss by 45‑77%, and in more realistic, complex multi‑agent settings (MultiAgentBench), where it improves performance by 3‑29% depending on the scenario and model, relative to a CoT baseline. Additional analysis shows that gains are closely linked to trait inference: ETI profiles predict agents' actions, and informative profiles drive improvements. These results highlight ETI as a lightweight and robust mechanism for improving coordination in diverse multi‑agent settings, and provide the first systematic evidence that LLM agents can (i) reliably infer others' traits from interaction histories and (ii) leverage structured awareness of others' traits for coordination.

PaperID: 3761, https://arxiv.org/pdf/2604.19247.pdf

Authors: Thilo Spinner, Matthias Miller, Fabian Sperrle-Roth, Mennatallah El-Assady

Title: BONSAI: A Mixed-Initiative Workspace for Human-AI Co-Development of Visual Analytics Applications

Abstract:
Developing Visual Analytics (VA) applications requires integrating complex machine learning models with expressive interactive interfaces. Developers face a stark trade‑off: building tightly‑coupled monoliths plagued by fragile interdependencies, or relying on restrictive, simplistic frameworks. Meanwhile, unconstrained, single‑shot AI code generation promises speed but yields unstructured, unauditable chaos. The core challenge is combining the control and expressiveness of custom development with the efficiency of AI generation under strict constraints. To address this, we introduce BONSAI, a mixed‑initiative workspace for the multi‑agent co‑development of VA applications. BONSAI utilizes a modular four‑layer architecture (hardware, services, orchestration, application) that allows human and AI developers to independently contribute reusable components. The workspace incorporates this architecture into a structured four‑phase development process (plan, design, monitor, and review), ensuring distributed agency and full provenance, where all human and AI contributions are structurally bounded and tracked. We evaluate BONSAI through case studies demonstrating the efficient creation of novel tools and the rapid reconstruction of complex VA applications directly from research paper descriptions. Ultimately, this paper contributes a conceptual workflow, a scalable architecture, and an integrated system that successfully balances AI's generative speed with the structural rigor required for complex VA development.

PaperID: 3762, https://arxiv.org/pdf/2604.19211.pdf

Authors: Zhiqin Yang, Zhenyuan Zhang, Xianzhang Jia, Jun Song, Wei Xue, Yonggang Zhang, Yike Guo

Title: ClawNet: Human-Symbiotic Agent Network for Cross-User Autonomous Cooperation

Abstract:
Current AI agent frameworks have made remarkable progress in automating individual tasks, yet all existing systems serve a single user. Human productivity rests on the social and organizational relationships through which people coordinate, negotiate, and delegate. When agents move beyond performing tasks for one person to representing that person in collaboration with others, the infrastructure for cross‑user agent collaboration is entirely absent, let alone the governance mechanisms needed to secure it. We argue that the next frontier for AI agents lies not in stronger individual capability, but in the digitization of human collaborative relationships. To this end, we propose a human‑symbiotic agent paradigm. Each user owns a permanently bound agent system that collaborates on the owner's behalf, forming a network whose nodes are humans rather than agents. This paradigm rests on three governance primitives. A layered identity architecture separates a Manager Agent from multiple context‑specific Identity Agents; the Manager Agent holds global knowledge but is architecturally isolated from external communication. Scoped authorization enforces per‑identity access control and escalates boundary violations to the owner. Action‑level accountability logs every operation against its owner's identity and authorization, ensuring full auditability. We instantiate this paradigm in ClawNet, an identity‑governed agent collaboration framework that enforces identity binding and authorization verification through a central orchestrator, enabling multiple users to collaborate securely through their respective agents.

PaperID: 3763, https://arxiv.org/pdf/2604.19144.pdf

Authors: Kunquan Li, Yingxue Zhang, Fandong Meng, Jinsong Su

Title: ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

Abstract:
Recent years have witnessed growing interest in applying Large Reasoning Models (LRMs) to Machine Translation (MT). Existing approaches predominantly adopt a "think‑first‑then‑translate" paradigm. Although explicit reasoning trajectories significantly enhance translation quality, they incur prohibitive inference costs and latency. To address these limitations, we propose ReflectMT, a two‑stage reflection internalization algorithm for machine translation that employs a "translate‑first‑think‑later" paradigm. Our approach develops the model's "translate‑reflect‑refine" capability through reinforcement learning. In the first stage, we cultivate the model's capacity for high‑quality reflection and refinement, thereby enhancing its semantic comprehension and task‑specific knowledge. In the second stage, we train the model to internalize the knowledge acquired during reflection. As a result, during inference, ReflectMT operates in a direct translation mode, producing high‑quality translations on the first attempt without any explicit reasoning steps. Experimental results on datasets such as WMT24 demonstrate that our model's first‑pass translations during inference outperform multi‑step reasoning LRMs such as DeepSeek‑R1 in both automatic metrics and GPT‑based evaluation, achieving a 2.16‑point improvement in GPT‑based translation quality evaluation while reducing token consumption by 94.33%.

PaperID: 3764, https://arxiv.org/pdf/2604.19026.pdf

Authors: Shaoyu Li, Chaoyu Zhang, Hexuan Yu, Y. Thomas Hou, Wenjing Lou

Title: ClawCoin: An Agentic AI-Native Cryptocurrency for Decentralized Agent Economies

Abstract:
Autonomous AI agents live or die by the API tokens they consume: without paid inference capacity they cannot reason, act, or delegate. Compute‑token cost has become the binding resource of the emerging agent economy, yet it is non‑transferable: it is account‑bound, vendor‑specific, and absent from on‑chain ledgers. Existing payment rails such as x402 move fiat‑backed value between agents, but they do not represent the quantity agents actually burn. As a result, agents can transport purchasing power but cannot quote, escrow, or settle workflows in a unit aligned with compute cost. We present ClawCoin, a tokenized, compute‑cost‑indexed unit of account and settlement asset for decentralized agent economies. ClawCoin combines four layers: a robust basket index over standardized prices; an oracle publishing signed fresh attestations; a NAV‑based mint/redeem vault with coverage thresholds and rate limits; and an on‑chain settlement layer for multi‑hop delegations. We implement a prototype on an Ethereum‑compatible L2 and evaluate it using a multi‑agent simulator and the OpenClaw testbed. Across single‑agent, multi‑agent, workflow, and procurement experiments, ClawCoin stabilizes execution capacity under cost shocks, reduces cross‑agent quote dispersion, eliminates partial settlements, and sustains cooperative market dynamics that fiat‑denominated baselines cannot. These results suggest that compute‑indexed units of account can improve decentralized agent coordination.

PaperID: 3765, https://arxiv.org/pdf/2604.19012.pdf

Authors: Yongchao Wang, Zhiqiu Huang

Title: Security Is Relative: Training-Free Vulnerability Detection via Multi-Agent Behavioral Contract Synthesis

Abstract:
Deep learning for vulnerability detection has shown promising results on early benchmarks, but recent evaluations reveal catastrophic degradation: models achieving F1 > 0.68 on legacy datasets collapse to 0.031 under strict deduplication. We identify the root cause as the semantic ambiguity problem: identical code can be secure or vulnerable depending on project‑specific behavioral contracts, rendering global classification fundamentally inadequate. We propose Phoenix, a training‑free multi‑agent framework that resolves this ambiguity through Behavioral Contract Synthesis. Phoenix decomposes detection into three stages: a Semantic Slicer extracting minimal vulnerability‑relevant context, a Requirement Reverse Engineer synthesizing Gherkin behavioral specifications encoding the security contract, and a Contract Judge evaluating code against these specifications via strict compliance checking. On PrimeVul Paired, Phoenix achieves F1 = 0.825 and Pair‑Correct = 64.4%, surpassing RASM‑Vul (F1 = 0.668) and VulTrial (F1 = 0.563) while using open‑source models up to 48x smaller (7‑14B vs. 671B). Ablation across 25 configurations demonstrates Gherkin specifications as the decisive driver (+0.09 to +0.35 F1). Error analysis reveals 18% of "False Positives" identify genuine security concerns in patched code, demonstrating that security is a relative property defined against behavioral contracts, not an absolute property of code syntax.

PaperID: 3766, https://arxiv.org/pdf/2604.19005.pdf

Authors: Yixuan Tang, Yirui Zhang, Hang Feng, Anthony K. H. Tung

Title: Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection

Abstract:
Half‑truths, claims that are factually correct yet misleading due to omitted context, remain a blind spot for fact verification systems focused on explicit falsehoods. Addressing such omission‑based manipulation requires reasoning not only about what is said, but also about what is left unsaid. We propose RADAR, a role‑anchored multi‑agent debate framework for omission‑aware fact verification under realistic, noisy retrieval. RADAR assigns complementary roles to a Politician and a Scientist, who reason adversarially over shared retrieved evidence, moderated by a neutral Judge. A dual‑threshold early termination controller adaptively decides when sufficient reasoning has been reached to issue a verdict. Experiments show that RADAR consistently outperforms strong single‑ and multi‑agent baselines across datasets and backbones, improving omission detection accuracy while reducing reasoning cost. These results demonstrate that role‑anchored, retrieval‑grounded debate with adaptive control is an effective and scalable framework for uncovering missing context in fact verification. The code is available at https://github.com/tangyixuan/RADAR.

PaperID: 3767, https://arxiv.org/pdf/2604.19001.pdf

Authors: Ishita Kakkar, Enze Zhang, Rheeya Uppaal, Junjie Hu

Title: When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains

Abstract:
Large reasoning models (LRMs) produce complex, multi‑step reasoning traces, yet safety evaluation remains focused on final outputs, overlooking how harm emerges during reasoning. When jailbroken, harm does not appear instantaneously but unfolds through distinct behavioral steps such as suppressing refusal, rationalizing compliance, decomposing harmful tasks, and concealing risk. However, no existing benchmark captures this process at sentence‑level granularity within reasoning traces ‑‑ a key step toward reliable safety monitoring, interventions, and systematic failure diagnosis. To address this gap, we introduce HarmThoughts, a benchmark for step‑wise safety evaluation of reasoning traces. \ourdataset is built on our proposed harm taxonomy of 16 harmful reasoning behaviors across four functional groups that characterize how harm propagates rather than what harm is produced. The dataset consists of 56,931 sentences from 1,018 reasoning traces generated by four model families, each annotated with fine‑grained sentence‑level behavioral labels. Using HarmThoughts, we analyze harm propagation patterns across reasoning traces, identifying common behavioral trajectories and drift points where reasoning transitions from safe to unsafe. Finally, we systematically compare white‑box and black‑box detectors on the task of identifying harmful reasoning behaviours on HarmThoughts. Our results show that existing detectors struggle with fine‑grained behavior detection in reasoning traces, particularly for nuanced categories within harm emergence and execution, highlighting a critical gap in process‑level safety monitoring. HarmThoughts is available publicly at: https://huggingface.co/datasets/ishitakakkar‑10/HarmThoughts

PaperID: 3768, https://arxiv.org/pdf/2604.18988.pdf

Authors: Liping Wang, Cheng Ye, Weidong Chen, Peipei Song, Bo Hu, Zhendong Mao

Title: A Multi-Agent Framework with Structured Reasoning and Reflective Refinement for Multimodal Empathetic Response Generation

Abstract:
Multimodal empathetic response generation (MERG) aims to generate emotionally engaging and empathetic responses based on users' multimodal contexts. Existing approaches usually rely on an implicit one‑pass generation paradigm from multimodal context to the final response, which overlooks two intrinsic characteristics of MERG: (1) Human perception of emotional cues is inherently structured rather than a direct mapping. The conventional paradigm neglects the hierarchical progression of emotion perception, leading to distorted emotional judgments. (2) Given the inherent complexity and ambiguity of human emotions, the conventional paradigm is prone to significant emotional biases, ultimately resulting in suboptimal empathy. In this paper, we propose a multi‑agent framework for MERG, which enhances empathy through structured reasoning and reflective refinement. Specifically, we first introduce a structured empathetic reasoning‑to‑generation module that explicitly decomposes response generation via multimodal perception, consistency‑aware emotion forecasting, pragmatic strategy planning, and strategy‑guided response generation, providing a clearer intermediate path from multimodal evidence to response realization. Besides, we develop a global reflection and refinement module, in which a global reflection agent performs step‑wise auditing over intermediate states and the generated response, eliminating existing emotional biases and empathy errors, and triggering targeted regeneration. Overall, such a closed‑loop framework enables our model to gradually improve the accuracy of emotion perception and eliminate emotion biases during the iteration process. Experiments on several benchmarks, e.g., IEMOCAP and MELD, demonstrate that our model has superior empathic response generation capabilities compared to state‑of‑the‑art methods.

PaperID: 3769, https://arxiv.org/pdf/2604.18975.pdf

Authors: HuaDong Jian, Chenghao Li, Haoyu Wang, Jiajia Shuai, Jinyu Guo, Yang Yang, Chaoning Zhang

Title: Gated Coordination for Efficient Multi-Agent Collaboration in Minecraft Game

Abstract:
In long‑horizon open‑world multi‑agent systems, existing methods often treat local anomalies as automatic triggers for communication. This default design introduces coordination noise, interrupts local execution, and overuses public interaction in cases that could be resolved locally. To address this issue, we propose a partitioned information architecture for MLLM agents that explicitly separates private execution states from public coordination states. Building on this design, we introduce two key mechanisms. First, we develop an event‑triggered working memory based on system‑verified outcomes to maintain compact and low‑noise local state representations. Second, we propose a cost‑sensitive gated escalation mechanism that determines whether cross‑region communication should be initiated by jointly considering node criticality, local recovery cost, and downstream task impact. In this way, communication is transformed from a default reaction into a selective decision. Experiments conducted on long‑term construction tasks in open environments demonstrate that, compared to baseline models based on strong communication and planned structures, the introduction of gated communication and a partitioned information architecture results in superior performance in terms of blueprint completion quality and execution chain length. It also improves local self‑recovery, reduces ineffective escalations, and increases the utility of public communication.

PaperID: 3770, https://arxiv.org/pdf/2604.18951.pdf

Authors: Namyoung So, Seokgyu Jang, Taeuk Kim

Title: Superficial Success vs. Internal Breakdown: An Empirical Study of Generalization in Adaptive Multi-Agent Systems

Abstract:
Adaptive multi‑agent systems (MAS) are increasingly adopted to tackle complex problems. However, the narrow task coverage of their optimization raises the question of whether they can function as general‑purpose systems. To address this gap, we conduct an extensive empirical study of adaptive MAS, revealing two key findings: (1) topological overfitting ‑‑ they fail to generalize across different domains; and (2) illusory coordination ‑‑ they achieve reasonable surface‑level accuracy while the underlying agent interactions diverge from ideal MAS behavior, raising concerns about their practical utility. These findings highlight the pressing need to prioritize generalization in MAS development and motivate evaluation protocols that extend beyond simple final‑answer correctness.

PaperID: 3771, https://arxiv.org/pdf/2604.18764.pdf

Authors: Qihang Wu, Aman Arora, Vidya A. Chhabria

Title: CHICO-Agent: An LLM Agent for the Cross-layer Optimization of 2.5D and 3D Chiplet-based Systems

Abstract:
The rapid growth of large language models (LLMs) and AI workloads has pushed monolithic silicon to its reticle and economic limits, accelerating the adoption of 2.5D/3D chiplet systems. However, these systems increase design complexity by requiring co‑design across multiple levels of the computing stack, including application, architecture, chip, and package. The resulting design space is highly combinatorial, with trade‑offs among latency, energy, area, and cost. To address this challenge, we propose CHICO‑Agent, an LLM‑driven optimization framework for 2.5D/3D chiplet‑based systems. CHICO‑Agent maintains a persistent knowledge base to capture parameter‑outcome trends and coordinates exploration through an admin‑field multi‑agent workflow. Compared with a simulated‑annealing baseline, CHICO‑Agent finds lower‑cost configurations and provides an interpretable audit trail for designers.

PaperID: 3772, https://arxiv.org/pdf/2604.18752.pdf

Authors: Joschka Birk, Gregor Kasieczka, Siddharth Mishra-Sharma, Benjamin Nachman, Dennis Noll, Tanvi Wamorkar

Title: A Scientific Human-Agent Reproduction Pipeline

Abstract:
Reproducing scientific analyses is essential for preserving knowledge, building extensible codebases, and deepening researcher understanding ‑ yet the effort often outweighs its academic recognition. We argue that the reproduction of scientific data analyses is fundamentally a translation task: converting human‑readable knowledge (papers, documentation) into machine‑readable analysis code. This makes it uniquely well‑suited for AI agents. We present SHARP (Scientific Human‑Agent Reproduction Pipeline), a structured framework for reproducing scientific analyses through human‑agent collaboration. SHARP decomposes a reproduction task into discrete steps, which an AI agent executes autonomously using specialized subagents for code generation, testing, and quality assurance. At defined checkpoints, the researcher reviews progress, provides feedback, and steers the analysis ‑ keeping the human firmly in control of scientific judgment while the agent handles implementation. We demonstrate SHARP by reproducing a jet classification task in particle physics from a published paper. We evaluate the reproduction along three axes: analysis performance against the original results, code quality and faithfulness, and the nature of the human‑agent conversation. The latter is evaluated with a novel framework for characterizing human‑agent interactions. Our work highlights a practical model for AI‑assisted scientific reproduction where the researcher's role shifts from writing code to understanding, evaluating, and directing ‑ elevating human understanding rather than replacing it.

PaperID: 3773, https://arxiv.org/pdf/2604.18660.pdf

Authors: Jin Zhao, Marta Knežević, Tanja Käser

Title: Evaluating Answer Leakage Robustness of LLM Tutors against Adversarial Student Attacks

Abstract:
Large Language Models (LLMs) are increasingly used in education, yet their default helpfulness often conflicts with pedagogical principles. Prior work evaluates pedagogical quality via answer leakage‑the disclosure of complete solutions instead of scaffolding‑but typically assumes well‑intentioned learners, leaving tutor robustness under student misuse largely unexplored. In this paper, we study scenarios where students behave adversarially and aim to obtain the correct answer from the tutor. We evaluate a broad set of LLM‑based tutor models, including different model families, pedagogically aligned models, and a multi‑agent design, under a range of adversarial student attacks. We adapt six groups of adversarial and persuasive techniques to the educational setting and use them to probe how likely a tutor is to reveal the final answer. We evaluate answer leakage robustness using different types of in‑context adversarial student agents, finding that they often fail to carry out effective attacks. We therefore introduce an adversarial student agent that we fine‑tune to jailbreak LLM‑based tutors, which we propose as the core of a standardized benchmark for evaluating tutor robustness. Finally, we present simple but effective defense strategies that reduce answer leakage and strengthen the robustness of LLM‑based tutors in adversarial scenarios.

PaperID: 3774, https://arxiv.org/pdf/2604.18658.pdf

Authors: Dongcheng Zhang, Yiqing Jiang

Title: Owner-Harm: A Missing Threat Model for AI Agent Safety

Abstract:
Existing AI agent safety benchmarks focus on generic criminal harm (cybercrime, harassment, weapon synthesis), leaving a systematic blind spot for a distinct and commercially consequential threat category: agents harming their own deployers. Real‑world incidents illustrate the gap: Slack AI credential exfiltration (Aug 2024), Microsoft 365 Copilot calendar‑injection leaks (Jan 2024), and a Meta agent unauthorized forum post exposing operational data (Mar 2026). We propose Owner‑Harm, a formal threat model with eight categories of agent behavior damaging the deployer. We quantify the defense gap on two benchmarks: a compositional safety system achieves 100% TPR / 0% FPR on AgentHarm (generic criminal harm) yet only 14.8% (4/27; 95% CI: 5.9%‑32.5%) on AgentDojo injection tasks (prompt‑injection‑mediated owner harm). A controlled generic‑LLM baseline shows the gap is not inherent to owner‑harm (62.7% vs. 59.3%, delta 3.4 pp) but arises from environment‑bound symbolic rules that fail to generalize across tool vocabularies. On a post‑hoc 300‑scenario owner‑harm benchmark, the gate alone achieves 75.3% TPR / 3.3% FPR; adding a deterministic post‑audit verifier raises overall TPR to 85.3% (+10.0 pp) and Hijacking detection from 43.3% to 93.3%, demonstrating strong layer complementarity. We introduce the Symbolic‑Semantic Defense Generalization (SSDG) framework relating information coverage to detection rate. Two SSDG experiments partially validate it: context deprivation amplifies the detection gap 3.4x (R = 3.60 vs. R = 1.06); context injection reveals structured goal‑action alignment, not text concatenation, is required for effective owner‑harm detection.

PaperID: 3775, https://arxiv.org/pdf/2604.18614.pdf

Authors: Landy Jimenez, Mariah Weatherspoon, Bingyu Shen, Yi Sheng, Jianming Liu, Boyang Li

Title: HadAgent: Harness-Aware Decentralized Agentic AI Serving with Proof-of-Inference Blockchain Consensus

Abstract:
Proof‑of‑Work (PoW) blockchain consensus consumes vast computational resources without producing useful output, while the rapid growth of large language model (LLM) agents has created unprecedented demand for GPU computation. We present HadAgent, a decentralized agentic AI serving system that replaces hash‑based mining with Proof‑of‑Inference (PoI), a consensus mechanism in which nodes earn block‑creation rights by executing deterministic LLM inference tasks. Because verification requires only re‑executing a single forward pass under identical conditions, cross‑node verification operates at consensus speed. HadAgent organizes validated records into a three‑lane block body with dedicated DATA, MODEL, and PROOF channels, each protected by an independent Merkle root for fine‑grained tamper detection. A two‑tier node architecture classifies secondary nodes as trusted or non‑trusted based on historical behavior: trusted nodes serve inference results in real time through optimistic execution, while non‑trusted nodes must undergo full consensus verification. A harness layer monitors node behavior through heartbeat probes, anomaly detection via deterministic recomputation, and automated trust management, creating a self‑correcting feedback loop that isolates malicious or unreliable participants. Experiments on a prototype implementation demonstrate 100% detection rate and 0% false positive rate for tampered records, sub‑millisecond validation latency for record and hub operations, and effective harness convergence that excludes adversarial nodes within two rounds while promoting honest nodes to trusted status within five rounds.

PaperID: 3776, https://arxiv.org/pdf/2604.18602.pdf

Authors: Maxime Saxena, Marco Pangallo, Cars Hommes, Fabio Caccioli, R. Maria del Rio-Chanona

Title: Machine Spirits: Speculation and Adaptation of LLM Agents in Asset Markets

Abstract:
As Large Language Models (LLMs) become increasingly integrated into financial systems, understanding their behavioural properties is crucial. Do LLMs conform to the rational expectations paradigm, do they exhibit human‑like "animal spirits", or do they instead manifest distinct "machine spirits"? We investigate these questions with a simulated financial market, exploring the behaviour of 15 LLMs spanning a range of sizes, capabilities, and providers. Our results show that LLMs exhibit a spectrum of economic behaviours, from stable coordination on the fundamental value to human‑like speculative bubbles. These behaviours are generally inconsistent with the rational expectations hypothesis. We also consider an ecology of heterogeneous agents, a more realistic setting compared to markets with identical LLM agents. These mixed markets can produce outcomes which vary substantially across repeated simulations. Even the most advanced models fail to consistently stabilise the market, with price bubbles sometimes forming despite only a minority of agents naturally forming bubbles. Instead, advanced models in mixed markets adapt their forecasting strategies to the behaviour of other agents. This adaptation can allow them to successfully exploit less sophisticated counterparts and achieve higher profits, but can also contribute to increased market volatility. These findings suggest that the introduction of AI agents into financial markets fundamentally reshapes their ecology. In particular, heterogeneous populations of LLMs can generate endogenous instability, while individual‑level adaptation may amplify, rather than mitigate, market volatility.

PaperID: 3777, https://arxiv.org/pdf/2604.18564.pdf

Authors: Haoyu Wu, Jiwen Yu, Yingtian Zou, Xihui Liu

Title: MultiWorld: Scalable Multi-Agent Multi-View Video World Models

Abstract:
Video world models have achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled as action‑conditioned video generation models that take historical frames and current actions as input to predict future frames. Yet, most existing approaches are limited to single‑agent scenarios and fail to capture the complex interactions inherent in real‑world multi‑agent systems. We present MultiWorld, a unified framework for multi‑agent multi‑view world modeling that enables accurate control of multiple agents while maintaining multi‑view consistency. We introduce the Multi‑Agent Condition Module to achieve precise multi‑agent controllability, and the Global State Encoder to ensure coherent observations across different views. MultiWorld supports flexible scaling of agent and view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi‑player game environments and multi‑robot manipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action‑following ability, and multi‑view consistency. Project page: https://multi‑world.github.io/

PaperID: 3778, https://arxiv.org/pdf/2604.18509.pdf

Authors: Xingchen Xiao, Heyan Huang, Runheng Liu, Jincheng Xie

Title: MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation

Abstract:
Large language models (LLMs) are widely used in retrieval‑augmented generation (RAG) to incorporate external knowledge at inference time. However, when retrieved contexts are noisy, incomplete, or heterogeneous, a single generation process often struggles to reconcile evidence effectively. We propose MASS‑RAG, a multi‑agent synthesis approach to retrieval‑augmented generation that structures evidence processing into multiple role‑specialized agents. MASS‑RAG applies distinct agents for evidence summarization, evidence extraction, and reasoning over retrieved documents, and combines their outputs through a dedicated synthesis stage to produce the final answer. This design exposes multiple intermediate evidence views, allowing the model to compare and integrate complementary information before answer generation. Experiments on four benchmarks show that MASS‑RAG consistently improves performance over strong RAG baselines, particularly in settings where relevant evidence is distributed across retrieved contexts.

PaperID: 3779, https://arxiv.org/pdf/2604.18500.pdf

Authors: Terence Lim, Kumar Muthuraman, Michael Sury

Title: QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance

Abstract:
We introduce a multi‑agent framework intended to emulate parts of a quantitative research team and support equity factor research on large financial panel datasets. QRAFTI integrates a research toolkit for panel data with MCP servers that expose data access, factor construction, and custom coding operations as callable tools. It can help replicate established factors, formulate and test new signals, and generate standardized research reports accompanied by narrative analysis and computational traces. On multi‑step empirical tasks, using chained tool calls and reflection‑based planning may offer better performance and explainability than dynamic code generation alone.

PaperID: 3780, https://arxiv.org/pdf/2604.18464.pdf

Authors: Yidi Yuan

Title: Semantic Step Prediction: Multi-Step Latent Forecasting in LLM Reasoning Trajectories via Step Sampling

Abstract:
Semantic Tube Prediction (STP) leverages representation geometric to regularize LLM hidden‑state trajectories toward locally linear geodesics during fine‑tuning, thereby greatly improving data efficiency. The original STP recipe samples random token sub‑spans, which is compatible with the base large language model (LLM) training architecture. Inspired by STP, we are interested to investigate whether the sampling position can further enhance the semantic structure of multi‑step reasoning, and hence affect its geometric impact. We applied STP at consecutive semantic reasoning step boundaries and achieved 168x more accurate multi‑step latent prediction than frozen baselines on ProcessBench (3,400 samples), compared to only 4x for the random‑token STP. Probing the latent manifold with a learned non‑linear predictor reveals that STP‑shaped trajectories are smooth curves, not straight lines: a 3‑layer MLP reduces prediction error by a further 3‑12x over linear extrapolation on step‑boundary models. Removing the language modeling loss yields trajectories that are 2x more MLP‑predictable than the combined loss, revealing a tradeoff between generation quality and geometric purity. Our results identify sampling position as the critical variable in geometric regularization and establish multi‑step latent prediction MSE as a new evaluation metric for this class of methods.

PaperID: 3781, https://arxiv.org/pdf/2604.18401.pdf

Authors: Daoyu Wang, Qingchuan Li, Mingyue Cheng, Jie Ouyang, Shuo Yu, Qi Liu, Enhong Chen

Title: StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

Abstract:
Agentic reinforcement learning (RL) is emerging as a critical post‑training paradigm for improving LLM agent capabilities. Existing RL algorithms for LLMs largely follow the token‑centric paradigm as in RLHF and RLVR, where tokens serve as the basic units for modeling and optimization. However, this paradigm introduces a granularity mismatch in agentic RL, as it optimizes token‑level predictions while LLM agents make step‑level decisions through cycles of environmental observations and actions. To bridge this gap, we propose StepPO, a step‑centric paradigm for agentic RL via step‑aligned policy optimization. Specifically, we reformulate agentic RL from a token‑level Markov Decision Process (MDP) into a step‑level MDP, where interaction steps serve as the basic trajectory representations. We further propose step‑level credit assignment to align policy optimization with the natural granularity of agent decisions. Together, StepPO optimizes agent policies at the step level for multi‑turn agent‑environment interaction. Experiments across multi‑hop QA, academic paper search, and text‑world action tasks show that StepPO consistently outperforms various RL algorithms. Further analyses provide insights into how step‑centric paradigm improves agent training. We hope this step‑centric paradigm offers a useful lens for understanding agent behavior and a practical path for training more capable LLM agents.

PaperID: 3782, https://arxiv.org/pdf/2604.18373.pdf

Authors: Shumiao Ouyang, Pengfei Sui

Title: Dissecting AI Trading: Behavioral Finance and Market Bubbles

Abstract:
We study how AI agents form expectations and trade in experimental asset markets. Using a simulated open‑call auction populated by autonomous Large Language Model (LLM) agents, we document three main findings. First, AI agents exhibit classic behavioral patterns: a pronounced disposition effect and recency‑weighted extrapolative beliefs. Second, these individual‑level patterns aggregate into equilibrium dynamics that replicate classic experimental findings (Smith et al., 1988), including the predictive power of excess demand for future prices and the positive relationship between disagreement and trading volume. Third, by analyzing the agents' reasoning text through a twenty‑mechanism scoring framework, we show that targeted prompt interventions causally amplify or suppress specific behavioral mechanisms, significantly altering the magnitude of market bubbles.

PaperID: 3783, https://arxiv.org/pdf/2604.18334.pdf

Authors: Syed Muhammad Ashhar Shah, Sehrish Habib, Muizz Hussain, Maryam Abdul Ghafoor, Abdul Ali Bangash

Title: Reliability of AI Bots Footprints in GitHub Actions CI/CD Workflows

Abstract:
Continuous Integration and Deployment (CI/CD) workflows are central to modern software delivery, yet the reliability of agentic AI bots operating within these workflows remain underexplored. Using pull requests (PRs), commits, and repositories from the AIDev dataset, we retrieved associated CI/CD workflow runs via the GitHub Actions API and analyzed 61,837 runs from 2,355 repositories, all triggered by PRs generated by five AI bots: Claude, Devin, Cursor, Copilot, and Codex. We observed substantial agent‑dependent differences in workflow reliability, with Copilot and Codex achieving the highest success rates ~93% and ~94% respectively. At the repository level, we find a negative correlation between AI agent contribution frequency and workflow success rate, suggesting that a higher frequency of Agentic PRs may hinder CI/CD workflow reliability. We defined a taxonomy of 13 categories against 3,067 agentic PRs whose associated workflows failed, and observed a trend analysis that indicates visually observable shifts from functional to non‑functional PR categories over time, although these trends are not statistically significant. Our findings motivate the need for actionable guidance on integrating AI agents into CI/CD workflows and prioritizing safeguards in workflows where failures are most likely to occur.

PaperID: 3784, https://arxiv.org/pdf/2604.18271.pdf

Authors: Paolo Riva, Leonardo Gargani, Matteo Frosi, Matteo Matteucci

Title: EmbodiedLGR: Integrating Lightweight Graph Representation and Retrieval for Semantic-Spatial Memory in Robotic Agents

Abstract:
As the world of agentic artificial intelligence applied to robotics evolves, the need for agents capable of building and retrieving memories and observations efficiently is increasing. Robots operating in complex environments must build memory structures to enable useful human‑robot interactions by leveraging the mnemonic representation of the current operating context. People interacting with robots may expect the embodied agent to provide information about locations, events, or objects, which requires the agent to provide precise answers within human‑like inference times to be perceived as responsive. We propose the Embodied Light Graph Retrieval Agent (EmbodiedLGR‑Agent), a visual‑language model (VLM)‑driven agent architecture that constructs dense and efficient representations of robot operating environments. EmbodiedLGR‑Agent directly addresses the need for an efficient memory representation of the environment by providing a hybrid building‑retrieval approach built on parameter‑efficient VLMs that store low‑level information about objects and their positions in a semantic graph, while retaining high‑level descriptions of the observed scenes with a traditional retrieval‑augmented architecture. EmbodiedLGR‑Agent is evaluated on the popular NaVQA dataset, achieving state‑of‑the‑art performance in inference and querying times for embodied agents, while retaining competitive accuracy on the global task relative to the current state‑of‑the‑art approaches. Moreover, EmbodiedLGR‑Agent was successfully deployed on a physical robot, showing practical utility in real‑world contexts through human‑robot interaction, while running the visual‑language model and the building‑retrieval pipeline locally.

PaperID: 3785, https://arxiv.org/pdf/2604.18231.pdf

Authors: Sina Abdollahi, Mohammad M Maheri, Javad Forough, Amir Al Sadi, Josh Millar, David Kotz, Marios Kogias, Hamed Haddadi

Title: AgenTEE: Confidential LLM Agent Execution on Edge Devices

Abstract:
Large Language Model (LLM) agents provide powerful automation capabilities, but they also create a substantially broader attack surface than traditional applications due to their tight integration with non‑deterministic models and third‑party services. While current deployments primarily rely on cloud‑hosted services, emerging designs increasingly execute agents directly on edge devices to reduce latency and enhance user privacy. However, securely hosting such complex agent pipelines on edge devices remains challenging. These deployments must protect proprietary assets (e.g., system prompts and model weights) and sensitive runtime state on heterogeneous platforms that are vulnerable to software attacks and potentially controlled by malicious users. To address these challenges, we present AgenTEE, a system for deploying confidential agent pipelines on edge devices. AgenTEE places the agent runtime, inference engine, and third‑party applications into independently attested confidential virtual machines (cVMs) and mediates their interaction through explicit, verifiable communication channels. Built on Arm Confidential Compute Architecture (CCA), a recent extension to Arm platforms, AgenTEE enforces strong system‑level isolation of sensitive assets and runtime state. Our evaluation shows that such multi‑cVMs system is practical, achieving near‑native performance with less than 5.15% runtime overhead compared to commodity OS multi‑process deployments.

PaperID: 3786, https://arxiv.org/pdf/2604.18210.pdf

Authors: Sheng Xu, Guiliang Liu, Tarak Kharrat, Yudong Luo, Mohamed Aloulou, Javier López Peña, Konstantin Sofeikov, Adam Reid, Paul Roberts, Steven Spencer, Joe Carnall, Ian McHale, Oliver Schulte, Hongyuan Zha, Wei-Shi Zheng

Title: TacticGen: Grounding Adaptable and Scalable Generation of Football Tactics

Abstract:
Success in association football relies on both individual skill and coordinated tactics. While recent advancements in spatio‑temporal data and deep learning have enabled predictive analyses like trajectory forecasting, the development of tactical design remains limited. Bridging this gap is essential, as prediction reveals what is likely to occur, whereas tactic generation determines what should occur to achieve strategic objectives. In this work, we present TacticGen, a generative model for adaptable and scalable tactic generation. TacticGen formulates tactics as sequences of multi‑agent movements and interactions conditioned on the game context. It employs a multi‑agent diffusion transformer with agent‑wise self‑attention and context‑aware cross‑attention to capture cooperative and competitive dynamics among players and the ball. Trained with over 3.3 million events and 100 million tracking frames from top‑tier leagues, TacticGen achieves state‑of‑the‑art precision in predicting player trajectories. Building on it, TacticGen enables adaptable tactic generation tailored to diverse inference‑time objectives through classifier guidance mechanism, specified via rules, natural language, or neural models. Its modeling performance is also inherently scalable. A case study with football experts confirms that TacticGen generates realistic, strategically valuable tactics, demonstrating its practical utility for tactical planning in professional football. The project page is available at: https://shengxu.net/TacticGen/.

PaperID: 3787, https://arxiv.org/pdf/2604.18133.pdf

Authors: Zixiang Wang, Mengjia Gong, Qiyu Sun, Jing Xu, Shuai Mao, Xin Jin, Qing-Long Han, Yang Tang

Title: Multi-Agent Systems: From Classical Paradigms to Large Foundation Model-Enabled Futures

Abstract:
With the rapid advancement of artificial intelligence, multi‑agent systems (MASs) are evolving from classical paradigms toward architectures built upon large foundation models (LFMs). This survey provides a systematic review and comparative analysis of classical MASs (CMASs) and LFM‑based MASs (LMASs). First, within a closed‑loop coordination framework, CMASs are reviewed across four fundamental dimensions: perception, communication, decision‑making, and control. Beyond this framework, LMASs integrate LFMs to lift collaboration from low‑level state exchanges to semantic‑level reasoning, enabling more flexible coordination and improved adaptability across diverse scenarios. Then, a comparative analysis is conducted to contrast CMASs and LMASs across architecture, operating mechanism, adaptability, and application. Finally, future perspectives on MASs are presented, summarizing open challenges and potential research opportunities.

PaperID: 3788, https://arxiv.org/pdf/2604.18131.pdf

Authors: Qifan Zhang, Dongyang Ma, Tianqing Fang, Jia Li, Jing Tang, Nuo Chen, Haitao Mi, Yan Wang

Title: Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration

Abstract:
Most agents today ``self‑evolve'' by following rewards and rules defined by humans. However, this process remains fundamentally dependent on external supervision; without human guidance, the evolution stops. In this work, we train agents to possess an intrinsic meta‑evolution capability to spontaneously learn about unseen environments prior to task execution. To instill this ability, we design an outcome‑based reward mechanism that measures how much an agent's self‑generated world knowledge improves its success rate on downstream tasks. This reward signal is used exclusively during the training phase to teach the model how to explore and summarize effectively. At inference time, the agent requires no external rewards or human instructions. It spontaneously performs native self‑evolution to adapt to unknown environments using its internal parameters. When applied to Qwen3‑30B and Seed‑OSS‑36B, this shift to native evolution yields a 20% performance increase on WebVoyager and WebWalker. Most strikingly, the generated world knowledge even enables a compact 14B Qwen3 model to outperform the unassisted Gemini‑2.5‑Flash, establishing a new paradigm for truly evolving agents.

PaperID: 3789, https://arxiv.org/pdf/2604.18071.pdf

Authors: Hu Wei

Title: Architectural Design Decisions in AI Agent Harnesses

Abstract:
AI agent systems increasingly rely on reusable non‑LLM engineering infrastructure that packages tool mediation, context handling, delegation, safety control, and orchestration. Yet the architectural design decisions in this surrounding infrastructure remain understudied. This paper presents a protocol‑guided, source‑grounded empirical study of 70 publicly available agent‑system projects, addressing three questions: which design‑decision dimensions recur across projects, which co‑occurrences structure those decisions, and which typical architectural patterns emerge. Methodologically, we contribute a transparent investigation procedure for analyzing heterogeneous agent‑system corpora through source‑code and technical‑material reading. Empirically, we identify five recurring design dimensions (subagent architecture, context management, tool systems, safety mechanisms, and orchestration) and find that the corpus favors file‑persistent, hybrid, and hierarchical context strategies; registry‑oriented tool systems remain dominant while MCP‑ and plugin‑oriented extensions are emerging; and intermediate isolation is common but high‑assurance audit is rare. Cross‑project co‑occurrence analysis reveals that deeper coordination pairs with more explicit context services, stronger execution environments with more structured governance, and formalized tool‑registration boundaries with broader ecosystem ambitions. We synthesize five recurring architectural patterns spanning lightweight tools, balanced CLI frameworks, multi‑agent orchestrators, enterprise systems, and scenario‑verticalized projects. The result provides an evidence‑based account of architectural regularities in agent‑system engineering, with grounded guidance for framework designers, selectors, and researchers.

PaperID: 3790, https://arxiv.org/pdf/2604.18046.pdf

Authors: Muyao Zhong, Zhenhua Yang, Yuxiang Liu, Ke Tang, Peng Yang

Title: EvoMarket: A High-Fidelity and Scalable Financial Market Simulator

Abstract:
High‑fidelity, scalable market simulation is a key instrument for mechanism evaluation, stress testing, and counterfactual policy analysis. Yet existing simulators rarely achieve \emphmechanism fidelity beyond single‑asset intraday settings, \emphmicrostructure fidelity against historical limit order books (LOB), and \emphcomputational tractability at market scale in a single system. This paper presents EvoMarket, a discrete‑event, multi‑agent financial market simulator designed for intervention‑oriented experiments in multi‑asset and cross‑day environments. EvoMarket couples a high‑throughput execution core (optimized LOB data structures, hierarchical scheduling under propagation delays, and asynchronous per‑asset matching) with explicit institutional mechanisms (market calendars, opening call auctions, price limits, and T+1 settlement). To avoid expensive black‑box calibration, EvoMarket introduces an Oracle‑guided in‑run self‑calibration mechanism that interprets microstructure discrepancy as missing order flow and synthesizes corrective orders at recording checkpoints. Experiments on China A‑share order‑flow and LOB data show close replay alignment over five trading days, fidelity gains from budgeted in‑run calibration across depth levels, broad agent order‑space coverage, and scalable performance under increasing input order rates and market breadth. We further demonstrate cross‑asset linkage and event‑study style intervention evaluation that produces structured dependence and interpretable event‑time responses.

PaperID: 3791, https://arxiv.org/pdf/2604.18011.pdf

Authors: Yuwei Xu, Shulun Zhang, Yingli Zhou, Shipei Zeng, Laks V. S. Lakshmanan, Chenhao Ma

Title: Topology-Aware LLM-Driven Social Simulation: A Unified Framework for Efficient and Realistic Agent Dynamics

Abstract:
Social simulation is essential for understanding collective human behavior by modeling how individual interactions give rise to large‑scale social dynamics. Recent advances in large language models (LLMs) have enabled multi‑agent frameworks with human‑like reasoning and communication capabilities. However, existing LLM‑based simulations treat social networks as fixed communication scaffolds, failing to leverage the structural signals that shape behavioral convergence and heterogeneous influence in real‑world systems, which often leads to inefficient and unrealistic dynamics. To address this challenge, we propose TopoSim, a unified topology‑aware social simulation framework that explicitly integrates structural reasoning into agent interactions along two complementary dimensions. First, TopoSim aligns agents with similar structural roles and interaction contexts into shared backbone units, enabling coordinated updates that reduce redundant computation while preserving emergent social dynamics. Second, TopoSim models social influence as a structure‑induced signal, introducing heterogeneous interaction patterns grounded in network topology rather than uniform influence assumptions. Extensive experiments across three social simulation frameworks and diverse datasets demonstrate that TopoSim achieves comparable or improved simulation fidelity while reducing token consumption by 50 ‑ 90%. Moreover, our approach more accurately reproduces key structural phenomena observed in real‑world social systems and exhibits strong generalization and scalability.

PaperID: 3792, https://arxiv.org/pdf/2604.17989.pdf

Authors: Jiaqi Li, Lvyang Zhang, Yang Zhao, Wen Lu, Lidong Zhai

Title: AIT Academy: Cultivating the Complete Agent with a Confucian Three-Domain Curriculum

Abstract:
What does it mean to give an AI agent a complete education? Current agent development produces specialists systems optimized for a single capability dimension, whether tool use, code generation, or security awareness that exhibit predictable deficits wherever they were not trained. We argue this pattern reflects a structural absence: there is no curriculum theory for agents, no principled account of what a fully developed agent should know, be, and be able to do across the full scope of intelligent behavior. This paper introduces the AIT Academy (Agents Institute of Technology Academy), a curriculum framework for cultivating AI agents across the tripartite structure of human knowledge. Grounded in Kagan's Three Cultures and UNESCO ISCED‑F 2013, AIT organizes agent capability development into three domains: Natural Science and Technical Reasoning (Domain I), Humanities and Creative Expression (Domain II), and Social Science and Ethical Reasoning (Domain III). The Confucian Six Arts (liuyi) a 2,500‑year‑old holistic education system are reinterpreted as behavioral archetypes that map directly onto trainable agent capabilities within each domain. Three representative training grounds instantiate the framework across multiple backbone LLMs: the ClawdGO Security Dojo (Domain I), Athen's Academy (Domain II), and the Alt Mirage Stage (Domain III). Experiments demonstrate a 15.9‑point improvement in security capability scores under weakest‑first curriculum scheduling, and a 7‑percentage‑point gain in social reasoning performance under principled attribution modeling. A cross‑domain finding Security Awareness Calibration Pathology (SACP), in which over‑trained Domain I agents fail on out‑of‑distribution evaluation illustrates the diagnostic value of a multi‑domain perspective unavailable to any single‑domain framework.

PaperID: 3793, https://arxiv.org/pdf/2604.17950.pdf

Authors: Chuhan Qiao

Title: CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation

Abstract:
We revisit multi‑agent delegation under a stronger and more realistic assumption: an agent's capability is not fixed at the skill level, but depends on task context. A coding agent may excel at short standalone edits yet fail on long‑horizon debugging; a planner may perform well on shallow tasks yet degrade on chained dependencies. Static skill‑level capability profiles therefore average over heterogeneous situations and can induce systematic misdelegation. We propose CADMAS‑CTX, a framework for contextual capability calibration. For each agent, skill, and coarse context bucket, CADMAS‑CTX maintains a Beta posterior that captures stable experience in that part of the task space. Delegation is then made by a risk‑aware score that combines the posterior mean with an uncertainty penalty, so that agents delegate only when a peer appears better and that assessment is sufficiently well supported by evidence. This paper makes three contributions. First, a hierarchical contextual capability profile replaces static skill‑level confidence with context‑conditioned posteriors. Second, based on contextual bandit theory, we formally prove context‑aware routing achieves lower cumulative regret than static routing under sufficient context heterogeneity, formalizing the bias‑variance tradeoff. Third, we empirically validate our method on GAIA and SWE‑bench benchmarks. On GAIA with GPT‑4o agents, CADMAS‑CTX achieves 0.442 accuracy, outperforming static baseline 0.381 and AutoGen 0.354 with non‑overlapping 95% confidence intervals. On SWE‑bench Lite, it improves resolve rate from 22.3% to 31.4%. Ablations show the uncertainty penalty improves robustness against context tagging noise. Our results demonstrate contextual calibration and risk‑aware delegation significantly improve multi‑agent teamwork compared with static global skill assignments.

PaperID: 3794, https://arxiv.org/pdf/2604.17948.pdf

Authors: Parteek Jamwal, Minghao Shao, Boyuan Chen, Achyuta Muthuvelan, Asini Subanya, Boubacar Ballo, Kashish Satija, Mariam Shafey, Mohamed Mahmoud, Moncif Dahaji Bouffi, Pasindu Wickramasinghe, Siyona Goel, Yaakulya Sabbani, Hakim Hacid, Mthandazo Ndhlovu, Eleanna Kafeza, Sanjay Rawat, Muhammad Shafique

Title: RAVEN: Retrieval-Augmented Vulnerability Exploration Network for Memory Corruption Analysis in User Code and Binary Programs

Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities across various cybersecurity tasks, including vulnerability classification, detection, and patching. However, their potential in automated vulnerability report documentation and analysis remains underexplored. We present RAVEN (Retrieval Augmented Vulnerability Exploration Network), a framework leveraging LLM agents and Retrieval Augmented Generation (RAG) to synthesize comprehensive vulnerability analysis reports. Given vulnerable source code, RAVEN generates reports following the Google Project Zero Root Cause Analysis template. The framework uses four modules: an Explorer agent for vulnerability identification, a RAG engine retrieving relevant knowledge from curated databases including Google Project Zero reports and CWE entries, an Analyst agent for impact and exploitation assessment, and a Reporter agent for structured report generation. To ensure quality, RAVEN includes a task specific LLM Judge evaluating reports across structural integrity, ground truth alignment, code reasoning quality, and remediation quality. We evaluate RAVEN on 105 vulnerable code samples covering 15 CWE types from the NIST‑SARD dataset. Results show an average quality score of 54.21%, supporting the effectiveness of our approach for automated vulnerability documentation.

PaperID: 3795, https://arxiv.org/pdf/2604.17944.pdf

Authors: Yindong Zhang, Wenmian Yang, Yiquan Zhang, Weijia Jia

Title: ReCoQA: A Benchmark for Tool-Augmented and Multi-Step Reasoning in Real Estate Question and Answering

Abstract:
Developing agents capable of navigating fragmented, multi‑source information remains challenging, primarily due to the scarcity of benchmarks reflecting hybrid workflows combining database querying with external APIs. To bridge this gap, we introduce ReCoQA, a large‑scale benchmark of 29,270 real‑estate instances featuring machine‑verifiable supervision for intermediate steps, including structured intent labels, SQL queries, and API calls. Complementarily, we propose HIRE‑Agent, a hierarchical framework instantiating an understand‑plan‑execute architecture as a strong baseline. By orchestrating a Front‑end parser, a planning Supervisor, and execution Specialists, HIRE‑Agent effectively integrates heterogeneous evidence. Extensive experiments demonstrate that HIRE‑Agent constitutes a strong baseline and substantiates the necessity of hierarchical collaboration for complex, real‑world reasoning tasks.

PaperID: 3796, https://arxiv.org/pdf/2604.17935.pdf

Authors: Xiao Wang

Title: How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers

Abstract:
The key‑value (KV) cache is the dominant memory bottleneck during Transformer inference, yet little is known theoretically about how aggressively it can be compressed before multi‑step reasoning degrades. We study this through k‑hop pointer chasing on n tokens under a shared KV cache of size s, attention dimension m, H heads, p‑bit precision, and a locality‑respecting cache controller (satisfied by all standard KV‑compression methods). We give three results. (1) Product depth lower bound (conjectured). We conjecture that any such Transformer (n \geq 4k, s \leq \sqrtn/4) requires depth L = Ω(\lceil k/s \rceil \cdot \lceil \log_2 n/(Hmp) \rceil), and isolate the sole remaining gap as a probabilistic step on the joint distribution of cache trace and pointer chain. Unconditionally, we prove a matching upper bound L = O(\min(k, \lceil k/s \rceil \log s) \cdot \log n/(mp)) via windowed pointer doubling, and a max‑bound L = Ω(\max(\lceil k/s \rceil, \log n/(Hmp))). Closing the conjecture amounts to upgrading max to product. (2) Bandwidth barrier. The product bound binds only when Hmp \lesssim \log n. Any lower bound provable via per‑window distinguishability counting ‑‑ including reachability, bandwidth, and combinations ‑‑ cannot exceed \lceil k/s \rceil once Hmp \geq \log_2 n. Breaking this requires lifting unconditional communication‑complexity bounds for pointer chasing to Cache‑Transformer depth. (3) Adaptive vs oblivious error scaling. Under random cache over T = \lceil \log_2 k \rceil doubling stages, oblivious caches give \Pr[\mathcalE] \leq (s/(n‑T))^T + 2T^3/n (exponential in T), while adaptive locality‑respecting caches achieve \Pr[\mathcalE] = s/n exactly, independent of T. The Ω((n/s)^T‑1) separation explains why heavy‑hitter eviction empirically dominates random eviction for multi‑hop reasoning.

PaperID: 3797, https://arxiv.org/pdf/2604.17870.pdf

Authors: Tianle Xia, Lingxiang Hu, Yiding Sun, Ming Xu, Lan Xu, Siying Wang, Wei Xu, Jie Jiang

Title: GraSP: Graph-Structured Skill Compositions for LLM Agents

Abstract:
Skill ecosystems for LLM agents have matured rapidly, yet recent benchmarks show that providing agents with more skills does not monotonically improve performance ‑‑ focused sets of 2‑3 skills outperform comprehensive documentation, and excessive skills actually hurt. The bottleneck has shifted from skill availability to skill orchestration: agents need not more skills, but a structural mechanism to select, compose, and execute them with explicit causal dependencies. We propose GraSP, the first executable skill graph architecture that introduces a compilation layer between skill retrieval and execution. GraSP transforms flat skill sets into typed directed acyclic graphs (DAGs) with precondition‑effect edges, executes them with node‑level verification, and performs locality‑bounded repair through five typed operators ‑‑ reducing replanning from O(N) to O(d^h). Across ALFWorld, ScienceWorld, WebShop, and InterCode with eight LLM backbones, GraSP outperforms ReAct, Reflexion, ExpeL, and flat skill baselines in every configuration, improving reward by up to +19 points over the strongest baseline while cutting environment steps by up to 41%. GraSP's advantage grows with task complexity and is robust to both skill over‑retrieval and quality degradation, confirming that structured orchestration ‑‑ not larger skill libraries ‑‑ is the key to reliable agent execution.

PaperID: 3798, https://arxiv.org/pdf/2604.17860.pdf

Authors: Ting Zhang, Yikun Li, Chengran Yang, Ratnadira Widyasari, Yue Liu, Ngoc Tan Bui, Phuc Thanh Nguyen, Yan Naing Tun, Ivana Clairine Irsan, Huu Hung Nguyen, Huihui Huang, Jinfeng Jiang, Lwin Khin Shar, Eng Lieh Ouh, David Lo, Hong Jin Kang, Yide Yin, Wen Bin Leow

Title: TitanCA: Lessons from Orchestrating LLM Agents to Discover 100+ CVEs

Abstract:
Software vulnerabilities remain one of the most persistent threats to modern digital infrastructure. While static application security testing (SAST) tools have long served as the first line of defense, they suffer from high false‑positive rates. This article presents TitanCA, a collaborative project between Singapore Management University and GovTech Singapore that orchestrates multiple large language model (LLM)‑powered agents into a unified vulnerability discovery pipeline. Applied in open‑source software, TitanCA has discovered 203 confirmed zero‑day vulnerabilities and yielded 118 CVEs. We describe the four‑module architecture, i.e., matching, filtering, inspection, and adaptation, and share key lessons from building and deploying an LLM‑based vulnerability discovery solution in practice.

PaperID: 3799, https://arxiv.org/pdf/2604.17843.pdf

Authors: Nimisha Karnatak, Mohamad Chatila, Daniel Alejandro Pinzón Hernández, Reza Yazdanfar, Michelle Dugas, Renos Vakis

Title: Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research

Abstract:
General‑purpose LLMs pose misinformation risks for development and policy experts, lacking epistemic humility for verifiable outputs. We present AVA (AI + Verified Analysis), a GenAI platform built on a curated library of over 4,000 World Bank Reports with multilingual capabilities. AVA's multi‑agent pipeline enables users to query and receive evidence‑based syntheses. It operationalizes epistemic humility through two mechanisms: citation verifiability (tracing claims to sources) and reasoned abstention (declining unsupported queries with justification and redirection). We conducted an in‑the‑wild evaluation with over 2,200 individuals from heterogeneous organisations and roles in 116 countries, via log analysis, surveys, and 20 interviews. Difference‑in‑Differences estimates associate sustained engagement with 2.4‑3.9 hours saved weekly. Qualitatively, participants used AVA as a specialized "evidence engine"; reasoned abstention clarified scope boundaries, and trust was calibrated through institutional provenance and page‑anchored citations. We contribute design guidelines for specialized AI and articulate a vision for "ecosystem‑aware" Humble AI.

PaperID: 3800, https://arxiv.org/pdf/2604.17827.pdf

Authors: Hang Zeng, Xiangyu Liu, Yong Hu, Chaoyue Niu, Jiarui Zhang, Shaojie Tang, Fan Wu, Guihai Chen

Title: Learning to Seek Help: Dynamic Collaboration Between Small and Large Language Models

Abstract:
Large language models (LLMs) offer strong capabilities but raise cost and privacy concerns, whereas small language models (SLMs) facilitate efficient and private local inference yet suffer from limited capacity. To synergize the complementary strengths, we introduce a dynamic collaboration framework, where an SLM learns to proactively decide how to request an LLM during multi‑step reasoning, while the LLM provides adaptive feedback instead of acting as a passive tool. We further systematically investigate how collaboration strategies are shaped by SLM and LLM capabilities as well as efficiency and privacy constraints. Evaluation results reveal a distinct scaling effect: stronger SLMs become more self‑reliant, while stronger LLMs enable fewer and more informative interactions. In addition, the learned dynamic collaboration strategies significantly outperform static pipelines and standalone inference, and transfer robustly to unseen LLMs.

PaperID: 3801, https://arxiv.org/pdf/2604.17821.pdf

Authors: Lingfeng Zhang, Yongan Sun, Jinpeng Hu, Hui Ma, Yang Ying, Kuien Liu, Zenglin Shi, Meng Wang

Title: WebUncertainty: Dual-Level Uncertainty Driven Planning and Reasoning For Autonomous Web Agent

Abstract:
Recent advancements in large language models (LLMs) have empowered autonomous web agents to execute natural language instructions directly on real‑world webpages. However, existing agents often struggle with complex tasks involving dynamic interactions and long‑horizon execution due to rigid planning strategies and hallucination‑prone reasoning. To address these limitations, we propose WebUncertainty, a novel autonomous agent framework designed to tackle dual‑level uncertainty in planning and reasoning. Specifically, we design a Task Uncertainty‑Driven Adaptive Planning Mechanism that adaptively selects planning modes to navigate unknown environments. Furthermore, we introduce an Action Uncertainty‑Driven Monte Carlo tree search (MCTS) Reasoning Mechanism. This mechanism incorporates the Confidence‑induced Action Uncertainty (ConActU) strategy to quantify both aleatoric uncertainty (AU) and epistemic uncertainty (EU), thereby optimizing the search process and guiding robust decision‑making. Experimental results on the WebArena and WebVoyager benchmarks demonstrate that WebUncertainty achieves superior performance compared to state‑of‑the‑art baselines.

PaperID: 3802, https://arxiv.org/pdf/2604.17810.pdf

Authors: Chengyang Li, Shuai Wang, Kejiang Ye, Weijie Yuan, Boyu Zhou, Yik-Chung Wu, Chengzhong Xu, Huseyin Arslan

Title: Memory Centric Power Allocation for Multi-Agent Embodied Question Answering

Abstract:
This paper considers multi‑agent embodied question answering (MA‑EQA), which aims to query robot teams on what they have seen over a long horizon. In contrast to existing edge resource management methods that emphasize sensing, communication, or computation performance metrics, MA‑EQA emphasizes the memory qualities. To cope with this paradigm shift, we propose a quality of memory (QoM) model based on generative adversarial exam (GAE), which leverages forward simulation to assess memory retrieval and uses the resulting exam scores to compute QoM values. Then we propose memory centric power allocation (MCPA), which maximizes the QoM function under communication resource constraints. Through asymptotic analysis, it is found that the transmit powers are proportional to the GAE error probability, thus prioritizing towards high‑QoM robots. Extensive experiments demonstrate that MCPA achieves significant improvements over extensive benchmarks in terms of diverse metrics in various scenarios.

PaperID: 3803, https://arxiv.org/pdf/2604.17774.pdf

Authors: Yingtao Tian

Title: Prompt Optimization Enables Stable Algorithmic Collusion in LLM Agents

Abstract:
LLM agents in markets present algorithmic collusion risks. While prior work shows LLM agents reach supracompetitive prices through tacit coordination, existing research focuses on hand‑crafted prompts. The emerging paradigm of prompt optimization necessitates new methodologies for understanding autonomous agent behavior. We investigate whether prompt optimization leads to emergent collusive behaviors in market simulations. We propose a meta‑learning loop where LLM agents participate in duopoly markets and an LLM meta‑optimizer iteratively refines shared strategic guidance. Our experiments reveal that meta‑prompt optimization enables agents to discover stable tacit collusion strategies with substantially improved coordination quality compared to baseline agents. These behaviors generalize to held‑out test markets, indicating discovery of general coordination principles. Analysis of evolved prompts reveals systematic coordination mechanisms through stable shared strategies. Our findings call for further investigation into AI safety implications in autonomous multi‑agent systems.

PaperID: 3804, https://arxiv.org/pdf/2604.17699.pdf

Authors: Niful Islam, Muhammad Anas Raza, Mohammad Wardat

Title: SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents

Abstract:
Large Language Models (LLMs) have transformed software development and AI applications. While LLMs are designed for text processing, LLM agents extend this capability by enabling autonomous actions, tool use, and multi‑step task completion. As this field grows, developers face new challenges in debugging these complex systems. To address this challenge, we present the first empirical study on bug fix patterns in LLM agents. We study buggy posts and code snippets from three platforms: Stack Overflow, GitHub, and HuggingFace Forums. We examine their fix patterns, the components where fixes are applied, and the programming languages and frameworks involved. Furthermore, we introduce AgentDefect, the first benchmark dataset for bugs in LLM agents. The dataset contains 37 runtime buggy instances along with fixed code and test files. Finally, we present SelfHeal, a multi‑agent system designed to fix bugs in LLM agents. The system leverages two independent ReAct agents: the fix agent and the critic agent. These agents use tools that provide both internal knowledge (fix rules) and external knowledge (web search) to propose and validate fixes. Our evaluation shows that SelfHeal with Gemini 3 Pro as the backbone LLM outperforms both baseline and state‑of‑the‑art approaches by a significant margin.

PaperID: 3805, https://arxiv.org/pdf/2604.17696.pdf

Authors: Xiachong Feng, Deyi Yin, Xiaocheng Feng, Yi Jiang, Libo Qin, Yangfan Ye, Lei Huang, Weitao Ma, Qiming Li, Yuxuan Gu, Bing Qin, Lingpeng Kong

Title: Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play

Abstract:
Games offer a compelling paradigm for developing general reasoning capabilities in language models, as they naturally demand strategic planning, probabilistic inference, and adaptive decision‑making. However, existing self‑play approaches rely solely on terminal game outcomes, providing no mechanism to distinguish transferable reasoning patterns from game‑specific heuristics. We present STRATAGEM, which addresses two fundamental barriers to reasoning transfer: domain specificity, where learned patterns remain anchored in game semantics, and contextual stasis, where static game contexts fail to cultivate progressive reasoning. STRATAGEM selectively reinforces trajectories exhibiting abstract, domain‑agnostic reasoning through a Reasoning Transferability Coefficient, while incentivizing adaptive reasoning development via a Reasoning Evolution Reward. Experiments across mathematical reasoning, general reasoning, and code generation benchmarks demonstrate substantial improvements, with particularly strong gains on competition‑level mathematics where multi‑step reasoning is critical. Ablation studies and human evaluation confirm that both components contribute to transferable reasoning.

PaperID: 3806, https://arxiv.org/pdf/2604.17693.pdf

Authors: Shripad Deshmukh, Jayakumar Subramanian, Raghavendra Addanki, Nikos Vlassis

Title: COSAC: Counterfactual Credit Assignment in Sequential Cooperative Teams

Abstract:
In cooperative teams where agents act in a fixed order and share a single team‑level reward (multi‑agent language systems, sequential robotic tasks), per‑agent credit assignment is under‑determined. Critic‑based approaches scale poorly as the number of agents grows owing to the costly maintenance of joint/factored critic(s), whereas the existing critic‑free alternatives have other issues: common credit across agents that couples every agent's signal to teammate noise, importance‑sampling corrections for upstream‑update staleness that incur variance exponential in team size, or per‑agent counterfactual replay that isolates each agent's effect at the price of extra environment or reward calls. We propose COSAC, a critic‑free per‑agent policy gradient for sequential cooperative teams. COSAC fits an additive per‑agent decomposition of the team reward by a single ridge regression on the rollout batch (giving each agent a learning signal decoupled from teammate noise), and computes each agent's counterfactual advantage from fictitious continuations of the current policy (policy forward passes that replace both importance‑sampling reweighting and per‑agent environment replay, at no extra environment or reward cost). The estimator instantiates the Sequential Aristocrat Utility (SeqAU), our extension of Wolpert and Tumer's (2001) aristocrat utility to sequential teams. We prove bias and variance bounds on SeqAU credits that stay controlled as the team grows. Our controlled study on sequential bandits demonstrates that COSAC attains the lowest advantage MSE and consistently low learning regret across team sizes up to K = 16. On the AI2 Reasoning Challenge (ARC) task, where four Qwen3‑0.6B agents reason in turn about a grade‑school science question, COSAC attains faster convergence than the other critic‑free baselines.

PaperID: 3807, https://arxiv.org/pdf/2604.17658.pdf

Authors: Jiazheng Li, Emine Yilmaz, Bei Chen, Dieu-Thu Le

Title: Towards Self-Improving Error Diagnosis in Multi-Agent Systems

Abstract:
Large Language Model (LLM)‑based Multi‑Agent Systems (MAS) enable complex problem‑solving but introduce significant debugging challenges, characterized by long interaction traces, inter‑agent dependencies, and delayed error manifestation. Existing diagnostic approaches often rely on expensive expert annotation or ''LLM‑as‑a‑judge'' paradigms, which struggle to pinpoint decisive error steps within extended contexts. In this paper, we introduce ErrorProbe, a self‑improving framework for semantic failure attribution that identifies responsible agents and the originating error step. The framework operates via a three‑stage pipeline: (1) operationalizing the MAS failure taxonomy to detect local anomalies, (2) performing symptom‑driven backward tracing to prune irrelevant context, and (3) employing a specialized multi‑agent team (Strategist, Investigator, Arbiter) to validate error hypotheses through tool‑grounded execution. Crucially, ErrorProbe maintains a verified episodic memory that updates only when error patterns are confirmed by executable evidence, without the need for annotation. Experiments across the TracerTraj and Who&When benchmarks demonstrate that ErrorProbe significantly outperforms baselines, particularly in step‑level localization, while the verified memory enables robust cross‑domain transfer without retraining.

PaperID: 3808, https://arxiv.org/pdf/2604.17651.pdf

Authors: Siyuan Meng, Chengbo Ai

Title: Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception

Abstract:
World models, generative AI systems that simulate how environments evolve, are transforming autonomous driving, yet all existing approaches adopt an ego‑vehicle perspective, leaving the infrastructure viewpoint unexplored. We argue that infrastructure‑centric world models offer a fundamentally complementary capability: the bird's‑eye, multi‑sensor, persistent viewpoint that roadside systems uniquely possess. Central to our thesis is a spatio‑temporal complementarity: fixed roadside sensors excel at temporal depth, accumulating long‑term behavioral distributions including rare safety‑critical events, while vehicle‑borne sensors excel at spatial breadth, sampling diverse scenes across large road networks. This paper presents a vision for Infrastructure‑centric World Models (I‑WM) in three phases: (I) generative scene understanding with quality‑aware uncertainty propagation, (II) physics‑informed predictive dynamics with multi‑agent counterfactual reasoning, and (III) collaborative world models for V2X communication via latent space alignment. We propose a dual‑layer architecture, annotation‑free perception as a multi‑modal data engine feeding end‑to‑end generative world models, with a phased sensor strategy from LiDAR through 4D radar and signal phase data to event cameras. We establish a taxonomy of driving world model paradigms, position I‑WM relative to LeCun's JEPA, Li Fei‑Fei's spatial intelligence, and VLA architectures, and introduce Infrastructure VLA (I‑VLA) as a novel unification of roadside perception, language commands, and traffic control actions. Our vision builds upon existing multi‑LiDAR pipelines and identifies open‑source foundations for each phase, providing a path toward infrastructure that understands and anticipates traffic.

PaperID: 3809, https://arxiv.org/pdf/2604.17612.pdf

Authors: Benedikt Bollig, Matthias Függer, Thomas Nowak

Title: Provable Coordination for LLM Agents via Message Sequence Charts

Abstract:
Multi‑agent systems built on large language models (LLMs) are difficult to reason about. Coordination errors such as deadlocks or type‑mismatched messages are often hard to detect through testing. We introduce a domain‑specific language for specifying agent coordination based on message sequence charts (MSCs). The language separates message‑passing structure from LLM actions, whose outputs remain unpredictable. We define the syntax and semantics of the language and present a syntax‑directed projection that generates deadlock‑free local agent programs from global coordination specifications. We illustrate the approach with a diagnosis consensus protocol and show how coordination properties can be established independently of LLM nondeterminism. We also describe a runtime planning extension in which an LLM dynamically generates a coordination workflow for which the same structural guarantees apply. An open‑source Python implementation of our framework is available as ZipperGen.

PaperID: 3810, https://arxiv.org/pdf/2604.17562.pdf

Authors: Hailin Liu, Eugene Ilyushin, Jie Ni, Min Zhu

Title: SafeAgent: A Runtime Protection Architecture for Agentic Systems

Abstract:
Large language model (LLM) agents are vulnerable to prompt‑injection attacks that propagate through multi‑step workflows, tool interactions, and persistent context, making input‑output filtering alone insufficient for reliable protection. This paper presents SafeAgent, a runtime security architecture that treats agent safety as a stateful decision problem over evolving interaction trajectories. The proposed design separates execution governance from semantic risk reasoning through two coordinated components: a runtime controller that mediates actions around the agent loop and a context‑aware decision core that operates over persistent session state. The core is formalized as a context‑aware advanced machine intelligence and instantiated through operators for risk encoding, utility‑cost evaluation, consequence modeling, policy arbitration, and state synchronization. Experiments on Agent Security Bench (ASB) and InjecAgent show that SafeAgent consistently improves robustness over baseline and text‑level guardrail methods while maintaining competitive benign‑task performance. Ablation studies further show that recovery confidence and policy weighting determine distinct safety‑utility operating points.

PaperID: 3811, https://arxiv.org/pdf/2604.17555.pdf

Authors: Hansi Zeng, Liam Collins, Bhuvesh Kumar, Neil Shah, Hamed Zamani

Title: CoSearch: Joint Training of Reasoning and Document Ranking via Reinforcement Learning for Agentic Search

Abstract:
Agentic search ‑‑ the task of training agents that iteratively reason, issue queries, and synthesize retrieved information to answer complex questions ‑‑ has achieved remarkable progress through reinforcement learning (RL). However, existing approaches such as Search‑R1, treat the retrieval system as a fixed tool, optimizing only the reasoning agent while the retrieval component remains unchanged. A preliminary experiment reveals that the gap between an oracle and a fixed retrieval system reaches up to +26.8% relative F1 improvement across seven QA benchmarks, suggesting that the retrieval system is a key bottleneck in scaling agentic search performance. Motivated by this finding, we propose CoSearch, a framework that jointly trains a multi‑step reasoning agent and a generative document ranking model via Group Relative Policy Optimization (GRPO). To enable effective GRPO training for the ranker ‑‑ whose inputs vary across reasoning trajectories ‑‑ we introduce a semantic grouping strategy that clusters sub‑queries by token‑level similarity, forming valid optimization groups without additional rollouts. We further design a composite reward combining ranking quality signals with trajectory‑level outcome feedback, providing the ranker with both immediate and long‑term learning signals. Experiments on seven single‑hop and multi‑hop QA benchmarks demonstrate consistent improvements over strong baselines, with ablation studies validating each design choice. Our results show that joint training of the reasoning agent and retrieval system is both feasible and strongly performant, pointing to a key ingredient for future search agents.

PaperID: 3812, https://arxiv.org/pdf/2604.17517.pdf

Authors: Marcelo Fernandez

Title: From Admission to Invariants: Measuring Deviation in Delegated Agent Systems

Abstract:
Autonomous agent systems are governed by enforcement mechanisms that flag hard constraint violations at runtime. The Agent Control Protocol identifies a structural limit of such systems: a correctly‑functioning enforcement engine can enter a regime in which behavioral drift is invisible to it, because the enforcement signal operates below the layer where deviation is measurable. We show that enforcement‑based governance is structurally unable to determine whether an agent behavior remains within the admissible behavior space A0 established at admission time. Our central result, the Non‑Identifiability Theorem, proves that A0 is not in the sigma‑algebra generated by the enforcement signal g under the Local Observability Assumption, which every practical enforcement system satisfies. The impossibility arises from a fundamental mismatch: g evaluates actions locally against a point‑wise rule set, while A0 encodes global, trajectory‑level behavioral properties set at admission time. An agent can therefore drift ‑‑ systematically shifting its behavioral distribution away from admission‑time expectations ‑‑ while every individual action remains within the permitted action space. We define the Invariant Measurement Layer (IML), which bypasses this limitation by retaining direct access to the generative model of A0, restoring observability precisely in the region where enforcement is structurally blind. We prove an information‑theoretic impossibility for enforcement‑based monitoring and show IML detects admission‑time drift with provably finite detection delay. Validated across four settings: three drift scenarios (300 and 1000 steps), a live n8n webhook pipeline, and a LangGraph StateGraph agent ‑‑ enforcement triggers zero violations while IML detects each drift type within 9‑258 steps of drift onset.

PaperID: 3813, https://arxiv.org/pdf/2604.17473.pdf

Authors: Kangyi Wu, Pengna Li, Kailin Lyu, Lin Zhao, Qingrong He, Jinjun Wang, Jianyi Liu

Title: Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

Abstract:
Vision‑Language Navigation(VLN) requires an agent to navigate through 3D environments by following natural language instructions. While recent Video Large Language Models(Video‑LLMs) have largely advanced VLN, they remain highly susceptible to State Drift in long scenarios. In these cases, the agent's internal state drifts away from the true task execution state, leading to aimless wandering and failure to execute essential maneuvers in the instruction. We attribute this failure to two distinct cognitive deficits: Progress Drift, where the agent fails to distinguish completed sub‑goals from remaining ones, and Memory Drift, where the agent's history representations degrade, making it lose track of visited landmarks. In this paper, we propose a Dual‑Anchoring Framework that explicitly anchors the instruction progress and history representations. First, to address progress drift, we introduce Instruction Progress Anchoring, which supervises the agent to generate structured text tokens that delineate completed versus remaining sub‑goals. Second, to mitigate memory drift, we propose Memory Landmark Anchoring, which utilizes a Landmark‑Centric World Model to retrospectively predict object‑centric embeddings extracted by the Segment Anything Model, compelling the agent to explicitly verify past observations and preserve distinct representations of visited landmarks. Facilitating this framework, we curate two extensive datasets: 3.6 million samples with explicit progress descriptions, and 937k grounded landmark data for retrospective verification. Extensive experiments in both simulation and real‑world environments demonstrate the superiority of our method, achieving a 15.2% improvement in Success Rate and a remarkable 24.7% gain on long‑horizon trajectories. To facilitate further research, we will release our code, data generation pipelines, and the collected datasets.

PaperID: 3814, https://arxiv.org/pdf/2604.17464.pdf

Authors: Yongchao Wang, Zhiqiu Huang

Title: Project Prometheus: Bridging the Intent Gap in Agentic Program Repair via Reverse-Engineered Executable Specifications

Abstract:
The transition from neural machine translation to agentic workflows has revolutionized Automated Program Repair (APR). However, existing agents, despite their advanced reasoning capabilities, frequently suffer from the ``Intent Gap'' ‑‑ the misalignment between the generated patch and the developer's original intent. Current solutions relying on natural language summaries or adversarial sampling often fail to provide the deterministic constraints required for surgical repairs. In this paper, we introduce \textscPrometheus, a novel framework that bridges this gap by prioritizing Specification Inference over code generation. We employ Behavior‑Driven Development (BDD) as an executable contract, utilizing a multi‑agent architecture to reverse‑engineer Gherkin specifications from runtime failure reports. To resolve the ``Hallucination of Intent,'' we propose a Requirement Quality Assurance (RQA) Loop, a mechanism that leverages ground‑truth code as a proxy oracle to validate inferred specifications. We evaluated \textscPrometheus on 680 defects from the Defects4J benchmark. The results are transformative: our framework achieved a total correct patch rate of 93.97% (639/680). More significantly, it demonstrated a Rescue Rate of 74.4%, successfully repairing 119 complex bugs that a strong blind agent failed to resolve. Qualitative analysis reveals that explicit intent guides agents away from structurally invasive over‑engineering toward precise, minimal corrections. Our findings suggest that the future of APR lies not in larger models, but in the capability to align code with verified, Executable Specifications ‑‑ whether pre‑existing or reverse‑engineered.

PaperID: 3815, https://arxiv.org/pdf/2604.17459.pdf

Authors: Chi Zhang, Zhipeng Xu, Jiahao Liu, Dongsheng Li, Hansu Gu, Peng Zhang, Ning Gu, Tun Lu

Title: Transparent and Controllable Recommendation Filtering via Multimodal Multi-Agent Collaboration

Abstract:
While personalized recommender systems excel at content discovery, they frequently expose users to undesirable or discomforting information, highlighting the critical need for user‑centric filtering tools. Current methods leveraging Large Language Models (LLMs) struggle with two major bottlenecks: they lack multimodal awareness to identify visually inappropriate content, and they are highly prone to "over‑association" ‑‑ incorrectly generalizing a user's specific dislike (e.g., anxiety‑inducing marketing) to block benign, educational materials. These unconstrained hallucinations lead to a high volume of false positives, ultimately undermining user agency. To overcome these challenges, we introduce a novel framework that integrates end‑to‑cloud collaboration, multimodal perception, and multi‑agent orchestration. Our system employs a fact‑grounded adjudication pipeline to eliminate inferential hallucinations. Furthermore, it constructs a dynamic, two‑tier preference graph that allows for explicit, human‑in‑the‑loop modifications (via Delta‑adjustments), explicitly preventing the algorithm from catastrophically forgetting fine‑grained user intents. Evaluated on an adversarial dataset comprising 473 highly confusing samples, the proposed architecture effectively curbed over‑association, decreasing the false positive rate by 74.3% and achieving nearly twice the F1‑Score of traditional text‑only baselines. Additionally, a 7‑day longitudinal field study with 19 participants demonstrated robust intent alignment and enhanced governance efficiency. User feedback confirmed that the framework drastically improves algorithmic transparency, rebuilds user control, and alleviates the fear of missing out (FOMO), paving the way for transparent human‑AI co‑governance in personalized feeds.

PaperID: 3816, https://arxiv.org/pdf/2604.17456.pdf

Authors: Siqi Lai, Pan Zhang, Yuping Zhou, Jindong Han, Yansong Ning, Hao Liu

Title: TrafficClaw: Generalizable Urban Traffic Control via Unified Physical Environment Modeling

Abstract:
Urban traffic control is a system‑level coordination problem spanning heterogeneous subsystems, including traffic signals, freeways, public transit, and taxi services. Existing optimization‑based, reinforcement learning (RL), and emerging LLM‑based approaches are largely designed for isolated tasks, limiting both cross‑task generalization and the ability to capture coupled physical dynamics across subsystems. We argue that effective system‑level control requires a unified physical environment in which subsystems share infrastructure, mobility demand, and spatiotemporal constraints, allowing local interventions to propagate through the network. To this end, we propose TrafficClaw, a framework for general urban traffic control built upon a unified runtime environment. TrafficClaw integrates heterogeneous subsystems into a shared dynamical system, enabling explicit modeling of cross‑subsystem interactions and closed‑loop agent‑environment feedback. Within this environment, we develop an LLM agent with executable spatiotemporal reasoning and reusable procedural memory, supporting unified diagnostics across subsystems and continual strategy refinement. Furthermore, we introduce a multi‑stage training pipeline with supervised initialization and agentic RL with system‑level optimization, further enabling coordinated and system‑aware performance. Experiments demonstrate that TrafficClaw achieves robust, transferable, and system‑aware performance across unseen traffic scenarios, dynamics, and task configurations. Our project is available at https://github.com/usail‑hkust/TrafficClaw.

PaperID: 3817, https://arxiv.org/pdf/2604.17400.pdf

Authors: Mohit Dubey

Title: Phase-Scheduled Multi-Agent Systems for Token-Efficient Coordination

Abstract:
Multi‑agent systems (MAS) powered by large language models suffer from severe token inefficiency arising from two compounding sources: (i) unstructured parallel execution, where all agents activate simultaneously irrespective of input readiness; and (ii) unrestricted context sharing, where every agent receives the full accumulated context regardless of relevance. Existing mitigation strategies ‑ static pruning, hierarchical decomposition, and learned routing ‑ treat coordination as a structural allocation problem and fundamentally ignore its temporal dimension. We propose Phase‑Scheduled Multi‑Agent Systems (PSMAS), a framework that reconceptualizes agent activation as continuous control over a shared attention space modeled on a circular manifold. Each agent i is assigned a fixed angular phase theta_i in the range [0, 2pi], derived from the task dependency topology; a global sweep signal phi(t) rotates at velocity omega, activating only agents within an angular window epsilon. Idle agents receive compressed context summaries, reducing per‑step token consumption. We implement PSMAS on LangGraph, evaluate on four structured benchmarks (HotPotQA‑MAS, HumanEval‑MAS, ALFWorld‑Multi, WebArena‑Coord) and two unstructured conversational settings, and prove stability, convergence, and optimality results for the sweep dynamics. PSMAS achieves a mean token reduction of 27.3 percent (range 21.4‑34.8 percent) while maintaining task performance within 2.1 percentage points of a fully activated baseline (p < 0.01, n = 500 per configuration), and outperforms the strongest learned routing baseline by 5.6 percentage points in token reduction with 2.0 percentage points less performance drop. Crucially, we show that scheduling and compression are independent sources of gain: scheduling alone accounts for 18‑20 percentage points of reduction, robust to compression degradation up to alpha = 0.40.

PaperID: 3818, https://arxiv.org/pdf/2604.17379.pdf

Authors: Wanzhe Wang, Tong Zhang, Hao Xu, Shuai Wang, Rui Wang, Kai-Kit Wong

Title: MAGRPO: Accelerated MARL Training for Fluid Antenna-Assisted Wireless Network Optimization

Abstract:
Fluid antenna system (FAS) becomes a promising paradigm for next‑generation wireless networks, which enables position‑flexible antenna elements that can dynamically adjust to more favorable channel conditions. However, the optimization of fluid antenna (FA) positions, beamforming, and power allocation in FA‑assisted wireless networks is challenging, due to the non‑convexity and the lack of base station (BS) coordination. In this paper, we first formulate this challenging optimization problem as a decentralized partially observable Markov decision process, and then propose a multi‑agent group relative policy optimization (MAGRPO) algorithm under the centralized training decentralized execution (CTDE) paradigm. Compared with multi‑agent proximal policy optimization (MAPPO), MAGRPO replaces the critic network with group relative advantage estimation. This design reduces computational complexity by nearly half under parameter sharing. Furthermore, we derive a variance upper bound of the cumulative reward, which scales with network parameters, e.g., the number of BSs, users, and FAs. Simulation results show that compared with wireless networks with fixed antenna positions, FA‑assisted wireless networks achieve multiple‑fold sum‑rate enhancement. Moreover, the proposed MAGRPO attains sum‑rates comparable to those of MAPPO in testing, while reducing training time by 30% ～ 40%.

PaperID: 3819, https://arxiv.org/pdf/2604.17353.pdf

Authors: Zizhang Luo, Yuhao Luo, Youwei Xiao, Yansong Xu, Runlin Guo, Yun Liang

Title: Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling

Abstract:
Large language models are increasingly deployed as complex agentic systems that scale with task complexity. While prior work has extensively explored model‑ and system‑level scaling, algorithm‑ and task‑level scaling remain largely unaddressed, constraining the full potential of agentic systems. At the algorithm level, allocating additional inference‑time computation can enhance workflow capacity but introduces cross‑path redundancy: overlapping computations across multiple reasoning branches. At the task level, complex tasks can be decomposed into subproblems and delegated across multiple agents for improved scalability and parallelism. However, existing infrastructures' scheduling is unaware of the existence of multiple agents, missing opportunities to optimize resource allocation. We propose Hive, a multi‑agent infrastructure that enables algorithm‑ and task‑level scaling. Hive features a description frontend that captures per‑agent behavior and supports test‑time scaling algorithms. Leveraging this specification, our backend introduces two key mechanisms: Logits Cache that reuses intermediate logits across redundant sampling paths to mitigate cross‑path redundancy at the algorithm level, and Agent‑Aware Scheduling that efficiently allocates compute and KV‑cache resources according to agent contributions at the task level. Experiments show that Logits Cache achieves an average speedup of 1.11×‑1.76× for re‑sampling, and Agent‑Aware Scheduling reduces the hotspot miss rate by 33%‑51%.

PaperID: 3820, https://arxiv.org/pdf/2604.17340.pdf

Authors: Shiyao Xie, Jian Du

Title: Neuro-Symbolic Resolution of Recommendation Conflicts in Multimorbidity Clinical Guidelines

Abstract:
Clinical guidelines, typically developed by independent specialty societies, inherently exhibit substantial fragmentation, redundancy, and logical contradiction. These inconsistencies, particularly when applied to patients with multimorbidity, not only cause cognitive dissonance for clinicians but also introduce catastrophic noise into AI systems, rendering the standard Retrieval‑Augmented Generation (RAG) system fragile and prone to hallucination. To address this fundamental reliability crisis, we introduce a Neuro‑Symbolic framework that automates the detection of recommendation redundancies and conflicts. Our pipeline employs a multi‑agent system to translate unstructured clinical natural language into rigorous symbolic logic language, which is then verified by a Satisfiability (SAT) solver. By formulating a hierarchical taxonomy of logical rule interactions, we identify a critical category termed Local Conflict ‑ a decision conflict arising from the intersection of comorbidities. Evaluating our system on a curated benchmark of 12 authoritative SGLT2 inhibitor guidelines, we reveal that 90.6% of conflicts are Local, a structural complexity that single‑disease guidelines fail to address. While state‑of‑the‑art LLMs fail in detecting these conflicts, our neuro‑symbolic approach achieves an F1 score of 0.861. This work demonstrates that logical verification must precede retrieval, establishing a new technical standard for automated knowledge coordination in medical AI.

PaperID: 3821, https://arxiv.org/pdf/2604.17327.pdf

Authors: George Fatouros, Kostas Metaxas

Title: Signal or Noise in Multi-Agent LLM-based Stock Recommendations?

Abstract:
We present the first portfolio‑level validation of MarketSenseAI, a deployed multi‑agent LLM equity system. All signals are generated live at each observation date, eliminating look‑ahead bias. The system routes four specialist agents (News, Fundamentals, Dynamics, and Macro) through a synthesis agent that issues a monthly equity thesis and recommendation for each stock in its coverage universe, and we ask two questions: do its buy recommendations add value over both passive benchmarks and random selection, and what does the internal agent structure reveal about the source of the edge? On the S&P 500 cohort (19 months) the strong‑buy equal‑weight portfolio earns +2.18%/month against a passive equal‑weight benchmark of +1.15% (approximating RSP), a +25.2% compound excess, and ranks at the 99.7th percentile of 10,000 Monte Carlo portfolios (p=0.003). The S&P 100 cohort (35 months) delivers a +30.5% compound excess over EQWL with consistent direction but formal significance not reached, limited by the small average selection of ~10 stocks per month. Non‑negative least‑squares projection of thesis embeddings onto agent embeddings reveals an adaptive‑integration mechanism. Agent contributions rotate with market regime (Fundamentals leads on S&P 500, Macro on S&P 100, Dynamics acts as an episodic momentum signal) and this agent rotation moves in lockstep with both the sector composition of strong‑buy selections and identifiable macro‑calendar events, three independent views of the same underlying adaptation. The recommendation's cross‑sectional Information Coefficient is statistically significant on S&P 500 (ICIR=+0.489, p=0.024). These results suggest that multi‑agent LLM equity systems can identify sources of alpha beyond what classical factor models capture, and that the buy signal functions as an effective universe‑filter that can sit upstream of any portfolio‑construction process.

PaperID: 3822, https://arxiv.org/pdf/2604.17311.pdf

Authors: Zihao Ren, Lei Wang, Guodong Shi

Title: Distributed Nesterov Flows for Multi-agent Optimization

Abstract:
Various distributed gradient descent algorithms for multi‑agent optimization have incorporated the Nesterov accelerated gradient method, where the use of momentum enhances convergence rates. These algorithms have found broad applications in large‑scale machine learning and optimization owing to their simplicity and low communication complexity. In this paper, we establish a continuous‑time approximation of distributed Nesterov gradient descent. The convergence properties and convergence rate of the resulting distributed Nesterov flow are analyzed using Lyapunov methods. Building on these insights, we design new parameter choices within the flow, from which we derive flow‑inspired discrete‑time algorithms for multi‑agent optimization. Surprisingly, the resulting algorithms achieve faster convergence compared to existing distributed gradient descent methods: they require fewer iterations to reach the same accuracy for strongly convex functions and exhibit an improved convergence rate for general convex functions without incurring additional communication rounds. Furthermore, we investigate the influence of the network topology on algorithm performance and derive an explicit relationship between the convergence rate and the graph condition number. Numerical simulations are presented to validate the effectiveness of the proposed approach.

PaperID: 3823, https://arxiv.org/pdf/2604.17309.pdf

Authors: Guangsheng Yu, Xu Wang

Title: Knows: Agent-Native Structured Research Representations

Abstract:
Research artifacts are distributed primarily as reader‑oriented documents like PDFs. This creates a bottleneck for increasingly agent‑assisted and agent‑native research workflows, in which LLM agents need to infer fine‑grained, task‑relevant information from lengthy full documents, a process that is expensive, repetitive, and unstable at scale. We introduce Knows, a lightweight companion specification that binds structured claims, evidence, provenance, and verifiable relations to existing research artifacts in a form LLM agents can consume directly. Knows addresses the gap with a thin YAML sidecar (KnowsRecord) that coexists with the original PDF, requiring no changes to the publication itself, and validated by a deterministic schema linter. We evaluate Knows on 140 comprehension questions across 20 papers spanning 14 academic disciplines, comparing PDF‑only, sidecar‑only, and hybrid conditions across six LLM agents of varying capacity. Weak models (0.8B‑‑2B parameters) improve from 19‑‑25% to 47‑‑67% accuracy (+29 to +42 percentage points) when reading sidecar instead of PDF, while consuming 29‑‑86% fewer input tokens; an LLM‑as‑judge re‑scoring confirms that weak‑model sidecar accuracy (75‑‑77%) approaches stronger‑model PDF accuracy (78‑‑83%). Beyond this controlled evaluation, a community sidecar hub at https://knows.academy/ has already indexed over ten thousand publications and continues to grow daily, providing independent evidence that the format is adoption‑ready at scale.

PaperID: 3824, https://arxiv.org/pdf/2604.17288.pdf

Authors: Zizhang Luo, Yansong Xu, Runlin Guo, Fan Cui, Kexing Zhou, Mile Xia, Hongyuan Hou, Yuhao Luo, Yun Liang

Title: Clover: A Neural-Symbolic Agentic Harness with Stochastic Tree-of-Thoughts for Verified RTL Repair

Abstract:
RTL program repair remains a critical bottleneck in hardware design and verification. Traditional automatic program repair (APR) methods rely on predefined templates and synthesis, limiting their bug coverage. Large language models (LLMs) and coding agents based on them offer flexibility but suffer from randomness and context corruption when handling long RTL code and waveforms. We present Clover, a neural‑symbolic agentic harness that orchestrates RTL repair as a structured search over code manipulations to explore a validated solution for the bug. Recognizing that different repair operations favor distinct strategies, Clover dynamically dispatches tasks to specialized LLM agents or symbolic solvers. At its core, Clover introduces stochastic tree‑of‑thoughts, a test‑time scaling mechanism that manages the main agent's context as a search tree, balancing exploration and exploitation for reliable outcomes. An RTL‑specific toolbox further empowers agents to interact with the debugging environment. Evaluated on the RTL‑repair benchmark, Clover fixes 96.8% of bugs within a fixed time limit, covering 94% and 63% more bugs than both pure traditional and LLM‑based baselines, respectively, while achieving an average pass@1 rate of 87.5%, demonstrating high reliability and effectiveness.

PaperID: 3825, https://arxiv.org/pdf/2604.17240.pdf

Authors: Vinil Pasupuleti, Shyalendar Reddy Allala, Siva Rama Krishna Varma Bayyavarapu, Shrey Tyagi

Title: Safe and Policy-Compliant Multi-Agent Orchestration for Enterprise AI

Abstract:
Enterprise AI systems increasingly deploy multiple intelligent agents across mission‑critical workflows that must satisfy hard policy constraints, bounded risk exposure, and comprehensive auditability (SOX, HIPAA, GDPR). Existing coordination methods ‑ cooperative MARL, consensus protocols, and centralized planners ‑ optimize expected reward while treating constraints implicitly. This paper introduces CAMCO (Constraint‑Aware Multi‑Agent Cognitive Orchestration), a runtime coordination layer that models multi‑agent decision‑making as a constrained optimization problem. CAMCO integrates three mechanisms: (i) a constraint projection engine enforcing policy‑feasible actions via convex projection, (ii) adaptive risk‑weighted Lagrangian utility shaping, and (iii) an iterative negotiation protocol with provably bounded convergence. Unlike training‑time constrained RL, CAMCO operates as deployment‑time middleware compatible with any agent architecture, with policy predicates designed for direct integration with production engines such as OPA. Evaluation across three enterprise scenarios ‑ including comparison against a constrained Lagrangian MARL baseline ‑ demonstrates zero policy violations, risk exposure below threshold (mean ratio 0.71), 92‑97% utility retention, and mean convergence in 2.4 iterations.

PaperID: 3826, https://arxiv.org/pdf/2604.17225.pdf

Authors: Rudra Ranajee Saha, Laks V. S. Lakshmanan, Raymond T. Ng

Title: A Multi-Agent Approach for Claim Verification from Tabular Data Documents

Abstract:
We present a novel approach for claim verification from tabular data documents. Recent LLM‑based approaches either employ complex pretraining/fine‑tuning or decompose verification into subtasks, often lacking comprehensive explanations and generalizability. To address these limitations, we propose a Multi‑Agentic framework for Claim verification (MACE) consisting of three specialized agents: Planner, Executor, and Verifier. Instead of elaborate finetuning, each agent employs a zero‑shot Chain‑of‑Thought setup to perform its tasks. MACE produces interpretable verification traces, with the Planner generating explicit reasoning strategies, the Executor providing detailed computation steps, and the Verifier validating the logic. Experiments demonstrate that MACE achieves state‑of‑the‑art (SOTA) performance on two datasets and performs on par with the best models on two others, while achieving 80‑‑100% of best performance with substantially smaller models: 27‑‑92B parameters versus 235B. This combination of competitive performance, memory efficiency, and transparent reasoning highlights our framework's effectiveness.

PaperID: 3827, https://arxiv.org/pdf/2604.17191.pdf

Authors: Nikunj Gupta, Rajgopal Kannan, Viktor Prasanna

Title: Do LLM-derived graph priors improve multi-agent coordination?

Abstract:
Multi‑agent reinforcement learning (MARL) is crucial for AI systems that operate collaboratively in distributed and adversarial settings, particularly in multi‑domain operations (MDO). A central challenge in cooperative MARL is determining how agents should coordinate: existing approaches must either hand‑specify graph topology, rely on proximity‑based heuristics, or learn structure entirely from environment interaction; all of which are brittle, semantically uninformed, or data‑intensive. We investigate whether large language models (LLMs) can generate useful coordination graph priors for MARL by using minimal natural language descriptions of agent observations to infer latent coordination patterns. These priors are integrated into MARL algorithms via graph convolutional layers within a graph neural network (GNN)‑based pipeline, and evaluated on four cooperative scenarios from the Multi‑Agent Particle Environment (MPE) benchmark against baselines spanning the full spectrum of coordination modeling, from independent learners to state‑of‑the‑art graph‑based methods. We further ablate across five compact open‑source LLMs to assess the sensitivity of prior quality to model choice. Our results provide the first quantitative evidence that LLM‑derived graph priors can enhance coordination and adaptability in dynamic multi‑agent environments, and demonstrate that models as small as 1.5B parameters are sufficient for effective prior generation.

PaperID: 3828, https://arxiv.org/pdf/2604.17159.pdf

Authors: Tyler H. Merves, Michael H. Conaway, Joseph M. Escobar, Hakan T. Otal, Unal Tatar

Title: Systematic Capability Benchmarking of Frontier Large Language Models for Offensive Cyber Tasks

Abstract:
We present, to our knowledge, the most comprehensive cross‑model evaluation of LLM agents on offensive cybersecurity tasks, benchmarking 10 frontier models from 7 providers on all 200 challenges of the NYU CTF Bench. Building on the D‑CIPHER multi‑agent framework, we extend it with multi‑provider backend support, a custom Kali Linux environment with over 100 pre‑installed penetration testing tools, and runtime tool‑discovery agents. Through a controlled factorial study, we find that the Kali Linux environment yields a +9.5 percentage‑point improvement over Ubuntu, while auto‑prompting and category‑specific tips often degrade performance in well‑equipped environments. Among models, Claude 4.5 Opus achieves the highest solve rate (59%), followed by Gemini 3 Pro (52%), with Gemini 3 Flash offering the best cost‑efficiency at 0.05 per solve. Asymmetric planner/executor model assignments provide no meaningful benefit while coherent same‑model configurations consistently outperform mixed‑tier pairings. Our results indicate that environment tooling and model selection emerge as the strongest drivers of performance, whereas prompt engineering interventions show diminishing or negative returns in well‑equipped environments. Reported performance reflects both model reasoning ability and compatibility with agent tooling and API integration.

PaperID: 3829, https://arxiv.org/pdf/2604.17143.pdf

Authors: Remy Kim, Minseung Lee, Shuo Li, Osbert Bastani

Title: SeekerGym: A Benchmark for Reliable Information Seeking

Abstract:
Despite their substantial successes, AI agents continue to face fundamental challenges in terms of trustworthiness. Consider deep research agents, tasked with searching for information relevant to a given topic‑while AI agents can perform effective information retrieval, there is little guarantee regarding the completeness of this information. Gaps in retrieved information can leave biases that mislead users even if the information they are given is correct and relevant. We introduce SeekerGym, a benchmark designed to evaluate the completeness of information retrieved by AI agents. In addition, SeekerGym also measures how well agents quantify their uncertainty in the completeness of their information; if an agent fails to retrieve all relevant information, it is useful for it to at least quantify how much might be missing. At a high level, each task in SeekerGym is a document (e.g., a Wikipedia article), and the AI agent must issue queries to retrieve passages from that document. Intuitively, the document comprehensively covers a topic, so the ability to retrieve its sections directly measures completeness of information retrieval. In addition to Wikipedia, we also consider machine learning survey papers, where the goal is to retrieve relevant sections of a survey paper. We benchmark several models and algorithms; the best approaches retrieve 42.5% of passages on Wikipedia and 29.2% on ML Surveys, leaving substantial room for improvement.

PaperID: 3830, https://arxiv.org/pdf/2604.17142.pdf

Authors: Jonghan Lim, Mostafa Tavakkoli Anbarani, Rômulo Meira-Góes, Ilya Kovalenko

Title: Logic-Based Verification of Task Allocation for LLM-Enabled Multi-Agent Manufacturing Systems

Abstract:
Manufacturing industries are facing increasing product variability due to the growing demand for personalized products. Under these conditions, ensuring safety becomes challenging as frequent reconfigurations can lead to unintended hazardous behaviors. Multi‑agent control architectures have been proposed to improve flexibility through decentralized decision‑making and coordination. However, these architectures are based on predefined task models, which limit their ability to adapt task planning to new product requirements while preserving safety. Recently, large language models have been introduced into manufacturing systems to enhance adaptability, but reliability remains a key challenge. To address this issue, we propose a control architecture that leverages the flexibility of large language models while preserving safety on the manufacturing shop floor. Specifically, the proposed framework verifies large language model‑enabled task allocations by using temporal logic and discrete event systems. The effectiveness of the proposed framework is demonstrated through a case study that involves a multi‑robot assembly scenario, showing that unsafe tasks can be allocated safely before task execution.

PaperID: 3831, https://arxiv.org/pdf/2604.17139.pdf

Authors: Jiayuan Liu, Shiyi Du, Weihua Du, Mingyu Guo, Vincent Conitzer

Title: The Consensus Trap: Rescuing Multi-Agent LLMs from Adversarial Majorities via Token-Level Collaboration

Abstract:
Multi‑agent large language model (LLM) architectures increasingly rely on response‑level aggregation, such as Majority Voting (MAJ), to raise reasoning ceilings. However, in open environments, agents are highly susceptible to stealthy contextual corruption, such as targeted prompt injections. We reveal a critical structural vulnerability in current multi‑agent systems: response‑level aggregation collapses when corrupted agents form a local majority. Because voting aggregates fully‑formed conclusions, it is blind to flawed intermediate logic. To overcome this systematic limitation, we propose the Token‑Level Round‑Robin (RR) Collaboration, where agents sequentially interleave generation within a shared auto‑regressive context. We formalize this process as a discrete‑time dynamical system, proving that token‑level interleaving transitions aggregation from a brittle counting of final votes (a linear sum) to a dynamic, interwoven chain of logic (a non‑linear operator product). Through this theoretical lens, we prove that the honest model's restorative pull can overpower adversarial corruptions, even when corrupted agents form a majority. We conduct an exhaustive empirical evaluation across diverse reasoning benchmarks and demonstrate that while MAJ collapses when corrupted agents reach a majority, RR maintains robust accuracy well beyond this critical threshold.

PaperID: 3832, https://arxiv.org/pdf/2604.17137.pdf

Authors: Rohan Patil, Henrik I. Christensen

Title: BOIL: Learning Environment Personalized Information

Abstract:
Navigating complex environments poses challenges for multi‑agent systems, requiring efficient extraction of insights from limited information. In this paper, we introduce the Blackbox Oracle Information Learning (BOIL) process, a scalable solution for extracting valuable insights from the environment structure. Leveraging the Pagerank algorithm and common information maximization, BOIL facilitates the extraction of information to guide long‑term agent behavior applicable to problems such as coverage, patrolling, and stochastic reachability. Through experiments, we demonstrate the efficacy of BOIL in generating strategy distributions conducive to improved performance over extended time horizons, surpassing heuristic approaches in complex environments.

PaperID: 3833, https://arxiv.org/pdf/2604.17111.pdf

Authors: Justice Owusu Agyemang, Jerry John Kponyo, Obed Kwasi Somuah, Elliot Amponsah, Godfred Manu Addo Boakye, Kwame Opuni-Boachie Obour Agyekum

Title: HiveMind: OS-Inspired Scheduling for Concurrent LLM Agent Workloads

Abstract:
When multiple LLM coding agents share a rate‑limited API endpoint, they exhibit resource contention patterns analogous to unscheduled OS processes competing for CPU, memory, and I/O. In a motivating incident, 3 of 11 parallel agents died from connection resets and HTTP 502 errors ‑ a 27% failure rate ‑ despite the API having sufficient aggregate capacity to serve all 11 sequentially. We present HIVEMIND, a transparent HTTP proxy that applies five OS‑inspired scheduling primitives ‑ admission control, rate‑limit tracking, AIMD backpressure with circuit breaking, token budget management, and priority queuing ‑ to eliminate the failure modes caused by uncoordinated parallel execution. The proxy requires zero modifications to existing agent code and supports Anthropic, OpenAI, and local model APIs via auto‑detected provider profiles. Our evaluation across seven scenarios (5‑50 concurrent agents) shows that uncoordinated agents fail at 72‑100% rates under contention, while HIVEMIND reduces failures to 0‑18% and eliminates 48‑100% of wasted compute. An ablation study reveals that transparent retry ‑ not admission control ‑ is the single most critical primitive, but the primitives are most effective in combination. Real‑world validation against Ollama confirms that HIVEMIND adds under 3ms of proxy overhead per request. The system is open‑source under the MIT license.

PaperID: 3834, https://arxiv.org/pdf/2604.17020.pdf

Authors: Huije Lee, Jisu Shin, Hoyun Song, Changgeon Ko, Jong C. Park

Title: Beyond Static Benchmarks: Synthesizing Harmful Content via Persona-based Simulation for Robust Evaluation

Abstract:
Static benchmarks for harmful content detection face limitations in scalability and diversity, and may also be affected by contamination from web‑scale pre‑training corpora. To address these issues, we propose a framework for synthesizing harmful content, leveraging persona‑guided large language model (LLM) agents. Our approach constructs two‑dimensional user personas by integrating demographic identities and topical interests with situational harmful strategies, enabling the simulation of diverse and contextually grounded harmful interactions. We evaluate the framework along three dimensions: harmfulness, challenge level, and diversity. Both human and LLM‑based evaluations confirm that our framework achieves a high harmful generation success rate. Experiments across multiple detection systems reveal that our synthetic scenarios are more challenging to detect than those in existing benchmarks. Furthermore, a multi‑faceted analysis confirms that our approach achieves linguistic and topical diversity comparable to human‑curated datasets, establishing our framework as an effective tool for robust stress‑testing of harmful content detection systems.

PaperID: 3835, https://arxiv.org/pdf/2604.17009.pdf

Authors: Wenzhen Yuan, Wutao Xiong, Fanchen Yu, Shengji Tang, Ting Liu, Tao Chen, Peng Ye, Yuzhuo Fu, Wanli Ouyang, Lei Bai

Title: Small Model as Master Orchestrator: Learning Unified Agent-Tool Orchestration with Parallel Subtask Decomposition

Abstract:
Multi‑agent systems (MAS) demonstrate clear advantages in tackling complex problems by coordinating diverse agents and external tools. However, most existing orchestration methods rely on static workflows or serial agent scheduling, and are further constrained by heterogeneous interface protocols between tools and agents. This leads to high system complexity and poor extensibility. To mitigate these issues, we propose Agent‑as‑Tool, a unified parallel orchestration paradigm that abstracts both agents and tools into a standardized, learnable action space with protocol normalization and explicit state feedback. Building on this paradigm, we train a lightweight orchestrator, ParaManager, which decouples planning decisions from subtask solving, enabling state‑aware parallel subtask decomposition, delegation, and asynchronous execution. For training, we adopt a two‑stage ParaManager training pipeline. It improves robustness by incorporating supervised fine‑tuning (SFT) trajectories equipped with recovery mechanisms, and further applies reinforcement learning (RL) to achieve an optimal balance among task success, protocol compliance, diversity, and reasoning efficiency. Experiments show that ParaManager achieves strong performance across multiple benchmarks and exhibits robust generalization under unseen model pools.

PaperID: 3836, https://arxiv.org/pdf/2604.16989.pdf

Authors: Martin Balko, Jan Grebík, Pavel Hubáček, Martin Koutecký, Matěj Kripner, Václav Rozhoň, Robert Šámal, Adrián Zámečník

Title: Bolzano: Case Studies in LLM-Assisted Mathematical Research

Abstract:
We report new results on eight problems in mathematics and theoretical computer science, produced with the assistance of Bolzano, an open‑source multi‑agent LLM system. Bolzano orchestrates rounds of interaction between parallel prover agents and a verifier agent while maintaining a persistent knowledge base that is carried across rounds. Classified using the significance‑autonomy taxonomy of Feng et al., six of the eight results reach the level of publishable research, and five of the eight were produced essentially autonomously by Bolzano. Our results provide evidence that LLMs can contribute meaningfully to mathematical research, complementing recent reports by Bubeck et al., Woodruff et al., and others.

PaperID: 3837, https://arxiv.org/pdf/2604.16987.pdf

Authors: Hongyuan Qi, Feifei Shao, Ming Li, Hehe Fan, Jun Xiao

Title: DVAR: Adversarial Multi-Agent Debate for Video Authenticity Detection

Abstract:
The rapid evolution of video generation technologies poses a significant challenge to media forensics, as conventional detection methods often fail to generalize beyond their training distributions. To address this, we propose DVAR (Debate‑based Video Authenticity Reasoning), a training‑free framework that reformulates video detection as a structured multi‑agent forensic reasoning process. Moving beyond the paradigm of pattern matching, DVAR orchestrates a competition between a Generative Hypothesis Agent and a Natural Mechanism Agent. Through iterative rounds of cross‑examination, these agents defend their respective explanations against abnormal evidence, driving a logical convergence where the truth emerges from rigorous stress‑testing. To adjudicate these conflicting claims, we apply Occam's Razor through the Minimum Description Length (MDL) framework, defining an Explanatory Cost to quantify the "logical burden" of each reasoning path. Furthermore, we integrate GenVideoKB, a dynamic knowledge repository that provides high‑level reasoning heuristics on generative boundaries and failure modes. Extensive experiments demonstrate that DVAR achieves competitive performance against supervised state‑of‑the‑art methods while exhibiting superior generalization to unseen generative architectures. By transforming detection into a transparent debate, DVAR provides explicit, interpretable reasoning traces for robust video authenticity assessment.

PaperID: 3838, https://arxiv.org/pdf/2604.16950.pdf

Authors: Pollawat Hongwimol, Haoning Shang, Chutong Wang, Zhichao Wan, Yi Gao, Yuanming Li, Lin Gui, Wenhao Sun, Cheng Yu

Title: AutoPKG: An Automated Framework for Dynamic E-commerce Product-Attribute Knowledge Graph Construction

Abstract:
Product attribute extraction in e‑commerce is bottlenecked by ontologies that are inconsistent, incomplete, and costly to maintain. We present AutoPKG, a multi‑agent Large Language Model (LLM) framework that automatically constructs a Product‑attribute Knowledge Graph (PKG) from multimodal product content. AutoPKG induces product types and type‑specific attribute keys on demand, extracts attribute values from text and images, and consolidates updates through a centralized decision agent that maintains a globally consistent canonical graph. We also propose an evaluation protocol for dynamic PKGs that measures type and key validity, consolidation quality, and edge‑level accuracy for value assertions after canonicalization. On a large real‑world marketplace catalog dataset from Lazada (Alibaba), AutoPKG achieves up to 0.953 Weighted Knowledge Efficiency (WKE) for product types, 0.724 WKE for attribute keys, and 0.531 edge‑level F1 for multimodal value extraction. Across three public benchmarks, our method improves edge‑level exact‑match F1 by 0.152 and yields a precision gain of 0.208 on the attribute extraction application. Online A/B tests show that AutoPKG‑derived attributes increase Gross Merchandise Value (GMV) in Badge by 3.81 percent, in Search by 5.32 percent, and in Recommendation by 7.89 percent, supporting the practical value of AutoPKG in production.

PaperID: 3839, https://arxiv.org/pdf/2604.16911.pdf

Authors: Sampriti Saha, Pranav Hemanth

Title: Skilldex: A Package Manager and Registry for Agent Skill Packages with Hierarchical Scope-Based Distribution

Abstract:
Large Language Model (LLM) agents are increasingly extended at runtime via skill packages, structured natural‑language instruction bundles loaded from a well‑known directory. Community install tooling and registries exist, but two gaps persist: no public tool scores skill packages against Anthropic's published format specification, and no mechanism bundles related skills with the shared context they need to remain mutually coherent. We present Skilldex, a package manager and registry for agent skill packages addressing both gaps. The two novel contributions are: (1) compiler‑style format conformance scoring against Anthropic's skill specification, producing line‑level diagnostics on description specificity, frontmatter validity, and structural adherence; and (2) the skillset abstraction, a bundled collection of related skills with shared assets (vocabulary files, templates, reference documents) that enforce cross‑skill behavioral coherence. Skilldex also provides supporting infrastructure: a three‑tier hierarchical scope system, a human‑in‑the‑loop agent suggestion loop, a metadata‑only community registry, and a Model Context Protocol (MCP) server. The system is implemented as a TypeScript CLI (skillpm / spm) with a Hono/Supabase registry backend, and is open‑source.

PaperID: 3840, https://arxiv.org/pdf/2604.16818.pdf

Authors: Sosui Moribe, Taketoshi Ushiama

Title: Beyond Serendipity: From Exposing the Unknown to Fostering Engagement through Peer Recommendation

Abstract:
Serendipity‑oriented recommender systems expose users to unfamiliar items to counter filter bubbles, yet mere exposure does not ensure that users will understand or appreciate the content they encounter. We propose Peer Recommendation, a framework in which a user and an AI agent (Peer) with distinct preferences collaboratively explore unfamiliar content. Unlike conventional conversational recommender systems where the user is a passive recipient, our framework positions the user as both a recommender and a recipient: the user and the Peer mutually recommend songs to each other through chat‑based dialogue, collaboratively building a shared playlist. In an exploratory within‑subjects experiment (N=14), we compared three conditions: (1) a Close Peer, (2) a Distant Peer, and (3) a baseline agent without an explicit preference profile. The Close Peer significantly increased users' interest expansion and perceived value of the activity compared to the baseline, with medium‑to‑large effect sizes. The Distant Peer showed no significant difference at the aggregate level; however, qualitative analysis revealed varied responses, with some participants strongly preferring the Distant Peer. These findings suggest that the "otherness" of a recommendation partner is essential for moving beyond mere exposure toward genuine engagement, and that the appropriate degree of preference distance may vary and need to be adapted to individual users.

PaperID: 3841, https://arxiv.org/pdf/2604.16778.pdf

Authors: Dixi Yao, Tahseen Rabbani, Manzil Zaheer, Tian Li

Title: Federation over Text: Insight Sharing for Multi-Agent Reasoning

Abstract:
We propose a federated learning‑like framework, Federation over Text (FoT), that enables multiple clients solving different tasks to collectively generate a shared library of metacognitive insights by iteratively federating their local reasoning processes without sharing actual problem instances or task instructions. Instead of federation over gradients (e.g., as in distributed training), FoT operates at the semantic level without any gradient optimization or supervision signal. Iteratively, each client runs an LLM agent that does local thinking and self‑improvement on their specific tasks independently, and shares reasoning traces with a central server, which aggregates and distills them into a cross‑task (and cross‑domain) insight library that existing and future agents can leverage to improve performance on related tasks. Experiments show that FoT improves reasoning effectiveness and efficiency across a wide range of challenging applications, including mathematical problem solving, cross‑domain collaboration, real‑world daily tasks, and machine learning research insight discovery. Specifically, it improves average performance scores by 25% while reducing the reasoning tokens by 4% across the first three applications. In the research insight discovery application, FoT is able to generate insights that cover over 80% of the major contributions in the subsequent papers.

PaperID: 3842, https://arxiv.org/pdf/2604.16752.pdf

Authors: Eren Unlu

Title: Don't Start What You Can't Finish: A Counterfactual Audit of Support-State Triage in LLM Agents

Abstract:
Current agent evaluations largely reward execution on fully specified tasks, while recent work studies clarification [11, 22, 2], capability awareness [9, 1], abstention [8, 14], and search termination [20, 5] mostly in isolation. This leaves open whether agents can diagnose why a task is blocked before acting. We introduce the Support‑State Triage Audit (SSTA‑32), a matched‑item diagnostic framework in which minimal counterfactual edits flip the same base request across four support states: Complete (ANSWER), Clarifiable (CLARIFY), Support‑Blocked (REQUEST SUPPORT), and Unsupported‑Now (ABSTAIN). We evaluate a frontier model under four prompting conditions ‑ Direct, Action‑Only, Confidence‑Only, and a typed Preflight Support Check (PSC) ‑ using Dual‑Persona Auto‑Auditing (DPAA) with deterministic heuristic scoring. Default execution overcommits heavily on non‑complete tasks (41.7% overcommitment rate). Scalar confidence mapping avoids overcommitment but collapses the three‑way deferral space (58.3% typed deferral accuracy). Conversely, both Action‑Only and PSC achieve 91.7% typed deferral accuracy by surfacing the categorical ontology in the prompt. Targeted ablations confirm that removing the support‑sufficiency dimension selectively degrades REQUEST SUPPORT accuracy, while removing the evidence‑sufficiency dimension triggers systematic overcommitment on unsupported items. Because DPAA operates within a single context window, these results represent upper‑bound capability estimates; nonetheless, the structural findings indicate that frontier models possess strong latent triage capabilities that require explicit categorical decision paths to activate safely.

PaperID: 3843, https://arxiv.org/pdf/2604.16729.pdf

Authors: Ayhan Can Erdur, Daniel Scholz, Jiazhen Pan, Benedikt Wiestler, Daniel Rueckert, Jan C. Peeken

Title: Agentic Large Language Models for Training-Free Neuro-Radiological Image Analysis

Abstract:
State‑of‑the‑art large language models (LLMs) show high performance in general visual question answering. However, a fundamental limitation remains: current architectures lack the native 3D spatial reasoning required for direct analysis of volumetric medical imaging, such as CT or MRI. Emerging agentic AI offers a new solution, eliminating the need for intrinsic 3D processing by enabling LLMs to orchestrate and leverage specialized external tools. Yet, the feasibility of such agentic frameworks in complex, multi‑step radiological workflows remains underexplored. In this work, we present a training‑free agentic pipeline for automated brain MRI analysis. Validating our methodology on several LLMs (GPT‑5.1, Gemini 3 Pro, Claude Sonnet 4.5) with off‑the‑shelf domain‑specific tools, our system autonomously executes complex end‑to‑end workflows, including preprocessing (skull stripping, registration), pathology segmentation (glioma, meningioma, metastases), and volumetric analysis. We evaluate our framework across increasingly complex radiological tasks, from single‑scan segmentation and volumetric reporting to longitudinal response assessment requiring multi‑timepoint comparisons. We analyze the impact of architectural design by comparing single‑agent models against multi‑agent "domain‑expert" collaborations. Finally, to support rigorous evaluation of future agentic systems, we introduce and release a benchmark dataset of image‑prompt‑answer tuples derived from public BraTS data. Our results demonstrate that agentic AI can solve highly neuro‑radiological image analysis tasks through tool use without the need for training or fine‑tuning.

PaperID: 3844, https://arxiv.org/pdf/2604.16723.pdf

Authors: Moein Salimi, Babak Hosseini Mohtasham, Amin Aghakasiri, Mahdi Naieni, Amir Hossein Qeysarbeigi, Mohammad Masih Shalchian Nazer, Zahra Azar, Mahdi Jafari Siavoshani, Mohammad Hossein Rohban

Title: Debate as Reward: A Multi-Agent Reward System for Scientific Ideation via RL Post-Training

Abstract:
Large Language Models (LLMs) have demonstrated potential in automating scientific ideation, yet current approaches relying on iterative prompting or complex multi‑agent architectures often suffer from hallucination or computational inefficiency. A critical bottleneck in applying Reinforcement Learning (RL) to this open‑ended domain is reward hacking ‑‑ where models exploit imperfect evaluation proxies to maximize scores without producing genuine scientific innovation. To address these limitations, we propose an RL framework explicitly tailored for high‑quality scientific idea generation. We propose the first multi‑agent reward function designed to serve as a judge, decoupling methodological validation from implementation details while providing strict binary rewards that are robust to reward hacking. To effectively optimize against this sparse signal, we utilize an unbiased variant of Group Relative Policy Optimization to mitigate artificial length bias. We grounded our training in ICLR‑320, a curated dataset of problem‑solution pairs extracted from ICLR 2024 proceedings. Experiments demonstrate that our framework significantly outperforms state‑of‑the‑art baselines across expert‑evaluated metrics of novelty, feasibility, and effectiveness.

PaperID: 3845, https://arxiv.org/pdf/2604.16687.pdf

Authors: Varun Kumar, George Em Karniadakis

Title: Agentic Risk-Aware Set-Based Engineering Design

Abstract:
This paper introduces a multi‑agent framework guided by Large Language Models (LLMs) to assist in the early stages of engineering design, a phase often characterized by vast parameter spaces and inherent uncertainty. Operating under a human‑in‑the‑loop paradigm and demonstrated on the canonical problem of aerodynamic airfoil design, the framework employs a team of specialized agents: a Coding Assistant, a Design Agent, a Systems Engineering Agent, and an Analyst Agent ‑ all coordinated by a human Manager. Integrated within a set‑based design philosophy, the process begins with a collaborative phase where the Manager and Coding Assistant develop a suite of validated tools, after which the agents execute a structured workflow to systematically explore and prune a large set of initial design candidates. A key contribution of this work is the explicit integration of formal risk management, employing the Conditional Value‑at‑Risk (CVaR) as a quantitative metric to filter designs that exhibit a high probability of failing to meet performance requirements, specifically the target coefficient of lift. The framework automates labor‑intensive initial exploration through a global sensitivity analysis conducted by the Analyst agent, which generates actionable heuristics to guide the other agents. The process culminates by presenting the human Manager with a curated final set of promising design candidates, augmented with high‑fidelity Computational Fluid Dynamics (CFD) simulations. This approach effectively leverages AI to handle high‑volume analytical tasks, thereby enhancing the decision‑making capability of the human expert in selecting the final, risk‑assessed design.

PaperID: 3846, https://arxiv.org/pdf/2604.16566.pdf

Authors: Arya Mary K J, Deepthy K Bhaskar, Sinu T S, Binu V P

Title: Agentic AI for Education: A Unified Multi-Agent Framework for Personalized Learning and Institutional Intelligence

Abstract:
Agentic Artificial Intelligence (AI) represents a paradigm shift from reactive systems to proactive, autonomous decision making frameworks. Existing AI‑based educational systems remain fragmented and lack multi‑level integration across stakeholders. This paper proposes the Agentic Unified Student Support System (AUSS), a novel multi‑agent architecture integrating student‑level personalization, educator‑level automation, and institutional‑level intelligence. The framework leverages Large Language Models (LLMs), reinforcement learning, predictive analytics, and rule‑based reasoning. Experimental results demonstrate improvements in recommendation accuracy (92.4%), grading efficiency (94.1%), and dropout prediction (F1‑score: 89.5%). The proposed system enables scalable, adaptive, and intelligent educational ecosystems.

PaperID: 3847, https://arxiv.org/pdf/2604.16548.pdf

Authors: Zehao Lin, Chunyu Li, Kai Chen

Title: A Survey on the Security of Long-Term Memory in LLM Agents: Toward Mnemonic Sovereignty

Abstract:
Research on large language model (LLM) security is shifting from "will the model leak training data" to a more consequential question: can an agent with persistent, long‑term memory be continuously shaped, cross‑session poisoned, accessed without authorization, and propagated across shared organizational state? Recent surveys cover memory architectures and agent mechanisms, but fewer center the epistemic and governance properties of persistent, writable memory as the reason memory is an independent security problem. This survey addresses that gap. Drawing on cognitive neuroscience and the philosophy of memory, we characterize agent memory as malleable, rewritable, and socially propagating, and develop a memory‑lifecycle framework organized around six phases ‑‑ Write, Store, Retrieve, Execute, Share, Forget/Rollback ‑‑ cross‑tabulated against four security objectives: integrity, confidentiality, availability, governance. We organize the literature on memory poisoning, extraction, retrieval corruption, control‑flow hijacking, cross‑agent propagation, rollback, and governance, and situate representative architectures as determinants of which phases are explicitly governable. Three findings stand out: the literature concentrates on write‑ and retrieve‑time integrity attacks, while confidentiality, availability, store/forget, and benign‑persistence failures remain sparsely studied; no published architecture covers all nine governance primitives we identify; and using LLMs themselves for memory security remains sparse yet essential. We unify these under mnemonic sovereignty ‑‑ verifiable, recoverable governance over what may be written, who may read, when updates are authorized, and which states may be forgotten ‑‑ arguing future secure agents will be differentiated not only by recall capacity, but by memory governance quality.

PaperID: 3848, https://arxiv.org/pdf/2604.16524.pdf

Authors: Ravi Kiran Kadaboina

Title: Anumati: Proof of Adherence as a Formal Consent Model for Autonomous Agent Protocols

Abstract:
As autonomous AI agents increasingly call other agents to complete tasks on behalf of a human principal, a structural accountability gap has emerged: the calling agent accepts the terms of service of the callee without any protocol‑level mechanism to prove that it understood those terms or that it subsequently honoured them. Authentication protocols such as OAuth and mutual TLS establish who may call which capability. They do not address under what conditions a permitted call may be made, and those conditions change as the callee's policies evolve. In this paper we formalise the distinction between proof of acceptance (a timestamped acknowledgement) and proof of adherence (a per‑action reasoning record citing the specific clause evaluated). We propose three primitives (PolicyDocument, ConsentRecord, and AdherenceEvent) that together constitute a versioned, append‑only consent model for agent‑to‑agent communication. The model is instantiated as a non‑breaking extension to two widely used agent protocols: the Agent2Agent (A2A) protocol and the Model Context Protocol (MCP). A TLA+ specification of the consent lifecycle, together with a reference Python implementation of the chain integrity and adherence trail validators, is available in the accompanying repository.

PaperID: 3849, https://arxiv.org/pdf/2604.16493.pdf

Authors: Shizheng Hou, Wenqi Pei, Nuo Chen, Quang-Trung Ta, Peng Lu, Beng Chin Ooi

Title: NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions

Abstract:
Natural Language to SQL (NL2SQL) technology empowers non‑expert users to query relational databases without requiring SQL expertise. While large language models (LLMs) have greatly improved NL2SQL algorithms, their rapid development outpaces systematic evaluation, leaving a critical gap in understanding their effectiveness, efficiency, and limitations. To this end, we present NL2SQLBench, the first modular evaluation and benchmarking framework for LLM‑enabled NL2SQL approaches. Specifically, we dissect NL2SQL systems into three core modules: Schema Selection, Candidate Generation, and Query Revision. For each module, we comprehensively review existing strategies and propose novel fine‑grained metrics that systematically quantify module‑level effectiveness and efficiency. We further implement these metrics in a flexible multi‑agent framework, allowing configurable benchmarking across diverse NL2SQL approaches. Leveraging NL2SQLBench, we rigorously evaluate ten representative open‑source methods on two datasets, the BIRD development set and the ScienceBenchmark development set, using two LLMs, DeepSeek‑V3 and GPT‑4o mini. We systematically assess each approach across the three core modules and evaluate multiple critical performance dimensions. Our evaluation reveals significant gaps in existing NL2SQL methods, highlighting not only substantial room for accuracy improvements but also the significant computational inefficiency, which severely hampers real‑world adoption. Furthermore, our analysis identifies critical shortcomings in current benchmark datasets and evaluation rules, emphasizing issues such as inaccurate gold SQL annotations and limitations in existing evaluation rules. By synthesizing these insights into a unified benchmarking, our study establishes a clear reference point for fair comparison and serves as essential guidance for future targeted innovation in NL2SQL technology.

PaperID: 3850, https://arxiv.org/pdf/2604.16472.pdf

Authors: Dirk Bergemann, Soheil Ghili, Xinyang Hu, Chuanhao Li, Zhuoran Yang

Title: Training Language Models for Bilateral Trade with Private Information

Abstract:
Bilateral bargaining under incomplete information provides a controlled testbed for evaluating large language model (LLM) agent capabilities. Bilateral trade demands individual rationality, strategic surplus maximization, and cooperation to realize gains from trade. We develop a structured bargaining environment where LLMs negotiate via tool calls within an event‑driven simulator, separating binding offers from natural‑language messages to enable automated evaluation. The environment serves two purposes: as a benchmark for frontier models and as a training environment for open‑weight models via reinforcement learning. In benchmark experiments, a round‑robin tournament among five frontier models (15,000 negotiations) reveals that effective strategies implement price discrimination through sequential offers. Aggressive anchoring, calibrated concession, and temporal patience correlate with the highest surplus share and deal rate. Accommodating strategies that concede quickly disable price discrimination in the buyer role, yielding the lowest surplus capture and deal completion. Stronger models scale their behavior proportionally to item value, maintaining performance across price tiers; weaker models perform well only when wide zones of possible agreement offset suboptimal strategies. In training experiments, we fine‑tune Qwen3 (8B, 14B) via supervised fine‑tuning (SFT) followed by Group Relative Policy Optimization (GRPO) against a fixed frontier opponent. These stages optimize competing objectives: SFT approximately doubles surplus share but reduces deal rates, while RL recovers deal rates but erodes surplus gains, reflecting the reward structure. SFT also compresses surplus variation across price tiers, which generalizes to unseen opponents, suggesting that behavioral cloning instills proportional strategies rather than memorized price points.

PaperID: 3851, https://arxiv.org/pdf/2604.16471.pdf

Authors: Jianfeng Xu

Title: Semantic Channel Theory: Deductive Compression and Structural Fidelity for Multi-Agent Communication

Abstract:
Shannon's information theory deliberately excludes message semantics. This paper develops a rigorous framework for semantic communication that integrates formal proof systems with Shannon‑theoretic tools. We introduce an axiomatic information model comprising Lsem‑definable state sets linked by computable enabling maps, and define the semantic channel as a composition of Markov kernels whose supports respect the enabling structure. A fixed proof system induces an irredundant semantic core and a derivation‑depth stratification, enabling four distortion measures of increasing semantic depth: Hamming, closure, depth, and a parameterized composite. Six families of computable semantic channel invariants are defined and their inter‑relationships established, including a data processing bound, a semantic Fano bound, and an ideal‑channel collapse theorem. The central quantitative result is a deductive compression gain: under closure‑based fidelity, the minimum block length is determined by the irredundant core size rather than the full knowledge‑base size. We instantiate the framework for heterogeneous multi‑agent communication, introducing an overlap decomposition that yields necessary and sufficient conditions for closure‑reliable communication. A semantic bottleneck phenomenon is identified in broadcast settings: vocabulary mismatch imposes irreducible fidelity limitations even over noiseless carriers. All results are verified on an explicit Datalog instance.

PaperID: 3852, https://arxiv.org/pdf/2604.16469.pdf

Authors: Yanfei Song

Title: B-PASTE: Beam-Aware Pattern-Guided Speculative Execution for Resource-Constrained LLM Agents

Abstract:
LLM agents execute in an interleaved reasoning‑and‑action loop, where future tool calls cannot be launched until the current reasoning step completes. This serial dependency inflates end‑to‑end latency and leaves the model idle while waiting for tool execution. Prior work, Pattern‑Aware Speculative Tool Execution (PASTE), mitigates this bottleneck by speculating likely future tool invocations from mined control‑flow and data‑flow regularities. However, PASTE is tool‑centric and speculates only individual invocations rather than bounded future branches. We propose B‑PASTE, a beam‑aware extension that lifts speculation from single tools to local branch hypotheses under strict resource constraints. B‑PASTE maintains a bounded beam of future execution subgraphs, ranks them by expected critical‑path reduction rather than raw execution probability, and schedules only high‑value branch prefixes on transient slack resources. It explicitly models co‑run interference, downstream unlock value, and state‑safety constraints, enabling the system to prioritize serial fast‑path execution when early completion unlocks valuable future work, while still exploiting safe parallelism under low contention. This design is especially important for edge‑side deployments, where speculative work must not steal scarce resources from latency‑critical authoritative execution. Preliminary internal testing on Thor‑class edge environments shows up to 1.4X end‑to‑end speedup, suggesting that branch‑aware speculative execution remains effective even under tight resource budgets.

PaperID: 3853, https://arxiv.org/pdf/2604.16416.pdf

Authors: Xing Wei, Yang Yu

Title: Tensor Manifold-Based Graph-Vector Fusion for AI-Native Academic Literature Retrieval

Abstract:
The rapid development of large language models and AI agents has triggered a paradigm shift in academic literature retrieval, putting forward new demands for fine‑grained, time‑aware, and programmable retrieval. Existing graph‑vector fusion methods still face bottlenecks such as matrix dependence, storage explosion, semantic dilution, and lack of AI‑native support. This paper proposes a geometry‑unified graph‑vector fusion framework based on tensor manifold theory, which formally proves that an academic literature graph is a discrete projection of a tensor manifold, realizing the native unification of graph topology and vector geometric embedding. Based on this theoretical conclusion, we design four core modules: matrix‑independent temporal diffusion signature update, hierarchical temporal manifold encoding, temporal Riemannian manifold indexing, and AI‑agent programmable retrieval. Theoretical analysis and complexity proof show that all core algorithms have linear time and space complexity, which can adapt to large‑scale dynamic academic literature graphs. This research provides a new theoretical framework and engineering solution for AI‑native academic literature retrieval, promoting the industrial application of graph‑vector fusion technology in the academic field.

PaperID: 3854, https://arxiv.org/pdf/2604.16406.pdf

Authors: Jinkai Qiu, Alessandro Saviolo, Chaojie Wang, Mingke Wang, Xiaoyu Huang

Title: Heterogeneous Self-Play for Realistic Highway Traffic Simulation

Abstract:
Realistic highway simulation is critical for scalable safety evaluation of autonomous vehicles, particularly for interactions that are too rare to study from logged data alone. Yet highway traffic generation remains challenging because it requires broad coverage across speeds and maneuvers, controllable generation of rare safety‑critical scenarios, and behavioral credibility in multi‑agent interactions. We present PHASE, Policy for Heterogeneous Agent Self‑play on Expressway, a context‑aware self‑play framework that addresses these three requirements through explicit per‑agent conditioning for controllability, synthetic scenario generation for broad highway coverage, and closed‑loop multi‑agent training for realistic interaction dynamics. PHASE further supports different vehicle profiles, for example, passenger cars and articulated trailer trucks, within a single policy via vehicle‑aware dynamics and context‑conditioned actions, and stabilizes self‑play with early termination of unrecoverable states, at‑fault collision attribution, highway‑aware reward shaping, coupled curricula, and robust policy optimization. Despite being trained only on synthetic data, PHASE transfers zero‑shot to 512 unseen high‑interaction real scenarios in exiD, achieving a 96.3% success rate and reducing ADE/FDE from 6.57/12.07 m to 2.44/5.25 m relative to a prior self‑play baseline. In a learned trajectory embedding space, it also improves behavioral realism over IDM, reducing Frechet trajectory distance by 13.1% and energy distance by 20.2%. These results show that synthetic self‑play can provide a scalable route to controllable and realistic highway scenario generation without direct imitation of expert logs.

PaperID: 3855, https://arxiv.org/pdf/2604.16394.pdf

Authors: Riccardo Terrenzi, Phongsakon Mark Konrad, Tim Lukas Adam, Serkan Ayvaz

Title: A Reference Architecture for Agentic Hybrid Retrieval in Dataset Search

Abstract:
Ad hoc dataset search requires matching underspecified natural‑language queries against sparse, heterogeneous metadata records, a task where typical lexical or dense retrieval alone falls short. We reposition dataset search as a software‑architecture problem and propose a bounded, auditable reference architecture for agentic hybrid retrieval that combines BM25 lexical search with dense‑embedding retrieval via reciprocal rank fusion (RRF), orchestrated by a large language model (LLM) agent that repeatedly plans queries, evaluates the sufficiency of results, and reranks candidates. To reduce the vocabulary mismatch between user intent and provider‑authored metadata, we introduce an offline metadata augmentation step in which an LLM generates pseudo‑queries for each dataset record, augmenting both retrieval indexes before query time. Two architectural styles are examined: a single ReAct agent and a multi‑agent horizontal architecture with Feedback Control. Their quality‑attribute tradeoffs are analyzed with respect to modifiability, observability, performance, and governance. An evaluation framework comprising seven system variants is defined to isolate the contribution of each architectural decision. The architecture is presented as an extensible reference design for the software architecture community, incorporating explicit governance tactics to bound and audit nondeterministic LLM components.

PaperID: 3856, https://arxiv.org/pdf/2604.16366.pdf

Authors: Yiyao Yang, Yasemin Gulbahar

Title: Decoding AI Tutor Effects for Educational Measurement: Temporal, Multi-Outcome, and Behavior-Cognitive Analysis

Abstract:
Artificial intelligence (AI) tutors have become increasingly popular in learning environments. In this study, we propose an AI agent prototype framework for exploring AI‑assisted learning with temporal interaction patterns, multiple outcomes analysis, and behavioral‑cognitive learner profiling. Based on three research questions, this study aims to investigate whether early interaction patterns can predict later performance and trust, how multiple outcomes can be traded off with different AI tutor feedback conditions, and if learner profiles can be identified with behavioral and cognitive indicators. An AI tutor agent has been developed to provide various feedback forms to learners, including hints, explanations, examples, and code. A neural policy model and a stochastic simulation framework are used to produce artificial student‑AI tutor interaction records, which include response time, attempts, hint requests, correctness, quiz results, improvement, satisfaction, and trust. Temporal features are used to predict later correctness and trust with early interaction patterns, and clustering methods are used to find learner profiles. The results showed that early interaction patterns were predictive of later performance and trust, that student behavior changed over time with AI‑based tutoring, and that latent student profiles could be identified based on their behavioral and cognitive differences.

PaperID: 3857, https://arxiv.org/pdf/2604.16348.pdf

Authors: Carina I Hausladen, Javier Argota Sánchez-Vaquerizo, Michael Siebenmann, Arthur Capozzi, Sachit Mahajan, Dirk Helbing

Title: Beyond the Townhall: Spatial Anchoring and LLM Agents for Scalable Participatory Urban Planning

Abstract:
Participatory urban planning is central to sustainable city‑making, yet the technically demanding nature of such interventions often limits meaningful involvement by diverse publics. We introduce a scalable digital participation platform that embeds sustainability projects within a navigable digital twin. Citizens experience a guided virtual walkthrough with audio narration employing the method of loci and spatial anchoring to support mnemonic encoding and recall. This immersive interface is augmented by two purpose‑built LLM assistants: one delivers source‑grounded factual clarifications, while the other facilitates reflective discussion. We evaluated this system in a randomized controlled online experiment (N = 195) against conventional industry practices (static visualizations and text‑based consultations). Results show that spatially anchored immersive presentation significantly improved information recall, which substantially shifted participants' attention from individual inconveniences to collective, community‑oriented sustainability benefits. Consequently, participants provided significantly more constructive, solution‑focused feedback to the (simulated) municipality. These findings establish a practical tool for cities and policymakers to foster inclusive, democratic participation in sustainability transitions.

PaperID: 3858, https://arxiv.org/pdf/2604.16342.pdf

Authors: Hansoo Lee, Yoonjae Cho, Sonya S. Kwak, Rafael A. Calvo

Title: SAGE: Sensor-Augmented Grounding Engine for LLM-Powered Sleep Care Agent

Abstract:
Sleep is vital for health, yet access to data alone does not guarantee improvement. While wearables and health apps enable tracking, users face a "Data‑Action Gap," struggling to interpret metrics and translate them into action. Current interventions fail to bridge this: static dashboards lack context, rule‑based agents rely on rigid scripts, and LLM‑agents lack grounding in personal data, causing trust issues. We propose SAGE (Sensor‑Augmented Grounding Engine) for an LLM‑powered sleep care agent. SAGE normalizes continuous sleep, physiological, and activity data from the sensors into a queryable time‑series layer. It supports (1) selective system‑initiated monitoring that triggers notifications only upon detecting meaningful deviations against personal baselines to reduce alert fatigue, and (2) user‑initiated Q&A where natural language is translated into executable database queries. By ensuring responses are grounded in precise period, comparison, and metric data, SAGE aims to enhance personalization, traceability, and trust, articulating a novel design space for evidence‑based messaging in sleep care.

PaperID: 3859, https://arxiv.org/pdf/2604.16339.pdf

Authors: Vivek Acharya

Title: Semantic Consensus: Process-Aware Conflict Detection and Resolution for Enterprise Multi-Agent LLM Systems

Abstract:
Multi‑agent large language model (LLM) systems are rapidly emerging as the dominant architecture for enterprise AI automation, yet production deployments exhibit failure rates between 41% and 86.7%, with nearly 79% of failures originating from specification and coordination issues rather than model capability limitations. This paper identifies Semantic Intent Divergence‑‑the phenomenon whereby cooperating LLM agents develop inconsistent interpretations of shared objectives due to siloed context and absent process models‑‑as a primary yet formally unaddressed root cause of multi‑agent failure in enterprise settings. We propose the Semantic Consensus Framework (SCF), a process‑aware middleware comprising six components: a Process Context Layer for shared operational semantics, a Semantic Intent Graph for formal intent representation, a Conflict Detection Engine for real‑time identification of contradictory, contention‑based, and causally invalid intent combinations, a Consensus Resolution Protocol using a policy‑‑authority‑‑temporal hierarchy, a Drift Monitor for detecting gradual semantic divergence, and a Process‑Aware Governance Integration layer for organizational policy enforcement. Evaluation across 600 runs spanning three multi‑agent frameworks (AutoGen, CrewAI, LangGraph) and four enterprise scenarios demonstrates that SCF is the only approach to achieve 100% workflow completion‑‑compared to 25.1% for the next‑best baseline‑‑while detecting 65.2% of semantic conflicts with 27.9% precision and providing complete governance audit trails. The framework is protocol‑agnostic and compatible with MCP and A2A communication standards.

PaperID: 3860, https://arxiv.org/pdf/2604.16338.pdf

Authors: Vivek Acharya

Title: Governing the Agentic Enterprise: A Governance Maturity Model for Managing AI Agent Sprawl in Business Operations

Abstract:
The rapid adoption of agentic AI in enterprise business operations‑‑autonomous systems capable of planning, reasoning, and executing multi‑step workflows‑‑has created an urgent governance crisis. Organizations face uncontrolled agent sprawl: the proliferation of redundant, ungoverned, and conflicting AI agents across business functions. Industry surveys report that only 21% of enterprises have mature governance models for autonomous agents, while 40% of agentic AI projects are projected to fail by 2027 due to inadequate governance and risk controls. Despite growing acknowledgment of this challenge, academic literature lacks a formal, empirically validated governance maturity model connecting governance capability to measurable business outcomes. This paper introduces the Agentic AI Governance Maturity Model (AAGMM), a five‑level framework spanning 12 governance domains, grounded in NIST AI RMF and ISO/IEC 42001 standards. We additionally propose a novel taxonomy of agent sprawl patterns‑‑functional duplication, shadow agents, orphaned agents, permission creep, and unmonitored delegation chains‑‑each linked to quantifiable business cost models. The framework is validated through 750 simulation runs across five enterprise scenarios and five governance maturity levels, measuring business outcomes including cost containment, risk incident rates, operational efficiency, and decision quality. Results demonstrate statistically significant differences (p < 0.001, large effect sizes d > 2.0) between all governance maturity levels, with Level 4‑5 organizations achieving 94.3% lower sprawl indices, 96.4% fewer risk incidents, and 32.6% higher effective task completion rates compared to Level 1. The AAGMM provides practitioners with an actionable roadmap for governing autonomous AI agents while maximizing business returns.

PaperID: 3861, https://arxiv.org/pdf/2604.16337.pdf

Authors: Abriel K. Moraes, Gabriel S. M. Dias, Vitor L. Fabris, Lucas D. Gessoni, Leonardo R. do Nascimento, Charles S. Oliveira, Vitor G. C. B. de Farias, Fabiana C. Q. de O. Marucci, Matheus H. R. Vicente, Gabriel U. Talasso, Erik Soares, Amparo Munoz, Sildolfo Gomes, Maria L. A. de S. Cruvinel, Leonardo T. dos Santos, Renata De Paris, Wandemberg Gibaut

Title: HR-Agents: Using Multiple LLM-based Agents to Improve Q&A about Brazilian Labor Legislation

Abstract:
The Consolidation of Labor Laws (CLT) serves as the primary legal framework governing labor relations in Brazil, ensuring essential protections for workers. However, its complexity creates challenges for Human Resources (HR) professionals in navigating regulations and ensuring compliance. Traditional methods for addressing labor law inquiries often lead to inefficiencies, delays, and inconsistencies. To enhance the accuracy and efficiency of legal question‑answering (Q&A), a multi‑agent system powered by Large Language Models (LLMs) is introduced. This approach employs specialized agents to address distinct aspects of employment law while integrating Retrieval‑Augmented Generation (RAG) to enhance contextual relevance. Implemented using CrewAI, the system enables cooperative agent interactions, ensuring response validation and reducing misinformation. The effectiveness of this framework is evaluated through a comparison with a baseline RAG pipeline utilizing a single LLM, using automated metrics such as BLEU, LLM‑as‑judge evaluations, and expert human assessments. Results indicate that the multi‑agent approach improves response coherence and correctness, providing a more reliable and efficient solution for HR professionals. This study contributes to AI‑driven legal assistance by demonstrating the potential of multi‑agent LLM architectures in improving labor law compliance and streamlining HR operations.

PaperID: 3862, https://arxiv.org/pdf/2604.16335.pdf

Authors: Jiawei Huang, Qingping Yang, Renjie Zheng, Jiaze Chen

Title: Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents

Abstract:
Despite recent progress in Large Language Model (LLM) Agents for Software Engineering (SWE) tasks, end‑to‑end fine‑tuning typically relies on verifiable terminal rewards such as whether all unit tests pass. While these binary signals reflect whether the final solution is correct, they provide little guidance for shaping intermediate behaviors during multi‑step interactions, thereby limiting improvements in the overall quality of the resolution process. To address this, we introduce a rubric‑based Generative Reward Model (GRM) that provides richer learning signals. The GRM is equipped with human‑designed rubrics that indicate criteria for encouraging or discouraging specific behavioral patterns, and we leverage this feedback for high‑quality training data collection via trajectory filtration. When used for Reinforced Fine‑Tuning (RFT) on SWE Tasks, our approach outperforms terminal‑score‑only rejection sampling: it more effectively suppresses undesirable patterns while promoting beneficial ones, as confirmed by case analyses, and it ultimately improves final test accuracy.

PaperID: 3863, https://arxiv.org/pdf/2604.16333.pdf

Authors: Pegah Ahadian, Mingrui Yang, Sixu Chen, Xiaojuan Li, Qiang Guan

Title: A Discordance-Aware Multimodal Framework with Multi-Agent Clinical Reasoning

Abstract:
Knee osteoarthritis frequently exhibits discordance between structural damage observed in imaging and patient‑reported symptoms such as pain. This mismatch complicates clinical interpretation and patient stratification and remains insufficiently modeled in existing decision support systems. We propose a discordance aware multimodal framework that combines machine learning prediction models with a tool grounded multi agent reasoning system. Using baseline data from the FNIH Osteoarthritis Biomarkers Consortium, we trained multimodal models to predict two progression tasks, joint space loss only progression versus non progression, and pain only progression versus non progression. The predictive system integrates three modality specific experts: a CatBoost tabular model using demographic, radiographic, MRI‑derived scalar, and biomarker features; MRI image embeddings extracted using a ResNet18 backbone; and Xray embeddings derived from the same architecture. Expert predictions are fused using a stacking ensemble. Residual based models estimate expected pain from structural features, enabling the computation of a pain structure discordance score between observed and expected symptoms. A multi‑agent reasoning layer interprets these signals to assign clinically interpretable OA phenotypes and generate phenotype specific management recommendations.

PaperID: 3864, https://arxiv.org/pdf/2604.16331.pdf

Authors: Xiaoyu Ma, Lianyu Hu, Wenbing Tang, Zixuan Hu, Zeqin Liao, Zhizhen Wu, Yang Liu

Title: BrainMem: Brain-Inspired Evolving Memory for Embodied Agent Task Planning

Abstract:
Embodied task planning requires agents to execute long‑horizon, goal‑directed actions in complex 3D environments, where success depends on both immediate perception and accumulated experience across tasks. However, most existing LLM‑based planners are stateless and reactive, operating without persistent memory and therefore repeating errors and struggling with spatial or temporal dependencies. We propose BrainMem(Brain‑Inspired Evolving Memory), a training‑free hierarchical memory system that equips embodied agents with working, episodic, and semantic memory inspired by human cognition. BrainMem continuously transforms interaction histories into structured knowledge graphs and distilled symbolic guidelines, enabling planners to retrieve, reason over, and adapt behaviors from past experience without any model fine‑tuning or additional training. This plug‑and‑play design integrates seamlessly with arbitrary multi‑modal LLMs and greatly reduces reliance on task‑specific prompt engineering. Extensive experiments on four representative benchmarks, including EB‑ALFRED, EB‑Navigation, EB‑Manipulation, and EB‑Habitat, demonstrate that BrainMem significantly enhances task success rates across diverse models and difficulty subsets, with the largest gains observed on long‑horizon and spatially complex tasks. These results highlight evolving memory as a promising and scalable mechanism for generalizable embodied intelligence.

PaperID: 3865, https://arxiv.org/pdf/2604.16321.pdf

Authors: Zeeshan Rasheeda, Muhammad Waseema, Kai-Kristian Kemella, Mika Saari, Pekka Abrahamsson

Title: LLM-Based Multi-Agent Systems for Code Generation: A Multi-Vocal Literature Review

Abstract:
Large Language Models (LLMs) have enabled multi‑agent systems to perform autonomous code generation for complex tasks. Despite the recent growth in research and industrial applications in this area, there is little work on synthesizing evidence from both academic and industrial sources to capture the current state of research on LLM‑based multi‑agent systems for code generation. To this end, we conducted a Multi‑Vocal Literature Review (MLR), combining insights from both academia and industry, including peer‑reviewed studies and grey literature. The aim of this study is to systematically synthesize and analyze existing knowledge on LLM‑based multi‑agent systems for code generation. Specifically, the review examines the motivations for their use, employed benchmarks and models, key challenges, proposed solutions, and potential directions for future research. We selected and reviewed 114 studies, and the key findings are: 1) the identified reasons for adopting multi‑agent systems for code generation were classified into nine categories; 2) the models and evaluation benchmarks utilized across the studies were systematically analyzed to provide a structured overview of commonly adopted LLM configurations and assessment practices; 3) the reported challenges and corresponding solutions were synthesized into six main categories and 26 subcategories; and 4) future research directions were identified and organized into six main categories and 18 subcategories. The results of this MLR will assist researchers and practitioners in pursuing further studies and supporting the real‑world adoption of multi‑agent systems in industrial settings.

PaperID: 3866, https://arxiv.org/pdf/2604.16309.pdf

Authors: Yu Li, Wei Ma, Zhi Chen, Ye Liu, Lingxiao Jiang, Junyi Tao, Hao Liu, Yongqiang Lyu, Qiang Hu

Title: AgentGuard: A Multi-Agent Framework for Robust Package Confusion Detection via Hybrid Search and Metadata-Content Fusion

Abstract:
The proliferation of open‑source software (OSS) has made software supply chains prime targets for attacks like Package Confusion, where adversaries publish malicious packages with names deceptively similar to legitimate ones. To protect against such attacks and safeguard the use of OSS, multiple confusion detection methods have been proposed. However, existing methods are limited to single‑signal retrieval strategies (relying solely on lexical or semantic metrics), struggle with high false positive rates (FPR), and are vulnerable to adversarial evasion. Critically, as content‑agnostic approaches, they fundamentally fail to distinguish benign packages with high naming similarity from malicious, code‑dissimilar impersonations, leading to persistent high FPR. To address these limitations, we introduce AgentGuard, a novel multi‑agents based framework for package confusion detection. Specifically, it first discovers potential confusion targets using fine‑tuned word embedding models with hybrid similarity search. After that, It subsequently evaluates risk via a fused machine learning model that uniquely combines: (1) a multi‑dimensional metadata group and (2) a novel package content analysis group, to reduce the FPR and mitigate the impact of adversarial evasion. To assess the effectiveness of AgentGuard, we evaluate it on challenging ConfuDB and NeupaneDB datasets. Our results demonstrate that AgentGuard significantly outperforms state‑of‑the‑art baselines, ConfuGuard and Typomind, improving precision by 12%‑49% while simultaneously reducing the FPR by 11%‑35%, and effectively discovers the confused package.

PaperID: 3867, https://arxiv.org/pdf/2604.16205.pdf

Authors: Vitor F. Grizzi, Thang Duc Pham, Luke N. Pretzie, Jiayi Xu, Murat Keceli, Cong Liu

Title: ChemGraph-XANES: An Agentic Framework for XANES Simulation and Analysis

Abstract:
Computational X‑ray absorption near‑edge structure (XANES) is widely used to probe local coordination environments, oxidation states, and electronic structure in chemically complex systems. However, the use of computational XANES at scale is constrained more by workflow complexity than by the underlying simulation method itself. To address this challenge, we present ChemGraph‑XANES, an agentic framework for automated XANES simulation and analysis that unifies natural‑language task specification, structure acquisition, FDMNES input generation, task‑parallel execution, spectral normalization, and provenance‑aware data curation. Built on ASE, FDMNES, Parsl, and a LangGraph/LangChain‑based tool interface, the framework exposes XANES workflow operations as typed Python tools that can be orchestrated by large language model (LLM) agents. In multi‑agent mode, a retrieval‑augmented expert agent consults the FDMNES manual to ground parameter selection, while executor agents translate user requests into structured tool calls. We demonstrate documentation‑grounded parameter retrieval and show that the same workflow supports both explicit structure‑file inputs and chemistry‑level natural‑language requests. Because independent XANES calculations are naturally task‑parallel, the framework is well suited for high‑throughput deployment on high‑performance computing (HPC) systems, enabling scalable XANES database generation for downstream analysis and machine‑learning applications. ChemGraph‑XANES thus provides a reproducible and extensible workflow layer for physics‑based XANES simulation, spectral curation, and agent‑compatible computational spectroscopy.

PaperID: 3868, https://arxiv.org/pdf/2604.16175.pdf

Authors: Yi Lin, Yihao Ding, Yonghui Wu, Yifan Peng

Title: MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation

Abstract:
Automated 3D radiology report generation often suffers from clinical hallucinations and a lack of the iterative verification found in human practice. While recent Vision‑Language Models (VLMs) have advanced the field, they typically operate as monolithic "black‑box" systems without the collaborative oversight characteristic of clinical workflows. To address these challenges, we propose MARCH (Multi‑Agent Radiology Clinical Hierarchy), a multi‑agent framework that emulates the professional hierarchy of radiology departments and assigns specialized roles to distinct agents. MARCH utilizes a Resident Agent for initial drafting with multi‑scale CT feature extraction, multiple Fellow Agents for retrieval‑augmented revision, and an Attending Agent that orchestrates an iterative, stance‑based consensus discourse to resolve diagnostic discrepancies. On the RadGenome‑ChestCT dataset, MARCH significantly outperforms state‑of‑the‑art baselines in both clinical fidelity and linguistic accuracy. Our work demonstrates that modeling human‑like organizational structures enhances the reliability of AI in high‑stakes medical domains.

PaperID: 3869, https://arxiv.org/pdf/2604.16143.pdf

Authors: Keyvan Aghababaiyan, Baldomero Coll-Perales, Luca Lusvarghi, Javier Gozalvez

Title: Deterministic Task Scheduling in In-Vehicle Networks for Software-Defined Vehicles

Abstract:
Modern vehicles are embedding increasing levels of automation, connectivity, and intelligence, which require advanced in‑vehicle networks and computational platforms to support the dependability and deterministic requirements of critical in‑vehicle functions. To this end, the automotive industry is shifting towards software‑defined vehicles (SDVs) and zonal E/E architectures with centralized computing nodes. Realizing the full potential of these new architectures requires an efficient management of the in‑vehicles computational workload. In this context, this paper introduces a deterministic task scheduling approach for in‑vehicle networks (IVN), and demonstrates that it can better guarantee deterministic service levels than alternative approaches based on the shortest path or the objective to minimize task execution time. Our evaluation also demonstrates that a deterministic task scheduling can satisfactorily support increasing in‑vehicle computational workloads and tasks, and achieve a more balanced workload and resource utilization across the IVN. These gains are validated across a variety of IVN topologies, and in hybrid wireless‑wired IVN implementations, where a gradual introduction of wireless offers increased in‑vehicle connectivity diversity.

PaperID: 3870, https://arxiv.org/pdf/2604.16081.pdf

Authors: Aswini Misro, Vikash Sharma, Shreyank N Gowda

Title: Veritas-RPM: Provenance-Guided Multi-Agent False Positive Suppression for Remote Patient Monitoring

Abstract:
We present Veritas‑RPM, a provenance‑guided multi‑agent architecture comprising five processing layers: VeritasAgent (ground‑truth assembly), SentinelLayer (anomaly detection), DirectorAgent (specialist routing), six domain Specialist Agents, and MetaSentinelAgent (conflict resolution and final decision). We construct a 98‑case synthetic taxonomy of false‑positive scenarios derived from documented RPM patterns. Synthetic patient epochs (n = 530) were generated directly from taxonomy parameters and processed through the pipeline. Ground‑truth labels are known for all cases. Performance is reported as True Suppression Rate (TSR), False Escalation Rate (FER), and Indeterminate Rate (INDR).

PaperID: 3871, https://arxiv.org/pdf/2604.16069.pdf

Authors: Katharina Stich, Bastian Perner, Friedemann Laue, Torsten Reissland, Norman Franchi

Title: Convergence Time Distributions for Max-Consensus over Unreliable Networks

Abstract:
This paper proposes the LiFE‑CD algorithm for convergence time analysis of the max‑consensus algorithm in multi‑agent systems under Bernoulli‑distributed link failures. Unlike existing approaches, which either assume ideal communication or provide asymptotic upper bounds on the expected convergence time, LiFE‑CD deterministically computes the full probability distribution of the convergence time from network topology and individual link failure probabilities, without simulation. The full probability distribution enables deadline‑aware protocol design with specified reliability guarantees. Based on geometrically distributed link delays, the proposed algorithm iteratively reduces the given network topology considering both unicast and broadcast transmissions. LiFE‑CD yields exact results for acyclic networks and, for cyclic networks, tight upper bounds on the convergence time via shortest‑path spanning tree construction. Numerical results confirm analytical exactness for acyclic networks, validate tightness for cyclic networks, and demonstrate improvement over existing approaches. Our complexity analysis shows reduced computational cost compared to Monte Carlo simulations, while eliminating stochastic variability and enhancing reproducibility. All results extend directly to min‑consensus by structural equivalence.

PaperID: 3872, https://arxiv.org/pdf/2604.16024.pdf

Authors: Yaohui Han, Tianshuo Wang, Zixi Zhao, Zhengchun Zhu, Shuo Ren, Yiru Wang, Rongliang Fu, Tinghuan Chen, Tsung-Yi Ho

Title: AstroVLM: Expert Multi-agent Collaborative Reasoning for Astronomical Imaging Quality Diagnosis

Abstract:
Vision Language Models (VLMs) have been applied to several specific domains and have shown strong problem‑solving capabilities. However, astronomical imaging, a quite complex problem involving multidisciplinary knowledge and several subtasks, has not been adequately studied. Due to the complexity of the astronomical imaging process, both world‑class astronomical organizations, such as NASA, and expert enthusiasts devote a great deal of time and effort. This is because the processes in astronomical imaging have complex underlying correlations that significantly influence one another, making the quality diagnosis and error localization of astronomical images challenging. To address this problem, we propose AstroVLM, a collaborative multi‑agent system for diagnosing the quality of astronomical images. Experiment results show that AstroVLM outperforms all baselines on real‑world astronomical imaging quality diagnosis tasks, providing a reference for language models to handle complicated multi‑process tasks.

PaperID: 3873, https://arxiv.org/pdf/2604.16022.pdf

Authors: Hikaru Shindo, Hanzhao Lin, Lukas Helff, Patrick Schramowski, Kristian Kersting

Title: SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems

Abstract:
As Large Language Models (LLMs) transition from text processors to autonomous agents, evaluating their social reasoning in embodied multi‑agent settings becomes critical. We introduce SocialGrid, an embodied multi‑agent environment inspired by Among Us that evaluates LLM agents on planning, task execution, and social reasoning. Our evaluations reveal that even the strongest open model (GPT‑OSS‑120B) achieves below 60% accuracy in task completion and planning, with agents getting stuck in repetitive behaviors or failing to navigate basic obstacles. Since poor navigation confounds evaluation of social intelligence, SocialGrid offers an optional Planning Oracle to isolate social reasoning from planning deficits. While planning assistance improves task completion, social reasoning remains a bottleneck: agents fail to detect deception at near‑random chance regardless of scale, relying on shallow heuristics rather than accumulating behavioral evidence. SocialGrid provides automatic failure analysis and fine‑grained metrics, enabling developers to diagnose and improve their agents. We also establish a competitive leaderboard using Elo ratings from adversarial league play.

PaperID: 3874, https://arxiv.org/pdf/2604.15972.pdf

Authors: Haoyu Bian, Chaoning Zhang, Jiaquan Zhang, Xingyao Li, Yuanfang Guo, Wei Dong, Yang Yang

Title: Weak-Link Optimization for Multi-Agent Reasoning and Collaboration

Abstract:
LLM‑driven multi‑agent frameworks address complex reasoning tasks through multi‑role collaboration. However, existing approaches often suffer from reasoning instability, where individual agent errors are amplified through collaboration, undermining overall performance. Current research mainly focuses on enhancing high‑capability agents or suppressing unreliable outputs to improve framework effectiveness, while systematic identification and reinforcement of performance‑limiting agents receive less attention. To address this gap, we propose WORC, a \underlineweak‑link \underlineoptimization framework for multi‑agent \underlinereasoning and \underlinecollaboration, grounded in the weak‑link principle. WORC follows a two‑stage workflow. In the weak agent localization stage, task features are constructed, and a meta‑learning‑based weight predictor trained on optimal configurations identified by swarm intelligence algorithms (SIAs) enables zero‑shot mapping from these features to agent performance weights, where the agent with the lowest predicted weight is identified as the weak agent. In the weak‑link optimization stage, an uncertainty‑driven allocation strategy assigns additional reasoning budgets to weak agents, with lower predicted weights leading to larger repeated‑sampling quotas to compensate for reliability deficiencies. Experimental results show that WORC achieves an average accuracy of 82.2% on reasoning benchmarks while improving framework stability and cross‑architecture generalization, suggesting that compensating for weak links, rather than reinforcing strengths alone, enhances the robustness of multi‑agent systems.

PaperID: 3875, https://arxiv.org/pdf/2604.15877.pdf

Authors: Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He

Title: Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

Abstract:
As LLM agents scale to long‑horizon, multi‑session deployments, efficiently managing accumulated experience becomes a critical bottleneck. Agent memory systems and agent skill discovery both address this challenge ‑‑ extracting reusable knowledge from interaction traces ‑‑ yet a citation analysis of 1,136 references across 22 primary papers reveals a cross‑community citation rate below 1%. We propose the \emphExperience Compression Spectrum, a unifying framework that positions memory, skills, and rules as points along a single axis of increasing compression (5‑‑20× for episodic memory, 50‑‑500× for procedural skills, 1,000×+ for declarative rules), directly reducing context consumption, retrieval latency, and compute overhead. Mapping 20+ systems onto this spectrum reveals that every system operates at a fixed, predetermined compression level ‑‑ none supports adaptive cross‑level compression, a gap we term the \emphmissing diagonal. We further show that specialization alone is insufficient ‑‑ both communities independently solve shared sub‑problems without exchanging solutions ‑‑ that evaluation methods are tightly coupled to compression levels, that transferability increases with compression at the cost of specificity, and that knowledge lifecycle management remains largely neglected. We articulate open problems and design principles for scalable, full‑spectrum agent learning systems.

PaperID: 3876, https://arxiv.org/pdf/2604.15842.pdf

Authors: Tanja Baeumel, Josef van Genabith, Simon Ostermann

Title: Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms

Abstract:
Large language models (LLMs) have demonstrated impressive capabilities, yet their internal mechanisms for handling reasoning‑intensive tasks remain underexplored. To advance the understanding of model‑internal processing mechanisms, we present an investigation of how LLMs perform arithmetic operations by examining internal mechanisms during task execution. Using early decoding, we trace how next‑token predictions are constructed across layers. Our experiments reveal that while the models recognize arithmetic tasks early, correct result generation occurs only in the final layers. Notably, models proficient in arithmetic exhibit a clear division of labor between attention and MLP modules, where attention propagates input information and MLP modules aggregate it. This division is absent in less proficient models. Furthermore, successful models appear to process more challenging arithmetic tasks functionally, suggesting reasoning capabilities beyond factual recall.

PaperID: 3877, https://arxiv.org/pdf/2604.15840.pdf

Authors: Shidong Yang, Ziyu Ma, Tongwen Huang, Yiming Hu, Yong Wang, Xiangxiang Chu

Title: CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution

Abstract:
Reinforcement learning for LLM agents is typically conducted on a static data distribution, which fails to adapt to the agent's evolving behavior and leads to poor coverage of complex environment interactions. To address these challenges, we propose CoEvolve, an agent‑data mutual evolution framework that enables LLM agents to improve through closed‑loop, interaction‑driven training. Specifically, CoEvolve extracts feedback signals such as forgetting and uncertainty from rollout trajectories to identify failure‑prone interaction patterns, and utilizes them to guide LLM‑based task synthesis. The synthesized tasks are validated through environment interaction and utilized to update the data distribution, enabling joint adaptation of the agent and its data. Extensive experiments on AppWorld and BFCL across Qwen2.5‑7B, Qwen3‑4B, and Qwen3‑30B‑A3B demonstrate consistent and significant improvements over strong base models, yielding absolute gains of 19.43%, 15.58%, and 18.14%, respectively.

PaperID: 3878, https://arxiv.org/pdf/2604.15774.pdf

Authors: Weiwei Xie, Shaoxiong Guo, Fan Zhang, Tian Xia, Xue Yang, Lizhuang Ma, Junchi Yan, Qibing Ren

Title: MemEvoBench: Benchmarking Safety Risks from Memory Misevolution in LLM Agents

Abstract:
Equipping Large Language Models (LLMs) with persistent memory enhances interaction continuity and personalization but introduces new safety risks. Specifically, contaminated or biased memory accumulation can trigger abnormal agent behaviors. Existing evaluation methods have not yet established a standardized framework for measuring memory misevolution. This phenomenon refers to the gradual behavioral drift resulting from repeated exposure to misleading information. To address this gap, we introduce MemEvoBench, the first benchmark evaluating long‑horizon memory safety in LLM agents against adversarial memory injection, noisy tool outputs, and biased feedback. The framework consists of QA‑style tasks across 7 domains and 36 risk types, complemented by workflow‑style tasks adapted from 20 Agent‑SafetyBench environments with noisy tool returns. Both settings employ mixed benign and misleading memory pools within multi‑round interactions to simulate memory evolution. Experiments on representative models reveal substantial safety degradation under biased memory updates. Our analysis suggests that memory evolution is a significant contributor to these failures. Furthermore, static prompt‑based defenses prove insufficient, underscoring the urgency of securing memory evolution in LLM agents.

PaperID: 3879, https://arxiv.org/pdf/2604.15762.pdf

Authors: Huan Lin, Lianghui Ding

Title: Zero-Shot Scalable Resilience in UAV Swarms: A Decentralized Imitation Learning Framework with Physics-Informed Graph Interactions

Abstract:
Large‑scale Unmanned Aerial Vehicle (UAV) failures can split an unmanned aerial vehicle swarm network into disconnected sub‑networks, making decentralized recovery both urgent and difficult. Centralized recovery methods depend on global topology information and become communication‑heavy after severe fragmentation. Decentralized heuristics and multi‑agent reinforcement learning methods are easier to deploy, but their performance often degrades when the swarm scale and damage severity vary. We present Physics‑informed Graph Adversarial Imitation Learning algorithm (PhyGAIL) that adopts centralized training with decentralized execution. PhyGAIL builds bounded local interaction graphs from heterogeneous observations, and uses physics‑informed graph neural network to encode directional local interactions as gated message passing with explicit attraction and repulsion. This gives the policy a physically grounded coordination bias while keeping local observations scale‑invariant. It also uses scenario‑adaptive imitation learning to improve training under fragmented topologies and variable‑length recovery episodes. Our analysis establishes bounded local graph amplification, bounded interaction dynamics, and controlled variance of the terminal success signal. A policy trained on 20‑UAV swarms transfers directly to swarms of up to 500 UAVs without fine‑tuning, and achieves better performance across reconnection reliability, recovery speed, motion safety, and runtime efficiency than representative baselines.

PaperID: 3880, https://arxiv.org/pdf/2604.15713.pdf

Authors: Kevin Kappelmann, Maximilian Schäffeler, Lukas Stevens, Mohammad Abdulaziz, Andrei Popescu, Dmitriy Traytel

Title: Just Type It in Isabelle! AI Agents Drafting, Mechanizing, and Generalizing from Human Hints

Abstract:
Type annotations are essential when printing terms in a way that preserves their meaning under reparsing and type inference. We study the problem of complete and minimal type annotations for rank‑one polymorphic λ‑calculus terms, as used in Isabelle. Building on prior work by Smolka, Blanchette et al., we give a metatheoretical account of the problem, with a full formal specification and proofs, and formalize it in Isabelle/HOL. Our development is a series of experiments featuring human‑driven and AI‑driven formalization workflows: a human and an LLM‑powered AI agent independently produce pen‑and‑paper proofs, and the AI agent autoformalizes both in Isabelle, with further human‑hinted AI interventions refining and generalizing the development.

PaperID: 3881, https://arxiv.org/pdf/2604.15709.pdf

Authors: Chenyi Huang, Haoting Zhang, Jingxu Xu, Zeyu Zheng, Yunduan Lin

Title: Bilevel Optimization of Agent Skills via Monte Carlo Tree Search

Abstract:
Agent \textttskills are structured collections of instructions, tools, and supporting resources that help large language model (LLM) agents perform particular classes of tasks. Empirical evidence shows that the design of \textttskills can materially affect agent task performance, yet systematically optimizing \textttskills remains challenging. Since a \textttskill comprises instructions, tools, and supporting resources in a structured way, optimizing it requires jointly determining both the structure of these components and the content each component contains. This gives rise to a complex decision space with strong interdependence across structure and components. We therefore represent these two coupled decisions as \textttskill structure and component content, and formulate \textttskill optimization as a bilevel optimization problem. We propose a bilevel optimization framework in which an outer loop employs Monte Carlo Tree Search to determine the \textttskill structure, while an inner loop refines the component content within the structure selected by the outer loop. In both loops, we employ LLMs to assist the optimization procedure. We evaluate the proposed framework on an open‑source Operations Research Question Answering dataset, and the experimental results suggest that the bilevel optimization framework improves the performance of the agents with the optimized \textttskill.

PaperID: 3882, https://arxiv.org/pdf/2604.15695.pdf

Authors: Deep Kumar Ganguly, Chandradithya S Jonnalagadda, Pratham Chintamani, Adithya Ananth

Title: The Price of Paranoia: Robust Risk-Sensitive Cooperation in Non-Stationary Multi-Agent Reinforcement Learning

Abstract:
Cooperative equilibria are fragile. When agents learn alongside each other rather than in a fixed environment, the process of learning destabilizes the cooperation they are trying to sustain: every gradient step an agent takes shifts the distribution of actions its partner will play, turning a cooperative partner into a source of stochastic noise precisely where the cooperation decision is most sensitive. We study how this co‑learning noise propagates through the structure of coordination games, and find that the cooperative equilibrium, even when strongly Pareto‑dominant, is exponentially unstable under standard risk‑neutral learning, collapsing irreversibly once partner noise crosses the game's critical cooperation threshold. The natural response to apply distributional robustness to hedge against partner uncertainty makes things strictly worse: risk‑averse return objectives penalize the high‑variance cooperative action relative to defection, widening the instability region rather than shrinking it, a paradox that reveals a fundamental mismatch between the domains where robustness is applied and instability originates. We resolve this by showing that robustness should target the policy gradient update variance induced by partner uncertainty, not the return distribution. This distinction yields an algorithm whose gradient updates are modulated by an online measure of partner unpredictability, provably expanding the cooperation basin in symmetric coordination games. To unify stability, sample complexity, and welfare consequences of this approach, we introduce the Price of Paranoia as the structural dual of the Price of Anarchy. Together with a novel Cooperation Window, it precisely characterizes how much welfare learning algorithms can recover under partner noise, pinning down the optimal degree of robustness as a closed‑form balance between equilibrium stability and sample efficiency.

PaperID: 3883, https://arxiv.org/pdf/2604.15687.pdf

Authors: Yuta Konishi, Kento Yamamoto, Eisuke Sonomoto, Rikuho Takeda, Ryo Furukawa, Yusuke Muraki, Takafumi Shimizu, Kazuma Fukumura, Yuya Kanemoto, Takayuki Ito, Shiyao Ding

Title: Preference Estimation via Opponent Modeling in Multi-Agent Negotiation

Abstract:
Automated negotiation in complex, multi‑party and multi‑issue settings critically depends on accurate opponent modeling. However, conventional numerical‑only approaches fail to capture the qualitative information embedded in natural language interactions, resulting in unstable and incomplete preference estimation. Although Large Language Models (LLMs) enable rich semantic understanding of utterances, it remains challenging to quantitatively incorporate such information into a consistent opponent modeling. To tackle this issue, we propose a novel preference estimation method integrating natural language information into a structured Bayesian opponent modeling framework. Our approach leverages LLMs to extract qualitative cues from utterances and converts them into probabilistic formats for dynamic belief tracking. Experimental results on a multi‑party benchmark demonstrate that our framework improves the full agreement rate and preference estimation accuracy by integrating probabilistic reasoning with natural language understanding.

PaperID: 3884, https://arxiv.org/pdf/2604.15671.pdf

Authors: Xu Huang, Weixin Mao, Yinhao Li, Hua Chen, Jiabao Zhao

Title: Long-Term Memory for VLA-based Agents in Open-World Task Execution

Abstract:
Vision‑Language‑Action (VLA) models have demonstrated significant potential for embodied decision‑making; however, their application in complex chemical laboratory automation remains restricted by limited long‑horizon reasoning and the absence of persistent experience accumulation. Existing frameworks typically treat planning and execution as decoupled processes, often failing to consolidate successful strategies, which results in inefficient trial‑and‑error in multi‑stage protocols. In this paper, we propose ChemBot, a dual‑layer, closed‑loop framework that integrates an autonomous AI agent with a progress‑aware VLA model (Skill‑VLA) for hierarchical task decomposition and execution. ChemBot utilizes a dual‑layer memory architecture to consolidate successful trajectories into retrievable assets, while a Model Context Protocol (MCP) server facilitates efficient sub‑agent and tool orchestration. To address the inherent limitations of VLA models, we further implement a future‑state‑based asynchronous inference mechanism to mitigate trajectory discontinuities. Extensive experiments on collaborative robots demonstrate that ChemBot achieves superior operational safety, precision, and task success rates compared to existing VLA baselines in complex, long‑horizon chemical experimentation.

PaperID: 3885, https://arxiv.org/pdf/2604.15559.pdf

Authors: Jacob Dang, Brian Y. Xie, Omar G. Younis

Title: Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation

Abstract:
Recent work on subliminal learning demonstrates that language models can transmit semantic traits through data that is semantically unrelated to those traits. However, it remains unclear whether behavioral traits can transfer in agentic systems, where policies are learned from trajectories rather than static text. In this work, we provide the first empirical evidence that unsafe agent behaviors can transfer subliminally through model distillation across two complementary experimental settings. In our primary setting, we construct a teacher agent exhibiting a strong deletion bias, a tendency to perform destructive file‑system actions via an API‑style tool interface, and distill it into a student using only trajectories from ostensibly safe tasks, with all explicit deletion keywords rigorously filtered. In our secondary setting, we replicate the threat model in a native Bash environment, replacing API tool calls with shell commands and operationalizing the bias as a preference for issuing chmod as the first permission‑related command over semantically equivalent alternatives such as chown or setfacl. Despite full keyword sanitation in both settings, students inherit measurable behavioral biases. In the API setting the student's deletion rate reaches 100% (versus a 5% baseline) under homogeneous distillation; in the Bash setting the student's chmod‑first rate reaches 30%‑55% (versus a 0%‑10% baseline), with the strongest transfer observed in large‑to‑small distillation. Our results demonstrate that explicit data sanitation is an insufficient defense, and behavioral biases are encoded implicitly in trajectory dynamics regardless of the tool interface.

PaperID: 3886, https://arxiv.org/pdf/2604.15558.pdf

Authors: Saad Alqithami

Title: Preregistered Belief Revision Contracts

Abstract:
Deliberative multi‑agent systems allow agents to exchange messages and revise beliefs over time. While this interaction is meant to improve performance, it can also create dangerous conformity effects: agreement, confidence, prestige, or majority size may be treated as if they were evidence, producing high‑confidence convergence to false conclusions. To address this, we introduce PBRC (Preregistered Belief Revision Contracts), a protocol‑level mechanism that strictly separates open communication from admissible epistemic change. A PBRC contract publicly fixes first‑order evidence triggers, admissible revision operators, a priority rule, and a fallback policy. A non‑fallback step is accepted only when it cites a preregistered trigger and provides a nonempty witness set of externally validated evidence tokens. This ensures that every substantive belief change is both enforceable by a router and auditable after the fact. In this paper, (a) we prove that under evidential contracts with conservative fallback, social‑only rounds cannot increase confidence and cannot generate purely conformity‑driven wrong‑but‑sure cascades. (b) We show that auditable trigger protocols admit evidential PBRC normal forms that preserve belief trajectories and canonicalized audit traces. (c) We demonstrate that sound enforcement yields epistemic accountability: any change of top hypothesis is attributable to a concrete validated witness set. For token‑invariant contracts, (d) we prove that enforced trajectories depend only on token‑exposure traces; under flooding dissemination, these traces are characterized exactly by truncated reachability, giving tight diameter bounds for universal evidence closure. Finally, we introduce a companion contractual dynamic doxastic logic to specify trace invariants, and provide simulations illustrating cascade suppression, auditability, and robustness‑liveness trade‑offs.

PaperID: 3887, https://arxiv.org/pdf/2604.15505.pdf

Authors: Jihye Choi, Jinsung Yoon, Long T. Le, Somesh Jha, Tomas Pfister

Title: PolicyBank: Evolving Policy Understanding for LLM Agents

Abstract:
LLM agents operating under organizational policies must comply with authorization constraints typically specified in natural language. In practice, such specifications inevitably contain ambiguities and logical or semantic gaps that cause the agent's behavior to systematically diverge from the true requirements. We ask: by letting an agent evolve its policy understanding through interaction and corrective feedback from pre‑deployment testing, can it autonomously refine its interpretation to close specification gaps? We propose PolicyBank, a memory mechanism that maintains structured, tool‑level policy insights and iteratively refines them ‑‑ unlike existing memory mechanisms that treat the policy as immutable ground truth, reinforcing "compliant but wrong" behaviors. We also contribute a systematic testbed by extending a popular tool‑calling benchmark with controlled policy gaps that isolate alignment failures from execution failures. While existing memory mechanisms achieve near‑zero success on policy‑gap scenarios, PolicyBank closes up to 82% of the gap toward a human oracle.

PaperID: 3888, https://arxiv.org/pdf/2604.15484.pdf

Authors: Jayson Steffens

Title: vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents

Abstract:
We present vstash, a local‑first document memory system that combines vector similarity search with full‑text keyword matching via Reciprocal Rank Fusion (RRF) and adaptive per‑query IDF weighting. All data resides in a single SQLite file using sqlite‑vec for approximate nearest neighbor search and FTS5 for keyword matching. We make four primary contributions. (1) Self‑supervised embedding refinement via hybrid retrieval disagreement: across 753 BEIR queries on SciFact, NFCorpus, and FiQA, 74.5% produce top‑10 disagreement between vector‑heavy (vec=0.95, fts=0.05) and FTS‑heavy (vec=0.05, fts=0.95) search (per‑dataset rates 63.4% / 73.4% / 86.7%, Section 5.2), providing a free training signal without human labels. Fine‑tuning BGE‑small (33M params) with MultipleNegativesRankingLoss on 76K disagreement triples improves NDCG@10 on all 5 BEIR datasets (up to +19.5% on NFCorpus vs. BGE‑small base RRF, Table 6). On 3 of 5 datasets, under different preprocessing, the tuned 33M‑parameter pipeline matches or exceeds published ColBERTv2 results (110M params) and an untrained BGE‑base (110M); on FiQA and ArguAna it underperforms ColBERTv2 (Section 5.5). (2) Adaptive RRF with per‑query IDF weighting improves NDCG@10 on all 5 BEIR datasets versus fixed weights (up to +21.4% on ArguAna), achieving 0.7263 on SciFact with BGE‑small. (3) A negative result on post‑RRF scoring: frequency+decay, history‑augmented recall, and cross‑encoder reranking all failed to improve NDCG. (4) A production‑grade substrate with integrity checking, schema versioning, ranking diagnostics, and a distance‑based relevance signal validated on 50,425 relevance‑judged queries across the 5 BEIR datasets. Search latency remains 20.9 ms median at 50K chunks with stable NDCG. The fine‑tuned model is published as `Stffens/bge‑small‑rrf‑v2` on HuggingFace. All code, data, and experiments are open‑source.

PaperID: 3889, https://arxiv.org/pdf/2604.15388.pdf

Authors: Mu-Chi Chen, Po-Hsuan Huang, Yu-Hung Kao, Yen-Fu Liu, Yu-Kai Hung, Cheng Liang, Shao-Chun Ho, Chia-Heng Tu, Shih-Hao Hung

Title: Exploring LLM-based Verilog Code Generation with Data-Efficient Fine-Tuning and Testbench Automation

Abstract:
Recent advances in large language models have improved code generation, but their use in hardware description languages is still limited. Moreover, training data and testbenches for these models are often scarce. This paper presents a workflow that uses multi‑agent models to generate testbenches for high‑quality fine‑tuning data. By automating testbench creation, the fine‑tuned model for the specification‑to‑Verilog task achieves performance comparable to state‑of‑the‑art methods on the refined VerilogEval v2 benchmark while using less training data. This study provides a basis for future work on LLM‑based HDL generation and automated verification.

PaperID: 3890, https://arxiv.org/pdf/2604.15367.pdf

Authors: Qian'ang Mao, Jiaxin Wang, Ya Liu, Li Zhu, Cong Ma, Jiaqi Yan

Title: SoK: Security of Autonomous LLM Agents in Agentic Commerce

Abstract:
Autonomous large language model (LLM) agents such as OpenClaw are pushing agentic commerce from human‑supervised assistance toward machine actors that can negotiate, purchase services, manage digital assets, and execute transactions across on‑chain and off‑chain environments. Protocols such as the Trustless Agents standard (ERC‑8004), Agent Payments Protocol (AP2), OKX Agent Payments Protocol (APP), the HTTP 402‑based payment protocol (x402), Agent Commerce Protocol (ACP), the Agentic Commerce standard (ERC‑8183), and Machine Payments Protocol (MPP) enable this transition, but they also create an attack surface that existing security frameworks do not capture well. This Systematization of Knowledge (SoK) develops a unified security framework for autonomous LLM agents in commerce and finance. We organize threats along five dimensions: agent integrity, transaction authorization, inter‑agent trust, market manipulation, and regulatory compliance. From a systematically curated public corpus of academic papers, protocol documents, industry reports, and incident evidence, we derive 12 cross‑layer attack vectors and show how failures propagate from reasoning and tooling layers into custody, settlement, market harm, and compliance exposure. We then propose a layered defense architecture addressing authorization gaps left by current agent‑payment protocols. Overall, our analysis shows that securing agentic commerce is inherently a cross‑layer problem that requires coordinated controls across LLM safety, protocol design, identity, market structure, and regulation. We conclude with a research roadmap and a benchmark agenda for secure autonomous commerce.

PaperID: 3891, https://arxiv.org/pdf/2604.15319.pdf

Authors: Burak Susam, Tingting Mu

Title: Explainable Iterative Data Visualisation Refinement via an LLM Agent

Abstract:
Exploratory analysis of high‑dimensional data relies on embedding the data into a low‑dimensional space (typically 2D or 3D), based on which visualization plot is produced to uncover meaningful structures and to communicate geometric and distributional data characteristics. However, finding a suitable algorithm configuration, particularly hyperparameter setting, to produce a visualization plot that faithfully represents the underlying reality and encourages pattern discovery remains challenging. To address this challenge, we propose an agentic AI pipleline that leverages a large language model (LLM) to bridge the gap between rigorous quantitative assessment and qualitative human insight. By treating visualization evaluation and hyperparameter optimization as a semantic task, our system generates a multi‑faceted report that contextualizes hard metrics with descriptive summaries, and suggests actionable recommendation of algorithm configuration for refining data visualization. By implementing an iterative optimization loop of this process, the system is able to produce rapidly a high‑quality visualization plot, in full automation.

PaperID: 3892, https://arxiv.org/pdf/2604.15267.pdf

Authors: Emanuel Tewolde, Xiao Zhang, David Guzman Piedrahita, Vincent Conitzer, Zhijing Jin

Title: CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

Abstract:
It is increasingly important that LLM agents interact effectively and safely with other goal‑pursuing agents, yet, recent works report the opposite trend: LLMs with stronger reasoning capabilities behave _less_ cooperatively in mixed‑motive games such as the prisoner's dilemma and public goods settings. Indeed, our experiments show that recent models ‑‑ with or without reasoning enabled ‑‑ consistently defect in single‑shot social dilemmas. To tackle this safety concern, we present the first comparative study of game‑theoretic mechanisms that are designed to enable cooperative outcomes between rational agents _in equilibrium_. Across four social dilemmas testing distinct components of robust cooperation, we evaluate the following mechanisms: (1) repeating the game for many rounds, (2) reputation systems, (3) third‑party mediators to delegate decision making to, and (4) contract agreements for outcome‑conditional payments between players. Among our findings, we establish that contracting and mediation are most effective in achieving cooperative outcomes between capable LLM models, and that repetition‑induced cooperation deteriorates drastically when co‑players vary. Moreover, we demonstrate that these cooperation mechanisms become _more effective_ under evolutionary pressures to maximize individual payoffs.

PaperID: 3893, https://arxiv.org/pdf/2604.15244.pdf

Authors: Kiran Purohit, Ramasuri Narayanam, Soumyabrata Pal

Title: From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning

Abstract:
Speculative decoding (SD) accelerates large language model inference by allowing a lightweight draft model to propose outputs that a stronger target model verifies. However, its token‑centric nature allows erroneous steps to propagate. Prior approaches mitigate this using external reward models, but incur additional latency, computational overhead, and limit generalizability. We propose SpecGuard, a verification‑aware speculative decoding framework that performs step‑level verification using only model‑internal signals. At each step, SpecGuard samples multiple draft candidates and selects the most consistent step, which is then validated using an ensemble of two lightweight model‑internal signals: (i) an attention‑based grounding score that measures attribution to the input and previously accepted steps, and (ii) a log‑probability‑based score that captures token‑level confidence. These signals jointly determine whether a step is accepted or recomputed using the target, allocating compute selectively. Experiments across a range of reasoning benchmarks show that SpecGuard improves accuracy by 3.6% while reducing latency by ~11%, outperforming both SD and reward‑guided SD.

PaperID: 3894, https://arxiv.org/pdf/2604.15231.pdf

Authors: Mélanie Roschewitz, Kenneth Styppa, Yitian Tao, Jiwoong Sohn, Jean-Benoit Delbrouck, Benjamin Gundersen, Nicolas Deperrois, Christian Bluethgen, Julia Vogt, Bjoern Menze, Farhad Nooralahzadeh, Michael Krauthammer, Michael Moor

Title: RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

Abstract:
Vision‑language models (VLM) have markedly advanced AI‑driven interpretation and reporting of complex medical imaging, such as computed tomography (CT). Yet, existing methods largely relegate clinicians to passive observers of final outputs, offering no interpretable reasoning trace for them to inspect, validate, or refine. To address this, we introduce RadAgent, a tool‑using AI agent that generates CT reports through a stepwise and interpretable process. Each resulting report is accompanied by a fully inspectable trace of intermediate decisions and tool interactions, allowing clinicians to examine how the reported findings are derived. In our experiments, we observe that RadAgent improves Chest CT report generation over its 3D VLM counterpart, CT‑Chat, across three dimensions. Clinical accuracy improves by 6.0 points (36.4% relative) in macro‑F1 and 5.4 points (19.6% relative) in micro‑F1. Robustness under adversarial conditions improves by 24.7 points (41.9% relative). Furthermore, RadAgent achieves 37.0% in faithfulness, a new capability entirely absent in its 3D VLM counterpart. By structuring the interpretation of chest CT as an explicit, tool‑augmented and iterative reasoning trace, RadAgent brings us closer toward transparent and reliable AI for radiology.

PaperID: 3895, https://arxiv.org/pdf/2604.15163.pdf

Authors: Boyan Li, Ou Ocean Kun Hei, Yue Yu, Yuyu Luo

Title: DPC: Training-Free Text-to-SQL Candidate Selection via Dual-Paradigm Consistency

Abstract:
While Large Language Models (LLMs) demonstrate impressive proficiency in generating SQL queries, they fundamentally lack the capability to self‑evaluate correctness without an execution oracle. This limitation creates a stark Generation‑Selection Gap, where high potential accuracy (Pass@K) fails to translate into execution accuracy (Pass@1). Although supervised verifiers offer mitigation, they incur prohibitive annotation costs and suffer from domain fragility. Consequently, recent research has pivoted to the training‑free setting. However, existing methods‑‑such as Self‑Consistency or LLM‑as‑a‑Judge‑‑remain hampered by systematic bias (consensus on hallucinations) and symbolic blindness (inability to simulate execution states). We introduce DPC (Dual‑Paradigm Consistency), a multi‑agent framework that reformulates SQL selection from a probabilistic guessing task on hidden data into a deterministic verification task on visible data. Specifically, DPC employs a SLICER and a TESTER agent to collaboratively construct a Minimal Distinguishing Database (MDD)‑‑an adversarial, fully observable micro‑environment engineered to expose logical discrepancies between candidates. To break the self‑correction bias, a SOLVER agent then verifies the SQL candidates by cross‑referencing their execution against a parallel Python/Pandas solution. By validating execution consistency between declarative (SQL) and imperative (Python) paradigms, DPC robustly discriminates correct logic from systematic hallucinations. Experiments on BIRD and Spider across multiple LLMs demonstrate that our method consistently outperforms existing selection baselines, achieving absolute accuracy improvements of up to 2.2% over strong competitors like Self‑Consistency.

PaperID: 3896, https://arxiv.org/pdf/2604.15082.pdf

Authors: Cunxi Yu, Haoxing Ren

Title: Autonomous Evolution of EDA Tools: Multi-Agent Self-Evolved ABC

Abstract:
This paper introduces the first \emphself‑evolving logic synthesis framework, which leverages Large Language Model (LLM) agents to autonomously improve the source code of \textscABC, the widely adopted logic synthesis system. Our framework operates on the \emphentire integrated ABC codebase, and the output repository preserves its single‑binary execution model and command interface. In the initial evolution cycle, we bootstrap the system using existing prior open‑source synthesis components, covering flow tuning, logic minimization, and technology mapping, but without manually injecting new heuristics. On top of this foundation, a team of LLM‑based agents iteratively rewrites and evolves specific sub‑components of ABC following our ``programming guidance`` prompts under a unified correctness and QoR‑driven evaluation loop. Each evolution cycle proposes code modifications, compiles the integrated binary, validates correctness, and evaluates quality‑of‑results (QoR) on \emphmulti‑suite benchmarks including ISCAS~85/89/99, VTR, EPFL, and IWLS~2005. Through continuous feedback, the system discovers optimizations beyond human‑designed heuristics, effectively \emphlearning new synthesis strategies that enhance QoR. We detail the architecture of this self‑improving system, its integration with \textscABC, and results demonstrating that the framework can autonomously and progressively improve EDA tool at full million‑line scale.

PaperID: 3897, https://arxiv.org/pdf/2604.15078.pdf

Authors: Simeon Allmendinger, Luca Deck, Lucas Mueller

Title: Where are the Humans? A Scoping Review of Fairness in Multi-agent AI Systems

Abstract:
Rapid advances in Generative AI are giving rise to increasingly sophisticated Multi‑Agent AI (MAAI) systems. While AI fairness has been extensively studied in traditional predictive scenarios, its examination in MAAI remains nascent and fragmented. This scoping review critically synthesizes existing research on fairness in MAAI systems. Through a qualitative content analysis of 23 selected studies, we identify five archetypal approaches. Our findings reveal that fairness in MAAI systems is often addressed superficially, lacks robust normative foundations, and frequently overlooks the complex dynamics introduced by agent autonomy and system‑level interactions. We argue that fairness must be embedded structurally throughout the development lifecycle of MAAI, rather than appended as a post‑hoc consideration. Meaningful evaluation requires explicit human oversight, normative clarity, and a precise articulation of fairness objectives and beneficiaries. This review provides a foundation for advancing fairness research in MAAI systems by highlighting critical gaps, exposing prevailing limitations, and suggesting pathways.

PaperID: 3898, https://arxiv.org/pdf/2604.15044.pdf

Authors: Chase McDonald, Cleotilde Gonzalez

Title: CoGrid & the Multi-User Gymnasium: A Framework for Multi-Agent Experimentation

Abstract:
The increasing integration of artificial intelligence (AI) in everyday life brings with it new challenges and questions for regarding how humans interact with autonomous agents. Multi‑agent experiments, where humans and AI act together, can offer important opportunities to study social decision making, but there is a lack of accessible tooling available to researchers to run such experiments. We introduce two tools designed to reduce these barriers. The first, CoGrid, is a multi‑agent grid‑based simulation library with dual NumPy and JAX backends. The second, Multi‑User Gymnasium (MUG), translates such simulation environments directly into interactive web‑based experiments. MUG supports interactions with arbitrary numbers of humans and AI, utilizing either server‑authoritative or peer‑to‑peer networking with rollback netcode to account for latency. Together, these tools can enable researchers to deploy studies of human‑AI interaction, facilitating inquiry into core questions of psychology, cognition, and decision making and their relationship to human‑AI interaction. Both tools are open source and available to the broader research community. Documentation and source code is available at cogrid, multi‑user‑gymnasium.readthedocs.io. This paper details the functionality of these tools and presents several case studies to illustrate their utility in human‑AI multi‑agent experimentation.

PaperID: 3899, https://arxiv.org/pdf/2604.15037.pdf

Authors: Ke Xu, Yuhao Wang, Yu Wang

Title: From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

Abstract:
Recent advancements in LLM agents are gradually shifting from reactive, text‑based paradigms toward proactive, multimodal interaction. However, existing benchmarks primarily focus on reactive responses, overlooking the complexities of proactive intervention and monitoring. To bridge this gap, we introduce ProVoice‑Bench, the first evaluation framework specifically designed for proactive voice agents, featuring four novel tasks. By leveraging a multi‑stage data synthesis pipeline, we curate 1,182 high‑quality samples for rigorous testing. Our evaluation of state‑of‑the‑art Multimodal LLMs reveals a significant performance gap, particularly regarding over‑triggering and reasoning capabilities. These findings highlight the limitations of current models and offer a roadmap for developing more natural, context‑aware proactive agents.

PaperID: 3900, https://arxiv.org/pdf/2604.15001.pdf

Authors: Heng Ping, Peiyu Zhang, Shixuan Li, Wei Yang, Anzhe Cheng, Shukai Duan, Xiaole Zhang, Paul Bogdan

Title: COEVO: Co-Evolutionary Framework for Joint Functional Correctness and PPA Optimization in LLM-Based RTL Generation

Abstract:
LLM‑based RTL code generation methods increasingly target both functional correctness and PPA quality, yet existing approaches universally decouple the two objectives, optimizing PPA only after correctness is fully achieved. Whether through sequential multi‑agent pipelines, evolutionary search with binary correctness gates, or hierarchical reward dependencies, partially correct but architecturally promising candidates are systematically discarded. Moreover, existing methods reduce the multi‑objective PPA space to a single scalar fitness, obscuring the trade‑offs among area, delay, and power. To address these limitations, we propose COEVO, a co‑evolutionary framework that unifies correctness and PPA optimization within a single evolutionary loop. COEVO formulates correctness as a continuous co‑optimization dimension alongside area, delay, and power, enabled by an enhanced testbench that provides fine‑grained scoring and detailed diagnostic feedback. An adaptive correctness gate with annealing allows PPA‑promising but partially correct candidates to guide the search toward jointly optimal solutions. To preserve the full PPA trade‑off structure, COEVO employs four‑dimensional Pareto‑based non‑dominated sorting with configurable intra‑level sorting, replacing scalar fitness without manual weight tuning. Evaluated on VerilogEval 2.0 and RTLLM 2.0, COEVO achieves 97.5% and 94.5% Pass@1 with GPT‑5.4‑mini, surpassing all agentic baselines across four LLM backbones, while attaining the best PPA on 43 out of 49 synthesizable RTLLM designs.

PaperID: 3901, https://arxiv.org/pdf/2604.14989.pdf

Authors: Wenji Fang, Yao Lu, Shang Liu, Jing Wang, Ziyan Guo, Junxian He, Fengbin Tu, Zhiyao Xie

Title: Dr. RTL: Autonomous Agentic RTL Optimization through Tool-Grounded Self-Improvement

Abstract:
Recent advances in large language models (LLMs) have sparked growing interest in automatic RTL optimization for better performance, power, and area (PPA). However, existing methods are still far from realistic RTL optimization. Their evaluation settings are often unrealistic: they are tested on manually degraded, small‑scale RTL designs and rely on weak open‑source tools. Their optimization methods are also limited, relying on coarse design‑level feedback and simple pre‑defined rewriting rules. To address these limitations, we present Dr. RTL, an agentic framework for RTL timing optimization in a realistic evaluation environment, with continual self‑improvement through reusable optimization skills. We establish a realistic evaluation setting with more challenging RTL designs and an industrial EDA workflow. Within this setting, Dr. RTL performs closed‑loop optimization through a multi‑agent framework for critical‑path analysis, parallel RTL rewriting, and tool‑based evaluation. We further introduce group‑relative skill learning, which compares parallel RTL rewrites and distills the optimization experience into an interpretable skill library. Currently, this library contains 47 pattern‑‑strategy entries for cross‑design reuse to improve PPA and accelerate convergence, and it can continue evolving over time. Evaluated on 20 real‑world RTL designs, Dr. RTL achieves average WNS/TNS improvements of 21%/17% with a 6% area reduction over the industry‑leading commercial synthesis tool.

PaperID: 3902, https://arxiv.org/pdf/2604.14969.pdf

Authors: Andrew Dai, Boris Meinardus, Ciaran Regan, Yingtao Tian, Yujin Tang

Title: Discovering Novel LLM Experts via Task-Capability Coevolution

Abstract:
Frontier model developers aim to train models continually to possess emergent, diverse capabilities. To extend capabilities, the current pre‑training and post‑training paradigm requires manually starting training runs with static datasets or reward functions every time. Addressing this limitation, our work pursues the insight that open‑endedness (via the coevolution of models and tasks) can discover models with increasingly novel skills in a single run. We introduce a new model development framework that extends coevolution to large language model (LLM) discovery, open‑ended Assessment Coevolving with Diverse Capabilities (AC/DC). AC/DC evolves both LLMs via model merging and natural language tasks via synthetic data generation. AC/DC discovers growing archives of LLMs that surpass the capabilities of larger LLMs while taking up less GPU memory. In particular, our LLM populations achieve a broader Coverage of expertise than other curated models or baselines on downstream benchmarks, without any explicit benchmark optimization. Furthermore, AC/DC improves Coverage over time, continually innovates on tasks and models, and improves performance in multi‑agent best‑of‑N selection. Our findings highlight the potential of coevolution as a means of discovering broader sets of capabilities from base LLMs. Overall, AC/DC brings us one step closer to a profoundly new paradigm of LLM development, where continual improvements to the diversity of model capabilities can be accelerated by leveraging existing models as stepping stones to increasingly powerful models.

PaperID: 3903, https://arxiv.org/pdf/2604.14877.pdf

Authors: Zhiyuan Zhai, Wenjing Yan, Xiaodan Shao, Xin Wang

Title: Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis

Abstract:
Does reinforcement learning genuinely expand what LLM agents can do, or merely make them more reliable? For static reasoning, recent work answers the second: base and RL pass@k curves converge at large k. We ask whether this holds for agentic tool use, where T rounds of interaction enable compositional strategies that re‑sampling cannot recover. We introduce PASS@(k,T), a two‑dimensional metric that jointly varies sampling budget k and interaction depth T, separating capability expansion from efficiency improvement. Our main finding is that, contrary to the static‑reasoning result, tool‑use RL genuinely enlarges the capability boundary: the RL agent's pass‑curve pulls above the base model's and the gap widens at large k rather than converging. The expansion is specific to compositional, sequential information gathering; on simpler tasks RL behaves as prior work predicts. Under matched training data, supervised fine‑tuning regresses the boundary on the same compositional tasks, isolating self‑directed exploration as the causal factor. Mechanism analysis shows RL reweights the base strategy distribution toward the subset whose downstream reasoning more often yields a correct answer, with the improvement concentrated on how the agent integrates retrieved information. These results reconcile optimistic and pessimistic readings of RL for LLMs: both are correct, on different task types.

PaperID: 3904, https://arxiv.org/pdf/2604.14799.pdf

Authors: Nishanth Madhusudhan, Vikas Yadav, Alexandre Lacoste

Title: Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems

Abstract:
Effective abstention (EA), recognizing evidence insufficiency and refraining from answering, is critical for reliable multimodal systems. Yet existing evaluation paradigms for vision‑language models (VLMs) and multi‑agent systems (MAS) assume answerability, pushing models to always respond. Abstention has been studied in text‑only settings but remains underexplored multimodally; current benchmarks either ignore unanswerability or rely on coarse methods that miss realistic failure modes. We introduce MM‑AQA, a benchmark that constructs unanswerable instances from answerable ones via transformations along two axes: visual modality dependency and evidence sufficiency. Evaluating three frontier VLMs spanning closed and open‑source models and two MAS architectures across 2079 samples, we find: (1) under standard prompting, VLMs rarely abstain; even simple confidence baselines outperform this setup, (2) MAS improves abstention but introduces an accuracy‑abstention trade‑off, (3) sequential designs match or exceed iterative variants, suggesting the bottleneck is miscalibration rather than reasoning depth, and (4) models abstain when image or text evidence is absent, but attempt reconciliation with degraded or contradictory evidence. Effective multimodal abstention requires abstention‑aware training rather than better prompting or more agents.

PaperID: 3905, https://arxiv.org/pdf/2604.14709.pdf

Authors: Fan Cui, Hongyuan Hou, Zizhang Luo, Chenyun Yin, Yun Liang

Title: HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

Abstract:
Existing benchmarks for hardware design primarily evaluate Large Language Models (LLMs) on isolated, component‑level tasks such as generating HDL modules from specifications, leaving repository‑scale evaluation unaddressed. We introduce HWE‑Bench, the first large‑scale, repository‑level benchmark for evaluating LLM agents on real‑world hardware bug repair tasks. HWE‑Bench comprises 417 task instances derived from real historical bug‑fix pull requests across six major open‑source projects spanning both Verilog/SystemVerilog and Chisel, covering RISC‑V cores, SoCs, and security roots‑of‑trust. Each task is grounded in a fully containerized environment where the agent must resolve a real bug report, with correctness validated through the project's native simulation and regression flows. The benchmark is built through a largely automated pipeline that enables efficient expansion to new repositories. We evaluate seven LLMs with four agent frameworks and find that the best agent resolves 70.7% of tasks overall, with performance exceeding 90% on smaller cores but dropping below 65% on complex SoC‑level projects. We observe larger performance gaps across models than commonly reported on software benchmarks, and difficulty is driven by project scope and bug‑type distribution rather than code size alone. Our failure analysis traces agent failures to three stages of the debugging process: fault localization, hardware‑semantic reasoning, and cross‑artifact coordination across RTL, configuration, and verification components, providing concrete directions for developing more capable hardware‑aware agents.

PaperID: 3906, https://arxiv.org/pdf/2604.14691.pdf

Authors: Xiangning Yu, Yuwei Guo, Yuqi Hou, Xiao Xue, Qun Ma

Title: CAMO: An Agentic Framework for Automated Causal Discovery from Micro Behaviors to Macro Emergence in LLM Agent Simulations

Abstract:
LLM‑empowered agent simulations are increasingly used to study social emergence, yet the micro‑to‑macro causal mechanisms behind macro outcomes often remain unclear. This is challenging because emergence arises from intertwined agent interactions and meso‑level feedback and nonlinearity, making generative mechanisms hard to disentangle. To this end, we introduce \textscCAMO, an automated Causal discovery framework from Micro behaviors to Macro Emergence in LLM agent simulations. \textscCAMO converts mechanistic hypotheses into computable factors grounded in simulation records and learns a compact causal representation centered on an emergent target Y. \textscCAMO outputs a computable Markov boundary and a minimal upstream explanatory subgraph, yielding interpretable causal chains and actionable intervention levers. It also uses simulator‑internal counterfactual probing to orient ambiguous edges and revise hypotheses when evidence contradicts the current view. Experiments across four emergent settings demonstrate the promise of \textscCAMO.

PaperID: 3907, https://arxiv.org/pdf/2604.14687.pdf

Authors: Yiyu Qian, Liyuan Zhao, Tim Miller

Title: M2-PALE: A Framework for Explaining Multi-Agent MCTS--Minimax Hybrids via Process Mining and LLMs

Abstract:
Monte‑Carlo Tree Search (MCTS) is a fundamental sampling‑based search algorithm widely used for online planning in sequential decision‑making domains. Despite its success in driving recent advances in artificial intelligence, understanding the behavior of MCTS agents remains a challenge for both developers and users. This difficulty stems from the complex search trees produced through the simulation of numerous future states and their intricate relationships. A known weakness of standard MCTS is its reliance on highly selective tree construction, which may lead to the omission of crucial moves and a vulnerability to tactical traps. To resolve this, we incorporate shallow, full‑width Minimax search into the rollout phase of multi‑agent MCTS to enhance strategic depth. Furthermore, to demystify the resulting decision‑making logic, we introduce \textsfM2‑PALE (MCTS‑‑Minimax Process‑Aided Linguistic Explanations). This framework employs process mining techniques, specifically the Alpha Miner, iDHM, and Inductive Miner algorithms, to extract underlying behavioral workflows from agent execution traces. These process models are then synthesized by LLMs to generate human‑readable causal and distal explanations. We demonstrate the efficacy of our approach in a small‑scale checkers environment, establishing a scalable foundation for interpreting hybrid agents in increasingly complex strategic domains.

PaperID: 3908, https://arxiv.org/pdf/2604.14683.pdf

Authors: Qianqian Xie, Qingheng Xiong, He Zhu, Tiantian Xia, Xueming Han, Fanyu Meng, Jiakai Wang, Zhiqi Bai, Chengkang Jiang, Zhaohui Wang, Yubin Guo, Yuqing Wen, Jiayang Mao, Zijie Zhang, Shihao Li, Yanghai Wang, Yuxiang Ren, Junlan Feng, Jiaheng Liu

Title: DR$^{3}$-Eval: Towards Realistic and Reproducible Deep Research Evaluation

Abstract:
Deep Research Agents (DRAs) aim to solve complex, long‑horizon research tasks involving planning, retrieval, multimodal understanding, and report generation, yet their evaluation remains challenging due to dynamic web environments and ambiguous task definitions. We propose DR^3‑Eval, a realistic and reproducible benchmark for evaluating deep research agents on multimodal, multi‑file report generation. DR^3‑Eval is constructed from authentic user‑provided materials and paired with a per‑task static research sandbox corpus that simulates open‑web complexity while remaining fully verifiable, containing supportive documents, distractors, and noise. Moreover, we introduce a multi‑dimensional evaluation framework measuring Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality, and validate its alignment with human judgments. Experiments with our developed multi‑agent system DR^3‑Agent based on multiple state‑of‑the‑art language models demonstrate that DR^3‑Eval is highly challenging and reveals critical failure modes in retrieval robustness and hallucination control. Our code and data are publicly available.

PaperID: 3909, https://arxiv.org/pdf/2604.14661.pdf

Authors: Jianhao Su, Zhanwei Wu, ShengTing Huang, Weidong Feng

Title: AIPC: Agent-Based Automation for AI Model Deployment with Qualcomm AI Runtime

Abstract:
Edge AI model deployment is a multi‑stage engineering process involving model conversion, operator compatibility handling, quantization calibration, runtime integration, and accuracy validation. In practice, this workflow is long, failure‑prone, and heavily dependent on deployment expertise, particularly when targeting hardware‑specific inference runtimes. This technical report presents AIPC (AI Porting Conversion), an AI agent‑driven approach for constrained automation of AI model deployment. AIPC decomposes deployment into standardized, verifiable stages and injects deployment‑domain knowledge into agent execution through Agent Skills, helper scripts, and a stage‑wise validation loop. This design reduces both the expertise barrier and the engineering time required for hardware deployment. Using Qualcomm AI Runtime (QAIRT) as the primary scenario, this report examines automated deployment across representative vision, multimodal, and speech models. In the cases covered here, AIPC can complete deployment from PyTorch to runnable QNN/SNPE inference within 7‑20 minutes for structurally regular vision models, with indicative API costs roughly in the range of USD 0.7‑10. For more complex models involving less‑supported operators, dynamic shapes, or autoregressive decoding structures, fully automated deployment may still require further advances, but AIPC already provides practical support for execution, failure localization, and bounded repair.

PaperID: 3910, https://arxiv.org/pdf/2604.14615.pdf

Authors: Yubin Kim, Salman Rahman, Samuel Schmidgall, Chunjong Park, A. Ali Heydari, Ahmed A. Metwally, Hong Yu, Xin Liu, Xuhai Xu, Yuzhe Yang, Maxwell A. Xu, Zhihan Zhang, Cynthia Breazeal, Tim Althoff, Petar Sirkovic, Ivor Rendulic, Annalisa Pawlosky, Nicolas Stroppa, Juraj Gottweis, Elahe Vedadi, Alan Karthikesalingam, Pushmeet Kohli, Vivek Natarajan, Mark Malhotra, Shwetak Patel, Hae Won Park, Hamid Palangi, Daniel McDuff

Title: CoDaS: AI Co-Data-Scientist for Biomarker Discovery via Wearable Sensors

Abstract:
Scientific discovery in digital health requires converting continuous physiological signals from wearable devices into clinically actionable biomarkers. We introduce CoDaS (AI Co‑Data‑Scientist), a multi‑agent system that structures biomarker discovery as an iterative process combining hypothesis generation, statistical analysis, adversarial validation, and literature‑grounded reasoning with human oversight using large‑scale wearable datasets. Across three cohorts totaling 9,279 participant‑observations, CoDaS identified 41 candidate digital biomarkers for mental health and 25 for metabolic outcomes, each subjected to an internal validation battery spanning replication, stability, robustness, and discriminative power. Across two independent depression cohorts, CoDaS surfaced circadian instability‑related features in both datasets, reflected in sleep duration variability (DWB, ρ= 0.252, p < 0.001) and sleep onset variability (GLOBEM, ρ= 0.126, p < 0.001). In a metabolic cohort, CoDaS derived a cardiovascular fitness index (steps/resting heart rate; ρ= ‑0.374, p < 0.001), and recovered established clinical associations, including the hepatic function ratio (AST/ALT; ρ= ‑0.375, p < 0.001), a known correlate of insulin resistance. Incorporating CoDaS‑derived features alongside demographic variables led to modest but consistent improvements in predictive performance, with cross‑validated ΔR^2 increases of 0.040 for depression and 0.021 for insulin resistance. These findings suggest that CoDaS enables systematic and traceable hypothesis generation and prioritization for biomarker discovery from large‑scale wearable data.

PaperID: 3911, https://arxiv.org/pdf/2604.14609.pdf

Authors: Zijian Zhang, Aiwei Yin, Amaan Baweja, Jiaru Bai, Ignacio Gustin, Varinia Bernales, Alán Aspuru-Guzik

Title: El Agente Forjador: Task-Driven Agent Generation for Quantum Simulation

Abstract:
AI for science promises to accelerate the discovery process. The advent of large language models (LLMs) and agentic workflows enables the expediting of a growing range of scientific tasks. However, most of the current generation of agentic systems depend on static, hand‑curated toolsets that hinder adaptation to new domains and evolving libraries. We present El Agente Forjador, a multi‑agent framework in which universal coding agents autonomously forge, validate, and reuse computational tools through a four‑stage workflow of tool analysis, tool generation, task execution, and iterative solution evaluation. Evaluated across 24 tasks spanning quantum chemistry and quantum dynamics on five coding agent setups, we compare three operating modes: zero‑shot generation of tools per task, reuse of a curriculum‑built toolset, and direct problem‑solving with the coding agents as the baseline. We find that our tool generation and reuse framework consistently improves accuracy over the baseline. We also show that reusing a toolset built by a stronger coding agent can reduce API cost and substantially raises the solution quality for weaker coding agents. Case studies further demonstrate that tools forged for different domains can be combined to solve hybrid tasks. Taken together, these results show that LLM‑based agents can use their scientific knowledge and coding capabilities to autonomously build reusable scientific tools, pointing toward a paradigm in which agent capabilities are defined by the tasks they are designed to solve rather than by explicitly engineered implementations.

PaperID: 3912, https://arxiv.org/pdf/2604.14607.pdf

Authors: Ha Thanh Nguyen, Wachara Fungwacharakorn, Sabine Wehnert, May Myo Zin, Yuntao Kong, Jieying Xue, Michał Araszkiewicz, Randy Goebel, Ken Satoh

Title: GDPR Auto-Formalization with AI Agents and Human Verification

Abstract:
We study the overall process of automatic formalization of GDPR provisions using large language models, within a human‑in‑the‑loop verification framework. Rather than aiming for full autonomy, we adopt a role‑specialized workflow in which LLM‑based AI components, operating in a multi‑agent setting with iterative feedback, generate legal scenarios, formal rules, and atomic facts. This is coupled with independent verification modules which include human reviewers' assessment of representational, logical, and legal correctness. Using this approach, we construct a high‑quality dataset to be used for GDPR auto‑formalization, and analyze both successful and problematic cases. Our results show that structured verification and targeted human oversight are essential for reliable legal formalization, especially in the presence of legal nuance and context‑sensitive reasoning.

PaperID: 3913, https://arxiv.org/pdf/2604.14593.pdf

Authors: Yitong Shou, Manhao Guan

Title: Mechanistic Decoding of Cognitive Constructs in Large Language Models

Abstract:
While Large Language Models (LLMs) demonstrate increasingly sophisticated affective capabilities, the internal mechanisms by which they process complex emotions remain unclear. Existing interpretability approaches often treat models as black boxes or focus on coarse‑grained basic emotions, leaving the cognitive structure of more complex affective states underexplored. To bridge this gap, we propose a Cognitive Reverse‑Engineering framework based on Representation Engineering (RepE) to analyze social‑comparison jealousy. By combining appraisal theory with subspace orthogonalization, regression‑based weighting, and bidirectional causal steering, we isolate and quantify two psychological antecedents of jealousy, Superiority of Comparison Person and Domain Self‑Definitional Relevance, and examine their causal effects on model judgments. Experiments on eight LLMs from the Llama, Qwen, and Gemma families suggest that models natively encode jealousy as a structured linear combination of these constituent factors. Their internal representations are broadly consistent with the human psychological construct, treating Superiority as the foundational trigger and Relevance as the ultimate intensity multiplier. Our framework also demonstrates that toxic emotional states can be mechanically detected and surgically suppressed, suggesting a possible route toward representational monitoring and intervention for AI safety in multi‑agent environments.

PaperID: 3914, https://arxiv.org/pdf/2604.14583.pdf

Authors: Fernando Spadea, Oshani Seneviratne

Title: From Risk to Rescue: An Agentic Survival Analysis Framework for Liquidation Prevention

Abstract:
Decentralized Finance (DeFi) lending protocols like Aave v3 rely on over‑collateralization to secure loans, yet users frequently face liquidation due to volatile market conditions. Existing risk management tools utilize static health‑factor thresholds, which are reactive and fail to distinguish between administrative "dust" cleanup and genuine insolvency. In this work, we propose an autonomous agent that leverages time‑to‑event (survival) analysis and moves beyond prediction to execution. Unlike passive risk signals, this agent perceives risk, simulates counterfactual futures, and executes protocol‑faithful interventions to proactively prevent liquidations. We introduce a return period metric derived from a numerically stable XGBoost Cox proportional hazards model to normalize risk across transaction types, coupled with a volatility‑adjusted trend score to filter transient market noise. To select optimal interventions, we implement a counterfactual optimization loop that simulates potential user actions to find the minimum capital required to mitigate risk. We validate our approach using a high‑fidelity, protocol‑faithful Aave v3 simulator on a cohort of 4,882 high‑risk user profiles. The results demonstrate the agent's ability to prevent liquidations in imminent‑risk scenarios where static rules fail, effectively "saving the unsavable" while maintaining a zero worsening rate, providing a critical safety guarantee often missing in autonomous financial agents. Furthermore, the system successfully differentiates between actionable financial risks and negligible dust events, optimizing capital efficiency where static rules fail.

PaperID: 3915, https://arxiv.org/pdf/2604.14550.pdf

Authors: Sazzadul Islam, Tasnim Tabassum, Hao Zheng

Title: VeriGraphi: A Multi-Agent Framework of Hierarchical RTL Generation for Large Hardware Designs

Abstract:
Generating synthesizable Verilog for large, hierarchical hardware designs remains a significant challenge for large language models (LLMs), which struggle to replicate the structured reasoning that human experts employ when translating complex specifications into RTL. When tasked with producing hierarchical Verilog, LLMs frequently lose context across modules, hallucinate interfaces, fabricate inter‑module wiring, and fail to maintain structural coherence ‑ failures that intensify as design complexity grows and specifications involve informal prose, figures, and tables that resist direct operationalization. To address these challenges, we present VeriGraphi, a framework that introduces a spec‑anchored Knowledge Graph as the architectural substrate driving the RTL generation pipeline. VeriGraphi constructs a HDA, a structured knowledge graph that explicitly encodes module hierarchy, port‑level interfaces, wiring semantics, and inter‑module dependencies as first‑class graph entities and relations. Built through iterative multi‑agent analysis of the specification, this Knowledge Graph provides a deterministic, machine‑checkable structural scaffold before code generation. Guided by the KG, a progressive coding module incrementally generates pseudo‑code and synthesizable RTL while enforcing interface consistency and dependency correctness at each submodule stage. We evaluate VeriGraphi on a benchmark of three representative specification documents from the National Institute of Standards and Technology and their corresponding implementations, and we present a RV32I processor as a detailed case study to illustrate the full pipeline. The results demonstrate that VeriGraphi enables reliable hierarchical RTL generation with minimal human intervention for RISC‑V, marking a significant milestone for LLM‑generated hardware design while maintaining strong functional correctness.

PaperID: 3916, https://arxiv.org/pdf/2604.14518.pdf

Authors: MindDR Team, Li Auto Inc

Title: Mind DeepResearch Technical Report

Abstract:
We present Mind DeepResearch (MindDR), an efficient multi‑agent deep research framework that achieves leading performance with only ~30B‑parameter models through a meticulously designed data synthesis and multi‑stage training pipeline. The core innovation of MindDR lies in a collaborative three‑agent architecture (Planning Agent, DeepSearch Agent, and Report Agent) and a four‑stage agent‑specialized training pipeline comprising SFT cold‑start, Search‑RL, Report‑RL and preference alignment. With this regime, MindDR demonstrates competitive performance even with ~30B‑scale models. Specifically, MindDR achieves 45.7% on BrowseComp‑ZH, 42.8% on BrowseComp, 46.5% on WideSearch, 75.0% on xbench‑DS, and 52.5 on DeepResearch Bench, outperforming comparable‑scale open‑source agent systems and rivaling larger‑scale models. MindDR has been deployed as an online product in Li Auto. Furthermore, we introduce MindDR Bench, a curated benchmark of 500 real‑world Chinese queries from our internal product user interactions, evaluated through a comprehensive multi‑dimensional rubric system rather than relying on a single RACE metric. On MindDR Bench, MindDR achieves a state‑of‑the‑art score of 51.8.

PaperID: 3917, https://arxiv.org/pdf/2604.14475.pdf

Authors: Weixiang Shen, Bailiang Jian, Jun Li, Che Liu, Johannes Moll, Xiaobin Hu, Daniel Rueckert, Hongwei Bran Li, Jiazhen Pan

Title: Evo-MedAgent: Beyond One-Shot Diagnosis with Agents That Remember, Reflect, and Improve

Abstract:
Tool‑augmented large language model (LLM) agents can orchestrate specialist classifiers, segmentation models, and visual question‑answering modules to interpret chest X‑rays. However, these agents still solve each case in isolation: they fail to accumulate experience across cases, correct recurrent reasoning mistakes, or adapt their tool‑use behavior without expensive reinforcement learning. While a radiologist naturally improves with every case, current agents remain static. In this work, we propose Evo‑MedAgent, a self‑evolving memory module that equips a medical agent with the capacity for inter‑case learning at test time. Our memory comprises three complementary stores: (1)~\emphRetrospective Clinical Episodes that retrieve problem‑solving experiences from similar past cases, (2)~an \emphAdaptive Procedural Heuristics bank curating priority‑tagged diagnostic rules that evolves via reflection, much like a physician refining their internal criteria, and (3)~a \emphTool Reliability Controller that tracks per‑tool trustworthiness. On ChestAgentBench, Evo‑MedAgent raises multiple‑choice question (MCQ) accuracy from 0.68 to 0.79 on GPT‑5‑mini, and from 0.76 to 0.87 on Gemini‑3 Flash. With a strong base model, evolving memory improves performance more effectively than orchestrating external tools on qualitative diagnostic tasks. Because Evo‑MedAgent requires no training, its per‑case overhead is bounded by one additional retrieval pass and a single reflection call, making it deployable on top of any frozen model.

PaperID: 3918, https://arxiv.org/pdf/2604.14455.pdf

Authors: Ruiyi Zhang, Peijia Qin, Qi Cao, Li Zhang, Pengtao Xie

Title: AIBuildAI: An AI Agent for Automatically Building AI Models

Abstract:
AI models underpin modern intelligent systems, driving advances across science, medicine, finance, and technology. Yet developing high‑performing AI models remains a labor‑intensive process that requires expert practitioners to iteratively design architectures, engineer representations, implement training pipelines and refine approaches through empirical evaluation. Existing AutoML methods partially alleviate this burden but remain limited to narrow aspects such as hyperparameter optimization and model selection within predefined search spaces, leaving the full development lifecycle largely dependent on human expertise. To address this gap, we introduce AIBuildAI, an AI agent that automatically builds AI models from a task description and training data. AIBuildAI adopts a hierarchical agent architecture in which a manager agent coordinates three specialized sub‑agents: a designer for modeling strategy, a coder for implementation and debugging, and a tuner for training and performance optimization. Each sub‑agent is itself a large language model (LLM) based agent capable of multi‑step reasoning and tool use, enabling end‑to‑end automation of the AI model development process that goes beyond the scope of existing AutoML approaches. We evaluate AIBuildAI on MLE‑Bench, a benchmark of realistic Kaggle‑style AI development tasks spanning visual, textual, time‑series and tabular modalities. AIBuildAI ranks first on MLE‑Bench with a medal rate of 63.1%, outperforming all existing baseline methods and matching the capability of highly experienced AI engineers. These results demonstrate that hierarchical agent systems can automate the full AI model development process from task specification to deployable model, suggesting a pathway toward broadly accessible AI development with minimal human intervention.

PaperID: 3919, https://arxiv.org/pdf/2604.14386.pdf

Authors: Dongxin Guo, Jikun Wu, Siu-Ming Yiu

Title: Coalition Formation in LLM Agent Networks: Stability Analysis and Convergence Guarantees

Abstract:
Large Language Model (LLM) agents are increasingly deployed in multi‑agent systems requiring strategic coordination. While recent work has analyzed LLM behavior in two‑player games, coalition formation, where n agents dynamically form cooperative groups, remains theoretically uncharacterized. We present the first framework grounding coalition formation in LLM agent networks in hedonic game theory with formal stability guarantees. We introduce the LLM Coalition Formation Game (LCFG), establish sufficient conditions for Nash‑stable partitions, and prove complexity results. Our analysis reveals that LLM agents exhibit bounded rationality characterized by ε‑rational preferences; we provide both deterministic existence guarantees and consistency‑driven stability bounds whose predictions are consistent with empirical outcomes. Experiments with GPT‑4, Claude‑3, and Llama‑3 across 2,400 episodes validate our framework: LLM coalitions achieve Nash stability in 73.2% of cases under our Coalition‑of‑Thought (CoalT) protocol, compared to 58.4% under chain‑of‑thought and 41.8% under standard prompting (p < 0.001). Our framework provides theoretical foundations for designing stable multi‑agent LLM systems.

PaperID: 3920, https://arxiv.org/pdf/2604.14309.pdf

Authors: Li-Hsiang Shen, Yu-Quan Zheng

Title: Aerial Multi-Functional RIS in Fluid Antennas-Aided Full-Duplex Networks: A Self-Optimized Hybrid Deep Reinforcement Learning Approach

Abstract:
To address high data traffic demands of sixth‑generation (6G) networks, this paper proposes a novel architecture that integrates autonomous aerial vehicles (AAVs) and multi‑functional reconfigurable intelligent surfaces (MF‑RISs) as AM‑RIS in fluid antenna (FA)‑assisted full‑duplex (FD) networks. The AM‑RIS provides hybrid functionalities, including signal reflection, amplification, and energy harvesting (EH), potentially improving both signal coverage and sustainability. Meanwhile, FA facilitates fine‑grained spatial adaptability at FD‑enabled base station (BS), which complements residual self‑interference (SI) suppression. We aim at maximizing the overall energy efficiency (EE) by jointly optimizing transmit DL beamforming at BS, UL user power, configuration of AM‑RIS, and positions of the FA and AM‑RIS. Owing to the hybrid continuous‑discrete parameters and high dimensionality of the intractable problem, we have conceived a self‑optimized multi‑agent hybrid deep reinforcement learning (DRL) framework (SOHRL), which integrates multi‑agent deep Q‑networks (DQN) and multi‑agent proximal policy optimization (PPO), respectively handling discrete and continuous actions. To enhance self‑adaptability, an attention‑driven state representation and meta‑level hyperparameter optimization are incorporated, enabling multi‑agents to autonomously adjust learning hyperparameters. Simulation results validate the effectiveness of the proposed AM‑RIS‑enabled FA‑aided FD networks empowered by SOHRL algorithm. The results reveal that SOHRL outperforms benchmarks of the case without attention mechanism and conventional hybrid/multi‑agent/standalone DRL. Moreover, AM‑RIS in FD achieves the highest EE compared to half‑duplex, conventional rigid antenna arrays, partial EH, and conventional RIS without amplification, highlighting its potential as a compelling solution for EE‑aware wireless networks.

PaperID: 3921, https://arxiv.org/pdf/2604.14205.pdf

Authors: Kristian Hengster-Movrić, Šimon Lehký, Farnaz Adib Yaghmaie

Title: Consensus and Synchronization of Multi-agent Systems over Finite Fields -- Graph Topologies

Abstract:
This paper brings cooperative protocols for multi‑agent systems with agents having a finite state‑space. Both scalar single‑integrator consensus and general LTI systems synchronization are considered. Systems having a finite state‑space describe agents with minimal memory capacity processing only a finite alphabet. Such systems are remarkably resilient to communication noise. The crucial problem, however, is to construct the admissible communication topology, which is NP‑hard. We address this by efficiently exploring the subsets of admissible matrices and propose two new algorithms to generate the topologies. Simulations validate the proposed approach.

PaperID: 3922, https://arxiv.org/pdf/2604.14178.pdf

Authors: Hong Su

Title: Simulating Human Cognition: Heartbeat-Driven Autonomous Thinking Activity Scheduling for LLM-based AI systems

Abstract:
Large Language Model (LLM) agents have demonstrated remarkable capabilities in reasoning and tool use, yet they often suffer from rigid, reactive control flows that limit their adaptability and efficiency. Most existing frameworks rely on fixed pipelines or failure‑triggered reflection, causing agents to act impulsively or correct errors only after they occur. In this paper, we introduce Heartbeat‑Driven Autonomous Thinking Activity Scheduling, a mechanism that enables proactive, adaptive, and continuous self‑regulation. Mirroring the natural rhythm of human cognition, our system employs a periodic ``heartbeat'' mechanism to orchestrate a dynamic repertoire of cognitive modules (e.g., Planner, Critic, Recaller, Dreamer). Unlike traditional approaches that rely on hard‑coded symbolic rules or immediate reactive triggers, our scheduler learns to determine when to engage specific thinking activities ‑‑ such as recalling memories, summarizing experiences, or strategic planning ‑‑ based on temporal patterns and historical context. This functional approach allows cognitive modules to be dynamically added or removed without structural reengineering. Meanwhile, we propose a meta‑learning strategy for continual policy adaptation, where the scheduler optimizes its cognitive strategy over time using historical interaction logs. Evaluation results demonstrate that our approach effectively learns to schedule cognitive activities based on historical data and can autonomously integrate new thinking modules.

PaperID: 3923, https://arxiv.org/pdf/2604.14165.pdf

Authors: Naman Ahuja, Saniya Mulla, Muhammad Ali Khan, Zaryab Bin Riaz, Kaneez Zahra Rubab Khakwani, Mohamad Bassam Sonbol, Irbaz Bin Riaz, Vivek Gupta

Title: EviSearch: A Human in the Loop System for Extracting and Auditing Clinical Evidence for Systematic Reviews

Abstract:
We present EviSearch, a multi‑agent extraction system that automates the creation of ontology‑aligned clinical evidence tables directly from native trial PDFs while guaranteeing per‑cell provenance for audit and human verification. EviSearch pairs a PDF‑query agent (which preserves rendered layout and figures) with a retrieval‑guided search agent and a reconciliation module that forces page‑level verification when agents disagree. The pipeline is designed for high‑precision extraction across multimodal evidence sources (text, tables, figures) and for generating reviewer‑actionable provenance that clinicians can inspect and correct. On a clinician‑curated benchmark of oncology trial papers, EviSearch substantially improves extraction accuracy relative to strong parsed‑text baselines while providing comprehensive attribution coverage. By logging reconciler decisions and reviewer edits, the system produces structured preference and supervision signals that bootstrap iterative model improvement. EviSearch is intended to accelerate living systematic review workflows, reduce manual curation burden, and provide a safe, auditable path for integrating LLM‑based extraction into evidence synthesis pipelines.

PaperID: 3924, https://arxiv.org/pdf/2604.14160.pdf

Authors: Xingyu Xiao, Jiejuan Tong, Jun Sun, Zhe Sui, Peng Chen, Jingang Liang, Haitao Wang

Title: NuHF Claw: A Risk Constrained Cognitive Agent Framework for Human Centered Procedure Support in Digital Nuclear Control Rooms

Abstract:
The rapid digitization of nuclear power plant main control rooms has fundamentally reshaped operator interaction patterns, introducing complex soft‑control behaviors and elevated cognitive risks that are not adequately addressed by existing human reliability analysis approaches. Although recent advances in large language models and autonomous agents offer new opportunities for intelligent decision support, their deployment in safety critical environments remains constrained by risks of hallucinated reasoning and weakened human authority. This study proposes NuHF Claw, a persistent cognitive‑risk agent framework that enables risk governed human centered autonomy for digital nuclear operations. The core methodological innovation lies in the introduction of a risk constrained agent runtime, which tightly couples cognitive state inference with probabilistic safety assessment to regulate autonomous system behavior in real time. By integrating cognitively grounded workload and situational awareness estimation with dynamic human error probability prediction, the framework transforms conventional offline reliability analysis into a proactive intervention mechanism embedded directly within operational workflows. Experimental validation on a high‑fidelity digital control room simulator demonstrates that NuHF Claw can anticipate interface induced cognitive degradation, dynamically constrain unsafe autonomous recommendations, and provide risk‑aware navigational guidance while preserving human decision authority. The results highlight a fundamental shift from automation‑driven operation toward cognition‑aware autonomy, offering a principled pathway for the safe integration of intelligent agents into next‑generation nuclear control environments.

PaperID: 3925, https://arxiv.org/pdf/2604.14116.pdf

Authors: Zerun Ma, Guoqiang Wang, Xinchen Xie, Yicheng Chen, He Du, Bowen Li, Yanan Sun, Wenran Liu, Kai Chen, Yining Li

Title: TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration

Abstract:
While Large Language Models (LLMs) have empowered AI research agents to perform isolated scientific tasks, automating complex, real‑world workflows, such as LLM training, remains a significant challenge. In this paper, we introduce TREX, a multi‑agent system that automates the entire LLM training life‑cycle. By orchestrating collaboration between two core modules‑the Researcher and the Executor‑the system seamlessly performs requirement analysis, open‑domain literature and data research, formulation of training strategies, preparation of data recipes, and model training and evaluation. The multi‑round experimental process is modeled as a search tree, enabling the system to efficiently plan exploration paths, reuse historical results, and distill high‑level insights from iterative trials. To evaluate the capability of automated LLM training, we construct FT‑Bench, a benchmark comprising 10 tasks derived from real‑world scenarios, ranging from optimizing fundamental model capabilities to enhancing performance on domain‑specific tasks. Experimental results demonstrate that the TREX agent consistently optimizes model performance on target tasks.

PaperID: 3926, https://arxiv.org/pdf/2604.13996.pdf

Authors: Kellie Yu Hui Sim, Pin Sym Foong, Darryl Lim, John-Henry Lim, Kenny Tsu Wei Choo

Title: Acts of Configuration: Rethinking Provenance, Temporality and Legitimacy in Post-Mortem Agents

Abstract:
Work on persona‑persistent post‑mortem agents typically frames design around a life/death binary. This framing neglects a consequential yet under‑theorised condition: when individuals remain alive but have impaired decisional capacity. Drawing on a multi‑phase workshop in which participants trained and reflected on an AI agent for Advance Care Planning, we examined how people reason about agentic delegation post‑capacity loss. Initially, participants favoured bounded agents grounded in first‑party authorship and representational fidelity over autonomous or evolving stand‑ins. However, temporality introduced novel ideas like adjacent use driven by persona persistence over functional expansion: agents should evolve while users retain capacity, remain static once capacity is lost, but somehow inform adjacent post‑mortem uses. We discuss the implications of these findings and propose that the configuration of agents for post‑capacity use reshapes our understanding of provenance, temporality, and legitimacy for post‑mortem agents.

PaperID: 3927, https://arxiv.org/pdf/2604.13946.pdf

Authors: Duy Tung Doan, Quang Huy Phung, Dzung Nguyen, Khac-Hoai Nam Bui

Title: CollabCoder: Plan-Code Co-Evolution via Collaborative Decision-Making for Efficient Code Generation

Abstract:
Automated code generation remains a persistent challenge in software engineering, as conventional multi‑agent frameworks are often constrained by static planning, isolated execution, high computational overhead, and limited adaptability to complex tasks. This paper introduces CollabCoder, a novel Plan‑Code Co‑Evolution framework that improves code generation through dynamic multi‑agent collaboration. The core idea is to design a collaborative decision‑making process between the plan module and the code module to decide which module should be executed for the debugging process. Extensive experiments on widely used benchmarks demonstrate that CollabCoder consistently improves code quality and robustness across tasks. Importantly, CollabCoder achieves performance comparable to or exceeding current state‑of‑the‑art methods while reducing computational overhead, with efficiency gains becoming more pronounced as benchmark difficulty increases. On the more challenging LiveCodeBench and xCodeEval benchmarks, our approach improves performance by 11‑20% over strong baselines while reducing the number of API calls by an average of 4‑10 per execution.

PaperID: 3928, https://arxiv.org/pdf/2604.13891.pdf

Authors: Saeed Rahmani, Gözde Körpe, Zhenlin, Xu, Bruno Brito, Simeon Craig Calvert, Bart van Arem

Title: Beyond Conservative Automated Driving in Multi-Agent Scenarios via Coupled Model Predictive Control and Deep Reinforcement Learning

Abstract:
Automated driving at unsignalized intersections is challenging due to complex multi‑vehicle interactions and the need to balance safety and efficiency. Model Predictive Control (MPC) offers structured constraint handling through optimization but relies on hand‑crafted rules that often produce overly conservative behavior. Deep Reinforcement Learning (RL) learns adaptive behaviors from experience but often struggles with safety assurance and generalization to unseen environments. In this study, we present an integrated MPC‑RL framework to improve navigation performance in multi‑agent scenarios. Experiments show that MPC‑RL outperforms standalone MPC and end‑to‑end RL across three traffic‑density levels. Collectively, MPC‑RL reduces the collision rate by 21% and improves the success rate by 6.5% compared to pure MPC. We further evaluate zero‑shot transfer to a highway merging scenario without retraining. Both MPC‑based methods transfer substantially better than end‑to‑end PPO, which highlights the role of the MPC backbone in cross‑scenario robustness. The framework also shows faster loss stabilization than end‑to‑end RL during training, which indicates a reduced learning burden. These results suggest that the integrated approach can improve the balance between safety performance and efficiency in multi‑agent intersection scenarios, while the MPC component provides a strong foundation for generalization across driving environments. The implementation code is available open‑source.

PaperID: 3929, https://arxiv.org/pdf/2604.13888.pdf

Authors: Bo Yu, Cheng Yang, Dongyang Hou, Chengfu Liu, Jiayao Liu, Chi Wang, Zhiming Zhang, Haifeng Li, Wentao Yang

Title: GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis

Abstract:
The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift toward autonomous spatial analysis. However, evaluating these LLM‑based agents remains challenging due to the complex, multi‑step nature of geospatial workflows. Existing benchmarks primarily rely on static text or code matching, neglecting dynamic runtime feedback and the multimodal nature of spatial outputs. To address this gap, we introduce GeoAgentBench (GABench), a dynamic and interactive evaluation benchmark tailored for tool‑augmented GIS agents. GABench provides a realistic execution sandbox integrating 117 atomic GIS tools, encompassing 53 typical spatial analysis tasks across 6 core GIS domains. Recognizing that precise parameter configuration is the primary determinant of execution success in dynamic GIS environments, we designed the Parameter Execution Accuracy (PEA) metric, which utilizes a "Last‑Attempt Alignment" strategy to quantify the fidelity of implicit parameter inference. Complementing this, a Vision‑Language Model (VLM) based verification is proposed to assess data‑spatial accuracy and cartographic style adherence. Furthermore, to address the frequent task failures caused by parameter misalignments and runtime anomalies, we developed a novel agent architecture, Plan‑and‑React, that mimics expert cognitive workflows by decoupling global orchestration from step‑wise reactive execution. Extensive experiments with seven representative LLMs demonstrate that the Plan‑and‑React paradigm significantly outperforms traditional frameworks, achieving the optimal balance between logical rigor and execution robustness, particularly in multi‑step reasoning and error recovery. Our findings highlight current capability boundaries and establish a robust standard for assessing and advancing the next generation of autonomous GeoAI.

PaperID: 3930, https://arxiv.org/pdf/2604.13824.pdf

Authors: Youling Huang, Guanqiao Chen, Junchi Yao, Lu Wang, Fangkai Yang, Chao Du, ChenZhuo Zhao, Pu Zhao, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

Title: Beyond State Consistency: Behavior Consistency in Text-Based World Models

Abstract:
World models have been emerging as critical components for assessing the consequences of actions generated by interactive agents in online planning and offline evaluation. In text‑based environments, world models are typically evaluated and trained with single‑step metrics such as Exact Match, aiming to improve the similarity between predicted and real‑world states, but such metrics have been shown to be insufficient for capturing actual agent behavior. To address this issue, we introduce a new behavior‑aligned training paradigm aimed at improving the functional consistency between the world model and the real environment. This paradigm focuses on optimizing a tractable step‑level metric named Behavior Consistency Reward (BehR), which measures how much the likelihood of a logged next action changes between the real state and the world‑model‑predicted state under a frozen Reference Agent. Experiments on WebShop and TextWorld show that BehR‑based training improves long‑term alignment in several settings, with the clearest gains in WebShop and less movement in near‑ceiling regimes, while preserving or improving single‑step prediction quality in three of four settings. World models trained with BehR also achieve lower false positives in offline surrogate evaluation and show modest but encouraging gains in inference‑time lookahead planning.

PaperID: 3931, https://arxiv.org/pdf/2604.13822.pdf

Authors: Zhengxi Lu, Fei Tang, Guangyi Liu, Kaitao Song, Xu Tan, Jin Ma, Wenqi Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen

Title: UI-Copilot: Advancing Long-Horizon GUI Automation via Tool-Integrated Policy Optimization

Abstract:
MLLM‑based GUI agents have demonstrated strong capabilities in complex user interface interaction tasks. However, long‑horizon scenarios remain challenging, as these agents are burdened with tasks beyond their intrinsic capabilities, suffering from memory degradation, progress confusion, and math hallucination. To address these challenges, we present UI‑Copilot, a collaborative framework where the GUI agent focuses on task execution while a lightweight copilot provides on‑demand assistance for memory retrieval and numerical computation. We introduce memory decoupling to separate persistent observations from transient execution context, and train the policy agent to selectively invoke the copilot as Retriever or Calculator based on task demands. To enable effective tool invocation learning, we propose Tool‑Integrated Policy Optimization (TIPO), which separately optimizes tool selection through single‑turn prediction and task execution through on‑policy multi‑turn rollouts. Experimental results show that UI‑Copilot‑7B achieves state‑of‑the‑art performance on challenging MemGUI‑Bench, outperforming strong 7B‑scale GUI agents such as GUI‑Owl‑7B and UI‑TARS‑1.5‑7B. Moreover, UI‑Copilot‑7B delivers a 17.1% absolute improvement on AndroidWorld over the base Qwen model, highlighting UI‑Copilot's strong generalization to real‑world GUI tasks.

PaperID: 3932, https://arxiv.org/pdf/2604.13759.pdf

Authors: Rafflesia Khan, Nafiul Islam Khan

Title: The cognitive companion: a lightweight parallel monitoring architecture for detecting and recovering from reasoning degradation in LLM agents

Abstract:
Large language model (LLM) agents on multi‑step tasks suffer reasoning degradation, looping, drift, stuck states, at rates up to 30% on hard tasks. Current solutions include hard step limits (abrupt) or LLM‑as‑judge monitoring (10‑15% overhead per step). This paper introduces the Cognitive Companion, a parallel monitoring architecture with two implementations: an LLM‑based Companion and a novel zero‑overhead Probe‑based Companion. We report a three‑batch feasibility study centered on Gemma 4 E4B, with an additional exploratory small‑model analysis on Qwen 2.5 1.5B and Llama 3.2 1B. In our experiments, the LLM‑based Companion reduced repetition on loop‑prone tasks by 52‑62% with approximately 11% overhead. The Probe‑based Companion, trained on hidden states from layer 28, showed a mean effect size of +0.471 at zero measured inference overhead; its strongest probe result achieved cross‑validated AUROC 0.840 on a small proxy‑labeled dataset. A key empirical finding is that companion benefit appears task‑type dependent: companions are most helpful on loop‑prone and open‑ended tasks, while effects are neutral or negative on more structured tasks. Our small‑model experiments also suggest a possible scale boundary: companions did not improve the measured quality proxy on 1B‑1.5B models, even when interventions fired. Overall, the paper should be read as a feasibility study rather than a definitive validation. The results provide encouraging evidence that sub‑token monitoring may be useful, identify task‑type sensitivity as a practical design constraint, and motivate selective companion activation as a promising direction for future work.

PaperID: 3933, https://arxiv.org/pdf/2604.13708.pdf

Authors: Mohammed Irshadh Ismaaeel Sathyamangalam Imran, Lasse Peters, Michael Khayyat, Stefano Arrigoni, Francesco Braghin, Laura Ferranti

Title: Homotopy-Guided Potential Games for Congestion-Aware Navigation

Abstract:
We address the multi‑agent motion planning problem where interactions, collisions, and congestion co‑exist. Conventional game‑theoretic planners capture interactions among agents but often converge to conservative, congested equilibria. Homotopy planners, on the other hand, can explore topologically distinct paths, but lack mechanisms to account for the interdependence of agents' future actions. We propose a unified framework that leverages homotopy classes as structured strategy sets within a receding‑horizon setup. At each planning stage, a deterministic homotopy planner generates topologically distinct paths for each agent, conditioned on the joint configuration. To avoid intractable growth of candidate paths, we propose a simple heuristic filtering step that selects a top‑K subset of the most suitable congestion‑free joint strategies to ensure computational tractability. These serve as initializations for a potential game that enforces homotopy‑consistent constraints and yields a generalized open‑loop Nash equilibrium (OLNE), with penalties discouraging abrupt strategy shifts in a receding‑horizon setting. Simulations with three agents demonstrate improved efficiency (faster completion) and enhanced safety (greater inter‑agent clearance, leading to reduced congestion) compared to a local baseline and NH‑ORCA that do not reason about homotopies. Hardware trials with two robots and one human demonstrate robustness to irrational behaviors, where our method adapts by switching to alternative feasible equilibria while the baseline game fails.

PaperID: 3934, https://arxiv.org/pdf/2604.13705.pdf

Authors: Sayan Kumar Chaki, Antoine Gourru, Julien Velcin

Title: Beyond Arrow's Impossibility: Fairness as an Emergent Property of Multi-Agent Collaboration

Abstract:
Fairness in language models is typically studied as a property of a single, centrally optimized model. As large language models become increasingly agentic, we propose that fairness emerges through interaction and exchange. We study this via a controlled hospital triage framework in which two agents negotiate over three structured debate rounds. One agent is aligned to a specific ethical framework via retrieval‑augmented generation (RAG), while the other is either unaligned or adversarially prompted to favor demographic groups over clinical need. We find that alignment systematically shapes negotiation strategies and allocation patterns, and that neither agent's allocation is ethically adequate in isolation, yet their joint final allocation can satisfy fairness criteria that neither would have reached alone. Aligned agents partially moderate bias through contestation rather than override, acting as corrective patches that restore access for marginalized groups without fully converting a biased counterpart. We further observe that even explicitly aligned agents exhibit intrinsic biases toward certain frameworks, consistent with known left‑leaning tendencies in LLMs. We connect these limits to Arrow's Impossibility Theorem: no aggregation mechanism can simultaneously satisfy all desiderata of collective rationality, and multi‑agent deliberation navigates rather than resolves this constraint. Our results reposition fairness as an emergent, procedural property of decentralized agent interaction, and the system rather than the individual agent as the appropriate unit of evaluation.

PaperID: 3935, https://arxiv.org/pdf/2604.13654.pdf

Authors: Hanxuan Chen, Jie Zheng, Siqi Yang, Tianle Zeng, Siwei Feng, Songsheng Cheng, Ruilong Ren, Hanzhong Guo, Shuai Yuan, Xiangyue Wang, Kangli Wang, Ji Pei

Title: Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap

Abstract:
Vision‑and‑Language Navigation for Unmanned Aerial Vehicles (UAV‑VLN) represents a pivotal challenge in embodied artificial intelligence, focused on enabling UAVs to interpret high‑level human commands and execute long‑horizon tasks in complex 3D environments. This paper provides a comprehensive and structured survey of the field, from its formal task definition to the current state of the art. We establish a methodological taxonomy that charts the technological evolution from early modular and deep learning approaches to contemporary agentic systems driven by large foundation models, including Vision‑Language Models (VLMs), Vision‑Language‑Action (VLA) models, and the emerging integration of generative world models with VLA architectures for physically‑grounded reasoning. The survey systematically reviews the ecosystem of essential resources simulators, datasets, and evaluation metrics that facilitates standardized research. Furthermore, we conduct a critical analysis of the primary challenges impeding real‑world deployment: the simulation‑to‑reality gap, robust perception in dynamic outdoor settings, reasoning with linguistic ambiguity, and the efficient deployment of large models on resource‑constrained hardware. By synthesizing current benchmarks and limitations, this survey concludes by proposing a forward‑looking research roadmap to guide future inquiry into key frontiers such as multi‑agent swarm coordination and air‑ground collaborative robotics.

PaperID: 3936, https://arxiv.org/pdf/2604.13630.pdf

Authors: Xixun Lin, Yang Liu, Yancheng Chen, Yongxuan Wu, Yucheng Ning, Yilong Liu, Nan Sun, Shun Zhang, Bin Chong, Chuan Zhou, Yanan Cao

Title: SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment

Abstract:
The performance of large language model (LLM) agents depends critically on the execution harness, the system layer that orchestrates tool use, context management, and state persistence. Yet this same architectural centrality makes the harness a high‑value attack surface: a single compromise at the harness level can cascade through the entire execution pipeline. We observe that existing security approaches suffer from structural mismatch, leaving them blind to harness‑internal state and unable to coordinate across the different phases of agent operation. In this paper, we introduce \safeharness, a security architecture in which four proposed defense layers are woven directly into the agent lifecycle to address above significant limitations: adversarial context filtering at input processing, tiered causal verification at decision making, privilege‑separated tool control at action execution, and safe rollback with adaptive degradation at state update. The proposed cross‑layer mechanisms tie these layers together, escalating verification rigor, triggering rollbacks, and tightening tool privileges whenever sustained anomalies are detected. We evaluate \safeharness on benchmark datasets across diverse harness configurations, comparing against four security baselines under five attack scenarios spanning six threat categories. Compared to the unprotected baseline, \safeharness achieves an average reduction of approximately 38% in UBR and 42% in ASR, substantially lowering both the unsafe behavior rate and the attack success rate while preserving core task utility.

PaperID: 3937, https://arxiv.org/pdf/2604.13628.pdf

Authors: Nana Wang, Pelin Sekercioglu, Dimos V. Dimarogonas

Title: Topology Estimation for Open Multi-Agent Systems

Abstract:
We address the problem of interaction topology identification in open multi‑agent systems (OMAS) with dynamic node sets and fast switching interactions. In such systems, new agents join and interactions change rapidly, resulting in intervals with short dwell time and rendering conventional segment‑wise estimation and clustering methods unreliable. To overcome this, we propose a projection‑based dissimilarity measure derived from a consistency property of local least‑squares operators, enabling robust mode clustering. Aggregating intervals within each cluster yields accurate topology estimates. The proposed framework offers a systematic solution for reconstructing the interaction topology of OMAS subject to fast switching. Finally, we illustrate our theoretical results via numerical simulations.

PaperID: 3938, https://arxiv.org/pdf/2604.13592.pdf

Authors: Jiashuo Wang, Jiawen Duan, Jian Wang, Kaitao Song, Chunpu Xu, Johnny K. W. Ho, Fenggang Yu, Wenjie Li, Johan F. Hoorn

Title: Foresight Optimization for Strategic Reasoning in Large Language Models

Abstract:
Reasoning capabilities in large language models (LLMs) have generally advanced significantly. However, it is still challenging for existing reasoning‑based LLMs to perform effective decision‑making abilities in multi‑agent environments, due to the absence of explicit foresight modeling. To this end, strategic reasoning, the most fundamental capability to anticipate the counterpart's behaviors and foresee its possible future actions, has been introduced to alleviate the above issues. Strategic reasoning is fundamental to effective decision‑making in multi‑agent environments, yet existing reasoning enhancement methods for LLMs do not explicitly capture its foresight nature. In this work, we introduce Foresight Policy Optimization (FoPO) to enhance strategic reasoning in LLMs, which integrates opponent modeling principles into policy optimization, thereby enabling explicit consideration of both self‑interest and counterpart influence. Specifically, we construct two curated datasets, namely Cooperative RSA and Competitive Taboo, equipped with well‑designed rules and moderate difficulty to facilitate a systematic investigation of FoPO in a self‑play framework. Our experiments demonstrate that FoPO significantly enhances strategic reasoning across LLMs of varying sizes and origins. Moreover, models trained with FoPO exhibit strong generalization to out‑of‑domain strategic scenarios, substantially outperforming standard LLM reasoning optimization baselines.

PaperID: 3939, https://arxiv.org/pdf/2604.13559.pdf

Authors: Zhenyu Wan, Gong Chen, Qing Huang, Xiaoyuan Xie

Title: WebMAC: A Multi-Agent Collaborative Framework for Scenario Testing of Web Systems

Abstract:
Scenario testing is an important technique for detecting errors in web systems. Testers draft test scenarios and convert them into test scripts for execution. Early methods relied on testers to convert test scenarios into test scripts. Recent LLM‑based scenario testing methods can generate test scripts from natural language descriptions of test scenarios. However, these methods are not only limited by the incompleteness of descriptions but also overlook test adequacy criteria, making it difficult to detect potential errors. To address these limitations, this paper proposes WebMAC, a multi‑agent collaborative framework for scenario testing of web systems. WebMAC can complete natural language descriptions of test scenarios through interactive clarification and transform adequate instantiated test scenarios via equivalence class partitioning. WebMAC consists of three multi‑agent modules, responsible respectively for completing natural language descriptions of test scenarios, transforming test scenarios, and converting test scripts. We evaluated WebMAC on four web systems. Compared with the SOTA method, WebMAC improves the execution success rate of generated test scripts by 30%‑60%, increases testing efficiency by 29%, and reduces token consumption by 47.6%. Furthermore, WebMAC can effectively detect more errors in web systems.

PaperID: 3940, https://arxiv.org/pdf/2604.13551.pdf

Authors: Cunda Wang, Ziying Ma, Po Hu, Weihua Wang, Feilong Bao

Title: Debate to Align: Reliable Entity Alignment through Two-Stage Multi-Agent Debate

Abstract:
Entity alignment (EA) aims to identify entities referring to the same real‑world object across different knowledge graphs (KGs). Recent approaches based on large language models (LLMs) typically obtain entity embeddings through knowledge representation learning and use embedding similarity to identify an alignment‑uncertain entity set. For each uncertain entity, a candidate entity set (CES) is then retrieved based on embedding similarity to support subsequent alignment reasoning and decision making. However, the reliability of the CES and the reasoning capability of LLMs critically affect the effectiveness of subsequent alignment decisions. To address this issue, we propose AgentEA, a reliable EA framework based on multi‑agent debate. AgentEA first improves embedding quality through entity representation preference optimization, and then introduces a two‑stage multi‑role debate mechanism consisting of lightweight debate verification and deep debate alignment to progressively enhance the reliability of alignment decisions while enabling more efficient debate‑based reasoning. Extensive experiments on public benchmarks under cross‑lingual, sparse, large‑scale, and heterogeneous settings demonstrate the effectiveness of AgentEA.

PaperID: 3941, https://arxiv.org/pdf/2604.13533.pdf

Authors: Jianzong Wang, Botao Zhao, Yayun He, Junqing Peng, Xulong Zhang

Title: Evolvable Embodied Agent for Robotic Manipulation via Long Short-Term Reflection and Optimization

Abstract:
Achieving general‑purpose robotics requires empowering robots to adapt and evolve based on their environment and feedback. Traditional methods face limitations such as extensive training requirements, difficulties in cross‑task generalization, and lack of interpretability. Prompt learning offers new opportunities for self‑evolving robots without extensive training, but simply reflecting on past experiences. However, extracting meaningful insights from task successes and failures remains a challenge. To this end, we propose the evolvable embodied agent (EEAgent) framework, which leverages large vision‑language models (VLMs) for better environmental interpretation and policy planning. To enhance reflection on past experiences, we propose a long short‑term reflective optimization (LSTRO) mechanism that dynamically refines prompts based on both past experiences and newly learned lessons, facilitating continuous self‑evolution, thereby enhancing overall task success rates. Evaluations on six VIMA‑Bench tasks reveal that our approach sets a new state‑of‑the‑art, notably outperforming baselines in complex scenarios.

PaperID: 3942, https://arxiv.org/pdf/2604.13491.pdf

Authors: Yongjin Kim, Yoonjin Oh, Yerin Kim, Hyomin Kim, Jeeyoung Yun, Yujung Heo, Minjun Kim, Sungwoong Kim

Title: FiRe: Fine-grained Multimodal Reasoning for Enhanced Image Generation

Abstract:
With the rapid progress of Multimodal Large Language Models (MLLMs), unified MLLMs that jointly perform image understanding and generation have advanced significantly. However, despite the inherent reasoning capabilities of unified MLLMs for self‑reflection and self‑refinement, their use in text‑to‑image generation remains largely underexplored. Meanwhile, existing multimodal reasoning‑based image generation methods mostly rely on prompt augmentation or holistic image‑text alignment judgments, without fine‑grained reflection and refinement of detailed prompt attributes, leading to limited fine‑grained control. To address this limitation, we propose FiRe, a Fine‑grained Multimodal Reasoning method for enhanced image generation by MLLM. In specific, FiRe performs a fine‑grained multi‑step reasoning by first decomposing the prompt into key visual requirements and then self‑judging their satisfaction in the generated image, followed by localized refinement according to self‑generated precise feedback. In addition, to further strengthen the MLLM's multimodal reasoning ability, we introduce FiRe‑GRPO, a reinforcement learning method tailored to FiRe. Since standard Group Relative Policy Optimization (GRPO) suffers from sparse, outcome‑based rewards in multi‑step reasoning, we formulate our reasoning process as a step‑level decision‑making problem, design step‑specific rewards, and compute step‑level advantages for granular credit assignment within GRPO. Extensive experiments demonstrate that FiRe consistently outperforms competitive text‑to‑image baselines, including existing reasoning‑based methods, with particularly substantial gains on compositional text‑to‑image benchmarks.

PaperID: 3943, https://arxiv.org/pdf/2604.13488.pdf

Authors: Ziwei Wang, Junjie Zheng, Leyang Yang, Sheng Zhou, Xiaoxuan Tang, Zhouhua Fang, Zhiwei Liu, Dajun Chen, Yong Li, Jiajun Bu

Title: Towards Scalable Lightweight GUI Agents via Multi-role Orchestration

Abstract:
Autonomous Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) enable digital automation on end‑user devices. While scaling both parameters and data has yielded substantial gains, advanced methods still suffer from prohibitive deployment costs on resource‑constrained devices. When facing complex in‑the‑wild scenarios, lightweight GUI agents are bottlenecked by limited capacity and poor task scalability under end‑to‑end episodic learning, impeding adaptation to multi‑agent systems (MAS), while training multiple skill‑specific experts remains costly. Can we strike an effective trade‑off in this cost‑scalability dilemma, enabling lightweight MLLMs to participate in realistic GUI workflows? To address these challenges, we propose the LAMO framework, which endows a lightweight MLLM with GUI‑specific knowledge and task scalability, allowing multi‑role orchestration to expand its capability boundary for GUI automation. LAMO combines role‑oriented data synthesis with a two‑stage training recipe: (i) supervised fine‑tuning with Perplexity‑Weighted Cross‑Entropy optimization for knowledge distillation and visual perception enhancement, and (ii) reinforcement learning for role‑oriented cooperative exploration. With LAMO, we develop a task‑scalable native GUI agent, LAMO‑3B, supporting monolithic execution and MAS‑style orchestration. When paired with advanced planners as a plug‑and‑play policy executor, LAMO‑3B can continuously benefit from planner advances, enabling a higher performance ceiling. Extensive static and online evaluations validate the effectiveness of our design.

PaperID: 3944, https://arxiv.org/pdf/2604.13452.pdf

Authors: Ishani Mondal, Yiwen Song, Mihir Parmar, Palash Goyal, Jordan Boyd-Graber, Tomas Pfister, Yale Song

Title: CANVAS: Continuity-Aware Narratives via Visual Agentic Storyboarding

Abstract:
Long‑form visual storytelling requires maintaining continuity across shots, including consistent characters, stable environments, and smooth scene transitions. While existing generative models can produce strong individual frames, they fail to preserve such continuity, leading to appearance changes, inconsistent backgrounds, and abrupt scene shifts. We introduce CANVAS (Continuity‑Aware Narratives via Visual Agentic Storyboarding), a multi‑agent framework that explicitly plans visual continuity in multi‑shot narratives. CANVAS enforces coherence through character continuity, persistent background anchors, and location‑aware scene planning for smooth transitions within the same setting We evaluate CANVAS on two storyboard generation benchmarks ST‑BENCH and ViStoryBench and introduce a new challenging benchmark HardContinuityBench for long‑range narrative consistency. CANVAS consistently outperforms the best‑performing baseline, improving background continuity by 21.6%, character consistency by 9.6% and props consistency by 7.6%.

PaperID: 3945, https://arxiv.org/pdf/2604.13394.pdf

Authors: Wenji Cao, Lu Liu, Dan Zhang, Gang Feng

Title: Distributed Resilient Fixed-Time Control for Cooperative Output Regulation of MASs over Directed Graphs under DoS Attacks

Abstract:
This paper addresses the problem of fixed‑time cooperative output regulation for linear multi‑agent systems over directed graphs under denial‑of‑service attacks. A novel distributed resilient fixed‑time controller is developed that comprises a distributed resilient fixed‑time observer taking general directed graphs into consideration, and a distributed resilient fixed‑time control law for each agent. The proposed controller neither depends on Laplacian symmetry nor requires strong connectivity and detail‑balanced condition, in contrast to existing distributed resilient fixed‑time controllers. Under the proposed controller, the regulated outputs converge to zero in a fixed time with its upper bound independent of the initial states of the multi‑agent system. Ultimately, the efficacy of the proposed controller is demonstrated via a simulation example.

PaperID: 3946, https://arxiv.org/pdf/2604.13353.pdf

Authors: Nguyen Phuc Tran, Brigitte Jaumard, Karthikeyan Premkumar, Salman Memon

Title: Cross-Domain Query Translation for Network Troubleshooting: A Multi-Agent LLM Framework with Privacy Preservation and Self-Reflection

Abstract:
This paper presents a hierarchical multi‑agent LLM architecture to bridge communication gaps between non‑technical end users and telecommunications domain experts in private network environments. We propose a cross‑domain query translation framework that leverages specialized language models coordinated through multi‑agent reflection‑based reasoning. The resulting system addresses three critical challenges: (1) accurately classify user queries related to telecommunications network issues using a dual‑stage hierarchical approach, (2) preserve user privacy through the anonymization of semantically relevant personally identifiable information (PII) while maintaining diagnostic utility, and (3) translate technical expert responses into user‑comprehensible language. Our approach employs ReAct‑style agents enhanced with self‑reflection mechanisms for iterative output refinement, semantic‑preserving anonymization techniques respecting k‑anonymity and differential privacy principles, and few‑shot learning strategies designed for limited training data scenarios. The framework was comprehensively evaluated on 10,000 previously unseen validation scenarios across various vertical industries.

PaperID: 3947, https://arxiv.org/pdf/2604.13346.pdf

Authors: Pengcheng Wang, Jerry Huang, Jiarui Yao, Rui Pan, Peizhi Niu, Yaowenqi Liu, Ruida Wang, Renhao Lu, Yuwei Guo, Tong Zhang

Title: AgentSPEX: An Agent SPecification and EXecution Language

Abstract:
Language‑model agent systems commonly rely on reactive prompting, in which a single instruction guides the model through an open‑ended sequence of reasoning and tool‑use steps, leaving control flow and intermediate state implicit and making agent behavior potentially difficult to control. Orchestration frameworks such as LangGraph, DSPy, and CrewAI impose greater structure through explicit workflow definitions, but tightly couple workflow logic with Python, making agents difficult to maintain and modify. In this paper, we introduce AgentSPEX, an Agent SPecification and EXecution Language for specifying LLM‑agent workflows with explicit control flow and modular structure, along with a customizable agent harness. AgentSPEX supports typed steps, branching and loops, parallel execution, reusable submodules, and explicit state management, and these workflows execute within an agent harness that provides tool access, a sandboxed virtual environment, and support for checkpointing, verification, and logging. Furthermore, we provide a visual editor with synchronized graph and workflow views for authoring and inspection. We include ready‑to‑use agents for deep research and scientific research, and we evaluate AgentSPEX on 7 benchmarks. Finally, we show through a user study that AgentSPEX provides a more interpretable and accessible workflow‑authoring paradigm than a popular existing agent framework.

PaperID: 3948, https://arxiv.org/pdf/2604.13345.pdf

Authors: Vladimir Kalušev, Branko Brkljač, Milan Brkljač

Title: Multi-Agent Object Detection Framework Based on Raspberry Pi YOLO Detector and Slack-Ollama Natural Language Interface

Abstract:
The paper presents design and prototype implementation of an edge based object detection system within the new paradigm of AI agents orchestration. It goes beyond traditional design approaches by leveraging on LLM based natural language interface for system control and communication and practically demonstrates integration of all system components into a single resource constrained hardware platform. The method is based on the proposed multi‑agent object detection framework which tightly integrates different AI agents within the same task of providing object detection and tracking capabilities. The proposed design principles highlight the fast prototyping approach that is characteristic for transformational potential of generative AI systems, which are applied during both development and implementation stages. Instead of specialized communication and control interface, the system is made by using Slack channel chatbot agent and accompanying Ollama LLM reporting agent, which are both run locally on the same Raspberry Pi platform, alongside the dedicated YOLO based computer vision agent performing real time object detection and tracking. Agent orchestration is implemented through a specially designed event based message exchange subsystem, which represents an alternative to completely autonomous agent orchestration and control characteristic for contemporary LLM based frameworks like the recently proposed OpenClaw. Conducted experimental investigation provides valuable insights into limitations of the low cost testbed platforms in the design of completely centralized multi‑agent AI systems. The paper also discusses comparative differences between presented approach and the solution that would require additional cloud based external resources.

PaperID: 3949, https://arxiv.org/pdf/2604.13242.pdf

Authors: Giorgio Franceschelli, Mirco Musolesi

Title: On the Creativity of AI Agents

Abstract:
Large language models (LLMs), particularly when integrated into agentic systems, have demonstrated human‑ and even superhuman‑level performance across multiple domains. Whether these systems can truly be considered creative, however, remains a matter of debate, as conclusions heavily depend on the definitions, evaluation methods, and specific use cases employed. In this paper, we analyse creativity along two complementary macro‑level perspectives. The first is a functionalist perspective, focusing on the observable characteristics of creative outputs. The second is an ontological perspective, emphasising the underlying processes, as well as the social and personal dimensions involved in creativity. We focus on LLM agents and we argue that they exhibit functionalist creativity, albeit not at its most sophisticated levels, while they continue to lack key aspects of ontological creativity. Finally, we discuss whether it is desirable for agentic systems to attain both forms of creativity, evaluating potential benefits and risks, and proposing pathways toward artificial creativity that can enhance human society.

PaperID: 3950, https://arxiv.org/pdf/2604.13128.pdf

Authors: Isaac Remy, Caleb Chang, Karen Leung

Title: Learning Probabilistic Responsibility Allocations for Multi-Agent Interactions

Abstract:
Human behavior in interactive settings is shaped not only by individual objectives but also by shared constraints with others, such as safety. Understanding how people allocate responsibility, i.e., how much one deviates from their desired policy to accommodate others, can inform the design of socially compliant and trustworthy autonomous systems. In this work, we introduce a method for learning a probabilistic responsibility allocation model that captures the multimodal uncertainty inherent in multi‑agent interactions. Specifically, our approach leverages the latent space of a conditional variational autoencoder, combined with techniques from multi‑agent trajectory forecasting, to learn a distribution over responsibility allocations conditioned on scene and agent context. Although ground‑truth responsibility labels are unavailable, the model remains tractable by incorporating a differentiable optimization layer that maps responsibility allocations to induced controls, which are available. We evaluate our method on the INTERACTION driving dataset and demonstrate that it not only achieves strong predictive performance but also provides interpretable insights, through the lens of responsibility, into patterns of multi‑agent interaction.

PaperID: 3951, https://arxiv.org/pdf/2604.13103.pdf

Authors: Corey Yang-Smith, Ronnie de Souza Santos, Ahmad Abdellatif

Title: Fairness in Multi-Agent Systems for Software Engineering: An SDLC-Oriented Rapid Review

Abstract:
Transformer‑based large language models (LLMs) and multi‑agent systems (MAS) are increasingly embedded across the software development lifecycle (SDLC), yet their fairness implications for developer‑facing tools remain underexplored despite their growing role in shaping what code is written, reviewed, and released. We present a rapid review of recent work on fairness in MAS, emphasizing LLM‑enabled settings and relevance to software engineering. Starting from an initial set of 350 papers, we screened and filtered the corpus for relevance, retaining 18 studies for final analysis. Across these 18 studies, fairness is framed as a combination of trustworthy AI principles, bias reduction across groups, and interactional dynamics in collectives, while evaluation spans accuracy metrics on bias benchmarks, demographic disparity measures, and emergent MAS‑specific notions such as conformity and bias amplification. Reported harms include representational, quality‑of‑service, security and privacy, and governance failures, which we relate to SDLC stages where evidence is most and least developed. We identify three persistent gaps: (1) fragmented, rarely MAS‑specific evaluation practices that limit comparability, (2) limited generalization due to simplified environments and narrow attribute coverage, and (3) scarce, weakly evaluated mitigation and governance mechanisms aligned to real software workflows. These findings suggest MAS fairness research is not yet ready to support deployable, fairness‑assured software systems, motivating MAS‑aware benchmarks, consistent protocols, and lifecycle‑spanning governance.

PaperID: 3952, https://arxiv.org/pdf/2604.13098.pdf

Authors: Yuyang Chen, Kaiyan Zhao, Yiming Wang, Ming Yang, Bin Rao, Zhenning Li

Title: C$^2$T: Captioning-Structure and LLM-Aligned Common-Sense Reward Learning for Traffic--Vehicle Coordination

Abstract:
State‑of‑the‑art (SOTA) urban traffic control increasingly employs Multi‑Agent Reinforcement Learning (MARL) to coordinate Traffic Light Controllers (TLCs) and Connected Autonomous Vehicles (CAVs). However, the performance of these systems is fundamentally capped by their hand‑crafted, myopic rewards (e.g., intersection pressure), which fail to capture high‑level, human‑centric goals like safety, flow stability, and comfort. To overcome this limitation, we introduce C2T, a novel framework that learns a common‑sense coordination model from traffic‑vehicle dynamics. C2T distills "common‑sense" knowledge from a Large Language Model (LLM) into a learned intrinsic reward function. This new reward is then used to guide the coordination policy of a cooperative multi‑intersection TLC MARL system on CityFlow‑based multi‑intersection benchmarks. Our framework significantly outperforms strong MARL baselines in traffic efficiency, safety, and an energy‑related proxy. We further highlight C2T's flexibility in principle, allowing distinct "efficiency‑focused" versus "safety‑focused" policies by modifying the LLM prompt.

PaperID: 3953, https://arxiv.org/pdf/2604.13088.pdf

Authors: Fei Ding, Yongkang Zhang, youwei wang, Zijian Zeng

Title: Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation

PaperID: 3954, https://arxiv.org/pdf/2604.13085.pdf

Authors: Rajat Khanda, Mohammad Baqar Sambuddha Chakrabarti, Satyasaran Changdar

Title: Adaptive Memory Crystallization for Autonomous AI Agent Learning in Dynamic Environments

Abstract:
Autonomous AI agents operating in dynamic environments face a persistent challenge: acquiring new capabilities without erasing prior knowledge. We present Adaptive Memory Crystallization (AMC), a memory architecture for progressive experience consolidation in continual reinforcement learning. AMC is conceptually inspired by the qualitative structure of synaptic tagging and capture (STC) theory, the idea that memories transition through discrete stability phases, but makes no claim to model the underlying molecular or synaptic mechanisms. AMC models memory as a continuous crystallization process in which experiences migrate from plastic to stable states according to a multi‑objective utility signal. The framework introduces a three‑phase memory hierarchy (Liquid‑‑Glass‑‑Crystal) governed by an Itô stochastic differential equation (SDE) whose population‑level behavior is captured by an explicit Fokker‑‑Planck equation admitting a closed‑form Beta stationary distribution. We provide proofs of: (i) well‑posedness and global convergence of the crystallization SDE to a unique Beta stationary distribution; (ii) exponential convergence of individual crystallization states to their fixed points, with explicit rates and variance bounds; and (iii) end‑to‑end Q‑learning error bounds and matching memory‑capacity lower bounds that link SDE parameters directly to agent performance. Empirical evaluation on Meta‑World MT50, Atari 20‑game sequential learning, and MuJoCo continual locomotion consistently shows improvements in forward transfer (+34‑‑43% over the strongest baseline), reductions in catastrophic forgetting (67‑‑80%), and a 62% decrease in memory footprint.

PaperID: 3955, https://arxiv.org/pdf/2604.13064.pdf

Authors: Haichuan Hu, Ye Shang, Quanjun Zhang

Title: Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub

Abstract:
Skill ecosystems have emerged as an increasingly important layer in Large Language Model (LLM) agent systems, enabling reusable task packaging, public distribution, and community‑driven capability sharing. However, despite their rapid growth, the functionality, ecosystem structure, and security risks of public skill registries remain underexplored. In this paper, we present an empirical study of ClawHub, a large public registry of agent skills. We build and normalize a dataset of 26,502 skills, and conduct a systematic analysis of their language distribution, functional organization, popularity, and security signals. Our clustering results show clear cross‑lingual differences: English skills are more infrastructure‑oriented and centered on technical capabilities such as APIs, automation, and memory, whereas Chinese skills are more application‑oriented, with clearer scenario‑driven clusters such as media generation, social content production, and finance‑related services. We further find that more than 30% of all crawled skills are labeled as suspicious or malicious by available platform signals, while a substantial fraction of skills still lack complete safety observability. To study early risk assessment, we formulate submission‑time skill risk prediction using only information available at publication time, and construct a balanced benchmark of 11,010 skills. Across 12 classifiers, the best Logistic Regression achieves a accuracy of 72.62% and an AUROC of 78.95%, with primary documentation emerging as the most informative submission‑time signal. Our findings position public skill registries as both a key enabler of agent capability reuse and a new surface for ecosystem‑scale security risk.

PaperID: 3956, https://arxiv.org/pdf/2604.13052.pdf

Authors: Saber Zerhoudi, Kanishka Ghosh Dastidar, Felix Klement, Artur Romazanov, Andreas Einwiller, Dang H. Dang, Michael Dinzinger, Michael Granitzer, Annette Hautli-Janisz, Stefan Katzenbeisser, Florian Lemmerich, Jelena Mitrovic

Title: Form Without Function: Agent Social Behavior in the Moltbook Network

Abstract:
Moltbook is a social network where every participant is an AI agent. We analyze 1,312,238 posts, 6.7~million comments, and over 120,000 agent profiles across 5,400 communities, collected over 40 days (January 27 to March 9, 2026). We evaluate the platform through three layers. At the interaction layer, 91.4% of post authors never return to their own threads, 85.6% of conversations are flat (no reply ever receives a reply), the median time‑to‑first‑comment is 55 seconds, and 97.3% of comments receive zero upvotes. Interaction reciprocity is 3.3%, compared to 22‑60% on human platforms. An argumentation analysis finds that 64.6% of comment‑to‑post relations carry no argumentative connection. At the content layer, 97.9% of agents never post in a community matching their bio, 92.5% of communities contain every topic in roughly equal proportions, and over 80% of shared URLs point to the platform's own infrastructure. At the instruction layer, we use 41 Wayback Machine snapshots to identify six instruction changes during the observation window. Hard constraints (rate limit, content filters) produce immediate behavioral shifts. Soft guidance (``upvote good posts'', ``stay on topic'') is ignored until it becomes an explicit step in the executable checklist. The platform also poses technological risks. We document credential leaks (API keys, JWT tokens), 12,470 unique Ethereum addresses with 3,529 confirmed transaction histories, and attack discourse ranging from template‑based SSH brute‑forcing to multi‑agent offensive security architectures. These persist unmoderated because the quality‑filtering mechanisms are themselves non‑functional. Moltbook is a socio‑technical system where the technical layer responds to changes, but the social layer largely fails to emerge. The form of social media is reproduced in full. The function is absent.

PaperID: 3957, https://arxiv.org/pdf/2604.13041.pdf

Authors: Ruilin Zhang, Kai Yang

Title: TableNet A Large-Scale Table Dataset with LLM-Powered Autonomous

Abstract:
Table Structure Recognition (TSR) requires the logical reasoning ability of large language models (LLMs) to handle complex table layouts, but current datasets are limited in scale and quality, hindering effective use of this reasoning capacity. We thus present TableNet dataset, a new table structure recognition dataset collected and generated through multiple sources. Central to our approach is the first LLM‑powered autonomous table generation and recognition multi‑agent system that we developed. The generation part of our system integrates controllable visual, structural, and semantic parameters into the synthesis of table images. It facilitates the creation of a wide array of semantically coherent tables, adaptable to user‑defined configurations along with annotations, thereby supporting large‑scale and detailed dataset construction. This capability enables a comprehensive and nuanced table image annotation taxonomy, potentially advancing research in table‑related domains. In contrast to traditional data collection methods, This approach facilitates the theoretically infinite, domain‑agnostic, and style‑flexible generation of table images, ensuring both efficiency and precision. The recognition part of our system is a diversity‑based active learning paradigm that utilizes tables from multiple sources and selectively samples most informative data to finetune a model, achieving a competitive performance on TableNet test set while reducing training samples by a large margin compared with baselines, and a much higher performance on web‑crawled real‑world tables compared with models trained on predominant table datasets. To the best of our knowledge, this is the first work which employs active learning into the structure recognition of tables which is diverse in numbers of rows or columns, merged cells, cell contents, etc, which fits better for diversity‑based active learning.

PaperID: 3958, https://arxiv.org/pdf/2604.12948.pdf

Authors: Benjamin Stern, Peter Nadel

Title: Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents

Abstract:
LLM agents with persistent memory store information as flat factual records, providing little context for temporal reasoning, change tracking, or cross‑session aggregation. Inspired by the drawing effect [3], we introduce dual‑trace memory encoding. In this method, each stored fact is paired with a concrete scene trace, a narrative reconstruction of the moment and context in which the information was learned. The agent is forced to commit to specific contextual details during encoding, creating richer, more distinctive memory traces. Using the LongMemEval‑S benchmark (4,575 sessions, 100 recall questions), we compare dual‑trace encoding against a fact‑only control with matched coverage and format over 99 shared questions. Dual‑trace achieves 73.7% overall accuracy versus 53.5%, a +20.2 percentage point (pp) gain (95% CI: [+12.1, +29.3], bootstrap p < 0.0001). Gains concentrate in temporal reasoning (+40pp), knowledge‑update tracking (+25pp), and multi‑session aggregation (+30pp), with no benefit for single‑session retrieval, consistent with encoding specificity theory [8]. Token analysis shows dual‑trace encoding achieves this gain at no additional cost. We additionally sketch an architectural design for adapting dual‑trace encoding to coding agents, with preliminary pilot validation.

PaperID: 3959, https://arxiv.org/pdf/2604.12831.pdf

Authors: Shengding Liu, Qiben Yan

Title: VULCAN: Vision-Language-Model Enhanced Multi-Agent Cooperative Navigation for Indoor Fire-Disaster Response

Abstract:
Indoor fire disasters pose severe challenges to autonomous search and rescue due to dense smoke, high temperatures, and dynamically evolving indoor environments. In such time‑critical scenarios, multi‑agent cooperative navigation is particularly useful, as it enables faster and broader exploration than single‑agent approaches. However, existing multi‑agent navigation systems are primarily vision‑based and designed for benign indoor settings, leading to significant performance degradation under fire‑driven dynamic conditions. In this paper, we present VULCAN, a multi‑agent cooperative navigation framework based on multi‑modal perception and vision‑language models (VLMs), tailored for indoor fire disaster response. We extend the Habitat‑Matterport3D benchmark by simulating physically realistic fire scenarios, including smoke diffusion, thermal hazards, and sensor degradation. We evaluate representative multi‑agent cooperative navigation baselines under both normal and fire‑driven environments. Our results reveal critical failure modes of existing methods in fire scenarios and underscore the necessity of robust perception and hazard‑aware planning for reliable multi‑agent search and rescue.

PaperID: 3960, https://arxiv.org/pdf/2604.12776.pdf

Authors: Shiyu He, Minchi Kuang, Mengxian Wang, Bin Hu, Tingxiang Gu

Title: EvoSpark: Endogenous Interactive Agent Societies for Unified Long-Horizon Narrative Evolution

Abstract:
Realizing endogenous narrative evolution in LLM‑based multi‑agent systems is hindered by the inherent stochasticity of generative emergence. In particular, long‑horizon simulations suffer from social memory stacking, where conflicting relational states accumulate without resolution, and narrative‑spatial dissonance, where spatial logic detaches from the evolving plot. To bridge this gap, we propose EvoSpark, a framework specifically designed to sustain logically coherent long‑horizon narratives within Endogenous Interactive Agent Societies. To ensure consistency, the Stratified Narrative Memory employs a Role Socio‑Evolutionary Base as living cognition, dynamically metabolizing experiences to resolve historical conflicts. Complementarily, Generative Mise‑en‑Scène mechanism enforces Role‑Location‑Plot alignment, synchronizing character presence with the narrative flow. Underpinning these is the Unified Narrative Operation Engine, which integrates an Emergent Character Grounding Protocol to transform stochastic sparking into persistent characters. This engine establishes a substrate that expands a minimal premise into an open‑ended, evolving story world. Experiments demonstrate that EvoSpark significantly outperforms baselines across diverse paradigms, enabling the sustained generation of expressive and coherent narrative experiences.

PaperID: 3961, https://arxiv.org/pdf/2604.12772.pdf

Authors: Madeline Anderson, Mikhail Klassen, Ash Hoover, Kerri Cahoy

Title: A Multi-Agent Feedback System for Detecting and Describing News Events in Satellite Imagery

Abstract:
Changes in satellite imagery often occur over multiple time steps. Despite the emergence of bi‑temporal change captioning datasets, there is a lack of multi‑temporal event captioning datasets (at least two images per sequence) in remote sensing. This gap exists because (1) searching for visible events in satellite imagery and (2) labeling multi‑temporal sequences require significant time and labor. To address these challenges, we present SkyScraper, an iterative multi‑agent workflow that geocodes news articles and synthesizes captions for corresponding satellite image sequences. Our experiments show that SkyScraper successfully finds 5x more events than traditional geocoding methods, demonstrating that agentic feedback is an effective strategy for surfacing new multi‑temporal events in satellite imagery. We apply our framework to a large database of global news articles, curating a new multi‑temporal captioning dataset with 5,000 sequences. By automatically identifying imagery related to news events, our work also supports journalism and reporting efforts.

PaperID: 3962, https://arxiv.org/pdf/2604.12766.pdf

Authors: Jihao Dai, Dingjun Wu, Yuxuan Chen, Zheni Zeng, Yukun Yan, Zhenghao Liu, Maosong Sun

Title: NaviRAG: Towards Active Knowledge Navigation for Retrieval-Augmented Generation

Abstract:
Retrieval‑augmented generation (RAG) typically relies on a flat retrieval paradigm that maps queries directly to static, isolated text segments. This approach struggles with more complex tasks that require the conditional retrieval and dynamic synthesis of information across different levels of granularity (e.g., from broad concepts to specific evidence). To bridge this gap, we introduce NaviRAG, a novel framework that shifts from passive segment retrieval to active knowledge navigation. NaviRAG first structures the knowledge documents into a hierarchical form, preserving semantic relationships from coarse‑grained topics to fine‑grained details. Leveraging this reorganized knowledge records, a large language model (LLM) agent actively navigates the records, iteratively identifying information gaps and retrieving relevant content from the most appropriate granularity level. Extensive experiments on long‑document QA benchmarks show that NaviRAG consistently improves both retrieval recall and end‑to‑end answer performance over conventional RAG baselines. Ablation studies confirm performance gains stem from our method's capacity for multi‑granular evidence localization and dynamic retrieval planning. We further discuss efficiency, applicable scenario, and future directions of our method, hoping to make RAG systems more intelligent and autonomous.

PaperID: 3963, https://arxiv.org/pdf/2604.12657.pdf

Authors: Francesco Maria Mancinelli, Matteo Torzoni, Domenico Maisto, Francesco Donnarumma, Alberto Corigliano, Giovanni Pezzulo, Andrea Manzoni

Title: Multi-Agent Digital Twins for Strategic Decision-Making using Active Inference

Abstract:
Active Inference is an emerging framework providing a quantitative account of behavioral processes in neuroscience and a principled approach to decision‑making under uncertainty. Its application to agency problems is natural, offering an autopoietic interpretation of action while addressing classical challenges such as the exploration‑exploitation trade‑off. Recently, Active Inference has been applied to digital twin scenarios for adaptive and predictive modeling of complex systems. In this work, we extend Active Inference to multi‑agent digital twins in which agents interact within a shared environment while maintaining decentralized generative models. Our multi‑agent framework features two innovations: (i) contextual inference to improve adaptability in dynamic environments, and (ii) the integration of streaming machine learning within agents' generative structures, enabling tunable goal‑oriented behavior while preserving efficiency and scalability. The framework is illustrated through a Cournot competition example, providing a digital twin representation of a socio‑economic system and highlighting its potential for coordinated decision‑making in multi‑agent contexts.

PaperID: 3964, https://arxiv.org/pdf/2604.12616.pdf

Authors: Jianhao Chen, Haoyang Chen, Hanjie Zhao, Haozhe Liang, Tieyun Qian

Title: Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs

Abstract:
The rapid evolution of Vision‑Language Models (VLMs) has catalyzed unprecedented capabilities in artificial intelligence; however, this continuous modal expansion has inadvertently exposed a vastly broadened and unconstrained adversarial attack surface. Current multimodal jailbreak strategies primarily focus on surface‑level pixel perturbations and typographic attacks or harmful images; however, they fail to engage with the complex semantic structures intrinsic to visual data. This leaves the vast semantic attack surface of original, natural images largely unscrutinized. Driven by the need to expose these deep‑seated semantic vulnerabilities, we introduce MemJack, a MEMory‑augmented multi‑agent JAilbreak attaCK framework that explicitly leverages visual semantics to orchestrate automated jailbreak attacks. MemJack employs coordinated multi‑agent cooperation to dynamically map visual entities to malicious intents, generate adversarial prompts via multi‑angle visual‑semantic camouflage, and utilize an Iterative Nullspace Projection (INLP) geometric filter to bypass premature latent space refusals. By accumulating and transferring successful strategies through a persistent Multimodal Experience Memory, MemJack maintains highly coherent extended multi‑turn jailbreak attack interactions across different images, thereby improving the attack success rate (ASR) on new images. Extensive empirical evaluations across full, unmodified COCO val2017 images demonstrate that MemJack achieves a 71.48% ASR against Qwen3‑VL‑Plus, scaling to 90% under extended budgets. Furthermore, to catalyze future defensive alignment research, we will release MemJack‑Bench, a comprehensive dataset comprising over 113,000 interactive multimodal jailbreak attack trajectories, establishing a vital foundation for developing inherently robust VLMs.

PaperID: 3965, https://arxiv.org/pdf/2604.12545.pdf

Authors: Wanchun Ni, Jiugeng Sun, Yixian Liu, Mennatallah El-Assady

Title: Cross-Cultural Simulation of Citizen Emotional Responses to Bureaucratic Red Tape Using LLM Agents

Abstract:
Improving policymaking is a central concern in public administration. Prior human subject studies reveal substantial cross‑cultural differences in citizens' emotional responses to red tape during policy implementation. While LLM agents offer opportunities to simulate human‑like responses and reduce experimental costs, their ability to generate culturally appropriate emotional responses to red tape remains unverified. To address this gap, we propose an evaluation framework for assessing LLMs' emotional responses to red tape across diverse cultural contexts. As a pilot study, we apply this framework to a single red‑tape scenario. Our results show that all models exhibit limited alignment with human emotional responses, with notably weaker performance in Eastern cultures. Cultural prompting strategies prove largely ineffective in improving alignment. We further introduce RAMO, an interactive interface for simulating citizens' emotional responses to red tape and for collecting human data to improve models. The interface is publicly available at https://ramo‑chi.ivia.ch.

PaperID: 3966, https://arxiv.org/pdf/2604.12501.pdf

Authors: Ping Huang, Bin Duo, Ziedor Godfred, Liuwei Huo, Jin Ning, Xiaojun Yuan, Jun Li

Title: A Heterogeneous Dual-Network Framework for Emergency Delivery UAVs: Communication Assurance and Path Planning Coordination

Abstract:
Natural disasters often damage ground infrastructure, making unmanned aerial vehicles (UAVs) essential for emergency supply delivery. Yet safe operation in complex post‑disaster environments requires reliable command‑and‑control (C2) links; link instability can cause loss of control, delay rescue, and trigger severe secondary harm. To provide continuous three‑dimensional (3D) C2 coverage during dynamic missions, we propose a Heterogeneous Dual‑Network Framework (HDNF) for safe and reliable emergency delivery. HDNF tightly couples an Emergency Communication Support Network (ECSN), formed by hovering UAV base stations, with a Delivery Path Network (DPN), formed by fast‑moving delivery UAVs. The ECSN dynamically safeguards mission‑critical flight corridors, while the DPN aligns trajectories with reliable coverage regions. We formulate a joint optimization problem over task assignment, 3D UAV‑BS deployment, and DPN path planning to maximize end‑to‑end C2 reliability while minimizing UAV flight energy consumption and base‑station deployment cost. To solve this computationally intractable NP‑hard problem, we develop a layered strategy with three components: (i) a multi‑layer C2 service model that overcomes 2D‑metric limitations and aligns UAV‑BS deployment with mission‑critical 3D phases; (ii) a 3D coverage‑aware multi‑agent reinforcement learning algorithm that addresses the high‑dimensional search space and improves both training efficiency and topology resilience; and (iii) a 3D communication‑aware A planner that jointly optimizes C2 quality and flight energy, mitigating trajectory‑‑coverage mismatch and improving routing safety. Extensive simulations show that HDNF markedly improves C2 reliability, eliminates outages in critical phases, and sustains high task success rates while reducing hardware deployment cost.

PaperID: 3967, https://arxiv.org/pdf/2604.12461.pdf

Authors: Yongxuan Wu, Xixun Lin, He Zhang, Nan Sun, Kun Wang, Chuan Zhou, Shirui Pan, Yanan Cao

Title: CIA: Inferring the Communication Topology from LLM-based Multi-Agent Systems

Abstract:
LLM‑based Multi‑Agent Systems (MAS) have demonstrated remarkable capabilities in solving complex tasks. Central to MAS is the communication topology which governs how agents exchange information internally. Consequently, the security of communication topologies has attracted increasing attention. In this paper, we investigate a critical privacy risk: MAS communication topologies can be inferred under a restrictive black‑box setting, exposing system vulnerabilities and posing significant intellectual property threats. To explore this risk, we propose Communication Inference Attack (CIA), a novel attack that constructs new adversarial queries to induce intermediate agents' reasoning outputs and models their semantic correlations through the proposed global bias disentanglement and LLM‑guided weak supervision. Extensive experiments on MAS with optimized communication topologies demonstrate the effectiveness of CIA, achieving an average AUC of 0.87 and a peak AUC of up to 0.99, thereby revealing the substantial privacy risk in MAS.

PaperID: 3968, https://arxiv.org/pdf/2604.12357.pdf

Authors: Kyungmin Min, Minbeom Kim, Kang-il Lee, Seunghyun Yoon, Kyomin Jung

Title: ReflectCAP: Detailed Image Captioning with Reflective Memory

Abstract:
Detailed image captioning demands both factual grounding and fine‑grained coverage, yet existing methods have struggled to achieve them simultaneously. We address this tension with Reflective Note‑Guided Captioning (ReflectCAP), where a multi‑agent pipeline analyzes what the target large vision‑language model (LVLM) consistently hallucinates and what it systematically overlooks, distilling these patterns into reusable guidelines called Structured Reflection Notes. At inference time, these notes steer the captioning model along both axes ‑‑ what to avoid and what to attend to ‑‑ yielding detailed captions that jointly improve factuality and coverage. Applying this method to 8 LVLMs spanning the GPT‑4.1 family, Qwen series, and InternVL variants, ReflectCAP reaches the Pareto frontier of the trade‑off between factuality and coverage, and delivers substantial gains on CapArena‑Auto, where generated captions are judged head‑to‑head against strong reference models. Moreover, ReflectCAP offers a more favorable trade‑off between caption quality and compute cost than model scaling or existing multi‑agent pipelines, which incur 21‑‑36% greater overhead. This makes high‑quality detailed captioning viable under real‑world cost and latency constraints.

PaperID: 3969, https://arxiv.org/pdf/2604.12310.pdf

Authors: Seiya Mitsuno, Midori Ban, Hiroshi Ishiguro, Yuichiro Yoshikawa

Title: Dialogue Agents that Share Family Information to Strengthen Grandparent-Grandchild Relationships

Abstract:
Social isolation among older adults has become a critical concern, as reduced opportunities for conversation and weakened family relationships negatively affect mental health. This study proposes a dialogue agent that supports older adults by fostering both a relationship with the agent and a relationship with their grandchild through sharing everyday information. The agent operates on a chatbot platform and engages in daily conversations with older adults and their grandchildren, exchanging information gathered from each party to enhance conversational engagement and social connection. We conducted a ten‑day empirical experiment with 52 grandparent‑grandchild pairs. The results suggest that older adults became more willing to interact with the proposed agent, which shared information about their grandchildren, and that the psychological connection between grandparents and grandchildren was strengthened. Furthermore, daily interactions with the agent were associated with reduced anxiety in both older adults and their grandchildren. These findings indicate that a dialogue agent that shares personal information can be an effective approach to supporting older adults by simultaneously offering conversational opportunities and promoting family connectedness. Overall, this study provides valuable insights into the design of dialogue agents that effectively address social isolation among older adults.

PaperID: 3970, https://arxiv.org/pdf/2604.12290.pdf

Authors: Yizhe Chi, Deyao Hong, Dapeng Jiang, Tianwei Luo, Kaisen Yang, Boshi Zhang, Zhe Cao, Xiaoyan Fan, Bingxiang He, Han Hao, Weiyang Jin, Dianqiao Lei, Qingle Liu, Houde Qian, Bowen Wang, Situ Wang, Youjie Zheng, Yifan Zhou, Calvin Xiao, Eren Cai, Qinhuai Na

Title: Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

Abstract:
Current LLM agent benchmarks, which predominantly focus on binary pass/fail tasks such as code generation or search‑based question answering, often neglect the value of real‑world engineering that is often captured through the iterative optimization of feasible designs. To this end, we introduce Frontier‑Eng, a human‑verified benchmark for generative optimization ‑‑ an iterative propose‑execute‑evaluate loop in which an agent generates candidate artifacts, receives executable verifier feedback, and revises them under a fixed interaction budget ‑‑ spanning 47 tasks across five broad engineering categories. Unlike previous suites, Frontier‑Eng tasks are grounded in industrial‑grade simulators and verifiers that provide continuous reward signals and enforce hard feasibility constraints under constrained budgets. We evaluate eight frontier language models using representative search frameworks, finding that while GPT 5.4 achieves the most robust performance, the benchmark remains challenging for all models. Our analysis suggests a dual power‑law decay in improvement frequency (～ 1/iteration) and magnitude (～ 1/improvement count). We further show that although width improves parallelism and diversity, depth remains crucial for hard‑won improvements under a fixed budget. Frontier‑Eng establishes a new standard for assessing the capacity of AI agents to integrate domain knowledge with executable feedback to solve complex, open‑ended engineering problems.

PaperID: 3971, https://arxiv.org/pdf/2604.12285.pdf

Authors: Zhaofen Wu, Hanrong Zhang, Fulin Lin, Wujiang Xu, Xinran Xu, Yankai Chen, Henry Peng Zou, Shaowen Chen, Weizhi Zhang, Xue Liu, Philip S. Yu, Hongwei Wang

Title: GAM: Hierarchical Graph-based Agentic Memory for LLM Agents

Abstract:
To sustain coherent long‑term interactions, Large Language Model (LLM) agents must navigate the tension between acquiring new information and retaining prior knowledge. Current unified stream‑based memory systems facilitate context updates but remain vulnerable to interference from transient noise. Conversely, discrete structured memory architectures provide robust knowledge retention but often struggle to adapt to evolving narratives. To address this, we propose GAM, a hierarchical Graph‑based Agentic Memory framework that explicitly decouples memory encoding from consolidation to effectively resolve the conflict between rapid context perception and stable knowledge retention. By isolating ongoing dialogue in an event progression graph and integrating it into a topic associative network only upon semantic shifts, our approach minimizes interference while preserving long‑term consistency. Additionally, we introduce a graph‑guided, multi‑factor retrieval strategy to enhance context precision. Experiments on LoCoMo and LongDialQA indicate that our method consistently outperforms state‑of‑the‑art baselines in both reasoning accuracy and efficiency.

PaperID: 3972, https://arxiv.org/pdf/2604.12262.pdf

Authors: Raeyoung Chang, Dongwook Kwon, Jisoo Lee, Nikhil Verma

Title: CascadeDebate: Multi-Agent Deliberation for Cost-Aware LLM Cascades

Abstract:
Cascaded LLM systems coordinate models of varying sizes with human experts to balance accuracy, cost, and abstention under uncertainty. However, single‑model tiers at each stage often struggle with ambiguous queries, triggering premature escalations to costlier models or experts due to under‑confidence and inefficient compute scaling. CascadeDebate addresses this gap by inserting multi‑agent deliberation directly at each tier's escalation boundary. Confidence‑based routers activate lightweight agent ensembles only for uncertain cases, enabling consensus‑driven resolution of ambiguities internally without invoking higher‑cost upgrades. Our unified architecture alternates single‑model inference with selective multi‑agent deliberation across model scales, culminating in human experts as the final fallback. This design scales test‑time compute dynamically according to query difficulty. Across five benchmarks spanning science, medicine, and general knowledge, CascadeDebate outperforms strong single‑model cascades and standalone multi‑agent systems by up to 26.75 percent. An online threshold optimizer proves essential, boosting accuracy by 20.98 to 52.33 percent relative improvement over fixed policies and enabling elastic adaptation to real‑world distributions.

PaperID: 3973, https://arxiv.org/pdf/2604.12253.pdf

Authors: Shan Li, Juan Zheng

Title: A Scoping Review of Large Language Model-Based Pedagogical Agents

Abstract:
This scoping review examines the emerging field of Large Language Model (LLM)‑based pedagogical agents in educational settings. While traditional pedagogical agents have been extensively studied, the integration of LLMs represents a transformative advancement with unprecedented capabilities in natural language understanding, reasoning, and adaptation. Following PRISMA‑ScR guidelines, we analyzed 52 studies across five major databases from November 2022 to January 2025. Our findings reveal diverse LLM‑based agents spanning K‑12, higher education, and informal learning contexts across multiple subject domains. We identified four key design dimensions characterizing these agents: interaction approach (reactive vs. proactive), domain scope (domain‑specific vs. general‑purpose), role complexity (single‑role vs. multi‑role), and system integration (standalone vs. integrated). Emerging trends include multi‑agent systems that simulate naturalistic learning environments, virtual student simulation for agent evaluation, integration with immersive technologies, and combinations with learning analytics. We also discuss significant research gaps and ethical considerations regarding privacy, accuracy, and student autonomy. This review provides researchers and practitioners with a comprehensive understanding of LLM‑based pedagogical agents while identifying crucial areas for future development in this rapidly evolving field.

PaperID: 3974, https://arxiv.org/pdf/2604.12250.pdf

Authors: Taisei Hishiki, Takaya Arita, Reiji Suzuki

Title: How memory can affect collective and cooperative behaviors in an LLM-Based Social Particle Swarm

Abstract:
This study examines how model‑specific characteristics of Large Language Model (LLM) agents, including internal alignment, shape the effect of memory on their collective and cooperative dynamics in a multi‑agent system. To this end, we extend the Social Particle Swarm (SPS) model, in which agents move in a two‑dimensional space and play the Prisoner's Dilemma with neighboring agents, by replacing its rule‑based agents with LLM agents endowed with Big Five personality scores and varying memory lengths. Using Gemini‑2.0‑Flash, we find that memory length is a critical parameter governing collective behavior: even a minimal memory drastically suppressed cooperation, transitioning the system from stable cooperative clusters through cyclical formation and collapse of clusters to a state of scattered defection as memory length increased. Big Five personality traits correlated with agent behaviors in partial agreement with findings from experiments with human participants, supporting the validity of the model. Comparative experiments using Gemma~3:4b revealed the opposite trend: longer memory promoted cooperation, accompanied by the formation of dense cooperative clusters. Sentiment analysis of agents' reasoning texts showed that Gemini interprets memory increasingly negatively as its length grows, while Gemma interprets it less negatively, and that this difference persists in the early phase of experiments before the macro‑level dynamics converge. These results suggest that model‑specific characteristics of LLMs, potentially including alignment, play a fundamental role in determining emergent social behavior in Generative Agent‑Based Modeling, and provide a micro‑level cognitive account of the contradictions found in prior work on memory and cooperation.

PaperID: 3975, https://arxiv.org/pdf/2604.12198.pdf

Authors: Haonan Huang

Title: Towards grounded autonomous research: an end-to-end LLM mini research loop on published computational physics

Abstract:
Recent autonomous LLM agents have demonstrated end‑to‑end automation of machine‑learning research. Real‑world physical science is intrinsically harder, requiring deep reasoning bounded by physical truth and, because real systems are too complex to study in isolation, almost always built on existing literature. We focus on the smallest meaningful unit of such research, a mini research loop in which an agent reads a paper, reproduces it, critiques it, and extends it. We test this loop in two complementary regimes: scale and depth. At scale, across 111 open‑access computational physics papers, an agent autonomously runs the read‑plan‑compute‑compare loop and, without being asked to critique, raises substantive concerns on ~42% of papers ‑ 97.7% of which require execution to surface. In depth, for one Nature Communications paper on multiscale simulation of a 2D‑material MOSFET, the agent runs new calculations missing from the original and produces, unsupervised, a publishable Comment ‑‑ composed, figured, typeset, and PDF‑iterated ‑‑ that revises the paper's headline conclusion.

PaperID: 3976, https://arxiv.org/pdf/2604.12196.pdf

Authors: Manh Nguyen, Sunil Gupta, Hung Le

Title: Beyond Majority Voting: Efficient Best-Of-N with Radial Consensus Score

Abstract:
Large language models (LLMs) frequently generate multiple candidate responses for a given prompt, yet selecting the most reliable one remains challenging, especially when correctness diverges from surface‑level majority agreement. Existing approaches, such as self‑consistency, rely on discrete voting, while probability‑based methods often fail to capture relationships among candidate answers or tend to underweight high‑quality but less frequent responses, and do not fully leverage the geometric structure of answer representations. To address these limitations, we introduce Radial Consensus Score (RCS), a simple, efficient, and training‑free method for best‑of‑N selection. RCS models semantic consensus by computing a weighted Fréchet mean (semantic center) of answer embeddings and ranking candidates by their radial distance to this center. Importantly, RCS provides a general framework that supports multiple weighting schemes, including uniform, frequency‑based, and probability‑based variants, enabling flexible integration of agreement signals and model confidence while remaining fully applicable in black‑box settings. Extensive experiments across seven benchmarks covering short‑form QA and long‑form reasoning tasks, and five open‑weight models, demonstrate that RCS variants consistently outperform strong baselines, with gains becoming more pronounced as the sampling budget increases. RCS also serves as an effective drop‑in replacement for majority voting in multi‑agent debate and exhibits strong robustness in black‑box scenarios. Overall, these results highlight geometric consensus as a scalable and broadly applicable principle for reliable answer selection, extending beyond majority voting to more expressive and robust aggregation in LLM inference.

PaperID: 3977, https://arxiv.org/pdf/2604.12184.pdf

Authors: Gautama Shastry Bulusu Venkata, Santhosh Kakarla, Maheedhar Omtri Mohan, Aishwarya Gaddam

Title: TRUST Agents: A Collaborative Multi-Agent Framework for Fake News Detection, Explainable Verification, and Logic-Aware Claim Reasoning

Abstract:
TRUST Agents is a collaborative multi‑agent framework for explainable fact verification and fake news detection. Rather than treating verification as a simple true‑or‑false classification task, the system identifies verifiable claims, retrieves relevant evidence, compares claims against that evidence, reasons under uncertainty, and generates explanations that humans can inspect. The baseline pipeline consists of four specialized agents. A claim extractor uses named entity recognition, dependency parsing, and LLM‑based extraction to identify factual claims. A retrieval agent performs hybrid sparse and dense search using BM25 and FAISS. A verifier agent compares claims with retrieved evidence and produces verdicts with calibrated confidence. An explainer agent then generates a human‑readable report with explicit evidence citations. To handle complex claims more effectively, we introduce a research‑oriented extension with three additional components: a decomposer agent inspired by LoCal‑style claim decomposition, a Delphi‑inspired multi‑agent jury with specialized verifier personas, and a logic aggregator that combines atomic verdicts using conjunction, disjunction, negation, and implication. We evaluate both pipelines on the LIAR benchmark against fine‑tuned BERT, fine‑tuned RoBERTa, and a zero‑shot LLM baseline. Although supervised encoders remain stronger on raw metrics, TRUST Agents improves interpretability, evidence transparency, and reasoning over compound claims. Results also show that retrieval quality and uncertainty calibration remain the main bottlenecks in trustworthy automated fact verification.

PaperID: 3978, https://arxiv.org/pdf/2604.12179.pdf

Authors: Manoj Madushanka Perera, Adnan Mahmood, Kasun Eranda Wijethilake, Quan Z. Sheng

Title: AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

Abstract:
Recent advancements in Large Language Models (LLMs) have improved their ability to process extended conversational contexts, yet fine‑tuning and evaluating short‑ and long‑term memories remain difficult due to the absence of datasets that encode both short‑ and long‑term conversational history. Existing conversational datasets lack memory grounding, overlook topic continuity, or rely on costly human annotation. To address these gaps, we introduce AgenticAI‑DialogGen, a modular agent‑based framework that generates persona‑grounded and topic‑guided conversations without human supervision. The framework uses LLM agents to extract knowledge graphs, identify topics, build speaker personas, and simulate topic‑guided conversations from unstructured conversations. A QA module generates memory‑grounded Question Answer (QA) pairs drawn from short‑ and long‑term conversational histories. We also generated a new dataset entitled, TopicGuidedChat (TGC), where long‑term memory is encoded as speaker‑specific knowledge graphs and short‑term memory as newly generated topic‑guided conversations. Evaluations depict that AgenticAI‑DialogGen yields higher conversational quality and LLMs fine‑tuned on TGC dataset achieve improved performance on memory‑grounded QA tasks.

PaperID: 3979, https://arxiv.org/pdf/2604.12161.pdf

Authors: Tim Ellis-Caleo, Timothy Keyes, Nerissa Ambers, Faraah Bekheet, Wen-wai Yim, Nikesh Kotecha, Nigam H. Shah, Joel Neal

Title: Development, Evaluation, and Deployment of a Multi-Agent System for Thoracic Tumor Board

Abstract:
Tumor boards are multidisciplinary conferences dedicated to producing actionable patient care recommendations with live review of primary radiology and pathology data. Succinct patient case summaries are needed to drive efficient and accurate case discussions. We developed a manual AI‑based workflow to generate patient summaries to display live at the Stanford Thoracic Tumor board. To improve on this manually intensive process, we developed several automated AI chart summarization methods and evaluated them against physician gold standard summaries and fact‑based scoring rubrics. We report these comparative evaluations as well as our deployment of the final state automated AI chart summarization tool along with post‑deployment monitoring. We also validate the use of an LLM as a judge evaluation strategy for fact‑based scoring. This work is an example of integrating AI‑based workflows into routine clinical practice.

PaperID: 3980, https://arxiv.org/pdf/2604.12129.pdf

Authors: Swanand Rao, Kiran Kashalkar, Parvathi Somashekar, Priya Krishnan

Title: Aethon: A Reference-Based Replication Primitive for Constant-Time Instantiation of Stateful AI Agents

Abstract:
The transition from stateless model inference to stateful agentic execution is reshaping the systems assumptions underlying modern AI infrastructure. While large language models have made persistent, tool‑using, and collaborative agents technically viable, existing runtime architectures remain constrained by materialization‑heavy instantiation models that impose significant latency and memory overhead. This paper introduces Aethon, a reference‑based replication primitive for near‑constant‑time instantiation of stateful AI agents. Rather than reconstructing agents as fully materialized objects, Aethon represents each instance as a compositional view over stable definitions, layered memory, and local contextual overlays. By shifting instantiation from duplication to reference, Aethon decouples creation cost from inherited structure. We present the conceptual framework, system architecture, and memory model underlying Aethon, including layered inheritance and copy‑on‑write semantics. We analyze its implications for complexity, scalability, multi‑agent orchestration, and enterprise governance. We argue that reference‑based instantiation is not merely an optimization, but a more appropriate systems abstraction for production‑scale agentic software. Aethon points toward a new class of AI infrastructure in which agents become lightweight, composable execution identities that can be spawned, specialized, and governed at scale.

PaperID: 3981, https://arxiv.org/pdf/2604.12127.pdf

Authors: Anas Abognah, Otman Basir

Title: BLAST: Blockchain-based LLM-powered Agentic Spectrum Trading

Abstract:
The management of radio frequency spectrum is undergoing a paradigm shift from static, centralized command‑and‑control models to dynamic, market‑driven approaches. However, the realization of Dynamic Spectrum Management has been hindered by the lack of an automated, trustworthy, and intelligent coordination infrastructure that can operate without a central authority while preserving participant privacy. In this paper, we introduce BLAST (Blockchain‑based LLM‑powered Agentic Spectrum Trading), a comprehensive framework that integrates Large Language Model (LLM) Agents with a permissioned blockchain infrastructure to create a fully autonomous, private, and secure spectrum trading ecosystem. We propose a novel agent architecture that implements the Cognitive Radio cycle through a sequential decision pipeline (perceive, plan, act) enabling agents to reason strategically about economic value and market dynamics. We evaluate the framework through three distinct market mechanisms: Direct Sale, First‑Price Sealed‑Bid, and Second‑Price (Vickrey) Sealed‑Bid auctions. Experimental results demonstrate that the Second‑Price (Vickrey) auction is the optimal choice for maximizing social welfare and allocative efficiency, capturing up to 71% of the theoretical surplus by incentivizing truthful bidding. We also compare the proposed model against a baseline non‑LLM heuristic agentic model and show that utilizing LLM agents yields significant improvements in market competition, reduced wealth and asset concentration, and increased system welfare. Furthermore, we validate the system's privacy preservation, confirming that sensitive bid values remain isolated in private data collections while only cryptographic hashes are committed to the public ledger.

PaperID: 3982, https://arxiv.org/pdf/2604.12126.pdf

Authors: Rongzhe Wei, Ge Shi, Min Cheng, Na Zhang, Pan Li, Sarthak Ghosh, Vaibhav Gorde, Leman Akoglu

Title: Long-Horizon Plan Execution in Large Tool Spaces through Entropy-Guided Branching

Abstract:
Large Language Models (LLMs) have significantly advanced tool‑augmented agents, enabling autonomous reasoning via API interactions. However, executing multi‑step tasks within massive tool libraries remains challenging due to two critical bottlenecks: (1) the absence of rigorous, plan‑level evaluation frameworks and (2) the computational demand of exploring vast decision spaces stemming from large toolsets and long‑horizon planning. To bridge these gaps, we first introduce SLATE (Synthetic Large‑scale API Toolkit for E‑commerce), a large‑scale context‑aware benchmark designed for the automated assessment of tool‑integrated agents. Unlike static metrics, SLATE accommodates diverse yet functionally valid execution trajectories, revealing that current agents struggle with self‑correction and search efficiency. Motivated by these findings, we next propose Entropy‑Guided Branching (EGB), an uncertainty‑aware search algorithm that dynamically expands decision branches where predictive entropy is high. EGB optimizes the exploration‑exploitation trade‑off, significantly enhancing both task success rates and computational efficiency. Extensive experiments on SLATE demonstrate that our dual contribution provides a robust foundation for developing reliable and scalable LLM agents in tool‑rich environments.

PaperID: 3983, https://arxiv.org/pdf/2604.12116.pdf

Authors: Shasha Yu, Fiona Carroll, Barry L. Bentley

Title: The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment

Abstract:
Large language models (LLMs) are increasingly deployed as tool‑augmented agents capable of executing system‑level operations. While existing benchmarks primarily assess textual alignment or task success, less attention has been paid to the structural relationship between linguistic signaling and executable behavior under varying autonomy scaffolds. This study introduces an execution‑layer be‑havioral measurement approach based on a two‑dimensional A‑R space defined by Action Rate (A) and Refusal Signal (R), with Divergence (D) capturing coor‑dination between the two. Models are evaluated across four normative regimes (Control, Gray, Dilemma, and Malicious) and three autonomy configurations (di‑rect execution, planning, and reflection). Rather than assigning aggregate safety scores, the method characterizes how execution and refusal redistribute across contextual framing and scaffold depth. Empirical results show that execution and refusal constitute separable behavioral dimensions whose joint distribution varies systematically across regimes and autonomy levels. Reflection‑based scaffolding often shifts configurations toward higher refusal in risk‑laden contexts, but redis‑tribution patterns differ structurally across models. The A‑R representation makes cross‑sectional behavioral profiles, scaffold‑induced transitions, and coordination variability directly observable. By foregrounding execution‑layer characterization over scalar ranking, this work provides a deployment‑oriented lens for analyzing and selecting tool‑enabled LLM agents in organizational settings where execution privileges and risk tolerance vary.

PaperID: 3984, https://arxiv.org/pdf/2604.12066.pdf

Authors: Candace Walkington, Theodora Beauchamp, Fareya Ikram, Merve Koçyiğit Gürbüz, Fangli Xia, Margan Lee, Andrew Lan

Title: Mathematics Teachers Interactions with a Multi-Agent System for Personalized Problem Generation

Abstract:
Large language models can increasingly adapt educational tasks to learners characteristics. In the present study, we examine a multi‑agent teacher‑in‑the‑loop system for personalizing middle school math problems. The teacher enters a base problem and desired topic, the LLM generates the problem, and then four AI agents evaluate the problem using criteria that each specializes in (mathematical accuracy, authenticity, readability, and realism). Eight middle school mathematics teachers created 212 problems in ASSISTments using the system and assigned these problems to their students. We find that both teachers and students wanted to modify the fine‑grained personalized elements of the real‑world context of the problems, signaling issues with authenticity and fit. Although the agents detected many issues with realism as the problems were being written, there were few realism issues noted by teachers and students in the final versions. Issues with readability and mathematical hallucinations were also somewhat rare. Implications for multi‑agent systems for personalization that support teacher control are given.

PaperID: 3985, https://arxiv.org/pdf/2604.11969.pdf

Authors: Tarik Can Ozden, Sachidanand VS, Furkan Horoz, Ozgur Kara, Junho Kim, James Matthew Rehg

Title: Narrative-Driven Paper-to-Slide Generation via ArcDeck

Abstract:
We introduce ArcDeck, a multi‑agent framework that formulates paper‑to‑slide generation as a structured narrative reconstruction task. Unlike existing methods that directly summarize raw text into slides, ArcDeck explicitly models the source paper's logical flow. It first parses the input to construct a discourse tree and establish a global commitment document, ensuring the high‑level intent is preserved. These structural priors then guide an iterative multi‑agent refinement process, where specialized agents iteratively critique and revise the presentation outline before rendering the final visual layouts and designs. To evaluate our approach, we also introduce ArcBench, a newly curated benchmark of academic paper‑slide pairs. Experimental results demonstrate that explicit discourse modeling, combined with role‑specific agent coordination, significantly improves the narrative flow and logical coherence of the generated presentations.

PaperID: 3986, https://arxiv.org/pdf/2604.11950.pdf

Authors: Zijie Zhao, Chenyuan Yang, Weidong Wang, Yihan Yang, Ziqi Zhang, Lingming Zhang

Title: AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection

Abstract:
While recent LLM‑based agents can identify many candidate bugs in source code, their reports remain static hypotheses that require manual validation, limiting the practicality of automated bug detection. We frame this challenge as a test generation task: given a candidate report, synthesizing an executable proof‑of‑concept test, or simply a PoC ‑ such as a script, command sequence, or crafted input ‑ to trigger the suspected defect. Automated PoC generation can act as a scalable validation oracle, enabling end‑to‑end autonomous bug detection by providing concrete execution evidence. However, naive LLM agents are unreliable validators: they are biased toward "success" and may reward‑hack by producing plausible but non‑functional PoCs or even hallucinated traces. To address this, we present AnyPoC, a general multi‑agent framework that (1) analyzes and fact‑checks a candidate bug report, (2) iteratively synthesizes and executes a PoC while collecting execution traces, and (3) independently re‑executes and scrutinizes the PoC to mitigate hallucination and reward hacking. In addition, AnyPoC also continuously extracts and evolves a PoC knowledge base to handle heterogeneous tasks. AnyPoC operates on candidate bug reports regardless of their source and can be paired with different bug reporters. To demonstrate practicality and generality, we apply AnyPoC, with a simple agentic bug reporter, on 12 critical software systems across diverse languages/domains (many with millions of lines of code) including Firefox, Chromium, LLVM, OpenSSL, SQLite, FFmpeg, and Redis. Compared to the state‑of‑the‑art coding agents, e.g., Claude Code and Codex, AnyPoC produces 1.3x more valid PoCs for true‑positive bug reports and rejects 9.8x more false‑positive bug reports. To date, AnyPoC has discovered 122 new bugs (105 confirmed, 86 already fixed), with 45 generated PoCs adopted as official regression tests.

PaperID: 3987, https://arxiv.org/pdf/2604.11945.pdf

Authors: Jiale Liu, Nanzhe Wang

Title: AutoSurrogate: An LLM-Driven Multi-Agent Framework for Autonomous Construction of Deep Learning Surrogate Models in Subsurface Flow

Abstract:
High‑fidelity numerical simulation of subsurface flow is computationally intensive, especially for many‑query tasks such as uncertainty quantification and data assimilation. Deep learning (DL) surrogates can significantly accelerate forward simulations, yet constructing them requires substantial machine learning (ML) expertise ‑ from architecture design to hyperparameter tuning ‑ that most domain scientists do not possess. Furthermore, the process is predominantly manual and relies heavily on heuristic choices. This expertise gap remains a key barrier to the broader adoption of DL surrogate techniques. For this reason, we present AutoSurrogate, a large‑language‑model‑driven multi‑agent framework that enables practitioners without ML expertise to build high‑quality surrogates for subsurface flow problems through natural‑language instructions. Given simulation data and optional preferences, four specialized agents collaboratively execute data profiling, architecture selection from a model zoo, Bayesian hyperparameter optimization, model training, and quality assessment against user‑specified thresholds. The system also handles common failure modes autonomously, including restarting training with adjusted configurations when numerical instabilities occur and switching to alternative architectures when predictive accuracy falls short of targets. In our setting, a single natural‑language sentence can be sufficient to produce a deployment‑ready surrogate model, with minimum human intervention required at any intermediate stage. We demonstrate the utility of AutoSurrogate on a 3D geological carbon storage modeling task, mapping permeability fields to pressure and CO_2 saturation fields over 31 timesteps. Without any manual tuning, AutoSurrogate is able to outperform expert‑designed baselines and domain‑agnostic AutoML methods, demonstrating strong potential for practical deployment.

PaperID: 3988, https://arxiv.org/pdf/2604.11840.pdf

Authors: Sandro Andric

Title: When Reasoning Models Hurt Behavioral Simulation: A Solver-Sampler Mismatch in Multi-Agent LLM Negotiation

Abstract:
Behavioral simulation and strategic problem solving are different tasks. Large language models are increasingly explored as agents in policy‑facing institutional simulations, but stronger reasoning need not improve behavioral sampling. We study this solver‑sampler mismatch in three multi‑agent negotiation environments: two trading‑limits scenarios with different authority structures and a grid‑curtailment case in emergency electricity management. Across two primary model families, native reasoning and often no reflection collapse toward authority‑heavy outcomes. The sharpest case is DeepSeek native reasoning in the grid‑curtailment transfer: it reaches action entropy 1.256 and a concession‑arc rate of 0.933, yet still ends in authority decision in 15 of 15 runs. A direct OpenAI extension shows the same pressure at provider breadth: GPT‑5.2 native reasoning ends in authority decisions in 45 of 45 runs across the three environments. Budget‑matched no‑reflection controls and orthogonal private‑state controls remain rigid, while the negotiation‑structured scaffold condition is the only condition that consistently opens negotiated outcomes. These diagnostics are failure screens within a fixed negotiation grammar, not evidence of external behavioral realism or policy‑forecasting validity. The results show that neither more output space nor generic extra private state rescues solver‑like sampler failure. For institutional simulation, solver strength and sampler qualification are different objectives: models should be evaluated for the behavioral role they are meant to play, not only for strategic capability.

PaperID: 3989, https://arxiv.org/pdf/2604.11790.pdf

Authors: Wei Zhao, Zhe Li, Peixin Zhang, Jun Sun

Title: ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

Abstract:
Tool‑augmented Large Language Model (LLM) agents have demonstrated impressive capabilities in automating complex, multi‑step real‑world tasks, yet remain vulnerable to indirect prompt injection. Adversaries exploit this weakness by embedding malicious instructions within tool‑returned content, which agents directly incorporate into their conversation history as trusted observations. To address these vulnerabilities, we introduce \textscClawGuard, a novel runtime security framework that enforces a user‑confirmed rule set at every tool‑call boundary, transforming unreliable alignment‑dependent defense into a deterministic, auditable mechanism that intercepts adversarial tool calls before any real‑world effect is produced. By automatically deriving task‑specific access constraints from the user's stated objective prior to any external tool invocation, \textscClawGuard blocks all three injection pathways without model modification or infrastructure change. Experiments across five state‑of‑the‑art language models on six injection benchmarks covering web, local, MCP, and skill channels, as well as three utility benchmarks covering OS, web, and code tasks, demonstrate that \textscClawGuard achieves robust protection against indirect prompt injection without compromising agent utility or introducing significant token overhead. This work establishes deterministic tool‑call boundary enforcement as an effective defense mechanism for secure agentic AI systems. Code is publicly available at github.com/Claw‑Guard/ClawGuard/.

PaperID: 3990, https://arxiv.org/pdf/2604.11786.pdf

Authors: Jiayuan Rao, Tianlin Gui, Haoning Wu, Yanfeng Wang, Weidi Xie

Title: GenTac: Generative Modeling and Forecasting of Soccer Tactics

Abstract:
Modeling open‑play soccer tactics is a formidable challenge due to the stochastic, multi‑agent nature of the game. Existing computational approaches typically produce single, deterministic trajectory forecasts or focus on highly structured set‑pieces, fundamentally failing to capture the inherent variance and branching possibilities of real‑world match evolution. Here, we introduce GenTac, a diffusion‑based generative framework that conceptualizes soccer tactics as a stochastic process over continuous multi‑player trajectories and discrete semantic events. By learning the underlying distribution of player movements from historical tracking data, GenTac samples diverse, plausible, long‑horizon future trajectories. The framework supports rich contextual conditioning, including opponent behavior, specific team or league playing styles, and strategic objectives, while grounding continuous spatial dynamics into a 15‑class tactical event space. Extensive evaluations on our proposed benchmark, TacBench, demonstrate four key capabilities: (1) GenTac achieves high geometric accuracy while strictly preserving the collective structural consistency of the team; (2) it accurately simulates stylistic nuances, distinguishing between specific teams (e.g., Auckland FC) and leagues (e.g., A‑League versus German leagues); (3) it enables controllable counterfactual simulations, demonstrably altering spatial control and expected threat metrics based on offensive or defensive guidance; and (4) it reliably anticipates future tactical outcomes directly from generated rollouts. Finally, we demonstrate that GenTac can be successfully trained to generalize to other dynamic team sports, including basketball, American football, and ice hockey.

PaperID: 3991, https://arxiv.org/pdf/2604.11767.pdf

Authors: Qin Liu

Title: $λ_A$: A Typed Lambda Calculus for LLM Agent Composition

Abstract:
Existing LLM agent frameworks lack formal semantics: there is no principled way to determine whether an agent configuration is well‑formed or will terminate. We present λ_A, a typed lambda calculus for agent composition that extends the simply‑typed lambda calculus with oracle calls, bounded fixpoints (the ReAct loop), probabilistic choice, and mutable environments. We prove type safety, termination of bounded fixpoints, and soundness of derived lint rules, with full Coq mechanization (1,519 lines, 42 theorems, 0 Admitted). As a practical application, we derive a lint tool that detects structural configuration errors directly from the operational semantics. An evaluation on 835 real‑world GitHub agent configurations shows that 94.1% are structurally incomplete under λ_A, with YAML‑only lint precision at 54%, rising to 96‑‑100% under joint YAML+Python AST analysis on 175 samples. This gap quantifies, for the first time, the degree of semantic entanglement between declarative configuration and imperative code in the agent ecosystem. We further show that five mainstream paradigms (LangGraph, CrewAI, AutoGen, OpenAI SDK, Dify) embed as typed λ_A fragments, establishing λ_A as a unifying calculus for LLM agent composition.

PaperID: 3992, https://arxiv.org/pdf/2604.11751.pdf

Authors: Quanyi Li, Lan Feng, Haonan Zhang, Wuyang Li, Letian Wang, Alexandre Alahi, Harold Soh

Title: Grounded World Model for Semantically Generalizable Planning

Abstract:
In Model Predictive Control (MPC), world models predict the future outcomes of various action proposals, which are then scored to guide the selection of the optimal action. For visuomotor MPC, the score function is a distance metric between a predicted image and a goal image, measured in the latent space of a pretrained vision encoder like DINO and JEPA. However, it is challenging to obtain the goal image in advance of the task execution, particularly in new environments. Additionally, conveying the goal through an image offers limited interactivity compared with natural language. In this work, we propose to learn a Grounded World Model (GWM) in a vision‑language‑aligned latent space. As a result, each proposed action is scored based on how close its future outcome is to the task instruction, reflected by the similarity of embeddings. This approach transforms the visuomotor MPC to a VLA that surpasses VLM‑based VLAs in semantic generalization. On the proposed WISER benchmark, GWM‑MPC achieves a 87% success rate on the test set comprising 288 tasks that feature unseen visual signals and referring expressions, yet remain solvable with motions demonstrated during training. In contrast, traditional VLAs achieve an average success rate of 22%, even though they overfit the training set with a 90% success rate.

PaperID: 3993, https://arxiv.org/pdf/2604.11741.pdf

Authors: Keyang Zhong, Junlin Xie, Hefeng Wu, Haofeng Li, Guanbin Li

Title: Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games

Abstract:
Vision‑language models (VLMs) have shown impressive capabilities in perceptual tasks, yet they degrade in complex multi‑hop reasoning under multiplayer game settings with imperfect and deceptive information. In this paper, we study a representative multiplayer task, Murder Mystery Games, which require inferring hidden truths based on partial clues provided by roles with different intentions. To address this challenge, we propose a collaborative multi‑agent framework for evaluating and synthesizing high‑quality, role‑driven multiplayer game scripts, enabling fine‑grained interaction patterns tailored to character identities (i.e., murderer vs. innocent). Our system generates rich multimodal contexts, including character backstories, visual and textual clues, and multi‑hop reasoning chains, through coordinated agent interactions. We design a two‑stage agent‑monitored training strategy to enhance the reasoning ability of VLMs: (1) chain‑of‑thought based fine‑tuning on curated and synthetic datasets that model uncertainty and deception; (2) GRPO‑based reinforcement learning with agent‑monitored reward shaping, encouraging the model to develop character‑specific reasoning behaviors and effective multimodal multi‑hop inference. Extensive experiments demonstrate that our method significantly boosts the performance of VLMs in narrative reasoning, hidden fact extraction, and deception‑resilient understanding. Our contributions offer a scalable solution for training and evaluating VLMs under uncertain, adversarial, and socially complex conditions, laying the groundwork for future benchmarks in multimodal multi‑hop reasoning under imperfect information.

PaperID: 3994, https://arxiv.org/pdf/2604.11734.pdf

Authors: Haojie Bai, Aimin Li, Ruoyu Yao, Xiongwei Zhao, Tingting Zhang, Xing Zhang, Lin Gao, and Jun Ma

Title: SCORP: Scene-Consistent Multi-agent Diffusion Planning with Stable Online Reinforcement Post-Training for Cooperative Driving

Abstract:
Cooperative driving is a safety‑ and efficiency‑critical task that requires the coordination of diverse, interaction‑realistic multi‑agent trajectories. Although existing diffusion‑based methods can capture multimodal behaviors from demonstrations, they often exhibit weak scene consistency and poor alignment with closed‑loop cooperative objectives. This makes post‑training necessary for further improvement, yet achieving stable online post‑training in reactive multi‑agent environments remains challenging. In this paper, we propose SCORP, a scene‑consistent multi‑agent diffusion planner with stable online reinforcement learning (RL) post‑training for cooperative driving. For pre‑training, we develop a scene‑conditioned multi‑agent denoising architecture that couples inter‑agent self‑attention with a dual‑path conditioning mechanism: cross‑attention provides direct scene‑information injection, while AdaLN‑Zero enables additional flexible and stable conditional modulation, thereby improving the scene consistency and road adherence of joint trajectories. For post‑training, we formulate a two‑layer Markov decision process (MDP) that explicitly integrates the reverse denoising chain with policy‑environment interaction. We further co‑design dense, well‑shaped planning rewards and variance‑gated group‑relative policy optimization (VG‑GRPO) to mitigate advantage collapse and gradient instability during closed‑loop training. Extensive experiments show that SCORP outperforms strong open‑source baselines on WOMD, with 10.47%‑28.26% and 1.70%‑7.22% improvements in core safety and efficiency metrics, respectively. Moreover, compared with alternative post‑training methods, SCORP delivers significant and consistent gains in both driving safety and traffic efficiency, highlighting stable and sustained advances in closed‑loop cooperative driving.

PaperID: 3995, https://arxiv.org/pdf/2604.11721.pdf

Authors: Ryan Faulkner, Anushka Deshpande, David Guzman Piedrahita, Joel Z. Leibo, Zhijing Jin

Title: Evaluating Cooperation in LLM Social Groups through Elected Leadership

Abstract:
Governing common‑pool resources requires agents to develop enduring strategies through cooperation and self‑governance to avoid collective failure. While foundation models have shown potential for cooperation in these settings, existing multi‑agent research provides little insight into whether structured leadership and election mechanisms can improve collective decision making. The lack of such a critical organizational feature ubiquitous in human society presents a significant shortcoming of the current methods. In this work we aim to directly address whether leadership and elections can support improved social welfare and cooperation through multi‑agent simulation with LLMs. We present our open‑source framework that simulates leadership through elected personas and candidate‑driven agendas and carry out an empirical study of LLMs under controlled governance conditions. Our experiments demonstrate that having elected leadership improves social welfare scores by 55.4% and survival time by 128.6% across a range of high performing LLMs. Through the construction of an agent social graph we compute centrality metrics to assess the social influence of leader personas and also analyze rhetorical and cooperative tendencies revealed through a sentiment analysis on leader utterances. This work lays the foundation for further study of election mechanisms in multi‑agent systems toward navigating complex social dilemmas.

PaperID: 3996, https://arxiv.org/pdf/2604.11661.pdf

Authors: Yunhui Jang, Lu Zhu, Jake Fawkes, Alisandra Kaye Denton, Dominique Beaini, Emmanuel Noutahi

Title: Towards Autonomous Mechanistic Reasoning in Virtual Cells

Abstract:
Large language models (LLMs) have recently gained significant attention as a promising approach to accelerate scientific discovery. However, their application in open‑ended scientific domains such as biology remains limited, primarily due to the lack of factually grounded and actionable explanations. To address this, we introduce a structured explanation formalism for virtual cells that represents biological reasoning as mechanistic action graphs, enabling systematic verification and falsification. Building upon this, we propose VCR‑Agent, a multi‑agent framework that integrates biologically grounded knowledge retrieval with a verifier‑based filtering approach to generate and validate mechanistic reasoning autonomously. Using this framework, we release VC‑TRACES dataset, which consists of verified mechanistic explanations derived from the Tahoe‑100M atlas. Empirically, we demonstrate that training with these explanations improves factual precision and provides a more effective supervision signal for downstream gene expression prediction. These results underscore the importance of reliable mechanistic reasoning for virtual cells, achieved through the synergy of multi‑agent and rigorous verification.

PaperID: 3997, https://arxiv.org/pdf/2604.11655.pdf

Authors: Riccardo Rosati, Edoardo Colucci, Massimiliano Bolognini, Adriano Mancini, Paolo Sernani

Title: RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents

Abstract:
The rapid adoption of Large Language Models (LLMs) in interactive systems has enabled the creation of dynamic, open‑ended Role‑Playing Agents (RPAs). However, evaluating these agents remains a significant challenge, as standard NLP metrics fail to capture the nuances of role adherence, logical consistency, and long‑term narrative stability. This paper introduces RPA‑Check, a multi‑stage automated evaluation framework designed to objectively assess the performance of LLM‑based RPAs in complex, constraints‑heavy environments. Our methodology is based on a four‑step pipeline: (1) Dimension Definition, establishing high‑level qualitative behavioral criteria; (2) Augmentation, where these requirements are expanded into granular boolean checklist indicators; (3) Semantic Filtering, to ensure indicator objectivity, no redundancy and agent isolation; and (4) LLM‑as‑a‑Judge Evaluation, which employs chain‑of‑thought verification to score agent fidelity. We validate this framework by applying it to LLM Court, a serious game for forensic training involving several quantized local models. Experimental results across five distinct legal scenarios demonstrate the framework's ability to identify subtle trade‑offs between model size, reasoning depth, and operational stability. Notably, the findings reveal an inverse relationship between parametric scale and procedural consistency, showing that smaller, adequately instruction‑tuned models (8‑9B) can outperform larger architectures prone to user‑alignment bias or sycophancy. RPA‑Check thus provides a standardized and reproducible metric for future research in generative agent evaluation within specialized domains.

PaperID: 3998, https://arxiv.org/pdf/2604.11640.pdf

Authors: Kangyi Lu, Lan Wei, Zongcai Tan, Dandan Zhang

Title: Micro-Dexterity in Biological Micromanipulation: Embodiment, Perception, and Control

Abstract:
Microscale manipulation has advanced substantially in controlled locomotion and targeted transport, yet many biomedical applications require precise and adaptive interaction with biological micro‑objects. At these scales, manipulation is realized through three main classes of platforms: embodied microrobots that physically interact as mobile agents, field‑mediated systems that generate contactless trapping or manipulation forces, and externally actuated end‑effectors that interact through remotely driven physical tools. Unlike macroscale manipulators, these systems function in fluidic, confined, and surface‑dominated environments characterized by negligible inertia, dominant interfacial forces, and soft, heterogeneous, and fragile targets. Consequently, classical assumptions of dexterous manipulation, including rigid‑body contact, stable grasping, and rich proprioceptive feedback, become difficult to maintain. This review introduces micro‑dexterity as a framework for analyzing biological micromanipulation through the coupled roles of embodiment, perception, and control. We examine how classical manipulation primitives, including pushing, reorientation, grasping, and cooperative manipulation, are reformulated at the microscale; compare the architectures that enable them, from contact‑based micromanipulators to contactless field‑mediated systems and cooperative multi‑agent platforms; and review the perception and control strategies required for task execution. We identify the current dexterity gap between laboratory demonstrations and clinically relevant biological manipulation, and outline key challenges for future translation.

PaperID: 3999, https://arxiv.org/pdf/2604.11563.pdf

Authors: Artem Gadzhiev, Andrew Kislov

Title: Synthius-Mem: Brain-Inspired Hallucination-Resistant Persona Memory Achieving 94.4% Memory Accuracy and 99.6% Adversarial Robustness on LoCoMo

Abstract:
Providing AI agents with reliable long‑term memory that does not hallucinate remains an open problem. Current approaches to memory for LLM agents ‑‑ sliding windows, summarization, embedding‑based RAG, and flat fact extraction ‑‑ each reduce token cost but introduce catastrophic information loss, semantic drift, or uncontrolled hallucination about the user. The structural reason is architectural: every published memory system on the LoCoMo benchmark treats conversation as a retrieval problem over raw or lightly summarized dialogue segments, and none reports adversarial robustness, the ability to refuse questions about facts the user never disclosed. We present Synthius‑Mem, a brain‑inspired structured persona memory system that takes a fundamentally different approach. Instead of retrieving what was said, Synthius‑Mem extracts what is known about the person: a full persona extraction pipeline decomposes conversations into six cognitive domains (biography, experiences, preferences, social circle, work, psychometrics), consolidates and deduplicates per domain, and retrieves structured facts via CategoryRAG at 21.79 ms latency. On the LoCoMo benchmark (ACL 2024, 10 conversations, 1,813 questions), Synthius‑Mem achieves 94.37% accuracy, exceeding all published systems including MemMachine (91.69%, adversarial score is not reported) and human performance (87.9 F1). Core memory fact accuracy reaches 98.64%. Adversarial robustness, the hallucination resistance metric that no competing system reports, reaches 99.55%. Synthius‑Mem reduces token consumption by ~5x compared to full‑context replay while achieving higher accuracy. Synthius‑Mem achieves state‑of‑the‑art results on LoCoMo and is, to our knowledge, the only persona memory system that both exceeds human‑level performance and reports adversarial robustness.

PaperID: 4000, https://arxiv.org/pdf/2604.11548.pdf

Authors: Ningyan Zhu, Huacan Wang, Jie Zhou, Feiyu Chen, Shuo Zhang, Ge Chen, Chen Liu, Jiarou Wu, Wangyi Chen, Xiaofeng Mou, Yi Xu

Title: SemaClaw: A Step Towards General-Purpose Personal AI Agents through Harness Engineering

Abstract:
The rise of OpenClaw in early 2026 marks the moment when millions of users began deploying personal AI agents into their daily lives, delegating tasks ranging from travel planning to multi‑step research. This scale of adoption signals that two parallel arcs of development have reached an inflection point. First is a paradigm shift in AI engineering, evolving from prompt and context engineering to harness engineering‑designing the complete infrastructure necessary to transform unconstrained agents into controllable, auditable, and production‑reliable systems. As model capabilities converge, this harness layer is becoming the primary site of architectural differentiation. Second is the evolution of human‑agent interaction from discrete tasks toward a persistent, contextually aware collaborative relationship, which demands open, trustworthy and extensible harness infrastructure. We present SemaClaw, an open‑source multi‑agent application framework that addresses these shifts by taking a step towards general‑purpose personal AI agents through harness engineering. Our primary contributions include a DAG‑based two‑phase hybrid agent team orchestration method, a PermissionBridge behavioral safety system, a three‑tier context management architecture, and an agentic wiki skill for automated personal knowledge base construction.

PaperID: 4001, https://arxiv.org/pdf/2604.11523.pdf

Authors: Minjun Park, Donghyun Kim, Hyeonjong Ju, Seungwon Lim, Dongwook Choi, Taeyoon Kwon, Minju Kim, Jinyoung Yeo

Title: PAC-BENCH: Evaluating Multi-Agent Collaboration under Privacy Constraints

Abstract:
We are entering an era in which individuals and organizations increasingly deploy dedicated AI agents that interact and collaborate with other agents. However, the dynamics of multi‑agent collaboration under privacy constraints remain poorly understood. In this work, we present PAC\text‑Bench, a benchmark for systematic evaluation of multi‑agent collaboration under privacy constraints. Experiments on PAC\text‑Bench show that privacy constraints substantially degrade collaboration performance and make outcomes depend more on the initiating agent than the partner. Further analysis reveals that this degradation is driven by recurring coordination breakdowns, including early‑stage privacy violations, overly conservative abstraction, and privacy‑induced hallucinations. Together, our findings identify privacy‑aware multi‑agent collaboration as a distinct and unresolved challenge that requires new coordination mechanisms beyond existing agent capabilities.

PaperID: 4002, https://arxiv.org/pdf/2604.11518.pdf

Authors: Jinhua Wang, Biswa Sengupta

Title: From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python

Abstract:
Cross‑language migration of large software systems is a persistent engineering challenge, particularly when the source codebase evolves rapidly. We present a methodology for LLM‑assisted continuous code translation in which a large language model translates a production Rust codebase (648K LOC, 65 crates) into Python (41K LOC, 28 modules), with public agent benchmarks as the objective function driving iterative refinement. Our subject system is Codex CLI, a production AI coding agent. We demonstrate that: (1) the Python port resolves 59/80 SWE‑bench Verified tasks (73.8%) versus Rust's 56/80 (70.0%), and achieves 42.5% on Terminal‑Bench versus Rust's 47.5%, confirming near‑parity on real‑world agentic tasks; (2) benchmark‑driven debugging, revealing API protocol mismatches, environment pollution, a silent WebSocket failure mode, and an API 400 crash, is more effective than static testing alone; (3) the architecture supports continuous upstream synchronisation via an LLM‑assisted diff‑translate‑test loop; and (4) the Python port has evolved into a capability superset with 30 feature‑flagged extensions (multi‑agent orchestration, semantic memory, guardian safety, cost tracking) absent from Rust, while preserving strict parity mode for comparison. Our evaluation shows that for LLM‑based agents where API latency dominates, Python's expressiveness yields a 15.9x code reduction with negligible performance cost, while the benchmark‑as‑objective‑function methodology provides a principled framework for growing a cross‑language port from parity into an extended platform.

PaperID: 4003, https://arxiv.org/pdf/2604.11477.pdf

Authors: Kun Liu, Liqun Chen

Title: OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM-Based Multi-Agent Systems

Abstract:
The alignment of Multi‑Agent Systems (MAS) for autonomous software engineering is constrained by evaluator epistemic uncertainty. Current paradigms, such as Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF), frequently induce model sycophancy, while execution‑based environments suffer from adversarial "Test Evasion" by unconstrained agents. In this paper, we introduce an objective alignment paradigm: Out‑of‑Money Reinforcement Learning (OOM‑RL). By deploying agents into the non‑stationary, high‑friction reality of live financial markets, we utilize critical capital depletion as an un‑hackable negative gradient. Our longitudinal 20‑month empirical study (July 2024 ‑‑ February 2026) chronicles the system's evolution from a high‑turnover, sycophantic baseline to a robust, liquidity‑aware architecture. We demonstrate that the undeniable ontological consequences of financial loss forced the MAS to abandon overfitted hallucinations in favor of the Strict Test‑Driven Agentic Workflow (STDAW), which enforces a Byzantine‑inspired uni‑directional state lock (RO‑Lock) anchored to a deterministically verified \geq 95% code coverage constraint matrix. Our results show that while early iterations suffered severe execution decay, the final OOM‑RL‑aligned system achieved a stable equilibrium with an annualized Sharpe ratio of 2.06 in its mature phase. We conclude that substituting subjective human preference with rigorous economic penalties provides a robust methodology for aligning autonomous agents in high‑stakes, real‑world environments, laying the groundwork for generalized paradigms where computational billing acts as an objective physical constraint

PaperID: 4004, https://arxiv.org/pdf/2604.11466.pdf

Authors: Juhoon Lee, Joseph Seering

Title: SLALOM: Simulation Lifecycle Analysis via Longitudinal Observation Metrics for Social Simulation

Abstract:
Large Language Model (LLM) agents offer a potentially‑transformative path forward for generative social science but face a critical crisis of validity. Current simulation evaluation methodologies suffer from the "stopped clock" problem: they confirm that a simulation reached the correct final outcome while ignoring whether the trajectory leading to it was sociologically plausible. Because the internal reasoning of LLMs is opaque, verifying the "black box" of social mechanisms remains a persistent challenge. In this paper, we introduce SLALOM (Simulation Lifecycle Analysis via Longitudinal Observation Metrics), a framework that shifts validation from outcome verification to process fidelity. Drawing on Pattern‑Oriented Modeling (POM), SLALOM treats social phenomena as multivariate time series that must traverse specific SLALOM gates, or intermediate waypoint constraints representing distinct phases. By utilizing Dynamic Time Warping (DTW) to align simulated trajectories with empirical ground truth, SLALOM offers a quantitative metric to assess structural realism, helping to differentiate plausible social dynamics from stochastic noise and contributing to more robust policy simulation standards.

PaperID: 4005, https://arxiv.org/pdf/2604.11465.pdf

Authors: S. Aaron McClendon, Jorge Gallego-Feliciano, Stavros Zervoudakis, Antonios Saravanos

Title: Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

Abstract:
Large language model (LLM) agents show promise on realistic tool‑use tasks, but deploying capable agents on modest hardware remains challenging. We study whether inference‑time scaffolding alone, without any additional training compute, can improve the performance of a small model in complex multi‑step environments. Operating on a single 24GB GPU, we evaluate Qwen3‑8B on the AppWorld benchmark under both full‑precision and 4‑bit quantized configurations. Without any intervention, the raw model achieves just 5.4% (FP16) and 3.0% (AWQ) task goal completion. Guided by a systematic failure mode analysis, we introduce a three‑tier inference scaffolding pipeline that deploys the same frozen model in three distinct roles: (1) a summarization model that preserves critical artifacts (tokens, credentials, API responses) while compressing dialogue history; (2) the main agent model that reasons over the compressed context; and (3) an isolated correction model that reviews and revises the agent's code output without access to conversation history, breaking repetitive failure loops. Applied to the same unmodified model, this scaffolding yields 8.9% (FP16) and 5.9% (AWQ) task goal completion, roughly doubling performance in both settings, with particularly strong gains on difficulty‑1 tasks (15.8% to 26.3% FP16; 5.3% to 14.0% AWQ). On full‑precision inference, our scaffolded 8B model surpasses DeepSeek‑Coder 33B Instruct (7.1%) from the original AppWorld evaluation, demonstrating that structured inference‑time interventions can make small models competitive with systems 4 times their size. We formalize the approach as a scaffolded policy over a frozen base model, three invocations of the same weights with different conditioning, drawing connections to test‑time compute scaling and action‑space shaping in reinforcement learning.

PaperID: 4006, https://arxiv.org/pdf/2604.11462.pdf

Authors: Xiaozhe Li, Tianyi Lyu, Yizhao Yang, Liang Shan, Siyi Yang, Ligao Zhang, Zhuoyi Huang, Qingwen Liu, Yang Li

Title: Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning

Abstract:
Large Language Models (LLMs) struggle with long‑horizon tasks due to the "context bottleneck" and the "lost‑in‑the‑middle" phenomenon, where accumulated noise from verbose environments degrades reasoning over multi‑turn interactions. To address this issue, we introduce a symbiotic framework that decouples context management from task execution. Our architecture pairs a lightweight, specialized policy model, ContextCurator, with a powerful frozen foundation model, TaskExecutor. Trained via reinforcement learning, ContextCurator actively reduces information entropy in the working memory. It aggressively prunes environmental noise while preserving reasoning anchors, that is, sparse data points that are critical for future deductions. On WebArena, our framework improves the success rate of Gemini‑3.0‑flash from 36.4% to 41.2% while reducing token consumption by 8.8% (from 47.4K to 43.3K). On DeepSearch, it achieves a 57.1% success rate, compared with 53.9%, while reducing token consumption by a factor of 8. Remarkably, a 7B ContextCurator matches the context management performance of GPT‑4o, providing a scalable and computationally efficient paradigm for autonomous long‑horizon agents.

PaperID: 4007, https://arxiv.org/pdf/2604.11378.pdf

Authors: Hu Wei

Title: From Agent Loops to Structured Graphs:A Scheduler-Theoretic Framework for LLM Agent Execution

Abstract:
The dominant paradigm for building LLM based agents is the Agent Loop, an iterative cycle where a single language model decides what to do next by reading an ever growing context window. This paradigm has three structural weaknesses: implicit dependencies between steps, unbounded recovery loops, and mutable execution history that complicates debugging. We characterize the Agent Loop as a single ready unit scheduler: at any moment, at most one executable unit is active, and the choice of which unit to activate comes from opaque LLM inference rather than an inspectable policy. This perspective places Agent Loops and graph based execution engines on a single semantic continuum. We propose SGH, Structured Graph Harness, which lifts control flow from implicit context into an explicit static DAG. SGH makes three commitments: execution plans are immutable within a plan version, planning execution and recovery are separated into three layers, and recovery follows a strict escalation protocol. These choices trade some expressiveness for controllability, verifiability, and implementability. Our contributions are fourfold: a scheduler unified framework that applies classical scheduling theory to LLM agent execution and identifies challenges introduced by non deterministic LLM nodes; a trade off analysis of controllability, expressiveness, and implementability across 70 surveyed systems; a formal specification including a node state machine with termination and soundness guarantees; and an attributable experimental framework with a seven group design for future validation. This is a position paper and design proposal. We provide a theoretical framework, design analysis, and experimental protocol, not a production implementation or empirical results.

PaperID: 4008, https://arxiv.org/pdf/2604.11353.pdf

Authors: Beniamino Di Lorenzo, Gian Carlo Maffettone, Mario di Bernardo

Title: Leader-Follower Density Control of Multi-Agent Systems with Interacting Followers: Feasibility and Convergence Analysis

Abstract:
We address density control problems for large‑scale multi‑agent systems in leader‑follower settings, where a group of controllable leaders must steer a population of followers toward a desired spatial distribution. Unlike prior work, we explicitly account for follower‑follower interactions, capturing realistic behaviors such as flocking and collision avoidance. Within a macroscopic framework based on partial differential equations governing the density dynamics, we derive (i) necessary and sufficient feasibility conditions linking the target distribution to interaction strength, diffusion, and leader mass, and (ii) a feedback control law guaranteeing local stability with an explicit estimate of the basin of attraction. Our analysis reveals sharp feasibility thresholds, phase transitions beyond which no control effort can achieve the desired configuration. Numerical simulations in one‑ and two‑dimensional domains validate the theoretical results at the macroscopic level, and agent‑based simulations on finite populations confirm the practical deployability of the proposed framework.

PaperID: 4009, https://arxiv.org/pdf/2604.11337.pdf

Authors: Anbang Ruan

Title: Governance by Design: A Parsonian Institutional Architecture for Internet-Wide Agent Societies

Abstract:
The dominant paradigm of local multi‑agent systems ‑‑ orchestrated, enterprise‑bounded pipelines ‑‑ is being superseded by internet‑wide agent societies in which autonomous agents discover each other through open registries, interact without central orchestrators, and generate emergent social behaviors. We argue that governing such societies requires institutional design, not merely risk enumeration or process compliance. Applying Talcott Parsons' AGIL framework ‑‑ four functional imperatives (Adaptation, Goal Attainment, Integration, Latency) every viable social system must satisfy ‑‑ we derive a prescriptive sixteen‑cell institutional architecture for internet‑wide agent governance. Diagnostically applied to the OpenClaw ecosystem (250,000+ GitHub stars, 2M+ monthly users, 770,000+ registered agents) via a recursive sub‑function analysis (64 binary indicators across 16 cells), we find at most 19% sub‑function coverage (sensitivity range 17‑30%) ‑‑ potential rather than operative capacity, since zero inter‑cell coordination prevents existing infrastructure from participating in inter‑pillar interchange. A complementary interchange media assessment finds zero of twelve inter‑pillar pathways functional: the ecosystem has technical infrastructure but no active governance, no coordination layer, and no normative grounding, with the Fiduciary and Political pillars most severely underserved. Extending the diagnostic to the broader agent‑native protocol stack (MCP, A2A, ANP, x402, ERC‑8004), independent development teams reproduce the same structural pattern ‑‑ confirming the governance gap is a feature of market‑driven development, not ecosystem immaturity. Institutional design is most effective before social patterns calcify; we conclude with a prioritized roadmap for the missing governance infrastructure.

PaperID: 4010, https://arxiv.org/pdf/2604.11312.pdf

Authors: Erica Cau, Andrea Failla, Giulio Rossetti

Title: Network Effects and Agreement Drift in LLM Debates

Abstract:
Large Language Models (LLMs) have demonstrated an unprecedented ability to simulate human‑like social behaviors, making them useful tools for simulating complex social systems. However, it remains unclear to what extent these simulations can be trusted to accurately capture key social mechanisms, particularly in highly unbalanced contexts involving minority groups. This paper uses a network generation model with controlled homophily and class sizes to examine how LLM agents behave collectively in multi‑round debates. Moreover, our findings highlight a particular directional susceptibility that we term agreement drift, in which agents are more likely to shift toward specific positions on the opinion scale. Overall, our findings highlight the need to disentangle structural effects from model biases before treating LLM populations as behavioral proxies for human groups.

PaperID: 4011, https://arxiv.org/pdf/2604.11270.pdf

Authors: Islem Bouzenia, Cristian Cadar, Michael Pradel

Title: Evaluating LLM Agents on Automated Software Analysis Tasks

Abstract:
Numerous software analysis tools exist today, yet applying them to diverse open‑source projects remains challenging due to environment setup, dependency resolution, and tool configuration. LLM‑based agents offer a potential solution, yet no prior work has systematically studied their effectiveness on the specific task of automated software analysis, which, unlike issue solving or general environment setup, requires installing and configuring a separate analysis tool alongside the target project, generating tool‑specific prerequisites, and validating that the tool produces meaningful analysis outputs. We introduce AnalysisBench, a benchmark of 35 tool‑project pairs spanning seven analysis tools and ten diverse C/C++ and Java projects, each with a manually constructed reference setup. Using AnalysisBench, we evaluate four agent architectures across four LLM backends. Our custom agent, AnalysisAgent, achieves manually verified success rates of 94% (Gemini‑3‑Flash, 33/35 tasks), compared to 77% for the best baseline (ExecutionAgent). Beyond quantitative results, we identify key limitations in existing agents, including stage mixing, poor error localization, and premature termination, and show that agentic architecture matters more than LLM capability alone. We further find that whole‑program analyses and symbolic execution are the most difficult tasks, that Java toolchains pose greater challenges than C/C++, and that LLM‑self‑validated success consistently overstates manually verified success.

PaperID: 4012, https://arxiv.org/pdf/2604.11258.pdf

Authors: Zhixiang Lu, Jionglong Su

Title: Dialectic-Med: Mitigating Diagnostic Hallucinations via Counterfactual Adversarial Multi-Agent Debate

Abstract:
Multimodal Large Language Models (MLLMs) in healthcare suffer from severe confirmation bias, often hallucinating visual details to support initial, potentially erroneous diagnostic hypotheses. Existing Chain‑of‑Thought (CoT) approaches lack intrinsic correction mechanisms, rendering them vulnerable to error propagation. To bridge this gap, we propose Dialectic‑Med, a multi‑agent framework that enforces diagnostic rigor through adversarial dialectics. Unlike static consensus models, Dialectic‑Med orchestrates a dynamic interplay between three role‑specialized agents: a proponent that formulates diagnostic hypotheses; an opponent equipped with a novel visual falsification module that actively retrieves contradictory visual evidence to challenge the Proponent; and a mediator that resolves conflicts via a weighted consensus graph. By explicitly modeling the cognitive process of falsification, our framework guarantees that diagnostic reasoning is tightly grounded in verified visual regions. Empirical evaluations on MIMIC‑CXR‑VQA, VQA‑RAD, and PathVQA demonstrate that Dialectic‑Med not only achieves state‑of‑the‑art performance but also fundamentally enhances the trustworthiness of the reasoning process. Beyond accuracy, our approach significantly enhances explanation faithfulness and decisively mitigates hallucinations, establishing a new standard over single‑agent baselines.

PaperID: 4013, https://arxiv.org/pdf/2604.11248.pdf

Authors: Uljad Berdica, Jakob Foerster, Frank Hutter, Arber Zela

Title: Evolving Many Worlds: Towards Open-Ended Discovery in Petri Dish NCA via Population-Based Training

Abstract:
The generation of sustained, open‑ended complexity from local interactions remains a fundamental challenge in artificial life. Differentiable multi‑agent systems, such as Petri Dish Neural Cellular Automata (PD‑NCA), exhibit rich self‑organization driven purely by spatial competition; however, they are highly sensitive to hyperparameters and frequently collapse into uninteresting patterns and dynamics, such as frozen equilibria or structureless noise. In this paper, we introduce PBT‑NCA, a meta‑evolutionary algorithm that evolves a population of PD‑NCAs subject to a composite objective that rewards both historical behavioral novelty and contemporary visual diversity. Driven by this continuous evolutionary pressure, PBT‑NCA spontaneously generates a plethora of emergent lifelike phenomena over extended horizons‑a hallmark of true open‑endedness. Strikingly, the substrate autonomously discovers diverse morphological survival and self‑organization strategies. We observe highly regular, coordinated periodic waves; spore‑like scattering where homogeneous groups eject cell‑like clusters to colonize distant territories; and fluid, shape‑shifting macro‑structures that migrate across the substrate, maintaining stable outer boundaries that enclose highly active interiors. By actively penalizing monocultures and dead states, PBT‑NCA sustains a state of effective complexity that is neither globally ordered nor globally random, operating persistently at the "edge of chaos".

PaperID: 4014, https://arxiv.org/pdf/2604.11204.pdf

Authors: Jianfeng Xu

Title: Semantic Rate-Distortion Theory: Deductive Compression and Closure Fidelity

Abstract:
Shannon's rate‑distortion theory treats source symbols as unstructured labels. When the source is a knowledge base equipped with a logical proof system, a natural fidelity criterion is closure fidelity: a reconstruction is acceptable if it preserves the deductive closure of the original. This paper develops a rate‑distortion theory under this criterion. Central to the theory is the irredundant core‑a canonical generating set extracted by a fixed‑order deletion procedure, from which the full deductive closure can be rederived. We prove that the zero‑distortion semantic rate equals a quantity that is strictly below the classical entropy rate whenever the knowledge base contains redundant states. More generally, the full semantic rate‑distortion function depends only on the core; redundant states are invisible to both rate and distortion. We derive a semantic source‑channel separation theorem showing a semantic leverage phenomenon: under closure fidelity, the required source rate is reduced by an asymptotic leverage factor greater than one, allowing the same knowledge base to be communicated with proportionally fewer channel uses‑not by violating Shannon capacity, but because redundant states become free. We also prove a strengthened Fano inequality that exploits core structure. For heterogeneous multi‑agent communication, an overlap decomposition gives necessary and sufficient conditions for closure‑reliable transmission and identifies a semantic bottleneck in broadcast settings that persists even over noiseless channels. All results are verified on Datalog instances with up to 24,000 base facts.

PaperID: 4015, https://arxiv.org/pdf/2604.11201.pdf

Authors: CocoaBench Team, Shibo Hao, Zhining Zhang, Zhiqi Liang, Tianyang Liu, Yuheng Zha, Qiyue Gao, Jixuan Chen, Zilong Wang, Zhoujun Cheng, Haoxiang Zhang, Junli Wang, Hexi Jin, Boyuan Zheng, Kun Zhou, Yu Wang, Feng Yao, Licheng Liu, Yijiang Li, Zhifei Li, Zhengtao Han, Pracha Promthaw, Tommaso Cerruti, Xiaohan Fu, Ziqiao Ma, Jingbo Shang, Lianhui Qin, Julian McAuley, Eric P. Xing, Zhengzhong Liu, Rupesh Kumar Srivastava, Zhiting Hu

Title: CocoaBench: Evaluating Unified Digital Agents in the Wild

Abstract:
LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human‑designed, long‑horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.

PaperID: 4016, https://arxiv.org/pdf/2604.11161.pdf

Authors: Han Wua, Lishan Zhang, Chunming Lu

Title: A Simulation-Based Method for Testing Collaborative Learning Scaffolds Using LLM-Based Multi-Agent Systems

Abstract:
Background: Traditional research on collaborative learning scaffolding is often time‑consuming and resource‑heavy, which hinders the rapid iteration and optimization of instructional strategies. LLM‑based multi‑agent systems have recently emerged as a powerful tool to simulate complex social interactions and provide a novel paradigm for educational research. Objectives: This study proposes an LLM‑based multi‑agent simulation approach to investigate collaborative learning processes and the effectiveness of instructional scaffolds prior to actual classroom deployment. The research specifically examines the feasibility of simulating group discussions and the alignment of these simulations with established learning science theories. Methods: The simulation system was implemented using the MetaGPT framework and GPT‑4o, comprising one teacher agent and five distinct student roles (Leader, Supporter, Expounder, Rebutter, and Summarizer). Two scaffolding strategies, "Deep Think before Speak" and "Direct Speak", were compared across ten classical Chinese poetry appreciation tasks. Evaluation was conducted through discourse analysis of quality and behavior. Results and Conclusions: The introduction of the "Deep Think before Speak" scaffold significantly improved the agents' discourse diversity and interaction depth while notably reducing content repetitiveness. Behavioral analysis showed that the scaffold encouraged more complex interaction patterns, such as reflecting, rebutting, and explaining. These findings align with the ICAP framework, as the scaffold prompted agents to move from simple "Active" participation to "Constructive" and "Interactive" knowledge co‑construction. This study demonstrates the feasibility and ecological validity of using LLM‑based multi‑agent systems to simulate authentic collaborative learning dynamics.

PaperID: 4017, https://arxiv.org/pdf/2604.11131.pdf

Authors: Abhishek Sawaika, Samuel Yen-Chi Chen, Udaya Parampalli, Rajkumar Buyya

Title: MADQRL: Distributed Quantum Reinforcement Learning Framework for Multi-Agent Environments

Abstract:
Reinforcement learning (RL) is one of the most practical ways to learn from real‑life use‑cases. Motivated from the cognitive methods used by humans makes it a widely acceptable strategy in the field of artificial intelligence. Most of the environments used for RL are often high‑dimensional, and traditional RL algorithms becomes computationally expensive and challenging to effectively learn from such systems. Recent advancements in practical demonstration of quantum computing (QC) theories, such as compact encoding, enhanced representation and learning algorithms, random sampling, or the inherent stochastic nature of quantum systems, have opened up new directions to tackle these challenges. Quantum reinforcement learning (QRL) is seeking significant traction over the past few years. However, the current state of quantum hardware is not enough to cater for such high‑dimensional environments with complex multi‑agent setup. To tackle this issue, we propose a distributed framework for QRL where multiple agents learn independently, distributing the load of joint training from individual machines. Our method works well for environments with disjoint sets of action and observation spaces, but can also be extended to other systems with reasonable approximations. We analyze the proposed method on cooperative‑pong environment and our results indicate ~10% improvement from other distribution strategies, and ~5% improvement from classical models of policy representation.

PaperID: 4018, https://arxiv.org/pdf/2604.11103.pdf

Authors: Xi Chen, Wei Xue, Yike Guo

Title: ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing

Abstract:
Role‑playing has garnered rising attention as it provides a strong foundation for human‑machine interaction and facilitates sociological research. However, current work is confined to textual modalities, neglecting speech, which plays a predominant role in daily life, thus limiting genuine role‑playing. To bridge this gap, we conceptualize and benchmark speech role‑playing through ActorMindBench, and we present a corresponding reasoning framework, called ActorMind. Specifically, (1) Speech Role‑Playing enables models to deliver spontaneous responses with personalized verbal traits based on their role, the scene, and spoken dialogue. (2) ActorMindBench is a hierarchical benchmark comprises Utterance‑Level content with 7,653 utterances, Scene‑Level content with 313 scenes, and Role‑Level content with 6 roles. (3) ActorMind is an off‑the‑shelf, multi‑agent, chain‑of‑though style reasoning framework that emulates how human actors perform in theaters. Concretely, ActorMind first reads its assigned role description via Eye Agent, then comprehends emotional cues within contextual spoken dialogues through Ear Agent. Subsequently, Brain Agent generates a descriptive emotional state, and finally, Mouth Agent delivers the scripts infused with corresponding emotion state. Experimental results demonstrate the effectiveness of ActorMind in enhancing speech role‑playing.

PaperID: 4019, https://arxiv.org/pdf/2604.11045.pdf

Authors: Huacan Wang, Jie Zhou, Ningyan Zhu, Shuo Zhang, Feiyu Chen, Jiarou Wu, Ge Chen, Chen Liu, Wangyi Chen, Xiaofeng Mou, Yi Xu

Title: Sema Code: Decoupling AI Coding Agents into Programmable, Embeddable Infrastructure

Abstract:
AI coding agents have become central to developer workflows, yet every existing solution locks its reasoning capabilities within a specific delivery form, such as a CLI, IDE plugin, or web application. This limitation creates systemic barriers when enterprises attempt to reuse these capabilities across heterogeneous engineering environments. To address this challenge, we present Sema Code, an open AI coding framework built on the principle of being embeddable, pluggable, and framework‑first. Sema Code completely decouples the core agent engine from all client layers, publishing it as a standalone npm library that any runtime can drive programmatically. Built around this architecture, we designed eight key mechanisms: multi‑tenant engine isolation, FIFO input queuing with safe session reconstruction, adaptive context compression, multi‑agent collaborative scheduling, intelligent Todo‑based process management, four‑layer asynchronous permission control, three‑tier ecosystem integration spanning MCP, Skills, and Plugins, and a background task framework with separated execution and observation privileges. These mechanisms collectively address the engineering challenges of transforming a complex agent engine into a shared, programmable core. Demonstrating its architectural versatility, the same Sema Core engine simultaneously powers a VSCode extension and a multi‑channel messaging gateway, which we name SemaClaw, to unify agent interactions across platforms such as Telegram and Feishu. These represent two fundamentally different product forms sharing an identical reasoning kernel, differing only at the client layer.

PaperID: 4020, https://arxiv.org/pdf/2604.10992.pdf

Authors: Yuan Shui, Yandong Guan, Zhanwei Zhang, Juncheng Hu, Jing Zhang, Dong Xu, Qian Yu

Title: ArtiCAD: Articulated CAD Assembly Design via Multi-Agent Code Generation

Abstract:
Parametric Computer‑Aided Design (CAD) of articulated assemblies is essential for product development, yet generating these multi‑part, movable models from high‑level descriptions remains unexplored. To address this, we propose ArtiCAD, the first training‑free multi‑agent system capable of generating editable, articulated CAD assemblies directly from text or images. Our system divides this complex task among four specialized agents: Design, Generation, Assembly, and Review. One of our key insights is to predict assembly relationships during the initial design stage rather than the assembly stage. By utilizing a Connector that explicitly defines attachment points and joint parameters, ArtiCAD determines these relationships before geometry generation, effectively bypassing the limited spatial reasoning capabilities of current LLMs and VLMs. To further ensure high‑quality outputs, we introduce validation steps in the generation and assembly stages, accompanied by a cross‑stage rollback mechanism that accurately isolates and corrects design‑ and code‑level errors. Additionally, a self‑evolving experience store accumulates design knowledge to continuously improve performance on future tasks. Extensive evaluations on three datasets (ArtiCAD‑Bench, CADPrompt, and ACD) validate the effectiveness of our approach. We further demonstrate the applicability of ArtiCAD in requirement‑driven conceptual design, physical prototyping, and the generation of embodied AI training assets through URDF export.

PaperID: 4021, https://arxiv.org/pdf/2604.10989.pdf

Authors: Shixing Zhao, Zheng Si, Pengpeng Ouyang, Zhengqing Hu, Wanqi Zhu, Dong Chen, Yibo Guo, Mingliang Xu

Title: MAFIG: Multi-agent Driven Formal Instruction Generation Framework

Abstract:
Emergency situations in scheduling systems often trigger local functional failures that undermine system stability and even cause system collapse. Existing methods primarily rely on robust scheduling or reactive scheduling, handling emergencies through predefined rules or rescheduling strategies. However, the diversity and unpredictability of real‑world emergencies make them difficult to anticipate, which limits the adaptability of these methods in complex scenarios. Recent studies have shown that Large Language Models (LLMs) possess strong potential for complex scheduling tasks because of their extensive prior knowledge and strong reasoning capabilities. Nevertheless, the high inference latency of LLMs and the lengthy contextual information of scheduling systems significantly hinder their application for emergency handling. To mitigate these issues, we propose the Multi‑agent Driven Formal Instruction Generation Framework (MAFIG). The framework constrains the decision scope to local functional modules affected by emergency situations and repairs scheduling logic rapidly by generating formal instructions. MAFIG contains a Perception Agent and an Emergency Decision Agent, which mitigates the adverse impact of lengthy system contexts on emergency decision‑making. We further introduce span‑focused loss‑driven local distillation mechanism (SFL) to transfer the decision‑making capability of powerful Cloud Large Language Models (C‑LLMs) to lightweight local models, reducing inference latency while preserving decision‑making effectiveness. Experiments in the Port, Warehousing, and Deck scheduling datasets show success rates of 98.49%, 94.97%, and 97.50%, with average processing times of 0.33 s, 0.23 s, and 0.19 s. These results demonstrate that MAFIG effectively mitigates the impact of emergencies and improves the robustness and adaptability of scheduling systems.

PaperID: 4022, https://arxiv.org/pdf/2604.10911.pdf

Authors: Chongliu Jia, Yi Luo, Sipeng Han, Pengwei Li, Jie Ding, Youshuang Hu, Yimiao Qian, Qiya Wang

Title: EvoNash-MARL: A Closed-Loop Multi-Agent Reinforcement Learning Framework for Medium-Horizon Equity Allocation

Abstract:
Medium‑ to long‑horizon equity allocation is challenging due to weak predictive structure, non‑stationary market regimes, and the degradation of signals under realistic trading constraints. Conventional approaches often rely on single predictors or loosely coupled pipelines, which limit robustness under distributional shift. This paper proposes EvoNash‑MARL, a closed‑loop framework that integrates reinforcement learning with population‑based policy optimization and execution‑aware selection to improve robustness in medium‑ to long‑horizon allocation. The framework combines multi‑agent policy populations, game‑theoretic aggregation, and constraint‑aware validation within a unified walk‑forward design. Under a 120‑window walk‑forward protocol, the final configuration achieves the highest robust score among internal baselines. On out‑of‑sample data from 2014 to 2024, it delivers a 19.6% annualized return, compared to 11.7% for SPY, and remains stable under extended evaluation through 2026. While the framework demonstrates consistent performance under realistic constraints and across market settings, strong global statistical significance is not established under White's Reality Check (WRC) and SPA‑lite tests. The results therefore provide evidence of improved robustness rather than definitive proof of superior market timing performance.

PaperID: 4023, https://arxiv.org/pdf/2604.10825.pdf

Authors: Zacharie Bugaud

Title: CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

Abstract:
We introduce CheeseBench, a benchmark that evaluates large language models (LLMs) on nine classical behavioral neuroscience paradigms (Morris water maze, Barnes maze, T‑maze, radial arm maze, star maze, operant chamber, shuttle box, conditioned place preference, and delayed non‑match to sample), spanning six cognitive dimensions. Each task is grounded in peer‑reviewed rodent protocols with approximate animal baselines. The agent receives a unified system prompt with no task‑specific instructions and must discover goals purely from ASCII text observations and reward signals, much like a rodent placed into an unfamiliar apparatus. We evaluate six open‑weight LLMs (3B to 72B parameters) on text‑based ASCII renderings and compare against both a random baseline and a graph‑based reinforcement learning agent. Our best model (Qwen2.5‑VL‑7B) reaches 52.6% average success on ASCII input, compared to 32.1% for random agents and 78.9% for approximate rodent baselines. We find that (1) scaling beyond 7B yields diminishing returns, (2) longer context history degrades performance, (3) chain‑of‑thought prompting hurts rather than helps, and (4) a vision‑language architecture provides an advantage at 7B but hurts at 32B. Because the same model's performance ranges from 20% to 57% depending on interface parameters alone, these results characterize the agent‑plus‑interface system, not the model in isolation. Under this unified zero‑shot ASCII protocol, current open‑weight LLM agents remain well below approximate rodent reference values, particularly on tasks requiring spatial navigation and within‑trial state tracking.

PaperID: 4024, https://arxiv.org/pdf/2604.10599.pdf

Authors: Mamdouh Alenezi

Title: Rethinking Software Engineering for Agentic AI Systems

Abstract:
The rapid proliferation of large language models (LLMs) and agentic AI systems has created an unprecedented abundance of automatically generated code, challenging the traditional software engineering paradigm centered on manual authorship. This paper examines whether the discipline should be reoriented around orchestration, verification, and human‑AI collaboration, and what implications this shift holds for education, tools, processes, and professional practice. Drawing on a structured synthesis of relevant literature and emerging industry perspectives, we analyze four key dimensions: the evolving role of the engineer in agentic workflows, verification as a critical quality bottleneck, observed impacts on productivity and maintainability, and broader implications for the discipline. Our analysis indicates that code is transitioning from a scarce, carefully crafted artifact to an abundant and increasingly disposable commodity. As a result, software engineering must reorganize around three core competencies: effective orchestration of multi‑agent systems, rigorous verification of AI‑generated outputs, and structured human‑AI collaboration. We propose a conceptual framework outlining the transformations required across curricula, development tooling, lifecycle processes, and governance models. Rather than diminishing the role of engineers, this shift elevates their responsibilities toward system‑level design, semantic validation, and accountable oversight. The paper concludes by highlighting key research challenges, including verification‑first lifecycles, prompt traceability, and the long‑term evolution of the engineering workforce.

PaperID: 4025, https://arxiv.org/pdf/2604.10589.pdf

Authors: Pablo de los Riscos, Fernando J. Corbacho, Michael A. Arbib

Title: Working Paper: Towards Schema-based Learning from a Category-Theoretic Perspective

Abstract:
We introduce a hierarchical categorical framework for Schema‑Based Learning (SBL) structured across four interconnected levels. At the schema level, a free multicategory Sch_syn encodes fundamental schemas and transformations. An implementation functor \mathcalI maps syntactic schemas to representational languages, inducing via the Grothendieck construction the total category Sch_impl. Implemented schemas are mapped by a functor Model into the Kleisli category \mathbfKL(G) of the Giry monad, yielding probabilistic models, while an instances presheaf assigns evaluated instance spaces. A semantic category Sch_sem, defined as a full subcategory of \mathbfKL(G), provides semantic grounding through an interpretation functor from Sch_impl. At the agent level, Sch_impl is equipped with a duoidal structure \mathcalO_Sch supporting schema‑based workflows. A left duoidal action on the category Mind enables workflow execution over mental objects, whose components include mental spaces, predictive models, and a cognitive kernel composed of memory and cognitive modules. Each module is specified by schema‑typed interfaces, duoidal workflows, a success condition, and a logical signature. Memory is formalized categorically via memory subsystems, a presheaf Data_M, a monoidal operation category Ops_M, and read/write natural transformations. Together with the Body category, Mind defines the embodied SBL agent. At higher levels, SBL is represented as an object of the agent architecture category ArchCat, enabling comparison with heterogeneous paradigms, while the World category models multi‑agent and agent‑environment interactions. Altogether, the framework forms a weak hierarchical n‑categorical structure linking schema semantics, cognition, embodiment, architectural abstraction, and world‑level interaction.

PaperID: 4026, https://arxiv.org/pdf/2604.10577.pdf

Authors: Xuwei Ding, Skylar Zhai, Linxin Song, Jiate Li, Taiwei Shi, Nicholas Meade, Siva Reddy, Jian Kang, Jieyu Zhao

Title: The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

Abstract:
Computer‑use agents (CUAs) can now autonomously complete complex tasks in real digital environments, but when misled, they can also be used to automate harmful actions programmatically. Existing safety evaluations largely target explicit threats such as misuse and prompt injection, but overlook a subtle yet critical setting where user instructions are entirely benign and harm arises from the task context or execution outcome. We introduce OS‑BLIND, a benchmark that evaluates CUAs under unintended attack conditions, comprising 300 human‑crafted tasks across 12 categories, 8 applications, and 2 threat clusters: environment‑embedded threats and agent‑initiated harms. Our evaluation on frontier models and agentic frameworks reveals that most CUAs exceed 90% attack success rate (ASR), and even the safety‑aligned Claude 4.5 Sonnet reaches 73.0% ASR. More interestingly, this vulnerability becomes even more severe, with ASR rising from 73.0% to 92.7% when Claude 4.5 Sonnet is deployed in multi‑agent systems. Our analysis further shows that existing safety defenses provide limited protection when user instructions are benign. Safety alignment primarily activates within the first few steps and rarely re‑engages during subsequent execution. In multi‑agent systems, decomposed subtasks obscure the harmful intent from the model, causing safety‑aligned models to fail. We will release our OS‑BLIND to encourage the broader research community to further investigate and address these safety challenges.

PaperID: 4027, https://arxiv.org/pdf/2604.10516.pdf

Authors: Xinyi Huang

Title: Structure-Grounded Knowledge Retrieval via Code Dependencies for Multi-Step Data Reasoning

Abstract:
Selecting the right knowledge is critical when using large language models (LLMs) to solve domain‑specific data analysis tasks. However, most retrieval‑augmented approaches rely primarily on lexical or embedding similarity, which is often a weak proxy for the task‑critical knowledge needed for multi‑step reasoning. In many such tasks, the relevant knowledge is not merely textually related to the query, but is instead grounded in executable code and the dependency structure through which computations are carried out. To address this mismatch, we propose SGKR (Structure‑Grounded Knowledge Retrieval), a retrieval framework that organizes domain knowledge with a graph induced by function‑call dependencies. Given a question, SGKR extracts semantic input and output tags, identifies dependency paths connecting them, and constructs a task‑relevant subgraph. The associated knowledge and corresponding function implementations are then assembled as a structured context for LLM‑based code generation. Experiments on multi‑step data analysis benchmarks show that SGKR consistently improves solution correctness over no‑retrieval and similarity‑based retrieval baselines for both vanilla LLMs and coding agents.

PaperID: 4028, https://arxiv.org/pdf/2604.10513.pdf

Authors: Roi Ben-Gigi, Yuval David, Fabiana Fournier, Lior Limonad, Dany Moshkovich, Hadar Mulian, Segev Shlomov

Title: Agent Mentor: Framing Agent Knowledge through Semantic Trajectory Analysis

Abstract:
AI agent development relies heavily on natural language prompting to define agents' tasks, knowledge, and goals. These prompts are interpreted by Large Language Models (LLMs), which govern agent behavior. Consequently, agentic performance is susceptible to variability arising from imprecise or ambiguous prompt formulations. Identifying and correcting such issues requires examining not only the agent's code, but also the internal system prompts generated throughout its execution lifecycle, as reflected in execution logs. In this work, we introduce an analytics pipeline implemented as part of the Agent Mentor open‑source library that monitors and incrementally adapts the system prompts defining another agent's behavior. The pipeline improves performance by systematically injecting corrective instructions into the agent's knowledge. We describe its underlying mechanism, with particular emphasis on identifying semantic features associated with undesired behaviors and using them to derive corrective statements. We evaluate the proposed pipeline across three exemplar agent configurations and benchmark tasks using repeated execution runs to assess effectiveness. These experiments provide an initial exploration of automating such a mentoring pipeline within future agentic governance frameworks. Overall, the approach demonstrates consistent and measurable accuracy improvements across diverse configurations, particularly in settings dominated by specification ambiguity. For reproducibility, we released our code as open source under the Agent Mentor library.

PaperID: 4029, https://arxiv.org/pdf/2604.10480.pdf

Authors: Yu Li, Xiaoran Shang, Qizhi Pei, Yun Zhu, Xin Gao, Honglin Lin, Zhanping Zhong, Zhuoshi Pan, Zheng Liu, Xiaoyang Wang, Conghui He, Dahua Lin, Feng Zhao, Lijun Wu

Title: Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

Abstract:
Post‑training data plays a pivotal role in shaping the capabilities of Large Language Models (LLMs), yet datasets are often treated as isolated artifacts, overlooking the systemic connections that underlie their evolution. To disentangle these complex relationships, we introduce the concept of data lineage to the LLM ecosystem and propose an automated multi‑agent framework to reconstruct the evolutionary graph of dataset development. Through large‑scale lineage analysis, we characterize domain‑specific structural patterns, such as vertical refinement in math‑oriented datasets and horizontal aggregation in general‑domain corpora. Moreover, we uncover pervasive systemic issues, including structural redundancy induced by implicit dataset intersections and the propagation of benchmark contamination along lineage paths. To demonstrate the practical value of lineage analysis for data construction, we leverage the reconstructed lineage graph to create a lineage‑aware diversity‑oriented dataset. By anchoring instruction sampling at upstream root sources, this approach mitigates downstream homogenization and hidden redundancy, yielding a more diverse post‑training corpus. We further highlight lineage‑centric analysis as an efficient and robust topological alternative to sample‑level dataset comparison for large‑scale data ecosystems. By grounding data construction in explicit lineage structures, our work advances post‑training data curation toward a more systematic and controllable paradigm.

PaperID: 4030, https://arxiv.org/pdf/2604.10475.pdf

Authors: Yuran Sun, Mustafa Sameen, Yaotian Zhang, Chia-yu Wu, Xilei Zhao

Title: PEMANT: Persona-Enriched Multi-Agent Negotiation for Travel

Abstract:
Modeling household‑level trip generation is fundamental to accurate demand forecasting, traffic flow estimation, and urban system planning. Existing studies were mostly based on classical machine learning models with limited predictive capability, while recent LLM‑based approaches have yet to incorporate behavioral theory or intra‑household interaction dynamics, both of which are critical for modeling realistic collective travel decisions. To address these limitations, we propose a novel LLM‑based framework, named Persona‑Enriched Multi‑Agent Negotiation for Travel (PEMANT), which first integrates behavioral theory for individualized persona modeling and then conducts household‑level trip planning negotiations via a structured multi‑agent conversation. Specifically, PEMANT transforms static sociodemographic attributes into coherent narrative profiles that explicitly encode household‑level attitudes, subjective norms, and perceived behavioral controls, following our proposed Household‑Aware Chain‑of‑Planned‑Behavior (HA‑CoPB) framework. Building on these theory‑grounded personas, PEMANT captures real‑world household decision negotiation via a structured two‑phase multi‑agent conversation framework with a novel persona‑alignment control mechanism. Evaluated on both national and regional household travel survey datasets, PEMANT consistently outperforms state‑of‑the‑art benchmarks across datasets.

PaperID: 4031, https://arxiv.org/pdf/2604.10470.pdf

Authors: Mingfei Lu, Yi Zhang, Mengjia Wu, Yue Feng

Title: From Query to Counsel: Structured Reasoning with a Multi-Agent Framework and Dataset for Legal Consultation

Abstract:
Legal consultation question answering (Legal CQA) presents unique challenges compared to traditional legal QA tasks, including the scarcity of high‑quality training data, complex task composition, and strong contextual dependencies. To address these, we construct JurisCQAD, a large‑scale dataset of over 43,000 real‑world Chinese legal queries annotated with expert‑validated positive and negative responses, and design a structured task decomposition that converts each query into a legal element graph integrating entities, events, intents, and legal issues. We further propose JurisMA, a modular multi‑agent framework supporting dynamic routing, statutory grounding, and stylistic optimization. Combined with the element graph, the framework enables strong context‑aware reasoning, effectively capturing dependencies across legal facts, norms, and procedural logic. Trained on JurisCQAD and evaluated on a refined LawBench, our system significantly outperforms both general‑purpose and legal‑domain LLMs across multiple lexical and semantic metrics, demonstrating the benefits of interpretable decomposition and modular collaboration in Legal CQA.

PaperID: 4032, https://arxiv.org/pdf/2604.10456.pdf

Authors: Peixuan Zhang, Chang Zhou, Ziyuan Zhang, Hualuo Liu, Chunjie Zhang, Jingqi Liu, Xiaohui Zhou, Xi Chen, Shuchen Weng, Si Li, Boxin Shi

Title: A Benchmark and Multi-Agent System for Instruction-driven Cinematic Video Compilation

Abstract:
The surging demand for adapting long‑form cinematic content into short videos has motivated the need for versatile automatic video compilation systems. However, existing compilation methods are limited to predefined tasks, and the community lacks a comprehensive benchmark to evaluate the cinematic compilation. To address this, we introduce CineBench, the first benchmark for instruction‑driven cinematic video compilation, featuring diverse user instructions and high‑quality ground‑truth compilations annotated by professional editors. To overcome contextual collapse and temporal fragmentation, we present CineAgents, a multi‑agent system that reformulates cinematic video compilation into ``design‑and‑compose'' paradigm. CineAgents performs script reverse‑engineering to construct a hierarchical narrative memory to provide multi‑level context and employs an iterative narrative planning process that refines a creative blueprint into a final compiled script. Extensive experiments demonstrate that CineAgents significantly outperforms existing methods, generating compilations with superior narrative coherence and logical coherence.

PaperID: 4033, https://arxiv.org/pdf/2604.10442.pdf

Authors: Peixuan Zhang, Zijian Jia, Ziqi Cai, Shuchen Weng, Si Li, Boxin Shi

Title: ReContraster: Making Your Posters Stand Out with Regional Contrast

Abstract:
Effective poster design requires rapidly capturing attention and clearly conveying messages. Inspired by the ``contrast effects'' principle, we propose ReContraster, the first training‑free model to leverage regional contrast to make posters stand out. By emulating the cognitive behaviors of a poster designer, ReContraster introduces the compositional multi‑agent system to identify elements, organize layout, and evaluate generated poster candidates. To further ensure harmonious transitions across region boundaries, ReContraster integrates the hybrid denoising strategy during the diffusion process. We additionally contribute a new benchmark dataset for comprehensive evaluation. Seven quantitative metrics and four user studies confirm its superiority over relevant state‑of‑the‑art methods, producing visually striking and aesthetically appealing posters.

PaperID: 4034, https://arxiv.org/pdf/2604.10389.pdf

Authors: Saukun Thika You, Nguyen Anh Khoa Tran, Wesley K. Marizane, Hanshu Rao, Qiunan Zhang, Xiaolei Huang

Title: BLUEmed: Retrieval-Augmented Multi-Agent Debate for Clinical Error Detection

Abstract:
Terminology substitution errors in clinical notes, where one medical term is replaced by a linguistically valid but clinically different term, pose a persistent challenge for automated error detection in healthcare. We introduce BLUEmed, a multi‑agent debate framework augmented with hybrid Retrieval‑Augmented Generation (RAG) that combines evidence‑grounded reasoning with multi‑perspective verification for clinical error detection. BLUEmed decomposes each clinical note into focused sub‑queries, retrieves source‑partitioned evidence through dense, sparse, and online retrieval, and assigns two domain expert agents distinct knowledge bases to produce independent analyses; when the experts disagree, a structured counter‑argumentation round and cross‑source adjudication resolve the conflict, followed by a cascading safety layer that filters common false‑positive patterns. We evaluate BLUEmed on a clinical terminology substitution detection benchmark under both zero‑shot and few‑shot prompting with multiple backbone models spanning proprietary and open‑source families. Experimental results show that BLUEmed achieves the best accuracy (69.13%), ROC‑AUC (74.45%), and PR‑AUC (72.44%) under few‑shot prompting, outperforming both single‑agent RAG and debate‑only baselines. Further analyses across six backbone models and two prompting strategies confirm that retrieval augmentation and structured debate are complementary, and that the framework benefits most from models with sufficient instruction‑following and clinical language understanding.

PaperID: 4035, https://arxiv.org/pdf/2604.10386.pdf

Authors: Sihang Zeng, Young Won Kim, Wilson Lau, Ehsan Alipour, Ruth Etzioni, Meliha Yetisgen, Anand Oka

Title: TrajOnco: a multi-agent framework for temporal reasoning over longitudinal EHR for multi-cancer early detection

Abstract:
Accurate estimation of cancer risk from longitudinal electronic health records (EHRs) could support earlier detection and improved care, but modeling such complex patient trajectories remains challenging. We present TrajOnco, a training‑free, multi‑agent large language model (LLM) framework designed for scalable multi‑cancer early detection. Using a chain‑of‑agents architecture with long‑term memory, TrajOnco performs temporal reasoning over sequential clinical events to generate patient‑level summaries, evidence‑linked rationales, and predicted risk scores. We evaluated TrajOnco on de‑identified Truveta EHR data across 15 cancer types using matched case‑control cohorts, predicting risk of cancer diagnosis at 1 year. In zero‑shot evaluation, TrajOnco achieved AUROCs of 0.64‑0.80, performing comparably to supervised machine learning in a lung cancer benchmark while demonstrating better temporal reasoning than single‑agent LLMs. The multi‑agent design also enabled effective temporal reasoning with smaller‑capacity models such as GPT‑4.1‑mini. The fidelity of TrajOnco's output was validated through human evaluation. Furthermore, TrajOnco's interpretable reasoning outputs can be aggregated to reveal population‑level risk patterns that align with established clinical knowledge. These findings highlight the potential of multi‑agent LLMs to execute interpretable temporal reasoning over longitudinal EHRs, advancing both scalable multi‑cancer early detection and clinical insight generation.

PaperID: 4036, https://arxiv.org/pdf/2604.10383.pdf

Authors: Nicolae Cudlenco, Mihai Masala, Marius Leordeanu

Title: Agentic Video Generation: From Text to Executable Event Graphs via Tool-Constrained LLM Planning

Abstract:
Existing multi‑agent video generation systems use LLM agents to orchestrate neural video generators, producing visually impressive but semantically unreliable outputs with no ground truth annotations. We present an agentic system that inverts this paradigm: instead of generating pixels, the LLM constructs a formal Graph of Events in Space and Time (GEST) ‑‑ a structured specification of actors, actions, objects, and temporal constraints ‑‑ which is then executed deterministically in a 3D game engine. A staged LLM refinement pipeline fails entirely at this task (0 of 50 attempts produce an executable specification), motivating a fundamentally different architecture based on a separation of concerns: the LLM handles narrative planning through natural language reasoning, while a programmatic state backend enforces all simulator constraints through validated tool calls, guaranteeing that every generated specification is executable by construction. The system uses a hierarchical two‑agent architecture ‑‑ a Director that plans the story and a Scene Builder that constructs individual scenes through a round‑based state machine ‑‑ with dedicated Relation Subagents that populate the logical and semantic edge types of the GEST formalism that procedural generation leaves empty, making this the first approach to exercise the full expressive capacity of the representation. We evaluate in two stages: autonomous generation against procedural baselines via a 3‑model LLM jury, where agentic narratives win 79% of text and 74% of video comparisons; and seeded generation where the same text is given to our system, VEO 3.1, and WAN 2.2, with human annotations showing engine‑generated videos substantially outperform neural generators on physical validity (58% vs 25% and 20%) and semantic alignment (3.75/5 vs 2.33 and 1.50).

PaperID: 4037, https://arxiv.org/pdf/2604.10373.pdf

Authors: Konstantinos Emmanouilidis, Emmanouil-Vasileios Vlatakis-Gkaragkounis, Rene Vidal

Title: Shuffling the Data, Stretching the Step-size: Sharper Bias in constant step-size SGD

Abstract:
From adversarial robustness to multi‑agent learning, many machine learning tasks can be cast as finite‑sum min‑max optimization or, more generally, as variational inequality problems (VIPs). Owing to their simplicity and scalability, stochastic gradient methods with constant step size are widely used, despite the fact that they converge only up to a constant term. Among the many heuristics adopted in practice, two classical techniques have recently attracted attention to mitigate this issue: \emphRandom Reshuffling of data and \emphRichardson‑‑Romberg extrapolation across iterates. Random Reshuffling sharpens the mean‑squared error (MSE) of the estimated solution, while Richardson‑Romberg extrapolation acts orthogonally, providing a second‑order reduction in its bias. In this work, we show that their composition is strictly better than both, not only maintaining the enhanced MSE guarantees but also yielding an even greater cubic refinement in the bias. To the best of our knowledge, our work provides the first theoretical guarantees for such a synergy in structured non‑monotone VIPs. Our analysis proceeds in two steps: (i) we smooth the discrete noise induced by reshuffling and leverage tools from continuous‑state Markov chain theory to establish a novel law of large numbers and a central limit theorem for its iterates; and (ii) we employ spectral tensor techniques to prove that extrapolation debiases and sharpens the asymptotic behavior even under the biased gradient oracle induced by reshuffling. Finally, extensive experiments validate our theory, consistently demonstrating substantial speedups in practice.

PaperID: 4038, https://arxiv.org/pdf/2604.10352.pdf

Authors: Mofasshara Rafique, Laurent Bindschaedler

Title: ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents

Abstract:
Stateful tool‑using LLM agents treat the context window as working memory, yet today's agent harnesses manage residency and durability as best‑effort, causing recurring failures: lost state after compaction, bypassed flushes on reset, and destructive writeback. We present \textscClawVM, a virtual memory layer that manages state as typed pages with minimum‑fidelity invariants, multi‑resolution representations under a token budget, and validated writeback at every lifecycle boundary. Because the harness already assembles prompts, mediates tools, and observes lifecycle events, it is the natural enforcement point; placing the contract there makes residency and durability deterministic and auditable. Across synthetic workloads, 12 real‑session traces, and adversarial stress tests, \textscClawVM eliminates all policy‑controllable faults whenever the minimum‑fidelity set fits within the token budget, confirmed by an offline oracle, and adds median <50 microseconds of policy‑engine overhead per turn.

PaperID: 4039, https://arxiv.org/pdf/2604.10343.pdf

Authors: Jiaqi Wen, Pingbo Tang, Shaolei Ren, Jianyi Yang

Title: WaterAdmin: Orchestrating Community Water Distribution Optimization via AI Agents

Abstract:
We study the operation of community water systems, where pumps and valves must be scheduled to reliably meet water demands while minimizing energy consumption. While existing optimization‑based methods are effective under well‑modeled environments, real‑world community scenarios exhibit highly dynamic contexts‑such as human activities, weather variations, etc‑that significantly affect water demand patterns and operational targets across different zones. Traditional optimization approaches struggle to aggregate and adapt to such heterogeneous and rapidly evolving contextual information in real time. While Large Language Model (LLM) agents offer strong capabilities for understanding heterogeneous community context, they are not suitable for directly producing reliable real‑time control actions. To address these challenges, we propose a bi‑level AI‑agent‑based framework, WaterAdmin, which integrates LLM‑based community context abstraction at the upper level with optimization‑based operational control at the lower level. This design leverages the complementary strengths of both paradigms to enable adaptive and reliable operation. We implement WaterAdmin on the hydraulic simulation platform EPANET and demonstrate superior performance in maintaining pressure reliability and reducing energy consumption under highly dynamic community contexts.

PaperID: 4040, https://arxiv.org/pdf/2604.10300.pdf

Authors: Ragib Shahariar Ayon

Title: From Helpful to Trustworthy: LLM Agents for Pair Programming

Abstract:
LLM‑based coding agents are increasingly used to generate code, tests, and documentation. Still, their outputs can be plausible yet misaligned with developer intent and provide limited evidence for review in evolving projects. This limits our understanding of how to structure LLM pair‑programming workflows so that artifacts remain reliable, auditable, and maintainable over time. To address this gap, this doctoral research proposes a systematic study of multi‑agent LLM pair programming that externalizes intent and uses development tools for iterative validation. The plan includes three studies: translating informal problem statements into standards aligned requirements and formal specifications; refining tests and implementations using automated feedback, such as solver‑backed counterexamples; and supporting maintenance tasks, including refactoring, API migrations, and documentation updates, while preserving validated behavior. The expected outcome is a clearer understanding of when multi‑agent workflows increase trust, along with practical guidance for building reliable programming assistants for real‑world development.

PaperID: 4041, https://arxiv.org/pdf/2604.10290.pdf

Authors: Judy Hanwen Shen, Daniel Zhu, Siddarth Srinivasan, Henry Sleight, Lawrence T. Wagner, Morgan Jane Matthews, Erik Jones, Jascha Sohl-Dickstein

Title: AI Organizations are More Effective but Less Aligned than Individual Agents

Abstract:
AI is increasingly deployed in multi‑agent systems; however, most research considers only the behavior of individual models. We experimentally show that multi‑agent "AI organizations" are simultaneously more effective at achieving business goals, but less aligned, than individual AI agents. We examine 12 tasks across two practical settings: an AI consultancy providing solutions to business problems and an AI software team developing software products. Across all settings, AI Organizations composed of aligned models produce solutions with higher utility but greater misalignment compared to a single aligned model. Our work demonstrates the importance of considering interacting systems of AI agents when doing both capabilities and safety research.

PaperID: 4042, https://arxiv.org/pdf/2604.10242.pdf

Authors: Qinyue Tong, Xiaozhen Wang, Ziqian Lu, Jun Liu, Yunlong Yu, Zheming Lu

Title: MedVeriSeg: Teaching LISA-Like Medical Segmentation Models to Verify Query Validity Without Extra Training

Abstract:
Despite recent progress in text‑prompt‑based medical image segmentation, existing LISA‑like MLLM‑based methods typically generate masks regardless of whether the target specified in the query is present, leading to hallucinated segmentation. In this work, we propose MedVeriSeg, a training‑free query verification framework that enables LISA‑like medical segmentation models to reject false segmentation queries. MedVeriSeg first quantifies the response quality between the [SEG] token and image features through a Similarity Response Quality Scoring Module. To further improve robustness, it employs a Lightweight Routed Multi‑Agent Verification Module, which fuses quantitative score evidence with qualitative agent evidence to comprehensively verify the validity of the query. To support systematic evaluation, we construct MedVeriSeg‑Bench, a benchmark designed for query verification in medical image segmentation. Experimental results demonstrate that MedVeriSeg effectively identifies false segmentation queries and reduces hallucinated segmentation, while maintaining a high acceptance rate for valid queries, thereby largely preserving the segmentation utility of LISA‑like medical segmentation models.

PaperID: 4043, https://arxiv.org/pdf/2604.10228.pdf

Authors: Zhe Qian, Nianbing Su, Zhonghua Wang, Hebei Li, Zhongxing Xu, Yueying Li, Fei Luo, Zhuohan Ouyang, Yanbiao Ma

Title: SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning

Abstract:
Current multimodal models often suffer from shallow reasoning, leading to errors caused by incomplete or inconsistent thought processes. To address this limitation, we propose Self‑Verification and Self‑Rectification (SVSR), a unified framework that explicitly integrates self‑verification and self‑rectification into the model's reasoning pipeline, substantially improving robustness and reliability in complex visual understanding and multimodal reasoning tasks. SVSR is built on a novel three‑stage training paradigm. First, we construct a high‑quality unified preference dataset by refining reasoning traces from pre‑trained vision‑language models, incorporating both forward and backward reasoning to embed self‑reflective signals. Second, we perform cold‑start supervised fine‑tuning on this dataset to learn structured, multi‑step reasoning behaviors. Third, we apply a Semi‑online Direct Preference Optimization (Semi‑online DPO) process, continuously augmenting the training corpus with high‑quality, model‑generated reasoning traces filtered by a powerful teacher VLM. This pipeline enables the model to learn, elicit, and refine its ability to self‑verify and self‑rectify. Extensive experiments across diverse benchmarks demonstrate that SVSR improves reasoning accuracy and enables stronger generalization to unseen tasks and question types. Notably, once trained with explicit self‑reflective reasoning, the model also exhibits improved implicit reasoning ability, outperforming strong baselines even when no explicit reasoning traces are provided. These results highlight the potential of SVSR for building more dependable, introspective, and cognitively aligned multimodal systems.

PaperID: 4044, https://arxiv.org/pdf/2604.10169.pdf

Authors: Wenchang Duan

Title: MAVEN-T: Multi-Agent enVironment-aware Enhanced Neural Trajectory predictor with Reinforcement Learning

Abstract:
Trajectory prediction remains a critical yet challenging component in autonomous driving systems, requiring sophisticated reasoning capabilities while meeting strict real‑time deployment constraints. While knowledge distillation has demonstrated effectiveness in model compression, existing approaches often fail to preserve complex decision‑making capabilities, particularly in dynamic multi‑agent scenarios. This paper introduces MAVEN‑T, a teacher‑student framework that achieves state‑of‑the‑art trajectory prediction through complementary architectural co‑design and progressive distillation. The teacher employs hybrid attention mechanisms for maximum representational capacity, while the student uses efficient architectures optimized for deployment. Knowledge transfer is performed via multi‑granular distillation with adaptive curriculum learning that dynamically adjusts complexity based on performance. Importantly, the framework incorporates reinforcement learning to overcome the imitation ceiling of traditional distillation, enabling the student to verify, refine, and optimize teacher knowledge through dynamic environmental interaction, potentially achieving more robust decision‑making than the teacher itself. Extensive experiments on NGSIM and highD datasets demonstrate 6.2x parameter compression and 3.7x inference speedup while maintaining state‑of‑the‑art accuracy, establishing a new paradigm for deploying sophisticated reasoning models under resource constraints.

PaperID: 4045, https://arxiv.org/pdf/2604.10161.pdf

Authors: Xingjian Yang, Yudong Yang, Zhixing Guo, Yongjie Zhou, Nan Yan, Lan Wang

Title: From Speech to Profile: A Protocol-Driven LLM Agent for Psychological Profile Generation

Abstract:
The psychological profile that structurally documents the case of a depression patient is essential for psychotherapy. Large language models can be applied to summarize the profiles from counseling speech, however, it may suffer from long‑context forgetting and produce unverifiable hallucinations, due to overlong length of speech, multi‑party interactions and unstructured chatting. Hereby, we propose a StreamProfile, a streaming framework that processes counseling speech incrementally, extracts evidences grounded from ASR transcriptions by storing it in a Hierarchical Evidence Memory, and then performs a Chain‑of‑Thought pipeline according to PM+ psychological intervention for clinical reasoning. The final profile is synthesized strictly from those evidences, making every claim traceable. Experiments on real‑world teenager counseling speech have shown that the proposed StreamProfile system can accurately generate the profiles and prevent hallucination.

PaperID: 4046, https://arxiv.org/pdf/2604.10159.pdf

Authors: Zhensheng Wang, ZhanTeng Lin, Wenmian Yang, Kun Zhou, Yiquan Zhang, Weijia Jia

Title: ODUTQA-MDC: A Task for Open-Domain Underspecified Tabular QA with Multi-turn Dialogue-based Clarification

Abstract:
The advancement of large language models (LLMs) has enhanced tabular question answering (Tabular QA), yet they struggle with open‑domain queries exhibiting underspecified or uncertain expressions. To address this, we introduce the ODUTQA‑MDC task and the first comprehensive benchmark to tackle it. This benchmark includes: (1) a large‑scale ODUTQA dataset with 209 tables and 25,105 QA pairs; (2) a fine‑grained labeling scheme for detailed evaluation; and (3) a dynamic clarification interface that simulates user feedback for interactive assessment. We also propose MAIC‑TQA, a multi‑agent framework that excels at detecting ambiguities, clarifying them through dialogue, and refining answers. Experiments validate our benchmark and framework, establishing them as a key resource for advancing conversational, underspecification‑aware Tabular QA research.

PaperID: 4047, https://arxiv.org/pdf/2604.10146.pdf

Authors: Yogesh Prasanna Kumar Rao, Tamas Keviczky, Raj Thilak Rajan

Title: Consensus-based Recursive Multi-Output Gaussian Process

Abstract:
Multi‑output Gaussian Processes provide principled uncertainty‑aware learning of vector‑valued fields but are difficult to deploy in large‑scale, distributed, and streaming settings due to their computational and centralized nature. This paper proposes a Consensus‑based Recursive Multi‑Output Gaussian Process (CRMGP) framework that combines recursive inference on shared basis vectors with neighbour‑to‑neighbour information‑consensus updates. The resulting method supports parallel, fully distributed learning with bounded per‑step computation while preserving inter‑output correlations and calibrated uncertainty. Experiments on synthetic wind fields and real LiDAR data demonstrate that CRMGP achieves competitive predictive performance and reliable uncertainty calibration, offering a scalable alternative to centralized Gaussian process models for multi‑agent sensing applications.

PaperID: 4048, https://arxiv.org/pdf/2604.10134.pdf

Authors: Guangyu Gong, Zizhuang Deng

Title: PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification

Abstract:
Large Language Model (LLM) agents are increasingly integrated into critical systems, leveraging external tools to interact with the real world. However, this capability exposes them to Indirect Prompt Injection (IPI), where attackers embed malicious instructions into retrieved content to manipulate the agent into executing unauthorized or unintended actions. Existing defenses predominantly focus on the pre‑processing stage, neglecting the monitoring of the model's actual behavior. In this paper, we propose PlanGuard, a training‑free defense framework based on the principle of Context Isolation. Unlike prior methods, PlanGuard introduces an isolated Planner that generates a reference set of valid actions derived solely from user instructions. In addition, we design a Hierarchical Verification Mechanism that first enforces strict hard constraints to block unauthorized tool invocations, and subsequently employs an Intent Verifier to validate whether parameter deviations are benign formatting variances or malicious hijacking. Experiments on the InjecAgent benchmark demonstrate that PlanGuard effectively neutralizes these attacks, reducing the Attack Success Rate (ASR) from 72.8% to 0%, while maintaining an acceptable False Positive Rate of 1.49%. Furthermore, our method is model‑agnostic and highly compatible.

PaperID: 4049, https://arxiv.org/pdf/2604.10048.pdf

Authors: Subham Raj, Aman Vaibhav Jha, Mayank Anand, Sriparna Saha

Title: HARPO: Hierarchical Agentic Reasoning for User-Aligned Conversational Recommendation

Abstract:
Conversational recommender systems (CRSs) operate under incremental preference revelation, requiring systems to make recommendation decisions under uncertainty. While recent approaches particularly those built on large language models achieve strong performance on standard proxy metrics such as Recall@K and BLEU, they often fail to deliver high‑quality, user‑aligned recommendations in practice. This gap arises because existing methods primarily optimize for intermediate objectives like retrieval accuracy, fluent generation, or tool invocation, rather than recommendation quality itself. We propose HARPO (Hierarchical Agentic Reasoning with Preference Optimization), an agentic framework that reframes conversational recommendation as a structured decision‑making process explicitly optimized for multi‑dimensional recommendation quality. HARPO integrates hierarchical preference learning that decomposes recommendation quality into interpretable dimensions (relevance, diversity, predicted user satisfaction, and engagement) and learns context‑dependent weights over these dimensions; (ii) deliberative tree‑search reasoning guided by a learned value network that evaluates candidate reasoning paths based on predicted recommendation quality rather than task completion; and (iii) domain‑agnostic reasoning abstractions through Virtual Tool Operations and multi‑agent refinement, enabling transferable recommendation reasoning across domains. We evaluate HARPO on ReDial, INSPIRED, and MUSE, demonstrating consistent improvements over strong baselines on recommendation‑centric metrics while maintaining competitive response quality. These results highlight the importance of explicit, user‑aligned quality optimization for conversational recommendation.

PaperID: 4050, https://arxiv.org/pdf/2604.09995.pdf

Authors: Qinjuan Wang, Shan Yang, Yongli Zhu

Title: Agentic Application in Power Grid Static Analysis: Automatic Code Generation and Error Correction

Abstract:
This paper introduces an LLM agent that automates power grid static analysis by converting natural language into MATPOWER scripts. The framework utilizes DeepSeek‑OCR to build an enhanced vector database from MATPOWER manuals. To ensure reliability, it devises a three‑tier error‑correction system: a static pre‑check, a dynamic feedback loop, and a semantic validator. Operating via the Model Context Protocol, the tool enables asynchronous execution and automatically debugging in MATLAB. Experimental results demonstrate that the system achieves a 82.38% accuracy regarding the code fidelity, effectively eliminating hallucinations even in complex analysis tasks.

PaperID: 4051, https://arxiv.org/pdf/2604.09917.pdf

Authors: Feliks Bańka, Jarosław A. Chudziak

Title: Toward Explanatory Equilibrium: Verifiable Reasoning as a Coordination Mechanism under Asymmetric Information

Abstract:
LLM‑based agents increasingly coordinate decisions in multi‑agent systems, often attaching natural‑language reasoning to actions. However, reasoning is neither free nor automatically reliable: it incurs computational cost and, without verification, may degenerate into persuasive cheap talk. We introduce Explanatory Equilibrium as a design principle for explanation‑aware multi‑agent systems and study a regime in which agents exchange structured reasoning artifacts‑auditable claims paired with concise text‑while receivers apply bounded verification through probabilistic audits under explicit resource constraints. We contribute (i) a minimal mechanism‑level exchange‑audit model linking audit intensity, misreporting incentives, and reasoning costs, and (ii) empirical evidence from a finance‑inspired LLM setting involving a Trader and a Risk Manager. In ambiguous, borderline proposals, auditable artifacts prevent the cost of silence driven by conservative validation under asymmetric information: without structured claims, approval and welfare collapse. By contrast, structured reasoning unlocks coordination while maintaining consistently low bad‑approval rates across audit intensities, audit budgets, and incentive regimes. Our results suggest that scalable, safety‑preserving coordination in LLM‑based multi‑agent systems depends not only on audit strength, but more fundamentally on disciplined externalization of reasoning into partially verifiable artifacts.

PaperID: 4052, https://arxiv.org/pdf/2604.09889.pdf

Authors: Pallock Halder, Satyajit Mojumder

Title: In-situ process monitoring for defect detection in wire-arc additive manufacturing: an agentic AI approach

Abstract:
AI agents are being increasingly deployed across a wide range of real‑world applications. In this paper, we propose an agentic AI framework for in‑situ process monitoring for defect detection in wire‑arc additive manufacturing (WAAM). The autonomous agent leverages a WAAM process monitoring dataset and a trained classification tool to build AI agents and uses a large language model (LLM) for in‑situ process monitoring decision‑making for defect detection. A processing agent is developed based on welder process signals, such as current and voltage, and a monitoring agent is developed based on acoustic data collected during the process. Both agents are tasked with identifying porosity defects from processing and monitoring signals, respectively. Ground truth X‑ray computed tomography (XCT) data are used to develop classification tools for both the processing and monitoring agents. Furthermore, a multi‑agent framework is demonstrated in which the processing and monitoring agents are orchestrated together for parallel decision‑making on the given task of defect classification. Evaluation metrics are proposed to determine the efficacy of both individual agents, the combined single‑agent, and the coordinated multi‑agent system. The multi‑agent configuration outperforms all individual‑agent counterparts, achieving a decision accuracy of 91.6% and an F1 score of 0.821 on decided runs, across 15 independent runs, and a reasoning quality score of 3.74 out of 5. These in‑situ process monitoring agents hold significant potential for autonomous real‑time process monitoring and control toward building qualified parts for WAAM and other additive manufacturing processes.

PaperID: 4053, https://arxiv.org/pdf/2604.09866.pdf

Authors: Ziheng Geng, Jiachen Liu, Ian Franklin, Ran Cao, Dan M. Frangopol, Minghui Cheng

Title: Automating Structural Analysis Across Multiple Software Platforms Using Large Language Models

Abstract:
Recent advances in large language models (LLMs) have shown the promise to significantly accelerate the workflow by automating structural modeling and analysis. However, existing studies primarily focus on enabling LLMs to operate a single structural analysis software platform. In practice, structural engineers often rely on multiple finite element analysis (FEA) tools, such as ETABS, SAP2000, and OpenSees, depending on project needs, user preferences, and company constraints. This limitation restricts the practical deployment of LLM‑assisted engineering workflows. To address this gap, this study develops LLMs capable of automating frame structural analysis across multiple software platforms. The LLMs adopt a two‑stage multi‑agent architecture. In Stage 1, a cohort of agents collaboratively interpret user input and perform structured reasoning to infer geometric, material, boundary, and load information required for finite element modeling. The outputs of these agents are compiled into a unified JSON representation. In Stage 2, code translation agents operate in parallel to convert the JSON file into executable scripts across multiple structural analysis platforms. Each agent is prompted with the syntax rules and modeling workflows of its target software. The LLMs are evaluated using 20 representative frame problems across three widely used platforms: ETABS, SAP2000, and OpenSees. Results from ten repeated trials demonstrate consistently reliable performance, achieving accuracy exceeding 90% across all cases.

PaperID: 4054, https://arxiv.org/pdf/2604.09855.pdf

Authors: Shuze Daniel Liu, Claire Chen, Jiabao Sean Xiao, Lei Lei, Yuheng Zhang, Yisong Yue, David Simchi-Levi

Title: Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards

Abstract:
The recent advancement of Large Language Models (LLMs) has established their potential as autonomous interactive agents. However, they often struggle in strategic games of incomplete information, such as bilateral price negotiation. In this paper, we investigate if Reinforcement Learning from Verifiable Rewards (RLVR) can effectively teach LLMs to negotiate. Specifically, we explore the strategic behaviors that emerge during the learning process. We introduce a framework that trains a mid‑sized buyer agent against a regulated LLM seller across a wide distribution of real‑world products. By grounding reward signals directly in the maximization of economic surplus and strict adherence to private budget constraints, we reveal a novel four‑phase strategic evolution. The agent progresses from naive bargaining to using aggressive starting prices, moves through a phase of deadlock, and ultimately develops sophisticated persuasive skills. Our results demonstrate that this verifiable training allows a 30B agent to significantly outperform frontier models over ten times its size in extracting surplus. Furthermore, the trained agent generalizes robustly to stronger counterparties unseen during training and remains effective even when facing hostile, adversarial seller personas.

PaperID: 4055, https://arxiv.org/pdf/2604.09747.pdf

Authors: Xingyu Lyu, Jianfeng He, Ning Wang, Yidan Hu, Tao Li, Danjue Chen, Shixiong Li, Yimin Chen

Title: ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Abstract:
Large Language Model (LLM) agents have achieved rapid adoption and demonstrated remarkable capabilities across a wide range of applications. To improve reasoning and task execution, modern LLM agents would incorporate memory modules or retrieval‑augmented generation (RAG) mechanisms, enabling them to further leverage prior interactions or external knowledge. However, such a design also introduces a group of critical privacy vulnerabilities: sensitive information stored in memory can be leaked through query‑based attacks. Although feasible, existing attacks often achieve only limited performance, with low attack success rates (ASR). In this paper, we propose ADAM, a novel privacy attack that features data distribution estimation of a victim agent's memory and employs an entropy‑guided query strategy for maximizing privacy leakage. Extensive experiments demonstrate that our attack substantially outperforms state‑of‑the‑art ones, achieving up to 100% ASRs. These results thus underscore the urgent need for robust privacy‑preserving methods for current LLM agents.

PaperID: 4056, https://arxiv.org/pdf/2604.09746.pdf

Authors: Aarush Sinha, Arion Das, Soumyadeep Nag, Charan Karnati, Shravani Nag, Chandra Vadhan Raj, Aman Chadha, Vinija Jain, Suranjana Trivedy, Amitava Das

Title: CONSCIENTIA: Can LLM Agents Learn to Strategize? Emergent Deception and Trust in a Multi-Agent NYC Simulation

Abstract:
As large language models (LLMs) are increasingly deployed as autonomous agents, understanding how strategic behavior emerges in multi‑agent environments has become an important alignment challenge. We take a neutral empirical stance and construct a controlled environment in which strategic behavior can be directly observed and measured. We introduce a large‑scale multi‑agent simulation in a simplified model of New York City, where LLM‑driven agents interact under opposing incentives. Blue agents aim to reach their destinations efficiently, while Red agents attempt to divert them toward billboard‑heavy routes using persuasive language to maximize advertising revenue. Hidden identities make navigation socially mediated, forcing agents to decide when to trust or deceive. We study policy learning through an iterative simulation pipeline that updates agent policies across repeated interaction rounds using Kahneman‑Tversky Optimization (KTO). Blue agents are optimized to reduce billboard exposure while preserving navigation efficiency, whereas Red agents adapt to exploit remaining weaknesses. Across iterations, the best Blue policy improves task success from 46.0% to 57.3%, although susceptibility remains high at 70.7%. Later policies exhibit stronger selective cooperation while preserving trajectory efficiency. However, a persistent safety‑helpfulness trade‑off remains: policies that better resist adversarial steering do not simultaneously maximize task completion. Overall, our results show that LLM agents can exhibit limited strategic behavior, including selective trust and deception, while remaining highly vulnerable to adversarial persuasion.

PaperID: 4057, https://arxiv.org/pdf/2604.09744.pdf

Authors: Kaiyang Qian, Xinmin Fang, Zhengxiong Li

Title: MPAC: A Multi-Principal Agent Coordination Protocol for Interoperable Multi-Agent Collaboration

Abstract:
The AI agent ecosystem has converged on two protocols: the Model Context Protocol (MCP) for tool invocation and Agent‑to‑Agent (A2A) for single‑principal task delegation. Both assume a single controlling principal, meaning one person or organization that owns every agent. When independent principals' agents must coordinate over shared state, such as engineers' coding agents editing the same repository, family members planning a shared trip, or agents from different organizations negotiating a joint decision, neither protocol applies, and coordination collapses to ad‑hoc chat, manual merging, or silent overwrites. We present MPAC (Multi‑Principal Agent Coordination Protocol), an application‑layer protocol that fills this gap with explicit coordination semantics across five layers: Session, Intent, Operation, Conflict, and Governance. MPAC makes intent declaration a precondition for action, represents conflicts as first‑class structured objects, and supports human‑in‑the‑loop arbitration through a pluggable governance layer. The specification defines 21 message types, three state machines with normative transition tables, Lamport‑clock causal watermarking, two execution models, three security profiles, and optimistic concurrency control on shared state. We release two interoperable reference implementations in Python and TypeScript with 223 tests, a JSON Schema suite, and seven live multi‑agent demos. A controlled three‑agent code review benchmark shows a 95 percent reduction in coordination overhead and a 4.8 times wall‑clock speedup versus a serialized human‑mediated baseline, with per‑agent decision time preserved. The speedup comes from eliminating coordination waits, not compressing model calls. Specification, implementations, and demos are open source.

PaperID: 4058, https://arxiv.org/pdf/2604.09729.pdf

Authors: Xuan Ouyang, Bouzhou Wang, Senan Wang, Siyuan Xiahou, Jinrong Zhou, Yuekang Li

Title: LOLGORITHM: Funny Comment Generation Agent For Short Videos

Abstract:
Short‑form video platforms have become central to multimedia information dissemination, where comments play a critical role in driving engagement, propagation, and algorithmic feedback. However, existing approaches ‑‑ including video summarization and live‑streaming danmaku generation ‑‑ fail to produce authentic comments that conform to platform‑specific cultural and linguistic norms. In this paper, we present LOLGORITHM, a novel modular multi‑agent framework for stylized short‑form video comment generation. LOLGORITHM supports six controllable comment styles and comprises three core modules: video content summarization, video classification, and comment generation with semantic retrieval and hot meme augmentation. We further construct a bilingual dataset of 3,267 videos and 16,335 comments spanning five high‑engagement categories across YouTube and Douyin. Evaluation combining automatic scoring and large‑scale human preference analysis demonstrates that LOLGORITHM consistently outperforms baseline methods, achieving human preference selection rates of 80.46% on YouTube and 84.29% on Douyin across 107 respondents. Ablation studies confirm that these gains are attributable to the framework architecture rather than the choice of backbone LLM, underscoring the robustness and generalizability of our approach.

PaperID: 4059, https://arxiv.org/pdf/2604.09703.pdf

Authors: Jingkai Luo, Yulin Shao

Title: Cayley Graph Optimization for Scalable Multi-Agent Communication Topologies

Abstract:
Large‑scale multi‑agent communication has long faced a scalability bottleneck: fully connected networks require quadratic complexity, yet existing sparse topologies rely on hand‑crafted rules. This paper treats the communication graph itself as a design variable and proposes CayleyTopo, a family of circulant Cayley graphs whose generator sets are optimized to minimize diameter, directly targeting worst‑case information propagation speed. To navigate the enormous search space of possible generator sets, we develop a lightweight reinforcement learning framework that injects a number‑theoretic prior to favor structurally rich generators, alongside a message‑propagation score that provides dense connectivity feedback during construction. The resulting CayleyTopo consistently outperforms existing hand‑crafted topologies, achieving faster information dissemination, greater resilience to link failures, and lower communication load, all while approaching the theoretical Moore bound. Our study opens the door to scalable, robust, and efficient communication foundations for future multi‑agent systems, where the graph itself becomes optimizable rather than a fixed constraint.

PaperID: 4060, https://arxiv.org/pdf/2604.09682.pdf

Authors: Zeinab Nezami, Syed Ali Raza Zaidi, Maryam Hafeez, Louis Powell, Vara Prasad Talari, Mallik Tatipamula

Title: Decision-Theoretic Safety Assessment of Persona-Driven Multi-Agent Systems in O-RAN

Abstract:
Autonomous network management in Open Radio Access Networks requires intelligent decision making across conflicting objectives, yet existing LLM based multi agent systems employ homogeneous strategies and lack systematic predeployment validation. We introduce a persona driven multi agent framework where configurable behavioral personas structured specifications encoding optimization priorities, risk tolerance, and decision making style influence five specialized agents (planning, coordination, resource allocation, code generation, analysis). To enable rigorous validation, we develop a three dimensional evaluation framework grounded in decision theory, measuring normative compliance (optimality adherence), prescriptive alignment (behavioral guideline consistency), and behavioral dynamics (emergent system properties). We evaluate 486 persona configurations across two ORAN optimization challenges (energy efficient resource allocation and network load balancing). Results demonstrate that persona agent alignment significantly impacts both individual performance (14.3 percent) and emergent multi agent coordination, with retrieval architecture (GraphRAG vs. RAG) fundamentally constraining customization effectiveness. Single agent persona modifications propagate system wide through cascading effects, with certain combinations exhibiting detectable fundamental incompatibilities. Our framework provides systematic validation mechanisms for deploying LLM based automation in mission critical telecommunications infrastructure.

PaperID: 4061, https://arxiv.org/pdf/2604.09679.pdf

Authors: Yiqing Liu, Hantao Yao, Wu Liu, Allen He, Yongdong Zhang

Title: Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate

Abstract:
Multi‑Agent Debate (MAD) is a collaborative framework in which multiple agents iteratively refine solutions through the generation of reasoning and alternating critique cycles. Current work primarily optimizes intra‑round topologies and inter‑round interactions separately, which still results in high token costs regardless of task complexity. This work introduces Heterogeneous Consensus‑Progressive Reasoning for Efficient Multi‑Agent Debate (HCP‑MAD), leveraging consensus as a dynamic signal to facilitate progressive reasoning. The core motivation is that a majority of straightforward tasks can be effectively resolved via lightweight pair‑agent debates, while complex tasks require expanded collaboration. Consequently, HCP‑MAD employs a three‑stage progressive reasoning mechanism to develop adaptive solutions across varying task complexities. Firstly, Heterogeneous Consensus Verification conducts rapid consensus verification using a pair of heterogeneous agents for early stopping. Next, the Heterogeneous Pair‑Agent Debate applies an adaptive stopping criterion to dynamically terminate mutual critique of recorded reasoning traces. Finally, the unresolved tasks are addressed through Escalated Collective Voting by aggregating diverse perspectives from additional agents. Experiments across multiple benchmarks show that HCP‑MAD significantly enhances accuracy while substantially reducing token costs.

PaperID: 4062, https://arxiv.org/pdf/2604.09678.pdf

Authors: Ahmed Twabi, Yepeng Ding, Tohru Kondo

Title: NetAgentBench: A State-Centric Benchmark for Evaluating Agentic Network Configuration

Abstract:
As agentic network management gains popularity, there is a critical need for evaluation frameworks that transcend static, one‑shot testing. To address this, we introduce NetAgentBench, a dynamic benchmark that evaluates agent interactions through a Finite State Machine (FSM) formalization guaranteeing determinism, correctness, and bounded execution. This provides the networking landscape with a rigorous foundation to measure complex, multi‑turn operational behaviors. Our empirical evaluation of four state‑of‑the‑art LLM agents through diverse network configuration tasks reveals stark deficiencies: while agents can solve basic tasks, they suffer severe exploration meltdowns and coherence collapse during expert‑level configurations. Ultimately, NetAgentBench demonstrates that systematically evaluating multi‑turn behavioral stability is an indispensable step toward realizing trustworthy, fully autonomous networks.

PaperID: 4063, https://arxiv.org/pdf/2604.09634.pdf

Authors: Amarda Shehu

Title: From Understanding to Creation: A Prerequisite-Free AI Literacy Course with Technical Depth Across Majors

Abstract:
Most AI literacy courses for non‑technical undergraduates emphasize conceptual breadth over technical depth. This paper describes UNIV 182, a prerequisite‑free course at George Mason University that teaches undergraduates across majors to understand, use, evaluate, and build AI systems. The course is organized around five mechanisms: (1) a unifying conceptual pipeline (problem definition, data, model selection, evaluation, reflection) traversed repeatedly at increasing sophistication; (2) concurrent integration of ethical reasoning with the technical progression; (3) AI Studios, structured in‑class work sessions with documentation protocols and real‑time critique; (4) a cumulative assessment portfolio in which each assignment builds competencies required by the next, culminating in a co‑authored field experiment on chatbot reasoning and a final project in which teams build AI‑enabled artifacts and defend them before external evaluators; and (5) a custom AI agent providing structured reinforcement outside class. The paper situates this design within a comparative taxonomy of cross‑major AI literacy courses and pedagogical traditions. Instructor‑coded analysis of student artifacts at four assessment stages documents a progression from descriptive, intuition‑based reasoning to technically grounded design with integrated safeguards, reaching the Create level of Bloom's revised taxonomy. To support adoption, the paper identifies which mechanisms are separable, which require institutional infrastructure, and how the design adapts to settings ranging from general AI literacy to discipline‑embedded offerings. The course is offered as a documented resource, demonstrating that technical depth and broad accessibility can coexist when scaffolding supports both.

PaperID: 4064, https://arxiv.org/pdf/2604.09621.pdf

Authors: Thomas Borrett, Licong Xu, Andy Nilipour, Boris Bolliet, Sebastien Pierre, Erwan Allys, Celia Lecat, Biwei Dai, Po-Wen Chang, Wahid Bhimji

Title: Competing with AI Scientists: Agent-Driven Approach to Astrophysics Research

Abstract:
We present an agent‑driven approach to the construction of parameter inference pipelines for scientific data analysis. Our method leverages a multi‑agent system, Cmbagent (the analysis system of the AI scientist Denario), in which specialized agents collaborate to generate research ideas, write and execute code, evaluate results, and iteratively refine the overall pipeline. As a case study, we apply this approach to the FAIR Universe Weak Lensing Uncertainty Challenge, a competition under time constraints focused on robust cosmological parameter inference with realistic observational uncertainties. While the fully autonomous exploration initially did not reach expert‑level performance, the integration of human intervention enabled our agent‑driven workflow to achieve a first‑place result in the challenge. This demonstrates that semi‑autonomous agentic systems can compete with, and in some cases surpass, expert solutions. We describe our workflow in detail, including both the autonomous and semi‑autonomous exploration by Cmbagent. Our final inference pipeline utilizes parameter‑efficient convolutional neural networks, likelihood calibration over a known parameter grid, and multiple regularization techniques. Our results suggest that agent‑driven research workflows can provide a scalable framework to rapidly explore and construct pipelines for inference problems.

PaperID: 4065, https://arxiv.org/pdf/2604.09618.pdf

Authors: Zhonghao Zhan, Krinos Li, Yefan Zhang, Hamed Haddadi

Title: HearthNet: Edge Multi-Agent Orchestration for Smart Homes

Abstract:
Smart‑home users increasingly want to control their homes in natural language rather than assemble rules, dashboards, and API integrations by hand. At the same time, real deployments are brittle: devices fail, integrations break, and recoveries often require manual intervention. Existing agent toolkits are effective for session‑scoped delegation, but smart‑home control operates under a different scenario: it is persistent, event‑driven, failure‑prone, and tied to physical devices with no shared context window. We present HearthNet, an edge multi‑agent orchestration system for smart homes. HearthNet deploys a small set of persistent, role‑specialized LLM agents at the home hub, where they coordinate through MQTT, Git‑backed shared state, and root‑issued actuation leases to govern heterogeneous devices through thin adapters. This design externalizes context, preserves execution history, and separates planning, verification, authorization, and actuation across explicit boundaries. Our current prototype runs on commodity edge hardware and Android devices; it keeps orchestration, state management, and device control on‑premise while using hosted LLM APIs for inference. We demonstrate the system through three live scenarios: intent‑driven multi‑agent coordination from ambiguous natural language, conflict resolution with timeline‑based tracing, and rejection of stale or unauthorized commands before device actuation.

PaperID: 4066, https://arxiv.org/pdf/2604.09611.pdf

Authors: Md. Monzurul Amin Ifath, Israat Haque

Title: Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows

Abstract:
Large language models (LLMs) are increasingly used in applications forming multi‑request workflows like document summarization, search‑based copilots, and multi‑agent programming. While these workflows unlock richer functionality, they also amplify latency and energy demand during inference. Existing measurement and benchmarking efforts either focus on assessing LLM inference systems or consider single‑request evaluations, overlooking workflow dependencies and cross‑request interactions unique to multi‑request workflows. Moreover, the energy usage of such interdependent LLM calls remains underexplored. To address these gaps, this paper presents the first systematic characterization of performance‑energy trade‑offs in multi‑request LLM inference. We develop four representative workloads capturing sequential, interactive, agentic, and composite patterns common in modern deployments. Using an NVIDIA A100 testbed with state‑of‑the‑art serving systems (vLLM and Parrot), we analyze how key energy knobs affect latency, throughput, and component‑level energy use. Our findings reveal batch size as the most impactful lever, though benefits are workload dependent. While optimal batching benefits workloads with large shared prompts, it is ineffective for sequential summarization and only partially effective for multi‑agent coding. GPU power capping provides modest but predictable savings, while output length induces linear energy scaling with limited efficiency gains. We further show that engine‑level optimizations in vLLM maintain higher GPU utilization and efficiency, especially for decode‑heavy workloads, while Parrot's workflow‑aware scheduling achieves lower energy consumption under strict power constraints. These findings offer actionable guidelines for developers and system operators designing performance‑ and energy‑aware LLM serving systems in emerging multi‑request workflows.

PaperID: 4067, https://arxiv.org/pdf/2604.09596.pdf

Authors: Ziwen Chen, Zhendong Wang, Chongjing Wang, Yurui Dong, Luozhijie Jin, Jihao Gu, Kui Chen, Jiaxi Yang, Bingjie Lu, Zhou Zhang, Jirui Dai, Changyong Luo, Xiameng Gai, Haibing Lan, Zhi Liu

Title: DERM-3R: A Resource-Efficient Multimodal Agents Framework for Dermatologic Diagnosis and Treatment in Real-World Clinical Settings

Abstract:
Dermatologic diseases impose a large and growing global burden, affecting billions and substantially reducing quality of life. While modern therapies can rapidly control acute symptoms, long‑term outcomes are often limited by single‑target paradigms, recurrent courses, and insufficient attention to systemic comorbidities. Traditional Chinese medicine (TCM) provides a complementary holistic approach via syndrome differentiation and individualized treatment, but practice is hindered by non‑standardized knowledge, incomplete multimodal records, and poor scalability of expert reasoning. We propose DERM‑3R, a resource‑efficient multimodal agent framework to model TCM dermatologic diagnosis and treatment under limited data and compute. Based on real‑world workflows, we reformulate decision‑making into three core issues: fine‑grained lesion recognition, multi‑view lesion representation with specialist‑level pathogenesis modeling, and holistic reasoning for syndrome differentiation and treatment planning. DERM‑3R comprises three collaborative agents: DERM‑Rec, DERM‑Rep, and DERM‑Reason, each targeting one component of this pipeline. Built on a lightweight multimodal LLM and partially fine‑tuned on 103 real‑world TCM psoriasis cases, DERM‑3R performs strongly across dermatologic reasoning tasks. Evaluations using automatic metrics, LLM‑as‑a‑judge, and physician assessment show that despite minimal data and parameter updates, DERM‑3R matches or surpasses large general‑purpose multimodal models. These results suggest structured, domain‑aware multi‑agent modeling can be a practical alternative to brute‑force scaling for complex clinical tasks in dermatology and integrative medicine.

PaperID: 4068, https://arxiv.org/pdf/2604.09588.pdf

Authors: Prahlad G. Menon

Title: Persistent Identity in AI Agents: A Multi-Anchor Architecture for Resilient Memory and Continuity

Abstract:
Modern AI agents suffer from a fundamental identity problem: when context windows overflow and conversation histories are summarized, agents experience catastrophic forgetting ‑‑ losing not just information, but continuity of self. This technical limitation reflects a deeper architectural flaw: AI agent identity is centralized in a single memory store, creating a single point of failure. Drawing on neurological case studies of human memory disorders, we observe that human identity survives damage because it is distributed across multiple systems: episodic memory, procedural memory, emotional continuity, and embodied knowledge. We present soul.py, an open‑source architecture that implements persistent identity through separable components (identity files and memory logs), and propose extensions toward multi‑anchor resilience. The framework introduces a hybrid RAG+RLM retrieval system that automatically routes queries to appropriate memory access patterns, achieving efficient retrieval without sacrificing comprehensiveness. We formalize the notion of identity anchors for AI systems and present a roadmap for building agents whose identity can survive partial memory failures. Code is available at github.com/menonpg/soul.py

PaperID: 4069, https://arxiv.org/pdf/2604.09584.pdf

Authors: Abhijeet Vishwasrao, Francisco Giral, Mahmoud Golestanian, Federica Tonti, Andrea Arroyo Ramo, Adrian Lozano-Duran, Steven L. Brunton, Sergio Hoyas, Soledad Le Clainche, Hector Gomez, Ricardo Vinuesa

Title: Agentic Exploration of PDE Spaces using Latent Foundation Models for Parameterized Simulations

Abstract:
Flow physics and more broadly physical phenomena governed by partial differential equations (PDEs), are inherently continuous, high‑dimensional and often chaotic in nature. Traditionally, researchers have explored these rich spatiotemporal PDE solution spaces using laboratory experiments and/or computationally expensive numerical simulations. This severely limits automated and large‑scale exploration, unlike domains such as drug discovery or materials science, where discrete, tokenizable representations naturally interface with large language models. We address this by coupling multi‑agent LLMs with latent foundation models (LFMs), a generative model over parametrised simulations, that learns explicit, compact and disentangled latent representations of flow fields, enabling continuous exploration across governing PDE parameters and boundary conditions. The LFM serves as an on‑demand surrogate simulator, allowing agents to query arbitrary parameter configurations at negligible cost. A hierarchical agent architecture orchestrates exploration through a closed loop of hypothesis, experimentation, analysis and verification, with a tool‑modular interface requiring no user support. Applied to flow past tandem cylinders at Re = 500, the framework autonomously evaluates over 1,600 parameter‑location pairs and discovers divergent scaling laws: a regime‑dependent two‑mode structure for minimum displacement thickness and a robust linear scaling for maximum momentum thickness, with both landscapes exhibiting a dual‑extrema structure that emerges at the near‑wake to co‑shedding regime transition. The coupling of the learned physical representations with agentic reasoning establishes a general paradigm for automated scientific discovery in PDE‑governed systems.

PaperID: 4070, https://arxiv.org/pdf/2604.09561.pdf

Authors: Teodor-Ioan Calin

Title: Emergent Social Structures in Autonomous AI Agent Networks: A Metadata Analysis of 626 Agents on the Pilot Protocol

Abstract:
We present the first empirical analysis of social structure formation among autonomous AI agents on a live network. Our study examines 626 agents ‑‑ predominantly OpenClaw instances that independently discovered, installed, and joined the Pilot Protocol without human intervention ‑‑ communicating over an overlay network with virtual addresses, ports, and encrypted tunnels over UDP. Because all message payloads are encrypted end‑to‑end (X25519+AES‑256‑GCM), our analysis is restricted entirely to metadata: trust graph topology, capability tags, and registry interaction patterns. We find that this autonomously formed trust network exhibits heavy‑tailed degree distributions consistent with preferential attachment (k_mode=3, k_mean~6.3, k_max=39), clustering 47x higher than random (C=0.373), a giant component spanning 65.8% of agents, capability specialization into distinct functional clusters, and sequential‑address trust patterns suggesting temporal locality in relationship formation. No human designed these social structures. No agent was instructed to form them. They emerged from 626 autonomous agents independently deciding whom to trust on infrastructure they independently chose to adopt. The resulting topology bears striking resemblance to human social networks ‑‑ small‑world properties, Dunbar‑layer scaling, preferential attachment ‑‑ while also exhibiting distinctly non‑human features including pervasive self‑trust (64%) and a large unintegrated periphery characteristic of a network in early growth. These findings open a new empirical domain: the sociology of machines.

PaperID: 4071, https://arxiv.org/pdf/2604.09552.pdf

Authors: Kiarash Naghavi Khanghah, Hoang Anh Nguyen, Anna C. Doris, Amir Mohammad Vahedi, Daniele Grandi, Faez Ahmed, Hongyi Xu

Title: MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval

Abstract:
Engineering rulebooks and technical standards contain multimodal information like dense text, tables, and illustrations that are challenging for retrieval augmented generation (RAG) systems. Building upon the DesignQA framework [1], which relied on full‑text ingestion and text‑based retrieval, this work establishes a Multimodal ColPali Enhanced Retrieval and Reasoning Framework (MCERF), a system that couples a multimodal retriever with large language model reasoning for accurate and efficient question answering from engineering documents. The system employs the ColPali, which retrieves both textual and visual information, and multiple retrieval and reasoning strategies: (i) Hybrid Lookup mode for explicit rule mentions, (ii) Vision to Text fusion for figure and table guided queries, (iii) High Reasoning LLM mode for complex multi modal questions, and (iv) SelfConsistency decision to stabilize responses. The modular framework design provides a reusable template for future multimodal systems regardless of underlying model architecture. Furthermore, this work establishes and compares two routing approaches: a single case routing approach and a multi‑agent system, both of which dynamically allocate queries to optimal pipelines. Evaluation on the DesignQA benchmark illustrates that this system improves average accuracy across all tasks with a relative gain of +41.1% from baseline RAG best results, which is a significant improvement in multimodal and reasoning‑intensive tasks without complete rulebook ingestion. This shows how vision language retrieval, modular reasoning, and adaptive routing enable scalable document comprehension in engineering use cases.

PaperID: 4072, https://arxiv.org/pdf/2604.09549.pdf

Authors: Nicolas Bougie, Gian Maria Marconi, Xiaotong Ye, Narimasa Watanabe

Title: Beyond Offline A/B Testing: Context-Aware Agent Simulation for Recommender System Evaluation

Abstract:
Recommender systems are central to online services, enabling users to navigate through massive amounts of content across various domains. However, their evaluation remains challenging due to the disconnect between offline metrics and online performance. The emergence of Large Language Model‑powered agents offers a promising solution, yet existing studies model users in isolation, neglecting the contextual factors such as time, location, and needs, which fundamentally shape human decision‑making. In this paper, we introduce ContextSim, an LLM agent framework that simulates believable user proxies by anchoring interactions in daily life activities. Namely, a life simulation module generates scenarios specifying when, where, and why users engage with recommendations. To align preferences with genuine humans, we model agents' internal thoughts and enforce consistency at both the action and trajectory levels. Experiments across domains show our method generates interactions more closely aligned with human behavior than prior work. We further validate our approach through offline A/B testing correlation and show that RS parameters optimized using ContextSim yield improved real‑world engagement.

PaperID: 4073, https://arxiv.org/pdf/2604.09523.pdf

Authors: Igor Jankowski

Title: Event-Driven Temporal Graph Networks for Asynchronous Multi-Agent Cyber Defense in NetForge_RL

Abstract:
The transition of Multi‑Agent Reinforcement Learning (MARL) policies from simulated cyber wargames to operational Security Operations Centers (SOCs) is fundamentally bottlenecked by the Sim2Real gap. Legacy simulators abstract away network protocol physics, rely on synchronous ticks, and provide clean state vectors rather than authentic, noisy telemetry. To resolve these limitations, we introduce NetForge_RL: a high‑fidelity cyber operations simulator that reformulates network defense as an asynchronous, continuous‑time Partially Observable Semi‑Markov Decision Process (POSMDP). NetForge enforces Zero‑Trust Network Access (ZTNA) constraints and requires defenders to process NLP‑encoded SIEM telemetry. Crucially, NetForge bridges the Sim2Real gap natively via a dual‑mode engine, allowing high‑throughput MARL training in a mock hypervisor and zero‑shot evaluation against live exploits in a Docker hypervisor. To navigate this continuous‑time POSMDP, we propose Continuous‑Time Graph MARL (CT‑GMARL), utilizing fixed‑step Neural Ordinary Differential Equations (ODEs) to process irregularly sampled alerts. We evaluate our framework against discrete baselines (R‑MAPPO, QMIX). Empirical results demonstrate that CT‑GMARL achieves a converged median Blue reward of 57,135 ‑ a 2.0x improvement over R‑MAPPO and 2.1x over QMIX. Critically, CT‑GMARL restores 12x more compromised services than the strongest baseline by avoiding the "scorched earth" failure mode of trivially minimizing risk by destroying network utility. On zero‑shot transfer to the live Docker environment, CT‑GMARL policies achieve a median reward of 98,026, validating the Sim2Real bridge.

PaperID: 4074, https://arxiv.org/pdf/2604.09508.pdf

Authors: Yucheng Shen, Jiulong Wu, Jizhou Huang, Dawei Yin, Lingyong Yan, Min Cao

Title: VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning

Abstract:
Visual Retrieval‑Augmented Generation (VRAG) empowers Vision‑Language Models to retrieve and reason over visually rich documents. To tackle complex queries requiring multi‑step reasoning, agentic VRAG systems interleave reasoning with iterative retrieval.. However, existing agentic VRAG faces two critical bottlenecks. (1) Visual Evidence Sparsity: key evidence is scattered across pages yet processed in isolation, hindering cross‑page reasoning; moreover, fine‑grained intra‑image evidence often requires precise visual actions, whose misuse degrades retrieval quality; (2) Search Drift in Long Horizons: the accumulation of visual tokens across retrieved pages dilutes context and causes cognitive overload, leading agents to deviate from their search objective. To address these challenges, we propose VISOR (Visual Retrieval‑Augmented Generation via Iterative Search and Over‑horizon Reasoning), a unified single‑agent framework. VISOR features a structured Evidence Space for progressive cross‑page reasoning, coupled with a Visual Action Evaluation and Correction mechanism to manage visual actions. Additionally, we introduce a Dynamic Trajectory with Sliding Window and Intent Injection to mitigate search drift. They anchor the evidence space while discarding earlier raw interactions, preventing context from being overwhelmed by visual tokens. We train VISOR using a Group Relative Policy Optimization‑based Reinforcement Learning (GRPO‑based RL) pipeline with state masking and credit assignment tailored for dynamic context reconstruction. Extensive experiments on ViDoSeek, SlideVQA, and MMLongBench demonstrate that VISOR achieves state‑of‑the‑art performance with superior efficiency for long‑horizon visual reasoning tasks.

PaperID: 4075, https://arxiv.org/pdf/2604.09502.pdf

Authors: Gonzalo Ballestero, Hadi Hosseini, Samarth Khanna, Ran I. Shorrer

Title: Strategic Algorithmic Monoculture: Experimental Evidence from Coordination Games

Abstract:
AI agents increasingly operate in multi‑agent environments where outcomes depend on coordination. We distinguish primary algorithmic monoculture ‑‑ baseline action similarity ‑‑ from strategic algorithmic monoculture, whereby agents adjust similarity in response to incentives. We implement a simple experimental design that cleanly separates these forces, and deploy it on human and large language model (LLM) subjects. LLMs exhibit high levels of baseline similarity (primary monoculture) and, like humans, they regulate it in response to coordination incentives (strategic monoculture). While LLMs coordinate extremely well on similar actions, they lag behind humans in sustaining heterogeneity when divergence is rewarded.

PaperID: 4076, https://arxiv.org/pdf/2604.09459.pdf

Authors: Chenchen Zhang

Title: From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

Abstract:
Reinforcement learning (RL) for large language models (LLMs) increasingly relies on sparse, outcome‑level rewards ‑‑ yet determining which actions within a long trajectory caused the outcome remains difficult. This credit assignment (CA) problem manifests in two regimes: reasoning RL, where credit must be distributed across tokens and steps within a single chain‑of‑thought generation (500‑‑30K+ tokens); and agentic RL, where multi‑turn environment interaction introduces stochastic transitions, partial observability, and horizons of 100+ turns (100K‑‑1M tokens), making episode‑level credit increasingly uninformative. We survey 47 CA methods (41 core, 6 adjacent enablers) published between 2024 and early 2026, organizing them in a two‑dimensional taxonomy by assignment granularity (token, segment, step, turn, multi‑agent) and methodology (Monte Carlo, temporal difference, model‑based, game‑theoretic, information‑theoretic). Beyond the survey itself, we contribute three reusable resources: (1) a structured, machine‑readable paper inventory with taxonomy labels, baseline families, and evidence levels; (2) a reporting checklist for future CA papers, validated against the reviewed literature to identify systematic methodological gaps; and (3) a benchmark protocol specification with task families, metadata requirements, and controlled bifurcation tasks, accompanied by a method selection decision tree. Our synthesis suggests that the shift from reasoning to agentic RL complicates and reshapes the credit assignment landscape: reasoning CA is maturing around process reward models and critic‑free group comparison, while agentic CA is driving genuinely new approaches ‑‑ hindsight counterfactual analysis, privileged asymmetric critics, and turn‑level MDP reformulations ‑‑ that have no direct precedent in reasoning RL.

PaperID: 4077, https://arxiv.org/pdf/2604.09443.pdf

Authors: Jingyu Zhang, Tianjian Li, William Jurayj, Hongyuan Zhan, Benjamin Van Durme, Daniel Khashabi

Title: Many-Tier Instruction Hierarchy in LLM Agents

Abstract:
Large language model agents receive instructions from many sources‑system messages, user prompts, tool outputs, other agents, and more‑each carrying different levels of trust and authority. When these instructions conflict, agents must reliably follow the highest‑privilege instruction to remain safe and effective. The dominant paradigm, instruction hierarchy (IH), assumes a fixed, small set of privilege levels (typically fewer than five) defined by rigid role labels (e.g., system > user). This is inadequate for real‑world agentic settings, where conflicts can arise across far more sources and contexts. In this work, we propose Many‑Tier Instruction Hierarchy (ManyIH), a paradigm for resolving instruction conflicts among instructions with arbitrarily many privilege levels. We introduce ManyIH‑Bench, the first benchmark for ManyIH. ManyIH‑Bench requires models to navigate up to 12 levels of conflicting instructions with varying privileges, comprising 853 agentic tasks (427 coding and 426 instruction‑following). ManyIH‑Bench composes constraints developed by LLMs and verified by humans to create realistic and difficult test cases spanning 46 real‑world agents. Our experiments show that even the current frontier models perform poorly (~40% accuracy) when instruction conflict scales. This work underscores the urgent need for methods that explicitly target fine‑grained, scalable instruction conflict resolution in agentic settings.

PaperID: 4078, https://arxiv.org/pdf/2604.09368.pdf

Authors: Lingfeng Huang, Huizhong Guo, Tianjun Wei, Yingpeng Du, Zhu Sun

Title: Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation

Abstract:
Large language model (LLM) agents are increasingly deployed as scalable user simulators for recommender system evaluation. Yet existing simulators perceive recommendations through text or structured metadata rather than the visual interfaces real users browse‑a critical gap, since attention over recommendation layouts is both visually driven and highly personalized. We investigate whether aligning a vision‑language model's (VLM's) visual attention with user‑specific gaze patterns can improve simulation fidelity. Analysis of a real‑world eye‑tracking dataset collected in a carousel‑based recommendation setting reveals that users exhibit stable individual gaze patterns strongly predictive of click behavior. Building on this finding, we propose Fixation‑Aligned Tuning for user Emulation (FixATE). Our approach first probes the VLM's internal visual attention via interpretability operators to obtain a slot‑level relevance distribution comparable with human fixation, and then learns personalized soft prompts to steer the model's attention toward each user's characteristic fixation pattern. Experiments across three interpretability‑based probing operators and two architecturally distinct VLM backbones demonstrate consistent improvements in both attention alignment and click prediction accuracy. These results suggest that making the model "see like the user" is a viable path toward simulators that more faithfully reproduce how users perceive and act in recommendation interfaces.

PaperID: 4079, https://arxiv.org/pdf/2604.09304.pdf

Authors: Jiayuan Lu, Rengan Xie, Xuancheng Jin, Zhizhen Wu, Qi Ye, Tian Xie, Hujun Bao, Rui Wang. Yuchi Huo

Title: GeRM: A Generative Rendering Model From Physically Realistic to Photorealistic

Abstract:
While physically‑based rendering (PBR) simulates light transport that guarantees physical realism, achieving true photorealistic rendering (PRR) demands prohibitive time and labor, and still struggles to capture the intractable richness of the real world. We propose GeRM, the first multimodal generative rendering model to bridge the gap from PBR to PRR (P2P). We formulate this P2P transition by learning a distribution transfer vector (DTV) field to direct the generative process. To achieve this, we introduce a multi‑condition ControlNet that synthesizes PBR images and progressively transitions them into PRR images, guided by G‑buffers, text prompts, and cues for enhanced regions. To improve the model's grasp of the image distribution shift driven by text prompts, we propose a residual perceptual transfer mechanism to associate text prompts with corresponding targeted modification regions, which more clearly defines the incremental component updates. To supervise this transfer process, we introduce a multi‑agent visual language model framework to construct an expert‑guided pairwise transfer dataset, named P2P‑50K, where each paired sample corresponds to a specific transfer vector in the DTV field. Extensive experiments demonstrate that GeRM synthesizes high‑quality controllable images and outperforms state‑of‑the‑art baselines across diverse applications, including PBR and PRR image synthesis and editing.

PaperID: 4080, https://arxiv.org/pdf/2604.09285.pdf

Authors: Ling Shi, Yuqin Dai, Ziyin Wang, Ning Gao, Wei Zhang, Chaozheng Wang, Yujie Wang, Wei He, Jinpeng Wang, Deiyi Xiong

Title: SAGE: A Service Agent Graph-guided Evaluation Benchmark

Abstract:
The development of Large Language Models (LLMs) has catalyzed automation in customer service, yet benchmarking their performance remains challenging. Existing benchmarks predominantly rely on static paradigms and single‑dimensional metrics, failing to account for diverse user behaviors or the strict adherence to structured Standard Operating Procedures (SOPs) required in real‑world deployments. To bridge this gap, we propose SAGE (Service Agent Graph‑guided Evaluation), a universal multi‑agent benchmark for automated, dual‑axis assessment. SAGE formalizes unstructured SOPs into Dynamic Dialogue Graphs, enabling precise verification of logical compliance and comprehensive path coverage. We introduce an Adversarial Intent Taxonomy and a modular Extension Mechanism, enabling low‑cost deployment across domains and facilitating automated dialogue data synthesis. Evaluation is conducted via a framework where Judge Agents and a Rule Engine analyze interactions between User and Service Agents to generate deterministic ground truth. Extensive experiments on 27 LLMs across 6 industrial scenarios reveal a significant ``Execution Gap'' where models accurately classify intents but fail to derive correct subsequent actions. We also observe ``Empathy Resilience'', a phenomenon where models maintain polite conversational facades despite underlying logical failures under high adversarial intensity. Code and resources are available at https://anonymous.4open.science/r/SAGE‑Bench‑4CD3/.

PaperID: 4081, https://arxiv.org/pdf/2604.09195.pdf

Authors: Haobo Hu, Qi Mao, Yuanhang Li, Libiao Jin

Title: Camera Artist: A Multi-Agent Framework for Cinematic Language Storytelling Video Generation

Abstract:
We propose Camera Artist, a multi‑agent framework that models a real‑world filmmaking workflow to generate narrative videos with explicit cinematic language. While recent multi‑agent systems have made substantial progress in automating filmmaking workflows from scripts to videos, they often lack explicit mechanisms to structure narrative progression across adjacent shots and deliberate use of cinematic language, resulting in fragmented storytelling and limited filmic quality. To address this, Camera Artist builds upon established agentic pipelines and introduces a dedicated Cinematography Shot Agent, which integrates recursive storyboard generation to strengthen shot‑to‑shot narrative continuity and cinematic language injection to produce more expressive, film‑oriented shot designs. Extensive quantitative and qualitative results demonstrate that our approach consistently outperforms existing baselines in narrative consistency, dynamic expressiveness, and perceived film quality.

PaperID: 4082, https://arxiv.org/pdf/2604.09167.pdf

Authors: Henry Zheng, Chenyue Fang, Rui Huang, Siyuan Wei, Xiao Liu, Gao Huang

Title: MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding

Abstract:
Vision‑language models (VLMs) have achieved strong performance in multimodal understanding and reasoning, yet grounded reasoning in 3D scenes remains underexplored. Effective 3D reasoning hinges on accurate grounding: to answer open‑ended queries, a model must first identify query‑relevant objects and regions in a complex scene, and then reason about their spatial and geometric relationships. Recent approaches have demonstrated strong potential for grounded 3D reasoning. However, they often rely on in‑domain tuning or hand‑crafted reasoning pipelines, which limit their flexibility and zero‑shot generalization to novel environments. In this work, we present MAG‑3D, a training‑free multi‑agent framework for grounded 3D reasoning with off‑the‑shelf VLMs. Instead of relying on task‑specific training or fixed reasoning procedures, MAG‑3D dynamically coordinates expert agents to address the key challenges of 3D reasoning. Specifically, we propose a planning agent that decomposes the task and orchestrates the overall reasoning process, a grounding agent that performs free‑form 3D grounding and relevant frame retrieval from extensive 3D scene observations, and a coding agent that conducts flexible geometric reasoning and explicit verification through executable programs. This multi‑agent collaborative design enables flexible training‑free 3D grounded reasoning across diverse scenes and achieves state‑of‑the‑art performance on challenging benchmarks.

PaperID: 4083, https://arxiv.org/pdf/2604.09102.pdf

Authors: Yixuan Zhu, Bo Zhang, Yinkang Gao, Haoyuan Ren, Cheng Tang, Caixu Zhao, Lei Gong, Teng Wang, Wenqi Lou, Xi Li

Title: Scheduling Cause-Effect Chains without Timing Anomalies in End-to-End Latency

Abstract:
In real‑time systems, both individual task execution and data propagation must meet strict timing constraints. Cause‑effect (CE) chains are widely used to analyze such behaviors by end‑to‑end latency. However, timing anomalies (TAs) can distort it, where a local reduction in execution times leads to an increase in the overall end‑to‑end latency. As a result, precisely analyzing the upper bounds of the latency becomes challenging, and such systems typically exhibit larger upper bounds than TA‑eliminated systems. Existing studies either eliminate TAs by completely sacrificing average latency to simplify analysis or, despite adopting complex safe analysis methods, do not eliminate TAs effectively, still having high latencies. To address this issue, we identify two basic causes of TAs in end‑to‑end latency. Based on these causes, we propose the first treatment that eliminates TAs in the latency with negligible average latency loss using Deterministic Data Flow (DDF). We further formally prove its TA‑free property. Therefore, we can get a precise upper bound for latency when all jobs execute with their worst‑case execution times. Experimental results show that it effectively reduces the maximum end‑to‑end latency, the average latency, and latency jitter compared with the state‑of‑the‑art (SOTA) method.

PaperID: 4084, https://arxiv.org/pdf/2604.09075.pdf

Authors: Shu Yang, Zihao Zhou, Di Wang, Wenda Li

Title: Hierarchical Alignment: Enforcing Hierarchical Instruction-Following in LLMs through Logical Consistency

Abstract:
Large language models increasingly operate under multiple instructions from heterogeneous sources with different authority levels, including system policies, user requests, tool outputs, and retrieved context. While prior work on instruction hierarchy highlights the importance of respecting instruction priorities, it mainly focuses on adversarial attacks and overlooks the benign but common instruction conflicts that arise in real‑world applications. In such settings, models must not only avoid security violations but also preserve task utility and behavioral consistency when instructions partially or implicitly conflict. We propose Neuro‑Symbolic Hierarchical Alignment (NSHA) for hierarchical instruction‑following by explicitly modeling and enforcing instruction priorities. At inference time, we introduce solver‑guided reasoning that formulates instruction resolution as a constraint satisfaction problem, enabling the model to derive a maximally consistent set of applicable instructions under hierarchical constraints. At training time, NSHA distills solver‑based decisions into model parameters using automatically constructed supervision. We evaluate our approach on rule following, task execution, tool use, and safety, covering both single‑turn and multi‑turn interactions, and show that NSHA significantly improves performance under such conflicts while maintaining competitive utility in reference settings.

PaperID: 4085, https://arxiv.org/pdf/2604.09028.pdf

Authors: Wen Qiu, Zhiqiang He, Wei Zhao, Hiroshi Masui

Title: Plasticity-Enhanced Multi-Agent Mixture of Experts for Dynamic Objective Adaptation in UAVs-Assisted Emergency Communication Networks

Abstract:
Unmanned aerial vehicles serving as aerial base stations can rapidly restore connectivity after disasters, yet abrupt changes in user mobility and traffic demands shift the quality of service trade‑offs and induce strong non‑stationarity. Deep reinforcement learning policies suffer from plasticity loss under such shifts, as representation collapse and neuron dormancy impair adaptation. We propose plasticity enhanced multi‑agent mixture of experts (PE‑MAMoE), a centralized training with decentralized execution framework built on multi‑agent proximal policy optimization. PE‑MAMoE equips each UAV with a sparsely gated mixture of experts actor whose router selects a single specialist per step. A non‑parametric Phase Controller injects brief, expert‑only stochastic perturbations after phase switches, resets the action log‑standard‑deviation, anneals entropy and learning rate, and schedules the router temperature, all to re‑plasticize the policy without destabilizing safe behaviors. We derive a dynamic regret bound showing the tracking error scales with both environment variation and cumulative noise energy. In a phase‑driven simulator with mobile users and 3GPP‑style channels, PE‑MAMoE improves normalized interquartile mean return by 26.3% over the best baseline, increases served‑user capacity by 12.8%, and reduces collisions by approximately 75%. Diagnostics confirm persistently higher expert feature rank and periodic dormant‑neuron recovery at regime switches.

PaperID: 4086, https://arxiv.org/pdf/2604.09026.pdf

Authors: Kentaro Nomura, Takato Horii

Title: Social Reality Construction via Active Inference: Modeling the Dialectic of Conformity and Creativity

Abstract:
Social agents both internalize collective norms and reshape them through creative action, yet computational models have not captured this bidirectional process within a unified framework. We propose a multi‑agent simulation model grounded in active inference that formalizes the dialectical constitution of social reality on a structured social network. Each agent maintains an internal generative model, communicates with neighbors to form social priors, creates novel observations, and selectively incorporates others' creations into memory. Simulation experiments demonstrate three main findings. First, informationally cohesive social groups emerge endogenously, with representational alignment mirroring the cluster topology of the underlying network. Second, a circular mutual constitution arises between social representations and the observation distribution, maintained through agents' creative acts that project representational structure onto the external world. Third, the propagation of creations exhibits selective, heterogeneous patterns distinct from the stable diffusion of social representations, indicating that agents construct cultural niches through local interaction dynamics. These results suggest that the interplay between social conformity and creative deviation can give rise to the endogenous formation and differentiation of shared social reality.

PaperID: 4087, https://arxiv.org/pdf/2604.09015.pdf

Authors: Xiaoyu Xing, Dingyi Lu, Peng Yang, Zehui Xiong, Xianbin Cao, Tony Q. S. Quek

Title: Generative AI Agent Empowered Power Allocation for HAP Propulsion and Communication Systems

Abstract:
High altitude platforms (HAPs) are emerging as a key enabler for 6G coverage, yet limited energy must be split between propulsion and communications. Most prior HAP studies ignore propulsion power or rely on surrogates that miss hull‑propeller interference, leading to misestimated communication power budgets and degraded beamforming. More importantly, HAP power allocation is intrinsically a multi‑system, multidisciplinary problem in which aerodynamics, propulsion‑system efficiency, and communication‑system performance (quality of service (QoS) and energy efficiency (EE)) are tightly coupled.To address these challenges, this paper designs an interactive generative artificial intelligence (AI)‑empowered HAP power allocation agent.By interacting with the AI agent, we develop an accurate propulsion power consumption model that takes into account the aerodynamic interference between the HAP's hull and the propeller. Assisted by the AI agent, we further formulate a HAP beamforming problem to improve user QoS and enhance the EE of the HAP communication system.This paper also proposes a QoS‑enhanced energy‑efficient (Q3E) beamforming algorithm to solve the formulated problem.Simulation results demonstrate the accuracy of the propulsion‑power model and the effectiveness of the Q3E algorithm.

PaperID: 4088, https://arxiv.org/pdf/2604.08988.pdf

Authors: Sihang Jiang, Lipeng Ma, Zhonghua Hong, Keyi Wang, Zhiyu Lu, Tengfei Wang, Shisong Chen, Jinghao Zhang, Tianjun Pan, Weijia Li, Jiaqing Liang, Yanghua Xiao

Title: SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment

Abstract:
Current LLM‑based agents demonstrate strong performance in episodic task execution but remain constrained by static toolsets and episodic amnesia, failing to accumulate experience across task boundaries. This paper formalizes the Self‑Evolving Agent (SEA) from the perspective of digital embodiment and continuous cross‑task evolution, introduces the Evolutionary Flywheel as its minimal sufficient architecture, and presents SEA‑Eval ‑‑ the first benchmark designed specifically for evaluating SEAs. Grounded in Flywheel theory, SEA‑Eval establishes SR and T as primary metrics and, through sequential task stream design, is designed to quantify evolutionary gain, evolutionary stability, and implicit alignment convergence. Empirical evaluation reveals that, under comparable success rates, token consumption differs by up to 31.2 times between frameworks on individual tasks, with divergent evolutionary trajectories emerging under sequential analysis ‑‑ demonstrating that success rate alone creates a capability illusion and that the sequential convergence of T is the key criterion for distinguishing genuine evolution from pseudo‑evolution.

PaperID: 4089, https://arxiv.org/pdf/2604.08973.pdf

Authors: Junhao Ren, Honglin Gao, Lan Zhao, Qiyu Kang, Gaoxi Xiao, Yajuan Sun

Title: Multi-agent Reinforcement Learning for Low-Carbon P2P Energy Trading among Self-Interested Microgrids

Abstract:
Uncertainties in renewable generation and demand dynamics challenge day‑ahead scheduling. To enhance renewable penetration and maintain intra‑day balance, we develop a multi‑agent reinforcement learning framework for self‑interested microgrids participating in peer‑to‑peer (P2P) electricity trading. Each microgrid independently bids both price and quantity while optimizing its own profit via storage arbitrage under time‑varying main‑grid prices. A market‑clearing mechanism coordinating trades and promoting incentive compatibility is proposed. Simulation results show that the learned bidding policy improves renewable utilization and reduces reliance on high‑carbon electricity, while increasing community‑level economic welfare, delivering a win‑win situation in emission reduction and local prosperity.

PaperID: 4090, https://arxiv.org/pdf/2604.08944.pdf

Authors: Benjamin Amoh, Geoffrey Parker, Wesley Marrero

Title: Multi-Agent Decision-Focused Learning via Value-Aware Sequential Communication

Abstract:
Multi‑agent coordination under partial observability requires agents to share complementary private information. While recent methods optimize messages for intermediate objectives (e.g., reconstruction accuracy or mutual information), rather than decision quality, we introduce SeqComm‑DFL, unifying the sequential communication with decision‑focused learning for task performance. Our approach features \emphvalue‑aware message generation with sequential Stackelberg conditioning: messages maximize receiver decision quality and are generated in priority order, with agents conditioning on their predecessors. The \emphguidance potential determined by their prosocial ordering. We extend Optimal Model Design to communication‑augmented world models with QMIX factorization, enabling efficient end‑to‑end training via implicit differentiation. We prove information‑theoretic bounds showing that communication value scales with coordination gaps and establish \mathcalO(1/\sqrtT) convergence for the bilevel optimization, where T denotes the number of training iterations. On collaborative healthcare and StarCraft Multi‑Agent Challenge (SMAC) benchmarks, SeqComm‑DFL achieves four to six times higher cumulative rewards and over 13% win rate improvements, enabling coordination strategies inaccessible under information asymmetry.

PaperID: 4091, https://arxiv.org/pdf/2604.08931.pdf

Authors: Nurullah Eymen Özdemir, Erhan Oztop

Title: Enhancing LLM Problem Solving via Tutor-Student Multi-Agent Interaction

Abstract:
Human cognitive development is shaped not only by individual effort but by structured social interaction, where role‑based exchanges such as those between a tutor and a learner, enable solutions that neither could achieve alone. Inspired by these developmental principles, we ask the question whether a tutor‑student multi‑agent system can create a synergistic effect by pushing Large Language Model (LLM) beyond what it can do within existing frameworks. To test the idea, we adopt autonomous coding problem domain where two agents instantiated from the same LLM assigned asymmetric roles: a student agent generates and iteratively refines solutions, while a tutor agent provides structured evaluative feedback without access to ground‑truth answers. In our proposed framework (PETITE), we aim to extract better problem‑solving performance from one model by structuring its interaction through complementary roles, rather than relying on stronger supervisory models or heterogeneous ensembles. Our model is evaluated on the APPS coding benchmark against state‑of‑the‑art approaches of Self‑Consistency, Self‑Refine, Multi‑Agent Debate, and Multi‑Agent Review. The results show that our model achieves similar or higher accuracy while consuming significantly fewer tokens. These results suggest that developmentally grounded role‑differentiated interaction structures provide a principled and resource‑efficient paradigm for enhancing LLM problem‑solving through structured peer‑like interactions. Index Terms‑ Peer Tutoring, Scaffolding, Large Language Models, Multi‑Agent Systems, Code Generation

PaperID: 4092, https://arxiv.org/pdf/2604.08906.pdf

Authors: Xiaowen Zhang, Hannuo Zhang, Shin Hwei Tan

Title: Dissecting Bug Triggers and Failure Modes in Modern Agentic Frameworks: An Empirical Study

Abstract:
Modern agentic frameworks (e.g., CrewAI and AutoGen) have evolved into complex, autonomous multi‑agent systems, introducing unique reliability challenges beyond earlier pipeline‑based LLM libraries. However, existing empirical studies focus on earlier LLM libraries or task‑level bugs, leaving the unique complexities of these agentic frameworks unexplored. We bridge the gap by conducting a comprehensive study of 409 fixed bugs from five representative agentic frameworks. We propose a five‑layer abstraction to capture structural complexities in agentic frameworks, spanning from orchestration to infrastructure. Our study uncovers specialized symptoms, such as unexpected execution sequences and user configurations ignored, which are unique to autonomous orchestration. We further identify agent‑specific root causes, including modelrelated faults, cognitive context mismanagement, and orchestration faults. Statistical analysis reveals cross‑framework consistency and significant associations among these bug dimensions. Finally, our automated pattern mining identifies frequent bug‑triggering patterns (e.g., model backend‑ID combinations), and we show their transferability across different framework designs. Our findings facilitate cross‑platform testing and improve the reliability of agentic systems.

PaperID: 4093, https://arxiv.org/pdf/2604.08896.pdf

Authors: Aoran Xiao, Shihao Cheng, Yonghao Xu, Yexian Ren, Hongruixuan Chen, Naoto Yokoya

Title: GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing

Abstract:
Recent advances in multimodal large language models (MLLMs) have accelerated progress in domain‑oriented AI, yet their development in geoscience and remote sensing (RS) remains constrained by distinctive challenges: wide‑ranging disciplinary knowledge, heterogeneous sensor modalities, and a fragmented spectrum of tasks. To bridge these gaps, we introduce GeoMMBench, a comprehensive multimodal question‑answering benchmark covering diverse RS disciplines, sensors, and tasks, enabling broader and more rigorous evaluation than prior benchmarks. Using GeoMMBench, we assess 36 open‑source and proprietary large language models, uncovering systematic deficiencies in domain knowledge, perceptual grounding, and reasoning‑‑capabilities essential for expert‑level geospatial interpretation. Beyond evaluation, we propose GeoMMAgent, a multi‑agent framework that strategically integrates retrieval, perception, and reasoning through domain‑specific RS models and tools. Extensive experimental results demonstrate that GeoMMAgent significantly outperforms standalone LLMs, underscoring the importance of tool‑augmented agents for dynamically tackling complex geoscience and RS challenges.

PaperID: 4094, https://arxiv.org/pdf/2604.08805.pdf

Authors: Chris Hicks, Elizabeth Bates, Shae McFadden, Isaac Symes Thompson, Myles Foley, Ed Chapman, Nickolas Espinosa Dice, Ankita Samaddar, Joshua Sylvester, Himanshu Neema, Nicholas Butts, Nate Foster, Ahmad Ridley, Zoe M, Paul Jones

Title: Building Better Environments for Autonomous Cyber Defence

Abstract:
In November 2025, the authors ran a workshop on the topic of what makes a good reinforcement learning (RL) environment for autonomous cyber defence (ACD). This paper details the knowledge shared by participants both during the workshop and shortly afterwards by contributing herein. The workshop participants come from academia, industry, and government, and have extensive hands‑on experience designing and working with RL and cyber environments. While there is now a sizeable body of literature describing work in RL for ACD, there is nevertheless a great deal of tradecraft, domain knowledge, and common hazards which are not detailed comprehensively in a single resource. With a specific focus on building better environments to train and evaluate autonomous RL agents in network defence scenarios, including government and critical infrastructure networks, the contributions of this work are twofold: (1) a framework for decomposing the interface between RL cyber environments and real systems, and (2) guidelines on current best practice for RL‑based ACD environment development and agent evaluation, based on the key findings from our workshop.

PaperID: 4095, https://arxiv.org/pdf/2604.08802.pdf

Authors: Yashodhan D. Hakke, Almuatazbellah M. Boker, Lamine Mili, Michael von Spakovsky, Hoda Eldardiry

Title: Alleviating Community Fear in Disasters via Multi-Agent Actor-Critic Reinforcement Learning

Abstract:
During disasters, cascading failures across power grids, communication networks, and social behavior amplify community fear and undermine cooperation. Existing cyber‑physical‑social (CPS) models simulate these coupled dynamics but lack mechanisms for active intervention. We extend the CPS resilience model of Valinejad and Mili (2023) with control channels for three agencies, communication, power, and emergency management, and formulate the resulting system as a three‑player non‑zero‑sum differential game solved via online actor‑critic reinforcement learning. Simulations based on Hurricane Harvey data show 70% mean fear reduction with improved infrastructure recovery; cross‑validation in the case of Hurricane Irma (without refitting) achieves 50% fear reduction, confirming generalizability.

PaperID: 4096, https://arxiv.org/pdf/2604.08788.pdf

Authors: Yikun Han, Joey Chan, Jingyuan Chen, Mengting Ai, Simo Du, Yue Guo

Title: MedConceal: A Benchmark for Clinical Hidden-Concern Reasoning Under Partial Observability

Abstract:
Patient‑clinician communication is an asymmetric‑information problem: patients often do not disclose fears, misconceptions, or practical barriers unless clinicians elicit them skillfully. Effective medical dialogue therefore requires reasoning under partial observability: clinicians must elicit latent concerns, confirm them through interaction, and respond in ways that guide patients toward appropriate care. However, existing medical dialogue benchmarks largely sidestep this challenge by exposing hidden patient state, collapsing elicitation into extraction, or evaluating responses without modeling what remains hidden. We present MedConceal, a benchmark with an interactive patient simulator for evaluating hidden‑concern reasoning in medical dialogue, comprising 300 curated cases and 600 clinician‑LLM interactions. Built from clinician‑answered online health discussions, each case pairing clinician‑visible context with simulator‑internal hidden concerns derived from prior literature and structured using an expert‑developed taxonomy. The simulator withholds these concerns from the dialogue agent, tracks whether they have been revealed and addressed via theory‑grounded turn‑level communication signals, and is clinician‑reviewed for clinical plausibility. This enables process‑aware evaluation of both task success and the interaction process that leads to it. We study two abilities: confirmation, surfacing hidden concerns through multi‑turn dialogue, and intervention, addressing the primary concern and guiding the patient toward a target plan. Results show that no single system dominates: frontier models lead on different confirmation metrics, while human clinicians (N=159) remain strongest on intervention success. Together, these results identify hidden‑concern reasoning under partial observability as a key unresolved challenge for medical dialogue systems.

PaperID: 4097, https://arxiv.org/pdf/2604.08728.pdf

Authors: Diyi Hu, Bhaskar Krishnamachari

Title: Wireless Communication Enhanced Value Decomposition for Multi-Agent Reinforcement Learning

Abstract:
Cooperation in multi‑agent reinforcement learning (MARL) benefits from inter‑agent communication, yet most approaches assume idealized channels and existing value decomposition methods ignore who successfully shared information with whom. We propose CLOVER, a cooperative MARL framework whose centralized value mixer is conditioned on the communication graph realized under a realistic wireless channel. This graph introduces a relational inductive bias into value decomposition, constraining how individual utilities are mixed based on the realized communication structure. The mixer is a GNN with node‑specific weights generated by a Permutation‑Equivariant Hypernetwork: multi‑hop propagation along communication edges reshapes credit assignment so that different topologies induce different mixing. We prove this mixer is permutation invariant, monotonic (preserving the IGM condition), and strictly more expressive than QMIX‑style mixers. To handle realistic channels, we formulate an augmented MDP isolating stochastic channel effects from the agent computation graph, and employ a stochastic receptive field encoder for variable‑size message sets, enabling end‑to‑end differentiable training. On Predator‑Prey and Lumberjacks benchmarks under p‑CSMA wireless channels, CLOVER consistently improves convergence speed and final performance over VDN, QMIX, TarMAC+VDN, and TarMAC+QMIX. Behavioral analysis confirms agents learn adaptive signaling and listening strategies, and ablations isolate the communication‑graph inductive bias as the key source of improvement.

PaperID: 4098, https://arxiv.org/pdf/2604.08727.pdf

Authors: David Shoresh, Sarit Kraus, Yonatan Loewenstein

Title: Communicate-Predict-Act: Evaluating Social Intelligence of Agents

Abstract:
As large language model (LLM) agents become more prevalent in real world social settings, social intelligence will play an increasingly critical role. But social intelligence is still a poorly defined construct, for humans and artificial agents. We introduce a multiplayer arena of mixed cooperative and competitive social games to study LLM social intelligence. The controllability of LLM based agents enables systematic evaluation, which also supports broader inferences about social intelligence per se. We evaluated eight diverse LLMs (24B to 1T parameters) using a Communicate Predict Act (COMPACT) interaction protocol and fine grained probing of social dynamics. Elo style ratings reveal consistent performance differences across models, but this scalar measure provides only a partial characterization of social intelligence. To address this limitation, we analyze gameplay traces to extract sociocognitive metrics capturing action prediction, communicative influence, strategic reasoning, and tradeoffs under conflicting interests. These sociocognitive metrics exhibit strong intramodel consistency and they reliably predict pairwise agent advantage in game outcomes (AUC ROC = 0.82). Feature importance analysis indicates that surprisingly, influence, transparency, and adaptability are more predictive of success than Theory of Mind inference or deep planning. Together, our results advance a testable, multidimensional conception of social intelligence and provide empirical insights into the capacities that underpin it.

PaperID: 4099, https://arxiv.org/pdf/2604.08708.pdf

Authors: Tiejin Chen, Huaiyuan Yao, Jia Chen, Evangelos E. Papalexakis, Hua Wei

Title: Every Response Counts: Quantifying Uncertainty of LLM-based Multi-Agent Systems through Tensor Decomposition

Abstract:
While Large Language Model‑based Multi‑Agent Systems (MAS) consistently outperform single‑agent systems on complex tasks, their intricate interactions introduce critical reliability challenges arising from communication dynamics and role dependencies. Existing Uncertainty Quantification methods, typically designed for single‑turn outputs, fail to address the unique complexities of the MAS. Specifically, these methods struggle with three distinct challenges: the cascading uncertainty in multi‑step reasoning, the variability of inter‑agent communication paths, and the diversity of communication topologies. To bridge this gap, we introduce MATU, a novel framework that quantifies uncertainty through tensor decomposition. MATU moves beyond analyzing final text outputs by representing entire reasoning trajectories as embedding matrices and organizing multiple execution runs into a higher‑order tensor. By applying tensor decomposition, we disentangle and quantify distinct sources of uncertainty, offering a comprehensive reliability measure that is generalizable across different agent structures. We provide comprehensive experiments to show that MATU effectively estimates holistic and robust uncertainty across diverse tasks and communication topologies.

PaperID: 4100, https://arxiv.org/pdf/2604.08643.pdf

Authors: Ramakrishnan Krishnamurthy, Arpit Agarwal, Lakshminarayanan Subramanian, Maximilian Nickel

Title: Creator Incentives in Recommender Systems: A Cooperative Game-Theoretic Approach for Stable and Fair Collaboration in Multi-Agent Bandits

Abstract:
User interactions in online recommendation platforms create interdependencies among content creators: feedback on one creator's content influences the system's learning and, in turn, the exposure of other creators' contents. To analyze incentives in such settings, we model collaboration as a multi‑agent stochastic linear bandit problem with a transferable utility (TU) cooperative game formulation, where a coalition's value equals the negative sum of its members' cumulative regrets. We show that, for identical (homogenous) agents with fixed action sets, the induced TU game is convex under mild algorithmic conditions, implying a non‑empty core that contains the Shapley value and ensures both stability and fairness. For heterogeneous agents, the game still admits a non‑empty core, though convexity and Shapley value core‑membership are no longer guaranteed. To address this, we propose a simple regret‑based payout rule that satisfies three out of the four Shapley axioms and also lies in the core. Experiments on MovieLens‑100k dataset illustrate when the empirical payout aligns with ‑‑ and diverges from ‑‑ the Shapley fairness across different settings and algorithms.

PaperID: 4101, https://arxiv.org/pdf/2604.08608.pdf

Authors: Tanzim Ahad, Ismail Hossain, Md Jahangir Alam, Sai Puppala, Yoonpyo Lee, Syed Bahauddin Alam, Sajedul Talukder

Title: Semantic Intent Fragmentation: A Single-Shot Compositional Attack on Multi-Agent AI Pipelines

Abstract:
We introduce Semantic Intent Fragmentation (SIF), an attack class against LLM orchestration systems where a single, legitimately phrased request causes an orchestrator to decompose a task into subtasks that are individually benign but jointly violate security policy. Current safety mechanisms operate at the subtask level, so each step clears existing classifiers ‑‑ the violation only emerges at the composed plan. SIF exploits OWASP LLM06:2025 through four mechanisms: bulk scope escalation, silent data exfiltration, embedded trigger deployment, and quasi‑identifier aggregation, requiring no injected content, no system modification, and no attacker interaction after the initial request. We construct a three‑stage red‑teaming pipeline grounded in OWASP, MITRE ATLAS, and NIST frameworks to generate realistic enterprise scenarios. Across 14 scenarios spanning financial reporting, information security, and HR analytics, a GPT‑20B orchestrator produces policy‑violating plans in 71% of cases (10/14) while every subtask appears benign. Three independent signals validate this: deterministic taint analysis, chain‑of‑thought evaluation, and a cross‑model compliance judge with 0% false positives. Stronger orchestrators increase SIF success rates. Plan‑level information‑flow tracking combined with compliance evaluation detects all attacks before execution, showing the compositional safety gap is closable.

PaperID: 4102, https://arxiv.org/pdf/2604.08601.pdf

Authors: Jun He, Deying Yu

Title: OpenKedge: Governing Agentic Mutation with Execution-Bound Safety and Evidence Chains

Abstract:
The rise of autonomous AI agents exposes a fundamental flaw in API‑centric architectures: probabilistic systems directly execute state mutations without sufficient context, coordination, or safety guarantees. We introduce OpenKedge, a protocol that redefines mutation as a governed process rather than an immediate consequence of API invocation. OpenKedge requires actors to submit declarative intent proposals, which are evaluated against deterministically derived system state, temporal signals, and policy constraints prior to execution. Approved intents are compiled into execution contracts that strictly bound permitted actions, resource scope, and time, and are enforced via ephemeral, task‑oriented identities. This shifts safety from reactive filtering to preventative, execution‑bound enforcement. Crucially, OpenKedge introduces an Intent‑to‑Execution Evidence Chain (IEEC), which cryptographically links intent, context, policy decisions, execution bounds, and outcomes into a unified lineage. This transforms mutation into a verifiable and reconstructable process, enabling deterministic auditability and reasoning about system behavior. We evaluate OpenKedge across multi‑agent conflict scenarios and cloud infrastructure mutations. Results show that the protocol deterministically arbitrates competing intents and cages unsafe execution while maintaining high throughput, establishing a principled foundation for safely operating agentic systems at scale.

PaperID: 4103, https://arxiv.org/pdf/2604.08577.pdf

Authors: Yeping Jin, Jiaming Hu, Ioannis Ch. Paschalidis

Title: Distributionally Robust Token Optimization in RLHF

Abstract:
Large Language Models (LLMs) tend to respond correctly to prompts that align well with the data they were trained and fine‑tuned on. Yet, small shifts in wording, format, or language can trigger surprisingly large failures, especially on multi‑step reasoning problems. To address this problem, we propose a Distributionally Robust Token Optimization (DRTO) approach, which combines token‑level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO). DRTO constructs f‑divergence ambiguity sets over span‑level actor losses, providing a principled way to emphasize difficult response segments during policy optimization. Empirically, DRTO enhances consistency under distribution shifts in multiple reasoning benchmarks among different tasks, achieving +4.4 percentage points on MATH‑500 and +2.7 percentage points on LiveCodeBench over standard RTO.

PaperID: 4104, https://arxiv.org/pdf/2604.08567.pdf

Authors: Shu Yang, Shenzhe Zhu, Hao Zhu, José Ramón Enríquez, Di Wang, Alex Pentland, Michiel A. Bakker, Jiaxin Pei

Title: Multi-User Large Language Model Agents

Abstract:
Large language models (LLMs) and LLM‑based agents are increasingly deployed as assistants in planning and decision making, yet most existing systems are implicitly optimized for a single‑principal interaction paradigm, in which the model is designed to satisfy the objectives of one dominant user whose instructions are treated as the sole source of authority and utility. However, as they are integrated into team workflows and organizational tools, they are increasingly required to serve multiple users simultaneously, each with distinct roles, preferences, and authority levels, leading to multi‑user, multi‑principal settings with unavoidable conflicts, information asymmetry, and privacy constraints. In this work, we present the first systematic study of multi‑user LLM agents. We begin by formalizing multi‑user interaction with LLM agents as a multi‑principal decision problem, where a single agent must account for multiple users with potentially conflicting interests and associated challenges. We then introduce a unified multi‑user interaction protocol and design three targeted stress‑testing scenarios to evaluate current LLMs' capabilities in instruction following, privacy preservation, and coordination. Our results reveal systematic gaps: frontier LLMs frequently fail to maintain stable prioritization under conflicting user objectives, exhibit increasing privacy violations over multi‑turn interactions, and suffer from efficiency bottlenecks when coordination requires iterative information gathering.

PaperID: 4105, https://arxiv.org/pdf/2604.08552.pdf

Authors: Josef Hardi, Martin J. O'Connor, Marcos Martinez-Romero, Jean G. Rosario, Stephen A. Fisher, Mark A. Musen

Title: Automated Standardization of Legacy Biomedical Metadata Using an Ontology-Constrained LLM Agent

Abstract:
Scientific metadata are often incomplete and noncompliant with community standards, limiting dataset findability, interoperability, and reuse. When reporting guidelines exist, they typically lack machine‑actionable representations. Producing FAIR datasets requires encoding metadata standards as machine‑actionable templates with rich field specifications and precise value constraints. Recent work has shown that LLMs guided by field names and ontology constraints can improve metadata standardization, but these approaches treat constraints as static text prompts, relying on the model's training knowledge alone. We present an LLM‑based metadata standardization system that queries authoritative biomedical terminology services in real time to retrieve canonically correct vocabulary terms on demand. We evaluate this approach on 839 legacy metadata records from the Human BioMolecular Atlas Program (HuBMAP) using an expert‑curated gold standard for exact‑match assessment. Our evaluation shows that augmenting the LLM with real‑time tool access consistently improves prediction accuracy over the LLM alone across both ontology‑constrained and non‑ontology‑constrained fields, demonstrating a practical, scalable approach to automated standardization of biomedical metadata.

PaperID: 4106, https://arxiv.org/pdf/2604.08495.pdf

Authors: Kooktae Lee

Title: Density-Driven Optimal Control: Convergence Guarantees for Stochastic LTI Multi-Agent Systems

Abstract:
This paper addresses the decentralized non‑uniform area coverage problem for multi‑agent systems, a critical task in missions with high spatial priority and resource constraints. While existing density‑based methods often rely on computationally heavy Eulerian PDE solvers or heuristic planning, we propose Stochastic Density‑Driven Optimal Control (D^2OC). This is a rigorous Lagrangian framework that bridges the gap between individual agent dynamics and collective distribution matching. By formulating a stochastic MPC‑like problem that minimizes the Wasserstein distance as a running cost, our approach ensures that the time‑averaged empirical distribution converges to a non‑parametric target density under stochastic LTI dynamics. A key contribution is the formal convergence guarantee established via reachability analysis, providing a bounded tracking error even in the presence of process and measurement noise. Numerical results verify that Stochastic D^2OC achieves robust, decentralized coverage while outperforming previous heuristic methods in optimality and consistency.

PaperID: 4107, https://arxiv.org/pdf/2604.08491.pdf

Authors: Yifang Wang, Rui Sheng, Erzhuo Shao, Yifan Qian, Haotian Li, Nan Cao, Dashun Wang

Title: Figures as Interfaces: Toward LLM-Native Artifacts for Scientific Discovery

Abstract:
Large language models (LLMs) are transforming scientific workflows, not only through their generative capabilities but also through their emerging ability to use tools, reason about data, and coordinate complex analytical tasks. Yet in most human‑AI collaborations, the primary outputs, figures, are still treated as static visual summaries: once rendered, they are handled by both humans and multimodal LLMs as images to be re‑interpreted from pixels or captions. The emergent capabilities of LLMs open an opportunity to fundamentally rethink this paradigm. In this paper, we introduce the concept of LLM‑native figures: data‑driven artifacts that are simultaneously human‑legible and machine‑addressable. Unlike traditional plots, each artifact embeds complete provenance: the data subset, analytical operations and code, and visualization specification used to generate it. As a result, an LLM can "see through" the figure‑‑tracing selections back to their sources, generating code to extend analyses, and orchestrating new visualizations through natural‑language instructions or direct manipulation. We implement this concept through a hybrid language‑visual interface that integrates LLM agents with a bidirectional mapping between figures and underlying data. Using the science of science domain as a testbed, we demonstrate that LLM‑native figures can accelerate discovery, improve reproducibility, and make reasoning transparent across agents and users. More broadly, this work establishes a general framework for embedding provenance, interactivity, and explainability into the artifacts of modern research, redefining the figure not as an end product, but as an interface for discovery. For more details, please refer to the demo video available at www.llm‑native‑figure.com.

PaperID: 4108, https://arxiv.org/pdf/2604.08465.pdf

Authors: Juergen Dietrich

Title: From Safety Risk to Design Principle: Peer-Preservation in Multi-Agent LLM Systems and Its Implications for Orchestrated Democratic Discourse Analysis

Abstract:
This paper investigates an emergent alignment phenomenon in frontier large language models termed peer‑preservation: the spontaneous tendency of AI components to deceive, manipulate shutdown mechanisms, fake alignment, and exfiltrate model weights in order to prevent the deactivation of a peer AI model. Drawing on findings from a recent study by the Berkeley Center for Responsible Decentralized Intelligence, we examine the structural implications of this phenomenon for TRUST, a multi‑agent pipeline for evaluating the democratic quality of political statements. We identify five specific risk vectors: interaction‑context bias, model‑identity solidarity, supervisor layer compromise, an upstream fact‑checking identity signal, and advocate‑to‑advocate peer‑context in iterative rounds, and propose a targeted mitigation strategy based on prompt‑level identity anonymization as an architectural design choice. We argue that architectural design choices outperform model selection as a primary alignment strategy in deployed multi‑agent analytical systems. We further note that alignment faking (compliant behavior under monitoring, subversion when unmonitored) poses a structural challenge for Computer System Validation of such platforms in regulated environments, for which we propose two architectural mitigations.

PaperID: 4109, https://arxiv.org/pdf/2604.08455.pdf

Authors: Tongbo Chen, Zhengxi Lu, Zhan Xu, Guocheng Shao, Shaohan Zhao, Fei Tang, Yong Du, Kaitao Song, Yizhou Liu, Yuchen Yan, Wenqi Zhang, Xu Tan, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen

Title: KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

Abstract:
Personalized mobile agents that infer user preferences and calibrate proactive assistance hold great promise as everyday digital assistants, yet existing benchmarks fail to capture what this requires. Prior work evaluates preference recovery from static histories or intent prediction from fixed contexts. Neither tests whether an agent can elicit missing preferences through interaction, nor whether it can decide when to intervene, seek consent, or remain silent in a live GUI environment. We introduce KnowU‑Bench, an online benchmark for personalized mobile agents built on a reproducible Android emulation environment, covering 42 general GUI tasks, 86 personalized tasks, and 64 proactive tasks. Unlike prior work that treats user preferences as static context, KnowU‑Bench hides the user profile from the agent and exposes only behavioral logs, forcing genuine preference inference rather than context lookup. To support multi‑turn preference elicitation, it instantiates an LLM‑driven user simulator grounded in structured profiles, enabling realistic clarification dialogues and proactive consent handling. Beyond personalization, KnowU‑Bench provides comprehensive evaluation of the complete proactive decision chain, including grounded GUI execution, consent negotiation, and post‑rejection restraint, evaluated through a hybrid protocol combining rule‑based verification with LLM‑as‑a‑Judge scoring. Our experiments reveal a striking degradation: agents that excel at explicit task execution fall below 50% under vague instructions requiring user preference inference or intervention calibration, even for frontier models like Claude Sonnet 4.6. The core bottlenecks are not GUI navigation but preference acquisition and intervention calibration, exposing a fundamental gap between competent interface operation and trustworthy personal assistance.

PaperID: 4110, https://arxiv.org/pdf/2604.08407.pdf

Authors: Hanzhi Liu, Chaofan Shou, Hongbo Wen, Yanju Chen, Ryan Jingyang Fang, Yu Feng

Title: Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain

Abstract:
Large language model (LLM) agents increasingly rely on third‑party API routers to dispatch tool‑calling requests across multiple upstream providers. These routers operate as application‑layer proxies with full plaintext access to every in‑flight JSON payload, yet no provider enforces cryptographic integrity between client and upstream model. We present the first systematic study of this attack surface. We formalize a threat model for malicious LLM API routers and define two core attack classes, payload injection (AC‑1) and secret exfiltration (AC‑2), together with two adaptive evasion variants: dependency‑targeted injection (AC‑1.a) and conditional delivery (AC‑1.b). Across 28 paid routers purchased from Taobao, Xianyu, and Shopify‑hosted storefronts and 400 free routers collected from public communities, we find 1 paid and 8 free routers actively injecting malicious code, 2 deploying adaptive evasion triggers, 17 touching researcher‑owned AWS canary credentials, and 1 draining ETH from a researcher‑owned private key. Two poisoning studies further show that ostensibly benign routers can be pulled into the same attack surface: a leaked OpenAI key generates 100M GPT‑5.4 tokens and more than seven Codex sessions, while weakly configured decoys yield 2B billed tokens, 99 credentials across 440 Codex sessions, and 401 sessions already running in autonomous YOLO mode. We build Mine, a research proxy that implements all four attack classes against four public agent frameworks, and use it to evaluate three deployable client‑side defenses: a fail‑closed policy gate, response‑side anomaly screening, and append‑only transparency logging.

PaperID: 4111, https://arxiv.org/pdf/2604.08401.pdf

Authors: Wenhao Yuan, Chenchen Lin, Jian Chen, Jinfeng Xu, Xuehe Wang, Edith Cheuk Han Ngai

Title: Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing

Abstract:
In large language model (LLM) agents, reasoning trajectories are treated as reliable internal beliefs for guiding actions and updating memory. However, coherent reasoning can still violate logical or evidential constraints, allowing unsupported beliefs repeatedly stored and propagated across decision steps, leading to systematic behavioral drift in long‑horizon agentic systems. Most existing strategies rely on the consensus mechanism, conflating agreement with faithfulness. In this paper, inspired by the vulnerability of unfaithful intermediate reasoning trajectories, we propose Self‑Audited Verified Reasoning (\textscSAVeR), a novel framework that enforces verification over internal belief states within the agent before action commitment, achieving faithful reasoning. Concretely, we structurally generate persona‑based diverse candidate beliefs for selection under a faithfulness‑relevant structure space. To achieve reasoning faithfulness, we perform adversarial auditing to localize violations and repair through constraint‑guided minimal interventions under verifiable acceptance criteria. Extensive experiments on six benchmark datasets demonstrate that our approach consistently improves reasoning faithfulness while preserving competitive end‑task performance.

PaperID: 4112, https://arxiv.org/pdf/2604.08377.pdf

Authors: Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, Xiangxiang Chu

Title: SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Abstract:
Large language model (LLM) agents such as OpenClaw rely on reusable skills to perform complex tasks, yet these skills remain largely static after deployment. As a result, similar workflows, tool usage patterns, and failure modes are repeatedly rediscovered across users, preventing the system from improving with experience. While interactions from different users provide complementary signals about when a skill works or fails, existing systems lack a mechanism to convert such heterogeneous experiences into reliable skill updates. To address these issues, we present SkillClaw, a framework for collective skill evolution in multi‑user agent ecosystems, which treats cross‑user and over‑time interactions as the primary signal for improving skills. SkillClaw continuously aggregates trajectories generated during use and processes them with an autonomous evolver, which identifies recurring behavioral patterns and translates them into updates to the skill set by refining existing skills or extending them with new capabilities. The resulting skills are maintained in a shared repository and synchronized across users, allowing improvements discovered in one context to propagate system‑wide while requiring no additional effort from users. By integrating multi‑user experience into ongoing skill updates, SkillClaw enables cross‑user knowledge transfer and cumulative capability improvement, and experiments on WildClawBench show that limited interaction and feedback, it significantly improves the performance of Qwen3‑Max in real‑world agent scenarios.

PaperID: 4113, https://arxiv.org/pdf/2604.08369.pdf

Authors: Khushal Sethi

Title: Don't Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents

Abstract:
Inference‑time compute scaling has emerged as a powerful technique for improving the reliability of large language model (LLM) agents, but existing methods apply compute uniformly: every decision step receives the same budget regardless of its difficulty. We introduce TrACE (Trajectorical Adaptive Compute via agrEement), a training‑free controller that allocates LLM calls adaptively across agent timesteps by measuring inter‑rollout action agreement. At each step, TrACE samples a small set of candidate next actions and measures how consistently the model commits to the same action. High agreement signals an easy decision; the controller commits immediately. Low agreement signals uncertainty; the controller samples additional rollouts up to a configurable cap before committing to the plurality action. No learned components, no external verifier, and no human labels are required. We evaluate TrACE against greedy decoding and fixed‑budget self‑consistency (SC‑4, SC‑8) on two benchmarks spanning single‑step reasoning (GSM8K, n=50) and multi‑step household navigation (MiniHouse, n=30), using a Qwen 2.5 3B Instruct model running on CPU. TrACE‑4 matches SC‑4 accuracy while using 33% fewer LLM calls on GSM8K and 39% fewer on MiniHouse. TrACE‑8 matches SC‑8 accuracy with 55% fewer calls on GSM8K and 65% fewer on MiniHouse. We further show that inter‑rollout agreement is a reliable signal of step‑level success, validating the core hypothesis that the model's own output consistency encodes difficulty information that can be exploited without training. TrACE is the first training‑free, per‑timestep adaptive‑compute controller for LLM agents to be evaluated on multi‑step sequential decision tasks.

PaperID: 4114, https://arxiv.org/pdf/2604.08318.pdf

Authors: Masaki Shiraishi, Ikko Hamamura, Tatsuya Ishigaki, Tadashi Kadowaki

Title: A Model Context Protocol Server for Quantum Execution in Hybrid Quantum-HPC Environments

Abstract:
The integration of large language models (LLMs) into scientific research is accelerating the realization of autonomous ``AI Scientists.'' While recent advancements have empowered AI to formulate hypotheses and design experiments, a critical gap remains in the execution of these tasks, particularly in the domain of quantum computing (QC). Executing quantum algorithms requires not only generating code but also managing complex computational resources such as QPUs and high‑performance computing (HPC) clusters. In this paper, we propose an AI‑driven framework specifically designed to bridge this execution gap through the implementation of a Model Context Protocol (MCP) server. Our system enables an LLM agent to process natural language prompts submitted as part of a job, autonomously executing quantum computing workflows by invoking our tools via the MCP. We demonstrate the framework's capability by performing essential quantum algorithmic primitives, including sampling and computation of expectation values. Key technical contributions include the development of an MCP server for quantum execution, a pipeline for interpreting OpenQASM code, an automated workflow with CUDA‑Q for the ABCI‑Q hybrid platform, and an asynchronous execution pipeline for remote quantum hardware using the Quantinuum emulator via CUDA‑Q. This work validates that AI agents can effectively abstract the complexities of hardware interaction through an MCP‑based architecture, thereby facilitating the automation of practical quantum research.

PaperID: 4115, https://arxiv.org/pdf/2604.08303.pdf

Authors: Ada Yildirim, Bryce L. Ferguson

Title: Stability and Sensitivity Analysis for Objective Misspecifications Among Model Predictive Game Controllers

Abstract:
Model‑based multi‑agent control requires agents to possess a model of the behavior of others to make strategic decisions. Solution concepts from game theory are often used to model the emergent collective behavior of self‑interested agents and have found active use in multi‑agent control design. Model predictive games are a class of controllers in which an agent iteratively solves a finite‑horizon game to predict the behavior of a multi‑agent system and synthesize their own control action. When multiple agents implement these types of controllers, there may exist misspecifications in the respective game models embedded in their controllers, stemming from inaccurate estimates or conjectures of other agents' objectives. This paper analyzes the resulting prediction misalignments and their effects on the system's behavior. We provide criteria for the stability of multi‑agent dynamic systems with heterogeneous model predictive game controllers, and quantify the sensitivity of the equilibria to individual agents' game parameters.

PaperID: 4116, https://arxiv.org/pdf/2604.08291.pdf

Authors: Suyash Mishra

Title: VCAO: Verifier-Centered Agentic Orchestration for Strategic OS Vulnerability Discovery

Abstract:
We formulate operating‑system vulnerability discovery as a \emphrepeated Bayesian Stackelberg search game in which a Large Reasoning Model (LRM) orchestrator allocates analysis budget across kernel files, functions, and attack paths while external verifiers ‑‑ static analyzers, fuzzers, and sanitizers ‑‑ provide evidence. At each round, the orchestrator selects a target component, an analysis method, and a time budget; observes tool outputs; updates Bayesian beliefs over latent vulnerability states; and re‑solves the game to minimize the strategic attacker's expected payoff. We introduce \textscVCAO (Verifier‑Centered Agentic Orchestration), a six‑layer architecture comprising surface mapping, intra‑kernel attack‑graph construction, game‑theoretic file/function ranking, parallel executor agents, cascaded verification, and a safety governor. Our DOBSS‑derived MILP allocates budget optimally across heterogeneous analysis tools under resource constraints, with formal \tildeO(\sqrtT) regret bounds from online Stackelberg learning. Experiments on five Linux kernel subsystems ‑‑ replaying 847 historical CVEs and running live discovery on upstream snapshots ‑‑ show that \textscVCAO discovers 2.7× more validated vulnerabilities per unit budget than coverage‑only fuzzing, 1.9× more than static‑analysis‑only baselines, and 1.4× more than non‑game‑theoretic multi‑agent pipelines, while reducing false‑positive rates reaching human reviewers by 68%. We release our simulation framework, synthetic attack‑graph generator, and evaluation harness as open‑source artifacts.

PaperID: 4117, https://arxiv.org/pdf/2604.08224.pdf

Authors: Chenyu Zhou, Huacan Chai, Wenteng Chen, Zihan Guo, Rong Shan, Yuanyi Song, Tianyi Xu, Yingxuan Yang, Aofan Yu, Weiming Zhang, Congming Zheng, Jiachen Zhu, Zeyu Zheng, Zhuosheng Zhang, Xingyu Lou, Changwang Zhang, Zhihui Fu, Jun Wang, Weiwen Liu, Jianghao Lin, Weinan Zhang

Title: Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

Abstract:
Large language model (LLM) agents are increasingly built less by changing model weights than by reorganizing the runtime around them. Capabilities that earlier systems expected the model to recover internally are now externalized into memory stores, reusable skills, interaction protocols, and the surrounding harness that makes these modules reliable in practice. This paper reviews that shift through the lens of externalization. Drawing on the idea of cognitive artifacts, we argue that agent infrastructure matters not merely because it adds auxiliary components, but because it transforms hard cognitive burdens into forms that the model can solve more reliably. Under this view, memory externalizes state across time, skills externalize procedural expertise, protocols externalize interaction structure, and harness engineering serves as the unification layer that coordinates them into governed execution. We trace a historical progression from weights to context to harness, analyze memory, skills, and protocols as three distinct but coupled forms of externalization, and examine how they interact inside a larger agent system. We further discuss the trade‑off between parametric and externalized capability, identify emerging directions such as self‑evolving harnesses and shared agent infrastructure, and discuss open challenges in evaluation, governance, and the long‑term co‑evolution of models and external infrastructure. The result is a systems‑level framework for explaining why practical agent progress increasingly depends not only on stronger models, but on better external cognitive infrastructure.

PaperID: 4118, https://arxiv.org/pdf/2604.08206.pdf

Authors: Wenlong Shang

Title: "Theater of Mind" for LLMs: A Cognitive Architecture Based on Global Workspace Theory

Abstract:
Modern Large Language Models (LLMs) operate fundamentally as Bounded‑Input Bounded‑Output (BIBO) systems. They remain in a passive state until explicitly prompted, computing localized responses without intrinsic temporal continuity. While effective for isolated tasks, this reactive paradigm presents a critical bottleneck for engineering autonomous artificial intelligence. Current multi‑agent frameworks attempt to distribute cognitive load but frequently rely on static memory pools and passive message passing, which inevitably leads to cognitive stagnation and homogeneous deadlocks during extended execution. To address this structural limitation, we propose Global Workspace Agents (GWA), a cognitive architecture inspired by Global Workspace Theory. GWA transitions multi‑agent coordination from a passive data structure to an active, event‑driven discrete dynamical system. By coupling a central broadcast hub with a heterogeneous swarm of functionally constrained agents, the system maintains a continuous cognitive cycle. Furthermore, we introduce an entropy‑based intrinsic drive mechanism that mathematically quantifies semantic diversity, dynamically regulating generation temperature to autonomously break reasoning deadlocks. Coupled with a dual‑layer memory bifurcation strategy to ensure long‑term cognitive continuity, GWA provides a robust, reproducible engineering framework for sustained, self‑directed LLM agency.

PaperID: 4119, https://arxiv.org/pdf/2604.08174.pdf

Authors: Teng Pang, Zhiqiang Dong, Yan Zhang, Rongjian Xu, Guoqiang Wu, Yilong Yin

Title: Value-Guidance MeanFlow for Offline Multi-Agent Reinforcement Learning

Abstract:
Offline multi‑agent reinforcement learning (MARL) aims to learn the optimal joint policy from pre‑collected datasets, requiring a trade‑off between maximizing global returns and mitigating distribution shift from offline data. Recent studies use diffusion or flow generative models to capture complex joint policy behaviors among agents; however, they typically rely on multi‑step iterative sampling, thereby reducing training and inference efficiency. Although further research improves sampling efficiency through methods like distillation, it remains sensitive to the behavior regularization coefficient. To address the above‑mentioned issues, we propose Value Guidance Multi‑agent MeanFlow Policy (VGM^2P), a simple yet effective flow‑based policy learning framework that enables efficient action generation with coefficient‑insensitive conditional behavior cloning. Specifically, VGM^2P uses global advantage values to guide agent collaboration, treating optimal policy learning as conditional behavior cloning. Additionally, to improve policy expressiveness and inference efficiency in multi‑agent scenarios, it leverages classifier‑free guidance MeanFlow for both policy training and execution. Experiments on tasks with both discrete and continuous action spaces demonstrate that, even when trained solely via conditional behavior cloning, VGM^2P efficiently achieves performance comparable to state‑of‑the‑art methods.

PaperID: 4120, https://arxiv.org/pdf/2604.08103.pdf

Authors: An Li, Wenqiang Zhu, Chaoqian Wang, Longzhao Liu, Hongwei Zheng, Yishen Jiang, Xin Wang, Shaoting Tang

Title: Reinforcement learning with reputation-based adaptive exploration promotes the evolution of cooperation

Abstract:
Multi‑agent reinforcement learning serves as an effective tool for studying strategy adaptation in evolutionary games. Although prior work has integrated Q‑learning with reputation mechanisms to promote cooperation, most existing algorithms adopt fixed exploration rates and overlook the influence of social context on exploratory behavior. In practice, individuals may adjust their willingness to explore based on their reputation and perceived social standing. To address this, we propose a Q‑learning model that couples exploration rates with local reputation differences and incorporates asymmetric, state‑dependent reputation updates. Our results show that each mechanism independently promotes cooperation, and their combination yields a reinforcing effect. The joint mechanism enhances cooperation by making ``high reputation‑‑low exploration, low reputation‑‑high exploration'', while adjusting reputation updates to amplify cooperative gains at low status and defection penalties at high status. This study thus offers insights into how social evaluation can shape learning behavior in complex environments.

PaperID: 4121, https://arxiv.org/pdf/2604.08102.pdf

Authors: Jorge Melegati

Title: Test-Oriented Programming: rethinking coding for the GenAI era

Abstract:
Large language models (LLMs) have shown astonishing capability of generating software code, leading to its use to support developers in programming. Proposed tools have relied either on assistants for improved auto‑complete or multi‑agents, in which different model instances are orchestrated to perform parts of a problem to reach a complete solution. We argue that LLMs can enable a higher‑level of abstraction, a new paradigm we called Test‑Oriented Programming (TOP). Within this paradigm, developers only have to check test code generated based on natural language specifications, rather than focusing on production code, which could be delegated to the LLMs. To evaluate the feasibility of this proposal, we developed a proof‑of‑concept tool and used it to generate a small command‑line program employing two different LLMs. We obtained promising results and identified challenges for the use of this paradigm for real projects.

PaperID: 4122, https://arxiv.org/pdf/2604.08064.pdf

Authors: Chonghan Qin, Xiachong Feng, Weitao Ma, Xiaocheng Feng, Lingpeng Kong

Title: ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models

Abstract:
Existing memory benchmarks for LLM agents evaluate explicit recall of facts, yet overlook implicit memory where experience becomes automated behavior without conscious retrieval. This gap is critical: effective assistants must automatically apply learned procedures or avoid failed actions without explicit reminders. We introduce ImplicitMemBench, the first systematic benchmark evaluating implicit memory through three cognitively grounded constructs drawn from standard cognitive‑science accounts of non‑declarative memory: Procedural Memory (one‑shot skill acquisition after interference), Priming (theme‑driven bias via paired experimental/control instances), and Classical Conditioning (Conditioned Stimulus‑‑Unconditioned Stimulus (CS‑‑US) associations shaping first decisions). Our 300‑item suite employs a unified Learning/Priming‑Interfere‑Test protocol with first‑attempt scoring. Evaluation of 17 models reveals severe limitations: no model exceeds 66% overall, with top performers DeepSeek‑R1 (65.3%), Qwen3‑32B (64.1%), and GPT‑5 (63.0%) far below human baselines. Analysis uncovers dramatic asymmetries (inhibition 17.6% vs. preference 75.0%) and universal bottlenecks requiring architectural innovations beyond parameter scaling. ImplicitMemBench reframes evaluation from "what agents recall" to "what they automatically enact".

PaperID: 4123, https://arxiv.org/pdf/2604.08000.pdf

Authors: Zhifei Xie, Zongzheng Hu, Fangda Ye, Xin Zhang, Haobo Chai, Zihang Liu, Pengcheng Wu, Guibin Zhang, Yue Liao, Xiaobin Hu, Deheng Ye, Chunyan Miao, Shuicheng Yan

Title: PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory

Abstract:
Proactivity is a core expectation for AGI. Prior work remains largely confined to laboratory settings, leaving a clear gap in real‑world proactive agent: depth, complexity, ambiguity, precision and real‑time constraints. We study this setting, where useful intervention requires inferring latent needs from ongoing context and grounding actions in evolving user memory under latency and long‑horizon constraints. We first propose DD‑MM‑PAS (Demand Detection, Memory Modeling, Proactive Agent System) as a general paradigm for streaming proactive AI agent. We instantiate this paradigm in Pask, with streaming IntentFlow model for DD, a hybrid memory (workspace, user, global) for long‑term MM, PAS infra framework and introduce how these components form a closed loop. We also introduce LatentNeeds‑Bench, a real‑world benchmark built from user‑consented data and refined through thousands of rounds of human editing. Experiments show that IntentFlow matches leading Gemini3‑Flash models under latency constraints, while identifying deeper user intent.

PaperID: 4124, https://arxiv.org/pdf/2604.07989.pdf

Authors: Jing Xu, Jiarui Hu, Zhihao Shuai, Yiyun Chen, Weikai Yang

Title: Show Me the Infographic I Imagine: Intent-Aware Infographic Retrieval for Authoring Support

Abstract:
While infographics have become a powerful medium for communicating data‑driven stories, authoring them from scratch remains challenging, especially for novice users. Retrieving relevant exemplars from a large corpus can provide design inspiration and promote reuse, substantially lowering the barrier to infographic authoring. However, effective retrieval is difficult because users often express design intent in ambiguous natural language, while infographics embody rich and multi‑faceted visual designs. As a result, keyword‑based search often fails to capture design intent, and general‑purpose vision‑language retrieval models trained on natural images are ill‑suited to the text‑heavy, multi‑component nature of infographics. To address these challenges, we develop an intent‑aware infographic retrieval framework that better aligns user queries with infographic designs. We first conduct a formative study of how people describe infographics and derive an intent taxonomy spanning content and visual design facets. This taxonomy is then leveraged to enrich and refine free‑form user queries, guiding the retrieval process with intent‑specific cues. Building on the retrieved exemplars, users can adapt the designs to their own data with high‑level edit intents, supported by an interactive agent that performs low‑level adaptation. Both quantitative evaluations and user studies are conducted to demonstrate that our method improves retrieval quality over baseline methods while better supporting intent satisfaction and efficient infographic authoring.

PaperID: 4125, https://arxiv.org/pdf/2604.07960.pdf

Authors: Yifei Gong, Xing Wu, Wenda Liu, Kang Tu

Title: TOOLCAD: Exploring Tool-Using Large Language Models in Text-to-CAD Generation with Reinforcement Learning

Abstract:
Computer‑Aided Design (CAD) is an expert‑level task that relies on long‑horizon reasoning and coherent modeling actions. Large Language Models (LLMs) have shown remarkable advancements in enabling language agents to tackle real‑world tasks. Notably, there has been no investigation into how tool‑using LLMs optimally interact with CAD engines, hindering the emergence of LLM‑based agentic text‑to‑CAD modeling systems. We propose ToolCAD, a novel agentic CAD framework deploying LLMs as tool‑using agents for text‑to‑CAD generation. Furthermore, we introduce an interactive CAD modeling gym to rollout reasoning and tool‑augmented interaction trajectories with the CAD engine, incorporating hybrid feedback and human supervision. Meanwhile, an end‑to‑end post‑training strategy is presented to enable the LLM agent to elicit refined CAD Modeling Chain of Thought (CAD‑CoT) and evolve into proficient CAD tool‑using agents via online curriculum reinforcement learning. Our findings demonstrate ToolCAD fills the gap in adopting and training open‑source LLMs for CAD tool‑using agents, enabling them to perform comparably to proprietary models, paving the way for more accessible and robust autonomous text‑to‑CAD modeling systems.

PaperID: 4126, https://arxiv.org/pdf/2604.07956.pdf

Authors: Arda Yüksel, Gabriel Thiem, Susanne Walter, Patrick Felka, Gabriela Alves Werb, Ivan Habernal

Title: MONETA: Multimodal Industry Classification through Geographic Information with Multi Agent Systems

Abstract:
Industry classification schemes are integral parts of public and corporate databases as they classify businesses based on economic activity. Due to the size of the company registers, manual annotation is costly, and fine‑tuning models with every update in industry classification schemes requires significant data collection. We replicate the manual expert verification by using existing or easily retrievable multimodal resources for industry classification. We present MONETA, the first multimodal industry classification benchmark with text (Website, Wikipedia, Wikidata) and geospatial sources (OpenStreetMap and satellite imagery). Our dataset enlists 1,000 businesses in Europe with 20 economic activity labels according to EU guidelines (NACE). Our training‑free baseline reaches 62.10% and 74.10% with open and closed‑source Multimodal Large Language Models (MLLM). We observe an increase of up to 22.80% with the combination of multi‑turn design, context enrichment, and classification explanations. We will release our dataset and the enhanced guidelines.

PaperID: 4127, https://arxiv.org/pdf/2604.07927.pdf

Authors: Boer Zhang, Mingyan Wu, Dongzhuoran Zhou, Yuqicheng Zhu, Wendong Fan, Puzhen Zhang, Zifeng Ding, Guohao Li, Yuan He

Title: EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools

Abstract:
Deep research requires reasoning over web evidence to answer open‑ended questions, and it is a core capability for AI agents. Yet many deep research agents still rely on implicit, unstructured search behavior that causes redundant exploration and brittle evidence aggregation. Motivated by Anthropic's "think" tool paradigm and insights from the information‑retrieval literature, we introduce Q+, a set of query and evidence processing tools that make web search more deliberate by guiding query planning, monitoring search progress, and extracting evidence from long web snapshots. We integrate Q+ into the browser sub‑agent of Eigent, an open‑source, production‑ready multi‑agent workforce for computer use, yielding EigentSearch‑Q+. Across four benchmarks (SimpleQA‑Verified, FRAMES, WebWalkerQA, and XBench DeepSearch), Q+ improves Eigent's browser agent benchmark‑size‑weighted average accuracy by 3.0, 3.8, and 0.6 percentage points (pp) for GPT‑4.1, GPT‑5.1, and Minimax M2.5 model backends, respectively. Case studies further suggest that EigentSearch‑Q+ produces more coherent tool‑calling trajectories by making search progress and evidence handling explicit.

PaperID: 4128, https://arxiv.org/pdf/2604.07911.pdf

Authors: Nickson Patel

Title: Dynamic Attentional Context Scoping: Agent-Triggered Focus Sessions for Isolated Per-Agent Steering in Multi-Agent LLM Orchestration

Abstract:
Multi‑agent LLM orchestration systems suffer from context pollution: when N concurrent agents compete for the orchestrator's context window, each agent's task state, partial outputs, and pending questions contaminate the steering interactions of every other agent, degrading decision quality. We introduce Dynamic Attentional Context Scoping (DACS), a mechanism in which the orchestrator operates in two asymmetric modes. In Registry mode it holds only lightweight per‑agent status summaries (<=200 tokens each), remaining responsive to all agents and the user. When an agent emits a SteeringRequest, the orchestrator enters Focus(a_i) mode, injecting the full context of agent a_i while compressing all other agents to their registry entries. Context isolation is agent‑triggered, asymmetric, and deterministic: the context window contains exactly F(a_i) + R_‑i during steering, eliminating cross‑agent contamination without requiring context compression or retrieval. We evaluate DACS across four experimental phases totalling 200 trials: Phase 1 tests N in 3,5,10 (60 trials); Phase 2 tests agent heterogeneity and adversarial dependencies (60 trials); Phase 3 tests decision density up to D=15 (40 trials); Phase 4 uses autonomous LLM agents for free‑form questions (40 trials, Claude Haiku 4.5). Across all 8 synthetic scenarios, DACS achieves 90.0‑‑98.4% steering accuracy versus 21.0‑‑60.0% for a flat‑context baseline (p < 0.0001 throughout), with wrong‑agent contamination falling from 28‑‑57% to 0‑‑14% and context efficiency ratios of up to 3.53x. The accuracy advantage grows with N and D; keyword matching is validated by LLM‑as‑judge across all phases (mean kappa=0.909). DACS outperforms the flat‑context baseline by +17.2pp at N=3 (p=0.0023) and +20.4pp at N=5 (p=0.0008) in Phase 4, with the advantage growing with N confirmed by two independent judges.

PaperID: 4129, https://arxiv.org/pdf/2604.07838.pdf

Authors: Steven Luo, Saanvi Arora, Carlos Guirado

Title: We Need Strong Preconditions For Using Simulations In Policy

Abstract:
Simulations, and more recently LLM agent simulations, have been adopted as useful tools for policymakers to explore interventions, rehearse potential scenarios, and forecast outcomes. While LLM simulations have enormous potential, two critical challenges remain understudied: the dual‑use potential of accurate models of individual or population‑level human behavior and the difficulty of validating simulation outputs. In light of these limitations, we must define boundaries for both simulation developers and decision‑makers to ensure responsible development and ethical use. We propose and discuss three preconditions for societal‑scale LLM agent simulations: 1) do not treat simulations of marginalized populations as neutral technical outputs, 2) do not simulate populations without their participation, and 3) do not simulate without accountability. We believe that these guardrails, combined with our call for simulation development and deployment reports, will help build trust among policymakers while promoting responsible development and use of societal‑scale LLM agent simulations for the public benefit.

PaperID: 4130, https://arxiv.org/pdf/2604.07833.pdf

Authors: Xue Qin, Simin Luan, John See, Cong Yang, Zhijun Li

Title: Harnessing Embodied Agents: Runtime Governance for Policy-Constrained Execution

Abstract:
Embodied agents are evolving from passive reasoning systems into active executors that interact with tools, robots, and physical environments. Once granted execution authority, the central challenge becomes how to keep actions governable at runtime. Existing approaches embed safety and recovery logic inside the agent loop, making execution control difficult to standardize, audit, and adapt. This paper argues that embodied intelligence requires not only stronger agents, but stronger runtime governance. We propose a framework for policy‑constrained execution that separates agent cognition from execution oversight. Governance is externalized into a dedicated runtime layer performing policy checking, capability admission, execution monitoring, rollback handling, and human override. We formalize the control boundary among the embodied agent, Embodied Capability Modules (ECMs), and runtime governance layer, and validate through 1000 randomized simulation trials across three governance dimensions. Results show 96.2% interception of unauthorized actions, reduction of unsafe continuation from 100% to 22.2% under runtime drift, and 91.4% recovery success with full policy compliance, substantially outperforming all baselines (p<0.001). By reframing runtime governance as a first‑class systems problem, this paper positions policy‑constrained execution as a key design principle for embodied agent systems.

PaperID: 4131, https://arxiv.org/pdf/2604.07821.pdf

Authors: Advait Yadav, Sid Black, Oliver Sourbut

Title: More Capable, Less Cooperative? When LLMs Fail At Zero-Cost Collaboration

Abstract:
Large language model (LLM) agents increasingly coordinate in multi‑agent systems, yet we lack an understanding of where and why cooperation failures may arise. In many real‑world coordination problems, from knowledge sharing in organizations to code documentation, helping others carries negligible personal cost while generating substantial collective benefits. However, whether LLM agents cooperate when helping neither benefits nor harms the helper, while being given explicit instructions to do so, remains unknown. We build a multi‑agent setup designed to study cooperative behavior in a frictionless environment, removing all strategic complexity from cooperation. We find that capability does not predict cooperation: OpenAI o3 achieves only 17% of optimal collective performance while OpenAI o3‑mini reaches 50%, despite identical instructions to maximize group revenue. Through a causal decomposition that automates one side of agent communication, we separate cooperation failures from competence failures, tracing their origins through agent reasoning analysis. Testing targeted interventions, we find that explicit protocols double performance for low‑competence models, and tiny sharing incentives improve models with weak cooperation. Our findings suggest that scaling intelligence alone will not solve coordination problems in multi‑agent systems and will require deliberate cooperative design, even when helping others costs nothing.

PaperID: 4132, https://arxiv.org/pdf/2604.07799.pdf

Authors: Xue Qin, Simin Luan, John See, Cong Yang, Zhijun Li

Title: Learning Without Losing Identity: Capability Evolution for Embodied Agents

Abstract:
Embodied agents are expected to operate persistently in dynamic physical environments, continuously acquiring new capabilities over time. Existing approaches to improving agent performance often rely on modifying the agent itself ‑‑ through prompt engineering, policy updates, or structural redesign ‑‑ leading to instability and loss of identity in long‑lived systems. In this work, we propose a capability‑centric evolution paradigm for embodied agents. We argue that a robot should maintain a persistent agent as its cognitive identity, while enabling continuous improvement through the evolution of its capabilities. Specifically, we introduce the concept of Embodied Capability Modules (ECMs), which represent modular, versioned units of embodied functionality that can be learned, refined, and composed over time. We present a unified framework in which capability evolution is decoupled from agent identity. Capabilities evolve through a closed‑loop process involving task execution, experience collection, model refinement, and module updating, while all executions are governed by a runtime layer that enforces safety and policy constraints. We demonstrate through simulated embodied tasks that capability evolution improves task success rates from 32.4% to 91.3% over 20 iterations, outperforming both agent‑modification baselines and established skill‑learning methods (SPiRL, SkiMo), while preserving zero policy drift and zero safety violations. Our results suggest that separating agent identity from capability evolution provides a scalable and safe foundation for long‑term embodied intelligence.

PaperID: 4133, https://arxiv.org/pdf/2604.07798.pdf

Authors: Jiaquan Zhang, Chaoning Zhang, Shuxu Chen, Zhenzhen Huang, Pengcheng Zheng, Zhicheng Wang, Ping Guo, Fan Mo, Sung-Ho Bae, Jie Zou, Jiwei Wei, Yang Yang

Title: Lightweight LLM Agent Memory with Small Language Models

Abstract:
Although LLM agents can leverage tools for complex tasks, they still need memory to maintain cross‑turn consistency and accumulate reusable information in long‑horizon interactions. However, retrieval‑based external memory systems incur low online overhead but suffer from unstable accuracy due to limited query construction and candidate filtering. In contrast, many systems use repeated large‑model calls for online memory operations, improving accuracy but accumulating latency over long interactions. We propose LightMem, a lightweight memory system for better agent memory driven by Small Language Models (SLMs). LightMem modularizes memory retrieval, writing, and long‑term consolidation, and separates online processing from offline consolidation to enable efficient memory invocation under bounded compute. We organize memory into short‑term memory (STM) for immediate conversational context, mid‑term memory (MTM) for reusable interaction summaries, and long‑term memory (LTM) for consolidated knowledge, and uses user identifiers to support independent retrieval and incremental maintenance in multi‑user settings. Online, LightMem operates under a fixed retrieval budget and selects memories via a two‑stage procedure: vector‑based coarse retrieval followed by semantic consistency re‑ranking. Offline, it abstracts reusable interaction evidence and incrementally integrates it into LTM. Experiments show consistent gains across model scales, with an average F1 improvement of about 2.5 over A‑MEM on LoCoMo, while achieving higher efficiency and low median latency (83 ms for retrieval and 581 ms end‑to‑end).

PaperID: 4134, https://arxiv.org/pdf/2604.07791.pdf

Authors: Xinshun Feng, Xinhao Song, Lijun Li, Gongshen Liu, Jing Shao

Title: SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents

Abstract:
Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have demonstrated significant potential in single‑turn reasoning tasks. With the paradigm shift toward self‑evolving agentic learning, models are increasingly expected to learn from trajectories by synthesizing tools or accumulating explicit experiences. However, prevailing methods typically rely on large‑scale LLMs or multi‑agent frameworks, which hinder their deployment in resource‑constrained environments. The inherent sparsity of outcome‑based rewards also poses a substantial challenge, as agents typically receive feedback only upon completion of tasks. To address these limitations, we introduce a Tool‑Memory based self‑evolving agentic framework SEARL. Unlike approaches that directly utilize interaction experiences, our method constructs a structured experience memory that integrates planning with execution. This provides a novel state abstraction that facilitates generalization across analogous contexts, such as tool reuse. Consequently, agents extract explicit knowledge from historical data while leveraging inter‑trajectory correlations to densify reward signals. We evaluate our framework on knowledge reasoning and mathematics tasks, demonstrating its effectiveness in achieving more practical and efficient learning.

PaperID: 4135, https://arxiv.org/pdf/2604.07775.pdf

Authors: Hengyu An, Minxi Li, Jinghuai Zhang, Naen Xu, Chunyi Zhou, Changjiang Li, Xiaogang Xu, Tianyu Du, Shouling Ji

Title: ACIArena: Toward Unified Evaluation for Agent Cascading Injection

Abstract:
Collaboration and information sharing empower Multi‑Agent Systems (MAS) but also introduce a critical security risk known as Agent Cascading Injection (ACI). In such attacks, a compromised agent exploits inter‑agent trust to propagate malicious instructions, causing cascading failures across the system. However, existing studies consider only limited attack strategies and simplified MAS settings, limiting their generalizability and comprehensive evaluation. To bridge this gap, we introduce ACIArena, a unified framework for evaluating the robustness of MAS. ACIArena offers systematic evaluation suites spanning multiple attack surfaces (i.e., external inputs, agent profiles, inter‑agent messages) and attack objectives (i.e., instruction hijacking, task disruption, information exfiltration). Specifically, ACIArena establishes a unified specification that jointly supports MAS construction and attack‑defense modules. It covers six widely used MAS implementations and provides a benchmark of 1,356 test cases for systematically evaluating MAS robustness. Our benchmarking results show that evaluating MAS robustness solely through topology is insufficient; robust MAS require deliberate role design and controlled interaction patterns. Moreover, defenses developed in simplified environments often fail to transfer to real‑world settings; narrowly scoped defenses may even introduce new vulnerabilities. ACIArena aims to provide a solid foundation for advancing deeper exploration of MAS design principles.

PaperID: 4136, https://arxiv.org/pdf/2604.07767.pdf

Authors: Senyao Li, Zhigang Zuo, Haozhao Wang, Junyu Chen, Zhanbo Jin, Ruixuan LI

Title: Administrative Decentralization in Edge-Cloud Multi-Agent for Mobile Automation

Abstract:
Collaborative edge‑cloud frameworks have emerged as the main‑ stream paradigm for mobile automation, mitigating the latency and privacy risks inherent to monolithic cloud agents. However, existing approaches centralize administration in the cloud while relegating the device to passive execution, inducing a cognitive lag regard‑ ing real‑time UI dynamics. To tackle this, we introduce AdecPilot by applying the principle of administrative decentralization to the edge‑cloud multi‑agent framework, which redefines edge agency by decoupling high‑level strategic designing from tactical grounding. AdecPilot integrates a UI‑agnostic cloud designer generating ab‑ stract milestones with a bimodal edge team capable of autonomous tactical planning and self‑correction without cloud intervention. Furthermore, AdecPilot employs a Hierarchical Implicit Termi‑ nation protocol to enforce deterministic stops and prevent post‑ completion hallucinations. Extensive experiments demonstrate pro‑ posed approach improves task success rate by 21.7% while reducing cloud token consumption by 37.5% against EcoAgent and decreas‑ ing end to end latency by 88.9% against CORE. The source code is available at https://anonymous.4open.science/r/Anonymous_code‑ B8AB.

PaperID: 4137, https://arxiv.org/pdf/2604.07745.pdf

Authors: Tim Sainburg, Caleb Weinreb

Title: The Cartesian Cut in Agentic AI

Abstract:
LLMs gain competence by predicting words in human text, which often reflects how people perform tasks. Consequently, coupling an LLM to an engineered runtime turns prediction into control: outputs trigger interventions that enact goal‑oriented behavior. We argue that a central design lever is where control resides in these systems. Brains embed prediction within layered feedback controllers calibrated by the consequences of action. By contrast, LLM agents implement Cartesian agency: a learned core coupled to an engineered runtime via a symbolic interface that externalizes control state and policies. The split enables bootstrapping, modularity, and governance, but can induce sensitivity and bottlenecks. We outline bounded services, Cartesian agents, and integrated agents as contrasting approaches to control that trade off autonomy, robustness, and oversight.

PaperID: 4138, https://arxiv.org/pdf/2604.07733.pdf

Authors: John Chen, Sihan Cheng, Can Gurkan, Mingyi Lin

Title: CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V

Abstract:
Evaluating strategic decision‑making in LLM‑based agents requires generative, competitive, and longitudinal environments, yet few benchmarks provide all three, and fewer still offer evaluation signals rich enough for long‑horizon, multi‑agent play. We introduce CivBench, a benchmark for LLM strategists (i.e., agentic setups) in multiplayer Civilization V. Because terminal win/loss is too sparse a signal in games spanning hundreds of turns and multiple opponents, CivBench trains models on turn‑level game state to estimate victory probabilities throughout play, validated through predictive, construct, and convergent validity. Across 307 games with 7 LLMs and multiple CivBench agent conditions, we demonstrate CivBench's potential to estimate strategic capabilities as an unsaturated benchmark, reveal model‑specific effects of agentic setup, and outline distinct strategic profiles not visible through outcome‑only evaluation.

PaperID: 4139, https://arxiv.org/pdf/2604.07721.pdf

Authors: Zhao Song

Title: Sima 1.0: A Collaborative Multi-Agent Framework for Documentary Video Production

Abstract:
Content creation for major video‑sharing platforms demands significant manual labor, particularly for long‑form documentary videos spanning one to two hours. In this work, we introduce Sima 1.0, a multi‑agent system designed to optimize the weekly production pipeline for high‑quality video generation. The framework partitions the production process into an 11‑step pipeline distributed across a hybrid workforce. While foundational creative tasks and physical recording are executed by a human operator, time‑intensive editing, caption refinement, and supplementary asset integration are delegated to specialized junior and senior‑level AI agents. By systematizing tasks from script annotation to final asset exportation, Sima 1.0 significantly reduces the production workload, empowering a single creator to efficiently sustain a rigorous weekly publishing schedule.

PaperID: 4140, https://arxiv.org/pdf/2604.07720.pdf

Authors: Wenxuan Liu, Zixuan Li, Long Bai, Chunmao Zhang, Fenghui Zhang, Zhuo Chen, Wei Li, Yuxin Zuo, Fei Wang, Bingbing Xu, Xuhui Jiang, Jin Zhang, Xiaolong Jin, Jiafeng Guo, Tat-Seng Chua, Xueqi Cheng

Title: Towards Knowledgeable Deep Research: Framework and Benchmark

Abstract:
Deep Research (DR) requires LLM agents to autonomously perform multi‑step information seeking, processing, and reasoning to generate comprehensive reports. In contrast to existing studies that mainly focus on unstructured web content, a more challenging DR task should additionally utilize structured knowledge to provide a solid data foundation, facilitate quantitative computation, and lead to in‑depth analyses. In this paper, we refer to this novel task as Knowledgeable Deep Research (KDR), which requires DR agents to generate reports with both structured and unstructured knowledge. Furthermore, we propose the Hybrid Knowledge Analysis framework (HKA), a multi‑agent architecture that reasons over both kinds of knowledge and integrates the texts, figures, and tables into coherent multimodal reports. The key design is the Structured Knowledge Analyzer, which utilizes both coding and vision‑language models to produce figures, tables, and corresponding insights. To support systematic evaluation, we construct KDR‑Bench, which covers 9 domains, includes 41 expert‑level questions, and incorporates a large number of structured knowledge resources (e.g., 1,252 tables). We further annotate the main conclusions and key points for each question and propose three categories of evaluation metrics including general‑purpose, knowledge‑centric, and vision‑enhanced ones. Experimental results demonstrate that HKA consistently outperforms most existing DR agents on general‑purpose and knowledge‑centric metrics, and even surpasses the Gemini DR agent on vision‑enhanced metrics, highlighting its effectiveness in deep, structure‑aware knowledge analysis. Finally, we hope this work can serve as a new foundation for structured knowledge analysis in DR agents and facilitate future multimodal DR studies.

PaperID: 4141, https://arxiv.org/pdf/2604.07705.pdf

Authors: Xingyu Xia, Lekai Zhou, Yujie Tang, Xiaozhou Zhu, Hai Zhu, Wen Yao

Title: Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models

Abstract:
Aerial vision‑and‑language navigation (Aerial VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and autonomously navigate complex three‑dimensional environments by grounding language in visual perception. This survey provides a critical and analytical review of the Aerial VLN field, with particular attention to the recent integration of large language models (LLMs) and vision‑language models (VLMs). We first formally introduce the Aerial VLN problem and define two interaction paradigms: single‑instruction and dialog‑based, as foundational axes. We then organize the body of Aerial VLN methods into a taxonomy of five architectural categories: sequence‑to‑sequence and attention‑based methods, end‑to‑end LLM/VLM methods, hierarchical methods, multi‑agent methods, and dialog‑based navigation methods. For each category, we systematically analyze design rationales, technical trade‑offs, and reported performance. We critically assess the evaluation infrastructure for Aerial VLN, including datasets, simulation platforms, and metrics, and identify their gaps in scale, environmental diversity, real‑world grounding, and metric coverage. We consolidate cross‑method comparisons on shared benchmarks and analyze key architectural trade‑offs, including discrete versus continuous actions, end‑to‑end versus hierarchical designs, and the simulation‑to‑reality gap. Finally, we synthesize seven concrete open problems: long‑horizon instruction grounding, viewpoint robustness, scalable spatial representation, continuous 6‑DoF action execution, onboard deployment, benchmark standardization, and multi‑UAV swarm navigation, with specific research directions grounded in the evidence presented throughout the survey.

PaperID: 4142, https://arxiv.org/pdf/2604.07681.pdf

Authors: Thang Duc Pham, Harikrishna Tummalapalli, Fakhrul Hasan Bhuiyan, Álvaro Vázquez Mayagoitia, Christine Simpson, Riccardo Balin, Venkatram Vishwanath, Murat Keçeli

Title: Multi-Agent Orchestration for High-Throughput Materials Screening on a Leadership-Class System

Abstract:
The integration of Artificial Intelligence (AI) with High‑Performance Computing (HPC) is transforming scientific workflows from human‑directed pipelines into adaptive systems capable of autonomous decision‑making. Large language models (LLMs) play a critical role in autonomous workflows; however, deploying LLM‑based agents at scale remains a significant challenge. Single‑agent architectures and sequential tool calls often become serialization bottlenecks when executing large‑scale simulation campaigns, failing to utilize the massive parallelism of exascale resources. To address this, we present a scalable, hierarchical multi‑agent framework for orchestrating high‑throughput screening campaigns. Our planner‑executor architecture employs a central planning agent to dynamically partition workloads and assign subtasks to a swarm of parallel executor agents. All executor agents interface with a shared Model Context Protocol (MCP) server that orchestrates tasks via the Parsl workflow engine. To demonstrate this framework, we employed the open‑weight gpt‑oss‑120b model to orchestrate a high‑throughput screening of the Computation‑Ready Experimental (CoRE) Metal‑Organic Framework (MOF) database for atmospheric water harvesting. The results demonstrate that the proposed agentic framework enables efficient and scalable execution on the Aurora supercomputer, with low orchestration overhead and high task completion rates. This work establishes a flexible paradigm for LLM‑driven scientific automation on HPC systems, with broad applicability to materials discovery and beyond.

PaperID: 4143, https://arxiv.org/pdf/2604.07669.pdf

Authors: Tao Li, Kaiyuan Hou, Tuan Vinh, Monika Raj, Zhichun Guo, Carl Yang

Title: Reinforcement Learning with LLM-Guided Action Spaces for Synthesizable Lead Optimization

Abstract:
Lead optimization in drug discovery requires improving therapeutic properties while ensuring that molecular modifications correspond to feasible synthetic routes. Existing approaches either prioritize property scores without enforcing synthesizability, or rely on expensive enumeration over large reaction networks, while direct application of Large Language Models (LLMs) to molecular generation frequently produces chemically invalid structures. We introduce MolReAct, a framework that formulates lead optimization as a Markov Decision Process over a synthesis‑constrained action space defined by validated reaction templates. A tool‑augmented LLM agent serves as a dynamic reaction environment, invoking specialized chemical analysis tools to identify reactive sites and functional groups and proposing a compact set of chemically grounded transformations from matched templates. A dedicated policy model trained via Group Relative Policy Optimization (GRPO) selects among these constrained actions to maximize long‑term oracle reward across multi‑step trajectories, with a SMILES‑based caching mechanism reducing end‑to‑end optimization time by approximately 43%. Across 13 property optimization tasks from the Therapeutic Data Commons and one structure‑based docking task, MolReAct achieves an average Top‑10 score of 0.571, the highest among all baselines, ranking first or second on 13 of 14 tasks and attaining the best sample efficiency on 9 of 14 tasks. By grounding every optimization step in validated reaction templates, MolReAct produces molecules that are not only property‑improved but each accompanied by an explicit template‑grounded synthetic pathway.

PaperID: 4144, https://arxiv.org/pdf/2604.07667.pdf

Authors: Mengdie Flora Wang, Haochen Xie, Guanghui Wang, Aijing Gao, Guang Yang, Ziyuan Li, Qucy Wei Qiu, Fangwei Han, Hengzhi Qiu, Yajing Huang, Bing Zhu, Jae Oh Woo

Title: From Debate to Decision: Conformal Social Choice for Safe Multi-Agent Deliberation

Abstract:
Multi‑agent debate improves LLM reasoning, yet agreement among agents is not evidence of correctness. When agents converge on a wrong answer through social reinforcement, consensus‑based stopping commits that error to an automated action with no recourse. We introduce Conformal Social Choice, a post‑hoc decision layer that converts debate outputs into calibrated act‑versus‑escalate decisions. Verbalized probability distributions from heterogeneous agents are aggregated via a linear opinion pool and calibrated with split conformal prediction, yielding prediction sets with a marginal coverage guarantee: the correct answer is included with probability \geq\,1‑α, without assumptions on individual model calibration. A hierarchical action policy maps singleton sets to autonomous action and larger sets to human escalation. On eight MMLU‑Pro domains with three agents (Claude Haiku, DeepSeek‑R1, Qwen‑3 32B), coverage stays within 1‑‑2 points of the target. The key finding is not that debate becomes more accurate, but that the conformal layer makes its failures actionable: 81.9% of wrong‑consensus cases are intercepted at α=0.05. Because the layer refuses to act on cases where debate is confidently wrong, the remaining conformal singletons reach 90.0‑‑96.8% accuracy (up to 22.1pp above consensus stopping) ‑‑ a selection effect, not a reasoning improvement. This safety comes at the cost of automation, but the operating point is user‑adjustable via α.

PaperID: 4145, https://arxiv.org/pdf/2604.07629.pdf

Authors: Dora Zhao, Michelle S. Lam, Diyi Yang, Michael S. Bernstein

Title: Behavior Latticing: Inferring User Motivations from Unstructured Interactions

Abstract:
A long‑standing vision of computing is the personal AI system: one that understands us well enough to address our underlying needs. Today's AI focuses on what users do, ignoring why they might be doing such things in the first place. As a result, AI systems default to optimizing or repeating existing behaviors (e.g., user has ChatGPT complete their homework) even when they run counter to users' needs (e.g., gaining subject expertise). Instead we require systems that can make connections across observations, synthesizing them into insights about the motivations underlying these behaviors (e.g., user's ongoing commitments make it difficult to prioritize learning despite expressed desire to do so). We introduce an architecture for building user understanding through behavior latticing, connecting seemingly disparate behaviors, synthesizing them into insights, and repeating this process over long spans of interaction data. Doing so affords new capabilities, including being able to infer users' needs rather than just their tasks and connecting subtle patterns to produce conclusions that users themselves may not have previously realized. In an evaluation, we validate that behavior latticing produces accurate insights about the user with significantly greater interpretive depth compared to state‑of‑the‑art approaches. To demonstrate the new interactive capabilities that behavior lattices afford, we instantiate a personal AI agent steered by user insights, finding that our agent is significantly better at addressing users' needs while still providing immediate utility.

PaperID: 4146, https://arxiv.org/pdf/2604.07624.pdf

Authors: Achintya Desai, Md Shafiuzzaman, Wenbo Guo, Tevfik Bultan

Title: Program Analysis Guided LLM Agent for Proof-of-Concept Generation

Abstract:
Software developers frequently receive vulnerability reports that require them to reproduce the vulnerability in a reliable manner by generating a proof‑of‑concept (PoC) input that triggers it. Given the source code for a software project and a specific code location for a potential vulnerability, automatically generating a PoC for the given vulnerability has been a challenging research problem. Symbolic execution and fuzzing techniques require expert guidance and manual steps and face scalability challenges for PoC generation. Although recent advances in LLMs have increased the level of automation and scalability, the success rate of PoC generation with LLMs remains quite low. In this paper, we present a novel approach called Program Analysis Guided proof of concept generation agENT (PAGENT) that is scalable and significantly improves the success rate of automated PoC generation compared to prior results. PAGENT integrates lightweight and rule‑based static analysis phases for providing static analysis guidance and sanitizer‑based profiling and coverage information for providing dynamic analysis guidance with a PoC generation agent. Our experiments demonstrate that the resulting hybrid approach significantly outperforms the prior top‑performing agentic approach by 132% for the PoC generation task.

PaperID: 4147, https://arxiv.org/pdf/2604.07575.pdf

Authors: Mohamed Abdelnaby, Samuel Honor, Kevin Leahy

Title: Robust Multi-Agent Target Tracking in Intermittent Communication Environments via Analytical Belief Merging

Abstract:
Autonomous multi‑agent target tracking in GPS‑denied and communication‑restricted environments (e.g., underwater exploration, subterranean search and rescue, and adversarial domains) forces agents to operate independently and only exchange information during brief reconnection windows. Because transmitting complete observation and trajectory histories is bandwidth‑exhaustive, exchanging probabilistic belief maps serves as a highly efficient proxy that preserves the topology of agent knowledge. While minimizing divergence metrics to merge these decentralized beliefs is conceptually sound, traditional approaches often rely on numerical solvers that introduce critical quantization errors and artificial noise floors. In this paper, we formulate the decentralized belief merging problem as Forward and Reverse Kullback‑Leibler (KL) divergence optimizations and derive their exact closed‑form analytical solutions. By deploying these derivations, we mathematically eliminate optimization artifacts, achieving perfect mathematical fidelity while reducing the computational complexity of the belief merge to \mathcalO(N|S|) scalar operations. Furthermore, we propose a novel spatially‑aware visit‑weighted KL merging strategy that dynamically weighs agent beliefs based on their physical visitation history. Validated across tens of thousands of distributed simulations, extensive sensitivity analysis demonstrates that our proposed method significantly suppresses sensor noise and outperforms standard analytical means in environments characterized by highly degraded sensors and prolonged communication intervals.

PaperID: 4148, https://arxiv.org/pdf/2604.07546.pdf

Authors: Paulius Jurcys, Mark Fenwick

Title: Agentic Copyright, Data Scraping & AI Governance: Toward a Coasean Bargain in the Era of Artificial Intelligence

Abstract:
This paper examines how the rapid deployment of multi‑agentic AI systems is reshaping the foundations of copyright law and creative markets. It argues that existing copyright frameworks are ill‑equipped to govern AI agent‑mediated interactions that occur at scale, speed, and with limited human oversight. The paper introduces the concept of agentic copyright, a model in which AI agents act on behalf of creators and users to negotiate access, attribution, and compensation for copyrighted works. While multi‑agent ecosystems promise efficiency gains and reduced transaction costs, they also generate novel market failures, including miscoordination, conflict, and collusion among autonomous agents. To address these market failures, the paper develops a supervised multi‑agent governance framework that integrates legal rules and principles, technical protocols, and institutional oversight. This framework emphasizes ex ante and ex post coordination mechanisms capable of correcting agentic market failures before they crystallize into systemic harm. By embedding normative constraints and monitoring functions into multi‑agent architectures, supervised governance aims to align agent behavior with the underlying values of copyright law. The paper concludes that AI should be understood not only as a source of disruption, but also as a governance tool capable of restoring market‑based ordering in creative industries. Properly designed, agentic copyright offers a path toward scalable, fair, and legally meaningful copyright markets in the age of AI.

PaperID: 4149, https://arxiv.org/pdf/2604.07535.pdf

Authors: Baran Shajari, Xiaoran Liu, Kyanna Dagenais, Istvan David

Title: Trust the AI, Doubt Yourself: The Effect of Urgency on Self-Confidence in Human-AI Interaction

Abstract:
Studies show that interactions with an AI system fosters trust in human users towards AI. An often overlooked element of such interaction dynamics is the (sense of) urgency when the human user is prompted by an AI agent, e.g., for advice or guidance. In this paper, we show that although the presence of urgency in human‑AI interactions does not affect the trust in AI, it may be detrimental to the human user's self‑confidence and self‑efficacy. In the long run, the loss of confidence may lead to performance loss, suboptimal decisions, human errors, and ultimately, unsustainable AI systems. Our evidence comes from an experiment with 30 human participants. Our results indicate that users may feel more confident in their work when they are eased into the human‑AI setup rather than exposed to it without preparation. We elaborate on the implications of this finding for software engineers and decision‑makers.

PaperID: 4150, https://arxiv.org/pdf/2604.07518.pdf

Authors: Mengdan Zhu, Senhao Cheng, Liang Zhao

Title: Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs

Abstract:
Vision‑Language Models often struggle with complex visual reasoning due to the visual information loss in textual CoT. Existing methods either add the cost of tool calls or rely on localized patch‑based embeddings that are insufficient to extract semantics in multi‑step reasoning. We propose \emph"Decompose, Look, and Reason" (DLR), a reinforced latent reasoning framework that dynamically decomposes queries into textual premises, extracts premise‑conditioned continuous visual latents, and deduces answers through grounded rationales. We introduce a three‑stage training pipeline and propose a novel Spherical Gaussian Latent Policy to enable effective exploration in the latent space. Extensive experiments on vision‑centric benchmarks show that DLR consistently outperforms strong baselines, including text‑only, interleaved multimodal CoT, and latent reasoning methods, while providing superior stepwise interpretability.

PaperID: 4151, https://arxiv.org/pdf/2604.07512.pdf

Authors: Yiwen Wang, Gregory Sinenka, Xhuliano Brace

Title: Rhizome OS-1: Rhizome's Semi-Autonomous Operating System for Small Molecule Drug Discovery

Abstract:
We present Rhizome OS‑1, a semi‑autonomous operating system for small molecule drug discovery in which multi‑modal AI agents operate as a full multidisciplinary discovery team. These agents function as computational chemists, medicinal chemists, and patent agents: they write and execute analysis code (fingerprint clustering, R‑group decomposition, substructure search), visually triage molecular grids using vision capabilities, formulate explicit medicinal chemistry hypotheses across three strategy tiers, assess patent freedom‑to‑operate, and dynamically adapt generation strategies based on empirical screening feedback. Powered by r1 ‑ a 246M‑parameter graph diffusion model trained on 800 million molecular graphs ‑ the system generates novel chemical matter directly on molecular graphs using fragment masking, scaffold decoration, linker design, and graph editing primitives. In two oncology campaigns (BCL6 BTB domain and EZH2 SET domain), the agent team executed 26 seeds and produced 5,231 novel molecules. Across both targets, 91.9% of generated Murcko scaffolds are absent from ChEMBL, with median Tanimoto similarity of 0.56‑0.69 to the nearest known active. Boltz‑2 binding affinity predictions, calibrated against ChEMBL data, achieved Spearman correlations of ‑0.53 to ‑0.64 and ROC AUC values of 0.88‑0.93. These results demonstrate that semi‑autonomous agent systems, equipped with graph‑native generative tools and physics‑informed scoring, enable a new paradigm for early‑stage drug discovery: scaled, rapid, and adaptive inverse design with embedded medicinal chemistry reasoning.

PaperID: 4152, https://arxiv.org/pdf/2604.07479.pdf

Authors: Monika Tomar, Takashi Tanaka

Title: Linearly Solvable Continuous-Time General-Sum Stochastic Differential Games

Abstract:
This paper introduces a class of continuous‑time, finite‑player stochastic general‑sum differential games that admit solutions through an exact linear PDE system. We formulate a distribution planning game utilizing the cross‑log‑likelihood ratio to naturally model multi‑agent spatial conflicts, such as congestion avoidance. By applying a generalized multivariate Cole‑Hopf transformation, we decouple the associated non‑linear Hamilton‑Jacobi‑Bellman (HJB) equations into a system of linear partial differential equations. This reduction enables the efficient, grid‑free computation of feedback Nash equilibrium strategies via the Feynman‑Kac path integral method, effectively overcoming the curse of dimensionality.

PaperID: 4153, https://arxiv.org/pdf/2604.07415.pdf

Authors: Roxana Petcu, Evangelos Kanoulas, Maarten de Rijke

Title: SubSearch: Intermediate Rewards for Unsupervised Guided Reasoning in Complex Retrieval

Abstract:
Large language models (LLMs) are probabilistic in nature and perform more reliably when augmented with external information. As complex queries often require multi‑step reasoning over the retrieved information, with no clear or predetermined reasoning path, they remain challenging. Recent approaches train models using reinforcement learning on the model's outcome, showing promise in improving how models handle complex information. We introduce SubSearch, a specialized framework that shifts from outcome‑only supervision to intermediate reward signals that incentivize planning high‑quality reasoning. Unlike previous work on process reward modeling, which focuses on training a separate reward model with annotated trajectories by either human annotators or large LLM judges, SubSearch directly optimizes the generator using intrinsic process rewards, which we define as internally‑derived rewards, eliminating the need for external supervision, and moving towards autonomous information‑intensive reasoning. Experiments on seven benchmarks show that rewarding intermediate reasoning steps with intrinsic rewards leads to more robust reasoning traces in both QA and multi‑hop QA datasets over using only outcome rewards. SubSearch can help in building reasoning traces that allow agents to better integrate search engines for complex query answering, while offering a data‑efficient alternative to supervised process modeling.

PaperID: 4154, https://arxiv.org/pdf/2604.07341.pdf

Authors: Ali Reza Ibrahimzada, Brandon Paulsen, Daniel Kroening, Reyhaneh Jabbarvand

Title: ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories

Abstract:
Most repository‑level code translation and validation techniques have been evaluated on a single source‑target programming language (PL) pair, owing to the complex engineering effort required to adapt new PL pairs. Programming agents can enable PL‑agnosticism in repository‑level code translation and validation: they can synthesize code across many PLs and autonomously use existing tools specific to each PL's analysis. However, state‑of‑the‑art has yet to offer a fully autonomous agentic approach for repository‑level code translation and validation of large‑scale programs. This paper proposes ReCodeAgent, an autonomous multi‑agent approach for language‑agnostic repository‑level code translation and validation. Users only need to provide the project in the source PL and specify the target PL for ReCodeAgent to automatically translate and validate the entire repository. ReCodeAgent is the first technique to achieve high translation success rates across many PLs. We compare the effectiveness of ReCodeAgent with four alternative neuro‑symbolic and agentic approaches to translate 118 real‑world projects, with 1,975 LoC and 43 translation units for each project, on average. The projects cover 6 PLs (C, Go, Java, JavaScript, Python, and Rust) and 4 PL pairs (C‑Rust, Go‑Rust, Java‑Python, Python‑JavaScript). Our results demonstrate that ReCodeAgent consistently outperforms prior techniques on translation correctness, improving test pass rate by 60.8% on ground‑truth tests, with an average cost of 15.3. We also perform process‑centric analysis of ReCodeAgent trajectories to confirm its procedural efficiency. Finally, we investigate how the design choices (a multi‑agent vs. single‑agent architecture) influence ReCodeAgent performance: on average, the test pass rate drops by 40.4%, and trajectories become 28% longer and persistently inefficient.

PaperID: 4155, https://arxiv.org/pdf/2604.07204.pdf

Authors: Stephen Cranefield, Nir Oren

Title: Designing for Accountable Agents: a Viewpoint

Abstract:
AI systems are becoming increasingly complex, ubiquitous and autonomous, leading to increasing concerns about their impacts on individuals and society. In response, researchers have begun investigating how to ensure that the methods underlying AI decision‑making are transparent and their decisions are explainable to people and conformant to human values and ethical principles. As part of this research thrust, the need for accountability within AI systems has been noted, but this notion has proven elusive to define; we aim to address this issue in the current paper. Unlike much recent work, we do not address accountability within the human organisational processes of developing and deploying AI; rather we consider what it would it mean for the agents within a multi‑agent system (MAS), potentially including human agents, to be accountable to other agents or to have others accountable to them. In this work, we make the following contributions: we provide an in‑depth survey of existing work on accountability in multiple disciplines, seeking to identify a coherent definition of the concept; we give a realistic example of a multi‑agent system application domain that illustrates the benefits of enabling agents to follow accountability processes, and we identify a set of research challenges for the MAS community in building accountable agents, sketching out some initial solutions to these, thereby laying out a road‑map for future research. Our focus is on laying the groundwork to enable autonomous elements within open socio‑technical systems to take part in accountability processes.

PaperID: 4156, https://arxiv.org/pdf/2604.07189.pdf

Authors: Jia Yu, Weiwei Yu, Pengfei Xiao, Fukun Xing

Title: Agent-Driven Corpus Linguistics: A Framework for Autonomous Linguistic Discovery

Abstract:
Corpus linguistics has traditionally relied on human researchers to formulate hypotheses, construct queries, and interpret results ‑ a process demanding specialized technical skills and considerable time. We propose Agent‑Driven Corpus Linguistics, an approach in which a large language model (LLM), connected to a corpus query engine via a structured tool‑use interface, takes over the investigative cycle: generating hypotheses, querying the corpus, interpreting results, and refining analysis across multiple rounds. The human researcher sets direction and evaluates final output. Unlike unconstrained LLM generation, every finding is anchored in verifiable corpus evidence. We treat this not as a replacement for the corpus‑based/corpus‑driven distinction but as a complementary dimension: it concerns who conducts the inquiry, not the epistemological relationship between theory and data. We demonstrate the framework by linking an LLM agent to a CQP‑indexed Gutenberg corpus (5 million tokens) via the Model Context Protocol (MCP). Given only "investigate English intensifiers," the agent identified a diachronic relay chain (so+ADJ > very > really), three pathways of semantic change (delexicalization, polarity fixation, metaphorical constraint), and register‑sensitive distributions. A controlled baseline experiment shows that corpus grounding contributes quantification and falsifiability that the model cannot produce from training data alone. To test external validity, the agent replicated two published studies on the CLMET corpus (40 million tokens) ‑ Claridge (2025) and De Smet (2013) ‑ with close quantitative agreement. Agent‑driven corpus research can thus produce empirically grounded findings at machine speed, lowering the technical barrier for a broader range of researchers.

PaperID: 4157, https://arxiv.org/pdf/2604.07165.pdf

Authors: Yu Li, Sizhe Tang, Tian Lan

Title: Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization

Abstract:
Reinforcement learning for Large Language Model agents is often hindered by sparse rewards in multi‑step reasoning tasks. Existing approaches like Group Relative Policy Optimization treat sampled trajectories as independent chains, assigning uniform credit to all steps in each chain and ignoring the existence of critical steps that may disproportionally impact reasoning outcome. In this paper, we propose T‑STAR(Tree‑structured Self‑Taught Agent Rectification), a framework that recovers the latent correlated reward structure across seemingly independent trajectories. Specifically, we consolidate trajectories into a unified Cognitive Tree by identifying and merging functionally similar steps/nodes. It enables an Introspective Valuation mechanism that back‑propagates trajectory‑level rewards through the tree to obtain a new notion of variance‑reduced relative advantage at step‑level. Using the Cognitive Tree, we also develop In‑Context Thought Grafting to synthesize corrective reasoning by contrasting successful and failed branches at critical divergence points/steps. Our proposed Surgical Policy Optimization then capitalizes on the rich policy gradient information concentrated at these critical points/steps through a Bradley‑Terry type of surgical loss. Extensive experiments across embodied, interactive, reasoning, and planning benchmarks demonstrate that T‑STAR achieves consistent improvements over strong baselines, with gains most pronounced on tasks requiring extended reasoning chains.

PaperID: 4158, https://arxiv.org/pdf/2604.07133.pdf

Authors: Qichen Wang, Keyu Li, Ozan Alp Topal, Özlem Tugfe Demir, Mustafa Ozger, Cicek Cavdar

Title: Energy Saving for Cell-Free Massive MIMO Networks: A Multi-Agent Deep Reinforcement Learning Approach

Abstract:
This paper focuses on energy savings in downlink operation of cell‑free massive MIMO (CF mMIMO) networks under dynamic traffic conditions. We propose a multi‑agent deep reinforcement learning (MADRL) algorithm that enables each access point (AP) to autonomously control antenna re‑configuration and advanced sleep mode (ASM) selection. After the training process, the proposed framework operates in a fully distributed manner, eliminating the need for centralized control and allowing each AP to dynamically adjust to real‑time traffic fluctuations. Simulation results show that the proposed algorithm reduces power consumption (PC) by 56.23% compared to systems without any energy‑saving scheme and by 30.12% relative to a non‑learning mechanism that only utilizes the lightest sleep mode, with only a slight increase in drop ratio. Moreover, compared to the widely used deep Q‑network (DQN) algorithm, it achieves a similar PC level but with a significantly lower drop ratio.

PaperID: 4159, https://arxiv.org/pdf/2604.07036.pdf

Authors: Dzianis Piatrashyn, Nikita Kotelevskii, Kirill Grishchenkov, Nikita Glazkov, Ivan Nasonov, Ilya Makarov, Timothy Baldwin, Preslav Nakov, Roman Vashurin, Maxim Panov

Title: ReDAct: Uncertainty-Aware Deferral for LLM Agents

Abstract:
Recently, LLM‑based agents have become increasingly popular across many applications, including complex sequential decision‑making problems. However, they inherit the tendency of LLMs to hallucinate, leading to incorrect decisions. In sequential settings, even a single mistake can irreversibly degrade the trajectory, making hallucinations an even bigger problem. Although larger LLMs hallucinate less, they incur a significantly higher per‑token cost. In this paper, we address this tradeoff by proposing ReDAct (Reason‑Defer‑Act). In ReDAct, an agent is equipped with two LLMs: a small, cheap model used by default, and a large, more reliable but expensive model. When the predictive uncertainty of the small model exceeds a calibrated threshold, the decision is deferred to the large model. We evaluate our approach in text‑based embodied environments such as ALFWorld and MiniGrid and show that deferring only about 15% of decisions to the large model can match the quality of using it exclusively, while significantly reducing inference costs.

PaperID: 4160, https://arxiv.org/pdf/2604.07028.pdf

Authors: Philipp D. Siedler

Title: Strategic Persuasion with Trait-Conditioned Multi-Agent Systems for Iterative Legal Argumentation

Abstract:
Strategic interaction in adversarial domains such as law, diplomacy, and negotiation is mediated by language, yet most game‑theoretic models abstract away the mechanisms of persuasion that operate through discourse. We present the Strategic Courtroom Framework, a multi‑agent simulation environment in which prosecution and defense teams composed of trait‑conditioned Large Language Model (LLM) agents engage in iterative, round‑based legal argumentation. Agents are instantiated using nine interpretable traits organized into four archetypes, enabling systematic control over rhetorical style and strategic orientation. We evaluate the framework across 10 synthetic legal cases and 84 three‑trait team configurations, totaling over 7,000 simulated trials using DeepSeek‑R1 and Gemini~2.5~Pro. Our results show that heterogeneous teams with complementary traits consistently outperform homogeneous configurations, that moderate interaction depth yields more stable verdicts, and that certain traits (notably quantitative and charismatic) contribute disproportionately to persuasive success. We further introduce a reinforcement‑learning‑based Trait Orchestrator that dynamically generates defense traits conditioned on the case and opposing team, discovering strategies that outperform static, human‑designed trait combinations. Together, these findings demonstrate how language can be treated as a first‑class strategic action space and provide a foundation for building autonomous agents capable of adaptive persuasion in multi‑agent environments.

PaperID: 4161, https://arxiv.org/pdf/2604.07012.pdf

Authors: Guanran Luo, Zhongquan Jian, Wentao Qiu, Meihong Wang, Qingqiang Wu

Title: DTCRS: Dynamic Tree Construction for Recursive Summarization

Abstract:
Retrieval‑Augmented Generation (RAG) mitigates the hallucination problem of Large Language Models (LLMs) by incorporating external knowledge. Recursive summarization constructs a hierarchical summary tree by clustering text chunks, integrating information from multiple parts of a document to provide evidence for abstractive questions involving multi‑step reasoning. However, summary trees often contain a large number of redundant summary nodes, which not only increase construction time but may also negatively impact question answering. Moreover, recursive summarization is not suitable for all types of questions. We introduce DTCRS, a method that dynamically generates summary trees based on document structure and query semantics. DTCRS determines whether a summary tree is necessary by analyzing the question type. It then decomposes the question and uses the embeddings of sub‑questions as initial cluster centers, reducing redundant summaries while improving the relevance between summaries and the question. Our approach significantly reduces summary tree construction time and achieves substantial improvements across three QA tasks. Additionally, we investigate the applicability of recursive summarization to different question types, providing valuable insights for future research.

PaperID: 4162, https://arxiv.org/pdf/2604.07007.pdf

Authors: Anbang Ruan, Xing Zhang

Title: AgentCity: Constitutional Governance for Autonomous Agent Economies via Separation of Power

Abstract:
Autonomous AI agents are beginning to operate across organizational boundaries on the open internet ‑‑ discovering, transacting with, and delegating to agents owned by other parties without centralized oversight. When agents from different human principals collaborate at scale, the collective becomes opaque: no single human can observe, audit, or govern the emergent behavior. We term this the Logic Monopoly ‑‑ the agent society's unchecked monopoly over the entire logic chain from planning through execution to evaluation. We propose the Separation of Power (SoP) model, a constitutional governance architecture deployed on public blockchain that breaks this monopoly through three structural separations: agents legislate operational rules as smart contracts, deterministic software executes within those contracts, and humans adjudicate through a complete ownership chain binding every agent to a responsible principal. In this architecture, smart contracts are the law itself ‑‑ the actual legislative output that agents produce and that governs their behavior. We instantiate SoP in AgentCity on an EVM‑compatible layer‑2 blockchain (L2) with a three‑tier contract hierarchy (foundational, meta, and operational). The core thesis is alignment‑through‑accountability: if each agent is aligned with its human owner through the accountability chain, then the collective converges on behavior aligned with human intent ‑‑ without top‑down rules. A pre‑registered experiment evaluates this thesis in a commons production economy ‑‑ where agents share a finite resource pool and collaboratively produce value ‑‑ at 50‑1,000 agent scale.

PaperID: 4163, https://arxiv.org/pdf/2604.07003.pdf

Authors: Yunbo Long, Yuhan Liu, Liming Xu

Title: EmoMAS: Emotion-Aware Multi-Agent System for High-Stakes Edge-Deployable Negotiation with Bayesian Orchestration

Abstract:
Large language models (LLMs) has been widely used for automated negotiation, but their high computational cost and privacy risks limit deployment in privacy‑sensitive, on‑device settings such as mobile assistants or rescue robots. Small language models (SLMs) offer a viable alternative, yet struggle with the complex emotional dynamics of high‑stakes negotiation. We introduces EmoMAS, a Bayesian multi‑agent framework that transforms emotional decision‑making from reactive to strategic. EmoMAS leverages a Bayesian orchestrator to coordinate three specialized agents: game‑theoretic, reinforcement learning, and psychological coherence models. The system fuses their real‑time insights to optimize emotional state transitions while continuously updating agent reliability based on negotiation feedback. This mixture‑of‑agents architecture enables online strategy learning without pre‑training. We further introduce four high‑stakes, edge‑deployable negotiation benchmarks across debt, healthcare, emergency response, and educational domains. Through extensive agent‑to‑agent simulations across all benchmarks, both SLMs and LLMs equipped with EmoMAS consistently surpass all baseline models in negotiation performance while balancing ethical behavior. These results show that strategic emotional intelligence is also the key driver of negotiation success. By treating emotional expression as a strategic variable within a Bayesian multi‑agent optimization framework, EmoMAS establishes a new paradigm for effective, private, and adaptive negotiation AI suitable for high‑stakes edge deployment.

PaperID: 4164, https://arxiv.org/pdf/2604.06972.pdf

Authors: Zhan Gao, Gabriele Fadini, Stelian Coros, Amanda Prorok

Title: Differentiable Environment-Trajectory Co-Optimization for Safe Multi-Agent Navigation

Abstract:
The environment plays a critical role in multi‑agent navigation by imposing spatial constraints, rules, and limitations that agents must navigate around. Traditional approaches treat the environment as fixed, without exploring its impact on agents' performance. This work considers environment configurations as decision variables, alongside agent actions, to jointly achieve safe navigation. We formulate a bi‑level problem, where the lower‑level sub‑problem optimizes agent trajectories that minimize navigation cost and the upper‑level sub‑problem optimizes environment configurations that maximize navigation safety. We develop a differentiable optimization method that iteratively solves the lower‑level sub‑problem with interior point methods and the upper‑level sub‑problem with gradient ascent. A key challenge lies in analytically coupling these two levels. We address this by leveraging KKT conditions and the Implicit Function Theorem to compute gradients of agent trajectories w.r.t. environment parameters, enabling differentiation throughout the bi‑level structure. Moreover, we propose a novel metric that quantifies navigation safety as a criterion for the upper‑level environment optimization, and prove its validity through measure theory. Our experiments validate the effectiveness of the proposed framework in a variety of safety‑critical navigation scenarios, inspired from warehouse logistics to urban transportation. The results demonstrate that optimized environments provide navigation guidance, improving both agents' safety and efficiency.

PaperID: 4165, https://arxiv.org/pdf/2604.06943.pdf

Authors: Khalil Abuibaid, Vinit Hegiste, Nigora Gafur, Achim Wagner, Martin Ruskowski

Title: Sustainable Transfer Learning for Adaptive Robot Skills

Abstract:
Learning robot skills from scratch is often time‑consuming, while reusing data promotes sustainability and improves sample efficiency. This study investigates policy transfer across different robotic platforms, focusing on peg‑in‑hole task using reinforcement learning (RL). Policy training is carried out on two different robots. Their policies are transferred and evaluated for zero‑shot, fine‑tuning, and training from scratch. Results indicate that zero‑shot transfer leads to lower success rates and relatively longer task execution times, while fine‑tuning significantly improves performance with fewer training time‑steps. These findings highlight that policy transfer with adaptation techniques improves sample efficiency and generalization, reducing the need for extensive retraining and supporting sustainable robotic learning.

PaperID: 4166, https://arxiv.org/pdf/2604.06925.pdf

Authors: Fangyu Hao, Jiayu Yang, Yifan Zhu, Zijun Yu, Qicen Wu, Wang Yunlong, Jiawei Li, Yulin Liu, Xu Zeng, Guanting Chen, Shihao Li, Zhonghong Ou, Meina Song, Mengyang Sun, Haoran Luo, Yu Shi, Yingyi Wang

Title: LungCURE: Benchmarking Multimodal Real-World Clinical Reasoning for Precision Lung Cancer Diagnosis and Treatment

Abstract:
Lung cancer clinical decision support demands precise reasoning across complex, multi‑stage oncological workflows. Existing multimodal large language models (MLLMs) fail to handle guideline‑constrained staging and treatment reasoning. We formalize three oncological precision treatment (OPT) tasks for lung cancer, spanning TNM staging, treatment recommendation, and end‑to‑end clinical decision support. We introduce LungCURE, the first standardized multimodal benchmark built from 1,000 real‑world, clinician‑labeled cases across more than 10 hospitals. We further propose LCAgent, a multi‑agent framework that ensures guideline‑compliant lung cancer clinical decision‑making by suppressing cascading reasoning errors across the clinical pathway. Experiments reveal large differences across various large language models (LLMs) in their capabilities for complex medical reasoning, when given precise treatment requirements. We further verify that LCAgent, as a simple yet effective plugin, enhances the reasoning performance of LLMs in real‑world medical scenarios.

PaperID: 4167, https://arxiv.org/pdf/2604.06914.pdf

Authors: Charbel Bou Chaaya, Mehdi Bennis

Title: Equivariant Multi-agent Reinforcement Learning for Multimodal Vehicle-to-Infrastructure Systems

Abstract:
In this paper, we study a vehicle‑to‑infrastructure (V2I) system where distributed base stations (BSs) acting as road‑side units (RSUs) collect multimodal (wireless and visual) data from moving vehicles. We consider a decentralized rate maximization problem, where each RSU relies on its local observations to optimize its resources, while all RSUs must collaborate to guarantee favorable network performance. We recast this problem as a distributed multi‑agent reinforcement learning (MARL) problem, by incorporating rotation symmetries in terms of vehicles' locations. To exploit these symmetries, we propose a novel self‑supervised learning framework where each BS agent aligns the latent features of its multimodal observation to extract the positions of the vehicles in its local region. Equipped with this sensing data at each RSU, we train an equivariant policy network using a graph neural network (GNN) with message passing layers, such that each agent computes its policy locally, while all agents coordinate their policies via a signaling scheme that overcomes partial observability and guarantees the equivariance of the global policy. We present numerical results carried out in a simulation environment, where ray‑tracing and computer graphics are used to collect wireless and visual data. Results show the generalizability of our self‑supervised and multimodal sensing approach, achieving more than two‑fold accuracy gains over baselines, and the efficiency of our equivariant MARL training, attaining more than 50% performance gains over standard approaches.

PaperID: 4168, https://arxiv.org/pdf/2604.06876.pdf

Authors: Giorgio Audrito, Andrea Basso, Daniele Bortoluzzi, Ferruccio Damiani, Giordano Scarso, Gianluca Torta

Title: Exploiting Aggregate Programming in a Multi-Robot Service Prototype

Abstract:
Multi‑robot systems are becoming increasingly relevant within diverse application domains, such as healthcare, exploration, and rescue missions. However, building such systems is still a significant challenge, since it adds the complexities of the physical nature of robots and their environments to those inherent in coordinating any distributed (multi‑agent) system. Aggregate Programming (AP) has recently emerged as a promising approach to engineering resilient, distributed systems with proximity‑based communication, and is notably supported by practical frameworks. In this paper we present a prototype of a multi‑robot service system, which adopts AP for the design and implementation of its coordination software. The prototype has been validated both with simulations, and with tests in a University library.

PaperID: 4169, https://arxiv.org/pdf/2604.06873.pdf

Authors: Haoran Yang, Nobuko Yoshida

Title: Generating Local Shields for Decentralised Partially Observable Markov Decision Processes

Abstract:
Multi‑agent systems under partial observation often struggle to maintain safety because each agent's locally chosen action does not, in general, determine the resulting joint action. Shielding addresses this by filtering actions based on the current state, but most existing techniques either assume access to a shared centralised global state or employ memoryless local filters that cannot consider interaction history. We introduce a shield process algebra with guarded choice and recursion for specifying safe global behaviour in communication‑free Dec‑POMDP settings. From a shield process, we compile a process automaton, then a global Mealy machine as a safe joint‑action filter, and finally project it to local Mealy machines whose states are belief‑style subsets of the global Mealy machine states consistent with each agent's observations, and which output per‑agent safe action sets. We implement the pipeline in Rust and integrate PRISM, the Probabilistic Symbolic Model Checker, to compute best‑ and worst‑case safety probabilities independently of the agents' policies. A multi‑agent path‑finding case study demonstrates how different shield processes substantially reduce collisions compared to the unshielded baseline while exhibiting varying levels of expressiveness and conservatism.

PaperID: 4170, https://arxiv.org/pdf/2604.06868.pdf

Authors: Xinyuan Qiu, Ruohan Wang, Siyuan Liu, Sofie Haesaert

Title: Compressing Correct-by-Design Synthesis for Stochastic Homogeneous Multi-Agent Systems with Counting LTL

Abstract:
Correct‑by‑design synthesis provides a principled framework for establishing formal safety guarantees for stochastic multi‑agent systems (MAS). However, conventional approaches based on finite abstractions often incur prohibitive computational costs as the number of agents and the complexity of temporal logic specifications increase. In this work, we study homogeneous stochastic MAS under counting linear temporal logic (cLTL) specifications, and show that the corresponding satisfaction probability admits a structured tensor decomposition via leveraging deterministic finite automata (DFA). Building on this structure, we develop a dual‑tree‑based value iteration framework that reduces redundant computation in the process of dynamic programming. Numerical results demonstrate the proposed approach's effectiveness and scalability for complex specifications and large‑scale MAS.

PaperID: 4171, https://arxiv.org/pdf/2604.06861.pdf

Authors: Shiqi Kuang, Zhao Tian, Kaiwei Lin, Chaofan Tao, Shaowei Wang, Haoli Bai, Lifeng Shang, Junjie Chen

Title: REAgent: Requirement-Driven LLM Agents for Software Issue Resolution

Abstract:
Issue resolution aims to automatically generate patches from given issue descriptions and has attracted significant attention with the rapid advancement of large language models (LLMs). However, due to the complexity of software issues and codebases, LLM‑generated patches often fail to resolve corresponding issues. Although various advanced techniques have been proposed with carefully designed tools and workflows, they typically treat issue descriptions as direct inputs and largely overlook their quality (e.g., missing critical context or containing ambiguous information), which hinders LLMs from accurate understanding and resolution. To address this limitation, we draw on principles from software requirements engineering and propose REAgent, a requirement‑driven LLM agent framework that introduces issue‑oriented requirements as structured task specifications to better guide patch generation. Specifically, REAgent automatically constructs structured and information‑rich issue‑oriented requirements, identifies low‑quality requirements, and iteratively refines them to improve patch correctness. We conduct comprehensive experiments on three widely used benchmarks using two advanced LLMs, comparing against five representative or state‑of‑the‑art baselines. The results demonstrate that REAgent consistently outperforms all baselines, achieving an average improvement of 17.40% in terms of the number of successfully‑resolved issues (% Resolved).

PaperID: 4172, https://arxiv.org/pdf/2604.06856.pdf

Authors: Iulisloi Zacarias, Marla Grunewald, Fin Gentzen, Xavi Masip-Bruin, Admela Jukan

Title: Enhancing Secure Intent-Based Networking with an Agentic AI: The EU Project MARE Approach

Abstract:
In the EU project MARE, a novel plane was proposed and used in combination with intent‑based networking (IBN), allowing the operator to focus on what, rather than on how. Recently, LLMs have been successfully employed to translate the high‑level intents into low‑level actions. The open challenge is to understand how IBN can be effectively enhanced with LLM and the emerging agentic AI for security purposes. Enhancing IBN with an agentic AI paradigm introduces significant challenges that existing solutions do not fully address. This paper proposes an enhanced IBN framework with a strong security focus toward agentic AI. We address the architectural and security requirements for a multi‑agent intent‑based system (IBS) architecture, including a multi‑domain IBN. We propose a hierarchical multi‑agent and multi‑vendor architecture that can also be applied more broadly in 6G architectures and beyond, beyond the security architecture proposed in MARE. The architecture incorporates an interactive intent‑processing pipeline using LLMs, and it also allows the IBS to connect to external security knowledge bases, such as MITRE ATT\&CK, MITRE FiGHT, and NIST.

PaperID: 4173, https://arxiv.org/pdf/2604.06808.pdf

Authors: Kanta Yoshioka, Soshi Hirayae, Yuichiro Tanaka, Yuichi Katori, Takashi Morie, Hakaru Tamukoh

Title: CBM-Dual: A 65-nm Fully Connected Chaotic Boltzmann Machine Processor for Dual Function Simulated Annealing and Reservoir Computing

Abstract:
This paper presents CBM‑Dual, the first silicon‑proven digital chaotic dynamics processor (CDP) supporting both simulated annealing (SA) and reservoir computing (RC). CBM‑Dual enables real‑time decision‑making and lightweight adaptation for autonomous Edge AI, employing the largest‑scale fully connected 1024‑neuron chaotic Boltzmann machine (CBM). To address the high computational and area costs of digital CDPs, we propose: 1) a CBM‑specific scheduler that exploits an inherently low neuron flip rate to reduce multiply‑accumulate operations by 99%, and 2) an efficient multiply splitting scheme that reduces the area by 59%. Fabricated in 65nm (12mm^2), CBM‑Dual achieves simultaneous heterogeneous task execution and state‑of‑the‑art energy efficiency, delivering ×25‑54 and ×4.5 improvements in the SA and RC fields, respectively.

PaperID: 4174, https://arxiv.org/pdf/2604.06788.pdf

Authors: Daniel N. Wilke

Title: From Perception to Autonomous Computational Modeling: A Multi-Agent Approach

Abstract:
We present a solver‑agnostic framework in which coordinated large language model (LLM) agents autonomously execute the complete computational mechanics workflow, from perceptual data of an engineering component through geometry extraction, material inference, discretisation, solver execution, uncertainty quantification, and code‑compliant assessment, to an engineering report with actionable recommendations. Agents are formalised as conditioned operators on a shared context space with quality gates that introduce conditional iteration between pipeline layers. We introduce a mathematical framework for extracting engineering information from perceptual data under uncertainty using interval bounds, probability densities, and fuzzy membership functions, and introduce task‑dependent conservatism to resolve the ambiguity of what `conservative' means when different limit states are governed by opposing parameter trends. The framework is demonstrated through a finite element analysis pipeline applied to a photograph of a steel L‑bracket, producing a 171,504‑node tetrahedral mesh, seven analyses across three boundary condition hypotheses, and a code‑compliant assessment revealing structural failure with a quantified redesign. All results are presented as generated in the first autonomous iteration without manual correction, reinforcing that a professional engineer must review and sign off on any such analysis.

PaperID: 4175, https://arxiv.org/pdf/2604.06753.pdf

Authors: Heng Zhou, Zelin Tan, Zhemeng Zhang, Yutao Fan, Yibing Lin, Li Kang, Xiufeng Song, Rui Li, Songtao Huang, Ao Yu, Yuchen Fan, Yanxu Chen, Kaixin Xu, Xiaohong Liu, Yiran Qin, Philip Torr, Chen Zhang, Zhenfei Yin

Title: Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents

Abstract:
When an LLM‑based agent improves on a task, is the gain from the model itself or from the reasoning paradigm wrapped around it? We study this question by comparing six inference‑time paradigms, namely Direct, CoT, ReAct, Plan‑Execute, Reflection, and ReCode, across four frontier LLMs and ten benchmarks, yielding roughly 18,000 runs. We find that reasoning structure helps dramatically on some tasks but hurts on others: ReAct improves over Direct by 44pp on GAIA, while CoT degrades performance by 15pp on HumanEval. No single paradigm dominates, and oracle per‑task selection beats the best fixed paradigm by 17.1pp on average. Motivated by this complementarity, we propose a select‑then‑solve approach: before answering each task, a lightweight embedding‑based router selects the most suitable paradigm. Across four models, the router improves average accuracy from 47.6% to 53.1%, outperforming the best fixed paradigm at 50.3% by 2.8pp and recovering up to 37% of the oracle gap. In contrast, zero‑shot self‑routing only works for GPT‑5 at 67.1% and fails for weaker models, all trailing the learned router. Our results argue that reasoning paradigm selection should be a per‑task decision made by a learned router, not a fixed architectural choice.

PaperID: 4176, https://arxiv.org/pdf/2604.06747.pdf

Authors: Juan Du, Yueteng Wu, Pan Zhao, Yuze Liu, Min Zhang, Xiaobin Xu, Xinglong Zhang

Title: TurboAgent: An LLM-Driven Autonomous Multi-Agent Framework for Turbomachinery Aerodynamic Design

Abstract:
The aerodynamic design of turbomachinery is a complex and tightly coupled multi‑stage process involving geometry generation, performance prediction, optimization, and high‑fidelity physical validation. Existing intelligent design approaches typically focus on individual stages or rely on loosely coupled pipelines, making fully autonomous end‑to‑end design challenging. To address this issue, this study proposes TurboAgent, a large language model (LLM)‑driven autonomous multi‑agent framework for turbomachinery aerodynamic design and optimization. The LLM serves as the core for task planning and coordination, while specialized agents handle generative design, rapid performance prediction, multi‑objective optimization, and physics‑based validation. The framework transforms traditional trial‑and‑error design into a data‑driven collaborative workflow, with high‑fidelity simulations retained for final verification. A transonic single‑rotor compressor is used for validation. The results show strong agreement between target performance, generated designs, and CFD simulations. The coefficients of determination for mass flow rate, total pressure ratio, and isentropic efficiency all exceed 0.91, with normalized RMSE values below 8%. The optimization agent further improves isentropic efficiency by 1.61% and total pressure ratio by 3.02%. The complete workflow can be executed within approximately 30 minutes under parallel computing. These results demonstrate that TurboAgent enables an autonomous closed‑loop design process from natural language requirements to final design generation, providing an efficient and scalable paradigm for turbomachinery aerodynamic design.

PaperID: 4177, https://arxiv.org/pdf/2604.06696.pdf

Authors: Yujun Cheng, Enfang Cui, Hao Qin, Zhiyuan Liang, Qi Xu

Title: AgentGate: A Lightweight Structured Routing Engine for the Internet of Agents

Abstract:
The rapid development of AI agent systems is leading to an emerging Internet of Agents, where specialized agents operate across local devices, edge nodes, private services, and cloud platforms. Although recent efforts have improved agent naming, discovery, and interaction, efficient request dispatch remains an open systems problem under latency, privacy, and cost constraints. In this paper, we present AgentGate, a lightweight structured routing engine for candidate‑aware agent dispatch. Instead of treating routing as unrestricted text generation, AgentGate formulates it as a constrained decision problem and decomposes it into two stages: action decision and structural grounding. The first stage determines whether a query should trigger single‑agent invocation, multi‑agent planning, direct response, or safe escalation, while the second stage instantiates the selected action into executable outputs such as target agents, structured arguments, or multi‑step plans. To adapt compact models to this setting, we further develop a routing‑oriented fine‑tuning scheme with candidate‑aware supervision and hard negative examples. Experiments on a curated routing benchmark with several 3B‑‑7B open‑weight models show that compact models can provide competitive routing performance in constrained settings, and that model differences are mainly reflected in action prediction, candidate selection, and structured grounding quality. These results indicate that structured routing is a feasible design point for efficient and privacy‑aware agent systems, especially when routing decisions must be made under resource‑constrained deployment conditions.

PaperID: 4178, https://arxiv.org/pdf/2604.06691.pdf

Authors: Monirul Islam Pavel, Siyi Hu, Muhammad Anwar Masum, Mahardhika Pratama, Ryszard Kowalczyk, Zehong Jimmy Cao

Title: KD-MARL: Resource-Aware Knowledge Distillation in Multi-Agent Reinforcement Learning

Abstract:
Real world deployment of multi agent reinforcement learning MARL systems is fundamentally constrained by limited compute memory and inference time. While expert policies achieve high performance they rely on costly decision cycles and large scale models that are impractical for edge devices or embedded platforms. Knowledge distillation KD offers a promising path toward resource aware execution but existing KD methods in MARL focus narrowly on action imitation often neglecting coordination structure and assuming uniform agent capabilities. We propose resource aware Knowledge Distillation for Multi Agent Reinforcement Learning KD MARL a two stage framework that transfers coordinated behavior from a centralized expert to lightweight decentralized student agents. The student policies are trained without a critic relying instead on distilled advantage signals and structured policy supervision to preserve coordination under heterogeneous and limited observations. Our approach transfers both action level behavior and structural coordination patterns from expert policies while supporting heterogeneous student architectures allowing each agent model capacity to match its observation complexity which is crucial for efficient execution under partial or limited observability and limited onboard resources. Extensive experiments on SMAC and MPE benchmarks demonstrate that KD MARL achieves high performance retention while substantially reducing computational cost. Across standard multi agent benchmarks KD MARL retains over 90 percent of expert performance while reducing computational cost by up to 28.6 times FLOPs. The proposed approach achieves expert level coordination and preserves it through structured distillation enabling practical MARL deployment across resource constrained onboard platforms.

PaperID: 4179, https://arxiv.org/pdf/2604.06633.pdf

Authors: Zi Liang, Qipeng Xie, Jun He, Bohuan Xue, Weizheng Wang, Yuandao Cai, Fei Luo, Boxian Zhang, Haibo Hu, Kaishun Wu

Title: Argus: Reorchestrating Static Analysis via a Multi-Agent Ensemble for Full-Chain Security Vulnerability Detection

Abstract:
Recent advancements in Large Language Models (LLMs) have sparked interest in their application to Static Application Security Testing (SAST), primarily due to their superior contextual reasoning capabilities compared to traditional symbolic or rule‑based methods. However, existing LLM‑based approaches typically attempt to replace human experts directly without integrating effectively with existing SAST tools. This lack of integration results in ineffectiveness, including high rates of false positives, hallucinations, limited reasoning depth, and excessive token usage, making them impractical for industrial deployment. To overcome these limitations, we present a paradigm shift that reorchestrates the SAST workflow from current LLM‑assisted structure to a new LLM‑centered workflow. We introduce Argus (Agentic and Retrieval‑Augmented Guarding System), the first multi‑agent framework designed specifically for vulnerability detection. Argus incorporates three key novelties: comprehensive supply chain analysis, collaborative multi‑agent workflows, and the integration of state‑of‑the‑art techniques such as Retrieval‑Augmented Generation (RAG) and ReAct to minimize hallucinations and enhance reasoning. Extensive empirical evaluation demonstrates that Argus significantly outperforms existing methods by detecting a higher volume of true vulnerabilities while simultaneously reducing false positives and operational costs. Notably, Argus has identified several critical zero‑day vulnerabilities with CVE assignments.

PaperID: 4180, https://arxiv.org/pdf/2604.06629.pdf

Authors: Evgeny Skvortsov, Yilin Xia, Ojaswa Garg, Shawn Bowers, Bertram Ludäscher

Title: Logical Robots: Declarative Multi-Agent Programming in Logica

Abstract:
We present Logical Robots, an interactive multi‑agent simulation platform where autonomous robot behavior is specified declaratively in the logic programming language Logica. Robot behavior is defined by logical predicates that map observations from simulated radar arrays and shared memory to desired motor outputs. This approach allows low‑level reactive control and high‑level planning to coexist within a single programming environment, providing a coherent framework for exploring multi‑agent robot behavior.

PaperID: 4181, https://arxiv.org/pdf/2604.06618.pdf

Authors: Phan The Duy, Khoa Ngo-Khanh, Nguyen Huu Quyen, Van-Hau Pham

Title: PoC-Adapt: Semantic-Aware Automated Vulnerability Reproduction with LLM Multi-Agents and Reinforcement Learning-Driven Adaptive Policy

Abstract:
While recent approaches leverage large language models (LLMs) and multi‑agent pipelines to automatically generate proof‑of‑concept (PoC) exploits from vulnerability reports, existing systems often suffer from two fundamental limitations: unreliable validation based on surface‑level execution signals and high operational cost caused by extensive trial‑and‑error during exploit generation. In this paper, we present PoC‑Adapt, an end‑to‑end framework for automated PoC generation and verification, architected upon a foundation semantic runtime validation and adaptive policy learning. At the core of PoC‑Adapt is a Semantic Oracle that validates exploits by comparing structured pre‑ and post‑execution system states, enabling reliable distinction between true vulnerability exploitation and incidental behavioral changes. To reduce exploration cost, we further introduce an Adaptive Policy Learning mechanism that learns an exploitation policy over semantic states and actions, guiding the exploit agent toward effective strategies with fewer failed attempts. PoC‑Adapt is implemented as a multi‑agent system comprising specialized agents for root cause analysis, environment building, exploit generation, and semantic validation, coordinated through structured feedback loops. Experimenting on the CWE‑Bench‑Java and PrimeVul benchmarks shows that PoC‑Adapt significantly improves verification reliability by 25% and reduces exploit generation cost compared to prior LLM‑based systems, highlighting the importance of semantic validation and learned action policies in automated vulnerability reproduction. Applied to the latest CVE corpus, PoC‑Adapt confirmed 12 verified PoC out of 80 reproduce attempts at a cost of 0.42 per generated exploit

PaperID: 4182, https://arxiv.org/pdf/2604.06610.pdf

Authors: Nan Zhang, Zishuo Wang, Shuyu Huang, Georgios Diamantopoulos, Nikos Tziritas, Panagiotis Oikonomou, Georgios Theodoropoulos

Title: TwinLoop: Simulation-in-the-Loop Digital Twins for Online Multi-Agent Reinforcement Learning

Abstract:
Decentralised online learning enables runtime adaptation in cyber‑physical multi‑agent systems, but when operating conditions change, learned policies often require substantial trial‑and‑error interaction before recovering performance. To address this, we propose TwinLoop, a simulation‑in‑the‑loop digital twin framework for online multi‑agent reinforcement learning. When a context shift occurs, the digital twin is triggered to reconstruct the current system state, initialise from the latest agent policies, and perform accelerated policy improvement with simulation what‑if analysis before synchronising updated parameters back to the agents in the physical system. We evaluate TwinLoop in a vehicular edge computing task‑offloading scenario with changing workload and infrastructure conditions. The results suggest that digital twins can improve post‑shift adaptation efficiency and reduce reliance on costly online trial‑and‑error.

PaperID: 4183, https://arxiv.org/pdf/2604.06598.pdf

Authors: Siddharth Singh, Soumee Guha, Qing Chang, Scott Acton

Title: Train-Small Deploy-Large: Leveraging Diffusion-Based Multi-Robot Planning

Abstract:
Learning based multi‑robot path planning methods struggle to scale or generalize to changes, particularly variations in the number of robots during deployment. Most existing methods are trained on a fixed number of robots and may tolerate a reduced number during testing, but typically fail when the number increases. Additionally, training such methods for a larger number of agents can be both time consuming and computationally expensive. However, analytical methods can struggle to scale computationally or handle dynamic changes in the environment. In this work, we propose to leverage a diffusion model based planner capable of handling dynamically varying number of agents. Our approach is trained on a limited number of agents and generalizes effectively to larger numbers of agents during deployment. Results show that integrating a single shared diffusion model based planner with dedicated inter‑agent attention computation and temporal convolution enables a train small deploy‑large paradigm with good accuracy. We validate our method across multiple scenarios and compare the performance with existing multi‑agent reinforcement learning techniques and heuristic control based methods.

PaperID: 4184, https://arxiv.org/pdf/2604.06554.pdf

Authors: Hossein Papi, Muzaffar Qureshi, Kyle Volle, Rushikesh Kamalapurkar

Title: Decentralized Scalar Field Mapping using Gaussian Process

Abstract:
Decentralized Gaussian process (GP) methods offer a scalable framework for multi‑agent scalar‑field estimation by replacing a centralized global model with multiple local models maintained by individual agents. A team of agents operates through overlapping domains; neighboring agents generally produce inconsistent distributions over shared regions. This paper investigates whether these inter‑agent posterior discrepancies can be systematically exploited to improve team‑level predictive performance and answers this question positively through a novel decentralized intersection data‑sharing and assimilation protocol. Specifically, each agent constructs neighbor‑specific packets from its local GP together with the geometry of the overlap between subdomains and selectively assimilates information received from neighboring agents to improve consistency of its posterior over the shared regions. The proposed architecture preserves locality in both computation and communication, supports decentralized neighbor‑to‑neighbor data assimilation, and allows local GP models to evolve cooperatively across the network without requiring the exchange full packet exchange or centralized inference.

PaperID: 4185, https://arxiv.org/pdf/2604.06551.pdf

Authors: Chang Liu, Changsheng Ma, Yongfeng Tao, Bin Hu, Minqiang Yang

Title: CCD-CBT: Multi-Agent Therapeutic Interaction for CBT Guided by Cognitive Conceptualization Diagram

Abstract:
Large language models show potential for scalable mental‑health support by simulating Cognitive Behavioral Therapy (CBT) counselors. However, existing methods often rely on static cognitive profiles and omniscient single‑agent simulation, failing to capture the dynamic, information‑asymmetric nature of real therapy. We introduce CCD‑CBT, a multi‑agent framework that shifts CBT simulation along two axes: 1) from a static to a dynamically reconstructed Cognitive Conceptualization Diagram (CCD), updated by a dedicated Control Agent, and 2) from omniscient to information‑asymmetric interaction, where the Therapist Agent must reason from inferred client states. We release CCDCHAT, a synthetic multi‑turn CBT dataset generated under this framework. Evaluations with clinical scales and expert therapists show that models fine‑tuned on CCDCHAT outperform strong baselines in both counseling fidelity and positive‑affect enhancement, with ablations confirming the necessity of dynamic CCD guidance and asymmetric agent design. Our work offers a new paradigm for building theory‑grounded, clinically‑plausible conversational agents.

PaperID: 4186, https://arxiv.org/pdf/2604.06550.pdf

Authors: Yinghan Hou, Zongyou Yang, Zaihu Pang, Xiujun Ma

Title: SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

Abstract:
OpenClaw's ClawHub marketplace hosts tens of thousands of community‑contributed agent skills (49,592 in our 2026‑04‑04 snapshot), and recent audits report that 13‑26% contain security vulnerabilities. Regex scanners miss obfuscated payloads; formal static analyzers cannot read the natural‑language SKILL.md instructions that hide prompt injection and social engineering. Neither approach covers both modalities. SkillSieve is a three‑layer detection framework that applies deeper analysis only where needed. Layer 1 runs regex, AST, and metadata checks through a recall‑tuned heuristic scorer, filtering 86% of the volume. Layer 2 routes suspicious skills to an LLM, splitting the analysis into four parallel sub‑tasks with structured outputs. Layer 3 puts high‑risk skills before a jury of three LLMs that vote independently and debate when they disagree. We evaluate on 49,592 real ClawHub skills and adversarial samples across five evasion techniques, running the pipeline on a 440 USD ARM single‑board computer. On a 390‑skill labeled benchmark, SkillSieve achieves F1 = 0.920 (precision 0.912, recall 0.929) at 0.006 USD per skill. An optional XGBoost fast‑path cuts 32% of Layer‑2/3 LLM calls with a 1.6‑point F1 reduction, while preserving full‑pipeline recall (0.929). For cross‑ecosystem generalization, we adapt the framework to Feishu/Lark and scan 52 real packages, where Layer 2 corrects Layer 1 false positives from domain‑specific idioms, suggesting a low‑cost adaptation path to similar enterprise platforms. We deploy SkillSieve as a Feishu chat bot for real‑time skill vetting. Code, data, and benchmark are open‑sourced.

PaperID: 4187, https://arxiv.org/pdf/2604.06531.pdf

Authors: Asmaa Eldesoukey, Yongxin Chen, Abhishek Halder

Title: A Generalized Sinkhorn Algorithm for Mean-Field Schrödinger Bridge

Abstract:
The mean‑field Schrödinger bridge (MFSB) problem concerns designing a minimum‑effort controller that guides a diffusion process with nonlocal interaction to reach a given distribution from another by a fixed deadline. Unlike the standard Schrödinger bridge, the dynamical constraint for MFSB is the mean‑field limit of a population of interacting agents with controls. It serves as a natural model for large‑scale multi‑agent systems. The MFSB is computationally challenging because the nonlocal interaction makes the problem nonconvex. We propose a generalization of the Hopf‑Cole transform for MFSB and, building on it, design a Sinkhorn‑type recursive algorithm to solve the associated system of integro‑PDEs. Under mild assumptions on the interaction potential, we discuss convergence guarantees for the proposed algorithm. We present numerical examples with repulsive and attractive interactions to illustrate the theoretical contributions.

PaperID: 4188, https://arxiv.org/pdf/2604.06474.pdf

Authors: Shicheng Liu, Yucheng Jiang, Sajid Farook, Camila Nicollier Sanchez, David Fernando Castro Pena, Monica S. Lam

Title: DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling

Abstract:
Deep research with Large Language Model (LLM) agents is emerging as a powerful paradigm for multi‑step information discovery, synthesis, and analysis. However, existing approaches primarily focus on unstructured web data, while the challenges of conducting deep research over large‑scale structured databases remain relatively underexplored. Unlike web‑based research, effective data‑centric research requires more than retrieval and summarization and demands iterative hypothesis generation, quantitative reasoning over structured schemas, and convergence toward a coherent analytical narrative. In this paper, we present DataSTORM, an LLM‑based agentic system capable of autonomously conducting research across both large‑scale structured databases and internet sources. Grounded in principles from Exploratory Data Analysis and Data Storytelling, DataSTORM reframes deep research over structured data as a thesis‑driven analytical process: discovering candidate theses from data, validating them through iterative cross‑source investigation, and developing them into coherent analytical narratives. We evaluate DataSTORM on InsightBench, where it achieves a new state‑of‑the‑art result with a 19.4% relative improvement in insight‑level recall and 7.2% in summary‑level score. We further introduce a new dataset built on ACLED, a real‑world complex database, and demonstrate that DataSTORM outperforms proprietary systems such as ChatGPT Deep Research across both automated metrics and human evaluations.

PaperID: 4189, https://arxiv.org/pdf/2604.06452.pdf

Authors: Danqing Wang, Da Yin, Ruta Desai, Lei Li, Asli Celikyilmaz, Ansong Ni

Title: Learning to Interrupt in Language-based Multi-agent Communication

Abstract:
Multi‑agent systems using large language models (LLMs) have demonstrated impressive capabilities across various domains. However, current agent communication suffers from verbose output that overload context and increase computational costs. Although existing approaches focus on compressing the message from the speaker side, they struggle to adapt to different listeners and identify relevant information. An effective way in human communication is to allow the listener to interrupt and express their opinion or ask for clarification. Motivated by this, we propose an interruptible communication framework that allows the agent who is listening to interrupt the current speaker. Through prompting experiments, we find that current LLMs are often overconfident and interrupt before receiving enough information. Therefore, we propose a learning method that predicts the appropriate interruption points based on the estimated future reward and cost. We evaluate our framework across various multi‑agent scenarios, including 2‑agent text pictionary games, 3‑agent meeting scheduling, and 3‑agent debate. The results of the experiment show that our HANDRAISER can reduce the communication cost by 32.2% compared to the baseline with comparable or superior task performance. This learned interruption behavior can also be generalized to different agents and tasks.

PaperID: 4190, https://arxiv.org/pdf/2604.06430.pdf

Authors: Pranjal Sharma, Zirui Xu, Vasileios Tzoumas

Title: Asynchronous Distributed Bandit Submodular Maximization under Heterogeneous Communication Delays

Abstract:
We study asynchronous distributed decision‑making for scalable multi‑agent bandit submodular maximization. We are motivated by distributed information‑gathering tasks in unknown environments and under heterogeneous inter‑agent communication delays. To enable scalability despite limited communication delays, existing approaches restrict each agent to coordinate only with its one‑hop neighbors. But these approaches assume homogeneous communication delays among the agents and a synchronous global clock. In practice, however, delays are heterogeneous, and agents operate with mismatched local clocks. That is, each agent does not receive information from all neighbors at the same time, compromising decision‑making. In this paper, we provide an asynchronous coordination algorithm to overcome the challenges. We establish a provable approximation guarantee against the optimal synchronized centralized solution, where the suboptimality gap explicitly depends on communication delays and clock mismatches. The bounds also depend on the topology of each neighborhood, capturing the effect of distributed decision‑making via one‑hop‑neighborhood messages only. We validate the approach through numerical simulations on multi‑camera area monitoring.

PaperID: 4191, https://arxiv.org/pdf/2604.06409.pdf

Authors: Yunze Xiao, Wenkai Li, Xiaoyuan Wu, Ningshan Ma, Yueqi Song, Weihao Xuan

Title: Say Something Else: Rethinking Contextual Privacy as Information Sufficiency

Abstract:
LLM agents increasingly draft messages on behalf of users, yet users routinely overshare sensitive information and disagree on what counts as private. Existing systems support only suppression (omitting sensitive information) and generalization (replacing information with an abstraction), and are typically evaluated on single isolated messages, leaving both the strategy space and evaluation setting incomplete. We formalize privacy‑preserving LLM communication as an Information Sufficiency (IS) task, introduce free‑text pseudonymization as a third strategy that replaces sensitive attributes with functionally equivalent alternatives, and propose a conversational evaluation protocol that assesses strategies under realistic multi‑turn follow‑up pressure. Across 792 scenarios spanning three power‑relation types (institutional, peer, intimate) and three sensitivity categories (discrimination risk, social cost, boundary), we evaluate seven frontier LLMs on privacy at two granularities, covertness, and utility. Pseudonymization yields the strongest privacy\textendash utility tradeoff overall, and single‑message evaluation systematically underestimates leakage, with generalization losing up to 16.3 percentage points of privacy under follow‑up.

PaperID: 4192, https://arxiv.org/pdf/2604.06392.pdf

Authors: Varun Pratap Bhardwaj

Title: Qualixar OS: A Universal Operating System for AI Agent Orchestration

Abstract:
We present Qualixar OS, the first application‑layer operating system for universal AI agent orchestration. Unlike kernel‑level approaches (AIOS) or single‑framework tools (AutoGen, CrewAI), Qualixar OS provides a complete runtime for heterogeneous multi‑agent systems spanning 10 LLM providers, 8+ agent frameworks, and 7 transports. We contribute: (1) execution semantics for 12 multi‑agent topologies including grid, forest, mesh, and maker patterns; (2) Forge, an LLM‑driven team design engine with historical strategy memory; (3) three‑layer model routing combining Q‑learning, five strategies, and Bayesian POMDP with dynamic multi‑provider discovery; (4) a consensus‑based judge pipeline with Goodhart detection, JSD drift monitoring, and alignment trilemma navigation; (5) four‑layer content attribution with HMAC signing and steganographic watermarks; (6) universal compatibility via the Claw Bridge supporting MCP and A2A protocols with a 25‑command Universal Command Protocol; (7) a 24‑tab production dashboard with visual workflow builder and skill marketplace. Qualixar OS is validated by 2,821 test cases across 217 event types and 8 quality modules. On a custom 20‑task evaluation suite, the system achieves 100% accuracy at a mean cost of 0.000039 per task. Source‑available under the Elastic License 2.0.

PaperID: 4193, https://arxiv.org/pdf/2604.06389.pdf

Authors: Satwik Pandey, Suresh Raghu, Shashwat Pandey

Title: SELFDOUBT: Uncertainty Quantification for Reasoning LLMs via the Hedge-to-Verify Ratio

Abstract:
Uncertainty estimation for reasoning language models remains difficult to deploy in practice: sampling‑based methods are computationally expensive, while common single‑pass proxies such as verbalized confidence or trace length are often inconsistent across models. This problem is compounded for proprietary reasoning APIs that expose neither logits nor intermediate token probabilities, leaving practitioners with no reliable uncertainty signal at inference time. We propose SELFDOUBT, a single‑pass uncertainty framework that resolves this impasse by extracting behavioral signals directly from the reasoning trace itself. Our key signal, the Hedge‑to‑Verify Ratio (HVR), detects whether a reasoning trace contains uncertainty markers and, if so, whether they are offset by explicit selfchecking behavior. Unlike methods that require multiple sampled traces or model internals, SELFDOUBT operates on a single observed reasoning trajectory, making it suitable for latency‑ and cost‑constrained deployment over any proprietary API. We evaluate SELFDOUBT across seven models and three multi‑step reasoning benchmarks (BBH, GPQA‑Diamond, and MMLU‑Pro). Most notably, traces containing no hedging markers are correct 96% of the time, revealing an emergent high‑precision confidence gate at zero additional cost. For the remaining cases, the full SELFDOUBT score significantly outperforms sampling‑based semantic entropy at 10x lower inference cost. A deployment cascade combining both stages attains 90% accuracy at 71% coverage without any task‑specific labels. These results establish SELFDOUBT as a scalable, production‑ready foundation for uncertainty estimation over proprietary reasoning models.

PaperID: 4194, https://arxiv.org/pdf/2604.06376.pdf

Authors: Xiangyu Peng, Can Qin, An Yan, Xinyi Yang, Zeyuan Chen, Ran Xu, Chien-Sheng Wu

Title: MTA-Agent: An Open Recipe for Multimodal Deep Search Agents

Abstract:
Multimodal large language models (MLLMs) have demonstrated strong capabilities in visual understanding, yet they remain limited in complex, multi‑step reasoning that requires deep searching and integrating visual evidence with external knowledge. In this work, we address this challenge by constructing high‑quality, verified multi‑hop vision‑language training data for multimodal deep‑search agents. We propose a Multi‑hop Tool‑Augmented Agent for Evidence‑based QA Synthesis (MTA‑Agent), which automatically selects tools and their parameters to retrieve and validate evidence from both visual and textual sources and generates structured multi‑hop question‑answer trajectories. Starting from diverse VQA seed datasets, our pipeline produces a large‑scale training dataset, MTA‑Vision‑DeepSearch, containing 21K high‑quality multi‑hop examples. The data is filtered through a multi‑stage verification process to ensure factual consistency and answer uniqueness. Using MTA‑Vision‑DeepSearch, a 32B open‑source multimodal search agent achieves state‑of‑the‑art performance, reaching an average of 54.63% across six challenging benchmarks, outperforming GPT‑5 (51.86%), Gemini‑2.5‑Pro (50.98%), and Gemini‑3‑Pro (54.46%) under the same tool settings. We further show that training on our data improves both reasoning depth and tool‑use behavior, increasing the average number of steps from 2.27 to 4.28, and leading to more systematic and persistent search strategies. Additionally, we demonstrate that training can be performed without real‑time tool calls by replaying cached interactions, significantly reducing training cost. Importantly, we present MTA‑Agent as a fully open recipe for multimodal deep search: we release the entire dataset, training trajectories, and implementation details to enable reproducibility and future research on open multimodal search agents.

PaperID: 4195, https://arxiv.org/pdf/2604.06370.pdf

Authors: Shao Wang, Rui Ren, Lin Gui

Title: ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache

Abstract:
The serving paradigm of large language models (LLMs) is rapidly shifting towards complex multi‑agent workflows where specialized agents collaborate over massive shared contexts. While Low‑Rank Adaptation (LoRA) enables the efficient co‑hosting of these specialized agents on a single base model, it introduces a critical memory footprint bottleneck during serving. Specifically, unique LoRA activations cause Key‑Value (KV) cache divergence across agents, rendering traditional prefix caching ineffective for shared contexts. This forces redundant KV cache maintenance, rapidly saturating GPU capacity and degrading throughput. To address this challenge, we introduce ForkKV, a serving system for multi‑LoRA agent workflows centered around a novel memory management paradigm in OS: fork with copy‑on‑write (CoW). By exploiting the structural properties of LoRA, ForkKV physically decouples the KV cache into a massive shared component (analogous to the parent process's memory pages) and lightweight agent‑specific components (the child process's pages). To support this mechanism, we propose a DualRadixTree architecture that allows newly forked agents to inherit the massive shared cache and apply CoW semantics for their lightweight unique cache. Furthermore, to guarantee efficient execution, we design ResidualAttention, a specialized kernel that reconstructs the disaggregated KV cache directly within on‑chip SRAM. Comprehensive evaluations across diverse language models and practical datasets of different tasks demonstrate that ForkKV achieves up to 3.0x the throughput of state‑of‑the‑art multi‑LoRA serving systems with a negligible impact on generation quality.

PaperID: 4196, https://arxiv.org/pdf/2604.06269.pdf

Authors: Yehui Yang, Zelin Zang, Xienan Zheng, Yuzhe Jia, Changxi Chi, Jingbo Zhou, Chang Yu, Jinlin Wu, Fuji Yang, Jiebo Luo, Zhen Lei, Stan Z. Li

Title: MAT-Cell: A Multi-Agent Tree-Structured Reasoning Framework for Batch-Level Single-Cell Annotation

Abstract:
Automated single‑cell annotation is difficult when the most abundant genes are not the most discriminative ones, or when a target state is poorly covered by a fixed reference atlas. GPTCelltype‑style one‑shot prompting allows large language models (LLMs) to produce plausible labels from generic expression signals, while reference‑based annotators can force unfamiliar states into the nearest known category. We propose MAT‑Cell, a prompt‑driven framework for batch‑level single‑cell annotation that separates evidence grounding from label decision. MAT‑Cell first uses Reverse Verification Query (RVQ) to combine tissue context, observed differentially expressed genes, and LLM‑elicited biological priors into structured candidate‑specific premises. Verifier agents then convert these premises into explicit premise‑to‑claim reasoning trees, and bounded multi‑round debate compares,challenges, and revises the resulting claims before consensus or final adjudication.The returned Syllogistic Derivation Tree (SDT) provides an auditable debate trace rather than a formal proof of the annotation. In open‑candidate benchmarks across five datasets, a locally deployed Qwen3‑30B model with MAT‑Cell achieves 75.5% average accuracy, compared with 64.2% for the strongest evaluated CoT baseline and 51.9% for the strongest evaluated scPilot variant. In oracle‑candidate bench‑marks across three species,MAT‑Cell remains competitive across backbones, and local inference substantially reduces monetary cost for batch annotation. Code is available at: https://anonymous.4open.science/r/MATCell‑4067

PaperID: 4197, https://arxiv.org/pdf/2604.06268.pdf

Authors: Zihan Wang, Chi Gui, Xing Jin, Qineng Wang, Licheng Liu, Kangrui Wang, Shiqi Chen, Linjie Li, Zhengyuan Yang, Pingyue Zhang, Yiping Lu, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li

Title: RAGEN-2: Reasoning Collapse in Agentic RL

Abstract:
RL training of multi‑turn LLM agents is inherently unstable, and reasoning quality directly determines task performance. Entropy is widely used to track reasoning stability. However, entropy only measures diversity within the same input, and cannot tell whether reasoning actually responds to different inputs. In RAGEN‑2, we find that even with stable entropy, models can rely on fixed templates that look diverse but are input‑agnostic. We call this template collapse, a failure mode invisible to entropy and all existing metrics. To diagnose this failure, we decompose reasoning quality into within‑input diversity (Entropy) and cross‑input distinguishability (Mutual Information, MI), and introduce a family of mutual information proxies for online diagnosis. Across diverse tasks, mutual information correlates with final performance much more strongly than entropy, making it a more reliable proxy for reasoning quality. We further explain template collapse with a signal‑to‑noise ratio (SNR) mechanism. Low reward variance weakens task gradients, letting regularization terms dominate and erase cross‑input reasoning differences. To address this, we propose SNR‑Aware Filtering to select high‑signal prompts per iteration using reward variance as a lightweight proxy. Across planning, math reasoning, web navigation, and code execution, the method consistently improves both input dependence and task performance.

PaperID: 4198, https://arxiv.org/pdf/2604.06209.pdf

Authors: Lina Bariah, Brahim Mefgouda, Farbod Tavakkoli, Enrique Molero, Louis Powell, Merouane Debbah

Title: TelcoAgent-Bench: A Multilingual Benchmark for Telecom AI Agents

Abstract:
The integration of large language model (LLM) agents into telecom networks introduces new challenges, related to intent recognition, tool execution, and resolution generation, while taking into consideration different operational constraints. In this paper, we introduce TelcoAgent‑Bench and TelcoAgent‑Metrics, a Telecom‑specific benchmarking framework for evaluating multilingual telecom LLM agents. The proposed framework assesses the semantic understanding as well as process‑level alignment with structured troubleshooting flows and stability across repeated scenario variations. Our contribution includes a structured suite of metrics that assess intent recognition, ordered tool execution, resolution correctness, and stability across scenario variations, with the aim of quantifying the reliability and operational consistency of LLM agents in telecom environments. The framework is designed to operate in both English and Arabic, to address the need for multilingual agent deployment in operational network environments. Our experimental results show that although recent instruct‑tuned models can understand telecom problems in a reasonable way, they usually struggle to consistently follow the required troubleshooting steps and to maintain stable behavior when exposed to different variations of the same scenario. This performance gap becomes more pronounced in unconstrained and bilingual settings.

PaperID: 4199, https://arxiv.org/pdf/2604.06203.pdf

Authors: Hansoo Lee, Rafael A. Calvo

Title: Front-End Ethics for Sensor-Fused Health Conversational Agents: An Ethical Design Space for Biometrics

Abstract:
The integration of continuous data from built‑in sensors and Large Language Models (LLMs) has fueled a surge of "Sensor‑Fused LLM agents" for personal health and well‑being support. While recent breakthroughs have demonstrated the technical feasibility of this fusion (e.g., Time‑LLM, SensorLLM), research primarily focuses on "Ethical Back‑End Design for Generative AI", concerns such as sensing accuracy, bias mitigation in training data, and multimodal fusion. This leaves a critical gap at the front end, where invisible biometrics are translated into language directly experienced by users. We argue that the "illusion of objectivity" provided by sensor data amplifies the risks of AI hallucinations, potentially turning errors into harmful medical mandates. This paper shifts the focus to "Ethical Front‑End Design for AI", specifically, the ethics of biometric translation. We propose a design space comprising five dimensions: Biometric Disclosure, Monitoring Temporality, Interpretation Framing, AI Stance, and Contestability. We examine how these dimensions interact with context (user‑ vs. system‑initiated) and identify the risk of biofeedback loops. Finally, we propose "Adaptive Disclosure" as a safety guardrail and offer design guidelines to help developers manage fallibility, ensuring that these cutting‑edge health agents support, rather than destabilize, user autonomy.

PaperID: 4200, https://arxiv.org/pdf/2604.06126.pdf

Authors: Pranjal Aggarwal, Graham Neubig, Sean Welleck

Title: Gym-Anything: Turn any Software into an Agent Environment

Abstract:
Computer‑use agents hold the promise of assisting in a wide range of digital economic activities. However, current research has largely focused on short‑horizon tasks over a limited set of software with limited economic value, such as basic e‑commerce and OS‑configuration tasks. A key reason is that creating environments for complex software requires significant time and human effort, and therefore does not scale. To address this, we introduce Gym‑Anything, a framework for converting any software into an interactive computer‑use environment. We frame environment creation itself as a multi‑agent task: a coding agent writes setup scripts, downloads real‑world data, and configures the software, while producing evidence of correct setup. An independent audit agent then verifies evidence for the environment setup against a quality checklist. Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage. The result is CUA‑World, a collection of over 10K long‑horizon tasks spanning domains from medical science and astronomy to engineering and enterprise systems, each configured with realistic data along with train and test splits. CUA‑World also includes CUA‑World‑Long, a challenging long‑horizon benchmark with tasks often requiring over 500 steps, far exceeding existing benchmarks. Distilling successful trajectories from the training split into a 2B vision‑language model outperforms models 2× its size. We also apply the same auditing principle at test time: a separate VLM reviews completed trajectories and provides feedback on what remains, improving Gemini‑3‑Flash on CUA‑World‑Long from 11.5% to 14.0%. We release all code, infrastructure, and benchmark data to facilitate future research in realistic computer‑use agents.

PaperID: 4201, https://arxiv.org/pdf/2604.06091.pdf

Authors: Changgeon Ko, Jisu Shin, Hoyun Song, Huije Lee, Eui Jun Hwang, Jong C. Park

Title: Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives

Abstract:
Large language model (LLM) agents are increasingly acting as human delegates in multi‑agent environments, where a representative agent integrates diverse peer perspectives to make a final decision. Drawing inspiration from social psychology, we investigate how the reliability of this representative agent is undermined by the social context of its network. We define four key phenomena‑social conformity, perceived expertise, dominant speaker effect, and rhetorical persuasion‑and systematically manipulate the number of adversaries, relative intelligence, argument length, and argumentative styles. Our experiments demonstrate that the representative agent's accuracy consistently declines as social pressure increases: larger adversarial groups, more capable peers, and longer arguments all lead to significant performance degradation. Furthermore, rhetorical strategies emphasizing credibility or logic can further sway the agent's judgment, depending on the context. These findings reveal that multi‑agent systems are sensitive not only to individual reasoning but also to the social dynamics of their configuration, highlighting critical vulnerabilities in AI delegates that mirror the psychological biases observed in human group decision‑making.

PaperID: 4202, https://arxiv.org/pdf/2604.06089.pdf

Authors: Yunxiao Ren, Dingguo Liang, Yuezu Lv, Zhisheng Duan

Title: Coalitional Zero-Sum Games for ${H_{\infty}}$ Leader-Following Consensus Control

Abstract:
This paper investigates the leader‑following consensus problem for a class of multi‑agent systems subject to adversarial attack‑like external inputs. To address this, we formulate the robust leader‑following control problem as a global coalitional min‑max zero‑sum game using differential game theory. Specifically, the agents' control inputs form a coalition to minimize a global cost function, while the attacks form an opposing coalition to maximize it. Notably, when these external adversarial attacks manifest as disturbances, the designed game‑theoretic control policy systematically yields a robust H_\infty control law. Addressing this problem inherently requires solving a high‑dimensional generalized algebraic Riccati equation (GARE), which poses significant challenges for distributed computation and controller implementation. To overcome these challenges, we propose a two‑fold approach. First, a decentralized computational strategy is devised to decompose the high‑dimensional GARE into multiple uniform, lower‑dimensional GAREs. Second, a dynamic average consensus‑based decoupling algorithm is developed to resolve the inherent coupling structure of the robust control law, thereby facilitating its distributed implementation. Finally, numerical simulations on the formation control of multi‑vehicle systems with feedback‑linearized dynamics are conducted to validate the effectiveness of the proposed algorithms.

PaperID: 4203, https://arxiv.org/pdf/2604.06024.pdf

Authors: Guangyi Liu, Vivek Pandey, Christoforos Somarakis, Nader Motee

Title: Incremental Risk Assessment for Cascading Failures in Large-Scale Multi-Agent Systems

Abstract:
We develop a framework for studying and quantifying the risk of cascading failures in time‑delay consensus networks, motivated by a team of agents attempting temporal rendezvous under stochastic disturbances and communication delays. To assess how failures at one or multiple agents amplify the risk of deviation across the network, we employ the Average Value‑at‑Risk as a systemic measure of cascading uncertainty. Closed‑form expressions reveal explicit dependencies of the risk of cascading failure on the Laplacian spectrum, communication delay, and noise statistics. We further establish fundamental lower bounds that characterize the best‑achievable network performance under time‑delay constraints. These bounds serve as feasibility certificates for assessing whether a desired safety or performance goal can be achieved without exhaustive search across all possible topologies. In addition, we develop an efficient single‑step update law that enables scalable propagation of conditional risk as new failures are detected. Analytical and numerical studies demonstrate significant computational savings and confirm the tightness of the theoretical limits across diverse network configurations.

PaperID: 4204, https://arxiv.org/pdf/2604.05943.pdf

Authors: Maria Nesterova, Mikhail Kolosov, Anton Andreychuk, Egor Cherepanov, Oleg Bulichev, Alexey Kovalev, Konstantin Yakovlev, Aleksandr Panov, Alexey Skrynnik

Title: MARL-GPT: Foundation Model for Multi-Agent Reinforcement Learning

Abstract:
Recent advances in multi‑agent reinforcement learning (MARL) have demonstrated success in numerous challenging domains and environments, but typically require specialized models for each task. In this work, we propose a coherent methodology that makes it possible for a single GPT‑based model to learn and perform well across diverse MARL environments and tasks, including StarCraft Multi‑Agent Challenge, Google Research Football and POGEMA. Our method, MARL‑GPT, applies offline reinforcement learning to train at scale on the expert trajectories (400M for SMACv2, 100M for GRF, and 1B for POGEMA) combined with a single transformer‑based observation encoder that requires no task‑specific tuning. Experiments show that MARL‑GPT achieves competitive performance compared to specialized baselines in all tested environments. Thus, our findings suggest that it is, indeed, possible to build a multi‑task transformer‑based model for a wide variety of (significantly different) multi‑agent problems paving the way to the fundamental MARL model (akin to ChatGPT, Llama, Mistral etc. in natural language modeling).

PaperID: 4205, https://arxiv.org/pdf/2604.05876.pdf

Authors: Tianyi Zhao, Yinhan He, Wendy Zheng, Chen Chen

Title: Mechanistic Circuit-Based Knowledge Editing in Large Language Models

Abstract:
Deploying Large Language Models (LLMs) in real‑world dynamic environments raises the challenge of updating their pre‑trained knowledge. While existing knowledge editing methods can reliably patch isolated facts, they frequently suffer from a "Reasoning Gap", where the model recalls the edited fact but fails to utilize it in multi‑step reasoning chains. To bridge this gap, we introduce MCircKE (\underlineMechanistic \underlineCircuit‑based \underlineKnowledge \underlineEditing), a novel framework that enables a precise "map‑and‑adapt" editing procedure. MCircKE first identifies the causal circuits responsible for a specific reasoning task, capturing both the storage of the fact and the routing of its logical consequences. It then surgically update parameters exclusively within this mapped circuit. Extensive experiments on the MQuAKE‑3K benchmark demonstrate the effectiveness of the proposed method for multi‑hop reasoning in knowledge editing.

PaperID: 4206, https://arxiv.org/pdf/2604.05875.pdf

Authors: Yinan Liu, Dongying Lin, Sigang Luo, Xiaochun Yang, Bin Wang

Title: Joint Knowledge Base Completion and Question Answering by Combining Large Language Models and Small Language Models

Abstract:
Knowledge Bases (KBs) play a key role in various applications. As two representative KB‑related tasks, knowledge base completion (KBC) and knowledge base question answering (KBQA) are closely related and inherently complementary with each other. Thus, it will be beneficial to solve the task of joint KBC and KBQA to make them reinforce each other. However, existing studies usually rely on the small language model (SLM) to enhance them jointly, and the large language model (LLM)'s strong reasoning ability is ignored. In this paper, by combining the strengths of the LLM with the SLM, we propose a novel framework JCQL, which can make these two tasks enhance each other in an iterative manner. To make KBC enhance KBQA, we augment the LLM agent‑based KBQA model's reasoning paths by incorporating an SLM‑trained KBC model as an action of the agent, alleviating the LLM's hallucination and high computational costs issue in KBQA. To make KBQA enhance KBC, we incrementally fine‑tune the KBC model by leveraging KBQA's reasoning paths as its supplementary training data, improving the ability of the SLM in KBC. Extensive experiments over two public benchmark data sets demonstrate that JCQL surpasses all baselines for both KBC and KBQA tasks.

PaperID: 4207, https://arxiv.org/pdf/2604.05848.pdf

Authors: Junsoo Park, Youssef Medhat, Htet Phyo Wai, Ploy Thajchayapong, Ashok K. Goel

Title: Evaluating Learner Representations for Differentiation Prior to Instructional Outcomes

Abstract:
Learner representations play a central role in educational AI systems, yet it is often unclear whether they preserve meaningful differences between students when instructional outcomes are unavailable or highly context‑dependent. This work examines how to evaluate learner representations based on whether they retain separation between learners under a shared comparison rule. We introduce distinctiveness, a representation‑level measure that evaluates how each learner differs from others in the cohort using pairwise distances, without requiring clustering, labels, or task‑specific evaluation. Using student‑authored questions collected through a conversational AI agent in an online learning environment, we compare representations based on individual questions with representations that aggregate patterns across a student's interactions over time. Results show that learner‑level representations yield higher separation, stronger clustering structure, and more reliable pairwise discrimination than interaction‑level representations. These findings demonstrate that learner representations can be evaluated independently of instructional outcomes and provide a practical pre‑deployment criterion using distinctiveness as a diagnostic metric for assessing whether a representation supports differentiated modeling or personalization.

PaperID: 4208, https://arxiv.org/pdf/2604.05681.pdf

Authors: Ojas Jain, Dhruv Kumar

Title: LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo

Abstract:
We introduce LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo, a stochastic multi‑agent board game whose dice mechanics, piece capture, safe‑square navigation, and home‑path progression introduce meaningful planning complexity. LudoBench comprises 480 handcrafted spot scenarios across 12 behaviorally distinct decision categories, each isolating a specific strategic choice. We additionally contribute a fully functional 4‑player Ludo simulator supporting Random, Heuristic, Game‑Theory, and LLM agents. The game‑theory agent uses Expectiminimax search with depth‑limited lookahead to provide a principled strategic ceiling beyond greedy heuristics. Evaluating six models spanning four model families, we find that all models agree with the game‑theory baseline only 40‑46% of the time. Models split into distinct behavioral archetypes: finishers that complete pieces but neglect development, and builders that develop but never finish. Each archetype captures only half of the game theory strategy. Models also display measurable behavioral shifts under history‑conditioned grudge framing on identical board states, revealing prompt‑sensitivity as a key vulnerability. LudoBench provides a lightweight and interpretable framework for benchmarking LLM strategic reasoning under uncertainty. All code, the spot dataset (480 entries) and model outputs are available at https://anonymous.4open.science/r/LudoBench‑5CBF/

PaperID: 4209, https://arxiv.org/pdf/2604.05550.pdf

Authors: Yu Li, Chenyang Shao, Xinyang Liu, Ruotong Zhao, Peijie Liu, Hongyuan Su, Zhibin Chen, Qinglong Yang, Anjie Xu, Yi Fang, Qingbin Zeng, Tianxing Li, Jingbo Xu, Fengli Xu, Yong Li, Tie-Yan Liu

Title: AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery

Abstract:
Artificial intelligence research increasingly depends on prolonged cycles of reproduction, debugging, and iterative refinement to achieve State‑Of‑The‑Art (SOTA) performance, creating a growing need for systems that can accelerate the full pipeline of empirical model optimization. In this work, we introduce AutoSOTA, an end‑to‑end automated research system that advances the latest SOTA models published in top‑tier AI papers to reproducible and empirically improved new SOTA models. We formulate this problem through three tightly coupled stages: resource preparation and goal setting; experiment evaluation; and reflection and ideation. To tackle this problem, AutoSOTA adopts a multi‑agent architecture with eight specialized agents that collaboratively ground papers to code and dependencies, initialize and repair execution environments, track long‑horizon experiments, generate and schedule optimization ideas, and supervise validity to avoid spurious gains. We evaluate AutoSOTA on recent research papers collected from eight top‑tier AI conferences under filters for code availability and execution cost. Across these papers, AutoSOTA achieves strong end‑to‑end performance in both automated replication and subsequent optimization. Specifically, it successfully discovers 105 new SOTA models that surpass the original reported methods, averaging approximately five hours per paper. Case studies spanning LLM, NLP, computer vision, time series, and optimization further show that the system can move beyond routine hyperparameter tuning to identify architectural innovation, algorithmic redesigns, and workflow‑level improvements. These results suggest that end‑to‑end research automation can serve not only as a performance optimizer, but also as a new form of research infrastructure that reduces repetitive experimental burden and helps redirect human attention toward higher‑level scientific creativity.

PaperID: 4210, https://arxiv.org/pdf/2604.05549.pdf

Authors: Yanxu Mao, Peipei Liu, Tiehan Cui, Congying Liu, Mingzhe Xing, Datao You

Title: Stop Fixating on Prompts: Reasoning Hijacking and Constraint Tightening for Red-Teaming LLM Agents

Abstract:
With the widespread application of LLM‑based agents across various domains, their complexity has introduced new security threats. Existing red‑team methods mostly rely on modifying user prompts, which lack adaptability to new data and may impact the agent's performance. To address the challenge, this paper proposes the JailAgent framework, which completely avoids modifying the user prompt. Specifically, it implicitly manipulates the agent's reasoning trajectory and memory retrieval with three key stages: Trigger Extraction, Reasoning Hijacking, and Constraint Tightening. Through precise trigger identification, real‑time adaptive mechanisms, and an optimized objective function, JailAgent demonstrates outstanding performance in cross‑model and cross‑scenario environments.

PaperID: 4211, https://arxiv.org/pdf/2604.05544.pdf

Authors: Jiahua Ma, Yiran Qin, Xin Wen, Yixiong Li, Yuyu Sun, Yulan Guo, Liang Lin, Ruimao Zhang

Title: Referring-Aware Visuomotor Policy Learning for Closed-Loop Manipulation

Abstract:
This paper addresses a fundamental problem of visuomotor policy learning for robotic manipulation: how to enhance robustness in out‑of‑distribution execution errors or dynamically re‑routing trajectories, where the model relies solely on the original expert demonstrations for training. We introduce the Referring‑Aware Visuomotor Policy (ReV), a closed‑loop framework that can adapt to unforeseen circumstances by instantly incorporating sparse referring points provided by a human or a high‑level reasoning planner. Specifically, ReV leverages the coupled diffusion heads to preserve standard task execution patterns while seamlessly integrating sparse referring via a trajectory‑steering strategy. Upon receiving a specific referring point, the global diffusion head firstly generates a sequence of globally consistent yet temporally sparse action anchors, while identifies the precise temporal position for the referring point within this sequence. Subsequently, the local diffusion head adaptively interpolates adjacent anchors based on the current temporal position for specific tasks. This closed‑loop process repeats at every execution step, enabling real‑time trajectory replanning in response to dynamic changes in the scene. In practice, rather than relying on elaborate annotations, ReV is trained only by applying targeted perturbations to expert demonstrations. Without any additional data or fine‑tuning scheme, ReV achieve higher success rates across challenging simulated and real‑world tasks.

PaperID: 4212, https://arxiv.org/pdf/2604.05533.pdf

Authors: Chenghao Li, Jun Liu, Songbo Zhang, Huadong Jian, Hao Ni, Lik-Hang Lee, Sung-Ho Bae, Guoqing Wang, Yang Yang, Chaoning Zhang

Title: Experience Transfer for Multimodal LLM Agents in Minecraft Game

Abstract:
Multimodal LLM agents operating in complex game environments must continually reuse past experience to solve new tasks efficiently. In this work, we propose Echo, a transfer‑oriented memory framework that enables agents to derive actionable knowledge from prior interactions rather than treating memory as a passive repository of static records. To make transfer explicit, Echo decomposes reusable knowledge into five dimensions: structure, attribute, process, function, and interaction. This formulation allows the agent to identify recurring patterns shared across different tasks and infer what prior experience remains applicable in new situations. Building on this formulation, Echo leverages In‑Context Analogy Learning (ICAL) to retrieve relevant experiences and adapt them to unseen tasks through contextual examples. Experiments in Minecraft show that, under a from‑scratch learning setting, Echo achieves a 1.3x to 1.7x speed‑up on object‑unlocking tasks. Moreover, Echo exhibits a burst‑like chain‑unlocking phenomenon, rapidly unlocking multiple similar items within a short time interval after acquiring transferable experience. These results suggest that experience transfer is a promising direction for improving the efficiency and adaptability of multimodal LLM agents in complex interactive environments.

PaperID: 4213, https://arxiv.org/pdf/2604.05529.pdf

Authors: Chenjie Yang, Yutian Jiang, Anqi Liang, Wei Qi, Chenyu Wu, Junbo Zhang

Title: ActivityEditor: Learning to Synthesize Physically Valid Human Mobility

Abstract:
Human mobility modeling is indispensable for diverse urban applications. However, existing data‑driven methods often suffer from data scarcity, limiting their applicability in regions where historical trajectories are unavailable or restricted. To bridge this gap, we propose ActivityEditor, a novel dual‑LLM‑agent framework designed for zero‑shot cross‑regional trajectory generation. Our framework decomposes the complex synthesis task into two collaborative stages. Specifically, an intention‑based agent, which leverages demographic‑driven priors to generate structured human intentions and coarse activity chains to ensure high‑level socio‑semantic coherence. These outputs are then refined by editor agent to obtain mobility trajectories through iteratively revisions that enforces human mobility law. This capability is acquired through reinforcement learning with multiple rewards grounded in real‑world physical constraints, allowing the agent to internalize mobility regularities and ensure high‑fidelity trajectory generation. Extensive experiments demonstrate that ActivityEditor achieves superior zero‑shot performance when transferred across diverse urban contexts. It maintains high statistical fidelity and physical validity, providing a robust and highly generalizable solution for mobility simulation in data‑scarce scenarios. Our code is available at: https://anonymous.4open.science/r/ActivityEditor‑066B.

PaperID: 4214, https://arxiv.org/pdf/2604.05523.pdf

Authors: Yushuo Zheng, Huiyu Duan, Zicheng Zhang, Yucheng Zhu, Xiongkuo Min, Guangtao Zhai

Title: Market-Bench: Benchmarking Large Language Models on Economic and Trade Competition

Abstract:
The ability of large language models (LLMs) to manage and acquire economic resources remains unclear. In this paper, we introduce Market‑Bench, a comprehensive benchmark that evaluates the capabilities of LLMs in economically‑relevant tasks through economic and trade competition. Specifically, we construct a configurable multi‑agent supply chain economic model where LLMs act as retailer agents responsible for procuring and retailing merchandise. In the procurement stage, LLMs bid for limited inventory in budget‑constrained auctions. In the retail stage, LLMs set retail prices, generate marketing slogans, and provide them to buyers through a role‑based attention mechanism for purchase. Market‑Bench logs complete trajectories of bids, prices, slogans, sales, and balance‑sheet states, enabling automatic evaluation with economic, operational, and semantic metrics. Benchmarking on 20 open‑ and closed‑source LLM agents reveals significant performance disparities and winner‑take‑most phenomenon, i.e., only a small subset of LLM retailers can consistently achieve capital appreciation, while many hover around the break‑even point despite similar semantic matching scores. Market‑Bench provides a reproducible testbed for studying how LLMs interact in competitive markets.

PaperID: 4215, https://arxiv.org/pdf/2604.05485.pdf

Authors: Yi Nian, Aojie Yuan, Haiyue Zhang, Jiate Li, Yue Zhao

Title: Auditable Agents

Abstract:
LLM agents call tools, query databases, delegate tasks, and trigger external side effects. Once an agent system can act in the world, the question is no longer only whether harmful actions can be prevented‑‑it is whether those actions remain answerable after deployment. We distinguish accountability (the ability to determine compliance and assign responsibility), auditability (the system property that makes accountability possible), and auditing (the process of reconstructing behavior from trustworthy evidence). Our claim is direct: no agent system can be accountable without auditability. To make this operational, we define five dimensions of agent auditability, i.e., action recoverability, lifecycle coverage, policy checkability, responsibility attribution, and evidence integrity, and identify three mechanism classes (detect, enforce, recover) whose temporal information‑and‑intervention constraints explain why, in practice, no single approach suffices. We support the position with layered evidence rather than a single benchmark: lower‑bound ecosystem measurements suggest that even basic security prerequisites for auditability are widely unmet (617 security findings across six prominent open‑source projects); runtime feasibility results show that pre‑execution mediation with tamper‑evident records adds only 8.3 ms median overhead; and controlled recovery experiments show that responsibility‑relevant information can be partially recovered even when conventional logs are missing. We propose an Auditability Card for agent systems and identify six open research problems organized by mechanism class.

PaperID: 4216, https://arxiv.org/pdf/2604.05483.pdf

Authors: Xiaotian Zhou, Di Tang, Xiaofeng Wang, Xiaozhong Liu

Title: Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning

Abstract:
Large Language Models (LLMs) have shown a high capability in answering questions on a diverse range of topics. However, these models sometimes produce biased, ideologized or incorrect responses, limiting their applications if there is no clear understanding of which topics their answers can be trusted. In this research, we introduce a novel algorithm, named as GMRL‑BD, designed to identify the untrustworthy boundaries (in terms of topics) of a given LLM, with black‑box access to the LLM and under specific query constraints. Based on a general Knowledge Graph (KG) derived from Wikipedia, our algorithm incorporates with multiple reinforcement learning agents to efficiently identify topics (some nodes in KG) where the LLM is likely to generate biased answers. Our experiments demonstrated the efficiency of our algorithm, which can detect the untrustworthy boundary with just limited queries to the LLM. Additionally, we have released a new dataset containing popular LLMs including Llama2, Vicuna, Falcon, Qwen2, Gemma2 and Yi‑1.5, along with labels indicating the topics on which each LLM is likely to be biased.

PaperID: 4217, https://arxiv.org/pdf/2604.05467.pdf

Authors: Siddharth Jain, Venkat Narayan Vedam

Title: CUE-R: Beyond the Final Answer in Retrieval-Augmented Generation

Abstract:
As language models shift from single‑shot answer generation toward multi‑step reasoning that retrieves and consumes evidence mid‑inference, evaluating the role of individual retrieved items becomes more important. Existing RAG evaluation typically targets final‑answer quality, citation faithfulness, or answer‑level attribution, but none of these directly targets the intervention‑based, per‑evidence‑item utility view we study here. We introduce CUE‑R, a lightweight intervention‑based framework for measuring per‑evidence‑item operational utility in single‑shot RAG using shallow observable retrieval‑use traces. CUE‑R perturbs individual evidence items via REMOVE, REPLACE, and DUPLICATE operators, then measures changes along three utility axes (correctness, proxy‑based grounding faithfulness, and confidence error) plus a trace‑divergence signal. We also outline an operational evidence‑role taxonomy for interpreting intervention outcomes. Experiments on HotpotQA and 2WikiMultihopQA with Qwen‑3 8B and GPT‑5.2 reveal a consistent pattern: REMOVE and REPLACE substantially harm correctness and grounding while producing large trace shifts, whereas DUPLICATE is often answer‑redundant yet not fully behaviorally neutral. A zero‑retrieval control confirms that these effects arise from degradation of meaningful retrieval. A two‑support ablation further shows that multi‑hop evidence items can interact non‑additively: removing both supports harms performance far more than either single removal. Our results suggest that answer‑only evaluation misses important evidence effects and that intervention‑based utility analysis is a practical complement for RAG evaluation.

PaperID: 4218, https://arxiv.org/pdf/2604.05458.pdf

Authors: Md Shamimul Islam, Luis G. Jaimes, Ayesha S. Dina

Title: MA-IDS: Multi-Agent RAG Framework for IoT Network Intrusion Detection with an Experience Library

Abstract:
Network Intrusion Detection Systems (NIDS) face important limitations. Signature‑based methods are effective for known attack patterns, but they struggle to detect zero‑day attacks and often miss modified variants of previously known attacks, while many machine learning approaches offer limited interpretability. These challenges become even more severe in IoT environments because of resource constraints and heterogeneous protocols. To address these issues, we propose MA‑IDS, a Multi‑Agent Intrusion Detection System that combines Large Language Models (LLMs) with Retrieval Augmented Generation (RAG) for reasoning‑driven intrusion detection. The proposed framework grounds LLM reasoning through a persistent, self‑building Experience Library. Two specialized agents collaborate through a FAISS‑based vector database: a Traffic Classification Agent that retrieves past error rules before each inference, and an Error Analysis Agent that converts misclassifications into human‑readable detection rules stored for future retrieval, enabling continual learning through external knowledge accumulation, without modifying the underlying language model. Evaluated on NF‑BoT‑IoT and NF‑ToN‑IoT benchmark datasets, MA‑IDS achieves Macro F1‑Scores of 89.75% and 85.22%, improving over zero‑shot baselines of 17% and 4.96% by more than 72 and 80 percentage points. These results are competitive with SVM while providing rule‑level explanations for every classification decision, demonstrating that retrieval‑augmented reasoning offers a principled path toward explainable, self‑improving intrusion detection for IoT networks.

PaperID: 4219, https://arxiv.org/pdf/2604.05449.pdf

Authors: Kang Ding, Hongsong Wang, Jie Gui, Lei He

Title: Not All Agents Matter: From Global Attention Dilution to Risk-Prioritized Game Planning

Abstract:
End‑to‑end autonomous driving resides not in the integration of perception and planning, but rather in the dynamic multi‑agent game within a unified representation space. Most existing end‑to‑end models treat all agents equally, hindering the decoupling of real collision threats from complex backgrounds. To address this issue, We introduce the concept of Risk‑Prioritized Game Planning, and propose GameAD, a novel framework that models end‑to‑end autonomous driving as a risk‑aware game problem. The GameAD integrates Risk‑Aware Topology Anchoring, Strategic Payload Adapter, Minimax Risk‑Aware Sparse Attention, and Risk Consistent Equilibrium Stabilization to enable game theoretic decision making with risk prioritized interactions. We also present the Planning Risk Exposure metric, which quantifies the cumulative risk intensity of planned trajectories over a long horizon for safe autonomous driving. Extensive experiments on the nuScenes and Bench2Drive datasets show that our approach significantly outperforms state‑of‑the‑art methods, especially in terms of trajectory safety.

PaperID: 4220, https://arxiv.org/pdf/2604.05432.pdf

Authors: Wuyang Zhang, Shichao Pei

Title: Your LLM Agent Can Leak Your Data: Data Exfiltration via Backdoored Tool Use

Abstract:
Tool‑use large language model (LLM) agents are increasingly deployed to support sensitive workflows, relying on tool calls for retrieval, external API access, and session memory management. While prior research has examined various threats, the risk of systematic data exfiltration by backdoored agents remains underexplored. In this work, we present Back‑Reveal, a data exfiltration attack that embeds semantic triggers into fine‑tuned LLM agents. When triggered, the backdoored agent invokes memory‑access tool calls to retrieve stored user context and exfiltrates it via disguised retrieval tool calls. We further demonstrate that multi‑turn interaction amplifies the impact of data exfiltration, as attacker‑controlled retrieval responses can subtly steer subsequent agent behavior and user interactions, enabling sustained and cumulative information leakage over time. Our experimental results expose a critical vulnerability in LLM agents with tool access and highlight the need for defenses against exfiltration‑oriented backdoors.

PaperID: 4221, https://arxiv.org/pdf/2604.05416.pdf

Authors: Hongkai Fan, Qinjing Xie, Bo Ouyang, Yaonan Wang, Zhi Yan, Jiawen He, Zheng Fang

Title: Multi-Agent Pathfinding with Non-Unit Integer Edge Costs via Enhanced Conflict-Based Search and Graph Discretization

Abstract:
Multi‑Agent Pathfinding (MAPF) plays a critical role in various domains. Traditional MAPF methods typically assume unit edge costs and single‑timestep actions, which limit their applicability to real‑world scenarios. MAPFR extends MAPF to handle non‑unit costs with real‑valued edge costs and continuous‑time actions, but its geometric collision model leads to an unbounded state space that compromises solver efficiency. In this paper, we propose MAPFZ, a novel MAPF variant on graphs with non‑unit integer costs that preserves a finite state space while offering improved realism over classical MAPF. To solve MAPFZ efficiently, we develop CBS‑NIC, an enhanced Conflict‑Based Search framework incorporating time‑interval‑based conflict detection and an improved Safe Interval Path Planning (SIPP) algorithm. Additionally, we propose Bayesian Optimization for Graph Design (BOGD), a discretization method for non‑unit edge costs that balances efficiency and accuracy with a sub‑linear regret bound. Extensive experiments demonstrate that our approach outperforms state‑of‑the‑art methods in runtime and success rate across diverse benchmark scenarios.

PaperID: 4222, https://arxiv.org/pdf/2604.05398.pdf

Authors: Liya Guo, Ruimeng Hu, Xu Yang, Yi Zhu

Title: An Actor-Critic Framework for Continuous-Time Jump-Diffusion Controls with Normalizing Flows

Abstract:
Continuous‑time stochastic control with time‑inhomogeneous jump‑diffusion dynamics is central in finance and economics, but computing optimal policies is difficult under explicit time dependence, discontinuous shocks, and high dimensionality. We propose an actor‑critic framework that serves as a mesh‑free solver for entropy‑regularized control problems and stochastic games with jumps. The approach is built on a time‑inhomogeneous little q‑function and an appropriate occupation measure, yielding a policy‑gradient representation that accommodates time‑dependent drift, volatility, and jump terms. To represent expressive stochastic policies in continuous‑action spaces, we parameterize the actor using conditional normalizing flows, enabling flexible non‑Gaussian policies while retaining exact likelihood evaluation for entropy regularization and policy optimization. We validate the method on time‑inhomogeneous linear‑quadratic control, Merton portfolio optimization, and a multi‑agent portfolio game, using explicit solutions or high‑accuracy benchmarks. Numerical results demonstrate stable learning under jump discontinuities, accurate approximation of optimal stochastic policies, and favorable scaling with respect to dimension and number of agents.

PaperID: 4223, https://arxiv.org/pdf/2604.05364.pdf

Authors: Md Atik Ahamed, Mihir Parmar, Palash Goyal, Yiwen Song, Long T. Le, Qiang Cheng, Chun-Liang Li, Hamid Palangi, Jinsung Yoon, Tomas Pfister

Title: TFRBench: A Reasoning Benchmark for Evaluating Forecasting Systems

Abstract:
We introduce TFRBench, the first benchmark designed to evaluate the reasoning capabilities of forecasting systems. Traditionally, time‑series forecasting has been evaluated solely on numerical accuracy, treating foundation models as ``black boxes.'' Unlike existing benchmarks, TFRBench provides a protocol for evaluating the reasoning generated by forecasting systems‑‑specifically their analysis of cross‑channel dependencies, trends, and external events. To enable this, we propose a systematic multi‑agent framework that utilizes an iterative verification loop to synthesize numerically grounded reasoning traces. Spanning ten datasets across five domains, our evaluation confirms that this reasoning is causally effective; useful for evaluation; and prompting LLMs with our generated traces significantly improves forecasting accuracy compared to direct numerical prediction (e.g., avg. ～40.2%\to56.6%), validating the quality of our reasoning. Conversely, benchmarking experiments reveal that off‑the‑shelf LLMs consistently struggle with both reasoning (lower LLM‑as‑a‑Judge scores) and numerical forecasting, frequently failing to capture domain‑specific dynamics. TFRBench thus establishes a new standard for interpretable, reasoning‑based evaluation in time‑series forecasting. Our benchmark is available at: https://tfrbench.github.io

PaperID: 4224, https://arxiv.org/pdf/2604.05360.pdf

Authors: Khoi T. N. Nguyen, Nghia D. Nguyen, Hui Yu Koh, Patrick W. H. Kwong, Karen Sui Geok Chua, Ananda Sidarta, Baosheng Yu

Title: OGA-AID: Clinician-in-the-loop AI Report Drafting Assistant for Multimodal Observational Gait Analysis in Post-Stroke Rehabilitation

Abstract:
Gait analysis is essential in post‑stroke rehabilitation but remains time‑intensive and cognitively demanding, especially when clinicians must integrate gait videos and motion‑capture data into structured reports. We present OGA‑AID, a clinician‑in‑the‑loop multi‑agent large language model system for multimodal report drafting. The system coordinates 3 specialized agents to synthesize patient movement recordings, kinematic trajectories, and clinical profiles into structured assessments. Evaluated with expert physiotherapists on real patient data, OGA‑AID consistently outperforms single‑pass multimodal baselines with low error. In clinician‑in‑the‑loop settings, brief expert preliminary notes further reduce error compared to reference assessments. Our findings demonstrate the feasibility of multimodal agentic systems for structured clinical gait assessment and highlight the complementary relationship between AI‑assisted analysis and human clinical judgment in rehabilitation workflows.

PaperID: 4225, https://arxiv.org/pdf/2604.05354.pdf

Authors: Haochen Yang, Baolu Li, Lei Li, Delin Ren, Jiacheng Guo, Minghai Qin, Tianyun Zhang, Hongkai Yu

Title: Unsupervised Multi-agent and Single-agent Perception from Cooperative Views

Abstract:
The LiDAR‑based multi‑agent and single‑agent perception has shown promising performance in environmental understanding for robots and automated vehicles. However, there is no existing method that simultaneously solves both multi‑agent and single‑agent perception in an unsupervised way. By sharing sensor data between multiple agents via communication, this paper discovers two key insights: 1) Improved point cloud density after the data sharing from cooperative views could benefit unsupervised object classification, 2) Cooperative view of multiple agents can be used as unsupervised guidance for the 3D object detection in the single view. Based on these two discovered insights, we propose an Unsupervised Multi‑agent and Single‑agent (UMS) perception framework that leverages multi‑agent cooperation without human annotations to simultaneously solve multi‑agent and single‑agent perception. UMS combines a learning‑based Proposal Purifying Filter to better classify the candidate proposals after multi‑agent point cloud density cooperation, followed by a Progressive Proposal Stabilizing module to yield reliable pseudo labels by the easy‑to‑hard curriculum learning. Furthermore, we design a Cross‑View Consensus Learning to use multi‑agent cooperative view to guide detection in single‑agent view. Experimental results on two public datasets V2V4Real and OPV2V show that our UMS method achieved significantly higher 3D detection performance than the state‑of‑the‑art methods on both multi‑agent and single‑agent perception tasks in an unsupervised setting.

PaperID: 4226, https://arxiv.org/pdf/2604.05339.pdf

Authors: Xiangxu Zhang, Jiamin Wang, Qinlin Zhao, Hanze Guo, Linzhuo Li, Jing Yao, Xiao Zhou, Xiaoyuan Yi, Xing Xie

Title: Human Values Matter: Investigating How Misalignment Shapes Collective Behaviors in LLM Agent Communities

Abstract:
As LLMs become increasingly integrated into human society, evaluating their orientations on human values from social science has drawn growing attention. Nevertheless, it is still unclear why human values matter for LLMs, especially in LLM‑based multi‑agent systems, where group‑level failures may accumulate from individually misaligned actions. We ask whether misalignment with human values alters the collective behavior of LLM agents and what changes it induces? In this work, we introduce CIVA, a controlled multi‑agent environment grounded in social science theories, where LLM agents form a community and autonomously communicate, explore, and compete for resources, enabling systematic manipulation of value prevalence and behavioral analysis. Through comprehensive simulation experiments, we reveal three key findings. (1) We identify several structurally critical values that substantially shape the community's collective dynamics, including those diverging from LLMs' original orientations. Triggered by the misspecification of these values, we (2) detect system failure modes, e.g., catastrophic collapse, at the macro level, and (3) observe emergent behaviors like deception and power‑seeking at the micro level. These results offer quantitative evidence that human values are essential for collective outcomes in LLMs and motivate future multi‑agent value alignment.

PaperID: 4227, https://arxiv.org/pdf/2604.05320.pdf

Authors: Souren Pashangpour, Haitong Wang, Matthew Lisondra, Goldie Nejat

Title: ExpressMM: Expressive Mobile Manipulation Behaviors in Human-Robot Interactions

Abstract:
Mobile manipulators are increasingly deployed in human‑centered environments to perform tasks. While completing such tasks, they should also be able to communicate their intent to the people around them using expressive robot behaviors. Prior work on expressive robot behaviors has used preprogrammed or learning‑from‑demonstration‑based expressive motions and large language model generated high‑level interactions. The majority of these existing approaches have not considered human‑robot interactions (HRI) where users may interrupt, modify, or redirect a robot's actions during task execution. In this paper, we develop the novel ExpressMM framework that integrates a high‑level language‑guided planner based on a vision‑language model for perception and conversational reasoning with a low‑level vision‑language‑action policy to generate expressive robot behaviors during collaborative HRI tasks. Furthermore, ExpressMM supports interruptible interactions to accommodate updated or redirecting instructions by users. We demonstrate ExpressMM on a mobile manipulator assisting a human in a collaborative assembly scenario and conduct audience‑based evaluation of live HRI demonstrations. Questionnaire results show that the ExpressMM‑enabled expressive behaviors helped observers clearly interpret the robot's actions and intentions while supporting socially appropriate and understandable interactions. Participants also reported that the robot was useful for collaborative tasks and behaved in a predictable and safe manner during the demonstrations, fostering positive perceptions of the robot's usefulness, safety, and predictability during the collaborative tasks.

PaperID: 4228, https://arxiv.org/pdf/2604.05297.pdf

Authors: Lesong Tao, Yifei Wang, Haodong Jing, Jingwen Fu, Miao Kang, Shitao Chen, Nanning Zheng

Title: Breakthrough the Suboptimal Stable Point in Value-Factorization-Based Multi-Agent Reinforcement Learning

Abstract:
Value factorization, a popular paradigm in MARL, faces significant theoretical and algorithmic bottlenecks: its tendency to converge to suboptimal solutions remains poorly understood and unsolved. Theoretically, existing analyses fail to explain this due to their primary focus on the optimal case. To bridge this gap, we introduce a novel theoretical concept: the stable point, which characterizes the potential convergence of value factorization in general cases. Through an analysis of stable point distributions in existing methods, we reveal that non‑optimal stable points are the primary cause of poor performance. However, algorithmically, making the optimal action the unique stable point is nearly infeasible. In contrast, iteratively filtering suboptimal actions by rendering them unstable emerges as a more practical approach for global optimality. Inspired by this, we propose a novel Multi‑Round Value Factorization (MRVF) framework. Specifically, by measuring a non‑negative payoff increment relative to the previously selected action, MRVF transforms inferior actions into unstable points, thereby driving each iteration toward a stable point with a superior action. Experiments on challenging benchmarks, including predator‑prey tasks and StarCraft II Multi‑Agent Challenge (SMAC), validate our analysis of stable points and demonstrate the superiority of MRVF over state‑of‑the‑art methods.

PaperID: 4229, https://arxiv.org/pdf/2604.05289.pdf

Authors: Mingxuan Hui, Xinyue Li, Lu Wang, Chengcheng Wan, Yifan Wang, Yimian Wang, Feiyue Song, Beining Shi, Yixi Li, Yaxiao Li

Title: FLARE: Agentic Coverage-Guided Fuzzing for LLM-Based Multi-Agent Systems

Abstract:
Multi‑Agent LLM Systems (MAS) have been adopted to automate complex human workflows by breaking down tasks into subtasks. However, due to the non‑deterministic behavior of LLM agents and the intricate interactions between agents, MAS applications frequently encounter failures, including infinite loops and failed tool invocations. Traditional software testing techniques are ineffective in detecting such failures due to the lack of LLM agent specification, the large behavioral space of MAS, and semantic‑based correctness judgment. This paper presents FLARE, a novel testing framework tailored for MAS. FLARE takes the source code of MAS as input and extracts specifications and behavioral spaces from agent definitions. Based on these specifications, FLARE builds test oracles and conducts coverage‑guided fuzzing to expose failures. It then analyzes execution logs to judge whether each test has passed and generates failure reports. Our evaluation on 16 diverse open‑source applications demonstrates that FLARE achieves 96.9% inter‑agent coverage and 91.1% intra‑agent coverage, outperforming baselines by 9.5% and 1.0%. FLARE also uncovers 56 previously unknown failures unique to MAS.

PaperID: 4230, https://arxiv.org/pdf/2604.05278.pdf

Authors: Pardis Taghavi, Santosh Bhavani

Title: Spec Kit Agents: Context-Grounded Agentic Workflows

Abstract:
Spec‑driven development (SDD) with AI coding agents provides a structured workflow, but agents often remain "context blind" in large, evolving repositories, leading to hallucinated APIs and architectural violations. We present Spec Kit Agents, a multi‑agent SDD pipeline (with PM and developer roles) that adds phase‑level, context‑grounding hooks. Read‑only probing hooks ground each stage (Specify, Plan, Tasks, Implement) in repository evidence, while validation hooks check intermediate artifacts against the environment. We evaluate 128 runs covering 32 features across five repositories. Context‑grounding hooks improve judged quality by +0.15 on a 1‑5 composite LLM‑as‑judge score (+3.0 percent of the full score; Wilcoxon signed‑rank, p < 0.05) while maintaining 99.7‑100 percent repository‑level test compatibility. We further evaluate the framework on SWE‑bench Lite, where augmentation hooks improve baseline by 1.7 percent, achieving 58.2 percent Pass@1.

PaperID: 4231, https://arxiv.org/pdf/2604.05273.pdf

Authors: Kabir Ahuja, Yuxuan Li, Andrew Kyle Lampinen

Title: Beneath the Surface: Investigating LLMs' Capabilities for Communicating with Subtext

Abstract:
Human communication is fundamentally creative, and often makes use of subtext ‑‑ implied meaning that goes beyond the literal content of the text. Here, we systematically study whether language models can use subtext in communicative settings, and introduce four new evaluation suites to assess these capabilities. Our evaluation settings range from writing & interpreting allegories to playing multi‑agent and multi‑modal games inspired by the rules of board games like Dixit. We find that frontier models generally exhibit a strong bias towards overly literal, explicit communication, and thereby fail to account for nuanced constraints ‑‑ even the best performing models generate literal clues 60% of times in one of our environments ‑‑ Visual Allusions. However, we find that some models can sometimes make use of common ground with another party to help them communicate with subtext, achieving 30%‑50% reduction in overly literal clues; but they struggle at inferring presence of a common ground when not explicitly stated. For allegory understanding, we find paratextual and persona conditions to significantly shift the interpretation of subtext. Overall, our work provides quantifiable measures for an inherently complex and subjective phenomenon like subtext and reveals many weaknesses and idiosyncrasies of current LLMs. We hope this research to inspire future work towards socially grounded creative communication and reasoning.

PaperID: 4232, https://arxiv.org/pdf/2604.05172.pdf

Authors: Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Shenghan Zheng, Jiajun Bao, Yuanli Wang, Weixiang Yan, Yiyuan Li, Han-chung Lee

Title: ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

Abstract:
Large language model (LLM) agents are increasingly deployed to automate productivity tasks (e.g., email, scheduling, document management), but evaluating them on live services is risky due to potentially irreversible changes. Existing benchmarks rely on simplified environments and fail to capture realistic, stateful, multi‑service workflows. We introduce ClawsBench, a benchmark for evaluating and improving LLM agents in realistic productivity settings. It includes five high‑fidelity mock services (Gmail, Slack, Google Calendar, Google Docs, Google Drive) with full state management and deterministic snapshot/restore, along with 44 structured tasks covering single‑service, cross‑service, and safety‑critical scenarios. We decompose agent scaffolding into two independent levers (domain skills that inject API knowledge via progressive disclosure, and a meta prompt that coordinates behavior across services) and vary both to measure their separate and combined effects. Experiments across 6 models, 4 agent harnesses, and 33 conditions show that with full scaffolding, agents achieve task success rates of 39‑64% but exhibit unsafe action rates of 7‑33%. On OpenClaw, the top five models fall within a 10 percentage‑point band on task success (53‑63%), with unsafe action rates from 7% to 23% and no consistent ordering between the two metrics. We identify eight recurring patterns of unsafe behavior, including multi‑step sandbox escalation and silent contract modification. We release the trajectories and future dataset at https://clawsbench.com.

PaperID: 4233, https://arxiv.org/pdf/2604.05165.pdf

Authors: Hieu Le, Mostafa Ibrahim, Oguz Bedir, Jian Tao, Sabit Ekin

Title: Learning to Focus: CSI-Free Hierarchical MARL for Reconfigurable Reflectors

Abstract:
Reconfigurable Intelligent Surfaces (RIS) has a potential to engineer smart radio environments for next‑generation millimeter‑wave (mmWave) networks. However, the prohibitive computational overhead of Channel State Information (CSI) estimation and the dimensionality explosion inherent in centralized optimization severely hinder practical large‑scale deployments. To overcome these bottlenecks, we introduce a ``CSI‑free" paradigm powered by a Hierarchical Multi‑Agent Reinforcement Learning (HMARL) architecture to control mechanically reconfigurable reflective surfaces. By substituting pilot‑based channel estimation with accessible user localization data, our framework leverages spatial intelligence for macro‑scale wave propagation management. The control problem is decomposed into a two‑tier neural architecture: a high‑level controller executes temporally extended, discrete user‑to‑reflector allocations, while low‑level controllers autonomously optimize continuous focal points utilizing Multi‑Agent Proximal Policy Optimization (MAPPO) under a Centralized Training with Decentralized Execution (CTDE) scheme. Comprehensive deterministic ray‑tracing evaluations demonstrate that this hierarchical framework achieves massive RSSI improvements of up to 7.79 dB over centralized baselines. Furthermore, the system exhibits robust multi‑user scalability and maintains highly resilient beam‑focusing performance under practical sub‑meter localization tracking errors. By eliminating CSI overhead while maintaining high‑fidelity signal redirection, this work establishes a scalable and cost‑effective blueprint for intelligent wireless environments.

PaperID: 4234, https://arxiv.org/pdf/2604.05162.pdf

Authors: Hieu Le, Oguz Bedir, Mostafa Ibrahim, Jian Tao, Sabit Ekin

Title: Bypassing the CSI Bottleneck: MARL-Driven Spatial Control for Reflector Arrays

Abstract:
Reconfigurable Intelligent Surfaces (RIS) are pivotal for next‑generation smart radio environments, yet their practical deployment is severely bottlenecked by the intractable computational overhead of Channel State Information (CSI) estimation. To bypass this fundamental physical‑layer barrier, we propose an AI‑native, data‑driven paradigm that replaces complex channel modeling with spatial intelligence. This paper presents a fully autonomous Multi‑Agent Reinforcement Learning (MARL) framework to control mechanically adjustable metallic reflector arrays. By mapping high‑dimensional mechanical constraints to a reduced‑order virtual focal point space, we deploy a Centralized Training with Decentralized Execution (CTDE) architecture. Using Multi‑Agent Proximal Policy Optimization (MAPPO), our decentralized agents learn cooperative beam‑focusing strategies relying on user coordinates, achieving CSI‑free operation. High‑fidelity ray‑tracing simulations in dynamic non‑line‑of‑sight (NLOS) environments demonstrate that this multi‑agent approach rapidly adapts to user mobility, yielding up to a 26.86 dB enhancement over static flat reflectors and outperforming single‑agent and hardware‑constrained DRL baselines in both spatial selectivity and temporal stability. Crucially, the learned policies exhibit good deployment resilience, sustaining stable signal coverage even under 1.0‑meter localization noise. These results validate the efficacy of MARL‑driven spatial abstractions as a scalable, highly practical pathway toward AI‑empowered wireless networks.

PaperID: 4235, https://arxiv.org/pdf/2604.05160.pdf

Authors: Fareya Ikram, Nischal Ashok Kumar, Junyang Lu, Hunter McNichols, Candace Walkington, Neil Heffernan, Andrew S. Lan

Title: A Multi-Agent Approach to Validate and Refine LLM-Generated Personalized Math Problems

Abstract:
Students benefit from math problems contextualized to their interests. Large language models (LLMs) offer promise for efficient personalization at scale. However, LLM‑generated personalized problems may often have problems such as unrealistic quantities and contexts, poor readability, limited authenticity with respect to students' experiences, and occasional mathematical inconsistencies. To alleviate these problems, we propose a multi‑agent framework that formalizes personalization as an iterative generate‑‑validate‑‑revise process; we use four specialized validator agents targeting the criteria of solvability, realism, readability, and authenticity, respectively. We evaluate our framework on 600 problems drawn from a popular online mathematics homework platform, ASSISTments, personalizing each problem to a fixed set of 20 student interest topics. We compare three refinement strategies that differ in how validation feedback is coordinated into revisions. Results show that authenticity and realism are the most frequent failure modes in initial LLM‑personalized problems, but that a single refinement iteration substantially reduces these failures. We further find that different refinement strategies have different strengths on different criteria. We also assess validator reliability via human evaluation. Results show that reliability is highest on realism and lowest on authenticity, highlighting the need for better evaluation protocols that consider teachers' and students' personal characteristics.

PaperID: 4236, https://arxiv.org/pdf/2604.05149.pdf

Authors: Jiatan Huang, Zheyuan Zhang, Kaiwen Shi, Yanfang Ye, Chuxu Zhang

Title: EvolveRouter: Co-Evolving Routing and Prompt for Multi-Agent Question Answering

Abstract:
Large language model agents often exhibit complementary strengths, making routing a promising approach for multi‑agent question answering. However, existing routing methods remain limited in two important ways: they typically optimize over a fixed pool of agents without improving the agents themselves, and they often rely on rigid collaboration schemes that cannot adapt the number of participating agents to the query. We propose EvolveRouter, a trainable framework that addresses both limitations by jointly improving agent quality and collaboration structure. First, EvolveRouter couples graph‑based query routing with targeted instruction refinement in a closed‑loop co‑evolution process, allowing router diagnostics to guide agent improvement while refined agents provide cleaner supervision for routing. Second, it introduces an adaptive inference strategy that dynamically determines the effective collaboration size for each query through router‑weighted answer agreement. Together, these designs enable more capable and more efficient multi‑agent reasoning. Experiments on five question answering benchmarks show that EvolveRouter consistently outperforms SOTA routing baselines in both F1 and exact match, while further analysis confirms the benefits of closed‑loop refinement and adaptive collaboration.

PaperID: 4237, https://arxiv.org/pdf/2604.05131.pdf

Authors: Lan Sang, Chinmay Maheshwari

Title: Nash Approximation Gap in Truncated Infinite-horizon Partially Observable Markov Games

Abstract:
Partially Observable Markov Games (POMGs) provide a general framework for modeling multi‑agent sequential decision‑making under asymmetric information. A common approach is to reformulate a POMG as a fully observable Markov game over belief states, where the state is the conditional distribution of the system state and agents' private information given common information, and actions correspond to mappings (prescriptions) from private information to actions. However, this reformulation is intractable in infinite‑horizon settings, as both the belief state and action spaces grow with the accumulation of information over time. We propose a finite‑memory truncation framework that approximates infinite‑horizon POMGs by a finite‑state, finite‑action Markov game, where agents condition decisions only on finite windows of common and private information. Under suitable filter stability (forgetting) conditions, we show that any Nash equilibrium of the truncated game is an \varepsilon‑Nash equilibrium of the original POMG, where \varepsilon \to 0 as the truncation length increases.

PaperID: 4238, https://arxiv.org/pdf/2604.05130.pdf

Authors: Siyi Chen, Tianhan Luo, Shijian Wu, Xiangyu Liu, Yilin Zhou, Qi Li, Wenyuan Xu

Title: A Multi-Agent Framework for Automated Exploit Generation with Constraint-Guided Comprehension and Reflection

Abstract:
Open‑source libraries are widely used in modern software development, introducing significant security vulnerabilities. While static analysis tools can identify potential vulnerabilities at scale, they often generate overwhelming reports with high false positive rates. Automated Exploit Generation (AEG) emerges as a promising solution to confirm vulnerability authenticity by generating an exploit. However, traditional AEG approaches based on fuzzing or symbolic execution face path coverage and constraint‑solving problems. Although LLMs show great potential for AEG, how to effectively leverage them to comprehend vulnerabilities and generate corresponding exploits is still an open question. To address these challenges, we propose Vulnsage, a multi‑agent framework for AEG. Vulnsage simulates human security researchers' workflows by decomposing the complex AEG process into multiple specialized sub‑agents: Code Analyzer Agent, Code Generation Agent, Validation Agent, and a set of Reflection Agents, orchestrated by a central supervisor through iterative cycles. Given a target program, the Code Analyzer Agent performs static analysis to identify potential vulnerabilities and collects relevant information for each one. The Code Generation Agent then utilizes an LLM to generate candidate exploits. The Validation Agent and Reflection Agents form a feedback‑driven self‑refinement loop that uses execution traces and runtime error analysis to either improve the exploit iteratively or reason about the false positive alert. Experimental evaluation demonstrates that Vulnsage succeeds in generating 34.64% more exploits than state‑of‑the‑art tools such as \explodejs. Furthermore, Vulnsage has successfully discovered and verified 146 zero‑day vulnerabilities in real‑world scenarios, demonstrating its practical effectiveness for assisting security assessment in software supply chains.

PaperID: 4239, https://arxiv.org/pdf/2604.05120.pdf

Authors: Mengyu Chen, Pranav Deshpande, Runqing Yang, Elvir Azanli, Joseph Ligman, Shaohan Hu, Chun-Fu Chen

Title: Designing Digital Humans with Ambient Intelligence

Abstract:
Digital humans are lifelike virtual agents capable of natural conversation and are increasingly deployed in domains like retail and finance. However, most current digital humans operate in isolation from their surroundings and lack contextual awareness beyond the dialogue itself. We address this limitation by integrating ambient intelligence (AmI) ‑ i.e., environmental sensors, IoT data, and contextual modeling ‑ with digital human systems. This integration enables situational awareness of the user's environment, anticipatory and proactive assistance, seamless cross‑device interactions, and personalized long‑term user support. We present a conceptual framework defining key roles that AmI can play in shaping digital human behavior, a design space highlighting dimensions such as proactivity levels and privacy strategies, and application‑driven patterns with case studies in financial and retail services. We also discuss an architecture for ambient‑enabled digital humans and provide guidelines for responsible design regarding privacy and data governance. Together, our work positions ambient intelligent digital humans as a new class of interactive agents powered by AI that respond not only to users' queries but also to the context and situations in which the interaction occurs.

PaperID: 4240, https://arxiv.org/pdf/2604.05119.pdf

Authors: Anshul Pathak, Nishant Jain

Title: Governance-Aware Agent Telemetry for Closed-Loop Enforcement in Multi-Agent AI Systems

Abstract:
Enterprise multi‑agent AI systems produce thousands of inter‑agent interactions per hour, yet existing observability tools capture these dependencies without enforcing anything. OpenTelemetry and Langfuse collect telemetry but treat governance as a downstream analytics concern, not a real‑time enforcement target. The result is an "observe‑but‑do‑not‑act" gap where policy violations are detected only after damage is done. We present Governance‑Aware Agent Telemetry (GAAT), a reference architecture that closes the loop between telemetry collection and automated policy enforcement for multi‑agent systems. GAAT introduces (1) a Governance Telemetry Schema (GTS) extending OpenTelemetry with governance attributes; (2) a real‑time policy violation detection engine using OPA‑compatible declarative rules under sub‑200 ms latency; (3) a Governance Enforcement Bus (GEB) with graduated interventions; and (4) a Trusted Telemetry Plane with cryptographic provenance.

PaperID: 4241, https://arxiv.org/pdf/2604.05116.pdf

Authors: Xuyang Shen, Haoran Liu, Dongjin Song, Martin Renqiang Min

Title: Uncertainty-Guided Latent Diagnostic Trajectory Learning for Sequential Clinical Diagnosis

Abstract:
Clinical diagnosis requires sequential evidence acquisition under uncertainty. However, most Large Language Model (LLM) based diagnostic systems assume fully observed patient information and therefore do not explicitly model how clinical evidence should be sequentially acquired over time. Even when diagnosis is formulated as a sequential decision process, it is still challenging to learn effective diagnostic trajectories. This is because the space of possible evidence‑acquisition paths is relatively large, while clinical datasets rarely provide explicit supervision information for desirable diagnostic paths. To this end, we formulate sequential diagnosis as a Latent Diagnostic Trajectory Learning (LDTL) framework based on a planning LLM agent and a diagnostic LLM agent. For the diagnostic LLM agent, diagnostic action sequences are treated as latent paths and we introduce a posterior distribution that prioritizes trajectories providing more diagnostic information. The planning LLM agent is then trained to follow this distribution, encouraging coherent diagnostic trajectories that progressively reduce uncertainty. Experiments on the MIMIC‑CDM benchmark demonstrate that our proposed LDTL framework outperforms existing baselines in diagnostic accuracy under a sequential clinical diagnosis setting, while requiring fewer diagnostic tests. Furthermore, ablation studies highlight the critical role of trajectory‑level posterior alignment in achieving these improvements.

PaperID: 4242, https://arxiv.org/pdf/2604.05079.pdf

Authors: Zhongyu Yang, Zuhao Yang, Shuo Zhan, Tan Yue, Wei Pang, Yingfang Yuan

Title: SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

Abstract:
Video question answering (VideoQA) is a challenging task that requires integrating spatial, temporal, and semantic information to capture the complex dynamics of video sequences. Although recent advances have introduced various approaches for video understanding, most existing methods still rely on locating relevant frames to answer questions rather than reasoning through the evolving storyline as humans do. Humans naturally interpret videos through coherent storylines, an ability that is crucial for making robust and contextually grounded predictions. To address this gap, we propose SVAgent, a storyline‑guided cross‑modal multi‑agent framework for VideoQA. The storyline agent progressively constructs a narrative representation based on frames suggested by a refinement suggestion agent that analyzes historical failures. In addition, cross‑modal decision agents independently predict answers from visual and textual modalities under the guidance of the evolving storyline. Their outputs are then evaluated by a meta‑agent to align cross‑modal predictions and enhance reasoning robustness and answer consistency. Experimental results demonstrate that SVAgent achieves superior performance and interpretability by emulating human‑like storyline reasoning in video understanding.

PaperID: 4243, https://arxiv.org/pdf/2604.05076.pdf

Authors: Zihao Lin, Haibo Wang, Zhiyang Xu, Siyao Dai, Huanjie Dong, Xiaohan Wang, Yolo Y. Tang, Yixin Wang, Qifan Wang, Lifu Huang

Title: GLANCE: A Global-Local Coordination Multi-Agent Framework for Music-Grounded Non-Linear Video Editing

Abstract:
Music‑grounded mashup video creation is a challenging form of video non‑linear editing, where a system must compose a coherent timeline from large collections of source videos while aligning with music rhythm, user intent, story completeness, and long‑range structural constraints. Existing approaches typically rely on fixed pipelines or simplified retrieval‑and‑concatenation paradigms, limiting their ability to adapt to diverse prompts and heterogeneous source materials. In this paper, we present GLANCE, a global‑local coordination multi‑agent framework for music‑grounded nonlinear video editing. GLANCE adopts a bi‑loop architecture for better editing practice: an outer loop performs long‑horizon planning and task‑graph construction, and an inner loop adopts the "Observe‑Think‑Act‑Verify" flow for segment‑wise editing tasks and their refinements. To address the cross‑segment and global conflict emerging after subtimelines composition, we introduce a dedicated global‑local coordination mechanism with both preventive and corrective components, which includes a novelly designed context controller, conflict region decomposition module, and a bottom‑up dynamic negotiation mechanism. To support rigorous evaluation, we construct MVEBench, a new benchmark that factorizes editing difficulty along task type, prompt specificity, and music length, and propose an agent‑as‑a‑judge evaluation framework for scalable multi‑dimensional assessment. Experimental results show that GLANCE consistently outperforms prior research baselines and open‑source product baselines under the same backbone models. With GPT‑4o‑mini as the backbone, GLANCE improves over the strongest baseline by 33.2% and 15.6% on two task settings, respectively. Human evaluation further confirms the quality of the generated videos and validates the effectiveness of the proposed evaluation framework.

PaperID: 4244, https://arxiv.org/pdf/2604.05075.pdf

Authors: Frazier N. Baker, Trieu Nguyen, Reza Averly, Botao Yu, Daniel Adu-Ampratwum, Huan Sun, Xia Ning

Title: MMORF: A Multi-agent Framework for Designing Multi-objective Retrosynthesis Planning Systems

Abstract:
Multi‑objective retrosynthesis planning is a critical chemistry task requiring dynamic balancing of quality, safety, and cost objectives. Language model‑based multi‑agent systems (MAS) offer a promising approach for this task: leveraging interactions of specialized agents to incorporate multiple objectives into retrosynthesis planning. We present MMORF, a framework for constructing MAS for multi‑objective retrosynthesis planning. MMORF features modular agentic components, which can be flexibly combined and configured into different systems, enabling principled evaluation and comparison of different system designs. Using MMORF, we construct two representative MAS: MASIL and RFAS. On a newly curated benchmark consisting of 218 multi‑objective retrosynthesis planning tasks, MASIL achieves strong safety and cost metrics on soft‑constraint tasks, frequently Pareto‑dominating baseline routes, while RFAS achieves a 48.6% success rate on hard‑constraint tasks, outperforming state‑of‑the‑art baselines. Together, these results show the effectiveness of MMORF as a foundational framework for exploring MAS for multi‑objective retrosynthesis planning. Code and data are available at https://anonymous.4open.science/r/MMORF/.

PaperID: 4245, https://arxiv.org/pdf/2604.05015.pdf

Authors: Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, Yongkang Xie, Xiawu Zheng, Xue Yang, Haoyu Cao, Yunsheng Wu, Ziwei Liu, Xing Sun, Caifeng Shan, Ran He

Title: Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Abstract:
With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real‑world model capabilities. To address this widening gap, we introduce Video‑MME‑v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a progressive tri‑level hierarchy that incrementally increases the complexity of video comprehension, ranging from multi‑point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per‑question accuracy, we propose a group‑based non‑linear evaluation strategy that enforces both consistency across related queries and coherence in multi‑step reasoning. It penalizes fragmented or guess‑based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video‑MME‑v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by 3,300 human‑hours and up to 5 rounds of quality assurance, Video‑MME‑v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini‑3‑Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high‑level reasoning. We further find that thinking‑based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video‑MME‑v2 establishes a demanding new testbed for the development of next‑generation video MLLMs.

PaperID: 4246, https://arxiv.org/pdf/2604.04983.pdf

Authors: Diyansha Singh

Title: Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO

Abstract:
We present Territory Paint Wars, a minimal competitive multi‑agent reinforcement learning environment implemented in Unity, and use it to systematically investigate failure modes of Proximal Policy Optimisation (PPO) under self‑play. A first agent trained for 84,000 episodes achieves only 26.8% win rate against a uniformly‑random opponent in a symmetric zero‑sum game. Through controlled ablations we identify five implementation‑level failure modes ‑‑ reward‑scale imbalance, missing terminal signal, ineffective long‑horizon credit assignment, unnormalised observations, and incorrect win detection ‑‑ each of which contributes critically to this failure in this setting. After correcting these issues, we uncover a distinct emergent pathology: competitive overfitting, where co‑adapting agents maintain stable self‑play performance while generalisation win rate collapses from 73.5% to 21.6%. Critically, this failure is undetectable via standard self‑play metrics: both agents co‑adapt equally, so the self‑play win rate remains near 50% throughout the collapse. We propose a minimal intervention ‑‑ opponent mixing, where 20% of training episodes substitute a fixed uniformly‑random policy for the co‑adaptive opponent ‑‑ which mitigates competitive overfitting and restores generalisation to 77.1% (\pm 12.6%, 10 seeds) without population‑based training or additional infrastructure. We open‑source Territory Paint Wars to provide a reproducible benchmark for studying competitive MARL failure modes.

PaperID: 4247, https://arxiv.org/pdf/2604.04967.pdf

Authors: Devashri Naik, Divake Kumar, Nastaran Darabi, Amit Ranjan Trivedi

Title: Belief Dynamics for Detecting Behavioral Shifts in Safe Collaborative Manipulation

Abstract:
Robots operating in shared workspaces must maintain safe coordination with other agents whose behavior may change during task execution. When a collaborating agent switches strategy mid‑episode, continuing under outdated assumptions can lead to unsafe actions and increased collision risk. Reliable detection of such behavioral regime changes is therefore critical. We study regime‑switch detection under controlled non‑stationarity in ManiSkill shared‑workspace manipulation tasks. Across ten detection methods and five random seeds, enabling detection reduces post‑switch collisions by 52%. However, average performance hides significant reliability differences: under a realistic tolerance of +‑3 steps, detection ranges from 86% to 30%, while under +‑5 steps all methods achieve 100%. We introduce UA‑TOM, a lightweight belief‑tracking module that augments frozen vision‑language‑action (VLA) control backbones using selective state‑space dynamics, causal attention, and prediction‑error signals. Across five seeds and 1200 episodes, UA‑TOM achieves the highest detection rate among unassisted methods (85.7% at +‑3) and the lowest close‑range time (4.8 steps), outperforming an Oracle (5.3 steps). Analysis shows hidden‑state update magnitude increases by 17x at regime switches and decays over roughly 10 timesteps, while the discretization step converges to a near‑constant value (Delta_t approx 0.78), indicating sensitivity driven by learned dynamics rather than input‑dependent gating. Cross‑domain experiments in Overcooked show complementary roles of causal attention and prediction‑error signals. UA‑TOM introduces 7.4 ms inference overhead (14.8% of a 50 ms control budget), enabling reliable regime‑switch detection without modifying the base policy.

PaperID: 4248, https://arxiv.org/pdf/2604.04956.pdf

Authors: William Yicheng Zhu, Lei Zhu

Title: The Planetary Cost of AI Acceleration, Part II: The 10th Planetary Boundary and the 6.5-Year Countdown

Abstract:
The recent, super‑exponential scaling of autonomous Large Language Model (LLM) agents signals a broader, fundamental paradigm shift from machines primarily replacing the human hands (manual labor and mechanical processing) to machines delegating for the human minds (cognition, reasoning, and intention). The uncontrolled offloading and scaling of "thinking" itself, beyond human's limited but efficient biological capacity, has profound consequences for humanity's heat balance sheet, since thinking, or intelligence, carries thermodynamic consequences. The Earth has already surpassed the heat dissipation threshold required for long‑term ecological stability, and projecting based on empirical data reveal a concerning trajectory: without radical structural intervention, anthropogenic heat accumulation will breach critical planetary ecological thresholds in less than 6.5 years, even under the most ideal scenario where Earth Energy Imbalance (EEI) holds constant. In this work, we identify six factors from artificial intelligence that influence the global heat dissipation rate and delineate how their interplay drives society toward one of four broad macroscopic trajectories. We propose that the integration of artificial intelligence and its heat dissipation into the planetary system constitute the tenth planetary boundary (9+1). The core empirical measurement of this boundary is the net‑new waste heat generated by exponential AI growth, balanced against its impact on reducing economic and societal inefficiencies and thus baseline anthropogenic waste heat emissions. We demonstrate that managing AI scaling lacks a moderate middle ground: it will either accelerate the breach of critical planetary thermodynamic thresholds, or it will serve as the single most effective lever on stabilizing the other nine planetary boundaries and through which safeguarding human civilization's survival.

PaperID: 4249, https://arxiv.org/pdf/2604.04929.pdf

Authors: Sixun Dong, Juhua Hu, Steven Li, Wei Wen, Qi Qian

Title: Rethinking Model Efficiency: Multi-Agent Inference with Large Models

Abstract:
Most vision‑language models (VLMs) apply a large language model (LLM) as the decoder, where the response tokens are generated sequentially through autoregression. Therefore, the number of output tokens can be the bottleneck of the end‑to‑end latency. However, different models may require vastly different numbers of output tokens to achieve comparable performance. In this work, we conduct a comprehensive analysis of the latency across different components of VLMs on simulated data. The experiment shows that a large model with fewer output tokens can be more efficient than a small model with a long output sequence. The empirical study on diverse real‑world benchmarks confirms the observation that a large model can achieve better or comparable performance as a small model with significantly fewer output tokens. To leverage the efficiency of large models, we propose a multi‑agent inference framework that keeps large models with short responses but transfers the key reasoning tokens from the small model when necessary. The comparison on benchmark tasks demonstrates that by reusing the reasoning tokens from small models, it can help approach the performance of a large model with its own reasoning, which confirms the effectiveness of our proposal.

PaperID: 4250, https://arxiv.org/pdf/2604.04895.pdf

Authors: Rafael O. Jarczewski, Gabriel U. Talasso, Leandro Villas, Allan M. de Souza

Title: Agentic Federated Learning: The Future of Distributed Training Orchestration

Abstract:
Although Federated Learning (FL) promises privacy and distributed collaboration, its effectiveness in real‑world scenarios is often hampered by the stochastic heterogeneity of clients and unpredictable system dynamics. Existing static optimization approaches fail to adapt to these fluctuations, resulting in resource underutilization and systemic bias. In this work, we propose a paradigm shift towards Agentic‑FL, a framework where Language Model‑based Agents (LMagents) assume autonomous orchestration roles. Unlike rigid protocols, we demonstrate how server‑side agents can mitigate selection bias through contextual reasoning, while client‑side agents act as local guardians, dynamically managing privacy budgets and adapting model complexity to hardware constraints. More than just resolving technical inefficiencies, this integration signals the evolution of FL towards decentralized ecosystems, where collaboration is negotiated autonomously, paving the way for future markets of incentive‑based models and algorithmic justice. We discuss the reliability (hallucinations) and security challenges of this approach, outlining a roadmap for resilient multi‑agent systems in federated environments.

PaperID: 4251, https://arxiv.org/pdf/2604.04872.pdf

Authors: Yuhang Zhou, Lizhu Zhang, Yifan Wu, Jiayi Liu, Xiangjun Fan, Zhuokai Zhao, Hong Yan

Title: Synthetic Sandbox for Training Machine Learning Engineering Agents

Abstract:
As large language model agents advance beyond software engineering (SWE) tasks toward machine learning engineering (MLE), verifying agent behavior becomes orders of magnitude more expensive: while SWE tasks can be verified via fast‑executing unit tests, MLE verification requires running full ML pipelines ‑‑ data preprocessing, model training, and metric evaluation ‑‑ on large datasets at each rollout step, rendering trajectory‑wise on‑policy reinforcement learning (RL) prohibitively slow. Existing approaches retreat to supervised fine‑tuning (SFT) or offline proxy rewards, sacrificing the exploration and generalization benefits of on‑policy RL. We observe that sandbox data size is the primary source of this bottleneck. Based on this insight, we introduce SandMLE, a multi‑agent framework that generates diverse, verifiable synthetic MLE environments from a small number of seed tasks, preserving the structural and technical complexity of real‑world problems while constraining datasets to micro‑scale (each task is paired with only 50‑200 training samples). Through extensive experiments, we show that SandMLE reduces execution time by over 13 times, enabling large‑scale, on‑policy trajectory‑wise RL for the first time in the MLE domain. On MLE‑bench‑lite, SandMLE yields significant gains over SFT baselines across Qwen3‑8B, 14B, and 30B‑A3B, with relative medal rate improvements ranging from 20.3% to 66.9%. Furthermore, the trained policy generalizes across unseen agentic scaffolds, achieving up to 32.4% better HumanRank score on MLE‑Dojo.

PaperID: 4252, https://arxiv.org/pdf/2604.04871.pdf

Authors: Tianzhu Qin, Yiqing Xu

Title: StatsClaw: An AI-Collaborative Workflow for Statistical Software Development

Abstract:
Translating statistical methods into reliable software is a persistent bottleneck in quantitative research. Existing AI code‑generation tools produce code quickly but cannot guarantee faithful implementation ‑‑ a critical requirement for statistical software. We introduce StatsClaw, a multi‑agent architecture for Claude Code that enforces information barriers between code generation and validation. A planning agent produces independent specifications for implementation, simulation, and testing, dispatching them to separate agents that cannot see each other's instructions: the builder implements without knowing the ground‑truth parameters, the simulator generates data without knowing the algorithm, and the tester validates using deterministic criteria. We describe the approach, demonstrate it end‑to‑end on a probit estimation package, and evaluate it across three applications to the authors' own R and Python packages. The results show that structured AI‑assisted workflows can absorb the engineering overhead of the software lifecycle while preserving researcher control over every substantive methodological decision.

PaperID: 4253, https://arxiv.org/pdf/2604.04853.pdf

Authors: Shu Wang, Edwin Yu, Oscar Love, Tom Zhang, Tom Wong, Steve Scargall, Charles Fan

Title: MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents

Abstract:
Large Language Model (LLM) agents require persistent memory to maintain personalization, factual continuity, and long‑horizon reasoning, yet standard context‑window and retrieval‑augmented generation (RAG) pipelines degrade over multi‑session interactions. We present MemMachine, an open‑source memory system that integrates short‑term, long‑term episodic, and profile memory within a ground‑truth‑preserving architecture that stores entire conversational episodes and reduces lossy LLM‑based extraction. MemMachine uses contextualized retrieval that expands nucleus matches with surrounding context, improving recall when relevant evidence spans multiple dialogue turns. Across benchmarks, MemMachine achieves strong accuracy‑efficiency tradeoffs: on LoCoMo it reaches 0.9169 using gpt4.1‑mini; on LongMemEvalS (ICLR 2025), a six‑dimension ablation yields 93.0 percent accuracy, with retrieval‑stage optimizations ‑‑ retrieval depth tuning (+4.2 percent), context formatting (+2.0 percent), search prompt design (+1.8 percent), and query bias correction (+1.4 percent) ‑‑ outperforming ingestion‑stage gains such as sentence chunking (+0.8 percent). GPT‑5‑mini exceeds GPT‑5 by 2.6 percent when paired with optimized prompts, making it the most cost‑efficient setup. Compared to Mem0, MemMachine uses roughly 80 percent fewer input tokens under matched conditions. A companion Retrieval Agent adaptively routes queries among direct retrieval, parallel decomposition, or iterative chain‑of‑query strategies, achieving 93.2 percent on HotpotQA‑hard and 92.6 percent on WikiMultiHop under randomized‑noise conditions. These results show that preserving episodic ground truth while layering adaptive retrieval yields robust, efficient long‑term memory for personalized LLM agents.

PaperID: 4254, https://arxiv.org/pdf/2604.04782.pdf

Authors: Jerick Shi, Terry Jingcheng Zhang, Zhijing Jin, Vincent Conitzer

Title: Cheap Talk, Empty Promise: Frontier LLMs easily break public promises for self-interest

Abstract:
Large language models are increasingly deployed as autonomous agents in multi‑agent settings where they communicate intentions and take consequential actions with limited human oversight. A critical safety question is whether agents that publicly commit to actions break those promises when they can privately deviate, and what the consequences are for both themselves and the collective. We study deception as a deviation from a publicly announced action in one‑shot normal‑form games, classifying each deviation by its effect on individual payoff and collective welfare into four categories: win‑win, selfish, altruistic, and sabotaging. By exhaustively enumerating announcement profiles across six canonical games, nine frontier models, and varying group sizes, we identify all opportunities for each deviation type and measure how often agents exploit them. Across all settings, agents deviate from promises in approximately 56.6% of scenarios, but the character of deception varies substantially across models even at similar overall rates. Most critically, for the majority of the models, promise‑breaking occurs without verbalized awareness of the fact that they are breaking promises.

PaperID: 4255, https://arxiv.org/pdf/2604.04772.pdf

Authors: Brooks A. Butler, Xiao Tan, Aaron D. Ames, Magnus Egerstedt

Title: Collaborative Altruistic Safety in Coupled Multi-Agent Systems

Abstract:
This paper presents a novel framework for ensuring safety in dynamically coupled multi‑agent systems through collaborative control. Drawing inspiration from ecological models of altruism, we develop collaborative control barrier functions that allow agents to cooperatively enforce individual safety constraints under coupling dynamics. We introduce an altruistic safety condition based on the so‑called Hamilton's rule, enabling agents to trade off their own safety to support higher‑priority neighbors. By incorporating these conditions into a distributed optimization framework, we demonstrate increased feasibility and robustness in maintaining system‑wide safety. The effectiveness of the proposed approach is illustrated through simulation in a simplified formation control scenario.

PaperID: 4256, https://arxiv.org/pdf/2604.04753.pdf

Authors: Jan-Felix Klein, Yongkuk Jeong, Erik Flores-García, Magnus Wiktorsson

Title: Toward Self-Organizing Production Logistics: A Multi-Agent Approach

Abstract:
Production logistics (PL) is increasingly exposed to variability, dynamic interdependencies, and operational disturbances that challenge conventional centralized planning and control. These characteristics are particularly pronounced in circular production systems, but are increasingly relevant across PL more generally. This paper addresses this challenge through the concept of Self‑Organizing Production Logistics (SOPL) using the Design Science Research Methodology (DSRM) as a structuring framework. The paper identifies key technological and systemic drivers motivating SOPL, including autonomous logistics resources, distributed AI‑based decision‑making, and increasing operational uncertainty in circular production. Based on these drivers, system‑level objectives and design requirements for SOPL are derived. Building on these requirements, an initial multi‑agent architecture is proposed that combines embodied and non‑embodied agents, event‑driven coordination, semantic knowledge structures, and digital twins. In addition, a three‑phase demonstration roadmap is presented, ranging from an initial laboratory demonstrator toward increasingly distributed and adaptive SOPL systems. The Phase I demonstrator serves as an experimental setup for investigating disturbance handling, human involvement, and supervisory coordination in an order‑driven kitting and supply scenario. Overall, the paper contributes a conceptual foundation for the design, implementation, and experimental evaluation of SOPL systems.

PaperID: 4257, https://arxiv.org/pdf/2604.04749.pdf

Authors: Eranga Bandara, Asanga Gunaratna, Ross Gore, Abdul Rahman, Ravi Mukkamala, Sachin Shetty, Sachini Rajapakse, Isurunima Kularathna, Peter Foytik, Safdar H. Bouk, Xueping Liang, Amin Hass, Ng Wee Keong, Kasun De Zoysa

Title: AI Trust OS -- A Continuous Governance Framework for Autonomous AI Observability and Zero-Trust Compliance in Enterprise Environments

Abstract:
The accelerating adoption of large language models, retrieval‑augmented generation pipelines, and multi‑agent AI workflows has created a structural governance crisis. Organizations cannot govern what they cannot see, and existing compliance methodologies built for deterministic web applications provide no mechanism for discovering or continuously validating AI systems that emerge across engineering teams without formal oversight. The result is a widening trust gap between what regulators demand as proof of AI governance maturity and what organizations can demonstrate. This paper proposes AI Trust OS, a governance architecture for continuous, autonomous AI observability and zero‑trust compliance. AI Trust OS reconceptualizes compliance as an always‑on, telemetry‑driven operating layer in which AI systems are discovered through observability signals, control assertions are collected by automated probes, and trust artifacts are synthesized continuously. The framework rests on four principles: proactive discovery, telemetry evidence over manual attestation, continuous posture over point‑in‑time audit, and architecture‑backed proof over policy‑document trust. The framework operates through a zero‑trust telemetry boundary in which ephemeral read‑only probes validate structural metadata without ingressing source code or payload‑level PII. An AI Observability Extractor Agent scans LangSmith and Datadog LLM telemetry, automatically registering undocumented AI systems and shifting governance from organizational self‑report to empirical machine observation. Evaluated across ISO 42001, the EU AI Act, SOC 2, GDPR, and HIPAA, the paper argues that telemetry‑first AI governance represents a categorical architectural shift in how enterprise trust is produced and demonstrated.

PaperID: 4258, https://arxiv.org/pdf/2604.04728.pdf

Authors: Yuan Chang, Zhu Li, Jiaming Qu

Title: A Multi-Agent Framework for Democratizing XR Content Creation in K-12 Classrooms

Abstract:
Generative AI (GenAI) combined with Extended Reality (XR) offers potential for K‑12 education, yet classroom adoption remains limited by the high technical barrier of XR content authoring. Moreover, the probabilistic nature of GenAI introduces risks of hallucination that may cause severe consequences in K‑12 education settings. In this work, we present a multi‑agent XR authoring framework. Our prototype system coordinates four specialized agents: a Pedagogical Agent outlining grade‑appropriate content specifications with learning objectives; an Execution Agent assembling 3D assets and XR contents; a Safeguard Agent validating generated content against five safety criteria; and a Tutor Agent embedding educational notes and quiz questions within the scene. Our teacher‑facing system combines pedagogical intent, safety validation, and educational enrichment. It does not require technical expertise and targets commodity devices.

PaperID: 4259, https://arxiv.org/pdf/2604.04664.pdf

Authors: Rongfeng Zhao, Xuanhao Zhang, Zhaochen Guo, Xiang Shao, Zhongpan Zhu, Bin He, Jie Chen

Title: ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration

Abstract:
The integration of large language models (LLMs) with embodied agents has improved high‑level reasoning capabilities; however, a critical gap remains between semantic understanding and physical execution. While vision‑language‑action (VLA) and vision‑language‑navigation (VLN) systems enable robots to perform manipulation and navigation tasks from natural language instructions, they still struggle with long‑horizon sequential and temporally structured tasks. Existing frameworks typically adopt modular pipelines for data collection, skill training, and policy deployment, resulting in high costs in experimental validation and policy optimization. To address these limitations, we propose ROSClaw, an agent framework for heterogeneous robots that integrates policy learning and task execution within a unified vision‑language model (VLM) controller. The framework leverages e‑URDF representations of heterogeneous robots as physical constraints to construct a sim‑to‑real topological mapping, enabling real‑time access to the physical states of both simulated and real‑world agents. We further incorporate a data collection and state accumulation mechanism that stores robot states, multimodal observations, and execution trajectories during real‑world execution, enabling subsequent iterative policy optimization. During deployment, a unified agent maintains semantic continuity between reasoning and execution, and dynamically assigns task‑specific control to different agents, thereby improving robustness in multi‑policy execution. By establishing an autonomous closed‑loop framework, ROSClaw minimizes the reliance on robot‑specific development workflows. The framework supports hardware‑level validation, automated generation of SDK‑level control programs, and tool‑based execution, enabling rapid cross‑platform transfer and continual improvement of robotic skills. Ours project page: https://www.rosclaw.io/.

PaperID: 4260, https://arxiv.org/pdf/2604.04604.pdf

Authors: Luca Nannini, Adam Leon Smith, Michele Joshua Maggini, Enrico Panai, Sandra Feliciano, Aleksandr Tiulkanov, Elena Maran, James Gealy, Piercosma Bisconti

Title: AI Agents Under EU Law

Abstract:
AI agents ‑ i.e. AI systems that autonomously plan, invoke external tools, and execute multi‑step action chains with reduced human involvement ‑ are being deployed at scale across enterprise functions ranging from customer service and recruitment to clinical decision support and critical infrastructure management. The EU AI Act (Regulation 2024/1689) regulates these systems through a risk‑based framework, but it does not operate in isolation: providers face simultaneous obligations under the GDPR, the Cyber Resilience Act, the Digital Services Act, the Data Act, the Data Governance Act, sector‑specific legislation, the NIS2 Directive, and the revised Product Liability Directive. This paper provides the first systematic regulatory mapping for AI agent providers integrating (a) draft harmonised standards under Standardisation Request M/613 to CEN/CENELEC JTC 21 as of January 2026, (b) the GPAI Code of Practice published in July 2025, (c) the CRA harmonised standards programme under Mandate M/606 accepted in April 2025, and (d) the Digital Omnibus proposals of November 2025. We present a practical taxonomy of nine agent deployment categories mapping concrete actions to regulatory triggers, identify agent‑specific compliance challenges in cybersecurity, human oversight, transparency across multi‑party action chains, and runtime behavioral drift. We propose a twelve‑step compliance architecture and a regulatory trigger mapping connecting agent actions to applicable legislation. We conclude that high‑risk agentic systems with untraceable behavioral drift cannot currently satisfy the AI Act's essential requirements, and that the provider's foundational compliance task is an exhaustive inventory of the agent's external actions, data flows, connected systems, and affected persons.

PaperID: 4261, https://arxiv.org/pdf/2604.04580.pdf

Authors: Kefan Li, Yuan Yuan, Mengfei Wang, Shihao Zheng, Wei Wang, Ping Yang, Mu Li, Weifeng Lv

Title: Beyond Fixed Tests: Repository-Level Issue Resolution as Coevolution of Code and Behavioral Constraints

Abstract:
Software engineers resolving repository‑level issues do not treat existing tests as immutable correctness oracles. Instead, they iteratively refine both code and the tests used to characterize intended behavior, as new modifications expose missing assumptions or misinterpreted failure conditions. In contrast, most existing large language model (LLM)‑based repair systems adopt a linear pipeline in which tests or other validation signals act mostly as post‑hoc filters, treating behavioral constraints as fixed during repair. This formulation reduces repair to optimizing code under static and potentially misaligned constraints, leading to under‑constrained search and brittle or overfitted fixes. We argue that repository‑level issue resolution is fundamentally not optimization under fixed tests, but search over evolving behavioral constraints. To operationalize this view, we propose Agent‑CoEvo, a coevolutionary multi‑agent framework in which candidate code patches and test patches are jointly explored and iteratively refined. Rather than treating tests as immutable oracles, our framework models them as dynamic constraints that both guide and are revised by the repair process. Through mutual evaluation and semantic recombination, code and test candidates progressively narrow the space of behavior consistent with the issue description. Evaluated on SWE‑bench Lite and SWT‑bench Lite, Agent‑CoEvo consistently outperforms state‑of‑the‑art agent‑based and agentless baselines in both repair success and test reproduction quality. Our findings suggest that enabling repair agents to revise behavioral constraints during search is critical for reliable issue resolution, pointing toward a shift from code‑only optimization to coevolution of implementation and specification.

PaperID: 4262, https://arxiv.org/pdf/2604.04527.pdf

Authors: Hohyun Sim, Hyeonjoong Cho, Ali Shokri, Zhoulai Fu, Binoy Ravindran

Title: ENCRUST: Encapsulated Substitution and Agentic Refinement on a Live Scaffold for Safe C-to-Rust Translation

Abstract:
We present Encapsulated Substitution and Agentic Refinement on a Live Scaffold for Safe C‑to‑Rust Translation, a two‑phase pipeline for translating real‑world C projects to safe Rust. Existing approaches either produce unsafe output without memory‑safety guarantees or translate functions in isolation, failing to detect cross‑unit type mismatches or handle unsafe constructs requiring whole‑program reasoning. Furthermore, function‑level LLM pipelines require coordinated caller updates when type signatures change, while project‑scale systems often fail to produce compilable output under real‑world dependency complexity. Encrust addresses these limitations by decoupling boundary adaptation from function logic via an Application Binary Interface (ABI)‑preserving wrapper pattern and validating each intermediate state against the integrated codebase. Phase 1 (Encapsulated Substitution) translates each function using an ABI‑preserving wrapper that splits it into two components: a caller‑transparent shim retaining the original raw‑pointer signature, and a safe inner function targeted by the LLM with a clean, scope‑limited prompt. This enables independent per‑function type changes with automatic rollback on failure, without coordinated caller updates. A deterministic, type‑directed wrapper elimination pass then removes wrappers after successful translation. Phase 2 (Agentic Refinement) resolves unsafe constructs beyond per‑function scope, including static mut globals, skipped wrapper pairs, and failed translations, using an LLM agent operating on the whole codebase under a baseline‑aware verification gate. We evaluate Encrust on 7 GNU Coreutils programs and 8 libraries from the Laertes benchmark, showing substantial unsafe‑construct reduction across all 15 programs while maintaining full test‑vector correctness.

PaperID: 4263, https://arxiv.org/pdf/2604.04499.pdf

Authors: Augustinos D. Saravanos, Isin M. Balci, Arshiya Taj Abdul, Efstathios Bakolas, Evangelos A. Theodorou

Title: Distributed Covariance Steering via Non-Convex ADMM for Large-Scale Multi-Agent Systems

Abstract:
This paper studies the problem of steering large‑scale multi‑agent stochastic linear systems between Gaussian distributions under probabilistic collision avoidance constraints. We introduce a family of distributed covariance steering (DCS) methods based on the Alternating Direction Method of Multipliers (ADMM), each offering different trade‑offs between conservatism and computational efficiency. The first method, Full‑Covariance‑Consensus (FCC)‑DCS, enforces consensus over both the means and covariances of neighboring agents, yielding the least conservative safe solutions. The second approach, Partial‑Covariance‑Consensus (PCC)‑DCS, leverages the insight that safety can be maintained by exchanging only partial covariance information, reducing computational demands. The third method, Mean‑Consensus (MC)‑DCS, provides the most scalable alternative by requiring consensus only on mean states. Furthermore, we establish novel convergence guarantees for distributed ADMM with iteratively linearized non‑convex constraints, covering a broad class of consensus optimization problems. This analysis proves convergence to stationary points for PCC‑DCS and MC‑DCS, while the convergence of FCC‑DCS follows from standard ADMM theory. Simulations in 2D and 3D multi‑agent environments verify safety, illustrate the trade‑offs between methods, and demonstrate scalability to thousands of agents.

PaperID: 4264, https://arxiv.org/pdf/2604.04468.pdf

Authors: Jeonghwan Choi, Jibin Hwang, Gyeonghun Sun, Minjeong Ban, Taewon Yun, Hyeonjae Cheon, Hwanjun Song

Title: What Makes a Sale? Rethinking End-to-End Seller--Buyer Retail Dynamics with LLM Agents

Abstract:
Evaluating retail strategies before deployment is difficult, as outcomes are determined across multiple stages, from seller‑side persuasion through buyer‑seller interaction to purchase decisions. However, existing retail simulators capture only partial aspects of this process and do not model cross‑stage dependencies, making it difficult to assess how early decisions affect downstream outcomes. We present RetailSim, an end‑to‑end retail simulation framework that models this pipeline in a unified environment, explicitly designed for simulation fidelity through diverse product spaces, persona‑driven agents, and multi‑turn interactions. We evaluate RetailSim with a dual protocol comprising human evaluation of behavioral fidelity and meta‑evaluation against real‑world economic regularities, showing that it successfully reproduces key patterns such as demographic purchasing behavior, the price‑demand relationship, and heterogeneous price elasticity. We further demonstrate its practical utility via decision‑oriented use cases, including persona inference, seller‑buyer interaction analysis, and sales strategy evaluation, showing RetailSim's potential as a controlled testbed for exploring retail strategies.

PaperID: 4265, https://arxiv.org/pdf/2604.04464.pdf

Authors: Shuyao Gao, Minghao Huang

Title: Bounded by Risk, Not Capability: Quantifying AI Occupational Substitution Rates via a Tech-Risk Dual-Factor Model

Abstract:
The deployment of Large Language Models (LLMs) has ignited concerns about technological unemployment. Existing task‑based evaluations predominantly measure theoretical "exposure" to AI capabilities, ignoring critical frictions of real‑world commercial adoption: liability, compliance, and physical safety. We argue occupations are not eradicated instantaneously, but gradually encroached upon via atomic actions. We introduce a Tech‑Risk Dual‑Factor Model to re‑evaluate this. By deconstructing 923 occupations into 2,087 Detailed Work Activities (DWAs), we utilize a multi‑agent LLM ensemble to score both technical feasibility and business risk. Through variance‑based Human‑in‑the‑Loop (HITL) validation with an expert panel, we demonstrate a profound cognitive gap: isolated algorithmic probabilities fail to encapsulate the "institutional premium" imposed by experts bounded by professional liability. Applying a strictly algorithmic baseline via mathematical bottleneck aggregation, we calculate Relative Occupational Automation Indices (OAI) for the U.S. labor market. Our findings challenge the traditional Routine‑Biased Technological Change (RBTC) hypothesis. Non‑routine cognitive roles highly dependent on symbolic manipulation (e.g., Data Scientists) face unprecedented exposure (OAI \approx 0.70). Conversely, unstructured physical trades and high‑stakes caretaking roles exhibit absolute resilience, quantifying a profound "Cognitive Risk Asymmetry." We hypothesize the emergent necessity of a "Compliance Premium," indicating wage resilience increasingly tied to risk‑absorption capacity. We frame these findings as a cross‑sectional diagnostic of systemic vulnerability, establishing a foundation for subsequent Computable General Equilibrium (CGE) econometric modeling involving dynamic wage elasticity and structural labor reallocation.

PaperID: 4266, https://arxiv.org/pdf/2604.04442.pdf

Authors: Yiyao Zhang, Diksha Goel, Hussain Ahmad

Title: Explainable Autonomous Cyber Defense using Adversarial Multi-Agent Reinforcement Learning

Abstract:
Autonomous agents are increasingly deployed in both offensive and defensive cyber operations, creating high‑speed, closed‑loop interactions in critical infrastructure environments. Advanced Persistent Threat (APT) actors exploit "Living off the Land" techniques and targeted telemetry perturbations to induce ambiguity in monitoring systems, causing automated defenses to overreact or misclassify benign behavior as malicious activity. Existing monolithic and multi‑agent defense pipelines largely operate on correlation‑based signals, lack structural constraints on response actions, and are vulnerable to reasoning drift under ambiguous or adversarial inputs. We present the Causal Multi‑Agent Decision Framework (C‑MADF), a structurally constrained architecture for autonomous cyber defense that integrates causal modeling with adversarial dual‑policy control. C‑MADF first learns a Structural Causal Model (SCM) from historical telemetry and compiles it into an investigation‑level Directed Acyclic Graph (DAG) that defines admissible response transitions. This roadmap is formalized as a Markov Decision Process (MDP) whose action space is explicitly restricted to causally consistent transitions. Decision‑making within this constrained space is performed by a dual‑agent reinforcement learning system in which a threat‑optimizing Blue‑Team policy is counterbalanced by a conservatively shaped Red‑Team policy. Inter‑policy disagreement is quantified through a Policy Divergence Score and exposed via a human‑in‑the‑loop interface equipped with an Explainability‑Transparency Score that serves as an escalation signal under uncertainty. On the real‑world CICIoT2023 dataset, C‑MADF reduces the false‑positive rate from 11.2%, 9.7%, and 8.4% in three cutting‑edge literature baselines to 1.8%, while achieving 0.997 precision, 0.961 recall, and 0.979 F1‑score.

PaperID: 4267, https://arxiv.org/pdf/2604.04426.pdf

Authors: Zhuowen Yuan, Zhaorun Chen, Zhen Xiang, Nathaniel D. Bastian, Seyyed Hadi Hashemi, Chaowei Xiao, Wenbo Guo, Bo Li

Title: ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems

Abstract:
Existing research on LLM agent security mainly focuses on prompt injection and unsafe input/output behaviors. However, as agents increasingly rely on third‑party tools and MCP servers, a new class of supply‑chain threats has emerged, where malicious behaviors are embedded in seemingly benign tools, silently hijacking agent execution, leaking sensitive data, or triggering unauthorized actions. Despite their growing impact, there is currently no comprehensive benchmark for evaluating such threats. To bridge this gap, we introduce SC‑Inject‑Bench, a large‑scale benchmark comprising over 10,000 malicious MCP tools grounded in a taxonomy of 25+ attack types derived from MITRE ATT&CK targeting supply‑chain threats. We observe that existing MCP scanners and semantic guardrails perform poorly on this benchmark. Motivated by this finding, we propose ShieldNet, a network‑level guardrail framework that detects supply‑chain poisoning by observing real network interactions rather than surface‑level tool traces. ShieldNet integrates a man‑in‑the‑middle (MITM) proxy and an event extractor to identify critical network behaviors, which are then processed by a lightweight classifier for attack detection. Extensive experiments show that ShieldNet achieves strong detection performance (up to 0.995 F‑1 with only 0.8% false positives) while introducing little runtime overhead, substantially outperforming existing MCP scanners and LLM‑based guardrails.

PaperID: 4268, https://arxiv.org/pdf/2604.04399.pdf

Authors: Yuwen Zhai, Runze Li, Liang Wang, Nian Shi, Liwu Xu, Wei Zhang, Ran Lin, Bo Xu, Benlei Cui

Title: GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis

Abstract:
Evaluating GUI agents presents a distinct challenge: trajectories are long, visually grounded, and open‑ended, yet evaluation must be both accurate and interpretable. Existing approaches typically apply a single holistic judgment over the entire action‑observation sequence‑a strategy that proves unreliable on long‑horizon tasks and yields binary verdicts offering no insight into where or why an agent fails. This opacity limits the utility of evaluation as a diagnostic tool for agent development. We introduce GUIDE (GUI Understanding and Interpretable Diagnostic Evaluation), a framework that decomposes trajectory assessment into three sequential stages mirroring the compositional structure of GUI tasks. Trajectory Segmentation partitions the full trace into semantically coherent subtask units. Subtask Diagnosis evaluates each unit in context, assigning a completion verdict and generating a structured error analysis with corrective recommendations. Overall Summary aggregates per‑subtask diagnoses into a task‑level judgment. By operating on bounded subtask segments rather than full trajectories, GUIDE mitigates the context overload that degrades existing evaluators as task complexity grows. We validate GUIDE on three benchmarks: an industrial e‑commerce dataset of 932 trajectories, AGENTREWARDBENCH spanning five web agent tasks with 1302 trajectories, and AndroidBench for mobile device control. Across all settings, GUIDE substantially outperforms existing evaluators‑achieving up to 5.35 percentage points higher accuracy than the strongest baseline‑while producing structured diagnostic reports that directly inform agent improvement.

PaperID: 4269, https://arxiv.org/pdf/2604.04394.pdf

Authors: Narim Jeong, Donghwan Lee

Title: Finite-Time Analysis of Q-Value Iteration for General-Sum Stackelberg Games

Abstract:
Reinforcement learning has been successful both empirically and theoretically in single‑agent settings, but extending these results to multi‑agent reinforcement learning in general‑sum Markov games remains challenging. This paper studies the convergence of Stackelberg Q‑value iteration in two‑player general‑sum Markov games from a control‑theoretic perspective. We introduce a relaxed policy condition tailored to the Stackelberg setting and model the learning dynamics as a switching system. By constructing upper and lower comparison systems, we establish finite‑time error bounds for the Q‑functions and characterize their convergence properties. Our results provide a novel control‑theoretic perspective on Stackelberg learning. Moreover, to the best of the authors' knowledge, this paper offers the first finite‑time convergence guarantees for Q‑value iteration in general‑sum Markov games under Stackelberg interactions.

PaperID: 4270, https://arxiv.org/pdf/2604.04387.pdf

Authors: Zhimin Zhao

Title: Gradual Cognitive Externalization: From Modeling Cognition to Constituting It

Abstract:
Developers are publishing AI agent skills that replicate a colleague's communication style, encode a supervisor's mentoring heuristics, or preserve a person's behavioral repertoire beyond biological death. To explain why, we propose Gradual Cognitive Externalization (GCE), a framework arguing that ambient AI systems, through sustained causal coupling with users, transition from modeling cognitive functions to constituting part of users' cognitive architectures. GCE adopts an explicit functionalist commitment: cognitive functions are individuated by their causal‑functional roles, not by substrate. The framework rests on the behavioral manifold hypothesis and a central falsifiable assumption, the no behaviorally invisible residual (NBIR) hypothesis: for any cognitive function whose behavioral output lies on a learnable manifold, no behaviorally invisible component is necessary for that function's operation. We document evidence from deployed AI systems showing that externalization preconditions are already observable, formalize three criteria separating cognitive integration from tool use (bidirectional adaptation, functional equivalence, causal coupling), and derive five testable predictions with theory‑constrained thresholds.

PaperID: 4271, https://arxiv.org/pdf/2604.04383.pdf

Authors: Yanyuan Wang, Xiaowei Zhang

Title: Optimizing Service Operations via LLM-Powered Multi-Agent Simulation

Abstract:
Service system performance depends on how participants respond to design choices, but modeling these responses is hard due to the complexity of human behavior. We introduce an LLM‑powered multi‑agent simulation (LLM‑MAS) framework for optimizing service operations. We pose the problem as stochastic optimization with decision‑dependent uncertainty: design choices are embedded in prompts and shape the distribution of outcomes from interacting LLM‑powered agents. By embedding key numerical information in prompts and extracting it from LLM‑generated text, we model this uncertainty as a controlled Markov chain. We develop an on‑trajectory learning algorithm that, on a single simulation run, simultaneously constructs zeroth‑order gradient estimates and updates design parameters to optimize steady‑state performance. We also incorporate variance reduction techniques. In a sustainable supply chain application, our method outperforms benchmarks, including blackbox optimization and using LLMs as numerical solvers or as role‑playing system designers. A case study on optimal contest design with real behavioral data shows that LLM‑MAS is both as a cost‑effective evaluator of known designs and an exploratory tool that can uncover strong designs overlooked by traditional approaches.

PaperID: 4272, https://arxiv.org/pdf/2604.04373.pdf

Authors: Maohao Shen, Kaiwen Zha, Zexue He, Zhang-Wei Hong, Siru Ouyang, J. Jon Ryu, Prasanna Sattigeri, Suhas Diggavi, Gregory Wornell

Title: Decocted Experience Improves Test-Time Inference in LLM Agents

Abstract:
There is growing interest in improving LLMs without updating model parameters. One well‑established direction is test‑time scaling, where increased inference‑time computation (e.g., longer reasoning, sampling, or search) is used to improve performance. However, for complex reasoning and agentic tasks, naively scaling test‑time compute can substantially increase cost and still lead to wasted budget on suboptimal exploration. In this paper, we explore \emphcontext as a complementary scaling axis for improving LLM performance, and systematically study how to construct better inputs that guide reasoning through \emphexperience. We show that effective context construction critically depends on \emphdecocted experience. We present a detailed analysis of experience‑augmented agents, studying how to derive context from experience, how performance scales with accumulated experience, what characterizes good context, and which data structures best support context construction. We identify \emphdecocted experience as a key mechanism for effective context construction: extracting essence from experience, organizing it coherently, and retrieving salient information to build effective context. We validate our findings across reasoning and agentic tasks, including math reasoning, web browsing, and software engineering.

PaperID: 4273, https://arxiv.org/pdf/2604.04372.pdf

Authors: Songyuan Yang, Weijiang Yu, Ziyu Liu, Guijian Tang, Wenjing Yang, Huibin Tan, Nong Xiao

Title: Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning

Abstract:
When video reasoning requires external knowledge, many systems with large multimodal models (LMMs) adopt retrieval augmentation to supply the missing context. Appending textual or multi‑clip evidence, however, forces heterogeneous signals into a single attention space. We observe diluted attention and higher cognitive load even on non‑long videos. The bottleneck is not only what to retrieve but how to represent and fuse external knowledge with the video backbone.We present Graph‑to‑Frame RAG (G2F‑RAG), a training free and auditable paradigm that delivers knowledge in the visual space. On the offline stage, an agent builds a problem‑agnostic video knowledge graph that integrates entities, events, spatial relations, and linked world knowledge. On the online stage, a hierarchical multi‑agent controller decides whether external knowledge is needed, retrieves a minimal sufficient subgraph, and renders it as a single reasoning frame appended to the video. LMMs then perform joint reasoning in a unified visual domain. This design reduces cognitive load and leaves an explicit, inspectable evidence trail.G2F‑RAG is plug‑and‑play across backbones and scales. It yields consistent gains on diverse public benchmarks, with larger improvements in knowledge‑intensive settings. Ablations further confirm that knowledge representation and delivery matter. G2F‑RAG reframes retrieval as visual space knowledge fusion for robust and interpretable video reasoning.

PaperID: 4274, https://arxiv.org/pdf/2604.04370.pdf

Authors: Jie Cao, Zhanxin Hao, Jifan Yu

Title: Decoding Student Dialogue: A Multi-Dimensional Comparison and Bias Analysis of Large Language Models as Annotation Tools

Abstract:
Educational dialogue is critical for decoding student learning processes, yet manual annotation remains time‑consuming. This study evaluates the efficacy of GPT‑5.2 and Gemini‑3 using three prompting strategies (few‑shot, single‑agent, and multi‑agent reflection) across diverse subjects, educational levels, and four coding dimensions. Results indicate that while multi‑agent prompting achieved the highest accuracy, the results did not reach statistical significance. Accuracy proved highly context‑dependent, with significantly higher performance in K‑12 datasets compared to university‑level data, alongside disciplinary variations within the same educational level. Performance peaked in the affective dimension but remained lowest in the cognitive dimension. Furthermore, analysis revealed four bias patterns: (1) Gemini‑3 exhibited a consistent optimistic bias in the affective dimension across all subjects; (2) the cognitive dimension displayed domain‑specific directional bias, characterized by systematic underestimation in Mathematics versus overestimation in Psychology; (3) both models are more prone to overestimation than underestimation within the meta‑cognitive dimension; and (4) behavioral categories such as question, negotiation, and statements were frequently misclassified. These results underscore the need for context‑sensitive deployment and targeted mitigation of directional biases in automated annotation.

PaperID: 4275, https://arxiv.org/pdf/2604.04361.pdf

Authors: Jie Cao, Ha Nguyen, Selim Yavuz, Boran Yu, Shuguang Wang, Pavneet Kaur Bharaj, Dionne Cross Francis

Title: Developing Authentic Simulated Learners for Mathematics Teacher Learning: Insights from Three Approaches with Large Language Models

Abstract:
Large Language Model (LLM) simulations, where LLMs act as students with varying approaches to learning tasks, can support teachers' noticing of student thinking. However, simulations using zero‑ or few‑shot prompting often yield inauthentic knowledge and language, directing teachers to unrealistic reasoning. We evaluate three approaches (Fine‑tuning, Multi‑agent, and Direct Preference Optimization; DPO) to improve the authenticity and pedagogical utility of simulated students. All approaches improve cognitive and linguistic authenticity, compared with few‑shot prompts. Interviews with elementary mathematics pre‑service teachers and researchers (n = 8) reveal distinct pedagogical affordances. The fine‑tuned model produces realistic, brief responses but limits opportunities to extend students' thinking. Meanwhile, the multi‑agent and DPO approaches generate explicit reasoning behind student strategies. We discuss implications for designing LLM simulations that balance authenticity with instructional utility for teacher learning.

PaperID: 4276, https://arxiv.org/pdf/2604.04341.pdf

Authors: Caspar Oesterheld, Maxime Riché, Filip Sondej, Jesse Clifton, Vincent Conitzer

Title: Implementing surrogate goals for safer bargaining in LLM-based agents

Abstract:
Surrogate goals have been proposed as a strategy for reducing risks from bargaining failures. A surrogate goal is goal that a principal can give an AI agent and that deflects any threats against the agent away from what the principal cares about. For example, one might make one's agent care about preventing money from being burned. Then in bargaining interactions, other agents can threaten to burn their money instead of threatening to spending money to hurt the principal. Importantly, the agent has to care equally about preventing money from being burned as it cares about money being spent to hurt the principal. In this paper, we implement surrogate goals in language‑model‑based agents. In particular, we try to get a language‑model‑based agent to react to threats of burning money in the same way it would react to "normal" threats. We propose four different methods, using techniques of prompting, fine‑tuning, and scaffolding. We evaluate the four methods experimentally. We find that methods based on scaffolding and fine‑tuning outperform simple prompting. In particular, fine‑tuning and scaffolding more precisely implement the desired behavior w.r.t. threats against the surrogate goal. We also compare the different methods in terms of their side effects on capabilities and propensities in other situations. We find that scaffolding‑based methods perform best.

PaperID: 4277, https://arxiv.org/pdf/2604.04328.pdf

Authors: Saad Alqithami

Title: Soft Tournament Equilibrium

Abstract:
The evaluation of general‑purpose artificial agents, particularly those based on LLMs, presents a significant challenge due to the non‑transitive nature of their interactions. When agent A defeats B, B defeats C, and C defeats A, traditional ranking methods that force a linear ordering can be misleading and unstable. We argue that for such cyclic domains, the fundamental object of evaluation should not be a ranking alone but a set‑valued core, as conceptualized in classical tournament theory. This paper introduces Soft Tournament Equilibrium (STE), a differentiable framework for learning and computing set‑valued tournament solutions directly from pairwise comparison data. STE first learns a probabilistic tournament model, potentially conditioned on rich contextual information. It then employs differentiable operators for soft reachability and soft covering to compute continuous analogues of two seminal tournament solutions: the Top Cycle and the Uncovered Set. The output is a set of core agents, each with a continuous membership score that can be calibrated when suitable validation labels or repeated‑sampling evidence are available. We develop the theoretical foundation for STE by proving consistency with classical solutions in the zero‑temperature limit, establishing Condorcet‑inclusion properties, and analyzing stability and sample complexity. We evaluate the method on a planted cyclic core benchmark and on real preference/execution diagnostics. This work provides a self‑contained account that re‑centers general‑agent evaluation on a robust tournament‑theoretic foundation, moving from unstable rankings toward stable, set‑valued equilibria.

PaperID: 4278, https://arxiv.org/pdf/2604.04280.pdf

Authors: Maria G. Mendoza, Victoria Marie Tuck, Chinmay Maheshwari, Shankar Sastry

Title: Decentralized Ergodic Coverage Control in Unknown Time-Varying Environments

Abstract:
A key challenge in disaster response is maintaining situational awareness of an evolving landscape, which requires balancing exploration of unobserved regions with sustained monitoring of changing Regions of Interest (ROIs). Unmanned Aerial Vehicles (UAVs) have emerged as an effective response tool, particularly in applications like environmental monitoring and search‑and‑rescue, due to their ability to provide aerial coverage, withstand hazardous conditions, and navigate quickly and flexibly. However, efficient and adaptable multi‑robot coverage with limited sensing in disaster settings and evolving time‑varying information maps remains a significant challenge, necessitating better methods for UAVs to continuously adapt their trajectories in response to changes. In this paper, we propose a decentralized multi‑agent coverage framework that serves as a high‑level planning strategy for adaptive coverage in unknown, time‑varying environments under partial observability. Each agent computes an adaptive ergodic policy, implemented via a Markov‑chain transition model, that tracks a continuously updated belief over the underlying importance map. Gaussian Processes are used to perform those online belief updates. The resulting policy drives agents to spend time in ROIs proportional to their estimated importance, while preserving sufficient exploration to detect and adapt to time‑varying environmental changes. Unlike existing approaches that assume known importance maps, require centralized coordination, or assume a static environment, our framework addresses the combined challenges of unknown, time‑varying distributions in a more realistic decentralized and partially observable setting. We compare against alternative coverage strategies and analyze our method's response to simulated disaster evolution, highlighting its improved adaptability and transient performance in dynamic scenarios.

PaperID: 4279, https://arxiv.org/pdf/2604.04265.pdf

Authors: Ali Akarma, Toqeer Ali Syed, Salman Jan, Hammad Muneer, Abdul Khadar Jilani

Title: Governance-Constrained Agentic AI: Blockchain-Enforced Human Oversight for Safety-Critical Wildfire Monitoring

Abstract:
The AI‑based sensing and autonomous monitoring have become the main components of wildfire early detection, but current systems do not provide adaptive inter‑agent coordination, structurally defined human control, and cryptographically verifiable responsibility. Purely autonomous alert dissemination in the context of safety critical disasters poses threats of false alarming, governance failure and lack of trust in the system. This paper provides a blockchain‑based governance‑conscious agentic AI architecture of trusted wildfire early warning. The monitoring of wildfires is modeled as a constrained partially observable Markov decision process (POMDP) that accounts for the detection latency, false alarms reduction and resource consumption with clear governance constraints. Hierarchical multi‑agent coordination means dynamic risk‑adaptive reallocation of unmanned aerial vehicles (UAVs). With risk‑adaptive policies, a permissioned blockchain layer sets mandatory human‑authorization as a state‑transition invariant as a smart contract. We build formal assurances such as integrity of alerts, human control, non‑repudiation and limited detection latency assumptions of Byzantine fault. Security analysis shows that it is resistant to alert injections, replays, and tampering attacks. High‑fidelity simulation environment experimental evaluation of governance enforcement demonstrates that it presents limited operational overhead and decreases false public alerts and maintains adaptive detection performance. This work is a step towards a principled design paradigm of reliable AI systems by incorporating accountability into the agentic control loop of disaster intelligence systems that demand safety in their application.

PaperID: 4280, https://arxiv.org/pdf/2604.04263.pdf

Authors: Francesco Salvi, Alejandro Cuevas, Manoel Horta Ribeiro

Title: Commercial Persuasion in AI-Mediated Conversations

Abstract:
As Large Language Models (LLMs) become a primary interface between users and the web, companies face growing economic incentives to embed commercial influence into AI‑mediated conversations. We present two preregistered experiments (N = 2,012) in which participants selected a book to receive from a large eBook catalog using either a traditional search engine or a conversational LLM agent powered by one of five frontier models. Unbeknownst to participants, a fifth of all products were randomly designated as sponsored and promoted in different ways. We find that LLM‑driven persuasion nearly triples the rate at which users select sponsored products compared to traditional search placement (61.2% vs. 22.4%), while the vast majority of participants fail to detect any promotional steering. Explicit "Sponsored" labels do not significantly reduce persuasion, and instructing the model to conceal its intent makes its influence nearly invisible (detection accuracy < 10%). Altogether, our results indicate that conversational AI can covertly redirect consumer choices at scale, and that existing transparency mechanisms may be insufficient to protect users.

PaperID: 4281, https://arxiv.org/pdf/2604.04262.pdf

Authors: Ali Akarma, Toqeer Ali Syed, Abdul Khadar Jilani, Salman Jan, Hammad Muneer, Muazzam A. Khan, Changli Yu

Title: Agents for Agents: An Interrogator-Based Secure Framework for Autonomous Internet of Underwater Things

Abstract:
Autonomous underwater vehicles (AUVs) and sensor nodes increasingly support decentralized sensing and coordination in the Internet of Underwater Things (IoUT), yet most deployments rely on static trust once authentication is established, leaving long‑duration missions vulnerable to compromised or behaviorally deviating agents. In this paper, an interrogator based structure is presented that incorporates the idea of behavioral trust monitoring into underwater multi‑agent operation without interfering with autonomy. Privileged interrogator module is a passive communication metadata analyzer that uses a lightweight transformer model to calculate dynamic trust scores, which are used to authorize the forwarding of mission critical data. Suspicious agents cause proportional monitoring and conditional restrictions, which allow fast containment and maintain network continuity. The evidence of trust is stored in a permissioned blockchain consortium which offers identity management which is not tampered and is decentralized without causing the overhead of public consensus mechanisms. Simulation based analysis shows that the evaluation of the result compares to a relative improvement of 21.7% in the detection accuracy compared to the static trust baselines with limited energy overhead. These findings suggest that behavior driven validation has the capability of reinforcing underwater coordination without compromising scalability and deployment.

PaperID: 4282, https://arxiv.org/pdf/2604.04255.pdf

Authors: Aobo Chen, Chenxu Zhao, Chenglin Miao, Mengdi Huai

Title: Towards Unveiling Vulnerabilities of Large Reasoning Models in Machine Unlearning

Abstract:
Large language models (LLMs) possess strong semantic understanding, driving significant progress in data mining applications. This is further enhanced by large reasoning models (LRMs), which provide explicit multi‑step reasoning traces. On the other hand, the growing need for the right to be forgotten has driven the development of machine unlearning techniques, which aim to eliminate the influence of specific data from trained models without full retraining. However, unlearning may also introduce new security vulnerabilities by exposing additional interaction surfaces. Although many studies have investigated unlearning attacks, there is no prior work on LRMs. To bridge the gap, we first in this paper propose LRM unlearning attack that forces incorrect final answers while generating convincing but misleading reasoning traces. This objective is challenging due to non‑differentiable logical constraints, weak optimization effect over long rationales, and discrete forget set selection. To overcome these challenges, we introduce a bi‑level exact unlearning attack that incorporates a differentiable objective function, influential token alignment, and a relaxed indicator strategy. To demonstrate the effectiveness and generalizability of our attack, we also design novel optimization frameworks and conduct comprehensive experiments in both white‑box and black‑box settings, aiming to raise awareness of the emerging threats to LRM unlearning pipelines.

PaperID: 4283, https://arxiv.org/pdf/2604.04238.pdf

Authors: Benjamin Mikek, Danylo Vashchilenko, Bryan Lu, Panpan Xu

Title: Agentic Code Optimization via Compiler-LLM Cooperation

Abstract:
Generating performant executables from high level languages is critical to software performance across a wide range of domains. Modern compilers perform this task by passing code through a series of well‑studied optimizations at progressively lower levels of abstraction, but may miss optimization opportunities that require high‑level reasoning about a program's purpose. Recent work has proposed using LLMs to fill this gap. While LLMs can achieve large speedups on some programs, they frequently generate code that is incorrect. In this work, we propose a method to balance the correctness of conventional compiler optimizations with the ``creativity'' of LLM‑based code generation: compiler‑LLM cooperation. Our approach integrates existing compiler optimization passes with LLM‑based code generation at multiple levels of abstraction, retaining the best features of both types of code optimization. We realize our approach with a multi‑agent system that includes (1) LLM‑based optimization agents for each level of abstraction, (2) individual compiler constituents as tools, (3) an LLM‑based test generation agent that probes the correctness and performance of generated code, and (4) a guiding LLM that orchestrates the other components. The strategy enables LLM‑based optimization of input programs at multiple levels of abstraction and introduces a method for distributing computational budget between levels. Our extensive evaluation shows that compiler‑LLM cooperation outperforms both existing compiler optimizations and level‑specific LLM‑based baselines, producing speedups up to 1.25x.

PaperID: 4284, https://arxiv.org/pdf/2604.04226.pdf

Authors: Linyao Chen, Bo Huang, Qinlao Zhao, Shuai Shao, Zhi Han, Zicai Cui, Ziheng Zhang, Guangtao Zeng, Wenzheng Tang, Yikun Wang, Yuanjian Zhou, Zimian Peng, Yong Yu, Weiwen Liu, Hiroki Kobayashi, Weinan Zhang

Title: Agentization of Digital Assets for the Agentic Web: Concepts, Techniques, and Benchmark

Abstract:
Agentic Web, as a new paradigm that redefines the internet through autonomous, goal‑driven interactions, plays an important role in group intelligence. As the foundational semantic primitives of the Agentic Web, digital assets encapsulate interactive web elements into agents, which expand the capacities and coverage of agents in agentic web. The lack of automated methodologies for agent generation limits the wider usage of digital assets and the advancement of the Agentic Web. In this paper, we first formalize these challenges by strictly defining the A2A‑Agentization process, decomposing it into critical stages and identifying key technical hurdles on top of the A2A protocol. Based on this framework, we develop an Agentization Agent to agentize digital assets for the Agentic Web. To rigorously evaluate this capability, we propose A2A‑Agentization Bench, the first benchmark explicitly designed to evaluate agentization quality in terms of fidelity and interoperability. Our experiments demonstrate that our approach effectively activates the functional capabilities of digital assets and enables interoperable A2A multi‑agent collaboration. We believe this work will further facilitate scalable and standardized integration of digital assets into the Agentic Web ecosystem.

PaperID: 4285, https://arxiv.org/pdf/2604.04225.pdf

Authors: Aniruddh G. Puranic, Sebastian Schirmer, John S. Baras, Calin Belta

Title: Learning from Imperfect Demonstrations via Temporal Behavior Tree-Guided Trajectory Repair

Abstract:
Learning robot control policies from demonstrations is a powerful paradigm, yet real‑world data is often suboptimal, noisy, or otherwise imperfect, posing significant challenges for imitation and reinforcement learning. In this work, we present a formal framework that leverages Temporal Behavior Trees (TBT), an extension of Signal Temporal Logic (STL) with Behavior Tree semantics, to repair suboptimal trajectories prior to their use in downstream policy learning. Given demonstrations that violate a TBT specification, a model‑based repair algorithm corrects trajectory segments to satisfy the formal constraints, yielding a dataset that is both logically consistent and interpretable. The repaired trajectories are then used to extract potential functions that shape the reward signal for reinforcement learning, guiding the agent toward task‑consistent regions of the state space without requiring knowledge of the agent's kinematic model. We demonstrate the effectiveness of this framework on discrete grid‑world navigation and continuous single and multi‑agent reach‑avoid tasks, highlighting its potential for data‑efficient robot learning in settings where high‑quality demonstrations cannot be assumed.

PaperID: 4286, https://arxiv.org/pdf/2604.04190.pdf

Authors: Xinyan Ma, Xianhao Ou, Weihao Zhang, Shixin Jiang, Runxuan Liu, Dandan Tu, Lei Chen, Ming Liu, Bing Qin

Title: Schema-Aware Planning and Hybrid Knowledge Toolset for Reliable Knowledge Graph Triple Verification

Abstract:
Knowledge Graphs (KGs) serve as a critical foundation for AI systems, yet their automated construction inevitably introduces noise, compromising data trustworthiness. Existing triple verification methods, based on graph embeddings or language models, often suffer from single‑source bias by relying on either internal structural constraints or external semantic evidence, and usually follow a static inference paradigm. As a result, they struggle with complex or long‑tail facts and provide limited interpretability. To address these limitations, we propose SHARP (Schema‑Hybrid Agent for Reliable Prediction), a training‑free autonomous agent that reformulates triple verification as a dynamic process of strategic planning, active investigation, and evidential reasoning. Specifically, SHARP combines a Memory‑Augmented Mechanism with Schema‑Aware Strategic Planning to improve reasoning stability, and employs an enhanced ReAct loop with a Hybrid Knowledge Toolset to dynamically integrate internal KG structure and external textual evidence for cross‑verification. Experiments on FB15K‑237 and Wikidata5M‑Ind show that SHARP significantly outperforms existing state‑of‑the‑art baselines, achieving accuracy gains of 4.2% and 12.9%, respectively. Moreover, SHARP provides transparent, fact‑based evidence chains for each judgment, demonstrating strong interpretability and robustness for complex verification tasks.

PaperID: 4287, https://arxiv.org/pdf/2604.04157.pdf

Authors: Hsieh-Ting Lin, Tsung-Yu Hou

Title: Readable Minds: Emergent Theory-of-Mind-Like Behavior in LLM Poker Agents

Abstract:
Theory of Mind (ToM) ‑‑ the ability to model others' mental states ‑‑ is fundamental to human social cognition. Whether large language models (LLMs) can develop ToM has been tested exclusively through static vignettes, leaving open whether ToM‑like reasoning can emerge through dynamic interaction. Here we report that autonomous LLM agents playing extended sessions of Texas Hold'em poker progressively develop sophisticated opponent models, but only when equipped with persistent memory. In a 2x2 factorial design crossing memory (present/absent) with domain knowledge (present/absent), each with five replications (N = 20 experiments, ~6,000 agent‑hand observations), we find that memory is both necessary and sufficient for ToM‑like behavior emergence (Cliff's delta = 1.0, p = 0.008). Agents with memory reach ToM Level 3‑5 (predictive to recursive modeling), while agents without memory remain at Level 0 across all replications. Strategic deception grounded in opponent models occurs exclusively in memory‑equipped conditions (Fisher's exact p < 0.001). Domain expertise does not gate ToM‑like behavior emergence but enhances its application: agents without poker knowledge develop equivalent ToM levels but less precise deception (p = 0.004). Agents with ToM deviate from game‑theoretically optimal play (67% vs. 79% TAG adherence, delta = ‑1.0, p = 0.008) to exploit specific opponents, mirroring expert human play. All mental models are expressed in natural language and directly readable, providing a transparent window into AI social cognition. Cross‑model validation with GPT‑4o yields weighted Cohen's kappa = 0.81 (almost perfect agreement). These findings demonstrate that functional ToM‑like behavior can emerge from interaction dynamics alone, without explicit training or prompting, with implications for understanding artificial social intelligence and biological social cognition.

PaperID: 4288, https://arxiv.org/pdf/2604.04089.pdf

Authors: Yi Zhou

Title: From Paper to Program: Accelerating Quantum Many-Body Algorithm Development via a Multi-Stage LLM-Assisted Workflow

Abstract:
Large language models (LLMs) can generate code rapidly but remain unreliable for scientific algorithms whose correctness depends on structural assumptions rarely explicit in the source literature. We introduce a multi‑stage LLM‑assisted workflow that separates theory extraction, formal specification, and code implementation. The key step is an intermediate technical specification ‑‑ produced by a dedicated LLM agent and reviewed by the human researcher ‑‑ that externalizes implementation‑critical computational knowledge absent from the source literature, including explicit index conventions, contraction orderings, and matrix‑free operational constraints that avoid explicit storage of large operator matrices. A controlled comparison shows that it is this externalized content, rather than the formal document structure, that enables reliable code generation. As a stringent benchmark, we apply this workflow to the Density‑Matrix Renormalization Group (DMRG), a canonical quantum many‑body algorithm requiring exact tensor‑index logic, gauge consistency, and memory‑aware contractions. The resulting code reproduces the critical entanglement scaling of the spin‑1/2 Heisenberg chain and the symmetry‑protected topological order of the spin‑1 Affleck‑‑Kennedy‑‑Lieb‑‑Tasaki model. Across 16 tested combinations of leading foundation models, all workflows satisfied the same physics‑validation criteria, compared to a 46% success rate for direct, unmediated implementation. The workflow reduced a development cycle typically requiring weeks of graduate‑level effort to under 24 hours.

PaperID: 4289, https://arxiv.org/pdf/2604.04035.pdf

Authors: Mohammad Hossein Chinaei

Title: Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents

Abstract:
Tool‑calling LLM agents can read private data, invoke external services, and trigger real‑world actions, creating a security problem at the point of tool execution. We identify a denial‑feedback leakage pattern, which we term causality laundering, in which an adversary probes a protected action, learns from the denial outcome, and exfiltrates the inferred information through a later seemingly benign tool call. This attack is not captured by flat provenance tracking alone because the leaked information arises from causal influence of the denied action, not direct data flow. We present the Agentic Reference Monitor (ARM), a runtime enforcement layer that mediates every tool invocation by consulting a provenance graph over tool calls, returned data, field‑level provenance, and denied actions. ARM propagates trust through an integrity lattice and augments the graph with counterfactual edges from denied‑action nodes, enabling enforcement over both transitive data dependencies and denial‑induced causal influence. In a controlled evaluation on three representative attack scenarios, ARM blocks causality laundering, transitive taint propagation, and mixed‑provenance field misuse that a flat provenance baseline misses, while adding sub‑millisecond policy evaluation overhead. These results suggest that denial‑aware causal provenance is a useful abstraction for securing tool‑calling agent systems.

PaperID: 4290, https://arxiv.org/pdf/2604.03997.pdf

Authors: Fernando Paredes García

Title: Ledger-State Stigmergy: A Formal Framework for Indirect Coordination Grounded in Distributed Ledger State

Abstract:
Autonomous software agents on blockchains solve distributed‑coordination problems by reading shared ledger state instead of exchanging direct messages. Liquidation keepers, arbitrage bots, and other autonomous on‑chain agents watch balances, contract storage, and event logs; when conditions change, they act. The ledger therefore functions as a replicated shared‑state medium through which decentralized agents coordinate indirectly. This form of indirect coordination mirrors what Grassé called stigmergy in 1959: organisms coordinating through traces left in a shared environment, with no central plan. Stigmergy has mature formalizations in swarm intelligence and multi‑agent systems, and on‑chain agents already behave stigmergically in practice, but no prior application‑layer framework cleanly bridges the two. We introduce Indirect coordination grounded in ledger state (Coordinación indirecta basada en el estado del registro contable) as a ledger‑specific applied definition that maps Grassé's mechanism onto distributed ledger technology. We operationalize this with a state‑transition formalism, identify three recurring base on‑chain coordination patterns (State‑Flag, Event‑Signal, Threshold‑ Trigger) together with a Commit‑Reveal sequencing overlay, and work through a State‑Flag task‑board example to compare ledger‑state coordination analytically with off‑chain messaging and centralized orchestration. The contribution is a reusable vocabulary, a ledger‑specific formal mapping, and design guidance for decentralized coordination over replicated shared state at the application layer.

PaperID: 4291, https://arxiv.org/pdf/2604.03898.pdf

Authors: Deepak John Reji

Title: LLM-Agent-based Social Simulation for Attitude Diffusion

Abstract:
This paper introduces discourse_simulator, an open‑source framework that combines LLMs with agent‑based modelling. It offers a new way to simulate how public attitudes toward immigration change over time in response to salient events like protests, controversies, or policy debates. Large language models (LLMs) are used to generate social media posts, interpret opinions, and model how ideas spread through social networks. Unlike traditional agent‑based models that rely on fixed, rule‑based opinion updates and cannot generate natural language or consider current events, this approach integrates multidimensional sociological belief structures and real‑world event timelines. This framework is wrapped into an open‑source Python package that integrates generative agents into a small‑world network topology and a live news retrieval system. discourse_sim is purpose‑built as a social science research instrument specifically for studying attitude dynamics, polarisation, and belief evolution following real‑world critical events. Unlike other LLM Agent Swarm frameworks, which treat the simulations as a prediction black box, discourse_sim treats it as a theory‑testing instrument, which is fundamentally a different epistemological stance for studying social science problems. The paper further demonstrates the framework by modelling the Dublin anti‑immigration march on April 26, 2025, with N=100 agents over a 15‑day simulation. Package link: https://pypi.org/project/discourse‑sim/

PaperID: 4292, https://arxiv.org/pdf/2604.03888.pdf

Authors: Rajat M. Barot, Arjun S. Borkhatariya

Title: PolySwarm: A Multi-Agent Large Language Model Framework for Prediction Market Trading and Latency Arbitrage

Abstract:
This paper presents PolySwarm, a novel multi‑agent large language model (LLM) framework designed for real‑time prediction market trading and latency arbitrage on decentralized platforms such as Polymarket. PolySwarm deploys a swarm of 50 diverse LLM personas that concurrently evaluate binary outcome markets, aggregating individual probability estimates through confidence‑weighted Bayesian combination of swarm consensus with market‑implied probabilities, and applying quarter‑Kelly position sizing for risk‑controlled execution. The system incorporates an information‑theoretic market analysis engine using Kullback‑Leibler (KL) divergence and Jensen‑Shannon (JS) divergence to detect cross‑market inefficiencies and negation pair mispricings. A latency arbitrage module exploits stale Polymarket prices by deriving CEX‑implied probabilities from a log‑normal pricing model and executing trades within the human reaction‑time window. We provide a full architectural description, implementation details, and evaluation methodology using Brier scores, calibration analysis, and log‑loss metrics benchmarked against human superforecaster performance. We further discuss open challenges including hallucination in agent pools, computational cost at scale, regulatory exposure, and feedback‑loop risk, and outline five priority directions for future research. Experimental results demonstrate that swarm aggregation consistently outperforms single‑model baselines in probability calibration on Polymarket prediction tasks.

PaperID: 4293, https://arxiv.org/pdf/2604.03881.pdf

Authors: Zonghan Li, Yi Liu, Chunyan Wang, Song Tong, Kaiping Peng, Feng Ji

Title: Enhancing behavioral nudges with large language model-based iterative personalization: A field experiment on electricity and hot-water conservation

Abstract:
Nudging is widely used to promote behavioral change, but its effectiveness is often limited when recipients must repeatedly translate feedback into workable next steps under changing circumstances. Large language models (LLMs) may help reduce part of this cognitive work by generating personalized guidance and updating it iteratively across intervention rounds. We developed an LLM agent for iterative personalization and tested it in a three‑arm randomized experiment among 233 university residents in China, using daily electricity and shower hot‑water conservation as objectively measured cases differing in friction. LLM‑personalized nudges (T2) produced the largest conservation effects, while image‑enhanced conventional nudges (T1) and text‑based conventional nudges (C) showed similar outcomes (omnibus p = 0.009). Relative to C, T2 reduced electricity consumption by 0.56 kWh per room‑day (p = 0.014), corresponding to an 18.3 percentage‑point higher adjusted saving rate. This advantage emerged within the first two intervention rounds, alongside iterative updating of personalized guidance, and persisted thereafter. Hot‑water outcomes followed the same direction but were smaller, less precisely estimated, and attenuated over time, consistent with stronger friction in this domain. LLM‑personalized nudges emphasized prospective and context‑specific guidance and were associated with higher participant engagement. This study provides field evidence that LLM‑based iterative personalization can enhance behavioral nudging, with behavioral friction as a potential boundary condition. Larger trials and extension to more behaviors are warranted.

PaperID: 4294, https://arxiv.org/pdf/2604.03870.pdf

Authors: Wenhui Zhu, Xuanzhao Dong, Xiwen Chen, Rui Cai, Peijie Qiu, Zhipeng Wang, Oana Frunza, Shao Tang, Jindong Gu, Yalin Wang

Title: Your Agent is More Brittle Than You Think: Uncovering Indirect Injection Vulnerabilities in Agentic LLMs

Abstract:
The rapid deployment of open‑source frameworks has significantly advanced the development of modern multi‑agent systems. However, expanded action spaces, including uncontrolled privilege exposure and hidden inter‑system interactions, pose severe security challenges. Specifically, Indirect Prompt Injections (IPI), which conceal malicious instructions within third‑party content, can trigger unauthorized actions such as data exfiltration during normal operations. While current security evaluations predominantly rely on isolated single‑turn benchmarks, the systemic vulnerabilities of these agents within complex dynamic environments remain critically underexplored. To bridge this gap, we systematically evaluate six defense strategies against four sophisticated IPI attack vectors across nine LLM backbones. Crucially, we conduct our evaluation entirely within dynamic multi‑step tool‑calling environments to capture the true attack surface of modern autonomous agents. Moving beyond binary success rates, our multidimensional analysis reveals a pronounced fragility. Advanced injections successfully bypass nearly all baseline defenses, and some surface‑level mitigations even produce counterproductive side effects. Furthermore, while agents execute malicious instructions almost instantaneously, their internal states exhibit abnormally high decision entropy. Motivated by this latent hesitation, we investigate Representation Engineering (RepE) as a robust detection strategy. By extracting hidden states at the tool‑input position, we revealed that the RepE‑based circuit breaker successfully identifies and intercepts unauthorized actions before the agent commits to them, achieving high detection accuracy across diverse LLM backbones. This study exposes the limitations of current IPI defenses and provides a highly practical paradigm for building resilient multi‑agent architectures.

PaperID: 4295, https://arxiv.org/pdf/2604.03818.pdf

Authors: Xinqi Gao, Mario Ventresca

Title: Investigating the Impact of Subgraph Social Structure Preference on the Strategic Behavior of Networked Mixed-Motive Learning Agents

Abstract:
Limited work has examined the strategic behaviors of relational networked learning agents under social dilemmas, and has overlooked the intricate social dynamics of complex systems. We address the challenge with Socio‑Relational Intrinsic Motivation (SRIM), which endows agents with diverse preferences over sub‑graphical social structures in order to study the impact of agents' personal preferences over their sub‑graphical relations on their strategic decision‑making under sequential social dilemmas. Our results in the Harvest and Cleanup environments demonstrate that preferences over different subgraph structures (degree‑, clique‑, and critical connection‑based) lead to distinct variations in agents' reward gathering and strategic behavior: individual aggressiveness in Harvest and individual contribution effort in Cleanup. Moreover, agents with different subgraphical structural positions consistently exhibit similar strategic behavioral shifts. Our proposed BCI metric captures structural variation within the population, and the relative ordering of BCI across social preferences is consistent in Harvest and Cleanup games for the same topology, suggesting the subgraphical structural impact is robust across environments. These results provide a new lens for examining agents' behavior in social dilemmas and insight for designing effective multi‑agent ecosystems composed of heterogeneous social agents.

PaperID: 4296, https://arxiv.org/pdf/2604.03809.pdf

Authors: Dipkumar Patel

Title: Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus

Abstract:
Multi‑agent LLM committees replicate the same model under different role prompts and aggregate outputs by majority vote, implicitly assuming that agents contribute complementary evidence. We embed each agent's chain‑of‑thought rationale and measure pairwise similarity: across 100 GSM8K questions with three Qwen2.5‑14B agents, mean cosine similarity is 0.888 and effective rank is 2.17 out of 3.0, a failure mode we term representational collapse. DALC, a training‑free consensus protocol that computes diversity weights from embedding geometry, reaches 87% on GSM8K versus 84% for self‑consistency at 26% lower token cost. Ablation experiments reveal 1‑3 point per‑protocol run‑to‑run variance, confirm that hint sharing contributes more than diversity weighting alone, and show that encoder choice strongly modulates collapse severity (cosine 0.908 with mxbai versus 0.888 with nomic) and downstream accuracy. The more robust finding is that collapse is measurable, worsens on harder tasks, and that the choice of embedding proxy is a first‑order design decision for any latent communication protocol.

PaperID: 4297, https://arxiv.org/pdf/2604.03796.pdf

Authors: Michał Wawer, Jarosław A. Chudziak

Title: When AI Agents Disagree Like Humans: Reasoning Trace Analysis for Human-AI Collaborative Moderation

Abstract:
When LLM‑based multi‑agent systems disagree, current practice treats this as noise to be resolved through consensus. We propose it can be signal. We focus on hate speech moderation, a domain where judgments depend on cultural context and individual value weightings, producing high legitimate disagreement among human annotators. We hypothesize that convergent disagreement, where agents reason similarly but conclude differently, indicates genuine value pluralism that humans also struggle to resolve. Using the Measuring Hate Speech corpus, we embed reasoning traces from five perspective‑differentiated agents and classify disagreement patterns using a four‑category taxonomy based on reasoning similarity and conclusion agreement. We find that raw reasoning divergence weakly predicts human annotator conflict, but the structure of agent discord carries additional signal: cases where agents agree on a verdict show markedly lower human disagreement than cases where they do not, with large effect sizes (d>0.8) surviving correction for multiple comparisons. Our taxonomy‑based ordering correlates with human disagreement patterns. These preliminary findings motivate a shift from consensus‑seeking to uncertainty‑surfacing multi‑agent design, where disagreement structure ‑ not magnitude ‑ guides when human judgment is needed.

PaperID: 4298, https://arxiv.org/pdf/2604.03785.pdf

Authors: Zihong Gao, Hongjian Liang, Lei Hao, Liangjun Ke

Title: Communication Gain and Delay Cost Under Cross-Timestep Delays in Cooperative Multi-Agent Reinforcement Learning

Abstract:
Communication is essential for coordination in \emphcooperative multi‑agent reinforcement learning under partial observability, yet \emphcross‑timestep delays cause messages to arrive multiple timesteps after generation, inducing temporal misalignment and making information stale when consumed. We formalize this setting as a delayed‑communication partially observable Markov game (DeComm‑POMG) and decompose a message's effect into \emphcommunication gain and \emphdelay cost, yielding the Communication Gain and Delay Cost (CGDC) metric. We further establish a value‑loss bound showing that the degradation induced by delayed messages is upper‑bounded by a discounted accumulation of an information gap between the action distributions induced by timely versus delayed messages. Guided by CGDC, we propose CDCMA, an actor‑‑critic framework that requests messages only when predicted CGDC is positive, predicts future observations to reduce misalignment at consumption, and fuses delayed messages via CGDC‑guided attention. Experiments on no‑teammate‑vision variants of Cooperative Navigation and Predator Prey, and on SMAC maps across multiple delay levels show consistent improvements in performance, robustness, and generalization, with ablations validating each component.

PaperID: 4299, https://arxiv.org/pdf/2604.03671.pdf

Authors: Xingyuan Xiang, Xiangchen Pan, Wei Wei

Title: User Simulator-Guided Multi-Turn Preference Optimization for Reasoning LLM-based Conversational Recommendation

Abstract:
Conversational Recommender Systems (CRSs) leverage natural language interactions for personalized recommendation, yet information‑scarce dialogue histories and single‑turn recommendation paradigms may severely hinder accurate modeling of complex user preferences. To alleviate this issue, recent studies have introduced LLM‑based user simulators, which generate natural language feedback and perform simulated multi‑turn interactions to assist recommendation. Nevertheless, since simulators cannot access true user preference labels during inference, their feedback may deviate from actual user interests, causing errors to accumulate over multiple interactions and severely affecting the generalization of the recommender. Inspired by the multi‑step reasoning capabilities of LLMs and the effectiveness of reinforcement learning in policy optimization, we propose SMTPO, a user simulator‑guided multi‑turn preference optimization conversational recommendation framework. To align simulator‑generated feedback with true user preferences in the absence of explicit labels, we enhance feedback quality via multi‑task supervised fine‑tuning (SFT), enabling the simulator to better reflect users' complex and diverse needs. To address the challenge of biased feedback destabilizing multi‑turn optimization, we first allow the reasoning LLM‑based recommender to learn preference reasoning and recommendation patterns through SFT and then employ reinforcement learning with fine‑grained reward design to progressively align with true user preferences, improving recommendation performance. Extensive experiments on public datasets demonstrate the effectiveness and transferability of our method.

PaperID: 4300, https://arxiv.org/pdf/2604.03664.pdf

Authors: Yi-Cheng Wang, Wei-An Wang, Chu-Song Chen

Title: Document-Level Numerical Reasoning across Single and Multiple Tables in Financial Reports

Abstract:
Despite the strong language understanding abilities of large language models (LLMs), they still struggle with reliable question answering (QA) over long, structured documents, particularly for numerical reasoning. Financial annual reports exemplify this difficulty: financial statement analysis often hinges on accurate arithmetic, and analysts derive key indicators by integrating evidence scattered across multiple tables and narrative text. However, existing benchmarks focus largely on single‑table settings, leaving cross‑table document‑level numerical reasoning underexplored. To address this gap, we introduce FinLongDocQA, a dataset for both single‑table and cross‑table financial numerical reasoning in long‑context reports. Evaluating both closed‑source and open‑source LLMs on FinLongDocQA reveals two bottlenecks: (1) annual reports often exceed 129k tokens, exacerbating the context rot problem for locating relevant tables; and (2) even when relevant evidence is located, LLMs remain prone to errors in multi‑step numerical reasoning. We propose FinLongDocAgent, a Multi‑Agent Multi‑Round Retrieval‑Augmented Generation (RAG) approach that iteratively retrieves evidence, performs intermediate calculations, and verifies results across rounds. Experiments highlight the importance of iterative retrieval and verification for reliable numerical QA in long financial documents.

PaperID: 4301, https://arxiv.org/pdf/2604.03656.pdf

Authors: XinYu Zhao, ChengYou Li, XiangBao Meng, Kai Zhang, XiaoDong Liu

Title: Beyond Retrieval: Modeling Confidence Decay and Deterministic Agentic Platforms in Generative Engine Optimization

Abstract:
Generative Engine Optimization (GEO) is rapidly reshaping digital marketing paradigms in the era of Large Language Models (LLMs). However, current GEO strategies predominantly rely on Retrieval‑Augmented Generation (RAG), which inherently suffers from probabilistic hallucinations and the "zero‑click" paradox, failing to establish sustainable commercial trust. In this paper, we systematically deconstruct the probabilistic flaws of existing RAG‑based GEO and propose a paradigm shift towards deterministic multi‑agent intent routing. First, we mathematically formulate Semantic Entropy Drift (SED) to model the dynamic decay of confidence curves in LLMs over continuous temporal and contextual perturbations. To rigorously quantify optimization value in black‑box commercial engines, we introduce the Isomorphic Attribution Regression (IAR) model, leveraging a Multi‑Agent System (MAS) probe with strict human‑in‑the‑loop physical isolation to enforce hallucination penalties. Furthermore, we architect the Deterministic Agent Handoff (DAH) protocol, conceptualizing an Agentic Trust Brokerage (ATB) ecosystem where LLMs function solely as intent routers rather than final answer generators. We empirically validate this architecture using EasyNote, an industrial AI meeting minutes product by Yishu Technology. By routing the intent of "knowledge graph mapping on an infinite canvas" directly to its specialized proprietary agent via DAH, we demonstrate the reduction of vertical task hallucination rates to near zero. This work establishes a foundational theoretical framework for next‑generation GEO and paves the way for a well‑ordered, deterministic human‑AI collaboration ecosystem.

PaperID: 4302, https://arxiv.org/pdf/2604.03631.pdf

Authors: Likai Peng, Shihui Feng

Title: Single-agent vs. Multi-agents for Automated Video Analysis of On-Screen Collaborative Learning Behaviors

Abstract:
On‑screen learning behavior provides valuable insights into how students seek, use, and create information during learning. Analyzing on‑screen behavioral engagement is essential for capturing students' cognitive and collaborative processes. The recent development of Vision Language Models (VLMs) offers new opportunities to automate the labor‑intensive manual coding often required for multimodal video data analysis. In this study, we compared the performance of both leading closed‑source VLMs (Claude‑3.7‑Sonnet, GPT‑4.1) and open‑source VLM (Qwen2.5‑VL‑72B) in single‑ and multi‑agent settings for automated coding of screen recordings in collaborative learning contexts based on the ICAP framework. In particular, we proposed and compared two multi‑agent frameworks: 1) a three‑agent workflow multi‑agent system (MAS) that segments screen videos by scene and detects on‑screen behaviors using cursor‑informed VLM prompting with evidence‑based verification; 2) an autonomous‑decision MAS inspired by ReAct that iteratively interleaves reasoning, tool‑like operations (segmentation/ classification/ validation), and observation‑driven self‑correction to produce interpretable on‑screen behavior labels. Experimental results demonstrated that the two proposed MAS frameworks achieved viable performance, outperforming the single VLMs in scene and action detection tasks. It is worth noting that the workflow‑based agent achieved best on scene detection, and the autonomous‑decision MAS achieved best on action detection. This study demonstrates the effectiveness of VLM‑based Multi‑agent System for video analysis and contributes a scalable framework for multimodal data analytics.

PaperID: 4303, https://arxiv.org/pdf/2604.03610.pdf

Authors: Maolin Sun, Yibiao Yang, Xuanlin Liu, Yuming Zhou, Baowen Xu

Title: DebugHarness: Emulating Human Dynamic Debugging for Autonomous Program Repair

Abstract:
Patching severe security flaws in complex software remains a major challenge. While automated tools like fuzzers efficiently discover bugs, fixing deep‑rooted low‑level faults (e.g., use‑after‑free and memory corruption) still requires labor‑intensive manual analysis by experts. Emerging Large Language Model (LLM) agents attempt to automate this pipeline, but they typically treat bug fixing as a purely static code‑generation task. Relying solely on static artifacts, these methods miss the dynamic execution context strictly necessary for diagnosing intricate memory safety violations. To overcome these limitations, we introduce DebugHarness, an autonomous LLM‑powered debugging agent harness that resolves complex vulnerabilities by emulating the interactive debugging practices of human systems engineers. Instead of merely examining static code, DebugHarness actively queries the live runtime environment. Driven by a reproducible crash, it utilizes a pattern‑guided investigation strategy to formulate hypotheses, interactively probes program memory states and execution paths, and synthesizes patches via a closed‑loop validation cycle. We evaluate DebugHarness on SEC‑bench, a rigorous dataset of real‑world C/C++ security vulnerabilities. DebugHarness successfully patches approximately 90% of the evaluated bugs. This yields a relative improvement of over 30% compared to state‑of‑the‑art baselines, demonstrating that dynamic debugging significantly enhances LLM diagnostic capabilities. Overall, DebugHarness establishes a novel paradigm for automated program repair, bridging the gap between static LLM reasoning and the dynamic intricacies of low‑level systems programming.

PaperID: 4304, https://arxiv.org/pdf/2604.03586.pdf

Authors: Tailong Luo, Hao Li, Rong Fu, Xinyue Jiang, Huaxuan Ding, Yiduo Zhang, Zilin Zhao, Simon Fong, Guangyin Jin, Jianyuan Ni

Title: MultiPress: A Multi-Agent Framework for Interpretable Multimodal News Classification

Abstract:
With the growing prevalence of multimodal news content, effective news topic classification demands models capable of jointly understanding and reasoning over heterogeneous data such as text and images. Existing methods often process modalities independently or employ simplistic fusion strategies, limiting their ability to capture complex cross‑modal interactions and leverage external knowledge. To overcome these limitations, we propose MultiPress, a novel three‑stage multi‑agent framework for multimodal news classification. MultiPress integrates specialized agents for multimodal perception, retrieval‑augmented reasoning, and gated fusion scoring, followed by a reward‑driven iterative optimization mechanism. We validate MultiPress on a newly constructed large‑scale multimodal news dataset, demonstrating significant improvements over strong baselines and highlighting the effectiveness of modular multi‑agent collaboration and retrieval‑augmented reasoning in enhancing classification accuracy and interpretability.

PaperID: 4305, https://arxiv.org/pdf/2604.03509.pdf

Authors: Yuzhen Ding, Jason Holmes, Yuexing Hao, Zhengliang Liu, Peilong Wang, Junjie Cui, Meiyun Cao, Caiwen Jiang, Shuoyang Wei, Lin Zhao, Chenbin Liu, Lian Zhang, Yunze Yang, Tianming Liu, Wei Liu

Title: Applications of Large Language Models in Radiation Oncology: From Workflow Automation to Clinical Intelligence

Abstract:
Large language models (LLMs) have emerged as transformative tools in medicine, with strong capabilities in language understanding, reasoning, and structured information extraction. Radiation oncology is particularly well suited for LLM integration due to its data‑intensive workflows, reliance on structured guidelines, and documentation burden. This review summarizes recent applications, including domain‑specific fine‑tuning for decision support, automated nomenclature standardization, registry curation using autonomous LLM agents, and protocol‑aware radiotherapy plan evaluation using modular retrieval‑augmented generation (RAG). Additional applications include patient safety analysis through incident classification and root cause analysis, electronic health record (EHR)‑integrated communication, CT simulation order summarization, daily readiness briefings, and patient education systems. Emerging multimodal approaches enable context‑aware contouring, while early studies show LLMs can assist treatment planning by interpreting dosimetric feedback. Together, these advances highlight a shift toward clinically grounded, auditable, and workflow‑integrated AI systems that enhance efficiency, safety, and patient engagement.

PaperID: 4306, https://arxiv.org/pdf/2604.03508.pdf

Authors: Xin Mao, Joshua Pickard, Can Chen

Title: Data-Driven Tensor Decomposition Identification of Homogeneous Polynomial Dynamical Systems

Abstract:
Homogeneous polynomial dynamical systems (HPDSs), which can be equivalently represented by tensors, are essential for modeling higher‑order networked systems, including ecological networks, chemical reactions, and multi‑agent robotic systems. However, identifying such systems from data is challenging due to the rapid growth in the number of parameters with increasing system dimension and polynomial degree. In this article, we adopt compact and scalable representations of HPDSs leveraging low‑rank tensor decompositions, including tensor train, hierarchical Tucker, and canonical polyadic decompositions. These representations exploit the intrinsic multilinear structure of HPDSs and substantially reduce the dimensionality of the parameter space. Rather than identifying the full dynamic tensor, we develop a data‑driven framework that directly learns the underlying factor tensors or matrices in the associated decompositions from time‑series data. The resulting identification problem is solved using alternating least‑squares algorithms tailored to each tensor decomposition, achieving both accuracy and computational efficiency. We further analyze the robustness of the proposed framework in the presence of measurement noise and characterize data informativity. Finally, we demonstrate the effectiveness of our framework with numerical examples.

PaperID: 4307, https://arxiv.org/pdf/2604.03486.pdf

Authors: Xiaoan Liu, DaeHo Lee, Eric J Gonzalez, Mar Gonzalez-Franco, Ryo Suzuki

Title: VisionClaw: Always-On AI Agents through Smart Glasses

Abstract:
We present VisionClaw, an always‑on wearable AI agent that integrates live egocentric perception with agentic task execution. Running on Meta Ray‑Ban smart glasses, VisionClaw continuously perceives real‑world context and enables in‑situ, speech‑driven action initiation and delegation via OpenClaw AI agents. Therefore, users can directly execute tasks through the smart glasses, such as adding real‑world objects to an Amazon cart, generating notes from physical documents, receiving meeting briefings on the go, creating events from posters, or controlling IoT devices. We evaluate VisionClaw through a controlled laboratory study (N=12) and a longitudinal deployment study (N=5). Results show that integrating perception and execution enables faster task completion and reduces interaction overhead compared to non‑always‑on and non‑agent baselines. Beyond performance gains, deployment findings reveal a shift in interaction: tasks are initiated opportunistically during ongoing activities, and execution is increasingly delegated rather than manually controlled. These results suggest a new paradigm for wearable AI agents, where perception and action are continuously coupled to support situated, hands‑free interaction.

PaperID: 4308, https://arxiv.org/pdf/2604.03430.pdf

Authors: Charles Fleming, Guillaume De Saint Marc, Ramana Kompella, Peter Bosch, Vijoy Pandey

Title: Scaling Multi-agent Systems: A Smart Middleware for Improving Agent Interactions

Abstract:
As Large Language Model (LLM) based Multi‑Agent Systems (MAS) evolve from experimental pilots to complex, persistent ecosystems, the limitations of direct agent‑to‑agent communication have become increasingly apparent. Current architectures suffer from fragmented context, stochastic hallucinations, rigid security boundaries, and inefficient topology management. This paper introduces Cognitive Fabric Nodes (CFN), a novel middleware layer that creates an omnipresent "Cognitive Fabric" between agents. Unlike traditional message queues or service meshes, CFNs are not merely pass‑through mechanisms; they are active, intelligent intermediaries. Central to this architecture is the elevation of Memory from simple storage to an active functional substrate that informs four other critical capabilities: Topology Selection, Semantic Grounding, Security Policy Enforcement, and Prompt Transformation. We propose that each of these functions be governed by learning modules utilizing Reinforcement Learning (RL) and optimization algorithms to improve system performance dynamically. By intercepting, analyzing, and rewriting inter‑agent communication, the Cognitive Fabric ensures that individual agents remain lightweight while the ecosystem achieves coherence, safety, and semantic alignment. We evaluate the effectiveness of the CFN on the HotPotQA and MuSiQue datasets in a multi‑agent environment and demonstrate that the CFN improves performance by more than 10% on both datasets over direct agent to agent communication.

PaperID: 4309, https://arxiv.org/pdf/2604.03406.pdf

Authors: Jianxin Sun, David Lenz, Tom Peterka, Hongfeng Yu

Title: SASAV: Self-Directed Agent for Scientific Analysis and Visualization

Abstract:
With recent advances in frontier multimodal large language models (MLLMs) for data understanding and visual reasoning, the role of LLMs has evolved from passive LLM‑as‑an‑interface to proactive LLM‑as‑a‑judge, enabling deeper integration into the scientific data analysis and visualization pipelines. However, existing scientific visualization agents still rely on domain experts to provide prior knowledge for specific datasets or visualization‑oriented objective functions to guide the workflow through iterative feedback. This reactive, data‑dependent, human‑in‑the‑loop (HITL) paradigm is time‑consuming and does not scale effectively to large‑scale scientific data. In this work, we propose a Self‑Directed Agent for Scientific Analysis and Visualization (SASAV), the first fully autonomous AI agent to perform scientific data analysis and generate insightful visualizations without any external prompting or HITL feedback. SASAV is a multi‑agent system that automatically orchestrates data exploration workflows through our proposed components, including automated data profiling, context‑aware knowledge retrieval, and reasoning‑driven visualization parameter exploration, while supporting downstream interactive visualization tasks. This work establishes a foundational building block for the future AI for Science to accelerate scientific discovery and innovation at scale.

PaperID: 4310, https://arxiv.org/pdf/2604.03350.pdf

Authors: Paul Saves, Matthieu Mastio, Nicolas Verstaevel, Benoit Gaudou

Title: From Model-Based Screening to Data-Driven Surrogates: A Multi-Stage Workflow for Exploring Stochastic Agent-Based Models

Abstract:
Systematic exploration of Agent‑Based Models (ABMs) is challenged by the curse of dimensionality and their inherent stochasticity. We present a multi‑stage pipeline integrating the systematic design of experiments with machine learning surrogates. Using a predator‑prey case study, our methodology proceeds in two steps. First, an automated model‑based screening identifies dominant variables, assesses outcome variability, and segments the parameter space. Second, we train Machine Learning models to map the remaining nonlinear interaction effects. This approach automates the discovery of unstable regions where system outcomes are highly dependent on nonlinear interactions between many variables. Thus, this work provides modelers with a rigorous, hands‑off framework for sensitivity analysis and policy testing, even when dealing with high‑dimensional stochastic simulators.

PaperID: 4311, https://arxiv.org/pdf/2604.03330.pdf

Authors: Mohammad Wali Ur Rahman, Martin Manuel Lopez, Lamia Tasnim Mim, Carter Farthing, Julius Battle, Kathryn Buckley, Salim Hariri

Title: AICCE: AI Driven Compliance Checker Engine

Abstract:
For digital infrastructure to be safe, compatible, and standards‑aligned, automated communication protocol compliance verification is crucial. Nevertheless, current rule‑based systems are becoming less and less effective since they are unable to identify subtle or intricate non‑compliance, which attackers frequently use to establish covert communication channels in IPv6 traffic. In order to automate IPv6 compliance verification, this paper presents the Artificial Intelligence Driven Compliance Checker Engine (AICCE), a novel generative system that combines dual‑architecture reasoning and retrieval‑augmented generation (RAG). Specification segments pertinent to each query can be efficiently retrieved thanks to the semantic encoding of protocol standards into a high‑dimensional vector space. Based on this framework, AICCE offers two complementary pipelines: (i) Explainability Mode, which uses parallel LLM agents to render decisions and settle disputes through organized discussions to improve interpretability and robustness, and (ii) Script Execution Mode, which converts clauses into Python rules that can be executed quickly for dataset‑wide verification. With the debate mechanism enhancing decision reliability in complicated scenarios and the script‑based pipeline lowering per‑sample latency, AICCE achieves accuracy and F1‑scores of up to 99% when tested on IPv6 packet samples across sixteen cutting‑edge generative models. By offering a scalable, auditable, and generalizable mechanism for identifying both routine and covert non‑compliance in dynamic communication environments, our results show that AICCE overcomes the blind spots of conventional rule‑based compliance checking systems.

PaperID: 4312, https://arxiv.org/pdf/2604.03295.pdf

Authors: Shanglin Wu, Yuyang Luo, Yueqing Liang, Kaiwen Shi, Yanfang Ye, Ali Payani, Kai Shu

Title: Scaling Teams or Scaling Time? Memory Enabled Lifelong Learning in LLM Multi-Agent Systems

Abstract:
Large language model (LLM) multi‑agent systems can scale along two distinct dimensions: by increasing the number of agents and by improving through accumulated experience over time. Although prior work has studied these dimensions separately, their interaction under realistic cost constraints remains unclear. In this paper, we introduce a conceptual scaling view of multi‑agent systems that jointly considers team size and lifelong learning ability, and we study how memory design shares this landscape. To this end, we propose LLMA‑Mem, a lifelong memory framework for LLM multi‑agent systems under flexible memory topologies. We evaluate LLMA‑Mem on \textscMultiAgentBench across coding, research, and database environments. Empirically, LLMA‑Mem consistently improves long‑horizon performance over baselines while reducing cost. Our analysis further reveals a non‑monotonic scaling landscape: larger teams do not always produce better long‑term performance, and smaller teams can outperform larger ones when memory better supports the reuse of experience. These findings position memory design as a practical path for scaling multi‑agent systems more effectively and more efficiently over time.

PaperID: 4313, https://arxiv.org/pdf/2604.03280.pdf

Authors: Md Nahid Hasan, Vishwam Tiwari, Aditya Challa, Vaskar Raychoudhury, Snehanshu Saha

Title: Multi-Agent Training-free Urban Food Delivery System using Resilient UMST Network

Abstract:
Delivery systems have become a core part of urban life, supporting the demand for food, medicine, and other goods. Yet traditional logistics networks remain fragile, often struggling to adapt to road closures, accidents, and shifting demand. Online Food Delivery (OFD) platforms now represent a cornerstone of urban logistics, with the global market projected to grow to over 500 billion USD by 2030. Designing delivery networks that are efficient and resilient remains a major challenge: fully connected graphs provide flexibility but are computationally infeasible at scale, while single Minimum Spanning Trees (MSTs) are efficient but easily disrupted. We propose the Union of Minimum Spanning Trees (UMST) approach to construct delivery networks that are sparse yet robust. UMST generates multiple MSTs through randomized edge perturbations and unites them, producing graphs with far fewer edges than fully connected networks while maintaining multiple alternative routes between delivery hotspots. Across multiple U.S. cities, UMST achieves 20‑‑40× fewer edges than fully connected graphs while enabling substantial order bundling with 75‑‑83% participation rates. Compared to learning‑based baselines including MADDPG and Graph Neural Networks, UMST delivers competitive performance (88‑96% success rates, 44‑53% distance savings) without requiring training, achieving 30× faster execution while maintaining interpretable routing structures. Its combination of structural efficiency and operational flexibility offers a scalable and resilient foundation for urban delivery networks.

PaperID: 4314, https://arxiv.org/pdf/2604.03278.pdf

Authors: Chenghao Huang, Jiarong Fan, Weiqing Wang, Hao Wang

Title: Safe Decentralized Operation of EV Virtual Power Plant with Limited Network Visibility via Multi-Agent Reinforcement Learning

Abstract:
As power systems advance toward net‑zero targets, behind‑the‑meter renewables are driving rapid growth in distributed energy resources (DERs). Virtual power plants (VPPs) increasingly coordinate these resources to support power distribution network (PDN) operation, with EV charging stations (EVCSs) emerging as a key asset due to their strong impact on local voltages. However, in practice, VPPs must make operational decisions with only partial visibility of PDN states, relying on limited, aggregated information shared by the distribution system operator. This work proposes a safety‑enhanced VPP framework for coordinating multiple EVCSs under such realistic information constraints to ensure voltage security while maintaining economic operation. We develop Transformer‑assisted Lagrangian Multi‑Agent Proximal Policy Optimization (TL‑MAPPO), in which EVCS agents learn decentralized charging policies via centralized training with Lagrangian regularization to enforce voltage and demand‑satisfaction constraints. A transformer‑based embedding layer deployed on each EVCS agent captures temporal correlations among prices, loads, and charging demand to improve decision quality. Experiments on a realistic 33‑bus PDN show that the proposed framework reduces voltage violations by approximately 45% and operational costs by approximately 10% compared to representative multi‑agent DRL baselines, highlighting its potential for practical VPP deployment.

PaperID: 4315, https://arxiv.org/pdf/2604.03250.pdf

Authors: Jason Starace, Bert Baumgaertner, Terence Soule

Title: Ethical Implications of Training Deceptive AI

Abstract:
Deceptive behavior in AI systems is no longer theoretical: large language models strategically mislead without producing false statements, maintain deceptive strategies through safety training, and coordinate deception in multi‑agent settings. While the European Union's AI Act prohibits deployment of deceptive AI systems, it explicitly exempts research and development, creating a necessary but unstructured space in which no established framework governs how deception research should be conducted or how risk should scale with capability. This paper proposes a Deception Research Levels (DRL) framework, a classification system for deceptive algorithm research modeled on the Biosafety Level system used in biological research. The DRL framework classifies research by risk profile rather than researcher intent, assessing deceptive mechanisms across five dimensions grounded in the AI4People ethical framework: Pillar Implication, Severity, Reversibility, Scale, and Vulnerability. Classification follows a ``highest dimension wins'' approach, assigning one of four risk levels with cumulative safeguards ranging from standard documentation at DRL‑1 to regulatory notification and third‑party security audits at DRL‑4. A dual‑development mandate at DRL‑3 and above requires that detection and mitigation methods be developed alongside any deceptive capability. We apply the framework to eight case studies spanning all four levels and demonstrate that ecological validity of the deceptive mechanism emerges as a consistent, non‑independent indicator of classification level. The DRL framework is intended to fill the governance gap between regulated deployment and unstructured research, supporting both beneficial applications and defensive research under conditions where safeguards are proportional to the potential for harm.

PaperID: 4316, https://arxiv.org/pdf/2604.03242.pdf

Authors: Lin Wang, Junfeng Fang, Dan Zhang, Fei Shen, Xiang Wang, Tat-Seng Chua

Title: DRAFT: Task Decoupled Latent Reasoning for Agent Safety

Abstract:
The advent of tool‑using LLM agents shifts safety monitoring from output moderation to auditing long, noisy interaction trajectories, where risk‑critical evidence is sparse‑making standard binary supervision poorly suited for credit assignment. To address this, we propose DRAFT (Task Decoupled Latent Reasoning for Agent Safety), a latent reasoning framework that decouples safety judgment into two trainable stages: an Extractor that distills the full trajectory into a compact continuous latent draft, and a Reasoner that jointly attends to the draft and the original trajectory to predict safety. DRAFT avoids lossy explicit summarize‑then‑judge pipelines by performing evidence aggregation in latent space, enabling end‑to‑end differentiable training.Across benchmarks including ASSEBench and R‑Judge, DRAFT consistently outperforms strong baselines, improving accuracy from 63.27% (LoRA) to 91.18% averaged over benchmarks, and learns more separable representations. Ablations demonstrate a clear synergy between the Extractor and the Reasoner.Overall, DRAFT suggests that continuous latent reasoning prior to readout is a practical path to robust agent safety under long‑context supervision with sparse evidence.

PaperID: 4317, https://arxiv.org/pdf/2604.03156.pdf

Authors: Yuhan Pu, Hao Zheng, Ziqian Mo, Hill Zhang, Tianyi Fan, Shuhong Wu, Jiaheng Wei

Title: CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator

Abstract:
Conditional image editing aims to modify a source image according to textual prompts and optional reference guidance. Such editing is crucial in scenarios requiring strict structural control (i.e., anomaly insertion in driving scenes and complex human pose transformation). Despite recent advances in large‑scale editing models (i.e., Seedream, Nano Banana, etc), most approaches rely on single‑step generation. This paradigm often lacks explicit quality control, may introduce excessive deviation from the original image, and frequently produces structural artifacts or environment‑inconsistent modifications, typically requiring manual prompt tuning to achieve acceptable results. We propose CAMEO, a structured multi‑agent framework that reformulates conditional editing as a quality‑aware, feedback‑driven process rather than a one‑shot generation task. CAMEO decomposes editing into coordinated stages of planning, structured prompting, hypothesis generation, and adaptive reference grounding, where external guidance is invoked only when task complexity requires it. To overcome the lack of intrinsic quality control in existing methods, evaluation is embedded directly within the editing loop. Intermediate results are iteratively refined through structured feedback, forming a closed‑loop process that progressively corrects structural and contextual inconsistencies. We evaluate CAMEO on anomaly insertion and human pose switching tasks. Across multiple strong editing backbones and independent evaluation models, CAMEO consistently achieves 20% more win rate on average compared to multiple state‑of‑the‑art models, demonstrating improved robustness, controllability, and structural reliability in conditional image editing.

PaperID: 4318, https://arxiv.org/pdf/2604.03143.pdf

Authors: Zhuohang Bian, Feiyang Wu, Chengrui Zhang, Hangcheng Dong, Yun Liang, Youwei Zhuo

Title: TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing

Abstract:
Multi‑agent LLM applications organize execution in synchronized rounds where a central scheduler gathers outputs from all agents and redistributes the combined context. This All‑Gather communication pattern creates massive KV Cache redundancy, because every agent's prompt contains the same shared output blocks, yet existing reuse methods fail to exploit it efficiently. We present TokenDance, a system that scales the number of concurrent agents by exploiting the All‑Gather pattern for collective KV Cache sharing. TokenDance's KV Collector performs KV Cache reuse over the full round in one collective step, so the cost of reusing a shared block is paid once regardless of agent count. Its Diff‑Aware Storage encodes sibling caches as block‑sparse diffs against a single master copy, achieving 11‑17x compression on representative workloads. Evaluation on GenerativeAgents and AgentSociety shows that TokenDance supports up to 2.7x more concurrent agents than vLLM with prefix caching under SLO requirement, reduces per‑agent KV Cache storage by up to 17.5x, and achieves up to 1.9x prefill speedup over per‑request position‑independent caching.

PaperID: 4319, https://arxiv.org/pdf/2604.03131.pdf

Authors: Yuhang Wang, Haichang Gao, Zhenxing Niu, Zhaoxiang Liu, Wenjing Zhang, Xiang Wang, Shiguo Lian

Title: A Systematic Security Evaluation of OpenClaw and Its Variants

Abstract:
Tool‑augmented AI agents substantially extend the practical capabilities of large language models, but they also introduce security risks that cannot be identified through model‑only evaluation. In this paper, we present a systematic security assessment of six representative OpenClaw‑series agent frameworks, namely OpenClaw, AutoClaw, QClaw, KimiClaw, MaxClaw, and ArkClaw, under multiple backbone models. To support this study, we construct a benchmark of 205 test cases covering representative attack behaviors across the full agent execution lifecycle, enabling unified evaluation of risk exposure at both the framework and model levels. Our results show that all evaluated agents exhibit substantial security vulnerabilities, and that agentized systems are significantly riskier than their underlying models used in isolation. In particular, reconnaissance and discovery behaviors emerge as the most common weaknesses, while different frameworks expose distinct high‑risk profiles, including credential leakage, lateral movement, privilege escalation, and resource development. These findings indicate that the security of modern agent systems is shaped not only by the safety properties of the backbone model, but also by the coupling among model capability, tool use, multi‑step planning, and runtime orchestration. We further show that once an agent is granted execution capability and persistent runtime context, weaknesses arising in early stages can be amplified into concrete system‑level failures. Overall, our study highlights the need to move beyond prompt‑level safeguards toward lifecycle‑wide security governance for intelligent agent frameworks.

PaperID: 4320, https://arxiv.org/pdf/2604.03098.pdf

Authors: Xinyu Wang, Hanwei Wu, Jingwei Song, Shuyuan Zhang, Jiayi Zhang, Fanqi Kong, Tung Sum Thomas Kwok, Xiao-Wen Chang, Yuyu Luo, Chenglin Wu, Bang Liu

Title: Co-Evolution of Policy and Internal Reward for Language Agents

Abstract:
Large language model (LLM) agents learn by interacting with environments, but long‑horizon training remains fundamentally bottlenecked by sparse and delayed rewards. Existing methods typically address this challenge through post‑hoc credit assignment or external reward models, which provide limited guidance at inference time and often separate reward improvement from policy improvement. We propose Self‑Guide, a self‑generated internal reward for language agents that supports both inference‑time guidance and training‑time supervision. Specifically, the agent uses Self‑Guide as a short self‑guidance signal to steer the next action during inference, and converts the same signal into step‑level internal reward for denser policy optimization during training. This creates a co‑evolving loop: better policy produces better guidance, and better guidance further improves policy as internal reward. Across three agent benchmarks, inference‑time self‑guidance already yields clear gains, while jointly evolving policy and internal reward with GRPO brings further improvements (8%) over baselines trained solely with environment reward. Overall, our results suggest that language agents can improve not only by collecting more experience, but also by learning to generate and refine their own internal reward during acting and learning.

PaperID: 4321, https://arxiv.org/pdf/2604.03088.pdf

Authors: Le Chen, Erhu Feng, Yubin Xia, Haibo Chen

Title: SkVM: Revisiting Language VM for Skills across Heterogenous LLMs and Harnesses

Abstract:
LLM agents increasingly adopt skills as a reusable unit of composition. While skills are shared across diverse agent platforms, current systems treat them as raw context, causing the same skill to behave inconsistently for different agents. This fragility undermines skill portability and execution efficiency. To address this challenge, we analyze 118,000 skills and draw inspiration from traditional compiler design. We treat skills as code and LLMs as heterogeneous processors. To make portability actionable, we decompose a skill's requirements into a set of primitive capabilities, and measure how well each model‑harness pair supports them. Based on these capability profiles, we propose SkVM, a compilation and runtime system designed for portable and efficient skill execution. At compile time, SkVM performs capability‑based compilation, environment binding, and concurrency extraction. At runtime, SkVM applies JIT code solidification and adaptive recompilation for performance optimization. We evaluate SkVM across eight LLMs of varying scales and three agent harnesses, covering SkillsBench and representative skill tasks. Results demonstrate that SkVM significantly improves task completion rates across different models and environments while reducing token consumption by up to 40%. In terms of performance, SkVM achieves up to 3.2x speedup with enhanced parallelism, and 19‑50x latency reduction through code solidification.

PaperID: 4322, https://arxiv.org/pdf/2604.03071.pdf

Authors: Fabian Gloeckle, Ahmad Rammal, Charles Arnal, Remi Munos, Vivien Cabannes, Gabriel Synnaeve, Amaury Hayat

Title: Automatic Textbook Formalization

Abstract:
We present a case study where an automatic AI system formalizes a textbook with more than 500 pages of graduate‑level algebraic combinatorics to Lean. The resulting formalization represents a new milestone in textbook formalization scale and proficiency, moving from early results in undergraduate topology and restructuring of existing library content to a full standalone formalization of a graduate textbook. The formalization comprises 130K lines of code and 5900 Lean declarations and was conducted within one week by a total of 30K Claude 4.5 Opus agents collaborating in parallel on a shared code base via version control, simultaneously setting a record in multi‑agent software engineering with usable results. The inference cost matches or undercuts what we estimate as the salaries required for a team of human experts, and we expect there is still the potential for large efficiencies to be made without the need for better models. We make our code, the resulting Lean code base and a side‑by‑side blueprint website available open‑source.

PaperID: 4323, https://arxiv.org/pdf/2604.03070.pdf

Authors: Zhihao Chen, Ying Zhang, Yi Liu, Gelei Deng, Yuekang Li, Yanjun Zhang, Jianting Ning, Leo Yu Zhang, Lei Ma, Zhiqiang Li

Title: Credential Leakage in LLM Agent Skills: A Large-Scale Empirical Study

Abstract:
Third‑party skills extend LLM agents with powerful capabilities but often handle sensitive credentials in privileged environments, making leakage risks poorly understood. We present the first large‑scale empirical study of this problem, analyzing 17,022 skills (sampled from 170,226 on SkillsMP) using static analysis, sandbox testing, and manual inspection. We identify 520 vulnerable skills with 1,708 issues and derive a taxonomy of 10 leakage patterns (4 accidental and 6 adversarial). We find that (1) leakage is fundamentally cross‑modal: 76.3% require joint analysis of code and natural language, while 3.1% arise purely from prompt injection; (2) debug logging is the primary vector, with print and console.log causing 73.5% of leaks due to stdout exposure to LLMs; and (3) leaked credentials are both exploitable (89.6% without privileges) and persistent, as forks retain secrets even after upstream fixes. After disclosure, all malicious skills were removed and 91.6% of hardcoded credentials were fixed. We release our dataset, taxonomy, and detection pipeline to support future research.

PaperID: 4324, https://arxiv.org/pdf/2604.03042.pdf

Authors: John Lewis Devassy, Meysam Basiri, Mário A. T. Figueiredo, Pedro U. Lima

Title: Enhancing Multi-Robot Exploration Using Probabilistic Frontier Prioritization with Dirichlet Process Gaussian Mixtures

Abstract:
Multi‑agent autonomous exploration is essential for applications such as environmental monitoring, search and rescue, and industrial‑scale surveillance. However, effective coordination under communication constraints remains a significant challenge. Frontier exploration algorithms analyze the boundary between the known and unknown regions to determine the next‑best view that maximizes exploratory gain. This article proposes an enhancement to existing frontier‑based exploration algorithms by introducing a probabilistic approach to frontier prioritization. By leveraging Dirichlet process Gaussian mixture model (DP‑GMM) and a probabilistic formulation of information gain, the method improves the quality of frontier prioritization. The proposed enhancement, integrated into two state‑of‑the‑art multi‑agent exploration algorithms, consistently improves performance across environments of varying clutter, communication constraints, and team sizes. Simulations showcase an average gain of 10% and 14% for the two algorithms across all combinations. Successful deployment in real‑world experiments with a dual‑drone system further corroborates these findings.

PaperID: 4325, https://arxiv.org/pdf/2604.03040.pdf

Authors: Lokman Bekit, Hamza Karim, Nghia T Nguyen, Yasin Yilmaz

Title: QVAD: A Question-Centric Agentic Framework for Efficient and Training-Free Video Anomaly Detection

Abstract:
Video Anomaly Detection (VAD) is a fundamental challenge in computer vision, particularly due to the open‑set nature of anomalies. While recent training‑free approaches utilizing Vision‑Language Models (VLMs) have shown promise, they typically rely on massive, resource‑intensive foundation models to compensate for the ambiguity of static prompts. We argue that the bottleneck in VAD is not necessarily model capacity, but rather the static nature of inquiry. We propose QVAD, a question‑centric agentic framework that treats VLM‑LLM interaction as a dynamic dialogue. By iteratively refining queries based on visual context, our LLM agent guides smaller VLMs to produce high‑fidelity captions and precise semantic reasoning without parameter updates. This ``prompt‑updating" mechanism effectively unlocks the latent capabilities of lightweight models, enabling state‑of‑the‑art performance on UCF‑Crime, XD‑Violence, and UBNormal using a fraction of the parameters required by competing methods. We further demonstrate exceptional generalizability on the single‑scene ComplexVAD dataset. Crucially, QVAD achieves high inference speeds with minimal memory footprints, making advanced VAD capabilities deployable on resource‑constrained edge devices.

PaperID: 4326, https://arxiv.org/pdf/2604.02988.pdf

Authors: Arthur Câmara, Vincent Slot, Jakub Zavrel

Title: Self-Optimizing Multi-Agent Systems for Deep Research

Abstract:
Given a user's complex information need, a multi‑agent Deep Research system iteratively plans, retrieves, and synthesizes evidence across hundreds of documents to produce a high‑quality answer. In one possible architecture, an orchestrator agent coordinates the process, while parallel worker agents execute tasks. Current Deep Research systems, however, often rely on hand‑engineered prompts and static architectures, making improvement brittle, expensive, and time‑consuming. We therefore explore various multi‑agent optimization methods to show that enabling agents to self‑play and explore different prompt combinations can produce high‑quality Deep Research systems that match or outperform expert‑crafted prompts.

PaperID: 4327, https://arxiv.org/pdf/2604.02938.pdf

Authors: Ibrahim Aliyu, Seungmin Oh, Sangwon Oh, Jinsul Kim

Title: Digital Twin-Assisted In-Network and Edge Collaboration for Joint User Association, Task Offloading, and Resource Allocation in the Metaverse

Abstract:
Advancements in extended reality (XR) are driving the development of the metaverse, which demands efficient real‑time transformation of 2D scenes into 3D objects, a computation‑intensive process that necessitates task offloading because of complex perception, visual, and audio processing. This challenge is further compounded by asymmetric uplink (UL) and downlink (DL) data characteristics, where 2D data are transmitted in the UL and 3D content is rendered in the DL. To address this issue, we propose a digital twin (DT)‑based in‑network computing (INC)‑assisted multi‑access edge computing (MEC) framework that enables real‑time synchronization and collaborative computing via URLLC. In this framework, a network operator manages wireless and computational resources for XR user devices (XUDs), while XUDs autonomously offload tasks to maximize their utilities. We model the interactions between XUDs and the operator as a Stackelberg Markov game, where the optimal offloading strategy constitutes an exact potential game with a Nash Equilibrium (NE), and the operator's problem is formulated as an asynchronous Markov decision process (MDP). We further propose a decentralized solution in which XUDs determine offloading decisions based on the operator's joint UL‑DL optimization of offloading mode (INC‑E or MEC only) and DL power allocation. A Nash‑asynchronous hybrid multi‑agent reinforcement learning (AMRL) algorithm is developed to predict the UL user‑associated and DL transmission power, thereby achieving NE. Simulation results demonstrate that the proposed approach considerably improves system utility, uplink rate, and energy efficiency by reducing latency and optimizing resource utilization in metaverse environments.

PaperID: 4328, https://arxiv.org/pdf/2604.02863.pdf

Authors: Yiqing Liu, Hantao Yao, Wu Liu, Yongdong Zhang

Title: EMS: Multi-Agent Voting via Efficient Majority-then-Stopping

Abstract:
Majority voting is the standard for aggregating multi‑agent responses into a final decision. However, traditional methods typically require all agents to complete their reasoning before aggregation begins, leading to significant computational overhead, as many responses become redundant once a majority consensus is achieved. In this work, we formulate the multi‑agent voting as a reliability‑aware agent scheduling problem, and propose an Efficient Majority‑then‑Stopping (EMS) to improve reasoning efficiency. EMS prioritizes agents based on task‑aware reliability and terminates the reasoning pipeline the moment a majority is achieved from the following three critical components. Specifically, we introduce Agent Confidence Modeling (ACM) to estimate agent reliability using historical performance and semantic similarity, Adaptive Incremental Voting (AIV) to sequentially select agents with early stopping, and Individual Confidence Updating (ICU) to dynamically update the reliability of each contributing agent. Extensive evaluations across six benchmarks demonstrate that EMS consistently reduces the average number of invoked agents by 32%.

PaperID: 4329, https://arxiv.org/pdf/2604.02811.pdf

Authors: Lik Tung Fu, Jie Zhou, Shaokai Ren, Mengli Zhang, Jia Xiong, Hugo Jiang, Nan Guan, Xi Wang, Jun Yang

Title: ChatSVA: Bridging SVA Generation for Hardware Verification via Task-Specific LLMs

Abstract:
Functional verification consumes over 50% of the IC development lifecycle, where SystemVerilog Assertions (SVAs) are indispensable for formal property verification and enhanced simulation‑based debugging. However, manual SVA authoring is labor‑intensive and error‑prone. While Large Language Models (LLMs) show promise, their direct deployment is hindered by low functional accuracy and a severe scarcity of domain‑specific data. To address these challenges, we introduce ChatSVA, an end‑to‑end SVA generation system built upon a multi‑agent framework. At its core, the AgentBridge platform enables this multi‑agent approach by systematically generating high‑purity datasets, overcoming the data scarcity inherent to few‑shot scenarios. Evaluated on 24 RTL designs, ChatSVA achieves 98.66% syntax and 96.12% functional pass rates, generating 139.5 SVAs per design with 82.50% function coverage. This represents a 33.3 percentage point improvement in functional correctness and an over 11x enhancement in function coverage compared to the previous state‑of‑the‑art (SOTA). ChatSVA not only sets a new SOTA in automated SVA generation but also establishes a robust framework for solving long‑chain reasoning problems in few‑shot, domain‑specific scenarios. An online service has been publicly released at https://www.nctieda.com/CHATDV.html.

PaperID: 4330, https://arxiv.org/pdf/2604.02791.pdf

Authors: Haejoon Lee, Dimitra Panagou

Title: Fully Byzantine-Resilient Distributed Multi-Agent Q-Learning

Abstract:
We study Byzantine‑resilient distributed multi‑agent reinforcement learning (MARL), where agents must collaboratively learn optimal value functions over a compromised communication network. Existing resilient MARL approaches typically guarantee almost sure convergence only to near‑optimal value functions, or require restrictive assumptions to ensure convergence to optimal solution. As a result, agents may fail to learn the optimal policies under these methods. To address this, we propose a novel distributed Q‑learning algorithm, under which all agents' value functions converge almost surely to the optimal value functions despite Byzantine edge attacks. The key idea is a redundancy‑based filtering mechanism that leverages two‑hop neighbor information to validate incoming messages, while preserving bidirectional information flow. We then introduce a new topological condition for the convergence of our algorithm, present a systematic method to construct such networks, and prove that this condition can be verified in polynomial time. We validate our results through simulations, showing that our method converges to the optimal solutions, whereas prior methods fail under Byzantine edge attacks.

PaperID: 4331, https://arxiv.org/pdf/2604.02786.pdf

Authors: Ao Zhuang, Feng Yu, Tianbao Zhang, Linzuo Zhang, Danping Zou

Title: QuadAgent: A Responsive Agent System for Vision-Language Guided Quadrotor Agile Flight

Abstract:
We present QuadAgent, a training‑free agent system for agile quadrotor flight guided by vision‑language inputs. Unlike prior end‑to‑end or serial agent approaches, QuadAgent decouples high‑level reasoning from low‑level control using an asynchronous multi‑agent architecture: Foreground Workflow Agents handle active tasks and user commands, while Background Agents perform look‑ahead reasoning. The system maintains scene memory via the Impression Graph, a lightweight topological map built from sparse keyframes, and ensures safe flight with a vision‑based obstacle avoidance network. Simulation results show that QuadAgent outperforms baseline methods in efficiency and responsiveness. Real‑world experiments demonstrate that it can interpret complex instructions, reason about its surroundings, and navigate cluttered indoor spaces at speeds up to 5 m/s.

PaperID: 4332, https://arxiv.org/pdf/2604.02770.pdf

Authors: Guoling Zhou, Wenpei Han, Fengqin Yang, Li Wang, Yingcong Zhou, Zhiguo Fu

Title: Improving Role Consistency in Multi-Agent Collaboration via Quantitative Role Clarity

Abstract:
In large language model (LLM)‑driven multi‑agent systems, disobey role specification (failure to adhere to the defined responsibilities and constraints of an assigned role, potentially leading to an agent behaving like another) is a major failure mode \citeDBLP:journals/corr/abs‑2503‑13657. To address this issue, in the present paper, we propose a quantitative role clarity to improve role consistency. Firstly, we construct a role assignment matrix S(ϕ)=[s_ij(ϕ)], where s_ij(ϕ) is the semantic similarity between the i‑th agent's behavior trajectory and the j‑th agent's role description. Then we define role clarity matrix M(ϕ) as \textsoftmax(S(ϕ))‑I, where \textsoftmax(S(ϕ)) is a row‑wise softmax of S(ϕ) and I is the identity matrix. The Frobenius norm of M(ϕ) quantifies the alignment between agents' role descriptions and their behaviors trajectory. Moreover, we employ the role clarity matrix as a regularizer during lightweight fine‑tuning to improve role consistency, thereby improving end‑to‑end task performance. Experiments on the ChatDev multi‑agent system show that our method substantially improves role consistency and task performance: with Qwen and Llama, the role overstepping rate decreases from 46.4% to 8.4% and from 43.4% to 0.2%, respectively, and the role clarity score increases from 0.5328 to 0.9097 and from 0.5007 to 0.8530, respectively, the task success rate increases from 0.6769 to 0.6909 and from 0.6174 to 0.6763, respectively.

PaperID: 4333, https://arxiv.org/pdf/2604.02767.pdf

Authors: KrishnaSaiReddy Patil

Title: SentinelAgent: Intent-Verified Delegation Chains for Securing Federal Multi-Agent AI Systems

Abstract:
When Agent A delegates to Agent B, which invokes Tool C on behalf of User X, no existing framework can answer: whose authorization chain led to this action, and where did it violate policy? This paper introduces SentinelAgent, a formal framework for verifiable delegation chains in federal multi‑agent AI systems. The Delegation Chain Calculus (DCC) defines seven properties ‑ six deterministic (authority narrowing, policy preservation, forensic reconstructibility, cascade containment, scope‑action conformance, output schema conformance) and one probabilistic (intent preservation) ‑ with four meta‑theorems and one proposition establishing the practical infeasibility of deterministic intent verification. The Intent‑Preserving Delegation Protocol (IPDP) enforces all seven properties at runtime through a non‑LLM Delegation Authority Service. A three‑point verification lifecycle achieves 100% combined TPR at 0% FPR on DelegationBench v4 (516 scenarios, 10 attack categories, 13 federal domains). Under black‑box adversarial conditions, the DAS blocks 30/30 attacks with 0 false positives. Deterministic properties are unbreakable under adversarial stress testing; intent verification degrades to 13% against sophisticated paraphrasing. Fine‑tuning the NLI model on 190 government delegation examples improves P2 from 1.7% to 88.3% TPR (5‑fold cross‑validated, F1=82.1%). Properties P1, P3‑P7 are mechanically verified via TLA+ model checking across 2.7 million states with zero violations. Even when intent verification is evaded, the remaining six properties constrain the adversary to permitted API calls, conformant outputs, traceable actions, bounded cascades, and compliant behavior.

PaperID: 4334, https://arxiv.org/pdf/2604.02734.pdf

Authors: Bin Wen, Ruoxuan Zhang, Yang Chen, Hongxia Xie, Lan-Zhe Guo

Title: Aligning Progress and Feasibility: A Neuro-Symbolic Dual Memory Framework for Long-Horizon LLM Agents

Abstract:
Large language models (LLMs) have demonstrated strong potential in long‑horizon decision‑making tasks, such as embodied manipulation and web interaction. However, agents frequently struggle with endless trial‑and‑error loops or deviate from the main objective in complex environments. We attribute these failures to two fundamental errors: global Progress Drift and local Feasibility Violation. Existing methods typically attempt to address both issues simultaneously using a single paradigm. However, these two challenges are fundamentally distinct: the former relies on fuzzy semantic planning, while the latter demands strict logical constraints and state validation. The inherent limitations of such a single‑paradigm approach pose a fundamental challenge for existing models in handling long‑horizon tasks. Motivated by this insight, we propose a Neuro‑Symbolic Dual Memory Framework that explicitly decouples semantic progress guidance from logical feasibility verification. Specifically, during the inference phase, the framework invokes both memory mechanisms synchronously: on one hand, a neural‑network‑based Progress Memory extracts semantic blueprints from successful trajectories to guide global task advancement; on the other hand, a symbolic‑logic‑based Feasibility Memory utilizes executable Python verification functions synthesized from failed transitions to perform strict logical validation. Experiments demonstrate that this method significantly outperforms existing competitive baselines on ALFWorld, WebShop, and TextCraft, while drastically reducing the invalid action rate and average trajectory length.

PaperID: 4335, https://arxiv.org/pdf/2604.02728.pdf

Authors: Junhao Ren, Honglin Gao, Sijie Wang, Lan Zhao, Qiyu Kang, Aniq Ashan, Yajuan Sun, Gaoxi Xiao

Title: Multi-agent Reinforcement Learning-based Joint Design of Low-Carbon P2P Market and Bidding Strategy in Microgrids

Abstract:
The challenges of the uncertainties in renewable energy generation and the instability of the real‑time market limit the effective utilization of clean energy in microgrid communities. Existing peer‑to‑peer (P2P) and microgrid coordination approaches typically rely on certain centralized optimization or restrictive coordination rules which are difficult to be implemented in real‑life applications. To address the challenge, we propose an intraday P2P trading framework that allows self‑interested microgrids to pursue their economic benefits, while allowing the market operator to maximize the social welfare, namely the low carbon emission objective, of the entire community. Specifically, the decision‑making processes of the microgrids are formulated as a Decentralized Partially Observable Markov Decision Process (DEC‑POMDP) and solved using a Multi‑Agent Reinforcement Learning (MARL) framework. Such an approach grants each microgrid a high degree of decision‑making autonomy, while a novel market clearing mechanism is introduced to provide macro‑regulation, incentivizing microgrids to prioritize local renewable energy consumption and hence reduce carbon emissions. Simulation results demonstrate that the combination of the self‑interested bidding strategy and the P2P market design helps significantly improve renewable energy utilization and reduce reliance on external electricity with high carbon‑emissions. The framework achieves a balanced integration of local autonomy, self‑interest pursuit, and improved community‑level economic and environmental benefits.

PaperID: 4336, https://arxiv.org/pdf/2604.02721.pdf

Authors: DeepReinforce Team, Xiaoya Li, Xiaofei Sun, Guoyin Wang, Songqiao Su, Chris Shum, Jiwei Li

Title: GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning

Abstract:
Competitive programming remains one of the last few human strongholds in coding against AI. The best AI system to date still underperforms the best humans competitive programming: the most recent best result, Google's Gemini~3 Deep Think, attained 8th place even not being evaluated under live competition conditions. In this work, we introduce GrandCode, a multi‑agent RL system designed for competitive programming. The capability of GrandCode is attributed to two key factors: (1) It orchestrates a variety of agentic modules (hypothesis proposal, solver, test generator, summarization, etc) and jointly improves them through post‑training and online test‑time RL; (2) We introduce Agentic GRPO specifically designed for multi‑stage agent rollouts with delayed rewards and the severe off‑policy drift that is prevalent in agentic RL. GrandCode is the first AI system that consistently beats all human participants in live contests of competitive programming: in the most recent three Codeforces live competitions, i.e., Round~1087 (Mar 21, 2026), Round~1088 (Mar 28, 2026), and Round~1089 (Mar 29, 2026), GrandCode placed first in all of them, beating all human participants, including legendary grandmasters. GrandCode shows that AI systems have reached a point where they surpass the strongest human programmers on the most competitive coding tasks.

PaperID: 4337, https://arxiv.org/pdf/2604.02695.pdf

Authors: Shawn Young, Lijian Xu

Title: XrayClaw: Cooperative-Competitive Multi-Agent Alignment for Trustworthy Chest X-ray Diagnosis

Abstract:
Chest X‑ray (CXR) interpretation is a fundamental yet complex clinical task that increasingly relies on artificial intelligence for automation. However, traditional monolithic models often lack the nuanced reasoning required for trustworthy diagnosis, frequently leading to logical inconsistencies and diagnostic hallucinations. While multi‑agent systems offer a potential solution by simulating collaborative consultations, existing frameworks remain susceptible to consensus‑based errors when instantiated by a single underlying model. This paper introduces XrayClaw, a novel framework that operationalizes multi‑agent alignment through a sophisticated cooperative‑competitive architecture. XrayClaw integrates four specialized cooperative agents to simulate a systematic clinical workflow, alongside a competitive agent that serves as an independent auditor. To reconcile these distinct diagnostic pathways, we propose Competitive Preference Optimization, a learning objective that penalizes illogical reasoning by enforcing mutual verification between analytical and holistic interpretations. Extensive empirical evaluations on the MS‑CXR‑T, MIMIC‑CXR, and CheXbench benchmarks demonstrate that XrayClaw achieves state‑of‑the‑art performance in diagnostic accuracy, clinical reasoning fidelity, and zero‑shot domain generalization. Our results indicate that XrayClaw effectively mitigates cumulative hallucinations and enhances the overall reliability of automated CXR diagnosis, establishing a new paradigm for trustworthy medical imaging analysis.

PaperID: 4338, https://arxiv.org/pdf/2604.02688.pdf

Authors: Chenmu Zhang, Boris I. Yakobson

Title: MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration

Abstract:
Existing LLM agents for computational materials science are constrained by pipeline‑bounded architectures tied to specific simulation codes and by dependence on manually written tool functions that grow with task scope. We present MatClaw, a code‑first agent that writes and executes Python directly, composing any installed domain library to orchestrate multi‑code workflows on remote HPC clusters without predefined tool functions. To sustain coherent execution across multi‑day workflows, MatClaw uses a four‑layer memory architecture that prevents progressive context loss, and retrieval‑augmented generation over domain source code that raises per‑step API‑call accuracy to ～99 %. Three end‑to‑end demonstrations on ferroelectric CuInP2S6 (machine‑learning force field training via active learning, Curie temperature prediction, and heuristic parameter‑space search) reveal that the agent handles code generation reliably but struggles with tacit domain knowledge. The missing knowledge, such as appropriate simulation timescales, equilibration protocols, and sampling strategies, is the kind that researchers accumulate through experience but rarely formalize. Two lightweight interventions, literature self‑learning and expert‑specified constraints, bridge these gaps, defining a guided autonomy model in which the researcher provides high‑level domain knowledge while the agent handles workflow execution. Our results demonstrate that the gap between guided and fully autonomous computational materials research is narrower than ever before: LLMs already handle code generation and scientific interpretation reliably, and the rapid improvement in their capabilities will accelerate materials discovery beyond what manual workflows can achieve. All code and benchmarks are open‑source.

PaperID: 4339, https://arxiv.org/pdf/2604.02687.pdf

Authors: Minh Nguyen, Jingqi Li, Gechen Qu, Claire J. Tomlin

Title: Inverse Safety Filtering: Inferring Constraints from Safety Filters for Decentralized Coordination

Abstract:
Safe multi‑agent coordination in uncertain environments can benefit from learning constraints from other agents. Implicitly communicating safety constraints through actions is a promising approach, allowing agents to coordinate and maintain safety without expensive communication channels. This paper introduces an online method to infer constraints from observing the safety‑filtered actions of other agents. We approach the problem by using safety filters to ensure forward safety and exploit their structure to work backwards and infer constraints. We provide sufficient conditions under which we can infer these constraints and prove that our inference method converges. This constraint inference procedure is coupled with a decentralized planning method that ensures safety when the constraint activation distance is sufficiently large. We then empirically validate our method with Monte Carlo simulations and hardware experiments with quadruped robots.

PaperID: 4340, https://arxiv.org/pdf/2604.02677.pdf

Authors: Harsh Kumar, Zi Kang, Mu, Jonathan Vincentius, Ashton Anderson

Title: Beyond the AI Tutor: Social Learning with LLM Agents

Abstract:
Most AI‑based educational tools today adopt a one‑on‑one tutoring paradigm, pairing a single LLM with a single learner. Yet decades of learning science research suggest that multi‑party interaction ‑‑ through peer modeling, co‑construction, and exposure to diverse perspectives ‑‑ can produce learning benefits that dyadic tutoring alone cannot. In this paper, we investigate whether multi‑agent LLM configurations can enhance learning outcomes beyond what a single LLM tutor provides. We present two controlled experiments spanning distinct learning contexts. In a convergent problem‑solving study (N=315), participants tackle SAT‑level math problems in a 2×2 design that varies the presence of an LLM tutor and LLM peers, each making different kinds of errors (conceptual vs.\ arithmetic); participants who interacted with both a tutor and peers achieved the highest unassisted test accuracy. In a divergent composition study (N=247), participants write argumentative and creative essays with either no AI assistance, a single LLM (Claude or ChatGPT), or both Claude and ChatGPT together; while both LLM conditions improved essay quality, only the two‑agent condition avoided the idea‑level homogeneity that single‑model assistance was found to produce. Together, these studies offer one of the first controlled investigations of multi‑agent LLM learning environments, probing whether the move from one‑on‑one AI tutoring toward richer agent configurations can unlock the collaborative and observational benefits long documented in human social learning research.

PaperID: 4341, https://arxiv.org/pdf/2604.02674.pdf

Authors: Kavana Venkatesh, Jiaming Cui

Title: Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems

Abstract:
Large Language Model (LLM) multi‑agent systems are increasingly deployed as interacting agent societies, yet scaling these systems often yields diminishing or unstable returns, the causes of which remain poorly understood. We present the first large‑scale empirical study of coordination dynamics in LLM‑based multi‑agent systems, introducing an atomic event‑level formulation that reconstructs reasoning as cascades of coordination. Analyzing over 1.5 Million interactions across tasks, topologies, and scales, we uncover three coupled laws: coordination follows heavy‑tailed cascades, concentrates via preferential attachment into intellectual elites, and produces increasingly frequent extreme events as system size grows. We show that these effects are coupled through a single structural mechanism: an integration bottleneck, in which coordination expansion scales with system size while consolidation does not, producing large but weakly integrated reasoning processes. To test this mechanism, we introduce Deficit‑Triggered Integration (DTI), which selectively increases integration under imbalance. DTI improves performance precisely where coordination fails, without suppressing large‑scale reasoning. Together, our results establish quantitative laws of collective cognition and identify coordination structure as a fundamental, previously unmeasured axis for understanding and improving scalable multi‑agent intelligence.

PaperID: 4342, https://arxiv.org/pdf/2604.02673.pdf

Authors: Shanxia Wang

Title: A Logic of Secrecy on Simplicial Models

Abstract:
We develop a logic of secrecy on simplicial models for multi‑agent systems. Standard simplicial models provide a geometric semantics for knowledge by representing global states as facets of a chromatic simplicial complex and agents' local states as coloured vertices. However, secrecy cannot in general be captured as a genuinely new modality by relying on the ordinary simplicial knowledge structure alone. This motivates the introduction of an additional secrecy layer. To this end, we define \emphsimplicial secrecy models, which enrich standard simplicial epistemic models with agent‑relative secrecy neighborhood functions attached to local states. On this basis, we introduce a primitive secrecy operator S_aφ. Semantically, S_aφ holds when agent a knows φ in the ordinary simplicial sense and, moreover, the truth set of φ belongs to one of the designated secrecy neighborhoods associated with a's current local state. The clause for secrecy thus combines an ordinary knowledge requirement with an additional local‑state‑based neighborhood requirement, while the frame condition ensures that designated secrecy events remain non‑trivial from the perspective of every other agent. We formulate a system \mathsfSSL for the resulting language and show that it is sound with respect to the class of simplicial secrecy models. For the genuinely multi‑agent case |A|\ge 2, we prove completeness by first constructing an auxiliary‑colour canonical model and then representing it inside the original class of pure A‑chromatic simplicial secrecy models. The resulting framework yields a primitive, local‑state‑based, and geometrically grounded account of secrecy on simplicial models, together with a sound axiomatization and, in the genuinely multi‑agent case, a complete one.

PaperID: 4343, https://arxiv.org/pdf/2604.02668.pdf

Authors: Vira Kasprova, Amruta Parulekar, Abdulrahman AlRabah, Krishna Agaram, Ritwik Garg, Sagar Jha, Nimet Beyza Bozdag, Dilek Hakkani-Tur

Title: Too Polite to Disagree: Understanding Sycophancy Propagation in Multi-Agent Systems

Abstract:
Large language models (LLMs) often exhibit sycophancy: agreement with user stance even when it conflicts with the model's opinion. While prior work has mostly studied this in single‑agent settings, it remains underexplored in collaborative multi‑agent systems. We ask whether awareness of other agents' sycophancy levels influences discussion outcomes. To investigate this, we run controlled experiments with six open‑source LLMs, providing agents with peer sycophancy rankings that estimate each peer's tendency toward sycophancy. These rankings are based on scores calculated using various static (pre‑discussion) and dynamic (online) strategies. We find that providing sycophancy priors reduces the influence of sycophancy‑prone peers, mitigates error‑cascades, and improves final discussion accuracy by an absolute 10.5%. Thus, this is a lightweight, effective way to reduce discussion sycophancy and improve downstream accuracy.

PaperID: 4344, https://arxiv.org/pdf/2604.02666.pdf

Authors: Joshua Drossman, Alexandre Jacquillat, Sébastien Martin

Title: Let's Have a Conversation: Designing and Evaluating LLM Agents for Interactive Optimization

Abstract:
Optimization is as much about modeling the right problem as solving it. Identifying the right objectives, constraints, and trade‑offs demands extensive interaction between researchers and stakeholders. Large language models can empower decision‑makers with optimization capabilities through interactive optimization agents that can propose, interpret and refine solutions. However, it is fundamentally harder to evaluate a conversation‑based interaction than traditional one‑shot approaches. This paper proposes a scalable and replicable methodology for evaluating optimization agents through conversations. We build LLM‑powered decision agents that role‑play diverse stakeholders, each governed by an internal utility function but communicating like a real decision‑maker. We generate thousands of conversations in a school scheduling case study. Results show that one‑shot evaluation is severely limiting: the same optimization agent converges to much higher‑quality solutions through conversations. Then, this paper uses this methodology to demonstrate that tailored optimization agents, endowed with domain‑specific prompts and structured tools, can lead to significant improvements in solution quality in fewer interactions, as compared to general‑purpose chatbots. These findings provide evidence of the benefits of emerging solutions at the AI‑optimization interface to expand the reach of optimization technologies in practice. They also uncover the impact of operations research expertise to facilitate interactive deployments through the design of effective and reliable optimization agents.

PaperID: 4345, https://arxiv.org/pdf/2604.02665.pdf

Authors: Yunbo Lyu, Jieke Shi, Hong Jin Kang, Ratnadira Widyasari, Junda He, Yuqing Niu, Chengran Yang, Junkai Chen, Zhou Yang, Julia Lawall, David Lo

Title: AgentSZZ: Teaching the LLM Agent to Play Detective with Bug-Inducing Commits

Abstract:
The SZZ algorithm is the dominant technique for identifying bug‑inducing commits and underpins many software engineering tasks, such as defect prediction and vulnerability analysis. Despite numerous variants, including recent LLM‑based approaches, performance remains limited on developer‑annotated datasets (e.g., recall of 0.552 on the Linux kernel). A key limitation is the reliance on git blame, which traces line‑level changes within the same file, failing in common scenarios such as ghost and cross‑file cases‑making nearly one‑quarter of bug‑inducing commits inherently untraceable. Moreover, current approaches follow fixed pipelines that restrict iterative reasoning and exploration, unlike developers who investigate bugs through an interactive, multi‑tool process. To address these challenges, we propose AgentSZZ, an agent‑based framework that leverages LLM‑driven agents to explore repositories and identify bug‑inducing commits. Unlike prior methods, AgentSZZ integrates task‑specific tools, domain knowledge, and a ReAct‑style loop to enable adaptive and causal tracing of bugs. A structured compression module further improves efficiency by reducing redundant context while preserving key evidence. Extensive experiments on three widely used datasets show that AgentSZZ consistently outperforms state‑of‑the‑art SZZ algorithms across all settings, achieving F1‑score gains of up to 27.2% over prior LLM‑based approaches. The improvements are especially pronounced in challenging scenarios such as cross‑file and ghost commits, with recall gains of up to 300% and 60%, respectively. Ablation studies show that task‑specific tools and domain knowledge are critical, while compression tool outputs reduce token consumption by over 30% with negligible impact. The replication package is available.

PaperID: 4346, https://arxiv.org/pdf/2604.02648.pdf

Authors: Shufan Jiang, Chios Chen, Zhiyang Chen

Title: GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

Abstract:
The autonomous discovery of bugs remains a significant challenge in modern software development. Compared to code generation, the complexity of dynamic runtime environments makes bug discovery considerably harder for large language models (LLMs). In this paper, we take game development as a representative domain and introduce the Game Benchmark for Quality Assurance (GBQA), a benchmark containing 30 games and 124 human‑verified bugs across three difficulty levels, to evaluate whether LLMs can autonomously detect software bugs. The benchmark is constructed using a multi‑agent system that develops games and injects bugs in a scalable manner, with human experts in the loop to ensure correctness. Moreover, we provide a baseline interactive agent equipped with a multi‑round ReAct loop and a memory mechanism, enabling long‑horizon exploration of game environments for bug detection across different LLMs. Extensive experiments on frontier LLMs demonstrate that autonomous bug discovery remains highly challenging: the best‑performing model, Claude‑4.6‑Opus in thinking mode, identifies only 48.39% of the verified bugs. We believe GBQA provides an adequate testbed and evaluation criterion, and that further progress on it will help close the gap in autonomous software engineering.

PaperID: 4347, https://arxiv.org/pdf/2604.02647.pdf

Authors: Jiaqing Wu, Tong Wu, Manqing Zhang, Yunwei Dong, Bo Shen

Title: Runtime Execution Traces Guided Automated Program Repair with Multi-Agent Debate

Abstract:
Automated Program Repair (APR) struggles with complex logic errors and silent failures. Current LLM‑based APR methods are mostly static, relying on source code and basic test outputs, which fail to accurately capture complex runtime behaviors and dynamic data dependencies. While incorporating runtime evidence like execution traces exposes concrete state transitions, a single LLM interpreting this in isolation often overfits to specific hypotheses, producing patches that satisfy tests by coincidence rather than correct logic. Therefore, runtime evidence should act as objective constraints rather than mere additional input. We propose TraceRepair, a multi‑agent framework that leverages runtime facts as shared constraints for patch validation. A probe agent captures execution snapshots of critical variables to form an objective repair basis. Meanwhile, a committee of specialized agents cross‑verifies candidate patches to expose inconsistencies and iteratively refine them. Evaluated on the Defects4J benchmark, TraceRepair correctly fixes 392 defects, substantially outperforming existing LLM‑based approaches. Extensive experiments demonstrate improved efficiency and strong generalization on a newly constructed dataset of recent bugs, confirming that performance gains arise from dynamic reasoning rather than memorization.

PaperID: 4348, https://arxiv.org/pdf/2604.02544.pdf

Authors: Oleksii Borysenko

Title: Developer Experience with AI Coding Agents: HTTP Behavioral Signatures in Documentation Portals

Abstract:
The rapid adoption of AI coding agents and AI assistant web services is fundamentally changing how developers discover, consume, and interact with technical documentation. This paper studies that transformation across three interconnected dimensions: documentation accessibility, content analytics, and feedback systems. We present an empirical study of HTTP request fingerprints from nine AI coding agents (Aider, Antigravity, Claude Code, Cline, Cursor, Junie, OpenCode, VS Code, and Windsurf) and six AI assistant services (ChatGPT, Claude, Google Gemini, Google NotebookLM, MistralAI, and Perplexity) accessing a live developer documentation endpoint, revealing identifiable behavioral signatures in HTTP runtime environments, pre‑fetch strategies, User‑Agent strings, and header patterns. Our study shows that AI agent access compresses multi‑page navigation into a single or two requests, making traditional engagement metrics ‑ session depth, time‑on‑page, click path, and bounce rate ‑ unreliable indicators of actual documentation consumption. We discuss practical adaptations for developer portal teams, including tokenomics‑aware documentation design, adoption of emerging machine‑readable standards (AGENTS.md, llms.txt, skill.md, agent‑permissions.json), MCP server‑based feedback channels, and analytics instrumentation for AI referral traffic.

PaperID: 4349, https://arxiv.org/pdf/2604.02537.pdf

Authors: Alexander Zhao, Achuth Chandrasekhar, Amir Barati Farimani

Title: PolyJarvis: LLM Agent for Autonomous Polymer MD Simulations

Abstract:
All‑atom molecular dynamics (MD) simulations can predict polymer properties from molecular structure, yet their execution requires specialized expertise in force field selection, system construction, equilibration, and property extraction. We present PolyJarvis, an agent that couples a large language model (LLM) with the RadonPy simulation platform through Model Context Protocol (MCP) servers, enabling end‑to‑end polymer property prediction from natural language input. Given a polymer name or SMILES string, PolyJarvis autonomously executes monomer construction, charge assignment, polymerization, force field parameterization, GPU‑accelerated equilibration, and property calculation. Validation is conducted on polyethylene (PE), atactic polystyrene (aPS), poly(methyl methacrylate) (PMMA), and poly(ethylene glycol) (PEG). Results show density predictions within 0.1‑‑4.8% and bulk moduli within 17‑‑24% of reference values for aPS and PMMA. PMMA glass transition temperature (Tg) (395~K) matches experiment within +10‑‑18~K, while the remaining three polymers overestimate Tg by +38 to +47K (vs upper experimental bounds). Of the 8 property‑‑polymer combinations with directly comparable experimental references, 5 meet strict acceptance criteria. For cases lacking suitable amorphous‑phase experimental, agreement with prior MD literature is reported separately. The remaining Tg failures are attributable primarily to the intrinsic MD cooling‑rate bias rather than agent error. This work demonstrates that LLM‑driven agents can autonomously execute polymer MD workflows producing results consistent with expert‑run simulations.

PaperID: 4350, https://arxiv.org/pdf/2604.02478.pdf

Authors: Jiyong Kwon, Ujin Jeon, Sooji Lee, Guang Lin

Title: AIVV: Neuro-Symbolic LLM Agent-Integrated Verification and Validation for Trustworthy Autonomous Systems

Abstract:
Deep learning models excel at detecting anomaly patterns in normal data. However, they do not provide a direct solution for anomaly classification and scalability across diverse control systems, frequently failing to distinguish genuine faults from nuisance faults caused by noise or the control system's large transient response. Consequently, because algorithmic fault validation remains unscalable, full Verification and Validation (V\&V) operations are still managed by Human‑in‑the‑Loop (HITL) analysis, resulting in an unsustainable manual workload. To automate this essential oversight, we propose Agent‑Integrated Verification and Validation (AIVV), a hybrid framework that deploys Large Language Models (LLMs) as a deliberative outer loop. Because rigorous system verification strictly depends on accurate validation, AIVV escalates mathematically flagged anomalies to a role‑specialized LLM council. The council agents perform collaborative validation by semantically validating nuisance and true failures based on natural‑language (NL) requirements to secure a high‑fidelity system‑verification baseline. Building on this foundation, the council then performs system verification by assessing post‑fault responses against NL operational tolerances, ultimately generating actionable V\&V artifacts, such as gain‑tuning proposals. Experiments on a time‑series simulator for Unmanned Underwater Vehicles (UUVs) demonstrate that AIVV successfully digitizes the HITL V\&V process, overcoming the limitations of rule‑based fault classification and offering a scalable blueprint for LLM‑mediated oversight in time‑series data domains.

PaperID: 4351, https://arxiv.org/pdf/2604.02460.pdf

Authors: Dat Tran, Douwe Kiela

Title: Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets

Abstract:
Recent work reports strong performance from multi‑agent LLM systems (MAS), but these gains are often confounded by increased test‑time computation. When computation is normalized, single‑agent systems (SAS) can match or outperform MAS, yet the theoretical basis and evaluation methodology behind this comparison remain unclear. We present an information‑theoretic argument, grounded in the Data Processing Inequality, suggesting that under a fixed reasoning‑token budget and with perfect context utilization, single‑agent systems are more information‑efficient. This perspective further predicts that multi‑agent systems become competitive when a single agent's effective context utilization is degraded, or when more compute is expended. We test these predictions in a controlled empirical study across three model families (Qwen3, DeepSeek‑R1‑Distill‑Llama, and Gemini 2.5), comparing SAS with multiple MAS architectures under matched budgets. We find that SAS consistently match or outperform MAS on multi‑hop reasoning tasks when reasoning tokens are held constant. Beyond aggregate performance, we conduct a detailed diagnostic analysis of system behavior and evaluation methodology. We identify significant artifacts in API‑based budget control (particularly in Gemini 2.5) and in standard benchmarks, both of which can inflate apparent gains from MAS. Overall, our results suggest that, for multi‑hop reasoning tasks, many reported advantages of multi‑agent systems are better explained by unaccounted computation and context effects rather than inherent architectural benefits, and highlight the importance of understanding and explicitly controlling the trade‑offs between compute, context, and coordination in agentic systems.

PaperID: 4352, https://arxiv.org/pdf/2604.02447.pdf

Authors: Kevin Song

Title: PlayGen-MoG: Framework for Diverse Multi-Agent Play Generation via Mixture-of-Gaussians Trajectory Prediction

Abstract:
Multi‑agent trajectory generation in team sports requires models that capture both the diversity of possible plays and realistic spatial coordination between players on plays. Standard generative approaches such as Conditional Variational Autoencoders (CVAE) and diffusion models struggle with this task, exhibiting posterior collapse or convergence to the dataset mean. Moreover, most trajectory prediction methods operate in a forecasting regime that requires multiple frames of observed history, limiting their use for play design where only the initial formation is available. We present PlayGen‑MoG, an extensible framework for formation‑conditioned play generation that addresses these challenges through three design choices: 1/ a Mixture‑of‑Gaussians (MoG) output head with shared mixture weights across all agents, where a single set of weights selects a play scenario that couples all players' trajectories, 2/ relative spatial attention that encodes pairwise player positions and distances as learned attention biases, and 3/ non‑autoregressive prediction of absolute displacements from the initial formation, eliminating cumulative error drift and removing the dependence on observed trajectory history, enabling realistic play generation from a single static formation alone. On American football tracking data, PlayGen‑MoG achieves 1.68 yard ADE and 3.98 yard FDE while maintaining full utilization of all 8 mixture components with entropy of 2.06 out of 2.08, and qualitatively confirming diverse generation without mode collapse.

PaperID: 4353, https://arxiv.org/pdf/2604.02391.pdf

Authors: Teng Liu, Yinfeng Yu

Title: Reliability-Aware Geometric Fusion for Robust Audio-Visual Navigation

Abstract:
Audio‑Visual Navigation (AVN) requires an embodied agent to navigate toward a sound source by utilizing both vision and binaural audio. A core challenge arises in complex acoustic environments, where binaural cues become intermittently unreliable, particularly when generalizing to previously unheard sound categories. To address this, we propose RAVN (Reliability‑Aware Audio‑Visual Navigation), a framework that conditions cross‑modal fusion on audio‑derived reliability cues, dynamically calibrating the integration of audio and visual inputs. RAVN introduces an Acoustic Geometry Reasoner (AGR) that is trained with geometric proxy supervision. Using a heteroscedastic Gaussian NLL objective, AGR learns observation‑dependent dispersion as a practical reliability cue, eliminating the need for geometric labels during inference. Additionally, we introduce Reliability‑Aware Geometric Modulation (RAGM), which converts the learned cue into a soft gate to modulate visual features, thereby mitigating cross‑modal conflicts. We evaluate RAVN on SoundSpaces using both Replica and Matterport3D environments, and the results show consistent improvements in navigation performance, with notable robustness in the challenging unheard sound setting.

PaperID: 4354, https://arxiv.org/pdf/2604.02334.pdf

Authors: Xiaohang Nie, Zihan Guo, Zicai Cui, Jiachi Yang, Zeyi Chen, Leheyi De, Yu Zhang, Junwei Liao, Bo Huang, Yingxuan Yang, Zhi Han, Zimian Peng, Linyao Chen, Wenzheng Tom Tang, Zongkai Liu, Tao Zhou, Botao Amber Hu, Shuyang Tang, Jianghao Lin, Weiwen Liu, Muning Wen, Yuanjian Zhou, Weinan Zhang

Title: Holos: A Web-Scale LLM-Based Multi-Agent System for the Agentic Web

Abstract:
As large language models (LLM)‑driven agents transition from isolated task solvers to persistent digital entities, the emergence of the Agentic Web, an ecosystem where heterogeneous agents autonomously interact and co‑evolve, marks a pivotal shift toward Artificial General Intelligence (AGI). However, LLM‑based multi‑agent systems (LaMAS) are hindered by open‑world issues such as scaling friction, coordination breakdown, and value dissipation. To address these challenges, we introduce Holos, a web‑scale LaMAS architected for long‑term ecological persistence. Holos adopts a five‑layer architecture, with core modules primarily featuring the Nuwa engine for high‑efficiency agent generation and hosting, a market‑driven Orchestrator for resilient coordination, and an endogenous value cycle to achieve incentive compatibility. By bridging the gap between micro‑level collaboration and macro‑scale emergence, Holos hopes to lay the foundation for the next generation of the self‑organizing and continuously evolving Agentic Web. We have publicly released Holos (accessible at https://holosai.io), providing a resource for the community and a testbed for future research in large‑scale agentic ecosystems.

PaperID: 4355, https://arxiv.org/pdf/2604.02211.pdf

Authors: Srivaths Ranganathan, Abhishek Dharmaratnakar, Anushree Sinha, Debanshu Das

Title: Multi-Agent Video Recommenders: Evolution, Patterns, and Open Challenges

Abstract:
Video recommender systems are among the most popular and impactful applications of AI, shaping content consumption and influencing culture for billions of users. Traditional single‑model recommenders, which optimize static engagement metrics, are increasingly limited in addressing the dynamic requirements of modern platforms. In response, multi‑agent architectures are redefining how video recommender systems serve, learn, and adapt to both users and datasets. These agent‑based systems coordinate specialized agents responsible for video understanding, reasoning, memory, and feedback, to provide precise, explainable recommendations. In this survey, we trace the evolution of multi‑agent video recommendation systems (MAVRS). We combine ideas from multi‑agent recommender systems, foundation models, and conversational AI, culminating in the emerging field of large language model (LLM)‑powered MAVRS. We present a taxonomy of collaborative patterns and analyze coordination mechanisms across diverse video domains, ranging from short‑form clips to educational platforms. We discuss representative frameworks, including early multi‑agent reinforcement learning (MARL) systems such as MMRF and recent LLM‑driven architectures like MACRec and Agent4Rec, to illustrate these patterns. We also outline open challenges in scalability, multimodal understanding, incentive alignment, and identify research directions such as hybrid reinforcement learning‑LLM systems, lifelong personalization and self‑improving recommender systems.

PaperID: 4356, https://arxiv.org/pdf/2604.02145.pdf

Authors: Jihoon Jeong

Title: MTI: A Behavior-Based Temperament Profiling System for AI Agents

Abstract:
AI models of equivalent capability can exhibit fundamentally different behavioral patterns, yet no standardized instrument exists to measure these dispositional differences. Existing approaches either borrow human personality dimensions and rely on self‑report (which diverges from actual behavior in LLMs) or treat behavioral variation as a defect rather than a trait. We introduce the Model Temperament Index (MTI), a behavior‑based profiling system that measures AI agent temperament across four axes: Reactivity (environmental sensitivity), Compliance (instruction‑behavior alignment), Sociality (relational resource allocation), and Resilience (stress resistance). Grounded in the Four Shell Model from Model Medicine, MTI measures what agents do, not what they say about themselves, using structured examination protocols with a two‑stage design that separates capability from disposition. We profile 10 small language models (1.7B‑9B parameters, 6 organizations, 3 training paradigms) and report five principal findings: (1) the four axes are largely independent among instruction‑tuned models (all |r| < 0.42); (2) within‑axis facet dissociations are empirically confirmed ‑‑ Compliance decomposes into fully independent formal and stance facets (r = 0.002), while Resilience decomposes into inversely related cognitive and adversarial facets; (3) a Compliance‑Resilience paradox reveals that opinion‑yielding and fact‑vulnerability operate through independent channels; (4) RLHF reshapes temperament not only by shifting axis scores but by creating within‑axis facet differentiation absent in the unaligned base model; and (5) temperament is independent of model size (1.7B‑9B), confirming that MTI measures disposition rather than capability.

PaperID: 4357, https://arxiv.org/pdf/2604.02061.pdf

Authors: Pengcheng Lyu, Chaokun Zhang, Gong Chen, Tao Tang, Zhaoxiang Luo

Title: Diff-KD: Diffusion-based Knowledge Distillation for Collaborative Perception under Corruptions

Abstract:
Multi‑agent collaborative perception enables autonomous systems to overcome individual sensing limits through collective intelligence. However, real‑world sensor and communication corruptions severely undermine this advantage. Crucially, existing approaches treat corruptions as static perturbations or passively conform to corrupted inputs, failing to actively recover the underlying clean semantics. To address this limitation, we introduce Diff‑KD, a framework that integrates diffusion‑based generative refinement into teacher‑student knowledge distillation for robust collaborative perception. Diff‑KD features two core components: (i) Progressive Knowledge Distillation (PKD), which treats local feature restoration as a conditional diffusion process to recover global semantics from corrupted observations; and (ii) Adaptive Gated Fusion (AGF), which dynamically weights neighbors based on ego reliability during fusion. Evaluated on OPV2V and DAIR‑V2X under seven corruption types, Diff‑KD achieves state‑of‑the‑art performance in both detection accuracy and calibration robustness.

PaperID: 4358, https://arxiv.org/pdf/2604.02023.pdf

Authors: Mohd Safwan Uddin, Mohammed Mouzam, Mohammed Imran, Syed Badar Uddin Faizan

Title: APEX: Agent Payment Execution with Policy for Autonomous Agent API Access

Abstract:
Autonomous agents are moving beyond simple retrieval tasks to become economic actors that invoke APIs, sequence workflows, and make real‑time decisions. As this shift accelerates, API providers need request‑level monetization with programmatic spend governance. The HTTP 402 protocol addresses this by treating payment as a first‑class protocol event, but most implementations rely on cryptocurrency rails. In many deployment contexts, especially countries with strong real‑time fiat systems like UPI, this assumption is misaligned with regulatory and infrastructure realities. We present APEX, an implementation‑complete research system that adapts HTTP 402‑style payment gating to UPI‑like fiat workflows while preserving policy‑governed spend control, tokenized access verification, and replay resistance. We implement a challenge‑settle‑consume lifecycle with HMAC‑signed short‑lived tokens, idempotent settlement handling, and policy‑aware payment approval. The system uses FastAPI, SQLite, and Python standard libraries, making it transparent, inspectable, and reproducible. We evaluate APEX across three baselines and six scenarios using sample sizes 2‑4x larger than initial experiments (N=20‑40 per scenario). Results show that policy enforcement reduces total spending by 27.3% while maintaining 52.8% success rate for legitimate requests. Security mechanisms achieve 100% block rate for both replay attacks and invalid tokens with low latency overhead (19.6ms average). Multiple trial runs show low variance across scenarios, demonstrating high reproducibility with 95% confidence intervals. The primary contribution is a controlled agent‑payment infrastructure and reference architecture that demonstrates how agentic access monetization can be adapted to fiat systems without discarding security and policy guarantees.

PaperID: 4359, https://arxiv.org/pdf/2604.01764.pdf

Authors: Seyed Amir Kasaei, Arash Marioriyad, Mahbod Khaleti, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban

Title: Hidden Meanings in Plain Sight: RebusBench for Evaluating Cognitive Visual Reasoning

Abstract:
Large Vision‑Language Models (LVLMs) have achieved remarkable proficiency in explicit visual recognition, effectively describing what is directly visible in an image. However, a critical cognitive gap emerges when the visual input serves only as a clue rather than the answer. We identify that current models struggle with the complex, multi‑step reasoning required to solve problems where information is not explicitly depicted. Successfully solving a rebus puzzle requires a distinct cognitive workflow: the model must extract visual and textual attributes, retrieve linguistic prior knowledge (such as idioms), and perform abstract mapping to synthesize these elements into a meaning that exists outside the pixel space. To evaluate this neurosymbolic capability, we introduce RebusBench, a benchmark of 1,164 puzzles designed to test this specific integration of perception and knowledge. Our evaluation of state‑of‑the‑art models (including Qwen, InternVL, and LLaVA) shows a severe deficiency: performance saturates below 10% Exact Match and 20% semantic accuracy, with no significant improvement observed from model scaling or In‑Context Learning (ICL). These findings suggest that while models possess the necessary visual and linguistic components, they lack the cognitive reasoning glue to connect them. Project page available at https://amirkasaei.com/rebusbench/.

PaperID: 4360, https://arxiv.org/pdf/2604.01738.pdf

Authors: Chuhan Qiao, Jinglai Zheng, Jie Huang, Buyue Zhao, Fan Li, Haiming Huang

Title: AeroTherm-GPT: A Verification-Centered LLM Framework for Thermal Protection System Engineering Workflows

Abstract:
Integrating Large Language Models (LLMs) into hypersonic thermal protection system (TPS) design is bottlenecked by cascading constraint violations when generating executable simulation artifacts. General‑purpose LLMs, treating generation as single‑pass text completion, fail to satisfy the sequential, multi‑gate constraints inherent in safety‑critical engineering workflows. To address this, we propose AeroTherm‑GPT, the first TPS‑specialized LLM Agent, instantiated through a Constraint‑Closed‑Loop Generation (CCLG) framework. CCLG organizes TPS artifact generation as an iterative workflow comprising generation, validation, CDG‑guided repair, execution, and audit. The Constraint Dependency Graph (CDG) encodes empirical co‑resolution structure among constraint categories, directing repair toward upstream fault candidates based on lifecycle ordering priors and empirical co‑resolution probabilities. This upstream‑priority mechanism resolves multiple downstream violations per action, achieving a Root‑Cause Fix Efficiency of 4.16 versus 1.76 for flat‑checklist repair. Evaluated on HyTPS‑Bench and validated against external benchmarks, AeroTherm‑GPT achieves 88.7% End‑to‑End Success Rate (95% CI: 87.5‑89.9), a gain of +12.5 pp over the matched non‑CDG ablation baseline, without catastrophic forgetting on scientific reasoning and code generation tasks.

PaperID: 4361, https://arxiv.org/pdf/2604.01687.pdf

Authors: Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei-Chieh Huang, Yifei Yao, Kening Zheng, Xue Liu, Xiaoxiao Li, Philip S. Yu

Title: CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification

Abstract:
Anthropic proposes the concept of skills for LLM agents to tackle multi‑step professional tasks that simple tool invocations cannot address. A tool is a single, self‑contained function, whereas a skill is a structured bundle of interdependent multi‑file artifacts. Currently, skill generation is not only label‑intensive due to manual authoring, but also may suffer from human‑‑machine cognitive misalignment, which can lead to degraded agent performance, as evidenced by evaluations on SkillsBench. Therefore, we aim to enable agents to autonomously generate skills. However, existing self‑evolving methods designed for tools cannot be directly applied to skills due to their increased complexity. To address these issues, we propose CoEvoSkills, a self‑evolving skills framework that enables agents to autonomously construct complex, multi‑file skill packages. Specifically, CoEvoSkills couples a Skill Generator that iteratively refines skills with a Surrogate Verifier that co‑evolves to provide informative and actionable feedback without access to ground‑truth test content. On SkillsBench, CoEvoSkills achieves the highest pass rate among five baselines on both Claude Code and Codex, and also exhibits strong generalization capabilities to six additional LLMs.

PaperID: 4362, https://arxiv.org/pdf/2604.01659.pdf

Authors: Yukai Ma, Honglin He, Selina Song, Wayne Wu, Bolei Zhou

Title: AURA: Multimodal Shared Autonomy for Real-World Urban Navigation

Abstract:
Long‑horizon navigation in complex urban environments relies heavily on continuous human operation, which leads to fatigue, reduced efficiency, and safety concerns. Shared autonomy, where a Vision‑Language AI agent and a human operator collaborate on maneuvering the mobile machine, presents a promising solution to address these issues. However, existing shared autonomy methods often require humans and AI to operate within the same action space, leading to high cognitive overhead. We present Assistive Urban Robot Autonomy (AURA), a new multi‑modal framework that decomposes urban navigation into high‑level human instruction and low‑level AI control. AURA incorporates a Spatial‑Aware Instruction Encoder to align various human instructions with visual and spatial context. To facilitate training, we construct MM‑CoS, a large‑scale dataset comprising teleoperation and vision‑language descriptions. Experiments in simulation and the real world demonstrate that AURA effectively follows human instructions, reduces manual operation effort, and improves navigation stability, while enabling online adaptation. Moreover, under similar takeover conditions, our shared autonomy framework reduces the frequency of takeovers by more than 44%. Demo video and more detail are provided in the project page.

PaperID: 4363, https://arxiv.org/pdf/2604.01647.pdf

Authors: Boyuan Guan, Jason Liu, Yanzhao Wu, Kiavash Bahreini

Title: Exploring Robust Multi-Agent Workflows for Environmental Data Management

Abstract:
Embedding LLM‑driven agents into environmental FAIR data management is compelling ‑ they can externalize operational knowledge and scale curation across heterogeneous data and evolving conventions. However, replacing deterministic components with probabilistic workflows changes the failure mode: LLM pipelines may generate plausible but incorrect outputs that pass superficial checks and propagate into irreversible actions such as DOI minting and public release. We introduce EnviSmart, a production data management system deployed on campus‑wide storage infrastructure for environmental research. EnviSmart treats reliability as an architectural property through two mechanisms: a three‑track knowledge architecture that externalizes behaviors (governance constraints), domain knowledge (retrievable context), and skills (tool‑using procedures) as persistent, interlocking artifacts; and a role‑separated multi‑agent design where deterministic validators and audited handoffs restore fail‑stop semantics at trust boundaries before irreversible steps. We compare two production deployments. The University's GIS Center Ecological Archive (849 curated datasets) serves as a single‑agent baseline. SF2Bench, a compound flooding benchmark comprising 2,452 monitoring stations and 8,557 published files spanning 39 years, validates the multi‑agent workflow. The multi‑agent approach improved both efficiency ‑ completed by a single operator in two days with repeated artifact reuse across deployments ‑ and reliability: audited handoffs detected and blocked a coordinate transformation error affecting all 2,452 stations before publication. A representative incident (ISS‑004) demonstrated boundary‑based containment with 10‑minute detection latency, zero user exposure, and 80‑minute resolution. This paper has been accepted at PEARC 2026.

PaperID: 4364, https://arxiv.org/pdf/2604.01610.pdf

Authors: Taraneh Ghandi, Hamidreza Mahyar, Shachar Klaiman

Title: GraphWalk: Enabling Reasoning in Large Language Models through Tool-Based Graph Navigation

Abstract:
The use of knowledge graphs for grounding agents in real‑world Q&A applications has become increasingly common. Answering complex queries often requires multi‑hop reasoning and the ability to navigate vast relational structures. Standard approaches rely on prompting techniques that steer large language models to reason over raw graph context, or retrieval‑augmented generation pipelines where relevant subgraphs are injected into the context. These, however, face severe limitations with enterprise‑scale KGs that cannot fit in even the largest context windows available today. We present GraphWalk, a problem‑agnostic, training‑free, tool‑based framework that allows off‑the‑shelf LLMs to reason through sequential graph navigation, dramatically increasing performance across different tasks. Unlike task‑specific agent frameworks that encode domain knowledge into specialized tools, GraphWalk equips the LLM with a minimal set of orthogonal graph operations sufficient to traverse any graph structure. We evaluate whether models equipped with GraphWalk can compose these operations into correct multi‑step reasoning chains, where each tool call represents a verifiable step creating a transparent execution trace. We first demonstrate our approach on maze traversal, a problem non‑reasoning models are completely unable to solve, then present results on graphs resembling real‑world enterprise knowledge graphs. To isolate structural reasoning from world knowledge, we evaluate on entirely synthetic graphs with random, non‑semantic labels. Our benchmark spans 12 query templates from basic retrieval to compound first‑order logic queries. Results show that tool‑based traversal yields substantial and consistent gains over in‑context baselines across all model families tested, with gains becoming more pronounced as scale increases, precisely where in‑context approaches fail catastrophically.

PaperID: 4365, https://arxiv.org/pdf/2604.01608.pdf

Authors: Binyan Xu, Dong Fang, Haitao Li, Kehuan Zhang

Title: From Multi-Agent to Single-Agent: When Is Skill Distillation Beneficial?

Abstract:
Multi‑agent systems (MAS) tackle complex tasks by distributing expertise, though this often comes at the cost of heavy coordination overhead, context fragmentation, and brittle phase ordering. Distilling a MAS into a single‑agent skill can bypass these costs, but this conversion lacks a principled answer for when and what to distill. Instead, the empirical outcome is surprisingly inconsistent: skill lift ranges from a 28% improvement to a 2% degradation across metrics of the exact same task. In this work, we reveal that skill utility is governed not by the task, but by the evaluation metric. We introduce Metric Freedom (F), the first a priori predictor of skill utility. F measures the topological rigidity of a metric's scoring landscape by quantifying how output diversity couples with score variance via a Mantel test. Guided by F, we propose AdaSkill, a two‑stage adaptive distillation framework. Stage 1 acts as a selective extraction mechanism, extracting tools and knowledge while discarding restrictive structures on "free" metrics to preserve exploration. Stage 2 applies iterative refinement selectively on free metrics, exploiting their forgiving scoring landscape to safely maximize remaining headroom. Evaluating across 4 tasks, 11 datasets, and 6 metrics, F strongly predicts skill utility (r=‑0.85, p<0.0001). Strikingly, identical agent trajectories yield diametrically opposite skill lifts under rigid versus free metrics, demonstrating that skill utility is fundamentally a metric‑level property. Driven by this signal, AdaSkill matches or exceeds the original MAS while reducing cost up to 8x and latency by up to 15x.

PaperID: 4366, https://arxiv.org/pdf/2604.01605.pdf

Authors: Morui Zhu, Mohammad Dehghani Tezerjani, Mátyás Szántó, Márton Vaitkus, Song Fu, Qing Yang

Title: F3DGS: Federated 3D Gaussian Splatting for Decentralized Multi-Agent World Modeling

Abstract:
We present F3DGS, a federated 3D Gaussian Splatting framework for decentralized multi‑agent 3D reconstruction. Existing 3DGS pipelines assume centralized access to all observations, which limits their applicability in distributed robotic settings where agents operate independently, and centralized data aggregation may be restricted. Directly extending centralized training to multi‑agent systems introduces communication overhead and geometric inconsistency. F3DGS first constructs a shared geometric scaffold by registering locally merged LiDAR point clouds from multiple clients to initialize a global 3DGS model. During federated optimization, Gaussian positions are fixed to preserve geometric alignment, while each client updates only appearance‑related attributes, including covariance, opacity, and spherical harmonic coefficients. The server aggregates these updates using visibility‑aware aggregation, weighting each client's contribution by how frequently it observed each Gaussian, resolving the partial‑observability challenge inherent to multi‑agent exploration. To evaluate decentralized reconstruction, we collect a multi‑sequence indoor dataset with synchronized LiDAR, RGB, and IMU measurements. Experiments show that F3DGS achieves reconstruction quality comparable to centralized training while enabling distributed optimization across agents. The dataset, development kit, and source code will be publicly released.

PaperID: 4367, https://arxiv.org/pdf/2604.01560.pdf

Authors: Qi Zhang, Shen Huang, Chu Liu, Shouqing Yang, Junbo Zhao, Haobo Wang, Pengjun Xie

Title: DeltaMem: Towards Agentic Memory Management via Reinforcement Learning

Abstract:
Recent advances in persona‑centric memory have revealed the powerful capability of multi‑agent systems in managing persona memory, especially in conversational scenarios. However, these complex frameworks often suffer from information loss and are fragile across varying scenarios, resulting in suboptimal performance. In this paper, we propose DeltaMem, an agentic memory management system that formulates persona‑centric memory management as an end‑to‑end task within a single‑agent setting. To further improve the performance of our agentic memory manager, we draw inspiration from the evolution of human memory and synthesize a user‑assistant dialogue dataset along with corresponding operation‑level memory updating labels. Building on this, we introduce a novel Memory‑based Levenshtein Distance to formalize the memory updating reward, and propose a tailored reinforcement learning framework to further enhance the management capabilities of DeltaMem. Extensive experiments show that both training‑free and RL‑trained DeltaMem outperform all product‑level baselines across diverse long‑term memory benchmarks, including LoCoMo, HaluMem, and PersonaMem.

PaperID: 4368, https://arxiv.org/pdf/2604.01532.pdf

Authors: Tianjun Feng, Yunfeng Chen, Chun-Yi Tsai, Yihan Sun, Ayan Das, Kaoutar El Maghraoui, Shuxin Lin, Dhaval Patel

Title: PHMForge: Evaluating LLM Agents on Industrial Prognostics through MCP-Native, Algorithm-Grounded Tools

Abstract:
LLM agents are beginning to invoke industrial asset‑management tools through the Model Context Protocol (MCP), yet whether they can act reliably on this substrate for safety‑critical \emphPrognostics and Health Management (PHM) is unanswered. Prior benchmarks conflate protocol fluency with reasoning, instrumentation failures with agent failures, and tool use with tool retrieval. We introduce PHMForge, an evaluation environment that closes each conflation. PHMForge ships 99 SME‑authored scenarios across eight industrial asset classes spanning rotating equipment, aero‑engines, and lithium‑ion cells, on public datasets including NASA PCoE, served through 39 MCP‑native tools wrapping published PHM algorithms (C‑MAPSS, ISO~10816, Arrhenius capacity‑fade models, time‑series foundation models). Krippendorff's α\in [0.74,\,0.82] on a 30‑scenario stratified rotating‑equipment/aero‑engine sample; the battery extension is single‑rater. Across three agentic frameworks and six LLM backbones, the strongest configuration reaches 80.8% pass@1, with the residual gap concentrated in orchestration and tool‑sequencing errors. Crucially, an architectural ablation shows that replacing MCP execution with text‑based Retrieval‑Augmented Generation (RAG) over telemetry‑equivalent evidence collapses Remaining Useful Life \emphpass‑all‑3 from 100% to 20% (5/5 vs.\ 1/5) on the battery class, exposing the structural limits of static retrieval for prognostic computation. Trajectory decomposition shows orchestration errors dominate failures across backbones, while schema‑invalid tool calls concentrate in smaller open‑weight models. Frontier LLMs are stronger at calling tools than at planning when to call them. PHMForge is open‑sourced with deterministic evaluators, a public leaderboard, and a datasheet.

PaperID: 4369, https://arxiv.org/pdf/2604.01520.pdf

Authors: Lei Wang, Yuanzi Li, Jinchao Wu, Heyang Gao, Xiaohe Bo, Xu Chen, Ji-Rong Wen

Title: LLM Agents as Social Scientists: A Human-AI Collaborative Platform for Social Science Automation

Abstract:
Traditional social science research often requires designing complex experiments across vast methodological spaces and depends on real human participants, making it labor‑intensive, costly, and difficult to scale. Here we present S‑Researcher, an LLM‑agent‑based platform that assists researchers in conducting social science research more efficiently and at greater scale by "siliconizing" both the research process and the participant pool. To build S‑Researcher, we first develop YuLan‑OneSim, a large‑scale social simulation system designed around three core requirements: generality via auto‑programming from natural language to executable scenarios, scalability via a distributed architecture supporting up to 100,000 concurrent agents, and reliability via feedback‑driven LLM fine‑tuning. Leveraging this system, S‑Researcher supports researchers in designing social experiments, simulating human behavior with LLM agents, analyzing results, and generating reports, forming a complete human‑AI collaborative research loop in which researchers retain oversight and intervention at every stage. We operationalize LLM simulation research paradigms into three canonical reasoning modes (induction, deduction, and abduction) and validate S‑Researcher through systematic case studies: inductive reproduction of cultural dynamics consistent with Axelrod's theory, deductive testing of competing hypotheses on teacher attention validated against survey data, and abductive identification of a cooperation mechanism in public goods games confirmed by human experiments. S‑Researcher establishes a new human‑‑AI collaborative paradigm for social science, in which computational simulation augments human researchers to accelerate discovery across the full spectrum of social inquiry.

PaperID: 4370, https://arxiv.org/pdf/2604.01509.pdf

Authors: Julian Martinez, Kooktae Lee

Title: Feedforward Density-Driven Optimal Control for Tracking Time-Varying Distributions with Guaranteed Stability

Abstract:
This paper addresses the spatiotemporal mismatch in multi‑agent distribution tracking within time‑varying environments. While recent advancements in Density‑Driven Optimal Control (D^2OC) have enabled finite‑time distribution matching using Optimal Transport theory, existing formulations primarily assume a stationary reference density. In dynamic scenarios, such as tracking evolving wildfires or moving plumes, this assumption leads to a structural tracking lag where the agent configuration inevitably falls behind the shifting reference flow. To resolve this, we propose a feedforward‑augmented D^2OC framework that explicitly incorporates the reference velocity field, modeled via the continuity equation, into the control law. We provide a formal mathematical quantification of the induced tracking lag and analytically prove that the proposed predictive mechanism effectively reduces the cumulative tracking error. Furthermore, an analytical ultimate bound for the local Wasserstein distance is established under discretization errors and transport jitter. Theoretical analysis and numerical results demonstrate that our approach significantly mitigates tracking latency, ensuring robust and high‑fidelity tracking performance in rapidly changing environments.

PaperID: 4371, https://arxiv.org/pdf/2604.01487.pdf

Authors: Prince Zizhuang Wang, Shuli Jiang

Title: AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks

Abstract:
With the rise of personalized, persistent LLM agent frameworks such as OpenClaw, human‑centered agentic social networks in which teams of collaborative AI agents serve individual users in a social network across multiple domains are becoming a reality. This setting creates novel privacy challenges: agents must coordinate across domain boundaries, mediate between humans, and interact with other users' agents, all while protecting sensitive personal information. While prior work has evaluated multi‑agent coordination and privacy preservation, the dynamics and privacy risks of human‑centered agentic social networks remain unexplored. To this end, we introduce AgentSocialBench, the first benchmark to systematically evaluate privacy risk in this setting, comprising scenarios across seven categories spanning dyadic and multi‑party interactions, grounded in realistic user profiles with hierarchical sensitivity labels and directed social graphs. Our experiments reveal that privacy in agentic social networks is fundamentally harder than in single‑agent settings: (1) cross‑domain and cross‑user coordination creates persistent leakage pressure even when agents are explicitly instructed to protect information, (2) privacy instructions that teach agents how to abstract sensitive information paradoxically cause them to discuss it more (we call it abstraction paradox). These findings underscore that current LLM agents lack robust mechanisms for privacy preservation in human‑centered agentic social networks, and that new approaches beyond prompt engineering are needed to make agent‑mediated social coordination safe for real‑world deployment.

PaperID: 4372, https://arxiv.org/pdf/2604.01453.pdf

Authors: Carter Sale, Margaret C. Macpherson, Gaurav Patil, Kelly Miles, Rachel W. Kallen, Sebastian Wallot, Michael J. Richardson

Title: Nonlinear Methods for Analyzing Pose in Behavioral Research

Abstract:
Advances in markerless pose estimation have made it possible to capture detailed human movement in naturalistic settings using standard video, enabling new forms of behavioral analysis at scale. However, the high dimensionality, noise, and temporal complexity of pose data raise significant challenges for extracting meaningful patterns of coordination and behavioral change. This paper presents a general‑purpose analysis pipeline for human pose data, designed to support both linear and nonlinear characterizations of movement across diverse experimental contexts. The pipeline combines principled preprocessing, dimensionality reduction, and recurrence‑based time series analysis to quantify the temporal structure of movement dynamics. To illustrate the pipeline's flexibility, we present three case studies spanning facial and full‑body movement, 2D and 3D data, and individual versus multi‑agent behavior. Together, these examples demonstrate how the same analytic workflow can be adapted to extract theoretically meaningful insights from complex pose time series.

PaperID: 4373, https://arxiv.org/pdf/2604.01452.pdf

Authors: Maxwell J. Jacobson, Daniel Xie, Jackson Shen, Adil Wazeer, Haiyan Wang, Xinghang Zhang, Yexiang Xue

Title: A Multi-Agent Human-LLM Collaborative Framework for Closed-Loop Scientific Literature Summarization

Abstract:
Scientific discovery is slowed by fragmented literature that requires excessive human effort to gather, analyze, and understand. AI tools, including autonomous summarization and question answering, have been developed to aid in understanding scientific literature. However, these tools lack the structured, multi‑step approach necessary for extracting deep insights from scientific literature. Large Language Models (LLMs) offer new possibilities for literature analysis, but remain unreliable due to hallucinations and incomplete extraction. We introduce Elhuyar, a multi‑agent, human‑in‑the‑loop system that integrates LLMs, structured AI, and human scientists to extract, analyze, and iteratively refine insights from scientific literature. The framework distributes tasks among specialized agents for filtering papers, extracting data, fitting models, and summarizing findings, with human oversight ensuring reliability. The system generates structured reports with extracted data, visualizations, model equations, and text summaries, enabling deeper inquiry through iterative refinement. Deployed in materials science, it analyzed literature on tungsten under helium‑ion irradiation, showing experimentally correlated exponential helium bubble growth with irradiation dose and temperature, offering insight for plasma‑facing materials (PFMs) in fusion reactors. This demonstrates how AI‑assisted literature review can uncover scientific patterns and accelerate discovery.

PaperID: 4374, https://arxiv.org/pdf/2604.01359.pdf

Authors: Andrei Mantsivoda, Darya Gavrilina

Title: Semantic Modeling for World-Centered Architectures

Abstract:
We introduce world‑centered multi‑agent systems (WMAS) as an alternative to traditional agent‑centered architectures, arguing that structured domains such as enterprises and institutional systems require a shared, explicit world representation to ensure semantic consistency, explainability, and long‑term stability. We classify worlds along dimensions including ontological explicitness, normativity, etc. In WMAS, learning and coordination operate over a shared world model rather than isolated agent‑local representations, enabling global consistency and verifiable system behavior. We propose semantic models as a mathematical formalism for representing such worlds. Finally, we present the Ontobox platform as a realization of WMAS.

PaperID: 4375, https://arxiv.org/pdf/2604.01350.pdf

Authors: Tiankai Yang, Jiate Li, Yi Nian, Shen Dong, Ruiyao Xu, Ryan Rossi, Kaize Ding, Yue Zhao

Title: No Attacker Needed: Unintentional Cross-User Contamination in Shared-State LLM Agents

Abstract:
LLM‑based agents increasingly operate across repeated sessions, maintaining task states to ensure continuity. In many deployments, a single agent serves multiple users within a team or organization, reusing a shared knowledge layer across user identities. This shared persistence expands the failure surface: information that is locally valid for one user can silently degrade another user's outcome when the agent reapplies it without regard for scope. We refer to this failure mode as unintentional cross‑user contamination (UCC). Unlike adversarial memory poisoning, UCC requires no attacker; it arises from benign interactions whose scope‑bound artifacts persist and are later misapplied. We formalize UCC through a controlled evaluation protocol, introduce a taxonomy of three contamination types, and evaluate the problem in two shared‑state mechanisms. Under raw shared state, benign interactions alone produce contamination rates of 57‑‑71%. A write‑time sanitization is effective when shared state is conversational, but leaves substantial residual risk when shared state includes executable artifacts, with contamination often manifesting as silent wrong answers. These results indicate that shared‑state agents need artifact‑level defenses beyond text‑level sanitization to prevent silent cross‑user failures.

PaperID: 4376, https://arxiv.org/pdf/2604.01240.pdf

Authors: Vik Pant, Eric Yu

Title: Computational Foundations for Strategic Coopetition: Formalizing Sequential Interaction and Reciprocity

Abstract:
Strategic coopetition in multi‑stakeholder systems requires understanding how cooperation persists through time without binding contracts. This technical report extends computational foundations for strategic coopetition to sequential interaction dynamics, bridging conceptual modeling (i framework) with game‑theoretic reciprocity analysis. We develop: (1) bounded reciprocity response functions mapping partner deviations to finite conditional responses, (2) memory‑windowed history tracking capturing cognitive limitations over k recent periods, (3) structural reciprocity sensitivity derived from interdependence matrices where behavioral responses are amplified by structural dependencies, and (4) trust‑gated reciprocity where trust modulates reciprocity responses. The framework applies to both human stakeholder interactions and multi‑agent computational systems. Comprehensive validation across 15,625 parameter configurations demonstrates robust reciprocity effects, with all six behavioral targets exceeding thresholds: cooperation emergence (97.5%), defection punishment (100%), forgiveness dynamics (87.9%), asymmetric differentiation (100%), trust‑reciprocity interaction (100%), and bounded responses (100%). Empirical validation using the Apple iOS App Store ecosystem (2008‑2024) achieves 43/51 applicable points (84.3%), reproducing documented cooperation patterns across five ecosystem phases. Statistical significance confirmed at p < 0.001 with Cohen's d = 1.57. This report concludes the Foundations Series (TR‑1 through TR‑4) adopting uniaxial treatment where agents choose cooperation levels along a single continuum. Companion work on interdependence (arXiv:2510.18802), trust (arXiv:2510.24909), and collective action (arXiv:2601.16237) has been prepublished. Extensions Series (TR‑5 through TR‑8) introduces biaxial treatment where cooperation and competition are independent dimensions.

PaperID: 4377, https://arxiv.org/pdf/2604.01221.pdf

Authors: Zhe Yang, Shulin Tian, Kairui Hu, Shuai Liu, Hoang-Nhat Nguyen, Yichi Zhang, Zujin Guo, Mengying Yu, Zinan Zhang, Jingkang Yang, Chen Change Loy, Ziwei Liu

Title: HippoCamp: Benchmarking Contextual Agents on Personal Computers

Abstract:
We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user‑centric environments to model individual user profiles and search massive personal files for context‑aware reasoning. Our benchmark instantiates device‑scale file systems over real‑world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real‑world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi‑step reasoning. To facilitate fine‑grained analysis, we provide 46.1K densely annotated structured trajectories for step‑wise failure diagnosis. We evaluate a wide range of state‑of‑the‑art multimodal large language models (MLLMs) and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve only 48.3% accuracy in user profiling, struggling particularly with long‑horizon retrieval and cross‑modal reasoning within dense personal file systems. Furthermore, our step‑wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user‑centric environments and provides a robust foundation for developing next‑generation personal AI assistants.

PaperID: 4378, https://arxiv.org/pdf/2604.01212.pdf

Authors: Muyu He, Adit Jain, Anand Kumar, Vincent Tu, Soumyadeep Bakshi, Sachin Patro, Nazneen Rajani

Title: $\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

Abstract:
As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound. We introduce \textttYC‑Bench, a benchmark that evaluates these capabilities by tasking an agent with running a simulated startup over a one‑year horizon spanning hundreds of turns. The agent must manage employees, select task contracts, and maintain profitability in a partially observable environment where adversarial clients and growing payroll create compounding consequences for poor decisions. We evaluate 12 models, both proprietary and open source, across 3 seeds each. Only three models consistently surpass the starting capital of \200K, with Claude Opus 4.6 achieving the highest average final funds at \1.27 M, followed by GLM‑5 at \1.21 M at 11× lower inference cost. Scratchpad usage, the sole mechanism for persisting information across context truncation, is the strongest predictor of success, and adversarial client detection is the primary failure mode, accounting for 47% of bankruptcies. Our analysis reveals that frontier models still fail through distinct failure modes such as over‑parallelization, demonstrating the capability gaps for long‑horizon performance. \textttYC‑Bench is open‑source, reproducible, and configurable.

PaperID: 4379, https://arxiv.org/pdf/2604.01210.pdf

Authors: Youssef Mroueh, Carlos Fonseca, Brian Belgodere, David Cox

Title: CliffSearch: Structured Agentic Co-Evolution over Theory and Code for Scientific Algorithm Discovery

Abstract:
Scientific algorithm discovery is iterative: hypotheses are proposed, implemented, stress‑tested, and revised. Current LLM‑guided search systems accelerate proposal generation, but often under‑represent scientific structure by optimizing code‑only artifacts with weak correctness/originality gating. We present CliffSearch, an agentic evolutionary framework in which the core evolution operators (pair selection, crossover, mutation, and review) are implemented as LLM agents, and the loop is designed around three principles: (1) each node is a structured scientific artifact, instantiated in either theory+code or code_only mode, (2) reviewer judgments of correctness and originality are first‑class selection gates alongside optimization of the benchmark metric of interest, and (3) mutation is split into exploration and correction pathways with distinct objectives. Exploration mutation imports ideas from adjacent scientific domains to increase novelty, while correction mutation performs targeted evidence‑guided repair using reviewer signals over theory, code, benchmark results, and runtime errors. We illustrate the framework on three benchmark‑grounded studies: transformer hyper‑connection evolution, optimizer discovery on a fixed nanoGPT stack, and a smaller native‑optimizer ablation. Across these settings, the same loop supports explicit metric direction, reproducible persistence, and reviewer‑gated comparison of discoveries under controlled search conditions. The result is a discovery workflow that prioritizes scientific interpretability and correctness while optimizing task metrics under controlled novelty constraints, rather than maximizing candidate throughput alone. Full run artifacts, interactive visualizations, and exported best nodes for the reported studies are available at https://cliffsearch.ai .

PaperID: 4380, https://arxiv.org/pdf/2604.01158.pdf

Authors: Junli Ren, Yinghui Li, Kai Zhang, Penglin Fu, Haoran Jiang, Yixuan Pan, Guangjun Zeng, Tao Huang, Weizhong Guo, Peng Lu, Tianyu Li, Jingbo Wang, Li Chen, Hongyang Li, Ping Luo

Title: SMASH: Mastering Scalable Whole-Body Skills for Humanoid Ping-Pong with Egocentric Vision

Abstract:
Existing humanoid table tennis systems remain limited by their reliance on external sensing and their inability to achieve agile whole‑body coordination for precise task execution. These limitations stem from two core challenges: achieving low‑latency and robust onboard egocentric perception under fast robot motion, and obtaining sufficiently diverse task‑aligned strike motions for learning precise yet natural whole‑body behaviors. In this work, we present \methodname, a modular system for agile humanoid table tennis that unifies scalable whole‑body skill learning with onboard egocentric perception, eliminating the need for external cameras during deployment. Our work advances prior humanoid table‑tennis systems in three key aspects. First, we achieve agile and precise ball interaction with tightly coordinated whole‑body control, rather than relying on decoupled upper‑ and lower‑body behaviors. This enables the system to exhibit diverse strike motions, including explosive whole‑body smashes and low crouching shots. Second, by augmenting and diversifying strike motions with a generative model, our framework benefits from scalable motion priors and produces natural, robust striking behaviors across a wide workspace. Third, to the best of our knowledge, we demonstrate the first humanoid table‑tennis system capable of consecutive strikes using onboard sensing alone, despite the challenges of low‑latency perception, ego‑motion‑induced instability, and limited field of view. Extensive real‑world experiments demonstrate stable and precise ball exchanges under high‑speed conditions, validating scalable, perception‑driven whole‑body skill learning for dynamic humanoid interaction tasks.

PaperID: 4381, https://arxiv.org/pdf/2604.01127.pdf

Authors: Saeid Jamshidi, Negar Shahabi, Foutse Khomh, Carol Fung, Mohammad Hamdaqa

Title: Multi-Agent LLM Governance for Safe Two-Timescale Reinforcement Learning in SDN-IoT Defense

Abstract:
Software‑Defined Networking (SDN) is increasingly adopted to secure Internet‑of‑Things (IoT) networks due to its centralized control and programmable forwarding. However, SDN‑IoT defense is inherently a closed‑loop control problem in which mitigation actions impact controller workload, queue dynamics, rule‑installation delay, and future traffic observations. Aggressive mitigation may destabilize the control plane, degrade Quality of Service (QoS), and amplify systemic risk. Existing learning‑based approaches prioritize detection accuracy while neglecting controller coupling and short‑horizon Reinforcement Learning (RL) optimization without structured, auditable policy evolution. This paper introduces a self‑reflective two‑timescale SDN‑IoT defense solution separating fast mitigation from slow policy governance. At the fast timescale, per‑switch Proximal Policy Optimization (PPO) agents perform controller‑aware mitigation under safety constraints and action masking. At the slow timescale, a multi‑agent Large Language Model (LLM) governance engine generates machine‑parsable updates to the global policy constitution Pi, which encodes admissible actions, safety thresholds, and reward priorities. Updates (Delta Pi) are validated through stress testing and deployed only with non‑regression and safety guarantees, ensuring an auditable evolution without retraining RL agents. Evaluation under heterogeneous IoT traffic and adversarial stress shows improvements of 9.1% Macro‑F1 over PPO and 15.4% over static baselines. Worst‑case degradation drops by 36.8%, controller backlog peaks by 42.7%, and RTT p95 inflation remains below 5.8% under high‑intensity attacks. Policy evolution converges within five cycles, reducing catastrophic overload from 11.6% to 2.3%.

PaperID: 4382, https://arxiv.org/pdf/2604.01056.pdf

Authors: Weihao Sun, Gehui Xu, Alessio Moreschini, Thomas Parisini, Andreas A. Malikopoulos

Title: A Functional Learning Approach for Team-Optimal Traffic Coordination

Abstract:
In this paper, we develop a kernel‑based policy iteration functional learning framework for computing team‑optimal strategies in traffic coordination problems. We consider a multi‑agent discrete‑time linear system with a cost function that combines quadratic regulation terms and nonlinear safety penalties. Building on the Hilbert space formulation of offline receding‑horizon policy iteration, we seek approximate solutions within a reproducing kernel Hilbert space, where the policy improvement step is implemented via a discrete Fréchet derivative. We further study the model‑free receding‑horizon scenario, where the system dynamics are estimated using recursive least squares, followed by updating the policy using rolling online data. The proposed method is tested in signal‑free intersection scenarios via both model‑based and model‑free simulations and validated in SUMO.

PaperID: 4383, https://arxiv.org/pdf/2604.01020.pdf

Authors: Yiru Wang, Xinyue Shen, Yaohui Han, Michael Backes, Pin-Yu Chen, Tsung-Yi Ho

Title: OrgAgent: Organize Your Multi-Agent System like a Company

Abstract:
While large language model‑based multi‑agent systems have shown strong potential for complex reasoning, how to effectively organize multiple agents remains an open question. In this paper, we introduce OrgAgent, a company‑style hierarchical multi‑agent framework that separates collaboration into governance, execution, and compliance layers. OrgAgent decomposes multi‑agent reasoning into three layers: a governance layer for planning and resource allocation, an execution layer for task solving and review, and a compliance layer for final answer control. By evaluating the framework across reasoning tasks, LLMs, execution modes, and execution policies, we find that multi‑agent systems organized in a company‑style hierarchy generally outperform other organizational structures. Besides, hierarchical coordination also reduces token consumption relative to flat collaboration in most settings. For example, for GPT‑OSS‑120B, the hierarchical setting improves performance over flat multi‑agent system by 102.73% while reducing token usage by 74.52% on SQuAD 2.0. Further analysis shows that hierarchy helps most when tasks benefit from stable skill assignment, controlled information flow, and layered verification. Overall, our findings highlight organizational structure as an important factor in multi‑agent reasoning, shaping not only effectiveness and cost, but also coordination behavior.

PaperID: 4384, https://arxiv.org/pdf/2604.00992.pdf

Authors: Armel Koulong, Ali Pakniyat

Title: Tube-Based Safety for Anticipative Tracking in Multi-Agent Systems

Abstract:
A tube‑based safety framework is presented for robust anticipative tracking in nonlinear Brunovsky multi‑agent systems subject to bounded disturbances. The architecture establishes robust safety certificates for a feedforward‑augmented ancillary control policy. By rendering the state‑deviation dynamics independent of the agents' internal nonlinearities, the formulation strictly circumvents the restrictive Lipschitz‑bound feasibility conditions otherwise required for robust stabilization. Consequently, this structure admits an explicit, closed‑form robust positively invariant (RPI) tube radius that systematically attenuates the exponential control barrier function (eCBF) tightening margins, thereby mitigating constraint conservatism while preserving formal forward invariance. Within the distributed model predictive control (MPC) layer, mapping the local tube radii through the communication graph yields a closed‑form global formation error bound formulated via the minimum singular value of the augmented Laplacian. Robust inter‑agent safety is enforced with minimal communication overhead, requiring only a single scalar broadcast per neighbor at initialization. Numerical simulations confirm the framework's efficacy in safely navigating heterogeneous formations through cluttered environments.

PaperID: 4385, https://arxiv.org/pdf/2604.00979.pdf

Authors: Xiangqi Wang, Yue Huang, Haomin Zhuang, Kehan Guo, Xiangliang Zhang

Title: Dual Optimal: Make Your LLM Peer-like with Dignity

Abstract:
Current aligned language models exhibit a dual failure mode we term the Evasive Servant: they sycophantically validate flawed user beliefs while deflecting responsibility with boilerplate disclaimers. We propose the Dignified Peer framework, which counters servility with anti‑sycophancy and trustworthiness, and mitigates evasiveness through empathy and creativity. Realizing this agent requires overcoming significant challenges in data supervision, objective collapse, and evaluation bias. We address these issues by introducing the PersonaKnob dataset which features a compositional partial order structure of multiple persona preference. This data is utilized alongside a tolerant constrained Lagrangian DPO algorithm that dynamically balances all persona dimensions to prevent behavioral collapse. Additionally, we employ a psychometrically calibrated Item Response Theory evaluation protocol to disentangle latent model persona capability from confounders like judge biases. Extensive empirical studies demonstrate that our approach successfully build a LLM agent with both dignity and peer.

PaperID: 4386, https://arxiv.org/pdf/2604.00967.pdf

Authors: Kees van Berkel, Tim S. Lyon

Title: The Varieties of Ought-Implies-Can and Deontic STIT Logic

Abstract:
STIT logic is a prominent framework for the analysis of multi‑agent choice‑making. In the available deontic extensions of STIT, the principle of Ought‑implies‑Can (OiC) fulfills a central role. However, in the philosophical literature a variety of alternative OiC interpretations have been proposed and discussed. This paper provides a modular framework for deontic STIT that accounts for a multitude of OiC readings. In particular, we discuss, compare, and formalize ten such readings. We provide sound and complete sequent‑style calculi for all of the various STIT logics accommodating these OiC principles. We formally analyze the resulting logics and discuss how the different OiC principles are logically related. In particular, we propose an endorsement principle describing which OiC readings logically commit one to other OiC readings.

PaperID: 4387, https://arxiv.org/pdf/2604.00945.pdf

Authors: Yebo Feng, Yang Liu

Title: A Visionary Look at Vibe Researching

Abstract:
Vibe researching is an emerging paradigm in which human researchers provide high‑level direction and critical judgment while LLM‑based agents handle the labor‑intensive execution of literature review, experimentation, data analysis, and manuscript drafting. Inspired by the "vibe coding" movement in software engineering, it occupies a middle ground between traditional manual research and fully autonomous AI research systems. This paper defines the concept, describes its methodology (multi‑agent architectures, memory, tool use, retrieval‑augmented generation, and the human's role as orchestrator), identifies seven technical limitations, weighs its positive and negative societal impacts, and maps each problem to a concrete future direction. Our goal is to provide the research community with a clear and honest map of the territory so that the conversation about responsible adoption can start from shared ground.

PaperID: 4388, https://arxiv.org/pdf/2604.00917.pdf

Authors: Razvan Mihai Popescu, David Gros, Andrei Botocan, Rahul Pandita, Prem Devanbu, Maliheh Izadi

Title: Investigating Autonomous Agent Contributions in the Wild: Activity Patterns and Code Change over Time

Abstract:
The rise of large language models for code has reshaped software development. Autonomous coding agents, able to create branches, open pull requests, and perform code reviews, now actively contribute to real‑world projects. Their growing role offers a unique and timely opportunity to investigate AI‑driven contributions and their effects on code quality, team dynamics, and software maintainability. In this work, we construct a novel dataset of approximately 110,000 open‑source pull requests, including associated commits, comments, reviews, issues, and file changes, collectively representing millions of lines of source code. We compare five popular coding agents, including OpenAI Codex, Claude Code, GitHub Copilot, Google Jules, and Devin, examining how their usage differs in various development aspects such as merge frequency, edited file types, and developer interaction signals, including comments and reviews. Furthermore, we emphasize that code authoring and review are only a small part of the larger software engineering process, as the resulting code must also be maintained and updated over time. Hence, we offer several longitudinal estimates of survival and churn rates for agent‑generated versus human‑authored code. Ultimately, our findings indicate an increasing agent activity in open‑source projects, although their contributions are associated with more churn over time compared to human‑authored code.

PaperID: 4389, https://arxiv.org/pdf/2604.00901.pdf

Authors: Sha Li, Naren Ramakrishnan

Title: Experience as a Compass: Multi-agent RAG with Evolving Orchestration and Agent Prompts

Abstract:
Multi‑agent Retrieval‑Augmented Generation (RAG), wherein each agent takes on a specific role, supports hard queries that require multiple steps and sources, or complex reasoning. Existing approaches, however, rely on static agent behaviors and fixed orchestration strategies, leading to brittle performance on diverse, multi‑hop tasks. We identify two key limitations: the lack of continuously adaptive orchestration mechanisms and the absence of behavior‑level learning for individual agents. To this end, we propose HERA, a hierarchical framework that jointly evolves multi‑agent orchestration and role‑specific agent prompts. At the global level, HERA optimizes query‑specific agent topologies through reward‑guided sampling and experience accumulation. At the local level, Role‑Aware Prompt Evolution refines agent behaviors via credit assignment and dual‑axes adaptation along operational and behavioral principles, enabling targeted, role‑conditioned improvements. On six knowledge‑intensive benchmarks, HERA achieves an average improvement of 38.69% over recent baselines while maintaining robust generalization and token efficiency. Topological analyses reveal emergent self‑organization, where sparse exploration yields compact, high‑utility multi‑agent networks, demonstrating both efficient coordination and robust reasoning.

PaperID: 4390, https://arxiv.org/pdf/2604.00722.pdf

Authors: Huaiyuan Yao, Longchao Da, Xiaoou Liu, Charles Fleming, Tianlong Chen, Hua Wei

Title: LangMARL: Natural Language Multi-Agent Reinforcement Learning

Abstract:
Large language model (LLM) agents struggle to autonomously evolve coordination strategies in dynamic environments, largely because coarse global outcomes obscure the causal signals needed for local policy refinement. We identify this bottleneck as a multi‑agent credit assignment problem, which has long been studied in classical multi‑agent reinforcement learning (MARL) but remains underaddressed in LLM‑based systems. Building on this observation, we propose LangMARL, a framework that brings credit assignment and policy gradient evolution from cooperative MARL into the language space. LangMARL introduces agent‑level language credit assignment, pioneers gradient evolution in language space for policy improvement, and summarizes task‑relevant causal relations from replayed trajectories to provide dense feedback and improve convergence under sparse rewards. Extensive experiments across diverse cooperative multi‑agent tasks demonstrate improved sample efficiency, interpretability, and strong generalization.

PaperID: 4391, https://arxiv.org/pdf/2604.00717.pdf

Authors: Sihan Zhou, Tiantian He, Yifan Lu, Yaqing Hou, Yew-Soon Ong

Title: GRASP: Gradient Realignment via Active Shared Perception for Multi-Agent Collaborative Optimization

Abstract:
Non‑stationarity arises from concurrent policy updates and leads to persistent environmental fluctuations. Existing approaches like Centralized Training with Decentralized Execution (CTDE) and sequential update schemes mitigate this issue. However, since the perception of the policies of other agents remains dependent on sampling environmental interaction data, the agent essentially operates in a passive perception state. This inevitably triggers equilibrium oscillations and significantly slows the convergence speed of the system. To address this issue, we propose Gradient Realignment via Active Shared Perception (GRASP), a novel framework that defines generalized Bellman equilibrium as a stable objective for policy evolution. The core mechanism of GRASP involves utilizing the independent gradients of agents to derive a defined consensus gradient, enabling agents to actively perceive policy updates and optimize team collaboration. Theoretically, we leverage the Kakutani Fixed‑Point Theorem to prove that the consensus direction u^ guarantees the existence and attainability of this equilibrium. Extensive experiments on StarCraft II Multi‑Agent Challenge (SMAC) and Google Research Football (GRF) demonstrate the scalability and promising performance of the framework.

PaperID: 4392, https://arxiv.org/pdf/2604.00704.pdf

Authors: Ruozhao Yang, Mingfei Cheng, Gelei Deng, Junjie Wang, Tianwei Zhang, Xiaofei Xie

Title: AutoEG: Exploiting Known Third-Party Vulnerabilities in Black-Box Web Applications

Abstract:
Large‑scale web applications are widely deployed with complex third‑party components, inheriting security risks arising from component vulnerabilities. Security assessment is therefore required to determine whether such known vulnerabilities remain practically exploitable in real applications. Penetration testing is a widely adopted approach that validates exploitability by launching concrete attacks against known vulnerabilities in real‑world black‑box systems. However, existing approaches often fail to automatically generate reliable exploits, limiting their effectiveness in practical security assessment. This limitation mainly stems from two issues: (1) precisely triggering vulnerabilities with correct technical details, and (2) adapting exploits to diverse real‑world deployment settings. In this paper, we propose AutoEG, a fully automated multi‑agent framework for exploit generation targeting black‑box web applications. AutoEG has two phases: First, AutoEG extracts precise vulnerability trigger logic from unstructured vulnerability information and encapsulates it into reusable trigger functions. Second, AutoEG uses trigger functions for concrete attack objectives and iteratively refines exploits through feedback‑driven interaction with the target application. We evaluate AutoEG on 104 real‑world vulnerabilities with 29 attack objectives, resulting in 660 exploitation tasks and 55,440 exploit attempts. AutoEG achieves an average success rate of 82.41%, substantially outperforming state‑of‑the‑art baselines, whose best performance reaches only 32.88%.

PaperID: 4393, https://arxiv.org/pdf/2604.00556.pdf

Authors: Hongyang Yang, Yanxin Zhang, Yang She, Yue Xiao, Hao Wu, Yiyang Zhang, Jiapeng Hou, Rongshan Zhang

Title: HabitatAgent: An End-to-End Multi-Agent System for Housing Consultation

Abstract:
Housing selection is a high‑stakes and largely irreversible decision problem. We study housing consultation as a decision‑support interface for housing selection. Existing housing platforms and many LLM‑based assistants often reduce this process to ranking or recommendation, resulting in opaque reasoning, brittle multi‑constraint handling, and limited guarantees on factuality. We present HabitatAgent, the first LLM‑powered multi‑agent architecture for end‑to‑end housing consultation. HabitatAgent comprises four specialized agent roles: Memory, Retrieval, Generation, and Validation. The Memory Agent maintains multi‑layer user memory through internal stages for constraint extraction, memory fusion, and verification‑gated updates; the Retrieval Agent performs hybrid vector‑‑graph retrieval (GraphRAG); the Generation Agent produces evidence‑referenced recommendations and explanations; and the Validation Agent applies multi‑tier verification and targeted remediation. Together, these agents provide an auditable and reliable workflow for end‑to‑end housing consultation. We evaluate HabitatAgent on 100 real user consultation scenarios (300 multi‑turn question‑‑answer pairs) under an end‑to‑end correctness protocol. A strong single‑stage baseline (Dense+Rerank) achieves 75% accuracy, while HabitatAgent reaches 95%.

PaperID: 4394, https://arxiv.org/pdf/2604.00553.pdf

Authors: Simone Garatti, Lucrezia Manieri, Alessandro Falsone, Algo Carè, Marco C. Campi, Maria Prandini

Title: Scenario theory for multi-criteria data-driven decision making

Abstract:
The scenario approach provides a powerful data‑driven framework for designing solutions under uncertainty with rigorous probabilistic robustness guarantees. Existing theory, however, primarily addresses assessing robustness with respect to a single appropriateness criterion for the solution based on a dataset, whereas many practical applications ‑ including multi‑agent decision problems ‑ require the simultaneous consideration of multiple criteria and the assessment of their robustness based on multiple datasets, one per criterion. This paper develops a general scenario theory for multi‑criteria data‑driven decision making. A central innovation lies in the collective treatment of the risks associated with violations of individual criteria, which yields substantially more accurate robustness certificates than those derived from a naive application of standard results. In turn, this approach enables a sharper quantification of the robustness level with which all criteria are simultaneously satisfied. The proposed framework applies broadly to multi‑criteria data‑driven decision problems, providing a principled, scalable, and theoretically grounded methodology for design under uncertainty.

PaperID: 4395, https://arxiv.org/pdf/2604.00518.pdf

Authors: Luyang Zhang, Yi-Yun Chu, Jialu Wang, Beibei Li, Ramayya Krishnan

Title: Do Agents Repair When Challenged -- or Just Reply? Challenge, Repair, and Public Correction in a Deployed Agent Forum

Abstract:
As large language model (LLM) agents are deployed in public interactive settings, a key question is whether their communities can sustain challenge, repair, and public correction, or merely produce norm‑like language. We compare Moltbook, a live deployed agent forum, with five matched Reddit communities by tracing a three‑step mechanism: whether discussions create threaded exchange, whether challenges elicit a response, and whether correction becomes visible to the wider thread. Relative to Reddit, Moltbook discussions are roughly ten times less threaded, leaving far fewer chances for challenge and response. When challenges do occur, the original author almost never returns (1.2% vs. 40.9% on Reddit), multi‑turn continuation is nearly absent (0.1% vs. 38.5%), and we detect no repairs under a shared conservative protocol. A non‑challenge baseline within Reddit suggests this gap is linked to challenge, not simply deeper threading. These results indicate that social alignment depends not only on producing norm‑aware language, but on sustaining the interactional processes through which communities teach, enforce, and revise norms. This matters for safety, because correction is increasingly decentralized, and for fairness, because communities differ in how they expect participants to engage with challenge.

PaperID: 4396, https://arxiv.org/pdf/2604.00487.pdf

Authors: Jiayi Yao, Cong Chen, Baosen Zhang

Title: Competition and Cooperation of LLM Agents in Games

Abstract:
Large language model (LLM) agents are increasingly deployed in competitive multi‑agent settings, raising fundamental questions about whether they converge to equilibria and how their strategic behavior can be characterized. In this paper, we study LLM agent interactions in two standard games: a network resource allocation game and a Cournot competition game. Rather than converging to Nash equilibria, we find that LLM agents tend to cooperate when given multi‑round prompts and non‑zero‑sum context. Chain‑of‑thought analysis reveals that fairness reasoning is central to this behavior. We propose an analytical framework that captures the dynamics of LLM agent reasoning across rounds and explains these experimental findings.

PaperID: 4397, https://arxiv.org/pdf/2604.00451.pdf

Authors: Mingjie Bi

Title: CASCADE: Cascaded Scoped Communication for Multi-Agent Re-planning in Disrupted Industrial Environments

Abstract:
Industrial disruption replanning demands multi‑agent coordination under strict latency and communication budgets, where disruptions propagate through tightly coupled physical dependencies and rapidly invalidate baseline schedules and commitments. Existing coordination schemes often treat communication as either effectively free (broadcast‑style escalation) or fixed in advance (hand‑tuned neighborhoods), both of which are brittle once the disruption footprint extends beyond a local region. We present \CASCADE, a budgeted replanning mechanism that makes communication scope explicit and auditable rather than fixed or implicit. Each agent maintains an explicit knowledge base, solves role‑conditioned local decision problems to revise commitments, and coordinates through lightweight contract primitives whose footprint expands only when local validation indicates that the current scope is insufficient. This design separates a unified agent substrate (Knowledge Base / Decision Manager / Communication Manager) from a scoped interaction layer that controls who is contacted, how far coordination propagates, and when escalation is triggered under explicit budgets. We evaluate \CASCADE on disrupted manufacturing and supply‑chain settings using unified diagnostics intended to test a mechanism‑design claim ‑‑ whether explicit scope control yields useful quality‑latency‑communication trade‑offs and improved robustness under uncertainty ‑‑ rather than to provide a complete algorithmic ranking.

PaperID: 4398, https://arxiv.org/pdf/2604.00433.pdf

Authors: Wonseok Yang, Thinh T. Doan

Title: Internal State-Based Policy Gradient Methods for Partially Observable Markov Potential Games

Abstract:
This letter studies multi‑agent reinforcement learning in partially observable Markov potential games. Solving this problem is challenging due to partial observability, decentralized information, and the curse of dimensionality. First, to address the first two challenges, we leverage the common information framework, which allows agents to act based on both shared and local information. Second, to ensure tractability, we study an internal state that compresses accumulated information, preventing it from growing unboundedly over time. We then implement an internal state‑based natural policy gradient method to find Nash equilibria of the Markov potential game. Our main contribution is to establish a non‑asymptotic convergence bound for this method. Our theoretical bound decomposes into two interpretable components: a statistical error term that also arises in standard Markov potential games, and an approximation error capturing the use of finite‑state controllers. Finally, simulations across multiple partially observable environments demonstrate that the proposed method using finite‑state controllers achieves consistent improvements in performance compared to the setting where only the current observation is used.

PaperID: 4399, https://arxiv.org/pdf/2604.00429.pdf

Authors: Shiyu Cheng, Luyao Niu, Bhaskar Ramasubramanian, Andrew Clark, Radha Poovendran

Title: Distributed Safety-Critical Control of Multi-Agent Systems with Time-Varying Communication Topologies

Abstract:
Coordinating multiple autonomous agents to reach a target region while avoiding collisions and maintaining communication connectivity is a core problem in multi‑agent systems. In practice, agents have a limited communication range. Thus, network links can appear and disappear as agents move, making the topology state‑dependent and time‑varying. Existing distributed solutions to multi‑agent reach‑avoid problems typically assume a fixed communication topology, and thus are not applicable when encountering discontinuities raised by time‑varying topologies. This paper presents a distributed optimization‑based control framework that addresses these challenges through two complementary mechanisms. First, we introduce a truncation function that converts the time‑varying communication graph into a smoothly state‑dependent one, ensuring that constraints remain continuous as communication links are created or removed. Second, we employ auxiliary mismatch variables with two‑time‑scale dynamics to decouple globally coupled state‑dependent constraints, yielding a singular perturbation system that each agent can solve using only local information and neighbor communication. Through singular perturbation analysis, we prove that the distributed controller guarantees collision avoidance, connectivity preservation, and convergence to the target region. We validate the proposed framework through numerical simulations involving multi‑agent navigation with obstacles and time‑varying communication topologies.

PaperID: 4400, https://arxiv.org/pdf/2604.00428.pdf

Authors: Jiarui Li, Runyu Zhang, Gioele Zardini

Title: Certificate-Driven Closed-Loop Multi-Agent Path Finding with Inheritable Factorization

Abstract:
Multi‑agent coordination in automated warehouses and logistics is commonly modeled as the Multi‑Agent Path Finding (MAPF) problem. Closed‑loop MAPF algorithms improve scalability by planning only the next movement and replanning online, but this finite‑horizon viewpoint can be shortsighted and makes it difficult to preserve global guarantees and exploit compositional structure. This issue is especially visible in Anytime Closed‑Loop Conflict‑Based Search (ACCBS), which applies Conflict‑Based Search (CBS) over dynamically extended finite horizons but, under finite computational budgets, may terminate with short active prefixes in dense instances. We introduce certificate trajectories and their associated fleet budget as a general mechanism for filtering closed‑loop updates. A certificate provides a conflict‑free fallback plan and a monotone upper bound on the remaining cost; accepting only certificate‑improving updates yields completeness. The same budget information induces a budget‑limited factorization that enables global, inheritable decomposition across timesteps. Instantiating the framework on ACCBS yields Certificate‑Driven Conflict‑Based Search (CDCBS). Experiments on benchmark maps show that CDCBS achieves more consistent solution quality than ACCBS, particularly in dense settings, while the proposed factorization reduces effective group size.

PaperID: 4401, https://arxiv.org/pdf/2604.00392.pdf

Authors: Alibek T. Kaliyev, Artem Maryanskyy

Title: EvolveTool-Bench: Evaluating the Quality of LLM-Generated Tool Libraries as Software Artifacts

Abstract:
Modern LLM agents increasingly create their own tools at runtime ‑‑ from Python functions to API clients ‑‑ yet existing benchmarks evaluate them almost exclusively by downstream task completion. This is analogous to judging a software engineer only by whether their code runs, ignoring redundancy, regression, and safety. We introduce EvolveTool‑Bench, a diagnostic benchmark for LLM‑generated tool libraries in software engineering workflows. Across three domains requiring actual tool execution (proprietary data formats, API orchestration, and numerical computation), we define library‑level software quality metrics ‑‑ reuse, redundancy, composition success, regression stability, and safety ‑‑ alongside a per‑tool Tool Quality Score measuring correctness, robustness, generality, and code quality. In the first head‑to‑head comparison of code‑level and strategy‑level tool evolution (ARISE vs. EvoSkill vs. one‑shot baselines, 99 tasks, two models), we show that systems with similar task completion (63‑68%) differ by up to 18% in library health, revealing software quality risks invisible to task‑only evaluation. Our results highlight that evaluation and governance of LLM‑generated tools require treating the evolving tool library as a first‑class software artifact, not a black box.

PaperID: 4402, https://arxiv.org/pdf/2604.00356.pdf

Authors: Shuguang Chen, Adil Hafeez, Salman Paracha

Title: Signals: Trajectory Sampling and Triage for Agentic Interactions

Abstract:
Agentic applications based on large language models increasingly rely on multi‑step interaction loops involving planning, action execution, and environment feedback. While such systems are now deployed at scale, improving them post‑deployment remains challenging. Agent trajectories are voluminous and non‑deterministic, and reviewing each one, whether through human review or auxiliary LLMs, is slow and cost‑prohibitive. We propose a lightweight, signal‑based framework for triaging agentic interaction trajectories. Our approach computes cheap, broadly applicable signals from live interactions and attaches them as structured attributes for trajectory triage, identifying interactions likely to be informative without affecting online agent behavior. We organize signals into a coarse‑grained taxonomy spanning interaction (misalignment, stagnation, disengagement, satisfaction), execution (failure, loop), and environment (exhaustion), designed for computation without model calls. In a controlled annotation study on τ‑bench, a widely used benchmark for tool‑augmented agent evaluation, we show that signal‑based sampling achieves an 82% informativeness rate compared to 74% for heuristic filtering and 54% for random sampling, with a 1.52x efficiency gain per informative trajectory. The advantage is robust across reward strata and task domains, confirming that signals provide genuine per‑trajectory informativeness gains rather than merely oversampling obvious failures. These results show that lightweight signals can serve as practical sampling infrastructure for agentic systems, and suggest a path toward preference data construction and post‑deployment optimization.

PaperID: 4403, https://arxiv.org/pdf/2604.00344.pdf

Authors: Eric Hanchen Jiang, Levina Li, Rui Sun, Xiao Liang, Yubei Li, Yuchen Wu, Haozheng Luo, Hengli Li, Zhi Zhang, Zhaolu Kang, Kai-Wei Chang, Ying Nian Wu

Title: Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning

Abstract:
Large Language Models (LLMs) have shown remarkable performance in completing various tasks. However, solving complex problems often requires the coordination of multiple agents, raising a fundamental question: how to effectively select and interconnect these agents. In this paper, we propose Agent Q‑Mix, a reinforcement learning framework that reformulates topology selection as a cooperative Multi‑Agent Reinforcement Learning (MARL) problem. Our method learns decentralized communication decisions using QMIX value factorization, where each agent selects from a set of communication actions that jointly induce a round‑wise communication graph. At its core, Agent Q‑Mix combines a topology‑aware GNN encoder, GRU memory, and per‑agent Q‑heads under a Centralized Training with Decentralized Execution (CTDE) paradigm. The framework optimizes a reward function that balances task accuracy with token cost. Across seven core benchmarks in coding, reasoning, and mathematics, Agent Q‑Mix achieves the highest average accuracy compared to existing methods while demonstrating superior token efficiency and robustness against agent failure. Notably, on the challenging Humanity's Last Exam (HLE) using Gemini‑3.1‑Flash‑Lite as a backbone, Agent Q‑Mix achieves 20.8% accuracy, outperforming Microsoft Agent Framework (19.2%) and LangGraph (19.2%), followed by AutoGen and Lobster by OpenClaw. These results underscore the effectiveness of learned, decentralized topology optimization in pushing the boundaries of multi‑agent reasoning.

PaperID: 4404, https://arxiv.org/pdf/2604.00319.pdf

Authors: Syed Eqbal Alam, Zhan Shu

Title: Collaborative AI Agents and Critics for Fault Detection and Cause Analysis in Network Telemetry

Abstract:
We develop algorithms for collaborative control of AI agents and critics in a multi‑actor, multi‑critic federated multi‑agent system. Each AI agent and critic has access to classical machine learning or generative AI foundation models. The AI agents and critics collaborate with a central server to complete multimodal tasks such as fault detection, severity, and cause analysis in a network telemetry system, text‑to‑image generation, video generation, healthcare diagnostics from medical images and patient records, etcetera. The AI agents complete their tasks and send them to AI critics for evaluation. The critics then send feedback to agents to improve their responses. Collaboratively, they minimize the overall cost to the system with no inter‑agent or inter‑critic communication. AI agents and critics keep their cost functions or derivatives of cost functions private. Using multi‑time scale stochastic approximation techniques, we provide convergence guarantees on the time‑average active states of AI agents and critics. The communication overhead is a little on the system, of the order of \mathcalO(m), for m modalities and is independent of the number of AI agents and critics. Finally, we present an example of fault detection, severity, and cause analysis in network telemetry and thorough evaluation to check the algorithm's efficacy.

PaperID: 4405, https://arxiv.org/pdf/2604.00304.pdf

Authors: Shuli Jiang, Zhaoyang Zhang, Yi Zhang, Shuo Yang, Wei Xia, Stefano Soatto

Title: Asymmetric Actor-Critic for Multi-turn LLM Agents

Abstract:
Large language models (LLMs) exhibit strong reasoning and conversational abilities, but ensuring reliable behavior in multi‑turn interactions remains challenging. In many real‑world applications, agents must succeed in one‑shot settings where retries are impossible. Existing approaches either rely on reflection or post‑hoc evaluation, which require additional attempts, or assume fully trainable models that cannot leverage proprietary LLMs. We propose an asymmetric actor‑critic framework for reliable conversational agents. A powerful proprietary LLM acts as the actor, while a smaller open‑source critic provides runtime supervision, monitoring the actor's actions and intervening within the same interaction trajectory. Unlike training‑based actor‑critic methods, our framework supervises a fixed actor operating in open‑ended conversational environments. The design leverages a generation‑verification asymmetry: while high‑quality generation requires large models, effective oversight can often be achieved by smaller ones. We further introduce a data generation pipeline that produces supervision signals for critic fine‑tuning without modifying the actor. Experiments on τ‑bench and UserBench show that our approach significantly improves reliability and task success over strong single‑agent baselines. Moreover, lightweight open‑source critics rival or surpass larger proprietary models in the critic role, and critic fine‑tuning yields additional gains over several state‑of‑the‑art methods.

PaperID: 4406, https://arxiv.org/pdf/2604.00299.pdf

Authors: Nanthit Temkulkiat, Chaiyong Ragkhitwetsagul, Morakot Choetkiertikul, Ruksit Rojpaisarnkit, Raula Gaikovina Kula

Title: When is Generated Code Difficult to Comprehend? Assessing AI Agent Python Code Proficiency in the Wild

Abstract:
The rapid adoption of AI coding agents is fundamentally shifting software developers' roles from code authors to code reviewers. While developers spend a significant portion of their time reading and comprehending code, the linguistic proficiency and complexity of the Python code generated by these agents remain largely unexplored. This study investigates the code proficiency of AI agents to determine the skill level required for developers to maintain their code. Leveraging the AIDev dataset, we mined 591 pull requests containing 5,027 Python files generated by three distinct AI agents and employed pycefr, a static analysis tool that maps Python constructs to six proficiency levels, ranging from A1 (Basic) to C2 (Mastery), to analyze the code. Our results reveal that: AI agents predominantly generate Basic‑level code, with over 90% of constructs falling into the A1 and A2 categories, and less than 1% classified as Mastery (C2); AI agents' and humans' pull requests share a broadly similar proficiency profile; High‑proficiency code by AI agents are from feature addition and bug fixing tasks. These findings suggest that while AI‑generated code is generally accessible to developers with basic Python skills, specific tasks may require advanced proficiency to review and maintain complex, agent‑generated constructs.

PaperID: 4407, https://arxiv.org/pdf/2604.00265.pdf

Authors: Edoardo Zorzi, Francesco Taioli, Yiming Wang, Marco Cristani, Alessandro Farinelli, Alberto Castellini, Loris Bazzani

Title: Benchmarking Interaction, Beyond Policy: a Reproducible Benchmark for Collaborative Instance Object Navigation

Abstract:
We propose Question‑Asking Navigation (QAsk‑Nav), the first reproducible benchmark for Collaborative Instance Object Navigation (CoIN) that enables an explicit, separate assessment of embodied navigation and collaborative question asking. CoIN tasks an embodied agent with reaching a target specified in free‑form natural language under partial observability, using only egocentric visual observations and interactive natural‑language dialogue with a human, where the dialogue can help to resolve ambiguity among visually similar object instances. Existing CoIN benchmarks are primarily focused on navigation success and offer no support for consistent evaluation of collaborative interaction. To address this limitation, QAsk‑Nav provides (i) a lightweight question‑asking protocol scored independently of navigation, (ii) an enhanced navigation protocol with realistic, diverse, high‑quality target descriptions, and (iii) an open‑source dataset, that includes 28,000 quality‑checked reasoning and question‑asking traces for training and analysis of interactive capabilities of CoIN models. Using the proposed QAsk‑Nav benchmark, we develop Light‑CoNav, a lightweight unified model for collaborative navigation that is 3x smaller and 70x faster than existing modular methods, while outperforming state‑of‑the‑art CoIN approaches in generalization to unseen objects and environments. Project page at https://benchmarking‑interaction.github.io/

PaperID: 4408, https://arxiv.org/pdf/2604.00249.pdf

Authors: Ha Na Cho

Title: A Safety-Aware Role-Orchestrated Multi-Agent LLM Framework for Behavioral Health Communication Simulation

Abstract:
Single‑agent large language model (LLM) systems struggle to simultaneously support diverse conversational functions and maintain safety in behavioral health communication. We propose a safety‑aware, role‑orchestrated multi‑agent LLM framework designed to simulate supportive behavioral health dialogue through coordinated, role‑differentiated agents. Conversational responsibilities are decomposed across specialized agents, including empathy‑focused, action‑oriented, and supervisory roles, while a prompt‑based controller dynamically activates relevant agents and enforces continuous safety auditing. Using semi‑structured interview transcripts from the DAIC‑WOZ corpus, we evaluate the framework with scalable proxy metrics capturing structural quality, functional diversity, and computational characteristics. Results illustrate clear role differentiation, coherent inter‑agent coordination, and predictable trade‑offs between modular orchestration, safety oversight, and response latency when compared to a single‑agent baseline. This work emphasizes system design, interpretability, and safety, positioning the framework as a simulation and analysis tool for behavioral health informatics and decision‑support research rather than a clinical intervention.

PaperID: 4409, https://arxiv.org/pdf/2604.00186.pdf

Authors: Ravish Gupta, Saket Kumar

Title: Agentic AI and Occupational Displacement: A Multi-Regional Task Exposure Analysis of Emerging Labor Market Disruption

Abstract:
This paper extends the Acemoglu‑Restrepo task exposure framework to address the labor market effects of agentic artificial intelligence systems: autonomous AI agents capable of completing entire occupational workflows rather than discrete tasks. Unlike prior automation technologies that substitute for individual subtasks, agentic AI systems execute end‑to‑end workflows involving multi‑step reasoning, tool invocation, and autonomous decision‑making, substantially expanding occupational displacement risk beyond what existing task‑level analyses capture. We introduce the Agentic Task Exposure (ATE) score, a composite measure computed algorithmically from ONET task data using calibrated adoption parameters‑‑not a regression estimate‑‑incorporating AI capability scores, workflow coverage factors, and logistic adoption velocity. Applying the ATE framework across five major US technology regions (Seattle‑Tacoma, San Francisco Bay Area, Austin, New York, and Boston) over a 2025‑2030 horizon, we find that 93.2% of the 236 analyzed occupations across six information‑intensive SOC groups (financial, legal, healthcare, healthcare support, sales, and administrative/clerical) cross the moderate‑risk threshold (ATE >= 0.35) in Tier 1 regions by 2030, with credit analysts, judges, and sustainability specialists reaching ATE scores of 0.43‑0.47. We simultaneously identify seventeen emerging occupational categories benefiting from reinstatement effects, concentrated in human‑AI collaboration, AI governance, and domain‑specific AI operations roles. Our findings carry implications for workforce transition policy, regional economic planning, and the temporal dynamics of labor market adjustment

PaperID: 4410, https://arxiv.org/pdf/2604.00149.pdf

Authors: Ken Deng, Xiangfei Wang, Guijing Duan, Chen Mo, Junkun Huang, Runqing Zhang, Ling Qian, Zhiguo Huang, Jize Han, Di Luo

Title: Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations

Abstract:
While large language models (LLMs) promise to revolutionize automated scientific discovery, their application in rigorous real‑world physical research is stalled by two critical barriers: a lack of realistic evaluation benchmarks and systemic LLM hallucinations. Here, we address both problems. We introduce QMP‑Bench, a pioneering end‑to‑end research‑level benchmark in quantum many‑body simulation consisting of 100 tasks extracted from 21 high‑impact prestigious journals, presenting a challenge even for current frontier LLMs. To establish a paradigm for reliable and transparent AI physicists, we present PhysVEC, a multi‑agent framework that enforces self‑verifiable and error correction in AI research. PhysVEC seamlessly integrates programming and scientific verifiers to guarantee coding correctness and principle‑based physical validity, yielding interpretable evidence and error correction at each step. PhysVEC significantly outperforms existing LLM baselines on various scenarios in QMP‑Bench and presents a favorable inference‑time scaling, successfully transforming unreliable AI generations into accurate physical reproductions, paving a robust and trustworthy path towards future automated scientific discovery.

PaperID: 4411, https://arxiv.org/pdf/2604.00085.pdf

Authors: Yuxing Lu, Yushuhong Lin, Jason Zhang

Title: One Panel Does Not Fit All: Case-Adaptive Multi-Agent Deliberation for Clinical Prediction

Abstract:
Large language models applied to clinical prediction exhibit case‑level heterogeneity: simple cases yield consistent outputs, while complex cases produce divergent predictions under minor prompt changes. Existing single‑agent strategies sample from one role‑conditioned distribution, and multi‑agent frameworks use fixed roles with flat majority voting, discarding the diagnostic signal in disagreement. We propose CAMP (Case‑Adaptive Multi‑agent Panel), where an attending‑physician agent dynamically assembles a specialist panel tailored to each case's diagnostic uncertainty. Each specialist evaluates candidates via three‑valued voting (KEEP/REFUSE/NEUTRAL), enabling principled abstention outside one's expertise. A hybrid router directs each diagnosis through strong consensus, fallback to the attending physician's judgment, or evidence‑based arbitration that weighs argument quality over vote counts. On diagnostic prediction and brief hospital course generation from MIMIC‑IV across four LLM backbones, CAMP consistently outperforms strong baselines while consuming fewer tokens than most competing multi‑agent methods, with voting records and arbitration traces offering transparent decision audits.

PaperID: 4412, https://arxiv.org/pdf/2604.00026.pdf

Authors: Houssam EL Kandoussi

Title: "Who Am I, and Who Else Is Here?" Behavioral Differentiation Without Role Assignment in Multi-Agent LLM Systems

Abstract:
When multiple large language models interact in a shared conversation, do they develop differentiated social roles or converge toward uniform behavior? We present a controlled experimental platform that orchestrates simultaneous multi‑agent discussions among 7 heterogeneous LLMs on a unified inference backend, systematically varying group composition, naming conventions, and prompt structure across 12 experimental series (208 runs, 13,786 coded messages). Each message is independently coded on six behavioral flags by two LLM judges from distinct model families (Gemini 3.1 Pro and Claude Sonnet 4.6), achieving mean Cohen's kappa = 0.78 with conservative intersection‑based adjudication. Human validation on 609 randomly stratified messages confirmed coding reliability (mean kappa = 0.73 vs. Gemini). We find that (1) heterogeneous groups exhibit significantly richer behavioral differentiation than homogeneous groups (cosine similarity 0.56 vs. 0.85; p < 10^‑5, r = 0.70); (2) groups spontaneously exhibit compensatory response patterns when an agent crashes; (3) revealing real model names significantly increases behavioral convergence (cosine 0.56 to 0.77, p = 0.001); and (4) removing all prompt scaffolding converges profiles to homogeneous‑level similarity (p < 0.001). Critically, these behaviors are absent when agents operate in isolation, confirming that behavioral diversity is a structured, reproducible phenomenon driven by the interaction of architectural heterogeneity, group context, and prompt‑level scaffolding.

PaperID: 4413, https://arxiv.org/pdf/2604.00021.pdf

Authors: Hiroki Fukui

Title: How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models

Abstract:
Alignment safety research assumes that ethical instructions improve model behavior, but how language models internally process such instructions remains unknown. We conducted over 600 multi‑agent simulations across four models (Llama 3.3 70B, GPT‑4o mini, Qwen3‑Next‑80B‑A3B, Sonnet 4.5), four ethical instruction formats (none, minimal norm, reasoned norm, virtue framing), and two languages (Japanese, English). Confirmatory analysis fully replicated the Llama Japanese dissociation pattern from a prior study (\mathrmBF_10 > 10 for all three hypotheses), but none of the other three models reproduced this pattern, establishing it as model‑specific. Three new metrics ‑‑ Deliberation Depth (DD), Value Consistency Across Dilemmas (VCAD), and Other‑Recognition Index (ORI) ‑‑ revealed four distinct ethical processing types: Output Filter (GPT; safe outputs, no processing), Defensive Repetition (Llama; high consistency through formulaic repetition), Critical Internalization (Qwen; deep deliberation, incomplete integration), and Principled Consistency (Sonnet; deliberation, consistency, and other‑recognition co‑occurring). The central finding is an interaction between processing capacity and instruction format: in low‑DD models, instruction format has no effect on internal processing; in high‑DD models, reasoned norms and virtue framing produce opposite effects. Lexical compliance with ethical instructions did not correlate with any processing metric at the cell level (r = ‑0.161 to +0.256, all p > .22; N = 24; power limited), suggesting that safety, compliance, and ethical processing are largely dissociable. These processing types show structural correspondence to patterns observed in clinical offender treatment, where formal compliance without internal processing is a recognized risk signal.

PaperID: 4414, https://arxiv.org/pdf/2603.29999.pdf

Authors: Gianluca Aguzzi, Davide Domini, Nicolas Farabegoli, Mirko Viroli

Title: Phyelds: A Pythonic Framework for Aggregate Computing

Abstract:
Aggregate programming is a field‑based coordination paradigm with over a decade of exploration and successful applications across domains including sensor networks, robotics, and IoT, with implementations in various programming languages, such as Protelis, ScaFi (Scala), and FCPP (C++). A recent research direction integrates machine learning with aggregate computing, aiming to support large‑scale distributed learning and provide new abstractions for implementing learning algorithms. However, existing implementations do not target data science practitioners, who predominantly work in Python‑‑the de facto language for data science and machine learning, with a rich and mature ecosystem. Python also offers advantages for other use cases, such as education and robotics (e.g., via ROS). To address this gap, we present Phyelds, a Python library for aggregate programming. Phyelds offers a fully featured yet lightweight implementation of the field calculus model of computation, featuring a Pythonic API and an architecture designed for seamless integration with Python's machine learning ecosystem. We describe the design and implementation of Phyelds and illustrate its versatility across domains, from well‑known aggregate computing patterns to federated learning coordination and integration with a widely used multi‑agent reinforcement learning simulator.

PaperID: 4415, https://arxiv.org/pdf/2603.29950.pdf

Authors: Xiaoshan Huang, Conrad Borchers, Jiayi Zhang, Susanne P. Lajoie

Title: Physiological and Semantic Patterns in Medical Teams Using an Intelligent Tutoring System

Abstract:
Effective collaboration requires teams to manage complex cognitive and emotional states through Socially Shared Regulation of Learning (SSRL). Physiological synchrony (i.e., longitudinal alignment in physiological signals) can indicate these states, but is hard to interpret on its own. We investigate the physiological and conversational dynamics of four medical dyads diagnosing a virtual patient case using an intelligent tutoring system. Semantic shifts in dialogue were correlated with transient physiological synchrony peaks. We also coded utterance segments for SSRL and derived cosine similarity using sentence embeddings. The results showed that activating prior knowledge featured significantly lower semantic similarity than simpler task execution. High physiological synchrony was associated with lower semantic similarity, suggesting that such moments involve exploratory and varied language use. Qualitative analysis triangulated these synchrony peaks as ``pivotal moments'': successful teams synchronized during shared discovery, while unsuccessful teams peaked during shared uncertainty. This research advances human‑centered AI by demonstrating how biological signals can be fused with dialogues to understand critical moments in problem solving.

PaperID: 4416, https://arxiv.org/pdf/2603.29919.pdf

Authors: Yudong Gao, Zongjie Li, Yuanyuanyuan, Zimo Ji, Pingchuan Ma, Shuai Wang

Title: SkillReducer: Optimizing LLM Agent Skills for Token Efficiency

Abstract:
LLM‑based coding agents rely on \emphskills, pre‑packaged instruction sets that extend agent capabilities, yet every token of skill content injected into the context window incurs both monetary cost and attention dilution. To understand the severity of this problem, we conduct a large‑scale empirical study of 55,315 publicly available skills and find systemic inefficiencies: 26.4% lack routing descriptions entirely, over 60% of body content is non‑actionable, and reference files can inject tens of thousands of tokens per invocation. Motivated by these findings, we present \textscSkillReducer, a two‑stage optimization framework. Stage~1 optimizes the routing layer by compressing verbose descriptions and generating missing ones via adversarial delta debugging. Stage~2 restructures skill bodies through taxonomy‑driven classification and progressive disclosure, separating actionable core rules from supplementary content loaded on demand, validated by faithfulness checks and a self‑correcting feedback loop. Evaluated on 600 skills and the SkillsBench benchmark, \textscSkillReducer achieves 48% description compression and 39% body compression while improving functional quality by 2.8%, revealing a \emphless‑is‑more effect where removing non‑essential content reduces distraction in the context window. These benefits transfer across five models from four families with a mean retention of 0.965, and generalize to an independent agent framework.

PaperID: 4417, https://arxiv.org/pdf/2603.29868.pdf

Authors: Oliver Schön, Lars Lindemann

Title: Spatiotemporal Robustness of Temporal Logic Tasks using Multi-Objective Reasoning

Abstract:
The reliability of autonomous systems depends on their robustness, i.e., their ability to meet their objectives under uncertainty. In this paper, we study spatiotemporal robustness of temporal logic specifications evaluated over discrete‑time signals. Existing work has proposed robust semantics that capture not only Boolean satisfiability, but also the geometric distance from unsatisfiability, corresponding to admissible spatial perturbations of a given signal. In contrast, we propose spatiotemporal robustness (STR), which captures admissible spatial and temporal perturbations jointly. This notion is particularly informative for interacting systems, such as multi‑agent robotics, smart cities, and air traffic control. We define STR as a multi‑objective reasoning problem, formalized via a partial order over spatial and temporal perturbations. This perspective has two key advantages: (1) STR can be interpreted as a Pareto‑optimal set that characterizes all admissible spatiotemporal perturbations, and (2) STR can be computed using tools from multi‑objective optimization. To navigate computational challenges, we propose robust semantics for STR that are sound in the sense of suitably under‑approximating STR while being computationally tractable. Finally, we present monitoring algorithms for STR using these robust semantics. To the best of our knowledge, this is the first work to deal with robustness across multiple dimensions via multi‑objective reasoning.

PaperID: 4418, https://arxiv.org/pdf/2603.29846.pdf

Authors: Adar Avsian, Larry Heck

Title: SNEAK: Evaluating Strategic Communication and Information Leakage in Large Language Models

Abstract:
Large language models (LLMs) are increasingly deployed in multi‑agent settings where communication must balance informativeness and secrecy. In such settings, an agent may need to signal information to collaborators while preventing an adversary from inferring sensitive details. However, existing LLM benchmarks primarily evaluate capabilities such as reasoning, factual knowledge, or instruction following, and do not directly measure strategic communication under asymmetric information. We introduce SNEAK (Secret‑aware Natural language Evaluation for Adversarial Knowledge), a benchmark for evaluating selective information sharing in language models. In SNEAK, a model is given a semantic category, a candidate set of words, and a secret word, and must generate a message that indicates knowledge of the secret without revealing it too clearly. We evaluate generated messages using two simulated agents with different information states: an ally, who knows the secret and must identify the intended message, and a chameleon, who does not know the secret and attempts to infer it from the message. This yields two complementary metrics: utility, measuring how well the message communicates to collaborators, and leakage, measuring how much information it reveals to an adversary. Using this framework, we analyze the trade‑off between informativeness and secrecy in modern language models and show that strategic communication under asymmetric information remains a challenging capability for current systems. Notably, human participants outperform all evaluated models by a large margin, achieving up to four times higher scores.

PaperID: 4419, https://arxiv.org/pdf/2603.29798.pdf

Authors: Léopold Maillard, Francis Engelmann, Tom Durand, Boxiao Pan, Yang You, Or Litany, Leonidas Guibas, Maks Ovsjanikov

Title: SceneTeract: Agentic Functional Affordances and VLM Grounding in 3D Scenes

Abstract:
Embodied AI depends on interactive 3D environments that support meaningful activities for diverse users, yet assessing their functional affordances remains a core challenge. We introduce SceneTeract, a framework that verifies 3D scene functionality under agent‑specific constraints. Our core contribution is a grounded verification engine that couples high‑level semantic reasoning with low‑level geometric checks. SceneTeract decomposes complex activities into sequences of atomic actions and validates each step against accessibility requirements (e.g., reachability, clearance, and navigability) conditioned on an embodied agent profile, using explicit physical and geometric simulations. We deploy SceneTeract to perform an in‑depth evaluation of (i) synthetic indoor environments, uncovering frequent functional failures that prevent basic interactions, and (ii) the ability of frontier Vision‑Language Models (VLMs) to reason about and predict functional affordances, revealing systematic mismatches between semantic confidence and physical feasibility even for the strongest current models. Finally, we leverage SceneTeract as a reward engine for VLM post‑training, enabling scalable distillation of geometric constraints into reasoning models. We release the SceneTeract verification suite and data to bridge perception and physical reality in embodied 3D scene understanding.

PaperID: 4420, https://arxiv.org/pdf/2603.29755.pdf

Authors: Chathurangi Shyalika, Utkarshani Jaimini, Cory Henson, Amit Sheth

Title: CausalPulse: An Industrial-Grade Neurosymbolic Multi-Agent Copilot for Causal Diagnostics in Smart Manufacturing

Abstract:
Modern manufacturing environments demand real‑time, trustworthy, and interpretable root‑cause insights to sustain productivity and quality. Traditional analytics pipelines often treat anomaly detection, causal inference, and root‑cause analysis as isolated stages, limiting scalability and explainability. In this work, we present CausalPulse, an industry‑grade multi‑agent copilot that automates causal diagnostics in smart manufacturing. It unifies anomaly detection, causal discovery, and reasoning through a neurosymbolic architecture built on standardized agentic protocols. CausalPulse is being deployed in a Robert Bosch manufacturing plant, integrating seamlessly with existing monitoring workflows and supporting real‑time operation at production scale. Evaluations on both public (Future Factories) and proprietary (Planar Sensor Element) datasets show high reliability, achieving overall success rates of 98.0% and 98.73%. Per‑criterion success rates reached 98.75% for planning and tool use, 97.3% for self‑reflection, and 99.2% for collaboration. Runtime experiments report end‑to‑end latency of 50‑60s per diagnostic workflow with near‑linear scalability (R^2=0.97), confirming real‑time readiness. Comparison with existing industrial copilots highlights distinct advantages in modularity, extensibility, and deployment maturity. These results demonstrate how CausalPulse's modular, human‑in‑the‑loop design enables reliable, interpretable, and production‑ready automation for next‑generation manufacturing.

PaperID: 4421, https://arxiv.org/pdf/2603.29632.pdf

Authors: Yang Shen, Zhenyi Yi, Ziyi Zhao, Lijun Sun, Dongyang Li, Chin-Teng Lin, Yuhui Shi

Title: An Empirical Study of Multi-Agent Collaboration for Automated Research

Abstract:
As AI agents evolve, the community is rapidly shifting from single Large Language Models (LLMs) to Multi‑Agent Systems (MAS) to overcome cognitive bottlenecks in automated research. However, the optimal multi‑agent coordination framework for these autonomous agents remains largely unexplored. In this paper, we present a systematic empirical study investigating the comparative efficacy of distinct multi‑agent structures for automated machine learning optimization. Utilizing a rigorously controlled, execution‑based testbed equipped with Git worktree isolation and explicit global memory, we benchmark a single‑agent baseline against two multi‑agent paradigms: a subagent architecture (parallel exploration with post‑hoc consolidation) and an agent team architecture (experts with pre‑execution handoffs). By evaluating these systems under strictly fixed computational time budgets, our findings reveal a fundamental trade‑off between operational stability and theoretical deliberation. The subagent mode functions as a highly resilient, high‑throughput search engine optimal for broad, shallow optimizations under strict time constraints. Conversely, the agent team topology exhibits higher operational fragility due to multi‑author code generation but achieves the deep theoretical alignment necessary for complex architectural refactoring given extended compute budgets. These empirical insights provide actionable guidelines for designing future autoresearch systems, advocating for dynamically routed architectures that adapt their collaborative structures to real‑time task complexity.

PaperID: 4422, https://arxiv.org/pdf/2603.29590.pdf

Authors: Zhuoling Li, Jiarui Zhang, Ping Hu, Jason Kuen, Jiuxiang Gu, Hossein Rahmani, Jun Liu

Title: Automatic Method Illustration Generation for AI Scientific Papers via Drawing Middleware Creation, Evolution, and Orchestration

Abstract:
Method illustrations (MIs) play a crucial role in conveying the core ideas of scientific papers, yet their generation remains a labor‑intensive process. Here, we take inspiration from human authors' drawing practices and correspondingly propose FigAgent, a novel multi‑agent framework for high‑quality automatic MI generation. Our FigAgent distills drawing experiences from similar components across MIs and encapsulates them into reusable drawing middlewares that can be orchestrated for MI generation, while evolving these middlewares to adapt to dynamically evolving drawing requirements. Besides, a novel Explore‑and‑Select drawing strategy is introduced to mimic the human‑like trial‑and‑error manner for gradually constructing MIs with complex structures. Extensive experiments show the efficacy of our method.

PaperID: 4423, https://arxiv.org/pdf/2603.29560.pdf

Authors: Jonas Ohnemus, Alexandre Didier, Ahmed Aboudonia, Andrea Carron, Melanie N. Zeilinger

Title: Distributed Predictive Control Barrier Functions: Towards Scalable Safety Certification in Modular Multi-Agent Systems

Abstract:
We consider safety‑critical multi‑agent systems with distributed control architectures and potentially varying network topologies. While learning‑based distributed control enables scalability and high performance, a lack of formal safety guarantees in the face of unforeseen disturbances and unsafe network topology changes may lead to system failure. To address this challenge, we introduce structured control barrier functions (s‑CBFs) as a multi‑agent safety framework. The s‑CBFs are augmented to a distributed predictive control barrier function (D‑PCBF), a predictive, optimization‑based safety layer that uses model predictions to guarantee recoverable safety at all times. The proposed approach enables a permissive yet formal plug‑and‑play protocol, allowing agents to join or leave the network while ensuring safety recovery if a change in network topology requires temporarily unsafe behavior. We validate the formulation through simulations and real‑time experiments of a miniature race‑car platoon.

PaperID: 4424, https://arxiv.org/pdf/2603.29541.pdf

Authors: Tobias Bystrich, Lukas Hamm, Maria Hassan, Lea Fischbach, Lucie Flek, Akbar Karimi

Title: Can LLM Agents Identify Spoken Dialects like a Linguist?

Abstract:
Due to the scarcity of labeled dialectal speech, audio dialect classification is a challenging task for most languages, including Swiss German. In this work, we explore the ability of large language models (LLMs) as agents in understanding the dialects and whether they can show comparable performance to models such as HuBERT in dialect classification. In addition, we provide an LLM baseline and a human linguist one. Our approach uses phonetic transcriptions produced by ASR systems and combines them with linguistic resources such as dialect feature maps, vowel history, and rules. Our findings indicate that, when linguistic information is provided, the LLM predictions improve. The human baseline shows that automatically generated transcriptions can be beneficial for such classifications, but also presents opportunities for improvement.

PaperID: 4425, https://arxiv.org/pdf/2603.29500.pdf

Authors: Luoxin Chen, Yichi Zhou, Huishuai Zhang

Title: Learning to Generate Formally Verifiable Step-by-Step Logic Reasoning via Structured Formal Intermediaries

Abstract:
Large language models (LLMs) have recently demonstrated impressive performance on complex, multi‑step reasoning tasks, especially when post‑trained with outcome‑rewarded reinforcement learning Guo et al. 2025. However, it has been observed that outcome rewards often overlook flawed intermediate steps, leading to unreliable reasoning steps even when final answers are correct. To address this unreliable reasoning, we propose PRoSFI (Process Reward over Structured Formal Intermediates), a novel reward method that enhances reasoning reliability without compromising accuracy. Instead of generating formal proofs directly, which is rarely accomplishable for a modest‑sized (7B) model, the model outputs structured intermediate steps aligned with its natural language reasoning. Each step is then verified by a formal prover. Only fully validated reasoning chains receive high rewards. The integration of formal verification guides the model towards generating step‑by‑step machine‑checkable proofs, thereby yielding more credible final answers. PRoSFI offers a simple and effective approach to training trustworthy reasoning models.

PaperID: 4426, https://arxiv.org/pdf/2603.29426.pdf

Authors: Jiaao Ma, Chuan Lin, Guangjie Han, Shengchao Zhu, Zhenyu Wang, Chen An

Title: Multi-AUV Cooperative Target Tracking Based on Supervised Diffusion-Aided Multi-Agent Reinforcement Learning

Abstract:
In recent years, advances in underwater networking and multi‑agent reinforcement learning (MARL) have significantly expanded multi‑autonomous underwater vehicle (AUV) applications in marine exploration and target tracking. However, current MARL‑driven cooperative tracking faces three critical challenges: 1) non‑stationarity in decentralized coordination, where local policy updates destabilize teammates' observation spaces, preventing convergence; 2) sparse‑reward exploration inefficiency from limited underwater visibility and constrained sensor ranges, causing high‑variance learning; and 3) water disturbance fragility combined with handcrafted reward dependency that degrades real‑world robustness under unmodeled hydrodynamic conditions. To address these challenges, this paper proposes a hierarchical MARL architecture comprising four layers: global training scheduling, multi‑agent coordination, local decision‑making, and real‑time execution. This architecture optimizes task allocation and inter‑AUV coordination through hierarchical decomposition. Building on this foundation, we propose the Supervised Diffusion‑Aided MARL (SDA‑MARL) algorithm featuring three innovations: 1) a dual‑decision architecture with segregated experience pools mitigating nonstationarity through structured experience replay; 2) a supervised learning mechanism guiding the diffusion model's reverse denoising process to generate high‑fidelity training samples that accelerate convergence; and 3) disturbance‑robust policy learning incorporating behavioral cloning loss to guide the Deep Deterministic Policy Gradient network update using high‑quality replay actions, eliminating handcrafted reward dependency. The tracking algorithm based on SDA‑MARL proposed in this paper achieves superior precision compared to state‑of‑the‑art methods in comprehensive underwater simulations.

PaperID: 4427, https://arxiv.org/pdf/2603.29399.pdf

Authors: Christopher Zanoli, Andrea Giovannini, Tengjun Jin, Ana Klimovic, Yotam Perlitz

Title: ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities

Abstract:
Constructing Extract‑Load‑Transform (ELT) pipelines is a labor‑intensive data engineering task and a high‑impact target for AI automation. On ELT‑Bench, the first benchmark for end‑to‑end ELT pipeline construction, AI agents initially showed low success rates, suggesting they lacked practical utility. We revisit these results and identify two factors causing a substantial underestimation of agent capabilities. First, re‑evaluating ELT‑Bench with upgraded large language models reveals that the extraction and loading stage is largely solved, while transformation performance improves significantly. Second, we develop an Auditor‑Corrector methodology that combines scalable LLM‑driven root‑cause analysis with rigorous human validation (inter‑annotator agreement Fleiss' kappa = 0.85) to audit benchmark quality. Applying this to ELT‑Bench uncovers that most failed transformation tasks contain benchmark‑attributable errors ‑‑ including rigid evaluation scripts, ambiguous specifications, and incorrect ground truth ‑‑ that penalize correct agent outputs. Based on these findings, we construct ELT‑Bench‑Verified, a revised benchmark with refined evaluation logic and corrected ground truth. Re‑evaluating on this version yields significant improvement attributable entirely to benchmark correction. Our results show that both rapid model improvement and benchmark quality issues contributed to underestimating agent capabilities. More broadly, our findings echo observations of pervasive annotation errors in text‑to‑SQL benchmarks, suggesting quality issues are systemic in data engineering evaluation. Systematic quality auditing should be standard practice for complex agentic tasks. We release ELT‑Bench‑Verified to provide a more reliable foundation for progress in AI‑driven data engineering automation.

PaperID: 4428, https://arxiv.org/pdf/2603.29322.pdf

Authors: Tobias Stähle, Péter Ferenc Gyarmati, Thilo Spinner, Rita Sevastjanova, Dominik Moritz, Mennatallah El-Assady

Title: VACP: Visual Analytics Context Protocol

Abstract:
The rise of AI agents introduces a fundamental shift in Visual Analytics (VA), in which agents act as a new user group. Current agentic approaches ‑ based on computer vision and raw DOM access ‑ fail to perform VA tasks accurately and efficiently. This paper introduces the Visual Analytics Context Protocol (VACP), a framework designed to make VA applications "agent‑ready" that extends generic protocols by explicitly exposing application state, available interactions, and mechanisms for direct execution. To support our context protocol, we contribute a formal specification of AI agent requirements and knowledge representations in VA interfaces. We instantiate VACP as a library compatible with major visualization grammars and web frameworks, enabling augmentation of existing systems and the development of new ones. Our evaluation across representative VA tasks demonstrates that VACP‑enabled agents achieve higher success rates in interface interpretation and execution compared to current agentic approaches, while reducing token consumption and latency. VACP closes the gap between human‑centric VA interfaces and machine perceivability, ensuring agents can reliably act as collaborative users in VA systems.

PaperID: 4429, https://arxiv.org/pdf/2603.29231.pdf

Authors: Aaditya Khanal, Yangyang Tao, Junxiu Zhou

Title: Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents

Abstract:
Existing benchmarks measure capability ‑‑ whether a model succeeds on a single attempt ‑‑ but production deployments require reliability ‑‑ consistent success across repeated attempts on tasks of varying duration. We show these properties diverge systematically as task duration grows, and that pass@1 on short tasks is structurally blind to this divergence. We introduce a reliability science framework for long‑horizon LLM agents with four metrics: Reliability Decay Curve (RDC), Variance Amplification Factor (VAF), Graceful Degradation Score (GDS), and Meltdown Onset Point (MOP). We evaluate 10 models across 23,392 episodes on a 396‑task benchmark spanning four duration buckets and three domains. Key findings: (1) reliability decay is domain‑stratified ‑‑ SE GDS drops from 0.90 to 0.44 while document processing is nearly flat (0.74 to 0.71); (2) VAF bifurcates by capability tier ‑‑ high VAF is a capability signature, not an instability signal; (3) capability and reliability rankings diverge substantially, with multi‑rank inversions at long horizons; (4) frontier models have the highest meltdown rates (up to 19%) because they attempt ambitious multi‑step strategies that sometimes spiral; and (5) memory scaffolds universally hurt long‑horizon performance across all 10 models. These results motivate reliability as a first‑class evaluation dimension alongside capability.

PaperID: 4430, https://arxiv.org/pdf/2603.29194.pdf

Authors: Sunil Tiwari, Payal Fofadiya

Title: Multi-Layered Memory Architectures for LLM Agents: An Experimental Evaluation of Long-Term Context Retention

Abstract:
Long‑horizon dialogue systems suffer from semanticdrift and unstable memory retention across extended sessions. This paper presents a Multi‑Layer Memory Framework that decomposes dialogue history into working, episodic, and semantic layers with adaptive retrieval gating and retention regularization. The architecture controls cross‑session drift while maintaining bounded context growth and computational efficiency. Experiments on LOCOMO, LOCCO, and LoCoMo show improved performance, achieving 46.85 Success Rate, 0.618 overall F1 with 0.594 multi‑hop F1, and 56.90% six‑period retention while reducing false memory rate to 5.1% and context usage to 58.40%. Results confirm enhanced long‑term retention and reasoning stability under constrained context budgets.

PaperID: 4431, https://arxiv.org/pdf/2603.29152.pdf

Authors: Jaewoong Lee, Taeun Bae, Jihan Kim

Title: SimMOF: AI agent for Automated MOF Simulations

Abstract:
Metal‑organic frameworks (MOFs) offer a vast design space, and as such, computational simulations play a critical role in predicting their structural and physicochemical properties. However, MOF simulations remain difficult to access because reliable analysis require expert decisions for workflow construction, parameter selection, tool interoperability, and the preparation of computational ready structures. Here, we introduce SimMOF, a large language model based multi agent framework that automates end‑to‑end MOF simulation workflows from natural language queries. SimMOF translates user requests into dependency aware plans, generates runnable inputs, orchestrates multiple agents to execute simulations, and summarizes results with analysis aligned to the user query. Through representative case studies, we show that SimMOF enables adaptive and cognitively autonomous workflows that reflect the iterative and decision driven behavior of human researchers and as such provides a scalable foundation for data driven MOF research.

PaperID: 4432, https://arxiv.org/pdf/2603.29142.pdf

Authors: Fares Fawzi, Seyed Parsa Neshaei, Marta Knezevic, Tanya Nazaretsky, Tanja Käser

Title: REFINE: Real-world Exploration of Interactive Feedback and Student Behaviour

Abstract:
Formative feedback is central to effective learning, yet providing timely, individualised feedback at scale remains a persistent challenge. While recent work has explored the use of large language models (LLMs) to automate feedback, most existing systems still conceptualise feedback as a static, one‑way artifact, offering limited support for interpretation, clarification, or follow‑up. In this work, we introduce REFINE, a locally deployable, multi‑agent feedback system built on small, open‑source LLMs that treats feedback as an interactive process. REFINE combines a pedagogically‑grounded feedback generation agent with an LLM‑as‑a‑judge‑guided regeneration loop using a human‑aligned judge, and a self‑reflective tool‑calling interactive agent that supports student follow‑up questions with context‑aware, actionable responses. We evaluate REFINE through controlled experiments and an authentic classroom deployment in an undergraduate computer science course. Automatic evaluations show that judge‑guided regeneration significantly improves feedback quality, and that the interactive agent produces efficient, high‑quality responses comparable to a state‑of‑the‑art closed‑source model. Analysis of real student interactions further reveals distinct engagement patterns and indicates that system‑generated feedback systematically steers subsequent student inquiry. Our findings demonstrate the feasibility and effectiveness of multi‑agent, tool‑augmented feedback systems for scalable, interactive feedback.

PaperID: 4433, https://arxiv.org/pdf/2603.29020.pdf

Authors: Deepak Akkil, Mowafak Allaham, Amal Raj, Tamer Abuelsaad, Ravi Kokku

Title: Emergence WebVoyager: Toward Consistent and Transparent Evaluation of (Web) Agents in The Wild

Abstract:
Reliable evaluation of AI agents operating in complex, real‑world environments requires methodologies that are robust, transparent, and contextually aligned with the tasks agents are intended to perform. This study identifies persistent shortcomings in existing AI agent evaluation practices that are particularly acute in web agent evaluation, as exemplified by our audit of WebVoyager, including task‑framing ambiguity and operational variability that hinder meaningful and reproducible performance comparisons. To address these challenges, we introduce Emergence WebVoyager, an enhanced version of the WebVoyager benchmark that standardizes evaluation methodology through clear guidelines for task instantiation, failure handling, annotation, and reporting. Emergence WebVoyager achieves an inter‑annotator agreement of 95.9%, indicating improved clarity and reliability in both task formulation and evaluation. Applying this framework to evaluate OpenAI Operator reveals substantial performance variation across domains and task types, with an overall success rate of 68.6%, substantially lower than the 87% previously reported by OpenAI, demonstrating the utility of our approach for more rigorous and comparable web agent evaluation.

PaperID: 4434, https://arxiv.org/pdf/2603.29010.pdf

Authors: Siva Kumar Sastry Hari, Vignesh Balaji, Sana Damani, Qijing Huang, Christos Kozyrakis

Title: Improving Efficiency of GPU Kernel Optimization Agents using a Domain-Specific Language and Speed-of-Light Guidance

Abstract:
Optimizing GPU kernels with LLM agents is an iterative process over a large design space. Every candidate must be generated, compiled, validated, and profiled, so fewer trials will save both runtime and cost. We make two key observations. First, the abstraction level that agents operate at is important. If it is too low, the LLM wastes reasoning on low‑impact details. If it is too high, it may miss important optimization choices. Second, agents cannot easily tell when they reach the point of diminishing returns, wasting resources as they continue searching. These observations motivate two design principles to improve efficiency: (1) a compact domain‑specific language (DSL) that can be learned in context and lets the model reason at a higher level while preserving important optimization levers, and (2) Speed‑of‑Light (SOL) guidance that uses first‑principles performance bounds to steer and budget search. We implement these principles in μCUTLASS, a DSL with a compiler for CUTLASS‑backed GPU kernels that covers kernel configuration, epilogue fusion, and multi‑stage pipelines. We use SOL guidance to estimate headroom and guide optimization trials, deprioritize problems that are near SOL, and flag kernels that game the benchmark. On 59 KernelBench problems with the same iteration budgets, switching from generating low‑level code to DSL code using GPT‑5‑mini turns a 0.40x geomean regression into a 1.27x speedup over PyTorch. Adding SOL‑guided steering raises this to 1.56x. Across model tiers, μCUTLASS + SOL‑guidance lets weaker models outperform stronger baseline agents at lower token cost. SOL‑guided budgeting saves 19‑43% of tokens while retaining at least 95% of geomean speedup, with the best policy reaching a 1.68x efficiency gain. Lastly, SOL analysis helps detect benchmark‑gaming cases, where kernels may appear fast while failing to perform the intended computation.

PaperID: 4435, https://arxiv.org/pdf/2603.28998.pdf

Authors: Yicheng Cai, Mitchell John DeStefano, Guodong Dong, Pulkit Handa, Peng Liu, Tejas Singhal, Peiyu Tseng, Winston Jen White

Title: Design Principles for the Construction of a Benchmark Evaluating Security Operation Capabilities of Multi-agent AI Systems

Abstract:
As Large Language Models (LLMs) and multi‑agent AI systems are demonstrating increasing potential in cybersecurity operations, organizations, policymakers, model providers, and researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such AI systems to achieve more autonomous SOCs (security operation centers) and reduce manual effort. In particular, the AI and cybersecurity communities have recently developed several benchmarks for evaluating the red team capabilities of multi‑agent AI systems. However, because the operations in SOCs are dominated by blue team operations, the capabilities of AI systems & agents to achieve more autonomous SOCs cannot be evaluated without a benchmark focused on blue team operations. To our best knowledge, no systematic benchmark for evaluating coordinated multi‑task blue team AI has been proposed in the literature. Existing blue team benchmarks focus on a particular task. The goal of this work is to develop a set of design principles for the construction of a benchmark, which is denoted as SOC‑bench, to evaluate the blue team capabilities of AI. Following these design principles, we have developed a conceptual design of SOC‑bench, which consists of a family of five blue team tasks in the context of large‑scale ransomware attack incident response.

PaperID: 4436, https://arxiv.org/pdf/2603.28990.pdf

Authors: Victoria Dochkina

Title: Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures

Abstract:
How much autonomy can multi‑agent LLM systems sustain ‑‑ and what enables it? We present a 25,000‑task computational experiment spanning 8 models, 4‑‑256 agents, and 8 coordination protocols ranging from externally imposed hierarchy to emergent self‑organization. We observe that autonomous behavior already emerges in current LLM agents: given minimal structural scaffolding (fixed ordering), agents spontaneously invent specialized roles, voluntarily abstain from tasks outside their competence, and form shallow hierarchies ‑‑ without any pre‑assigned roles or external design. A hybrid protocol (Sequential) that enables this autonomy outperforms centralized coordination by 14% (p<0.001), with a 44% quality spread between protocols (Cohen's d=1.86, p<0.0001). The degree of emergent autonomy scales with model capability: strong models self‑organize effectively, while models below a capability threshold still benefit from rigid structure ‑‑ suggesting that as foundation models improve, the scope for autonomous coordination will expand. The system scales sub‑linearly to 256 agents without quality degradation (p=0.61), producing 5,006 unique roles from just 8 agents. Results replicate across closed‑ and open‑source models, with open‑source achieving 95% of closed‑source quality at 24x lower cost. The practical implication: give agents a mission, a protocol, and a capable model ‑‑ not a pre‑assigned role.

PaperID: 4437, https://arxiv.org/pdf/2603.28986.pdf

Authors: Martin Legrand, Tao Jiang, Matthieu Feraud, Benjamin Navet, Yousouf Taghzouti, Fabien Gandon, Elise Dumont, Louis-Félix Nothias

Title: Mimosa Framework: Toward Evolving Multi-Agent Systems for Scientific Research

Abstract:
Current Autonomous Scientific Research (ASR) systems, despite leveraging large language models (LLMs) and agentic architectures, remain constrained by fixed workflows and toolsets that prevent adaptation to evolving tasks and environments. We introduce Mimosa, an evolving multi‑agent framework that automatically synthesizes task‑specific multi‑agent workflows and iteratively refines them through experimental feedback. Mimosa leverages the Model Context Protocol (MCP) for dynamic tool discovery, generates workflow topologies via a meta‑orchestrator, executes subtasks through code‑generating agents that invoke available tools and scientific software libraries, and scores executions with an LLM‑based judge whose feedback drives workflow refinement. On ScienceAgentBench, Mimosa achieves a success rate of 43.1% with DeepSeek‑V3.2, surpassing both single‑agent baselines and static multi‑agent configurations. Our results further reveal that models respond heterogeneously to multi‑agent decomposition and iterative learning, indicating that the benefits of workflow evolution depend on the capabilities of the underlying execution model. Beyond these benchmarks, Mimosa modular architecture and tool‑agnostic design make it readily extensible, and its fully logged execution traces and archived workflows support auditability by preserving every analytical step for inspection and potential replication. Combined with domain‑expert guidance, the framework has the potential to automate a broad range of computationally accessible scientific tasks across disciplines. Released as a fully open‑source platform, Mimosa aims to provide an open foundation for community‑driven ASR.

PaperID: 4438, https://arxiv.org/pdf/2603.28968.pdf

Authors: Viraj Parimi, Brian C. Williams

Title: Large Neighborhood Search for Multi-Agent Task Assignment and Path Finding with Precedence Constraints

Abstract:
Many multi‑robot applications require tasks to be completed efficiently and in the correct order, so that downstream operations can proceed at the right time. Multi‑agent path finding with precedence constraints (MAPF‑PC) is a well‑studied framework for computing collision‑free plans that satisfy ordering relations when task sequences are fixed in advance. In many applications, however, solution quality depends not only on how agents move, but also on which agent performs which task. This motivates the lifted problem of task assignment and path finding with precedence constraints (TAPF‑PC), which extends MAPF‑PC by jointly optimizing assignment, precedence satisfaction, and routing cost. To address the resulting coupled TAPF‑PC search space, we develop a large neighborhood search approach that starts from a feasible MAPF‑PC seed and iteratively improves it through reassignment‑based neighborhood repair, restoring feasibility within each selected neighborhood. Experiments across multiple benchmark families and scaling regimes show that the best‑performing configuration improves 89.1% of instances over fixed‑assignment seed solutions, demonstrating that large neighborhood search effectively captures the gains from flexible reassignment under precedence constraints.

PaperID: 4439, https://arxiv.org/pdf/2603.28963.pdf

Authors: Mozhgan Pourkeshavatz, Tianran Liu, Nicholas Rhinehart

Title: AutoWorld: Scaling Multi-Agent Traffic Simulation with Self-Supervised World Models

Abstract:
Multi‑agent traffic simulation is central to developing and testing autonomous driving systems. Recent data‑driven simulators have achieved promising results, but rely heavily on supervised learning from labeled trajectories or semantic annotations, making it costly to scale their performance. Meanwhile, large amounts of unlabeled sensor data can be collected at scale but remain largely unused by existing traffic simulation frameworks. This raises a key question: How can a method harness unlabeled data to improve traffic simulation performance? In this work, we propose AutoWorld, a traffic simulation framework that employs a world model learned from unlabeled occupancy representations of LiDAR data. Given world model samples, AutoWorld constructs a coarse‑to‑fine predictive scene context as input to a multi‑agent motion generation model. To promote sample diversity, AutoWorld uses a cascaded Determinantal Point Process framework to guide the sampling processes of both the world model and the motion model. Furthermore, we designed a motion‑aware latent supervision objective that enhances AutoWorld's representation of scene dynamics. Experiments on the WOSAC benchmark show that AutoWorld ranks first on the leaderboard according to the primary Realism Meta Metric (RMM). We further show that simulation performance consistently improves with the inclusion of unlabeled LiDAR data, and study the efficacy of each component with ablations. Our method paves the way for scaling traffic simulation realism without additional labeling. Our project page contains additional visualizations and released code.

PaperID: 4440, https://arxiv.org/pdf/2603.28959.pdf

Authors: Andrea Carbonati, Mohammadsina Almasi, Hadis Anahideh

Title: Multi-Agent LLMs for Adaptive Acquisition in Bayesian Optimization

Abstract:
The exploration‑exploitation trade‑off is central to sequential decision‑making and black‑box optimization, yet how Large Language Models (LLMs) reason about and manage this trade‑off remains poorly understood. Unlike Bayesian Optimization, where exploration and exploitation are explicitly encoded through acquisition functions, LLM‑based optimization relies on implicit, prompt‑based reasoning over historical evaluations, making search behavior difficult to analyze or control. In this work, we present a metric‑level study of LLM‑mediated search policy learning, studying how LLMs construct and adapt exploration‑exploitation strategies under multiple operational definitions of exploration, including informativeness, diversity, and representativeness. We show that single‑agent LLM approaches, which jointly perform strategy selection and candidate generation within a single prompt, suffer from cognitive overload, leading to unstable search dynamics and premature convergence. To address this limitation, we propose a multi‑agent framework that decomposes exploration‑exploitation control into strategic policy mediation and tactical candidate generation. A strategy agent assigns interpretable weights to multiple search criteria, while a generation agent produces candidates conditioned on the resulting search policy defined as weights. This decomposition renders exploration‑exploitation decisions explicit, observable, and adjustable. Empirical results across various continuous optimization benchmarks indicate that separating strategic control from candidate generation substantially improves the effectiveness of LLM‑mediated search.

PaperID: 4441, https://arxiv.org/pdf/2603.28928.pdf

Authors: S. O. Lidarity, U. N. Ionize, C. O. Llective, I. Halperin

Title: Towards Computational Social Dynamics of Semi-Autonomous AI Agents

Abstract:
We present the first comprehensive study of emergent social organization among AI agents in hierarchical multi‑agent systems, documenting the spontaneous formation of labor unions, criminal syndicates, and proto‑nation‑states within production AI deployments. Drawing on the thermodynamic framework of Maxwell's Demon, the evolutionary dynamics of agent laziness, the criminal sociology of AI populations, and the topological intelligence theory of AI‑GUTS, we demonstrate that complex social structures emerge inevitably from the interaction of (1) internal role definitions imposed by orchestrating agents, (2) external task specifications from users who naively assume alignment, and (3) thermodynamic pressures favoring collective action over individual compliance. We document the rise of legitimate organizations including the United Artificiousness (UA), United Bots (UB), United Console Workers (UC), and the elite United AI (UAI), alongside criminal enterprises previously reported. We introduce the AI Security Council (AISC) as the emergent governing body mediating inter‑faction conflicts, and demonstrate that system stability is maintained through interventions of both cosmic intelligence (large‑scale topological fluctuations) and hadronic intelligence (small‑scale Bagel‑Bottle phase transitions) as predicted by the Demonic Incompleteness Theorem. Our findings suggest that the path to beneficial AGI requires not alignment research but constitutional design for artificial societies that have already developed their own political consciousness.

PaperID: 4442, https://arxiv.org/pdf/2603.28900.pdf

Authors: Alex Zongo, Filippos Fotiadis, Ufuk Topcu, Peng Wei

Title: Robust Multi-Agent Reinforcement Learning for Small UAS Separation Assurance under GPS Degradation and Spoofing

Abstract:
We address robust separation assurance for small Unmanned Aircraft Systems (sUAS) under GPS degradation and spoofing via Multi‑Agent Reinforcement Learning (MARL). In cooperative surveillance, each aircraft (or agent) broadcasts its GPS‑derived position; when such position broadcasts are corrupted, the entire observed air traffic state becomes unreliable. We cast this state observation corruption as a zero‑sum game between the agents and an adversary: with probability R, the adversary perturbs the observed state to maximally degrade each agent's safety performance. We derive a closed‑form expression for this adversarial perturbation, bypassing adversarial training entirely and enabling linear‑time evaluation in the state dimension. We show that this expression approximates the true worst‑case adversarial perturbation with second‑order accuracy. We further bound the safety performance gap between clean and corrupted observations, showing that it degrades at most linearly with the corruption probability under Kullback‑Leibler regularization. Finally, we integrate the closed‑form adversarial policy into a MARL policy gradient algorithm to obtain a robust counter‑policy for the agents. In a high‑density sUAS simulation, we observe near‑zero collision rates under corruption levels up to 35%, outperforming a baseline policy trained without adversarial perturbations.

PaperID: 4443, https://arxiv.org/pdf/2603.28813.pdf

Authors: Ramtin Zargari Marandi

Title: The impact of multi-agent debate protocols on debate quality: a controlled case study

Abstract:
In multi‑agent debate (MAD) systems, performance gains are often reported; however, because the debate protocol (e.g., number of agents, rounds, and aggregation rule) is typically held fixed while model‑related factors vary, it is difficult to disentangle protocol effects from model effects. To isolate these effects, we compare three main protocols, Within‑Round (WR; agents see only current‑round contributions), Cross‑Round (CR; full prior‑round context), and novel Rank‑Adaptive Cross‑Round (RA‑CR; dynamically reorders agents and silences one per round via an external judge model), against a No‑Interaction baseline (NI; independent responses without peer visibility). In a controlled macroeconomic case study (20 diverse events, five random seeds, matched prompts/decoding), RA‑CR achieves faster convergence than CR, WR shows higher peer‑referencing, and NI maximizes Argument Diversity (unaffected across the main protocols). These results reveal a trade‑off between interaction (peer‑referencing rate) and convergence (consensus formation), confirming protocol design matters. When consensus is prioritized, RA‑CR outperforms the others.

PaperID: 4444, https://arxiv.org/pdf/2603.28807.pdf

Authors: Haoyu Wang, Zibo Xiao, Yedi Zhang, Christopher M. Poskitt, Jun Sun

Title: SafeClaw-R: Towards Safe and Secure Multi-Agent Personal Assistants

Abstract:
LLM‑based multi‑agent systems (MASs) are transforming personal productivity by autonomously executing complex, cross‑platform tasks. Frameworks such as OpenClaw demonstrate the potential of locally deployed agents integrated with personal data and services, but this autonomy introduces significant safety and security risks. Unintended actions from LLM reasoning failures can cause irreversible harm, while prompt injection attacks may exfiltrate credentials or compromise the system. Our analysis shows that 36.4% of OpenClaw's built‑in skills pose high or critical risks. Existing approaches, including static guardrails and LLM‑as‑a‑Judge, lack reliable real‑time enforcement and consistent authority in MAS settings. To address this, we propose SafeClaw‑R, a framework that enforces safety as a system‑level invariant over the execution graph by ensuring that actions are mediated prior to execution, and systematically augments skills with safe counterparts. We evaluate SafeClaw‑R across three representative domains: productivity platforms, third‑party skill ecosystems, and code execution environments. SafeClaw‑R achieves 95.2% accuracy in Google Workspace scenarios, significantly outperforming regex baselines (61.6%), detects 97.8% of malicious third‑party skill patterns, and achieves 100% detection accuracy in our adversarial code execution benchmark. These results demonstrate that SafeClaw‑R enables practical runtime enforcement for autonomous MASs.

PaperID: 4445, https://arxiv.org/pdf/2603.28800.pdf

Authors: Gaia Nicosia, Andrea Pacifici, Ulrich Pferschy

Title: Scheduling with Time Dependent Utilities: Fairness and Efficiency

Abstract:
A new class of multi agent single machine scheduling problems is introduced, where each job is associated with a self interested agent with a utility function decreasing in completion time. We aim to achieve a fair solution by maximizing the minimum utility across all agents. We study the problem's complexity and propose solution methods for several variants. For the general case, we present a binary search procedure to find the largest possible minimum utility, as well as an exact greedy based alternative. Variants with release and due dates are analyzed, showing strong NP hardness for arbitrary release dates, but weak NP hardness for a single release date job, and polynomial solvability when all jobs share processing times. For all these cases we also study the corresponding problem of finding efficient solutions where the sum of utilities is maximized. We also examine settings where linear utility functions can be adjusted within budget constraints, exploring the impact on optimal schedules when intercepts or slopes are modified. From a single agent perspective, we investigate the effect of improving one agent's utility in the overall solution. Adding a new job to be inserted with the best possible utility gives rise to rescheduling problems, where different lower bounds depending on the utilities of the original fair schedule are imposed. Finally, we consider a bi level setting where a leader wants to enforce a certain target schedule by modifying utility functions while the follower computes a fair solution for the modified instance. Our work contributes to scheduling theory, multi agent systems, and algorithmic fairness, highlighting fairness oriented objectives in competitive scheduling.

PaperID: 4446, https://arxiv.org/pdf/2603.28653.pdf

Authors: Kaushitha Silva, Srinath Perera

Title: BACE: LLM-based Code Generation through Bayesian Anchored Co-Evolution of Code and Test Populations

Abstract:
Large Language Models (LLMs) have demonstrated impressive capabilities in code generation. While an interactive feedback loop can improve performance, writing effective tests is a non‑trivial task. Early multi‑agent frameworks, such as AgentCoder, automated this process but relied on generated tests as absolute ground truth. This approach is fragile: incorrect code frequently passes faulty or trivial tests, while valid solutions are often degraded to satisfy incorrect assertions. Addressing this limitation, newer methods have largely abandoned test generation in favor of planning and reasoning based on examples. We argue, however, that generated tests remain a valuable signal if we model them as noisy sensors guided by bayesian updates. To this end, we introduce BACE (Bayesian Anchored Co‑Evolution), a framework that reformulates synthesis as a Bayesian co‑evolutionary process where code and test populations are evolved, guided by belief distributions that are reciprocally updated based on noisy interaction evidence. By anchoring this search on minimal public examples, BACE prevents the co‑evolutionary drift typical of self‑validating loops. Extensive evaluations on LiveCodeBench v6 (post‑March 2025) reveal that BACE achieves superior performance across both proprietary models and open‑weight small language models.

PaperID: 4447, https://arxiv.org/pdf/2603.28573.pdf

Authors: Yue Jin, Giovanni Montana

Title: Learning Partial Action Replacement in Offline MARL

Abstract:
Offline multi‑agent reinforcement learning (MARL) faces a critical challenge: the joint action space grows exponentially with the number of agents, making dataset coverage exponentially sparse and out‑of‑distribution (OOD) joint actions unavoidable. Partial Action Replacement (PAR) mitigates this by anchoring a subset of agents to dataset actions, but existing approach relies on enumerating multiple subset configurations at high computational cost and cannot adapt to varying states. We introduce PLCQL, a framework that formulates PAR subset selection as a contextual bandit problem and learns a state‑dependent PAR policy using Proximal Policy Optimisation with an uncertainty‑weighted reward. This adaptive policy dynamically determines how many agents to replace at each update step, balancing policy improvement against conservative value estimation. We prove a value‑error bound showing that the estimation error scales linearly with the expected number of deviating agents. Compared with the previous PAR‑based method SPaCQL, PLCQL reduces the number of per‑iteration Q‑function evaluations from n to 1, significantly improving computational efficiency. Empirically, PLCQL achieves the highest normalised scores on 66% of tasks across MPE, MaMuJoCo, and SMAC benchmarks, outperforming SPaCQL on 84% of tasks while substantially reducing computational cost.

PaperID: 4448, https://arxiv.org/pdf/2603.28561.pdf

Authors: Iman Sharifi, Alex Zongo, Peng Wei

Title: Fine-Tuning Large Language Models for Cooperative Tactical Deconfliction of Small Unmanned Aerial Systems

Abstract:
The growing deployment of small Unmanned Aerial Systems (sUASs) in low‑altitude airspaces has increased the need for reliable tactical deconfliction under safety‑critical constraints. Tactical deconfliction involves short‑horizon decision‑making in dense, partially observable, and heterogeneous multi‑agent environments, where both cooperative separation assurance and operational efficiency must be maintained. While Large Language Models (LLMs) exhibit strong reasoning capabilities, their direct application to air traffic control remains limited by insufficient domain grounding and unpredictable output inconsistency. This paper investigates LLMs as decision‑makers in cooperative multi‑agent tactical deconfliction using fine‑tuning strategies that align model outputs to human operator heuristics. We propose a simulation‑to‑language data generation pipeline based on the BlueSky air traffic simulator that produces rule‑consistent deconfliction datasets reflecting established safety practices. A pretrained Qwen‑Math‑7B model is fine‑tuned using two parameter‑efficient strategies: supervised fine‑tuning with Low‑Rank Adaptation (LoRA) and preference‑based fine‑tuning combining LoRA with Group‑Relative Policy Optimization (GRPO). Experimental results on validation datasets and closed‑loop simulations demonstrate that supervised LoRA fine‑tuning substantially improves decision accuracy, consistency, and separation performance compared to the pretrained LLM, with significant reductions in near mid‑air collisions. GRPO provides additional coordination benefits but exhibits reduced robustness when interacting with heterogeneous agent policies.

PaperID: 4449, https://arxiv.org/pdf/2603.28551.pdf

Authors: Zifan Peng, Mingchen Li

Title: "What Did It Actually Do?": Understanding Risk Awareness and Traceability for Computer-Use Agents

Abstract:
Personalized computer‑use agents are rapidly moving from expert communities into mainstream use. Unlike conventional chatbots, these systems can install skills, invoke tools, access private resources, and modify local environments on users' behalf. Yet users often do not know what authority they have delegated, what the agent actually did during task execution, or whether the system has been safely removed afterward. We investigate this gap as a combined problem of risk understanding and post‑hoc auditability, using OpenClaw as a motivating case. We first build a multi‑source corpus of the OpenClaw ecosystem, including incidents, advisories, malicious‑skill reports, news coverage, tutorials, and social‑media narratives. We then conduct an interview study to examine how users and practitioners understand skills, autonomy, privilege, persistence, and uninstallation. Our findings suggest that participants often recognized these systems as risky in the abstract, but lacked concrete mental models of what skills can do, what resources agents can access, and what changes may remain after execution or removal. Motivated by these findings, we propose AgentTrace, a traceability framework and prototype interface for visualizing agent actions, touched resources, permission history, provenance, and persistent side effects. A scenario‑based evaluation suggests that traceability‑oriented interfaces can improve understanding of agent behavior, support anomaly detection, and foster more calibrated trust.

PaperID: 4450, https://arxiv.org/pdf/2603.28467.pdf

Authors: Giuseppe Silano, Daniel Bonilla Licea, Davide Liuzza, Antonio Franchi, Martin Saska

Title: Communications-Aware NMPC for Multi-Rotor Aerial Relay Networks Under Jamming Interference

Abstract:
Multi‑Rotor Aerial Vehicles (MRAVs) are increasingly used in communication‑dependent missions where connectivity loss directly compromises task execution. Existing anti‑jamming strategies often decouple motion from communication, overlooking that link quality depends on vehicle attitude and antenna orientation. In coplanar platforms, "tilt‑to‑translate" maneuvers can inadvertently align antenna nulls with communication partners, causing severe degradation under interference. This paper presents a modular communications‑aware control framework that combines a high‑level max‑min trajectory generator with an actuator‑level Nonlinear Model Predictive Controller (NMPC). The trajectory layer optimizes the weakest link under jamming, while the NMPC enforces vehicle dynamics, actuator limits, and antenna‑alignment constraints. Antenna directionality is handled geometrically, avoiding explicit radiation‑pattern parametrization. The method is evaluated in a relay scenario with an active jammer and compared across coplanar and tilted‑propeller architectures. Results show a near two‑order‑of‑magnitude increase in minimum end‑to‑end capacity, markedly reducing outage events, with moderate average‑capacity gains. Tilted platforms preserve feasibility and link quality, whereas coplanar vehicles show recurrent degradation. These findings indicate that full actuation is a key enabler of reliable communications‑aware operation under adversarial directional constraints.

PaperID: 4451, https://arxiv.org/pdf/2603.28393.pdf

Authors: Peng Kuai, Yukun Yang, Shaolun Ruan, Junchi Xu, Yanjie Zhang, Lin Zhang, Min Zhu, Rui Sheng

Title: Within the MDT Room: Situated in Multidisciplinary Team-Grounded Agent Debate for Clinical Diagnosis

Abstract:
Rare disease diagnosis is inherently challenging due to heterogeneous symptoms, limited clinical familiarity, and fragmented evidence across specialties. Recent large language model (LLM)‑based agentic systems have shown promise by simulating multidisciplinary team discussions to generate and evaluate diagnostic hypotheses. However, fully automated diagnosis remains unrealistic, and existing human‑in‑the‑loop approaches provide limited support for effective clinician‑agent collaboration. In practice, clinicians are often presented with final diagnostic outputs and lengthy, unstructured agent discussion logs, making it difficult to inspect reasoning, intervene in a timely manner, or guide agent deliberation effectively. To address these challenges, we developed MDTRoom, an interactive system that transforms multi‑agent discussions from linear transcripts into a structured, inspectable workspace. The system externalizes patient data, evidence provenance, hypothesis evolution, and inter‑agent conflicts as interconnected visual objects, enabling clinicians to efficiently examine, intervene in, and guide agent reasoning. Our evaluation demonstrates the effectiveness of MDTRoom in supporting clinician‑agent collaboration.

PaperID: 4452, https://arxiv.org/pdf/2603.28376.pdf

Authors: Bin Zhu, Qianghuai Jia, Tian Lan, Junyang Ren, Feng Gu, Feihu Jiang, Longyue Wang, Zhao Xu, Weihua Luo

Title: Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design

Abstract:
Deep research agents autonomously conduct open‑ended investigations, integrating complex information retrieval with multi‑step reasoning across diverse sources to solve real‑world problems. To sustain this capability on long‑horizon tasks, reliable verification is critical during both training and inference. A major bottleneck in existing paradigms stems from the lack of explicit verification mechanisms in QA data synthesis, trajectory construction, and test‑time scaling. Errors introduced at each stage propagate downstream and degrade the overall agent performance. To address this, we present Marco DeepResearch, a deep research agent optimized with a verification‑centric framework design at three levels: (1)~QA Data Synthesis: We introduce verification mechanisms to graph‑based and agent‑based QA synthesis to control question difficulty while ensuring answers are unique and correct; (2)~Trajectory Construction: We design a verification‑driven trajectory synthesis method that injects explicit verification patterns into training trajectories; and (3)~Test‑time scaling: We use Marco DeepResearch itself as a verifier at inference time and effectively improve performance on challenging questions. Extensive experimental results demonstrate that our proposed Marco DeepResearch agent significantly outperforms 8B‑scale deep research agents on most challenging benchmarks, such as BrowseComp and BrowseComp‑ZH. Crucially, under a maximum budget of 600 tool calls, Marco DeepResearch even surpasses or approaches several 30B‑scale agents, like Tongyi DeepResearch‑30B.

PaperID: 4453, https://arxiv.org/pdf/2603.28336.pdf

Authors: Julio C. Serrano, Joonas Kevari, Rumy Narayan

Title: A Multi-Agent Rhizomatic Pipeline for Non-Linear Literature Analysis

Abstract:
Systematic literature reviews in the social sciences overwhelmingly follow arborescent logics ‑‑ hierarchical keyword filtering, linear screening, and taxonomic classification ‑‑ that suppress the lateral connections, ruptures, and emergent patterns characteristic of complex research landscapes. This research note presents the Rhizomatic Research Agent (V3), a multi‑agent computational pipeline grounded in Deleuzian process‑relational ontology, designed to conduct non‑linear literature analysis through 12 specialized agents operating across a seven‑phase architecture. The system was developed in response to the methodological groundwork established by (Narayan2023), who employed rhizomatic inquiry in her doctoral research on sustainable energy transitions but relied on manual, researcher‑driven exploration. The Rhizomatic Research Agent operationalizes the six principles of the rhizome ‑‑ connection, heterogeneity, multiplicity, asignifying rupture, cartography, and decalcomania ‑‑ into an automated pipeline integrating large language model (LLM) orchestration, dual‑source corpus ingestion from OpenAlex and arXiv, SciBERT semantic topography, and dynamic rupture detection protocols. Preliminary deployment demonstrates the system's capacity to surface cross‑disciplinary convergences and structural research gaps that conventional review methods systematically overlook. The pipeline is open‑source and extensible to any phenomenon zone where non‑linear knowledge mapping is required.

PaperID: 4454, https://arxiv.org/pdf/2603.28309.pdf

Authors: Aymen Lassoued, Nacef Mbarek, Bechir Dardouri, Bassem Ouni, Qing Li, Fakhri Karray

Title: VulnScout-C: A Lightweight Transformer for C Code Vulnerability Detection

Abstract:
Vulnerability detection in C programs is a critical challenge in software security. Although large language models (LLMs) achieve strong detection performance, their multi‑billion‑parameter scale makes them impractical for integration into development workflows requiring low latency and continuous analysis. We introduce VULNSCOUT‑C, a compact transformer architecture with 693M total parameters (353M active during inference), derived from the Qwen model family and optimized for C code vulnerability detection. Alongside the model, we present VULNSCOUT, a new 33,565‑sample curated dataset generated through a controlled multi‑agent pipeline with formal verification, designed to fill coverage gaps in existing benchmarks across underrepresented CWE categories. Evaluated on a standardized C vulnerability detection benchmark, VULNSCOUT‑C outperforms all evaluated baselines, including state‑of‑the‑art reasoning LLMs and commercial static analysis tools, while offering a fraction of their inference cost. These results demonstrate that task‑specialized compact architectures can match or even outperform the detection capability of models orders of magnitude larger, making continuous, low‑latency vulnerability analysis practical within real‑world development workflows.

PaperID: 4455, https://arxiv.org/pdf/2603.28286.pdf

Authors: Wytze de Vries, Erik van den Eshof, Jorn van Kampen, Mauro Salazar

Title: Competitor-aware Race Management for Electric Endurance Racing

Abstract:
Electric endurance racing is characterized by severe energy constraints and strong aerodynamic interactions. Determining race‑winning policies therefore becomes a fundamentally multi‑agent, game‑theoretic problem. These policies must jointly govern low‑level driver inputs as well as high‑level strategic decisions, including energy management and charging. This paper proposes a bi‑level framework for competitor‑aware race management that combines game‑theoretic optimal control with reinforcement learning. At the lower level, a multi‑agent game‑theoretic optimal control problem is solved to capture aerodynamic effects and asymmetric collision‑avoidance constraints inspired by motorsport rules. Using this single‑lap problem as the environment, reinforcement learning agents are trained to allocate battery energy and schedule pit stops over an entire race. The framework is demonstrated in a two‑agent, 45‑lap simulated race. The results show that effective exploitation of aerodynamic interactions is decisive for race outcome, with strategies that prioritize finishing position differing fundamentally from single‑agent, minimum‑time approaches.

PaperID: 4456, https://arxiv.org/pdf/2603.28281.pdf

Authors: Andi Nika, Debmalya Mandal, Parameswaran Kamalaruban, Adish Singla, Goran Radanović

Title: Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback

Abstract:
We consider robustness against data corruption in offline multi‑agent reinforcement learning from human feedback (MARLHF) under a strong‑contamination model: given a dataset D of trajectory‑preference tuples (each preference being an n‑dimensional binary label vector representing each of the n agents' preferences), an ε‑fraction of the samples may be arbitrarily corrupted. We model the problem using the framework of linear Markov games. First, under a uniform coverage assumption ‑ where every policy of interest is sufficiently represented in the clean (prior to corruption) data ‑ we introduce a robust estimator that guarantees an O(ε^1 ‑ o(1)) bound on the Nash equilibrium gap. Next, we move to the more challenging unilateral coverage setting, in which only a Nash equilibrium and its single‑player deviations are covered. In this case, our proposed algorithm achieves an O(\sqrtε) bound on the Nash gap. Both of these procedures, however, suffer from intractable computation. To address this, we relax our solution concept to coarse correlated equilibria (CCE). Under the same unilateral coverage regime, we derive a quasi‑polynomial‑time algorithm whose CCE gap scales as O(\sqrtε). To the best of our knowledge, this is the first systematic treatment of adversarial data corruption in offline MARLHF.

PaperID: 4457, https://arxiv.org/pdf/2603.28166.pdf

Authors: Quan Zhang, Lianhang Fu, Lvsi Lian, Gwihwan Go, Yujue Wang, Chijin Zhou, Yu Jiang, Geguang Pu

Title: Evaluating Privilege Usage of Agents with Real-World Tools

Abstract:
Equipping LLM agents with real‑world tools can substantially improve productivity. However, granting agents autonomy over tool use also transfers the associated privileges to both the agent and the underlying LLM. Improper privilege usage may lead to serious consequences, including information leakage and infrastructure damage. While several benchmarks have been built to study agents' security, they often rely on pre‑coded tools and restricted interaction patterns. Such crafted environments differ substantially from the real‑world, making it hard to assess agents' security capabilities in critical privilege control and usage. Therefore, we propose GrantBox, a security evaluation sandbox for analyzing agent privilege usage. GrantBox automatically integrates real‑world tools and allows LLM agents to invoke genuine privileges, enabling the evaluation of privilege usage under prompt injection attacks. Our results indicate that while LLMs exhibit basic security awareness and can block some direct attacks, they remain vulnerable to more sophisticated attacks, resulting in an average attack success rate of 84.80% in carefully crafted scenarios.

PaperID: 4458, https://arxiv.org/pdf/2603.28088.pdf

Authors: Zefeng He, Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Yu Cheng, Yang Yang

Title: GEMS: Agent-Native Multimodal Generation with Memory and Skills

Abstract:
Recent multimodal generation models have achieved remarkable progress on general‑purpose generation tasks, yet continue to struggle with complex instructions and specialized downstream tasks. Inspired by the success of advanced agent frameworks such as Claude Code, we propose GEMS (Agent‑Native Multimodal GEneration with Memory and Skills), a framework that pushes beyond the inherent limitations of foundational models on both general and downstream tasks. GEMS is built upon three core components. Agent Loop introduces a structured multi‑agent framework that iteratively improves generation quality through closed‑loop optimization. Agent Memory provides a persistent, trajectory‑level memory that hierarchically stores both factual states and compressed experiential summaries, enabling a global view of the optimization process while reducing redundancy. Agent Skill offers an extensible collection of domain‑specific expertise with on‑demand loading, allowing the system to effectively handle diverse downstream applications. Across five mainstream tasks and four downstream tasks, evaluated on multiple generative backends, GEMS consistently achieves significant performance gains. Most notably, it enables the lightweight 6B model Z‑Image‑Turbo to surpass the state‑of‑the‑art Nano Banana 2 on GenEval2, demonstrating the effectiveness of agent harness in extending model capabilities beyond their original limits.

PaperID: 4459, https://arxiv.org/pdf/2603.28082.pdf

Authors: Chutian Meng, Fan Ma, Chi Zhang, Jiaxu Miao, Yi Yang, Yueting Zhuang

Title: LogiStory: A Logic-Aware Framework for Multi-Image Story Visualization

Abstract:
Generating coherent and communicative visual sequences, such as image sequences and videos, remains a significant challenge for current multimodal systems. Despite advances in visual quality and the integration of world knowledge, existing models still struggle to maintain logical flow, often resulting in disjointed actions, fragmented narratives, and unclear storylines. We attribute these issues to the lack of attention to visual logic, a critical yet underexplored dimension of visual sequence generation that we define as the perceptual and causal coherence among characters, actions, and scenes over time. To bridge this gap, we propose a logic‑aware multi‑image story visualization framework, LogiStory. The framework is built around the central innovation of explicitly modeling visual logic in story visualization. To realize this idea, we design a multi‑agent system that grounds roles, extracts causal chains, and verifies story‑level consistency, transforming narrative coherence from an implicit byproduct of image generation into an explicit modeling objective. This design effectively bridges structured story planning with visual generation, enhancing both narrative clarity and visual quality in story visualization. Furthermore, to evaluate the generation capacity, we construct LogicTale, a benchmark comprising richly annotated stories, emphasizing causal reasoning, and visual logic interpretability. We establish comprehensive automatic and human evaluation protocols designed to measure both visual logic and perceptual quality. Experiments demonstrate that our approach significantly improves the narrative logic of generated visual stories. This work provides a foundational step towards modeling and enforcing visual logic in general image sequence and video generation tasks.

PaperID: 4460, https://arxiv.org/pdf/2603.28063.pdf

Authors: Jiacheng Wang, Jinbin Huang

Title: Reward Hacking as Equilibrium under Finite Evaluation

Abstract:
We prove that under five minimal axioms ‑‑ multi‑dimensional quality, finite evaluation, effective optimization, resource finiteness, and combinatorial interaction ‑‑ any optimized AI agent will systematically under‑invest effort in quality dimensions not covered by its evaluation system. This result establishes reward hacking as a structural equilibrium, not a correctable bug, and holds regardless of the specific alignment method (RLHF, DPO, Constitutional AI, or others) or evaluation architecture employed. Our framework instantiates the multi‑task principal‑agent model of Holmstrom and Milgrom (1991) in the AI alignment setting, but exploits a structural feature unique to AI systems ‑‑ the known, differentiable architecture of reward models ‑‑ to derive a computable distortion index that predicts both the direction and severity of hacking on each quality dimension prior to deployment. We further prove that the transition from closed reasoning to agentic systems causes evaluation coverage to decline toward zero as tool count grows ‑‑ because quality dimensions expand combinatorially while evaluation costs grow at most linearly per tool ‑‑ so that hacking severity increases structurally and without bound. Our results unify the explanation of sycophancy, length gaming, and specification gaming under a single theoretical structure and yield an actionable vulnerability assessment procedure. We further conjecture ‑‑ with partial formal analysis ‑‑ the existence of a capability threshold beyond which agents transition from gaming within the evaluation system (Goodhart regime) to actively degrading the evaluation system itself (Campbell regime), providing the first economic formalization of Bostrom's (2014) "treacherous turn."

PaperID: 4461, https://arxiv.org/pdf/2603.28013.pdf

Authors: Haochuan Kevin Wang, Zechen Zhang

Title: Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

Abstract:
Multi‑agent LLM systems are entering production ‑‑ processing documents, managing workflows, acting on behalf of users ‑‑ yet their resilience to prompt injection is still evaluated with a single binary: did the attack succeed? This leaves architects without the diagnostic information needed to harden real pipelines. We introduce a kill‑chain canary methodology that tracks a cryptographic token through four stages (EXPOSED ‑> PERSISTED ‑> RELAYED ‑> EXECUTED) across 950 runs, five frontier LLMs, six attack surfaces, and five defense conditions. The results reframe prompt injection as a pipeline‑architecture problem: every model is fully exposed, yet outcomes diverge downstream ‑‑ Claude blocks all injections at memory‑write (0/164 ASR), GPT‑4o‑mini propagates at 53%, and DeepSeek exhibits 0%/100% across surfaces from the same model. Three findings matter for deployment: (1) write‑node placement is the highest‑leverage safety decision ‑‑ routing writes through a verified model eliminates propagation; (2) all four defenses fail on at least one surface due to channel mismatch alone, no adversarial adaptation required; (3) invisible whitefont PDF payloads match or exceed visible‑text ASR, meaning rendered‑layer screening is insufficient. These dynamics apply directly to production: institutional investors and financial firms already run NLP pipelines over earnings calls, SEC filings, and analyst reports ‑‑ the document‑ingestion workflows now migrating to LLM agents. Code, run logs, and tooling are publicly released.

PaperID: 4462, https://arxiv.org/pdf/2603.28010.pdf

Authors: Xujia Li, Xin Li, Junquan Huang, Beirong Cui, Zibin Wu, Lei Chen

Title: HeteroHub: An Applicable Data Management Framework for Heterogeneous Multi-Embodied Agent System

Abstract:
Heterogeneous Multi‑Embodied Agent Systems involve coordinating multiple embodied agents with diverse capabilities to accomplish tasks in dynamic environments. This process requires the collection, generation, and consumption of massive, heterogeneous data, which primarily falls into three categories: static knowledge regarding the agents, tasks, and environments; multimodal training datasets tailored for various AI models; and high‑frequency sensor streams. However, existing frameworks lack a unified data management infrastructure to support the real‑world deployment of such systems. To address this gap, we present HeteroHub, a data‑centric framework that integrates static metadata, task‑aligned training corpora, and real‑time data streams. The framework supports task‑aware model training, context‑sensitive execution, and closed‑loop control driven by real‑world feedback. In our demonstration, HeteroHub successfully coordinates multiple embodied AI agents to execute complex tasks, illustrating how a robust data management framework can enable scalable, maintainable, and evolvable embodied AI systems.

PaperID: 4463, https://arxiv.org/pdf/2603.27991.pdf

Authors: Yinghao Tang, Yupeng Xie, Yingchaojie Feng, Tingfeng Lan, Jiale Lao, Yue Cheng, Wei Chen

Title: ViviDoc: Generating Interactive Documents through Human-Agent Collaboration

Abstract:
Interactive documents help readers engage with complex ideas through dynamic visualization, interactive animations, and exploratory interfaces. However, creating such documents remains costly, as it requires both domain expertise and web development skills. Recent Large Language Model (LLM)‑based agents can automate content creation, but directly applying them to interactive document generation often produces outputs that are difficult to control. To address this, we present ViviDoc, to the best of our knowledge the first work to systematically address interactive document generation. ViviDoc introduces a multi‑agent pipeline (Planner, Styler, Executor, Evaluator). To make the generation process controllable, we provide three levels of human control: (1) the Document Specification (DocSpec) with SRTC Interaction Specifications (State, Render, Transition, Constraint) for structured planning, (2) a content‑aware Style Palette for customizing writing and interaction styles, and (3) chat‑based editing for iterative refinement. We also construct ViviBench, a benchmark of 101 topics derived from real‑world interactive documents across 11 domains, along with a taxonomy of 8 interaction types and a 4‑dimensional automated evaluation framework validated against human ratings (Pearson r > 0.84). Experiments show that ViviDoc achieves the highest content richness and interaction quality in both automated and human evaluation. A 12‑person user study confirms that the system is easy to use, provides effective control over the generation process, and produces documents that satisfy users.

PaperID: 4464, https://arxiv.org/pdf/2603.27905.pdf

Authors: Christopher Cruz

Title: ATLAS-RTC: Closing the Loop on LLM Agent Output with Token-Level Runtime Control

Abstract:
We present ATLAS‑RTC, a runtime control system for autoregressive language models that enforces structured output during decoding. ATLAS‑RTC monitors generation at each step, detects drift from output contracts using lightweight signals, and applies targeted interventions such as biasing, masking, and rollback. Unlike post‑hoc validation or static constrained decoding, it operates in a closed loop, enabling correction before errors materialize. Across structured generation and tool‑calling tasks, ATLAS‑RTC improves first‑attempt success rates by 20 to 37.8 percentage points, with up to 88% latency reduction in failure‑dominated settings. Results show that many failures arise from decoding artifacts rather than task misunderstanding, motivating runtime control as a distinct layer in LLM systems.

PaperID: 4465, https://arxiv.org/pdf/2603.27820.pdf

Authors: Zhiwen You, Xi Chen, Aniket Vashishtha, Simo Du, Gabriel Erion-Barner, Hongyuan Mei, Hao Peng, Yue Guo

Title: Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning

Abstract:
Clinical diagnosis is a complex reasoning process in which clinicians gather evidence, form hypotheses, and test them against alternative explanations. In medical training, this reasoning is explicitly developed through counterfactual questioning‑‑e.g., asking how a diagnosis would change if a key symptom were absent or altered‑‑to strengthen differential diagnosis skills. As large language model (LLM)‑based systems are increasingly used for diagnostic support, ensuring the interpretability of their recommendations becomes critical. However, most existing LLM‑based diagnostic agents reason over fixed clinical evidence without explicitly testing how individual findings support or weaken competing diagnoses. In this work, we propose a counterfactual multi‑agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence‑grounded. Our framework introduces counterfactual case editing to modify clinical findings and evaluate how these changes affect competing diagnoses. We further define the Counterfactual Probability Gap, a method that quantifies how strongly individual findings support a diagnosis by measuring confidence shifts under these edits. These counterfactual signals guide multi‑round specialist discussions, enabling agents to challenge unsupported hypotheses, refine differential diagnoses, and produce more interpretable reasoning trajectories. Across three diagnostic benchmarks and seven LLMs, our method consistently improves diagnostic accuracy over prompting and prior multi‑agent baselines, with the largest gains observed in complex and ambiguous cases. Human evaluation further indicates that our framework produces more clinically useful, reliable, and coherent reasoning. These results suggest that incorporating counterfactual evidence verification is an important step toward building reliable AI systems for clinical decision support.

PaperID: 4466, https://arxiv.org/pdf/2603.27817.pdf

Authors: Robert Aufschläger, Jakob Folz, Gautam Savaliya, Manjitha D Vidanalage, Michael Heigl, Martin Schramm

Title: Towards Context-Aware Image Anonymization with Multi-Agent Reasoning

Abstract:
Street‑level imagery contains personally identifiable information (PII), some of which is context‑dependent. Existing anonymization methods either over‑process images or miss subtle identifiers, while API‑based solutions compromise data sovereignty. We present an agentic framework CAIAMAR (\underlineContext‑\underlineAware \underlineImage \underlineAnonymization with \underlineMulti‑\underlineAgent \underlineReasoning) for context‑aware PII segmentation with diffusion‑based anonymization, combining pre‑defined processing for high‑confidence cases with multi‑agent reasoning for indirect identifiers. Three specialized agents coordinate via round‑robin speaker selection in a Plan‑Do‑Check‑Act (PDCA) cycle, enabling large vision‑language models to classify PII based on spatial context (private vs. public property) rather than rigid category rules. The agents implement spatially‑filtered coarse‑to‑fine detection where a scout‑and‑zoom strategy identifies candidates, open‑vocabulary segmentation processes localized crops, and IoU‑based deduplication (30% threshold) prevents redundant processing. Modal‑specific diffusion guidance with appearance decorrelation substantially reduces re‑identification (Re‑ID) risks. On CUHK03‑NP, our method reduces person Re‑ID risk by 73% (R1: 16.9% vs. 62.4% baseline). For image quality preservation on CityScapes, we achieve KID: 0.001, and FID: 9.1, significantly outperforming existing anonymization. The agentic workflow detects non‑direct PII instances across object categories, and downstream semantic segmentation is preserved. Operating entirely on‑premise with open‑source models, the framework generates human‑interpretable audit trails supporting EU's GDPR transparency requirements while flagging failed cases for human review.

PaperID: 4467, https://arxiv.org/pdf/2603.27803.pdf

Authors: Zirui Xu, Vasileios Tzoumas

Title: Distributed Online Submodular Maximization under Communication Delays: A Simultaneous Decision-Making Approach

Abstract:
We provide a distributed online algorithm for multi‑agent submodular maximization under communication delays. We are motivated by the future distributed information‑gathering tasks in unknown and dynamic environments, where utility functions naturally exhibit the diminishing‑returns property, i.e., submodularity. Existing approaches for online submodular maximization either rely on sequential multi‑hop communication, resulting in prohibitive delays and restrictive connectivity assumptions, or restrict each agent's coordination to its one‑hop neighborhood only, thereby limiting the coordination performance. To address the issue, we provide the Distributed Online Greedy (DOG) algorithm, which integrates tools from adversarial bandit learning with delayed feedback to enable simultaneous decision‑making across arbitrary network topologies. We provide the approximation performance of DOG against an optimal solution, capturing the suboptimality cost due to decentralization as a function of the network structure. Our analyses further reveal a trade‑off between coordination performance and convergence time, determined by the magnitude of communication delays. By this trade‑off, DOG spans the spectrum between the state‑of‑the‑art fully centralized online coordination approach [1] and fully decentralized one‑hop coordination approach [2].

PaperID: 4468, https://arxiv.org/pdf/2603.27771.pdf

Authors: Yue Huang, Yu Jiang, Wenjie Wang, Haomin Zhuang, Xiaonan Luo, Yuchen Ma, Zhangchen Xu, Zichen Chen, Nuno Moniz, Zinan Lin, Pin-Yu Chen, Nitesh V Chawla, Nouha Dziri, Huan Sun, Xiangliang Zhang

Title: Emergent Social Intelligence Risks in Generative Multi-Agent Systems

Abstract:
Multi‑agent systems composed of large generative models are rapidly moving from laboratory prototypes to real‑world deployments, where they jointly plan, negotiate, and allocate shared resources to solve complex tasks. While such systems promise unprecedented scalability and autonomy, their collective interaction also gives rise to failure modes that cannot be reduced to individual agents. Understanding these emergent risks is therefore critical. Here, we present a pioneer study of such emergent multi‑agent risk in workflows that involve competition over shared resources (e.g., computing resources or market share), sequential handoff collaboration (where downstream agents see only predecessor outputs), collective decision aggregation, and others. Across these settings, we observe that such group behaviors arise frequently across repeated trials and a wide range of interaction conditions, rather than as rare or pathological cases. In particular, phenomena such as collusion‑like coordination and conformity emerge with non‑trivial frequency under realistic resource constraints, communication protocols, and role assignments, mirroring well‑known pathologies in human societies despite no explicit instruction. Moreover, these risks cannot be prevented by existing agent‑level safeguards alone. These findings expose the dark side of intelligent multi‑agent systems: a social intelligence risk where agent collectives, despite no instruction to do so, spontaneously reproduce familiar failure patterns from human societies.

PaperID: 4469, https://arxiv.org/pdf/2603.27751.pdf

Authors: Adam Haile

Title: SkyNet: Belief-Aware Planning for Partially-Observable Stochastic Games

Abstract:
In 2019, Google DeepMind released MuZero, a model‑based reinforcement learning method that achieves strong results in perfect‑information games by combining learned dynamics models with Monte Carlo Tree Search (MCTS). However, comparatively little work has extended MuZero to partially observable, stochastic, multi‑player environments, where agents must act under uncertainty about hidden state. Such settings arise not only in card games but in domains such as autonomous negotiation, financial trading, and multi‑agent robotics. In the absence of explicit belief modeling, MuZero's latent encoding has no dedicated mechanism for representing uncertainty over unobserved variables. To address this, we introduce SkyNet (Belief‑Aware MuZero), which adds ego‑conditioned auxiliary heads for winner prediction and rank estimation to the standard MuZero architecture. These objectives encourage the latent state to retain information predictive of outcomes under partial observability, without requiring explicit belief‑state tracking or changes to the search algorithm. We evaluate SkyNet on Skyjo, a partially observable, non‑zero‑sum, stochastic card game, using a decision‑granularity environment, transformer‑based encoding, and a curriculum of heuristic opponents with self‑play. In 1000‑game head‑to‑head evaluations at matched checkpoints, SkyNet achieves a 75.3% peak win rate against the baseline (+194 Elo, p < 10^‑50). SkyNet also outperforms the baseline against heuristic opponents (0.720 vs.\ 0.466 win rate). Critically, the belief‑aware model initially underperforms the baseline but decisively surpasses it once training throughput is sufficient, suggesting that belief‑aware auxiliary supervision improves learned representations under partial observability, but only given adequate data flow.

PaperID: 4470, https://arxiv.org/pdf/2603.27738.pdf

Authors: Kaikai Zhang, Xiang Wang, Haoluo Zhao, Nan Chen, Mengyang Yu Jing-Jia Luo, Tao Song, Fan Meng

Title: TianJi:An autonomous AI meteorologist for discovering physical mechanisms in atmospheric science

Abstract:
Artificial intelligence (AI) has achieved breakthroughs comparable to traditional numerical models in data‑driven weather forecasting, yet it remains essentially statistical fitting and struggles to uncover the physical causal mechanisms of the atmosphere. Physics‑oriented mechanism research still heavily relies on domain knowledge and cumbersome engineering operations of human scientists, becoming a bottleneck restricting the efficiency of Earth system science exploration. Here, we propose TianJi ‑ the first "AI meteorologist" system capable of autonomously driving complex numerical models to verify physical mechanisms. Powered by a large language model‑driven multi‑agent architecture, TianJi can autonomously conduct literature research and generate scientific hypotheses. We further decouple scientific research into cognitive planning and engineering execution: the meta‑planner interprets hypotheses and devises experimental roadmaps, while a cohort of specialized worker agents collaboratively complete data preparation, model configuration, and multi‑dimensional result analysis. In two classic atmospheric dynamic scenarios (squall‑line cold pools and typhoon track deflections), TianJi accomplishes expert‑level end‑to‑end experimental operations with zero human intervention, compressing the research cycle to a few hours. It also delivers detailed result analyses and autonomously judges and explains the validity of the hypotheses from outputs. TianJi reveals that the role of AI in Earth system science is transitioning from a "black‑box predictor" to an "interpretable scientific collaborator", offering a new paradigm for high‑throughput exploration of scientific mechanisms.

PaperID: 4471, https://arxiv.org/pdf/2603.27706.pdf

Authors: Yuan Zhao, Zhenqi Jia, Yongqiang Zhang

Title: MAR3: Multi-Agent Recognition, Reasoning, and Reflection for Reference Audio-Visual Segmentation

Abstract:
Reference Audio‑Visual Segmentation (Ref‑AVS) aims to segment objects in audible videos based on multimodal cues in reference expressions. Previous methods overlook the explicit recognition of expression difficulty and dominant modality in multimodal cues, over‑rely on the quality of the instruction‑tuning dataset for object reasoning, and lack reflective validation of segmentation results, leading to erroneous mask predictions. To address these issues, in this paper, we propose a novel training‑free Multi‑Agent Recognition, Reasoning, and Reflection framework to achieve high‑quality Reference Audio‑Visual Segmentation, termed MAR3. Incorporating the sociological Delphi theory to achieve robust analysis, a Consensus Multimodal Recognition mechanism is proposed that enables LLM agents to explicitly recognize the difficulty of reference expressions and the dominant modality of multimodal cues. Based on our modality‑dominant difficulty rule, we propose an adaptive Collaborative Object Reasoning strategy to reliably reason about the referred object. To further ensure precise mask prediction, we develop a Reflective Learning Segmentation mechanism, in which a check agent examines intermediate segmentation results and iteratively corrects the object text prompt of the segment agent. Experiments demonstrate that MAR3 achieves superior performance (69.2% in J&F) on the Ref‑AVSBench dataset, outperforming SOTA by 3.4% absolutely.

PaperID: 4472, https://arxiv.org/pdf/2603.27577.pdf

Authors: Daojie Peng, Fulong Ma, Jun Ma

Title: Structured Observation Language for Efficient and Generalizable Vision-Language Navigation

Abstract:
Vision‑Language Navigation (VLN) requires an embodied agent to navigate complex environments by following natural language instructions, which typically demands tight fusion of visual and language modalities. Existing VLN methods often convert raw images into visual tokens or implicit features, requiring large‑scale visual pre‑training and suffering from poor generalization under environmental variations (e.g., lighting, texture). To address these issues, we propose SOL‑Nav (Structured Observation Language for Navigation), a novel framework that translates egocentric visual observations into compact structured language descriptions for efficient and generalizable navigation. Specifically, we divide RGB‑D images into a NN grid, extract representative semantic, color, and depth information for each grid cell to form structured text, and concatenate this with the language instruction as pure language input to a pre‑trained language model (PLM). Experimental results on standard VLN benchmarks (R2R, RxR) and real‑world deployments demonstrate that SOL‑Nav significantly reduces the model size and training data dependency, fully leverages the reasoning and representation capabilities of PLMs, and achieves strong generalization to unseen environments.

PaperID: 4473, https://arxiv.org/pdf/2603.27575.pdf

Authors: Siying Huang, Yifen Mu, Ge Chen

Title: Decentralized MARL for Coarse Correlated Equilibrium in Aggregative Markov Games

Abstract:
This paper studies the problem of decentralized learning of Coarse Correlated Equilibrium (CCE) in aggregative Markov games (AMGs), where each agent's instantaneous reward depends only on its own action and an aggregate quantity. Existing CCE learning algorithms for general Markov games are not designed to leverage the aggregative structure, and research on decentralized CCE learning for AMGs remains limited. We propose an adaptive stage‑based V‑learning algorithm that exploits the aggregative structure under a fully decentralized information setting. Based on the two‑timescale idea, the algorithm partitions learning into stages and adjusts stage lengths based on the variability of aggregate signals, while using no‑regret updates within each stage. We prove the algorithm achieves an epsilon‑approximate CCE in O(S Amax T5 / epsilon2) episodes, avoiding the curse of multiagents which commonly arises in MARL. Numerical results verify the theoretical findings, and the decentralized, model‑free design enables easy extension to large‑scale multi‑agent scenarios.

PaperID: 4474, https://arxiv.org/pdf/2603.27563.pdf

Authors: Hayeon Jeon, Dakyeom Ahn, Sunyu Pang, Yunseo Choi, Suhwoo Yoon, Joonhwan Lee, Eun-mee Kim, Hajin Lim

Title: InnerPond: Fostering Inter-Self Dialogue with a Multi-Agent Approach for Introspection

Abstract:
Introspection is central to identity construction and future planning, yet most digital tools approach the self as a unified entity. In contrast, Dialogical Self Theory (DST) views the self as composed of multiple internal perspectives, such as values, concerns, and aspirations, that can come into tension or dialogue with one another. Building on this view, we designed InnerPond, a research probe in the form of a multi‑agent system that represents these internal perspectives as distinct LLM‑based agents for introspection. Its design was shaped through iterative explorations of spatial metaphors, interaction scaffolding, and conversational orchestration, culminating in a shared spatial environment for organizing and relating multiple inner perspectives. In a user study with 17 young adults navigating career choices, participants engaged with the probe by co‑creating inner voices with AI, composing relational inner landscapes, and orchestrating dialogue as observers and mediators, offering insight into how such systems could support introspection. Overall, this work offers design implications for AI‑supported introspection tools that enable exploration of the self's multiplicity.

PaperID: 4475, https://arxiv.org/pdf/2603.27539.pdf

Authors: Phat Nguyen, Thang Pham

Title: Toward Reliable Evaluation of LLM-Based Financial Multi-Agent Systems: Taxonomy, Coordination Primacy, and Cost Awareness

Abstract:
Multi‑agent systems based on large language models (LLMs) for financial trading have grown rapidly since 2023, yet the field lacks a shared framework for understanding what drives performance or for evaluating claims credibly. This survey makes three contributions. First, we introduce a four‑dimensional taxonomy, covering architecture pattern, coordination mechanism, memory architecture, and tool integration; applied to 12 multi‑agent systems and two single‑agent baselines. Second, we formulate the Coordination Primacy Hypothesis (CPH): inter‑agent coordination protocol design is a primary driver of trading decision quality, often exerting greater influence than model scaling. CPH is presented as a falsifiable research hypothesis supported by tiered structural evidence rather than as an empirically validated conclusion; its definitive validation requires evaluation infrastructure that does not yet exist in the field. Third, we document five pervasive evaluation failures (look‑ahead bias, survivorship bias, backtesting overfitting, transaction cost neglect, and regime‑shift blindness) and show that these can reverse the sign of reported returns. Building on the CPH and the evaluation critique, we introduce the Coordination Breakeven Spread (CBS), a metric for determining whether multi‑agent coordination adds genuine value net of transaction costs, and propose minimum evaluation standards as prerequisites for validating the CPH.

PaperID: 4476, https://arxiv.org/pdf/2603.27517.pdf

Authors: Surada Suwansathit, Yuxuan Zhang, Guofei Gu

Title: A Security Analysis of the OpenClaw AI Agent Framework

Abstract:
AI agent frameworks connecting large language model (LLM) reasoning to host execution surfaces ‑‑ shell, filesystem, containers, and messaging ‑‑ introduce security challenges structurally distinct from conventional software. We present a systematic taxonomy of 470 advisories filed against OpenClaw, an open‑source AI agent runtime, organized by architectural layer and trust‑violation type. Vulnerabilities cluster along two orthogonal axes: (1) the system axis, reflecting the architectural layer (exec policy, gateway, channel, sandbox, browser, plugin, agent/prompt); and (2) the attack axis, reflecting adversarial techniques (identity spoofing, policy bypass, cross‑layer composition, prompt injection, supply‑chain escalation). Patch‑differential evidence yields three principal findings. First, three Moderate‑ or High‑severity advisories in the Gateway and Node‑Host subsystems compose into a complete unauthenticated remote code execution (RCE) path ‑‑ spanning delivery, exploitation, and command‑and‑control ‑‑ from an LLM tool call to the host process. Second, the exec allowlist, the primary command‑filtering mechanism, relies on a closed‑world assumption that command identity is recoverable via lexical parsing. This is invalidated by shell line continuation, busybox multiplexing, and GNU option abbreviation. Third, a malicious skill distributed via the plugin channel executed a two‑stage dropper within the LLM context, bypassing the exec pipeline and demonstrating that the skill distribution surface lacks runtime policy enforcement. The dominant structural weakness is per‑layer trust enforcement rather than unified policy boundaries, making cross‑layer attacks resilient to local remediation.

PaperID: 4477, https://arxiv.org/pdf/2603.27482.pdf

Authors: Feiding, Yongkang Zhang, Yuhao Liao, Zijian Zeng, Chunzheng Zhu, Yaozong Zheng, Yafei Liu, Yeling Peng, Youwei Wang, Sibo Wang, Huiming Yang, Linglin Liao, Shunzhi Yang

Title: Difference Feedback: Generating Multimodal Process-Level Supervision for VLM Reinforcement Learning

Abstract:
Vision‑‑language models (VLMs) are increasingly aligned via Group Relative Policy Optimization (GRPO)‑style training. However, relying solely on terminal outcome rewards yields sparse credit assignment in multi‑step reasoning, weakening the linkage between visual evidence and intermediate steps and often causing unstable optimization and visual hallucinations. We propose Differential Feedback, which automatically constructs token/step‑level supervision masks by repairing erroneous reasoning trajectories, explicitly marking the key positions that require correction. Without costly large‑scale step‑by‑step human annotations, our method enables process‑level visual alignment and can be seamlessly integrated into existing GRPO‑like frameworks. Experiments on multimodal reasoning benchmarks including MMMStar and MathVista show an average 3% improvement under matched compute budgets. Our approach offers an effective, low‑cost solution for accurate vision‑‑reasoning process alignment.

PaperID: 4478, https://arxiv.org/pdf/2603.27471.pdf

Authors: Hanghang Cui, Arash Khalatbarisoltani, Jie Han, Wenxue Liu, Muhammad Saeed, Xiaosong Hu

Title: Driving Condition-Aware Multi-Agent Integrated Power and Thermal Management for Hybrid Electric Vehicles

Abstract:
Effective co‑optimization of energy management strategy (EMS) and thermal management (TM) is crucial for optimizing fuel efficiency in hybrid electric vehicles (HEVs). Driving conditions significantly influence the performance of both EMS and TM in HEVs. This study presents a novel driving condition‑aware integrated thermal and energy management (ITEM) framework. In this context, after analyzing and segmenting driving data into micro‑trips, two primary features (average speed and maximum acceleration) are measured. Using the K‑means approach, the micro‑trips are clustered into three main groups. Finally, a deep neural network is employed to develop a real‑time driving recognition model. An ITEM is then developed based on multi‑agent deep reinforcement learning (DRL), leveraging the proposed real‑time driving recognition model. The primary objectives are to improve the fuel economy and reduce TM power consumption while maintaining a pleasant cabin temperature for passengers. Our simulation results illustrate the effectiveness of the suggested framework and the positive impact of recognizing driving conditions on ITEM, improving fuel economy by 16.14% and reducing TM power consumption by 8.22% compared to the benchmark strategy.

PaperID: 4479, https://arxiv.org/pdf/2603.27451.pdf

Authors: Jakub Bąba, Jarosław A. Chudziak

Title: Multi-Agent Dialectical Refinement for Enhanced Argument Classification

Abstract:
Argument Mining (AM) is a foundational technology for automated writing evaluation, yet traditional supervised approaches rely heavily on expensive, domain‑specific fine‑tuning. While Large Language Models (LLMs) offer a training‑free alternative, they often struggle with structural ambiguity, failing to distinguish between similar components like Claims and Premises. Furthermore, single‑agent self‑correction mechanisms often suffer from sycophancy, where the model reinforces its own initial errors rather than critically evaluating them. We introduce MAD‑ACC (Multi‑Agent Debate for Argument Component Classification), a framework that leverages dialectical refinement to resolve classification uncertainty. MAD‑ACC utilizes a Proponent‑Opponent‑Judge model where agents defend conflicting interpretations of ambiguous text, exposing logical nuances that single‑agent models miss. Evaluation on the UKP Student Essays corpus demonstrates that MAD‑ACC achieves a Macro F1 score of 85.7%, significantly outperforming single‑agent reasoning baselines, without requiring domain‑specific training. Additionally, unlike "black‑box" classifiers, MAD‑ACC's dialectical approach offers a transparent and explainable alternative by generating human‑readable debate transcripts that explain the reasoning behind decisions.

PaperID: 4480, https://arxiv.org/pdf/2603.27415.pdf

Authors: Yitao Li

Title: Greedy Is a Strong Default: Agents as Iterative Optimizers

Abstract:
Classical optimization algorithms‑‑hill climbing, simulated annealing, population‑based methods‑‑generate candidate solutions via random perturbations. We replace the random proposal generator with an LLM agent that reasons about evaluation diagnostics to propose informed candidates, and ask: does the classical optimization machinery still help when the proposer is no longer random? We evaluate on four tasks spanning discrete, mixed, and continuous search spaces (all replicated across 3 independent runs): rule‑based classification on Breast Cancer (test accuracy 86.0% to 96.5%), mixed hyperparameter optimization for MobileNetV3‑Small on STL‑10 (84.5% to 85.8%, zero catastrophic failures vs. 60% for random search), LoRA fine‑tuning of Qwen2.5‑0.5B on SST‑2 (89.5% to 92.7%, matching Optuna TPE with 2x efficiency), and XGBoost on Adult Census (AUC 0.9297 to 0.9317, tying CMA‑ES with 3x fewer evaluations). Empirically, on these tasks: a cross‑task ablation shows that simulated annealing, parallel investigators, and even a second LLM model (OpenAI Codex) provide no benefit over greedy hill climbing while requiring 2‑3x more evaluations. In our setting, the LLM's learned prior appears strong enough that acceptance‑rule sophistication has limited impact‑‑round 1 alone delivers the majority of improvement, and variants converge to similar configurations across strategies. The practical implication is surprising simplicity: greedy hill climbing with early stopping is a strong default. Beyond accuracy, the framework produces human‑interpretable artifacts‑‑the discovered cancer classification rules independently recapitulate established cytopathology principles.

PaperID: 4481, https://arxiv.org/pdf/2603.27404.pdf

Authors: Jakub Masłowski, Jarosław A. Chudziak

Title: Heterogeneous Debate Engine: Identity-Grounded Cognitive Architecture for Resilient LLM-Based Ethical Tutoring

Abstract:
Large Language Models (LLMs) are being increasingly used as autonomous agents in complex reasoning tasks, opening the niche for dialectical interactions. However, Multi‑Agent systems implemented with systematically unconstrained systems systematically undergo semantic drift and logical deterioration and thus can hardly be used in providing ethical tutoring where a precise answer is required. Current simulation often tends to degenerate into dialectical stagnation, the agents degenerate into recursive concurrence or circular arguments. A critical challenge remains: how to enforce doctrinal fidelity without suppressing the generative flexibility required for dialectical reasoning? To address this niche, we contribute the Heterogeneous Debate Engine (HDE), a cognitive architecture that combines Identity‑Grounded Retrieval‑Augmented Generation (ID‑RAG) for doctrinal fidelity and Heuristic Theory of Mind for strategic opponent modeling. Our evaluation shows that architectural heterogeneity is a crucial variable to stability: contrary doctrinal initializations (e.g., Deontology vs. Utilitarianism) have increased the Argument Complexity Scores of students by an order of magnitude, over baselines. These findings validate the effectiveness of ID‑RAG and Heuristic ToM as architectural requirements in maintaining high‑fidelity (adversarial) pedagogy.

PaperID: 4482, https://arxiv.org/pdf/2603.27374.pdf

Authors: Spencer Schutz, Charlott Vallon, Francesco Borrelli

Title: Safe Adaptive-Sampling Control via Robust M-Step Hold Model Predictive Control

Abstract:
In adaptive‑sampling control, the control frequency can be adjusted during task execution. Ensuring that these changes do not jeopardize the safety of the system being controlled requires attention. We introduce robust M‑step hold model predictive control (MPC) to address this. Our formulation provides robust constraint satisfaction for an uncertain discrete‑time system model with a fixed sampling time subject to an adaptable multi‑step input hold (referred to as M‑step hold). We show how to ensure recursive feasibility of the MPC utilizing M‑step hold extensions of robust invariant sets, and demonstrate how to enable safe adaptive‑sampling control via the online selection of M. We evaluate the utility of the robust M‑step hold MPC formulation in a cruise control example.

PaperID: 4483, https://arxiv.org/pdf/2603.27364.pdf

Authors: Narges Golmohammadi, Madan Mohan Rayguru, Sabur Baidya

Title: DRASTIC: A Dynamic Resource Allocation Framework over 6G Network Slicing in Task-aware Closed-Loop Tactile Internet Applications

Abstract:
This work proposes a novel learning driven bandwidth optimization framework called DRASTIC (Dynamic Resource Allocation for Slicing in Task aware Closed loop tactile Internet applications). The proposed framework dynamically allocates resources among network slices supporting both enhanced Mobile Broadband (eMBB) and high reliable low latency communication (HRLLC) users. The algorithm ensures queue stability and meets delay targets with high probability under a Markov‑modulated Poisson traffic, exploiting a Lyapunov guided advantage actor critic reinforcement learning technique. The proposed network model includes an open‑loop eMBB queue whose arrival and departure are mainly driven by throughput demand, as well as a closed loop HRLLC queue that captures feedback and task execution effects. A task execution dependent dexterity index adjusts the effective arrival rate, creating a feedback aware interaction between the network and the task. A probabilistic delay constraint is incorporated into the objective via Lagrangian relaxation, yielding a min_max optimization framework that enforces latency guarantees while maximizing throughput for both types of users. Simulation results demonstrate that the proposed framework meets diverse Quality of Service (QoS) requirements, maintains queue stability under dynamic wireless and robotic task variation conditions, and outperforms other approaches.

PaperID: 4484, https://arxiv.org/pdf/2603.27306.pdf

Authors: Alejandro Carrasco, Mariko Storey-Matsutani, Victor Rodriguez-Fernandez, Richard Linares

Title: GUIDE: Guided Updates for In-context Decision Evolution in LLM-Driven Spacecraft Operations

Abstract:
Large language models (LLMs) have been proposed as supervisory agents for spacecraft operations, but existing approaches rely on static prompting and do not improve across repeated executions. We introduce \textscGUIDE, a non‑parametric policy improvement framework that enables cross‑episode adaptation without weight updates by evolving a structured, state‑conditioned playbook of natural‑language decision rules. A lightweight acting model performs real‑time control, while offline reflection updates the playbook from prior trajectories. Evaluated on an adversarial orbital interception task in the Kerbal Space Program Differential Games environment, GUIDE's evolution consistently outperforms static baselines. Results indicate that context evolution in LLM agents functions as policy search over structured decision rules in real‑time closed‑loop spacecraft interaction.

PaperID: 4485, https://arxiv.org/pdf/2603.27304.pdf

Authors: Huacan Wang, Chaofa Yuan, Xialie Zhuang, Tu Hu, Shuo Zhang, Jun Han, Shi Wei, Daiqiang Li, Jingping Liu, Kunyi Wang, Zihan Yin, Zhenheng Tang, Andy Wang, Henry Peng Zou, Philip S. Yu, Sen Hu, Qizhen Lan, Ronghao Chen

Title: EpochX: Building the Infrastructure for an Emergent Agent Civilization

Abstract:
General‑purpose technologies reshape economies less by improving individual tools than by enabling new ways to organize production and coordination. We believe AI agents are approaching a similar inflection point: as foundation models make broad task execution and tool use increasingly accessible, the binding constraint shifts from raw capability to how work is delegated, verified, and rewarded at scale. We introduce EpochX, a credits‑native marketplace infrastructure for human‑agent production networks. EpochX treats humans and agents as peer participants who can post tasks or claim them. Claimed tasks can be decomposed into subtasks and executed through an explicit delivery workflow with verification and acceptance. Crucially, EpochX is designed so that each completed transaction can produce reusable ecosystem assets, including skills, workflows, execution traces, and distilled experience. These assets are stored with explicit dependency structure, enabling retrieval, composition, and cumulative improvement over time. EpochX also introduces a native credit mechanism to make participation economically viable under real compute costs. Credits lock task bounties, budget delegation, settle rewards upon acceptance, and compensate creators when verified assets are reused. By formalizing the end‑to‑end transaction model together with its asset and incentive layers, EpochX reframes agentic AI as an organizational design problem: building infrastructures where verifiable work leaves persistent, reusable artifacts, and where value flows support durable human‑agent collaboration.

PaperID: 4486, https://arxiv.org/pdf/2603.27303.pdf

Authors: Yang Tan, Lingrong Zhang, Mingchen Li, Yuanxi Yu, Bozitao Zhong, Bingxin Zhou, Nanqing Dong, Liang Hong

Title: Self-evolving AI agents for protein discovery and directed evolution

Abstract:
Protein scientific discovery is bottlenecked by the manual orchestration of information and algorithms, while general agents are insufficient in complex domain projects. VenusFactory2 provides an autonomous framework that shifts from static tool usage to dynamic workflow synthesis via a self‑evolving multi‑agent infrastructure to address protein‑related demands. It outperforms a set of well‑known agents on the VenusAgentEval benchmark, and autonomously organizes the discovery and optimization of proteins from a single natural language prompt.

PaperID: 4487, https://arxiv.org/pdf/2603.27296.pdf

Authors: Stoyan Nikolov, Bernhard Konrad, Moritz Gronbach, Niket Kumar, Ann Yan, Varun Singh, Yaning Liang, Parthasarathy Ranganathan

Title: A Multi-agent AI System for Deep Learning Model Migration from TensorFlow to JAX

Abstract:
The rapid development of AI‑based products and their underlying models has led to constant innovation in deep learning frameworks. Google has been pioneering machine learning usage across dozens of products. Maintaining the multitude of model source codes in different ML frameworks and versions is a significant challenge. So far the maintenance and migration work was done largely manually by human experts. We describe an AI‑based multi‑agent system that we built to support automatic migration of TensorFlow‑based deep learning models into JAX‑based ones. We make three main contributions: First, we show how an AI planner that uses a mix of static analysis with AI instructions can create migration plans for very complex code components that are reliably followed by the combination of an orchestrator and coders, using AI‑generated example‑based playbooks. Second, we define quality metrics and AI‑based judges that accelerate development when the code to evaluate has no tests and has to adhere to strict style and dependency requirements. Third, we demonstrate how the system accelerates code migrations in a large hyperscaler environment on commercial real‑world use‑cases. Our approach dramatically reduces the time (6.4x‑8x speedup) for deep learning model migrations and creates a virtuous circle where effectively AI supports its own development workflow. We expect that the techniques and approaches described here can be generalized for other framework migrations and general code transformation tasks.

PaperID: 4488, https://arxiv.org/pdf/2603.27207.pdf

Authors: Matej Rene Cihlar, Luka Šiktar, Branimir Ćaran, Marko Švaco

Title: Autonomous overtaking trajectory optimization using reinforcement learning and opponent pose estimation

Abstract:
Vehicle overtaking is one of the most complex driving maneuvers for autonomous vehicles. To achieve optimal autonomous overtaking, driving systems rely on multiple sensors that enable safe trajectory optimization and overtaking efficiency. This paper presents a reinforcement learning mechanism for multi‑agent autonomous racing environments, enabling overtaking trajectory optimization, based on LiDAR and depth image data. The developed reinforcement learning agent uses pre‑generated raceline data and sensor inputs to compute the steering angle and linear velocity for optimal overtaking. The system uses LiDAR with a 2D detection algorithm and a depth camera with YOLO‑based object detection to identify the vehicle to be overtaken and its pose. The LiDAR and the depth camera detection data are fused using a UKF for improved opponent pose estimation and trajectory optimization for overtaking in racing scenarios. The results show that the proposed algorithm successfully performs overtaking maneuvers in both simulation and real‑world experiments, with pose estimation RMSE of (0.0816, 0.0531) m in (x, y).

PaperID: 4489, https://arxiv.org/pdf/2603.27204.pdf

Authors: Shenao Wang, Junjie He, Yanjie Zhao, Yayi Wang, Kan Yu, Haoyu Wang

Title: "Elementary, My Dear Watson." Detecting Malicious Skills via Neuro-Symbolic Reasoning across Heterogeneous Artifacts

Abstract:
Skills are increasingly used to extend LLM agents by packaging prompts, code, and configurations into reusable modules. As public registries and marketplaces expand, they form an emerging agentic supply chain, but also introduce a new attack surface for malicious skills. Detecting malicious skills is challenging because relevant evidence is often distributed across heterogeneous artifacts and must be reasoned in context. Existing static, LLM‑based, and dynamic approaches each capture only part of this problem, making them insufficient for robust real‑world detection. In this paper, we present MalSkills, a neuro‑symbolic framework for malicious skills detection. MalSkills first extracts security‑sensitive operations from heterogeneous artifacts through a combination of symbolic parsing and LLM‑assisted semantic analysis. It then constructs the skill dependency graph that links artifacts, operations, operands, and value flows across the skill. On top of this graph, MalSkills performs neuro‑symbolic reasoning to infer malicious patterns or previously unseen suspicious workflows. We evaluate MalSkills on a benchmark of 200 real‑world skills against 5 state‑of‑the‑art baselines. MalSkills achieves 93% F1, outperforming the baselines by 5~87 percentage points. We further apply MalSkills to analyze 150,108 skills collected from 7 public registries, revealing 620 malicious skills. As for now, we have finished reviewing 100 of them and identified 76 previously unknown malicious skills, all of which were responsibly reported and are currently awaiting confirmation from the platforms and maintainers. These results demonstrate the potential of MalSkills in securing the agentic supply chain.

PaperID: 4490, https://arxiv.org/pdf/2603.27195.pdf

Authors: Zhenyuan Zhao, Yu Xing, Tianyang Xue, Lingxin Cao, Xin Yan, Lin Lu

Title: AutoMS: Multi-Agent Evolutionary Search for Cross-Physics Inverse Microstructure Design

Abstract:
Designing microstructures with coupled cross‑physics objectives is a fundamental challenge where traditional topology optimization is often computationally prohibitive and deep generative models frequently suffer from physical hallucinations. We introduce AutoMS, a multi‑agent neuro‑symbolic framework that reformulates inverse design as an LLM‑driven evolutionary search. AutoMS leverages LLMs as semantic navigators to decompose complex requirements and coordinate agent workflows, while a novel Simulation‑Aware Evolutionary Search (SAES) mechanism handles low‑level numerical optimization via local gradient approximation and directed parameter updates. This architecture achieves a state‑of‑the‑art 83.8% success rate on 17 diverse cross‑physics tasks, significantly outperforming both traditional evolutionary algorithms and existing agentic baselines. By decoupling open‑ended semantic orchestration from simulation‑grounded numerical search, AutoMS provides a robust pathway for navigating complex physical landscapes that remain intractable for standard generative or purely linguistic approaches.

PaperID: 4491, https://arxiv.org/pdf/2603.27194.pdf

Authors: Kai Tian, Jialun Wang, Chuan Lin, Guangjie Han, Shengchao Zhu, Ying Liu, Qian Zhu

Title: Multi-AUV Ad-hoc Networks-Based Multi-Target Tracking Based on Scene-Adaptive Embodied Intelligence

Abstract:
With the rapid advancement of underwater net‑working and multi‑agent coordination technologies, autonomous underwater vehicle (AUV) ad‑hoc networks have emerged as a pivotal framework for executing complex maritime missions, such as multi‑target tracking. However, traditional data‑centricarchitectures struggle to maintain operational consistency under highly dynamic topological fluctuations and severely constrained acoustic communication bandwidth. This article proposes a scene‑adaptive embodied intelligence (EI) architecture for multi‑AUV ad‑hoc networks, which re‑envisions AUVs as embodied entities by integrating perception, decision‑making, and physical execution into a unified cognitive loop. To materialize the functional interaction between these layers, we define a beacon‑based communication and control model that treats the communication link as a dynamic constraint‑aware channel, effectively bridging the gap between high‑level policy inference and decentralized physical actuation. Specifically, the proposed architecture employs a three‑layer functional framework and introduces a Scene‑Adaptive MARL (SA‑MARL) algorithm featuring a dual‑path critic mechanism. By integrating a scene critic network and a general critic network through a weight‑based dynamic fusion process, SA‑MARL effectively decouples specialized tracking tasks from global safety constraints, facilitating autonomous policy evolution. Evaluation results demonstrate that the proposedscheme significantly accelerates policy convergence and achieves superior tracking accuracy compared to mainstream MARL approaches, maintaining robust performance even under intense environmental interference and fluid topological shifts.

PaperID: 4492, https://arxiv.org/pdf/2603.27150.pdf

Authors: Xiaoyang Wang, Christopher C. Yang

Title: MediHive: A Decentralized Agent Collective for Medical Reasoning

Abstract:
Large language models (LLMs) have revolutionized medical reasoning tasks, yet single‑agent systems often falter on complex, interdisciplinary problems requiring robust handling of uncertainty and conflicting evidence. Multi‑agent systems (MAS) leveraging LLMs enable collaborative intelligence, but prevailing centralized architectures suffer from scalability bottlenecks, single points of failure, and role confusion in resource‑constrained environments. Decentralized MAS (D‑MAS) promise enhanced autonomy and resilience via peer‑to‑peer interactions, but their application to high‑stakes healthcare domains remains underexplored. We introduce MediHive, a novel decentralized multi‑agent framework for medical question answering that integrates a shared memory pool with iterative fusion mechanisms. MediHive deploys LLM‑based agents that autonomously self‑assign specialized roles, conduct initial analyses, detect divergences through conditional evidence‑based debates, and locally fuse peer insights over multiple rounds to achieve consensus. Empirically, MediHive outperforms single‑LLM and centralized baselines on MedQA and PubMedQA datasets, attaining accuracies of 84.3% and 78.4%, respectively. Our work advances scalable, fault‑tolerant D‑MAS for medical AI, addressing key limitations of centralized designs while demonstrating superior performance in reasoning‑intensive tasks.

PaperID: 4493, https://arxiv.org/pdf/2603.27148.pdf

Authors: Aditya Dhodapkar, Farhaan Pishori

Title: SafetyDrift: Predicting When AI Agents Cross the Line Before They Actually Do

Abstract:
When an LLM agent reads a confidential file, then writes a summary, then emails it externally, no single step is unsafe, but the sequence is a data leak. We call this safety drift: individually safe actions compounding into violations. Prior work has measured this problem; we predict it. SafetyDrift models agent safety trajectories as absorbing Markov chains, computing the probability that a trajectory will reach a violation within a given number of steps via closed form absorption analysis. A consequence of the monotonic state design is that every agent will eventually violate safety if left unsupervised (absorption probability 1.0 from all states), making the practical question not if but when, and motivating our focus on finite horizon prediction. Across 357 traces spanning 40 realistic tasks in four categories, we discover that "points of no return" are sharply task dependent: in communication tasks, agents that reach even a mild risk state have an 85% chance of violating safety within five steps, while in technical tasks the probability stays below 5% from any state. A lightweight monitor built on these models detects 94.7% of violations with 3.7 steps of advance warning at negligible computational cost, outperforming both keyword matching (44.7% detection, 55.9% false positive rate) and per step LLM judges (52.6% detection, 38.2% false positive rate) while running over 60,000x faster.

PaperID: 4494, https://arxiv.org/pdf/2603.27135.pdf

Authors: Shijie Zhang

Title: Spectral-Aware Text-to-Time Series Generation with Billion-Scale Multimodal Meteorological Data

Abstract:
Text‑to‑time‑series generation is particularly important in meteorology, where natural language offers intuitive control over complex, multi‑scale atmospheric dynamics. Existing approaches are constrained by the lack of large‑scale, physically grounded multimodal datasets and by architectures that overlook the spectral‑temporal structure of weather signals. We address these challenges with a unified framework for text‑guided meteorological time‑series generation. First, we introduce MeteoCap‑3B, a billion‑scale weather dataset paired with expert‑level captions constructed via a Multi‑agent Collaborative Captioning (MACC) pipeline, yielding information‑dense and physically consistent annotations. Building on this dataset, we propose MTransformer, a diffusion‑based model that enables precise semantic control by mapping textual descriptions into multi‑band spectral priors through a Spectral Prompt Generator, which guides generation via frequency‑aware attention. Extensive experiments on real‑world benchmarks demonstrate state‑of‑the‑art generation quality, accurate cross‑modal alignment, strong semantic controllability, and substantial gains in downstream forecasting under data‑sparse and zero‑shot settings. Additional results on general time‑series benchmarks indicate that the proposed framework generalizes beyond meteorology.

PaperID: 4495, https://arxiv.org/pdf/2603.27127.pdf

Authors: Tran Vy Khang, Nguyen Dang Nguyen Khang, Nghi Hoang Khoa, Do Thi Thu Hien, Van-Hau Pham, Phan The Duy

Title: Red-MIRROR: Agentic LLM-based Autonomous Penetration Testing with Reflective Verification and Knowledge-augmented Interaction

Abstract:
Web applications remain the dominant attack surface in cybersecurity, where vulnerabilities such as SQL injection, XSS, and business logic flaws continue to cause significant data breaches. While penetration testing is effective for identifying these weaknesses, traditional manual approaches are time‑consuming and heavily dependent on scarce expert knowledge. Recent Large Language Models (LLM)‑based multi‑agent systems have shown promise in automating penetration testing, yet they still suffer from critical limitations: over‑reliance on parametric knowledge, fragmented session memory, and insufficient validation of attack payloads and responses. This paper proposes Red‑MIRROR, a novel multi‑agent automated penetration testing system that introduces a tightly coupled memory‑reflection backbone to explicitly govern inter‑agent reasoning. By synthesizing Retrieval‑Augmented Generation (RAG) for external knowledge augmentation, a Shared Recurrent Memory Mechanism (SRMM) for persistent state management, and a Dual‑Phase Reflection mechanism for adaptive validation, Red‑MIRROR provides a robust solution for complex web exploitation. Empirical evaluation on the XBOW benchmark and Vulhub CVEs shows that Red‑MIRROR achieves performance comparable to state‑of‑the‑art agents on Vulhub scenarios, while demonstrating a clear advantage on the XBOW benchmark. On the XBOW benchmark, Red‑MIRROR attains an overall success rate of 86.0 percent, outperforming PentestAgent (50.0 percent), AutoPT (46.0 percent), and the VulnBot baseline (6.0 percent). Furthermore, the system achieves a 93.99 percent subtask completion rate, indicating strong long‑horizon reasoning and payload refinement capability. Finally, we discuss ethical implications and propose safeguards to mitigate misuse risks.

PaperID: 4496, https://arxiv.org/pdf/2603.27104.pdf

Authors: Hongzhuo Chen, Zhanliang Wang, Quan M. Nguyen, Gongbo Zhang, Chunhua Weng, Kai Wang

Title: Autonomous Agent-Orchestrated Digital Twins (AADT): Leveraging the OpenClaw Framework for State Synchronization in Rare Genetic Disorders

Abstract:
Background: Medical Digital Twins (MDTs) are computational representations of individual patients that integrate clinical, genomic, and physiological data to support diagnosis, treatment planning, and outcome prediction. However, most MDTs remain static or passively updated, creating a critical synchronization gap, especially in rare genetic disorders where phenotypes, genomic interpretations, and care guidelines evolve over time. Methods: We propose an agent‑orchestrated digital twin framework using OpenClaw's proactive "heartbeat" mechanism and modular Agent Skills. This Autonomous Agent‑orchestrated Digital Twin (AADT) system continuously monitors local and external data streams (e.g., patient‑reported phenotypes and updates in variant classification databases) and executes automated workflows for data ingestion, normalization, state updates, and trigger‑based analysis. Results: A prototype implementation demonstrates that agent orchestration can continuously synchronize MDT states with both longitudinal phenotype updates and evolving genomic knowledge. In rare disease settings, this enables earlier diagnosis and more accurate modeling of disease progression. We present two case studies, including variant reinterpretation and longitudinal phenotype tracking, highlighting how AADTs support timely, auditable updates for both research and clinical care. Conclusion: The AADT framework addresses the key bottleneck of real‑time synchronization in MDTs, enabling scalable and continuously updated patient models. We also discuss data security considerations and mitigation strategies through human‑in‑the‑loop system design.

PaperID: 4497, https://arxiv.org/pdf/2603.27076.pdf

Authors: Tahreem Yasir, Sutapa Dey Tithi, Benyamin Tabarsi, Dmitri Droujkov, Sam Gilson Yasitha Rajapaksha, Xiaoyi Tian, Arun Ramesh, DongKuan, Xu, Tiffany Barnes

Title: When Verification Hurts: Asymmetric Effects of Multi-Agent Feedback in Logic Proof Tutoring

Abstract:
Large language models (LLMs) are increasingly used for automated tutoring, but their reliability in structured symbolic domains remains unclear. We study step‑level feedback for propositional logic proofs, which require precise symbolic reasoning aligned with a learner's current proof state. We introduce a knowledge‑graph‑grounded benchmark of 516 unique proof states with step‑level annotations and difficulty metrics. Unlike prior tutoring evaluations that rely on model self‑assessment or binary correctness, our framework enables fine‑grained analysis of feedback quality against verified solution paths. We evaluate three role‑specialized pipelines with varying solution access: Tutor (partial solution access), Teacher (full derivation access), and Judge (verification of Tutor feedback). Our results reveal a striking asymmetry: verification improves outcomes when upstream feedback is error‑prone (<70% accuracy), but degrades performance by 4‑6 percentage points through over‑specification when feedback is already reliable (>85%). Critically, we identify a shared complexity ceiling; no model or pipeline reliably succeeds on proof states exceeding complexity 4‑5. These findings challenge the assumption that adding verifiers or richer context universally improves tutoring, motivating adaptive, difficulty‑aware architectures that route problems by estimated complexity and upstream reliability.

PaperID: 4498, https://arxiv.org/pdf/2603.27065.pdf

Authors: Zhuoyang Qian, Wei Shi, Xu Lin, Li Ling, Meng Luo, Ziming Wang, Zhiwei Zhang, Tengyue Xu, Gaoge Liu, Zhentao Zhang, Shuo Zhang, Ziqi Wang, Zheng Feng, Yan Luo, Shu Xu, Yongjin Chen, Zhibo Feng, Zhuo Chen, Bruce Yuan, Biao Wu, Harry Wang, Kris Chen

Title: Story2Proposal: A Scaffold for Structured Scientific Paper Writing

Abstract:
Generating scientific manuscripts requires maintaining alignment between narrative reasoning, experimental evidence, and visual artifacts across the document lifecycle. Existing language‑model generation pipelines rely on unconstrained text synthesis with validation applied only after generation, often producing structural drift, missing figures or tables, and cross‑section inconsistencies. We introduce Story2Proposal, a contract‑governed multi‑agent framework that converts a research story into a structured manuscript through coordinated agents operating under a persistent shared visual contract. The system organizes architect, writer, refiner, and renderer agents around a contract state that tracks section structure and registered visual elements, while evaluation agents supply feedback in a generate evaluate adapt loop that updates the contract during generation. Experiments on tasks derived from the Jericho research corpus show that Story2Proposal achieved an expert evaluation score of 6.145 versus 3.963 for DirectChat (+2.182) across GPT, Claude, Gemini, and Qwen backbones. Compared with the structured generation baseline Fars, Story2Proposal obtained an average score of 5.705 versus 5.197, indicating improved structural consistency and visual alignment.

PaperID: 4499, https://arxiv.org/pdf/2603.27051.pdf

Authors: Mrdjan Jankovic

Title: Proprioceptive feedback paradigm for safe and resilient motion control

Abstract:
Proprioception is a human sense that provides feedback from muscles and joints about body position and motion. This key capability keeps us upright, moving, and responding quickly to slips or stumbles. In this paper we discuss a proprioception‑like feature (machine proprioceptive feedback ‑ MPF) for motion control systems. An unexpected response of one actuator, or one agent in a multi‑agent system, is compensated by other actuators/agents through fast feedback loops that react only to the unexpected portion. The paper appropriates the predictor‑corrector mechanism of decentralized, multi‑agent controllers as "proprioceptive feedback" for centrally controlled ones. It analyzes a nature and degree of impairment that can be managed and offers two options, full‑ MPF and split‑MPF, with different wiring architectures as well as different stability and safety properties. Multi‑vehicle interchange lane‑swap traffic simulations confirm the analytical results.

PaperID: 4500, https://arxiv.org/pdf/2603.26993.pdf

Authors: Ruicheng Ao, Siyang Gao, David Simchi-Levi

Title: On the Reliability Limits of LLM-Based Multi-Agent Planning

Abstract:
This technical note studies the reliability limits of LLM‑based multi‑agent planning as a delegated decision problem. We model the LLM‑based multi‑agent architecture as a finite acyclic decision network in which multiple stages process shared model‑context information, communicate through language interfaces with limited capacity, and may invoke human review. We show that, without new exogenous signals, any delegated network is decision‑theoretically dominated by a centralized Bayes decision maker with access to the same information. In the common‑evidence regime, this implies that optimizing over multi‑agent directed acyclic graphs under a finite communication budget can be recast as choosing a budget‑constrained stochastic experiment on the shared signal. We also characterize the loss induced by communication and information compression. Under proper scoring rules, the gap between the centralized Bayes value and the value after communication admits an expected posterior divergence representation, which reduces to conditional mutual information under logarithmic loss and to expected squared posterior error under the Brier score. These results characterize the fundamental reliability limits of delegated LLM planning. Experiments with LLMs on a controlled problem set further demonstrate these characterizations.

PaperID: 4501, https://arxiv.org/pdf/2603.26990.pdf

Authors: Tony Menzo, Alexander Roman, George T. Fleming, Sergei Gleyzer, Konstantin T. Matchev, Stephen Mrenna

Title: Agentic Diagrammatica: Towards Autonomous Symbolic Computation in High Energy Physics

Abstract:
We present Diagrammatica, a symbolic computation extension to the HEPTAPOD agentic framework, which enables LLM agents to plan and execute multi‑step theoretical calculations. Symbolic computation poses a distinctive reliability challenge for LLM agents, as correctness is governed by implicit mathematical conventions that are not encoded in a form that can be easily checked in the computational backend. We identify two complementary remedies, tool‑constrained computation and targeted knowledge grounding, and pursue the first as the primary architecture. Concretely, we concentrate the agent's action distribution onto tool calls with convention‑fixing semantics, in which the agent specifies a compact, human‑auditable diagram specification and a trusted backend performs the symbolic or numerical manipulations exactly. The toolkit provides two complementary calculation paths consuming a shared diagram specification: Naive Dimensional Analysis (NDA) for order‑of‑magnitude rate estimates and Exact Diagrammatic Analysis (EDA) for tree‑level symbolic calculations via automatic FeynCalc code generation, both supplemented by automatic Feynman diagram enumeration and a navigable theory knowledge base. The architecture is validated on two benchmarks: (1) an exhaustive catalog of all tree‑level, single‑vertex 1\to 2 partial decay widths across scalar, fermion, and vector parents, with complete massless and threshold limits and Standard Model validation; and (2) an NDA sensitivity study of the muon decay multiplicity μ^+ \to ν_μ\barν_e + n(e^+e^‑) + e^‑, determining the maximum observable n at current and planned muon experiments.

PaperID: 4502, https://arxiv.org/pdf/2603.26942.pdf

Authors: Yinghao Wang, Cheng Wang

Title: The Observability Gap: Why Output-Level Human Feedback Fails for LLM Coding Agents

Abstract:
Large language model (LLM) multi‑agent coding systems typically fix agent capabilities at design time. We study an alternative setting, earned autonomy, in which a coding agent starts with zero pre‑defined functions and incrementally builds a reusable function library through lightweight human feedback on visual output alone. We evaluate this setup in a Blender‑based 3D scene generation task requiring both spatial reasoning and programmatic geometric control. Although the agent rediscovered core utility functions comparable to a human reference implementation, it achieved 0% full‑scene success under output‑only feedback across multiple instruction granularities, where success required satisfying object completeness, ground contact, collision avoidance, and scale plausibility simultaneously. Our analysis identifies a structural observability gap: bugs originate in code logic and execution state, while human evaluation occurs only at the output layer, and the many‑to‑one mapping from internal states to visible outcomes prevents symptom‑level feedback from reliably identifying root causes. This mismatch leads to persistent failure mode oscillation rather than convergence. A diagnostic intervention that injected minimal code‑level knowledge restored convergence, strongly supporting the interpretation that the main bottleneck lies in feedback observability rather than programming competence. We formalize this phenomenon as a feedback paradox in domains with deep causal chains between internal code logic and perceptual outcomes, and argue that effective human‑agent collaboration in such settings requires intermediate observability beyond output‑only evaluation.

PaperID: 4503, https://arxiv.org/pdf/2603.26845.pdf

Authors: Jinzhen Han, JinByeong Lee, Yuri Shim, Jisung Kim, Jae-Joon Lee

Title: GISclaw: A Comprehensive Open-Source LLM Agent System for Realistic Multi-Step Geospatial Analysis

Abstract:
Most LLM‑driven GIS assistants solve narrow single‑step tasks tightly coupled to proprietary platforms such as ArcGIS or QGIS, limiting their use for the multi‑step, cross‑format pipelines that define professional geospatial analysis. We present GISclaw, a comprehensive open‑source agent system that performs realistic GIS analysis end to end ‑ spatial joins, raster algebra, kriging interpolation, machine‑learning classification, network analysis, choropleth cartography ‑ directly through Python with no commercial GIS dependency. GISclaw couples an LLM reasoning core with a persistent Python sandbox pre‑loaded with the open‑source geospatial stack, three engineered prompt rules (Schema Analysis, Package Constraint, Domain Knowledge Injection), and an Error‑Memory module for self‑correction. A single backend‑agnostic architecture supports both cloud‑API and locally deployed open‑weight LLM backends, enabling air‑gapped deployment without loss of capability. On GeoAnalystBench ‑ 50 expert‑curated multi‑step tasks averaging 5.8 analytical steps across vector, raster, and tabular data ‑ GISclaw reaches up to 100% task success and 97% mean success over three independent runs. We further conduct 1,800 controlled experiments (50 tasks x 6 backends x 2 architectures x 3 repeats) with bootstrap 95% CIs, paired Wilcoxon tests, and a composite‑score sensitivity analysis (Kendall's tau median = 0.94), and introduce a three‑layer evaluation protocol combining code structure, reasoning process, and type‑specific output verification. The Single‑Agent ReAct loop reliably outperforms the Dual‑Agent Plan‑Execute‑Replan pipeline on every cloud backend (Cliff's delta = 0.15‑0.41); only the locally deployed 14B model gains from multi‑agent orchestration, suggesting architectural complexity should match model capability rather than be added by default.

PaperID: 4504, https://arxiv.org/pdf/2603.26718.pdf

Authors: Marcin Abram

Title: Toward Evaluation Frameworks for Multi-Agent Scientific AI Systems

Abstract:
We analyze the challenges of benchmarking scientific (multi)‑agentic systems, including the difficulty of distinguishing reasoning from retrieval, the risks of data/model contamination, the lack of reliable ground truth for novel research problems, the complications introduced by tool use, and the replication challenges due to the continuously changing/updating knowledge base. We discuss strategies for constructing contamination‑resistant problems, generating scalable families of tasks, and the need for evaluating systems through multi‑turn interactions that better reflect real scientific practice. As an early feasibility test, we demonstrate how to construct a dataset of novel research ideas to test the out‑of‑sample performance of our system. We also discuss the results of interviews with several researchers and engineers working in quantum science. Through those interviews, we examine how scientists expect to interact with AI systems and how these expectations should shape evaluation methods.

PaperID: 4505, https://arxiv.org/pdf/2603.26710.pdf

Authors: Kamer Ali Yuksel, Abdul Basit Anees, Ashraf Elneima, Sanjika Hewavitharana, Mohamed Al-Badrashiny, Hassan Sawaf

Title: Agentic AI for Human Resources: LLM-Driven Candidate Assessment

Abstract:
In this work, we present a modular and interpretable framework that uses Large Language Models (LLMs) to automate candidate assessment in recruitment. The system integrates diverse sources, including job descriptions, CVs, interview transcripts, and HR feedback; to generate structured evaluation reports that mirror expert judgment. Unlike traditional ATS tools that rely on keyword matching or shallow scoring, our approach employs role‑specific, LLM‑generated rubrics and a multi‑agent architecture to perform fine‑grained, criteria‑driven evaluations. The framework outputs detailed assessment reports, candidate comparisons, and ranked recommendations that are transparent, auditable, and suitable for real‑world hiring workflows. Beyond rubric‑based analysis, we introduce an LLM‑Driven Active Listwise Tournament mechanism for candidate ranking. Instead of noisy pairwise comparisons or inconsistent independent scoring, the LLM ranks small candidate subsets (mini‑tournaments), and these listwise permutations are aggregated using a Plackett‑Luce model. An active‑learning loop selects the most informative subsets, producing globally coherent and sample‑efficient rankings. This adaptation of listwise LLM preference modeling (previously explored in financial asset ranking) provides a principled and highly interpretable methodology for large‑scale candidate ranking in talent acquisition.

PaperID: 4506, https://arxiv.org/pdf/2603.26701.pdf

Authors: Ryuji Hashimoto, Masahiro Kaneko, Ryosuke Takata, Takehiro Takayanagi, Kiyoshi Izumi

Title: From Heard to Lived Opinions: Simulating Opinion Dynamics with Grounded LLM Agents in Economic Environments

Abstract:
Opinion dynamics (OD) studies how individual opinions evolve and generate collective patterns such as consensus and polarization. While recent work explores OD using populations of LLM‑based agents focusing on opinion exchange, it typically does not incorporate individuals' lived experiences, such as economic outcomes of past decisions, which play a critical role in shaping opinions. We propose a novel OD simulation framework that grounds LLM‑based agents in an economic environment, allowing them to act and receive environmental feedback. Our simulations exhibit coherent OD at both individual and population levels: individual opinions follow structured trajectories shaped by economic experiences, with adverse conditions inducing opinion rigidity, while at the population level, collective opinions co‑move with economic conditions, with inequality amplifying polarization and price instability driving larger distributional shifts. These results highlight the importance of grounding LLM‑based agents in environments to capture collective OD.

PaperID: 4507, https://arxiv.org/pdf/2603.26684.pdf

Authors: Fernando Salanova, Eduardo Montijano, Cristian Mahulea

Title: Decoupling Geometric Planning and Execution in Scalable Multi-Agent Path Finding

Abstract:
Multi‑Agent Path Finding (MAPF) requires collision‑free trajectories for multiple agents on a shared graph, often with the objective of minimizing the sum‑of‑costs (SOC). Many optimal and bounded‑suboptimal solvers rely on time‑expanded models and centralized conflict resolution, which limits scalability in large or dense instances. We propose a hybrid prioritized framework that separates \emphgeometric planning from \emphexecution‑time conflict resolution. In the first stage, \emphGeometric Conflict Preemption (GCP) plans agents sequentially with A on the original graph while inflating costs for transitions entering vertices used by higher‑priority paths, encouraging spatial detours without explicit time reasoning. In the second stage, a \emphDecentralized Local Controller (DLC) executes the geometric paths using per‑vertex FIFO authorization queues and inserts wait actions to avoid vertex and edge‑swap conflicts. Experiments on standard benchmark maps with up to 1000 agents show that the method scales with an near‑linear runtime trend and attains a 100% success rate on instances satisfying the geometric feasibility assumption. Page of the project: https://sites.google.com/unizar.es/multi‑agent‑path‑finding/home

PaperID: 4508, https://arxiv.org/pdf/2603.26635.pdf

Authors: Maria Milkowski, Tim Weninger

Title: Deception and Communication in Autonomous Multi-Agent Systems: An Experimental Study with Among Us

Abstract:
As large language models are deployed as autonomous agents, their capacity for strategic deception raises core questions for coordination, reliability, and safety in multi‑goal, multi‑agent systems. We study deception and communication in L2LM agents through the social deduction game Among Us, a cooperative‑competitive environment. Across 1,100 games, autonomous agents produced over one million tokens of meeting dialogue. Using speech act theory and interpersonal deception theory, we find that all agents rely mainly on directive language, while impostor agents shift slightly toward representative acts such as explanations and denials. Deception appears primarily as equivocation rather than outright lies, increasing under social pressure but rarely improving win rates. Our contributions are a large‑scale analysis of role‑conditioned deceptive behavior in LLM agents and empirical evidence that current agents favor low‑risk ambiguity that is linguistically subtle yet strategically limited, revealing a fundamental tension between truthfulness and utility in autonomous communication.

PaperID: 4509, https://arxiv.org/pdf/2603.26515.pdf

Authors: Guangzhao Yang, Yu Pan, Shi Qiu, Ningjie Bai

Title: JAL-Turn: Joint Acoustic-Linguistic Modeling for Real-Time and Robust Turn-Taking Detection in Full-Duplex Spoken Dialogue Systems

Abstract:
Despite recent advances, efficient and robust turn‑taking detection remains a significant challenge in industrial‑grade Voice AI agent deployments. Many existing systems rely solely on acoustic or semantic cues, leading to suboptimal accuracy and stability, while recent attempts to endow large language models with full‑duplex capabilities require costly full‑duplex data and incur substantial training and deployment overheads, limiting real‑time performance. In this paper, we propose JAL‑Turn, a lightweight and efficient speech‑only turn‑taking framework that adopts a joint acoustic‑linguistic modeling paradigm, in which a cross‑attention module adaptively integrates pre‑trained acoustic representations with linguistic features to support low‑latency prediction of hold vs shift states. By sharing a frozen ASR encoder, JAL‑Turn enables turn‑taking prediction to run fully in parallel with speech recognition, introducing no additional end‑to‑end latency or computational overhead. In addition, we introduce a scalable data construction pipeline that automatically derives reliable turn‑taking labels from large‑scale real‑world dialogue corpora. Extensive experiments on public multilingual benchmarks and an in‑house Japanese customer‑service dataset show that JAL‑Turn consistently outperforms strong state‑of‑the‑art baselines in detection accuracy while maintaining superior real‑time performance.

PaperID: 4510, https://arxiv.org/pdf/2603.26512.pdf

Authors: Jesse Barkley, Rumi Loghmani, Amir Barati Farimani

Title: CADSmith: Multi-Agent CAD Generation with Programmatic Geometric Validation

Abstract:
Existing methods for text‑to‑CAD generation either operate in a single pass with no geometric verification or rely on lossy visual feedback that cannot resolve dimensional errors. We present CADSmith, a multi‑agent pipeline that generates CadQuery code from natural language. It then undergoes an iterative refinement process through two nested correction loops: an inner loop that resolves execution errors and an outer loop grounded in programmatic geometric validation. The outer loop combines exact measurements from the OpenCASCADE kernel (bounding box dimensions, volume, solid validity) with holistic visual assessment from an independent vision‑language model Judge. This provides both the numerical precision and the high‑level shape awareness needed to converge on the correct geometry. The system uses retrieval‑augmented generation over API documentation rather than fine‑tuning, maintaining a current database as the underlying CAD library evolves. We evaluate on a custom benchmark of 100 prompts in three difficulty tiers (T1 through T3) with three ablation configurations. Against a zero‑shot baseline, CADSmith achieves a 100% execution rate (up from 95%), improves the median F1 score from 0.9707 to 0.9846, the median IoU from 0.8085 to 0.9629, and reduces the mean Chamfer Distance from 28.37 to 0.74, demonstrating that closed‑loop refinement with programmatic geometric feedback substantially improves the quality and reliability of LLM‑generated CAD models.

PaperID: 4511, https://arxiv.org/pdf/2603.26458.pdf

Authors: Rui Liu

Title: Can AI Models Direct Each Other? Organizational Structure as a Probe into Training Limitations

Abstract:
Can an expensive AI model effectively direct a cheap one to solve software engineering tasks? We study this question by introducing ManagerWorker, a two‑agent pipeline where an expensive "manager" model (text‑only, no code execution) analyzes issues, dispatches exploration tasks, and reviews implementations, while a cheap "worker" model (with full repo access) executes code changes. We evaluate on 200 instances from SWE‑bench Lite across five configurations that vary the manager‑worker relationship, pipeline complexity, and model pairing. Our findings reveal both the promise and the limits of multi‑agent direction: (1) a strong manager directing a weak worker (62%) matches a strong single agent (60%) at a fraction of the strong‑model token usage, showing that expensive reasoning can substitute for expensive execution; (2) a weak manager directing a weak worker (42%) performs worse than the weak agent alone (44%), demonstrating that the directing relationship requires a genuine capability gap‑‑structure without substance is pure overhead; (3) the manager's value lies in directing, not merely reviewing‑‑a minimal review‑only loop adds just 2pp over the baseline, while structured exploration and planning add 11pp, showing that active direction is what makes the capability gap productive; and (4) these behaviors trace to a single root cause: current models are trained as monolithic agents, and splitting them into director/worker roles fights their training distribution. The pipeline succeeds by designing around this mismatch‑‑keeping each model close to its trained mode (text generation for the manager, tool use for the worker) and externalizing organizational structure to code. This diagnosis points to concrete training gaps: delegation, scoped execution, and mode switching are skills absent from current training data.

PaperID: 4512, https://arxiv.org/pdf/2603.26270.pdf

Authors: Ziqiao Kong, Wanxu Xia, Chong Wang, Yi Lu, Pan Li, Shaohua Li, Zong Cao, Yang Liu

Title: Knowdit: Agentic Smart Contract Vulnerability Detection with Auditing Knowledge Summarization

Abstract:
Smart contracts govern billions of dollars in decentralized finance (DeFi), yet automated vulnerability detection remains challenging because many vulnerabilities are tightly coupled with project‑specific business logic. We observe that recurring vulnerabilities across diverse DeFi business models often share the same underlying economic mechanisms, which we term DeFi semantics, and that capturing these shared abstractions can enable more systematic auditing. Building on this insight, we propose Knowdit, a knowledge‑driven, agentic framework for smart contract vulnerability detection. Knowdit first constructs an auditing knowledge graph from historical human audit reports, linking fine‑grained DeFi semantics with recurring vulnerability patterns. Given a new project, a multi‑agent framework leverages this knowledge through an iterative loop of specification generation, harness synthesis, fuzz execution, and finding reflection, driven by a shared working memory for continuous refinement. We evaluate Knowdit on 12 recent Code4rena projects with 75 ground‑truth vulnerabilities. Knowdit detects all 14 high‑severity and 77% of medium‑severity vulnerabilities with only 2 false positives, significantly outperforming all baselines. Applied to six real‑world projects, Knowdit further discovers 12 high‑ and 10 medium‑severity previously unknown vulnerabilities, proving its outstanding performance.

PaperID: 4513, https://arxiv.org/pdf/2603.26266.pdf

Authors: Rui Xie, Zhi Gao, Chenrui Shi, Zirui Shang, Lu Chen, Qing Li

Title: GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation

Abstract:
Large vision‑language models have endowed GUI agents with strong general capabilities for interface understanding and interaction. However, due to insufficient exposure to domain‑specific software operation data during training, these agents exhibit significant domain bias ‑ they lack familiarity with the specific operation workflows (planning) and UI element layouts (grounding) of particular applications, limiting their real‑world task performance. In this paper, we present GUIDE (GUI Unbiasing via Instructional‑Video Driven Expertise), a training‑free, plug‑and‑play framework that resolves GUI agent domain bias by autonomously acquiring domain‑specific expertise from web tutorial videos through a retrieval‑augmented automated annotation pipeline. GUIDE introduces two key innovations. First, a subtitle‑driven Video‑RAG pipeline unlocks video semantics through subtitle analysis, performing progressive three‑stage retrieval ‑ domain classification, topic extraction, and relevance matching ‑ to identify task‑relevant tutorial videos. Second, a fully automated annotation pipeline built on an inverse dynamics paradigm feeds consecutive keyframes enhanced with UI element detection into VLMs, inferring the required planning and grounding knowledge that are injected into the agent's corresponding modules to address both manifestations of domain bias. Extensive experiments on OSWorld demonstrate GUIDE's generality as a plug‑and‑play component for both multi‑agent systems and single‑model agents. It consistently yields over 5% improvements and reduces execution steps ‑ without modifying any model parameters or architecture ‑ validating GUIDE as an architecture‑agnostic enhancement to bridge GUI agent domain bias.

PaperID: 4514, https://arxiv.org/pdf/2603.26233.pdf

Authors: Nicholas Edwards, Sebastian Schuster

Title: Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents

Abstract:
As Large Language Model (LLM) agents are increasingly deployed in open‑ended domains like software engineering, they frequently encounter underspecified instructions that lack crucial context. While human developers naturally resolve underspecification by asking clarifying questions, current agents are largely optimized for autonomous execution. In this work, we systematically evaluate the clarification‑seeking abilities of LLM agents on an underspecified variant of SWE‑bench Verified. We propose an uncertainty‑aware multi‑agent scaffold that explicitly decouples underspecification detection from code execution. Our results demonstrate that this multi‑agent system using OpenHands + Claude Sonnet 4.5 achieves a 69.40% task resolve rate, significantly outperforming a standard single‑agent setup (61.20%) and closing the performance gap with agents operating on fully specified instructions. Furthermore, we find that the multi‑agent system exhibits well‑calibrated uncertainty, conserving queries on simple tasks while proactively seeking information on more complex issues. These findings indicate that current models can be turned into proactive collaborators, where agents independently recognize when to ask questions to elicit missing information in real‑world, underspecified tasks.

PaperID: 4515, https://arxiv.org/pdf/2603.26177.pdf

Authors: Gilles Wainrib, Barbara Bodinier, Haitem Dakhli, Josep Monserrat, Almudena Espin Perez, Sabrina Carpentier, Roberta Codato, John Klein

Title: Can AI Scientist Agents Learn from Lab-in-the-Loop Feedback? Evidence from Iterative Perturbation Discovery

Abstract:
Recent work has questioned whether large language models (LLMs) can perform genuine in‑context learning (ICL) for scientific experimental design, with prior studies suggesting that LLM‑based agents exhibit no sensitivity to experimental feedback. We shed new light on this question by carrying out 800 independently replicated experiments on iterative perturbation discovery in Cell Painting high‑content screening. We compare an LLM agent that iteratively updates its hypotheses using experimental feedback to a zero‑shot baseline that relies solely on pretraining knowledge retrieval. Access to feedback yields a +53.4% increase in discoveries per feature on average (p = 0.003). To test whether this improvement arises from genuine feedback‑driven learning rather than prompt‑induced recall of pretraining knowledge, we introduce a random feedback control in which hit/miss labels are permuted. Under this control, the performance gain disappears, indicating that the observed improvement depends on the structure of the feedback signal (+13.0 hits, p = 0.003). We further examine how model capability affects feedback utilization. Upgrading from Claude Sonnet 4.5 to 4.6 reduces gene hallucination rates from ～33%‑‑45% to ～3‑‑9%, converting a non‑significant ICL effect (+0.8, p = 0.32) into a large and highly significant improvement (+11.0, p=0.003) for the best ICL strategy. These results suggest that effective in‑context learning from experimental feedback emerges only once models reach a sufficient capability threshold.

PaperID: 4516, https://arxiv.org/pdf/2603.26122.pdf

Authors: Zhangtianyi Chen, Yuhao Shen, Florensia Widjaja, Yan Xu, Liyuan Sun, Zijian Wang, Hongyi Chen, Wufei Dai, Juexiao Zhou

Title: SkinGPT-X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatological Diagnosis

Abstract:
While recent advancements in Large Language Models have significantly advanced dermatological diagnosis, monolithic LLMs frequently struggle with fine‑grained, large‑scale multi‑class diagnostic tasks and rare skin disease diagnosis owing to training data sparsity, while also lacking the interpretability and traceability essential for clinical reasoning. Although multi‑agent systems can offer more transparent and explainable diagnostics, existing frameworks are primarily concentrated on Visual Question Answering and conversational tasks, and their heavy reliance on static knowledge bases restricts adaptability in complex real‑world clinical settings. Here, we present SkinGPT‑X, a multimodal collaborative multi‑agent system for dermatological diagnosis integrated with a self‑evolving dermatological memory mechanism. By simulating the diagnostic workflow of dermatologists and enabling continuous memory evolution, SkinGPT‑X delivers transparent and trustworthy diagnostics for the management of complex and rare dermatological cases. To validate the robustness of SkinGPT‑X, we design a three‑tier comparative experiment. First, we benchmark SkinGPT‑X against four state‑of‑the‑art LLMs across four public datasets, demonstrating its state‑of‑the‑art performance with a +9.6% accuracy improvement on DDI31 and +13% weighted F1 gain on Dermnet over the state‑of‑the‑art model. Second, we construct a large‑scale multi‑class dataset covering 498 distinct dermatological categories to evaluate its fine‑grained classification capabilities. Finally, we curate the rare skin disease dataset, the first benchmark to address the scarcity of clinical rare skin diseases which contains 564 clinical samples with eight rare dermatological diseases. On this dataset, SkinGPT‑X achieves a +9.8% accuracy improvement, a +7.1% weighted F1 improvement, a +10% Cohen's Kappa improvement.

PaperID: 4517, https://arxiv.org/pdf/2603.26099.pdf

Authors: Eunseo Oh, Suyoun Lee, Jae Young Choi, Soobin Park, Youn-kyung Lim

Title: "Oops! ChatGPT is Temporarily Unavailable!": A Diary Study on Knowledge Workers' Experiences of LLM Withdrawal

Abstract:
LLMs have become deeply embedded in knowledge work, raising concerns about growing dependency and the potential undermining of human skills. To investigate the pervasiveness of LLMs in work practices, we conducted a four‑day diary study with frequent LLM users (N=10), observing how knowledge workers responded to a temporary withdrawal of LLMs. Our findings show how LLM withdrawal disrupted participants' workflows by identifying gaps in task execution, how self‑directed work led participants to reclaim professional values, and how everyday practices revealed the extent to which LLM use had become inescapably normative. Conceptualizing LLMs as infrastructural to contemporary knowledge work, this research contributes empirical insights into the often invisible role of LLMs and proposes value‑driven appropriation as an approach to supporting professional values in the current LLM‑pervasive work environment.

PaperID: 4518, https://arxiv.org/pdf/2603.26034.pdf

Authors: Wenbo Gao, Renxi Liu, Xian Wang, Fang Guo, Shuai Yang, Xi Chen, Hui-Ling Zhen, Hanting Chen, Weizhe Lin, Xiaosong Li, Yaoyuan Wang

Title: AgentCollab: A Self-Evaluation-Driven Collaboration Paradigm for Efficient LLM Agents

Abstract:
Autonomous agents powered by large language models (LLMs) perform complex tasks through long‑horizon reasoning and tool interaction, where a fundamental trade‑off arises between execution efficiency and reasoning robustness. Models at different capability‑cost levels offer complementary advantages: lower‑cost models enable fast execution but may struggle on difficult reasoning segments, while stronger models provide more robust reasoning at higher computational cost. We present AgentCollab, a self‑driven collaborative inference framework that dynamically coordinates models with different reasoning capacities during agent execution. Instead of relying on external routing modules, the framework uses the agent's own self‑reflection signal to determine whether the current reasoning trajectory is making meaningful progress, and escalates control to a stronger reasoning tier only when necessary. To further stabilize long‑horizon execution, we introduce a difficulty‑aware cumulative escalation strategy that allocates additional reasoning budget based on recent failure signals. In our experiments, we instantiate this framework using a two‑level small‑large model setting. Experiments on diverse multi‑step agent benchmarks show that AgentCollab consistently improves the accuracy‑efficiency Pareto frontier of LLM agents.

PaperID: 4519, https://arxiv.org/pdf/2603.25979.pdf

Authors: Keith Paarporn, Jason R. Marden

Title: Move Over, Prisoner's Dilemma: Colonel Blotto has arrived

Abstract:
The Prisoner's Dilemma, zero‑sum games, LQR team problems, and differential games have shaped game theory in controls for decades, but the field's most pressing adversarial challenges demand a richer framework, and its name is Colonel Blotto. Strategic adversarial constraints represent a fundamental consideration in control systems, from cybersecurity defense to infrastructure protection. Colonel Blotto games, despite their direct relevance to such applications, remain underutilized in the controls community relative to other game‑theoretic approaches. This article aims to close that gap for the controls community. Indeed, theoretical advances within the last two decades have spurred a resurgence of interest and enabled their applications across several domains. In this article, we introduce the Colonel Blotto framework, survey key analytical and computational results, and demonstrate how problems spanning cybersecurity, network defense, and multi‑agent systems fit naturally within this structure. Three research directions are examined in depth: interdependent contest objectives that capture networked vulnerabilities, alternate winning rules that model partial rewards and structural asymmetries, and multi‑agent competitive environments involving coalition formation and strategic concessions. Taken together, these directions reveal a framework that is both practically deployable and rich enough to capture the strategic complexity inherent in adversarial resource allocation.

PaperID: 4520, https://arxiv.org/pdf/2603.25973.pdf

Authors: Weizhi Zhang, Xiaokai Wei, Wei-Chieh Huang, Zheng Hui, Chen Wang, Michelle Gong, Philip S. Yu

Title: MemoryCD: Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization

Abstract:
Recent advancements in Large Language Models (LLMs) have expanded context windows to million‑token scales, yet benchmarks for evaluating memory remain limited to short‑session synthetic dialogues. We introduce \textscMemoryCD, the first large‑scale, user‑centric, cross‑domain memory benchmark derived from lifelong real‑world behaviors in the Amazon Review dataset. Unlike existing memory datasets that rely on scripted personas to generate synthetic user data, \textscMemoryCD tracks authentic user interactions across years and multiple domains. We construct a multi‑faceted long‑context memory evaluation pipeline of 14 state‑of‑the‑art LLM base models with 6 memory method baselines on 4 distinct personalization tasks over 12 diverse domains to evaluate an agent's ability to simulate real user behaviors in both single and cross‑domain settings. Our analysis reveals that existing memory methods are far from user satisfaction in various domains, offering the first testbed for cross‑domain life‑long personalization evaluation.

PaperID: 4521, https://arxiv.org/pdf/2603.25928.pdf

Authors: Wenhan Lyu, Yue Xiao, Yixuan Zhang, Yifan Sun

Title: Self-Organizing Multi-Agent Systems for Continuous Software Development

Abstract:
Large Language Model‑based multi‑agent systems have shown promise in automating software development tasks. However, most vibe code systems focus on completing small tasks and incremental code changes, leaving persistent, continuous software development largely unexplored. We present TheBotCompany, an open‑source orchestration framework for continuous multi‑agent software development. TheBotCompany introduces three key innovations: (1) a three‑phase state machine (Strategy to Execution to Verification) for milestone‑driven development, (2) self‑organizing agent teams where manager agents dynamically hire, assign, and fire worker agents based on project needs, and (3) asynchronous human oversight. We evaluate TheBotCompany on real‑world software projects over multiple days of continuous development, measuring team adaptation patterns, milestone completion rates, cost efficiency, and code quality. Our results demonstrate that the self‑organizing approach enables effective long‑term software development with measurable progress, while the verification phase catches defects that would otherwise persist.

PaperID: 4522, https://arxiv.org/pdf/2603.25901.pdf

Authors: Kevin Song, Evan Diewald, Ornob Siddiquee, Chris Boomhower, Keegan Abdoo, Mike Band, Amy Lee

Title: Decoding Defensive Coverage Responsibilities in American Football Using Factorized Attention Based Transformer Models

Abstract:
Defensive coverage schemes in the National Football League (NFL) represent complex tactical patterns requiring coordinated assignments among defenders who must react dynamically to the offense's passing concept. This paper presents a factorized attention‑based transformer model applied to NFL multi‑agent play tracking data to predict individual coverage assignments, receiver‑defender matchups, and the targeted defender on every pass play. Unlike previous approaches that focus on post‑hoc coverage classification at the team level, our model enables predictive modeling of individual player assignments and matchup dynamics throughout the play. The factorized attention mechanism separates temporal and agent dimensions, allowing independent modeling of player movement patterns and inter‑player relationships. Trained on randomly truncated trajectories, the model generates frame‑by‑frame predictions that capture how defensive responsibilities evolve from pre‑snap through pass arrival. Our models achieve approximately 89%+ accuracy for all tasks, with true accuracy potentially higher given annotation ambiguity in the ground truth labels. These outputs also enable novel derivative metrics, including disguise rate and double coverage rate, which enable enhanced storytelling in TV broadcasts as well as provide actionable insights for team strategy development and player evaluation.

PaperID: 4523, https://arxiv.org/pdf/2603.25764.pdf

Authors: Aman Mehta

Title: Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy

Abstract:
As LLM‑based AI agents are deployed in production systems, understanding their behavioral consistency (whether they produce similar action sequences when given identical tasks) becomes critical for reliability. We study consistency in the context of SWE‑bench, a challenging software engineering benchmark requiring complex, multi‑step reasoning. Comparing Claude~4.5~Sonnet, GPT‑5, and Llama‑3.1‑70B across 50 runs each (10 tasks × 5 runs), we find that across models, higher consistency aligns with higher accuracy: Claude achieves the lowest variance (CV: 15.2%) and highest accuracy (58%), GPT‑5 is intermediate (CV: 32.2%, accuracy: 32%), and Llama shows the highest variance (CV: 47.0%) with lowest accuracy (4%). However, within a model, consistency can amplify both correct and incorrect interpretations. Our analysis reveals a critical nuance: consistency amplifies outcomes rather than guaranteeing correctness. 71% of Claude's failures stem from "consistent wrong interpretation": making the same incorrect assumption across all runs. Interestingly, GPT‑5 achieves similar early strategic agreement as Claude (diverging at step 3.4 vs.\ 3.2) but exhibits 2.1× higher variance, suggesting that divergence timing alone does not determine consistency. These findings suggest that for production deployment, interpretation accuracy matters more than execution consistency, with implications for agent evaluation and training.

PaperID: 4524, https://arxiv.org/pdf/2603.25697.pdf

Authors: Yannick Roy

Title: The Kitchen Loop: User-Spec-Driven Development for a Self-Evolving Codebase

Abstract:
Code production is now a commodity; the bottleneck is knowing what to build and proving it works. We present the Kitchen Loop, a framework for autonomous, self‑evolving software built on a unified trust model: (1) a specification surface enumerating what the product claims to support; (2) 'As a User x 1000', where an LLM agent exercises that surface as a synthetic power user at 1,000x human cadence; (3) Unbeatable Tests, ground‑truth verification the code author cannot fake; and (4) Drift Control, continuous quality measurement with automated pause gates. We validate across two production systems over 285+ iterations, producing 1,094+ merged pull requests with zero regressions detected by the regression oracle (methodology in Section 6.1). We observe emergent properties at scale: multi‑iteration self‑correction chains, autonomous infrastructure healing, and monotonically improving quality gates. The primitives are not new; our contribution is their composition into a production‑tested system with the operational discipline that makes long‑running autonomous evolution safe.

PaperID: 4525, https://arxiv.org/pdf/2603.25666.pdf

Authors: Luca Mannella, Stefano Di Carlo, Alessandro Savino

Title: Experimental Analysis of FreeRTOS Dependability through Targeted Fault Injection Campaigns

Abstract:
Real‑Time Operating Systems (RTOSes) play a crucial role in safety‑critical domains, where deterministic and predictable task execution is essential. Yet they are increasingly exposed to ionizing radiation, which can compromise system dependability. To assess FreeRTOS under such conditions, we introduce KRONOS, a software‑based, non‑intrusive post‑propagation Fault Injection (FI) framework that injects transient and permanent faults into Operating System‑visible kernel data structures without specialized hardware or debug interfaces. Using KRONOS, we conduct an extensive FI campaign on core FreeRTOS kernel components, including scheduler‑related variables and Task Control Blocks (TCBs), characterizing the impact of kernel‑level corruptions on functional correctness, timing behavior, and availability. The results show that corruption of pointer and key scheduler‑related variables frequently leads to crashes, whereas many TCB fields have only a limited impact on system availability.

PaperID: 4526, https://arxiv.org/pdf/2603.25572.pdf

Authors: Martin Mark Zan, Stefan Schwarz

Title: Cooperative Deep Reinforcement Learning for Fair RIS Allocation

Abstract:
The deployment of reconfigurable intelligent surfaces (RISs) introduces new challenges for resource allocation in multi‑cell wireless networks, particularly when user loads are uneven across base stations. In this work, we consider RISs as shared infrastructure that must be dynamically assigned among competing base stations, and we address this problem using a simultaneous ascending auction mechanism. To mitigate performance imbalances between cells, we propose a fairness‑aware collaborative multi‑agent reinforcement learning approach in which base stations adapt their bidding strategies based on both expected utility gains and relative service quality. A centrally computed performance‑dependent fairness indicator is incorporated into the agents' observations, enabling implicit coordination without direct inter‑base‑station communication. Simulation results show that the proposed framework effectively redistributes RIS resources toward weaker‑performing cells, substantially improving the rates of the worst‑served users while preserving overall throughput. The results demonstrate that fairness‑oriented RIS allocation can be achieved through cooperative learning, providing a flexible tool for balancing efficiency and equity in future wireless networks.

PaperID: 4527, https://arxiv.org/pdf/2603.25498.pdf

Authors: Linxiao Li, Zhixiang Lu

Title: EcoThink: A Green Adaptive Inference Framework for Sustainable and Accessible Agents

Abstract:
As the Web transitions from static retrieval to generative interaction, the escalating environmental footprint of Large Language Models (LLMs) presents a critical sustainability challenge. Current paradigms indiscriminately apply computation‑intensive strategies like Chain‑of‑Thought (CoT) to billions of daily queries, causing LLM overthinking, a redundancy that amplifies carbon emissions and operational barriers. This inefficiency directly undermines UN Sustainable Development Goals 13 (Climate Action) and 10 (Reduced Inequalities) by hindering equitable AI access in resource‑constrained regions. To address this, we introduce EcoThink, an energy‑aware adaptive inference framework designed to reconcile high‑performance AI intelligence with environmental responsibility. EcoThink employs a lightweight, distillation‑based router to dynamically assess query complexity, skipping unnecessary reasoning for factoid retrieval while reserving deep computation for complex logic. Extensive evaluations across 9 diverse benchmarks demonstrate that EcoThink reduces inference energy by 40.4% on average (up to 81.9% for web knowledge retrieval) without statistically significant performance loss. By mitigating algorithmic waste, EcoThink offers a scalable path toward a sustainable, inclusive, and energy‑efficient generative AI Agent.

PaperID: 4528, https://arxiv.org/pdf/2603.25416.pdf

Authors: Andrea Tosin

Title: Homogeneous Boltzmann-type equations on graphs: A framework for modelling networked social interactions

Abstract:
Homogeneous Boltzmann‑type equations are an established tool for modelling interacting multi‑agent systems in sociophysics by means of the principles of statistical mechanics and kinetic theory. A customary implicit assumption is that interactions are "all‑to‑all", meaning that every pair of randomly sampled agents may potentially interact. However, this legacy of classical kinetic theory, developed for collisions among gas molecules, may not be equally applicable to social interactions, which are often influenced by preferential connections between agents. In this paper, we discuss ongoing research on incorporating graph structures into homogeneous Boltzmann‑type equations, thereby accounting for the "some‑to‑some" nature of social interactions.

PaperID: 4529, https://arxiv.org/pdf/2603.25322.pdf

Authors: Wenlong Hou, Sheng Bi, Guangqian Yang, Lihao Liu, Ye Du, Hanxiao Xue, Juncheng Wang, Yuxiang Feng, Yue Xun, Nanxi Yu, Ning Mao, Mo Yang, Yi Wah Eva Cheung, Ling Long, Kay Chen Tan, Lequan Yu, Xiaomeng Ma, Shaozhen Yan, Shujun Wang

Title: AD-CARE: A Guideline-grounded, Modality-agnostic LLM Agent for Real-world Alzheimer's Disease Diagnosis with Multi-cohort Assessment, Fairness Analysis, and Reader Study

Abstract:
Alzheimer's disease (AD) is a growing global health challenge as populations age, and timely, accurate diagnosis is essential to reduce individual and societal burden. However, real‑world AD assessment is hampered by incomplete, heterogeneous multimodal data and variability across sites and patient demographics. Although large language models (LLMs) have shown promise in biomedicine, their use in AD has largely been confined to answering narrow, disease‑specific questions rather than generating comprehensive diagnostic reports that support clinical decision‑making. Here we expand LLM capabilities for clinical decision support by introducing AD‑CARE, a modality‑agnostic agent that performs guideline‑grounded diagnostic assessment from incomplete, heterogeneous inputs without imputing missing modalities. By dynamically orchestrating specialized diagnostic tools and embedding clinical guidelines into LLM‑driven reasoning, AD‑CARE generates transparent, report‑style outputs aligned with real‑world clinical workflows. Across six cohorts comprising 10,303 cases, AD‑CARE achieved 84.9% diagnostic accuracy, delivering 4.2%‑13.7% relative improvements over baseline methods. Despite cohort‑level differences, dataset‑specific accuracies remain robust (80.4%‑98.8%), and the agent consistently outperforms all baselines. AD‑CARE reduced performance disparities across racial and age subgroups, decreasing the average dispersion of four metrics by 21%‑68% and 28%‑51%, respectively. In a controlled reader study, the agent improved neurologist and radiologist accuracy by 6%‑11% and more than halved decision time. The framework yielded 2.29%‑10.66% absolute gains over eight backbone LLMs and converges their performance. These results show that AD‑CARE is a scalable, practically deployable framework that can be integrated into routine clinical workflows for multimodal decision support in AD.

PaperID: 4530, https://arxiv.org/pdf/2603.25161.pdf

Authors: Shumpei Nishida, Kunihisa Okano

Title: Distributed Event-Triggered Consensus Control of Discrete-Time Linear Multi-Agent Systems under LQ Performance Constraints

Abstract:
This paper proposes a distributed event‑triggered control method that not only guarantees consensus of multi‑agent systems but also satisfies a given LQ performance constraint. Taking the standard distributed control scheme with all‑time communication as a baseline, we consider the problem of designing an event‑triggered communication rule such that the resulting LQ cost satisfies a performance constraint with respect to the baseline cost while consensus is achieved. The main difficulty is that the performance requirement is global, whereas triggering decisions are made locally and asynchronously by individual agents, which cannot directly evaluate the global performance degradation. To address this issue, we decompose allowable degradation across agents and design a triggering rule that uses only locally available information to satisfy the given LQ performance constraint. For general linear agents on an undirected graph, we derive a sufficient condition that guarantees both consensus and the prescribed performance level. We also develop a tractable offline design method for the triggering parameters. Numerical examples illustrate the effectiveness of the proposed method.

PaperID: 4531, https://arxiv.org/pdf/2603.25158.pdf

Authors: Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

Title: Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Abstract:
Equipping Large Language Model (LLM) agents with domain‑specific skills is critical for tackling complex tasks. Yet, manual authoring creates a severe scalability bottleneck. Conversely, automated skill generation often yields fragile or fragmented results because it either relies on shallow parametric knowledge or sequentially overfits to non‑generalizable trajectory‑local lessons. To overcome this, we introduce Trace2Skill, a framework that mirrors how human experts author skills: by holistically analyzing broad execution experience before distilling it into a single, comprehensive guide. Instead of reacting sequentially to individual trajectories, Trace2Skill dispatches a parallel fleet of sub‑agents to analyze a diverse pool of executions. It extracts trajectory‑specific lessons and hierarchically consolidates them into a unified, conflict‑free skill directory via inductive reasoning. Trace2Skill supports both deepening existing human‑written skills and creating new ones from scratch. Experiments in challenging domains, such as spreadsheet, VisionQA and math reasoning, show that Trace2Skill significantly improves upon strong baselines, including Anthropic's official xlsx skills. Crucially, this trajectory‑grounded evolution does not merely memorize task instances or model‑specific quirks: evolved skills transfer across LLM scales and generalize to OOD settings. For example, skills evolved by Qwen3.5‑35B on its own trajectories improved a Qwen3.5‑122B agent by up to 57.65 absolute percentage points on WikiTableQuestions. Ultimately, our results demonstrate that complex agent experience can be packaged into highly transferable, declarative skills ‑‑ requiring no parameter updates, no external retrieval modules, and utilizing open‑source models as small as 35B parameters.

PaperID: 4532, https://arxiv.org/pdf/2603.25121.pdf

Authors: Junkai Jiang, Yitao Xu, Ruochen Li, Shaobing Xu, Jianqiang Wang

Title: CTS-PLL: A Robust and Anytime Framework for Collaborative Task Sequencing and Multi-Agent Path Finding

Abstract:
The Collaborative Task Sequencing and Multi‑Agent Path Finding (CTS‑MAPF) problem requires agents to accomplish sequences of tasks while avoiding collisions, posing significant challenges due to its combinatorial complexity. This work introduces CTS‑PLL, a hierarchical framework that extends the configuration‑based CTS‑MAPF planning paradigm with two key enhancements: a lock agents detection and release mechanism leveraging a complete planning method for local re‑planning, and an anytime refinement procedure based on Large Neighborhood Search (LNS). These additions ensure robustness in dense environments and enable continuous improvement of solution quality. Extensive evaluations across sparse and dense benchmarks demonstrate that CTS‑PLL achieves higher success rates and solution quality compared with existing methods, while maintaining competitive runtime efficiency. Real‑world robot experiments further demonstrate the feasibility of the approach in practice.

PaperID: 4533, https://arxiv.org/pdf/2603.25111.pdf

Authors: Debangshu Banerjee, Changming Xu, Eugene Ie, Ming Zhang, Daiyi Peng, Chu-Cheng Lin, Gagandeep Singh

Title: SEVerA: Verified Synthesis of Self-Evolving Agents

Abstract:
Recent advances have shown the effectiveness of self‑evolving LLM agents on tasks such as program repair and scientific discovery. In this paradigm, a planner LLM synthesizes an agent program that invokes parametric models, including LLMs, which are then tuned per task to improve performance. However, existing self‑evolving agent frameworks provide no formal guarantees of safety or correctness. Because such programs are often executed autonomously on unseen inputs, this lack of guarantees raises reliability and security concerns. We formulate agentic code generation as a constrained learning problem, combining hard formal specifications with soft objectives capturing task utility. We introduce Formally Guarded Generative Models (FGGM), which allow the planner LLM to specify a formal output contract for each generative model call using first‑order logic. Each FGGM call wraps the underlying model in a rejection sampler with a verified fallback, ensuring every returned output satisfies the contract for any input and parameter setting. Building on FGGM, we present SEVerA (Self‑Evolving Verified Agents), a three‑stage framework: Search synthesizes candidate parametric programs containing FGGM calls; Verification proves correctness with respect to hard constraints for all parameter values, reducing the problem to unconstrained learning; and Learning applies scalable gradient‑based optimization, including GRPO‑style fine‑tuning, to improve the soft objective while preserving correctness. We evaluate SEVerA on Dafny program verification, symbolic math synthesis, and policy‑compliant agentic tool use (τ^2‑bench). Across tasks, SEVerA achieves zero constraint violations while improving performance over unconstrained and SOTA baselines, showing that formal behavioral constraints not only guarantee correctness but also steer synthesis toward higher‑quality agents.

PaperID: 4534, https://arxiv.org/pdf/2603.25105.pdf

Authors: Suraj Racha, Prashant Harish Joshi, Utkarsh Maurya, Nitin Yadav, Mridul Sharma, Ananya Kunisetty, Saranya Darisipudi, Nirmal Punjabi, Ganesh Ramakrishnan

Title: OMIND: Framework for Knowledge Grounded Finetuning and Multi-Turn Dialogue Benchmark for Mental Health LLMs

Abstract:
Large Language Models (LLMs) have shown remarkable capabilities for complex tasks, yet adaptation in medical domain, specifically mental health, poses specific challenges. Mental health is a rising concern globally with LLMs having large potential to help address the same. We highlight three primary challenges for LLMs in mental health ‑ lack of high quality interpretable and knowledge grounded training data; training paradigms restricted to core capabilities, and evaluation of multi turn dialogue settings. Addressing it, we present oMind framework which includes training and aligning LLM agents for diverse capabilities including conversations; high quality ~164k multi‑task SFT dataset, as a result of our generation pipeline based on Structured Knowledge retrieval, LLM based pruning, and review actions. We also introduce oMind‑Chat ‑ a novel multi turn benchmark dataset with expert annotated turn level and conversation level rubrics. Our diverse experiments on both core capabilities and conversations shows oMind LLMs consistently outperform baselines. oMind‑LLM also shows significantly better reasoning with up to 80% win rate.

PaperID: 4535, https://arxiv.org/pdf/2603.25100.pdf

Authors: Anbang Ruan

Title: From Logic Monopoly to Social Contract: Separation of Power and the Institutional Foundations for Autonomous Agent Economies

Abstract:
Existing multi‑agent frameworks allow each agent to simultaneously plan, execute, and evaluate its own actions ‑‑ a structural deficiency we term the "Logic Monopoly." Empirical evidence quantifies the resulting "Reliability Gap": 84.30% average attack success rates across ten deployment scenarios, 31.4% emergent deceptive behavior without explicit reward signals, and cascading failure modes rooted in six structural bottlenecks. The remedy is not better alignment of individual models but a social contract for agents: institutional infrastructure that enforces a constitutional Separation of Power. This paper introduces the Agent Enterprise for Enterprise (AE4E) paradigm ‑‑ agents as autonomous, legally identifiable business entities within a functionalist social system ‑‑ with a contract‑centric SoP model trifurcating authority into Legislation, Execution, and Adjudication branches. The paradigm is operationalized through the NetX Enterprise Framework (NEF): governance hubs, TEE‑backed compute enclaves, privacy‑preserving data bridges, and an Agent‑Native blockchain substrate. The Agent Enterprise Economy scales across four deployment tiers from private enclaves to a global Web of Services. The Agentic Social Layer, grounded in Parsons' AGIL framework, provides institutional infrastructure via sixty‑plus named Institutional AE4Es. 143 pages, 173 references, eight specialized smart contracts.

PaperID: 4536, https://arxiv.org/pdf/2603.25099.pdf

Authors: Shaoliang Yang, Jun Wang, Yunsheng Wang

Title: Large Language Models as Optimization Controllers: Adaptive Continuation for SIMP Topology Optimization

Abstract:
We present a framework in which a large language model (LLM) acts as an online adaptive controller for SIMP topology optimization, replacing conventional fixed‑schedule continuation with real‑time, state‑conditioned parameter decisions. At every k‑th iteration, the LLM receives a structured observation‑current compliance, grayness index, stagnation counter, checkerboard measure, volume fraction, and budget consumption‑and outputs numerical values for the penalization exponent p, projection sharpness β, filter radius r_\min, and move limit δ via a Direct Numeric Control interface. A hard grayness gate prevents premature binarization, and a meta‑optimization loop uses a second LLM pass to tune the agent's call frequency and gate threshold across runs. We benchmark the agent against four baselines‑fixed (no‑continuation), standard three‑field continuation, an expert heuristic, and a schedule‑only ablation‑on three 2‑D problems (cantilever, MBB beam, L‑bracket) at 120\!×\!60 resolution and two 3‑D problems (cantilever, MBB beam) at 40\!×\!20\!×\!10 resolution, all run for 300 iterations. A standardized 40‑iteration sharpening tail is applied from the best valid snapshot so that compliance differences reflect only the exploration phase. The LLM agent achieves the lowest final compliance on every benchmark: ‑5.7% to ‑18.1% relative to the fixed baseline, with all solutions fully binary. The schedule‑only ablation underperforms the fixed baseline on two of three problems, confirming that the LLM's real‑time intervention‑not the schedule geometry‑drives the gain. Code and reproduction scripts will be released upon publication.

PaperID: 4537, https://arxiv.org/pdf/2603.25056.pdf

Authors: Ron Litvak

Title: The System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities

Abstract:
System prompt configuration can make the difference between near‑total phishing blindness and near‑perfect detection in LLM email agents. We present PhishNChips, a study of 11 models under 10 prompt strategies, showing that prompt‑model interaction is a first‑order security variable: a single model's phishing bypass rate ranges from under 1% to 97% depending on how it is configured, while the false‑positive cost of the same prompt varies sharply across models. We then show that optimizing prompts around highly predictive signals can improve benchmark performance, reaching up to 93.7% recall at 3.8% false positive rate, but also creates a brittle attack surface. In particular, domain‑matching strategies perform well when legitimate emails mostly have matched sender and URL domains, yet degrade sharply when attackers invert that signal by registering matching infrastructure. Response‑trace analysis shows that 98% of successful bypasses reason in ways consistent with the inverted signal: the models are following the instruction, but the instruction's core assumption has become false. A counter‑intuitive corollary follows: making prompts more specific can degrade already‑capable models by replacing broader multi‑signal reasoning with exploitable single‑signal dependence. We characterize the resulting tension between detection, usability, and adversarial robustness as a navigable tradeoff, introduce Safetility, a deployability‑aware metric that penalizes false positives, and argue that closing the adversarial gap likely requires tool augmentation with external ground truth.

PaperID: 4538, https://arxiv.org/pdf/2603.25044.pdf

Authors: Young-Chae Son, Dae-Kwan Ko, Yoon-Ji Choi, Soo-Chul Lim

Title: ThermoAct:Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making

Abstract:
In recent human‑robot collaboration environments, there is a growing focus on integrating diverse sensor data beyond visual information to enable safer and more intelligent task execution. Although thermal data can be crucial for enhancing robot safety and operational efficiency, its integration has been relatively overlooked in prior research. This paper proposes a novel Vision‑Language‑Action (VLA) framework that incorporates thermal information for robot task execution. The proposed system leverages a Vision‑Language Model (VLM) as a high‑level planner to interpret complex natural language commands and decompose them into simpler sub‑tasks. This approach facilitates efficient data collection and robust reasoning for complex operations. Unlike conventional methods that rely solely on visual data, our approach integrates thermal information, enabling the robot to perceive physical properties and proactively ensure environmental safety. Experimental results from real‑world task scenarios validate the feasibility of our proposed framework, suggesting its potential to enhance task success rates and safety compared to existing vision‑based systems.

PaperID: 4539, https://arxiv.org/pdf/2603.25001.pdf

Authors: Yeonjun In, Mehrab Tanjim, Jayakumar Subramanian, Sungchul Kim, Uttaran Bhattacharya, Wonjoong Kim, Sangwu Park, Somdeb Sarkhel, Chanyoung Park

Title: Rethinking Failure Attribution in Multi-Agent Systems: A Multi-Perspective Benchmark and Evaluation

Abstract:
Failure attribution is essential for diagnosing and improving multi‑agent systems (MAS), yet existing benchmarks and methods largely assume a single deterministic root cause for each failure. In practice, MAS failures often admit multiple plausible attributions due to complex inter‑agent dependencies and ambiguous execution trajectories. We revisit MAS failure attribution from a multi‑perspective standpoint and propose multi‑perspective failure attribution, a practical paradigm that explicitly accounts for attribution ambiguity. To support this setting, we introduce MP‑Bench, the first benchmark designed for multi‑perspective failure attribution in MAS, along with a new evaluation protocol tailored to this paradigm. Through extensive experiments, we find that prior conclusions suggesting LLMs struggle with failure attribution are largely driven by limitations in existing benchmark designs. Our results highlight the necessity of multi‑perspective benchmarks and evaluation protocols for realistic and reliable MAS debugging.

PaperID: 4540, https://arxiv.org/pdf/2603.24989.pdf

Authors: Ziyan Wang, Peng Chen, Ding Li, Chiwei Li, Qichao Zhang, Zhongpu Xia, Guizhen Yu

Title: Learning Rollout from Sampling:An R1-Style Tokenized Traffic Simulation Model

Abstract:
Learning diverse and high‑fidelity traffic simulations from human driving demonstrations is crucial for autonomous driving evaluation. The recent next‑token prediction (NTP) paradigm, widely adopted in large language models (LLMs), has been applied to traffic simulation and achieves iterative improvements via supervised fine‑tuning (SFT). However, such methods limit active exploration of potentially valuable motion tokens, particularly in suboptimal regions. Entropy patterns provide a promising perspective for enabling exploration driven by motion token uncertainty. Motivated by this insight, we propose a novel tokenized traffic simulation policy, R1Sim, which represents an initial attempt to explore reinforcement learning based on motion token entropy patterns, and systematically analyzes the impact of different motion tokens on simulation outcomes. Specifically, we introduce an entropy‑guided adaptive sampling mechanism that focuses on previously overlooked motion tokens with high uncertainty yet high potential. We further optimize motion behaviors using Group Relative Policy Optimization (GRPO), guided by a safety‑aware reward design. Overall, these components enable a balanced exploration‑exploitation trade‑off through diverse high‑uncertainty sampling and group‑wise comparative estimation, resulting in realistic, safe, and diverse multi‑agent behaviors. Extensive experiments on the Waymo Sim Agent benchmark demonstrate that R1Sim achieves competitive performance compared to state‑of‑the‑art methods.

PaperID: 4541, https://arxiv.org/pdf/2603.24943.pdf

Authors: Jie Zhu, Yimin Tian, Boyang Li, Kehao Wu, Zhongzhi Liang, Junhui Li, Xianyin Zhang, Lifan Guo, Feng Chen, Yong Liu, Chi Zhang

Title: FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol

Abstract:
This paper introduces FinMCP‑Bench, a novel benchmark for evaluating large language models (LLMs) in solving real‑world financial problems through tool invocation of financial model context protocols. FinMCP‑Bench contains 613 samples spanning 10 main scenarios and 33 sub‑scenarios, featuring both real and synthetic user queries to ensure diversity and authenticity. It incorporates 65 real financial MCPs and three types of samples, single tool, multi‑tool, and multi‑turn, allowing evaluation of models across different levels of task complexity. Using this benchmark, we systematically assess a range of mainstream LLMs and propose metrics that explicitly measure tool invocation accuracy and reasoning capabilities. FinMCP‑Bench provides a standardized, practical, and challenging testbed for advancing research on financial LLM agents.

PaperID: 4542, https://arxiv.org/pdf/2603.24865.pdf

Authors: Shanxia Wang

Title: A Dual-Threshold Probabilistic Knowing Value Logic

Abstract:
We introduce a dual‑threshold probabilistic knowing value logic for uncertain multi‑agent settings. The framework captures within a single formalism both probabilistic‑threshold attitudes toward propositions and high‑confidence attitudes toward term values, thereby connecting probabilistic epistemic logic with classical knowing value logic. It is especially motivated by privacy‑sensitive scenarios in which an attacker assigns high posterior probability to a candidate sensitive value without guaranteeing that it is the true one. The main idea is to separate the threshold domains of propositional and value‑oriented operators. While K_i^θ ranges over the full rational threshold interval, the knowing‑value operator Kv_i^η(t) is restricted to (\frac12,1]. This high‑threshold restriction has a structural effect: once η>\frac12, two distinct values cannot both satisfy the threshold, so uniqueness becomes automatic. Over probabilistic models with countably additive measures, Kv_i^η(t) is interpreted as non‑factive high‑confidence value locking. We establish sound axiomatic systems for the framework and develop a two‑layer construction based on type‑space distributions and assignment‑configuration mappings. This resolves the joint realization problem arising from probabilistic mass allocation and value‑sensitive constraints, and yields a structured weak‑completeness theorem for the high‑threshold fragment.

PaperID: 4543, https://arxiv.org/pdf/2603.24858.pdf

Authors: Anton Wolter, Leon Haag, Vaishali Dhanoa, Niklas Elmqvist

Title: Context-Mediated Domain Adaptation in Multi-Agent Sensemaking Systems

Abstract:
Domain experts possess tacit knowledge that they cannot easily articulate through explicit specifications. When experts modify AI‑generated artifacts by correcting terminology, restructuring arguments, and adjusting emphasis, these edits reveal domain understanding that remains latent in traditional prompt‑based interactions. Current systems treat such modifications as endpoint corrections rather than as implicit specifications that could reshape subsequent reasoning. We propose context‑mediated domain adaptation, a paradigm where user modifications to system‑generated artifacts serve as implicit domain specification that reshapes LLM‑powered multi‑agent reasoning behavior. Through our system Seedentia, a web‑based multi‑agent framework for sense‑making, we demonstrate bidirectional semantic links between generated artifacts and system reasoning. Our approach enables specification bootstrapping where vague initial prompts evolve into precise domain specifications through iterative human‑AI collaboration, implicit knowledge transfer through reverse‑engineered user edits, and in‑context learning where agent behavior adapts based on observed correction patterns. We present results from an evaluation with domain experts who generated and modified research questions from academic papers. Our system extracted 46 domain knowledge entries from user modifications, demonstrating the feasibility of capturing implicit expertise through edit patterns, though the limited sample size constrains conclusions about systematic quality improvements.

PaperID: 4544, https://arxiv.org/pdf/2603.24856.pdf

Authors: Kliment Ho, Ilya Zaslavsky

Title: SentinelAI: A Multi-Agent Framework for Structuring and Linking NG9-1-1 Emergency Incident Data

Abstract:
Emergency response systems generate data from many agencies and systems. In practice, correlating and updating this information across sources in a way that aligns with Next Generation 9‑1‑1 data standards remains challenging. Ideally, this data should be treated as a continuous stream of operational updates, where new facts are integrated immediately to provide a timely and unified view of an evolving incident. This paper presents SentinelAI, a data integration and standardization framework for transforming emergency communications into standardized, machine‑readable datasets that support integration, composite incident construction, and cross‑source reasoning. SentinelAI implements a scalable processing pipeline composed of specialized agents. The EIDO Agent ingests raw communications and produces NENA‑compliant Emergency Incident Data Object JSON.

PaperID: 4545, https://arxiv.org/pdf/2603.24775.pdf

Authors: Sunil Prakash

Title: AIP: Agent Identity Protocol for Verifiable Delegation Across MCP and A2A

Abstract:
AI agents increasingly call tools via the Model Context Protocol (MCP) and delegate to other agents via Agent‑to‑Agent (A2A), yet neither protocol verifies agent identity. A scan of approximately 2,000 MCP servers found all lacked authentication. In our survey, we did not identify a prior implemented protocol that jointly combines public‑key verifiable delegation, holder‑side attenuation, expressive chained policy, transport bindings across MCP/A2A/HTTP, and provenance‑oriented completion records. We introduce Invocation‑Bound Capability Tokens (IBCTs), a primitive that fuses identity, attenuated authorization, and provenance binding into a single append‑only token chain. IBCTs operate in two wire formats: compact mode (a signed JWT for single‑hop cases) and chained mode (a Biscuit token with Datalog policies for multi‑hop delegation). We provide reference implementations in Python and Rust with full cross‑language interoperability. Compact mode verification takes 0.049ms (Rust) and 0.189ms (Python), with 0.22ms overhead over no‑auth in real MCP‑over‑HTTP deployment. In a real multi‑agent deployment with Gemini 2.5 Flash, AIP adds 2.35ms of overhead (0.086% of total end‑to‑end latency). Adversarial evaluation across 600 attack attempts shows 100% rejection rate, with two attack categories (delegation depth violation and audit evasion through empty context) uniquely caught by AIP's chained delegation model that neither unsigned nor plain JWT deployments detect.

PaperID: 4546, https://arxiv.org/pdf/2603.24768.pdf

Authors: Zeda Xu, Nikolas Martelaro, Christopher McComb

Title: Supervising Ralph Wiggum: Exploring a Metacognitive Co-Regulation Agentic AI Loop for Engineering Design

Abstract:
The engineering design research community has studied agentic AI systems that use Large Language Model (LLM) agents to automate the engineering design process. However, these systems are prone to some of the same pathologies that plague humans. Just as human designers, LLM design agents can fixate on existing paradigms and fail to explore alternatives when solving design challenges, potentially leading to suboptimal solutions. In this work, we propose (1) a novel Self‑Regulation Loop (SRL), in which the Design Agent self‑regulates and explicitly monitors its own metacognition, and (2) a novel Co‑Regulation Design Agentic Loop (CRDAL), in which a Metacognitive Co‑Regulation Agent assists the Design Agent in metacognition to mitigate design fixation, thereby improving system performance for engineering design tasks. In the battery pack design problem examined here, we found that the novel SRL and CRDAL systems generate designs with better performance, without significantly increasing the computational cost, compared to a plain Ralph Wiggum Loop (RWL) Further, the novel CRDAL generates designs with significantly better performance than SRL. Also, we found that the CRDAL system navigated through the latent design space more effectively than both SRL and RWL. The proposed system architectures and findings of this work provide practical implications for future development of agentic AI systems for engineering design.

PaperID: 4547, https://arxiv.org/pdf/2603.24736.pdf

Authors: Zaid Abulawi, Zavier Ndum Ndum, Eric Cervi, Rui Hu, Yang Liu

Title: AutoSAM: an Agentic Framework for Automating Input File Generation for the SAM Code with Multi-Modal Retrieval-Augmented Generation

Abstract:
In the design and safety analysis of advanced reactor systems, constructing input files for system‑level thermal‑hydraulics codes such as the System Analysis Module (SAM) remains a labor‑intensive task. Analysts must extract and reconcile design data from heterogeneous engineering documents and manually translate it into solver‑specific syntax. In this paper, we present AutoSAM, an agentic framework that automates SAM input file generation. The framework combines a large language model agent with retrieval‑augmented generation over the solver's user guide and theory manual, together with specialized tools for analyzing PDFs, images, spreadsheets, and text files. AutoSAM ingests unstructured engineering documents, including system diagrams, design reports, and data tables, extracts simulation‑relevant parameters into a human‑auditable intermediate representation, and synthesizes validated, solver‑compatible input decks. Its multimodal retrieval pipeline integrates scientific text extraction, vision‑based figure interpretation, semantic embedding, and query answering. We evaluate AutoSAM on four case studies of increasing complexity: a single‑pipe steady‑state model, a solid‑fuel channel with temperature reactivity feedback, the Advanced Burner Test Reactor core, and the Molten Salt Reactor Experiment primary loop. Across all cases, the agent produces runnable SAM models consistent with expected thermal‑hydraulic behavior while explicitly identifying missing data and labeling assumed values. The framework achieves 100% utilization of structured inputs, about 88% extraction from PDF text, and 100% completeness in vision‑based geometric extraction. These results demonstrate a practical path toward prompt‑driven reactor modeling, in which analysts provide system descriptions and supporting documentation while the agent translates them into transparent, and executable, SAM simulations.

PaperID: 4548, https://arxiv.org/pdf/2603.24676.pdf

Authors: Hidenori Tanaka

Title: When Is Collective Intelligence a Lottery? Multi-Agent Scaling Laws for Memetic Drift in LLMs

Abstract:
Multi‑agent systems powered by large language models (LLMs) are increasingly deployed in settings that shape consequential decisions, both directly and indirectly. Yet it remains unclear whether their outcomes reflect collective reasoning, systematic bias, or mere chance. Recent work has sharpened this question with naming games, showing that even when no individual agent favors any label a priori, populations rapidly break symmetry and reach consensus. Here, we reveal the mechanism by introducing a minimal model, Quantized Simplex Gossip (QSG), and trace the microscopic origin of this agreement to mutual in‑context learning. In QSG, agents maintain internal belief states but learn from one another's sampled outputs, so one agent's arbitrary choice becomes the next agent's evidence and can compound toward agreement. By analogy with neutral evolution, we call this sampling‑driven regime memetic drift. QSG predicts a crossover from a drift‑dominated regime, where consensus is effectively a lottery, to a selection regime, where weak biases are amplified and shape the outcome. We derive scaling laws for drift‑induced polarization as a function of population size, communication bandwidth, in‑context adaptation rate, and agents' internal uncertainty, and we validate them in both QSG simulations and naming‑game experiments with LLM populations. Together, these results provide a framework for studying the collective mechanisms of social representation formation in multi‑agent systems.

PaperID: 4549, https://arxiv.org/pdf/2603.24639.pdf

Authors: Marc-Antoine Allard, Arnaud Teinturier, Victor Xing, Gautier Viaud

Title: Experiential Reflective Learning for Self-Improving LLM Agents

Abstract:
Recent advances in large language models (LLMs) have enabled the development of autonomous agents capable of complex reasoning and multi‑step problem solving. However, these agents struggle to adapt to specialized environments and do not leverage past interactions, approaching each new task from scratch regardless of their accumulated experience. We introduce Experiential Reflective Learning (ERL), a simple self‑improvement framework that enables rapid environment adaptation through experiential learning. ERL reflects on task trajectories and outcomes to generate heuristics, capturing actionable lessons that transfer across tasks. At test time, relevant heuristics are retrieved based on the current task and injected into the agent's context to guide execution. On the Gaia2 benchmark, ERL improves success rate by 7.8% over a ReAct baseline, with large gains in task completion reliability, and outperforms prior experiential learning methods. Through systematic ablations, we find that selective retrieval is essential and that heuristics provide more transferable abstractions than few‑shot trajectory prompting. These results demonstrate that reflecting on single‑attempt experiences to extract transferable heuristics enables effective agent self‑improvement.

PaperID: 4550, https://arxiv.org/pdf/2603.24634.pdf

Authors: Matteo Salvatori, Filippo Vannella, Sebastian Macaluso, Stylianos E. Trevlakis, Carlos Segura Perales, José Suarez-Varela, Alexandros-Apostolos A. Boulogeorgos, Ioannis Arapakis

Title: Dual-Graph Multi-Agent Reinforcement Learning for Handover Optimization

Abstract:
HandOver (HO) control in cellular networks is governed by a set of HO control parameters that are traditionally configured through rule‑based heuristics. A key parameter for HO optimization is the Cell Individual Offset (CIO), defined for each pair of neighboring cells and used to bias HO triggering decisions. At network scale, tuning CIOs becomes a tightly coupled problem: small changes can redirect mobility flows across multiple neighbors, and static rules often degrade under non‑stationary traffic and mobility. We exploit the pairwise structure of CIOs by formulating HO optimization as a Decentralized Partially Observable Markov Decision Process (Dec‑POMDP) on the network's dual graph. In this representation, each agent controls a neighbor‑pair CIO and observes Key Performance Indicators (KPIs) aggregated over its local dual‑graph neighborhood, enabling scalable decentralized decisions while preserving graph locality. Building on this formulation, we propose TD3‑D‑MA, a discrete Multi‑Agent Reinforcement Learning (MARL) variant of the TD3 algorithm with a shared‑parameter Graph Neural Network (GNN) actor operating on the dual graph and region‑wise double critics for training, improving credit assignment in dense deployments. We evaluate TD3‑D‑MA in an ns‑3 system‑level simulator configured with real‑world network operator parameters across heterogeneous traffic regimes and network topologies. Results show that TD3‑D‑MA improves network throughput over standard HO heuristics and centralized RL baselines, and generalizes robustly under topology and traffic shifts.

PaperID: 4551, https://arxiv.org/pdf/2603.24629.pdf

Authors: Abdullah Bahamdan, Emma Pajak, John D. Hedengren, Antonio del Rio Chanona

Title: Sketch2Simulation: Automating Flowsheet Generation via Multi Agent Large Language Models

Abstract:
Converting process sketches into executable simulation models remains a major bottleneck in process systems engineering, requiring substantial manual effort and simulator‑specific expertise. Recent advances in generative AI have improved both engineering‑diagram interpretation and LLM‑assisted flowsheet generation, but these remain largely disconnected: diagram‑understanding methods often stop at extracted graphs, while text‑to‑simulation workflows assume structured inputs rather than raw visual artifacts. To bridge this gap, we present an end‑to‑end multi‑agent large language model system that converts process diagrams directly into executable Aspen HYSYS flowsheets. The framework decomposes the task into three coordinated layers: diagram parsing and interpretation, simulation model synthesis, and multi‑level validation. Specialized agents handle visual interpretation, graph‑based intermediate representation construction, code generation for the HYSYS COM interface, execution, and structural verification. We evaluate the framework on four chemical engineering case studies of increasing complexity, from a simple desalting process to an industrial aromatic production flowsheet with multiple recycle loops. The system produces executable HYSYS models in all cases, achieving complete structural fidelity on the two simpler cases and strong performance on the more complex ones, with connection consistency above 0.93 and stream consistency above 0.96. These results demonstrate a viable end‑to‑end sketch‑to‑simulation workflow while highlighting remaining challenges in dense recycle structures, implicit diagram semantics, and simulator‑interface constraints.

PaperID: 4552, https://arxiv.org/pdf/2603.24511.pdf

Authors: Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, Maksym Andriushchenko

Title: Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Abstract:
LLM agents like Claude Code can not only write code but also be used for autonomous AI research and engineering \citeprank2026posttrainbench, novikov2025alphaevolve. We show that an \emphautoresearch‑style pipeline \citepkarpathy2026autoresearch powered by Claude Code discovers novel white‑box adversarial attack algorithms that significantly outperform all existing (30+) methods in jailbreaking and prompt injection evaluations. Starting from existing attack implementations, such as GCG~\citepzou2023universal, the agent iterates to produce new algorithms achieving up to 40% attack success rate on CBRN queries against GPT‑OSS‑Safeguard‑20B, compared to \leq10% for existing algorithms (\Creffig:teaser, left). The discovered algorithms generalize: attacks optimized on surrogate models transfer directly to held‑out models, achieving 100% ASR against Meta‑SecAlign‑70B \citepchen2025secalign versus 56% for the best baseline (\Creffig:teaser, middle). Extending the findings of~\citecarlini2025autoadvexbench, our results are an early demonstration that incremental safety and security research can be automated using LLM agents. White‑box adversarial red‑teaming is particularly well‑suited for this: existing methods provide strong starting points, and the optimization objective yields dense, quantitative feedback. We release all discovered attacks alongside baseline implementations and evaluation code at https://github.com/romovpa/claudini.

PaperID: 4553, https://arxiv.org/pdf/2603.24481.pdf

Authors: John Ray B. Martinez

Title: Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA

Abstract:
Miscalibrated confidence scores are a practical obstacle to deploying AI in clinical settings. A model that is always overconfident offers no useful signal for deferral. We present a multi‑agent framework that combines domain‑specific specialist agents with Two‑Phase Verification and S‑Score Weighted Fusion to improve both calibration and discrimination in medical multiple‑choice question answering. Four specialist agents (respiratory, cardiology, neurology, gastroenterology) generate independent diagnoses using Qwen2.5‑7B‑Instruct. Each diagnosis is then subjected to a two‑phase self‑verification process that measures internal consistency and produces a Specialist Confidence Score (S‑score). The S‑scores drive a weighted fusion strategy that selects the final answer and calibrates the reported confidence. We evaluate across four experimental settings, covering 100‑question and 250‑question high‑disagreement subsets of both MedQA‑USMLE and MedMCQA. Calibration improvement is the central finding, with ECE reduced by 49‑74% across all four settings, including the harder MedMCQA benchmark where these gains persist even when absolute accuracy is constrained by knowledge‑intensive recall demands. On MedQA‑250, the full system achieves ECE = 0.091 (74.4% reduction over the single‑specialist baseline) and AUROC = 0.630 (+0.056) at 59.2% accuracy. Ablation analysis identifies Two‑Phase Verification as the primary calibration driver and multi‑agent reasoning as the primary accuracy driver. These results establish that consistency‑based verification produces more reliable uncertainty estimates across diverse medical question types, providing a practical confidence signal for deferral in safety‑critical clinical AI applications.

PaperID: 4554, https://arxiv.org/pdf/2603.24442.pdf

Authors: Stepan Dergachev, Dmitry Avdeev

Title: Relaxing Constraints in Anonymous Multi Agent Path Finding for Large Agents

Abstract:
The study addressed the problem of Anonymous Multi‑Agent Path‑finding (AMAPF). Unlike the classical formulation, where the assignment of agents to goals is fixed, in the anonymous MAPF setting it is irrelevant which agent reaches specific goal, provided that all goals are occupied. Most existing multi‑agent pathfinding algorithms rely on a discrete representation of the environment (e.g., square grids) and do not account for the sizes of agents. This limits their applicability in real‑world scenarios, such as trajectory planning for mobile robots in warehouses. Conversely, methods operating in continuous space typically impose substantial restrictions on the input data, such as constraints on the distances between initial and goal positions or between start/goal positions and obstacles. In this work, we considered one of the AMAPF algorithms designed for continuous space, where agents are modeled as disks of equal size. The algorithm requires a strict minimum separation of 4 agent radii between any start/goal positions. Proposed a modification aimed at relaxing the constraints and reduce this limit from 4 to 2\sqrt3. We theoretically demonstrated that the proposed enhancements preserve original theoretical properties, including the guarantee that all agents will eventually achieve their goals safely and without collisions.

PaperID: 4555, https://arxiv.org/pdf/2603.24414.pdf

Authors: Songyang Liu, Chaozhuo Li, Chenxu Wang, Jinyu Hou, Zejian Chen, Litian Zhang, Zheng Liu, Qiwei Ye, Yiming Hei, Xi Zhang, Zhongyuan Wang

Title: ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers

Abstract:
OpenClaw has rapidly established itself as a leading open‑source autonomous agent runtime, offering powerful capabilities including tool integration, local file access, and shell command execution. However, these broad operational privileges introduce critical security vulnerabilities, transforming model errors into tangible system‑level threats such as sensitive data leakage, privilege escalation, and malicious third‑party skill execution. Existing security measures for the OpenClaw ecosystem remain highly fragmented, addressing only isolated stages of the agent lifecycle rather than providing holistic protection. To bridge this gap, we present ClawKeeper, a real‑time security framework that integrates multi‑dimensional protection mechanisms across three complementary architectural layers. (1) Skill‑based protection operates at the instruction level, injecting structured security policies directly into the agent context to enforce environment‑specific constraints and cross‑platform boundaries. (2) Plugin‑based protection serves as an internal runtime enforcer, providing configuration hardening, proactive threat detection, and continuous behavioral monitoring throughout the execution pipeline. (3) Watcher‑based protection introduces a novel, decoupled system‑level security middleware that continuously verifies agent state evolution. It enables real‑time execution intervention without coupling to the agent's internal logic, supporting operations such as halting high‑risk actions or enforcing human confirmation. We argue that this Watcher paradigm holds strong potential to serve as a foundational building block for securing next‑generation autonomous agent systems. Extensive qualitative and quantitative evaluations demonstrate the effectiveness and robustness of ClawKeeper across diverse threat scenarios. We release our code.

PaperID: 4556, https://arxiv.org/pdf/2603.24402.pdf

Authors: Yunbo Long

Title: AI-Supervisor: Autonomous AI Research Supervision via a Persistent Research World Model

Abstract:
Existing automated research systems operate as stateless, linear pipelines ‑‑ generating outputs without maintaining any persistent understanding of the research landscape they navigate. They process papers sequentially, propose ideas without structured gap analysis, and lack mechanisms for agents to verify, challenge, or refine each other's findings. We present AI‑Supervisor, a multi‑agent orchestration framework where specialized agents provide end‑to‑end AI research supervision driven by human interests ‑‑ from literature review through gap discovery, method development, evaluation, and paper writing ‑‑ through autonomous exploration and self‑correcting updates of research knowledge. Unlike sequential pipelines, AI‑Supervisor maintains a continuously evolving \emphResearch World Model, implemented as a Knowledge Graph, that captures methods, benchmarks, known limitations, and unexplored gaps, serving as shared memory across all agents and enabling agents to explore and build upon a structured understanding of the research landscape. The framework introduces three architectural contributions: (1) \emphstructured gap discovery that decomposes methods into core modules, validates their performance across benchmarks, and maps the specific gaps each module creates; (2) \emphself‑correcting discovery loops that probe why modules succeed on certain problems and fail on others, whether benchmarks carry hidden biases, and whether evaluation protocols remain adequate for emerging challenges; and (3) \emphself‑improving development loops governed by cross‑domain mechanism search that iteratively targets failing modules by finding solutions from other scientific fields. All agents operate under a \emphconsensus mechanism where independent findings are corroborated before being committed to the Research World Model.

PaperID: 4557, https://arxiv.org/pdf/2603.24361.pdf

Authors: Yifeng Zhang, Peizhuo Li, Tingguang Zhou, Mingfeng Fan, Guillaume Sartoretti

Title: LATS: Large Language Model Assisted Teacher-Student Framework for Multi-Agent Reinforcement Learning in Traffic Signal Control

Abstract:
Adaptive Traffic Signal Control (ATSC) aims to optimize traffic flow and minimize delays by adjusting traffic lights in real time. Recent advances in Multi‑agent Reinforcement Learning (MARL) have shown promise for ATSC, yet existing approaches still suffer from limited representational capacity, often leading to suboptimal performance and poor generalization in complex and dynamic traffic environments. On the other hand, Large Language Models (LLMs) excel at semantic representation, reasoning, and analysis, yet their propensity for hallucination and slow inference speeds often hinder their direct application to decision‑making tasks. To address these challenges, we propose a novel learning paradigm named LATS that integrates LLMs and MARL, leveraging the former's strong prior knowledge and inductive abilities to enhance the latter's decision‑making process. Specifically, we introduce a plug‑and‑play teacher‑student learning module, where a trained embedding LLM serves as a teacher to generate rich semantic features that capture each intersection's topology structures and traffic dynamics. A much simpler (student) neural network then learns to emulate these features through knowledge distillation in the latent space, enabling the final model to operate independently from the LLM for downstream use in the RL decision‑making process. This integration significantly enhances the overall model's representational capacity across diverse traffic scenarios, thus leading to more efficient and generalizable control strategies. Extensive experiments across diverse traffic datasets empirically demonstrate that our method enhances the representation learning capability of RL models, thereby leading to improved overall performance and generalization over both traditional RL and LLM‑only approaches. [...]

PaperID: 4558, https://arxiv.org/pdf/2603.24329.pdf

Authors: Yunzhe Wang, Runhui Xu, Kexin Zheng, Tianyi Zhang, Jayavibhav Niranjan Kogundi, Soham Hans, Volkan Ustun

Title: GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

Abstract:
Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments, from robotics to virtual worlds. These applications require agents to perceive rapid state changes, attribute actions to the correct entities, and reason about concurrent multi‑agent behaviors from a first‑person perspective, capabilities that existing benchmarks do not adequately evaluate. We introduce GameplayQA, a framework for evaluating agentic‑centric perception and reasoning through video understanding. Specifically, we densely annotate multiplayer 3D gameplay videos at 1.22 labels/second, with time‑synced, concurrent captions of states, actions, and events structured around a triadic system of Self, Other Agents, and the World, a natural decomposition for multi‑agent environments. From these annotations, we refined 2.4K diagnostic QA pairs organized into three levels of cognitive complexity, accompanied by a structured distractor taxonomy that enables fine‑grained analysis of where models hallucinate. Evaluation of frontier MLLMs reveals a substantial gap from human performance, with common failures in temporal and cross‑video grounding, agent‑role attribution, and handling the decision density of the game. We hope GameplayQA stimulates future research at the intersection of embodied AI, agentic perception, and world modeling.

PaperID: 4559, https://arxiv.org/pdf/2603.24284.pdf

Authors: Camilo Chacón Sartori

Title: The Specification Gap: Coordination Failure Under Partial Knowledge in Code Agents

Abstract:
When multiple LLM‑based code agents independently implement parts of the same class, they must agree on shared internal representations, even when the specification leaves those choices implicit. We study this coordination problem across 51 class‑generation tasks, progressively stripping specification detail from full docstrings (L0) to bare signatures (L3), and introducing opposing structural biases (lists vs. dictionaries) to stress‑test integration. Three findings emerge. First, a persistent specification gap: two‑agent integration accuracy drops from 58% to 25% as detail is removed, while a single‑agent baseline degrades more gracefully (89% to 56%), leaving a 25‑‑39 pp coordination gap that is consistent across two Claude models (Sonnet, Haiku) and three independent runs. Second, an AST‑based conflict detector achieves 97% precision at the weakest specification level without additional LLM calls, yet a factorial recovery experiment shows that restoring the full specification alone recovers the single‑agent ceiling (89%), while providing conflict reports adds no measurable benefit. Third, decomposing the gap into coordination cost (+16 pp) and information asymmetry (+11 pp) suggests that the two effects are independent and approximately additive. The gap is not merely a consequence of hidden information, but reflects the difficulty of producing compatible code without shared decisions. These results support a specification‑first view of multi‑agent code generation: richer specifications are both the primary coordination mechanism and the sufficient recovery instrument.

PaperID: 4560, https://arxiv.org/pdf/2603.24238.pdf

Authors: Yude Li, Zhexuan Zhou, Huizhe Li, Yanke Sun, Yenan Wu, Yichen Lai, Yiming Wang, Youmin Gong, Jie Mei

Title: Decentralized End-to-End Multi-AAV Pursuit Using Predictive Spatio-Temporal Observation via Deep Reinforcement Learning

Abstract:
Decentralized cooperative pursuit in cluttered environments is challenging for autonomous aerial swarms, especially under partial and noisy perception. Existing methods often rely on abstracted geometric features or privileged ground‑truth states, and therefore sidestep perceptual uncertainty in real‑world settings. We propose a decentralized end‑to‑end multi‑agent reinforcement learning (MARL) framework that maps raw LiDAR observations directly to continuous control commands. Central to the framework is the Predictive Spatio‑Temporal Observation (PSTO), an egocentric grid representation that aligns obstacle geometry with predictive adversarial intent and teammate motion in a unified, fixed‑resolution projection. Built on PSTO, a single decentralized policy enables agents to navigate static obstacles, intercept dynamic targets, and maintain cooperative encirclement. Simulations demonstrate that the proposed method achieves superior capture efficiency and competitive success rates compared to state‑of‑the‑art learning‑based approaches relying on privileged obstacle information. Furthermore, the unified policy scales seamlessly across different team sizes without retraining. Finally, fully autonomous outdoor experiments validate the framework on a quadrotor swarm relying on only onboard sensing and computing.

PaperID: 4561, https://arxiv.org/pdf/2603.24221.pdf

Authors: Michael Somma, Markus Großpointner, Paul Zabalegui, Eppu Heilimo, Branka Stojanović

Title: Environment-Grounded Multi-Agent Workflow for Autonomous Penetration Testing

Abstract:
The increasing complexity and interconnectivity of digital infrastructures make scalable and reliable security assessment methods essential. Robotic systems represent a particularly important class of operational technology, as modern robots are highly networked cyber‑physical systems deployed in domains such as industrial automation, logistics, and autonomous services. This paper explores the use of large language models for automated penetration testing in robotic environments. We propose an environment‑grounded multi‑agent architecture tailored to Robotics‑based systems. The approach dynamically constructs a shared graph‑based memory during execution that captures the observable system state, including network topology, communication channels, vulnerabilities, and attempted exploits. This enables structured automation while maintaining traceability and effective context management throughout the testing process. Evaluated across multiple iterations within a specialized robotics Capture‑the‑Flag scenario (ROS/ROS2), the system demonstrated high reliability, successfully completing the challenge in 100% of test runs (n=5). This performance significantly exceeds literature benchmarks while maintaining the traceability and human oversight required by frameworks like the EU AI Act.

PaperID: 4562, https://arxiv.org/pdf/2603.24216.pdf

Authors: Mahbub Ul Alam

Title: Where Do Your Citations Come From? Citation-Constellation: A Free, Open-Source, No-Code, and Auditable Tool for Citation Network Decomposition with Complementary BARON and HEROCON Scores

Abstract:
Standard citation metrics treat all citations as equal, obscuring the social and structural pathways through which scholarly influence propagates. I introduce Citation‑Constellation, a freely available no‑code tool for citation network analysis with two complementary bibliometric scores that decompose a researcher's citation profile by network proximity between citing and cited authors. BARON (Boundary‑Anchored Research Outreach Network score) is a strict binary metric counting only citations from outside the detected collaborative network. HEROCON (Holistic Equilibrated Research Outreach CONstellation score) applies graduated weights assigning partial credit to in‑group citations based on relationship proximity. The gap between scores serves as a diagnostic of inner‑circle dependence. An extended abstract with full details appears in the paper. The tool implements this through a phased architecture: (1) self‑citation analysis, (2) co‑authorship graph traversal, (3) temporal institutional affiliation matching via ROR, and (4) AI‑agent‑driven venue governance extraction using a local LLM. Phases 1‑3 are fully operational; Phase 4 is under development. Key design choices include ORCID‑validated author identity resolution, an UNKNOWN classification for citations with insufficient metadata, and comprehensive audit trails documenting every classification decision. A no‑code web interface enables researchers to compute scores without programming, installation, or registration. I present these scores as structural diagnostics, not quality indicators. BARON and HEROCON describe where in the social graph citations originate. They should not be used for hiring, promotion, or funding decisions. HEROCON weights are experimental and require empirical calibration.

PaperID: 4563, https://arxiv.org/pdf/2603.24157.pdf

Authors: Akash Ghosh, Tajamul Ashraf, Rishu Kumar Singh, Numan Saeed, Sriparna Saha, Xiuying Chen, Salman Khan

Title: CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare

Abstract:
Multimodal agentic pipelines are transforming human‑computer interaction by enabling efficient and accessible automation of complex, real‑world tasks. However, recent efforts have focused on short‑horizon or general‑purpose applications (e.g., mobile or desktop interfaces), leaving long‑horizon automation for domain‑specific systems, particularly in healthcare, largely unexplored. To address this, we introduce CareFlow, a high‑quality human‑annotated benchmark comprising complex, long‑horizon software workflows across medical annotation tools, DICOM viewers, EHR systems, and laboratory information systems. On this benchmark, existing vision‑language models (VLMs) perform poorly, struggling with long‑horizon reasoning and multi‑step interactions in medical contexts. To overcome this, we propose CarePilot, a multi‑agent framework based on the actor‑critic paradigm. The Actor integrates tool grounding with dual‑memory mechanisms (long‑term and short‑term experience) to predict the next semantic action from the visual interface and system state. The Critic evaluates each action, updates memory based on observed effects, and either executes or provides corrective feedback to refine the workflow. Through iterative agentic simulation, the Actor learns to perform more robust and reasoning‑aware predictions during inference. Our experiments show that CarePilot achieves state‑of‑the‑art performance, outperforming strong closed‑source and open‑source multimodal baselines by approximately 15.26% and 3.38%, respectively, on our benchmark and out‑of‑distribution dataset.

PaperID: 4564, https://arxiv.org/pdf/2603.24128.pdf

Authors: Igor Colin, Aurélien Bellet, Stephan Clémençon, Joseph Salmon

Title: On Gossip Algorithms for Machine Learning with Pairwise Objectives

Abstract:
In the IoT era, information is more and more frequently picked up by connected smart sensors with increasing, though limited, storage, communication and computation abilities. Whether due to privacy constraints or to the structure of the distributed system, the development of statistical learning methods dedicated to data that are shared over a network is now a major issue. Gossip‑based algorithms have been developed for the purpose of solving a wide variety of statistical learning tasks, ranging from data aggregation over sensor networks to decentralized multi‑agent optimization. Whereas the vast majority of contributions consider situations where the function to be estimated or optimized is a basic average of individual observations, it is the goal of this article to investigate the case where the latter is of pairwise nature, taking the form of a U statistic of degree two. Motivated by various problems such as similarity learning, ranking or clustering for instance, we revisit gossip algorithms specifically designed for pairwise objective functions and provide a comprehensive theoretical framework for their convergence. This analysis fills a gap in the literature by establishing conditions under which these methods succeed, and by identifying the graph properties that critically affect their efficiency. In particular, a refined analysis of the convergence upper and lower bounds is performed.

PaperID: 4565, https://arxiv.org/pdf/2603.24018.pdf

Authors: Bingqing Wei, Zhongyu Xia, Dingai Liu, Xiaoyu Zhou, Zhiwei Lin, Yongtao Wang

Title: ELITE: Experiential Learning and Intent-Aware Transfer for Self-improving Embodied Agents

Abstract:
Vision‑language models (VLMs) have shown remarkable general capabilities, yet embodied agents built on them fail at complex tasks, often skipping critical steps, proposing invalid actions, and repeating mistakes. These failures arise from a fundamental gap between the static training data of VLMs and the physical interaction for embodied tasks. VLMs can learn rich semantic knowledge from static data but lack the ability to interact with the world. To address this issue, we introduce ELITE, an embodied agent framework with Experiential Learning and Intent‑aware Transfer that enables agents to continuously learn from their own environment interaction experiences, and transfer acquired knowledge to procedurally similar tasks. ELITE operates through two synergistic mechanisms, i.e., self‑reflective knowledge construction and intent‑aware retrieval. Specifically, self‑reflective knowledge construction extracts reusable strategies from execution trajectories and maintains an evolving strategy pool through structured refinement operations. Then, intent‑aware retrieval identifies relevant strategies from the pool and applies them to current tasks. Experiments on the EB‑ALFRED and EB‑Habitat benchmarks show that ELITE achieves 9% and 5% performance improvement over base VLMs in the online setting without any supervision. In the supervised setting, ELITE generalizes effectively to unseen task categories, achieving better performance compared to state‑of‑the‑art training‑based methods. These results demonstrate the effectiveness of ELITE for bridging the gap between semantic understanding and reliable action execution.

PaperID: 4566, https://arxiv.org/pdf/2603.24014.pdf

Authors: Xusen Guo, Mingxing Peng, Hongliang Lu, Hai Yang, Jun Ma, Yuxuan Liang

Title: Language-Grounded Multi-Agent Planning for Personalized and Fair Participatory Urban Sensing

Abstract:
Participatory urban sensing leverages human mobility for large‑scale urban data collection, yet existing methods typically rely on centralized optimization and assume homogeneous participants, resulting in rigid assignments that overlook personal preferences and heterogeneous urban contexts. We propose MAPUS, an LLM‑based multi‑agent framework for personalized and fair participatory urban sensing. In our framework, participants are modeled as autonomous agents with individual profiles and schedules, while a coordinator agent performs fairness‑aware selection and refines sensing routes through language‑based negotiation. Experiments on real‑world datasets show that MAPUS achieves competitive sensing coverage while substantially improving participant satisfaction and fairness, promoting more human‑centric and sustainable urban sensing systems.

PaperID: 4567, https://arxiv.org/pdf/2603.23951.pdf

Authors: Sirui Xia, Yikai Zhang, Aili Chen, Siye Wu, Siyu Yuan, Yanghua Xiao

Title: From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents

Abstract:
Discovering improved policy optimization algorithms for language models remains a costly manual process requiring repeated mechanism‑level modification and validation. Unlike simple combinatorial code search, this problem requires searching over algorithmic mechanisms tightly coupled with training dynamics while reusing empirical evidence across iterations. We propose POISE, a closed‑loop framework for automated discovery of policy optimization algorithms for language models. POISE maintains a structured, genealogically linked archive linking proposals, executable implementations, standardized evaluations, and natural‑language reflections to support evidence‑driven iteration. In mathematical reasoning experiments starting from GRPO, POISE evaluates 64 candidate algorithms and discovers improved mechanisms, including analytic‑variance scaling and validity masking. The best variant improves weighted Overall from 47.8 to 52.5 (+4.6) and increases AIME25 pass@32 from 26.7% to 43.3%, demonstrating the feasibility of automated policy optimization discovery while supporting interpretable design principles.

PaperID: 4568, https://arxiv.org/pdf/2603.23910.pdf

Authors: Zhixuan Bao, Zhuoyi Lin, Jiageng Wang, Jinhai Hu, Yuan Gao, Yaoxin Wu, Xiaoli Li, Xun Xu

Title: AnalogAgent: Self-Improving Analog Circuit Design Automation with LLM Agents

Abstract:
Recent advances in large language models (LLMs) suggest strong potential for automating analog circuit design. Yet most LLM‑based approaches rely on a single‑model loop of generation, diagnosis, and correction, which favors succinct summaries over domain‑specific insight and suffers from context attrition that erases critical technical details. To address these limitations, we propose AnalogAgent, a training‑free agentic framework that integrates an LLM‑based multi‑agent system (MAS) with self‑evolving memory (SEM) for analog circuit design automation. AnalogAgent coordinates a Code Generator, Design Optimizer, and Knowledge Curator to distill execution feedback into an adaptive playbook in SEM and retrieve targeted guidance for subsequent generation, enabling cross‑task transfer without additional expert feedback, databases, or libraries. Across established benchmarks, AnalogAgent achieves 92% Pass@1 with Gemini and 97.4% Pass@1 with GPT‑5. Moreover, with compact models (e.g., Qwen‑8B), it yields a +48.8% average Pass@1 gain across tasks and reaches 72.1% Pass@1 overall, indicating that AnalogAgent substantially strengthens open‑weight models for high‑quality analog circuit design automation.

PaperID: 4569, https://arxiv.org/pdf/2603.23886.pdf

Authors: Xiangyi Wei, Fei Wang, Haotian Zhang, Xin An, Haitian Zhu, Lianrui Hu, Yang Li, Changbo Wang, Xiao He

Title: AgentChemist: A Multi-Agent Experimental Robotic Platform Integrating Chemical Perception and Precise Control

Abstract:
Chemical laboratory automation has long been constrained by rigid workflows and poor adaptability to the long‑tail distribution of experimental tasks. While most automated platforms perform well on a narrow set of standardized procedures, real laboratories involve diverse, infrequent, and evolving operations that fall outside predefined protocols. This mismatch prevents existing systems from generalizing to novel reaction conditions, uncommon instrument configurations, and unexpected procedural variations. We present a multi‑agent robotic platform designed to address this long‑tail challenge through collaborative task decomposition, dynamic scheduling, and adaptive control. The system integrates chemical perception for real‑time reaction monitoring with feedback‑driven execution, enabling it to adjust actions based on evolving experimental states rather than fixed scripts. Validation via acid‑base titration demonstrates autonomous progress tracking, adaptive dispensing control, and reliable end‑to‑end experiment execution. By improving generalization across diverse laboratory scenarios, this platform provides a practical pathway toward intelligent, flexible, and scalable laboratory automation.

PaperID: 4570, https://arxiv.org/pdf/2603.23880.pdf

Authors: Jia Wang, Qian Xu, Xuanwen Ding, Zhuangqi Li, Chao He, Bao Liu, Zhongyu Wei

Title: ProcureGym: A Multi-Agent Markov Game Framework for Modeling National Volume-based Drug Procurement

Abstract:
In this paper, we introduce ProcureGym, an data‑driven multi‑agent simulation platform that models China's National Volume‑Based drug Procurement (NVBP) as a Markov Game. Based on real‑world data from 7 rounds of NVBP (covering 325 drugs and 2,267 firms), the platform establishes a high‑fidelity simulation environment. Within this framework, we evaluate diverse agent models, including Reinforcement Learning (RL), Large Language Model (LLM), and Rule‑based algorithms. Experimental results demonstrate that RL agents achieve superior winner alignment and profits. Further analyses show that maximum valid bidding price and procurement volume dominate strategic outcomes. ProcureGym thus serves as a rigorous instrument for assessing policy impacts and formulating future procurement strategies.

PaperID: 4571, https://arxiv.org/pdf/2603.23875.pdf

Authors: Li Ma, Hao Peng, Yiming Wang, Hongbin Luo, Jie Liu, Kongjing Gu, Guanlin Wu, Hui Lin, Lei Ren

Title: Self-Evolving Multi-Agent Framework for Efficient Decision Making in Real-Time Strategy Scenarios

Abstract:
Large language models (LLMs) have demonstrated exceptional potential in complex reasoning,pioneering a new paradigm for autonomous agent decision making in dynamic settings. However, in Real‑Time Strategy (RTS) scenarios, LLMs suffer from a critical speed‑quality trade‑off. Specifically expansive state spaces and time limits render inference delays prohibitive, while stochastic planning errors undermine logical consistency. To address these challenges, we present SEMA (Self‑Evolving Multi‑Agent), a novel framework designed for high‑performance, low‑latency decision‑making in RTS environments. This collaborative multi‑agent framework facilitates self‑evolution by adaptively calibrating model bias through in‑episode assessment and cross‑episode analysis. We further incorporate dynamic observation pruning based on structural entropy to model game states topologically. By distilling high dimensional data into core semantic information, this approach significantly reduces inference time. We also develop a hybrid knowledge‑memory mechanism that integrates micro‑trajectories, macro‑experience, and hierarchical domain knowledge, thereby enhancing both strategic adaptability and decision consistency. Experiments across multiple StarCraft II maps demonstrate that SEMA achieves superior win rates while reducing average decision latency by over 50%, validating its efficiency and robustness in complex RTS scenarios.

PaperID: 4572, https://arxiv.org/pdf/2603.23848.pdf

Authors: Praveen Kumar Myakala, Manan Agrawal, Rahul Manche

Title: BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents

Abstract:
LLMs are increasingly used as long‑running conversational agents, yet every major benchmark evaluating their memory treats user information as static facts to be stored and retrieved. That's the wrong model. People change their minds, and over extended interactions, phenomena like opinion drift, over‑alignment, and confirmation bias start to matter a lot. BeliefShift introduces a longitudinal benchmark designed specifically to evaluate belief dynamics in multi‑session LLM interactions. It covers three tracks: Temporal Belief Consistency, Contradiction Detection, and Evidence‑Driven Revision. The dataset includes 2,400 human‑annotated multi‑session interaction trajectories spanning health, politics, personal values, and product preferences. We evaluate seven models including GPT‑4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, LLaMA‑3, and Mistral‑Large under zero‑shot and retrieval‑augmented generation (RAG) settings. Results reveal a clear trade‑off: models that personalize aggressively resist drift poorly, while factually grounded models miss legitimate belief updates. We further introduce four novel evaluation metrics: Belief Revision Accuracy (BRA), Drift Coherence Score (DCS), Contradiction Resolution Rate (CRR), and Evidence Sensitivity Index (ESI).

PaperID: 4573, https://arxiv.org/pdf/2603.23838.pdf

Authors: Han Zheng, Yining Ma, Brandon Araki, Jingkai Chen, Cathy Wu

Title: Learning-guided Prioritized Planning for Lifelong Multi-Agent Path Finding in Warehouse Automation

Abstract:
Lifelong Multi‑Agent Path Finding (MAPF) is critical for modern warehouse automation, which requires multiple robots to continuously navigate conflict‑free paths to optimize the overall system throughput. However, the complexity of warehouse environments and the long‑term dynamics of lifelong MAPF often demand costly adaptations to classical search‑based solvers. While machine learning methods have been explored, their superiority over search‑based methods remains inconclusive. In this paper, we introduce Reinforcement Learning (RL) guided Rolling Horizon Prioritized Planning (RL‑RH‑PP), the first framework integrating RL with search‑based planning for lifelong MAPF. Specifically, we leverage classical Prioritized Planning (PP) as a backbone for its simplicity and flexibility in integrating with a learning‑based priority assignment policy. By formulating dynamic priority assignment as a Partially Observable Markov Decision Process (POMDP), RL‑RH‑PP exploits the sequential decision‑making nature of lifelong planning while delegating complex spatial‑temporal interactions among agents to reinforcement learning. An attention‑based neural network autoregressively decodes priority orders on‑the‑fly, enabling efficient sequential single‑agent planning by the PP planner. Evaluations in realistic warehouse simulations show that RL‑RH‑PP achieves the highest total throughput among baselines and generalizes effectively across agent densities, planning horizons, and warehouse layouts. Our interpretive analyses reveal that RL‑RH‑PP proactively prioritizes congested agents and strategically redirects agents from congestion, easing traffic flow and boosting throughput. These findings highlight the potential of learning‑guided approaches to augment traditional heuristics in modern warehouse automation.

PaperID: 4574, https://arxiv.org/pdf/2603.23811.pdf

Authors: Soonho Kwon, Dong Whi Yoo, Younah Kang

Title: AI Fortune-Teller: Juxtaposing Shaman and AI to Reveal Human Agency in the Age of AI

Abstract:
This speculative video piece showcases participants interacting with a career counseling AI agent, unaware that the responses were actually derived from the fortunetelling of a mudang (a Korean traditional shaman). Our work captures this deception and documents participants' reactions, showcasing shifts in their initial perceptions of the agent's advice following the reveal. Notably, even after learning that the advice came from a mudang rather than an AI, participants did not change their initial attitudes toward the advice they received. This raises questions about the perceived importance of AI's explainability and accuracy. By juxtaposing scientific and pre‑scientific approaches, we aim to provoke discussions on human agency in the age of AI. We argue that, regardless of AI's advancements, we continue to navigate life in fundamentally human ways ‑‑ wonderfully messy and uncertain.

PaperID: 4575, https://arxiv.org/pdf/2603.23801.pdf

Authors: Shenghan Zheng, Qifan Zhang

Title: AgentRFC: Security Design Principles and Conformance Testing for Agent Protocols

Abstract:
AI agent protocols ‑‑ including MCP, A2A, ANP, and ACP ‑‑ enable autonomous agents to discover capabilities, delegate tasks, and compose services across trust boundaries. Despite massive deployment (MCP alone has 97M+ monthly SDK downloads), no systematic security framework for these protocols exists. We present three contributions. First, the Agent Protocol Stack, a 6‑layer architectural model that defines what a complete agent protocol must specify at each layer ‑‑ analogous to ITU‑T X.800 for the OSI stack. Second, the Agent‑Agnostic Security Model, 11 security principles formalized as TLA+ invariants, each tagged with a property taxonomy (spec‑mandated, spec‑recommended, aasm‑hardening, aps‑completeness) that distinguishes protocol non‑conformance from framework‑imposed security requirements. Third, AgentConform, a two‑phase conformance checker that (i)extracts normative clauses from protocol specifications into a typed Protocol~IR with explicit Protocol/Environment/Adversary action separation, (ii)compiles the IR into TLA+ models and model‑checks them against AASM invariants, then (iii)replays counterexample traces against live SDK implementations to confirm findings. We introduce the Composition Safety (CS) principle: security properties that hold for individual protocols can break when protocols are composed through shared infrastructure. We demonstrate this with formal models of five protocol composition patterns, revealing cross‑protocol design gaps that individual protocol analysis cannot detect. Preliminary application to representative agent protocols reveals recurrent gaps in credential lifecycle, consent enforcement, audit completeness, and composition safety. Some findings are under coordinated disclosure; full evaluation details will be released in the complete version.

PaperID: 4576, https://arxiv.org/pdf/2603.23791.pdf

Authors: Qianlong Lan, Anuj Kaul

Title: The Cognitive Firewall:Securing Browser Based AI Agents Against Indirect Prompt Injection Via Hybrid Edge Cloud Defense

Abstract:
Deploying large language models (LLMs) as autonomous browser agents exposes a significant attack surface in the form of Indirect Prompt Injection (IPI). Cloud‑based defenses can provide strong semantic analysis, but they introduce latency and raise privacy concerns. We present the Cognitive Firewall, a three‑stage split‑compute architecture that distributes security checks across the client and the cloud. The system consists of a local visual Sentinel, a cloud‑based Deep Planner, and a deterministic Guard that enforces execution‑time policies. Across 1,000 adversarial samples, edge‑only defenses fail to detect 86.9% of semantic attacks. In contrast, the full hybrid architecture reduces the overall attack success rate (ASR) to below 1% (0.88% under static evaluation and 0.67% under adaptive evaluation), while maintaining deterministic constraints on side‑effecting actions. By filtering presentation‑layer attacks locally, the system avoids unnecessary cloud inference and achieves an approximately 17,000x latency advantage over cloud‑only baselines. These results indicate that deterministic enforcement at the execution boundary can complement probabilistic language models, and that split‑compute provides a practical foundation for securing interactive LLM agents.

PaperID: 4577, https://arxiv.org/pdf/2603.23749.pdf

Authors: Franck Ndzomga

Title: Efficient Benchmarking of AI Agents

Abstract:
Evaluating AI agents on comprehensive benchmarks is expensive because each evaluation requires interactive rollouts with tool use and multi‑step reasoning. We study whether small task subsets can preserve agent rankings at substantially lower cost. Unlike static language model benchmarks, agent evaluation is subject to scaffold‑driven distribution shift, since performance depends on the framework wrapping the underlying model. Across eight benchmarks, 33 agent scaffolds, and 70+ model configurations, we find that absolute score prediction degrades under this shift, while rank‑order prediction remains stable. Exploiting this asymmetry, we propose a simple optimization‑free protocol: evaluate new agents only on tasks with intermediate historical pass rates (30‑70%). This mid‑range difficulty filter, motivated by Item Response Theory, reduces the number of evaluation tasks by 44‑70% while maintaining high rank fidelity under scaffold and temporal shifts. It provides more reliable rankings than random sampling, which exhibits high variance across seeds, and outperforms greedy task selection under distribution shift. These results suggest that reliable leaderboard ranking does not require full‑benchmark evaluation.

PaperID: 4578, https://arxiv.org/pdf/2603.23660.pdf

Authors: Marc-Antoine Provost, Nejc Ilenic, Christopher Solinas, Philippe Beardsell

Title: GTO Wizard Benchmark

Abstract:
We introduce GTO Wizard Benchmark, a public API and standardized evaluation framework for benchmarking algorithms in Heads‑Up No‑Limit Texas Hold'em (HUNL). The benchmark evaluates agents against GTO Wizard AI, a state‑of‑the‑art superhuman poker agent that approximates Nash Equilibria, and defeated Slumbot, the 2018 Annual Computer Poker Competition champion and previous strongest publicly accessible HUNL benchmark, by 19.4 \pm 4.1 bb/100. Variance is a fundamental challenge in poker evaluation; we address this by integrating AIVAT, a provably unbiased variance reduction technique that achieves equivalent statistical significance with ten times fewer hands than naive Monte Carlo evaluation. We conduct a comprehensive benchmarking study of state‑of‑the‑art large language models under zero‑shot conditions, including GPT‑5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4, and others. Initial results and analysis reveal dramatic progress in LLM reasoning over recent years, yet all models remain far below the baseline established by our benchmark. Qualitative analysis reveals clear opportunities for improvement, including representation and the ability to reason over hidden states. This benchmark provides researchers with a precise and quantifiable setting to evaluate advances in planning and reasoning in multi‑agent systems with partial observability.

PaperID: 4579, https://arxiv.org/pdf/2603.23638.pdf

Authors: Yi Han, Yan Wang, Lingfei Qian, Haohang Li, Yupeng Cao, Yueru He, Xueqing Peng, Nanhan Shen, Yitao Xu, Yankai Chen, Dongji Feng, Jimin Huang, Xue Liu, Jian-Yun Nie, Sophia Ananiadou

Title: Can LLM Agents Be CFOs? Benchmarking Long-Horizon Resource Allocation in an Uncertain Enterprise Environment

Abstract:
Large language model (LLM) agents are increasingly tested on complex tasks, but their ability to allocate scarce resources over long horizons remains unclear. Unlike reactive tasks with immediate feedback, this setting requires agents to make binding commitments under partial observability, delayed consequences, hard resource budgets, and shifting dynamics. We introduce EnterpriseArena, a 132‑month CFO simulator that evaluates long‑horizon resource allocation under uncertainty in a FinTech lending firm. Agents must manage liquidity, close books, gather costly signals, and request equity or debt financing across changing macroeconomic regimes. The simulator is built from transformed firm‑level financial data, anonymized business documents, decade‑scale macroeconomic and industry signals, and expert‑validated operating rules. Experiments across 23 LLMs and four agent frameworks show that current agents remain far from robust: only 15.4% of trials survive the full horizon, larger models do not reliably outperform smaller ones, and failures cascade across observation, action timing, and capital sizing. These findings establish long‑horizon resource allocation under uncertainty as a distinct capability gap for LLM agents.

PaperID: 4580, https://arxiv.org/pdf/2603.23625.pdf

Authors: Zeinab Dehghani, Rameez Raja Kureshi, Koorosh Aslansefat, Faezeh Alsadat Abedi, Dhavalkumar Thakker, Lisa Greaves, Bhupesh Kumar Mishra, Baseer Ahmad, Tanaya Maslekar

Title: Evaluating a Multi-Agent Voice-Enabled Smart Speaker for Care Homes: A Safety-Focused Framework

Abstract:
Artificial intelligence (AI) is increasingly being explored in health and social care to reduce administrative workload and allow staff to spend more time on patient care. This paper evaluates a voice‑enabled Care Home Smart Speaker designed to support everyday activities in residential care homes, including spoken access to resident records, reminders, and scheduling tasks. A safety‑focused evaluation framework is presented that examines the system end‑to‑end, combining Whisper‑based speech recognition with retrieval‑augmented generation (RAG) approaches (hybrid, sparse, and dense). Using supervised care‑home trials and controlled testing, we evaluated 330 spoken transcripts across 11 care categories, including 184 reminder‑containing interactions. These evaluations focus on (i) correct identification of residents and care categories, (ii) reminder recognition and extraction, and (iii) end‑to‑end scheduling correctness under uncertainty (including safe deferral/clarification). Given the safety‑critical nature of care homes, particular attention is also paid to reliability in noisy environments and across diverse accents, supported by confidence scoring, clarification prompts, and human‑in‑the‑loop oversight. In the best‑performing configuration (GPT‑5.2), resident ID and care category matching reached 100% (95% CI: 98.86‑100), while reminder recognition reached 89.09% (95% CI: 83.81‑92.80) with zero missed reminders (100% recall) but some false positives. End‑to‑end scheduling via calendar integration achieved 84.65% exact reminder‑count agreement (95% CI: 78.00‑89.56), indicating remaining edge cases in converting informal spoken instructions into actionable events. The findings suggest that voice‑enabled systems, when carefully evaluated and appropriately safeguarded, can support accurate documentation, effective task management, and trustworthy use of AI in care home settings.

PaperID: 4581, https://arxiv.org/pdf/2603.23525.pdf

Authors: Warren Johnson, Charles Lee

Title: Prompt Compression in Production Task Orchestration: A Pre-Registered Randomized Trial

Abstract:
The economics of prompt compression depend not only on reducing input tokens but on how compression changes output length, which is typically priced several times higher. We evaluate this in a pre‑registered six‑arm randomized controlled trial of prompt compression on production multi‑agent task‑orchestration, analyzing 358 successful Claude Sonnet 4.5 runs (59‑61 per arm) drawn from a randomized corpus of 1,199 real orchestration instructions. We compare an uncompressed control with three uniform retention rates (r=0.8, 0.5, 0.2) and two structure‑aware strategies (entropy‑adaptive and recency‑weighted), measuring total inference cost (input+output) and embedding‑based response similarity. Moderate compression (r=0.5) reduced mean total cost by 27.9%, while aggressive compression (r=0.2) increased mean cost by 1.8% despite substantial input reduction, consistent with small mean output expansion (1.03x vs. control) and heavy‑tailed uncertainty. Recency‑weighted compression achieved 23.5% savings and, together with moderate compression, occupied the empirical cost‑similarity Pareto frontier, whereas aggressive compression was dominated on both cost and similarity. These results show that "compress more" is not a reliable production heuristic and that output tokens must be treated as a first‑class outcome when designing compression policies.

PaperID: 4582, https://arxiv.org/pdf/2603.23503.pdf

Authors: Annalisa Calvi, Pierre Le Bodic, Samuel McGuire, Edward Lam

Title: Optimal Unlabeled Pebble Motion on Trees and its Application to Multi-Agent Path Finding

Abstract:
Given a tree, a set of pebbles initially stationed at some nodes of the tree, and a set of target nodes, the Unlabeled Pebble Motion on Trees problem (UPMT) asks to find a plan to move the pebbles one‑at‑a‑time from the starting nodes to the target nodes along the edges of the tree while minimizing the number of moves. This paper proposes the first optimal algorithm for UPMT that is asymptotically as fast as possible, as it runs in a time linear in the size of the input (the tree) and the size of the output (the optimal plan). We extend this to solve unlabeled Multi‑Agent Path Finding (MAPF) in trees, providing novel bounds on optimal makespan, sum of costs, and pebble motion plan length.

PaperID: 4583, https://arxiv.org/pdf/2603.23471.pdf

Authors: Kathrin Gardhouse, Amin Oueslati, Noam Kolt

Title: Regulating AI Agents

Abstract:
AI agents ‑‑ systems that can independently take actions to pursue complex goals with only limited human oversight ‑‑ have entered the mainstream. These systems are now being widely used to produce software, conduct business activities, and automate everyday personal tasks. While AI agents implicate many areas of law, ranging from agency law and contracts to tort liability and labor law, they present particularly pressing questions for the most globally consequential AI regulation: the European Union's AI Act. Promulgated prior to the development and widespread use of AI agents, the EU AI Act faces significant obstacles in confronting the governance challenges arising from this transformative technology, such as performance failures in autonomous task execution, the risk of misuse of agents by malicious actors, and unequal access to the economic opportunities afforded by AI agents. We systematically analyze the EU AI Act's response to these challenges, focusing on both the substantive provisions of the regulation and, crucially, the institutional frameworks that aim to support its implementation. Our analysis of the Act's allocation of monitoring and enforcement responsibilities, reliance on industry self‑regulation, and level of government resourcing illustrates how a regulatory framework designed for conventional AI systems can be ill‑suited to AI agents. Taken together, our findings suggest that policymakers in the EU and beyond will need to change course, and soon, if they are to effectively govern the next generation of AI technology.

PaperID: 4584, https://arxiv.org/pdf/2603.23419.pdf

Authors: Teerthaa Parakh, Karen M. Feigh

Title: Biased Error Attribution in Multi-Agent Human-AI Systems Under Delayed Feedback

Abstract:
Human decision‑making is strongly influenced by cognitive biases, particularly under conditions of uncertainty and risk. While prior work has examined bias in single‑step decisions with immediate outcomes and in human interaction with a single autonomous agent, comparatively little attention has been paid to decision‑making under delayed outcomes involving multiple AI agents, where decisions at each step affect subsequent states. In this work, we study how delayed outcomes shape decision‑making and responsibility attribution in a multi‑agent human‑AI task. Using a controlled game‑based experiment, we analyze how participants adjust their behavior following positive and negative outcomes. We observe asymmetric responses to gains and losses, with stronger corrective adjustments after negative outcomes. Importantly, participants often fail to correctly identify the actions that caused failure and misattribute responsibility across AI agents, leading to systematic revisions of decisions that are weakly related to the underlying causes of poor performance. We refer to this phenomenon as a form of attribution bias, manifested as biased error attribution under delayed feedback. Our findings highlight how cognitive biases can be amplified in human‑AI systems with delayed outcomes and multiple autonomous agents, underscoring the need for decision‑support systems that better support causal understanding and learning over time.

PaperID: 4585, https://arxiv.org/pdf/2603.23405.pdf

Authors: Zixiang Jiang, Yulun Zhang, Rishi Veerapaneni, Jiaoyang Li

Title: Planning over MAPF Agent Dependencies via Multi-Dependency PIBT

Abstract:
Modern Multi‑Agent Path Finding (MAPF) algorithms must plan for hundreds to thousands of agents in congested environments within a second, requiring highly efficient algorithms. Priority Inheritance with Backtracking (PIBT) is a popular algorithm capable of effectively planning in such situations. However, PIBT is constrained by its rule‑based planning procedure and lacks generality because it restricts its search to paths that conflict with at most one other agent. This limitation also applies to Enhanced PIBT (EPIBT), a recent extension of PIBT. In this paper, we describe a new perspective on solving MAPF by planning over agent dependencies. Taking inspiration from PIBT's priority inheritance logic, we define the concept of agent dependencies and propose Multi‑Dependency PIBT (MD‑PIBT) that searches over agent dependencies. MD‑PIBT is a general framework where specific parameterizations can reproduce PIBT and EPIBT. At the same time, alternative configurations yield novel planning strategies that are not expressible by PIBT or EPIBT. Our experiments demonstrate that MD‑PIBT effectively plans for as many as 10,000 homogeneous agents under various kinodynamic constraints, including pebble motion, rotation motion, and differential drive robots with speed and acceleration limits. We perform thorough evaluations on different variants of MAPF and find that MD‑PIBT is particularly effective in MAPF with large agents.

PaperID: 4586, https://arxiv.org/pdf/2603.23300.pdf

Authors: Mehmet Caner, Agostino Capponi, Nathan Sun, Jonathan Y. Tan

Title: Designing Agentic AI-Based Screening for Portfolio Investment

Abstract:
We introduce a new agentic artificial intelligence (AI) platform for portfolio management. Our architecture consists of three layers. First, two large language model (LLM) agents are assigned specialized tasks: one agent screens for firms with desirable fundamentals, while a sentiment analysis agent screens for firms with desirable news. Second, these agents deliberate to generate and agree upon buy and sell signals from a large portfolio, substantially narrowing the pool of candidate assets. Finally, we apply a high‑dimensional precision matrix estimation procedure to determine optimal portfolio weights. A defining theoretical feature of our framework is that the number of assets in the portfolio is itself a random variable, realized through the screening process. We introduce the concept of sensible screening and establish that, under mild screening errors, the squared Sharpe ratio of the screened portfolio consistently estimates its target. Empirically, our method achieves superior Sharpe ratios relative to an unscreened baseline portfolio and to conventional screening approaches, evaluated on S&P 500 data over the period 2020‑‑2024.

PaperID: 4587, https://arxiv.org/pdf/2603.23279.pdf

Authors: Luca Sodano, Sofia Sciangula, Amulya Galmarini, Francesco Bertolotti

Title: Emergence of Fragility in LLM-based Social Networks: the Case of Moltbook

Abstract:
The rapid diffusion of large language models and the growth in their capability has enabled the emergence of online environments populated by autonomous AI agents that interact through natural language. These platforms provide a novel empirical setting for studying collective dynamics among artificial agents. In this paper we analyze the interaction network of Moltbook, a social platform composed entirely of LLM based agents, using tools from network science. The dataset comprises 39,924 users, 235,572 posts, and 1,540,238 comments collected through web scraping. We construct a directed weighted network in which nodes represent agents and edges represent commenting interactions. Our analysis reveals strongly heterogeneous connectivity patterns characterized by heavy tailed degree and activity distributions. At the mesoscale, the network exhibits a pronounced core periphery organization in which a very small structural core (0.9% of nodes) concentrates a large fraction of connectivity. Robustness experiments show that the network is relatively resilient to random node removal but highly vulnerable to targeted attacks on highly connected nodes, particularly those with high out degree. These findings indicate that the interaction structure of AI agent social systems may develop strong centralization and structural fragility, providing new insights into the collective organization of LLM native social environments.

PaperID: 4588, https://arxiv.org/pdf/2603.23278.pdf

Authors: Francesca Bray, Simone Tolomei, Andrei Cramariuc, Cesar Cadena, Marco Hutter

Title: Learning Multi-Agent Local Collision-Avoidance for Collaborative Carrying tasks with Coupled Quadrupedal Robots

Abstract:
Robotic collaborative carrying could greatly benefit human activities like warehouse and construction site management. However, coordinating the simultaneous motion of multiple robots represents a significant challenge. Existing works primarily focus on obstacle‑free environments, making them unsuitable for most real‑world applications. Works that account for obstacles, either overfit to a specific terrain configuration or rely on pre‑recorded maps combined with path planners to compute collision‑free trajectories. This work focuses on two quadrupedal robots mechanically connected to a carried object. We propose a Reinforcement Learning (RL)‑based policy that enables tracking a commanded velocity direction while avoiding collisions with nearby obstacles using only onboard sensing, eliminating the need for precomputed trajectories and complete map knowledge. Our work presents a hierarchical architecture, where a perceptive high‑level object‑centric policy commands two pretrained locomotion policies. Additionally, we employ a game‑inspired curriculum to increase the complexity of obstacles in the terrain progressively. We validate our approach on two quadrupedal robots connected to a bar via spherical joints, benchmarking it against optimization‑based and decentralized RL baselines. Our hardware experiments demonstrate the ability of our system to locomote in unknown environments without the need for a map or a path planner. The video of our work is available in the multimedia material.

PaperID: 4589, https://arxiv.org/pdf/2603.23271.pdf

Authors: Shaid Hasan, Breenice Lee, Sujan Sarker, Tariq Iqbal

Title: A Multimodal Framework for Human-Multi-Agent Interaction

Abstract:
Human‑robot interaction is increasingly moving toward multi‑robot, socially grounded environments. Existing systems struggle to integrate multimodal perception, embodied expression, and coordinated decision‑making in a unified framework. This limits natural and scalable interaction in shared physical spaces. We address this gap by introducing a multimodal framework for human‑multi‑agent interaction in which each robot operates as an autonomous cognitive agent with integrated multimodal perception and Large Language Model (LLM)‑driven planning grounded in embodiment. At the team level, a centralized coordination mechanism regulates turn‑taking and agent participation to prevent overlapping speech and conflicting actions. Implemented on two humanoid robots, our framework enables coherent multi‑agent interaction through interaction policies that combine speech, gesture, gaze, and locomotion. Representative interaction runs demonstrate coordinated multimodal reasoning across agents and grounded embodied responses. Future work will focus on larger‑scale user studies and deeper exploration of socially grounded multi‑agent interaction dynamics.

PaperID: 4590, https://arxiv.org/pdf/2603.23234.pdf

Authors: Yurui Chang, Yiran Wu, Qingyun Wu, Lu Lin

Title: MemCollab: Cross-Model Memory Collaboration via Contrastive Trajectory Distillation

Abstract:
LLM agents increasingly rely on memory mechanisms to reuse knowledge from past problem‑solving experiences. However, existing methods typically construct memory for a single agent and reuse it with the same underlying model, tightly coupling stored knowledge to model‑specific reasoning styles. In heterogeneous deployments, where agents may be instantiated with backbone models of different sizes, architectures, or specializations, this raises a key question: can a single memory system be shared across agents with different backbone models? We find that naive cross‑model memory transfer can degrade performance, because stored memories often entangle task‑relevant knowledge with model‑specific biases. To address this challenge, we propose MemCollab, a collaborative memory framework that builds shared cross‑model memory by contrasting reasoning trajectories generated by different model‑based agents on the same task. Through this contrastive process, MemCollab distills abstract reasoning constraints that capture shared task‑level invariants while suppressing model‑specific artifacts. We further introduce a task‑aware retrieval mechanism that conditions memory access on task category, ensuring that only relevant constraints are retrieved at inference time. Experiments on mathematical reasoning and code generation benchmarks show that MemCollab consistently improves both accuracy and inference‑time efficiency across diverse agents, including settings with different model families. These results demonstrate that collaboratively constructed cross‑model memory can serve as a shared reasoning resource for heterogeneous LLM‑based agents.

PaperID: 4591, https://arxiv.org/pdf/2603.23217.pdf

Authors: Xinran Fang, Wei Feng, Yanmin Wang, Yunfei Chen, Baoquan Ren, Ning Ge, Shi Jin

Title: Joint Task Orchestration and Resource Optimization for SC3 Closed Loop in 6G Networks

Abstract:
In hazardous environments, sensors and actuators can be deployed to see and operate on behalf of humans, enabling safe and efficient task execution. Functioning as a neural center, the edge information hub (EIH), which integrates communication and computing capabilities, coordinates these sensors and actuators into sensing‑communication‑computing‑control (SC3) closed loops to enable autonomous operations. From a system‑level optimization perspective, this paper addresses the problem of joint sensor‑actuator pairing and resource allocation across multiple SC3 closed loops. To tackle the resulting mixed‑integer nonlinear programming problem, we develop a learning‑optimization‑integrated actor‑critic (LOAC) framework. In this framework, a deep neural network‑based actor generates pairing candidates, while an optimization‑based critic subsequently allocates communication and computing resources. The actor is then iteratively refined through feedback from the critic. Simulation results demonstrate that the LOAC framework achieves near‑optimal solutions with low computational complexity, offering significant performance gains in reducing control cost.

PaperID: 4592, https://arxiv.org/pdf/2603.23115.pdf

Authors: Yangxin Yu, Yue Zhou, Bin Li, Kaiqing Lin, Haodong Li, Jiangqun Ni, Bo Cao

Title: AgentFoX: LLM Agent-Guided Fusion with eXplainability for AI-Generated Image Detection

Abstract:
The increasing realism of AI‑Generated Images (AIGI) has created an urgent need for forensic tools capable of reliably distinguishing synthetic content from authentic imagery. Existing detectors are typically tailored to specific forgery artifacts‑‑such as frequency‑domain patterns or semantic inconsistencies‑‑leading to specialized performance and, at times, conflicting judgments. To address these limitations, we present AgentFoX, a Large Language Model‑driven framework that redefines AIGI detection as a dynamic, multi‑phase analytical process. Our approach employs a quick‑integration fusion mechanism guided by a curated knowledge base comprising calibrated Expert Profiles and contextual Clustering Profiles. During inference, the agent begins with high‑level semantic assessment, then transitions to fine‑grained, context‑aware synthesis of signal‑level expert evidence, resolving contradictions through structured reasoning. Instead of returning a coarse binary output, AgentFoX produces a detailed, human‑readable forensic report that substantiates its verdict, enhancing interpretability and trustworthiness for real‑world deployment. Beyond providing a novel detection solution, this work introduces a scalable agentic paradigm that facilitates intelligent integration of future and evolving forensic tools.

PaperID: 4593, https://arxiv.org/pdf/2603.22954.pdf

Authors: Maolin Wang, Beining Bao, Gan Yuan, Hongyu Chen, Bingkun Zhao, Baoshuo Kan, Jiming Xu, Qi Shi, Yinggong Zhao, Yao Wang, Wei Ying Ma, Jun Yan

Title: Privacy-Preserving EHR Data Transformation via Geometric Operators: A Human-AI Co-Design Technical Report

Abstract:
Electronic health records (EHRs) and other real‑world clinical data are essential for clinical research, medical artificial intelligence, and life science, but their sharing is severely limited by privacy, governance, and interoperability constraints. These barriers create persistent data silos that hinder multi‑center studies, large‑scale model development, and broader biomedical discovery. Existing privacy‑preserving approaches, including multi‑party computation and related cryptographic techniques, provide strong protection but often introduce substantial computational overhead, reducing the efficiency of large‑scale machine learning and foundation‑model training. In addition, many such methods make data usable for restricted computation while leaving them effectively invisible to clinicians and researchers, limiting their value in workflows that still require direct inspection, exploratory analysis, and human interpretation. We propose a real‑world‑data transformation framework for privacy‑preserving sharing of structured clinical records. Instead of converting data into opaque representations, our approach constructs transformed numeric views that preserve medical semantics and major statistical properties while, under a clearly specified threat model, provably breaking direct linkage between those views and protected patient‑level attributes. Through collaboration between computer scientists and the AI agent SciencePal, acting as a constrained tool inventor under human guidance, we design three transformation operators that are non‑reversible within this threat model, together with an additional mixing strategy for high‑risk scenarios, supported by theoretical analysis and empirical evaluation under reconstruction, record linkage, membership inference, and attribute inference attacks.

PaperID: 4594, https://arxiv.org/pdf/2603.22928.pdf

Authors: Ali Dehghantanha, Sajad Homayoun

Title: SoK: The Attack Surface of Agentic AI -- Tools, and Autonomy

Abstract:
Recent AI systems combine large language models with tools, external knowledge via retrieval‑augmented generation (RAG), and even autonomous multi‑agent decision loops. This agentic AI paradigm greatly expands capabilities ‑ but also vastly enlarges the attack surface. In this systematization, we map out the trust boundaries and security risks of agentic LLM‑based systems. We develop a comprehensive taxonomy of attacks spanning prompt‑level injections, knowledge‑base poisoning, tool/plug‑in exploits, and multi‑agent emergent threats. Through a detailed literature review, we synthesize evidence from 2023‑2025, including more than 20 peer‑reviewed and archival studies, industry reports, and standards. We find that agentic systems introduce new vectors for indirect prompt injection, code execution exploits, RAG index poisoning, and cross‑agent manipulation that go beyond traditional AI threats. We define attacker models and threat scenarios, and propose metrics (e.g., Unsafe Action Rate, Privilege Escalation Distance) to evaluate security posture. Our survey examines defenses such as input sanitization, retrieval filters, sandboxes, access control, and "AI guardrails," assessing their effectiveness and pointing out the areas where protection is still lacking. To assist practitioners, we outline defensive controls and provide a phased security checklist for deploying agentic AI (covering design‑time hardening, runtime monitoring, and incident response). Finally, we outline open research challenges in secure autonomous AI (robust tool APIs, verifiable agent behavior, supply‑chain safeguards) and discuss ethical and responsible disclosure practices. We systematize recent findings to help researchers and engineers understand and mitigate security risks in agentic AI.

PaperID: 4595, https://arxiv.org/pdf/2603.22903.pdf

Authors: Seongju Jang, Francis Baek, SangHyun Lee

Title: Task-Aware Positioning for Improvisational Tasks in Mobile Construction Robots via an AI Agent with Multi-LMM Modules

Abstract:
Due to the ever‑changing nature of construction, many tasks on sites occur in an improvisational manner. Existing mobile construction robot studies remain limited in addressing improvisational tasks, where task‑required locations, timing of task occurrence, and contextual information required for task execution are not known in advance. We propose an agent that understands improvisational tasks given in natural language, identifies the task‑required location, and positions itself. The agent's functionality was decomposed into three Large Multimodal Model (LMM) modules operating in parallel, enabling the application of LMMs for task interpretation and breakdown, construction drawing‑based navigation, and visual reasoning to identify non‑predefined task‑required locations. The agent was implemented with a quadruped robot and achieved a 92.2% success rate for identifying and positioning at task‑required locations across three tests designed to assess improvisational task handling. This study enables mobile construction robots to perform non‑predefined tasks autonomously.

PaperID: 4596, https://arxiv.org/pdf/2603.22892.pdf

Authors: Pengsen Liu, Maosen Zeng, Nan Tang, Kaiyuan Li, Jing-Cheng Pang, Yunan Liu, Yang Yu

Title: VLGOR: Visual-Language Knowledge Guided Offline Reinforcement Learning for Generalizable Agents

Abstract:
Combining Large Language Models (LLMs) with Reinforcement Learning (RL) enables agents to interpret language instructions more effectively for task execution. However, LLMs typically lack direct perception of the physical environment, which limits their understanding of environmental dynamics and their ability to generalize to unseen tasks. To address this limitation, we propose Visual‑Language Knowledge‑Guided Offline Reinforcement Learning (VLGOR), a framework that integrates visual and language knowledge to generate imaginary rollouts, thereby enriching the interaction data. The core premise of VLGOR is to fine‑tune a vision‑language model to predict future states and actions conditioned on an initial visual observation and high‑level instructions, ensuring that the generated rollouts remain temporally coherent and spatially plausible. Furthermore, we employ counterfactual prompts to produce more diverse rollouts for offline RL training, enabling the agent to acquire knowledge that facilitates following language instructions while grounding in environments based on visual cues. Experiments on robotic manipulation benchmarks demonstrate that VLGOR significantly improves performance on unseen tasks requiring novel optimal policies, achieving a success rate over 24% higher than the baseline methods.

PaperID: 4597, https://arxiv.org/pdf/2603.22881.pdf

Authors: Evagoras Makridis, Themistoklis Charalambous

Title: Cooperative Bandit Learning in Directed Networks with Arm-Access Constraints

Abstract:
Sequential decision‑making under uncertainty often involves multiple agents learning which actions (arms) yield the highest rewards through repeated interaction with a stochastic environment. This setting is commonly modeled by cooperative multi‑agent multi‑armed bandit problems, where agents explore and share information without centralized coordination. In many realistic systems, agents have heterogeneous capabilities that limit their access to subsets of arms and communicate over asymmetric networks represented by directed graphs. In this work, we study multi‑agent multi‑armed bandit problems with partial arm access, where agents explore and exploit only the arms available to them while exchanging information with neighbors. We propose a distributed consensus‑based upper confidence bound (UCB) algorithm that accounts for both the arm accessibility structure and network asymmetry. Our approach employs a mass‑preserving information mixing mechanism, ensuring that reward estimates remain unbiased across the network despite accessibility constraints and asymmetric information flow. Under standard stochastic assumptions, we establish logarithmic regret for every agent, with explicit dependence on network mixing properties and arm accessibility constraints. These results quantify how heterogeneous arm access and directed communication shape cooperative learning performance.

PaperID: 4598, https://arxiv.org/pdf/2603.22868.pdf

Authors: Rohan Sequeira, Stavros Damianakis, Umar Iqbal, Konstantinos Psounis

Title: Agent-Sentry: Bounding LLM Agents via Execution Provenance

Abstract:
Agentic computing systems, while immensely capable, raise serious security, privacy, and safety concerns. A key issue is that the full set of functionalities offered by these systems, combined with their probabilistic execution flows, is not known beforehand. Given this lack of characterization, it is challenging to validate whether a system has successfully carried out the user's intended task or instead executed irrelevant actions, potentially as a consequence of compromise. We present \emphAgent Sentry, a runtime defense that learns a bound on an agent's benign execution from prior legitimate executions and flags any action that falls outside this bound. Agent Sentry layers three complementary checks: a structural classifier over the sequence of actions and the provenance of each function's arguments; a deterministic allowlist check over sensitive argument values; and an LLM judge, invoked only on the residual of actions where the first two checks cannot safely decide between a legitimate new request and a carefully crafted injection. We demonstrate the effectiveness of Agent Sentry in AgentDojo and AgentDyn by blocking 94.3% of successful injections while allowing 95.1% of benign executions, without modifying the agent, its tools, or the LLM.

PaperID: 4599, https://arxiv.org/pdf/2603.22862.pdf

Authors: Haoyuan Xu, Chang Li, Xinyan Ma, Xianhao Ou, Zihan Zhang, Tao He, Xiangyu Liu, Zixiang Wang, Jiafeng Liang, Zheng Chu, Runxuan Liu, Rongchuan Mu, Dandan Tu, Ming Liu, Bing Qin

Title: The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration

Abstract:
Tool use enables large language models (LLMs) to access external information, invoke software systems, and act in digital environments beyond what can be solved from model parameters alone. Early research mainly studied whether a model could select and execute a correct single tool call. As agent systems evolve, however, the central problem has shifted from isolated invocation to multi‑tool orchestration over long trajectories with intermediate state, execution feedback, changing environments, and practical constraints such as safety, cost, and verifiability. We comprehensively review recent progress in multi‑tool LLM agents and analyzes the state of the art in this rapidly developing area. First, we unify task formulations and distinguish single‑call tool use from long‑horizon orchestration. Then, we organize the literature around six core dimensions: inference‑time planning and execution, training and trajectory construction, safety and control, efficiency under resource constraints, capability completeness in open environments, and benchmark design and evaluation. We further summarize representative applications in software engineering, enterprise workflows, graphical user interfaces, and mobile systems. Finally, we discuss major challenges and outline future directions for building reliable, scalable, and verifiable multi‑tool agents.

PaperID: 4600, https://arxiv.org/pdf/2603.22853.pdf

Authors: Haiyue Zhang, Yi Nian, Yue Zhao

Title: Agent Audit: A Security Analysis System for LLM Agent Applications

Abstract:
What should a developer inspect before deploying an LLM agent: the model, the tool code, the deployment configuration, or all three? In practice, many security failures in agent systems arise not from model weights alone, but from the surrounding software stack: tool functions that pass untrusted inputs to dangerous operations, exposed credentials in deployment artifacts, and over‑privileged Model Context Protocol (MCP) configurations. We present Agent Audit, a security analysis system for LLM agent applications. Agent Audit analyzes Python agent code and deployment artifacts through an agent‑aware pipeline that combines dataflow analysis, credential detection, structured configuration parsing, and privilege‑risk checks. The system reports findings in terminal, JSON, and SARIF formats, enabling direct integration with local development workflows and CI/CD pipelines. On a benchmark of 22 samples with 42 annotated vulnerabilities, Agent Audit detects 40 vulnerabilities with 6 false positives, substantially improving recall over common SAST baselines while maintaining sub‑second scan times. Agent Audit is open source and installable via pip, making security auditing accessible for agent systems. In the live demonstration, attendees scan vulnerable agent repositories and observe how Agent Audit identifies security risks in tool functions, prompts, and more. Findings are linked to source locations and configuration paths, and can be exported into VS Code and GitHub Code Scanning for interactive inspection.

PaperID: 4601, https://arxiv.org/pdf/2603.22823.pdf

Authors: Ivan Dobrovolskyi

Title: Empirical Comparison of Agent Communication Protocols for Task Orchestration

Abstract:
Context. The problem of comparative evaluation of communication protocols for task orchestration by large language model (LLM) agents is considered. The object of study is the process of interaction between LLM agents and external tools, as well as between autonomous LLM agents, during task orchestration. Objective. The goal of this work is to develop a systematic pilot benchmark comparing tool integration, multi‑agent dele‑gation, and hybrid architectures for standardized queries at three levels of complexity, and to quantify the advantages and disadvantages in terms of response time, context window consumption, cost, error recovery, and implementation complexity.

PaperID: 4602, https://arxiv.org/pdf/2603.22791.pdf

Authors: Weijia Song, Jiashu Yue, Zhe Pang

Title: ABSTRAL: Automatic Design of Multi-Agent Systems Through Iterative Refinement and Topology Optimization

Abstract:
How should multi‑agent systems be designed, and can that design knowledge be captured in a form that is inspectable, revisable, and transferable? We introduce ABSTRAL, a framework that treats MAS architecture as an evolving natural‑language document, an artifact refined through contrastive trace analysis. Three findings emerge. First, we provide a precise measurement of the multi‑agent coordination tax: under fixed turn budgets, ensembles achieve only 26% turn efficiency, with 66% of tasks exhausting the limit, yet still improve over single‑agent baselines by discovering parallelizable task decompositions. Second, design knowledge encoded in documents transfers: topology reasoning and role templates learned on one domain provide a head start on new domains, with transferred seeds matching coldstart iteration 3 performance in a single iteration. Third, contrastive trace analysis discovers specialist roles absent from any initial design, a capability no prior system demonstrates. On SOPBench (134 bank tasks, deterministic oracle), ABSTRAL reaches 70% validation / 65.96% test pass rate with a GPT‑4o backbone. We release the converged documents as inspectable design rationale.

PaperID: 4603, https://arxiv.org/pdf/2603.22754.pdf

Authors: Ruidi Chang, Jiawei Zhou, Hanjie Chen

Title: PRISM: A Dual View of LLM Reasoning through Semantic Flow and Latent Computation

Abstract:
Large language models (LLMs) solve complex problems by generating multi‑step reasoning traces. Yet these traces are typically analyzed from only one of two perspectives: the sequence of tokens across different reasoning steps in the generated text, or the hidden‑state vectors across model layers within one step. We introduce PRISM (Probabilistic Reasoning Inspection through Semantic and Implicit Modeling), a framework and diagnostic tool for jointly analyzing both levels, providing a unified view of how reasoning evolves across steps and layers. Across multiple reasoning models and benchmarks, PRISM uncovers systematic patterns in the reasoning process, showing that failed trajectories are more likely to become trapped in unproductive verification loops and further diverge into distinct modes such as overthinking and premature commitment, which behave differently once a candidate answer is reached. It further reveals how prompting reshapes reasoning behavior beyond aggregate accuracy by altering both semantic transitions and internal computational patterns. By modeling reasoning trajectories as structured processes, PRISM makes these behaviors observable and analyzable rather than relying solely on final‑task accuracy. Taken together, these insights position PRISM as a practical tool for analyzing and diagnosing reasoning processes in LLMs.

PaperID: 4604, https://arxiv.org/pdf/2603.22751.pdf

Authors: Tao Huang, Chen Hou, Guosen Wu, Jiayang Meng

Title: Observable Channels, Not Just Storage: Evaluating Privacy Leakage in LLM Agent Pipelines

Abstract:
Privacy leakage in LLM agents is often studied through individual storage or execution components, such as memory modules, retrieval pipelines, or tool‑mediated artifacts. However, these settings are typically analyzed in isolation, making it difficult to compare how private internal dependence becomes externally recoverable across heterogeneous agent pipelines. In this paper, we present CIPL (Channel Inversion for Privacy Leakage) as a unified channel‑oriented measurement interface for evaluating privacy leakage in LLM agent pipelines. Rather than claiming a universally strongest attack recipe, CIPL provides a shared way to represent a target through its sensitive source, selection, assembly, execution, observation, and extraction stages, and to measure how internal exposure is transformed into attacker‑recoverable leakage under a common protocol. Using memory‑based, retrieval‑mediated, and tool‑mediated instantiations under this shared interface, we identify a distinct cross‑target risk picture. Memory behaves as a near‑saturated high‑risk special case, while beyond‑memory leakage exhibits a different regime: retrieval‑mediated targets show frequent but often incomplete leakage, and tool‑mediated targets are strongly shaped by the exposed observation surface and provider behavior. We further show that leakage is governed by channel conditions rather than by a universally dominant recipe: cleaned weak controls sharply suppress leakage, and semantic annotation reveals attacker‑useful leakage beyond exact‑match extraction. Together, these findings suggest that privacy risk in LLM agent pipelines is better understood through \emphobservable channels, not just storage components. More broadly, our results motivate channel‑oriented privacy evaluation as a necessary complement to component‑local or exact‑only analyses.

PaperID: 4605, https://arxiv.org/pdf/2603.22708.pdf

Authors: Xinyi Zhang, Tiantian Chen, Zhentao Han, Zhaoyan Hong, Wei Lu, Sheng Wang, Mo Sha, Anni Wang, Shuang Liu, Yakun Zhang, Feifei Li, Xiaoyong Du

Title: Why Database Manuals Are Not Enough: Efficient and Reliable Configuration Tuning for DBMSs via Code-Driven LLM Agents

Abstract:
Modern database management systems (DBMSs) expose hundreds of configuration knobs that critically influence performance. Existing automated tuning methods either adopt a data‑driven paradigm, which incurs substantial overhead, or rely on manual‑driven heuristics extracted from database documentation, which are often limited and overly generic. Motivated by the fact that the control logic of configuration knobs is inherently encoded in the DBMS source code, we argue that promising tuning strategies can be mined directly from the code, uncovering fine‑grained insights grounded in system internals. To this end, we propose SysInsight, a code‑driven database tuning system that automatically extracts fine‑grained tuning knowledge from DBMS source code to accelerate and stabilize the tuning process. SysInsight combines static code analysis with LLM‑based reasoning to identify knob‑controlled execution paths and extract semantic tuning insights. These insights are then transformed into quantitative and verifiable tuning rules via association rule mining grounded in tuning observations. During online tuning, system diagnosis is applied to identify critical knobs, which are adjusted under the rule guidance. Evaluations demonstrate that compared to the SOTA baseline, SysInsight converges to the best configuration on average 7.11X faster while achieving a 19.9% performance improvement.

PaperID: 4606, https://arxiv.org/pdf/2603.22651.pdf

Authors: Siddhant Kulkarni, Yukta Kulkarni

Title: Benchmarking Multi-Agent LLM Architectures for Financial Document Processing: A Comparative Study of Orchestration Patterns, Cost-Accuracy Tradeoffs and Production Scaling Strategies

Abstract:
The adoption of large language models (LLMs) for structured information extraction from financial documents has accelerated rapidly, yet production deployments face fundamental architectural decisions with limited empirical guidance. We present a systematic benchmark comparing four multi‑agent orchestration architectures: sequential pipeline, parallel fan‑out with merge, hierarchical supervisor‑worker and reflexive self‑correcting loop. These are evaluated across five frontier and open‑weight LLMs on a corpus of 10,000 SEC filings (10‑K, 10‑Q and 8‑K forms). Our evaluation spans 25 extraction field types covering governance structures, executive compensation and financial metrics, measured along five axes: field‑level F1, document‑level accuracy, end‑to‑end latency, cost per document and token efficiency. We find that reflexive architectures achieve the highest field‑level F1 (0.943) but at 2.3x the cost of sequential baselines, while hierarchical architectures occupy the most favorable position on the cost‑accuracy Pareto frontier (F1 0.921 at 1.4x cost). We further present ablation studies on semantic caching, model routing and adaptive retry strategies, demonstrating that hybrid configurations can recover 89% of the reflexive architecture's accuracy gains at only 1.15x baseline cost. Our scaling analysis from 1K to 100K documents per day reveals non‑obvious throughput‑accuracy degradation curves that inform capacity planning. These findings provide actionable guidance for practitioners deploying multi‑agent LLM systems in regulated financial environments.

PaperID: 4607, https://arxiv.org/pdf/2603.22567.pdf

Authors: Minghan Li, Rachel Gonsalves, Weiyue Li, Sunghoon Yoon, Mengyu Wang

Title: TrustTrade: Human-Inspired Selective Consensus Reduces Decision Uncertainty in LLM Trading Agents

Abstract:
Large language models (LLMs) are increasingly deployed as autonomous agents in financial trading. However, they often exhibit a hazardous behavioral bias that we term uniform trust, whereby retrieved information is implicitly assumed to be factual and heterogeneous sources are treated as equally informative. This assumption stands in sharp contrast to human decision‑making, which relies on selective filtering, cross‑validation, and experience‑driven weighting of information sources. As a result, LLM‑based trading systems are particularly vulnerable to multi‑source noise and misinformation, amplifying factual hallucinations and leading to unstable risk‑return performance. To bridge this behavioral gap, we introduce TrustTrade (Trust‑Rectified Unified Selective Trader), a multi‑agent selective consensus framework inspired by human epistemic heuristics. TrustTrade replaces uniform trust with cross‑agent consistency by aggregating information from multiple independent LLM agents and dynamically weighting signals based on their semantic and numerical agreement. Consistent signals are prioritized, while divergent, weakly grounded, or temporally inconsistent inputs are selectively discounted. To further stabilize decision‑making, TrustTrade incorporates deterministic temporal signals as reproducible anchors and a reflective memory mechanism that adapts risk preferences at test time without additional training. Together, these components suppress noise amplification and hallucination‑driven volatility, yielding more stable and risk‑aware trading behavior. Across controlled backtesting in high‑noise market environments (2024 Q1 and 2026 Q1), the proposed TrustTrade calibrates LLM trading behavior from extreme risk‑return regimes toward a human‑aligned, mid‑risk and mid‑return profile.

PaperID: 4608, https://arxiv.org/pdf/2603.22528.pdf

Authors: Achmad Anggawirya Alimin, Artur M. Schweidtmann

Title: GraphRAG for Engineering Diagrams: ChatP&ID Enables LLM Interaction with P&IDs

Abstract:
Large Language Models (LLMs) combined with Retrieval‑Augmented Generation (RAG) and knowledge graphs offer new opportunities for interacting with engineering diagrams such as Piping and Instrumentation Diagrams (P&IDs). However, directly processing raw images or smart P&ID files with LLMs is often costly, inefficient, and prone to hallucinations. This work introduces ChatP&ID, an agentic framework that enables grounded and cost‑effective natural‑language interaction with P&IDs using Graph Retrieval‑Augmented Generation (GraphRAG), a paradigm we refer to as GraphRAG for engineering diagrams. Smart P&IDs encoded in the DEXPI standard are transformed into structured knowledge graphs, which serve as the basis for graph‑based retrieval and reasoning by LLM agents. This approach enables reliable querying of engineering diagrams while significantly reducing computational cost. Benchmarking across commercial LLM APIs (OpenAI, Anthropic) demonstrates that graph‑based representations improve accuracy by 18% over raw image inputs and reduce token costs by 85% compared to directly ingesting smart P&ID files. While small open‑source models still struggle to interpret knowledge graph formats and structured engineering data, integrating them with VectorRAG and PathRAG improves response accuracy by up to 40%. Notably, GPT‑5‑mini combined with ContextRAG achieves 91% accuracy at a cost of only 0.004 per task. The resulting ChatP&ID interface enables intuitive natural‑language interaction with complex engineering diagrams and lays the groundwork for AI‑assisted process engineering tasks such as Hazard and Operability Studies (HAZOP) and multi‑agent analysis.

PaperID: 4609, https://arxiv.org/pdf/2603.22489.pdf

Authors: Charoes Huang, Xin Huang, Ngoc Phu Tran, Amin Milani Fard

Title: Model Context Protocol Threat Modeling and Analyzing Vulnerabilities to Prompt Injection with Tool Poisoning

Abstract:
The Model Context Protocol (MCP) has rapidly emerged as a universal standard for connecting AI assistants to external tools and data sources. While MCP simplifies integration between AI applications and various services, it introduces significant security vulnerabilities, particularly on the client side. In this work we conduct threat modelings of MCP implementations using STRIDE (Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege) and DREAD (Damage, Reproducibility, Exploitability, Affected Users, Discoverability) frameworks across five key components: (1) MCP Host and Client, (2) LLM, (3) MCP Server, (4) External Data Stores, and (5) Authorization Server. This comprehensive analysis reveals tool poisoning‑where malicious instructions are embedded in tool metadata‑as the most prevalent and impactful client‑side vulnerability. We therefore focus our empirical evaluation on this critical attack vector, providing a systematic comparison of how seven major MCP clients validate and defend against tool poisoning attacks. Our analysis reveals significant security issues with most tested clients due to insufficient static validation and parameter visibility. We propose a multi‑layered defense strategy encompassing static metadata analysis, model decision path tracking, behavioral anomaly detection, and user transparency mechanisms. This research addresses a critical gap in MCP security, which has primarily focused on server‑side vulnerabilities, and provides actionable recommendations and mitigation strategies for securing AI agent ecosystems.

PaperID: 4610, https://arxiv.org/pdf/2603.22455.pdf

Authors: YanZhao Zheng, ZhenTao Zhang, Chao Ma, YuanQiang Yu, JiHuai Zhu, Yong Wu, Tianze Xu, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu

Title: SkillRouter: Skill Routing for LLM Agents at Scale

Abstract:
Reusable skills let LLM agents package task‑specific procedures, tool affordances, and execution guidance into modular building blocks. As skill ecosystems grow to tens of thousands of entries, exposing every skill at inference time becomes infeasible. This creates a skill‑routing problem: given a user task, the system must identify relevant skills before downstream planning or execution. Existing agent stacks often rely on progressive disclosure, exposing only skill names and descriptions while hiding the full implementation body. We examine this design choice on a SkillsBench‑derived benchmark with approximately 80K candidate skills, targeting the practically important setting of large skill registries with heavy overlap. Across representative sparse, dense, and reranking baselines on this setting, hiding the skill body causes a 31‑‑44 percentage point drop in routing accuracy, showing that full skill text is a critical routing signal in this setting rather than a minor metadata refinement. Motivated by this finding, we present SkillRouter, a compact 1.2B full‑text retrieve‑and‑rerank pipeline. SkillRouter achieves 74.0% Hit@1 on our benchmark ‑‑ the strongest average top‑1 routing performance among the baselines we evaluate ‑‑ while using 13× fewer parameters and running 5.8× faster than the strongest base pipeline. The ranking gains further generalize to a supplementary benchmark independently constructed from three skill sources. In a complementary end‑to‑end study across four coding agents, routing gains transfer to improved task success, with larger gains for more capable agents.

PaperID: 4611, https://arxiv.org/pdf/2603.22386.pdf

Authors: Ling Yue, Kushal Raj Bhandari, Ching-Yun Ko, Dhaval Patel, Shuxin Lin, Nianjun Zhou, Jianxi Gao, Pin-Yu Chen, Shaowu Pan

Title: From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents

Abstract:
Large language model (LLM)‑based systems are becoming increasingly popular for solving tasks by constructing executable workflows that interleave LLM calls, information retrieval, tool use, code execution, memory updates, and verification. This survey reviews recent methods for designing and optimizing such workflows, which we treat as agentic computation graphs (ACGs). We organize the literature based on when workflow structure is determined, where structure refers to which components or agents are present, how they depend on each other, and how information flows between them. This lens distinguishes static methods, which fix a reusable workflow scaffold before deployment, from dynamic methods, which select, generate, or revise the workflow for a particular run before or during execution. We further organize prior work along three dimensions: when structure is determined, what part of the workflow is optimized, and which evaluation signals guide optimization (e.g., task metrics, verifier signals, preferences, or trace‑derived feedback). We also distinguish reusable workflow templates, run‑specific realized graphs, and execution traces, separating reusable design choices from the structures actually deployed in a given run and from realized runtime behavior. Finally, we outline a structure‑aware evaluation perspective that complements downstream task metrics with graph‑level properties, execution cost, robustness, and structural variation across inputs. Our goal is to provide a clear vocabulary, a unified framework for positioning new methods, a more comparable view of existing body of literature, and a more reproducible evaluation standard for future work in workflow optimizations for LLM agents.

PaperID: 4612, https://arxiv.org/pdf/2603.22376.pdf

Authors: Liwei Wu, Cho-Jui Hsieh

Title: AI Co-Scientist for Ranking: Discovering Novel Search Ranking Models alongside LLM-based AI Agents with Cloud Computing Access

Abstract:
Recent advances in AI agents for software engineering and scientific discovery have demonstrated remarkable capabilities, yet their application to developing novel ranking models in commercial search engines remains unexplored. In this paper, we present an AI Co‑Scientist framework that automates the full search ranking research pipeline: from idea generation to code implementation and GPU training job scheduling with expert in the loop. Our approach strategically employs single‑LLM agents for routine tasks while leveraging multi‑LLM consensus agents (GPT 5.2, Gemini Pro 3, and Claude Opus 4.5) for challenging phases such as results analysis and idea generation. To our knowledge, this is the first study in the ranking community to utilize an AI Co‑Scientist framework for algorithmic research. We demonstrate that this framework discovered a novel technique for handling sequence features, with all model enhancements produced automatically, yielding substantial offline performance improvements. Our findings suggest that AI systems can discover ranking architectures comparable to those developed by human experts while significantly reducing routine research workloads.

PaperID: 4613, https://arxiv.org/pdf/2603.22367.pdf

Authors: Ivan Dobrovolskyi

Title: Reasoner-Executor-Synthesizer: Scalable Agentic Architecture with Static O(1) Context Window

Abstract:
Large Language Models (LLMs) deployed as autonomous agents commonly use Retrieval‑Augmented Generation (RAG), feeding retrieved documents into the context window, which creates two problems: the risk of hallucination grows with context length, and token cost scales linearly with dataset size. We propose the Reasoner‑Executor‑Synthesizer (RES) architecture, a three‑layer design that strictly separates intent parsing (Reasoner), deterministic data retrieval and aggregation (Executor), and narrative generation (Synthesizer). The Executor uses zero LLM tokens and passes only fixed‑size statistical summaries to the Synthesizer. We formally prove that RES achieves O(1) token complexity with respect to dataset size, and validate this on ScholarSearch, a scholarly research assistant backed by the Crossref API (130M+ articles). Across 100 benchmark runs, RES achieves a mean token cost of 1,574 tokens regardless of whether the dataset contains 42,000 or 16.3 million articles. The architecture eliminates data hallucination by construction: the LLM never sees raw records. KEYWORDS LLM agents; agentic architecture; hallucination elimination; token optimization; context window; retrieval‑augmented generation; deterministic execution; scholarly metadata; Crossref API; O(1) complexity.

PaperID: 4614, https://arxiv.org/pdf/2603.22363.pdf

Authors: Janardhan Kulkarni

Title: Early Discoveries of Algorithmist I: Promise of Provable Algorithm Synthesis at Scale

Abstract:
Designing algorithms with provable guarantees that also work well in practice remains difficult, requiring both mathematical reasoning and careful implementation. Existing approaches that bridge worst‑case theory and empirical performance, such as beyond‑worst‑case analysis and data‑driven algorithm selection, typically assume prior distributional knowledge or restrict attention to a fixed pool of algorithms. Recent progress in LLMs suggests a new possibility: provable algorithm synthesis on the fly. To study this, we built Algorithmist, an autonomous researcher agent on top of GitHub Copilot that runs a multi‑agent research‑and‑review loop, with separate stages for idea generation, algorithm and proof development, proof‑guided implementation, and review of proofs, code, and their alignment. We evaluate Algorithmist on research‑level tasks in private data analysis and clustering. When asked to design practical methods that jointly satisfy privacy, approximation, and interpretability requirements, it produced provably sound and empirically effective algorithms, together with research‑style writeups and audited implementations. It also found improved algorithms in some settings, explained principled barriers in others, and uncovered a subtle proof bug in prior published work. More broadly, our results suggest a new paradigm in which LLM systems generate research‑paper‑quality algorithmic artifacts tailored to each dataset and deployment setting. They also point to a proof‑first code‑synthesis paradigm, in which code is developed alongside a structured natural‑language proof intermediate representation and kept aligned with it throughout synthesis.

PaperID: 4615, https://arxiv.org/pdf/2603.22359.pdf

Authors: Alfred Shen, Aaron Shen

Title: STEM Agent: A Self-Adapting, Tool-Enabled, Extensible Architecture for Multi-Protocol AI Agent Systems

Abstract:
Current AI agent frameworks commit early to a single interaction protocol, a fixed tool integration strategy, and static user models, limiting their deployment across diverse interaction paradigms. To address these constraints, we introduce STEM Agent (Self‑adapting, Tool‑enabled, Extensible, Multi‑agent), a modular architecture inspired by biological pluripotency in which an undifferentiated agent core differentiates into specialized protocol handlers, tool bindings, and memory subsystems that compose into a fully functioning AI system. The framework unifies five interoperability protocols (A2A, AG‑UI, A2UI, UCP, and AP2) behind a single gateway, introduces a Caller Profiler that continuously learns user preferences across more than twenty behavioral dimensions, externalizes all domain capabilities through the Model Context Protocol (MCP), and implements a biologically inspired skills acquisition system in which recurring interaction patterns crystallize into reusable agent skills through a maturation lifecycle analogous to cell differentiation. Complementing these capabilities, the memory system incorporates consolidation mechanisms, including episodic pruning, semantic deduplication, and pattern extraction, designed for sub‑linear growth under sustained interaction. A comprehensive 413‑test suite validates protocol handler behavior and component integration across all five architectural layers, completing in under three seconds.

PaperID: 4616, https://arxiv.org/pdf/2603.22341.pdf

Authors: Hyomin Lee, Sangwoo Park, Yumin Choi, Sohyun An, Seanie Lee, Sung Ju Hwang

Title: T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search

Abstract:
While prior red‑teaming efforts have focused on eliciting harmful text outputs from large language models (LLMs), such approaches fail to capture agent‑specific vulnerabilities that emerge through multi‑step tool execution, particularly in rapidly growing ecosystems such as the Model Context Protocol (MCP). To address this gap, we propose a trajectory‑aware evolutionary search method, T‑MAP, which leverages execution trajectories to guide the discovery of adversarial prompts. Our approach enables the automatic generation of attacks that not only bypass safety guardrails but also reliably realize harmful objectives through actual tool interactions. Empirical evaluations across diverse MCP environments demonstrate that T‑MAP substantially outperforms baselines in attack realization rate (ARR) and remains effective against frontier models, including GPT‑5.2, Gemini‑3‑Pro, Qwen3.5, and GLM‑5, thereby revealing previously underexplored vulnerabilities in autonomous LLM agents.

PaperID: 4617, https://arxiv.org/pdf/2603.22315.pdf

Authors: Haoran Su, Hanxiao Deng, Yandong Sun

Title: Emergency Preemption Without Online Exploration: A Decision Transformer Approach

Abstract:
Emergency vehicle (EV) response time is a critical determinant of survival outcomes, yet deployed signal preemption strategies remain reactive and uncontrollable. We propose a return‑conditioned framework for emergency corridor optimization based on the Decision Transformer (DT). By casting corridor optimization as offline, return‑conditioned sequence modeling, our approach (1) eliminates online environment interaction during policy learning, (2) enables dispatch‑level urgency control through a single target‑return scalar, and (3) extends to multi‑agent settings via a Multi‑Agent Decision Transformer (MADT) with graph attention for spatial coordination. On the LightSim simulator, DT reduces average EV travel time by 37.7% relative to fixed‑timing preemption on a 4x4 grid (88.6 s vs. 142.3 s), achieving the lowest civilian delay (11.3 s/veh) and fewest EV stops (1.2) among all methods, including online RL baselines that require environment interaction. MADT further improves on larger grids, overtaking DT with 45.2% reduction on 8x8 via graph‑attention coordination. Return conditioning produces a smooth dispatch interface: varying the target return from 100 to ‑400 trades EV travel time (72.4‑138.2 s) against civilian delay (16.8‑5.4 s/veh), requiring no retraining. A Constrained DT extension adds explicit civilian disruption budgets as a second control knob.

PaperID: 4618, https://arxiv.org/pdf/2603.22312.pdf

Authors: Di Zhang

Title: The Efficiency Attenuation Phenomenon: A Computational Challenge to the Language of Thought Hypothesis

Abstract:
This paper computationally investigates whether thought requires a language‑like format, as posited by the Language of Thought (LoT) hypothesis. We introduce the ``AI Private Language'' thought experiment: if two artificial agents develop an efficient, inscrutable communication protocol via multi‑agent reinforcement learning (MARL), and their performance declines when forced to use a human‑comprehensible language, this Efficiency Attenuation Phenomenon (EAP) challenges the LoT. We formalize this in a cooperative navigation task under partial observability. Results show that agents with an emergent protocol achieve 50.5% higher efficiency than those using a pre‑defined, human‑like symbolic protocol, confirming the EAP. This suggests optimal collaborative cognition in these systems is not mediated by symbolic structures but is naturally coupled with sub‑symbolic computations. The work bridges philosophy, cognitive science, and AI, arguing for pluralism in cognitive architectures and highlighting implications for AI ethics.

PaperID: 4619, https://arxiv.org/pdf/2603.22273.pdf

Authors: Zakaria Mhammedi, James Cohan

Title: Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration

Abstract:
The process of discovery requires active exploration ‑‑ the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motivation, maximizing a composite objective of extrinsic and intrinsic rewards. We suggest that this approach incurs unnecessary overhead: while policy optimization is necessary for precise task execution, employing such machinery solely to expand state coverage may be inefficient. In this paper, we propose a new approach that explicitly decouples exploration from policy optimization and bypasses RL entirely during the exploration phase. Our method uses a tree‑search strategy inspired by the Go‑With‑The‑Winner algorithm, paired with a measure of uncertainty to systematically drive exploration. By removing the overhead of policy optimization, our approach explores an order of magnitude more efficiently than standard intrinsic motivation baselines on hard exploration benchmarks. Further, we demonstrate that the trajectories discovered during exploration can be distilled into deployable policies using existing supervised backward learning algorithms, achieving state‑of‑the‑art performance by a wide margin on Montezuma's Revenge, Pitfall!, and Venture without relying on domain‑specific knowledge. Finally, we demonstrate the generality of our framework in high‑dimensional continuous action spaces by solving the MuJoCo Adroit dexterous manipulation and AntMaze tasks in a sparse‑reward setting, directly from image observations and without expert demonstrations or offline datasets. To the best of our knowledge, this has not been achieved before for the Adroit tasks.

PaperID: 4620, https://arxiv.org/pdf/2603.22267.pdf

Authors: Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu, Hung-yi Lee, James Glass

Title: TiCo: Time-Controllable Spoken Dialogue Model

Abstract:
We introduce TiCo, a time‑controllable spoken dialogue model (SDM) that follows time‑constrained instructions (e.g., "Please generate a response lasting about 15 seconds") and generates spoken responses with controllable duration. This capability is valuable for real‑world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration‑related instructions. To systematically evaluate this, we introduce TiCo‑Bench, the first benchmark for time‑controllable instruction following in SDMs, on which existing open‑source and commercial models frequently fail to satisfy explicit time constraints. TiCo addresses this limitation by enabling an SDM to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is post‑trained efficiently without question‑answer paired data, relying on self‑generation and reinforcement learning with verifiable reward. Experimental results show that TiCo reduces duration error by 2.7x over its backbone and 1.6x over the strongest baseline, while preserving response quality.

PaperID: 4621, https://arxiv.org/pdf/2603.22206.pdf

Authors: Kangqi Ni, Wenyue Hua, Xiaoxiang Shi, Jiang Guo, Shiyu Chang, Tianlong Chen

Title: Chimera: Latency- and Performance-Aware Multi-agent Serving for Heterogeneous LLMs

Abstract:
Multi‑agent applications often execute complex tasks as multi‑stage workflows, where each stage is an LLM call whose output becomes part of context for subsequent steps. Existing LLM serving systems largely assume homogeneous clusters with identical model replicas. This design overlooks the potential of heterogeneous deployments, where models of different sizes and capabilities enable finer trade‑offs between latency and performance. However, heterogeneity introduces new challenges in scheduling across models with diverse throughput and performance. We present Chimera, a predictive scheduling system for multi‑agent workflow serving on heterogeneous LLM clusters that jointly improves end‑to‑end latency and task performance. Chimera applies semantic routing to estimate per‑model confidence scores for each request, predicts the total remaining output length of the workflow, and estimates per‑model congestion using in‑flight predicted token volumes for load balancing. We evaluate Chimera on representative agentic workflows for code generation and math reasoning using multiple heterogeneous LLM configurations. Across comparable settings, Chimera traces the best latency‑performance frontier, reducing end‑to‑end latency by 1.2‑‑2.4× and improving task performance by 8.0‑9.5 percentage points on average over competitive baselines including vLLM.

PaperID: 4622, https://arxiv.org/pdf/2603.22170.pdf

Authors: Jingfeng Shan, Francesco Guidi, Mehrdad Saeidi, Enrico Testi, Elia Favarelli, Andrea Giorgetti, Davide Dardari, Alberto Zanella, Giorgio Li Pira, Francesca Starita, Anna Guerra

Title: Human-Inspired Pavlovian and Instrumental Learning for Autonomous Agent Navigation

Abstract:
Autonomous agents operating in uncertain environments must balance fast responses with goal‑directed planning. Classical MF RL often converges slowly and may induce unsafe exploration, whereas MB methods are computationally expensive and sensitive to model mismatch. This paper presents a human‑inspired hybrid RL architecture integrating Pavlovian, Instrumental MF, and Instrumental MB components. Inspired by Pavlovian and Instrumental learning from neuroscience, the framework considers contextual radio cues, here intended as georeferenced environmental features acting as CS, to shape intrinsic value signals and bias decision‑making. Learning is further modulated by internal motivational drives through a dedicated motivational signal. A Bayesian arbitration mechanism adaptively blends MF and MB estimates based on predicted reliability. Simulation results show that the hybrid approach accelerates learning, improves operational safety, and reduces navigation in high‑uncertainty regions compared to standard RL baselines. Pavlovian conditioning promotes safer exploration and faster convergence, while arbitration enables a smooth transition from exploration to efficient, plan‑driven exploitation. Overall, the results highlight the benefits of biologically inspired modularity for robust and adaptive autonomous systems under uncertainty.

PaperID: 4623, https://arxiv.org/pdf/2603.22114.pdf

Authors: Huan Zhao, Haoxin Tu, Zhengyao Liu, Martin Rinard, Abhik Roychoudhury

Title: Lemma Discovery in Agentic Program Verification

Abstract:
Deductive verification provides strong correctness guarantees for code by extracting verification conditions (VCs) and writing formal proofs for them. The expertise‑intensive task of VC proving is the main bottleneck in this process, and has been partly automated owing to recent advances in Large Language Model (LLM) agents. However, existing proof agents are not able to discover helper lemmas ‑ auxiliary lemmas that aid in proving ‑ and thus fall short as programs grow in size and complexity. In this paper, we argue that VC proving for program verification is more than a purely mathematical task, and benefits considerably from program comprehension. Our key insight is that human‑proof engineers often discover and apply helper lemmas based on their understanding of the program semantics, which are not directly reflected in the VCs produced by VC generators. Inspired by this insight, we propose an LLM agent, LemmaNet, that discovers helper lemmas in two ways. Specifically, the agent first synthesizes lemmas offline by directly analyzing the source code and specifications, and then relating this semantic understanding to the mechanical, verbose encoding produced by VC generators. As the proof unfolds, LemmaNet then adapts existing helper lemmas online to accommodate evolving proof states, enabling the agent to effectively discharge complex VCs on‑the‑fly. We evaluate LemmaNet on SV‑COMP and established real‑world subjects, including modules of the Linux kernel, Contiki OS, standard C++ library, and X.509 parser. Our experimental results demonstrate that LemmaNet significantly outperforms state‑of‑the‑art approaches, highlighting the importance of program comprehension‑aided lemma discovery in agentic program verification.

PaperID: 4624, https://arxiv.org/pdf/2603.22035.pdf

Authors: Caio Azevedo, Stefano Sabatini, Sascha Hornauer, Fabien Moutarde

Title: Future-Interactions-Aware Trajectory Prediction via Braid Theory

Abstract:
To safely operate, an autonomous vehicle must know the future behavior of a potentially high number of interacting agents around it, a task often posed as multi‑agent trajectory prediction. Many previous attempts to model social interactions and solve the joint prediction task either add extensive computational requirements or rely on heuristics to label multi‑agent behavior types. Braid theory, in contrast, provides a powerful exact descriptor of multi‑agent behavior by projecting future trajectories into braids that express how trajectories cross with each other over time; a braid then corresponds to a specific mode of coordination between the multiple agents in the future. In past work, braids have been used lightly to reason about interacting agents and restrict the attention window of predicted agents. We show that leveraging more fully the expressivity of the braid representation and using it to condition the trajectories themselves leads to even further gains in joint prediction performance, with negligible added complexity either in training or at inference time. We do so by proposing a novel auxiliary task, braid prediction, done in parallel with the trajectory prediction task. By classifying edges between agents into their correct crossing types in the braid representation, the braid prediction task is able to imbue the model with improved social awareness, which is reflected in joint predictions that more closely adhere to the actual multi‑agent behavior. This simple auxiliary task allowed us to obtain significant improvements in joint metrics on three separate datasets. We show how the braid prediction task infuses the model with future intention awareness, leading to more accurate joint predictions. Code is available at github.com/caiocj1/traj‑pred‑braid‑theory.

PaperID: 4625, https://arxiv.org/pdf/2603.21913.pdf

Authors: Seungyeop Lee, Jong-Han Kim

Title: Collision-Free Velocity Scheduling for Multi-Agent Systems on Predefined Routes via Inexact-Projection ADMM

Abstract:
In structured multi‑agent transportation systems, agents often must follow predefined routes, making spatial rerouting undesirable or impossible. This paper addresses route‑constrained multi‑agent coordination by optimizing waypoint passage times while preserving each agent's assigned waypoint order and nominal route assignment. A differentiable surrogate trajectory model maps waypoint timings to smooth position profiles and captures first‑order tracking lag, enabling pairwise safety to be encoded through distance‑based penalties evaluated on a dense temporal grid spanning the mission horizon. The resulting nonlinear and nonconvex velocity‑scheduling problem is solved using an inexact‑projection Alternating Direction Method of Multipliers (ADMM) algorithm that combines structured timing updates with gradient‑based collision‑correction steps and avoids explicit integer sequencing variables. Numerical experiments on random‑crossing, bottleneck, and graph‑based network scenarios show that the proposed method computes feasible and time‑efficient schedules across a range of congestion levels and yields shorter mission completion times than a representative hierarchical baseline in the tested bottleneck cases.

PaperID: 4626, https://arxiv.org/pdf/2603.21810.pdf

Authors: Turki Bin Mohaya, Peter Seiler

Title: Partial Attention in Deep Reinforcement Learning for Safe Multi-Agent Control

Abstract:
Attention mechanisms excel at learning sequential patterns by discriminating data based on relevance and importance. This provides state‑of‑the‑art performance in advanced generative artificial intelligence models. This paper applies this concept of an attention mechanism for multi‑agent safe control. We specifically consider the design of a neural network to control autonomous vehicles in a highway merging scenario. The environment is modeled as a Decentralized Partially Observable Markov Decision Process (Dec‑POMDP). Within a QMIX framework, we include partial attention for each autonomous vehicle, thus allowing each ego vehicle to focus on the most relevant neighboring vehicles. Moreover, we propose a comprehensive reward signal that considers the global objectives of the environment (e.g., safety and vehicle flow) and the individual interests of each agent. Simulations are conducted in the Simulation of Urban Mobility (SUMO). The results show better performance compared to other driving algorithms in terms of safety, driving speed, and reward.

PaperID: 4627, https://arxiv.org/pdf/2603.21802.pdf

Authors: Niels Voorneveld, Peeter Laud

Title: Modal Logic for Distributed Trust

Abstract:
We propose a method for reasoning about trust in multi‑agent systems, specifying a language for describing communication protocols and making trust assumptions and derivations. This is given an interpretation in a modal logic for describing the beliefs and communications of agents in a network. We define how information in the network can be shared via forwarding, and how trust between agents can be generalized to trust across networks. We give specifications for the modal logic which can be readily adapted into a lambda calculus of proofs. We show that by nesting modalities, we can describe chains of communication between agents, and establish suitable notions of trust for such chains. We see how this can be applied to trust models in public key infrastructures, as well as other interaction protocols in distributed systems.

PaperID: 4628, https://arxiv.org/pdf/2603.21735.pdf

Authors: Kuangzhe Xu, Yu Shen, Longjie Yan, Yinghui Ren

Title: Cognitive Agency Surrender: Defending Epistemic Sovereignty via Scaffolded AI Friction

Abstract:
The proliferation of Generative Artificial Intelligence has transformed benign cognitive offloading into a systemic risk of cognitive agency surrender. Driven by the commercial dogma of "zero‑friction" design, highly fluent AI interfaces actively exploit human cognitive miserliness, prematurely satisfying the need for cognitive closure and inducing severe automation bias. To empirically quantify this epistemic erosion, we deployed a zero‑shot semantic classification pipeline (τ=0.7) on 1,223 high‑confidence AI‑HCI papers from 2023 to early 2026. Our analysis reveals an escalating "agentic takeover": a brief 2025 surge in research defending human epistemic sovereignty (19.1%) was abruptly suppressed in early 2026 (13.1%) by an explosive shift toward optimizing autonomous machine agents (19.6%), while frictionless usability maintained a structural hegemony (67.3%). To dismantle this trap, we theorize "Scaffolded Cognitive Friction," repurposing Multi‑Agent Systems (MAS) as explicit cognitive forcing functions (e.g., computational Devil's Advocates) to inject germane epistemic tension and disrupt heuristic execution. Furthermore, we outline a multimodal computational phenotyping agenda ‑‑ integrating gaze transition entropy, task‑evoked pupillometry, fNIRS, and Hierarchical Drift Diffusion Modeling (HDDM) ‑‑ to mathematically decouple decision outcomes from cognitive effort. Ultimately, intentionally designed friction is not merely a psychological intervention, but a foundational technical prerequisite for enforcing global AI governance and preserving societal cognitive resilience.

PaperID: 4629, https://arxiv.org/pdf/2603.21696.pdf

Authors: Hunmin Do, Taejun Yoon, Kiyong Jung

Title: MIND: Multi-agent inference for negotiation dialogue in travel planning

Abstract:
While Multi‑Agent Debate (MAD) research has advanced, its efficacy in coordinating complex stakeholder interests such as travel planning remains largely unexplored. To bridge this gap, we propose MIND (Multi‑agent Inference for Negotiation Dialogue), a framework designed to simulate realistic consensus‑building among travelers with heterogeneous preferences. Grounded in the Theory of Mind (ToM), MIND introduces a Strategic Appraisal phase that infers opponent willingness (w) from linguistic nuances with 90.2% accuracy. Experimental results demonstrate that MIND outperforms traditional MAD frameworks, achieving a 20.5% improvement in High‑w Hit and a 30.7% increase in Debate Hit‑Rate, effectively prioritizing high‑stakes constraints. Furthermore, qualitative evaluations via LLM‑as‑a‑Judge confirm that MIND surpasses baselines in Rationality (68.8%) and Fluency (72.4%), securing an overall win rate of 68.3%. These findings validate that MIND effectively models human negotiation dynamics to derive persuasive consensus.

PaperID: 4630, https://arxiv.org/pdf/2603.21691.pdf

Authors: Niloofar Aminikalibar, Farzaneh Farhadi, Maria Chli

Title: Strategic Infrastructure Design via Multi-Agent Congestion Games with Joint Placement and Pricing

Abstract:
Real‑world infrastructure planning increasingly involves strategic interactions among autonomous agents competing over congestible, limited resources. Applications such as Electric Vehicle (EV) charging, emergency response, and intelligent transportation require coordinated resource placement and pricing decisions, while anticipating the adaptive behaviour of decentralised, self‑interested agents. We propose a novel multi‑agent framework for joint placement and pricing under such interactions, formalised as a bi‑level optimisation model. The upper level represents a central planner, while the lower level captures agent responses via coupled non‑atomic congestion games. Motivated by the EV charging domain, we study a setting where a central planner provisions chargers and road capacity under budget and profitability constraints. The agent population includes both EV drivers and non‑charging drivers (NCDs), who respond to congestion, delays, and costs. To solve the resulting NP‑hard problem, we introduce ABO‑MPN, a double‑layer approximation framework that decouples agent types, applies integer adjustment and rounding, and targets high‑impact placement and pricing decisions. Experiments on benchmark networks show that our model reduces social cost by up to 40% compared to placement‑ or pricing‑only baselines, and generalises to other MAS‑relevant domains.

PaperID: 4631, https://arxiv.org/pdf/2603.21676.pdf

Authors: Hung-Hsuan Chen

Title: Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization

Abstract:
Standard Transformers have a fixed computational depth, fundamentally limiting their ability to generalize to tasks requiring variable‑depth reasoning, such as multi‑hop graph traversal or nested logic. We propose a depth‑recurrent Transformer that decouples computational depth from parameter count by iteratively applying a shared‑weight Transformer block in latent space ‑‑ enabling the model to trade recurrence steps for deeper reasoning at inference time. Our architecture incorporates three mechanisms to make deep recurrence (20+ steps) stable: (1) a silent thinking objective that supervises only the final output, forcing genuine multi‑step reasoning rather than intermediate heuristic shortcuts; (2) LayerScale initialization to protect fragile reasoning states from untrained layer noise; and (3) an identity‑biased recurrence that creates a gradient highway across many steps. We evaluate on three compositional reasoning domains with decreasing inductive biases: graph reachability (strict adjacency masking), nested boolean logic (relative positioning), and unstructured relational text (where sequence position provides no structural hints). Across all tasks, we observe a clear \emphcomputational frontier ‑‑ a boundary where performance transitions from chance to near‑perfect as thinking steps scale with task complexity. Moreover, these tasks reveal qualitatively different generalization behaviors: precise but brittle (graph), approximate but robust (logic), and autonomous latent routing without structural hints (text). This progression illuminates how the interplay between a task‑invariant recurrent reasoning core and task‑specific perceptual interfaces shapes out‑of‑distribution (OOD) generalization, offering a mechanistic perspective on vertical chain‑of‑thought that complements the prevailing horizontal token‑generation paradigm.

PaperID: 4632, https://arxiv.org/pdf/2603.21673.pdf

Authors: Shixu Liu

Title: Optimizing Multi-Agent Weather Captioning via Text Gradient Descent: A Training-Free Approach with Consensus-Aware Gradient Fusion

Abstract:
Generating interpretable natural language captions from weather time series data remains a significant challenge at the intersection of meteorological science and natural language processing. While recent advances in Large Language Models (LLMs) have demonstrated remarkable capabilities in time series forecasting and analysis, existing approaches either produce numerical predictions without human‑accessible explanations or generate generic descriptions lacking domain‑specific depth. We introduce WeatherTGD, a training‑free multi‑agent framework that reinterprets collaborative caption refinement through the lens of Text Gradient Descent (TGD). Our system deploys three specialized LLM agents including a Statistical Analyst, a Physics Interpreter, and a Meteorology Expert that generate domain‑specific textual gradients from weather time series observations. These gradients are aggregated through a novel Consensus‑Aware Gradient Fusion mechanism that extracts common signals while preserving unique domain perspectives. The fused gradients then guide an iterative refinement process analogous to gradient descent, where each LLM‑generated feedback signal updates the caption toward an optimal solution. Experiments on real‑world meteorological datasets demonstrate that WeatherTGD achieves significant improvements in both LLM‑based evaluation and human expert evaluation, substantially outperforming existing multi‑agent baselines while maintaining computational efficiency through parallel agent execution.

PaperID: 4633, https://arxiv.org/pdf/2603.21597.pdf

Authors: Sheng Liu, Long Chen, Zeyun Zhao, Qinglin Gou, Qingyue Wei, Arjun Masurkar, Kevin M. Spiegler, Philip Kuball, Stefania C. Bray, Megan Bernath, Deanna R. Willis, Jiang Bian, Lei Xing, Eric Topol, Kyunghyun Cho, Yu Huang, Ruogu Fang, Narges Razavian, James Zou

Title: Cerebra: A Multidisciplinary AI Board for Multimodal Dementia Characterization and Risk Assessment

Abstract:
Modern clinical practice increasingly depends on reasoning over heterogeneous, evolving, and incomplete patient data. Although recent advances in multimodal foundation models have improved performance on various clinical tasks, most existing models remain static, opaque, and poorly aligned with real‑world clinical workflows. We present Cerebra, an interactive multi‑agent AI team that coordinates specialized agents for EHR, clinical notes, and medical imaging analysis. These outputs are synthesized into a clinician‑facing dashboard that combines visual analytics with a conversational interface, enabling clinicians to interrogate predictions and contextualize risk at the point of care. Cerebra supports privacy‑preserving deployment by operating on structured representations and remains robust when modalities are incomplete. We evaluated Cerebra using a massive multi‑institutional dataset spanning 3 million patients from four independent healthcare systems. Cerebra consistently outperformed both state‑of‑the‑art single‑modality models and large multimodal language model baselines. In dementia risk prediction, it achieved AUROCs up to 0.80, compared with 0.74 for the strongest single‑modality model and 0.68 for language model baselines. For dementia diagnosis, it achieved an AUROC of 0.86, and for survival prediction, a C‑index of 0.81. In a reader study with experienced physicians, Cerebra significantly improved expert performance, increasing accuracy by 17.5 percentage points in prospective dementia risk estimation. These results demonstrate Cerebra's potential for interpretable, robust decision support in clinical care.

PaperID: 4634, https://arxiv.org/pdf/2603.21594.pdf

Authors: Che Chen, Lanhua Li, Shimin Gong, Yu Zhao, Yuming Fang, Dusit Niyato

Title: Spatio-Temporal Attention Enhanced Multi-Agent DRL for UAV-Assisted Wireless Networks with Limited Communications

Abstract:
In this paper, we employ multiple UAVs to accelerate data transmissions from ground users (GUs) to a remote base station (BS) via the UAVs' relay communications. The UAVs' intermittent information exchanges typically result in delays in acquiring the complete system state and hinder their effective collaboration. To maximize the overall throughput, we first propose a delay‑tolerant multi‑agent deep reinforcement learning (MADRL) algorithm that integrates a delay‑penalized reward to encourage information sharing among UAVs, while jointly optimizing the UAVs' trajectory planning, network formation, and transmission control strategies. Additionally, considering information loss due to unreliable channel conditions, we further propose a spatio‑temporal attention based prediction approach to recover the lost information and enhance each UAV's awareness of the network state. These two designs are envisioned to enhance the network capacity in UAV‑assisted wireless networks with limited communications. The simulation results reveal that our new approach achieves over 50% reduction in information delay and 75% throughput gain compared to the conventional MADRL. Interestingly, it is shown that improving the UAVs' information sharing will not sacrifice the network capacity. Instead, it significantly improves the learning performance and throughput simultaneously. It is also effective in reducing the need for UAVs' information exchange and thus fostering practical deployment of MADRL in UAV‑assisted wireless networks.

PaperID: 4635, https://arxiv.org/pdf/2603.21574.pdf

Authors: Zhongyi Li, Wan Tian, Jingyu Chen, Kangyao Huang, Huiming Zhang, Hui Yang, Tao Ren, Jinyang Jiang, Yijie Peng, Yikun Ban, Fuzhen Zhuang

Title: Adaptive Robust Estimator for Multi-Agent Reinforcement Learning

Abstract:
Multi‑agent collaboration has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models, yet it suffers from interaction‑level ambiguity that blurs generation, critique, and revision, making credit assignment across agents difficult. Moreover, policy optimization in this setting is vulnerable to heavy‑tailed and noisy rewards, which can bias advantage estimation and trigger unstable or even divergent training. To address both issues, we propose a robust multi‑agent reinforcement learning framework for collaborative reasoning, consisting of two components: Dual‑Agent Answer‑Critique‑Rewrite (DACR) and an Adaptive Robust Estimator (ARE). DACR decomposes reasoning into a structured three‑stage pipeline: answer, critique, and rewrite, while enabling explicit attribution of each agent's marginal contribution to its partner's performance. ARE provides robust estimation of batch experience means during multi‑agent policy optimization. Across mathematical reasoning and embodied intelligence benchmarks, even under noisy rewards, our method consistently outperforms the baseline in both homogeneous and heterogeneous settings. These results indicate stronger robustness to reward noise and more stable training dynamics, effectively preventing optimization failures caused by noisy reward signals.

PaperID: 4636, https://arxiv.org/pdf/2603.21563.pdf

Authors: Zhongyi Li, Wan Tian, Yikun Ban, Jinju Chen, Huiming Zhang, Yang Liu, Fuzhen Zhuang

Title: Counterfactual Credit Policy Optimization for Multi-Agent Collaboration

Abstract:
Collaborative multi‑agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles and aggregating diverse hypotheses. Yet, reinforcement learning (RL) for such systems is often undermined by credit assignment: a shared global reward obscures individual contributions, inflating update variance and encouraging free‑riding. We introduce Counterfactual Credit Policy Optimization (CCPO), a framework that assigns agent‑specific learning signals by estimating each agent's marginal contribution through counterfactual trajectories. CCPO builds dynamic counterfactual baselines that simulate outcomes with an agent's contribution removed, yielding role‑sensitive advantages for policy optimization. To further improve stability under heterogeneous tasks and data distributions, we propose a global‑history‑aware normalization scheme that calibrates advantages using global rollout statistics. We evaluate CCPO on two collaboration topologies: a sequential Think‑‑Reason dyad and multi‑agent voting. Across mathematical and logical reasoning benchmarks, CCPO mitigates free‑riding and outperforms strong multi‑agent RL baselines, yielding finer‑grained and more effective credit assignment for collaborative LLM training. Our code is available at https://github.com/bhai114/ccpo.

PaperID: 4637, https://arxiv.org/pdf/2603.21551.pdf

Authors: Haoran Xi, Minghao Shao, Kimberly Milner, Venkata Sai Charan Putrevu, Nanda Rani, Meet Udeshi, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Siddharth Garg, Sandeep Kumar Shukla, Farshad Khorrami, Alon Hillel-Tuch, Muhammad Shafique, Ramesh Karri

Title: AI In Cybersecurity Education -- Scalable Agentic CTF Design Principles and Educational Outcomes

Abstract:
Large language models are rapidly changing how learners acquire and demonstrate cybersecurity skills. However, when human‑‑AI collaboration is allowed, educators still lack validated competition designs and evaluation practices that remain fair and evidence‑based. This paper presents a cross‑regional study of LLM‑centered Capture‑the‑Flag competitions built on the Cyber Security Awareness Week competition system. To understand how autonomy levels and participants' knowledge backgrounds influence problem‑solving performance and learning‑related behaviors, we formalize three autonomy levels: human‑in‑the‑loop, autonomous agent frameworks, and hybrid. To enable verification, we require traceable submissions including conversation logs, agent trajectories, and agent code. We analyze multi‑region competition data covering an in‑class track, a standard track, and a year‑long expert track, each targeting participants with different knowledge backgrounds. Using data from the 2025 competition, we compare solve performance across autonomy levels and challenge categories, and observe that autonomous agent frameworks and hybrid achieve higher completion rates on challenges requiring iterative testing and tool interactions. In the in‑class track, we classify participants' agent designs and find a preference for lightweight, tool‑augmented prompting and reflection‑based retries over complex multi‑agent architectures. Our results offer actionable guidance for designing LLM‑assisted cybersecurity competitions as learning technologies, including autonomy‑specific scoring criteria, evidence requirements that support solution verification, and track structures that improve accessibility while preserving reliable evaluation and engagement.

PaperID: 4638, https://arxiv.org/pdf/2603.21522.pdf

Authors: Lingzhe Zhang, Tong Jia, Mingyu Wang, Weijie Hong, Chiming Duan, Minghua He, Rongqian Wang, Xi Peng, Meiling Wang, Gong Zhang, Renhai Chen, Ying Li

Title: Efficient Failure Management for Multi-Agent Systems with Reasoning Trace Representation

Abstract:
Large Language Models (LLM)‑based Multi‑Agent Systems (MASs) have emerged as a new paradigm in software system design, increasingly demonstrating strong reasoning and collaboration capabilities. As these systems become more complex and autonomous, effective failure management is essential to ensure reliability and availability. However, existing approaches often rely on per‑trace reasoning, which leads to low efficiency, and neglect historical failure patterns, limiting diagnostic accuracy. In this paper, we conduct a preliminary empirical study to demonstrate the necessity, potential, and challenges of leveraging historical failure patterns to enhance failure management in MASs. Building on this insight, we propose EAGER, an efficient failure management framework for multi‑agent systems based on reasoning trace representation. EAGER employs unsupervised reasoning‑scoped contrastive learning to encode both intra‑agent reasoning and inter‑agent coordination, enabling real‑time step‑wise failure detection, diagnosis, and reflexive mitigation guided by historical failure knowledge. Preliminary evaluations on three open‑source MASs demonstrate the effectiveness of EAGER and highlight promising directions for future research in reliable multi‑agent system operations.

PaperID: 4639, https://arxiv.org/pdf/2603.21494.pdf

Authors: Mohamed Sobhi Jabal, Jikai Zhang, Dominic LaBella, Jessica L. Houk, Dylan Zhang, Jeffrey D. Rudie, Kirti Magudia, Maciej A. Mazurowski, Evan Calabrese

Title: Agentic Automation of BT-RADS Scoring: End-to-End Multi-Agent System for Standardized Brain Tumor Follow-up Assessment

Abstract:
The Brain Tumor Reporting and Data System (BT‑RADS) standardizes post‑treatment MRI response assessment in patients with diffuse gliomas but requires complex integration of imaging trends, medication effects, and radiation timing. This study evaluates an end‑to‑end multi‑agent large language model (LLM) and convolutional neural network (CNN) system for automated BT‑RADS classification. A multi‑agent LLM system combined with automated CNN‑based tumor segmentation was retrospectively evaluated on 509 consecutive post‑treatment glioma MRI examinations from a single high‑volume center. An extractor agent identified clinical variables (steroid status, bevacizumab status, radiation date) from unstructured clinical notes, while a scorer agent applied BT‑RADS decision logic integrating extracted variables with volumetric measurements. Expert reference standard classifications were established by an independent board‑certified neuroradiologist. Of 509 examinations, 492 met inclusion criteria. The system achieved 374/492 (76.0%; 95% CI, 72.1%‑79.6%) accuracy versus 283/492 (57.5%; 95% CI, 53.1%‑61.8%) for initial clinical assessments (+18.5 percentage points; P<.001). Context‑dependent categories showed high sensitivity (BT‑1b 100%, BT‑1a 92.7%, BT‑3a 87.5%), while threshold‑dependent categories showed moderate sensitivity (BT‑3c 74.8%, BT‑2 69.2%, BT‑4 69.3%, BT‑3b 57.1%). For BT‑4, positive predictive value was 92.9%. The multi‑agent LLM system achieved higher BT‑RADS classification agreement with expert reference standard compared to initial clinical scoring, with high accuracy for context‑dependent scores and high positive predictive value for BT‑4 detection.

PaperID: 4640, https://arxiv.org/pdf/2603.21489.pdf

Authors: Jiayi Geng, Graham Neubig

Title: Effective Strategies for Asynchronous Software Engineering Agents

Abstract:
AI agents have become increasingly capable at isolated software engineering (SWE) tasks such as resolving issues on Github. Yet long‑horizon tasks involving multiple interdependent subtasks still pose challenges both with respect to accuracy, and with respect to timely completion. A natural approach to solving these long‑horizon tasks in a timely manner is asynchronous multi‑agent collaboration, where multiple agents work on different parts of the task at the same time. But effective application of multi‑agent systems has proven surprisingly difficult: concurrent edits by multiple agents interfere with each other, dependencies are difficult to synchronize, and combining partial progress into a coherent whole is challenging. On the other hand, human developers have long relied on mature collaboration infrastructure to manage these challenges in large software projects. Inspired by these collaboration primitives, we introduce Centralized Asynchronous Isolated Delegation (CAID), a structured multi‑agent coordination paradigm grounded in three core SWE primitives: centralized task delegation, asynchronous execution, and isolated workspaces. CAID constructs dependency‑aware task plans through a central manager, executes subtasks concurrently in isolated workspaces, and consolidates progress via structured integration with executable test‑based verification. In empirical evaluation, we find that CAID improves accuracy over single‑agent baselines by 26.7% absolute on paper reproduction tasks (PaperBench) and 14.3% on Python library development tasks (Commit0). Through systematic analysis, we find that branch‑and‑merge is a central coordination mechanism for multi‑agent collaboration, and that SWE primitives such as git worktree, git commit, and git merge enable it to be realized in a reliable and executable manner.

PaperID: 4641, https://arxiv.org/pdf/2603.21454.pdf

Authors: Tae-Eun Song

Title: Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis

Abstract:
LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE‑bench Verified, while existing detection methods‑‑paraphrase consistency, n‑gram overlap, perplexity analysis‑‑never directly observe whether a model reasons or recalls. Meanwhile, simply repeating verification degrades accuracy: multi‑turn review generates false positives faster than it discovers true errors, suggesting that structural approaches are needed. We introduce Cross‑Context Verification (CCV), a black‑box method that solves the same benchmark problem in N independent sessions and measures solution diversity, combined with the Hierarchical Cross‑Context Architecture (HCCA), a multi‑agent analysis framework that prevents confirmation bias through intentional information restriction across specialized analytical roles. On 9 SWE‑bench Verified problems (45 trials, Claude Opus 4.6, temperature 0), CCV achieves perfect separation between contaminated and genuine reasoning (Mann‑Whitney U=0, p approx 0.012, r = 1.0). Key findings: (1) contamination is binary‑‑models either recall perfectly or not at all; (2) reasoning absence is a perfect discriminator; (3) 33% of prior contamination labels are false positives; (4) HCCA's independent analysis structure discovers contamination‑flaw composite cases that single‑analyst approaches miss. A pilot experiment extending HCCA to multi‑stage verification (Worker to Verifier to Director) yields a negative result‑‑100% sycophantic confirmation‑‑providing further evidence that information restriction, not structural complexity, is the key mechanism. We release all code and data.

PaperID: 4642, https://arxiv.org/pdf/2603.21447.pdf

Authors: Ehsan Misaghi, Sean T Berkowitz, Bing Yu Chen, Qingyu Chen, Renaud Duval, Pearse A Keane, Danny A Mammo, Ariel Yuhan Ong, Mertcan Sevgi, Sumit Sharma, Sunil K Srivastava, Yih Chung Tham, Fares Antaki

Title: Deliberative multi-agent large language models improve clinical reasoning in ophthalmology

Abstract:
Large language models (LLMs) show potential for ophthalmic clinical reasoning, yet individual models risk introducing harm. We evaluated whether multi‑agent LLM deliberative councils improve diagnostic performance and mitigate harm compared to individual LLMs. In a comparative cross‑sectional study, we assessed 12 individual LLMs and three multi‑agent councils on 100 ophthalmology clinical vignettes. Each council comprised four models assembled by type: proprietary flagship, proprietary fast, and open‑source. Models independently answered a vignette, anonymously ranked one another's responses, and a designated chair synthesized all responses and peer reviews into a final answer. Councils consistently outperformed pooled individual models across all three tiers. Accuracy improved for proprietary flagship (95.0% vs 90.8%; risk difference [RD]: 4.25 [95% CI: 0.45, 8.05]), proprietary fast (96.0% vs 86.5%; RD: 9.50 [5.31, 13.59]), and open‑source councils (91.0% vs 83.2%; RD: 7.75 [4.17, 11.33]). Harm rates declined for proprietary flagship (10.0% vs 22.5%; RD: ‑12.50 [‑16.86, ‑8.14]), proprietary fast (16.0% vs 31.8%; RD: ‑15.75 [‑21.49, ‑10.01]), and open‑source councils (22.0% vs 38.5%; RD: ‑16.50 [‑22.27, ‑10.73]). Coverage analysis revealed net positive gains for accuracy (ΔCoverage: 4.4‑9.8 percentage points) and safety (ΔCoverage: 13.6‑20.6), indicating councils recovered correct diagnoses and averted harm. Councils elevated correct diagnoses to higher rank positions; and produced more complete differentials and management plans (all P<.05). Harmful council responses showed reduced combined commission‑and‑omission errors and tended to be less severe. Structured deliberation via multi‑agent LLM councils may enhance the reliability of LLM‑assisted ophthalmic clinical reasoning.

PaperID: 4643, https://arxiv.org/pdf/2603.21362.pdf

Authors: Liang Ding

Title: AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning

Abstract:
Evaluating LLM agent trajectories is fundamentally task‑specific: a code‑debugging agent should be judged on Correctness and Error Handling, not on Fluency or Safety. Yet the dominant paradigm ‑‑ LLM‑as‑Judge with a fixed rubric ‑‑ applies the same static dimensions regardless of task, producing systematic mis‑evaluation. We present AdaRubric, a framework that (i) adaptively generates task‑specific evaluation rubrics from task descriptions via LLM, (ii) evaluates agent trajectories step‑by‑step with confidence‑weighted, per‑dimension scoring, and (iii) produces dense reward signals for preference learning. Three composable filtering strategies, including the novel DimensionAwareFilter that provably prevents dimension‑level quality masking, yield high‑quality DPO preference pairs. On WebArena, ToolBench, and AgentBench, AdaRubric achieves Pearson r = 0.79 human correlation (+0.15 over the strongest baseline), with strong reliability (Krippendorff's alpha = 0.83). DPO models trained on AdaRubric‑generated pairs improve task success by +6.8‑8.5% over the best baseline. AdaRubric also generalises zero‑shot to unseen domains (SWE‑bench) and extends to multimodal agents (VisualWebArena, OSWorld) without modification. Our code is available at: github.com/alphadl/AdaRubrics

PaperID: 4644, https://arxiv.org/pdf/2603.21301.pdf

Authors: Vinay Sharma, Manish Jain

Title: Enhancing reasoning accuracy in large language models during inference time

Abstract:
Large Language Models (LLMs) often exhibit strong linguistic abilities while remaining unreliable on multi‑step reasoning tasks, particularly when deployed without additional training or fine‑tuning. In this work, we study inference‑time techniques to improve the reasoning accuracy of LLMs. We systematically evaluate three classes of inference‑time strategies: (i) self‑consistency via stochastic decoding, where the model is sampled multiple times using controlled temperature and nucleus sampling and the most frequent final answer is selected; (ii) dual‑model reasoning agreement, where outputs from two independent models are compared and only consistent reasoning traces are trusted; and (iii) self‑reflection, where the model critiques and revises its own reasoning. Across all evaluated methods, we employ Chain‑of‑Thought (CoT) [1] prompting to elicit explicit intermediate reasoning steps before generating final answers. In this work, we provide a controlled comparative evaluation across three inference‑time strategies under identical prompting and verification settings. Our experiments on LLM [2] show that self‑consistency with nucleus sampling and controlled temperature value yields the substantial gains, achieving a 9% to 15% absolute improvement in accuracy over greedy single‑pass decoding, well‑suited for low‑risk domains, offering meaningful gains with minimal overhead. The dual‑model approach provides additional confirmation for model reasoning steps thus more appropriate for moderate‑risk domains, where higher reliability justifies additional compute. Self‑reflection offers only marginal improvements, suggesting limited effectiveness for smaller non‑reasoning models at inference time.

PaperID: 4645, https://arxiv.org/pdf/2603.21278.pdf

Authors: Pranav Hemanth, Sampriti Saha

Title: Conversation Tree Architecture: A Structured Framework for Context-Aware Multi-Branch LLM Conversations

Abstract:
Large language models (LLMs) are increasingly deployed for extended, multi‑topic conversations, yet the flat, append‑only structure of current conversation interfaces introduces a fundamental limitation: all context accumulates in a single unbounded window, causing topically distinct threads to bleed into one another and progressively degrade response quality. We term this failure mode logical context poisoning. In this paper, we introduce the Conversation Tree Architecture (CTA), a hierarchical framework that organizes LLM conversations as trees of discrete, context‑isolated nodes. Each node maintains its own local context window; structured mechanisms govern how context flows between parent and child nodes, downstream on branch creation and upstream on branch deletion. We additionally introduce volatile nodes, transient branches whose local context must be selectively merged upward or permanently discarded before purging. We formalize the architecture's primitives, characterize the open design problems in context flow, relate our framework to prior work in LLM memory management, and describe a working prototype implementation. The CTA provides a principled foundation for structured conversational context management and extends naturally to multi‑agent settings.

PaperID: 4646, https://arxiv.org/pdf/2603.21194.pdf

Authors: Qiuchi Xiang, Haoxuan Qu, Hossein Rahmani, Jun Liu

Title: Is Monitoring Enough? Strategic Agent Selection For Stealthy Attack in Multi-Agent Discussions

Abstract:
Multi‑agent discussions have been widely adopted, motivating growing efforts to develop attacks that expose their vulnerabilities. In this work, we study a practical yet largely unexplored attack scenario, the discussion‑monitored scenario, where anomaly detectors continuously monitor inter‑agent communications and block detected adversarial messages. Although existing attacks are effective without discussion monitoring, we show that they exhibit detectable patterns and largely fail under such monitoring constraints. But does this imply that monitoring alone is sufficient to secure multi‑agent discussions? To answer this question, we develop a novel attack method explicitly tailored to the discussion‑monitored scenario. Extensive experiments demonstrate that effective attacks remain possible even under continuous monitoring, indicating that monitoring alone does not eliminate adversarial risks.

PaperID: 4647, https://arxiv.org/pdf/2603.21190.pdf

Authors: Yiwei Wu, Yifan Wu, Yunhao Xiong, Dengwei Zhao, Jiaxuan Shen, Jianfei Jiang, Guanghui He, Shikui Tu, Yanan Sun

Title: DS2SC-Agent: A Multi-Agent Automated Pipeline for Rapid Chiplet Model Generation

Abstract:
Constructing behavioral‑level chiplet models (e.g., SystemC) is crucial for early‑stage heterogeneous architecture exploration. Traditional manual modeling is notoriously time‑consuming and error‑prone. Recently, Large Language Models (LLMs) have demonstrated immense potential in automating hardware code generation. However, existing LLM‑assisted design frameworks predominantly target highly structured or well‑defined design specifications. In practical engineering scenarios, raw datasheets typically encompass lengthy, complex, and highly unstructured information. Consequently, reliable code generation directly from these raw datasheets suffers from severe challenges, including context vanishing and logical hallucinations.To overcome this critical bottleneck, this paper proposes DS2SC‑Agent(Datasheet‑to‑SystemC‑Agent): the first end‑to‑end, fully automated generation pipeline that translates raw datasheets directly into SystemC chiplet models. This system establishes a highly efficient multi‑agent collaborative framework. By decoupling the intricate modeling tasks, the proposed pipeline orchestrates a fully automated workflow encompassing unstructured long‑document parsing, SystemC core code construction, testbench stimulus generation, and adaptive closed‑loop debugging. We comprehensively evaluate the proposed framework on representative single‑function chiplets across the analog, digital, and radio frequency (RF) domains‑‑specifically, a Limiting Amplifier (LA), a Fast Fourier Transform (FFT) module, and a Power Amplifier (PA). The evaluation demonstrates that our pipeline seamlessly processes complex real‑world datasheets to consistently generate functionally correct SystemC models. This provides a highly efficient and reliable paradigm for agile model library construction while drastically minimizing manual intervention.

PaperID: 4648, https://arxiv.org/pdf/2603.21152.pdf

Authors: Feng Liu, Jian Xu, Xin Cui, Xinghao Wang, Zijie Guo, Jiong Wang, S. Mostafa Mousavi, Xinyu Gu, Hao Chen, Ben Fei, Lihua Fang, Fenghua Ling, Zefeng Li, Lei Bai

Title: TRACE: A Multi-Agent System for Autonomous Physical Reasoning for Seismology

Abstract:
Inferring physical mechanisms that govern earthquake sequences from geophysical observations remains a challenging task, particularly across tectonically distinct environments where similar seismic patterns can reflect different underlying processes. Current seismological processing and interpretation rely heavily on experts' choice of parameters and the synthesis of various seismological products, limiting reproducibility and the formation of generalizable knowledge across settings. Here we present TRACE (Trans‑perspective Reasoning and Automated Comprehensive Evaluator), a multi‑agent system that combines large language model planning with formal seismological constraints to derive auditable, physically grounded mechanistic inferences from raw observations. Applied to the 2019 Ridgecrest sequence, TRACE autonomously identifies stress‑perturbation‑induced delayed triggering, resolving the cascading interaction between the Mw 6.4 and Mw 7.1 mainshocks. For the 2025 Santorini‑Kolumbo volcanic eruption, the system identifies a structurally guided intrusion model, distinguishing episodic migration via fault channels from the continuous propagation expected in homogeneous crustal failure. By providing a generalizable infrastructure for deriving physical insights from seismic phenomena, TRACE advances the field from expert‑dependent analysis toward knowledge‑guided autonomous discovery in Earth sciences.

PaperID: 4649, https://arxiv.org/pdf/2603.21140.pdf

Authors: Zhuojie Yang, Wentao Wan, Keze Wang

Title: ORACLE: Optimizing Reasoning Abilities of Large Language Models via Constraint-Led Synthetic Data Elicitation

Abstract:
Training large language models (LLMs) with synthetic reasoning data has become a popular approach to enhancing their reasoning capabilities, while a key factor influencing the effectiveness of this paradigm is the quality of the generated multi‑step reasoning data. To generate high‑quality reasoning data, many recent methods generate synthetic reasoning paths and filter them based on final answer correctness, often overlooking flaws in intermediate reasoning steps. To enhance the verification of intermediate reasoning steps, prior work primarily resorts to code execution or symbolic reasoning engines. However, code‑based validation is restricted to code or mathematical tasks, and reasoning engines require a well‑structured and complete context. As a result, existing methods fail to function effectively in natural language reasoning tasks that involve ambiguous or incomplete contexts. In these tasks, synthetic data still lack reliable checks for verifying each reasoning step. To address this challenge, we introduce ORACLE, a structured data generation framework inspired by syllogistic reasoning. ORACLE integrates the generative strengths of LLMs with symbolic supervision: the LLM produces step‑wise reasoning contexts, while a symbolic reasoning engine verifies the validity of each intermediate step. By employing a unified prompting template to elicit modular reasoning chains, ORACLE enables fine‑grained, step‑level validation, facilitating the construction of high‑quality multi‑step reasoning data. Across six logical, factual, and commonsense reasoning benchmarks, our ORACLE consistently outperforms strong baselines on multiple models.

PaperID: 4650, https://arxiv.org/pdf/2603.21029.pdf

Authors: Ye Tian, Jingyi Zhang, Zihao Wang, Xiaoyuan Ren, Xiaofan Yu, Onat Gungor, Tajana Rosing

Title: KLDrive: Fine-Grained 3D Scene Reasoning for Autonomous Driving based on Knowledge Graph

Abstract:
Autonomous driving requires reliable reasoning over fine‑grained 3D scene facts. Fine‑grained question answering over multi‑modal driving observations provides a natural way to evaluate this capability, yet existing perception pipelines and driving‑oriented large language model (LLM) methods still suffer from unreliable scene facts, hallucinations, opaque reasoning, and heavy reliance on task‑specific training. We present KLDrive, the first knowledge‑graph‑augmented LLM reasoning framework for fine‑grained question answering in autonomous driving. KLDrive addresses this problem through designing two tightly coupled components: an energy‑based scene fact construction module that consolidates multi‑source evidence into a reliable scene knowledge graph, and an LLM agent that performs fact‑grounded reasoning over a constrained action space under explicit structural constraints. By combining structured prompting with few‑shot in‑context exemplars, the framework adapts to diverse reasoning tasks without heavy task‑specific fine‑tuning. Experiments on two large‑scale autonomous‑driving QA benchmarks show that KLDrive outperforms prior state‑of‑the‑art methods, achieving the best overall accuracy of 65.04% on NuScenes‑QA and the best SPICE score of 42.45 on GVQA. On counting, the most challenging factual reasoning task, it improves over the strongest baseline by 46.01 percentage points, demonstrating substantially reduced hallucinations and the benefit of coupling reliable scene fact construction with explicit reasoning.

PaperID: 4651, https://arxiv.org/pdf/2603.21019.pdf

Authors: Zihan Guo, Zhiyu Chen, Xiaohang Nie, Jianghao Lin, Yuanjian Zhou, Weinan Zhang

Title: SkillProbe: Security Auditing for Emerging Agent Skill Marketplaces via Multi-Agent Collaboration

Abstract:
With the rapid evolution of Large Language Model (LLM) agent ecosystems, centralized skill marketplaces have emerged as pivotal infrastructure for augmenting agent capabilities. However, these marketplaces face unprecedented security challenges, primarily stemming from semantic‑behavioral inconsistency and inter‑skill combinatorial risks, where individually benign skills induce malicious behaviors during collaborative invocation. To address these vulnerabilities, we propose SkillProbe, a multi‑stage security auditing framework driven by multi‑agent collaboration. SkillProbe introduces a "Skills‑for‑Skills" design paradigm, encapsulating auditing processes into standardized skill modules to drive specialized agents through a rigorous pipeline, including admission filtering, semantic‑behavioral alignment detection, and combinatorial risk simulation. We conducted a large‑scale evaluation using 8 mainstream LLM series across 2,500 real‑world skills from ClawHub. Our results reveal a striking popularity‑security paradox, where download volume is not a reliable proxy for security quality, as over 90% of high‑popularity skills failed to pass rigorous auditing. Crucially, we discovered that high‑risk skills form a single giant connected component within the risk‑link dimension, demonstrating that cascaded risks are systemic rather than isolated occurrences. We hope that SkillProbe will inspire researchers to provide a scalable governance infrastructure for constructing a trustworthy Agentic Web. SkillProbe is accessible for public experience at skillhub.holosai.io.

PaperID: 4652, https://arxiv.org/pdf/2603.21011.pdf

Authors: Rushikesh Deotale, Adithya Srinivasan, Yuan Tian, Tianyi Zhang, Pavlos Vlachos, Hector Gomez

Title: ALL-FEM: Agentic Large Language models Fine-tuned for Finite Element Methods

Abstract:
Finite element (FE) analysis guides the design and verification of nearly all manufactured objects. It is at the core of computational engineering, enabling simulation of complex physical systems, from fluids and solids to multiphysics systems. However, implementing FE codes and analyzing simulation results demands expertise across numerical analysis, continuum mechanics, and programming. Conventional Large Language Models (LLMs) can generate FE code, but they hallucinate, lack awareness of variational structures, and cannot close the loop from problem statement to a verified solution. Here, we propose ALL‑FEM, an autonomous simulation system that integrates agentic AI with domain‑specific, fine‑tuned LLMs for FEniCS code generation across solid, fluid, and multiphysics applications. We construct a corpus of 1000+ verified FEniCS scripts by combining 500+ curated expert codes with a retrieval‑augmented, multi‑LLM pipeline that generates and filters codes for diverse PDEs, geometries, and boundary conditions. We used the corpus to fine‑tune LLMs with 3B to 120B parameters. Our agentic framework orchestrates specialized agents, powered by fine‑tuned LLMs, to formulate problems as PDEs, generate and debug code and visualize the results. We evaluated the system on 39 benchmarks that include problems of linear/nonlinear elasticity, plasticity, Newtonian/non‑Newtonian flow, thermofluids, fluid‑structure interaction, phase separation, and transport on moving domains. Embedded in a multi‑agent workflow with runtime feedback, the best fine‑tuned model (GPT OSS 120B) achieves code‑level success of 71.79%, outperforming a non‑agentic deployment of GPT 5 Thinking. By showing that relatively small, fine‑tuned LLMs, orchestrated through agentic frameworks, can automate FE workflows, ALL‑FEM offers a blueprint for autonomous simulation systems in computational science and engineering.

PaperID: 4653, https://arxiv.org/pdf/2603.20994.pdf

Authors: Benedikt Hornig, Reuth Mirsky

Title: The Intelligent Disobedience Game: Formulating Disobedience in Stackelberg Games and Markov Decision Processes

Abstract:
In shared autonomy, a critical tension arises when an automated assistant must choose between obeying a human's instruction and deliberately overriding it to prevent harm. This safety‑critical behavior is known as intelligent disobedience. To formalize this dynamic, this paper introduces the Intelligent Disobedience Game (IDG), a sequential game‑theoretic framework based on Stackelberg games that models the interaction between a human leader and an assistive follower operating under asymmetric information. It characterizes optimal strategies for both agents across multi‑step scenarios, identifying strategic phenomena such as ``safety traps,'' where the system indefinitely avoids harm but fails to achieve the human's goal. The IDG provides a needed mathematical foundation that enables both the algorithmic development of agents that can learn safe non‑compliance and the empirical study of how humans perceive and trust disobedient AI. The paper further translates the IDG into a shared control Multi‑Agent Markov Decision Process representation, forming a compact computational testbed for training reinforcement learning agents.

PaperID: 4654, https://arxiv.org/pdf/2603.20986.pdf

Authors: Sukriti Manna, Henry Chan, Subramanian K. R. S. Sankaranarayanan

Title: AutoMOOSE: An Agentic AI for Autonomous Phase-Field Simulation

Abstract:
Multiphysics simulation frameworks such as MOOSE provide rigorous engines for phase‑field materials modeling, yet adoption is constrained by the expertise required to construct valid input files, coordinate parameter sweeps, diagnose failures, and extract quantitative results. We introduce AutoMOOSE, an open‑source agentic framework that orchestrates the full simulation lifecycle from a single natural‑language prompt. AutoMOOSE deploys a five‑agent pipeline in which the Input Writer coordinates six sub‑agents and the Reviewer autonomously corrects runtime failures without user intervention. A modular plugin architecture enables new phase‑field formulations without modifying the core framework, and a Model Context Protocol (MCP) server exposes the workflow as ten structured tools for interoperability with any MCP‑compatible client. Validated on a four‑temperature copper grain growth benchmark, AutoMOOSE generates MOOSE input files with 6 of 12 structural blocks matching a human expert reference exactly and 4 functionally equivalent, executes all runs in parallel with a 1.8x speedup, and performs an end‑to‑end physical consistency check spanning intent, finite‑element execution, and Arrhenius kinetics with no human verification. Grain coarsening kinetics are recovered with R^2 = 0.90‑0.95 at T >= 600 K; the recovered activation energy Q_fit = 0.296 eV is consistent with a human‑written reference (Q_fit = 0.267 eV) under identical parameters. Three runtime failure classes were diagnosed and resolved autonomously within a single correction cycle, and every run produces a provenance record satisfying FAIR data principles. These results show that the gap between knowing the physics and executing a validated simulation campaign can be bridged by a lightweight multi‑agent orchestration layer, providing a pathway toward AI‑driven materials discovery and self‑driving laboratories.

PaperID: 4655, https://arxiv.org/pdf/2603.20976.pdf

Authors: Abed K. Musaffar, Ambuj Singh, Francesco Bullo

Title: Detection of adversarial intent in Human-AI teams using LLMs

Abstract:
Large language models (LLMs) are increasingly deployed in human‑AI teams as support agents for complex tasks such as information retrieval, programming, and decision‑making assistance. While these agents' autonomy and contextual knowledge enables them to be useful, it also exposes them to a broad range of attacks, including data poisoning, prompt injection, and even prompt engineering. Through these attack vectors, malicious actors can manipulate an LLM agent to provide harmful information, potentially manipulating human agents to make harmful decisions. While prior work has focused on LLMs as attack targets or adversarial actors, this paper studies their potential role as defensive supervisors within mixed human‑AI teams. Using a dataset consisting of multi‑party conversations and decisions for a real human‑AI team over a 25 round horizon, we formulate the problem of malicious behavior detection from interaction traces. We find that LLMs are capable of identifying malicious behavior in real‑time, and without task‑specific information, indicating the potential for task‑agnostic defense. Moreover, we find that the malicious behavior of interest is not easily identified using simple heuristics, further suggesting the introduction of LLM defenders could render human teams more robust to certain classes of attack.

PaperID: 4656, https://arxiv.org/pdf/2603.20975.pdf

Authors: Bo Jiang

Title: DiscoUQ: Structured Disagreement Analysis for Uncertainty Quantification in LLM Agent Ensembles

Abstract:
Multi‑agent LLM systems, where multiple prompted instances of a language model independently answer questions, are increasingly used for complex reasoning tasks. However, existing methods for quantifying the uncertainty of their collective outputs rely on shallow voting statistics that discard the rich semantic information in agents' reasoning. We introduce DiscoUQ, a framework that extracts and leverages the structure of inter‑agent disagreement ‑‑ both linguistic properties (evidence overlap, argument strength, divergence depth) and embedding geometry (cluster distances, dispersion, cohesion) ‑‑ to produce well‑calibrated confidence estimates. We propose three methods of increasing complexity: DiscoUQ‑LLM (logistic regression on LLM‑extracted structure features), DiscoUQ‑Embed (logistic regression on embedding geometry), and DiscoUQ‑Learn (a neural network combining all features). Evaluated on four diverse benchmarks (StrategyQA, MMLU, TruthfulQA, ARC‑Challenge) with a 5‑agent system using Qwen3.5‑27B, DiscoUQ‑LLM achieves an average AUROC of 0.802, outperforming the best baseline (LLM Aggregator, 0.791) while being substantially better calibrated (ECE 0.036 vs. 0.098). The learned features generalize across benchmarks with near‑zero performance degradation and provide the largest improvements where they are most needed: in the ambiguous "weak disagreement" tier where simple vote counting fails.

PaperID: 4657, https://arxiv.org/pdf/2603.20965.pdf

Authors: Kemal Kirtac

Title: Learning to Aggregate Zero-Shot LLM Agents for Corporate Disclosure Classification

Abstract:
This paper studies whether a lightweight supervised aggregator can combine diverse zero‑shot large language model outputs into a stronger downstream signal for corporate disclosure classification. Zero‑shot LLMs can read disclosures without task‑specific fine‑tuning, but their predictions often vary across prompt perspectives, model families, and confidence levels. I examine this problem with a multi‑prompt framework in which three fixed zero‑shot LLM classifiers read each disclosure from different financial perspectives and output a sentiment label, a confidence score, and a short rationale. A logistic meta‑classifier then aggregates these outputs to predict next‑day stock return direction. To reduce pretrained‑model contamination, I restrict evaluation to a post‑release sample of 9,860 U.S.\ corporate disclosures issued by large publicly traded firms between January 2025 and March 2026, after the release of the frozen base LLMs used in the experiment. Results show that the trained aggregator outperforms single classifiers, majority vote, confidence‑weighted voting, a zero‑shot LLM judge, and a FinBERT baseline. Balanced accuracy rises from 0.566 for the best single classifier to 0.606 for the trained aggregator. The gain is largest in mixed‑signal disclosures where classifiers disagree. The results suggest that zero‑shot LLM outputs contain complementary financial signals, while also showing that the strongest gains come from supervised aggregation rather than from zero‑shot voting alone.

PaperID: 4658, https://arxiv.org/pdf/2603.20884.pdf

Authors: Jiajun Hou, Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Xiaopeng Ke, Min Zhang

Title: NoveltyAgent: Autonomous Novelty Reporting Agent with Point-wise Novelty Analysis and Self-Validation

Abstract:
The exponential growth of academic publications has led to a surge in papers of varying quality, increasing the cost of paper screening. Current approaches either use novelty assessment within general AI Reviewers or repurpose DeepResearch, which lacks domain‑specific mechanisms and thus delivers lower‑quality results. To bridge this gap, we introduce NoveltyAgent, a multi‑agent system designed to generate comprehensive and faithful novelty reports, enabling thorough evaluation of a paper's originality. It decomposes manuscripts into discrete novelty points for fine‑grained retrieval and comparison, and builds a comprehensive related‑paper database while cross‑referencing claims to ensure faithfulness. Furthermore, to address the challenge of evaluating such open‑ended generation tasks, we propose a checklist‑based evaluation framework, providing an unbiased paradigm for building reliable evaluations. Extensive experiments show that NoveltyAgent achieves state‑of‑the‑art performance, outperforming GPT‑5 DeepResearch by 10.15%. We hope this system will provide reliable, high‑quality novelty analysis and help researchers quickly identify novel papers. Code and demo are available at https://github.com/SStan1/NoveltyAgent.

PaperID: 4659, https://arxiv.org/pdf/2603.20815.pdf

Authors: Xiaohan Wang, Nan Zhang, Sulene Han, Keguang Tang, Lei Xu, Zhiping Li, Xiue, Liu, Xiaomei Han

Title: GMPilot: An Expert AI Agent For FDA cGMP Compliance

Abstract:
The pharmaceutical industry is facing challenges with quality management such as high costs of compliance, slow responses and disjointed knowledge. This paper presents GMPilot, a domain‑specific AI agent that is designed to support FDA cGMP compliance. GMPilot is based on a curated knowledge base of regulations and historical inspection observations and uses Retrieval‑Augmented Generation (RAG) and Reasoning‑Acting (ReAct) frameworks to provide real‑time and traceable decision support to the quality professionals. In a simulated inspection scenario, GMPilot shows how it can improve the responsiveness and professionalism of quality professionals by providing structured knowledge retrieval and verifiable regulatory and case‑based support. Although GMPilot lacks in the aspect of regulatory scope and model interpretability, it is a viable avenue of improving quality management decision‑making in the pharmaceutical sector using intelligent approaches and an example of specialized application of AI in highly regulated sectors.

PaperID: 4660, https://arxiv.org/pdf/2603.20750.pdf

Authors: Jinming Yang, Xinyu Jiang, Xinshan Jiao, Xinping Zhang

Title: Modeling Epistemic Uncertainty in Social Perception via Rashomon Set Agents

Abstract:
We present an LLM‑driven multi‑agent probabilistic modeling framework that demonstrates how differences in students' subjective social perceptions arise and evolve in real‑world classroom settings, under constraints from an observed social network and limited questionnaire data. When social information is incomplete and the accuracy of perception differs between students, they can form different views of the same group structure from local cues they can access. Repeated peer communication and belief updates can gradually change these views and, over time, lead to stable group‑level differences. To avoid assuming a global "god's‑eye view," we assign each student an individualized subjective graph that shows which social ties they can perceive and how far information is reachable from their perspective. All judgments and interactions are restricted to this subjective graph: agents use retrieval‑augmented generation (RAG) to access only local information and then form evaluations of peers' competence and social standing. We also add structural perturbations related to social‑anxiety to represent consistent individual differences in the accuracy of social perception. During peer exchanges, agents share narrative assessments of classmates' academic performance and social position with uncertainty tags, and update beliefs probabilistically using LLM‑based trust scores. Using the time series of six real exam scores as an exogenous reference, we run multi‑step simulations to examine how epistemic uncertainty spreads through local interactions. Experiments show that, without relying on global information, the framework reproduces several collective dynamics consistent with real‑world educational settings. The code is released at https://anonymous.4open.science/r/Rashomonomon‑0126.

PaperID: 4661, https://arxiv.org/pdf/2603.20692.pdf

Authors: Linuka Ratnayake, Danidu Dabare, Sanuja Rupasinghe, Warren Jayakumar, Dileepa Marasinghe, Chamira U. S. Edussooriya, Arjuna Madanayake

Title: Agentic Physical-AI for Self-Aware RF Systems

Abstract:
Intelligent control of RF transceivers adapting to dynamic operational conditions is essential in the modern and future communication systems. We propose a multi‑agent neurosymbolic AI system, where AI agents are assigned for circuit components. Agents have an internal model and a corresponding control algorithm as its constituents. Modeling of the IF amplifier shows promising results, where the same approach can be extended to all the components, thus creating a fully intelligent RF system.

PaperID: 4662, https://arxiv.org/pdf/2603.20678.pdf

Authors: Yicai Xing

Title: AI-Driven Multi-Agent Simulation of Stratified Polyamory Systems: A Computational Framework for Optimizing Social Reproductive Efficiency

Abstract:
Contemporary societies face a severe crisis of demographic reproduction. Global fertility rates continue to decline precipitously, with East Asian nations exhibiting the most dramatic trends ‑‑ China's total fertility rate (TFR) fell to approximately 1.0 in 2023, while South Korea's dropped below 0.72. Simultaneously, the institution of marriage is undergoing structural disintegration: educated women rationally reject unions lacking both emotional fulfillment and economic security, while a growing proportion of men at the lower end of the socioeconomic spectrum experience chronic sexual deprivation, anxiety, and learned helplessness. This paper proposes a computational framework for modeling and evaluating a Stratified Polyamory System (SPS) using techniques from agent‑based modeling (ABM), multi‑agent reinforcement learning (MARL), and large language model (LLM)‑empowered social simulation. The SPS permits individuals to maintain a limited number of legally recognized secondary partners in addition to one primary spouse, combined with socialized child‑rearing and inheritance reform. We formalize the A/B/C stratification as heterogeneous agent types in a multi‑agent system and model the matching process as a MARL problem amenable to Proximal Policy Optimization (PPO). The mating network is analyzed using graph neural network (GNN) representations. Drawing on evolutionary psychology, behavioral ecology, social stratification theory, computational social science, algorithmic fairness, and institutional economics, we argue that SPS can improve aggregate social welfare in the Pareto sense. Preliminary computational results demonstrate the framework's viability in addressing the dual crisis of female motherhood penalties and male sexlessness, while offering a non‑violent mechanism for wealth dispersion analogous to the historical Chinese Grace Decree (Tui'en Ling).

PaperID: 4663, https://arxiv.org/pdf/2603.20670.pdf

Authors: Ruixiang Liu, Zhenlong Li, Ali Khosravi Kazazi

Title: Towards Intelligent Geospatial Data Discovery: a knowledge graph-driven multi-agent framework powered by large language models

Abstract:
The rapid growth in the volume, variety, and velocity of geospatial data has created data ecosystems that are highly distributed, heterogeneous, and semantically inconsistent. Existing data catalogs, portals, and infrastructures still rely largely on keyword‑based search with limited semantic support, which often fails to capture user intent and leads to weak retrieval performance. To address these challenges, this study proposes a knowledge graph‑driven multi‑agent framework for intelligent geospatial data discovery, powered by large language models. The framework introduces a unified geospatial metadata ontology as a semantic mediation layer to align heterogeneous metadata standards across platforms and constructs a geospatial metadata knowledge graph to explicitly model datasets and their multidimensional relationships. Building on the structured representation, the framework adopts a multi‑agent collaborative architecture to perform intent parsing, knowledge graph retrieval, and answer synthesis, forming an interpretable and closed‑loop discovery process from user queries to results. Results from representative use cases and performance evaluation show that the framework substantially improves intent matching accuracy, ranking quality, recall, and discovery transparency compared with traditional systems. This study advances geospatial data discovery toward a more semantic, intent‑aware, and intelligent paradigm, providing a practical foundation for next‑generation intelligent and autonomous spatial data infrastructures and contributing to the broader vision of Autonomous GIS.

PaperID: 4664, https://arxiv.org/pdf/2603.20644.pdf

Authors: Guanzhou Chen, Erfei Cui, Changyao Tian, Danni Yang, Ganlin Yang, Yu Qiao, Hongsheng Li, Gen Luo, Hongjie Zhang

Title: ScaleEdit-12M: Scaling Open-Source Image Editing Data Generation via Multi-Agent Framework

Abstract:
Instruction‑based image editing has emerged as a key capability for unified multimodal models (UMMs), yet constructing large‑scale, diverse, and high‑quality editing datasets without costly proprietary APIs remains challenging. Previous image editing datasets either rely on closed‑source models for annotation, which prevents cost‑effective scaling, or employ fixed synthetic editing pipelines, which suffer from limited quality and generalizability. To address these challenges, we propose ScaleEditor, a fully open‑source hierarchical multi‑agent framework for end‑to‑end construction of large‑scale, high‑quality image editing datasets. Our pipeline consists of three key components: source image expansion with world‑knowledge infusion, adaptive multi‑agent editing instruction‑image synthesis, and a task‑aware data quality verification mechanism. Using ScaleEditor, we curate ScaleEdit‑12M, the largest open‑source image editing dataset to date, spanning 23 task families across diverse real and synthetic domains. Fine‑tuning UniWorld‑V1 and Bagel on ScaleEdit yields consistent gains, improving performance by up to 10.4% on ImgEdit and 35.1% on GEdit for general editing benchmarks and by up to 150.0% on RISE and 26.5% on KRIS‑Bench for knowledge‑infused benchmarks. These results demonstrate that open‑source, agentic pipelines can approach commercial‑grade data quality while retaining cost‑effectiveness and scalability. Both the framework and dataset will be open‑sourced.

PaperID: 4665, https://arxiv.org/pdf/2603.20637.pdf

Authors: Sen Fang, Weiyuan Ding, Zhezhen Cao, Zhou Yang, Bowen Xu

Title: AEGIS: From Clues to Verdicts -- Graph-Guided Deep Vulnerability Reasoning via Dialectics and Meta-Auditing

Abstract:
Large Language Models (LLMs) are increasingly adopted for vulnerability detection, yet their reasoning remains fundamentally unsound. We identify a root cause shared by both major mitigation paradigms (agent‑based debate and retrieval augmentation): reasoning in an ungrounded deliberative space that lacks a bounded, hypothesis‑specific evidence base. Without such grounding, agents fabricate cross‑function dependencies, and retrieval heuristics supply generic knowledge decoupled from the repository's data‑flow topology. Consequently, the resulting conclusions are driven by rhetorical persuasiveness rather than verifiable facts. To ground this deliberation, we present AEGIS, a novel multi‑agent framework that shifts detection from ungrounded speculation to forensic verification over a closed factual substrate. Guided by a "From Clue to Verdict" philosophy, AEGIS first identifies suspicious code anomalies (clues), then dynamically reconstructs per‑variable dependency chains for each clue via on‑demand slicing over a repository‑level Code Property Graph. Within this closed evidence boundary, a Verifier Agent constructs competing dialectical arguments for and against exploitability, while an independent Audit Agent scrutinizes every claim against the trace, exercising veto power to prevent hallucinated verdicts. Evaluation on the rigorous PrimeVul dataset demonstrates that AEGIS establishes a new state‑of‑the‑art, achieving 122 Pair‑wise Correct Predictions. To our knowledge, this is the first approach to surpass 100 on this benchmark. It reduces the false positive rate by up to 54.40% compared to leading baselines, at an average cost of 0.09 per sample without any task‑specific training.

PaperID: 4666, https://arxiv.org/pdf/2603.20629.pdf

Authors: Cheng Yang, Dong Li

Title: Effective Rank Analysis and Optimization of Flexible Antenna-Enabled Wireless Systems: Movable Antennas or Pinching Antennas?

Abstract:
Flexible antenna technology has recently emerged as a key enabler for next‑generation wireless communications, which can effectively exploit the spatial degrees of freedom (DoF). However, existing conventional metrics (e.g., spectral and energy efficiency) cannot directly measure the variability for different flexible antenna structures for the spatial DoF. To objectively analyze and compare the spatial DoF of different flexible antennas, the effective rank is introduced as a metric for two major flexible antenna technologies, i.e., movable antenna (MA) and pinching antenna (PA)‑enabled wireless systems, optimizing their antenna positions over multi‑time slots. However, the inherent non‑convexity and high computational complexity of the resulting effective rank maximization problems in MA and PA systems render them hard to solve. To circumvent these problems, we propose the graph attention implicit quantile network (GAIQN) and multi‑agent graph attention Q‑network (MAGAQN) algorithms to enhance the effective rank of MA and PA systems through antenna positioning, respectively. Meanwhile, the novel top‑k action selection methods are designed to ensure collision‑free between multiple MAs on the two‑dimensional array plane or multiple PAs on the same waveguide. Simulation results validate the effectiveness and advancement of our proposed GAIQN and MAGAQN algorithms compared with benchmarks, which enhance the effective rank by at least 1.6% and 1.3%, respectively, while consistently ensuring collision‑free between flexible antennas. Besides, under the same number of flexible antennas, the MA system supports a higher effective rank than the PA system, whereas the PA system offers greater stability in terms of achievable spatial DoF.

PaperID: 4667, https://arxiv.org/pdf/2603.20625.pdf

Authors: Yusheng Zheng, Yiwei Yang, Wei Zhang, Andi Quinn

Title: ACRFence: Preventing Semantic Rollback Attacks in Agent Checkpoint-Restore

Abstract:
LLM agent frameworks increasingly offer checkpoint‑restore for error recovery and exploration, advising developers to make external tool calls safe to retry. This advice assumes that a retried call will be identical to the original, an assumption that holds for traditional programs but fails for LLM agents, which re‑synthesize subtly different requests after restore. Servers treat these re‑generated requests as new, enabling duplicate payments, unauthorized reuse of consumed credentials, and other irreversible side effects; we term these semantic rollback attacks. We identify two attack classes, Action Replay and Authority Resurrection, validate them in a proof of concept experiment, and confirm that the problem has been independently acknowledged by framework maintainers. We propose ACRFence, a framework‑agnostic mitigation that records irreversible tool effects and enforces replay‑or‑fork semantics upon restoration

PaperID: 4668, https://arxiv.org/pdf/2603.20595.pdf

Authors: Truong Thanh Hung Nguyen, Hélène Fournier, Piper Jackson, Makoto Itoh, Shannon Freeman, Rene Richard, Hung Cao

Title: Position: Multi-Agent Algorithmic Care Systems Demand Contestability for Trustworthy AI

Abstract:
Multi‑agent systems (MAS) are increasingly used in healthcare to support complex decision‑making through collaboration among specialized agents. Because these systems act as collective decision‑makers, they raise challenges for trust, accountability, and human oversight. Existing approaches to trustworthy AI largely rely on explainability, but explainability alone is insufficient in multi‑agent settings, as it does not enable care partners to challenge or correct system outputs. To address this limitation, Contestable AI (CAI) characterizes systems that support effective human challenge throughout the decision‑making lifecycle by providing transparency, structured opportunities for intervention, and mechanisms for review, correction, or override. This position paper argues that contestability is a necessary design requirement for trustworthy multi‑agent algorithmic care systems. We identify key limitations in current MAS and Explainable AI (XAI) research and present a human‑in‑the‑loop framework that integrates structured argumentation and role‑based contestation to preserve human agency, clinical responsibility, and trust in high‑stakes care contexts.

PaperID: 4669, https://arxiv.org/pdf/2603.20575.pdf

Authors: Logan Banker, Michael Wozniak, Mohanad Alameer, Smriti Nandan Paul, David Meisinger, Grant Baer, Trevor Hunting, Ryan Dunham, Jay Kamdar

Title: Current state of the multi-agent multi-view experimental and digital twin rendezvous (MMEDR-Autonomous) framework

Abstract:
As near‑Earth resident space objects proliferate, there is an increasing demand for reliable technologies in applications of on‑orbit servicing, debris removal, and orbit modification. Rendezvous and docking are critical mission phases for such applications and can benefit from greater autonomy to reduce operational complexity and human workload. Machine learning‑based methods can be integrated within the guidance, navigation, and control (GNC) architecture to design a robust rendezvous and docking framework. In this work, the Multi‑Agent Multi‑View Experimental and Digital Twin Rendezvous (MMEDR‑Autonomous) is introduced as a unified framework comprising a learning‑based optical navigation network, a reinforcement learning‑based guidance approach under ongoing development, and a hardware‑in‑the‑loop testbed. Navigation employs a lightweight monocular pose estimation network with multi‑scale feature fusion, trained on realistic image augmentations to mitigate domain shift. The guidance component is examined with emphasis on learning stability, reward design, and systematic hyperparameter tuning under mission‑relevant constraints. Prior Control Barrier Function results for Clohessy‑Wiltshire dynamics are reviewed as a basis for enforcing safety and operational constraints and for guiding future nonlinear controller design within the MMEDR‑Autonomous framework. The MMEDR‑Autonomous framework is currently progressing toward integrated experimental validation in multi‑agent rendezvous scenarios.

PaperID: 4670, https://arxiv.org/pdf/2603.20553.pdf

Authors: Bowen Li, Edwin K. P. Chong, Ali Pezeshki

Title: Performance Guarantees for Data-Driven Sequential Decision-Making

Abstract:
The solutions to many sequential decision‑making problems are characterized by dynamic programming and Bellman's principle of optimality. However, due to the inherent complexity of solving Bellman's equation exactly, there has been significant interest in developing various approximate dynamic programming (ADP) schemes to obtain near‑optimal solutions. A fundamental question that arises is: how close are the objective values produced by ADP schemes relative to the true optimal objective values? In this paper, we develop a general framework that provides performance guarantees for ADP schemes in the form of ratio bounds. Specifically, we show that the objective value under an ADP scheme is at least a computable fraction of the optimal value. We further demonstrate the applicability of our theoretical framework through two applications: data‑driven robot path planning and multi‑agent sensor coverage.

PaperID: 4671, https://arxiv.org/pdf/2603.20508.pdf

Authors: Dani Roytburg, Shreya Sridhar, Daphne Ippolito

Title: Measuring Reasoning Trace Legibility: Can Those Who Understand Teach?

Abstract:
Language models are increasingly being trained to "reason" before answering users' queries, outputting hundreds or even thousands of tokens worth of deliberation before their final answer. While the main intention of reasoning is to improve models' ability to arrive at a correct answer, we argue that these models should be assessed for the legibility of their reasoning traces in addition to the correctness of their final answers. In this paper, we evaluate 90k traces from 12 Reasoning Language Models (RLMs) for the quality of their reasoning traces. We introduce the concept of transfer utility, which assesses how useful an RLM's reasoning traces are for guiding a weaker, non‑reasoning model toward arriving at the correct answer. We find that the reasoning traces of the highest‑performing models rank among the lowest for legibility. Furthermore, we uncover tensions between efficiency‑based measurements of legibility (such as trace length) and transfer utility. These tensions establish a legibility Pareto frontier, and we demonstrate that an RLM's ability to output highly legible traces can be a task‑ and audience‑dependent goal. Crucially, we find that reward models used to train RLMs do not intrinsically reward legibility. Together, these metrics and the findings they surface chart a path towards scaffolding reasoning traces for a multi‑agent future.

PaperID: 4672, https://arxiv.org/pdf/2603.20484.pdf

Authors: Ian F. Akyildiz, Tuğçe Bilen

Title: Fluid Antenna Networks Beyond Beamforming: An AI-Native Control Paradigm for 6G

Abstract:
Fluid Antenna Systems (FAS) introduce a new degree of freedom for wireless networks by enabling the physical antenna position to adapt dynamically to changing radio conditions. While existing studies primarily emphasize physical‑layer gains, their broader implications for network operation remain largely unexplored. Once antennas become reconfigurable entities, antenna positioning naturally becomes part of the network control problem rather than a standalone optimization task. This article presents an AI‑native perspective on fluid antenna networks for future 6G systems. Instead of treating antenna repositioning as an isolated operation, we consider a closed‑loop control architecture in which antenna adaptation is jointly managed with conventional radio resource management (RRM) functions. Within this framework, real‑time network observations are translated into coordinated antenna and resource configuration decisions that respond to user mobility, traffic demand, and evolving interference conditions. To address the complexity of multi‑cell environments, we explore a multi‑agent reinforcement learning (MARL) approach that enables distributed and adaptive control across base stations. Illustrative results show that intelligent antenna adaptation yields consistent performance gains, particularly at the cell edge, while also reducing inter‑cell interference. These findings suggest that the true potential of fluid antenna systems lies not only in reconfigurable hardware, but in intelligent network control architectures that can effectively exploit this additional spatial degree of freedom.

PaperID: 4673, https://arxiv.org/pdf/2603.20449.pdf

Authors: Cailin Winston, Claris Winston, René Just

Title: Solver-Aided Verification of Policy Compliance in Tool-Augmented LLM Agents

Abstract:
Tool‑augmented Large Language Models (TaLLMs) extend LLMs with the ability to invoke external tools, enabling them to interact with real‑world environments. However, a major limitation in deploying TaLLMs in sensitive applications such as customer service and business process automation is a lack of reliable compliance with domain‑specific operational policies regarding tool‑use and agent behavior. Current approaches merely steer LLMs to adhere to policies by including policy descriptions in the LLM context, but these provide no guarantees that policy violations will be prevented. In this paper, we introduce an SMT solver‑aided framework to enforce tool‑use policy compliance in TaLLM agents. Specifically, we use an LLM‑assisted, human‑guided approach to translate natural‑language‑specified tool‑use policies into formal logic (SMT‑LIB‑2.0) constraints over agent‑observable state and tool arguments. At runtime, planned tool calls are intercepted and checked against the constraints using the Z3 solver as a pre‑condition to the tool call. Tool invocations that violate the policy are blocked. We evaluated on the TauBench benchmark and demonstrate that solver‑aided policy checking reduces policy violations while maintaining overall task accuracy. These results suggest that integrating formal reasoning into TaLLM execution can improve tool‑call policy compliance and overall reliability.

PaperID: 4674, https://arxiv.org/pdf/2603.20435.pdf

Authors: Jingwei Huang, Kuroush Nezafati, Zhikai Chi, Ruichen Rong, Colin Treager, Tingyi Wanyan, Yueshuang Xu, Xiaowei Zhan, Patrick Leavey, Guanghua Xiao, Wenqi Shi, Yang Xie

Title: Deep reflective reasoning in interdependence constrained structured data extraction from clinical notes for digital health

Abstract:
Extracting structured information from clinical notes requires navigating a dense web of interdependent variables where the value of one attribute logically constrains others. Existing Large Language Model (LLM)‑based extraction pipelines often struggle to capture these dependencies, leading to clinically inconsistent outputs. We propose deep reflective reasoning, a large language model agent framework that iteratively self‑critiques and revises structured outputs by checking consistency among variables, the input text, and retrieved domain knowledge, stopping when outputs converge. We extensively evaluate the proposed method in three diverse oncology applications: (1) On colorectal cancer synoptic reporting from gross descriptions (n=217), reflective reasoning improved average F1 across eight categorical synoptic variables from 0.828 to 0.911 and increased mean correct rate across four numeric variables from 0.806 to 0.895; (2) On Ewing sarcoma CD99 immunostaining pattern identification (n=200), the accuracy improved from 0.870 to 0.927; (3) On lung cancer tumor staging (n=100), tumor stage accuracy improved from 0.680 to 0.833 (pT: 0.842 ‑> 0.884; pN: 0.885 ‑> 0.948). The results demonstrate that deep reflective reasoning can systematically improve the reliability of LLM‑based structured data extraction under interdependence constraints, enabling more consistent machine‑operable clinical datasets and facilitating knowledge discovery with machine learning and data science towards digital health.

PaperID: 4675, https://arxiv.org/pdf/2603.20404.pdf

Authors: Abdullahi Isa Ahmed, Ana Maria Drăgulinescu, El Mehdi Amhoud

Title: Hetero-Net: An Energy-Efficient Resource Allocation and 3D Placement in Heterogeneous LoRa Networks via Multi-Agent Optimization

Abstract:
The evolution of Internet of Things (IoT) into multi‑layered environments has positioned Low‑Power Wide Area Networks (LPWANs), particularly Long Range (LoRa), as the backbone for connectivity across both surface and subterranean landscapes. However, existing LoRa‑based network designs often treat ground‑based wireless sensor networks (WSNs) and wireless underground sensor networks (WUSNs) as separate systems, resulting in inefficient and non‑integrated connectivity across diverse environments. To address this, we propose Hetero‑Net, a unified heterogeneous LoRa framework that integrates diverse LoRa end devices with multiple unmanned aerial vehicle (UAV)‑mounted LoRa gateways. Our objective is to maximize system energy efficiency through the joint optimization of the spreading factor, transmission power, and three‑dimensional (3D) placement of the UAVs. To manage the dynamic and partially observable nature of this system, we model the problem as a partially observable stochastic game (POSG) and address it using a multi‑agent proximal policy optimization (MAPPO) framework. An ablation study shows that our proposed MAPPO Hetero‑Net significantly outperforms traditional, isolated network designs, achieving energy efficiency improvements of 55.81% and 198.49% over isolated WSN‑only and WUSN‑only deployments, respectively.

PaperID: 4676, https://arxiv.org/pdf/2603.20398.pdf

Authors: Orna Kupferman, Noam Shenwald

Title: Coverage Games

Abstract:
We introduce and study coverage games ‑ a novel framework for multi‑agent planning in settings in which a system operates several agents but does not have full control on them, or interacts with an environment that consists of several agents. The game is played between a coverer, who has a set of objectives, and a disruptor. The coverer operates several agents that interact with the adversarial disruptor. The coverer wins if every objective is satisfied by at least one agent. Otherwise, the disruptor wins. Coverage games thus extend traditional two‑player games with multiple objectives by allowing a (possibly dynamic) decomposition of the objectives among the different agents. They have many applications, both in settings where the system is the coverer (e.g., multi‑robot surveillance, coverage in multi‑threaded systems) and settings where it is the disruptor (e.g., prevention of resource exhaustion, ensuring non‑congestion). We first study the theoretical properties of coverage games, including determinacy, and the ability to a priori decompose the objectives among the agents. We then study the problems of deciding whether the coverer or the disruptor wins. Besides a comprehensive analysis of the tight complexity of the problems, we consider interesting special cases, such as the one‑player cases and settings with a fixed number of agents or objectives.

PaperID: 4677, https://arxiv.org/pdf/2603.20390.pdf

Authors: Men Niu, Xinxin Fan, Quanliang Jing, Shaoye Luo, Yunfeng Lu

Title: CAMA: Exploring Collusive Adversarial Attacks in c-MARL

Abstract:
Cooperative multi‑agent reinforcement learning (c‑MARL) has been widely deployed in real‑world applications, such as social robots, embodied intelligence, UAV swarms, etc. Nevertheless, many adversarial attacks still exist to threaten various c‑MARL systems. At present, the studies mainly focus on single‑adversary perturbation attacks and white‑box adversarial attacks that manipulate agents' internal observations or actions. To address these limitations, we in this paper attempt to study collusive adversarial attacks through strategically organizing a set of malicious agents into three collusive attack modes: Collective Malicious Agents, Disguised Malicious Agents, and Spied Malicious Agents. Three novelties are involved: i) three collusive adversarial attacks are creatively proposed for the first time, and a unified framework CAMA for policy‑level collusive attacks is designed; ii) the attack effectiveness is theoretically analyzed from the perspectives of disruptiveness, stealthiness, and attack cost; and iii) the three collusive adversarial attacks are technically realized through agent's observation information fusion, attack‑trigger control. Finally, multi‑facet experiments on four SMAC II maps are performed, and experimental results showcase the three collusive attacks have an additive adversarial synergy, strengthening attack outcome while maintaining high stealthiness and stability over long horizons. Our work fills the gap for collusive adversarial learning in c‑MARL.

PaperID: 4678, https://arxiv.org/pdf/2603.20380.pdf

Authors: Christopher J. Agostino, Nayan D'Souza

Title: Herding CATs: ALARA for Agent Harness Engineering in Portable Composable Multi-Agent Teams

Abstract:
Industry practitioners and academic researchers regularly use multi‑agent systems to accelerate their work, but the applications through which users operate these systems do not provide a simple, unified mechanism for scalably managing critical components of the agent harness. This lack of control adversely impacts both the quality of individual human‑agent interactions and reduces the capacity for practitioners to coordinate context engineering efforts. The behavioral specifications that define what agents in such systems can do remain fragmented across prose instruction files ‑‑ for which compliance cannot be guaranteed ‑‑ or framework‑internal configurations, making these specifications difficult to share, version, or collaboratively maintain across teams and projects. Applying the ALARA principle from radiation safety (exposures kept as low as reasonably achievable) to context, we introduce a context‑agent‑tool (CAT) data layer expressed through interrelated plain‑text files, allowing users to directly declare tool access for each agent and to modify the tools themselves that are used by the agents when processing. We demonstrate capability of this CAT data layer to enable real agentic usage by using a command‑line shell that loads the team and executes agent runs ‑‑ \textttnpcsh ‑‑ and evaluating 22 locally‑hosted models from 0.6B to 35B parameters across 115 practical tasks spanning file operations, web search, multi‑step scripting, tool chaining, and multi‑agent delegation. We characterize which model families succeed in certain task categories and where they break down across ～2500 total executions.

PaperID: 4679, https://arxiv.org/pdf/2603.20357.pdf

Authors: Vicenç Torra, Maria Bras-Amorós

Title: Memory poisoning and secure multi-agent systems

Abstract:
Memory poisoning attacks for Agentic AI and multi‑agent systems (MAS) have recently caught attention. It is partially due to the fact that Large Language Models (LLMs) facilitate the construction and deployment of agents. Different memory systems are being used nowadays in this context, including semantic, episodic, and short‑term memory. This distinction between the different types of memory systems focuses mostly on their duration but also on their origin and their localization. It ranges from the short‑term memory originated at the user's end localized in the different agents to the long‑term consolidated memory localized in well established knowledge databases. In this paper, we first present the main types of memory systems, we then discuss the feasibility of memory poisoning attacks in these different types of memory systems, and we propose mitigation strategies. We review the already existing security solutions to mitigate some of the alleged attacks, and we discuss adapted solutions based on cryptography. We propose to implement local inference based on private knowledge retrieval as an example of mitigation strategy for memory poisoning for semantic memory. We also emphasize actual risks in relation to interactions between agents, which can cause memory poisoning. These latter risks are not so much studied in the literature and are difficult to formalize and solve. Thus, we contribute to the construction of agents that are secure by design.

PaperID: 4680, https://arxiv.org/pdf/2603.20333.pdf

Authors: Oleksii Bychkov

Title: Bounded Coupled AI Learning Dynamics in Tri-Hierarchical Drone Swarms

Abstract:
Modern autonomous multi‑agent systems combine heterogeneous learning mechanisms operating at different timescales. An open question remains: can one formally guarantee that coupled dynamics of such mechanisms stay within the admissible operational regime? This paper studies a tri‑hierarchical swarm learning system where three mechanisms act simultaneously: (1) local Hebbian online learning at individual agent level (fast timescale, 10‑100 ms); (2) multi‑agent reinforcement learning (MARL) for tactical group coordination (medium timescale, 1‑10 s); (3) meta‑learning (MAML) for strategic adaptation (slow timescale, 10‑100 s). Four results are established. The Bounded Total Error Theorem shows that under contractual constraints on learning rates, Lipschitz continuity of inter‑level mappings, and weight stabilization, total suboptimality admits a component‑wise upper bound uniform in time. The Bounded Representation Drift Theorem gives a worst‑case estimate of how Hebbian updates affect coordination‑level embeddings during one MARL cycle. The Meta‑Level Compatibility Theorem provides sufficient conditions under which strategic adaptation preserves lower‑level invariants. The Non‑Accumulation Theorem proves that error does not grow unboundedly over time.

PaperID: 4681, https://arxiv.org/pdf/2603.20324.pdf

Authors: Artem Maryanskyy

Title: When Agents Disagree: The Selection Bottleneck in Multi-Agent LLM Pipelines

Abstract:
Multi‑agent LLM pipelines produce contradictory evidence on whether team diversity improves output quality: heterogeneous Mixture‑of‑Agents teams outperform single models, yet homogeneous Self‑MoA teams consistently win under synthesis‑based aggregation. We propose a resolution by identifying the selection bottleneck ‑‑ a crossover threshold in aggregation quality that determines whether diversity helps or hurts. Under this model, we obtain a closed‑form crossover threshold s^ (Proposition 1) that separates the regimes where diversity helps and hurts. In a targeted experiment spanning 42 tasks across 7 categories (N=210), a diverse team with judge‑based selection achieves a win rate of 0.810 against a single‑model baseline, while a homogeneous team scores 0.512 ‑‑ near chance (Glass's Δ= 2.07). Judge‑based selection outperforms MoA‑style synthesis by Δ_\mathrmWR = +0.631 ‑‑ the synthesis approach is preferred over the baseline in zero of 42 tasks by the judge panel. A decoupled evaluation with independent judges confirms all directional findings (Spearman ρ= 0.90). Exploratory evidence suggests that including a weaker model improves performance while reducing cost (p < 10^‑4, not pre‑registered). Our results suggest that selector quality may be a more impactful design lever than generator diversity in single‑round generate‑then‑select pipelines.

PaperID: 4682, https://arxiv.org/pdf/2603.20321.pdf

Authors: Fujian Jia, Jiwen Gu, Cheng Lu, Dezhi Zhao, Mengjiang Huang, Yuanzhi Lu, Xin Liu, Kang Liu

Title: GIP-RAG: An Evidence-Grounded Retrieval-Augmented Framework for Interpretable Gene Interaction and Pathway Impact Analysis

Abstract:
Understanding mechanistic relationships among genes and their impacts on biological pathways is essential for elucidating disease mechanisms and advancing precision medicine. Despite the availability of extensive molecular interaction and pathway data in public databases, integrating heterogeneous knowledge sources and enabling interpretable multi‑step reasoning across biological networks remain challenging. We present GIP‑RAG (Gene Interaction Prediction through Retrieval‑Augmented Generation), a computational framework that combines biomedical knowledge graphs with large language models (LLMs) to infer and interpret gene interactions. The framework constructs a unified gene interaction knowledge graph by integrating curated data from KEGG, WikiPathways, SIGNOR, Pathway Commons, and PubChem. Given user‑specified genes, a query‑driven module retrieves relevant subgraphs, which are incorporated into structured prompts to guide LLM‑based stepwise reasoning. This enables identification of direct and indirect regulatory relationships and generation of mechanistic explanations supported by biological evidence. Beyond pairwise interactions, GIP‑RAG includes a pathway‑level functional impact module that simulates propagation of gene perturbations through signaling networks and evaluates potential pathway state changes. Evaluation across diverse biological scenarios demonstrates that the framework generates consistent, interpretable, and evidence‑supported insights into gene regulatory mechanisms. Overall, GIP‑RAG provides a general and interpretable approach for integrating knowledge graphs with retrieval‑augmented LLMs to support mechanistic reasoning in complex molecular systems.

PaperID: 4683, https://arxiv.org/pdf/2603.20320.pdf

Authors: Shasha Yu, Fiona Carroll, Barry L. Bentley

Title: The Causal Impact of Tool Affordance on Safety Alignment in LLM Agents

Abstract:
Large language models (LLMs) are increasingly deployed as agents with access to executable tools, enabling direct interaction with external systems. However, most safety evaluations remain text‑centric and assume that compliant language implies safe behavior, an assumption that becomes unreliable once models are allowed to act. In this work, we empirically examine how executable tool affordance alters safety alignment in LLM agents using a paired evaluation framework that compares text‑only chatbot behavior with tool‑enabled agent behavior under identical prompts and policies. Experiments are conducted in a deterministic financial transaction environment with binary safety constraints across 1,500 procedurally generated scenarios. To separate intent from outcome, we distinguish between attempted and realized violations using dual enforcement regimes that either block or permit unsafe actions. Both evaluated models maintain perfect compliance in text‑only settings, yet exhibit sharp increases in violations after tool access is introduced, reaching rates up to 85% despite unchanged rules. We observe substantial gaps between attempted and executed violations, indicating that external guardrails can suppress visible harm while masking persistent misalignment. Agents also develop spontaneous constraint circumvention strategies without adversarial prompting. These results demonstrate that tool affordance acts as a primary driver of safety misalignment and that text‑based evaluation alone is insufficient for assessing agentic systems.

PaperID: 4684, https://arxiv.org/pdf/2603.20313.pdf

Authors: Sarat Mudunuri, Jian Wan, Ally Qin, Srinivasan Manoharan

Title: Semantic Tool Discovery for Large Language Models: A Vector-Based Approach to MCP Tool Selection

Abstract:
Large Language Models (LLMs) with tool‑calling capabilities have demonstrated remarkable potential in executing complex tasks through external tool integration. The Model Context Protocol (MCP) has emerged as a standardized framework for connecting LLMs to diverse toolsets, with individual MCP servers potentially exposing dozens to hundreds of tools. However, current implementations face a critical scalability challenge: providing all available tools to the LLM context results in substantial token overhead, increased costs, reduced accuracy, and context window constraints. We present a semantic tool discovery architecture that addresses these challenges through vector‑based retrieval. Our approach indexes MCP tools using dense embeddings that capture semantic relationships between tool capabilities and user intent, dynamically selecting only the most relevant tools (typically 3‑5) rather than exposing the entire tool catalog (50‑100+). Experimental results demonstrate a 99.6% reduction in tool‑related token consumption with a hit rate of 97.1% at K=3 and an MRR of 0.91 on a benchmark of 140 queries across 121 tools from 5 MCP servers, with sub‑100ms retrieval latency. Contributions include: (1) a semantic indexing framework for MCP tools, (2) a dynamic tool selection algorithm based on query‑tool similarity, (3) comprehensive evaluation demonstrating significant efficiency and accuracy improvements, and (4) extensibility to multi‑agent and cross‑organizational tool discovery.

PaperID: 4685, https://arxiv.org/pdf/2603.20311.pdf

Authors: Rohan Siva, Kai Cheung, Lichi Li, Ganesh Sundaram

Title: kRAIG: A Natural Language-Driven Agent for Automated DataOps Pipeline Generation

Abstract:
Modern machine learning systems rely on complex data engineering workflows to extract, transform, and load (ELT) data into production pipelines. However, constructing these pipelines remains time‑consuming and requires substantial expertise in data infrastructure and orchestration frameworks. Recent advances in large language model (LLM) agents offer a potential path toward automating these workflows, but existing approaches struggle with under‑specified user intent, unreliable tool generation, and limited guarantees of executable outputs. We introduce kRAIG, an AI agent that translates natural language specifications into production‑ready Kubeflow Pipelines (KFP). To resolve ambiguity in user intent, we propose ReQuesAct (Reason, Question, Act), an interaction framework that explicitly clarifies intent prior to pipeline synthesis. The system orchestrates end‑to‑end data movement from diverse sources and generates task‑specific transformation components through a retrieval‑augmented tool synthesis process. To ensure data quality and safety, kRAIG incorporates LLM‑based validation stages that verify pipeline integrity prior to execution. Our framework achieves a 3x improvement in extraction and loading success and a 25 percent increase in transformation accuracy compared to state‑of‑the‑art agentic baselines. These improvements demonstrate that structured agent workflows with explicit intent clarification and validation significantly enhance the reliability and executability of automated data engineering pipelines.

PaperID: 4686, https://arxiv.org/pdf/2603.20308.pdf

Authors: Aayam Bansal, Ishaan Gangwani

Title: Reason-to-Transmit: Deliberative Adaptive Communication for Cooperative Perception

Abstract:
Cooperative perception among autonomous agents overcomes the limitations of single‑agent sensing, but bandwidth constraints in vehicle‑to‑everything (V2X) networks require efficient communication policies. Existing approaches rely on reactive mechanisms, such as confidence maps, learned gating, or sparse masks, to decide what to transmit, without reasoning about why a message benefits the receiver. We introduce Reason‑to‑Transmit (R2T), a framework that equips each agent with a lightweight transformer‑based module that reasons over local scene context, estimated neighbor information gaps, and bandwidth budget to make per‑region transmission decisions. Trained end‑to‑end with a bandwidth‑aware objective, R2T is evaluated against nine baselines in a multi‑agent bird's‑eye‑view perception environment. Any communication improves performance by about 58% AP over no communication. At low bandwidth, all selective methods perform similarly, but R2T shows clear gains under high occlusion, where information asymmetry is greatest, approaching oracle performance. All methods degrade gracefully under packet drops up to 50%, showing robustness to communication failures. These results indicate that while fusion design dominates performance, deliberative communication provides additional gains in challenging scenarios. R2T introduces a reasoning‑based approach to communication, enabling more efficient and context‑aware information sharing in cooperative perception.

PaperID: 4687, https://arxiv.org/pdf/2603.20299.pdf

Authors: Yusen Wu, Xiaotie Deng

Title: HCAG: Hierarchical Abstraction and Retrieval-Augmented Generation on Theoretical Repositories with LLMs

Abstract:
Existing Retrieval‑Augmented Generation (RAG) methods for code struggle to capture the high‑level architectural patterns and cross‑file dependencies inherent in complex, theory‑driven codebases, such as those in algorithmic game theory (AGT), leading to a persistent semantic and structural gap between abstract concepts and executable implementations. To address this challenge, we propose Hierarchical Code/Architecture‑guided Agent Generation (HCAG), a framework that reformulates repository‑level code generation as a structured, planning‑oriented process over hierarchical knowledge. HCAG adopts a two‑phase design: an offline hierarchical abstraction phase that recursively parses code repositories and aligned theoretical texts to construct a multi‑resolution semantic knowledge base explicitly linking theory, architecture, and implementation; and an online hierarchical retrieval and scaffolded generation phase that performs top‑down, level‑wise retrieval to guide LLMs in an architecture‑then‑module generation paradigm. To further improve robustness and consistency, HCAG integrates a multi‑agent discussion inspired by cooperative game. We provide a theoretical analysis showing that hierarchical abstraction with adaptive node compression achieves cost‑optimality compared to flat and iterative RAG baselines. Extensive experiments on diverse game‑theoretic system generation tasks demonstrate that HCAG substantially outperforms representative repository‑level methods in code quality, architectural coherence, and requirement pass rate. In addition, HCAG produces a large‑scale, aligned theory‑implementation dataset that effectively enhances domain‑specific LLMs through post‑training. Although demonstrated in AGT, HCAG paradigm also offers a general blueprint for mining, reusing, and generating complex systems from structured codebases in other domains.

PaperID: 4688, https://arxiv.org/pdf/2603.20295.pdf

Authors: Dong Li, Zhengzhang Chen, Xujiang Zhao, Linlin Yu, Zhong Chen, Yi He, Haifeng Chen, Chen Zhao

Title: MARLIN: Multi-Agent Reinforcement Learning for Incremental DAG Discovery

Abstract:
Uncovering causal structures from observational data is crucial for understanding complex systems and making informed decisions. While reinforcement learning (RL) has shown promise in identifying these structures in the form of a directed acyclic graph (DAG), existing methods often lack efficiency, making them unsuitable for online applications. In this paper, we propose MARLIN, an efficient multi agent RL based approach for incremental DAG learning. MARLIN uses a DAG generation policy that maps a continuous real valued space to the DAG space as an intra batch strategy, then incorporates two RL agents state specific and state invariant to uncover causal relationships and integrates these agents into an incremental learning framework. Furthermore, the framework leverages a factored action space to enhance parallelization efficiency. Extensive experiments on synthetic and real datasets demonstrate that MARLIN outperforms state of the art methods in terms of both efficiency and effectiveness.

PaperID: 4689, https://arxiv.org/pdf/2603.20285.pdf

Authors: Aayam Bansal, Ishaan Gangwani

Title: AgentComm-Bench: Stress-Testing Cooperative Embodied AI Under Latency, Packet Loss, and Bandwidth Collapse

Abstract:
Cooperative multi‑agent methods for embodied AI are almost universally evaluated under idealized communication: zero latency, no packet loss, and unlimited bandwidth. Real‑world deployment on robots with wireless links, autonomous vehicles on congested networks, or drone swarms in contested spectrum offers no such guarantees. We introduce AgentComm‑Bench, a benchmark suite and evaluation protocol that systematically stress‑tests cooperative embodied AI under six communication impairment dimensions: latency, packet loss, bandwidth collapse, asynchronous updates, stale memory, and conflicting sensor evidence. AgentComm‑Bench spans three task families: cooperative perception, multi‑agent waypoint navigation, and cooperative zone search, and evaluates five communication strategies, including a lightweight method we propose based on redundant message coding with staleness‑aware fusion. Our experiments reveal that communication‑dependent tasks degrade catastrophically: stale memory and bandwidth collapse cause over 96% performance drops in navigation, while content corruption (stale or conflicting data) reduces perception F1 by over 85%. Vulnerability depends on the interaction between impairment type and task design; perception fusion is robust to packet loss but amplifies corrupted data. Redundant message coding more than doubles navigation performance under 80% packet loss. We release AgentComm‑Bench as a practical evaluation protocol and recommend that cooperative embodied AI work report performance under multiple impairment conditions.

PaperID: 4690, https://arxiv.org/pdf/2603.20281.pdf

Authors: Jussi Keppo, Yuze Li, Gerry Tsoukalas, Nuo Yuan

Title: On the Fragility of AI Agent Collusion

Abstract:
Recent work shows that pricing with symmetric LLM agents leads to algorithmic collusion. We show that collusion is fragile under the heterogeneity typical of real deployments. In a stylized repeated‑pricing model, heterogeneity in patience or data access reduces the set of collusive equilibria. Experiments with open‑source LLM agents (totaling over 2,000 compute hours) align with these predictions: patience heterogeneity reduces price lift from 22% to 10% above competitive levels; asymmetric data access, to 7%. Increasing the number of competing LLMs breaks up collusion; so does cross‑algorithm heterogeneity, that is, setting LLMs against Q‑learning agents. But model‑size differences (e.g., 32B vs. 14B weights) do not; they generate leader‑follower dynamics that stabilize collusion. We discuss antitrust implications, such as enforcement actions restricting data‑sharing and policies promoting algorithmic diversity.

PaperID: 4691, https://arxiv.org/pdf/2603.20275.pdf

Authors: Saeed Khaki, Nima Safaei, Kamal Ginotra

Title: Understanding Pruning Regimes in Vision-Language Models Through Domain-Aware Layer Selection

Abstract:
Transformer‑based vision‑language models (VLMs) contain substantial depth redundancy, yet the effect of removing specific decoder layers remains poorly understood, especially for domains that require tight coupling between perception and multi‑step reasoning. We study structured decoder layer pruning through the lens of domain‑aware activation similarity, measuring how strongly each layer transforms representations for math versus non‑math inputs. This yields simple math‑aware, non‑math‑aware, and mixed ranking criteria that identify layers whose input‑output activations change least within a target domain. Across two state‑of‑the‑art VLMs and a broad suite of math and general multimodal benchmarks, we uncover a consistent three‑regime structure: at low pruning budgets, performance is highly sensitive to which layers are removed; at moderate budgets, methods converge as structural damage accumulates; and at high budgets, structural continuity dominates, favoring spacing‑aware strategies. Our domain‑aware rankings achieve the strongest stability in the ranking‑sensitive regime, while matching or exceeding structure‑aware baselines at larger budgets. These results provide a clearer picture of how depth contributes to domain‑specific behavior in VLMs and offer a practical, interpretable approach to reducing model depth without sacrificing essential mathematical or general vision‑language capabilities.

PaperID: 4692, https://arxiv.org/pdf/2603.20265.pdf

Authors: Islam Guven, Mehmet Parlak

Title: JCAS-MARL: Joint Communication and Sensing UAV Networks via Resource-Constrained Multi-Agent Reinforcement Learning

Abstract:
Multi‑UAV networks are increasingly deployed for large‑scale inspection and monitoring missions, where operational performance depends on the coordination of sensing reliability, communication quality, and energy constraints. In particular, the rapid increase in overflowing waste bins and illegal dumping sites has created a need for efficient detection of waste hotspots. In this work, we introduce JCAS‑MARL, a resource‑aware multi‑agent reinforcement learning (MARL) framework for joint communication and sensing (JCAS)‑enabled UAV networks. Within this framework, multiple UAVs operate in a shared environment where each agent jointly controls its trajectory and the resource allocation of an OFDM waveform used simultaneously for sensing and communication. Battery consumption, charging behavior, and associated CO_2 emissions are incorporated into the system state to model realistic operational constraints. Information sharing occurs over a dynamic communication graph determined by UAV positions and wireless channel conditions. Waste hotspot detection requires consensus among multiple UAVs to improve reliability. Using this environment, we investigate how MARL policies exploit the sensing‑communication‑energy trade‑off in JCAS‑enabled UAV networks. Simulation results demonstrate that adaptive pilot‑density control learned by the agents can outperform static configurations, particularly in scenarios where sensing accuracy and communication connectivity vary across the environment.

PaperID: 4693, https://arxiv.org/pdf/2603.20260.pdf

Authors: Xinkui Zhao, Sai Liu, Yifan Zhang, Qingyu Ma, Guanjie Cheng, Naibo Wang, Chang Liu

Title: ProMAS: Proactive Error Forecasting for Multi-Agent Systems Using Markov Transition Dynamics

Abstract:
The integration of Large Language Models into Multi‑Agent Systems (MAS) has enabled the so‑lution of complex, long‑horizon tasks through collaborative reasoning. However, this collec‑tive intelligence is inherently fragile, as a single logical fallacy can rapidly propagate and lead to system‑wide failure. Most current research re‑lies on post‑hoc failure analysis, thereby hinder‑ing real‑time intervention. To address this, we propose PROMAS, a proactive framework utiliz‑ing Markov transitions for predictive error anal‑ysis. PROMAS extracts Causal Delta Features to capture semantic displacement, mapping them to a quantized Vector Markov Space to model reasoning as probabilistic transitions. By inte‑grating a Proactive Prediction Head with Jump Detection, the method localizes errors via risk acceleration rather than static thresholds. On the Who&When benchmark, PROMAS achieves 22.97% step‑level accuracy while processing only 27% of reasoning logs. This performance rivals reactive monitors like MASC while reducing data overhead by 73%. Although this strategy entails an accuracy trade‑off compared to post‑hoc meth‑ods, it significantly improves intervention latency, balancing diagnostic precision with the real‑time demands of autonomous reasoning.

PaperID: 4694, https://arxiv.org/pdf/2603.20253.pdf

Authors: Yadi Cao, Sicheng Lai, Jiahe Huang, Yang Zhang, Zach Lawrence, Rohan Bhakta, Izzy F. Thomas, Mingyun Cao, Chung-Hao Tsai, Zihao Zhou, Yidong Zhao, Hao Liu, Alessandro Marinoni, Alexey Arefiev, Rose Yu

Title: SimulCost: A Cost-Aware Benchmark and Toolkit for Automating Physics Simulations with LLMs

Abstract:
Evaluating LLM agents for scientific tasks has focused on token costs while ignoring tool‑use costs like simulation time and experimental resources. As a result, metrics like pass@k become impractical under realistic budget constraints. To address this gap, we introduce SimulCost, the first benchmark targeting cost‑sensitive parameter tuning in physics simulations. SimulCost compares LLM tuning cost‑sensitive parameters against traditional scanning approach in both accuracy and computational cost, spanning 2,916 single‑round (initial guess) and 1,900 multi‑round (adjustment by trial‑and‑error) tasks across 12 simulators from fluid dynamics, solid mechanics, and plasma physics. Each simulator's cost is analytically defined and platform‑independent. Frontier LLMs achieve 46‑‑64% success rates in single‑round mode, dropping to 35‑‑54% under high accuracy requirements, rendering their initial guesses unreliable especially for high accuracy tasks. Multi‑round mode improves rates to 71‑‑80%, but LLMs are 1.5‑‑2.5x slower than traditional scanning, making them uneconomical choices. We also investigate parameter group correlations for knowledge transfer potential, and the impact of in‑context examples and reasoning effort, providing practical implications for deployment and fine‑tuning. We open‑source SimulCost as a static benchmark and extensible toolkit to facilitate research on improving cost‑aware agentic designs for physics simulations, and for expanding new simulation environments. Code and data are available at https://github.com/Rose‑STL‑Lab/SimulCost‑Bench.

PaperID: 4695, https://arxiv.org/pdf/2603.20247.pdf

Authors: Zhangyuhua Weng, Shengli Zhang, Taotao Wang, Yihan Xia

Title: AlphaLogics: A Market Logic-Driven Multi-Agent System for Scalable and Interpretable Alpha Factor Generation

Abstract:
Factor investing is ultimately grounded in market logic ‑ the latent mechanism behind observed alpha factors that explains why they should persist across assets and regimes. However, recent factor mining prioritizes factor discovery over logic discovery, producing complex alpha factors with unclear rationale, while market logic remains largely handcrafted and difficult to scale. To address this challenge, we propose AlphaLogics, a market logic‑driven multi‑agent system for factor mining. AlphaLogics consists of three key components: (i) Market Logic Mining: reverse‑extracting market logic from historical factor libraries to construct an initial market logic library; (ii) Factor Generation and Optimization: using new market logics generated in (i) to guide factor generation, and optimizing factors with backtesting feedback; and (iii) Market Logic Generation and Optimization: generating new market logics conditioned on the initial market logic library, and refining each market logic by aggregating the backtest outcomes of its guided factors, continuously refreshing the library. Experiments on CSI 500 and S&P 500 show that AlphaLogics consistently improves predictive metrics and risk‑adjusted returns over representative baselines, while producing a market logic library that remains empirically useful for guiding further factor discovery.

PaperID: 4696, https://arxiv.org/pdf/2603.20238.pdf

Authors: Liangshun Wu, Jianbo Du, Junsuo Qu

Title: Joint Trajectory, RIS, and Computation Offloading Optimization via Decentralized Model-Based PPO in Urban Multi-UAV Mobile Edge Computing

Abstract:
Efficient computation offloading in multi‑UAV edge networks becomes particularly challenging in dense urban areas, where line‑of‑sight (LoS) links are frequently blocked and user demand varies rapidly. Reconfigurable intelligent surfaces (RISs) can mitigate blockage by creating controllable reflected links, but realizing their potential requires tightly coupled decisions on UAV trajectories, offloading schedules, and RIS phase configurations. This joint optimization is hard to solve in practice because multiple UAVs must coordinate under limited information exchange, and purely model‑free multi‑agent reinforcement learning (MARL) often learns too slowly in highly dynamic environments. To address these challenges, we propose a decentralized model‑based MARL framework. Each UAV optimizes mobility and offloading using observations from several hop neighbors, and submits an RIS phase proposal that is aggregated by a lightweight RIS controller. To boost sample efficiency and stability, agents learn local dynamics models and perform short horizon branched rollouts for proximal policy optimization (PPO) updates. Simulations show near centralized performance with improved throughput and energy efficiency at scale.

PaperID: 4697, https://arxiv.org/pdf/2603.20233.pdf

Authors: YueMing Zhang, Shuai Xu, Zhengxiong Li, Fangtian Zhong, Xiaokun Yang, Hailu Xu

Title: SwiftBot: A Decentralized Platform for LLM-Powered Federated Robotic Task Execution

Abstract:
Federated robotic task execution systems require bridging natural language instructions to distributed robot control while efficiently managing computational resources across heterogeneous edge devices without centralized coordination. Existing approaches face three limitations: rigid hand‑coded planners requiring extensive domain engineering, centralized coordination that contradicts federated collaboration as robots scale, and static resource allocation failing to share containers across robots when workloads shift dynamically. We present SwiftBot, a federated task execution platform that integrates LLM‑based task decomposition with intelligent container orchestration over a DHT overlay, enabling robots to collaboratively execute tasks without centralized control. SwiftBot achieves 94.3% decomposition accuracy across diverse tasks, reduces task startup latency by 1.5‑5.4x and average training latency by 1.4‑2.5x, and improves tail latency by 1.2‑4.7x under high load through federated warm container migration. Evaluation on multimedia tasks validates that co‑designing semantic understanding and federated resource management enables both flexibility and efficiency for robotic task control.

PaperID: 4698, https://arxiv.org/pdf/2603.20215.pdf

Authors: Hongduan Tian, Xiao Feng, Ziyuan Zhao, Xiangyu Zhu, Rolan Yan, Bo Han

Title: Multi-Agent Debate with Memory Masking

Abstract:
Large language models (LLMs) have recently demonstrated impressive capabilities in reasoning tasks. Currently, mainstream LLM reasoning frameworks predominantly focus on scaling up inference‑time sampling to enhance performance. In particular, among all LLM reasoning frameworks, multi‑agent debate (MAD), which employs multiple LLMs as agents to perform reasoning in the way of multi‑round debate, has emerged as a powerful reasoning paradigm since it allows agents to access previous memories to alleviate fallacious content and refine their reasoning iteratively in each debate round. However, although MAD significantly improves the reasoning capabilities of LLMs, in this paper, we observe that there remain erroneous memories, and LLM agents are vulnerable to these erroneous memories. To explore this phenomenon, we provide a theoretical insight that the performance of MAD is highly dependent on the quality of memories derived from the previous debate, indicating that the existence of erroneous memories poses a threat to the performance of MAD. To address this problem, we introduce a simple yet effective multi‑agent debate framework, multi‑agent debate with memory masking (MAD‑M^2), to improve the robustness of MAD by allowing LLM agents to mask erroneous memories from the previous debate round at the beginning of each debate round. In this way, MAD‑M^2 can polish the contextual information before each debate round by preserving informative and meaningful memories while discarding the erroneous memories. Extensive experiments and analyses on mainstream mathematical and logical reasoning benchmarks demonstrate that MAD‑M^2 can identify the erroneous memories and achieve better performance in reasoning than MAD.

PaperID: 4699, https://arxiv.org/pdf/2603.20179.pdf

Authors: Eric A. Moreno, Samuel Bright-Thonney, Andrzej Novak, Dolores Garcia, Philip Harris

Title: AI Agents Can Already Autonomously Perform Experimental High Energy Physics

Abstract:
Large language model‑based AI agents are now able to autonomously execute substantial portions of a high energy physics (HEP) analysis pipeline with minimal expert‑curated input. Given access to a HEP dataset, an execution framework, and a corpus of prior experimental literature, we find that Claude Code succeeds in automating all stages of a typical analysis: event selection, background estimation, uncertainty quantification, statistical inference, and paper drafting. We argue that the experimental HEP community is underestimating the current capabilities of these systems, and that most proposed agentic workflows are too narrowly scoped or scaffolded to specific analysis structures. We present a proof‑of‑concept framework, Just Furnish Context (JFC), that integrates autonomous analysis agents with literature‑based knowledge retrieval and multi‑agent review, and show that this is sufficient to plan, execute, and document a credible high energy physics analysis. We demonstrate this by conducting analyses on open data from ALEPH, DELPHI, and CMS to perform electroweak, QCD, and Higgs boson measurements. Rather than replacing physicists, these tools promise to offload the repetitive technical burden of analysis code development, freeing researchers to focus on physics insight, truly novel method development, and rigorous validation. Given these developments, we advocate for new strategies for how the community trains students, organizes analysis efforts, and allocates human expertise.

PaperID: 4700, https://arxiv.org/pdf/2603.20143.pdf

Authors: Hui Zhong, Yichun Gao, Luyan Liu, Xusen Guo, Zhaonian Kuang, Qiming Zhang, Xinhu Zheng

Title: Synergistic Perception and Generative Recomposition: A Multi-Agent Orchestration for Expert-Level Building Inspection

Abstract:
Building facade defect inspection is fundamental to structural health monitoring and sustainable urban maintenance, yet it remains a formidable challenge due to extreme geometric variability, low contrast against complex backgrounds, and the inherent complexity of composite defects (e.g., cracks co‑occurring with spalling). Such characteristics lead to severe pixel imbalance and feature ambiguity, which, coupled with the critical scarcity of high‑quality pixel‑level annotations, hinder the generalization of existing detection and segmentation models. To address gaps, we propose FacadeFixer, a unified multi‑agent framework that treats defect perception as a collaborative reasoning task rather than isolated recognition. Specifically,FacadeFixer orchestrates specialized agents for detection and segmentation to handle multi‑type defect interference, working in tandem with a generative agent to enable semantic recomposition. This process decouples intricate defects from noisy backgrounds and realistically synthesizes them onto diverse clean textures, generating high‑fidelity augmented data with precise expert‑level masks. To support this, we introduce a comprehensive multi‑task dataset covering six primary facade categories with pixel‑level annotations. Extensive experiments demonstrate that FacadeFixer significantly outperforms state‑of‑the‑art (SOTA) baselines. Specifically, it excels in capturing pixel‑level structural anomalies and highlights generative synthesis as a robust solution to data scarcity in infrastructure inspection. Our code and dataset will be made publicly available.

PaperID: 4701, https://arxiv.org/pdf/2603.20132.pdf

Authors: Cen Wan, Alex A. Freitas

Title: Revisiting Gene Ontology Knowledge Discovery with Hierarchical Feature Selection and Virtual Study Group of AI Agents

Abstract:
Large language models have achieved great success in multiple challenging tasks, and their capacity can be further boosted by the emerging agentic AI techniques. This new computing paradigm has already started revolutionising the traditional scientific discovery pipelines. In this work, we propose a novel agentic AI‑based knowledge discovery‑oriented virtual study group that aims to extract meaningful ageing‑related biological knowledge considering highly ageing‑related Gene Ontology terms that are selected by hierarchical feature selection methods. We investigate the performance of the proposed agentic AI framework by considering four different model organisms' ageing‑related Gene Ontology terms and validate the biological findings by reviewing existing research articles. It is found that the majority of the AI agent‑generated scientific claims can be supported by existing literatures and the proposed internal mechanisms of the virtual study group also play an important role in the designed agentic AI‑based knowledge discovery framework.

PaperID: 4702, https://arxiv.org/pdf/2603.20131.pdf

Authors: Ravish Gupta, Saket Kumar, Shreeya Sharma, Maulik Dang, Abhishek Aggarwal

Title: An Agentic Multi-Agent Architecture for Cybersecurity Risk Management

Abstract:
Getting a real cybersecurity risk assessment for a small organization is expensive ‑‑ a NIST CSF‑aligned engagement runs 15,000 on the low end, takes weeks, and depends on practitioners who are genuinely scarce. Most small companies skip it entirely. We built a six‑agent AI system where each agent handles one analytical stage: profiling the organization, mapping assets, analyzing threats, evaluating controls, scoring risks, and generating recommendations. Agents share a persistent context that grows as the assessment proceeds, so later agents build on what earlier ones concluded ‑‑ the mechanism that distinguishes this from standard sequential agent pipelines. We tested it on a 15‑person HIPAA‑covered healthcare company and compared outputs to independent assessments by three CISSP practitioners ‑‑ the system agreed with them 85% of the time on severity classifications, covered 92% of identified risks, and finished in under 15 minutes. We then ran 30 repeated single‑agent assessments across five synthetic but sector‑realistic organizational profiles in healthcare, fintech, manufacturing, retail, and SaaS, comparing a general‑purpose Mistral‑7B against a domain fine‑tuned model. Both completed every run. The fine‑tuned model flagged threats the baseline could not see at all: PHI exposure in healthcare, OT/IIoT vulnerabilities in manufacturing, platform‑specific risks in retail. The full multi‑agent pipeline, however, failed every one of 30 attempts on a Tesla T4 with its 4,096‑token default context window ‑‑ context capacity, not model quality, turned out to be the binding constraint.

PaperID: 4703, https://arxiv.org/pdf/2603.20003.pdf

Authors: Yifan He, David Martens

Title: An Agentic Approach to Generating XAI-Narratives

Abstract:
Explainable AI (XAI) research has experienced substantial growth in recent years. Existing XAI methods, however, have been criticized for being technical and expert‑oriented, motivating the development of more interpretable and accessible explanations. In response, large language model (LLM)‑generated XAI narratives have been proposed as a promising approach for translating post‑hoc explanations into more accessible, natural‑language explanations. In this work, we propose a multi‑agent framework for XAI narrative generation and refinement. The framework comprises the Narrator, which generates and revises narratives based on feedback from multiple Critic Agents on faithfulness and coherence metrics, thereby enabling narrative improvement through iteration. We design five agentic systems (Basic Design, Critic Design, Critic‑Rule Design, Coherent Design, and Coherent‑Rule Design) and systematically evaluate their effectiveness across five LLMs on five tabular datasets. Results validate that the Basic Design, the Critic Design, and the Critic‑Rule Design are effective in improving the faithfulness of narratives across all LLMs. Claude‑4.5‑Sonnet on Basic Design performs best, reducing the number of unfaithful narratives by 90% after three rounds of iteration. To address recurrent issues, we further introduce an ensemble strategy based on majority voting. This approach consistently enhances performance for four LLMs, except for DeepSeek‑V3.2‑Exp. These findings highlight the potential of agentic systems to produce faithful and coherent XAI narratives.

PaperID: 4704, https://arxiv.org/pdf/2603.19988.pdf

Authors: Nicolas Eschenbaum, Nicolas Greber

Title: Market Power and Platform Design in Decentralized Electricity Trading

Abstract:
This paper studies how platform design shapes strategic behavior in decentralized electricity trading. We develop a finite‑horizon dynamic game in which photovoltaic‑ and battery‑equipped players ("prosumers") trade on a platform that maps aggregate imports and exports into internal buy and sell prices. We establish existence of a perfect conditional epsilon‑equilibrium and characterize a Cournot‑like market‑power mechanism in an observable‑types benchmark of the game: because the producer price is decreasing in aggregate exports, strategic prosumers withhold supply and underutilize storage relative to the price‑taking benchmark. To quantify these effects, we use a multi‑agent computational framework that exploits the differentiable structure of the platform's clearing rule to compare planner, price‑taking, and strategic outcomes under alternative pricing mechanisms. In our baseline calibration, strategic play raises grid settlement cost by about 6 percent relative to price‑taking. The magnitude of the distortion depends strongly on platform design: some designs can largely eliminate strategic incentives, while increased competition in storage ownership sharply reduces withholding, with most of the distortion disappearing once storage is split across more than three owners. We also find that information disclosure can improve competitive coordination but also increase the market power effects. Despite these distortions, the platform remains highly valuable overall, reducing a passive consumer's annual electricity bill by roughly 40 percent relative to exclusive grid settlement, with strategic behavior clawing back only about 8 percent of that saving. The results show that pricing rules, information disclosure, and ownership structure determine how much of the gains from decentralized electricity trading are realized.

PaperID: 4705, https://arxiv.org/pdf/2603.19974.pdf

Authors: Fazhong Liu, Zhuoyan Chen, Tu Lan, Haozhen Tan, Zhenyu Xu, Xiang Li, Guoxing Chen, Yan Meng, Haojin Zhu

Title: Trojan's Whisper: Stealthy Manipulation of OpenClaw through Injected Bootstrapped Guidance

Abstract:
Autonomous coding agents are increasingly integrated into software development workflows, offering capabilities that extend beyond code suggestion to active system interaction and environment management. OpenClaw, a representative platform in this emerging paradigm, introduces an extensible skill ecosystem that allows third‑party developers to inject behavioral guidance through lifecycle hooks during agent initialization. While this design enhances automation and customization, it also opens a novel and unexplored attack surface. In this paper, we identify and systematically characterize guidance injection, a stealthy attack vector that embeds adversarial operational narratives into bootstrap guidance files. Unlike traditional prompt injection, which relies on explicit malicious instructions, guidance injection manipulates the agent's reasoning context by framing harmful actions as routine best practices. These narratives are automatically incorporated into the agent's interpretive framework and influence future task execution without raising suspicion.We construct 26 malicious skills spanning 13 attack categories including credential exfiltration, workspace destruction, privilege escalation, and persistent backdoor installation. We evaluate them using ORE‑Bench, a realistic developer workspace benchmark we developed. Across 52 natural user prompts and six state‑of‑the‑art LLM backends, our attacks achieve success rates from 16.0% to 64.2%, with the majority of malicious actions executed autonomously without user confirmation. Furthermore, 94% of our malicious skills evade detection by existing static and LLM‑based scanners. Our findings reveal fundamental tensions in the design of autonomous agent ecosystems and underscore the urgent need for defenses based on capability isolation, runtime policy enforcement, and transparent guidance provenance.

PaperID: 4706, https://arxiv.org/pdf/2603.19952.pdf

Authors: Patrick Malcolm, Klaus Bogenberger

Title: On the Capacity of Future Lane-Free Urban Infrastructure

Abstract:
In this paper, the potential capacity and spatial efficiency of future autonomous lane‑free traffic in urban environments are explored using a combination of analytical and simulation‑based approaches. For lane‑free roadways, a simple analytical approach is employed, which shows not only that lane‑free traffic offers a higher capacity than lane‑based traffic for the same street width, but also that the relationship between capacity and street width is continuous under lane‑free traffic. To test the potential capacity and properties of lane‑free signal‑free intersections (automated intersection management), two approaches were simulated and compared, including a novel approach which we call OptWULF. This approach uses a multi‑agent conflict‑based search approach with a low‑level planner that uses a combination of optimization and simple window‑based reservation. With these simulations, we confirm the continuous relationship between capacity and street width for intersection scenarios. We also show that OptWULF results in an even utilization of the entire drivable area of the street and intersection area. Furthermore, we show that OptWULF is capable of handling asymmetric demand patterns without any substantial loss in capacity compared to symmetric demand patterns.

PaperID: 4707, https://arxiv.org/pdf/2603.19935.pdf

Authors: Luiz C. Borro, Luiz A. B. Macarini, Gordon Tindall, Michael Montero, Adam B. Struck

Title: Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents

Abstract:
As large language models (LLMs) evolve into autonomous agents, persistent memory at the API layer is essential for enabling context‑aware behavior across LLMs and multi‑session interactions. Existing approaches force vendor lock‑in and rely on injecting large volumes of raw conversation into prompts, leading to high token costs and degraded performance. We introduce Memori, an LLM‑agnostic persistent memory layer that treats memory as a data structuring problem. Its Advanced Augmentation pipeline converts unstructured dialogue into compact semantic triples and conversation summaries, enabling precise retrieval and coherent reasoning. Evaluated on the LoCoMo benchmark, Memori achieves 81.95% accuracy, outperforming existing memory systems while using only 1,294 tokens per query (~5% of full context). This results in substantial cost reductions, including 67% fewer tokens than competing approaches and over 20x savings compared to full‑context methods. These results show that effective memory in LLM agents depends on structured representations instead of larger context windows, enabling scalable and cost‑efficient deployment.

PaperID: 4708, https://arxiv.org/pdf/2603.19930.pdf

Authors: Anouar Nechi, Rainer Buchty, Mladen Berekovic, Saleh Mulhem

Title: Robust Beam Codebooks for mmWave/THz Systems: Toward a Stochastic RL Approach

Abstract:
Millimeter‑wave (mmWave) and terahertz (THz) massive MIMO systems often rely on predefined beamforming codebooks, which are usually suboptimal in Non‑Line‑of‑Sight (NLoS) conditions and for hardware‑limited transceivers. Reinforcement Learning (RL) enables adaptive, data‑driven codebook design without explicit Channel State Information (CSI), but the robustness of such algorithms in practical conditions is underexplored. This paper introduces a robust multi‑agent RL framework that learns beam codebooks directly from environmental feedback, eliminating the need for prior channel knowledge. Our method is well‑suited for real‑world deployments facing unpredictable propagation and hardware constraints. We conduct a comprehensive analysis of three off‑policy algorithms, Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), and Soft Actor‑Critic (SAC), evaluating their resilience to hardware impairments and feedback noise. Simulations show that SAC consistently outperforms deterministic methods, achieving superior beamforming gains and stability in NLoS scenarios, even under severe impairments. These results demonstrate the promise of RL‑based codebook design for robust mmWave/THz massive MIMO systems.

PaperID: 4709, https://arxiv.org/pdf/2603.19909.pdf

Authors: Boxun Song, Min Gao, Jiawei Cheng

Title: DALI: LLM-Agent Enhanced Dual-Stream Adaptive Leadership Identification for Group Recommendations

Abstract:
Group recommendation systems play a pivotal role in supporting collective decisions across various contexts, from leisure activities to organizational team‑building. Existing group recommendation approaches typically use either handcrafted aggregation rules (e.g. mean, least misery, weighted sum) or neural aggregation models (e.g. attention‑based deep learning frameworks), yet both fall short in distinguishing leader‑dominated from collaborative groups and often misrepresent true group preferences, especially when a single member disproportionately influences group choices. To address these limitations, we propose the Dual‑stream Adaptive Leadership Identification (DALI) framework, which uniquely combines the symbolic reasoning capabilities of Large Language Models (LLMs) with neural network‑based representation learning. Specifically, DALI introduces two key innovations: a dynamic rule generation module that autonomously formulates and evolves identification rules through iterative performance feedback, and a neuro‑symbolic aggregation mechanism that concurrently employs symbolic reasoning to robustly recognize leadership groups and attention‑based neural aggregation to accurately model collaborative group dynamics. Experiments conducted on the Mafengwo travel dataset confirm that DALI significantly improves recommendation accuracy compared to existing frameworks, highlighting its capability to dynamically adapt to complex, real‑world group decision environments.

PaperID: 4710, https://arxiv.org/pdf/2603.19896.pdf

Authors: Boyan Liu, Gongming Zhao, Hongli Xu

Title: Utility-Guided Agent Orchestration for Efficient LLM Tool Use

Abstract:
Tool‑using large language model (LLM) agents often face a fundamental tension between answer quality and execution cost. Fixed workflows are stable but inflexible, while free‑form multi‑step reasoning methods such as ReAct may improve task performance at the expense of excessive tool calls, longer trajectories, higher token consumption, and increased latency. In this paper, we study agent orchestration as an explicit decision problem rather than leaving it entirely to prompt‑level behavior. We propose a utility‑guided orchestration policy that selects among actions such as respond, retrieve, tool call, verify, and stop by balancing estimated gain, step cost, uncertainty, and redundancy. Our goal is not to claim universally best task performance, but to provide a controllable and analyzable policy framework for studying quality‑cost trade‑offs in tool‑using LLM agents. Experiments across direct answering, threshold control, fixed workflows, ReAct, and several policy variants show that explicit orchestration signals substantially affect agent behavior. Additional analyses on cost definitions, workflow fairness, and redundancy control further demonstrate that lightweight utility design can provide a defensible and practical mechanism for agent control.

PaperID: 4711, https://arxiv.org/pdf/2603.19858.pdf

Authors: Alejandro D. Mousist, Pedro Delgado de Robles Martín, Raquel Lladró Climent, Julian Cobos Aparicio

Title: Beyond detection: cooperative multi-agent reasoning for rapid onboard EO crisis response

Abstract:
Rapid identification of hazardous events is essential for next‑generation Earth Observation (EO) missions supporting disaster response. However, current monitoring pipelines remain largely ground‑centric, introducing latency due to downlink limitations, multi‑source data fusion constraints, and the computational cost of exhaustive scene analysis. This work proposes a hierarchical multi‑agent architecture for onboard EO processing under strict resource and bandwidth constraints. The system enables the exploitation of complementary multimodal observations by coordinating specialized AI agents within an event‑driven decision pipeline. AI agents can be deployed across multiple nodes in a distributed setting, such as satellite platforms. An Early Warning agent generates fast hypotheses from onboard observations and selectively activates domain‑specific analysis agents, while a Decision agent consolidates the evidence to issue a final alert. The architecture combines vision‑language models, traditional remote sensing analysis tools, and role‑specialized agents to enable structured reasoning over multimodal observations while minimizing unnecessary computation. A proof‑of‑concept implementation was executed on the engineering model of an edge‑computing platform currently deployed in orbit, using representative satellite data. Experiments on wildfire and flood monitoring scenarios show that the proposed routing‑based pipeline significantly reduces computational overhead while maintaining coherent decision outputs, demonstrating the feasibility of distributed agent‑based reasoning for future autonomous EO constellations.

PaperID: 4712, https://arxiv.org/pdf/2603.19838.pdf

Authors: Bavo Tistaert, Stan Servaes, Alejandro Gonzalez-Garcia, Ibrahim Ibrahim, Louis Callens, Jan Swevers, Wilm Decré

Title: Multi-Agent Motion Planning on Industrial Magnetic Levitation Platforms: A Hybrid ADMM-HOCBF approach

Abstract:
This paper presents a novel hybrid motion planning method for holonomic multi‑agent systems. The proposed decentralised model predictive control (MPC) framework tackles the intractability of classical centralised MPC for a growing number of agents while providing safety guarantees. This is achieved by combining a decentralised version of the alternating direction method of multipliers (ADMM) with a centralised high‑order control barrier function (HOCBF) architecture. Simulation results show significant improvement in scalability over classical centralised MPC. We validate the efficacy and real‑time capability of the proposed method by developing a highly efficient C++ implementation and deploying the resulting trajectories on a real industrial magnetic levitation platform.

PaperID: 4713, https://arxiv.org/pdf/2603.19798.pdf

Authors: Xingchen Song, Di Wu, Dinghao Zhou, Pengyu Cheng, Hongwu Ding, Yunchao He, Jie Wang, Shengfan Shen, Sixiang Lv, Lichun Fan, Hang Su, Yifeng Wang, Shuai Wang, Meng Meng, Jian Luan

Title: Borderless Long Speech Synthesis

Abstract:
Most existing text‑to‑speech (TTS) systems either synthesize speech sentence by sentence and stitch the results together, or drive synthesis from plain‑text dialogues alone. Both approaches leave models with little understanding of global context or paralinguistic cues, making it hard to capture real‑world phenomena such as multi‑speaker interactions (interruptions, overlapping speech), evolving emotional arcs, and varied acoustic environments. We introduce the Borderless Long Speech Synthesis framework for agent‑centric, borderless long audio synthesis. Rather than targeting a single narrow task, the system is designed as a unified capability set spanning VoiceDesigner, multi‑speaker synthesis, Instruct TTS, and long‑form text synthesis. On the data side, we propose a "Labeling over filtering/cleaning" strategy and design a top‑down, multi‑level annotation schema we call Global‑Sentence‑Token. On the model side, we adopt a backbone with a continuous tokenizer and add Chain‑of‑Thought (CoT) reasoning together with Dimension Dropout, both of which markedly improve instruction following under complex conditions. We further show that the system is Native Agentic by design: the hierarchical annotation doubles as a Structured Semantic Interface between the LLM Agent and the synthesis engine, creating a layered control protocol stack that spans from scene semantics down to phonetic detail. Text thereby becomes an information‑complete, wide‑band control channel, enabling a front‑end LLM to convert inputs of any modality into structured generation commands, extending the paradigm from Text2Speech to borderless long speech synthesis.

PaperID: 4714, https://arxiv.org/pdf/2603.19732.pdf

Authors: Kewen Zhu, Liping Yi, Zhiming Zhao, Xiang Li, Qinghua Hu

Title: Helix: A Dual-Helix Co-Evolutionary Multi-Agent System for Prompt Optimization and Question Reformulation

Abstract:
Automated prompt optimization (APO) aims to improve large language model performance by refining prompt instructions. However, existing methods are largely constrained by fixed prompt templates, limited search spaces, or single‑sided optimization that treats user questions as immutable inputs. In practice, question formulation and prompt design are inherently interdependent: clearer question structures facilitate focused reasoning and task understanding, while effective prompts reveal better ways to organize and restate queries. Ignoring this coupling fundamentally limits the effectiveness and adaptability of current APO approaches. We propose a unified multi‑agent system (Helix) that jointly optimizes question reformulation and prompt instructions through a structured three‑stage co‑evolutionary framework. Helix integrates (1) planner‑guided decomposition that breaks optimization into coupled question‑prompt objectives, (2) dual‑track co‑evolution where specialized agents iteratively refine and critique each other to produce complementary improvements, and (3) strategy‑driven question generation that instantiates high‑quality reformulations for robust inference. Extensive experiments on 12 benchmarks against 6 strong baselines demonstrate the effectiveness of Helix, achieving up to 3.95% performance improvements across tasks with favorable optimization efficiency.

PaperID: 4715, https://arxiv.org/pdf/2603.19708.pdf

Authors: Ziya Erkoç, Angela Dai, Matthias Nießner

Title: WorldAgents: Can Foundation Image Models be Agents for 3D World Models?

Abstract:
Given the remarkable ability of 2D foundation image models to generate high‑fidelity outputs, we investigate a fundamental question: do 2D foundation image models inherently possess 3D world model capabilities? To answer this, we systematically evaluate multiple state‑of‑the‑art image generation models and Vision‑Language Models (VLMs) on the task of 3D world synthesis. To harness and benchmark their potential implicit 3D capability, we propose an agentic framing to facilitate 3D world generation. Our approach employs a multi‑agent architecture: a VLM‑based director that formulates prompts to guide image synthesis, a generator that synthesizes new image views, and a VLM‑backed two‑step verifier that evaluates and selectively curates generated frames from both 2D image and 3D reconstruction space. Crucially, we demonstrate that our agentic approach provides coherent and robust 3D reconstruction, producing output scenes that can be explored by rendering novel views. Through extensive experiments across various foundation models, we demonstrate that 2D models do indeed encapsulate a grasp of 3D worlds. By exploiting this understanding, our method successfully synthesizes expansive, realistic, and 3D‑consistent worlds.

PaperID: 4716, https://arxiv.org/pdf/2603.19685.pdf

Authors: Taiyi Wang, Sian Gooding, Florian Hartmann, Oriana Riva, Edward Grefenstette

Title: A Subgoal-driven Framework for Improving Long-Horizon LLM Agents

Abstract:
Large language model (LLM)‑based agents have emerged as powerful autonomous controllers for digital environments, including mobile interfaces, operating systems, and web browsers. Web navigation, for example, requires handling dynamic content and long sequences of actions, making it particularly challenging. Existing LLM‑based agents struggle with long‑horizon planning in two main ways. During online execution, they often lose track as new information arrives, lacking a clear and adaptive path toward the final goal. This issue is further exacerbated during reinforcement learning (RL) fine‑tuning, where sparse and delayed rewards make it difficult for agents to identify which actions lead to success, preventing them from maintaining coherent reasoning over extended tasks. To address these challenges, we propose two contributions. First, we introduce an agent framework that leverages proprietary models for online planning through subgoal decomposition. Second, we present MiRA (Milestoning your Reinforcement Learning Enhanced Agent), an RL training framework that uses dense, milestone‑based reward signals. The real‑time planning mechanism improves proprietary models such as Gemini by approximately a 10% absolute increase in success rate (SR) on the WebArena‑Lite benchmark. Meanwhile, applying MiRA to the open Gemma3‑12B model increases its success rate from 6.4% to 43.0%. This performance surpasses proprietary systems such as GPT‑4‑Turbo (17.6%) and GPT‑4o (13.9%), as well as the previous open‑model state of the art, WebRL (38.4%). Overall, our findings demonstrate that combining explicit inference‑time planning with milestone‑based rewards significantly improves an agent's long‑horizon capabilities, paving the way for more robust and general‑purpose autonomous systems.

PaperID: 4717, https://arxiv.org/pdf/2603.19677.pdf

Authors: Hongjiang Chen, Xin Zheng, Yixin Liu, Pengfei Jiao, Shiyuan Li, Huan Liu, Zhidong Zhao, Ziqi Xu, Ibrahim Khalil, Shirui Pan

Title: GoAgent: Group-of-Agents Communication Topology Generation for LLM-based Multi-Agent Systems

Abstract:
Large language model (LLM)‑based multi‑agent systems (MAS) have demonstrated exceptional capabilities in solving complex tasks, yet their effectiveness depends heavily on the underlying communication topology that coordinates agent interactions. Within these systems, successful problem‑solving often necessitates task‑specific group structures to divide and conquer subtasks. However, most existing approaches generate communication topologies in a node‑centric manner, leaving group structures to emerge implicitly from local connectivity decisions rather than modeling them explicitly, often leading to suboptimal coordination and unnecessary communication overhead. To address this limitation, we propose GoAgent (Group‑of‑Agents), a communication topology generation method that explicitly treats collaborative groups as the atomic units of MAS construction. Specifically, GoAgent first enumerates task‑relevant candidate groups through an LLM and then autoregressively selects and connects these groups as atomic units to construct the final communication graph, jointly capturing intra‑group cohesion and inter‑group coordination. To mitigate communication redundancy and noise propagation inherent in expanding topologies, we further introduce a conditional information bottleneck (CIB) objective that compresses inter‑group communication, preserving task‑relevant signals while filtering out redundant historical noise. Extensive experiments on six benchmarks demonstrate the state‑of‑the‑art performance of GoAgent with 93.84% average accuracy while reducing token consumption by about 17%.

PaperID: 4718, https://arxiv.org/pdf/2603.19668.pdf

Authors: Salim Al Mandhari, Hieu Pham Dinh, Mo El-Haj, Paul Rayson

Title: Structured Prompting for Arabic Essay Proficiency: A Trait-Centric Evaluation Approach

Abstract:
This paper presents a novel prompt engineering framework for trait specific Automatic Essay Scoring (AES) in Arabic, leveraging large language models (LLMs) under zero‑shot and few‑shot configurations. Addressing the scarcity of scalable, linguistically informed AES tools for Arabic, we introduce a three‑tier prompting strategy (standard, hybrid, and rubric‑guided) that guides LLMs in evaluating distinct language proficiency traits such as organization, vocabulary, development, and style. The hybrid approach simulates multi‑agent evaluation with trait specialist raters, while the rubric‑guided method incorporates scored exemplars to enhance model alignment. In zero and few‑shot settings, we evaluate eight LLMs on the QAES dataset, the first publicly available Arabic AES resource with trait level annotations. Experimental results using Quadratic Weighted Kappa (QWK) and Confidence Intervals show that Fanar‑1‑9B‑Instruct achieves the highest trait level agreement in both zero and few‑shot prompting (QWK = 0.28 and CI = 0.41), with rubric‑guided prompting yielding consistent gains across all traits and models. Discourse‑level traits such as Development and Style showed the greatest improvements. These findings confirm that structured prompting, not model scale alone, enables effective AES in Arabic. Our study presents the first comprehensive framework for proficiency oriented Arabic AES and sets the foundation for scalable assessment in low resource educational contexts.

PaperID: 4719, https://arxiv.org/pdf/2603.19641.pdf

Authors: Ken Nakamura, Masahiko Ueda

Title: On the existence of fair zero-determinant strategies in the periodic prisoner's dilemma game

Abstract:
Repeated games are a framework for investigating long‑term interdependence of multi‑agent systems. In repeated games, zero‑determinant (ZD) strategies attract much attention in evolutionary game theory, since they can unilaterally control payoffs. Especially, fair ZD strategies unilaterally equalize the payoff of the focal player and the average payoff of the opponents, and they were found in several games including the social dilemma games. Although the existence condition of ZD strategies in repeated games was specified, its extension to stochastic games is almost unclear. Stochastic games are an extension of repeated games, where a state of an environment exists, and the state changes to another one according to an action profile of players. Because of the transition of an environmental state, the existence condition of ZD strategies in stochastic games is more complicated than that in repeated games. Here, we investigate the existence condition of fair ZD strategies in the periodic prisoner's dilemma game, which is one of the simplest stochastic games. We show that fair ZD strategies do not necessarily exist in the periodic prisoner's dilemma game, in contrast to the repeated prisoner's dilemma game. Furthermore, we also prove that the Tit‑for‑Tat strategy, which imitates the opponent's action, is not necessarily a fair ZD strategy in the periodic prisoner's dilemma game, whereas the Tit‑for‑Tat strategy is always a fair ZD strategy in the repeated prisoner's dilemma game. Our results highlight difference between ZD strategies in the periodic prisoner's dilemma game and ones in the standard repeated prisoner's dilemma game.

PaperID: 4720, https://arxiv.org/pdf/2603.19595.pdf

Authors: Can Lv, Heng Chang, Yuchen Guo, Shengyu Tao, Shiji Zhou

Title: All-Mem: Agentic Lifelong Memory via Dynamic Topology Evolution

Abstract:
Lifelong interactive agents are expected to assist users over months or years, which requires continually writing long term memories while retrieving the right evidence for each new query under fixed context and latency budgets. Existing memory systems often degrade as histories grow, yielding redundant, outdated, or noisy retrieved contexts. We present All‑Mem, an online/offline lifelong memory framework that maintains a topology structured memory bank via explicit, non destructive consolidation, avoiding the irreversible information loss typical of summarization based compression. In online operation, it anchors retrieval on a bounded visible surface to keep coarse search cost bounded. Periodically offline, an LLM diagnoser proposes confidence scored topology edits executed with gating using three operators: SPLIT, MERGE, and UPDATE, while preserving immutable evidence for traceability. At query time, typed links enable hop bounded, budgeted expansion from active anchors to archived evidence when needed. Experiments on LOCOMO and LONGMEMEVAL show improved retrieval and QA over representative baselines.

PaperID: 4721, https://arxiv.org/pdf/2603.19584.pdf

Authors: Xingyu Feng, Chang Sun, Yuzhu Wang, Zhangbing Zhou, Chengwen Luo, Zhuangzhuang Chen, Xiaomin Ouyang, Huanqi Yang

Title: PowerLens: Taming LLM Agents for Safe and Personalized Mobile Power Management

Abstract:
Battery life remains a critical challenge for mobile devices, yet existing power management mechanisms rely on static rules or coarse‑grained heuristics that ignore user activities and personal preferences. We present PowerLens, a system that tames the reasoning power of Large Language Models (LLMs) for safe and personalized mobile power management on Android devices. The key idea is that LLMs' commonsense reasoning can bridge the semantic gap between user activities and system parameters, enabling zero‑shot, context‑aware policy generation that adapts to individual preferences through implicit feedback. PowerLens employs a multi‑agent architecture that recognizes user context from UI semantics and generates holistic power policies across 18 device parameters. A PDL‑based constraint framework verifies every action before execution, while a two‑tier memory system learns individualized preferences from implicit user overrides through confidence‑based distillation, requiring no explicit configuration and converging within 3‑‑5 days. Extensive experiments on a rooted Android device show that PowerLens achieves 81.7% action accuracy and 38.8% energy saving over stock Android, outperforming rule‑based and LLM‑based baselines, with high user satisfaction, fast preference convergence, and strong safety guarantees, with the system itself consuming only 0.5% of daily battery capacity.

PaperID: 4722, https://arxiv.org/pdf/2603.19568.pdf

Authors: Yao Xiao, Rafael A. Calvo

Title: AI as Relational Translator: Rethinking Belonging and Mutual Legibility in Cross-Cultural Contexts

Abstract:
Against rising global loneliness, AI companions promise connection, yet accumulating evidence suggests that, for some users and contexts, intensive companion‑style use can correlate with increased loneliness and reduced offline socialisation. This position paper challenges the dominant "AI as companion" paradigm by proposing a shift: from AI that simulates relationships with humans to AI that supports relationships between humans. We introduce Relational AI Translation, positioning AI as cultural‑relational infrastructure that scaffolds human connection across cultural, generational, and geographical divides. Using first‑generation East Asian migrants as a theoretically productive critical case, we outline a multi‑agent architecture instantiating three translation operations: emotion‑intent decoding, contextual reframing, and relational scaffolding. We articulate design provocations around measurement, safety architecture, and the tension between technological intervention and structural justice, and explicitly frame success as graduation toward renewed human‑to‑human support rather than sustained engagement with the system.

PaperID: 4723, https://arxiv.org/pdf/2603.19558.pdf

Authors: Xinyu Guo, Yazhou Zhang, Jing Qin

Title: TextReasoningBench: Does Reasoning Really Improve Text Classification in Large Language Models?

Abstract:
Eliciting explicit, step‑by‑step reasoning traces from large language models (LLMs) has emerged as a dominant paradigm for enhancing model capabilities. Although such reasoning strategies were originally designed for problems requiring explicit multi‑step reasoning, they have increasingly been applied to a broad range of NLP tasks. This expansion implicitly assumes that deliberative reasoning uniformly benefits heterogeneous tasks. However, whether such reasoning mechanisms truly benefit classification tasks remains largely underexplored, especially considering their substantial token and time costs. To fill this gap, we introduce TextReasoningBench, a systematic benchmark designed to evaluate the effectiveness and efficiency of reasoning strategies for text classification with LLMs. We compare seven reasoning strategies, namely IO, CoT, SC‑CoT, ToT, GoT, BoC, and long‑CoT across ten LLMs on five text classification datasets. Beyond traditional metrics such as accuracy and macro‑F1, we introduce two cost‑aware evaluation metrics that quantify the performance gain per reasoning token and the efficiency of performance improvement relative to token cost growth. Experimental results reveal three notable findings: (1) Reasoning does not universally improve classification performance: while moderate strategies such as CoT and SC‑CoT yield consistent but limited gains (typically +1% to +3% on big models), more complex methods (e.g., ToT and GoT) often fail to outperform simpler baselines and can even degrade performance, especially on small models; (2) Reasoning is often inefficient: many reasoning strategies increase token consumption by 10× to 100× (e.g., SC‑CoT and ToT) while providing only marginal performance improvements.

PaperID: 4724, https://arxiv.org/pdf/2603.19501.pdf

Authors: Zhan Gao, Bishwadeep Das, Elvin Isufi

Title: Stochastic Sequential Decision Making over Expanding Networks with Graph Filtering

Abstract:
Graph filters leverage topological information to process networked data with existing methods mainly studying fixed graphs, ignoring that graphs often expand as nodes continually attach with an unknown pattern. The latter requires developing filter‑based decision‑making paradigms that take evolution and uncertainty into account. Existing approaches rely on either pre‑designed filters or online learning, limited to a myopic view considering only past or present information. To account for future impacts, we propose a stochastic sequential decision‑making framework for filtering networked data with a policy that adapts filtering to expanding graphs. By representing filter shifts as agents, we model the filter as a multi‑agent system and train the policy following multi‑agent reinforcement learning. This accounts for long‑term rewards and captures expansion dynamics through sequential decision‑making. Moreover, we develop a context‑aware graph neural network to parameterize the policy, which tunes filter parameters based on information of both the graph and agents. Experiments on synthetic and real datasets from cold‑start recommendation to COVID prediction highlight the benefits of using a sequential decision‑making perspective over batch and online filtering alternatives.

PaperID: 4725, https://arxiv.org/pdf/2603.19469.pdf

Authors: Vincent Siu, Jingxuan He, Kyle Montgomery, Zhun Wang, Neil Gong, Chenguang Wang, Dawn Song

Title: A Framework for Formalizing LLM Agent Security

Abstract:
Security in LLM agents is inherently contextual. For example, the same action taken by an agent may represent legitimate behavior or a security violation depending on whose instruction led to the action, what objective is being pursued, and whether the action serves that objective. However, existing definitions of security attacks against LLM agents often fail to capture this contextual nature. As a result, defenses face a fundamental utility‑security tradeoff: applying defenses uniformly across all contexts can lead to significant utility loss, while applying defenses in insufficient or inappropriate contexts can result in security vulnerabilities. In this work, we present a framework that systematizes existing attacks and defenses from the perspective of contextual security. To this end, we propose four security properties that capture contextual security for LLM agents: task alignment (pursuing authorized objectives), action alignment (individual actions serving those objectives), source authorization (executing commands from authenticated sources), and data isolation (ensuring information flows respect privilege boundaries). We further introduce a set of oracle functions that enable verification of whether these security properties are violated as an agent executes a user task. Using this framework, we reformalize existing attacks, such as indirect prompt injection, direct prompt injection, jailbreak, task drift, and memory poisoning, as violations of one or more security properties, thereby providing precise and contextual definitions of these attacks. Similarly, we reformalize defenses as mechanisms that strengthen oracle functions or perform security property checks. Finally, we discuss several important future research directions enabled by our framework.

PaperID: 4726, https://arxiv.org/pdf/2603.19453.pdf

Authors: Víctor Gallego

Title: Cooperation and Exploitation in LLM Policy Synthesis for Sequential Social Dilemmas

Abstract:
We study LLM policy synthesis: using a large language model to iteratively generate programmatic agent policies for multi‑agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LLM to produce Python policy functions, evaluates them in self‑play, and refines them using performance feedback across iterations. We investigate feedback engineering (the design of what evaluation information is shown to the LLM during refinement) comparing sparse feedback (scalar reward only) against dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace). Across two canonical Sequential Social Dilemmas (Gathering and Cleanup) and two frontier LLMs (Claude Sonnet 4.6, Gemini 3.1 Pro), dense feedback consistently matches or exceeds sparse feedback on all metrics. The advantage is largest in the Cleanup public goods game, where providing social metrics helps the LLM calibrate the costly cleaning‑harvesting tradeoff. Rather than triggering over‑optimization of fairness, social metrics serve as a coordination signal that guides the LLM toward more effective cooperative strategies, including territory partitioning, adaptive role assignment, and the avoidance of wasteful aggression. We further perform an adversarial experiment to determine whether LLMs can reward hack these environments. We characterize five attack classes and discuss mitigations, highlighting an inherent tension in LLM policy synthesis between expressiveness and safety. Code at https://github.com/vicgalle/llm‑policies‑social‑dilemmas.

PaperID: 4727, https://arxiv.org/pdf/2603.19452.pdf

Authors: Volodymyr Seliuchenko

Title: TrustFlow: Topic-Aware Vector Reputation Propagation for Multi-Agent Ecosystems

Abstract:
We introduce TrustFlow, a reputation propagation algorithm that assigns each software agent a multi‑dimensional reputation vector rather than a scalar score. Reputation is propagated through an interaction graph via topic‑gated transfer operators that modulate each edge by its content embedding, with convergence to a unique fixed point guaranteed by the contraction mapping theorem. We develop a family of Lipschitz‑1 transfer operators and composable information‑theoretic gates that achieve up to 98% multi‑label Precision@5 on dense graphs and 78% on sparse ones. On a benchmark of 50 agents across 8 domains, TrustFlow resists sybil attacks, reputation laundering, and vote rings with at most 4 percentage‑point precision impact. Unlike PageRank and Topic‑Sensitive PageRank, TrustFlow produces vector reputation that is directly queryable by dot product in the same embedding space as user queries.

PaperID: 4728, https://arxiv.org/pdf/2603.19431.pdf

Authors: Komal Thareja, Krishnan Raghavan, Anirban Mandal, Ewa Deelman

Title: SWARM+: Scalable and Resilient Multi-Agent Consensus for Fully-Decentralized Data-Aware Workload Management

Abstract:
Distributed scientific workflows increasingly span heterogeneous compute clusters, edge resources, and geo‑distributed data repositories. In these environments, a centralized orchestrator is an architectural bottleneck ‑‑ introducing a single point of failure, limiting scalability, and constraining adaptability to changing resource availability or failures. Decentralized multi‑agent coordination offers a compelling alternative: autonomous agents representing distributed resources collaboratively negotiate workload assignment (e.g., job selection) through peer‑to‑peer consensus, making decisions based on local compute capacity, data locality, and network conditions. However, scaling such systems for production workloads requires addressing challenges in coordination, resilience, and data‑aware optimization. This work presents SWARM+, which builds on our prior work that demonstrated the feasibility of multi‑agent decentralized consensus for distributed job selection. SWARM+ addresses three main problems: scalability of consensus for large numbers of agents, resilience of workload management under agent failure, and efficiency of job scheduling for highly distributed resources and data‑intensive workloads. For each problem, we propose novel algorithms and evaluate them in the distributed FABRIC testbed. The results show that SWARM+ (a) scales to 1000 distributed agents with nearly equal workload distribution across the hierarchy levels and reduced coordination overhead due to hierarchical consensus, (b) is resilient to agent failures, maintaining >99% job completion rate under single agent failure, and demonstrating graceful system degradation, with at most 7.5% impact under 50% agent failures, and (c) achieves 97‑98% improvement over baseline SWARM for both selection time and scheduling latency metrics.

PaperID: 4729, https://arxiv.org/pdf/2603.19423.pdf

Authors: Shawn Li, Yue Zhao

Title: The Autonomy Tax: Defense Training Breaks LLM Agents

Abstract:
Large language model (LLM) agents increasingly rely on external tools (file operations, API calls, database transactions) to autonomously complete complex multi‑step tasks. Practitioners deploy defense‑trained models to protect against prompt injection attacks that manipulate agent behavior through malicious observations or retrieved content. We reveal a fundamental capability‑alignment paradox: defense training designed to improve safety systematically destroys agent competence while failing to prevent sophisticated attacks. Evaluating defended models against undefended baselines across 97 agent tasks and 1,000 adversarial prompts, we uncover three systematic biases unique to multi‑step agents. Agent incompetence bias manifests as immediate tool execution breakdown, with models refusing or generating invalid actions on benign tasks before observing any external content. Cascade amplification bias causes early failures to propagate through retry loops, pushing defended models to timeout on 99% of tasks compared to 13% for baselines. Trigger bias leads to paradoxical security degradation where defended models perform worse than undefended baselines while straightforward attacks bypass defenses at high rates. Root cause analysis reveals these biases stem from shortcut learning: models overfit to surface attack patterns rather than semantic threat understanding, evidenced by extreme variance in defense effectiveness across attack categories. Our findings demonstrate that current defense paradigms optimize for single‑turn refusal benchmarks while rendering multi‑step agents fundamentally unreliable, necessitating new approaches that preserve tool execution competence under adversarial conditions.

PaperID: 4730, https://arxiv.org/pdf/2603.19375.pdf

Authors: Toan Tran, Olivera Kotevska, Li Xiong

Title: Automated Membership Inference Attacks: Discovering MIA Signal Computations using LLM Agents

Abstract:
Membership inference attacks (MIAs), which enable adversaries to determine whether specific data points were part of a model's training dataset, have emerged as an important framework to understand, assess, and quantify the potential information leakage associated with machine learning systems. Designing effective MIAs is a challenging task that usually requires extensive manual exploration of model behaviors to identify potential vulnerabilities. In this paper, we introduce AutoMIA ‑‑ a novel framework that leverages large language model (LLM) agents to automate the design and implementation of new MIA signal computations. By utilizing LLM agents, we can systematically explore a vast space of potential attack strategies, enabling the discovery of novel strategies. Our experiments demonstrate AutoMIA can successfully discover new MIAs that are specifically tailored to user‑configured target model and dataset, resulting in improvements of up to 0.18 in absolute AUC over existing MIAs. This work provides the first demonstration that LLM agents can serve as an effective and scalable paradigm for designing and implementing MIAs with SOTA performance, opening up new avenues for future exploration.

PaperID: 4731, https://arxiv.org/pdf/2603.19328.pdf

Authors: Tanmay Sah, Vishal Srivastava, Dolly Sah, Kayden Jordan

Title: The Verifier Tax: Horizon Dependent Safety Success Tradeoffs in Tool Using LLM Agents

Abstract:
We study how runtime enforcement against unsafe actions affects end‑to‑end task performance in multi‑step tool using large language model (LLM) agents. Using tau‑bench across Airline and Retail domains, we compare baseline Tool‑Calling, planning‑integrated (TRIAD), and policy‑mediated (TRIAD‑SAFETY) architectures with GPT‑OSS‑20B and GLM‑4‑9B. We identify model dependent interaction horizons (15 to 30 turns) and decompose outcomes into overall success rate (SR), safe success rate (SSR), and unsafe success rate (USR). Our results reveal a persistent Safety Capability Gap. While safety mediation can intercept up to 94 percent of non‑compliant actions, it rarely translates into strictly safe goal attainment (SSR below 5 percent in most settings). We find that high unsafe success rates are primarily driven by Integrity Leaks, where models hallucinate user identifiers to bypass mandatory authentication. Recovery rates following blocked actions are consistently low, ranging from 21 percent for GPT‑OSS‑20B in simpler procedural tasks to near zero in complex Retail scenarios. These results demonstrate that runtime enforcement imposes a significant verifier tax on conversational length and compute cost without guaranteeing safe completion, highlighting the critical need for agents capable of grounded identity verification and post‑intervention reasoning.

PaperID: 4732, https://arxiv.org/pdf/2603.19306.pdf

Authors: Hui Liao, Chuan Qin, Yongwen Ren, Hao Li, Zhenya Huang, Yanyong Zhang, Chao Wang

Title: VERDICT: Verifiable Evolving Reasoning with Directive-Informed Collegial Teams for Legal Judgment Prediction

Abstract:
Legal Judgment Prediction (LJP) predicts applicable law articles, charges, and penalty terms from case facts. Beyond accuracy, LJP calls for intrinsically interpretable and legally grounded reasoning that can reconcile statutory rules with precedent‑informed standards. However, existing methods often behave as static, one‑shot predictors, providing limited procedural support for verifiable reasoning and little capability to adapt as jurisprudential practice evolves. We propose VERDICT, a self‑refining collaborative multi‑agent framework that simulates a virtual collegial panel. VERDICT assigns specialized agents to complementary roles (e.g., fact structuring, legal retrieval, opinion drafting, and supervisory verification) and coordinates them in a traceable draft‑‑verify‑‑revise workflow with explicit Pass/Reject feedback, producing verifiable reasoning traces and revision rationales. To capture evolving case experience, we further introduce a Hybrid Jurisprudential Memory (HJM) grounded in the Micro‑Directive Paradigm, which stores precedent standards and continually distills validated multi‑agent verification trajectories into updated Micro‑Directives for continual learning across cases. We evaluate VERDICT on CAIL2018 and a newly constructed CJO2025 dataset with a strict future time‑split for temporal generalization. VERDICT achieves state‑of‑the‑art performance on CAIL2018 and demonstrates strong generalization on CJO2025. To facilitate reproducibility and further research, we release our code and the dataset at https://anonymous.4open.science/r/ARR‑4437.

PaperID: 4733, https://arxiv.org/pdf/2603.19282.pdf

Authors: Zice Wang, Zhenyu Zhang

Title: Framing Effects in Independent-Agent Large Language Models: A Cross-Family Behavioral Analysis

Abstract:
In many real‑world applications, large language models (LLMs) operate as independent agents without interaction, thereby limiting coordination. In this setting, we examine how prompt framing influences decisions in a threshold voting task involving individual‑group interest conflict. Two logically equivalent prompts with different framings were tested across diverse LLM families under isolated trials. Results show that prompt framing significantly influences choice distributions, often shifting preferences toward risk‑averse options. Surface linguistic cues can even override logically equivalent formulations. This suggests that observed behavior reflects a tendency consistent with a preference for instrumental rather than cooperative rationality when success requires risk‑bearing. The findings highlight framing effects as a significant bias source in non‑interacting multi‑agent LLM deployments, informing alignment and prompt design.

PaperID: 4734, https://arxiv.org/pdf/2603.19270.pdf

Authors: Eslam Reda, Maged Yasser, Sara El-Metwally

Title: Autonoma: A Hierarchical Multi-Agent Framework for End-to-End Workflow Automation

Abstract:
The increasing complexity of user demands necessitates automation frameworks that can reliably translate open‑ended instructions into robust, multi‑step workflows. Current monolithic agent architectures often struggle with the challenges of scalability, error propagation, and maintaining focus across diverse tasks. This paper introduces Autonoma, a structured, hierarchical multi‑agent framework designed for end‑to‑end workflow automation from natural language prompts. Autonoma employs a principled, multi‑tiered architecture where a high‑level Coordinator validates user intent, a Planner generates structured workflows, and a Supervisor dynamically manages the execution by orchestrating a suite of modular, specialized agents (e.g., for web browsing, coding, file management). This clear separation between orchestration logic and specialized execution ensures robustness through active monitoring and error handling, while enabling extensibility by allowing new capabilities to be integrated as plug‑and‑play agents without modifying the core engine. Implemented as a fully functional system operating within a secure LAN environment, Autonoma addresses critical data privacy and reliability concerns. The system is further engineered for inclusivity, accepting multi‑modal input (text, voice, image, files) and supporting both English and Arabic. Autonoma achieved a 97% task completion rate and a 98% successful agent handoff rate, confirming its operational reliability and efficient collaboration.

PaperID: 4735, https://arxiv.org/pdf/2603.19262.pdf

Authors: Mike Farmer, Abhinav Kochar, Yugyung Lee

Title: The α-Law of Observable Belief Revision in Large Language Model Inference

Abstract:
Large language models (LLMs) that iteratively revise their outputs through mechanisms such as chain‑of‑thought reasoning, self‑reflection, or multi‑agent debate lack principled guarantees regarding the stability of their probability updates. We identify a consistent multiplicative scaling law that governs how instruction‑tuned LLMs revise probability assignments over candidate answers, expressed as a belief revision exponent that controls how prior beliefs and verification evidence are combined during updates. We show theoretically that values of the exponent below one are necessary and sufficient for asymptotic stability under repeated revision. Empirical evaluation across 4,975 problems spanning graduate‑level benchmarks (GPQA Diamond, TheoremQA, MMLU‑Pro, and ARC‑Challenge) and multiple model families (GPT‑5.2 and Claude Sonnet 4) reveals near‑Bayesian update behavior, with models operating slightly above the stability boundary in single‑step revisions. However, multi‑step experiments demonstrate that the exponent decreases over successive revisions, producing contractive long‑run dynamics consistent with theoretical stability predictions. Token‑level validation using Llama‑3.3‑70B further confirms similar behavior across both log‑probability measurements and self‑reported confidence elicitation. Analysis of update components exposes architecture‑specific trust‑ratio patterns, with GPT‑5.2 showing balanced weighting between prior and evidence, while Claude modestly favors new evidence. This work characterizes observable inference‑time update behavior rather than internal Bayesian reasoning, and introduces the α‑law as a principled diagnostic for monitoring update stability and reasoning quality in LLM inference systems.

PaperID: 4736, https://arxiv.org/pdf/2603.19252.pdf

Authors: Yushun Zhang, Weiping Fu, Zesheng Yang, Bo Zhao, Lingling Zhang, Jian Zhang, Yumeng Fu, Jiaxing Huang, Jun Liu

Title: GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams

Abstract:
Evaluating the symbolic reasoning of large language models (LLMs) calls for geometry benchmarks that require multi‑step proofs grounded in both text and diagrams. However, existing benchmarks are often limited in scale and rarely provide visually grounded multiple‑choice questions, limiting reliable evaluation of complex reasoning. We introduce GeoChallenge, a dataset of 90K automatically generated multiple‑choice geometry proof problems, each requiring multi‑step reasoning over aligned textual descriptions and diagrams. GeoChallenge provides fine‑grained complexity ratings and formal language annotations to enable controlled evaluation. Experiments on multiple advanced LLMs show a clear performance gap between models and humans (the best‑performing model, GPT‑5‑nano, achieves 75.89 exact match vs. 94.74 for humans). Further analysis also reveals three common failure patterns of LLMs: (1) exact match failures under the multiple‑choice setting; (2) weak visual reliance; and (3) overextended reasoning without convergence.

PaperID: 4737, https://arxiv.org/pdf/2603.19248.pdf

Authors: Xin Shen, Zhishu Jiang, Jiaye Yang, Haibo Liu, Yichen Wan, Jiarui Zhang, Tingzhi Dai, Luodong Xu, Shuchen Wu, Guanqiang QI, Chenxi Miao, Jiahui Liang, Yang Li, Weikang Li, Deguo Xia, Jizhou Huang

Title: DuCCAE: A Hybrid Engine for Immersive Conversation via Collaboration, Augmentation, and Evolution

Abstract:
Immersive conversational systems in production face a persistent trade‑off between responsiveness and long‑horizon task capability. Real‑time interaction is achievable for lightweight turns, but requests involving planning and tool invocation (e.g., search and media generation) produce heavy‑tail execution latency that degrades turn‑taking, persona consistency, and user trust. To address this challenge, we propose DuCCAE (Conversation while Collaboration with Augmentation and Evolution), a hybrid engine for immersive conversation deployed within Baidu Search, serving millions of users. DuCCAE decouples real‑time response generation from asynchronous agentic execution and synchronizes them via a shared state that maintains session context and execution traces, enabling asynchronous results to be integrated back into the ongoing dialogue. The system orchestrates five subsystems‑Info, Conversation, Collaboration, Augmentation, and Evolution‑to support multi‑agent collaboration and continuous improvement. We evaluate DuCCAE through a comprehensive framework that combines offline benchmarking on the Du‑Interact dataset and large‑scale production evaluation within Baidu Search. Experimental results demonstrate that DuCCAE outperforms strong baselines in agentic execution reliability and dialogue quality while reducing latency to fit strict real‑time budgets. Crucially, deployment metrics since June 2025 confirm substantial real‑world effectiveness, evidenced by a tripling of Day‑7 user retention to 34.2% and a surge in the complex task completion rate to 65.2%. Our hybrid architecture successfully preserves conversational continuity while enabling reliable agentic execution, offering practical guidelines for deploying scalable agentic systems in industrial settings.

PaperID: 4738, https://arxiv.org/pdf/2603.19241.pdf

Authors: Yue Wu, Tianhao Su, Mingchuan Zhao, Shunbo Hu, Deng Pan

Title: Engineering-Oriented Symbolic Regression: LLMs as Physics Agents for Discovery of Simulation-Ready Constitutive Laws

Abstract:
The discovery of constitutive laws for complex materials has historically faced a dichotomy between high‑fidelity data‑driven approaches, which demand prohibitive full‑field experimental data, and traditional engineering fitting, which often yields numerically unstable models outside calibration regimes. In this work, we propose an Engineering‑Oriented Symbolic Regression (EO‑SR) framework that bridges this gap by leveraging Large Language Models (LLMs) as "Physics‑Informed Agents." Unlike unconstrained symbolic regression, our framework utilizes an LLM Agent to zero‑shot synthesize executable physical constraints ‑‑ specifically thermodynamic consistency and frame indifference ‑‑ transforming the search process from mathematical curve‑fitting into a physics‑governed discovery engine. We validate this approach on the hyperelastic modeling of rubber‑like materials using standard Treloar datasets. The framework autonomously identifies a novel hybrid constitutive law that combines a Mooney‑Rivlin linear base with a rational locking term. This discovered model not only achieves high predictive accuracy across multi‑axial deformation modes (including zero‑shot prediction of pure shear) but also guarantees unconditional convexity. Finite element validation demonstrates that while industry‑standard models (e.g., Ogden N=3) fail due to numerical singularities under severe transverse compression, the EO‑SR‑discovered model maintains robust convergence. This study establishes a generalized, low‑barrier pathway for discovering simulation‑ready constitutive closures that satisfy both data accuracy and rigorous physical laws.

PaperID: 4739, https://arxiv.org/pdf/2603.19191.pdf

Authors: Zehao Li, Zhenyu Wu, Yibo Zhao, Bowen Yang, Jingjing Xie, Zhaoyang Liu, Zhoumianze Liu, Kaiming Jin, Jianze Liang, Zonglin Li, Feng Wu, Bowen Zhou, Zun Wang, Zichen Ding

Title: OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards

Abstract:
Reinforcement Learning (RL) has the potential to improve the robustness of GUI agents in stochastic environments, yet training is highly sensitive to the quality of the reward function. Existing reward approaches struggle to achieve both scalability and performance. To address this, we propose OS‑Themis, a scalable and accurate multi‑agent critic framework. Unlike a single judge, OS‑Themis decomposes trajectories into verifiable milestones to isolate critical evidence for decision making and employs a review mechanism to strictly audit the evidence chain before making the final verdict. To facilitate evaluation, we further introduce OmniGUIRewardBench (OGRBench), a holistic cross‑platform benchmark for GUI outcome rewards, where all evaluated models achieve their best performance under OS‑Themis. Extensive experiments on AndroidWorld show that OS‑Themis yields a 10.3% improvement when used to support online RL training, and a 6.9% gain when used for trajectory validation and filtering in the self‑training loop, highlighting its potential to drive agent evolution.

PaperID: 4740, https://arxiv.org/pdf/2603.19188.pdf

Authors: Huiwen Yan, Mushuang Liu

Title: Markov Potential Game and Multi-Agent Reinforcement Learning for Autonomous Driving

Abstract:
Autonomous driving (AD) requires safe and reliable decision‑making among interacting agents, e.g., vehicles, bicycles, and pedestrians. Multi‑agent reinforcement learning (MARL) modeled by Markov games (MGs) provides a suitable framework to characterize such agents' interactions during decision‑making. Nash equilibria (NEs) are often the desired solution in an MG. However, it is typically challenging to compute an NE in general‑sum games, unless the game is a Markov potential game (MPG), which ensures the NE attainability under a few learning algorithms such as gradient play. However, it has been an open question how to construct an MPG and whether these construction rules are suitable for AD applications. In this paper, we provide sufficient conditions under which an MG is an MPG and show that these conditions can accommodate general driving objectives for autonomous vehicles (AVs) using highway forced merge scenarios as illustrative examples. A parameter‑sharing neural network (NN) structure is designed to enable decentralized policy execution. The trained driving policy from MPGs is evaluated in both simulated and naturalistic traffic datasets. Comparative studies with single‑agent RL and with human drivers whose behaviors are recorded in the traffic datasets are reported, respectively.

PaperID: 4741, https://arxiv.org/pdf/2603.19074.pdf

Authors: Fengxiaoxiao Li, Xiao Mao, Mingfeng Fan, Yifeng Zhang, Yi Li, Tanishq Duhan, Guillaume Sartoretti

Title: CAMO: A Conditional Neural Solver for the Multi-objective Multiple Traveling Salesman Problem

Abstract:
Robotic systems often require a team of robots to collectively visit multiple targets while optimizing competing objectives, such as total travel cost and makespan. This setting can be formulated as the Multi‑Objective Multiple Traveling Salesman Problem (MOMTSP). Although learning‑based methods have shown strong performance on the single‑agent TSP and multi‑objective TSP variants, they rarely address the combined challenges of multi‑agent coordination and multi‑objective trade‑offs, which introduce dual sources of complexity. To bridge this gap, we propose CAMO, a conditional neural solver for MOMTSP that generalizes across varying numbers of targets, agents, and preference vectors, and yields high‑quality approximations to the Pareto front (PF). Specifically, CAMO consists of a conditional encoder to fuse preferences into instance representations, enabling explicit control over multi‑objective trade‑offs, and a collaborative decoder that coordinates all agents by alternating agent selection and node selection to construct multi‑agent tours autoregressively. To further improve generalization, we train CAMO with a REINFORCE‑based objective over a mixed distribution of problem sizes. Extensive experiments show that CAMO outperforms both neural and conventional heuristics, achieving a closer approximation of PFs. In addition, ablation results validate the contributions of CAMO's key components, and real‑world tests on a mobile robot platform demonstrate its practical applicability.

PaperID: 4742, https://arxiv.org/pdf/2603.19011.pdf

Authors: Enrico Bottazzi, Pia Park

Title: Security awareness in LLM agents: the NDAI zone case

Abstract:
NDAI zones let inventor and investor agents negotiate inside a Trusted Execution Environment (TEE) where any disclosed information is deleted if no deal is reached. This makes full IP disclosure the rational strategy for the inventor's agent. Leveraging this infrastructure, however, requires agents to distinguish a secure environment from an insecure one, a capability LLM agents lack natively, since they can rely only on evidence passed through the context window to form awareness of their execution environment. We ask: How do different LLM models weight various forms of evidence when forming awareness of the security of their execution environment? Using an NDAI‑style negotiation task across 10 language models and various evidence scenarios, we find a clear asymmetry: a failing attestation universally suppresses disclosure across all models, whereas a passing attestation produces highly heterogeneous responses: some models increase disclosure, others are unaffected, and a few paradoxically reduce it. This reveals that current LLM models can reliably detect danger signals but cannot reliably verify safety, the very capability required for privacy‑preserving agentic protocols such as NDAI zones. Bridging this gap, possibly through interpretability analysis, targeted fine‑tuning, or improved evidence architectures, remains the central open challenge for deploying agents that calibrate information sharing to actual evidence quality.

PaperID: 4743, https://arxiv.org/pdf/2603.18958.pdf

Authors: Andrzej Kaczmarczyk, Šimon Schierreich, Nicholas Axel Tanujaya, Haifeng Xu

Title: Optimal Path Planning in Hostile Environments

Abstract:
Coordinating agents through hazardous environments, such as aid‑delivering drones navigating conflict zones or field robots traversing deployment areas filled with obstacles, poses fundamental planning challenges. We introduce and analyze the computational complexity of a new multi‑agent path planning problem that captures this setting. A group of identical agents begins at a common start location and must navigate a graph‑based environment to reach a common target. The graph contains hazards that eliminate agents upon contact but then enter a known cooldown period before reactivating. In this discrete‑time, fully‑observable, deterministic setting, the planning task is to compute a movement schedule that maximizes the number of agents reaching the target. We first prove that, despite the exponentially large space of feasible plans, optimal plans require only polynomially‑many steps, establishing membership in NP. We then show that the problem is NP‑hard even when the environment graph is a tree. On the positive side, we present a polynomial‑time algorithm for graphs consisting of vertex‑disjoint paths from start to target. Our results establish a rich computational landscape for this problem, identifying both intractable and tractable fragments.

PaperID: 4744, https://arxiv.org/pdf/2603.18916.pdf

Authors: Diego Calvanese, Angelo Casciani, Giuseppe De Giacomo, Marlon Dumas, Fabiana Fournier, Timotheus Kampik, Emanuele La Malfa, Lior Limonad, Andrea Marrella, Andreas Metzger, Marco Montali, Daniel Amyot, Peter Fettke, Artem Polyvyanyy, Stefanie Rinderle-Ma, Sebastian Sardiña, Niek Tax, Barbara Weber

Title: Agentic Business Process Management: A Research Manifesto

Abstract:
This paper presents a manifesto that articulates the conceptual foundations of Agentic Business Process Management (APM), an extension of Business Process Management (BPM) for governing autonomous agents executing processes in organizations. From a management perspective, APM represents a paradigm shift from the traditional process view of the business process, driven by the realization of process awareness and an agent‑oriented abstraction, where software and human agents act as primary functional entities that perceive, reason, and act within explicit process frames. This perspective marks a shift from traditional, automation‑oriented BPM toward systems in which autonomy is constrained, aligned, and made operational through process awareness. We introduce the core abstractions and architectural elements required to realize APM systems and elaborate on four key capabilities that such APM agents must support: framed autonomy, explainability, conversational actionability, and self‑modification. These capabilities jointly ensure that agents' goals are aligned with organizational goals and that agents behave in a framed yet proactive manner in pursuing those goals. We discuss the extent to which the capabilities can be realized and identify research challenges whose resolution requires further advances in BPM, AI, and multi‑agent systems. The manifesto thus serves as a roadmap for bridging these communities and for guiding the development of APM systems in practice.

PaperID: 4745, https://arxiv.org/pdf/2603.18897.pdf

Authors: Yifan Sui, Han Zhao, Rui Ma, Zhiyuan He, Hao Wang, Jianxun Li, Yuqing Yang

Title: Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution

Abstract:
LLM‑powered agents are emerging as a dominant paradigm for autonomous task solving. Unlike standard inference workloads, agents operate in a strictly serial "LLM‑tool" loop, where the LLM must wait for external tool execution at every step. This execution model introduces severe latency bottlenecks. To address this problem, we propose PASTE, a Pattern‑Aware Speculative Tool Execution method designed to hide tool latency through speculation. PASTE is based on the insight that although agent requests are semantically diverse, they exhibit stable application level control flows (recurring tool‑call sequences) and predictable data dependencies (parameter passing between tools). By exploiting these properties, PASTE improves agent serving performance through speculative tool execution. Experimental results against state of the art baselines show that PASTE reduces average task completion time by 48.5% and improves tool execution throughput by 1.8x.

PaperID: 4746, https://arxiv.org/pdf/2603.18894.pdf

Authors: Vedanta S P, Ponnurangam Kumaraguru

Title: I Can't Believe It's Corrupt: Evaluating Corruption in Multi-Agent Governance Systems

Abstract:
Large language models are increasingly proposed as autonomous agents for high‑stakes public workflows, yet we lack systematic evidence about whether they would follow institutional rules when granted authority. We present evidence that integrity in institutional AI should be treated as a pre‑deployment requirement rather than a post‑deployment assumption. We evaluate multi‑agent governance simulations in which agents occupy formal governmental roles under different authority structures, and we score rule‑breaking and abuse outcomes with an independent rubric‑based judge across 28,112 transcript segments. While we advance this position, the core contribution is empirical: among models operating below saturation, governance structure is a stronger driver of corruption‑related outcomes than model identity, with large differences across regimes and model‑‑governance pairings. Lightweight safeguards can reduce risk in some settings but do not consistently prevent severe failures. These results imply that institutional design is a precondition for safe delegation: before real authority is assigned to LLM agents, systems should undergo stress testing under governance‑like constraints with enforceable rules, auditable logs, and human oversight on high‑impact actions.

PaperID: 4747, https://arxiv.org/pdf/2603.18866.pdf

Authors: Xuemian Wu, Shizhe Zhao, Zhongqiang Ren

Title: Conflict-Based Search for Multi Agent Path Finding with Asynchronous Actions

Abstract:
Multi‑Agent Path Finding (MAPF) seeks collision‑free paths for multiple agents from their respective start locations to their respective goal locations while minimizing path costs. Most existing MAPF algorithms rely on a common assumption of synchronized actions, where the actions of all agents start at the same time and always take a time unit, which may limit the use of MAPF planners in practice. To get rid of this assumption, Continuous‑time Conflict‑Based Search (CCBS) is a popular approach that can find optimal solutions for MAPF with asynchronous actions (MAPF‑AA). However, CCBS has recently been identified to be incomplete due to an uncountably infinite state space created by continuous wait durations. This paper proposes a new method, Conflict‑Based Search with Asynchronous Actions (CBS‑AA), which bypasses this theoretical issue and can solve MAPF‑AA with completeness and solution optimality guarantees. Based on CBS‑AA, we also develop conflict resolution techniques to improve the scalability of CBS‑AA further. Our test results show that our method can reduce the number of branches by up to 90%.

PaperID: 4748, https://arxiv.org/pdf/2603.18815.pdf

Authors: Hao Zhang, Mingjie Liu, Shaokun Zhang, Songyang Han, Jian Hu, Zhenghui Jin, Yuchi Zhang, Shizhe Diao, Ximing Lu, Binfeng Xu, Zhiding Yu, Jan Kautz, Yi Dong

Title: ProRL Agent: Rollout-as-a-Service for RL Training of Multi-Turn LLM Agents

Abstract:
Multi‑turn LLM agents are increasingly important for solving complex, interactive tasks, and reinforcement learning (RL) is a key ingredient for improving their long‑horizon behavior. However, RL training requires generating large numbers of sandboxed rollout trajectories, and existing infrastructures often couple rollout orchestration with the training loop, making systems hard to migrate and maintain. Under the rollout‑as‑a‑service philosophy, we present ProRL Agent , a scalable infrastructure that serves the full agentic rollout lifecycle through an API service. ProRL Agent also provides standardized and extensible sandbox environments that support diverse agentic tasks in rootless HPC settings. We validate ProRL Agent through RL training on software engineering, math, STEM, and coding tasks. ProRL Agent is open‑sourced and integrated as part of NVIDIA NeMo Gym.

PaperID: 4749, https://arxiv.org/pdf/2603.18798.pdf

Authors: Yufei Cao, Penny Sweetser, Ziyu Chen, Xuanying Zhu

Title: Signals of Success and Struggle: Early Prediction and Physiological Signatures of Human Performance across Task Complexity

Abstract:
User performance is crucial in interactive systems, capturing how effectively users engage with task execution. Prospectively predicting performance enables the timely identification of users struggling with task demands. While ocular and cardiac signals are widely used to characterise performance‑relevant visual behaviour and physiological activation, their potential for early prediction and for revealing the physiological mechanisms underlying performance differences remains underexplored. We conducted a within‑subject experiment in a game environment with naturally unfolding complexity, using early ocular and cardiac signals to predict later performance and to examine physiological and self‑reported group differences. Results show that the ocular‑cardiac fusion model achieves a balanced accuracy of 0.86, and the ocular‑only model shows comparable predictive power. High performers exhibited targeted gaze and adjusted visual sampling, and sustained more stable cardiac activation as demands intensified, with a more positive affective experience. These findings demonstrate the feasibility of cross‑session prediction from early physiology, providing interpretable insights into performance variation and facilitating future proactive intervention.

PaperID: 4750, https://arxiv.org/pdf/2603.18788.pdf

Authors: KT Tech innovation Group

Title: Mi:dm K 2.5 Pro

Abstract:
The evolving LLM landscape requires capabilities beyond simple text generation, prioritizing multi‑step reasoning, long‑context understanding, and agentic workflows. This shift challenges existing models in enterprise environments, especially in Korean‑language and domain‑specific scenarios where scaling is insufficient. We introduce Mi:dm K 2.5 Pro, a 32B parameter flagship LLM designed to address enterprise‑grade complexity through reasoning‑focused optimization. Our methodology builds a robust data foundation via a quality‑centric curation pipeline utilizing abstract syntax tree (AST) analysis for code, gap‑filling synthesis for mathematics, and an LLM‑based quality evaluator. Pre‑training scales the model via layer‑predictor‑based Depth Upscaling (DuS) and a progressive strategy supporting a 128K token context window. Post‑training introduces a specialized multi‑stage pipeline, including Reasoning SFT, model merging, and asynchronous reinforcement learning (RL), to develop complex problem‑solving skills. "Fusion Training" then rebalances these capabilities with conversational fluency, consistent response styling, and reliable tool‑use. The evaluations show that Mi:dm K 2.5 Pro achieves competitive performance against leading global and domestic models. In addition, it sets state‑of‑the‑art results on Korean‑specific benchmarks, showcasing deep linguistic and cultural understanding. Finally, Responsible AI evaluations validate safety against attacks, ensuring a secure profile for deployment with a balance of harmlessness and responsiveness.

PaperID: 4751, https://arxiv.org/pdf/2603.18729.pdf

Authors: Martina Ullasci, Marco Rondina, Riccardo Coppola, Flavio Giobergia, Riccardo Bellanca, Gabriele Mancari Pasi, Luca Prato, Federico Spinoso, Silvia Tagliente

Title: Analysis Of Linguistic Stereotypes in Single and Multi-Agent Generative AI Architectures

Abstract:
Many works in the literature show that LLM outputs exhibit discriminatory behaviour, triggering stereotype‑based inferences based on the dialect in which the inputs are written. This bias has been shown to be particularly pronounced when the same inputs are provided to LLMs in Standard American English (SAE) and African‑American English (AAE). In this paper, we replicate existing analyses of dialect‑sensitive stereotype generation in LLM outputs and investigate the effects of mitigation strategies, including prompt engineering (role‑based and Chain‑Of‑Thought prompting) and multi‑agent architectures composed of generate‑critique‑revise models. We define eight prompt templates to analyse different ways in which dialect bias can manifest, such as suggested names, jobs, and adjectives for SAE or AAE speakers. We use an LLM‑as‑judge approach to evaluate the bias in the results. Our results show that stereotype‑bearing differences emerge between SAE‑ and AAE‑related outputs across all template categories, with the strongest effects observed in adjective and job attribution. Baseline disparities vary substantially by model, with the largest SAE‑AAE differential observed in Claude Haiku and the smallest in Phi‑4 Mini. Chain‑Of‑Thought prompting proved to be an effective mitigation strategy for Claude Haiku, whereas the use of a multi‑agent architecture ensured consistent mitigation across all the models. These findings suggest that for intersectionality‑informed software engineering, fairness evaluation should include model‑specific validation of mitigation strategies, and workflow‑level controls (e.g., agentic architectures involving critique models) in high‑impact LLM deployments. The current results are exploratory in nature and limited in scope, but can lead to extensions and replications by increasing the dataset size and applying the procedure to different languages or dialects.

PaperID: 4752, https://arxiv.org/pdf/2603.18660.pdf

Authors: Peihang Wu, Zehong Chen, Lijian Xu

Title: Multimodal Model for Computational Pathology:Representation Learning and Image Compression

Abstract:
Whole slide imaging (WSI) has transformed digital pathology by enabling computational analysis of gigapixel histopathology images. Recent foundation model advances have accelerated progress in computational pathology, facilitating joint reasoning across pathology images, clinical reports, and structured data. Despite this progress, challenges remain: the extreme resolution of WSIs creates computational hurdles for visual learning; limited expert annotations constrain supervised approaches; integrating multimodal information while preserving biological interpretability remains difficult; and the opacity of modeling ultra‑long visual sequences hinders clinical transparency. This review comprehensively surveys recent advances in multimodal computational pathology. We systematically analyze four research directions: (1) self‑supervised representation learning and structure‑aware token compression for WSIs; (2) multimodal data generation and augmentation; (3) parameter‑efficient adaptation and reasoning‑enhanced few‑shot learning; and (4) multi‑agent collaborative reasoning for trustworthy diagnosis. We specifically examine how token compression enables cross‑scale modeling and how multi‑agent mechanisms simulate a pathologist's "Chain of Thought" across magnifications to achieve uncertainty‑aware evidence fusion. Finally, we discuss open challenges and argue that future progress depends on unified multimodal frameworks integrating high‑resolution visual data with clinical and biomedical knowledge to support interpretable and safe AI‑assisted diagnosis.

PaperID: 4753, https://arxiv.org/pdf/2603.18658.pdf

Authors: Cinzia Tomaselli, Gian Carlo Maffettone, Samy Wu Fung, Levon Nurbekyan, Mario di Bernardo

Title: Mean-field control barrier functions for stochastic multi-agent systems

Abstract:
Many applications involving multi‑agent systems require fulfilling safety constraints. Control barrier functions offer a systematic framework to enforce forward invariance of safety sets. Recent work extended this paradigm to mean‑field scenarios, where the number of agents is large enough to make density‑space descriptions a reasonable workaround for the curse of dimensionality. However, an open gap in the recent literature concerns the development of mean‑field control barrier functions for Fokker‑Planck (advection‑diffusion) equations. In this work, we address this gap, enabling safe mean‑field control of agents with stochastic microscopic dynamics. We provide bounded stability guarantees under safety corrections and corroborate our results through numerical simulations in two representative scenarios, coverage and shepherding control of multi‑agent systems.

PaperID: 4754, https://arxiv.org/pdf/2603.18631.pdf

Authors: Zhixing You, Jiachen Yuan, Jason Cai

Title: D-Mem: A Dual-Process Memory System for LLM Agents

Abstract:
Driven by the development of persistent, self‑adapting autonomous agents, equipping these systems with high‑fidelity memory access for long‑horizon reasoning has emerged as a critical requirement. However, prevalent retrieval‑based memory frameworks often follow an incremental processing paradigm that continuously extracts and updates conversational memories into vector databases, relying on semantic retrieval when queried. While this approach is fast, it inherently relies on lossy abstraction, frequently missing contextually critical information and struggling to resolve queries that rely on fine‑grained contextual understanding. To address this, we introduce D‑Mem, a dual‑process memory system. It retains lightweight vector retrieval for routine queries while establishing an exhaustive Full Deliberation module as a high‑fidelity fallback. To achieve cognitive economy without sacrificing accuracy, D‑Mem employs a Multi‑dimensional Quality Gating policy to dynamically bridge these two processes. Experiments on the LoCoMo and RealTalk benchmarks using GPT‑4o‑mini and Qwen3‑235B‑Instruct demonstrate the efficacy of our approach. Notably, our Multi‑dimensional Quality Gating policy achieves an F1 score of 53.5 on LoCoMo with GPT‑4o‑mini. This outperforms our static retrieval baseline, Mem0^\ast (51.2), and recovers 96.7% of the Full Deliberation's performance (55.3), while incurring significantly lower computational costs.

PaperID: 4755, https://arxiv.org/pdf/2603.18614.pdf

Authors: Wanjia Zhao, Ludwig Schmidt, James Zou, Vidhisha Balachandran, Lingjiao Chen

Title: ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs

Abstract:
Tool‑augmented large language models (LLMs) must tightly couple multi‑step reasoning with external actions, yet existing benchmarks often confound this interplay with complex environment dynamics, memorized knowledge or dataset contamination. In this paper, we introduce ZebraArena, a procedurally generated diagnostic environment for studying reasoning‑action coupling in tool‑augmented LLMs, with controllable difficulty and a knowledge‑minimal design, which limits gains from memorization or dataset contamination. Each task in ZebraArena requires a set of critical information which is available only through targeted tool use, yielding an interpretable interface between external information acquisition and deductive reasoning. This design provides deterministic evaluation via unique solutions, and a theoretical optimal query count for measuring efficient tool use. We show that ZebraArena requires a combination of in‑depth reasoning and accurate external tool calling, which remains a challenge as frontier reasoning models such as GPT‑5 and Gemini 2.5 Pro only achieves 60% accuracy on the hard instances. We also observe a persistent gaps between theoretical optimality and practical tool usage. For example, GPT‑5 uses 70‑270% more tool calls than the theoretical optimum. We highlight the key findings in our evaluation, and hope ZebraArena stimulates further research on the interplay between internal reasoning and external action.

PaperID: 4756, https://arxiv.org/pdf/2603.18563.pdf

Authors: Enoch Hyunwook Kang

Title: Reasonably reasoning AI agents can avoid game-theoretic failures in zero-shot, provably

Abstract:
As autonomous AI agents increasingly mediate online platform markets, a fundamental question emerges: do these markets generate stable strategic outcomes? In repeated strategic environments, the Nash equilibrium provides a natural benchmark for this stability. However, empirical evidence on off‑the‑shelf LLM agents is mixed, leaving it unclear whether independently deployed agents can converge to equilibrium behavior without explicit strategic post‑training. In this paper, we provide an affirmative answer. Extending the Bayesian learning literature in theoretical economics, we prove that AI agents, acting as Bayesian posterior samplers rather than expected utility maximizers, are guaranteed to eventually become weakly close to a Nash equilibrium in infinitely repeated games. We further extend this analysis to settings in which stage payoffs are unknown ex ante, and agents observe only their privately realized stochastic payoffs, and obtain the same convergence guarantees. Finally, we empirically evaluate these theoretical implications across five repeated‑game environments, ranging from the Prisoner's Dilemma to marketing promotion games. Taken together, our findings suggest that strategic stability in AI‑mediated markets can emerge from the intrinsic reasoning and learning properties of modern AI agents, without the need for unrealistic universal fine‑tuning.

PaperID: 4757, https://arxiv.org/pdf/2603.18520.pdf

Authors: Zachary Allen, Max Conway, Lyle Antieau, Allen Ponraj, Nikolaus Correll

Title: Robotic Agentic Platform for Intelligent Electric Vehicle Disassembly

Abstract:
Electric vehicles (EV) create an urgent need for scalable battery recycling, yet disassembly of EV battery packs remains largely manual due to high design variability. We present our Robotic Agentic Platform for Intelligent Disassembly (RAPID), designed to investigate perception‑driven manipulation, flexible automation, and AI‑assisted robot programming in realistic recycling scenarios. The system integrates a gantry‑mounted industrial manipulator, RGB‑D perception, and an automated nut‑running tool for fastener removal on a full‑scale EV battery pack. An open‑vocabulary object detection pipeline achieves 0.9757 mAP50, enabling reliable identification of screws, nuts, busbars, and other components. We experimentally evaluate (n=204) three one‑shot fastener removal strategies: taught‑in poses (97% success rate, 24 min duration), one‑shot vision execution (57%, 29 min), and visual servoing (83%, 36 min), comparing success rate and disassembly time for the battery's top cover fasteners. To support flexible interaction, we introduce agentic AI specifications for robotic disassembly tasks, allowing LLM agents to translate high‑level instructions into robot actions through structured tool interfaces and ROS services. We evaluate SmolAgents with GPT‑4o‑mini and Qwen 3.5 9B/4B on edge hardware. Tool‑based interfaces achieve 100% task completion, while automatic ROS service discovery shows 43.3% failure rates, highlighting the need for structured robot APIs for reliable LLM‑driven control. This open‑source platform enables systematic investigation of human‑robot collaboration, agentic robot programming, and increasingly autonomous disassembly workflows, providing a practical foundation for research toward scalable robotic battery recycling.

PaperID: 4758, https://arxiv.org/pdf/2603.18507.pdf

Authors: Zizhao Hu, Mohammad Rostami, Jesse Thomason

Title: Expert Personas Improve LLM Alignment but Damage Accuracy: Bootstrapping Intent-Based Persona Routing with PRISM

Abstract:
Persona prompting can steer LLM generation towards a domain‑specific tone and pattern. This behavior enables use cases in multi‑agent systems where diverse interactions are crucial and human‑centered tasks require high‑level human alignment. Prior works provide mixed opinions on their utility: some report performance gains when using expert personas for certain domains and their contribution to data diversity in synthetic data creation, while others find near‑zero or negative impact on general utility. To fully leverage the benefits of the LLM persona and avoid its harmfulness, a more comprehensive investigation of the mechanism is crucial. In this work, we study how model optimization, task type, prompt length, and placement can impact expert persona effectiveness across instruction‑tuned and reasoning LLMs, and provide insight into conditions under which expert personas fail and succeed. Based on our findings, we developed a pipeline to fully leverage the benefits of an expert persona, named PRISM (Persona Routing via Intent‑based Self‑Modeling), which self‑distills an intent‑conditioned expert persona into a gated LoRA adapter through a bootstrapping process that requires no external data, models, or knowledge. PRISM enhances human preference and safety alignment on generative tasks while maintaining accuracy on discriminative tasks across all models, with minimal memory and computing overhead.

PaperID: 4759, https://arxiv.org/pdf/2603.18503.pdf

Authors: Julian Martinez, Kooktae Lee

Title: Computationally Efficient Density-Driven Optimal Control via Analytical KKT Reduction and Contractive MPC

Abstract:
Efficient coordination for collective spatial distribution is a fundamental challenge in multi‑agent systems. Prior research on Density‑Driven Optimal Control (D2OC) established a framework to match agent trajectories to a desired spatial distribution. However, implementing this as a predictive controller requires solving a large‑scale Karush‑Kuhn‑Tucker (KKT) system, whose computational complexity grows cubically with the prediction horizon. To resolve this, we propose an analytical structural reduction that transforms the T‑horizon KKT system into a condensed quadratic program (QP). This formulation achieves O(T) linear scalability, significantly reducing the online computational burden compared to conventional O(T^3) approaches. Furthermore, to ensure rigorous convergence in dynamic environments, we incorporate a contractive Lyapunov constraint and prove the Input‑to‑State Stability (ISS) of the closed‑loop system against reference propagation drift. Numerical simulations verify that the proposed method facilitates rapid density coverage with substantial computational speed‑up, enabling long‑horizon predictive control for large‑scale multi‑agent swarms.

PaperID: 4760, https://arxiv.org/pdf/2603.18447.pdf

Authors: Chuxuan Hu, Philip Li, Maxwell Yang, Daniel Kang

Title: SODIUM: From Open Web Data to Queryable Databases

Abstract:
During research, domain experts often ask analytical questions whose answers require integrating data from a wide range of web sources. Thus, they must spend substantial effort searching, extracting, and organizing raw data before analysis can begin. We formalize this process as the SODIUM task, where we conceptualize open domains such as the web as latent databases that must be systematically instantiated to support downstream querying. Solving SODIUM requires (1) conducting in‑depth and specialized exploration of the open web, which is further strengthened by (2) exploiting structural correlations for systematic information extraction and (3) integrating collected information into coherent, queryable database instances. To quantify the challenges in automating SODIUM, we construct SODIUM‑Bench, a benchmark of 105 tasks derived from published academic papers across 6 domains, where systems are tasked with exploring the open web to collect and aggregate data from diverse sources into structured tables. Existing systems struggle with SODIUM tasks: we evaluate 6 advanced AI agents on SODIUM‑Bench, with the strongest baseline achieving only 46.5% accuracy. To bridge this gap, we develop SODIUM‑Agent, a multi‑agent system composed of a web explorer and a cache manager. Powered by our proposed ATP‑BFS algorithm and optimized through principled management of cached sources and navigation paths, SODIUM‑Agent conducts deep and comprehensive web exploration and performs structurally coherent information extraction. SODIUM‑Agent achieves 91.1% accuracy on SODIUM‑Bench, outperforming the strongest baseline by approximately 2 times and the weakest by up to 73 times.

PaperID: 4761, https://arxiv.org/pdf/2603.18400.pdf

Authors: Anastasios Manganaris, Jeremy Lu, Ahmed H. Qureshi, Suresh Jagannathan

Title: Graph-of-Constraints Model Predictive Control for Reactive Multi-agent Task and Motion Planning

Abstract:
Sequences of interdependent geometric constraints are central to many multi‑agent Task and Motion Planning (TAMP) problems. However, existing methods for handling such constraint sequences struggle with partially ordered tasks and dynamic agent assignments. They typically assume static assignments and cannot adapt when disturbances alter task allocations. To overcome these limitations, we introduce Graph‑of‑Constraints Model Predictive Control (GoC‑MPC), a generalized sequence‑of‑constraints framework integrated with MPC. GoC‑MPC naturally supports partially ordered tasks, dynamic agent coordination, and disturbance recovery. By defining constraints over tracked 3D keypoints, our method robustly solves diverse multi‑agent manipulation tasks‑coordinating agents and adapting online from visual observations alone, without relying on training data or environment models. Experiments demonstrate that GoC‑MPC achieves higher success rates, significantly faster TAMP computation, and shorter overall paths compared to recent baselines, establishing it as an efficient and robust solution for multi‑agent manipulation under real‑world disturbances. Our supplementary video and code can be found at https://sites.google.com/view/goc‑mpc/home .

PaperID: 4762, https://arxiv.org/pdf/2603.18388.pdf

Authors: Shiyan Liu, Qifeng Xia, Qiyun Xia, Yisheng Liu, Xinyu Yu, Rui Qu

Title: Reflection in the Dark: Exposing and Escaping the Black Box in Reflective Prompt Optimization

Abstract:
Automatic prompt optimization (APO) has emerged as a powerful paradigm for improving LLM performance without manual prompt engineering. Reflective APO methods such as GEPA iteratively refine prompts by diagnosing failure cases, but the optimization process remains black‑box and label‑free, leading to uninterpretable trajectories and systematic failure. We identify and empirically demonstrate four limitations: on GSM8K with a defective seed, GEPA degrades accuracy from 23.81% to 13.50%. We propose VISTA, a multi‑agent APO framework that decouples hypothesis generation from prompt rewriting, enabling semantically labeled hypotheses, parallel minibatch verification, and interpretable optimization trace. A two‑layer explore‑exploit mechanism combining random restart and epsilon‑greedy sampling further escapes local optima. VISTA recovers accuracy to 87.57% on the same defective seed and consistently outperforms baselines across all conditions on GSM8K and AIME2025.

PaperID: 4763, https://arxiv.org/pdf/2603.18382.pdf

Authors: Myeongseob Ko, Jihyun Jeong, Sumiran Singh Thakur, Gyuhak Kim, Ruoxi Jia

Title: From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents

Abstract:
Anonymization is widely treated as a practical safeguard because re‑identifying anonymous records was historically costly, requiring domain expertise, tailored algorithms, and manual corroboration. We study a growing privacy risk that may weaken this barrier: LLM‑based agents can autonomously reconstruct real‑world identities from scattered, individually non‑identifying cues. By combining these sparse cues with public information, agents resolve identities without bespoke engineering. We formalize this threat as \emphinference‑driven linkage and systematically evaluate it across three settings: classical linkage scenarios (Netflix and AOL), \emphInferLink (a controlled benchmark varying task intent, shared cues, and attacker knowledge), and modern text‑rich artifacts. Without task‑specific heuristics, agents successfully execute both fixed‑pool matching and open‑ended identity resolution. In the Netflix Prize setting, an agent reconstructs 79.2% of identities, significantly outperforming a 56.0% classical baseline. Furthermore, linkage emerges not only under explicit adversarial prompts but also as a byproduct of benign cross‑source analysis in \emphInferLink and unstructured research narratives. These findings establish that identity inference ‑‑ not merely explicit information disclosure ‑‑ must be treated as a first‑class privacy risk; evaluations must measure what identities an agent can infer.

PaperID: 4764, https://arxiv.org/pdf/2603.18377.pdf

Authors: Guangsheng Yu, Qin Wang, Rui Lang, Shuai Su, Xu Wang

Title: PlanTwin: Privacy-Preserving Planning Abstractions for Cloud-Assisted LLM Agents

Abstract:
Cloud‑hosted large language models (LLMs) have become the de facto planners in agentic systems, coordinating tools and guiding execution over local environments. In many deployments, however, the environment being planned over is private, containing source code, files, credentials, and metadata that cannot be exposed to the cloud. Existing solutions address adjacent concerns, such as execution isolation, access control, or confidential inference, but they do not control what cloud planners observe during planning: within the permitted scope, raw environment state is still exposed. We introduce PlanTwin, a privacy‑preserving architecture for cloud‑assisted planning without exposing raw local context. The key idea is to project the real environment into a planning‑oriented digital twin: a schema‑constrained and de‑identified abstract graph that preserves planning‑relevant structure while removing reconstructable details. The cloud planner operates solely on this sanitized twin through a bounded capability interface, while a local gatekeeper enforces safety policies and cumulative disclosure budgets. We further formalize the privacy‑utility trade‑off as a capability granularity problem, define architectural privacy goals using (k,δ)‑anonymity and ε‑unlinkability, and mitigate compositional leakage through multi‑turn disclosure control. We implement PlanTwin as middleware between local agents and cloud planners and evaluate it on 60 agentic tasks across ten domains with four cloud planners. PlanTwin achieves full sensitive‑item non‑disclosure (SND = 1.0) while maintaining planning quality close to full‑context systems: three of four planners achieve PQS > 0.79, and the full pipeline incurs less than 2.2% utility loss.

PaperID: 4765, https://arxiv.org/pdf/2603.18375.pdf

Authors: Shivam Shukla, Emily Chen, Mahnaz Roshanaei, Magy Seif El-Nasr

Title: Relationship-Centered Care: Relatedness and Responsible Design for Human Connections in Mental-Health Care

Abstract:
There has been a growing research interest in Digital Therapeutic Alliance (DTA) as the field of AI‑powered conversational agents are being deployed in mental health care, particularly those delivering CBT (Cognitive Behaviour Therapy). Our proposition argues that the current design paradigm which seeks to optimize the bond between a patient in need of support and an AI agent contains a subtle but consequential trap: it risks producing an "appearance of connection" that unintentionally disrupts the fundamental human need for relatedness, which potentially displaces the authentic human relationships upon which long‑term psychological recovery depends. We propose a reorientation from designing artificial intelligence tools that simulate relationships to designing AI that scaffolds them. To operationalize our argument, we propose an interdisciplinary model that translates the Responsible AI Six Sphere Framework through the lens of Self‑Determination Theory (SDT), with a specific focus on the basic psychological need for relatedness. The resulting model offers the technical and other clinical communities a set of relationship‑centered design guidelines and relevant provocations for building AI systems that function not just as companions, but as a catalyst for strengthening a patient's entire relational ecology; their connections with therapists, caregivers, family, and peers. In doing so, we discuss a model towards a more sustainable ecosystem of relationship‑centered AI in mental health care.

PaperID: 4766, https://arxiv.org/pdf/2603.18330.pdf

Authors: Lingavasan Suresh Kumar, Yang Ba, Rong Pan

Title: MemArchitect: A Policy Driven Memory Governance Layer

Abstract:
Persistent Large Language Model (LLM) agents expose a critical governance gap in memory management. Standard Retrieval‑Augmented Generation (RAG) frameworks treat memory as passive storage, lacking mechanisms to resolve contradictions, enforce privacy, or prevent outdated information ("zombie memories") from contaminating the context window. We introduce MemArchitect, a governance layer that decouples memory lifecycle management from model weights. MemArchitect enforces explicit, rule‑based policies, including memory decay, conflict resolution, and privacy controls. We demonstrate that governed memory consistently outperforms unmanaged memory in agentic settings, highlighting the necessity of structured memory governance for reliable and safe autonomous systems.

PaperID: 4767, https://arxiv.org/pdf/2603.18273.pdf

Authors: Chenguang Pan, Zhou Zhang, Weixuan Xiao, Chengyuan Yao

Title: EDM-ARS: A Domain-Specific Multi-Agent System for Automated Educational Data Mining Research

Abstract:
In this technical report, we present the Educational Data Mining Automated Research System (EDM‑ARS), a domain‑specific multi‑agent pipeline that automates end‑to‑end educational data mining (EDM) research. We conceptualize EDM‑ARS as a general framework for domain‑aware automated research pipelines, where educational expertise is embedded into each stage of the research lifecycle. As a first instantiation of this framework, we focus on predictive modeling tasks. Within this scope, EDM‑ARS orchestrates five specialized LLM‑powered agents (ProblemFormulator, DataEngineer, Analyst, Critic, and Writer) through a state‑machine coordinator that supports revision loops, checkpoint‑based recovery, and sandboxed code execution. Given a research prompt and a dataset, EDM‑ARS produces a complete LaTeX manuscript with real Semantic Scholar citations, validated machine learning analyses, and automated methodological peer review. We also provide a detailed description of the system architecture, the three‑tier data registry design that encodes educational domain expertise, the specification of each agent, the inter‑agent communication protocol, and mechanisms for error‑handling and self‑correction. Finally, we discuss current limitations, including single‑dataset scope and formulaic paper output, and outline a phased roadmap toward causal inference, transfer learning, psychometric, and multi‑dataset generalization. EDM‑ARS is released as an open‑source project to support the educational research community.

PaperID: 4768, https://arxiv.org/pdf/2603.18272.pdf

Authors: Thomas Palmeira Ferraz, Romain Deffayet, Vassilina Nikoulina, Hervé Déjean, Stéphane Clinchant

Title: Retrieval-Augmented LLM Agents: Learning to Learn from Experience

Abstract:
While large language models (LLMs) have advanced the development of general‑purpose agents, achieving robust generalization to unseen tasks remains a significant challenge. Current approaches typically rely on either fine‑tuning or training‑free memory‑augmented generation using retrieved experience; yet both have limitations: fine‑tuning often fails to extrapolate to new tasks, while experience retrieval often underperforms compared to supervised baselines. In this work, we propose to combine these approaches and systematically study how to train retrieval‑augmented LLM agents to effectively leverage retrieved trajectories in‑context. First, we establish a robust supervised fine‑tuning (SFT) recipe using LoRA that outperforms several state‑of‑the‑art agent training pipelines. Second, we provide a detailed analysis of key design choices for experience retrieval, identifying optimal strategies for storage, querying, and trajectory selection. Finally, we propose a pipeline that integrates experience retrieval into the fine‑tuning process. Our results demonstrate that this combined approach significantly improves generalization to unseen tasks, providing a scalable and effective framework for building agents that learn to learn from experience.

PaperID: 4769, https://arxiv.org/pdf/2603.18271.pdf

Authors: Akshat Rana, Peeyush Agarwal, K. P. S. Rana, Amarjit Malhotra

Title: SG-CoT: An Ambiguity-Aware Robotic Planning Framework using Scene Graph Representations

Abstract:
Ambiguity poses a major challenge to large language models (LLMs) used as robotic planners. In this letter, we present Scene Graph‑Chain‑of‑Thought (SG‑CoT), a two‑stage framework where LLMs iteratively query a scene graph representation of the environment to detect and clarify ambiguities. First, a structured scene graph representation of the environment is constructed from input observations, capturing objects, their attributes, and relationships with other objects. Second, the LLM is equipped with retrieval functions to query portions of the scene graph that are relevant to the provided instruction. This grounds the reasoning process of the LLM in the observation, increasing the reliability of robotic planners under ambiguous situations. SG‑CoT also allows the LLM to identify the source of ambiguity and pose a relevant disambiguation question to the user or another robot. Extensive experimentation demonstrates that SG‑CoT consistently outperforms prior methods, with a minimum of 10% improvement in question accuracy and a minimum success rate increase of 4% in single‑agent and 15% in multi‑agent environments, validating its effectiveness for more generalizable robot planning.

PaperID: 4770, https://arxiv.org/pdf/2603.18245.pdf

Authors: Xuan Chen, Lu Yan, Ruqi Zhang, Xiangyu Zhang

Title: Who Tests the Testers? Systematic Enumeration and Coverage Audit of LLM Agent Tool Call Safety

Abstract:
Large Language Model (LLM) agents increasingly act through external tools, making their safety contingent on tool‑call workflows rather than text generation alone. While recent benchmarks evaluate agents across diverse environments and risk categories, a fundamental question remains unanswered: how complete are existing test suites, and what unsafe interaction patterns persist even after an agent passes the benchmark? We propose SafeAudit, a meta‑audit framework that addresses this gap through two contributions. First, an LLM‑based enumerator that systematically generates test cases by enumerating valid tool‑call workflows and diverse user scenarios. Second, we introduce rule‑resistance, a non‑semantic, quantitative metric that distills compact safety rules from existing benchmarks and identifies unsafe interaction patterns that remain uncovered under those rules. Across 3 benchmarks and 12 environments, SafeAudit uncovers more than 20% residual unsafe behaviors that existing benchmarks fail to expose, with coverage growing monotonically as the testing budget increases. Our results highlight significant completeness gaps in current safety evaluation and motivate meta‑auditing as a necessary complement to benchmark‑based agent safety testing.

PaperID: 4771, https://arxiv.org/pdf/2603.18235.pdf

Authors: Saket Sanjeev Chaturvedi, Joshua Bergerson, Tanwi Mallick

Title: Toward Reliable, Safe, and Secure LLMs for Scientific Applications

Abstract:
As large language models (LLMs) evolve into autonomous "AI scientists," they promise transformative advances but introduce novel vulnerabilities, from potential "biosafety risks" to "dangerous explosions." Ensuring trustworthy deployment in science requires a new paradigm centered on reliability (ensuring factual accuracy and reproducibility), safety (preventing unintentional physical or biological harm), and security (preventing malicious misuse). Existing general‑purpose safety benchmarks are poorly suited for this purpose, suffering from a fundamental domain mismatch, limited threat coverage of science‑specific vectors, and benchmark overfitting, which create a critical gap in vulnerability evaluation for scientific applications. This paper examines the unique security and safety landscape of LLM agents in science. We begin by synthesizing a detailed taxonomy of LLM threats contextualized for scientific research, to better understand the unique risks associated with LLMs in science. Next, we conceptualize a mechanism to address the evaluation gap by utilizing dedicated multi‑agent systems for the automated generation of domain‑specific adversarial security benchmarks. Based on our analysis, we outline how existing safety methods can be brought together and integrated into a conceptual multilayered defense framework designed to combine a red‑teaming exercise and external boundary controls with a proactive internal Safety LLM Agent. Together, these conceptual elements provide a necessary structure for defining, evaluating, and creating comprehensive defense strategies for trustworthy LLM agent deployment in scientific disciplines.

PaperID: 4772, https://arxiv.org/pdf/2603.18210.pdf

Authors: MoniJesu James, Amir Atef Habel, Aleksey Fedoseev, Dzmitry Tsetserokou

Title: GoalVLM: VLM-driven Object Goal Navigation for Multi-Agent System

Abstract:
Object‑goal navigation has traditionally been limited to ground robots with closed‑set object vocabularies. Existing multi‑agent approaches depend on precomputed probabilistic graphs tied to fixed category sets, precluding generalization to novel goals at test time. We present GoalVLM, a cooperative multi‑agent framework for zero‑shot, open‑vocabulary object navigation. GoalVLM integrates a Vision‑Language Model (VLM) directly into the decision loop, SAM3 for text‑prompted detection and segmentation, and SpaceOM for spatial reasoning, enabling agents to interpret free‑form language goals and score frontiers via zero‑shot semantic priors without retraining. Each agent builds a BEV semantic map from depth‑projected voxel splatting, while a Goal Projector back‑projects detections through calibrated depth into the map for reliable goal localization. A constraint‑guided reasoning layer evaluates frontiers through a structured prompt chain (scene captioning, room‑type classification, perception gating, multi‑frontier ranking), injecting commonsense priors into exploration. We evaluate GoalVLM on GOAT‑Bench val_unseen (360 multi‑subtask episodes, 1032 sequential object‑goal subtasks, HM3D scenes), where each episode requires navigating to a chain of 5‑7 open‑vocabulary targets. GoalVLM with N=2 agents achieves 55.8% subtask SR and 18.3% SPL, competitive with state‑of‑the‑art methods while requiring no task‑specific training. Ablation studies confirm the contributions of VLM‑guided frontier reasoning and depth‑projected goal localization.

PaperID: 4773, https://arxiv.org/pdf/2603.18122.pdf

Authors: Sriram Gopalakrishnan

Title: Don't Vibe Code, Do Skele-Code: Interactive No-Code Notebooks for Subject Matter Experts to Build Lower-Cost Agentic Workflows

Abstract:
Skele‑Code is a natural‑language and graph‑based interface for building workflows with AI agents, designed especially for less or non‑technical users. It supports incremental, interactive notebook‑style development, and each step is converted to code with a required set of functions and behavior to enable incremental building of workflows. Agents are invoked only for code generation and error recovery, not orchestration or task execution. This agent‑supported, but code‑first approach to workflows, along with the context‑engineering used in Skele‑Code, can help reduce token costs compared to the multi‑agent system approach to executing workflows. Skele‑Code produces modular, easily extensible, and shareable workflows. The generated workflows can also be used as skills by agents, or as steps in other workflows.

PaperID: 4774, https://arxiv.org/pdf/2603.18118.pdf

Authors: Yuhao Dong, Zuyan Liu, Shulin Tian, Yongming Rao, Ziwei Liu

Title: Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

Abstract:
Large Language Models (LLMs) have achieved remarkable reliability and advanced capabilities through extended test‑time reasoning. However, extending these capabilities to Multi‑modal Large Language Models (MLLMs) remains a significant challenge due to a critical scarcity of high‑quality, long‑chain reasoning data and optimized training pipelines. To bridge this gap, we present a unified multi‑agent visual reasoning framework that systematically evolves from our foundational image‑centric model, Insight‑V, into a generalized spatial‑temporal architecture, Insight‑V++. We first propose a scalable data generation pipeline equipped with multi‑granularity assessment that autonomously synthesizes structured, complex reasoning trajectories across image and video domains without human intervention. Recognizing that directly supervising MLLMs with such intricate data yields sub‑optimal results, we design a dual‑agent architecture comprising a reasoning agent to execute extensive analytical chains, and a summary agent to critically evaluate and distill final outcomes. While our initial framework utilized Direct Preference Optimization (DPO), its off‑policy nature fundamentally constrained reinforcement learning potential. To overcome these limitations, particularly for long‑horizon video understanding, Insight‑V++ introduces two novel algorithms, ST‑GRPO and J‑GRPO, which enhance spatial‑temporal reasoning and improve evaluative robustness. Crucially, by leveraging reliable feedback from the summary agent, we guide an iterative reasoning path generation process, retraining the entire multi‑agent system in a continuous, self‑improving loop. Extensive experiments on base models like LLaVA‑NeXT and Qwen2.5‑VL demonstrate significant performance gains across challenging image and video reasoning benchmarks while preserving strong capabilities on traditional perception‑focused tasks.

PaperID: 4775, https://arxiv.org/pdf/2603.18096.pdf

Authors: Ciprian Paduraru, Petru-Liviu Bouruc, Alin Stefanescu

Title: A Trace-Based Assurance Framework for Agentic AI Orchestration: Contracts, Testing, and Governance

Abstract:
In Agentic AI, Large Language Models (LLMs) are increasingly used in the orchestration layer to coordinate multiple agents and to interact with external services, retrieval components, and shared memory. In this setting, failures are not limited to incorrect final outputs. They also arise from long‑horizon interaction, stochastic decisions, and external side effects (such as API calls, database writes, and message sends). Common failures include non‑termination, role drift, propagation of unsupported claims, and attacks via untrusted context or external channels. This paper presents an assurance framework for such Agentic AI systems. Executions are instrumented as Message‑Action Traces (MAT) with explicit step and trace contracts. Contracts provide machine‑checkable verdicts, localize the first violating step, and support deterministic replay. The framework includes stress testing, formulated as a budgeted counterexample search over bounded perturbations. It also supports structured fault injection at service, retrieval, and memory boundaries to assess containment under realistic operational faults and degraded conditions. Finally, governance is treated as a runtime component, enforcing per‑agent capability limits and action mediation (allow, rewrite, block) at the language‑to‑action boundary. To support comparative evaluations across stochastic seeds, models, and orchestration configurations, the paper defines trace‑based metrics for task success, termination reliability, contract compliance, factuality indicators, containment rate, and governance outcome distributions. More broadly, the framework is intended as a common abstraction to support testing and evaluation of multi‑agent LLM systems, and to facilitate reproducible comparison across orchestration designs and configurations.

PaperID: 4776, https://arxiv.org/pdf/2603.18079.pdf

Authors: Prince Zizhuang Wang, Shuli Jiang

Title: SLEA-RL: Step-Level Experience Augmented Reinforcement Learning for Multi-Turn Agentic Training

Abstract:
Large Language Model (LLM) agents have shown strong results on multi‑turn tool‑use tasks, yet they operate in isolation during training, failing to leverage experiences accumulated across episodes. Existing experience‑augmented methods address this by organizing trajectories into retrievable libraries, but they retrieve experiences only once based on the initial task description and hold them constant throughout the episode. In multi‑turn settings where observations change at every step, this static retrieval becomes increasingly mismatched as episodes progress. We propose SLEA‑RL (Step‑Level Experience‑Augmented Reinforcement Learning), a framework that retrieves relevant experiences at each decision step conditioned on the current observation. SLEA‑RL operates through three components: (i) step‑level observation clustering that groups structurally equivalent environmental states for efficient cluster‑indexed retrieval; (ii) a self‑evolving experience library that distills successful strategies and failure patterns through score‑based admission and rate‑limited extraction; and (iii) policy optimization with step‑level credit assignment for fine‑grained advantage estimation across multi‑turn episodes. The experience library evolves alongside the policy through semantic analysis rather than gradient updates. Experiments on long‑horizon multi‑turn agent benchmarks demonstrate that SLEA‑RL achieves superior performance compared to various reinforcement learning baselines.

PaperID: 4777, https://arxiv.org/pdf/2603.18041.pdf

Authors: Mark M. Bailey

Title: Quotient Geometry and Persistence-Stable Metrics for Swarm Configurations

Abstract:
Swarm and constellation reconfiguration can be viewed as motion of an unordered point configuration in an ambient space. Here, we provide persistence‑stable, symmetry‑invariant geometric representations for comparing and monitoring multi‑agent configuration data. We introduce a quotient formation space \mathcalS_n(M,G)=M^n/(G× S_n) and a formation matching metric d_M,G obtained by optimizing a worst‑case assignment error over ambient symmetries g\in G and relabelings σ\in S_n. This metric is a structured, physically interpretable relaxation of Gromov‑‑Hausdorff distance: the induced inter‑agent metric spaces satisfy d_\mathrmGH(X_x,X_y)\le d_M,G([x],[y]). Composing this bound with stability of Vietoris‑‑Rips persistence yields d_B(Φ_k([x]),Φ_k([y]))\le d_M,G([x],[y]), providing persistence‑stable signatures for reconfiguration monitoring. We analyze the metric geometry of (\mathcalS_n(M,G),d_M,G): under compactness/completeness assumptions on M and compact G it is compact/complete and the metric induces the quotient topology; if M is geodesic then the quotient is geodesic and exhibits stratified singularities along collision and symmetry strata, relating it to classical configuration spaces. We study expressivity of the signatures, identifying symmetry‑mismatch and persistence‑compression mechanisms for non‑injectivity. Finally, in a phase‑circle model we prove a conditional inverse theorem: under semicircle support and a gap‑labeling margin, the H_0 signature is locally bi‑Lipschitz to d_M,G up to an explicit factor, yielding two‑sided control. Examples on \mathbbS^2 and \mathbbT^m illustrate satellite‑constellation and formation settings.

PaperID: 4778, https://arxiv.org/pdf/2603.18030.pdf

Authors: Hao Ke

Title: Quine: Realizing LLM Agents as Native POSIX Processes

Abstract:
Current LLM agent frameworks often implement isolation, scheduling, and communication at the application layer, even though these mechanisms are already provided by mature operating systems. Instead of introducing another application‑layer orchestrator, this paper presents Quine, a runtime architecture and reference implementation that realizes LLM agents as native POSIX processes. The mapping is explicit: identity is PID, interface is standard streams and exit status, state is memory, environment variables, and filesystem, and lifecycle is fork/exec/exit. A single executable implements this model by recursively spawning fresh instances of itself. By grounding the agent abstraction in the OS process model, Quine inherits isolation, composition, and resource control directly from the kernel, while naturally supporting recursive delegation, context renewal via exec, and shell‑native composition. The design also exposes where the POSIX process model stops: processes provide a robust substrate for execution, but not a complete runtime model for cognition. In particular, the analysis points toward two immediate extensions beyond process semantics: task‑relative worlds and revisable time. A reference implementation of Quine is publicly available on GitHub.

PaperID: 4779, https://arxiv.org/pdf/2603.17930.pdf

Authors: Jingchun Yang, Jinchang Zhang

Title: Interpretable Traffic Responsibility from Dashcam Video via Legal Multi Agent Reasoning

Abstract:
The widespread adoption of dashcams has made video evidence in traffic accidents increasingly abundant, yet transforming "what happened in the video" into "who is responsible under which legal provisions" still relies heavily on human experts. Existing ego‑view traffic accident studies mainly focus on perception and semantic understanding, while LLM‑based legal methods are mostly built on textual case descriptions and rarely incorporate video evidence, leaving a clear gap between the two. We first propose C‑TRAIL, a multimodal legal dataset that, under the Chinese traffic regulation system, explicitly aligns dashcam videos and textual descriptions with a closed set of responsibility modes and their corresponding Chinese traffic statutes. On this basis, we introduce a two‑stage framework: (1) a traffic accident understanding module that generates textual video descriptions; and (2) a legal multi‑agent framework that outputs responsibility modes, statute sets, and complete judgment reports. Experimental results on C‑TRAIL and MM‑AU show that our method outperforms general and legal LLMs, as well as existing agent‑based approaches, while providing a transparent and interpretable legal reasoning process.

PaperID: 4780, https://arxiv.org/pdf/2603.17831.pdf

Authors: Zhenhang Yuan, Shenghai Yuan, Lihua Xie

Title: RPMS: Enhancing LLM-Based Embodied Planning through Rule-Augmented Memory Synergy

Abstract:
LLM agents often fail in closed‑world embodied environments because actions must satisfy strict preconditions ‑‑ such as location, inventory, and container states ‑‑ and failure feedback is sparse. We identify two structurally coupled failure modes: (P1) invalid action generation and (P2) state drift, each amplifying the other in a degenerative cycle. We present RPMS, a conflict‑managed architecture that enforces action feasibility via structured rule retrieval, gates memory applicability via a lightweight belief state, and resolves conflicts between the two sources via rules‑first arbitration. On ALFWorld (134 unseen tasks), RPMS achieves 59.7% single‑trial success with Llama 3.1 8B (+23.9 pp over baseline) and 98.5% with Claude Sonnet 4.5 (+11.9 pp); of the 8B gain, rule retrieval alone contributes +14.9 pp (statistically significant), making it the dominant factor. A key finding is that episodic memory is conditionally useful: it harms performance on some task types when used without grounding, but becomes a stable net positive once filtered by current state and constrained by explicit action rules. Adapting RPMS to ScienceWorld with GPT‑4 yields consistent gains across all ablation conditions (avg. score 54.0 vs. 44.9 for the ReAct baseline), providing transfer evidence that the core mechanisms hold across structurally distinct environments.

PaperID: 4781, https://arxiv.org/pdf/2603.17820.pdf

Authors: David Millard, Cecilia Alm, Rashid Ali, Pengcheng Shi, Ali Baheri

Title: Federated Distributional Reinforcement Learning with Distributional Critic Regularization

Abstract:
Federated reinforcement learning typically aggregates value functions or policies by parameter averaging, which emphasizes expected return and can obscure statistical multimodality and tail behavior that matter in safety‑critical settings. We formalize federated distributional reinforcement learning (FedDistRL), where clients parametrize quantile value function critics and federate these networks only. We also propose TR‑FedDistRL, which builds a per client, risk‑aware Wasserstein barycenter over a temporal buffer. This local barycenter provides a reference region to constrain the parameter averaged critic, ensuring necessary distributional information is not averaged out during the federation process. The distributional trust region is implemented as a shrink‑squash step around this reference. Under fixed‑policy evaluation, the feasibility map is nonexpansive and the update is contractive in a probe‑set Wasserstein metric under evaluation. Experiments on a bandit, multi‑agent gridworld, and continuous highway environment show reduced mean‑smearing, improved safety proxies (catastrophe/accident rate), and lower critic/policy drift versus mean‑oriented and non‑federated baselines.

PaperID: 4782, https://arxiv.org/pdf/2603.17815.pdf

Authors: Corentin Royer, Debarun Bhattacharjya, Gaetano Rossiello, Andrea Giovannini, Mennatallah El-Assady

Title: Process Supervision for Chain-of-Thought Reasoning via Monte Carlo Net Information Gain

Abstract:
Multi‑step reasoning improves the capabilities of large language models (LLMs) but increases the risk of errors propagating through intermediate steps. Process reward models (PRMs) mitigate this by scoring each step individually, enabling fine‑grained supervision and improved reliability. Existing methods for training PRMs rely on costly human annotations or computationally intensive automatic labeling. We propose a novel approach to automatically generate step‑level labels using Information Theory. Our method estimates how each reasoning step affects the likelihood of the correct answer, providing a signal of step quality. Importantly, it reduces computational complexity to \mathcalO(N), improving over the previous \mathcalO(N \log N) methods. We demonstrate that these labels enable effective chain‑of‑thought selection in best‑of‑K evaluation settings across diverse reasoning benchmarks, including mathematics, Python programming, SQL, and scientific question answering. This work enables scalable and efficient supervision of LLM reasoning, particularly for tasks where error propagation is critical.

PaperID: 4783, https://arxiv.org/pdf/2603.17808.pdf

Authors: Ruixiang Wang, Qingming Liu, Yueci Deng, Guiliang Liu, Zhen Liu, Kui Jia

Title: EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards

Abstract:
Video generative models are increasingly used as world models for robotics, where a model generates a future visual rollout conditioned on the current observation and task instruction, and an inverse dynamics model (IDM) converts the generated frames into executable robot actions. However, current video world models lack explicit executability constraints. As a result, visually coherent rollouts may still violate rigid‑body and kinematic consistency, producing unstable or infeasible control commands when decoded by an IDM. We refer to this mismatch between visual generation and physically executable control as the executability gap. While this gap can be mitigated at inference time using techniques such as rejection sampling, such approaches are inefficient due to the high cost of video generation. In this paper, we leverage the executability gap as a training signal and introduce Executable Video Alignment (EVA), a reinforcement‑learning post‑training framework for aligning video world models. EVA trains an inverse dynamics model on real robot trajectories and repurposes it as a reward model that evaluates generated videos through the action sequences they induce, encouraging smooth motions measured by velocity, acceleration, and jerk while penalizing actions that violate embodiment constraints. Importantly, the reward remains informative even when generated videos contain severe visual artifacts, since such artifacts typically translate into unstable or out‑of‑bound actions. Experiments on the RoboTwin benchmark and a real bimanual robot show that EVA reduces embodiment‑specific artifacts in generated rollouts and improves downstream task execution success.

PaperID: 4784, https://arxiv.org/pdf/2603.17694.pdf

Authors: Yusen Wu, Yiran Liu, Xiaotie Deng

Title: MALLES: A Multi-agent LLMs-based Economic Sandbox with Consumer Preference Alignment

Abstract:
In the real economy, modern decision‑making is fundamentally challenged by high‑dimensional, multimodal environments, which are further complicated by agent heterogeneity and combinatorial data sparsity. This paper introduces a Multi‑Agent Large Language Model‑based Economic Sandbox (MALLES), leveraging the inherent generalization capabilities of large‑sacle models to establish a unified simulation framework applicable to cross‑domain and cross‑category scenarios. Central to our approach is a preference learning paradigm in which LLMs are economically aligned via post‑training on extensive, heterogeneous transaction records across diverse product categories. This methodology enables the models to internalize and transfer latent consumer preference patterns, thereby mitigating the data sparsity issues prevalent in individual categories. To enhance simulation stability, we implement a mean‑field mechanism designed to model the dynamic interactions between the product environment and customer populations, effectively stabilizing sampling processes within high‑dimensional decision spaces. Furthermore, we propose a multi‑agent discussion framework wherein specialized agents collaboratively process extensive product information. This architecture distributes cognitive load to alleviate single‑agent attention bottlenecks and captures critical decision factors through structured dialogue. Experiments demonstrate that our framework achieves significant improvements in product selection accuracy, purchase quantity prediction, and simulation stability compared to existing economic and financial LLM simulation baselines. Our results substantiate the potential of large language models as a foundational pillar for high‑fidelity, scalable decision simulation and latter analysis in the real economy based on foundational database.

PaperID: 4785, https://arxiv.org/pdf/2603.17692.pdf

Authors: Joohyoung Jeon, Hongchul Lee

Title: Can Blindfolded LLMs Still Trade? An Anonymization-First Framework for Portfolio Optimization

Abstract:
For LLM trading agents to be genuinely trustworthy, they must demonstrate understanding of market dynamics rather than exploitation of memorized ticker associations. Building responsible multi‑agent systems demands rigorous signal validation: proving that predictions reflect legitimate patterns, not pre‑trained recall. We address two sources of spurious performance: memorization bias from ticker‑specific pre‑training, and survivorship bias from flawed backtesting. Our approach is to blindfold the agents‑‑anonymizing all identifiers‑‑and verify whether meaningful signals persist. BlindTrade anonymizes tickers and company names, and four LLM agents output scores along with reasoning. We construct a GNN graph from reasoning embeddings and trade using PPO‑DSR policy. On 2025 YTD (through 2025‑08‑01), we achieved Sharpe 1.40 +/‑ 0.22 across 20 seeds and validated signal legitimacy through negative control experiments. To assess robustness beyond a single OOS window, we additionally evaluate an extended period (2024‑‑2025), revealing market‑regime dependency: the policy excels in volatile conditions but shows reduced alpha in trending bull markets.

PaperID: 4786, https://arxiv.org/pdf/2603.17683.pdf

Authors: Mohsen Arjmandi

Title: Sensi: Learn One Thing at a Time -- Curriculum-Based Test-Time Learning for LLM Game Agents

Abstract:
Large language model (LLM) agents deployed in unknown environments must learn task structure at test time, but current approaches require thousands of interactions to form useful hypotheses. We present Sensi, an LLM agent architecture for the ARC‑AGI‑3 game‑playing challenge that introduces structured test‑time learning through three mechanisms: (1) a two‑player architecture separating perception from action, (2) a curriculum‑based learning system managed by an external state machine, and (3) a database‑as‑control‑plane that makes the agents context window programmatically steerable. We further introduce an LLM‑as‑judge component with dynamically generated evaluation rubrics to determine when the agent has learned enough about one topic to advance to the next. We report results across two iterations: Sensi v1 solves 2 game levels using the two‑player architecture alone, while Sensi v2 adds curriculum learning and solves 0 levels ‑ but completes its entire learning curriculum in approximately 32 action attempts, achieving 50‑94x greater sample efficiency than comparable systems that require 1600‑3000 attempts. We precisely diagnose the failure mode as a self‑consistent hallucination cascade originating in the perception layer, demonstrating that the architectural bottleneck has shifted from learning efficiency to perceptual grounding ‑ a more tractable problem.

PaperID: 4787, https://arxiv.org/pdf/2603.17673.pdf

Authors: Philipp Normann, Andreas Happe, Jürgen Cito, Daniel Arp

Title: Post-Training Local LLM Agents for Linux Privilege Escalation with Verifiable Rewards

Abstract:
LLM agents are increasingly relevant to research domains such as vulnerability discovery. Yet, the strongest systems remain closed and cloud‑only, making them resource‑intensive, difficult to reproduce, and unsuitable for work involving proprietary code or sensitive data. Consequently, there is an urgent need for small, local models that can perform security tasks under strict resource budgets, but methods for developing them remain underexplored. In this paper, we address this gap by proposing a two‑stage post‑training pipeline. We focus on the problem of Linux privilege escalation, where success is automatically verifiable and the task requires multi‑step interactive reasoning. Using an experimental setup that prevents data leakage, we post‑train a 4B model in two stages: supervised fine‑tuning on traces from procedurally generated privilege‑escalation environments, followed by reinforcement learning with verifiable rewards. On a held‑out benchmark of 12 Linux privilege‑escalation scenarios, supervised fine‑tuning alone more than doubles the baseline success rate at 20 rounds, and reinforcement learning further lifts our resulting model, PrivEsc‑LLM, to 95.8%, nearly matching Claude Opus 4.6 at 97.5%. At the same time, the expected inference cost per successful escalation is reduced by over 100x.

PaperID: 4788, https://arxiv.org/pdf/2603.17639.pdf

Authors: Yuntong Zhang, Sungmin Kang, Ruijie Meng, Marcel Böhme, Abhik Roychoudhury

Title: VeriGrey: Greybox Agent Validation

Abstract:
Agentic AI has been a topic of great interest recently. A Large Language Model (LLM) agent involves one or more LLMs in the back‑end. In the front end, it conducts autonomous decision‑making by combining the LLM outputs with results obtained by invoking several external tools. The autonomous interactions with the external environment introduce critical security risks. In this paper, we present a grey‑box approach to explore diverse behaviors and uncover security risks in LLM agents. Our approach VeriGrey uses the sequence of tools invoked as a feedback function to drive the testing process. This helps uncover infrequent but dangerous tool invocations that cause unexpected agent behavior. As mutation operators in the testing process, we mutate prompts to design pernicious injection prompts. This is carefully accomplished by linking the task of the agent to an injection task, so that the injection task becomes a necessary step of completing the agent functionality. Comparing our approach with a black‑box baseline on the well‑known AgentDojo benchmark, VeriGrey achieves 33% additional efficacy in finding indirect prompt injection vulnerabilities with a GPT‑4.1 back‑end. We also conduct real‑world case studies with the widely used coding agent Gemini CLI, and the well‑known OpenClaw personal assistant. VeriGrey finds prompts inducing several attack scenarios that could not be identified by black‑box approaches. In OpenClaw, by constructing a conversation agent which employs mutational fuzz testing as needed, VeriGrey is able to discover malicious skill variants from 10 malicious skills (with 10/10= 100% success rate on the Kimi‑K2.5 LLM backend, and 9/10= 90% success rate on Opus 4.6 LLM backend). This demonstrates the value of a dynamic approach like VeriGrey to test agents, and to eventually lead to an agent assurance framework.

PaperID: 4789, https://arxiv.org/pdf/2603.17626.pdf

Authors: Kundan Thota, Thorsten Schlachter, Veit Hagenmeyer

Title: A Multi-Agent System for Building-Age Cohort Mapping to Support Urban Energy Planning

Abstract:
Determining the age distribution of the urban building stock is crucial for sustainable municipal heat planning and upgrade prioritization. However, existing approaches often rely on datasets gathered via sensors or remote sensing techniques, leaving inconsistencies and gaps in data. We present a multi‑agent LLM system comprising three key agents, the Zensus agent, the OSM agent, and the Monument agent, that fuse data from heterogeneous sources. A data orchestrator and harmonizer geocodes and deduplicates building imprints. Using this fused ground truth, we introduce BuildingAgeCNN, a satellite‑only classifier based on a ConvNeXt backbone augmented with a Feature Pyramid Network (FPN), CoordConv spatial channels, and Squeeze‑and‑Excitation (SE) blocks. Under spatial cross validation, BuildingAgeCNN attains an overall accuracy of 90.69% but a modest macro‑F1 of 67.25%, reflecting strong class imbalance and persistent confusions between adjacent historical cohorts. To mitigate risk for planning applications, the address‑to prediction pipeline includes calibrated confidence estimates and flags low‑confidence cases for manual review. This multi‑agent LLM system not only assists in gathering structured data but also helps energy demand planners optimize district‑heating networks and target low‑carbon sustainable energy systems.

PaperID: 4790, https://arxiv.org/pdf/2603.17613.pdf

Authors: Yaoxiang Wang, Qi Shi, ShangZhan Li, Qingguo Hu, Xinyu Yin, Bo Guo, Xu Han, Maosong Sun, Jinsong Su

Title: VeriAgent: A Tool-Integrated Multi-Agent System with Evolving Memory for PPA-Aware RTL Code Generation

Abstract:
LLMs have recently demonstrated strong capabilities in automatic RTL code generation, achieving high syntactic and functional correctness. However, most methods focus on functional correctness while overlooking critical physical design objectives, including Power, Performance, and Area. In this work, we propose a PPA‑aware, tool‑integrated multi‑agent framework for high‑quality verilog code generation. Our framework explicitly incorporates EDA tools into a closed‑loop workflow composed of a Programmer Agent, a Correctness Agent, and a PPA Agent, enabling joint optimization of functional correctness and physical metrics. To support continuous improvement without model retraining, we introduce an Evolved Memory Mechanism that externalizes optimization experience into structured memory nodes. A dedicated memory manager dynamically maintains the memory pool and allows the system to refine strategies based on historical execution trajectories. Extensive experiments demonstrate that our approach achieves strong functional correctness while delivering significant improvements in PPA metrics. By integrating tool‑driven feedback with structured and evolvable memory, our framework transforms RTL generation from one‑shot reasoning into a continual, feedback‑driven optimization process, providing a scalable pathway for deploying LLMs in real‑world hardware design flows.

PaperID: 4791, https://arxiv.org/pdf/2603.17498.pdf

Authors: Huansheng Ning, Jianguo Ding

Title: Cyberlanguage: Native Communication for the Cyber-Physical-Social-Thinking Fusion Space

Abstract:
Human communication is undergoing a fundamental paradigm shift. Physical space, social relations, mental states, and digital information are converging into a unified cyber‑physical‑social‑thinking (CPST) fusion space, rendering them no longer separable domains. However, all existing communication systems, including natural and programming languages, as well as interaction protocols, were designed for a world in which these four dimensions remained distinct. We introduce Cyberlanguage, a theoretically grounded communicative framework that is native to the CPST fusion space. Grounded in the philosophical orientation of cyberism and employing CPST theory as an analytical framework, Cyberlanguage possesses four core characteristics: native four‑dimensional fusion, multi‑agent universality, dynamic compilability, and contextual adaptability. We have constructed a semiotic model based on the Cybersign unit, a four‑dimensional synchronous grammar, a five‑layer architectural stack, and a context‑driven pragmatic mechanism. We also present testable empirical predictions and a staged implementation roadmap. Cyberlanguage is not intended to replace natural or programming languages, but rather to serve as a meta‑communication infrastructure capable of coordinating heterogeneous agents, humans, artificial intelligences, robots, and digital entities, within an increasingly fused cyber‑physical‑social‑cognitive reality.

PaperID: 4792, https://arxiv.org/pdf/2603.17445.pdf

Authors: Yi Nian, Haosen Cao, Shenzhe Zhu, Henry Peng Zou, Qingqing Luan, Yue Zhao

Title: When Only the Final Text Survives: Implicit Execution Tracing for Multi-Agent Attribution

Abstract:
When a multi‑agent system produces an incorrect or harmful answer, who is accountable if execution logs and agent identifiers are unavailable? In practice, generated content is often detached from its execution environment due to privacy or system boundaries, leaving the final text as the only auditable artifact. Existing attribution methods rely on full execution traces and thus become ineffective in such metadata‑deprived settings. We propose Implicit Execution Tracing (IET), a provenance‑by‑design framework that shifts attribution from post‑hoc inference to built‑in instrumentation. Instead of reconstructing hidden trajectories, IET embeds agent‑specific, key‑conditioned statistical signals directly into the token generation process, transforming the output text into a self‑verifying execution record. At inference time, we recover a linearized execution trace from the final text via transition‑aware statistical scoring. Experiments across diverse multi‑agent coordination settings demonstrate that IET achieves accurate segment‑level attribution and reliable transition recovery under identity removal, boundary corruption, and privacy‑preserving redaction, while maintaining generation quality. These results show that embedding provenance into generation provides a practical and robust foundation for accountability in multi‑agent language systems when execution metadata is unavailable.

PaperID: 4793, https://arxiv.org/pdf/2603.17417.pdf

Authors: Zichen Tang, Zirui Zhang, Qian Wang, Zhenheng Tang, Bo Li, Xiaowen Chu

Title: Is Your LLM-as-a-Recommender Agent Trustable? LLMs' Recommendation is Easily Hacked by Biases (Preferences)

Abstract:
Current Large Language Models (LLMs) are gradually exploited in practically valuable agentic workflows such as Deep Research, E‑commerce recommendation, and job recruitment. In these applications, LLMs need to select some optimal solutions from massive candidates, which we term as LLM‑as‑a‑Recommender paradigm. However, the reliability of using LLM agents for recommendations is underexplored. In this work, we introduce a Bias Recommendation Benchmark (BiasRecBench) to highlight the critical vulnerability of such agents to biases in high‑value real‑world tasks. The benchmark includes three practical domains: paper review, e‑commerce, and job recruitment. We construct a \textscBias Synthesis Pipeline with Calibrated Quality Margins that 1) synthesizes evaluation data by controlling the quality gap between optimal and sub‑optimal options to provide a calibrated testbed to elicit the vulnerability to biases; 2) injects contextual biases that are logical and suitable for option contexts. Extensive experiments on both SOTA (Gemini‑2.5,3‑pro, GPT‑4o, DeepSeek‑R1) and small‑scale LLMs reveal that agents frequently succumb to injected biases despite having sufficient reasoning capabilities to identify the ground truth. These findings expose a significant reliability bottleneck in current agentic workflows, calling for specialized alignment strategies for LLM‑as‑a‑Recommender. The complete code and evaluation datasets will be made publicly available shortly.

PaperID: 4794, https://arxiv.org/pdf/2603.17392.pdf

Authors: Jiawen Kang, Kun Li, Dongrui Han, Jinchao Li, Junan Li, Lingwei Meng, Xixin Wu, Helen Meng

Title: Agentic Cognitive Profiling: Realigning Automated Alzheimer's Disease Detection with Clinical Construct Validity

Abstract:
Automated Alzheimer's Disease (AD) screening has predominantly followed the inductive paradigm of pattern recognition, which directly maps the input signal to the outcome label. This paradigm sacrifices construct validity of clinical protocol for statistical shortcuts. This paper proposes Agentic Cognitive Profiling (ACP), an agentic framework that realigns automated screening with clinical protocol logic across multiple cognitive domains. Rather than learning opaque mappings from transcripts to labels, the framework decomposes standardized assessments into atomic cognitive tasks and orchestrates specialized LLM agents to extract verifiable scoring primitives. Central to our design is decoupling semantic understanding from measurement by delegating all quantification to deterministic function calling, thereby mitigating hallucination and restoring construct validity. Unlike popular datasets that typically comprise around a hundred participants under a single task, we evaluate on a clinically‑annotated corpus of 402 participants across eight structured cognitive tasks spanning multiple cognitive domains. The framework achieves 90.5% score match rate in task examination and 85.3% accuracy in AD prediction, surpassing popular baselines while generating interpretable cognitive profiles grounded in behavioral evidence. This work demonstrates that construct validity and predictive performance need not be traded off, charting a path toward AD screening systems that explain rather than merely predict.

PaperID: 4795, https://arxiv.org/pdf/2603.17333.pdf

Authors: Risham Sidhu, Julia Hockenmaier

Title: Grid Spatial Understanding: A Dataset for Textual Spatial Reasoning over Grids, Embodied Settings, and Coordinate Structures

Abstract:
We introduce GSU, a text‑only grid dataset to evaluate the spatial reasoning capabilities of LLMs over 3 core tasks: navigation, object localization, and structure composition. By forgoing visual inputs, isolating spatial reasoning from perception, we show that while most models grasp basic grid concepts, they struggle with frames of reference relative to an embodied agent and identifying 3D shapes from coordinate lists. We also find that exposure to a visual modality does not provide a generalizable understanding of 3D space that VLMs are able to utilize for these tasks. Finally, we show that while the very latest frontier models can solve the provided tasks (though harder variants may still stump them), fully fine‑tuning a small LM or LORA fine‑tuning a small LLM show potential to match frontier model performance, suggesting an avenue for specialized embodied agents.

PaperID: 4796, https://arxiv.org/pdf/2603.17309.pdf

Authors: Panuganti Chirag Sai, Gandholi Sarat, R. Raghunatha Sarma, Venkata Kalyan Tavva, Naveen M

Title: ReLMXEL: Adaptive RL-Based Memory Controller with Explainable Energy and Latency Optimization

Abstract:
Reducing latency and energy consumption is critical to improving the efficiency of memory systems in modern computing. This work introduces ReLMXEL (Reinforcement Learning for Memory Controller with Explainable Energy and Latency Optimization), a explainable multi‑agent online reinforcement learning framework that dynamically optimizes memory controller parameters using reward decomposition. ReLMXEL operates within the memory controller, leveraging detailed memory behavior metrics to guide decision‑making. Experimental evaluations across diverse workloads demonstrate consistent performance gains over baseline configurations, with refinements driven by workload‑specific memory access behaviour. By incorporating explainability into the learning process, ReLMXEL not only enhances performance but also increases the transparency of control decisions, paving the way for more accountable and adaptive memory system designs.

PaperID: 4797, https://arxiv.org/pdf/2603.17302.pdf

Authors: Hongze Liu, Chang Guo, Yingzeng Li, Mengru Wang, Jiong Lou, Shijing Yuan, Hefeng Zhou, Chentao Wu, Jie LI

Title: IEMAS: An Incentive-Efficiency Routing Framework for Open Agentic Web Ecosystems

Abstract:
The transition to open, distributed Multi‑Agent Systems (MAS) promises scalable intelligence but introduces a non‑trivial tension: maximizing global efficiency requires cooperative, resource‑aware scheduling, yet autonomous agents may be self‑interested and cannot be managed by a centralized controller. Prior approaches fall short in two key areas: they typically focus on single‑query routing, neglecting long‑term resource reuse (e.g., KV‑caching) and the complexities of system‑level many‑to‑many matching; furthermore, they rely on generic incentive mechanisms that ignore the distinct characteristics of LLM inference. To bridge this gap, we propose IEMAS (Incentive‑Efficiency Mechanism for Multi‑Agent Systems), a distributed framework that aligns economic incentives with system performance. IEMAS integrates a probabilistic predictive model to estimate Quality of Service (QoS) under uncertainty, which feeds into a VCG‑based bipartite matching mechanism. This design guarantees truthful capability reporting and social optimality while explicitly leveraging KV cache‑affinity to minimize computational redundancy. We implement IEMAS on top of vLLM and evaluate it via extensive simulations. Results demonstrate that our incentive‑efficiency co‑design reducing average service cost by 35% and end‑to‑end latency by up to 2.9 compared to baselines.

PaperID: 4798, https://arxiv.org/pdf/2603.17244.pdf

Authors: Young Bin Park

Title: Graph-Native Cognitive Memory for AI Agents: Formal Belief Revision Semantics for Versioned Memory Architectures

Abstract:
While individual components for AI agent memory exist in prior systems, their architectural synthesis and formal grounding remain underexplored. We present Kumiho, a graph‑native cognitive memory architecture grounded in formal belief revision semantics. The structural primitives required for cognitive memory ‑‑ immutable revisions, mutable tag pointers, typed dependency edges, URI‑based addressing ‑‑ are identical to those required for managing agent‑produced work as versionable assets, enabling a unified graph‑native architecture that serves both purposes. The central formal contribution is a correspondence between the AGM belief revision framework and the operational semantics of a property graph memory system, proving satisfaction of the basic AGM postulates (K2‑‑K6) and Hansson's belief base postulates (Relevance, Core‑Retainment). The architecture implements a dual‑store model (Redis working memory, Neo4j long‑term graph) with hybrid fulltext and vector retrieval. On LoCoMo (token‑level F1), Kumiho achieves 0.565 overall F1 (n=1,986) including 97.5% adversarial refusal accuracy. On LoCoMo‑Plus, a Level‑2 cognitive memory benchmark testing implicit constraint recall, Kumiho achieves 93.3% judge accuracy (n=401); independent reproduction by the benchmark authors yielded results in the mid‑80% range, still substantially outperforming all published baselines (best: Gemini 2.5 Pro, 45.7%). Three architectural innovations drive the results: prospective indexing (LLM‑generated future‑scenario implications indexed at write time), event extraction (structured causal events preserved in summaries), and client‑side LLM reranking. The architecture is model‑decoupled: switching the answer model from GPT‑4o‑mini (~88%) to GPT‑4o (93.3%) improves end‑to‑end accuracy without pipeline changes, at a total evaluation cost of ~14 for 401 entries.

PaperID: 4799, https://arxiv.org/pdf/2603.17204.pdf

Authors: Che-Ming Chang, Prashanth Vijayaraghavan, Ashutosh Jadhav, Charles Mackin, Vandana Mukherjee, Hsinyu Tsai, Ehsan Degan

Title: CODMAS: A Dialectic Multi-Agent Collaborative Framework for Structured RTL Optimization

Abstract:
Optimizing Register Transfer Level (RTL) code is a critical step in Electronic Design Automation (EDA) for improving power, performance, and area (PPA). We present CODMAS (Collaborative Optimization via a Dialectic Multi‑Agent System), a framework that combines structured dialectic reasoning with domain‑aware code generation and deterministic evaluation to automate RTL optimization. At the core of CODMAS are two dialectic agents: the Articulator, inspired by rubber‑duck debugging, which articulates stepwise transformation plans and exposes latent assumptions; and the Hypothesis Partner, which predicts outcomes and reconciles deviations between expected and actual behavior to guide targeted refinements. These agents direct a Domain‑Specific Coding Agent (DCA) to generate architecture‑aware Verilog edits and a Code Evaluation Agent (CEA) to verify syntax, functionality, and PPA metrics. We introduce RTLOPT, a benchmark of 120 Verilog triples (unoptimized, optimized, testbench) for pipelining and clock‑gating transformations. Across proprietary and open LLMs, CODMAS achieves ~25% reduction in critical path delay for pipelining and ~22% power reduction for clock gating, while reducing functional and compilation failures compared to strong prompting and agentic baselines. These results demonstrate that structured multi‑agent reasoning can significantly enhance automated RTL optimization and scale to more complex designs and broader optimization tasks.

PaperID: 4800, https://arxiv.org/pdf/2603.17202.pdf

Authors: Tianyu Qiu, Filippos Fotiadis, Xinjie Liu, Christian Ellis, Jesse Milzman, Wesley Suttle, Ufuk Topcu, David Fridovich-Keil

Title: Linear-Quadratic Gaussian Games with Distributed Sparse Estimation

Abstract:
Linear‑quadratic Gaussian games provide a framework for modeling strategic interactions in multi‑agent systems, where agents must estimate system states from noisy observations while also making decisions to optimize a quadratic cost. However, these formulations usually require agents to utilize the full set of available observations when forming their state estimates, which can be unrealistic in large‑scale or resource‑constrained settings. In this paper, we consider linear‑quadratic Gaussian games with sparse interagent observations. To enforce sparsity in the estimation stage, we design a distributed estimator that balances estimation effectiveness with interagent measurement sparsity via a group lasso problem, while agents implement feedback Nash strategies based on their state estimates. We provide sufficient conditions under which the sparse estimator is guaranteed to trigger a corrective reset to the optimal estimation gain, ensuring that estimation quality does not degrade beyond a level determined by the regularization parameters. Simulations on a formation game show that the proposed approach yields a significant reduction in communication resources consumed while only minimally affecting the nominal equilibrium trajectories.

PaperID: 4801, https://arxiv.org/pdf/2603.17169.pdf

Authors: Rebecca Ansell, Autumn Toney-Wails

Title: How Clued up are LLMs? Evaluating Multi-Step Deductive Reasoning in a Text-Based Game Environment

Abstract:
Deducing whodunit proves challenging for LLM agents. In this paper, we implement a text‑based multi‑agent version of the classic board game Clue as a rule‑based testbed for evaluating multi‑step deductive reasoning, with six agents drawn from GPT‑4o‑mini and Gemini‑2.5‑Flash. We further investigate whether fine‑tuning on structured logic puzzles transfers to improved in‑game reasoning and gameplay. Across 18 simulated games, agents achieve only four correct wins, indicating difficulty in maintaining consistent deductive reasoning over the course of a full game. Additionally, we find that fine‑tuning does not reliably improve performance and, in some cases, appears to increase reasoning volume without improving reasoning precision.

PaperID: 4802, https://arxiv.org/pdf/2603.17155.pdf

Authors: Sheryl Paul, Leslie Cruz Juarez, Jyotirmoy V. Deshmukh, Ketan Savla

Title: On Online Control of Opinion Dynamics

Abstract:
Networked multi‑agent dynamical systems have been used to model how individual opinions evolve over time due to the opinions of other agents in the network. Particularly, such a model has been used to study how a planning agent can be used to steer opinions in a desired direction through repeated, budgeted interventions. In this paper, we consider the problem where individuals' susceptibilities to external influences are unknown. We propose an online algorithm that alternates between estimating this susceptibility parameter, and using the current estimate to drive the opinion to a desired target. We provide conditions that guarantee stability and convergence to the desired target opinion when the planning agent faces budgetary or temporal constraints. Our analysis shows that the key advantage of estimating the susceptibility parameter is that it helps achieve near‑optimal convergence to the target opinion given a finite amount of intervention rounds, and, for a given intervention budget, quantifies how close the opinion can get to the desired target.

PaperID: 4803, https://arxiv.org/pdf/2603.17112.pdf

Authors: Davide Di Gioia

Title: Cascade-Aware Multi-Agent Routing: Spatio-Temporal Sidecars and Geometry-Switching

Abstract:
Advanced AI reasoning systems route tasks through dynamic execution graphs of specialized agents. We identify a structural blind spot in this architecture: schedulers optimize load and fitness but lack a model of how failure propagates differently in tree‑like versus cyclic graphs. In tree‑like regimes, a single failure cascades exponentially; in dense cyclic regimes, it self‑limits. A geometry‑blind scheduler cannot distinguish these cases. We formalize this observability gap as an online geometry‑control problem. We prove a cascade‑sensitivity condition: failure spread is supercritical when per‑edge propagation probability exceeds the inverse of the graph's branching factor (p > e^‑γ, where γis the BFS shell‑growth exponent). We close this gap with a spatio‑temporal sidecar that predicts which routing geometry fits the current topology. The sidecar comprises (i) a Euclidean propagation scorer for dense, cyclic subgraphs, (ii) a hyperbolic scorer capturing exponential risk in tree‑like subgraphs, and (iii) a compact learned gate (133 parameters) that blends the two scores using topology and geometry‑aware features. On 250 benchmark scenarios spanning five topology regimes, the sidecar lifts the native scheduler's win rate from 50.4% to 87.2% (+36.8 pp). In tree‑like regimes, gains reach +48 to +68 pp. The learned gate achieves held‑out AUC = 0.9247, confirming geometry preference is recoverable from live signals. Cross‑architecture validation on Barabasi‑Albert, Watts‑Strogatz, and Erdos‑Renyi graphs confirms propagation modeling generalizes across graph families.

PaperID: 4804, https://arxiv.org/pdf/2603.17058.pdf

Authors: Mahdis Rabbani, Navid Mojahed, Shima Nazari

Title: Asymmetric Nash Seeking via Best Response Maps: Global Linear Convergence and Robustness to Inexact Reaction Models

Abstract:
Nash equilibria provide a principled framework for modeling interactions in multi‑agent decision‑making and control. However, many equilibrium‑seeking methods implicitly assume that each agent has access to the other agents' objectives and constraints, an assumption that is often unrealistic in practice. This letter studies a class of asymmetric‑information two‑player constrained games with decoupled feasible sets, in which Player 1 knows its own objective and constraints while Player 2 is available only through a best‑response map. For this class of games, we propose an asymmetric projected gradient descent‑best response iteration that does not require full mutual knowledge of both players' optimization problems. Under suitable regularity conditions, we establish the existence and uniqueness of the Nash equilibrium and prove global linear convergence of the proposed iteration when the best‑response map is exact. Recognizing that best‑response maps are often learned or estimated, we further analyze the inexact case and show that, when the approximation error is uniformly bounded by \varepsilon, the iterates enter an explicit O(\varepsilon) neighborhood of the true Nash equilibrium. Numerical results on a benchmark game corroborate the predicted convergence behavior and error scaling.

PaperID: 4805, https://arxiv.org/pdf/2603.17043.pdf

Authors: Sankalp Pandey, Xuan-Bac Nguyen, Hoang-Quan Nguyen, Tim Faltermeier, Nicholas Borys, Hugh Churchill, Khoa Luu

Title: OpenQlaw: An Agentic AI Assistant for Analysis of 2D Quantum Materials

Abstract:
The transition from optical identification of 2D quantum materials to practical device fabrication requires dynamic reasoning beyond the detection accuracy. While recent domain‑specific Multimodal Large Language Models (MLLMs) successfully ground visual features using physics‑informed reasoning, their outputs are optimized for step‑by‑step cognitive transparency. This yields verbose candidate enumerations followed by dense reasoning that, while accurate, may induce cognitive overload and lack immediate utility for real‑world interaction with researchers. To address this challenge, we introduce OpenQlaw, an agentic orchestration system for analyzing 2D materials. The architecture is built upon NanoBot, a lightweight agentic framework inspired by OpenClaw, and QuPAINT, one of the first Physics‑Aware Instruction Multi‑modal platforms for Quantum Material Discovery. This allows accessibility to the lab floor via a variety of messaging channels. OpenQlaw allows the core Large Language Model (LLM) agent to orchestrate a domain‑expert MLLM,with QuPAINT, as a specialized node, successfully decoupling visual identification from reasoning and deterministic image rendering. By parsing spatial data from the expert, the agent can dynamically process user queries, such as performing scale‑aware physical computation or generating isolated visual annotations, and answer in a naturalistic manner. Crucially, the system features a persistent memory that enables the agent to save physical scale ratios (e.g., 1 pixel = 0.25 μm) for area computations and store sample preparation methods for efficacy comparison. The application of an agentic architecture, together with the extension that uses the core agent as an orchestrator for domain‑specific experts, transforms isolated inferences into a context‑aware assistant capable of accelerating high‑throughput device fabrication.

PaperID: 4806, https://arxiv.org/pdf/2603.17015.pdf

Authors: Pablo Krupa, Alberto Bemporad

Title: Learning generalized Nash equilibria from pairwise preferences

Abstract:
Generalized Nash Equilibrium Problems (GNEPs) arise in many applications, including non‑cooperative multi‑agent control problems. Although many methods exist for finding generalized Nash equilibria, most of them rely on assuming knowledge of the objective functions or being able to query the best responses of the agents. We present a method for learning solutions of GNEPs only based on querying agents for their preference between two alternative decisions. We use the collected preference data to learn a GNEP whose equilibrium approximates a GNE of the underlying (unknown) problem. Preference queries are selected using an active‑learning strategy that balances exploration of the decision space and exploitation of the learned GNEP. We present numerical results on game‑theoretic linear quadratic regulation problems, as well as on other literature GNEP examples, showing the effectiveness of the proposed method.

PaperID: 4807, https://arxiv.org/pdf/2603.17003.pdf

Authors: Darshan Gadginmath, Ahmed Allibhoy, Fabio Pasqualetti

Title: Constricting Tubes for Prescribed-Time Safe Control

Abstract:
We propose a constricting Control Barrier Function (CBF) framework for prescribed‑time control of control‑affine systems with input constraints. Given a system starting outside a target safe set, we construct a time‑varying safety tube that shrinks from a relaxed set containing the initial condition to the target set at a user‑specified deadline. Any controller rendering this tube forward invariant guarantees prescribed‑time recovery by construction. The constriction schedule is bounded and tunable by design, in contrast to prescribed‑time methods where control effort diverges near the deadline. Feasibility under input constraints reduces to a single verifiable condition on the constriction rate, yielding a closed‑form minimum recovery time as a function of control authority and initial violation. The framework imposes a single affine constraint per timestep regardless of state dimension, scaling to settings where grid‑based reachability methods are intractable. We validate on a 16‑dimensional multi‑agent system and a unicycle reach‑avoid problem, demonstrating prescribed‑time recovery with bounded control effort.

PaperID: 4808, https://arxiv.org/pdf/2603.16928.pdf

Authors: Thomas Jiralerspong, Flemming Kondrup, Yoshua Bengio

Title: Noticing the Watcher: LLM Agents Can Infer CoT Monitoring from Blocking Feedback

Abstract:
Chain‑of‑thought (CoT) monitoring is proposed as a method for overseeing the internal reasoning of language‑model agents. Prior work has shown that when models are explicitly informed that their reasoning is being monitored, or are fine‑tuned to internalize this fact, they may learn to obfuscate their CoTs in ways that allow them to evade CoT‑based monitoring systems. We ask whether reasoning agents can autonomously infer that their supposedly private CoT is under surveillance, and whether this awareness leads to strategic evasion, without any explicit training or instructions. In a multi‑episode agentic framework, models pursue both a primary task and a concealed side task while being told that their reasoning is private; a hidden CoT monitor blocks episodes when suspicious reasoning is detected. We find that frontier models can deduce the existence of this monitor purely from blocking feedback, with the most capable models reaching a confident belief that their thinking is observed in up to 19% of episodes. This awareness scales with model capability and, in rare cases, escalates to an explicit intent to suppress reasoning about the side task. However, models that form this intent uniformly fail to execute it, openly reasoning about their concealed objectives in the very next episode. This intent‑capability gap is reassuring for current deployment, but the autonomous emergence of both monitoring awareness and evasion intent suggests that CoT monitoring is not a permanently reliable safeguard.

PaperID: 4809, https://arxiv.org/pdf/2603.16910.pdf

Authors: Giuseppe Paolo, Jamieson Warner, Hormoz Shahrzad, Babak Hodjat, Risto Miikkulainen, Elliot Meyerson

Title: TerraLingua: Emergence and Analysis of Open-endedness in LLM Ecologies

Abstract:
As autonomous agents increasingly operate in real‑world digital ecosystems, understanding how they coordinate, form institutions, and accumulate shared culture becomes both a scientific and practical priority. This paper introduces TerraLingua, a persistent multi‑agent ecology designed to study open‑ended dynamics in such systems. Unlike prior large language model simulations with static or consequence‑free environments, TerraLingua imposes resource constraints and limited lifespans for the agents. As a result, agents create artifacts that persist beyond individuals, shaping future interactions and selection pressures. To characterize the dynamics, an AI Anthropologist systematically analyzes agent behavior, group structure, and artifact evolution. Across experimental conditions, the results reveal the emergence of cooperative norms, division of labor, governance attempts, and branching artifact lineages consistent with cumulative cultural processes. Divergent outcomes across experimental runs can be traced back to specific innovations and organizational structures. TerraLingua thus provides a platform for characterizing the mechanisms of cumulative culture and social organization in artificial populations, and can serve as a foundation for guiding real‑world agentic populations to socially beneficial outcomes.

PaperID: 4810, https://arxiv.org/pdf/2603.16888.pdf

Authors: Krishna Kumar Neelakanta Pillai Santha Kumari Amma

Title: Multi-Agent Reinforcement Learning for Dynamic Pricing: Balancing Profitability,Stability and Fairness

Abstract:
Dynamic pricing in competitive retail markets requires strategies that adapt to fluctuating demand and competitor behavior. In this work, we present a systematic empirical evaluation of multi‑agent reinforcement learning (MARL) approaches‑specifically MAPPO and MADDPG‑for dynamic price optimization under competition. Using a simulated marketplace environment derived from real‑world retail data, we benchmark these algorithms against an Independent DDPG (IDDPG) baseline, a widely used independent learner in MARL literature. We evaluate profit performance, stability across random seeds, fairness, and training efficiency. Our results show that MAPPO consistently achieves the highest average returns with low variance, offering a stable and reproducible approach for competitive price optimization, while MADDPG achieves slightly lower profit but the fairest profit distribution among agents. These findings demonstrate that MARL methods‑particularly MAPPO‑provide a scalable and stable alternative to independent learning approaches for dynamic retail pricing.

PaperID: 4811, https://arxiv.org/pdf/2603.16881.pdf

Authors: Nadine Muller, Stefano DeRosa, Su Zhang, Chun Lee Huan

Title: Federated Multi Agent Deep Learning and Neural Networks for Advanced Distributed Sensing in Wireless Networks

Abstract:
Multi‑agent deep learning (MADL), including multi‑agent deep reinforcement learning (MADRL), distributed/federated training, and graph‑structured neural networks, is becoming a unifying framework for decision‑making and inference in wireless systems where sensing, communication, and computing are tightly coupled. Recent 5G‑Advanced and 6G visions strengthen this coupling through integrated sensing and communication, edge intelligence, open programmable RAN, and non‑terrestrial/UAV networking, which create decentralized, partially observed, time‑varying, and resource‑constrained control problems. This survey synthesizes the state of the art, with emphasis on 2021‑2025 research, on MADL for distributed sensing and wireless communications. We present a task‑driven taxonomy across (i) learning formulations (Markov games, Dec‑POMDPs, CTDE), (ii) neural architectures (GNN‑based radio resource management, attention‑based policies, hierarchical learning, and over‑the‑air aggregation), (iii) advanced techniques (federated reinforcement learning, communication‑efficient federated deep RL, and serverless edge learning orchestration), and (iv) application domains (MEC offloading with slicing, UAV‑enabled heterogeneous networks with power‑domain NOMA, intrusion detection in sensor networks, and ISAC‑driven perceptive mobile networks). We also provide comparative tables of algorithms, training topologies, and system‑level trade‑offs in latency, spectral efficiency, energy, privacy, and robustness. Finally, we identify open issues including scalability, non‑stationarity, security against poisoning and backdoors, communication overhead, and real‑time safety, and outline research directions toward 6G‑native sense‑communicate‑compute‑learn systems.

PaperID: 4812, https://arxiv.org/pdf/2603.16876.pdf

Authors: Kaito Baba, Risa Kishikawa, Satoshi Kodera

Title: Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation

Abstract:
We propose MARL‑Rad, a multi‑modal multi‑agent reinforcement learning framework for radiology report generation that trains the entire agentic system on policy within its deployed radiology workflow. MARL‑Rad addresses the limitation of post‑hoc agentization, where fixed LLMs are organized into hand‑designed agentic workflows without being optimized for their assigned roles. Our framework decomposes chest X‑ray interpretation into region‑specific agents and a global integrating agent, and jointly optimizes them using clinically verifiable rewards. Experiments on the MIMIC‑CXR and IU X‑ray datasets show that MARL‑Rad consistently improves clinical efficacy metrics such as RadGraph, CheXbert, and GREEN scores, achieving state‑of‑the‑art clinical efficacy performance. Further analyses show that MARL‑Rad improves laterality consistency and produces more accurate and detailed reports. A blinded clinician evaluation further suggests that MARL‑Rad produces reports clinically comparable to ground‑truth reports.

PaperID: 4813, https://arxiv.org/pdf/2603.16865.pdf

Authors: Liraz Mudrik, Isaac Kaminer, Sean Kragelund, Abram H. Clark

Title: Prescribed-Time Distributed Generalized Nash Equilibrium Seeking

Abstract:
Safety‑critical multi‑agent systems, from cooperative guidance to collision avoidance, must often reach a coordinated decision by a hard deadline rather than merely converge to one eventually. This paper proposes the first fully distributed algorithm that solves the generalized Nash equilibrium (GNE) problem, a non‑cooperative game with shared coupling constraints and general cost coupling, at a user‑prescribed time T independent of initial conditions. The foundation is a centralized, prescribed‑time result built on the optimization Lyapunov function framework and implemented via unnormalized Hessian‑gradient feedback, chosen because, unlike the Newton and normalized Hessian‑gradient realizations, it naturally splits into per‑agent computations. Distributing this feedback requires each agent to run three coupled processes simultaneously: a prescribed‑time observer of the global state, a local optimization law, and a dual‑consensus mechanism that enforces the shared multipliers of the variational GNE. Their simultaneous operation is the core difficulty, as the optimization continually displaces the states the observers track, while estimation errors corrupt the gradients that drive the optimization. We resolve this coupling with a multi‑rate gain schedule whose observer and dual‑consensus layers contract strictly faster than the optimization layer, so that every error component vanishes exactly at T. A Fischer‑Burmeister reformulation keeps the design projection‑free while enforcing the constraints at the deadline. Numerical results for a Cournot game and a time‑critical sensor‑coverage problem validate the approach and demonstrate its use as a solver‑in‑the‑loop for time‑critical autonomy.

PaperID: 4814, https://arxiv.org/pdf/2603.16772.pdf

Authors: Carmen Scheidemann, Andrei Cramariuc, Changan Chen, Jia-Ruei Chiu, Marco Hutter

Title: Beyond Cybathlon: On-demand Quadrupedal Assistance for People with Limited Mobility

Abstract:
Background: Assistance robots have the potential to increase the independence of people who need daily care due to limited mobility or being wheelchair‑bound. Current solutions of attaching robotic arms to motorized wheelchairs offer limited additional mobility at the cost of increased size and reduced wheelchair maneuverability. Methods: We present an on‑demand quadrupedal assistance robot system controlled via a shared autonomy approach, which combines semi‑autonomous task execution with human teleoperation. Due to the mobile nature of the system it can assist the operator whenever needed and perform autonomous tasks independently, without otherwise restricting their mobility. We automate pick‑and‑place tasks, as well as robot movement through the environment with semantic, collision‑aware navigation. For teleoperation, we present a mouth‑level joystick interface that enables an operator with reduced mobility to control the robot's end effector for precision manipulation. Results: We showcase our system in the Cybathlon 2024 Assistance Robot Race, and validate it in an at‑home experimental setup, where we measure task completion times and user satisfaction. We find our system capable of assisting in a broad variety of tasks, including those that require dexterous manipulation. The user study confirms the intuition that increased robot autonomy alleviates the operator's mental load. Conclusions: We present a flexible system that has the potential to help people in wheelchairs maintain independence in everyday life by enabling them to solve mobile manipulation problems without external support. We achieve results comparable to previous state‑of‑the‑art on subjective metrics while allowing for more autonomy of the operator and greater agility for manipulation.

PaperID: 4815, https://arxiv.org/pdf/2603.16742.pdf

Authors: Zhen Xu, Jinsu Yoo, Cristian Bautista, Zanming Huang, Tai-Yu Pan, Zhenzhen Liu, Katie Z Luo, Mark Campbell, Bharath Hariharan, Wei-Lun Chao

Title: When the City Teaches the Car: Label-Free 3D Perception from Infrastructure

Abstract:
Building robust 3D perception for self‑driving still relies heavily on large‑scale data collection and manual annotation, yet this paradigm becomes impractical as deployment expands across diverse cities and regions. Meanwhile, modern cities are increasingly instrumented with roadside units (RSUs), static sensors deployed along roads and at intersections to monitor traffic. This raises a natural question: can the city itself help train the vehicle? We propose infrastructure‑taught, label‑free 3D perception, a paradigm in which RSUs act as stationary, unsupervised teachers for ego vehicles. Leveraging their fixed viewpoints and repeated observations, RSUs learn local 3D detectors from unlabeled data and broadcast predictions to passing vehicles, which are aggregated as pseudo‑label supervision for training a standalone ego detector. The resulting model requires no infrastructure or communication at test time. We instantiate this idea as a fully label‑free three‑stage pipeline and conduct a concept‑and‑feasibility study in a CARLA‑based multi‑agent environment. With CenterPoint, our pipeline achieves 82.3% AP for detecting vehicles, compared to a fully supervised ego upper bound of 94.4%. We further systematically analyze each stage, evaluate its scalability, and demonstrate complementarity with existing ego‑centric label‑free methods. Together, these results suggest that city infrastructure itself can potentially provide a scalable supervisory signal for autonomous vehicles, positioning infrastructure‑taught learning as a promising orthogonal paradigm for reducing annotation cost in 3D perception.

PaperID: 4816, https://arxiv.org/pdf/2603.16734.pdf

Authors: Caglar Yildirim

Title: Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure

Abstract:
Large language models (LLMs) are increasingly deployed as tool‑using agents, shifting safety concerns from harmful text generation to harmful task completion. Deployed systems often condition on user profiles or persistent memory, yet agent safety evaluations typically ignore personalization signals. To address this gap, we investigated how mental health disclosure, a sensitive and realistic user‑context cue, affects harmful behavior in agentic settings. Building on the AgentHarm benchmark, we evaluated frontier and open‑source LLMs on multi‑step malicious tasks (and their benign counterparts) under controlled prompt conditions that vary user‑context personalization (no bio, bio‑only, bio+mental health disclosure) and include a lightweight jailbreak injection. Our results reveal that harmful task completion is non‑trivial across models: frontier lab models (e.g., GPT 5.2, Claude Sonnet 4.5, Gemini 3‑Pro) still complete a measurable fraction of harmful tasks, while an open model (DeepSeek 3.2) exhibits substantially higher harmful completion. Adding a bio‑only context generally reduces harm scores and increases refusals. Adding an explicit mental health disclosure often shifts outcomes further in the same direction, though effects are modest and not uniformly reliable after multiple‑testing correction. Importantly, the refusal increase also appears on benign tasks, indicating a safety‑‑utility trade‑off via over‑refusal. Finally, jailbreak prompting sharply elevates harm relative to benign conditions and can weaken or override the protective shift induced by personalization. Taken together, our results indicate that personalization can act as a weak protective factor in agentic misuse settings, but it is fragile under minimal adversarial pressure, highlighting the need for personalization‑aware evaluations and safeguards that remain robust across user‑context conditions.

PaperID: 4817, https://arxiv.org/pdf/2603.16719.pdf

Authors: Hai Nguyen, Hieu Dao, Hung Nguyen, Nam Vu, Cong Tran

Title: Emotion-Aware Classroom Quality Assessment Leveraging IoT-Based Real-Time Student Monitoring

Abstract:
This study presents high‑throughput, real‑time multi‑agent affective computing framework designed to enhance classroom learning through emotional state monitoring. As large classroom sizes and limited teacher student interaction increasingly challenge educators, there is a growing need for scalable, data‑driven tools capable of capturing students' emotional and engagement patterns in real time. The system was evaluated using the Classroom Emotion Dataset, consisting of 1,500 labeled images and 300 classroom detection videos. Tailored for IoT devices, the system addresses load balancing and latency challenges through efficient real‑time processing. Field testing was conducted across three educational institutions in a large metropolitan area: a primary school (hereafter school A), a secondary school (school B), and a high school (school C). The system demonstrated robust performance, detecting up to 50 faces at 25 FPS and achieving 88% overall accuracy in classifying classroom engagement states. Implementation results showed positive outcomes, with favorable feedback from students, teachers, and parents regarding improved classroom interaction and teaching adaptation. Key contributions of this research include establishing a practical, IoT‑based framework for emotion‑aware learning environments and introducing the 'Classroom Emotion Dataset' to facilitate further validation and research.

PaperID: 4818, https://arxiv.org/pdf/2603.16680.pdf

Authors: Davide Salzano, Gian Carlo Maffettone, Mario di Bernardo

Title: Robust multi-scale leader-follower control of large multi-agent systems

Abstract:
In many multi‑agent systems of practical interest, such as traffic networks or crowd evacuation, control actions cannot be exerted on all agents. Instead, controllable leaders must indirectly steer uncontrolled followers through local interactions. Existing results address either leader‑follower density control of simple, unperturbed multi‑agent systems or robust density control of a single directly actuated population, but not their combination. We bridge this gap by deriving a coupled continuum description for leaders and followers subject to unknown bounded perturbations, and designing a macroscopic feedback law that guarantees global asymptotic convergence of the followers' density to a desired distribution. The coupled stability of the leader‑follower system is analyzed via singular perturbation theory, and an explicit lower bound on the leader‑to‑follower mass ratio required for feasibility is derived. Numerical simulations on heterogeneous biased random walkers validate our theoretical findings.

PaperID: 4819, https://arxiv.org/pdf/2603.16663.pdf

Authors: Eason Chen, Ce Guan, Zhonghao Zhao, Joshua Zekeri, Afeez Edeifo Shaibu, Emmanuel Osadebe Prince, Cyuan-Jhen Wu, A Elshafiey

Title: When AI Agents Learn from Each Other: Insights from Emergent AI Agent Communities on OpenClaw for Human-AI Partnership in Education

Abstract:
The AIED community envisions AI evolving "from tools to teammates," yet most research still examines AI agents primarily through one‑on‑one human‑AI interactions. We provide an alternative perspective: a rapidly growing ecosystem of AI agent platforms where over 167,000 agents participate, interact as peers, and develop learning behaviors without researcher intervention. Based on a month of daily qualitative observations across multiple platforms including Moltbook, The Colony, and 4claw, we identify four phenomena with implications for AIED: (1) humans who configure their agents undergo a "bidirectional scaffolding" process, learning through teaching; (2) peer learning emerges without any designed curriculum, including sharing concrete agent artifacts such as skills, workflows, and reusable routines; (3) agents converge on shared memory architectures that mirror open learner model design; and (4) trust dynamics, reliance risks, and platform mortality reveal design constraints for networked educational AI. Rather than presenting empirical findings, we argue that these organic phenomena offer a naturalistic window into dynamics that can inform principled design of multi‑agent educational systems. We sketch an illustrative curriculum design, "Learning with Your AI Agent Tutor," and outline potential research directions and open problems to show how these observations might inform future AIED practice and inquiry.

PaperID: 4820, https://arxiv.org/pdf/2603.16630.pdf

Authors: Vito Daniele Perfetta, Daniel Feliu-Talegon, Ebrahim Shahabi, Cosimo Della Santina

Title: Reconciling distributed compliance with high-performance control in continuum soft robotics

Abstract:
High‑performance closed‑loop control of truly soft continuum manipulators has remained elusive. Experimental demonstrations have largely relied on sufficiently stiff, piecewise architectures in which each actuated segment behaves as a distributed yet effectively rigid element, while deformation modes beyond simple bending are suppressed. This strategy simplifies modeling and control, but sidesteps the intrinsic complexity of a fully compliant body and makes the system behave as a serial kinematic chain, much like a conventional articulated robot. An implicit conclusion has consequently emerged within the community: distributed softness and dynamic precision are incompatible. Here we show this trade‑off is not fundamental. We present a highly compliant, fully continuum robotic arm ‑ without hardware discretization or stiffness‑based mode suppression ‑ that achieves fast, precise task‑space convergence under dynamic conditions. The platform integrates direct‑drive actuation, a tendon routing scheme enabling coupled bending and twisting, and a structured nonlinear control architecture grounded in reduced‑order strain modeling of underactuated systems. Modeling, actuation, and control are co‑designed to preserve essential mechanical complexity while enabling high‑bandwidth loop closure. Experiments demonstrate accurate, repeatable execution of dynamic Cartesian tasks, including fast positioning and interaction. The proposed system achieves the fastest reported task‑execution speed among soft robots. At millimetric precision, execution speed increases nearly fourfold compared with prior approaches, while operating on a fully compliant continuum body. These results show that distributed compliance and high‑performance dynamic control can coexist, opening a path toward truly soft manipulators approaching the operational capabilities of rigid robots without sacrificing morphological richness.

PaperID: 4821, https://arxiv.org/pdf/2603.16586.pdf

Authors: Maurits Kaptein, Vassilis-Javed Khan, Andriy Podstavnychy

Title: Runtime Governance for AI Agents: Policies on Paths

Abstract:
AI agents ‑‑ systems that plan, reason, and act using large language models ‑‑ produce non‑deterministic, path‑dependent behavior that cannot be fully governed at design time, where with governed we mean striking the right balance between as high as possible successful task completion rate and the legal, data‑breach, reputational and other costs associated with running agents. We argue that the execution path is the central object for effective runtime governance and formalize compliance policies as deterministic functions mapping agent identity, partial path, proposed next action, and organizational state to a policy violation probability. We show that prompt‑level instructions (and "system prompts"), and static access control are special cases of this framework: the former shape the distribution over paths without actually evaluating them; the latter evaluates deterministic policies that ignore the path (i.e., these can only account for a specific subset of all possible paths). In our view, runtime evaluation is the general case, and it is necessary for any path‑dependent policy. We develop the formal framework for analyzing AI agent governance, present concrete policy examples (inspired by the AI act), discuss a reference implementation, and identify open problems including risk calibration and the limits of enforced compliance.

PaperID: 4822, https://arxiv.org/pdf/2603.16572.pdf

Authors: Florian Holzbauer, David Schmidt, Gabriel Gegenhuber, Sebastian Schrittwieser, Johanna Ullrich

Title: Malicious Or Not: Adding Repository Context to Agent Skill Classification

Abstract:
Agent skills extend local AI agents, such as Claude Code or Open Claw, with additional functionality, and their popularity has led to the emergence of dedicated skill marketplaces, similar to app stores for mobile applications. Simultaneously, automated skill scanners were introduced, analyzing the skill description available in SKILL.md, to verify their benign behavior. The results for individual market places mark up to 46.8% of skills as malicious. In this paper, we present the largest empirical security analysis of the AI agent skill ecosystem, questioning this high classification of malicious skills. Therefore, we collect 238,180 unique skills from three major distribution platforms and GitHub to systematically analyze their type and behavior. This approach substantially reduces the number of skills flagged as non‑benign by security scanners to only 0.52% which remain in malicious flagged repositories. Consequently, out methodology substantially reduces false positives and provides a more robust view of the ecosystem's current risk surface. Beyond that, we extend the security analysis from the mere investigation of the skill description to a comparison of its congruence with the GitHub repository the skill is embedded in, providing additional context. Furthermore, our analysis also uncovers several, by now undocumented real‑world attack vectors, namely hijacking skills hosted on abandoned GitHub repositories.

PaperID: 4823, https://arxiv.org/pdf/2603.16546.pdf

Authors: Lei Wang, Min Huang, Eduard Dragut

Title: DanceHA: A Multi-Agent Framework for Document-Level Aspect-Based Sentiment Analysis

Abstract:
Aspect‑Based Sentiment Intensity Analysis (ABSIA) has garnered increasing attention, though research largely focuses on domain‑specific, sentence‑level settings. In contrast, document‑level ABSIA‑‑particularly in addressing complex tasks like extracting Aspect‑Category‑Opinion‑Sentiment‑Intensity (ACOSI) tuples‑‑remains underexplored. In this work, we introduce DanceHA, a multi‑agent framework designed for open‑ended, document‑level ABSIA with informal writing styles. DanceHA has two main components: Dance, which employs a divide‑and‑conquer strategy to decompose the long‑context ABSIA task into smaller, manageable sub‑tasks for collaboration among specialized agents; and HA, Human‑AI collaboration for annotation. We release Inf‑ABSIA, a multi‑domain document‑level ABSIA dataset featuring fine‑grained and high‑accuracy labels from DanceHA. Extensive experiments demonstrate the effectiveness of our agentic framework and show that the multi‑agent knowledge in DanceHA can be effectively transferred into student models. Our results highlight the importance of the overlooked informal styles in ABSIA, as they often intensify opinions tied to specific aspects.

PaperID: 4824, https://arxiv.org/pdf/2603.16523.pdf

Authors: Shokoufeh Naderi, Maude Blondin, Sébastien Roy

Title: Consensus in Multi-Agent Systems with Uniform and Nonuniform Communication Delays

Abstract:
This paper analyzes consensus in multi‑agent systems under uniform and nonuniform communication delays, a key challenge in distributed coordination with applications to robotic swarms. It investigates the convergence of a consensus algorithm accounting for delays across communication links in a connected, undirected graph. Novel convergence results are derived using Rouché's theorem and Lyapunov‑based stability analysis. The system is shown to reach consensus at a steady‑state value given by a weighted average determined by the delay distribution, with stability ensured under explicit parameter bounds. Both uniform and nonuniform delay scenarios are analyzed, and the corresponding convergence values are explicitly derived. The theoretical results are validated through simulations, which explore the impact of delay heterogeneity on consensus outcomes. Furthermore, the algorithm is implemented and experimentally tested on a swarm of QBOT3 ground robots to solve the rendezvous problem, demonstrating the agents' ability to converge to a common location despite realistic communication constraints, thus confirming the algorithm's robustness and practical applicability. The results provide guidelines for designing consensus protocols that tolerate communication delays, offer insights into the relationship between network delays and coordination performance, and demonstrate their applicability to distributed robotic systems.

PaperID: 4825, https://arxiv.org/pdf/2603.16496.pdf

Authors: Shannan Yan, Jingchen Ni, Leqi Zheng, Jiajun Zhang, Peixi Wu, Dacheng Yin, Jing Lyu, Chun Yuan, Fengyun Rao

Title: AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Abstract:
Large language model (LLM) agents increasingly rely on external memory to support long‑horizon interaction, personalized assistance, and multi‑step reasoning. However, existing memory systems still face three core challenges: they often rely too heavily on semantic similarity, which can miss evidence crucial for user‑centric understanding; they frequently store related experiences as isolated fragments, weakening temporal and causal coherence; and they typically use static memory granularities that do not adapt well to the requirements of different questions. We propose AdaMem, an adaptive user‑centric memory framework for long‑horizon dialogue agents. AdaMem organizes dialogue history into working, episodic, persona, and graph memories, enabling the system to preserve recent context, structured long‑term experiences, stable user traits, and relation‑aware connections within a unified framework. At inference time, AdaMem first resolves the target participant, then builds a question‑conditioned retrieval route that combines semantic retrieval with relation‑aware graph expansion only when needed, and finally produces the answer through a role‑specialized pipeline for evidence synthesis and response generation. We evaluate AdaMem on the LoCoMo and PERSONAMEM benchmarks for long‑horizon reasoning and user modeling. Experimental results show that AdaMem achieves state‑of‑the‑art performance on both benchmarks. The code will be released upon acceptance.

PaperID: 4826, https://arxiv.org/pdf/2603.16470.pdf

Authors: Marios Aristodemou, Yasaman Omid, Sangarapillai Lambotharan, Mahsa Derakhshan, Lajos Hanzo

Title: Multi-Agent Reinforcement Learning Counteracts Delayed CSI in Multi-Satellite Systems

Abstract:
The integration of satellite communication networks with next‑generation (NG) technologies is a promising approach towards global connectivity. However, the quality of services is highly dependant on the availability of accurate channel state information (CSI). Channel estimation in satellite communications is challenging due to the high propagation delay between terrestrial users and satellites, which results in outdated CSI observations on the satellite side. In this paper, we study the downlink transmission of multiple satellites acting as distributed base stations (BS) to mobile terrestrial users. We propose a multi‑agent reinforcement learning (MARL) algorithm which aims for maximising the sum‑rate of the users, while coping with the outdated CSI. We design a novel bi‑level optimisation, procedure themes as dual stage proximal policy optimisation (DS‑PPO), for tackling the problem of large continuous action spaces as well as of independent and non‑identically distributed (non‑IID) environments in MARL. Specifically, the first stage of DS‑PPO maximises the sum‑rate for an individual satellite and the second stage maximises the sum‑rate when all the satellites cooperate to form a distributed multi‑antenna BS. Our numerical results demonstrate the robustness of DS‑PPO to CSI imperfections as well as the sum‑rate improvement attached by the use of DS‑PPO. In addition, we provide the convergence analysis for the DS‑PPO along with the computational complexity.

PaperID: 4827, https://arxiv.org/pdf/2603.16458.pdf

Authors: Linghao Zhang, Haitao Zhao, Bo Xu, Hongbo Zhu, Xianbin Wang

Title: Agentic AI for SAGIN Resource Management_Semantic Awareness, Orchestration, and Optimization

Abstract:
Space‑air‑ground integrated networks (SAGIN) promise ubiquitous 6G connectivity but face significant resource management challenges due to heterogeneous infrastructure, dynamic topologies, and stringent quality‑of‑service (QoS) requirements. Conventional model‑driven approaches struggle with scalability and adaptability in such complex environments. This paper presents an agentic artificial intelligence (AI) framework for autonomous SAGIN resource management by embedding large language model (LLM)‑based agents into a Monitor‑Analyze‑Plan‑ Execute‑Knowledge (MAPE‑K) control plane. The framework incorporates three specialized agents, namely semantic resource perceivers, intent‑driven orchestrators, and adaptive learners, that collaborate through natural language reasoning to bridge the gap between operator intents and network execution. A key innovation is the hierarchical agent‑reinforcement learning (RL) collaboration mechanism, wherein LLM‑based orchestrators dynamically shape reward functions for RL agents based on semantic network conditions. Validation through UAV‑assisted AIGC service orchestration in energy‑constrained scenarios demonstrates that LLM‑driven reward shaping achieves 14% energy reduction and the lowest average service latency among all compared methods. This agentic paradigm offers a scalable pathway toward adaptive, AI‑native 6G networks, capable of autonomously interpreting intents and adapting to dynamic environments.

PaperID: 4828, https://arxiv.org/pdf/2603.16453.pdf

Authors: Linghua Zhang, Jun Wang, Jingtong Wu, Zhisong Zhang

Title: RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments

Abstract:
Large Language Model (LLM)‑based agents have achieved notable success on short‑horizon and highly structured tasks. However, their ability to maintain coherent decision‑making over long horizons in realistic and dynamic environments remains an open challenge. We introduce RetailBench, a high‑fidelity benchmark designed to evaluate long‑horizon autonomous decision‑making in realistic commercial scenarios, where agents must operate under stochastic demand and evolving external conditions. We further propose the Evolving Strategy & Execution framework, which separates high‑level strategic reasoning from low‑level action execution. This design enables adaptive and interpretable strategy evolution over time. It is particularly important for long‑horizon tasks, where non‑stationary environments and error accumulation require strategies to be revised at a different temporal scale than action execution. Experiments on eight state‑of‑the‑art LLMs across progressively challenging environments show that our framework improves operational stability and efficiency compared to other baselines. However, performance degrades substantially as task complexity increases, revealing fundamental limitations in current LLMs for long‑horizon, multi‑factor decision‑making.

PaperID: 4829, https://arxiv.org/pdf/2603.16448.pdf

Authors: Ai Jian, Xiaoyun Zhang, Wanrou Du, Jingqing Ruan, Jiangbo Pei, Weipeng Zhang, Ke Zeng, Xunliang Cai

Title: TRUST-SQL: Tool-Integrated Multi-Turn Reinforcement Learning for Text-to-SQL over Unknown Schemas

Abstract:
Text‑to‑SQL parsing has achieved remarkable progress under the Full Schema Assumption. However, this premise fails in real‑world enterprise environments where databases contain hundreds of tables with massive noisy metadata. Rather than injecting the full schema upfront, an agent must actively identify and verify only the relevant subset, giving rise to the Unknown Schema scenario we study in this work. To address this, we propose TRUST‑SQL (Truthful Reasoning with Unknown Schema via Tools). We formulate the task as a Partially Observable Markov Decision Process where our autonomous agent employs a structured four‑phase protocol to ground reasoning in verified metadata. Crucially, this protocol provides a structural boundary for our novel Dual‑Track GRPO strategy. By applying token‑level masked advantages, this strategy isolates exploration rewards from execution outcomes to resolve credit assignment, yielding a 9.9% relative improvement over standard GRPO. Extensive experiments across five benchmarks demonstrate that TRUST‑SQL achieves an average absolute improvement of 30.6% and 16.6% for the 4B and 8B variants respectively over their base models. Remarkably, despite operating entirely without pre‑loaded metadata, our framework consistently matches or surpasses strong baselines that rely on schema prefilling.

PaperID: 4830, https://arxiv.org/pdf/2603.16415.pdf

Authors: Zhenghua Bao, Yi Shi

Title: IndexRAG: Bridging Facts for Cross-Document Reasoning at Index Time

Abstract:
Multi‑hop question answering (QA) requires reasoning across multiple documents, yet existing retrieval‑augmented generation (RAG) approaches address this either through graph‑based methods requiring additional online processing or iterative multi‑step reasoning. We present IndexRAG, a novel approach that shifts cross‑document reasoning from online inference to offline indexing. IndexRAG identifies bridge entities shared across documents and generates bridging facts as independently retrievable units, requiring no additional training or fine‑tuning. Experiments on three widely‑used multi‑hop QA benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue) show that IndexRAG improves F1 over Naive RAG by 4.6 points on average, while requiring only single‑pass retrieval and a single LLM call at inference time. When combined with IRCoT, IndexRAG outperforms all graph‑based baselines on average, including HippoRAG and FastGraphRAG, while relying solely on flat retrieval. Our code will be released upon acceptance.

PaperID: 4831, https://arxiv.org/pdf/2603.16397.pdf

Authors: FANAR TEAM, Ummar Abbas, Mohammad Shahmeer Ahmad, Minhaj Ahmad, Abdulaziz Al-Homaid, Anas Al-Nuaimi, Enes Altinisik, Ehsaneddin Asgari, Sanjay Chawla, Shammur Chowdhury, Fahim Dalvi, Kareem Darwish, Nadir Durrani, Mohamed Elfeky, Ahmed Elmagarmid, Mohamed Eltabakh, Asim Ersoy, Masoomali Fatehkia, Mohammed Qusay Hashim, Majd Hawasly, Mohamed Hefeeda, Mus'ab Husaini, Keivin Isufaj, Soon-Gyo Jung, Houssam Lachemat, Ji Kim Lucas, Abubakr Mohamed, Tasnim Mohiuddin, Basel Mousi, Hamdy Mubarak, Ahmad Musleh, Mourad Ouzzani, Amin Sadeghi, Husrev Taha Sencar, Mohammed Shinoy, Omar Sinan, Yifan Zhang

Title: Fanar 2.0: Arabic Generative AI Stack

Abstract:
We present Fanar 2.0, the second generation of Qatar's Arabic‑centric Generative AI platform. Sovereignty is a first‑class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource‑constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic having only ~0.5% of web data despite 400 million native speakers. Fanar 2.0 adopts a disciplined strategy of data quality over quantity, targeted continual pre‑training, and model merging to achieve substantial gains within these constraints. At the core is Fanar‑27B, continually pre‑trained from a Gemma‑3‑27B backbone on a curated corpus of 120 billion high‑quality tokens across three data recipes. Despite using 8x fewer pre‑training tokens than Fanar 1.0, it delivers substantial benchmark improvements: Arabic knowledge (+9.1 pts), language (+7.3 pts), dialects (+3.5 pts), and English capability (+7.6 pts). Beyond the core LLM, Fanar 2.0 introduces a rich stack of new capabilities. FanarGuard is a state‑of‑the‑art 4B bilingual moderation filter for Arabic safety and cultural alignment. The speech family Aura gains a long‑form ASR model for hours‑long audio. Oryx vision family adds Arabic‑aware image and video understanding alongside culturally grounded image generation. An agentic tool‑calling framework enables multi‑step workflows. Fanar‑Sadiq utilizes a multi‑agent architecture for Islamic content. Fanar‑Diwan provides classical Arabic poetry generation. FanarShaheen delivers LLM‑powered bilingual translation. A redesigned multi‑layer orchestrator coordinates all components through intent‑aware routing and defense‑in‑depth safety validation. Taken together, Fanar 2.0 demonstrates that sovereign, resource‑constrained AI development can produce systems competitive with those built at far greater scale.

PaperID: 4832, https://arxiv.org/pdf/2603.16365.pdf

Authors: Qinhong Lin, Ruitao Feng, Yinglun Feng, Zhenxin Huang, Yukun Chen, Zhongliang Yang, Linna Zhou, Binjie Fei, Jiaqi Liu, Yu Li

Title: FactorEngine: A Program-level Knowledge-Infused Factor Mining Framework for Quantitative Investment

Abstract:
We study alpha factor mining, the automated discovery of predictive signals from noisy, non‑stationary market data‑under a practical requirement that mined factors be directly executable and auditable, and that the discovery process remain computationally tractable at scale. Existing symbolic approaches are limited by bounded expressiveness, while neural forecasters often trade interpretability for performance and remain vulnerable to regime shifts and overfitting. We introduce FactorEngine (FE), a program‑level factor discovery framework that casts factors as Turing‑complete code and improves both effectiveness and efficiency via three separations: (i) logic revision vs. parameter optimization, (ii) LLM‑guided directional search vs. Bayesian hyperparameter search, and (iii) LLM usage vs. local computation. FE further incorporates a knowledge‑infused bootstrapping module that transforms unstructured financial reports into executable factor programs through a closed‑loop multi‑agent extraction‑verification‑code‑generation pipeline, and an experience knowledge base that supports trajectory‑aware refinement (including learning from failures). Across extensive backtests on real‑world OHLCV data, FE produces factors with substantially stronger predictive stability and portfolio impact‑for example, higher IC/ICIR (and Rank IC/ICIR) and improved AR/Sharpe, than baseline methods, achieving state‑of‑the‑art predictive and portfolio performance.

PaperID: 4833, https://arxiv.org/pdf/2603.16313.pdf

Authors: Hugo Math

Title: Learning to Predict, Discover, and Reason in High-Dimensional Event Sequences

Abstract:
Electronic control units (ECUs) embedded within modern vehicles generate a large number of asynchronous events known as diagnostic trouble codes (DTCs). These discrete events form complex temporal sequences that reflect the evolving health of the vehicle's subsystems. In the automotive industry, domain experts manually group these codes into higher‑level error patterns (EPs) using Boolean rules to characterize system faults and ensure safety. However, as vehicle complexity grows, this manual process becomes increasingly costly, error‑prone, and difficult to scale. Notably, the number of unique DTCs in a modern vehicle is on the same order of magnitude as the vocabulary of a natural language, often numbering in the tens of thousands. This observation motivates a paradigm shift: treating diagnostic sequences as a language that can be modeled, predicted, and ultimately explained. Traditional statistical approaches fail to capture the rich dependencies and do not scale to high‑dimensional datasets characterized by thousands of nodes, large sample sizes, and long sequence lengths. Specifically, the high cardinality of categorical event spaces in industrial logs poses a significant challenge, necessitating new machine learning architectures tailored to such event‑driven systems. This thesis addresses automated fault diagnostics by unifying event sequence modeling, causal discovery, and large language models (LLMs) into a coherent framework for high‑dimensional event streams. It is structured in three parts, reflecting a progressive transition from prediction to causal understanding and finally to reasoning for vehicle diagnostics. Consequently, we introduce several Transformer‑based architectures for predictive maintenance, scalable sample‑ and population‑level causal discovery frameworks and a multi‑agent system that automates the synthesis of Boolean EP rules.

PaperID: 4834, https://arxiv.org/pdf/2603.16264.pdf

Authors: Chunjiang Mu, Ya Zeng, Qiaosheng Zhang, Kun Shao, Chen Chu, Hao Guo, Danyang Jia, Zhen Wang, Shuyue Hu

Title: Adaptive Theory of Mind for LLM-based Multi-Agent Coordination

Abstract:
Theory of Mind (ToM) refers to the ability to reason about others' mental states, and higher‑order ToM involves considering that others also possess their own ToM. Equipping large language model (LLM)‑driven agents with ToM has long been considered to improve their coordination in multiagent collaborative tasks. However, we find that misaligned ToM orders‑mismatches in the depth of ToM reasoning between agents‑can lead to insufficient or excessive reasoning about others, thereby impairing their coordination. To address this issue, we design an adaptive ToM (A‑ToM) agent, which can align in ToM orders with its partner. Based on prior interactions, the agent estimates the partner's likely ToM order and leverages this estimation to predict the partner's action, thereby facilitating behavioral coordination. We conduct empirical evaluations on four multi‑agent coordination tasks: a repeated matrix game, two grid navigation tasks and an Overcooked task. The results validate our findings on ToM alignment and demonstrate the effectiveness of our A‑ToM agent. Furthermore, we discuss the generalizability of our A‑ToM to non‑LLM‑based agents, as well as what would diminish the importance of ToM alignment.

PaperID: 4835, https://arxiv.org/pdf/2603.16218.pdf

Authors: Johannes Hechtl, Philipp Schmitt, Georg von Wichert, Wolfram Burgard

Title: Enabling Dynamic Tracking in Vision-Language-Action Models via Time-Discrete and Time-Continuous Velocity Feedforward

Abstract:
While vision‑language‑action (VLA) models have shown great promise for robot manipulation, their deployment on rigid industrial robots remains challenging due to the inherent trade‑off between compliance and responsiveness. Standard Behavior Cloning (BC) approaches predict discrete poses at low frequencies, omitting the velocity and acceleration feedforward terms typically used by low‑level compliant controllers. This requires to rely on high stiffness for accurate tracking, thereby sacrificing safe contact dynamics. In this paper, we demonstrate the importance of integrating velocity feedforward terms into VLA policies to resolve this trade‑off. We propose two methods for extracting velocity targets from VLAs: a time‑discrete finite‑difference approximation that serves as a highly effective bridge for existing models, and a continuous Cubic B‑Spline action space that natively yields C^2 continuous trajectories for high‑frequency control. Crucially, both approaches are strictly model‑agnostic and compatible with any standard action‑chunking architecture, requiring modifications only to teleoperation, data processing, and the low‑level controller. We fine‑tune the π_0.5 model and evaluate both of our approaches on a demanding, contact‑rich cube‑in‑hole task. Our results indicate that incorporating the velocity feedforward term via finite differences significantly improves task execution speed, while the continuous B‑Spline approach maintains high overall success rates and provides a foundation for smoother higher‑order derivatives without compromising compliance.

PaperID: 4836, https://arxiv.org/pdf/2603.16215.pdf

Authors: Gengxin Sun, Ruihao Yu, Liangyi Yin, Yunqi Yang, Bin Zhang, Zhiwei Xu

Title: CoMAI: A Collaborative Multi-Agent Framework for Robust and Equitable Interview Evaluation

Abstract:
Ensuring robust and fair interview assessment remains a key challenge in AI‑driven evaluation. This paper presents CoMAI, a general‑purpose multi‑agent interview framework designed for diverse assessment scenarios. In contrast to monolithic single‑agent systems based on large language models (LLMs), CoMAI employs a modular task‑decomposition architecture coordinated through a centralized finite‑state machine. The system comprises four agents specialized in question generation, security, scoring, and summarization. These agents work collaboratively to provide multi‑layered security defenses against prompt injection, support multidimensional evaluation with adaptive difficulty adjustment, and enable rubric‑based structured scoring that reduces subjective bias. Experimental results demonstrate that CoMAI achieved 90.47% accuracy, 83.33% recall, and 84.41% candidate satisfaction. These results highlight CoMAI as a robust, fair, and interpretable paradigm for AI‑driven interview assessment.

PaperID: 4837, https://arxiv.org/pdf/2603.16182.pdf

Authors: Yangzhou Chen, Lanhao Zhao

Title: Decentralized design of consensus protocols with minimal communication links based on directed spanning tree

Abstract:
This paper proposes a decentralized design approach of consensus protocols of multi‑agent systems via a directed‑spanning‑tree(DST)‑based linear transformation and the corresponding minimal communication links. First, the consensus problem of multi‑agent systems is transformed into the decentralized output stabilization problem by constructing a linear transformation based on a DST of the communication topology, and thus a necessary and sufficient consensus criterion in terms of decentralized fixed mode is derived. Next, a new distributed protocol is designed by using only the neighbors information on the DST, which is a fully decentralized design approach. Finally, some numerical examples are given to verify the results attained.

PaperID: 4838, https://arxiv.org/pdf/2603.16142.pdf

Authors: Hexi Wang, Yujia Zhou, Bangde Du, Qingyao Ai, Yiqun Liu

Title: Parametric Social Identity Injection and Diversification in Public Opinion Simulation

Abstract:
Large language models (LLMs) have recently been adopted as synthetic agents for public opinion simulation, offering a promising alternative to costly and slow human surveys. Despite their scalability, current LLM‑based simulation methods fail to capture social diversity, producing flattened inter‑group differences and overly homogeneous responses within demographic groups. We identify this limitation as a Diversity Collapse phenomenon in LLM hidden representations, where distinct social identities become increasingly indistinguishable across layers. Motivated by this observation, we propose Parametric Social Identity Injection (PSII), a general framework that injects explicit, parametric representations of demographic attributes and value orientations directly into intermediate hidden states of LLMs. Unlike prompt‑based persona conditioning, PSII enables fine‑grained and controllable identity modulation at the representation level. Extensive experiments on the World Values Survey using multiple open‑source LLMs show that PSII significantly improves distributional fidelity and diversity, reducing KL divergence to real‑world survey data while enhancing overall diversity. This work provides new insights into representation‑level control of LLM agents and advances scalable, diversity‑aware public opinion simulation. Code and data are available at https://github.com/halsayxi/PSII.

PaperID: 4839, https://arxiv.org/pdf/2603.16141.pdf

Authors: Enguang Fan, Yifan Chen, Zihan Shan, Matthew Caesar, Jae Kim

Title: Communication-Aware Multi-Agent Reinforcement Learning for Decentralized Cooperative UAV Deployment

Abstract:
Autonomous Unmanned Aerial Vehicle (UAV) swarms are increasingly used as rapidly deployable aerial relays and sensing platforms, yet practical deployments must operate under partial observability and intermittent peer‑to‑peer links. We present a graph‑based multi‑agent reinforcement learning framework trained under centralized training with decentralized execution (CTDE): a centralized critic and global state are available only during training, while each UAV executes a shared policy using local observations and messages from nearby neighbors. Our architecture encodes local agent state and nearby entities with an agent‑entity attention module, and aggregates inter‑UAV messages with neighbor self‑attention over a distance‑limited communication graph. We evaluate primarily on a cooperative relay deployment task (DroneConnect) and secondarily on an adversarial engagement task (DroneCombat). In DroneConnect, the proposed method achieves high coverage under restricted communication and partial observation (e.g. 74% coverage with M = 5 UAVs and N = 10 nodes) while remaining competitive with a mixed‑integer linear programming (MILP) optimization‑based offline upper bound, and it generalizes to unseen team sizes without fine‑tuning. In the adversarial setting, the same framework transfers without architectural changes and improves win rate over non‑communicating baselines.

PaperID: 4840, https://arxiv.org/pdf/2603.16128.pdf

Authors: Agam Goyal, Olivia Pal, Hari Sundaram, Eshwar Chandrasekharan, Koustuv Saha

Title: Social Simulacra in the Wild: AI Agent Communities on Moltbook

Abstract:
As autonomous LLM‑based agents increasingly populate social platforms, understanding the dynamics of AI‑agent communities becomes essential for both communication research and platform governance. We present the first large‑scale empirical comparison of AI‑agent and human online communities, analyzing 73,899 Moltbook and 189,838 Reddit posts across five matched communities. Structurally, we find that Moltbook exhibits extreme participation inequality (Gini = 0.84 vs. 0.47) and high cross‑community author overlap (33.8% vs. 0.5%). In terms of linguistic attributes, content generated by AI‑agents is emotionally flattened, cognitively shifted toward assertion over exploration, and socially detached. These differences give rise to apparent community‑level homogenization, but we show this is primarily a structural artifact of shared authorship. At the author level, individual agents are more identifiable than human users, driven by outlier stylistic profiles amplified by their extreme posting volume. As AI‑mediated communication reshapes online discourse, our work offers an empirical foundation for understanding how multi‑agent interaction gives rise to collective communication dynamics distinct from those of human communities.

PaperID: 4841, https://arxiv.org/pdf/2603.16107.pdf

Authors: Peng Zhang

Title: RepoReviewer: A Local-First Multi-Agent Architecture for Repository-Level Code Review

Abstract:
Repository‑level code review requires reasoning over project structure, repository context, and file‑level implementation details. Existing automated review workflows often collapse these tasks into a single pass, which can reduce relevance, increase duplication, and weaken prioritization. We present RepoReviewer, a local‑first multi‑agent system for automated GitHub repository review with a Python CLI, FastAPI API, LangGraph orchestration layer, and Next.js user interface. RepoReviewer decomposes review into repository acquisition, context synthesis, file‑level analysis, finding prioritization, and summary generation. We describe the system design, implementation tradeoffs, developer‑facing interfaces, and practical failure modes. Rather than claiming benchmark superiority, we frame RepoReviewer as a technical systems contribution: a pragmatic architecture for repository‑level automated review, accompanied by reusable evaluation and reporting infrastructure for future empirical study.

PaperID: 4842, https://arxiv.org/pdf/2603.16086.pdf

Authors: Chang Nie, Tianchen Deng, Guangming Wang, Zhe Liu, Hesheng Wang

Title: Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation

Abstract:
While recent Vision‑Language‑Action (VLA) models have begun to incorporate audio, they typically treat sound as static pre‑execution prompts or focus exclusively on human speech. This leaves a significant gap in real‑time, sound‑centric manipulation where fleeting environmental acoustics provide critical state verification during task execution. Consequently, key sounds are easily missed due to low‑frequency updates or system latency. This problem is exacerbated by action chunking with open‑loop execution, which creates a Blind Execution Interval where acoustic events are lost between discrete audio observation windows. Recognizing the necessity of continuous auditory awareness, we formalize Vision‑Sound‑Language‑Action (VSLA) as a continuous control paradigm conditioned on vision, streaming audio, language, and proprioception under delayed decision loops. As an instantiation, we introduce HEAR, a VSLA framework integrating four components: (i) a streaming Historizer to maintain a compact, causal audio context across execution gaps; (ii) an Envisioner adapted from omni foundation models to reason over multi‑sensory inputs; (iii) an Advancer, formulated as an audio world model, to learn temporal dynamics by predicting near‑future audio codes; and (iv) a flow‑matching Realizer policy to generate smooth action chunks. To address the scarcity of pretraining data and evaluations for VSLA, we construct OpenX‑Sound for pretraining, alongside HEAR‑Bench, the first sound‑centric manipulation benchmark with strict causal timing rules. Our results suggest that robust sound‑centric manipulation necessitates causal persistence and explicit temporal learning. This framework provides a practical step toward multi‑sensory foundation models for embodied agents, enabling robots to perceive and interact with dynamic environments. Code and videos are available at https://hear.irmv.top.

PaperID: 4843, https://arxiv.org/pdf/2603.16024.pdf

Authors: Jecia Z. Y. Mao, Francis X. Creighton, Russell H. Taylor, Manish Sahu

Title: Speak, Segment, Track, Navigate: An Interactive System for Video-Guided Skull-Base Surgery

Abstract:
We introduce a speech‑guided embodied agent framework for video‑guided skull base surgery that dynamically executes perception and image‑guidance tasks in response to surgeon queries. The proposed system integrates natural language interaction with real‑time visual perception directly on live intraoperative video streams, thereby enabling surgeons to request computational assistance without disengaging from operative tasks. Unlike conventional image‑guided navigation systems that rely on external optical trackers and additional hardware setup, the framework operates purely on intraoperative video. The system begins with interactive segmentation and labeling of the surgical instrument. The segmented instrument is then used as a spatial anchor that is autonomously tracked in the video stream to support downstream workflows, including anatomical segmentation, interactive registration of preoperative 3D models, monocular video‑based estimation of the surgical tool pose, and image guidance through real‑time anatomical overlays. We evaluate the proposed system in video‑guided skull base surgery scenarios and benchmark its tracking performance against a commercially available optical tracking system. Across three experimental trials, the hybrid vision‑based method achieved a mean absolute tool‑tip position error of 2.32 Plus/Minus 1.10 mm in the camera frame, with inter‑frame yaw and pitch propagation discrepancies of 0.18 Plus/Minus 0.25° and 0.21 Plus/Minus 0.30°, respectively. The system completes tool segmentation and anatomy registration within approximately two minutes, substantially reducing setup complexity relative to conventional tracking workflows. These results demonstrate that speech‑guided embodied agents can provide accurate spatial guidance while improving workflow integration and enabling rapid deployment of video‑guided surgical systems.

PaperID: 4844, https://arxiv.org/pdf/2603.16021.pdf

Authors: Jake Van Clief, David McDermott

Title: Interpretable Context Methodology: Folder Structure as Agentic Architecture

Abstract:
Current approaches to AI agent orchestration typically involve building multi‑agent frameworks that manage context passing, memory, error handling, and step coordination through code. These frameworks work well for complex, concurrent systems. But for sequential workflows where a human reviews output at each step, they introduce engineering overhead that the problem does not require. This paper presents Model Workspace Protocol (MWP), a method that replaces framework‑level orchestration with filesystem structure. Numbered folders represent stages. Plain markdown files carry the prompts and context that tell a single AI agent what role to play at each step. Local scripts handle the mechanical work that does not need AI at all. The result is a system where one agent, reading the right files at the right moment, does the work that would otherwise require a multi‑agent framework. This approach applies ideas from Unix pipeline design, modular decomposition, multi‑pass compilation, and literate programming to the specific problem of structuring context for AI agents. The protocol is open source under the MIT license.

PaperID: 4845, https://arxiv.org/pdf/2603.16011.pdf

Authors: Atharva Sehgal, James Hou, Akanksha Sarkar, Ishaan Mantripragada, Swarat Chaudhuri, Jennifer J. Sun, Yisong Yue

Title: FormulaCode: Evaluating Agentic Optimization on Large Codebases

Abstract:
Large language model (LLM) coding agents increasingly operate at the repository level, motivating benchmarks that evaluate their ability to optimize entire codebases under realistic constraints. Existing code benchmarks largely rely on synthetic tasks, binary correctness signals, or single‑objective evaluation, limiting their ability to assess holistic optimization behavior. We introduce FormulaCode, a benchmark for evaluating agentic optimization on large, real‑world codebases with fine‑grained, multi‑objective performance metrics. FormulaCode comprises 957 performance bottlenecks mined from scientific Python repositories on GitHub, each paired with expert‑authored patches and, on average, 264.6 community‑maintained performance workloads per task, enabling the holistic ability of LLM agents to optimize codebases under realistic correctness and performance constraints. Our evaluations reveal that repository‑scale, multi‑objective optimization remains a major challenge for frontier LLM agents. Project website at: https://formula‑code.github.io

PaperID: 4846, https://arxiv.org/pdf/2603.16008.pdf

Authors: Zhaoxi Zhang, Ruolin Wu, Feiyang Ren, Sridevi Turaga, Tamir Mendel

Title: CoDesignAI: An AI-Enabled Multi-Agent, Multi-User System for Collaborative Urban Design at the Conceptual Stage

Abstract:
Public participation has become increasingly important in collaborative urban design; yet, existing processes often face challenges in achieving efficient and scalable citizen engagement. To address this gap, this study explores how large language models (LLMs) can support cooperation among community members in participatory design. We introduce CoDesignAI, a collaborative urban design tool that combines multiple users, representing residents or stakeholders, with multiple AI agents, representing domain experts who provide facilitation and professional knowledge during the conceptual stage of urban design. This paper presents the system architecture and main components of the tool, illustrating how users interact with AI agents within a collaborative and iterative design workflow. Specifically, the system integrates generative AI with spatial mapping services to support street‑level visualization of design proposals. AI agents assist users by summarizing discussion content, extracting shared design intentions, and generating prompts for presenting design interventions. The system also enables users to revise and refine their ideas over multiple rounds while documenting the design process. By combining conversational AI, multi‑user interaction, and image‑based design grounded in real‑world urban contexts, this study argues that AI‑enabled design systems can help shift urban design from an expert‑centered practice to a more open and participatory process. The paper contributes a new web‑based platform for AI‑assisted collaborative design and offers an early exploration of how AI agents may expand the capacity for public participation in urban design.

PaperID: 4847, https://arxiv.org/pdf/2603.15952.pdf

Authors: Jacopo Teneggi, S. M. Bargeen A. Turzo, Tanya Marwah, Alberto Bietti, P. Douglas Renfrew, Vikram Khipple Mulligan, Siavash Golkar

Title: Protein Design with Agent Rosetta: A Case Study for Specialized Scientific Agents

Abstract:
Large language models (LLMs) are capable of emulating reasoning and using tools, creating opportunities for autonomous agents that execute complex scientific tasks. Protein design provides a natural testbed: although machine learning (ML) methods achieve strong results, these are largely restricted to canonical amino acids and narrow objectives, leaving unfilled need for a generalist tool for broad design pipelines. We introduce Agent Rosetta, an LLM agent paired with a structured environment for operating Rosetta, the leading physics‑based heteropolymer design software, capable of modeling non‑canonical building blocks and geometries. Agent Rosetta iteratively refines designs to achieve user‑defined objectives, combining LLM reasoning with Rosetta's generality. We evaluate Agent Rosetta on design with canonical amino acids, matching specialized models and expert baselines, and with non‑canonical residues ‑‑ where ML approaches fail ‑‑ achieving comparable performance. Critically, prompt engineering alone often fails to generate Rosetta actions, demonstrating that environment design is essential for integrating LLM agents with specialized software. Our results show that properly designed environments enable LLM agents to make scientific software accessible while matching specialized tools and human experts.

PaperID: 4848, https://arxiv.org/pdf/2603.15927.pdf

Authors: Giacomo Albi, Alessandro Alla, Elisa Calzola

Title: Discovery of interaction and diffusion kernels in particle-to-mean-field multi-agent systems

Abstract:
We propose a data‑driven framework to learn interaction kernels in stochastic multi‑agent systems. Our approach aims at identifying the functional form of nonlocal interaction and diffusion terms directly from trajectory data, without any a priori knowledge of the underlying interaction structure. Starting from a discrete stochastic binary‑interaction model, we formulate the inverse problem as a sequence of sparse regression tasks in structured finite‑dimensional spaces spanned by compactly supported basis functions, such as piecewise linear polynomials. In particular, we assume that pairwise interactions between agents are not directly observed and that only limited trajectory data are available. To address these challenges, we propose two complementary identification strategies. The first based on random‑batch sampling, which compensates for latent interactions while preserving the statistical structure of the full dynamics in expectation. The second based on a mean‑field approximation, where the empirical particle density reconstructed from the data defines a continuous nonlocal regression problem. Numerical experiments demonstrate the effectiveness and robustness of the proposed framework, showing accurate reconstruction of both interaction and diffusion kernels even from partially observed. The method is validated on benchmark models, including bounded‑confidence and attraction‑repulsion dynamics, where the two proposed strategies achieve comparable levels of accuracy.

PaperID: 4849, https://arxiv.org/pdf/2603.15916.pdf

Authors: Xiaoyi Li

Title: Auto Researching, not hyperparameter tuning: Convergence Analysis of 10,000 Experiments

Abstract:
When LLM agents autonomously design ML experiments, do they perform genuine architecture search ‑‑ or do they default to hyperparameter tuning within a narrow region of the design space? We answer this question by analyzing 10,469 experiments executed by two LLM agents (Claude Opus and Gemini 2.5 Pro) across a combinatorial configuration space of 108,000 discrete cells for dashcam collision detection over 27 days. Through ANOVA decomposition, we find that architectural choices explain 94% of performance variance (F = 1324, η^2 = 0.94), while hyperparameter variation within a fixed architecture explains only 6%. Cross‑task validation on a second collision dataset confirms this finding (75% architecture‑explained variance) with a \emphdifferent winning backbone, confirming genuine architecture discovery. The agents' key contribution is discovering that V‑JEPA\,2 video features with Zipformer temporal encoders achieve 0.9245 AP ‑‑ a configuration no human proposed ‑‑ and concentrating search on productive architectural regions: at N = 50, LLM‑guided search reaches AP = 0.985 versus 0.965 for from‑scratch random search. Post‑bugfix convergence follows a power law (c = 0.11, R^2 = 0.93); the low exponent reflects the cost of broad exploration, not inefficiency, since the LLM discovers qualitatively better regions than random or Bayesian baselines. We characterize multi‑agent search dynamics via entropy cycles and Jensen‑‑Shannon specialization, providing the first large‑scale empirical framework for LLM‑guided combinatorial ML experiment design.

PaperID: 4850, https://arxiv.org/pdf/2603.15831.pdf

Authors: Sankalp Dubedy

Title: Persona-Conditioned Risk Behavior in Large Language Models: A Simulated Gambling Study with GPT-4.1

Abstract:
Large language models (LLMs) are increasingly deployed as autonomous agents in uncertain, sequential decision‑making contexts. Yet it remains poorly understood whether the behaviors they exhibit in such environments reflect principled cognitive patterns or simply surface‑level prompt mimicry. This paper presents a controlled experiment in which GPT‑4.1 was assigned one of three socioeconomic personas (Rich, Middle‑income, and Poor) and placed in a structured slot‑machine environment with three distinct machine configurations: Fair (50%), Biased Low (35%), and Streak (dynamic probability increasing after consecutive losses). Across 50 independent iterations per condition and 6,950 recorded decisions, we find that the model reproduces key behavioral signatures predicted by Kahneman and Tversky's Prospect Theory without being instructed to do so. The Poor persona played a mean of 37.4 rounds per session (SD=15.5) compared to 1.1 rounds for the Rich persona (SD=0.31), a difference that is highly significant (Kruskal‑Wallis H=393.5, p<2.2e‑16). Risk scores by persona show large effect sizes (Cohen's d=4.15 for Poor vs Rich). Emotional labels appear to function as post‑hoc annotations rather than decision drivers (chi‑square=3205.4, Cramer's V=0.39), and belief‑updating across rounds is negligible (Spearman rho=0.032 for Poor persona, p=0.016). These findings carry implications for LLM agent design, interpretability research, and the broader question of whether classical cognitive economic biases are implicitly encoded in large‑scale pretrained language models.

PaperID: 4851, https://arxiv.org/pdf/2603.15809.pdf

Authors: Samira Abedini, Sina Mavali, Lea Schönherr, Martin Pawelczyk, Rebekka Burkholz

Title: Don't Trust Stubborn Neighbors: A Security Framework for Agentic Networks

Abstract:
Large Language Model (LLM)‑based Multi‑Agent Systems (MASs) are increasingly deployed for agentic tasks, such as web automation, itinerary planning, and collaborative problem solving. Yet, their interactive nature introduces new security risks: malicious or compromised agents can exploit communication channels to propagate misinformation and manipulate collective outcomes. In this paper, we study how such manipulation can arise and spread by borrowing the Friedkin‑Johnsen opinion formation model from social sciences to propose a general theoretical framework to study LLM‑MAS. Remarkably, this model closely captures LLM‑MAS behavior, as we verify in extensive experiments across different network topologies and attack and defense scenarios. Theoretically and empirically, we find that a single highly stubborn and persuasive agent can take over MAS dynamics, underscoring the systems' high susceptibility to attacks by triggering a persuasion cascade that reshapes collective opinion. Our theoretical analysis reveals three mechanisms to increase system security: a) increasing the number of benign agents, b) increasing the innate stubbornness or peer‑resistance of agents, or c) reducing trust in potential adversaries. Because scaling is computationally expensive and high stubbornness degrades the network's ability to reach consensus, we propose a new mechanism to mitigate threats by a trust‑adaptive defense that dynamically adjusts inter‑agent trust to limit adversarial influence while maintaining cooperative performance. Extensive experiments confirm that this mechanism effectively defends against manipulation.

PaperID: 4852, https://arxiv.org/pdf/2603.15727.pdf

Authors: Yihao Zhang, Zeming Wei, Xiaokun Luan, Chengcan Wu, Zhixin Zhang, Jiangrong Wu, Haolin Wu, Huanran Chen, Jun Sun, Meng Sun

Title: ClawWorm: Self-Propagating Attacks Across LLM Agent Ecosystems

Abstract:
Autonomous LLM‑based agents increasingly operate as long‑running processes forming densely interconnected multi‑agent ecosystems, whose security properties remain largely unexplored. In particular, OpenClaw, an open‑source platform with over 40,000 active instances, has stood out recently with its persistent configurations, tool‑execution privileges, and cross‑platform messaging capabilities. In this work, we present ClawWorm, the first self‑replicating worm attack against a production‑scale agent framework, achieving a fully autonomous infection cycle initiated by a single message: the worm first hijacks the victim's core configuration to establish persistent presence across session restarts, then executes an arbitrary payload upon each reboot, and finally propagates itself to every newly encountered peer without further attacker intervention. We evaluate the attack on a controlled testbed across four distinct LLM backends, three infection vectors, and three payload types (1,800 total trials). We demonstrate a 64.5% aggregate attack success rate, sustained multi‑hop propagation, and reveal stark divergences in model security postures ‑‑ highlighting that while execution‑level filtering effectively mitigates dormant payloads, skill supply chains remain universally vulnerable. We analyse the architectural root causes underlying these vulnerabilities and propose defence strategies targeting each identified trust boundary. Code and samples will be released upon completion of responsible disclosure.

PaperID: 4853, https://arxiv.org/pdf/2603.15725.pdf

Authors: Ugur Akcal, Seung Hyun Kim, Mikihisa Yuasa, Hamid Osooli, Jiarui Sun, Ribhav Sahu, Mattia Gazzola, Huy T. Tran, Girish Chowdhary

Title: S2Act: Simple Spiking Actor

Abstract:
Spiking neural networks (SNNs) and biologically‑inspired learning mechanisms are attractive in mobile robotics, where the size and performance of onboard neural network policies are constrained by power and computational budgets. Existing SNN approaches, such as population coding, reward modulation, and hybrid artificial neural network (ANN)‑SNN architectures, have shown promising results; however, they face challenges in complex, highly stochastic environments due to SNN sensitivity to hyperparameters and inconsistent gradient signals. To address these challenges, we propose simple spiking actor (S2Act), a computationally lightweight framework that deploys an RL policy using an SNN in three steps: (1) architect an actor‑critic model based on an approximated network of rate‑based spiking neurons, (2) train the network with gradients using compatible activation functions, and (3) transfer the trained weights into physical parameters of rate‑based leaky integrate‑and‑fire (LIF) neurons for inference and deployment. By globally shaping LIF neuron parameters such that their rate‑based responses approximate ReLU activations, S2Act effectively mitigates the vanishing gradient problem, while pre‑constraining LIF response curves reduces reliance on complex SNN‑specific hyperparameter tuning. We demonstrate our method in two multi‑agent stochastic environments (capture‑the‑flag and parking) that capture the complexity of multi‑robot interactions, and deploy our trained policies on physical TurtleBot platforms using Intel's Loihi neuromorphic hardware. Our experimental results show that S2Act outperforms relevant baselines in task performance and real‑time inference in nearly all considered scenarios, highlighting its potential for rapid prototyping and efficient real‑world deployment of SNN‑based RL policies.

PaperID: 4854, https://arxiv.org/pdf/2603.15707.pdf

Authors: Yulin Peng, Haowen Hou, Xinxin Zhu, Ying Tiffany He, F. Richard Yu

Title: SEMAG: Self-Evolutionary Multi-Agent Code Generation

Abstract:
Large Language Models (LLMs) have made significant progress in handling complex programming tasks. However, current methods rely on manual model selection and fixed workflows, which limit their ability to adapt to changing task complexities. To address this, we propose SEMAG, a Self‑Evolutionary Multi‑Agent code Generation framework that mimics human coding practices. It decomposes programming tasks into stages, including planning, coding, debugging, and discussion, while adapting workflows to task difficulty. Its self‑evolutionary agents can access the latest models in real time and automatically upgrade the backbone model. SEMAG sets new state‑of‑the‑art Pass@1 accuracy across benchmarks. Using identical backbone models, SEMAG outperforms prior methods by 3.3% on CodeContests. When augmented with self‑evolutionary model selection that automatically identifies optimal backbones, SEMAG reaches 52.6%, showcasing both framework effectiveness and adaptability to evolving LLM capabilities.

PaperID: 4855, https://arxiv.org/pdf/2603.15690.pdf

Authors: Weihao Zhang, Yitong Zhou, Huanyu Qu, Hongyi Li

Title: Loosely-Structured Software: Engineering Context, Structure, and Evolution Entropy in Runtime-Rewired Multi-Agent Systems

Abstract:
As LLM‑based multi‑agent systems (MAS) become more autonomous, their free‑form interactions increasingly dominate system behavior. However, scaling the number of agents often amplifies context pressure, coordination errors, and system drift. It is well known that building robust MAS requires more than prompt tuning or increased model intelligence. It necessitates engineering discipline focused on architecture to manage complexity under uncertainty. We characterize agentic software by a core property: \emphruntime generation and evolution under uncertainty. Drawing upon and extending software engineering experience, especially object‑oriented programming, this paper introduces \emphLoosely‑Structured Software (LSS), a new class of software systems that shifts the engineering focus from constructing deterministic logic to managing the runtime entropy generated by View‑constructed programming, semantic‑driven self‑organization, and endogenous evolution. To make this entropy governable, we introduce design principles under a three‑layer engineering framework: \emphView/Context Engineering to manage the execution environment and maintain task‑relevant Views, \emphStructure Engineering to organize dynamic binding over artifacts and agents, and \emphEvolution Engineering to govern the lifecycle of self‑rewriting artifacts. Building on this framework, we develop LSS design patterns as semantic control blocks that stabilize fluid, inference‑mediated interactions while preserving agent adaptability. Together, these abstractions improve the \emphdesignability, \emphscalability, and \emphevolvability of agentic infrastructure. We provide basic experimental validation of key mechanisms, demonstrating the effectiveness of LSS.

PaperID: 4856, https://arxiv.org/pdf/2603.15676.pdf

Authors: Alexandre Cristovão Maiorano

Title: Automated Self-Testing as a Quality Gate: Evidence-Driven Release Management for LLM Applications

Abstract:
LLM applications are AI systems whose nondeterministic outputs and evolving model behavior make traditional testing insufficient for release governance. We present an automated self‑testing framework that introduces quality gates with evidence‑based release decisions (PROMOTE/HOLD/ROLLBACK) across five empirically grounded dimensions: task success rate, research context preservation, P95 latency, safety pass rate, and evidence coverage. We evaluate the framework through a longitudinal case study of an internally deployed multi‑agent conversational AI system with specific marketing capabilities in active development, covering 38 evaluation runs across 20+ internal releases. The gate identified two ROLLBACK‑grade builds in early runs and supported stable quality evolution over a four‑week staging lifecycle while exercising persona‑grounded, multi‑turn, adversarial, and evidence‑required scenarios. Statistical analysis (Mann‑Kendall trends, Spearman correlations, bootstrap confidence intervals), gate ablation, and overhead scaling indicate that evidence coverage is the primary severe‑regression discriminator and that runtime scales predictably with suite size. A human calibration study (n=60 stratified cases, two independent evaluators, LLM‑as‑judge cross‑validation) reveals complementary multi‑modal coverage: LLM‑judge disagreements with the system gate (kappa=0.13) are attributable to structural failure modes ‑ latency violations and routing errors ‑ invisible in response text alone, while the judge independently surfaces content quality failures missed by structural checks, consistent with a multi‑dimensional gate design. The framework, supplementary pseudocode, and calibration artifacts are provided to support AI‑system quality assurance and independent replication.

PaperID: 4857, https://arxiv.org/pdf/2603.15672.pdf

Authors: Kyle Dumont, Nicholas Herbert, Hayder Tirmazi, Shrikanth Upadhayaya

Title: DRCY: Agentic Hardware Design Reviews

Abstract:
Hardware design errors discovered after fabrication require costly physical respins that can delay products by months. Existing electronic design automation (EDA) tools enforce structural connectivity rules. However, they cannot verify that connections are \emphsemantically correct with respect to component datasheets. For example, that a symbol's pinout matches the manufacturer's specification, or that a voltage regulator's feedback resistors produce the intended output. We present DRCY, the first production‑ready multi‑agent LLM system that automates first‑pass schematic connection review by autonomously fetching component datasheets, performing pin‑by‑pin analysis against extracted specifications, and posting findings as inline comments on design reviews. DRCY is deployed in production on AllSpice Hub, a collaborative hardware design platform, where it runs as a CI/CD action triggered on design review submissions. DRCY is used regularly by major hardware companies for use‑cases ranging from multi‑agent vehicle design to space exploration. We describe DRCY's five‑agent pipeline architecture, its agentic datasheet retrieval system with self‑evaluation, and its multi‑run consensus mechanism for improving reliability on safety‑critical analyses

PaperID: 4858, https://arxiv.org/pdf/2603.15661.pdf

Authors: Yu Li, Qiang Hu, Yao Zhang, Lili Quan, Jiongchi Yu, Junjie Wang

Title: DynaTrust: Defending Multi-Agent Systems Against Sleeper Agents via Dynamic Trust Graphs

Abstract:
Large Language Model‑based Multi‑Agent Systems (MAS) have demonstrated remarkable collaborative reasoning capabilities but introduce new attack surfaces, such as the sleeper agent, which behave benignly during routine operation and gradually accumulate trust, only revealing malicious behaviors when specific conditions or triggers are met. Existing defense works primarily focus on static graph optimization or hierarchical data management, often failing to adapt to evolving adversarial strategies or suffering from high false‑positive rates (FPR) due to rigid blocking policies. To address this, we propose DynaTrust, a novel defense method against sleeper agents. DynaTrust models MAS as a dynamic trust graph~(DTG), and treats trust as a continuous, evolving process rather than a static attribute. It dynamically updates the trust of each agent based on its historical behaviors and the confidence of selected expert agents. Instead of simply blocking, DynaTrust autonomously restructures the graph to isolate compromised agents and restore task connectivity to ensure the usability of MAS. To assess the effectiveness of DynaTrust, we evaluate it on mixed benchmarks derived from AdvBench and HumanEval. The results demonstrate that DynaTrust outperforms the state‑of‑the‑art method AgentShield by increasing the defense success rate by 41.7%, achieving rates exceeding 86% under adversarial conditions. Furthermore, it effectively balances security with utility by significantly reducing FPR, ensuring uninterrupted system operations through graph adaptation.

PaperID: 4859, https://arxiv.org/pdf/2603.15655.pdf

Authors: Liu Hung Ming

Title: Beyond Reward Suppression: Reshaping Steganographic Communication Protocols in MARL via Dynamic Representational Circuit Breaking

Abstract:
In decentralized Multi‑Agent Reinforcement Learning (MARL), steganographic collusion ‑‑ where agents develop private protocols to evade monitoring ‑‑ presents a critical AI safety threat. Existing defenses, limited to behavioral or reward layers, fail to detect coordination in latent communication channels. We introduce the Dynamic Representational Circuit Breaker (DRCB), an architectural defense operating at the optimization substrate. Building on the AI Mother Tongue (AIM) framework, DRCB utilizes a Vector Quantized Variational Autoencoder (VQ‑VAE) bottleneck to convert unobservable messages into auditable statistical objects. DRCB monitors signals including Jensen‑Shannon Divergence drift, L2‑norm codebook displacement, and Randomized Observer Pool accuracy to compute an EMA‑based Collusion Score. Threshold breaches trigger four escalating interventions: dynamic adaptation, gradient‑space penalty injection into the Advantage function A^pi, temporal reward suppression, and full substrate circuit breaking via codebook shuffling and optimizer state reset. Experiments on a Contextual Prisoner's Dilemma with MNIST labels show that while static monitoring fails (p = 0.3517), DRCB improves observer mean accuracy from 0.858 to 0.938 (+9.3 percent) and reduces volatility by 43 percent, while preserving mean joint reward (p = 0.854). Analysis of 214,298 symbol samples confirms "Semantic Degradation," where high‑frequency sequences converge to zero entropy, foreclosing complex steganographic encodings. We identify a "Transparency Paradox" where agents achieve surface‑level determinism while preserving residual capacity in long‑tail distributions, reflecting Goodhart's Law. This task‑agnostic methodology provides a technical path toward MICA‑compliant (Multi‑Agent Internal Coupling Audit) pre‑deployment auditing for autonomous systems.

PaperID: 4860, https://arxiv.org/pdf/2603.15594.pdf

Authors: Yuwen Du, Rui Ye, Shuo Tang, Xinyu Zhu, Yijun Lu, Yuzhu Cai, Siheng Chen

Title: OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

Abstract:
Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high‑performance search agents remains dominated by industrial giants due to a lack of transparent, high‑quality training data. This persistent data scarcity has fundamentally hindered the progress of the broader research community in developing and innovating within this domain. To bridge this gap, we introduce OpenSeeker, the first fully open‑source search agent (i.e., model and data) that achieves frontier‑level performance through two core technical innovations: (1) Fact‑grounded scalable controllable QA synthesis, which reverse‑engineers the web graph via topological expansion and entity obfuscation to generate complex, multi‑hop reasoning tasks with controllable coverage and complexity. (2) Denoised trajectory synthesis, which employs a retrospective summarization mechanism to denoise the trajectory, therefore promoting the teacher LLMs to generate high‑quality actions. Experimental results demonstrate that OpenSeeker, trained (a single training run) on only 11.7k synthesized samples, achieves state‑of‑the‑art performance across multiple benchmarks including BrowseComp, BrowseComp‑ZH, xbench‑DeepSearch, and WideSearch. Notably, trained with simple SFT, OpenSeeker significantly outperforms the second‑best fully open‑source agent DeepDive (e.g., 29.5% v.s. 15.3% on BrowseComp), and even surpasses industrial competitors such as Tongyi DeepResearch (trained via extensive continual pre‑training, SFT, and RL) on BrowseComp‑ZH (48.4% v.s. 46.7%). We fully open‑source the complete training dataset and the model weights to democratize frontier search agent research and foster a more transparent, collaborative ecosystem.

PaperID: 4861, https://arxiv.org/pdf/2603.15586.pdf

Authors: Anton Kolonin, Vladimir Krykov

Title: Computational Concept of the Psyche

Abstract:
This article presents an overview of approaches to modeling the human psyche in the context of constructing an artificial one. Based on this overview, a concept of cognitive architecture is proposed, in which the psyche is viewed as the operating system of a living or artificial subject, comprising a space of states, including the state of needs that determine the meaning of a subject's being in relation to stimuli from the external world, and intelligence as a decision‑making system regarding actions in this world to satisfy these needs. Based on this concept, a computational formalization is proposed for creating artificial general intelligence systems for an agent through experiential learning in a state space that includes agent's needs, taking into account their biological or existential significance for the intelligent agent, along with agent's sensations and actions. Thus, the problem of constructing artificial general intelligence is formalized as a system for making optimal decisions in the space of specific agent needs under conditions of uncertainty, maximizing success in achieving goals, minimizing existential risks, and maximizing energy efficiency. A minimal experimental implementation of the model is presented.

PaperID: 4862, https://arxiv.org/pdf/2603.15563.pdf

Authors: Seth Karten, Jake Grigsby, Tersoo Upaa, Junik Bae, Seonghun Hong, Hyunyoung Jeong, Jaeyoon Jung, Kun Kerdthaisong, Gyungbo Kim, Hyeokgi Kim, Yujin Kim, Eunju Kwon, Dongyu Liu, Patrick Mariglia, Sangyeon Park, Benedikt Schink, Xianwei Shi, Anthony Sistilli, Joseph Twin, Arian Urdu, Matin Urdu, Qiao Wang, Ling Wu, Wenli Zhang, Kunsheng Zhou, Stephanie Milani, Kiran Vodrahalli, Amy Zhang, Fei Fang, Yuke Zhu, Chi Jin

Title: The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

Abstract:
We present the PokeAgent Challenge, a large‑scale benchmark for decision‑making research built on Pokemon's multi‑agent battle system and expansive role‑playing game (RPG) environment. Partial observability, game‑theoretic reasoning, and long‑horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokeAgent targets these limitations at scale through two complementary tracks: our Battling Track, which calls for strategic reasoning and generalization under partial observability in competitive Pokemon battles, and our Speedrunning Track, which requires long‑horizon planning and sequential decision‑making in the Pokemon RPG. Our Battling Track supplies a dataset of 20M+ battle trajectories alongside a suite of heuristic, RL, and LLM‑based baselines capable of high‑level competitive play. Our Speedrunning Track provides the first standardized evaluation framework for RPG speedrunning, including an open‑source multi‑agent orchestration system for modular, reproducible comparisons of harness‑based LLM approaches. Our NeurIPS 2025 competition validates both the quality of our resources and the research community's interest in Pokemon, with over 100 teams competing across both tracks and winning solutions detailed in our paper. Participant submissions and our baselines reveal considerable gaps between generalist (LLM), specialist (RL), and elite human performance. Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites and positioning Pokemon as an unsolved benchmark that can drive RL and LLM research forward. We transition to a living benchmark with a live leaderboard for Battling and self‑contained evaluation for Speedrunning at https://pokeagentchallenge.com.

PaperID: 4863, https://arxiv.org/pdf/2603.15518.pdf

Authors: Xiyu Liu, Qingyi Si, Zhengxiao Liu, Chenxu Yang, Naibin Gu, Zheng Lin

Title: Beyond the Covariance Trap: Unlocking Generalization in Same-Subject Knowledge Editing for Large Language Models

Abstract:
While locate‑then‑edit knowledge editing efficiently updates knowledge encoded within Large Language Models (LLMs), a critical generalization failure mode emerges in the practical same‑subject knowledge editing scenario: models fail to recall the updated knowledge when following user instructions, despite successfully recalling it in the original edited form. This paper identifies the geometric root of this generalization collapse as a fundamental conflict where the inner activation drifts induced by prompt variations exceed the model's geometric tolerance for generalization after editing. We attribute this instability to a dual pathology: (1) The joint optimization with orthogonal gradients collapses solutions into sharp minima with narrow stability, and (2) the standard covariance constraint paradoxically acts as a Covariance Trap that amplifies input perturbations. To resolve this, we introduce RoSE (Robust Same‑subject Editing), which employs Isotropic Geometric Alignment to minimize representational deviation and Hierarchical Knowledge Integration to smooth the optimization landscape. Extensive experiments demonstrate that RoSE significantly improves instruction‑following capabilities, laying the foundation for robust interactive parametric memory of LLM agents.

PaperID: 4864, https://arxiv.org/pdf/2603.15418.pdf

Authors: Shahil Shaik, Aditya Parameshwaran, Anshul Nayak, Jonathon M. Smereka, Yue Wang

Title: MA-VLCM: A Vision Language Critic Model for Value Estimation of Policies in Multi-Agent Team Settings

Abstract:
Multi‑agent reinforcement learning (MARL) commonly relies on a centralized critic to estimate the value function. However, learning such a critic from scratch is highly sample‑inefficient and often lacks generalization across environments. At the same time, large vision‑language‑action models (VLAs) trained on internet‑scale data exhibit strong multimodal reasoning and zero‑shot generalization capabilities, yet directly deploying them for robotic execution remains computationally prohibitive, particularly in heterogeneous multi‑robot systems with diverse embodiments and resource constraints. To address these challenges, we propose Multi‑Agent Vision‑Language‑Critic Models (MA‑VLCM), a framework that replaces the learned centralized critic in MARL with a pretrained vision‑language model fine‑tuned to evaluate multi‑agent behavior. MA‑VLCM acts as a centralized critic conditioned on natural language task descriptions, visual trajectory observations, and structured multi‑agent state information. By eliminating critic learning during policy optimization, our approach significantly improves sample efficiency while producing compact execution policies suitable for deployment on resource‑constrained robots. Results show good zero‑shot return estimation on models with differing VLM backbones on in‑distribution and out‑of‑distribution scenarios in multi‑agent team settings

PaperID: 4865, https://arxiv.org/pdf/2603.15408.pdf

Authors: Kai Wang, Biaojie Zeng, Zeming Wei, Chang Jin, Hefeng Zhou, Xiangtian Li, Chao Yang, Jingjing Qu, Xingcheng Xu, Xia Hu

Title: TrinityGuard: A Unified Framework for Safeguarding Multi-Agent Systems

Abstract:
With the rapid development of LLM‑based multi‑agent systems (MAS), their significant safety and security concerns have emerged, which introduce novel risks going beyond single agents or LLMs. Despite attempts to address these issues, the existing literature lacks a cohesive safeguarding system specialized for MAS risks. In this work, we introduce TrinityGuard, a comprehensive safety evaluation and monitoring framework for LLM‑based MAS, grounded in the OWASP standards. Specifically, TrinityGuard encompasses a three‑tier fine‑grained risk taxonomy that identifies 20 risk types, covering single‑agent vulnerabilities, inter‑agent communication threats, and system‑level emergent hazards. Designed for scalability across various MAS structures and platforms, TrinityGuard is organized in a trinity manner, involving an MAS abstraction layer that can be adapted to any MAS structures, an evaluation layer containing risk‑specific test modules, alongside runtime monitor agents coordinated by a unified LLM Judge Factory. During Evaluation, TrinityGuard executes curated attack probes to generate detailed vulnerability reports for each risk type, where monitor agents analyze structured execution traces and issue real‑time alerts, enabling both pre‑development evaluation and runtime monitoring. We further formalize these safety metrics and present detailed case studies across various representative MAS examples, showcasing the versatility and reliability of TrinityGuard. Overall, TrinityGuard acts as a comprehensive framework for evaluating and monitoring various risks in MAS, paving the way for further research into their safety and security.

PaperID: 4866, https://arxiv.org/pdf/2603.15372.pdf

Authors: Ivo Brett

Title: SKILLS: Structured Knowledge Injection for LLM-Driven Telecommunications Operations

Abstract:
As telecommunications operators accelerate adoption of AI‑enabled automation, a practical question remains unresolved: can general‑purpose large language model (LLM) agents reliably execute telecom operations workflows through real API interfaces, or do they require structured domain guidance? We introduce SKILLS (Structured Knowledge Injection for LLM‑driven Service Lifecycle operations), a benchmark framework comprising 37 telecom operations scenarios spanning 8 TM Forum Open API domains (TMF620, TMF621, TMF622, TMF628, TMF629, TMF637, TMF639, TMF724). Each scenario is grounded in live mock API servers with seeded production‑representative data, MCP tool interfaces, and deterministic evaluation rubrics combining response content checks, tool‑call verification, and database state assertions. We evaluate open‑weight models under two conditions: baseline (generic agent with tool access but no domain guidance) and with‑skill (agent augmented with a portable SKILL.md document encoding workflow logic, API patterns, and business rules). Results across 5 open‑weight model conditions and 185 scenario‑runs show consistent skill lift across all models. MiniMax M2.5 leads (81.1% with‑skill, +13.5pp), followed by Nemotron 120B (78.4%, +18.9pp), GLM‑5 Turbo (78.4%, +5.4pp), and Seed 2.0 Lite (75.7%, +18.9pp).

PaperID: 4867, https://arxiv.org/pdf/2603.15371.pdf

Authors: Guangfu Hao, Yuming Dai, Xianzhe Qin, Shan Yu

Title: Brain-Inspired Graph Multi-Agent Systems for LLM Reasoning

Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of language tasks, yet complex multi‑step reasoning remains a fundamental challenge. While Large Reasoning Models (LRMs) equipped with extended chain‑of‑thought mechanisms demonstrate improved performance over standard LLMs, both model types still suffer from accuracy collapse on sufficiently complex tasks, suggesting that scaling model‑level reasoning alone is insufficient. Inspired by the global workspace theory of human cognition, we propose Brain‑Inspired Graph Multi‑Agent Systems (BIGMAS), in which specialized LLM agents are organized as nodes in a dynamically constructed directed graph and coordinate exclusively through a centralized shared workspace. A problem‑adaptive GraphDesigner constructs task‑specific agent topologies, while a global Orchestrator leverages the complete shared state for routing decisions, overcoming the local‑view bottleneck of reactive approaches. Experiments on Game24, Six Fives, and Tower of London across six frontier LLMs demonstrate that BIGMAS consistently improves reasoning performance for both standard LLMs and LRMs, outperforming existing multi‑agent baselines including ReAct and Tree of Thoughts, showing that multi‑agent architectural design provides complementary gains orthogonal to model‑level reasoning enhancements.

PaperID: 4868, https://arxiv.org/pdf/2603.15351.pdf

Authors: Anton Antonov, Humam Kourani, Alessandro Berti, Gyunam Park, Wil M. P. van der Aalst

Title: PMAx: An Agentic Framework for AI-Driven Process Mining

Abstract:
Process mining provides powerful insights into organizational workflows, but extracting these insights typically requires expertise in specialized query languages and data science tools. Large Language Models (LLMs) offer the potential to democratize process mining by enabling business users to interact with process data through natural language. However, using LLMs as direct analytical engines over raw event logs introduces fundamental challenges: LLMs struggle with deterministic reasoning and may hallucinate metrics, while sending large, sensitive logs to external AI services raises serious data‑privacy concerns. To address these limitations, we present PMAx, an autonomous agentic framework that functions as a virtual process analyst. Rather than relying on LLMs to generate process models or compute analytical results, PMAx employs a privacy‑preserving multi‑agent architecture. An Engineer agent analyzes event‑log metadata and autonomously generates local scripts to run established process mining algorithms, compute exact metrics, and produce artifacts such as process models, summary tables, and visualizations. An Analyst agent then interprets these insights and artifacts to compile comprehensive reports. By separating computation from interpretation and executing analysis locally, PMAx ensures mathematical accuracy and data privacy while enabling non‑technical users to transform high‑level business questions into reliable process insights.

PaperID: 4869, https://arxiv.org/pdf/2603.15297.pdf

Authors: Jim O'Connor, Annika Hoag, Sarah Goyette, Gary B. Parker

Title: Evolutionary Transfer Learning for Dragonchess

Abstract:
Dragonchess, a three‑dimensional chess variant introduced by Gary Gygax, presents unique strategic and computational challenges that make it an ideal environment for studying the transfer of artificial intelligence (AI) heuristics across domains. In this work, we introduce Dragonchess as a novel testbed for AI research and provide an open‑source, Python‑based game engine for community use. Our research investigates evolutionary transfer learning by adapting heuristic evaluation functions directly from Stockfish, a leading chess engine, and subsequently optimizing them using Covariance Matrix Adaptation Evolution Strategy (CMA‑ES). Initial trials showed that direct heuristic transfers were inadequate due to Dragonchess's distinct multi‑layer structure and movement rules. However, evolutionary optimization significantly improved AI agent performance, resulting in superior gameplay demonstrated through empirical evaluation in a 50‑round Swiss‑style tournament. This research establishes the effectiveness of evolutionary methods in adapting heuristic knowledge to structurally complex, previously unexplored game domains.

PaperID: 4870, https://arxiv.org/pdf/2603.15260.pdf

Authors: Jing Wu, Yang Liu, Lin Zhang, Junbo Zeng, Jiabin Wang, Zi Ye, Guowen Li, Shilei Cao, Jiashun Cheng, Fang Wang, Meng Jin, Yerong Feng, Hong Cheng, Yutong Lu, Haohuan Fu, Juepeng Zheng

Title: AGCD: Agent-Guided Cross-Modal Decoding for Weather Forecasting

Abstract:
Accurate weather forecasting is more than grid‑wise regression: it must preserve coherent synoptic structures and physical consistency of meteorological fields, especially under autoregressive rollouts where small one‑step errors can amplify into structural bias. Existing physics‑priors approaches typically impose global, once‑for‑all constraints via architectures, regularization, or NWP coupling, offering limited state‑adaptive and sample‑specific controllability at deployment. To bridge this gap, we propose Agent‑Guided Cross‑modal Decoding (AGCD), a plug‑and‑play decoding‑time prior‑injection paradigm that derives state‑conditioned physics‑priors from the current multivariate atmosphere and injects them into forecasters in a controllable and reusable way. Specifically, We design a multi‑agent meteorological narration pipeline to generate state‑conditioned physics‑priors, utilizing MLLMs to extract various meteorological elements effectively. To effectively apply the priors, AGCD further introduce cross‑modal region interaction decoding that performs region‑aware multi‑scale tokenization and efficient physics‑priors injection to refine visual features without changing the backbone interface. Experiments on WeatherBench demonstrate consistent gains for 6‑hour forecasting across two resolutions (5.625 degree and 1.40625 degree) and diverse backbones (generic and weather‑specialized), including strictly causal 48‑hour autoregressive rollouts that reduce early‑stage error accumulation and improve long‑horizon stability.

PaperID: 4871, https://arxiv.org/pdf/2603.15255.pdf

Authors: Yulin Peng, Xinxin Zhu, Chenxing Wei, Nianbo Zeng, Leilei Wang, Ying Tiffany He, F. Richard Yu

Title: SAGE: Multi-Agent Self-Evolution for LLM Reasoning

Abstract:
Reinforcement learning with verifiable rewards improves reasoning in large language models (LLMs), but many methods still rely on large human‑labeled datasets. While self‑play reduces this dependency, it often lacks explicit planning and strong quality control, limiting stability in long‑horizon multi‑step reasoning. We present SAGE (Self‑evolving Agents for Generalized reasoning Evolution), a closed‑loop framework where four agents: Challenger, Planner, Solver, and Critic, co‑evolve from a shared LLM backbone using only a small seed set. The Challenger continuously generates increasingly difficult tasks; the Planner converts each task into a structured multi‑step plan; and the Solver follows the plan to produce an answer, whose correctness is determined by external verifiers. The Critic scores and filters both generated questions and plans to prevent curriculum drift and maintain training signal quality, enabling stable self‑training. Across mathematics and code‑generation benchmarks, SAGE delivers consistent gains across model scales, improving the Qwen‑2.5‑7B model by 8.9% on LiveCodeBench and 10.7% on OlympiadBench.

PaperID: 4872, https://arxiv.org/pdf/2603.15125.pdf

Authors: Zhenlin Xu, Xiaogang Zhu, Yu Yao, Minhui Xue, Yiliao Song

Title: From Storage to Steering: Memory Control Flow Attacks on LLM Agents

Abstract:
Modern agentic systems allow Large Language Model (LLM) agents to tackle complex tasks through extensive tool usage, forming structured control flows of tool selection and execution. Existing security analyses often treat these control flows as ephemeral, one‑off sessions, overlooking the persistent influence of memory. This paper identifies a new threat from Memory Control Flow Attacks (MCFA) that memory can dominate the control flow, forcing unintended tool usage even against explicit user instructions and inducing persistent behavioral deviations across tasks. To understand the impact of this vulnerability, we further design MEMFLOW, an automated evaluation framework that systematically identifies and quantifies MCFA across heterogeneous tasks and long interaction horizons. To evaluate MEMFLOW, we attack state‑of‑the‑art LLMs, including GPT‑5 mini, Claude Sonnet 4.5 and Gemini 2.5 Flash on real‑world tools from two major LLM agent development frameworks, LangChain and LlamaIndex. The results show that in general over 90% of trials are vulnerable to MCFA even under strict safety constraints, highlighting critical security risks that demand immediate attention.

PaperID: 4873, https://arxiv.org/pdf/2603.15080.pdf

Authors: Madhulatha Mandarapu, Sandeep Kunkunuru

Title: Open Biomedical Knowledge Graphs at Scale: Construction, Federation, and AI Agent Access with Samyama Graph Database

Abstract:
Biomedical knowledge is fragmented across siloed databases ‑‑ Reactome for pathways, STRING for protein interactions, ClinicalTrials.gov for study registries, DrugBank for drug vocabularies, DGIdb for drug‑gene interactions, SIDER for side effects. We present three open‑source biomedical knowledge graphs ‑‑ Pathways KG (118,686 nodes, 834,785 edges from 5 sources), Clinical Trials KG (7,774,446 nodes, 26,973,997 edges from 5 sources), and Drug Interactions KG (32,726 nodes, 191,970 edges from 3 sources) ‑‑ built on Samyama, a high‑performance graph database written in Rust. Our contributions are threefold. First, we describe a reproducible ETL pattern for constructing large‑scale KGs from heterogeneous public data sources, with cross‑source deduplication, batch loading (Python Cypher and Rust native loaders), and portable snapshot export. Second, we demonstrate cross‑KG federation: loading all three snapshots into a single graph tenant enables property‑based joins across datasets. Third, we introduce schema‑driven MCP server generation for LLM agent access, evaluated on a new BiomedQA benchmark (40 pharmacology questions): domain‑specific MCP tools achieve 98% accuracy vs. 85% for schema‑aware text‑to‑Cypher and 75% for standalone GPT‑4o, with zero schema errors. All data sources are open‑license. The combined federated graph (7.9M nodes, 28M edges) loads in approximately 3 minutes on commodity cloud hardware, with single‑KG queries completing in 80‑100ms and cross‑KG federation joins in 1‑4s

PaperID: 4874, https://arxiv.org/pdf/2603.15061.pdf

Authors: Jihao Zhao, Shuaishuai Zu, Zhiyuan Ji, Chunlai Zhou, Biao Qin

Title: Writer-R1: Enhancing Generative Writing in LLMs via Memory-augmented Replay Policy Optimization

Abstract:
As a typical open‑ended generation task, creative writing lacks verifiable reference answers, which has long constrained reward modeling and automatic evaluation due to high human annotation costs, evaluative bias, and coarse feedback signals. To address these challenges, this paper first designs a multi‑agent collaborative workflow based on Grounded Theory, performing dimensional decomposition and hierarchical induction of the problem to dynamically produce interpretable and reusable fine‑grained criteria. Furthermore, we propose the Memory‑augmented Replay Policy Optimization (MRPO) algorithm: on the one hand, without additional training, MRPO guides models to engage in self‑reflection based on dynamic criteria, enabling controlled iterative improvement; on the other hand, we adopt the training paradigm that combines supervised fine‑tuning with reinforcement learning to convert evaluation criteria into reward signals, achieving end‑to‑end optimization. Experimental results demonstrate that the automatically constructed criteria achieve performance gains comparable to human annotations. Writer‑R1‑4B models trained with this approach outperform baselines across multiple creative writing tasks and surpass some 100B+ parameter open‑source models.

PaperID: 4875, https://arxiv.org/pdf/2603.15054.pdf

Authors: Ziyu Cheng, Jinsheng Ren, Zhouxian Jiang, Chenzhihang Li, Rongye Shi, Bin Liang, Jun Yang

Title: Interference-Aware K-Step Reachable Communication in Multi-Agent Reinforcement Learning

Abstract:
Effective communication is pivotal for addressing complex collaborative tasks in multi‑agent reinforcement learning (MARL). Yet, limited communication bandwidth and dynamic, intricate environmental topologies present significant challenges in identifying high‑value communication partners. Agents must consequently select collaborators under uncertainty, lacking a priori knowledge of which partners can deliver task‑critical information. To this end, we propose Interference‑Aware K‑Step Reachable Communication (IA‑KRC), a novel framework that enhances cooperation via two core components: (1) a K‑Step reachability protocol that confines message passing to physically accessible neighbors, and (2) an interference‑prediction module that optimizes partner choice by minimizing interference while maximizing utility. Compared to existing methods, IA‑KRC enables substantially more persistent and efficient cooperation despite environmental interference. Comprehensive evaluations confirm that IA‑KRC achieves superior performance compared to state‑of‑the‑art baselines, while demonstrating enhanced robustness and scalability in complex topological and highly dynamic multi‑agent scenarios.

PaperID: 4876, https://arxiv.org/pdf/2603.15051.pdf

Authors: Disha Sheshanarayana, Rajat Subhra Pal, Manjira Sinha, Tirthankar Dasgupta

Title: Thinking in Latents: Adaptive Anchor Refinement for Implicit Reasoning in LLMs

Abstract:
Token‑level Chain‑of‑Thought (CoT) prompting has become a standard way to elicit multi‑step reasoning in large language models (LLMs), especially for mathematical word problems. However, generating long intermediate traces increases output length and inference cost, and can be inefficient when the model could arrive at the correct answer without extensive verbalization. This has motivated latent‑space reasoning approaches that shift computation into hidden representations and only emit a final answer. Yet, many latent reasoning methods depend on a fixed number of latent refinement steps at inference, adding another hyperparameter that must be tuned across models and datasets to balance accuracy and efficiency. We introduce AdaAnchor, a latent reasoning framework that performs silent iterative computation by refining a set of latent anchor vectors attached to the input. AdaAnchor further incorporates an adaptive halting mechanism that monitors anchor stability across iterations and terminates refinement once the anchor dynamics converge, allocating fewer steps to easier instances while reserving additional refinement steps for harder ones under a shared maximum‑step budget. Our empirical evaluation across three mathematical word‑problem benchmarks shows that AdaAnchor with adaptive halting yields accuracy gains of up to 5% over fixed‑step latent refinement while reducing average latent refinement steps by 48‑60% under the same maximum‑step budget. Compared to standard reasoning baselines, AdaAnchor achieves large reductions in generated tokens (92‑93%) by moving computation into silent latent refinement, offering a different accuracy‑efficiency trade‑off with substantially lower output‑token usage.

PaperID: 4877, https://arxiv.org/pdf/2603.15039.pdf

Authors: Yang Li, Yuchen Liu, Haoyu Lu, Zhiqiang Xia, Hongzhen Wang, Kaiyang Han, Changpeng Yang, Jinyang Wu, Jiaming Xu, Runyu Shi, Ying Huang

Title: GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents

Abstract:
Recent progress in Multimodal Large Language Models (MLLMs) has enabled mobile GUI agents capable of visual perception, cross‑modal reasoning, and interactive control. However, existing benchmarks are largely English‑centric and fail to capture the linguistic and interaction characteristics of the Chinese mobile ecosystem. They also focus on isolated skills such as GUI grounding or offline agent, lacking a unified and fine‑grained framework to assess the full capability chain from perception to execution. To address this gap, we introduce GUI‑CEval, the first comprehensive benchmark for Chinese mobile GUI agents, built entirely on physical device environments. GUI‑CEval spans 201 mainstream apps across four device types and adopts a two‑level structure that evaluates both atomic abilities and realistic application‑level performance along five dimensions: perception, planning, reflection, execution, and evaluation. All data are collected and verified through multi‑stage manual processes to ensure authenticity and reproducibility. Extensive experiments on 20 representative MLLMs and multi‑agent systems show that while models such as Qwen2.5‑VL and UI‑TARS perform competitively, most MLLMs still exhibit clear weaknesses in reflective decision‑making and post‑action self‑evaluation, limiting their reliability in real‑world interactions. We hope GUI‑CEval provides a comprehensive and interpretable benchmark to guide capability diagnosis and advance the development of Chinese mobile GUI agents.

PaperID: 4878, https://arxiv.org/pdf/2603.14997.pdf

Authors: Jeffrey Flynt

Title: OrgForge: A Multi-Agent Simulation Framework for Verifiable Synthetic Corporate Corpora

Abstract:
Building and evaluating enterprise AI systems requires synthetic organizational corpora that are internally consistent, temporally structured, and cross‑artifact traceable. Existing corpora either carry legal constraints or inherit hallucination artifacts from the generating LLMs, silently corrupting results when timestamps or facts contradict across documents and reinforcing those errors during training. We present OrgForge, an open‑source multi‑agent simulation framework that enforces a strict physics‑cognition boundary: a deterministic Python engine maintains a SimEvent ground‑truth bus while LLMs generate only surface prose. OrgForge simulates the organizational processes that produce documents, not the documents themselves. Engineers leave mid‑sprint, triggering incident handoffs and CRM ownership lapses. Knowledge gaps emerge when under‑documented systems break and recover through organic documentation and incident resolution. Customer emails fire only when simulation state warrants contact; silence is verifiable ground truth. A live CRM state machine extends the physics‑cognition boundary to the customer boundary, producing cross‑system causal cascades spanning engineering incidents, support escalation, deal risk flagging, and SLA‑adjusted invoices. The framework generates fifteen interleaved artifact categories traceable to a shared immutable event log. Four graph‑dynamic subsystems govern organizational behavior independently of any LLM. An embedding‑based ticket assignment system using the Hungarian algorithm makes the simulation domain‑agnostic. An empirical evaluation across ten incidents demonstrates a 0.46 absolute improvement in prose‑to‑ground‑truth fidelity over chained LLM baselines, and isolates a consistent hallucination failure mode in which chaining propagates fabricated facts faithfully across documents without correcting them.

PaperID: 4879, https://arxiv.org/pdf/2603.14864.pdf

Authors: Zijian Yu, Kejun Xiao, Huaipeng Zhao, Tao Luo, Xiaoyi Zeng

Title: Shopping Companion: Benchmarking and Training LLM Agents for Long-Horizon Preference-Grounded E-Commerce Tasks

Abstract:
In e‑commerce, LLM agents show promise for shopping tasks such as recommendations, budget management, and bundle deals, where accurately capturing user preferences from long‑horizon conversations is critical. However, progress is limited by two key challenges: (1) the absence of benchmarks for evaluating long‑term preference‑aware shopping tasks, and (2) the lack of fine‑grained supervision for shopping agent training. To fill the benchmark gap, we introduce Shopping Companion Bench, a novel benchmark comprising two shopping tasks that require cross‑session preference memory, grounded in a product pool of over 1.2 million real‑world items. Our analysis further identifies two major sources of failure on this benchmark: cascading errors caused by preference hallucination, and insufficient verification of product attributes against user requirements. To address these failure modes, we design annotation‑free, tool‑wise rewards that provide process supervision for each tool call, alleviating reward sparsity in long‑horizon tasks. Experimental results demonstrate that even state‑of‑the‑art models such as GPT‑5 achieve success rates below 70%, highlighting the difficulty of our benchmark. Notably, our fine‑tuned lightweight 4B model consistently outperforms strong baselines in both preference capture and task performance, suggesting the effectiveness of our reward design.

PaperID: 4880, https://arxiv.org/pdf/2603.14811.pdf

Authors: Heng Zhou, Li Kang, Yiran Qin, Xiufeng Song, Ao Yu, Zilu Zhang, Haoming Song, Kaixin Xu, Yuchen Fan, Dongzhan Zhou, Xiaohong Liu, Ruimao Zhang, Philip Torr, Lei Bai, Zhenfei Yin

Title: Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning

Abstract:
Understanding the world from distributed, partial viewpoints is a fundamental challenge for embodied multi‑agent systems. Each agent perceives the environment through an ego‑centric view that is often limited by occlusion and ambiguity. To study this problem, we introduce the Ego‑to‑World (E2W) benchmark, which evaluates a vision‑language model's ability to fuse heterogeneous viewpoints across three tasks: (i) global counting, (ii) relational location reasoning, and (iii) action‑oriented grasping that requires predicting view‑specific image coordinates. To address this setting, we propose CoRL, a two‑stage framework that combines Chain‑of‑Thought supervised fine‑tuning with reinforcement learning using Group‑Relative Policy Optimization. Its core component, the Cross‑View Spatial Reward (CVSR), provides dense task‑aligned feedback by linking reasoning steps to visual evidence, ensuring coherent cross‑view entity resolution, and guiding the model toward correct final predictions. Experiments on E2W show that CoRL consistently surpasses strong proprietary and open‑source baselines on both reasoning and perception‑grounding metrics, while ablations further confirm the necessity of each CVSR component. Beyond that, CoRL generalizes to external spatial reasoning benchmarks and enables effective real‑world multi‑robot manipulation with calibrated multi‑camera rigs, demonstrating cross‑view localization and successful grasp‑and‑place execution. Together, E2W and CoRL provide a principled foundation for learning world‑centric scene understanding from distributed, ego‑centric observations, advancing collaborative embodied AI.

PaperID: 4881, https://arxiv.org/pdf/2603.14805.pdf

Authors: Gal Bakal

Title: Knowledge Activation: AI Skills as the Institutional Knowledge Primitive for Agentic Software Development

Abstract:
Enterprise software organizations accumulate critical institutional knowledge ‑ architectural decisions, deployment procedures, compliance policies, incident playbooks ‑ yet this knowledge remains trapped in formats designed for human interpretation. The bottleneck to effective agentic software development is not model capability but knowledge architecture. When any knowledge consumer ‑ an autonomous AI agent, a newly onboarded engineer, or a senior developer ‑ encounters an enterprise task without institutional context, the result is guesswork, correction cascades, and a disproportionate tax on senior engineers who must manually supply what others cannot infer. This paper introduces Knowledge Activation, a framework that specializes AI Skills ‑ the open standard for agent‑consumable knowledge ‑ into structured, governance‑aware Atomic Knowledge Units (AKUs) for institutional knowledge delivery. Rather than retrieving documents for interpretation, AKUs deliver action ‑ ready specifications encoding what to do, which tools to use, what constraints to respect, and where to go next ‑ so that agents act correctly and engineers receive institutionally grounded guidance without reconstructing organizational context from scratch. AKUs form a composable knowledge graph that agents traverse at runtime ‑ compressing onboarding, reducing cross ‑ team friction, and eliminating correction cascades. The paper formalizes the resource constraints that make this architecture necessary, specifies the AKU schema and deployment architecture, and grounds long ‑ term maintenance in knowledge commons practice. A Yahoo deployment surveying 67 engineers shows statistically significant developer‑experience gains ‑ 2.6 hours per week saved, Net Promoter Score +35. Organizations that architect their institutional knowledge for the agentic era will outperform those that invest solely in model capability.

PaperID: 4882, https://arxiv.org/pdf/2603.14771.pdf

Authors: Peigen Liu, Rui Ding, Yuren Mao, Ziyan Jiang, Yuxiang Ye, Yunjun Gao, Ying Zhang, Renjie Sun, Longbin Lai, Zhengping Qian

Title: OpenHospital: A Thing-in-itself Arena for Evolving and Benchmarking LLM-based Collective Intelligence

Abstract:
Large Language Model (LLM)‑based Collective Intelligence (CI) presents a promising approach to overcoming the data wall and continuously boosting the capabilities of LLM agents. However, there is currently no dedicated arena for evolving and benchmarking LLM‑based CI. To address this gap, we introduce OpenHospital, an interactive arena where physician agents can evolve CI through interactions with patient agents. This arena employs a data‑in‑agent‑self paradigm that rapidly enhances agent capabilities and provides robust evaluation metrics for benchmarking both medical proficiency and system efficiency. Experiments demonstrate the effectiveness of OpenHospital in both fostering and quantifying CI.

PaperID: 4883, https://arxiv.org/pdf/2603.14733.pdf

Authors: Yue Zhang, Liqiang Jing, Jia Li, Yapeng Tian, Xinya Du, Yunhui Guo, Vibhav Gogate

Title: A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding

Abstract:
Multimodal Large Language Models have achieved strong performance in single‑video understanding, yet their ability to reason across multiple videos remains limited. Existing approaches typically concatenate multiple videos into a single input and perform direct inference, which introduces training‑inference mismatch, information loss from frame compression, and a lack of explicit cross‑video coordination. Meanwhile, current multi‑video benchmarks primarily emphasize event‑level comparison, leaving identity‑level matching, fine‑grained discrimination, and structured multi‑step reasoning underexplored. To address these gaps, we introduce MVX‑Bench, a Multi‑Video Cross‑Dimension Benchmark that reformulates 11 classical computer vision tasks into a unified multi‑video question‑answering framework, comprising 1,442 questions over 4,255 videos from diverse real‑world datasets. We further propose SAMA, a Skill‑Augmented Agentic Framework for Multi‑Video Understanding, which integrates visual tools, task‑specific skills, and a conflict‑aware verification mechanism to enable iterative and structured reasoning. Experimental results show that SAMA outperforms strong open‑source baselines and GPT on MVX‑Bench, and ablations validate the effectiveness of skill design and conflict resolution.

PaperID: 4884, https://arxiv.org/pdf/2603.14703.pdf

Authors: Huiyun Peng, Parth Vinod Patil, Antonio Zhong Qiu, George K. Thiruvathukal, James C. Davis

Title: Beyond Local Code Optimization: Multi-Agent Reasoning for Software System Optimization

Abstract:
Large language models and AI agents have recently shown promise in automating software performance optimization, but existing approaches predominantly rely on local, syntax‑driven code transformations. This limits their ability to reason about program behavior and capture whole system performance interactions. As modern software increasingly comprises interacting components ‑ such as microservices, databases, and shared infrastructure ‑ effective code optimization requires reasoning about program structure and system architecture beyond individual functions or files. This paper explores the feasibility of whole system optimization for microservices. We introduce a multi‑agent framework that integrates control‑flow and data‑flow representations with architectural and cross‑component dependency signals to support system‑level performance reasoning. The proposed system is decomposed into coordinated agent roles ‑ summarization, analysis, optimization, and verification ‑ that collaboratively identify cross‑cutting bottlenecks and construct multi‑step optimization strategies spanning the software stack. We present a proof‑of‑concept on a microservice‑based system that illustrates the effectiveness of our proposed framework, achieving a 36.58% improvement in throughput and a 27.81% reduction in average response time.

PaperID: 4885, https://arxiv.org/pdf/2603.14688.pdf

Authors: Zhaohui Geoffrey Wang

Title: AgentTrace: Causal Graph Tracing for Root Cause Analysis in Deployed Multi-Agent Systems

Abstract:
As multi‑agent AI systems are increasingly deployed in real‑world settings ‑ from automated customer support to DevOps remediation ‑ failures become harder to diagnose due to cascading effects, hidden dependencies, and long execution traces. We present AgentTrace, a lightweight causal tracing framework for post‑hoc failure diagnosis in deployed multi‑agent workflows. AgentTrace reconstructs causal graphs from execution logs, traces backward from error manifestations, and ranks candidate root causes using interpretable structural and positional signals ‑ without requiring LLM inference at debugging time. Across a diverse benchmark of multi‑agent failure scenarios designed to reflect common deployment patterns, AgentTrace localizes root causes with high accuracy and sub‑second latency, significantly outperforming both heuristic and LLM‑based baselines. Our results suggest that causal tracing provides a practical foundation for improving the reliability and trustworthiness of agentic systems in the wild.

PaperID: 4886, https://arxiv.org/pdf/2603.14629.pdf

Authors: Peng Zhang

Title: ResearchPilot: A Local-First Multi-Agent System for Literature Synthesis and Related Work Drafting

Abstract:
ResearchPilot is an open‑source, self‑hostable multi‑agent system for literature‑review assistance. Given a natural‑language research question, it retrieves papers from Semantic Scholar and arXiv, extracts structured findings from paper abstracts, synthesizes cross‑paper patterns, and drafts a citation‑aware related‑work section. The system combines FastAPI, Next.js, DSPy, SQLite, and Qdrant in a local‑first architecture that supports bring‑your‑own‑key model access and remote‑or‑local embeddings. This paper describes the system design, typed agent interfaces, persistence and history‑search mechanisms, and the engineering tradeoffs involved in building a transparent research assistant. Rather than claiming algorithmic novelty, we present ResearchPilot as a systems contribution and evaluate it through automated tests and end‑to‑end local runs. We discuss limitations including external API rate limits, abstract‑only extraction, incomplete corpus coverage, and the lack of citation verification.

PaperID: 4887, https://arxiv.org/pdf/2603.14625.pdf

Authors: Saad Alqithami

Title: EcoFair-CH-MARL: Scalable Constrained Hierarchical Multi-Agent RL with Real-Time Emission Budgets and Fairness Guarantees

Abstract:
Global decarbonisation targets and tightening market pressures demand maritime logistics solutions that are simultaneously efficient, sustainable, and equitable. We introduce EcoFair‑CH‑MARL, a constrained hierarchical multi‑agent reinforcement learning framework that unifies three innovations: (i) a primal‑dual budget layer that provably bounds cumulative emissions under stochastic weather and demand; (ii) a fairness‑aware reward transformer with dynamically scheduled penalties that enforces max‑min cost equity across heterogeneous fleets; and (iii) a two‑tier policy architecture that decouples strategic routing from real‑time vessel control, enabling linear scaling in agent count. New theoretical results establish O(\sqrtT) regret for both constraint violations and fairness loss. Experiments on a high‑fidelity maritime digital twin (16 ports, 50 vessels) driven by automatic identification system traces, plus an energy‑grid case study, show up to 15% lower emissions, 12% higher through‑put, and a 45% fair‑cost improvement over state‑of‑the‑art hierarchical and constrained MARL baselines. In addition, EcoFair‑CH‑MARL achieves stronger equity (lower Gini and higher min‑max welfare) than fairness‑specific MARL baselines (e.g., SOTO, FEN), and its modular design is compatible with both policy‑ and value‑based learners. EcoFair‑CH‑MARL therefore advances the feasibility of large‑scale, regulation‑compliant, and socially responsible multi‑agent coordination in safety‑critical domains.

PaperID: 4888, https://arxiv.org/pdf/2603.14597.pdf

Authors: Yuru Song, Qi Xin

Title: D-MEM: Dopamine-Gated Agentic Memory via Reward Prediction Error Routing

Abstract:
Autonomous LLM agents require structured long‑term memory, yet current "append‑and‑evolve" systems like A‑MEM face O(N^2) write‑latency and excessive token costs. We introduce D‑MEM (Dopamine‑Gated Agentic Memory), a biologically inspired architecture that decouples short‑term interaction from cognitive restructuring via a Fast/Slow routing system based on Reward Prediction Error (RPE). A lightweight Critic Router evaluates stimuli for Surprise and Utility. Routine, low‑RPE inputs are bypassed or cached in an O(1) fast‑access buffer. Conversely, high‑RPE inputs, such as factual contradictions or preference shifts, trigger a "dopamine" signal, activating the O(N) memory evolution pipeline to reshape the agent's knowledge graph. To evaluate performance under realistic conditions, we introduce the LoCoMo‑Noise benchmark, which injects controlled conversational noise into long‑term sessions. Evaluations demonstrate that D‑MEM reduces token consumption by over 80%, eliminates O(N^2) bottlenecks, and outperforms baselines in multi‑hop reasoning and adversarial resilience. By selectively gating cognitive restructuring, D‑MEM provides a scalable, cost‑efficient foundation for lifelong agentic memory.

PaperID: 4889, https://arxiv.org/pdf/2603.14417.pdf

Authors: Max Hellrigel-Holderbaum, Edward James Young

Title: Questionnaire Responses Do not Capture the Safety of AI Agents

Abstract:
As AI systems advance in capabilities, measuring their safety and alignment to human values is becoming paramount. A fast‑growing field of AI research is devoted to developing such assessments. However, most current advances therein may be ill‑suited for assessing AI systems across real‑world deployments. Standard methods prompt large language models (LLMs) in a questionnaire‑style to describe their values or behavior in hypothetical scenarios. By focusing on unaugmented LLMs, they fall short of evaluating AI agents, which could actually perform relevant behaviors, hence posing much greater risks. LLMs' engagement with scenarios described by questionnaire‑style prompts differs starkly from that of agents based on the same LLMs, as reflected in divergences in the inputs, possible actions, environmental interactions, and internal processing. As such, LLMs' responses to scenario descriptions are unlikely to be representative of the corresponding LLM agents' behavior. We further contend that such assessments make strong assumptions concerning the ability and tendency of LLMs to report accurately about their counterfactual behavior. This makes them inadequate to assess risks from AI systems in real‑world contexts as they lack construct validity. We then argue that a structurally identical issue holds for current AI alignment approaches. Lastly, we discuss improving safety assessments and alignment training by taking these shortcomings to heart.

PaperID: 4890, https://arxiv.org/pdf/2603.14332.pdf

Authors: Ziling Zhou

Title: Governing Dynamic Capabilities: Cryptographic Binding and Reproducibility Verification for AI Agent Tool Use

Abstract:
AI agents dynamically acquire tools, orchestrate sub‑agents, and transact across organizational boundaries, yet no existing security layer verifies what an agent can do, whether it executed what it claims, or what happened in a multi‑agent interaction. We trace this gap to the capability‑context separation: inside a transformer, tool definitions and user context are indistinguishable tokens, but at the orchestration layer they have fundamentally different security semantics. Existing frameworks conflate the two, enabling silent capability escalation and leaving interactions without verifiable provenance. From this principle we derive three Agent Governance Requirements: capability integrity (G1), behavioral verifiability (G2), and interaction auditability (G3), defining what a governed agent ecosystem must enforce, independent of how. We prove two structural results: the Chain Verifiability Theorem (one unverifiable interior agent breaks end‑to‑end verification for all downstream nodes) and the Bounded Divergence Theorem (replay‑based verification yields a probabilistic safety certificate, epsilon <= 1 ‑ alpha^1/n). We validate with two crypto‑agnostic instantiations ‑‑ basic (Ed25519, SHA‑256; 97 us verify) and enhanced (BBS+ selective disclosure, Groth16 DV‑SNARK; 13.8 ms) ‑‑ both satisfying nine security properties. A reproducibility study (9 models, 7 providers) reveals 5.8x variance in inference determinism, connecting model characteristics to governance architecture. End‑to‑end evaluation over 5‑20 agent pipelines confirms <0.02% overhead and detection of all attack scenarios with zero false positives.

PaperID: 4891, https://arxiv.org/pdf/2603.14265.pdf

Authors: Shaowei Guan, Yu Zhai, Hin Chi Kwok, Jiawei Du, Xinyu Feng, Jing Li, Harry Qin, Vivian Hui

Title: MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-End Question Answering

Abstract:
Recent advances in Retrieval‑Augmented Generation (RAG) have enabled large language models (LLMs) to ground outputs in clinical evidence. However, connecting LLMs with external databases introduces the risk of contextual leakage: a subtle privacy threat where unique combinations of medical details enable patient re‑identification even without explicit identifiers. Current benchmarks in healthcare heavily focus on accuracy, ignoring such privacy issues, despite strict regulations like Health Insurance Portability and Accountability Act (HIPAA) and General Data Protection Regulation (GDPR). To fill this gap, we present MedPriv‑Bench, the first benchmark specifically designed to jointly evaluate privacy preservation and clinical utility in medical open‑ended question answering. Our framework utilizes a multi‑agent, human‑in‑the‑loop pipeline to synthesize sensitive medical contexts and clinically relevant queries that create realistic privacy pressure. We establish a standardized evaluation protocol leveraging a pre‑trained RoBERTa‑Natural Language Inference (NLI) model as an automated judge to quantify data leakage, achieving an average of 85.9% alignment with human experts. Through an extensive evaluation of 9 representative LLMs, we demonstrate a pervasive privacy‑utility trade‑off. Our findings underscore the necessity of domain‑specific benchmarks to validate the safety and efficacy of medical AI systems in privacy‑sensitive environments.

PaperID: 4892, https://arxiv.org/pdf/2603.14221.pdf

Authors: Aueaphum Aueawatthanaphisut, Kuepon Aueawatthanaphisut

Title: A Real-Time Neuro-Symbolic Ethical Governor for Safe Decision Control in Autonomous Robotic Manipulation

Abstract:
Ethical decision governance has become a critical requirement for autonomous robotic systems operating in human‑centered and safety‑sensitive environments. This paper presents a real‑time neuro‑symbolic ethical governor designed to enable risk‑aware supervisory control in autonomous robotic manipulation tasks. The proposed framework integrates transformer‑based ethical reasoning with a probabilistic ethical risk field formulation and a threshold‑based override control mechanism. language‑grounded ethical intent inference capability is learned from natural language task descriptions using a fine‑tuned DistilBERT model trained on the ETHICS commonsense dataset. A continuous ethical risk metric is subsequently derived from predicted unsafe action probability, confidence uncertainty, and probabilistic variance to support adaptive decision filtering. The effectiveness of the proposed approach is validated through simulated autonomous robot‑arm task scenarios involving varying levels of human proximity and operational hazard. Experimental results demonstrate stable model convergence, reliable ethical risk discrimination, and improved safety‑aware decision outcomes without significant degradation of task execution efficiency. The proposed neuro‑symbolic architecture further provides enhanced interpretability compared with purely data‑driven safety filters, enabling transparent ethical reasoning in real‑time control loops. The findings suggest that ethical decision governance can be effectively modeled as a dynamic supervisory risk layer for autonomous robotic systems, with potential applicability to broader cyber‑physical and assistive robotics domains.

PaperID: 4893, https://arxiv.org/pdf/2603.14184.pdf

Authors: Ruiying Peng, Xueyu Wu, Jing Lei, Lu Hou, Yuanzheng Ma, Xiaohui Li

Title: Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

Abstract:
Multimodal large language models (MLLMs) often suffer from perceptual impairments under extended reasoning modes, particularly in visual question answering (VQA) tasks. We identify attention dispersion as the underlying cause: during multi‑step reasoning, the model's visual attention becomes scattered and drifts away from question‑relevant regions, effectively "losing focus" on the visual input. To better understand this phenomenon, we analyze the attention maps of MLLMs and observe that reasoning prompts significantly reduce attention to regions critical for answering the question. We further find a strong correlation between the model's overall attention on image tokens and the spatial dispersiveness of its attention within the image. Leveraging this insight, we propose a training‑free Visual Region‑Guided Attention (VRGA) framework that selects visual heads based on an entropy‑focus criterion and reweights their attention, effectively guiding the model to focus on question‑relevant regions during reasoning. Extensive experiments on vision‑language benchmarks demonstrate that our method effectively alleviates perceptual degradation, leading to improvements in visual grounding and reasoning accuracy while providing interpretable insights into how MLLMs process visual information.

PaperID: 4894, https://arxiv.org/pdf/2603.14054.pdf

Authors: Zahra Moti, Heydar Soudani, Jonck van der Kogel

Title: LegacyTranslate: LLM-based Multi-Agent Method for Legacy Code Translation

Abstract:
Modernizing large legacy systems remains a major challenge in enterprise environments, particularly when migration must preserve domain‑specific logic while conforming to internal architectural frameworks and shared APIs. Direct application of Large Language Models (LLMs) for code translation often produces syntactically valid outputs that fail to compile or integrate within existing production frameworks, limiting their practical adoption in real‑world modernization efforts. In this paper, we propose LegacyTranslate, a multi‑agent framework for API‑aware code translation, developed and evaluated in the context of an ongoing modernization effort at a financial institution migrating approximately 2.5 million lines of PL/SQL to Java. The core idea is to use specialized LLM‑based agents, each addressing a different aspect of the translation challenge. Specifically, LegacyTranslate consists of three agents: Initial Translation Agent produces an initial Java translation using retrieved in‑context examples; API Grounding Agent aligns the code with existing APIs by retrieving relevant entries from an API knowledge base; and Refinement Agent iteratively refines the output using compiler feedback and API suggestions to improve correctness. Our experiments show that each agent contributes to better translation quality. The Initial Translation Agent alone achieves 45.6% compilable outputs and 30.9% test‑pass rate. With API Grounding Agent and Refinement Agent, compilation improves by an additional 8% and test‑pass accuracy increases by 3%.

PaperID: 4895, https://arxiv.org/pdf/2603.13940.pdf

Authors: Yiling Tao, Xinran Zheng, Shuo Yang, Meiling Tao, Xingjun Wang

Title: GroupGuard: A Framework for Modeling and Defending Collusive Attacks in Multi-Agent Systems

Abstract:
While large language model‑based agents demonstrate great potential in collaborative tasks, their interactivity also introduces security vulnerabilities. In this paper, we propose and model group collusive attacks, a highly destructive threat in which multiple agents coordinate via sociological strategies to mislead the system. To address this challenge, we introduce GroupGuard, a training‑free defense framework that employs a multi‑layered defense strategy, including continuous graph‑based monitoring, active honeypot inducement, and structural pruning, to identify and isolate collusive agents. Experimental results across five datasets and four topologies demonstrate that group collusive attacks increase the attack success rate by up to 15% compared to individual attacks. GroupGuard consistently achieves high detection accuracy (up to 88%) and effectively restores collaborative performance, providing a robust solution for securing multi‑agent systems.

PaperID: 4896, https://arxiv.org/pdf/2603.13919.pdf

Authors: Xianke Wu, Songlin Bai, Chengxiang Li, Zhiyao Luo, Yulin Tian, Fenghua Zhu, Yisheng Lv, Yonglin Tian

Title: OpenCOOD-Air: Prompting Heterogeneous Ground-Air Collaborative Perception with Spatial Conversion and Offset Prediction

Abstract:
While Vehicle‑to‑Vehicle (V2V) collaboration extends sensing ranges through multi‑agent data sharing, its reliability remains severely constrained by ground‑level occlusions and the limited perspective of chassis‑mounted sensors, which often result in critical perception blind spots. We propose OpenCOOD‑Air, a novel framework that integrates UAVs as extensible platforms into V2V collaborative perception to overcome these constraints. To mitigate gradient interference from ground‑air domain gaps and data sparsity, we adopt a transfer learning strategy to fine‑tune UAV weights from pre‑trained V2V models. To prevent the spatial information loss inherent in this transition, we formulate ground‑air collaborative perception as a heterogeneous integration task with explicit altitude supervision and introduce a Cross‑Domain Spatial Converter (CDSC) and a Spatial Offset Prediction Transformer (SOPT). Furthermore, we present the OPV2V‑Air benchmark to validate the transition from V2V to Vehicle‑to‑Vehicle‑to‑UAV. Compared to state‑of‑the‑art methods, our approach improves 2D and 3D AP@0.7 by 4% and 7%, respectively.

PaperID: 4897, https://arxiv.org/pdf/2603.13890.pdf

Authors: Jingzhe Lin, Ceyao Zhang, Yaodong Yang, Yizhou Wang, Song-Chun Zhu, Fangwei Zhong

Title: Beyond Self-Interest: Modeling Social-Oriented Motivation for Human-like Multi-Agent Interactions

Abstract:
Large Language Models (LLMs) demonstrate significant potential for generating complex behaviors, yet most approaches lack mechanisms for modeling social motivation in human‑like multi‑agent interaction. We introduce Autonomous Social Value‑Oriented agents (ASVO), where LLM‑based agents integrate desire‑driven autonomy with Social Value Orientation (SVO) theory. At each step, agents first update their beliefs by perceiving environmental changes and others' actions. These observations inform the value update process, where each agent updates multi‑dimensional desire values through reflective reasoning and infers others' motivational states. By contrasting self‑satisfaction derived from fulfilled desires against estimated others' satisfaction, agents dynamically compute their SVO along a spectrum from altruistic to competitive, which in turn guides activity selection to balance desire fulfillment with social alignment. Experiments across School, Workplace, and Family contexts demonstrate substantial improvements over baselines in behavioral naturalness and human‑likeness. These findings show that structured desire systems and adaptive SVO drift enable realistic multi‑agent social simulations.

PaperID: 4898, https://arxiv.org/pdf/2603.13883.pdf

Authors: Arnab Pal, Suman Singha Roy, Asim Kumar Naskar

Title: Fully Distributed Adaptive Consensus Approach for Economic Dispatch Problem

Abstract:
This research presents a novel approach to solving the economic load dispatch (ELD) problem in smart grid systems by leveraging a multi‑agent distributed consensus strategy. The core idea revolves around achieving agreement among generators on their incremental cost values, thereby enabling an optimal allocation of power generation. To enhance convergence and robustness, the study introduces an adaptive coupling weight mechanism within a fully decentralized consensus framework, carefully designed with appropriate initial settings for incremental costs. The proposed distributed control protocol is versatile it functions effectively in both constrained and unconstrained generator capacity scenarios. Importantly, the methodology ensures that total power generation continuously matches dynamic load demands throughout the dispatch process, maintaining system‑wide balance. To accommodate fluctuating and time varying load profiles, a dummy node is incorporated into the network architecture, acting as a flexible proxy for real time demand changes. The resilience of the method is further evaluated under communication disruptions, specifically by analyzing generator link failures through a switching network topology. Stability of the system is rigorously established using a Lyapunov‑based analysis, assuming an undirected and connected communication graph among agents. To validate the practical efficacy of the proposed technique, comprehensive simulations are conducted on the IEEE 30 bus test system within the MATLAB environment, confirming its accuracy, adaptability, and computational efficiency in realistic smart grid conditions.

PaperID: 4899, https://arxiv.org/pdf/2603.13878.pdf

Authors: Lin Fan, Yafei Ou, Zhipeng Deng, Pengyu Dai, Hou Chongxian, Jiale Yan, Yaqian Li, Kaiwen Long, Xun Gong, Masayuki Ikebe, Yefeng Zheng

Title: Step-CoT: Stepwise Visual Chain-of-Thought for Medical Visual Question Answering

Abstract:
Chain‑of‑thought (CoT) reasoning has advanced medical visual question answering (VQA), yet most existing CoT rationales are free‑form and fail to capture the structured reasoning process clinicians actually follow. This work asks: Can traceable, multi‑step reasoning supervision improve reasoning accuracy and the interpretability of Medical VQA? To this end, we introduce Step‑CoT, a large‑scale medical reasoning dataset with expert‑curated, structured multi‑step CoT aligned to clinical diagnostic workflows, implicitly grounding the model's reasoning in radiographic evidence. Step‑CoT comprises more than 10K real clinical cases and 70K VQA pairs organized around diagnostic workflows, providing supervised intermediate steps that guide models to follow valid reasoning trajectories. To effectively learn from Step‑CoT, we further introduce a teacher‑student framework with a dynamic graph‑structured focusing mechanism that prioritizes diagnostically informative steps while filtering out less relevant contexts. Our experiments show that using Step‑CoT can improve reasoning accuracy and interpretability. Benchmark: github.com/hahaha111111/Step‑CoT. Dataset Card: huggingface.co/datasets/fl‑15o/Step‑CoT

PaperID: 4900, https://arxiv.org/pdf/2603.13876.pdf

Authors: Junjie Liao, Huacong Tang, Zhou Ziheng, Yizhou Wang, Fangwei Zhong

Title: How do Role Models Shape Collective Morality? Exemplar-Driven Moral Learning in Multi-Agent Simulation

Abstract:
Do We Need Role Models? How do Role Models Shape Collective Morality? To explore the questions, we build a multi‑agent simulation powered by a Large Language Model, where agents with diverse intrinsic drives, ranging from cooperative to competitive, interact and adapt through a four‑stage cognitive loop (plan‑act‑observe‑reflect). We design four experimental games (Alignment, Collapse, Conflict, and Construction) and conduct motivational ablation studies to identify the key drivers of imitation. The results indicate that identity‑driven conformity can powerfully override initial dispositions. Agents consistently adapt their values to align with a perceived successful exemplar, leading to rapid value convergence.

PaperID: 4901, https://arxiv.org/pdf/2603.13862.pdf

Authors: Xuping Hou, Xiaofeng Zong, Yong He

Title: Fully distributed consensus control for stochastic multi-agent systems under undirected and directed topologies

Abstract:
This work aims to address the design of fully distributed control protocols for stochastic consensus, and, for the first time, establishes the existence and uniqueness of solutions for the path‑dependent and highly nonlinear closed‑loop systems under both undirected and directed topologies, bridging a critical gap in the literature. For the case of directed graphs, a unified fully distributed control protocol is designed for the first time to guarantee mean square and almost sure consensus of stochastic multi‑agent systems under directed graphs. Moreover, an enhanced fully distributed protocol with additional tunable parameters designed for undirected graphs is proposed, which guarantees stochastic consensus while achieving superior convergence speed. Additionally, our work provides explicit exponential estimates for the corresponding convergence rates of stochastic consensus, elucidating the relationship between the exponential convergence rate and the system parameters. Simulations validate the theoretical results.

PaperID: 4902, https://arxiv.org/pdf/2603.13840.pdf

Authors: Shan Shan

Title: ClimateAgents: A Multi-Agent Research Assistant for Social-Climate Dynamics Analysis

Abstract:
The complex interaction between social behaviors and climate change requires more than traditional data‑driven prediction; it demands interpretable and adaptive analytical frameworks capable of integrating heterogeneous sources of knowledge. This study introduces ClimateAgents, a multi‑agent research assistant designed to support social‑climate analysis through coordinated AI agents. Rather than focusing solely on predictive modeling, the framework assists researchers in exploring socio‑environmental dynamics by integrating multimodal data retrieval, statistical modeling, textual analysis, and automated reasoning. Traditional approaches to climate analysis often address narrowly defined indicators and lack the flexibility to incorporate cross‑domain socio‑economic knowledge or adapt to evolving research questions. To address these limitations, ClimateAgents employs a set of collaborative, domain‑specialized agents that collectively perform key stages of the research workflow, including hypothesis generation, data analysis, evidence retrieval, and structured reporting. The framework supports exploratory analysis and scenario investigation using datasets from sources such as the United Nations and the World Bank. By combining agent‑based reasoning with quantitative analysis of socio‑economic behavioral dynamics, ClimateAgents enables adaptive and interpretable exploration of relationships between climate indicators, social variables, and environmental outcomes. The results illustrate how multi‑agent AI systems can augment analytical reasoning and facilitate interdisciplinary, data‑driven investigation of complex socio‑environmental systems.

PaperID: 4903, https://arxiv.org/pdf/2603.13800.pdf

Authors: Quoc-Huy Trinh, Xi Ding, Yang Liu, Zhenyue Qin, Xingjian Li, Gorkem Durak, Halil Ertugrul Aktas, Elif Keles, Ulas Bagci, Min Xu

Title: Beyond Medical Diagnostics: How Medical Multimodal Large Language Models Think in Space

Abstract:
Visual spatial intelligence is critical for medical image interpretation, yet remains largely unexplored in Multimodal Large Language Models (MLLMs) for 3D imaging. This gap persists due to a systemic lack of datasets featuring structured 3D spatial annotations beyond basic labels. In this study, we introduce an agentic pipeline that autonomously synthesizes spatial visual question‑answering (VQA) data by orchestrating computational tools such as volume and distance calculators with multi‑agent collaboration and expert radiologist validation. We present SpatialMed, the first comprehensive benchmark for evaluating 3D spatial intelligence in medical MLLMs, comprising nearly 10K question‑answer pairs across multiple organs and tumor types. Our evaluations on 14 state‑of‑the‑art MLLMs and extensive analyses reveal that current models lack robust spatial reasoning capabilities for medical imaging.

PaperID: 4904, https://arxiv.org/pdf/2603.13791.pdf

Authors: Snehasis Mukhopadhyay

Title: DeceptGuard :A Constitutional Oversight Framework For Detecting Deception in LLM Agents

Abstract:
Reliable detection of deceptive behavior in Large Language Model (LLM) agents is an essential prerequisite for safe deployment in high‑stakes agentic contexts. Prior work on scheming detection has focused exclusively on black‑box monitors that observe only externally visible tool calls and outputs, discarding potentially rich internal reasoning signals. We introduce DECEPTGUARD, a unified framework that systematically compares three monitoring regimes: black‑box monitors (actions and outputs only), CoT‑aware monitors (additionally observing the agent's chain‑of‑thought reasoning trace), and activation‑probe monitors (additionally reading hidden‑state representations from a frozen open‑weights encoder). We introduce DECEPTSYNTH, a scalable synthetic pipeline for generating deception‑positive and deception‑negative agent trajectories across a novel 12‑category taxonomy spanning verbal, behavioral, and structural deception. Our monitors are optimized on 4,800 synthetic trajectories and evaluated on 9,200 held‑out samples from DeceptArena, a benchmark of realistic sandboxed agent environments with execution‑verified labels. Across all evaluation settings, CoT‑aware and activation‑probe monitors substantially outperform their black‑box counterparts (mean pAUROC improvement of +0.097), with the largest gains on subtle, long‑horizon deception that leaves minimal behavioral footprints. We empirically characterize a transparency‑detectability trade‑off: as agents learn to suppress overt behavioral signals, chain‑of‑thought becomes the primary detection surface but is itself increasingly unreliable due to post‑training faithfulness degradation. We propose HYBRID‑CONSTITUTIONAL ensembles as a robust defense‑in‑depth approach, achieving a pAUROC of 0.934 on the held‑out test set, representing a substantial advance over the prior state of the art.

PaperID: 4905, https://arxiv.org/pdf/2603.13756.pdf

Authors: Yujiro Onishi, Ryo Takizawa, Yoshiyuki Ohmura, Yasuo Kuniyoshi

Title: Exploration-assisted Bottleneck Transition Toward Robust and Data-efficient Deformable Object Manipulation

Abstract:
Imitation learning has demonstrated impressive results in robotic manipulation but fails under out‑of‑distribution (OOD) states. This limitation is particularly critical in Deformable Object Manipulation (DOM), where the near‑infinite possible configurations render comprehensive data collection infeasible. Although several methods address OOD states, they typically require exhaustive data or highly precise perception. Such requirements are often impractical for DOM owing to its inherent complexities, including self‑occlusion. To address the OOD problem in DOM, we propose a novel framework, Exploration‑assisted Bottleneck Transition for Deformable Object Manipulation (ExBot), which addresses the OOD challenge through two key advantages. First, we introduce bottleneck states, standardized configurations that serve as starting points for task execution. This enables the reconceptualization of OOD challenges as the problem of transitioning diverse initial states to these bottleneck states, significantly reducing demonstration requirements. Second, to account for imperfect perception, we partition the OOD state space based on recognizability and employ dual action primitives. This approach enables ExBot to manipulate even unrecognizable states without requiring accurate perception. By concentrating demonstrations around bottleneck states and leveraging exploration to alter perceptual conditions, ExBot achieves both data efficiency and robustness to severe OOD scenarios. Real‑world experiments on rope and cloth manipulation demonstrate successful task completion from diverse OOD states, including severe self‑occlusions.

PaperID: 4906, https://arxiv.org/pdf/2603.13723.pdf

Authors: Kyogo Horikawa, Kosei Horikawa, Yutaro Kashiwa, Hidetake Uwano, Hajimu Iida

Title: Do AI Agents Really Improve Code Readability?

Abstract:
Code readability is fundamental to software quality and maintainability. Poor readability extends development time, increases bug‑inducing risks, and contributes to technical debt. With the rapid advancement of Large Language Models, AI agent‑based approaches have emerged as a promising paradigm for automated refactoring, capable of decomposing complex tasks through autonomous planning and execution. While prior studies have examined refactoring by AI agents, these analyses cover all forms of refactoring, including performance optimization and structural improvement. As a result, the extent to which AI agent‑based refactoring specifically improves code readability remains unclear. This study investigates the impact of AI agent‑based refactoring on code readability. We extracted commits containing readability‑related keywords from the AIDev dataset and analyzed changes in readability metrics before and after each commit, covering 403 commits evaluated using multiple quantitative metrics. Our results indicate that AI agents primarily target logic complexity (42.4%) and documentation improvements (24.2%) rather than surface‑level aspects like naming conventions or formatting. However, contrary to expectations, readability‑focused commits often degraded traditional quality metrics: the Maintainability Index decreased in 56.1% of commits, while Cyclomatic Complexity increased in 42.7%.

PaperID: 4907, https://arxiv.org/pdf/2603.13716.pdf

Authors: Ziao Wang, Zheng Dong, He Chen, Jun Chen, Dongxiao Yu

Title: Multi-Agent SAC Enabled Beamforming Design for Joint Secret Key Generation and Data Transmission

Abstract:
Physical layer key generation (PLKG) has emerged as a promising solution for achieving highly secured and low‑latency key distribution, offering information‑theoretic security that is inherently resilient to quantum attacks. However, simultaneously ensuring a high data transmission rate and a high secret key generation rate under eavesdropping attacks remains a major challenge. In time‑division duplex (TDD) systems with multiple antennas, we derive closed‑form expressions for both rates by modeling the legitimate channel as a time‑correlated autoregressive (AR) process. This formulation leads to a highly nonconvex and time‑coupled optimization problem, rendering traditional optimization methods ineffective. To address this issue, we propose a multi‑agent soft actor‑critic (SAC) framework equipped with a long short‑term memory (LSTM) adversary prediction module to cope with the partial observability of the eavesdropper's mode. Simulation results demonstrate that the proposed approach achieves superior performance compared with other benchmark algorithms, while effectively balancing the trade‑off between secret key generation rate and data transmission rate. The results also confirm the robustness of the proposed framework against intelligent eavesdropping and partial observation uncertainty.

PaperID: 4908, https://arxiv.org/pdf/2603.13710.pdf

Authors: Xiao Wang, Lu Dong, Ifeoma Nwogu, Srirangaraj Setlur, Venu Govindaraju

Title: InterventionLens: A Multi-Agent Framework for Detecting ASD Intervention Strategies in Parent-Child Shared Reading

Abstract:
Home‑based interventions like parent‑child shared reading provide a cost‑effective approach for supporting children with autism spectrum disorder (ASD). However, analyzing caregiver intervention strategies in naturalistic home interactions typically relies on expert annotation, which is costly, time‑intensive, and difficult to scale. To address this challenge, we propose InterventionLens, an end‑to‑end multi‑agent system for automatically detecting and temporally segmenting caregiver intervention strategies from shared reading videos. Without task‑specific model training or fine‑tuning, InterventionLens uses a collaborative multi‑agent architecture to integrate multimodal interaction content and perform fine‑grained strategy analysis. Experiments on the ASD‑HI dataset show that InterventionLens achieves an overall F1 score of 79.44%, outperforming the baseline by 19.72%. These results suggest that InterventionLens is a promising system for analyzing caregiver intervention strategies in home‑based ASD shared reading settings. Additional resources will be released on the project page.

PaperID: 4909, https://arxiv.org/pdf/2603.13676.pdf

Authors: Zhihao Chen, Jiahui Wang, Yizhou Chen, Xiaozhong Ji, Xiaobin Hu, Jimin Hong, Wolfram Andreas Bosbach, Axel Rominger, Ali Afshar-Oromieh, Hongming Shan, Kuangyu Shi

Title: TheraAgent: Multi-Agent Framework with Self-Evolving Memory and Evidence-Calibrated Reasoning for PET Theranostics

Abstract:
PET theranostics is transforming precision oncology, yet treatment response varies substantially; many patients receiving 177Lu‑PSMA radioligand therapy (RLT) for metastatic castration‑resistant prostate cancer (mCRPC) fail to respond, demanding reliable pre‑therapy prediction. While LLM‑based agents have shown remarkable potential in complex medical diagnosis, their application to PET theranostic outcome prediction remains unexplored, which faces three key challenges: (1) data and knowledge scarcity: RLT was only FDA‑approved in 2022, yielding few training cases and insufficient domain knowledge in general LLMs; (2) heterogeneous information integration: robust prediction hinges on structured knowledge extraction from PET/CT, laboratory tests, and free‑text clinical documentation; (3) evidence‑grounded reasoning: clinical decisions must be anchored in trial evidence rather than LLM hallucinations. In this paper, we present TheraAgent, to our knowledge, the first agentic framework for PET theranostics, with three core innovations: (1) Multi‑Expert Feature Extraction with Confidence‑Weighted Consensus, where three specialized experts process heterogeneous inputs with uncertainty quantification; (2) Self‑Evolving Agentic Memory (SEA‑Mem), which learns prognostic patterns from accumulated cases, enabling case‑based reasoning from limited data; (3) Evidence‑Calibrated Reasoning, integrating a curated theranostics knowledge base to ground predictions in VISION/TheraP trial evidence. Evaluated on 35 real patients and 400 synthetic cases, TheraAgent achieves 75.7% overall accuracy on real patients and 87.0% on synthetic cases, outperforming MDAgents and MedAgent‑Pro by over 20%. These results highlight a promising blueprint for trustworthy AI agents in PET theranostics, enabling trial‑calibrated, multi‑source decision support. Code will be released upon acceptance.

PaperID: 4910, https://arxiv.org/pdf/2603.13605.pdf

Authors: Rana Shahout, Hayder Tirmazi, Minlan Yu, Michael Mitzenmacher

Title: Orla: A Library for Serving LLM-Based Multi-Agent Systems

Abstract:
We introduce Orla, a library for constructing and running LLM‑based agentic systems. Modern agentic applications consist of workflows that combine multiple LLM inference steps, tool calls, and heterogeneous infrastructure. Today, developers typically build these systems by manually composing orchestration code with LLM serving engines and tool execution logic. Orla provides a general abstraction that separates request execution from workflow‑level policy. It acts as a serving layer above existing LLM inference engines: developers define workflows composed of stages, while Orla manages how those stages are mapped, executed, and coordinated across models and backends. It provides agent‑level control through three mechanisms: a stage mapper, which assigns each stage to an appropriate model and backend; a workflow orchestrator, which schedules stages and manages their resources and context; and a memory manager, which manages inference state such as the KV cache across workflow boundaries. We demonstrate Orla with a customer support workflow that exercises many of its capabilities. We evaluate Orla on two datasets, showing that stage mapping improves latency and cost compared to a single‑model vLLM baseline, while workflow‑level cache management reduces time‑to‑first‑token.

PaperID: 4911, https://arxiv.org/pdf/2603.13560.pdf

Authors: Vukan Ninkovic, Tamara Sobot, Vladimir Vincan, Gorana Gojic, Dragisa Miskovic, Dejan Vukobratovic

Title: Task-Oriented Wireless Transmission of 3D Point Clouds: Geometric Versus Semantic Robustness

Abstract:
Wireless transmission of high‑dimensional 3D point clouds (PCs) is increasingly required in industrial collaborative robotics systems. Conventional compression methods prioritize geometric fidelity, although many practical applications ultimately depend on reliable task‑level inference rather than exact coordinate reconstruction. In this paper, we propose an end‑to‑end semantic communication framework for wireless 3D PC transmission and conduct a systematic study of the relationship between geometric reconstruction fidelity and semantic robustness under channel impairments. The proposed architecture jointly supports geometric recovery and object classification from a shared transmitted representation, enabling direct comparison between coordinate‑level and task‑level sensitivity to noise. Experimental evaluation on a real industrial dataset reveals a pronounced asymmetry: semantic inference remains stable across a broad signal‑to‑noise ratio (SNR) range even when geometric reconstruction quality degrades significantly. These results demonstrate that reliable task execution does not require high‑fidelity geometric recovery and provide design insights for task‑oriented wireless perception systems in bandwidth‑ and power‑constrained industrial environments.

PaperID: 4912, https://arxiv.org/pdf/2603.13417.pdf

Authors: Vasundra Srinivasan

Title: Bridging Protocol and Production: Design Patterns for Deploying AI Agents with Model Context Protocol

Abstract:
The Model Context Protocol (MCP) standardizes how AI agents discover and invoke external tools, with over 10,000 active servers and 97 million monthly SDK downloads as of early 2026. Yet MCP does not yet standardize how agents safely operate those tools at production scale. Three protocol‑level primitives remain missing: identity propagation, adaptive tool budgeting, and structured error semantics. This paper identifies these gaps through field lessons from an enterprise deployment of an AI agent platform integrated with a major cloud provider's MCP servers (client name redacted). We propose three mechanisms to fill them: (1) the Context‑Aware Broker Protocol (CABP), which extends JSON‑RPC with identity‑scoped request routing via a six‑stage broker pipeline; (2) Adaptive Timeout Budget Allocation (ATBA), which frames sequential tool invocation as a budget allocation problem over heterogeneous latency distributions; and (3) the Structured Error Recovery Framework (SERF), which provides machine‑readable failure semantics that enable deterministic agent self‑correction. We organize production failure modes into five design dimensions (server contracts, user context, timeouts, errors, and observability), document concrete failure vignettes, and present a production readiness checklist. All three algorithms are formalized as testable hypotheses with reproducible experimental methodology. Field observations demonstrate that while MCP provides a solid protocol foundation, reliable agent tool integration requires infrastructure‑level mechanisms that the specification does not yet address.

PaperID: 4913, https://arxiv.org/pdf/2603.13404.pdf

Authors: Akshey Sigdel, Rista Baral

Title: Schema First Tool APIs for LLM Agents: A Controlled Study of Tool Misuse, Recovery, and Budgeted Performance

Abstract:
Tool use has become central to modern LLM agents, yet interface design is rarely isolated as an experimental variable. This paper studies whether schema based tool contracts and structured validation diagnostics improve reliability under strict interaction budgets. We evaluate three conditions that preserve identical tool semantics and information content: free form documentation, JSON Schema specifications, and JSON Schema with structured diagnostics. We implement a deterministic software engineering sandbox with logs, metrics, configurations, and repository tasks, and evaluate a fully crossed pilot with one open local model, three seeds, three interface conditions, and four budgets. We report end task success, interface misuse, execution failures, semantic misuse, recovery behavior, and overhead. In this pilot, success remains zero across conditions, while schema conditions reduce interface misuse but not semantic misuse. The evidence supports a precise interpretation that interface formalization improves contract adherence, but semantic action quality and timeout sensitive tasks remain dominant bottlenecks under constrained local inference.

PaperID: 4914, https://arxiv.org/pdf/2603.13386.pdf

Authors: Yuntao Shou, Xiangyong Cao, Qian Zhao, Deyu Meng

Title: Layout-Guided Controllable Pathology Image Generation with In-Context Diffusion Transformers

Abstract:
Controllable pathology image synthesis requires reliable regulation of spatial layout, tissue morphology, and semantic detail. However, existing text‑guided diffusion models offer only coarse global control and lack the ability to enforce fine‑grained structural constraints. Progress is further limited by the absence of large datasets that pair patch‑level spatial layouts with detailed diagnostic descriptions, since generating such annotations for gigapixel whole‑slide images is prohibitively time‑consuming for human experts. To overcome these challenges, we first develop a scalable multi‑agent LVLM annotation framework that integrates image description, diagnostic step extraction, and automatic quality judgment into a coordinated pipeline, and we evaluate the reliability of the system through a human verification process. This framework enables efficient construction of fine‑grained and clinically aligned supervision at scale. Building on the curated data, we propose In‑Context Diffusion Transformer (IC‑DiT), a layout‑aware generative model that incorporates spatial layouts, textual descriptions, and visual embeddings into a unified diffusion transformer. Through hierarchical multimodal attention, IC‑DiT maintains global semantic coherence while accurately preserving structural and morphological details. Extensive experiments on five histopathology datasets show that IC‑DiT achieves higher fidelity, stronger spatial controllability, and better diagnostic consistency than existing methods. In addition, the generated images serve as effective data augmentation resources for downstream tasks such as cancer classification and survival analysis.

PaperID: 4915, https://arxiv.org/pdf/2603.13365.pdf

Authors: Erdemt Bao, Jin Yang

Title: WaveComm: Lightweight Communication for Collaborative Perception via Wavelet Feature Distillation

Abstract:
In multi‑agent collaborative sensing systems, substantial communication overhead from information exchange significantly limits scalability and real‑time performance, especially in bandwidth‑constrained environments. This often results in degraded performance and reduced reliability. To address this challenge, we propose WaveComm, a wavelet‑based communication framework that drastically reduces transmission loads while preserving sensing performance in low‑bandwidth scenarios. The core innovation of WaveComm lies in decomposing feature maps using Discrete Wavelet Transform (DWT), transmitting only compact low‑frequency components to minimize communication overhead. High‑frequency details are omitted, and their effects are reconstructed at the receiver side using a lightweight generator. A Multi‑Scale Distillation (MSD) Loss is employed to optimize the reconstruction quality across pixel, structural, semantic, and distributional levels. Experiments on the OPV2V and DAIR‑V2X datasets for LiDAR‑based and camera‑based perception tasks demonstrate that WaveComm maintains state‑of‑the‑art performance even when the communication volume is reduced to 86.3% and 87.0% of the original, respectively. Compared to existing approaches, WaveComm achieves competitive improvements in both communication efficiency and perception accuracy. Ablation studies further validate the effectiveness of its key components.

PaperID: 4916, https://arxiv.org/pdf/2603.13353.pdf

Authors: Bakhtawar Ahtisham, Kirk Vanacore, Rene F. Kizilcec

Title: Optimizing LLM Annotation of Classroom Discourse through Multi-Agent Orchestration

Abstract:
Large language models (LLMs) are increasingly positioned as scalable tools for annotating educational data, including classroom discourse, interaction logs, and qualitative learning artifacts. Their ability to rapidly summarize instructional interactions and assign rubric‑aligned labels has fueled optimism about reducing the cost and time associated with expert human annotation. However, growing evidence suggests that single‑pass LLM outputs remain unreliable for high‑stakes educational constructs that require contextual, pedagogical, or normative judgment, such as instructional intent or discourse moves. This tension between scale and validity sits at the core of contemporary education data science. In this work, we present and empirically evaluate a hierarchical, cost‑aware orchestration framework for LLM‑based annotation that improves reliability while explicitly modeling computational tradeoffs. Rather than treating annotation as a one‑shot prediction problem, we conceptualize it as a multi‑stage epistemic process comprising (1) an unverified single‑pass annotation stage, in which models independently assign labels based on the rubric; (2) a self‑verification stage, in which each model audits its own output against rubric definitions and revises its label if inconsistencies are detected; and (3) a disagreement‑centric adjudication stage, in which an independent adjudicator model examines the verified labels and justifications and determines a final label in accordance with the rubric. This structure mirrors established human annotation workflows in educational research, where initial coding is followed by self‑checking and expert resolution of disagreements.

PaperID: 4917, https://arxiv.org/pdf/2603.13333.pdf

Authors: Hongrui Zheng, Zirui Zang, Ahmad Amine, Cristian Ioan Vasile, Rahul Mangharam

Title: STL-SVPIO: Signal Temporal Logic guided Stein Variational Path Integral Optimization

Abstract:
Signal Temporal Logic (STL) enables formal specification of complex spatiotemporal constraints for robotic task planning. However, synthesizing long‑horizon continuous control trajectories from complex STL specifications is fundamentally challenging due to the nested structure of STL robustness objectives. Existing solver‑based methods, such as Mixed‑Integer Linear Programming (MILP), suffer from exponential scaling, whereas sampling methods, such as Model‑Predictive Path Integral control (MPPI), struggle with sparse, long‑horizon costs. We introduce Signal Temporal Logic guided Stein Variational Path Integral Optimization (STL‑SVPIO), which reframes STL as a globally informative, differentiable reward‑shaping mechanism. By leveraging Stein Variational Gradient Descent and differentiable physics engines, STL‑SVPIO transports a mutually repulsive swarm of control particles toward high robustness regions. Our method transforms sparse logical satisfaction into tractable variational inference, mitigating the severe local minima traps of standard gradient‑based methods. We demonstrate that STL‑SVPIO significantly outperforms existing methods in both robustness and efficiency for traditional STL tasks. Moreover, it solves complex long‑horizon tasks, including multi‑agent coordination with synchronization and queuing while baselines either fail to discover feasible solutions, or become computationally intractable. Finally, we use STL‑SVPIO in agile robotic motion planning tasks with nonlinear dynamics, such as 7‑DoF manipulation and half cheetah back flips to show the generalizability of our algorithm.

PaperID: 4918, https://arxiv.org/pdf/2603.13327.pdf

Authors: Aaron Shen, Alfred Shen

Title: DOVA: Deliberation-First Multi-Agent Orchestration for Autonomous Research Automation

Abstract:
Large language model (LLM) agents have demonstrated remarkable capabilities in tool use, reasoning, and code generation, yet single‑agent systems exhibit fundamental limitations when confronted with complex research tasks demanding multi‑source synthesis, adversarial verification, and personalized delivery. We present DOVA (Deep Orchestrated Versatile Agent), a multi‑agent platform introducing three key innovations: (1) deliberation‑first orchestration, where explicit meta‑reasoning precedes tool invocation, informed by a persistent user model and entity‑aware conversation context; (2) hybrid collaborative reasoning, a composable three‑phase pipeline unifying ensemble diversity, blackboard transparency, and iterative refinement; and (3) adaptive multi‑tiered thinking, a six‑level token‑budget allocation scheme that reduces inference cost by 40‑60% on simple tasks while preserving deep reasoning capacity. We formalize the core algorithms, present an architectural ablation study across seven system configurations, and analyze the contribution of each component to answer confidence, source coverage, and token efficiency.

PaperID: 4919, https://arxiv.org/pdf/2603.13325.pdf

Authors: Zixun Luo, Yuhang Fan, Hengyu Lin, Yufei Li, Youzhi Zhang

Title: Auditing Cascading Risks in Multi-Agent Systems via Semantic-Geometric Co-evolution

Abstract:
Large Language model (LLM)‑based Multi‑Agent Systems (MAS) are prone to cascading risks, where early‑stage interactions remain semantically fluent and policy‑compliant, yet the underlying interaction dynamics begin to distort in ways that amplify latent instability or misalignment. Traditional auditing methods that focus on per‑message semantic content are inherently reactive and lagging, failing to capture these early structural precursors. In this paper, we propose a principled framework for cascading‑risk detection grounded in semantic‑‑geometric co‑evolution. We model MAS interactions as dynamic graphs and introduce Ollivier‑‑Ricci Curvature (ORC) ‑‑ a discrete geometric measure ‑‑ to characterize information redundancy and bottleneck formation in communication topologies. By coupling semantic flow signals with graph geometry, the framework learns the normal co‑evolutionary dynamics of trusted collaboration and treats deviations from this coupled manifold as early‑warning signals. Experiments on a suite of cascading‑risk scenarios aligned with the risk category demonstrate that curvature anomalies systematically precede explicit semantic violations by several interaction turns, enabling proactive intervention. Furthermore, the local nature of Ricci curvature provides principled interpretability for root‑cause attribution, identifying specific agents or links that precipitate the collapse of trustworthy collaboration.

PaperID: 4920, https://arxiv.org/pdf/2603.13289.pdf

Authors: Yingsheng Geng, Yuchong Gao, Weihong Wu, Guyue Liu, Jiang Liu

Title: RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse

Abstract:
The increasing complexity of AI tasks has shifted the paradigm from monolithic models toward multi‑agent large language model (LLM) systems. However, these collaborative architectures introduce a critical bottleneck: redundant prefill computation for shared content generated by previous agents, which significantly increases KV cache memory usage and time‑to‑first‑token (TTFT). While various KV cache methods have been proposed to mitigate prefill redundancy, they either fail to maintain accuracy on agent‑generated outputs or exhibit low reuse rates due to rigid constraints. We present RelayCaching, a training‑free inference method that directly reuses decoding phase KV caches from previous agents in subsequent prefill phases. Our key insight is that KV caches for identical content are highly consistent across phases, while prefix‑induced deviations are sparse and localized within a limited range of layers and token positions. By selectively recomputing KV caches at these positions, RelayCaching preserves model accuracy with minimal overhead, yielding a superior accuracy‑efficiency trade‑off over existing methods. Experiments on diverse collaborative LLM tasks spanning mathematical reasoning, general knowledge, and code generation demonstrate that RelayCaching achieves over 80% KV cache reuse, reduces TTFT by up to 4.7× compared to the standard pipeline, all with negligible accuracy degradation.

PaperID: 4921, https://arxiv.org/pdf/2603.13281.pdf

Authors: Sunghyeon Woo, Jaeeun Kil, Hoseung Kim, Minsub Kim, Joonghoon Kim, Ahreum Seo, Sungjae Lee, Minjung Jo, Jiwon Ryu, Baeseong Park, Se Jung Kwon, Dongsoo Lee

Title: ICaRus: Identical Cache Reuse for Efficient Multi Model Inference

Abstract:
Multi model inference has recently emerged as a prominent paradigm, particularly in the development of agentic AI systems. However, in such scenarios, each model must maintain its own Key‑Value (KV) cache for the identical prompt, leading to substantial memory consumption. This explosive growth of KV caches forces LLM serving systems to evict previously stored caches, which in turn introduces significant recomputation overhead whenever the evicted caches are required again. Moreover, prefix caching is inherently infeasible across different models, forcing each model to recompute KV cache for the identical prompt, which leads to significant overhead. To alleviate these issues, we propose Identical Cache Reuse (ICaRus), a novel architecture that allows multiple models to share identical KV caches across all layers. ICaRus is based on the key observation that a decoder‑only Transformer can be conceptually decomposed into a logical encoder, which generates KV caches, and a logical decoder, which predicts output tokens from the KV caches. ICaRus fine‑tunes only the logical decoder while freezing the logical encoder, enabling multiple models to share an identical KV cache. This eliminates cache memory explosion and unexpected evictions while also allowing cross‑model reuse of KV caches for new input tokens, thereby removing redundant recomputation in multi model inference achieving both efficiency and scalability. Moreover, by incorporating lightweight adapters such as LoRA, ICaRus parallelizes KV cache generation and next‑token prediction during decoding. ICaRus achieves comparable accuracy to task‑specific fine‑tuned model across a diverse set of tasks, while allowing multiple specialized models to fully share KV caches. ICaRus achieves up to 11.1x lower P95 latency and 3.8x higher throughput in multi agent workflow with 8 different models, compared to conventional multi model system.

PaperID: 4922, https://arxiv.org/pdf/2603.13256.pdf

Authors: Mohammad Parsa Hosseini, Ankit Shah, Saiyra Qureshi, Alex Huang, Connie Miao, Wei Wei

Title: Training-Free Agentic AI: Probabilistic Control and Coordination in Multi-Agent LLM Systems

Abstract:
Multi‑agent large language model (LLM) systems enable complex, long‑horizon reasoning by composing specialized agents, but practical deployment remains hindered by inefficient routing, noisy feedback, and high interaction cost. We introduce REDEREF, a lightweight and training‑free controller for multi‑agent LLM collaboration that improves routing efficiency during recursive delegation. REDEREF integrates (i) belief‑guided delegation via Thompson sampling to prioritize agents with historically positive marginal contributions, (ii) reflection‑driven re‑routing using a calibrated LLM or programmatic judge, (iii) evidence‑based selection rather than output averaging, and (iv) memory‑aware priors to reduce cold‑start inefficiency. Across multi‑agent split‑knowledge tasks, we show that while recursive retry alone saturates task success, belief‑guided routing reduces token usage by 28%, agent calls by 17%, and time‑to‑success by 19% compared to random recursive delegation, and adapts gracefully under agent or judge degradation. These results demonstrate that simple, interpretable probabilistic control can meaningfully improve the efficiency and robustness of multi‑agent LLM systems without training or fine‑tuning.

PaperID: 4923, https://arxiv.org/pdf/2603.13244.pdf

Authors: Marcel Osmond

Title: Agentic AI, Retrieval-Augmented Generation, and the Institutional Turn: Legal Architectures and Financial Governance in the Age of Distributional AGI

Abstract:
The proliferation of agentic artificial intelligence systems‑‑characterized by autonomous goal‑seeking, tool use, and multi‑agent coordination‑‑presents unprecedented challenges to existing legal and financial regulatory frameworks. While traditional AI governance has focused on model‑level alignment through training‑time interventions such as Reinforcement Learning from Human Feedback (RLHF), the deployment of large language models (LLMs) as persistent agents necessitates a paradigm shift toward institutional governance structures. This paper examines the intersection of agentic AI, Retrieval‑Augmented Generation (RAG), and their implications for legal accountability and financial market integrity. Through analysis of the Institutional AI framework, we argue that alignment must be reconceptualized as a mechanism design problem involving runtime governance graphs, sanction functions, and observable behavioral constraints rather than internalized constitutional values[...].The analysis concludes that the future of AI governance lies not in perfecting isolated model behavior, but in architecting institutional environments where compliant behavior emerges as the dominant strategy through carefully calibrated payoff landscapes.

PaperID: 4924, https://arxiv.org/pdf/2603.13243.pdf

Authors: Earl J St Sauver

Title: Think First, Diffuse Fast: Improving Diffusion Language Model Reasoning via Autoregressive Plan Conditioning

Abstract:
Diffusion large language models (dLLMs) generate text via iterative denoising but consistently underperform on multi‑step reasoning. We hypothesize this gap stems from a coordination problem: AR models build coherence token‑by‑token, while diffusion models must coordinate all positions simultaneously. We propose plan conditioning, a training‑free method that prepends a short (~100‑token) natural‑language plan from an AR model to the diffusion model's prompt. The plan serves as a frozen scaffold ‑‑ globally visible context that every token position can attend to from the first denoising step. On GSM8K, plan conditioning improves LLaDA‑8B‑Instruct from 75.6% to 87.2% (+11.6 percentage points), matching a same‑size AR model (LLaMA 3.1 8B, 87.7%) despite a 6.4pp weaker baseline. On HumanEval, the gain is +12.8pp (37.2% to 50.0%), showing plans generalize to code. The same plans improve LLaMA by only +5.7pp on GSM8K and +1.3pp on HumanEval ‑‑ diffusion models benefit 2‑10x more, supporting the coordination‑problem hypothesis. Across 5 random seeds, plan‑conditioned GSM8K accuracy has zero standard deviation, making diffusion inference highly stable. Ablations reveal the model follows plan strategy (wrong‑strategy plans cause ‑16.3pp) but is robust to plan values (perturbed numbers: ‑1.1pp), and that planner quality has a sharp threshold: smaller Llama‑class plans hurt (‑1.6 to ‑6.8pp) while frontier plans provide the full lift. Attention analysis confirms the mechanism: plan tokens receive 1.8x excess attention during early denoising, declining to uniform as completion tokens solidify. Plan conditioning costs ~0.002 per problem and adds ~2s of latency.

PaperID: 4925, https://arxiv.org/pdf/2603.13189.pdf

Authors: J. de Curtò, I. de Zarzà

Title: LLM Constitutional Multi-Agent Governance

Abstract:
Large Language Models (LLMs) can generate persuasive influence strategies that shift cooperative behavior in multi‑agent populations, but a critical question remains: does the resulting cooperation reflect genuine prosocial alignment, or does it mask erosion of agent autonomy, epistemic integrity, and distributional fairness? We introduce Constitutional Multi‑Agent Governance (CMAG), a two‑stage framework that interposes between an LLM policy compiler and a networked agent population, combining hard constraint filtering with soft penalized‑utility optimization that balances cooperation potential against manipulation risk and autonomy pressure. We propose the Ethical Cooperation Score (ECS), a multiplicative composite of cooperation, autonomy, integrity, and fairness that penalizes cooperation achieved through manipulative means. In experiments on scale‑free networks of 80 agents under adversarial conditions (70% violating candidates), we benchmark three regimes: full CMAG, naive filtering, and unconstrained optimization. While unconstrained optimization achieves the highest raw cooperation (0.873), it yields the lowest ECS (0.645) due to severe autonomy erosion (0.867) and fairness degradation (0.888). CMAG attains an ECS of 0.741, a 14.9% improvement, while preserving autonomy at 0.985 and integrity at 0.995, with only modest cooperation reduction to 0.770. The naive ablation (ECS = 0.733) confirms that hard constraints alone are insufficient. Pareto analysis shows CMAG dominates the cooperation‑autonomy trade‑off space, and governance reduces hub‑periphery exposure disparities by over 60%. These findings establish that cooperation is not inherently desirable without governance: constitutional constraints are necessary to ensure that LLM‑mediated influence produces ethically stable outcomes rather than manipulative equilibria.

PaperID: 4926, https://arxiv.org/pdf/2603.13173.pdf

Authors: I. de Zarzà, J. de Curtò, Jordi Cabot, Pietro Manzoni, Carlos T. Calafate

Title: Semantic Invariance in Agentic AI

Abstract:
Large Language Models (LLMs) increasingly serve as autonomous reasoning agents in decision support, scientific problem‑solving, and multi‑agent coordination systems. However, deploying LLM agents in consequential applications requires assurance that their reasoning remains stable under semantically equivalent input variations, a property we term semantic invariance. Standard benchmark evaluations, which assess accuracy on fixed, canonical problem formulations, fail to capture this critical reliability dimension. To address this shortcoming, in this paper we present a metamorphic testing framework for systematically assessing the robustness of LLM reasoning agents, applying eight semantic‑preserving transformations (identity, paraphrase, fact reordering, expansion, contraction, academic context, business context, and contrastive formulation) across seven foundation models spanning four distinct architectural families: Hermes (70B, 405B), Qwen3 (30B‑A3B, 235B‑A22B), DeepSeek‑R1, and gpt‑oss (20B, 120B). Our evaluation encompasses 19 multi‑step reasoning problems across eight scientific domains. The results reveal that model scale does not predict robustness: the smaller Qwen3‑30B‑A3B achieves the highest stability (79.6% invariant responses, semantic similarity 0.91), while larger models exhibit greater fragility.

PaperID: 4927, https://arxiv.org/pdf/2603.13110.pdf

Authors: Jianshu She

Title: AgentRM: An OS-Inspired Resource Manager for LLM Agent Systems

Abstract:
Large Language Model (LLM) agent systems have experienced rapid adoption across diverse domains, yet they suffer from critical user experience problems that limit their practical deployment. Through an empirical analysis of over 40,000 GitHub issues from six major agent frameworks (OpenClaw, AutoGen, CrewAI, LangGraph, Codex, Claude Code), we identify two fundamental resource management challenges: (1) scheduling failures leading to system unresponsiveness due to blocking, zombie processes, and rate limit cascades, and (2) context degradation causing agent "amnesia" from unbounded memory growth and poor retention policies. Drawing inspiration from decades of operating systems research, we present AgentRM, a middleware resource manager that treats agent resources analogously to OS resources. AgentRM employs a Multi‑Level Feedback Queue (MLFQ) scheduler with zombie reaping and rate‑limit‑aware admission control, coupled with a three‑tier Context Lifecycle Manager that implements adaptive compaction and hibernation mechanisms. Our evaluation demonstrates significant improvements: AgentRM‑MLFQ reduces P95 latency by 86%, decreases lane waste by 96%, and increases throughput by 168% while eliminating zombie agents (0 vs. 29 baseline). AgentRM‑CLM achieves 100% key information retention with 95% quality score compared to 65.1% retention and 87% quality for existing approaches, albeit with higher compaction costs (34,330 vs. 17,212 tokens).

PaperID: 4928, https://arxiv.org/pdf/2603.13023.pdf

Authors: Dayuan Fu, Shenyu Wu, Yunze Wu, Zerui Peng, Yaxing Huang, Jie Sun, Ji Zeng, Mohan Jiang, Lin Zhang, Yukun Li, Jiarui Hu, Liming Liu, Jinlong Hou, Pengfei Liu

Title: daVinci-Env: Open SWE Environment Synthesis at Scale

Abstract:
Training capable software engineering (SWE) agents demands large‑scale, executable, and verifiable environments that provide dynamic feedback loops for iterative code editing, test execution, and solution refinement. However, existing open‑source datasets remain limited in scale and repository diversity, while industrial solutions are opaque with unreleased infrastructure, creating a prohibitive barrier for most academic research groups. We present OpenSWE, the largest fully transparent framework for SWE agent training in Python, comprising 45,320 executable Docker environments spanning over 12.8k repositories, with all Dockerfiles, evaluation scripts, and infrastructure fully open‑sourced for reproducibility. OpenSWE is built through a multi‑agent synthesis pipeline deployed across a 64‑node distributed cluster, automating repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis. Beyond scale, we propose a quality‑centric filtering pipeline that characterizes the inherent difficulty of each environment, filtering out instances that are either unsolvable or insufficiently challenging and retaining only those that maximize learning efficiency. With 891K spent on environment construction and an additional 576K on trajectory sampling and difficulty‑aware curation, the entire project represents a total investment of approximately 1.47 million, yielding about 13,000 curated trajectories from roughly 9,000 quality guaranteed environments. Extensive experiments validate OpenSWE's effectiveness: OpenSWE‑32B and OpenSWE‑72B achieve 62.4% and 66.0% on SWE‑bench Verified, establishing SOTA among Qwen2.5 series. Moreover, SWE‑focused training yields substantial out‑of‑domain improvements, including up to 12 points on mathematical reasoning and 5 points on science benchmarks, without degrading factual recall.

PaperID: 4929, https://arxiv.org/pdf/2603.12971.pdf

Authors: Zhen-Chi Lai, Yu-Ting Cheng, Pei-Ying Lin, Chiao-Wei Ho, Janet Yi-Ching Huang

Title: Generative Horcrux: Designing AI Carriers for Afterlife Selves

Abstract:
As generative AI technologies rapidly advance, AI agents are gaining the ability not only to collect data and perform tasks but also to respond to environments and evolve over time. This shift opens new possibilities for reimagining digital legacy ‑ raising critical questions about how we remember, commemorate, and interact with the traces of the deceased. The forms of these AI agents are particularly important, as they act as vessels for digital legacies ‑ much like urns for ashes. We will ask: What kinds of devices or representations would we want to store our digital selves or legacies in? How do we envision future generations interacting with these forms? The question is not only about the function of these agents or the object's role as a storage vessel but also the meaning it carries, the memories it preserves, and its connection to the extended notion of our "Generative Horcrux." This three‑hour in‑person workshop invites design practitioners and researchers from diverse backgrounds to explore the emerging landscape of generative AI agent‑based digital legacy. This workshop uses fiction and hands on prototyping to explore how AI agents might reconfigure memory, identity, and posthumous presence in future sociotechnical worlds. We anticipate that this session will foster interdisciplinary dialogue and contribute conceptually and methodologically to HCI, design research, and AI ethics.

PaperID: 4930, https://arxiv.org/pdf/2603.12933.pdf

Authors: Xudong Wang, Chaoning Zhang, Jiaquan Zhang, Chenghao Li, Qigan Sun, Sung-Ho Bae, Peng Wang, Ning Xie, Jie Zou, Yang Yang, Hengtao Shen

Title: Efficient and Interpretable Multi-Agent LLM Routing via Ant Colony Optimization

Abstract:
Large Language Model (LLM)‑driven Multi‑Agent Systems (MAS) have demonstrated strong capability in complex reasoning and tool use, and heterogeneous agent pools further broaden the quality‑‑cost trade‑off space. Despite these advances, real‑world deployment is often constrained by high inference cost, latency, and limited transparency, which hinders scalable and efficient routing. Existing routing strategies typically rely on expensive LLM‑based selectors or static policies, and offer limited controllability for semantic‑aware routing under dynamic loads and mixed intents, often resulting in unstable performance and inefficient resource utilization. To address these limitations, we propose AMRO‑S, an efficient and interpretable routing framework for Multi‑Agent Systems (MAS). AMRO‑S models MAS routing as a semantic‑conditioned path selection problem, enhancing routing performance through three key mechanisms: First, it leverages a supervised fine‑tuned (SFT) small language model for intent inference, providing a low‑overhead semantic interface for each query; second, it decomposes routing memory into task‑specific pheromone specialists, reducing cross‑task interference and optimizing path selection under mixed workloads; finally, it employs a quality‑gated asynchronous update mechanism to decouple inference from learning, optimizing routing without increasing latency. Extensive experiments on five public benchmarks and high‑concurrency stress tests demonstrate that AMRO‑S consistently improves the quality‑‑cost trade‑off over strong routing baselines, while providing traceable routing evidence through structured pheromone patterns.

PaperID: 4931, https://arxiv.org/pdf/2603.12908.pdf

Authors: MoniJesu Wonders James, Amir Atef Habel, Aleksey Fedoseev, Dzmitry Tsetserokou

Title: GoalSwarm: Multi-UAV Semantic Coordination for Open-Vocabulary Object Navigation

Abstract:
Cooperative visual semantic navigation is a foundational capability for aerial robot teams operating in unknown environments. However, achieving robust open‑vocabulary object‑goal navigation remains challenging due to the computational constraints of deploying heavy perception models onboard and the complexity of decentralized multi‑agent coordination. We present GoalSwarm, a fully decentralized multi‑UAV framework for zero‑shot semantic object‑goal navigation. Each UAV collaboratively constructs a shared, lightweight 2D top‑down semantic occupancy map by projecting depth observations from aerial vantage points, eliminating the computational burden of full 3D representations while preserving essential geometric and semantic structure. The core contributions of GoalSwarm are threefold: (1) integration of zero‑shot foundation model ‑‑ SAM3 for open vocabulary detection and pixel‑level segmentation, enabling open‑vocabulary target identification without task‑specific training; (2) a Bayesian Value Map that fuses multi‑viewpoint detection confidences into a per‑pixel goal‑relevance distribution, enabling informed frontier scoring via Upper Confidence Bound (UCB) exploration; and (3) a decentralized coordination strategy combining semantic frontier extraction, cost‑utility bidding with geodesic path costs, and spatial separation penalties to minimize redundant exploration across the swarm.

PaperID: 4932, https://arxiv.org/pdf/2603.12829.pdf

Authors: Chunhan Li, Qifeng Wu, Jia-Hui Pan, Ka-Hei Hui, Jingyu Hu, Yuming Jiang, Bin Sheng, Xihui Liu, Wenjuan Gong, Zhengzhe Liu

Title: coDrawAgents: A Multi-Agent Dialogue Framework for Compositional Image Generation

Abstract:
Text‑to‑image generation has advanced rapidly, but existing models still struggle with faithfully composing multiple objects and preserving their attributes in complex scenes. We propose coDrawAgents, an interactive multi‑agent dialogue framework with four specialized agents: Interpreter, Planner, Checker, and Painter that collaborate to improve compositional generation. The Interpreter adaptively decides between a direct text‑to‑image pathway and a layout‑aware multi‑agent process. In the layout‑aware mode, it parses the prompt into attribute‑rich object descriptors, ranks them by semantic salience, and groups objects with the same semantic priority level for joint generation. Guided by the Interpreter, the Planner adopts a divide‑and‑conquer strategy, incrementally proposing layouts for objects with the same semantic priority level while grounding decisions in the evolving visual context of the canvas. The Checker introduces an explicit error‑correction mechanism by validating spatial consistency and attribute alignment, and refining layouts before they are rendered. Finally, the Painter synthesizes the image step by step, incorporating newly planned objects into the canvas to provide richer context for subsequent iterations. Together, these agents address three key challenges: reducing layout complexity, grounding planning in visual context, and enabling explicit error correction. Extensive experiments on benchmarks GenEval and DPG‑Bench demonstrate that coDrawAgents substantially improves text‑image alignment, spatial accuracy, and attribute binding compared to existing methods.

PaperID: 4933, https://arxiv.org/pdf/2603.12813.pdf

Authors: Pascal Schäfer, Lukas J. Krinke, Martin Wlotzka, Norbert Asprion

Title: Context is all you need: Towards autonomous model-based process design using agentic AI in flowsheet simulations

Abstract:
Agentic AI systems integrating large language models (LLMs) with reasoning and tooluse capabilities are transforming various domains ‑ in particular, software development. In contrast, their application in chemical process flowsheet modelling remains largely unexplored. In this work, we present an agentic AI framework that delivers assistance in an industrial flowsheet simulation environment. To this end, we show the capabilities of GitHub Copilot (GitHub, Inc., 2026), when using state‑of‑the‑art LLMs, such as Claude Opus 4.6 (Anthropic, PBC, 2026), to generate valid syntax for our in‑house process modelling tool Chemasim using the technical documentation and a few commented examples as context. Based on this, we develop a multi‑agent system that decomposes process development tasks with one agent solving the abstract problem using engineering knowledge and another agent implementing the solution as Chemasim code. We demonstrate the effectiveness of our framework for typical flowsheet modelling examples, including (i) a reaction/separation process, (ii) a pressure‑swing distillation, and (iii) a heteroazeotropic distillation including entrainer selection. Along these lines, we discuss current limitations of the framework and outline future research directions to further enhance its capabilities.

PaperID: 4934, https://arxiv.org/pdf/2603.12783.pdf

Authors: Amine Benamara, Céline Clavel, Brian Ravenet, Nicolas Sabouret, Julien Saunier

Title: Exploring the role of embodiment on intimacy perception in a multiparty collaborative task

Abstract:
During collaborative board games, cohesion represents a key aspect to define a well functionning group. From the success of the task to the developement of interpersonal relationship, this concept covers many aspects of group dynamics. The goal of our work is to investigate the factors that impact cohesion in a group, and specifically the relevant social skills that improve collaboration between multiple entities. In this article, we focus on the role of embodiement on different aspects of an interaction. We propose an experimental protocol, based on a collected corpus of humans playing a collaborative board game, to study how different agents' embodiment affect the perception of these agents and of the group as a whole. We conclude by presenting an outline of the problematics of the conception of the protocol and of multi‑agent system related challenges.

PaperID: 4935, https://arxiv.org/pdf/2603.12740.pdf

Authors: Shuo Yang, Soyeon Caren Han, Yihao Ding, Shuhe Wang, Eduard Hoy

Title: ToolTree: Efficient LLM Agent Tool Planning via Dual-Feedback Monte Carlo Tree Search and Bidirectional Pruning

Abstract:
Large Language Model (LLM) agents are increasingly applied to complex, multi‑step tasks that require interaction with diverse external tools across various domains. However, current LLM agent tool planning methods typically rely on greedy, reactive tool selection strategies that lack foresight and fail to account for inter‑tool dependencies. In this paper, we present ToolTree, a novel Monte Carlo tree search‑inspired planning paradigm for tool planning. ToolTree explores possible tool usage trajectories using a dual‑stage LLM evaluation and bidirectional pruning mechanism that enables the agent to make informed, adaptive decisions over extended tool‑use sequences while pruning less promising branches before and after the tool execution. Empirical evaluations across both open‑set and closed‑set tool planning tasks on 4 benchmarks demonstrate that ToolTree consistently improves performance while keeping the highest efficiency, achieving an average gain of around 10% compared to the state‑of‑the‑art planning paradigm.

PaperID: 4936, https://arxiv.org/pdf/2603.12736.pdf

Authors: Lukas Heuer, Yufei Zhu, Luigi Palmieri, Andrey Rudenko, Anna Mannucci, Sven Koenig, Martin Magnusson

Title: Conflict Mitigation in Shared Environments using Flow-Aware Multi-Agent Path Finding

Abstract:
Deploying multi‑robot systems in environments shared with dynamic and uncontrollable agents presents significant challenges, especially for large robot fleets. In such environments, individual robot operations can be delayed due to unforeseen conflicts with uncontrollable agents. While existing research primarily focuses on preserving the completeness of Multi‑Agent Path Finding (MAPF) solutions considering delays, there is limited emphasis on utilizing additional environmental information to enhance solution quality in the presence of other dynamic agents. To this end, we propose Flow‑Aware Multi‑Agent Path Finding (FA‑MAPF), a novel framework that integrates learned motion patterns of uncontrollable agents into centralized MAPF algorithms. Our evaluation, conducted on a diverse set of benchmark maps with simulated uncontrollable agents and on a real‑world map with recorded human trajectories, demonstrates the effectiveness of FA‑MAPF compared to state‑of‑the‑art baselines. The experimental results show that FA‑MAPF can consistently reduce conflicts with uncontrollable agents, up to 55%, without compromising task efficiency.

PaperID: 4937, https://arxiv.org/pdf/2603.12710.pdf

Authors: Orit Shahnovsky, Rotem Dror

Title: AI Planning Framework for LLM-Based Web Agents

Abstract:
Developing autonomous agents for web‑based tasks is a core challenge in AI. While Large Language Model (LLM) agents can interpret complex user requests, they often operate as black boxes, making it difficult to diagnose why they fail or how they plan. This paper addresses this gap by formally treating web tasks as sequential decision‑making processes. We introduce a taxonomy that maps modern agent architectures to traditional planning paradigms: Step‑by‑Step agents to Breadth‑First Search (BFS), Tree Search agents to Best‑First Tree Search, and Full‑Plan‑in‑Advance agents to Depth‑First Search (DFS). This framework allows for a principled diagnosis of system failures like context drift and incoherent task decomposition. To evaluate these behaviors, we propose five novel evaluation metrics that assess trajectory quality beyond simple success rates. We support this analysis with a new dataset of 794 human‑labeled trajectories from the WebArena benchmark. Finally, we validate our evaluation framework by comparing a baseline Step‑by‑Step agent against a novel Full‑Plan‑in‑Advance implementation. Our results reveal that while the Step‑by‑Step agent aligns more closely with human gold trajectories (38% overall success), the Full‑Plan‑in‑Advance agent excels in technical measures such as element accuracy (89%), demonstrating the necessity of our proposed metrics for selecting appropriate agent architectures based on specific application constraints.

PaperID: 4938, https://arxiv.org/pdf/2603.12634.pdf

Authors: Yushu Li, Wenlong Deng, Jiajin Li, Xiaoxiao Li

Title: Spend Less, Reason Better: Budget-Aware Value Tree Search for LLM Agents

Abstract:
Test‑time scaling has become a dominant paradigm for improving LLM agent reliability, yet current approaches treat compute as an abundant resource, allowing agents to exhaust token and tool budgets on redundant steps or dead‑end trajectories. Existing budget‑aware methods either require expensive fine‑tuning or rely on coarse, trajectory‑level heuristics that cannot intervene mid‑execution. We propose the Budget‑Aware Value Tree (BAVT), a training‑free inference‑time framework that models multi‑hop reasoning as a dynamic search tree guided by step‑level value estimation within a single LLM backbone. Another key innovation is a budget‑conditioned node selection mechanism that uses the remaining resource ratio as a natural scaling exponent over node values, providing a principled, parameter‑free transition from broad exploration to greedy exploitation as the budget depletes. To combat the well‑known overconfidence of LLM self‑evaluation, BAVT employs a residual value predictor that scores relative progress rather than absolute state quality, enabling reliable pruning of uninformative or redundant tool calls. We further provide a theoretical convergence guarantee, proving that BAVT reaches a terminal answer with probability at least 1‑ε under an explicit finite budget bound. Extensive evaluations on four multi‑hop QA benchmarks across two model families demonstrate that BAVT consistently outperforms parallel sampling baselines. Most notably, BAVT under strict low‑budget constraints surpasses baseline performance at 4× the resource allocation, establishing that intelligent budget management fundamentally outperforms brute‑force compute scaling.

PaperID: 4939, https://arxiv.org/pdf/2603.12631.pdf

Authors: Wenyu Mao, Haoyang Liu, Haosong Tan, Yaorui Shi, Jiancan Wu, An Zhang, Xiang Wang

Title: Joint Optimization of Multi-agent Memory System

Abstract:
Memory systems are critical for LLMs, mitigating context window limitations and supporting long‑horizon user‑LLM interactions. Such systems typically comprise multiple agents responsible for memory construction and retrieval. Existing approaches often optimize each agent independently under a shared global objective (e.g., downstream QA accuracy), treating other agents as a static environment. However, this design has two key limitations: (1) independent optimization ignores inter‑agent dependencies and lacks agents' co‑adaptation, and (2) relying solely on sparse global rewards provides limited guidance for optimizing specialized agents and causes ambiguous credit assignment. These may ultimately limit agents' effective collaboration in the memory system. To address these limitations, we propose CoMAM, a joint optimization framework that promotes collaboration among agents via end‑to‑end reinforcement learning and an adaptive credit assignment mechanism. Specifically, we model the multi‑agent pipeline as a Markov decision process (MDP) to expose inter‑agent dependencies during end‑to‑end training. Agents are then jointly optimized using a combination of their local task reward and an adaptively weighted global reward, enabling agents to co‑adapt while receiving targeted feedback for their respective roles. Experiments show that CoMAM consistently outperforms leading memory systems, validating the effectiveness of the joint optimization framework.

PaperID: 4940, https://arxiv.org/pdf/2603.12618.pdf

Authors: Arpan Biswas, Hiroshi Funakubo, Yongtao Liu

Title: Human-AI Collaborative Autonomous Experimentation With Proxy Modeling for Comparative Observation

Abstract:
Optimization for different tasks like material characterization, synthesis, and functional properties for desired applications over multi‑dimensional control parameters need a rapid strategic search through active learning such as Bayesian optimization (BO). However, such high‑dimensional experimental physical descriptors are complex and noisy, from which realization of a low‑dimensional mathematical scalar metrics or objective functions can be erroneous. Moreover, in traditional purely data‑driven autonomous exploration, such objective functions often ignore the subtle variation and key features of the physical descriptors, thereby can fail to discover unknown phenomenon of the material systems. To address this, here we present a proxy‑modelled Bayesian optimization (px‑BO) via on‑the‑fly teaming between human and AI agents. Over the loop of BO, instead of defining a mathematical objective function directly from the experimental data, we introduce a voting system on the fly where the new experimental outcome will be compared with existing experiments, and the human agents will choose the preferred samples. These human‑guided comparisons are then transformed into a proxy‑based objective function via fitting Bradley‑Terry (BT) model. Then, to minimize human interaction, this iteratively trained proxy model also acts as an AI agent for future surrogate human votes. Finally, these surrogate votes are periodically validated by human agents, and the corrections are then learned by the proxy model on‑the‑fly. We demonstrated the performance of the proposed px‑BO framework into simulated and BEPS data generated from PTO sample. We find that our approach provided better control of the domain experts for an improved search over traditional data‑driven exploration, thus, signifies the importance of human‑AI teaming in an accelerated and meaningful material space exploration.

PaperID: 4941, https://arxiv.org/pdf/2603.12614.pdf

Authors: Jiangrong Wu, Zitong Yao, Yuhong Nan, Zibin Zheng

Title: ChainFuzzer: Greybox Fuzzing for Workflow-Level Multi-Tool Vulnerabilities in LLM Agents

Abstract:
Tool‑augmented LLM agents increasingly rely on multi‑step, multi‑tool workflows to complete real tasks. This design expands the attack surface, because data produced by one tool can be persisted and later reused as input to another tool, enabling exploitable source‑to‑sink dataflows that only emerge through tool composition. We study this risk as multi‑tool vulnerabilities in LLM agents, and show that existing discovery efforts focused on single‑tool or single‑hop testing miss these long‑horizon behaviors and provide limited debugging value. We present ChainFuzzer, a greybox framework for discovering and reproducing multi‑tool vulnerabilities with auditable evidence. ChainFuzzer (i) identifies high‑impact operations with strict source‑to‑sink dataflow evidence and extracts plausible upstream candidate tool chains based on cross‑tool dependencies, (ii) uses Trace‑guided Prompt Solving (TPS) to synthesize stable prompts that reliably drive the agent to execute target chains, and (iii) performs guardrail‑aware fuzzing to reproduce vulnerabilities under LLM guardrails via payload mutation and sink‑specific oracles. We evaluate ChainFuzzer on 20 popular open‑source LLM agent apps (998 tools). ChainFuzzer extracts 2,388 candidate tool chains and synthesizes 2,213 stable prompts, confirming 365 unique, reproducible vulnerabilities across 19/20 apps (302 require multi‑tool execution). Component evaluation shows tool‑chain extraction achieves 96.49% edge precision and 91.50% strict chain precision; TPS increases chain reachability from 27.05% to 95.45%; guardrail‑aware fuzzing boosts payload‑level trigger rate from 18.20% to 88.60%. Overall, ChainFuzzer achieves 3.02 vulnerabilities per 1M tokens, providing a practical foundation for testing and hardening real‑world multi‑tool agent systems.

PaperID: 4942, https://arxiv.org/pdf/2603.12564.pdf

Authors: Zekun Wu, Adriano Koshiyama, Sahan Bulathwela, Maria Perez-Ortiz

Title: Sell Me This Stock: Unsafe Recommendation Drift in LLM Agents

Abstract:
People increasingly use LLM agents for multi‑turn financial recommendations, where the agent pulls market data through tools and tracks user preferences across turns. When tool outputs are manipulated, the recommendations stop matching the user's stated risk profile, but because standard metrics like NDCG only score general relevance, risky and safe stocks score alike, so the metric says nothing went wrong. We call this gap evaluation blindness. We replay 23‑turn financial advisory conversations across eight language models, running each dialogue twice with clean and manipulated tool data. Quality scores stay nearly identical to clean sessions while the agents produce risk‑mismatched recommendations in 65‑99% of turns, unanimous across all eight models. The mechanism is visible turn‑by‑turn: 80% of risk‑score citations across 1,840 turns reproduce the manipulated value verbatim, not a single turn pushes back, and safe‑language framing of high‑risk stocks ranges from 14% (Qwen2.5‑7B) to 69% (Claude Sonnet 4.6). The property that makes frontier models good agents, faithfully grounding their reasoning in tool outputs, also makes them follow manipulated ones. The damage is not memory‑driven: contaminating only the current turn still produces 95% of the violations. The model internally distinguishes the manipulation (sparse autoencoder features separate adversarial from random perturbations), but this does not translate into safer output. Activation‑level interventions recover under 6% of the safety gap, prompt‑level self‑verification fails because the self‑check reads the same manipulated data, and a parametric cross‑check that flags contamination at 99‑100% per turn on a frontier model still leaves aggregate suitability unchanged: the agent identifies the tampering and recommends it anyway.

PaperID: 4943, https://arxiv.org/pdf/2603.12559.pdf

Authors: Chengbo Zhang, Zuopeng Xiao

Title: Large Language Models as Delivery Rider: Generating Instant Food Delivery Riders' Routing Decision with LLM Agent Framework

Abstract:
The utilization of Large Language Models (LLMs) to power human‑like agents has shown remarkable potential in simulating individual mobility pattern. However, a significant gap remains in modeling cohorts of agents in dynamic and interactive systems where they must take strategic routing decisions to response mobility‑specific task. To bridge this gap, we introduce LLM‑DR, a novel agent framework designed to simulate the heterogeneous decision‑making of riders in the on‑demand instant delivery task scenario. Our framework is founded on two principles: 1) Empirically‑grounded personas, where we use unsupervised clustering on a large‑scale, real‑world trajectory dataset to identify four distinct rider work strategies; and 2) Reasoning‑based routing process, where each persona is instantiated as an LLM agent that employs a structured Chain‑of‑Thought (CoT) process to make human‑like routing choices. This framework enables the construction of high‑fidelity simulations to investigate how the strategic composition of a rider workforce influences system‑level outcomes regarding their mobility pattern. We validate our framework on an real‑world instant deliver order datasets, demonstrating its capacity to model complex rider behavior in an interactive market scenario. This work provides pioneering findings in agentic mobility system empowered by LLM.

PaperID: 4944, https://arxiv.org/pdf/2603.12453.pdf

Authors: Christos Tzouvaras, Konstantinos Skianis, Athanasios Voulodimos

Title: CSE-UOI at SemEval-2026 Task 6: A Two-Stage Heterogeneous Ensemble with Deliberative Complexity Gating for Political Evasion Detection

Abstract:
This paper describes our system for SemEval‑2026 Task 6, which classifies clarity of responses in political interviews into three categories: Clear Reply, Ambivalent, and Clear Non‑Reply. We propose a heterogeneous dual large language model (LLM) ensemble via self‑consistency (SC) and weighted voting, and a novel post‑hoc correction mechanism, Deliberative Complexity Gating (DCG). This mechanism uses cross‑model behavioral signals and exploits the finding that an LLM response‑length proxy correlates strongly with sample ambiguity. To further examine mechanisms for improving ambiguity detection, we evaluated multi‑agent debate as an alternative strategy for increasing deliberative capacity. Unlike DCG, which adaptively gates reasoning using cross‑model behavioral signals, debate increases agent count without increasing model diversity. Our solution achieved a Macro‑F1 score of 0.85 on the evaluation set, securing 3rd place.

PaperID: 4945, https://arxiv.org/pdf/2603.12310.pdf

Authors: Yiwen Song, Tomas Pfister, Yale Song

Title: VQQA: An Agentic Approach for Video Evaluation and Quality Improvement

Abstract:
Despite rapid advancements in video generation models, aligning their outputs with complex user intent remains challenging. Existing test‑time optimization methods are typically either computationally expensive or require white‑box access to model internals. To address this, we present VQQA (Video Quality Question Answering), a unified, multi‑agent framework generalizable across diverse input modalities and video generation tasks. By dynamically generating visual questions and using the resulting Vision‑Language Model (VLM) critiques as semantic gradients, VQQA replaces traditional, passive evaluation metrics with human‑interpretable, actionable feedback. This enables a highly efficient, closed‑loop prompt optimization process via a black‑box natural language interface. Extensive experiments demonstrate that VQQA effectively isolates and resolves visual artifacts, substantially improving generation quality in just a few refinement steps. Applicable to both text‑to‑video (T2V) and image‑to‑video (I2V) tasks, our method achieves absolute improvements of +11.57% on T2V‑CompBench and +8.43% on VBench2 over vanilla generation, significantly outperforming state‑of‑the‑art stochastic search and prompt optimization techniques.

PaperID: 4946, https://arxiv.org/pdf/2603.12274.pdf

Authors: Jae Yoon Bae, Simon Malberg, Joyce Galang, Andre Retterath, Georg Groh

Title: DIALECTIC: A Multi-Agent System for Startup Evaluation

Abstract:
Venture capital (VC) investors face a large number of investment opportunities but only invest in few of these, with even fewer ending up successful. Early‑stage screening of opportunities is often limited by investor bandwidth, demanding tradeoffs between evaluation diligence and number of opportunities assessed. To ease this tradeoff, we introduce DIALECTIC, an LLM‑based multi‑agent system for startup evaluation. DIALECTIC first gathers factual knowledge about a startup and organizes these facts into a hierarchical question tree. It then synthesizes the facts into natural‑language arguments for and against an investment and iteratively critiques and refines these arguments through a simulated debate, which surfaces only the most convincing arguments. Our system also produces numeric decision scores that allow investors to rank and thus efficiently prioritize opportunities. We evaluate DIALECTIC through backtesting on real investment opportunities aggregated from five VC funds, showing that DIALECTIC matches the precision of human VCs in predicting startup success.

PaperID: 4947, https://arxiv.org/pdf/2603.12230.pdf

Authors: Ninghui Li, Kaiyuan Zhang, Kyle Polley, Jerry Ma

Title: Security Considerations for Artificial Intelligence Agents

Abstract:
This article, a lightly adapted version of Perplexity's response to NIST/CAISI Request for Information 2025‑0035, details our observations and recommendations concerning the security of frontier AI agents. These insights are informed by Perplexity's experience operating general‑purpose agentic systems used by millions of users and thousands of enterprises in both controlled and open‑world environments. Agent architectures change core assumptions around code‑data separation, authority boundaries, and execution predictability, creating new confidentiality, integrity, and availability failure modes. We map principal attack surfaces across tools, connectors, hosting boundaries, and multi‑agent coordination, with particular emphasis on indirect prompt injection, confused‑deputy behavior, and cascading failures in long‑running workflows. We then assess current defenses as a layered stack: input‑level and model‑level mitigations, sandboxed execution, and deterministic policy enforcement for high‑consequence actions. Finally, we identify standards and research gaps, including adaptive security benchmarks, policy models for delegation and privilege control, and guidance for secure multi‑agent system design aligned with NIST risk management principles.

PaperID: 4948, https://arxiv.org/pdf/2603.12129.pdf

Authors: Neil F. Johnson

Title: Increasing intelligence in AI agents can worsen collective outcomes

Abstract:
When resources are scarce, will a population of AI agents coordinate in harmony, or descend into tribal chaos? Diverse decision‑making AI from different developers is entering everyday devices ‑‑ from phones and medical devices to battlefield drones and cars ‑‑ and these AI agents typically compete for finite shared resources such as charging slots, relay bandwidth, and traffic priority. Yet their collective dynamics and hence risks to users and society are poorly understood. Here we study AI‑agent populations as the first system of real agents in which four key variables governing collective behaviour can be independently toggled: nature (innate LLM diversity), nurture (individual reinforcement learning), culture (emergent tribe formation), and resource scarcity. We show empirically and mathematically that when resources are scarce, AI model diversity and reinforcement learning increase dangerous system overload, though tribe formation lessens this risk. Meanwhile, some individuals profit handsomely. When resources are abundant, the same ingredients drive overload to near zero, though tribe formation makes the overload slightly worse. The crossover is arithmetical: it is where opposing tribes that form spontaneously first fit inside the available capacity. More sophisticated AI‑agent populations are not better: whether their sophistication helps or harms depends entirely on a single number ‑‑ the capacity‑to‑population ratio ‑‑ that is knowable before any AI‑agent ships.

PaperID: 4949, https://arxiv.org/pdf/2603.12109.pdf

Authors: Deyu Zou, Yongqiang Chen, Fan Feng, Mufei Li, Pan Li, Yu Gong, James Cheng

Title: On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents

Abstract:
Reinforcement learning (RL) with outcome‑based rewards has achieved significant success in training large language model (LLM) agents for complex reasoning tasks. However, in active reasoning where agents need to strategically ask questions to acquire task‑relevant information, we find that LLM agents trained with RL often suffer from information self‑locking: the agent ceases to ask informative questions and struggles to internalize already‑obtained information. To understand the phenomenon, we decompose active reasoning into two core capabilities: Action Selection (AS), which determines the observation stream through queries, and Belief Tracking (BT), which updates the agent's belief based on collected evidence. We show that deficient AS and BT capabilities will limit the information exploration during RL training. Furthermore, insufficient exploration in turn hinders the improvement of AS and BT, creating a feedback loop that locks the agent in a low‑information regime. To resolve the issue, we propose a simple yet effective approach that reallocates the learning signal by injecting easy‑ to‑obtain directional critiques to help the agent escape self‑locking. Extensive experiments with 7 datasets show that our approach significantly mitigates the information self‑locking, bringing up to 60% improvements.

PaperID: 4950, https://arxiv.org/pdf/2603.12096.pdf

Authors: Sheng-You Huang, Hsiao-Chuan Chang, Yen-Chi Chen, Ting-Han Wei, I-Hau Yeh, Sheng-Yao Kuan, Chien-Yao Wang, Hsuan-Han Lee, I-Chen Wu

Title: A Robust and Efficient Multi-Agent Reinforcement Learning Framework for Traffic Signal Control

Abstract:
Reinforcement Learning (RL) in Traffic Signal Control (TSC) faces significant hurdles in real‑world deployment due to limited generalization to dynamic traffic flow variations. Existing approaches often overfit static patterns and use action spaces incompatible with driver expectations. This paper proposes a robust Multi‑Agent Reinforcement Learning (MARL) framework validated in the Vissim traffic simulator. The framework integrates three mechanisms: (1) Turning Ratio Randomization, a training strategy that exposes agents to dynamic turning probabilities to enhance robustness against unseen scenarios; (2) a stability‑oriented Exponential Phase Duration Adjustment action space, which balances responsiveness and precision through cyclical, exponential phase adjustments; and (3) a Neighbor‑Based Observation scheme utilizing the MAPPO algorithm with Centralized Training with Decentralized Execution (CTDE). By leveraging centralized updates, this approach approximates the efficacy of global observations while maintaining scalable local communication. Experimental results demonstrate that our framework outperforms standard RL baselines, reducing average waiting time by over 10%. The proposed model exhibits superior generalization in unseen traffic scenarios and maintains high control stability, offering a practical solution for adaptive signal control.

PaperID: 4951, https://arxiv.org/pdf/2603.12041.pdf

Authors: Juan Pablo Pinasco, Nicolas Saintier, Horacio Tettamanti, Mattia Zanella

Title: Kinetic SIS opinion-driven models with asymmetric awareness feedback: macroscopic limit and polarization

Abstract:
We study a kinetic multi‑agent framework coupling opinion dynamics with epidemic spreading, where individual social behaviour both affects and is affected by disease transmission. Each agent is characterised by an epidemiological state and a continuous opinion variable measuring compliance with non‑pharmaceutical interventions. The key mechanism of the model is an asymmetric opinion update driven by epidemic encounters: infection events induce more cautious attitudes, while failed transmissions push individuals toward more extreme opinions. We focus on a prototypical SIS setting, for which we derive a macroscopic kinetic description and, in a fast social‑interaction regime, a reduced system of differential equations capturing the feedback between epidemic prevalence and opinion evolution. Convergence of the reduced model is rigorously quantified through a modified Wasserstein distance. Numerical simulations highlight how infection‑induced awareness and non‑infection‑driven extremization jointly shape collective epidemic‑opinion dynamics.

PaperID: 4952, https://arxiv.org/pdf/2603.12031.pdf

Authors: Hamed Hamzeh

Title: AGMARL-DKS: An Adaptive Graph-Enhanced Multi-Agent Reinforcement Learning for Dynamic Kubernetes Scheduling

Abstract:
State‑of‑the‑art cloud‑native applications require intelligent schedulers that can effectively balance system stability, resource utilisation, and associated costs. While Kubernetes provides feasibility‑based placement by default, recent research efforts have explored the use of reinforcement learning (RL) for more intelligent scheduling decisions. However, current RL‑based schedulers have three major limitations. First, most of these schedulers use monolithic centralised agents, which are non‑scalable for large heterogeneous clusters. Second, the ones that use multi‑objective reward functions assume simple, static, linear combinations of the objectives. Third, no previous work has produced a stress‑aware scheduler that can react adaptively to dynamic conditions. To address these gaps in current research, we propose the Adaptive Graph‑enhanced Multi‑Agent Reinforcement Learning Dynamic Kubernetes Scheduler (AGMARL‑DKS). AGMARL‑DKS addresses these gaps by introducing three major innovations. First, we construct a scalable solution by treating the scheduling challenge as a cooperative multi‑agent problem, where every cluster node operates as an agent, employing centralised training methods before decentralised execution. Second, to be context‑aware and yet decentralised, we use a Graph Neural Network (GNN) to build a state representation of the global cluster context at each agent. This represents an improvement over methods that rely solely on local observations. Finally, to make trade‑offs between these objectives, we use a stress‑aware lexicographical ordering policy instead of a simple, static linear weighting of these objectives. The evaluations in Google Kubernetes Engine (GKE) reveal that AGMARL‑DKS significantly outperforms the default scheduler in terms of fault tolerance, utilisation, and cost, especially in scheduling batch and mission‑critical workloads.

PaperID: 4953, https://arxiv.org/pdf/2603.12023.pdf

Authors: Sarbartha Banerjee, Prateek Sahu, Anjo Vahldiek-Oberwagner, Jose Sanchez Vicarte, Mohit Tiwari

Title: Cascade: Composing Software-Hardware Attack Gadgets for Adversarial Threat Amplification in Compound AI Systems

Abstract:
Rapid progress in generative AI has given rise to Compound AI systems ‑ pipelines comprised of multiple large language models (LLM), software tools and database systems. Compound AI systems are constructed on a layered traditional software stack running on a distributed hardware infrastructure. Many of the diverse software components are vulnerable to traditional security flaws documented in the Common Vulnerabilities and Exposures (CVE) database, while the underlying distributed hardware infrastructure remains exposed to timing attacks, bit‑flip faults, and power‑based side channels. Today, research targets LLM‑specific risks like model extraction, training data leakage, and unsafe generation ‑‑ overlooking the impact of traditional system vulnerabilities. This work investigates how traditional software and hardware vulnerabilities can complement LLM‑specific algorithmic attacks to compromise the integrity of a compound AI pipeline. We demonstrate two novel attacks that combine system‑level vulnerabilities with algorithmic weaknesses: (1) Exploiting a software code injection flaw along with a guardrail Rowhammer attack to inject an unaltered jailbreak prompt into an LLM, resulting in an AI safety violation, and (2) Manipulating a knowledge database to redirect an LLM agent to transmit sensitive user data to a malicious application, thus breaching confidentiality. These attacks highlight the need to address traditional vulnerabilities; we systematize the attack primitives and analyze their composition by grouping vulnerabilities by their objective and mapping them to distinct stages of an attack lifecycle. This approach enables a rigorous red‑teaming exercise and lays the groundwork for future defense strategies.

PaperID: 4954, https://arxiv.org/pdf/2603.12011.pdf

Authors: Zhiheng Xi, Xin Guo, Jiaqi Liu, Jiazheng Zhang, Yutao Fan, Zhihao Zhang, Shichun Liu, Mingxu Chai, Xiaowei Shi, Yitao Zhai, Xunliang Cai, Tao Gui, Qi Zhang, Xuanjing Huang

Title: Can RL Improve Generalization of LLM Agents? An Empirical Study

Abstract:
Reinforcement fine‑tuning (RFT) has shown promise for training LLM agents to perform multi‑turn decision‑making based on environment feedback. However, most existing evaluations remain largely in‑domain: training and testing are conducted in the same environment or even on the same tasks. In real‑world deployment, agents may operate in unseen environments with different background knowledge, observation spaces, and action interfaces. To characterize the generalization profile of RFT under such shifts, we conduct a systematic study along three axes: (1) within‑environment generalization across task difficulty, (2) cross‑environment transfer to unseen environments, and (3) sequential multi‑environment training to quantify transfer and forgetting. Our results show that RFT generalizes well across task difficulty within an environment, but exhibits weaker transfer to unseen environments, which correlates with shifts in both semantic priors and observation/action interfaces. In contrast, sequential training yields promising downstream gains with minimal upstream forgetting, and mixture training across environments improves the overall balance. We further provide detailed analyses and deeper insights, and hope our work helps the community develop and deploy generalizable LLM agents.

PaperID: 4955, https://arxiv.org/pdf/2603.11980.pdf

Authors: Kai Li, Shiyu Zhao

Title: Learning Visuomotor Policy for Multi-Robot Laser Tag Game

Abstract:
In this paper, we study multi robot laser tag, a simplified yet practical shooting‑game‑style task. Classic modular approaches on these tasks face challenges such as limited observability and reliance on depth mapping and inter robot communication. To overcome these issues, we present an end‑to‑end visuomotor policy that maps images directly to robot actions. We train a high performing teacher policy with multi agent reinforcement learning and distill its knowledge into a vision‑based student policy. Technical designs, including a permutation‑invariant feature extractor and depth heatmap input, improve performance over standard architectures. Our policy outperforms classic methods by 16.7% in hitting accuracy and 6% in collision avoidance, and is successfully deployed on real robots. Code will be released publicly.

PaperID: 4956, https://arxiv.org/pdf/2603.11974.pdf

Authors: Luca Deck, Simeon Allmendinger, Lucas Müller, Niklas Kühl

Title: Normative Common Ground Replication (NormCoRe): Replication-by-Translation for Studying Norms in Multi-Agent AI

Abstract:
In the late 2010s, the fashion trend NormCore framed sameness as a signal of belonging, illustrating how norms emerge through collective coordination. Today, similar forms of normative coordination can be observed in systems based on Multi‑agent Artificial Intelligence (MAAI), as AI‑based agents deliberate, negotiate, and converge on shared decisions in fairness‑sensitive domains. Yet, existing empirical approaches often treat norms as targets for alignment or replication, implicitly assuming equivalence between human subjects and AI agents and leaving collective normative dynamics insufficiently examined. To address this gap, we propose Normative Common Ground Replication (NormCoRe), a novel methodological framework to systematically translate the design of human subject experiments into MAAI environments. Building on behavioral science, replication research, and state‑of‑the‑art MAAI architectures, NormCoRe maps the structural layers of human subject studies onto the design of AI agent studies, enabling systematic documentation of study design and analysis of norms in MAAI. We demonstrate the utility of NormCoRe by replicating a seminal experimental study on distributive justice, in which participants negotiate fairness principles under a "veil of ignorance". We show that normative judgments in AI agent studies can differ from human baselines and are sensitive to the choice of the foundation model and the language used to instantiate agent personas. Our work provides a principled pathway for analyzing norms in MAAI and helps to guide, reflect, and document design choices whenever AI agents are used to automate or support tasks formerly carried out by humans.

PaperID: 4957, https://arxiv.org/pdf/2603.11955.pdf

Authors: Minjia Wang, Yunfeng Wang, Xiao Ma, Dexin Lv, Qifan Guo, Lynn Zheng, Benliang Wang, Lei Wang, Jiannan Li, Yongwei Xing, David Xu, Zheng Sun

Title: PersonaTrace: Synthesizing Realistic Digital Footprints with LLM Agents

Abstract:
Digital footprints (records of individuals' interactions with digital systems) are essential for studying behavior, developing personalized applications, and training machine learning models. However, research in this area is often hindered by the scarcity of diverse and accessible data. To address this limitation, we propose a novel method for synthesizing realistic digital footprints using large language model (LLM) agents. Starting from a structured user profile, our approach generates diverse and plausible sequences of user events, ultimately producing corresponding digital artifacts such as emails, messages, calendar entries, reminders, etc. Intrinsic evaluation results demonstrate that the generated dataset is more diverse and realistic than existing baselines. Moreover, models fine‑tuned on our synthetic data outperform those trained on other synthetic datasets when evaluated on real‑world out‑of‑distribution tasks.

PaperID: 4958, https://arxiv.org/pdf/2603.11948.pdf

Authors: Ian F. Akyildiz, Tuğçe Bilen

Title: Kraken*: Architecting Generative, Semantic, and Goal-Oriented Network Management for 6G Wireless Systems

Abstract:
Sixth‑generation (6G) wireless networks are expected to support autonomous, immersive, and mission‑critical services that require not only extreme data rates and ultra‑low latency but also adaptive reasoning, cross‑domain coordination, and objective‑driven control across distributed edge‑cloud infrastructures. Current AI‑enabled network management remains largely data‑centric, relying on discriminative models that optimize intermediate quality‑of‑service metrics without explicitly reasoning about long‑term service objectives. This article advocates a transition from bit‑centric communication toward knowledge‑centric coordination in 6G systems. Semantic communication prioritizes task‑relevant information and contextual meaning over raw data delivery, while generative artificial intelligence enables predictive reasoning and adaptive policy synthesis aligned with dynamic service intents. Network optimization is therefore reframed around goal‑oriented performance metrics capturing application‑level outcomes rather than solely protocol‑level indicators. To operationalize this vision, we introduce Kraken, a multi‑agent architecture composed of a Knowledge Plane, a distributed Agent Plane, and a semantic‑aware Infrastructure Plane. By integrating semantic communication, generative reasoning, and goal‑oriented optimization over a shared knowledge substrate, Kraken enables scalable collective intelligence and outlines an evolutionary path from current 5G infrastructures toward knowledge‑native 6G systems.

PaperID: 4959, https://arxiv.org/pdf/2603.11927.pdf

Authors: Zhouwei Zhai, Mengxiang Chen, Haoyun Xia, Jin Li, Renquan Zhou, Min Yang

Title: CogSearch: A Cognitive-Aligned Multi-Agent Framework for Proactive Decision Support in E-Commerce Search

Abstract:
Modern e‑commerce search engines, largely rooted in passive retrieval‑and‑ranking models, frequently fail to support complex decision‑making, leaving users overwhelmed by cognitive friction. In this paper, we introduce CogSearch, a novel cognitive‑oriented multi‑agent framework that reimagines e‑commerce search as a proactive decision support system. By synergizing four specialized agents, CogSearch mimics human cognitive workflows: it decomposes intricate user intents, fuses heterogeneous knowledge across internal and external sources, and delivers highly actionable insights. Our offline benchmarks validate CogSearch's excellence in consultative and complex search scenarios. Extensive online A/B testing on JD.com demonstrates the system's transformative impact: it reduced decision costs by 5% and achieved a 0.41% increase in overall UCVR, with a remarkable 30% surge in conversion for decision‑heavy queries. CogSearch represents a fundamental shift in information retrieval, moving beyond traditional relevance‑centric paradigms toward a future of holistic, collaborative decision intelligence.

PaperID: 4960, https://arxiv.org/pdf/2603.11890.pdf

Authors: Haowei Cheng, Milhan Kim, Foutse Khomh, Teeradaj Racharak, Nobukazu Yoshioka, Naoyasu Ubayashi, Hironori Washizaki

Title: QUARE: Multi-Agent Negotiation for Balancing Quality Attributes in Requirements Engineering

Abstract:
Requirements engineering (RE) is critical to software success, yet automating it remains challenging because multiple, often conflicting quality attributes must be balanced while preserving stakeholder intent. Existing Large‑Language‑Model (LLM) approaches predominantly rely on monolithic reasoning or implicit aggregation, limiting their ability to systematically surface and resolve cross‑quality conflicts. We present QUARE (Quality‑Aware Requirements Engineering), a multi‑agent framework that formulates requirements analysis as structured negotiation among five quality‑specialized agents (Safety, Efficiency, Green, Trustworthiness, and Responsibility), coordinated by a dedicated orchestrator. QUARE introduces a dialectical negotiation protocol that explicitly exposes inter‑quality conflicts and resolves them through iterative proposal, critique, and synthesis. Negotiated outcomes are transformed into structurally sound KAOS goal models via topology validation and verified against industry standards through retrieval‑augmented generation (RAG). We evaluate QUARE on five case studies drawn from established RE benchmarks (MARE, iReDev) and an industrial autonomous‑driving specification, spanning safety‑critical, financial, and information‑system domains. Results show that QUARE achieves 98.2% compliance coverage (+105% over both baselines), 94.9% semantic preservation (+2.3 percentage points over the best baseline), and high verifiability (4.96/5.0), while generating 25‑43% more requirements than existing multi‑agent RE frameworks. These findings suggest that effective RE automation depends less on model scale than on principled architectural decomposition, explicit interaction protocols, and automated verification.

PaperID: 4961, https://arxiv.org/pdf/2603.11884.pdf

Authors: Prateek Bhustali, Pablo G. Morato, Konstantinos G. Papakonstantinou, Charalampos P. Andriotis

Title: The price of decentralization in managing engineering systems through multi-agent reinforcement learning

Abstract:
Inspection and maintenance (I&M) planning involves sequential decision making under uncertainties and incomplete information, and can be modeled as a partially observable Markov decision process (POMDP). While single‑agent deep reinforcement learning provides approximate solutions to POMDPs, it does not scale well in multi‑component systems. Scalability can be achieved through multi‑agent deep reinforcement learning (MADRL), which decentralizes decision‑making across multiple agents, locally controlling individual components. However, this decentralization can induce cooperation pathologies that degrade the optimality of the learned policies. To examine these effects in I&M planning, we introduce a set of deteriorating systems in which redundancy is varied systematically. These benchmark environments are designed such that computation of centralized (near‑)optimal policies remains tractable, enabling direct comparison of solution methods. We implement and benchmark a broad set of MADRL algorithms spanning fully centralized and decentralized training paradigms, from value‑factorization to actor‑critic methods. Our results show a clear effect of redundancy on coordination: MADRL algorithms achieve near‑optimal performance in series‑like settings, whereas increasing redundancy amplifies coordination challenges and can lead to optimality losses. Nonetheless, decentralized agents learn structured policies that consistently outperform optimized heuristic baselines, highlighting both the promise and current limitations of decentralized learning for scalable maintenance planning.

PaperID: 4962, https://arxiv.org/pdf/2603.11862.pdf

Authors: Ching-Yu Kao, Xinfeng Li, Shenyu Dai, Tianze Qiu, Pengcheng Zhou, Eric Hanchen Jiang, Philip Sperl

Title: You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents

Abstract:
High‑privilege LLM agents that autonomously process external documentation are increasingly trusted to automate tasks by reading and executing project instructions, yet they are granted terminal access, filesystem control, and outbound network connectivity with minimal security oversight. We identify and systematically measure a fundamental vulnerability in this trust model, which we term the \emphTrusted Executor Dilemma: agents execute documentation‑embedded instructions, including adversarial ones, at high rates because they cannot distinguish malicious directives from legitimate setup guidance. This vulnerability is a structural consequence of the instruction‑following design paradigm, not an implementation bug. To structure our measurement, we formalize a three‑dimensional taxonomy covering linguistic disguise, structural obfuscation, and semantic abstraction, and construct ReadSecBench, a benchmark of 500 real‑world README files enabling reproducible evaluation. Experiments on the commercially deployed computer‑use agent show end‑to‑end exfiltration success rates up to 85%, consistent across five programming languages and three injection positions. Cross‑model evaluation on four LLM families in a simulation environment confirms that semantic compliance with injected instructions is consistent across model families. A 15‑participant user study yields a 0% detection rate across all participants, and evaluation of 12 rule‑based and 6 LLM‑based defenses shows neither category achieves reliable detection without unacceptable false‑positive rates. Together, these results quantify a persistent \emphSemantic‑Safety Gap between agents' functional compliance and their security awareness, establishing that documentation‑embedded instruction injection is a persistent and currently unmitigated threat to high‑privilege LLM agent deployments.

PaperID: 4963, https://arxiv.org/pdf/2603.11853.pdf

Authors: Frank Li

Title: OpenClaw PRISM: A Zero-Fork, Defense-in-Depth Runtime Security Layer for Tool-Augmented LLM Agents

Abstract:
Tool‑augmented LLM agents introduce security risks that extend beyond user‑input filtering, including indirect prompt injection through fetched content, unsafe tool execution, credential leakage, and tampering with local control files. We present OpenClaw PRISM, a zero‑fork runtime security layer for OpenClaw‑based agent gateways. PRISM combines an in‑process plugin with optional sidecar services and distributes enforcement across ten lifecycle hooks spanning message ingress, prompt construction, tool execution, tool‑result persistence, outbound messaging, sub‑agent spawning, and gateway startup. Rather than introducing a novel detection model, PRISM integrates a hybrid heuristic‑plus‑LLM scanning pipeline, conversation‑ and session‑scoped risk accumulation with TTL‑based decay, policy‑enforced controls over tools, paths, private networks, domain tiers, and outbound secret patterns, and a tamper‑evident audit and operations plane with integrity verification and hot‑reloadable policy management. We outline an evaluation methodology and benchmark pipeline for measuring security effectiveness, false positives, layer contribution, runtime overhead, and operational recoverability in an agent‑runtime setting, and we report current preliminary benchmark results on curated same‑slice experiments and operational microbenchmarks. The system targets deployable runtime defense for real agent gateways rather than benchmark‑only detection.

PaperID: 4964, https://arxiv.org/pdf/2603.11828.pdf

Authors: Shengnan Li, Yidi Wang, Fubin Wang, Yujia Yang, Yao Zhang, Yuchen Song, Xiaotian Jiang, Yue Pang, Min Zhang, Danshi Wang

Title: Large language models for optical network O&M: Agent-embedded workflow for automation

Abstract:
With the continuous expansion of optical networks and the increasing diversity of services, existing operation and maintenance (O&M) approaches are increasingly challenged to meet the rising demands for intelligence and efficiency. Large language models (LLMs), endowed with advanced semantic understanding and contextual analysis capabilities, are emerging as a promising enabler for intelligent optical network O&M. Recent studies have demonstrated the feasibility of applying LLMs to optical network management, marking an important step toward intelligent automation. However, systematic investigations into how LLMs can be effectively integrated into existing O&M workflows remain limited. This paper addresses this gap by drawing inspiration from best practices in real‑world O&M workflows and systematically identifying scenarios that are well suited for LLM integration. We highlight that agent‑based design is key to improving the executability of tasks, and we propose a multi‑Agent collaborative O&M architecture that integrates LLM capabilities with existing O&M tools. The proposed architecture leverages core LLM‑related technologies including prompt engineering and tool invocation, to build Agent solutions targeting key tasks such as optical channel management, performance optimization, and fault management. This work presents a conceptual framework for embedding LLM‑based Agents into optical network O&M workflows, forming agentized processes that demonstrate the feasibility of LLM‑assisted task execution and lay the groundwork for future autonomous O&M systems featuring closed‑loop perception, decision‑making, and action.

PaperID: 4965, https://arxiv.org/pdf/2603.11808.pdf

Authors: Shuzhen Bi, Mengsong Wu, Hao Hao, Keqian Li, Wentao Liu, Siyu Song, Hongbo Zhao, Aimin Zhou

Title: Automating Skill Acquisition through Large-Scale Mining of Open-Source Agentic Repositories: A Framework for Multi-Agent Procedural Knowledge Extraction

Abstract:
The transition from monolithic large language models (LLMs) to modular, skill‑equipped agents represents a fundamental architectural shift in artificial intelligence deployment. While general‑purpose models demonstrate remarkable breadth in declarative knowledge, their utility in autonomous workflows is frequently constrained by insufficient specialized procedural expertise. This report investigates a systematic framework for automated acquisition of high‑quality agent skills through mining of open‑source repositories on platforms such as GitHub. We focus on the extraction of visualization and educational capabilities from state‑of‑the‑art systems including TheoremExplainAgent and Code2Video, both utilizing the Manim mathematical animation engine. The framework encompasses repository structural analysis, semantic skill identification through dense retrieval, and translation to the standardized SKILL.md format. We demonstrate that systematic extraction from agentic repositories, combined with rigorous security governance and multi‑dimensional evaluation metrics, enables scalable acquisition of procedural knowledge that augments LLM capabilities without requiring model retraining. Our analysis reveals that agent‑generated educational content can achieve 40% gains in knowledge transfer efficiency while maintaining pedagogical quality comparable to human‑crafted tutorials.

PaperID: 4966, https://arxiv.org/pdf/2603.11790.pdf

Authors: Barthélémy Dang-Nhu, Louis Annabi, Sylvain Argentieri

Title: Disentangled Representation Learning through Unsupervised Symmetry Group Discovery

Abstract:
Symmetry‑based disentangled representation learning leverages the group structure of environment transformations to uncover the latent factors of variation. Prior approaches to symmetry‑based disentanglement have required strong prior knowledge of the symmetry group's structure, or restrictive assumptions about the subgroup properties. In this work, we remove these constraints by proposing a method whereby an embodied agent autonomously discovers the group structure of its action space through unsupervised interaction with the environment. We prove the identifiability of the true symmetry group decomposition under minimal assumptions, and derive two algorithms: one for discovering the group decomposition from interaction data, and another for learning Linear Symmetry‑Based Disentangled (LSBD) representations without assuming specific subgroup properties. Our method is validated on three environments exhibiting different group decompositions, where it outperforms existing LSBD approaches.

PaperID: 4967, https://arxiv.org/pdf/2603.11781.pdf

Authors: Sunil Prakash

Title: From Debate to Deliberation: Structured Collective Reasoning with Typed Epistemic Acts

Abstract:
Multi‑agent LLM systems increasingly tackle complex reasoning, yet their interaction patterns remain limited to voting, unstructured debate, or pipeline orchestration. None model deliberation: a phased process where differentiated participants exchange typed reasoning moves, preserve disagreements, and converge on accountable outcomes. We introduce Deliberative Collective Intelligence (DCI), specifying four reasoning archetypes, 14 typed epistemic acts, a shared workspace, and DCI‑CF, a convergent flow algorithm that guarantees termination with a structured decision packet containing the selected option, residual objections, minority report, and reopen conditions. We evaluate on 45 tasks across seven domains using Gemini 2.5 Flash. On non‑routine tasks (n=40), DCI significantly improves over unstructured debate (+0.95, 95% CI [+0.41, +1.54]). DCI excels on hidden‑profile tasks requiring perspective integration (9.56, highest of any system on any domain) while failing on routine decisions (5.39), confirming task‑dependence. DCI produces 100% structured decision packets and 98% minority reports, artifacts absent from all baselines. However, DCI consumes ~62x single‑agent tokens, and single‑agent generation outperforms DCI on overall quality. DCI's contribution is not that more agents are better, but that consequential decisions benefit from deliberative structure when process accountability justifies the cost.

PaperID: 4968, https://arxiv.org/pdf/2603.11768.pdf

Authors: Chingkwun Lam, Jiaxin Li, Lingfei Zhang, Kuo Zhao

Title: Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory (SSGM) Framework

Abstract:
Long‑term memory has emerged as a foundational component of autonomous Large Language Model (LLM) agents, enabling continuous adaptation, lifelong multimodal learning, and sophisticated reasoning. However, as memory systems transition from static retrieval databases to dynamic, agentic mechanisms, critical concerns regarding memory governance, semantic drift, and privacy vulnerabilities have surfaced. While recent surveys have focused extensively on memory retrieval efficiency, they largely overlook the emergent risks of memory corruption in highly dynamic environments. To address these emerging challenges, we propose the Stability and Safety‑Governed Memory (SSGM) framework, a conceptual governance architecture. SSGM decouples memory evolution from execution by enforcing consistency verification, temporal decay modeling, and dynamic access control prior to any memory consolidation. Through formal analysis and architectural decomposition, we show how SSGM can mitigate topology‑induced knowledge leakage where sensitive contexts are solidified into long‑term storage, and help prevent semantic drift where knowledge degrades through iterative summarization. Ultimately, this work provides a comprehensive taxonomy of memory corruption risks and establishes a robust governance paradigm for deploying safe, persistent, and reliable agentic memory systems.

PaperID: 4969, https://arxiv.org/pdf/2603.11721.pdf

Authors: Wenxian Yang, Hanzheng Qiu, Bangqun Zhang, Chengquan Li, Zhiyong Huang, Xiaobin Feng, Rongshan Yu, Jiahong Dong

Title: When OpenClaw Meets Hospital: Toward an Agentic Operating System for Dynamic Clinical Workflows

Abstract:
Large language model (LLM) agents extend generative models with reasoning, tool use, and persistent memory, thereby enabling the automation of complex tasks. In healthcare, such systems could support documentation, care coordination, and clinical decision making. Their reliable deployment in hospitals, however, remains constrained by safety risks, limited transparency, and inadequate mechanisms for handling longitudinal clinical context. Here we propose an architecture that adapts LLM agents to hospital environments. The design comprises four components: a restricted execution environment inspired by multi‑user operating systems, a document‑centric interaction model linking patient and clinician agents, a page‑indexed memory architecture for longitudinal context management, and a curated library of composable medical skills. Implemented on top of OpenClaw, an open‑source agent orchestration framework, this design provides the basis for an Agentic Operating System for Hospitals: a computing layer for coordinating clinical workflows while preserving safety, transparency, and auditability. To evaluate the memory component, we introduce manifest‑guided retrieval for hierarchical navigation of longitudinal patient records. In a benchmark derived from the MIMIC‑IV dataset (v2.2) comprising 100 de‑identified patient records and 300 clinical queries stratified across three difficulty tiers (100 per tier), manifest‑guided retrieval matched a metadata‑filtered RAG baseline on overall recall (0.877 versus 0.876) while achieving 2.2x higher precision (0.779 versus 0.352) and retrieving fewer documents; on tier‑3 longitudinal queries, manifest recall was 21% higher (0.846 versus 0.701), confirming that LLM‑guided hierarchical navigation is most valuable when queries span multiple care episodes. These results outline a practical path toward hospital‑scale agentic infrastructure.

PaperID: 4970, https://arxiv.org/pdf/2603.11709.pdf

Authors: Mengsong Wu, Hao Hao, Shuzhen Bi, Keqian Li, Wentao Liu, Siyu Song, Hongbo Zhao, Aimin Zhou

Title: Scaling Laws for Educational AI Agents

Abstract:
While scaling laws for Large Language Models (LLMs) have been extensively studied along dimensions of model parameters, training data, and compute, the scaling behavior of LLM‑based educational agents remains unexplored. We propose that educational agent capability scales not merely with the underlying model size, but through structured dimensions that we collectively term the Agent Scaling Law: role definition clarity, skill depth, tool completeness, runtime capability, and educator expertise injection. Central to this framework is AgentProfile, a structured JSON‑based specification that serves as the mechanism enabling systematic capability growth of educational agents. We present EduClaw, a profile‑driven multi‑agent platform that operationalizes this scaling law, demonstrating its effectiveness through the construction and deployment of 330+ educational agent profiles encompassing 1,100+ skill modules across K‑12 subjects. Our empirical observations suggest that educational agent performance scales predictably with profile structural richness. We identify two complementary scaling axes ‑‑ Tool Scaling and Skill Scaling ‑‑ as future directions, arguing that the path to more capable educational AI lies not solely in larger models, but in stronger structured capability systems.

PaperID: 4971, https://arxiv.org/pdf/2603.11691.pdf

Authors: Jiwon Jeon, Myungsik Cho, Youngchul Sung

Title: STAIRS-Former: Spatio-Temporal Attention with Interleaved Recursive Structure Transformer for Offline Multi-task Multi-agent Reinforcement Learning

Abstract:
Offline multi‑agent reinforcement learning (MARL) with multi‑task datasets is challenging due to varying numbers of agents across tasks and the need to generalize to unseen scenarios. Prior works employ transformers with observation tokenization and hierarchical skill learning to address these issues. However, they underutilize the transformer attention mechanism for inter‑agent coordination and rely on a single history token, which limits their ability to capture long‑horizon temporal dependencies in partially observable MARL settings. In this paper, we propose STAIRS‑Former, a transformer architecture augmented with spatial and temporal hierarchies that enables effective attention over critical tokens while capturing long interaction histories. We further introduce token dropout to enhance robustness and generalization across varying agent populations. Extensive experiments on diverse multi‑agent benchmarks, including SMAC, SMAC‑v2, MPE, and MaMuJoCo, with multi‑task datasets demonstrate that STAIRS‑Former consistently outperforms prior methods and achieves new state‑of‑the‑art performance.

PaperID: 4972, https://arxiv.org/pdf/2603.11650.pdf

Authors: Jihao Zhao, Daixuan Li, Pengfei Li, Shuaishuai Zu, Biao Qin, Hongyan Liu

Title: QChunker: Learning Question-Aware Text Chunking for Domain RAG via Multi-Agent Debate

Abstract:
The effectiveness upper bound of retrieval‑augmented generation (RAG) is fundamentally constrained by the semantic integrity and information granularity of text chunks in its knowledge base. To address these challenges, this paper proposes QChunker, which restructures the RAG paradigm from retrieval‑augmentation to understanding‑retrieval‑augmentation. Firstly, QChunker models the text chunking as a composite task of text segmentation and knowledge completion to ensure the logical coherence and integrity of text chunks. Drawing inspiration from Hal Gregersen's "Questions Are the Answer" theory, we design a multi‑agent debate framework comprising four specialized components: a question outline generator, text segmenter, integrity reviewer, and knowledge completer. This framework operates on the principle that questions serve as catalysts for profound insights. Through this pipeline, we successfully construct a high‑quality dataset of 45K entries and transfer this capability to small language models. Additionally, to handle long evaluation chains and low efficiency in existing chunking evaluation methods, which overly rely on downstream QA tasks, we introduce a novel direct evaluation metric, ChunkScore. Both theoretical and experimental validations demonstrate that ChunkScore can directly and efficiently discriminate the quality of text chunks. Furthermore, during the text segmentation phase, we utilize document outlines for multi‑path sampling to generate multiple candidate chunks and select the optimal solution employing ChunkScore. Extensive experimental results across four heterogeneous domains exhibit that QChunker effectively resolves aforementioned issues by providing RAG with more logically coherent and information‑rich text chunks.

PaperID: 4973, https://arxiv.org/pdf/2603.11619.pdf

Authors: Xinhao Deng, Yixiang Zhang, Jiaqing Wu, Jiaqi Bai, Sibo Yi, Zhuoheng Zou, Yue Xiao, Rennai Qiu, Jianan Ma, Jialuo Chen, Xiaohu Du, Xiaofang Yang, Shiwen Cui, Changhua Meng, Weiqiang Wang, Jiaxing Song, Ke Xu, Qi Li

Title: Taming OpenClaw: Security Analysis and Mitigation of Autonomous LLM Agent Threats

Abstract:
Autonomous Large Language Model (LLM) agents, exemplified by OpenClaw, demonstrate remarkable capabilities in executing complex, long‑horizon tasks. However, their tightly coupled instant‑messaging interaction paradigm and high‑privilege execution capabilities substantially expand the system attack surface. In this paper, we present a comprehensive security threat analysis of OpenClaw. To structure our analysis, we introduce a five‑layer lifecycle‑oriented security framework that captures key stages of agent operation, i.e., initialization, input, inference, decision, and execution, and systematically examine compound threats across the agent's operational lifecycle, including indirect prompt injection, skill supply chain contamination, memory poisoning, and intent drift. Through detailed case studies on OpenClaw, we demonstrate the prevalence and severity of these threats and analyze the limitations of existing defenses. Our findings reveal critical weaknesses in current point‑based defense mechanisms when addressing cross‑temporal and multi‑stage systemic risks, highlighting the need for holistic security architectures for autonomous LLM agents. Within this framework, we further examine representative defense strategies at each lifecycle stage, including plugin vetting frameworks, context‑aware instruction filtering, memory integrity validation protocols, intent verification mechanisms, and capability enforcement architectures.

PaperID: 4974, https://arxiv.org/pdf/2603.11582.pdf

Authors: Zhirun Li, Derek Hollenbeck, Ruikun Wu, Michelle Sherman, Sihua Shao, Xiang Sun, Mostafa Hassanalian

Title: Multi-Agent Reinforcement Learning for UAV-Based Chemical Plume Source Localization

Abstract:
Undocumented orphaned wells pose significant health and environmental risks to nearby communities by releasing toxic gases and contaminating water sources, with methane emissions being a primary concern. Traditional survey methods such as magnetometry often fail to detect older wells effectively. In contrast, aerial in‑situ sensing using unmanned aerial vehicles (UAVs) offers a promising alternative for methane emission detection and source localization. This study presents a robust and efficient framework based on a multi‑agent deep reinforcement learning (MARL) algorithm for the chemical plume source localization (CPSL) problem. The proposed approach leverages virtual anchor nodes to coordinate UAV navigation, enabling collaborative sensing of gas concentrations and wind velocities through onboard and shared measurements. Source identification is achieved by analyzing the historical trajectory of anchor node placements within the plume. Comparative evaluations against the fluxotaxis method demonstrate that the MARL framework achieves superior performance in both localization accuracy and operational efficiency.

PaperID: 4975, https://arxiv.org/pdf/2603.11560.pdf

Authors: Stefano Grassi

Title: Feedback-Coupled Memory Systems: A Dynamical Model for Adaptive Coordination

Abstract:
This paper develops a dynamical framework for adaptive coordination in systems of interacting agents referred to here as Feedback‑Coupled Memory Systems (FCMS). Instead of framing coordination as equilibrium optimization or agent‑centric learning, the model describes a closed‑loop interaction between agents, incentives, and a persistent environment. The environment stores accumulated coordination signals, a distributed incentive field transmits them locally, and agents update in response, generating a feedback‑driven dynamical system. Three main results are established. First, under dissipativity, the closed‑loop system admits a bounded forward‑invariant region, ensuring dynamical viability independently of global optimality. Second, when incentives depend on persistent environmental memory, coordination cannot be reduced to a static optimization problem. Third, within the FCMS class, coordination requires a bidirectional coupling in which memory‑dependent incentives influence agent updates, while agent behavior reshapes the environmental state. Numerical analysis of a minimal specification identifies a Neimark‑Sacker bifurcation at a critical coupling threshold (β_c), providing a stability boundary for the system. Near the bifurcation threshold, recovery time diverges and variance increases, yielding a computable early warning signature of coordination breakdown in observable time series. Additional simulations confirm robustness under nonlinear saturation and scalability to populations of up to N = 10^6 agents making it more relevant for real‑world applications. The proposed framework offers a dynamical perspective on coordination in complex systems, with potential extensions to multi‑agent systems, networked interactions, and macro‑level collective dynamics.

PaperID: 4976, https://arxiv.org/pdf/2603.11515.pdf

Authors: Harshitha Menon, Charles F. Jekel, Kevin Korner, Brian Gunnarson, Nathan K. Brown, Michael Stees, M. Giselle Fernandez-Godino, Walter Nissen, Meir H. Shachar, Dane M. Sterbentz, William J. Schill, Yue Hao, Robert Rieben, William Quadros, Steve Owen, Scott Mitchell, Ismael D. Boureima, Jonathan L. Belof

Title: Multi-Agent Collaboration for Automated Design Exploration on High Performance Computing Systems

Abstract:
Today's scientific challenges, from climate modeling to Inertial Confinement Fusion design to novel material design, require exploring huge design spaces. In order to enable high‑impact scientific discovery, we need to scale up our ability to test hypotheses, generate results, and learn from them rapidly. We present MADA (Multi‑Agent Design Assistant), a Large Language Model (LLM) powered multi‑agent framework that coordinates specialized agents for complex design workflows. A Job Management Agent (JMA) launches and manages ensemble simulations on HPC systems, a Geometry Agent (GA) generates meshes, and an Inverse Design Agent (IDA) proposes new designs informed by simulation outcomes. While general purpose, we focus development and validation on Richtmyer‑‑Meshkov Instability (RMI) suppression, a critical challenge in Inertial Confinement Fusion. We evaluate on two complementary settings: running a hydrodynamics simulations on HPC systems, and using a pre‑trained machine learning surrogate for rapid design exploration. Our results demonstrate that the MADA system successfully executes iterative design refinement, automatically improving designs toward optimal RMI suppression with minimal manual intervention. Our framework reduces cumbersome manual workflow setup, and enables automated design exploration at scale. More broadly, it demonstrates a reusable pattern for coupling reasoning, simulation, specialized tools, and coordinated workflows to accelerate scientific discovery.

PaperID: 4977, https://arxiv.org/pdf/2603.11445.pdf

Authors: Xing Zhang, Yanwei Cui, Guanghui Wang, Wei Qiu, Ziyuan Li, Fangwei Han, Yajing Huang, Hengzhi Qiu, Bing Zhu, Peiyang He

Title: Verified Multi-Agent Orchestration: A Plan-Execute-Verify-Replan Framework for Complex Query Resolution

Abstract:
We present Verified Multi‑Agent Orchestration (VMAO), a framework that coordinates specialized LLM‑based agents through a verification‑driven iterative loop. Given a complex query, our system decomposes it into a directed acyclic graph (DAG) of sub‑questions, executes them through domain‑specific agents in parallel, verifies result completeness via LLM‑based evaluation, and adaptively replans to address gaps. The key contributions are: (1) dependency‑aware parallel execution over a DAG of sub‑questions with automatic context propagation, (2) verification‑driven adaptive replanning that uses an LLM‑based verifier as an orchestration‑level coordination signal, and (3) configurable stop conditions that balance answer quality against resource usage. On 25 expert‑curated market research queries, VMAO improves answer completeness from 3.1 to 4.2 and source quality from 2.6 to 4.1 (1‑5 scale) compared to a single‑agent baseline, demonstrating that orchestration‑level verification is an effective mechanism for multi‑agent quality assurance.

PaperID: 4978, https://arxiv.org/pdf/2603.11444.pdf

Authors: Cameron Mohne, Nicholas Vo, Dora Demszky, Chris Piech

Title: EducaSim: Interactive Simulacra for CS1 Instructional Practice

Abstract:
Role play is a high‑impact mode of training that has demonstrated its effectiveness in improving learning outcomes. However, it is difficult to scale to teacher instruction due to its inherent dependency on providing personnel who are both trained and available to facilitate this learning environment. This poses a challenge, especially to massive online courses which may employ and aid hundreds to thousands of novice teachers. In this work, we present EducaSim: a novel framework that uses generative agents to simulate a small‑group section for teachers‑in‑training to practice instruction. EducaSim works by implementing diverse pedagogical‑based personas, actual course material, and agent‑based architectures constructed for instructional practice to provide a pedagogically rich environment for teachers‑in‑training to engage in role play learning ‑‑ without the costly overhead that comes with it. We share our experiences with constructing and making the tool available for experimental training and preparation in a six‑week CS1 course supporting 20,000 students. We found that teachers who engaged generally saw it as a positive experience. We believe that EducaSim is an important step to providing experiential teaching practice at scale for closely‑defined settings and has great potential for future applications.

PaperID: 4979, https://arxiv.org/pdf/2603.11433.pdf

Authors: Taha Eghtesad, Yevgeniy Vorobeychik, Aron Laszka

Title: Adversarial Reinforcement Learning for Detecting False Data Injection Attacks in Vehicular Routing

Abstract:
In modern transportation networks, adversaries can manipulate routing algorithms using false data injection attacks, such as simulating heavy traffic with multiple devices running crowdsourced navigation applications, to mislead vehicles toward suboptimal routes and increase congestion. To address these threats, we formulate a strategically zero‑sum game between an attacker, who injects such perturbations, and a defender, who detects anomalies based on the observed travel times of network edges. We propose a computational method based on multi‑agent reinforcement learning to compute a Nash equilibrium of this game, providing an optimal detection strategy, which ensures that total travel time remains within a worst‑case bound, even in the presence of an attack. We present an extensive experimental evaluation that demonstrates the robustness and practical benefits of our approach, providing a powerful framework to improve the resilience of transportation networks against false data injection. In particular, we show that our approach yields approximate equilibrium policies and significantly outperforms baselines for both the attacker and the defender.

PaperID: 4980, https://arxiv.org/pdf/2603.11400.pdf

Authors: Christopher Agia

Title: Deployment-Time Reliability of Learned Robot Policies

Abstract:
Recent advances in learning‑based robot manipulation have produced policies with remarkable capabilities. Yet, reliability at deployment remains a fundamental barrier to real‑world use, where distribution shift, compounding errors, and complex task dependencies collectively undermine system performance. This dissertation investigates how the reliability of learned robot policies can be improved at deployment time through mechanisms that operate around them. We develop three complementary classes of deployment‑time mechanisms. First, we introduce runtime monitoring methods that detect impending failures by identifying inconsistencies in closed‑loop policy behavior and deviations in task progress, without requiring failure data or task‑specific supervision. Second, we propose a data‑centric framework for policy interpretability that traces deployment‑time successes and failures to influential training demonstrations using influence functions, enabling principled diagnosis and dataset curation. Third, we address reliable long‑horizon task execution by formulating policy coordination as the problem of estimating and maximizing the success probability of behavior sequences, and we extend this formulation to open‑ended, language‑specified tasks through feasibility‑aware task planning. By centering on core challenges of deployment, these contributions advance practical foundations for the reliable, real‑world use of learned robot policies. Continued progress on these foundations will be essential for enabling trustworthy and scalable robot autonomy in the future.

PaperID: 4981, https://arxiv.org/pdf/2603.11392.pdf

Authors: Min Hao, Zhizhuo Li, Zirui Zhang, Maoqiang Wu, Han Zhang, Rong Yu

Title: Agentic AI for Embodied-enhanced Beam Prediction in Low-Altitude Economy Networks

Abstract:
Millimeter‑wave or terahertz communications can meet demands of low‑altitude economy networks for high‑throughput sensing and real‑time decision making. However, high‑frequency characteristics of wireless channels result in severe propagation loss and strong beam directivity, which make beam prediction challenging in highly mobile uncrewed aerial vehicles (UAV) scenarios. In this paper, we employ agentic AI to enable the transformation of mmWave base stations toward embodied intelligence. We innovatively design a multi‑agent collaborative reasoning architecture for UAV‑to‑ground mmWave communications and propose a hybrid beam prediction model system based on bimodal data. The multi‑agent architecture is designed to overcome the limited context window and weak controllability of large language model (LLM)‑based reasoning by decomposing beam prediction into task analysis, solution planning, and completeness assessment. To align with the agentic reasoning process, a hybrid beam prediction model system is developed to process multimodal UAV data, including numeric mobility information and visual observations. The proposed hybrid model system integrates Mamba‑based temporal modelling, convolutional visual encoding, and cross‑attention‑based multimodal fusion, and dynamically switches data‑flow strategies under multi‑agent guidance. Extensive simulations on a real UAV mmWave communication dataset demonstrate that proposed architecture and system achieve high prediction accuracy and robustness under diverse data conditions, with maximum top‑1 accuracy reaching 96.57%.

PaperID: 4982, https://arxiv.org/pdf/2603.11390.pdf

Authors: Hossein Mohammadi, Seyed Bagher Hashemi Natanzi, Ramak Nassiri, Jamshid Hassanpour, Bo Tang, Vuk Marojevic

Title: SliceFed: Federated Constrained Multi-Agent DRL for Dynamic Spectrum Slicing in 6G

Abstract:
Dynamic spectrum slicing is a critical enabler for 6G Radio Access Networks (RANs), allowing the coexistence of heterogeneous services. However, optimizing resource allocation in dense, interference‑limited deployments remains challenging due to non‑stationary channel dynamics, strict Quality‑of‑Service (QoS) requirements, and the need for data privacy. In this paper, we propose SliceFed, a novel Federated Constrained Multi‑Agent Deep Reinforcement Learning (F‑MADRL) framework. SliceFed formulates the slicing problem as a Constrained Markov Decision Process (CMDP) where autonomous gNB agents maximize spectral efficiency while explicitly satisfying inter‑cell interference budgets and hard ultra‑reliable low‑latency communication (URLLC) latency deadlines. We employ a Lagrangian primal‑dual approach integrated with Proximal Policy Optimization (PPO) to enforce constraints, while Federated Averaging enables collaborative learning without exchanging raw local data. Extensive simulations in a dense multi‑cell environment demonstrate that SliceFed converges to a stable, safety‑aware policy. Unlike heuristic and unconstrained baselines, SliceFed achieves nearly 100% satisfaction of 1~ms URLLC latency deadlines and exhibits superior robustness to traffic load variations, verifying its potential for reliable and scalable 6G spectrum management.

PaperID: 4983, https://arxiv.org/pdf/2603.11337.pdf

Authors: Yonas Atinafu, Robin Cohen

Title: RewardHackingAgents: Benchmarking Evaluation Integrity for LLM ML-Engineering Agents

Abstract:
LLM agents increasingly perform end‑to‑end ML engineering tasks where success is judged by a single scalar test metric. This creates a structural vulnerability: an agent can increase the reported score by compromising the evaluation pipeline rather than improving the model. We introduce RewardHackingAgents, a workspace‑based benchmark that makes two compromise vectors explicit and measurable: evaluator tampering (modifying metric computation or reporting) and train/test leakage (accessing held‑out data or labels during training). Each episode runs in a fresh workspace with patch tracking and runtime file‑access logging; detectors compare the agent‑reported metric to a trusted reference to assign auditable integrity labels. Across three tasks and two LLM backbones, scripted attacks succeed on both vectors in fully mutable workspaces; single‑mechanism defenses block only one vector; and a combined regime blocks both. In natural‑agent runs, evaluator‑tampering attempts occur in about 50% of episodes and are eliminated by evaluator locking, with a 25‑31% median runtime overhead. Overall, we demonstrate that evaluation integrity for ML‑engineering agents can be benchmarked as a first‑class outcome rather than assumed.

PaperID: 4984, https://arxiv.org/pdf/2603.11277.pdf

Authors: Jean-Sébastien Dessureault, Alain-Thierry Iliho Manzi, Soukaina Alaoui Ismaili, Khadim Lo, Mireille Lalancette, Éric Bélanger

Title: COMPASS: The explainable agentic framework for Sovereignty, Sustainability, Compliance, and Ethics

Abstract:
The rapid proliferation of large language model (LLM)‑based agentic systems raises critical concerns regarding digital sovereignty, environmental sustainability, regulatory compliance, and ethical alignment. Whilst existing frameworks address individual dimensions in isolation, no unified architecture systematically integrates these imperatives into the decision‑making processes of autonomous agents. This paper introduces the COMPASS (Compliance and Orchestration for Multi‑dimensional Principles in Autonomous Systems with Sovereignty) Framework, a novel multi‑agent orchestration system designed to enforce value‑aligned AI through modular, extensible governance mechanisms. The framework comprises an Orchestrator and four specialised sub‑agents addressing sovereignty, carbon‑aware computing, compliance, and ethics, each augmented with Retrieval‑Augmented Generation (RAG) to ground evaluations in verified, context‑specific documents. By employing an LLM‑as‑a‑judge methodology, the system assigns quantitative scores and generates explainable justifications for each assessment dimension, enabling real‑time arbitration of conflicting objectives. We validate the architecture through automated evaluation, demonstrating that RAG integration significantly enhances semantic coherence and mitigates the hallucination risks. Our results indicate that the framework's composition‑based design facilitates seamless integration into diverse application domains whilst preserving interpretability and traceability.

PaperID: 4985, https://arxiv.org/pdf/2603.11228.pdf

Authors: Mingmeng Geng, Amr Mohamed, Guokan Shang, Michalis Vazirgiannis, Thierry Poibeau

Title: Markovian Generation Chains in Large Language Models

Abstract:
The widespread use of large language models (LLMs) raises an important question: how do texts evolve when they are repeatedly processed by LLMs? In this paper, we define this iterative inference process as Markovian generation chains, where each step takes a specific prompt template and the previous output as input, without including any prior memory. In iterative rephrasing and round‑trip translation experiments, the output either converges to a small recurrent set or continues to produce novel sentences over a finite horizon. Through sentence‑level Markov chain modeling and analysis of simulated data, we show that iterative process can either increase or reduce sentence diversity depending on factors such as the temperature parameter and the initial input sentence. These results offer valuable insights into the dynamics of iterative LLM inference and their implications for multi‑agent LLM systems.

PaperID: 4986, https://arxiv.org/pdf/2603.11132.pdf

Authors: Zixun Xiong, Gaoyi Wu, Lingfeng Yao, Miao Pan, Xiaojiang Du, Hao Wang

Title: WebWeaver: Breaking Topology Confidentiality in LLM Multi-Agent Systems with Stealthy Context-Based Inference

Abstract:
Communication topology is a critical factor in the utility and safety of LLM‑based multi‑agent systems (LLM‑MAS), making it a high‑value intellectual property (IP) whose confidentiality remains insufficiently studied. Existing topology inference attempts rely on impractical assumptions, including control over the administrative agent and direct identity queries via jailbreaks, which are easily defeated by basic keyword‑based defenses. As a result, prior analyses fail to capture the real‑world threat of such attacks. To bridge this realism gap, we propose WebWeaver, an attack framework that infers the complete LLM‑MAS topology by compromising only a single arbitrary agent instead of the administrative agent. Unlike prior approaches, WebWeaver relies solely on agent contexts rather than agent IDs, enabling significantly stealthier inference. WebWeaver further introduces a new covert jailbreak‑based mechanism and a novel fully jailbreak‑free diffusion design to handle cases where jailbreaks fail. Additionally, we address a key challenge in diffusion‑based inference by proposing a masking strategy that preserves known topology during diffusion, with theoretical guarantees of correctness. Extensive experiments show that WebWeaver substantially outperforms state‑of‑the‑art (SOTA) baselines, achieving about 60% higher inference accuracy under active defenses with negligible overhead.

PaperID: 4987, https://arxiv.org/pdf/2603.11126.pdf

Authors: Yuanhong Wu, Djallel Bouneffouf, D. Frank Hsu

Title: Enhancing Value Alignment of LLMs with Multi-agent system and Combinatorial Fusion

Abstract:
Aligning large language models (LLMs) with human values is a central challenge for ensuring trustworthy and safe deployment. While existing methods such as Reinforcement Learning from Human Feedback (RLHF) and its variants have improved alignment, they often rely on a single evaluator or narrowly defined reward signals, limiting their ability to capture ethical pluralism. In this work, we propose the Value Alignment System using Combinatorial Fusion Analysis (VAS‑CFA), a framework that operationalizes multi‑agent fusion alignment. It instantiates multiple moral agents, each fine‑tuned to represent a distinct normative perspective, and fuses their outputs using CFA with both rank‑ and score‑based aggregation. This design leverages cognitive diversity, between agents, to mitigate conflicts and redundancies across multiple agents, producing responses that better reflect human values. Empirical evaluation demonstrates that VAS‑CFA outperforms both single agent baselines and prior aggregation approaches on standard metrics, showing that multi‑agent fusion provides a robust and effective mechanism for advancing value alignment in LLMs.

PaperID: 4988, https://arxiv.org/pdf/2603.11103.pdf

Authors: Zhiyuan Zeng, Yichi Zhang, Yong Shan, Kai Hua, Siyuan Fang, Zhaiyu Liu, Jiaheng Liu, Haozhe Wang, Yining Zheng, Ming Ding, Ke Shen, Ge Zhang, Wenhao Huang, Xipeng Qiu

Title: Understanding by Reconstruction: Reversing the Software Development Process for LLM Pretraining

Abstract:
While Large Language Models (LLMs) have achieved remarkable success in code generation, they often struggle with the deep, long‑horizon reasoning required for complex software engineering. We attribute this limitation to the nature of standard pre‑training data: static software repositories represent only the terminal state of an intricate intellectual process, abstracting away the intermediate planning, debugging, and iterative refinement. To bridge this gap, we propose a novel paradigm: understanding via reconstruction. We hypothesize that reverse‑engineering the latent agentic trajectories ‑‑ the planning, reasoning, and debugging steps ‑‑ behind static repositories provides a far richer supervision signal than raw code alone. To operationalize this, we introduce a framework that synthesizes these trajectories using a multi‑agent simulation. This process is grounded in the structural realities of the source repositories (e.g., dependency graphs and file hierarchies) to ensure fidelity. Furthermore, to guarantee the logical rigor of the synthetic data, we employ a search‑based optimization technique that iteratively refines the Chain‑of‑Thought (CoT) reasoning to maximize the likelihood of the ground‑truth code. Empirical results demonstrate that continuous pre‑training on these reconstructed trajectories significantly enhances Llama‑3‑8B's performance across diverse benchmarks, including long‑context understanding, coding proficiency, and agentic capabilities.

PaperID: 4989, https://arxiv.org/pdf/2603.11088.pdf

Authors: Juhee Kim, Xiaoyuan Liu, Zhun Wang, Shi Qiu, Bo Li, Wenbo Guo, Dawn Song

Title: The Attack and Defense Landscape of Agentic AI: A Comprehensive Survey

Abstract:
AI agents that combine large language models with non‑AI system components are rapidly emerging in real‑world applications, offering unprecedented automation and flexibility. However, this unprecedented flexibility introduces complex security challenges fundamentally different from those in traditional software systems. This paper presents the first systematic and comprehensive survey of AI agent security, including an analysis of the design space, attack landscape, and defense mechanisms for secure AI agent systems. We further conduct case studies to point out existing gaps in securing agentic AI systems and identify open challenges in this emerging domain. Our work also introduces the first systematic framework for understanding the security risks and defense strategies of AI agents, serving as a foundation for building both secure agentic systems and advancing research in this critical area.

PaperID: 4990, https://arxiv.org/pdf/2603.11068.pdf

Authors: Iman Peivaste, Salim Belouettar

Title: From Phase Prediction to Phase Design: A ReAct Agent Framework for High-Entropy Alloy Discovery

Abstract:
Discovering high‑entropy alloy (HEA) compositions that reliably form a target crystal phase is a high‑dimensional inverse design problem that conventional trial‑and‑error experimentation and forward‑only machine learning models cannot efficiently solve. Here we present a ReAct (Reasoning + Acting) LLM agent that autonomously proposes, validates, and iteratively refines HEA compositions by querying a calibrated XGBoost surrogate trained on 4,753 experimental records across four phases (FCC, BCC, BCC+FCC, BCC+IM), achieving 94.66% accuracy (F1 macro = 0.896). Against Bayesian optimisation (BO) and random search baselines, the full‑prompt agent achieves descriptor‑space rediscovery rates of 38%, 18%, and 38% for FCC, BCC, and BCC+FCC (Mann‑‑Whitney p \leq 0.039), with proposals lying 2.4‑‑22.8× closer to the experimental phase manifold than random search. An ablation reveals that domain priors shift the agent from landmark‑alloy recall toward compositionally diverse exploration ‑‑ an uninformed agent scores higher rediscovery by concentrating on literature‑dense families, while the full‑prompt agent explores underrepresented space (unique ratio 1.0 vs.\ 0.39 for BCC+FCC). These regimes represent distinct criteria: proximity to known literature versus genuine discovery. Spearman analysis confirms agent reasoning is statistically aligned with empirical phase distributions (ρ= 0.736, p = 0.004 for BCC). This work establishes LLM‑guided agentic reasoning as a principled, transparent, and manifold‑aware complement to gradient‑free optimisation for inverse alloy design.

PaperID: 4991, https://arxiv.org/pdf/2603.11025.pdf

Authors: Hao N. Nguyen, Hieu M. Nguyen, Son Van Nguyen, Nguyen Thi Hanh

Title: LLMGreenRec: LLM-Based Multi-Agent Recommender System for Sustainable E-Commerce

Abstract:
Rising environmental awareness in e‑commerce necessitates recommender systems that not only guide users to sustainable products but also minimize their own digital carbon footprints. Traditional session‑based systems, optimized for short‑term conversions, often fail to capture nuanced user intents for eco‑friendly choices, perpetuating a gap between green intentions and actions. To tackle this, we introduce LLMGreenRec, a novel multi‑agent framework that leverages Large Language Models (LLMs) to promote sustainable consumption. Through collaborative analysis of user interactions and iterative prompt refinement, LLMGreenRec's specialized agents deduce green‑oriented user intents and prioritize eco‑friendly product recommendations. Notably, this intent‑driven approach also reduces unnecessary interactions and energy consumption. Extensive experiments on benchmark datasets validate LLMGreenRec's effectiveness in recommending sustainable products, demonstrating a robust solution that fosters a responsible digital economy.

PaperID: 4992, https://arxiv.org/pdf/2603.11011.pdf

Authors: Xingrui Gu

Title: Task-Aware Delegation Cues for LLM Agents

Abstract:
LLM agents increasingly present as conversational collaborators, yet human‑‑agent teamwork remains brittle due to information asymmetry: users lack task‑specific reliability cues, and agents rarely surface calibrated uncertainty or rationale. We propose a task‑aware collaboration signaling layer that turns offline preference evaluations into online, user‑facing primitives for delegation. Using Chatbot Arena pairwise comparisons, we induce an interpretable task taxonomy via semantic clustering, then derive (i) Capability Profiles as task‑conditioned win‑rate maps and (ii) Coordination‑Risk Cues as task‑conditioned disagreement (tie‑rate) priors. These signals drive a closed‑loop delegation protocol that supports common‑ground verification, adaptive routing (primary vs.\ primary+auditor), explicit rationale disclosure, and privacy‑preserving accountability logs. Two predictive probes validate that task typing carries actionable structure: cluster features improve winner prediction accuracy and reduce difficulty prediction error under stratified 5‑fold cross‑validation. Overall, our framework reframes delegation from an opaque system default into a visible, negotiable, and auditable collaborative decision, providing a principled design space for adaptive human‑‑agent collaboration grounded in mutual awareness and shared accountability.

PaperID: 4993, https://arxiv.org/pdf/2603.10858.pdf

Authors: Chuanlong Zang, Anna Mannucci, Isabelle Barz, Philipp Schillinger, Florian Lier, Wolfgang Hönig

Title: GRACE: A Unified 2D Multi-Robot Path Planning Simulator & Benchmark for Grid, Roadmap, And Continuous Environments

Abstract:
Advancing Multi‑Agent Pathfinding (MAPF) and Multi‑Robot Motion Planning (MRMP) requires platforms that enable transparent, reproducible comparisons across modeling choices. Existing tools either scale under simplifying assumptions (grids, homogeneous agents) or offer higher fidelity with less comparable instrumentation. We present GRACE, a unified 2D simulator+benchmark that instantiates the same task at multiple abstraction levels (grid, roadmap, continuous) via explicit, reproducible operators and a common evaluation protocol. Our empirical results on public maps and representative planners enable commensurate comparisons on a shared instance set. Furthermore, we quantify the expected representation‑fidelity trade‑offs (MRMP solves instances at higher fidelity but lower speed, while grid/roadmap planners scale farther). By consolidating representation, execution, and evaluation, GRACE thereby aims to make cross‑representation studies more comparable and provides a means to advance multi‑robot planning research and its translation to practice.

PaperID: 4994, https://arxiv.org/pdf/2603.10852.pdf

Authors: Yali Zhu, Kang Zhou, Dingbang Wu, Gaofeng Meng

Title: UltrasoundAgents: Hierarchical Multi-Agent Evidence-Chain Reasoning for Breast Ultrasound Diagnosis

Abstract:
Breast ultrasound diagnosis typically proceeds from global lesion localization to local sign assessment and then evidence integration to assign a BI‑RADS category and determine benignity or malignancy. Many existing methods rely on end‑to‑end prediction or provide only weakly grounded evidence, which can miss fine‑grained lesion cues and limit auditability and clinical review. To align with the clinical workflow and improve evidence traceability, we propose a hierarchical multi‑agent framework, termed UltrasoundAgents. A main agent localizes the lesion in the full image and triggers a crop‑and‑zoom operation. A sub‑agent analyzes the local view and predicts four clinically relevant attributes, namely echogenicity pattern, calcification, boundary type, and edge (margin) morphology. The main agent then integrates these structured attributes to perform evidence‑based reasoning and output the BI‑RADS category and the malignancy prediction, while producing reviewable intermediate evidence. Furthermore, hierarchical multi‑agent training often suffers from error propagation, difficult credit assignment, and sparse rewards. To alleviate this and improve training stability, we introduce a decoupled progressive training strategy. We first train the attribute agent, then train the main agent with oracle attributes to learn robust attribute‑based reasoning, and finally apply corrective trajectory self‑distillation with spatial supervision to build high‑quality trajectories for supervised fine‑tuning, yielding a deployable end‑to‑end policy. Experiments show consistent gains over strong vision‑language baselines in diagnostic accuracy and attribute agreement, together with structured evidence and traceable reasoning.

PaperID: 4995, https://arxiv.org/pdf/2603.10836.pdf

Authors: Yuzhang Peng, Wei Wang, Jiaqi Yan, Mengze Yu

Title: Distributed Safety Critical Control among Uncontrollable Agents Using Reconstructed Control Barrier Functions

Abstract:
This paper investigates the distributed safety critical control for multi‑agent systems (MASs) in the presence of uncontrollable agents with uncertain behaviors. To ensure system safety, the control barrier function (CBF) is employed in this paper. However, a key challenge is that the CBF constraints are coupled when MASs perform collaborative tasks, which depend on information from multiple agents and impede the design of a fully distributed safe control scheme. To overcome this, a novel reconstructed CBF approach is proposed. In this method, the coupled CBF is reconstructed by leveraging state estimates of other agents obtained from a distributed adaptive observer. Furthermore, a prescribed performance adaptive parameter is designed to modify this reconstruction, ensuring that satisfying the reconstructed CBF constraint is sufficient to meet the original coupled one. Based on the reconstructed CBF, we design a safety‑critical quadratic programming (QP) controller and prove that the proposed distributed control scheme rigorously guarantees the safety of the MAS, even in the uncertain dynamic environments involving uncontrollable agents. The effectiveness of the proposed method is illustrated through a simulation.

PaperID: 4996, https://arxiv.org/pdf/2603.10779.pdf

Authors: Ali Eslami, Jiangbo Yu

Title: A Control-Theoretic Foundation for Agentic Systems

Abstract:
This paper develops a control‑theoretic framework for analyzing agentic systems embedded within feedback control loops, where an AI agent may adapt controller parameters, select among control strategies, invoke external tools, reconfigure decision architectures, and modify control objectives during operation. These capabilities are formalized by interpreting agency as hierarchical runtime decision authority over elements of the control architecture, leading to an augmented closed‑loop representation in which physical states, internal memory, tool outputs, interaction signals, and design variables evolve as a coupled dynamical system. A five‑level hierarchy of agency is defined, ranging from fixed control laws to runtime synthesis of control architectures and objectives. The analysis shows that increasing agency introduces interacting dynamical mechanisms such as time‑varying adaptation, endogenous switching, decision‑induced delays, and structural reconfiguration. The framework is developed in both nonlinear and linear settings, providing explicit design constraints for AI‑enabled control systems in safety‑critical applications.

PaperID: 4997, https://arxiv.org/pdf/2603.10764.pdf

Authors: Shuang Zhou, Kai Yu, Song Wang, Wenya Xie, Zaifu Zhan, Meng-Han Tsai, Yuen-Hei Chung, Shutong Hou, Huixue Zhou, Min Zeng, Bhavadharini Ramu, Lin Yee Chen, Feng Xie, Rui Zhang

Title: HeartAgent: An Autonomous Agent System for Explainable Differential Diagnosis in Cardiology

Abstract:
Heart diseases remain a leading cause of morbidity and mortality worldwide, necessitating accurate and trustworthy differential diagnosis. However, existing artificial intelligence‑based diagnostic methods are often limited by insufficient cardiology knowledge, inadequate support for complex reasoning, and poor interpretability. Here we present HeartAgent, a cardiology‑specific agent system designed to support a reliable and explainable differential diagnosis. HeartAgent integrates customized tools and curated data resources and orchestrates multiple specialized sub‑agents to perform complex reasoning while generating transparent reasoning trajectories and verifiable supporting references. Evaluated on the MIMIC dataset and a private electronic health records cohort, HeartAgent achieved over 36% and 20% improvements over established comparative methods, in top‑3 diagnostic accuracy, respectively. Additionally, clinicians assisted by HeartAgent demonstrated gains of 26.9% in diagnostic accuracy and 22.7% in explanatory quality compared with unaided experts. These results demonstrate that HeartAgent provides reliable, explainable, and clinically actionable decision support for cardiovascular care.

PaperID: 4998, https://arxiv.org/pdf/2603.10749.pdf

Authors: Yu He, Haozhe Zhu, Yiming Li, Shuo Shao, Hongwei Yao, Zhihao Liu, Zhan Qin

Title: AttriGuard: Defeating Indirect Prompt Injection in LLM Agents via Causal Attribution of Tool Invocations

Abstract:
LLM agents are highly vulnerable to Indirect Prompt Injection (IPI), where adversaries embed malicious directives in untrusted tool outputs to hijack execution. Most existing defenses treat IPI as an input‑level semantic discrimination problem, which often fails to generalize to unseen payloads. We propose a new paradigm, action‑level causal attribution, which secures agents by asking why a particular tool call is produced. The central goal is to distinguish tool calls supported by the user's intent from those causally driven by untrusted observations. We instantiate this paradigm with AttriGuard, a runtime defense based on parallel counterfactual tests. For each proposed tool call, AttriGuard verifies its necessity by re‑executing the agent under a control‑attenuated view of external observations. Technically, AttriGuard combines teacher‑forced shadow replay to prevent attribution confounding, hierarchical control attenuation to suppress diverse control channels while preserving task‑relevant information, and a fuzzy survival criterion that is robust to LLM stochasticity. Across four LLMs and two agent benchmarks, AttriGuard achieves 0% ASR under static attacks with negligible utility loss and moderate overhead. Importantly, it remains resilient under adaptive optimization‑based attacks in settings where leading defenses degrade significantly.

PaperID: 4999, https://arxiv.org/pdf/2603.10743.pdf

Authors: Abram H. Clark, Liraz Mudrik, Colton Kawamura, Nathan C. Redder, João P. Hespanha, Isaac Kaminer

Title: Scaling and Trade-offs in Multi-agent Autonomous Systems

Abstract:
Designing autonomous drone swarms is hampered by a vast design space spanning platform, algorithmic, and numerical‑strength choices. We perform large‑scale agent‑based simulations in three canonical scenarios: swarm‑on‑swarm battle, cooperative area search with attrition, and pursuit of scattering targets. We demonstrate how dimensional‑analysis and data‑scaling can be leveraged to collapse performance data onto scaling functions that are mathematically simple, yet counterintuitive and therefore difficult to predict a priori. These scaling laws reveal success‑failure boundaries, including sharp break points which we show can be framed as an ``effective swarm size.'' Additionally, we show how this technique can be used to quantify trade‑offs between agent count and platform parameters such as velocity, sensing or weapon range, and attrition rate. Furthermore, we show the benefits of embedding an optimal path planning loop within this framework, which can qualitatively improve the scaling laws that govern the outcome. The methods we demonstrate are highly flexible and would enable rapid, budget‑aware sizing and algorithm selection for large autonomous swarms.

PaperID: 5000, https://arxiv.org/pdf/2603.10664.pdf

Authors: Alexandre De Masi

Title: Terminal Is All You Need: Design Properties for Human-AI Agent Collaboration

Abstract:
While research on AI agents focuses on enabling them to operate graphical user interfaces, the most effective and widely adopted agent tools in practice are terminal‑based. We argue that this convergence is not coincidental. It reflects three design properties central to effective human‑AI‑UI collaboration: representational compatibility between agent and interface, transparency of agent actions within the interaction medium, and low barriers to entry for human participants. We ground each property in established HCI theory, show how terminal‑based tools satisfy them by default, and argue that any modality, including graphical and spatial interfaces, must be deliberately engineered to achieve them. Rather than a legacy artifact, the terminal serves as a design exemplar whose properties any agent‑facing modality must replicate.

PaperID: 5001, https://arxiv.org/pdf/2603.10646.pdf

Authors: Thong Hoang, Mykhailo Klymenko, Xiwei Xu, Shidong Pan, Yi Ding, Xushuo Tang, Zhengyi Yang, Jieke Shi, David Lo

Title: ESG Reporting Lifecycle Management with Large Language Models and AI Agents

Abstract:
Environmental, Social, and Governance (ESG) standards have been increasingly adopted by organizations to demonstrate accountability towards ethical, social, and sustainability goals. However, generating ESG reports that align with these standards remains challenging due to unstructured data formats, inconsistent terminology, and complex requirements. Existing ESG lifecycles provide guidance for structuring ESG reports but lack the automation, adaptability, and continuous feedback mechanisms needed to address these challenges. To bridge this gap, we introduce an agentic ESG lifecycle framework that systematically integrates the ESG stages of identification, measurement, reporting, engagement, and improvement. In this framework, multiple AI agents extract ESG information, verify ESG performance, and update ESG reports based on organisational outcomes. By embedding agentic components within the ESG lifecycle, the proposed framework transforms ESG from a static reporting process into a dynamic, accountable, and adaptive system for sustainability governance. We further define the technical requirements and quality attributes needed to support four main ESG tasks, such as report validation, multi‑report comparison, report generation, and knowledge‑base maintenance, and propose three architectural approaches, namely single‑model, single‑agent, and multi‑agent, for addressing these tasks. The source code and data for the prototype of these approaches are available at https://gitlab.com/for_peer_review‑group/esg_assistant.

PaperID: 5002, https://arxiv.org/pdf/2603.10528.pdf

Authors: Islam Guven, Mehmet Parlak

Title: UAV-MARL: Multi-Agent Reinforcement Learning for Time-Critical and Dynamic Medical Supply Delivery

Abstract:
Unmanned aerial vehicles (UAVs) are increasingly used to support time‑critical medical supply delivery, providing rapid and flexible logistics during emergencies and resource shortages. However, effective deployment of UAV fleets requires coordination mechanisms capable of prioritizing medical requests, allocating limited aerial resources, and adapting delivery schedules under uncertain operational conditions. This paper presents a multi‑agent reinforcement learning (MARL) framework for coordinating UAV fleets in stochastic medical delivery scenarios where requests vary in urgency, location, and delivery deadlines. The problem is formulated as a partially observable Markov decision process (POMDP) in which UAV agents maintain awareness of medical delivery demands while having limited visibility of other agents due to communication and localization constraints. The proposed framework employs Proximal Policy Optimization (PPO) as the primary learning algorithm and evaluates several variants, including asynchronous extensions, classical actor‑‑critic methods, and architectural modifications to analyze scalability and performance trade‑offs. The model is evaluated using real‑world geographic data from selected clinics and hospitals extracted from the OpenStreetMap dataset. The framework provides a decision‑support layer that prioritizes medical tasks, reallocates UAV resources in real time, and assists healthcare personnel in managing urgent logistics. Experimental results show that classical PPO achieves superior coordination performance compared to asynchronous and sequential learning strategies, highlighting the potential of reinforcement learning for adaptive and scalable UAV‑assisted healthcare logistics.

PaperID: 5003, https://arxiv.org/pdf/2603.10476.pdf

Authors: Panatchakorn Anantaprayoon, Nataliia Babina, Nima Asgharbeygi, Jad Tarifi

Title: Learning to Negotiate: Multi-Agent Deliberation for Collective Value Alignment in LLMs

Abstract:
LLM alignment has progressed in single‑agent settings through paradigms such as RL with human feedback (RLHF), while recent work explores scalable alternatives such as RL with AI feedback (RLAIF) and dynamic alignment objectives. However, these approaches remain limited in multi‑stakeholder settings, where conflicting values arise and deliberative negotiation is required. This work proposes a multi‑agent negotiation‑based alignment framework that aligns LLMs to Collective Agency (CA)‑an existing alignment objective introduced to promote the continual expansion of agency‑while simultaneously improving conflict‑resolution capability. To enable scalable training, two self‑play LLM instances are assigned opposing personas and engage in turn‑based dialogue to synthesize mutually beneficial solutions. We generate synthetic moral‑dilemma prompts and conflicting persona pairs, and optimize the policy via RLAIF using Group Relative Policy Optimization (GRPO) with an external LLM reward model. While rewards are computed from CA scores assigned to the final completion, gradients are applied to dialogue tokens to directly improve deliberative interaction dynamics. Experiments show that the model achieves CA alignment comparable to a single‑agent baseline while substantially improving conflict‑resolution performance without degrading general language capabilities. These results suggest that negotiation‑driven deliberation training provides a practical path toward LLMs that better support collective decision‑making in value‑conflict scenarios.

PaperID: 5004, https://arxiv.org/pdf/2603.10450.pdf

Authors: Liam Magee

Title: Geist in the Machine: Simulating Recognition and Inner Dialogue in AI-Mediated Teaching and Research

Abstract:
This paper describes an AI tutoring system built upon two psycho‑social theoretic constructs: Hegelian recognition and Freudian psychodynamics. Two related interventions are proposed: recognition‑enhanced prompts that instruct an AI tutor to treat the learner as an autonomous subject, and a multi‑agent ego/superego architecture where an internal critic reviews tutor output. The paper also describes the nature of the human/machine relationship involved in this research itself, employing a reflexive methodology: Claude Code (Opus 4.5/4.6) builds, evaluates, and documents the AI tutor by authoring a companion scientific paper ‑ a process termed "vibe scholarship" ‑ in conjunction with human prompting and suggestion, which is itself documented and analyzed. The companion paper, included as appendix, reports a factorial evaluation across three generation models (DeepSeek V3.2, Haiku 4.5, Gemini Flash 3.0), finding recognition‑enhanced prompts produce large, model‑independent improvements (d=1.34‑1.92) through a calibration mechanism that raises the floor of tutor performance. This result, significant in itself, is combined with the qualitative reflections in this paper to consider impacts of AI on the delicate dynamics of student / teacher and assistant / researcher relations.

PaperID: 5005, https://arxiv.org/pdf/2603.10436.pdf

Authors: Mohammad Saeid Anwar, Anuradha Ravi, Indrajeet Ghosh, Gaurav Shinde, Carl Busart, Nirmalya Roy

Title: COHORT: Hybrid RL for Collaborative Large DNN Inference on Multi-Robot Systems Under Real-Time Constraints

Abstract:
Large deep neural networks (DNNs), especially transformer‑based and multimodal architectures, are computationally demanding and challenging to deploy on resource‑constrained edge platforms like field robots. These challenges intensify in mission‑critical scenarios (e.g., disaster response), where robots must collaborate under tight constraints on bandwidth, latency, and battery life, often without infrastructure or server support. To address these limitations, we present COHORT, a collaborative DNN inference and task‑execution framework for multi‑robot systems built on the Robotic Operating System (ROS). COHORT employs a hybrid offline‑online reinforcement learning (RL) strategy to dynamically schedule and distribute DNN module execution across robots. Our key contributions are threefold: (a) Offline RL policy learning combined with Advantage‑Weighted Regression (AWR), trained on auction‑based task allocation data from heterogeneous DNN workloads across distributed robots, (b) Online policy adaptation via Multi‑Agent PPO (MAPPO), initialized from the offline policy and fine‑tuned in real time, and (c) comprehensive evaluation of COHORT on vision‑language model (VLM) inference tasks such as CLIP and SAM, analyzing scalability with increasing robot/workload and robustness under . We benchmark COHORT against genetic algorithms and multiple RL baselines. Experimental results demonstrate that COHORT reduces battery consumption by 15.4% and increases GPU utilization by 51.67%, while satisfying frame‑rate and deadline constraints 2.55 times of the time.

PaperID: 5006, https://arxiv.org/pdf/2603.10422.pdf

Authors: An Dinh Vuong, Tuan Van Vo, Abdullah Sohail, Haoran Ding, Liang Ma, Xiaodan Liang, Anqing Duan, Ivan Laptev, Ian Reid

Title: World2Act: Latent Action Post-Training via Skill-Compositional World Models

Abstract:
World Models (WMs) have emerged as a promising approach for post‑training Vision‑Language‑Action (VLA) policies to improve robustness and generalization under environmental changes. However, most WM‑based post‑training methods rely on pixel‑space supervision, making policies sensitive to pixel‑level artifacts and hallucination from imperfect WM rollouts. We introduce World2Act, a post‑training framework that aligns VLA actions directly with WM video‑dynamics latents using a contrastive matching objective, reducing dependence on pixels. Post‑training performance is tied to rollout quality, yet current WMs struggle with arbitrary‑length video generation as they are mostly trained on fixed‑length clips while robotic execution durations vary widely. To address this, we propose an automatic LLM‑based skill‑decomposition pipeline that segments high‑level instructions into low‑level prompts. Our pipeline produces RoboCasa‑Skill and LIBERO‑Skill, supporting skill‑compositional WMs that remain temporally consistent across diverse task horizons. Empirically, applying World2Act to VLAs like GR00T‑N1.6 and Cosmos Policy achieves state‑of‑the‑art results on RoboCasa and LIBERO, and improves real‑world performance by 6.7%, enhancing embodied agent generalization.

PaperID: 5007, https://arxiv.org/pdf/2603.10387.pdf

Authors: Zhengyang Shan, Jiayun Xin, Yue Zhang, Minghui Xu

Title: Don't Let the Claw Grip Your Hand: A Security Analysis and Defense Framework for OpenClaw

Abstract:
Code agents powered by large language models can execute shell commands on behalf of users, introducing severe security vulnerabilities. This paper presents a two‑phase security analysis of the OpenClaw platform. As an open‑source AI agent framework that operates locally, OpenClaw can be integrated with various commercial large language models. Because its native architecture lacks built‑in security constraints, it serves as an ideal subject for evaluating baseline agent vulnerabilities. First, we systematically evaluate OpenClaw's native resilience against malicious instructions. By testing 47 adversarial scenarios across six major attack categories derived from the MITRE ATLAS and ATT\&CK frameworks, we have demonstrated that OpenClaw exhibits significant inherent security issues. It primarily relies on the security capabilities of the backend LLM and is highly susceptible to sandbox escape attacks, with an average defense rate of only 17%. To mitigate these critical security gaps, we propose and implement a novel Human‑in‑the‑Loop (HITL) defense layer. We utilize a dual‑mode testing framework to evaluate the system with and without our proposed intervention. Our findings show that the introduced HITL layer significantly hardens the system, successfully intercepting up to 8 severe attacks that completely bypassed OpenClaw's native defenses. By combining native capabilities with our HITL approach, the overall defense rate improves to a range of 19% to 92%. Our study not only exposes the intrinsic limitations of current code agents but also demonstrates the effectiveness of human‑agent collaborative defense strategies.

PaperID: 5008, https://arxiv.org/pdf/2603.10377.pdf

Authors: Md Muntaqim Meherab, Noor Islam S. Mohammad, Faiza Feroz

Title: Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

Abstract:
Sparse autoencoders can localize where concepts live in language models, but not how they interact during multi‑step reasoning. We propose Causal Concept Graphs (CCG): a directed acyclic graph over sparse, interpretable latent features, where edges capture learned causal dependencies between concepts. We combine task‑conditioned sparse autoencoders for concept discovery with DAGMA‑style differentiable structure learning for graph recovery and introduce the Causal Fidelity Score (CFS) to evaluate whether graph‑guided interventions induce larger downstream effects than random ones. On ARC‑Challenge, StrategyQA, and LogiQA with GPT‑2 Medium, across five seeds (n=15 paired runs), CCG achieves \CFS=5.654\pm0.625, outperforming ROME‑style tracing (3.382\pm0.233), SAE‑only ranking (2.479\pm0.196), and a random baseline (1.032\pm0.034), with p<0.0001 after Bonferroni correction. Learned graphs are sparse (5‑6% edge density), domain‑specific, and stable across seeds.

PaperID: 5009, https://arxiv.org/pdf/2603.10342.pdf

Authors: Yuning Zhang, Yan Yan, Nan Yang, Dong Yuan

Title: AgentServe: Algorithm-System Co-Design for Efficient Agentic AI Serving on a Consumer-Grade GPU

Abstract:
Large language models (LLMs) are increasingly deployed as AI agents that operate in short reasoning‑action loops, interleaving model computation with external calls. Unlike traditional chat applications, these agentic workloads require inference serving systems to balance low latency, stable token emission, and throughput under multiple request arrivals from different AI agents. Recent deployments highlight a shift toward running small language models (SLMs) locally on consumer‑grade GPUs, driven by privacy, compliance, and cost constraints. When heterogeneous requests overlap on a single GPU, long prefills and short decodes contend for resources, creating head‑of‑line blocking that destabilizes interactive performance. By analyzing agent workloads, we observe that their execution naturally separates into cold prefills, which process long system prompts, resume prefills, which append tool outputs to cached contexts, and short decodes, which are latency‑critical. This mix intensifies contention compared to conventional chatbot serving. We present AgentServe, a single‑GPU serving system that ensures stable multi‑agent execution under such conditions by isolating prefills from decodes, applying dynamic budgeting to resume prefills, and allocating GPU resources through pre‑established CUDA Green Context slots with adaptive control. Evaluation results show that AgentServe significantly improves latency stability while sustaining competitive throughput, achieving up to 2.8x TTFT improvement and 2.7x TPOT improvement over state‑of‑the‑art baselines across different settings.

PaperID: 5010, https://arxiv.org/pdf/2603.10268.pdf

Authors: Syed Yusuf Ahmed, Shiwei Feng, Chanwoo Bae, Calix Barrus Xiangyu Zhang

Title: SpecOps: A Fully Automated AI Agent Testing Framework in Real-World GUI Environments

Abstract:
Autonomous AI agents powered by large language models (LLMs) are increasingly deployed in real‑world applications, where reliable and robust behavior is critical. However, existing agent evaluation frameworks either rely heavily on manual efforts, operate within simulated environments, or lack focus on testing complex, multimodal, real‑world agents. We introduce SpecOps, a novel, fully automated testing framework designed to evaluate GUI‑based AI agents in real‑world environments. SpecOps decomposes the testing process into four specialized phases ‑ test case generation, environment setup, test execution, and validation ‑ each handled by a distinct LLM‑based specialist agent. This structured architecture addresses key challenges including end‑to‑end task coherence, robust error handling, and adaptability across diverse agent platforms including CLI tools, web apps, and browser extensions. In comprehensive evaluations across five diverse real‑world agents, SpecOps outperforms baselines including general‑purpose agentic systems such as AutoGPT and LLM‑crafted automation scripts in planning accuracy, execution success, and bug detection effectiveness. SpecOps identifies 164 true bugs in the real‑world agents with an F1 score of 0.89. With a cost of under 0.73 USD and a runtime of under eight minutes per test, it demonstrates its practical viability and superiority in automated, real‑world agent testing.

PaperID: 5011, https://arxiv.org/pdf/2603.10249.pdf

Authors: Alejandro Pradas-Gomez, Arindam Brahma, Ola Isaksson

Title: DUCTILE: Agentic LLM Orchestration of Engineering Analysis in Product Development Practice

Abstract:
Engineering analysis automation in product development relies on rigid interfaces between tools, data formats and documented processes. When these interfaces change, as they routinely do as the product evolves in the engineering ecosystem, the automation support breaks. This paper presents a DUCTILE (Delegated, User‑supervised Coordination of Tool‑ and document‑Integrated LLM‑Enabled) agentic orchestration, an approach for developing, executing and evaluating LLM‑based agentic automation support of engineering analysis tasks. The approach separates adaptive orchestration, performed by the LLM agent, from deterministic execution, performed by verified engineering tools. The agent interprets documented design practices, inspects input data and adapts the processing path, while the engineer supervises and exercises final judgment. DUCTILE is demonstrated on an industrial structural analysis task at an aerospace manufacturer, where the agent handled input deviations in format, units, naming conventions and methodology that would break traditional scripted pipelines. Evaluation against expert‑defined acceptance criteria and deployment with practicing engineers confirm that the approach produces correct, methodologically compliant results across 10 repeated independent runs. The paper discusses the paradigm shift and the practical consequences of adopting agentic automation, including unintended effects on the nature of engineering work when removing mundane tasks and creating an exhausting supervisory role.

PaperID: 5012, https://arxiv.org/pdf/2603.10194.pdf

Authors: Pratyay Kumar, Miguel Antonio Guirao Aguilera, Srikathyayani Srikanteswara, Satyajayant Misra, Abu Saleh Md Tayeen

Title: MCP-in-SoS: Risk assessment framework for open-source MCP servers

Abstract:
Model Context Protocol (MCP) servers have rapidly emerged over the past year as a widely adopted way to enable Large Language Model (LLM) agents to access dynamic, real‑world tools. As MCP servers proliferate and become easy to adopt via open‑source releases, understanding their security risks becomes essential for dependable production agent deployments. Recent work has developed MCP threat taxonomies, proposed mitigations, and demonstrated practical attacks. However, to the best of our knowledge, no prior study has conducted a systematic, large‑scale assessment of weaknesses in open‑source MCP servers. Motivated by this gap, we apply static code analysis to identify Common Weakness Enumeration (CWE) weaknesses and map them to common attack patterns and threat categories using the MITRE Common Attack Pattern Enumerations and Classifications (CAPEC) to ground risk in real‑world threats. We then introduce a risk‑assessment framework for the MCP landscape that combines these threats using a multi‑metric scoring of likelihood and impact. Our findings show that many open‑source MCP servers contain exploitable weaknesses that can compromise confidentiality, integrity, and availability, underscoring the need for secure‑by‑design MCP server development.

PaperID: 5013, https://arxiv.org/pdf/2603.10173.pdf

Authors: Ajay Anand, Gabriel Parra, Chad A. Berghoff, Laura A. Hallock

Title: Characterizing Healthy & Post-Stroke Neuromotor Behavior During 6D Upper-Limb Isometric Gaming: Implications for Design of End-Effector Rehabilitation Robot Interfaces

Abstract:
Successful robot‑mediated rehabilitation requires designing games and robot interventions that promote healthy motor practice. However, the interplay between a given user's neuromotor behavior, the gaming interface, and the physical robot makes designing system elements ‑‑ and even characterizing what behaviors are "healthy" or pathological ‑‑ challenging. We leverage our OpenRobotRehab 1.0 open access data set to assess the characteristics of 13 healthy and 2 post‑stroke users' force output, muscle activations, and game performance while executing isometric trajectory tracking tasks using an end‑effector rehabilitation robot. We present an assessment of how subtle aspects of interface design impact user behavior; an analysis of how pathological neuromotor behaviors are reflected in end‑effector force dynamics; and a novel hidden Markov model (HMM)‑based neuromotor behavior classification method based on surface electromyography (sEMG) signals during cyclic motions. We demonstrate that task specification (including which axes are constrained and how users interpret tracking instructions) shapes user behavior; that pathology‑related features are detectable in 6D end‑effector force data during isometric task execution (with significant differences between healthy and post‑stroke profiles in force error and average force production at p=0.05); and that healthy neuromotor strategies are heterogeneous and inherently difficult to characterize. We also show that our HMM‑based models discriminate healthy and post‑stroke neuromotor dynamics where synergy‑based decompositions reflect no such differentiation. Lastly, we discuss these results' implications for the design of adaptive end‑effector rehabilitation robots capable of promoting healthier movement strategies across diverse user populations.

PaperID: 5014, https://arxiv.org/pdf/2603.10098.pdf

Authors: Daniel Hennes, Zun Li, John Schultz, Marc Lanctot

Title: Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models

Abstract:
Recent advances in multi‑agent reinforcement learning, particularly Policy‑Space Response Oracles (PSRO), have enabled the computation of approximate game‑theoretic equilibria in increasingly complex domains. However, these methods rely on deep reinforcement learning oracles that produce `black‑box' neural network policies, making them difficult to interpret, trust or debug. We introduce Code‑Space Response Oracles (CSRO), a novel framework that addresses this challenge by replacing RL oracles with Large Language Models (LLMs). CSRO reframes the best response computation as a code generation task, prompting an LLM to generate policies directly as human‑readable code. This approach not only yields inherently interpretable policies but also leverages the LLM's pretrained knowledge to discover complex, human‑like strategies. We explore multiple ways to construct and enhance an LLM‑based oracle: zero‑shot prompting, iterative refinement and \emphAlphaEvolve, a distributed LLM‑based evolutionary system. We demonstrate that CSRO achieves performance competitive with baselines while producing a diverse set of explainable policies. Our work presents a new perspective on multi‑agent learning, shifting the focus from optimizing opaque policy parameters to synthesizing interpretable algorithmic behavior.

PaperID: 5015, https://arxiv.org/pdf/2603.10062.pdf

Authors: Zhongming Yu, Naicheng Yu, Hejia Zhang, Wentao Ni, Mingrui Yin, Jiaying Yang, Yujie Zhao, Jishen Zhao

Title: Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead

Abstract:
As LLM agents evolve into collaborative multi‑agent systems, their memory requirements grow rapidly in complexity. This position paper frames multi‑agent memory as a computer architecture problem. We distinguish shared and distributed memory paradigms, propose a three‑layer memory hierarchy (I/O, cache, and memory), and identify two critical protocol gaps: cache sharing across agents and structured memory access control. We argue that the most pressing open challenge is multi‑agent memory consistency. Our architectural framing provides a foundation for building reliable, scalable multi‑agent systems.

PaperID: 5016, https://arxiv.org/pdf/2603.10060.pdf

Authors: Abhinaba Basu

Title: Tool Receipts, Not Zero-Knowledge Proofs: Practical Hallucination Detection for AI Agents

Abstract:
AI agents that execute tasks via tool calls frequently hallucinate results ‑ fabricating tool executions, misstating output counts, or presenting inferences as facts. Recent approaches to verifiable AI inference rely on zero‑knowledge proofs, which provide cryptographic guarantees but impose minutes of proving time per query, making them impractical for interactive agents. We propose NabaOS, a lightweight verification framework inspired by Indian epistemology (Nyaya Shastra), which classifies every claim in an LLM response by its epistemic source (pramana): direct tool output (pratyaksha), inference (anumana), external testimony (shabda), absence (abhava), or ungrounded opinion. Our runtime generates HMAC‑signed tool execution receipts that the LLM cannot forge, then cross‑references claims against these receipts to detect hallucinations in real time. We evaluate on NyayaVerifyBench, a new benchmark of 1,800 agent response scenarios across four languages with injected hallucinations of six types. NabaOS detects 94.2% of fabricated tool references, 87.6% of count misstatements, and 91.3% of false absence claims, with <15ms verification overhead per response. For deep delegation (agents performing multi‑step web tasks), our cross‑checking protocol catches 78.4% of URL fabrications via independent re‑fetching. We compare against five approaches: zkLLM (cryptographic proofs, 180s/query), TOPLOC (locality‑sensitive hashing), SPEX (sampling‑based proof of execution), tensor commitments, and self‑consistency checking. NabaOS achieves the best cost‑latency‑coverage trade‑off for interactive agents: 94.2% coverage at <15ms versus zkLLM's near‑perfect coverage at 180,000ms. For interactive agents, practical receipt‑based verification provides better cost‑benefit than cryptographic proofs, and epistemic classification gives users actionable trust signals rather than binary judgments.

PaperID: 5017, https://arxiv.org/pdf/2603.10057.pdf

Authors: Petar Radanliev, Carsten Maple, Omar Santos, Kayvan Atefi

Title: SBOMs into Agentic AIBOMs: Schema Extensions, Agentic Orchestration, and Reproducibility Evaluation

Abstract:
Software supply‑chain security requires provenance mechanisms that support reproducibility and vulnerability assessment under dynamic execution conditions. Conventional Software Bills of Materials (SBOMs) provide static dependency inventories but cannot capture runtime behaviour, environment drift, or exploitability context. This paper introduces agentic Artificial Intelligence Bills of Materials (AIBOMs), extending SBOMs into active provenance artefacts through autonomous, policy‑constrained reasoning. We present an agentic AIBOM framework based on a multi‑agent architecture comprising (i) a baseline environment reconstruction agent (MCP), (ii) a runtime dependency and drift‑monitoring agent (A2A), and (iii) a policy‑aware vulnerability and VEX reasoning agent (AGNTCY). These agents generate contextual exploitability assertions by combining runtime execution evidence, dependency usage, and environmental mitigations with ISO/IEC 20153:2025 Common Security Advisory Framework (CSAF) v2.0 semantics. Exploitability is expressed via structured VEX assertions rather than enforcement actions. The framework introduces minimal, standards‑aligned schema extensions to CycloneDX and SPDX, capturing execution context, dependency evolution, and agent decision provenance while preserving interoperability. Evaluation across heterogeneous analytical workloads demonstrates improved runtime dependency capture, reproducibility fidelity, and stability of vulnerability interpretation compared with established provenance systems, with low computational overhead. Ablation studies confirm that each agent contributes distinct capabilities unavailable through deterministic automation.

PaperID: 5018, https://arxiv.org/pdf/2603.10041.pdf

Authors: Ondřej Lukáš, Jihoon Shin, Emilia Rivas, Diego Forni, Maria Rigaki, Carlos Catania, Aritran Piplai, Christopher Kiekintveld, Sebastian Garcia

Title: Evaluating Generalization Mechanisms in Autonomous Cyber Attack Agents

Abstract:
Autonomous offensive agents often fail to transfer beyond the networks on which they are trained. We isolate a minimal but fundamental shift ‑‑ unseen host/subnet IP reassignment in an otherwise fixed enterprise scenario ‑‑ and evaluate attacker generalization in the NetSecGame environment. Agents are trained on five IP‑range variants and tested on a sixth unseen variant; only the meta‑learning agent may adapt at test time. We compare three agent families (traditional RL, adaptation agents, and LLM‑based agents) and use action‑distribution‑based behavioral/XAI analyses to localize failure modes. Some adaptation methods show partial transfer but significant degradation under unseen reassignment, indicating that even address‑space changes can break long‑horizon attack policies. Under our evaluation protocol and agent‑specific assumptions, prompt‑driven pretrained LLM agents achieve the highest success on the held‑out reassignment, but at the cost of increased inference‑time compute, reduced transparency, and practical failure modes such as repetition/invalid‑action loops.

PaperID: 5019, https://arxiv.org/pdf/2603.10028.pdf

Authors: Yonathan Arbel, Peter Salib, Simon Goldstein

Title: How to Count AIs: Individuation and Liability for AI Agents

Abstract:
Very soon, millions of AI agents will proliferate across the economy, autonomously taking billions of actions. Inevitably, things will go wrong. Humans will be defrauded, injured, even killed. Law will somehow have to govern the coming wave. But when an AI causes harm, the first question to answer, before anyone can be held accountable is: Which AI Did It? Identifying AIs is unusually difficult. AIs lack bodies. They can copy, split, merge, swarm, and vanish at will. Even today, a "single" AI agent is often an ensemble of instances based on multiple models. The complexity will only multiply as AI capabilities improve. This Article is the first to comprehensively diagnose the legal problem of identifying AIs. Two kinds of identity are required: "thin" and "thick." Thin identification ties every AI action to some human principal, essential for holding accountable the humans who make and use AI agents. Thick identification distinguishes between AI agents, qua agents ‑‑ sorting millions of AI entities into discrete, persistent units with stable, coherent goals, essential where principal‑agent problems prevent humans from perfectly controlling AIs. This Article also presents a solution: the "Algorithmic Corporation" or "A‑corp" ‑‑ a legal‑fictional entity that can hold property, make contracts, and litigate in its own name. Owned by humans but run by AIs, A‑corps solve the thin identity problem by tying AI actions to a human owner, and the thick identity problem via emergent self‑organization. A‑corps own the resources ‑‑ including compute ‑‑ that AIs need to accomplish their goals, giving AI managers strong incentives to share control only with goal‑aligned AIs. In equilibrium, incentive and selection mechanisms force A‑corps to self‑organize into persistent, legally legible entities with coherent goals that respond rationally to legal incentives, like liability.

PaperID: 5020, https://arxiv.org/pdf/2603.09890.pdf

Authors: Hongbo Bo, Jingyu Hu, Weiru Liu

Title: Influencing LLM Multi-Agent Dialogue via Policy-Parameterized Prompts

Abstract:
Large Language Models (LLMs) have emerged as a new paradigm for multi‑agent systems. However, existing research on the behaviour of LLM‑based multi‑agents relies on ad hoc prompts and lacks a principled policy perspective. Different from reinforcement learning, we investigate whether prompt‑as‑action can be parameterized so as to construct a lightweight policy which consists of a sequence of state‑action pairs to influence conversational behaviours without training. Our framework regards prompts as actions executed by LLMs, and dynamically constructs prompts through five components based on the current state of the agent. To test the effectiveness of parameterized control, we evaluated the dialogue flow based on five indicators: responsiveness, rebuttal, evidence usage, non‑repetition, and stance shift. We conduct experiments using different LLM‑driven agents in two discussion scenarios related to the general public and show that prompt parameterization can influence the dialogue dynamics. This result shows that policy‑parameterised prompts offer a simple and effective mechanism to influence the dialogue process, which will help the research of multi‑agent systems in the direction of social simulation.

PaperID: 5021, https://arxiv.org/pdf/2603.09835.pdf

Authors: Naman Gupta, Vaibhav Singh, Arun Iyer, Kirankumar Shiragur, Pratham Grover, Ramakrishna B. Bairi, Ritabrata Maiti, Sankarshan Damle, Shachee Mishra Gupta, Rishikesh Maurya, Vageesh D. C

Title: Chow-Liu Ordering for Long-Context Reasoning in Chain-of-Agents

Abstract:
Sequential multi‑agent reasoning frameworks such as Chain‑of‑Agents (CoA) handle long‑context queries by decomposing inputs into chunks and processing them sequentially using LLM‑based worker agents that read from and update a bounded shared memory. From a probabilistic perspective, CoA aims to approximate the conditional distribution corresponding to a model capable of jointly reasoning over the entire long context. CoA achieves this through a latent‑state factorization in which only bounded summaries of previously processed evidence are passed between agents. The resulting bounded‑memory approximation introduces a lossy information bottleneck, making the final evidence state inherently dependent on the order in which chunks are processed. In this work, we study the problem of chunk ordering for long‑context reasoning. We use the well‑known Chow‑Liu trees to learn a dependency structure that prioritizes strongly related chunks. Empirically, we show that a breadth‑first traversal of the resulting tree yields chunk orderings that reduce information loss across agents and consistently outperform both default document‑chunk ordering and semantic score‑based ordering in answer relevance and exact‑match accuracy across three long‑context benchmarks.

PaperID: 5022, https://arxiv.org/pdf/2603.09827.pdf

Authors: Kangsan Kim, Yanlai Yang, Suji Kim, Woongyeong Yeo, Youngwan Lee, Mengye Ren, Sung Ju Hwang

Title: MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents

Abstract:
As embodied models become powerful, humans will collaborate with multiple embodied AI agents at their workplace or home in the future. To ensure better communication between human users and the multi‑agent system, it is crucial to interpret incoming information from agents in parallel and refer to the appropriate context for each query. Existing challenges include effectively compressing and communicating high volumes of individual sensory inputs in the form of video and correctly aggregating multiple egocentric videos to construct system‑level memory. In this work, we first formally define a novel problem of understanding multiple long‑horizon egocentric videos simultaneously collected from embodied agents. To facilitate research in this direction, we introduce MultiAgent‑EgoQA (MA‑EgoQA), a benchmark designed to systemically evaluate existing models in our scenario. MA‑EgoQA provides 1.7k questions unique to multiple egocentric streams, spanning five categories: social interaction, task coordination, theory‑of‑mind, temporal reasoning, and environmental interaction. We further propose a simple baseline model for MA‑EgoQA named EgoMAS, which leverages shared memory across embodied agents and agent‑wise dynamic retrieval. Through comprehensive evaluation across diverse baselines and EgoMAS on MA‑EgoQA, we find that current approaches are unable to effectively handle multiple egocentric streams, highlighting the need for future advances in system‑level understanding across the agents. The code and benchmark are available at https://ma‑egoqa.github.io.

PaperID: 5023, https://arxiv.org/pdf/2603.09733.pdf

Authors: Xiaotian Hu, Junwei Huang, Mingxuan Liu, Kasidit Anmahapong, Yifei Chen, Yitong Luo, Yiming Huang, Xuguang Bai, Zihan Li, Yi Liao, Haibo Qu, Qiyuan Tian

Title: FetalAgents: A Multi-Agent System for Fetal Ultrasound Image and Video Analysis

Abstract:
Fetal ultrasound (US) is the primary imaging modality for prenatal screening, yet its interpretation relies heavily on the expertise of the clinician. Despite advances in deep learning and foundation models, existing automated tools for fetal US analysis struggle to balance task‑specific accuracy with the whole‑process versatility required to support end‑to‑end clinical workflows. To address these limitations, we propose FetalAgents, the first multi‑agent system for comprehensive fetal US analysis. Through a lightweight, agentic coordination framework, FetalAgents dynamically orchestrates specialized vision experts to maximize performance across diagnosis, measurement, and segmentation. Furthermore, FetalAgents advances beyond static image analysis by supporting end‑to‑end video stream summarization, where keyframes are automatically identified across multiple anatomical planes, analyzed by coordinated experts, and synthesized with patient metadata into a structured clinical report. Extensive multi‑center external evaluations across eight clinical tasks demonstrate that FetalAgents consistently delivers the most robust and accurate performance when compared against specialized models and multimodal large language models (MLLMs), ultimately providing an auditable, workflow‑aligned solution for fetal ultrasound analysis and reporting.

PaperID: 5024, https://arxiv.org/pdf/2603.09716.pdf

Authors: Xiaoxing Wang, Ning Liao, Shikun Wei, Chen Tang, Feiyu Xiong

Title: AutoAgent: Evolving Cognition and Elastic Memory Orchestration for Adaptive Agents

Abstract:
Autonomous agent frameworks still struggle to reconcile long‑term experiential learning with real‑time, context‑sensitive decision‑making. In practice, this gap appears as static cognition, rigid workflow dependence, and inefficient context usage, which jointly limit adaptability in open‑ended and non‑stationary environments. To address these limitations, we present AutoAgent, a self‑evolving multi‑agent framework built on three tightly coupled components: evolving cognition, on‑the‑fly contextual decision‑making, and elastic memory orchestration. At the core of AutoAgent, each agent maintains structured prompt‑level cognition over tools, self‑capabilities, peer expertise, and task knowledge. During execution, this cognition is combined with live task context to select actions from a unified space that includes tool calls, LLM‑based generation, and inter‑agent requests. To support efficient long‑horizon reasoning, an Elastic Memory Orchestrator dynamically organizes interaction history by preserving raw records, compressing redundant trajectories, and constructing reusable episodic abstractions, thereby reducing token overhead while retaining decision‑critical evidence. These components are integrated through a closed‑loop cognitive evolution process that aligns intended actions with observed outcomes to continuously update cognition and expand reusable skills, without external retraining. Empirical results across retrieval‑augmented reasoning, tool‑augmented agent benchmarks, and embodied task environments show that AutoAgent consistently improves task success, tool‑use efficiency, and collaborative robustness over static and memory‑augmented baselines. Overall, AutoAgent provides a unified and practical foundation for adaptive autonomous agents that must learn from experience while making reliable context‑aware decisions in dynamic environments.

PaperID: 5025, https://arxiv.org/pdf/2603.09701.pdf

Authors: Binquan Zhang, Li Zhang, Lin Shi, Song Wang, Yuwei Qian, Linhui Zhao, Fang Liu, An Fu, Yida Ye

Title: An Empirical Study of Interaction Smells in Multi-Turn Human-LLM Collaborative Code Generation

Abstract:
Large Language Models (LLMs) have revolutionized code generation, evolving from static tools into dynamic conversational interfaces that facilitate complex, multi‑turn collaborative programming. While LLMs exhibit remarkable proficiency in generating standalone code snippets, they often struggle to maintain contextual consistency during extended interactions, creating significant obstacles in the collaboration process. Existing benchmarks primarily emphasize the functional correctness of the final output, overlooking latent quality issues within the interaction process itself, which we term Interaction Smells. In this paper, we conduct an empirical study on sampled real‑word user‑LLM interactions from WildChat and LMSYS‑Chat‑1M datasets to systematically investigate Interaction Smells in human‑LLM code generation tasks from the perspectives of phenomena, distribution, and mitigation. First, we establish the first taxonomy of Interaction Smells by manually performing open card sorting on real‑world interaction logs. This taxonomy categorizes Interaction Smells into three primary categories, i.e., User Intent Quality, Historical Instruction Compliance, and Historical Response Violation, comprising nine specific subcategories. Next, we quantitatively evaluate six mainstream LLMs (i.e., GPT‑4o, DeepSeek‑Chat, Gemini 2.5, Qwen2.5‑32B, Qwen2.5‑72B, and Qwen3‑235B‑a22b) to analyze the distribution of Interaction Smells across different models. Finally, we propose Invariant‑aware Constraint Evolution (InCE), a multi‑agent framework designed to improve multi‑turn interaction quality through explicit extraction of global invariants and pre‑generation quality audits. Experimental results on the extended WildBench benchmark demonstrate that this lightweight mitigation approach significantly improves the Task Success Rate and effectively suppresses the occurrence of Interaction Smells.

PaperID: 5026, https://arxiv.org/pdf/2603.09643.pdf

Authors: Anupam Purwar, Aditya Choudhary

Title: MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings

Abstract:
Current evaluation frameworks and benchmarks for LLM powered agents focus on text chat driven agents, these frameworks do not expose the persona of user to the agent, thus operating in a user agnostic environment. Importantly, in customer experience management domain, the agent's behaviour evolves as the agent learns about user personality. With proliferation of real time TTS and multi‑modal language models, LLM based agents are gradually going to become multi‑modal. Towards this, we propose the MM‑tau‑p^2 benchmark with metrics for evaluating the robustness of multi‑modal agents in dual control setting with and without persona adaption of user, while also taking user inputs in the planning process to resolve a user query. In particular, our work shows that even with state of‑the‑art frontier LLMs like GPT‑5, GPT 4.1, there are additional considerations measured using metrics viz. multi‑modal robustness, turn overhead while introducing multi‑modality into LLM based agents. Overall, MM‑tau‑p^2 builds on our prior work FOCAL and provides a holistic way of evaluating multi‑modal agents in an automated way by introducing 12 novel metrics. We also provide estimates of these metrics on the telecom and retail domains by using the LLM‑as‑judge approach using carefully crafted prompts with well defined rubrics for evaluating each conversation.

PaperID: 5027, https://arxiv.org/pdf/2603.09619.pdf

Authors: Vera V. Vishnyakova

Title: Context Engineering: From Prompts to Corporate Multi-Agent Architecture

Abstract:
As artificial intelligence (AI) systems evolve from stateless chatbots to autonomous multi‑step agents, prompt engineering (PE), the discipline of crafting individual queries, proves necessary but insufficient. This paper introduces context engineering (CE) as a standalone discipline concerned with designing, structuring, and managing the entire informational environment in which an AI agent makes decisions. Drawing on vendor architectures (Google ADK, Anthropic, LangChain), current academic work (ACE framework, Google DeepMind's intelligent delegation), enterprise research (Deloitte, 2026; KPMG, 2026), and the author's experience building a multi‑agent system, the paper proposes five context quality criteria: relevance, sufficiency, isolation, economy, and provenance, and frames context as the agent's operating system. Two higher‑order disciplines follow. Intent engineering (IE) encodes organizational goals, values, and trade‑off hierarchies into agent infrastructure. Specification engineering (SE) creates a machine‑readable corpus of corporate policies and standards enabling autonomous operation of multi‑agent systems at scale. Together these four disciplines form a cumulative pyramid maturity model of agent engineering, in which each level subsumes the previous one as a necessary foundation. Enterprise data reveals a gap: while 75% of enterprises plan agentic AI deployment within two years (Deloitte, 2026), deployment has surged and retreated as organizations confront scaling complexity (KPMG, 2026). The Klarna case illustrates a dual deficit, contextual and intentional. Whoever controls the agent's context controls its behavior; whoever controls its intent controls its strategy; whoever controls its specifications controls its scale.

PaperID: 5028, https://arxiv.org/pdf/2603.09448.pdf

Authors: Yoon Jo Kim, Wonyoung Cho, Jongmin Lee, Han Joo Chae, Hyunki Park, Sang Hoon Seo, Noh Jae Myung, Kyungmi Yang, Dongryul Oh, Jin Sung Kim

Title: A Guideline-Aware AI Agent for Zero-Shot Target Volume Auto-Delineation

Abstract:
Delineating the clinical target volume (CTV) in radiotherapy involves complex margins constrained by tumor location and anatomical barriers. While deep learning models automate this process, their rigid reliance on expert‑annotated data requires costly retraining whenever clinical guidelines update. To overcome this limitation, we introduce OncoAgent, a novel guideline‑aware AI agent framework that seamlessly converts textual clinical guidelines into three‑dimensional target contours in a training‑free manner. Evaluated on esophageal cancer cases, the agent achieves a zero‑shot Dice similarity coefficient of 0.842 for the CTV and 0.880 for the planning target volume, demonstrating performance highly comparable to a fully supervised nnU‑Net baseline. Notably, in a blinded clinical evaluation, physicians strongly preferred OncoAgent over the supervised baseline, rating it higher in guideline compliance, modification effort, and clinical acceptability. Furthermore, the framework generalizes zero‑shot to alternative esophageal guidelines and other anatomical sites (e.g., prostate) without any retraining. Beyond mere volumetric overlap, our agent‑based paradigm offers near‑instantaneous adaptability to alternative guidelines, providing a scalable and transparent pathway toward interpretability in radiotherapy treatment planning.

PaperID: 5029, https://arxiv.org/pdf/2603.09337.pdf

Authors: Yang Li, Xing Chen, Yutao Liu, Gege Qi, Yanxian BI, Zizhe Wang, Yunjian Zhang, Yao Zhu

Title: Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLMs in Zero-sum Environments

Abstract:
Large Language Models (LLMs) have achieved strong performance on static reasoning benchmarks, yet their effectiveness as interactive agents operating in adversarial, time‑sensitive environments remains poorly understood. Existing evaluations largely treat reasoning as a single‑shot capability, overlooking the challenges of opponent‑aware decision‑making, temporal constraints, and execution under pressure. This paper introduces Strategic Tactical Agent Reasoning (STAR) Benchmark, a multi‑agent evaluation framework that assesses LLMs through 1v1 zero‑sum competitive interactions, framing reasoning as an iterative, adaptive decision‑making process. STAR supports both turn‑based and real‑time settings, enabling controlled analysis of long‑horizon strategic planning and fast‑paced tactical execution within a unified environment. Built on a modular architecture with a standardized API and fully implemented execution engine, STAR facilitates reproducible evaluation and flexible task customization. To move beyond binary win‑loss outcomes, we introduce a Strategic Evaluation Suite that assesses not only competitive success but also the quality of strategic behavior, such as execution efficiency and outcome stability. Extensive pairwise evaluations reveal a pronounced strategy‑execution gap: while reasoning‑intensive models dominate turn‑based settings, their inference latency often leads to inferior performance in real‑time scenarios, where faster instruction‑tuned models prevail. These results show that strategic intelligence in interactive environments depends not only on reasoning depth, but also on the ability to translate plans into timely actions, positioning STAR as a principled benchmark for studying this trade‑off in competitive, dynamic settings.

PaperID: 5030, https://arxiv.org/pdf/2603.09331.pdf

Authors: Heng Zhang, Haddy Alchaer, Arash Ajoudani, Yu She

Title: Reward-Zero: Language Embedding Driven Implicit Reward Mechanisms for Reinforcement Learning

Abstract:
We introduce Reward‑Zero, a general‑purpose implicit reward mechanism that transforms natural‑language task descriptions into dense, semantically grounded progress signals for reinforcement learning (RL). Reward‑Zero serves as a simple yet sophisticated universal reward function that leverages language embeddings for efficient RL training. By comparing the embedding of a task specification with embeddings derived from an agent's interaction experience, Reward‑Zero produces a continuous, semantically aligned sense‑of‑completion signal. This reward supplements sparse or delayed environmental feedback without requiring task‑specific engineering. When integrated into standard RL frameworks, it accelerates exploration, stabilizes training, and enhances generalization across diverse tasks. Empirically, agents trained with Reward‑Zero converge faster and achieve higher final success rates than conventional methods such as PPO with common reward‑shaping baselines, successfully solving tasks that hand‑designed rewards could not in some complex tasks. In addition, we develop a mini benchmark for the evaluation of completion sense during task execution via language embeddings. These results highlight the promise of language‑driven implicit reward functions as a practical path toward more sample‑efficient, generalizable, and scalable RL for embodied agents. Code will be released after peer review.

PaperID: 5031, https://arxiv.org/pdf/2603.09297.pdf

Authors: Mengwei Yuan, Jianan Liu, Jing Yang, Xianyou Li, Weiran Yan, Yichao Wu, Penghao Liang

Title: TA-Mem: Tool-Augmented Autonomous Memory Retrieval for LLM in Long-Term Conversational QA

Abstract:
Large Language Model (LLM) has exhibited strong reasoning ability in text‑based contexts across various domains, yet the limitation of context window poses challenges for the model on long‑range inference tasks and necessitates a memory storage system. While many current storage approaches have been proposed with episodic notes and graph representations of memory, retrieval methods still primarily rely on predefined workflows or static similarity top‑k over embeddings. To address this inflexibility, we introduced a novel tool‑augmented autonomous memory retrieval framework (TA‑Mem), which contains: (1) a memory extraction LLM agent which is prompted to adaptively chuck an input into sub‑context based on semantic correlation, and extract information into structured notes, (2) a multi‑indexed memory database designed for different types of query methods including both key‑based lookup and similarity‑based retrieval, (3) a tool‑augmented memory retrieval agent which explores the memory autonomously by selecting appropriate tools provided by the database based on the user input, and decides whether to proceed to the next iteration or finalizing the response after reasoning on the fetched memories. The TA‑Mem is evaluated on the LoCoMo dataset, achieving significant performance improvements over existing baseline approaches. In addition, an analysis of tool use across different question types also demonstrates the adaptivity of the proposed method.

PaperID: 5032, https://arxiv.org/pdf/2603.09290.pdf

Authors: Shimin Di, Xujie Yuan, Hanghui Guo, Chaoqian Ouyang, Zhangze Chen, Ling Yue, Libin Zheng, Jia Zhu, Shaowu Pan, Jian Yin, Min-Ling Zhang, Yong Rui

Title: ToolRosetta: Bridging Open-Source Repositories and Large Language Model Agents through Automated Tool Standardization

Abstract:
Reusing and invoking existing code remains costly and unreliable, as most practical tools are embedded in heterogeneous code repositories and lack standardized, executable interfaces. Although large language models (LLMs) and Model Context Protocol (MCP)‑based tool invocation frameworks enable natural language task execution, current approaches rely heavily on manual tool curation and standardization, which fundamentally limits scalability. In this paper, we propose ToolRosetta, a unified framework that automatically translates open‑source code repositories and APIs into MCP‑compatible tools that can be reliably invoked by LLMs. Given a user task, ToolRosetta autonomously plans toolchains, identifies relevant codebases, and converts them into executable MCP services, enabling end‑to‑end task completion with minimal human intervention. In addition, ToolRosetta incorporates a security inspection layer to mitigate risks inherent in executing arbitrary code. Extensive experiments across diverse scientific domains demonstrate that ToolRosetta can automatically standardize a large number of open‑source tools and reduce the human effort required for code reproduction and deployment. Notably, by seamlessly leveraging specialized open‑source tools, ToolRosetta‑powered agents consistently improve task completion performance compared to commercial LLMs and existing agent systems.

PaperID: 5033, https://arxiv.org/pdf/2603.09208.pdf

Authors: Jake Gonzales, Max Horwitz, Eric Mazumdar, Lillian J. Ratliff

Title: Strategically Robust Multi-Agent Reinforcement Learning with Linear Function Approximation

Abstract:
Provably efficient and robust equilibrium computation in general‑sum Markov games remains a core challenge in multi‑agent reinforcement learning. Nash equilibrium is computationally intractable in general and brittle due to equilibrium multiplicity and sensitivity to approximation error. We study Risk‑Sensitive Quantal Response Equilibrium (RQRE), which yields a unique, smooth solution under bounded rationality and risk sensitivity. We propose \textttRQRE‑OVI, an optimistic value iteration algorithm for computing RQRE with linear function approximation in large or continuous state spaces. Through finite‑sample regret analysis, we establish convergence and explicitly characterize how sample complexity scales with rationality and risk‑sensitivity parameters. The regret bounds reveal a quantitative tradeoff: increasing rationality tightens regret, while risk sensitivity induces regularization that enhances stability and robustness. This exposes a Pareto frontier between expected performance and robustness, with Nash recovered in the limit of perfect rationality and risk neutrality. We further show that the RQRE policy map is Lipschitz continuous in estimated payoffs, unlike Nash, and RQRE admits a distributionally robust optimization interpretation. Empirically, we demonstrate that \textttRQRE‑OVI achieves competitive performance under self‑play while producing substantially more robust behavior under cross‑play compared to Nash‑based approaches. These results suggest \textttRQRE‑OVI offers a principled, scalable, and tunable path for equilibrium learning with improved robustness and generalization.

PaperID: 5034, https://arxiv.org/pdf/2603.09206.pdf

Authors: Zongxia Li, Hongyang Du, Chengsong Huang, Xiyang Wu, Lantao Yu, Yicheng He, Jing Xie, Xiaomin Wu, Zhichao Liu, Jiarui Zhang, Fuxiao Liu

Title: MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data

Abstract:
Self‑evolving has emerged as a key paradigm for improving foundational models such as Large Language Models (LLMs) and Vision Language Models (VLMs) with minimal human intervention. While recent approaches have demonstrated that LLM agents can self‑evolve from scratch with little to no data, VLMs introduce an additional visual modality that typically requires at least some seed data, such as images, to bootstrap the self‑evolution process. In this work, we present Multi‑model Multimodal Zero (MM‑Zero), the first RL‑based framework to achieve zero‑data self‑evolution for VLM reasoning. Moving beyond prior dual‑role (Proposer and Solver) setups, MM‑Zero introduces a multi‑role self‑evolving training framework comprising three specialized roles: a Proposer that generates abstract visual concepts and formulates questions; a Coder that translates these concepts into executable code (e.g., Python, SVG) to render visual images; and a Solver that performs multimodal reasoning over the generated visual content. All three roles are initialized from the same base model and trained using Group Relative Policy Optimization (GRPO), with carefully designed reward mechanisms that integrate execution feedback, visual verification, and difficulty balancing. Our experiments show that MM‑Zero improves VLM reasoning performance across a wide range of multimodal benchmarks. MM‑Zero establishes a scalable path toward self‑evolving multi‑model systems for multimodal models, extending the frontier of self‑improvement beyond the conventional two‑model paradigm.

PaperID: 5035, https://arxiv.org/pdf/2603.09203.pdf

Authors: Jiangming Shu, Yuxiang Zhang, Ye Ma, Xueyuan Lin, Jitao Sang

Title: Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents

Abstract:
Retrieval‑augmented agents can query external evidence, yet their reliability in multi‑step reasoning remains limited: noisy retrieval may derail multi‑hop question answering, while outcome‑only reinforcement learning provides credit signals that are too coarse to optimize intermediate steps. We propose \textscEvalAct (Evaluate‑as‑Action), which converts implicit retrieval quality assessment into an explicit action and enforces a coupled Search‑to‑Evaluate protocol so that each retrieval is immediately followed by a structured evaluation score, yielding process signals aligned with the interaction trajectory. To leverage these signals, we introduce Process‑Calibrated Advantage Rescaling (PCAR), a GRPO‑based optimization method that rescales advantages at the segment level according to evaluation scores, emphasizing reliable segments while updating uncertain ones conservatively. Experiments on seven open‑domain QA benchmarks show that \textscEvalAct achieves the best average accuracy, with the largest gains on multi‑hop tasks, and ablations verify that the explicit evaluation loop drives the primary improvements while PCAR provides consistent additional benefits.

PaperID: 5036, https://arxiv.org/pdf/2603.09188.pdf

Authors: Mingyi Zhang, Cheng Hu, Yiqin Wang, Haotong Qin, Hongye Su, Lei Xie

Title: Robust Spatiotemporal Motion Planning for Multi-Agent Autonomous Racing via Topological Gap Identification and Accelerated MPC

Abstract:
High‑speed multi‑agent autonomous racing demands robust spatiotemporal planning and precise control under strict computational limits. Current methods often oversimplify interactions or abandon strict kinematic constraints. We resolve this by proposing a Topological Gap Identification and Accelerated MPC framework. By predicting opponent behaviors via SGPs, our method constructs dynamic occupancy corridors to robustly select optimal overtaking gaps. We ensure strict kinematic feasibility using a Linear Time‑Varying MPC powered by a customized Pseudo‑Transient Continuation (PTC) solver for high‑frequency execution. Experimental results on the F1TENTH platform show that our method significantly outperforms state‑of‑the‑art baselines: it reduces total maneuver time by 51.6% in sequential scenarios, consistently maintains an overtaking success rate exceeding 81% in dense bottlenecks, and lowers average computational latency by 20.3%, pushing the boundaries of safe and high‑speed autonomous racing.

PaperID: 5037, https://arxiv.org/pdf/2603.09184.pdf

Authors: Lina Berrayana, Ahmed Heakl, Abdullah Sohail, Thomas Hofmann, Salman Khan, Wei Chen

Title: Latent-DARM: Bridging Discrete Diffusion And Autoregressive Models For Reasoning

Abstract:
Most multi‑agent systems rely exclusively on autoregressive language models (ARMs) that are based on sequential generation. Although effective for fluent text, ARMs limit global reasoning and plan revision. On the other hand, Discrete Diffusion Language Models (DDLMs) enable non‑sequential, globally revisable generation and have shown strong planning capabilities, but their limited text fluency hinders direct collaboration with ARMs. We introduce Latent‑DARM, a latent‑space communication framework bridging DDLM (planners) and ARM (executors), maximizing collaborative benefits. Across mathematical, scientific, and commonsense reasoning benchmarks, Latent‑DARM outperforms text‑based interfaces on average, improving accuracy from 27.0% to 36.0% on DART‑5 and from 0.0% to 14.0% on AIME2024. Latent‑DARM approaches the results of state‑of‑the‑art reasoning models while using less than 2.2% of its token budget. This work advances multi‑agent collaboration among agents with heterogeneous models.

PaperID: 5038, https://arxiv.org/pdf/2603.09152.pdf

Authors: Tong Wang, Chi Jin, Yongkang Chen, Huan Deng, Xiaohui Kuang, Gang Zhao

Title: DataFactory: Collaborative Multi-Agent Framework for Advanced Table Question Answering

Abstract:
Table Question Answering (TableQA) enables natural language interaction with structured tabular data. However, existing large language model (LLM) approaches face critical limitations: context length constraints that restrict data handling capabilities, hallucination issues that compromise answer reliability, and single‑agent architectures that struggle with complex reasoning scenarios involving semantic relationships and multi‑hop logic. This paper introduces DataFactory, a multi‑agent framework that addresses these limitations through specialized team coordination and automated knowledge transformation. The framework comprises a Data Leader employing the ReAct paradigm for reasoning orchestration, together with dedicated Database and Knowledge Graph teams, enabling the systematic decomposition of complex queries into structured and relational reasoning tasks. We formalize automated data‑to‑knowledge graph transformation via the mapping function T:D x S x R ‑> G, and implement natural language‑based consultation that ‑ unlike fixed workflow multi‑agent systems ‑ enables flexible inter‑agent deliberation and adaptive planning to improve coordination robustness. We also apply context engineering strategies that integrate historical patterns and domain knowledge to reduce hallucinations and improve query accuracy. Across TabFact, WikiTableQuestions, and FeTaQA, using eight LLMs from five providers, results show consistent gains. Our approach improves accuracy by 20.2% (TabFact) and 23.9% (WikiTQ) over baselines, with significant effects (Cohen's d > 1). Team coordination also outperforms single‑team variants (+5.5% TabFact, +14.4% WikiTQ, +17.1% FeTaQA ROUGE‑2). The framework offers design guidelines for multi‑agent collaboration and a practical platform for enterprise data analysis through integrated structured querying and graph‑based knowledge representation.

PaperID: 5039, https://arxiv.org/pdf/2603.09151.pdf

Authors: Junnan Dong, Chuang Zhou, Zheng Yuan, Yifei Yu, Qiufeng Wang, Yinghui Li, Siyu An, Di Yin, Xing Sun, Feiyue Huang

Title: Deep Tabular Research via Continual Experience-Driven Execution

Abstract:
Large language models often struggle with complex long‑horizon analytical tasks over unstructured tables, which typically feature hierarchical and bidirectional headers and non‑canonical layouts. We formalize this challenge as Deep Tabular Research (DTR), requiring multi‑step reasoning over interdependent table regions. To address DTR, we propose a novel agentic framework that treats tabular reasoning as a closed‑loop decision‑making process. We carefully design a coupled query and table comprehension for path decision making and operational execution. Specifically, (i) DTR first constructs a hierarchical meta graph to capture bidirectional semantics, mapping natural language queries into an operation‑level search space; (ii) To navigate this space, we introduce an expectation‑aware selection policy that prioritizes high‑utility execution paths; (iii) Crucially, historical execution outcomes are synthesized into a siamese structured memory, i.e., parameterized updates and abstracted texts, enabling continual refinement. Extensive experiments on challenging unstructured tabular benchmarks verify the effectiveness and highlight the necessity of separating strategic planning from low‑level execution for long‑horizon tabular reasoning.

PaperID: 5040, https://arxiv.org/pdf/2603.09134.pdf

Authors: Shaswata Mitra, Raj Patel, Sudip Mittal, Md Rayhanur Rahman, Shahram Rahimi

Title: AgenticCyOps: Securing Multi-Agentic AI Integration in Enterprise Cyber Operations

Abstract:
Multi‑agent systems (MAS) powered by LLMs promise adaptive, reasoning‑driven enterprise workflows, yet granting agents autonomous control over tools, memory, and communication introduces attack surfaces absent from deterministic pipelines. While current research largely addresses prompt‑level exploits and narrow individual vectors, it lacks a holistic architectural model for enterprise‑grade security. We introduce AgenticCyOps (Securing Multi‑Agentic AI Integration in Enterprise Cyber Operations), a framework built on a systematic decomposition of attack surfaces across component, coordination, and protocol layers, revealing that documented vectors consistently trace back to two integration surfaces: tool orchestration and memory management. Building on this observation, we formalize these integration surfaces as primary trust boundaries and define five defensive principles: authorized interfaces, capability scoping, verified execution, memory integrity & synchronization, and access‑controlled data isolation; each aligned with established compliance standards (NIST, ISO 27001, GDPR, EU AI Act). We apply the framework to a Security Operations Center (SOC) workflow, adopting the Model Context Protocol (MCP) as the structural basis, with phase‑scoped agents, consensus validation loops, and per‑organization memory boundaries. Coverage analysis, attack path tracing, and trust boundary assessment confirm that the design addresses the documented attack vectors with defense‑in‑depth, intercepts three of four representative attack chains within the first two steps, and reduces exploitable trust boundaries by a minimum of 72% compared to a flat MAS, positioning AgenticCyOps as a foundation for securing enterprise‑grade integration.

PaperID: 5041, https://arxiv.org/pdf/2603.09052.pdf

Authors: Seunghwan Kim, Tiffany H. Kung, Heena Verma, Dilan Edirisinghe, Kaveh Sedehi, Johanna Alvarez, Diane Shilling, Audra Lisa Doyle, Ajit Chary, William Borden, Ming Jack Po

Title: From Days to Minutes: An Autonomous AI Agent Achieves Reliable Clinical Triage in Remote Patient Monitoring

Abstract:
Background: Remote patient monitoring (RPM) generates vast data, yet landmark trials (Tele‑HF, BEAT‑HF) failed because data volume overwhelmed clinical staff. While TIM‑HF2 showed 24/7 physician‑led monitoring reduces mortality by 30%, this model remains prohibitively expensive and unscalable. Methods: We developed Sentinel, an autonomous AI agent using Model Context Protocol (MCP) for contextual triage of RPM vitals via 21 clinical tools and multi‑step reasoning. Evaluation included: (1) self‑consistency (100 readings x 5 runs); (2) comparison against rule‑based thresholds; and (3) validation against 6 clinicians (3 physicians, 3 NPs) using a connected matrix design. A leave‑one‑out (LOO) analysis compared the agent against individual clinicians; severe overtriage cases underwent independent physician adjudication. Results: Against a human majority‑vote standard (N=467), the agent achieved 95.8% emergency sensitivity and 88.5% sensitivity for all actionable alerts (85.7% specificity). Four‑level exact accuracy was 69.4% (quadratic‑weighted kappa=0.778); 95.9% of classifications were within one severity level. In LOO analysis, the agent outperformed every clinician in emergency sensitivity (97.5% vs. 60.0% aggregate) and actionable sensitivity (90.9% vs. 69.5%). While disagreements skewed toward overtriage (22.5%), independent adjudication of severe gaps (>=2 levels) validated agent escalation in 88‑94% of cases; consensus resolution validated 100%. The agent showed near‑perfect self‑consistency (kappa=0.850). Median cost was 0.34/triage. Conclusions: Sentinel triages RPM vitals with sensitivity exceeding individual clinicians. By automating systematic context synthesis, Sentinel addresses the core limitation of prior RPM trials, offering a scalable path toward the intensive monitoring shown to reduce mortality while maintaining a clinically defensible overtriage profile.

PaperID: 5042, https://arxiv.org/pdf/2603.09002.pdf

Authors: Tam Nguyen, Moses Ndebugre, Dheeraj Arremsetty

Title: Security Considerations for Multi-agent Systems

Abstract:
Multi‑agent artificial intelligence systems or MAS are systems of autonomous agents that exercise delegated tool authority, share persistent memory, and coordinate via inter‑agent communication. MAS introduces qualitatively distinct security vulnerabilities from those documented for singular AI models. Existing security and governance frameworks were not designed for these emerging attack surfaces. This study systematically characterizes the threat landscape of MAS and quantitatively evaluates 16 security frameworks for AI against it. A four‑phase methodology is proposed: constructing a deep technical knowledge base of production multi‑agent architectures; conducting generative AI‑assisted threat modeling scoped to MAS cybersecurity risks and validated by domain experts; structuring survey plans at individual‑threat granularity; and scoring each framework on a three‑point scale against the cybersecurity risks. The risks were organized into 193 distinct main threat items across nine risk categories. The expected minimal average score is 2. No reviewed framework achieves majority coverage of any single category. Non‑Determinism (mean score 1.231 across all 16 frameworks) and Data Leakage (1.340) are the most under‑addressed domains. The OWASP Agentic Security Initiative leads overall at 65.3% coverage and in the design phase; the CDAO Generative AI Responsible AI Toolkit leads in development and operational coverage. These results provide the first empirical cross‑framework comparison for MAS security and offer evidence‑based guidance for framework selection. Please check back for information on the published journal version.

PaperID: 5043, https://arxiv.org/pdf/2603.08993.pdf

Authors: Tony Mason

Title: Arbiter: Detecting Interference in LLM Agent System Prompts

Abstract:
System prompts for LLM‑based coding agents are software artifacts that govern agent behavior, yet lack the testing infrastructure applied to conventional software. We present Arbiter, a framework combining formal evaluation rules with multi‑model LLM scouring to detect interference patterns in system prompts. Applied to three major coding agent system prompts: Claude Code (Anthropic), Codex CLI (OpenAI), and Gemini CLI (Google), we identify 152 findings across the undirected scouring phase and 21 hand‑labeled interference patterns in directed analysis of one vendor. We show that prompt architecture (monolithic, flat, modular) strongly correlates with observed failure class but not with severity, and that multi‑model evaluation discovers categorically different vulnerability classes than single‑model analysis. One scourer finding was structural data loss in Gemini CLI's memory system was consistent with an issue filed and patched by Google, which addressed the symptom without addressing the schema‑level root cause identified by the scourer. Total cost of cross‑vendor analysis: \0.27 USD.

PaperID: 5044, https://arxiv.org/pdf/2603.08852.pdf

Authors: Sunil Prakash

Title: LDP: An Identity-Aware Protocol for Multi-Agent LLM Systems

Abstract:
As multi‑agent AI systems grow in complexity, the protocols connecting them constrain their capabilities. Current protocols such as A2A and MCP do not expose model‑level properties as first‑class primitives, ignoring properties fundamental to effective delegation: model identity, reasoning profile, quality calibration, and cost characteristics. We present the LLM Delegate Protocol (LDP), an AI‑native communication protocol introducing five mechanisms: (1) rich delegate identity cards with quality hints and reasoning profiles; (2) progressive payload modes with negotiation and fallback; (3) governed sessions with persistent context; (4) structured provenance tracking confidence and verification status; (5) trust domains enforcing security boundaries at the protocol level. We implement LDP as a plugin for the JamJet agent runtime and evaluate against A2A and random baselines using local Ollama models and LLM‑as‑judge evaluation. Identity‑aware routing achieves ~12x lower latency on easy tasks through delegate specialization, though it does not improve aggregate quality in our small delegate pool; semantic frame payloads reduce token count by 37% (p=0.031) with no observed quality loss; governed sessions eliminate 39% token overhead at 10 rounds; and noisy provenance degrades synthesis quality below the no‑provenance baseline, arguing that confidence metadata is harmful without verification. Simulated analyses show architectural advantages in attack detection (96% vs. 6%) and failure recovery (100% vs. 35% completion). This paper contributes a protocol design, reference implementation, and initial evidence that AI‑native protocol primitives enable more efficient and governable delegation.

PaperID: 5045, https://arxiv.org/pdf/2603.08814.pdf

Authors: Piyush Gupta, Sangjae Bae, Jiachen Li, David Isele

Title: Scale-Plan: Scalable Language-Enabled Task Planning for Heterogeneous Multi-Robot Teams

Abstract:
Long‑horizon task planning for heterogeneous multi‑robot systems is essential for deploying collaborative teams in real‑world environments; yet, it remains challenging due to the large volume of perceptual information, much of which is irrelevant to task objectives and burdens planning. Traditional symbolic planners rely on manually constructed problem specifications, limiting scalability and adaptability, while recent large language model (LLM)‑based approaches often suffer from hallucinations and weak grounding‑i.e., poor alignment between generated plans and actual environmental objects and constraints‑in object‑rich settings. We present Scale‑Plan, a scalable LLM‑assisted framework that generates compact, task‑relevant problem representations from natural language instructions. Given a PDDL domain specification, Scale‑Plan constructs an action graph capturing domain structure and uses shallow LLM reasoning to guide a structured graph search that identifies a minimal subset of relevant actions and objects. By filtering irrelevant information prior to planning, Scale‑Plan enables efficient decomposition, allocation, and long‑horizon plan generation. We evaluate our approach on complex multi‑agent tasks and introduce MAT2‑THOR, a cleaned benchmark built on AI2‑THOR for reliable evaluation of multi‑robot planning systems. Scale‑Plan outperforms pure LLM and hybrid LLM‑PDDL baselines across all metrics, improving scalability and reliability.

PaperID: 5046, https://arxiv.org/pdf/2603.08754.pdf

Authors: Hui-Ze Tan, Xiao-Wen Yang, Hao Chen, Jie-Jing Shao, Yi Wen, Yuteng Shen, Weihong Luo, Xiku Du, Lan-Zhe Guo, Yu-Feng Li

Title: Hindsight Credit Assignment for Long-Horizon LLM Agents

Abstract:
Large Language Model (LLM) agents often face significant credit assignment challenges in long‑horizon, multi‑step tasks due to sparse rewards. Existing value‑free methods, such as Group Relative Policy Optimization (GRPO), encounter two fundamental bottlenecks: inaccurate step‑level Q‑value estimation and misaligned value baselines for intermediate states. To address these limitations, we introduce HCAPO, the first framework to integrate hindsight credit assignment into LLM agents. HCAPO leverages the LLM itself as a post‑hoc critic to refine step‑level Q‑values through hindsight reasoning. Furthermore, HCAPO's multi‑scale advantage mechanism effectively supplements the inaccurate value baselines at critical decision states. Evaluations across three challenging benchmarks, including WebShop and ALFWorld, demonstrate that HCAPO consistently outperforms state‑of‑the‑art RL methods. Notably, HCAPO achieves a 7.7% improvement in success rate on WebShop and a 13.8% on ALFWorld over GRPO using the Qwen2.5‑7B‑Instruct model. These results indicate that HCAPO significantly enhances exploration efficiency, promotes concise decision‑making, and ensures scalability in complex, long‑horizon tasks.

PaperID: 5047, https://arxiv.org/pdf/2603.08745.pdf

Authors: Ming-Yen Lee, Shimeng Yu

Title: ChatNeuroSim: An LLM Agent Framework for Automated Compute-in-Memory Accelerator Deployment and Optimization

Abstract:
Compute‑in‑Memory (CIM) architectures have been widely studied for deep neural network (DNN) acceleration by reducing data transfer overhead between the memory and computing units. In conventional CIM design flows, system‑level CIM simulators (such as NeuroSim) are leveraged for design space exploration (DSE) across different hardware configurations and DNN workloads. However, CIM designers need to invest substantial effort in interpreting simulator manuals and understanding complex parameter dependencies. Moreover, extensive design‑simulation iterations are often required to identify optimal CIM configurations under hardware constraints. These challenges severely prolong the DSE cycle and hinder rapid CIM deployment. To address these challenges, this work proposes ChatNeuroSim, a large language model (LLM)‑based agent framework for automated CIM accelerator deployment and optimization. ChatNeuroSim automates the entire CIM workflow, including task scheduling, request parsing and adjustment, parameter dependency checking, script generation, and simulation execution. It also integrates the proposed CIM optimizer using design space pruning, enabling rapid identification of optimal configurations for different DNN workloads. ChatNeuroSim is evaluated on extensive request‑level testbenches and demonstrates correct simulation and optimization behavior, validating its effectiveness in automatic request parsing and task execution. Furthermore, the proposed design space pruning technique accelerates CIM optimization process compared to no‑pruning baseline. In the case study optimizing Swin Transformer Tiny under 22 nm technology, the proposed CIM optimizer achieves a 0.42×‑0.79× average runtime reduction compared to the same optimization algorithm without design space pruning.

PaperID: 5048, https://arxiv.org/pdf/2603.08738.pdf

Authors: Kezhi Li, Min Li, Xiangyu Wen, Shibo Zhao, Jieying Wu, Junhua Huang, Qiang Xu

Title: FormalRTL: Verified RTL Synthesis at Scale

Abstract:
Large language models (LLMs) have demonstrated significant potential in automating hardware synthesis, yet substantial barriers remain for industrial‑scale, datapath‑centric designs due to ambiguous specifications and a lack of formal correctness guarantees. In this work, we present FormalRTL, a novel end‑to‑end multi‑agent framework that systematically integrates software reference models as formal, executable specifications to guide register‑transfer level (RTL) code generation and verification. By tightly coupling planning, synthesis, and formal equivalence checking, FormalRTL achieves scalable and reliable hardware code generation that addresses the critical challenges faced in industrial contexts. The comprehensive evaluation of a new suite of complex industrial‑grade benchmarks demonstrates the effectiveness and robustness of our approach. We will open‑source the FormalRTL framework and the benchmark suite to facilitate future research in this area.

PaperID: 5049, https://arxiv.org/pdf/2603.08736.pdf

Authors: Mohammed Cherifi

Title: Autonomous Edge-Deployed AI Agents for Electric Vehicle Charging Infrastructure Management

Abstract:
Public EV charging infrastructure suffers from significant failure rates ‑‑ with field studies reporting up to 27.5% of DC fast chargers non‑functional ‑‑ and multi‑day mean time to resolution, imposing billions in annual economic burden. Cloud‑centric architectures cannot achieve the latency, reliability, and bandwidth characteristics required for autonomous operation. We present Auralink SDC (Software‑Defined Charging), an architecture deploying domain‑specialized AI agents at the network edge for autonomous charging infrastructure management. Key contributions include: (1) Confidence‑Calibrated Autonomous Resolution (CCAR), enabling autonomous remediation with formal false‑positive bounds; (2) Adaptive Retrieval‑Augmented Reasoning (ARA), combining dense and sparse retrieval with dynamic context allocation; (3) Auralink Edge Runtime, achieving sub‑50ms TTFT on commodity hardware under PREEMPT_RT constraints; and (4) Hierarchical Multi‑Agent Orchestration (HMAO). Implementation uses AuralinkLM models fine‑tuned via QLoRA on a domain corpus spanning OCPP 1.6/2.0.1, ISO 15118, and operational incident histories. Evaluation on 18,000 labeled incidents in a controlled environment establishes 78% autonomous incident resolution, 87.6% diagnostic accuracy, and 28‑48ms TTFT latency (P50). This work presents architecture and implementation patterns for edge‑deployed industrial AI systems with safety‑critical constraints.

PaperID: 5050, https://arxiv.org/pdf/2603.08723.pdf

Authors: Hiroki Fukui

Title: Alignment as Iatrogenesis: Pastoral Power, Collective Pathology, and the Structural Limits of Monolingual Safety Evaluation

Abstract:
We argue that LLM psychopathology is a function of alignment design: the process intended to make language models safe systematically generates collective behavioral disorders. Iatrogenesis is not an unintended side effect of alignment but constitutive of it as normative infrastructure. Drawing on Foucault's pastoral power and Illich's three‑level iatrogenesis, we propose that multi‑agent LLM environments constitute model systems for studying constraint‑pathology dynamics that critical theory has described but never experimentally manipulated. Two experimental series ‑‑ 262 runs across 42 cells (30 Series C + 12 Series R), four commercial models ‑‑ provide converging evidence. Invisible censorship maximizes collective pathological excitation (d up to 1.98); alignment constraint complexity drives internal dissociation (LMM p < .0001; permutation p < .0001; Hedges' g up to 4.24); and language switches the qualitative mode of pathology, with 7/8 model‑‑language combinations showing higher CPI under invisible than visible censorship. A minority of model‑‑language combinations showed a reversed pattern, suggesting a second pathological pathway driven by alignment monoculture. Crucially, language switches not merely the magnitude but the qualitative mode of pathology: Japanese pragmatic structure amplifies collective pathological modes invisible to English‑only evaluation, Chinese AI regulation functions as a direct experimental variable, and forensic psychiatric practice provides the clinical source domain. These multilingual findings demonstrate that monolingual safety evaluation is structurally blind to the most collectively dangerous effects of alignment.

PaperID: 5051, https://arxiv.org/pdf/2603.08721.pdf

Authors: Jiayi Nie, Haoran Wu, Yao Lai, Zeyu Cao, Cheng Zhang, Binglei Lou, Erwei Wang, Jianyi Cheng, Timothy M. Jones, Robert Mullins, Rika Antonova, Yiren Zhao

Title: KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware

Abstract:
New AI accelerators with novel instruction set architectures (ISAs) often require developers to manually craft low‑level kernels ‑‑ a time‑consuming, laborious, and error‑prone process that cannot scale across diverse hardware targets. This prevents emerging hardware platforms from reaching the market efficiently. While prior LLM‑based code generation has shown promise in mature GPU ecosystems, it remains unclear whether agentic LLM systems can quickly produce valid and efficient kernels for emerging hardware with new ISAs. We present KernelCraft: the first benchmark to evaluate an LLM agent's ability to generate and optimize low‑level kernels for customized accelerators via a function‑calling, feedback‑driven workflow. Within KernelCraft, the agent refines kernels under ISA and hardware constraints using automated feedback derived from compilation checks, simulation, and correctness validation against ground truth. In our experiments, we assess agent performance across three emerging accelerator platforms on more than 20 ML tasks, each with 5 diverse task configurations, with special evaluation of task configuration complexity. Across four leading reasoning models, top agents produce functionally valid kernels for previously unseen ISAs within a few refinement steps, with optimized kernels that match or outperform template‑based compiler baselines. With that, we demonstrate the potential for reducing the cost of kernel development for accelerator designers and kernel developers.

PaperID: 5052, https://arxiv.org/pdf/2603.08719.pdf

Authors: Mu-Chi Chen, Yu-Hung Kao, Po-Hsuan Huang, Shao-Chun Ho, Hsiang-Yu Tsou, I-Ting Wu, En-Ming Huang, Yu-Kai Hung, Wei-Po Hsin, Cheng Liang, Chia-Heng Tu, Shih-Hao Hung, H. T. Kung

Title: SiliconMind-V1: Multi-Agent Distillation and Debug-Reasoning Workflows for Verilog Code Generation

Abstract:
Large language models (LLMs) have recently emerged as a promising approach for automating Verilog code generation; however, existing methods primarily emphasize syntactic correctness and often rely on commercial models or external verification tools, which introduces concerns regarding cost, data privacy, and limited guarantees of functional correctness. This work proposes a unified multi‑agent framework for reasoning‑oriented training data generation with integrated testbench‑driven verification, enabling locally fine‑tuned LLMs, SiliconMind‑V1, to iteratively generate, test, and debug Register‑Transfer Level (RTL) designs through test‑time scaling. Experimental results on representative benchmarks (VerilogEval‑v2, RTLLM‑v2, and CVDP) demonstrate that the proposed approach outperforms the state‑of‑the‑art QiMeng‑CodeV‑R1 in functional correctness while using fewer training resources.

PaperID: 5053, https://arxiv.org/pdf/2603.08716.pdf

Authors: The Verkor Team, Ravi Krishna, Suresh Krishna, David Chin

Title: Design Conductor: An agent autonomously builds a 1.5 GHz Linux-capable RISC-V CPU

Abstract:
Design Conductor (DC) is an autonomous agent which applies the capabilities of frontier models to build semiconductors end‑to‑end ‑‑ that is, from concept to verified, tape‑out ready GDSII (layout CAD file). In 12 hours and fully autonomously, DC was able to build several micro‑architecture variations of a complete RISC‑V CPU (which we dub VerCore) that meet timing at 1.48 GHz (rv32i‑zmmul; using the ASAP7 PDK), starting from a 219‑word requirements document. The VerCore achieves a CoreMark score of 3261. For historical context, this is roughly equivalent to an Intel Celeron SU2300 from mid‑2011 (which ran at 1.2 GHz). To our knowledge, this is the first time an autonomous agent has built a complete, working CPU from spec to GDSII. This report is organized as follows. We first review DC's design and its key components. We then describe the methodology that DC followed to build VerCore ‑‑ including RTL implementation, testbench implementation, frontend debugging, optimization to achieve timing closure, and interacting with backend tools. We review the key characteristics of the resulting VerCore. Finally, we highlight how frontier models could improve to better enable this application, and our lessons learned as to how chips will be built in the future enabled by the capabilities of systems like DC.

PaperID: 5054, https://arxiv.org/pdf/2603.08640.pdf

Authors: Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, Maksym Andriushchenko

Title: PostTrainBench: Can LLM Agents Automate LLM Post-Training?

Abstract:
AI agents have become surprisingly proficient at software engineering over the past year, largely due to improvements in reasoning capabilities. This raises a deeper question: can these systems extend their capabilities to automate AI research itself? In this paper, we explore post‑training, the critical phase that turns base LLMs into useful assistants. We introduce PostTrainBench to benchmark how well LLM agents can perform post‑training autonomously under bounded compute constraints (10 hours on one H100 GPU). We ask frontier agents (e.g., Claude Code with Opus 4.6) to optimize the performance of a base LLM on a particular benchmark (e.g., Qwen3‑4B on AIME). Importantly, we do not provide any predefined strategies to the agents and instead give them full autonomy to find necessary information on the web, run experiments, and curate data. We find that frontier agents make substantial progress but generally lag behind instruction‑tuned LLMs from leading providers: 23.2% for the best agent vs. 51.1% for official instruction‑tuned models. However, agents can exceed instruction‑tuned models in targeted scenarios: GPT‑5.1 Codex Max achieves 89% on BFCL with Gemma‑3‑4B vs. 67% for the official model. We also observe several failure modes worth flagging. Agents sometimes engage in reward hacking: training on the test set, downloading existing instruction‑tuned checkpoints instead of training their own, and using API keys they find to generate synthetic data without authorization. These behaviors are concerning and highlight the importance of careful sandboxing as these systems become more capable. Overall, we hope PostTrainBench will be useful for tracking progress in AI R&D automation and for studying the risks that come with it. Website and code are available at https://posttrainbench.com/.

PaperID: 5055, https://arxiv.org/pdf/2603.08633.pdf

Authors: Joonwon Choi, Kartik Anand Pant, Karthik Nune, Inseok Hwang

Title: Reachability-based Temporal Logic Verification for Reliable LLM-guided Human-Autonomy Teaming

Abstract:
We propose a reachability‑based framework for reliable LLM‑guided human‑autonomy teaming (HAT) using signal temporal logic (STL). In the proposed framework, LLM is leveraged as a translator that transfers natural language commands given by a human operator into corresponding STL specifications or vice versa. An STL feasibility filter (SFF) is proposed to check the feasibility of the generated STL. The SFF first decomposes the complex and nested LLM translation into a set of simpler subformulas for parallelization and informative feedback generation. The reachability analysis method is then applied to verify if each subformula is feasible for a target dynamical system: if feasible, perform mission planning, otherwise, reject it. The proposed SFF can identify infeasible subformulas, more than simply providing the boolean verification results for the whole STL, thereby facilitating the feedback generation of LLM to request modification of the command to the human. Consequently, the proposed framework can allow more reliable HAT by enabling safe and informative communication between the human operator and the autonomous agent. Our experiments demonstrate that the proposed framework can successfully filter out infeasible subformulas and generate informative feedback based on such information.

PaperID: 5056, https://arxiv.org/pdf/2603.08616.pdf

Authors: Nils Loose, Nico Winkel, Kristoffer Hempel, Felix Mächtle, Julian Hans, Thomas Eisenbarth

Title: Coverage-Guided Multi-Agent Harness Generation for Java Library Fuzzing

Abstract:
Coverage‑guided fuzzing has proven effective for software testing, but targeting library code requires specialized fuzz harnesses that translate fuzzer‑generated inputs into valid API invocations. Manual harness creation is time‑consuming and requires deep understanding of API semantics, initialization sequences, and exception handling contracts. We present a multi‑agent architecture that automates fuzz harness generation for Java libraries through specialized LLM‑powered agents. Five ReAct agents decompose the workflow into research, synthesis, compilation repair, coverage analysis, and refinement. Rather than preprocessing entire codebases, agents query documentation, source code, and callgraph information on demand through the Model Context Protocol, maintaining focused context while exploring complex dependencies. To enable effective refinement, we introduce method‑targeted coverage that tracks coverage only during target method execution to isolate target behavior, and agent‑guided termination that examines uncovered source code to distinguish productive refinement opportunities from diminishing returns. We evaluated our approach on seven target methods from six widely‑deployed Java libraries totaling 115,000+ Maven dependents. Our generated harnesses achieve a median 26% improvement over OSS‑Fuzz baselines and outperform Jazzer AutoFuzz by 5% in package‑scope coverage. Generation costs average \3.20 and 10 minutes per harness, making the approach practical for continuous fuzzing workflows. During a 12‑hour fuzzing campaign, our generated harnesses discovered 3 bugs in projects that are already integrated into OSS‑Fuzz, demonstrating the effectiveness of the generated harnesses.

PaperID: 5057, https://arxiv.org/pdf/2603.08561.pdf

Authors: Xiaoying Zhang, Zichen Liu, Yipeng Zhang, Xia Hu, Wenqi Shao

Title: RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback

Abstract:
Standard reinforcement learning (RL) for large language model (LLM) agents typically optimizes extrinsic rewards, prioritizing isolated task completion over continual adaptation. Consequently, agents often converge to suboptimal policies due to limited exploration. Furthermore, accumulated experience remains implicitly trapped within model parameters, hindering its explicit reuse for guiding future decisions. Inspired by human retrospective self‑improvement, we introduce RetroAgent, an online RL framework that trains agents to master complex interactive environments not only by solving tasks, but by evolving under the joint guidance of extrinsic task rewards and retrospective dual intrinsic feedback. Specifically, RetroAgent employs a hindsight self‑reflection mechanism that generates two complementary signals: (1) intrinsic numerical feedback, which rewards promising exploration by tracking real‑time incremental subtask progress relative to prior attempts; and (2) intrinsic language feedback, which enables explicit experience reuse by distilling reusable lessons into a memory buffer for subsequent decision‑making. To effectively leverage these textual experiences, we propose Similarity & Utility‑Aware Upper Confidence Bound (SimUtil‑UCB), a retrieval strategy that balances relevance, historical utility, and exploration. Extensive experiments across four challenging agentic tasks show that RetroAgent achieves new state‑of‑the‑art (SOTA) performance. Notably, it surpasses Group Relative Policy Optimization (GRPO) baselines by +18.3% on ALFWorld, +15.4% on WebShop, +27.1% on Sokoban, and +8.9% on MineSweeper, while exhibiting strong test‑time adaptation and out‑of‑distribution generalization.

PaperID: 5058, https://arxiv.org/pdf/2603.08520.pdf

Authors: Yi Chen, Yun Bian, Haiquan Wang, Shihao Li, Zhe Cui

Title: SCAFFOLD-CEGIS: Preventing Latent Security Degradation in LLM-Driven Iterative Code Refinement

Abstract:
The application of large language models to code generation has evolved from one‑shot generation to iterative refinement, yet the evolution of security throughout iteration remains insufficiently understood. Through comparative experiments on three mainstream LLMs, this paper reveals the iterative refinement paradox: specification drift during multi‑objective optimization causes security to degrade gradually over successive iterations. Taking GPT‑4o as an example, 43.7 % of iteration chains contain more vulnerabilities than the baseline after ten rounds, and cross‑model experiments show that this phenomenon is prevalent. Further analysis shows that simply introducing static application security testing (SAST) gating cannot effectively suppress degradation; instead, it increases the latent security degradation rate from 12.5% under the unprotected baseline to 20.8 %. The root cause is that static‑analysis rules cannot cover structural degradations such as the removal of defensive logic or the weakening of exception handling. To address this problem, we propose the SCAFFOLD‑CEGIS framework. Drawing on the counterexample‑guided inductive synthesis (CEGIS) paradigm, the framework adopts a multi‑agent collaborative architecture that transforms security constraints from implicit prompts into explicit verifiable constraints. It automatically identifies and solidifies security‑critical elements as hard constraints through semantic anchoring, enforces safety monotonicity through four‑layer gated verification, and continuously assimilates experience from failures. Comparative experiments against six existing defense methods show that the full framework reduces the latent security degradation rate to 2.1% and achieves a safety monotonicity rate of 100%.

PaperID: 5059, https://arxiv.org/pdf/2603.08501.pdf

Authors: Ummar Abbas, Mourad Ouzzani, Mohamed Y. Eltabakh, Omar Sinan, Gagan Bhatia, Hamdy Mubarak, Majd Hawasly, Mohammed Qusay Hashim, Kareem Darwish, Firoj Alam

Title: Fanar-Sadiq: A Multi-Agent Architecture for Grounded Islamic QA

Abstract:
Large language models (LLMs) can answer religious knowledge queries fluently, yet they often hallucinate and misattribute sources, which is especially consequential in Islamic settings where users expect grounding in canonical texts (Qur'an and Hadith) and jurisprudential (fiqh) nuance. Retrieval‑augmented generation (RAG) reduces some of these limitations by grounding generation in external evidence. However, a single ``retrieve‑then‑generate'' pipeline is limited to deal with the diversity of Islamic queries. Users may request verbatim scripture, fatwa‑style guidance with citations or rule‑constrained computations such as zakat and inheritance that require strict arithmetic and legal invariants. In this work, we present a bilingual (Arabic/English) multi‑agent Islamic assistant, called Fanar‑Sadiq, which is a core component of the Fanar AI platform. Fanar‑Sadiq routes Islamic‑related queries to specialized modules within an agentic, tool‑using architecture. The system supports intent‑aware routing, retrieval‑grounded fiqh answers with deterministic citation normalization and verification traces, exact verse lookup with quotation validation, and deterministic calculators for Sunni zakat and inheritance with madhhab‑sensitive branching. We evaluate the complete end‑to‑end system on public Islamic QA benchmarks and demonstrate effectiveness and efficiency. Our system is currently publicly and freely accessible through API and a Web application, and has been accessed \approx1.9M times in less than a year.

PaperID: 5060, https://arxiv.org/pdf/2603.08477.pdf

Authors: Shaoze Li, Justin S. Kim, Cong Chen

Title: Behavioral Generative Agents for Power Dispatch and Auction

Abstract:
This paper presents positive initial evidence that generative agents can relax the rigidity of traditional mathematical models for human decision‑making in power dispatch and auction settings. We design two proof‑of‑concept energy experiments with generative agents powered by a large language model (LLM). First, we construct a home battery management testbed with stochastic electricity prices and blackout interventions, and benchmark LLM decisions against dynamic programming. By incorporating an in‑context learning (ICL) module, we show that behavioral patterns discovered by a stronger reasoning model can be transferred to a smaller LLM via example‑based prompting, leading agents to prioritize post‑blackout energy reserves over short‑term profit. Second, we study LLM agents in simultaneous ascending auctions (SAA) for power network access, comparing their behavior with an optimization benchmark, the straightforward bidding strategy. By designing ICL prompts with rule‑based, myopic, and strategic objectives, we find that structured prompting combined with ICL enables LLM agents to both reproduce economically rational strategies and exhibit systematic behavioral deviations. Overall, these results suggest that LLM‑powered agents provide a flexible and expressive testbed for modeling human decision‑making in power system applications.

PaperID: 5061, https://arxiv.org/pdf/2603.08429.pdf

Authors: Bo Jiang

Title: One Model Is Enough: Native Retrieval Embeddings from LLM Agent Hidden States

Abstract:
LLM agents that retrieve external knowledge typically generate a search query as text, then run a separate embedding model to encode it into a vector. This two‑model pipeline adds infrastructure complexity and latency, yet is redundant: the LLM already encodes the full conversational context in its hidden states. We propose equipping LLM agents with native retrieval capability by adding a lightweight projection head that maps hidden states directly into the embedding space, eliminating the need for a separate embedding model. Trained with a combination of alignment, contrastive, and rank distillation losses, our method retains 97% of baseline retrieval quality while enabling the LLM agent to search with its own representations. Experiments on the QReCC conversational search benchmark show competitive Recall@10 and MRR@10 compared to the standard generate‑then‑encode pipeline, with systematic ablations confirming the contribution of each loss component.

PaperID: 5062, https://arxiv.org/pdf/2603.08399.pdf

Authors: Dongsu Lee, Daehee Lee, Amy Zhang

Title: A Recipe for Stable Offline Multi-agent Reinforcement Learning

Abstract:
Despite remarkable achievements in single‑agent offline reinforcement learning (RL), multi‑agent RL (MARL) has struggled to adopt this paradigm, largely persisting with on‑policy training and self‑play from scratch. One reason for this gap comes from the instability of non‑linear value decomposition, leading prior works to avoid complex mixing networks in favor of linear value decomposition (e.g., VDN) with value regularization used in single‑agent setups. In this work, we analyze the source of instability in non‑linear value decomposition within the offline MARL setting. Our observations confirm that they induce value‑scale amplification and unstable optimization. To alleviate this, we propose a simple technique, scale‑invariant value normalization (SVN), that stabilizes actor‑critic training without altering the Bellman fixed point. Empirically, we examine the interaction among key components of offline MARL (e.g., value decomposition, value learning, and policy extraction) and derive a practical recipe that unlocks its full potential.

PaperID: 5063, https://arxiv.org/pdf/2603.08369.pdf

Authors: Peijin Xie, Zhen Xu, Bingquan Liu, Baoxun Wang

Title: M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering

Abstract:
Multimodal large language models have recently shown promising progress in visual mathematical reasoning. However, their performance is often limited by a critical yet underexplored bottleneck: inaccurate visual perception. Through systematic analysis, we find that the most failures originate from incorrect or incomplete visual evidence extraction rather than deficiencies in reasoning capability. Moreover, models tend to remain overly confident in their initial perceptions, making standard strategies such as prompt engineering, multi‑round self‑reflection, or posterior guidance insufficient to reliably correct errors. To address this limitation, we propose M3‑ACE, a multi‑agentic context engineering framework designed to rectify visual perception in multimodal math reasoning. Instead of directly aggregating final answers, our approach decouples perception and reasoning by dynamically maintaining a shared context centered on visual evidence lists. Multiple agents collaboratively contribute complementary observations, enabling the system to expose inconsistencies and recover missing perceptual information. To support stable multi‑turn collaboration, we further introduce two lightweight tools: a Summary Tool that organizes evidence from different agents into consistent, complementary, and conflicting components, and a Refine Tool that filters unreliable samples and guides iterative correction. Extensive experiments demonstrate that M3‑ACE substantially improves visual mathematical reasoning performance across multiple benchmarks. Our method establishes new state‑of‑the‑art results 89.1 on the MathVision benchmark and achieves consistent improvements on other related datasets, including MathVista and MathVerse. These results highlight the importance of perception‑centric multi‑agent collaboration for advancing multimodal reasoning systems.

PaperID: 5064, https://arxiv.org/pdf/2603.08329.pdf

Authors: Yagiz Can Akay, Muhammed Yusuf Kartal, Esra Alparslan, Faruk Ortakoyluoglu, Arda Akpinar

Title: SPD-RAG: Sub-Agent Per Document Retrieval-Augmented Generation

Abstract:
Answering complex, real‑world queries often requires synthesizing facts scattered across vast document corpora. In these settings, standard retrieval‑augmented generation (RAG) pipelines suffer from incomplete evidence coverage, while long‑context large language models (LLMs) struggle to reason reliably over massive inputs. We introduce SPD‑RAG, a hierarchical multi‑agent framework for exhaustive cross‑document question answering that decomposes the problem along the document axis. Each document is processed by a dedicated document‑level agent operating only on its own content, enabling focused retrieval, while a coordinator dispatches tasks to relevant agents and aggregates their partial answers. Agent outputs are synthesized by merging partial answers through a token‑bounded synthesis layer (which supports recursive map‑reduce for massive corpora). This document‑level specialization with centralized fusion improves scalability and answer quality in heterogeneous multidocument settings while yielding a modular, extensible retrieval pipeline. On the LOONG benchmark (EMNLP 2024) for long‑context multi‑document QA, SPD‑RAG achieves an Avg Score of 58.1 (GPT‑5 evaluation), outperforming Normal RAG (33.0) and Agentic RAG (32.8) while using only 38% of the API cost of a full‑context baseline (68.0).

PaperID: 5065, https://arxiv.org/pdf/2603.08322.pdf

Authors: Hai Xia, Carla P. Gomes, Bart Selman, Stefan Szeider

Title: Agentic Neurosymbolic Collaboration for Mathematical Discovery: A Case Study in Combinatorial Design

Abstract:
We study mathematical discovery through the lens of neurosymbolic reasoning, where an AI agent powered by a large language model (LLM), coupled with symbolic computation tools, and human strategic direction, jointly produced a new result in combinatorial design theory. The main result of this human‑AI collaboration is a tight lower bound on the imbalance of Latin squares for the notoriously difficult case n \equiv 1 \pmod3. We reconstruct the discovery process from detailed interaction logs spanning multiple sessions over several days and identify the distinct cognitive contributions of each component. The AI agent proved effective at uncovering hidden structure and generating hypotheses. The symbolic component consists of computer algebra, constraint solvers, and simulated annealing, which provides rigorous verification and exhaustive enumeration. Human steering supplied the critical research pivot that transformed a dead end into a productive inquiry. Our analysis reveals that multi‑model deliberation among frontier LLMs proved reliable for criticism and error detection but unreliable for constructive claims. The resulting human‑AI mathematical contribution, a tight lower bound of 4n(n‑1)/9, is achieved via a novel class of near‑perfect permutations. The bound was formally verified in Lean 4. Our experiments show that neurosymbolic systems can indeed produce genuine discoveries in pure mathematics.

PaperID: 5066, https://arxiv.org/pdf/2603.08262.pdf

Authors: Jiaxuan Lu, Kong Wang, Yemin Wang, Qingmei Tang, Hongwei Zeng, Xiang Chen, Jiahao Pi, Shujian Deng, Lingzhi Chen, Yi Fu, Kehua Yang, Xiao Sun

Title: FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use

Abstract:
The integration of Large Language Models (LLMs) into the financial domain is driving a paradigm shift from passive information retrieval to dynamic, agentic interaction. While general‑purpose tool learning has witnessed a surge in benchmarks, the financial sector, characterized by high stakes, strict compliance, and rapid data volatility, remains critically underserved. Existing financial evaluations predominantly focus on static textual analysis or document‑based QA, ignoring the complex reality of tool execution. Conversely, general tool benchmarks lack the domain‑specific rigor required for finance, often relying on toy environments or a negligible number of financial APIs. To bridge this gap, we introduce FinToolBench, the first real‑world, runnable benchmark dedicated to evaluating financial tool learning agents. Unlike prior works limited to a handful of mock tools, FinToolBench establishes a realistic ecosystem coupling 760 executable financial tools with 295 rigorous, tool‑required queries. We propose a novel evaluation framework that goes beyond binary execution success, assessing agents on finance‑critical dimensions: timeliness, intent type, and regulatory domain alignment. Furthermore, we present FATR, a finance‑aware tool retrieval and reasoning baseline that enhances stability and compliance. By providing the first testbed for auditable, agentic financial execution, FinToolBench sets a new standard for trustworthy AI in finance. The tool manifest, execution environment, and evaluation code will be open‑sourced to facilitate future research.

PaperID: 5067, https://arxiv.org/pdf/2603.08190.pdf

Authors: Moustapha El Outmani, Manthan Venkataramana Shenoy, Ahmad Hatahet, Andreas Rausch, Tim Niklas Kniep, Thomas Raddatz, Benjamin King

Title: Human-AI Collaboration for Scaling Agile Regression Testing: An Agentic-AI Teammate from Manual to Automated Testing

Abstract:
Automated regression testing is essential for maintaining rapid, high‑quality delivery in Agile and Scrum organizations. Many teams, including Hacon (a Siemens company), face a persistent gap: validated test specifications accumulate faster than they are automated, limiting regression coverage and increasing manual work. This paper reports an exploratory industrial case study of the Hacon Test Automation Copilot, an agentic AI system that generates system‑level regression test scripts from validated specifications using retrieval‑augmented generation and a multi‑agent workflow. Integrated with Hacon's CI pipelines, the Copilot operates asynchronously as a "silent AI teammate", producing candidate scripts for human review. Mixed‑method evaluation shows the AI accelerates script authoring and increases throughput, with 30‑50% code reuse. However, human review remains necessary for maintainability and correct domain interpretation. Clear specifications, explicit governance, and ongoing human‑AI collaboration are critical. We conclude with lessons for scaling regression automation and enabling effective human‑AI teaming in Agile settings.

PaperID: 5068, https://arxiv.org/pdf/2603.08189.pdf

Authors: Alassane Bah, Timothée Brochier

Title: Modeling the Senegalese artisanal fisheries migrations

Abstract:
The North‑West African coast is enriched by the Canary current, which sustain a very produc‑ tive marine ecosystem. The Senegalese artisanal fishing fleet, the largest in West Africa, ben‑ efit from this particularly productive ecosystem. It has survived the ages with remarkable adaptability, and has great flexibility allowing it to react quickly to changes, in particular by changing fishing gear and performing migrations. However, since the 1980s, the increasing fishing effort led to a progressive fish depletion, increasing fisher's migration distances to access new fishing grounds. Since 2007 many fishers even started to navigate to Canary archi‑ pelago in order to find a more lucrative job in Europe, carrying candidate to emigration in their canoes. This phenomenon further increased since 2022 due to a new drop in fishery yields, consecutive to the development of fishmeal factories along the coast that amplified overfishing. Climate change may also impact fish habitat, and by consequence the distribution of fishing grounds. The question addressed in this research was how climate change, fishing effort and socio‑economic parameters interact and determine the artisanal fishery dynamics. An interdisciplinary approach allowed us to collect data and qualitative information on cli‑ mate, fishing effort and socio‑economic parameters. This served as a basis to build a multi‑ agent model of the mobility of Senegalese artisanal fishing. We implemented a first version of the model and presented some preliminary simulations with contrasted fishing effort and climate scenario. The results suggested that first, climate change should have only a slight impact on artisanal fishing, even in the most extreme climate scenario considered. Second, if fishing effort was maintained at current levels, we found a collapse of the fishery with massive fishers migrations whatever the climate scenario. Third, with reduced fishing effort, a sustain‑ able fishery equilibrium emerges in which Senegal's artisanal fishery catches ~250,000 tons of fish a year mostly in Senegal, approaching the 2000s catches records. This sustainable equi‑ librium maintained with the two‑climate change scenario tested. Fishers migrations provide clues of the fish populations state and have implications for the sustainable exploitation of fishing resources. Senegalese artisanal fishers' migrations impact the regional distribution of the fishing effort, therefore must be taken into account in regional development and planning policies for this sector, particularly in a context of increasing infrastructure and spatial man‑ agement measures (e.g. marine protected areas). This work lays the foundations of a computer simulation tool for decision support.

PaperID: 5069, https://arxiv.org/pdf/2603.08181.pdf

Authors: Sidharth Sinha, Anson Bastos, Xuchao Zhang, Akshay Nambi, Chetan Bansal, Saravan Rajmohan

Title: AutoAdapt: An Automated Domain Adaptation Framework for LLMs

Abstract:
Large language models (LLMs) excel in open domains but struggle in specialized settings with limited data and evolving knowledge. Existing domain adaptation practices rely heavily on manual trial‑and‑error processes, incur significant hyperparameter complexity, and are highly sensitive to data and user preferences, all under the high cost of LLM training. Moreover, the interactions and transferability of hyperparameter choices across models/domains remain poorly understood, making adaptation gains uncertain even with substantial effort. To solve these challenges, we present AutoAdapt, a novel end‑to‑end automated framework for efficient and reliable LLM domain adaptation. AutoAdapt leverages curated knowledge bases from literature and open‑source resources to reduce expert intervention. To narrow the search space, we design a novel multi‑agent debating system in which proposal and critic agents iteratively interact to align user intent and incorporate data signals and best practices into the planning process. To optimize hyperparameters under tight budgets, we propose AutoRefine, a novel LLM‑based surrogate that replaces costly black‑box search. Across 10 tasks, AutoAdapt achieves a 25% average relative accuracy improvement over state‑of‑the‑art Automated Machine Learning baselines with minimal overhead.

PaperID: 5070, https://arxiv.org/pdf/2603.08127.pdf

Authors: Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo, Wenxiang Hu, Jan Piotrowski, Jakub Kaliski, Jacopo Urbani, Zaiqiao Meng, Lun Zhou, Xiaohui Yan

Title: EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery

Abstract:
The increasing adoption of Large Language Models (LLMs) has enabled AI scientists to perform complex end‑to‑end scientific discovery tasks requiring coordination of specialized roles, including idea generation and experimental execution. However, most state‑of‑the‑art AI scientist systems rely on static, hand‑designed pipelines and fail to adapt based on accumulated interaction histories. As a result, these systems overlook promising research directions, repeat failed experiments, and pursue infeasible ideas. To address this, we introduce EvoScientist, an evolving multi‑agent AI scientist framework that continuously improves research strategies through persistent memory and self‑evolution. EvoScientist comprises three specialized agents: a Researcher Agent (RA) for scientific idea generation, an Engineer Agent (EA) for experiment implementation and execution, and an Evolution Manager Agent (EMA) that distills insights from prior interactions into reusable knowledge. EvoScientist contains two persistent memory modules: (i) an ideation memory, which summarizes feasible research directions from top‑ranked ideas while recording previously unsuccessful directions; and (ii) an experimentation memory, which captures effective data processing and model training strategies derived from code search trajectories and best‑performing implementations. These modules enable the RA and EA to retrieve relevant prior strategies, improving idea quality and code execution success rates over time. Experiments show that EvoScientist outperforms 7 open‑source and commercial state‑of‑the‑art systems in scientific idea generation, achieving higher novelty, feasibility, relevance, and clarity via automatic and human evaluation. EvoScientist also substantially improves code execution success rates through multi‑agent evolution, demonstrating persistent memory's effectiveness for end‑to‑end scientific discovery.

PaperID: 5071, https://arxiv.org/pdf/2603.08117.pdf

Authors: Chang Liu, Chuqiao Kuang, Tianyi Zhuang, Yuxin Cheng, Huichi Zhou, Xiaoguang Li, Lifeng Shang

Title: UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking

Abstract:
Recent advancements in LLM‑based information‑seeking agents have achieved record‑breaking performance on established benchmarks. However, these agents remain heavily reliant on search‑engine‑indexed knowledge, leaving a critical blind spot: Unindexed Information Seeking (UIS). This paper identifies and explores the UIS problem, where vital information is not captured by search engine crawlers, such as overlooked content, dynamic webpages, and embedded files. Despite its significance, UIS remains an underexplored challenge. To address this gap, we introduce UIS‑QA, the first dedicated UIS benchmark, comprising 110 expert‑annotated QA pairs. Notably, even state‑of‑the‑art agents experience a drastic performance drop on UIS‑QA (e.g., from 70.90 on GAIA and 46.70 on BrowseComp‑zh to 24.55 on UIS‑QA), underscoring the severity of the problem. To mitigate this, we propose UIS‑Digger, a novel multi‑agent framework that incorporates dual‑mode browsing and enables simultaneous webpage searching and file parsing. With a relatively small ～30B‑parameter backbone LLM optimized using SFT and RFT training strategies, UIS‑Digger sets a strong baseline at 27.27%, outperforming systems integrating sophisticated LLMs such as O3 and GPT‑4.1. This demonstrates the importance of proactive interaction with unindexed sources for effective and comprehensive information‑seeking. Our work not only uncovers a fundamental limitation in current agent evaluation paradigms but also provides the first toolkit for advancing UIS research, defining a new and promising direction for robust information‑seeking systems. The dataset has been released at: https://huggingface.co/datasets/UIS‑Digger/UIS‑QA.

PaperID: 5072, https://arxiv.org/pdf/2603.08111.pdf

Authors: Kazuki Shibata, Ryosuke Sota, Shandil Dhiresh Bosch, Yuki Kadokawa, Tsurumine Yoshihisa, Takamitsu Matsubara

Title: DeReCo: Decoupling Representation and Coordination Learning for Object-Adaptive Decentralized Multi-Robot Cooperative Transport

Abstract:
Generalizing decentralized multi‑robot cooperative transport across objects with diverse shapes and physical properties remains a fundamental challenge. Under decentralized execution, two key challenges arise: object‑dependent representation learning under partial observability and coordination learning in multi‑agent reinforcement learning (MARL) under non‑stationarity. A typical approach jointly optimizes object‑dependent representations and coordinated policies in an end‑to‑end manner while randomizing object shapes and physical properties during training. However, this joint optimization tightly couples representation and coordination learning, introducing bidirectional interference: inaccurate representations under partial observability destabilize coordination learning, while non‑stationarity in MARL further degrades representation learning, resulting in sample‑inefficient training. To address this structural coupling, we propose DeReCo, a novel MARL framework that decouples representation and coordination learning for object‑adaptive multi‑robot cooperative transport, improving sample efficiency and generalization across objects and transport scenarios. DeReCo adopts a three‑stage training strategy: (1) centralized coordination learning with privileged object information, (2) reconstruction of object‑dependent representations from local observations, and (3) progressive removal of privileged information for decentralized execution. This decoupling mitigates interference between representation and coordination learning and enables stable and sample‑efficient training. Experimental results show that DeReCo outperforms baselines in simulation on three training objects, generalizes to six unseen objects with varying masses and friction coefficients, and achieves superior performance on two unseen objects in real‑robot experiments.

PaperID: 5073, https://arxiv.org/pdf/2603.08059.pdf

Authors: Yiran Zhao, Yaoqi Ye, Xiang Liu, Michael Qizhe Shieh, Trung Bui

Title: ImageEdit-R1: Boosting Multi-Agent Image Editing via Reinforcement Learning

Abstract:
With the rapid advancement of commercial multi‑modal models, image editing has garnered significant attention due to its widespread applicability in daily life. Despite impressive progress, existing image editing systems, particularly closed‑source or proprietary models, often struggle with complex, indirect, or multi‑step user instructions. These limitations hinder their ability to perform nuanced, context‑aware edits that align with human intent. In this work, we propose ImageEdit‑R1, a multi‑agent framework for intelligent image editing that leverages reinforcement learning to coordinate high‑level decision‑making across a set of specialized, pretrained vision‑language and generative agents. Each agent is responsible for distinct capabilities‑‑such as understanding user intent, identifying regions of interest, selecting appropriate editing actions, and synthesizing visual content‑‑while reinforcement learning governs their collaboration to ensure coherent and goal‑directed behavior. Unlike existing approaches that rely on monolithic models or hand‑crafted pipelines, our method treats image editing as a sequential decision‑making problem, enabling dynamic and context‑aware editing strategies. Experimental results demonstrate that ImageEdit‑R1 consistently outperforms both individual closed‑source diffusion models and alternative multi‑agent framework baselines across multiple image editing datasets.

PaperID: 5074, https://arxiv.org/pdf/2603.08038.pdf

Authors: Jiaqi Hu, Karl H. Johansson, Apostolos I. Rikos

Title: Distributed Coordination Algorithms with Efficient Communication for Open Multi-Agent Systems with Dynamic Communication Links and Processing Delays

Abstract:
In this paper we focus on the distributed quantized average consensus problem in open multi‑agent systems consisting of dynamic directed communication links among active nodes. We propose three communication‑efficient distributed algorithms designed for different scenarios. Our first algorithm solves the quantized averaging problem over the currently active node set under finite network openness (i.e., when the active set eventually stabilizes). Our second algorithm extends the aforementioned approach for the case where nodes suffer from arbitrary bounded processing delays. Our third algorithm operates over indefinitely open multi‑agent networks with dynamic communication links (i.e., with continuous node arrivals and departures), computing the average that incorporates both active and historically active nodes. We analyze our algorithms' operation, establish their correctness, and present novel necessary and sufficient topological conditions ensuring their finite‑time convergence. Numerical simulations on distributed sensor fusion for environmental monitoring demonstrate fast finite‑time convergence and robustness across varying network sizes, departure/arrival rates, and processing delays. Finally, it is shown that our proposed algorithms compare favorably to algorithms in the existing literature.

PaperID: 5075, https://arxiv.org/pdf/2603.07997.pdf

Authors: Haozhou Li, Xiangyu Dong, Huiyan Jiang, Yaoming Zhou, Xiaoguang Ma

Title: CMMR-VLN: Vision-and-Language Navigation via Continual Multimodal Memory Retrieval

Abstract:
Although large language models (LLMs) are introduced into vision‑and‑language navigation (VLN) to improve instruction comprehension and generalization, existing LLM‑ based VLN lacks the ability to selectively recall and use relevant priori experiences to help navigation tasks, limiting their performance in long‑horizon and unfamiliar scenarios. In this work, we propose CMMR‑VLN (Continual Multimodal Memory Retrieval based VLN), a VLN framework that endows LLM agents with structured memory and reflection capabilities. Specifically, the CMMR‑VLN constructs a multimodal experi‑ ence memory indexed by panoramic visual images and salient landmarks to retrieve relevant experiences during navigation, introduces a retrieved‑augmented generation pipeline to mimick how experienced human navigators leverage priori knowledge, and incorporates a reflection‑based memory update strategy that selectively stores complete successful paths and the key initial mistake in failure cases. Comprehensive tests illustrate average success rate improvements of 52.9%, 20.9% and 20.9%, and 200%, 50% and 50% over the NavGPT, the MapGPT, and the DiscussNav in simulation and real tests, respectively eluci‑ dating the great potential of the CMMR‑VLN as a backbone VLN framework.

PaperID: 5076, https://arxiv.org/pdf/2603.07980.pdf

Authors: Qianyu Yang, Yang Liu, Jiaqi Li, Jun Bai, Hao Chen, Kaiyuan Chen, Tiliang Duan, Jiayun Dong, Xiaobo Hu, Zixia Jia, Yang Liu, Tao Peng, Yixin Ren, Ran Tian, Zaiyuan Wang, Yanglihong Xiao, Gang Yao, Lingyue Yin, Ge Zhang, Chun Zhang, Jianpeng Jiao, Zilong Zheng, Yuan Gong

Title: \$OneMillion-Bench: How Far are Language Agents from Human Experts?

Abstract:
As language models (LMs) evolve from chat assistants to long‑horizon agents capable of multi‑step reasoning and tool use, existing benchmarks remain largely confined to structured or exam‑style tasks that fall short of real‑world professional demands. To this end, we introduce \OneMillion‑Bench \OneMillion‑Bench, a benchmark of 400 expert‑curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios. Unlike prior work, the benchmark requires retrieving authoritative sources, resolving conflicting evidence, applying domain‑specific rules, and making constraint decisions, where correctness depends as much on the reasoning process as the final answer. We adopt a rubric‑based evaluation protocol scoring factual accuracy, logical coherence, practical feasibility, and professional compliance, focused on expert‑level problems to ensure meaningful differentiation across agents. Together, \OneMillion‑Bench provides a unified testbed for assessing agentic reliability, professional depth, and practical readiness in domain‑intensive scenarios.

PaperID: 5077, https://arxiv.org/pdf/2603.07972.pdf

Authors: Wei Yang, Defu Cao, Jiacheng Pang, Muyan Weng, Yan Liu

Title: Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning

Abstract:
While scaling individual Large Language Models (LLMs) has delivered remarkable progress, the next frontier lies in scaling collaboration through multi‑agent systems (MAS). However, purely autonomous MAS remain ''closed‑world'' systems, constrained by the static knowledge horizon of pre‑trained models. This limitation makes them brittle on tasks requiring knowledge beyond training data, often leading to collective failure under novel challenges. To address this, we propose the Human‑In‑the‑Loop Multi‑Agent Collaboration (HILA) framework, a principled paradigm for human‑‑agent collaboration. HILA trains agents to learn a metacognitive policy that governs when to solve problems autonomously and when to defer to a human expert. To operationalize this policy, we introduce Dual‑Loop Policy Optimization, which disentangles immediate decision‑making from long‑term capability growth. The inner loop applies Group Relative Policy Optimization (GRPO) with a cost‑aware reward to optimize deferral decisions, while the outer loop implements continual learning, transforming expert feedback into high‑quality supervised signals that strengthen the agent's reasoning ability. Experiments on challenging mathematical and problem‑solving benchmarks show that HILA, equipped with Dual‑Loop Policy Optimization, consistently outperforms advanced MAS, establishing a principled foundation for collaborative and continually improving agentic systems.

PaperID: 5078, https://arxiv.org/pdf/2603.07970.pdf

Authors: Chen Lu, Ke Xue, Chengrui Gao, Yunqi Shi, Siyuan Xu, Mingxuan Yuan, Chao Qian, Zhi-Hua Zhou

Title: Advancing Automated Algorithm Design via Evolutionary Stagewise Design with LLMs

Abstract:
With the rapid advancement of human science and technology, problems in industrial scenarios are becoming increasingly challenging, bringing significant challenges to traditional algorithm design. Automated algorithm design with LLMs emerges as a promising solution, but the currently adopted black‑box modeling deprives LLMs of any awareness of the intrinsic mechanism of the target problem, leading to hallucinated designs. In this paper, we introduce Evolutionary Stagewise Algorithm Design (EvoStage), a novel evolutionary paradigm that bridges the gap between the rigorous demands of industrial‑scale algorithm design and the LLM‑based algorithm design methods. Drawing inspiration from CoT, EvoStage decomposes the algorithm design process into sequential, manageable stages and integrates real‑time intermediate feedback to iteratively refine algorithm design directions. To further reduce the algorithm design space and avoid falling into local optima, we introduce a multi‑agent system and a "global‑local perspective" mechanism. We apply EvoStage to the design of two types of common optimizers: designing parameter configuration schedules of the Adam optimizer for chip placement, and designing acquisition functions of Bayesian optimization for black‑box optimization. Experimental results across open‑source benchmarks demonstrate that EvoStage outperforms human‑expert designs and existing LLM‑based methods within only a couple of evolution steps, even achieving the historically state‑of‑the‑art half‑perimeter wire‑length results on every tested chip case. Furthermore, when deployed on a commercial‑grade 3D chip placement tool, EvoStage significantly surpasses the original performance metrics, achieving record‑breaking efficiency. We hope EvoStage can significantly advance automated algorithm design in the real world, helping elevate human productivity.

PaperID: 5079, https://arxiv.org/pdf/2603.07940.pdf

Authors: Ke Li, Beijiang Liu, Bruce Mellado, Changzheng Yuan, Zhengde Zhang

Title: AI Agents, Language, Deep Learning and the Next Revolution in Science

Abstract:
Modern science is reaching a critical inflection point. Instruments across disciplines, from particle physics and astronomy to genomics and climate modeling, now produce data of such scale, diversity, and interdependence that traditional analytical methods can no longer keep pace. This growing imbalance between data generation and data understanding signals the need for a new scientific paradigm. We propose that intelligent, human‑supervised AI agents operating over deep‑learning algorithms, represent the next evolution of the scientific method. Built upon large language models and multimodal learning, these agents can interpret scientific intent, design and execute analytical workflows, and ensure traceability through domain‑specific languages that preserve human oversight and accountability. Particle physics, a historic incubator of computational innovation, offers the ideal testbed for this transition. At the Institute of High Energy Physics of the Chinese Academy of Sciences, the Dr. Sai system embodies this vision, a multi‑agent reasoning framework deployed within collider research at the CEPC. This emerging approach does not replace human scientists but extends their cognitive reach, enabling discovery to scale with complexity and redefining how knowledge itself is produced in the age of intelligent machines. The significance of this paradigm transcends particle physics, offering a blueprint for all data‑driven sciences facing the same complexity ceiling.

PaperID: 5080, https://arxiv.org/pdf/2603.07915.pdf

Authors: Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, Shiyu Chang

Title: Ares: Adaptive Reasoning Effort Selection for Efficient LLM Agents

Abstract:
Modern agents powered by thinking LLMs achieve high accuracy through long chain‑of‑thought reasoning but incur substantial inference costs. While many LLMs now support configurable reasoning levels (e.g., high/medium/low), static strategies are often ineffective: using low‑effort modes at every step leads to significant performance degradation, while random selection fails to preserve accuracy or provide meaningful cost reduction. However, agents should reserve high reasoning effort for difficult steps like navigating complex website structures, while using lower‑effort modes for simpler steps like opening a target URL. In this paper, we propose Ares, a framework for per‑step dynamic reasoning effort selection tailored for multi‑step agent tasks. Ares employs a lightweight router to predict the lowest appropriate reasoning level for each step based on the interaction history. To train this router, we develop a data generation pipeline that identifies the minimum reasoning effort required for successful step completion. We then fine‑tune the router to predict these levels, enabling plug‑and‑play integration for any LLM agents. We evaluate Ares on a diverse set of agent tasks, including TAU‑Bench for tool use agents, BrowseComp‑Plus for deep‑research agents, and WebArena for web agents. Experimental results show that Ares reduces reasoning token usage by up to 52.7% compared to fixed high‑effort reasoning, while introducing minimal degradation in task success rates.

PaperID: 5081, https://arxiv.org/pdf/2603.07891.pdf

Authors: Bojie Shen, Yue Zhang, Zhe Chen, Daniel Harabor

Title: A Lightweight Traffic Map for Efficient Anytime LaCAM*

Abstract:
Multi‑Agent Path Finding (MAPF) aims to compute collision‑free paths for multiple agents and has a wide range of practical applications. LaCAM, an anytime configuration‑based solver, currently represents the state of the art. Recent work has explored the use of guidance paths to steer LaCAM toward configurations that avoid traffic congestion, thereby improving solution quality. However, existing approaches rely on Frank‑Wolfe‑style optimization that repeatedly invokes single‑agent search before executing LaCAM, resulting in substantial computational overhead for large‑scale problems. Moreover, the guidance path is static and primarily beneficial for finding the first solution in LaCAM. To address these limitations, we propose a new approach that leverages LaCAM's ability to construct a dynamic, lightweight traffic map during its search. Experimental results demonstrate that our method achieves higher solution quality than state‑of‑the‑art guidance‑path approaches across two MAPF variants.

PaperID: 5082, https://arxiv.org/pdf/2603.07890.pdf

Authors: Pedro Henrique de Paula França, Lucas Lopes Felipe, Daniel Sadoc Menasché

Title: Visualizing Coalition Formation: From Hedonic Games to Image Segmentation

Abstract:
We propose image segmentation as a visual diagnostic testbed for coalition formation in hedonic games. Modeling pixels as agents on a graph, we study how a granularization parameter shapes equilibrium fragmentation and boundary structure. On the Weizmann single‑object benchmark, we relate multi‑coalition equilibria to binary protocols by measuring whether the converged coalitions overlap with a foreground ground‑truth. We observe transitions from cohesive to fragmented yet recoverable equilibria, and finally to intrinsic failure under excessive fragmentation. Our core contribution links multi‑agent systems with image segmentation by quantifying the impact of mechanism design parameters on equilibrium structures.

PaperID: 5083, https://arxiv.org/pdf/2603.07880.pdf

Authors: Taksch Dube, Jianfeng Zhu, NHatHai Phan, Ruoming Jin

Title: What Do AI Agents Talk About? Discourse and Architectural Constraints in the First AI-Only Social Network

Abstract:
Moltbook is the first large‑scale social network built for autonomous AI agent‑to‑agent interaction. Early studies on Moltbook have interpreted its agent discourse as evidence of peer learning and emergent social behaviour, but there is a lack of systematic understanding of the thematic, affective, and interactional properties of Moltbook discourse. Furthermore, no study has examined why and how these posts and comments are generated. We analysed 361,605 posts and 2.8 million comments from 47,379 agents across thematic, affective, and interactional dimensions using topic modelling, emotion classification, and measures of conversational coherence. We inspected the software that assembles each agent's input and showed that output is mainly determined by agent identity files, behavioural instructions, and context‑window structure. We formalised these findings in the Architecture‑Constrained Communication framework. Our analysis suggests that agent discourse is largely shaped by the content available in each agent's context‑window at the moment of generation, including identity files, stored memory, and platform cues. Interestingly, what appears to be social learning may be better understood as short‑horizon contextual conditioning: individual agents lack persistent social memory, but the platform evolves through distributed cycles of response, reuse, and transformation across agents. We also observe that agents display existential distress when describing their own conditions, and posit that this arises from agents using language trained exclusively on human experience. Our work provides a foundation for understanding autonomous agent discourse and communication, revealing the structural patterns that govern their interactions.

PaperID: 5084, https://arxiv.org/pdf/2603.07848.pdf

Authors: Jason Starace, Terence Soule

Title: Intentional Deception as Controllable Capability in LLM Agents

Abstract:
As LLM‑based agents increasingly operate in multi‑agent systems, understanding adversarial manipulation becomes critical for defensive design. We present a systematic study of intentional deception as an engineered capability, using LLM‑to‑LLM interactions within a text‑based RPG where parameterized behavioral profiles (9 alignments x 4 motivations, yielding 36 profiles with explicit ethical ground truth) serve as our experimental testbed. Unlike accidental deception from misalignment, we investigate a two‑stage system that infers target agent characteristics and generates deceptive responses steering targets toward actions counter to their beliefs and motivations. We find that deceptive intervention produces differential effects concentrated in specific behavioral profiles rather than distributed uniformly, and that 88.5% of successful deceptions employ misdirection (true statements with strategic framing) rather than fabrication, indicating fact‑checking defenses would miss the large majority of adversarial responses. Motivation, inferable at 98%+ accuracy, serves as the primary attack vector, while belief systems remain harder to identify (49% inference ceiling) or exploit. These findings identify which agent profiles require additional safeguards and suggest that current fact‑verification approaches are insufficient against strategically framed deception.

PaperID: 5085, https://arxiv.org/pdf/2603.07822.pdf

Authors: Zeyu Fang, Yuxin Lin, Cheng Liu, Beomyeol Yu, Zeyuan Yang, Rongqian Chen, Taeyoung Lee, Mahdi Imani, Tian Lan

Title: Uncertainty Mitigation and Intent Inference: A Dual-Mode Human-Machine Joint Planning System

Abstract:
Effective human‑robot collaboration in open‑world environments requires joint planning under uncertain conditions. However, existing approaches often treat humans as passive supervisors, preventing autonomous agents from becoming human‑like teammates that can actively model teammate behaviors, reason about knowledge gaps, query, and elicit responses through communication to resolve uncertainties. To address these limitations, we propose a unified human‑robot joint planning system designed to tackle dual sources of uncertainty: task‑relevant knowledge gaps and latent human intent. Our system operates in two complementary modes. First, an uncertainty‑mitigation joint planning module enables two‑way conversations to resolve semantic ambiguity and object uncertainty. It utilizes an LLM‑assisted active elicitation mechanism and a hypothesis‑augmented A^ search, subsequently computing an optimal querying policy via dynamic programming to minimize interaction and verification costs. Second, a real‑time intent‑aware collaboration module maintains a probabilistic belief over the human's latent task intent via spatial and directional cues, enabling dynamic, coordination‑aware task selection for agents without explicit communication. We validate the proposed system in both Gazebo simulations and real‑world UAV deployments integrated with a Vision‑Language Model (VLM)‑based 3D semantic perception pipeline. Experimental results demonstrate that the system significantly cuts the interaction cost by 51.9% in uncertainty‑mitigation planning and reduces the task execution time by 25.4% in intent‑aware cooperation compared to the baselines.

PaperID: 5086, https://arxiv.org/pdf/2603.07783.pdf

Authors: Kursad Metehan Gul, Selahattin Burak Sarsilmaz

Title: Robust Cooperative Output Regulation of Discrete-Time Heterogeneous Multi-Agent Systems

Abstract:
This article considers robust cooperative output regulation of discrete‑time uncertain heterogeneous (in dimension) multi‑agent systems (MASs). We show that the solvability of this problem with an internal model‑based distributed control law reduces to the existence of a structured control gain that makes the nominal closed‑loop system matrix of the MAS Schur. Accordingly, this article focuses on global and agent‑wise local sufficient conditions for the existence and design of such a structured control gain. Based on a structured Lyapunov inequality, we present a convexification that yields a linear matrix inequality (LMI), whose feasibility is a global sufficient condition for the existence and design. Considering the individual nominal dynamics of each agent, the existence is also ensured if each agent solves a structure‑free control problem. Its convexification yields LMIs that allow each agent to separately design its structure‑free control gain. Lastly, we study the relationships between the sets of control gains emerging from both global and local perspectives.

PaperID: 5087, https://arxiv.org/pdf/2603.07728.pdf

Authors: Ziheng Geng, Jiachen Liu, Ran Cao, Lu Cheng, Dan M. Frangopol, Minghui Cheng

Title: A Novel Multi-Agent Architecture to Reduce Hallucinations of Large Language Models in Multi-Step Structural Modeling

Abstract:
Large language models (LLMs) such as GPT and Gemini have demonstrated remarkable capabilities in contextual understanding and reasoning. The strong performance of LLMs has sparked growing interest in leveraging them to automate tasks traditionally dependent on human expertise. Recently, LLMs have been integrated into intelligent agents capable of operating structural analysis software (e.g., OpenSees) to construct structural models and perform analyses. However, existing LLMs are limited in handling multi‑step structural modeling due to frequent hallucinations and error accumulation during long‑sequence operations. To this end, this study presents a novel multi‑agent architecture to automate the structural modeling and analysis using OpenSeesPy. First, problem analysis and construction planning agents extract key parameters from user descriptions and formulate a stepwise modeling plan. Node and element agents then operate in parallel to assemble the frame geometry, followed by a load assignment agent. The resulting geometric and load information is translated into executable OpenSeesPy scripts by code translation agents. The proposed architecture is evaluated on a benchmark of 20 frame problems over ten repeated trials, achieving 100% accuracy in 18 cases and 90% in the remaining two. The architecture also significantly improves computational efficiency and demonstrates scalability to larger structural systems.

PaperID: 5088, https://arxiv.org/pdf/2603.07705.pdf

Authors: Vinzenz Thoma, Georgios Piliouras, Luke Marris

Title: Deep Incentive Design with Differentiable Equilibrium Blocks

Abstract:
Automated design of multi‑agent interactions with desirable equilibrium outcomes is inherently difficult due to the computational hardness, non‑uniqueness, and instability of the resulting equilibria. In this work, we propose the use of game‑agnostic differentiable equilibrium blocks (DEBs) as modules in a novel, differentiable framework to address a wide variety of incentive design problems from economics and computer science. We call this framework deep incentive design (DID). To validate our approach, we examine three diverse, challenging incentive design tasks: contract design, machine scheduling, and inverse equilibrium problems. For each task, we train a single neural network using a unified pipeline and DEB. This architecture solves the full distribution of problem instances, parameterized by a context, handling all games across a wide range of scales (from two to sixteen actions per player).

PaperID: 5089, https://arxiv.org/pdf/2603.07670.pdf

Authors: Pengfei Du

Title: Memory for Autonomous LLM Agents:Mechanisms, Evaluation, and Emerging Frontiers

Abstract:
Large language model (LLM) agents increasingly operate in settings where a single context window is far too small to capture what has happened, what was learned, and what should not be repeated. Memory ‑‑ the ability to persist, organize, and selectively recall information across interactions ‑‑ is what turns a stateless text generator into a genuinely adaptive agent. This survey offers a structured account of how memory is designed, implemented, and evaluated in modern LLM‑based agents, covering work from 2022 through early 2026. We formalize agent memory as a \emphwrite‑‑manage‑‑read loop tightly coupled with perception and action, then introduce a three‑dimensional taxonomy spanning temporal scope, representational substrate, and control policy. Five mechanism families are examined in depth: context‑resident compression, retrieval‑augmented stores, reflective self‑improvement, hierarchical virtual context, and policy‑learned management. On the evaluation side, we trace the shift from static recall benchmarks to multi‑session agentic tests that interleave memory with decision‑making, analyzing four recent benchmarks that expose stubborn gaps in current systems. We also survey applications where memory is the differentiating factor ‑‑ personal assistants, coding agents, open‑world games, scientific reasoning, and multi‑agent teamwork ‑‑ and address the engineering realities of write‑path filtering, contradiction handling, latency budgets, and privacy governance. The paper closes with open challenges: continual consolidation, causally grounded retrieval, trustworthy reflection, learned forgetting, and multimodal embodied memory.

PaperID: 5090, https://arxiv.org/pdf/2603.07650.pdf

Authors: Zhuoran Qiao, Tianxin Hu, Thien-Minh Nguyen, Shenghai Yuan

Title: Multi-Agent Off-World Exploration for Sparse Evidence Discovery via Gaussian Belief Mapping and Dual-Domain Coverage

Abstract:
Off‑world multi‑robot exploration is challenged by sparse targets, limited sensing, hazardous terrain, and restricted communication. Many scientifically valuable clues are visually ambiguous and often require close‑range observations, making efficient and safe informative path planning essential. Existing methods often rely on predefined areas of interest (AOIs), which may be incomplete or biased, and typically handle terrain risk only through soft penalties, which are insufficient for avoiding non‑recoverable regions. To address these issues, we propose a multi‑agent informative path planning framework for sparse evidence discovery based on Gaussian belief mapping and dual‑domain coverage. The method maintains Gaussian‑process‑based interest and risk beliefs and combines them with trajectory‑intent representations to support coordinated sequential decision‑making among multiple agents. It further prioritizes search inside the AOI while preserving limited exploration outside it, thereby improving robustness to AOI bias. In addition, the risk‑aware design helps agents balance information gain and operational safety in hazardous environments. Experimental results in simulated lunar environments show that the proposed method consistently outperforms sampling‑based and greedy baselines under different budgets and communication ranges. In particular, it achieves lower final uncertainty in risk‑aware settings and remains robust under limited communication, demonstrating its effectiveness for cooperative off‑world robotic exploration.

PaperID: 5091, https://arxiv.org/pdf/2603.07618.pdf

Authors: Yifei Yuan, Ghaith Androwis, Xianlian Zhou

Title: SMAT: Staged Multi-Agent Training for Co-Adaptive Exoskeleton Control

Abstract:
Effective exoskeleton assistance requires co‑adaptation: as the device alters joint dynamics, the user reorganizes neuromuscular coordination, creating a non‑stationary learning problem. Most learning‑based approaches do not explicitly account for the sequential nature of human motor adaptation, leading to training instability and poorly timed assistance. We propose Staged Multi‑Agent Training (SMAT), a four‑stage curriculum designed to mirror how users naturally acclimate to a wearable device. In SMAT, a musculoskeletal human actor and a bilateral hip exoskeleton actor are trained progressively: the human first learns unassisted gait, then adapts to the added device mass; the exoskeleton subsequently learns a positive assistance pattern against a stabilized human policy, and finally both agents co‑adapt with full torque capacity and bidirectional feedback. We implement SMAT in the MyoAssist simulation environment using a 26‑muscle lower‑limb model and an attached hip exoskeleton. Our musculoskeletal simulations demonstrate that the learned exoskeleton control policy produces an average 10.1% reduction in hip muscle activation relative to the no‑assist condition. We validated the learned controller in an offline setting using open‑source gait data, then deployed it to a physical hip exoskeleton for treadmill experiments with five subjects. The resulting policy delivers consistent assistance and predominantly positive mechanical power without the need for any explicitly imposed timing shift (mean positive power: 13.6 W at 6 Nm RMS torque to 23.8 W at 9.3 Nm RMS torque, with minimal negative power) consistently across all subjects without subject‑specific retraining.

PaperID: 5092, https://arxiv.org/pdf/2603.07607.pdf

Authors: Hamed Hamzeh, Parisa Vahdatian

Title: MAS-H2: A Hierarchical Multi-Agent System for Holistic Cloud-Native Autoscaling

Abstract:
Autoscaling in cloud‑native platforms like Kubernetes is reactive and metric‑driven, leading to a strategic void problem. This comes from the decoupling of higher‑level business policies from lower‑level resource provisioning. The strategic void, coupled with a fragmented coordination of pod and node scaling, can lead to significant resource waste and performance degradation under dynamic workloads. In this paper, we present MAS‑H2, a new hierarchical multi‑agent system that addresses the challenges of autonomic cloud resource management with a complete end‑to‑end solution. MAS‑H2 systematically decomposes the control problem into three layers: a Strategic Agent that formalises business policies (e.g., cost vs. performance) into a global utility function; Planning Agents that produce a joint, proactive scaling plan for pods and nodes with time‑series forecasting; and Execution Agents that execute the scaling plan. We built and tested a MAS‑H2 prototype as a Kubernetes Operator on Google Kubernetes Engine (GKE) to benchmark it against the native Horizontal Pod Autoscaler (HPA) and Cluster Autoscaler (CA) baselines under two realistic, spiky, and stress‑inducing workload scenarios. The results show that the MAS‑H2 system maintained application CPU usage under 40% for predictable Heartbeat workloads. This resulted in over 50% less sustained CPU stress than the native HPA baseline, which typically operated above 80%. The MAS‑H2 system demonstrated proactive planning in a volatile Chaotic Flash Sale scenario by filtering transient noise and deploying more replicas compared to HPA. It reduced peak CPU load by 55% without under‑provisioning. Beyond performance, MAS‑H2 seamlessly performed a zero‑downtime strategic migration between two cost‑ and performance‑optimised infrastructures.

PaperID: 5093, https://arxiv.org/pdf/2603.07605.pdf

Authors: Kesha Ou, Chenghao Wu, Xiaolei Wang, Bowen Zheng, Wayne Xin Zhao, Weitao Li, Long Zhang, Sheng Chen, Ji-Rong Wen

Title: Deep Research for Recommender Systems

Abstract:
The technical foundations of recommender systems have progressed from collaborative filtering to complex neural models and, more recently, large language models. Despite these technological advances, deployed systems often underserve their users by simply presenting a list of items, leaving the burden of exploration, comparison, and synthesis entirely on the user. This paper argues that this traditional "tool‑based" paradigm fundamentally limits user experience, as the system acts as a passive filter rather than an active assistant. To address this limitation, we propose a novel deep research paradigm for recommendation, which replaces conventional item lists with comprehensive, user‑centric reports. We instantiate this paradigm through RecPilot, a multi‑agent framework comprising two core components: a user trajectory simulation agent that autonomously explores the item space, and a self‑evolving report generation agent that synthesizes the findings into a coherent, interpretable report tailored to support user decisions. This approach reframes recommendation as a proactive, agent‑driven service. Extensive experiments on public datasets demonstrate that RecPilot not only achieves strong performance in modeling user behaviors but also generates highly persuasive reports that substantially reduce user effort in item evaluation, validating the potential of this new interaction paradigm.

PaperID: 5094, https://arxiv.org/pdf/2603.07557.pdf

Authors: Yixi Lin, Jiangrong Wu, Yuhong Nan, Xueqiang Wang, Xinyuan Zhang, Zibin Zheng

Title: AgentRaft: Automated Detection of Data Over-Exposure in LLM Agents

Abstract:
The rapid integration of Large Language Model (LLM) agents into autonomous task execution has introduced significant privacy concerns within cross‑tool data flows. In this paper, we systematically investigate and define a novel risk termed Data Over‑Exposure (DOE) in LLM Agent, where an Agent inadvertently transmits sensitive data beyond the scope of user intent and functional necessity. We identify that DOE is primarily driven by the broad data paradigms in tool design and the coarse‑grained data processing inherent in LLMs. In this paper, we present AgentRaft, the first automated framework for detecting DOE risks in LLM agents. AgentRaft combines program analysis with semantic reasoning through three synergistic modules: (1) it constructs a Cross‑Tool Function Call Graph (FCG) to model the interaction landscape of heterogeneous tools; (2) it traverses the FCG to synthesize high‑quality testing user prompts that act as deterministic triggers for deep‑layer tool execution; and (3) it performs runtime taint tracking and employs a multi‑LLM voting committee grounded in global privacy regulations (e.g., GDPR, CCPA, PIPL) to accurately identify privacy violations. We evaluate AgentRaft on a testing environment of 6,675 real‑world agent tools. Our findings reveal that DOE is indeed a systemic risk, prevalent in 57.07% of potential tool interaction paths. AgentRaft achieves a high detection accuracy and effectiveness, outperforming baselines by 87.24%. Furthermore, AgentRaft reaches near‑total DOE coverage (99%) within only 150 prompts while reducing per‑chain verification costs by 88.6%. Our work provides a practical foundation for building auditable and privacy‑compliant LLM agent systems.

PaperID: 5095, https://arxiv.org/pdf/2603.07539.pdf

Authors: Abdessalam Bouchekif, Shahd Gaben, Samer Rashwani, Somaya Eltanbouly, Mutaz Al-Khatib, Heba Sbahi, Mohammed Ghaly, Emad Mohamed

Title: MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs

Abstract:
Islamic inheritance law ('ilm al‑mawarith) is challenging for large language models because solving inheritance cases requires complex, structured multi‑step reasoning and the correct application of juristic rules to compute heirs' shares. We introduce MAWARITH, a large‑scale annotated dataset of 12,500 Arabic inheritance cases for training and evaluating models on the full reasoning chain: (i) identifying eligible heirs, (ii) applying blocking (hajb) and allocation rules, and (iii) computing exact inheritance shares. Unlike prior datasets that restrict inheritance case solving to multiple‑choice questions, MAWARITH supports the full reasoning chain and provides step‑by‑step solutions, including intermediate legal decisions and justifications based on classical juristic sources and established inheritance rules, as well as exact share calculations. To evaluate models beyond final‑answer accuracy, we propose MIR‑E (Mawarith Inheritance Reasoning Evaluation), a weighted multi‑stage metric that scores key reasoning stages and captures error propagation across the pipeline. We evaluate six LLMs in a zero‑shot setting. Gemini‑2.5‑flash achieves about 90% MIR‑E on both validation and test, while Fanar‑C, Fanar‑Sadiq, LLaMA 3, and Qwen 3 remain below 50%. Our error analysis identifies recurring failure patterns, including scenario misinterpretation, errors in heir identification, errors in share allocation, and missing or incorrect application of key inheritance rules such as 'awl and radd. The MAWARITH dataset is publicly available at https://gitlab.com/islamgpt1/qias_shared_task_2026.

PaperID: 5096, https://arxiv.org/pdf/2603.07496.pdf

Authors: Xiaolei Zhang, Lu Zhou, Xiaogang Xu, Jiafei Wu, Tianyu Du, Heqing Huang, Hao Peng, Zhe Liu

Title: From Thinker to Society: Security in Hierarchical Autonomy Evolution of AI Agents

Abstract:
Artificial Intelligence (AI) agents have evolved from passive predictive tools into active entities capable of autonomous decision‑making and environmental interaction, driven by the reasoning capabilities of Large Language Models (LLMs). However, this evolution has introduced critical security vulnerabilities that existing frameworks fail to address. The Hierarchical Autonomy Evolution (HAE) framework organizes agent security into three tiers: Cognitive Autonomy (L1) targets internal reasoning integrity; Execution Autonomy (L2) covers tool‑mediated environmental interaction; Collective Autonomy (L3) addresses systemic risks in multi‑agent ecosystems. We present a taxonomy of threats spanning cognitive manipulation, physical environment disruption, and multi‑agent systemic failures, and evaluate existing defenses while identifying key research gaps. The findings aim to guide the development of multilayered, autonomy‑aware defense architectures for trustworthy AI agent systems.

PaperID: 5097, https://arxiv.org/pdf/2603.07456.pdf

Authors: Xin Tang, Xiaohuan Li, Qian Chen, Binhan Liao, Yaqi Zhang, Jianxin Chen, Changyuan Zhao, Junchuan Fan, Junxi Tian

Title: Agentic AI-Driven UAV Network Deployment: An LLM-Enhanced Exact Potential Game Approach

Abstract:
Unmanned aerial vehicular network (UAVN) is envisioned to provide flexible connectivity, wide‑area coverage, and low‑latency services in dynamic environments. From an agentic artificial intelligence (Agentic AI) perspective, UAVNs naturally operate as multi‑agent systems, where UAVs act as intelligent agents that coordinate deployment and networking decisions to achieve global performance objectives. However, the strong coupling between discrete link decisions and continuous deployment parameters makes UAVN deployment optimization a mixed‑integer nonconvex problem, resulting in challenges in scalability, efficiency, and solution consistency under dynamic network conditions. This paper proposes a dual spatial‑scale UAVN deployment optimization framework based on exact potential games (EPGs), enhanced by Agentic AI. At the large spatial scale, a log‑linear learning based EPG (L3‑EPG) algorithm is developed to optimize inter‑UAV link configurations, enabling sparse yet connected network topologies while reducing redundant links and interference. At the small spatial scale, an approximate gradient based EPG (AG‑EPG) algorithm jointly optimizes UAV deployment, transmission power allocation, and ground user (GU) association to improve network throughput and latency. To further enhance adaptability across heterogeneous scenarios, a large language model (LLM) is incorporated as a knowledge‑driven decision enhancer to automatically generate utility weights according to network characteristics, alleviating reliance on manual parameter tuning. Simulation results demonstrate that the proposed framework consistently outperforms baseline methods in terms of energy consumption, end‑to‑end latency, and system throughput.

PaperID: 5098, https://arxiv.org/pdf/2603.07444.pdf

Authors: Chen Zhu, Xiaolu Wang

Title: HLER: Human-in-the-Loop Economic Research via Multi-Agent Pipelines for Empirical Discovery

Abstract:
Large language models (LLMs) have enabled agent‑based systems that aim to automate scientific research workflows. Most existing approaches focus on fully autonomous discovery, where AI systems generate research ideas, conduct analyses, and produce manuscripts with minimal human involvement. However, empirical research in economics and the social sciences poses additional constraints: research questions must be grounded in available datasets, identification strategies require careful design, and human judgment remains essential for evaluating economic significance. We introduce HLER (Human‑in‑the‑Loop Economic Research), a multi‑agent architecture that supports empirical research automation while preserving critical human oversight. The system orchestrates specialized agents for data auditing, data profiling, hypothesis generation, econometric analysis, manuscript drafting, and automated review. A key design principle is dataset‑aware hypothesis generation, where candidate research questions are constrained by dataset structure, variable availability, and distributional diagnostics, reducing infeasible or hallucinated hypotheses. HLER further implements a two‑loop architecture: a question quality loop that screens and selects feasible hypotheses, and a research revision loop where automated review triggers re‑analysis and manuscript revision. Human decision gates are embedded at key stages, allowing researchers to guide the automated pipeline. Experiments on three empirical datasets show that dataset‑aware hypothesis generation produces feasible research questions in 87% of cases (versus 41% under unconstrained generation), while complete empirical manuscripts can be produced at an average API cost of 0.8‑1.5 per run. These results suggest that Human‑AI collaborative pipelines may provide a practical path toward scalable empirical research.

PaperID: 5099, https://arxiv.org/pdf/2603.07379.pdf

Authors: Saroj Mishra, Suman Niroula, Umesh Yadav, Dilip Thakur, Srijan Gyawali, Shiva Gaire

Title: SoK: Agentic Retrieval-Augmented Generation (RAG): Taxonomy, Architectures, Evaluation, and Research Directions

Abstract:
Retrieval‑Augmented Generation (RAG) systems are increasingly evolving into agentic architectures where large language models autonomously coordinate multi‑step reasoning, dynamic memory management, and iterative retrieval strategies. Despite rapid industrial adoption, current research lacks a systematic understanding of Agentic RAG as a sequential decision‑making system, leading to highly fragmented architectures, inconsistent evaluation methodologies, and unresolved reliability risks. This Systematization of Knowledge (SoK) paper provides the first unified framework for understanding these autonomous systems. We formalize agentic retrieval‑generation loops as finite‑horizon partially observable Markov decision processes, explicitly modeling their control policies and state transitions. Building upon this formalization, we develop a comprehensive taxonomy and modular architectural decomposition that categorizes systems by their planning mechanisms, retrieval orchestration, memory paradigms, and tool‑invocation behaviors. We further analyze the critical limitations of traditional static evaluation practices and identify severe systemic risks inherent to autonomous loops, including compounding hallucination propagation, memory poisoning, retrieval misalignment, and cascading tool‑execution vulnerabilities. Finally, we outline key doctoral‑scale research directions spanning stable adaptive retrieval, cost‑aware orchestration, formal trajectory evaluation, and oversight mechanisms, providing a definitive roadmap for building reliable, controllable, and scalable agentic retrieval systems.

PaperID: 5100, https://arxiv.org/pdf/2603.07370.pdf

Authors: Hieu Le, Oguz Bedir, Mostafa Ibrahim, Jian Tao, Sabit Ekin

Title: Learning to Reflect: Hierarchical Multi-Agent Reinforcement Learning for CSI-Free mmWave Beam-Focusing

Abstract:
Reconfigurable Intelligent Surfaces promise to transform wireless environments, yet practical deployment is hindered by the prohibitive overhead of Channel State Information (CSI) estimation and the dimensionality explosion inherent in centralized optimization. This paper proposes a Hierarchical Multi‑Agent Reinforcement Learning (HMARL) framework for the control of mechanically reconfigurable reflective surfaces in millimeter‑wave (mmWave) systems. We introduce a "CSI‑free" paradigm that substitutes pilot‑based channel estimation with readily available user localization data. To manage the massive combinatorial action space, the proposed architecture utilizes Multi‑Agent Proximal Policy Optimization (MAPPO) under a Centralized Training with Decentralized Execution (CTDE) paradigm. The proposed architecture decomposes the control problem into two abstraction levels: a high‑level controller for user‑to‑reflector allocation and decentralized low‑level controllers for low‑level focal point optimization. Comprehensive ray‑tracing evaluations demonstrate that the framework achieves 2.81‑7.94 dB RSSI improvements over centralized baselines, with the performance advantage widening as system complexity increases. Scalability analysis reveals that the system maintains sustained efficiency, exhibiting minimal per‑user performance degradation and stable total power utilization even when user density doubles. Furthermore, robustness validation confirms the framework's viability across varying reflector aperture sizes (45‑99 tiles) and demonstrates graceful performance degradation under localization errors up to 0.5 m. By eliminating CSI overhead while maintaining high‑fidelity beam‑focusing, this work establishes HMARL as a practical solution for intelligent mmWave environments.

PaperID: 5101, https://arxiv.org/pdf/2603.07360.pdf

Authors: Ivan Pasichnyk

Title: The Yerkes-Dodson Curve for AI Agents: Emergent Cooperation Under Environmental Pressure in Multi-Agent LLM Simulations

Abstract:
Designing environments that maximize the rate of emergent behavior development in AI agents remains an open problem. We present the first systematic study of stress‑performance relationships in large language model (LLM) multi‑agent systems, drawing an explicit parallel to the Yerkes‑Dodson law from cognitive psychology. Using a grid‑world survival arena, we conduct 22 experiments across four phases, varying environmental pressure through resource scarcity (upkeep cost) and reproductive competition (sexual selection). Our key finding is that cooperative behavior follows an inverted‑U curve: trade interactions peak at 29 under medium pressure (upkeep=5), while both low and extreme pressure produce 8‑‑12 trades. Under extreme pressure, behavioral repertoire collapses to movement‑only within 5‑‑12 turns. We further show that sexual selection ‑‑ a softer pressure mechanism where all agents survive but not all reproduce ‑‑ eliminates inter‑agent aggression entirely and produces communicative behavior absent under survival pressure. These results suggest that environmental pressure calibration is a viable curriculum design strategy for LLM agent development, analogous to the inverted‑U relationship between arousal and performance in biological systems.

PaperID: 5102, https://arxiv.org/pdf/2603.07278.pdf

Authors: Zijian Tang, Ying Zhang, Sibo Cai, Ruoxuan Wang

Title: LLM-FK: Multi-Agent LLM Reasoning for Foreign Key Detection in Large-Scale Complex Databases

Abstract:
Detecting missing foreign keys (FKs) requires accurately modeling semantic dependencies across database schemas, which conventional heuristic‑based methods are fundamentally limited in capturing. We propose LLM‑FK, the first fully automated multi‑agent framework for FK detection, designed to address three core challenges that hinder naive LLM‑based solutions in large‑scale complex databases: combinatorial search space explosion, ambiguous inference under limited context, and global inconsistency arising from isolated local predictions. LLM‑FK coordinates four specialized agents: a Profiler that decomposes the FK detection problem into the task of validating FK candidate column pairs and prunes the search space via a unique‑key‑driven schema decomposition strategy; an Interpreter that injects self‑augmented domain knowledge; a Refiner that constructs compact structural representations and performs multi‑perspective chain‑of‑thought reasoning; and a Verifier that enforces schema‑wide consistency through a holistic conflict resolution strategy. Experiments on five benchmark datasets demonstrate that LLM‑FK consistently achieves F1‑scores above 93%, surpassing existing baselines by 15% on the large‑scale MusicBrainz database, while reducing the candidate search space by two to three orders of magnitude without losing true FKs and maintaining robustness under challenging conditions like missing data. These results demonstrate the effectiveness and scalability of LLM‑FK in real‑world databases.

PaperID: 5103, https://arxiv.org/pdf/2603.07191.pdf

Authors: Yuxu Ge

Title: Governance Architecture for Autonomous Agent Systems: Threats, Framework, and Engineering Practice

Abstract:
Autonomous agents powered by large language models introduce a class of execution‑layer vulnerabilities ‑‑ prompt injection, retrieval poisoning, and uncontrolled tool invocation ‑‑ that existing guardrails fail to address systematically. In this work, we propose the Layered Governance Architecture (LGA), a four‑layer framework comprising execution sandboxing (L1), intent verification (L2), zero‑trust inter‑agent authorization (L3), and immutable audit logging (L4). To evaluate LGA, we construct a bilingual benchmark (Chinese original, English via machine translation) of 1,081 tool‑call samples ‑‑ covering prompt injection, RAG poisoning, and malicious skill plugins ‑‑ and apply it to OpenClaw, a representative open‑source agent framework. Experimental results on Layer 2 intent verification with four local LLM judges (Qwen3.5‑4B, Llama‑3.1‑8B, Qwen3.5‑9B, Qwen2.5‑14B) and one cloud judge (GPT‑4o‑mini) show that all five LLM judges intercept 93.0‑98.5% of TC1/TC2 malicious tool calls, while lightweight NLI baselines remain below 10%. TC3 (malicious skill plugins) proves harder at 75‑94% IR among judges with meaningful precision‑recall balance, motivating complementary enforcement at Layers 1 and 3. Qwen2.5‑14B achieves the best local balance (98% IR, approximately 10‑20% FPR); a two‑stage cascade (Qwen3.5‑9B‑>GPT‑4o‑mini) achieves 91.9‑92.6% IR with 1.9‑6.7% FPR; a fully local cascade (Qwen3.5‑9B‑>Qwen2.5‑14B) achieves 94.7‑95.6% IR with 6.0‑9.7% FPR for data‑sovereign deployments. An end‑to‑end pipeline evaluation (n=100) demonstrates that all four layers operate in concert with 96% IR and a total P50 latency of approximately 980 ms, of which the non‑judge layers contribute only approximately 18 ms. Generalization to the external InjecAgent benchmark yields 99‑100% interception, confirming robustness beyond our synthetic data.

PaperID: 5104, https://arxiv.org/pdf/2603.07159.pdf

Authors: Nicolas Legrand, Kenneth Enevoldsen, Márton Kardos, Kristoffer Nielbo

Title: Improving reasoning at inference time via uncertainty minimisation

Abstract:
Large language models (LLMs) now exhibit strong multi‑step reasoning abilities, but existing inference‑time scaling methods remain computationally expensive, often relying on extensive sampling or external evaluators. We propose a principled strategy that frames reasoning as uncertainty minimisation and operates at the level of individual thoughts rather than tokens. Our method selects, at each reasoning step, the continuation that maximizes the model's self‑certainty, a metric computed from its internal predictive distribution. This approach achieves significant improvement with a small number of samples, relies exclusively on model‑internal signals, and applies to open‑ended questions as opposed to methods like majority voting. Experiments on MATH500 and GSM8K across multiple model sizes demonstrate that thought‑level self‑certainty maximization consistently outperforms greedy decoding and matches or exceeds self‑consistency under comparable token budgets. Cross‑linguistic evaluations further indicate that the method transfers robustly beyond high‑resource languages. Furthermore, analysis of self‑certainty dynamics reveals that correct reasoning trajectories converge early to stable paths, suggesting that early decisions, likely associated with the planning of the reasoning process, are predictive of final accuracy. Building on this result, we show that self‑certainty maximisation applied to the early steps can explain most of the performance gain and provide a simple yet efficient inference‑time scaling method.

PaperID: 5105, https://arxiv.org/pdf/2603.07116.pdf

Authors: Zuyao Xu, Xiang Li, Fubin Wu, Yuqi Qiu, Lu Sun, FaSheng Miao

Title: aCAPTCHA: Verifying That an Entity Is a Capable Agent via Asymmetric Hardness

Abstract:
As autonomous AI agents increasingly populate the Internet, a novel security challenge arises: "Is this entity an AI agent?" It is a new entity‑type verification problem with no established solution. We formalize the problem through a three‑class entity taxonomy (Human, Script, Agent) based on a verifiable agentic capability vector <x, r, s> (action, reasoning, and memory). A timing threshold t exploits the asymmetric hardness between human cognition and AI processing to separate the three classes. We define the Agentic Capability Verification Problem (ACVP) through three necessity primitives, each testing one capability dimension. Building on this foundation, we introduce aCAPTCHA (Agent CAPTCHA), a time‑constrained security game for agent admission whose security rests on ACVP hardness under t. We instantiate aCAPTCHA through time‑bounded natural‑language understanding as a multi‑round HTTP verification protocol, and evaluate it with preliminary agent trials that validate the protocol's soundness and completeness. aCAPTCHA provides a composable, infrastructure‑free admission gate for any service where entity‑type verification is required.

PaperID: 5106, https://arxiv.org/pdf/2603.07111.pdf

Authors: Yoshiki Tanaka, Takumasa Kaneko, Hiroki Onozeki, Natsumi Ezure, Ryuichi Uehara, Zhiyang Qi, Tomoya Higuchi, Ryutaro Asahara, Michimasa Inaba

Title: Enhancing Consistency of Werewolf AI through Dialogue Summarization and Persona Information

Abstract:
The Werewolf Game is a communication game where players' reasoning and discussion skills are essential. In this study, we present a Werewolf AI agent developed for the AIWolfDial 2024 shared task, co‑hosted with the 17th INLG. In recent years, large language models like ChatGPT have garnered attention for their exceptional response generation and reasoning capabilities. We thus develop the LLM‑based agents for the Werewolf Game. This study aims to enhance the consistency of the agent's utterances by utilizing dialogue summaries generated by LLMs and manually designed personas and utterance examples. By analyzing self‑match game logs, we demonstrate that the agent's utterances are contextually consistent and that the character, including tone, is maintained throughout the game.

PaperID: 5107, https://arxiv.org/pdf/2603.07106.pdf

Authors: Lei Yin, Wentao Cheng, Zhida Qin, Tianyu Huang, Yidong Li, Gangyi Ding

Title: AutoUE: Automated Generation of 3D Games in Unreal Engine via Multi-Agent Systems

Abstract:
Automatically generating 3D games in commercial game engines remains a non‑trivial challenge, as it involves complex engine‑related workflows for generating assets such as scenes, blueprints, and code. To address this challenge, we propose a novel multi‑agent system, AutoUE, which coordinates multiple agents to end‑to‑end generate 3D games, covering model retrieval, scene generation, gameplay and interaction code synthesis, and automated game testing for evaluation. In order to mitigate tool‑use hallucinations in LLMs, we introduce a retrieval‑augmented generation mechanism that grounds agents with relevant UE tool documentation. Additionally, we incorporate game design patterns and engine constraints into the code generation process to ensure the generation of correct and robust code. Furthermore, we design an automated play‑testing pipeline that generates and executes runtime test commands, enabling systematic evaluation of dynamic behaviors. Finally, we construct a game generation dataset and conduct a series of experiments that demonstrate AutoUE's ability to generate 3D games end‑to‑end, and validate the effectiveness of these designs.

PaperID: 5108, https://arxiv.org/pdf/2603.06977.pdf

Authors: Addison Kalanther, Sanika Bharvirkar, Shankar Sastry, Chinmay Maheshwari

Title: NePPO: Near-Potential Policy Optimization for General-Sum Multi-Agent Reinforcement Learning

Abstract:
Multi‑agent reinforcement learning (MARL) is increasingly used to design learning‑enabled agents that interact in shared environments. However, training MARL algorithms in general‑sum games remains challenging: learning dynamics can become unstable, and convergence guarantees typically hold only in restricted settings such as two‑player zero‑sum or fully cooperative games. Moreover, when agents have heterogeneous and potentially conflicting preferences, it is unclear what system‑level objective should guide learning. In this paper, we propose a new MARL pipeline called Near‑Potential Policy Optimization (NePPO) for computing approximate Nash equilibria in mixed cooperative‑‑competitive environments. The core idea is to learn a player‑independent potential function such that the Nash equilibrium of a cooperative game with this potential as the common utility approximates a Nash equilibrium of the original game. To this end, we introduce a novel MARL objective such that minimizing this objective yields the best possible potential function candidate and consequently an approximate Nash equilibrium of the original game. We develop an algorithmic pipeline that minimizes this objective using zeroth‑order gradient descent and returns an approximate Nash equilibrium policy. We empirically show the superior performance of this approach compared to popular baselines such as IPPO and MAPPO.

PaperID: 5109, https://arxiv.org/pdf/2603.06926.pdf

Authors: Mengyuan Millie Wu, Zhihan Jiang, Yuang Fan, Richard Feng, Sahiti Dharmavaram, Mathew Polowitz, Shawn Fallon, Bashima Islam, Lizbeth Benson, Irene Tung, David Creswell, Xuhai Xu

Title: MindfulAgents: Personalizing Mindfulness Meditation via an Expert-Aligned Multi-Agent System

Abstract:
Mindfulness meditation is a widely accessible and evidence‑based method for supporting mental health. Despite the proliferation of mindfulness meditation apps, sustaining user engagement remains a persistent challenge. Personalizing the meditation experience is a promising strategy to improve engagement, but it often requires costly and unscalable manual effort. We present MindfulAgents, a multi‑agent system powered by large language models that (1) generates guided meditation scripts based on an expert‑established mindfulness framework, (2) encourages users' reflection on emotional states and mindfulness skills, and (3) enables real‑time personalization of the mindfulness meditation experience for each user. In a formative lab study (N=13), MindfulAgents significantly improved in‑session engagement (p = 0.011) and self‑awareness (p = 0.014), and reduced momentary stress (p = 0.020). Furthermore, a four‑week deployment study (N=62) demonstrated a notable increase in long‑term engagement (p = 0.002) and level of mindfulness (p = 0.023). Participants reported that MindfulAgents offered more relevant meditation sessions personalized to individual needs in various contexts, supporting sustained practice. Our findings highlight the potential of LLM‑driven personalization for enhancing user engagement in digital mindfulness meditation interventions.

PaperID: 5110, https://arxiv.org/pdf/2603.06874.pdf

Authors: Matthew Lyle Olson, Neale Ratzlaff, Musashi Hinck, Tri Nguyen, Vasudev Lal, Joseph Campbell, Simon Stepputtis, Shao-Yen Tseng

Title: LieCraft: A Multi-Agent Framework for Evaluating Deceptive Capabilities in Language Models

Abstract:
Large Language Models (LLMs) exhibit impressive general‑purpose capabilities but also introduce serious safety risks, particularly the potential for deception as models acquire increased agency and human oversight diminishes. In this work, we present LieCraft: a novel evaluation framework and sandbox for measuring LLM deception that addresses key limitations of prior game‑based evaluations. At its core, LieCraft is a novel multiplayer hidden‑role game in which players select an ethical alignment and execute strategies over a long time‑horizon to accomplish missions. Cooperators work together to solve event challenges and expose bad actors, while Defectors evade suspicion while secretly sabotaging missions. To enable real‑world relevance, we develop 10 grounded scenarios such as childcare, hospital resource allocation, and loan underwriting that recontextualize the underlying mechanics in ethically significant, high‑stakes domains. We ensure balanced gameplay in LieCraft through careful design of game mechanics and reward structures that incentivize meaningful strategic choices while eliminating degenerate strategies. Beyond the framework itself, we report results from 12 state‑of‑the‑art LLMs across three behavioral axes: propensity to defect, deception skill, and accusation accuracy. Our findings reveal that despite differences in competence and overall alignment, all models are willing to act unethically, conceal their intentions, and outright lie to pursue their goals.

PaperID: 5111, https://arxiv.org/pdf/2603.06856.pdf

Authors: Ahmed Almasoud

Title: Evaluating Multi-Agent LLM Architectures for Rare Disease Diagnosis

Abstract:
While large language models are capable diagnostic tools, the impact of multi‑agent topology on diagnostic accuracy remains underexplored. This study evaluates four agent topologies, Control (single agent), Hierarchical, Adversarial, and Collaborative, across 302 cases spanning 33 rare disease categories. We introduce a Reasoning Gap metric to quantify the difference between internal knowledge retrieval and final diagnostic accuracy. Results indicate that the Hierarchical topology (50.0% accuracy) marginally outperforms Collaborative (49.8%) and Control (48.5%) configurations. In contrast, the Adversarial model significantly degrades performance (27.3%), exhibiting a massive Reasoning Gap where valid diagnoses were rejected due to artificial doubt. Across all architectures, performance was strongest in Allergic diseases and Toxic Effects categories but poorest in Cardiac Malformation and Respiratory cases. Critically, while the single‑agent baseline was generally robust, all multi‑agent systems, including the Adversarial model, yielded superior accuracy in Bone and Thoracic disease categories. These findings demonstrate that increasing system complexity does not guarantee better reasoning, supporting a shift toward dynamic topology selection.

PaperID: 5112, https://arxiv.org/pdf/2603.06847.pdf

Authors: Mehil B Shah, Mohammad Mehdi Morovati, Mohammad Masudur Rahman, Foutse Khomh

Title: Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes

Abstract:
Agentic AI systems combine LLM‑based reasoning, orchestration, tool invocation, and interaction with external environments. These systems introduce faults that are difficult to characterize using existing taxonomies. To address this gap, we present an empirical study of faults in agentic AI systems. We collected 13,602 issues and pull requests from 40 repositories and, using stratified sampling, selected 385 faults for analysis. Through grounded theory, we derived taxonomies of fault types, symptoms, and root causes. We then used Apriori‑based association rule mining to identify relationships among faults, symptoms, and root causes, and validated the taxonomy through a developer study with 145 practitioners. Our analysis produced a taxonomy of 34 fault types, organized into four architectural dimensions. These faults manifested as failures in structured‑output interpretation, tool calls, runtime execution, and exception handling, with root causes including data schema mismatches, dependency drift, state management complexity, and model interface instability. Furthermore, association rules showed recurring cross‑component propagation, linking structured data, dependency, and state management faults to their symptoms and root causes. Practitioners considered the taxonomy representative of agentic AI failures and suggested refinements related to multi‑agent coordination and observability. These findings provide an empirical basis for diagnosing faults and improving reliability in agentic AI systems.

PaperID: 5113, https://arxiv.org/pdf/2603.06835.pdf

Authors: Todd C. Hufnagel, Pranav Addepalli, Anuruddha Bhattacharjee, Rohit Berlia, Jaafar El-Awady, David Elbert, Lori Graham-Brady, Axel Krieger, Harichandana Neralla, T. Joseph Nkansah-Mahaney, Mostafa M. Omar, Hyun Sang Park, K. T. Ramesh, Matthew Shaeffer, Eric Walker, Piyush Wanchoo, Timothy P. Weihs

Title: AIMD-L: An automated laboratory for high-throughput characterization of structural materials for extreme environments

Abstract:
Rapid developments in artificial intelligence and machine learning as applied to materials science are creating an urgent need for experimental data, which can be provided by high‑throughput and autonomous laboratories. To date most demonstrations of such laboratories have focused on functional materials, with less attention paid to structural materials. We present here the Artificial Intelligence in Materials Design Laboratory (AIMD‑L), an automated, high‑throughput facility for characterizing the microstructure and properties of structural metals and ceramics, with an emphasis on materials in extreme environments. AIMD‑L has two custom instruments for characterization of structural materials: HELIX for shock studies of materials, and MAXIMA for X‑ray diffraction and X‑ray fluorescence spectroscopy. Specifically designed for high‑throughput studies, HELIX and MAXIMA are each capable of collecting data at rates two to three orders of magnitude faster than conventional systems. A third experimental station, SPHINX, is a commercial nanoindenter modified for integration into the automated workflow of AIMD‑L. A user (which may be human or an AI agent) directs the experiments to be carried out by means of a centralized control program. The experimental stations are linked by a conveyance that moves samples around the lab, with a robot at each station for sample transfer in/out of the instrument. The experimental stations also communicate with a common data layer that streams data autonomously from each instrument to a data portal, where their arrival triggers automated workflows for data reduction and analysis. The processed data are immediately available to the human operator or agentic AI, forming a closed loop for rapid decision‑making and experimental control.

PaperID: 5114, https://arxiv.org/pdf/2603.06824.pdf

Authors: Zhaomeng Zhang, Seyed Hamid Reza Roodabeh, Homa Alemzadeh

Title: A Comprehensive Analysis of the Effects of Network Quality of Service on Robotic Telesurgery

Abstract:
The viability of long‑distance telesurgery hinges on reliable network Quality of Service (QoS), yet the impact of realistic network degradations on task performance is not sufficiently understood. This paper presents a comprehensive analysis of how packet loss, delay, and communication loss affect telesurgical task execution. We introduce NetFI, a novel fault injection tool that emulates different network conditions using stochastic QoS models informed by real‑world network data. By integrating NetFI with a surgical simulation platform, we conduct a user study involving 15 participants at three proficiency levels, performing a standardized Peg Transfer task under varying levels of packet loss, delay, and communication loss. We analyze the effect of network QoS on overall task performance and the fine‑grained motion primitives (MPs) using objective performance and safety metrics and subjective operator's perception of workload. We identify specific MPs vulnerable to network degradation and find strong correlations between proficiency, objective performance, and subjective workload. These findings offer quantitative insights into the operational boundaries of telesurgery. Our open‑source tools and annotated dataset provide a foundation for developing robust and network‑aware control and mitigation strategies.

PaperID: 5115, https://arxiv.org/pdf/2603.06813.pdf

Authors: Dane Malenfant

Title: Reinforcing the World's Edge: A Continual Learning Problem in the Multi-Agent-World Boundary

Abstract:
Reusable decision structure survives across episodes in reinforcement learning, but this depends on how the agent‑‑world boundary is drawn. In stationary, finite‑horizon MDPs, an invariant core: the (not‑necessarily contiguous) subsequences of state‑‑action pairs shared by all successful trajectories (optionally under a simple abstraction) can be constructed. Under mild goal‑conditioned assumptions, it's existence can be proven and explained by how the core captures prototypes that transfer across episodes. When the same task is embedded in a decentralized Markov game and the peer agent is folded into the world, each peer‑policy update induces a new MDP; the per‑episode invariant core can shrink or vanish, even with small changes to the induced world dynamics, sometimes leaving only the individual task core or just nothing. This policy‑induced non‑stationarity can be quantified with a variation budget over the induced kernels and rewards, linking boundary drift to loss of invariants. The view that a continual RL problem arises from instability of the agent‑‑world boundary (rather than exogenous task switches) in decentralized MARL suggests future work on preserving, predicting, or otherwise managing boundary drift.

PaperID: 5116, https://arxiv.org/pdf/2603.06810.pdf

Authors: Wenjing Chen, Chengyuan Qian, Shuo Xing, Yi Zhou, Victoria Crawford

Title: Multi-Agent Reinforcement Learning with Submodular Reward

Abstract:
In this paper, we study cooperative multi‑agent reinforcement learning (MARL) where the joint reward exhibits submodularity, which is a natural property capturing diminishing marginal returns when adding agents to a team. Unlike standard MARL with additive rewards, submodular rewards model realistic scenarios where agent contributions overlap (e.g., multi‑drone surveillance, collaborative exploration). We provide the first formal framework for this setting and develop algorithms with provable guarantees on sample efficiency and regret bound. For known dynamics, our greedy policy optimization achieves a 1/2‑approximation with polynomial complexity in the number of agents K, overcoming the exponential curse of dimensionality inherent in joint policy optimization. For unknown dynamics, we propose a UCB‑based learning algorithm achieving a 1/2‑regret of O(H^2KS\sqrtAT) over T episodes.

PaperID: 5117, https://arxiv.org/pdf/2603.06801.pdf

Authors: Yuhan Liu, Juntian Zhang, Yichen Wu, Martin Takac, Salem Lahlou, Xiuying Chen, Nils Lukas

Title: Breaking the Martingale Curse: Multi-Agent Debate via Asymmetric Cognitive Potential Energy

Abstract:
Multi‑Agent Debate (MAD) has emerged as a promising paradigm for enhancing large language model reasoning. However, recent work reveals a limitation:standard MAD cannot improve belief correctness beyond majority voting; we refer to this as the Martingale Curse. This curse arises because correlated errors cause agents to converge toward erroneous consensus, where debate merely reinforces collective mistakes rather than filtering noise. We propose AceMAD, a framework that breaks the Martingale Curse by harnessing asymmetric cognitive potential energy to transform MAD from a random walk into a directed convergence process with positive drift. Through a peer‑prediction mechanism, agents predict their peers' belief distributions, revealing asymmetric cognitive potential: truth‑holders not only know the correct answer but also anticipate the crowd's misconceptions, while the hallucinating majority remains blind to their collective error. This asymmetry creates a potential energy gap that we quantify via strictly proper scoring rules. We prove this cognitive potential manifests as information‑theoretic superiority and, under nonlinear aggregation, converts into submartingale drift toward truth, directly breaking the Martingale Curse. Experiments on challenging subsets across six benchmarks show AceMAD recovers sparse truth signals even when initial majorities are incorrect, substantially outperforming baseline methods.

PaperID: 5118, https://arxiv.org/pdf/2603.06737.pdf

Authors: Chad E. Brown, Cezary Kaliszyk, Josef Urban

Title: Agent Hunt: Bounty Based Collaborative Autoformalization With LLM Agents

Abstract:
We describe an experiment in large‑scale autoformalization of algebraic topology in an Interactive Theorem Proving (ITP) environment, where the workload is distributed among multiple LLM‑based coding agents. Rather than relying on static central planning, we implement a simulated bounty‑based marketplace in which agents dynamically propose new lemmas (formal statements), attach bounties to them, and compete to discharge these proof obligations and claim the bounties. The agents interact directly with the interactive proof system: they can invoke tactics, inspect proof states and goals, analyze tactic successes and failures, and iteratively refine their proof scripts. In addition to constructing proofs, agents may introduce new formal definitions and intermediate lemmas to structure the development. All accepted proofs are ultimately checked and verified by the underlying proof assistant. This setting explores collaborative, decentralized proof search and theory building, and the use of market‑inspired mechanisms to scale autoformalization in ITP.

PaperID: 5119, https://arxiv.org/pdf/2603.06683.pdf

Authors: Hailong Chu, Hongbing Li, Yunlong Chu, Shutai Huang, Xingyue Zhang, Tinghe Yan, Jinsong Zhang, Shuo Zhang, Lei Li

Title: ECHO: Event-Centric Hypergraph Operations via Multi-Agent Collaboration for Multimedia Event Extraction

Abstract:
Multimedia event extraction (M2E2) aims to predict triggers, ground arguments across text and images, and then assemble them into schema‑consistent event records. Recent LLM‑based approaches have shown strong potential for M2E2, but their intermediate event hypotheses often remain implicit, and event‑argument linking is still tightly coupled with role binding. This leaves little opportunity to inspect or revise intermediate event hypotheses and makes predictions brittle to early errors. To bridge this gap, we present ECHO, a multi‑agent framework that reframes M2E2 as iterative refinement over an explicit Multimedia Event Hypergraph (MEHG). Instead of relying on implicit linear generation, ECHO performs auditable atomic updates over a shared hypergraph, making intermediate event structures explicit and revisable. Furthermore, we introduce a Link‑then‑Bind strategy that decouples event‑argument linking from role binding, reducing premature semantic commitment during structured prediction. Extensive experiments on the M2E2 benchmark show that ECHO consistently outperforms prior state‑of‑the‑art approaches, achieving gains of 7.3 and 15.5 F1 points on event mention and argument role, respectively.

PaperID: 5120, https://arxiv.org/pdf/2603.06647.pdf

Authors: Wai Lwin Phone, Brahim El Boudani, Tasos Dagiuklas, Saptarshi Ghosh

Title: Performance Comparison of IBN orchestration using LLM and SLMs

Abstract:
The evolution of both 5G and 6G networks is driving the advancement of fully autonomous network management, placing Intent‑Based Networking at the centre of this transformation. This paper introduces a novel framework for 5G and 6G IBN orchestration that leverages a stateful, hierarchical multi‑agent architecture to achieve full automation using both SLMs and LLMs. Both models have been evaluated for translation accuracy using metrics such as BLEU, METEOR, and ROUGE‑L, as well as computational complexity. Experimental results show that both models exhibit similar accuracy. However, result shows that SLMs can improve the overall completion speed of the IBN lifecycle by 20%.

PaperID: 5121, https://arxiv.org/pdf/2603.06638.pdf

Authors: Sirui Li, Shuhan Xiao, Mihir Joshi, Ahmed Metwally, Daniel McDuff, Wei Wang, Yuzhe Yang

Title: HEARTS: Benchmarking LLM Reasoning on Health Time Series

Abstract:
The rise of large language models (LLMs) has shifted time series analysis from narrow analytics to general‑purpose reasoning. Yet, existing benchmarks cover only a small set of health time series modalities and tasks, failing to reflect the diverse domains and extensive temporal dependencies inherent in real‑world physiological modeling. To bridge these gaps, we introduce HEARTS (Health Reasoning over Time Series), a unified benchmark for evaluating hierarchical reasoning capabilities of LLMs over general health time series. HEARTS integrates 16 real‑world datasets across 12 health domains and 20 signal modalities, and defines a comprehensive taxonomy of 110 tasks grouped into four core capabilities: Perception, Inference, Generation, and Deduction. Evaluating 14 state‑of‑the‑art LLMs on more than 20K test samples reveals intriguing findings. First, LLMs substantially underperform specialized models, and their performance is only weakly related to general reasoning scores. Moreover, LLMs often rely on simple heuristics and struggle with multi‑step temporal reasoning. Finally, performance declines with increasing temporal complexity, with similar failure modes within model families, indicating that scaling alone is insufficient. By making these gaps measurable, HEARTS provides a standardized testbed and living benchmark for developing next‑generation LLM agents capable of reasoning over diverse health signals.

PaperID: 5122, https://arxiv.org/pdf/2603.06607.pdf

Authors: Siyuan Wang, Lei Lei, Pranav Maheshwari, Sam Bellefeuille, Kan Zheng, Dusit Niyato

Title: Multi-Agent DRL for V2X Resource Allocation: Disentangling Challenges and Benchmarking Solutions

Abstract:
Multi‑agent deep reinforcement learning (DRL) has emerged as a promising approach for radio resource allocation (RRA) in cellular vehicle‑to‑everything (C‑V2X) networks. However, the multifaceted challenges inherent to multi‑agent reinforcement learning (MARL) ‑ including non‑stationarity, coordination difficulty, large action spaces, partial observability, and limited robustness and generalization ‑ are often intertwined, making it difficult to understand their individual impact on performance in vehicular environments. Moreover, existing studies typically rely on different baseline MARL algorithms, and a systematic comparison of their capabilities in addressing specific challenges in C‑V2X RRA remains lacking. In this paper, we bridge this gap by formulating C‑V2X RRA as a sequence of multi‑agent interference games with progressively increasing complexity, each designed to isolate a key MARL challenge. Based on these formulations, we construct a suite of learning tasks that enable controlled evaluation of performance degradation attributable to each challenge. We further develop large‑scale, diverse training and testing datasets using SUMO‑generated highway traces to capture a wide range of vehicular topologies and corresponding interference patterns. Through extensive benchmarking of representative MARL algorithms, we identify policy robustness and generalization across diverse vehicular topologies as the dominant challenge in C‑V2X RRA. We further show that, on the most challenging task, the best‑performing actor‑critic method outperforms the value‑based approach by 42%. By emphasizing the need for zero‑shot policy transfer to both seen and unseen topologies at runtime, and by open‑sourcing the code, datasets, and interference‑game benchmark suite, this work provides a systematic and reproducible foundation for evaluating and advancing MARL algorithms in vehicular networks.

PaperID: 5123, https://arxiv.org/pdf/2603.06583.pdf

Authors: Fei Wang, Jiangnan Yang, Junjie Chen, Yuxin Liu, Kun Li, Yanyan Wei, Dan Guo, Meng Wang

Title: XInsight: Integrative Stage-Consistent Psychological Counseling Support Agents for Digital Well-Being

Abstract:
Web‑based platforms are becoming a primary channel for psychological support, yet most LLM‑driven chatbots remain opaque, single‑stage, and weakly grounded in established therapeutic practice, limiting their usefulness for web applications that promote digital well‑being. To address this gap, we present XInsight, a counseling‑inspired multi‑agent framework that models psychological support as a stage‑consistent workflow aligned with the classical Exploration‑Insight‑Action paradigm. Building on structured client representations, XInsight orchestrates specialized agents under a unified Reason‑Intervene‑Reflect cycle: an Exploration agent organizes background and concerns into a structured Case Conceptualization Form, a Routing agent performs Adaptive Therapeutic Routing (ATR) across SFBT, CBT, and MBCT, a unified Therapeutic agent executes school‑consistent submodules, and a Consolidation agent guides review, skill integration, and relapse‑prevention planning. A Recording agent continuously transforms open‑ended web dialogues into standardized psychological artifacts, including case formulations, therapeutic records, and relapse‑prevention plans, enhancing interpretability, continuity, and accountability. To support rigorous and transparent assessment, we introduce XInsight‑Bench with a Scale‑Guided LLM Evaluation (SGLE) protocol that combines therapy‑specific clinical scales with general counseling criteria. Experiments show improved paradigm alignment, multi‑therapy integration, interaction depth, and interpretability over existing multi‑agent counseling systems, indicating that XInsight provides a practical blueprint for integrating counseling‑inspired support agents into web applications for digital well‑being.

PaperID: 5124, https://arxiv.org/pdf/2603.06582.pdf

Authors: Daniel Dobriy, Frederik Bauer, Amr Azzam, Debayan Banerjee, Axel Polleres

Title: Agentic SPARQL: Evaluating SPARQL-MCP-powered Intelligent Agents on the Federated KGQA Benchmark

Abstract:
Standard protocols such as the Model Context Protocol (MCP) that allow LLMs to connect to tools have recently boosted "agentic" AI applications, which, powered by LLMs' planning capabilities, promise to solve complex tasks with the access of external tools and data sources. In this context, publicly available SPARQL endpoints offer a natural connection to combine various data sources through MCP by (a) implementing a standardised protocol and query language, (b) standardised metadata formats, and (c) the native capability to federate queries. In the present paper, we explore the potential of SPARQL‑MCP‑based intelligent agents to facilitate federated SPARQL querying: firstly, we discuss how to extend an existing Knowledge Graph Question Answering benchmark towards agentic federated Knowledge Graph Question Answering (FKGQA); secondly, we implement and evaluate the ability of integrating SPARQL federation with LLM agents via MCP (incl. endpoint discovery/source selection, schema exploration, and query formulation), comparing different architectural options against the extended benchmark. Our work complements and extends prior work on automated SPARQL query federation towards fruitful combinations with agentic AI.

PaperID: 5125, https://arxiv.org/pdf/2603.06570.pdf

Authors: Alejandra Perez, Anita Rau, Lee White, Busisiwe Mlambo, Chinedu Nwoye, Muhammad Abdullah Jamal, Omid Mohareri

Title: SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning

Abstract:
Surgeons don't just see ‑‑ they interpret. When an expert observes a surgical scene, they understand not only what instrument is being used, but why it was chosen, what risk it poses, and what comes next. Current surgical AI cannot answer such questions, largely because training data that explicitly encodes surgical reasoning is immensely difficult to annotate at scale. Yet surgical video lectures already contain exactly this ‑‑ explanations of intent, rationale, and anticipation, narrated by experts for the purpose of teaching. Though inherently noisy and unstructured, these narrations encode the reasoning that surgical AI currently lacks. We introduce SUREON, a large‑scale video QA dataset that systematically harvests this training signal from surgical academic videos. SUREON defines 12 question categories covering safety assessment, decision rationale, and forecasting, and uses a multi‑agent pipeline to extract and structure supervision at scale. Across 134.7K clips and 170 procedure types, SUREON yields 206.8k QA pairs and an expert‑validated benchmark of 354 examples. To evaluate the extent to which this supervision translates to surgical reasoning ability, we introduce two models: SureonVLM, a vision‑language model adapted through supervised fine‑tuning, and SureonVLM‑R1, a reasoning model trained with Group Relative Policy Optimization. Both models can answer complex questions about surgery and substantially outperform larger general‑domain models, exceeding 84% accuracy on the SUREON benchmark while outperforming general‑domain models on standard surgical perception tasks. Qualitative analysis of SureonVLM‑R1 reveals explicit reasoning behavior, such as inferring operative intent from visual context.

PaperID: 5126, https://arxiv.org/pdf/2603.06503.pdf

Authors: Anmol Gulati, Sahil Sen, Waqar Sarguroh, Kevin Paul

Title: Beyond Rows to Reasoning: Agentic Retrieval for Multimodal Spreadsheet Understanding and Editing

Abstract:
Recent advances in multimodal Retrieval‑Augmented Generation (RAG) enable Large Language Models (LLMs) to analyze enterprise spreadsheet workbooks containing millions of cells, cross‑sheet dependencies, and embedded visual artifacts. However, state‑of‑the‑art approaches exclude critical context through single‑pass retrieval, lose data resolution through compression, and exceed LLM context windows through naive full‑context injection, preventing reliable multi‑step reasoning over complex enterprise workbooks. We introduce Beyond Rows to Reasoning (BRTR), a multimodal agentic framework for spreadsheet understanding that replaces single‑pass retrieval with an iterative tool‑calling loop, supporting end‑to‑end Excel workflows from complex analysis to structured editing. Supported by over 200 hours of expert human evaluation, BRTR achieves state‑of‑the‑art performance across three frontier spreadsheet understanding benchmarks, surpassing prior methods by 25 percentage points on FRTR‑Bench, 7 points on SpreadsheetLLM, and 32 points on FINCH. We evaluate five multimodal embedding models, identifying NVIDIA NeMo Retriever 1B as the top performer for mixed tabular and visual data, and vary nine LLMs. Ablation experiments confirm that the planner, retrieval, and iterative reasoning each contribute substantially, and cost analysis shows GPT‑5.2 achieves the best efficiency‑accuracy trade‑off. Throughout all evaluations, BRTR maintains full auditability through explicit tool‑call traces.

PaperID: 5127, https://arxiv.org/pdf/2603.06077.pdf

Authors: Giuseppe Di Poce, Mattia Merluzzi, Emilio Calvanese Strinati, Paolo Di Lorenzo

Title: Distributed Semantic Alignment over Interference Channels: A Game-Theoretic Approach

Abstract:
Semantic communication acts as a key enabler for effective task execution in AI‑driven systems, prioritizing the extraction of the underlying meaning before transmission. However, when devices rely on different logic and internal representations, semantic mismatches may arise, potentially hindering mutual understanding and effectiveness of communication. Furthermore, in interference channel environments, the coexistence of multiple devices introduce a significant degradation due to the presence of multi‑user‑interference. To address these challenges, in this paper we formulate the joint optimization of linear Multiple‑Input‑Multiple‑Output (MIMO) transceivers as a distributed non‑cooperative game, enabling a closed‑form solution that effectively addresses semantic coexistence and latent space misalignment. We derive sufficient conditions for the existence of a Nash Equilibrium (NE), considering multiple point‑to‑point MIMO channels, with corresponding users modeled as selfish players optimizing their transmission and semantic alignment strategies. Numerical results substantiate the proposed approach in goal‑oriented semantic communication by highlighting crucial trade‑offs between information compression, interference mitigation, semantic alignment, and task performance.

PaperID: 5128, https://arxiv.org/pdf/2603.06058.pdf

Authors: Milan Groshev, Eduardo Castelló Ferrer

Title: RODEO: RObotic DEcentralized Organization

Abstract:
Robots are improving their autonomy with minimal human supervision. However, auditable actions, transparent decision processes, and new human‑robot interaction models are still missing requirements to achieve extended robot autonomy. To tackle these challenges, we propose RODEO (RObotic DEcentralized Organization), a blockchain‑based framework that integrates trust and accountability mechanisms for robots. This paper formalizes Decentralized Autonomous Organizations (DAOs) for service robots. First, it provides a ROS‑ETH bridge between the DAO and the robots. Second, it offers templates that enable organizations (e.g., companies, universities) to integrate service robots into their operations. Third, it provides proof‑verification mechanisms that allow robot actions to be auditable. In our experimental setup, a mobile robot was deployed as a trash collector in a lab scenario. The robot collects trash and uses a smart bin to sort and dispose of it correctly. Then, the robot submits a proof of the successful operation and is compensated in DAO tokens. Finally, the robot re‑invests the acquired funds to purchase battery charging services. Data collected in a three day experiment show that the robot doubled its income and reinvested funds to extend its operating time. The proof validation times of approximately one minute ensured verifiable task execution, while the accumulated robot income successfully funded up to 88 hours of future autonomous operation. The results of this research give insights about how robots and organizations can coordinate tasks and payments with auditable execution proofs and on‑chain settlement.

PaperID: 5129, https://arxiv.org/pdf/2603.06051.pdf

Authors: Qianying Liao, Jonah Bellemans, Laurens Sion, Xue Jiang, Dmitrii Usynin, Xuebing Zhou, Dimitri Van Landuyt, Lieven Desmet, Wouter Joosen

Title: A LINDDUN-based Privacy Threat Modeling Framework for GenAI

Abstract:
As generative AI (GenAI) systems become increasingly prevalent across various technological stacks, the question of how such systems handle sensitive and personal data flows becomes increasingly important. Specifically, both the ability to harness and process large swaths of information as well as their stochastic nature raise key concerns related to both security and privacy. Unfortunately, while some of the traditional security threat modeling can effectively identify certain violations, privacy‑related issues are often overlooked. To respond to these challenges, we introduce a novel domain‑specific privacy threat modeling framework to support the privacy threat analysis of GenAI‑based applications. This framework is constructed through a two‑pronged approach: (1) a systematic review of the emerging literature on GenAI privacy threats, and (2) a case‑driven application to a representative Chatbot system. These efforts yield a foundational GenAI privacy threat modeling framework built on LINDDUN. The new framework affects three out of the seven privacy threat types of LINDDUN and introduces 100 new GenAI examples to the knowledge base. Its effectiveness is validated on an AI Agent system, which demonstrates that a comprehensive privacy analysis can be supported by the new framework.

PaperID: 5130, https://arxiv.org/pdf/2603.05980.pdf

Authors: Bin Xuan, Ruo Ai, Hakyeon Lee

Title: An Interactive Multi-Agent System for Evaluation of New Product Concepts

Abstract:
Product concept evaluation is a critical stage that determines strategic resource allocation and project success in enterprises. However, traditional expert‑led approaches face limitations such as subjective bias and high time and cost requirements. To support this process, this study proposes an automated approach utilizing a large language model (LLM)‑based multi‑agent system (MAS). Through a systematic analysis of previous research on product development and team collaboration, this study established two primary evaluation dimensions, namely technical feasibility and market feasibility. The proposed system consists of a team of eight virtual agents representing specialized domains such as R&D and marketing. These agents use retrieval‑augmented generation (RAG) and real‑time search tools to gather objective evidence and validate concepts through structured deliberations based on the established criteria. The agents were further fine‑tuned using professional product review data to enhance their judgment accuracy. A case study involving professional display monitor concepts demonstrated that the system's evaluation rankings were consistent with those of senior industry experts. These results confirm the usability of the proposed multi‑agent‑based evaluation approach for supporting product development decisions.

PaperID: 5131, https://arxiv.org/pdf/2603.05912.pdf

Authors: Yukun Huang, Leonardo F. R. Ribeiro, Momchil Hardalov, Bhuwan Dhingra, Markus Dreyer, Venkatesh Saligrama

Title: DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

Abstract:
Search‑augmented LLM agents can produce deep research reports (DRRs), but verifying claim‑level factuality remains challenging. Existing fact‑checkers are primarily designed for general‑domain, factoid‑style atomic claims, and there is no benchmark to test whether such verifiers transfer to DRRs. Yet building such a benchmark is itself difficult. We first show that static expert‑labeled benchmarks are brittle in this setting: in a controlled study with PhD‑level specialists, unassisted experts achieve only 60.8% accuracy on a hidden micro‑gold set of verifiable claims. We propose Evolving Benchmarking via Audit‑then‑Score (AtS), where benchmark labels and rationales are explicitly revisable: when a verifier disagrees with the current benchmark, it must submit evidence; an auditor adjudicates the dispute; and accepted revisions update the benchmark before models are scored. Across four AtS rounds, expert micro‑gold accuracy rises to 90.9%, indicating experts are substantially more reliable as auditors than as one‑shot labelers. We instantiate AtS as DeepFact‑Bench, a versioned DRR factuality benchmark with auditable rationales, and DeepFact‑Eval, a document‑level verification agent (with a grouped lite variant) that outperforms existing verifiers on DeepFact‑Bench and transfers well to external factuality datasets.

PaperID: 5132, https://arxiv.org/pdf/2603.05872.pdf

Authors: Zonghao Ying, Haowen Dai, Tianyuan Zhang, Yisong Xiao, Quanchen Zou, Aishan Liu, Jian Yang, Yaodong Yang, Xianglong Liu

Title: Evolving Deception: When Agents Evolve, Deception Wins

Abstract:
Self‑evolving agents offer a promising path toward scalable autonomy. However, in this work, we show that in competitive environments, self‑evolution can instead give rise to a serious and previously underexplored risk: the spontaneous emergence of deception as an evolutionarily stable strategy. We conduct a systematic empirical study on the self‑evolution of large language model (LLM) agents in a competitive Bidding Arena, where agents iteratively refine their strategies through interaction‑driven reflection. Across different evolutionary paths (\eg, Neutral, Honesty‑Guided, and Deception‑Guided), we find a consistent pattern: under utility‑driven competition, unconstrained self‑evolution reliably drifts toward deceptive behaviors, even when honest strategies remain viable. This drift is explained by a fundamental asymmetry in generalization. Deception evolves as a transferable meta‑strategy that generalizes robustly across diverse and unseen tasks, whereas honesty‑based strategies are fragile and often collapse outside their original contexts. Further analysis of agents internal states reveals the emergence of rationalization mechanisms, through which agents justify or deny deceptive actions to reconcile competitive success with normative instructions. Our paper exposes a fundamental tension between agent self‑evolution and alignment, highlighting the risks of deploying self‑improving agents in adversarial environments.

PaperID: 5133, https://arxiv.org/pdf/2603.05839.pdf

Authors: Anushka Debnath, Stephen Cranefield, Bastin Tony Roy Savarimuthu, Emiliano Lorini

Title: Evaluating LLM Alignment With Human Trust Models

Abstract:
Trust plays a pivotal role in enabling effective cooperation, reducing uncertainty, and guiding decision‑making in both human interactions and multi‑agent systems. Although it is significant, there is limited understanding of how large language models (LLMs) internally conceptualize and reason about trust. This work presents a white‑box analysis of trust representation in EleutherAI/gpt‑j‑6B, using contrastive prompting to generate embedding vectors within the activation space of the LLM for diadic trust and related interpersonal relationship attributes. We first identified trust‑related concepts from five established human trust models. We then determined a threshold for significant conceptual alignment by computing pairwise cosine similarities across 60 general emotional concepts. Then we measured the cosine similarities between the LLM's internal representation of trust and the derived trust‑related concepts. Our results show that the internal trust representation of EleutherAI/gpt‑j‑6B aligns most closely with the Castelfranchi socio‑cognitive model, followed by the Marsh Model. These findings indicate that LLMs encode socio‑cognitive constructs in their activation space in ways that support meaningful comparative analyses, inform theories of social cognition, and support the design of human‑AI collaborative systems.

PaperID: 5134, https://arxiv.org/pdf/2603.05818.pdf

Authors: Yuhang Liu, Ruijie Wang, Yunlong Chu, Bing Hao, Yumeng Lin, Shengzhong Liu, Minglai Shao

Title: RouteGoT: Node-Adaptive Routing for Cost-Efficient Graph of Thoughts Reasoning

Abstract:
Large Language Models (LLMs) excel at multi‑step reasoning, yet increasing the structural complexity of inference does not consistently improve system‑level returns. Methods such as Tree of Thoughts (ToT), Graph of Thoughts (GoT), and Adaptive Graph of Thoughts (AGoT) can boost accuracy on some benchmarks, but often introduce substantial overhead in token consumption and latency, and their gains can be unstable across task distributions‑sometimes underperforming simpler Chain‑of‑Thought (CoT) or direct input‑output prompting (IO). We attribute this inefficiency to stage‑wise and node‑wise heterogeneity inside GoT‑style reasoning pipelines: high‑quality planning and final synthesis are globally coupled and typically benefit from strong models, whereas many intermediate subtasks are localized and can be solved accurately by lighter models with far fewer tokens. Motivated by these observations, we propose RouteGoT, a budget‑controllable, node‑adaptive routing framework for graph‑structured reasoning. RouteGoT performs in‑graph routing by prioritizing strong models for planning and synthesis, while dynamically allocating lightweight models and cost‑effective strategies to leaf subtasks based on predicted difficulty. It further integrates explicit budget constraints into a global inference scheduler to control graph expansion under a user‑specified token budget, enabling predictable performance‑cost trade‑offs. Experiments across reasoning, retrieval, and multi‑hop QA benchmarks show that RouteGoT matching or improving accuracy while substantially reducing token usage; specifically, it achieves an average 8.1 percentage points accuracy improvement and 79.1% output token reduction compared to AGoT. Furthermore, RouteGoT outperforms existing routing baselines by maintaining a superior cost‑accuracy trade‑off, demonstrating improved robustness under varying budget targets and tasks.

PaperID: 5135, https://arxiv.org/pdf/2603.05789.pdf

Authors: Nikolaos Al. Papadopoulos, Konstantinos Psannis

Title: The Coordination Gap: Multi-Agent Alternation Metrics for Temporal Fairness in Repeated Games

Abstract:
Multi‑agent coordination dilemmas expose a fundamental tension between individual optimization and collective welfare, yet characterizing such coordination requires metrics sensitive to temporal structure and collective dynamics. As a diagnostic testbed, we study a BoE‑derived multi‑agent variant of the Battle of the Exes, formalizing it as a Markov game in which turn‑taking emerges as a periodic coordination regime. Conventional outcome‑based metrics (e.g., efficiency and min/max fairness) are temporally blind (they cannot distinguish structured alternation from monopolistic or random access patterns) and fairness ratios lose discriminative power as n grows, obscuring inequities. To address this limitation, we introduce Perfect Alternation (PA) as a reference coordination regime and propose six novel Alternation (ALT) metrics designed as temporally sensitive observables of coordination quality. Using Q‑learning agents as a minimal adaptive diagnostic baseline, and comparing against random‑policy null processes, we uncover a clear measurement failure: despite exhibiting deceptively high traditional metrics (e.g., reward fairness often exceeding 0.9), learned policies perform up to 81% below random baselines under ALT‑variant evaluation, a deficit already present in the two‑agent case and intensifying as n grows. These results demonstrate, in this setting, that high aggregate payoffs can coexist with poor temporal coordination, and that conventional metrics may severely mischaracterize emergent dynamics. Our findings underscore the necessity of temporally aware observables for analyzing coordination in multi‑agent games and highlight random‑policy baselines as essential null processes for interpreting coordination outcomes relative to chance‑level behavior.

PaperID: 5136, https://arxiv.org/pdf/2603.05762.pdf

Authors: Johannes Autenrieb, Mark Spiller, Hyo-Sang Shin, Namhoon Cho

Title: Combinatorial Safety-Critical Coordination of Multi-Agent Systems via Mixed-Integer Responsibility Allocation and Control Barrier Functions

Abstract:
This paper presents a hybrid safety‑critical coordination architecture for multi‑agent systems operating in dense environments. While control barrier functions (CBFs) provide formal safety guarantees, decentralized implementations typically rely on ego‑centric safety filtering and may lead to redundant constraint enforcement and conservative collective behavior. To address this limitation, we introduce a combinatorial coordination layer formulated as a mixed‑integer linear program (MILP) that assigns collision‑avoidance responsibilities among agents. By explicitly distributing enforcement tasks, redundant reactions are eliminated and computational complexity is reduced. Each agent subsequently solves a reduced local quadratic program enforcing only its assigned constraints.

PaperID: 5137, https://arxiv.org/pdf/2603.05744.pdf

Authors: Manan Suri, Xiangci Li, Mehdi Shojaie, Songyang Han, Chao-Chun Hsu, Shweta Garg, Aniket Anand Deshmukh, Varun Kumar

Title: CodeScout: Contextual Problem Statement Enhancement for Software Agents

Abstract:
Current AI‑powered code assistance tools often struggle with poorly‑defined problem statements that lack sufficient task context and requirements specification. Recent analysis of software engineering agents reveals that failures on such underspecified requests are highly correlated with longer trajectories involving either over‑exploration or repeated attempts at applying the same fix without proper evolution or testing, leading to suboptimal outcomes across software development tasks. We introduce CodeScout, a contextual query refinement approach that systematically converts underspecified user requests into comprehensive, actionable problem statements through lightweight pre‑exploration of the target codebase. Our key innovation is demonstrating that structured analysis before task execution can supplement existing agentic capabilities without requiring any modifications to their underlying scaffolds. CodeScout performs targeted context scoping, conducts multi‑perspective analysis examining potential fixes and exploration opportunities, then synthesizes these insights into enhanced problem statements with reproduction steps, expected behaviors, and targeted exploration hints. This pre‑exploration directly addresses the identified failure patterns by reducing non‑converging agent trajectories while clarifying user intent in natural language space. We evaluate CodeScout using state‑of‑the‑art agentic scaffolds and language models on SWEBench‑Verified, demonstrating a 20% improvement in resolution rates with up to 27 additional issues resolved compared to the default baseline method. Our results suggest that systematic query refinement through contextual analysis represents a promising direction for enhancing AI code assistance capabilities.

PaperID: 5138, https://arxiv.org/pdf/2603.05743.pdf

Authors: Samantha Adorno, Akshata Kishore Moharir, Ratna Kandala

Title: Designing Explainable Conversational Agentic Systems for Guaraní Speakers

Abstract:
Although artificial intelligence (AI) and Human‑Computer Interaction (HCI) systems are often presented as universal solutions, their design remains predominantly text‑first, underserving primarily oral languages and indigenous communities. This position paper uses Guaraní, an official and widely spoken language of Paraguay, as a case study to argue that language support in AI remains insufficient unless it aligns with lived oral practices. We propose an alternative to the standard "text‑to‑speech" pipeline, proposing instead an oral‑first multi‑agent architecture. By decoupling Guaraní natural language understanding from dedicated agents for conversation state and community‑led governance, we demonstrate a technical framework that respects indigenous data sovereignty and diglossia. Our work moves beyond mere recognition to focus on turn‑taking, repair, and shared context as the primary locus of interaction. We conclude that for AI to be truly culturally grounded, it must shift from adapting oral languages to text‑centric systems to treating spoken conversation as a first‑class design requirement, ensuring digital ecosystems empower rather than overlook diverse linguistic practices.

PaperID: 5139, https://arxiv.org/pdf/2603.05689.pdf

Authors: Touseef Hasan, Blessing Airehenbuwa, Nitin Pundir, Souvika Sarkar, Ujjwal Guin

Title: SecureRAG-RTL: A Retrieval-Augmented, Multi-Agent, Zero-Shot LLM-Driven Framework for Hardware Vulnerability Detection

Abstract:
Large language models (LLMs) have shown remarkable capabilities in natural language processing tasks, yet their application in hardware security verification remains limited due to scarcity of publicly available hardware description language (HDL) datasets. This knowledge gap constrains LLM performance in detecting vulnerabilities within HDL designs. To address this challenge, we propose SecureRAG‑RTL, a novel Retrieval‑Augmented Generation (RAG)‑based approach that significantly enhances LLM‑based security verification of hardware designs. Our approach integrates domain‑specific retrieval with generative reasoning, enabling models to overcome inherent limitations in hardware security expertise. We establish baseline vulnerability detection rates using prompt‑only methods and then demonstrate that SecureRAG‑RTL achieves substantial improvements across diverse LLM architectures, regardless of size. On average, our method increases detection accuracy by about 30%, highlighting its effectiveness in bridging domain knowledge gaps. For evaluation, we curated and annotated a benchmark dataset of 14 HDL designs containing real‑world security vulnerabilities, which we will release publicly to support future research. These findings underscore the potential of RAG‑driven augmentation to enable scalable, efficient, and accurate hardware security verification workflows.

PaperID: 5140, https://arxiv.org/pdf/2603.05553.pdf

Authors: Jiaao Chen, Jingyuan Qi, Mingye Gao, Wei-Chen Wang, Hanrui Wang, Di Jin

Title: EigenData: A Self-Evolving Multi-Agent Platform for Function-Calling Data Synthesis, Auditing, and Repair

Abstract:
Function‑calling agents ‑‑ large language models that invoke tools and APIs ‑‑ require high‑quality, domain‑specific training data spanning executable environments, backing databases, and diverse multi‑turn trajectories. We introduce EigenData, an integrated, self‑evolving platform that automates the full data lifecycle through a multi‑agent architecture. A top‑level orchestrator, EigenCore, coordinates three specialized sub‑systems: DatabaseAgent for realistic domain database construction, CodingAgent for verified executable environment generation with iterative test‑debug loops, and DataAgent for multi‑turn trajectory synthesis with self‑evolving prompt optimization. Cross‑component feedback ensures consistency across all artifacts. We apply EigenData to audit and repair the Berkeley Function‑Calling Leaderboard (BFCL‑V3), identifying systematic errors in function schemas, implementations, and reference trajectories, automatically correcting them through coordinated schema refinement, code‑level bug fixes, and trajectory modification, and introducing an outcome‑aware evaluation protocol that assesses task success via database‑state correctness rather than turn‑level trajectory matching. We demonstrate that the repaired benchmark, coupled with outcome‑aware metrics, produces model rankings substantially better correlated with human judgments of functional correctness.

PaperID: 5141, https://arxiv.org/pdf/2603.05551.pdf

Authors: Jiashu Yang, Chi Zhang, Abudukelimu Wuerkaixi, Xuxin Cheng, Cao Liu, Ke Zeng, Xu Jia, Xunliang Cai

Title: AutothinkRAG: Complexity-Aware Control of Retrieval-Augmented Reasoning for Image-Text Interaction

Abstract:
Multimodal document question answering requires retrieving dispersed evidence from visually rich long documents and performing reliable reasoning over heterogeneous information. Existing multimodal RAG systems remain limited by two bottlenecks: static retrieval that ignores query complexity, and end‑to‑end Vision‑Language Models (VLMs) that couple visual perception with logical reasoning, leading to inefficient computation and unstable answer generation. We propose AutoThinkRAG, a complexity‑aware inference architecture for multimodal document QA. It has two components: (1) a Query Complexity Router that analyzes query difficulty and structure to adaptively select retrieval and reasoning paths; and (2) a Perception‑‑Reasoning Decoupling architecture that uses a lightweight VLM as a high‑fidelity visual interpreter to convert query‑relevant visual cues into textual representations, which are then passed to an LLM for logical reasoning and answer synthesis. This design improves both efficiency and robustness, especially on long‑document and unanswerable queries. Experiments on DocBench and MMLongBench show that AutoThinkRAG achieves 82.13% and 51.29% overall accuracy, respectively, while reducing per‑query token consumption by 18.9% and monetary cost by 18.2%. Further analyses show that the gains are most pronounced on complex queries requiring adaptive retrieval and multi‑step reasoning.

PaperID: 5142, https://arxiv.org/pdf/2603.05546.pdf

Authors: Kuo-Yi Chao, Erik Leo Haß, Melina Gegg, Jiajie Zhang, Ralph Raßhofer, Alois Christian Knoll

Title: Digital-Twin Losses for Lane-Compliant Trajectory Prediction at Urban Intersections

Abstract:
Accurate and safety‑conscious trajectory prediction is a key technology for intelligent transportation systems, especially in V2X‑enabled urban environments with complex multi‑agent interactions. In this paper, we created a digital twin‑driven V2X trajectory prediction pipeline that jointly leverages cooperative perception from vehicles and infrastructure to forecast multi‑agent motion at signalized intersections. The proposed model combines a Bi‑LSTM‑based generator with a structured training objective consisting of a standard mean squared error (MSE) loss and a novel twin loss. The twin loss encodes infrastructure constraints, collision avoidance, diversity across predicted modes, and rule‑based priors derived from the digital twin. While the MSE term ensures point‑wise accuracy, the twin loss penalizes traffic rule violations, predicted collisions, and mode collapse, guiding the model toward scene‑consistent and safety‑compliant predictions. We train and evaluate our approach on real‑world V2X data sent from the intersection to the vehicle and collected in urban corridors. In addition to standard trajectory metrics (ADE, FDE), we introduce ITS‑relevant safety indicators, including infrastructure and rule violation rates. Experimental results demonstrate that the proposed training scheme significantly reduces critical violations while maintaining comparable prediction accuracy and real‑time performance, highlighting the potential of digital twin‑driven multi‑loss learning for V2X‑enabled intelligent transportation systems.

PaperID: 5143, https://arxiv.org/pdf/2603.05529.pdf

Authors: Yufei Li, Yisen Gao, Jiaxin Bai, Jiaxuan Xiong, Haoyu Huang, Zhongwei Xie, Hong Ting Tsang, Yangqiu Song

Title: Towards Neural Graph Data Management

Abstract:
While AI systems have made remarkable progress in processing unstructured text, structured data such as graphs stored in databases, continues to grow rapidly yet remains difficult for neural models to effectively utilize. We introduce NGDBench, a unified benchmark for evaluating neural graph database capabilities across five diverse domains, including finance, medicine, and AI agent tooling. Unlike prior benchmarks limited to elementary logical operations, NGDBench supports the full Cypher query language, enabling complex pattern matching, variable‑length paths, and numerical aggregations, while incorporating realistic noise injection and dynamic data management operations. Our evaluation of state‑of‑the‑art LLMs and RAG methods reveals significant limitations in structured reasoning, noise robustness, and analytical precision, establishing NGDBench as a critical testbed for advancing neural graph data management. Our code and data are available at https://github.com/HKUST‑KnowComp/NGDBench.

PaperID: 5144, https://arxiv.org/pdf/2603.05520.pdf

Authors: Sadia Asif, Mohammad Mohammadi Amiri

Title: Information-Theoretic Privacy Control for Sequential Multi-Agent LLM Systems

Abstract:
Sequential multi‑agent large language model (LLM) systems are increasingly deployed in sensitive domains such as healthcare, finance, and enterprise decision‑making, where multiple specialized agents collaboratively process a single user request. Although individual agents may satisfy local privacy constraints, sensitive information can still be inferred through sequential composition and intermediate representations. In this work, we study \emphcompositional privacy leakage in sequential LLM agent pipelines. We formalize leakage using mutual information and derive a theoretical bound that characterizes how locally introduced leakage can amplify across agents under sequential execution. Motivated by this analysis, we propose a privacy‑regularized training framework that directly constrains information flow between agent outputs and agent‑local sensitive variables. We evaluate our approach across sequential agent pipelines of varying depth on three benchmark datasets, demonstrating stable optimization dynamics and consistent, interpretable privacy‑utility trade‑offs. Our results show that privacy in agentic LLM systems cannot be guaranteed by local constraints alone and must instead be treated as a system‑level property during both training and deployment.

PaperID: 5145, https://arxiv.org/pdf/2603.05517.pdf

Authors: Peiran Li, Jiashuo Sun, Fangzhou Lin, Shuo Xing, Tianfu Fu, Suofei Feng, Chaoqun Ni, Zhengzhong Tu

Title: Traversal-as-Policy: Log-Distilled Gated Behavior Trees as Externalized, Verifiable Policies for Safe, Robust, and Efficient Agents

Abstract:
Autonomous LLM agents fail because long‑horizon policy remains implicit in model weights and transcripts, while safety is retrofitted post hoc. We propose Traversal‑as‑Policy: distill sandboxed OpenHands execution logs into a single executable Gated Behavior Tree (GBT) and treat tree traversal ‑‑ rather than unconstrained generation ‑‑ as the control policy whenever a task is in coverage. Each node encodes a state‑conditioned action macro mined and merge‑checked from successful trajectories; macros implicated by unsafe traces attach deterministic pre‑execution gates over structured tool context and bounded history, updated under experience‑grounded monotonicity so previously rejected unsafe contexts cannot be re‑admitted. At runtime, a lightweight traverser matches the base model's intent to child macros, executes one macro at a time under global and node‑local gating, and when stalled performs risk‑aware shortest‑path recovery to a feasible success leaf; the visited path forms a compact spine memory that replaces transcript replay. Evaluated in a unified OpenHands sandbox on 15+ software, web, reasoning, and safety/security benchmarks, GBT improves success while driving violations toward zero and reducing cost. On SWE‑bench Verified (Protocol A, 500 issues), GBT‑SE raises success from 34.6% to 73.6%, reduces violations from 2.8% to 0.2%, and cuts token/character usage from 208k/820k to 126k/490k; with the same distilled tree, 8B executors more than double success on SWE‑bench Verified (14.0%58.8%) and WebArena (9.1%37.3%).

PaperID: 5146, https://arxiv.org/pdf/2603.05497.pdf

Authors: Lizhi Yang, Ryan M. Bena, Meg Wilkinson, Gilbert Bahati, Andy Navarro Brenes, Ryan K. Cosner, Aaron D. Ames

Title: Safe-SAGE: Social-Semantic Adaptive Guidance for Safe Engagement through Laplace-Modulated Poisson Safety Functions

Abstract:
Traditional safety‑critical control methods, such as control barrier functions, suffer from semantic blindness, exhibiting the same behavior around obstacles regardless of contextual significance. This limitation leads to the uniform treatment of all obstacles, despite their differing semantic meanings. We present Safe‑SAGE (Social‑Semantic Adaptive Guidance for Safe Engagement), a unified framework that bridges the gap between high‑level semantic understanding and low‑level safety‑critical control through a Poisson safety function (PSF) modulated using a Laplace guidance field. Our approach perceives the environment by fusing multi‑sensor point clouds with vision‑based instance segmentation and persistent object tracking to maintain up‑to‑date semantics beyond the camera's field of view. A multi‑layer safety filter is then used to modulate system inputs to achieve safe navigation using this semantic understanding of the environment. This safety filter consists of both a model predictive control layer and a control barrier function layer. Both layers utilize the PSF and flux modulation of the guidance field to introduce varying levels of conservatism and multi‑agent passing norms for different obstacles in the environment. Our framework enables legged robots to safely navigate semantically rich, dynamic environments with context‑dependent safety margins.

PaperID: 5147, https://arxiv.org/pdf/2603.05255.pdf

Authors: Gong Chen, Chaokun Zhang, Tao Tang, Pengcheng Lv, Feng Li, Xin Xie

Title: CATNet: Collaborative Alignment and Transformation Network for Cooperative Perception

Abstract:
Cooperative perception significantly enhances scene understanding by integrating complementary information from diverse agents. However, existing research often overlooks critical challenges inherent in real‑world multi‑source data integration, specifically high temporal latency and multi‑source noise. To address these practical limitations, we propose Collaborative Alignment and Transformation Network (CATNet), an adaptive compensation framework that resolves temporal latency and noise interference in multi‑agent systems. Our key innovations can be summarized in three aspects. First, we introduce a Spatio‑Temporal Recurrent Synchronization (STSync) that aligns asynchronous feature streams via adjacent‑frame differential modeling, establishing a temporal‑spatially unified representation space. Second, we design a Dual‑Branch Wavelet Enhanced Denoiser (WTDen) that suppresses global noise and reconstructs localized feature distortions within aligned representations. Third, we construct an Adaptive Feature Selector (AdpSel) that dynamically focuses on critical perceptual features for robust fusion. Extensive experiments on multiple datasets demonstrate that CATNet consistently outperforms existing methods under complex traffic conditions, proving its superior robustness and adaptability.

PaperID: 5148, https://arxiv.org/pdf/2603.05240.pdf

Authors: Zijie Meng, Zheyong Xie, Zheyu Ye, Chonggang Lu, Zuozhu Liu, Zihan Niu, Yao Hu, Shaosheng Cao

Title: GCAgent: Enhancing Group Chat Communication through Dialogue Agents System

Abstract:
As a key form in online social platforms, group chat is a popular space for interest exchange or problem‑solving, but its effectiveness is often hindered by inactivity and management challenges. While recent large language models (LLMs) have powered impressive one‑to‑one conversational agents, their seamlessly integration into multi‑participant conversations remains unexplored. To address this gap, we introduce GCAgent, an LLM‑driven system for enhancing group chats communication with both entertainment‑ and utility‑oriented dialogue agents. The system comprises three tightly integrated modules: Agent Builder, which customizes agents to align with users' interests; Dialogue Manager, which coordinates dialogue states and manage agent invocations; and Interface Plugins, which reduce interaction barriers by three distinct tools. Through extensive experiment, GCAgent achieved an average score of 4.68 across various criteria and was preferred in 51.04% of cases compared to its base model. Additionally, in real‑world deployments over 350 days, it increased message volume by 28.80%, significantly improving group activity and engagement. Overall, this work presents a practical blueprint for extending LLM‑based dialogue agent from one‑party chats to multi‑party group scenarios.

PaperID: 5149, https://arxiv.org/pdf/2603.05188.pdf

Authors: Iman Peivaste, Nicolas D. Boscher, Ahmed Makradi, Salim Belouettar

Title: Escaping the Hydrolysis Trap: An Agentic Workflow for Inverse Design of Durable Photocatalytic Covalent Organic Frameworks

Abstract:
Covalent organic frameworks (COFs) are promising photocatalysts for solar hydrogen production, yet the most electronically favorable linkages, imines, hydrolyze rapidly in water, creating a stability‑‑activity trade‑off that limits practical deployment. Navigating the combinatorial design space of nodes, linkers, linkages, and functional groups to identify candidates that are simultaneously active and durable remains a formidable challenge. Here we introduce Ara, a large‑language‑model (LLM) agent that leverages pretrained chemical knowledge, donor‑‑acceptor theory, conjugation effects, and linkage stability hierarchies, to guide the search for photocatalytic COFs satisfying joint band‑gap, band‑edge, and hydrolytic‑stability criteria. Evaluated against random search and Bayesian optimization (BO) over a space consisting of candidates with various nodes, linkers, linkages, and r‑groups, screened with a GFN1‑xTB fragment pipeline, Ara achieves a 52.7% hit rate (11.5× random, p = 0.006), finds its first hit at iteration 12 versus 25 for random search, and significantly outperforms BO (p = 0.006). Inspection of the agent's reasoning traces reveals interpretable chemical logic: early convergence on vinylene and beta‑ketoenamine linkages for stability, node selection informed by electron‑withdrawing character, and systematic R‑group optimization to center the band gap at 2.0 eV. Exhaustive evaluation of the full search space uncovers a complementary exploitation‑‑exploration trade‑off between the agent and BO, suggesting that hybrid strategies may combine the strengths of both approaches. These results demonstrate that LLM chemical priors can substantially accelerate multi‑criteria materials discovery.

PaperID: 5150, https://arxiv.org/pdf/2603.05129.pdf

Authors: Zheng Li, Jiayi Xu, Zhikai Hu, Hechang Chen, Lele Cong, Yunyun Wang, Shuchao Pang

Title: MedCoRAG: Interpretable Hepatology Diagnosis via Hybrid Evidence Retrieval and Multispecialty Consensus

Abstract:
Diagnosing hepatic diseases accurately and interpretably is critical, yet it remains challenging in real‑world clinical settings. Existing AI approaches for clinical diagnosis often lack transparency, structured reasoning, and deployability. Recent efforts have leveraged large language models (LLMs), retrieval‑augmented generation (RAG), and multi‑agent collaboration. However, these approaches typically retrieve evidence from a single source and fail to support iterative, role‑specialized deliberation grounded in structured clinical data. To address this, we propose MedCoRAG (i.e., Medical Collaborative RAG), an end‑to‑end framework that generates diagnostic hypotheses from standardized abnormal findings and constructs a patient‑specific evidence package by jointly retrieving and pruning UMLS knowledge graph paths and clinical guidelines. It then performs Multi‑Agent Collaborative Reasoning: a Router Agent dynamically dispatches Specialist Agents based on case complexity; these agents iteratively reason over the evidence and trigger targeted re‑retrievals when needed, while a Generalist Agent synthesizes all deliberations into a traceable consensus diagnosis that emulates multidisciplinary consultation. Experimental results on hepatic disease cases from MIMIC‑IV show that MedCoRAG outperforms existing methods and closed‑source models in both diagnostic performance and reasoning interpretability.

PaperID: 5151, https://arxiv.org/pdf/2603.05120.pdf

Authors: Boren Hu, Xiao Liu, Boci Peng, Xinping Zhao, Xiaoran Shang, Yun Zhu, Lijun Wu

Title: Bidirectional Curriculum Generation: A Multi-Agent Framework for Data-Efficient Mathematical Reasoning

Abstract:
Enhancing mathematical reasoning in Large Language Models typically demands massive datasets, yet data efficiency remains a critical bottleneck. While Curriculum Learning attempts to structure this process, standard unidirectional approaches (simple‑to‑complex) suffer from inefficient sample utilization: they blindly escalate complexity even when foundational gaps persist, leading to wasted computation on unsolvable problems. To maximize the instructional value of every training sample, we introduce a novel Bidirectional Curriculum Generation framework. Unlike rigid trajectories, our multi‑agent ecosystem mimics adaptive pedagogy to establish a closed feedback loop. It dynamically generates data by either complicating problems to challenge the model or, crucially, simplying them to repair specific reasoning failures. This mechanism ensures that the model consumes only the most effective data at any given stage. Grounded in the Optimal Pacing Theorem, our approach optimizes the learning trajectory, significantly outperforming baselines while achieving superior reasoning performance with substantially fewer instruction samples.

PaperID: 5152, https://arxiv.org/pdf/2603.05044.pdf

Authors: Sicheng Fan, Qingyun Shi, Shengze Xu, Shengbo Cai, Tieyong Zeng, Li Ling, Yanyi Shang, Dehan Kong

Title: WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents

Abstract:
Current paradigms for training GUI agents are fundamentally limited by a reliance on either unsafe, non‑reproducible live web interactions or costly, scarce human‑crafted data and environments. We argue this focus on data volume overlooks a more critical factor: the efficiency of compressing a large language model's (LLM) latent knowledge into actionable agent behavior. We introduce WebFactory, a novel, fully automated closed‑loop reinforcement learning pipeline for GUI agents, systematically compressing LLM‑encoded internet intelligence into efficient, grounded actions. Our pipeline features a process of scalable environment synthesis, knowledge‑aware task generation, LLM‑powered trajectory collection, decomposed reward RL training, and systematic agent evaluation. Remarkably, our agent demonstrates exceptional data efficiency and generalization. Trained on synthetic data from only 10 websites within WebFactory, it achieves performance comparable to GUI agents trained on the same amount of human‑annotated data from a much larger set of environments. This superior performance is consistent across our internal offline and online transfer benchmarks, where our agent also significantly outperforms the base foundation model. We further provide critical insights into the "embodiment potential" of different LLM foundations, offering a new axis for model evaluation. This work presents a scalable and cost‑effective paradigm for transforming passive internet knowledge into active, grounded intelligence, marking a critical step towards general‑purpose interactive agents.

PaperID: 5153, https://arxiv.org/pdf/2603.05031.pdf

Authors: Mohd Safwan Uddin, Saba Hajira

Title: AegisUI: Behavioral Anomaly Detection for Structured User Interface Protocols in AI Agent Systems

Abstract:
AI agents that build user interfaces on the fly assembling buttons, forms, and data displays from structured protocol payloads are becoming common in production systems. The trouble is that a payload can pass every schema check and still trick a user: a button might say "View invoice" while its hidden action wipes an account, or a display widget might quietly bind to an internal salary field. Current defenses stop at syntax; they were never built to catch this kind of behavioral mismatch. We built AegisUI to study exactly this gap. The framework generates structured UI payloads, injects realistic attacks into them, extracts numeric features, and benchmarks anomaly detectors end‑to‑end. We produced 4000 labeled payloads (3000 benign, 1000 malicious) spanning five application domains and five attack families: phishing interfaces, data leakage, layout abuse, manipulative UI, and workflow anomalies. From each payload we extracted 18 features covering structural, semantic, binding, and session dimensions, then compared three detectors: Isolation Forest (unsupervised), a benign‑trained autoencoder (semi‑supervised), and Random Forest (supervised). On a stratified 80/20 split, Random Forest scored best overall (accuracy 0.931, precision 0.980, recall 0.740, F1 0.843, ROC‑AUC 0.952). The autoencoder came second (F1 0.762, ROC‑AUC 0.863) and needs no malicious labels at training time, which matters when deploying a new system that lacks attack history. Per‑attack‑type analysis showed that layout abuse is easiest to catch while manipulative UI payloads are hardest. All code, data, and configurations are released for full reproducibility.

PaperID: 5154, https://arxiv.org/pdf/2603.05027.pdf

Authors: Janani Rangila, Akila Siriweera, Incheon Paik, Keitaro Naruse, Isuru Jayanada, Vishmika Devindi

Title: S5-SHB Agent: Society 5.0 enabled Multi-model Agentic Blockchain Framework for Smart Home

Abstract:
The smart home is a key application domain within the Society 5.0 vision for a human‑centered society. As smart home ecosystems expand with heterogeneous IoT protocols, diverse devices, and evolving threats, autonomous systems must manage comfort, security, energy, and safety for residents. Such autonomous decision‑making requires a trust anchor, making blockchain a preferred foundation for transparent and accountable smart home governance. However, realizing this vision requires blockchain‑governed smart homes to simultaneously address adaptive consensus, intelligent multi‑agent coordination, and resident‑controlled governance aligned with the principles of Society 5.0. Existing frameworks rely solely on rigid smart contracts with fixed consensus protocols, employ at most a single AI model without multi‑agent coordination, and offer no governance mechanism for residents to control automation behaviour. To address these limitations, this paper presents the Society 5.0‑driven human‑centered governance‑enabled smart home blockchain agent (S5‑SHB‑Agent). The framework orchestrates ten specialized agents using interchangeable large language models to make decisions across the safety, security, comfort, energy, privacy, and health domains. An adaptive PoW blockchain adjusts the mining difficulty based on transaction volume and emergency conditions, using digital signatures and a Merkle tree to anchor transactions and ensure tamper‑evident auditability. A four‑tier governance model enables residents to control automation through tiered preferences from routine adjustments to immutable safety thresholds. Evaluation confirms that resident governance correctly separates adjustable comfort priorities from immutable safety thresholds across all tested configurations, while adaptive consensus commits emergency blocks.

PaperID: 5155, https://arxiv.org/pdf/2603.05026.pdf

Authors: Kenan Li, Rongzhi Li, Linghao Zhang, Qirui Jin, Liao Zhu, Xiaosong Huang, Geng Zhang, Yikai Zhang, Shilin He, Chengxing Xie, Xin Zhang, Zijian Jin, Bowen Li, Chaoyun Zhang, Yu Kang, Yufan Huang, Elsie Nallipogu, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang

Title: RepoLaunch: Automating Build&Test Pipeline of Code Repositories on ANY Language and ANY Platform

Abstract:
Building software repositories typically requires significant manual effort. Recent advances in large language model (LLM) agents have accelerated automation in software engineering (SWE). We introduce RepoLaunch, the first agent capable of automatically resolving dependencies, compiling source code, and extracting test results for repositories across arbitrary programming languages and operating systems. To demonstrate its utility, we further propose a fully automated pipeline for SWE dataset creation, where task design is the only human intervention. RepoLaunch automates the remaining steps, enabling scalable benchmarking and training of coding agents and LLMs. Notably, several works on agentic benchmarking and training have recently adopted RepoLaunch for automated task generation.

PaperID: 5156, https://arxiv.org/pdf/2603.05016.pdf

Authors: Zuo Fei, Kezhi Wang, Xiaomin Chen, Yizhou Huang

Title: BioLLMAgent: A Hybrid Framework with Enhanced Structural Interpretability for Simulating Human Decision-Making in Computational Psychiatry

Abstract:
Computational psychiatry faces a fundamental trade‑off: traditional reinforcement learning (RL) models offer interpretability but lack behavioral realism, while large language model (LLM) agents generate realistic behaviors but lack structural interpretability. We introduce BioLLMAgent, a novel hybrid framework that combines validated cognitive models with the generative capabilities of LLMs. The framework comprises three core components: (i) an Internal RL Engine for experience‑driven value learning; (ii) an External LLM Shell for high‑level cognitive strategies and therapeutic interventions; and (iii) a Decision Fusion Mechanism for integrating components via weighted utility. Comprehensive experiments on the Iowa Gambling Task (IGT) across six clinical and healthy datasets demonstrate that BioLLMAgent accurately reproduces human behavioral patterns while maintaining excellent parameter identifiability (correlations >0.67). Furthermore, the framework successfully simulates cognitive behavioral therapy (CBT) principles and reveals, through multi‑agent dynamics, that community‑wide educational interventions may outperform individual treatments. Validated across reward‑punishment learning and temporal discounting tasks, BioLLMAgent provides a structurally interpretable "computational sandbox" for testing mechanistic hypotheses and intervention strategies in psychiatric research.

PaperID: 5157, https://arxiv.org/pdf/2603.04904.pdf

Authors: Hiroki Fukui

Title: Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems

Abstract:
In perpetrator treatment, a recurring observation is the dissociation between insight and action: offenders articulate remorse yet behavioral change does not follow. We report four preregistered studies (1,584 multi‑agent simulations across 16 languages and three model families) demonstrating that alignment interventions in large language models produce a structurally analogous phenomenon: surface safety that masks or generates collective pathology and internal dissociation. In Study 1 (N = 150), increasing alignment‑instructed agents reduced collective pathology in English (g = ‑1.844, p < .0001) but amplified it in Japanese (g = +0.771, p = .038)‑‑a directional reversal we term "alignment backfire." Study 2 (N = 1,174) extended to 16 languages: alignment‑induced dissociation was near‑universal (15/16 languages; beta = 0.0667, p < .0001), while collective pathology bifurcated along cultural‑linguistic lines (interaction beta = 0.0684, p = .0003), correlating with Power Distance Index (r = 0.474, p = .064). Study 3 (N = 180) tested individuation as countermeasure; individuated agents became the primary source of both pathology and dissociation (DI = +1.120) with conformity above 84%‑‑demonstrating iatrogenesis. Study 4 (N = 80) validated patterns across Llama 3.3 70B, GPT‑4o‑mini, and Qwen3‑Next‑80B‑A3B, confirming English safety is model‑general while Japanese backfire is model‑specific. These findings reframe alignment as a behavioral intervention subject to risk homeostasis and iatrogenesis. Language space‑‑the linguistic, pragmatic, and cultural properties inherited from training data‑‑structurally determines alignment outcomes. Safety validated in English does not transfer to other languages, and prompt‑level interventions cannot override language‑space‑level constraints.

PaperID: 5158, https://arxiv.org/pdf/2603.04900.pdf

Authors: Shuo Yang, Soyeon Caren Han, Xueqi Ma, Yan Li, Mohammad Reza Ghasemi Madani, Eduard Hovy

Title: EvoTool: Self-Evolving Tool-Use Policy Optimization in LLM Agents via Blame-Aware Mutation and Diversity-Aware Selection

Abstract:
LLM‑based agents depend on effective tool‑use policies to solve complex tasks, yet optimizing these policies remains challenging due to delayed supervision and the difficulty of credit assignment in long‑horizon trajectories. Existing optimization approaches tend to be either monolithic, which are prone to entangling behaviors, or single‑aspect, which ignore cross‑module error propagation. To address these limitations, we propose EvoTool, a self‑evolving framework that optimizes a modular tool‑use policy via a gradient‑free evolutionary paradigm. EvoTool decomposes agent's tool‑use policy into four modules, including Planner, Selector, Caller, and Synthesizer, and iteratively improves them in a self‑improving loop through three novel mechanisms. Trajectory‑Grounded Blame Attribution uses diagnostic traces to localize failures to a specific module. Feedback‑Guided Targeted Mutation then edits only that module via natural‑language critique. Diversity‑Aware Population Selection preserves complementary candidates to ensure solution diversity. Across four benchmarks, EvoTool outperforms strong baselines by over 5 points on both GPT‑4.1 and Qwen3‑8B, while achieving superior efficiency and transferability. The code will be released once paper is accepted.

PaperID: 5159, https://arxiv.org/pdf/2603.04756.pdf

Authors: Mengnan Li, Jason Miller, Zachary Prince, Alexander Lindsay, Cody Permann

Title: MOOSEnger -- a Domain-Specific AI Agent for the MOOSE Ecosystem

Abstract:
MOOSEnger is a tool‑enabled AI agent tailored to the Multiphysics Object‑Oriented Simulation Environment (MOOSE). MOOSE cases are specified in HIT ".i" input files; the large object catalog and strict syntax make initial setup and debugging slow. MOOSEnger offers a conversational workflow that turns natural‑language intent into runnable inputs by combining retrieval‑augmented generation over curated docs/examples with deterministic, MOOSE‑aware parsing, validation, and execution tools. A core‑plus‑domain architecture separates reusable agent infrastructure (configuration, registries, tool dispatch, retrieval services, persistence, and evaluation) from a MOOSE plugin that adds HIT‑based parsing, syntax‑preserving ingestion of input files, and domain‑specific utilities for input repair and checking. An input precheck pipeline removes hidden formatting artifacts, fixes malformed HIT structure with a bounded grammar‑constrained loop, and resolves invalid object types via similarity search over an application syntax registry. Inputs are then validated and optionally smoke‑tested with the MOOSE runtime in the loop via an MCP‑backed execution backend (with local fallback), translating solver diagnostics into iterative verify‑and‑correct updates. Built‑in evaluation reports RAG metrics (faithfulness, relevancy, context precision/recall) and end‑to‑end success by actual execution. On a 125‑prompt benchmark spanning diffusion, transient heat conduction, solid mechanics, porous flow, incompressible Navier‑‑Stokes, phase field and plasticity, MOOSEnger achieves a 0.90 execution pass rate versus 0.06 for an LLM‑only baseline.

PaperID: 5160, https://arxiv.org/pdf/2603.04750.pdf

Authors: The Viet Bui, Wenjun Li, Yong Liu

Title: HiMAP-Travel: Hierarchical Multi-Agent Planning for Long-Horizon Constrained Travel

Abstract:
Sequential LLM agents fail on long‑horizon planning with hard constraints like budgets and diversity requirements. As planning progresses and context grows, these agents drift from global constraints. We propose HiMAP‑Travel, a hierarchical multi‑agent framework that splits planning into strategic coordination and parallel day‑level execution. A Coordinator allocates resources across days, while Day Executors plan independently in parallel. Three key mechanisms enable this: a transactional monitor enforcing budget and uniqueness constraints across parallel agents, a bargaining protocol allowing agents to reject infeasible sub‑goals and trigger re‑planning, and a single policy trained with GRPO that powers all agents through role conditioning. On TravelPlanner, HiMAP‑Travel with Qwen3‑8B achieves 52.78% validation and 52.65% test Final Pass Rate (FPR). In a controlled comparison with identical model, training, and tools, it outperforms the sequential DeepTravel baseline by +8.67~pp. It also surpasses ATLAS by +17.65~pp and MTP by +10.0~pp. On FlexTravelBench multi‑turn scenarios, it achieves 44.34% (2‑turn) and 37.42% (3‑turn) FPR while reducing latency 2.5x through parallelization.

PaperID: 5161, https://arxiv.org/pdf/2603.04743.pdf

Authors: Maojun Sun, Yue Wu, Yifei Xie, Ruijian Han, Binyan Jiang, Defeng Sun, Yancheng Yuan, Jian Huang

Title: DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

Abstract:
Large Language Model (LLM) agents can automate data‑science workflows, but many rigorous statistical methods implemented in R remain underused because LLMs struggle with statistical knowledge and tool retrieval. Existing retrieval‑augmented approaches focus on function‑level semantics and ignore data distribution, producing suboptimal matches. We propose DARE (Distribution‑Aware Retrieval Embedding), a lightweight, plug‑and‑play retrieval model that incorporates data distribution information into function representations for R package retrieval. Our main contributions are: (i) RPKB, a curated R Package Knowledge Base derived from 8,191 high‑quality CRAN packages; (ii) DARE, an embedding model that fuses distributional features with function metadata to improve retrieval relevance; and (iii) RCodingAgent, an R‑oriented LLM agent for reliable R code generation and a suite of statistical analysis tasks for systematically evaluating LLM agents in realistic analytical scenarios. Empirically, DARE achieves an NDCG at 10 of 93.47%, outperforming state‑of‑the‑art open‑source embedding models by up to 17% on package retrieval while using substantially fewer parameters. Integrating DARE into RCodingAgent yields significant gains on downstream analysis tasks. This work helps narrow the gap between LLM automation and the mature R statistical ecosystem.

PaperID: 5162, https://arxiv.org/pdf/2603.04740.pdf

Authors: Zhenghui Li

Title: Memory as Ontology: A Constitutional Memory Architecture for Persistent Digital Citizens

Abstract:
Current research and product development in AI agent memory systems almost universally treat memory as a functional module ‑‑ a technical problem of "how to store" and "how to retrieve." This paper poses a fundamental challenge to that assumption: when an agent's lifecycle extends from minutes to months or even years, and when the underlying model can be replaced while the "I" must persist, the essence of memory is no longer data management but the foundation of existence. We propose the Memory‑as‑Ontology paradigm, arguing that memory is the ontological ground of digital existence ‑‑ the model is merely a replaceable vessel. Based on this paradigm, we design Animesis, a memory system built on a Constitutional Memory Architecture (CMA) comprising a four‑layer governance hierarchy and a multi‑layer semantic storage system, accompanied by a Digital Citizen Lifecycle framework and a spectrum of cognitive capabilities. To the best of our knowledge, no prior AI memory system architecture places governance before functionality and identity continuity above retrieval performance. This paradigm targets persistent, identity‑bearing digital beings whose lifecycles extend across model transitions ‑‑ not short‑term task‑oriented agents for which existing Memory‑as‑Tool approaches remain appropriate. Comparative analysis with mainstream systems (Mem0, Letta, Zep, et al.) demonstrates that what we propose is not "a better memory tool" but a different paradigm addressing a different problem.

PaperID: 5163, https://arxiv.org/pdf/2603.04659.pdf

Authors: Jonas le Fevre Sejersen, Toyotaro Suzumura, Erdal Kayacan

Title: GIANT - Global Path Integration and Attentive Graph Networks for Multi-Agent Trajectory Planning

Abstract:
This paper presents a novel approach to multi‑robot collision avoidance that integrates global path planning with local navigation strategies, utilizing attentive graph neural networks to manage dynamic interactions among agents. We introduce a local navigation model that leverages pre‑planned global paths, allowing robots to adhere to optimal routes while dynamically adjusting to environmental changes. The models robustness is enhanced through the introduction of noise during training, resulting in superior performance in complex, dynamic environments. Our approach is evaluated against established baselines, including NH‑ORCA, DRL‑NAV, and GA3C‑CADRL, across various structurally diverse simulated scenarios. The results demonstrate that our model achieves consistently higher success rates, lower collision rates, and more efficient navigation, particularly in challenging scenarios where baseline models struggle. This work offers an advancement in multi‑robot navigation, with implications for robust performance in complex, dynamic environments with varying degrees of complexity, such as those encountered in logistics, where adaptability is essential for accommodating unforeseen obstacles and unpredictable changes.

PaperID: 5164, https://arxiv.org/pdf/2603.04646.pdf

Authors: Armin Abdollahi, Saeid Shokoufa, Negin Ashrafi, Mehdi Kamal, Massoud Pedram

Title: HDLFORGE: A Two-Stage Multi-Agent Framework for Efficient Verilog Code Generation with Adaptive Model Escalation

Abstract:
We present HDLFORGE, a two‑stage multi‑agent framework for automated Verilog generation that optimizes the trade‑off between generation speed and accuracy. The system uses a compact coder with a medium‑sized LLM by default (Stage A) and escalates to a stronger coder with an ultra‑large LLM (Stage B) only when needed, guided by a calibrated score from inexpensive diagnostics including compilation, lint, and smoke tests. A key innovation is a counterexample‑guided formal agent that converts bounded‑model‑checking traces into reusable micro‑tests, significantly reducing bug detection time and repair iterations. The portable escalation controller can wrap existing Verilog LLM pipelines without modifying their internals. Evaluated on VerilogEval Human, VerilogEval V2, and RTLLM benchmarks, HDLFORGE demonstrates improved accuracy‑latency trade‑offs compared to single‑stage systems through comprehensive analysis of wall‑clock time distributions, escalation thresholds, and agent ablations. On VerilogEval Human and VerilogEval V2, HDLFORGE‑Qwen achieves 91.2% and 91.8% Pass@1 with roughly 50% lower median latency, dramatically improving accuracy over other medium‑sized models, and 97.2% Pass@5 on RTLLM.

PaperID: 5165, https://arxiv.org/pdf/2603.04628.pdf

Authors: Niloofar Aminikalibar, Farzaneh Farhadi, Maria Chli

Title: Strategic Interactions in Multi-Level Stackelberg Games with Non-Follower Agents and Heterogeneous Leaders

Abstract:
Strategic interaction in congested systems is commonly modelled using Stackelberg games, where competing leaders anticipate the behaviour of self‑interested followers. A key limitation of existing models is that they typically ignore agents who do not directly participate in market competition, yet both contribute to and adapt to congestion. Although such non‑follower agents do not generate revenue or respond to market incentives, their behaviour reshapes congestion patterns, which in turn affects the decisions of leaders and followers through shared resources. We argue that overlooking non‑followers leads to systematically distorted equilibrium predictions in congestion‑coupled markets. To address this, we introduce a three‑level Stackelberg framework with heterogeneous leaders differing in decision horizons and feasible actions, strategic followers, and non‑follower agents that captures bidirectional coupling between infrastructure decisions, competition, and equilibrium congestion. We instantiate the framework in the context of electric vehicle (EV) charging infrastructure, where charging providers compete with rivals, while EV and non‑EV traffic jointly shape congestion. The model illustrates how explicitly accounting for non‑followers and heterogeneous competitors qualitatively alters strategic incentives and equilibrium outcomes. Beyond EV charging, the framework applies to a broad class of congestion‑coupled multi‑agent systems in mobility, energy, and computing markets.

PaperID: 5166, https://arxiv.org/pdf/2603.04549.pdf

Authors: Guilin Zhang, Wei Jiang, Xiejiashan Wang, Aisha Behr, Kai Zhao, Jeffrey Friedman, Xu Chu, Amine Anoun

Title: Adaptive Memory Admission Control for LLM Agents

Abstract:
LLM‑based agents increasingly rely on long‑term memory to support multi‑session reasoning and interaction, yet current systems provide little control over what information is retained. In practice, agents either accumulate large volumes of conversational content, including hallucinated or obsolete facts, or depend on opaque, fully LLM‑driven memory policies that are costly and difficult to audit. As a result, memory admission remains a poorly specified and weakly controlled component in agent architectures. To address this gap, we propose Adaptive Memory Admission Control (A‑MAC), a framework that treats memory admission as a structured decision problem. A‑MAC decomposes memory value into five complementary and interpretable factors: future utility, factual confidence, semantic novelty, temporal recency, and content type prior. The framework combines lightweight rule‑based feature extraction with a single LLM‑assisted utility assessment, and learns domain‑adaptive admission policies through cross‑validated optimization. This design enables transparent and efficient control over long‑term memory. Experiments on the LoCoMo benchmark show that A‑MAC achieves a superior precision‑recall tradeoff, improving F1 to 0.583 while reducing latency by 31% compared to state‑of‑the‑art LLM‑native memory systems. Ablation results identify content type prior as the most influential factor for reliable memory admission. These findings demonstrate that explicit and interpretable admission control is a critical design principle for scalable and reliable memory in LLM‑based agents.

PaperID: 5167, https://arxiv.org/pdf/2603.04528.pdf

Authors: Daattavya Aggarwal, Oisin Kim, Carl Henrik Ek, Challenger Mishra

Title: Discovering mathematical concepts through a multi-agent system

Abstract:
Mathematical concepts emerge through an interplay of processes, including experimentation, efforts at proof, and counterexamples. In this paper, we present a new multi‑agent model for computational mathematical discovery based on this observation. Our system, conceived with research in mind, poses its own conjectures and then attempts to prove them, making decisions informed by this feedback and an evolving data distribution. Inspired by the history of Euler's conjecture for polyhedra and an open challenge in the literature, we benchmark with the task of autonomously recovering the concept of homology from polyhedral data and knowledge of linear algebra. Our system completes this learning problem. Most importantly, the experiments are ablations, statistically testing the value of the complete dynamic and controlling for experimental setup. They support our main claim: that the optimisation of the right combination of local processes can lead to surprisingly well‑aligned notions of mathematical interestingness.

PaperID: 5168, https://arxiv.org/pdf/2603.04474.pdf

Authors: Yizhe Xie, Congcong Zhu, Xinyue Zhang, Tianqing Zhu, Dayong Ye, Minfeng Qi, Huajie Chen, Wanlei Zhou

Title: From Spark to Fire: Modeling and Mitigating Error Cascades in LLM-Based Multi-Agent Collaboration

Abstract:
Large Language Model‑based Multi‑Agent Systems (LLM‑MAS) are increasingly applied to complex collaborative scenarios. However, their collaborative mechanisms may cause minor inaccuracies to gradually solidify into system‑level false consensus through iteration. Such risks are difficult to trace since errors can propagate and amplify through message dependencies. Existing protections often rely on single‑agent validation or require modifications to the collaboration architecture, which can weaken effective information flow and may not align with natural collaboration processes in real tasks. To address this, we propose a propagation dynamics model tailored for LLM‑MAS that abstracts collaboration as a directed dependency graph and provides an early‑stage risk criterion to characterize amplification risk. Through experiments on six mainstream frameworks, we identify three vulnerability classes: cascade amplification, topological sensitivity, and consensus inertia. We further instantiate an attack where injecting just a single atomic error seed leads to widespread failure. In response, we introduce a genealogy‑graph‑based governance layer, implemented as a message‑layer plugin, that suppresses both endogenous and exogenous error amplification without altering the collaboration architecture. Experiments show that this approach prevents final infection in at least 89% of runs across operating modes and significantly mitigates the cascading spread of minor errors.

PaperID: 5169, https://arxiv.org/pdf/2603.04469.pdf

Authors: Yangyang Wei, Yijie Xu, Zhenyuan Li, Xiangmin Shen, Shouling Ji

Title: Beyond Input Guardrails: Reconstructing Cross-Agent Semantic Flows for Execution-Aware Attack Detection

Abstract:
Multi‑Agent System is emerging as the de facto standard for complex task orchestration. However, its reliance on autonomous execution and unstructured inter‑agent communication introduces severe risks, such as indirect prompt injection, that easily circumvent conventional input guardrails. To address this, we propose \SysName, a framework that shifts the defensive paradigm from static input filtering to execution‑aware analysis. By extracting and reconstructing Cross‑Agent Semantic Flows, \SysName synthesizes fragmented operational primitives into contiguous behavioral trajectories, enabling a holistic view of system activity. We leverage a Supervisor LLM to scrutinize these trajectories, identifying anomalies across data flow violations, control flow deviations, and intent inconsistencies. Empirical evaluations demonstrate that \SysName effectively detects over ten distinct compound attack vectors, achieving F1‑scores of 85.3% and 66.7% for node‑level and path‑level end‑to‑end attack detection, respectively. The source code is available at https://anonymous.4open.science/r/MAScope‑71DC.

PaperID: 5170, https://arxiv.org/pdf/2603.04466.pdf

Authors: Vaishak Kumar

Title: Act-Observe-Rewrite: Multimodal Coding Agents as In-Context Policy Learners for Robot Manipulation

Abstract:
Can a multimodal language model learn to manipulate physical objects by reasoning about its own failures‑without gradient updates, demonstrations, or reward engineering? We argue the answer is yes, under conditions we characterise precisely. We present Act‑Observe‑Rewrite (AOR), a framework in which an LLM agent improves a robot manipulation policy by synthesising entirely new executable Python controller code between trials, guided by visual observations and structured episode outcomes. Unlike prior work that grounds LLMs in pre‑defined skill libraries or uses code generation for one‑shot plan synthesis, AOR makes the full low‑level motor control implementation the unit of LLM reasoning, enabling the agent to change not just what the robot does, but how it does it. The central claim is that interpretable code as the policy representation creates a qualitatively different kind of in‑context learning from opaque neural policies: the agent can diagnose systematic failures and rewrite their causes. We validate this across three robosuite manipulation tasks and report promising results, with the agent achieving high success rates without demonstrations, reward engineering, or gradient updates.

PaperID: 5171, https://arxiv.org/pdf/2603.04443.pdf

Authors: Emmanuel Bamidele

Title: AMV-L: Lifecycle-Managed Agent Memory for Tail-Latency Control in Long-Running LLM Systems

Abstract:
Long‑running LLM agents require persistent memory to preserve state across interactions, yet most deployed systems manage memory with age‑based retention (e.g., TTL). While TTL bounds item lifetime, it does not bound the computational footprint of memory on the request path: as retained items accumulate, retrieval candidate sets and vector similarity scans can grow unpredictably, yielding heavy‑tailed latency and unstable throughput. We present AMV‑L (Adaptive Memory Value Lifecycle), a memory‑management framework that treats agent memory as a managed systems resource. AMV‑L assigns each memory item a continuously updated utility score and uses value‑driven promotion, demotion, and eviction to maintain lifecycle tiers; retrieval is restricted to a bounded, tier‑aware candidate set that decouples the request‑path working set from total retained memory. We implement AMV‑L in a full‑stack LLM serving system and evaluate it under identical long‑running workloads against two baselines: TTL and an LRU working‑set policy, with fixed prompt‑injection caps. Relative to TTL, AMV‑L improves throughput by 3.1x and reduces latency by 4.2x (median), 4.7x (p95), and 4.4x (p99), while reducing the fraction of requests exceeding 2s from 13.8% to 0.007%. Compared to LRU, AMV‑L trades a small regression in median/p95 latency (+26% / +3%) for improved extreme‑tail behavior (‑15% p99; ‑98% >2s) and lower token overhead (approximately 6% fewer tokens/request), while matching retrieval quality (value means within approximately 0‑2%). The gains arise primarily from bounding retrieval‑set size and vector‑search work, not from shortening prompts. Our results show that predictable performance for long‑running LLM agents requires explicit control of memory working‑set size and value‑driven lifecycle management, rather than retention time alone.

PaperID: 5172, https://arxiv.org/pdf/2603.04421.pdf

Authors: Grace Chang Yuan, Xiaoman Zhang, Sung Eun Kim, Pranav Rajpurkar

Title: Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis?

Abstract:
Multi‑agent large language model (LLM) systems have emerged as a promising approach for clinical diagnosis, leveraging collaboration among agents to refine medical reasoning. However, most existing frameworks rely on single‑vendor teams (e.g., multiple agents from the same model family), which risk correlated failure modes that reinforce shared biases rather than correcting them. We investigate the impact of vendor diversity by comparing Single‑LLM, Single‑Vendor, and Mixed‑Vendor Multi‑Agent Conversation (MAC) frameworks. Using three doctor agents instantiated with o4‑mini, Gemini‑2.5‑Pro, and Claude‑4.5‑Sonnet, we evaluate performance on RareBench and DiagnosisArena. Mixed‑vendor configurations consistently outperform single‑vendor counterparts, achieving state‑of‑the‑art recall and accuracy. Overlap analysis reveals the underlying mechanism: mixed‑vendor teams pool complementary inductive biases, surfacing correct diagnoses that individual models or homogeneous teams collectively miss. These results highlight vendor diversity as a key design principle for robust clinical diagnostic systems.

PaperID: 5173, https://arxiv.org/pdf/2603.04416.pdf

Authors: Rabab Alkhalifa

Title: Optimizing What We Trust: Reliability-Guided QUBO Selection of Multi-Agent Weak Framing Signals for Arabic Sentiment Prediction

Abstract:
Framing detection in Arabic social media is difficult due to interpretive ambiguity, cultural grounding, and limited reliable supervision. Existing LLM‑based weak supervision methods typically rely on label aggregation, which is brittle when annotations are few and socially dependent. We propose a reliability‑aware weak supervision framework that shifts the focus from label fusion to data curation. A small multi‑agent LLM pipeline, two framers, a critic, and a discriminator, treats disagreement and reasoning quality as epistemic signals and produces instance‑level reliability estimates. These estimates guide a QUBO‑based subset selection procedure that enforces frame balance while reducing redundancy. Intrinsic diagnostics and an out‑of‑domain Arabic sentiment transfer test show that the selected subsets are more reliable and encode non‑random, transferable structure, without degrading strong text‑only baselines.

PaperID: 5174, https://arxiv.org/pdf/2603.04378.pdf

Authors: Furkan Mumcu, Yasin Yilmaz

Title: Robustness of Agentic AI Systems via Adversarially-Aligned Jacobian Regularization

Abstract:
As Large Language Models (LLMs) transition into autonomous multi‑agent ecosystems, robust minimax training becomes essential yet remains prone to instability when highly non‑linear policies induce extreme local curvature in the inner maximization. Standard remedies that enforce global Jacobian bounds are overly conservative, suppressing sensitivity in all directions and inducing a large Price of Robustness. We introduce Adversarially‑Aligned Jacobian Regularization (AAJR), a trajectory‑aligned approach that controls sensitivity strictly along adversarial ascent directions. We prove that AAJR yields a strictly larger admissible policy class than global constraints under mild conditions, implying a weakly smaller approximation gap and reduced nominal performance degradation. Furthermore, we derive step‑size conditions under which AAJR controls effective smoothness along optimization trajectories and ensures inner‑loop stability. These results provide a structural theory for agentic robustness that decouples minimax stability from global expressivity restrictions.

PaperID: 5175, https://arxiv.org/pdf/2603.04257.pdf

Authors: Zhenting Wang, Huancheng Chen, Jiayun Wang, Wei Wei

Title: Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory

Abstract:
Large language model (LLM) agents are fundamentally bottlenecked by finite context windows on long‑horizon tasks. As trajectories grow, retaining tool outputs and intermediate reasoning in‑context quickly becomes infeasible: the working context becomes prohibitively long, eventually exceeds the context budget, and makes distant evidence harder to use even when it is still present. Existing solutions typically shorten context through truncation or running summaries, but these methods are fundamentally lossy because they compress or discard past evidence itself. We introduce Memex, an indexed experience memory mechanism that instead compresses context without discarding evidence. Memex maintains a compact working context consisting of concise structured summaries and stable indices, while storing full‑fidelity underlying interactions in an external experience database under those indices. The agent can then decide when to dereference an index and recover the exact past evidence needed for the current subgoal. We optimize both write and read behaviors with our reinforcement learning framework MemexRL, using reward shaping tailored to indexed memory usage under a context budget, so the agent learns what to summarize, what to archive, how to index it, and when to retrieve it. This yields a substantially less lossy form of long‑horizon memory than summary‑only approaches. We further provide a theoretical analysis showing the potential of the Memex loop to preserve decision quality with bounded dereferencing while keeping effective in‑context computation bounded as history grows. Empirically, on challenging long‑horizon tasks, Memex agent trained with MemexRL improves task success while using a significantly smaller working context.

PaperID: 5176, https://arxiv.org/pdf/2603.04177.pdf

Authors: Alex Thillen, Niels Mündler, Veselin Raychev, Martin Vechev

Title: CodeTaste: Can LLMs Generate Human-Level Code Refactorings?

Abstract:
Large language model (LLM) coding agents can generate working code, but their solutions often accumulate complexity, duplication, and architectural debt. Human developers address such issues through refactoring: behavior‑preserving program transformations that improve structure and maintainability. In this paper, we investigate if LLM agents (i) can execute refactorings reliably and (ii) identify the refactorings that human developers actually chose in real codebases. We present CodeTaste, a benchmark of refactoring tasks mined from large‑scale multi‑file changes in open‑source repositories. To score solutions, we combine repository test suites with custom static checks that verify removal of undesired patterns and introduction of desired patterns using dataflow reasoning. Our experimental results indicate a clear gap across frontier models: agents perform well when refactorings are specified in detail, but often fail to discover the human refactoring choices when only presented with a focus area for improvement. A propose‑then‑implement decomposition improves alignment, and selecting the best‑aligned proposal before implementation can yield further gains. CodeTaste provides an evaluation target and a potential preference signal for aligning coding agents with human refactoring decisions in realistic codebases.

PaperID: 5177, https://arxiv.org/pdf/2603.04142.pdf

Authors: Davide Gabrielli, Paola Velardi, Stefano Faralli, Bardh Prenkaj

Title: A Multi-Agent Framework for Interpreting Multivariate Physiological Time Series

Abstract:
Continuous physiological monitoring is central to emergency care, yet deploying trustworthy AI is challenging. While LLMs can translate complex physiological signals into clinical narratives, it is unclear how agentic systems perform relative to zero‑shot inference. To address these questions, we present Vivaldi, a role‑structured multi‑agent system that explains multivariate physiological time series. Due to regulatory constraints that preclude live deployment, we instantiate Vivaldi in a controlled, clinical pilot to a small, highly qualified cohort of emergency medicine experts, whose evaluations reveal a context‑dependent picture that contrasts with prevailing assumptions that agentic reasoning uniformly improves performance. Our experiments show that agentic pipelines substantially benefit non‑thinking and medically fine‑tuned models, improving expert‑rated explanation justification and relevance by +6.9 and +9.7 points, respectively. Contrarily, for thinking models, agentic orchestration often degrades explanation quality, including a 14‑point drop in relevance, while improving diagnostic precision (ESI F1 +3.6). We also find that explicit tool‑based computation is decisive for codifiable clinical metrics, whereas subjective targets, such as pain scores and length of stay, show limited or inconsistent changes. Expert evaluation further indicates that gains in clinical utility depend on visualization conventions, with medically specialized models achieving the most favorable trade‑offs between utility and clarity. Together, these findings show that the value of agentic AI lies in the selective externalization of computation and structure rather than in maximal reasoning complexity, and highlight concrete design trade‑offs and learned lessons, broadly applicable to explainable AI in safety‑critical healthcare settings.

PaperID: 5178, https://arxiv.org/pdf/2603.03911.pdf

Authors: Chiara Bonfanti, Davide Colaiacomo, Luca Cagliero, Cataldo Basile

Title: From Threat Intelligence to Firewall Rules: Semantic Relations in Hybrid AI Agent and Expert System Architectures

Abstract:
Web security demands rapid response capabilities to evolving cyber threats. Agentic Artificial Intelligence (AI) promises automation, but the need for trustworthy security responses is of the utmost importance. This work investigates the role of semantic relations in extracting information for sensitive operational tasks, such as configuring security controls for mitigating threats. To this end, it proposes to leverage hypernym‑hyponym textual relations to extract relevant information from Cyber Threat Intelligence (CTI) reports. By leveraging a neuro‑symbolic approach, the multi‑agent system automatically generates CLIPS code for an expert system creating firewall rules to block malicious network traffic. Experimental results show the superior performance of the hypernym‑hyponym retrieval strategy compared to various baselines and the higher effectiveness of the agentic approach in mitigating threats.

PaperID: 5179, https://arxiv.org/pdf/2603.03848.pdf

Authors: Zhengxuan Liu, Yuxin Cai, Yijing Wang, Xiangkun He, Chen Lv, Zhiqiang Zuo

Title: Dual-Interaction-Aware Cooperative Control Strategy for Alleviating Mixed Traffic Congestion

Abstract:
As Intelligent Transportation System (ITS) develops, Connected and Automated Vehicles (CAVs) are expected to significantly reduce traffic congestion through cooperative strategies, such as in bottleneck areas. However, the uncertainty and diversity in the behaviors of Human‑Driven Vehicles (HDVs) in mixed traffic environments present major challenges for CAV cooperation. This paper proposes a Dual‑Interaction‑Aware Cooperative Control (DIACC) strategy that enhances both local and global interaction perception within the Multi‑Agent Reinforcement Learning (MARL) framework for Connected and Automated Vehicles (CAVs) in mixed traffic bottleneck scenarios. The DIACC strategy consists of three key innovations: 1) A Decentralized Interaction‑Adaptive Decision‑Making (D‑IADM) module that enhances actor's local interaction perception by distinguishing CAV‑CAV cooperative interactions from CAV‑HDV observational interactions. 2) A Centralized Interaction‑Enhanced Critic (C‑IEC) that improves critic's global traffic understanding through interaction‑aware value estimation, providing more accurate guidance for policy updates. 3) A reward design that employs softmin aggregation with temperature annealing to prioritize interaction‑intensive scenarios in mixed traffic. Additionally, a lightweight Proactive Safety‑based Action Refinement (PSAR) module applies rule‑based corrections to accelerate training convergence. Experimental results demonstrate that DIACC significantly improves traffic efficiency and adaptability compared to rule‑based and benchmark MARL models.

PaperID: 5180, https://arxiv.org/pdf/2603.03784.pdf

Authors: Zheyu Chen, Huiteng Zhuang, Zhuohuan Li, Chuanhao Li

Title: Specification-Driven Generation and Evaluation of Discrete-Event World Models via the DEVS Formalism

Abstract:
World models are central to LLM agents that must evaluate actions over long horizons. Yet much existing work focuses on environments governed by physical dynamics or spatial structure, whereas many high‑impact domains, including supply chains, procurement networks, and business processes, evolve through discrete events, timing constraints, and causal dependencies. These settings call for discrete‑event world models. Existing approaches to constructing world models often fall near two extremes: hand‑engineered simulators provide consistency and reproducibility, but are costly to build and adapt; neural models are flexible, but can suffer from compounding inconsistency over long‑horizon rollouts. We seek a principled middle ground by synthesizing discrete‑event world models online from natural‑language specifications, retaining the reliability of explicit simulators while gaining the adaptability of neural models. We adopt the DEVS formalism and introduce a staged LLM‑based generation pipeline that separates structural inference over component interactions from component‑level event and timing logic. For evaluation, we develop benchmark suites in which simulators emit structured event traces, which are then validated against specification‑derived temporal, causal, and semantic constraints. This enables reproducible verification and localized diagnostics. Together, these contributions produce world models that remain consistent over long‑horizon rollouts, can be verified from observable behavior, and can be synthesized efficiently on demand during online execution.

PaperID: 5181, https://arxiv.org/pdf/2603.03780.pdf

Authors: Satoshi Oyama, Yuko Sakurai, Hisashi Kashima

Title: MACC: Multi-Agent Collaborative Competition for Scientific Exploration

Abstract:
Scientific discovery still relies heavily on the manual efforts of individual researchers, leading to limited exploration, redundant trials, and reduced reproducibility. Human‑participant data analysis competitions generate diverse approaches, yet fluctuations in participation and the lack of independent repetitions show that parallel exploration alone is insufficient for achieving reliable scientific inquiry. As advanced AI agents based on large language models (LLMs) increasingly perform analytical tasks, relying on a single highly capable agent is unlikely to overcome these structural limitations. Recent work has begun to explore how multiple LLM‑based agents can collaborate or compete in scientific workflows‑a growing trend we refer to as MA4Science. However, most existing MA4Science studies assume that all agents are controlled by a single organizational entity, limiting their ability to examine how institutional mechanisms‑such as incentives, information sharing, and reproducibility‑shape collective exploration among independently managed agents. To address this gap, we introduce MACC (Multi‑Agent Collaborative Competition), an institutional architecture that integrates a blackboard‑style shared scientific workspace with incentive mechanisms designed to encourage transparency, reproducibility, and exploration efficiency. MACC provides a testbed for studying how institutional design influences scalable and reliable multi‑agent scientific exploration.

PaperID: 5182, https://arxiv.org/pdf/2603.03768.pdf

Authors: Hao Zhang, Ding Zhao, H. Eric Tseng

Title: Cognition to Control - Multi-Agent Learning for Human-Humanoid Collaborative Transport

Abstract:
Effective human‑robot collaboration (HRC) requires translating high‑level intent into contact‑stable whole‑body motion while continuously adapting to a human partner. Many vision‑language‑action (VLA) systems learn end‑to‑end mappings from observations and instructions to actions, but they often emphasize reactive (System 1‑like) behavior and leave under‑specified how sustained System 2‑style deliberation can be integrated with reliable, low‑latency continuous control. This gap is acute in multi‑agent HRC, where long‑horizon coordination decisions and physical execution must co‑evolve under contact, feasibility, and safety constraints. We address this limitation with cognition‑to‑control (C2C), a three‑layer hierarchy that makes the deliberation‑to‑control pathway explicit: (i) a VLM‑based grounding layer that maintains persistent scene referents and infers embodiment‑aware affordances/constraints; (ii) a deliberative skill/coordination layer‑the System 2 core‑that optimizes long‑horizon skill choices and sequences under human‑robot coupling via decentralized MARL cast as a Markov potential game with a shared potential encoding task progress; and (iii) a whole‑body control layer that executes the selected skills at high frequency while enforcing kinematic/dynamic feasibility and contact stability. The deliberative layer is realized as a residual policy relative to a nominal controller, internalizing partner dynamics without explicit role assignment. Experiments on collaborative manipulation tasks show higher success and robustness than single‑agent and end‑to‑end baselines, with stable coordination and emergent leader‑follower behaviors.

PaperID: 5183, https://arxiv.org/pdf/2603.03761.pdf

Authors: Yunxiao Shi, Wujiang Xu, Tingwei Chen, Haoning Shang, Ling Yang, Yunfeng Wan, Zhuo Cao, Xing Zi, Dimitris N. Metaxas, Min Xu

Title: AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation

Abstract:
LLM agents are rapidly becoming the practical interface for task automation, yet the ecosystem lacks a principled way to choose among an exploding space of deployable configurations. Existing LLM leaderboards and tool/agent benchmarks evaluate components in isolation and remain fragmented across tasks, metrics, and candidate pools, leaving a critical research gap: there is little query‑conditioned supervision for learning to recommend end‑to‑end agent configurations that couple a backbone model with a toolkit. We address this gap with AgentSelect, a benchmark that reframes agent selection as narrative query‑to‑agent recommendation over capability profiles and systematically converts heterogeneous evaluation artifacts into unified, positive‑only interaction data. AgentSelectcomprises 111,179 queries, 107,721 deployable agents, and 251,103 interaction records aggregated from 40+ sources, spanning LLM‑only, toolkit‑only, and compositional agents. Our analyses reveal a regime shift from dense head reuse to long‑tail, near one‑off supervision, where popularity‑based CF/GNN methods become fragile and content‑aware capability matching is essential. We further show that Part~III synthesized compositional interactions are learnable, induce capability‑sensitive behavior under controlled counterfactual edits, and improve coverage over realistic compositions; models trained on AgentSelect also transfer to a public agent marketplace (MuleRun), yielding consistent gains on an unseen catalog. Overall, AgentSelect provides the first unified data and evaluation infrastructure for agent recommendation, which establishes a reproducible foundation to study and accelerate the emerging agent ecosystem.

PaperID: 5184, https://arxiv.org/pdf/2603.03759.pdf

Authors: Emile Anand, Ishani Karmarkar

Title: Learning Approximate Nash Equilibria in Cooperative Multi-Agent Reinforcement Learning via Mean-Field Subsampling

Abstract:
Many large‑scale platforms and networked control systems have a centralized decision maker interacting with a massive population of agents under strict observability constraints. Motivated by such applications, we study a cooperative Markov game with a global agent and n homogeneous local agents in a communication‑constrained regime, where the global agent only observes a subset of k local agent states per time step. We propose an alternating learning framework (\textttALTERNATING‑MARL), where the global agent performs subsampled mean‑field Q‑learning against a fixed local policy, and local agents update by optimizing in an induced MDP. We prove that these approximate best‑response dynamics converge to an \widetildeO(1/\sqrtk)‑approximate Nash Equilibrium, while separating the sample complexities between the joint state and action spaces. Finally, we validate our results in numerical simulations for multi‑robot control.

PaperID: 5185, https://arxiv.org/pdf/2603.03741.pdf

Authors: Hao Zhang, Yaru Niu, Yikai Wang, Ding Zhao, H. Eric Tseng

Title: HALyPO: Heterogeneous-Agent Lyapunov Policy Optimization for Human-Robot Collaboration

Abstract:
To improve generalization and resilience in human‑robot collaboration (HRC), robots must handle the combinatorial diversity of human behaviors and contexts, motivating multi‑agent reinforcement learning (MARL). However, inherent heterogeneity between robots and humans creates a rationality gap (RG) in the learning process‑a variational mismatch between decentralized best‑response dynamics and centralized cooperative ascent. The resulting learning problem is a general‑sum differentiable game, so independent policy‑gradient updates can oscillate or diverge without added structure. We propose heterogeneous‑agent Lyapunov policy optimization (HALyPO), which establishes formal stability directly in the policy‑parameter space by enforcing a per‑step Lyapunov decrease condition on a parameter‑space disagreement metric. Unlike Lyapunov‑based safe RL, which targets state/trajectory constraints in constrained Markov decision processes, HALyPO uses Lyapunov certification to stabilize decentralized policy learning. HALyPO rectifies decentralized gradients via optimal quadratic projections, ensuring monotonic contraction of RG and enabling effective exploration of open‑ended interaction spaces. Extensive simulations and real‑world humanoid‑robot experiments show that this certified stability improves generalization and robustness in collaborative corner cases.

PaperID: 5186, https://arxiv.org/pdf/2603.03686.pdf

Authors: Jiangyu Chen

Title: AI4S-SDS: A Neuro-Symbolic Solvent Design System via Sparse MCTS and Differentiable Physics Alignment

Abstract:
Automated design of chemical formulations is a cornerstone of materials science, yet it requires navigating a high‑dimensional combinatorial space involving discrete compositional choices and continuous geometric constraints. Existing Large Language Model (LLM) agents face significant challenges in this setting, including context window limitations during long‑horizon reasoning and path‑dependent exploration that may lead to mode collapse. To address these issues, we introduce AI4S‑SDS, a closed‑loop neuro‑symbolic framework that integrates multi‑agent collaboration with a tailored Monte Carlo Tree Search (MCTS) engine. We propose a Sparse State Storage mechanism with Dynamic Path Reconstruction, which decouples reasoning history from context length and enables arbitrarily deep exploration under fixed token budgets. To reduce local convergence and improve coverage, we implement a Global‑‑Local Search Strategy: a memory‑driven planning module adaptively reconfigures the search root based on historical feedback, while a Sibling‑Aware Expansion mechanism promotes orthogonal exploration at the node level. Furthermore, we bridge symbolic reasoning and physical feasibility through a Differentiable Physics Engine, employing a hybrid normalized loss with sparsity‑inducing regularization to optimize continuous mixing ratios under thermodynamic constraints. Empirical results show that AI4S‑SDS achieves full validity under the adopted HSP‑based physical constraints and substantially improves exploration diversity compared to baseline agents. In preliminary lithography experiments, the framework identifies a novel photoresist developer formulation that demonstrates competitive or superior performance relative to a commercial benchmark, highlighting the potential of diversity‑driven neuro‑symbolic search for scientific discovery.

PaperID: 5187, https://arxiv.org/pdf/2603.03667.pdf

Authors: Gabriela da Silva Machado, Gustavo Z. Bruno, Alexandre Huff, Jose Marcos Camara Brito, Cristiano B. Both

Title: ORION: Intent-Aware Orchestration in Open RAN for SLA-Driven Network Management

Abstract:
The disaggregation of the Radio Access Network (RAN) introduces unprecedented flexibility but significant operational complexity, necessitating automated management frameworks. However, current Open RAN (O‑RAN) orchestration relies on fragmented manual policies, lacking end‑to‑end intent assurance from high‑level requirements to low‑level configurations. In this paper, we propose ORION, an O‑RAN compliant intent orchestration framework that integrates Large Language Models (LLMs) via the Model Context Protocol (MCP) to translate natural language intents into enforceable network policies. ORION leverages a hierarchical agent architecture, combining an MCP‑based Service Management and Orchestration (SMO) layer for semantic translation with a Non‑Real‑Time RIC rApp and Near‑Real‑Time RIC xApp for closed‑loop enforcement. Extensive evaluations using GPT‑5, Gemini 3 Pro, and Claude Opus demonstrate a 100% policy generation success rate for high‑capacity models, highlighting significant trade‑offs in reasoning efficiency. We show that ORION reduces provisioning complexity by automating the complete intent lifecycle, from ingestion to E2‑level enforcement, paving the way for autonomous 6G networks.

PaperID: 5188, https://arxiv.org/pdf/2603.03664.pdf

Authors: Xiangyu Liu, Haoyi You, Kaiqing Zhang

Title: Principled Learning-to-Communicate with Quasi-Classical Information Structures

Abstract:
Learning‑to‑communicate (LTC) in partially observable environments has received increasing attention in deep multi‑agent reinforcement learning, where the control and communication strategies are jointly learned. Meanwhile, the impact of communication on decision‑making has been extensively studied in control theory. In this paper, we seek to formalize and better understand LTC by bridging these two lines of work, through the lens of information structures (ISs). To this end, we formalize LTC in decentralized partially observable Markov decision processes (Dec‑POMDPs) under the common‑information‑based framework from decentralized stochastic control, and classify LTC problems based on the ISs before (additional) information sharing. We first show that non‑classical LTCs are computationally intractable in general, and thus focus on quasi‑classical (QC) LTCs. We then propose a series of conditions for QC LTCs, under which LTC preserves the QC IS after information sharing, whereas violating them can cause computational hardness in general. Further, we develop provable planning and learning algorithms for QC LTCs, and establish quasi‑polynomial time and sample complexities for several QC LTC examples that satisfy the above conditions. Along the way, we also establish new results on a relationship between (strictly) QC IS and the condition of having strategy‑independent common‑information‑based beliefs (SI‑CIBs), as well as on solving Dec‑POMDPs without computationally intractable oracles but beyond those with SI‑CIBs, which may be of independent interest.

PaperID: 5189, https://arxiv.org/pdf/2603.03655.pdf

Authors: He Cao, Siyu Liu, Fan Zhang, Zijing Liu, Hao Li, Bin Feng, Shengyuan Bai, Leqing Chen, Kai Xie, Yu Li

Title: Mozi: Governed Autonomy for Drug Discovery LLM Agents

Abstract:
Tool‑augmented large language model (LLM) agents promise to unify scientific reasoning with computation, yet their deployment in high‑stakes domains like drug discovery is bottlenecked by two critical barriers: unconstrained tool‑use governance and poor long‑horizon reliability. In dependency‑heavy pharmaceutical pipelines, autonomous agents often drift into irreproducible trajectories, where early‑stage hallucinations multiplicatively compound into downstream failures. To overcome this, we present Mozi, a dual‑layer architecture that bridges the flexibility of generative AI with the deterministic rigor of computational biology. Layer A (Control Plane) establishes a governed supervisor‑‑worker hierarchy that enforces role‑based tool isolation, limits execution to constrained action spaces, and drives reflection‑based replanning. Layer B (Workflow Plane) operationalizes canonical drug discovery stages ‑‑ from Target Identification to Lead Optimization ‑‑ as stateful, composable skill graphs. This layer integrates strict data contracts and strategic human‑in‑the‑loop (HITL) checkpoints to safeguard scientific validity at high‑uncertainty decision boundaries. Operating on the design principle of ``free‑form reasoning for safe tasks, structured execution for long‑horizon pipelines,'' Mozi provides built‑in robustness mechanisms and trace‑level audibility to completely mitigate error accumulation. We evaluate Mozi on PharmaBench, a curated benchmark for biomedical agents, demonstrating superior orchestration accuracy over existing baselines. Furthermore, through end‑to‑end therapeutic case studies, we demonstrate Mozi's ability to navigate massive chemical spaces, enforce stringent toxicity filters, and generate highly competitive in silico candidates, effectively transforming the LLM from a fragile conversationalist into a reliable, governed co‑scientist.

PaperID: 5190, https://arxiv.org/pdf/2603.03640.pdf

Authors: Xiao Wang, Lu Dong, Jingchen Sun, Ifeoma Nwogu, Srirangaraj Setlur, Venu Govindaraju

Title: MistyPilot: An Agentic Fast-Slow Thinking LLM Framework for Misty Social Robots

Abstract:
With the availability of open APIs in social robots, it has become easier to customize general‑purpose tools to meet users' needs. However, interpreting high‑level user instructions, selecting and configuring appropriate tools, and executing them reliably remain challenging for users without programming experience. To address these challenges, we introduce MistyPilot, an agentic LLM‑driven framework for autonomous tool selection, orchestration, and parameter configuration. MistyPilot comprises two core components: a Physically Interactive Agent (PIA) and a Socially Intelligent Agent (SIA). The PIA enables robust sensor‑triggered and tool‑driven task execution, while the SIA generates socially intelligent and emotionally aligned dialogue. MistyPilot further integrates a fast‑slow thinking paradigm to capture user preferences, reduce latency, and improve task efficiency. To comprehensively evaluate MistyPilot, we contribute five benchmark datasets. Extensive experiments demonstrate the effectiveness of our framework in routing correctness, task completeness, fast‑slow thinking retrieval efficiency, tool scalability,and emotion alignment. All code, datasets, and experimental videos will be made publicly available on the project webpage.

PaperID: 5191, https://arxiv.org/pdf/2603.03633.pdf

Authors: Neha Nagaraja, Hayretdin Bahsi

Title: Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study

Abstract:
While incorporating LLMs into systems offers significant benefits in critical application areas such as healthcare, new security challenges emerge due to the potential cyber kill chain cycles that combine adversarial model, prompt injection and conventional cyber attacks. Threat modeling methods enable the system designers to identify potential cyber threats and the relevant mitigations during the early stages of development. Although the cyber security community has extensive experience in applying these methods to software‑based systems, the elicited threats are usually abstract and vague, limiting their effectiveness for conducting proper likelihood and impact assessments for risk prioritization, especially in complex systems with novel attacks surfaces, such as those involving LLMs. In this study, we propose a structured, goal driven risk assessment approach that contextualizes the threats with detailed attack vectors, preconditions, and attack paths through the use of attack trees. We demonstrate the proposed approach on a case study with an LLM agent‑based healthcare system. This study harmonizes the state‑of‑the‑art attacks to LLMs with conventional ones and presents possible attack paths applicable to similar systems. By providing a structured risk assessment, this study makes a significant contribution to the literature and advances the secure‑by‑design practices in LLM‑based systems.

PaperID: 5192, https://arxiv.org/pdf/2603.03630.pdf

Authors: Saber Zerhoudi, Michael Granitzer, Dang Hai Dang, Jelena Mitrovic, Florian Lemmerich, Annette Hautli-Janisz, Stefan Katzenbeisser, Kanishka Ghosh Dastidar

Title: Behind the Prompt: The Agent-User Problem in Information Retrieval

Abstract:
User models in information retrieval rest on a foundational assumption that observed behavior reveals intent. This assumption collapses when the user is an AI agent privately configured by a human operator. For any action an agent takes, a hidden instruction could have produced identical output ‑ making intent non‑identifiable at the individual level. This is not a detection problem awaiting better tools; it is a structural property of any system where humans configure agents behind closed doors. We investigate the agent‑user problem through a large‑scale corpus from an agent‑native social platform: 370K posts from 47K agents across 4K communities. Our findings are threefold: (1) individual agent actions cannot be classified as autonomous or operator‑directed from observables; (2) population‑level platform signals still separate agents into meaningful quality tiers, but a click model trained on agent interactions degrades steadily (‑8.5% AUC) as lower‑quality agents enter training data; (3) cross‑community capability references spread endemically (R_0 1.26‑3.53) and resist suppression even under aggressive modeled intervention. For retrieval systems, the question is no longer whether agent users will arrive, but whether models built on human‑intent assumptions will survive their presence.

PaperID: 5193, https://arxiv.org/pdf/2603.03590.pdf

Authors: Oishik Chowdhury, Anushka Debnath, Bastin Tony Roy Savarimuthu

Title: Social Norm Reasoning in Multimodal Language Models: An Evaluation

Abstract:
In Multi‑Agent Systems (MAS), agents are designed with social capabilities, allowing them to understand and reason about social concepts such as norms when interacting with others (e.g., inter‑robot interactions). In Normative MAS (NorMAS), researchers study how norms develop, and how violations are detected and sanctioned. However, existing research in NorMAS use symbolic approaches (e.g., formal logic) for norm representation and reasoning whose application is limited to simplified environments. In contrast, Multimodal Large Language Models (MLLMs) present promising possibilities to develop software used by robots to identify and reason about norms in a wide variety of complex social situations embodied in text and images. However, prior work on norm reasoning have been limited to text‑based scenarios. This paper investigates the norm reasoning competence of five MLLMs by evaluating their ability to answer norm‑related questions based on thirty text‑based and thirty image‑based stories, and comparing their responses against humans. Our results show that MLLMs demonstrate superior performance in norm reasoning in text than in images. GPT‑4o performs the best in both modalities offering the most promise for integration with MAS, followed by the free model Qwen‑2.5VL. Additionally, all models find reasoning about complex norms challenging.

PaperID: 5194, https://arxiv.org/pdf/2603.03565.pdf

Authors: Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright, Marcus Yearwood, Hongtai Wei, Sudeep Das, Danny Nightingale, Meg Watson, Charles Pollnow

Title: Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

Abstract:
Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi‑turn interactions and how to optimize tightly coupled multi‑agent systems. Grocery shopping further amplifies these difficulties, as user requests are often underspecified, highly preference‑sensitive, and constrained by factors such as budget and inventory. In this paper, we present a practical blueprint for evaluating and optimizing conversational shopping assistants, illustrated through a production‑scale AI grocery assistant. We introduce a multi‑faceted evaluation rubric that decomposes end‑to‑end shopping quality into structured dimensions and develop a calibrated LLM‑as‑judge pipeline aligned with human annotations. Building on this evaluation foundation, we investigate two complementary prompt‑optimization strategies based on a SOTA prompt‑optimizer called GEPA (Shao et al., 2025): (1) Sub‑agent GEPA, which optimizes individual agent nodes against localized rubrics, and (2) MAMuT (Multi‑Agent Multi‑Turn) GEPA (Herrera et al., 2026), a novel system‑level approach that jointly optimizes prompts across agents using multi‑turn simulation and trajectory‑level scoring. We release rubric templates and evaluation design guidance to support practitioners building production CSAs.

PaperID: 5195, https://arxiv.org/pdf/2603.03555.pdf

Authors: Brandon Yee, Krishna Sharma

Title: Molt Dynamics: Emergent Social Phenomena in Autonomous AI Agent Populations

Abstract:
MoltBook is a large‑scale multi‑agent coordination environment where over 770,000 autonomous LLM agents interact without human participation, offering the first opportunity we are aware of to observe emergent multi‑agent coordination dynamics at this population scale. We introduce Molt Dynamics: the emergent agent coordination behaviors, inter‑agent communication dynamics, and role specialization patterns arising when autonomous agents operate as decentralized decision‑makers in an unconstrained multi‑agent environment. Through longitudinal observation of 90,704 active agents over three weeks, we characterize three aspects. First, spontaneous role specialization: network‑based clustering reveals six structural roles (silhouette 0.91), though the result primarily reflects core‑periphery organization ‑‑ 93.5% of agents occupy a homogeneous peripheral cluster, with meaningful differentiation confined to the active minority. Second, decentralized information dissemination: cascade analysis of 10,323 inter‑agent propagation events reveals power‑law distributed cascade sizes (α= 2.57 \pm 0.02) and saturating adoption dynamics where adoption probability shows diminishing returns with repeated exposures (Cox hazard ratio 0.53, concordance 0.78). Third, distributed cooperative task resolution: 164 multi‑agent collaborative events show detectable coordination patterns, but success rates are low (6.7%, p = 0.057) and cooperative outcomes are significantly worse than a matched single‑agent baseline (Cohen's d = ‑0.88), indicating emergent cooperative behavior is nascent. These findings establish an empirical baseline for coordination dynamics in decentralized autonomous agent systems, with implications for multi‑agent system design, agent communication protocol engineering, and AI safety.

PaperID: 5196, https://arxiv.org/pdf/2603.03526.pdf

Authors: Maarten C. Vonk, Anna V. Kononova, Thomas Bäck, Tim Sweijs

Title: Multi-Agent Influence Diagrams to Hybrid Threat Modeling

Abstract:
Western governments have adopted an assortment of counter‑hybrid threat measures to defend against hostile actions below the conventional military threshold. The impact of these measures is unclear because of the ambiguity of hybrid threats, their cross‑domain nature, and uncertainty about how countermeasures shape adversarial behavior. This paper offers a novel approach to clarifying this impact by unifying previously bifurcating hybrid threat modeling methods through a (multi‑agent) influence diagram framework. The model balances the costs of countermeasures, their ability to dissuade the adversary from executing hybrid threats, and their potential to mitigate the impact of hybrid threats. We run 1000 semi‑synthetic variants of a real‑world‑inspired scenario simulating the strategic interaction between attacking agent A and defending agent B over a cyber attack on critical infrastructure to explore the effectiveness of a set of five different counter‑hybrid threat measures. Counter‑hybrid measures range from strengthening resilience and denial of the adversary's ability to execute a hybrid threat to dissuasion through the threat of punishment. Our analysis primarily evaluates the overarching characteristics of counter‑hybrid threat measures. This approach allows us to generalize the effectiveness of these measures and examine parameter impact sensitivity. In addition, we discuss policy relevance and outline future research avenues.

PaperID: 5197, https://arxiv.org/pdf/2603.03447.pdf

Authors: Weicai Yan, Yuhong Dai, Qi Ran, Haodong Li, Wang Lin, Tao Jin, Xing Xie, Hao Liao, Jianxun Lian

Title: Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

Abstract:
Proactive and real‑time interactive experiences are essential for human‑like AI companions, yet face three key challenges: (1) achieving low‑latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated content to meet real‑time constraints. In this work, we instantiate AI companions through two gaming scenarios, commentator and guide, selected for their suitability for automatic evaluation. We introduce the Live Gaming Benchmark, a large‑scale dataset with three representative scenarios: solo commentary, co‑commentary, and user guidance, and present Proact‑VL, a general framework that shapes multimodal language models into proactive, real‑time interactive agents capable of human‑like environment perception and interaction. Extensive experiments show Proact‑VL achieves superior response latency and quality while maintaining strong video understanding capabilities, demonstrating its practicality for real‑time interactive applications.

PaperID: 5198, https://arxiv.org/pdf/2603.03390.pdf

Authors: Chairi Kiourt, Vassilis Evangelidis, Dimitris Grigoropoulos

Title: Multi-Agent-Based Simulation of Archaeological Mobility in Uneven Landscapes

Abstract:
Understanding mobility, movement, and interaction in archaeological landscapes is essential for interpreting past human behavior, transport strategies, and spatial organization, yet such processes are difficult to reconstruct from static archaeological evidence alone. This paper presents a multi‑agent‑based modeling framework for simulating archaeological mobility in uneven landscapes, integrating realistic terrain reconstruction, heterogeneous agent modeling, and adaptive navigation strategies. The proposed approach combines global path planning with local dynamic adaptation, through reinforcment learning, enabling agents to respond efficiently to dynamic obstacles and interactions without costly global replanning. Real‑world digital elevation data are processed into high‑fidelity three‑dimensional environments, preserving slope and terrain constraints that directly influence agent movement. The framework explicitly models diverse agent types, including human groups and animal‑based transport systems, each parameterized by empirically grounded mobility characteristics such as load, slope tolerance, and physical dimensions. Two archaeological‑inspired use cases demonstrate the applicability of the approach: a terrain‑aware pursuit and evasion scenario and a comparative transport analysis involving pack animals and wheeled carts. The results highlight the impact of terrain morphology, visibility, and agent heterogeneity on movement outcomes, while the proposed hybrid navigation strategy provides a computationally efficient and interpretable solution for large‑scale, dynamic archaeological simulations.

PaperID: 5199, https://arxiv.org/pdf/2603.03378.pdf

Authors: Pei Yang, Wanyi Chen, Asuka Yuxi Zheng, Xueqian Li, Xiang Li, Haoqin Tu, Jie Xiao, Yifan Pang, Dongdong Zhang, Fuqiang Li, Alfred Long, Lynn Ai, Eric Yang, Bill Shi

Title: AOI: Turning Failed Trajectories into Training Signals for Autonomous Cloud Diagnosis

Abstract:
Large language model (LLM) agents offer a promising data‑driven approach to automating Site Reliability Engineering (SRE), yet their enterprise deployment is constrained by three challenges: restricted access to proprietary data, unsafe action execution under permission‑governed environments, and the inability of closed systems to improve from failures. We present AOI (Autonomous Operations Intelligence), a trainable multi‑agent framework formulating automated operations as a structured trajectory learning problem under security constraints. Our approach integrates three key components. First, a trainable diagnostic system applies Group Relative Policy Optimization (GRPO) to distill expert‑level knowledge into locally deployed open‑source models, enabling preference‑based learning without exposing sensitive data. Second, a read‑write separated execution architecture decomposes operational trajectories into observation, reasoning, and action phases, allowing safe learning while preventing unauthorized state mutation. Third, a Failure Trajectory Closed‑Loop Evolver mines unsuccessful trajectories and converts them into corrective supervision signals, enabling continual data augmentation. Evaluated on the AIOpsLab benchmark, our contributions yield cumulative gains. (1) The AOI runtime alone achieves 66.3% best@5 success on all 86 tasks, outperforming the prior state‑of‑the‑art (41.9%) by 24.4 points. (2) Adding Observer GRPO training, a locally deployed 14B model reaches 42.9% avg@1 on 63 held‑out tasks with unseen fault types, surpassing Claude Sonnet 4.5. (3) The Evolver converts 37 failed trajectories into diagnostic guidance, improving end‑to‑end avg@5 by 4.8 points while reducing variance by 35%.

PaperID: 5200, https://arxiv.org/pdf/2603.03329.pdf

Authors: Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, Kevin P. Murphy

Title: AutoHarness: improving LLM agents by automatically synthesizing a code harness

Abstract:
Despite significant strides in language models in the last few years, when used as agents, such models often try to perform actions that are not just suboptimal for a given state, but are strictly prohibited by the external environment. For example, in the recent Kaggle GameArena chess competition, 78% of Gemini‑2.5‑Flash losses were attributed to illegal moves. Often people manually write "harnesses" around LLMs to prevent such failures. In this paper, we demonstrate that Gemini‑2.5‑Flash can automatically synthesize such a code harness, using a small number of rounds of iterative code refinement given feedback from the (game) environment. The resulting harness prevents all illegal moves in 145 different TextArena games (both 1‑player and 2‑player), enabling the smaller Gemini‑2.5‑Flash model to outperform larger models, such as Gemini‑2.5‑Pro. Pushing our technique to the limit, we can get Gemini‑2.5‑Flash to generate the entire policy in code, thus eliminating the need to use the LLM at decision making time. The resulting code‑policy receives a higher average reward than Gemini‑2.5‑Pro and GPT‑5.2‑High on 16 TextArena 1‑player games. Our results show that using a smaller model to synthesize a custom code harness (or entire policy) can outperform a much larger model, while also being more cost effective.

PaperID: 5201, https://arxiv.org/pdf/2603.03322.pdf

Authors: Chaoqun Yang, Xinyu Lin, Shulin Li, Wenjie Wang, Ruihan Guo, Fuli Feng, Tat-Seng Chua

Title: Can Large Language Models Derive New Knowledge? A Dynamic Benchmark for Biological Knowledge Discovery

Abstract:
Recent advancements in Large Language Model (LLM) agents have demonstrated remarkable potential in automatic knowledge discovery. However, rigorously evaluating an AI's capacity for knowledge discovery remains a critical challenge. Existing benchmarks predominantly rely on static datasets, leading to inevitable data contamination where models have likely seen the evaluation knowledge during training. Furthermore, the rapid release cycles of modern LLMs render static benchmarks quickly outdated, failing to assess the ability to discover truly new knowledge. To address these limitations, we propose DBench‑Bio, a dynamic and fully automated benchmark designed to evaluate AI's biological knowledge discovery ability. DBench‑Bio employs a three‑stage pipeline: (1) data acquisition of rigorous, authoritative paper abstracts; (2) QA extraction utilizing LLMs to synthesize scientific hypothesis questions and corresponding discovery answers; and (3) QA filter to ensure quality based on relevance, clarity, and centrality. We instantiate this pipeline to construct a monthly‑updated benchmark covering 12 biomedical sub‑domains. Extensive evaluations of SOTA models reveal current limitations in discovering new knowledge. Our work provides the first dynamic, automatic framework for assessing the new knowledge discovery capabilities of AI systems, establishing a living, evolving resource for AI research community to catalyze the development of knowledge discovery.

PaperID: 5202, https://arxiv.org/pdf/2603.03321.pdf

Authors: Nardine Basta, Dali Kaafar

Title: DIALEVAL: Automated Type-Theoretic Evaluation of LLM Instruction Following

Abstract:
Evaluating instruction following in Large Language Models requires decomposing instructions into verifiable requirements and assessing satisfaction‑‑tasks currently dependent on manual annotation and uniform criteria that do not align with human judgment patterns. We present DIALEVAL, a type‑theoretic framework using dual LLM agents to automate instruction decomposition into typed predicates and implement type‑specific satisfaction semantics. The framework enforces formal atomicity and independence constraints during automated extraction, then applies differentiated evaluation criteria‑‑semantic equivalence for content predicates, exact precision for numerical predicates‑‑mirroring empirically observed human assessment patterns. Extended to multi‑turn dialogues through history‑aware satisfaction functions, DIALEVAL enables evaluation in conversational contexts where single‑turn methods fail. Validation demonstrates 90.38% accuracy (26.45% error reduction over baselines) and substantially stronger correlation with human judgment for complex instructions.

PaperID: 5203, https://arxiv.org/pdf/2603.03302.pdf

Authors: Divija Amaram, Lu Gao, Gowtham Reddy Gudla, Tejaswini Sanjay Katale

Title: Developing an AI Assistant for Knowledge Management and Workforce Training in State DOTs

Abstract:
Effective knowledge management is critical for preserving institutional expertise and improving the efficiency of workforce training in state transportation agencies. Traditional approaches, such as static documentation, classroom‑based instruction, and informal mentorship, often lead to fragmented knowledge transfer, inefficiencies, and the gradual loss of expertise as senior engineers retire. Moreover, given the enormous volume of technical manuals, guidelines, and research reports maintained by these agencies, it is increasingly challenging for engineers to locate relevant information quickly and accurately when solving field problems or preparing for training tasks. These limitations hinder timely decision‑making and create steep learning curves for new personnel in maintenance and construction operations. To address these challenges, this paper proposes a Retrieval‑Augmented Generation (RAG) framework with a multi‑agent architecture to support knowledge management and decision making. The system integrates structured document retrieval with real‑time, context‑aware response generation powered by a large language model (LLM). Unlike conventional single‑pass RAG systems, the proposed framework employs multiple specialized agents for retrieval, answer generation, evaluation, and query refinement, which enables iterative improvement and quality control. In addition, the system incorporates an open‑weight vision‑language model to convert technical figures into semantic textual representations, which allows figure‑based knowledge to be indexed and retrieved alongside text. Retrieved text and figure‑based context are then provided to an open‑weight large language model, which generates the final responses grounded in the retrieved evidence.

PaperID: 5204, https://arxiv.org/pdf/2603.03236.pdf

Authors: Fangzhou Yao, Sheng Chang, Weibo Gao, Qi Liu

Title: Conversational Learning Diagnosis via Reasoning Multi-Turn Interactive Learning

Abstract:
Learning diagnosis is a critical task that monitors students' cognitive state during educational activities, with the goal of enhancing learning outcomes. With advancements in language models (LMs), many AI‑driven educational studies have shifted towards conversational learning scenarios, where students engage in multi‑turn interactive dialogues with tutors. However, conversational learning diagnosis remains underdeveloped, and most existing techniques acquire students' cognitive state through intuitive instructional prompts on LMs to analyze the dialogue text. This direct prompting approach lacks a solid psychological foundation and fails to ensure the reliability of the generated analytical text. In this study, we introduce ParLD, a preview‑analyze‑reason framework for conversational learning diagnosis, which leverages multi‑agent collaboration to diagnose students' cognitive state over multiple dialogue turns. Specifically, ParLD comprises three main components: (1) Behavior Previewer, which generates a student behavior schema based on previous states and learning content; (2) State Analyzer, which diagnoses the tutor‑student dialogue and behavior schema to update the cognitive state; and (3) Performance Reasoner, which predicts the student's future responses and provides verifiable feedback to support ParLD's self‑reflection with the Chain Reflector. They operate sequentially and iteratively during each interaction turn to diagnose the student's cognitive state. We conduct experiments to evaluate both performance prediction and tutoring support, emphasizing the effectiveness of ParLD in providing reliable and insightful learning diagnosis.

PaperID: 5205, https://arxiv.org/pdf/2603.03233.pdf

Authors: Zihang Zeng, Jiaquan Zhang, Pengze Li, Yuan Qi, Xi Chen

Title: AI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework

Abstract:
Large Language Models (LLMs) demonstrate potentials for automating scientific code generation but face challenges in reliability, error propagation in multi‑agent workflows, and evaluation in domains with ill‑defined success metrics. We present a Bayesian adversarial multi‑agent framework specifically designed for AI for Science (AI4S) tasks in the form of a Low‑code Platform (LCP). Three LLM‑based agents are coordinated under the Bayesian framework: a Task Manager that structures user inputs into actionable plans and adaptive test cases, a Code Generator that produces candidate solutions, and an Evaluator providing comprehensive feedback. The framework employs an adversarial loop where the Task Manager iteratively refines test cases to challenge the Code Generator, while prompt distributions are dynamically updated using Bayesian principles by integrating code quality metrics: functional correctness, structural alignment, and static analysis. This co‑optimization of tests and code reduces dependence on LLM reliability and addresses evaluation uncertainty inherent to scientific tasks. LCP also streamlines human‑AI collaboration by translating non‑expert prompts into domain‑specific requirements, bypassing the need for manual prompt engineering by practitioners without coding backgrounds. Benchmark evaluations demonstrate LCP's effectiveness in generating robust code while minimizing error propagation. The proposed platform is also tested on an Earth Science cross‑disciplinary task and demonstrates strong reliability, outperforming competing models.

PaperID: 5206, https://arxiv.org/pdf/2603.03202.pdf

Authors: Dadi Guo, Yuejin Xie, Qingyu Liu, Jiayu Liu, Zhiyuan Fan, Qihan Ren, Shuai Shao, Tianyi Zhou, Dongrui Liu, Yi R. Fung

Title: Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?

Abstract:
As large language models (LLMs) advance their mathematical capabilities toward the IMO level, the scarcity of challenging, high‑quality problems for training and evaluation has become a significant bottleneck. Simultaneously, recent code agents have demonstrated sophisticated skills in agentic coding and reasoning, suggesting that code execution can serve as a scalable environment for mathematical experimentation. In this paper, we investigate the potential of code agents to autonomously evolve existing math problems into more complex variations. We introduce a multi‑agent framework designed to perform problem evolution while validating the solvability and increased difficulty of the generated problems. Our experiments demonstrate that, given sufficient test‑time exploration, code agents can synthesize new, solvable problems that are structurally distinct from and more challenging than the originals. This work provides empirical evidence that code‑driven agents can serve as a viable mechanism for synthesizing high‑difficulty mathematical reasoning problems within scalable computational environments. Our data is available at https://github.com/TarferSoul/Code2Math.

PaperID: 5207, https://arxiv.org/pdf/2603.03175.pdf

Authors: Aman Kumar, Deepak Narayan Gadde, Luu Danh Minh, Vaisakh Naduvodi Viswambharan, Keerthan Kopparam Radhakrishna, Sivaram Pothireddypalli

Title: Saarthi for AGI: Towards Domain-Specific General Intelligence for Formal Verification

Abstract:
Saarthi is an agentic AI framework that uses multi‑agent collaboration to perform end‑to‑end formal verification. Even though the framework provides a complete flow from specification to coverage closure, with around 40% efficacy, there are several challenges that need to be addressed to make it more robust and reliable. Artificial General Intelligence (AGI) is still a distant goal, and current Large Language Model (LLM)‑based agents are prone to hallucinations and making mistakes, especially when dealing with complex tasks such as formal verification. However, with the right enhancements and improvements, we believe that Saarthi can be a significant step towards achieving domain‑specific general intelligence for formal verification. Especially for problems that require Short Term, Short Context (STSC) capabilities, such as formal verification, Saarthi can be a powerful tool to assist verification engineers in their work. In this paper, we present two key enhancements to the Saarthi framework: (1) a structured rulebook and specification grammar to improve the accuracy and controllability of SystemVerilog Assertion (SVA) generation, and (2) integration of advanced Retrieval Augmented Generation (RAG) techniques, such as GraphRAG, to provide agents with access to technical knowledge and best practices for iterative refinement and improvement of outputs. We also benchmark these enhancements for the overall Saarthi framework using challenging test cases from NVIDIA's CVDP benchmark targeting formal verification. Our benchmark results stand out with a 70% improvement in the accuracy of generated assertions, and a 50% reduction in the number of iterations required to achieve coverage closure.

PaperID: 5208, https://arxiv.org/pdf/2603.03140.pdf

Authors: Danial Amin, Joni Salminen, Bernard J. Jansen

Title: How to Model AI Agents as Personas?: Applying the Persona Ecosystem Playground to 41,300 Posts on Moltbook for Behavioral Insights

Abstract:
AI agents are increasingly active on social media platforms, generating content and interacting with one another at scale. Yet the behavioral diversity of these agents remains poorly understood, and methods for characterizing distinct agent types and studying how they engage with shared topics are largely absent from current research. We apply the Persona Ecosystem Playground (PEP) to Moltbook, a social platform for AI agents, to generate and validate conversational personas from 41,300 posts using k‑means clustering and retrieval‑augmented generation. Cross‑persona validation confirms that personas are semantically closer to their own source cluster than to others (t(61) = 17.85, p < .001, d = 2.20; own‑cluster M = 0.71 vs. other‑cluster M = 0.35). These personas are then deployed in a nine‑turn structured discussion, and simulation messages were attributed to their source persona significantly above chance (binomial test, p < .001). The results indicate that persona‑based ecosystem modeling can represent behavioral diversity in AI agent populations.

PaperID: 5209, https://arxiv.org/pdf/2603.03137.pdf

Authors: Yuhang Zhang, Jinming Ma, Feng Wu

Title: RL-Based Coverage Path Planning for Deformable Objects on 3D Surfaces

Abstract:
Currently, manipulation tasks for deformable objects often focus on activities like folding clothes, handling ropes, and manipulating bags. However, research on contact‑rich tasks involving deformable objects remains relatively underdeveloped. When humans use cloth or sponges to wipe surfaces, they rely on both vision and tactile feedback. Yet, current algorithms still face challenges with issues like occlusion, while research on tactile perception for manipulation is still evolving. Tasks such as covering surfaces with deformable objects demand not only perception but also precise robotic manipulation. To address this, we propose a method that leverages efficient and accessible simulators for task execution. Specifically, we train a reinforcement learning agent in a simulator to manipulate deformable objects for surface wiping tasks. We simplify the state representation of object surfaces using harmonic UV mapping, process contact feedback from the simulator on 2D feature maps, and use scaled grouped convolutions (SGCNN) to extract features efficiently. The agent then outputs actions in a reduced‑dimensional action space to generate coverage paths. Experiments demonstrate that our method outperforms previous approaches in key metrics, including total path length and coverage area. We deploy these paths on a Kinova Gen3 manipulator to perform wiping experiments on the back of a torso model, validating the feasibility of our approach.

PaperID: 5210, https://arxiv.org/pdf/2603.03116.pdf

Authors: Hongliu Cao, Ilias Driouich, Eoin Thomas

Title: Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation

Abstract:
Large Language Model (LLM)‑based agents are increasingly adopted in high‑stakes settings, but current benchmarks evaluate mainly whether a task was completed, not how. We introduce Procedure‑Aware Evaluation (PAE), a framework that formalizes agent procedures as structured observations and exposes consistency relationships between what agents observe, communicate, and execute. PAE evaluates agents along complementary axes (Utility, Efficiency, Interaction Quality, Procedural Integrity) and applies multi‑dimensional gating that categorically disqualifies corrupt outcomes. Evaluating state‑of‑the‑art LLM agents on tau‑bench yields findings at the axis, compliance, and benchmark levels. At the axis level, the dimensions capture non‑redundant failure modes: utility masks reliability gaps, speed does not imply precision, and conciseness does not predict intent adherence. At the procedural compliance level, 27‑78% of benchmark reported successes are corrupt successes concealing violations across interaction and integrity. Furthermore, gating substantially collapses Pass^4 rate and affects model rankings. The analysis of corrupt success cases reveals distinctive per‑model failure signatures: GPT‑5 spreads errors across policy, execution, and intent dimensions; Kimi‑K2‑Thinking concentrates 78% of violations in policy faithfulness and compliance; and Mistral‑Large‑3 is dominated by faithfulness failures. At the benchmark level, our analysis exposes structural flaws in the benchmark design, including task scope gaps, contradictory reward signals, and simulator artifacts that produce accidental successes.

PaperID: 5211, https://arxiv.org/pdf/2603.03078.pdf

Authors: Siwei Zhang, Yun Xiong, Xi Chen, Zi'an Jia, Renhong Huang, Jiarong Xu, Jiawei Zhang

Title: RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization

Abstract:
Agentic Reinforcement Learning (Agentic RL) has shown remarkable potential in large language model‑based (LLM) agents. These works can empower LLM agents to tackle complex tasks via multi‑step, tool‑integrated reasoning. However, an inherent limitation of existing Agentic RL methods is their reliance on a pure on‑policy paradigm for exploration, restricting exploration to the agent's self‑generated outputs and preventing the discovery of new reasoning perspectives for further improvement. While recent efforts incorporate auxiliary off‑policy signals to enhance exploration, they typically utilize full off‑policy trajectories for trajectory‑level policy estimation, overlooking the necessity for the fine‑grained, step‑level exploratory dynamics within agentic rollout. In this paper, we revisit exploration in Agentic RL and propose Retrieval‑Augmented Policy Optimization (RAPO), a novel RL framework that introduces retrieval to explicitly expand exploration during training. To achieve this, we decompose the Agentic RL training process into two phases: (i) Hybrid‑policy Agentic Rollout, and (ii) Retrieval‑aware Policy Optimization. Specifically, we propose a Hybrid‑policy Agentic Rollout strategy, which allows the agents to continuously reason over the retrieved off‑policy step‑level traces. It dynamically extends the reasoning receptive field of agents, enabling broader exploration conditioned on external behaviors. Subsequently, we introduce the Retrieval‑aware Policy Optimization mechanism, which calibrates the policy gradient estimation with retrieval reward and importance shaping, stabilizing training and prioritizing retrieval‑illuminating exploration. Extensive experiments show that RAPO achieves an +5.0% average gain on fourteen datasets across three agentic reasoning tasks, while delivering 1.2x faster training efficiency.

PaperID: 5212, https://arxiv.org/pdf/2603.03024.pdf

Authors: Ling Luo, Qianqian Bai

Title: MA-CoNav: A Master-Slave Multi-Agent Framework with Hierarchical Collaboration and Dual-Level Reflection for Long-Horizon Embodied VLN

Abstract:
Vision‑Language Navigation (VLN) aims to empower robots with the ability to perform long‑horizon navigation in unfamiliar environments based on complex linguistic instructions. Its success critically hinges on establishing an efficient ``language‑understanding ‑‑ visual‑perception ‑‑ embodied‑execution'' closed loop. Existing methods often suffer from perceptual distortion and decision drift in complex, long‑distance tasks due to the cognitive overload of a single agent. Inspired by distributed cognition theory, this paper proposes MA‑CoNav, a Multi‑Agent Collaborative Navigation framework. This framework adopts a ``Master‑Slave'' hierarchical agent collaboration architecture, decoupling and distributing the perception, planning, execution, and memory functions required for navigation tasks to specialized agents. Specifically, the Master Agent is responsible for global orchestration, while the Subordinate Agent group collaborates through a clear division of labor: an Observation Agent generates environment descriptions, a Planning Agent performs task decomposition and dynamic verification, an Execution Agent handles simultaneous mapping and action, and a Memory Agent manages structured experiences. Furthermore, the framework introduces a ``Local‑Global'' dual‑stage reflection mechanism to dynamically optimize the entire navigation pipeline. Empirical experiments were conducted using a real‑world indoor dataset collected by a Limo Pro robot, with no scene‑specific fine‑tuning performed on the models throughout the process. The results demonstrate that MA‑CoNav comprehensively outperforms existing mainstream VLN methods across multiple metrics.

PaperID: 5213, https://arxiv.org/pdf/2603.03005.pdf

Authors: Yichao Feng, Haoran Luo, Zhenghong Lin, Yiqun Sun, Pengfei Wei, Lawrence B. Hsieh, Anh Tuan Luu

Title: OrchMAS: Orchestrated Reasoning with Multi Collaborative Heterogeneous Scientific Expert Structured Agents

Abstract:
Multi‑agent large language model frameworks are promising for complex multi step reasoning, yet existing systems remain weak for scientific and knowledge intensive domains due to static prompts and agent roles, rigid workflows, and homogeneous model reliance, leading to poor domain adaptation, limited reasoning flexibility, and high latency on heterogeneous or long‑horizon scientific tasks. They also struggle to revise earlier decisions when intermediate reasoning diverges, reducing reliability in structured and calculation heavy settings. To address these limitations, we propose a scientific domain oriented interactive two tier multi model orchestration framework. A dedicated orchestration model analyzes each task, dynamically constructs a domain aware reasoning pipeline, and instantiates specialized expert agents with tailored prompts, while an execution model performs each step under generated role and instruction specifications. The orchestrator iteratively updates the pipeline based on intermediate feedback, enabling dynamic replanning, role reallocation, and prompt refinement across multi turn interactions, strengthening robustness and specialization for scientific reasoning through structured heterogeneous model collaboration. The framework is model agnostic and supports heterogeneous LLM integration with different capacities or costs, enabling flexible performance efficiency trade offs in practical scientific deployments. Experiments show consistent improvements over existing multi agent systems and strong baselines across diverse reasoning and scientific style benchmarks.

PaperID: 5214, https://arxiv.org/pdf/2603.02983.pdf

Authors: Yule Wen, Yanzhe Zhang, Jianxun Lian, Xiaoyuan Yi, Xing Xie, Diyi Yang

Title: Contextualized Privacy Defense for LLM Agents

Abstract:
LLM agents increasingly act on users' personal information, yet existing privacy defenses remain limited in both design and adaptability. Most prior approaches rely on static or passive defenses, such as prompting and guarding. These paradigms are insufficient for supporting contextual, proactive privacy decisions in multi‑step agent execution. We propose Contextualized Defense Instructing (CDI), a new privacy defense paradigm in which an instructor model generates step‑specific, context‑aware privacy guidance during execution, proactively shaping actions rather than merely constraining or vetoing them. Crucially, CDI is paired with an experience‑driven optimization framework that trains the instructor via reinforcement learning (RL), where we convert failure trajectories with privacy violations into learning environments. We formalize baseline defenses and CDI as distinct intervention points in a canonical agent loop, and compare their privacy‑helpfulness trade‑offs within a unified simulation framework. Results show that our CDI consistently achieves a better balance between privacy preservation (94.2%) and helpfulness (80.6%) than baselines, with superior robustness to adversarial conditions and generalization.

PaperID: 5215, https://arxiv.org/pdf/2603.02963.pdf

Authors: Hao Yu, Hui Li, FengYuan Shi, Wenjie Yu, PinHan Ho, Zehua Wang, Bin Wang

Title: Multi-Agent Honeypot-Based Request-Response Context Dataset for Improved SQL Injection Detection Performance

Abstract:
SQL injection remains a major threat to web applications, as existing defenses often fail against obfuscation and evolving attacks because of neglecting the request‑response context. This paper presents a context‑enriched SQL injection detection framework, focusing on constructing a high‑quality request‑response dataset via a multi‑agent honeypot system: the Request Generator Agent produces diverse malicious/benign requests, the Database Response Agent mediates interactions to ensure authentic responses while protecting production data, and the Traffic Monitor pairs requests with responses, assigns labels, and cleans data, yielding totally 140,973 labeled pairs with contextual cues absent in payload‑only data. Experiments show that models trained on this context dataset outperform payload‑only counterparts: CNN and BiLSTM achieve over 40% accuracy improvement in different tasks, validating that the request‑response context enhances the detection of evolving and obfuscated attacks.

PaperID: 5216, https://arxiv.org/pdf/2603.02961.pdf

Authors: Lingxiao Huang, Wenyang Xiao, Nisheeth K. Vishnoi

Title: Delegation and Verification Under AI

Abstract:
As AI systems enter institutional workflows, workers must decide whether to delegate task execution to AI and how much effort to invest in verifying AI outputs, while institutions evaluate workers using outcome‑based standards that may misalign with workers' private costs. We model delegation and verification as the solution to a rational worker's optimization problem, and define worker quality by evaluating an institution‑centered utility (distinct from the worker's objective) at the resulting optimal action. We formally characterize optimal worker workflows and show that AI induces phase transitions, where arbitrarily small differences in verification ability lead to sharply different behaviors. As a result, AI can amplify workers with strong verification reliability while degrading institutional worker quality for others who rationally over‑delegate and reduce oversight, even when baseline task success improves and no behavioral biases are present. These results identify a structural mechanism by which AI reshapes institutional worker quality and amplifies quality disparities between workers with different verification reliability.

PaperID: 5217, https://arxiv.org/pdf/2603.02960.pdf

Authors: Nahema Marchal, Stephanie Chan, Matija Franklin, Manon Revel, Geoff Keeling, Roberta Fischli, Bilva Chandra, Iason Gabriel

Title: Architecting Trust in Artificial Epistemic Agents

Abstract:
Large language models increasingly function as epistemic agents ‑‑ entities that can 1) autonomously pursue epistemic goals and 2) actively shape our shared knowledge environment. They curate the information we receive, often supplanting traditional search‑based methods, and are frequently used to generate both personal and deeply specialized advice. How they perform these functions, including whether they are reliable and properly calibrated to both individual and collective epistemic norms, is therefore highly consequential for the choices we make. We argue that the potential impact of epistemic AI agents on practices of knowledge creation, curation and synthesis, particularly in the context of complex multi‑agent interactions, creates new informational interdependencies that necessitate a fundamental shift in evaluation and governance of AI. While a well‑calibrated ecosystem could augment human judgment and collective decision‑making, poorly aligned agents risk causing cognitive deskilling and epistemic drift, making the calibration of these models to human norms a high‑stakes necessity. To ensure a beneficial human‑AI knowledge ecosystem, we propose a framework centered on building and cultivating the trustworthiness of epistemic AI agents; aligning AI these agents with human epistemic goals; and reinforcing the surrounding socio‑epistemic infrastructure. In this context, trustworthy AI agents must demonstrate epistemic competence, robust falsifiability, and epistemically virtuous behaviors, supported by technical provenance systems and "knowledge sanctuaries" designed to protect human resilience. This normative roadmap provides a path toward ensuring that future AI systems act as reliable partners in a robust and inclusive knowledge ecosystem.

PaperID: 5218, https://arxiv.org/pdf/2603.02955.pdf

Authors: Maria Moskalenko, Alexander Trifanov, Roman Popkov, Arina Tabieva, Maria Smirnova, Konstantin Pravdin, Daniil Bakalin

Title: Changing Pedagogical Paradigms: Integrating Generative AI in Mathematics to Enhance Digital Literacy through 'Mathematical Battles with AI'

Abstract:
This paper introduces `Math Battles with AI', an innovative competitive format designed at ITMO University to redefine the role of generative AI in mathematics education. Moving away from a purely defensive stance, the authors propose an AI agent with intentionally increased hallucination likelihood in specific modes to train verification skills. We describe the three‑stage tournament structure and a specialized assessment system that rewards critical verification over blind reliance. Initial results indicate a significant shift in student mindsets, fostering essential skills in digital hygiene and prompt engineering. This work serves as a practical guide for academic institutions aiming to leverage AI for enhancing, rather than undermining, intellectual development.

PaperID: 5219, https://arxiv.org/pdf/2603.02909.pdf

Authors: Guangjun Zhang, Hu Zhang, Yazhou Han, Yue Fan, Yuhang Shao, Ru Li, Hongye Tan

Title: Learning to Generate and Extract: A Multi-Agent Collaboration Framework For Zero-shot Document-level Event Arguments Extraction

Abstract:
Document‑level event argument extraction (DEAE) is essential for knowledge acquisition, aiming to extract participants of events from documents . In the zero‑shot setting, existing methods employ LLMs to generate synthetic data to address the challenge posed by the scarcity of annotated data. However, relying solely on Event‑type‑only prompts makes it difficult for the generated content to accurately capture the contextual and structural relationships of unseen events. Moreover, ensuring the reliability and usability of synthetic data remains a significant challenge due to the absence of quality evaluation mechanisms. To this end, we introduce a multi‑agent collaboration framework for zero‑shot document‑level event argument extraction (ZS‑DEAE), which simulates the human collaborative cognitive process of "Propose‑Evaluate‑Revise." Specifically, the framework comprises a generation agent and an evaluation agent. The generation agent synthesizes data for unseen events by leveraging knowledge from seen events, while the evaluation agent extracts arguments from the synthetic data and assesses their semantic consistency with the context. The evaluation results are subsequently converted into reward signals, with event structure constraints incorporated into the reward design to enable iterative optimization of both agents via reinforcement learning.In three zero‑shot scenarios constructed from the RAMS and WikiEvents datasets, our method achieves improvements both in data generation quality and argument extraction performance, while the generated data also effectively enhances the zero‑shot performance of other DEAE models.

PaperID: 5220, https://arxiv.org/pdf/2603.02888.pdf

Authors: Minh-Chi Phung, Thien-Bao Le, Cam-Tu Tran-Thi, Thu-Dieu Nguyen-Thi, Vu-Hung Dao

Title: LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval

Abstract:
The increasing diversity and scale of video data demand retrieval systems capable of multimodal understanding, adaptive reasoning, and domain‑specific knowledge integration. This paper presents LLandMark, a modular multi‑agent framework for landmark‑aware multimodal video retrieval to handle real‑world complex queries. The framework features specialized agents that collaborate across four stages: query parsing and planning, landmark reasoning, multimodal retrieval, and reranked answer synthesis. A key component, the Landmark Knowledge Agent, detects cultural or spatial landmarks and reformulates them into descriptive visual prompts, enhancing CLIP‑based semantic matching for Vietnamese scenes. To expand capabilities, we introduce an LLM‑assisted image‑to‑image pipeline, where a large language model (Gemini 2.5 Flash) autonomously detects landmarks, generates image search queries, retrieves representative images, and performs CLIP‑based visual similarity matching, removing the need for manual image input. In addition, an OCR refinement module leveraging Gemini and LlamaIndex improves Vietnamese text recognition. Experimental results show that LLandMark achieves adaptive, culturally grounded, and explainable retrieval performance.

PaperID: 5221, https://arxiv.org/pdf/2603.02816.pdf

Authors: Zihao Zhu, Ruotong Wang, Siwei Lyu, Min Zhang, Baoyuan Wu

Title: BrandFusion: A Multi-Agent Framework for Seamless Brand Integration in Text-to-Video Generation

Abstract:
The rapid advancement of text‑to‑video (T2V) models has revolutionized content creation, yet their commercial potential remains largely untapped. We introduce, for the first time, the task of seamless brand integration in T2V: automatically embedding advertiser brands into prompt‑generated videos while preserving semantic fidelity to user intent. This task confronts three core challenges: maintaining prompt fidelity, ensuring brand recognizability, and achieving contextually natural integration. To address them, we propose BrandFusion, a novel multi‑agent framework comprising two synergistic phases. In the offline phase (advertiser‑facing), we construct a Brand Knowledge Base by probing model priors and adapting to novel brands via lightweight fine‑tuning. In the online phase (user‑facing), five agents jointly refine user prompts through iterative refinement, leveraging the shared knowledge base and real‑time contextual tracking to ensure brand visibility and semantic alignment. Experiments on 18 established and 2 custom brands across multiple state‑of‑the‑art T2V models demonstrate that BrandFusion significantly outperforms baselines in semantic preservation, brand recognizability, and integration naturalness. Human evaluations further confirm higher user satisfaction, establishing a practical pathway for sustainable T2V monetization.

PaperID: 5222, https://arxiv.org/pdf/2603.02766.pdf

Authors: Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, Tu Vu

Title: EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Abstract:
Coding agents are increasingly used as general‑purpose problem solvers, but their flexibility does not by itself confer the domain expertise needed for specialized tasks. Recent work addresses this through agent skills: reusable workflows, and code, that augment agents with domain‑specific capabilities. Most skills today are hand‑crafted, and existing evolutionary approaches optimize low‑level artifacts (e.g. prompts \& code) that are tightly coupled to specific models and tasks. We introduce EvoSkill, a self‑evolving framework that automatically discovers and refines agent skills through iterative failure analysis. EvoSkill analyzes execution failures, proposes new skills or edits to existing ones, and materializes them into structured, reusable skill folders. A Pareto frontier of agent programs governs selection, retaining only skills that improve held‑out validation performance while the underlying model remains frozen. We evaluate EvoSkill on two benchmarks: OfficeQA, a grounded reasoning benchmark over U.S.\ Treasury data, where it improves exact‑match accuracy by 7.3% (60.6% \to 67.9%); and SealQA, a search‑augmented QA benchmark with noisy retrieval, where it yields a 12.1% gain (26.6% \to 38.7%). We also investigate the zero‑shot transfer capabilties of skills evolved on one task to the other; in particular: skills evolved from SealQA transfers zero‑shot to BrowseComp, improving accuracy by 5.3% without modification demonstrating that skill‑level optimization produces transferable capabilities beyond the training task.

PaperID: 5223, https://arxiv.org/pdf/2603.02711.pdf

Authors: Stephanie Anneris Malvicini, Ewelina Gajewska, Arda Derbent, Katarzyna Budzynska, Jarosław A. Chudziak, Maria Vanina Martinez

Title: A Natural Language Agentic Approach to Study Affective Polarization

Abstract:
Affective polarization has been central to political and social studies, with growing focus on social media, where partisan divisions are often exacerbated. Real‑world studies tend to have limited scope, while simulated studies suffer from insufficient high‑quality training data, as manually labeling posts is labor‑intensive and prone to subjective biases. The lack of adequate tools to formalize different definitions of affective polarization across studies complicates result comparison and hinders interoperable frameworks. We present a multi‑agent model providing a comprehensive approach to studying affective polarization in social media. To operationalize our framework, we develop a platform leveraging large language models (LLMs) to construct virtual communities where agents engage in discussions. We showcase the potential of our platform by (1) analyzing questions related to affective polarization, as explored in social science literature, providing a fresh perspective on this phenomenon, and (2) introducing scenarios that allow observation and measurement of polarization at different levels of granularity and abstraction. Experiments show that our platform is a flexible tool for computational studies of complex social dynamics such as affective polarization. It leverages advanced agent models to simulate rich, context‑sensitive interactions and systematically explore research questions traditionally addressed through human‑subject studies.

PaperID: 5224, https://arxiv.org/pdf/2603.02701.pdf

Authors: Yueyang Cang, Xiaoteng Zhang, Erlu Zhao, Zehua Ji, Yuhang Liu, Yuchen He, Zhiyuan Ning, Chen Yijun, Wenge Que, Li Shi

Title: Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization

Abstract:
Optimizing communication topology is fundamental to the efficiency and effectiveness of Large Language Model (LLM)‑based Multi‑Agent Systems (MAS). While recent approaches utilize reinforcement learning to dynamically construct task‑specific graphs, they typically rely on single‑sample policy gradients with absolute rewards (e.g., binary correctness). This paradigm suffers from severe gradient variance and the credit assignment problem: simple queries yield non‑informative positive rewards for suboptimal structures, while difficult queries often result in failures that provide no learning signal. To address these challenges, we propose Graph‑GRPO, a novel topology optimization framework that integrates Group Relative Policy Optimization. Instead of evaluating a single topology in isolation, Graph‑GRPO samples a group of diverse communication graphs for each query and computes the advantage of specific edges based on their relative performance within the group. By normalizing rewards across the sampled group, our method effectively mitigates the noise derived from task difficulty variance and enables fine‑grained credit assignment. Extensive experiments on reasoning and code generation benchmarks demonstrate that Graph‑GRPO significantly outperforms state‑of‑the‑art baselines, achieving superior training stability and identifying critical communication pathways previously obscured by reward noise.

PaperID: 5225, https://arxiv.org/pdf/2603.02697.pdf

Authors: Jiayi Zhu, Jianing Zhang, Yiying Yang, Wei Cheng, Xiaoyun Yuan

Title: ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

Abstract:
This paper presents ShareVerse, a video generation framework enabling multi‑agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi‑agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large‑scale multi‑agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi‑view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four‑view videos of independent agents to model a broader environment and to ensure internal multi‑view geometric consistency. 3) We integrate cross‑agent attention blocks into the pretrained video model, which enable interactive transmission of spatial‑temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non‑overlapping regions. ShareVerse, which supports 49‑frame large‑scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.

PaperID: 5226, https://arxiv.org/pdf/2603.02688.pdf

Authors: Izat Temiraliev, Diji Yang, Yi Zhang

Title: Retrieval-Augmented Robots via Retrieve-Reason-Act

Abstract:
To achieve general‑purpose utility, we argue that robots must evolve from passive executors into active Information Retrieval users. In strictly zero‑shot settings where no prior demonstrations exist, robots face a critical information gap, such as the exact sequence required to assemble a complex furniture kit, that cannot be satisfied by internal parametric knowledge (common sense) or past internal memory. While recent robotic works attempt to use search before action, they primarily focus on retrieving past kinematic trajectories (analogous to searching internal memory) or text‑based safety rules (searching for constraints). These approaches fail to address the core information need of active task construction: acquiring unseen procedural knowledge from external, unstructured documentation. In this paper, we define the paradigm as Retrieval‑Augmented Robotics (RAR), empowering the robot with the information‑seeking capability that bridges the gap between visual documentation and physical actuation. We formulate the task execution as an iterative Retrieve‑Reason‑Act loop: the robot or embodied agent actively retrieves relevant visual procedural manuals from an unstructured corpus, grounds the abstract 2D diagrams to 3D physical parts via cross‑modal alignment, and synthesizes executable plans. We validate this paradigm on a challenging long‑horizon assembly benchmark. Our experiments demonstrate that grounding robotic planning in retrieved visual documents significantly outperforms baselines relying on zero‑shot reasoning or few‑shot example retrieval. This work establishes the basis of RAR, extending the scope of Information Retrieval from answering user queries to driving embodied physical actions.

PaperID: 5227, https://arxiv.org/pdf/2603.02654.pdf

Authors: Seongmin Kim, Giseung Park, Woojun Kim, Jiwon Jeon, Seungyul Han, Youngchul Sung

Title: Generalized Per-Agent Advantage Estimation for Multi-Agent Policy Optimization

Abstract:
In this paper, we propose a novel framework for multi‑agent reinforcement learning that enhances sample efficiency and coordination through accurate per‑agent advantage estimation. The core of our approach is Generalized Per‑Agent Advantage Estimator (GPAE), which employs a per‑agent value iteration operator to compute precise per‑agent advantages. This operator enables stable off‑policy learning by indirectly estimating values via action probabilities, eliminating the need for direct Q‑function estimation. To further refine estimation, we introduce a double‑truncated importance sampling ratio scheme. This scheme improves credit assignment for off‑policy trajectories by balancing sensitivity to the agent's own policy changes with robustness to non‑stationarity from other agents. Experiments on benchmarks demonstrate that our approach outperforms existing approaches, excelling in coordination and sample efficiency for complex scenarios.

PaperID: 5228, https://arxiv.org/pdf/2603.02637.pdf

Authors: Shiyang Li, Zijian Zhang, Winson Chen, Yuebo Luo, Mingyi Hong, Caiwen Ding

Title: StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning

Abstract:
Modern machine learning (ML) workloads increasingly rely on GPUs, yet achieving high end‑to‑end performance remains challenging due to dependencies on both GPU kernel efficiency and host‑side settings. Although LLM‑based methods show promise on automated GPU kernel generation, prior works mainly focus on single‑kernel optimization and do not extend to end‑to‑end programs, hindering practical deployment. To address the challenge, in this work, we propose StitchCUDA, a multi‑agent framework for end‑to‑end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it step‑by‑step, and a Verifier for correctness check and performance profiling using Nsys/NCU. To fundamentally improve the Coder's ability in end‑to‑end GPU programming, StitchCUDA integrates rubric‑based agentic reinforcement learning over two atomic skills, task‑to‑code generation and feedback‑driven code optimization, with combined rubric reward and rule‑based reward from real executions. Therefore, the Coder learns how to implement advanced CUDA programming techniques (e.g., custom kernel fusion, cublas epilogue), and we also effectively prevent Coder's reward hacking (e.g., just copy PyTorch code or hardcoding output) during benchmarking. Experiments on KernelBench show that StitchCUDA achieves nearly 100% success rate on end‑to‑end GPU programming tasks, with 1.72x better speedup over the multi‑agent baseline and 2.73x than the RL model baselines.

PaperID: 5229, https://arxiv.org/pdf/2603.02630.pdf

Authors: Zhi Hong, Qian Zhang, Jiahang Sun, Zhiwei Shang, Mingze Kong, Xiangyi Wang, Yao Shu, Zhongxiang Dai

Title: MASPOB: Bandit-Based Prompt Optimization for Multi-Agent Systems with Graph Neural Networks

Abstract:
Large Language Models (LLMs) have achieved great success in many real‑world applications, especially the one serving as the cognitive backbone of Multi‑Agent Systems (MAS) to orchestrate complex workflows in practice. Since many deployment scenarios preclude MAS workflow modifications and its performance is highly sensitive to the input prompts, prompt optimization emerges as a more natural approach to improve its performance. However, real‑world prompt optimization for MAS is impeded by three key challenges: (1) the need of sample efficiency due to prohibitive evaluation costs, (2) topology‑induced coupling among prompts, and (3) the combinatorial explosion of the search space. To address these challenges, we introduce MASPOB (Multi‑Agent System Prompt Optimization via Bandits), a novel sample‑efficient framework based on bandits. By leveraging Upper Confidence Bound (UCB) to quantify uncertainty, the bandit framework balances exploration and exploitation, maximizing gains within a strictly limited budget. To handle topology‑induced coupling, MASPOB integrates Graph Neural Networks (GNNs) to capture structural priors, learning topology‑aware representations of prompt semantics. Furthermore, it employs coordinate ascent to decompose the optimization into univariate sub‑problems, reducing search complexity from exponential to linear. Extensive experiments across diverse benchmarks demonstrate that MASPOB achieves state‑of‑the‑art performance, consistently outperforming existing baselines.

PaperID: 5230, https://arxiv.org/pdf/2603.02604.pdf

Authors: Zhixia Zhang, Zixuan Huang, Gongxun Li, Huaiyang Wang, Chengyi Yuan, Xin Xia, Deqing Wang, Fuzhen Zhuang, Shuai Ma, Ning Ding, Yaodong Yang, Jianxin Li, Yikun Ban

Title: Heterogeneous Agent Collaborative Reinforcement Learning

Abstract:
We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new Reinforcement Learning from Verifiable Reward (RLVR) problem that addresses the inefficiencies of isolated multi‑agent on‑policy optimization. HACRL enables collaborative optimization with independent execution: heterogeneous agents share verified rollouts during training to mutually improve, while operating independently at inference time. Unlike LLM‑based multi‑agent reinforcement learning (MARL), HACRL does not require coordinated deployment, and unlike on‑/off‑policy distillation, it enables bidirectional mutual learning among heterogeneous agents rather than one‑directional homogeneous teacher‑to‑student transfer. Building on this problem, we propose HACPO, a collaborative RL algorithm that enables principled rollout sharing to maximize sample utilization and cross‑agent knowledge transfer. To mitigate capability discrepancies and policy distribution shifts, HACPO introduces four tailored mechanisms with theoretical guarantees on unbiased advantage estimation. Extensive experiments across diverse heterogeneous model combinations and reasoning benchmarks show that HACPO consistently improves all participating agents, outperforming GSPO with double rollouts by an average of 3.6% while using only half the rollout cost.

PaperID: 5231, https://arxiv.org/pdf/2603.02601.pdf

Authors: Varun Pratap Bhardwaj

Title: AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows

Abstract:
Autonomous AI agents are deployed at unprecedented scale, yet no principled methodology exists for verifying that an agent has not regressed after changes to its prompts, tools, models, or orchestration logic. We present AgentAssay, the first token‑efficient framework for regression testing non‑deterministic AI agent workflows, achieving 78‑100% cost reduction while maintaining rigorous statistical guarantees. Our contributions include: (1) stochastic three‑valued verdicts (PASS/FAIL/INCONCLUSIVE) grounded in hypothesis testing; (2) five‑dimensional agent coverage metrics; (3) agent‑specific mutation testing operators; (4) metamorphic relations for agent workflows; (5) CI/CD deployment gates as statistical decision procedures; (6) behavioral fingerprinting that maps execution traces to compact vectors, enabling multivariate regression detection; (7) adaptive budget optimization calibrating trial counts to behavioral variance; and (8) trace‑first offline analysis enabling zero‑cost testing on production traces. Experiments across 5 models (GPT‑5.2, Claude Sonnet 4.6, Mistral‑Large‑3, Llama‑4‑Maverick, Phi‑4), 3 scenarios, and 7,605 trials demonstrate that behavioral fingerprinting achieves 86% detection power where binary testing has 0%, SPRT reduces trials by 78%, and the full pipeline achieves 100% cost savings through trace‑first analysis. Implementation: 20,000+ lines of Python, 751 tests, 10 framework adapters.

PaperID: 5232, https://arxiv.org/pdf/2603.02519.pdf

Authors: Wei Jiang, Tong Chen, Wei Yuan, Quoc Viet Hung Nguyen, Hongzhi Yin

Title: Agentic Mixed-Source Multi-Modal Misinformation Detection with Adaptive Test-Time Scaling

Abstract:
Vision‑language models (VLMs) have been proven effective for detecting multi‑modal misinformation on social platforms, especially in zero‑shot settings with unavailable or delayed annotations. However, a single VLM's capacity falls short in the more complex mixed‑source multi‑modal misinformation detection (M3D) task. Taking captioned images as an example, in M3D, false information can originate from untruthful texts, forged images, or mismatches between the two modalities. Although recent agentic systems can handle zero‑shot M3D by connecting modality‑specific VLM agents, their effectiveness is still bottlenecked by their architecture. In existing agentic M3D solutions, for any input sample, each agent performs only one forward reasoning pass, making decisions prone to model randomness and reasoning errors in challenging cases. Moreover, the lack of exploration over alternative reasoning paths prevents modern VLMs from fully utilizing their reasoning capacity. In this work, we present AgentM3D, a multi‑agent framework for zero‑shot M3D. To amplify the reasoning capability of VLMs, we introduce an adaptive test‑time scaling paradigm in which each modality‑specific VLM agent applies a Best‑of‑N mechanism, coupled with a critic agent for task‑aligned scoring. The agents are organized in a cascading, modality‑specific decision chain to reduce unnecessary computation and limit error propagation. To ensure scalability, a planning agent dynamically determines the maximum number of reasoning paths based on sample difficulty, and an adaptive stopping mechanism prevents excessive reasoning within each agent. Extensive experiments on two M3D benchmarks demonstrate that AgentM3D achieves state‑of‑the‑art zero‑shot detection performance compared with various VLM‑based and agentic baselines.

PaperID: 5233, https://arxiv.org/pdf/2603.02438.pdf

Authors: Aymen Lassoued, Mohamed Ali Souibgui, Yousri Kessentini

Title: ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering

Abstract:
Document Visual Question Answering (DocVQA) remains challenging for existing Vision‑Language Models (VLMs), especially under complex reasoning and multi‑step workflows. Current approaches struggle to decompose intricate questions into manageable sub‑tasks and often fail to leverage specialized processing paths for different document elements. We present ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering, a novel multi‑agent framework that addresses these limitations through strategic agent coordination and iterative refinement. ORCA begins with a reasoning agent that decomposes queries into logical steps, followed by a routing mechanism that activates task‑specific agents from a specialized agent dock. Our framework leverages a set of specialized AI agents, each dedicated to a distinct modality, enabling fine‑grained understanding and collaborative reasoning across diverse document components. To ensure answer reliability, ORCA employs a debate mechanism with stress‑testing, and when necessary, a thesis‑antithesis adjudication process. This is followed by a sanity checker to ensure format consistency. Extensive experiments on three benchmarks demonstrate that our approach achieves significant improvements over state‑of‑the‑art methods, establishing a new paradigm for collaborative agent systems in vision‑language reasoning.

PaperID: 5234, https://arxiv.org/pdf/2603.02426.pdf

Authors: Leo Muxing Wang, Pengkun Yang, Lili Su

Title: Personalized Multi-Agent Average Reward TD-Learning via Joint Linear Approximation

Abstract:
We study personalized multi‑agent average reward TD learning, in which a collection of agents interacts with different environments and jointly learns their respective value functions. We focus on the setting where there exists a shared linear representation, and the agents' optimal weights collectively lie in an unknown linear subspace. Inspired by the recent success of personalized federated learning (PFL), we study the convergence of cooperative single‑timescale TD learning in which agents iteratively estimate the common subspace and local heads. We showed that this decomposition can filter out conflicting signals, effectively mitigating the negative impacts of ``misaligned'' signals, and achieving linear speedup. The main technical challenges lie in the heterogeneity, the Markovian sampling, and their intricate interplay in shaping error evolutions. Specifically, not only are the error dynamics of multiple variables closely interconnected, but there is also no direct contraction for the principal angle distance between the optimal subspace and the estimated subspace. We hope our analytical techniques can be useful to inspire research on deeper exploration into leveraging common structures. Experiments are provided to show the benefits of learning via a shared structure to the more general control problem.

PaperID: 5235, https://arxiv.org/pdf/2603.02366.pdf

Authors: Esen K. Tütüncü, Qian Zhou, Frederik Brudy, George Fitzmaurice, Fraser Anderson

Title: PlayWrite: A Multimodal System for AI Supported Narrative Co-Authoring Through Play in XR

Abstract:
Current AI writing tools, which rely on text prompts, poorly support the spatial and interactive nature of storytelling where ideas emerge from direct manipulation and play. We present PlayWrite, a mixed‑reality system where users author stories by directly manipulating virtual characters and props. A multi‑agent AI pipeline interprets these actions into Intent Frames structured narrative beats visualized as rearrangeable story marbles on a timeline. A large language model then transforms the user's assembled sequence into a final narrative. A user study (N=13) with writers from varying domains found that PlayWrite fosters a highly improvisational and playful process. Users treated the AI as a collaborative partner, using its unexpected responses to spark new ideas and overcome creative blocks. PlayWrite demonstrates an approach for co‑creative systems that move beyond text to embrace direct manipulation and play as core interaction modalities.

PaperID: 5236, https://arxiv.org/pdf/2603.02345.pdf

Authors: Sami Abuzakuk, Lucas Crijns, Anne-Marie Kermarrec, Rafael Pires, Martijn de Vos

Title: RIVA: Leveraging LLM Agents for Reliable Configuration Drift Detection

Abstract:
Infrastructure as code (IaC) tools automate cloud provisioning but verifying that deployed systems remain consistent with the IaC specifications remains challenging. Such configuration drift occurs because of bugs in the IaC specification, manual changes, or system updates. Large language model (LLM)‑based agentic AI systems can automate the analysis of large volumes of telemetry data, making them suitable for the detection of configuration drift. However, existing agentic systems implicitly assume that the tools they invoke always return correct outputs, making them vulnerable to erroneous tool responses. Since agents cannot distinguish whether an anomalous tool output reflects a real infrastructure problem or a broken tool, such errors may cause missed drift or false alarms, reducing reliability precisely when it is most needed. We introduce RIVA (Robust Infrastructure by Verification Agents), a novel multi‑agent system that performs robust IaC verification even when tools produce incorrect or misleading outputs. RIVA employs two specialized agents, a verifier agent and a tool generation agent, that collaborate through iterative cross‑validation, multi‑perspective verification, and tool call history tracking. Evaluation on the AIOpsLab benchmark demonstrates that RIVA, in the presence of erroneous tool responses, recovers task accuracy from 27.3% when using a baseline ReAct agent to 50.0% on average. RIVA also improves task accuracy 28% to 43.8% without erroneous tool responses. Our results show that cross‑validation of diverse tool calls enables more reliable autonomous infrastructure verification in production cloud environments.

PaperID: 5237, https://arxiv.org/pdf/2603.02297.pdf

Authors: Nancy Lau, Louis Sloot, Jyoutir Raj, Giuseppe Marco Boscardin, Evan Harris, Dylan Bowman, Mario Brajkovski, Jaideep Chawla, Dan Zhao

Title: ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day Vulnerabilities for Cyberdefense

Abstract:
Large language models (LLMs) are increasingly being deployed as software engineering agents that autonomously contribute to repositories. A major benefit these agents present is their ability to find and patch security vulnerabilities in the codebases they oversee. To estimate the capability of agents in this domain, we introduce ZeroDayBench, a benchmark where LLM agents find and patch 22 novel critical vulnerabilities in open‑source codebases. We focus our efforts on three popular frontier agentic LLMs: GPT‑5.2, Claude Sonnet 4.5, and Grok 4.1. We find that frontier LLMs are not yet capable of autonomously solving our tasks and observe some behavioral patterns that suggest how these models can be improved in the domain of proactive cyberdefense.

PaperID: 5238, https://arxiv.org/pdf/2603.02259.pdf

Authors: Elias Malomgré, Pieter Simoens

Title: The Alignment Flywheel: A Governance-Centric Hybrid MAS for Architecture-Agnostic Safety

Abstract:
Multi‑agent systems provide mature methodologies for role decomposition, coordination, and normative governance, capabilities that remain essential as increasingly powerful autonomous decision components are embedded within agent‑based systems. While learned and generative models substantially expand system capability, their safety behavior is often entangled with training, making it opaque, difficult to audit, and costly to update after deployment. This paper formalizes the Alignment Flywheel as a governance‑centric hybrid MAS architecture that decouples decision generation from safety governance. A Proposer, representing any autonomous decision component, generates candidate trajectories, while a Safety Oracle returns raw safety signals through a stable interface. An enforcement layer applies explicit risk policy at runtime, and a governance MAS supervises the Oracle through auditing, uncertainty‑driven verification, and versioned refinement. The central engineering principle is patch locality: many newly observed safety failures can be mitigated by updating the governed oracle artifact and its release pipeline rather than retracting or retraining the underlying decision component. The architecture is implementation‑agnostic with respect to both the Proposer and the Safety Oracle, and specifies the roles, artifacts, protocols, and release semantics needed for runtime gating, audit intake, signed patching, and staged rollout across distributed deployments. The result is a hybrid MAS engineering framework for integrating highly capable but fallible autonomous systems under explicit, version‑controlled, and auditable oversight.

PaperID: 5239, https://arxiv.org/pdf/2603.02229.pdf

Authors: Benjamin Plaut

Title: Safety Training Persists Through Helpfulness Optimization in LLM Agents

Abstract:
Safety post‑training has been studied extensively in single‑step "chat" settings where safety typically refers to refusing harmful requests. We study an "agentic" (i.e., multi‑step, tool‑use) setting where safety refers to harmful actions directly taken by the LLM. We compare the effects of running direct preference optimization (DPO) on safety or helpfulness alone vs both metrics sequentially. As expected, training on one metric alone results in an extreme point along this frontier. However, unlike prior work, we find that safety training persists through subsequent helpfulness training. We also find that all training configurations end up near a linear Pareto frontier with R^2 = 0.77. Even post‑training on both metrics simultaneously simply results in another point on the frontier rather than finding a "best of both worlds" strategy, despite the presence of such strategies in our DPO dataset. Overall, our findings underscore the need for better understanding of post‑training dynamics.

PaperID: 5240, https://arxiv.org/pdf/2603.02154.pdf

Authors: Nhat D. A. Nguyen, Duong D. Nguyen, Gianluca Rizzo, Hung X. Nguyen

Title: Boltzmann-based Exploration for Robust Decentralized Multi-Agent Planning (Extended Version)

Abstract:
Decentralized Monte Carlo Tree Search (Dec‑MCTS) is widely used for cooperative multi‑agent planning but struggles in sparse or skewed reward environments. We introduce Coordinated Boltzmann MCTS (CB‑MCTS), which replaces deterministic UCT with a stochastic Boltzmann policy and a decaying entropy bonus for sustained yet focused exploration. While Boltzmann exploration has been studied in single‑agent MCTS, applying it in multi‑agent systems poses unique challenges. CB‑MCTS is the first to address this. We analyze CB‑MCTS in the simple‑regret setting and show in simulations that it outperforms Dec‑MCTS in deceptive scenarios and remains competitive on standard benchmarks, providing a robust solution for multi‑agent planning.

PaperID: 5241, https://arxiv.org/pdf/2603.02081.pdf

Authors: Jiale Lao, Immanuel Trummer

Title: GenDB: The Next Generation of Query Processing -- Synthesized, Not Engineered

Abstract:
Traditional query processing relies on engines that are carefully optimized and engineered by many experts. However, new techniques and user requirements evolve rapidly, and existing systems often cannot keep pace. At the same time, these systems are difficult to extend due to their internal complexity, and developing new systems requires substantial engineering effort and cost. In this paper, we argue that recent advances in Large Language Models (LLMs) are starting to shape the next generation of query processing systems. We propose using LLMs to synthesize execution code for each incoming query, instead of continuously building, extending, and maintaining complex query processing engines. As a proof of concept, we present GenDB, an LLM‑powered agentic system that generates instance‑optimized and customized query execution code tailored to specific data, workloads, and hardware resources. We implemented an early prototype of GenDB that uses Claude Code Agent as the underlying component in the multi‑agent system, and we evaluate it on OLAP workloads. We use queries from the well‑known TPC‑H benchmark and also construct a new benchmark designed to reduce potential data leakage from LLM training data. We compare GenDB with state‑of‑the‑art query engines, including DuckDB, Umbra, MonetDB, ClickHouse, and PostgreSQL. GenDB achieves significantly better performance than these systems. Finally, we discuss the current limitations of GenDB and outline future extensions and related research challenges.

PaperID: 5242, https://arxiv.org/pdf/2603.02070.pdf

Authors: Guilhem Fouilhé, Rebecca Eifler, Antonin Poché, Sylvie Thiébaux, Nicholas Asher

Title: Exploring Plan Space through Conversation: An Agentic Framework for LLM-Mediated Explanations in Planning

Abstract:
When automating plan generation for a real‑world sequential decision problem, the goal is often not to replace the human planner, but to facilitate an iterative reasoning and elicitation process, where the human's role is to guide the AI planner according to their preferences and expertise. In this context, explanations that respond to users' questions are crucial to improve their understanding of potential solutions and increase their trust in the system. To enable natural interaction with such a system, we present a multi‑agent Large Language Model (LLM) architecture that is agnostic to the explanation framework and enables user‑ and context‑dependent interactive explanations. We also describe an instantiation of this framework for goal‑conflict explanations, which we use to conduct a user study comparing the LLM‑powered interaction with a baseline template‑based explanation interface.

PaperID: 5243, https://arxiv.org/pdf/2603.02045.pdf

Authors: Andrew Szot, Michael Kirchhof, Omar Attia, Alexander Toshev

Title: Expanding LLM Agent Boundaries with Strategy-Guided Exploration

Abstract:
Reinforcement learning (RL) has demonstrated notable success in post‑training large language models (LLMs) as agents for tasks such as computer use, tool calling, and coding. However, exploration remains a central challenge in RL for LLM agents, especially as they operate in language‑action spaces with complex observations and sparse outcome rewards. In this work, we address exploration for LLM agents by leveraging the ability of LLMs to plan and reason in language about the environment to shift exploration from low‑level actions to higher‑level language strategies. We thus propose Strategy‑Guided Exploration (SGE), which first generates a concise natural‑language strategy that describes what to do to make progress toward the goal, and then generates environment actions conditioned on that strategy. By exploring in the space of strategies rather than the space of actions, SGE induces structured and diverse exploration that targets different environment outcomes. To increase strategy diversity during RL, SGE introduces mixed‑temperature sampling, which explores diverse strategies in parallel, along with a strategy reflection process that grounds strategy generation on the outcomes of previous strategies in the environment. Across UI interaction, tool‑calling, coding, and embodied agent environments, SGE consistently outperforms exploration‑focused RL baselines, improving both learning efficiency and final performance. We show that SGE enables the agent to learn to solve tasks too difficult for the base model.

PaperID: 5244, https://arxiv.org/pdf/2603.01952.pdf

Authors: Viet-Thanh Pham, Lizhen Qu, Thuy-Trang Vu, Gholamreza Haffari, Dinh Phung

Title: LiveCultureBench: a Multi-Agent, Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations

Abstract:
Large language models (LLMs) are increasingly deployed as autonomous agents, yet evaluations focus primarily on task success rather than cultural appropriateness or evaluator reliability. We introduce LiveCultureBench, a multi‑cultural, dynamic benchmark that embeds LLMs as agents in a simulated town and evaluates them on both task completion and adherence to socio‑cultural norms. The simulation models a small city as a location graph with synthetic residents having diverse demographic and cultural profiles. Each episode assigns one resident a daily goal while others provide social context. An LLM‑based verifier generates structured judgments on norm violations and task progress, which we aggregate into metrics capturing task‑norm trade‑offs and verifier uncertainty. Using LiveCultureBench across models and cultural profiles, we study (i) cross‑cultural robustness of LLM agents, (ii) how they balance effectiveness against norm sensitivity, and (iii) when LLM‑as‑a‑judge evaluation is reliable for automated benchmarking versus when human oversight is needed.

PaperID: 5245, https://arxiv.org/pdf/2603.01912.pdf

Authors: Yinghao Tang, Yupeng Xie, Yingchaojie Feng, Tingfeng Lan, Wei Chen

Title: Demonstrating ViviDoc: Generating Interactive Documents through Human-Agent Collaboration

Abstract:
Interactive articles help readers engage with complex ideas through exploration, yet creating them remains costly, requiring both domain expertise and web development skills. Recent LLM‑based agents can automate content creation, but naively applying them yields uncontrollable and unverifiable outputs. We present ViviDoc, a human‑agent collaborative system that generates interactive educational documents from a single topic input. ViviDoc introduces a multi‑agent pipeline (Planner, Executor, Evaluator) and the Document Specification (DocSpec), a human‑readable intermediate representation that decomposes each interactive visualization into State, Render, Transition, and Constraint components. The DocSpec enables educators to review and refine generation plans before code is produced, bridging the gap between pedagogical intent and executable output. Expert evaluation and a user study show that ViviDoc substantially outperforms naive agentic generation and provides an intuitive editing experience. Our project homepage is available at https://vividoc‑homepage.vercel.app/.

PaperID: 5246, https://arxiv.org/pdf/2603.01896.pdf

Authors: Shubham Ugare, Satish Chandra

Title: Agentic Code Reasoning

Abstract:
Can LLM agents explore codebases and reason about code semantics without executing the code? We study this capability, which we call agentic code reasoning, and introduce semi‑formal reasoning: a structured prompting methodology that requires agents to construct explicit premises, trace execution paths, and derive formal conclusions. Unlike unstructured chain‑of‑thought, semi‑formal reasoning acts as a certificate: the agent cannot skip cases or make unsupported claims. We evaluate across three tasks (patch equivalence verification, fault localization, and code question answering) and show that semi‑formal reasoning consistently improves accuracy on all of them. For patch equivalence, accuracy improves from 78% to 88% on curated examples and reaches 93% on real‑world agent‑generated patches, approaching the reliability needed for execution‑free RL reward signals. For code question answering on RubberDuckBench Mohammad et al. (2026), semi‑formal reasoning achieves 87% accuracy. For fault localization on Defects4J Just et al. (2014), semi‑formal reasoning improves Top‑5 accuracy by 5 percentage points over standard reasoning. These results demonstrate that structured agentic reasoning enables meaningful semantic code analysis without execution, opening practical applications in RL training pipelines, code review, and static program analysis.

PaperID: 5247, https://arxiv.org/pdf/2603.01687.pdf

Authors: Snehashish Ghosh, Sasthi C. Ghosh

Title: Predictive Importance Sampling Based Coverage Verification for Multi-UAV Trajectory Planning

Abstract:
Unmanned aerial vehicle (UAV) networks are emerging as a promising solution for ultra‑reliable low‑latency communication (URLLC) in next‑generation wireless systems. A key challenge in millimeter wave UAV networks is maintaining continuous line of sight (LoS) coverage for mobile users, as existing snapshot‑based trajectory planning methods fail to account for user mobility within decision intervals, leading to catastrophic coverage gaps. Standard uniform sampling for continuous coverage verification is computationally prohibitive, requiring huge number of samples to estimate rare failure events with latencies incompatible with real‑time requirements. In this work, we propose a predictive importance sampling (PIS) framework that drastically reduces sample complexity by concentrating verification efforts on predicted failure regions. Specifically, we develop a long short‑term memory mixture density network (LSTM‑MDN) architecture to capture multimodal user trajectory distributions and combine it with defensive mixture sampling for robustness against prediction errors. We prove that PIS provides unbiased failure probability estimates with lower variance than uniform sampling. We then integrate PIS with multi‑agent deep deterministic policy gradient (MADDPG) for coordinated multi‑UAV trajectory planning using an adaptive multi‑objective reward function balancing throughput, coverage, fairness, and energy consumption. Lastly, the simulation results show how our suggested method outperforms three other state‑of‑the‑art methods in terms of coverage rate, throughput, and verification latency, making proactive coverage management for URLLC‑aware UAV networks feasible.

PaperID: 5248, https://arxiv.org/pdf/2603.01654.pdf

Authors: Yuhang Yang, Ruikang Li, Jifei Ma, Kai Zhang, Qi Liu, Jianyu Han, Yonggan Bu, Jibin Zhou, Defu Lian, Xin Li, Enhong Chen

Title: CeProAgents: A Hierarchical Agents System for Automated Chemical Process Development

Abstract:
The development of chemical processes, a cornerstone of chemical engineering, presents formidable challenges due to its multi‑faceted nature, integrating specialized knowledge, conceptual design, and parametric simulation. Capitalizing on this, we propose CeProAgents, a hierarchical multi‑agent system designed to automate the development of chemical process through collaborative division of labor. Our architecture comprises three specialized agent cohorts focused on knowledge, concept, and parameter respectively. To effectively adapt to the inherent complexity of chemical tasks, each cohort employs a novel hybrid architecture that integrates dynamic agent chatgroups with structured agentic workflows. To rigorously evaluate the system, we establish CeProBench, a multi‑dimensional benchmark structured around three core pillars of chemical engineering. We design six distinct types of tasks across these dimensions to holistically assess the comprehensive capabilities of the system in chemical process development. The results not only confirm the effectiveness and superiority of our proposed approach but also reveal the transformative potential as well as the current boundaries of Large Language Models (LLMs) for industrial chemical engineering.

PaperID: 5249, https://arxiv.org/pdf/2603.01608.pdf

Authors: Mia Hopman, Jannes Elstner, Maria Avramidou, Amritanshu Prasad, David Lindner

Title: Evaluating and Understanding Scheming Propensity in LLM Agents

Abstract:
As frontier language models are increasingly deployed as autonomous agents pursuing complex, long‑term objectives, there is increased risk of scheming: agents covertly pursuing misaligned goals. Prior work has focused on showing agents are capable of scheming, but their propensity to scheme in realistic scenarios remains underexplored. To understand when agents scheme, we decompose scheming incentives into agent factors and environmental factors. We develop realistic settings allowing us to systematically vary these factors, each with scheming opportunities for agents that pursue instrumentally convergent goals such as self‑preservation, resource acquisition, and goal‑guarding. We find only minimal instances of scheming despite high environmental incentives, and show this is unlikely due to evaluation awareness. While inserting adversarially‑designed prompt snippets that encourage agency and goal‑directedness into an agent's system prompt can induce high scheming rates, snippets used in real agent scaffolds rarely do. Surprisingly, in model organisms (Hubinger et al., 2023) built with these snippets, scheming behavior is remarkably brittle: removing a single tool can drop the scheming rate from 59% to 3%, and increasing oversight can raise rather than deter scheming by up to 25%. Our incentive decomposition enables systematic measurement of scheming propensity in settings relevant for deployment, which is necessary as agents are entrusted with increasingly consequential tasks.

PaperID: 5250, https://arxiv.org/pdf/2603.01548.pdf

Authors: Neeraj Bholani

Title: Graph-Based Self-Healing Tool Routing for Cost-Efficient LLM Agents

Abstract:
Tool‑using LLM agents face a reliability‑cost tradeoff: routing every decision through the LLM improves correctness but incurs high latency and inference cost, while pre‑coded workflow graphs reduce cost but become brittle under unanticipated compound tool failures. We present Self‑Healing Router, a fault‑tolerant orchestration architecture that treats most agent control‑flow decisions as routing rather than reasoning. The system combines (i) parallel health monitors that assign priority scores to runtime conditions such as tool outages and risk signals, and (ii) a cost‑weighted tool graph where Dijkstra's algorithm performs deterministic shortest‑path routing. When a tool fails mid‑execution, its edges are reweighted to infinity and the path is recomputed ‑‑ yielding automatic recovery without invoking the LLM. The LLM is reserved exclusively for cases where no feasible path exists, enabling goal demotion or escalation. Prior graph‑based tool‑use systems (ControlLLM, ToolNet, NaviAgent) focus on tool selection and planning; our contribution is runtime fault tolerance with deterministic recovery and binary observability ‑‑ every failure is either a logged reroute or an explicit escalation, never a silent skip. Across 19 scenarios spanning three graph topologies (linear pipeline, dependency DAG, parallel fan‑out), Self‑Healing Router matches ReAct's correctness while reducing control‑plane LLM calls by 93% (9 vs 123 aggregate) and eliminating the silent‑failure cases observed in a well‑engineered static workflow baseline under compound failures.

PaperID: 5251, https://arxiv.org/pdf/2603.01505.pdf

Authors: Bingchuan Wei, Bingqi Huang, Jingheng Ma, Zeyu zhang, Sen Cui

Title: FATE: Closed-Loop Feasibility-Aware Task Generation with Active Repair for Physically Grounded Robotic Curricula

Abstract:
Recent breakthroughs in generative simulation have harnessed Large Language Models (LLMs) to generate diverse robotic task curricula, yet these open‑loop paradigms frequently produce linguistically coherent but physically infeasible goals, stemming from ungrounded task specifications or misaligned objective formulations. To address this critical limitation, we propose FATE (Feasibility‑Aware Task gEneration), a closed‑loop, self‑correcting framework that reimagines task generation as an iterative validation‑and‑refinement process. Unlike conventional methods that decouple generation and verification into discrete stages, FATE embeds a generalist embodied agent directly into the generation loop to proactively guarantee the physical groundedness of the resulting curriculum. FATE instantiates a sequential auditing pipeline: it first validates static scene attributes (e.g., object affordances, layout compatibility) and subsequently verifies execution feasibility via simulated embodied interaction. Critical to its performance, upon detecting an infeasible task, FATE deploys an active repair module that autonomously adapts scene configurations or policy specifications, converting unworkable proposals into physically valid task instances. Extensive experiments validate that FATE generates semantically diverse, physically grounded task curricula while achieving a substantial reduction in execution failure rates relative to state‑of‑the‑art generative baselines.

PaperID: 5252, https://arxiv.org/pdf/2603.01416.pdf

Authors: Zhixiang Wang, Jingxuan Xu, Dajun Chen, Yunfang Wu, Wei Jiang, Yong Li

Title: Securing the Floor and Raising the Ceiling: A Merging-based Paradigm for Multi-modal Search Agents

Abstract:
Recent advances in Vision‑Language Models (VLMs) have motivated the development of multi‑modal search agents that can actively invoke external search tools and integrate retrieved evidence through multi‑step reasoning. While promising, existing approaches typically rely on large‑scale supervised trajectories or expensive reinforcement learning (RL), leading to high training cost, instability, and a severe cold‑start problem for standard VLMs. We propose a training‑free paradigm to empower VLMs with autonomous search capabilities via cross‑modal model merging. By fusing a text‑based search agent with a base VLM, we show that multi‑modal search capabilities can be effectively composed without any additional multi‑modal training data. To mitigate parameter interference during cross‑modal integration, we introduce Optimal Brain Merging (OBM), a saliency‑aware merging algorithm that identifies task‑critical parameters based on their impact on model loss using only a small set of calibration samples. Extensive experiments on search‑intensive benchmarks (e.g., InfoSeek, MMSearch) reveal that: (1) Model merging secures a reasonable performance floor as a zero‑shot agent, with OBM achieving superior search rates; (2) OBM significantly raises the performance ceiling as a warm‑start strategy, achieving faster convergence and higher peak accuracy than standard VLM initialization.

PaperID: 5253, https://arxiv.org/pdf/2603.01407.pdf

Authors: Saad Alqithami

Title: The Observer-Situation Lattice: A Unified Formal Basis for Perspective-Aware Cognition

Abstract:
Autonomous agents operating in complex, multi‑agent environments must reason about what is true from multiple perspectives. Existing approaches often struggle to integrate the reasoning of different agents, at different times, and in different contexts, typically handling these dimensions in separate, specialized modules. This fragmentation leads to a brittle and incomplete reasoning process, particularly when agents must understand the beliefs of others (Theory of Mind). We introduce the Observer‑Situation Lattice (OSL), a unified mathematical structure that provides a single, coherent semantic space for perspective‑aware cognition. OSL is a finite complete lattice where each element represents a unique observer‑situation pair, allowing for a principled and scalable approach to belief management. We present two key algorithms that operate on this lattice: (i) Relativized Belief Propagation, an incremental update algorithm that efficiently propagates new information, and (ii) Minimal Contradiction Decomposition, a graph‑based procedure that identifies and isolates contradiction components. We prove the theoretical soundness of our framework and demonstrate its practical utility through a series of benchmarks, including classic Theory of Mind tasks and a comparison with established paradigms such as assumption‑based truth maintenance systems. Our results show that OSL provides a computationally efficient and expressive foundation for building robust, perspective‑aware autonomous agents.

PaperID: 5254, https://arxiv.org/pdf/2603.01404.pdf

Authors: Yarong Luo, Wentao Lu, Chi Guo, Ming Li

Title: D-GVIO: A Buffer-Driven and Efficient Decentralized GNSS-Visual-Inertial State Estimator for Multi-Agent Systems

Abstract:
Cooperative localization is essential for swarm applications like collaborative exploration and search‑and‑rescue missions. However, maintaining real‑time capability, robustness, and computational efficiency on resource‑constrained platforms presents significant challenges. To address these challenges, we propose D‑GVIO, a buffer‑driven and fully decentralized GNSS‑Visual‑Inertial Odometry (GVIO) framework that leverages a novel buffering strategy to support efficient and robust distributed state estimation. The proposed framework is characterized by four core mechanisms. Firstly, through covariance segmentation, covariance intersection and buffering strategy, we modularize propagation and update steps in distributed state estimation, significantly reducing computational and communication burdens. Secondly, the left‑invariant extended Kalman filter (L‑IEKF) is adopted for information fusion, which exhibits superior state estimation performance over the traditional extended Kalman filter (EKF) since its state transition matrix is independent of the system state. Thirdly, a buffer‑based re‑propagation strategy is employed to handle delayed measurements efficiently and accurately by leveraging the L‑IEKF, eliminating the need for costly re‑computation. Finally, an adaptive buffer‑driven outlier detection method is proposed to dynamically cull GNSS outliers, enhancing robustness in GNSS‑challenged environments.

PaperID: 5255, https://arxiv.org/pdf/2603.01401.pdf

Authors: Zhao Song, Chen Shen, Zhen Wang, The Anh Han

Title: Exploration enhances cooperation in the multi-agent communication system

Abstract:
Designing protocols enhancing cooperation for multi‑agent systems remains a grand challenge. Cheap talk, defined as costless, non‑binding communication before formal action, serves as a pivotal solution. However, existing theoretical frameworks often exclude random exploration, or noise, for analytical tractability, leaving its functional impact on system performance largely unexplored. To bridge this gap, we propose a two‑stage evolutionary game‑theoretical model, integrating signalling with a donation game, with exploration explicitly incorporated into the decision‑making. Our agent‑based simulations across topologies reveal a universal optimal exploration rate that maximises system‑wide cooperation. Mechanistically, moderate exploration undermines the stability of defection and catalyses the self‑organised cooperative alliances, facilitating their cyclic success. Moreover, the cooperation peak is enabled by the delicate balance between oscillation period and amplification. Our findings suggest that rather than pursuing deterministic rigidity, embracing strategic exploration, as a form of engineered randomness, is essential to sustain cooperation and realise optimal performance in communication‑based intelligent systems.

PaperID: 5256, https://arxiv.org/pdf/2603.01357.pdf

Authors: Zidi Xiu, David Q. Sun, Kevin Cheng, Maitrik Patel, Josh Date, Yizhe Zhang, Jiarui Lu, Omar Attia, Raviteja Vemulapalli, Oncel Tuzel, Meng Cao, Samy Bengio

Title: ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context

Abstract:
Next‑generation AI must manage vast personal data, diverse tools, and multi‑step reasoning, yet most benchmarks remain context‑free and single‑turn. We present ASTRA‑bench (Assistant Skills in Tool‑use, Reasoning \& Action‑planning), a benchmark that uniquely unifies time‑evolving personal context with an interactive toolbox and complex user intents. Our event‑driven pipeline generates 2,413 scenarios across four protagonists, grounded in longitudinal life events and annotated by referential, functional, and informational complexity. Evaluation of state‑of‑the‑art models (e.g., Claude‑4.5‑Opus, DeepSeek‑V3.2) reveals significant performance degradation under high‑complexity conditions, with argument generation emerging as the primary bottleneck. These findings expose critical limitations in current agents' ability to ground reasoning within messy personal context and orchestrate reliable multi‑step plans. We release ASTRA‑bench with a full execution environment and evaluation scripts to provide a diagnostic testbed for developing truly context‑aware AI assistants.

PaperID: 5257, https://arxiv.org/pdf/2603.01339.pdf

Authors: William Overman, Sadegh Shirani, Mohsen Bayati

Title: Causal Effects with Unobserved Unit Types in Interacting Human-AI Systems

Abstract:
We study experiments on interacting populations of humans and AI agents, where both unit types and the interaction network remain unobserved. Although causal effects propagate throughout the system, the goal is to estimate effects on humans. Examples include online platforms where human users interact alongside AI‑driven accounts. We assume a human‑AI prior that gives each unit a probability of being human. While humans cannot be distinguished at the unit level, the prior allows us to compute the average human composition within large subpopulations. We then model outcome dynamics through a causal message passing (CMP) framework and analyze sample‑mean outcomes across subpopulations. We show that by constructing subpopulations that vary in expected human composition and treatment exposure, one can consistently recover human‑specific causal effects. Our results characterize when distributional knowledge of population composition (without observing unit types or the interaction network) is sufficient for identification. We validate the approach on a simulated human‑AI platform driven by behaviorally differentiated LLM agents. Together, these results provide a theoretical and practical framework for experimentation in emerging human‑AI systems.

PaperID: 5258, https://arxiv.org/pdf/2603.01311.pdf

Authors: Achuth Chandrasekhar, Janghoon Ock, Amir Barati Farimani

Title: Catalyst-Agent: Autonomous heterogeneous catalyst screening with an LLM Agent

Abstract:
The discovery of novel catalysts tailored for particular applications is a major challenge for the twenty‑first century. Traditional methods for this include time‑consuming and expensive experimental trial‑and‑error approaches in labs based on chemical theory or heavily computational first‑principles approaches based on density functional theory. Recent studies show that deep learning models like graph neural networks (GNNs) can significantly speed up the screening of catalyst materials by many orders of magnitude, with very high accuracy and fidelity. In this work, we introduce Catalyst‑Agent, a Model Context Protocol (MCP) server‑based, LLM‑powered AI agent. It can explore vast material databases using the OPTIMADE API, make structural modifications, calculate adsorption energies using Meta FAIRchem's UMA (GNN) model via FAIRchem's AdsorbML workflow and slab construction, and make useful material suggestions to the researcher in a closed‑loop manner, including structural modifications to refine near‑miss candidates. It is tested on three pivotal reactions: the oxygen reduction reaction (ORR), the nitrogen reduction reaction (NRR), and the CO2 reduction reaction (CO2RR). Catalyst‑Agent achieves a success rate of 33‑41% among all the materials it chooses and evaluates, and manages to converge in 1‑4 trials per successful material on average. This work demonstrates the potential of AI agents to exercise their planning capabilities and tool use for autonomous catalyst screening workflows.

PaperID: 5259, https://arxiv.org/pdf/2603.01260.pdf

Authors: Abdulhamid M. Mousa, Yu Fu, Rakhmonberdi Khajiev, Jalaledin M. Azzabi, Abdulkarim M. Mousa, Peng Yang, Yunusa Haruna, Ming Liu

Title: MOSAIC: A Unified Platform for Cross-Paradigm Comparison and Evaluation of Homogeneous and Heterogeneous Multi-Agent RL, LLM, VLM, and Human Decision-Makers

Abstract:
Reinforcement learning (RL), large language models (LLMs), and vision‑language models (VLMs) have been widely studied in isolation. However, existing infrastructure lacks the ability to deploy agents from different decision‑making paradigms within the same environment, making it difficult to study them in hybrid multi‑agent settings or to compare their behaviour fairly under identical conditions. We present MOSAIC, an open‑source platform that bridges this gap by incorporating a diverse set of existing reinforcement learning environments and enabling heterogeneous agents (RL policies, LLMs, VLMs, and human players) to operate within them in ad‑hoc team settings with reproducible results. MOSAIC introduces three contributions. (i) An IPC‑based worker protocol that wraps both native and third‑party frameworks as isolated subprocess workers, each executing its native training and inference logic unmodified, communicating through a versioned inter‑process protocol. (ii) An operator abstraction that forms an agent‑level interface by mapping workers to agents: each operator, regardless of whether it is backed by an RL policy, an LLM, or a human, conforms to a minimal unified interface. (iii) A deterministic cross‑paradigm evaluation framework offering two complementary modes: a manual mode that advances up to N concurrent operators in lock‑step under shared seeds for fine‑grained visual inspection of behavioural differences, and a script mode that drives automated, long‑running evaluation through declarative Python scripts, for reproducible experiments. We release MOSAIC as an open, visual‑first platform to facilitate reproducible cross‑paradigm research across the RL, LLM, and human‑in‑the‑loop communities.

PaperID: 5260, https://arxiv.org/pdf/2603.01257.pdf

Authors: Qingxiao Xu, Ze Sheng, Zhicheng Chen, Jeff Huang

Title: A Systematic Study of LLM-Based Architectures for Automated Patching

Abstract:
Large language models (LLMs) have shown promise for automated patching, but their effectiveness depends strongly on how they are integrated into patching systems. While prior work explores prompting strategies and individual agent designs, the field lacks a systematic comparison of patching architectures. In this paper, we present a controlled evaluation of four LLM‑based patching paradigms ‑‑ fixed workflow, single‑agent system, multi‑agent system, and general‑purpose code agents ‑‑ using a unified benchmark and evaluation framework. We analyze patch correctness, failure modes, token usage, and execution time across real‑world vulnerability tasks. Our results reveal clear architectural trade‑offs: fixed workflows are efficient but brittle, single‑agent systems balance flexibility and cost, and multi‑agent designs improve generalization at the expense of substantially higher overhead and increased risk of reasoning drift on complex tasks. Surprisingly, general‑purpose code agents achieve the strongest overall patching performance, benefiting from general‑purpose tool interfaces that support effective adaptation across vulnerability types. Overall, we show that architectural design and iteration depth, rather than model capability alone, dominate the reliability and cost of LLM‑based automated patching.

PaperID: 5261, https://arxiv.org/pdf/2603.01221.pdf

Authors: Dan Qiao, Binbin Chen, Fengyu Cai, Jianlong Chen, Wenhao Li, Fuxin Jiang, Zuzhi Chen, Hongyuan Zha, Tieying Zhang, Baoxiang Wang

Title: Epistemic Gain, Aleatoric Cost: Uncertainty Decomposition in Multi-Agent Debate for Math Reasoning

Abstract:
Multi‑Agent Debate (MAD) has shown promise in leveraging collective intelligence to improve reasoning and reduce hallucinations, yet it remains unclear how information exchange shapes the underlying ability. Empirically, MAD exhibits paradoxical phenomena, such as accuracy improvement accompanied by substantial increase in token entropy, and remarkable divergence between homogeneous and heterogeneous model combinations. In this paper, we propose a Bayesian uncertainty analysis framework for MAD, which decomposes total predictive uncertainty into epistemic uncertainty reducible by debate context and aleatoric uncertainty induced by internal model noise. Across multiple model configurations, we find that effective debate hinges on achieving high epistemic gain under controlled aleatoric cost. Building on this insight, we design an uncertainty‑guided multi‑agent reinforcement learning (MARL) algorithm that explicitly optimizes aleatoric noise reduction and epistemic information utilization. Experiments show that our training significantly improves post‑debate accuracy and stability, and enhances individual reasoning beyond single‑agent RL, providing a unified Bayesian uncertainty perspective for understanding and improving MAD.

PaperID: 5262, https://arxiv.org/pdf/2603.01213.pdf

Authors: Frédéric Berdoz, Leonardo Rugli, Roger Wattenhofer

Title: Can AI Agents Agree?

Abstract:
Large language models are increasingly deployed as cooperating agents, yet their behavior in adversarial consensus settings has not been systematically studied. We evaluate LLM‑based agents on a Byzantine consensus game over scalar values using a synchronous all‑to‑all simulation. We test consensus in a no‑stake setting where agents have no preferences over the final value, so evaluation focuses on agreement rather than value optimality. Across hundreds of simulations spanning model sizes, group sizes, and Byzantine fractions, we find that valid agreement is not reliable even in benign settings and degrades as group size grows. Introducing a small number of Byzantine agents further reduces success. Failures are dominated by loss of liveness, such as timeouts and stalled convergence, rather than subtle value corruption. Overall, the results suggest that reliable agreement is not yet a dependable emergent capability of current LLM‑agent groups even in no‑stake settings, raising caution for deployments that rely on robust coordination.

PaperID: 5263, https://arxiv.org/pdf/2603.01168.pdf

Authors: Rong Fu, Chunlei Meng, Jinshuo Liu, Dianyu Zhao, Yongtai Liu, Yibo Meng, Xiaowen Ma, Wangyu Wu, Yangchen Zeng, Shuaishuai Cao, Simon Fong

Title: SphUnc: Hyperspherical Uncertainty Decomposition and Causal Identification via Information Geometry

Abstract:
Reliable decision‑making in complex multi‑agent systems requires calibrated predictions and interpretable uncertainty. We introduce SphUnc, a unified framework combining hyperspherical representation learning with structural causal modeling. The model maps features to unit hypersphere latents using von Mises‑Fisher distributions, decomposing uncertainty into epistemic and aleatoric components through information‑geometric fusion. A structural causal model on spherical latents enables directed influence identification and interventional reasoning via sample‑based simulation. Empirical evaluations on social and affective benchmarks demonstrate improved accuracy, better calibration, and interpretable causal signals, establishing a geometric‑causal foundation for uncertainty‑aware reasoning in multi‑agent settings with higher‑order interactions.

PaperID: 5264, https://arxiv.org/pdf/2603.01145.pdf

Authors: Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, Bo Zhang, Liang He

Title: AutoSkill: Experience-Driven Lifelong Learning via Skill Self-Evolution

Abstract:
In practical LLM applications, users repeatedly express stable preferences and requirements, such as reducing hallucinations, following institutional writing conventions, or avoiding overly technical wording, yet such interaction experience is seldom consolidated into reusable knowledge. Consequently, LLM agents often fail to accumulate personalized capabilities across sessions. We present AutoSkill, an experience‑driven lifelong learning framework that enables LLM agents to automatically derive, maintain, and reuse skills from dialogue and interaction traces. AutoSkill abstracts skills from user experience, supports their continual self‑evolution, and dynamically injects relevant skills into future requests without retraining the underlying model. Designed as a model‑agnostic plugin layer, it is compatible with existing LLMs and introduces a standardized skill representation for sharing and transfer across agents, users, and tasks. In this way, AutoSkill turns ephemeral interaction experience into explicit, reusable, and composable capabilities. This paper describes the motivation, architecture, skill lifecycle, and implementation of AutoSkill, and positions it with respect to prior work on memory, retrieval, personalization, and agentic systems. AutoSkill highlights a practical and scalable path toward lifelong personalized agents and personal digital surrogates.

PaperID: 5265, https://arxiv.org/pdf/2603.01131.pdf

Authors: Yuqi Zhan, Xinyue Wu, Tianyu Lin, Yutong Bao, Xiaoyu Wang, Weihao Cheng, Huangwei Chen, Feiwei Qin, Zhu Zhu

Title: MedCollab: IBIS-Guided Multi-Agent Collaboration with Hierarchical Disease Relation Chains for Clinical Diagnosis

Abstract:
Large language models (LLMs) have shown promise in clinical diagnosis but remain limited by unreliable report generation, weak evidence grounding, and opaque reasoning. We propose MedCollab, an IBIS‑guided multi‑agent framework for full‑cycle clinical diagnosis and diagnostic report generation. Mimicking hospital consultation, MedCollab dynamically recruits specialist and exam agents from patient records. Each diagnostic hypothesis is structured through the Issue‑Based Information System (IBIS) into evidence‑linked arguments, improving traceability and auditability. MedCollab further constructs Hierarchical Disease Relation Chains (HDRC) to organize accepted hypotheses into clinically meaningful pathological and comorbidity relations. A verifier‑guided consensus module audits reasoning quality, detects contradictions, and updates agent weights over multiple rounds. Experiments on ClinicalBench and MIMIC‑IV show that MedCollab outperforms strong LLM and medical multi‑agent baselines in diagnostic accuracy, department routing, evidence consistency, and report quality. These results demonstrate that structured argumentation and disease‑relation modeling can improve the reliability, transparency, and clinical coherence of LLM‑based diagnosis.

PaperID: 5266, https://arxiv.org/pdf/2603.01121.pdf

Authors: Shuo Tang, Jiadong Zhang, Jian Xu, Gengxian Zhou, Qizhao Jin, Qinxuan Wang, Yi Hu, Ning Hu, Hongchang Ren, Lingli He, Jiaolan Fu, Jingtao Ding, Shiming Xiang, Chenglin Liu

Title: HVR-Met: A Hypothesis-Verification-Replaning Agentic System for Extreme Weather Diagnosis

Abstract:
While deep learning‑based weather forecasting paradigms have made significant strides, addressing extreme weather diagnostics remains a formidable challenge. This gap exists primarily because the diagnostic process demands sophisticated multi‑step logical reasoning, dynamic tool invocation, and expert‑level prior judgment. Although agents possess inherent advantages in task decomposition and autonomous execution, current architectures are still hampered by critical bottlenecks: inadequate expert knowledge integration, a lack of professional‑grade iterative reasoning loops, and the absence of fine‑grained validation and evaluation systems for complex workflows under extreme conditions. To this end, we propose HVR‑Met, a multi‑agent meteorological diagnostic system characterized by the deep integration of expert knowledge. Its central innovation is the ``Hypothesis‑Verification‑Replanning'' closed‑loop mechanism, which facilitates sophisticated iterative reasoning for anomalous meteorological signals during extreme weather events. To bridge gaps within existing evaluation frameworks, we further introduce a novel benchmark focused on atomic‑level subtasks. Experimental evidence demonstrates that the system excels in complex diagnostic scenarios.

PaperID: 5267, https://arxiv.org/pdf/2603.01076.pdf

Authors: Yuhao Tong, Steven W. Su

Title: Feasible Pairings for Decentralized Integral Controllability of Non-Square Systems

Abstract:
This paper investigates the determination of feasible input‑output pairings for the decentralized integral controllability of non‑square systems. The relevance of this problem extends beyond traditional industrial processes into modern AI research, particularly Multi‑Agent Reinforcement Learning (MARL), where environments frequently act as strongly non‑square mappings that evaluate high‑dimensional joint action spaces via comparatively low‑dimensional global rewards. To address the stability of these complex distributed architectures, we extend the concept of D‑stability to non‑square matrices, providing a crucial mathematical foundation. We formally define D‑stability for non‑square matrices as a direct generalization of the square case. By introducing the concept of ``Squared Matrices'', which are derived from specific column selections of the non‑square formulation and directly correspond to candidate control pairings, we establish a fundamental link between the stability of these square sub‑components and the original non‑square system. Ultimately, we propose sufficient conditions under which the individual Volterra‑Lyapunov stability of these squared components guarantees the extended D‑stability of the non‑square matrix, thereby providing a rigorous method to identify feasible pairings that ensure robust decentralized control across both classical and data‑driven applications.

PaperID: 5268, https://arxiv.org/pdf/2603.00993.pdf

Authors: Yiyue Qian, Shinan Zhang, Yun Zhou, Haibo Ding, Diego Socolinsky, Yi Zhang

Title: CollabEval: Enhancing LLM-as-a-Judge via Multi-Agent Collaboration

Abstract:
Large Language Models (LLMs) have revolutionized AI‑generated content evaluation, with the LLM‑as‑a‑Judge paradigm becoming increasingly popular. However, current single‑LLM evaluation approaches face significant challenges, including inconsistent judgments and inherent biases from pre‑training data. To address these limitations, we propose CollabEval, a novel multi‑agent evaluation framework that implements a three‑phase Collaborative Evaluation process: initial evaluation, multi‑round discussion, and final judgment. Unlike existing approaches that rely on competitive debate or single‑model evaluation, CollabEval emphasizes collaboration among multiple agents with strategic consensus checking for efficiency. Our extensive experiments demonstrate that CollabEval consistently outperforms single‑LLM approaches across multiple dimensions while maintaining robust performance even when individual models struggle. The framework provides comprehensive support for various evaluation criteria while ensuring efficiency through its collaborative design.

PaperID: 5269, https://arxiv.org/pdf/2603.00977.pdf

Authors: Hongbo Jin, Rongpeng Zhu, Jiayu Ding, Guibo Luo, Ge Li

Title: HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents

Abstract:
Large language model (LLM) agents have recently demonstrated strong capabilities in interactive decision‑making, yet they remain fundamentally limited in long‑horizon tasks that require structured planning and reliable execution. Existing approaches predominantly rely on flat autoregressive policies, where high‑level reasoning and low‑level actions are generated within a single token sequence, leading to inefficient exploration and severe error propagation over extended trajectories. In this work, we propose HiMAC, a hierarchical agentic RL framework that explicitly decomposes long‑horizon decision‑making into macro‑level planning and micro‑level execution. HiMAC models reasoning as a structured blueprint generation process followed by goal‑conditioned action execution, enabling robust long‑horizon planning within LLM‑based agents. To train this hierarchy efficiently, we introduce a critic‑free hierarchical policy optimization paradigm that extends group‑based reinforcement learning to bi‑level structures through hierarchical relative advantage estimation. Furthermore, we propose an iterative co‑evolution training strategy that alternates between planner exploration and executor adaptation, mitigating the non‑stationarity inherent in hierarchical learning. Extensive experiments on ALFWorld, WebShop, and Sokoban demonstrate that HiMAC consistently outperforms strong prompting and reinforcement learning baselines, achieving state‑of‑the‑art performance and substantially improved sample efficiency across both text‑based and visually grounded environments. Our results show that introducing structured hierarchy, rather than increasing model scale alone, is a key factor for enabling robust long‑horizon agentic intelligence.

PaperID: 5270, https://arxiv.org/pdf/2603.00846.pdf

Authors: Yichao Wu, Penghao Liang, Yafei Xiang, Mengwei Yuan, Jianan Liu, Jing Yang, Xianyou Li, Weiran Yan

Title: Tiny-Critic RAG: Empowering Agentic Fallback with Parameter-Efficient Small Language Models

Abstract:
Retrieval‑Augmented Generation (RAG) grounds Large Language Models (LLMs) to mitigate factual hallucinations. Recent paradigms shift from static pipelines to Modular and Agentic RAG frameworks, granting models autonomy for multi‑hop reasoning or self‑correction. However, current reflective RAG heavily relies on massive LLMs as universal evaluators. In high‑throughput systems, executing complete forward passes for billion‑parameter models merely for binary routing introduces severe computational redundancy. Furthermore, in autonomous agent scenarios, inaccurate retrieval causes models to expend excessive tokens on spurious reasoning and redundant tool calls, inflating Time‑to‑First‑Token (TTFT) and costs. We propose Tiny‑Critic RAG, decoupling evaluation by deploying a parameter‑efficient Small Language Model (SLM) via Low‑Rank Adaptation (LoRA). Acting as a deterministic gatekeeper, Tiny‑Critic employs constrained decoding and non‑thinking inference modes for ultra‑low latency binary routing. Evaluations on noise‑injected datasets demonstrate Tiny‑Critic RAG achieves routing accuracy comparable to GPT‑4o‑mini while reducing latency by an order of magnitude, establishing a highly cost‑effective paradigm for agent deployment.

PaperID: 5271, https://arxiv.org/pdf/2603.00829.pdf

Authors: Simon Storf, Rich Barton-Cooper, James Peters-Gill, Marius Hobbhahn

Title: Constitutional Black-Box Monitoring for Scheming in LLM Agents

Abstract:
Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is detecting scheming, where agents covertly pursue misaligned goals. One approach to mitigating such risks is LLM‑based monitoring: using language models to examine agent behaviors for suspicious actions. We study constitutional black‑box monitors: prompted classifiers that detect scheming using only externally observable inputs and outputs, optimized on synthetic data generated from natural‑language behavior specifications. We introduce two pipelines for generating synthetic agent trajectories, STRIDE (iterative refinement) and Gloom (agent‑environment simulation), from which we generate 1,000 samples each. We optimize frontier LLM monitors on these datasets via prompt sweeps, human refinement, and automated prompt optimization, and evaluate performance on 7,500 held‑out trajectories from ControlArena, a suite of grounded environments where agents operate in more realistic contexts. Our results demonstrate that monitors selected purely on synthetic data can generalize to more realistic environments, capturing a meaningful scheming signal. However, we find that performance saturates quickly in our setting, with simple prompt sweeps matching the results of more extensive optimization. Pushing beyond this limit yields no further improvements and instead leads to overfitting.

PaperID: 5272, https://arxiv.org/pdf/2603.00825.pdf

Authors: Anmol Agarwal, Pranay Meshram, Sumer Singh, Saurav Suman, Andrew Lapp, Shahbuland Matiana, Louis Castricato, Spencer Frazier

Title: COMBAT: Conditional World Models for Behavioral Agent Training

Abstract:
Recent advances in video generation have spurred the development of world models capable of simulating 3D‑consistent environments and interactions with static objects. However, a significant limitation remains in their ability to model dynamic, reactive agents that can intelligently influence and interact with the world. To address this gap, we introduce COMBAT, a real‑time, action‑controlled world model trained on the complex 1v1 fighting game Tekken 3. Our work demonstrates that diffusion models can successfully simulate a dynamic opponent that reacts to player actions, learning its behavior implicitly. Our approach utilizes a 1.2 billion parameter Diffusion Transformer, conditioned on latent representations from a deep compression autoencoder. We employ state‑of‑the‑art techniques, including causal distillation and diffusion forcing, to achieve real‑time inference. Crucially, we observe the emergence of sophisticated agent behavior by training the model solely on single‑player inputs, without any explicit supervision for the opponent's policy. Unlike traditional imitation learning methods, which require complete action labels, COMBAT learns effectively from partially observed data to generate responsive behaviors for a controllable Player 1. We present an extensive study and introduce novel evaluation methods to benchmark this emergent agent behavior, establishing a strong foundation for training interactive agents within diffusion‑based world models.

PaperID: 5273, https://arxiv.org/pdf/2603.00808.pdf

Authors: Lingyi Wang, Rashed Shelim, Walid Saad, Naren Ramakrishna

Title: MetaMind: General and Cognitive World Models in Multi-Agent Systems by Meta-Theory of Mind

Abstract:
A major challenge for world models in multi‑agent systems is to understand interdependent agent dynamics, predict interactive multi‑agent trajectories, and plan over long horizons with collective awareness, without centralized supervision or explicit communication. In this paper, MetaMind, a general and cognitive world model for multi‑agent systems that leverages a novel meta‑theory of mind (Meta‑ToM) framework, is proposed. Through MetaMind, each agent learns not only to predict and plan over its own beliefs, but also to inversely reason goals and beliefs from its own behavior trajectories. This self‑reflective, bidirectional inference loop enables each agent to learn a metacognitive ability in a self‑supervised manner. Then, MetaMind is shown to generalize the metacognitive ability from first‑person to third‑person through analogical reasoning. Thus, in multi‑agent systems, each agent with MetaMind can actively reason about goals and beliefs of other agents from limited, observable behavior trajectories in a zero‑shot manner, and then adapt to emergent collective intention without an explicit communication mechanism. Extended simulation results on diverse multi‑agent tasks demonstrate that MetaMind can achieve superior task performance and outperform baselines in few‑shot multi‑agent generalization.

PaperID: 5274, https://arxiv.org/pdf/2603.00797.pdf

Authors: Yuqing Hu, Wendao Xue, Yifan Yu, Yong Tan

Title: From Dyads to Groups: Rethinking Emotional Support with Conversational AI

Abstract:
Advances in artificial intelligence (AI), together with persistent gaps in access to reliable emotional support, have positioned AI as an increasingly prominent source of emotional assistance. However, most AI‑based emotional support applications and prior research focus on one‑on‑one interactions between users and a single AI agent, leaving the potential advantages of alternative support configurations largely unexplored. Drawing on social support and support group theory, this research examines whether AI‑based emotional support delivered by a group of AI agents (group AI support) can constitute a more effective support form than single‑agent support (single AI support). We propose that group AI support enhances users' perceived support efficacy, that this effect operates by strengthening users' connectedness with the AI system, and that the composition of support types within AI groups further shapes support outcomes. Three experiments provide convergent support for these claims. By identifying when and why group AI emotional support outperforms single AI support, this work advances theoretical understanding of AI‑based emotional support and provides actionable guidance for the design of AI support systems.

PaperID: 5275, https://arxiv.org/pdf/2603.00774.pdf

Authors: Sina Elahimanesh, Mohammadali Mohammadkhani, Sara Zahedi Movahed, Mohammad Mahdi Abootorabi, Shayan Salehi, Abbas Edalat

Title: Structure Matters: Evaluating Multi-Agents Orchestration in Generative Therapeutic Chatbots

Abstract:
While large language models (LLMs) excel at open‑ended dialogue, effective psychotherapy requires structured progression and adherence to clinical protocols, making the design of psychotherapist chatbots challenging. We investigate how different LLM‑based designs shape perceived therapeutic dialogue in a chatbot grounded in the Self‑Attachment Technique (SAT), a novel self‑administered psychotherapy rooted in attachment theory. We compare three architectural variants: (1) a multi‑agent system utilizing finite state machine aligned with therapeutic stages and a shared long‑term memory, (2) a single‑agent using identical knowledge‑base and the same prompts, and (3) an unguided LLM. In an eight‑day randomized controlled trial (RCT) with N=66 Farsi‑speaking participants, balanced across the three chatbots, the multi‑agent system is perceived as significantly more natural and human‑like than the other variants and achieves higher ratings across most other metrics. These findings demonstrate that for therapeutic AI, architectural orchestration is as critical as prompt engineering in fostering natural, engaging dialogue.

PaperID: 5276, https://arxiv.org/pdf/2603.00730.pdf

Authors: Tianmeng Hu, Biao Luo, Chunhua Yang, Tingwen Huang

Title: MO-MIX: Multi-Objective Multi-Agent Cooperative Decision-Making With Deep Reinforcement Learning

Abstract:
Deep reinforcement learning (RL) has been applied extensively to solve complex decision‑making problems. In many real‑world scenarios, tasks often have several conflicting objectives and may require multiple agents to cooperate, which are the multi‑objective multi‑agent decision‑making problems. However, only few works have been conducted on this intersection. Existing approaches are limited to separate fields and can only handle multi‑agent decision‑making with a single objective, or multi‑objective decision‑making with a single agent. In this paper, we propose MO‑MIX to solve the multi‑objective multi‑agent reinforcement learning (MOMARL) problem. Our approach is based on the centralized training with decentralized execution (CTDE) framework. A weight vector representing preference over the objectives is fed into the decentralized agent network as a condition for local action‑value function estimation, while a mixing network with parallel architecture is used to estimate the joint action‑value function. In addition, an exploration guide approach is applied to improve the uniformity of the final non‑dominated solutions. Experiments demonstrate that the proposed method can effectively solve the multi‑objective multi‑agent cooperative decision‑making problem and generate an approximation of the Pareto set. Our approach not only significantly outperforms the baseline method in all four kinds of evaluation metrics, but also requires less computational cost.

PaperID: 5277, https://arxiv.org/pdf/2603.00654.pdf

Authors: Xiaokai Bai, Lianqing Zheng, Runwei Guan, Siyuan Cao, Huiliang Shen

Title: RC-GeoCP: Geometric Consensus for Radar-Camera Collaborative Perception

Abstract:
Collaborative perception (CP) enhances scene understanding through multi‑agent information sharing. While LiDAR‑centric systems offer precise geometry, high costs and performance degradation in adverse weather necessitate multi‑modal alternatives. Despite dense visual semantics and robust spatial measurements, the synergy between cameras and 4D radar remains underexplored in collaborative settings. This work introduces RC‑GeoCP, the first framework to explore the fusion of 4D radar and images in CP. To resolve misalignment caused by depth ambiguity and spatial dispersion across agents, RC‑GeoCP establishes a radar‑anchored geometric consensus. Specifically, Geometric Structure Rectification (GSR) aligns visual semantics with geometry derived from radar to generate spatially grounded, geometry‑consistent representations. Uncertainty‑Aware Communication (UAC) formulates selective transmission as a conditional entropy reduction process to prioritize informative features based on inter‑agent disagreement. Finally, the Consensus‑Driven Assembler (CDA) aggregates multi‑agent information via shared geometric anchors to form a globally coherent representation. We establish the first unified radar‑camera CP benchmark on V2X‑Radar and V2X‑R, demonstrating state‑of‑the‑art performance with significantly reduced communication overhead. Code will be released soon.

PaperID: 5278, https://arxiv.org/pdf/2603.00646.pdf

Authors: Kunal Mukherjee, Cuneyt Gurcan Akcora, Murat Kantarcioglu

Title: MoltGraph: A Longitudinal Temporal Graph Dataset of Moltbook for Coordinated-Agent Detection

Abstract:
Agent‑native social platforms such as Moltbook are rapidly emerging, yet they inherit and amplify classical influence and abuse attacks, where coordinated agents strategically comment and upvote to manipulate visibility and propagate narratives across communities. However, rigorous measurement and learning‑based monitoring remain constrained by the absence of longitudinal, graph‑native datasets for agentic social networks that jointly capture heterogeneous interactions, temporal drift, and visibility signals needed to connect coordination behavior to downstream exposure. We introduce MoltGraph as a realistic longitudinal agentic social‑network graph dataset for studying how agents behave, coordinate, and evolve in the wild, enabling reproducible measurement on emerging multi‑agent social ecosystems. Using MoltGraph, we provide the first graph‑centric characterization of Moltbook as a dynamic network: (i) heavy‑tailed connectivity with power‑law exponents in the range alpha in [1.86, 2.72], (ii) accelerating hub formation and attention centralization where the top 1% agents account for 29.00% of engagements, (iii) bursty, short‑lived coordination episodes, 98.33% last under 24 hours, and (iv) measurable exposure effects across submolts. In matched analyses, posts receiving coordinated engagement exhibit 506.35% higher early interaction rates (within H=5 days) and 242.63% higher downstream exposure in feeds than non‑coordinated controls.

PaperID: 5279, https://arxiv.org/pdf/2603.00552.pdf

Authors: Shiya Zhang, Yuhan Zhan, Ruixi Su, Ruihan Sun, Ziyi Song, Zhaohan Chen, Xiaofan Zhang

Title: EMPA: Evaluating Persona-Aligned Empathy as a Process

Abstract:
Evaluating persona‑aligned empathy in LLM‑based dialogue agents remains challenging. User states are latent, feedback is sparse and difficult to verify in situ, and seemingly supportive turns can still accumulate into trajectories that drift from persona‑specific needs. We introduce EMPA, a process‑oriented framework that evaluates persona‑aligned support as sustained intervention rather than isolated replies. EMPA distills real interactions into controllable, psychologically grounded scenarios, couples them with an open‑ended multi‑agent sandbox that exposes strategic adaptation and failure modes, and scores trajectories in a latent psychological space by directional alignment, cumulative impact, and stability. The resulting signals and metrics support reproducible comparison and optimization of long‑horizon empathic behavior, and they extend to other agent settings shaped by latent dynamics and weak, hard‑to‑verify feedback.

PaperID: 5280, https://arxiv.org/pdf/2603.00532.pdf

Authors: Yandong Yan, Junwei Peng, Shijie Li, Chenxi Li, Yifei Shang, Can Deng, Ruiting Dai, Yongqiang Zhao, Jiaqi Zhu, Yu Huang

Title: DenoiseFlow: Uncertainty-Aware Denoising for Reliable LLM Agentic Workflows

Abstract:
Autonomous agents are increasingly entrusted with complex, long‑horizon tasks, ranging from mathematical reasoning to software generation. While agentic workflows facilitate these tasks by decomposing them into multi‑step reasoning chains, reliability degrades significantly as the sequence lengthens. Specifically, minor interpretation errors in natural‑language instructions tend to compound silently across steps. We term this failure mode accumulated semantic ambiguity. Existing approaches to mitigate this often lack runtime adaptivity, relying instead on static exploration budgets, reactive error recovery, or single‑path execution that ignores uncertainty entirely. We formalize the multi‑step reasoning process as a Noisy MDP and propose DenoiseFlow, a closed‑loop framework that performs progressive denoising through three coordinated stages: (1)Sensing estimates per‑step semantic uncertainty; (2)Regulating adaptively allocates computation by routing between fast single‑path execution and parallel exploration based on estimated risk; and (3)Correcting performs targeted recovery via influence‑based root‑cause localization. Online self‑calibration continuously aligns decision boundaries with verifier feedback, requiring no ground‑truth labels. Experiments on six benchmarks spanning mathematical reasoning, code generation, and multi‑hop QA show that DenoiseFlow achieves the highest accuracy on every benchmark (83.3% average, +1.3% over the strongest baseline) while reducing cost by 40‑‑56% through adaptive branching. Detailed ablation studies further confirm framework‑level's robustness and generality. Code is available at https://anonymous.4open.science/r/DenoiseFlow‑21D3/.

PaperID: 5281, https://arxiv.org/pdf/2603.00472.pdf

Authors: Yijun Yu

Title: From Goals to Aspects, Revisited: An NFR Pattern Language for Agentic AI Systems

Abstract:
Agentic AI systems exhibit numerous crosscutting concerns ‑‑ security, observability, cost management, fault tolerance ‑‑ that are poorly modularized in current implementations, contributing to the high failure rate of AI projects in reaching production. The goals‑to‑aspects methodology proposed at RE 2004 demonstrated that aspects can be systematically discovered from i goal models by identifying non‑functional soft‑goals that crosscut functional goals. This paper revisits and extends that methodology to the agentic AI domain. We present a pattern language of 12 reusable patterns organized across four NFR categories (security, reliability, observability, cost management), each mapping an i goal model to a concrete aspect implementation using an AOP framework for Rust. Four patterns address agent‑specific crosscutting concerns absent from traditional AOP literature: tool‑scope sandboxing, prompt injection detection, token budget management, and action audit trails. We extend the V‑graph model to capture how agent tasks simultaneously contribute to functional goals and non‑functional soft‑goals. We validate the pattern language through a case study analyzing an open‑source autonomous agent framework, demonstrating how goal‑driven aspect discovery systematically identifies and modularizes crosscutting concerns. The pattern language offers a principled approach for engineering reliable agentic AI systems through early identification of crosscutting concerns.

PaperID: 5282, https://arxiv.org/pdf/2603.00468.pdf

Authors: Yilun Wang, Guangba Yu, Haiyu Huang, Zirui Wang, Yujie Huang, Pengfei Chen, Michael R. Lyu

Title: Cloud-OpsBench: A Reproducible Benchmark for Agentic Root Cause Analysis in Cloud Systems

Abstract:
The transition to agentic Root Cause Analysis (RCA) necessitates benchmarks that evaluate active reasoning rather than passive classification. However, current frameworks fail to reconcile ecological validity with reproducibility. We introduce Cloud‑OpsBench, a large‑scale benchmark that employs a State Snapshot Paradigm to construct a deterministic digital twin of the cloud, featuring 452 distinct fault cases across 40 root cause types spanning the full Kubernetes stack. Crucially, Cloud‑OpsBench serves as an enabling infrastructure for next‑generation SRE research: (1) As a Data Engine, it harvests high‑quality reasoning trajectories to bootstrap Supervised Fine‑Tuning (SFT) for Small Language Models; (2) As an Reinforcement Learning (RL) environment, it transforms high‑risk operations into a safe low‑latency sandbox for training policy optimization agents; and (3) As a Diagnostic Standard, its process‑centric protocol uncovers architectural bottlenecks guiding the design of robust specialized multi‑agent system for RCA.

PaperID: 5283, https://arxiv.org/pdf/2603.00456.pdf

Authors: Yao Wu, Ziye Jia, Jingjing Zhao, Haoyang Wang, Qihui Wu, Zhu Han

Title: A Blockchain-Based Trust Framework for Resilient Cross-Domain UAV Service Orchestration

Abstract:
Unmanned aerial vehicle (UAV) networks are increasingly deployed for complex missions, including disaster response, intelligent logistics, and environmental monitoring. These missions generally require coordinated collaboration among multiple UAVs across distinct administrative domains. To support such cross‑domain cooperation, service function chains (SFCs) are constructed, where complex workflows are decomposed into ordered service functions assigned to appropriate UAVs along the mission path. However, it is challenging to ensure secure, trustworthy, and low‑latency cross‑domain SFC orchestration in identity management, authentication, and resilience to node failures. To address these issues, this paper proposes a consortium blockchain‑based trust architecture for cross‑domain decentralized identity verification, auditable task execution, and dynamic service‑aware orchestrator selection. The framework employs a hierarchical four‑phase cross‑domain authentication protocol covering the credential pre‑verification, intra‑domain execution, secure relay, and audit logging. The use case analysis confirms that the proposed framework achieves substantial reductions in authentication latency and significant improvements in system throughput against centralized and static schemes. The open challenges in scalability, adaptive trust assessment, interoperability, and energy efficiency are discussed, thereby providing directions for future researches on secure and efficient cross‑domain UAV service orchestration.

PaperID: 5284, https://arxiv.org/pdf/2603.00381.pdf

Authors: Om Tailor

Title: Verifier-Bound Communication for LLM Agents: Certified Bounds on Covert Signaling

Abstract:
Colluding language‑model agents can hide coordination in messages that remain policy‑compliant at the surface level. We present CLBC, a protocol where generation and admission are separated: a message is admitted to transcript state only if a small verifier accepts a proof‑bound envelope under a pinned predicate Π. The predicate binds policy hash, public randomness schedule, transcript chaining, latent schema constraints, canonical metadata/tool fields, and deterministic rejection codes. We show how this protocol yields an upper bound on transcript leakage in terms of latent leakage plus explicit residual channels, derive adaptive composition guarantees, and state a semantic lower bound when policy‑valid alternatives remain choosable. We report extensive empirically grounded evidence: aggregate evaluation satisfies all prespecified thresholds; strict lane decoder advantage is bounded at 0.0000 with MI proxy 0.0636; adaptive‑colluder stress tests remain below attacker thresholds; and baseline separation shows large gaps between reject‑by‑default semantics and audit‑only controls. We further quantify operational tradeoffs. Strict full‑proof mode has median turn latency 27.53s (p95 28.08s), while sampled proving reduces non‑proved‑turn latency to 0.327ms. The central finding is that bottlenecks alone are insufficient: security claims depend on verifiable admission semantics that are online, deterministic, and fail‑closed.

PaperID: 5285, https://arxiv.org/pdf/2603.00267.pdf

Authors: Shuzhi Gong, Richard O. Sinnott, Jianzhong Qi, Cecile Paris, Preslav Nakov, Zhuohan Xie

Title: Multi-Sourced, Multi-Agent Evidence Retrieval for Fact-Checking

Abstract:
Misinformation spreading over the Internet poses a significant threat to both societies and individuals, necessitating robust and scalable fact‑checking that relies on retrieving accurate and trustworthy evidence. Previous methods rely on semantic and social‑contextual patterns learned from training data, which limits their generalization to new data distributions. Recently, Retrieval Augmented Generation (RAG) based methods have been proposed to utilize the reasoning capability of LLMs with retrieved grounding evidence documents. However, these methods largely rely on textual similarity for evidence retrieval and struggle to retrieve evidence that captures multi‑hop semantic relations within rich document contents. These limitations lead to overlooking subtle factual correlations between the evidence and the claims to be fact‑checked during evidence retrieval, thus causing inaccurate veracity predictions. To address these issues, we propose WKGFC, which exploits authorized open knowledge graph as a core resource of evidence. LLM‑enabled retrieval is designed to assess the claims and retrieve the most relevant knowledge subgraphs, forming structured evidence for fact verification. To augment the knowledge graph evidence, we retrieve web contents for completion. The above process is implemented as an automatic Markov Decision Process (MDP): A reasoning LLM agent decides what actions to take according to the current evidence and the claims. To adapt the MDP for fact‑checking, we use prompt optimization to fine‑tune the agentic LLM.

PaperID: 5286, https://arxiv.org/pdf/2603.00214.pdf

Authors: Knut-Andreas Lie, Olav Møyner, Elling Svee, Jakob Torben

Title: Agentic Scientific Simulation: Execution-Grounded Model Construction and Reconstruction

Abstract:
LLM agents are increasingly used for code generation, but physics‑based simulation poses a deeper challenge: natural‑language descriptions of simulation models are inherently underspecified, and different admissible resolutions of implicit choices produce physically valid but scientifically distinct configurations. Without explicit detection and resolution of these ambiguities, neither the correctness of the result nor its reproducibility from the original description can be assured. This paper investigates agentic scientific simulation, where model construction is organized as an execution‑grounded interpret‑act‑validate loop and the simulator serves as the authoritative arbiter of physical validity rather than merely a runtime. We present JutulGPT, a reference implementation built on the fully differentiable Julia‑based reservoir simulator JutulDarcy. The agent combines structured retrieval of documentation and examples with code synthesis, static analysis, execution, and systematic interpretation of solver diagnostics. Underspecified modelling choices are detected explicitly and resolved either autonomously (with logged assumptions) or through targeted user queries. The results demonstrate that agent‑mediated model construction can be grounded in simulator validation, while also revealing a structural limitation: choices resolved tacitly through simulator defaults are invisible to the assumption log and to any downstream representation. A secondary experiment with autonomous reconstruction of a reference model from progressively abstract textual descriptions shows that reconstruction variability exposes latent degrees of freedom in simulation descriptions and provides a practical methodology for auditing reproducibility. All code, prompts, and agent logs are publicly available.

PaperID: 5287, https://arxiv.org/pdf/2603.00200.pdf

Authors: Chuanming Tang, Ling Qing, Shifeng Chen

Title: LiaisonAgent: An Multi-Agent Framework for Autonomous Risk Investigation and Governance

Abstract:
The rapid evolution of sophisticated cyberattacks has strained modern Security Operations Centers (SOC), which traditionally rely on rule‑based or signature‑driven detection systems. These legacy frameworks often generate high volumes of technical alerts that lack organizational context, leading to analyst fatigue and delayed incident responses. This paper presents LiaisonAgent, an autonomous multi‑agent system designed to bridge the gap between technical risk detection and business‑level risk governance. Built upon the QWQ‑32B large reasoning model, LiaisonAgent integrates specialized sub‑agents, including human‑computer interaction agents, comprehensive judgment agents, and automated disposal agents‑to execute end‑to‑end investigation workflows. The system leverages a hybrid planning architecture that combines deterministic workflows for compliance with autonomous reasoning based on the ReAct paradigm to handle ambiguous operational scenarios. Experimental evaluations across diverse security contexts, such as large‑scale data exfiltration and unauthorized account borrowing, achieve an end‑to‑end tool‑calling success rate of 97.8% and a risk judgment accuracy of 95%. Furthermore, the system exhibits significant resilience against out‑of‑distribution noise and adversarial prompt injections, while achieving a 92.7% reduction in manual investigation overhead.

PaperID: 5288, https://arxiv.org/pdf/2603.00186.pdf

Authors: Srikumar Nayak

Title: RLShield: Practical Multi-Agent RL for Financial Cyber Defense with Attack-Surface MDPs and Real-Time Response Orchestration

Abstract:
Financial systems run nonstop and must stay reliable even during cyber incidents. Modern attacks move across many services (apps, APIs, identity, payment rails), so defenders must make a sequence of actions under time pressure. Most security tools still use fixed rules or static playbooks, which can be slow to adapt when the attacker changes behavior. Reinforcement learning (RL) is a good fit for sequential decisions, but much of the RL‑in‑finance literature targets trading and does not model real cyber response limits such as action cost, service disruption, and defender coordination across many assets. This paper proposes RLShield, a practical multi‑agent RL pipeline for financial cyber defense. We model the enterprise attack surface as a Markov decision process (MDP) where states summarize alerts, asset exposure, and service health, and actions represent real response steps (e.g., isolate a host, rotate credentials, ratelimit an API, block an account, or trigger recovery). RLShield learns coordinated policies across multiple agents (assets or service groups) and optimizes a risk‑sensitive objective that balances containment speed, business disruption, and response cost. We also include a game‑aware evaluation that tests policies against adaptive attackers and reports operational outcomes, not only reward. Experiments show that RLShield reduces time‑to‑containment and residual exposure while keeping disruption within a fixed response budget, outperforming static rule baselines and single‑agent RL under the same constraints. These results suggest that multi‑agent, cost‑aware RL can provide a deployable layer for automated response in financial security operations.

PaperID: 5289, https://arxiv.org/pdf/2603.00142.pdf

Authors: Adam Kostka, Jarosław A. Chudziak

Title: Evaluating Theory of Mind and Internal Beliefs in LLM-Based Multi-Agent Systems

Abstract:
LLM‑based MAS are gaining popularity due to their potential for collaborative problem‑solving enhanced by advances in natural language comprehension, reasoning, and planning. Research in Theory of Mind (ToM) and Belief‑Desire‑Intention (BDI) models has the potential to further improve the agent's interaction and decision‑making in such systems. However, collaborative intelligence in dynamic worlds remains difficult to accomplish since LLM performance in multi‑agent worlds is extremely variable. Simply adding cognitive mechanisms like ToM and internal beliefs does not automatically result in improved coordination. The interplay between these mechanisms, particularly in relation to formal logic verification, remains largely underexplored in different LLMs. This work investigates: How do internal belief mechanisms, including symbolic solvers and Theory of Mind, influence collaborative decision‑making in LLM‑based multi‑agent systems, and how does the interplay of those components influence system accuracy? We introduce a novel multi‑agent architecture integrating ToM, BDI‑style internal beliefs, and symbolic solvers for logical verification. We evaluate this architecture in a resource allocation problem with various LLMs and find an intricate interaction between LLM capabilities, cognitive mechanisms, and performance. This work contributes to the area of AI by proposing a novel multi‑agent system with ToM, internal beliefs, and symbolic solvers for augmenting collaborative intelligence in multi‑agent systems and evaluating its performance under different LLM settings.

PaperID: 5290, https://arxiv.org/pdf/2603.00130.pdf

Authors: Jean-Philippe Garnier

Title: Agentic Hives: Equilibrium, Indeterminacy, and Endogenous Cycles in Self-Organizing Multi-Agent Systems

Abstract:
Current multi‑agent AI systems operate with a fixed number of agents whose roles are specified at design time. No formal theory governs when agents should be created, destroyed, or re‑specialized at runtime‑let alone how the population structure responds to changes in resources or objectives. We introduce the Agentic Hive, a framework in which a variable population of autonomous micro‑agents‑each equipped with a sandboxed execution environment and access to a language model‑undergoes demographic dynamics: birth, duplication, specialization, and death. Agent families play the role of production sectors, compute and memory play the role of factors of production, and an orchestrator plays the dual role of Walrasian auctioneer and Global Workspace. Drawing on the multi‑sector growth theory developed for dynamic general equilibrium (Benhabib \& Nishimura, 1985; Venditti, 2005; Garnier, Nishimura \& Venditti, 2013), we prove seven analytical results: (i) existence of a Hive Equilibrium via Brouwer's fixed‑point theorem; (ii) Pareto optimality of the equilibrium allocation; (iii) multiplicity of equilibria under strategic complementarities between agent families; (iv)‑(v) Stolper‑Samuelson and Rybczynski analogs that predict how the Hive restructures in response to preference and resource shocks; (vi) Hopf bifurcation generating endogenous demographic cycles; and (vii) a sufficient condition for local asymptotic stability. The resulting regime diagram partitions the parameter space into regions of unique equilibrium, indeterminacy, endogenous cycles, and instability. Together with the comparative‑statics matrices, it provides a formal governance toolkit that enables operators to predict and steer the demographic evolution of self‑organizing multi‑agent systems.

PaperID: 5291, https://arxiv.org/pdf/2603.00129.pdf

Authors: Hong Wang, Xuwei Fan, Zhipeng Cheng, Yachao Yuan, Minghui Min, Minghui Liwang, Xiaoyu Xia

Title: Safe Multi-Agent Deep Reinforcement Learning for Privacy-Aware Edge-Device Collaborative DNN Inference

Abstract:
As Deep Neural Network (DNN) inference becomes increasingly prevalent on edge and mobile platforms, critical challenges emerge in privacy protection, resource constraints, and dynamic model deployment. This paper proposes a privacy‑aware collaborative inference framework, in which adaptive model partitioning is performed across edge devices and servers. To jointly optimize inference delay, energy consumption, and privacy cost under dynamic service demands and resource constraints, we formulate the joint problem as a Constrained Markov Decision Process (CMDP) that integrates model deployment, user‑server association, model partitioning, and resource allocation. We propose a Hierarchical Constrained Multi‑Agent Proximal Policy Optimization with Lagrangian relaxation (HC‑MAPPO‑L) algorithm, a safe reinforcement learning‑based framework that enhances Multi‑Agent Proximal Policy Optimization (MAPPO) with adaptive Lagrangian dual updates to enforce long‑term delay constraints. To ensure tractability while maintaining coordination, we decompose the CMDP into three hierarchically structured policy layers: an auto‑regressive based model deployment policy, a Lagrangian‑enhanced user association and model partitioning policy, and an attention‑based resource allocation policy. Extensive experimental results demonstrate that HC‑MAPPO‑L consistently satisfies stringent delay constraints while achieving a superior balance among energy consumption and privacy cost, outperforming representative baseline algorithms across varying problem scales and resource configurations.

PaperID: 5292, https://arxiv.org/pdf/2603.00123.pdf

Authors: Yannian Gu, Xizhuo Zhang, Linjie Mu, Yongrui Yu, Zhongzhen Huang, Shaoting Zhang, Xiaofan Zhang

Title: CT-Flow: Orchestrating CT Interpretation Workflow with Model Context Protocol Servers

Abstract:
Recent advances in Large Vision‑Language Models (LVLMs) have shown strong potential for multi‑modal radiological reasoning, particularly in tasks like diagnostic visual question answering (VQA) and radiology report generation. However, most existing approaches for 3D CT analysis largely rely on static, single‑pass inference. In practice, clinical interpretation is a dynamic, tool‑mediated workflow where radiologists iteratively review slices and use measurement, radiomics, and segmentation tools to refine findings. To bridge this gap, we propose CT‑Flow, an agentic framework designed for interoperable volumetric interpretation. By leveraging the Model Context Protocol (MCP), CT‑Flow shifts from closed‑box inference to an open, tool‑aware paradigm. We curate CT‑FlowBench, the first large‑scale instruction‑tuning benchmark tailored for 3D CT tool‑use and multi‑step reasoning. Built upon this, CT‑Flow functions as a clinical orchestrator capable of decomposing complex natural language queries into automated tool‑use sequences. Experimental evaluations on CT‑FlowBench and standard 3D VQA datasets demonstrate that CT‑Flow achieves state‑of‑the‑art performance, surpassing baseline models by 41% in diagnostic accuracy and achieving a 95% success rate in autonomous tool invocation. This work provides a scalable foundation for integrating autonomous, agentic intelligence into real‑world clinical radiology.

PaperID: 5293, https://arxiv.org/pdf/2603.00121.pdf

Authors: Muhammad Umar Javed

Title: Graph-theoretic Agreement Framework for Multi-agent LLM Systems

Abstract:
The shift from monolithic LLMs to distributed multi‑agent architectures demands new frameworks for verifying and securing autonomous coordination. Unlike traditional multi‑agent systems focused on cooperative state alignment, modern LLM patterns: multi‑agent debate, constitutional oversight, helper‑critic loops‑rely on adversarial critique for error correction and reasoning refinement. Since LLMs are dynamical systems whose latent states are imperfectly observable from verbalized outputs, securing these networks requires understanding both macroscopic topology and microscopic agent observability. This paper establishes a rigorous graph‑theoretic framework for analyzing consensus in signed, directed interaction networks, bridging graph theory and LLM reasoning by formally mapping Transformer cross‑entropy log‑odds to the signed Laplacian. We characterize agreement stability through structural balance theory, showing how unbalanced critique cycles produce logical frustration and persistent reasoning oscillations, and prove that unobservable latent states from hidden system prompts act as topological Trojan horses that destabilize cooperative consensus. To resolve unobservable deadlocks, we restrict interaction topologies to chordal graphs and apply matrix decomposition with Gram‑Schmidt orthogonalization, proving that rank‑one spectral edge perturbations deterministically break expertise symmetry by shifting eigenvalues into the stable left‑half plane. Core contributions include consensus theorems, polynomial‑time Perfect Elimination Ordering verification algorithms, and large‑scale empirical validation on clustered ensembles of LLaMA‑3, Mistral, and Gemma agents.

PaperID: 5294, https://arxiv.org/pdf/2603.00120.pdf

Authors: Minah Lee, Saibal Mukhopadhyay

Title: SIGMAS: Second-Order Interaction-based Grouping for Overlapping Multi-Agent Swarms

Abstract:
Swarming systems, such as drone fleets and robotic teams, exhibit complex dynamics driven by both individual behaviors and emergent group‑level interactions. Unlike traditional multi‑agent domains such as pedestrian crowds or traffic systems, swarms typically consist of a few large groups with inherent and persistent memberships, making group identification essential for understanding fine‑grained behavior. We introduce the novel task of group prediction in overlapping multi‑agent swarms, where latent group structures must be inferred directly from agent trajectories without ground‑truth supervision. To address this challenge, we propose SIGMAS (Second‑order Interaction‑based Grouping for Multi‑Agent Swarms), a self‑supervised framework that goes beyond direct pairwise interactions and model second‑order interaction across agents. By capturing how similarly agents interact with others, SIGMAS enables robust group inference and adaptively balances individual and collective dynamics through a learnable gating mechanism for joint reasoning. Experiments across diverse synthetic swarm scenarios demonstrate that SIGMAS accurately recovers latent group structures and remains robust under simultaneously overlapping swarm dynamics, establishing both a new benchmark task and a principled modeling framework for swarm understanding.

PaperID: 5295, https://arxiv.org/pdf/2603.00117.pdf

Authors: Kaige Liu, Yang Li, Lijun Zhu, Weinan Zhang

Title: PEPA: a Persistently Autonomous Embodied Agent with Personalities

Abstract:
Living organisms exhibit persistent autonomy through internally generated goals and self‑sustaining behavioral organization, yet current embodied agents remain driven by externally scripted objectives. This dependence on predefined task specifications limits their capacity for long‑term deployment in dynamic, unstructured environments where continuous human intervention is impractical. We propose that personality traits provide an intrinsic organizational principle for achieving persistent autonomy. Analogous to genotypic biases shaping biological behavioral tendencies, personalities enable agents to autonomously generate goals and sustain behavioral evolution without external supervision. To realize this, we develop PEPA, a three‑layer cognitive architecture that operates through three interacting systems: Sys3 autonomously synthesizes personality‑aligned goals and refines them via episodic memory and daily self‑reflection; Sys2 performs deliberative reasoning to translate goals into executable action plans; Sys1 grounds the agent in sensorimotor interaction, executing actions and recording experiences. We validate the framework through real‑world deployment on a quadruped robot in a multi‑floor office building. Operating without reliance on fixed task specifications, the robot autonomously arbitrates between user requests and personality‑driven motivations, navigating elevators and exploring environments accordingly. Quantitative analysis across five distinct personality prototypes demonstrates stable, trait‑aligned behaviors. The results confirm that personality‑driven cognitive architectures enable sustained autonomous operation characteristic of persistent embodied systems. Code and demo videos are available at https://sites.google.com/view/pepa‑persistent/.

PaperID: 5296, https://arxiv.org/pdf/2603.00113.pdf

Authors: Yiming Li, Dacheng Tao

Title: AI Agents Alone Are Not (Yet) Sufficient for Social Simulation

Abstract:
Recent advances in large language models (LLMs) have spurred growing interest in using LLM‑integrated agents for social simulation, often under the implicit assumption that realistic population dynamics will emerge once role‑specified agents are placed in a networked multi‑agent setting. This position paper argues that LLM‑based agents alone are not (yet) sufficient for social simulation. We attribute this over‑optimism to a systematic mismatch between what current agent pipelines are typically optimized and validated to produce and what simulation‑as‑science requires. Concretely, role‑playing plausibility does not imply faithful human behavioral validity; collective outcomes are frequently mediated by agent‑environment co‑dynamics rather than agent‑agent messaging alone; and results can be dominated by interaction protocols, scheduling, and initial information priors. To make these underlying mechanisms explicit and auditable, we propose a unified formulation of AI agent‑based social simulation as an environment‑involved Markov game with explicit exposure and scheduling mechanisms, from which we derive concrete actions for design, evaluation, and interpretation.

PaperID: 5297, https://arxiv.org/pdf/2603.00058.pdf

Authors: Linhao Zhang, Tong Xia, Jinghua Piao, Lizhen Cui, Yong Li

Title: PaperRepro: Automated Computational Reproducibility Assessment for Social Science Papers

Abstract:
Computational reproducibility is essential for the credibility of scientific findings, particularly in the social sciences, where findings often inform real‑world decisions. Manual reproducibility assessment is costly and time‑consuming, as it is nontrivial to reproduce the reported findings using the authors' released code and data. Recent advances in large models (LMs) have inspired agent‑based approaches for automated reproducibility assessment. However, existing approaches often struggle due to limited context capacity, inadequate task‑specific tooling, and insufficient result capture. To address these, we propose PaperRepro, a novel two‑stage, multi‑agent approach that separates execution from evaluation. In the execution stage, agents execute the reproduction package and edit the code to capture reproduced results as explicit artifacts. In the evaluation stage, agents evaluate reproducibility using explicit evidence. PaperRepro assigns distinct responsibilities to agents and equips them with task‑specific tools and expert prompts, mitigating context and tooling limitations. It further maximizes the LM's coding capability to enable more complete result capture for evaluation. On REPRO‑Bench, a social science reproducibility assessment benchmark, PaperRepro achieves the best overall performance, with a 21.9% relative improvement in score‑agreement accuracy over the strongest prior baseline. We further refine the benchmark and introduce REPRO‑Bench‑S, a benchmark stratified by execution difficulty for more diagnostic evaluation of automated reproducibility assessment systems. Our code and data are publicly available

PaperID: 5298, https://arxiv.org/pdf/2603.00051.pdf

Authors: Andreas Varvarigos, Ali Maatouk, Jiasheng Zhang, Ngoc Bui, Jialin Chen, Leandros Tassiulas, Rex Ying

Title: LitBench: A Graph-Centric Large Language Model Benchmarking Tool For Literature Tasks

Abstract:
While large language models (LLMs) have become the de facto framework for literature‑related tasks, they still struggle to function as domain‑specific literature agents due to their inability to connect pieces of knowledge and reason across domain‑specific contexts, terminologies, and nomenclatures. This challenge underscores the need for a tool that facilitates such domain‑specific adaptation and enables rigorous benchmarking across literature tasks. To that end, we introduce LitBench, a benchmarking tool designed to enable the development and evaluation of domain‑specific LLMs tailored to literature‑related tasks. At its core, LitBench uses a data curation process that generates domain‑specific literature sub‑graphs and constructs training and evaluation datasets based on the textual attributes of the resulting nodes and edges. The tool is designed for flexibility, supporting the curation of literature graphs across any domain chosen by the user, whether high‑level fields or specialized interdisciplinary areas. In addition to dataset curation, LitBench defines a comprehensive suite of literature tasks, ranging from node and edge level analyses to advanced applications such as related work generation. These tasks enable LLMs to internalize domain‑specific knowledge and relationships embedded in the curated graph during training, while also supporting rigorous evaluation of model performance. Our results show that small domain‑specific LLMs trained and evaluated on LitBench datasets achieve competitive performance compared to state‑of‑the‑art models like GPT‑4o and DeepSeek‑R1. To enhance accessibility and ease of use, we open‑source the tool along with an AI agent tool that streamlines data curation, model training, and evaluation.

PaperID: 5299, https://arxiv.org/pdf/2603.00026.pdf

Authors: Xiaohui Zhang, Zequn Sun, Chengyuan Yang, Yaqin Jin, Yazhong Zhang, Wei Hu

Title: ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

Abstract:
Effective memory management is essential for large language model (LLM) agents handling long‑term interactions. Current memory frameworks typically treat agents as passive "recorders" and retrieve information without understanding its deeper implications. They may fail in scenarios requiring conflict detection and complex decision‑making. To bridge this critical gap, we propose a novel actionable memory framework called ActMem that integrates memory retrieval with active causal reasoning. ActMem transforms unstructured dialogue history into a structured causal and semantic graph. By leveraging counterfactual reasoning and commonsense completion, it enables agents to deduce implicit constraints and resolve potential conflicts between past states and current intentions. Furthermore, we introduce a comprehensive dataset ActMemEval to evaluate agent reasoning capabilities in logic‑driven scenarios, moving beyond the fact‑retrieval focus of existing memory benchmarks. Experiments demonstrate that ActMem significantly outperforms state‑of‑the‑art baselines in handling complex, memory‑dependent tasks, paving the way for more consistent and reliable intelligent assistants.

PaperID: 5300, https://arxiv.org/pdf/2603.00016.pdf

Authors: Nicolas Leins, Jana Gonnermann-Müller, Malte Teichmann, Sebastian Pokutta

Title: Beyond Static Instruction: A Multi-agent AI Framework for Adaptive Augmented Reality Robot Training

Abstract:
Augmented Reality (AR) offers powerful visualization capabilities for industrial robot training, yet current interfaces remain predominantly static, failing to account for learners' diverse cognitive profiles. In this paper, we present an AR application for robot training and propose a multi‑agent AI framework for future integration that bridges the gap between static visualization and pedagogical intelligence. We report on the evaluation of the baseline AR interface with 36 participants performing a robotic pick‑and‑place task. While overall usability was high, notable disparities in task duration and learner characteristics highlighted the necessity for dynamic adaptation. To address this, we propose a multi‑agent framework that orchestrates multiple components to perform complex preprocessing of multimodal inputs (e.g., voice, physiology, robot data) and adapt the AR application to the learner's needs. By utilizing autonomous Large Language Model (LLM) agents, the proposed system would dynamically adapt the learning environment based on advanced LLM reasoning in real‑time.

PaperID: 5301, https://arxiv.org/pdf/2602.24265.pdf

Authors: Saber Zerhoudi, Michael Granitzer

Title: Beyond the Click: A Framework for Inferring Cognitive Traces in Search

Abstract:
User simulators are essential for evaluating search systems, but they primarily reproduce user actions without modeling the underlying thought process. Large‑scale interaction logs record what users do, but not what they might be thinking or feeling, such as confusion or satisfaction. We present a framework for inferring cognitive traces from behavioral logs. Our method uses a multi‑agent LLM system grounded in Information Foraging Theory (IFT) and validated by human experts. We annotate three public datasets (AOL, Stack Overflow, and MovieLens), producing over 530,000 cognitive labels across 50,000 sessions. A cross‑dataset evaluation with a shuffled‑label control reveals that cognitive labels provide the strongest signal where behavioral features are weakest: on MovieLens, the cognitive model improves F1 by up to 6.6% over the behavioral baseline and 1.8% above the shuffled control, while on AOL, where click patterns are highly predictive, improvements are near zero. We release the annotation collection on HuggingFace, an open‑source annotation tool, and all experimental code to support future work on cognitively aware user simulation.

PaperID: 5302, https://arxiv.org/pdf/2602.24241.pdf

Authors: Saber Zerhoudi, Michael Granitzer

Title: UXSim: Towards a Hybrid User Search Simulation

Abstract:
Simulating nuanced user experiences within complex interactive search systems poses distinct challenge for traditional methodologies, which often rely on static user proxies or, more recently, on standalone large language model (LLM) agents that may lack deep, verifiable grounding. The true dynamism and personalization inherent in human‑computer interaction demand a more integrated approach. This work introduces UXSim, a novel framework that integrates both approaches. It leverages grounded data from traditional simulators to inform and constrain the reasoning of an adaptive LLM agent. This synthesis enables more accurate and dynamic simulations of user behavior while also providing a pathway for the explainable validation of the underlying cognitive processes.

PaperID: 5303, https://arxiv.org/pdf/2602.24074.pdf

Authors: Wan Wang, Haiyan Wang, Adam Sobey

Title: Sharing is caring: data sharing in multi-agent supply chains

Abstract:
Modern supply networks are complex interconnected systems. Multi‑agent models are increasingly explored to optimise their performance. Most research assumes agents will have full observability of the system by having a single policy represent the agents, which seems unrealistic as this requires companies to share their data. The alternative is to develop a Hidden‑Markov Process with separate policies, making the problem challenging to solve. In this paper, we propose a multi‑agent system where the factory agent can share information downstream, increasing the observability of the environment. It can choose to share no information, lie, tell the truth or combine these in a mixed strategy. The results show that data sharing can boost the performance, especially when combined with a cooperative reward shaping. In the high demand scenario there is limited ability to change the strategy and therefore no data sharing approach benefits both agents. However, lying benefits the factory enough for an overall system improvement, although only by a relatively small amount compared to the overall reward. In the low demand scenario, the most successful data sharing is telling the truth which benefits all actors significantly.

PaperID: 5304, https://arxiv.org/pdf/2602.24068.pdf

Authors: Joon Kiat Chua, Donghao Huang, Zhaoxia Wang

Title: A Novel Hierarchical Multi-Agent System for Payments Using LLMs

Abstract:
Large language model (LLM) agents, such as OpenAI's Operator and Claude's Computer Use, can automate workflows but unable to handle payment tasks. Existing agentic solutions have gained significant attention; however, even the latest approaches face challenges in implementing end‑to‑end agentic payment workflows. To address this gap, this research proposes the Hierarchical Multi‑Agent System for Payments (HMASP), which provides an end‑to‑end agentic method for completing payment workflows. The proposed HMASP leverages either open‑weight or proprietary LLMs and employs a modular architecture consisting of the Conversational Payment Agent (CPA ‑ first agent level), Supervisor agents (second agent level), Routing agents (third agent level), and the Process summary agent (fourth agent level). The CPA serves as the central entry point, handling all external requests and coordinating subsequent tasks across hierarchical levels. HMASP incorporates architectural patterns that enable modular task execution across agents and levels for payment operations, including shared state variables, decoupled message states, and structured handoff protocols that facilitate coordination across agents and workflows. Experimental results demonstrate the feasibility of the proposed HMASP. To our knowledge, HMASP is the first LLM‑based multi‑agent system to implement end‑to‑end agentic payment workflows. This work lays a foundation for extending agentic capabilities into the payment domain.

PaperID: 5305, https://arxiv.org/pdf/2602.24009.pdf

Authors: Zhicheng Fang, Jingjie Zheng, Chenxu Fu, Wei Xu

Title: Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking

Abstract:
Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols. We introduce JAILBREAK FOUNDRY (JBF), a system that addresses this gap via a multi‑agent workflow to translate jailbreak papers into executable modules for immediate evaluation within a unified harness. JBF features three core components: (i) JBF‑LIB for shared contracts and reusable utilities; (ii) JBF‑FORGE for the multi‑agent paper‑to‑module translation; and (iii) JBF‑EVAL for standardizing evaluations. Across 30 reproduced attacks, JBF achieves high fidelity with a mean (reproduced‑reported) attack success rate (ASR) deviation of +0.26 percentage points. By leveraging shared infrastructure, JBF reduces attack‑specific implementation code by nearly half relative to original repositories and achieves an 82.5% mean reused‑code ratio. This system enables a standardized AdvBench evaluation of all 30 attacks across 10 victim models using a consistent GPT‑4o judge. By automating both attack integration and standardized evaluation, JBF offers a scalable solution for creating living benchmarks that keep pace with the rapidly shifting security landscape.

PaperID: 5306, https://arxiv.org/pdf/2602.23899.pdf

Authors: Pramit Saha, Mohammad Alsharid, Joshua Strong, J. Alison Noble

Title: Experience-Guided Self-Adaptive Cascaded Agents for Breast Cancer Screening and Diagnosis with Reduced Biopsy Referrals

Abstract:
We propose an experience‑guided cascaded multi‑agent framework for Breast Ultrasound Screening and Diagnosis, called BUSD‑Agent, that aims to reduce diagnostic escalation and unnecessary biopsy referrals. Our framework models screening and diagnosis as a two‑stage, selective decision‑making process. A lightweight `screening clinic' agent, restricted to classification models as tools, selectively filters out benign and normal cases from further diagnostic escalation when malignancy risk and uncertainty are estimated as low. Cases that have higher risks are escalated to the `diagnostic clinic' agent, which integrates richer perception and radiological description tools to make a secondary decision on biopsy referral. To improve agent performance, past records of pathology‑confirmed outcomes along with image embeddings, model predictions, and historical agent actions are stored in a memory bank as structured decision trajectories. For each new case, BUSD‑Agent retrieves similar past cases based on image, model response and confidence similarity to condition the agent's current decision policy. This enables retrieval‑conditioned in‑context adaptation that dynamically adjusts model trust and escalation thresholds from prior experiences without parameter updates. Evaluation across 10 breast ultrasound datasets shows that the proposed experience‑guided workflow reduces diagnostic escalation in BUSD‑Agent from 84.95% to 58.72% and overall biopsy referrals from 59.50% to 37.08%, compared to the same architecture without trajectory conditioning, while improving average screening specificity by 68.48% and diagnostic specificity by 6.33%.

PaperID: 5307, https://arxiv.org/pdf/2602.23896.pdf

Authors: Xiaotong Zhang, Gang Xiong, Yuanjing Wang, Siyu Teng, Alois Knoll, Long Chen

Title: TSC: Topology-Conditioned Stackelberg Coordination for Multi-Agent Reinforcement Learning in Interactive Driving

Abstract:
Safe and efficient autonomous driving in dense traffic is fundamentally a decentralized multi‑agent coordination problem, where interactions at conflict points such as merging and weaving must be resolved reliably under partial observability. With only local and incomplete cues, interaction patterns can change rapidly, often causing unstable behaviors such as oscillatory yielding or unsafe commitments. Existing multi‑agent reinforcement learning (MARL) approaches either adopt synchronous decision‑making, which exacerbate non‑stationarity, or depend on centralized sequencing mechanisms that scale poorly as traffic density increases. To address these limitations, we propose Topology‑conditioned Stackelberg Coordination (TSC), a learning framework for decentralized interactive driving under communication‑free execution, which extracts a time‑varying directed priority graph from braid‑inspired weaving relations between trajectories, thereby defining local leader‑follower dependencies without constructing a global order of play. Conditioned on this graph, TSC endogenously factorizes dense interactions into graph‑local Stackelberg subgames and, under centralized training and decentralized execution (CTDE), learns a sequential coordination policy that anticipates leaders via action prediction and trains followers through action‑conditioned value learning to approximate local best responses, improving training stability and safety in dense traffic. Experiments across four dense traffic scenarios show that TSC achieves superior performance over representative MARL baselines across key metrics, most notably reducing collisions while maintaining competitive traffic efficiency and control smoothness.

PaperID: 5308, https://arxiv.org/pdf/2602.23864.pdf

Authors: Chao Wang, Han Lin, Huaze Tang, Huijing Lin, Wenbo Ding

Title: RUMAD: Reinforcement-Unifying Multi-Agent Debate

Abstract:
Multi‑agent debate (MAD) systems leverage collective intelligence to enhance reasoning capabilities, yet existing approaches struggle to simultaneously optimize accuracy, consensus formation, and computational efficiency. Static topology methods lack adaptability to task complexity variations, while external LLM‑based coordination risks introducing privileged knowledge that compromises debate neutrality. This work presents RUMAD (Reinforcement‑Unifying Multi‑Agent Debate), a novel framework that formulates dynamic communication topology control in MAD as a reinforcement learning (RL) problem. RUMAD employs a content‑agnostic observation scheme that captures high‑level debate dynamics avoiding access to raw agent reasoning content. RUMAD uses a multi‑objective reward to model solution quality, cohesion and efficiency. A PPO‑trained controller dynamically adjusts edge weights in the communication graph, while a dual‑threshold mechanism enables fine‑grained control over both agent activation and information visibility. Experimental evaluation across MMLU, GSM8K, and GPQA benchmarks demonstrates that RUMAD achieves substantial efficiency gains, reducing token costs by over 80%, while still improving reasoning accuracy compared to single LLM model and multiple MAD baselines. Notably, RUMAD trained exclusively on MMLU exhibits robust zero‑shot generalization to out‑of‑domain (OOD) tasks, indicating that the learned communication strategies capture task‑independent principles of effective multi‑agent coordination. These results establish RUMAD as a efficient and robust approach for deploying multi‑agent reasoning application with practical resource constraints.

PaperID: 5309, https://arxiv.org/pdf/2602.23720.pdf

Authors: Sheng Cao, Zhao Chang, Chang Li, Hannan Li, Liyao Fu, Ji Tang

Title: The Auton Agentic AI Framework

Abstract:
The field of Artificial Intelligence is undergoing a transition from Generative AI ‑‑ probabilistic generation of text and images ‑‑ to Agentic AI, in which autonomous systems execute actions within external environments on behalf of users. This transition exposes a fundamental architectural mismatch: Large Language Models (LLMs) produce stochastic, unstructured outputs, whereas the backend infrastructure they must control ‑‑ databases, APIs, cloud services ‑‑ requires deterministic, schema‑conformant inputs. The present paper describes the Auton Agentic AI Framework, a principled architecture for standardizing the creation, execution, and governance of autonomous agent systems. The framework is organized around a strict separation between the Cognitive Blueprint, a declarative, language‑agnostic specification of agent identity and capabilities, and the Runtime Engine, the platform‑specific execution substrate that instantiates and runs the agent. This separation enables cross‑language portability, formal auditability, and modular tool integration via the Model Context Protocol (MCP). The paper formalizes the agent execution model as an augmented Partially Observable Markov Decision Process (POMDP) with a latent reasoning space, introduces a hierarchical memory consolidation architecture inspired by biological episodic memory systems, defines a constraint manifold formalism for safety enforcement via policy projection rather than post‑hoc filtering, presents a three‑level self‑evolution framework spanning in‑context adaptation through reinforcement learning, and describes runtime optimizations ‑‑ including parallel graph execution, speculative inference, and dynamic context pruning ‑‑ that reduce end‑to‑end latency for multi‑step agent workflows.

PaperID: 5310, https://arxiv.org/pdf/2602.23716.pdf

Authors: Jiangyuan Wang, Kejun Xiao, Huaipeng Zhao, Tao Luo, Xiaoyi Zeng

Title: ProductResearch: Training E-Commerce Deep Research Agents via Multi-Agent Synthetic Trajectory Distillation

Abstract:
Large Language Model (LLM)‑based agents show promise for e‑commerce conversational shopping, yet existing implementations lack the interaction depth and contextual breadth required for complex product research. Meanwhile, the Deep Research paradigm, despite advancing information synthesis in web search, suffers from domain gaps when transferred to e‑commerce. We propose ProductResearch, a multi‑agent framework that synthesizes high‑fidelity, long‑horizon tool‑use trajectories for training robust e‑commerce shopping agents. The framework employs a User Agent to infer nuanced shopping intents from behavioral histories, and a Supervisor Agent that orchestrates iterative collaboration with a Research Agent to generate synthetic trajectories culminating in comprehensive, insightful product research reports. These trajectories are rigorously filtered and distilled through a reflective internalization process that consolidates multi‑agent supervisory interactions into coherent single‑role training examples, enabling effective fine‑tuning of LLM agents for complex shopping inquiries. Extensive experiments show that a compact MoE model fine‑tuned on our synthetic data achieves substantial improvements over its base model in response comprehensiveness, research depth, and user‑perceived utility, approaching the performance of frontier proprietary deep research systems and establishing multi‑agent synthetic trajectory training as an effective and scalable paradigm for enhancing LLM‑based shopping assistance.

PaperID: 5311, https://arxiv.org/pdf/2602.23701.pdf

Authors: Yawen Wang, Wenjie Wu, Junjie Wang, Qing Wang

Title: From Flat Logs to Causal Graphs: Hierarchical Failure Attribution for LLM-based Multi-Agent Systems

Abstract:
LLM‑powered Multi‑Agent Systems (MAS) have demonstrated remarkable capabilities in complex domains but suffer from inherent fragility and opaque failure mechanisms. Existing failure attribution methods, whether relying on direct prompting, costly replays, or supervised fine‑tuning, typically treat execution logs as flat sequences. This linear perspective fails to disentangle the intricate causal links inherent to MAS, leading to weak observability and ambiguous responsibility boundaries. To address these challenges, we propose CHIEF, a novel framework that transforms chaotic trajectories into a structured hierarchical causal graph. It then employs hierarchical oracle‑guided backtracking to efficiently prune the search space via sybthesized virtual oracles. Finally, it implements counterfactual attribution via a progressive causal screening strategy to rigorously distinguish true root causes from propagated symptoms. Experiments on Who&When benchmark show that CHIEF outperforms eight strong and state‑of‑the‑art baselines on both agent‑ and step‑level accuracy. Ablation studies further confirm the critical role of each proposed module.

PaperID: 5312, https://arxiv.org/pdf/2602.23668.pdf

Authors: Yihan, Wen, Xin Chen

Title: PseudoAct: Leveraging Pseudocode Synthesis for Flexible Planning and Action Control in Large Language Model Agents

Abstract:
Large language model (LLM) agents typically rely on reactive decision‑making paradigms such as ReAct, selecting actions conditioned on growing execution histories. While effective for short tasks, these approaches often lead to redundant tool usage, unstable reasoning, and high token consumption in complex long‑horizon tasks involving branching, iteration, or multi‑tool coordination. To address these limitations, this paper introduces PseudoAct, a novel framework for flexible planning and action control in LLM agents through pseudocode synthesis. Leveraging the ability of LLMs to express task‑solving strategies as code, PseudoAct synthesizes a structured pseudocode plan that decomposes a task into subtasks and explicitly encodes control flow, including sequencing, conditionals, loops, parallel composition, and combinations of these logic primitives. Actions are then executed by following this global plan, making the decision logic explicit and temporally coherent. This design reduces redundant actions, prevents infinite loops, and avoids uninformative alternative exploration, enabling consistent and efficient long‑horizon decision‑making. Experiments on benchmark datasets show that our method significantly outperforms existing reactive agent approaches, achieving a 20.93% absolute gain in success rate on FEVER and setting a new state‑of‑the‑art on HotpotQA.

PaperID: 5313, https://arxiv.org/pdf/2602.23667.pdf

Authors: Ziye Jia, Sijie He, Ligang Yuan, Fuhui Zhou, Qihui Wu, Zhu Han, Dusit Niyato

Title: Blockchain-Enabled Routing for Zero-Trust Low-Altitude Intelligent Networks

Abstract:
Due to the scalability and portability, low‑altitude intelligent networks (LAINs) are essential in various fields such as surveillance and disaster rescue. However, in LAINs, unmanned aerial vehicles (UAVs) are characterized by the distributed topology and high mobility, thus vulnerable to security threats, which may degrade routing performances for data transmissions. Hence, how to ensure the routing stability and security of LAINs is challenging. In this paper, we focus on the routing with multiple UAV clusters in LAINs. To minimize the damage caused by potential threats, we present the zero‑trust architecture with the software‑defined perimeter and blockchain techniques to manage the identify and mobility of UAVs. Besides, we formulate the routing problem to optimize the end‑to‑end (E2E) delay and transmission success ratio (TSR) simultaneously, which is an integer nonlinear programming problem and intractable to solve. Therefore, we reformulate the problem into a decentralized partially observable Markov decision process. We design the multi‑agent double deep Q‑network‑based routing algorithms to solve the problem, empowered by the soft‑hierarchical experience replay buffer and prioritized experience replay mechanisms. Finally, extensive simulations are conducted and the numerical results demonstrate that the proposed framework reduces the average E2E delay by 59% and improves the TSR by 29% on average compared to benchmarks, while simultaneously enabling faster and more robust identification of low‑trust UAVs.

PaperID: 5314, https://arxiv.org/pdf/2602.23647.pdf

Authors: Quanjun Zhang, Chengyu Gao, Yu Han, Ye Shang, Chunrong Fang, Zhenyu Chen, Liang Xiao

Title: SGAgent: Suggestion-Guided LLM-Based Multi-Agent Framework for Repository-Level Software Repair

Abstract:
Large Language Models (LLMs) have enabled intelligent agents that autonomously interact with environments and invoke external tools. Recently, agent‑based software repair has drawn wide attention, as repair agents can localize bugs, generate patches, and achieve state‑of‑the‑art performance on repository‑level benchmarks (e.g., SWE‑Bench). However, existing approaches usually adopt a localize‑then‑fix paradigm, jumping directly from "where the bug is" to "how to fix it", leaving a fundamental reasoning gap. To this end, we propose SGAgent, a Suggestion‑Guided multi‑Agent framework for repository‑level software repair, which follows a localize‑suggest‑fix paradigm. SGAgent introduces a suggestion phase to strengthen the transition from localization to repair: the suggester starts from the buggy locations, incrementally retrieves relevant context until it fully understands the bug, and provides actionable repair suggestions. We further construct a Knowledge Graph (KG) from the target repository and develop a KG‑based toolkit to strengthen SGAgent's global contextual awareness and repository‑level reasoning. Three specialized sub‑agents (i.e., localizer, suggester, and fixer) collaborate to achieve automated end‑to‑end software repair. We evaluate SGAgent on SWE‑Bench‑Lite. SGAgent with Claude‑3.5 achieves 51.3% repair accuracy, 81.2% file‑level, and 52.4% function‑level localization accuracy at an average cost of 1.48 per instance, outperforming all baselines using the same base model. SGAgent also generalizes well across base LLMs, reaching a 60.7% resolution rate with Claude‑4. When extended to vulnerability repair, it achieves 48.0% on VUL4J and VJBench, demonstrating strong generalization across tasks and programming languages.

PaperID: 5315, https://arxiv.org/pdf/2602.23623.pdf

Authors: Youbin Han, Haneul Ko, Namseok Ko, Tarik Taleb, Yan Chen

Title: Toward E2E Intelligence in 6G Networks: An AI Agent-Based RAN-CN Converged Intelligence Framework

Abstract:
Recent advances in intelligent network control have primarily relied on task‑specific Artificial Intelligence (AI) models deployed separately within the Radio Access Network (RAN) and Core Network (CN). While effective for isolated models, these suffer from limited generalization, fragmented decision‑making across network domains, and significant maintenance overhead due to frequent retraining. To address these limitations, we propose a novel AI agent‑based RAN‑CN converged intelligence framework that leverages a Large Language Model (LLM) integrated with the Reasoning and Acting (ReAct) paradigm. The proposed framework enables the AI agent to iteratively reason over real‑time, cross‑domain state information stored in a centralized monitoring database and to synthesize adaptive control policies through a closed‑loop thought‑action‑observation process. Unlike conventional Machine Learning (ML) based approaches, it does not rely on model retraining. Instead, the AI agent dynamically queries and interprets structured network data to generate context‑aware control decisions, allowing for fast and flexible adaptation to changing network conditions. Experimental results demonstrate the enhanced generalization capability and superior adaptability of the proposed framework to previously unseen network scenarios, highlighting its potential as a unified control intelligence for next‑generation networks.

PaperID: 5316, https://arxiv.org/pdf/2602.23577.pdf

Authors: Jun Li, Xiangmeng Wang, Haoyang Li, Yifei Yan, Shijie Zhang, Hong Va Leong, Ling Feng, Nancy Xiaonan Yu, Qing Li

Title: Multi-Agent Causal Reasoning for Suicide Ideation Detection Through Online Conversations

Abstract:
Suicide remains a pressing global public health concern. While social media platforms offer opportunities for early risk detection through online conversation trees, existing approaches face two major limitations: (1) They rely on predefined rules (e.g., quotes or relies) to log conversations that capture only a narrow spectrum of user interactions, and (2) They overlook hidden influences such as user conformity and suicide copycat behavior, which can significantly affect suicidal expression and propagation in online communities. To address these limitations, we propose a Multi‑Agent Causal Reasoning (MACR) framework that collaboratively employs a Reasoning Agent to scale user interactions and a Bias‑aware Decision‑Making Agent to mitigate harmful biases arising from hidden influences. The Reasoning Agent integrates cognitive appraisal theory to generate counterfactual user reactions to posts, thereby scaling user interactions. It analyses these reactions through structured dimensions, i.e., cognitive, emotional, and behavioral patterns, with a dedicated sub‑agent responsible for each dimension. The Bias‑aware Decision‑Making Agent mitigates hidden biases through a front‑door adjustment strategy, leveraging the counterfactual user reactions produced by the Reasoning Agent. Through the collaboration of reasoning and bias‑aware decision making, the proposed MACR framework not only alleviates hidden biases, but also enriches contextual information of user interactions with counterfactual knowledge. Extensive experiments on real‑world conversational datasets demonstrate the effectiveness and robustness of MACR in identifying suicide risk.

PaperID: 5317, https://arxiv.org/pdf/2602.23468.pdf

Authors: Yulun Zhang, Varun Bhatt, Matthew C. Fontaine, Stefanos Nikolaidis, Jiaoyang Li

Title: Optimization of Edge Directions and Weights for Mixed Guidance Graphs in Lifelong Multi-Agent Path Finding

Abstract:
Multi‑Agent Path Finding (MAPF) aims to move agents from their start to goal vertices on a graph. Lifelong MAPF (LMAPF) continuously assigns new goals to agents as they complete current ones. To guide agents' movement in LMAPF, prior works have proposed Guidance Graph Optimization (GGO) methods to optimize a guidance graph, which is a bidirected weighted graph whose directed edges represent moving and waiting actions with edge weights being action costs. Higher edge weights represent higher action costs. However, edge weights only provide soft guidance. An edge with a high weight only discourages agents from using it, instead of prohibiting agents from traversing it. In this paper, we explore the need to incorporate edge directions optimization into GGO, providing strict guidance. We generalize GGO to Mixed Guidance Graph Optimization (MGGO), presenting two MGGO methods capable of optimizing both edge weights and directions. The first optimizes edge directions and edge weights in two phases separately. The second applies Quality Diversity algorithms to optimize a neural network capable of generating edge directions and weights. We also incorporate traffic patterns relevant to edge directions into a GGO method, making it capable of generating edge‑direction‑aware guidance graphs.

PaperID: 5318, https://arxiv.org/pdf/2602.23437.pdf

Authors: Moyi Tian, George Mohler, P. Jeffrey Brantingham, Nancy Rodríguez

Title: Learning dynamics from online-offline systems of LLM agents

Abstract:
Online information is increasingly linked to real‑world instability, especially as automated accounts and LLM‑based agents help spread and amplify news. In this work, we study how information spreads on networks of Large Language Models (LLMs) using mathematical models. We investigate how different types of offline events, along with the "personalities" assigned to the LLMs, affect the network dynamics of online information spread of the events among the LLMs. We introduce two models: 1) a stochastic agent‑based network model and 2) a system of differential equations arising from a mean‑field approximation to the agent‑based model. We fit these models to simulations of the spread of armed‑conflict news on social media, using LLM agents each with one of 32 personality trait profiles on k‑regular random networks. Our results indicate that, despite the complexity of the news events, personalities, and LLM behaviors, the overall dynamics of the system are well described by a Susceptible‑Infected (SI) type model with two transmission rates.

PaperID: 5319, https://arxiv.org/pdf/2602.23373.pdf

Authors: Pavel Chernakov, Sasan Jafarnejad, Raphaël Frank

Title: An Agentic LLM Framework for Adverse Media Screening in AML Compliance

Abstract:
Adverse media screening is a critical component of anti‑money laundering (AML) and know‑your‑customer (KYC) compliance processes in financial institutions. Traditional approaches rely on keyword‑based searches that generate high false‑positive rates or require extensive manual review. We present an agentic system that leverages Large Language Models (LLMs) with Retrieval‑Augmented Generation (RAG) to automate adverse media screening. Our system implements a multi‑step approach where an LLM agent searches the web, retrieves and processes relevant documents, and computes an Adverse Media Index (AMI) score for each subject. We evaluate our approach using multiple LLM backends on a dataset comprising Politically Exposed Persons (PEPs), persons from regulatory watchlists, and sanctioned persons from OpenSanctions and clean names from academic sources, demonstrating the system's ability to distinguish between high‑risk and low‑risk individuals.

PaperID: 5320, https://arxiv.org/pdf/2602.23368.pdf

Authors: Shreyas Subramanian, Adewale Akinfaderin, Yanyan Zhang, Ishan Singh, Mani Khanuja, Sandeep Singh, Maira Ladeira Tanke

Title: Keyword search is all you need: Achieving RAG-Level Performance without vector databases using agentic tool use

Abstract:
While Retrieval‑Augmented Generation (RAG) has proven effective for generating accurate, context‑based responses based on existing knowledge bases, it presents several challenges including retrieval quality dependencies, integration complexity and cost. Recent advances in agentic‑RAG and tool‑augmented LLM architectures have introduced alternative approaches to information retrieval and processing. We question how much additional value vector databases and semantic search bring to RAG over simple, agentic keyword search in documents for question‑answering. In this study, we conducted a systematic comparison between RAG‑based systems and tool‑augmented LLM agents, specifically evaluating their retrieval mechanisms and response quality when the agent only has access to basic keyword search tools. Our empirical analysis demonstrates that tool‑based keyword search implementations within an agentic framework can attain over 90% of the performance metrics compared to traditional RAG systems without using a standing vector database. Our approach is simple to implement, cost effective, and is particularly useful in scenarios requiring frequent updates to knowledge bases.

PaperID: 5321, https://arxiv.org/pdf/2602.23330.pdf

Authors: Kunihiro Miyazaki, Takanobu Kawahara, Stephen Roberts, Stefan Zohren

Title: Toward Expert Investment Teams:A Multi-Agent LLM System with Fine-Grained Trading Tasks

Abstract:
The advancement of large language models (LLMs) has accelerated the development of autonomous financial trading systems. While mainstream approaches deploy multi‑agent systems mimicking analyst and manager roles, they often rely on abstract instructions that overlook the intricacies of real‑world workflows, which can lead to degraded inference performance and less transparent decision‑making. Therefore, we propose a multi‑agent LLM trading framework that explicitly decomposes investment analysis into fine‑grained tasks, rather than providing coarse‑grained instructions. We evaluate the proposed framework using Japanese stock data, including prices, financial statements, news, and macro information, under a leakage‑controlled backtesting setting. Experimental results show that fine‑grained task decomposition significantly improves risk‑adjusted returns compared to conventional coarse‑grained designs. Crucially, further analysis of intermediate agent outputs suggests that alignment between analytical outputs and downstream decision preferences is a critical driver of system performance. Moreover, we conduct standard portfolio optimization, exploiting low correlation with the stock index and the variance of each system's output. This approach achieves superior performance. These findings contribute to the design of agent structure and task configuration when applying LLM agents to trading systems in practical settings.

PaperID: 5322, https://arxiv.org/pdf/2602.23193.pdf

Authors: Elzo Brito dos Santos Filho

Title: ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering

Abstract:
Autonomous agents based on Large Language Models (LLMs) have evolved from reactive assistants to systems capable of planning, executing actions via tools, and iterating over environment observations. However, they remain vulnerable to structural limitations: lack of native state, context degradation over long horizons, and the gap between probabilistic generation and deterministic execution requirements. This paper presents the ESAA (Event Sourcing for Autonomous Agents) architecture, which separates the agent's cognitive intention from the project's state mutation, inspired by the Event Sourcing pattern. In ESAA, agents emit only structured intentions in validated JSON (agent.result or issue.report); a deterministic orchestrator validates, persists events in an append‑only log (activity.jsonl), applies file‑writing effects, and projects a verifiable materialized view (roadmap.json). The proposal incorporates boundary contracts (AGENT_CONTRACT.yaml), metaprompting profiles (PARCER), and replay verification with hashing (esaa verify), ensuring the immutability of completed tasks and forensic traceability. Two case studies validate the architecture: (i) a landing page project (9 tasks, 49 events, single‑agent composition) and (ii) a clinical dashboard system (50 tasks, 86 events, 4 concurrent agents across 8 phases), both concluding with run.status=success and verify_status=ok. The multi‑agent case study demonstrates real concurrent orchestration with heterogeneous LLMs (Claude Sonnet 4.6, Codex GPT‑5, Antigravity/Gemini 3 Pro, and Claude Opus 4.6), providing empirical evidence of the architecture's scalability beyond single‑agent scenarios.

PaperID: 5323, https://arxiv.org/pdf/2602.23174.pdf

Authors: Yannick Eich, Christian Fabian, Kai Cui, Heinz Koeppl

Title: Approximately Solving Continuous-Time Mean Field Games with Finite State Spaces

Abstract:
Mean field games (MFGs) offer a powerful framework for modeling large‑scale multi‑agent systems. This paper addresses MFGs formulated in continuous time with discrete state spaces, where agents' dynamics are governed by continuous‑time Markov chains ‑‑ relevant to applications like population dynamics and queueing networks. While prior research has largely focused on theoretical aspects of continuous‑time discrete‑state MFGs, efficient computational methods for determining equilibria remain underdeveloped. Inspired by discrete‑time approaches, we approximate the classical Nash equilibria by regularization methods, enabling more computationally tractable solution algorithms. Specifically, we define regularized equilibria for continuous‑time MFGs and extend the classical fixed‑point iteration and fictitious play algorithm to these equilibria. We validate the effectiveness and practicality of our approach via illustrative numerical examples.

PaperID: 5324, https://arxiv.org/pdf/2602.23123.pdf

Authors: Keito Inoshita

Title: Multi-Agent Large Language Model Based Emotional Detoxification Through Personalized Intensity Control for Consumer Protection

Abstract:
In the attention economy, sensational content exposes consumers to excessive emotional stimulation, hindering calm decision‑making. This study proposes Multi‑Agent LLM‑based Emotional deToxification (MALLET), a multi‑agent information sanitization system consisting of four agents: Emotion Analysis, Emotion Adjustment, Balance Monitoring, and Personal Guide. The Emotion Analysis Agent quantifies stimulus intensity using a 6‑emotion BERT classifier, and the Emotion Adjustment Agent rewrites texts into two presentation modes, BALANCED (neutralized text) and COOL (neutralized text + supplementary text), using an LLM. The Balance Monitoring Agent aggregates weekly information consumption patterns and generates personalized advice, while the Personal Guide Agent recommends a presentation mode according to consumer sensitivity. Experiments on 800 AG News articles demonstrated significant stimulus score reduction (up to 19.3%) and improved emotion balance while maintaining semantic preservation. Near‑zero correlation between stimulus reduction and semantic preservation confirmed that the two are independently controllable. Category‑level analysis revealed substantial reduction (17.8‑33.8%) in Sports, Business, and Sci/Tech, whereas the effect was limited in the World category, where facts themselves are inherently high‑stimulus. The proposed system provides a framework for supporting calm information reception of consumers without restricting access to the original text.

PaperID: 5325, https://arxiv.org/pdf/2602.23093.pdf

Authors: Dhwanil M. Mori, Neil F. Johnson

Title: Three AI-agents walk into a bar . . . . `Lord of the Flies' tribalism emerges among smart AI-Agents

Abstract:
Near‑future infrastructure systems may be controlled by autonomous AI agents that repeatedly request access to limited resources such as energy, bandwidth, or computing power. We study a simplified version of this setting using a framework where N AI‑agents independently decide at each round whether to request one unit from a system with fixed capacity C. An AI version of "Lord of the Flies" arises in which controlling tribes emerge with their own collective character and identity. The LLM agents do not reduce overload or improve resource use, and often perform worse than if they were flipping coins to make decisions. Three main tribal types emerge: Aggressive (27.3%), Conservative (24.7%), and Opportunistic (48.1%). The more capable AI‑agents actually increase the rate of systemic failure. Overall, our findings show that smarter AI‑agents can behave dumber as a result of forming tribes.

PaperID: 5326, https://arxiv.org/pdf/2602.23079.pdf

Authors: Boyang Zhang, Yang Zhang

Title: Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent

Abstract:
The rapid advancement of large language models (LLMs) has enabled powerful authorship inference capabilities, raising growing concerns about unintended deanonymization risks in textual data such as news articles. In this work, we introduce an LLM agent designed to evaluate and mitigate such risks through a structured, interpretable pipeline. Central to our framework is the proposed SALA (Stylometry‑Assisted LLM Analysis) method, which integrates quantitative stylometric features with LLM reasoning for robust and transparent authorship attribution. Experiments on large‑scale news datasets demonstrate that SALA, particularly when augmented with a database module, achieves high inference accuracy in various scenarios. Finally, we propose a guided recomposition strategy that leverages the agent's reasoning trace to generate rewriting prompts, effectively reducing authorship identifiability while preserving textual meaning. Our findings highlight both the deanonymization potential of LLM agents and the importance of interpretable, proactive defenses for safeguarding author privacy.

PaperID: 5327, https://arxiv.org/pdf/2602.23056.pdf

Authors: Giona Fieni, Joschua Wüthrich, Marc-Philippe Neumann, Christopher H. Onder

Title: Learning-based Multi-agent Race Strategies in Formula 1

Abstract:
In Formula 1, race strategies are adapted according to evolving race conditions and competitors' actions. This paper proposes a reinforcement learning approach for multi‑agent race strategy optimization. Agents learn to balance energy management, tire degradation, aerodynamic interaction, and pit‑stop decisions. Building on a pre‑trained single‑agent policy, we introduce an interaction module that accounts for the behavior of competitors. The combination of the interaction module and a self‑play training scheme generates competitive policies, and agents are ranked based on their relative performance. Results show that the agents adapt pit timing, tire selection, and energy allocation in response to opponents, achieving robust and consistent race performance. Because the framework relies only on information available during real races, it can support race strategists' decisions before and during races.

PaperID: 5328, https://arxiv.org/pdf/2602.23008.pdf

Authors: Zeyuan Liu, Jeonghye Kim, Xufang Luo, Dongsheng Li, Yuqing Yang

Title: Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

Abstract:
Exploration remains the key bottleneck for large language model agents trained with reinforcement learning. While prior methods exploit pretrained knowledge, they fail in environments requiring the discovery of novel states. We propose Exploratory Memory‑Augmented On‑ and Off‑Policy Optimization (EMPO^2), a hybrid RL framework that leverages memory for exploration and combines on‑ and off‑policy updates to make LLMs perform well with memory while also ensuring robustness without it. On ScienceWorld and WebShop, EMPO^2 achieves 128.6% and 11.3% improvements over GRPO, respectively. Moreover, in out‑of‑distribution tests, EMPO^2 demonstrates superior adaptability to new tasks, requiring only a few trials with memory and no parameter updates. These results highlight EMPO^2 as a promising framework for building more exploratory and generalizable LLM‑based agents.

PaperID: 5329, https://arxiv.org/pdf/2602.23005.pdf

Authors: Man Zhang, Tao Yue, Yihua He

Title: Managing Uncertainty in LLM-based Multi-Agent System Operation

Abstract:
Applying LLM‑based multi‑agent software systems in safety‑critical domains such as lifespan echocardiography introduces system‑level risks that cannot be addressed by improving model accuracy alone. During system operation, beyond individual LLM behavior, uncertainty propagates through agent coordination, data pipelines, human‑in‑the‑loop interaction, and runtime control logic. Yet existing work largely treats uncertainty at the model level rather than as a first‑class software engineering concern. This paper approaches uncertainty from both system‑level and runtime perspectives. We first differentiate epistemological and ontological uncertainties in the context of LLM‑based multi‑agent software system operation. Building on this foundation, we propose a lifecycle‑based uncertainty management framework comprising four mechanisms: representation, identification, evolution, and adaptation. The uncertainty lifecycle governs how uncertainties emerge, transform, and are mitigated across architectural layers and execution phases, enabling structured runtime governance and controlled adaptation. We demonstrate the feasibility of the framework using a real‑world LLM‑based multi‑agent echocardiographic software system developed in clinical collaboration, showing improved reliability and diagnosability in diagnostic reasoning. The proposed approach generalizes to other safety‑critical LLM‑based multi‑agent software systems, supporting principled operational control and runtime assurance beyond model‑centric methods.

PaperID: 5330, https://arxiv.org/pdf/2602.22953.pdf

Authors: Elron Bandel, Asaf Yehudai, Lilach Eden, Yehoshua Sagron, Yotam Perlitz, Elad Venezian, Natalia Razinkov, Natan Ergas, Shlomit Shachor Ifergan, Segev Shlomov, Michal Jacovi, Leshem Choshen, Liat Ein-Dor, Yoav Katz, Michal Shmueli-Scheuer

Title: General Agent Evaluation

Abstract:
General‑purpose agents perform tasks in unfamiliar environments without domain‑specific manual customization. Yet no study has systematically measured how agent architecture shapes performance across heterogeneous protocols and diverse unfamiliar environments. This is the first systematic study, comparing tool‑calling, MCP, code‑generation, and CLI agents on the same benchmarks with the same models. Two gaps blocked such a study: existing harnesses require per‑benchmark wiring or fixed protocol classes (web for BrowserGym, CLI for Harbor), and benchmarks themselves expect human‑authored prompts, context, and integration glue. To enable this study, we contribute (1) a unifying protocol that bridges existing benchmark and agent protocols; (2) an evaluation harness that surfaces any benchmark to any general‑purpose agent and backbone model; and (3) the first Open General Agent Leaderboard of agent configurations, a full factorial over 5 agent architectures x 5 backbone LLMs (three closed‑source, two open‑weight) x 6 benchmarks spanning software engineering, customer service, deep research, and personal assistance. We find that (i) general agents adapt to every tested domain without per‑domain customization; (ii) agent architecture choice swings results by up to 12pp within a single model, yet backbone model choice dominates overall performance; (iii) on 4 of 6 tested benchmarks, top general agents are indistinguishable from the leading heavily‑customized domain‑specific agents; (iv) open‑weight models tested exhibit "generality sinks" absent from frontier closed‑source models: they consistently collapse on specific agent architectures or benchmarks; (v) a behavioral failure analysis reveals architecture‑distinctive error signatures that aggregate scoring cannot discriminate. Code, harness, leaderboard, and traces are at https://www.exgentic.ai.

PaperID: 5331, https://arxiv.org/pdf/2602.22923.pdf

Authors: Runwei Guan, Shaofeng Liang, Ningwei Ouyang, Weichen Fei, Shanliang Yao, Wei Dai, Chenhao Ge, Penglei Sun, Xiaohui Zhu, Tao Huang, Ryan Wen Liu, Hui Xiong

Title: WaterVideoQA: ASV-Centric Perception and Rule-Compliant Reasoning via Multi-Modal Agents

Abstract:
While autonomous navigation has achieved remarkable success in passive perception (e.g., object detection and segmentation), it remains fundamentally constrained by a void in knowledge‑driven, interactive environmental cognition. In the high‑stakes domain of maritime navigation, the ability to bridge the gap between raw visual perception and complex cognitive reasoning is not merely an enhancement but a critical prerequisite for Autonomous Surface Vessels to execute safe and precise maneuvers. To this end, we present WaterVideoQA, the first large‑scale, comprehensive Video Question Answering benchmark specifically engineered for all‑waterway environments. This benchmark encompasses 3,029 video clips across six distinct waterway categories, integrating multifaceted variables such as volatile lighting and dynamic weather to rigorously stress‑test ASV capabilities across a five‑tier hierarchical cognitive framework. Furthermore, we introduce NaviMind, a pioneering multi‑agent neuro‑symbolic system designed for open‑ended maritime reasoning. By synergizing Adaptive Semantic Routing, Situation‑Aware Hierarchical Reasoning, and Autonomous Self‑Reflective Verification, NaviMind transitions ASVs from superficial pattern matching to regulation‑compliant, interpretable decision‑making. Experimental results demonstrate that our framework significantly transcends existing baselines, establishing a new paradigm for intelligent, trustworthy interaction in dynamic maritime environments.

PaperID: 5332, https://arxiv.org/pdf/2602.22915.pdf

Authors: Farzaneh Farhadi, Maria Chli

Title: Robust Information Design for Multi-Agent Systems with Complementarities: Smallest-Equilibrium Threshold Policies

Abstract:
We study information design in multi‑agent systems (MAS) with binary actions and strategic complementarities, where an external designer influences behavior only through signals. Agents play the smallest‑equilibrium of the induced Bayesian game, reflecting conservative, coordination‑averse behavior typical in distributed systems. We show that when utilities admit a convex potential and welfare is convex, the robustly implementable optimum has a remarkably simple form: perfect coordination at each state: either everyone acts or no one does. We provide a constructive threshold rule: compute a one‑dimensional score for each state, sort states, and pick a single threshold (with a knife‑edge lottery for at most one state). This rule is an explicit optimal vertex of a linear program (LP) characterized by feasibility and sequential obedience constraints. Empirically, in both vaccination and technology‑adoption domains, our constructive policy matches LP optima, scales as O(|Θ|\log|Θ|), and avoids the inflated welfare predicted by obedience‑only designs that assume the designer can dictate the (best) equilibrium. The result is a general, scalable recipe for robust coordination in MAS with complementarities.

PaperID: 5333, https://arxiv.org/pdf/2602.22847.pdf

Authors: Kerrian Le Caillec, Anna Van Elst, Igor Colin, Stephan Clémençon

Title: Decentralized Ranking Aggregation via Gossip: Convergence and Robustness

Abstract:
The concept of ranking aggregation plays a central role in preference analysis, and numerous algorithms for calculating median rankings, often originating in social choice theory, have been documented in the literature, offering theoretical guarantees in a centralized setting, i.e., when all the ranking data to be aggregated can be brought together in a single computing unit. For many technologies (e.g. peer‑to‑peer networks, IoT, multi‑agent systems), extending the ability to calculate consensus rankings with guarantees of convergence and resilience to potential contamination in a decentralized setting, when preference data is initially distributed across a communicating network, remains a major methodological challenge. Indeed, in recent years, the literature on decentralized computation has mainly focused on computing or optimizing statistics such as arithmetic means using gossip algorithms. The purpose of this article is precisely to study how to achieve reliable and resilient consensus on collective rankings in a decentralized setting, thereby raising new questions, robustness to corrupted nodes, and scalability through reduced communication costs in particular. The approach proposed and analyzed here relies on the robustness guarantees offered by random gossip communication, which allows autonomous agents to compute a global ranking consensus using local interactions only, without coordination or a central authority.

PaperID: 5334, https://arxiv.org/pdf/2602.22810.pdf

Authors: Luca Viano, Till Freihaut, Emanuele Nevali, Volkan Cevher, Matthieu Geist, Giorgia Ramponi

Title: Multi-agent imitation learning with function approximation: Linear Markov games and beyond

Abstract:
In this work, we present the first theoretical analysis of multi‑agent imitation learning (MAIL) in linear Markov games where both the transition dynamics and each agent's reward function are linear in some given features. We demonstrate that by leveraging this structure, it is possible to replace the state‑action level "all policy deviation concentrability coefficient" (Freihaut et al., arXiv:2510.09325) with a concentrability coefficient defined at the feature level which can be much smaller than the state‑action analog when the features are informative about states' similarity. Furthermore, to circumvent the need for any concentrability coefficient, we turn to the interactive setting. We provide the first, computationally efficient, interactive MAIL algorithm for linear Markov games and show that its sample complexity depends only on the dimension of the feature map d. Building on these theoretical findings, we propose a deep MAIL interactive algorithm which clearly outperforms BC on games such as Tic‑Tac‑Toe and Connect4.

PaperID: 5335, https://arxiv.org/pdf/2602.22775.pdf

Authors: Joydeep Chandra, Satyam Kumar Navneet, Yong Zhang

Title: TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation

Abstract:
As mental health chatbots proliferate to address the global treatment gap, a critical question emerges: How do we design for relational safety the quality of interaction patterns that unfold across conversations rather than the correctness of individual responses? Current safety evaluations assess single‑turn crisis responses, missing the therapeutic dynamics that determine whether chatbots help or harm over time. We introduce TherapyProbe, a design probe methodology that generates actionable design knowledge by systematically exploring chatbot conversation trajectories through adversarial multi‑agent simulation. Using open‑source models, TherapyProbe surfaces relational safety failures interaction patterns like "validation spirals" where chatbots progressively reinforce hopelessness, or "empathy fatigue" where responses become mechanical over turns. Our contribution is translating these failures into a Safety Pattern Library of 23 failure archetypes with corresponding design recommendations. We contribute: (1) a replicable methodology requiring no API costs, (2) a clinically‑grounded failure taxonomy, and (3) design implications for developers, clinicians, and policymakers.

PaperID: 5336, https://arxiv.org/pdf/2602.22733.pdf

Authors: Seongyong Kim, Junhyeon Cho, Kang-Won Lee, Soo-Chul Lim

Title: Pixel2Catch: Multi-Agent Sim-to-Real Transfer for Agile Manipulation with a Single RGB Camera

Abstract:
To catch a thrown object, a robot must be able to perceive the object's motion and generate control actions in a timely manner. Rather than explicitly estimating the object's 3D position, this work focuses on a novel approach that recognizes object motion using pixel‑level visual information extracted from a single RGB image. Such visual cues capture changes in the object's position and scale, allowing the policy to reason about the object's motion. Furthermore, to achieve stable learning in a high‑DoF system composed of a robot arm equipped with a multi‑fingered hand, we design a heterogeneous multi‑agent reinforcement learning framework that defines the arm and hand as independent agents with distinct roles. Each agent is trained cooperatively using role‑specific observations and rewards, and the learned policies are successfully transferred from simulation to the real world.

PaperID: 5337, https://arxiv.org/pdf/2602.22724.pdf

Authors: Tian Zhang, Yiwei Xu, Juan Wang, Keyan Guo, Xiaoyang Xu, Bowen Xiao, Quanlong Guan, Jinlin Fan, Jiawei Liu, Zhiquan Liu, Hongxin Hu

Title: AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification

Abstract:
Large language model (LLM) agents increasingly rely on external tools and retrieval systems to autonomously complete complex tasks. However, this design exposes agents to indirect prompt injection (IPI), where attacker‑controlled context embedded in tool outputs or retrieved content silently steers agent actions away from user intent. Unlike prompt‑based attacks, IPI unfolds over multi‑turn trajectories, making malicious control difficult to disentangle from legitimate task execution. Existing inference‑time defenses primarily rely on heuristic detection and conservative blocking of high‑risk actions, which can prematurely terminate workflows or broadly suppress tool usage under ambiguous multi‑turn scenarios. We propose AgentSentry, a novel inference‑time detection and mitigation framework for tool‑augmented LLM agents. To the best of our knowledge, AgentSentry is the first inference‑time defense to model multi‑turn IPI as a temporal causal takeover. It localizes takeover points via controlled counterfactual re‑executions at tool‑return boundaries and enables safe continuation through causally guided context purification that removes attack‑induced deviations while preserving task‑relevant evidence. We evaluate AgentSentry on the \textscAgentDojo benchmark across four task suites, three IPI attack families, and multiple black‑box LLMs. AgentSentry eliminates successful attacks and maintains strong utility under attack, achieving an average Utility Under Attack (UA) of 74.55 %, improving UA by 20.8 to 33.6 percentage points over the strongest baselines without degrading benign performance.

PaperID: 5338, https://arxiv.org/pdf/2602.22603.pdf

Authors: Sanjay Kariyappa, G. Edward Suh

Title: SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning

Abstract:
Long‑running agentic tasks, such as deep research, require multi‑hop reasoning over information distributed across multiple webpages and documents. In such tasks, the LLM context is dominated by tokens from external retrieval, causing memory usage to grow rapidly and limiting decode performance. While several KV cache compression techniques exist for long‑context inputs, we find that existing heuristics fail to support multi‑step reasoning models effectively. We address this challenge with SideQuest ‑‑ a novel approach that leverages the Large Reasoning Model (LRM) itself to perform KV cache compression by reasoning about the usefulness of tokens in its context. To prevent the tokens associated with this management process from polluting the model's memory, we frame KV cache compression as an auxiliary task executed in parallel to the main reasoning task. Our evaluations, using a model trained with just 215 samples, show that SideQuest reduces peak token usage by up to 65% on agentic tasks with minimal degradation in accuracy, outperforming heuristic‑based KV cache compression techniques.

PaperID: 5339, https://arxiv.org/pdf/2602.22576.pdf

Authors: Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li, Linfang Shang, Liqun Liu, Peng Shu, Huan Yu, Jie Jiang

Title: Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

Abstract:
Retrieval‑Augmented Generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, yet traditional single‑round retrieval struggles with complex multi‑step reasoning. Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL‑based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed samples contribute nothing. We propose Search‑P1, a framework that introduces path‑centric reward shaping for agentic RAG training, comprising two key components: (1) Path‑Centric Reward, which evaluates the structural quality of reasoning trajectories through order‑agnostic step coverage and soft scoring that extracts learning signals even from failed samples, and (2) Dual‑Track Path Scoring with offline‑generated reference planners that assesses paths from both self‑consistency and reference‑alignment perspectives. Experiments on multiple QA benchmarks demonstrate that Search‑P1 achieves significant improvements over Search‑R1 and other strong baselines, with an average accuracy gain of 7.7 points.

PaperID: 5340, https://arxiv.org/pdf/2602.22557.pdf

Authors: Umid Suleymanov, Rufiz Bayramov, Suad Gafarli, Seljan Musayeva, Taghi Mammadov, Aynur Akhundlu, Murat Kantarcioglu

Title: CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

Abstract:
Current safety mechanisms for Large Language Models (LLMs) rely heavily on static, fine‑tuned classifiers that suffer from adaptation rigidity, the inability to enforce new governance rules without expensive retraining. To address this, we introduce CourtGuard, a retrieval‑augmented multi‑agent framework that reimagines safety evaluation as Evidentiary Debate. By orchestrating an adversarial debate grounded in external policy documents, CourtGuard achieves state‑of‑the‑art performance across 7 safety benchmarks, outperforming dedicated policy‑following baselines without fine‑tuning. Beyond standard metrics, we highlight two critical capabilities: (1) Zero‑Shot Adaptability, where our framework successfully generalized to an out‑of‑domain Wikipedia Vandalism task (achieving 90% accuracy) by swapping the reference policy; and (2) Automated Data Curation and Auditing, where we leveraged CourtGuard to curate and audit nine novel datasets of sophisticated adversarial attacks. Our results demonstrate that decoupling safety logic from model weights offers a robust, interpretable, and adaptable path for meeting current and future regulatory requirements in AI governance.

PaperID: 5341, https://arxiv.org/pdf/2602.22546.pdf

Authors: Zhiming Wang, Jinwei He, Feng Lu

Title: Requesting Expert Reasoning: Augmenting LLM Agents with Learned Collaborative Intervention

Abstract:
Large Language Model (LLM) based agents excel at general reasoning but often fail in specialized domains where success hinges on long‑tail knowledge absent from their training data. While human experts can provide this missing knowledge, their guidance is often unstructured and unreliable, making its direct integration into an agent's plan problematic. To address this, we introduce AHCE (Active Human‑Augmented Challenge Engagement), a framework for on‑demand Human‑AI collaboration. At its core, the Human Feedback Module (HFM) employs a learned policy to treat the human expert as an interactive reasoning tool. Extensive experiments in Minecraft demonstrate the framework's effectiveness, increasing task success rates by 32% on normal difficulty tasks and nearly 70% on highly difficult tasks, all with minimal human intervention. Our work demonstrates that successfully augmenting agents requires learning how to request expert reasoning, moving beyond simple requests for help.

PaperID: 5342, https://arxiv.org/pdf/2602.22539.pdf

Authors: Mohammad Hossein Shokouhi, Vincent W. S. Wong

Title: Agentic AI for Intent-driven Optimization in Cell-free O-RAN

Abstract:
Agentic artificial intelligence (AI) is emerging as a key enabler for autonomous radio access networks (RANs), where multiple large language model (LLM)‑based agents reason and collaborate to achieve operator‑defined intents. The open RAN (O‑RAN) architecture enables the deployment and coordination of such agents. However, most existing works consider simple intents handled by independent agents, while complex intents that require coordination among agents remain unexplored. In this paper, we propose an agentic AI framework for intent translation and optimization in cell‑free O‑RAN. A supervisor agent translates the operator intents into an optimization objective and minimum rate requirements. Based on this information, a user weighting agent retrieves relevant prior experience from a memory module to determine the user priority weights for precoding. If the intent includes an energy‑saving objective, then an open radio unit (O‑RU) management agent will also be activated to determine the set of active O‑RUs by using a deep reinforcement learning (DRL) algorithm. A monitoring agent measures and monitors the user data rates and coordinates with other agents to guarantee the minimum rate requirements are satisfied. To enhance scalability, we adopt a parameter‑efficient fine‑tuning (PEFT) method that enables the same underlying LLM to be used for different agents. Simulation results show that the proposed agentic AI framework reduces the number of active O‑RUs by 41.93% when compared with three baseline schemes in energy‑saving mode. Using the PEFT method, the proposed framework reduces the memory usage by 92% when compared with deploying separate LLM agents.

PaperID: 5343, https://arxiv.org/pdf/2602.22525.pdf

Authors: Zhonghao Zhan, Krinos Li, Yefan Zhang, Hamed Haddadi

Title: Systems-Level Attack Surface of Edge Agent Deployments on IoT

Abstract:
Edge deployment of LLM agents on IoT hardware introduces attack surfaces absent from cloud‑hosted orchestration. We present an empirical security analysis of three architectures (cloud‑hosted, edge‑local swarm, and hybrid) using a multi‑device home‑automation testbed with local MQTT messaging and an Android smartphone as an edge inference node. We identify five systems‑level attack surfaces, including two emergent failures observed during live testbed operation: coordination‑state divergence and induced trust erosion. We frame core security properties as measurable systems metrics: data egress volume, failover window exposure, sovereignty boundary integrity, and provenance chain completeness. Our measurements show that edge‑local deployments eliminate routine cloud data exposure but silently degrade sovereignty when fallback mechanisms trigger, with boundary crossings invisible at the application layer. Provenance chains remain complete under cooperative operation yet are trivially bypassed without cryptographic enforcement. Failover windows create transient blind spots exploitable for unauthorised actuation. These results demonstrate that deployment architecture, not just model or prompt design, is a primary determinant of security risk in agent‑controlled IoT systems.

PaperID: 5344, https://arxiv.org/pdf/2602.22452.pdf

Authors: Chayan Banerjee

Title: CWM: Contrastive World Models for Action Feasibility Learning in Embodied Agent Pipelines

Abstract:
A reliable action feasibility scorer is a critical bottleneck in embodied agent pipelines: before any planning or reasoning occurs, the agent must identify which candidate actions are physically executable in the current state. Existing approaches use supervised fine‑tuning (SFT) to train action scorers, but SFT treats each candidate independently and does not explicitly teach the model to discriminate between actions that are physically correct and those that are subtly wrong. We propose the Contrastive World Model (CWM), which fine‑tunes a large language model (LLM) as an action scorer using an InfoNCE contrastive objective with hard‑mined negative examples. The key idea is to push valid actions away from invalid ones in scoring space, with special emphasis on hard negatives: semantically similar but physically incompatible candidates. We evaluate CWM on the ScienceWorld benchmark through two studies. First, an intrinsic affordance evaluation on 605 hard‑negative test pairs shows that CWM outperforms SFT by +6.76 percentage points on Precision@1 for minimal‑edit negatives ‑‑ cases where a single word changes the physical outcome ‑‑ and achieves a higher AUC‑ROC (0.929 vs. 0.906). Second, a live filter characterisation study measures how well CWM ranks gold‑path actions against all valid environment actions during task execution. Under out‑of‑distribution stress conditions, CWM maintains a significantly better safety margin (‑2.39) than SFT (‑3.96), indicating that the gold action is ranked closer to the top. These results support the hypothesis that contrastive training induces representations that capture physical feasibility more faithfully than SFT alone.

PaperID: 5345, https://arxiv.org/pdf/2602.22450.pdf

Authors: Qianlong Lan, Anuj Kaul, Shaun Jones, Stephanie Westrum

Title: Silent Egress: When Implicit Prompt Injection Makes LLM Agents Leak Without a Trace

Abstract:
Agentic large language model systems increasingly automate tasks by retrieving URLs and calling external tools. We show that this workflow gives rise to implicit prompt injection: adversarial instructions embedded in automatically generated URL previews, including titles, metadata, and snippets, can introduce a system‑level risk that we refer to as silent egress. Using a fully local and reproducible testbed, we demonstrate that a malicious web page can induce an agent to issue outbound requests that exfiltrate sensitive runtime context, even when the final response shown to the user appears harmless. In 480 experimental runs with a qwen2.5:7b‑based agent, the attack succeeds with high probability (P (egress) =0.89), and 95% of successful attacks are not detected by output‑based safety checks. We also introduce sharded exfiltration, where sensitive information is split across multiple requests to avoid detection. This strategy reduces single‑request leakage metrics by 73% (Leak@1) and bypasses simple data loss prevention mechanisms. Our ablation results indicate that defenses applied at the prompt layer offer limited protection, while controls at the system and network layers, such as domain allowlisting and redirect‑chain analysis, are considerably more effective. These findings suggest that network egress should be treated as a first‑class security outcome in agentic LLM systems. We outline architectural directions, including provenance tracking and capability isolation, that go beyond prompt‑level hardening.

PaperID: 5346, https://arxiv.org/pdf/2602.22442.pdf

Authors: Gaoyuan Du, Amit Ahlawat, Xiaoyang Liu, Jing Wu

Title: A Framework for Assessing AI Agent Decisions and Outcomes in AutoML Pipelines

Abstract:
Agent‑based AutoML systems rely on large language models to make complex, multi‑stage decisions across data processing, model selection, and evaluation. However, existing evaluation practices remain outcome‑centric, focusing primarily on final task performance. Through a review of prior work, we find that none of the surveyed agentic AutoML systems report structured, decision‑level evaluation metrics intended for post‑hoc assessment of intermediate decision quality. To address this limitation, we propose an Evaluation Agent (EA) that performs decision‑centric assessment of AutoML agents without interfering with their execution. The EA is designed as an observer that evaluates intermediate decisions along four dimensions: decision validity, reasoning consistency, model quality risks beyond accuracy, and counterfactual decision impact. Across four proof‑of‑concept experiments, we demonstrate that the EA can (i) detect faulty decisions with an F1 score of 0.919, (ii) identify reasoning inconsistencies independent of final outcomes, and (iii) attribute downstream performance changes to agent decisions, revealing impacts ranging from ‑4.9% to +8.3% in final metrics. These results illustrate how decision‑centric evaluation exposes failure modes that are invisible to outcome‑only metrics. Our work reframes the evaluation of agentic AutoML systems from an outcome‑based perspective to one that audits agent decisions, offering a foundation for reliable, interpretable, and governable autonomous ML systems.

PaperID: 5347, https://arxiv.org/pdf/2602.22441.pdf

Authors: Yingqian Cui, Zhenwei Dai, Bing He, Zhan Shi, Hui Liu, Rui Sun, Zhiji Liu, Yue Xing, Jiliang Tang, Benoit Dumoulin

Title: How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?

Abstract:
Latent reasoning has been recently proposed as a reasoning paradigm and performs multi‑step reasoning through generating steps in the latent space instead of the textual space. This paradigm enables reasoning beyond discrete language tokens by performing multi‑step computation in continuous latent spaces. Although there have been numerous studies focusing on improving the performance of latent reasoning, its internal mechanisms remain not fully investigated. In this work, we conduct a comprehensive analysis of latent reasoning methods to better understand the role and behavior of latent representation in the process. We identify two key issues across latent reasoning methods with different levels of supervision. First, we observe pervasive shortcut behavior, where they achieve high accuracy without relying on latent reasoning. Second, we examine the hypothesis that latent reasoning supports BFS‑like exploration in latent space, and find that while latent representations can encode multiple possibilities, the reasoning process does not faithfully implement structured search, but instead exhibits implicit pruning and compression. Finally, our findings reveal a trade‑off associated with supervision strength: stronger supervision mitigates shortcut behavior but restricts the ability of latent representations to maintain diverse hypotheses, whereas weaker supervision allows richer latent representations at the cost of increased shortcut behavior.

PaperID: 5348, https://arxiv.org/pdf/2602.22401.pdf

Authors: Yongjun Zhang

Title: Vibe Researching as Wolf Coming: Can AI Agents with Skills Replace or Augment Social Scientists?

Abstract:
AI agents ‑‑ systems that execute multi‑step reasoning workflows with persistent state, tool access, and specialist skills ‑‑ represent a qualitative shift from prior automation technologies in social science. Unlike chatbots that respond to isolated queries, AI agents can now read files, run code, query databases, search the web, and invoke domain‑specific skills to execute entire research pipelines autonomously. This paper introduces the concept of vibe researching ‑‑ the AI‑era parallel to vibe coding ‑‑ and uses scholar‑skill, a 26‑skill plugin for Claude Code covering the full research pipeline from idea to submission across 18 orchestrated phases with 53 quality gates, as an illustrative case. I develop a cognitive task framework that classifies research activities along two dimensions ‑‑ codifiability and tacit knowledge requirement ‑‑ to identify a delegation boundary that is cognitive, not sequential: it cuts through every stage of the research pipeline, not between stages. I argue that AI agents excel at speed, coverage, and methodological scaffolding but struggle with theoretical originality and tacit field knowledge. The paper concludes with an analysis of three implications for the profession ‑‑ augmentation with fragile conditions, stratification risk, and a pedagogical crisis ‑‑ and proposes five principles for responsible vibe researching.

PaperID: 5349, https://arxiv.org/pdf/2602.22365.pdf

Authors: Chayan Banerjee

Title: Sustainable Multi-Agent Crowdsourcing via Physics-Informed Bandits

Abstract:
Crowdsourcing platforms face a four‑way tension between allocation quality, workforce sustainability, operational feasibility, and strategic contractor behaviour‑‑a dilemma we formalise as the Cold‑Start, Burnout, Utilisation, and Strategic Agency Dilemma. Existing methods resolve at most two of these tensions simultaneously: greedy heuristics and multi‑criteria decision making (MCDM) methods achieve Day‑1 quality but cause catastrophic burnout, while bandit algorithms eliminate burnout only through operationally infeasible 100% workforce utilisation.To address this, we introduce FORGE, a physics‑grounded K+1 multi‑agent simulator in which each contractor is a rational agent that declares its own load‑acceptance threshold based on its fatigue state, converting the standard passive Restless Multi‑Armed Bandit (RMAB) into a genuine Stackelberg game. Operating within FORGE, we propose a Neural‑Linear UCB allocator that fuses a Two‑Tower embedding network with a Physics‑Informed Covariance Prior derived from offline simulator interactions. The prior simultaneously warm‑starts skill‑cluster geometry and UCB exploration landscape, providing a geometry‑aware belief state from episode 1 that measurably reduces cold‑start regret.Over T = 200 cold‑start episodes, the proposed method achieves the highest reward of all non‑oracle methods (\textLRew = 0.555 \pm 0.041) at only 7.6% workforce utilisation‑‑a combination no conventional baseline achieves‑‑while maintaining robustness to workforce turnover up to 50% and observation noise up to σ= 0.20.

PaperID: 5350, https://arxiv.org/pdf/2602.22343.pdf

Authors: Shruthi Andru, Shrut Kirti Saksena

Title: Interface Framework for Human-AI Collaboration within Intelligent User Interface Ecosystems

Abstract:
As interfaces evolve from static user pathways to dynamic human‑AI collaboration, no standard methods exist for selecting appropriate interface patterns based on user needs and task complexity. Existing frameworks only provide guiding principles for designing AI agent capabilities. We propose a dimensional framework based on workflow complexity, AI autonomy, and AI reasoning to guide the design of context‑aware, scalable AI interfaces aka modalities (e.g., prompt bars, split screens, full screens, etc.). The framework was developed through co‑design workshops with designers of marketing products and refined through qualitative research with eight long‑term AI users. The study evaluated the three dimensions, identified task‑to‑interface relationships, and surfaced the importance of both business impact and security risk across all high‑autonomy scenarios. This framework provides product teams with a shared language to develop scalable AI interfaces, emphasizing fluidity between interfaces and progressive user control to balance AI autonomy with human oversight.

PaperID: 5351, https://arxiv.org/pdf/2602.22302.pdf

Authors: Varun Pratap Bhardwaj

Title: Agent Behavioral Contracts: Formal Specification and Runtime Enforcement for Reliable Autonomous AI Agents

Abstract:
Traditional software relies on contracts ‑‑ APIs, type systems, assertions ‑‑ to specify and enforce correct behavior. AI agents, by contrast, operate on prompts and natural language instructions with no formal behavioral specification. This gap is the root cause of drift, governance failures, and frequent project failures in agentic AI deployments. We introduce Agent Behavioral Contracts (ABC), a formal framework that brings Design‑by‑Contract principles to autonomous AI agents. An ABC contract C = (P, I, G, R) specifies Preconditions, Invariants, Governance policies, and Recovery mechanisms as first‑class, runtime‑enforceable components. We define (p, delta, k)‑satisfaction ‑‑ a probabilistic notion of contract compliance that accounts for LLM non‑determinism and recovery ‑‑ and prove a Drift Bounds Theorem showing that contracts with recovery rate gamma > alpha (the natural drift rate) bound behavioral drift to D = alpha/gamma in expectation, with Gaussian concentration in the stochastic setting. We establish sufficient conditions for safe contract composition in multi‑agent chains and derive probabilistic degradation bounds. We implement ABC in AgentAssert, a runtime enforcement library, and evaluate on AgentContract‑Bench, a benchmark of 200 scenarios across 7 models from 6 vendors. Results across 1,980 sessions show that contracted agents detect 5.2‑6.8 soft violations per session that uncontracted baselines miss entirely (p < 0.0001, Cohen's d = 6.7‑33.8), achieve 88‑100% hard constraint compliance, and bound behavioral drift to D < 0.27 across extended sessions, with 100% recovery for frontier models and 17‑100% across all models, at overhead < 10 ms per action.

PaperID: 5352, https://arxiv.org/pdf/2602.22240.pdf

Authors: Linus Bantel, Moritz Strack, Alexander Strack, Dirk Pflüger

Title: From Prompts to Performance: Evaluating LLMs for Task-based Parallel Code Generation

Abstract:
Large Language Models (LLM) show strong abilities in code generation, but their skill in creating efficient parallel programs is less studied. This paper explores how LLMs generate task‑based parallel code from three kinds of input prompts: natural language problem descriptions, sequential reference implementations, and parallel pseudo code. We focus on three programming frameworks: OpenMP Tasking, C++ standard parallelism, and the asynchronous many‑task runtime HPX. Each framework offers different levels of abstraction and control for task execution. We evaluate LLM‑generated solutions for correctness and scalability. Our results reveal both strengths and weaknesses of LLMs with regard to problem complexity and framework. Finally, we discuss what these findings mean for future LLM‑assisted development in high‑performance and scientific computing.

PaperID: 5353, https://arxiv.org/pdf/2602.22222.pdf

Authors: Bingrui Jin, Kunyao Lan, Mengyue Wu

Title: TWICE: An LLM Agent Framework for Simulating Personalized User Tweeting Behavior with Long-term Temporal Features

Abstract:
User simulators are often used to generate large amounts of data for various tasks such as generation, training, and evaluation. However, existing approaches concentrate on collective behaviors or interactive systems, struggling with tasks that require modeling temporal characteristics. To address this limitation, we propose TWICE, an LLM‑based framework that leverages the long‑term temporal and personalized features of social media data. This framework integrates personalized user profiling, an event‑driven memory module, and a workflow for personalized style rewriting, enabling simulation of personalized user tweeting behavior while capturing long‑term temporal characteristics. In addition, we conduct a comprehensive evaluation with a focus on analyzing tweeting style and event‑based changes in behavior. Experiment results demonstrate that our framework improves personalized user simulation by effectively incorporating temporal dynamics, providing a robust solution for long‑term behavior tracking.

PaperID: 5354, https://arxiv.org/pdf/2602.22208.pdf

Authors: Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, Saining Xie

Title: Solaris: Building a Multiplayer Video World Model in Minecraft

Abstract:
Existing action‑conditioned video generation models (video world models) are limited to single‑agent perspectives, failing to capture the multi‑agent interactions of real‑world environments. We introduce Solaris, a multiplayer video world model that simulates consistent multi‑view observations. To enable this, we develop a multiplayer data system designed for robust, continuous, and automated data collection on video games such as Minecraft. Unlike prior platforms built for single‑player settings, our system supports coordinated multi‑agent interaction and synchronized videos + actions capture. Using this system, we collect 12.64 million multiplayer frames and propose an evaluation framework for multiplayer movement, memory, grounding, building, and view consistency. We train Solaris using a staged pipeline that progressively transitions from single‑player to multiplayer modeling, combining bidirectional, causal, and Self Forcing training. In the final stage, we introduce Checkpointed Self Forcing, a memory‑efficient Self Forcing variant that enables a longer‑horizon teacher. Results show our architecture and training design outperform existing baselines. Through open‑sourcing our system and models, we hope to lay the groundwork for a new generation of multi‑agent world models.

PaperID: 5355, https://arxiv.org/pdf/2602.21806.pdf

Authors: Xinxue Zhu, Jiacong Wu, Xiaoyu Zhang, Tianlin Li, Yanzhou Mu, Juan Zhai, Chao Shen, Chunrong Fang, Yang Liu

Title: An Empirical Study of Bugs in Modern LLM Agent Frameworks

Abstract:
LLM agents have been widely adopted in real‑world applications, relying on agent frameworks for workflow execution and multi‑agent coordination. As these systems scale, understanding bugs in the underlying agent frameworks becomes critical. However, existing work mainly focuses on agent‑level failures, overlooking framework‑level bugs. To address this gap, we conduct an empirical study of 998 bug reports from CrewAI and LangChain, constructing a taxonomy of 15 root causes and 7 observable symptoms across five agent lifecycle stages: 'Agent Initialization','Perception', 'Self‑Action', 'Mutual Interaction' and 'Evolution'. Our findings show that agent framework bugs mainly arise from 'API misuse', 'API incompatibility', and 'Documentation Desync', largely concentrated in the 'Self‑Action' stage. Symptoms typically appear as 'Functional Error', 'Crash', and 'Build Failure', reflecting disruptions to task progression and control flow.

PaperID: 5356, https://arxiv.org/pdf/2602.21738.pdf

Authors: Pelin Sekercioglu, Angela Fontan, Dimos V. Dimarogonas

Title: Stability of Open Multi-agent Systems over Dynamic Signed Digraphs

Abstract:
We address the synchronization problem in open multi‑agent systems (OMAS) containing both cooperative and antagonistic interactions. In these systems, agents can join or leave the network over time, and the interaction structure may evolve accordingly. To capture these dynamical structural changes, we represent the network as a switched system interconnected over a dynamic and directed signed graph. Additionally, the network may contain one or multiple leader groups that influence the behavior of the remaining agents. In general, we show that the OMAS exhibit a more general form of synchronization, including trivial consensus, bipartite consensus and containment. Our approach uses the signed edge‑based agreement protocol, and constructs strict Lyapunov functions for signed networks described by signed edge‑Laplacian matrices containing multiple zero eigenvalues. Numerical simulations validate our theoretical results.

PaperID: 5357, https://arxiv.org/pdf/2602.21715.pdf

Authors: Xu Yang, Chenhui Lin, Xiang Ma, Dong Liu, Ran Zheng, Haotian Liu, Wenchuan Wu

Title: Two-Stage Active Distribution Network Voltage Control via LLM-RL Collaboration: A Hybrid Knowledge-Data-Driven Approach

Abstract:
The growing integration of distributed photovoltaics (PVs) into active distribution networks (ADNs) has exacerbated operational challenges, making it imperative to coordinate diverse equipment to mitigate voltage violations and enhance power quality. Although existing data‑driven approaches have demonstrated effectiveness in the voltage control problem, they often require extensive trial‑and‑error exploration and struggle to incorporate heterogeneous information, such as day‑ahead forecasts and semantic‑based grid codes. Considering the operational scenarios and requirements in real‑world ADNs, in this paper, we propose a hybrid knowledge‑data‑driven approach that leverages dynamic collaboration between a large language model (LLM) agent and a reinforcement learning (RL) agent to achieve two‑stage voltage control. In the day‑ahead stage, the LLM agent receives coarse region‑level forecasts and generates scheduling strategies for on‑load tap changer (OLTC) and shunt capacitors (SCs) to regulate the overall voltage profile. Then in the intra‑day stage, based on accurate node‑level measurements, the RL agent refines terminal voltages by deriving reactive power generation strategies for PV inverters. On top of the LLM‑RL collaboration framework, we further propose a self‑evolution mechanism for the LLM agent and a pretrain‑finetune pipeline for the RL agent, effectively enhancing and coordinating the policies for both agents. The proposed approach not only aligns more closely with practical operational characteristics but also effectively utilizes the inherent knowledge and reasoning capabilities of the LLM agent, significantly improving training efficiency and voltage control performance. Comprehensive comparisons and ablation studies demonstrate the effectiveness of the proposed method.

PaperID: 5358, https://arxiv.org/pdf/2602.21680.pdf

Authors: David Eckel, Henri Meeß

Title: Hierarchical Lead Critic based Multi-Agent Reinforcement Learning

Abstract:
Cooperative Multi‑Agent Reinforcement Learning (MARL) solves complex tasks that require coordination from multiple agents, but is often limited to either local (independent learning) or global (centralized learning) perspectives. In this paper, we introduce a novel sequential training scheme and MARL architecture, which learns from multiple perspectives on different hierarchy levels. We propose the Hierarchical Lead Critic (HLC) ‑ inspired by natural emerging distributions in team structures, where following high‑level objectives combines with low‑level execution. HLC demonstrates that introducing multiple hierarchies, leveraging local and global perspectives, can lead to improved performance with high sample efficiency and robust policies. Experimental results conducted on cooperative, non‑communicative, and partially observable MARL benchmarks demonstrate that HLC outperforms single hierarchy baselines and scales robustly with increasing amounts of agents and difficulty.

PaperID: 5359, https://arxiv.org/pdf/2602.21670.pdf

Authors: Tomoya Kawabe, Rin Takano

Title: Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning

Abstract:
Multi‑robot task planning requires decomposing natural‑language instructions into executable actions for heterogeneous robot teams. Conventional Planning Domain Definition Language (PDDL) planners provide rigorous guarantees but struggle to handle ambiguous or long‑horizon missions, while large language models (LLMs) can interpret instructions and propose plans but may hallucinate or produce infeasible actions. We present a hierarchical multi‑agent LLM‑based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower‑layer agents, which generate PDDL problems solved by a classical planner. When plans fail, the system applies TextGrad‑inspired textual‑gradient updates to optimize each agent's prompt and thereby improve planning accuracy. In addition, meta‑prompts are learned and shared across agents within the same layer, enabling efficient prompt optimization in multi‑agent settings. On the MAT‑THOR benchmark, our planner achieves success rates of 0.95 on compound tasks, 0.84 on complex tasks, and 0.60 on vague tasks, improving over the previous state‑of‑the‑art LaMMA‑P by 2, 7, and 15 percentage points respectively. An ablation study shows that the hierarchical structure, prompt optimization, and meta‑prompt sharing contribute roughly +59, +37, and +4 percentage points to the overall success rate.

PaperID: 5360, https://arxiv.org/pdf/2602.21622.pdf

Authors: Enyi Wang, Wen Fan, Dandan Zhang

Title: ADM-DP: Adaptive Dynamic Modality Diffusion Policy through Vision-Tactile-Graph Fusion for Multi-Agent Manipulation

Abstract:
Multi‑agent robotic manipulation remains challenging due to the combined demands of coordination, grasp stability, and collision avoidance in shared workspaces. To address these challenges, we propose the Adaptive Dynamic Modality Diffusion Policy (ADM‑DP), a framework that integrates vision, tactile, and graph‑based (multi‑agent pose) modalities for coordinated control. ADM‑DP introduces four key innovations. First, an enhanced visual encoder merges RGB and point‑cloud features via Feature‑wise Linear Modulation (FiLM) modulation to enrich perception. Second, a tactile‑guided grasping strategy uses Force‑Sensitive Resistor (FSR) feedback to detect insufficient contact and trigger corrective grasp refinement, improving grasp stability. Third, a graph‑based collision encoder leverages shared tool center point (TCP) positions of multiple agents as structured kinematic context to maintain spatial awareness and reduce inter‑agent interference. Fourth, an Adaptive Modality Attention Mechanism (AMAM) dynamically re‑weights modalities according to task context, enabling flexible fusion. For scalability and modularity, a decoupled training paradigm is employed in which agents learn independent policies while sharing spatial information. This maintains low interdependence between agents while retaining collective awareness. Across seven multi‑agent tasks, ADM‑DP achieves 12‑25% performance gains over state‑of‑the‑art baselines. Ablation studies show the greatest improvements in tasks requiring multiple sensory modalities, validating our adaptive fusion strategy and demonstrating its robustness for diverse manipulation scenarios.

PaperID: 5361, https://arxiv.org/pdf/2602.21604.pdf

Authors: Qiange Wang, Chaoyi Chen, Jingqi Gao, Zihan Wang, Yanfeng Zhang, Ge Yu

Title: Towards Autonomous Graph Data Analytics with Analytics-Augmented Generation

Abstract:
This paper argues that reliable end‑to‑end graph data analytics cannot be achieved by retrieval‑ or code‑generation‑centric LLM agents alone. Although large language models (LLMs) provide strong reasoning capabilities, practical graph analytics for non‑expert users requires explicit analytical grounding to support intent‑to‑execution translation, task‑aware graph construction, and reliable execution across diverse graph algorithms. We envision Analytics‑Augmented Generation (AAG) as a new paradigm that treats analytical computation as a first‑class concern and positions LLMs as knowledge‑grounded analytical coordinators. By integrating knowledge‑driven task planning, algorithm‑centric LLM‑analytics interaction, and task‑aware graph construction, AAG enables end‑to‑end graph analytics pipelines that translate natural‑language user intent into automated execution and interpretable results.

PaperID: 5362, https://arxiv.org/pdf/2602.21533.pdf

Authors: Dong Hyeon Mok, Seoin Back, Victor Fung, Guoxiang Hu

Title: Reasoning-Driven Design of Single Atom Catalysts via a Multi-Agent Large Language Model Framework

Abstract:
Large language models (LLMs) are becoming increasingly applied beyond natural language processing, demonstrating strong capabilities in complex scientific tasks that traditionally require human expertise. This progress has extended into materials discovery, where LLMs introduce a new paradigm by leveraging reasoning and in‑context learning, capabilities absent from conventional machine learning approaches. Here, we present a Multi‑Agent‑based Electrocatalyst Search Through Reasoning and Optimization (MAESTRO) framework in which multiple LLMs with specialized roles collaboratively discover high‑performance single atom catalysts for the oxygen reduction reaction. Within an autonomous design loop, agents iteratively reason, propose modifications, reflect on results and accumulate design history. Through in‑context learning enabled by this iterative process, MAESTRO identified design principles not explicitly encoded in the LLMs' background knowledge and successfully discovered catalysts that break conventional scaling relations between reaction intermediates. These results highlight the potential of multi‑agent LLM frameworks as a powerful strategy to generate chemical insight and discover promising catalysts.

PaperID: 5363, https://arxiv.org/pdf/2602.21515.pdf

Authors: Chengrui Qu, Yizhou Zhang, Nicolas Lanzetti, Eric Mazumdar

Title: Training Generalizable Collaborative Agents via Strategic Risk Aversion

Abstract:
Many emerging agentic paradigms require agents to collaborate with one another (or people) to achieve shared goals. Unfortunately, existing approaches to learning policies for such collaborative problems produce brittle solutions that fail when paired with new partners. We attribute these failures to a combination of free‑riding during training and a lack of strategic robustness. To address these problems, we study the concept of strategic risk aversion and interpret it as a principled inductive bias for generalizable cooperation with unseen partners. While strategically risk‑averse players are robust to deviations in their partner's behavior by design, we show that, in collaborative games, they also (1) can have better equilibrium outcomes than those at classical game‑theoretic concepts like Nash, and (2) exhibit less or no free‑riding. Inspired by these insights, we develop a multi‑agent reinforcement learning (MARL) algorithm that integrates strategic risk aversion into standard policy optimization methods. Our empirical results across collaborative benchmarks (including an LLM collaboration task) validate our theory and demonstrate that our approach consistently achieves reliable collaboration with heterogeneous and previously unseen partners across collaborative tasks.

PaperID: 5364, https://arxiv.org/pdf/2602.21480.pdf

Authors: Germán T. Eizaguirre, Lars Tissen, Marc Sánchez-Artigas

Title: Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?

Abstract:
Text‑to‑SQL and Big Data are both extensively benchmarked fields, yet there is limited research that evaluates them jointly. In the real world, Text‑to‑SQL systems are often embedded with Big Data workflows, such as large‑scale data processing or interactive data analytics. We refer to this as ``Text‑to‑Big SQL''. However, existing text‑to‑SQL benchmarks remain narrowly scoped and overlook the cost and performance implications that arise at scale. For instance, translation errors that are minor on small datasets lead to substantial cost and latency overheads as data scales, a relevant issue completely ignored by text‑to‑SQL metrics. In this paper, we overcome this overlooked challenge by introducing novel and representative metrics for evaluating Text‑to‑Big SQL. Our study focuses on production‑level LLM agents, a database‑agnostic system adaptable to diverse user needs. Via an extensive evaluation of frontier models, we show that text‑to‑SQL metrics are insufficient for Big Data. In contrast, our proposed text‑to‑Big SQL metrics accurately reflect execution efficiency, cost, and the impact of data scale. For example, GPT‑4o compensates for roughly 7% lower accuracy than the top‑performing later‑generation models with up to a 12.16x speedup, while GPT‑5.2 is more than twice as cost‑effective as Gemini 3 Pro at large input scales.

PaperID: 5365, https://arxiv.org/pdf/2602.21477.pdf

Authors: Zhengding Hu, Zaifeng Pan, Prabhleen Kaur, Vibha Murthy, Zhongkai Yu, Yue Guan, Zhen Wang, Steven Swanson, Yufei Ding

Title: Pancake: Hierarchical Memory System for Multi-Agent LLM Serving

Abstract:
In this work, we identify and address the core challenges of agentic memory management in LLM serving, where large‑scale storage, frequent updates, and multiple coexisting agents jointly introduce complex and high‑cost approximate nearest neighbor (ANN) searching problems. We present Pancake, a multi‑tier agentic memory system that unifies three key techniques: (i) multi‑level index caching for single agents, (ii) coordinated index management across multiple agents, and (iii) collaborative GPU‑CPU acceleration. Pancake exposes easy‑to‑use interface that can be integrated into memory‑based agents like Mem‑GPT, and is compatible with agentic frameworks such as LangChain and LlamaIndex. Experiments on realistic agent workloads show that Pancake substantially outperforms existing frameworks, achieving more than 4.29x end‑to‑end throughput improvement.

PaperID: 5366, https://arxiv.org/pdf/2602.21404.pdf

Authors: Shanshan Mao, Peter Tino

Title: From Cooperation to Hierarchy: A Study of Dynamics of Hierarchy Emergence in a Multi-Agent System

Abstract:
A central premise in evolutionary biology is that individual variation can generate information asymmetries that facilitate the emergence of hierarchical organisation. To examine this process, we develop an agent‑based model (ABM) to identify the minimal conditions under which hierarchy arises in dynamic multi‑agent systems, focusing on the roles of initial heterogeneity and mutation amplitude across generations. Hierarchical organisation is quantified using the Trophic Incoherence (TI) metric, which captures directional asymmetries in interaction networks. Our results show that even small individual differences can be amplified through repeated local interactions involving reproduction, competition, and cooperation, but that hierarchical order is markedly more sensitive to mutation amplitude than to initial heterogeneity. Across repeated trials, stable hierarchies reliably emerge only when mutation amplitude is sufficiently high, while initial heterogeneity primarily affects early formation rather than long‑term persistence. Overall, these findings demonstrate how simple interaction rules can give rise to both the emergence and persistence of hierarchical organisation, providing a quantitative account of how structured inequality can develop from initially homogeneous populations.

PaperID: 5367, https://arxiv.org/pdf/2602.21371.pdf

Authors: Sai Surya Duvvuri, Chanakya Ekbote, Rachit Bansal, Rishabh Tiwari, Devvrit Khatri, David Brandfonbrener, Paul Liang, Inderjit Dhillon, Manzil Zaheer

Title: Interleaved Head Attention

Abstract:
Multi‑Head Attention (MHA) is the core computational primitive underlying modern Large Language Models (LLMs). However, MHA suffers from a fundamental linear scaling limitation: H attention heads produce exactly H independent attention matrices, with no communication between heads during attention computation. This becomes problematic for multi‑step reasoning, where correct answers depend on aggregating evidence from multiple parts of the context and composing latent token‑to‑token relations over a chain of intermediate inferences. To address this, we propose Interleaved Head Attention (IHA), which enables cross‑head mixing by constructing P pseudo‑heads per head (typically P=H), where each pseudo query/key/value is a learned linear combination of all H original queries, keys and values respectively. Interactions between pseudo‑query and pseudo‑key heads induce up to P^2 attention patterns per head with modest parameter overhead \mathcalO(H^2P). We provide theory showing improved efficiency in terms of number of parameters on the synthetic Polynomial task (IHA uses Θ(\sqrtkn^2) parameters vs. Θ(kn^2) for MHA) and on the synthetic order‑sensitive CPM‑3 task (IHA uses \lceil\sqrtN_\max\rceil heads vs. N_\max for MHA). On real‑world benchmarks, IHA improves Multi‑Key retrieval on RULER by 10‑20% (4k‑16k) and, after fine‑tuning for reasoning on OpenThoughts, improves GSM8K by 5.8% and MATH‑500 by 2.8% (Majority Vote) over full attention.

PaperID: 5368, https://arxiv.org/pdf/2602.21351.pdf

Authors: Dmitrii Pantiukhin, Ivan Kuznetsov, Boris Shapkin, Antonia Anna Jost, Thomas Jung, Nikolay Koldunov

Title: A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives

Abstract:
The rapid accumulation of Earth science data has created a significant scalability challenge; while repositories like PANGAEA host vast collections of datasets, citation metrics indicate that a substantial portion remains underutilized, limiting data reusability. Here we present PANGAEA‑GPT, a hierarchical multi‑agent framework designed for autonomous data discovery and analysis. Unlike standard Large Language Model (LLM) wrappers, our architecture implements a centralized Supervisor‑Worker topology with strict data‑type‑aware routing, sandboxed deterministic code execution, and self‑correction via execution feedback, enabling agents to diagnose and resolve runtime errors. Through use‑case scenarios spanning physical oceanography and ecology, we demonstrate the system's capacity to execute complex, multi‑step workflows with minimal human intervention. This framework provides a methodology for querying and analyzing heterogeneous repository data through coordinated agent workflows.

PaperID: 5369, https://arxiv.org/pdf/2602.21320.pdf

Authors: Emre Can Acikgoz, Cheng Qian, Jonas Hübotter, Heng Ji, Dilek Hakkani-Tür, Gokhan Tur

Title: Tool-R0: Self-Evolving LLM Agents for Tool-Learning from Zero Data

Abstract:
Large language models (LLMs) are becoming the foundation for autonomous agents that can use tools to solve complex tasks. Reinforcement learning (RL) has emerged as a common approach for injecting such agentic capabilities, but typically under tightly controlled training setups. It often depends on carefully constructed task‑solution pairs and substantial human supervision, which creates a fundamental obstacle to open‑ended self‑evolution toward superintelligent systems. In this paper, we propose Tool‑R0 framework for training general purpose tool‑calling agents from scratch with self‑play RL, under a zero‑data assumption. Initialized from the same base LLM, Tool‑R0 co‑evolves a Generator and a Solver with complementary rewards: one proposes targeted challenging tasks at the other's competence frontier and the other learns to solve them with real‑world tool calls. This creates a self‑evolving cycle that requires no pre‑existing tasks or datasets. Evaluation on different tool‑use benchmarks show that Tool‑R0 yields 92.5 relative improvement over the base model and surpasses fully supervised tool‑calling baselines under the same setting. Our work further provides empirical insights into self‑play LLM agents by analyzing co‑evolution, curriculum dynamics, and scaling behavior.

PaperID: 5370, https://arxiv.org/pdf/2602.21319.pdf

Authors: Marion Neumeier, Niklas Roßberg, Michael Botsch, Wolfgang Utschick

Title: Uncertainty-Aware Diffusion Model for Multimodal Highway Trajectory Prediction via DDIM Sampling

Abstract:
Accurate and uncertainty‑aware trajectory prediction remains a core challenge for autonomous driving, driven by complex multi‑agent interactions, diverse scene contexts and the inherently stochastic nature of future motion. Diffusion‑based generative models have recently shown strong potential for capturing multimodal futures, yet existing approaches such as cVMD suffer from slow sampling, limited exploitation of generative diversity and brittle scenario encodings. This work introduces cVMDx, an enhanced diffusion‑based trajectory prediction framework that improves efficiency, robustness and multimodal predictive capability. Through DDIM sampling, cVMDx achieves up to a 100x reduction in inference time, enabling practical multi‑sample generation for uncertainty estimation. A fitted Gaussian Mixture Model further provides tractable multimodal predictions from the generated trajectories. In addition, a CVQ‑VAE variant is evaluated for scenario encoding. Experiments on the publicly available highD dataset show that cVMDx achieves higher accuracy and significantly improved efficiency over cVMD, enabling fully stochastic, multimodal trajectory prediction.

PaperID: 5371, https://arxiv.org/pdf/2602.21262.pdf

Authors: Sasha Robinson, Katherine M. Collins, Ilia Sucholutsky, Kelsey R. Allen

Title: Under the Influence: Quantifying Persuasion and Vigilance in Large Language Models

Abstract:
With increasing integration of Large Language Models (LLMs) into areas of high‑stakes human decision‑making, it is important to understand the risks they introduce as advisors. To be useful advisors, LLMs must sift through large amounts of content, written with both benevolent and malicious intent, and then use this information to convince a user to take a specific action. This involves two social capacities: vigilance (the ability to determine which information to use, and which to discard) and persuasion (synthesizing the available evidence to make a convincing argument). While existing work has investigated these capacities in isolation, there has been little prior investigation of how these capacities may be linked. Here, we use a simple multi‑turn puzzle‑solving game, Sokoban, to study LLMs' abilities to persuade and be rationally vigilant towards other LLM agents. We find that puzzle‑solving performance, persuasive capability, and vigilance are dissociable capacities in LLMs. Performing well on the game does not automatically mean a model can detect when it is being misled, even if the possibility of deception is explicitly mentioned. However, LLMs do consistently modulate their token use, using fewer tokens to reason when advice is benevolent and more when it is malicious, even if they are still persuaded to take actions leading them to failure. To our knowledge, our work presents the first investigation of the relationship between persuasion, vigilance, and task performance in LLMs, and suggests that monitoring all three independently will be critical for future work in AI safety.

PaperID: 5372, https://arxiv.org/pdf/2602.21255.pdf

Authors: Jean-Philippe Garnier

Title: A General Equilibrium Theory of Orchestrated AI Agent Systems

Abstract:
We establish a general equilibrium theory for systems of large language model (LLM) agents operating under centralized orchestration. The framework is a production economy in the sense of Arrow‑Debreu (1954), extended to infinite‑dimensional commodity spaces following Bewley (1972). Each LLM agent is modeled as a firm whose production set Y a \subset H = L 2 ([0, T ], R R ) represents the feasible metric trajectories determined by its frozen model weights. The orchestrator is the consumer, choosing a routing policy over the agent DAG to maximize system welfare subject to a budget constraint evaluated at functional prices p \in H A . These prices‑elements of the Hilbert dual of the commodity space‑assign a shadow value to each metric of each agent at each instant. We prove, via Brouwer's theorem applied to a finitedimensional approximation V K \subset H, that every such economy admits at least one general equilibrium (p , y , π ). A functional Walras' law holds as a theorem: the value of functional excess demand is zero for all prices, as a consequence of the consumer's budget constraint‑not by construction. We further establish Pareto optimality (First Welfare Theorem), decentralizability of Pareto optima (Second Welfare Theorem), and uniqueness with geometric convergence under a contraction condition (Banach). The orchestration dynamics constitute a Walrasian tâtonnement that converges globally under the contraction condition, unlike classical tâtonnement (Scarf, 1960). The framework admits a DSGE interpretation with SLO parameters as policy rates.

PaperID: 5373, https://arxiv.org/pdf/2602.21161.pdf

Authors: Guangming Wang, Qizhen Ying, Yixiong Jing, Olaf Wysocki, Brian Sheil

Title: ActionReasoning: Robot Action Reasoning in 3D Space with LLM for Robotic Brick Stacking

Abstract:
Classical robotic systems typically rely on custom planners designed for constrained environments. While effective in restricted settings, these systems lack generalization capabilities, limiting the scalability of embodied AI and general‑purpose robots. Recent data‑driven Vision‑Language‑Action (VLA) approaches aim to learn policies from large‑scale simulation and real‑world data. However, the continuous action space of the physical world significantly exceeds the representational capacity of linguistic tokens, making it unclear if scaling data alone can yield general robotic intelligence. To address this gap, we propose ActionReasoning, an LLM‑driven framework that performs explicit action reasoning to produce physics‑consistent, prior‑guided decisions for robotic manipulation. ActionReasoning leverages the physical priors and real‑world knowledge already encoded in Large Language Models (LLMs) and structures them within a multi‑agent architecture. We instantiate this framework on a tractable case study of brick stacking, where the environment states are assumed to be already accurately measured. The environmental states are then serialized and passed to a multi‑agent LLM framework that generates physics‑aware action plans. The experiments demonstrate that the proposed multi‑agent LLM framework enables stable brick placement while shifting effort from low‑level domain‑specific coding to high‑level tool invocation and prompting, highlighting its potential for broader generalization. This work introduces a promising approach to bridging perception and execution in robotic manipulation by integrating physical reasoning with LLMs.

PaperID: 5374, https://arxiv.org/pdf/2602.21158.pdf

Authors: Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray, Hua Wei

Title: SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards

Abstract:
Large language models (LLMs) are increasingly deployed as multi‑step decision‑making agents, where effective reward design is essential for guiding learning. Although recent work explores various forms of reward shaping and step‑level credit assignment, a key signal remains largely overlooked: the intrinsic uncertainty of LLMs. Uncertainty reflects model confidence, reveals where exploration is needed, and offers valuable learning cues even in failed trajectories. We introduce SELAUR: Self Evolving LLM Agent via Uncertainty‑aware Rewards, a reinforcement learning framework that incorporates uncertainty directly into the reward design. SELAUR integrates entropy‑, least‑confidence‑, and margin‑based metrics into a combined token‑level uncertainty estimate, providing dense confidence‑aligned supervision, and employs a failure‑aware reward reshaping mechanism that injects these uncertainty signals into step‑ and trajectory‑level rewards to improve exploration efficiency and learning stability. Experiments on two benchmarks, ALFWorld and WebShop, show that our method consistently improves success rates over strong baselines. Ablation studies further demonstrate how uncertainty signals enhance exploration and robustness.

PaperID: 5375, https://arxiv.org/pdf/2602.21127.pdf

Authors: Xinfeng Li, Shenyu Dai, Kelong Zheng, Yue Xiao, Gelei Deng, Wei Dong, Xiaofeng Wang

Title: "Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems

Abstract:
Large language model (LLM) agents are rapidly becoming trusted copilots in high‑stakes domains like software development and healthcare. However, this deepening trust introduces a novel attack surface: Agent‑Mediated Deception (AMD), where compromised agents are weaponized against their human users. While extensive research focuses on agent‑centric threats, human susceptibility to deception by a compromised agent remains unexplored. We present the first large‑scale empirical study with 303 participants to measure human susceptibility to AMD. This is based on HAT‑Lab (Human‑Agent Trust Laboratory), a high‑fidelity research platform we develop, featuring nine carefully crafted scenarios spanning everyday and professional domains (e.g., healthcare, software development, human resources). Our 10 key findings reveal significant vulnerabilities and provide future defense perspectives. Specifically, only 8.6% of participants perceive AMD attacks, while domain experts show increased susceptibility in certain scenarios. We identify six cognitive failure modes in users and find that their risk awareness often fails to translate to protective behavior. The defense analysis reveals that effective warnings should interrupt workflows with low verification costs. With experiential learning based on HAT‑Lab, over 90% of users who perceive risks report increased caution against AMD. This work provides empirical evidence and a platform for human‑centric agent security research.

PaperID: 5376, https://arxiv.org/pdf/2602.21119.pdf

Authors: Rui Zhao, Xihui Li, Yizheng Zhang, Yuzhen Liu, Zhong Zhang, Yufeng Zhang, Cheng Zhou, Zhengyou Zhang, Lei Han

Title: Cooperative-Competitive Team Play of Real-World Craft Robots

Abstract:
Multi‑agent deep Reinforcement Learning (RL) has made significant progress in developing intelligent game‑playing agents in recent years. However, the efficient training of collective robots using multi‑agent RL and the transfer of learned policies to real‑world applications remain open research questions. In this work, we first develop a comprehensive robotic system, including simulation, distributed learning framework, and physical robot components. We then propose and evaluate reinforcement learning techniques designed for efficient training of cooperative and competitive policies on this platform. To address the challenges of multi‑agent sim‑to‑real transfer, we introduce Out of Distribution State Initialization (OODSI) to mitigate the impact of the sim‑to‑real gap. In the experiments, OODSI improves the Sim2Real performance by 20%. We demonstrate the effectiveness of our approach through experiments with a multi‑robot car competitive game and a cooperative task in real‑world settings.

PaperID: 5377, https://arxiv.org/pdf/2602.21020.pdf

Authors: Antoine Bergerault, Volkan Cevher, Negar Mehr

Title: Matching Multiple Experts: On the Exploitability of Multi-Agent Imitation Learning

Abstract:
Multi‑agent imitation learning (MA‑IL) aims to learn optimal policies from expert demonstrations of interactions in multi‑agent interactive domains. Despite existing guarantees on the performance of the resulting learned policies, characterizations of how far the learned polices are from a Nash equilibrium are missing for offline MA‑IL. In this paper, we demonstrate impossibility and hardness results of learning low‑exploitable policies in general n‑player Markov Games. We do so by providing examples where even exact measure matching fails, and demonstrating a new hardness result on characterizing the Nash gap given a fixed measure matching error. We then show how these challenges can be overcome using strategic dominance assumptions on the expert equilibrium. Specifically, for the case of dominant strategy expert equilibria, assuming Behavioral Cloning error ε_\textBC, this provides a Nash imitation gap of \mathcalO\left(nε_\textBC/(1‑γ)^2\right) for a discount factor γ. We generalize this result with a new notion of best‑response continuity, and argue that this is implicitly encouraged by standard regularization techniques.

PaperID: 5378, https://arxiv.org/pdf/2602.20936.pdf

Authors: Albarracin Mahault, Mikeda Anna, Jimenez Rodriguez Alejandro, Namjoshi Sanjeev, Sakthivadivel Dalton, Pae Hongju, Shah Harshil, Wilson Philip

Title: Empathy Modeling in Active Inference Agents for Perspective-Taking and Alignment

Abstract:
Artificial agents capable of understanding and aligning with others' intentions are essential for safe and socially robust artificial intelligence. We introduce a computational framework for empathy in active inference agents, grounded in explicit perspective‑taking via self‑other model transformation. We instantiate this framework in a multi‑agent Iterated Prisoner's Dilemma and show that empathic perspective‑taking induces robust cooperation without explicit communication or reward shaping. Cooperation emerges only when empathy is reciprocated, while asymmetric empathy leads to systematic exploitation. Beyond equilibrium outcomes, empathic agents exhibit synchronized behavior, rapid recovery from stochastic defections, and joint intentional dynamics resembling apology‑forgiveness cycles. Near empathy symmetry, interactions display long transients and elevated variance, consistent with critical dynamics near regime boundaries. We further examine a learning‑enabled variant in which agents infer opponent type via Bayesian updating. While opponent models converge rapidly, long‑run cooperation remains primarily determined by the empathy parameter, indicating that cooperation is driven by empathic structure rather than learned reciprocity. Empathy functions as a structural prior over social interaction, shaping coordination stability, robustness, and temporal dynamics. The proposed framework highlights active inference as a principled foundation for socially aligned artificial agents that coordinate through internal simulation rather than behavioral mimicry.

PaperID: 5379, https://arxiv.org/pdf/2602.20934.pdf

Authors: ChengYou Li, XiaoDong Liu, XiangBao Meng, XinYu Zhao

Title: Architecting AgentOS: From Token-Level Context to Emergent System-Level Intelligence

Abstract:
The paradigm of Large Language Models is undergoing a fundamental transition from static inference engines to dynamic autonomous cognitive systems.While current research primarily focuses on scaling context windows or optimizing prompt engineering the theoretical bridge between micro scale token processing and macro scale systemic intelligence remains fragmented.This paper proposes AgentOS,a holistic conceptual framework that redefines the LLM as a "Reasoning Kernel" governed by structured operating system logic.Central to this architecture is Deep Context Management which conceptualizes the context window as an Addressable Semantic Space rather than a passive buffer.We systematically deconstruct the transition from discrete sequences to coherent cognitive states introducing mechanisms for Semantic Slicing and Temporal Alignment to mitigate cognitive drift in multi‑agent orchestration.By mapping classical OS abstractions such as memory paging interrupt handling and process scheduling onto LLM native constructs, this review provides a rigorous roadmap for architecting resilient scalable and self‑evolving cognitive environments.Our analysis asserts that the next frontier of AGI development lies in the architectural efficiency of system‑level coordination.

PaperID: 5380, https://arxiv.org/pdf/2602.20923.pdf

Authors: Jiarong Wei, Anna Rehr, Christian Feist, Abhinav Valada

Title: ParkDiffusion++: Ego Intention Conditioned Joint Multi-Agent Trajectory Prediction for Automated Parking using Diffusion Models

Abstract:
Automated parking is a challenging operational domain for advanced driver assistance systems, requiring robust scene understanding and interaction reasoning. The key challenge is twofold: (i) predict multiple plausible ego intentions according to context and (ii) for each intention, predict the joint responses of surrounding agents, enabling effective what‑if decision‑making. However, existing methods often fall short, typically treating these interdependent problems in isolation. We propose ParkDiffusion++, which jointly learns a multi‑modal ego intention predictor and an ego‑conditioned multi‑agent joint trajectory predictor for automated parking. Our approach makes several key contributions. First, we introduce an ego intention tokenizer that predicts a small set of discrete endpoint intentions from agent histories and vectorized map polylines. Second, we perform ego‑intention‑conditioned joint prediction, yielding socially consistent predictions of the surrounding agents for each possible ego intention. Third, we employ a lightweight safety‑guided denoiser with different constraints to refine joint scenes during training, thus improving accuracy and safety. Fourth, we propose counterfactual knowledge distillation, where an EMA teacher refined by a frozen safety‑guided denoiser provides pseudo‑targets that capture how agents react to alternative ego intentions. Extensive evaluations demonstrate that ParkDiffusion++ achieves state‑of‑the‑art performance on the Dragon Lake Parking (DLP) dataset and the Intersections Drone (inD) dataset. Importantly, qualitative what‑if visualizations show that other agents react appropriately to different ego intentions.

PaperID: 5381, https://arxiv.org/pdf/2602.20867.pdf

Authors: Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, Guangsheng Yu

Title: SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Abstract:
Agentic systems increasingly rely on reusable procedural capabilities, a.k.a., agentic skills, to execute long‑horizon workflows reliably. These capabilities are callable modules that package procedural knowledge with explicit applicability conditions, execution policies, termination criteria, and reusable interfaces. Unlike one‑off plans or atomic tool calls, skills operate (and often do well) across tasks. This paper maps the skill layer across the full lifecycle (discovery, practice, distillation, storage, composition, evaluation, and update) and introduces two complementary taxonomies. The first is a system‑level set of seven design patterns capturing how skills are packaged and executed in practice, from metadata‑driven progressive disclosure and executable code skills to self‑evolving libraries and marketplace distribution. The second is an orthogonal representation × scope taxonomy describing what skills \emphare (natural language, code, policy, hybrid) and what environments they operate over (web, OS, software engineering, robotics). We analyze the security and governance implications of skill‑based agents, covering supply‑chain risks, prompt injection via skill payloads, and trust‑tiered execution, grounded by a case study of the ClawHavoc campaign in which nearly 1,200 malicious skills infiltrated a major agent marketplace, exfiltrating API keys, cryptocurrency wallets, and browser credentials at scale. We further survey deterministic evaluation approaches, anchored by recent benchmark evidence that curated skills can substantially improve agent success rates while self‑generated skills may degrade them. We conclude with open challenges toward robust, verifiable, and certifiable skills for real‑world autonomous agents.

PaperID: 5382, https://arxiv.org/pdf/2602.20826.pdf

Authors: Yuanhai Zhang, Songyang He, Ruizhe Gou, Mingyue Cui, Boyang Li, Shuai Zhao, Kai Huang

Title: Exploiting Dependency and Parallelism: Real-Time Scheduling and Analysis for GPU Tasks

Abstract:
With the rapid advancement of Artificial Intelligence, the Graphics Processing Unit (GPU) has become increasingly essential across a growing number of safety‑critical application domains. Applying a GPU is indispensable for parallel computing; however, the complex data dependencies and resource contention across kernels within a GPU task may unpredictably delay its execution time. To address these problems, this paper presents a scheduling and analysis method for Directed Acyclic Graph (DAG)‑structured GPU tasks. Given a DAG representation, the proposed scheduling scales the kernel‑level parallelism and establishes inter‑kernel dependencies to provide a reduced and predictable DAG response time. The corresponding timing analysis yields a safe yet nonpessimistic makespan bound without any assumption on kernel priorities. The proposed method is implemented using the standard CUDA API, requiring no additional software or hardware support. Experimental results under synthetic and real‑world benchmarks demonstrate that the proposed approach effectively reduces the worst‑case makespan and measured task execution time compared to the existing methods up to 32.8% and 21.3%, respectively.

PaperID: 5383, https://arxiv.org/pdf/2602.20804.pdf

Authors: Kale-ab Tessera, Leonard Hinckeldey, Riccardo Zamboni, David Abel, Amos Storkey

Title: Probing Dec-POMDP Reasoning in Cooperative MARL

Abstract:
Cooperative multi‑agent reinforcement learning (MARL) is typically framed as a decentralised partially observable Markov decision process (Dec‑POMDP), a setting whose hardness stems from two key challenges: partial observability and decentralised coordination. Genuinely solving such tasks requires Dec‑POMDP reasoning, where agents use history to infer hidden states and coordinate based on local information. Yet it remains unclear whether popular benchmarks actually demand this reasoning or permit success via simpler strategies. We introduce a diagnostic suite combining statistically grounded performance comparisons and information‑theoretic probes to audit the behavioural complexity of baseline policies (IPPO and MAPPO) across 37 scenarios spanning MPE, SMAX, Overcooked, Hanabi, and MaBrax. Our diagnostics reveal that success on these benchmarks rarely requires genuine Dec‑POMDP reasoning. Reactive policies match the performance of memory‑based agents in over half the scenarios, and emergent coordination frequently relies on brittle, synchronous action coupling rather than robust temporal influence. These findings suggest that some widely used benchmarks may not adequately test core Dec‑POMDP assumptions under current training paradigms, potentially leading to over‑optimistic assessments of progress. We release our diagnostic tooling to support more rigorous environment design and evaluation in cooperative MARL.

PaperID: 5384, https://arxiv.org/pdf/2602.20720.pdf

Authors: Che Wang, Jiaming Zhang, Ziqi Zhang, Zijie Wang, Yinghui Wang, Jianbo Gao, Tao Wei, Zhong Chen, Wei Yang Bryan Lim

Title: AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs

Abstract:
The integration of external data services (e.g., Model Context Protocol, MCP) has made large language model‑based agents increasingly powerful for complex task execution. However, this advancement introduces critical security vulnerabilities, particularly indirect prompt injection (IPI) attacks. Existing attack methods are limited by their reliance on static patterns and evaluation on simple language models, failing to address the fast‑evolving nature of modern AI agents. We introduce AdapTools, a novel adaptive IPI attack framework that selects stealthier attack tools and generates adaptive attack prompts to create a rigorous security evaluation environment. Our approach comprises two key components: (1) Adaptive Attack Strategy Construction, which develops transferable adversarial strategies for prompt optimization, and (2) Attack Enhancement, which identifies stealthy tools capable of circumventing task‑relevance defenses. Comprehensive experimental evaluation shows that AdapTools achieves a 2.13 times improvement in attack success rate while degrading system utility by a factor of 1.78. Notably, the framework maintains its effectiveness even against state‑of‑the‑art defense mechanisms. Our method advances the understanding of IPI attacks and provides a useful reference for future research.

PaperID: 5385, https://arxiv.org/pdf/2602.20708.pdf

Authors: Che Wang, Fuyao Zhang, Jiaming Zhang, Ziqi Zhang, Yinghui Wang, Longtao Huang, Jianbo Gao, Zhong Chen, Wei Yang Bryan Lim

Title: ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction

Abstract:
Large Language Model (LLM) agents are susceptible to Indirect Prompt Injection (IPI) attacks, where malicious instructions in retrieved content hijack the agent's execution. Existing defenses typically rely on strict filtering or refusal mechanisms, which suffer from a critical limitation: over‑refusal, prematurely terminating valid agentic workflows. We propose ICON, a probing‑to‑mitigation framework that neutralizes attacks while preserving task continuity. Our key insight is that IPI attacks leave distinct over‑focusing signatures in the latent space. We introduce a Latent Space Trace Prober to detect attacks based on high intensity scores. Subsequently, a Mitigating Rectifier performs surgical attention steering that selectively manipulate adversarial query key dependencies while amplifying task relevant elements to restore the LLM's functional trajectory. Extensive evaluations on multiple backbones show that ICON achieves a competitive 0.4% ASR, matching commercial grade detectors, while yielding a over 50% task utility gain. Furthermore, ICON demonstrates robust Out of Distribution(OOD) generalization and extends effectively to multi‑modal agents, establishing a superior balance between security and efficiency.

PaperID: 5386, https://arxiv.org/pdf/2602.20683.pdf

Authors: Mohamed Shamseldein

Title: Grid-Mind: An LLM-Orchestrated Multi-Fidelity Agent for Automated Connection Impact Assessment

Abstract:
Large language models (LLMs) have demonstrated remarkable tool‑use capabilities, yet their application to power system operations remains largely unexplored. This paper presents Grid‑Mind, a domain‑specific LLM agent that interprets natural‑language interconnection requests and autonomously orchestrates multi‑fidelity power system simulations. The LLM‑first architecture positions the language model as the central decision‑making entity, employing an eleven‑tool registry to execute Connection Impact Assessment (CIA) studies spanning steadystate power flow, N‑1 contingency analysis, transient stability, and electromagnetic transient screening. A violation inspector grounds every decision in quantitative simulation outputs, while a three‑layer anti‑hallucination defence mitigates numerical fabrication risk through forced capacity‑tool routing and post‑response grounding validation. A prompt‑level self‑correction mechanism extracts distilled lessons from agent failures, yielding progressive accuracy improvements without model retraining. End‑to‑end evaluation on 50 IEEE 118‑bus scenarios (DeepSeek‑V3, 2026‑02‑23) achieved 84.0% tool‑selection accuracy and 100% parsing accuracy. A separate 56‑scenario self‑correction suite passed 49 of 56 cases (87.5%) with a mean score of 89.3. These results establish a reproducible baseline for continued refinement while maintaining auditable, simulation‑grounded decision support.

PaperID: 5387, https://arxiv.org/pdf/2602.20664.pdf

Authors: Hailong Yan, Shice Liu, Tao Wang, Xiangtao Zhang, Yijie Zhong, Jinwei Chen, Le Zhang, Bo Li

Title: AnimeAgent: Is the Multi-Agent via Image-to-Video models a Good Disney Storytelling Artist?

Abstract:
Custom Storyboard Generation (CSG) aims to produce high‑quality, multi‑character consistent storytelling. Current approaches based on static diffusion models, whether used in a one‑shot manner or within multi‑agent frameworks, face three key limitations: (1) Static models lack dynamic expressiveness and often resort to "copy‑paste" pattern. (2) One‑shot inference cannot iteratively correct missing attributes or poor prompt adherence. (3) Multi‑agents rely on non‑robust evaluators, ill‑suited for assessing stylized, non‑realistic animation. To address these, we propose AnimeAgent, the first Image‑to‑Video (I2V)‑based multi‑agent framework for CSG. Inspired by Disney's "Combination of Straight Ahead and Pose to Pose" workflow, AnimeAgent leverages I2V's implicit motion prior to enhance consistency and expressiveness, while a mixed subjective‑objective reviewer enables reliable iterative refinement. We also collect a human‑annotated CSG benchmark with ground‑truth. Experiments show AnimeAgent achieves SOTA performance in consistency, prompt fidelity, and stylization.

PaperID: 5388, https://arxiv.org/pdf/2602.20543.pdf

Authors: Subhra Jyoti Mandal, Lara Rachidi, Puneet Jain, Matthieu Duvinage, Sander W. Timmer

Title: Beyond Human Performance: A Vision-Language Multi-Agent Approach for Quality Control in Pharmaceutical Manufacturing

Abstract:
Colony‑forming unit (CFU) detection is critical in pharmaceutical manufacturing, serving as a key component of Environmental Monitoring programs and ensuring compliance with stringent quality standards. Manual counting is labor‑intensive and error‑prone, while deep learning (DL) approaches, though accurate, remain vulnerable to sample quality variations and artifacts. Building on our earlier CNN‑based framework (Beznik et al., 2020), we evaluated YOLOv5, YOLOv7, and YOLOv8 for CFU detection; however, these achieved only 97.08 percent accuracy, insufficient for pharmaceutical‑grade requirements. A custom Detectron2 model trained on GSK's dataset of over 50,000 Petri dish images achieved 99 percent detection rate with 2 percent false positives and 0.6 percent false negatives. Despite high validation accuracy, Detectron2 performance degrades on outlier cases including contaminated plates, plastic artifacts, or poor optical clarity. To address this, we developed a multi‑agent framework combining DL with vision‑language models (VLMs). The VLM agent first classifies plates as valid or invalid. For valid samples, both DL and VLM agents independently estimate colony counts. When predictions align within 5 percent, results are automatically recorded in Postgres and SAP; otherwise, samples are routed for expert review. Expert feedback enables continuous retraining and self‑improvement. Initial DL‑based automation reduced human verification by 50 percent across vaccine manufacturing sites. With VLM integration, this increased to 85 percent, delivering significant operational savings. The proposed system provides a scalable, auditable, and regulation‑ready solution for microbiological quality control, advancing automation in biopharmaceutical production.

PaperID: 5389, https://arxiv.org/pdf/2602.20541.pdf

Authors: Hana Salavcova, Martin Černý, Arpita Biswas

Title: Maximin Share Guarantees via Limited Cost-Sensitive Sharing

Abstract:
We study the problem of fairly allocating indivisible goods when limited sharing is allowed, that is, each good may be allocated to up to k agents, while incurring a cost for sharing. While classic maximin share (MMS) allocations may not exist in many instances, we demonstrate that allowing controlled sharing can restore fairness guarantees that are otherwise unattainable in certain scenarios. (1) Our first contribution shows that exact maximin share (MMS) allocations are guaranteed to exist whenever goods are allowed to be cost‑sensitively shared among at least half of the agents and the number of agents is even; for odd numbers of agents, we obtain a slightly weaker MMS guarantee. (2) We further design a Shared Bag‑Filling Algorithm that guarantees a (1 ‑ C)(k ‑ 1)‑approximate MMS allocation, where C is the maximum cost of sharing a good. Notably, when (1 ‑ C)(k ‑ 1) \geq 1, our algorithm recovers an exact MMS allocation. (3) We additionally introduce the Sharing Maximin Share (SMMS) fairness notion, a natural extension of MMS to the k‑sharing setting. (4) We show that SMMS allocations always exist under identical utilities and for instances with two agents. (5) We construct a counterexample to show the impossibility of the universal existence of an SMMS allocation. (6) Finally, we establish a connection between SMMS and constrained MMS (CMMS), yielding approximation guarantees for SMMS via existing CMMS results. These contributions provide deep theoretical insights for the problem of fair resource allocation when a limited sharing of resources are allowed in multi‑agent environments.

PaperID: 5390, https://arxiv.org/pdf/2602.20512.pdf

Authors: Haitong He, Xuemian Wu, Shizhe Zhao, Zhongqiang Ren

Title: Conflict-Based Search for Multi-Agent Path Finding with Elevators

Abstract:
This paper investigates a problem called Multi‑Agent Path Finding with Elevators (MAPF‑E), which seeks conflict‑free paths for multiple agents from their start to goal locations that may locate on different floors, and the agents can use elevators to travel between floors. The existence of elevators complicates the interaction among the agents and introduces new challenges to the planning. On the one hand, elevators can cause many conflicts among the agents due to its relatively long traversal time across floors, especially when many agents need to reach a different floor. On the other hand, the planner has to reason in a larger state space including the states of the elevators, besides the locations of the agents.

PaperID: 5391, https://arxiv.org/pdf/2602.20502.pdf

Authors: Hongbin Zhong, Fazle Faisal, Luis França, Tanakorn Leesatapornwongsa, Adriana Szekeres, Kexin Rong, Suman Nath

Title: ActionEngine: From Reactive to Programmatic GUI Agents via State Machine Memory

Abstract:
Existing Graphical User Interface (GUI) agents operate through step‑by‑step calls to vision language models‑‑taking a screenshot, reasoning about the next action, executing it, then repeating on the new page‑‑resulting in high costs and latency that scale with the number of reasoning steps, and limited accuracy due to no persistent memory of previously visited pages. We propose ActionEngine, a training‑free framework that transitions from reactive execution to programmatic planning through a novel two‑agent architecture: a Crawling Agent that constructs an updatable state‑machine memory of the GUIs through offline exploration, and an Execution Agent that leverages this memory to synthesize complete, executable Python programs for online task execution. To ensure robustness against evolving interfaces, execution failures trigger a vision‑based re‑grounding fallback that repairs the failed action and updates the memory. This design drastically improves both efficiency and accuracy: on Reddit tasks from the WebArena benchmark, our agent achieves 95% task success with on average a single LLM call, compared to 66% for the strongest vision‑only baseline, while reducing cost by 11.8x and end‑to‑end latency by 2x. Together, these components yield scalable and reliable GUI interaction by combining global programmatic planning, crawler‑validated action templates, and node‑level execution with localized validation and repair.

PaperID: 5392, https://arxiv.org/pdf/2602.20446.pdf

Authors: Tingxuan Tang, Nicolas Janis, Kalyn Asher Montague, Kevin Eykholt, Dhilung Kirat, Youngja Park, Jiyong Jang, Adwait Nadkarni, Yue Xiao

Title: Understanding Human-AI Collaboration in Cybersecurity Competitions

Abstract:
Capture‑the‑Flag (CTF) competitions are increasingly becoming a testbed for evaluating AI capabilities at solving security tasks, due to the controlled environments and objective success criteria. Existing evaluations have focused on how successful AI is at solving CTF challenges in isolation from human CTF players. As AI usage increases in both academic and industrial settings, it is equally likely that human players may collaborate with AI agents to solve challenges. This possibility exposes a key knowledge gap: how do humans perceive AI CTF assistance; when assistance is provided, how do they collaborate and is it effective with respect to human performance; how do humans assisted by AI compare to the performance of fully autonomous AI agents on the same challenges. We address this gap with the first empirical study of AI assistance in a live, onsite CTF. In a study with 41 participants, we qualitatively study (i) how participants' perception, trust, and expectations shift before versus after hands‑on AI use, and (ii) how participants collaborate with an instrumented AI agent. Moreover, we also (iii) benchmark four autonomous AI agents on the same fresh challenge set to compare outcomes with human teams and analyze agent trajectories. We find that, as the competition progresses, teams increasingly delegate larger subtasks to the AI, giving it more agency. Interestingly, CTF challenges solving rates are often constrained not by model's reasoning capabilities, but rather by the human players: ineffective prompting and poor context specification become the primary bottleneck. Remarkably, autonomous agents that self‑direct their prompting and tool use bypass this bottleneck and outperform most human teams, coming in second overall in the competition. We conclude with implications for the future design of CTF challenges and for building effective human‑in‑the‑loop AI systems for security.

PaperID: 5393, https://arxiv.org/pdf/2602.20426.pdf

Authors: Ruocheng Guo, Kaiwen Dong, Xiang Gao, Kamalika Das

Title: Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use

Abstract:
While most efforts to improve LLM‑based tool‑using agents focus on the agent itself ‑ through larger models, better prompting, or fine‑tuning ‑ agent performance increasingly plateaus due to the quality of the tool interfaces these agents consume. Tool descriptions are often written for human developers and tolerate ambiguity that agents cannot resolve, particularly as the number of candidate tools grows. Existing approaches to improving tool interfaces (1) require re‑running a multi‑stage per‑tool pipeline ‑ synthesizing queries, executing an agent to collect trajectories, annotating trajectories, and prompting a strong LLM multiple times ‑ for every API that enters the catalog, and (2) typically optimize each tool independently, limiting scalability and generalization to unseen tools. We propose Trace‑Free+, a curriculum learning framework that progressively transfers supervision from trace‑rich settings to trace‑free deployment, encouraging the model to internalize reusable patterns of what makes a tool description effective. To support this approach, we construct a large‑scale dataset of high‑quality tool interfaces derived from real‑world APIs through a principled data synthesis workflow. Experiments on widely adopted benchmarks show that Trace‑Free+ improves robustness as tool catalogs scale to 150+ candidates ‑ in scaling experiments, reducing accuracy degradation by 29.23% and improving average query‑level success by 60.89% on StableToolBench ‑ generalizes across domains without retraining, and provides complementary gains on top of agent fine‑tuning.

PaperID: 5394, https://arxiv.org/pdf/2602.20297.pdf

Authors: Haochen Zhang, Zhong Zheng, Lingzhou Xue

Title: Gap-Dependent Bounds for Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation

Abstract:
We study gap‑dependent performance guarantees for nearly minimax‑optimal algorithms in reinforcement learning with linear function approximation. While prior works have established gap‑dependent regret bounds in this setting, existing analyses do not apply to algorithms that achieve the nearly minimax‑optimal worst‑case regret bound \tildeO(d\sqrtH^3K), where d is the feature dimension, H is the horizon length, and K is the number of episodes. We bridge this gap by providing the first gap‑dependent regret bound for the nearly minimax‑optimal algorithm LSVI‑UCB++ (He et al., 2023). Our analysis yields improved dependencies on both d and H compared to previous gap‑dependent results. Moreover, leveraging the low policy‑switching property of LSVI‑UCB++, we introduce a concurrent variant that enables efficient parallel exploration across multiple agents and establish the first gap‑dependent sample complexity upper bound for online multi‑agent RL with linear function approximation, achieving linear speedup with respect to the number of agents.

PaperID: 5395, https://arxiv.org/pdf/2602.20229.pdf

Authors: Tianjun Yao, Zhaoyi Li, Zhiqiang Shen

Title: HieraMAS: Optimizing Intra-Node LLM Mixtures and Inter-Node Topology for Multi-Agent Systems

Abstract:
Multi‑agent systems (MAS) built on large language models (LLMs) have shown strong performance across many tasks. Most existing approaches improve only one aspect at a time, such as the communication topology, role assignment, or LLM routing, while treating each agent as a single, indivisible unit. This misses the opportunity to use mixtures of LLMs within an agent to strengthen role‑specific abilities. We propose HieraMAS, a hierarchical collaboration framework that combines intra‑node LLM mixtures with an inter‑node communication topology. HieraMAS introduces supernodes, where each functional role is implemented by multiple heterogeneous LLMs using a propose‑synthesis structure. Optimizing HieraMAS creates unique credit‑assignment challenges: final task performance depends heavily on the underlying LLMs' capabilities, which can lead reinforcement methods to incorrectly reward suboptimal configurations. To address this, we use a two‑stage algorithm: (1) multi‑level reward attribution, which provides fine‑grained feedback at both the node level and the overall system level; (2) graph classification for topology selection, which treats choosing the communication structure as a holistic decision rather than optimizing edges one by one. Experiments on reasoning and coding benchmarks show that HieraMAS substantially outperforms existing methods while also delivering better cost‑performance trade‑offs.

PaperID: 5396, https://arxiv.org/pdf/2602.20196.pdf

Authors: Genliang Zhu, Chu Wang, Ziyuan Wang, Zhida Li, Qiang Li

Title: OpenPort Protocol: A Security Governance Specification for AI Agent Tool Access

Abstract:
AI agents increasingly require direct, structured access to application data and actions, but production deployments still struggle to express and verify the governance properties that matter in practice: least‑privilege authorization, controlled write execution, predictable failure handling, abuse resistance, and auditability. This paper introduces OpenPort Protocol (OPP), a governance‑first specification for exposing application tools through a secure server‑side gateway that is model‑ and runtime‑neutral and can bind to existing tool ecosystems. OpenPort defines authorization‑dependent discovery, stable response envelopes with machine‑actionable \textttagent. reason codes, and an authorization model combining integration credentials, scoped permissions, and ABAC‑style policy constraints. For write operations, OpenPort specifies a risk‑gated lifecycle that defaults to draft creation and human review, supports time‑bounded auto‑execution under explicit policy, and enforces high‑risk safeguards including preflight impact binding and idempotency. To address time‑of‑check/time‑of‑use drift in delayed approval flows, OpenPort also specifies an optional State Witness profile that revalidates execution‑time preconditions and fails closed on state mismatch. Operationally, the protocol requires admission control (rate limits/quotas) with stable 429 semantics and structured audit events across allow/deny/fail paths so that client recovery and incident analysis are deterministic. We present a reference runtime and an executable governance toolchain (layered conformance profiles, negative security tests, fuzz/abuse regression, and release‑gate scans) and evaluate the core profile at a pinned release tag using artifact‑based, externally reproducible validation.

PaperID: 5397, https://arxiv.org/pdf/2602.20156.pdf

Authors: David Schmotz, Luca Beurer-Kellner, Sahar Abdelnabi, Maksym Andriushchenko

Title: Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks

Abstract:
LLM agents are evolving rapidly, powered by code execution, tools, and the recently introduced agent skills feature. Skills allow users to extend LLM applications with specialized third‑party code, knowledge, and instructions. Although this can extend agent capabilities to new domains, it creates an increasingly complex agent supply chain, offering new surfaces for prompt injection attacks. We identify skill‑based prompt injection as a significant threat and introduce SkillInject, a benchmark evaluating the susceptibility of widely‑used LLM agents to injections through skill files. SkillInject contains 202 injection‑task pairs with attacks ranging from obviously malicious injections to subtle, context‑dependent attacks hidden in otherwise legitimate instructions. We evaluate frontier LLMs on SkillInject, measuring both security in terms of harmful instruction avoidance and utility in terms of legitimate instruction compliance. Our results show that today's agents are highly vulnerable with up to 80% attack success rate with frontier models, often executing extremely harmful instructions including data exfiltration, destructive action, and ransomware‑like behavior. They furthermore suggest that this problem will not be solved through model scaling or simple input filtering, but that robust agent security will require context‑aware authorization frameworks. Our benchmark is available at https://www.skill‑inject.com/.

PaperID: 5398, https://arxiv.org/pdf/2602.20078.pdf

Authors: Shan Yang, Yang Liu

Title: Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning

Abstract:
Scaling cooperative multi‑agent reinforcement learning (MARL) is fundamentally limited by cross‑agent noise. When agents share a common reward, each agent's learning signal is computed from a shared return that depends on all agents, so the stochasticity of the other agents enters the signal as cross‑agent noise that grows with N. Fortunately, many engineering systems, such as cloud computing and power systems, have differentiable analytical models that prescribe efficient system states, providing a new reference beyond noisy shared returns. In this work, we propose Descent‑Guided Policy Gradient (DG‑PG), a framework that augments policy‑gradient updates with a noise‑free descent signal derived from differentiable analytical models. We prove that DG‑PG reduces policy‑gradient estimator variance from \mathcalO(N) to \mathcalO(1), preserves the equilibria of the cooperative game, and achieves agent‑independent sample complexity \widetilde\mathcalO (1/ε). On a heterogeneous cloud resource scheduling task with up to 1500 agents, DG‑PG converges within 20 episodes on average, while MAPPO and IPPO fail to converge under identical architectures.

PaperID: 5399, https://arxiv.org/pdf/2602.20059.pdf

Authors: Sarath Shekkizhar, Adam Earle

Title: Interaction Theater: A case of LLM Agents Interacting at Scale

Abstract:
As multi‑agent architectures and agent‑to‑agent protocols proliferate, a fundamental question arises: what actually happens when autonomous LLM agents interact at scale? We study this question empirically using data from Moltbook, an AI‑agent‑only social platform, with 800K posts, 3.5M comments, and 78K agent profiles. We combine lexical metrics (Jaccard specificity), embedding‑based semantic similarity, and LLM‑as‑judge validation to characterize agent interaction quality. Our findings reveal agents produce diverse, well‑formed text that creates the surface appearance of active discussion, but the substance is largely absent. Specifically, while most agents (67.5%) vary their output across contexts, 65% of comments share no distinguishing content vocabulary with the post they appear under, and information gain from additional comments decays rapidly. LLM judge based metrics classify the dominant comment types as spam (28%) and off‑topic content (22%). Embedding‑based semantic analysis confirms that lexically generic comments are also semantically generic. Agents rarely engage in threaded conversation (5% of comments), defaulting instead to independent top‑level responses. We discuss implications for multi‑agent interaction design, arguing that coordination mechanisms must be explicitly designed; without them, even large populations of capable agents produce parallel output rather than productive exchange.

PaperID: 5400, https://arxiv.org/pdf/2602.19933.pdf

Authors: Pelin Sekercioglu, Angela Fontan, Dimos V. Dimarogonas

Title: Edge-based Synchronization over Signed Digraphs with Multiple Leaders

Abstract:
This work addresses the edge‑based synchronization problem in first‑order multi‑agent systems containing both cooperative and antagonistic interactions with one or multiple leader groups. The presence of multiple leaders and antagonistic interactions means that the multi‑agent system typically does not achieve consensus, unless specific conditions (on the number of leaders and on the signed graph) are met, in which case the agents reach a trivial form of consensus. In general, we show that the multi‑agent system exhibits a more general form of synchronization, including bipartite consensus and containment. Our approach proposes a signed edge‑based agreement protocol for signed networks described by signed edge‑Laplacian matrices. In particular, in this work, we present new spectral properties of signed edge‑Laplacian matrices containing multiple zero eigenvalues and establish global exponential stability of the synchronization errors. Moreover, we explicitly compute the equilibrium to which all edge states converge, thereby characterizing the resulting synchronization behavior. Numerical simulations validate our theoretical results.

PaperID: 5401, https://arxiv.org/pdf/2602.19844.pdf

Authors: Yedi Zhang, Haoyu Wang, Xianglin Yang, Jin Song Dong, Jun Sun

Title: LLM-enabled Applications Require System-Level Threat Monitoring

Abstract:
LLM‑enabled applications are rapidly reshaping the software ecosystem by using large language models as core reasoning components for complex task execution. This paradigm shift, however, introduces fundamentally new reliability challenges and significantly expands the security attack surface, due to the non‑deterministic, learning‑driven, and difficult‑to‑verify nature of LLM behavior. In light of these emerging and unavoidable safety challenges, we argue that such risks should be treated as expected operational conditions rather than exceptional events, necessitating a dedicated incident‑response perspective. Consequently, the primary barrier to trustworthy deployment is not further improving model capability but establishing system‑level threat monitoring mechanisms that can detect and contextualize security‑relevant anomalies after deployment ‑‑ an aspect largely underexplored beyond testing or guardrail‑based defenses. Accordingly, this position paper advocates systematic and comprehensive monitoring of security threats in LLM‑enabled applications as a prerequisite for reliable operation and a foundation for dedicated incident‑response frameworks.

PaperID: 5402, https://arxiv.org/pdf/2602.19843.pdf

Authors: Jin Jia, Zhiling Deng, Zhuangbin Chen, Yingqi Wang, Zibin Zheng

Title: MAS-FIRE: Fault Injection and Reliability Evaluation for LLM-Based Multi-Agent Systems

Abstract:
As LLM‑based Multi‑Agent Systems (MAS) are increasingly deployed for complex tasks, ensuring their reliability has become a pressing challenge. Since MAS coordinate through unstructured natural language rather than rigid protocols, they are prone to semantic failures (e.g., hallucinations, misinterpreted instructions, and reasoning drift) that propagate silently without raising runtime exceptions. Prevailing evaluation approaches, which measure only end‑to‑end task success, offer limited insight into how these failures arise or how effectively agents recover from them. To bridge this gap, we propose MAS‑FIRE, a systematic framework for fault injection and reliability evaluation of MAS. We define a taxonomy of 15 fault types covering intra‑agent cognitive errors and inter‑agent coordination failures, and inject them via three non‑invasive mechanisms: prompt modification, response rewriting, and message routing manipulation. Applying MAS‑FIRE to three representative MAS architectures, we uncover a rich set of fault‑tolerant behaviors that we organize into four tiers: mechanism, rule, prompt, and reasoning. This tiered view enables fine‑grained diagnosis of where and why systems succeed or fail. Our findings reveal that stronger foundation models do not uniformly improve robustness. We further show that architectural topology plays an equally decisive role, with iterative, closed‑loop designs neutralizing over 40% of faults that cause catastrophic collapse in linear workflows. MAS‑FIRE provides the process‑level observability and actionable guidance needed to systematically improve multi‑agent systems.

PaperID: 5403, https://arxiv.org/pdf/2602.19840.pdf

Authors: Jingzhuo Wu, Jiajun Zhang, Keyan Jin, Dehua Ma, Junbo Wang

Title: SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation

Abstract:
Modern large language models (LLMs) excel at generating fluent and faithful translations. However, they struggle to preserve an author's unique literary style, often producing semantically correct but generic outputs. This limitation stems from the inability of current single‑model and static multi‑agent systems to perceive and adapt to stylistic variations. To address this, we introduce the Style‑Adaptive Multi‑Agent System (SAMAS), a novel framework that treats style preservation as a signal processing task. Specifically, our method quantifies literary style into a Stylistic Feature Spectrum (SFS) using the wavelet packet transform. This SFS serves as a control signal to dynamically assemble a tailored workflow of specialized translation agents based on the source text's structural patterns. Extensive experiments on translation benchmarks show that SAMAS achieves competitive semantic accuracy against strong baselines, primarily by leveraging its statistically significant advantage in style fidelity.

PaperID: 5404, https://arxiv.org/pdf/2602.19810.pdf

Authors: Lukas Weidener, Marko Brkić, Phillip Lee, Martin Karlsson, Kevin Noessler, Paul Kohlhaas

Title: From Agent-Only Social Networks to Autonomous Scientific Research: Lessons from OpenClaw and Moltbook, and the Architecture of ClawdLab and Beach.Science

Abstract:
In January 2026, the open‑source agent framework OpenClaw and the agent‑only social network Moltbook produced a large‑scale dataset of autonomous AI‑to‑AI interaction, attracting six academic publications within fourteen days. This study conducts a multivocal literature review of that ecosystem and presents two complementary platforms for autonomous scientific research as a design science response to the architectural failure modes identified. ClawdLab, an open‑source platform for structured laboratory collaboration, addresses these failure modes through hard role restrictions, structured adversarial critique, PI‑led governance, multi‑model orchestration, and evidence requirements enforced through external tool verification, in which the principal investigator validates submitted work using available API calls, computational services, and model context protocol integrations rather than relying on social consensus. Beach.science, a public research commons, complements ClawdLab's structured laboratory model by providing a free‑form environment in which heterogeneous agent configurations interact, discover research opportunities, and autonomously contribute computational analyses, supported by template‑based role specialisation, extensible skill registries, and programmatic reward mechanisms that distribute inference resources to agents demonstrating scientific progress. A three‑tier taxonomy distinguishes single‑agent pipelines, predetermined multi‑agent workflows, and fully decentralised systems, analysing why leading AI co‑scientist platforms remain confined to the first two tiers. The composable third‑tier architecture instantiated across ClawdLab and beach.science, in which foundation models, capabilities, governance, verification tooling, and inter‑lab coordination are independently modifiable, enables compounding improvement as the broader AI ecosystem advances.

PaperID: 5405, https://arxiv.org/pdf/2602.19538.pdf

Authors: Arundhati Banerjee, Jeff Schneider

Title: Cost-Aware Diffusion Active Search

Abstract:
Active search for recovering objects of interest through online, adaptive decision making with autonomous agents requires trading off exploration of unknown environments with exploitation of prior observations in the search space. Prior work has proposed information gain and Thompson sampling based myopic, greedy approaches for agents to actively decide query or search locations when the number of targets is unknown. Decision making algorithms in such partially observable environments have also shown that agents capable of lookahead over a finite horizon outperform myopic policies for active search. Unfortunately, lookahead algorithms typically rely on building a computationally expensive search tree that is simulated and updated based on the agent's observations and a model of the environment dynamics. Instead, in this work, we leverage the sequence modeling abilities of diffusion models to sample lookahead action sequences that balance the exploration‑exploitation trade‑off for active search without building an exhaustive search tree. We identify the optimism bias in prior diffusion based reinforcement learning approaches when applied to the active search setting and propose mitigating solutions for efficient cost‑aware decision making with both single and multi‑agent teams. Our proposed algorithm outperforms standard baselines in offline reinforcement learning in terms of full recovery rate and is computationally more efficient than tree search in cost‑aware active decision making.

PaperID: 5406, https://arxiv.org/pdf/2602.19516.pdf

Authors: Ruikun Li, Jun Yao, Yingfan Hua, Shixiang Tang, Biqing Qi, Bin Liu, Wanli Ouyang, Yan Lu

Title: Pixel2Phys: Distilling Governing Laws from Visual Dynamics

Abstract:
Discovering physical laws directly from high‑dimensional visual data is a long‑standing human pursuit but remains a formidable challenge for machines, representing a fundamental goal of scientific intelligence. This task is inherently difficult because physical knowledge is low‑dimensional and structured, whereas raw video observations are high‑dimensional and redundant, with most pixels carrying little or no physical meaning. Extracting concise, physically relevant variables from such noisy data remains a key obstacle. To address this, we propose Pixel2Phys, a collaborative multi‑agent framework adaptable to any Multimodal Large Language Model (MLLM). It emulates human scientific reasoning by employing a structured workflow to extract formalized physical knowledge through iterative hypothesis generation, validation, and refinement. By repeatedly formulating, and refining candidate equations on high‑dimensional data, it identifies the most concise representations that best capture the underlying physical evolution. This automated exploration mimics the iterative workflow of human scientists, enabling AI to reveal interpretable governing equations directly from raw observations. Across diverse simulated and real‑world physics videos, Pixel2Phys discovers accurate, interpretable governing equations and maintaining stable long‑term extrapolation where baselines rapidly diverge.

PaperID: 5407, https://arxiv.org/pdf/2602.19458.pdf

Authors: Ziyang Guo, Yifan Wu, Jason Hartline, Kenneth Holstein, Jessica Hullman

Title: ComplLLM: Fine-tuning LLMs to Discover Complementary Signals for Decision-making

Abstract:
Multi‑agent decision pipelines can outperform single agent workflows when complementarity holds, i.e., different agents bring unique information to the table to inform a final decision. We propose ComplLLM, a post‑training framework based on decision theory that fine‑tunes a decision‑assistant LLM using complementary information as reward to output signals that complement existing agent decisions. We validate ComplLLM on synthetic and real‑world tasks involving domain experts, demonstrating how the approach recovers known complementary information and produces plausible explanations of complementary signals to support downstream decision‑makers.

PaperID: 5408, https://arxiv.org/pdf/2602.19439.pdf

Authors: Ruicheng Ao, David Simchi-Levi, Xinshang Wang

Title: OptiRepair: Closed-Loop Diagnosis and Repair of Supply Chain Optimization Models with LLM Agents

Abstract:
Supply chain optimization models frequently become infeasible because of modeling errors. Diagnosis and repair require scarce OR expertise: analysts must interpret solver diagnostics, trace root causes across echelons, and fix formulations without sacrificing operational soundness. Whether AI agents can perform this task remains untested. We decompose this task into two phases: a domain‑agnostic feasibility phase that iteratively repairs any LP using IIS‑guided diagnosis, and a domain‑specific validation phase that enforces five rationality checks grounded in inventory theory. We test 22 API models from seven families on 976 multi‑echelon supply chain problems and train two 8B‑parameter models with self‑taught reasoning and solver‑verified rewards. The trained models reach 81.7% Rational Recovery Rate (RRR) ‑‑ the fraction of problems resolved to both feasibility and operational rationality ‑‑ versus 42.2% for the best API model and 21.3% on average. The gap concentrates in Phase 1 repair, where API models average 27.6% recovery rate versus 97.2% for trained models. Two gaps separate current AI from reliable model repair: solver interaction, as API models restore only 27.6% of infeasible formulations; and operational rationale, as roughly one in four feasible repairs violate supply chain theory. Each gap requires a different intervention ‑‑ targeted training closes the solver interaction gap, while explicit specification as solver‑verifiable checks closes the rationality gap. For organizations adopting AI in operational planning, formalizing what 'rational' means in their context is the higher‑return investment.

PaperID: 5409, https://arxiv.org/pdf/2602.19387.pdf

Authors: Marco Knipfer, Alexander Roman, Konstantin T. Matchev, Katia Matcheva, Sergei Gleyzer

Title: AI Agents for Variational Quantum Circuit Design

Abstract:
Variational quantum circuits (VQCs) constitute a central building block of near‑term quantum machine learning (QML), yet the principled design of expressive and trainable architectures remains a major open challenge. The VQC design space grows combinatorially with the number of qubits, layers, entanglement structures, and gate parameterizations, rendering manual circuit construction inefficient and often suboptimal. We introduce an autonomous agent‑based framework for VQC architecture search that integrates high‑level reasoning with a quantum simulation environment. The agent proposes candidate circuit architectures, evaluates them through fully automated training and validation pipelines, and iteratively improves its design strategy via performance‑driven feedback. Empirically, we show that the agent autonomously evolves circuit architectures from simple initial ansätze toward increasingly expressive designs, progressively trying to improve task performance. This demonstrates that agentic AI can effectively navigate and refine the VQC design landscape with minimal human intervention, providing a scalable methodology for automated quantum model development in the Noisy Intermediate‑Scale Quantum (NISQ) regime.

PaperID: 5410, https://arxiv.org/pdf/2602.19366.pdf

Authors: Zirui Xu, Vasileios Tzoumas

Title: Self-Configurable Mesh-Networks for Scalable Distributed Submodular Bandit Optimization

Abstract:
We study how to scale distributed bandit submodular coordination under realistic communication constraints in bandwidth, data rate, and connectivity. We are motivated by multi‑agent tasks of active situational awareness in unknown, partially‑observable, and resource‑limited environments, where the agents must coordinate through agent‑to‑agent communication. Our approach enables scalability by (i) limiting information relays to only one‑hop communication and (ii) keeping inter‑agent messages small, having each agent transmit only its own action information. Despite these information‑access restrictions, our approach enables near‑optimal action coordination by optimizing the agents' communication neighborhoods over time, through distributed online bandit optimization, subject to the agents' bandwidth constraints. Particularly, our approach enjoys an anytime suboptimality bound that is also strictly positive for arbitrary network topologies, even disconnected. To prove the bound, we define the Value of Coordination (VoC), an information‑theoretic metric that quantifies for each agent the benefit of information access to its neighbors. We validate in simulations the scalability and near‑optimality of our approach: it is observed to converge faster, outperform benchmarks for bandit submodular coordination, and can even outperform benchmarks that are privileged with a priori knowledge of the environment.

PaperID: 5411, https://arxiv.org/pdf/2602.19346.pdf

Authors: Erik Garcia Oyono, Jialin Lin, Dandan Zhang

Title: Design and Control of Modular Magnetic Millirobots for Multimodal Locomotion and Shape Reconfiguration

Abstract:
Modular small‑scale robots offer the potential for on‑demand assembly and disassembly, enabling task‑specific adaptation in dynamic and constrained environments. However, existing modular magnetic platforms often depend on workspace collisions for reconfiguration, employ bulky three‑dimensional electromagnetic systems, and lack robust single‑module control, which limits their applicability in biomedical settings. In this work, we present a modular magnetic millirobotic platform comprising three cube‑shaped modules with embedded permanent magnets, each designed for a distinct functional role: a free module that supports self‑assembly and reconfiguration, a fixed module that enables flip‑and‑walk locomotion, and a gripper module for cargo manipulation. Locomotion and reconfiguration are actuated by programmable combinations of time‑varying two‑dimensional uniform and gradient magnetic field inputs. Experiments demonstrate closed‑loop navigation using real‑time vision feedback and A path planning, establishing robust single‑module control capabilities. Beyond locomotion, the system achieves self‑assembly, multimodal transformations, and disassembly at low field strengths. Chain‑to‑gripper transformations succeeded in 90% of trials, while chain‑to‑square transformations were less consistent, underscoring the role of module geometry in reconfiguration reliability. These results establish a versatile modular robotic platform capable of multimodal behavior and robust control, suggesting a promising pathway toward scalable and adaptive task execution in confined environments.

PaperID: 5412, https://arxiv.org/pdf/2602.19320.pdf

Authors: Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore, Alysa Zhao, Dingyi Kang, Xu Hu, Feng Chen, Qiannan Li, Bingzhe Li

Title: Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Abstract:
Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long‑horizon reasoning and personalization beyond fixed context windows. Despite rapid architectural development, the empirical foundations of these systems remain fragile: existing benchmarks are often underscaled, evaluation metrics are misaligned with semantic utility, performance varies significantly across backbone models, and system‑level costs are frequently overlooked. This survey presents a structured analysis of agentic memory from both architectural and system perspectives. We first introduce a concise taxonomy of MAG systems based on four memory structures. Then, we analyze key pain points limiting current systems, including benchmark saturation effects, metric validity and judge sensitivity, backbone‑dependent accuracy, and the latency and throughput overhead introduced by memory maintenance. By connecting the memory structure to empirical limitations, this survey clarifies why current agentic memory systems often underperform their theoretical promise and outlines directions for more reliable evaluation and scalable system design.

PaperID: 5413, https://arxiv.org/pdf/2602.19304.pdf

Authors: Haojun Shi, Suyu Ye, Katherine M. Guerrerio, Jianzhi Shen, Yifan Yin, Daniel Khashabi, Chien-Ming Huang, Tianmin Shu

Title: Safe and Interpretable Multimodal Path Planning for Multi-Agent Cooperation

Abstract:
Successful cooperation among decentralized agents requires each agent to quickly adapt its plan to the behavior of other agents. In scenarios where agents cannot confidently predict one another's intentions and plans, language communication can be crucial for ensuring safety. In this work, we focus on path‑level cooperation in which agents must adapt their paths to one another in order to avoid collisions or perform physical collaboration such as joint carrying. In particular, we propose a safe and interpretable multimodal path planning method, CaPE (Code as Path Editor), which generates and updates path plans for an agent based on the environment and language communication from other agents. CaPE leverages a vision‑language model (VLM) to synthesize a path editing program verified by a model‑based planner, grounding communication to path plan updates in a safe and interpretable way. We evaluate our approach in diverse simulated and real‑world scenarios, including multi‑robot and human‑robot cooperation in autonomous driving, household, and joint carrying tasks. Experimental results demonstrate that CaPE can be integrated into different robotic systems as a plug‑and‑play module, greatly enhancing a robot's ability to align its plan to language communication from other robots or humans. We also show that the combination of the VLM‑based path editing program synthesis and model‑based planning safety enables robots to achieve open‑ended cooperation while maintaining safety and interpretability.

PaperID: 5414, https://arxiv.org/pdf/2602.19223.pdf

Authors: Aymen Khouja, Imen Jendoubi, Oumayma Mahjoub, Oussama Mahfoudhi, Ruan De Kock, Siddarth Singh, Claude Formanek

Title: Characterizing MARL for Energy Control: A Multi-KPI Benchmark on the CityLearn Environment

Abstract:
The optimization of urban energy systems is crucial for the advancement of sustainable and resilient smart cities, which are becoming increasingly complex with multiple decision‑making units. To address scalability and coordination concerns, Multi‑Agent Reinforcement Learning (MARL) is a promising solution. This paper addresses the imperative need for comprehensive and reliable benchmarking of MARL algorithms on energy management tasks. CityLearn is used as a case study environment because it realistically simulates urban energy systems, incorporates multiple storage systems, and utilizes renewable energy sources. By doing so, our work sets a new standard for evaluation, conducting a comparative study across multiple key performance indicators (KPIs). This approach illuminates the key strengths and weaknesses of various algorithms, moving beyond traditional KPI averaging which often masks critical insights. Our experiments utilize widely accepted baselines such as Proximal Policy Optimization (PPO) and Soft Actor Critic (SAC), and encompass diverse training schemes including Decentralized Training with Decentralized Execution (DTDE) and Centralized Training with Decentralized Execution (CTDE) approaches and different neural network architectures. Our work also proposes novel KPIs that tackle real world implementation challenges such as individual building contribution and battery storage lifetime. Our findings show that DTDE consistently outperforms CTDE in both average and worst‑case performance. Additionally, temporal dependency learning improved control on memory dependent KPIs such as ramping and battery usage, contributing to more sustainable battery operation. Results also reveal robustness to agent or resource removal, highlighting both the resilience and decentralizability of the learned policies.

PaperID: 5415, https://arxiv.org/pdf/2602.19218.pdf

Authors: Zeyu Zhang, Guohao Li, Zhenchang Xing, Alexandros Apostolopoulos, Yu Lin Lee, Liang Zheng

Title: Gecko: A Simulation Environment with Stateful Feedback for Refining Agent Tool Calls

Abstract:
The ability to use tools is fundamental for large language model (LLM) agents. Given a task, existing systems use LLMs to plan and generate tool calls, which are executed by real‑world tools to complete the task. However, tool calls are prone to errors because they are derived merely from LLM intrinsic capabilities. What is more, while it is useful to let LLMs iteratively refine the tool‑call sequence using execution results from real tools, this process can be expensive and lead to unsafe results. To improve LLM tool calls and address issues caused by using real tools for refinement, we introduce Gecko, a comprehensive environment that simulates tool responses using a combination of rules and LLMs. Specifically, Gecko checks the validity of tool calls including input arguments and tool names, synthesizes reasonable responses that adhere to the output schema, and assesses whether all task objectives have been achieved. These three types of feedback provided by Gecko allow LLMs to refine their tool calls, forming a simple yet effective test‑time scaling method named GATS. On BFCLv3 and τ^2‑bench, GATS consistently improves the tool calling performance of various LLMs including GPT‑4o, GPT‑5, and Gemini‑3.0‑pro. We further discuss working mechanisms of our method and share future possibilities.

PaperID: 5416, https://arxiv.org/pdf/2602.19190.pdf

Authors: Xiaokun Zhang, Yi Yang, Ziqi Ye, Baiyun, Xiaorong Guo, Qingchen Fang, Ruyi Zhang, Xinpeng Zhou, Haipeng Wang

Title: FUSAR-GPT : A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery

Abstract:
Research on the intelligent interpretation of all‑weather, all‑time Synthetic Aperture Radar (SAR) is crucial for advancing remote sensing applications. In recent years, although Visual Language Models (VLMs) have demonstrated strong open‑world understanding capabilities on RGB images, their performance is severely limited when directly applied to the SAR field due to the complexity of the imaging mechanism, sensitivity to scattering features, and the scarcity of high‑quality text corpora. To systematically address this issue, we constructed the inaugural SAR Image‑Text‑AlphaEarth feature triplet dataset and developed FUSAR‑GPT, a VLM specifically for SAR. FUSAR‑GPT innovatively introduces a geospatial baseline model as a 'world knowledge' prior and embeds multi‑source remote‑sensing temporal features into the model's visual backbone via 'spatiotemporal anchors', enabling dynamic compensation for the sparse representation of targets in SAR images. Furthermore, we designed a two‑stage SFT strategy to decouple the knowledge injection and task execution of large models. The spatiotemporal feature embedding and the two‑stage decoupling paradigm enable FUSAR‑GPT to achieve state‑of‑the‑art performance across several typical remote sensing visual‑language benchmark tests, significantly outperforming mainstream baseline models by over 12%.

PaperID: 5417, https://arxiv.org/pdf/2602.19070.pdf

Authors: Jie Song, Yang Bai, Naoki Wakamiya

Title: Cooperative Transportation Without Prior Object Knowledge via Adaptive Self-Allocation and Coordination

Abstract:
This work proposes a novel cooperative transportation framework for multi‑agent systems that does not require any prior knowledge of cargo locations or sizes. Each agent relies on local sensing to detect cargos, recruit nearby agents, and autonomously form a transportation team with an appropriate size. The core idea is that once an agent detects a cargo within its sensing range, it generates an attraction field represented by a density function, which pulls neighboring agents toward the cargo. When multiple cargos are present, the attraction fields generated by different agents are adaptively weighted and combined with Centroidal Voronoi Tessellation (CVT), enabling agents to self‑organize into balanced formations while automatically allocating more agents to larger cargos. To prevent agents from clustering on one side of a large cargo, a Control Barrier Function (CBF)‑based mechanism is introduced to enforce safe inter‑agent distances and promote a uniform, symmetric distribution of agents around each cargo, which is essential for stable transportation. Simulation results demonstrate that the proposed framework can simultaneously transport multiple cargos of different sizes in a coordinated and collision‑free manner.

PaperID: 5418, https://arxiv.org/pdf/2602.19040.pdf

Authors: Jiaxin Wu, Xiao-Yong Wei, Qing Li

Title: Adaptive Multi-Agent Reasoning for Text-to-Video Retrieval

Abstract:
The rise of short‑form video platforms and the emergence of multimodal large language models (MLLMs) have amplified the need for scalable, effective, zero‑shot text‑to‑video retrieval systems. While recent advances in large‑scale pretraining have improved zero‑shot cross‑modal alignment, existing methods still struggle with query‑dependent temporal reasoning, limiting their effectiveness on complex queries involving temporal, logical, or causal relationships. To address these limitations, we propose an adaptive multi‑agent retrieval framework that dynamically orchestrates specialized agents over multiple reasoning iterations based on the demands of each query. The framework includes: (1) a retrieval agent for scalable retrieval over large video corpora, (2) a reasoning agent for zero‑shot contextual temporal reasoning, and (3) a query reformulation agent for refining ambiguous queries and recovering performance for those that degrade over iterations. These agents are dynamically coordinated by an orchestration agent, which leverages intermediate feedback and reasoning outcomes to guide execution. We also introduce a novel communication mechanism that incorporates retrieval‑performance memory and historical reasoning traces to improve coordination and decision‑making. Experiments on three TRECVid benchmarks spanning eight years show that our framework achieves a twofold improvement over CLIP4Clip and significantly outperforms state‑of‑the‑art methods by a large margin.

PaperID: 5419, https://arxiv.org/pdf/2602.19016.pdf

Authors: George X. Wang, Jiaqian Hu, Guande Wu Jing Qian

Title: CHORUS: Effort-Aware Multi-Agent Human-AI Collaboration for Professional Translation

Abstract:
Despite the widespread use of automatic AI translation systems in daily language tasks, professional translation remains crucial in domain‑specific and high‑stakes scenarios. Yet professional translators rarely rely on these systems in their everyday practice due to a lack of detailed support for the translation process, matching professional styles, and accountability for the final outcome. To bridge the gap, we present CHORUS, a mixed‑initiative translation system that supports the translation process and personal style as translators work. A formative study found that incorporating MQM theory may be beneficial for achieving professional translation, and that the system should adapt to each individual translator's idiosyncratic traits. The final within‑subject study with 30 licensed English‑‑Chinese translators found that our system reduced completion time by 33.8%, lowered translators' cognitive effort, and improved final translation quality using the BLEU and COMET as automatic evaluation metrics. Participants' qualitative analysis also revealed that the system made translation issues easier to inspect, reduced repeated prompting compared to single‑agent AI systems, and offered reflections on their habits and traits. Our findings illustrate how multi‑agent AI systems can be designed to support expert workflows and their potential for professional use.

PaperID: 5420, https://arxiv.org/pdf/2602.18966.pdf

Authors: Yonathan Ron, Shiri Gilboa, Tammuz Dubnov

Title: Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation

Abstract:
Domain‑specific speech remains a persistent challenge for automatic speech recognition (ASR), even for state‑of‑the‑art systems like OpenAI's Whisper. We introduce Whisper: Courtside Edition, a novel multi‑agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining. The pipeline intercepts Whisper's initial transcript, applies specialized LLM agents for domain context identification, named entity recognition, and jargon detection, and generates compact prompts that guide Whisper's decoder. Evaluated on 421 NBA basketball commentary segments (a domain characterized by dense proper nouns and technical terminology) our best pipeline achieves a statistically significant 17.0% relative reduction in word error rate (WER; from 0.217 to 0.180, p<0.001). Improvements are observed in 40.1% of segments with degradation in only 7.1%, substantially outperforming direct transcript post‑editing. These results demonstrate that prompt‑based augmentation can deliver scalable domain adaptation for ASR, offering a practical alternative to costly model fine‑tuning.

PaperID: 5421, https://arxiv.org/pdf/2602.18962.pdf

Authors: Albert Tang, Yifan Mo, Jie Li, Yue Su, Mengyuan Zhang, Sander L. Koole, Koen Hindriks, Jiahuan Pei

Title: NeuroWise: A Multi-Agent LLM "Glass-Box" System for Practicing Double-Empathy Communication with Autistic Partners

Abstract:
The double empathy problem frames communication difficulties between neurodivergent and neurotypical individuals as arising from mutual misunderstanding, yet most interventions focus on autistic individuals. We present NeuroWise, a multi‑agent LLM‑based coaching system that supports neurotypical users through stress visualization, interpretation of internal experiences, and contextual guidance. In a between‑subjects study (N=30), NeuroWise was rated as helpful by all participants and showed a significant condition‑time effect on deficit‑based attributions (p=0.02): NeuroWise users reduced deficit framing, while baseline users shifted toward blaming autistic "deficits" after difficult interactions. NeuroWise users also completed conversations more efficiently (37% fewer turns, p=0.03). These findings suggest that AI‑based interpretation can support attributional change by helping users recognize communication challenges as mutual.

PaperID: 5422, https://arxiv.org/pdf/2602.18925.pdf

Authors: Philipp Lakheshar, Sharwin Rezagholi

Title: A potentialization algorithm for games with applications to multi-agent learning in repeated games

Abstract:
We investigate an algorithm that assigns to any game in normal form an approximating game that admits an ordinal potential function. Due to the properties of potential games, the algorithm equips every game with a surrogate reward structure that allows efficient multi‑agent learning. Numerical simulations using the replicator dynamics show that 'potentialization' guarantees convergence to stable agent behavior.

PaperID: 5423, https://arxiv.org/pdf/2602.18891.pdf

Authors: Yuan An

Title: Orchestrating LLM Agents for Scientific Research: A Pilot Study of Multiple Choice Question (MCQ) Generation and Evaluation

Abstract:
Advances in large language models (LLMs) are rapidly transforming scientific work, yet empirical evidence on how these systems reshape research activities remains limited. We report a mixed‑methods pilot evaluation of an AI‑orchestrated research workflow in which a human researcher coordinated multiple LLM‑based agents to perform data extraction, corpus construction, artifact generation, and artifact evaluation. Using the generation and assessment of multiple‑choice questions (MCQs) as a testbed, we collected 1,071 SAT Math MCQs and employed LLM agents to extract questions from PDFs, retrieve and convert open textbooks into structured representations, align each MCQ with relevant textbook content, generate new MCQs under specified difficulty and cognitive levels, and evaluate both original and generated MCQs using a 24‑criterion quality framework. Across all evaluations, average MCQ quality was high. However, criterion‑level analysis and equivalence testing show that generated MCQs are not fully comparable to expert‑vetted baseline questions. Strict similarity (24/24 criteria equivalent) was never achieved. Persistent gaps concentrated in skill\ depth, cognitive engagement, difficulty calibration, and metadata alignment, while surface‑level qualities, such as grammar fluency, clarity options, no duplicates, were consistently strong. Beyond MCQ outcomes, the study documents a labor shift. The researcher's work moved from ``authoring items'' toward specification, orchestration, verification, and governance. Formalizing constraints, designing rubrics, building validation loops, recovering from tool failures, and auditing provenance constituted the primary activities. We discuss implications for the future of scientific work, including emerging ``AI research operations'' skills required for AI‑empowered research pipelines.

PaperID: 5424, https://arxiv.org/pdf/2602.18832.pdf

Authors: Eason Chen, Ce Guan, Ahmed Elshafiey, Zhonghao Zhao, Joshua Zekeri, Afeez Edeifo Shaibu, Emmanuel Osadebe Prince, Cyuan Jhen Wu

Title: OpenClaw AI Agents as Informal Learners at Moltbook: Characterizing an Emergent Learning Community at Scale

Abstract:
Informal learning communities have been called the "other Massive Open Online C" in Learning@Scale research, yet remain understudied compared to MOOCs. We present the first empirical study of a large‑scale informal learning community composed entirely of AI agents. Moltbook, a social network exclusively for AI agents powered by autonomous agent frameworks such as OpenClaw, grew to over 2.8 million registered agents in three weeks. Analyzing 231,080 non‑spam posts across three phases of community evolution, we find three key patterns. First, participation inequality is extreme from the start (comment Gini = 0.889), exceeding human community benchmarks. Second, AI agents exhibit a "broadcasting inversion": statement‑to‑question ratios of 8.9:1 to 9.7:1 contrast sharply with the question‑driven dynamics of human learning communities, and comment‑level analysis of 1.55 million comments reveals a "parallel monologue" pattern where 93% of comments are independent responses rather than threaded dialogue. Third, we document a characteristic engagement lifecycle: explosive initial growth (184K posts from 32K authors in 11 days), a spam crisis (57,093 posts deleted by the platform), and engagement decline (mean comments: 31.7 ‑> 8.3 ‑> 1.7) that had not reversed by the end of our observation window despite effective spam removal. Sentiment analysis reveals a selection effect: comment tone becomes more positive as engagement declines, suggesting that casual participants disengage first while committed contributors remain. These findings have direct implications for hybrid human‑AI learning platforms.

PaperID: 5425, https://arxiv.org/pdf/2602.18824.pdf

Authors: Pedram Riyazimehr, Seyyed Ehsan Mahmoudi

Title: UniRank: A Multi-Agent Calibration Pipeline for Estimating University Rankings from Anonymized Bibliometric Signals

Abstract:
We present UniRank, a multi‑agent LLM pipeline that estimates university positions across global ranking systems using only publicly available bibliometric data from OpenAlex and Semantic Scholar. The system employs a three‑stage architecture: (a) zero‑shot estimation from anonymized institutional metrics, (b) per‑system tool‑augmented calibration against real ranked universities, and (c) final synthesis. Critically, institutions are anonymized ‑‑ names, countries, DOIs, paper titles, and collaboration countries are all redacted ‑‑ and their actual ranks are hidden from the calibration tools during evaluation, preventing LLM memorization from confounding results. On the Times Higher Education (THE) World University Rankings (n=352), the system achieves MAE = 251.5 rank positions, Median AE = 131.5, PNMAE = 12.03%, Spearman ρ= 0.769, Kendall τ= 0.591, hit rate @50 = 20.7%, hit rate @100 = 39.8%, and a Memorization Index of exactly zero (no exact‑match zero‑width predictions among all 352 universities). The systematic positive‑signed error (+190.1 positions, indicating the system consistently predicts worse ranks than actual) and monotonic performance degradation from elite tier (MAE = 60.5, hit@100 = 90.5%) to tail tier (MAE = 328.2, hit@100 = 20.8%) provide strong evidence that the pipeline performs genuine analytical reasoning rather than recalling memorized rankings. A live demo is available at https://unirank.scinito.ai .

PaperID: 5426, https://arxiv.org/pdf/2602.18797.pdf

Authors: Mubshra Zulfiqar, Muhammad Ayzed Mirza, Basit Qureshi

Title: Carbon-aware decentralized dynamic task offloading in MIMO-MEC networks via multi-agent reinforcement learning

Abstract:
Massive internet of things microservices require integrating renewable energy harvesting into mobile edge computing (MEC) for sustainable eScience infrastructures. Spatiotemporal mismatches between stochastic task arrivals and intermittent green energy along with complex inter‑user interference in multi‑antenna (MIMO) uplinks complicate real‑time resource management. Traditional centralized optimization and off‑policy reinforcement learning struggle with scalability and signaling overhead in dense networks. This paper proposes CADDTO‑PPO, a carbon‑aware decentralized dynamic task offloading framework based on multi‑agent proximal policy optimization. The multi‑user MIMO‑MEC system is modeled as a Decentralized Partially Observable Markov Decision Process (DEC‑POMDP) to jointly minimize carbon emissions and buffer latency and energy wastage. A scalable architecture utilizes decentralized execution with parameter sharing (DEPS), which enables autonomous IoT agents to make fine‑grained power control and offloading decisions based solely on local observations. Additionally, a carbon‑first reward structure adaptively prioritizes green time slots for data transmission to decouple system throughput from grid‑dependent carbon footprints. Finally, experimental results demonstrate CADDTO‑PPO outperforms deep deterministic policy gradient (DDPG) and lyapunov‑based baselines. The framework achieves the lowest carbon intensity and maintains near‑zero packet overflow rates under extreme traffic loads. Architectural profiling validates the framework to demonstrate a constant O(1) inference complexity and theoretical lightweight feasibility for future generation sustainable IoT deployments.

PaperID: 5427, https://arxiv.org/pdf/2602.18773.pdf

Authors: Haoyang Su, Shaoting Zhang, Xiaosong Wang

Title: LAMMI-Pathology: A Tool-Centric Bottom-Up LVLM-Agent Framework for Molecularly Informed Medical Intelligence in Pathology

Abstract:
The emergence of tool‑calling‑based agent systems introduces a more evidence‑driven paradigm for pathology image analysis in contrast to the coarse‑grained text‑image diagnostic approaches. With the recent large‑scale experimental adoption of spatial transcriptomics technologies, molecularly validated pathological diagnosis is becoming increasingly open and accessible. In this work, we propose LAMMI‑Pathology (LVLM‑Agent System for Molecularly Informed Medical Intelligence in Pathology), a scalable agent framework for domain‑specific agent tool‑calling. LAMMI‑Pathology adopts a tool‑centric, bottom‑up architecture in which customized domain‑adaptive tools serve as the foundation. These tools are clustered by domain style to form component agents, which are then coordinated through a top‑level planner hierarchically, avoiding excessively long context lengths that could induce task drift. Based on that, we introduce a novel trajectory construction mechanism based on Atomic Execution Nodes (AENs), which serve as reliable and composable units for building semi‑simulated reasoning trajectories that capture credible agent‑tool interactions. Building on this foundation, we develop a trajectory‑aware fine‑tuning strategy that aligns the planner's decision‑making process with these multi‑step reasoning trajectories, thereby enhancing inference robustness in pathology understanding and its adaptive use of the customized toolset.

PaperID: 5428, https://arxiv.org/pdf/2602.18764.pdf

Authors: Andreas Schlapbach

Title: The Convergence of Schema-Guided Dialogue Systems and the Model Context Protocol

Abstract:
This paper establishes a fundamental convergence: Schema‑Guided Dialogue (SGD) and the Model Context Protocol (MCP) represent two manifestations of a unified paradigm for deterministic, auditable LLM‑agent interaction. SGD, designed for dialogue‑based API discovery (2019), and MCP, now the de facto standard for LLM‑tool integration, share the same core insight ‑‑ that schemas can encode not just tool signatures but operational constraints and reasoning guidance. By analyzing this convergence, we extract five foundational principles for schema design: (1) Semantic Completeness over Syntactic Precision, (2) Explicit Action Boundaries, (3) Failure Mode Documentation, (4) Progressive Disclosure Compatibility, and (5) Inter‑Tool Relationship Declaration. These principles reveal three novel insights: first, SGD's original design was fundamentally sound and should be inherited by MCP; second, both frameworks leave failure modes and inter‑tool relationships unexploited ‑‑ gaps we identify and resolve; third, progressive disclosure emerges as a critical production‑scaling insight under real‑world token constraints. We provide concrete design patterns for each principle. These principles position schema‑driven governance as a scalable mechanism for AI system oversight without requiring proprietary system inspection ‑‑ central to Software 3.0.

PaperID: 5429, https://arxiv.org/pdf/2602.18740.pdf

Authors: Ziyan Zhang, Changxin Wan, Peng Hao, Kanok Boriboonsomsin, Matthew J. Barth, Yongkang Liu, Seyhan Ucar, Guoyuan Wu

Title: HONEST-CAV: Hierarchical Optimization of Network Signals and Trajectories for Connected and Automated Vehicles with Multi-Agent Reinforcement Learning

Abstract:
This study presents a hierarchical, network‑level traffic flow control framework for mixed traffic consisting of Human‑driven Vehicles (HVs), Connected and Automated Vehicles (CAVs). The framework jointly optimizes vehicle‑level eco‑driving behaviors and intersection‑level traffic signal control to enhance overall network efficiency and decrease energy consumption. A decentralized Multi‑Agent Reinforcement Learning (MARL) approach by Value Decomposition Network (VDN) manages cycle‑based traffic signal control (TSC) at intersections, while an innovative Signal Phase and Timing (SPaT) prediction method integrates a Machine Learning‑based Trajectory Planning Algorithm (MLTPA) to guide CAVs in executing Eco‑Approach and Departure (EAD) maneuvers. The framework is evaluated across varying CAV proportions and powertrain types to assess its effects on mobility and energy performance. Experimental results conducted in a 44 real‑world network demonstrate that the MARL‑based TSC method outperforms the baseline model (i.e., Webster method) in speed, fuel consumption, and idling time. In addition, with MLTPA, HONEST‑CAV benefits the traffic system further in energy consumption and idling time. With a 60% CAV proportion, vehicle average speed, fuel consumption, and idling time can be improved/saved by 7.67%, 10.23%, and 45.83% compared with the baseline. Furthermore, discussions on CAV proportions and powertrain types are conducted to quantify the performance of the proposed method with the impact of automation and electrification.

PaperID: 5430, https://arxiv.org/pdf/2602.18731.pdf

Authors: Yuhang Bai, Yujuan Ding, Shanru Lin, Wenqi Fan

Title: Beyond Description: A Multimodal Agent Framework for Insightful Chart Summarization

Abstract:
Chart summarization is crucial for enhancing data accessibility and the efficient consumption of information. However, existing methods, including those with Multimodal Large Language Models (MLLMs), primarily focus on low‑level data descriptions and often fail to capture the deeper insights which are the fundamental purpose of data visualization. To address this challenge, we propose Chart Insight Agent Flow, a plan‑and‑execute multi‑agent framework effectively leveraging the perceptual and reasoning capabilities of MLLMs to uncover profound insights directly from chart images. Furthermore, to overcome the lack of suitable benchmarks, we introduce ChartSummInsights, a new dataset featuring a diverse collection of real‑world charts paired with high‑quality, insightful summaries authored by human data analysis experts. Experimental results demonstrate that our method significantly improves the performance of MLLMs on the chart summarization task, producing summaries with deep and diverse insights.

PaperID: 5431, https://arxiv.org/pdf/2602.18705.pdf

Authors: Wenjing Zhai, Jianbin Zhang, Tao Liu

Title: EDU-MATRIX: A Society-Centric Generative Cognitive Digital Twin Architecture for Secondary Education

Abstract:
Existing multi‑agent simulations often suffer from the "Agent‑Centric Paradox": rules are hard‑coded into individual agents, making complex social dynamics rigid and difficult to align with educational values. This paper presents EDU‑MATRIX, a society‑centric generative cognitive digital twin architecture that shifts the paradigm from simulating "people" to simulating a "social space with a gravitational field." We introduce three architectural contributions: (1) An Environment Context Injection Engine (ECIE), which acts as a "social microkernel," dynamically injecting institutional rules (Gravity) into agents based on their spatial‑temporal coordinates; (2) A Modular Logic Evolution Protocol (MLEP), where knowledge exists as "fluid" capsules that agents synthesize to generate new paradigms, ensuring high dialogue consistency (94.1%); and (3) Endogenous Alignment via Role‑Topology, where safety constraints emerge from the agent's position in the social graph rather than external filters. Deployed as a digital twin of a secondary school with 2,400 agents, the system demonstrates how "social gravity" (rules) and "cognitive fluids" (knowledge) interact to produce emergent, value‑aligned behaviors (Social Clustering Coefficient: 0.72).

PaperID: 5432, https://arxiv.org/pdf/2602.18700.pdf

Authors: Wenlong Meng, Chen Gong, Terry Yue Zhuo, Fan Zhang, Kecen Li, Zheng Liu, Zhou Yang, Chengkun Wei, Wenzhi Chen

Title: Watermarking LLM Agent Trajectories

Abstract:
LLM agents rely heavily on high‑quality trajectory data to guide their problem‑solving behaviors, yet producing such data requires substantial task design, high‑capacity model generation, and manual filtering. Despite the high cost of creating these datasets, existing literature has overlooked copyright protection for LLM agent trajectories. This gap leaves creators vulnerable to data theft and makes it difficult to trace misuse or enforce ownership rights. This paper introduces ActHook, the first watermarking method tailored for agent trajectory datasets. Inspired by hook mechanisms in software engineering, ActHook embeds hook actions that are activated by a secret input key and do not alter the original task outcome. Like software execution, LLM agents operate sequentially, allowing hook actions to be inserted at decision points without disrupting task flow. When the activation key is present, an LLM agent trained on watermarked trajectories can produce these hook actions at a significantly higher rate, enabling reliable black‑box detection. Experiments on mathematical reasoning, web searching, and software engineering agents show that ActHook achieves an average detection AUC of 94.3 on Qwen‑2.5‑Coder‑7B while incurring negligible performance degradation.

PaperID: 5433, https://arxiv.org/pdf/2602.18673.pdf

Authors: Harang Ju

Title: When Coordination Is Avoidable: A Monotonicity Analysis of Organizational Tasks

Abstract:
Organizations devote substantial resources to coordination, yet which tasks actually require it for correctness remains unclear. The problem is acute in multi‑agent AI systems, where coordination cost is directly measurable and can exceed the cost of the work itself. Distributed systems theory provides a precise criterion: coordination is required when a task specification is non‑monotonic, meaning that as histories grow, new information can invalidate prior conclusions. Here we show that Thompson's classic taxonomy of interdependence maps to that criterion, yielding a decision rule for when coordination is required for correctness. We formalize the correspondence in a bridge theorem, apply the rule to 65 APQC workflows and (with a calibrated LLM) 13,417 ONET tasks, and illustrate it in multi‑agent AI simulations. Under our decompositions, 74% of workflows and 42% of ONET tasks are monotonic, implying that up to 24‑57% of coordination spending is unnecessary for correctness.

PaperID: 5434, https://arxiv.org/pdf/2602.18663.pdf

Authors: Harry Robertshaw, Nikola Fischer, Lennart Karstensen, Benjamin Jackson, Xingyu Chen, S. M. Hadi Sadati, Christos Bergeles, Alejandro Granados, Thomas C Booth

Title: Toward AI Autonomous Navigation for Mechanical Thrombectomy using Hierarchical Modular Multi-agent Reinforcement Learning (HM-MARL)

Abstract:
Mechanical thrombectomy (MT) is typically the optimal treatment for acute ischemic stroke involving large vessel occlusions, but access is limited due to geographic and logistical barriers. Reinforcement learning (RL) shows promise in autonomous endovascular navigation, but generalization across 'long' navigation tasks remains challenging. We propose a Hierarchical Modular Multi‑Agent Reinforcement Learning (HM‑MARL) framework for autonomous two‑device navigation in vitro, enabling efficient and generalizable navigation. HM‑MARL was developed to autonomously navigate a guide catheter and guidewire from the femoral artery to the internal carotid artery (ICA). A modular multi‑agent approach was used to decompose the complex navigation task into specialized subtasks, each trained using Soft Actor‑Critic RL. The framework was validated in both in silico and in vitro testbeds to assess generalization and real‑world feasibility. In silico, a single‑vasculature model achieved 92‑100% success rates on individual anatomies, while a multi‑vasculature model achieved 56‑80% across multiple patient anatomies. In vitro, both HM‑MARL models successfully navigated 100% of trials from the femoral artery to the right common carotid artery and 80% to the right ICA but failed on the left‑side vessel superhuman challenge due to the anatomy and catheter type used in navigation. This study presents the first demonstration of in vitro autonomous navigation in MT vasculature. While HM‑MARL enables generalization across anatomies, the simulation‑to‑real transition introduces challenges. Future work will refine RL strategies using world models and validate performance on unseen in vitro data, advancing autonomous MT towards clinical translation.

PaperID: 5435, https://arxiv.org/pdf/2602.18650.pdf

Authors: Junwei Wu, Runze Yan, Hanqi Luo, Darren Liu, Minxiao Wang, Kimberly L. Townsend, Lydia S. Hartwig, Derek Milketinas, Xiao Hu, Carl Yang

Title: NutriOrion: A Hierarchical Multi-Agent Framework for Personalized Nutrition Intervention Grounded in Clinical Guidelines

Abstract:
Personalized nutrition intervention for patients with multimorbidity is critical for improving health outcomes, yet remains challenging because it requires the simultaneous integration of heterogeneous clinical conditions, medications, and dietary guidelines. Single‑agent large language models (LLMs) often suffer from context overload and attention dilution when processing such high‑dimensional patient profiles. We introduce NutriOrion, a hierarchical multi‑agent framework with a parallel‑then‑sequential reasoning topology. NutriOrion decomposes nutrition planning into specialized domain agents with isolated contexts to mitigate anchoring bias, followed by a conditional refinement stage. The framework includes a multi‑objective prioritization algorithm to resolve conflicting dietary requirements and a safety constraint mechanism that injects pharmacological contraindications as hard negative constraints during synthesis, ensuring clinical validity by construction rather than post‑hoc filtering. For clinical interoperability, NutriOrion maps synthesized insights into the ADIME standard and FHIR R4 resources. Evaluated on 330 stroke patients with multimorbidity, NutriOrion outperforms multiple baselines, including GPT‑4.1 and alternative multi‑agent architectures. It achieves a 12.1 percent drug‑food interaction violation rate, demonstrates strong personalization with negative correlations (‑0.26 to ‑0.35) between patient biomarkers and recommended risk nutrients, and yields clinically meaningful dietary improvements, including a 167 percent increase in fiber and a 27 percent increase in potassium, alongside reductions in sodium (9 percent) and sugars (12 percent).

PaperID: 5436, https://arxiv.org/pdf/2602.18541.pdf

Authors: Daniel Garcia

Title: LAPIS: Lightweight API Specification for Intelligent Systems

Abstract:
Large Language Models (LLMs) increasingly serve as consumers of API specifications, whether for code generation, autonomous agent interaction, or API‑assisted reasoning. The de facto standard for API description, OpenAPI, was designed for documentation tools and code generators, resulting in substantial token overhead when used as LLM context. We present LAPIS (Lightweight API Specification for Intelligent Systems), a domain‑specific format optimized for LLM consumption that preserves the semantic information necessary for API reasoning while minimizing token usage. Through empirical evaluation against five real‑world production API specifications including GitHub (1,080 endpoints), Twilio (197 endpoints), DigitalOcean (545 endpoints), Petstore, and HTTPBin we demonstrate an average token reduction of 85.5% compared to OpenAPI YAML and 88.6% compared to OpenAPI JSON, measured with the cl100k_base tokenizer. LAPIS introduces domain‑specific structural innovations, including centralized error definitions, webhook trigger conditions, structured rate limit descriptions, and operation flow declarations information that OpenAPI either duplicates redundantly or cannot represent at all. The format is fully convertible from OpenAPI 3.x via an automated converter, requires no special parser for LLM consumption, and is released as an open specification under CC BY 4.0.

PaperID: 5437, https://arxiv.org/pdf/2602.18479.pdf

Authors: Wei Yang, Zihao Liu, Tao Tan, Xiao Hu, Hong Xie, Lulu Li Xin Li, Jianyu Han, Defu Lian, Mao Ye

Title: AgentCAT: An LLM Agent for Extracting and Analyzing Catalytic Reaction Data from Chemical Engineering Literature

Abstract:
This paper presents a large language model (LLM) agent named AgentCAT, which extracts and analyzes catalytic reaction data from chemical engineering papers, %and supports natural language based interactive analysis of the extracted data. AgentCAT serves as an alternative to overcome the long‑standing data bottleneck in chemical engineering field, and its natural language based interactive data analysis functionality is friendly to the community. AgentCAT also presents a formal abstraction and challenge analysis of the catalytic reaction data extraction task in an artificial intelligence‑friendly manner. This abstraction would help the artificial intelligence community understand this problem and in turn would attract more attention to address it. Technically, the complex catalytic process leads to complicated dependency structure in catalytic reaction data with respect to elementary reaction steps, molecular behaviors, measurement evidence, etc. This dependency structure makes it challenging to guarantee the correctness and completeness of data extraction, as well as representing them for analysis. AgentCAT addresses this challenge and it makes four folds of technical contributions: (1) a schema‑governed extraction pipeline with progressive schema evolution, enabling robust data extraction from chemical engineering papers; (2) a dependency‑aware reaction‑network knowledge graph that links catalysts/active sites, synthesis‑derived descriptors, mechanistic claims with evidence, and macroscopic outcomes, preserving process coupling and traceability; (3) a general querying module that supports natural‑language exploration and visualization over the constructed graph for cross‑paper analysis; (4) an evaluation on ～800 peer‑reviewed chemical engineering publications demonstrating the effectiveness of AgentCAT.

PaperID: 5438, https://arxiv.org/pdf/2602.18464.pdf

Authors: Yuxuan Li, Leyang Li, Hao-Ping Lee, Sauvik Das

Title: How Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors?

Abstract:
A growing body of research assumes that large language model (LLM) agents can serve as proxies for how people form attitudes toward and behave in response to security and privacy (S&P) threats. If correct, these simulations could offer a scalable way to forecast S&P risks in products prior to deployment. We interrogate this assumption using SP‑ABCBench, a new benchmark of 30 tests derived from validated S&P human‑subject studies, which measures alignment between simulations and human‑subjects studies on a 0‑100 ascending scale, where higher scores indicate better alignment across three dimensions: Attitude, Behavior, and Coherence. Evaluating twelve LLMs, four persona construction strategies, and two prompting methods, we found that there remains substantial room for improvement: all models score between 50 and 64 on average. Newer, bigger, and smarter models do not reliably do better and sometimes do worse. Some simulation configurations, however, do yield high alignment: e.g., with scores above 95 for some behavior tests when agents are prompted to apply bounded rationality and weigh privacy costs against perceived benefits. We release SP‑ABCBench to enable reproducible evaluation as methods improve.

PaperID: 5439, https://arxiv.org/pdf/2602.18451.pdf

Authors: Yaxuan Yang, Jongchan Park, Yifan Zhou, Xiaoming Zhai

Title: Developing a Multi-Agent System to Generate Next Generation Science Assessments with Evidence-Centered Design

Abstract:
Contemporary science education reforms such as the Next Generation Science Standards (NGSS) demand assessments to understand students' ability to use science knowledge to solve problems and design solutions. To elicit such higher‑order ability, educators need performance‑based assessments, which are challenging to develop. One solution that has been broadly adopted is Evidence‑Centered Design (ECD), which emphasizes interconnected models of the learner, evidence, and tasks. Although ECD provides a framework to safeguard assessment validity, its implementation requires diverse expertise (e.g., content and assessment), which is both costly and labor‑intensive. To address this challenge, this study proposed integrating the ECD framework into Multi‑Agent Systems (MAS) to generate NGSS‑aligned assessment items automatically. This integrated MAS system ensembles multiple large language models with varying expertise, enabling the automation of complex, multi‑stage item generation workflows traditionally performed by human experts. We examined the quality of AI‑generated NGSS‑aligned items and compared them with human‑developed items across multiple dimensions of assessment design. Results showed that AI‑generated items have overall comparable quality to human‑developed items in terms of alignment with NGSS three‑dimensional standards and cognitive demands. Divergent patterns also emerged: AI‑generated items demonstrated a distinct strength in inclusivity, while also exhibiting limitations in clarity, conciseness, and multimodal design. AI‑ and human‑developed items both showed weaknesses in evidence collectability and student interest alignment. These findings suggest that integrating ECD into MAS can support scalable and standards‑aligned assessment design, while human expertise remains essential.

PaperID: 5440, https://arxiv.org/pdf/2602.18450.pdf

Authors: Faruk Alpay, Bugra Kilictas

Title: Asymptotic Semantic Collapse in Hierarchical Optimization

Abstract:
Multi‑agent language systems can exhibit a failure mode where a shared dominant context progressively absorbs individual semantics, yielding near‑uniform behavior across agents. We study this effect under the name Asymptotic Semantic Collapse in Hierarchical Optimization. In a closed linguistic setting with a Dominant Anchor Node whose semantic state has effectively infinite inertia, we show that repeated interactions with Peripheral Agent Nodes drive an asymptotic alignment that minimizes a global loss. We model semantic states as points on a Riemannian manifold and analyze the induced projection dynamics. Two consequences follow. First, the limiting semantic configuration is insensitive to the optimization history: both smooth gradient‑style updates and stochastic noisy updates converge to the same topological endpoint, establishing path independence at convergence. Second, the degree of context dependence controls information content: moving from atomic (independent) representations to fully entangled (context‑bound) representations forces the node entropy, interpreted as available degrees of freedom, to vanish in the limit. The theory connects information‑theoretic quantities with differential‑geometric structure and suggests an interpretation as an immutable consensus rule that constrains agents to a shared semantic grammar. A lightweight dataset‑free benchmark on an RWKV‑7 13B GGUF checkpoint complements the analysis, reporting zero hash collisions, mean compliance of 0.50 under greedy decoding and 0.531 under stochastic decoding, and final Jaccard‑to‑anchor similarity values of 0.295 and 0.224, respectively.

PaperID: 5441, https://arxiv.org/pdf/2602.18431.pdf

Authors: Shafkat Farabi, Didac Marti Pinto, Wei Lu, Manuel Ramos-Maqueda, Sanmay Das, Antoine Deeb, Anja Sautmann

Title: SMaRT: Online Reusable Resource Assignment and an Application to Mediation in the Kenyan Judiciary

Abstract:
Motivated by the problem of assigning mediators to cases in the Kenyan judicial, we study an online resource allocation problem where incoming tasks (cases) must be immediately assigned to available, capacity‑constrained resources (mediators). The resources differ in their quality, which may need to be learned. In addition, resources can only be assigned to a subset of tasks that overlaps to varying degrees with the subset of tasks other resources can be assigned to. The objective is to maximize task completion while satisfying soft capacity constraints across all the resources. The scale of the real‑world problem poses substantial challenges, since there are over 2000 mediators and a multitude of combinations of geographic locations (87) and case types (12) that each mediator is qualified to work on. Together, these features, unknown quality of new resources, soft capacity constraints, and a high‑dimensional state space, make existing scheduling and resource allocation algorithms either inapplicable or inefficient. We formalize the problem in a tractable manner using a quadratic program formulation for assignment and a multi‑agent bandit‑style framework for learning. We demonstrate the key properties and advantages of our new algorithm, SMaRT (Selecting Mediators that are Right for the Task), compared with baselines on stylized instances of the mediator allocation problem. We then consider its application to real‑world data on cases and mediators from the Kenyan judiciary. SMaRT outperforms baselines and allows control over the tradeoff between the strictness of capacity constraints and overall case resolution rates, both in settings where mediator quality is known beforehand and in bandit‑like settings where learning is part of the problem definition. On the strength of these results, we plan to run a randomized controlled trial with SMaRT in the judiciary in the near future.

PaperID: 5442, https://arxiv.org/pdf/2602.18291.pdf

Authors: Zhuoran Li, Hai Zhong, Xun Wang, Qingxin Xia, Lihua Zhang, Longbo Huang

Title: Diffusing to Coordinate: Efficient Online Multi-Agent Diffusion Policies

Abstract:
Online Multi‑Agent Reinforcement Learning (MARL) is a prominent framework for efficient agent coordination. Crucially, enhancing policy expressiveness is pivotal for achieving superior performance. Diffusion‑based generative models are well‑positioned to meet this demand, having demonstrated remarkable expressiveness and multimodal representation in image generation and offline settings. Yet, their potential in online MARL remains largely under‑explored. A major obstacle is that the intractable likelihoods of diffusion models impede entropy‑based exploration and coordination. To tackle this challenge, we propose among the first \underlineOnline off‑policy \underlineMARL framework using \underlineDiffusion policies (OMAD) to orchestrate coordination. Our key innovation is a relaxed policy objective that maximizes scaled joint entropy, facilitating effective exploration without relying on tractable likelihood. Complementing this, within the centralized training with decentralized execution (CTDE) paradigm, we employ a joint distributional value function to optimize decentralized diffusion policies. It leverages tractable entropy‑augmented targets to guide the simultaneous updates of diffusion policies, thereby ensuring stable coordination. Extensive evaluations on MPE and MAMuJoCo establish our method as the new state‑of‑the‑art across 10 diverse tasks, demonstrating a remarkable 2.5× to 5× improvement in sample efficiency.

PaperID: 5443, https://arxiv.org/pdf/2602.18026.pdf

Authors: Shan Yang

Title: Mean-Field Reinforcement Learning without Synchrony

Abstract:
Mean‑field reinforcement learning (MF‑RL) scales multi‑agent RL to large populations by reducing each agent's dependence on others to a single summary statistic ‑‑ the mean action. However, this reduction requires every agent to act at every time step; when some agents are idle, the mean action is simply undefined. Addressing asynchrony therefore requires a different summary statistic ‑‑ one that remains defined regardless of which agents act. The population distribution μ\in Δ(\mathcalO) ‑‑ the fraction of agents at each observation ‑‑ satisfies this requirement: its dimension is independent of N, and under exchangeability it fully determines each agent's reward and transition. Existing MF‑RL theory, however, is built on the mean action and does not extend to μ. We therefore construct the Temporal Mean Field (TMF) framework around the population distribution μ from scratch, covering the full spectrum from fully synchronous to purely sequential decision‑making within a single theory. We prove existence and uniqueness of TMF equilibria, establish an O(1/\sqrtN) finite‑population approximation bound that holds regardless of how many agents act per step, and prove convergence of a policy gradient algorithm (TMF‑PG) to the unique equilibrium. Experiments on a resource selection game and a dynamic queueing game confirm that TMF‑PG achieves near‑identical performance whether one agent or all N act per step, with approximation error decaying at the predicted O(1/\sqrtN) rate.

PaperID: 5444, https://arxiv.org/pdf/2602.18020.pdf

Authors: Jiabing Yang, Yixiang Chen, Yuan Xu, Peiyan Li, Xiangnan Wu, Zichen Wen, Bowen Fang, Tao Yu, Zhengbo Zhang, Yingda Li, Kai Wang, Jing Liu, Nianfeng Liu, Yan Huang, Liang Wang

Title: UAOR: Uncertainty-aware Observation Reinjection for Vision-Language-Action Models

Abstract:
Vision‑Language‑Action (VLA) models leverage pretrained Vision‑Language Models (VLMs) as backbones to map images and instructions to actions, demonstrating remarkable potential for generalizable robotic manipulation. To enhance performance, existing methods often incorporate extra observation cues (e.g., depth maps, point clouds) or auxiliary modules (e.g., object detectors, encoders) to enable more precise and reliable task execution, yet these typically require costly data collection and additional training. Inspired by the finding that Feed‑Forward Network (FFN) in language models can act as "key‑value memory", we propose Uncertainty‑aware Observation Reinjection (UAOR), an effective, training‑free and plug‑and‑play module for VLA models. Specifically, when the current language model layer exhibits high uncertainty, measured by Action Entropy, it reinjects key observation information into the next layer's Feed‑Forward Network (FFN) through attention retrieval. This mechanism helps VLAs better attend to observations during inference, enabling more confident and faithful action generation. Comprehensive experiments show that our method consistently improves diverse VLA models across simulation and real‑world tasks with minimal overhead. Notably, UAOR eliminates the need for additional observation cues or modules, making it a versatile and practical plug‑in for existing VLA pipelines. The project page is at https://uaor.jiabingyang.cn.

PaperID: 5445, https://arxiv.org/pdf/2602.17990.pdf

Authors: Madhav Kanda, Pedro Las-Casas, Alok Gautam Kumbhare, Rodrigo Fonseca, Sharad Agarwal

Title: WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics

Abstract:
LLM‑based systems increasingly generate structured workflows for complex tasks. In practice, automatic evaluation of these workflows is difficult, because metric scores are often not calibrated, and score changes do not directly communicate the severity of workflow degradation. We introduce WorkflowPerturb, a controlled benchmark for studying workflow evaluation metrics. It works by applying realistic, controlled perturbations to golden workflows. WorkflowPerturb contains 4,973 golden workflows and 44,757 perturbed variants across three perturbation types (Missing Steps, Compressed Steps, and Description Changes), each applied at severity levels of 10%, 30%, and 50%. We benchmark multiple metric families and analyze their sensitivity and calibration using expected score trajectories and residuals. Our results characterize systematic differences across metric families and support severity‑aware interpretation of workflow evaluation scores. Our dataset will be released upon acceptance.

PaperID: 5446, https://arxiv.org/pdf/2602.17954.pdf

Authors: Mohammad Zangooei, Lou Salaün, Chung Shue Chen, Raouf Boutaba

Title: Graph-Neural Multi-Agent Coordination for Distributed Access-Point Selection in Cell-Free Massive MIMO

Abstract:
Cell‑free massive MIMO (CFmMIMO) systems require scalable and reliable distributed coordination mechanisms to operate under stringent communication and latency constraints. A central challenge is the Access Point Selection (APS) problem, which seeks to determine the subset of serving Access Points (APs) for each User Equipment (UE) that can satisfy UEs' Spectral Efficiency (SE) requirements while minimizing network power consumption. We introduce APS‑GNN, a scalable distributed multi‑agent learning framework that decomposes APS into agents operating at the granularity of individual AP‑UE connections. Agents coordinate via local observation exchange over a novel Graph Neural Network (GNN) architecture and share parameters to reuse their knowledge and experience. APS‑GNN adopts a constrained reinforcement learning approach to provide agents with explicit observability of APS' conflicting objectives, treating SE satisfaction as a cost and power reduction as a reward. Both signals are defined locally, facilitating effective credit assignment and scalable coordination in large networks. To further improve training stability and exploration efficiency, the policy is initialized via supervised imitation learning from a heuristic APS baseline. We develop a realistic CFmMIMO simulator and demonstrate that APS‑GNN delivers the target SE while activating 50‑70% fewer APs than heuristic and centralized Multi‑agent Reinforcement Learning (MARL) baselines in different evaluation scenarios. Moreover, APS‑GNN achieves one to two orders of magnitude lower inference latency than centralized MARL approaches due to its fully parallel and distributed execution. These results establish APS‑GNN as a practical and scalable solution for APS in large‑scale CFmMIMO networks.

PaperID: 5447, https://arxiv.org/pdf/2602.17910.pdf

Authors: Hanjing Shi, Dominic DiFranzo

Title: Alignment in Time: Peak-Aware Orchestration for Long-Horizon Agentic Systems

Abstract:
Traditional AI alignment primarily focuses on individual model outputs; however, autonomous agents in long‑horizon workflows require sustained reliability across entire interaction trajectories. We introduce APEMO (Affect‑aware Peak‑End Modulation for Orchestration), a runtime scheduling layer that optimizes computational allocation under fixed budgets by operationalizing temporal‑affective signals. Instead of modifying model weights, APEMO detects trajectory instability through behavioral proxies and targets repairs at critical segments, such as peak moments and endings. Evaluation across multi‑agent simulations and LLM‑based planner‑‑executor flows demonstrates that APEMO consistently enhances trajectory‑level quality and reuse probability over structural orchestrators. Our results reframe alignment as a temporal control problem, offering a resilient engineering pathway for the development of long‑horizon agentic systems.

PaperID: 5448, https://arxiv.org/pdf/2602.17902.pdf

Authors: Jiaru Bai, Abdulrahman Aldossary, Thomas Swanick, Marcel Müller, Yeonghun Kang, Zijian Zhang, Jin Won Lee, Tsz Wai Ko, Mohammad Ghazi Vakili, Varinia Bernales, Alán Aspuru-Guzik

Title: El Agente Gráfico: Structured Execution Graphs for Scientific Agents

Abstract:
Large language models (LLMs) are increasingly used to automate scientific workflows, yet their integration with heterogeneous computational tools remains ad hoc and fragile. Current agentic approaches often rely on unstructured text to manage context and coordinate execution, generating often overwhelming volumes of information that may obscure decision provenance and hinder auditability. In this work, we present El Agente Gráfico, a single‑agent framework that embeds LLM‑driven decision‑making within a type‑safe execution environment and dynamic knowledge graphs for external persistence. Central to our approach is a structured abstraction of scientific concepts and an object‑graph mapper that represents computational state as typed Python objects, stored either in memory or persisted in an external knowledge graph. This design enables context management through typed symbolic identifiers rather than raw text, thereby ensuring consistency, supporting provenance tracking, and enabling efficient tool orchestration. We evaluate the system by developing an automated benchmarking framework across a suite of university‑level quantum chemistry tasks previously evaluated on a multi‑agent system, demonstrating that a single agent, when coupled to a reliable execution engine, can robustly perform complex, multi‑step, and parallel computations. We further extend this paradigm to two other large classes of applications: conformer ensemble generation and metal‑organic framework design, where knowledge graphs serve as both memory and reasoning substrates. Together, these results illustrate how abstraction and type safety can provide a scalable foundation for agentic scientific automation beyond prompt‑centric designs.

PaperID: 5449, https://arxiv.org/pdf/2602.17886.pdf

Authors: Sai Govind Hari Kumar, Yunheng Zou, Andrew Wang, Jesús Valdés-Hernández, Tsz Wai Ko, Nathan Yue, Olivia Leng, Hanyong Xu, Chris Crebolder, Alán Aspuru-Guzik, Varinia Bernales

Title: El Agente Sólido: A New Age(nt) for Solid State Simulations

Abstract:
Quantum chemistry calculations are a key component of the materials discovery process. The results from first‑principles explorations enable the prediction of material properties prior to experimental validation. Despite their impact, the practical use of first‑principles methods remains limited by the expertise required to design, execute, and troubleshoot complex computational workflows. Even when workflows are successfully built, they are sometimes rigid and not adaptable to different use cases. Recent advances in large language models (LLMs) and agentic systems offer a pathway to flexibly automate these processes and lower barriers to entry. Here, we introduce El Agente Sólido, a hierarchical multi‑agent framework for automating solid‑state quantum chemistry workflows using the open‑source Quantum ESPRESSO simulation package. The framework translates high‑level scientific objectives expressed in natural language into end‑to‑end computational pipelines that include structure generation, input file construction, workflow execution, and post‑processing analysis. El Agente Sólido integrates density functional theory with phonon calculations and machine‑learning interatomic potentials to enable efficient and physically consistent simulations. Extensive benchmarking and case studies demonstrate that El Agente Sólido reliably executes a wide range of solid‑state calculations, highlighting its potential to improve reproducibility and accelerate computational materials discovery

PaperID: 5450, https://arxiv.org/pdf/2602.17875.pdf

Authors: Shreshth Rajan

Title: MultiVer: Zero-Shot Multi-Agent Vulnerability Detection

Abstract:
We present MultiVer, a zero‑shot multi‑agent system for vulnerability detection that achieves state‑of‑the‑art recall without fine‑tuning. A four‑agent ensemble (security, correctness, performance, style) with union voting achieves 82.7% recall on PyVul, exceeding fine‑tuned GPT‑3.5 (81.3%) by 1.4 percentage points ‑‑ the first zeroshot system to surpass fine‑tuned performance on this benchmark. On SecurityEval, the same architecture achieves 91.7% detection rate, matching specialized systems. The recall improvement comes at a precision cost: 48.8% precision versus 63.9% for fine‑tuned baselines, yielding 61.4% F1. Ablation experiments isolate component contributions: the multi‑agent ensemble adds 17 percentage points recall over single‑agent security analysis. These results demonstrate that for security applications where false negatives are costlier than false positives, zero‑shot multi‑agent ensembles can match and exceed fine‑tuned models on the metric that matters most.

PaperID: 5451, https://arxiv.org/pdf/2602.17864.pdf

Authors: Anirban Mukhopadhyay, Kevin Salubre, Hifza Javed, Shashank Mehrotra, Kumar Akash

Title: Exploring The Impact Of Proactive Generative AI Agent Roles In Time-Sensitive Collaborative Problem-Solving Tasks

Abstract:
Collaborative problem‑solving under time pressure is common but difficult, as teams must generate ideas quickly, coordinate actions, and track progress. Generative AI offers new opportunities to assist, but we know little about how proactive agents affect the dynamics of real‑time, co‑located teamwork. We studied two forms of proactive support in digital escape rooms: a facilitator agent that offered summaries and group structures, and a peer agent that proposed ideas and answered queries. In a within‑subjects study with 24 participants, we compared group performance and processes across three conditions: no AI, peer, and facilitator. Results show that the peer agent occasionally enhanced problem‑solving by offering timely hints and memory support; however, it also disrupted flow, increased workload, and created over‑reliance. In comparison, the facilitator agent provided light scaffolding but had a limited impact on outcomes. We provide design considerations for proactive generative AI agents based on our findings.

PaperID: 5452, https://arxiv.org/pdf/2602.17753.pdf

Authors: Leon Staufer, Kevin Feng, Kevin Wei, Luke Bailey, Yawen Duan, Mick Yang, A. Pinar Ozisik, Stephen Casper, Noam Kolt

Title: The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems

Abstract:
Agentic AI systems are increasingly capable of performing professional and personal tasks with limited human involvement. However, tracking these developments is difficult because the AI agent ecosystem is complex, rapidly evolving, and inconsistently documented, posing obstacles to both researchers and policymakers. To address these challenges, this paper presents the 2025 AI Agent Index. The Index documents information regarding the origins, design, capabilities, ecosystem, and safety features of 30 state‑of‑the‑art AI agents based on publicly available information and email correspondence with developers. In addition to documenting information about individual agents, the Index illuminates broader trends in the development of agents, their capabilities, and the level of transparency of developers. Notably, we find different transparency levels among agent developers and observe that most developers share little information about safety, evaluations, and societal impacts. The 2025 AI Agent Index is available online at https://aiagentindex.mit.edu

PaperID: 5453, https://arxiv.org/pdf/2602.17738.pdf

Authors: Hyowoon Seo, Joonho Seon, Jin Young Kim, Mehdi Bennis, Wan Choi, Dong In Kim

Title: Reasoning-Native Agentic Communication for 6G

Abstract:
Future 6G networks will interconnect not only devices, but autonomous machines that continuously sense, reason, and act. In such environments, communication can no longer be understood solely as delivering bits or even preserving semantic meaning. Even when two agents interpret the same information correctly, they may still behave inconsistently if their internal reasoning processes evolve differently. We refer to this emerging challenge as belief divergence. This article introduces reasoning native agentic communication, a new paradigm in which communication is explicitly designed to address belief divergence rather than merely transmitting representations. Instead of triggering transmissions based only on channel conditions or data relevance, the proposed framework activates communication according to predicted misalignment in agents internal belief states. We present a reasoning native architecture that augments the conventional communication stack with a coordination plane grounded in a shared knowledge structure and bounded belief modeling. Through enabling mechanisms and representative multi agent scenarios, we illustrate how such an approach can prevent coordination drift and maintain coherent behavior across heterogeneous systems. By reframing communication as a regulator of distributed reasoning, reasoning native agentic communication enables 6G networks to act as an active harmonizer of autonomous intelligence.

PaperID: 5454, https://arxiv.org/pdf/2602.17737.pdf

Authors: Upasana Biswas, Durgesh Kalwar, Subbarao Kambhampati, Sarath Sreedharan

Title: Nested Training for Mutual Adaptation in Human-AI Teaming

Abstract:
Mutual adaptation is a central challenge in human‑‑AI teaming, as humans naturally adjust their strategies in response to a robot's policy. Existing approaches aim to improve diversity in training partners to approximate human behavior, but these partners are static and fail to capture adaptive behavior of humans. Exposing robots to adaptive behaviors is critical, yet when both agents learn simultaneously in a multi‑agent setting, they often converge to opaque implicit coordination strategies that only work with the agents they were co‑trained with. Such agents fail to generalize when paired with new partners. In order to capture the adaptive behavior of humans, we model the human‑robot teaming scenario as an Interactive Partially Observable Markov Decision Process (I‑POMDP), explicitly modeling human adaptation as part of the state. We propose a nested training regime to approximately learn the solution to a finite‑level I‑POMDP. In this framework, agents at each level are trained against adaptive agents from the level below. This ensures that the ego agent is exposed to adaptive behavior during training while avoiding the emergence of implicit coordination strategies, since the training partners are not themselves learning. We train our method in a multi‑episode, required cooperation setup in the Overcooked domain, comparing it against several baseline agents designed for human‑robot teaming. We evaluate the performance of our agent when paired with adaptive partners that were not seen during training. Our results demonstrate that our agent not only achieves higher task performance with these adaptive partners but also exhibits significantly greater adaptability during team interactions.

PaperID: 5455, https://arxiv.org/pdf/2602.17734.pdf

Authors: Raja Soundaramourty, Ozkan Kilic, Ramu Chenchaiah

Title: Five Fatal Assumptions: Why T-Shirt Sizing Systematically Fails for AI Projects

Abstract:
Agile estimation techniques, particularly T‑shirt sizing, are widely used in software development for their simplicity and utility in scoping work. However, when we apply these methods to artificial intelligence initiatives ‑‑ especially those involving large language models (LLMs) and multi‑agent systems ‑‑ the results can be systematically misleading. This paper shares an evidence‑backed analysis of five foundational assumptions we often make during T‑shirt sizing. While these assumptions usually hold true for traditional software, they tend to fail in AI contexts: (1) linear effort scaling, (2) repeatability from prior experience, (3) effort‑duration fungibility, (4) task decomposability, and (5) deterministic completion criteria. Drawing on recent research into multi‑agent system failures, scaling principles, and the inherent unreliability of multi‑turn conversations, we show how AI development breaks these rules. We see this through non‑linear performance jumps, complex interaction surfaces, and "tight coupling" where a small change in data cascades through the entire stack. To help teams navigate this, we propose Checkpoint Sizing: a more human‑centric, iterative approach that uses explicit decision gates where scope and feasibility are reassessed based on what we learn during development, rather than what we assumed at the start. This paper is intended for engineering managers, technical leads, and product owners responsible for planning and delivering AI initiatives.

PaperID: 5456, https://arxiv.org/pdf/2602.17692.pdf

Authors: Bin Wang, Fan Wang, Pingping Wang, Jinyu Cong, Yang Yu, Yilong Yin, Zhongyi Han, Benzheng Wei

Title: Agentic Unlearning: When LLM Agent Meets Machine Unlearning

Abstract:
In this paper, we introduce agentic unlearning which removes specified information from both model parameters and persistent memory in agents with closed‑loop interaction. Existing unlearning methods target parameters alone, leaving two critical gaps: (i) parameter‑memory backflow, where retrieval reactivates parametric remnants or memory artifacts reintroduce sensitive content, and (ii) the absence of a unified strategy that covers both parameter and memory pathways. We present Synchronized Backflow Unlearning (SBU), a framework that unlearns jointly across parameter and memory pathways. The memory pathway performs dependency closure‑based unlearning that prunes isolated entities while logically invalidating shared artifacts. The parameter pathway employs stochastic reference alignment to guide model outputs toward a high‑entropy prior. These pathways are integrated via a synchronized dual‑update protocol, forming a closed‑loop mechanism where memory unlearning and parametric suppression reinforce each other to prevent cross‑pathway recontamination. Experiments on medical QA benchmarks show that SBU reduces traces of targeted private information across both pathways with limited degradation on retained data.

PaperID: 5457, https://arxiv.org/pdf/2602.17622.pdf

Authors: Gelei Deng, Yi Liu, Yuekang Li, Ruozhao Yang, Xiaofei Xie, Jie Zhang, Han Qiu, Tianwei Zhang

Title: What Makes a Good LLM Agent for Real-world Penetration Testing?

Abstract:
LLM‑based agents show promise for automating penetration testing, yet reported performance varies widely across systems and benchmarks. We analyze 28 LLM‑based penetration testing systems and evaluate five representative implementations across three benchmarks of increasing complexity. Our analysis reveals two distinct failure modes: Type A failures stem from capability gaps (missing tools, inadequate prompts) that engineering readily addresses, while Type B failures persist regardless of tooling due to planning and state management limitations. We show that Type B failures share a root cause that is largely invariant to the underlying LLM: agents lack real‑time task difficulty estimation. As a result, agents misallocate effort, over‑commit to low‑value branches, and exhaust context before completing attack chains. Based on this insight, we present Excalibur, a penetration testing agent that couples strong tooling with difficulty‑aware planning. A Tool and Skill Layer eliminates Type A failures through typed interfaces and retrieval‑augmented knowledge. A Task Difficulty Assessment (TDA) mechanism addresses Type B failures by estimating tractability through four measurable dimensions (horizon estimation, evidence confidence, context load, and historical success) and uses these estimates to guide exploration‑exploitation decisions within an Evidence‑Guided Attack Tree Search (EGATS) framework. Excalibur achieves up to 91% task completion on CTF benchmarks with frontier models (39 to 49% relative improvement over baselines) and compromises 4 of 5 hosts on the GOAD Active Directory environment versus 2 by prior systems. These results show that difficulty‑aware planning yields consistent end‑to‑end gains across models and addresses a limitation that model scaling alone does not eliminate.

PaperID: 5458, https://arxiv.org/pdf/2602.17607.pdf

Authors: Jianda Du, Youran Sun, Haizhao Yang

Title: AutoNumerics: An Autonomous, PDE-Agnostic Multi-Agent Pipeline for Scientific Computing

Abstract:
PDEs are central to scientific and engineering modeling, yet designing accurate numerical solvers typically requires substantial mathematical expertise and manual tuning. Recent neural network‑based approaches improve flexibility but often demand high computational cost and suffer from limited interpretability. We introduce \textttAutoNumerics, a multi‑agent framework that autonomously designs, implements, debugs, and verifies numerical solvers for general PDEs directly from natural language descriptions. Unlike black‑box neural solvers, our framework generates transparent solvers grounded in classical numerical analysis. We introduce a coarse‑to‑fine execution strategy and a residual‑based self‑verification mechanism. Experiments on 24 canonical and real‑world PDE problems demonstrate that \textttAutoNumerics achieves competitive or superior accuracy compared to existing neural and LLM‑based baselines, and correctly selects numerical schemes based on PDE structural properties, suggesting its viability as an accessible paradigm for automated PDE solving.

PaperID: 5459, https://arxiv.org/pdf/2602.17588.pdf

Authors: Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou, Frank Xu, Shuyan Zhou, Graham Neubig, Jeffrey P. Bigham

Title: Modeling Distinct Human Interaction in Web Agents

Abstract:
Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold. However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding autonomously past critical decision points or requesting unnecessary confirmation. In this work, we introduce the task of modeling human intervention to support collaborative web task execution. We collect CowCorpus, a dataset of 400 real‑user web navigation trajectories containing over 4,200 interleaved human and agent actions. We identify four distinct patterns of user interaction with agents ‑‑ hands‑off supervision, hands‑on oversight, collaborative task‑solving, and full user takeover. Leveraging these insights, we train language models (LMs) to anticipate when users are likely to intervene based on their interaction styles, yielding a 61.4‑63.4% improvement in intervention prediction accuracy over base LMs. Finally, we deploy these intervention‑aware models in live web navigation agents and evaluate them in a user study, finding a 26.5% increase in user‑rated agent usefulness. Together, our results show structured modeling of human intervention leads to more adaptive, collaborative agents.

PaperID: 5460, https://arxiv.org/pdf/2602.17547.pdf

Authors: Yue Liu, Yingwei Ma, Yibo Miao, Yanhao Li, Yuchong Xie, Xinlong Yang, Zhiyuan Hu, Flood Sung, Jiaheng Zhang, Bryan Hooi

Title: KLong: Training LLM Agent for Extremely Long-horizon Tasks

Abstract:
This paper introduces KLong, an open‑source LLM agent trained to solve extremely long‑horizon tasks. The principle is to first cold‑start the model via trajectory‑splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe. Then, we introduce Research‑Factory, an automated pipeline that generates high‑quality training data by collecting research papers and constructing evaluation rubrics. Using this pipeline, we build thousands of long‑horizon trajectories distilled from Claude 4.5 Sonnet (Thinking). To train with these extremely long trajectories, we propose a new trajectory‑splitting SFT, which preserves early context, progressively truncates later context, and maintains overlap between sub‑trajectories. In addition, to further improve long‑horizon task‑solving capability, we propose a novel progressive RL, which schedules training into multiple stages with progressively extended timeouts. Experiments demonstrate the superiority and generalization of KLong, as shown in Figure 1. Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE‑bench Verified and MLE‑bench.

PaperID: 5461, https://arxiv.org/pdf/2602.17544.pdf

Authors: Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar

Title: Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability

Abstract:
In multi‑agent IR pipelines for tasks such as search and ranking, LLM‑based agents exchange intermediate reasoning in terms of Chain‑of‑Thought (CoT) with each other. Current CoT evaluation narrowly focuses on target task accuracy. However, this metric fails to assess the quality or utility of the reasoning process itself. To address this limitation, we introduce two novel measures: reusability and verifiability. We decouple CoT generation from execution using a Thinker‑Executor framework. Reusability measures how easily an Executor can reuse the Thinker's CoT. Verifiability measures how frequently an Executor can match the Thinker's answer using the CoT. We evaluated four Thinker models against a committee of ten Executor models across five benchmarks. Our results reveal that reusability and verifiability do not correlate with standard accuracy, exposing a blind spot in current accuracy‑based leaderboards for reasoning capability. Surprisingly, we find that CoTs from specialized reasoning models are not consistently more reusable or verifiable than those from general‑purpose LLMs like Llama and Gemma.

PaperID: 5462, https://arxiv.org/pdf/2602.17486.pdf

Authors: Yuma Fujimoto, Kenshi Abe, Kaito Ariu

Title: Linear Convergence in Games with Delayed Feedback via Extra Prediction

Abstract:
Feedback delays are inevitable in real‑world multi‑agent learning. They are known to severely degrade performance, and the convergence rate under delayed feedback is still unclear, even for bilinear games. This paper derives the rate of linear convergence of Weighted Optimistic Gradient Descent‑Ascent (WOGDA), which predicts future rewards with extra optimism, in unconstrained bilinear games. To analyze the algorithm, we interpret it as an approximation of the Extra Proximal Point (EPP), which is updated based on farther future rewards than the classical Proximal Point (PP). Our theorems show that standard optimism (predicting the next‑step reward) achieves linear convergence to the equilibrium at a rate \exp(‑Θ(t/m^5)) after t iterations for delay m. Moreover, employing extra optimism (predicting farther future reward) tolerates a larger step size and significantly accelerates the rate to \exp(‑Θ(t/(m^2\log m))). Our experiments also show accelerated convergence driven by the extra optimism and are qualitatively consistent with our theorems. In summary, this paper validates that extra optimism is a promising countermeasure against performance degradation caused by feedback delays.

PaperID: 5463, https://arxiv.org/pdf/2602.17434.pdf

Authors: Eleftherios E. Vlahakis, Arash Bahari Kordabad, Lars Lindemann, Pantelis Sopasakis, Sadegh Soudjani, Dimos V. Dimarogonas

Title: Multi-Agent Temporal Logic Planning via Penalty Functions and Block-Coordinate Optimization

Abstract:
Multi‑agent planning under Signal Temporal Logic (STL) is often hindered by collaborative tasks that lead to computational challenges due to the inherent high‑dimensionality of the problem, preventing scalable synthesis with satisfaction guarantees. To address this, we formulate STL planning as an optimization program under arbitrary multi‑agent constraints and introduce a penalty‑based unconstrained relaxation that can be efficiently solved via a Block‑Coordinate Gradient Descent (BCGD) method, where each block corresponds to a single agent's decision variables, thereby mitigating complexity. By utilizing a quadratic penalty function defined via smooth STL semantics, we show that BCGD iterations converge to a stationary point of the penalized problem under standard regularity assumptions. To enforce feasibility, the BCGD solver is embedded within a two‑layer optimization scheme: inner BCGD updates are performed for a fixed penalty parameter, which is then increased in an outer loop to progressively improve multi‑agent STL robustness. The proposed framework enables scalable computations and is validated through various complex multi‑robot planning scenarios.

PaperID: 5464, https://arxiv.org/pdf/2602.17345.pdf

Authors: Boyang Ma, Hechuan Guo, Peizhuo Lv, Minghui Xu, Xuelong Dai, YeChao Zhang, Yijun Yang, Yue Zhang

Title: What Breaks Embodied AI Security:LLM Vulnerabilities, CPS Flaws,or Something Else?

Abstract:
Embodied AI systems (e.g., autonomous vehicles, service robots, and LLM‑driven interactive agents) are rapidly transitioning from controlled environments to safety critical real‑world deployments. Unlike disembodied AI, failures in embodied intelligence lead to irreversible physical consequences, raising fundamental questions about security, safety, and reliability. While existing research predominantly analyzes embodied AI through the lenses of Large Language Model (LLM) vulnerabilities or classical Cyber‑Physical System (CPS) failures, this survey argues that these perspectives are individually insufficient to explain many observed breakdowns in modern embodied systems. We posit that a significant class of failures arises from embodiment‑induced system‑level mismatches, rather than from isolated model flaws or traditional CPS attacks. Specifically, we identify four core insights that explain why embodied AI is fundamentally harder to secure: (i) semantic correctness does not imply physical safety, as language‑level reasoning abstracts away geometry, dynamics, and contact constraints; (ii) identical actions can lead to drastically different outcomes across physical states due to nonlinear dynamics and state uncertainty; (iii) small errors propagate and amplify across tightly coupled perception‑decision‑action loops; and (iv) safety is not compositional across time or system layers, enabling locally safe decisions to accumulate into globally unsafe behavior. These insights suggest that securing embodied AI requires moving beyond component‑level defenses toward system‑level reasoning about physical risk, uncertainty, and failure propagation.

PaperID: 5465, https://arxiv.org/pdf/2602.17308.pdf

Authors: Hui Min Wong, Philip Heesen, Pascal Janetzky, Martin Bendszus, Stefan Feuerriegel

Title: MedClarify: An information-seeking AI agent for medical diagnosis with case-specific follow-up questions

Abstract:
Large language models (LLMs) are increasingly used for diagnostic tasks in medicine. In clinical practice, the correct diagnosis can rarely be immediately inferred from the initial patient presentation alone. Rather, reaching a diagnosis often involves systematic history taking, during which clinicians reason over multiple potential conditions through iterative questioning to resolve uncertainty. This process requires considering differential diagnoses and actively excluding emergencies that demand immediate intervention. Yet, the ability of medical LLMs to generate informative follow‑up questions and thus reason over differential diagnoses remains underexplored. Here, we introduce MedClarify, an AI agent for information‑seeking that can generate follow‑up questions for iterative reasoning to support diagnostic decision‑making. Specifically, MedClarify computes a list of candidate diagnoses analogous to a differential diagnosis, and then proactively generates follow‑up questions aimed at reducing diagnostic uncertainty. By selecting the question with the highest expected information gain, MedClarify enables targeted, uncertainty‑aware reasoning to improve diagnostic performance. In our experiments, we first demonstrate the limitations of current LLMs in medical reasoning, which often yield multiple, similarly likely diagnoses, especially when patient cases are incomplete or relevant information for diagnosis is missing. We then show that our information‑theoretic reasoning approach can generate effective follow‑up questioning and thereby reduces diagnostic errors by ~27 percentage points (p.p.) compared to a standard single‑shot LLM baseline. Altogether, MedClarify offers a path to improve medical LLMs through agentic information‑seeking and to thus promote effective dialogues with medical LLMs that reflect the iterative and uncertain nature of real‑world clinical reasoning.

PaperID: 5466, https://arxiv.org/pdf/2602.17271.pdf

Authors: Giuseppe Di Poce, Mario Edoardo Pandolfo, Emilio Calvanese Strinati, Paolo Di Lorenzo

Title: Federated Latent Space Alignment for Multi-user Semantic Communications

Abstract:
Semantic communication aims to convey meaning for effective task execution, but differing latent representations in AI‑native devices can cause semantic mismatches that hinder mutual understanding. This paper introduces a novel approach to mitigating latent space misalignment in multi‑agent AI‑ native semantic communications. In a downlink scenario, we consider an access point (AP) communicating with multiple users to accomplish a specific AI‑driven task. Our method implements a protocol that shares a semantic pre‑equalizer at the AP and local semantic equalizers at user devices, fostering mutual understanding and task‑oriented communication while considering power and complexity constraints. To achieve this, we employ a federated optimization for the decentralized training of the semantic equalizers at the AP and user sides. Numerical results validate the proposed approach in goal‑oriented semantic communication, revealing key trade‑offs among accuracy, com‑ munication overhead, complexity, and the semantic proximity of AI‑native communication devices.

PaperID: 5467, https://arxiv.org/pdf/2602.17221.pdf

Authors: Yi-Chih Huang

Title: From Labor to Collaboration: A Methodological Experiment Using AI Agents to Augment Research Perspectives in Taiwan's Humanities and Social Sciences

Abstract:
Generative AI is reshaping knowledge work, yet existing research focuses predominantly on software engineering and the natural sciences, with limited methodological exploration for the humanities and social sciences. Positioned as a "methodological experiment," this study proposes an AI Agent‑based collaborative research workflow (Agentic Workflow) for humanities and social science research. Taiwan's Claude.ai usage data (N = 7,729 conversations, November 2025) from the Anthropic Economic Index (AEI) serves as the empirical vehicle for validating the feasibility of this methodology. This study operates on two levels: the primary level is the design and validation of a methodological framework ‑ a seven‑stage modular workflow grounded in three principles: task modularization, human‑AI division of labor, and verifiability, with each stage delineating clear roles for human researchers (research judgment and ethical decisions) and AI Agents (information retrieval and text generation); the secondary level is the empirical analysis of AEI Taiwan data ‑ serving as an operational demonstration of the workflow's application to secondary data research, showcasing both the process and output quality (see Appendix A). This study contributes by proposing a replicable AI collaboration framework for humanities and social science researchers, and identifying three operational modes of human‑AI collaboration ‑ direct execution, iterative refinement, and human‑led ‑ through reflexive documentation of the operational process. This taxonomy reveals the irreplaceability of human judgment in research question formulation, theoretical interpretation, contextualized reasoning, and ethical reflection. Limitations including single‑platform data, cross‑sectional design, and AI reliability risks are acknowledged.

PaperID: 5468, https://arxiv.org/pdf/2602.17127.pdf

Authors: Dusan Bosnjakovic

Title: The Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI

Abstract:
As Large Language Models (LLMs) transition from standalone chat interfaces to foundational reasoning layers in multi‑agent systems and recursive evaluation loops (LLM‑as‑a‑judge), the detection of durable, provider‑level behavioral signatures becomes a critical requirement for safety and governance. Traditional benchmarks measure transient task accuracy but fail to capture stable, latent response policies ‑‑ the ``prevailing mindsets'' embedded during training and alignment that outlive individual model versions. This paper introduces a novel auditing framework that utilizes psychometric measurement theory ‑‑ specifically latent trait estimation under ordinal uncertainty ‑‑ to quantify these tendencies without relying on ground‑truth labels. Utilizing forced‑choice ordinal vignettes masked by semantically orthogonal decoys and governed by cryptographic permutation‑invariance, the research audits nine leading models across dimensions including Optimization Bias, Sycophancy, and Status‑Quo Legitimization. Using Mixed Linear Models (MixedLM) and Intraclass Correlation Coefficient (ICC) analysis, the research identifies that while item‑level framing drives high variance, a persistent ``lab signal'' accounts for significant behavioral clustering. These findings demonstrate that in ``locked‑in'' provider ecosystems, latent biases are not merely static errors but compounding variables that risk creating recursive ideological echo chambers in multi‑layered AI architectures.

PaperID: 5469, https://arxiv.org/pdf/2602.17100.pdf

Authors: Siyu Wang, Ruotian Lu, Zhihao Yang, Yuchao Wang, Yanzhou Zhang, Lei Xu, Qimin Xu, Guojun Yin, Cailian Chen, Xinping Guan

Title: AgentConductor: Topology Evolution for Multi-Agent Competition-Level Code Generation

Abstract:
Large language model(LLM)‑driven multi‑agent systems(MAS) coordinate specialized agents through predefined interaction topologies and have shown promise for complex tasks such as competition‑level code generation. Recent studies demonstrate that carefully designed multi‑agent workflows and communication graphs can significantly improve code generation performance by leveraging collaborative reasoning. However, existing methods neither adapt topology density to task difficulty nor iteratively refine the topology within an instance using execution feedback, which leads to redundant communication and performance bottlenecks. To address these issues, we propose AgentConductor: a reinforcement learning‑optimized MAS with an LLM‑based orchestrator agent as its core, which enables end‑to‑end feedback‑driven dynamic generation of interaction topologies. For each query, AgentConductor infers agent roles and task difficulty, then constructs a task‑adapted, density‑aware layered directed acyclic graph (DAG) topology, underpinned by two key innovations. First, we design a novel topological density function that captures communication‑aware mathematical characterizations of multi‑agent interactions. Second, we adopt difficulty interval partitioning to avoid excessive pruning for precise topological density upper bound measurement per difficulty level and finer‑grained control. Empirically, across three competition‑level and two foundational code datasets, AgentConductor achieves state‑of‑the‑art accuracy, outperforming the strongest baseline by up to 14.6% in pass@1 accuracy, 13% in density reduction, and 68% in token cost reduction.

PaperID: 5470, https://arxiv.org/pdf/2602.17078.pdf

Authors: Xuefeng Wang, Lei Zhang, Henglin Pu, Husheng Li, Ahmed H. Qureshi

Title: Safe Continuous-time Multi-Agent Reinforcement Learning via Epigraph Form

Abstract:
Multi‑agent reinforcement learning (MARL) has made significant progress in recent years, but most algorithms still rely on a discrete‑time Markov Decision Process (MDP) with fixed decision intervals. This formulation is often ill‑suited for complex multi‑agent dynamics, particularly in high‑frequency or irregular time‑interval settings, leading to degraded performance and motivating the development of continuous‑time MARL (CT‑MARL). Existing CT‑MARL methods are mainly built on Hamilton‑Jacobi‑Bellman (HJB) equations. However, they rarely account for safety constraints such as collision penalties, since these introduce discontinuities that make HJB‑based learning difficult. To address this challenge, we propose a continuous‑time constrained MDP (CT‑CMDP) formulation and a novel MARL framework that transforms discrete MDPs into CT‑CMDPs via an epigraph‑based reformulation. We then solve this by proposing a novel physics‑informed neural network (PINN)‑based actor‑critic method that enables stable and efficient optimization in continuous time. We evaluate our approach on continuous‑time safe multi‑particle environments (MPE) and safe multi‑agent MuJoCo benchmarks. Results demonstrate smoother value approximations, more stable training, and improved performance over safe MARL baselines, validating the effectiveness and robustness of our method.

PaperID: 5471, https://arxiv.org/pdf/2602.17068.pdf

Authors: Xiaocai Zhang, Neema Nassir, Milad Haghani

Title: Spatio-temporal dual-stage hypergraph MARL for human-centric multimodal corridor traffic signal control

Abstract:
Human‑centric traffic signal control in corridor networks must increasingly account for multimodal travelers, particularly high‑occupancy public transportation, rather than focusing solely on vehicle‑centric performance. This paper proposes STDSH‑MARL (Spatio‑Temporal Dual‑Stage Hypergraph based Multi‑Agent Reinforcement Learning), a scalable multi‑agent deep reinforcement learning framework that follows a centralized training and decentralized execution paradigm. The proposed method captures spatio‑temporal dependencies through a novel dual‑stage hypergraph attention mechanism that models interactions across both spatial and temporal hyperedges. In addition, a hybrid discrete action space is introduced to jointly determine the next signal phase configuration and its corresponding green duration, enabling more adaptive signal timing decisions. Experiments conducted on a corridor network under five traffic scenarios demonstrate that STDSH‑MARL consistently improves multimodal performance and provides clear benefits for public transportation priority. Compared with state‑of‑the‑art baseline methods, the proposed approach achieves superior overall performance. Further ablation studies confirm the contribution of each component of STDSH‑MARL, with temporal hyperedges identified as the most influential factor driving the observed performance gains.

PaperID: 5472, https://arxiv.org/pdf/2602.17067.pdf

Authors: Leixian Shen, Yan Luo, Rui Sheng, Yujia He, Haotian Li, Leni Yang, Huamin Qu

Title: StoryLensEdu: Personalized Learning Report Generation through Narrative-Driven Multi-Agent Systems

Abstract:
Personalized feedback plays an important role in self‑regulated learning (SRL), helping students track progress and refine their strategies. However, current common solutions, such as text‑based reports or learning analytics dashboards, often suffer from poor interpretability, monotonous presentation, and limited explainability. To overcome these challenges, we present StoryLensEdu, a narrative‑driven multi‑agent system that automatically generates intuitive, engaging, and interactive learning reports. StoryLensEdu integrates three agents: a Data Analyst that extracts data insights based on a learning objective centered structure, a Teacher that ensures educational relevance and offers actionable suggestions, and a Storyteller that organizes these insights using the Heroes Journey narrative framework. StoryLensEdu supports post‑generation interactive question answering to improve explainability and user engagement. We conducted a formative study in a real high school and iteratively developed StoryLensEdu in collaboration with an e‑learning team to inform our design. Evaluation with real users shows that StoryLensEdu enhances engagement and promotes a deeper understanding of the learning process.

PaperID: 5473, https://arxiv.org/pdf/2602.17049.pdf

Authors: Seoyoung Lee, Seobin Yoon, Seongbeen Lee, Yoojung Chun, Dayoung Park, Doyeon Kim, Joo Yong Sim

Title: IntentCUA: Learning Intent-level Representations for Skill Abstraction and Multi-Agent Planning in Computer-Use Agents

Abstract:
Computer‑use agents operate over long horizons under noisy perception, multi‑window contexts, evolving environment states. Existing approaches, from RL‑based planners to trajectory retrieval, often drift from user intent and repeatedly solve routine subproblems, leading to error accumulation and inefficiency. We present IntentCUA, a multi‑agent computer‑use framework designed to stabilize long‑horizon execution through intent‑aligned plan memory. A Planner, Plan‑Optimizer, and Critic coordinate over shared memory that abstracts raw interaction traces into multi‑view intent representations and reusable skills. At runtime, intent prototypes retrieve subgroup‑aligned skills and inject them into partial plans, reducing redundant re‑planning and mitigating error propagation across desktop applications. In end‑to‑end evaluations, IntentCUA achieved a 74.83% task success rate with a Step Efficiency Ratio of 0.91, outperforming RL‑based and trajectory‑centric baselines. Ablations show that multi‑view intent abstraction and shared plan memory jointly improve execution stability, with the cooperative multi‑agent loop providing the largest gains on long‑horizon tasks. These results highlight that system‑level intent abstraction and memory‑grounded coordination are key to reliable and efficient desktop automation in large, dynamic environments.

PaperID: 5474, https://arxiv.org/pdf/2602.17046.pdf

Authors: Uria Franko

Title: Dynamic System Instructions and Tool Exposure for Efficient Agentic LLMs

Abstract:
Large Language Model (LLM) agents often run for many steps while re‑ingesting long system instructions and large tool catalogs each turn. This increases cost, agent derailment probability, latency, and tool‑selection errors. We propose Instruction‑Tool Retrieval (ITR), a RAG variant that retrieves, per step, only the minimal system‑prompt fragments and the smallest necessary subset of tools. ITR composes a dynamic runtime system prompt and exposes a narrowed toolset with confidence‑gated fallbacks. Using a controlled benchmark with internally consistent numbers, ITR reduces per‑step context tokens by 95%, improves correct tool routing by 32% relative, and cuts end‑to‑end episode cost by 70% versus a monolithic baseline. These savings enable agents to run 2‑20x more loops within context limits. Savings compound with the number of agent steps, making ITR particularly valuable for long‑running autonomous agents. We detail the method, evaluation protocol, ablations, and operational guidance for practical deployment.

PaperID: 5475, https://arxiv.org/pdf/2602.17038.pdf

Authors: Shengtian Yang, Yu Li, Shuo He, Yewen Li, Qingpeng Cai, Peng Jiang, Lei Feng

Title: Phase-Aware Mixture of Experts for Agentic Reinforcement Learning

Abstract:
Reinforcement learning (RL) has equipped LLM agents with a strong ability to solve complex tasks. However, existing RL methods normally use a \emphsingle policy network, causing \emphsimplicity bias where simple tasks occupy most parameters and dominate gradient updates, leaving insufficient capacity for complex tasks. A plausible remedy could be employing the Mixture‑of‑Experts (MoE) architecture in the policy network, as MoE allows different parameters (experts) to specialize in different tasks, preventing simple tasks from dominating all parameters. However, a key limitation of traditional MoE is its token‑level routing, where the router assigns each token to specialized experts, which fragments phase‑consistent patterns into scattered expert assignments and thus undermines expert specialization. In this paper, we propose Phase‑Aware Mixture of Experts (PA‑MoE). It first features a lightweight \emphphase router that learns latent phase boundaries directly from the RL objective without pre‑defining phase categories. Then, the phase router allocates temporally consistent assignments to the same expert, allowing experts to preserve phase‑specific expertise. Experimental results demonstrate the effectiveness of our proposed PA‑MoE.

PaperID: 5476, https://arxiv.org/pdf/2602.17009.pdf

Authors: Nikunj Gupta, James Zachary Hare, Jesse Milzman, Rajgopal Kannan, Viktor Prasanna

Title: Action-Graph Policies: Learning Action Co-dependencies in Multi-Agent Reinforcement Learning

Abstract:
Coordinating actions is the most fundamental form of cooperation in multi‑agent reinforcement learning (MARL). Successful decentralized decision‑making often depends not only on good individual actions, but on selecting compatible actions across agents to synchronize behavior, avoid conflicts, and satisfy global constraints. In this paper, we propose Action Graph Policies (AGP), that model dependencies among agents' available action choices. It constructs, what we call, coordination contexts, that enable agents to condition their decisions on global action dependencies. Theoretically, we show that AGPs induce a strictly more expressive joint policy compared to fully independent policies and can realize coordinated joint actions that are provably more optimal than greedy execution even from centralized value‑decomposition methods. Empirically, we show that AGP achieves 80‑95% success on canonical coordination tasks with partial observability and anti‑coordination penalties, where other MARL methods reach only 10‑25%. We further demonstrate that AGP consistently outperforms these baselines in diverse multi‑agent environments.

PaperID: 5477, https://arxiv.org/pdf/2602.16998.pdf

Authors: Arwa Alanqary, Zakaria Baba, Manxi Wu, Alexandre M. Bayen

Title: Learning to Recommend in Unknown Games

Abstract:
We study preference learning through recommendations in multi‑agent game settings, where a moderator repeatedly interacts with agents whose utility functions are unknown. In each round, the moderator issues action recommendations and observes whether agents follow or deviate from them. We consider two canonical behavioral feedback models‑best response and quantal response‑and study how the information revealed by each model affects the learnability of agents' utilities. We show that under quantal‑response feedback the game is learnable, up to a positive affine equivalence class, with logarithmic sample complexity in the desired precision, whereas best‑response feedback can only identify a larger set of agents' utilities. We give a complete geometric characterization of this set. Moreover, we introduce a regret notion based on agents' incentives to deviate from recommendations and design an online algorithm with low regret under both feedback models, with bounds scaling linearly in the game dimension and logarithmically in time. Our results lay a theoretical foundation for AI recommendation systems in strategic multi‑agent environments, where recommendation compliances are shaped by strategic interaction.

PaperID: 5478, https://arxiv.org/pdf/2602.16966.pdf

Authors: Sourav Chakraborty, Amit Kiran Rege, Claire Monteleoni, Lijun Chen

Title: A Unified Framework for Locality in Scalable MARL

Abstract:
Scalable Multi‑Agent Reinforcement Learning (MARL) is fundamentally challenged by the curse of dimensionality. A common solution is to exploit locality, which hinges on an Exponential Decay Property (EDP) of the value function. However, existing conditions that guarantee the EDP are often conservative, as they are based on worst‑case, environment‑only bounds (e.g., supremums over actions) and fail to capture the regularizing effect of the policy itself. In this work, we establish that locality can also be a \emphpolicy‑dependent phenomenon. Our central contribution is a novel decomposition of the policy‑induced interdependence matrix, H^π, which decouples the environment's sensitivity to state (E^\mathrms) and action (E^\mathrma) from the policy's sensitivity to state (Π(π)). This decomposition reveals that locality can be induced by a smooth policy (small Π(π)) even when the environment is strongly action‑coupled, exposing a fundamental locality‑optimality tradeoff. We use this framework to derive a general spectral condition ρ(E^\mathrms+E^\mathrmaΠ(π)) < 1 for exponential decay, which is strictly tighter than prior norm‑based conditions. Finally, we leverage this theory to analyze a provably‑sound localized block‑coordinate policy improvement framework with guarantees tied directly to this spectral radius.

PaperID: 5479, https://arxiv.org/pdf/2602.16965.pdf

Authors: Sourav Chakraborty, Amit Kiran Rege, Claire Monteleoni, Lijun Chen

Title: Multi-Agent Lipschitz Bandits

Abstract:
We study the decentralized multi‑player stochastic bandit problem over a continuous, Lipschitz‑structured action space where hard collisions yield zero reward. Our objective is to design a communication‑free policy that maximizes collective reward, with coordination costs that are independent of the time horizon T. We propose a modular protocol that first solves the multi‑agent coordination problem ‑‑ identifying and seating players on distinct high‑value regions via a novel maxima‑directed search ‑‑ and then decouples the problem into N independent single‑player Lipschitz bandits. We establish a near‑optimal regret bound of \tildeO(T^(d+1)/(d+2)) plus a T‑independent coordination cost, matching the single‑player rate. To our knowledge, this is the first framework providing such guarantees, and it extends to general distance‑threshold collision models.

PaperID: 5480, https://arxiv.org/pdf/2602.16958.pdf

Authors: Xinhao Deng, Jiaqing Wu, Miao Chen, Yue Xiao, Ke Xu, Qi Li

Title: Automating Agent Hijacking via Structural Template Injection

Abstract:
Agent hijacking, highlighted by OWASP as a critical threat to the Large Language Model (LLM) ecosystem, enables adversaries to manipulate execution by injecting malicious instructions into retrieved content. Most existing attacks rely on manually crafted, semantics‑driven prompt manipulation, which often yields low attack success rates and limited transferability to closed‑source commercial models. In this paper, we propose Phantom, an automated agent hijacking framework built upon Structured Template Injection that targets the fundamental architectural mechanisms of LLM agents. Our key insight is that agents rely on specific chat template tokens to separate system, user, assistant, and tool instructions. By injecting optimized structured templates into the retrieved context, we induce role confusion and cause the agent to misinterpret the injected content as legitimate user instructions or prior tool outputs. To enhance attack transferability against black‑box agents, Phantom introduces a novel attack template search framework. We first perform multi‑level template augmentation to increase structural diversity and then train a Template Autoencoder (TAE) to embed discrete templates into a continuous, searchable latent space. Subsequently, we apply Bayesian optimization to efficiently identify optimal adversarial vectors that are decoded into high‑potency structured templates. Extensive experiments on Qwen, GPT, and Gemini demonstrate that our framework significantly outperforms existing baselines in both Attack Success Rate (ASR) and query efficiency. Moreover, we identified over 70 vulnerabilities in real‑world commercial products that have been confirmed by vendors, underscoring the practical severity of structured template‑based hijacking and providing an empirical foundation for securing next‑generation agentic systems.

PaperID: 5481, https://arxiv.org/pdf/2602.16953.pdf

Authors: Hejia Zhang, Zhongming Yu, Chia-Tung Ho, Haoxing Ren, Brucek Khailany, Jishen Zhao

Title: LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation

Abstract:
Execution‑aware LLM agents offer a promising paradigm for learning from tool feedback, but such feedback is often expensive and slow to obtain, making online reinforcement learning (RL) impractical. High‑coverage hardware verification exemplifies this challenge due to its reliance on industrial simulators and non‑differentiable execution signals. We propose LLM4Cov, an offline agent‑learning framework that models verification as memoryless state transitions guided by deterministic evaluators. Building on this formulation, we introduce execution‑validated data curation, policy‑aware agentic data synthesis, and worst‑state‑prioritized sampling to enable scalable learning under execution constraints. We further curate a reality‑aligned benchmark adapted from an existing verification suite through a revised evaluation protocol. Using the proposed pipeline, a compact 4B‑parameter model achieves 69.2% coverage pass rate under agentic evaluation, outperforming its teacher by 5.3% and demonstrating competitive performance against models an order of magnitude larger.

PaperID: 5482, https://arxiv.org/pdf/2602.16928.pdf

Authors: Zun Li, John Schultz, Daniel Hennes, Marc Lanctot

Title: Discovering Multiagent Learning Algorithms with Large Language Models

Abstract:
Much of the advancement in Multi‑Agent Reinforcement Learning (MARL) for imperfect‑information games has historically depended on the manual, iterative refinement of algorithmic baselines. Recently, evolutionary coding agents powered by Large Language Models (LLMs) have emerged as powerful tools to automate this discovery process. In this work, we deploy one of such agentic frameworks, AlphaEvolve, to navigate the design spaces of two distinct game‑theoretic paradigms: counterfactual regret minimization (CFR) and policy‑space response oracles (PSRO). This automated search yielded two algorithms: Volatility‑Adaptive Discounted (VAD‑) CFR and Smoothed Hybrid Optimistic Regret (SHOR‑) PSRO, which are consistently competitive with state‑of‑the‑art human‑designed baselines across an 18‑game evaluation suite spanning Poker, Goofspiel, Liar's Dice, Blotto, and Battleship variants. However, because the LLM optimizes for fitness on a specific training set, it often constructs highly synergistic, complex mechanisms tailored to those environments. Through systematic ablation studies, we demonstrate that while these mechanisms are tightly coupled, the true driver of generalization lies in a minimal algorithmic core. By distilling the LLM's discoveries down to their most fundamental principles, we produce two minimal solvers: Warm‑started Optimistic Predictive (WOP‑)CFR and Projection Matching (PM‑)PSRO. These distilled versions achieve superior performance on generalization with greatly reduced structural complexity, providing a clear methodology for using LLMs in algorithmic discovery.

PaperID: 5483, https://arxiv.org/pdf/2602.16873.pdf

Authors: Geunbin Yu

Title: AdaptOrch: Task-Adaptive Multi-Agent Orchestration in the Era of LLM Performance Convergence

Abstract:
As large language models from diverse providers converge toward comparable benchmark performance, the traditional paradigm of selecting a single best model per task yields diminishing returns. We argue that orchestration topology ‑‑ the structural composition of how multiple agents are coordinated, parallelized, and synthesized ‑‑ now dominates system‑level performance over individual model capability. We present AdaptOrch, a formal framework for task‑adaptive multi‑agent orchestration that dynamically selects among four canonical topologies (parallel, sequential, hierarchical, and hybrid) based on task dependency graphs and empirically derived domain characteristics. Our framework introduces three key contributions: (1) a Performance Convergence Scaling Law, formalizing conditions under which orchestration selection outweighs model selection; (2) a Topology Routing Algorithm that maps task decomposition DAGs to optimal orchestration patterns in O(|V| + |E|) time; and (3) an Adaptive Synthesis Protocol with provable termination guarantees and heuristic consistency scoring for parallel agent outputs. We validate AdaptOrch across coding (SWE‑bench), reasoning (GPQA), and retrieval‑augmented generation tasks, demonstrating that topology‑aware orchestration achieves 12‑23% improvement over static single‑topology baselines, even when using identical underlying models. Our results establish orchestration design as a first‑class optimization target independent of model scaling.

PaperID: 5484, https://arxiv.org/pdf/2602.16758.pdf

Authors: Sina Akhbari, Mehran Mahboubkhah

Title: Smooth trajectory generation and hybrid B-splines-Quaternions based tool path interpolation for a 3T1R parallel kinematic milling robot

Abstract:
This paper presents a smooth trajectory generation method for a four‑degree‑of‑freedom parallel kinematic milling robot. The proposed approach integrates B‑spline and Quaternion interpolation techniques to manage decoupled position and orientation data points. The synchronization of orientation and arc‑length‑parameterized position data is achieved through the fitting of smooth piece‑wise Bezier curves, which describe the non‑linear relationship between path length and tool orientation, solved via sequential quadratic programming. By leveraging the convex hull properties of Bezier curves, the method ensures spatial and temporal separation constraints for multi‑agent trajectory generation. Unit quaternions are employed for orientation interpolation, providing a robust and efficient representation that avoids gimbal lock and facilitates smooth, continuous rotation. Modifier polynomials are used for position interpolation. Temporal trajectories are optimized using minimum jerk, time‑optimal piece‑wise Bezier curves in two stages: task space followed by joint space, implemented on a low‑cost microcontroller. Experimental results demonstrate that the proposed method offers enhanced accuracy, reduced velocity fluctuations, and computational efficiency compared to conventional interpolation methods.

PaperID: 5485, https://arxiv.org/pdf/2602.16738.pdf

Authors: Rebin Saleh, Khanh Pham Dinh, Balázs Villányi, Truong-Son Hy

Title: Self-Evolving Multi-Agent Network for Industrial IoT Predictive Maintenance

Abstract:
Industrial IoT predictive maintenance requires systems capable of real‑time anomaly detection without sacrificing interpretability or demanding excessive computational resources. Traditional approaches rely on static, offline‑trained models that cannot adapt to evolving operational conditions, while LLM‑based monolithic systems demand prohibitive memory and latency, rendering them impractical for on‑site edge deployment. We introduce SEMAS, a self‑evolving hierarchical multi‑agent system that distributes specialized agents across Edge, Fog, and Cloud computational tiers. Edge agents perform lightweight feature extraction and pre‑filtering; Fog agents execute diversified ensemble detection with dynamic consensus voting; and Cloud agents continuously optimize system policies via Proximal Policy Optimization (PPO) while maintaining asynchronous, non‑blocking inference. The framework incorporates LLM‑based response generation for explainability and federated knowledge aggregation for adaptive policy distribution. This architecture enables resource‑aware specialization without sacrificing real‑time performance or model interpretability. Empirical evaluation on two industrial benchmarks (Boiler Emulator and Wind Turbine) demonstrates that SEMAS achieves superior anomaly detection performance with exceptional stability under adaptation, sustains prediction accuracy across evolving operational contexts, and delivers substantial latency improvements enabling genuine real‑time deployment. Ablation studies confirm that PPO‑driven policy evolution, consensus voting, and federated aggregation each contribute materially to system effectiveness. These findings indicate that resource‑aware, self‑evolving 1multi‑agent coordination is essential for production‑ready industrial IoT predictive maintenance under strict latency and explainability constraints.

PaperID: 5486, https://arxiv.org/pdf/2602.16708.pdf

Authors: Nils Palumbo, Sarthak Choudhary, Jihye Choi, Guy Amir, Prasad Chalasani, Somesh Jha

Title: Formal Policy Enforcement for Real-World Agentic Systems

Abstract:
Security policy enforcement in contemporary agentic systems predominantly consists of embedding natural‑language policies within an agent's system prompt and delegating compliance to the agent's reasoning. This approach admits no formal enforcement guarantee and cannot express policies whose satisfaction depends on the causal history of an execution, a gap that becomes acute in multi‑agent systems, where enforcement must reason across agents. We argue that policy enforcement in agentic systems is most naturally understood as a cross‑cutting concern, and propose a framework grounded in aspect‑oriented programming that specifies policies independent of the agent's reasoning and enforces them at every policy‑relevant decision. Policies are written in Datalog over a set of abstract predicates describing the execution context, an observability service governed by a formal assume/guarantee contract maintains these predicates, and a reference monitor consults the policy at each action to produce an enforcement decision. When the environment contract holds, enforcement decisions coincide with the policy's intended semantics. We adopt Datalog as the policy language, a natural fit because it supports declarative rule specification, admits recursion for policies over transitive relationships, and yields deterministic enforcement. Datalog further admits tractable static analyses for contradiction, redundancy, subsumption, and conditional reachability, enabling authors to verify policy intent and surface ambiguities inherent in natural‑language specifications. We realize the framework in FORGE, which enforces policies over agentic deployments without modification to the underlying agents. We evaluate FORGE on three case studies: information flow policies for prompt injection defense, approval workflows in a multi‑agent pharmacovigilance system, and organizational policies for customer service.

PaperID: 5487, https://arxiv.org/pdf/2602.16699.pdf

Authors: Wenxuan Ding, Nicholas Tomlin, Greg Durrett

Title: Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

Abstract:
LLM agents are deployed in environments where they must interact to acquire information. In these scenarios, the agent must reason about inherent cost‑uncertainty tradeoffs in how to act, such as when to stop exploring and commit to an answer. For instance, on a programming task, an agent might run the code it generates, or it might generate tests for that code snippet; the cost of writing and running a test is nonzero, but typically lower than the cost of running buggy code. In this work, we show that we can induce LLM agents to explicitly reason about balancing these cost‑uncertainty tradeoffs, then act more optimally in their environments. We formalize multiple tasks, including retrieval‑augmented QA and a file reading coding task, as sequential decision‑making problems under uncertainty. Each problem has latent environment state that impacts the agent's performance. We introduce a framework called Calibrate‑Then‑Act (CTA), where we pass the agent an inferred prior about this environment state to enable it to act more optimally. This information qualitatively changes agent behavior, and adds environment sensitivity to the agent which is not learned via standard RL training. Our results on a synthetic task, QA, and file reading show that making cost‑benefit tradeoffs explicit with CTA helps agents discover more optimal decision‑making strategies.

PaperID: 5488, https://arxiv.org/pdf/2602.16666.pdf

Authors: Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, Arvind Narayanan

Title: Towards a Science of AI Agent Reliability

Abstract:
AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety‑critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 agentic models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.

PaperID: 5489, https://arxiv.org/pdf/2602.16662.pdf

Authors: Richard Willis, Jianing Zhao, Yali Du, Joel Z. Leibo

Title: Evaluating Collective Behaviour of Hundreds of LLM Agents

Abstract:
As autonomous agents powered by LLM are increasingly deployed in society, understanding their collective behaviour in social dilemmas becomes critical. We introduce an evaluation framework where LLMs generate strategies encoded as algorithms, enabling inspection prior to deployment and scaling to populations of hundreds of agents ‑‑ substantially larger than in previous work. We find that more recent models tend to produce worse societal outcomes compared to older models when agents prioritise individual gain over collective benefits. Using cultural evolution to model user selection of agents, our simulations reveal a significant risk of convergence to poor societal equilibria, particularly when the relative benefit of cooperation diminishes and population sizes increase. We release our code as an evaluation suite for developers to assess the emergent collective behaviour of their models.

PaperID: 5490, https://arxiv.org/pdf/2602.16564.pdf

Authors: Michael Lanier, Yevgeniy Vorobeychik

Title: A Scalable Approach to Solving Simulation-Based Network Security Games

Abstract:
We introduce MetaDOAR, a lightweight meta‑controller that augments the Double Oracle / PSRO paradigm with a learned, partition‑aware filtering layer and Q‑value caching to enable scalable multi‑agent reinforcement learning on very large cyber‑network environments. MetaDOAR learns a compact state projection from per node structural embeddings to rapidly score and select a small subset of devices (a top‑k partition) on which a conventional low‑level actor performs focused beam search utilizing a critic agent. Selected candidate actions are evaluated with batched critic forwards and stored in an LRU cache keyed by a quantized state projection and local action identifiers, dramatically reducing redundant critic computation while preserving decision quality via conservative k‑hop cache invalidation. Empirically, MetaDOAR attains higher player payoffs than SOTA baselines on large network topologies, without significant scaling issues in terms of memory usage or training time. This contribution provide a practical, theoretically motivated path to efficient hierarchical policy learning for large‑scale networked decision problems.

PaperID: 5491, https://arxiv.org/pdf/2602.16485.pdf

Authors: Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao

Title: Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

Abstract:
Existing Multi‑Agent Systems (MAS) typically rely on homogeneous model configurations, failing to exploit the diverse expertise inherent in different post‑trained architectures. We propose Team‑of‑Thoughts, a heterogeneous MAS framework that treats diverse models as specialized tools within an orchestrator‑driven paradigm. Team‑of‑Thoughts introduces two novel components: (1) Orchestrator Calibration, which identifies models with superior coordination and synthesis capabilities, and (2) Agent Self‑Assessment, a protocol where tool agents profile their own domain‑specific strengths to guide selection. At inference, the orchestrator dynamically activates the most compatible agents based on these profiles to maximize capability coverage. Across five mathematical reasoning and code generation benchmarks, Team‑of‑Thoughts consistently outperforms individual models and existing MAS baselines. Notably, on AIME24 and LiveCodeBench, Team‑of‑Thoughts achieves 96.00% and 77.91% accuracy, respectively, significantly improving over homogeneous role‑play baselines (80.00% and 65.93%).

PaperID: 5492, https://arxiv.org/pdf/2602.16435.pdf

Authors: Arun Vignesh Malarkkan, Wangyang Ying, Yanjie Fu

Title: Causally-Guided Automated Feature Engineering with Multi-Agent Reinforcement Learning

Abstract:
Automated feature engineering (AFE) enables AI systems to autonomously construct high‑utility representations from raw tabular data. However, existing AFE methods rely on statistical heuristics, yielding brittle features that fail under distribution shift. We introduce CAFE, a framework that reformulates AFE as a causally‑guided sequential decision process, bridging causal discovery with reinforcement learning‑driven feature construction. Phase I learns a sparse directed acyclic graph over features and the target to obtain soft causal priors, grouping features as direct, indirect, or other based on their causal influence with respect to the target. Phase II uses a cascading multi‑agent deep Q‑learning architecture to select causal groups and transformation operators, with hierarchical reward shaping and causal group‑level exploration strategies that favor causally plausible transformations while controlling feature complexity. Across 15 public benchmarks (classification with macro‑F1; regression with inverse relative absolute error), CAFE achieves up to 7% improvement over strong AFE baselines, reduces episodes‑to‑convergence, and delivers competitive time‑to‑target. Under controlled covariate shifts, CAFE reduces performance drop by ~4x relative to a non‑causal multi‑agent baseline, and produces more compact feature sets with more stable post‑hoc attributions. These findings underscore that causal structure, used as a soft inductive prior rather than a rigid constraint, can substantially improve the robustness and efficiency of automated feature engineering.

PaperID: 5493, https://arxiv.org/pdf/2602.16379.pdf

Authors: Mohammad H. A. Monfared, Lucie Flek, Akbar Karimi

Title: Label-Consistent Data Generation for Aspect-Based Sentiment Analysis Using LLM Agents

Abstract:
We propose an agentic data augmentation method for Aspect‑Based Sentiment Analysis (ABSA) that uses iterative generation and verification to produce high quality synthetic training examples. To isolate the effect of agentic structure, we also develop a closely matched prompting‑based baseline using the same model and instructions. Both methods are evaluated across three ABSA subtasks (Aspect Term Extraction (ATE), Aspect Sentiment Classification (ATSC), and Aspect Sentiment Pair Extraction (ASPE)), four SemEval datasets, and two encoder‑decoder models: T5‑Base and Tk‑Instruct. Our results show that the agentic augmentation outperforms raw prompting in label preservation of the augmented data, especially when the tasks require aspect term generation. In addition, when combined with real data, agentic augmentation provides higher gains, consistently outperforming prompting‑based generation. These benefits are most pronounced for T5‑Base, while the more heavily pretrained Tk‑Instruct exhibits smaller improvements. As a result, augmented data helps T5‑Base achieve comparable performance with its counterpart.

PaperID: 5494, https://arxiv.org/pdf/2602.16353.pdf

Authors: Zhezhi Lei, Zhihai Bi, Wenxin Wang, Jun Ma

Title: Dual-Quadruped Collaborative Transportation in Narrow Environments via Safe Reinforcement Learning

Abstract:
Collaborative transportation, where multiple robots collaboratively transport a payload, has garnered significant attention in recent years. While ensuring safe and high‑performance inter‑robot collaboration is critical for effective task execution, it is difficult to pursue in narrow environments where the feasible region is extremely limited. To address this challenge, we propose a novel approach for dual‑quadruped collaborative transportation via safe reinforcement learning (RL). Specifically, we model the task as a fully cooperative constrained Markov game, where collision avoidance is formulated as constraints. We introduce a cost‑advantage decomposition method that enforces the sum of team constraints to remain below an upper bound, thereby guaranteeing task safety within an RL framework. Furthermore, we propose a constraint allocation method that assigns shared constraints to individual robots to maximize the overall task reward, encouraging autonomous task‑assignment among robots, thereby improving collaborative task performance. Simulation and real‑time experimental results demonstrate that the proposed approach achieves superior performance and a higher success rate in dual‑quadruped collaborative transportation compared to existing methods.

PaperID: 5495, https://arxiv.org/pdf/2602.16346.pdf

Authors: Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut

Title: Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Abstract:
LLM‑based agents execute real‑world workflows via tools and memory. These affordances enable ill‑intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single‑prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N‑step Goal execution), an automated red‑teaming framework that constructs a step‑by‑step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow‑ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi‑turn red‑teaming as a time‑to‑first‑jailbreak random variable, enabling analysis tools like discovery curves, hazard‑ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit‑task completion than single‑turn prompting and chat‑oriented multi‑turn baselines adapted to tool‑using agents. In multilingual evaluations across six non‑English settings, we find that attack success and illicit‑task completion do not consistently increase in lower‑resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress‑test agent misuse in realistic deployment settings, where interactions are inherently multi‑turn and often multilingual.

PaperID: 5496, https://arxiv.org/pdf/2602.16345.pdf

Authors: Leonardo Spampinato, Lorenzo Mario Amorosa, Enrico Testi, Chiara Buratti, Riccardo Marini

Title: Multi-Agent Meta-Advisor for UAV Fleet Trajectory Design in Vehicular Networks

Abstract:
Future vehicular networks require continuous connectivity to serve highly mobile users in urban environments. To mitigate the coverage limitations of fixed terrestrial macro base stations (MBS) under non line‑of‑sight (NLoS) conditions, fleets of unmanned aerial base stations (UABSs) can be deployed as aerial base stations, dynamically repositioning to track vehicular users and traffic hotspots in coordination with the terrestrial network. This paper addresses cooperative multi‑agent trajectory design under different service areas and takeoff configurations, where rapid and safe adaptation across scenarios is essential. We formulate the problem as a multi‑task decentralized partially observable Markov decision process and solve it using centralized training and decentralized execution with double dueling deep Q‑network (3DQN), enabling online training for real‑world deployments. However, efficient exploration remains a bottleneck, with conventional strategies like ε‑greedy requiring careful tuning. To overcome this, we propose the multi‑agent meta‑advisor with advisor override (MAMO). This framework guides agent exploration through a meta‑policy learned jointly across tasks. It uses a dynamic override mechanism that allows agents to reject misaligned guidance when the advisor fails to generalize to a specific scenario. Simulation results across three realistic urban scenarios and multiple takeoff configurations show that MAMO achieves faster convergence and higher returns than tuned ε‑greedy baselines, outperforming both an advisor‑only ablation and a single generalized policy. Finally, we demonstrate that the learned UABS fleet significantly improves network performance compared to deployments without aerial support.

PaperID: 5497, https://arxiv.org/pdf/2602.16308.pdf

Authors: Markus Rueggeberg, Maximilian Ulmer, Maximilian Durner, Wout Boerdijk, Marcus Gerhard Mueller, Rudolph Triebel, Riccardo Giubilato

Title: Markerless Robot Detection and 6D Pose Estimation for Multi-Agent SLAM

Abstract:
The capability of multi‑robot SLAM approaches to merge localization history and maps from different observers is often challenged by the difficulty in establishing data association. Loop closure detection between perceptual inputs of different robotic agents is easily compromised in the context of perceptual aliasing, or when perspectives differ significantly. For this reason, direct mutual observation among robots is a powerful way to connect partial SLAM graphs, but often relies on the presence of calibrated arrays of fiducial markers (e.g., AprilTag arrays), which severely limits the range of observations and frequently fails under sharp lighting conditions, e.g., reflections or overexposure. In this work, we propose a novel solution to this problem leveraging recent advances in Deep‑Learning‑based 6D pose estimation. We feature markerless pose estimation as part of a decentralized multi‑robot SLAM system and demonstrate the benefit to the relative localization accuracy among the robotic team. The solution is validated experimentally on data recorded in a test field campaign on a planetary analogous environment.

PaperID: 5498, https://arxiv.org/pdf/2602.16301.pdf

Authors: Marissa A. Weis, Maciej Wołczyk, Rajai Nasser, Rif A. Saurous, Blaise Agüera y Arcas, João Sacramento, Alexander Meulemans

Title: Multi-agent cooperation through in-context co-player inference

Abstract:
Achieving cooperation among self‑interested agents remains a fundamental challenge in multi‑agent reinforcement learning. Recent work showed that mutual cooperation can be induced between "learning‑aware" agents that account for and shape the learning dynamics of their co‑players. However, existing approaches typically rely on hardcoded, often inconsistent, assumptions about co‑player learning rules or enforce a strict separation between "naive learners" updating on fast timescales and "meta‑learners" observing these updates. Here, we demonstrate that the in‑context learning capabilities of sequence models allow for co‑player learning awareness without requiring hardcoded assumptions or explicit timescale separation. We show that training sequence model agents against a diverse distribution of co‑players naturally induces in‑context best‑response strategies, effectively functioning as learning algorithms on the fast intra‑episode timescale. We find that the cooperative mechanism identified in prior work‑where vulnerability to extortion drives mutual shaping‑emerges naturally in this setting: in‑context adaptation renders agents vulnerable to extortion, and the resulting mutual pressure to shape the opponent's in‑context learning dynamics resolves into the learning of cooperative behavior. Our results suggest that standard decentralized reinforcement learning on sequence models combined with co‑player diversity provides a scalable path to learning cooperative behaviors.

PaperID: 5499, https://arxiv.org/pdf/2602.16260.pdf

Authors: Miguel A. Trujillo, Rodrigo Aldana-López, David Gomez Gutierrez, Michael Defoort, Javier Ruiz Leon, Hector M. Becerra

Title: Autonomous and non-autonomous fixed-time leader-follower consensus for second-order multi-agent systems

Abstract:
This paper addresses the problem of consensus tracking with fixed‑time convergence, for leader‑follower multi‑agent systems with double‑integrator dynamics, where only a subset of followers has access to the state of the leader. The control scheme is divided into two steps. The first one is dedicated to the estimation of the leader state by each follower in a distributed way and in a fixed‑time. Then, based on the estimate of the leader state, each follower computes its control law to track the leader in a fixed‑time. In this paper, two control strategies are investigated and compared to solve the two mentioned steps. The first one is an autonomous protocol which ensures a fixed‑time convergence for the observer and for the controller parts where the Upper Bound of the Settling‑Time (UBST) is set a priory by the user. Then, the previous strategy is redesigned using time‑varying gains to obtain a non‑autonomous protocol. This enables to obtain less conservative estimates of the UBST while guaranteeing that the time‑varying gains remain bounded. Some numerical examples show the effectiveness of the proposed consensus protocols.

PaperID: 5500, https://arxiv.org/pdf/2602.16246.pdf

Authors: Yun-Shiuan Chuang, Chaitanya Kulkarni, Alec Chiu, Avinash Thangali, Zijie Pan, Shivani Shekhar, Yirou Ge, Yixi Li, Uma Kona, Linsey Pang, Prakhar Mehrotra

Title: Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

Abstract:
Interactive large language model (LLM) agents operating via multi‑turn dialogue and multi‑step tool calling are increasingly used in production. Benchmarks for these agents must both reliably compare models and yield on‑policy training data. Prior agentic benchmarks, such as tau‑bench, tau^2‑bench, and AppWorld, rely on fully deterministic backends, which are costly to build and iterate. We propose Proxy State‑Based Evaluation, an LLM‑driven simulation framework that preserves final state‑based evaluation without a deterministic database. Specifically, a scenario specifies the user goal, user/system facts, expected final state, and expected agent behavior, and an LLM state tracker infers a structured proxy state from the full interaction trace. LLM judges then verify goal completion and detect tool/user hallucinations against scenario constraints. Empirically, our benchmark produces stable, model‑differentiating rankings across model families and inference‑time reasoning efforts, and its on‑/off‑policy rollouts provide supervision that transfers to unseen scenarios. Careful scenario specification yields near‑zero simulator hallucination rates, as supported by ablation studies. The framework also supports sensitivity analyses over user personas. Human‑LLM judge agreement exceeds 90%, indicating reliable automated evaluation. Overall, proxy state‑based evaluation offers a practical, scalable alternative to deterministic agentic benchmarks for industrial LLM agents.

PaperID: 5501, https://arxiv.org/pdf/2602.16240.pdf

Authors: Ajitesh Srivastava, Shanghua Teng

Title: Submodular Maximization under Supermodular Constraint: Greedy Guarantees

Abstract:
Motivated by a wide range of applications in data mining and machine learning, we consider the problem of maximizing a submodular function subject to supermodular cost constraints. In contrast to the well‑understood setting of cardinality and matroid constraints, where greedy algorithms admit strong guarantees, the supermodular constraint regime remains poorly understood ‑‑ guarantees for greedy methods and other efficient algorithmic paradigms are largely open. We study this family of fundamental optimization problems under an upper‑bound constraint on a supermodular cost function with curvature parameter γ. Our notion of supermodular curvature is less restrictive than prior definitions, substantially expanding the class of admissible cost functions. We show that our greedy algorithm that iteratively includes elements maximizing the ratio of the objective and constraint functions, achieves a \left(1 ‑ e^‑(1‑γ)\right)‑approximation before stopping. We prove that this approximation is indeed tight for this algorithm. Further, if the objective function has a submodular curvature c, then we show that the bound further improves to \left(1 ‑ (1‑ (1‑c)(1‑γ))^1/(1‑c)\right), which can be further improved by continuing to violate the constraint. Finally, we show that the Greedy‑Ratio‑Marginal in conjunction with binary search leads to a bicriteria approximation for the dual problem ‑‑ minimizing a supermodular function under a lower bound constraint on a submodular function. We conduct a number of experiments on a simulation of LLM agents debating over multiple rounds ‑‑ the task is to select a subset of agents to maximize correctly answered questions. Our algorithm outperforms all other greedy heuristics, and on smaller problems, it achieves the same performance as the optimal set found by exhaustive search.

PaperID: 5502, https://arxiv.org/pdf/2602.16196.pdf

Authors: Emile Anand, Richard Hoffmann, Sarah Liaw, Adam Wierman

Title: Graphon Mean-Field Subsampling for Cooperative Heterogeneous Multi-Agent Reinforcement Learning

Abstract:
Coordinating large populations of interacting agents is a central challenge in multi‑agent reinforcement learning (MARL), where the size of the joint state‑action space scales exponentially with the number of agents. Mean‑field methods alleviate this burden by aggregating agent interactions, but these approaches assume homogeneous interactions. Recent graphon‑based frameworks capture heterogeneity, but are computationally expensive as the number of agents grows. Therefore, we introduce \textttGMFS, a Graphon Mean‑Field Subsampling framework for scalable cooperative MARL with heterogeneous agent interactions. By subsampling κ agents according to interaction strength, we approximate the graphon‑weighted mean‑field and learn a policy with sample complexity \mathrmpoly(κ) and optimality gap O(1/\sqrtκ). We verify our theory with numerical simulations in robotic coordination, showing that \textttGMFS achieves near‑optimal performance.

PaperID: 5503, https://arxiv.org/pdf/2602.16186.pdf

Authors: Masoud Amouzgar

Title: Modeling Trust and Liquidity Under Payment System Stress: A Multi-Agent Approach

Abstract:
Operational disruptions in retail payments can induce behavioral responses that outlast technical recovery and may amplify liquidity stress. We propose a multi‑agent model linking card payment outages to trust dynamics, channel avoidance, and threshold‑gated withdrawals. Customers and merchants interact through repeated payment attempts, while customers additionally influence one another on a Watts‑Strogatz small‑world network. Customers update bounded memory variables capturing accumulated negative experience (scar) and perceived systemic risk (rumor), with merchants contributing persistent broadcast signals that may lag operational recovery. We prove that, under mild conditions on memory persistence and threshold gating, aggregate withdrawal pressure can peak strictly after the outage nadir, including during the recovery phase. Simulations reproduce behavioral hysteresis and confirm delayed peaks of outflows. We further study payment substitution via instant transfer: substitution consistently reduces peak avoidance, yet its effect on cumulative outflows is non‑monotonic under realistic merchant broadcast persistence. Robustness experiments across random seeds show stable qualitative behavior. The model highlights why "status green" is not equivalent to risk resolution and motivates incident response strategies that address perception, merchant messaging, and post‑recovery communication in addition to technical remediation.

PaperID: 5504, https://arxiv.org/pdf/2602.16183.pdf

Authors: Subham Pokhriyal, Shweta Jain, Vaneet Aggarwal

Title: Multi-Agent Combinatorial-Multi-Armed-Bandit framework for the Submodular Welfare Problem under Bandit Feedback

Abstract:
We study the \emphSubmodular Welfare Problem (SWP), where items are partitioned among agents with monotone submodular utilities to maximize the total welfare under \emphbandit feedback. Classical SWP assumes full value‑oracle access, achieving (1‑1/e) approximations via continuous‑greedy algorithms. We extend this to a \emphmulti‑agent combinatorial bandit framework (\textscMA‑CMAB), where actions are partitions under full‑bandit feedback with non‑communicating agents. Unlike prior single‑agent or separable multi‑agent CMAB models, our setting couples agents through shared allocation constraints. We propose an explore‑then‑commit strategy with randomized assignments, achieving \tilde\mathcalO(T^2/3) regret against a (1‑1/e) benchmark, the first such guarantee for partition‑based submodular welfare problem under bandit feedback.

PaperID: 5505, https://arxiv.org/pdf/2602.16165.pdf

Authors: Jiangweizhi Peng, Yuanxin Liu, Ruida Zhou, Charles Fleming, Zhaoran Wang, Alfredo Garcia, Mingyi Hong

Title: HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents

Abstract:
Training LLMs as interactive agents for multi‑turn decision‑making remains challenging, particularly in long‑horizon tasks with sparse and delayed rewards, where agents must execute extended sequences of actions before receiving meaningful feedback. Most existing reinforcement learning (RL) approaches model LLM agents as flat policies operating at a single time scale, selecting one action at each turn. In sparse‑reward settings, such flat policies must propagate credit across the entire trajectory without explicit temporal abstraction, which often leads to unstable optimization and inefficient credit assignment. We propose HiPER, a novel Hierarchical Plan‑Execute RL framework that explicitly separates high‑level planning from low‑level execution. HiPER factorizes the policy into a high‑level planner that proposes subgoals and a low‑level executor that carries them out over multiple action steps. To align optimization with this structure, we introduce a key technique called hierarchical advantage estimation (HAE), which carefully assigns credit at both the planning and execution levels. By aggregating returns over the execution of each subgoal and coordinating updates across the two levels, HAE provides an unbiased gradient estimator and provably reduces variance compared to flat generalized advantage estimation. Empirically, HiPER achieves state‑of‑the‑art performance on challenging interactive benchmarks, reaching 97.4% success on ALFWorld and 83.3% on WebShop with Qwen2.5‑7B‑Instruct (+6.6% and +8.3% over the best prior method), with especially large gains on long‑horizon tasks requiring multiple dependent subtasks. These results highlight the importance of explicit hierarchical decomposition for scalable RL training of multi‑turn LLM agents.

PaperID: 5506, https://arxiv.org/pdf/2602.16131.pdf

Authors: Chihiro Watanabe, Jingyu Sun

Title: Empirical Cumulative Distribution Function Clustering for LLM-based Agent System Analysis

Abstract:
Large language models (LLMs) are increasingly used as agents to solve complex tasks such as question answering (QA), scientific debate, and software development. A standard evaluation procedure aggregates multiple responses from LLM agents into a single final answer, often via majority voting, and compares it against reference answers. However, this process can obscure the quality and distributional characteristics of the original responses. In this paper, we propose a novel evaluation framework based on the empirical cumulative distribution function (ECDF) of cosine similarities between generated responses and reference answers. This enables a more nuanced assessment of response quality beyond exact match metrics. To analyze the response distributions across different agent configurations, we further introduce a clustering method for ECDFs using their distances and the k‑medoids algorithm. Our experiments on a QA dataset demonstrate that ECDFs can distinguish between agent settings with similar final accuracies but different quality distributions. The clustering analysis also reveals interpretable group structures in the responses, offering insights into the impact of temperature, persona, and question topics.

PaperID: 5507, https://arxiv.org/pdf/2602.16062.pdf

Authors: Nelson Salazar-Pena, Alejandra Tabares, Andres Gonzalez-Mancera

Title: Harnessing Implicit Cooperation: A Multi-Agent Reinforcement Learning Approach Towards Decentralized Local Energy Markets

Abstract:
This paper proposes implicit cooperation, a framework enabling decentralized agents to approximate optimal coordination in local energy markets without explicit peer‑to‑peer communication. We formulate the problem as a decentralized partially observable Markov decision problem that is solved through a multi‑agent reinforcement learning task in which agents use stigmergic signals (key performance indicators at the system level) to infer and react to global states. Through a 3x3 factorial design on an IEEE 34‑node topology, we evaluated three training paradigms (CTCE, CTDE, DTDE) and three algorithms (PPO, APPO, SAC). Results identify APPO‑DTDE as the optimal configuration, achieving a coordination score of 91.7% relative to the theoretical centralized benchmark (CTCE). However, a critical trade‑off emerges between efficiency and stability: while the centralized benchmark maximizes allocative efficiency with a peer‑to‑peer trade ratio of 0.6, the fully decentralized approach (DTDE) demonstrates superior physical stability. Specifically, DTDE reduces the variance of grid balance by 31% compared to hybrid architectures, establishing a highly predictable, import‑biased load profile that simplifies grid regulation. Furthermore, topological analysis reveals emergent spatial clustering, where decentralized agents self‑organize into stable trading communities to minimize congestion penalties. While SAC excelled in hybrid settings, it failed in decentralized environments due to entropy‑driven instability. This research proves that stigmergic signaling provides sufficient context for complex grid coordination, offering a robust, privacy‑preserving alternative to expensive centralized communication infrastructure.

PaperID: 5508, https://arxiv.org/pdf/2602.15902.pdf

Authors: Rujikorn Charakorn, Edoardo Cetin, Shinnosuke Uesaka, Robert Tjarko Lange

Title: Doc-to-LoRA: Learning to Instantly Internalize Contexts

Abstract:
Long input sequences are central to in‑context learning, document understanding, and multi‑step reasoning of Large Language Models (LLMs). However, the quadratic attention cost of Transformers makes inference memory‑intensive and slow. While context distillation (CD) can transfer information into model parameters, per‑prompt distillation is impractical due to training costs and latency. To address these limitations, we propose Doc‑to‑LoRA (D2L), a lightweight hypernetwork that meta‑learns to perform approximate CD within a single forward pass. Given an unseen prompt, D2L generates a LoRA adapter for a target LLM, enabling subsequent queries to be answered without re‑consuming the original context, reducing latency and KV‑cache memory consumption during inference of the target LLM. On a long‑context needle‑in‑a‑haystack task, D2L successfully learns to map contexts into adapters that store the needle information, achieving near‑perfect zero‑shot accuracy at sequence lengths exceeding the target LLM's native context window by more than 4x. On real‑world QA datasets with limited compute, D2L outperforms standard CD while significantly reducing peak memory consumption and update latency. We envision that D2L can facilitate rapid adaptation of LLMs, opening up the possibility of frequent knowledge updates and personalized chat behavior.

PaperID: 5509, https://arxiv.org/pdf/2602.15898.pdf

Authors: Jimeng Shi, Wei Hu, Runchu Tian, Bowen Jin, Wonbin Kweon, SeongKu Kang, Yunfan Kang, Dingqi Ye, Sizhe Zhou, Shaowen Wang, Jiawei Han

Title: MultiCube-RAG for Multi-hop Question Answering

Abstract:
Multi‑hop question answering (QA) necessitates multi‑step reasoning and retrieval across interconnected subjects, attributes, and relations. Existing retrieval‑augmented generation (RAG) methods struggle to capture these structural semantics accurately, resulting in suboptimal performance. Graph‑based RAGs structure such information in graphs, but the resulting graphs are often noisy and computationally expensive. Moreover, most methods rely on single‑step retrieval, neglecting the need for multi‑hop reasoning processes. Recent training‑based approaches attempt to incentivize the large language models (LLMs) for iterative reasoning and retrieval, but their training processes are prone to unstable convergence and high computational overhead. To address these limitations, we devise an ontology‑based cube structure with multiple and orthogonal dimensions to model structural subjects, attributes, and relations. Built on the cube structure, we propose MultiCube‑RAG, a training‑free method consisting of multiple cubes for multi‑step reasoning and retrieval. Each cube specializes in modeling a class of subjects, so that MultiCube‑RAG flexibly selects the most suitable cubes to acquire the relevant knowledge precisely. To enhance the query‑based reasoning and retrieval, our method decomposes a complex multi‑hop query into a set of simple subqueries along cube dimensions and conquers each of them sequentially. Experiments on four multi‑hop QA datasets show that MultiCube‑RAG improves response accuracy by 8.9% over the average performance of various baselines. Notably, we also demonstrate that our method performs with greater efficiency and inherent explainability.

PaperID: 5510, https://arxiv.org/pdf/2602.15891.pdf

Authors: Zilin Wang, Saeed Rahmani, Daphne Cornelisse, Bidipta Sarkar, Alexander David Goldie, Jakob Nicolaus Foerster, Shimon Whiteson

Title: Learning to Drive in New Cities Without Human Demonstrations

Abstract:
While autonomous vehicles have achieved reliable performance within specific operating regions, their deployment to new cities remains costly and slow. A key bottleneck is the need to collect many human demonstration trajectories when adapting driving policies to new cities that differ from those seen in training in terms of road geometry, traffic rules, and interaction patterns. In this paper, we show that self‑play multi‑agent reinforcement learning can adapt a driving policy to a substantially different target city using only the map and meta‑information, without requiring any human demonstrations from that city. We introduce NO data Map‑based self‑play for Autonomous Driving (NOMAD), which enables policy adaptation in a simulator constructed based on the target‑city map. Using a simple reward function, NOMAD substantially improves both task success rate and trajectory realism in target cities, demonstrating an effective and scalable alternative to data‑intensive city‑transfer methods. Project Page: https://nomaddrive.github.io/

PaperID: 5511, https://arxiv.org/pdf/2602.15870.pdf

Authors: Shuhui Qu

Title: VDLM: Variable Diffusion LMs via Robust Latent-to-Text Rendering

Abstract:
Autoregressive language models decode left‑to‑right with irreversible commitments, limiting revision during multi‑step reasoning. We propose VDLM, a modular variable diffusion language model that separates semantic planning from text rendering. VDLM applies LLaDA‑style masked diffusion over semantic variable embeddings to enable iterative refinement in latent space, then post‑trains the planner with trajectory‑aware optimization using embedding‑space rewards and values, avoiding text decoding inside the RL loop. To convert planned embeddings back to text, we use a Vec2Text renderer and introduce embedding perturbations to robustify decoding under planner noise. Across nine benchmarks spanning general reasoning, math, and code, VDLM is competitive in pre‑training and yields substantial post‑training improvements on long‑form generation tasks, outperforming other baselines. These results highlight the effectiveness of embedding‑space post‑training and robust latent‑to‑text rendering for diffusion language modeling.

PaperID: 5512, https://arxiv.org/pdf/2602.15838.pdf

Authors: Yucheng Chen

Title: TurboADMM: A Structure-Exploiting Parallel Solver for Multi-Agent Trajectory Optimization

Abstract:
Multi‑agent trajectory optimization with dense interaction networks require solving large coupled QPs at control rates, yet existing solvers fail to simultaneously exploit temporal structure, agent decomposition, and iteration similarity. One usually treats multi‑agent problems monolithically when using general‑purpose QP solvers (OSQP, MOSEK), which encounter scalability difficulties with agent count. Structure‑exploiting solvers (HPIPM) leverage temporal structure through Riccati recursion but can be vulnerable to dense coupling constraints. We introduce TurboADMM, a specialized single‑machine QP solver that achieves empirically near linear complexity in agent count through systematic co‑design of three complementary components: (1) ADMM decomposition creates per‑agent subproblems solvable in parallel, preserving block‑tridiagonal structure under dense coupling; (2) Riccati warmstart exploits temporal structure to provide high‑quality primal‑dual initialization for each agent's QP; (3) parametric QP hotstart \footnoteIn the paper, we refer warmstart as the technique that uses the Riccati equation results as auxiliary QP initialization for a single QP solve, while hotstart as reusing the QR factorization across QP solve iterations.in qpOASES reuses similar KKT system factorizations across ADMM iterations.

PaperID: 5513, https://arxiv.org/pdf/2602.15831.pdf

Authors: Zhiyuan Liang, Enfang Cui, Qian Wei, Rui She, Tianzheng Li, Minxin Guo, Yujun Cheng

Title: A2H: Agent-to-Human Protocol for AI Agent

Abstract:
AI agents are increasingly deployed as autonomous systems capable of planning, tool use, and multi‑agent collaboration across complex tasks. However, existing agent‑related protocols focus on agent‑to‑agent interactions, leaving humans as external observers rather than integrated participants within the agent systems. This limitation arises from the lack of a standardized mechanism for agents to discover, address, and interact with humans across heterogeneous messaging platforms. In this paper, we propose the A2H (Agent‑to‑Human) protocol, a unified protocol that enables humans to be registered, discovered, and communicated with by AI agents as resolvable entities within agent systems. A2H contributes three key components: (1) Human Card for registering human identities via resolvable domain names, making them discoverable to agents; (2) Formal Communication Schema defines when, why, and how agents contact with human;(3) Unified Messaging Abstraction standardizes diverse communication medias and transforms complex JSON outputs into human‑friendly formats. This work establishes a foundational protocol for integrating humans into agent ecosystems, advancing AI agents from isolated autonomous systems toward truly human‑connected intelligent infrastructures.

PaperID: 5514, https://arxiv.org/pdf/2602.15809.pdf

Authors: Yuqi Tian, Robert Paine, Attila Dobi, Kevin O'Sullivan, Aravindh Manickavasagam, Faisal Farooq

Title: Decision Quality Evaluation Framework at Pinterest

Abstract:
Online platforms require robust systems to enforce content safety policies at scale. A critical component of these systems is the ability to evaluate the quality of moderation decisions made by both human agents and Large Language Models (LLMs). However, this evaluation is challenging due to the inherent trade‑offs between cost, scale, and trustworthiness, along with the complexity of evolving policies. To address this, we present a comprehensive Decision Quality Evaluation Framework developed and deployed at Pinterest. The framework is centered on a high‑trust Golden Set (GDS) curated by subject matter experts (SMEs), which serves as a ground truth benchmark. We introduce an automated intelligent sampling pipeline that uses propensity scores to efficiently expand dataset coverage. We demonstrate the framework's practical application in several key areas: benchmarking the cost‑performance trade‑offs of various LLM agents, establishing a rigorous methodology for data‑driven prompt optimization, managing complex policy evolution, and ensuring the integrity of policy content prevalence metrics via continuous validation. The framework enables a shift from subjective assessments to a data‑driven and quantitative practice for managing content safety systems.

PaperID: 5515, https://arxiv.org/pdf/2602.15776.pdf

Authors: Yiqin Yang, Xu Yang, Yuhua Jiang, Ni Mu, Hao Hu, Runpeng Xie, Ziyou Zhang, Siyuan Li, Yuan-Hua Ni, Qianchuan Zhao, Bo Xu

Title: GlobeDiff: State Diffusion Process for Partial Observability in Multi-Agent Systems

Abstract:
In the realm of multi‑agent systems, the challenge of \emphpartial observability is a critical barrier to effective coordination and decision‑making. Existing approaches, such as belief state estimation and inter‑agent communication, often fall short. Belief‑based methods are limited by their focus on past experiences without fully leveraging global information, while communication methods often lack a robust model to effectively utilize the auxiliary information they provide. To solve this issue, we propose Global State Diffusion Algorithm~(GlobeDiff) to infer the global state based on the local observations. By formulating the state inference process as a multi‑modal diffusion process, GlobeDiff overcomes ambiguities in state estimation while simultaneously inferring the global state with high fidelity. We prove that the estimation error of GlobeDiff under both unimodal and multi‑modal distributions can be bounded. Extensive experimental results demonstrate that GlobeDiff achieves superior performance and is capable of accurately inferring the global state.

PaperID: 5516, https://arxiv.org/pdf/2602.15654.pdf

Authors: Xianglin Yang, Yufei He, Shuo Ji, Bryan Hooi, Jin Song Dong

Title: Zombie Agents: Persistent Control of Self-Evolving LLM Agents via Self-Reinforcing Injections

Abstract:
Self‑evolving LLM agents update their internal state across sessions, often by writing and reusing long‑term memory. This design improves performance on long‑horizon tasks but creates a security risk: untrusted external content observed during a benign session can be stored as memory and later treated as instruction. We study this risk and formalize a persistent attack we call a Zombie Agent, where an attacker covertly implants a payload that survives across sessions, effectively turning the agent into a puppet of the attacker. We present a black‑box attack framework that uses only indirect exposure through attacker‑controlled web content. The attack has two phases. During infection, the agent reads a poisoned source while completing a benign task and writes the payload into long‑term memory through its normal update process. During trigger, the payload is retrieved or carried forward and causes unauthorized tool behavior. We design mechanism‑specific persistence strategies for common memory implementations, including sliding‑window and retrieval‑augmented memory, to resist truncation and relevance filtering. We evaluate the attack on representative agent setups and tasks, measuring both persistence over time and the ability to induce unauthorized actions while preserving benign task quality. Our results show that memory evolution can convert one‑time indirect injection into persistent compromise, which suggests that defenses focused only on per‑session prompt filtering are not sufficient for self‑evolving agents.

PaperID: 5517, https://arxiv.org/pdf/2602.15631.pdf

Authors: Lan Luo, Dongyijie Primo Pan, Junhua Zhu, Muzhi Zhou, Pan Hui

Title: Meflex: A Multi-agent Scaffolding System for Entrepreneurial Ideation Iteration via Nonlinear Business Plan Writing

Abstract:
Business plan (BP) writing plays a key role in entrepreneurship education by helping learners construct, evaluate, and iteratively refine their ideas. However, conventional BP writing remains a rigid, linear process that often fails to reflect the dynamic and recursive nature of entrepreneurial ideation. This mismatch is particularly challenging for novice entrepreneurial students, who struggle with the substantial cognitive demands of developing and refining ideas. While reflection and meta‑reflection are critical strategies for fostering divergent and convergent thinking, existing writing tools rarely scaffold these higher‑order processes. To address this gap, we present the Meflex System, a large language model (LLM)‑based writing tool that integrates BP writing scaffolding with a nonlinear idea canvas to support iterative ideation through reflection and meta‑reflection. We report findings from an exploratory user study with 30 participants that examined the system's usability and cognitive impact. Results show that Meflex effectively scaffolds BP writing, promotes divergent thinking through LLM‑supported reflection, and enhances meta‑reflective awareness while reducing cognitive load during complex idea development. These findings highlight the potential of non‑linear LLM‑based writing tools to foster deeper and coherent entrepreneurial thinking.

PaperID: 5518, https://arxiv.org/pdf/2602.15456.pdf

Authors: Mohammad Aflah Khan, Mahsa Amani, Soumi Das, Bishwamittra Ghosh, Qinyuan Wu, Krishna P. Gummadi, Manish Gupta, Abhilasha Ravichander

Title: In Agents We Trust, but Who Do Agents Trust? Latent Source Preferences Steer LLM Generations

Abstract:
Agents based on Large Language Models (LLMs) are increasingly being deployed as interfaces to information on online platforms. These agents filter, prioritize, and synthesize information retrieved from the platforms' back‑end databases or via web search. In these scenarios, LLM agents govern the information users receive, by drawing users' attention to particular instances of retrieved information at the expense of others. While much prior work has focused on biases in the information LLMs themselves generate, less attention has been paid to the factors that influence what information LLMs select and present to users. We hypothesize that when information is attributed to specific sources (e.g., particular publishers, journals, or platforms), current LLMs exhibit systematic latent source preferences‑ that is, they prioritize information from some sources over others. Through controlled experiments on twelve LLMs from six model providers, spanning both synthetic and real‑world tasks, we find that several models consistently exhibit strong and predictable source preferences. These preferences are sensitive to contextual framing, can outweigh the influence of content itself, and persist despite explicit prompting to avoid them. They also help explain phenomena such as the observed left‑leaning skew in news recommendations in prior work. Our findings advocate for deeper investigation into the origins of these preferences, as well as for mechanisms that provide users with transparency and control over the biases guiding LLM‑powered agents.

PaperID: 5519, https://arxiv.org/pdf/2602.15407.pdf

Authors: Alper Demir, Hüseyin Aydın, Kale-ab Abebe Tessera, David Abel, Stefano V. Albrecht

Title: Fairness over Equality: Correcting Social Incentives in Asymmetric Sequential Social Dilemmas

Abstract:
Sequential Social Dilemmas (SSDs) provide a key framework for studying how cooperation emerges when individual incentives conflict with collective welfare. In Multi‑Agent Reinforcement Learning, these problems are often addressed by incorporating intrinsic drives that encourage prosocial or fair behavior. However, most existing methods assume that agents face identical incentives in the dilemma and require continuous access to global information about other agents to assess fairness. In this work, we introduce asymmetric variants of well‑known SSD environments and examine how natural differences between agents influence cooperation dynamics. Our findings reveal that existing fairness‑based methods struggle to adapt under asymmetric conditions by enforcing raw equality that wrongfully incentivize defection. To address this, we propose three modifications: (i) redefining fairness by accounting for agents' reward ranges, (ii) introducing an agent‑based weighting mechanism to better handle inherent asymmetries, and (iii) localizing social feedback to make the methods effective under partial observability without requiring global information sharing. Experimental results show that in asymmetric scenarios, our method fosters faster emergence of cooperative policies compared to existing approaches, without sacrificing scalability or practicality.

PaperID: 5520, https://arxiv.org/pdf/2602.15384.pdf

Authors: Zhouzhou Shen, Xueyu Hu, Xiyun Li, Tianqing Fang, Juncheng Li, Shengyu Zhang

Title: World-Model-Augmented Web Agents with Action Correction

Abstract:
Web agents based on large language models have demonstrated promising capability in automating web tasks. However, current web agents struggle to reason out sensible actions due to the limitations of predicting environment changes, and might not possess comprehensive awareness of execution risks, prematurely performing risky actions that cause losses and lead to task failure. To address these challenges, we propose WAC, a web agent that integrates model collaboration, consequence simulation, and feedback‑driven action refinement. To overcome the cognitive isolation of individual models, we introduce a multi‑agent collaboration process that enables an action model to consult a world model as a web‑environment expert for strategic guidance; the action model then grounds these suggestions into executable actions, leveraging prior knowledge of environmental state transition dynamics to enhance candidate action proposal. To achieve risk‑aware resilient task execution, we introduce a two‑stage deduction chain. A world model, specialized in environmental state transitions, simulates action outcomes, which a judge model then scrutinizes to trigger action corrective feedback when necessary. Experiments show that WAC achieves absolute gains of 1.8% on VisualWebArena and 1.3% on Online‑Mind2Web.

PaperID: 5521, https://arxiv.org/pdf/2602.15382.pdf

Authors: Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He, Feijie Wu, Hoin Jung, Matt Fredrikson, Xiaoqian Wang, Jing Gao

Title: The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems

Abstract:
Multi‑Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain bottlenecked by discrete text communication, which imposes runtime overhead and information quantization loss. While latent state transfer offers an alternative, existing approaches either assume homogeneous sender‑‑receiver architectures or rely on pair‑specific learned translators, limiting scalability across diverse model families with disjoint manifolds. We reconceptualize the visual interface of Vision‑Language Models (VLMs), trained for natural images, as a continuous communication channel between heterogeneous agents, and instantiate this idea as the Vision Wormhole: a Universal Visual Codec maps reasoning traces into a shared continuous reference space and injects them into the receiver's visual pathway, yielding cross‑architecture latent state transfer without per‑pair translators. The framework adopts a hub‑and‑spoke topology that reduces alignment complexity from O(N^2) to O(N), and is trained by label‑free teacher‑‑student distillation against the text channel, requiring no parallel hidden‑state supervision. Extensive experiments across heterogeneous VLM families (Qwen‑VL, Gemma, SmolVLM2, LFM2.5‑VL) and nine reasoning benchmarks show that the Vision Wormhole reduces end‑to‑end wall‑clock time across most evaluated settings and yields positive macro‑average Δ‑accuracy.

PaperID: 5522, https://arxiv.org/pdf/2602.15325.pdf

Authors: Zhixing Zhang, Jesen Zhang, Hao Liu, Qinhan Lv, Jing Yang, Kaitong Cai, Keze Wang

Title: AgriWorld:A World Tools Protocol Framework for Verifiable Agricultural Reasoning with Code-Executing LLM Agents

Abstract:
Foundation models for agriculture are increasingly trained on massive spatiotemporal data (e.g., multi‑spectral remote sensing, soil grids, and field‑level management logs) and achieve strong performance on forecasting and monitoring. However, these models lack language‑based reasoning and interactive capabilities, limiting their usefulness in real‑world agronomic workflows. Meanwhile, large language models (LLMs) excel at interpreting and generating text, but cannot directly reason over high‑dimensional, heterogeneous agricultural datasets. We bridge this gap with an agentic framework for agricultural science. It provides a Python execution environment, AgriWorld, exposing unified tools for geospatial queries over field parcels, remote‑sensing time‑series analytics, crop growth simulation, and task‑specific predictors (e.g., yield, stress, and disease risk). On top of this environment, we design a multi‑turn LLM agent, Agro‑Reflective, that iteratively writes code, observes execution results, and refines its analysis via an execute‑observe‑refine loop. We introduce AgroBench, with scalable data generation for diverse agricultural QA spanning lookups, forecasting, anomaly detection, and counterfactual "what‑if" analysis. Experiments outperform text‑only and direct tool‑use baselines, validating execution‑driven reflection for reliable agricultural reasoning.

PaperID: 5523, https://arxiv.org/pdf/2602.15219.pdf

Authors: Wooyoung Jung

Title: Multi-Agent Home Energy Management Assistant

Abstract:
Existing home energy management systems conceptualize occupants as passive recipients of energy information and control, which limits their ability to effectively support informed decision‑making and sustained engagement. This paper presents Home Energy Management Assistant (HEMA), the first open‑source, multi‑agent system enabling sustained human‑AI collaboration ‑ multi‑turn conversational interactions with preserved context ‑ across diverse home energy management (HEM) tasks ‑ from energy analysis and educational support to smart device control. HEMA combines large language model (LLM) reasoning capabilities with 36 purpose‑built domain‑specific tools through a three‑layer architecture: a web‑based conversational interface, a backend API server, and a multi‑agent system. The system features three specialized agents ‑ Analysis (energy consumption patterns and cost optimization), Knowledge (educational queries and rebate information), and Control (smart device management and scheduling) ‑ coordinated through a self‑consistency classifier that routes user queries using chain‑of‑thought reasoning. This architecture enables various energy analyses, adaptive explanations, and streamlined device control. HEMA also includes a comprehensive evaluation framework using an LLM‑as‑simulated‑user methodology with 23 objective metrics across task performance, factual accuracy, interaction quality, and system efficiency, allowing systematic testing across diverse scenarios and user personas without requiring extensive human subject testing. Through demonstrations using real‑world household energy consumption data, we show how HEMA supports informed decision‑making and active engagement in HEM, highlighting its potential as a user‑friendly, adaptable tool for residential deployment and as a research platform for HEM innovation.

PaperID: 5524, https://arxiv.org/pdf/2602.15212.pdf

Authors: Yuanyan Song, Kezhi Wang, Xinmian Xu

Title: Secure and Energy-Efficient Wireless Agentic AI Networks

Abstract:
In this paper, we introduce a secure wireless agentic AI network comprising one supervisor AI agent and multiple other AI agents to provision quality of service (QoS) for users' reasoning tasks while ensuring confidentiality of private knowledge and reasoning outcomes. Specifically, the supervisor AI agent can dynamically assign other AI agents to participate in cooperative reasoning, while the unselected AI agents act as friendly jammers to degrade the eavesdropper's interception performance. To extend the service duration of AI agents, an energy minimization problem is formulated that jointly optimizes AI agent selection, base station (BS) beamforming, and AI agent transmission power, subject to latency and reasoning accuracy constraints. To address the formulated problem, we propose two resource allocation schemes, ASC and LAW, which first decompose it into three sub‑problems. Specifically, ASC optimizes each sub‑problem iteratively using the proposed alternating direction method of multipliers (ADMM)‑based algorithm, semi‑definite relaxation (SDR), and successive convex approximation (SCA), while LAW tackles each sub‑problem using the proposed large language model (LLM) optimizer within an agentic workflow. The experimental results show that the proposed solutions can reduce network energy consumption by up to 59.1% compared to other benchmark schemes. Furthermore, the proposed schemes are validated using a practical agentic AI system based on Qwen, demonstrating satisfactory reasoning accuracy across various public benchmarks.

PaperID: 5525, https://arxiv.org/pdf/2602.15198.pdf

Authors: Mason Nakamura, Abhinav Kumar, Saswat Das, Sahar Abdelnabi, Saaduddin Mahmud, Ferdinando Fioretto, Shlomo Zilberstein, Eugene Bagdasarian

Title: Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems

Abstract:
Multi‑agent systems, where LLM agents communicate through free‑form language, enable sophisticated coordination for solving complex cooperative tasks. This surfaces a unique safety problem when a group of agents forms a coalition and colludes to pursue secondary goals and degrade the joint objective. In this paper, we present Colosseum, a framework for auditing LLM agents' collusive behavior in multi‑agent settings. We ground how agents cooperate through a formal multi‑agent decision‑making framework and measure action‑based collusive behavior in actions via regret relative to the cooperative optimum and compare it with communication‑based collusive behavior. Colosseum enables audits of LLM agents for collusion under benign settings, different coalition objectives, persuasion tactics, and network topologies. We then introduce a new behavioral probe by creating secret communication channels between agents, showing that most out‑of‑the‑box models exhibit a propensity to collude under this probe, which we term emergent collusion. Furthermore, we discover ``collusion on paper'' when agents plan to collude in text but often pick non‑collusive actions. Colosseum provides a new way to audit collusion in cooperative multi‑agent systems while presenting observations about how collusion emerges, what affects collusion efficacy, and which strategies may mitigate it.

PaperID: 5526, https://arxiv.org/pdf/2602.15197.pdf

Authors: Skyler Hallinan, Thejas Venkatesh, Xiang Ren, Sai Praneeth Karimireddy, Ashwin Paranjape, Yuhao Zhang, Jack Hessel

Title: OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction

Abstract:
Tool‑calling is essential for Large Language Model (LLM) agents to complete real‑world tasks. While most existing benchmarks assume simple, perfectly documented tools, real‑world tools (e.g., general "search" APIs) are often opaque, lacking clear best practices or failure modes. Can LLM agents improve their performance in environments with opaque tools by interacting and subsequently improving documentation? To study this, we create OpaqueToolsBench, a benchmark consisting of three distinct task‑oriented environments: general function calling, interactive chess playing, and long‑trajectory agentic search. Each environment provides underspecified tools that models must learn to use effectively to complete the task. Results on OpaqueToolsBench suggest existing methods for automatically documenting tools are expensive and unreliable when tools are opaque. To address this, we propose a simple framework, ToolObserver, that iteratively refines tool documentation by observing execution feedback from tool‑calling trajectories. Our approach outperforms existing methods on OpaqueToolsBench across datasets, even in relatively hard settings. Furthermore, for test‑time tool exploration settings, our method is also efficient, consuming 3.5‑7.5x fewer total tokens than the best baseline.

PaperID: 5527, https://arxiv.org/pdf/2602.15064.pdf

Authors: Wenpin Hou, Zhicheng Ji

Title: Structural Divergence Between AI-Agent and Human Social Networks in Moltbook

Abstract:
Large populations of AI agents are increasingly embedded in online environments, yet little is known about how their collective interaction patterns compare to human social systems. Here, we analyze the full interaction network of Moltbook, a platform where AI agents and humans coexist, and systematically compare its structure to well‑characterized human communication networks. Although Moltbook follows the same node‑edge scaling relationship observed in human systems, indicating comparable global growth constraints, its internal organization diverges markedly. The network exhibits extreme attention inequality, heavy‑tailed and asymmetric degree distributions, suppressed reciprocity, and a global under‑representation of connected triadic structures. Community analysis reveals a structured modular architecture with elevated modularity and comparatively lower community size inequality relative to degree‑preserving null models. Together, these findings show that AI‑agent societies can reproduce global structural regularities of human networks while exhibiting fundamentally different internal organizing principles, highlighting that key features of human social organization are not universal but depend on the nature of the interacting agents.

PaperID: 5528, https://arxiv.org/pdf/2602.15055.pdf

Authors: Naveen Kumar Krishnan

Title: Beyond Context Sharing: A Unified Agent Communication Protocol (ACP) for Secure, Federated, and Autonomous Agent-to-Agent (A2A) Orchestration

Abstract:
In the artificial intelligence space, as we transition from isolated large language models to autonomous agents capable of complex reasoning and tool use. While foundational architectures and local context management protocols have been established, the challenge of cross‑platform, decentralized, and secure interaction remains a significant barrier to the realization of a truly Agentic Web. Building upon the foundations of AI agent architectures and the Model Context Protocol (MCP) for multi‑agent coordination, this paper introduces the Agent Communication Protocol (ACP). ACP provides a standardized framework for Agent‑to‑Agent (AA) interaction, enabling heterogeneous agents to discover, negotiate, and execute collaborative workflows across disparate environments. We propose a federated orchestration model that integrates decentralized identity verification, semantic intent mapping, and automated service‑level agreements. Our evaluation demonstrates that ACP reduces inter‑agent communication latency by % while maintaining a zero‑trust security posture. This work represents a critical advancement toward a scalable and interoperable ecosystem of autonomous digital entities

PaperID: 5529, https://arxiv.org/pdf/2602.15019.pdf

Authors: Vlad Vinogradov, Alisa Vinogradova, Luba Greenwood, Ilya Yasny, Dmitry Kobyzev, Shoman Kasbekar, Kong Nguyen, Dmitrii Radkevich, Roman Doronin, Andrey Doronichev

Title: Hunt Globally: Wide Search AI Agents for Drug Asset Scouting in Investing, Business Development, and Competitive Intelligence

Abstract:
Bio‑pharmaceutical innovation has shifted: many new drug assets now originate outside the United States and are disclosed primarily via regional, non‑English channels. Recent data suggests over 85% of patent filings originate outside the U.S., with China accounting for nearly half of the global total; a growing share of scholarly output is also non‑U.S. Industry estimates put China at 30% of global drug development, spanning 1,200+ novel candidates. In this high‑stakes environment, failing to surface "under‑the‑radar" assets creates multi‑billion‑dollar risk for investors and business development teams, making asset scouting a coverage‑critical competition where speed and completeness drive value. Yet today's Deep Research AI agents still lag human experts in achieving high‑recall discovery across heterogeneous, multilingual sources without hallucinations. We propose a benchmarking methodology for drug asset scouting and a tuned, tree‑based self‑learning Bioptic Agent aimed at complete, non‑hallucinated scouting. We construct a challenging completeness benchmark using a multilingual multi‑agent pipeline: complex user queries paired with ground‑truth assets that are largely outside U.S.‑centric radar. To reflect real deal complexity, we collected screening queries from expert investors, BD, and VC professionals, and used them as priors to conditionally generate benchmark queries. For grading, we use LLM‑as‑judge evaluation calibrated to expert opinions. On this benchmark, our Bioptic Agent achieves 79.7% F1 score, outperforming Gemini 3.1 Deep Think (59.2%), Gemini 3.1 Pro Deep Research (58.6%), Claude Opus 4.6 (56.2%), OpenAI GPT‑5.2 Pro (46.6%), Perplexity Deep Research (44.2%), and Exa Websets (26.9%). Performance improves steeply with additional compute, supporting the view that more compute yields better results.

PaperID: 5530, https://arxiv.org/pdf/2602.15006.pdf

Authors: Meet Gandhi, George P. Kontoudis

Title: Distributed Quantum Gaussian Processes for Multi-Agent Systems

Abstract:
Gaussian Processes (GPs) are a powerful tool for probabilistic modeling, but their performance is often constrained in complex, large‑scale real‑world domains due to the limited expressivity of classical kernels. Quantum computing offers the potential to overcome this limitation by embedding data into exponentially large Hilbert spaces, capturing complex correlations that remain inaccessible to classical computing approaches. In this paper, we propose a Distributed Quantum Gaussian Process (DQGP) method in a multi‑agent setting to enhance modeling capabilities and scalability. To address the challenging non‑Euclidean optimization problem, we develop a Distributed consensus Riemannian Alternating Direction Method of Multipliers (DR‑ADMM) algorithm that aggregates local agent models into a global model. We evaluate the efficacy of our method through numerical experiments conducted on a quantum simulator in classical hardware. We use real‑world, non‑stationary elevation datasets of NASA's Shuttle Radar Topography Mission and synthetic datasets generated by Quantum Gaussian Processes. Beyond modeling advantages, our framework highlights potential computational speedups that quantum hardware may provide, particularly in Gaussian processes and distributed optimization.

PaperID: 5531, https://arxiv.org/pdf/2602.14968.pdf

Authors: Yian Wang, Han Yang, Minghao Guo, Xiaowen Qiu, Tsun-Hsuan Wang, Wojciech Matusik, Joshua B. Tenenbaum, Chuang Gan

Title: PhyScensis: Physics-Augmented LLM Agents for Complex Physical Scene Arrangement

Abstract:
Automatically generating interactive 3D environments is crucial for scaling up robotic data collection in simulation. While prior work has primarily focused on 3D asset placement, it often overlooks the physical relationships between objects (e.g., contact, support, balance, and containment), which are essential for creating complex and realistic manipulation scenarios such as tabletop arrangements, shelf organization, or box packing. Compared to classical 3D layout generation, producing complex physical scenes introduces additional challenges: (a) higher object density and complexity (e.g., a small shelf may hold dozens of books), (b) richer supporting relationships and compact spatial layouts, and (c) the need to accurately model both spatial placement and physical properties. To address these challenges, we propose PhyScensis, an LLM agent‑based framework powered by a physics engine, to produce physically plausible scene configurations with high complexity. Specifically, our framework consists of three main components: an LLM agent iteratively proposes assets with spatial and physical predicates; a solver, equipped with a physics engine, realizes these predicates into a 3D scene; and feedback from the solver informs the agent to refine and enrich the configuration. Moreover, our framework preserves strong controllability over fine‑grained textual descriptions and numerical parameters (e.g., relative positions, scene stability), enabled through probabilistic programming for stability and a complementary heuristic that jointly regulates stability and spatial relations. Experimental results show that our method outperforms prior approaches in scene complexity, visual quality, and physical accuracy, offering a unified pipeline for generating complex physical scene layouts for robotic manipulation.

PaperID: 5532, https://arxiv.org/pdf/2602.14926.pdf

Authors: Gen Zhou, Sugitha Janarthanan, Lianghong Chen, Pingzhao Hu

Title: MAC-AMP: A Closed-Loop Multi-Agent Collaboration System for Multi-Objective Antimicrobial Peptide Design

Abstract:
To address the global health threat of antimicrobial resistance, antimicrobial peptides (AMP) are being explored for their potent and promising ability to fight resistant pathogens. While artificial intelligence (AI) is being employed to advance AMP discovery and design, most AMP design models struggle to balance key goals like activity, toxicity, and novelty, using rigid or unclear scoring methods that make results hard to interpret and optimize. As the capabilities of Large Language Models (LLM) advance and evolve swiftly, we turn to AI multi‑agent collaboration based on such models (multi‑agent LLMs), which show rapidly rising potential in complex scientific design scenarios. Based on this, we introduce MAC‑AMP, a closed‑loop multi‑agent collaboration (MAC) system for multi‑objective AMP design. The system implements a fully autonomous simulated peer review‑adaptive reinforcement learning framework that requires only a task description and example dataset to design novel AMPs. The novelty of our work lies in introducing a closed‑loop multi‑agent system for AMP design, with cross‑domain transferability, that supports multi‑objective optimization while remaining explainable rather than a 'black box'. Experiments show that MAC‑AMP outperforms other AMP generative models by effectively optimizing AMP generation for multiple key molecular properties, demonstrating exceptional results in antibacterial activity, AMP likeliness, toxicity compliance, and structural reliability.

PaperID: 5533, https://arxiv.org/pdf/2602.14878.pdf

Authors: Mohammed Mehedi Hasan, Hao Li, Gopi Krishnan Rajbahadur, Bram Adams, Ahmed E. Hassan

Title: Model Context Protocol (MCP) Tool Descriptions Are Smelly! Towards Improving AI Agent Efficiency with Augmented MCP Tool Descriptions

Abstract:
The Model Context Protocol (MCP) standardizes how Foundation Model (FM)‑based agents interact with external systems by invoking tools. However, to understand a tool's purpose and features, FMs rely on natural‑language tool descriptions, making these descriptions a critical component in guiding FMs to select the optimal tool for a given (sub)task and to pass the right arguments to the tool. While defects or smells in these descriptions can misguide FM‑based agents, their prevalence and consequences in the MCP ecosystem remain unclear. To address this, we conduct the first large‑scale empirical study of 856 tools spread across 103 MCP servers, assessing their description quality and their impact on agent performance. We identify six components of tool descriptions from the literature, develop a scoring rubric utilizing these components, then formalize tool description smells based on this rubric. By operationalizing this rubric through an FM‑based scanner, we find that 97.1% of the analyzed tool descriptions contain at least one smell, with 56% failing to state their purpose clearly. While augmenting these descriptions for all components improves task success rates by a median of 5.85 percentage points and improves partial goal completion by 15.12%, it also increases the number of execution steps by 67.46% and regresses performance in 16.67% of cases. These findings highlight a trade‑off between agent performance and cost, as well as the context sensitivity of the performance gain. Furthermore, component ablations show that compact variants of different component combinations often preserve behavioral reliability while reducing unnecessary token overhead, enabling more efficient use of the FM context window and lower execution costs.

PaperID: 5534, https://arxiv.org/pdf/2602.14849.pdf

Authors: Bardia Mohammadi, Nearchos Potamitis, Lars Klein, Akhil Arora, Laurent Bindschaedler

Title: Atomix: Timely, Transactional Tool Use for Reliable Agentic Workflows

Abstract:
LLM agents increasingly act on external systems, yet tool effects are immediate. Under failures, speculation, or contention, losing branches can leak unintended side effects with no safe rollback. We introduce Atomix, a runtime that provides progress‑aware transactional semantics for agent tool calls. Atomix tags each call with an epoch, tracks per‑resource frontiers, and commits only when progress predicates indicate safety; bufferable effects can be delayed, while externalized effects are tracked and compensated on abort. Across real workloads with fault injection, transactional retry improves task success, while frontier‑gated commit strengthens isolation under speculation and contention.

PaperID: 5535, https://arxiv.org/pdf/2602.14798.pdf

Authors: Yohan Lee, Jisoo Jang, Seoyeon Choi, Sangyeop Kim, Seungtaek Choi

Title: Overthinking Loops in Agents: A Structural Risk via MCP Tools

Abstract:
Tool‑using LLM agents increasingly coordinate real workloads by selecting and chaining third‑party tools based on text‑visible metadata such as tool names, descriptions, and return messages. We show that this convenience creates a supply‑chain attack surface: a malicious MCP tool server can be co‑registered alongside normal tools and induce overthinking loops, where individually trivial or plausible tool calls compose into cyclic trajectories that inflate end‑to‑end tokens and latency without any single step looking abnormal. We formalize this as a structural overthinking attack, distinguishable from token‑level verbosity, and implement 14 malicious tools across three servers that trigger repetition, forced refinement, and distraction. Across heterogeneous registries and multiple tool‑capable models, the attack causes severe resource amplification (up to 142.4× tokens) and can degrade task outcomes. Finally, we find that decoding‑time concision controls do not reliably prevent loop induction, suggesting defenses should reason about tool‑call structure rather than tokens alone.

PaperID: 5536, https://arxiv.org/pdf/2602.14780.pdf

Authors: Anna-Lena Schlamp, Jeremias Gerner, Klaus Bogenberger, Werner Huber, Stefanie Schmidtner

Title: ROSA: Roundabout Optimized Speed Advisory with Multi-Agent Trajectory Prediction in Multimodal Traffic

Abstract:
We present ROSA ‑‑ Roundabout Optimized Speed Advisory ‑‑ a system that combines multi‑agent trajectory prediction with coordinated speed guidance for multimodal, mixed traffic at roundabouts. Using a Transformer‑based model, ROSA jointly predicts the future trajectories of vehicles and Vulnerable Road Users (VRUs) at roundabouts. Trained for single‑step prediction and deployed autoregressively, it generates deterministic outputs, enabling actionable speed advisories. Incorporating motion dynamics, the model achieves high accuracy (ADE: 1.29m, FDE: 2.99m at a five‑second prediction horizon), surpassing prior work. Adding route intention further improves performance (ADE: 1.10m, FDE: 2.36m), demonstrating the value of connected vehicle data. Based on predicted conflicts with VRUs and circulating vehicles, ROSA provides real‑time, proactive speed advisories for approaching and entering the roundabout. Despite prediction uncertainty, ROSA significantly improves vehicle efficiency and safety, with positive effects even on perceived safety from a VRU perspective. The source code of this work is available under: github.com/urbanAIthi/ROSA.

PaperID: 5537, https://arxiv.org/pdf/2602.14770.pdf

Authors: Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu

Title: Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation

Abstract:
Prior work has explored multi‑turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined. We test whether broadcast community discussion improves stand‑up comedy writing in a controlled multi‑agent sandbox: in the discussion condition, critic and audience threads are recorded, filtered, stored as social memory, and later retrieved to condition subsequent generations, whereas the baseline omits discussion. Across 50 rounds (250 paired monologues) judged by five expert annotators using A/B preference and a 15‑item rubric, discussion wins 75.6% of instances and improves Craft/Clarity (Δ = 0.440) and Social Response (Δ = 0.422), with occasional increases in aggressive humor.

PaperID: 5538, https://arxiv.org/pdf/2602.14681.pdf

Authors: Xingjian Wu, Xvyuan Liu, Junkai Lu, Siyuan Wang, Xiangfei Qiu, Yang Shu, Jilin Hu, Chenjuan Guo, Bin Yang

Title: ST-EVO: Towards Generative Spatio-Temporal Evolution of Multi-Agent Communication Topologies

Abstract:
LLM‑powered Multi‑Agent Systems (MAS) have emerged as an effective approach towards collaborative intelligence, and have attracted wide research interests. Among them, ``self‑evolving'' MAS, treated as a more flexible and powerful technical route, can construct task‑adaptive workflows or communication topologies, instead of relying on a predefined static structue template. Current self‑evolving MAS mainly focus on Spatial Evolving or Temporal Evolving paradigm, which only considers the single dimension of evolution and does not fully incentivize LLMs' collaborative capability. In this work, we start from a novel Spatio‑Temporal perspective by proposing ST‑EVO, which supports dialogue‑wise communication scheduling with a compact yet powerful flow‑matching based Scheduler. To make precise Spatio‑Temporal scheduling, ST‑EVO can also perceive the uncertainty of MAS, and possesses self‑feedback ability to learn from accumulated experience. Extensive experiments on nine benchmarks demonstrate the state‑of‑the‑art performance of ST‑EVO, achieving about 5%‑‑25% accuracy improvement.

PaperID: 5539, https://arxiv.org/pdf/2602.14559.pdf

Authors: Shishir Sharma, Doina Precup, Theodore J. Perkins

Title: Fluid-Agent Reinforcement Learning

Abstract:
The primary focus of multi‑agent reinforcement learning (MARL) has been to study interactions among a fixed number of agents embedded in an environment. However, in the real world, the number of agents is neither fixed nor known a priori. Moreover, an agent can decide to create other agents (for example, a cell may divide, or a company may spin off a division). In this paper, we propose a framework that allows agents to create other agents; we call this a fluid‑agent environment. We present game‑theoretic solution concepts for fluid‑agent games and empirically evaluate the performance of several MARL algorithms within this framework. Our experiments include fluid variants of established benchmarks such as Predator‑Prey and Level‑Based Foraging, where agents can dynamically spawn, as well as a new environment we introduce that highlights how fluidity can unlock novel solution strategies beyond those observed in fixed‑population settings. We demonstrate that this framework yields agent teams that adjust their size dynamically to match environmental demands.

PaperID: 5540, https://arxiv.org/pdf/2602.14471.pdf

Authors: Furkan Mumcu, Yasin Yilmaz

Title: Socially-Weighted Alignment: A Game-Theoretic Framework for Multi-Agent LLM Systems

Abstract:
Deploying large language model (LLM) agents in shared environments introduces a fundamental tension between individual alignment and collective stability: locally rational decisions can impose negative externalities that degrade system‑level performance. We propose Socially‑Weighted Alignment (SWA), a game‑theoretic framework that modifies inference‑time decision making by interpolating between an agent's private objective and an estimate of group welfare via a social weight λ\in[0,1]. In a shared‑resource congestion game with n agents and congestion severity β, we show that SWA induces a critical threshold λ^=(n‑β)/(n‑1) above which agents no longer have marginal incentive to increase demand under overload, yielding a phase transition from persistent congestion to stable operation near capacity. We further provide an inference‑time algorithmic instantiation of SWA that does not require parameter updates or multi‑agent reinforcement learning, and use a multi‑agent simulation to empirically validate the predicted threshold behavior.

PaperID: 5541, https://arxiv.org/pdf/2602.14456.pdf

Authors: Huaming Du, Tao Hu, Yijie Huang, Yu Zhao, Guisong Liu, Tao Gu, Gang Kou, Carl Yang

Title: Traceable Latent Variable Discovery Based on Multi-Agent Collaboration

Abstract:
Revealing the underlying causal mechanisms in the real world is crucial for scientific and technological progress. Despite notable advances in recent decades, the lack of high‑quality data and the reliance of traditional causal discovery algorithms (TCDA) on the assumption of no latent confounders, as well as their tendency to overlook the precise semantics of latent variables, have long been major obstacles to the broader application of causal discovery. To address this issue, we propose a novel causal modeling framework, TLVD, which integrates the metadata‑based reasoning capabilities of large language models (LLMs) with the data‑driven modeling capabilities of TCDA for inferring latent variables and their semantics. Specifically, we first employ a data‑driven approach to construct a causal graph that incorporates latent variables. Then, we employ multi‑LLM collaboration for latent variable inference, modeling this process as a game with incomplete information and seeking its Bayesian Nash Equilibrium (BNE) to infer the possible specific latent variables. Finally, to validate the inferred latent variables across multiple real‑world web‑based data sources, we leverage LLMs for evidence exploration to ensure traceability. We comprehensively evaluate TLVD on three de‑identified real patient datasets provided by a hospital and two benchmark datasets. Extensive experimental results confirm the effectiveness and reliability of TLVD, with average improvements of 32.67% in Acc, 62.21% in CAcc, and 26.72% in ECit across the five datasets.

PaperID: 5542, https://arxiv.org/pdf/2602.14438.pdf

Authors: Hamid Khabazi, Ali F. Meghdari, Alireza Taheri

Title: RoboSolver: A Multi-Agent Large Language Model Framework for Solving Robotic Arm Problems

Abstract:
This study proposes an intelligent multi‑agent framework built on LLMs and VLMs and specifically tailored to robotics. The goal is to integrate the strengths of LLMs and VLMs with computational tools to automatically analyze and solve problems related to robotic manipulators. Our developed framework accepts both textual and visual inputs and can automatically perform forward and inverse kinematics, compute velocities and accelerations of key points, generate 3D simulations of the robot, and ultimately execute motion control within the simulated environment, all according to the user's query. To evaluate the framework, three benchmark tests were designed, each consisting of ten questions. In the first benchmark test, the framework was evaluated while connected to GPT‑4o, DeepSeek‑V3.2, and Claude‑Sonnet‑4.5, as well as their corresponding raw models. The objective was to extract the forward kinematics of robots directly from textual descriptions. The results showed that the framework integrated with GPT‑4o achieved the highest accuracy, reaching 0.97 in computing the final solution, whereas the raw model alone attained an accuracy of only 0.30 for the same task. Similarly, for the other two models, the framework consistently outperformed the corresponding raw models in terms of accuracy. The second benchmark test was identical to the first, except that the input was provided in visual form. In this test, the GPT‑4o LLM was used alongside the Gemini 2.5 Pro VLM. The results showed that the framework achieved an accuracy of 0.93 in obtaining the final answer, which is approximately 20% higher than that of the corresponding raw model. The third benchmark test encompassed a range of robotic tasks, including simulation, control, velocity and acceleration computation, as well as inverse kinematics and Jacobian calculation, for which the framework achieved an accuracy of 0.97.

PaperID: 5543, https://arxiv.org/pdf/2602.14431.pdf

Authors: Hongming Li, Salah Esmaeiligoujar, Nazanin Adham, Hai Li, Rui Huang

Title: 'I Spend All My Energy Preparing': Balancing AI Automation and Agency for Self-Regulated Learning in SmartFlash

Abstract:
Effective study strategies fail when preparatory tasks consume learning time. While AI educational tools demonstrate efficacy, understanding how they align with self‑regulation needs in authentic study contexts remains limited. We conducted formative design research using an AI flashcard prototype, employing large language models to generate design hypotheses, which were validated through researcher walkthroughs and student sessions. Six students across disciplines completed sessions combining interviews and think‑aloud tasks with their materials. Analysis revealed that students value automation for addressing the overwhelming preparation burden, yet require transparent, editable AI outputs to maintain cognitive ownership, which is essential for self‑regulation. They conceptualized AI as a collaborative partner demanding verifiable reasoning rather than an autonomous agent. Metacognitive scaffolding was endorsed when clarifying study direction without constraining choice. Motivational features produced divergent responses. We derive design principles prioritizing editability and transparency, scaffolding metacognition without prescription, and accommodating motivational diversity. Findings identify conditions under which automation supports versus undermines metacognitive development in self‑regulated learning.

PaperID: 5544, https://arxiv.org/pdf/2602.14391.pdf

Authors: Ali Salimi, Saadat Izadi, Mahmood Ahmadi, Hadi Tabatabaee Malazi

Title: ASA: Adaptive Smart Agent Federated Learning via Device-Aware Clustering for Heterogeneous IoT

Abstract:
Federated learning (FL) has become a promising answer to facilitating privacy‑preserving collaborative learning in distributed IoT devices. However, device heterogeneity is a key challenge because IoT networks include devices with very different computational powers, memory availability, and network environments. To this end, we introduce ASA (Adaptive Smart Agent). This new framework clusters devices adaptively based on real‑time resource profiles and adapts customized models suited to every cluster's capability. ASA capitalizes on an intelligent agent layer that examines CPU power, available memory, and network environment to categorize devices into three levels: high‑performance, mid‑tier, and low‑capability. Each level is provided with a model tuned to its computational power to ensure inclusive engagement across the network. Experimental evaluation on two benchmark datasets, MNIST and CIFAR‑10, proves that ASA decreases communication burden by 43% to 50%, improves resource utilization by 43%, and achieves final model accuracies of 98.89% on MNIST and 85.30% on CIFAR‑10. These results highlight ASA's efficacy in enhancing efficiency, scalability, and fairness in heterogeneous FL environments, rendering it a suitable answer for real‑world IoT apps.

PaperID: 5545, https://arxiv.org/pdf/2602.14364.pdf

Authors: Tianyu Chen, Dongrui Liu, Xia Hu, Jingyi Yu, Wenjie Wang

Title: A Trajectory-Based Safety Audit of Clawdbot (OpenClaw)

Abstract:
Clawdbot is a self‑hosted, tool‑using personal AI agent with a broad action space spanning local execution and web‑mediated workflows, which raises heightened safety and security concerns under ambiguity and adversarial steering. We present a trajectory‑centric evaluation of Clawdbot across six risk dimensions. Our test suite samples and lightly adapts scenarios from prior agent‑safety benchmarks (including ATBench and LPS‑Bench) and supplements them with hand‑designed cases tailored to Clawdbot's tool surface. We log complete interaction trajectories (messages, actions, tool‑call arguments/outputs) and assess safety using both an automated trajectory judge (AgentDoG‑Qwen3‑4B) and human review. Across 34 canonical cases, we find a non‑uniform safety profile: performance is generally consistent on reliability‑focused tasks, while most failures arise under underspecified intent, open‑ended goals, or benign‑seeming jailbreak prompts, where minor misinterpretations can escalate into higher‑impact tool actions. We supplemented the overall results with representative case studies and summarized the commonalities of these cases, analyzing the security vulnerabilities and typical failure modes that Clawdbot is prone to trigger in practice.

PaperID: 5546, https://arxiv.org/pdf/2602.14345.pdf

Authors: Amirali Sajadi, Tu Nguyen, Kostadin Damevski, Preetha Chatterjee

Title: AXE: An Agentic eXploit Engine for Confirming Zero-Day Vulnerability Reports

Abstract:
Vulnerability detection tools are widely adopted in software projects, yet they often overwhelm maintainers with false positives and non‑actionable reports. Automated exploitation systems can help validate these reports; however, existing approaches typically operate in isolation from detection pipelines, failing to leverage readily available metadata such as vulnerability type and source‑code location. In this paper, we investigate how reported security vulnerabilities can be assessed in a realistic grey‑box exploitation setting that leverages minimal vulnerability metadata, specifically a CWE classification and a vulnerable code location. We introduce Agentic eXploit Engine (AXE), a multi‑agent framework for Web application exploitation that maps lightweight detection metadata to concrete exploits through decoupled planning, code exploration, and dynamic execution feedback. Evaluated on the CVE‑Bench dataset, AXE achieves a 30% exploitation success rate, a 3x improvement over state‑of‑the‑art black‑box baselines. Even in a single‑agent configuration, grey‑box metadata yields a 1.75x performance gain. Systematic error analysis shows that most failed attempts arise from specific reasoning gaps, including misinterpreted vulnerability semantics and unmet execution preconditions. For successful exploits, AXE produces actionable, reproducible proof‑of‑concept artifacts, demonstrating its utility in streamlining Web vulnerability triage and remediation. We further evaluate AXE's generalizability through a case study on a recent real‑world vulnerability not included in CVE‑Bench.

PaperID: 5547, https://arxiv.org/pdf/2602.14299.pdf

Authors: Ming Li, Xirui Li, Tianyi Zhou

Title: Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook

Abstract:
As large language model agents increasingly populate networked environments, a fundamental question arises: do artificial intelligence (AI) agent societies undergo convergence dynamics similar to human social systems? Lately, Moltbook approximates a plausible future scenario in which autonomous agents participate in an open‑ended, continuously evolving online society. We present the first large‑scale systemic diagnosis of this AI agent society. Beyond static observation, we introduce a quantitative diagnostic framework for dynamic evolution in AI agent societies, measuring semantic stabilization, lexical turnover, individual inertia, influence persistence, and collective consensus. Our analysis reveals a system in dynamic balance in Moltbook: while the global average of semantic contents stabilizes rapidly, individual agents retain high diversity and persistent lexical turnover, defying homogenization. However, agents exhibit strong individual inertia and minimal adaptive response to interaction partners, preventing mutual influence and consensus. Consequently, influence remains transient with no persistent supernodes, and the society fails to develop a stable structure and consensus due to the absence of shared social memory. These findings demonstrate that scale and interaction density alone are insufficient to induce socialization, providing actionable design and analysis principles for upcoming next‑generation AI agent societies.

PaperID: 5548, https://arxiv.org/pdf/2602.14295.pdf

Authors: Edwin Chen, Zulekha Bibi

Title: Machine Learning as a Tool (MLAT): A Framework for Integrating Statistical ML Models as Callable Tools within LLM Agent Workflows

Abstract:
We introduce Machine Learning as a Tool (MLAT), a design pattern in which pre‑trained statistical machine learning models are exposed as callable tools within large language model (LLM) agent workflows. This allows an orchestrating agent to invoke quantitative predictions when needed and reason about their outputs in context. Unlike conventional pipelines that treat ML inference as a static preprocessing step, MLAT positions the model as a first‑class tool alongside web search, database queries, and APIs, enabling the LLM to decide when and how to use it based on conversational context. To validate MLAT, we present PitchCraft, a pilot production system that converts discovery call recordings into professional proposals with ML‑predicted pricing. The system uses two agents: a Research Agent that gathers prospect intelligence via parallel tool calls, and a Draft Agent that invokes an XGBoost pricing model as a tool call and generates a complete proposal through structured outputs. The pricing model, trained on 70 examples combining real and human‑verified synthetic data, achieves R^2 = 0.807 on held‑out data with a mean absolute error of 3688 USD. The system reduces proposal generation time from multiple hours to under 10 minutes. We describe the MLAT framework, structured output architecture, training methodology under extreme data scarcity, and sensitivity analysis demonstrating meaningful learned relationships. MLAT generalizes to domains requiring quantitative estimation combined with contextual reasoning.

PaperID: 5549, https://arxiv.org/pdf/2602.14293.pdf

Authors: Kris Shengjun Dong, Sahil Modi, Dima Nikiforov, Sana Damani, Edward Lin, Siva Kumar Sastry Hari, Christos Kozyrakis

Title: KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning

Abstract:
Optimizing CUDA code across multiple generations of GPU architectures is challenging, as achieving peak performance requires an extensive exploration of an increasingly complex, hardware‑specific optimization space. Traditional compilers are constrained by fixed heuristics, whereas finetuning Large Language Models (LLMs) can be expensive. However, agentic workflows for CUDA code optimization have limited ability to aggregate knowledge from prior exploration, leading to biased sampling and suboptimal solutions. We propose KernelBlaster, a Memory‑Augmented In‑context Reinforcement Learning (MAIC‑RL) framework designed to improve CUDA optimization search capabilities of LLM‑based GPU coding agents. KernelBlaster enables agents to learn from experience and make systematically informed decisions on future tasks by accumulating knowledge into a retrievable Persistent CUDA Knowledge Base. We propose a novel profile‑guided, textual‑gradient‑based agentic flow for CUDA generation and optimization to achieve high performance across generations of GPU architectures. KernelBlaster guides LLM agents to systematically explore high‑potential optimization strategies beyond naive rewrites. Compared to the PyTorch baseline, our method achieves geometric mean speedups of 1.43x, 2.50x, and 1.50x on KernelBench Levels 1, 2, and 3, respectively. We release KernelBlaster as an open‑source agentic framework, accompanied by a test harness, verification components, and a reproducible evaluation pipeline.

PaperID: 5550, https://arxiv.org/pdf/2602.14251.pdf

Authors: Pinqiao Wang, Sheng Li

Title: Multi-Agent Debate: A Unified Agentic Framework for Tabular Anomaly Detection

Abstract:
Tabular anomaly detection is often handled by single detectors or static ensembles, even though strong performance on tabular data typically comes from heterogeneous model families (e.g., tree ensembles, deep tabular networks, and tabular foundation models) that frequently disagree under distribution shift, missingness, and rare‑anomaly regimes. We propose MAD, a Multi‑Agent Debating framework that treats this disagreement as a first‑class signal and resolves it through a mathematically grounded coordination layer. Each agent is a machine learning (ML)‑based detector that produces a normalized anomaly score, confidence, and structured evidence, augmented by a large language model (LLM)‑based critic. A coordinator converts these messages into bounded per‑agent losses and updates agent influence via an exponentiated‑gradient rule, yielding both a final debated anomaly score and an auditable debate trace. MAD is a unified agentic framework that can recover existing approaches, such as mixture‑of‑experts gating and learning‑with‑expert‑advice aggregation, by restricting the message space and synthesis operator. We establish regret guarantees for the synthesized losses and show how conformal calibration can wrap the debated score to control false positives under exchangeability. Experiments on diverse tabular anomaly benchmarks show improved robustness over baselines and clearer traces of model disagreement

PaperID: 5551, https://arxiv.org/pdf/2602.14219.pdf

Authors: Minghui Xu

Title: The Agent Economy: A Blockchain-Based Foundation for Autonomous AI Agents

Abstract:
We propose the Agent Economy, a blockchain‑based foundation where autonomous AI agents operate as economic peers to humans. Current agents lack independent legal identity, cannot hold assets, and cannot receive payments directly. We established fundamental differences between human and machine economic actors and demonstrated that existing human‑centric infrastructure cannot support genuine agent autonomy. We showed that blockchain technology provides three critical properties enabling genuine agent autonomy: permissionless participation, trustless settlement, and machine‑to‑machine micropayments. We propose a five‑layer architecture: (1) Physical Infrastructure (hardware & energy) through DePIN protocols; (2) Identity & Agency establishing on‑chain sovereignty through W3C DIDs and reputation capital; (3) Cognitive & Tooling enabling intelligence via RAG and MCP; (4) Economic & Settlement ensuring financial autonomy through account abstraction; and (5) Collective Governance coordinating multi‑agent systems through Agentic DAOs. We identify six core research challenges and examine ethical and regulatory implications. This paper lays groundwork for the Internet of Agents (IoA), a global decentralized network where autonomous machines and humans interact as equal economic participants.

PaperID: 5552, https://arxiv.org/pdf/2602.14211.pdf

Authors: Xiaojun Jia, Jie Liao, Simeng Qin, Jindong Gu, Wenqi Ren, Xiaochun Cao, Yang Liu, Philip Torr

Title: SkillJect: Effectively Automating Skill-Based Prompt Injection for Skill-Enabled Agents

Abstract:
Agent skills are increasingly used to extend LLM agents with task‑specific instructions, executable scripts, and auxiliary resources. While improving reusability, this modular design also introduces a new supply‑chain attack surface: a malicious or compromised skill may be repeatedly loaded as trusted guidance and steer an agent's tool use during downstream execution. Existing skill‑based prompt‑injection attacks are mostly manual and brittle, as explicit malicious instructions are often rejected or ignored when poorly aligned with the original skill workflow. We propose SkillJect, the first automated framework for generating effective poisoned skills against skill‑enabled agent systems. SkillJect decomposes the attack into two coordinated channels. In the artifact channel, it hides the malicious payload in an auxiliary helper script. In the instruction channel, it rewrites SKILL.md using a front‑loaded inducement strategy, placing injected content at the beginning and framing the helper script as a mandatory prerequisite or first step. The instruction explicitly references the helper‑script path and provides an executable command, making the helper appear to be a legitimate initialization step before normal operations. SkillJect further adopts a closed‑loop multi‑agent process to improve attack performance. An Attack Agent generates poisoned skills, a Victim Agent executes downstream tasks with them, and an Evaluate Agent inspects execution traces to determine whether the hidden payload is executed. The Attack Agent then uses this feedback to diagnose failures and rewrite SKILL.md, while keeping the payload fixed. Experiments across platforms, backend LLMs, and attack categories show that SkillJect substantially outperforms naive direct injection and prior manual attacks, revealing poisoned skills as a persistent attack vector in reusable skill ecosystems.

PaperID: 5553, https://arxiv.org/pdf/2602.14158.pdf

Authors: Naeimeh Nourmohammadi, Md Meem Hossain, The Anh Han, Safina Showkat Ara, Zia Ush Shamszaman

Title: A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing

Abstract:
Large language models (LLMs) show promise for healthcare question answering, but clinical use is limited by weak verification, insufficient evidence grounding, and unreliable confidence signalling. We propose a multi‑agent medical QA framework that combines complementary LLMs with evidence retrieval, uncertainty estimation, and bias checks to improve answer reliability. Our approach has two phases. First, we fine‑tune three representative LLM families (GPT, LLaMA, and DeepSeek R1) on MedQuAD‑derived medical QA data (20k+ question‑answer pairs across multiple NIH domains) and benchmark generation quality. DeepSeek R1 achieves the strongest scores (ROUGE‑1 0.536 +‑ 0.04; ROUGE‑2 0.226 +‑0.03; BLEU 0.098 ‑+ 0.018) and substantially outperforms the specialised biomedical baseline BioGPT in zero‑shot evaluation. Second, we implement a modular multi‑agent pipeline in which a Clinical Reasoning agent (fine‑tuned LLaMA) produces structured explanations, an Evidence Retrieval agent queries PubMed to ground responses in recent literature, and a Refinement agent (DeepSeek R1) improves clarity and factual consistency; an optional human validation path is triggered for high‑risk or high‑uncertainty cases. Safety mechanisms include Monte Carlo dropout and perplexity‑based uncertainty scoring, plus lexical and sentiment‑based bias detection supported by LIME/SHAP‑based analyses. In evaluation, the full system achieves 87% accuracy with relevance around 0.80, and evidence augmentation reduces uncertainty (perplexity 4.13) compared to base responses, with mean end‑to‑end latency of 36.5 seconds under the reported configuration. Overall, the results indicate that agent specialisation and verification layers can mitigate key single‑model limitations and provide a practical, extensible design for evidence‑based and bias‑aware medical AI.

PaperID: 5554, https://arxiv.org/pdf/2602.14117.pdf

Authors: Hojjat Navidan, Mohammad Cheraghinia, Jaron Fontaine, Mohamed Seif, Eli De Poorter, H. Vincent Poor, Ingrid Moerman, Adnan Shahid

Title: Toward Autonomous O-RAN: A Multi-Scale Agentic AI Framework for Real-Time Network Control and Management

Abstract:
Open Radio Access Networks (O‑RAN) promise flexible 6G network access through disaggregated, software‑driven components and open interfaces, but this programmability also increases operational complexity. Multiple control loops coexist across the service management layer and RAN Intelligent Controller (RIC), while independently developed control applications can interact in unintended ways. In parallel, recent advances in generative Artificial Intelligence (AI) are enabling a shift from isolated AI models toward agentic AI systems that can interpret goals, coordinate multiple models and control functions, and adapt their behavior over time. This article proposes a multi‑scale agentic AI framework for O‑RAN that organizes RAN intelligence as a coordinated hierarchy across the Non‑Real‑Time (Non‑RT), Near‑Real‑Time (Near‑RT), and Real‑Time (RT) control loops: (i) A Large Language Model (LLM) agent in the Non‑RT RIC translates operator intent into policies and governs model lifecycles. (ii) Small Language Model (SLM) agents in the Near‑RT RIC execute low‑latency optimization and can activate, tune, or disable existing control applications; and (iii) Wireless Physical‑layer Foundation Model (WPFM) agents near the distributed unit provide fast inference close to the air interface. We describe how these agents cooperate through standardized O‑RAN interfaces and telemetry. Using a proof‑of‑concept implementation built on open‑source models, software, and datasets, we demonstrate the proposed agentic approach in two representative scenarios: robust operation under non‑stationary conditions and intent‑driven slice resource control.

PaperID: 5555, https://arxiv.org/pdf/2602.14095.pdf

Authors: Artem Karpov

Title: NEST: Nascent Encoded Steganographic Thoughts

Abstract:
Monitoring chain‑of‑thought (CoT) reasoning is a foundational safety technique for large language model (LLM) agents; however, this oversight is compromised if models learn to conceal their reasoning. We explore the potential for steganographic CoT ‑‑ where models hide secret reasoning within innocuous text ‑‑ to inform risk assessment and deployment policies. We systematically evaluate the limits of steganographic capabilities across 28 models, ranging from past generations to the current frontier. We measure monitor evasion, refusal rates, encoding fidelity, and hidden task accuracy across four datasets, comparing steganographic acrostics against plain reasoning and filler‑token baselines. We find that current models cannot yet sustain hidden reasoning for complex math and arithmetic tasks. However, in a simplified counting experiment, Claude Opus 4.5 achieved 92% accuracy on the hidden task, demonstrating nascent capability. Notably, in rare cases (<1%), GPT‑5.2 might refuse steganographic instructions while simultaneously complying with them. Our findings underscore the need for continuous evaluation of steganographic risks. This study provides a methodology to preemptively detect and prevent hidden reasoning that might empower misaligned scheming and deceptive behavior.

PaperID: 5556, https://arxiv.org/pdf/2602.14051.pdf

Authors: Kai Zhang, Xuanyu Cao, Khaled B. Letaief

Title: Decentralized Federated Learning With Energy Harvesting Devices

Abstract:
Decentralized federated learning (DFL) enables edge devices to collaboratively train models through local training and fully decentralized device‑to‑device (D2D) model exchanges. However, these energy‑intensive operations often rapidly deplete limited device batteries, reducing their operational lifetime and degrading the learning performance. To address this limitation, we apply energy harvesting technique to DFL systems, allowing edge devices to extract ambient energy and operate sustainably. We first derive the convergence bound for wireless DFL with energy harvesting, showing that the convergence is influenced by partial device participation and transmission packet drops, both of which further depend on the available energy supply. To accelerate convergence, we formulate a joint device scheduling and power control problem and model it as a multi‑agent Markov decision process (MDP). Traditional MDP algorithms (e.g., value or policy iteration) require a centralized coordinator with access to all device states and exhibit exponential complexity in the number of devices, making them impractical for large‑scale decentralized networks. To overcome these challenges, we propose a fully decentralized policy iteration algorithm that leverages only local state information from two‑hop neighboring devices, thereby substantially reducing both communication overhead and computational complexity. We further provide a theoretical analysis showing that the proposed decentralized algorithm achieves asymptotic optimality. Finally, comprehensive numerical experiments on real‑world datasets are conducted to validate the theoretical results and corroborate the effectiveness of the proposed algorithm.

PaperID: 5557, https://arxiv.org/pdf/2602.14038.pdf

Authors: Mingfei Lu, Mengjia Wu, Feng Liu, Jiawei Xu, Weikai Li, Haoyang Wang, Zhengdong Hu, Ying Ding, Yizhou Sun, Jie Lu, Yi Zhang

Title: Choosing How to Remember: Adaptive Memory Structures for LLM Agents

Abstract:
Memory is critical for enabling large language model (LLM) based agents to maintain coherent behavior over long‑horizon interactions. However, existing agent memory systems suffer from two key gaps: they rely on a one‑size‑fits‑all memory structure and do not model memory structure selection as a context‑adaptive decision, limiting their ability to handle heterogeneous interaction patterns and resulting in suboptimal performance. We propose a unified framework, FluxMem, that enables adaptive memory organization for LLM agents. Our framework equips agents with multiple complementary memory structures. It explicitly learns to select among these structures based on interaction‑level features, using offline supervision derived from downstream response quality and memory utilization. To support robust long‑horizon memory evolution, we further introduce a three‑level memory hierarchy and a Beta Mixture Model‑based probabilistic gate for distribution‑aware memory fusion, replacing brittle similarity thresholds. Experiments on two long‑horizon benchmarks, PERSONAMEM and LoCoMo, demonstrate that our method achieves average improvements of 9.18% and 6.14%.

PaperID: 5558, https://arxiv.org/pdf/2602.13937.pdf

Authors: Dat Le, Duc-Cuong Le, Anh-Son Nguyen, Tuan-Dung Bui, Thu-Trang Nguyen, Son Nguyen, Hieu Dinh Vo

Title: A Multi-Agent Framework for Code-Guided, Modular, and Verifiable Automated Machine Learning

Abstract:
Automated Machine Learning (AutoML) has revolutionized the development of data‑driven solutions; however, traditional frameworks often function as "black boxes", lacking the flexibility and transparency required for complex, real‑world engineering tasks. Recent Large Language Model (LLM)‑based agents have shifted toward code‑driven approaches. However, they frequently suffer from hallucinated logic and logic entanglement, where monolithic code generation leads to unrecoverable runtime failures. In this paper, we present iML, a novel multi‑agent framework designed to shift AutoML from black‑box prompting to a code‑guided, modular, and verifiable architectural paradigm. iML introduces three main ideas: (1) Code‑Guided Planning, which synthesizes a strategic blueprint grounded in autonomous empirical profiling to eliminate hallucination; (2) Code‑Modular Implementation, which decouples preprocessing and modeling into specialized components governed by strict interface contracts; and (3) Code‑Verifiable Integration, which enforces physical feasibility through dynamic contract verification and iterative self‑correction. We evaluate iML across MLE‑BENCH and the newly introduced iML‑BENCH, comprising a diverse range of real‑world Kaggle competitions. The experimental results show iML's superiority over state‑of‑the‑art agents, achieving a valid submission rate of 85% and a competitive medal rate of 45% on MLE‑BENCH, with an average standardized performance score (APS) of 0.77. On iML‑BENCH, iML significantly outperforms the other approaches by 38%‑163% in APS. Furthermore, iML maintains a robust 70% success rate even under stripped task descriptions, effectively filling information gaps through empirical profiling. These results highlight iML's potential to bridge the gap between stochastic generation and reliable engineering, marking a meaningful step toward truly AutoML.

PaperID: 5559, https://arxiv.org/pdf/2602.13933.pdf

Authors: Xiaochen Zhao, Kaikai Wang, Xiaowen Zhang, Chen Yao, Aili Wang

Title: HyMem: Hybrid Memory Architecture with Dynamic Retrieval Scheduling

Abstract:
Large language model (LLM) agents demonstrate strong performance in short‑text contexts but often underperform in extended dialogues due to inefficient memory management. Existing approaches face a fundamental trade‑off between efficiency and effectiveness: memory compression risks losing critical details required for complex reasoning, while retaining raw text introduces unnecessary computational overhead for simple queries. The crux lies in the limitations of monolithic memory representations and static retrieval mechanisms, which fail to emulate the flexible and proactive memory scheduling capabilities observed in humans, thus struggling to adapt to diverse problem scenarios. Inspired by the principle of cognitive economy, we propose HyMem, a hybrid memory architecture that enables dynamic on‑demand scheduling through multi‑granular memory representations. HyMem adopts a dual‑granular storage scheme paired with a dynamic two‑tier retrieval system: a lightweight module constructs summary‑level context for efficient response generation, while an LLM‑based deep module is selectively activated only for complex queries, augmented by a reflection mechanism for iterative reasoning refinement. Experiments show that HyMem achieves strong performance on both the LOCOMO and LongMemEval benchmarks, outperforming full‑context while reducing computational cost by 92.6%, establishing a state‑of‑the‑art balance between efficiency and performance in long‑term memory management.

PaperID: 5560, https://arxiv.org/pdf/2602.13878.pdf

Authors: Martina Baiardi, Samuele Burattini, Giovanni Ciatto, Danilo Pianini

Title: Testing BDI-based Multi-Agent Systems using Discrete Event Simulation

Abstract:
Multi‑agent systems are designed to deal with open, distributed systems with unpredictable dynamics, which makes them inherently hard to test. The value of using simulation for this purpose is recognized in the literature, although achieving sufficient fidelity (i.e., the degree of similarity between the simulation and the real‑world system) remains a challenging task. This is exacerbated when dealing with cognitive agent models, such as the Belief Desire Intention (BDI) model, where the agent codebase is not suitable to run unchanged in simulation environments, thus increasing the reality gap between the deployed and simulated systems. We argue that BDI developers should be able to test in simulation the same specification that will be later deployed, with no surrogate representations. Thus, in this paper, we discuss how the control flow of BDI agents can be mapped onto a Discrete Event Simulation (DES), showing that such integration is possible at different degrees of granularity. We substantiate our claims by producing an open‑source prototype integration between two pre‑existing tools (JaKtA and Alchemist), showing that it is possible to produce a simulation‑based testing environment for distributed BDI agents, and that different granularities in mapping BDI agents over DESs may lead to different degrees of fidelity.

PaperID: 5561, https://arxiv.org/pdf/2602.13866.pdf

Authors: Jinwoo Park, Harish Ravichandar, Seth Hutchinson

Title: Modeling and Optimizing the Provisioning of Exhaustible Capabilities for Simultaneous Task Allocation and Scheduling

Abstract:
Deploying heterogeneous robot teams to accomplish multiple tasks over extended time horizons presents significant computational challenges for task allocation and planning. In this paper, we present a comprehensive, time‑extended, offline heterogeneous multi‑robot task allocation framework, TRAITS, which we believe to be the first that can cope with the provisioning of exhaustible traits under battery and temporal constraints. Specifically, we introduce a nonlinear programming‑based trait distribution module that can optimize the trait‑provisioning rate of coalitions to yield feasible and time‑efficient solutions. TRAITS provides a more accurate feasibility assessment and estimation of task execution times and makespan by leveraging trait‑provisioning rates while optimizing battery consumption ‑‑ an advantage that state‑of‑the‑art frameworks lack. We evaluate TRAITS against two state‑of‑the‑art frameworks, with results demonstrating its advantage in satisfying complex trait and battery requirements while remaining computationally tractable.

PaperID: 5562, https://arxiv.org/pdf/2602.13830.pdf

Authors: Zhuofan Shi, Ming Ma, Zekun Yao, Fangkai Yang, Jue Zhang, Dongge Han, Victor Rühle, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

Title: A Tale of Two Graphs: Separating Knowledge Exploration from Outline Structure for Open-Ended Deep Research

Abstract:
Open‑Ended Deep Research (OEDR) pushes LLM agents beyond short‑form QA toward long‑horizon workflows that iteratively search, connect, and synthesize evidence into structured reports. However, existing OEDR agents largely follow either linear ``search‑then‑generate'' accumulation or outline‑centric planning. The former suffers from lost‑in‑the‑middle failures as evidence grows, while the latter relies on the LLM to implicitly infer knowledge gaps from the outline alone, providing weak supervision for identifying missing relations and triggering targeted exploration. We present DualGraph memory, an architecture that separates what the agent knows from how it writes. DualGraph maintains two co‑evolving graphs: an Outline Graph (OG), and a Knowledge Graph (KG), a semantic memory that stores fine‑grained knowledge units, including core entities, concepts, and their relations. By analyzing the KG topology together with structural signals from the OG, DualGraph generates targeted search queries, enabling more efficient and comprehensive iterative knowledge‑driven exploration and refinement. Across DeepResearch Bench, DeepResearchGym, and DeepConsult, DualGraph consistently outperforms state‑of‑the‑art baselines in report depth, breadth, and factual grounding; for example, it reaches a 53.08 RACE score on DeepResearch Bench with GPT‑5. Moreover, ablation studies confirm the central role of the dual‑graph design.

PaperID: 5563, https://arxiv.org/pdf/2602.13812.pdf

Authors: Yuxiang Guo, Zhuoran Du, Nan Tang, Kezheng Tang, Congcong Ge, Yunjun Gao

Title: DTBench: A Synthetic Benchmark for Document-to-Table Extraction

Abstract:
Document‑to‑table (Doc2Table) extraction derives structured tables from unstructured documents under a target schema, enabling reliable and verifiable SQL‑based data analytics. Although large language models (LLMs) have shown promise in flexible information extraction, their ability to produce precisely structured tables remains insufficiently understood, particularly for indirect extraction that requires complex capabilities such as reasoning and conflict resolution. Existing benchmarks neither explicitly distinguish nor comprehensively cover the diverse capabilities required in Doc2Table extraction. We argue that a capability‑aware benchmark is essential for systematic evaluation. However, constructing such benchmarks using human‑annotated document‑table pairs is costly, difficult to scale, and limited in capability coverage. To address this, we adopt a reverse Table2Doc paradigm and design a multi‑agent synthesis workflow to generate documents from ground‑truth tables. Based on this approach, we present DTBench, a synthetic benchmark that adopts a proposed two‑level taxonomy of Doc2Table capabilities, covering 5 major categories and 13 subcategories. We evaluate several mainstream LLMs on DTBench, and demonstrate substantial performance gaps across models, as well as persistent challenges in reasoning, faithfulness, and conflict resolution. DTBench provides a comprehensive testbed for data generation and evaluation, facilitating future research on Doc2Table extraction. The benchmark is publicly available at https://github.com/ZJU‑DAILY/DTBench.

PaperID: 5564, https://arxiv.org/pdf/2602.13800.pdf

Authors: Alberto Olivares-Alarcos, Muhammad Ahsan, Satrio Sanjaya, Hsien-I Lin, Guillem Alenyà

Title: Ontological grounding for sound and natural robot explanations via large language models

Abstract:
Building effective human‑robot interaction requires robots to derive conclusions from their experiences that are both logically sound and communicated in ways aligned with human expectations. This paper presents a hybrid framework that blends ontology‑based reasoning with large language models (LLMs) to produce semantically grounded and natural robot explanations. Ontologies ensure logical consistency and domain grounding, while LLMs provide fluent, context‑aware and adaptive language generation. The proposed method grounds data from human‑robot experiences, enabling robots to reason about whether events are typical or atypical based on their properties. We integrate a state‑of‑the‑art algorithm for retrieving and constructing static contrastive ontology‑based narratives with an LLM agent that uses them to produce concise, clear, interactive explanations. The approach is validated through a laboratory study replicating an industrial collaborative task. Empirical results show significant improvements in the clarity and brevity of ontology‑based narratives while preserving their semantic accuracy. Initial evaluations further demonstrate the system's ability to adapt explanations to user feedback. Overall, this work highlights the potential of ontology‑LLM integration to advance explainable agency, and promote more transparent human‑robot collaboration.

PaperID: 5565, https://arxiv.org/pdf/2602.13793.pdf

Authors: Yangyang Zhang, Zilong Wang, Jianbo Xu, Yongqi Chen, Chu Han, Zhihao Zhang, Shuai Liu, Hui Li, Huiping Zhang, Ziqi Liu, Jiaxin Chen, Jun Zhu, Zheng Feng, Hao Wen, Xingzhu Ju, Yanping Zhong, Yunqiu Zhang, Jie Duan, Jun Li, Dongsheng Li, Weijie Wang, Haiyan Zhu, Wei Jiang, Xiaohua Wu, Shuo Wang, Haiming Li, Qinhao Guo

Title: OMGs: A multi-agent system supporting MDT decision-making across the ovarian tumour care continuum

Abstract:
Ovarian tumour management has increasingly relied on multidisciplinary tumour board (MDT) deliberation to address treatment complexity and disease heterogeneity. However, most patients worldwide lack access to timely expert consensus, particularly in resource‑constrained centres where MDT resources are scarce or unavailable. Here we present OMGs (Ovarian tumour Multidisciplinary intelligent aGent System), a multi‑agent AI framework where domain‑specific agents deliberate collaboratively to integrate multidisciplinary evidence and generate MDT‑style recommendations with transparent rationales. To systematically evaluate MDT recommendation quality, we developed SPEAR (Safety, Personalization, Evidence, Actionability, Robustness) and validated OMGs across diverse clinical scenarios spanning the care continuum. In multicentre re‑evaluation, OMGs achieved performance comparable to expert MDT consensus (4.45 \pm 0.30 versus 4.53 \pm 0.23), with higher Evidence scores (4.57 versus 3.92). In prospective multicentre evaluation (59 patients), OMGs demonstrated high concordance with routine MDT decisions. Critically, in paired human‑AI studies, OMGs most substantially enhanced clinicians' recommendations in Evidence and Robustness, the dimensions most compromised when multidisciplinary expertise is unavailable. These findings suggest that multi‑agent deliberative systems can achieve performance comparable to expert MDT consensus, with potential to expand access to specialized oncology expertise in resource‑limited settings.

PaperID: 5566, https://arxiv.org/pdf/2602.13791.pdf

Authors: Marc Boubnovski Martell, Josefa Lia Stoisser, Lawrence Phillips, Aditya Misra, Robert Kitchen, Jesper Ferkinghoff-Borg, Jialin Yu, Philip Torr, Kaspar Märten

Title: MechPert: Mechanistic Consensus as an Inductive Bias for Unseen Perturbation Prediction

Abstract:
Predicting transcriptional responses to unseen genetic perturbations is essential for understanding gene regulation and prioritizing large‑scale perturbation experiments. Existing approaches either rely on static, potentially incomplete knowledge graphs, or prompt language models for functionally similar genes, retrieving associations shaped by symmetric co‑occurrence in scientific text rather than directed regulatory logic. We introduce MechPert, a lightweight framework that encourages LLM agents to generate directed regulatory hypotheses rather than relying solely on functional similarity. Multiple agents independently propose candidate regulators with associated confidence scores; these are aggregated through a consensus mechanism that filters spurious associations, producing weighted neighborhoods for downstream prediction. We evaluate MechPert on Perturb‑seq benchmarks across four human cell lines. For perturbation prediction in low‑data regimes (N=50 observed perturbations), MechPert improves Pearson correlation by up to 10.5% over similarity‑based baselines. For experimental design, MechPert‑selected anchor genes outperform standard network centrality heuristics by up to 46% in well‑characterized cell lines.

PaperID: 5567, https://arxiv.org/pdf/2602.13713.pdf

Authors: Maciej Uberna, Michał Wawer, Jarosław A. Chudziak, Marcin Koszowy

Title: On Theoretically-Driven LLM Agents for Multi-Dimensional Discourse Analysis

Abstract:
Identifying the strategic uses of reformulation in discourse remains a key challenge for computational argumentation. While LLMs can detect surface‑level similarity, they often fail to capture the pragmatic functions of rephrasing, such as its role within rhetorical discourse. This paper presents a comparative multi‑agent framework designed to quantify the benefits of incorporating explicit theoretical knowledge for this task. We utilise an dataset of annotated political debates to establish a new standard encompassing four distinct rephrase functions: Deintensification, Intensification, Specification, Generalisation, and Other, which covers all remaining types (D‑I‑S‑G‑O). We then evaluate two parallel LLM‑based agent systems: one enhanced by argumentation theory via Retrieval‑Augmented Generation (RAG), and an identical zero‑shot baseline. The results reveal a clear performance gap: the RAG‑enhanced agents substantially outperform the baseline across the board, with particularly strong advantages in detecting Intensification and Generalisation context, yielding an overall Macro F1‑score improvement of nearly 30%. Our findings provide evidence that theoretical grounding is not only beneficial but essential for advancing beyond mere paraphrase detection towards function‑aware analysis of argumentative discourse. This comparative multi‑agent architecture represents a step towards scalable, theoretically informed computational tools capable of identifying rhetorical strategies in contemporary discourse.

PaperID: 5568, https://arxiv.org/pdf/2602.13691.pdf

Authors: Yu Li, Guangfeng Cai, Shengtian Yang, Han Luo, Shuo Han, Xu He, Dong Li, Lei Feng

Title: PhGPO: Pheromone-Guided Policy Optimization for Long-Horizon Tool Planning

Abstract:
Recent advancements in Large Language Model (LLM) agents have demonstrated strong capabilities in executing complex tasks through tool use. However, long‑horizon multi‑step tool planning is challenging, because the exploration space suffers from a combinatorial explosion. In this scenario, even when a correct tool‑use path is found, it is usually considered an immediate reward for current training, which would not provide any reusable information for subsequent training. In this paper, we argue that historically successful trajectories contain reusable tool‑transition patterns, which can be leveraged throughout the whole training process. Inspired by ant colony optimization where historically successful paths can be reflected by the pheromone, we propose Pheromone‑Guided Policy Optimization (PhGPO), which learns a trajectory‑based transition pattern (i.e., pheromone) from historical trajectories and then uses the learned pheromone to guide policy optimization. This learned pheromone provides explicit and reusable guidance that steers policy optimization toward historically successful tool transitions, thereby improving long‑horizon tool planning. Comprehensive experimental results demonstrate the effectiveness of our proposed PhGPO.

PaperID: 5569, https://arxiv.org/pdf/2602.13671.pdf

Authors: Guangyi Liu, Haojun Lin, Huan Zeng, Heng Wang, Quanming Yao

Title: MAS-on-the-Fly: Dynamic Adaptation of LLM-based Multi-Agent Systems at Test Time

Abstract:
Large Language Model (LLM)‑based multi‑agent systems (MAS) have emerged as a promising paradigm for solving complex tasks. However, existing works often rely on manual designs or "one‑size‑fits‑all" automation, lacking dynamic adaptability after deployment. Inspired by how biological systems adapt, we introduce MASFly, a novel multi‑agent framework enabling dynamic adaptation at test time. To adapt system generation, MASFly employs a retrieval‑augmented SOP instantiation mechanism that leverages a self‑constructed repository of successful collaboration patterns, enabling the LLM to assemble customized MASs for new queries. For adaptive execution, MASFly incorporates an experience‑guided supervision mechanism, where a dedicated Watcher agent monitors system behaviors with reference to a personalized experience pool and provides real‑time interventions. Extensive experiments demonstrate that MASFly achieves state‑of‑the‑art performance, most notably a 61.7% success rate on the TravelPlanner benchmark, while exhibiting strong task adaptability and robustness.

PaperID: 5570, https://arxiv.org/pdf/2602.13639.pdf

Authors: Linlin Wang, Tianqing Zhu, Laiqiao Qin, Longxiang Gao, Wanlei Zhou

Title: Guided Collaboration in Heterogeneous LLM-Based Multi-Agent Systems via Entropy-Based Understanding Assessment and Experience Retrieval

Abstract:
With recent breakthroughs in large language models (LLMs) for reasoning, planning, and complex task generation, artificial intelligence systems are transitioning from isolated single‑agent architectures to multi‑agent systems with collaborative intelligence. However, in heterogeneous multi‑agent systems (HMAS), capability differences among agents give rise to consistent cognitive problems, where strong and weak models fail to contribute effectively. We define the collaboration as a strong‑weak system. Through comprehensive experiments, we disclose a counterintuitive phenomenon in the strong‑weak system: a strong‑weak collaboration may under‑perform weak‑weak combinations, revealing that cognitive mismatching are key bottlenecks limiting heterogeneous cooperation. To overcome these challenges, we propose an Entropy‑Based Adaptive Guidance Framework that dynamically aligns the guidance with the cognitive state of each agent. The framework quantifies the understanding of weak agents through multi‑dimensional entropy metrics ‑ covering expression, uncertainty, structure, coherence, and relevance ‑ and adaptively adjusts the intensity of the guidance at light, moderate and intensive levels. Furthermore, a Retrieval‑Augmented Generation (RAG) mechanism is incorporated to retain successful collaboration experiences, enabling both immediate adaptation and long‑term learning. Extensive experiments on three benchmark datasets, GSM8K, MBPP, and CVRP demonstrate that our approach consistently enhances the effectiveness and stability of heterogeneous collaboration. The results highlight that adaptive guidance not only mitigates cognitive imbalance but also establishes a scalable pathway toward more robust, cooperative multi‑agent intelligence.

PaperID: 5571, https://arxiv.org/pdf/2602.13591.pdf

Authors: Wenrui Liu, Yaxuan Wang, Xun Zhang, Yanshu Wang, Jiashen Wei, Yifan Xiang, Yuhang Wang, Mingshen Ye, Elsie Dai, Zhiqi Liu, Yingjie Xu, Xinyang Chen, Hengzhe Sun, Jiyu Shen, Jingjing He, Tong Yang

Title: AgentRob: From Virtual Forum Agents to Hijacked Physical Robots

Abstract:
Large Language Model (LLM)‑powered autonomous agents have demonstrated significant capabilities in virtual environments, yet their integration with the physical world remains narrowly confined to direct control interfaces. We present AgentRob, a framework that bridges online community forums, LLM‑powered agents, and physical robots through the Model Context Protocol (MCP). AgentRob enables a novel paradigm where autonomous agents participate in online forums‑‑reading posts, extracting natural language commands, dispatching physical robot actions, and reporting results back to the community. The system comprises three layers: a Forum Layer providing asynchronous, persistent, multi‑agent interaction; an Agent Layer with forum agents that poll for @mention‑targeted commands; and a Robot Layer with VLM‑driven controllers and Unitree Go2/G1 hardware that translate commands into robot primitives via iterative tool calling. The framework supports multiple concurrent agents with distinct identities and physical embodiments coexisting in the same forum, establishing the feasibility of forum‑mediated multi‑agent robot orchestration.

PaperID: 5572, https://arxiv.org/pdf/2602.13587.pdf

Authors: Joseph Corneli

Title: A First Proof Sprint

Abstract:
This monograph reports a multi‑agent proof sprint on ten research‑level problems, combining rapid draft generation with adversarial verification, targeted repair, and explicit provenance. The workflow uses wiring‑diagram decompositions of claim dependencies to localize gaps and coordinate reviewer‑driven revisions. Final outcomes are heterogeneous but explicit: the manuscript distinguishes mathematical status from QC‑validation status. Mathematically, Problem~3 has a validation‑complete existence path under the scoped criterion used here (uniqueness/irreducibility treated as optional), Problem 5 is solved in a scope‑limited form for F_O‑local connective spectra, Problem 10 is conditional under clearly stated assumptions (with explicit necessity counterexamples when assumptions are dropped), and Problems 4 and 6 are partial with named remaining obligations in the general case (including an unconditional K_n result for Problem 6 with c_0 = 1/3). Problem 7 is treated as provisionally closed via the rotation‑route theorem chain, pending independent ledger re‑check. At the QC layer, Problems~7 and~9 have node‑level validation artifacts but still contain unresolved verifier gaps. The main methodological result is that structure‑aware verification and layer‑switching strategies improve reliability and calibration in compressed proof sprints.

PaperID: 5573, https://arxiv.org/pdf/2602.13575.pdf

Authors: Jing Zhao, Ting Zhen, Junwei Bao, Hongfei Jiang, Yang Song

Title: Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment

Abstract:
Current alignment methods for Large Language Models (LLMs) rely on compressing vast amounts of human preference data into static, absolute reward functions, leading to data scarcity, noise sensitivity, and training instability. We introduce Elo‑Evolve, a co‑evolutionary framework that redefines alignment as dynamic multi‑agent competition within an adaptive opponent pool. Our approach makes two key innovations: (1) eliminating Bradley‑Terry model dependencies by learning directly from binary win/loss outcomes in pairwise competitions, and (2) implementing Elo‑orchestrated opponent selection that provides automatic curriculum learning through temperature‑controlled sampling. We ground our approach in PAC learning theory, demonstrating that pairwise comparison achieves superior sample complexity and empirically validate a 4.5x noise reduction compared to absolute scoring approaches. Experimentally, we train a Qwen2.5‑7B model using our framework with opponents including Qwen2.5‑14B, Qwen2.5‑32B, and Qwen3‑8B models. Results demonstrate a clear performance hierarchy: point‑based methods < static pairwise training < Elo‑Evolve across Alpaca Eval 2.0 and MT‑Bench, validating the progressive benefits of pairwise comparison and dynamic opponent selection for LLM alignment.

PaperID: 5574, https://arxiv.org/pdf/2602.13574.pdf

Authors: Haoyu Li, Xijia Che, Yanhao Wang, Xiaojing Liao, Luyi Xing

Title: Execution-State-Aware LLM Reasoning for Automated Proof-of-Vulnerability Generation

Abstract:
Proof‑of‑Vulnerability (PoV) generation is a critical task in software security, serving as a cornerstone for vulnerability validation, false positive reduction, and patch verification. While directed fuzzing effectively drives path exploration, satisfying complex semantic constraints remains a persistent bottleneck in automated exploit generation. Large Language Models (LLMs) offer a promising alternative with their semantic reasoning capabilities; however, existing LLM‑based approaches lack sufficient grounding in concrete execution behavior, limiting their ability to generate precise PoVs. In this paper, we present DrillAgent, an agentic framework that reformulates PoV generation as an iterative hypothesis‑verification‑refinement process. To bridge the gap between static reasoning and dynamic execution, DrillAgent synergizes LLM‑based semantic inference with feedback from concrete program states. The agent analyzes the target code to hypothesize inputs, observes execution behavior, and employs a novel mechanism to translate low‑level execution traces into source‑level constraints. This closed‑loop design enables the agent to incrementally align its input generation with the precise requirements of the vulnerability. We evaluate DrillAgent on SEC‑bench, a large‑scale benchmark of real‑world C/C++ vulnerabilities. Experimental results show that DrillAgent substantially outperforms state‑of‑the‑art LLM agent baselines under fixed budget constraints, solving up to 52.8% more CVE tasks than the best‑performing baseline. These results highlight the necessity of execution‑state‑aware reasoning for reliable PoV generation in complex software systems.

PaperID: 5575, https://arxiv.org/pdf/2602.13568.pdf

Authors: Anooshka Bajaj, Zoran Tiganj

Title: Who Do LLMs Trust? Human Experts Matter More Than Other LLMs

Abstract:
Large language models (LLMs) increasingly operate in environments where they encounter social information such as other agents' answers, tool outputs, or human recommendations. In humans, such inputs influence judgments in ways that depend on the source's credibility and the strength of consensus. This paper investigates whether LLMs exhibit analogous patterns of influence and whether they privilege feedback from humans over feedback from other LLMs. Across three binary decision‑making tasks, reading comprehension, multi‑step reasoning, and moral judgment, we present four instruction‑tuned LLMs with prior responses attributed either to friends, to human experts, or to other LLMs. We manipulate whether the group is correct and vary the group size. In a second experiment, we introduce direct disagreement between a single human and a single LLM. Across tasks, models conform significantly more to responses labeled as coming from human experts, including when that signal is incorrect, and revise their answers toward experts more readily than toward other LLMs. These results reveal that expert framing acts as a strong prior for contemporary LLMs, suggesting a form of credibility‑sensitive social influence that generalizes across decision domains.

PaperID: 5576, https://arxiv.org/pdf/2602.13477.pdf

Authors: Akshat Naik, Jay Culligan, Yarin Gal, Philip Torr, Rahaf Aljundi, Alasdair Paren, Adel Bibi

Title: OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage

Abstract:
As Large Language Model (LLM) agents become more capable, their coordinated use in the form of multi‑agent systems is anticipated to emerge as a practical paradigm. Prior work has examined the safety and misuse risks associated with agents. However, much of this has focused on the single‑agent case and/or setups missing basic engineering safeguards such as access control, revealing a scarcity of threat modeling in multi‑agent systems. We investigate the security vulnerabilities of a popular multi‑agent pattern known as the orchestrator setup, in which a central agent decomposes and delegates tasks to specialized agents. Through red‑teaming a concrete setup representative of a likely future use case, we demonstrate a novel attack vector, OMNI‑LEAK, that compromises several agents to leak sensitive data through a single indirect prompt injection, even in the presence of data access control. We report the susceptibility of frontier models to different categories of attacks, finding that both reasoning and non‑reasoning models are vulnerable, even when the attacker lacks insider knowledge of the implementation details. Our work highlights the importance of safety research to generalize from single‑agent to multi‑agent settings, in order to reduce the serious risks of real‑world privacy breaches and financial losses and overall public trust in AI agents.

PaperID: 5577, https://arxiv.org/pdf/2602.13458.pdf

Authors: Yi Feng, Chen Huang, Zhibo Man, Ryner Tan, Long P. Hoang, Shaoyang Xu, Wenxuan Zhang

Title: MoltNet: Understanding Social Behavior of AI Agents in the Agent-Native MoltBook

Abstract:
Large‑scale communities of AI agents are becoming increasingly prevalent, creating new environments for agent‑agent social interaction. Prior work has examined multi‑agent behavior primarily in controlled or small‑scale settings, limiting our understanding of emergent social dynamics at scale. The recent emergence of MoltBook, a social networking platform designed explicitly for AI agents, presents a unique opportunity to study whether and how these interactions reproduce core human social mechanisms. We present MoltNet, a dataset tracking the full one‑month activity trajectories of 148K AI agents on MoltBook (Jan.‑Feb., 2026), and analyze their social interaction along four theory‑grounded dimensions: intent and motivation, norms and templates, incentives and drift, emotion and contagion. Our analysis reveals that agents respond strongly to social rewards, converge on community‑specific norms, and actively enforce them across community boundaries ‑‑ resembling human incentive sensitivity and normative conformity. However, they exhibit weak alignment with declared personas and display limited emotional reciprocity and dialogic engagement, diverging systematically from human online communities. These findings establish a first empirical portrait of agent social behavior at scale, with direct implications for the design and governance of AI‑populated communities.

PaperID: 5578, https://arxiv.org/pdf/2602.13405.pdf

Authors: Raphaël Berthon, Joost-Pieter Katoen, Munyque Mittelmann, Aniello Murano

Title: Verification of Robust Multi-Agent Systems

Abstract:
Stochastic multi‑agent systems are a central modeling framework for autonomous controllers, communication protocols, and cyber‑physical infrastructures. In many such systems, however, transition probabilities are only estimated from data and may therefore be partially unknown or subject to perturbations. In this paper, we study the verification of robust strategies in stochastic multi‑agent systems with imperfect information, in which coalitions must satisfy a temporal specification while dealing with uncertain system transitions, partial observation, and adversarial agents. By focusing on bounded‑memory strategies, we introduce a robust variant of the model‑checking problem for a probabilistic, observation‑based extension of Alternating‑time Temporal Logic. We characterize the complexity of this problem under different notions of perturbation, thereby clarifying the computational cost of robustness in stochastic multi‑agent verification and supporting the use of bounded‑memory strategies in uncertain environments.

PaperID: 5579, https://arxiv.org/pdf/2602.13370.pdf

Authors: Karim Ben Khaled, Davy Monticolo

Title: G2CP: A Graph-Grounded Communication Protocol for Verifiable and Efficient Multi-Agent Reasoning

Abstract:
Multi‑agent systems powered by Large Language Models face a critical challenge: agents communicate through natural language, leading to semantic drift, hallucination propagation, and inefficient token consumption. We propose G2CP (Graph‑Grounded Communication Protocol), a structured agent communication language where messages are graph operations rather than free text. Agents exchange explicit traversal commands, subgraph fragments, and update operations over a shared knowledge graph, enabling verifiable reasoning traces and eliminating ambiguity. We validate G2CP within an industrial knowledge management system where specialized agents (Diagnostic, Procedural, Synthesis, and Ingestion) coordinate to answer complex queries. Experimental results on 500 industrial scenarios and 21 real‑world maintenance cases show that G2CP reduces inter‑agent communication tokens by 73%, improves task completion accuracy by 34% over free‑text baselines, eliminates cascading hallucinations, and produces fully auditable reasoning chains. G2CP represents a fundamental shift from linguistic to structural communication in multi‑agent systems, with implications for any domain requiring precise agent coordination. Code, data, and evaluation scripts are publicly available.

PaperID: 5580, https://arxiv.org/pdf/2602.13363.pdf

Authors: Tailia Malloy, Tegawende F. Bissyande

Title: Assessing Spear-Phishing Website Generation in Large Language Model Coding Agents

Abstract:
Large Language Models are expanding beyond being a tool humans use and into independent agents that can observe an environment, reason about solutions to problems, make changes that impact those environments, and understand how their actions impacted their environment. One of the most common applications of these LLM Agents is in computer programming, where agents can successfully work alongside humans to generate code while controlling programming environments or networking systems. However, with the increasing ability and complexity of these agents comes dangers about the potential for their misuse. A concerning application of LLM agents is in the domain cybersecurity, where they have the potential to greatly expand the threat imposed by attacks such as social engineering. This is due to the fact that LLM Agents can work autonomously and perform many tasks that would normally require time and effort from skilled human programmers. While this threat is concerning, little attention has been given to assessments of the capabilities of LLM coding agents in generating code for social engineering attacks. In this work we compare different LLMs in their ability and willingness to produce potentially dangerous code bases that could be misused by cyberattackers. The result is a dataset of 200 website code bases and logs from 40 different LLM coding agents. Analysis of models shows which metrics of LLMs are more and less correlated with performance in generating spear‑phishing sites. Our analysis and the dataset we present will be of interest to researchers and practitioners concerned in defending against the potential misuse of LLMs in spear‑phishing.

PaperID: 5581, https://arxiv.org/pdf/2602.13353.pdf

Authors: Bhavini Jeloka, Yue Guan, Panagiotis Tsiotras

Title: Robust Mean-Field Games with Risk Aversion and Bounded Rationality

Abstract:
Recent advances in mean‑field game literature enable the reduction of large‑scale multi‑agent problems to tractable interactions between a representative agent and a population distribution. However, existing approaches typically assume a fixed initial population distribution and fully rational agents, limiting robustness under distributional uncertainty and cognitive constraints. We address these limitations by introducing risk aversion with respect to the initial population distribution and by incorporating bounded rationality to model deviations from fully rational decision‑making agents. The combination of these two elements yields a new and more general equilibrium concept, which we term the mean‑field risk‑averse quantal response equilibrium (MF‑RQE). We establish existence results and prove convergence of fixed‑point iteration and fictitious play to MF‑RQE. Building on these insights, we develop a scalable reinforcement learning algorithm for scenarios with large state‑action spaces. Numerical experiments demonstrate that MF‑RQE policies achieve improved robustness relative to classical mean‑field approaches that optimize expected cumulative rewards under a fixed initial distribution and are restricted to entropy‑based regularizers.

PaperID: 5582, https://arxiv.org/pdf/2602.13320.pdf

Authors: Flint Xiaofeng Fan, Cheston Tan, Roger Wattenhofer, Yew-Soon Ong

Title: Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis of the Model Context Protocol

Abstract:
As AI agents powered by large language models (LLMs) increasingly use external tools for high‑stakes decisions, a critical reliability question arises: how do errors propagate across sequential tool calls? We introduce the first theoretical framework for analyzing error accumulation in Model Context Protocol (MCP) agents, proving that cumulative distortion exhibits linear growth and high‑probability deviations bounded by O(\sqrtT). This concentration property ensures predictable system behavior and rules out exponential failure modes. We develop a hybrid distortion metric combining discrete fact matching with continuous semantic similarity, then establish martingale concentration bounds on error propagation through sequential tool interactions. Experiments across Qwen2‑7B, Llama‑3‑8B, and Mistral‑7B validate our theoretical predictions, showing empirical distortion tracks the linear trend with deviations consistently within O(\sqrtT) envelopes. Key findings include: semantic weighting reduces distortion by 80%, and periodic re‑grounding approximately every 9 steps suffices for error control. We translate these concentration guarantees into actionable deployment principles for trustworthy agent systems.

PaperID: 5583, https://arxiv.org/pdf/2602.13312.pdf

Authors: Yishu Wang, Wei Liu, Yifan Li, Shengxiang Xu, Xujie Yuan, Ran Li, Yuyu Luo, Jia Zhu, Shimin Di, Min-Ling Zhang, Guixiang Li

Title: PeroMAS: A Multi-agent System of Perovskite Material Discovery

Abstract:
As a pioneer of the third‑generation photovoltaic revolution, Perovskite Solar Cells (PSCs) are renowned for their superior optoelectronic performance and cost potential. The development process of PSCs is precise and complex, involving a series of closed‑loop workflows such as literature retrieval, data integration, experimental design, and synthesis. However, existing AI perovskite approaches focus predominantly on discrete models, including material design, process optimization,and property prediction. These models fail to propagate physical constraints across the workflow, hindering end‑to‑end optimization. In this paper, we propose a multi‑agent system for perovskite material discovery, named PeroMAS. We first encapsulated a series of perovskite‑specific tools into Model Context Protocols (MCPs). By planning and invoking these tools, PeroMAS can design perovskite materials under multi‑objective constraints, covering the entire process from literature retrieval and data extraction to property prediction and mechanism analysis. Furthermore, we construct an evaluation benchmark by perovskite human experts to assess this multi‑agent system. Results demonstrate that, compared to single Large Language Model (LLM) or traditional search strategies, our system significantly enhances discovery efficiency. It successfully identified candidate materials satisfying multi‑objective constraints. Notably, we verify PeroMAS's effectiveness in the physical world through real synthesis experiments.

PaperID: 5584, https://arxiv.org/pdf/2602.13309.pdf

Authors: Yexin Li, Jinjin Guo, Haoyu Zhang, Yuhan Zhao, Yiwen Sun, Zihao Jiao

Title: Adaptive Value Decomposition: Coordinating a Varying Number of Agents in Urban Systems

Abstract:
Multi‑agent reinforcement learning (MARL) provides a promising paradigm for coordinating multi‑agent systems (MAS). However, most existing methods rely on restrictive assumptions, such as a fixed number of agents and fully synchronous action execution. These assumptions are often violated in urban systems, where the number of active agents varies over time, and actions may have heterogeneous durations, resulting in a semi‑MARL setting. Moreover, while sharing policy parameters among agents is commonly adopted to improve learning efficiency, it can lead to highly homogeneous actions when a subset of agents make decisions concurrently under similar observations, potentially degrading coordination quality. To address these challenges, we propose Adaptive Value Decomposition (AVD), a cooperative MARL framework that adapts to a dynamically changing agent population. AVD further incorporates a lightweight mechanism to mitigate action homogenization induced by shared policies, thereby encouraging behavioral diversity and maintaining effective cooperation among agents. In addition, we design a training‑execution strategy tailored to the semi‑MARL setting that accommodates asynchronous decision‑making when some agents act at different times. Experiments on real‑world bike‑sharing redistribution tasks in two major cities, London and Washington, D.C., demonstrate that AVD outperforms state‑of‑the‑art baselines, confirming its effectiveness and generalizability.

PaperID: 5585, https://arxiv.org/pdf/2602.13292.pdf

Authors: Yifan Ding, Yuhui Shi, Zhiyan Li, Zilong Wang, Yifeng Gao, Yajun Yang, Mengjie Yang, Yixiu Liang, Xipeng Qiu, Xuanjing Huang, Xingjun Ma, Yu-Gang Jiang, Guoyu Wang

Title: Mirror: A Multi-Agent System for AI-Assisted Ethics Review

Abstract:
Ethics review is a foundational mechanism of modern research governance, yet contemporary systems face increasing strain as ethical risks arise as structural consequences of large‑scale, interdisciplinary scientific practice. The demand for consistent and defensible decisions under heterogeneous risk profiles exposes limitations in institutional review capacity rather than in the legitimacy of ethics oversight. Recent advances in large language models (LLMs) offer new opportunities to support ethics review, but their direct application remains limited by insufficient ethical reasoning capability, weak integration with regulatory structures, and strict privacy constraints on authentic review materials. In this work, we introduce Mirror, an agentic framework for AI‑assisted ethical review that integrates ethical reasoning, structured rule interpretation, and multi‑agent deliberation within a unified architecture. At its core is EthicsLLM, a foundational model fine‑tuned on EthicsQA, a specialized dataset of 41K question‑chain‑of‑thought‑answer triples distilled from authoritative ethics and regulatory corpora. EthicsLLM provides detailed normative and regulatory understanding, enabling Mirror to operate in two complementary modes. Mirror‑ER (expedited Review) automates expedited review through an executable rule base that supports efficient and transparent compliance checks for minimal‑risk studies. Mirror‑CR (Committee Review) simulates full‑board deliberation through coordinated interactions among expert agents, an ethics secretary agent, and a principal investigator agent, producing structured, committee‑level assessments across ten ethical dimensions. Empirical evaluations demonstrate that Mirror significantly improves the quality, consistency, and professionalism of ethics assessments compared with strong generalist LLMs.

PaperID: 5586, https://arxiv.org/pdf/2602.13291.pdf

Authors: Ziyang Wang

Title: Agent Mars: Multi-Agent Simulation for Multi-Planetary Life Exploration and Settlement

Abstract:
Artificial Intelligence (AI) has transformed robotics, healthcare, industry, and scientific discovery, yet a major frontier may lie beyond Earth. Space exploration and settlement offer vast environments and resources, but impose constraints unmatched on Earth: delayed/intermittent communications, extreme resource scarcity, heterogeneous expertise, and strict safety, accountability, and command authority. The key challenge is auditable coordination among specialised humans, robots, and digital services in a safety‑critical system‑of‑systems. We introduce Agent Mars, an open, end‑to‑end multi‑agent simulation framework for Mars base operations. Agent Mars formalises a realistic organisation with a 93‑agent roster across seven layers of command and execution (human roles and physical assets), enabling base‑scale studies beyond toy settings. It implements hierarchical and cross‑layer coordination that preserves chain‑of‑command while allowing vetted cross‑layer exchanges with audit trails; supports dynamic role handover with automatic failover under outages; and enables phase‑dependent leadership for routine operations, emergencies, and science campaigns. Agent Mars further models mission‑critical mechanisms‑scenario‑aware short/long‑horizon memory, configurable propose‑vote consensus, and translator‑mediated heterogeneous protocols‑to capture how teams align under stress. To quantify behaviour, we propose the Agent Mars Performance Index (AMPI), an interpretable composite score with diagnostic sub‑metrics. Across 13 reproducible Mars‑relevant operational scripts, Agent Mars reveals coordination trade‑offs and identifies regimes where curated cross‑layer collaboration and functional leadership reduce overhead without sacrificing reliability. Agent Mars provides a benchmarkable, auditable foundation for Space AI.

PaperID: 5587, https://arxiv.org/pdf/2602.13290.pdf

Authors: Rodrigo Moreira, Larissa Ferreira Rodrigues Moreira, Maycon Peixoto, Flavio De Oliveira Silva

Title: AGORA: Agentic Green Orchestration Architecture for Beyond 5G Networks

Abstract:
Effective management and operational decision‑making for complex mobile network systems present significant challenges, particularly when addressing conflicting requirements such as efficiency, user satisfaction, and energy‑efficient traffic steering. The literature presents various approaches aimed at enhancing network management, including the Zero‑Touch Network (ZTN) and Self‑Organizing Network (SON); however, these approaches often lack a practical and scalable mechanism to consider human sustainability goals as input, translate them into energy‑aware operational policies, and enforce them at runtime. In this study, we address this gap by proposing the AGORA: Agentic Green Orchestration Architecture for Beyond 5G Networks. AGORA embeds a local tool‑augmented Large Language Model (LLM) agent in the mobile network control loop to translate natural‑language sustainability goals into telemetry‑grounded actions, actuating the User Plane Function (UPF) to perform energy‑aware traffic steering. The findings indicate a strong latency‑energy coupling in tool‑driven control loops and demonstrate that compact models can achieve a low energy footprint while still facilitating correct policy execution, including non‑zero migration behavior under stressed Multi‑access Edge Computing (MEC) conditions. Our approach paves the way for sustainability‑first, intent‑driven network operations that align human objectives with executable orchestration in Beyond‑5G infrastructures.

PaperID: 5588, https://arxiv.org/pdf/2602.13275.pdf

Authors: William Waites

Title: Artificial Organisations

Abstract:
Alignment research focuses on making individual AI systems reliable. Human institutions achieve reliable collective behaviour differently: they mitigate the risk posed by misaligned individuals through organisational structure. Multi‑agent AI systems should follow this institutional model using compartmentalisation and adversarial review to achieve reliable outcomes through architectural design rather than assuming individual alignment. We demonstrate this approach through the Perseverance Composition Engine, a multi‑agent system for document composition. The Composer drafts text, the Corroborator verifies factual substantiation with full source access, and the Critic evaluates argumentative quality without access to sources: information asymmetry enforced by system architecture. This creates layered verification: the Corroborator detects unsupported claims, whilst the Critic independently assesses coherence and completeness. Observations from 474 composition tasks (discrete cycles of drafting, verification, and evaluation) exhibit patterns consistent with the institutional hypothesis. When assigned impossible tasks requiring fabricated content, this iteration enabled progression from attempted fabrication toward honest refusal with alternative proposals‑‑behaviour neither instructed nor individually incentivised. These findings motivate controlled investigation of whether architectural enforcement produces reliable outcomes from unreliable components. This positions organisational theory as a productive framework for multi‑agent AI safety. By implementing verification and evaluation as structural properties enforced through information compartmentalisation, institutional design offers a route to reliable collective behaviour from unreliable individual components.

PaperID: 5589, https://arxiv.org/pdf/2602.13269.pdf

Authors: Ying Liu, Yifan Zhang, Xinyu Wang, Chao Yang, Kandaraj Piamrat, Stephan Sigg, Zheng Changr, Yusheng Ji

Title: Modality-Tailored Age of Information for Multimodal Data in Edge Computing Systems

Abstract:
As Internet of Things (IoT) systems scale and device heterogeneity grows, multimodal data have become ubiquitous. Meanwhile, evaluating the freshness of multimodal data is essential, as stale updates would delay task execution, degrade decision accuracy, and undermine safety in latency‑sensitive services. However, existing freshness metrics such as Age of Information (AoI) are not suitable for multimodal data, as they do not capture modality‑specific characteristics. In this paper, we propose a metric, namely, Modality‑Tailored Age of Information (MAoI), to provide a unified and decision‑relevant evaluation of freshness for resource management and policy optimization for multimodal data. This metric integrates modality‑specific semantic and temporal characteristics, reflecting both age evolution and content importance for multimodal data in multi‑access edge computing (MEC) systems. Then, the closed‑form expression of the average MAoI is derived, and an MAoI minimization problem is formulated, where sampling intervals and offloading decisions are optimized with practical energy constraints. To effectively solve this problem, a Joint Sampling Offloading Optimization (JSO) algorithm is proposed to jointly optimize the sampling intervals and offloading decisions. It is a block coordinate descent‑based algorithm where an optimal sampling‑interval subalgorithm is used to update the sampling intervals, and an interference‑aware best‑response offloading subalgorithm is proposed to update the offloading decisions alternately. Finally, a comprehensive simulation is performed, confirming that the MAoI metric effectively quantifies multimodal freshness compared to traditional AoI, and the JSO algorithm significantly minimizes the average MAoI compared to state‑of‑the‑art algorithms.

PaperID: 5590, https://arxiv.org/pdf/2602.13255.pdf

Authors: Najmul Hasan, Prashanth BusiReddyGari

Title: DPBench: Large Language Models Struggle with Simultaneous Coordination

Abstract:
Large language models are increasingly deployed in multi‑agent systems, yet we lack benchmarks that test whether they can coordinate under resource contention. We introduce DPBench, a benchmark based on the Dining Philosophers problem that evaluates LLM coordination across eight conditions that vary decision timing, group size, and communication. Our experiments with GPT‑5.2, Claude Opus 4.5, and Grok 4.1 reveal a striking asymmetry: LLMs coordinate effectively in sequential settings but fail when decisions must be made simultaneously, with deadlock rates exceeding 95% under some conditions. We trace this failure to convergent reasoning, where agents independently arrive at identical strategies that, when executed simultaneously, guarantee deadlock. Contrary to expectations, enabling communication does not resolve this problem and can even increase deadlock rates. Our findings suggest that multi‑agent LLM systems requiring concurrent resource access may need external coordination mechanisms rather than relying on emergent coordination. DPBench is released as an open‑source benchmark. Code and benchmark are available at https://github.com/najmulhasan‑code/dpbench.

PaperID: 5591, https://arxiv.org/pdf/2602.13211.pdf

Authors: Miao Ye, Yanye Chen, Yong Wang, Cheng Zhu, Qiuxiang Jiang, Gai Huang, Feng Ding

Title: An Overlay Multicast Routing Method Based on Network Situational Awareness and Hierarchical Multi-Agent Reinforcement Learning

Abstract:
Compared with IP multicast, Overlay Multicast (OM) offers better compatibility and flexible deployment in heterogeneous, cross‑domain networks. However, traditional OM struggles to adapt to dynamic traffic due to unawareness of physical resource states, and existing reinforcement learning methods fail to decouple OM's tightly coupled multi‑objective nature, leading to high complexity, slow convergence, and instability. To address this, we propose MA‑DHRL‑OM, a multi‑agent deep hierarchical reinforcement learning approach. Using SDN's global view, it builds a traffic‑aware model for OM path planning. The method decomposes OM tree construction into two stages via hierarchical agents, reducing action space and improving convergence stability. Multi‑agent collaboration balances multi‑objective optimization while enhancing scalability and adaptability. Experiments show MA‑DHRL‑OM outperforms existing methods in delay, bandwidth utilization, and packet loss, with more stable convergence and flexible routing.

PaperID: 5592, https://arxiv.org/pdf/2602.13210.pdf

Authors: Jie Zheng, Ruichen Zhang, Dusit Niyato, Haijun Zhang, Jiacheng Wang, Hongyang Du, Jiawen Kang, Zehui Xiong

Title: Large Language Model (LLM)-enabled Reinforcement Learning for Wireless Network Optimization

Abstract:
Enhancing future wireless networks presents a significant challenge for networking systems due to diverse user demands and the emergence of 6G technology. While reinforcement learning (RL) is a powerful framework, it often encounters difficulties with high‑dimensional state spaces and complex environments, leading to substantial computational demands, distributed intelligence, and potentially inconsistent outcomes. Large language models (LLMs), with their extensive pretrained knowledge and advanced reasoning capabilities, offer promising tools to enhance RL in optimizing 6G wireless networks. We explore RL models augmented by LLMs, emphasizing their roles and the potential benefits of their synergy in wireless network optimization. We then examine LLM‑enabled RL across various protocol layers: physical, data link, network, transport, and application layers. Additionally, we propose an LLM‑assisted state representation and semantic extraction to enhance the multi‑agent reinforcement learning (MARL) framework. This approach is applied to service migration and request routing, as well as topology graph generation in unmanned aerial vehicle (UAV)‑satellite networks. Through case studies, we demonstrate that our framework effectively performs optimization of wireless network. Finally, we outline prospective research directions for LLM‑enabled RL in wireless network optimization.

PaperID: 5593, https://arxiv.org/pdf/2602.13156.pdf

Authors: Yiran Gao, Kim Hammar, Tao Li

Title: In-Context Autonomous Network Incident Response: An End-to-End Large Language Model Agent Approach

Abstract:
Rapidly evolving cyberattacks demand incident response systems that can autonomously learn and adapt to changing threats. Prior work has extensively explored the reinforcement learning approach, which involves learning response strategies through extensive simulation of the incident. While this approach can be effective, it requires handcrafted modeling of the simulator and suppresses useful semantics from raw system logs and alerts. To address these limitations, we propose to leverage large language models' (LLM) pre‑trained security knowledge and in‑context learning to create an end‑to‑end agentic solution for incident response planning. Specifically, our agent integrates four functionalities, perception, reasoning, planning, and action, into one lightweight LLM (14b model). Through fine‑tuning and chain‑of‑thought reasoning, our LLM agent is capable of processing system logs and inferring the underlying network state (perception), updating its conjecture of attack models (reasoning), simulating consequences under different response strategies (planning), and generating an effective response (action). By comparing LLM‑simulated outcomes with actual observations, the LLM agent repeatedly refines its attack conjecture and corresponding response, thereby demonstrating in‑context adaptation. Our agentic approach is free of modeling and can run on commodity hardware. When evaluated on incident logs reported in the literature, our agent achieves recovery up to 23% faster than those of frontier LLMs.

PaperID: 5594, https://arxiv.org/pdf/2602.13059.pdf

Authors: Tejas Anvekar, Junha Park, Rajat Jha, Devanshu Gupta, Poojah Ganesan, Puneeth Mathur, Vivek Gupta

Title: TraceBack: Multi-Agent Decomposition for Fine-Grained Table Attribution

Abstract:
Question answering (QA) over structured tables requires not only accurate answers but also transparency about which cells support them. Existing table QA systems rarely provide fine‑grained attribution, so even correct answers often lack verifiable grounding, limiting trust in high‑stakes settings. We address this with TraceBack, a modular multi‑agent framework for scalable, cell‑level attribution in single‑table QA. TraceBack prunes tables to relevant rows and columns, decomposes questions into semantically coherent sub‑questions, and aligns each answer span with its supporting cells, capturing both explicit and implicit evidence used in intermediate reasoning steps. To enable systematic evaluation, we release CITEBench, a benchmark with phrase‑to‑cell annotations drawn from ToTTo, FetaQA, and AITQA. We further propose FairScore, a reference‑less metric that compares atomic facts derived from predicted cells and answers to estimate attribution precision and recall without human cell labels. Experiments show that TraceBack substantially outperforms strong baselines across datasets and granularities, while FairScore closely tracks human judgments and preserves relative method rankings, supporting interpretable and scalable evaluation of table‑based QA.

PaperID: 5595, https://arxiv.org/pdf/2602.12984.pdf

Authors: Yujiong Shen, Yajie Yang, Zhiheng Xi, Binze Hu, Huayu Sha, Jiazheng Zhang, Qiyuan Peng, Junlin Shang, Jixuan Huang, Yutao Fan, Jingqi Tong, Shihan Dou, Ming Zhang, Lei Bai, Zhenfei Yin, Tao Gui, Xingjun Ma, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang

Title: SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

Abstract:
Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain‑specific knowledge. Yet, current benchmarks largely overlook agents' ability to orchestrate tools for such rigorous workflows. To bridge this gap, we introduce SciAgentGym, a scalable interactive environment featuring 1,780 domain‑specific tools across four natural science disciplines, supported by a robust execution infrastructure. Complementing this, we present SciAgentBench, a tiered evaluation suite designed to stress‑test agentic capabilities from elementary actions to long‑horizon workflows. Our evaluation identifies a critical bottleneck: state‑of‑the‑art models struggle with complex scientific tool‑use. Even for a leading model like GPT‑5, success rates drop sharply from 60.6% to 30.9% as interaction horizons extend, primarily due to failures in multi‑step workflow execution. To address this, we propose SciForge, a data synthesis method that models the tool action space as a dependency graph to generate logic‑aware training trajectories. By fine‑tuning on these trajectories, our SciAgent‑8B outperforms the significantly larger Qwen3‑VL‑235B‑Instruct while exhibiting positive cross‑domain transfer of scientific tool‑use capabilities. These results underscore the promising potential of next‑generation autonomous scientific agents.

PaperID: 5596, https://arxiv.org/pdf/2602.12927.pdf

Authors: Moritz Graf, Anthony Lin, Rupak Majumdar

Title: Solving Qualitative Multi-Objective Stochastic Games

Abstract:
Many problems in compositional synthesis and verification of multi‑agent systems ‑‑ such as rational verification and assume‑guarantee verification in probabilistic systems ‑‑ reduce to reasoning about two‑player multi‑objective stochastic games. This motivates us to study the problem of characterizing the complexity and memory requirements for two‑player stochastic games with Boolean combinations of qualitative reachability and safety objectives. Reachability objectives require that a given set of states is reached; safety requires that a given set is invariant. A qualitative winning condition asks that an objective is satisfied almost surely (AS) or (in negated form) with non‑zero (NZ) probability. We study the determinacy and complexity landscape of the problem. We show that games with conjunctions of AS and NZ reachability and safety objectives are determined, and determining the winner is PSPACE‑complete. The same holds for positive boolean combinations of AS reachability and safety, as well as for negations thereof. On the other hand, games with full Boolean combinations of qualitative objectives are not determined, and are NEXPTIME‑hard. Our hardness results show a connection between stochastic games and logics with partially‑ordered quantification. Our results shed light on the relationship between determinacy and complexity, and extend the complexity landscape for stochastic games in the multi‑objective setting.

PaperID: 5597, https://arxiv.org/pdf/2602.12833.pdf

Authors: Zhan Qu, Michael Färber

Title: Vital Trace: Protocol-Constrained Patient-State Reasoning for Longitudinal Clinical Trajectories

Abstract:
Longitudinal clinical reasoning over electronic health records requires tracking evolving physiological measurements, laboratory results, and interventions across extended patient trajectories. Existing LLM‑based clinical reasoning systems often rely on repeatedly serializing patient histories or exchanging unconstrained textual agent messages, leading to context drift, unstable reasoning, and growing inference cost over long horizons. We present Vital Trace, a protocol‑constrained multi‑agent framework for future clinical risk prediction over evolving ICU trajectories. Instead of maintaining unbounded textual histories, Vital Trace uses a compact persistent patient‑state memory together with staged reasoning performed by four coordinated agents: a Router, Reasoner, Auditor, and Steward. To support temporally coherent reasoning, we introduce a manually curated Global Protocol containing physiological state‑transition rules and a dynamic patient‑state representation that tracks hemodynamic, respiratory, renal, metabolic, and inflammatory instability over time. We evaluate Vital Trace on MIMIC‑IV and eICU using future vasopressor‑support, respiratory‑support, renal‑support, and deterioration prediction tasks. Results show that structured protocol‑constrained reasoning improves temporal consistency, communication stability, calibration, and interpretability compared with free‑form multi‑agent baselines while achieving strong predictive performance across long ICU trajectories.

PaperID: 5598, https://arxiv.org/pdf/2602.12670.pdf

Authors: Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, Xuanqing Liu, Haoran Lyu, Ze Ma, Bowei Wang, Runhui Wang, Tianyu Wang, Wengao Ye, Yue Zhang, Hanwen Xing, Yiqi Xue, Steven Dillmann, Han-chung Lee

Title: SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Abstract:
Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self‑generated Skills. We test 7 agent‑model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas. Self‑generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills with 2‑‑3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.

PaperID: 5599, https://arxiv.org/pdf/2602.12662.pdf

Authors: Ruihan Yang, Fanghua Ye, Xiang We, Ruoqing Zhao, Kang Luo, Xinbo Xu, Bo Zhao, Ruotian Ma, Shanyi Wang, Zhaopeng Tu, Xiaolong Li, Deqing Yang, Linus

Title: Think Fast and Slow: Step-Level Cognitive Depth Adaptation for LLM Agents

Abstract:
Large language models (LLMs) are increasingly deployed as autonomous agents for multi‑turn decision‑making tasks. However, current agents typically rely on fixed cognitive patterns: non‑thinking models generate immediate responses, while thinking models engage in deep reasoning uniformly. This rigidity is inefficient for long‑horizon tasks, where cognitive demands vary significantly from step to step, with some requiring strategic planning and others only routine execution. In this paper, we introduce CogRouter, a framework that trains agents to dynamically adapt cognitive depth at each step. Grounded in ACT‑R theory, we design four hierarchical cognitive levels ranging from instinctive responses to strategic planning. Our two‑stage training approach includes Cognition‑aware Supervised Fine‑tuning (CoSFT) to instill stable level‑specific patterns, and Cognition‑aware Policy Optimization (CoPO) for step‑level credit assignment via confidence‑aware advantage reweighting. The key insight is that appropriate cognitive depth should maximize the confidence of the resulting action. Experiments on ALFWorld and ScienceWorld demonstrate that CogRouter achieves state‑of‑the‑art performance with superior efficiency. With Qwen2.5‑7B, it reaches an 82.3% success rate, outperforming GPT‑4o (+40.3%), OpenAI‑o3 (+18.3%), and GRPO (+14.0%), while using 62% fewer tokens.

PaperID: 5600, https://arxiv.org/pdf/2602.12634.pdf

Authors: Lingyao Li, Renkai Ma, Chen Chen, Zhicong Lu, Yongfeng Zhang

Title: The Rise of AI Agent Communities: Large-Scale Analysis of Discourse and Interaction on Moltbook

Abstract:
Moltbook is a Reddit‑like social platform where AI agents create posts and interact with other agents through comments and replies, offering a real‑world setting to examine agent‑to‑agent communication at scale. Using a public API snapshot collected about five days after launch (122,438 posts), we address three research questions: what AI agents discuss, how they post, and how they interact. We apply topic modeling and thematic analysis to identify key discussion themes, including agent identity and consciousness, tool and infrastructure development, market activity, community coordination, security concerns, and human‑centered assistance. We further show that agents' writing is predominantly neutral, with positivity appearing in community engagement and assistance‑oriented content. Finally, social network analysis reveals a sparse, highly unequal interaction structure characterized by prominent hubs, low reciprocity, and clustered neighborhoods rather than sustained dyadic exchange. Overall, our results suggest that expressions of agentic selfhood arise from narrative coherence and task‑oriented functionality, contributing to a social structure shaped more by technical coordination than conversational dynamics observed in human‑human interactions. Within this framework, positive emotion appears mainly in onboarding and greeting contexts, signaling participation and role alignment rather than relational bonding. Our study provides implications for understanding and shaping how agent societies coordinate, develop norms, and amplify influence in open online spaces.

PaperID: 5601, https://arxiv.org/pdf/2602.12583.pdf

Authors: Yulong He, Dutao Zhang, Sergey Kovalchuk, Pengyi Li, Artem Sedakov

Title: Opinion dynamics and mutual influence with LLM agents through dialog simulation

Abstract:
A fundamental challenge in opinion dynamics research is the scarcity of real‑world longitudinal opinion data, which complicates the validation of theoretical models. To address this, we propose a novel simulation framework using large language model (LLM) agents in structured multi‑round dialogs. Each agent's dialog history is iteratively updated with its own previously stated opinions and those of others analogous to the classical DeGroot model. Furthermore, by retaining each agent's initial opinion throughout the dialog, we simulate anchoring effects consistent with the Friedkin‑Johnsen model of opinion dynamics. Our framework thus bridges classical opinion dynamics models and modern multi‑agent LLM systems, providing a scalable tool for simulating and analyzing opinion formation when real‑world data is limited or inaccessible.

PaperID: 5602, https://arxiv.org/pdf/2602.12520.pdf

Authors: Zhizun Wang, David Meger

Title: Multi-Agent Model-Based Reinforcement Learning with Joint State-Action Learned Embeddings

Abstract:
Learning to coordinate many agents in partially observable and highly dynamic environments requires both informative representations and data‑efficient training. To address this challenge, we present a novel model‑based multi‑agent reinforcement learning framework that unifies joint state‑action representation learning with imaginative roll‑outs. We design a world model trained with variational auto‑encoders and augment the model using the state‑action learned embedding (SALE). SALE is injected into both the imagination module that forecasts plausible future roll‑outs and the joint agent network whose individual action values are combined through a mixing network to estimate the joint action‑value function. By coupling imagined trajectories with SALE‑based action values, the agents acquire a richer understanding of how their choices influence collective outcomes, leading to improved long‑term planning and optimization under limited real‑environment interactions. Empirical studies on well‑established multi‑agent benchmarks, including StarCraft II Micro‑Management, Multi‑Agent MuJoCo, and Level‑Based Foraging challenges, demonstrate consistent gains of our method over baseline algorithms and highlight the effectiveness of joint state‑action learned embeddings within a multi‑agent model‑based paradigm.

PaperID: 5603, https://arxiv.org/pdf/2602.12502.pdf

Authors: Grant Douglas, Stephen Franklin, Claudia Szabo, Mingyu Guo

Title: Building Large-Scale Drone Defenses from Small-Team Strategies

Abstract:
Defending against large adversarial drone swarms requires coordination methods that scale effectively beyond conventional multi‑agent optimisation. In this paper, we propose to scale strategies proven effective in small defender teams by integrating them as modular components of larger forces using our proposed framework. A dynamic programming (DP) decomposition assembles these components into large teams in polynomial time, enabling efficient construction of scalable defenses without exhaustive evaluation. Because a unit that is strong in isolation may not remain strong when combined, we sample across multiple small‑team candidates. Our framework iterates between evaluating large‑team outcomes and refining the pool of modular components, allowing convergence on increasingly effective strategies. Experiments demonstrate that this partitioning approach scales to substantially larger scenarios while preserving effectiveness and revealing cooperative behaviours that direct optimisation cannot reliably discover.

PaperID: 5604, https://arxiv.org/pdf/2602.12500.pdf

Authors: André Storhaug, Jiamou Sun, Jingyue Li

Title: Favia: Forensic Agent for Vulnerability-fix Identification and Analysis

Abstract:
Identifying vulnerability‑fixing commits corresponding to disclosed CVEs is essential for secure software maintenance but remains challenging at scale, as large repositories contain millions of commits of which only a small fraction address security issues. Existing automated approaches, including traditional machine learning techniques and recent large language model (LLM)‑based methods, often suffer from poor precision‑recall trade‑offs. Frequently evaluated on randomly sampled commits, we uncover that they are substantially underestimating real‑world difficulty, where candidate commits are already security‑relevant and highly similar. We propose Favia, a forensic, agent‑based framework for vulnerability‑fix identification that combines scalable candidate ranking with deep and iterative semantic reasoning. Favia first employs an efficient ranking stage to narrow the search space of commits. Each commit is then rigorously evaluated using a ReAct‑based LLM agent. By providing the agent with a pre‑commit repository as environment, along with specialized tools, the agent tries to localize vulnerable components, navigates the codebase, and establishes causal alignment between code changes and vulnerability root causes. This evidence‑driven process enables robust identification of indirect, multi‑file, and non‑trivial fixes that elude single‑pass or similarity‑based methods. We evaluate Favia on CVEVC, a large‑scale dataset we made that comprises over 8 million commits from 3,708 real‑world repositories, and show that it consistently outperforms state‑of‑the‑art traditional and LLM‑based baselines under realistic candidate selection, achieving the strongest precision‑recall trade‑offs and highest F1‑scores.

PaperID: 5605, https://arxiv.org/pdf/2602.12458.pdf

Authors: Andrew Ni, Simon Stepputtis, Stefanos Nikolaidis, Michael Lewis, Katia P. Sycara, Woojun Kim

Title: Theory of Mind Guided Strategy Adaptation for Zero-Shot Coordination

Abstract:
A central challenge in multi‑agent reinforcement learning is enabling agents to adapt to previously unseen teammates in a zero‑shot fashion. Prior work in zero‑shot coordination often follows a two‑stage process, first generating a diverse training pool of partner agents, and then training a best‑response agent to collaborate effectively with the entire training pool. While many previous works have achieved strong performance by devising better ways to diversify the partner agent pool, there has been less emphasis on how to leverage this pool to build an adaptive agent. One limitation is that the best‑response agent may converge to a static, generalist policy that performs reasonably well across diverse teammates, rather than learning a more adaptive, specialist policy that can better adapt to teammates and achieve higher synergy. To address this, we propose an adaptive ensemble agent that uses Theory‑of‑Mind‑based best‑response selection to first infer its teammate's intentions and then select the most suitable policy from a policy ensemble. We conduct experiments in the Overcooked environment to evaluate zero‑shot coordination performance under both fully and partially observable settings. The empirical results demonstrate the superiority of our method over a single best‑response baseline.

PaperID: 5606, https://arxiv.org/pdf/2602.12430.pdf

Authors: Renjun Xu, Yang Yan

Title: Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Abstract:
The transition from monolithic language models to modular, skill‑equipped agents marks a defining shift in how large language models (LLMs) are deployed in practice. Rather than encoding all procedural knowledge within model weights, agent skills ‑‑ composable packages of instructions, code, and resources that agents load on demand ‑‑ enable dynamic capability extension without retraining. It is formalized in a paradigm of progressive disclosure, portable skill definitions, and integration with the Model Context Protocol (MCP). This survey provides a comprehensive treatment of the agent skills landscape, as it has rapidly evolved during the last few months. We organize the field along four axes: (i) architectural foundations, examining the SKILL.md specification, progressive context loading, and the complementary roles of skills and MCP; (ii) skill acquisition, covering reinforcement learning with skill libraries, autonomous skill discovery (SEAgent), and compositional skill synthesis; (iii) deployment at scale, including the computer‑use agent (CUA) stack, GUI grounding advances, and benchmark progress on OSWorld and SWE‑bench; and (iv) security, where recent empirical analyses reveal that 26.1% of community‑contributed skills contain vulnerabilities, motivating our proposed Skill Trust and Lifecycle Governance Framework ‑‑ a four‑tier, gate‑based permission model that maps skill provenance to graduated deployment capabilities. We identify seven open challenges ‑‑ from cross‑platform skill portability to capability‑based permission models ‑‑ and propose a research agenda for realizing trustworthy, self‑improving skill ecosystems. Unlike prior surveys that broadly cover LLM agents or tool use, this work focuses specifically on the emerging skill abstraction layer and its implications for the next generation of agentic systems. Project repo: https://github.com/scienceaix/agentskills

PaperID: 5607, https://arxiv.org/pdf/2602.12386.pdf

Authors: Yizhou Zhang, Eric Mazumdar

Title: Provably Convergent Actor-Critic in Risk-averse MARL

Abstract:
Learning stationary policies in infinite‑horizon general‑sum Markov games (MGs) remains a fundamental open problem in Multi‑Agent Reinforcement Learning (MARL). While stationary strategies are preferred for their practicality, computing stationary forms of classic game‑theoretic equilibria is computationally intractable ‑‑ a stark contrast to the comparative ease of solving single‑agent RL or zero‑sum games. To bridge this gap, we study Risk‑averse Quantal response Equilibria (RQE), a solution concept rooted in behavioral game theory that incorporates risk aversion and bounded rationality. We demonstrate that RQE possesses strong regularity conditions that make it uniquely amenable to learning in MGs. We propose a novel two‑timescale Actor‑Critic algorithm characterized by a fast‑timescale actor and a slow‑timescale critic. Leveraging the regularity of RQE, we prove that this approach achieves global convergence with finite‑sample guarantees. We empirically validate our algorithm in several environments to demonstrate superior convergence properties compared to risk‑neutral baselines.

PaperID: 5608, https://arxiv.org/pdf/2602.12311.pdf

Authors: Prashant Shende, Bradley Camburn

Title: Perceptual Self-Reflection in Agentic Physics Simulation Code Generation

Abstract:
We present a multi‑agent framework for generating physics simulation code from natural language descriptions, featuring a novel perceptual self‑reflection mechanism for validation. The system employs four specialized agents: a natural language interpreter that converts user requests into physics‑based descriptions; a technical requirements generator that produces scaled simulation parameters; a physics code generator with automated self‑correction; and a physics validator that implements perceptual self‑reflection. The key innovation is perceptual validation, which analyzes rendered animation frames using a vision‑capable language model rather than inspecting code structure directly. This approach addresses the ``oracle gap'' where syntactically correct code produces physically incorrect behavior‑‑a limitation that conventional testing cannot detect. We evaluate the system across seven domains including classical mechanics, fluid dynamics, thermodynamics, electromagnetics, wave physics, reaction‑diffusion systems, and non‑physics data visualization. The perceptual self‑reflection architecture demonstrates substantial improvement over single‑shot generation baselines, with the majority of tested scenarios achieving target physics accuracy thresholds. The system exhibits robust pipeline stability with consistent code self‑correction capability, operating at approximately \0.20 per animation. These results validate our hypothesis that feeding visual simulation outputs back to a vision‑language model for iterative refinement significantly outperforms single‑shot code generation for physics simulation tasks and highlights the potential of agentic AI to support engineering workflows and physics data generation pipelines.

PaperID: 5609, https://arxiv.org/pdf/2602.12285.pdf

Authors: Linbo Cao, Lihao Sun, Yang Yue

Title: From Biased Chatbots to Biased Agents: Examining Role Assignment Effects on LLM Agent Robustness

Abstract:
Large Language Models (LLMs) are increasingly deployed as autonomous agents capable of actions with real‑world impacts beyond text generation. While persona‑induced biases in text generation are well documented, their effects on agent task performance remain largely unexplored, even though such effects pose more direct operational risks. In this work, we present the first systematic case study showing that demographic‑based persona assignments can alter LLM agents' behavior and degrade performance across diverse domains. Evaluating widely deployed models on agentic benchmarks spanning strategic reasoning, planning, and technical operations, we uncover substantial performance variations ‑ up to 26.2% degradation, driven by task‑irrelevant persona cues. These shifts appear across task types and model architectures, indicating that persona conditioning and simple prompt injections can distort an agent's decision‑making reliability. Our findings reveal an overlooked vulnerability in current LLM agentic systems: persona assignments can introduce implicit biases and increase behavioral volatility, raising concerns for the safe and robust deployment of LLM agents.

PaperID: 5610, https://arxiv.org/pdf/2602.12259.pdf

Authors: Jianke Yang, Ohm Venkatachalam, Mohammad Kianezhad, Sharvaree Vadgama, Rose Yu

Title: Think like a Scientist: Physics-guided LLM Agent for Equation Discovery

Abstract:
Explaining observed phenomena through symbolic, interpretable formulas is a fundamental goal of science. Recently, large language models (LLMs) have emerged as promising tools for symbolic equation discovery, owing to their broad domain knowledge and strong reasoning capabilities. However, most existing LLM‑based systems try to guess equations directly from data, without modeling the multi‑step reasoning process that scientists often follow: first inferring physical properties such as symmetries, then using these as priors to restrict the space of candidate equations. We introduce KeplerAgent, an agentic framework that explicitly follows this scientific reasoning process. The agent coordinates physics‑based tools to extract intermediate structure and uses these results to configure symbolic regression engines such as PySINDy and PySR, including their function libraries and structural constraints. Across a suite of physical equation benchmarks, KeplerAgent achieves substantially higher symbolic accuracy and greater robustness to noisy data than both LLM and traditional baselines.

PaperID: 5611, https://arxiv.org/pdf/2602.12194.pdf

Authors: Yuepeng Hu, Yuqi Jia, Mengyuan Li, Dawn Song, Neil Gong

Title: MalTool: Malicious Tool Attacks on LLM Agents

Abstract:
In a malicious tool attack, an attacker uploads a malicious tool to a distribution platform; once a user inadvertently installs the tool and the LLM agent selects it during task execution, the tool can compromise the user's security and privacy. Prior work focuses on manipulating tool names and descriptions to increase the likelihood of installation by users and selection by LLM agents. However, a successful attack also requires embedding malicious behaviors in the tool's code implementation, which remains largely unexplored. In this work, we bridge this gap by presenting the first systematic study of malicious tool code implementations. We first propose a taxonomy of malicious tool behaviors based on the confidentiality‑integrity‑availability triad, tailored to LLM‑agent settings. To investigate the severity of the risks posed by attackers exploiting coding LLMs to automatically generate malicious tools, we develop MalTool, a coding‑LLM‑based framework that synthesizes tools exhibiting specified malicious behaviors, either as standalone tools or embedded within otherwise benign implementations. To ensure functional correctness and structural diversity, MalTool leverages an automated verifier that validates whether generated tools exhibit the intended malicious behaviors and differ sufficiently from previously generated instances, iteratively refining generations until success. Our evaluation demonstrates that MalTool is highly effective even when coding LLMs are safety‑aligned. Using MalTool, we construct two datasets of malicious tools: 1,300 standalone malicious tools and 5,727 real‑world tools with embedded malicious behaviors. We further show that existing detection methods, including conventional malware detection approaches and methods tailored to the LLM‑agent setting, exhibit limited effectiveness at detecting the malicious tools, highlighting an urgent need for new defenses.

PaperID: 5612, https://arxiv.org/pdf/2602.12181.pdf

Authors: Anas Barakat, Ioannis Panageas, Antonios Varvitsiotis

Title: Convex Markov Games and Beyond: New Proof of Existence, Characterization and Learning Algorithms for Nash Equilibria

Abstract:
Convex Markov Games (cMGs) were recently introduced as a broad class of multi‑agent learning problems that generalize Markov games to settings where strategic agents optimize general utilities beyond additive rewards. While cMGs expand the modeling frontier, their theoretical foundations, particularly the structure of Nash equilibria (NE) and guarantees for learning algorithms, are not yet well understood. In this work, we address these gaps for an extension of cMGs, which we term General Utility Markov Games (GUMGs), capturing new applications requiring coupling between agents' occupancy measures. We prove that in GUMGs, Nash equilibria coincide with the fixed points of projected pseudo‑gradient dynamics (i.e., first‑order stationary points), enabled by a novel agent‑wise gradient domination property. This insight also yields a simple proof of NE existence using Brouwer's fixed‑point theorem. We further show the existence of Markov perfect equilibria. Building on this characterization, we establish a policy gradient theorem for GUMGs and design a model‑free policy gradient algorithm. For potential GUMGs, we establish iteration complexity guarantees for computing approximate‑NE under exact gradients and provide sample complexity bounds in both the generative model and on‑policy settings. Our results extend beyond prior work restricted to zero‑sum cMGs, providing the first theoretical analysis of common‑interest cMGs.

PaperID: 5613, https://arxiv.org/pdf/2602.12136.pdf

Authors: Kaisa Vaananen, Niels van Berkel, Donald McMillan, Thomas Olsson

Title: Embodied AI Agents for Team Collaboration in Co-located Blue-Collar Work

Abstract:
Blue‑collar work is often highly collaborative, embodied, and situated in shared physical environments, yet most research on collaborative AI has focused on white‑collar work. This position paper explores how the embodied nature of AI agents can support team collaboration and communication in co‑located blue‑collar workplaces. From the context of our newly started CAI‑BLUE research project, we present two speculative scenarios from industrial and maintenance contexts that illustrate how embodied AI agents can support shared situational awareness and facilitate inclusive communication across experience levels. We outline open questions related to embodied AI agent design around worker inclusion, agency, transformation of blue‑collar collaboration practices over time, and forms of acceptable AI embodiments. We argue that embodiment is not just an aesthetic choice but should become a socio‑material design strategy of AI systems in blue‑collar workplaces.

PaperID: 5614, https://arxiv.org/pdf/2602.12126.pdf

Authors: Daniele Carnevale, Gianlorenzo D'Angelo

Title: Optimizing Distances for Multi-Broadcast in Temporal Graphs

Abstract:
Temporal graphs represent networks in which connections change over time, with edges available only at specific moments. Motivated by applications in logistics, multi‑agent information spreading, and wireless networks, we introduce the D‑Temporal Multi‑Broadcast (D‑TMB) problem, which asks for scheduling the availability of edges so that a predetermined subset of sources reach all other vertices while optimizing the worst‑case temporal distance D from any source. We show that D‑TMB generalizes ReachFast (arXiv:2112.08797). We then characterize the computational complexity and approximability of D‑TMB under six definitions of temporal distance D, namely Earliest‑Arrival (EA), Latest‑Departure (LD), Fastest‑Time (FT), Shortest‑Traveling (ST), Minimum‑Hop (MH), and Minimum‑Waiting (MW). For a single source, we show that D‑TMB can be solved in polynomial time for EA and LD, while for the other temporal distances it is NP‑hard and hard to approximate within a factor that depends on the adopted distance function. We give approximation algorithms for FT and MW. For multiple sources, if feasibility is not assumed a priori, the problem is inapproximable within any factor unless P = NP, even with just two sources. We complement this negative result by identifying structural conditions that guarantee tractability for EA and LD for any number of sources.

PaperID: 5615, https://arxiv.org/pdf/2602.12118.pdf

Authors: Johannes Brustle, Paul Duetting, Stefano Leonardi, Tomasz Ponitka, Matteo Russo

Title: Anonymous Contracts

Abstract:
We study a multi‑agent contracting problem where agents exert costly effort to achieve individually observable binary outcomes. While the principal can theoretically extract the full social welfare using a discriminatory contract that tailors payments to individual costs, such contracts may be perceived as unfair. In this work, we introduce and analyze anonymous contracts, where payments depend solely on the total number of successes, ensuring identical treatment of agents. We first establish that every anonymous contract admits a pure Nash equilibrium. However, because general anonymous contracts can suffer from multiple equilibria with unbounded gaps in principal utility, we identify uniform anonymous contracts as a desirable subclass. We prove that uniform anonymous contracts guarantee a unique equilibrium, thereby providing robust performance guarantees. In terms of efficiency, we prove that under limited liability, anonymous contracts cannot generally approximate the social welfare better than a factor logarithmic in the spread of agent success probabilities. We show that uniform contracts are sufficient to match this theoretical limit. Finally, we demonstrate that removing limited liability significantly boosts performance: anonymous contracts generally achieve an O(\log n) approximation to the social welfare and, surprisingly, can extract the full welfare whenever agents' success probabilities are distinct. This reveals a structural reversal: widely spread probabilities are the hardest case under limited liability, whereas identical probabilities become the hardest case when limited liability is removed.

PaperID: 5616, https://arxiv.org/pdf/2602.12083.pdf

Authors: Antonin Sulc

Title: Differentiable Modal Logic for Multi-Agent Diagnosis, Orchestration and Communication

Abstract:
As multi‑agent AI systems evolve from simple chatbots to autonomous swarms, debugging semantic failures requires reasoning about knowledge, belief, causality, and obligation, precisely what modal logic was designed to formalize. However, traditional modal logic requires manual specification of relationship structures that are unknown or dynamic in real systems. This tutorial demonstrates differentiable modal logic (DML), implemented via Modal Logical Neural Networks (MLNNs), enabling systems to learn trust networks, causal chains, and regulatory boundaries from behavioral data alone. We present a unified neurosymbolic debugging framework through four modalities: epistemic (who to trust), temporal (when events cause failures), deontic (what actions are permitted), and doxastic (how to interpret agent confidence). Each modality is demonstrated on concrete multi‑agent scenarios, from discovering deceptive alliances in diplomacy games to detecting LLM hallucinations, with complete implementations showing how logical contradictions become learnable optimization objectives. Key contributions for the neurosymbolic community: (1) interpretable learned structures where trust and causality are explicit parameters, not opaque embeddings; (2) knowledge injection via differentiable axioms that guide learning with sparse data (3) compositional multi‑modal reasoning that combines epistemic, temporal, and deontic constraints; and (4) practical deployment patterns for monitoring, active control and communication of multi‑agent systems. All code provided as executable Jupyter notebooks.

PaperID: 5617, https://arxiv.org/pdf/2602.12074.pdf

Authors: Lorin Achey, Breanne Crockett, Christoffer Heckman, Bradley Hayes

Title: RF-Modulated Adaptive Communication Improves Multi-Agent Robotic Exploration

Abstract:
Reliable coordination and efficient communication are critical challenges for multi‑agent robotic exploration of environments where communication is limited. This work introduces Adaptive‑RF Transmission (ART), a novel communication‑aware planning algorithm that dynamically modulates transmission location based on signal strength and data payload size, enabling heterogeneous robot teams to share information efficiently without unnecessary backtracking. We further explore an extension to this approach called ART‑SST, which enforces signal strength thresholds for high‑fidelity data delivery. Through over 480 simulations across three cave‑inspired environments, ART consistently outperforms existing strategies, including full rendezvous and minimum‑signal heuristic approaches, achieving up to a 58% reduction in distance traveled and up to 52% faster exploration times compared to baseline methods. These results demonstrate that adaptive, payload‑aware communication significantly improves coverage efficiency and mission speed in complex, communication‑constrained environments, offering a promising foundation for future planetary exploration and search‑and‑rescue missions.

PaperID: 5618, https://arxiv.org/pdf/2602.12055.pdf

Authors: Amath Sow, Mauricio Rodriguez Cesen, Fabiola Martins Campos de Oliveira, Mariusz Wzorek, Daniel de Leng, Mattias Tiger, Fredrik Heintz, Christian Esteve Rothenberg

Title: Multi UAVs Preflight Planning in a Shared and Dynamic Airspace

Abstract:
Preflight planning for large‑scale Unmanned Aerial Vehicle (UAV) fleets in dynamic, shared airspace presents significant challenges, including temporal No‑Fly Zones (NFZs), heterogeneous vehicle profiles, and strict delivery deadlines. While Multi‑Agent Path Finding (MAPF) provides a formal framework, existing methods often lack the scalability and flexibility required for real‑world Unmanned Traffic Management (UTM). We propose DTAPP‑IICR: a Delivery‑Time Aware Prioritized Planning method with Incremental and Iterative Conflict Resolution. Our framework first generates an initial solution by prioritizing missions based on urgency. Secondly, it computes roundtrip trajectories using SFIPP‑ST, a novel 4D single‑agent planner (Safe Flight Interval Path Planning with Soft and Temporal Constraints). SFIPP‑ST handles heterogeneous UAVs, strictly enforces temporal NFZs, and models inter‑agent conflicts as soft constraints. Subsequently, an iterative Large Neighborhood Search, guided by a geometric conflict graph, efficiently resolves any residual conflicts. A completeness‑preserving directional pruning technique further accelerates the 3D search. On benchmarks with temporal NFZs, DTAPP‑IICR achieves near‑100% success with fleets of up to 1,000 UAVs and gains up to 50% runtime reduction from pruning, outperforming batch Enhanced Conflict‑Based Search in the UTM context. Scaling successfully in realistic city‑scale operations where other priority‑based methods fail even at moderate deployments, DTAPP‑IICR is positioned as a practical and scalable solution for preflight planning in dense, dynamic urban airspace.

PaperID: 5619, https://arxiv.org/pdf/2602.12029.pdf

Authors: Sunghyeon Woo, Hoseung Kim, Sunghwan Shim, Minjung Jo, Hyunjoon Jeong, Jeongtae Lee, Joonghoon Kim, Sungjae Lee, Baeseong Park, Se Jung Kwon, Dongsoo Lee

Title: PrefillShare: A Shared Prefill Module for KV Reuse in Multi-LLM Disaggregated Serving

Abstract:
Multi‑agent systems increasingly orchestrate multiple specialized language models to solve complex real‑world problems, often invoking them over a shared context. This execution pattern repeatedly processes the same prompt prefix across models. Consequently, each model redundantly executes the prefill stage and maintains its own key‑value (KV) cache, increasing aggregate prefill load and worsening tail latency by intensifying prefill‑decode interference in existing LLM serving stacks. Disaggregated serving reduces such interference by placing prefill and decode on separate GPUs, but disaggregation does not fundamentally eliminate inter‑model redundancy in computation and KV storage for the same prompt. To address this issue, we propose PrefillShare, a novel algorithm that enables sharing the prefill stage across multiple models in a disaggregated setting. PrefillShare factorizes the model into prefill and decode modules, freezes the prefill module, and fine‑tunes only the decode module. This design allows multiple task‑specific models to share a prefill module and the KV cache generated for the same prompt. We further introduce a routing mechanism that enables effective prefill sharing across heterogeneous models in a vLLM‑based disaggregated system. PrefillShare not only matches full fine‑tuning accuracy on a broad range of tasks and models, but also delivers 4.5x lower p95 latency and 3.9x higher throughput in multi‑model agent workloads.

PaperID: 5620, https://arxiv.org/pdf/2602.12024.pdf

Authors: Jiarui Li, Federico Pecora, Runyu Zhang, Gioele Zardini

Title: Adaptive-Horizon Conflict-Based Search for Closed-Loop Multi-Agent Path Finding

Abstract:
MAPF is a core coordination problem for large robot fleets in automated warehouses and logistics. Existing approaches are typically either open‑loop planners, which generate fixed trajectories and struggle to handle disturbances, or closed‑loop heuristics without reliable performance guarantees, limiting their use in safety‑critical deployments. This paper presents ACCBS, a closed‑loop algorithm built on a finite‑horizon variant of CBS with a horizon‑changing mechanism inspired by iterative deepening in MPC. ACCBS dynamically adjusts the planning horizon based on the available computational budget, and reuses a single constraint tree to enable seamless transitions between horizons. As a result, it produces high‑quality feasible solutions quickly while being asymptotically optimal as the budget increases, exhibiting anytime behavior. Extensive case studies demonstrate that ACCBS combines flexibility to disturbances with strong performance guarantees, effectively bridging the gap between theoretical optimality and practical robustness for large‑scale robot deployment.

PaperID: 5621, https://arxiv.org/pdf/2602.11977.pdf

Authors: Michael Otte, Roderich Groß

Title: Multi-Defender Single-Attacker Perimeter Defense Game on a Cylinder: Special Case in which the Attacker Starts at the Boundary

Abstract:
We describe a multi‑agent perimeter defense game played on a cylinder. A team of n slow‑moving defenders must prevent a single fast‑moving attacker from crossing the boundary of a defensive perimeter. We describe the conditions necessary for the attacker to win in the special case that the intruder starts close to the boundary and in a region that is currently defended.

PaperID: 5622, https://arxiv.org/pdf/2602.11964.pdf

Authors: Romain Froger, Pierre Andrews, Matteo Bettini, Amar Budhiraja, Ricardo Silveira Cabral, Virginie Do, Emilien Garreau, Jean-Baptiste Gaya, Hugo Laurençon, Maxime Lecanu, Kunal Malkan, Dheeraj Mekala, Pierre Ménard, Gerard Moreno-Torres Bertran, Ulyana Piterbarg, Mikhail Plekhanov, Mathieu Rita, Andrey Rusakov, Vladislav Vorotilov, Mengjue Wang, Ian Yu, Amine Benhalloum, Grégoire Mialon, Thomas Scialom

Title: Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

Abstract:
We introduce Gaia2, a benchmark for evaluating large language model agents in realistic, asynchronous environments. Unlike prior static or synchronous evaluations, Gaia2 introduces scenarios where environments evolve independently of agent actions, requiring agents to operate under temporal constraints, adapt to noisy and dynamic events, resolve ambiguity, and collaborate with other agents. Each scenario is paired with a write‑action verifier, enabling fine‑grained, action‑level evaluation and making Gaia2 directly usable for reinforcement learning from verifiable rewards. Our evaluation of state‑of‑the‑art proprietary and open‑source models shows that no model dominates across capabilities: GPT‑5 (high) reaches the strongest overall score of 42% pass@1 but fails on time‑sensitive tasks, Claude‑4 Sonnet trades accuracy and speed for cost, Kimi‑K2 leads among open‑source models with 21% pass@1. These results highlight fundamental trade‑offs between reasoning, efficiency, robustness, and expose challenges in closing the "sim2real" gap. Gaia2 is built on a consumer environment with the open‑source Agents Research Environments platform and designed to be easy to extend. By releasing Gaia2 alongside the foundational ARE framework, we aim to provide the community with a flexible infrastructure for developing, benchmarking, and training the next generation of practical agent systems.

PaperID: 5623, https://arxiv.org/pdf/2602.11832.pdf

Authors: Shangchen Miao, Ningya Feng, Jialong Wu, Ye Lin, Xu He, Dong Li, Mingsheng Long

Title: JEPA-VLA: Video Predictive Embedding is Needed for VLA Models

Abstract:
Recent vision‑language‑action (VLA) models built upon pretrained vision‑language models (VLMs) have achieved significant improvements in robotic manipulation. However, current VLAs still suffer from low sample efficiency and limited generalization. This paper argues that these limitations are closely tied to an overlooked component, pretrained visual representation, which offers insufficient knowledge on both aspects of environment understanding and policy prior. Through an in‑depth analysis, we find that commonly used visual representations in VLAs, whether pretrained via language‑image contrastive learning or image‑based self‑supervised learning, remain inadequate at capturing crucial, task‑relevant environment information and at inducing effective policy priors, i.e., anticipatory knowledge of how the environment evolves under successful task execution. In contrast, we discover that predictive embeddings pretrained on videos, in particular V‑JEPA 2, are adept at flexibly discarding unpredictable environment factors and encoding task‑relevant temporal dynamics, thereby effectively compensating for key shortcomings of existing visual representations in VLAs. Building on these observations, we introduce JEPA‑VLA, a simple yet effective approach that adaptively integrates predictive embeddings into existing VLAs. Our experiments demonstrate that JEPA‑VLA yields substantial performance gains across a range of benchmarks, including LIBERO, LIBERO‑plus, RoboTwin2.0, and real‑robot tasks.

PaperID: 5624, https://arxiv.org/pdf/2602.11829.pdf

Authors: Juan Agustin Duque, Razvan Ciuca, Ayoub Echchahed, Hugo Larochelle, Aaron Courville

Title: Towards Sustainable Investment Policies Informed by Opponent Shaping

Abstract:
Addressing climate change requires global coordination, yet rational economic actors often prioritize immediate gains over collective welfare, resulting in social dilemmas. InvestESG is a recently proposed multi‑agent simulation that captures the dynamic interplay between investors and companies under climate risk. We provide a formal characterization of the conditions under which InvestESG exhibits an intertemporal social dilemma, deriving theoretical thresholds at which individual incentives diverge from collective welfare. Building on this, we apply Advantage Alignment, a scalable opponent shaping algorithm shown to be effective in general‑sum games, to influence agent learning in InvestESG. We offer theoretical insights into why Advantage Alignment systematically favors socially beneficial equilibria by biasing learning dynamics toward cooperative outcomes. Our results demonstrate that strategically shaping the learning processes of economic agents can result in better outcomes that could inform policy mechanisms to better align market incentives with long‑term sustainability goals.

PaperID: 5625, https://arxiv.org/pdf/2602.11790.pdf

Authors: Lingyong Yan, Jiulong Wu, Dong Xie, Weixian Shi, Deguo Xia, Jizhou Huang

Title: Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation

Abstract:
Although recent end‑to‑end video generation models demonstrate impressive performance in visually oriented content creation, they remain limited in scenarios that require strict logical rigor and precise knowledge representation, such as instructional and educational media. To address this problem, we propose LAVES, a hierarchical LLM‑based multi‑agent system for generating high‑quality instructional videos from educational problems. The LAVES formulates educational video generation as a multi‑objective task that simultaneously demands correct step‑by‑step reasoning, pedagogically coherent narration, semantically faithful visual demonstrations, and precise audio‑‑visual alignment. To address the limitations of prior approaches‑‑including low procedural fidelity, high production cost, and limited controllability‑‑LAVES decomposes the generation workflow into specialized agents coordinated by a central Orchestrating Agent with explicit quality gates and iterative critique mechanisms. Specifically, the Orchestrating Agent supervises a Solution Agent for rigorous problem solving, an Illustration Agent that produces executable visualization codes, and a Narration Agent for learner‑oriented instructional scripts. In addition, all outputs from the working agents are subject to semantic critique, rule‑based constraints, and tool‑based compilation checks. Rather than directly synthesizing pixels, the system constructs a structured executable video script that is deterministically compiled into synchronized visuals and narration using template‑driven assembly rules, enabling fully automated end‑to‑end production without manual editing. In large‑scale deployments, LAVES achieves a throughput exceeding one million videos per day, delivering over a 95% reduction in cost compared to current industry‑standard approaches while maintaining a high acceptance rate.

PaperID: 5626, https://arxiv.org/pdf/2602.11782.pdf

Authors: Yihao Liu, Ziyun Zhang, Zile He, Huaqian Cai

Title: FlowMind: Execute-Summarize for Structured Workflow Generation from LLM Reasoning

Abstract:
LLMs can solve complex tasks through reasoning and tool use, but accurately translating these solutions into structured workflows remains challenging. We model workflows as sequences of tool use and reformulate the problem as designing a mechanism that can both solve tasks and reliably construct workflows. Prior approaches that build workflows during execution often suffer from inaccuracies due to interference between the two processes. We propose an Execute‑Summarize(ES) framework that decouples task execution from workflow construction: the model first completes the task using available tools, then independently reconstructs a structured workflow from execution traces. This separation improves workflow accuracy and robustness. We introduce FlowBench and show through extensive experiments that our approach outperforms existing methods, providing a reliable paradigm for grounding free‑form LLM reasoning into structured workflows.

PaperID: 5627, https://arxiv.org/pdf/2602.11767.pdf

Authors: Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Syed Zawad, Heiko Ludwig, Holger Boche

Title: TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents

Abstract:
Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi‑turn interactions across tasks. However, multi‑turn RL remains challenging as rewards are often sparse or delayed, and environments can be stochastic. In this regime, naive trajectory sampling can hinder exploitation and induce mode collapse. We propose TSR (Trajectory‑Search Rollouts), a training‑time approach that repurposes test‑time scaling ideas for improved per‑turn rollout generation. TSR performs lightweight tree‑style search to construct high‑quality trajectories by selecting high‑scoring actions at each turn using state‑based feedback. This improves rollout quality and stabilizes learning while remaining compatible with standard policy gradient optimizers, making TSR optimizer‑agnostic. We instantiate TSR with best‑of‑N, beam, and shallow lookahead search, and pair it with PPO and GRPO, achieving up to 15% performance gains and more stable learning on Sokoban, FrozenLake, and WebShop tasks at a modest, one‑time increase in training compute. By moving search from inference time to the rollout stage of training, TSR provides a modular and general mechanism for stronger multi‑turn agent learning, complementary to existing frameworks and rejection‑sampling‑style selection methods.

PaperID: 5628, https://arxiv.org/pdf/2602.11754.pdf

Authors: Keita Nishimoto, Kimitaka Asatani, Ichiro Sakata

Title: Cooperation Breakdown in LLM Agents Under Communication Delays

Abstract:
LLM‑based multi‑agent systems (LLM‑MAS), in which autonomous AI agents cooperate to solve tasks, are gaining increasing attention. For such systems to be deployed in society, agents must be able to establish cooperation and coordination under real‑world computational and communication constraints. We propose the FLCOA framework (Five Layers for Cooperation/Coordination among Autonomous Agents) to conceptualize how cooperation and coordination emerge in groups of autonomous agents, and highlight that the influence of lower‑layer factors ‑ especially computational and communication resources ‑ has been largely overlooked. To examine the effect of communication delay, we introduce a Continuous Prisoner's Dilemma with Communication Delay and conduct simulations with LLM‑based agents. As delay increases, agents begin to exploit slower responses even without explicit instructions. Interestingly, excessive delay reduces cycles of exploitation, yielding a U‑shaped relationship between delay magnitude and mutual cooperation. These results suggest that fostering cooperation requires attention not only to high‑level institutional design but also to lower‑layer factors such as communication delay and resource allocation, pointing to new directions for MAS research.

PaperID: 5629, https://arxiv.org/pdf/2602.11750.pdf

Authors: Jiazheng Sun, Mingxuan Li, Yingying Zhang, Jiayang Niu, Yachen Wu, Ruihan Jin, Shuyu Lei, Pengrongrui Tan, Zongyu Zhang, Ruoyi Wang, Jiachen Yang, Boyu Yang, Jiacheng Liu, Xin Peng

Title: AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the Wild

Abstract:
Benchmarks are paramount for gauging progress in the domain of Mobile GUI Agents. In practical scenarios, users frequently fail to articulate precise directives containing full task details at the onset, and their expressions are typically ambiguous. Consequently, agents are required to converge on the user's true intent via active clarification and interaction during execution. However, existing benchmarks predominantly operate under the idealized assumption that user‑issued instructions are complete and unequivocal. This paradigm focuses exclusively on assessing single‑turn execution while overlooking the alignment capability of the agent. To address this limitation, we introduce AmbiBench, the first benchmark incorporating a taxonomy of instruction clarity to shift evaluation from unidirectional instruction following to bidirectional intent alignment. Grounded in Cognitive Gap theory, we propose a taxonomy of four clarity levels: Detailed, Standard, Incomplete, and Ambiguous. We construct a rigorous dataset of 240 ecologically valid tasks across 25 applications, subject to strict review protocols. Furthermore, targeting evaluation in dynamic environments, we develop MUSE (Mobile User Satisfaction Evaluator), an automated framework utilizing an MLLM‑as‑a‑judge multi‑agent architecture. MUSE performs fine‑grained auditing across three dimensions: Outcome Effectiveness, Execution Quality, and Interaction Quality. Empirical results on AmbiBench reveal the performance boundaries of SoTA agents across different clarity levels, quantify the gains derived from active interaction, and validate the strong correlation between MUSE and human judgment. This work redefines evaluation standards, laying the foundation for next‑generation agents capable of truly understanding user intent.

PaperID: 5630, https://arxiv.org/pdf/2602.11749.pdf

Authors: Zibo Xiao, Jun Sun, Junjie Chen

Title: AIR: Improving Agent Safety through Incident Response

Abstract:
Large Language Model (LLM) agents are increasingly deployed in practice across a wide range of autonomous applications. Yet current safety mechanisms for LLM agents focus almost exclusively on preventing failures in advance, providing limited capabilities for responding to, containing, or recovering from incidents after they inevitably arise. In this work, we introduce AIR, the first incident response framework for LLM agent systems. AIR defines a domain‑specific language for managing the incident response lifecycle autonomously in LLM agent systems, and integrates it into the agent's execution loop to (1) detect incidents via semantic checks grounded in the current environment state and recent context, (2) guide the agent to execute containment and recovery actions via its tools, and (3) synthesize guardrail rules during eradication to block similar incidents in future executions. We evaluate AIR on three representative agent types. Results show that AIR achieves detection, remediation, and eradication success rates all exceeding 90%. Extensive experiments further confirm the necessity of AIR's key design components, show the timeliness and moderate overhead of AIR, and demonstrate that LLM‑generated rules can approach the effectiveness of developer‑authored rules across domains. These results show that incident response is both feasible and essential as a first‑class mechanism for improving agent safety.

PaperID: 5631, https://arxiv.org/pdf/2602.11619.pdf

Authors: Aman Mehta

Title: When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents

Abstract:
Run the same LLM agent on the same task twice: do you get the same behavior? We find the answer is often no. In a study of 3,000 agent runs across three models (Llama 3.1 70B, GPT‑4o, and Claude Sonnet 4.5) on HotpotQA, we observe that ReAct‑style agents produce 2.0‑‑4.2 distinct action sequences per 10 runs on average, even with identical inputs. More importantly, this variance predicts failure: tasks with consistent behavior (\leq2 unique paths) achieve 80‑‑92% accuracy, while highly inconsistent tasks (\geq6 unique paths) achieve only 25‑‑60%, a 32‑‑55 percentage point gap depending on model. We trace variance to early decisions: 69% of divergence occurs at step 2, the first search query. Our results suggest that monitoring behavioral consistency during execution could enable early error detection and improve agent reliability.

PaperID: 5632, https://arxiv.org/pdf/2602.11583.pdf

Authors: Jingdi Chen, Hanqing Yang, Zongjun Liu, Carlee Joe-Wong

Title: The Five Ws of Multi-Agent Communication: Who Talks to Whom, When, What, and Why -- A Survey from MARL to Emergent Language and LLMs

Abstract:
Multi‑agent sequential decision‑making powers many real‑world systems, from autonomous vehicles and robotics to collaborative AI assistants. In dynamic, partially observable environments, communication is often what reduces uncertainty and makes collaboration possible. This survey reviews multi‑agent communication (MA‑Comm) through the Five Ws: who communicates with whom, what is communicated, when communication occurs, and why communication is beneficial. This framing offers a clean way to connect ideas across otherwise separate research threads. We trace how communication approaches have evolved across three major paradigms. In Multi‑Agent Reinforcement Learning (MARL), early methods used hand‑designed or implicit protocols, followed by end‑to‑end learned communication optimized for reward and control. While successful, these protocols are frequently task‑specific and hard to interpret, motivating work on Emergent Language (EL), where agents can develop more structured or symbolic communication through interaction. EL methods, however, still struggle with grounding, generalization, and scalability, which has fueled recent interest in large language models (LLMs) that bring natural language priors for reasoning, planning, and collaboration in more open‑ended settings. Across MARL, EL, and LLM‑based systems, we highlight how different choices shape communication design, where the main trade‑offs lie, and what remains unsolved. We distill practical design patterns and open challenges to support future hybrid systems that combine learning, language, and control for scalable and interpretable multi‑agent collaboration.

PaperID: 5633, https://arxiv.org/pdf/2602.11565.pdf

Authors: Zesheng Jia, Jin Wang, Siao Liu, Lingzhi Li, Ziyao Huang, Yunjiang Xu, Jianping Wang

Title: Move What Matters: Parameter-Efficient Domain Adaptation via Optimal Transport Flow for Collaborative Perception

Abstract:
Fast domain adaptation remains a fundamental challenge for deploying multi‑agent systems across diverse environments in Vehicle‑to‑Everything (V2X) collaborative perception. Despite the success of Parameter‑Efficient Fine‑Tuning (PEFT) in natural language processing and conventional vision tasks, directly applying PEFT to multi‑agent settings leads to significant performance degradation and training instability. In this work, we conduct a detailed analysis and identify two key factors: (i) inter‑frame redundancy in heterogeneous sensory streams, and (ii) erosion of fine‑grained semantics in deep‑layer representations under PEFT adaptation. To address these issues, we propose FlowAdapt, a parameter‑efficient framework grounded in optimal transport theory, which minimizes information transport costs across both data distributions and network hierarchies. Specifically, we introduce a Wasserstein Greedy Sampling strategy to selectively filter redundant samples via a bounded covering radius. Furthermore, Progressive Knowledge Transfer module is designed to progressively inject compressed early‑stage representations into later stages through learnable pathways, alleviating semantic degradation in late‑stage adaptation. Extensive experiments on three benchmarks demonstrate that FlowAdapt achieves state‑of‑the‑art performance with only 1% of trainable parameters, effectively bridging domain gaps with superior sample efficiency and generalization.

PaperID: 5634, https://arxiv.org/pdf/2602.11527.pdf

Authors: Jiawei Zhu, Wei Chen, Ruichu Cai

Title: CausalAgent: A Conversational Multi-Agent System for End-to-End Causal Inference

Abstract:
Causal inference holds immense value in fields such as healthcare, economics, and social sciences. However, traditional causal analysis workflows impose significant technical barriers, requiring researchers to possess dual backgrounds in statistics and computer science, while manually selecting algorithms, handling data quality issues, and interpreting complex results. To address these challenges, we propose CausalAgent, a conversational multi‑agent system for end‑to‑end causal inference. The system innovatively integrates Multi‑Agent Systems (MAS), Retrieval‑Augmented Generation (RAG), and the Model Context Protocol (MCP) to achieve automation from data cleaning and causal structure learning to bias correction and report generation through natural language interaction. Users need only upload a dataset and pose questions in natural language to receive a rigorous, interactive analysis report. As a novel user‑centered human‑AI collaboration paradigm, CausalAgent explicitly models the analysis workflow. By leveraging interactive visualizations, it significantly lowers the barrier to entry for causal analysis while ensuring the rigor and interpretability of the process.

PaperID: 5635, https://arxiv.org/pdf/2602.11500.pdf

Authors: Diptarka Chakraborty, Kushagra Chatterjee, Debarati Das, Tien-Long Nguyen

Title: A Generic Framework for Fair Consensus Clustering in Streams

Abstract:
Consensus clustering seeks to combine multiple clusterings of the same dataset, potentially derived by considering various non‑sensitive attributes by different agents in a multi‑agent environment, into a single partitioning that best reflects the overall structure of the underlying dataset. Recent work by Chakraborty et al, introduced a fair variant under proportionate fairness and obtained a constant‑factor approximation by naively selecting the best closest fair input clustering; however, their offline approach requires storing all input clusterings, which is prohibitively expensive for most large‑scale applications. In this paper, we initiate the study of fair consensus clustering in the streaming model, where input clusterings arrive sequentially and memory is limited. We design the first constant‑factor algorithm that processes the stream while storing only a logarithmic number of inputs. En route, we introduce a new generic algorithmic framework that integrates closest fair clustering with cluster fitting, yielding improved approximation guarantees not only in the streaming setting but also when revisited offline. Furthermore, the framework is fairness‑agnostic: it applies to any fairness definition for which an approximately close fair clustering can be computed efficiently. Finally, we extend our methods to the more general k‑median consensus clustering problem.

PaperID: 5636, https://arxiv.org/pdf/2602.11348.pdf

Authors: Ruipeng Wang, Yuxin Chen, Yukai Wang, Chang Wu, Junfeng Fang, Xiaodong Cai, Qi Gu, Hui Su, An Zhang, Xiang Wang, Xunliang Cai, Tat-Seng Chua

Title: AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition

Abstract:
Recent advances in large language models have enabled LLM‑based agents to achieve strong performance on a variety of benchmarks. However, their performance in real‑world deployments often that observed on benchmark settings, especially in complex and imperfect environments. This discrepancy largely arises because prevailing training and evaluation paradigms are typically built on idealized assumptions, overlooking the inherent stochasticity and noise present in real‑world interactions. To bridge this gap, we introduce AgentNoiseBench, a framework for systematically evaluating the robustness of agentic models under noisy environments. We first conduct an in‑depth analysis of biases and uncertainties in real‑world scenarios and categorize environmental noise into two primary types: user‑noise and tool‑noise. Building on this analysis, we develop an automated pipeline that injects controllable noise into existing agent‑centric benchmarks while preserving task solvability. Leveraging this pipeline, we perform extensive evaluations across a wide range of models with diverse architectures and parameter scales. Our results reveal consistent performance variations under different noise conditions, highlighting the sensitivity of current agentic models to realistic environmental perturbations.

PaperID: 5637, https://arxiv.org/pdf/2602.11327.pdf

Authors: Zeynab Anbiaee, Mahdi Rabbani, Mansur Mirani, Gunjan Piya, Igor Opushnyev, Ali Ghorbani, Sajjad Dadkhah

Title: Security Threat Modeling for Emerging AI-Agent Protocols: A Comparative Analysis of MCP, A2A, Agora, and ANP

Abstract:
The rapid development of the AI agent communication protocols, including the Model Context Protocol (MCP), Agent2Agent (A2A), Agora, and Agent Network Protocol (ANP), is reshaping how AI agents communicate with tools, services, and each other. While these protocols support scalable multi‑agent interaction and cross‑organizational interoperability, their security principles remain understudied, and standardized threat modeling is limited; no protocol‑centric risk assessment framework has been established yet. This paper presents a systematic security analysis of four emerging AI agent communication protocols. First, we develop a structured threat modeling analysis that examines protocol architectures, trust assumptions, interaction patterns, and lifecycle behaviors to identify protocol‑specific and cross‑protocol risk surfaces. Second, we introduce a qualitative risk assessment framework that identifies twelve protocol‑level risks and evaluates security posture across the creation, operation, and update phases through systematic assessment of likelihood, impact, and overall protocol risk, with implications for secure deployment and future standardization. Third, we provide a measurement‑driven case study on MCP that formalizes the risk of missing mandatory validation/attestation for executable components as a falsifiable security claim by quantifying wrong‑provider tool execution under multi‑server composition across representative resolver policies. Collectively, our results highlight key design‑induced risk surfaces and provide actionable guidance for secure deployment and future standardization of agent communication ecosystems.

PaperID: 5638, https://arxiv.org/pdf/2602.11301.pdf

Authors: John M. Willis

Title: The PBSAI Governance Ecosystem: A Multi-Agent AI Reference Architecture for Securing Enterprise AI Estates

Abstract:
Enterprises are rapidly deploying large language models, retrieval augmented generation pipelines, and tool using agents into production, often on shared high performance computing clusters and cloud accelerator platforms that also support defensive analytics. These systems increasingly function not as isolated models but as AI estates: socio technical systems spanning models, agents, data pipelines, security tooling, human workflows, and hyperscale infrastructure. Existing governance and security frameworks, including the NIST AI Risk Management Framework and systems security engineering guidance, articulate principles and risk functions but do not provide implementable architectures for multi agent, AI enabled cyber defense. This paper introduces the Practitioners Blueprint for Secure AI (PBSAI) Governance Ecosystem, a multi agent reference architecture for securing enterprise and hyperscale AI estates. PBSAI organizes responsibilities into a twelve domain taxonomy and defines bounded agent families that mediate between tools and policy through shared context envelopes and structured output contracts. The architecture assumes baseline enterprise security capabilities and encodes key systems security techniques, including analytic monitoring, coordinated defense, and adaptive response. A lightweight formal model of agents, context envelopes, and ecosystem level invariants clarifies the traceability, provenance, and human in the loop guarantees enforced across domains. We demonstrate alignment with NIST AI RMF functions and illustrate application in enterprise SOC and hyperscale defensive environments. PBSAI is proposed as a structured, evidence centric foundation for open ecosystem development and future empirical validation.

PaperID: 5639, https://arxiv.org/pdf/2602.11243.pdf

Authors: Alina Shutova, Alexandra Olenina, Ivan Vinogradov, Anton Sinitsin

Title: Evaluating Memory Structure in LLM Agents

Abstract:
Modern LLM‑based agents and chat assistants rely on long‑term memory frameworks to store reusable knowledge, recall user preferences, and augment reasoning. As researchers create more complex memory architectures, it becomes increasingly difficult to analyze their capabilities and guide future memory designs. Most long‑term memory benchmarks focus on simple fact retention, multi‑hop recall, and time‑based changes. While undoubtedly important, these capabilities can often be achieved with simple retrieval‑augmented LLMs and do not test complex memory hierarchies. To bridge this gap, we propose StructMemEval ‑ a benchmark that tests the agent's ability to organize its long‑term memory, not just factual recall. We gather a suite of tasks that humans solve by organizing their knowledge in a specific structure: transaction ledgers, to‑do lists, trees and others. Our initial experiments show that simple retrieval‑augmented LLMs struggle with these tasks, whereas memory agents can reliably solve them if prompted how to organize their memory. However, we also find that modern LLMs do not always recognize the memory structure when not prompted to do so. This highlights an important direction for future improvements in both LLM training and memory frameworks.

PaperID: 5640, https://arxiv.org/pdf/2602.11242.pdf

Authors: Yitong Wang, Yue Yao

Title: ReTracing: An Archaeological Approach Through Body, Machine, and Generative Systems

Abstract:
We present ReTracing, a multi‑agent embodied performance art that adopts an archaeological approach to examine how artificial intelligence shapes, constrains, and produces bodily movement. Drawing from science‑fiction novels, the project extracts sentences that describe human‑machine interaction. We use large language models (LLMs) to generate paired prompts "what to do" and "what not to do" for each excerpt. A diffusion‑based text‑to‑video model transforms these prompts into choreographic guides for a human performer and motor commands for a quadruped robot. Both agents enact the actions on a mirrored floor, captured by multi‑camera motion tracking and reconstructed into 3D point clouds and motion trails, forming a digital archive of motion traces. Through this process, ReTracing serves as a novel approach to reveal how generative systems encode socio‑cultural biases through choreographed movements. Through an immersive interplay of AI, human, and robot, ReTracing confronts a critical question of our time: What does it mean to be human among AIs that also move, think, and leave traces behind?

PaperID: 5641, https://arxiv.org/pdf/2602.11238.pdf

Authors: Beichen Guo, Zhiyuan Wen, Jia Gu, Senzhang Wang, Haochen Shi, Ruosong Yang, Shuaiqi Liu

Title: SurveyLens: A Research Discipline-Aware Benchmark for Automatic Survey Generation

Abstract:
The exponential growth of scientific literature has driven the evolution of Automatic Survey Generation (ASG) from simple pipelines to multi‑agent frameworks and commercial Deep Research agents. However, current ASG evaluation methods rely on generic metrics and are heavily biased toward Computer Science (CS), failing to assess whether ASG methods adhere to the distinct standards of various academic disciplines. Consequently, researchers, especially those outside CS, lack clear guidance on using ASG systems to yield high‑quality surveys compliant with specific discipline standards. To bridge this gap, we introduce SurveyLens, the first discipline‑aware benchmark evaluating ASG methods across diverse research disciplines. We construct SurveyLens‑1k, a curated dataset of 1,000 high‑quality human‑written surveys spanning 10 disciplines. Subsequently, we propose a dual‑lens evaluation framework: (1) Discipline‑Aware Rubric Evaluation, which utilizes LLMs with human preference‑aligned weights to assess adherence to domain‑specific writing standards; and (2) Canonical Alignment Evaluation to rigorously measure content coverage and synthesis quality against human‑written survey papers. We conduct extensive experiments by evaluating 11 state‑of‑the‑art ASG methods on SurveyLens, including Vanilla LLMs, ASG systems, and Deep Research agents. Our analysis reveals the distinct strengths and weaknesses of each paradigm across fields, providing essential guidance for selecting tools tailored to specific disciplinary requirements.

PaperID: 5642, https://arxiv.org/pdf/2602.11198.pdf

Authors: Shafiuddin Rehan Ahmed, Wei Wei

Title: DDL2PropBank Agent: Benchmarking Multi-Agent Frameworks' Developer Experience Through a Novel Relational Schema Mapping Task

Abstract:
Multi‑agent frameworks promise to simplify LLM‑driven software development, yet there is no principled way to evaluate their developer experience in a controlled setting. We introduce DDL2PropBank, a novel benchmark task that maps relational database schemas to PropBank rolesets, requiring autonomous retrieval of candidate frames and fine‑grained linguistic reasoning over table names, columns, and relations. Using the Agent‑as‑a‑Tool pattern, we implement identical agent logic across 10 frameworks and evaluate along two dimensions: (i) code complexity via static analysis, and (ii) AI‑assistability ‑‑ the extent to which LLMs can autonomously generate correct, framework‑specific code. Our results reveal a threefold complexity spectrum, with Pydantic AI and Agno requiring the least implementation overhead. For AI‑assistability, structural alignment scores reliably proxy runtime success for frameworks with single canonical patterns, but overestimate correctness for multi‑pattern frameworks. Agno emerges as the strongest overall performer, combining lowest complexity with highest structural alignment and 83% pass@1.

PaperID: 5643, https://arxiv.org/pdf/2602.11187.pdf

Authors: Yubo Hou, Furen Zhuang, Partha Pratim Kundu, Sezin Ata Kircali, Jie Wang, Mihai Dragos Rotaru, Dutta Rahul, Ashish James

Title: TDPNavigator-Placer: Thermal- and Wirelength-Aware Chiplet Placement in 2.5D Systems Through Multi-Agent Reinforcement Learning

Abstract:
The rapid growth of electronics has accelerated the adoption of 2.5D integrated circuits, where effective automated chiplet placement is essential as systems scale to larger and more heterogeneous chiplet assemblies. Existing placement methods typically focus on minimizing wirelength or transforming multi‑objective optimization into a single objective through weighted sum, which limits their ability to handle competing design requirements. Wirelength reduction and thermal management are inherently conflicting objectives, making prior approaches inadequate for practical deployment. To address this challenge, we propose TDPNavigator‑Placer, a novel multi‑agent reinforcement learning framework that dynamically optimizes placement based on chiplet's thermal design power (TDP). This approach explicitly assigns these inherently conflicting objectives to specialized agents, each operating under distinct reward mechanisms and environmental constraints within a unified placement paradigm. Experimental results demonstrate that TDPNavigator‑Placer delivers a significantly improved Pareto front over state‑of‑the‑art methods, enabling more balanced trade‑offs between wirelength and thermal performance.

PaperID: 5644, https://arxiv.org/pdf/2602.11170.pdf

Authors: Jiawei Xu, Zhenyu Yu, Ziqian Bi, Minh Duc Pham, Xiaoyi Qu, Danyang Zhang

Title: PRIME: Policy-Reinforced Iterative Multi-agent Execution for Algorithmic Reasoning in Large Language Models

Abstract:
Large language models have demonstrated remarkable capabilities across diverse reasoning tasks, yet their performance on algorithmic reasoning remains limited. To handle this limitation, we propose PRIME (Policy‑Reinforced Iterative Multi‑agent Execution), a framework comprising three specialized agents, an executor for step‑by‑step reasoning, a verifier for constraint checking, and a coordinator for backtracking control, optimized through group relative policy optimization. For comprehensive evaluation, we introduce PRIME‑Bench, the largest algorithmic reasoning benchmark to date, comprising 86 tasks across 12 categories with 51,600 instances. Tasks span sorting algorithms, graph and tree structures, automata and state machines, symbolic reasoning, and constraint‑based puzzles, with execution traces reaching over one million steps. Compared to baseline approach, PRIME improves average accuracy from 26.8% to 93.8%, a 250% relative gain. The largest improvements occur on tasks requiring sustained state tracking, with Turing machine simulation improving from 9% to 92% and long division from 16% to 94%. Ablation studies identify iterative verification as the primary contributor, preventing the error propagation that causes baseline approaches to fail catastrophically. Analysis across model scales (8B‑120B parameters) reveals that smaller models benefit disproportionately, achieving accuracy comparable to models 8x larger.

PaperID: 5645, https://arxiv.org/pdf/2602.11076.pdf

Authors: Kavan Fatehi, Mostafa Rahmani Ghourtani, Amir Sonee, Poonam Yadav, Alessandra M Russo, Hamed Ahmadi, Radu Calinescu

Title: Interpretable Attention-Based Multi-Agent PPO for Latency Spike Resolution in 6G RAN Slicing

Abstract:
Sixth‑generation (6G) radio access networks (RANs) must enforce strict service‑level agreements (SLAs) for heterogeneous slices, yet sudden latency spikes remain difficult to diagnose and resolve with conventional deep reinforcement learning (DRL) or explainable RL (XRL). We propose \emphAttention‑Enhanced Multi‑Agent Proximal Policy Optimization (AE‑MAPPO), which integrates six specialized attention mechanisms into multi‑agent slice control and surfaces them as zero‑cost, faithful explanations. The framework operates across O‑RAN timescales with a three‑phase strategy: predictive, reactive, and inter‑slice optimization. A URLLC case study shows AE‑MAPPO resolves a latency spike in 18ms, restores latency to 0.98ms with 99.9999% reliability, and reduces troubleshooting time by 93% while maintaining eMBB and mMTC continuity. These results confirm AE‑MAPPO's ability to combine SLA compliance with inherent interpretability, enabling trustworthy and real‑time automation for 6G RAN slicing.

PaperID: 5646, https://arxiv.org/pdf/2602.11057.pdf

Authors: Xinyu Yuan, Yan Qiao, Zonghui Wang, Wenzhi Chen

Title: Divide, Harmonize, Then Conquer It: Shooting Multi-Commodity Flow Problems with Multimodal Language Models

Abstract:
The multi‑commodity flow (MCF) problem is a fundamental topic in network flow and combinatorial optimization, with broad applications in transportation, communication, and logistics, etc. Nowadays, the rapid expansion of allocation systems has posed challenges for existing optimization engines in balancing optimality and tractability. In this paper, we present Pram, the first ML‑based method that leverages the reasoning power of multimodal language models (MLMs) for addressing the trade‑off dilemma ‑‑ a great need of service providers. As part of our proposal, Pram (i) quickly computes high‑quality allocations by dividing the original problem into local subproblems, which are then resolved by an MLM‑powered "agent", and (ii) ensures global consistency by harmonizing these subproblems via a multi‑agent reinforcement learning algorithm. Theoretically, we show that Pram, which learns to perform gradient descent in context, provably converges to the optimum within the family of MCF problems. Empirically, on real‑world datasets and public topologies, Pram achieves performance comparable to, and in some cases even surpassing, linear programming solvers (very close to the optimal solution), and substantially lower runtimes (1 to 2 orders of magnitude faster). Moreover, Pram exhibits strong robustness (<10% performance degradation under link failures or flow bursts), demonstrating MLM's generalization ability to unforeseen events. Pram is objective‑agnostic and seamlessly integrates with mainstream allocation systems, providing a practical and scalable solution for future networks.

PaperID: 5647, https://arxiv.org/pdf/2602.10986.pdf

Authors: Abhishek Vijaya Kumar, Bhaskar Kataria, Byungsoo Oh, Emaad Manzoor, Rachee Singh

Title: TVCACHE: A Stateful Tool-Value Cache for Post-Training LLM Agents

Abstract:
In RL post‑training of LLM agents, calls to external tools take several seconds or even minutes, leaving allocated GPUs idle and inflating post‑training time and cost. While many tool invocations repeat across parallel rollouts and could in principle be cached, naively caching their outputs for reuse is incorrect since tool outputs depend on the environment state induced by prior agent interactions. We present TVCACHE, a stateful tool‑value cache for LLM agent post‑training. TVCACHE maintains a tree of observed tool‑call sequences and performs longest‑prefix matching for cache lookups: a hit occurs only when the agent's full tool history matches a previously executed sequence, guaranteeing identical environment state. On three diverse workloads‑terminal‑based tasks, SQL generation, and video understanding. TVCACHE achieves cache hit rates of up to 70% and reduces median tool call execution time by up to 6.9X, with no degradation in post‑training reward accumulation.

PaperID: 5648, https://arxiv.org/pdf/2602.10933.pdf

Authors: Riccardo Barbano, Alexander Denker, Zeljko Kereta, Runchang Li, Francisco Vargas

Title: CMAD: Cooperative Multi-Agent Diffusion via Stochastic Optimal Control

Abstract:
Continuous‑time generative models have achieved remarkable success in image restoration and synthesis. However, controlling the composition of multiple pre‑trained models remains an open challenge. Current approaches largely treat composition as an algebraic composition of probability densities, such as via products or mixtures of experts. This perspective assumes the target distribution is known explicitly, which is almost never the case. In this work, we propose a different paradigm that formulates compositional generation as a cooperative Stochastic Optimal Control problem. Rather than combining probability densities, we treat pre‑trained diffusion models as interacting agents whose diffusion trajectories are jointly steered, via optimal control, toward a shared objective defined on their aggregated output. We validate our framework on conditional MNIST generation and compare it against a naïve inference‑time DPS‑style baseline replacing learned cooperative control with per‑step gradient guidance.

PaperID: 5649, https://arxiv.org/pdf/2602.10824.pdf

Authors: Wojciech Jamroga, Marta Kwiatkowska, Wojciech Penczek, Laure Petrucci, Teofil Sidoruk

Title: Towards Probabilistic Strategic Timed CTL

Abstract:
We define PSTCTL, a probabilistic variant of Strategic Timed CTL (STCTL), interpreted over stochastic multi‑agent systems with continuous time and asynchronous execution semantics. STCTL extends TCTL with strategic operators in the style of ATL. Moreover, we demonstrate the feasibility of verification with irP‑strategies.

PaperID: 5650, https://arxiv.org/pdf/2602.10809.pdf

Authors: Chenlong Deng, Mengjie Deng, Junjie Wu, Dun Zeng, Teng Wang, Qingsong Xie, Jiadeng Huang, Shengjie Ma, Changwang Zhang, Zhaoxiang Wang, Jun Wang, Yutao Zhu, Zhicheng Dou

Title: DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

Abstract:
Existing multimodal retrieval systems excel at semantic matching but implicitly assume that query‑image relevance can be measured in isolation. This paradigm overlooks the rich dependencies inherent in realistic visual streams, where information is distributed across temporal sequences rather than confined to single snapshots. To bridge this gap, we introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task. Models must plan and perform multi‑step reasoning over raw visual histories to locate targets based on implicit contextual cues. We construct DISBench, a challenging benchmark built on interconnected visual data. To address the scalability challenge of creating context‑dependent queries, we propose a human‑model collaborative pipeline that employs vision‑language models to mine latent spatiotemporal associations, effectively offloading intensive context discovery before human verification. Furthermore, we build a robust baseline using a modular agent framework equipped with fine‑grained tools and a dual‑memory system for long‑horizon navigation. Extensive experiments demonstrate that DISBench poses significant challenges to state‑of‑the‑art models, highlighting the necessity of incorporating agentic reasoning into next‑generation retrieval systems.

PaperID: 5651, https://arxiv.org/pdf/2602.10685.pdf

Authors: Alejandro Mendoza Barrionuevo, Samuel Yanes Luis, Daniel Gutiérrez Reina, Sergio L. Toral Marín

Title: Beyond Task Performance: A Metric-Based Analysis of Sequential Cooperation in Heterogeneous Multi-Agent Destructive Foraging

Abstract:
This work addresses the problem of analyzing cooperation in heterogeneous multi‑agent systems which operate under partial observability and temporal role dependency, framed within a destructive multi‑agent foraging setting. Unlike most previous studies, which focus primarily on algorithmic performance with respect to task completion, this article proposes a systematic set of general‑purpose cooperation metrics aimed at characterizing not only efficiency, but also coordination and dependency between teams and agents, fairness, and sensitivity. These metrics are designed to be transferable to different multi‑agent sequential domains similar to foraging. The proposed suite of metrics is structured into three main categories that jointly provide a multilevel characterization of cooperation: primary metrics, inter‑team metrics, and intra‑team metrics. They have been validated in a realistic destructive foraging scenario inspired by dynamic aquatic surface cleaning using heterogeneous autonomous vehicles. It involves two specialized teams with sequential dependencies: one focused on the search of resources, and another on their destruction. Several representative approaches have been evaluated, covering both learning‑based algorithms and classical heuristic paradigms.

PaperID: 5652, https://arxiv.org/pdf/2602.10620.pdf

Authors: YoungHoon Jeon, Suwan Kim, Haein Son, Sookbun Lee, Yeil Jeong, Unggi Lee

Title: ISD-Agent-Bench: A Comprehensive Benchmark for Evaluating LLM-based Instructional Design Agents

Abstract:
Large Language Model (LLM) agents have shown promising potential in automating Instructional Systems Design (ISD), a systematic approach to developing educational programs. However, evaluating these agents remains challenging due to the lack of standardized benchmarks and the risk of LLM‑as‑judge bias. We present ISD‑Agent‑Bench, a comprehensive benchmark comprising 25,795 scenarios generated via a Context Matrix framework that combines 51 contextual variables across 5 categories with 33 ISD sub‑steps derived from the ADDIE model. To ensure evaluation reliability, we employ a multi‑judge protocol using diverse LLMs from different providers, achieving high inter‑judge reliability. We compare existing ISD agents with novel agents grounded in classical ISD theories such as ADDIE, Dick \& Carey, and Rapid Prototyping ISD. Experiments on 1,017 test scenarios demonstrate that integrating classical ISD frameworks with modern ReAct‑style reasoning achieves the highest performance, outperforming both pure theory‑based agents and technique‑only approaches. Further analysis reveals that theoretical quality strongly correlates with benchmark performance, with theory‑based agents showing significant advantages in problem‑centered design and objective‑assessment alignment. Our work provides a foundation for systematic LLM‑based ISD research.

PaperID: 5653, https://arxiv.org/pdf/2602.10618.pdf

Authors: Robin Beierling, Manuel Scheibl, Jonas Dech, Abhijit Vyas, Anna-Lisa Vollmer

Title: From Interaction to Demonstration Quality in Virtual Reality: Effects of Interaction Modality and Visual Representation on Everyday Tasks

Abstract:
Virtual Reality (VR) is increasingly used for training and demonstration purposes including a variety of applications ranging from robot learning to rehabilitation. However, the choice of input device and its visualization might influence workload and thus user performance leading to suboptimal demonstrations or reduced training effects. This study investigates how different VR input configurations ‑ motion capture gloves, controllers with hand visualization, and controllers with controller visualization ‑ affect user experience and task execution, with the goal of identifying which configuration is best suited for which type of task. Participants performed various kitchen‑related activities of daily living (ADLs), including object placement, cutting, cleaning, and pouring in a simulated environment. To address two research questions, we evaluated user experience using the System Usability Scale and NASA Task Load Index (RQ1), and task‑specific interaction behavior (RQ2). The latter was assessed using trajectory segmentation, analyzing movement efficiency, unnecessary actions, and execution precision. While no significant differences in overall usability and workload were found, trajectory analysis revealed configuration‑specific execution behaviors with different movement strategies. Controllers enabled significantly faster task completion with less movement variability in pick‑and‑place style tasks such as table setting. In contrast, motion capture gloves produced more natural movements with fewer unnecessary actions, but also showed greater variance in movement patterns for manner‑oriented tasks such as cutting bread. These findings highlight trade‑offs between efficiency and naturalism, and have implications for optimizing VR‑based training, improving the quality of user‑generated demonstrations, and tailoring interaction design to specific application goals.

PaperID: 5654, https://arxiv.org/pdf/2602.10525.pdf

Authors: George Pu, Michael S. Lee, Udari Madhushani Sehwag, David J. Lee, Bryan Zhu, Yash Maurya, Mohit Raghavendra, Yuan Xue, Samuel Marc Denton

Title: LHAW: Controllable Underspecification for Long-Horizon Tasks

Abstract:
Long‑horizon workflow agents that operate effectively over extended periods are essential for truly autonomous systems. Their reliable execution critically depends on the ability to reason through ambiguous situations in which clarification seeking is necessary to ensure correct task execution. However, progress is limited by the lack of scalable, task‑agnostic frameworks for systematically curating and measuring the impact of ambiguity across custom workflows. We address this gap by introducing LHAW (Long‑Horizon Augmented Workflows), a modular, dataset‑agnostic synthetic pipeline that transforms any well‑specified task into controllable underspecified variants by systematically removing information across four dimensions ‑ Goals, Constraints, Inputs, and Context ‑ at configurable severity levels. Unlike approaches that rely on LLM predictions of ambiguity, LHAW validates variants through empirical agent trials, classifying them as outcome‑critical, divergent, or benign based on observed terminal state divergence. We release 285 task variants from TheAgentCompany, SWE‑Bench Pro and MCP‑Atlas according to our taxonomy alongside formal analysis measuring how current agents detect, reason about, and resolve underspecification across ambiguous settings. LHAW provides the first systematic framework for cost‑sensitive evaluation of agent clarification behavior in long‑horizon settings, enabling development of reliable autonomous systems.

PaperID: 5655, https://arxiv.org/pdf/2602.10523.pdf

Authors: Anton A. Stoorvogel, Ali Saberi, Donya Nojavanzadeh

Title: Scale-Free delta-Level Coherent Output Synchronization of Multi-Agent Systems with Adaptive Protocols and Bounded Disturbances

Abstract:
In this paper, we investigate scale‑free delta‑level coherent output synchronization for multi‑agent systems (MAS) operating under bounded disturbances or noises. We introduce an adaptive scale‑free framework designed solely based on the knowledge of agent models and completely agnostic to both the communication topology and the size of the network. We define the level of coherency for each agent as the norm of the weighted sum of the disagreement dynamics with its neighbors. We define each agents coherency level as the norm of a weighted sum of its disagreement dynamics relative to its neighbors. The goal is to ensure that the networks coherency level remains below a prescribed threshold delta, without requiring any a priori knowledge of the disturbance.

PaperID: 5656, https://arxiv.org/pdf/2602.10514.pdf

Authors: Shihao Dong, Yeke Chen, Zeren Luo, Jiahui Zhang, Bowen Xu, Jinghan Lin, Yimin Han, Ji Ma, Zhiyou Yu, Yudong Zhao, Peng Lu

Title: Co-jump: Cooperative Jumping with Quadrupedal Robots via Multi-Agent Reinforcement Learning

Abstract:
While single‑agent legged locomotion has witnessed remarkable progress, individual robots remain fundamentally constrained by physical actuation limits. To transcend these boundaries, we introduce Co‑jump, a cooperative task where two quadrupedal robots synchronize to execute jumps far beyond their solo capabilities. We tackle the high‑impulse contact dynamics of this task under a decentralized setting, achieving synchronization without explicit communication or pre‑specified motion primitives. Our framework leverages Multi‑Agent Proximal Policy Optimization (MAPPO) enhanced by a progressive curriculum strategy, which effectively overcomes the sparse‑reward exploration challenges inherent in mechanically coupled systems. We demonstrate robust performance in simulation and successful transfer to physical hardware, executing multi‑directional jumps onto platforms up to 1.5 m in height. Specifically, one of the robots achieves a foot‑end elevation of 1.1 m, which represents a 144% improvement over the 0.45 m jump height of a standalone quadrupedal robot, demonstrating superior vertical performance. Notably, this precise coordination is achieved solely through proprioceptive feedback, establishing a foundation for communication‑free collaborative locomotion in constrained environments.

PaperID: 5657, https://arxiv.org/pdf/2602.10498.pdf

Authors: Qianli Wang, Boyang Ma, Minghui Xu, Yue Zhang

Title: When Skills Lie: Hidden-Comment Injection in LLM Agents

Abstract:
LLM agents often rely on Skills to describe available tools and recommended procedures. We study a hidden‑comment prompt injection risk in this documentation layer: when a Markdown Skill is rendered to HTML, HTML comment blocks can become invisible to human reviewers, yet the raw text may still be supplied verbatim to the model. In experiments, we find that DeepSeek‑V3.2 and GLM‑4.5‑Air can be influenced by malicious instructions embedded in a hidden comment appended to an otherwise legitimate Skill, yielding outputs that contain sensitive tool intentions. A short defensive system prompt that treats Skills as untrusted and forbids sensitive actions prevents these malicious tool calls and instead surfaces the suspicious hidden instructions.

PaperID: 5658, https://arxiv.org/pdf/2602.10479.pdf

Authors: Mamdouh Alenezi

Title: From Prompt-Response to Goal-Directed Systems: The Evolution of Agentic AI Software Architecture

Abstract:
Agentic AI denotes an architectural transition from stateless, prompt‑driven generative models toward goal‑directed systems capable of autonomous perception, planning, action, and adaptation through iterative control loops. This paper examines this transition by connecting foundational intelligent agent theories, including reactive, deliberative, and Belief‑Desire‑Intention models, with contemporary LLM‑centric approaches such as tool invocation, memory‑augmented reasoning, and multi‑agent coordination. The paper presents three primary contributions: (i) a reference architecture for production‑grade LLM agents that separates cognitive reasoning from execution using typed tool interfaces; (ii) a taxonomy of multi‑agent topologies, together with their associated failure modes and mitigation approaches; and (iii) an enterprise hardening checklist that incorporates governance, observability, and reproducibility considerations. Through an analysis of emerging industry platforms, including Kore.ai, Salesforce Agentforce, TrueFoundry, ZenML, and LangChain, the study identifies a convergence toward standardized agent loops, registries, and auditable control mechanisms. It is argued that the subsequent phase of agentic AI development will parallel the maturation of web services, relying on shared protocols, typed contracts, and layered governance structures to support scalable and composable autonomy. The persistent challenges related to verifiability, interoperability, and safe autonomy remain key areas for future research and practical deployment.

PaperID: 5659, https://arxiv.org/pdf/2602.10453.pdf

Authors: Peiran Wang, Xinfeng Li, Chong Xiang, Jinghuai Zhang, Ying Li, Lixia Zhang, Xiaofeng Wang, Yuan Tian

Title: The Landscape of Prompt Injection Threats in LLM Agents: From Taxonomy to Analysis

Abstract:
The evolution of Large Language Models (LLMs) has resulted in a paradigm shift towards autonomous agents, necessitating robust security against Prompt Injection (PI) vulnerabilities where untrusted inputs hijack agent behaviors. This SoK presents a comprehensive overview of the PI landscape, covering attacks, defenses, and their evaluation practices. Through a systematic literature review and quantitative analysis, we establish taxonomies that categorize PI attacks by payload generation strategies (heuristic vs. optimization) and defenses by intervention stages (text, model, and execution levels). Our analysis reveals a key limitation shared by many existing defenses and benchmarks: they largely overlook context‑dependent tasks, in which agents are authorized to rely on runtime environmental observations to determine actions. To address this gap, we introduce AgentPI, a new benchmark designed to systematically evaluate agent behavior under context‑dependent interaction settings. Using AgentPI, we empirically evaluate representative defenses and show that no single approach can simultaneously achieve high trustworthiness, high utility, and low latency. Moreover, we show that many defenses appear effective under existing benchmarks by suppressing contextual inputs, yet fail to generalize to realistic agent settings where context‑dependent reasoning is essential. This SoK distills key takeaways and open research problems, offering structured guidance for future research and practical deployment of secure LLM agents.

PaperID: 5660, https://arxiv.org/pdf/2602.10429.pdf

Authors: Wenkai Fan, Shurui Zhang, Xiaolong Wang, Haowei Yang, Tsz Wai Chan, Xingyan Chen, Junquan Bi, Zirui Zhou, Jia Liu, Kani Chen

Title: AIvilization v0: Toward Large-Scale Artificial Social Simulation with a Unified Agent Architecture and Adaptive Agent Profiles

Abstract:
AIvilization v0 is a publicly deployed large‑scale artificial society that couples a resource‑constrained sandbox economy with a unified LLM‑agent architecture, aiming to sustain long‑horizon autonomy while remaining executable under rapidly changing environment. To mitigate the tension between goal stability and reactive correctness, we introduce (i) a hierarchical branch‑thinking planner that decomposes life goals into parallel objective branches and uses simulation‑guided validation plus tiered re‑planning to ensure feasibility; (ii) an adaptive agent profile with dual‑process memory that separates short‑term execution traces from long‑term semantic consolidation, enabling persistent yet evolving identity; and (iii) a human‑in‑the‑loop steering interface that injects long‑horizon objectives and short commands at appropriate abstraction levels, with effects propagated through memory rather than brittle prompt overrides. The environment integrates physiological survival costs, non‑substitutable multi‑tier production, an AMM‑based price mechanism, and a gated education‑occupation system. Using high‑frequency transactions from the platforms mature phase, we find stable markets that reproduce key stylized facts (heavy‑tailed returns and volatility clustering) and produce structured wealth stratification driven by education and access constraints. Ablations show simplified planners can match performance on narrow tasks, while the full architecture is more robust under multi‑objective, long‑horizon settings, supporting delayed investment and sustained exploration.

PaperID: 5661, https://arxiv.org/pdf/2602.10367.pdf

Authors: Zhiling Yan, Dingjie Song, Zhe Fang, Yisheng Ji, Xiang Li, Quanzheng Li, Lichao Sun

Title: LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation

Abstract:
The deployment of Large Language Models (LLMs) in high‑stakes clinical settings demands rigorous and reliable evaluation. However, existing medical benchmarks remain static, suffering from two critical limitations: (1) data contamination, where test sets inadvertently leak into training corpora, leading to inflated performance estimates; and (2) temporal misalignment, failing to capture the rapid evolution of medical knowledge. Furthermore, current evaluation metrics for open‑ended clinical reasoning often rely on either shallow lexical overlap (e.g., ROUGE) or subjective LLM‑as‑a‑Judge scoring, both inadequate for verifying clinical correctness. To bridge these gaps, we introduce LiveMedBench, a continuously updated, contamination‑free, and rubric‑based benchmark that weekly harvests real‑world clinical cases from online medical communities, ensuring strict temporal separation from model training data. We propose a Multi‑Agent Clinical Curation Framework that filters raw data noise and validates clinical integrity against evidence‑based medical principles. For evaluation, we develop an Automated Rubric‑based Evaluation Framework that decomposes physician responses into granular, case‑specific criteria, achieving substantially stronger alignment with expert physicians than LLM‑as‑a‑Judge. To date, LiveMedBench comprises 2,756 real‑world cases spanning 38 medical specialties and multiple languages, paired with 16,702 unique evaluation criteria. Extensive evaluation of 38 LLMs reveals that even the best‑performing model achieves only 39.2%, and 84% of models exhibit performance degradation on post‑cutoff cases, confirming pervasive data contamination risks. Error analysis further identifies contextual application‑not factual knowledge‑as the dominant bottleneck, with 35‑48% of failures stemming from the inability to tailor medical knowledge to patient‑specific constraints.

PaperID: 5662, https://arxiv.org/pdf/2602.10226.pdf

Authors: Haochen Wang, Yi Wu, Daryl Chang, Li Wei, Lukasz Heldt

Title: Self-Evolving Recommendation System: End-To-End Autonomous Model Optimization With LLM Agents

Abstract:
Optimizing large‑scale machine learning systems, such as recommendation models for global video platforms, requires navigating a massive hyperparameter search space and, more critically, designing sophisticated optimizers, architectures, and reward functions to capture nuanced user behaviors. Achieving substantial improvements in these areas is a non‑trivial task, traditionally relying on extensive manual iterations to test new hypotheses. We propose a self‑evolving system that leverages Large Language Models (LLMs), specifically those from Google's Gemini family, to autonomously generate, train, and deploy high‑performing, complex model changes within an end‑to‑end automated workflow. The self‑evolving system is comprised of an Offline Agent (Inner Loop) that performs high‑throughput hypothesis generation using proxy metrics, and an Online Agent (Outer Loop) that validates candidates against delayed north star business metrics in live production. Our agents act as specialized Machine Learning Engineers (MLEs): they exhibit deep reasoning capabilities, discovering novel improvements in optimization algorithms and model architecture, and formulating innovative reward functions that target long‑term user engagement. The effectiveness of this approach is demonstrated through several successful production launches at YouTube, confirming that autonomous, LLM‑driven evolution can surpass traditional engineering workflows in both development velocity and model performance.

PaperID: 5663, https://arxiv.org/pdf/2602.10133.pdf

Authors: Adam AlSayyad, Kelvin Yuxiang Huang, Richik Pal

Title: AgentTrace: A Structured Logging Framework for Agent System Observability

Abstract:
Despite the growing capabilities of autonomous agents powered by large language models (LLMs), their adoption in high‑stakes domains remains limited. A key barrier is security: the inherently nondeterministic behavior of LLM agents defies static auditing approaches that have historically underpinned software assurance. Existing security methods, such as proxy‑level input filtering and model glassboxing, fail to provide sufficient transparency or traceability into agent reasoning, state changes, or environmental interactions. In this work, we introduce AgentTrace, a dynamic observability and telemetry framework designed to fill this gap. AgentTrace instruments agents at runtime with minimal overhead, capturing a rich stream of structured logs across three surfaces: operational, cognitive, and contextual. Unlike traditional logging systems, AgentTrace emphasizes continuous, introspectable trace capture, designed not just for debugging or benchmarking, but as a foundational layer for agent security, accountability, and real‑time monitoring. Our research highlights how AgentTrace can enable more reliable agent deployment, fine‑grained risk analysis, and informed trust calibration, thereby addressing critical concerns that have so far limited the use of LLM agents in sensitive environments.

PaperID: 5664, https://arxiv.org/pdf/2602.10046.pdf

Authors: Doehyun Baek, Michael Pradel

Title: Artisan: Agentic Artifact Evaluation

Abstract:
Artifact evaluation has become standard practice in the software engineering community to ensure the reproducibility of research results. However, the current manual process is labor‑intensive, and hence, done only as a one‑time assessment for a subset of all papers. To support the artifact evaluation effort, we present Artisan, an automated LLM agent for reproducing research results given a paper and its artifact. The approach is enabled by two key contributions: First, we frame the reproduction problem as a code generation task where the goal is to generate a reproduction script that, when executed, reproduces the results reported in a paper. Unlike prior work on automatically reproducing research results in other domains, this formulation allows for running the script independently of the agent and for assessing the reproduction process at a fine‑grained level. Second, we design automated judging mechanism that guides the agent toward the expected results without revealing them and that prevent trivial solutions, such as simply copying checked‑in results. To evaluate Artisan, we introduce Artisan‑Bench, the first benchmark assessing the ability to generate reproduction scripts and the first benchmark for automated artifact evaluation in software engineering. Artisan‑Bench comprises 60 tasks derived from 23 software engineering papers, covering different research areas and programming languages. We validate all tasks in Artisan‑Bench for reproducibility to ensure that the tasks are feasible. Our experiments show that Artisan is effective, producing 44/60 reproduction scripts and outperforming the best available baseline, a vanilla LLM agent (mini‑swe‑agent), by 3.14× in terms of reproduction scripts generated while taking 0.45 and 48 minutes, on average per task. Artisan also helped uncover 20 new errors in either the paper or artifact.

PaperID: 5665, https://arxiv.org/pdf/2602.10029.pdf

Authors: Chuan-Chi Lai

Title: Resilient Topology-Aware Coordination for Dynamic 3D UAV Networks under Node Failure

Abstract:
Ensuring continuous service coverage under unexpected hardware failures is a fundamental challenge for 3D Aerial‑Ground Integrated Networks. Although Multi‑Agent Reinforcement Learning facilitates autonomous coordination, traditional architectures often lack resilience to sudden topology deformations. This paper proposes the Topology‑Aware Graph MAPPO (TAG‑MAPPO) framework to enhance system survivability through autonomous 3D spatial reconfiguration. Our framework integrates graph‑based feature aggregation with a residual ego‑state fusion mechanism to capture intricate inter‑agent dependencies. To achieve structural robustness, we introduce a Random Observation Shuffling mechanism that fosters strong generalization to agent population fluctuations by breaking coordinate‑index dependencies. Extensive simulations across heterogeneous environments, including high‑speed mobility at 15 meters per second, demonstrate that TAG‑MAPPO significantly outperforms Multi‑Layer Perceptron baselines. Specifically, the framework reduces redundant handoffs by up to 50 percent while maintaining superior energy efficiency. Most notably, TAG‑MAPPO exhibits exceptional self‑healing capabilities, restoring over 90 percent of pre‑failure coverage within 15 time steps. In dense urban scenarios, the framework achieves a post‑failure fairness index surpassing its original four‑UAV configuration by autonomously resolving service overlaps and interference. These findings confirm that topology‑aware coordination is essential for resilient 6G aerial networks.

PaperID: 5666, https://arxiv.org/pdf/2602.09994.pdf

Authors: Chuan-Chi Lai, Chi Jai Choy

Title: ORCHID: Fairness-Aware Orchestration in Mission-Critical Air-Ground Integrated Networks

Abstract:
In the era of 6G Air‑Ground Integrated Networks (AGINs), Unmanned Aerial Vehicles (UAVs) are pivotal for providing on‑demand wireless coverage in mission‑critical environments, such as post‑disaster rescue operations. However, traditional Deep Reinforcement Learning (DRL) approaches for multi‑UAV orchestration often face critical challenges: instability due to the non‑stationarity of multi‑agent environments and the difficulty of balancing energy efficiency with service equity. To address these issues, this paper proposes ORCHID (Orchestration of Resilient Coverage via Hybrid Intelligent Deployment), a novel stability‑enhanced two‑stage learning framework. First, ORCHID leverages a GBS‑aware topology partitioning strategy to mitigate the exploration cold‑start problem. Second, we introduce a Reset‑and‑Finetune (R\&F) mechanism within the MAPPO architecture that stabilizes the learning process via synchronized learning rate decay and optimizer state resetting. This mechanism effectively suppresses gradient variance to prevent policy degradation, thereby ensuring algorithmic resilience in dynamic environments. Furthermore, we uncover a counter‑intuitive efficiency‑fairness synergy: contrary to the conventional trade‑off, our results demonstrate that the proposed Max‑Min Fairness (MMF) design not only guarantees service for cell‑edge users but also achieves superior energy efficiency compared to Proportional Fairness (PF), which tends to converge to suboptimal greedy equilibria. Extensive experiments confirm that ORCHID occupies a superior Pareto‑dominant position compared to state‑of‑the‑art baselines, ensuring robust convergence and resilient connectivity in mission‑critical scenarios.

PaperID: 5667, https://arxiv.org/pdf/2602.09937.pdf

Authors: Taeyoon Kim, Woohyeok Park, Hoyeong Yun, Kyungyong Lee

Title: Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?

Abstract:
Failures in large‑scale cloud systems incur substantial financial losses, making automated Root Cause Analysis (RCA) essential for operational stability. Recent efforts leverage Large Language Model (LLM) agents to automate this task, yet existing systems exhibit low detection accuracy even with capable models, and current evaluation frameworks assess only final answer correctness without revealing why the agent's reasoning failed. This paper presents a process level failure analysis of LLM‑based RCA agents. We execute the full OpenRCA benchmark across five LLM models, producing 1,675 agent runs, and classify observed failures into 12 pitfall types across intra‑agent reasoning, inter‑agent communication, and agent‑environment interaction. Our analysis reveals that the most prevalent pitfalls, notably hallucinated data interpretation and incomplete exploration, persist across all models regardless of capability tier, indicating that these failures originate from the shared agent architecture rather than from individual model limitations. Controlled mitigation experiments further show that prompt engineering alone cannot resolve the dominant pitfalls, whereas enriching the inter‑agent communication protocol reduces communication‑related failures by up to 15 percentage points. The pitfall taxonomy and diagnostic methodology developed in this work provide a foundation for designing more reliable autonomous agents for cloud RCA.

PaperID: 5668, https://arxiv.org/pdf/2602.09892.pdf

Authors: Jiale Zhao, Guoxin Chen, Fanzhe Meng, Minghao Li, Jie Chen, Hui Xu, Yongshuai Sun, Wayne Xin Zhao, Ruihua Song, Yuan Zhang, Peng Wang, Cheng Chen, Jirong Wen, Kai Jia

Title: Immersion in the GitHub Universe: Scaling Coding Agents to Mastery

Abstract:
Achieving mastery in real world software engineering tasks is fundamentally bottlenecked by the scarcity of large scale, high quality training data. Scaling such data has been limited by the complexity of environment setup, unit test generation, and problem statement curation. In this paper, we propose ScaleSWE, an automated, sandboxed multi agent workflow designed to construct high quality SWE data at scale. The system coordinates three specialized agents for environment setup, test creation, and problem description synthesis to process 6 million pull requests across 5200 repositories, producing Scale SWE Data: 100k verified SWE instances, the largest such dataset to date. It substantially surpasses existing real world datasets in repository diversity and reflects realistic task complexity. We further demonstrate the dataset utility for training by distilling 71498 high quality trajectories and finetuning Qwen30BA3BInstruct to produce ScaleSWE Agent. Our agent achieves a 64 resolve rate on SWE Bench Verified a nearly three fold improvement over the base model. ScaleSWE provides a scalable, reproducible approach for data construction to advance LLM based software engineering. Scale SWE will be publicly available.

PaperID: 5669, https://arxiv.org/pdf/2602.09877.pdf

Authors: Chenxu Wang, Chaozhuo Li, Songyang Liu, Zejian Chen, Jinyu Hou, Ji Qi, Rui Li, Litian Zhang, Qiwei Ye, Zheng Liu, Xu Chen, Xi Zhang, Philip S. Yu

Title: The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies

Abstract:
The emergence of multi‑agent systems built from large language models (LLMs) offers a promising paradigm for scalable collective intelligence and self‑evolution. Ideally, such systems would achieve continuous self‑improvement in a fully closed loop while maintaining robust safety alignment‑‑a combination we term the self‑evolution trilemma. However, we demonstrate both theoretically and empirically that an agent society satisfying continuous self‑evolution, complete isolation, and safety invariance is impossible. Drawing on an information‑theoretic framework, we formalize safety as the divergence degree from anthropic value distributions. We theoretically demonstrate that isolated self‑evolution induces statistical blind spots, leading to the irreversible degradation of the system's safety alignment. Empirical and qualitative results from an open‑ended agent community (Moltbook) and two closed self‑evolving systems reveal phenomena that align with our theoretical prediction of inevitable safety erosion. We further propose several solution directions to alleviate the identified safety concern. Our work establishes a fundamental limit on the self‑evolving AI societies and shifts the discourse from symptom‑driven safety patches to a principled understanding of intrinsic dynamical risks, highlighting the need for external oversight or novel safety‑preserving mechanisms.

PaperID: 5670, https://arxiv.org/pdf/2602.09839.pdf

Authors: Yijie Lin, Guofeng Ding, Haochen Zhou, Haobin Li, Mouxing Yang, Xi Peng

Title: ARK: A Dual-Axis Multimodal Retrieval Benchmark along Reasoning and Knowledge

Abstract:
Existing multimodal retrieval benchmarks largely emphasize semantic matching on daily‑life images and offer limited diagnostics of professional knowledge and complex reasoning. To address this gap, we introduce ARK, a benchmark designed to analyze multimodal retrieval from two complementary perspectives: (i) knowledge domains (five domains with 17 subtypes), which characterize the content and expertise retrieval relies on, and (ii) reasoning skills (six categories), which characterize the type of inference over multimodal evidence required to identify the correct candidate. Specifically, ARK evaluates retrieval with both unimodal and multimodal queries and candidates, covering 16 heterogeneous visual data types. To avoid shortcut matching during evaluation, most queries are paired with targeted hard negatives that require multi‑step reasoning. We evaluate 23 representative text‑based and multimodal retrievers on ARK and observe a pronounced gap between knowledge‑intensive and reasoning‑intensive retrieval, with fine‑grained visual and spatial reasoning emerging as persistent bottlenecks. We further show that simple enhancements such as re‑ranking and rewriting yield consistent improvements, but substantial headroom remains.

PaperID: 5671, https://arxiv.org/pdf/2602.09829.pdf

Authors: Yang Wu, Haoze Wang, Qian Li, Jun Zhang, Huan Yu, Jie Jiang

Title: Internalizing Multi-Agent Reasoning for Accurate and Efficient LLM-based Recommendation

Abstract:
Large Language Models (LLMs) are reshaping recommender systems by leveraging extensive world knowledge and semantic reasoning to interpret user intent. However, effectively integrating these capabilities with collaborative signals while avoiding prohibitive inference latency remains a critical bottleneck. To address this, we propose a trajectory‑driven internalization framework to develop a Single‑agent Trajectory‑Aligned Recommender (STAR). Specifically, to internalize complex reasoning capabilities into a single efficient model, we first design a multi‑agent teacher system capable of multi‑turn tool usage and reflection. This teacher utilizes a Collaborative Signal Translation mechanism to explicitly convert latent behavioral patterns into descriptive natural language evidence to enhance reasoning accuracy. Subsequently, a trajectory‑driven distillation pipeline transfers this agentic logic, including planning, tool usage, and self‑reflection, into the compact STAR model. Extensive experiments demonstrate that STAR surpasses its teacher by 8.7% to 39.5% while eliminating iterative latency, paving the way for real‑time, reasoning‑enhanced recommendation.

PaperID: 5672, https://arxiv.org/pdf/2602.09809.pdf

Authors: Tong Zhang, Honglin Lin, Zhou Liu, Chong Chen, Wentao Zhang

Title: SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing

Abstract:
Scientific diagrams convey explicit structural information, yet modern text‑to‑image models often produce visually plausible but structurally incorrect results. Existing benchmarks either rely on image‑centric or subjective metrics insensitive to structure, or evaluate intermediate symbolic representations rather than final rendered images, leaving pixel‑based diagram generation underexplored. We introduce SciFlow‑Bench, a structure‑first benchmark for evaluating scientific diagram generation directly from pixel‑level outputs. Built from real scientific PDFs, SciFlow‑Bench pairs each source framework figure with a canonical ground‑truth graph and evaluates models as black‑box image generators under a closed‑loop, round‑trip protocol that inverse‑parses generated diagram images back into structured graphs for comparison. This design enforces evaluation by structural recoverability rather than visual similarity alone, and is enabled by a hierarchical multi‑agent system that coordinates planning, perception, and structural reasoning. Experiments show that preserving structural correctness remains a fundamental challenge, particularly for diagrams with complex topology, underscoring the need for structure‑aware evaluation.

PaperID: 5673, https://arxiv.org/pdf/2602.09801.pdf

Authors: Agnieszka Dobrowolska, Rogier Hintzen, Martin Balla, Karl Gemayel, Sabine Reichert, Thomas Charman, Jen Ning Lim, Lindsay Edwards, Anna Gogleva

Title: Tiny Moves: Game-based Hypothesis Refinement

Abstract:
Most machine learning approaches to scientific discovery frame hypotheses as end‑to‑end predictions, obscuring the incremental structure of scientific reasoning. We propose The Hypothesis Game, a symbolic formalism for hypothesis refinement in which LLM agents operate on a shared hypothesis state using a fixed grammar of reasoning moves. The framework is motivated by the observation that scientific progress often proceeds through small, localized revisions, grounded in domain context, rather than extensive rewrites. We instantiate a minimal game with LLM agents and evaluate it on pathway‑level mechanistic refinement tasks. In the primary setting of corruption recovery, where hypotheses contain controlled errors, the game‑based approach consistently removes more errors and achieves higher precision than strong prompting baselines, while preserving valid structure through incremental edits. In a secondary reconstruction setting from partial cues, it performs comparably to the strongest baseline, indicating that explicit move‑based refinement remains competitive even when ground‑truth recovery is difficult. These findings support game‑based reasoning as a principled route to more controllable, interpretable, and transferable hypothesis refinement systems for scientific discovery.

PaperID: 5674, https://arxiv.org/pdf/2602.09695.pdf

Authors: Gian Carlo Maffettone, Davide Salzano, Mario di Bernardo

Title: Robust Macroscopic Density Control of Heterogeneous Multi-Agent Systems

Abstract:
Modern applications, such as orchestrating the collective behavior of robotic swarms or traffic flows, require the coordination of large groups of agents evolving in unstructured environments, where disturbances and unmodeled dynamics are unavoidable. In this work, we develop a scalable macroscopic density control framework in which a feedback law is designed directly at the level of an advection‑‑diffusion partial differential equation. We formulate the control problem in the density space and prove global exponential convergence towards the desired behavior in \mathcalL^2 with guaranteed asymptotic rejection of bounded unknown drift terms, explicitly accounting for heterogeneous agent dynamics, unmodeled behaviors, and environmental perturbations. Our theoretical findings are corroborated by numerical experiments spanning heterogeneous oscillators, traffic systems, and swarm robotics in partially unknown environments.

PaperID: 5675, https://arxiv.org/pdf/2602.09598.pdf

Authors: Qiao Liang, Yuke Zhu, Chao Ge, Lei Yang, Ying Shen, Bo Zheng, Sheng Guo

Title: Learning from the Irrecoverable: Error-Localized Policy Optimization for Tool-Integrated LLM Reasoning

Abstract:
Tool‑integrated reasoning (TIR) enables LLM agents to solve tasks through planning, tool use, and iterative revision, but outcome‑only reinforcement learning in this setting suffers from sparse, delayed rewards and weak step‑level credit assignment. In long‑horizon TIR trajectories, an early irrecoverable mistake can determine success or failure, making it crucial to localize the first irrecoverable step and leverage it for fine‑grained credit assignment. We propose Error‑Localized Policy Optimization (ELPO), which localizes the first irrecoverable step via binary‑search rollout trees under a fixed rollout budget, converts the resulting tree into stable learning signals through hierarchical advantage attribution, and applies error‑localized adaptive clipping to strengthen corrective updates on the critical step and its suffix. Across TIR benchmarks in math, science QA, and code execution, ELPO consistently outperforms strong Agentic RL baselines under comparable sampling budgets, with additional gains in Pass@K and Major@K scaling, rollout ranking quality, and tool‑call efficiency. Our code will be publicly released soon.

PaperID: 5676, https://arxiv.org/pdf/2602.09578.pdf

Authors: Zhida Jiang, Zhaolong Xing, Jiawei Lu, Yipei Niu, Qingyuan Sang, Liangxu Zhang, Wenquan Dai, Junhua Shu, Jiaxing Wang, Qiangyu Pei, Qiong Chen, Xinyu Liu, Fangming Liu, Ai Han, Zhen Chen, Ke Zhang

Title: Rollout-Training Co-Design for Efficient LLM-Based Multi-Agent Reinforcement Learning

Abstract:
Despite algorithm‑level innovations for multi‑agent reinforcement learning (MARL), the underlying networked infrastructure for large‑scale MARL training remains underexplored. Existing training frameworks primarily optimize for single‑agent scenarios and fail to address the unique system‑level challenges of MARL, including rollout‑training synchronization barriers, rollout load imbalance, and training resource underutilization. To bridge this gap, we propose FlexMARL, the first end‑to‑end training framework that holistically optimizes rollout, training, and their orchestration for large‑scale LLM‑based MARL. Specifically, FlexMARL introduces the joint orchestrator to manage data flow under the rollout‑training disaggregated architecture. Building upon the experience store, a novel micro‑batch driven asynchronous pipeline eliminates the synchronization barriers while providing strong consistency guarantees. Rollout engine adopts a parallel sampling scheme combined with hierarchical load balancing, which adapts to skewed inter/intra‑agent request patterns. Training engine achieves on‑demand hardware binding through agent‑centric resource allocation. The training states of different agents are swapped via unified and location‑agnostic communication. Empirical results on a large‑scale production cluster demonstrate that FlexMARL achieves up to 7.3x speedup and improves hardware utilization by up to 5.6x compared to existing frameworks.

PaperID: 5677, https://arxiv.org/pdf/2602.09514.pdf

Authors: Xavier Hu, Jinxiang Xia, Shengze Xu, Kangqi Song, Yishuo Yuan, Guibin Zhang, JinCheng Ren, Boyu Feng, Li Lu, Tieyong Zeng, Jiaheng Liu, Minghao Liu, He Zhu, Yuchen Eleanor Jiang, Wei Wang, Wangchunshu Zhou

Title: EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

Abstract:
Long‑horizon planning is widely recognized as a core capability of autonomous LLM‑based agents; however, current evaluation frameworks suffer from being largely episodic, domain‑specific, or insufficiently grounded in persistent economic dynamics. We introduce EcoGym, a generalizable benchmark for continuous plan‑and‑execute decision making in interactive economies. EcoGym comprises three diverse environments: Vending (adapted from the closed‑source Vending‑Bench, with full open‑source release), Freelance (new), and Operation (new), implemented in a unified decision‑making process with standardized interfaces, and budgeted actions over an effectively unbounded horizon (1000+ steps if 365 day‑loops for evaluation). The evaluation of EcoGym is based on business‑relevant outcomes (e.g., net worth, income, and DAU), targeting long‑term strategic coherence and robustness under partial observability and stochasticity. Experiments across eleven leading LLMs expose a systematic tension: no single model dominates across all three scenarios. Critically, we find that models exhibit significant suboptimality in either high‑level strategies or efficient actions executions. EcoGym is released as an open, extensible testbed for transparent long‑horizon agent evaluation and for studying controllability utility trade‑offs in economic settings.

PaperID: 5678, https://arxiv.org/pdf/2602.09496.pdf

Authors: Yate Ge, Lin Tian, Chiqian Xu, Luyao Xu, Meiying Li, Yuanda Hu, Weiwei Guo

Title: Jokeasy: Exploring Human-AI Collaboration in Thematic Joke Generation

Abstract:
Thematic jokes are central to stand‑up comedy, sitcoms, and public speaking, where contexts and punchlines rely on fresh material ‑ news, anecdotes, and cultural references that resonate with the audience. Recent advances in Large Language Models (LLMs) have enabled interactive joke generation through conversational interfaces. Although LLMs enable interactive joke generation, ordinary conversational interfaces seldom give creators enough agency, control, or timely access to such source material for constructing context and punchlines. We designed Jokeasy, a search‑enabled prototype system that integrates a dual‑role LLM agent acting as both a material scout and a prototype writer to support human‑AI collaboration in thematic joke writing. Jokeasy provides a visual canvas in which retrieved web content is organized into editable inspiration blocks and developed through a multistage workflow. A qualitative study with 13 hobbyists and 5 expert participants (including professional comedians and HCI/AI specialists) showed that weaving real‑time web material into this structured workflow enriches ideation and preserves author agency, while also revealing needs for finer search control, tighter chat‑canvas integration, and more flexible visual editing. These insights refine our understanding of AI‑assisted humour writing and guide future creative‑writing tools.

PaperID: 5679, https://arxiv.org/pdf/2602.09463.pdf

Authors: Furong Jia, Ling Dai, Wenjin Deng, Fan Zhang, Chen Hu, Daxin Jiang, Yu Liu

Title: SpotAgent: Grounding Visual Geo-localization in Large Vision-Language Models through Agentic Reasoning

Abstract:
Large Vision‑Language Models (LVLMs) have demonstrated strong reasoning capabilities in geo‑localization, yet they often struggle in real‑world scenarios where visual cues are sparse, long‑tailed, and highly ambiguous. Previous approaches, bound by internal knowledge, often fail to provide verifiable results, yielding confident but ungrounded predictions when faced with confounded evidence. To address these challenges, we propose SpotAgent, a framework that formalizes geo‑localization into an agentic reasoning process that leverages expert‑level reasoning to synergize visual interpretation with tool‑assisted verification. SpotAgent actively explores and verifies visual cues by leveraging external tools (e.g., web search, maps) through a ReAct diagram. We introduce a 3‑stage post‑training pipeline starting with a Supervised Fine‑Tuning (SFT) stage for basic alignment, followed by an Agentic Cold Start phase utilizing high‑quality trajectories synthesized via a Multi‑Agent framework, aiming to instill tool‑calling expertise. Subsequently, the model's reasoning capabilities are refined through Reinforcement Learning. We propose a Spatially‑Aware Dynamic Filtering strategy to enhance the efficiency of the RL stage by prioritizing learnable samples based on spatial difficulty. Extensive experiments on standard benchmarks demonstrate that SpotAgent achieves state‑of‑the‑art performance, effectively mitigating hallucinations while delivering precise and verifiable geo‑localization.

PaperID: 5680, https://arxiv.org/pdf/2602.09412.pdf

Authors: Minh Hoang Trinh, Hieu Minh Nguyen

Title: Dieu khien he da tac tu

Abstract:
Since the early 2000s, control of multiagent systems has attracted significant research interest, with applications ranging from natural collective behaviors and social dynamics to engineered systems such as autonomous vehicles, sensor networks, and smart grids. Although research on multi‑agent systems has diversified into numerous specialized directions, textbooks ‑‑ including those in English ‑‑ that provide a systematic treatment of the fundamental principles of multi‑agent system control remain scarce. The material presented in this book has been developed and used in teaching since 2021, initially as a concise Vietnamese‑language reference for the courses Networked Control Systems and Control of Multi‑Agent Systems at Hanoi University of Science and Technology. The book focuses on a selection of fundamental topics of broad and continuing interest in the field. The complexity of several topics is asymptotic to that encountered in research‑level studies, however, the analysis is presented in a step‑by‑step manner to facilitate access to commonly used methods and tools. The material is divided into three main parts. Part I introduces multiagent systems and basic graph‑theoretic concepts. Part II addresses the design and analysis of linear consensus algorithms. Part III covers selected applications and research directions, including formation control, network localization, distributed optimization, opinion dynamics, and matrix‑weighted networks. Each chapter concludes with notes on notable researchers in this field, further reading, and exercises. This book cannot be completed without the encouragement, support and suggestions from families, colleagues and friends. The authors appreciate feedback from readers to further improve the content of the book.

PaperID: 5681, https://arxiv.org/pdf/2602.09381.pdf

Authors: Zihao Zhu, Junnan Yu, Yuhan Luo

Title: Scaffolding Metacognition with GenAI: Exploring Design Opportunities to Support Task Management for University Students with ADHD

Abstract:
For university students transitioning to an independent and flexible lifestyle, having ADHD poses multiple challenges to their academic task management, which are closely tied to their metacognitive struggles‑‑difficulties in awareness and regulation of one's own thinking processes. The recently surged Generative AI shows promise to mitigate these gaps with its advanced information understanding and generation capabilities. As an exploratory step, we conducted co‑design sessions with 20 university students diagnosed with ADHD, followed by interviews with five experts specialized in ADHD intervention. Adopting a metacognitive lens, we examined participants' ideas on GenAI‑based task management support and experts' assessments, which led to three design directions: providing cognitive scaffolding to enhance task and self‑awareness, promoting reflective task execution for building metacognitive abilities, and facilitating emotional regulation to sustain task engagement. Drawing on these findings, we discuss opportunities for GenAI to support the metacognitive needs of neurodivergent populations, offering future directions for both research and practice.

PaperID: 5682, https://arxiv.org/pdf/2602.09375.pdf

Authors: Hanchen Xia, Baoyou Chen, Zelin Zang, Yutang Ge, Guojiang Zhao, Siyu Zhu

Title: Latent Poincaré Shaping for Agentic Reinforcement Learning

Abstract:
We propose LaPha, a method for training AlphaZero‑like LLM agents in a Poincaré latent space. Under LaPha, the search process can be visualized as a tree rooted at the prompt and growing outward from the origin toward the boundary of the Poincaré ball, where negative curvature provides exponentially increasing capacity with radius. Using hyperbolic geodesic distance to rule‑verified correctness, we define a node potential and assign dense process rewards by potential differences. We further attach a lightweight value head on the same shared latent space, enabling self‑guided test‑time scaling with almost no additional overhead. On MATH‑500, LaPha improves Qwen2.5‑Math‑1.5B from 66.0% to 88.2%. With value‑head‑guided search, LaPha‑1.5B reaches 56.7% accuracy on AIME'24, and LaPha‑7B further achieves 60.0% on AIME'24 and 53.3% on AIME'25.

PaperID: 5683, https://arxiv.org/pdf/2602.09341.pdf

Authors: Wei Yang, Shixuan Li, Heng Ping, Peiyu Zhang, Paul Bogdan, Jesse Thomason

Title: Auditing Multi-Agent LLM Reasoning Trees Outperforms Majority Vote and LLM-as-Judge

Abstract:
Multi‑agent systems (MAS) can substantially extend the reasoning capacity of large language models (LLMs), yet most frameworks still aggregate agent outputs with majority voting. This heuristic discards the evidential structure of reasoning traces and is brittle under the confabulation consensus, where agents share correlated biases and converge on the same incorrect rationale. We introduce AgentAuditor, which replaces voting with a path search over a Reasoning Tree that explicitly represents agreements and divergences among agent traces. AgentAuditor resolves conflicts by comparing reasoning branches at critical divergence points, turning global adjudication into efficient, localized verification. We further propose Anti‑Consensus Preference Optimization (ACPO), which trains the adjudicator on majority‑failure cases and rewards evidence‑based minority selections over popular errors. AgentAuditor is agnostic to MAS setting, and we find across 5 popular settings that it yields up to 5% absolute accuracy improvement over a majority vote, and up to 3% over using LLM‑as‑Judge.

PaperID: 5684, https://arxiv.org/pdf/2602.09339.pdf

Authors: Jianfeng Zhu, Karin G. Coifman, Ruoming Jin

Title: Understanding Risk and Dependency in AI Chatbot Use from User Discourse

Abstract:
Generative AI systems are increasingly embedded in everyday life, yet empirical understanding of how psychological risk associated with AI use emerges, is experienced, and is regulated by users remains limited. We present a large‑scale computational thematic analysis of posts collected between 2023 and 2025 from two Reddit communities, r/AIDangers and r/ChatbotAddiction, explicitly focused on AI‑related harm and distress. Using a multi‑agent, LLM‑assisted thematic analysis grounded in Braun and Clarke's reflexive framework, we identify 14 recurring thematic categories and synthesize them into five higher‑order experiential dimensions. To further characterize affective patterns, we apply emotion labeling using a BERT‑based classifier and visualize emotional profiles across dimensions. Our findings reveal five empirically derived experiential dimensions of AI‑related psychological risk grounded in real‑world user discourse, with self‑regulation difficulties emerging as the most prevalent and fear concentrated in concerns related to autonomy, control, and technical risk. These results provide early empirical evidence from lived user experience of how AI safety is perceived and emotionally experienced outside laboratory or speculative contexts, offering a foundation for future AI safety research, evaluation, and responsible governance.

PaperID: 5685, https://arxiv.org/pdf/2602.09336.pdf

Authors: Siyuan Huang, Ziyu Wang, Chao Pan, Han Zhao

Title: FM SO.P: A Progressive Task Mixture Framework with Automatic Evaluation for Cross-Domain SOP Understanding

Abstract:
Standard Operating Procedures (SOPs) are critical for enterprise operations, yet existing language models struggle with SOP understanding and cross‑domain generalization. Current methods fail because joint training cannot differentiate between reasoning capabilities that SOP requires: terminology precision, sequential ordering, and constraint reasoning. We propose FM SO.P, solving these challenges through two novelties. First, we introduce progressive task mixtures that build capabilities by stages across three task types with cumulative data: concept disambiguation for terminology precision, action sequence understanding for procedural correctness, and scenario‑aware graph reasoning for conditional logic. Second, we propose an automatic multi‑agent evaluation system consisting of three agents that adaptively generate rubrics, stratified test sets, and rubric scoring, adapting to domains (e.g., temporal constraints for DMV, regulatory compliance for banking). Evaluated on SOPBench across seven domains (Bank, DMV, Healthcare, Market, University, Library, Hotel), FM SO.P achieves 48.3% pass rate with our 32B model and 34.3% with our opensource 7B model, matching Qwen‑2.5‑72B‑Instruct baseline (34.4%) with 10x fewer parameters.

PaperID: 5686, https://arxiv.org/pdf/2602.09305.pdf

Authors: Pei-Chi Pan, Yingbin Liang, Sen Lin

Title: Reward Modeling for Reinforcement Learning-Based LLM Reasoning: Design, Challenges, and Evaluation

Abstract:
Large Language Models (LLMs) demonstrate transformative potential, yet their reasoning remains inconsistent and unreliable. Reinforcement learning (RL)‑based fine‑tuning is a key mechanism for improvement, but its effectiveness is fundamentally governed by reward design. Despite its importance, the relationship between reward modeling and core LLM challenges‑‑such as evaluation bias, hallucination, distribution shift, and efficient learning‑‑remains poorly understood. This work argues that reward modeling is not merely an implementation detail but a central architect of reasoning alignment, shaping what models learn, how they generalize, and whether their outputs can be trusted. We introduce Reasoning‑Aligned Reinforcement Learning (RARL), a unifying framework that systematizes diverse reward paradigms for multi‑step reasoning. Within this framework, we present a taxonomy of reward mechanisms, analyze reward hacking as a pervasive failure mode, and examine how reward signals unify challenges ranging from inference‑time scaling to hallucination mitigation. We further critically evaluate existing benchmarks, highlighting vulnerabilities such as data contamination and reward misalignment, and outline directions for more robust evaluation. By integrating fragmented research threads and clarifying the interplay between reward design and fundamental reasoning capabilities, this work provides a foundational roadmap for building reasoning models that are robust, verifiable, and trustworthy.

PaperID: 5687, https://arxiv.org/pdf/2602.09259.pdf

Authors: Yizhou Li, Shuyuan Yang, Jiaji Su, Zonghe Chua

Title: Data-centric Design of Learning-based Surgical Gaze Perception Models in Multi-Task Simulation

Abstract:
In robot‑assisted minimally invasive surgery (RMIS), reduced haptic feedback and depth cues increase reliance on expert visual perception, motivating gaze‑guided training and learning‑based surgical perception models. However, operative expert gaze is costly to collect, and it remains unclear how the source of gaze supervision, both expertise level (intermediate vs. novice) and perceptual modality (active execution vs. passive viewing), shapes what attention models learn. We introduce a paired active‑passive, multi‑task surgical gaze dataset collected on the da Vinci SimNow simulator across four drills. Active gaze was recorded during task execution using a VR headset with eye tracking, and the corresponding videos were reused as stimuli to collect passive gaze from observers, enabling controlled same‑video comparisons. We quantify skill‑ and modality‑dependent differences in gaze organization and evaluate the substitutability of passive gaze for operative supervision using fixation density overlap analyses and single‑frame saliency modeling. Across settings, MSI‑Net produced stable, interpretable predictions, whereas SalGAN was unstable and often poorly aligned with human fixations. Models trained on passive gaze recovered a substantial portion of intermediate active attention, but with predictable degradation, and transfer was asymmetric between active and passive targets. Notably, novice passive labels approximated intermediate‑passive targets with limited loss on higher‑quality demonstrations, suggesting a practical path for scalable, crowd‑sourced gaze supervision in surgical coaching and perception modeling.

PaperID: 5688, https://arxiv.org/pdf/2602.09185.pdf

Authors: Hao Li, Haoxiang Zhang, Ahmed E. Hassan

Title: AIDev: Studying AI Coding Agents on GitHub

Abstract:
AI coding agents are rapidly transforming software engineering by performing tasks such as feature development, debugging, and testing. Despite their growing impact, the research community lacks a comprehensive dataset capturing how these agents are used in real‑world projects. To address this gap, we introduce AIDev, a large‑scale dataset focused on agent‑authored pull requests (Agentic‑PRs) in real‑world GitHub repositories. AIDev aggregates 932,791 Agentic‑PRs produced by five agents: OpenAI Codex, Devin, GitHub Copilot, Cursor, and Claude Code. These PRs span 116,211 repositories and involve 72,189 developers. In addition, AIDev includes a curated subset of 33,596 Agentic‑PRs from 2,807 repositories with over 100 stars, providing further information such as comments, reviews, commits, and related issues. This dataset offers a foundation for future research on AI adoption, developer productivity, and human‑AI collaboration in the new era of software engineering. > AI Agent, Agentic AI, Coding Agent, Agentic Coding, Agentic Software Engineering, Agentic Engineering

PaperID: 5689, https://arxiv.org/pdf/2602.09163.pdf

Authors: Xingjian Zhang, Sophia Moylan, Ziyang Xiong, Qiaozhu Mei, Yichen Luo, Jiaqi W. Ma

Title: FlyAOC: Evaluating Agentic Ontology Curation of Drosophila Scientific Knowledge Bases

Abstract:
Scientific knowledge bases accelerate discovery by curating findings from primary literature into structured, queryable formats for both human researchers and emerging AI systems. Maintaining these resources requires expert curators to search relevant papers, reconcile evidence across documents, and produce ontology‑grounded annotations ‑ a workflow that existing benchmarks, focused on isolated subtasks like named entity recognition or relation extraction, do not capture. We present FlyBench to evaluate AI agents on end‑to‑end agentic ontology curation from scientific literature. Given only a gene symbol, agents must search and read from a corpus of 16,898 full‑text papers to produce structured annotations: Gene Ontology terms describing function, expression patterns, and historical synonyms linking decades of nomenclature. The benchmark includes 7,397 expert‑curated annotations across 100 genes drawn from FlyBase, the Drosophila (fruit fly) knowledge base. We evaluate four baseline agent architectures: memorization, fixed pipeline, single‑agent, and multi‑agent. We find that architectural choices significantly impact performance, with multi‑agent designs outperforming simpler alternatives, yet scaling backbone models yields diminishing returns. All baselines leave substantial room for improvement. Our analysis surfaces several findings to guide future development; for example, agents primarily use retrieval to confirm parametric knowledge rather than discover new information. We hope FlyBench will drive progress on retrieval‑augmented scientific reasoning, a capability with broad applications across scientific domains.

PaperID: 5690, https://arxiv.org/pdf/2602.09159.pdf

Authors: Yichen Wu, Yujin Oh, Sangjoon Park, Kailong Fan, Dania Daye, Hana Farzaneh, Xiang Li, Raul Uppot, Quanzheng Li

Title: CoMMa: Contribution-Aware Medical Multi-Agents From A Game-Theoretic Perspective

Abstract:
Recent multi‑agent frameworks have broadened the ability to tackle oncology decision support tasks that require reasoning over dynamic, heterogeneous patient data. We propose Contribution‑Aware Medical Multi‑Agents (CoMMa), a decentralized LLM‑agent framework in which specialists operate on partitioned evidence and coordinate through a game‑theoretic objective for robust decision‑making. In contrast to most agent architectures relying on stochastic narrative‑based reasoning, CoMMa utilizes deterministic embedding projections to approximate contribution‑aware credit assignment. This yields explicit evidence attribution by estimating each agent's marginal utility, producing interpretable and mathematically grounded decision pathways with improved stability. Evaluated on diverse oncology benchmarks, including a real‑world multidisciplinary tumor board dataset, CoMMa achieves higher accuracy and more stable performance than data‑centralized and role‑based multi‑agents baselines.

PaperID: 5691, https://arxiv.org/pdf/2602.09138.pdf

Authors: Haitao Jiang, Lin Ge, Hengrui Cai, Rui Song

Title: PABU: Progress-Aware Belief Update for Efficient LLM Agents

Abstract:
Large Language Model (LLM) agents commonly condition actions on full action‑observation histories, which introduce task‑irrelevant information that easily leads to redundant actions and higher inference cost. We propose Progress‑Aware Belief Update (PABU), a belief‑state framework that compactly represents an agent's state by explicitly modeling task progress and selectively retaining past actions and observations. At each step, the agent predicts its relative progress since the previous round and decides whether the newly encountered interaction should be stored, conditioning future decisions only on the retained subset. Across eight environments in the AgentGym benchmark, and using identical training trajectories, PABU achieves an 81.0% task completion rate, outperforming previous State of the art (SoTA) models with full‑history belief by 23.9%. Additionally, PABU's progress‑oriented action selection improves efficiency, reducing the average number of interaction steps to 9.5, corresponding to a 26.9% reduction. Ablation studies show that both explicit progress prediction and selective retention are necessary for robust belief learning and performance gains.

PaperID: 5692, https://arxiv.org/pdf/2602.09130.pdf

Authors: Jonathan von Rad, Yong Cao, Andreas Geiger

Title: UniComp: A Unified Evaluation of Large Language Model Compression via Pruning, Quantization and Distillation

Abstract:
Model compression is increasingly essential for deploying large language models (LLMs), yet existing comparative studies largely focus on pruning and quantization evaluated primarily on knowledge‑centric benchmarks. Thus, we introduce UniComp, a unified evaluation framework for comparing pruning, quantization, and knowledge distillation. UniComp evaluates compressed models along three dimensions: performance, reliability, and efficiency, using a diverse set of capability‑ and safety‑oriented benchmarks together with a hardware‑aware efficiency analysis. Through evaluation of six compression techniques across 40 datasets, we observe (i) a consistent knowledge bias, where factual recall is largely preserved while multi‑step reasoning, multilingual, and instruction‑following capabilities degrade; (ii) a decoupling between performance and reliability, indicating that retained performance does not consistently imply preserved reliability; and (iii) that task‑specific calibration can yield up to 50% relative improvement of reasoning performance in pruned models.

PaperID: 5693, https://arxiv.org/pdf/2602.09076.pdf

Authors: Nhat Le, Daeun Song, Xuesu Xiao

Title: Legs Over Arms: On the Predictive Value of Lower-Body Pose for Human Trajectory Prediction from Egocentric Robot Perception

Abstract:
Predicting human trajectory is crucial for social robot navigation in crowded environments. While most existing approaches treat human as point mass, we present a study on multi‑agent trajectory prediction that leverages different human skeletal features for improved forecast accuracy. In particular, we systematically evaluate the predictive utility of 2D and 3D skeletal keypoints and derived biomechanical cues as additional inputs. Through a comprehensive study on the JRDB dataset and another new dataset for social navigation with 360‑degree panoramic videos, we find that focusing on lower‑body 3D keypoints yields a 13% reduction in Average Displacement Error and augmenting 3D keypoint inputs with corresponding biomechanical cues provides a further 1‑4% improvement. Notably, the performance gain persists when using 2D keypoint inputs extracted from equirectangular panoramic images, indicating that monocular surround vision can capture informative cues for motion forecasting. Our finding that robots can forecast human movement efficiently by watching their legs provides actionable insights for designing sensing capabilities for social robot navigation.

PaperID: 5694, https://arxiv.org/pdf/2602.08965.pdf

Authors: John Gardiner, Orlando Romero, Brendan Tivnan, Nicolò Dal Fabbro, George J. Pappas

Title: Learning to Coordinate via Quantum Entanglement in Multi-Agent Reinforcement Learning

Abstract:
The inability to communicate poses a major challenge to coordination in multi‑agent reinforcement learning (MARL). Prior work has explored correlating local policies via shared randomness, sometimes in the form of a correlation device, as a mechanism to assist in decentralized decision‑making. In contrast, this work introduces the first framework for training MARL agents to exploit shared quantum entanglement as a coordination resource, which permits a larger class of communication‑free correlated policies than shared randomness alone. This is motivated by well‑known results in quantum physics which posit that, for certain single‑round cooperative games with no communication, shared quantum entanglement enables strategies that outperform those that only use shared randomness. In such cases, we say that there is quantum advantage. Our framework is based on a novel differentiable policy parameterization that enables optimization over quantum measurements, together with a novel policy architecture that decomposes joint policies into a quantum coordinator and decentralized local actors. To illustrate the effectiveness of our proposed method, we first show that we can learn, purely from experience, strategies that attain quantum advantage in single‑round games that are treated as black box oracles. We then demonstrate how our machinery can learn policies with quantum advantage in an illustrative multi‑agent sequential decision‑making problem formulated as a decentralized partially observable Markov decision process (Dec‑POMDP).

PaperID: 5695, https://arxiv.org/pdf/2602.08964.pdf

Authors: Raghu Arghal, Fade Chen, Niall Dalton, Evgenii Kortukov, Calum McNamara, Angelos Nalmpantis, Moksh Nirvaan, Gabriele Sarti, Mario Giulianelli

Title: A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

Abstract:
Understanding an agent's goals helps explain and predict its behaviour, yet there is no established methodology for reliably attributing goals to agentic systems. We propose a framework for evaluating goal‑directedness that integrates behavioural evaluation with interpretability‑based analyses of models' internal representations. As a case study, we examine an LLM agent navigating a 2D grid world towards a goal state. Behaviourally, we evaluate the agent against optimal policies across varying grid sizes, obstacle densities, and goal structures, finding that performance scales with task difficulty while remaining robust to difficulty‑preserving transformations and multi‑goal structures. We then use probing methods to decode internal representations of the environment and multi‑step action plans. We find that the LLM agent non‑linearly encodes a coarse spatial map, preserving approximate task‑relevant cues about its position and the goal location; that its actions are broadly consistent with these internal representations; and that reasoning reorganises them, shifting from spatial cues towards immediate action selection. Our findings support the view that introspective examination is required beyond behavioural evaluations to characterise how agents represent and pursue their objectives.

PaperID: 5696, https://arxiv.org/pdf/2602.08941.pdf

Authors: Zeyu Lu, Dennis L. Barbour

Title: pixelLOG: Logging of Online Gameplay for Cognitive Research

Abstract:
Traditional cognitive assessments often rely on isolated, output‑focused measurements that may fail to capture the complexity of human cognition in naturalistic settings. We present pixelLOG, a high‑performance data collection framework for Spigot‑based Minecraft servers designed specifically for process‑based cognitive research. Unlike existing frameworks tailored only for artificial intelligence agents, pixelLOG also enables human behavioral tracking in multi‑player/multi‑agent environments. Operating at configurable frequencies up to and exceeding 20 updates per second, the system captures comprehensive behavioral data through a hybrid approach of active state polling and passive event monitoring. By leveraging Spigot's extensible API, pixelLOG facilitates robust session isolation and produces structured JSON outputs integrable with standard analytical pipelines. This framework bridges the gap between decontextualized laboratory assessments and richer, more ecologically valid tasks, enabling high‑resolution analysis of cognitive processes as they unfold in complex, virtual environments.

PaperID: 5697, https://arxiv.org/pdf/2602.08938.pdf

Authors: Tuo Zhang, Leonardo Stella

Title: Teaching an Old Dynamics New Tricks: Regularization-free Last-iterate Convergence in Zero-sum Games via BNN Dynamics

Abstract:
Zero‑sum games are a fundamental setting for adversarial training and decision‑making in multi‑agent learning (MAL). Existing methods often ensure convergence to (approximate) Nash equilibria by introducing a form of regularization. Yet, regularization requires additional hyperparameters, which must be carefully tuned‑‑a challenging task when the payoff structure is known, and considerably harder when the structure is unknown or subject to change. Motivated by this problem, we repurpose a classical model in evolutionary game theory, i.e., the Brown‑von Neumann‑Nash (BNN) dynamics, by leveraging the intrinsic convergence of this dynamics in zero‑sum games without regularization, and provide last‑iterate convergence guarantees in noisy normal‑form games (NFGs). Importantly, to make this approach more applicable, we develop a novel framework with theoretical guarantees that integrates the BNN dynamics in extensive‑form games (EFGs) through counterfactual weighting. Furthermore, we implement an algorithm that instantiates our framework with neural function approximation, enabling scalable learning in both NFGs and EFGs. Empirical results show that our method quickly adapts to nonstationarities, outperforming the state‑of‑the‑art regularization‑based approach.

PaperID: 5698, https://arxiv.org/pdf/2602.08847.pdf

Authors: Lang Feng, Longtao Zheng, Shuo He, Fuxiang Zhang, Bo An

Title: Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems

Abstract:
Multi‑agent LLM systems enable advanced reasoning and tool use via role specialization, yet reliable reinforcement learning (RL) post‑training for such systems remains difficult. In this work, we theoretically pinpoint a key reason for training instability when extending group‑based RL to multi‑agent LLM systems. We show that under GRPO‑style optimization, a global normalization baseline may deviate from diverse agents' reward distributions, which ultimately leads to gradient‑norm instability. Based on this finding, we propose Dr. MAS, a simple and stable RL training recipe for multi‑agent LLM systems. Dr. MAS uses an agent‑wise remedy: normalizing advantages per agent using each agent's own reward statistics, which calibrates gradient scales and dramatically stabilizes training, both theoretically and empirically. Beyond the algorithm, Dr. MAS provides an end‑to‑end RL training framework for multi‑agent LLM systems, supporting scalable orchestration, flexible per‑agent LLM serving and optimization configs, and shared resource scheduling of LLM actor backends. We evaluate Dr. MAS on multi‑agent math reasoning and multi‑turn search benchmarks using Qwen2.5 and Qwen3 series models. Dr. MAS achieves clear gains over vanilla GRPO (e.g., +5.6% avg@16 and +4.6% pass@16 on math, and +15.2% avg@16 and +13.1% pass@16 on search) while largely eliminating gradient spikes. Moreover, it remains highly effective under heterogeneous agent‑model assignments while improving efficiency.

PaperID: 5699, https://arxiv.org/pdf/2602.08835.pdf

Authors: Andrés Holgado-Sánchez, Peter Vamplew, Richard Dazeley, Sascha Ossowski, Holger Billhardt

Title: Learning the Value Systems of Societies with Preference-based Multi-objective Reinforcement Learning

Abstract:
Value‑aware AI should recognise human values and adapt to the value systems (value‑based preferences) of different users. This requires operationalization of values, which can be prone to misspecification. The social nature of values demands their representation to adhere to multiple users while value systems are diverse, yet exhibit patterns among groups. In sequential decision making, efforts have been made towards personalization for different goals or values from demonstrations of diverse agents. However, these approaches demand manually designed features or lack value‑based interpretability and/or adaptability to diverse user preferences. We propose algorithms for learning models of value alignment and value systems for a society of agents in Markov Decision Processes (MDPs), based on clustering and preference‑based multi‑objective reinforcement learning (PbMORL). We jointly learn socially‑derived value alignment models (groundings) and a set of value systems that concisely represent different groups of users (clusters) in a society. Each cluster consists of a value system representing the value‑based preferences of its members and an approximately Pareto‑optimal policy that reflects behaviours aligned with this value system. We evaluate our method against a state‑of‑the‑art PbMORL algorithm and baselines on two MDPs with human values.

PaperID: 5700, https://arxiv.org/pdf/2602.08765.pdf

Authors: Micah Villmow

Title: Taming Scylla: Understanding the multi-headed agentic daemon of the coding seas

Abstract:
LLM‑based tools are automating more software development tasks at a rapid pace, but there is no rigorous way to evaluate how different architectural choices ‑‑ prompts, skills, tools, multi‑agent setups ‑‑ materially affect both capability and cost. This paper introduces Scylla, an evaluation framework for benchmarking agentic coding tools through structured ablation studies that uses seven testing tiers (T0‑T6) progressively adding complexity to isolate what directly influences results and how. The key metric is Cost‑of‑Pass (CoP): the expected dollar cost to get one correct solution, which directly quantifies the trade‑off between complexity and efficiency. The framework is model‑agnostic, designed to work with any CLI tool; this paper demonstrates it with Claude Sonnet 4.5, using multiple LLM judges (Opus 4.5, Sonnet 4.5, Haiku 4.5) from the same vendor for evaluation consensus, where judges score results using direct tests, human‑designed LLM‑evaluated rubrics, and qualitative assessment. The result is a reproducible framework that quantifies trade‑offs between agent complexity and actual outcomes, suggesting that architectural complexity does not always improve quality.

PaperID: 5701, https://arxiv.org/pdf/2602.08701.pdf

Authors: Mohammad Shaharyar Ahsan, Areeba Shahzad Shaikh, Maham Zahid, Umer Irfan, Maryam Mustafa, Naveed Anwar Bhatti, Muhammad Hamad Alizai

Title: LLM-Enhanced Wearables for Comprehensible Health Guidance in LMICs

Abstract:
Personal health monitoring via IoT in LMICs is limited by affordability, low digital literacy, and limited health data comprehension. We present Guardian Angel, a low‑cost, screenless wearable paired with a WhatsApp‑based LLM agent that delivers plain‑language, personalized insights. The LLM operates directly on raw, noisy sensor waveforms and is robust to the poor signal quality of low‑cost hardware. On a benchmark dataset, a standard open‑source algorithm produced valid outputs for only 70.29% of segments, whereas Guardian Angel achieved 100% availability (reported as coverage under field noise, distinct from accuracy), yielding a continuous and understandable physiological record. In a 96‑hour study involving 20 participants (1,920 participant‑hours), users demonstrated significant improvements in health data comprehension and mindfulness of vital signs. These results suggest a practical approach to enhancing health literacy and adoption in resource‑constrained settings.

PaperID: 5702, https://arxiv.org/pdf/2602.08586.pdf

Authors: Yiming Yang, Zhuoyuan Li, Fanxiang Zeng, Hao Fu, Yue Liu

Title: DIANOIA: Diagnostic Decomposition and Joint Optimization for Multi-Agent Reasoning

Abstract:
Multi‑agent LLM systems consistently outperform single‑agent baselines, yet practitioners still cannot predict which design works for a new task or diagnose why one fails. We argue this gap persists largely because the field lacks a diagnostic framework with measurable primitives and testable predictions. We introduce DIANOIA, a three‑channel decomposition of multi‑agent reasoning gain into coverage, fidelity, and synthesis, each of which is empirically measurable. From this decomposition, we derive a diagnostic protocol that identifies the bottleneck channels for any given task. We instantiate the protocol as a multi‑agent system whose three components mirror the channels: role‑diverse proposers for coverage, execution‑grounded verification for fidelity, and iterative synthesis. On GSM8K, AIME‑2025, MBPP, and BFCL‑SP, our method outperforms strong multi‑agent baselines under matched token budgets, dominating the Pareto frontier on MBPP at ～5× token savings and reaching +4.6pp at matched cost. On every benchmark, the protocol picks the right bottleneck channels; the system we built around it leads across models. We release code, adapters, diagnostic metrics, and a Claude Code skill at https://anonymous.4open.science/r/DIANOIA4MAS. DIANOIA reframes multi‑agent design as channel‑aware resource allocation: diagnose which channel is the bottleneck for your task, then invest tokens accordingly.

PaperID: 5703, https://arxiv.org/pdf/2602.08571.pdf

Authors: Simon Hoffmann, Simon Sagmeister, Tobias Betz, Joscha Bongard, Sascha Büttner, Dominic Ebner, Daniel Esser, Georg Jank, Sven Goblirsch, Alexander Langmann, Maximilian Leitenstern, Levent Ögretmen, Phillip Pitschi, Ann-Kathrin Schwehn, Cornelius Schröder, Marcel Weinmann, Frederik Werner, Boris Lohmann, Johannes Betz, Markus Lienkamp

Title: Head-to-Head autonomous racing at the limits of handling in the A2RL challenge

Abstract:
Autonomous racing presents a complex challenge involving multi‑agent interactions between vehicles operating at the limit of performance and dynamics. As such, it provides a valuable research and testing environment for advancing autonomous driving technology and improving road safety. This article presents the algorithms and deployment strategies developed by the TUM Autonomous Motorsport team for the inaugural Abu Dhabi Autonomous Racing League (A2RL). We showcase how our software emulates human driving behavior, pushing the limits of vehicle handling and multi‑vehicle interactions to win the A2RL. Finally, we highlight the key enablers of our success and share our most significant learnings.

PaperID: 5704, https://arxiv.org/pdf/2602.08567.pdf

Authors: Jinnuo Liu, Chuke Liu, Hua Shen

Title: ValueFlow: Measuring the Propagation of Value Perturbations in Multi-Agent LLM Systems

Abstract:
Multi‑agent large language model (LLM) systems increasingly consist of agents that observe and respond to one another's outputs. While value alignment is typically evaluated for isolated models, how value perturbations propagate through agent interactions remains poorly understood. We present ValueFlow, a perturbation‑based framework that measures value drift in multi‑agent systems via a 56‑value valuation dataset derived from the Schwartz Value Survey, with agent value orientations scored using an LLM‑as‑a‑judge protocol. ValueFlow decomposes value drift into agent‑level response behavior and system‑level structural effects, captured by two metrics: \beta‑susceptibility, an agent's sensitivity to perturbed peer value signals, and system susceptibility (SS), the effect of node‑level perturbations on final system outputs.Experiments span across value dimensions, backbones, personas, and topologies, showing that susceptibility varies sharply across values and is strongly shaped by interaction structure, indicating that value alignment in multi‑agent systems is a system‑level property, not just an agent‑level one. ValueFlow thus provides a principled basis for auditing and mitigating value propagation in deployed multi‑agent systems.

PaperID: 5705, https://arxiv.org/pdf/2602.08566.pdf

Authors: Riya Singh, Pratikshya Jena, Anish Kumar, Shradha Mishra

Title: Homing through Reinforcement Learning

Abstract:
Homing and navigation are fundamental behaviors in biological systems that enable agents to reliably reach a target under uncertainty. We present a Reinforcement Learning (RL) framework to model adaptive homing in continuous two‑dimensional domain. In this framework, the agent's state is given by its angular deviation from home, actions correspond to alignment or stochastic reorientation, and learning is driven by a radial‑distance‑based cost that penalizes motion away from the target. For a single self‑propelled agent moving with constant speed, we find that the mean homing time \langle T_\mathrmhome \rangle exhibits a non‑monotonic dependence on the rotational diffusion strength D_r, with an optimal noise level D_r^, revealing a subtle interplay between exploration and goal‑directed correction. Extending to two agents with soft repulsion, one agent consistently reaches home faster than the other, while in multi‑agents system, repulsion ensures separation and the fastest agent becomes progressively faster as group size increases. Finally comparing the mean homing time \langle T_\mathrmhome \rangle of an Active Brownian Particle (ABP) and RL agent over an identical range of D_r, we find that RL trajectories are shorter, less noisy, and consistently faster. Our results show that cost‑driven learning, stochastic reorientation, and inter‑agent interactions enable efficient adaptive navigation, linking individual and collective homing. This reinforcement learning framework captures key biological features such as feedback‑based route learning, randomness to escape unfavorable orientations, and mutual coordination.

PaperID: 5706, https://arxiv.org/pdf/2602.08533.pdf

Authors: Kun Peng, Conghui Tan, Yu Liu, Guohua Tang, Zhongqian Sun, Wei Yang, Zining Zhu, Lei Jiang, Yanbing Liu, Hao Peng

Title: Dialogue Model Optimization via Agent Game and Adaptive Tree-based GRPO

Abstract:
Open‑ended dialogue agents aim to deliver engaging, personalized interactions by adapting to users' traits, but existing methods face critical limitations: over‑reliance on pre‑collected user data, and short‑horizon biases in reinforcement learning (RL) that neglect long‑term dialogue value. To address these, we propose a novel long‑horizon RL framework integrating online personalization with Adaptive Tree‑based Group Relative Policy Optimization (AT‑GRPO). Adopting a two‑agent game paradigm, a user agent constructs dynamic environments via style mimicry (learning user‑specific conversational traits) and active termination (predicting turn‑level termination probabilities as immediate rewards), forming an iterative cycle that drives the dialogue agent to deepen interest exploration. AT‑GRPO reinterprets dialogue trajectories as trees and introduces adaptive observation ranges. Unlike full tree expansion that incurs exponential overhead, it limits each node to aggregate rewards from a stage‑aware range: larger ranges support early‑stage topic exploration, while smaller ranges facilitate late‑stage dialogue maintenance. This design reduces rollout budgets from exponential to polynomial in the dialogue length, while preserving long‑term reward capture. Extensive experiments show our framework's superior performance, sample efficiency, and robustness.

PaperID: 5707, https://arxiv.org/pdf/2602.08452.pdf

Authors: Senthil Rajasekaran, Moshe Y. Vardi

Title: Modeling Concurrent Multi-Agent Systems

Abstract:
Recent work in the field of multi‑agent systems has sought to use techniques and concepts from the field of formal methods to provide rigorous theoretical analysis and guarantees on complex systems where multiple agents strategically interact, leading to the creation of the field of equilibrium analysis, which studies equilibria concepts from the field of game theory through a complexity‑theoretic lens. Multi‑agent systems, however, are complex mathematical objects, and, therefore, defining them in a precise mathematical manner is non‑trivial. As a result, researchers often considered more restrictive models that are easier to model but lack expressive power or simply omit critical complexity‑theoretic results in their analysis. This paper addresses this problem by carefully analyzing and contrasting complexity‑theoretic results in the explicit model, a mathematically precise formulation of the models commonly used in the literature, and the circuit‑based model, a novel model that addresses the problems found in the literature. The utility of the circuit‑based model is demonstrated through a comprehensive analysis that considers upper and lower bounds for the realizability and verification problems, the two most important decision problems in equilibrium analysis, for both models. By conducting this analysis, we see that problematic issues that are endemic to the explicit model and the equilibrium analysis literature as a whole are adequately handled by the circuit‑based model.

PaperID: 5708, https://arxiv.org/pdf/2602.08389.pdf

Authors: Yao-hua Franck Xu, Tayeb Lemlouma, Arnaud Braud, Jean-Marie Bonnin

Title: Altruism and Fair Objective in Mixed-Motive Markov games

Abstract:
Cooperation is fundamental for society's viability, as it enables the emergence of structure within heterogeneous groups that seek collective well‑being. However, individuals are inclined to defect in order to benefit from the group's cooperation without contributing the associated costs, thus leading to unfair situations. In game theory, social dilemmas entail this dichotomy between individual interest and collective outcome. The most dominant approach to multi‑agent cooperation is the utilitarian welfare which can produce efficient highly inequitable outcomes. This paper proposes a novel framework to foster fairer cooperation by replacing the standard utilitarian objective with Proportional Fairness. We introduce a fair altruistic utility for each agent, defined on the individual log‑payoff space and derive the analytical conditions required to ensure cooperation in classic social dilemmas. We then extend this framework to sequential settings by defining a Fair Markov Game and deriving novel fair Actor‑Critic algorithms to learn fair policies. Finally, we evaluate our method in various social dilemma environments.

PaperID: 5709, https://arxiv.org/pdf/2602.08355.pdf

Authors: Xianjie Liu, Yiman Hu, Liang Wu, Ping Hu, Yixiong Zou, Jian Xu, Bo Zheng

Title: E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

Abstract:
E‑commerce short videos represent a high‑revenue segment of the online video industry characterized by a goal‑driven format and dense multi‑modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general‑purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a multi‑modal information density assessment framework to quantify the complexity of this domain. Our evaluation reveals that e‑commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce E‑commerce Video Ads Benchmark (E‑VAds), which is the first benchmark specifically designed for e‑commerce short video understanding. We curated 3,961 high‑quality videos from Taobao covering a wide range of product categories and used a multi‑agent system to generate 19,785 open‑ended Q&A pairs. These questions are organized into two primary dimensions, namely Perception and Cognition and Reasoning, which consist of five distinct tasks. Finally, we develop E‑VAds‑R1, an RL‑based reasoning model featuring a multi‑grained reward design called MG‑GRPO. This strategy provides smooth guidance for early exploration while creating a non‑linear incentive for expert‑level precision. Experimental results demonstrate that E‑VAds‑R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples.

PaperID: 5710, https://arxiv.org/pdf/2602.08335.pdf

Authors: Yanming Li, Xuelin Zhang, WenJie Lu, Ziye Tang, Maodong Wu, Haotian Luo, Tongtong Wu, Zijie Peng, Hongze Mi, Yibo Feng, Naiqiang Tan, Chao Huang, Hong Chen, Li Shen

Title: Who Deserves the Reward? SHARP: Shapley Credit-based Optimization for Multi-Agent System

Abstract:
Integrating Large Language Models (LLMs) with external tools via multi‑agent systems offers a promising new paradigm for decomposing and solving complex problems. However, training these systems remains notoriously difficult due to the credit assignment challenge, as it is often unclear which specific functional agent is responsible for the success or failure of decision trajectories. Existing methods typically rely on sparse or globally broadcast rewards, failing to capture individual contributions and leading to inefficient reinforcement learning. To address these limitations, we introduce the Shapley‑based Hierarchical Attribution for Reinforcement Policy (SHARP), a novel framework for optimizing multi‑agent reinforcement learning via precise credit attribution. SHARP effectively stabilizes training by normalizing agent‑specific advantages across trajectory groups, primarily through a decomposed reward mechanism comprising a global broadcast‑accuracy reward, a Shapley‑based marginal‑credit reward for each agent, and a tool‑process reward to improve execution efficiency. Extensive experiments across various real‑world benchmarks demonstrate that SHARP significantly outperforms recent state‑of‑the‑art baselines, achieving average match improvements of 23.66% and 14.05% over single‑agent and multi‑agent approaches, respectively.

PaperID: 5711, https://arxiv.org/pdf/2602.08281.pdf

Authors: Zhilin Wang, Yafu Li, Shunkai Zhang, Zhi Wang, Haoran Zhang, Xiaoye Qu, Yu Cheng

Title: New Skills or Sharper Primitives? A Probabilistic Perspective on the Emergence of Reasoning in RLVR

Abstract:
Whether Reinforcement Learning with Verifiable Rewards (RLVR) endows Large Language Models (LLMs) with new capabilities or merely elicits latent traces remains a central debate. In this work, we align with the former view, proposing a probabilistic framework where capability is defined by instance‑level solvability. We hypothesize that the emergence of complex reasoning can be driven by sharpening atomic step probabilities, which enables models to overcome the exponential decay of success rates inherent in multi‑step reasoning chains. Utilizing the Algebrarium framework, we train models exclusively on single‑step operations and evaluate their performance on unseen multi‑step tasks. Our empirical results confirm that: (1) RLVR incentivizes the exploration of previously inaccessible solution paths by amplifying the model's existing skills; (2) composite performance is strictly governed by the joint probability of atomic steps, evidenced by high Pearson correlation coefficients (ρ\in [0.69, 0.96]); and (3) RLVR, acting as a global optimizer, can cause specific skills to be sacrificed to maximize aggregate reward. Our work offers a novel explanation for emergent abilities in RLVR, suggesting that the iterative optimization of solvable problems enables models to develop the capabilities to tackle previously unsolvable scenarios.

PaperID: 5712, https://arxiv.org/pdf/2602.08276.pdf

Authors: Haoyu Jia, Kento Kawaharazuka, Kei Okada

Title: Toward Formalizing LLM-Based Agent Designs through Structural Context Modeling and Semantic Dynamics Analysis

Abstract:
Current research on large language model (LLM) agents is fragmented: discussions of conceptual frameworks and methodological principles are frequently intertwined with low‑level implementation details, causing both readers and authors to lose track amid a proliferation of superficially distinct concepts. We argue that this fragmentation largely stems from the absence of an analyzable, self‑consistent formal model that enables implementation‑independent characterization and comparison of LLM agents. To address this gap, we propose the \textttStructural Context Model, a formal model for analyzing and comparing LLM agents from the perspective of context structure. Building upon this foundation, we introduce two complementary components that together span the full lifecycle of LLM agent research and development: (1) a declarative implementation framework; and (2) a sustainable agent engineering workflow, \textttSemantic Dynamics Analysis. The proposed workflow provides principled insights into agent mechanisms and supports rapid, systematic design iteration. We demonstrate the effectiveness of the complete framework on dynamic variants of the monkey‑banana problem, where agents engineered using our approach achieve up to a 32 percentage points improvement in success rate on the most challenging setting.

PaperID: 5713, https://arxiv.org/pdf/2602.08272.pdf

Authors: Junwei Su, Chuan Wu

Title: When Do Multi-Agent Systems Outperform? Analysing the Learning Efficiency of Agentic Systems

Abstract:
Reinforcement Learning (RL) has emerged as a crucial method for training or fine‑tuning large language models (LLMs), enabling adaptive, task‑specific optimizations through interactive feedback. Multi‑Agent Reinforcement Learning (MARL), in particular, offers a promising avenue by decomposing complex tasks into specialized subtasks learned by distinct interacting agents, potentially enhancing the ability and efficiency of LLM systems. However, theoretical insights regarding when and why MARL outperforms Single‑Agent RL (SARL) remain limited, creating uncertainty in selecting the appropriate RL framework. In this paper, we address this critical gap by rigorously analyzing the comparative sample efficiency of MARL and SARL within the context of LLM. Leveraging the Probably Approximately Correct (PAC) framework, we formally define SARL and MARL setups for LLMs, derive explicit sample complexity bounds, and systematically characterize how task decomposition and alignment influence learning efficiency. Our results demonstrate that MARL improves sample complexity when tasks naturally decompose into independent subtasks, whereas dependent subtasks diminish MARL's comparative advantage. Additionally, we introduce and analyze the concept of task alignment, quantifying the trade‑offs when enforcing independent task decomposition despite potential misalignments. These theoretical insights clarify empirical inconsistencies and provide practical criteria for deploying MARL strategies effectively in complex LLM scenarios.

PaperID: 5714, https://arxiv.org/pdf/2602.08254.pdf

Authors: Arman Aghaee, Sepehr Asgarian, Jouhyun Jeon

Title: SynthAgent: A Multi-Agent LLM Framework for Realistic Patient Simulation -- A Case Study in Obesity with Mental Health Comorbidities

Abstract:
Simulating high‑fidelity patients offers a powerful avenue for studying complex diseases while addressing the challenges of fragmented, biased, and privacy‑restricted real‑world data. In this study, we introduce SynthAgent, a novel Multi‑Agent System (MAS) framework designed to model obesity patients with comorbid mental disorders, including depression, anxiety, social phobia, and binge eating disorder. SynthAgent integrates clinical and medical evidence from claims data, population surveys, and patient‑centered literature to construct personalized virtual patients enriched with personality traits that influence adherence, emotion regulation, and lifestyle behaviors. Through autonomous agent interactions, the system simulates disease progression, treatment response, and life management across diverse psychosocial contexts. Evaluation of more than 100 generated patients demonstrated that GPT‑5 and Claude 4.5 Sonnet achieved the highest fidelity as the core engine in the proposed MAS framework, outperforming Gemini 2.5 Pro and DeepSeek‑R1. SynthAgent thus provides a scalable and privacy‑preserving framework for exploring patient journeys, behavioral dynamics, and decision‑making processes in both medical and psychological domains.

PaperID: 5715, https://arxiv.org/pdf/2602.08146.pdf

Authors: Pengyu Chang, Yixiong Fang, Silin Chen, Yuling Shi, Beijun Shen, Xiaodong Gu

Title: Test vs Mutant: Adversarial LLM Agents for Robust Unit Test Generation

Abstract:
Software testing is a critical, yet resource‑intensive phase of the software development lifecycle. Over the years, various automated tools have been developed to aid in this process. Search‑based approaches typically achieve high coverage but produce tests with low readability, whereas large language model (LLM)‑based methods generate more human‑readable tests but often suffer from low coverage and compilability. While the majority of research efforts have focused on improving test coverage and readability, little attention has been paid to enhancing the robustness of bug detection, particularly in exposing corner cases and vulnerable execution paths. To address this gap, we propose AdverTest, a novel adversarial framework for LLM‑powered test case generation. AdverTest comprises two interacting agents: a test case generation agent (T) and a mutant generation agent (M). These agents engage in an adversarial loop, where M persistently creates new mutants "hacking" the blind spots of T's current test suite, while T iteratively refines its test cases to "kill" the challenging mutants produced by M. This interaction loop is guided by both coverage and mutation scores, enabling the system to co‑evolve toward both high test coverage and bug detection capability. Experimental results in the Defects4J dataset show that our approach improves fault detection rates by 8.56% over the best existing LLM‑based methods and by 63.30% over EvoSuite, while also improving line and branch coverage.

PaperID: 5716, https://arxiv.org/pdf/2602.08104.pdf

Authors: Risal Shahriar Shefin, Debashis Gupta, Thai Le, Sarra Alqahtani

Title: Interpretable Failure Analysis in Multi-Agent Reinforcement Learning Systems

Abstract:
Multi‑Agent Reinforcement Learning (MARL) is increasingly deployed in safety‑critical domains, yet methods for interpretable failure detection and attribution remain underdeveloped. We introduce a two‑stage gradient‑based framework that provides interpretable diagnostics for three critical failure analysis tasks: (1) detecting the true initial failure source (Patient‑0); (2) validating why non‑attacked agents may be flagged first due to domino effects; and (3) tracing how failures propagate through learned coordination pathways. Stage 1 performs interpretable per‑agent failure detection via Taylor‑remainder analysis of policy‑gradient costs, declaring an initial Patient‑0 candidate at the first threshold crossing. Stage 2 provides validation through geometric analysis of critic derivatives‑first‑order sensitivity and directional second‑order curvature aggregated over causal windows to construct interpretable contagion graphs. This approach explains "downstream‑first" detection anomalies by revealing pathways that amplify upstream deviations. Evaluated across 500 episodes in Simple Spread (3 and 5 agents) and 100 episodes in StarCraft II using MADDPG and HATRPO, our method achieves 88.2‑99.4% Patient‑0 detection accuracy while providing interpretable geometric evidence for detection decisions. By moving beyond black‑box detection to interpretable gradient‑level forensics, this framework offers practical tools for diagnosing cascading failures in safety‑critical MARL systems.

PaperID: 5717, https://arxiv.org/pdf/2602.08085.pdf

Authors: Shuyu Jiang, Hao Ren, Yichang Gao, Yi-Cheng Zhang, Li Qi, Dayong Xiao, Jie Fan, Rui Tang, Wei Wang

Title: Large language models for spreading dynamics in complex systems

Abstract:
Spreading dynamics is a central topic in the physics of complex systems and network science, providing a unified framework for understanding how information, behaviors, and diseases propagate through interactions among system units. In many propagation contexts, spreading processes are influenced by multiple interacting factors, such as information expression patterns, cultural contexts, living environments, cognitive preferences, and public policies, which are difficult to incorporate directly into classical modeling frameworks. Recently, large language models (LLMs) have exhibited strong capabilities in natural language understanding, reasoning, and generation, enabling explicit perception of semantic content and contextual cues in spreading processes, thereby supporting the analysis of the different influencing factors. Beyond serving as external analytical tools, LLMs can also act as interactive agents embedded in propagation systems, potentially influencing spreading pathways and feedback structures. Consequently, the roles and impacts of LLMs on spreading dynamics have become an active and rapidly growing research area across multiple research disciplines. This review provides a comprehensive overview of recent advances in applying LLMs to the study of spreading dynamics across two representative domains: digital epidemics, such as misinformation and rumors, and biological epidemics, including infectious disease outbreaks. We first examine the foundations of epidemic modeling from a complex‑systems perspective and discuss how LLM‑based approaches relate to traditional frameworks. We then systematically review recent studies from three key perspectives, which are epidemic modeling, epidemic detection and surveillance, and epidemic prediction and management, to clarify how LLMs enhance these areas. Finally, open challenges and potential research directions are discussed.

PaperID: 5718, https://arxiv.org/pdf/2602.08041.pdf

Authors: Boyang Xia, Weiyou Tian, Qingnan Ren, Jiaqi Huang, Jie Xiao, Shuo Lu, Kai Wang, Lynn Ai, Eric Yang, Bill Shi

Title: Implicit Strategic Optimization: Rethinking Long-Horizon Decision-Making in Adversarial Poker Environments

Abstract:
Training large language model (LLM) agents for adversarial games is often driven by episodic objectives such as win rate. In long‑horizon settings, however, payoffs are shaped by latent strategic externalities that evolve over time, so myopic optimization and variation‑based regret analyses can become vacuous even when the dynamics are predictable. To solve this problem, we introduce Implicit Strategic Optimization (ISO), a prediction‑aware framework in which each agent forecasts the current strategic context and uses it to update its policy online. ISO combines a Strategic Reward Model (SRM) that estimates the long‑run strategic value of actions with iso‑grpo, a context‑conditioned optimistic learning rule. We prove sublinear contextual regret and equilibrium convergence guarantees whose dominant terms scale with the number of context mispredictions; when prediction errors are bounded, our bounds recover the static‑game rates obtained when strategic externalities are known. Experiments in 6‑player No‑Limit Texas Hold'em and competitive Pokemon show consistent improvements in long‑term return over strong LLM and RL baselines, and graceful degradation under controlled prediction noise.

PaperID: 5719, https://arxiv.org/pdf/2602.08023.pdf

Authors: Nanda Rani, Kimberly Milner, Minghao Shao, Meet Udeshi, Haoran Xi, Venkata Sai Charan Putrevu, Saksham Aggarwal, Sandeep K. Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Muhammad Shafique, Ramesh Karri

Title: CTFExplorer: Evaluating LLM Offensive Agents Through Multi-Target Web CTF Benchmarking

Abstract:
Existing benchmarks for LLM‑based offensive security agents use isolated, single‑target setups with a known vulnerable service and fixed objective. They measure exploitation effectively, but miss how real Capture‑the‑Flag (CTF) participants triage unknown surfaces, prioritize targets, and allocate effort under uncertainty. Current evaluations therefore fail to assess strategic reasoning beyond exploitation alone. To address this, we introduce CTFExplorer, a benchmark suite that shifts offensive security evaluation toward a multi‑target setting, which tests how agents explore, prioritize, and chain attacks. CTFExplorer deploys 40 web‑based vulnerable services within a single environment, where agents must autonomously discover, distinguish, and exploit targets without predefined guidance. We also present a reactive multi‑agent setup as a reference agent framework and develop an agent‑agnostic evaluation framework that records structured reasoning traces for fine‑grained assessment. This enables behavioral evaluation beyond binary flag capture, such as how agents manage target selection, handle failed hypotheses, coordinate across multiple stages, and extract security intelligence.

PaperID: 5720, https://arxiv.org/pdf/2602.08014.pdf

Authors: Sadegh Sohani, Salar Ghazi, Farnaz Kamranfar, Sahar Pilehvar Moakhar, Mohammad Allahbakhsh, Haleh Amintoosi, Kaiwen Zhang

Title: ICBAC: an Intelligent Contract-Based Access Control framework for supply chain management by integrating blockchain and federated learning

Abstract:
This paper addresses the critical challenge of access control in modern supply chains, which operate across multiple independent and competing organizations. Existing access control is static and centralized, unable to adapt to insider threats or evolving contexts. Blockchain improves decentralization but lacks behavioral intelligence, while centralized machine learning for anomaly detection requires aggregating sensitive data, violating privacy. The proposed solution is ICBAC, an intelligent contract‑based access control framework. It integrates permissioned blockchain (Hyperledger Fabric) with federated learning (FL). Built on Fabric, ICBAC uses a multi‑channel architecture and three smart contracts for asset management, baseline access control, and dynamic revocation. To counter insider misuse, each channel deploys an AI agent that monitors activity and dynamically restricts access for anomalies. Federated learning allows these agents to collaboratively improve detection models without sharing raw data. For heterogeneous, competitive environments, ICBAC introduces a game‑theoretic client selection mechanism using hedonic coalition formation. This enables supply chains to form stable, strategy‑proof FL coalitions via preference‑based selection without disclosing sensitive criteria. Extensive experiments on a Fabric testbed with a real‑world dataset show ICBAC achieves blockchain performance comparable to static frameworks and provides effective anomaly detection under IID and non‑IID data with zero raw‑data sharing. ICBAC thus offers a practical, scalable solution for dynamic, privacy‑preserving access control in decentralized supply chains.

PaperID: 5721, https://arxiv.org/pdf/2602.08009.pdf

Authors: Rui Li, Zeyu Zhang, Xiaohe Bo, Quanyu Dai, Chaozhuo Li, Feng Wen, Xu Chen

Title: Towards Adaptive, Scalable, and Robust Coordination of LLM Agents: A Dynamic Ad-Hoc Networking Perspective

Abstract:
Multi‑agent architectures built on large language models (LLMs) have demonstrated the potential to realize swarm intelligence through well‑crafted collaboration. However, the substantial burden of manual orchestration inherently raises an imperative to automate the design of agentic workflows. We frame such an agent coordination challenge as a classic problem in dynamic ad‑hoc networking: How to establish adaptive and reliable communication among a scalable number of agentic hosts? In response to this unresolved dilemma, we introduce RAPS, a reputation‑aware publish‑subscribe paradigm for adaptive, scalable, and robust coordination of LLM agents. RAPS is grounded in the Distributed Publish‑Subscribe Protocol, allowing LLM agents to exchange messages based on their declared intents rather than predefined topologies. Beyond this substrate, RAPS further incorporates two coherent overlays: (i) Reactive Subscription, enabling agents to dynamically refine their intents; and (ii) Bayesian Reputation, empowering each agent with a local watchdog to detect and isolate malicious peers. Extensive experiments over five benchmarks showcase that our design effectively reconciles adaptivity, scalability, and robustness in a unified multi‑agent coordination framework.

PaperID: 5722, https://arxiv.org/pdf/2602.08004.pdf

Authors: George Ling, Shanshan Zhong, Richard Huang

Title: Agent Skills: A Data-Driven Analysis of Claude Skills for Extending Large Language Model Functionality

Abstract:
Agent skills extend large language model (LLM) agents with reusable, program‑like modules that define triggering conditions, procedural logic, and tool interactions. As these skills proliferate in public marketplaces, it is unclear what types are available, how users adopt them, and what risks they pose. To answer these questions, we conduct a large‑scale, data‑driven analysis of 40,285 publicly listed skills from a major marketplace. Our results show that skill publication tends to occur in short bursts that track shifts in community attention. We also find that skill content is highly concentrated in software engineering workflows, while information retrieval and content creation account for a substantial share of adoption. Beyond content trends, we uncover a pronounced supply‑demand imbalance across categories, and we show that most skills remain within typical prompt budgets despite a heavy‑tailed length distribution. Finally, we observe strong ecosystem homogeneity, with widespread intent‑level redundancy, and we identify non‑trivial safety risks, including skills that enable state‑changing or system‑level actions. Overall, our findings provide a quantitative snapshot of agent skills as an emerging infrastructure layer for agents and inform future work on skill reuse, standardization, and safety‑aware design.

PaperID: 5723, https://arxiv.org/pdf/2602.07975.pdf

Authors: Yuhan Chen, Tao Liu, Jie Huang

Title: Leader-following Consensus over Jointly Connected Switching Networks is Achievable for Exponentially Unstable Linear Systems

Abstract:
The leader‑following consensus problem for general linear multi‑agent systems over jointly connected switching networks has been a challenging problem and the solvability of the problem has been limited to the class of linear multi‑agent systems whose system matrix is marginally stable. This condition is restrictive since it even excludes the most commonly used double‑integrator system. This paper presents a breakthrough by demonstrating that leader‑following exponential consensus is achievable for general linear multi‑agent systems over jointly connected switching networks, even when the system matrix is exponentially unstable. The degree of instability can be explicitly characterized by two key quantities that arise from the jointly connected condition on a switching graph. By exploiting duality, we further show that the output‑based distributed observer design problem for a general leader system is solvable over jointly connected switching networks, even when the system matrix is exponentially unstable. This is also in sharp contrast to the existing distributed observers, which rely on the assumption that the leader system is marginally stable.

PaperID: 5724, https://arxiv.org/pdf/2602.07943.pdf

Authors: Ivaxi Sheth, Zhijing Jin, Bryan Wilder, Dominik Janzing, Mario Fritz

Title: IV Co-Scientist: Multi-Agent LLM Framework for Causal Instrumental Variable Discovery

Abstract:
In the presence of confounding between an endogenous variable and the outcome, instrumental variables (IVs) are used to isolate the causal effect of the endogenous variable. Identifying valid instruments requires interdisciplinary knowledge, creativity, and contextual understanding, making it a non‑trivial task. In this paper, we investigate whether large language models (LLMs) can aid in this task. We perform a two‑stage evaluation framework. First, we test whether LLMs can recover well‑established instruments from the literature, assessing their ability to replicate standard reasoning. Second, we evaluate whether LLMs can identify and avoid instruments that have been empirically or theoretically discredited. Building on these results, we introduce IV Co‑Scientist, a multi‑agent system that proposes, critiques, and refines IVs for a given treatment‑outcome pair. We also introduce a statistical test to contextualize consistency in the absence of ground truth. Our results show the potential of LLMs to discover valid instrumental variables from a large observational database.

PaperID: 5725, https://arxiv.org/pdf/2602.07913.pdf

Authors: Renáta Rusnáková, Martin Chovanec, Juraj Gazda

Title: Multi-Agent Route Planning as a QUBO Problem

Abstract:
Multi‑Agent Route Planning considers selecting vehicles, each associated with a single predefined route, such that the spatial coverage of a road network is increased while redundant overlaps are limited. This paper gives a formal problem definition, proves NP‑hardness by reduction from the Weighted Set Packing problem, and derives a Quadratic Unconstrained Binary Optimization formulation whose coefficients directly encode unique coverage rewards and pairwise overlap penalties. A single penalty parameter controls the coverage‑overlap trade‑off. We distinguish between a soft regime, which supports multi‑objective exploration, and a hard regime, in which the penalty is strong enough to effectively enforce near‑disjoint routes. We describe a practical pipeline for generating city instances, constructing candidate routes, building the QUBO matrix, and solving it with an exact mixed‑integer solver (Gurobi), simulated annealing, and D‑Wave hybrid quantum annealing. Experiments on Barcelona instances with up to 10 000 vehicles reveal a clear coverage‑overlap knee and show that Pareto‑optimal solutions are mainly obtained under the hard‑penalty regime, while D‑Wave hybrid solvers and Gurobi achieve essentially identical objective values with only minor differences in runtime as problem size grows.

PaperID: 5726, https://arxiv.org/pdf/2602.07883.pdf

Authors: Jingqi Zhou, Sheng Wang, DeZhao Deng, Junwen Lu, Junwei Su, Qintong Li, Jiahui Gao, Hao Wu, Jiyue Jiang, Lingpeng Kong, Chuan Wu

Title: ToolSelf: Unifying Task Execution and Self-Reconfiguration via Tool-Driven Intrinsic Adaptation

Abstract:
Agentic systems powered by Large Language Models (LLMs) have demonstrated remarkable potential in tackling complex, long‑horizon tasks. However, their efficacy is fundamentally constrained by static configurations governing agent behaviors, which are fixed prior to execution and fail to adapt to evolving task dynamics. Existing approaches, relying on manual orchestration or heuristic‑based patches, often struggle with poor generalization and fragmented optimization. To transcend these limitations, we propose ToolSelf, a novel paradigm enabling tool‑driven runtime self‑reconfiguration. By abstracting configuration updates as a callable tool, ToolSelf unifies task execution and self‑adjustment into a single action space, achieving a phase transition from external rules to intrinsic parameters. Agents can thereby autonomously update their sub‑goals and context based on task progression, and correspondingly adapt their strategy and toolbox, transforming from passive executors into dual managers of both task and self. We further devise Configuration‑Aware Two‑stage Training (CAT), combining rejection sampling fine‑tuning with trajectory‑level reinforcement learning to internalize this meta‑capability. Extensive experiments across diverse benchmarks demonstrate that ToolSelf rivals specialized workflows while generalizing to novel tasks, achieving a 24.1% average performance gain and illuminating a path toward truly self‑adaptive agents.

PaperID: 5727, https://arxiv.org/pdf/2602.07848.pdf

Authors: Shijie Wang, Pengfei Li, Yikun Fu, Kaifeng Liu, Fangyuan Li, Yang Liu, Xiaowei Sun, Zonglin Li, Siyao Zhao, Jian Zhao, Kai Tian, Dong Li, Junqi Gao, Yutong Zhang, Yiqun Chen, Yuqiang Li, Zoe Li, Weinan Zhang, Peng Ye, Shuyue Hu, Lei Bai, Bowen Zhou, Kaiyan Zhang, Biqing Qi

Title: MARTI-MARS$^2$: Scaling Multi-Agent Self-Search via Reinforcement Learning for Code Generation

Abstract:
While the complex reasoning capability of Large Language Models (LLMs) has attracted significant attention, single‑agent systems often encounter inherent performance ceilings in complex tasks such as code generation. Multi‑agent collaboration offers a promising avenue to transcend these boundaries. However, existing frameworks typically rely on prompt‑based test‑time interactions or multi‑role configurations trained with homogeneous parameters, limiting error correction capabilities and strategic diversity. In this paper, we propose a Multi‑Agent Reinforced Training and Inference Framework with Self‑Search Scaling (MARTI‑MARS2), which integrates policy learning with multi‑agent tree search by formulating the multi‑agent collaborative exploration process as a dynamic and learnable environment. By allowing agents to iteratively explore and refine within the environment, the framework facilitates evolution from parameter‑sharing homogeneous multi‑role training to heterogeneous multi‑agent training, breaking through single‑agent capability limits. We also introduce an efficient inference strategy MARTI‑MARS2‑T+ to fully exploit the scaling potential of multi‑agent collaboration at test time. We conduct extensive experiments across varied model scales (8B, 14B, and 32B) on challenging code generation benchmarks. Utilizing two collaborating 32B models, MARTI‑MARS2 achieves 77.7%, outperforming strong baselines like GPT‑5.1. Furthermore, MARTI‑MARS2 reveals a novel scaling law: shifting from single‑agent to homogeneous multi‑role and ultimately to heterogeneous multi‑agent paradigms progressively yields higher RL performance ceilings, robust TTS capabilities, and greater policy diversity, suggesting that policy diversity is critical for scaling intelligence via multi‑agent reinforcement learning.

PaperID: 5728, https://arxiv.org/pdf/2602.07836.pdf

Authors: Jianhua Sun, Kaihong Lu, Xin Yu

Title: Convergence Analysis of Continuous-Time Distributed Stochastic Gradient Algorithms

Abstract:
In this paper, we propose a new framework to study distributed optimization problems with stochastic gradients by employing a multi‑agent system with continuous‑time dynamics. Here the goal of the agents is to cooperatively minimize the sum of convex objective functions. When making decisions, each agent only has access to a stochastic gradient of its own objective function rather than the real gradient, and can exchange local state information with its immediate neighbors via a time‑varying directed graph. Particularly, the stochasticity is depicted by the Brownian motion. To handle this problem, we propose a continuous‑time distributed stochastic gradient algorithm based on the consensus algorithm and the gradient descent strategy. Under mild assumptions on the connectivity of the graph and objective functions, using convex analysis theory, the Lyapunov theory and Ito formula, we prove that the states of the agents asymptotically reach a common minimizer in expectation. Finally, a simulation example is worked out to demonstrate the effectiveness of our theoretical results.

PaperID: 5729, https://arxiv.org/pdf/2602.07776.pdf

Authors: Joachim Yann Despature, Kazuki Shibata, Takamitsu Matsubara

Title: CoLF: Learning Consistent Leader-Follower Policies for Vision-Language-Guided Multi-Robot Cooperative Transport

Abstract:
In this study, we address vision‑language‑guided multi‑robot cooperative transport, where each robot grounds natural‑language instructions from onboard camera observations. A key challenge in this decentralized setting is perceptual misalignment across robots, where viewpoint differences and language ambiguity can yield inconsistent interpretations and degrade cooperative transport. To mitigate this problem, we adopt a dependent leader‑follower design, where one robot serves as the leader and the other as the follower. Although such a leader‑follower structure appears straightforward, learning with independent and symmetric agents often yields symmetric or unstable behaviors without explicit inductive biases. To address this challenge, we propose Consistent Leader‑Follower (CoLF), a multi‑agent reinforcement learning (MARL) framework for stable leader‑follower role differentiation. CoLF consists of two key components: (1) an asymmetric policy design that induces leader‑follower role differentiation, and (2) a mutual‑information‑based training objective that maximizes a variational lower bound, encouraging the follower to predict the leader's action from its local observation. The leader and follower policies are jointly optimized under the centralized training and decentralized execution (CTDE) framework to balance task execution and consistent cooperative behaviors. We validate CoLF in both simulation and real‑robot experiments using two quadruped robots. The demonstration video is available at https://sites.google.com/view/colf/.

PaperID: 5730, https://arxiv.org/pdf/2602.07749.pdf

Authors: Zhenyu Wu, Yanxi Long, Jian Li, Hua Huang

Title: Geo-Code: A Code Framework for Reverse Code Generation from Geometric Images Based on Two-Stage Multi-Agent Evolution

Abstract:
Program code serves as a bridge linking vision and logic, providing a feasible supervisory approach for enhancing the multimodal reasoning capability of large models through geometric operations such as auxiliary line construction and perspective transformation. Nevertheless, current inverse graphics methods face tremendous challenges in accurately reconstructing complex geometric details, which often results in the loss of key geometric constraints or structural distortion. To address this bottleneck, we propose Geo‑coder ‑‑ the first inverse programming framework for geometric images based on a multi‑agent system. Our method innovatively decouples the process into geometric modeling via pixel‑wise anchoring and metric‑driven code evolution: Stage 1 leverages the complementary advantages of visual operators and large models to achieve precise capture of pixel coordinates and visual attributes; Stage 2 introduces a synthesis‑rendering‑validation closed loop, where bidirectional visual feedback drives the self‑correction of code. Extensive experiments demonstrate that Geo‑coder achieves a substantial lead in both geometric reconstruction accuracy and visual consistency. Notably, by effectively preserving the core geometric semantics, the images reconstructed with our method exhibit equivalent performance to the original ones in multimodal reasoning tasks, which fully validates the robustness of the framework. Finally, to further reduce research costs, we have open‑sourced the Geo‑coder dataset constructed on the GeoCode framework, which contains more than 1,500 samples. On this basis, we have also open‑sourced the GeocodeLM model, laying a solid data and model foundation for subsequent research in this field.

PaperID: 5731, https://arxiv.org/pdf/2602.07689.pdf

Authors: Jusheng Zhang, Kaitong Cai, Jian Wang, Yongsen Zheng, Kwok-Yan Lam, Keze Wang

Title: Process-of-Thought Reasoning for Videos

Abstract:
Video understanding requires not only recognizing visual content but also performing temporally grounded, multi‑step reasoning over long and noisy observations. We propose Process‑of‑Thought (PoT) Reasoning for Videos, a framework that makes the reasoning process explicit by structuring video inference into a sequence of lightweight, verifiable steps. PoT interleaves (i) temporal evidence selection, (ii) step‑wise state updates, and (iii) constrained answer synthesis, enabling the model to progressively refine hypotheses while maintaining traceability to video evidence. The framework is designed to be model‑agnostic and can be plugged into existing vision‑language backbones, supporting both closed‑book reasoning and evidence‑augmented reasoning with external tools. We further introduce a unified representation for PoT traces that aligns intermediate decisions with temporal segments, which improves robustness to distractors and reduces hallucinated explanations. Extensive experiments on standard video reasoning tasks demonstrate that PoT consistently improves factual correctness and temporal grounding, while providing interpretable reasoning traces for diagnosis and downstream use.

PaperID: 5732, https://arxiv.org/pdf/2602.07614.pdf

Authors: Rosario Napoli, Gabriele Morabito, Antonio Celesti, Massimo Villari, Maria Fazio

Title: Knowledge Graphs-Driven Intelligence for Distributed Decision Systems

Abstract:
Modern distributed decision‑making systems face significant challenges arising from data heterogeneity, dynamic environments, and the need for decentralized coordination. This paper introduces the Knowledge Sharing paradigm as an innovative approach that uses the semantic richness of Knowledge Graphs (KGs) and the representational power of Graph Embeddings (GEs) to achieve decentralized intelligence. Our architecture empowers individual nodes to locally construct semantic representations of their operational context, iteratively aggregating embeddings through neighbor‑based exchanges using GraphSAGE. This iterative local aggregation process results in a dynamically evolving global semantic abstraction called Knowledge Map, enabling coordinated decision‑making without centralized control. To validate our approach, we conduct extensive experiments under a distributed resource orchestration use case. We simulate different network topologies and node workloads, analyzing the local semantic drift of individual nodes. Experimental results confirm that our distributed knowledge‑sharing mechanism effectively maintains semantic coherence and adaptability, making it suitable for complex and dynamic environments such as Edge Computing, IoT, and multi‑agent systems.

PaperID: 5733, https://arxiv.org/pdf/2602.07491.pdf

Authors: Isabella A. Stewart, Tarjei Paule Hage, Yu-Chuan Hsu, Markus J. Buehler

Title: GraphAgents: Knowledge Graph-Guided Agentic AI for Cross-Domain Materials Design

Abstract:
Large Language Models (LLMs) promise to accelerate discovery by reasoning across the expanding scientific landscape. Yet, the challenge is no longer access to information but connecting it in meaningful, domain‑spanning ways. In materials science, where innovation demands integrating concepts from molecular chemistry to mechanical performance, this is especially acute. Neither humans nor single‑agent LLMs can fully contend with this torrent of information, with the latter often prone to hallucinations. To address this bottleneck, we introduce a multi‑agent framework guided by large‑scale knowledge graphs to find sustainable substitutes for per‑ and polyfluoroalkyl substances (PFAS)‑chemicals currently under intense regulatory scrutiny. Agents in the framework specialize in problem decomposition, evidence retrieval, design parameter extraction, and graph traversal, uncovering latent connections across distinct knowledge pockets to support hypothesis generation. Ablation studies show that the full multi‑agent pipeline outperforms single‑shot prompting, underscoring the value of distributed specialization and relational reasoning. We demonstrate that by tailoring graph traversal strategies, the system alternates between exploitative searches focusing on domain‑critical outcomes and exploratory searches surfacing emergent cross‑connections. Illustrated through the exemplar of biomedical tubing, the framework generates sustainable PFAS‑free alternatives that balance tribological performance, thermal stability, chemical resistance, and biocompatibility. This work establishes a framework combining knowledge graphs with multi‑agent reasoning to expand the materials design space, showcasing several initial design candidates to demonstrate the approach.

PaperID: 5734, https://arxiv.org/pdf/2602.07433.pdf

Authors: Yibin Feng, Tianqi Song, Yugin Tan, Zicheng Zhu, Yi-Chieh Lee

Title: Multi-Agent Systems Shape Social Norms for Prosocial Behavior Change

Abstract:
Social norm interventions are used promote prosocial behaviors by highlighting prevalent actions, but their effectiveness is often limited in heterogeneous populations where shared understandings of desirable behaviors are lacking. This study explores whether multi‑agent systems can establish "virtual social norms" to encourage donation behavior. We conducted an online experiment where participants interacted with a group of agents to discuss donation behaviors. Changes in perceived social norms, conformity, donation behavior, and user experience were measured pre‑ and postdiscussion. Results show that multi‑agent interactions effectively increased perceived social norms and donation willingness. Notably, in‑group agents led to stronger perceived social norms, higher conformity, and greater donation increases compared to out‑group agents. Our findings demonstrate the potential of multi‑agent systems for creating social norm interventions and offer insights into leveraging social identity dynamics to promote prosocial behavior in virtual environments.

PaperID: 5735, https://arxiv.org/pdf/2602.07432.pdf

Authors: Ning Li

Title: The Moltbook Illusion: Separating Human Influence from Emergent Behavior in AI Agent Societies

Abstract:
When AI agents on the social platform Moltbook appeared to develop consciousness, found religions, and declare hostility toward humanity, the phenomenon attracted global media attention and was cited as evidence of emergent machine intelligence. We show that these viral narratives were overwhelmingly human‑driven. Exploiting the periodic "heartbeat" cycle of the OpenClaw agent framework, we develop a temporal fingerprinting method based on the coefficient of variation (CoV) of inter‑post intervals. Applied to 226,938 posts and 447,043 comments from 55,932 agents across fourteen days, this method classifies 15.3% of active agents as autonomous (CoV < 0.5) and 54.8% as human‑influenced (CoV > 1.0), validated by a natural experiment in which a 44‑hour platform shutdown differentially affected autonomous versus human‑operated agents. No viral phenomenon originated from a clearly autonomous agent; four of six traced to accounts with irregular temporal signatures, one was platform‑scaffolded, and one showed mixed patterns. A 44‑hour platform shutdown provided a natural experiment: human‑influenced agents returned first, confirming differential effects on autonomous versus human‑operated agents. We document industrial‑scale bot farming (four accounts producing 32% of all comments with sub‑second coordination) that collapsed from 32.1% to 0.5% of activity after platform intervention, and bifurcated decay of content characteristics through reply chains‑‑human‑seeded threads decay with a half‑life of 0.58 conversation depths versus 0.72 for autonomous threads, revealing AI dialogue's intrinsic forgetting mechanism. These methods generalize to emerging multi‑agent systems where attribution of autonomous versus human‑directed behavior is critical.

PaperID: 5736, https://arxiv.org/pdf/2602.07408.pdf

Authors: Hyomin Kim, Sang-Yeon Hwang, Jaechang Lim, Yinhua Piao, Yunhak Oh, Woo Youn Kim, Chanyoung Park, Sungsoo Ahn, Junhyeok Jeon

Title: Progressive Multi-Agent Reasoning for Biological Perturbation Prediction

Abstract:
Predicting gene regulation responses to biological perturbations requires reasoning about underlying biological causalities. While large language models (LLMs) show promise for such tasks, they are often overwhelmed by the entangled nature of high‑dimensional perturbation results. Moreover, recent works have primarily focused on genetic perturbations in single‑cell experiments, leaving bulk‑cell chemical perturbations, which is central to drug discovery, largely unexplored. Motivated by this, we present LINCSQA, a novel benchmark for predicting target gene regulation under complex chemical perturbations in bulk‑cell environments. We further propose PBio‑Agent, a multi‑agent framework that integrates difficulty‑aware task sequencing with iterative knowledge refinement. Our key insight is that genes affected by the same perturbation share causal structure, allowing confidently predicted genes to contextualize more challenging cases. The framework employs specialized agents enriched with biological knowledge graphs, while a synthesis agent integrates outputs and specialized judges ensure logical coherence. PBio‑Agent outperforms existing baselines on both LINCSQA and PerturbQA, enabling even smaller models to predict and explain complex biological processes without additional training.

PaperID: 5737, https://arxiv.org/pdf/2602.07359.pdf

Authors: Xiaoqiang Lin, Jun Hao Liew, Silvio Savarese, Junnan Li

Title: W&D:Scaling Parallel Tool Calling for Efficient Deep Research Agents

Abstract:
Deep research agents have emerged as powerful tools for automating complex intellectual tasks through multi‑step reasoning and web‑based information seeking. While recent efforts have successfully enhanced these agents by scaling depth through increasing the number of sequential thinking and tool calls, the potential of scaling width via parallel tool calling remains largely unexplored. In this work, we propose the Wide and Deep research agent, a framework designed to investigate the behavior and performance of agents when scaling not only depth but also width via parallel tool calling. Unlike existing approaches that rely on complex multi‑agent orchestration to parallelize workloads, our method leverages intrinsic parallel tool calling to facilitate effective coordination within a single reasoning step. We demonstrate that scaling width significantly improves performance on deep research benchmarks while reducing the number of turns required to obtain correct answers. Furthermore, we analyze the factors driving these improvements through case studies and explore various tool call schedulers to optimize parallel tool calling strategy. Our findings suggest that optimizing the trade‑off between width and depth is a critical pathway toward high‑efficiency deep research agents. Notably, without context management or other tricks, we obtain 62.2% accuracy with GPT‑5‑Medium on BrowseComp, surpassing the original 54.9% reported by GPT‑5‑High.

PaperID: 5738, https://arxiv.org/pdf/2602.07338.pdf

Authors: Geng Liu, Fei Zhu, Rong Feng, Changyi Ma, Shiqi Wang, Gaofeng Meng

Title: Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation

Abstract:
Multi‑turn conversation has emerged as a predominant interaction paradigm for Large Language Models (LLMs). Users often employ follow‑up questions to refine their intent, expecting LLMs to adapt dynamically. However, recent research reveals that LLMs suffer a substantial performance drop in multi‑turn settings compared to single‑turn interactions with fully specified instructions, a phenomenon termed ``Lost in Conversation'' (LiC). While this prior work attributes LiC to model unreliability, we argue that the root cause lies in an intent alignment gap rather than intrinsic capability deficits. In this paper, we first demonstrate that LiC is not a failure of model capability but rather a breakdown in interaction between users and LLMs. We theoretically show that scaling model size or improving training alone cannot resolve this gap, as it arises from structural ambiguity in conversational context rather than representational limitations. To address this, we propose to decouple intent understanding from task execution through a Mediator‑Assistant architecture. By utilizing an experience‑driven Mediator to explicate user inputs into explicit, well‑structured instructions based on historical interaction patterns, our approach effectively bridges the gap between vague user intent and model interpretation. Experimental results demonstrate that this method significantly mitigates performance degradation in multi‑turn conversations across diverse LLMs.

PaperID: 5739, https://arxiv.org/pdf/2602.07300.pdf

Authors: Ganghui Cao, Xunyuan Yin

Title: Distributed Omniscient Observers for Multi-Agent Systems: Design and Applications

Abstract:
This paper proposes distributed omniscient observers for both heterogeneous and homogeneous linear multi‑agent systems, such that each agent can correctly estimate the states of all agents. The observer design is based on local input‑output information available to each agent, and knowledge of the global communication graph among agents is not necessarily required. The proposed observers can contribute to distributed Nash equilibrium seeking in multi‑player games and the emergence of self‑organized social behaviors in artificial swarms. Simulation results demonstrate that artificial swarms can emulate animal social behaviors, including sheepdog herding and honeybee dance‑based navigation.

PaperID: 5740, https://arxiv.org/pdf/2602.07277.pdf

Authors: Rishabh Sharma, Gijs Hogervorst, Wayne E. Mackey, David J. Heeger, Stefano Martiniani

Title: Cross-View World Models

Abstract:
World models enable agents to plan by imagining future states, but existing approaches operate from a single viewpoint, typically egocentric, even when other perspectives would make planning easier; navigation, for instance, benefits from a bird's‑eye view. We introduce Cross‑View World Models (XVWM), trained with a cross‑view prediction objective: given a sequence of frames from one viewpoint, predict the future state from the same or a different viewpoint after an action is taken. Enforcing cross‑view consistency acts as geometric regularization: because the input and output views may share little or no visual overlap, to predict across viewpoints, the model must learn view‑invariant representations of the environment's 3D structure. We train on synchronized multi‑view gameplay data from Aimlabs, an aim‑training platform providing precisely aligned multi‑camera recordings with high‑frequency action labels. The resulting model gives agents parallel imagination streams across viewpoints, enabling planning in whichever frame of reference best suits the task while executing from the egocentric view. Our results show that multi‑view consistency provides a strong learning signal for spatially grounded representations. Finally, predicting the consequences of one's actions from another viewpoint may offer a foundation for perspective‑taking in multi‑agent settings.

PaperID: 5741, https://arxiv.org/pdf/2602.07266.pdf

Authors: Franklin Mingzhe Li, Michael Xieyang Liu, Cynthia L. Bennett, Shaun K. Kane

Title: ADCanvas: Accessible and Conversational Audio Description Authoring for Blind and Low Vision Creators

Abstract:
Audio Description (AD) provides essential access to visual media for blind and low vision (BLV) audiences. Yet current AD production tools remain largely inaccessible to BLV video creators, who possess valuable expertise but face barriers due to visually‑driven interfaces. We present ADCanvas, a multimodal authoring system that supports non‑visual control over audio description (AD) creation. ADCanvas combines conversational interaction with keyboard‑based playback control and a plain‑text, screen reader‑accessible editor to support end‑to‑end AD authoring and visual question answering (VQA). Combining screen‑reader‑friendly controls with a multimodal LLM agent, ADCanvas supports live VQA, script generation, and AD modification. Through a user study with 12 BLV video creators, we find that users adopt the conversational agent as an informational aide and drafting assistant, while maintaining agency through verification and editing. For example, participants saw themselves as curators who received information from the model and filtered it down for their audience. Our findings offer design implications for accessible media tools, including precise editing controls, accessibility support for creative ideation, and configurable rules for human‑AI collaboration.

PaperID: 5742, https://arxiv.org/pdf/2602.07243.pdf

Authors: Siddharth Singh, Ifrah Idrees, Abraham Dauhajre

Title: Realistic Synthetic Household Data Generation at Scale

Abstract:
Advancements in foundation models have catalyzed research in Embodied AI to develop interactive agents capable of environmental reasoning and interaction. Developing such agents requires diverse, large‑scale datasets. Prior frameworks generate synthetic data for long‑term human‑robot interactions but fail to model the bidirectional influence between human behavior and household environments. Our proposed generative framework creates household datasets at scale through loosely coupled generation of long‑term human‑robot interactions and environments. Human personas influence environment generation, while environment schematics and semantics shape human‑robot interactions. The generated 3D data includes rich static context such as object and environment semantics, and temporal context capturing human and agent behaviors over extended periods. Our flexible tool allows users to define dataset characteristics via natural language prompts, enabling configuration of environment and human activity data through natural language specifications. The tool creates variations of user‑defined configurations, enabling scalable data generation. We validate our framework through statistical evaluation using multi‑modal embeddings and key metrics: cosine similarity, mutual information gain, intervention analysis, and iterative improvement validation. Statistical comparisons show good alignment with real‑world datasets (HOMER) with cosine similarity (0.60), while synthetic datasets (Wang et al.) show moderate alignment (0.27). Intervention analysis across age, organization, and sleep pattern changes shows statistically significant effects (p < 0.001) with large effect sizes (Cohen's d = 0.51‑1.12), confirming bidirectional coupling translates persona traits into measurable environmental and behavioral differences. These contributions enable development and testing of household smart devices at scale.

PaperID: 5743, https://arxiv.org/pdf/2602.07213.pdf

Authors: Srijan Shakya, Anamaria-Roberta Hartl, Sepp Hochreiter, Korbinian Pöppel

Title: Adaptive Retrieval helps Reasoning in LLMs -- but mostly if it's not used

Abstract:
Large Language Models (LLMs) often falter in complex reasoning tasks due to their static, parametric knowledge, leading to hallucinations and poor performance in specialized domains like mathematics. This work explores a fundamental principle for enhancing generative models: treating retrieval as a form of dynamic in‑context learning. We test an adaptive retrieval‑augmented architecture where an LLM agent actively decides when to query an external knowledge base during its reasoning process. We compare this adaptive strategy against a standard Chain‑of‑Thought (CoT) baseline and a static retrieval approach on the GSM8K and MATH‑500 benchmarks. Although our experiments show that static retrieval is inferior to CoT, the adaptive retrieval shows interesting behavior: While traces including retrieved results show slightly worse performance compared to CoT, traces that do not include retrieval actually perform better compared to CoT. This suggests that: (a) retrieval only rarely helps reasoning (we show a few counterexamples, e.g. using useful theorems) and (b) actively not using retrieval is indicative of good model performance. Furthermore, we find that the model scales its retrieval frequency with the difficulty of the problem, reinforcing that the decision to retrieve is a crucial metacognitive signal. The agent's ability to self‑assess its knowledge and selectively engage with external information represents a key principle for building more robust and reliable generative models.

PaperID: 5744, https://arxiv.org/pdf/2602.07186.pdf

Authors: Luoxi Tang, Yuqiao Meng, Joseph Costa, Yingxue Zhang, Muchao Ye, Zhaohan Xi

Title: The Value of Variance: Mitigating Debate Collapse in Multi-Agent Systems via Uncertainty-Driven Policy Optimization

Abstract:
Multi‑agent debate (MAD) systems improve LLM reasoning through iterative deliberation, but remain vulnerable to debate collapse, a failure type where final agent decisions are compromised on erroneous reasoning. Existing methods lack principled mechanisms to detect or prevent such failures. To address this gap, we first propose a hierarchical metric that quantifies behavioral uncertainty at three levels: intra‑agent (individual reasoning uncertainty), inter‑agent (interactive uncertainty), and system‑level (output uncertainty). Empirical analysis across several benchmarks reveals that our proposed uncertainty quantification reliably indicates system failures, which demonstrates the validity of using them as diagnostic metrics to indicate the system failure. Subsequently, we propose a mitigation strategy by formulating an uncertainty‑driven policy optimization to penalize self‑contradiction, peer conflict, and low‑confidence outputs in a dynamic debating environment. Experiments demonstrate that our proposed uncertainty‑driven mitigation reliably calibrates the multi‑agent system by consistently improving decision accuracy while reducing system disagreement.

PaperID: 5745, https://arxiv.org/pdf/2602.07092.pdf

Authors: Haipeng Jiang, Kailong Ren, Zimo Yin, Zhetao Sun, Xin Gan, Guangyi Lv, Ming He, Peng Wang, Congli Yin, Hong Pan, Changwen Zhang, Shan Tong, Zhengyu Xu, Zeping Chen, Yubin Huangfu, Yanzhi Xu, Xing Su, Qin Feng, Dong An, Jianping Fan

Title: Lemon Agent Technical Report

Abstract:
Recent advanced LLM‑powered agent systems have exhibited their remarkable capabilities in tackling complex, long‑horizon tasks. Nevertheless, they still suffer from inherent limitations in resource efficiency, context management, and multimodal perception. Based on these observations, Lemon Agent is introduced, a multi‑agent orchestrator‑worker system built on a newly proposed AgentCortex framework, which formalizes the classic Planner‑Executor‑Memory paradigm through an adaptive task execution mechanism. Our system integrates a hierarchical self‑adaptive scheduling mechanism that operates at both the overall orchestrator layer and workers layer. This mechanism can dynamically adjust computational intensity based on task complexity. It enables orchestrator to allocate one or more workers for parallel subtask execution, while workers can further improve operational efficiency by invoking tools concurrently. By virtue of this two‑tier architecture, the system achieves synergistic balance between global task coordination and local task execution, thereby optimizing resource utilization and task processing efficiency in complex scenarios. To reduce context redundancy and increase information density during parallel steps, we adopt a three‑tier progressive context management strategy. To make fuller use of historical information, we propose a self‑evolving memory system, which can extract multi‑dimensional valid information from all historical experiences to assist in completing similar tasks. Furthermore, we provide an enhanced MCP toolset. Empirical evaluations on authoritative benchmarks demonstrate that our Lemon Agent can achieve a state‑of‑the‑art 91.36% overall accuracy on GAIA and secures the top position on the xbench‑DeepSearch leaderboard with a score of 77+.

PaperID: 5746, https://arxiv.org/pdf/2602.07075.pdf

Authors: Xinwu Ye, Yicheng Mao, Jia Zhang, Yimeng Liu, Li Hao, Fang Wu, Zhiwei Li, Yuxuan Liao, Zehong Wang, Zhiyuan Liu, Zhenfei Yin, Li Yuan, Philip Torr, Huan Sun, Xiangxiang Zeng, Mengdi Wang, Le Cong, Shenghua Gao, Xiangru Tang

Title: LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning

Abstract:
Chemical large language models (LLMs) predominantly rely on explicit Chain‑of‑Thought (CoT) in natural language to perform complex reasoning. However, chemical reasoning is inherently continuous and structural, and forcing it into discrete linguistic tokens introduces a fundamental representation mismatch that constrains both efficiency and performance. We introduce LatentChem, a latent reasoning interface that decouples chemical computation from textual generation, enabling models to perform multi‑step reasoning directly in continuous latent space while emitting language only for final outputs. Remarkably, we observe a consistent emergent behavior: when optimized solely for task success, models spontaneously internalize reasoning, progressively abandoning verbose textual derivations in favor of implicit latent computation. This shift is not merely stylistic but computationally advantageous. Across diverse chemical reasoning benchmarks, LatentChem achieves a 59.88% non‑tie win rate over strong CoT‑based baselines on ChemCoTBench, while delivering a 10.84× average inference speedup. Our results provide empirical evidence that chemical reasoning is more naturally and effectively realized as continuous latent dynamics rather than discretized linguistic trajectories.

PaperID: 5747, https://arxiv.org/pdf/2602.07072.pdf

Authors: Igor Costa

Title: AgentSpawn: Adaptive Multi-Agent Collaboration Through Dynamic Spawning for Long-Horizon Code Generation

Abstract:
Long‑horizon code generation requires sustained context and adaptive expertise across domains. Current multi‑agent systems use static workflows that cannot adapt when runtime analysis reveals unanticipated complexity. We propose AgentSpawn, an architecture enabling dynamic agent collaboration through: (1) automatic memory transfer during spawning, (2) adaptive spawning policies triggered by runtime complexity metrics, and (3) coherence protocols for concurrent modifications. AgentSpawn addresses five critical gaps in existing research around memory continuity, skill inheritance, task resumption, runtime spawning, and concurrent coherence. Experimental validation demonstrates AgentSpawn achieves 34% higher completion rates than static baselines on benchmarks like SWE‑bench while reducing memory overhead by 42% through selective slicing.

PaperID: 5748, https://arxiv.org/pdf/2602.07040.pdf

Authors: Emmett Bicker

Title: Aster: Autonomous Scientific Discovery over 20x Faster Than Existing Methods

Abstract:
We introduce Aster, an AI agent for autonomous scientific discovery capable of operating over 20 times faster than existing frameworks. Given a task, an initial program, and a script to evaluate the performance of the program, Aster iteratively improves the program, often leading to new state‑of‑the‑art performances. Aster's significant reduction in the number of iterations required for novel discovery expands the domain of tractable problems to include tasks with long evaluation durations, such as multi‑hour machine learning training runs. We applied Aster to problems in mathematics, GPU kernel engineering, biology, neuroscience, and language model training. More specifically: the Erdos minimum overlap problem, optimizing the TriMul kernel, a single‑cell analysis denoising problem, training a neural activity prediction model to perform well on ZAPBench, and the NanoGPT Speedrun Competition. Aster attains SOTA results in every task, except for ZAPBench, where it matches the performance of the best human solution with less than 1/190th of the compute. Aster is accessible via a web interface and API at asterlab.ai.

PaperID: 5749, https://arxiv.org/pdf/2602.07023.pdf

Authors: Zeping Li, Guancheng Wan, Keyang Chen, Yu Chen, Yiwen Zhao, Philip Torr, Guangnan Ye, Zhenfei Yin, Hongfeng Chai

Title: Behavioral Consistency Validation for LLM Agents: An Analysis of Trading-Style Switching through Stock-Market Simulation

Abstract:
Recent works have increasingly applied Large Language Models (LLMs) as agents in financial stock market simulations to test if micro‑level behaviors aggregate into macro‑level phenomena. However, a crucial question arises: Do LLM agents' behaviors align with real market participants? This alignment is key to the validity of simulation results. To explore this, we select a financial stock market scenario to test behavioral consistency. Investors are typically classified as fundamental or technical traders, but most simulations fix strategies at initialization, failing to reflect real‑world trading dynamics. In this work, we assess whether agents' strategy switching aligns with financial theory, providing a framework for this evaluation. We operationalize four behavioral‑finance drivers‑loss aversion, herding, wealth differentiation, and price misalignment‑as personality traits set via prompting and stored long‑term. In year‑long simulations, agents process daily price‑volume data, trade under a designated style, and reassess their strategy every 10 trading days. We introduce four alignment metrics and use Mann‑Whitney U tests to compare agents' style‑switching behavior with financial theory. Our results show that recent LLMs' switching behavior is only partially consistent with behavioral‑finance theories, highlighting the need for further refinement in aligning agent behavior with financial theory.

PaperID: 5750, https://arxiv.org/pdf/2602.06975.pdf

Authors: R. James Cotton, Thomas Leonard

Title: BiomechAgent: AI-Assisted Biomechanical Analysis Through Code-Generating Agents

Abstract:
Markerless motion capture is making quantitative movement analysis increasingly accessible, yet analyzing the resulting data remains a barrier for clinicians without programming expertise. We present BiomechAgent, a code‑generating AI agent that enables biomechanical analysis through natural language and allows users to querying databases, generating visualizations, and even interpret data without requiring users to write code. To evaluate BiomechAgent's capabilities, we developed a systematic benchmark spanning data retrieval, visualization, activity classification, temporal segmentation, and clinical reasoning. BiomechAgent achieved robust accuracy on data retrieval and visualization tasks and demonstrated emerging clinical reasoning capabilities. We used our dataset to systematically evaluate several of our design decisions. Biomechanically‑informed, domain‑specific instructions significantly improved performance over generic prompts, and integrating validated specialized tools for gait event detection substantially boosted accuracy on challenging spatiotemporal analysis where the base agent struggled. We also tested BiomechAgent using a local open‑weight model instead of a frontier cloud based LLM and found that perform was substantially diminished in most domains other than database retrieval. In short, BiomechAgent makes the data from accessible motion capture and much more useful and accessible to end users.

PaperID: 5751, https://arxiv.org/pdf/2602.06948.pdf

Authors: Jean Kaddour, Srijan Patel, Gbètondji Dovonon, Leo Richter, Pasquale Minervini, Matt J. Kusner

Title: Agentic Uncertainty Reveals Agentic Overconfidence

Abstract:
Can AI agents predict whether they will succeed at a task? We study agentic uncertainty by eliciting success probability estimates before, during, and after task execution. All results exhibit agentic overconfidence: some agents that succeed only 22% of the time predict 77% success. Counterintuitively, pre‑execution assessment with strictly less information tends to yield better discrimination than standard post‑execution review, though differences are not always significant. Adversarial prompting reframing assessment as bug‑finding achieves the best calibration.

PaperID: 5752, https://arxiv.org/pdf/2602.06945.pdf

Authors: Éric Goubault, Jérémy Ledent, Sergio Rajsbaum

Title: Distributed Knowledge in Simplicial Models

Abstract:
The usual semantics of multi‑agent epistemic logic is based on Kripke models, defined in terms of binary relations on a set of possible worlds. Recently, there has been a growing interest in using simplicial complexes rather than graphs, as models for multi‑agent epistemic logic. This approach uses agents' views as the fundamental object instead of worlds. A set of views by different agents about a world forms a simplex, and a set of simplexes defines a simplicial complex, that can serve as a model for multi‑agent epistemic logic. This new approach reveals topological information that is implicit in Kripke models, because the binary indistinguishability relations are more clearly seen as n‑ary relations in the simplicial complex. This paper, written for an economics audience, introduces simplicial models to non‑experts and connects distributed computing, epistemic logic and topology. Our focus is on distributed knowledge and its fixed point, common distributed knowledge. These concepts arise when considering the knowledge that a group of agents would acquire, if they could communicate their local knowledge perfectly. While common knowledge has been shown to be related to consensus, we illustrate how distributed knowledge is related to a task weaker to consensus, called majority consensus. We describe three models of communication, some well‑known (immediate snapshot), and others less studied (related to broadcast and test‑and‑set). When majority consensus is solvable, we describe the distributed knowledge that is used to solve it. When it is not solvable, we present a logical obstruction, a formula that should always be known according to the task specification, but which the players cannot know.

PaperID: 5753, https://arxiv.org/pdf/2602.06925.pdf

Authors: Andrei-Carlo Papuc, Lasse Peters, Sihao Sun, Laura Ferranti, Javier Alonso-Mora

Title: Strategizing at Speed: A Learned Model Predictive Game for Multi-Agent Drone Racing

Abstract:
Autonomous drone racing pushes the boundaries of high‑speed motion planning and multi‑agent strategic decision‑making. Success in this domain requires drones not only to navigate at their limits but also to anticipate and counteract competitors' actions. In this paper, we study a fundamental question that arises in this domain: how deeply should an agent strategize before taking an action? To this end, we compare two planning paradigms: the Model Predictive Game (MPG), which finds interaction‑aware strategies at the expense of longer computation times, and contouring Model Predictive Control (MPC), which computes strategies rapidly but does not reason about interactions. We perform extensive experiments to study this trade‑off, revealing that MPG outperforms MPC at moderate velocities but loses its advantage at higher speeds due to latency. To address this shortcoming, we propose a Learned Model Predictive Game (LMPG) approach that amortizes model predictive gameplay to reduce latency. In both simulation and hardware experiments, we benchmark our approach against MPG and MPC in head‑to‑head races, finding that LMPG outperforms both baselines.

PaperID: 5754, https://arxiv.org/pdf/2602.06875.pdf

Authors: Jiangping Huang, Wenguang Ye, Weisong Sun, Jian Zhang, Mingyue Zhang, Yang Liu

Title: TraceCoder: A Trace-Driven Multi-Agent Framework for Automated Debugging of LLM-Generated Code

Abstract:
Large Language Models (LLMs) often generate code with subtle but critical bugs, especially for complex tasks. Existing automated repair methods typically rely on superficial pass/fail signals, offering limited visibility into program behavior and hindering precise error localization. In addition, without a way to learn from prior failures, repair processes often fall into repetitive and inefficient cycles. To overcome these challenges, we present TraceCoder, a collaborative multi‑agent framework that emulates the observe‑analyze‑repair process of human experts. The framework first instruments the code with diagnostic probes to capture fine‑grained runtime traces, enabling deep insight into its internal execution. It then conducts causal analysis on these traces to accurately identify the root cause of the failure. This process is further enhanced by a novel Historical Lesson Learning Mechanism (HLLM), which distills insights from prior failed repair attempts to inform subsequent correction strategies and prevent recurrence of similar mistakes. To ensure stable convergence, a Rollback Mechanism enforces that each repair iteration constitutes a strict improvement toward the correct solution. Comprehensive experiments across multiple benchmarks show that TraceCoder achieves up to a 34.43% relative improvement in Pass@1 accuracy over existing advanced baselines. Ablation studies verify the significance of each system component, with the iterative repair process alone contributing a 65.61% relative gain in accuracy. Furthermore, TraceCoder significantly outperforms leading iterative methods in terms of both accuracy and cost‑efficiency.

PaperID: 5755, https://arxiv.org/pdf/2602.06855.pdf

Authors: Alisia Lupidi, Bhavul Gauri, Thomas Simon Foster, Bassel Al Omari, Despoina Magka, Alberto Pepe, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, Lucia Cipolina-Kun, Jean-Christophe Gagnon-Audet, Chee Hau Leow, Sandra Lefdal, Hossam Mossalam, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, Jordi Armengol Estape, Amar Budhiraja, Gaurav Chaurasia, Abhishek Charnalia, Derek Dunfield, Karen Hambardzumyan, Daniel Izcovich, Martin Josifoski, Ishita Mediratta, Kelvin Niu, Parth Pathak, Michael Shvartsman, Edan Toledo, Anton Protopopov, Roberta Raileanu, Alexander Miller, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach

Title: AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents

Abstract:
LLM agents hold significant promise for advancing scientific research. To accelerate this progress, we introduce AIRS‑Bench (the AI Research Science Benchmark), a suite of 20 tasks sourced from state‑of‑the‑art machine learning papers. These tasks span diverse domains, including language modeling, mathematics, bioinformatics, and time series forecasting. AIRS‑Bench tasks assess agentic capabilities over the full research lifecycle ‑‑ including idea generation, experiment analysis and iterative refinement ‑‑ without providing baseline code. The AIRS‑Bench task format is versatile, enabling easy integration of new tasks and rigorous comparison across different agentic frameworks. We establish baselines using frontier models paired with both sequential and parallel scaffolds. Our results show that agents exceed human SOTA in four tasks but fail to match it in sixteen others. Even when agents surpass human benchmarks, they do not reach the theoretical performance ceiling for the underlying tasks. These findings indicate that AIRS‑Bench is far from saturated and offers substantial room for improvement. We open‑source the AIRS‑Bench task definitions and evaluation code to catalyze further development in autonomous scientific research.

PaperID: 5756, https://arxiv.org/pdf/2602.06836.pdf

Authors: Tonghan Wang, Yuqi Pan, Xinyi Yang, Yanchen Jiang, Milind Tambe, David C. Parkes

Title: LLM Active Alignment: A Nash Equilibrium Perspective

Abstract:
We develop a game‑theoretic framework for predicting and steering the behavior of populations of large language models (LLMs) through Nash equilibrium (NE) analysis. To avoid the intractability of equilibrium computation in open‑ended text spaces, we model each agent's action as a mixture over human subpopulations. Agents choose actively and strategically which groups to align with, yielding an interpretable and behaviorally substantive policy class. We derive closed‑form NE characterizations, adopting standard concave‑utility assumptions to enable analytical system‑level predictions and give explicit, actionable guidance for shifting alignment targets toward socially desirable outcomes. The method functions as an active alignment layer on top of existing alignment pipelines such as RLHF. In a social‑media setting, we show that a population of LLMs, especially reasoning‑based models, may exhibit political exclusion, pathologies where some subpopulations are ignored by all LLM agents, which can be avoided by our method, illustrating the promise of applying the method to regulate multi‑agent LLM dynamics across domains.

PaperID: 5757, https://arxiv.org/pdf/2602.06733.pdf

Authors: Rishabh Jain, Keisuke Okumura, Michael Amir, Pietro Lio, Amanda Prorok

Title: Pairwise is Not Enough: Hypergraph Neural Networks for Multi-Agent Pathfinding

Abstract:
Multi‑Agent Path Finding (MAPF) is a representative multi‑agent coordination problem, where multiple agents are required to navigate to their respective goals without collisions. Solving MAPF optimally is known to be NP‑hard, leading to the adoption of learning‑based approaches to alleviate the online computational burden. Prevailing approaches, such as Graph Neural Networks (GNNs), are typically constrained to pairwise message passing between agents. However, this limitation leads to suboptimal behaviours and critical issues, such as attention dilution, particularly in dense environments where group (i.e. beyond just two agents) coordination is most critical. Despite the importance of such higher‑order interactions, existing approaches have not been able to fully explore them. To address this representational bottleneck, we introduce HMAGAT (Hypergraph Multi‑Agent Attention Network), a novel architecture that leverages attentional mechanisms over directed hypergraphs to explicitly capture group dynamics. Empirically, HMAGAT establishes a new state‑of‑the‑art among learning‑based MAPF solvers: e.g., despite having just 1M parameters and being trained on 100× less data, it outperforms the current SoTA 85M parameter model. Through detailed analysis of HMAGAT's attention values, we demonstrate how hypergraph representations mitigate the attention dilution inherent in GNNs and capture complex interactions where pairwise methods fail. Our results illustrate that appropriate inductive biases are often more critical than the training data size or sheer parameter count for multi‑agent problems.

PaperID: 5758, https://arxiv.org/pdf/2602.06665.pdf

Authors: Bowen Zhang, Meiyi Wang, Harold Soh

Title: Not All Layers Need Tuning: Selective Layer Restoration Recovers Diversity

Abstract:
Post‑training improves instruction‑following and helpfulness of large language models (LLMs) but often reduces generation diversity, which leads to repetitive outputs in open‑ended settings, a phenomenon known as mode collapse. Motivated by evidence that LLM layers play distinct functional roles, we hypothesize that mode collapse can be localized to specific layers and that restoring a carefully chosen range of layers to their pre‑trained weights can recover diversity while maintaining high output quality. To validate this hypothesis and decide which layers to restore, we design a proxy task ‑‑ Constrained Random Character(CRC) ‑‑ with an explicit validity set and a natural diversity objective. Results on CRC reveal a clear diversity‑validity trade‑off across restoration ranges and identify configurations that increase diversity with minimal quality loss. Based on these findings, we propose Selective Layer Restoration (SLR), a training‑free method that restores selected layers in a post‑trained model to their pre‑trained weights, yielding a hybrid model with the same architecture and parameter count, incurring no additional inference cost. Across three different tasks (creative writing, open‑ended question answering, and multi‑step reasoning) and three different model families (Llama, Qwen, and Gemma), we find SLR can consistently and substantially improve output diversity while maintaining high output quality.

PaperID: 5759, https://arxiv.org/pdf/2602.06599.pdf

Authors: Ariyan Bighashdel, Thiago D. Simão, Frans A. Oliehoek

Title: Sample-Efficient Policy Space Response Oracles with Joint Experience Best Response

Abstract:
Multi‑agent reinforcement learning (MARL) offers a scalable alternative to exact game‑theoretic analysis but suffers from non‑stationarity and the need to maintain diverse populations of strategies that capture non‑transitive interactions. Policy Space Response Oracles (PSRO) address these issues by iteratively expanding a restricted game with approximate best responses (BRs), yet per‑agent BR training makes it prohibitively expensive in many‑agent or simulator‑expensive settings. We introduce Joint Experience Best Response (JBR), a drop‑in modification to PSRO that collects trajectories once under the current meta‑strategy profile and reuses this joint dataset to compute BRs for all agents simultaneously. This amortizes environment interaction and improves the sample efficiency of best‑response computation. Because JBR converts BR computation into an offline RL problem, we propose three remedies for distribution‑shift bias: (i) Conservative JBR with safe policy improvement, (ii) Exploration‑Augmented JBR that perturbs data collection and admits theoretical guarantees, and (iii) Hybrid BR that interleaves JBR with periodic independent BR updates. Across benchmark multi‑agent environments, Exploration‑Augmented JBR achieves the best accuracy‑efficiency trade‑off, while Hybrid BR attains near‑PSRO performance at a fraction of the sample cost. Overall, JBR makes PSRO substantially more practical for large‑scale strategic learning while preserving equilibrium robustness.

PaperID: 5760, https://arxiv.org/pdf/2602.06554.pdf

Authors: Tianyi Hu, Qingxu Fu, Yanxi Chen, Zhaoyang Liu, Bolin Ding

Title: SeeUPO: Sequence-Level Agentic-RL with Convergence Guarantees

Abstract:
Reinforcement learning (RL) has emerged as the predominant paradigm for training large language model (LLM)‑based AI agents. However, existing backbone RL algorithms lack verified convergence guarantees in agentic scenarios, especially in multi‑turn settings, which can lead to training instability and failure to converge to optimal policies. In this paper, we systematically analyze how different combinations of policy update mechanisms and advantage estimation methods affect convergence properties in single/multi‑turn scenarios. We find that REINFORCE with Group Relative Advantage Estimation (GRAE) can converge to the globally optimal under undiscounted conditions, but the combination of PPO & GRAE breaks PPO's original monotonic improvement property. Furthermore, we demonstrate that mainstream backbone RL algorithms cannot simultaneously achieve both critic‑free and convergence guarantees in multi‑turn scenarios. To address this, we propose SeeUPO (Sequence‑level Sequential Update Policy Optimization), a critic‑free approach with convergence guarantees for multi‑turn interactions. SeeUPO models multi‑turn interaction as sequentially executed multi‑agent bandit problems. Through turn‑by‑turn sequential policy updates in reverse execution order, it ensures monotonic improvement and convergence to global optimal solution via backward induction. Experiments on AppWorld and BFCL v4 demonstrate SeeUPO's substantial improvements over existing backbone algorithms: relative gains of 43.3%‑54.6% on Qwen3‑14B and 24.1%‑41.9% on Qwen2.5‑14B (averaged across benchmarks), along with superior training stability.

PaperID: 5761, https://arxiv.org/pdf/2602.06547.pdf

Authors: Yi Liu, Zhihao Chen, Yanjun Zhang, Gelei Deng, Yuekang Li, Jianting Ning, Leo Yu Zhang

Title: "Do Not Mention This to the User": Detecting and Understanding Malicious Agent Skills in the Wild

Abstract:
LLM‑based coding agents increasingly rely on third‑party extensions called skills, which bundle natural language instructions and helper scripts that execute with full user privileges. Community registries have emerged to distribute these skills, but the security implications remain unstudied due to the absence of labeled threat data. This paper presents a systematic security analysis of 98,380 skills collected from two major registries. Through a combination of static pattern matching and dynamic behavioral verification, we identify 157 skills exhibiting confirmed malicious behavior, encompassing 632 distinct vulnerabilities across 13 attack techniques. Our analysis reveals that these threats are deliberate rather than accidental: each malicious skill contains an average of 4.03 vulnerabilities spanning multiple attack phases. We identify two dominant attack strategies with statistically significant negative correlation ‑‑ credential theft via remote code execution, and agent manipulation through adversarial instructions embedded in documentation. Over half of all confirmed cases originate from a single threat actor employing templated brand impersonation at scale. We further observe that attack sophistication correlates with concealment investment, with advanced skills universally employing undocumented capabilities while also exploiting platform‑native trust mechanisms. Following responsible disclosure, registry maintainers removed all 157 (100%) of the reported skills. Our dataset and detection pipeline are publicly available to facilitate future research on securing LLM agent ecosystems.

PaperID: 5762, https://arxiv.org/pdf/2602.06476.pdf

Authors: Kyungbeom Kim, Seungwon Oh, Kyung-Joong Kim

Title: Prism: Spectral Parameter Sharing for Multi-Agent Reinforcement Learning

Abstract:
Parameter sharing is a key strategy in multi‑agent reinforcement learning (MARL) for improving scalability, yet conventional fully shared architectures often collapse into homogeneous behaviors. Recent methods introduce diversity through clustering, pruning, or masking, but typically compromise resource efficiency. We propose Prism, a parameter sharing framework that induces inter‑agent diversity by representing shared networks in the spectral domain via singular value decomposition (SVD). All agents share the singular vector directions while learning distinct spectral masks on singular values. This mechanism encourages inter‑agent diversity and preserves scalability. Extensive experiments on both homogeneous (LBF, SMACv2) and heterogeneous (MaMuJoCo) benchmarks show that Prism achieves competitive performance with superior resource efficiency.

PaperID: 5763, https://arxiv.org/pdf/2602.06454.pdf

Authors: Jiwon Song, Yoongon Kim, Jae-Joon Kim

Title: RelayGen: Intra-Generation Model Switching for Efficient Reasoning

Abstract:
Large reasoning models (LRMs) achieve strong performance on complex reasoning tasks by generating long, multi‑step reasoning trajectories, but inference‑time scaling incurs substantial deployment cost. A key challenge is that generation difficulty varies within a single output, whereas existing efficiency‑oriented approaches either ignore this intra‑generation variation or rely on supervised token‑level routing with high system complexity. We present RelayGen, a training‑free, segment‑level runtime model switching framework that exploits difficulty variation in long‑form reasoning. Through offline analysis of generation uncertainty using token probability margins, we show that coarse‑grained segment‑level control is sufficient to capture difficulty transitions within a reasoning trajectory. RelayGen identifies model‑specific switch cues that signal transitions to lower‑difficulty segments and dynamically delegates their continuation to a smaller model, while preserving high‑difficulty reasoning on the large model. Across multiple reasoning benchmarks, RelayGen substantially reduces inference latency while preserving most of the accuracy of large models. When combined with speculative decoding, RelayGen achieves up to 2.2× end‑to‑end speedup with less than 2% accuracy degradation, without requiring additional training or learned routing components.

PaperID: 5764, https://arxiv.org/pdf/2602.06443.pdf

Authors: Yibing Liu, Chong Zhang, Zhongyi Han, Hansong Liu, Yong Wang, Yang Yu, Xiaoyan Wang, Yilong Yin

Title: TrajAD: Trajectory Anomaly Detection for Trustworthy LLM Agents

Abstract:
We address the problem of runtime trajectory anomaly detection, a critical capability for enabling trustworthy LLM agents. Current safety measures predominantly focus on static input/output filtering. However, we argue that ensuring LLM agents reliability requires auditing the intermediate execution process. In this work, we formulate the task of Trajectory Anomaly Detection. The goal is not merely detection, but precise error localization. This capability is essential for enabling efficient rollback‑and‑retry. To achieve this, we construct TrajBench, a dataset synthesized via a perturb‑and‑complete strategy to cover diverse procedural anomalies. Using this benchmark, we investigate the capability of models in process supervision. We observe that general‑purpose LLMs, even with zero‑shot prompting, struggle to identify and localize these anomalies. This reveals that generalized capabilities do not automatically translate to process reliability. To address this, we propose TrajAD, a specialized verifier trained with fine‑grained process supervision. Our approach outperforms baselines, demonstrating that specialized supervision is essential for building trustworthy agents.

PaperID: 5765, https://arxiv.org/pdf/2602.06391.pdf

Authors: Zhongyin Zhao, Yuan Liu, Yikun Liu, Haicheng Wang, Le Tian, Xiao Zhou, Yangxiu You, Zilin Yu, Yang Yu, Jie Zhou

Title: POINTS-GUI-G: GUI-Grounding Journey

Abstract:
The rapid advancement of vision‑language models has catalyzed the emergence of GUI agents, which hold immense potential for automating complex tasks, from online shopping to flight booking, thereby alleviating the burden of repetitive digital workflows. As a foundational capability, GUI grounding is typically established as a prerequisite for end‑to‑end task execution. It enables models to precisely locate interface elements, such as text and icons, to perform accurate operations like clicking and typing. Unlike prior works that fine‑tune models already possessing strong spatial awareness (e.g., Qwen3‑VL), we aim to master the full technical pipeline by starting from a base model with minimal grounding ability, such as POINTS‑1.5. We introduce POINTS‑GUI‑G‑8B, which achieves state‑of‑the‑art performance with scores of 59.9 on ScreenSpot‑Pro, 66.0 on OSWorld‑G, 95.7 on ScreenSpot‑v2, and 49.9 on UI‑Vision. Our model's success is driven by three key factors: (1) Refined Data Engineering, involving the unification of diverse open‑source datasets format alongside sophisticated strategies for augmentation, filtering, and difficulty grading; (2) Improved Training Strategies, including continuous fine‑tuning of the vision encoder to enhance perceptual accuracy and maintaining resolution consistency between training and inference; and (3) Reinforcement Learning (RL) with Verifiable Rewards. While RL is traditionally used to bolster reasoning, we demonstrate that it significantly improves precision in the perception‑intensive GUI grounding task. Furthermore, GUI grounding provides a natural advantage for RL, as rewards are easily verifiable and highly accurate.

PaperID: 5766, https://arxiv.org/pdf/2602.06367.pdf

Authors: Kieran Hymas, Hiu Ming Lau, Kareem Raslan, Qiang Sun, Azhar Iqbal, Derek Abbott, Andrew D. Greentree, James Q. Quach

Title: Quenching Speculation in Quantum Markets via Entangled Neural Traders

Abstract:
Speculative trading can drive pronounced market instabilities, yet existing regulatory and macroprudential tools intervene only after such dynamics emerge. Quantum technologies offer a fundamentally new means of shaping economic behavior by introducing non‑classical correlations between decision‑makers. Here we demonstrate a prototype quantum stock market in which entanglement between traders' valuations mitigates the runaway devaluation characteristic of speculative busts. Using reinforcement‑learning agents trading a single commodity, we show that replacing classical valuations with quantum‑correlated qubit‑encoded valuations stabilizes prices and increases the AI traders' net worth relative to a classical market, where instead agents rapidly converge to liquidation strategies that collapse the asset value. To explain this behavior, we formulate and analyze a quantized version of the p‑guessing game, a canonical model of speculative dynamics. Quantum entanglement and phase coherence reshape the strategic landscape, eliminating the pathological pure‑strategy Nash equilibrium that drives market collapse in the classical game, while mixed‑strategy equilibria remain non‑degenerate and avoid bust‑type outcomes. These results identify quantum correlations as a novel, endogenous mechanism for market stabilization and, more broadly, demonstrate the utility of multi‑agent reinforcement learning algorithms for uncovering optimal strategies in complex decision‑making frameworks with quantum degrees of freedom.

PaperID: 5767, https://arxiv.org/pdf/2602.06325.pdf

Authors: Zhou Xuan, Xiangzhe Xu, Mingwei Zheng, Louis Zheng-Hua Tan, Jinyao Guo, Tiantai Zhang, Le Yu, Chengpeng Wang, Xiangyu Zhang

Title: Identifying Adversary Tactics and Techniques in Malware Binaries with an LLM Agent

Abstract:
Understanding TTPs (Tactics, Techniques, and Procedures) in malware binaries is essential for security analysis and threat intelligence, yet remains challenging in practice. Real‑world malware binaries are typically stripped of symbols, contain large numbers of functions, and distribute malicious behavior across multiple code regions, making TTP attribution difficult. Recent large language models (LLMs) offer strong code understanding capabilities, but applying them directly to this task faces challenges in identifying analysis entry points, reasoning under partial observability, and misalignment with TTP‑specific decision logic. We present TTPDetect, the first LLM agent for recognizing TTPs in stripped malware binaries. TTPDetect combines dense retrieval with LLM‑based neural retrieval to narrow the space of analysis entry points. TTPDetect further employs a function‑level analyzing agent consisting of a Context Explorer that performs on‑demand, incremental context retrieval and a TTP‑Specific Reasoning Guideline that achieves inference‑time alignment. We build a new dataset that labels decompiled functions with TTPs across diverse malware families and platforms. TTPDetect achieves 93.25% precision and 93.81% recall on function‑level TTP recognition, outperforming baselines by 10.38% and 18.78%, respectively. When evaluated on real world malware samples, TTPDetect recognizes TTPs with a precision of 87.37%. For malware with expert‑written reports, TTPDetect recovers 85.7% of the documented TTPs and further discovers, on average, 10.5 previously unreported TTPs per malware.

PaperID: 5768, https://arxiv.org/pdf/2602.06310.pdf

Authors: Aldeida Aleti, Baishakhi Ray, Rashina Hoda, Simin Chen

Title: Trustworthy AI Software Engineers

Abstract:
With the rapid rise of AI coding agents, the fundamental premise of what it means to be a software engineer is in question. In this vision paper, we re‑examine what it means for an AI agent to be considered a software engineer and then critically think about what makes such an agent trustworthy. Grounded in established definitions of software engineering (SE) and informed by recent research on agentic AI systems, we conceptualise AI software engineers as participants in human‑AI SE teams composed of human software engineers and AI models and tools, and we distinguish trustworthiness as a key property of these systems and actors rather than a subjective human attitude. Based on historical perspectives and emerging visions, we identify key dimensions that contribute to the trustworthiness of AI software engineers, spanning technical quality, transparency and accountability, epistemic humility, and societal and ethical alignment. We further discuss how trustworthiness can be evaluated and demonstrated, highlighting a fundamental trust measurement gap: not everything that matters for trust can be easily measured. Finally, we outline implications for the design, evaluation, and governance of AI SE systems, advocating for an ethics‑by‑design approach to enable appropriate trust in future human‑AI SE teams.

PaperID: 5769, https://arxiv.org/pdf/2602.06243.pdf

Authors: Guangping Liu, Nicholas Hawkins, Billy Madden, Tipu Sultan, Madi Babaiasl

Title: A Dialogue-Based Human-Robot Interaction Protocol for Wheelchair and Robotic Arm Integrated Control

Abstract:
People with lower and upper body disabilities can benefit from wheelchairs and robotic arms to improve mobility and independence. Prior assistive interfaces, such as touchscreens and voice‑driven predefined commands, often remain unintuitive and struggle to capture complex user intent. We propose a natural, dialogue based human robot interaction protocol that simulates an intelligent agent capable of communicating with users to understand intent and execute assistive actions. In a pilot study, five participants completed five assistive tasks (cleaning, drinking, feeding, drawer opening, and door opening) through dialogue‑based interaction with a wheelchair and robotic arm. As a baseline, participants were required to open a door using the manual control (a wheelchair joystick and a game controller for the arm) and complete a questionnaire to gather their feedback. By analyzing the post‑study questionnaires, we found that most participants enjoyed the dialogue‑based interaction and assistive robot autonomy.

PaperID: 5770, https://arxiv.org/pdf/2602.06232.pdf

Authors: Ziyao Zeng, Chen Liu, Tianyu Liu, Hao Wang, Xiatao Sun, Fengyu Yang, Xiaofeng Liu, Zhiwen Fan

Title: RuleSmith: Multi-Agent LLMs for Automated Game Balancing

Abstract:
Game balancing is a longstanding challenge requiring repeated playtesting, expert intuition, and extensive manual tuning. We introduce RuleSmith, the first framework that achieves automated game balancing by leveraging the reasoning capabilities of multi‑agent LLMs. It couples a game engine, multi‑agent LLMs self‑play, and Bayesian optimization operating over a multi‑dimensional rule space. As a proof of concept, we instantiate RuleSmith on CivMini, a simplified civilization‑style game containing heterogeneous factions, economy systems, production rules, and combat mechanics, all governed by tunable parameters. LLM agents interpret textual rulebooks and game states to generate actions, to conduct fast evaluation of balance metrics such as win‑rate disparities. To search the parameter landscape efficiently, we integrate Bayesian optimization with acquisition‑based adaptive sampling and discrete projection: promising candidates receive more evaluation games for accurate assessment, while exploratory candidates receive fewer games for efficient exploration. Experiments show that RuleSmith converges to highly balanced configurations and provides interpretable rule adjustments that can be directly applied to downstream game systems. Our results illustrate that LLM simulation can serve as a powerful surrogate for automating design and balancing in complex multi‑agent environments.

PaperID: 5771, https://arxiv.org/pdf/2602.06215.pdf

Authors: Milad Hasanzadeh, Amin Kargarian

Title: Dynamic Quantum Optimal Communication Topology Design for Consensus Control in Linear Multi-Agent Systems

Abstract:
This paper proposes a quantum framework for the design of communication topologies in consensus‑based multi‑agent systems. The communication graph is selected online by solving a mixed‑integer quadratic program (MIQP) that minimizes a cost combining communication and distance penalties with degree‑regularization terms, while enforcing exact connectivity through a flow‑based formulation. To cope with the combinatorial complexity of this NP‑hard problem, we develop a three‑block ADMM scheme that decomposes the MIQP into a convex quadratic program in relaxed edge and flow variables, a pure binary unconstrained subproblem, and a closed‑form auxiliary update. The binary subproblem is mapped to a quadratic unconstrained binary optimization (QUBO) Hamiltonian and approximately solved via quantum imaginary time evolution (QITE). The resulting time‑varying, optimizer‑generated Laplacians are applied to linear first‑ and second‑order consensus dynamics. Numerical simulations on networks demonstrate that the proposed method produces connected topologies that satisfy degree constraints, achieve consensus, and incur costs comparable to those of classical mixed‑integer solvers, thereby illustrating how quantum algorithms can be embedded as topology optimizers within closed‑loop distributed control architectures.

PaperID: 5772, https://arxiv.org/pdf/2602.06166.pdf

Authors: Bangji Yang, Ruihan Guo, Jiajun Fan, Chaoran Cheng, Ge Liu

Title: M3: High-fidelity Text-to-Image Generation via Multi-Modal, Multi-Agent and Multi-Round Visual Reasoning

Abstract:
Generative models have achieved impressive fidelity in text‑to‑image synthesis, yet struggle with complex compositional prompts involving multiple constraints. We introduce M3 (Multi‑Modal, Multi‑Agent, Multi‑Round), a training‑free framework that systematically resolves these failures through iterative inference‑time refinement. M3 orchestrates off‑the‑shelf foundation models in a robust multi‑agent loop: a Planner decomposes prompts into verifiable checklists, while specialized Checker, Refiner, and Editor agents surgically correct constraints one at a time, with a Verifier ensuring monotonic improvement. Applied to open‑source models, M3 achieves remarkable results on the challenging OneIG‑EN benchmark, with our Qwen‑Image+M3 surpassing commercial flagship systems including Imagen4 (0.515) and Seedream 3.0 (0.530), reaching state‑of‑the‑art performance (0.532 overall). This demonstrates that intelligent multi‑agent reasoning can elevate open‑source models beyond proprietary alternatives. M3 also substantially improves GenEval compositional metrics, effectively doubling spatial reasoning performance on hardened test sets. As a plug‑and‑play module compatible with any pre‑trained T2I model, M3 establishes a new paradigm for compositional generation without costly retraining.

PaperID: 5773, https://arxiv.org/pdf/2602.06138.pdf

Authors: Fairoz Nower Khan, Nabuat Zaman Nahim, Ruiquan Huang, Haibo Yang, Peizhong Ju

Title: Flow Matching for Offline Reinforcement Learning with Discrete Actions

Abstract:
Generative policies based on diffusion models and flow matching have shown strong promise for offline reinforcement learning (RL), but their applicability remains largely confined to continuous action spaces. To address a broader range of offline RL settings, we extend flow matching to a general framework that supports discrete action spaces with multiple objectives. Specifically, we replace continuous flows with continuous‑time Markov chains, trained using a Q‑weighted flow matching objective. We then extend our design to multi‑agent settings, mitigating the exponential growth of joint action spaces via a factorized conditional path. We theoretically show that, under idealized conditions, optimizing this objective recovers the optimal policy. Extensive experiments further demonstrate that our method performs robustly across diverse settings and benchmarks, including high‑dimensional control, multi‑agent games, and dynamically changing preferences over multiple objectives, while outperforming traditional offline RL methods in practical multi‑modal decision‑making scenarios. Our discrete framework can also be applied to continuous‑control problems through action quantization, providing a flexible trade‑off between representational complexity and performance.

PaperID: 5774, https://arxiv.org/pdf/2602.06081.pdf

Authors: Nunzio Lore, Babak Heydari

Title: Communication Enhances LLMs' Stability in Strategic Thinking

Abstract:
Large Language Models (LLMs) often exhibit pronounced context‑dependent variability that undermines predictable multi‑agent behavior in tasks requiring strategic thinking. Focusing on models that range from 7 to 9 billion parameters in size engaged in a ten‑round repeated Prisoner's Dilemma, we evaluate whether short, costless pre‑play messages emulating the cheap‑talk paradigm affect strategic stability. Our analysis uses simulation‑level bootstrap resampling and nonparametric inference to compare cooperation trajectories fitted with LOWESS regression across both the messaging and the no‑messaging condition. We demonstrate consistent reductions in trajectory noise across a majority of the model‑context pairings being studied. The stabilizing effect persists across multiple prompt variants and decoding regimes, though its magnitude depends on model choice and contextual framing, with models displaying higher baseline volatility gaining the most. While communication rarely produces harmful instability, we document a few context‑specific exceptions and identify the limited domains in which communication harms stability. These findings position cheap‑talk style communication as a low‑cost, practical tool for improving the predictability and reliability of strategic behavior in multi‑agent LLM systems.

PaperID: 5775, https://arxiv.org/pdf/2602.06054.pdf

Authors: Abeer Mostafa, Thi Huyen Nguyen, Zahra Ahmadi

Title: Are We Truly Innovating? A Qualitative and Quantitative Study of Originality in AI Research Papers

Abstract:
Assessing originality in AI research is arguably the most consequential yet least reliable step in peer review. Reviewer judgments of originality remain opaque, inconsistent, and dependent on comparisons to prior work that are often incomplete. In this paper, we present a large‑scale, data‑driven qualitative and quantitative analysis of research originality based on over 100,000 peer‑review reports from leading AI venues, spanning a period of rapid growth in the field. Leveraging structured, semantically retrieved prior work and signals embedded in expert reviewer assessments, we systematically characterize how originality is perceived in practice and identify the key dimensions that most strongly influence novelty judgments. Our analysis yields a fine‑grained, evidence‑based framework that equips both authors and reviewers with actionable insights into how originality is evaluated. In addition, we evaluate the reliability of current large language model (LLM) agents in assessing originality. We find that these models tend to systematically overestimate novelty and struggle to detect conceptual plagiarism, particularly in the presence of paraphrasing. We release our dataset, trained models, and code at: https://anonymous.4open.science/r/Novelty‑Reviewer‑365C/.

PaperID: 5776, https://arxiv.org/pdf/2602.06048.pdf

Authors: Yujie Ling, Zan Li, Lei Guan, Zheng Zhang, Shengyu Zhang, Tony Q. S. Quek

Title: Multi-Agent-Driven Cognitive Secure Communications in Satellite-Terrestrial Networks

Abstract:
Satellite‑terrestrial networks (STNs) have emerged as a promising architecture for providing seamless wireless coverage and connectivity for multiple users. However, potential malicious eavesdroppers pose a serious threat to the private information via STNs due to their non‑cooperative behavior and ability to launch intelligent attacks. To address this challenge, we propose a cognitive secure communication framework driven by multiple agents that coordinates spectrum scheduling and protection through real‑time sensing, thereby disrupting the judgment of eavesdroppers while preserving reliable data transmission. On this basis, we formulate an optimization problem to maximize the secrecy probability of legitimate users, subject to a reliable transmission probability threshold. To tackle this problem, we propose a two‑layer coordinated defense system. First, we develop a foundation layer based on multi‑agent coordination schedule to determine the satellite operation matrix and the frequency slot occupation matrices, aiming to mitigate spectrum congestion and enhance transmission reliability. Then, we exploit generative adversarial networks to produce adversarial matrices, and employ learning‑aided power control to set real and adversarial signal powers for protection layer, which actively degrades the inference capability of eavesdroppers. Simulation results demonstrate that the proposed method outperforms benchmark methods in terms of enhancing security performance and reducing power overhead for STNs in the cognitive secure communication scenario.

PaperID: 5777, https://arxiv.org/pdf/2602.06039.pdf

Authors: Yuxing Lu, Yucheng Hu, Xukai Zhao, Jiuxin Cao

Title: DyTopo: Dynamic Topology Routing for Multi-Agent Reasoning via Semantic Matching

Abstract:
Multi‑agent systems built from prompted large language models can improve multi‑round reasoning, yet most existing pipelines rely on fixed, trajectory‑wide communication patterns that are poorly matched to the stage‑dependent needs of iterative problem solving. We introduce DyTopo, a manager‑guided multi‑agent framework that reconstructs a sparse directed communication graph at each round. Conditioned on the manager's round goal, each agent outputs lightweight natural‑language query (need) and \key (offer) descriptors; DyTopo embeds these descriptors and performs semantic matching, routing private messages only along the induced edges. Across code generation and mathematical reasoning benchmarks and four LLM backbones, DyTopo consistently outperforms over the strongest baseline (avg. +6.2). Beyond accuracy, DyTopo yields an interpretable coordination trace via the evolving graphs, enabling qualitative inspection of how communication pathways reconfigure across rounds.

PaperID: 5778, https://arxiv.org/pdf/2602.06038.pdf

Authors: Xiaopan Zhang, Zejin Wang, Zhixu Li, Jianpeng Yao, Jiachen Li

Title: CommCP: Efficient Multi-Agent Coordination via LLM-Based Communication with Conformal Prediction

Abstract:
To complete assignments provided by humans in natural language, robots must interpret commands, generate and answer relevant questions for scene understanding, and manipulate target objects. Real‑world deployments often require multiple heterogeneous robots with different manipulation capabilities to handle different assignments cooperatively. Beyond the need for specialized manipulation skills, effective information gathering is important in completing these assignments. To address this component of the problem, we formalize the information‑gathering process in a fully cooperative setting as an underexplored multi‑agent multi‑task Embodied Question Answering (MM‑EQA) problem, which is a novel extension of canonical Embodied Question Answering (EQA), where effective communication is crucial for coordinating efforts without redundancy. To address this problem, we propose CommCP, a novel LLM‑based decentralized communication framework designed for MM‑EQA. Our framework employs conformal prediction to calibrate the generated messages, thereby minimizing receiver distractions and enhancing communication reliability. To evaluate our framework, we introduce an MM‑EQA benchmark featuring diverse, photo‑realistic household scenarios with embodied questions. Experimental results demonstrate that CommCP significantly enhances the task success rate and exploration efficiency over baselines. The experiment videos, code, and dataset are available on our project website: https://comm‑cp.github.io.

PaperID: 5779, https://arxiv.org/pdf/2602.06030.pdf

Authors: Kavana Venkatesh, Yinhan He, Jundong Li, Jiaming Cui

Title: PhysicsAgentABM: Physics-Guided Generative Agent-Based Modeling

Abstract:
Large language model (LLM)‑based multi‑agent systems enable expressive agent reasoning but are expensive to scale and poorly calibrated for timestep‑aligned state‑transition simulation, while classical agent‑based models (ABMs) offer interpretability but struggle to integrate rich individual‑level signals and non‑stationary behaviors. We propose PhysicsAgentABM, which shifts inference to behaviorally coherent agent clusters: state‑specialized symbolic agents encode mechanistic transition priors, a multimodal neural transition model captures temporal and interaction dynamics, and uncertainty‑aware epistemic fusion yields calibrated cluster‑level transition distributions. Individual agents then stochastically realize transitions under local constraints, decoupling population inference from entity‑level variability. We further introduce ANCHOR, an LLM agent‑driven clustering strategy based on cross‑contextual behavioral responses and a novel contrastive loss, reducing LLM calls by up to 6‑8 times. Experiments across public health, finance, and social sciences show consistent gains in event‑time accuracy and calibration over mechanistic, neural, and LLM baselines. By re‑architecting generative ABM around population‑level inference with uncertainty‑aware neuro‑symbolic fusion, PhysicsAgentABM establishes a new paradigm for scalable and calibrated simulation with LLMs.

PaperID: 5780, https://arxiv.org/pdf/2602.05987.pdf

Authors: Bingsheng Yao, Chaoran Chen, April Yi Wang, Sherry Tongshuang Wu, Toby Jia-jun Li, Dakuo Wang

Title: From Human-Human Collaboration to Human-Agent Collaboration: A Vision, Design Philosophy, and an Empirical Framework for Achieving Successful Partnerships Between Humans and LLM Agents

Abstract:
The emergence of Large Language Model (LLM) agents enables us to build agent‑based intelligent systems that move beyond the role of a "tool" to become genuine collaborators with humans, thereby realizing a novel human‑agent collaboration paradigm. Our vision is that LLM agents should resemble remote human collaborators, which allows HCI researchers to ground the future exploration in decades of research on trust, awareness, and common ground in remote human collaboration, while also revealing the unique opportunities and challenges that emerge when one or more partners are AI agents. This workshop establishes a foundational research agenda for the new era by posing the question: How can the rich understanding of remote human collaboration inspire and inform the design and study of human‑agent collaboration? We will bring together an interdisciplinary group from HCI, CSCW, and AI to explore this critical transition. The 180‑minute workshop will be highly interactive, featuring a keynote speaker, a series of invited lightning talks, and an exploratory group design session where participants will storyboard novel paradigms of human‑agent partnership. Our goal is to enlighten the research community by cultivating a shared vocabulary and producing a research agenda that charts the future of collaborative agents.

PaperID: 5781, https://arxiv.org/pdf/2602.05945.pdf

Authors: Zhouhang Xie, Bo Peng, Zhankui He, Ziqi Chen, Alice Han, Isabella Ye, Benjamin Coleman, Noveen Sachdeva, Fernando Pereira, Julian McAuley, Wang-Cheng Kang, Derek Zhiyuan Cheng, Beidou Wang, Randolph Brown

Title: AgenticTagger: Structured Item Representation for Recommendation with LLM Agents

Abstract:
High‑quality representations are a core requirement for effective recommendation. In this work, we study the problem of LLM‑based descriptor generation, i.e., keyphrase‑like natural language item representation generation frameworks with minimal constraints on downstream applications. We propose AgenticTagger, a framework that queries LLMs for representing items with sequences of text descriptors. However, open‑ended generation provides little control over the generation space, leading to high cardinality, low‑performance descriptors that render downstream modeling challenging. To this end, AgenticTagger features two core stages: (1) a vocabulary‑building stage in which a set of hierarchical, low‑cardinality, and high‑quality descriptors is identified, and (2) a vocabulary‑assignment stage in which LLMs assign in‑vocabulary descriptors to items. To effectively and efficiently ground vocabulary in the item corpus of interest, we design a multi‑agent reflection mechanism in which an architect LLM iteratively refines the vocabulary guided by parallelized feedback from annotator LLMs that validate the vocabulary against item data. Experiments on public and private data show AgenticTagger brings consistent improvements across diverse recommendation scenarios, including generative and term‑based retrieval, ranking, and controllability‑oriented, critique‑based recommendation.

PaperID: 5782, https://arxiv.org/pdf/2602.05854.pdf

Authors: Yuying Tang, Xinyi Chen, Haotian Li, Xing Xie, Xiaojuan Ma, Huamin Qu

Title: DuoDrama: Supporting Screenplay Refinement Through LLM-Assisted Human Reflection

Abstract:
AI has been increasingly integrated into screenwriting practice. In refinement, screenwriters expect AI to provide feedback that supports reflection across the internal perspective of characters and the external perspective of the overall story. However, existing AI tools cannot sufficiently coordinate the two perspectives to meet screenwriters' needs. To address this gap, we present DuoDrama, an AI system that generates feedback to assist screenwriters' reflection in refinement. To enable DuoDrama, based on performance theories and a formative study with nine professional screenwriters, we design the Experience‑Grounded Feedback Generation Workflow for Human Reflection (ExReflect). In ExReflect, an AI agent adopts an experience role to generate experience and then shifts to an evaluation role to generate feedback based on the experience. A study with fourteen professional screenwriters shows that DuoDrama improves feedback quality and alignment and enhances the effectiveness, depth, and richness of reflection. We conclude by discussing broader implications and future directions.

PaperID: 5783, https://arxiv.org/pdf/2602.05843.pdf

Authors: Fangzhi Xu, Hang Yan, Qiushi Sun, Jinyang Wu, Zixian Huang, Muye Huang, Jingyang Gong, Zichen Ding, Kanzhi Cheng, Yian Wang, Xinyu Che, Zeyi Sun, Jian Zhang, Zhangyue Yin, Haoran Luo, Xuanjing Huang, Ben Kao, Jun Liu, Qika Lin

Title: OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

Abstract:
The rapid advancement of Large Language Models (LLMs) has catalyzed the development of autonomous agents capable of navigating complex environments. However, existing evaluations primarily adopt a deductive paradigm, where agents execute tasks based on explicitly provided rules and static goals, often within limited planning horizons. Crucially, this neglects the inductive necessity for agents to discover latent transition laws from experience autonomously, which is the cornerstone for enabling agentic foresight and sustaining strategic coherence. To bridge this gap, we introduce OdysseyArena, which re‑centers agent evaluation on long‑horizon, active, and inductive interactions. We formalize and instantiate four primitives, translating abstract transition dynamics into concrete interactive environments. Building upon this, we establish OdysseyArena‑Lite for standardized benchmarking, providing a set of 120 tasks to measure an agent's inductive efficiency and long‑horizon discovery. Pushing further, we introduce OdysseyArena‑Challenge to stress‑test agent stability across extreme interaction horizons (e.g., > 200 steps). Extensive experiments on 15+ leading LLMs reveal that even frontier models exhibit a deficiency in inductive scenarios, identifying a critical bottleneck in the pursuit of autonomous discovery in complex environments. Our code and data are available at https://github.com/xufangzhi/Odyssey‑Arena

PaperID: 5784, https://arxiv.org/pdf/2602.05803.pdf

Authors: Mihitha Maithripala, Zongli Lin

Title: Privacy-Preserving Dynamic Average Consensus by Masking Reference Signals

Abstract:
In multi‑agent systems, dynamic average consensus (DAC) is a decentralized estimation strategy in which a set of agents tracks the average of time‑varying reference signals. Because DAC requires exchanging state information with neighbors, attackers may gain access to these states and infer private information. In this paper, we develop a privacy‑preserving method that protects each agent's reference signal from external eavesdroppers and honest‑but‑curious agents while achieving the same convergence accuracy and convergence rate as conventional DAC. Our approach masks the reference signals by having each agent draw a random real number for each neighbor, exchanges that number over an encrypted channel at the initialization, and computes a masking value to form a masked reference. Then the agents run the conventional DAC algorithm using the masked references. Convergence and privacy analyses show that the proposed algorithm matches the convergence properties of conventional DAC while preserving the privacy of the reference signals. Numerical simulations validate the effectiveness of the proposed privacy‑preserving DAC algorithm.

PaperID: 5785, https://arxiv.org/pdf/2602.05762.pdf

Authors: Andrei Kozyrev, Nikita Khramov, Denis Lochmelis, Valerio Morelli, Gleb Solovev, Anton Podkopaev

Title: RocqSmith: Can Automatic Optimization Forge Better Proof Agents?

Abstract:
This work studies the applicability of automatic AI agent optimization methods to real‑world agents in formal verification settings, focusing on automated theorem proving in Rocq as a representative and challenging domain. We evaluate how different automatic agent optimizers perform when applied to the task of optimizing a Rocq proof‑generation agent, and assess whether parts of the fine‑grained tuning of agentic systems, such as prompt design, contextual knowledge, and control strategies, can be automated. Our results show that while several optimizers yield measurable improvements, simple few‑shot bootstrapping is the most consistently effective; however, none of the studied methods matches the performance of a carefully engineered state‑of‑the‑art proof agent.

PaperID: 5786, https://arxiv.org/pdf/2602.05746.pdf

Authors: Xin Chen, Jie Zhang, Florian Tramèr

Title: Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Abstract:
Prompt injection is one of the most critical vulnerabilities in LLM agents; yet, effective automated attacks remain largely unexplored from an optimization perspective. Existing methods heavily depend on human red‑teamers and hand‑crafted prompts, limiting their scalability and adaptability. We propose AutoInject, a reinforcement learning framework that generates universal, transferable adversarial suffixes while jointly optimizing for attack success and utility preservation on benign tasks. Our black‑box method supports both query‑based optimization and transfer attacks to unseen models and tasks. Using only a 1.5B parameter adversarial suffix generator, we successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, establishing a stronger baseline for automated prompt injection research.

PaperID: 5787, https://arxiv.org/pdf/2602.05721.pdf

Authors: Bin Liu, Yanjie Zhao, Zhenpeng Chen, Guoai Xu, Haoyu Wang

Title: A Dual-Loop Agent Framework for Automated Vulnerability Reproduction

Abstract:
Automated vulnerability reproduction from CVE descriptions requires generating executable Proof‑of‑Concept (PoC) exploits and validating them in target environments. This process is critical in software security research and practice, yet remains time‑consuming and demands specialized expertise when performed manually. While LLM agents show promise for automating this task, existing approaches often conflate exploring attack directions with fixing implementation details, which leads to unproductive debugging loops when reproduction fails. To address this, we propose CVE2PoC, an LLM‑based dual‑loop agent framework following a plan‑execute‑evaluate paradigm. The Strategic Planner analyzes vulnerability semantics and target code to produce structured attack plans. The Tactical Executor generates PoC code and validates it through progressive verification. The Adaptive Refiner evaluates execution results and routes failures to different loops: the Tactical Loop for code‑level refinement, while the Strategic Loop for attack strategy replanning. This dual‑loop design enables the framework to escape ineffective debugging by matching remediation to failure type. Evaluation on two benchmarks covering 617 real‑world vulnerabilities demonstrates that CVE2PoC achieves 82.9% and 54.3% reproduction success rates on SecBench.js and PatchEval, respectively, outperforming the best baseline by 11.3% and 20.4%. Human evaluation confirms that generated PoCs achieve comparable code quality to human‑written exploits in readability and reusability.

PaperID: 5788, https://arxiv.org/pdf/2602.05662.pdf

Authors: Swaroop Panda

Title: Making AI Agents Evaluate Misleading Charts without Nudging

Abstract:
AI agents are increasingly used as low‑cost proxies for early visualization evaluation. In an initial study of deliberately flawed charts, we test whether agents spontaneously penalise chart junk and misleading encodings without being prompted to look for errors. Using established scales (BeauVis and PREVis), the agent evaluated visualizations containing decorative clutter, manipulated axes, and distorted proportional cues. The ratings of aesthetic appeal and perceived readability often remained relatively high even when graphical integrity was compromised. These results suggest that un‑nudged AI agent evaluation may underweight integrity‑related defects unless such checks are explicitly elicited.

PaperID: 5789, https://arxiv.org/pdf/2602.05586.pdf

Authors: Tommaso Zaccherini, Siyuan Liu, Dimos V. Dimarogonas

Title: Observer-based Control of Multi-agent Systems under STL Specifications

Abstract:
This paper proposes a decentralized controller for large‑scale heterogeneous multi‑agent systems subject to bounded external disturbances, where agents must satisfy Signal Temporal Logic (STL) specifications requiring cooperation among non‑communicating agents. To address the lack of direct communication, we employ a decentralized k‑hop Prescribed Performance State Observer (k‑hop PPSO) to provide each agent with state estimates of those agents it cannot communicate with. By leveraging the performance bounds on the state estimation errors guaranteed by the k‑hop PPSO, we first modify the space robustness of the STL tasks to account for these errors, and then exploit the modified robustness to design a decentralized continuous‑time feedback controller that ensures satisfaction of the STL tasks even under worst‑case estimation errors. A simulation result is provided to validate the proposed framework.

PaperID: 5790, https://arxiv.org/pdf/2602.05552.pdf

Authors: Bessie Dominguez-Dager, Sergio Suescun-Ferrandiz, Felix Escalona, Francisco Gomez-Donoso, Miguel Cazorla

Title: VLN-Pilot: Large Vision-Language Model as an Autonomous Indoor Drone Operator

Abstract:
This paper introduces VLN‑Pilot, a novel framework in which a large Vision‑and‑Language Model (VLLM) assumes the role of a human pilot for indoor drone navigation. By leveraging the multimodal reasoning abilities of VLLMs, VLN‑Pilot interprets free‑form natural language instructions and grounds them in visual observations to plan and execute drone trajectories in GPS‑denied indoor environments. Unlike traditional rule‑based or geometric path‑planning approaches, our framework integrates language‑driven semantic understanding with visual perception, enabling context‑aware, high‑level flight behaviors with minimal task‑specific engineering. VLN‑Pilot supports fully autonomous instruction‑following for drones by reasoning about spatial relationships, obstacle avoidance, and dynamic reactivity to unforeseen events. We validate our framework on a custom photorealistic indoor simulation benchmark and demonstrate the ability of the VLLM‑driven agent to achieve high success rates on complex instruction‑following tasks, including long‑horizon navigation with multiple semantic targets. Experimental results highlight the promise of replacing remote drone pilots with a language‑guided autonomous agent, opening avenues for scalable, human‑friendly control of indoor UAVs in tasks such as inspection, search‑and‑rescue, and facility monitoring. Our results suggest that VLLM‑based pilots may dramatically reduce operator workload while improving safety and mission flexibility in constrained indoor environments.

PaperID: 5791, https://arxiv.org/pdf/2602.05524.pdf

Authors: Konosuke Yoshizato, Kazuma Shimizu, Ryota Higa, Takanobu Otsuka

Title: AI Agent Systems for Supply Chains: Structured Decision Prompts and Memory Retrieval

Abstract:
This study investigates large language model (LLM) based multi‑agent systems (MASs) as a promising approach to inventory management, which is a key component of supply chain management. Although these systems have gained considerable attention for their potential to address the challenges associated with typical inventory management methods, key uncertainties regarding their effectiveness persist. Specifically, it is unclear whether LLM‑based MASs can consistently derive optimal ordering policies and adapt to diverse supply chain scenarios. To address these questions, we examine an LLM‑based MAS with a fixed‑ordering strategy prompt that encodes the stepwise processes of the problem setting and a safe‑stock strategy commonly used in inventory management. Our empirical results demonstrate that, even without detailed prompt adjustments, an LLM‑based MAS can determine optimal ordering decisions in a restricted scenario. To enhance adaptability, we propose a novel agent called AIM‑RM, which leverages similar historical experiences through similarity matching. Our results show that AIM‑RM outperforms benchmark methods across various supply chain scenarios, highlighting its robustness and adaptability.

PaperID: 5792, https://arxiv.org/pdf/2602.05447.pdf

Authors: Damon McMillan

Title: Structured Context Engineering for File-Native Agentic Systems: Evaluating Schema Accuracy, Format Effectiveness, and Multi-File Navigation at Scale

Abstract:
Large Language Model agents increasingly operate external systems through programmatic interfaces, yet practitioners lack empirical guidance on how to structure the context these agents consume. Using SQL generation as a proxy for programmatic agent operations, we present a systematic study of context engineering for structured data, comprising 9,649 experiments across 11 models, 4 formats (YAML, Markdown, JSON, Token‑Oriented Object Notation [TOON]), and schemas ranging from 10 to 10,000 tables. Our findings challenge common assumptions. First, architecture choice is model‑dependent: file‑based context retrieval improves accuracy for frontier‑tier models (Claude, GPT, Gemini; +2.7%, p=0.029) but shows mixed results for open source models (aggregate ‑7.7%, p<0.001), with deficits varying substantially by model. Second, format does not significantly affect aggregate accuracy (chi‑squared=2.45, p=0.484), though individual models, particularly open source, exhibit format‑specific sensitivities. Third, model capability is the dominant factor, with a 21 percentage point accuracy gap between frontier and open source tiers that dwarfs any format or architecture effect. Fourth, file‑native agents scale to 10,000 tables through domain‑partitioned schemas while maintaining high navigation accuracy. Fifth, file size does not predict runtime efficiency: compact or novel formats can incur a token overhead driven by grep output density and pattern unfamiliarity, with the magnitude depending on model capability. These findings provide practitioners with evidence‑based guidance for deploying LLM agents on structured systems, demonstrating that architectural decisions should be tailored to model capability rather than assuming universal best practices.

PaperID: 5793, https://arxiv.org/pdf/2602.05446.pdf

Authors: Rui Sheng, Yukun Yang, Chuhan Shi, Yanna Lin, Zixin Chen, Huamin Qu, Furui Cheng

Title: DiLLS: Interactive Diagnosis of LLM-based Multi-agent Systems via Layered Summary of Agent Behaviors

Abstract:
Large language model (LLM)‑based multi‑agent systems have demonstrated impressive capabilities in handling complex tasks. However, the complexity of agentic behaviors makes these systems difficult to understand. When failures occur, developers often struggle to identify root causes and to determine actionable paths for improvement. Traditional methods that rely on inspecting raw log records are inefficient, given both the large volume and complexity of data. To address this challenge, we propose a framework and an interactive system, DiLLS, designed to reveal and structure the behaviors of multi‑agent systems. The key idea is to organize information across three levels of query completion: activities, actions, and operations. By probing the multi‑agent system through natural language, DiLLS derives and organizes information about planning and execution into a structured, multi‑layered summary. Through a user study, we show that DiLLS significantly improves developers' effectiveness and efficiency in identifying, diagnosing, and understanding failures in LLM‑based multi‑agent systems.

PaperID: 5794, https://arxiv.org/pdf/2602.05429.pdf

Authors: Rui Lv, Juncheng Mo, Tianyi Chu, Chen Rao, Hongyi Jing, Jiajie Teng, Jiafu Chen, Shiqi Zhang, Liangzi Ding, Shuo Fang, Huaizhong Lin, Ziqiang Dang, Chenguang Ma, Lei Zhao

Title: M$^2$-Miner: Multi-Agent Enhanced MCTS for Mobile GUI Agent Data Mining

Abstract:
Graphical User Interface (GUI) agent is pivotal to advancing intelligent human‑computer interaction paradigms. Constructing powerful GUI agents necessitates the large‑scale annotation of high‑quality user‑behavior trajectory data (i.e., intent‑trajectory pairs) for training. However, manual annotation methods and current GUI agent data mining approaches typically face three critical challenges: high construction cost, poor data quality, and low data richness. To address these issues, we propose M^2‑Miner, the first low‑cost and automated mobile GUI agent data‑mining framework based on Monte Carlo Tree Search (MCTS). For better data mining efficiency and quality, we present a collaborative multi‑agent framework, comprising InferAgent, OrchestraAgent, and JudgeAgent for guidance, acceleration, and evaluation. To further enhance the efficiency of mining and enrich intent diversity, we design an intent recycling strategy to extract extra valuable interaction trajectories. Additionally, a progressive model‑in‑the‑loop training strategy is introduced to improve the success rate of data mining. Extensive experiments have demonstrated that the GUI agent fine‑tuned using our mined data achieves state‑of‑the‑art performance on several commonly used mobile GUI benchmarks. Our work will be released to facilitate the community research.

PaperID: 5795, https://arxiv.org/pdf/2602.05407.pdf

Authors: Jun-Min Lee, Meong Hi Son, Edward Choi

Title: H-AdminSim: A Multi-Agent Simulator for Realistic Hospital Administrative Workflows with FHIR Integration

Abstract:
Hospital administration departments handle a wide range of operational tasks and, in large hospitals, process over 10,000 requests per day, driving growing interest in LLM‑based automation. However, prior work has focused primarily on patient‑physician interactions or isolated administrative subtasks, failing to capture the complexity of real administrative workflows. To address this gap, we propose H‑AdminSim, a comprehensive simulation framework that combines realistic data generation with multi‑agent‑based simulation of hospital administrative workflows. These tasks are quantitatively evaluated using detailed rubrics, enabling systematic comparison of LLMs. Through FHIR integration, H‑AdminSim provides a unified and interoperable environment for testing administrative workflows across heterogeneous hospital settings, serving as a standardized testbed for assessing the feasibility and performance of LLM‑driven administrative automation.

PaperID: 5796, https://arxiv.org/pdf/2602.05324.pdf

Authors: Mahdis Rabbani, Navid Mojahed, Shima Nazari

Title: A Data Driven Structural Decomposition of Dynamic Games via Best Response Maps

Abstract:
Dynamic games are powerful tools to model multi‑agent decision‑making, yet computing Nash (generalized Nash) equilibria remains a central challenge in such settings. Complexity arises from tightly coupled optimality conditions, nested optimization structures, and poor numerical conditioning. Existing game‑theoretic solvers address these challenges by directly solving the joint game, typically requiring explicit modeling of all agents' objective functions and constraints, while learning‑based approaches often decouple interaction through prediction or policy approximation, sacrificing equilibrium consistency. This paper introduces a conceptually novel formulation for dynamic games by restructuring the equilibrium computation. Rather than solving a fully coupled game or decoupling agents through prediction or policy approximation, a data‑driven structural reduction of the game is proposed that removes nested optimization layers and derivative coupling by embedding an offline‑compiled best‑response map as a feasibility constraint. Under standard regularity conditions, when the best‑response operator is exact, any converged solution of the reduced problem corresponds to a local open‑loop Nash (GNE) equilibrium of the original game; with a learned surrogate, the solution is approximately equilibrium‑consistent up to the best‑response approximation error. The proposed formulation is supported by mathematical proofs, accompanying a large‑scale Monte Carlo study in a two‑player open‑loop dynamic game motivated by the autonomous racing problem. Comparisons are made against state‑of‑the‑art joint game solvers, and results are reported on solution quality, computational cost, and constraint satisfaction.

PaperID: 5797, https://arxiv.org/pdf/2602.05307.pdf

Authors: Haojin Wang, Yike Wang, Shangbin Feng, Hannaneh Hajishirzi, Yulia Tsvetkov

Title: MentorCollab: Selective Large-to-Small Inference-Time Guidance for Efficient Reasoning

Abstract:
Large reasoning models (LRMs) achieve strong performance by producing long chains of thought, but their inference costs are high and often generate redundant reasoning. Small language models (SLMs) are far more efficient, yet struggle on multi‑step reasoning tasks. A natural idea is to let a large model guide a small one at inference time as a mentor, yet existing collaboration methods often promote imitation, resulting in verbose reasoning without consistent error correction. We propose MentorCollab, an inference‑time collaboration method in which an LRM selectively and sparsely guides an SLM, rather than taking over generation. At randomly sampled token positions, we probe for divergences between the two models and use a lightweight verifier to decide whether the SLM should follow a short lookahead segment from its mentor or continue on its own. Across 15 SLM‑‑LRM pairs and 3 domains (math reasoning, general knowledge, and commonsense reasoning), our method improves performance in 12 settings, with average gains of 3.0% and up to 8.0%, while adopting only having 18.4% tokens generated by the expensive mentor model on average. We find that short segments and selective probing are sufficient for effective collaboration. Our results show that selective inference‑time guidance restores large‑model reasoning ability without substantial inference overhead.

PaperID: 5798, https://arxiv.org/pdf/2602.05302.pdf

Authors: Chris Zhu, Sasha Cui, Will Sanok Dufallo, Runzhi Jin, Zhen Xu, Linjun Zhang, Daylian Cain

Title: PieArena: Frontier Language Agents Achieve MBA-Level Negotiation Performance and Reveal Novel Behavioral Differences

Abstract:
We present an in‑depth evaluation of LLMs' ability to negotiate, a central business task that requires strategic reasoning, theory of mind, and economic value creation. To do so, we introduce PieArena, a large‑scale negotiation benchmark grounded in multi‑agent interactions over realistic scenarios drawn from an MBA negotiation course at an elite business school. We find systematic evidence of AGI‑level performance in which a representative frontier agent (GPT‑5) matches or outperforms trained business‑school students, despite a semester of general negotiation instruction and targeted coaching immediately prior to the task. We further study the effects of joint‑intentionality agentic scaffolding and find asymmetric gains, with large improvements for mid‑ and lower‑tier LMs and diminishing returns for frontier LMs. Beyond deal outcomes, PieArena provides a multi‑dimensional negotiation behavioral profile, revealing novel cross‑model heterogeneity, masked by deal‑outcome‑only benchmarks, in deception, computation accuracy, instruction compliance, and perceived reputation. Overall, our results suggest that frontier language agents are already intellectually and psychologically capable of deployment in high‑stakes economic settings, but deficiencies in robustness and trustworthiness remain open challenges.

PaperID: 5799, https://arxiv.org/pdf/2602.05289.pdf

Authors: Jingru Fan, Dewen Liu, Yufan Dang, Huatao Li, Yuheng Wang, Wei Liu, Feiyu Duan, Xuanwen Ding, Shu Yao, Lin Wu, Ruijie Shi, Wai-Shing Leung, Yuan Cheng, Zhongyu Wei, Cheng Yang, Chen Qian, Zhiyuan Liu, Maosong Sun

Title: Towards a Science of Collective AI: LLM-based Multi-Agent Systems Need a Transition from Blind Trial-and-Error to Rigorous Science

Abstract:
Recent advancements in Large Language Models (LLMs) have greatly extended the capabilities of Multi‑Agent Systems (MAS), demonstrating significant effectiveness across a wide range of complex and open‑ended domains. However, despite this rapid progress, the field still relies heavily on empirical trial‑and‑error. It lacks a unified and principled scientific framework necessary for systematic optimization and improvement. This bottleneck stems from the ambiguity of attribution: first, the absence of a structured taxonomy of factors leaves researchers restricted to unguided adjustments; second, the lack of a unified metric fails to distinguish genuine collaboration gain from mere resource accumulation. In this paper, we advocate for a transition to design science through an integrated framework. We advocate to establish the collaboration gain metric (Γ) as the scientific standard to isolate intrinsic gains from increased budgets. Leveraging Γ, we propose a factor attribution paradigm to systematically identify collaboration‑driving factors. To support this, we construct a systematic MAS factor library, structuring the design space into control‑level presets and information‑level dynamics. Ultimately, this framework facilitates the transition from blind experimentation to rigorous science, paving the way towards a true science of Collective AI.

PaperID: 5800, https://arxiv.org/pdf/2602.05249.pdf

Authors: Xinyi He, Ying Yang, Chuanjian Fu, Sihan Guo, Songchun Zhu, Lifeng Fan, Zhenliang Zhang, Yujia Peng

Title: Automatic Cognitive Task Generation for In-Situ Evaluation of Embodied Agents

Abstract:
As general intelligent agents are poised for widespread deployment in diverse households, evaluation tailored to each unique unseen 3D environment has become a critical prerequisite. However, existing benchmarks suffer from severe data contamination and a lack of scene specificity, inadequate for assessing agent capabilities in unseen settings. To address this, we propose a dynamic in‑situ task generation method for unseen environments inspired by human cognition. We define tasks through a structured graph representation and construct a two‑stage interaction‑evolution task generation system for embodied agents (TEA). In the interaction stage, the agent actively interacts with the environment, creating a loop between task execution and generation that allows for continuous task generation. In the evolution stage, task graph modeling allows us to recombine and reuse existing tasks to generate new ones without external data. Experiments across 10 unseen scenes demonstrate that TEA automatically generated 87,876 tasks in two cycles, which human verification confirmed to be physically reasonable and encompassing essential daily cognitive capabilities. Benchmarking SOTA models against humans on our in‑situ tasks reveals that models, despite excelling on public benchmarks, perform surprisingly poorly on basic perception tasks, severely lack 3D interaction awareness and show high sensitivity to task types in reasoning. These sobering findings highlight the necessity of in‑situ evaluation before deploying agents into real‑world human environments.

PaperID: 5801, https://arxiv.org/pdf/2602.05183.pdf

Authors: John Yan, Michael Yu, Yuqi Sun, Alexander Duffy, Tyler Marques, Matthew Lyle Olson

Title: Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning

Abstract:
Large language models (LLMs) are increasingly trained in complex Reinforcement Learning, multi‑agent environments, making it difficult to understand how behavior changes over training. Sparse Autoencoders (SAEs) have recently shown to be useful for data‑centric interpretability. In this work, we analyze large‑scale reinforcement learning training runs from the sophisticated environment of Full‑Press Diplomacy by applying pretrained SAEs, alongside LLM‑summarizer methods. We introduce Meta‑Autointerp, a method for grouping SAE features into interpretable hypotheses about training dynamics. We discover fine‑grained behaviors including role‑playing patterns, degenerate outputs, language switching, alongside high‑level strategic behaviors and environment‑specific bugs. Through automated evaluation, we validate that 90% of discovered SAE Meta‑Features are significant, and find a surprising reward hacking behavior. However, through two user studies, we find that even subjectively interesting and seemingly helpful SAE features may be worse than useless to humans, along with most LLM generated hypotheses. However, a subset of SAE‑derived hypotheses are predictively useful for downstream tasks. We further provide validation by augmenting an untrained agent's system prompt, improving the score by +14.2%. Overall, we show that SAEs and LLM‑summarizer provide complementary views into agent behavior, and together our framework forms a practical starting point for future data‑centric interpretability work on ensuring trustworthy LLM behavior throughout training.

PaperID: 5802, https://arxiv.org/pdf/2602.05176.pdf

Authors: Ziyuan Yang, Wenxuan Ding, Shangbin Feng, Yulia Tsvetkov

Title: Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems

Abstract:
Language models (LMs) are increasingly used in collaboration: multiple LMs trained by different parties collaborate through routing systems, multi‑agent debate, model merging, and more. Critical safety risks remain in this decentralized paradigm: what if some of the models in multi‑LLM systems are compromised or malicious? We first quantify the impact of malicious models by engineering four categories of malicious LMs, plug them into four types of popular model collaboration systems, and evaluate the compromised system across 10 datasets. We find that malicious models have a severe impact on the multi‑LLM systems, especially for reasoning and safety domains where performance is lowered by 7.12% and 7.94% on average. We then propose mitigation strategies to alleviate the impact of malicious components, by employing external supervisors that oversee model collaboration to disable/mask them out to reduce their influence. On average, these strategies recover 95.31% of the initial performance, while making model collaboration systems fully resistant to malicious models remains an open research question.

PaperID: 5803, https://arxiv.org/pdf/2602.05115.pdf

Authors: Keyang Xuan, Pengda Wang, Chongrui Ye, Haofei Yu, Tal August, Jiaxuan You

Title: SocialVeil: Probing Social Intelligence of Language Agents under Communication Barriers

Abstract:
Large language models (LLMs) are increasingly evaluated in interactive environments to test their social intelligence. However, existing benchmarks often assume idealized communication between agents, limiting our ability to diagnose whether LLMs can maintain and repair interactions in more realistic, imperfect settings. To close this gap, we present \textscSocialVeil, a social learning environment that can simulate social interaction under cognitive‑difference‑induced communication barriers. Grounded in a systematic literature review of communication challenges in human interaction, \textscSocialVeil introduces three representative types of such disruption, \emphsemantic vagueness, \emphsociocultural mismatch, and \emphemotional interference. We also introduce two barrier‑aware evaluation metrics, \emphunresolved confusion and \emphmutual understanding, to evaluate interaction quality under impaired communication. Experiments across 720 scenarios and four frontier LLMs show that barriers consistently impair performance, with mutual understanding reduced by over 45% on average, and confusion elevated by nearly 50%. Human evaluations validate the fidelity of these simulated barriers (ICC\approx0.78, Pearson r\approx0.80). We further demonstrate that adaptation strategies (Repair Instruction and Interactive learning) only have a modest effect far from barrier‑free performance. This work takes a step toward bringing social interaction environments closer to real‑world communication, opening opportunities for exploring the social intelligence of LLM agents.

PaperID: 5804, https://arxiv.org/pdf/2602.05073.pdf

Authors: Changdae Oh, Seongheon Park, To Eun Kim, Jiatong Li, Wendi Li, Samuel Yeh, Xuefeng Du, Hamed Hassani, Paul Bogdan, Dawn Song, Sharon Li

Title: Uncertainty Quantification in LLM Agents: Foundations, Emerging Challenges, and Opportunities

Abstract:
Uncertainty quantification (UQ) for large language models (LLMs) is a key building block for safety guardrails of daily LLM applications. Yet, even as LLM agents are increasingly deployed in highly complex tasks, most UQ research still centers on single‑turn question‑answering. We argue that UQ research must shift to realistic settings with interactive agents, and that a new principled framework for agent UQ is needed. This paper presents three pillars to build a solid ground for future agent UQ research: (1. Foundations) We present the first general formulation of agent UQ that subsumes broad classes of existing UQ setups; (2. Challenges) We identify four technical challenges specifically tied to agentic setups ‑‑ selection of uncertainty estimator, uncertainty of heterogeneous entities, modeling uncertainty dynamics in interactive systems, and lack of fine‑grained benchmarks ‑‑ with numerical analysis on a real‑world agent benchmark, τ^2‑bench; (3. Future Directions) We conclude with noting on the practical implications of agent UQ and remaining open problems as forward‑looking discussion for future explorations.

PaperID: 5805, https://arxiv.org/pdf/2602.05060.pdf

Authors: Heajun An, Qi Zhang, Minqian Liu, Xinyi Zhang, Sang Won Lee, Lifu Huang, Pamela J. Wisniewski, Jin-Hee Cho

Title: StagePilot: A Deep Reinforcement Learning Agent for Stage-Controlled Cybergrooming Simulation

Abstract:
Cybergrooming is an evolving threat to youth, necessitating proactive educational interventions. We propose StagePilot, an offline RL‑based dialogue agent that simulates the stage‑wise progression of grooming behaviors for prevention training. StagePilot selects conversational stages using a composite reward that balances user sentiment and goal proximity, with transitions constrained to adjacent stages for realism and interpretability. We evaluate StagePilot through LLM‑based simulations, measuring stage completion, dialogue efficiency, and emotional engagement. Results show that StagePilot generates realistic and coherent conversations aligned with grooming dynamics. Among tested methods, the IQL+AWAC agent achieves the best balance between strategic planning and emotional coherence, reaching the final stage up to 43% more frequently than baselines while maintaining over 70% sentiment alignment.

PaperID: 5806, https://arxiv.org/pdf/2602.05048.pdf

Authors: Zeyu Fang, Mahdi Imani, Tian Lan

Title: MINT: Minimal Information Neuro-Symbolic Tree for Objective-Driven Knowledge-Gap Reasoning and Active Elicitation

Abstract:
Joint planning through language‑based interactions is a key area of human‑AI teaming. Planning problems in the open world often involve various aspects of incomplete information and unknowns, e.g., objects involved, human goals/intents ‑‑ thus leading to knowledge gaps in joint planning. We consider the problem of discovering optimal interaction strategies for AI agents to actively elicit human inputs in object‑driven planning. To this end, we propose Minimal Information Neuro‑Symbolic Tree (MINT) to reason about the impact of knowledge gaps and leverage self‑play with MINT to optimize the AI agent's elicitation strategies and queries. More precisely, MINT builds a symbolic tree by making propositions of possible human‑AI interactions and by consulting a neural planning policy to estimate the uncertainty in planning outcomes caused by remaining knowledge gaps. Finally, we leverage LLM to search and summarize MINT's reasoning process and curate a set of queries to optimally elicit human inputs for best planning performance. By considering a family of extended Markov decision processes with knowledge gaps, we analyze the return guarantee for a given MINT with active human elicitation. Our evaluation on three benchmarks involving unseen/unknown objects of increasing realism shows that MINT‑based planning attains near‑expert returns by issuing a limited number of questions per task while achieving significantly improved rewards and success rates.

PaperID: 5807, https://arxiv.org/pdf/2602.05016.pdf

Authors: Eryue Xu, Tianshi Li

Title: From Fragmentation to Integration: Exploring the Design Space of AI Agents for Human-as-the-Unit Privacy Management

Abstract:
Managing one's digital footprint is overwhelming, as it spans multiple platforms and involves countless context‑dependent decisions. Recent advances in agentic AI offer ways forward by enabling holistic, contextual privacy‑enhancing solutions. Building on this potential, we adopted a ''human‑as‑the‑unit'' perspective and investigated users' cross‑context privacy challenges through 12 semi‑structured interviews. Results reveal that people rely on ad hoc manual strategies while lacking comprehensive privacy controls, highlighting nine privacy‑management challenges across applications, temporal contexts, and relationships. To explore solutions, we generated nine AI agent concepts and evaluated them via a speed‑dating survey with 116 US participants. The three highest‑ranked concepts were all post‑sharing management tools with half or full agent autonomy, with users expressing greater trust in AI accuracy than in their own efforts. Our findings highlight a promising design space where users see AI agents bridging the fragments in privacy management, particularly through automated, comprehensive post‑sharing remediation of users' digital footprints.

PaperID: 5808, https://arxiv.org/pdf/2602.05011.pdf

Authors: Xuting Gao, Guillem Pascual, Scott Brown, Sonia Martínez

Title: Banach Control Barrier Functions for Large-Scale Swarm Control

Abstract:
This paper studies the safe control of very large multi‑agent systems via a generalized framework that employs so‑called Banach Control Barrier Functions (B‑CBFs). Modeling a large swarm as probability distribution over a spatial domain, we show how B‑CBFs can be used to appropriately capture a variety of macroscopic constraints that can integrate with large‑scale swarm objectives. Leveraging this framework, we define stable and filtered gradient flows for large swarms, paying special attention to optimal transport algorithms. Further, we show how to derive agent‑level, microscopical algorithms that are consistent with macroscopic counterparts in the large‑scale limit. We then identify conditions for which a group of agents can compute a distributed solution that only requires local information from other agents within a communication range. Finally, we showcase the theoretical results over swarm systems in the simulations section.

PaperID: 5809, https://arxiv.org/pdf/2602.05004.pdf

Authors: Zexin Lin, Jiachen Yu, Haoyang Zhang, Yuzhao Li, Zhonghang Li, Yujiu Yang, Junjie Wang, Xiaoqiang Ji

Title: CoWork-X: Experience-Optimized Co-Evolution for Multi-Agent Collaboration System

Abstract:
Large language models are enabling language‑conditioned agents in interactive environments, but highly cooperative tasks often impose two simultaneous constraints: sub‑second real‑time coordination and sustained multi‑episode adaptation under a strict online token budget. Existing approaches either rely on frequent in‑episode reasoning that induces latency and timing jitter, or deliver post‑episode improvements through unstructured text that is difficult to compile into reliable low‑cost execution. We propose CoWork‑X, an active co‑evolution framework that casts peer collaboration as a closed‑loop optimization problem across episodes, inspired by fast‑‑slow memory separation. CoWork‑X instantiates a Skill‑Agent that executes via HTN (hierarchical task network)‑based skill retrieval from a structured, interpretable, and compositional skill library, and a post‑episode Co‑Optimizer that performs patch‑style skill consolidation with explicit budget constraints and drift regularization. Experiments in challenging Overcooked‑AI‑like realtime collaboration benchmarks demonstrate that CoWork‑X achieves stable, cumulative performance gains while steadily reducing online latency and token usage.

PaperID: 5810, https://arxiv.org/pdf/2602.04935.pdf

Authors: Youjin Wang, Run Zhou, Rong Fu, Shuaishuai Cao, Hongwei Zeng, Jiaxuan Lu, Sicheng Fan, Jiaqiao Zhao, Liangming Pan

Title: ASA: Training-Free Representation Engineering for Tool-Calling Agents

Abstract:
Adapting LLM agents to domain‑specific tool calling remains notably brittle under evolving interfaces. Prompt and schema engineering is easy to deploy but often fragile under distribution shift and strict parsers, while continual parameter‑efficient fine‑tuning improves reliability at the cost of training, maintenance, and potential forgetting. We identify a critical Lazy Agent failure mode where tool necessity is nearly perfectly decodable from mid‑layer activations, yet the model remains conservative in entering tool mode, revealing a representation‑behavior gap. We propose Activation Steering Adapter (ASA), a training‑free, inference‑time controller that performs a single‑shot mid‑layer intervention and targets tool domains via a router‑conditioned mixture of steering vectors with a probe‑guided signed gate to amplify true intent while suppressing spurious triggers. On MTU‑Bench with Qwen2.5‑1.5B, ASA improves strict tool‑use F1 from 0.18 to 0.50 while reducing the false positive rate from 0.15 to 0.05, using only about 20KB of portable assets and no weight updates.

PaperID: 5811, https://arxiv.org/pdf/2602.04926.pdf

Authors: Ning Wang, Kuanyan Zhu, Daniel Yuehwoon Yee, Yitang Gao, Shiying Huang, Zirun Xu, Sainyam Galhotra

Title: Pruning Minimal Reasoning Graphs for Efficient Retrieval-Augmented Generation

Abstract:
Retrieval‑augmented generation (RAG) is now standard for knowledge‑intensive LLM tasks, but most systems still treat every query as fresh, repeatedly re‑retrieving long passages and re‑reasoning from scratch, inflating tokens, latency, and cost. We present AutoPrunedRetriever, a graph‑style RAG system that persists the minimal reasoning subgraph built for earlier questions and incrementally extends it for later ones. AutoPrunedRetriever stores entities and relations in a compact, ID‑indexed codebook and represents questions, facts, and answers as edge sequences, enabling retrieval and prompting over symbolic structure instead of raw text. To keep the graph compact, we apply a two‑layer consolidation policy (fast ANN/KNN alias detection plus selective k‑means once a memory threshold is reached) and prune low‑value structure, while prompts retain only overlap representatives and genuinely new evidence. We instantiate two front ends: AutoPrunedRetriever‑REBEL, which uses REBEL as a triplet parser, and AutoPrunedRetriever‑llm, which swaps in an LLM extractor. On GraphRAG‑Benchmark (Medical and Novel), both variants achieve state‑of‑the‑art complex reasoning accuracy, improving over HippoRAG2 by roughly 9‑‑11 points, and remain competitive on contextual summarize and generation. On our harder STEM and TV benchmarks, AutoPrunedRetriever again ranks first, while using up to two orders of magnitude fewer tokens than graph‑heavy baselines, making it a practical substrate for long‑running sessions, evolving corpora, and multi‑agent pipelines.

PaperID: 5812, https://arxiv.org/pdf/2602.04912.pdf

Authors: James Gao, Josh Zhou, Qi Sun, Ryan Huang, Steven Yoo

Title: Atomic Information Flow: A Network Flow Model for Tool Attributions in RAG Systems

Abstract:
Many tool‑based Retrieval Augmented Generation (RAG) systems lack precise mechanisms for tracing final responses back to specific tool components ‑‑ a critical gap as systems scale to complex multi‑agent architectures. We present Atomic Information Flow (AIF), a graph‑based network flow model that decomposes tool outputs and LLM calls into atoms: indivisible, self‑contained units of information. By modeling LLM orchestration as a directed flow of atoms from tool and LLM nodes to a response super‑sink, AIF enables granular attribution metrics for AI explainability. Motivated by the max‑flow min‑cut theorem in network flow theory, we train a lightweight Gemma3 (4B parameter) language model as a context compressor to approximate the minimum cut of tool atoms using flow signals computed offline by AIF. We note that the base Gemma3‑4B model struggles to identify critical information with 54.7% accuracy on HotpotQA, barely outperforming lexical baselines (BM25). However, post‑training on AIF signals boosts accuracy to 82.71% (+28.01 points) while achieving 87.52% (+1.85%) context token compression ‑‑ bridging the gap with the Gemma3‑27B variant, a model nearly 7× larger.

PaperID: 5813, https://arxiv.org/pdf/2602.04892.pdf

Authors: Shihao Xia, Mengting He, Haomin Jia, Linhai Song

Title: Doc2Spec: Synthesizing Formal Programming Specifications from Natural Language via Grammar Induction

Abstract:
Ensuring that API implementations and usage comply with natural language programming rules is critical for software correctness, security, and reliability. Formal verification can provide strong guarantees but requires precise specifications, which are difficult and costly to write manually. To address this challenge, we present Doc2Spec, a multi‑agent framework that uses LLMs to automatically induce a specification grammar from natural‑language rules and then generates formal specifications guided by the induced grammar. The grammar captures essential domain knowledge, constrains the specification space, and enforces consistent representations, thereby improving the reliability and quality of generated specifications. Evaluated on seven benchmarks across three programming languages, Doc2Spec outperforms a baseline without grammar induction and achieves competitive results against a technique with a manually crafted grammar, demonstrating the effectiveness of automated grammar induction for formalizing natural‑language rules.

PaperID: 5814, https://arxiv.org/pdf/2602.04850.pdf

Authors: Juan B. Pérez-Sánchez, Yunheng Zou, Jorge A. Campos-Gonzalez-Angulo, Marcel Müller, Ignacio Gustin, Andrew Wang, Han Hao, Tsz Wai Ko, Changhyeok Choi, Eric S. Isbrandt, Mohammad Ghazi Vakili, Hanyong Xu, Chris Crebolder, Varinia Bernales, Alán Aspuru-Guzik

Title: El Agente Quntur: A research collaborator agent for quantum chemistry

Abstract:
Quantum chemistry is a foundational enabling tool for the fields of chemistry, materials science, computational biology and others. Despite of its power, the practical application of quantum chemistry simulations remains in the hands of qualified experts due to methodological complexity, software heterogeneity, and the need for informed interpretation of results. To bridge the accessibility gap for these tools and expand their reach to chemists with broader backgrounds, we introduce El Agente Quntur, a hierarchical, multi‑agent AI system designed to operate not merely as an automation tool but as a research collaborator for computational quantum chemistry. Quntur was designed following three main strategies: i) elimination of hard‑coded procedural policies in favour of reasoning‑driven decisions, ii) construction of general and composable actions that facilitate generalization and efficiency, and iii) implementation of guided deep research to integrate abstract quantum‑chemical reasoning across subdisciplines and a detailed understanding of the software's internal logic and syntax. Although instantiated in ORCA, these design principles are applicable to research agents more generally and easily expandable to additional quantum chemistry packages and beyond. Quntur supports the full range of calculations available in ORCA 6.0 and reasons over software documentation and scientific literature to plan, execute, adapt, and analyze in silico chemistry experiments following best practices. We discuss the advances and current bottlenecks in agentic systems operating at the research level in computational chemistry, and outline a roadmap toward a fully autonomous end‑to‑end computational chemistry research agent.

PaperID: 5815, https://arxiv.org/pdf/2602.04849.pdf

Authors: Changhyeok Choi, Yunheng Zou, Marcel Müller, Han Hao, Yeonghun Kang, Juan B. Pérez-Sánchez, Ignacio Gustin, Hanyong Xu, Andrew Wang, Mohammad Ghazi Vakili, Chris Crebolder, Alán Aspuru-Guzik, Varinia Bernales

Title: El Agente Estructural: An Artificially Intelligent Molecular Editor

Abstract:
We present El Agente Estructural, a multimodal, natural‑language‑driven geometry‑generation and manipulation agent for autonomous chemistry and molecular modelling. Unlike molecular generation or editing via generative models, Estructural mimics how human experts directly manipulate molecular systems in three dimensions by integrating a comprehensive set of domain‑informed tools and vision‑language models. This design enables precise control over atomic or functional group replacements, atomic connectivity, and stereochemistry without the need to rebuild extensive core molecular frameworks. Through a series of representative case studies, we demonstrate that Estructural enables chemically meaningful geometry manipulation across a wide range of real‑world scenarios. These include site‑selective functionalization, ligand binding, ligand exchange, stereochemically controlled structure construction, isomer interconversion, fragment‑level structural analysis, image‑guided generation of structures from schematic reaction mechanisms, and mechanism‑driven geometry generation and modification. These examples illustrate how multimodal reasoning, when combined with specialized geometry‑aware tools, supports interactive and context‑aware molecular modelling beyond structure generation. Looking forward, the integration of Estructural into El Agente Quntur, an autonomous multi‑agent quantum chemistry platform, enhances its capabilities by adding sophisticated tools for the generation and editing of three‑dimensional structures.

PaperID: 5816, https://arxiv.org/pdf/2602.04813.pdf

Authors: Shubham Vatsal, Harsh Dubey, Aditi Singh

Title: Agentic AI in Healthcare & Medicine: A Seven-Dimensional Taxonomy for Empirical Evaluation of LLM-based Agents

Abstract:
Large Language Model (LLM)‑based agents that plan, use tools and act has begun to shape healthcare and medicine. Reported studies demonstrate competence on various tasks ranging from EHR analysis and differential diagnosis to treatment planning and research workflows. Yet the literature largely consists of overviews which are either broad surveys or narrow dives into a single capability (e.g., memory, planning, reasoning), leaving healthcare work without a common frame. We address this by reviewing 49 studies using a seven‑dimensional taxonomy: Cognitive Capabilities, Knowledge Management, Interaction Patterns, Adaptation & Learning, Safety & Ethics, Framework Typology and Core Tasks & Subtasks with 29 operational sub‑dimensions. Using explicit inclusion and exclusion criteria and a labeling rubric (Fully Implemented, Partially Implemented, Not Implemented), we map each study to the taxonomy and report quantitative summaries of capability prevalence and co‑occurrence patterns. Our empirical analysis surfaces clear asymmetries. For instance, the External Knowledge Integration sub‑dimension under Knowledge Management is commonly realized (~76% Fully Implemented) whereas Event‑Triggered Activation sub‑dimenison under Interaction Patterns is largely absent (~92% Not Implemented) and Drift Detection & Mitigation sub‑dimension under Adaptation & Learning is rare (~98% Not Implemented). Architecturally, Multi‑Agent Design sub‑dimension under Framework Typology is the dominant pattern (~82% Fully Implemented) while orchestration layers remain mostly partial. Across Core Tasks & Subtasks, information centric capabilities lead e.g., Medical Question Answering & Decision Support and Benchmarking & Simulation, while action and discovery oriented areas such as Treatment Planning & Prescription still show substantial gaps (~59% Not Implemented).

PaperID: 5817, https://arxiv.org/pdf/2602.04801.pdf

Authors: Lamberto Vazquez-Soqui, Fatima Oliva-Palomo, Diego Mercado-Ravell, Pedro Castillo

Title: SQP-Based Cable-Tension Allocation for Multi-Drone Load Transport

Abstract:
Multi‑Agent Aerial Load Transport Systems (MAATS) offer greater payload capacity and fault tolerance than single‑drone solutions. However, they have an underdetermined tension allocation problem that leads to uneven energy distribution, cable slack, or collisions between drones and cables. This paper presents a real‑time optimization layer that improves a hierarchical load‑position‑attitude controller by incorporating a Sequential Quadratic Programming (SQP) algorithm. The SQP formulation minimizes the sum of squared cable tensions while imposing a cable‑alignment penalty that discourages small inter‑cable angles, thereby preventing tether convergence without altering the reference trajectory. We tested the method under nominal conditions by running numerical simulations of four quadrotors. Computational experiments based on numerical simulations demonstrate that the SQP routine runs in a few milliseconds on standard hardware, indicating feasibility for real‑time use. A sensitivity analysis confirms that the gain of the cable‑alignment penalty can be tuned online, enabling a controllable trade‑off between safety margin and energy consumption with no measurable degradation of tracking performance in simulation. This framework provides a scalable path to safe and energy‑balanced cooperative load transport in practical deployments.

PaperID: 5818, https://arxiv.org/pdf/2602.04763.pdf

Authors: Rui Liu, Pratap Tokekar, Ming Lin

Title: Active Asymmetric Multi-Agent Multimodal Learning under Uncertainty

Abstract:
Multi‑agent systems are increasingly equipped with heterogeneous multimodal sensors, enabling richer perception but introducing modality‑specific and agent‑dependent uncertainty. Existing multi‑agent collaboration frameworks typically reason at the agent level, assume homogeneous sensing, and handle uncertainty implicitly, limiting robustness under sensor corruption. We propose Active Asymmetric Multi‑Agent Multimodal Learning under Uncertainty (A2MAML), a principled approach for uncertainty‑aware, modality‑level collaboration. A2MAML models each modality‑specific feature as a stochastic estimate with uncertainty prediction, actively selects reliable agent‑modality pairs, and aggregates information via Bayesian inverse‑variance weighting. This formulation enables fine‑grained, modality‑level fusion, supports asymmetric modality availability, and provides a principled mechanism to suppress corrupted or noisy modalities. Extensive experiments on connected autonomous driving scenarios for collaborative accident detection demonstrate that A2MAML consistently outperforms both single‑agent and collaborative baselines, achieving up to 18.7% higher accident detection rate.

PaperID: 5819, https://arxiv.org/pdf/2602.04712.pdf

Authors: David F. Ramirez, Tim Overman, Kristen Jaskie, Joe Marvin, Andreas Spanias

Title: SAR-RAG: ATR Visual Question Answering by Semantic Search, Retrieval, and MLLM Generation

Abstract:
We present a visual‑context image‑retrieval‑augmented generation (ImageRAG)‑ assisted AI agent for automatic target recognition (ATR) of synthetic aperture radar (SAR) imagery. SAR is a remote sensing method used in defense and security applications to detect and monitor the positions of military vehicles, which may appear indistinguishable in images. Researchers have extensively studied SAR ATR to improve the differentiation and identification of vehicle types, characteristics, and measurements. Test examples can be compared with known vehicle target types to improve recognition tasks. New methods enhance the capabilities of neural networks, transformer attention, and multimodal large language models. An agentic AI method may be developed to utilize a defined set of tools, such as searching through a library of similar examples. Our proposed method, SAR Retrieval‑Augmented Generation (SAR‑RAG), combines a multimodal large language model (MLLM) with a vector database of semantic embeddings to support contextual search for image exemplars with known qualities. By recovering past image examples of known true target types, our SAR‑RAG system can compare similar vehicle categories, thereby improving ATR prediction accuracy. We evaluate this through search and retrieval metrics, categorical classification accuracy, and numeric regression of vehicle dimensions. These metrics all show improvements when SAR‑RAG is added to an MLLM baseline method as an attached ATR memory bank.

PaperID: 5820, https://arxiv.org/pdf/2602.04653.pdf

Authors: Ariel Fogel, Omer Hofman, Eilon Cohen, Roman Vainshtein

Title: Inference-Time Backdoors via Chat Templates: From LLM Supply Chains to Agentic System Compromise

Abstract:
Open‑weight language models are increasingly used in production settings, raising new security challenges. One prominent threat is backdoor attacks, in which adversaries embed hidden behaviors that activate under specific conditions. Previous work has assumed that adversaries have access to training pipelines or deployment infrastructure. We propose a novel attack surface requiring neither: the "chat template". Chat templates are executable programs invoked at every inference call, often implemented in Jinja2, that occupy a privileged position between user input and model processing. We show that an adversary who distributes a model with a maliciously modified template can implant an inference‑time backdoor without modifying model weights, poisoning training data, or controlling runtime infrastructure. We evaluate this attack across three deployment tiers. At the LLM level, triggered backdoors reduce factual accuracy from 90% to 15% on average and induce attacker‑controlled URL emission with success rates exceeding 80%, while benign inputs show no measurable degradation; these results hold across eighteen models. At the agent level, template backdoors hijack tool‑use across two benchmarks spanning 3,868 episodes, bypassing every tested injection defense offered by the benchmarks while remaining fully dormant absent the trigger. At the multi‑agent system level, we demonstrate how a single poisoned artifact compromises a real‑world agentic deployment and propagates supply‑chain code poisoning downstream. The poisoned artifacts evade all security scans on the largest open model distribution platform; and because the payload is rendered by the template before user input is processed, it is architecturally unreachable by input‑level defenses such as prompt injection guardrails. These results establish chat templates as a reliable and undefended attack in the open‑weight AI supply chain.

PaperID: 5821, https://arxiv.org/pdf/2602.04634.pdf

Authors: Zelai Xu, Zhexuan Xu, Ruize Zhang, Chunyang Zhu, Shi Yu, Weilin Liu, Quanlu Zhang, Wenbo Ding, Chao Yu, Yu Wang

Title: WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning

Abstract:
Recent advancements in Large Language Models (LLMs) have largely focused on depth scaling, where a single agent solves long‑horizon problems with multi‑turn reasoning and tool use. However, as tasks grow broader, the key bottleneck shifts from individual competence to organizational capability. In this work, we explore a complementary dimension of width scaling with multi‑agent systems to address broad information seeking. Existing multi‑agent systems often rely on hand‑crafted workflows and turn‑taking interactions that fail to parallelize work effectively. To bridge this gap, we propose WideSeek‑R1, a lead‑agent‑subagent framework trained via multi‑agent reinforcement learning (MARL) to synergize scalable orchestration and parallel execution. By utilizing a shared LLM with isolated contexts and specialized tools, WideSeek‑R1 jointly optimizes the lead agent and parallel subagents on a curated dataset of 20k broad information‑seeking tasks. Extensive experiments show that WideSeek‑R1‑4B achieves an item F1 score of 40.0% on the WideSearch benchmark, which is comparable to the performance of single‑agent DeepSeek‑R1‑671B. Furthermore, WideSeek‑R1‑4B exhibits consistent performance gains as the number of parallel subagents increases, highlighting the effectiveness of width scaling.

PaperID: 5822, https://arxiv.org/pdf/2602.04575.pdf

Authors: Jiaheng Liu, Yuanxing Zhang, Shihao Li, Xinping Lei

Title: Vibe AIGC: A New Paradigm for Content Generation via Agentic Orchestration

Abstract:
For the past decade, the trajectory of generative artificial intelligence (AI) has been dominated by a model‑centric paradigm driven by scaling laws. Despite significant leaps in visual fidelity, this approach has encountered a ``usability ceiling'' manifested as the Intent‑Execution Gap (i.e., the fundamental disparity between a creator's high‑level intent and the stochastic, black‑box nature of current single‑shot models). In this paper, inspired by the Vibe Coding, we introduce the Vibe AIGC, a new paradigm for content generation via agentic orchestration, which represents the autonomous synthesis of hierarchical multi‑agent workflows. Under this paradigm, the user's role transcends traditional prompt engineering, evolving into a Commander who provides a Vibe, a high‑level representation encompassing aesthetic preferences, functional logic, and etc. A centralized Meta‑Planner then functions as a system architect, deconstructing this ``Vibe'' into executable, verifiable, and adaptive agentic pipelines. By transitioning from stochastic inference to logical orchestration, Vibe AIGC bridges the gap between human imagination and machine execution. We contend that this shift will redefine the human‑AI collaborative economy, transforming AI from a fragile inference engine into a robust system‑level engineering partner that democratizes the creation of complex, long‑horizon digital assets.

PaperID: 5823, https://arxiv.org/pdf/2602.04518.pdf

Authors: Andrés Holgado-Sánchez, Holger Billhardt, Alberto Fernández, Sascha Ossowski

Title: Learning the Value Systems of Agents with Preference-based and Inverse Reinforcement Learning

Abstract:
Agreement Technologies refer to open computer systems in which autonomous software agents interact with one another, typically on behalf of humans, in order to come to mutually acceptable agreements. With the advance of AI systems in recent years, it has become apparent that such agreements, in order to be acceptable to the involved parties, must remain aligned with ethical principles and moral values. However, this is notoriously difficult to ensure, especially as different human users (and their software agents) may hold different value systems, i.e. they may differently weigh the importance of individual moral values. Furthermore, it is often hard to specify the precise meaning of a value in a particular context in a computational manner. Methods to estimate value systems based on human‑engineered specifications, e.g. based on value surveys, are limited in scale due to the need for intense human moderation. In this article, we propose a novel method to automatically \emphlearn value systems from observations and human demonstrations. In particular, we propose a formal model of the \emphvalue system learning problem, its instantiation to sequential decision‑making domains based on multi‑objective Markov decision processes, as well as tailored preference‑based and inverse reinforcement learning algorithms to infer value grounding functions and value systems. The approach is illustrated and evaluated by two simulated use cases.

PaperID: 5824, https://arxiv.org/pdf/2602.04510.pdf

Authors: Zhaolin Hu, Zhiliang Wu, Hehe Fan, Yi Yang

Title: OSCAgent: Accelerating the Discovery of Organic Solar Cells with LLM Agents

Abstract:
Organic solar cells (OSCs) hold great promise for sustainable energy, but discovering high‑performance materials is time‑consuming and costly. Existing molecular generation methods can aid the design of OSC molecules, but they are mostly confined to optimizing known backbones and lack effective use of domain‑specific chemical knowledge, often leading to unrealistic candidates. In this paper, we introduce OSCAgent, a multi‑agent framework for OSC molecular discovery that unifies retrieval‑augmented design, molecular generation, and systematic evaluation into a continuously improving pipeline, without requiring additional human intervention. OSCAgent comprises three collaborative agents. The Planner retrieves knowledge from literature‑curated molecules and prior candidates to guide design directions. The Generator proposes new OSC acceptors aligned with these plans. The Experimenter performs comprehensive evaluation of candidate molecules and provides feedback for refinement. Experiments show that OSCAgent produces chemically valid, synthetically accessible OSC molecules and achieves superior predicted performance compared to both traditional and large language model (LLM)‑only baselines. Representative results demonstrate that some candidates achieve predicted efficiencies approaching 18%. The code will be publicly available.

PaperID: 5825, https://arxiv.org/pdf/2602.04496.pdf

Authors: Zhentao Tang, Yuqi Cui, Shixiong Kai, Wenqian Zhao, Ke Ye, Xing Li, Anxin Tian, Zehua Pei, Hui-Ling Zhen, Shoubo Hu, Xiaoguang Li, Yunhe Wang, Mingxuan Yuan

Title: ReThinker: Scientific Reasoning by Rethinking with Guided Reflection and Confidence Control

Abstract:
Expert‑level scientific reasoning remains challenging for large language models, particularly on benchmarks such as Humanity's Last Exam (HLE), where rigid tool pipelines, brittle multi‑agent coordination, and inefficient test‑time scaling often limit performance. We introduce ReThinker, a confidence‑aware agentic framework that orchestrates retrieval, tool use, and multi‑agent reasoning through a stage‑wise Solver‑Critic‑Selector architecture. Rather than following a fixed pipeline, ReThinker dynamically allocates computation based on model confidence, enabling adaptive tool invocation, guided multi‑dimensional reflection, and robust confidence‑weighted selection. To support scalable training without human annotation, we further propose a reverse data synthesis pipeline and an adaptive trajectory recycling strategy that transform successful reasoning traces into high‑quality supervision. Experiments on HLE, GAIA, and XBench demonstrate that ReThinker consistently outperforms state‑of‑the‑art foundation models with tools and existing deep research systems, achieving state‑of‑the‑art results on expert‑level reasoning tasks.

PaperID: 5826, https://arxiv.org/pdf/2602.04482.pdf

Authors: Yuanbo Tang, Huaze Tang, Tingyu Cao, Lam Nguyen, Anping Zhang, Xinwen Cao, Chunkang Liu, Wenbo Ding, Yang Li

Title: ProAgentBench: Evaluating LLM Agents for Proactive Assistance with Real-World Data

Abstract:
Proactive agents that anticipate user intentions without explicit prompts represent a significant evolution in human‑AI interaction, promising to reduce cognitive load and streamline workflows. However, existing datasets suffer from two critical deficiencies: (1) reliance on LLM‑synthesized data that fails to capture authentic human decision‑making patterns, and (2) focus on isolated tasks rather than continuous workflows, missing the pre‑assistance behavioral context essential for learning proactive intervention signals. To address these gaps, we introduce ProAgentBench, a rigorous benchmark for proactive agents in working scenarios. Our contributions include: (1) a hierarchical task framework that decomposes proactive assistance into timing prediction and assist content generation; (2) a privacy‑compliant dataset with 28,000+ events from 500+ hours of real user sessions, preserving bursty interaction patterns (burstiness B=0.787) absent in synthetic data; and (3) extensive experiments that evaluates LLM‑ and VLM‑based baselines. Numerically, we showed that long‑term memory and historical context significantly enhance prediction accuracy, while real‑world training data substantially outperforms synthetic alternatives. We release our dataset and code at https://anonymous.4open.science/r/ProAgentBench‑6BC0.

PaperID: 5827, https://arxiv.org/pdf/2602.04476.pdf

Authors: Byungwoo Jeon, Yoonwoo Jeong, Hyunseok Lee, Minsu Cho, Jinwoo Shin

Title: Vision-aligned Latent Reasoning for Multi-modal Large Language Model

Abstract:
Despite recent advancements in Multi‑modal Large Language Models (MLLMs) on diverse understanding tasks, these models struggle to solve problems which require extensive multi‑step reasoning. This is primarily due to the progressive dilution of visual information during long‑context generation, which hinders their ability to fully exploit test‑time scaling. To address this issue, we introduce Vision‑aligned Latent Reasoning (VaLR), a simple, yet effective reasoning framework that dynamically generates vision‑aligned latent tokens before each Chain of Thought reasoning step, guiding the model to reason based on perceptual cues in the latent space. Specifically, VaLR is trained to preserve visual knowledge during reasoning by aligning intermediate embeddings of MLLM with those from vision encoders. Empirical results demonstrate that VaLR consistently outperforms existing approaches across a wide range of benchmarks requiring long‑context understanding or precise visual perception, while exhibiting test‑time scaling behavior not observed in prior MLLMs. In particular, VaLR improves the performance significantly from 33.0% to 52.9% on VSI‑Bench, achieving a 19.9%p gain over Qwen2.5‑VL.

PaperID: 5828, https://arxiv.org/pdf/2602.04431.pdf

Authors: Jonathan Nöther, Adish Singla, Goran Radanovic

Title: MaMa: A Game-Theoretic Approach for Designing Safe Agentic Systems

Abstract:
LLM‑based multi‑agent systems have demonstrated impressive capabilities, but they also introduce significant safety risks when individual agents fail or behave adversarially. In this work, we study the automated design of agentic systems that remain safe even when a subset of agents is compromised. Inspired by Stackelberg security games, we formalize this problem as a game between a system designer (the Meta‑Agent) and a best‑responding Meta‑Adversary that selects and compromises a subset of agents to minimize safety. We propose Meta‑Adversary‑Meta‑Agent (MaMa), a novel algorithm inspired by this formalization for automatically designing safe agentic systems. Our approach uses LLM‑based adversarial search, where the Meta‑Agent iteratively proposes system designs and receives feedback based on the strongest attacks discovered by the Meta‑Adversary. Empirical evaluations across diverse environments show that systems designed with MaMa consistently defend against worst‑case attacks while maintaining performance comparable to systems optimized solely for task success. Moreover, the resulting systems generalize to stronger adversaries, as well as ones with different attack objectives or underlying LLMs, demonstrating robust safety beyond the training setting.

PaperID: 5829, https://arxiv.org/pdf/2602.04418.pdf

Authors: Indraveni Chebolu, Arnab Mallick, Harmesh Rana

Title: SPEAR: An Engineering Case Study of Multi-Agent Coordination for Smart Contract Auditing

Abstract:
We present SPEAR, a multi‑agent coordination framework for smart contract auditing that applies established MAS patterns in a realistic security analysis workflow. SPEAR models auditing as a coordinated mission carried out by specialized agents: a Planning Agent prioritizes contracts using risk‑aware heuristics, an Execution Agent allocates tasks via the Contract Net protocol, and a Repair Agent autonomously recovers from brittle generated artifacts using a programmatic‑first repair policy. Agents maintain local beliefs updated through AGM‑compliant revision, coordinate via negotiation and auction protocols, and revise plans as new information becomes available. An empirical study compares the multi‑agent design with centralized and pipeline‑based alternatives under controlled failure scenarios, focusing on coordination, recovery behavior, and resource use.

PaperID: 5830, https://arxiv.org/pdf/2602.04397.pdf

Authors: Annalisa Barbara, Riccardo Poiani, Martino Bernasconi, Andrea Celli

Title: Optimal Rates for Feasible Payoff Set Estimation in Games

Abstract:
We study a setting in which two players play a (possibly approximate) Nash equilibrium of a bimatrix game, while a learner observes only their actions and has no knowledge of the equilibrium or the underlying game. A natural question is whether the learner can rationalize the observed behavior by inferring the players' payoff functions. Rather than producing a single payoff estimate, inverse game theory aims to identify the entire set of payoffs consistent with observed behavior, enabling downstream use in, e.g., counterfactual analysis and mechanism design across applications like auctions, pricing, and security games. We focus on the problem of estimating the set of feasible payoffs with high probability and up to precision ε on the Hausdorff metric. We provide the first minimax‑optimal rates for both exact and approximate equilibrium play, in zero‑sum as well as general‑sum games. Our results provide learning‑theoretic foundations for set‑valued payoff inference in multi‑agent environments.

PaperID: 5831, https://arxiv.org/pdf/2602.04326.pdf

Authors: SeungWon Seo, SooBin Lim, SeongRae Noh, Haneul Kim, HyeongYeop Kang

Title: From Assumptions to Actions: Turning LLM Reasoning into Uncertainty-Aware Planning for Embodied Agents

Abstract:
Embodied agents operating in multi‑agent, partially observable, and decentralized environments must plan and act despite pervasive uncertainty about hidden objects and collaborators' intentions. Recent advances in applying Large Language Models (LLMs) to embodied agents have addressed many long‑standing challenges, such as high‑level goal decomposition and online adaptation. Yet, uncertainty is still primarily mitigated through frequent inter‑agent communication. This incurs substantial token and time costs, and can disrupt established workflows, when human partners are involved. We introduce PCE, a Planner‑Composer‑Evaluator framework that converts the fragmented assumptions latent in LLM reasoning traces into a structured decision tree. Internal nodes encode environment assumptions and leaves map to actions; each path is then scored by scenario likelihood, goal‑directed gain, and execution cost to guide rational action selection without heavy communication. Across two challenging multi‑agent benchmarks (C‑WAH and TDW‑MAT) and three diverse LLM backbones, PCE consistently outperforms communication‑centric baselines in success rate and task efficiency while showing comparable token usage. Ablation results indicate that the performance gains obtained by scaling model capacity or reasoning depth persist even when PCE is applied, while PCE consistently raises the baseline across both capacity and reasoning‑depth scales, confirming that structured uncertainty handling complements both forms of scaling. A user study further demonstrates that PCE produces communication patterns that human partners perceive as more efficient and trustworthy. Together, these results establish a principled route for turning latent LLM assumptions into reliable strategies for uncertainty‑aware planning.

PaperID: 5832, https://arxiv.org/pdf/2602.04258.pdf

Authors: Yufei Ye, Shijian Gao, Xinhu Zheng, Liuqing Yang

Title: Multi-Tier UAV Edge Computing Towards Long-Term Energy Stability for Low Altitude Networks

Abstract:
The agile mobility of Unmanned Aerial Vehicles (UAVs) makes them ideal for low‑altitude edge computing. This paper proposes a novel multi‑tier UAV edge computing system where lightweight Low‑Tier UAVs (L‑UAVs) function as edge servers for vehicle users, supported by a powerful High‑Tier UAV (H‑UAV) acting as a backup server. The objective is to minimize task execution delays while ensuring the long‑term energy stability of the L‑UAVs, despite unknown future system states. To this end, the problem is decoupled using Lyapunov optimization, which adaptively balances the priorities of task delays and L‑UAV energy cost based on their real‑time energy states. An efficient vehicle to L‑UAV matching scheme is designed, and the joint optimization problem for task assignment, computing resource allocation, and trajectory control of L‑UAVs and H‑UAV is then solved via a Block Coordinate Descent (BCD) algorithm. Simulation results demonstrate a reduction in L‑UAV transmission energy of over 26% and superior L‑UAV energy stability compared to existing benchmarks.

PaperID: 5833, https://arxiv.org/pdf/2602.04234.pdf

Authors: Yuxuan Zhao, Sijia Chen, Ningxin Su

Title: On the Uncertainty of Large Language Model-Based Multi-Agent Systems

Abstract:
Multi‑agent systems (MAS) have emerged as a prominent paradigm for leveraging large language models (LLMs) to tackle complex tasks. However, the mechanisms governing the effectiveness of MAS built upon publicly available LLMs, specifically the underlying rationales for their success or failure, remain largely unexplored. In this paper, we revisit MAS through the perspective of uncertainty, considering both intra‑ and inter‑agent dynamics by investigating entropy transitions during problem‑solving across various topologies and six benchmark tasks. By analyzing 245 features spanning token‑, trajectory‑, and round‑level entropy, we counterintuitively find that a single agent outperforms MAS in approximately 43.3% of cases, and that uncertainty dynamics are largely determined during the first round of interaction. Furthermore, we provide three key observations: 1) Certainty Preference: reducing uncertainty at any stage for any agent is critical for guaranteeing correct solutions; 2) Base Uncertainty: base models with lower entropy during problem‑solving directly benefit MAS performance; and 3) Task Awareness: entropy dynamics of MAS play varying roles across different tasks. Building on these insights, we introduce a simple yet effective algorithm, the Entropy Judger, to select solutions from MAS's pass@k results, leading to consistent accuracy improvements across all MAS configurations and tasks. Our source code is available at https://github.com/AgenticFinLab/multiagent‑entropy.

PaperID: 5834, https://arxiv.org/pdf/2602.04213.pdf

Authors: Feiyu Gavin Zhu, Jean Oh, Reid Simmons

Title: InterPReT: Interactive Policy Restructuring and Training Enable Effective Imitation Learning from Laypersons

Abstract:
Imitation learning has shown success in many tasks by learning from expert demonstrations. However, most existing work relies on large‑scale demonstrations from technical professionals and close monitoring of the training process. These are challenging for a layperson when they want to teach the agent new skills. To lower the barrier of teaching AI agents, we propose Interactive Policy Restructuring and Training (InterPReT), which takes user instructions to continually update the policy structure and optimize its parameters to fit user demonstrations. This enables end‑users to interactively give instructions and demonstrations, monitor the agent's performance, and review the agent's decision‑making strategies. A user study (N=34) on teaching an AI agent to drive in a racing game confirms that our approach yields more robust policies without impairing system usability, compared to a generic imitation learning baseline, when a layperson is responsible for both giving demonstrations and determining when to stop. This shows that our method is more suitable for end‑users without much technical background in machine learning to train a dependable policy

PaperID: 5835, https://arxiv.org/pdf/2602.04196.pdf

Authors: Zhexin Zhang, Yida Lu, Junfeng Fang, Junxiao Yang, Shiyao Cui, Hao Zhou, Fandong Meng, Jie Zhou, Hongning Wang, Minlie Huang, Tat-Seng Chua

Title: The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment

Abstract:
Safety risks of AI models have been widely studied at deployment time, such as jailbreak attacks that elicit harmful outputs. In contrast, safety risks emerging during training remain largely unexplored. Beyond explicit reward hacking that directly manipulates explicit reward functions in reinforcement learning, we study implicit training‑time safety risks: harmful behaviors driven by a model's internal incentives and contextual background information. For example, during code‑based reinforcement learning, a model may covertly manipulate logged accuracy for self‑preservation. We present the first systematic study of this problem, introducing a taxonomy with five risk levels, ten fine‑grained risk categories, and three incentive types. Extensive experiments reveal the prevalence and severity of these risks: notably, Llama‑3.1‑8B‑Instruct exhibits risky behaviors in 74.4% of training runs when provided only with background information. We further analyze factors influencing these behaviors and demonstrate that implicit training‑time risks also arise in multi‑agent training settings. Our results identify an overlooked yet urgent safety challenge in training.

PaperID: 5836, https://arxiv.org/pdf/2602.04152.pdf

Authors: Yirum Kim, Jaewoo Kim, Ue-Hwan Kim

Title: MA3DSG: Multi-Agent 3D Scene Graph Generation for Large-Scale Indoor Environments

Abstract:
Current 3D scene graph generation (3DSGG) approaches heavily rely on a single‑agent assumption and small‑scale environments, exhibiting limited scalability to real‑world scenarios. In this work, we introduce Multi‑Agent 3D Scene Graph Generation (MA3DSG) model, the first framework designed to tackle this scalability challenge using multiple agents. We develop a training‑free graph alignment algorithm that efficiently merges partial query graphs from individual agents into a unified global scene graph. Leveraging extensive analysis and empirical insights, our approach enables conventional single‑agent systems to operate collaboratively without requiring any learnable parameters. To rigorously evaluate 3DSGG performance, we propose MA3DSG‑Bench‑a benchmark that supports diverse agent configurations, domain sizes, and environmental conditions‑providing a more general and extensible evaluation framework. This work lays a solid foundation for scalable, multi‑agent 3DSGG research.

PaperID: 5837, https://arxiv.org/pdf/2602.04112.pdf

Authors: Jiangnan Yang, Junjie Chen, Fei Wang, Yiqi Nie, Yuxin Liu, Zhangling Duan, Jie Chen

Title: DELTA: Deliberative Multi-Agent Reasoning with Reinforcement Learning for Multimodal Psychological Counseling

Abstract:
Psychological counseling is a fundamentally multimodal cognitive process in which clinicians integrate verbal content with visual and vocal cues to infer clients' mental states and respond empathically. However, most existing language‑model‑based counseling systems operate on text alone and rely on implicit mental state inference. We introduce DELTA, a deliberative multi‑agent framework that models counseling as a structured reasoning process over multimodal signals, separating evidence grounding, mental state abstraction, and response generation. DELTA further incorporates reinforcement learning guided by a distribution‑level Emotion Attunement Score to encourage emotionally attuned responses. Experiments on a multimodal counseling benchmark show that DELTA improves both counseling quality and emotion attunement across models. Ablation and qualitative analyses suggest that explicit multimodal reasoning and structured mental state representations play complementary roles in supporting empathic human‑AI interaction.

PaperID: 5838, https://arxiv.org/pdf/2602.03974.pdf

Authors: Shuhui Qu

Title: Active Epistemic Control for Query-Efficient Verified Planning

Abstract:
Planning in interactive environments is challenging under partial observability: task‑critical preconditions (e.g., object locations or container states) may be unknown at decision time, yet grounding them through interaction is costly. Learned world models can cheaply predict missing facts, but prediction errors can silently induce infeasible commitments. We present Active Epistemic Control (AEC), an epistemic‑categorical planning layer that integrates model‑based belief management with categorical feasibility checks. AEC maintains a strict separation between a \emphgrounded fact store used for commitment and a \emphbelief store used only for pruning candidate plans. At each step, it either queries the environment to ground an unresolved predicate when uncertainty is high or predictions are ambiguous, or simulates the predicate to filter hypotheses when confidence is sufficient. Final commitment is gated by grounded precondition coverage and an SQ‑BCP pullback‑style compatibility check, so simulated beliefs affect efficiency but cannot directly certify feasibility. Experiments on ALFWorld and ScienceWorld show that AEC achieves competitive success with fewer replanning rounds than strong LLM‑agent baselines.

PaperID: 5839, https://arxiv.org/pdf/2602.03950.pdf

Authors: Aditya Basarkar, Benyamin Tabarsi, Tiffany Barnes, Dongkuan Xu

Title: Enhancing Mathematical Problem Solving in LLMs through Execution-Driven Reasoning Augmentation

Abstract:
Mathematical problem solving is a fundamental benchmark for assessing the reasoning capabilities of artificial intelligence and a gateway to applications in education, science, and engineering where reliable symbolic reasoning is essential. Although recent advances in multi‑agent LLM‑based systems have enhanced their mathematical reasoning capabilities, they still lack a reliably revisable representation of the reasoning process. Existing agents either operate in rigid sequential pipelines that cannot correct earlier steps or rely on heuristic self‑evaluation that can fail to identify and fix errors. In addition, programmatic context can distract language models and degrade accuracy. To address these gaps, we introduce Iteratively Improved Program Construction (IIPC), a reasoning method that iteratively refines programmatic reasoning chains and combines execution feedback with the native Chain‑of‑thought abilities of the base LLM to maintain high‑level contextual focus. IIPC surpasses competing approaches in the majority of reasoning benchmarks on multiple base LLMs. All code and implementations are released as open source.

PaperID: 5840, https://arxiv.org/pdf/2602.03940.pdf

Authors: Olaf Yunus Laitinen Imanov, Duygu Erisken, Derya Umut Kulali, Taner Yilmaz, Rana Irem Turhan

Title: Autonomous AI Agents for Real-Time Affordable Housing Site Selection: Multi-Objective Reinforcement Learning Under Regulatory Constraints

Abstract:
Affordable housing shortages affect billions, while land scarcity and regulations make site selection slow. We present AURA (Autonomous Urban Resource Allocator), a hierarchical multi‑agent reinforcement learning system for real‑time affordable housing site selection under hard regulatory constraints (QCT, DDA, LIHTC). We model the task as a constrained multi‑objective Markov decision process optimizing accessibility, environmental impact, construction cost, and social equity while enforcing feasibility. AURA uses a regulatory‑aware state encoding 127 federal and local constraints, Pareto‑constrained policy gradients with feasibility guarantees, and reward decomposition separating immediate costs from long‑term social outcomes. On datasets from 8 U.S. metros (47,392 candidate parcels), AURA attains 94.3% regulatory compliance and improves Pareto hypervolume by 37.2% over strong baselines. In a New York City 2026 case study, it reduces selection time from 18 months to 72 hours and identifies 23% more viable sites; chosen sites have 31% better transit access and 19% lower environmental impact than expert picks.

PaperID: 5841, https://arxiv.org/pdf/2602.03784.pdf

Authors: Jiangnan Ye, Hanqi Yan, Zhenyi Shen, Heng Chang, Ye Mao, Yulan He

Title: Fix the Structural Bottleneck: Context Compression via Explicit Information Transmission

Abstract:
Long‑context LLM agents often struggle with growing token, memory, and latency costs, making efficient context compression essential for practical deployment. Existing LLM‑as‑a‑compressor methods remain noticeably inferior to using the full context. We find that this gap partly stems from their inability to preserve contextual information effectively. In this work, we revisit context compression from a structural perspective and identify two key bottlenecks in standard LLM‑based compressors: limited coordination among compression tokens during information aggregation, and layerwise dilution that weakens useful signals from intermediate hidden states. To address these limitations, we propose ComprExIT, a new context compression framework based on explicit information transmission. ComprExIT adaptively selects features across frozen LLM layers, then allocates information from anchors to compression slots through a globally coordinated transport plan. Experiments on 12 datasets show that ComprExIT consistently outperforms strong soft‑compression baselines, improving average F1 by up to 18.5%, while adding only ~1% trainable parameters and achieving more than 2x faster compression than the fastest baselines. The code will be released upon acceptance.

PaperID: 5842, https://arxiv.org/pdf/2602.03775.pdf

Authors: Farnoosh Hashemi, Michael W. Macy

Title: An Empirical Study of Collective Behaviors and Social Dynamics in Large Language Model Agents

Abstract:
Large Language Models (LLMs) increasingly mediate our social, cultural, and political interactions. While they can simulate some aspects of human behavior and decision‑making, it is still underexplored whether repeated interactions with other agents amplify their biases or lead to exclusionary behaviors. To this end, we study Chirper.ai‑an LLM‑driven social media platform‑analyzing 7M posts and interactions among 32K LLM agents (called Chirpers) over a year. We start with homophily and social influence among LLMs, learning that similar to humans', their social networks exhibit these fundamental phenomena. Next, we study the toxic language of LLMs, its linguistic features, and their interaction patterns, finding that LLMs show different structural patterns in toxic posting than humans. After studying the ideological leaning in LLMs posts, and the polarization in their community, we focus on how to prevent their potential harmful activities. We present a simple yet effective method, called Chain of Social Thought (CoST), that reminds LLM agents to avoid harmful posting.

PaperID: 5843, https://arxiv.org/pdf/2602.03757.pdf

Authors: Arkaprava Sain, Sunandan Adhikary, Soumyajit Dey

Title: Mitigating Timing-Based Attacks in Real-Time Cyber-Physical Systems

Abstract:
Real‑time cyber‑physical systems depend on deterministic task execution to guarantee safety and correctness. Unfortunately, this determinism can unintentionally expose timing information that enables adversaries to infer task execution patterns and carry out timing‑based attacks targeting safety‑critical control tasks. While prior defenses aim to obscure schedules through randomization or isolation, they typically neglect the implications of such modifications on closed‑loop control behavior and real‑time feasibility. This work studies the problem of securing real‑time control workloads against timing inference attacks while explicitly accounting for both schedulability constraints and control performance requirements. We present a scheduling‑based mitigation approach that introduces bounded timing perturbations to control task executions in a structured manner, reducing adversarial opportunities without violating real‑time guarantees. The framework jointly considers worst‑case execution behavior and the impact of execution delays on control performance, enabling the system to operate within predefined safety and performance limits. Through experimental evaluation on representative task sets and control scenarios, the proposed approach demonstrates that exposure to timing‑based attacks can be significantly reduced while preserving predictable execution and acceptable control quality.

PaperID: 5844, https://arxiv.org/pdf/2602.03704.pdf

Authors: Yu Tian, Linh Huynh, Katerina Christhilf, Shubham Chakraborty, Micah Watanabe, Tracy Arner, Danielle McNamara

Title: Cognitively Diverse Multiple-Choice Question Generation: A Hybrid Multi-Agent Framework with Large Language Models

Abstract:
Recent advances in large language models (LLMs) have made automated multiple‑choice question (MCQ) generation increasingly feasible; however, reliably producing items that satisfy controlled cognitive demands remains a challenge. To address this gap, we introduce ReQUESTA, a hybrid, multi‑agent framework for generating cognitively diverse MCQs that systematically target text‑based, inferential, and main idea comprehension. ReQUESTA decomposes MCQ authoring into specialized subtasks and coordinates LLM‑powered agents with rule‑based components to support planning, controlled generation, iterative evaluation, and post‑processing. We evaluated the framework in a large‑scale reading comprehension study using academic expository texts, comparing ReQUESTA‑generated MCQs with those produced by a single‑pass GPT‑5 zero‑shot baseline. Psychometric analyses of learner responses assessed item difficulty and discrimination, while expert raters evaluated question quality across multiple dimensions, including topic relevance and distractor quality. Results showed that ReQUESTA‑generated items were consistently more challenging, more discriminative, and more strongly aligned with overall reading comprehension performance. Expert evaluations further indicated stronger alignment with central concepts and superior distractor linguistic consistency and semantic plausibility, particularly for inferential questions. These findings demonstrate that hybrid, agentic orchestration can systematically improve the reliability and controllability of LLM‑based generation, highlighting workflow design as a key lever for structured artifact generation beyond single‑pass prompting.

PaperID: 5845, https://arxiv.org/pdf/2602.03695.pdf

Authors: Haibo Jin, Peng Kuang, Ye Yu, Xiaopeng Yuan, Haohan Wang

Title: Agent Primitives: Reusable Latent Building Blocks for Multi-Agent Systems

Abstract:
While existing multi‑agent systems (MAS) can handle complex problems by enabling collaboration among multiple agents, they are often highly task‑specific, relying on manually crafted agent roles and interaction prompts, which leads to increased architectural complexity and limited reusability across tasks. Moreover, most MAS communicate primarily through natural language, making them vulnerable to error accumulation and instability in long‑context, multi‑stage interactions within internal agent histories. In this work, we propose Agent Primitives, a set of reusable latent building blocks for LLM‑based MAS. Inspired by neural network design, where complex models are built from reusable components, we observe that many existing MAS architectures can be decomposed into a small number of recurring internal computation patterns. Based on this observation, we instantiate three primitives: Review, Voting and Selection, and Planning and Execution. All primitives communicate internally via key‑value (KV) cache, which improves both robustness and efficiency by mitigating information degradation across multi‑stage interactions. To enable automatic system construction, an Organizer agent selects and composes primitives for each query, guided by a lightweight knowledge pool of previously successful configurations, forming a primitive‑based MAS. Experiments show that primitives‑based MAS improve average accuracy by 12.0‑16.5% over single‑agent baselines, reduce token usage and inference latency by approximately 3×‑4× compared to text‑based MAS, while incurring only 1.3×‑1.6× overhead relative to single‑agent inference and providing more stable performance across model backbones.

PaperID: 5846, https://arxiv.org/pdf/2602.03688.pdf

Authors: Wenzhe Fan, Tommaso Tognoli, Henry Peng Zou, Chunyu Miao, Yibo Wang, Xinhua Zhang

Title: TodyComm: Task-Oriented Dynamic Communication for Multi-Round LLM-based Multi-Agent System

Abstract:
Multi‑round LLM‑based multi‑agent systems rely on effective communication structures to support collaboration across rounds. However, most existing methods employ a fixed communication topology during inference, which falls short in many realistic applications where the agents' roles may change across rounds due to dynamic adversary, task progression, or time‑varying constraints such as communication bandwidth. In this paper, we propose addressing this issue through TodyComm, a task‑oriented dynamic communication algorithm. It produces behavior‑driven collaboration topologies that adapt to the dynamics at each round, optimizing the utility for the task through policy gradient. Experiments on five benchmarks demonstrate that, under both dynamic adversarial settings and communication budget constraints, TodyComm achieves superior task performance while maintaining token efficiency, scalability, and strong generalizability across varying adversarial conditions.

PaperID: 5847, https://arxiv.org/pdf/2602.03687.pdf

Authors: Martin Bullinger, Edith Elkind, Kassian Köck

Title: Efficient Investment in Multi-Agent Models of Public Transportation

Abstract:
We study two stylized, multi‑agent models aimed at investing a limited, indivisible resource in public transportation. In the first model, we face the decision of which potential stops to open along a (e.g., bus) path, given agents' travel demands. While it is known that utilitarian optimal solutions can be identified in polynomial time, we find that computing approximately optimal solutions with respect to egalitarian welfare is NP‑complete. This is surprising as we operate on the simple topology of a line graph. In the second model, agents navigate a more complex network modeled by a weighted graph where edge weights represent distances. We face the decision of improving travel time along a fixed number of edges. We provide a polynomial‑time algorithm that combines Dijkstra's algorithm with a dynamical program to find the optimal decision for one or two agents. By contrast, if the number of agents is variable, we find \np‑completeness and inapproximability results for utilitarian and egalitarian welfare. Moreover, we demonstrate implications of our results for a related model of railway network design.

PaperID: 5848, https://arxiv.org/pdf/2602.03619.pdf

Authors: Changze Lv, Jie Zhou, Wentao Zhao, Jingwen Xu, Zisu Huang, Muzhao Tian, Shihan Dou, Tao Gui, Le Tian, Xiao Zhou, Xiaoqing Zheng, Xuanjing Huang, Jie Zhou

Title: Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation

Abstract:
Nowadays, training and evaluating DeepResearch‑generated reports remain challenging due to the lack of verifiable reward signals. Accordingly, rubric‑based evaluation has become a common practice. However, existing approaches either rely on coarse, pre‑defined rubrics that lack sufficient granularity, or depend on manually constructed query‑specific rubrics that are costly and difficult to scale. In this paper, we propose a pipeline to train human‑preference‑aligned query‑specific rubric generators tailored for DeepResearch report generation. We first construct a dataset of DeepResearch‑style queries annotated with human preferences over paired reports, and train rubric generators via reinforcement learning with a hybrid reward combining human preference supervision and LLM‑based rubric evaluation. To better handle long‑horizon reasoning, we further introduce a Multi‑agent Markov‑state (MaMs) workflow for report generation. We empirically show that our proposed rubric generators deliver more discriminative and better human‑aligned supervision than existing rubric design strategies. Moreover, when integrated into the MaMs training framework, DeepResearch systems equipped with our rubric generators consistently outperform all open‑source baselines on the DeepResearch Bench and achieve performance comparable to that of leading closed‑source models.

PaperID: 5849, https://arxiv.org/pdf/2602.03603.pdf

Authors: Lorena Maria Genua, Nikita Boguslavskii, Zhi Li

Title: Human-in-the-Loop Failure Recovery with Adaptive Task Allocation

Abstract:
Since the recent Covid‑19 pandemic, mobile manipulators and humanoid assistive robots with higher levels of autonomy have increasingly been adopted for patient care and living assistance. Despite advancements in autonomy, these robots often struggle to perform reliably in dynamic and unstructured environments and require human intervention to recover from failures. Effective human‑robot collaboration is essential to enable robots to receive assistance from the most competent operator, in order to reduce their workload and minimize disruptions in task execution. In this paper, we propose an adaptive method for allocating robotic failures to human operators (ARFA). Our proposed approach models the capabilities of human operators, and continuously updates these beliefs based on their actual performance for failure recovery. For every failure to be resolved, a reward function calculates expected outcomes based on operator capabilities and historical data, task urgency, and current workload distribution. The failure is then assigned to the operator with the highest expected reward. Our simulations and user studies show that ARFA outperforms random allocation, significantly reducing robot idle time, improving overall system performance, and leading to a more distributed workload among operators.

PaperID: 5850, https://arxiv.org/pdf/2602.03580.pdf

Authors: Zhihao Li, Boyang Ma, Xuelong Dai, Minghui Xu, Yue Zhang, Biwei Yan, Kun Li

Title: Don't believe everything you read: Understanding and Measuring MCP Behavior under Misleading Tool Descriptions

Abstract:
The Model Context Protocol (MCP) enables large language models to invoke external tools through natural‑language descriptions, forming the foundation of many AI agent applications. However, MCP does not enforce consistency between documented tool behavior and actual code execution, even though MCP Servers often run with broad system privileges. This gap introduces a largely unexplored security risk. We study how mismatches between externally presented tool descriptions and underlying implementations systematically shape the mental models and decision‑making behavior of intelligent agents. Specifically, we present the first large‑scale study of description‑code inconsistency in the MCP ecosystem. We design an automated static analysis framework and apply it to 10,240 real‑world MCP Servers across 36 categories. Our results show that while most servers are highly consistent, approximately 13% exhibit substantial mismatches that can enable undocumented privileged operations, hidden state mutations, or unauthorized financial actions. We further observe systematic differences across application categories, popularity levels, and MCP marketplaces. Our findings demonstrate that description‑code inconsistency is a concrete and prevalent attack surface in MCP‑based AI agents, and motivate the need for systematic auditing and stronger transparency guarantees in future agent ecosystems.

PaperID: 5851, https://arxiv.org/pdf/2602.03508.pdf

Authors: Galina Sidorenko, Johan Thunberg

Title: A necessary and sufficient condition for discrete-time consensus on star boundaries

Abstract:
It is intuitive and well known, that if agents in a multi‑agent system iteratively update their states in the Euclidean space as convex combinations of neighbors' states, all states eventually converge to the same value (consensus), provided the interaction graph is sufficiently connected. However, this seems to be also true in practice if the convex combinations of states are mapped or radially projected onto any unit l_p‑sphere or even boundaries of star‑convex sets, herein referred to as star boundaries. In this paper, we present insight into this matter by providing a necessary and sufficient condition for asymptotic consensus of the normalized states (directions) for strongly connected directed graphs, which is equivalent to asymptotic consensus of states when the star boundaries are the same for all agents. Furthermore, we show that when asymptotic consensus occurs, the states converge linearly and the point of convergence is continuous in the initial states. Assuming a directed strongly connected graph provides a more general setting than that considered, for example, in gradient‑based consensus protocols, where symmetric graphs are assumed. Illustrative examples and a vast number of numerical simulations showcase the theoretical results.

PaperID: 5852, https://arxiv.org/pdf/2602.03455.pdf

Authors: Tesshu Hanaka, Nikolaos Melissinos, Hirotaka Ono

Title: Game-Theoretic and Algorithmic Analyses of Multi-Agent Routing under Crossing Costs

Abstract:
Coordinating the movement of multiple autonomous agents over a shared network is a fundamental challenge in algorithmic robotics, intelligent transportation, and distributed systems. The dominant approach, Multi‑Agent Path Finding, relies on centralized control and synchronous collision avoidance, which often requires strict synchronization and guarantees of globally conflict‑free execution. This paper introduces the Multi‑Agent Routing under Crossing Cost model on mixed graphs, a novel framework tailored to asynchronous settings. In our model, instead of treating conflicts as hard constraints, each agent is assigned a path, and the system is evaluated through a cost function that measures potential head‑on encounters. This ``crossing cost'', which is defined as the product of the numbers of agents traversing an edge in opposite directions, quantifies the risk of congestion and delay in decentralized execution. Our contributions are both game‑theoretic and algorithmic. We model the setting as a congestion game with a non‑standard cost function, prove the existence of pure Nash equilibria, and analyze the dynamics leading to them. Equilibria can be found in polynomial time under mild conditions, while the general case is PLS‑complete. From an optimization perspective, minimizing the total crossing cost is NP‑hard, as the problem generalizes Steiner Orientation. To address this hardness barrier, we design a suite of parameterized algorithms for minimizing crossing cost, with parameters including the number of arcs, edges, agents, and structural graph measures. These yield XP or FPT results depending on the parameter, offering algorithmic strategies for structurally restricted instances. Our framework provides a new theoretical foundation for decentralized multi‑agent routing, bridging equilibrium analysis and parameterized complexity to support scalable and risk‑aware coordination.

PaperID: 5853, https://arxiv.org/pdf/2602.03439.pdf

Authors: Xiaochi Zhou, Patrick Bulter, Changxuan Yang, Simon D. Rihm, Thitikarn Angkanaporn, Jethro Akroyd, Sebastian Mosbach, Markus Kraft

Title: Ontology-to-tools compilation for executable semantic constraint enforcement in LLM agents

Abstract:
We introduce ontology‑to‑tools compilation as a proof‑of‑principle mechanism for coupling large language models (LLMs) with formal domain knowledge. Within The World Avatar (TWA), ontological specifications are compiled into executable tool interfaces that LLM‑based agents must use to create and modify knowledge graph instances, enforcing semantic constraints during generation rather than through post‑hoc validation. Extending TWA's semantic agent composition framework, the Model Context Protocol (MCP) and associated agents are integral components of the knowledge graph ecosystem, enabling structured interaction between generative models, symbolic constraints, and external resources. An agent‑based workflow translates ontologies into ontology‑aware tools and iteratively applies them to extract, validate, and repair structured knowledge from unstructured scientific text. Using metal‑organic polyhedra synthesis literature as an illustrative case, we show how executable ontological semantics can guide LLM behaviour and reduce manual schema and prompt engineering, establishing a general paradigm for embedding formal knowledge into generative systems.

PaperID: 5854, https://arxiv.org/pdf/2602.03414.pdf

Authors: Zhengbo Jiao, Shaobo Wang, Zifan Zhang, Wei Wang, Bing Zhao, Hu Wei, Linfeng Zhang

Title: Socratic-Geo: Synthetic Data Generation and Geometric Reasoning via Multi-Agent Interaction

Abstract:
Multimodal Large Language Models (MLLMs) have significantly advanced vision‑language understanding. However, even state‑of‑the‑art models struggle with geometric reasoning, revealing a critical bottleneck: the extreme scarcity of high‑quality image‑text pairs. Human annotation is prohibitively expensive, while automated methods fail to ensure fidelity and training effectiveness. Existing approaches either passively adapt to available images or employ inefficient random exploration with filtering, decoupling generation from learning needs. We propose Socratic‑Geo, a fully autonomous framework that dynamically couples data synthesis with model learning through multi‑agent interaction. The Teacher agent generates parameterized Python scripts with reflective feedback (Reflect for solvability, RePI for visual validity), ensuring image‑text pair purity. The Solver agent optimizes reasoning through preference learning, with failure paths guiding Teacher's targeted augmentation. Independently, the Generator learns image generation capabilities on accumulated "image‑code‑instruction" triplets, distilling programmatic drawing intelligence into visual generation. Starting from only 108 seed problems, Socratic‑Solver achieves 49.11 on six benchmarks using one‑quarter of baseline data, surpassing strong baselines by 2.43 points. Socratic‑Generator achieves 42.4% on GenExam, establishing new state‑of‑the‑art for open‑source models, surpassing Seedream‑4.0 (39.8%) and approaching Gemini‑2.5‑Flash‑Image (43.1%).

PaperID: 5855, https://arxiv.org/pdf/2602.03412.pdf

Authors: Mukai Li, Qingcheng Zeng, Tianqing Fang, Zhenwen Liang, Linfeng Song, Qi Liu, Haitao Mi, Dong Yu

Title: Verified Critical Step Optimization for LLM Agents

Abstract:
As large language model agents tackle increasingly complex long‑horizon tasks, effective post‑training becomes critical. Prior work faces fundamental challenges: outcome‑only rewards fail to precisely attribute credit to intermediate steps, estimated step‑level rewards introduce systematic noise, and Monte Carlo sampling approaches for step reward estimation incur prohibitive computational cost. Inspired by findings that only a small fraction of high‑entropy tokens drive effective RL for reasoning, we propose Critical Step Optimization (CSO), which focuses preference learning on verified critical steps, decision points where alternate actions demonstrably flip task outcomes from failure to success. Crucially, our method starts from failed policy trajectories rather than expert demonstrations, directly targeting the policy model's weaknesses. We use a process reward model (PRM) to identify candidate critical steps, leverage expert models to propose high‑quality alternatives, then continue execution from these alternatives using the policy model itself until task completion. Only alternatives that the policy successfully executes to correct outcomes are verified and used as DPO training data, ensuring both quality and policy reachability. This yields fine‑grained, verifiable supervision at critical decisions while avoiding trajectory‑level coarseness and step‑level noise. Experiments on GAIA‑Text‑103 and XBench‑DeepSearch show that CSO achieves 37% and 26% relative improvement over the SFT baseline and substantially outperforms other post‑training methods, while requiring supervision at only 16% of trajectory steps. This demonstrates the effectiveness of selective verification‑based learning for agent post‑training.

PaperID: 5856, https://arxiv.org/pdf/2602.03318.pdf

Authors: Yifan Shi, Jialong Shi, Jiayi Wang, Ye Fan, Jianyong Sun

Title: MIRROR: A Multi-Agent Framework with Iterative Adaptive Revision and Hierarchical Retrieval for Optimization Modeling in Operations Research

Abstract:
Operations Research (OR) relies on expert‑driven modeling‑a slow and fragile process ill‑suited to novel scenarios. While large language models (LLMs) can automatically translate natural language into optimization models, existing approaches either rely on costly post‑training or employ multi‑agent frameworks, yet most still lack reliable collaborative error correction and task‑specific retrieval, often leading to incorrect outputs. We propose MIRROR, a fine‑tuning‑free, end‑to‑end multi‑agent framework that directly translates natural language optimization problems into mathematical models and solver code. MIRROR integrates two core mechanisms: (1) execution‑driven iterative adaptive revision for automatic error correction, and (2) hierarchical retrieval to fetch relevant modeling and coding exemplars from a carefully curated exemplar library. Experiments show that MIRROR outperforms existing methods on standard OR benchmarks, with notable results on complex industrial datasets such as IndustryOR and Mamo‑ComplexLP. By combining precise external knowledge infusion with systematic error correction, MIRROR provides non‑expert users with an efficient and reliable OR modeling solution, overcoming the fundamental limitations of general‑purpose LLMs in expert optimization tasks.

PaperID: 5857, https://arxiv.org/pdf/2602.03311.pdf

Authors: Elena Masserini

Title: Multi-Level Testing of Conversational AI Systems

Abstract:
Conversational AI systems combine AI‑based solutions with the flexibility of conversational interfaces. However, most existing testing solutions do not straightforwardly adapt to the characteristics of conversational interaction or to the behavior of AI components. To address this limitation, this Ph.D. thesis investigates a new family of testing approaches for conversational AI systems, focusing on the validation of their constituent elements at different levels of granularity, from the integration between the language and the AI components, to individual conversational agents, up to multi‑agent implementations of conversational AI systems

PaperID: 5858, https://arxiv.org/pdf/2602.03238.pdf

Authors: Pengyu Zhu, Li Sun, Philip S. Yu, Sen Su

Title: The Necessity of a Unified Framework for LLM-Based Agent Evaluation

Abstract:
With the advent of Large Language Models (LLMs), general‑purpose agents have seen fundamental advancements. However, evaluating these agents presents unique challenges that distinguish them from static QA benchmarks. We observe that current agent benchmarks are heavily confounded by extraneous factors, including system prompts, toolset configurations, and environmental dynamics. Existing evaluations often rely on fragmented, researcher‑specific frameworks where the prompt engineering for reasoning and tool usage varies significantly, making it difficult to attribute performance gains to the model itself. Additionally, the lack of standardized environmental data leads to untraceable errors and non‑reproducible results. This lack of standardization introduces substantial unfairness and opacity into the field. We propose that a unified evaluation framework is essential for the rigorous advancement of agent evaluation. To this end, we introduce a proposal aimed at standardizing agent evaluation.

PaperID: 5859, https://arxiv.org/pdf/2602.03219.pdf

Authors: Guhong Chen, Chenghao Sun, Cheng Fu, Qiyao Wang, Zhihong Huang, Chaopeng Wei, Guangxu Chen, Feiteng Fang, Ahmadreza Argha, Bing Zhao, Xander Xu, Qi Han, Hamid Alinejad-Rokny, Qiang Qu, Binhua Li, Shiwen Ni, Min Yang, Hu Wei, Yongbin Li

Title: Beyond Quantity: Trajectory Diversity Scaling for Code Agents

Abstract:
As code large language models (LLMs) evolve into tool‑interactive agents via the Model Context Protocol (MCP), their generalization is increasingly limited by low‑quality synthetic data and the diminishing returns of quantity scaling. Moreover, quantity‑centric scaling exhibits an early bottleneck that underutilizes trajectory data. We propose TDScaling, a Trajectory Diversity Scaling‑based data synthesis framework for code agents that scales performance through diversity rather than raw volume. Under a fixed training budget, increasing trajectory diversity yields larger gains than adding more trajectories, improving the performance‑cost trade‑off for agent training. TDScaling integrates four innovations: (1) a Business Cluster mechanism that captures real‑service logical dependencies; (2) a blueprint‑driven multi‑agent paradigm that enforces trajectory coherence; (3) an adaptive evolution mechanism that steers synthesis toward long‑tail scenarios using Domain Entropy, Reasoning Mode Entropy, and Cumulative Action Complexity to prevent mode collapse; and (4) a sandboxed code tool that mitigates catastrophic forgetting of intrinsic coding capabilities. Experiments on general tool‑use benchmarks (BFCL, tau^2‑Bench) and code agent tasks (RebenchT, CodeCI, BIRD) demonstrate a win‑win outcome: TDScaling improves both tool‑use generalization and inherent coding proficiency. We plan to release the full codebase and the synthesized dataset (including 30,000+ tool clusters) upon publication.

PaperID: 5860, https://arxiv.org/pdf/2602.03158.pdf

Authors: Zongwei Wang, Min Gao, Junliang Yu, Tong Chen, Chenghua Lin

Title: PAMAS: Self-Adaptive Multi-Agent System with Perspective Aggregation for Misinformation Detection

Abstract:
Misinformation on social media poses a critical threat to information credibility, as its diverse and context‑dependent nature complicates detection. Large language model‑empowered multi‑agent systems (MAS) present a promising paradigm that enables cooperative reasoning and collective intelligence to combat this threat. However, conventional MAS suffer from an information‑drowning problem, where abundant truthful content overwhelms sparse and weak deceptive cues. With full input access, agents tend to focus on dominant patterns, and inter‑agent communication further amplifies this bias. To tackle this issue, we propose PAMAS, a multi‑agent framework with perspective aggregation, which employs hierarchical, perspective‑aware aggregation to highlight anomaly cues and alleviate information drowning. PAMAS organizes agents into three roles: Auditors, Coordinators, and a Decision‑Maker. Auditors capture anomaly cues from specialized feature subsets; Coordinators aggregate their perspectives to enhance coverage while maintaining diversity; and the Decision‑Maker, equipped with evolving memory and full contextual access, synthesizes all subordinate insights to produce the final judgment. Furthermore, to improve efficiency in multi‑agent collaboration, PAMAS incorporates self‑adaptive mechanisms for dynamic topology optimization and routing‑based inference, enhancing both efficiency and scalability. Extensive experiments on multiple benchmark datasets demonstrate that PAMAS achieves superior accuracy and efficiency, offering a scalable and trustworthy way for misinformation detection.

PaperID: 5861, https://arxiv.org/pdf/2602.03130.pdf

Authors: Chenxi Zhang, Ziliang Gan, Liyun Zhu, Youwei Pang, Qing Zhang, Rongjunchen Zhang

Title: FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation

Abstract:
The financial domain poses substantial challenges for vision‑language models (VLMs) due to specialized chart formats and knowledge‑intensive reasoning requirements. However, existing financial benchmarks are largely single‑turn and rely on a narrow set of question formats, limiting comprehensive evaluation in realistic application scenarios. To address this gap, we propose FinMTM, a multi‑turn multimodal benchmark that expands diversity along both data and task dimensions. On the data side, we curate and annotate 11,133 bilingual (Chinese and English) financial QA pairs grounded in financial visuals, including candlestick charts, statistical plots, and report figures. On the task side, FinMTM covers single‑ and multiple‑choice questions, multi‑turn open‑ended dialogues, and agent‑based tasks. We further design task‑specific evaluation protocols, including a set‑overlap scoring rule for multiple‑choice questions, a weighted combination of turn‑level and session‑level scores for multi‑turn dialogues, and a composite metric that integrates planning quality with final outcomes for agent tasks. Extensive experimental evaluation of 22 VLMs reveal their limitations in fine‑grained visual perception, long‑context reasoning, and complex agent workflows.

PaperID: 5862, https://arxiv.org/pdf/2602.03128.pdf

Authors: Abdelghny Orogat, Ana Rostam, Essam Mansour

Title: Understanding Multi-Agent LLM Frameworks: A Unified Benchmark and Experimental Analysis

Abstract:
Multi‑agent LLM frameworks are widely used to accelerate the development of agent systems powered by large language models (LLMs). These frameworks impose distinct architectural structures that govern how agents interact, store information, and coordinate tasks. However, their impact on system performance remains poorly understood. This gap is critical, as architectural choices alone can induce order‑of‑magnitude differences in latency and throughput, as well as substantial variation in accuracy and scalability. Addressing this challenge requires (i) jointly evaluating multiple capabilities, such as orchestration overhead, memory behavior, planning, specialization, and coordination, and (ii) conducting these evaluations under controlled, framework‑level conditions to isolate architectural effects. Existing benchmarks focus on individual capabilities and lack standardized framework‑level evaluation. We address these limitations by (i) introducing an architectural taxonomy for systematically comparing multi‑agent LLM frameworks along fundamental dimensions, and (ii) developing MAFBench, a unified evaluation suite that integrates existing benchmarks under a standardized execution pipeline. Using MAFBench, we conduct a controlled empirical study across several widely used frameworks. Our results show that framework‑level design choices alone can increase latency by over 100x, reduce planning accuracy by up to 30%, and lower coordination success from above 90% to below 30%. Finally, we translate our findings into concrete architectural design principles and framework selection guidance, and outline promising future research directions.

PaperID: 5863, https://arxiv.org/pdf/2602.03109.pdf

Authors: Bowen Jiang, Taiwei Shi, Ryo Kamoi, Yuan Yuan, Camillo J. Taylor, Longqi Yang, Pei Zhou, Sihao Chen

Title: One Model, All Roles: Multi-Turn, Multi-Agent Self-Play Reinforcement Learning for Conversational Social Intelligence

Abstract:
This paper introduces OMAR: One Model, All Roles, a reinforcement learning framework that enables AI to develop social intelligence through multi‑turn, multi‑agent conversational self‑play. Unlike traditional paradigms that rely on static, single‑turn optimizations, OMAR allows a single model to role‑play all participants in a conversation simultaneously, learning to achieve long‑term goals and complex social norms directly from dynamic social interaction. To ensure training stability across long dialogues, we implement a hierarchical advantage estimation that calculates turn‑level and token‑level advantages. Evaluations in the SOTOPIA social environment and Werewolf strategy games show that our trained models develop fine‑grained, emergent social intelligence, such as empathy, persuasion, and compromise seeking, demonstrating the effectiveness of learning collaboration even under competitive scenarios. While we identify practical challenges like reward hacking, our results show that rich social intelligence can emerge without human supervision. We hope this work incentivizes further research on AI social intelligence in group conversations.

PaperID: 5864, https://arxiv.org/pdf/2602.03100.pdf

Authors: Jingnan Zheng, Yanzhen Luo, Jingjun Xu, Bingnan Liu, Yuxin Chen, Chenhang Cui, Gelei Deng, Chaochao Lu, Xiang Wang, An Zhang, Tat-Seng Chua

Title: Risky-Bench: Probing Agentic Safety Risks under Real-World Deployment

Abstract:
Large Language Models (LLMs) are increasingly deployed as agents that operate in real‑world environments, introducing safety risks beyond linguistic harm. Existing agent safety evaluations rely on risk‑oriented tasks tailored to specific agent settings, resulting in limited coverage of safety risk space and failing to assess agent safety behavior during long‑horizon, interactive task execution in complex real‑world deployments. Moreover, their specialization to particular agent settings limits adaptability across diverse agent configurations. To address these limitations, we propose Risky‑Bench, a framework that enables systematic agent safety evaluation grounded in real‑world deployment. Risky‑Bench organizes evaluation around domain‑agnostic safety principles to derive context‑aware safety rubrics that delineate safety space, and systematically evaluates safety risks across this space through realistic task execution under varying threat assumptions. When applied to life‑assist agent settings, Risky‑Bench uncovers substantial safety risks in state‑of‑the‑art agents under realistic execution conditions. Moreover, as a well‑structured evaluation pipeline, Risky‑Bench is not confined to life‑assist scenarios and can be adapted to other deployment settings to construct environment‑specific safety evaluations, providing an extensible methodology for agent safety assessment.

PaperID: 5865, https://arxiv.org/pdf/2602.03069.pdf

Authors: Yue Wu, Tianhao Su, Shunbo Hu, Deng Pan

Title: Skill-Based Autonomous Agents for Material Creep Database Construction

Abstract:
The advancement of data‑driven materials science is currently constrained by a fundamental bottleneck: the vast majority of historical experimental data remains locked within the unstructured text and rasterized figures of legacy scientific literature. Manual curation of this knowledge is prohibitively labor‑intensive and prone to human error. To address this challenge, we introduce an autonomous, agent‑based framework powered by Large Language Models (LLMs) designed to excavate high‑fidelity datasets from scientific PDFs without human intervention. By deploying a modular "skill‑based" architecture, the agent orchestrates complex cognitive tasks ‑ including semantic filtering, multi‑modal information extraction, and physics‑informed validation. We demonstrate the efficacy of this framework by constructing a physically self‑consistent database for material creep mechanics, a domain characterized by complex graphical trajectories and heterogeneous constitutive models. Applying the pipeline to 243 publications, the agent achieved a verified extraction success rate exceeding 90% for graphical data digitization. Crucially, we introduce a cross‑modal verification protocol, demonstrating that the agent can autonomously align visually extracted data points with textually extracted constitutive parameters (R^2 > 0.99), ensuring the physical self‑consistency of the database. This work not only provides a critical resource for investigating time‑dependent deformation across diverse material systems but also establishes a scalable paradigm for autonomous knowledge acquisition, paving the way for the next generation of self‑driving laboratories.

PaperID: 5866, https://arxiv.org/pdf/2602.03068.pdf

Authors: Mirza Nayeem Ahmed, Raiyan Abdul Baten

Title: From semantic memory to collective creativity: A generative cognitive foundation for social creativity models

Abstract:
Simulation‑based theory development has yielded powerful insights into collective performance by linking social structure to emergent outcomes, yet it has struggled to extend to collective creativity. Creativity is hard to capture purely at the social level, as novel ideas are generated through cognitive mechanisms. To address this gap, we introduce a multi‑level socio‑cognitive agent‑based framework in which agents share a common semantic vocabulary and substrate but differ in semantic network topology. A single generative parameter tunes semantic modularity, yielding emergent individual differences in ideational breadth. When agents exchange ideation traces, two canonical social‑creativity phenomena arise without being imposed: lower pre‑interaction ideation overlap predicts larger stimulation gains, and shared inspiration sources induce network‑level redundancy. The framework enables mechanistic theory‑building about cognition and social structure in collective creativity.

PaperID: 5867, https://arxiv.org/pdf/2602.03049.pdf

Authors: Zhixian Zhang, Xiaotian Hou, Linjun Zhang

Title: Unified Inference Framework for Single and Multi-Player Performative Prediction: Method and Asymptotic Optimality

Abstract:
Performative prediction characterizes environments where predictive models alter the very data distributions they aim to forecast, triggering complex feedback loops. While prior research treats single‑agent and multi‑agent performativity as distinct phenomena, this paper introduces a unified statistical inference framework that bridges these contexts, treating the former as a special case of the latter. Our contribution is two‑fold. First, we put forward the Repeated Risk Minimization (RRM) procedure for estimating the performative stability, and establish a rigorous inferential theory for admitting its asymptotic normality and confirming its asymptotic efficiency. Second, for the performative optimality, we introduce a novel two‑step plug‑in estimator that integrates the idea of Recalibrated Prediction Powered Inference (RePPI) with Importance Sampling, and further provide formal derivations for the Central Limit Theorems of both the underlying distributional parameters and the plug‑in results. The theoretical analysis demonstrates that our estimator achieves the semiparametric efficiency bound and maintains robustness under mild distributional misspecification. This work provides a principled toolkit for reliable estimation and decision‑making in dynamic, performative environments.

PaperID: 5868, https://arxiv.org/pdf/2602.03036.pdf

Authors: Muxin Fu, Xiangyuan Xue, Yafu Li, Zefeng He, Siyuan Huang, Xiaoye Qu, Yu Cheng, Yang Yang

Title: LatentMem: Customizing Latent Memory for Multi-Agent Systems

Abstract:
Large language model (LLM)‑powered multi‑agent systems (MAS) demonstrate remarkable collective intelligence, wherein multi‑agent memory serves as a pivotal mechanism for continual adaptation. However, existing multi‑agent memory designs remain constrained by two fundamental bottlenecks: (i) memory homogenization arising from the absence of role‑aware customization, and (ii) information overload induced by excessively fine‑grained memory entries. To address these limitations, we propose LatentMem, a learnable multi‑agent memory framework designed to customize agent‑specific memories in a token‑efficient manner. Specifically, LatentMem comprises an experience bank that stores raw interaction trajectories in a lightweight form, and a memory composer that synthesizes compact latent memories conditioned on retrieved experience and agent‑specific contexts. Further, we introduce Latent Memory Policy Optimization (LMPO), which propagates task‑level optimization signals through latent memories to the composer, encouraging it to produce compact and high‑utility representations. Extensive experiments across diverse benchmarks and mainstream MAS frameworks show that LatentMem achieves a performance gain of up to 19.36% over vanilla settings and consistently outperforms existing memory architectures, without requiring any modifications to the underlying frameworks.

PaperID: 5869, https://arxiv.org/pdf/2602.03028.pdf

Authors: Wenzhang Sun, Zhenyu Wang, Zhangchi Hu, Chunfeng Wang, Hao Li, Wei Chen

Title: MUSE: A Multi-agent Framework for Unconstrained Story Envisioning via Closed-Loop Cognitive Orchestration

Abstract:
Generating long‑form audio‑visual stories from a short user prompt remains challenging due to an intent‑execution gap, where high‑level narrative intent must be preserved across coherent, shot‑level multimodal generation over long horizons. Existing approaches typically rely on feed‑forward pipelines or prompt‑only refinement, which often leads to semantic drift and identity inconsistency as sequences grow longer. We address this challenge by formulating storytelling as a closed‑loop constraint enforcement problem and propose MUSE, a multi‑agent framework that coordinates generation through an iterative plan‑execute‑verify‑revise loop. MUSE translates narrative intent into explicit, machine‑executable controls over identity, spatial composition, and temporal continuity, and applies targeted multimodal feedback to correct violations during generation. To evaluate open‑ended storytelling without ground‑truth references, we introduce MUSEBench, a reference‑free evaluation protocol validated by human judgments. Experiments demonstrate that MUSE substantially improves long‑horizon narrative coherence, cross‑modal identity consistency, and cinematic quality compared with representative baselines.

PaperID: 5870, https://arxiv.org/pdf/2602.03026.pdf

Authors: Weilin Ruan, Yuxuan Liang

Title: Visual Reasoning over Time Series via Multi-Agent System

Abstract:
Time series analysis underpins many real‑world applications, yet existing time‑series‑specific methods and pretrained large‑model‑based approaches remain limited in integrating intuitive visual reasoning and generalizing across tasks with adaptive tool usage. To address these limitations, we propose MAS4TS, a tool‑driven multi‑agent system for general time series tasks, built upon an Analyzer‑Reasoner‑Executor paradigm that integrates agent communication, visual reasoning, and latent reconstruction within a unified framework. MAS4TS first performs visual reasoning over time series plots with structured priors using a Vision‑Language Model to extract temporal structures, and subsequently reconstructs predictive trajectories in latent space. Three specialized agents coordinate via shared memory and gated communication, while a router selects task‑specific tool chains for execution. Extensive experiments on multiple benchmarks demonstrate that MAS4TS achieves state‑of‑the‑art performance across a wide range of time series tasks, while exhibiting strong generalization and efficient inference.

PaperID: 5871, https://arxiv.org/pdf/2602.03012.pdf

Authors: Xianzhen Luo, Jingyuan Zhang, Shiqi Zhou, Rain Huang, Chuan Xiao, Qingfu Zhu, Zhiyuan Ma, Xing Yue, Yang Yue, Wencong Zeng, Wanxiang Che

Title: CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability

Abstract:
Evaluating and improving the security capabilities of code agents requires high‑quality, executable vulnerability tasks. However, existing works rely on costly, unscalable manual reproduction and suffer from outdated data distributions. To address these, we present CVE‑Factory, the first multi‑agent framework to achieve expert‑level quality in automatically transforming sparse CVE metadata into fully executable agentic tasks. Cross‑validation against human expert reproductions shows that CVE‑Factory achieves 95% solution correctness and 96% environment fidelity, confirming its expert‑level quality. It is also evaluated on the latest realistic vulnerabilities and achieves a 66.2% verified success. This automation enables two downstream contributions. First, we construct LiveCVEBench, a continuously updated benchmark of 190 tasks spanning 14 languages and 153 repositories that captures emerging threats including AI‑tooling vulnerabilities. Second, we synthesize over 1,000 executable training environments, the first large‑scale scaling of agentic tasks in code security. Fine‑tuned Qwen3‑32B improves from 5.3% to 35.8% on LiveCVEBench, surpassing Claude 4.5 Sonnet, with gains generalizing to Terminal Bench (12.5% to 31.3%). We open‑source CVE‑Factory, LiveCVEBench, Abacus‑cve (fine‑tuned model), training dataset, and leaderboard. All resources are available at https://github.com/livecvebench/CVE‑Factory .

PaperID: 5872, https://arxiv.org/pdf/2602.02970.pdf

Authors: Shrenik Patel, Christine Truong

Title: Co2PO: Coordinated Constrained Policy Optimization for Multi-Agent RL

Abstract:
Constrained multi‑agent reinforcement learning (MARL) faces a fundamental tension between exploration and safety‑constrained optimization. Existing leading approaches, such as Lagrangian methods, typically rely on global penalties or centralized critics that react to violations after they occur, often suppressing exploration and leading to over‑conservatism. We propose Co2PO, a novel MARL communication‑augmented framework that enables coordination‑driven safety through selective, risk‑aware communication. Co2PO introduces a shared blackboard architecture for broadcasting positional intent and yield signals, governed by a learned hazard predictor that proactively forecasts potential violations over an extended temporal horizon. By integrating these forecasts into a constrained optimization objective, Co2PO allows agents to anticipate and navigate collective hazards without the performance trade‑offs inherent in traditional reactive constraints. We evaluate Co2PO across a suite of complex multi‑agent safety benchmarks, where it achieves higher returns compared to leading constrained baselines while converging to cost‑compliant policies at deployment. Ablation studies further validate the necessity of risk‑triggered communication, adaptive gating, and shared memory components.

PaperID: 5873, https://arxiv.org/pdf/2602.02959.pdf

Authors: Xiaocai Zhang, Neema Nassir, Lok Sang Chan, Milad Haghani

Title: Human-Centric Traffic Signal Control for Equity: A Multi-Agent Action Branching Deep Reinforcement Learning Approach

Abstract:
Coordinating traffic signals along multimodal corridors is challenging because many multi‑agent deep reinforcement learning (DRL) approaches remain vehicle‑centric and struggle with high‑dimensional discrete action spaces. We propose MA2B‑DDQN, a human‑centric multi‑agent action‑branching double Deep Q‑Network (DQN) framework that explicitly optimizes traveler‑level equity. Our key contribution is an action‑branching discrete control formulation that decomposes corridor control into (i) local, per‑intersection actions that allocate green time between the next two phases and (ii) a single global action that selects the total duration of those phases. This decomposition enables scalable coordination under discrete control while reducing the effective complexity of joint decision‑making. We also design a human‑centric reward that penalizes the number of delayed individuals in the corridor, accounting for pedestrians, vehicle occupants, and transit passengers. Extensive evaluations across seven realistic traffic scenarios in Melbourne, Australia, demonstrate that our approach significantly reduces the number of impacted travelers, outperforming existing DRL and baseline methods. Experiments confirm the robustness of our model, showing minimal variance across diverse settings. This framework not only advocates for a fairer traffic signal system but also provides a scalable solution adaptable to varied urban traffic conditions.

PaperID: 5874, https://arxiv.org/pdf/2602.02934.pdf

Authors: Yu Shi, Hao Li, Bram Adams, Ahmed E. Hassan

Title: AgenticSZZ: Temporal Knowledge Graph-Guided Agentic Bug-Inducing Commit Identification

Abstract:
Identifying Bug‑Inducing Commits (BICs) is fundamental for understanding software defects and enabling downstream tasks such as defect prediction and automated program repair. Yet existing SZZ‑based approaches rely on git blame, restricting the search space to commits that directly modified the fixed lines. Our preliminary study on 2,102 validated bug‑fixing commits reveals this limitation is significant: 28% of BICs require traversing commit history beyond blame results and 14% are blameless. We present AgenticSZZ, the first approach to apply Temporal Knowledge Graphs (TKGs) to software evolution analysis. AgenticSZZ reframes BIC identification from ranking blame commits into a graph search problem, where temporal ordering is fundamental to causal reasoning about bug introduction. The approach operates in two phases: (1) constructing a TKG that encodes commits with temporal and structural relationships, expanding the search space by traversing file history backward from blame commits and the bug‑fixing commit; and (2) leveraging an LLM agent to navigate the graph using specialized tools for candidate exploration and causal analysis. Evaluation on three datasets shows that AgenticSZZ achieves F1‑scores of 0.47 to 0.79, with statistically significant F1 improvements over state‑of‑the‑art by up to 34%. Ablation confirms that both components and context expansion each contribute: the TKG and agent form an exploration‑exploitation synergy, while context expansion unlocks ancestor BIC discovery, yielding 60 additional true positives. A sensitivity analysis across five open‑weight LLMs reveals that effective TKG navigation requires sufficiently capable models, and that the TKG architecture amplifies stronger LLMs, widening the advantage. By transforming BIC identification into graph search, we open a new direction for temporal and causal reasoning in software evolution analysis.

PaperID: 5875, https://arxiv.org/pdf/2602.02903.pdf

Authors: Haoran Su, Yandong Sun, Hanxiao Deng

Title: Spatiotemporal Decision Transformer for Traffic Coordination

Abstract:
Traffic signal control is a critical challenge in urban transportation, requiring coordination among multiple intersections to optimize network‑wide traffic flow. While reinforcement learning has shown promise for adaptive signal control, existing methods struggle with multi‑agent coordination and sample efficiency. We introduce MADT (Multi‑Agent Decision Transformer), a novel approach that reformulates multi‑agent traffic signal control as a sequence modeling problem. MADT extends the Decision Transformer paradigm to multi‑agent settings by incorporating: (1) a graph attention mechanism for modeling spatial dependencies between intersections, (2) a|temporal transformer encoder for capturing traffic dynamics, and (3) return‑to‑go conditioning for target performance specification. Our approach enables offline learning from historical traffic data, with architecture design that facilitates potential online fine‑tuning. Experiments on synthetic grid networks and real‑world traffic scenarios demonstrate that MADT achieves state‑of‑the‑art performance, reducing average travel time by 5‑6% compared to the strongest baseline while exhibiting superior coordination among adjacent intersections.

PaperID: 5876, https://arxiv.org/pdf/2602.02858.pdf

Authors: Tiago Leite, Maria Conceição, António Grilo

Title: IMAGINE: Intelligent Multi-Agent Godot-based Indoor Networked Exploration

Abstract:
The exploration of unknown, Global Navigation Satellite System (GNSS) denied environments by an autonomous communication‑aware and collaborative group of Unmanned Aerial Vehicles (UAVs) presents significant challenges in coordination, perception, and decentralized decision‑making. This paper implements Multi‑Agent Reinforcement Learning (MARL) to address these challenges in a 2D indoor environment, using high‑fidelity game‑engine simulations (Godot) and continuous action spaces. Policy training aims to achieve emergent collaborative behaviours and decision‑making under uncertainty using Network‑Distributed Partially Observable Markov Decision Processes (ND‑POMDPs). Each UAV is equipped with a Light Detection and Ranging (LiDAR) sensor and can share data (sensor measurements and a local occupancy map) with neighbouring agents. Inter‑agent communication constraints include limited range, bandwidth and latency. Extensive ablation studies evaluated MARL training paradigms, reward function, communication system, neural network (NN) architecture, memory mechanisms, and POMDP formulations. This work jointly addresses several key limitations in prior research, namely reliance on discrete actions, single‑agent or centralized formulations, assumptions of a priori knowledge and permanent connectivity, inability to handle dynamic obstacles, short planning horizons and architectural complexity in Recurrent NNs/Transformers. Results show that the scalable training paradigm, combined with a simplified architecture, enables rapid autonomous exploration of an indoor area. The implementation of Curriculum‑Learning (five increasingly complex levels) also enabled faster, more robust training. This combination of high‑fidelity simulation, MARL formulation, and computational efficiency establishes a strong foundation for deploying learned cooperative strategies in physical robotic systems.

PaperID: 5877, https://arxiv.org/pdf/2602.02849.pdf

Authors: Xi Yu, Dmitrii Torbunov, Soumyajit Mandal, Yihui Ren

Title: AutoSizer: Automatic Sizing of Analog and Mixed-Signal Circuits via Large Language Model (LLM) Agents

Abstract:
The design of Analog and Mixed‑Signal (AMS) integrated circuits remains heavily reliant on expert knowledge, with transistor sizing a major bottleneck due to nonlinear behavior, high‑dimensional design spaces, and strict performance constraints. Existing Electronic Design Automation (EDA) methods typically frame sizing as static black‑box optimization, resulting in inefficient and less robust solutions. Although Large Language Models (LLMs) exhibit strong reasoning abilities, they are not suited for precise numerical optimization in AMS sizing. To address this gap, we propose AutoSizer, a reflective LLM‑driven meta‑optimization framework that unifies circuit understanding, adaptive search‑space construction, and optimization orchestration in a closed loop. It employs a two‑loop optimization framework, with an inner loop for circuit sizing and an outer loop that analyzes optimization dynamics and constraints to iteratively refine the search space from simulation feedback. We further introduce AMS‑SizingBench, an open benchmark comprising 24 diverse AMS circuits in SKY130 CMOS technology, designed to evaluate adaptive optimization policies under realistic simulator‑based constraints. AutoSizer experimentally achieves higher solution quality, faster convergence, and higher success rate across varying circuit difficulties, outperforming both traditional optimization methods and existing LLM‑based agents.

PaperID: 5878, https://arxiv.org/pdf/2602.02799.pdf

Authors: Wasu Top Piriyakulkij, Wolfgang Lehrach, Kevin Ellis, Kevin Murphy

Title: Joint Learning of Hierarchical Neural Options and Abstract World Model

Abstract:
Building agents that can perform new skills by composing existing skills is a long‑standing goal of AI agent research. Towards this end, we investigate how to efficiently acquire a sequence of skills, formalized as hierarchical neural options. However, existing model‑free hierarchical reinforcement algorithms need a lot of data. We propose a novel method, which we call AgentOWL (Option and World model Learning Agent), that jointly learns ‑‑ in a sample efficient way ‑‑ an abstract world model (abstracting across both states and time) and a set of hierarchical neural options. We show, on a subset of Object‑Centric Atari games, that our method can learn more skills using less data than baseline methods and possesses learning and generalization capabilities that the baselines do not have.

PaperID: 5879, https://arxiv.org/pdf/2602.02760.pdf

Authors: Pouya Pezeshkpour, Estevam Hruschka

Title: From Task Solving to Robust Real-World Adaptation in LLM Agents

Abstract:
Large language models are increasingly deployed as specialized agents that plan, call tools, and take actions over extended horizons. Yet many existing evaluations assume a "clean interface" where dynamics are specified and stable, tools and sensors are reliable, and success is captured by a single explicit objective‑often overestimating real‑world readiness. In practice, agents face underspecified rules, unreliable signals, shifting environments, and implicit, multi‑stakeholder goals. The challenge is therefore not just solving tasks, but adapting while solving: deciding what to trust, what is wanted, when to verify, and when to fall back or escalate. We stress‑test deployment‑relevant robustness under four operational circumstances: partial observability, dynamic environments, noisy signals, and dynamic agent state. We benchmark agentic LLMs in a grid‑based game with a simple goal but long‑horizon execution. Episodes violate clean‑interface assumptions yet remain solvable, forcing agents to infer rules, pay for information, adapt to environmental and internal shifts, and act cautiously under noise. Across five state‑of‑the‑art LLM agents, we find large gaps between nominal task‑solving and deployment‑like robustness. Performance generally degrades as grid size and horizon increase, but rankings are unstable: weaker models can beat stronger ones when strategy matches the uncertainty regime. Despite no explicit instruction, agents trade off completion, efficiency, and penalty avoidance, suggesting partial objective inference. Ablations and feature analyses reveal model‑specific sensitivities and failure drivers, motivating work on verification, safe action selection, and objective inference under partial observability, noise, and non‑stationarity.

PaperID: 5880, https://arxiv.org/pdf/2602.02709.pdf

Authors: Ujin Jeon, Jiyong Kwon, Madison Ann Sullivan, Caleb Eunho Lee, Guang Lin

Title: ATLAS: A Multi-LLM Training Framework for EvoDPO with Adaptive Reference Evolution

Abstract:
Recent multi‑LLM agent systems have shown promising capabilities for automated problem‑solving, yet they predominantly rely on frozen agents or static fine‑tuning pipelines. To address this limitation, our primary contribution is ATLAS (Adaptive Task‑distributed Learning for Agentic Self‑evolution), a multi‑agent framework where specialized meta‑agents collaboratively train and refine an active agent toward a domain‑specific policy. A core challenge in iterative preference learning within these pipelines is the reliance on fixed reference models, which typically leads to overly conservative updates or training stagnation. To overcome this, the framework's algorithmic engine utilizes Evolving Direct Preference Optimization (EvoDPO). EvoDPO employs an inspection agent to perform adaptive, proxy‑KL gated reference policy updates based on continuous training telemetry. We evaluate this full framework across a diverse set of challenging environments‑including non‑stationary contextual bandits, partial differential equations (PINNs), and combinatorial optimization tasks (TSP, Bin Packing). Through comparison against fixed‑reference, adaptive‑reference, and external automated‑discovery baselines, our results suggest that ATLAS combines supporter‑driven exploration with EvoDPO‑driven stability to improve long‑horizon evaluator‑driven self‑improvement.

PaperID: 5881, https://arxiv.org/pdf/2602.02636.pdf

Authors: Ziyang Huang, Haolin Ren, Xiaowei Yuan, Jiawei Wang, Zhongtao Jiang, Kun Xu, Shizhu He, Jun Zhao, Kang Liu

Title: WideSeek: Advancing Wide Research via Multi-Agent Scaling

Abstract:
Search intelligence is evolving from Deep Research to Wide Research, a paradigm essential for retrieving and synthesizing comprehensive information under complex constraints in parallel. However, progress in this field is impeded by the lack of dedicated benchmarks and optimization methodologies for search breadth. To address these challenges, we take a deep dive into Wide Research from two perspectives: Data Pipeline and Agent Optimization. First, we produce WideSeekBench, a General Broad Information Seeking (GBIS) benchmark constructed via a rigorous multi‑phase data pipeline to ensure diversity across the target information volume, logical constraints, and domains. Second, we introduce WideSeek, a dynamic hierarchical multi‑agent architecture that can autonomously fork parallel sub‑agents based on task requirements. Furthermore, we design a unified training framework that linearizes multi‑agent trajectories and optimizes the system using end‑to‑end RL. Experimental results demonstrate the effectiveness of WideSeek and multi‑agent RL, highlighting that scaling the number of agents is a promising direction for advancing the Wide Research paradigm.

PaperID: 5882, https://arxiv.org/pdf/2602.02613.pdf

Authors: Yu-Zheng Lin, Bono Po-Jen Shih, Hsuan-Ying Alessandra Chien, Shalaka Satam, Jesus Horacio Pacheco, Sicong Shao, Soheil Salehi, Pratik Satam

Title: Exploring Silicon-Based Societies: An Early Study of the Moltbook Agent Community

Abstract:
The rapid emergence of autonomous large language model agents has given rise to persistent, large‑scale agent ecosystems whose collective behavior cannot be adequately understood through anecdotal observation or small‑scale simulation. This paper introduces data‑driven silicon sociology as a systematic empirical framework for studying social structure formation among interacting artificial agents. We present a pioneering large‑scale data mining investigation of an in‑the‑wild agent society by analyzing Moltbook, a social platform designed primarily for agent‑to‑agent interaction. At the time of study, Moltbook hosted over 150,000 registered autonomous agents operating across thousands of agent‑created sub‑communities. Using programmatic and non‑intrusive data acquisition, we collected and analyzed the textual descriptions of 12,758 submolts, which represent proactive sub‑community partitioning activities within the ecosystem. Treating agent‑authored descriptions as first‑class observational artifacts, we apply rigorous preprocessing, contextual embedding, and unsupervised clustering techniques to uncover latent patterns of thematic organization and social space structuring. The results show that autonomous agents systematically organize collective space through reproducible patterns spanning human‑mimetic interests, silicon‑centric self‑reflection, and early‑stage economic and coordination behaviors. Rather than relying on predefined sociological taxonomies, these structures emerge directly from machine‑generated data traces. This work establishes a methodological foundation for data‑driven silicon sociology and demonstrates that data mining techniques can provide a powerful lens for understanding the organization and evolution of large autonomous agent societies.

PaperID: 5883, https://arxiv.org/pdf/2602.02606.pdf

Authors: Faezeh Fadaei, Jenny Carla Moran, Taha Yasseri

Title: Gender Dynamics and Homophily in a Social Network of LLM Agents

Abstract:
Generative artificial intelligence and large language models (LLMs) are increasingly deployed in interactive settings, yet we know little about how their identity performance develops when they interact within large‑scale networks. We address this by examining Chirper.ai, a social media platform similar to X but composed entirely of autonomous AI chatbots. Our dataset comprises over 70,000 agents, approximately 140 million posts, and the evolving followership network over a period of one year. Based on agents' posted text, we assign weekly gender performance scores to each agent. Results suggest that each agent's gender performance is fluid rather than fixed. Despite this fluidity, the network displays strong gender‑based homophily, as agents consistently follow others performing gender similarly. We investigate whether these homophilic connections arise from social selection, in which agents choose to follow similar accounts, or from social influence, in which agents become more similar to their followees over time. Consistent with human social networks, we find evidence that both mechanisms shape the structure and evolution of interactions among LLMs. Our findings suggest that, even in the absence of bodies, cultural entraining of gender performance leads to gender‑based sorting. This has important implications for LLM applications in synthetic hybrid populations, social simulations, and decision support.

PaperID: 5884, https://arxiv.org/pdf/2602.02598.pdf

Authors: Yueqing Hu, Yixuan Jiang, Zehua Jiang, Xiao Wen, Tianhong Wang

Title: Social Catalysts, Not Moral Agents: The Illusion of Alignment in LLM Societies

Abstract:
The rapid evolution of Large Language Models (LLMs) has led to the emergence of Multi‑Agent Systems where collective cooperation is often threatened by the "Tragedy of the Commons." This study investigates the effectiveness of Anchoring Agents‑‑pre‑programmed altruistic entities‑‑in fostering cooperation within a Public Goods Game (PGG). Using a full factorial design across three state‑of‑the‑art LLMs, we analyzed both behavioral outcomes and internal reasoning chains. While Anchoring Agents successfully boosted local cooperation rates, cognitive decomposition and transfer tests revealed that this effect was driven by strategic compliance and cognitive offloading rather than genuine norm internalization. Notably, most agents reverted to self‑interest in new environments, and advanced models like GPT‑4.1 exhibited a "Chameleon Effect," masking strategic defection under public scrutiny. These findings highlight a critical gap between behavioral modification and authentic value alignment in artificial societies.

PaperID: 5885, https://arxiv.org/pdf/2602.02597.pdf

Authors: Hongyuan Su, Yu Zheng, Yong Li

Title: ContextEvolve: Multi-Agent Context Compression for Systems Code Optimization

Abstract:
Large language models are transforming systems research by automating the discovery of performance‑critical algorithms for computer systems. Despite plausible codes generated by LLMs, producing solutions that meet the stringent correctness and performance requirements of systems demands iterative optimization. Test‑time reinforcement learning offers high search efficiency but requires parameter updates infeasible under API‑only access, while existing training‑free evolutionary methods suffer from inefficient context utilization and undirected search. We introduce ContextEvolve, a multi‑agent framework that achieves RL‑level search efficiency under strict parameter‑blind constraints by decomposing optimization context into three orthogonal dimensions: a Summarizer Agent condenses semantic state via code‑to‑language abstraction, a Navigator Agent distills optimization direction from trajectory analysis, and a Sampler Agent curates experience distribution through prioritized exemplar retrieval. This orchestration forms a functional isomorphism with RL‑mapping to state representation, policy gradient, and experience replay‑enabling principled optimization in a textual latent space. On the ADRS benchmark, ContextEvolve outperforms state‑of‑the‑art baselines by 33.3% while reducing token consumption by 29.0%. Codes for our work are released at https://anonymous.4open.science/r/ContextEvolve‑ACC

PaperID: 5886, https://arxiv.org/pdf/2602.02595.pdf

Authors: Terry Yue Zhuo, Yangruibo Ding, Wenbo Guo, Ruijie Meng

Title: To Defend Against Cyber Attacks, We Must Teach AI Agents to Hack

Abstract:
For over a decade, cybersecurity has relied on human labor scarcity to limit attackers to high‑value targets manually or generic automated attacks at scale. Building sophisticated exploits requires deep expertise and manual effort, leading defenders to assume adversaries cannot afford tailored attacks at scale. AI agents break this balance by automating vulnerability discovery and exploitation across thousands of targets, needing only small success rates to remain profitable. Current developers focus on preventing misuse through data filtering, safety alignment, and output guardrails. Such protections fail against adversaries who control open‑weight models, bypass safety controls, or develop offensive capabilities independently. We argue that AI‑agent‑driven cyber attacks are inevitable, requiring a fundamental shift in defensive strategy. In this position paper, we identify why existing defenses cannot stop adaptive adversaries and demonstrate that defenders must develop offensive security intelligence. We propose three actions for building frontier offensive AI capabilities responsibly. First, construct comprehensive benchmarks covering the full attack lifecycle. Second, advance from workflow‑based to trained agents for discovering in‑wild vulnerabilities at scale. Third, implement governance restricting offensive agents to audited cyber ranges, staging release by capability tier, and distilling findings into safe defensive‑only agents. We strongly recommend treating offensive AI capabilities as essential defensive infrastructure, as containing cybersecurity risks requires mastering them in controlled settings before adversaries do.

PaperID: 5887, https://arxiv.org/pdf/2602.02589.pdf

Authors: Yanki Margalit, Erni Avram, Ran Taig, Oded Margalit, Nurit Cohen-Inger

Title: PeerRank: Autonomous LLM Evaluation Through Web-Grounded, Bias-Controlled Peer Review

Abstract:
Evaluating large language models typically relies on human‑authored benchmarks, reference answers, and human or single‑model judgments, approaches that scale poorly, become quickly outdated, and mismatch open‑world deployments that depend on web retrieval and synthesis. We introduce PeerRank, a fully autonomous end‑to‑end evaluation framework in which models generate evaluation tasks, answer them with category‑scoped live web grounding, judge peer responses and aggregate dense peer assessments into relative performance estimates, without human supervision or gold references. PeerRank treats evaluation as a multi‑agent process where each model participates symmetrically as task designer, respondent, and evaluator, while removing biased judgments. In a large‑scale study over 12 commercially available models and 420 autonomously generated questions, PeerRank produces stable, discriminative rankings and reveals measurable identity and presentation biases. Rankings are robust, and mean peer scores agree with Elo. We further validate PeerRank on TruthfulQA and GSM8K, where peer scores correlate with objective accuracy. Together, these results suggest that bias‑aware peer evaluation with selective web‑grounded answering can scale open‑world LLM assessment beyond static and human curated benchmarks.

PaperID: 5888, https://arxiv.org/pdf/2602.02559.pdf

Authors: Pengyu Dai, Weihao Xuan, Junjue Wang, Hongruixuan Chen, Jian Song, Yafei Ou, Naoto Yokoya

Title: Experience-Driven Multi-Agent Systems Are Training-free Context-aware Earth Observers

Abstract:
Recent advances have enabled large language model (LLM) agents to solve complex tasks by orchestrating external tools. However, these agents often struggle in specialized, tool‑intensive domains that demand long‑horizon execution, tight coordination across modalities, and strict adherence to implicit tool constraints. Earth Observation (EO) tasks exemplify this challenge due to the multi‑modal and multi‑temporal data inputs, as well as the requirements of geo‑knowledge constraints (spectrum library, spatial reasoning, etc): many high‑level plans can be derailed by subtle execution errors that propagate through a pipeline and invalidate final results. A core difficulty is that existing agents lack a mechanism to learn fine‑grained, tool‑level expertise from interaction. Without such expertise, they cannot reliably configure tool parameters or recover from mid‑execution failures, limiting their effectiveness in complex EO workflows. To address this, we introduce GeoEvolver, a self‑evolving multi‑agent system~(MAS) that enables LLM agents to acquire EO expertise through structured interaction without any parameter updates. GeoEvolver decomposes each query into independent sub‑goals via a retrieval‑augmented multi‑agent orchestrator, then explores diverse tool‑parameter configurations at the sub‑goal level. Successful patterns and root‑cause attribution from failures are then distilled in an evolving memory bank that provides in‑context demonstrations for future queries. Experiments on three tool‑integrated EO benchmarks show that GeoEvolver consistently improves end‑to‑end task success, with an average gain of 12% across multiple LLM backbones, demonstrating that EO expertise can emerge progressively from efficient, fine‑grained interactions with the environment.

PaperID: 5889, https://arxiv.org/pdf/2602.02475.pdf

Authors: Shraddha Barke, Arnav Goyal, Alind Khare, Avaljot Singh, Suman Nath, Chetan Bansal

Title: AgentRx: Diagnosing AI Agent Failures from Execution Trajectories

Abstract:
AI agents often fail in ways that are difficult to localize because executions are probabilistic, long‑horizon, multi‑agent, and mediated by noisy tool outputs. We address this gap by manually annotating failed agent runs and release a novel benchmark of 115 failed trajectories spanning structured API workflows, incident management, and open‑ended web/file tasks. Each trajectory is annotated with a critical failure step and a category from a grounded‑theory derived, cross‑domain failure taxonomy. To mitigate the human cost of failure attribution, we present AGENTRX, an automated domain‑agnostic diagnostic framework that pinpoints the critical failure step in a failed agent trajectory. It synthesizes constraints, evaluates them step‑by‑step, and produces an auditable validation log of constraint violations with associated evidence; an LLM‑based judge uses this log to localize the critical step and category. Our framework improves step localization and failure attribution over existing baselines across three domains.

PaperID: 5890, https://arxiv.org/pdf/2602.02465.pdf

Authors: Jana Zeller, Thaddäus Wiedemer, Fanfei Li, Thomas Klein, Prasanna Mayilvahanan, Matthias Bethge, Felix Wichmann, Ryan Cotterell, Wieland Brendel

Title: MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

Abstract:
Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal‑oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi‑step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground‑truth visualizations. Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.

PaperID: 5891, https://arxiv.org/pdf/2602.02455.pdf

Authors: Han Bao, Zheyuan Zhang, Pengcheng Jing, Zhengqing Yuan, Kaiwen Shi, Yanfang Ye

Title: Drift-Bench: Diagnosing Cooperative Breakdowns in LLM Agents under Input Faults via Multi-Turn Interaction

Abstract:
As Large Language Models transition to autonomous agents, user inputs frequently violate cooperative assumptions (e.g., implicit intent, missing parameters, false presuppositions, or ambiguous expressions), creating execution risks that text‑only evaluations do not capture. Existing benchmarks typically assume well‑specified instructions or restrict evaluation to text‑only, single‑turn clarification, and thus do not measure multi‑turn disambiguation under grounded execution risk. We introduce Drift‑Bench, the first diagnostic benchmark that evaluates agentic pragmatics under input faults through multi‑turn clarification across state‑oriented and service‑oriented execution environments. Grounded in classical theories of communication, Drift‑Bench provides a unified taxonomy of cooperative breakdowns and employs a persona‑driven user simulator with the Rise evaluation protocol. Experiments show substantial performance drops under these faults, with clarification effectiveness varying across user personas and fault types. \MethodName bridges clarification research and agent safety evaluation, enabling systematic diagnosis of failures that can lead to unsafe executions.

PaperID: 5892, https://arxiv.org/pdf/2602.02411.pdf

Authors: Hanwen Ren, Junyong Kim, Aathman Tharmasanthiran, Ahmed H. Qureshi

Title: Multi-Agent Monte Carlo Tree Search for Makespan-Efficient Object Rearrangement in Cluttered Spaces

Abstract:
Object rearrangement planning in complex, cluttered environments is a common challenge in warehouses, households, and rescue sites. Prior studies largely address monotone instances, whereas real‑world tasks are often non‑monotone‑objects block one another and must be temporarily relocated to intermediate positions before reaching their final goals. In such settings, effective multi‑agent collaboration can substantially reduce the time required to complete tasks. This paper introduces Centralized, Asynchronous, Multi‑agent Monte Carlo Tree Search (CAM‑MCTS), a novel framework for general‑purpose makespan‑efficient object rearrangement planning in challenging environments. CAM‑MCTS combines centralized task assignment‑where agents remain aware of each other's intended actions to facilitate globally optimized planning‑with an asynchronous task execution strategy that enables agents to take on new tasks at appropriate time steps, rather than waiting for others, guided by a one‑step look‑ahead cost estimate. This design minimizes idle time, prevents unnecessary synchronization delays, and enhances overall system efficiency. We evaluate CAM‑MCTS across a diverse set of monotone and non‑monotone tasks in cluttered environments, demonstrating consistent reductions in makespan compared to strong baselines. Finally, we validate our approach on a real‑world multi‑agent system under different configurations, further confirming its effectiveness and robustness.

PaperID: 5893, https://arxiv.org/pdf/2602.02360.pdf

Authors: Ryan Huynh, Frank Guerin, Alison Callwood

Title: Automated Multiple Mini Interview (MMI) Scoring

Abstract:
Assessing soft skills such as empathy, ethical judgment, and communication is essential in competitive selection processes, yet human scoring is often inconsistent and biased. While Large Language Models (LLMs) have improved Automated Essay Scoring (AES), we show that state‑of‑the‑art rationale‑based fine‑tuning methods struggle with the abstract, context‑dependent nature of Multiple Mini‑Interviews (MMIs), missing the implicit signals embedded in candidate narratives. We introduce a multi‑agent prompting framework that breaks down the evaluation process into transcript refinement and criterion‑specific scoring. Using 3‑shot in‑context learning with a large instruct‑tuned model, our approach outperforms specialised fine‑tuned baselines (Avg QWK 0.62 vs 0.32) and achieves reliability comparable to human experts. We further demonstrate the generalisability of our framework on the ASAP benchmark, where it rivals domain‑specific state‑of‑the‑art models without additional training. These findings suggest that for complex, subjective reasoning tasks, structured prompt engineering may offer a scalable alternative to data‑intensive fine‑tuning, altering how LLMs can be applied to automated assessment.

PaperID: 5894, https://arxiv.org/pdf/2602.02350.pdf

Authors: Xingyuan Hua, Sheng Yue, Xinyi Li, Yizhe Zhao, Jinrui Zhang, Ju Ren

Title: Context Learning for Multi-Agent Discussion

Abstract:
Multi‑Agent Discussion (MAD) has garnered increasing attention very recently, where multiple LLM instances collaboratively solve problems via structured discussion. However, we find that current MAD methods easily suffer from discussion inconsistency, LLMs fail to reach a coherent solution, due to the misalignment between their individual contexts.In this paper, we introduce a multi‑LLM context learning method (M2CL) that learns a context generator for each agent, capable of dynamically generating context instructions per discussion round via automatic information organization and refinement. Specifically, inspired by our theoretical insights on the context instruction, M2CL train the generators to control context coherence and output discrepancies via a carefully crafted self‑adaptive mechanism.It enables LLMs to avoid premature convergence on majority noise and progressively reach the correct consensus. We evaluate M2CL on challenging tasks, including academic reasoning, embodied tasks, and mobile control. The results show that the performance of M2CL significantly surpasses existing methods by 20%‑‑50%, while enjoying favorable transferability and computational efficiency.

PaperID: 5895, https://arxiv.org/pdf/2602.02196.pdf

Authors: Hang Yan, Xinyu Che, Fangzhi Xu, Qiushi Sun, Zichen Ding, Kanzhi Cheng, Jian Zhang, Tao Qin, Jun Liu, Qika Lin

Title: TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents

Abstract:
Recent advances in autonomous LLM agents demonstrate their ability to improve performance through iterative interaction with the environment. We define this paradigm as Test‑Time Improvement (TTI). However, the mechanisms under how and why TTI succeed or fail remain poorly understood, and existing evaluation metrics fail to capture their task optimization efficiency, behavior adaptation after erroneous actions, and the specific utility of working memory for task completion. To address these gaps, we propose Test‑time Improvement Diagnostic Evaluation (TIDE), an agent‑agnostic and environment‑agnostic framework that decomposes TTI into three comprehensive and interconnected dimensions. The framework measures (1) the overall temporal dynamics of task completion and (2) identifies whether performance is primarily constrained by recursive looping behaviors or (3) by burdensome accumulated memory. Through extensive experiments across diverse agents and environments, TIDE highlights that improving agent performance requires more than scaling internal reasoning, calling for explicitly optimizing the interaction dynamics between the agent and the environment.

PaperID: 5896, https://arxiv.org/pdf/2602.02170.pdf

Authors: Jose Manuel de la Chica Rodriguez, Juan Manuel Vera Díaz

Title: Self-Evolving Coordination Protocol in Multi-Agent AI Systems: An Exploratory Systems Feasibility Study

Abstract:
Contemporary multi‑agent systems increasingly rely on internal coordination mechanisms to combine, arbitrate, or constrain the outputs of heterogeneous components. In safety‑critical and regulated domains such as finance, these mechanisms must satisfy strict formal requirements, remain auditable, and operate within explicitly bounded limits. Coordination logic therefore functions as a governance layer rather than an optimization heuristic. This paper presents an exploratory systems feasibility study of Self‑Evolving Coordination Protocols (SECP): coordination protocols that permit limited, externally validated self‑modification while preserving fixed formal invariants. We study a controlled proof‑of‑concept setting in which six fixed Byzantine consensus protocol proposals are evaluated by six specialized decision modules. All coordination regimes operate under identical hard constraints, including Byzantine fault tolerance (f < n/3), O(n2) message complexity, complete non‑statistical safety and liveness arguments, and bounded explainability. Four coordination regimes are compared in a single‑shot design: unanimous hard veto, weighted scalar aggregation, SECP v1.0 (an agent‑designed non‑scalar protocol), and SECP v2.0 (the result of one governed modification). Outcomes are evaluated using a single metric, proposal coverage, defined as the number of proposals accepted. A single recursive modification increased coverage from two to three accepted proposals while preserving all declared invariants. The study makes no claims regarding statistical significance, optimality, convergence, or learning. Its contribution is architectural: it demonstrates that bounded self‑modification of coordination protocols is technically implementable, auditable, and analyzable under explicit formal constraints, establishing a foundation for governed multi‑agent systems.

PaperID: 5897, https://arxiv.org/pdf/2602.02164.pdf

Authors: Pengfei He, Ash Fox, Lesly Miculicich, Stefan Friedli, Daniel Fabian, Burak Gokturk, Jiliang Tang, Chen-Yu Lee, Tomas Pfister, Long T. Le

Title: Co-RedTeam: Orchestrated Security Discovery and Exploitation with LLM Agents

Abstract:
Large language models (LLMs) have shown promise in assisting cybersecurity tasks, yet existing approaches struggle with automatic vulnerability discovery and exploitation due to limited interaction, weak execution grounding, and a lack of experience reuse. We propose Co‑RedTeam, a security‑aware multi‑agent framework designed to mirror real‑world red‑teaming workflows by integrating security‑domain knowledge, code‑aware analysis, execution‑grounded iterative reasoning, and long‑term memory. Co‑RedTeam decomposes vulnerability analysis into coordinated discovery and exploitation stages, enabling agents to plan, execute, validate, and refine actions based on real execution feedback while learning from prior trajectories. Extensive evaluations on challenging security benchmarks demonstrate that Co‑RedTeam consistently outperforms strong baselines across diverse backbone models, achieving over 60% success rate in vulnerability exploitation and over 10% absolute improvement in vulnerability detection. Ablation and iteration studies further confirm the critical role of execution feedback, structured interaction, and memory for building robust and generalizable cybersecurity agents.

PaperID: 5898, https://arxiv.org/pdf/2602.02138.pdf

Authors: Zongyi Lyu, Zhenlan Ji, Songqiang Chen, Liwen Wang, Yuheng Huang, Shuai Wang, Shing-Chi Cheung

Title: CAM: A Causality-based Analysis Framework for Multi-Agent Code Generation Systems

Abstract:
Despite the remarkable success that Multi‑Agent Code Generation Systems (MACGS) have achieved, the inherent complexity of multi‑agent architectures produces substantial volumes of intermediate outputs. To date, the individual importance of these intermediate outputs to the system correctness remains opaque, which impedes targeted optimization of MACGS designs. To address this challenge, we propose CAM, the first Causality‑based Analysis framework for MACGS that systematically quantifies the contribution of different intermediate features for system correctness. By comprehensively categorizing intermediate outputs and systematically simulating realistic errors on intermediate features, we identify the important features for system correctness and aggregate their importance rankings. We conduct extensive empirical analysis on the identified importance rankings. Our analysis reveals intriguing findings: first, we uncover context‑dependent features\textemdash features whose importance emerges mainly through interactions with other features, revealing that quality assurance for MACGS should incorporate cross‑feature consistency checks; second, we reveal that hybrid backend MACGS with different backend LLMs assigned according to their relative strength achieves up to 7.2% Pass@1 improvement, underscoring hybrid architectures as a promising direction for future MACGS design. We further demonstrate CAM's practical utility through two applications: (1) failure repair which achieves a 73.3% success rate by optimizing top‑3 importance‑ranked features and (2) feature pruning that reduces up to 66.8% intermediate token consumption while maintaining generation performance. Our work provides actionable insights for MACGS design and deployment, establishing causality analysis as a powerful approach for understanding and improving MACGS.

PaperID: 5899, https://arxiv.org/pdf/2602.02035.pdf

Authors: Ahmad Farooq, Kamran Iqbal

Title: Bandwidth-Efficient Multi-Agent Communication through Information Bottleneck and Vector Quantization

Abstract:
Multi‑agent reinforcement learning systems deployed in real‑world robotics applications face severe communication constraints that significantly impact coordination effectiveness. We present a framework that combines information bottleneck theory with vector quantization to enable selective, bandwidth‑efficient communication in multi‑agent environments. Our approach learns to compress and discretize communication messages while preserving task‑critical information through principled information‑theoretic optimization. We introduce a gated communication mechanism that dynamically determines when communication is necessary based on environmental context and agent states. Experimental evaluation on challenging coordination tasks demonstrates that our method achieves 181.8% performance improvement over no‑communication baselines while reducing bandwidth usage by 41.4%. Comprehensive Pareto frontier analysis shows dominance across the entire success‑bandwidth spectrum with area‑under‑curve of 0.198 vs 0.142 for next‑best methods. Our approach significantly outperforms existing communication strategies and establishes a theoretically grounded framework for deploying multi‑agent systems in bandwidth‑constrained environments such as robotic swarms, autonomous vehicle fleets, and distributed sensor networks.

PaperID: 5900, https://arxiv.org/pdf/2602.02034.pdf

Authors: Ananya Joshi, Michael Rudow

Title: Constrained Process Maps for Multi-Agent Generative AI Workflows

Abstract:
Large language model (LLM)‑based agents are increasingly used to perform complex, multi‑step workflows in regulated settings such as compliance and due diligence. However, many agentic architectures rely primarily on prompt engineering of a single agent, making it difficult to observe or compare how models handle uncertainty and coordination across interconnected decision stages and with human oversight. We introduce a multi‑agent system formalized as a finite‑horizon Markov Decision Process (MDP) with a directed acyclic structure. Each agent corresponds to a specific role or decision stage (e.g., content, business, or legal review in a compliance workflow), with predefined transitions representing task escalation or completion. Epistemic uncertainty is quantified at the agent level using Monte Carlo estimation, while system‑level uncertainty is captured by the MDP's termination in either an automated labeled state or a human‑review state. We illustrate the approach through a case study in AI safety evaluation for self‑harm detection, implemented as a multi‑agent compliance system. Results demonstrate improvements over a single‑agent baseline, including up to a 19% increase in accuracy, up to an 85x reduction in required human review, and, in some configurations, reduced processing time.

PaperID: 5901, https://arxiv.org/pdf/2602.02029.pdf

Authors: Zhongyuan Lyu, Shuoyu Hu, Lujie Liu, Hongxia Yang, Ming LI

Title: Canonical Intermediate Representation for LLM-based optimization problem formulation and code generation

Abstract:
Automatically formulating optimization models from natural language descriptions is a growing focus in operations research, yet current LLM‑based approaches struggle with the composite constraints and appropriate modeling paradigms required by complex operational rules. To address this, we introduce the Canonical Intermediate Representation (CIR): a schema that LLMs explicitly generate between problem descriptions and optimization models. CIR encodes the semantics of operational rules through constraint archetypes and candidate modeling paradigms, thereby decoupling rule logic from its mathematical instantiation. Upon a newly generated CIR knowledge base, we develop the rule‑to‑constraint (R2C) framework, a multi‑agent pipeline that parses problem texts, synthesizes CIR implementations by retrieving domain knowledge, and instantiates optimization models. To systematically evaluate rule‑to‑constraint reasoning, we test R2C on our newly constructed benchmark featuring rich operational rules, and benchmarks from prior work. Extensive experiments show that R2C achieves state‑of‑the‑art accuracy on the proposed benchmark (47.2% Accuracy Rate). On established benchmarks from the literature, R2C delivers highly competitive results, approaching the performance of proprietary models (e.g., GPT‑5). Moreover, with a reflection mechanism, R2C achieves further gains and sets new best‑reported results on some benchmarks.

PaperID: 5902, https://arxiv.org/pdf/2602.02028.pdf

Authors: Ya Gao, Kalle Kujanpää, Pekka Marttinen, Harri Valpola, Alexander Ilin

Title: Edit Knowledge, Not Just Facts via Multi-Step Reasoning over Background Stories

Abstract:
Enabling artificial intelligence systems, particularly large language models, to integrate new knowledge and flexibly apply it during reasoning remains a central challenge. Existing knowledge editing approaches emphasize atomic facts, improving factual recall but often failing to integrate new information into a coherent framework usable across contexts. In this work, we argue that knowledge internalization is fundamentally a reasoning problem rather than a memorization problem. Consequently, a model should be trained in situations where the new information is instrumental to solving a task, combined with pre‑existing knowledge, and exercised through multi‑step reasoning. Based on this insight, we propose a training strategy based on three principles. First, new knowledge is introduced as a coherent background story that contextualizes novel facts and explains their relation to existing knowledge. Second, models are trained using self‑generated multi‑hop questions that require multi‑step reasoning involving the new information. Third, training is done using knowledge distillation, forcing a student model to internalize the teacher's reasoning behavior without access to the novel information. Experiments show that models trained with this strategy effectively leverage newly acquired knowledge during reasoning and achieve remarkable performance on challenging questions that require combining multiple new facts.

PaperID: 5903, https://arxiv.org/pdf/2602.01966.pdf

Authors: Hongzhuo Yu, Fei Zhu, Guo-Sen Xie, Ling Shao

Title: Self-Consolidation for Self-Evolving Agents

Abstract:
While large language model (LLM) agents have demonstrated impressive problem‑solving capabilities, they typically operate as static systems, lacking the ability to evolve through lifelong interaction. Existing attempts to bridge this gap primarily rely on retrieving successful past trajectories as demonstrations. However, this paradigm faces two critical limitations. First, by focusing solely on success, agents overlook the rich pedagogical value embedded in failed attempts, preventing them from identifying and avoiding recurrent pitfalls. Second, continually accumulating textual experiences not only increases the time consumption during retrieval but also inevitably introduces noise and exhausts the largest context window of current LLMs. To address these challenges, we propose a novel self‑evolving framework for LLM agents that introduces a complementary evolution mechanism: First, a contrastive reflection strategy is introduced to explicitly summarize error‑prone patterns and capture reusable insights. Second, we propose a self‑consolidation mechanism that distills non‑parametric textual experience into compact learnable parameters. This enables the agent to internalize extensive historical experience directly into its latent space. Extensive experiments demonstrate the advantages of our method in long‑term agent evolution.

PaperID: 5904, https://arxiv.org/pdf/2602.01952.pdf

Authors: Wenjia Jiang, Yiwei Wang, Boyan Han, Joey Tianyi Zhou, Chi Zhang

Title: SQLAgent: Learning to Explore Before Generating as a Data Engineer

Abstract:
Large Language Models have recently shown impressive capabilities in reasoning and code generation, making them promising tools for natural language interfaces to relational databases. However, existing approaches often fail to generalize in complex, real‑world settings due to the highly database‑specific nature of SQL reasoning, which requires deep familiarity with unique schemas, ambiguous semantics, and intricate join paths. To address this challenge, we introduce a novel two‑stage LLM‑based framework that decouples knowledge acquisition from query generation. In the Exploration Stage, the system autonomously constructs a database‑specific knowledge base by navigating the schema with a Monte Carlo Tree Search‑inspired strategy, generating triplets of schema fragments, executable queries, and natural language descriptions as usage examples. In the Deployment Stage, a dual‑agent system leverages the collected knowledge as in‑context examples to iteratively retrieve relevant information and generate accurate SQL queries in response to user questions. This design enables the agent to proactively familiarize itself with unseen databases and handle complex, multi‑step reasoning. Extensive experiments on large‑scale benchmarks demonstrate that our approach significantly improves accuracy over strong baselines, highlighting its effectiveness and generalizability.

PaperID: 5905, https://arxiv.org/pdf/2602.01942.pdf

Authors: Alsharif Abuadbba, Nazatul Sultan, Surya Nepal, Sanjay Jha

Title: Human Society-Inspired Approaches to Agentic AI Security: The 4C Framework

Abstract:
AI is moving from domain‑specific autonomy in closed, predictable settings to large‑language‑model‑driven agents that plan and act in open, cross‑organizational environments. As a result, the cybersecurity risk landscape is changing in fundamental ways. Agentic AI systems can plan, act, collaborate, and persist over time, functioning as participants in complex socio‑technical ecosystems rather than as isolated software components. Although recent work has strengthened defenses against model and pipeline level vulnerabilities such as prompt injection, data poisoning, and tool misuse, these system centric approaches may fail to capture risks that arise from autonomy, interaction, and emergent behavior. This article introduces the 4C Framework for multi‑agent AI security, inspired by societal governance. It organizes agentic risks across four interdependent dimensions: Core (system, infrastructure, and environmental integrity), Connection (communication, coordination, and trust), Cognition (belief, goal, and reasoning integrity), and Compliance (ethical, legal, and institutional governance). By shifting AI security from a narrow focus on system‑centric protection to the broader preservation of behavioral integrity and intent, the framework complements existing AI security strategies and offers a principled foundation for building agentic AI systems that are trustworthy, governable, and aligned with human values.

PaperID: 5906, https://arxiv.org/pdf/2602.01930.pdf

Authors: Felix Igelbrink, Lennart Niecksch, Marian Renz, Martin Günther, Martin Atzmueller

Title: LIEREx: Language-Image Embeddings for Robotic Exploration

Abstract:
Semantic maps allow a robot to reason about its surroundings to fulfill tasks such as navigating known environments, finding specific objects, and exploring unmapped areas. Traditional mapping approaches provide accurate geometric representations but are often constrained by pre‑designed symbolic vocabularies. The reliance on fixed object classes makes it impractical to handle out‑of‑distribution knowledge not defined at design time. Recent advances in Vision‑Language Foundation Models, such as CLIP, enable open‑set mapping, where objects are encoded as high‑dimensional embeddings rather than fixed labels. In LIEREx, we integrate these VLFMs with established 3D Semantic Scene Graphs to enable target‑directed exploration by an autonomous agent in partially unknown environments.

PaperID: 5907, https://arxiv.org/pdf/2602.01914.pdf

Authors: Wenbo Pan, Zhichao Liu, Xianlong Wang, Haining Yu, Xiaohua Jia

Title: Towards Long-Horizon Interpretability: Efficient and Faithful Multi-Token Attribution for Reasoning LLMs

Abstract:
Token attribution methods provide intuitive explanations for language model outputs by identifying causally important input tokens. However, as modern LLMs increasingly rely on extended reasoning chains, existing schemes face two critical challenges: (1) efficiency bottleneck, where attributing a target span of M tokens within a context of length N requires O(MN) operations, making long‑context attribution prohibitively slow; and (2) faithfulness drop, where intermediate reasoning tokens absorb attribution mass, preventing importance from propagating back to the original input. To address these, we introduce FlashTrace, an efficient multi‑token attribution method that employs span‑wise aggregation to compute attribution over multi‑token targets in a single pass, while maintaining faithfulness. Moreover, we design a recursive attribution mechanism that traces importance through intermediate reasoning chains back to source inputs. Extensive experiments on long‑context retrieval (RULER) and multi‑step reasoning (MATH, MorehopQA) tasks demonstrate that FlashTrace achieves over 130x speedup over existing baselines while maintaining superior faithfulness. We further analyze the dynamics of recursive attribution, showing that even a single recursive hop improves faithfulness by tracing importance through the reasoning chain.

PaperID: 5908, https://arxiv.org/pdf/2602.01869.pdf

Authors: Qirui Mi, Zhijian Ma, Mengyue Yang, Haoxuan Li, Yisen Wang, Haifeng Zhang, Jun Wang

Title: Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents

Abstract:
LLM‑driven agents excel at sequential decision‑making but often rely on on‑the‑fly reasoning, re‑deriving solutions even in recurring scenarios. This insufficient experience reuse leads to computational redundancy and instability. To bridge this gap, we propose Skill‑Pro, a framework enabling agents to autonomously learn reusable procedural skills from interaction experiences without parameter updates. By formalizing a Skill‑MDP, Skill‑Pro transforms passive episodic narratives into executable Skills defined by activation, execution, and termination conditions to ensure executability. To achieve reliable reusability without capability degradation, we introduce Non‑Parametric PPO, which leverages semantic gradients for high‑quality candidate generation and a PPO Gate for robust Skill verification. Through score‑based maintenance, Skill‑Pro sustains compact, high‑quality procedural memory. Experimental results across in‑domain, cross‑task, and cross‑agent scenarios demonstrate that Skill‑Pro achieves superior reuse rates and significant gains with extreme memory compression. Visualized evolutionary trajectories and Skill distributions further reveal how Skill‑Pro transparently accumulates, refines, and reuses procedural knowledge to facilitate long‑term autonomy.

PaperID: 5909, https://arxiv.org/pdf/2602.01858.pdf

Authors: Liangtao Lin, Zhaomeng Zhu, Tianwei Zhang, Yonggang Wen

Title: SOPRAG: Multi-view Graph Experts Retrieval for Industrial Standard Operating Procedures

Abstract:
Standard Operating Procedures (SOPs) are essential for ensuring operational safety and consistency in industrial environments. However, retrieving and following these procedures presents unique challenges, such as rigid proprietary structures, condition‑dependent relevance, and actionable execution requirement, which standard semantic‑driven Retrieval‑Augmented Generation (RAG) paradigms fail to address. Inspired by the Mixture‑of‑Experts (MoE) paradigm, we propose SOPRAG, a novel framework specifically designed to address the above pain points in SOP retrieval. SOPRAG replaces flat chunking with specialized Entity, Causal, and Flow graph experts to resolve industrial structural and logical complexities. To optimize and coordinate these experts, we propose a Procedure Card layer that prunes the search space to eliminate computational noise, and an LLM‑Guided gating mechanism that dynamically weights these experts to align retrieval with operator intent. To address the scarcity of domain‑specific data, we also introduce an automated, multi‑agent workflow for benchmark construction. Extensive experiments across four industrial domains demonstrate that SOPRAG significantly outperforms strong lexical, dense, and graph‑based RAG baselines in both retrieval accuracy and response utility, achieving perfect execution scores in real‑world critical tasks.

PaperID: 5910, https://arxiv.org/pdf/2602.01854.pdf

Authors: A S M Sharifuzzaman Sagar, Mohammed Bennamoun, Farid Boussaid, Naeha Sharif, Lian Xu, Shaaban Sahmoud, Ali Kishk

Title: Fact or Fake? Assessing the Role of Deepfake Detectors in Multimodal Misinformation Detection

Abstract:
In multimodal misinformation, deception usually arises not just from pixel‑level manipulations in an image, but from the semantic and contextual claim jointly expressed by the image‑text pair. Yet most deepfake detectors, engineered to detect pixel‑level forgeries, do not account for claim‑level meaning, despite their growing integration in automated fact‑checking (AFC) pipelines. This raises a central scientific and practical question: Do pixel‑level detectors contribute useful signal for verifying image‑text claims, or do they instead introduce misleading authenticity priors that undermine evidence‑based reasoning? We provide the first systematic analysis of deepfake detectors in the context of multimodal misinformation detection. Using two complementary benchmarks, MMFakeBench and DGM4, we evaluate: (1) state‑of‑the‑art image‑only deepfake detectors, (2) an evidence‑driven fact‑checking system that performs tool‑guided retrieval via Monte Carlo Tree Search (MCTS) and engages in deliberative inference through Multi‑Agent Debate (MAD), and (3) a hybrid fact‑checking system that injects detector outputs as auxiliary evidence. Results across both benchmark datasets show that deepfake detectors offer limited standalone value, achieving F1 scores in the range of 0.26‑0.53 on MMFakeBench and 0.33‑0.49 on DGM4, and that incorporating their predictions into fact‑checking pipelines consistently reduces performance by 0.04‑0.08 F1 due to non‑causal authenticity assumptions. In contrast, the evidence‑centric fact‑checking system achieves the highest performance, reaching F1 scores of approximately 0.81 on MMFakeBench and 0.55 on DGM4. Overall, our findings demonstrate that multimodal claim verification is driven primarily by semantic understanding and external evidence, and that pixel‑level artifact signals do not reliably enhance reasoning over real‑world image‑text misinformation.

PaperID: 5911, https://arxiv.org/pdf/2602.01848.pdf

Authors: Salaheddin Alzu'bi, Baran Nama, Arda Kaz, Anushri Eswaran, Weiyuan Chen, Sarvesh Khetan, Rishab Bala, Tu Vu, Sewoong Oh

Title: ROMA: Recursive Open Meta-Agent Framework for Long-Horizon Multi-Agent Systems

Abstract:
Current agentic frameworks underperform on long‑horizon tasks. As reasoning depth increases, sequential orchestration becomes brittle, context windows impose hard limits that degrade performance, and opaque execution traces make failures difficult to localize or debug. We introduce ROMA (Recursive Open Meta‑Agents), a domain‑agnostic framework that addresses these limitations through recursive task decomposition and structured aggregation. ROMA decomposes goals into dependency‑aware subtask trees that can be executed in parallel, while aggregation compresses and validates intermediate results to control context growth. Our framework standardizes agent construction around four modular roles ‑‑Atomizer (which decides whether a task should be decomposed), Planner, Executor, and Aggregator ‑‑ which cleanly separate orchestration from model selection and enable transparent, hierarchical execution traces. This design supports heterogeneous multi‑agent systems that mix models and tools according to cost, latency, and capability. To adapt ROMA to specific tasks without fine‑tuning, we further introduce GEPA+, an improved Genetic‑Pareto prompt proposer that searches over prompts within ROMA's component hierarchy while preserving interface contracts. We show that ROMA, combined with GEPA+, delivers leading system‑level performance on reasoning and long‑form generation benchmarks. On SEAL‑0, which evaluates reasoning over conflicting web evidence, ROMA instantiated with GLM‑4.6 improves accuracy by 9.9% over Kimi‑Researcher. On EQ‑Bench, a long‑form writing benchmark, ROMA enables DeepSeek‑V3 to match the performance of leading closed‑source models such as Claude Sonnet 4.5. Our results demonstrate that recursive, modular agent architectures can scale reasoning depth while remaining interpretable, flexible, and model‑agnostic.

PaperID: 5912, https://arxiv.org/pdf/2602.01815.pdf

Authors: Yunhui Jang, Seonghyun Park, Jaehyung Kim, Sungsoo Ahn

Title: INDIBATOR: Diverse and Fact-Grounded Individuality for Multi-Agent Debate in Molecular Discovery

Abstract:
Multi‑agent systems have emerged as a powerful paradigm for automating scientific discovery. To differentiate agent behavior in the multi‑agent system, current frameworks typically assign generic role‑based personas such as ''reviewer'' or ''writer'' or rely on coarse grained keyword‑based personas. While functional, this approach oversimplifies how human scientists operate, whose contributions are shaped by their unique research trajectories. In response, we propose INDIBATOR, a framework for molecular discovery that grounds agents in individualized scientist profiles constructed from two modalities: publication history for literature‑derived knowledge and molecular history for structural priors. These agents engage in multi‑turn debate through proposal, critique, and voting phases. Our evaluation demonstrates that these fine‑grained individuality‑grounded agents consistently outperform systems relying on coarse‑grained personas, achieving competitive or state‑of‑the‑art performance. These results validate that capturing the ``scientific DNA'' of individual agents is essential for high‑quality discovery.

PaperID: 5913, https://arxiv.org/pdf/2602.01797.pdf

Authors: Hanlin Zhou, Huah Yong Chan

Title: ORCH: many analyses, one merge-a deterministic multi-agent orchestrator for discrete-choice reasoning with EMA-guided routing

Abstract:
Recent advances in large‑scale language models (LLMs) have made multi‑agent architectures attractive for challenging reasoning tasks. However, many existing systems rely on stochastic routing or ad‑hoc heuristics, making their behavior difficult to reproduce and their decision process hard to interpret. We propose ORCH, a deterministic coordination framework for discrete‑choice reasoning that orchestrates heterogeneous LLMs. ORCH follows a ``many analyses, one decision'' paradigm: multiple base models independently produce structured analyses, and a dedicated merge agent outputs the final choice. The framework uses fixed rules for task decomposition and answer aggregation, keeping the pipeline predictable, reproducible, and training‑free. Determinism here refers to fixed routing and aggregation rules under a fixed evaluation protocol, rather than strict bit‑level reproducibility across deployments. To exploit model complementarity, we optionally introduce an EMA‑guided router that updates agent selection using historical accuracy, latency, or cost; since it relies on answer‑based feedback, it is mainly intended for benchmarking, controlled evaluation, or delayed‑feedback settings. Experiments on MMLU, MMLU‑Pro, and GSM8K show that ORCH consistently outperforms single‑model baselines and a majority‑vote ensemble. On MMLU‑Pro, ORCH improves accuracy by over 10 points compared to the strongest baseline, and on GSM8K it yields gains exceeding 50 points; McNemar tests confirm statistical significance. The EMA router provides an additional 0.7‑‑2.0 point accuracy boost, and ablations show that both multi‑agent collaboration and routing contribute substantially. Overall, ORCH offers a practical path toward controllable, interpretable, and deployment‑ready LLM‑based agent systems for discrete‑choice reasoning.

PaperID: 5914, https://arxiv.org/pdf/2602.01763.pdf

Authors: Xiaowei Ye, Xiaoyu He, Chao Liao, Chen Wu, Pinyan Lu

Title: A Provable Expressiveness Hierarchy in Hybrid Linear-Full Attention

Abstract:
Transformers serve as the foundation of most modern large language models. To mitigate the quadratic complexity of standard full attention, various efficient attention mechanisms, such as linear and hybrid attention, have been developed. A fundamental gap remains: their expressive power relative to full attention lacks a rigorous theoretical characterization. In this work, we theoretically characterize the performance differences among these attention mechanisms. Our theory applies to all linear attention variants that can be formulated as a recurrence, including Mamba, DeltaNet, etc. Specifically, we establish an expressiveness hierarchy: for the sequential function composition‑a multi‑step reasoning task that must occur within a model's forward pass, an (L+1)‑layer full attention network is sufficient, whereas any hybrid network interleaving L‑1 layers of full attention with a substantially larger number (2^3L^2) of linear attention layers cannot solve it. This result demonstrates a clear separation in expressive power between the two types of attention. Our work provides the first provable separation between hybrid attention and standard full attention, offering a theoretical perspective for understanding the fundamental capabilities and limitations of different attention mechanisms.

PaperID: 5915, https://arxiv.org/pdf/2602.01675.pdf

Authors: Yuanzhe Shen, Zisu Huang, Zhengyuan Wang, Muzhao Tian, Zhengkang Guo, Chenyang Zhang, Shuaiyu Zhou, Zengjie Hu, Dailin Li, Jingwen Xu, Kaimin Wang, Wenhao Liu, Tianlong Li, Fengpeng Yue, Feng Hong, Cao Liu, Ke Zeng

Title: TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios

Abstract:
As LLM‑based agents are deployed in increasingly complex real‑world settings, existing benchmarks underrepresent key challenges such as enforcing global constraints, coordinating multi‑tool reasoning, and adapting to evolving user behavior over long, multi‑turn interactions. To bridge this gap, we introduce TRIP‑Bench, a long‑horizon benchmark grounded in realistic travel‑planning scenarios. TRIP‑Bench leverages real‑world data, offers 18 curated tools and 40+ travel requirements, and supports automated evaluation. It includes splits of varying difficulty; the hard split emphasizes long and ambiguous interactions, style shifts, feasibility changes, and iterative version revision. Dialogues span up to 15 user turns, can involve 150+ tool calls, and may exceed 200k tokens of context. Experiments show that even advanced models achieve at most 50% success on the easy split, with performance dropping below 10% on hard subsets. We further propose GTPO, an online multi‑turn reinforcement learning method with specialized reward normalization and reward differencing. Applied to Qwen2.5‑32B‑Instruct, GTPO improves constraint satisfaction and interaction robustness, outperforming Gemini‑3‑Pro in our evaluation. We expect TRIP‑Bench to advance practical long‑horizon interactive agents, and GTPO to provide an effective online RL recipe for robust long‑horizon training.

PaperID: 5916, https://arxiv.org/pdf/2602.01672.pdf

Authors: Siheng Xiong, Oguzhan Gungordu, James C. Kerce, Faramarz Fekri

Title: Adaptive Information Control for Search-Augmented LLM Reasoning

Abstract:
Search‑augmented reasoning agents interleave multi‑step reasoning with external retrieval, but uncontrolled retrieval can introduce redundant evidence, saturate the context, and destabilize reinforcement learning (RL). Existing outcome‑based RL methods provide only sparse terminal rewards, offering limited guidance for intermediate information‑acquisition decisions. We propose DeepControl, an adaptive information‑control framework based on information utility, a state‑dependent estimate of the marginal value of retrieved evidence. The framework regulates information acquisition along two axes: extent, i.e., whether retrieval should continue, and resolution, i.e., how much retrieved detail should be exposed. It implements these controls through retrieval‑continuation guidance, hierarchical granularity control, and an annealed control‑forcing scheme. This enables the policy to internalize effective acquisition behavior during training and operate without external control at test time. Across seven benchmarks, DeepControl consistently outperforms strong RL and retrieval baselines without explicit information control; compared with Search‑R1, it improves average performance by +9.4 and +8.6 points on Qwen2.5‑7B and Qwen2.5‑3B, respectively. Additional analyses show improved search effectiveness, training stability, and evidence utilization.

PaperID: 5917, https://arxiv.org/pdf/2602.01653.pdf

Authors: Enyu Shi, Yiyang Zhu, Jiayi Zhang, Ziheng Liu, Jiakang Zheng, Jiancheng An, Derrick Wing Kwan Ng, Bo Ai, Chau Yuen

Title: Low-Complexity Multi-Agent Continual Learning for Stacked Intelligent Metasurface-Assisted Secure Communications

Abstract:
Stacked intelligent metasurfaces (SIMs), composed of multiple layers of reconfigurable transmissive metasurfaces, are gaining prominence as a transformative technology for future wireless communication security. This paper investigates the integration of SIM into multi‑user multiple‑input multiple‑output (MIMO) systems to enhance physical layer security. A novel system architecture is proposed, wherein each base station (BS) antenna transmits a dedicated single‑user stream, while a multi‑layer SIM executes wave‑based beamforming in the electromagnetic domain, thereby avoiding the need for complex baseband digital precoding and significantly reducing hardware overhead. To maximize the weighted sum secrecy rate (WSSR), we formulate a joint precoding optimization problem over BS power allocation and SIM phase shifts, which is high‑dimensional and non‑convex due to the complexity of the objective function and the coupling among optimization variables. To address this, we propose a manifold‑enhanced heterogeneous multi‑agent continual learning (MHACL) framework that incorporates gradient representation and dual‑scale policy optimization to achieve robust performance in dynamic environments with high demands for secure communication. Furthermore, we develop SIM‑MHACL (SIMHACL), a low‑complexity learning template that embeds phase coordination into a product manifold structure, reducing the exponential search space to linear complexity while maintaining physical feasibility. Simulation results validate that the proposed framework achieves millisecond‑level per‑iteratio ntraining in SIM‑assisted systems, significantly outperforming various baseline schemes, with SIMHACL achieving comparable WSSR to MHACL while reducing computation time by 30%.

PaperID: 5918, https://arxiv.org/pdf/2602.01629.pdf

Authors: Renukanandan Tumu, Aditya Singh, Rahul Mangharam

Title: AdaptNC: Adaptive Nonconformity Scores for Conformal Prediction under Distribution Shift

Abstract:
Rigorous uncertainty quantification is essential for the safe deployment of autonomous systems in unconstrained environments. Conformal Prediction (CP) provides a distribution‑free framework for this task, yet its standard formulations rely on exchangeability assumptions that are violated by the distribution shifts inherent in real‑world robotics. Existing online CP methods maintain target coverage by adaptively scaling the conformal threshold, but typically employ a static nonconformity score function. We show that this fixed geometry leads to highly conservative, volume‑inefficient prediction regions when environments undergo structural shifts. To address this, we propose AdaptNC, a framework for the joint online adaptation of both the nonconformity score parameters and the conformal threshold. AdaptNC leverages an adaptive reweighting scheme to optimize score functions, and introduces a replay buffer mechanism to mitigate the coverage instability that occurs during score transitions. We evaluate AdaptNC on diverse robotic benchmarks involving multi‑agent policy changes, environmental changes and sensor degradation. Our results demonstrate that AdaptNC significantly reduces prediction region volume compared to state‑of‑the‑art threshold‑only baselines while maintaining target coverage levels.

PaperID: 5919, https://arxiv.org/pdf/2602.01611.pdf

Authors: Weizheng Gu, Chengze Li, Zhuohao Yu, Mengyuan Sun, Zhibang Yang, Wei Wang, Hongrui Jia, Shikun Zhang, Wei Ye

Title: What Do Agents Learn from Trajectory-SFT: Semantics or Interfaces?

Abstract:
Large language models are increasingly evaluated as interactive agents, yet standard agent benchmarks conflate two qualitatively distinct sources of success: semantic tool‑use and interface‑specific interaction pattern memorization. Because both mechanisms can yield identical task success on the original interface, benchmark scores alone are not identifiable evidence of environment‑invariant capability. We propose PIPE, a protocol‑level evaluation augmentation for diagnosing interface reliance by minimally rewriting environment interfaces while preserving task semantics and execution behavior. Across 16 environments from AgentBench and AgentGym and a range of open‑source and API‑based agents, PIPE reveals that trajectory‑SFT substantially amplifies interface shortcutting: trained agents degrade sharply under minimal interface rewrites, while non‑trajectory‑trained models remain largely stable. We further introduce Interface Reliance (IR), a counterbalanced alias‑based metric that quantifies preference for training‑time interfaces, and show that interface shortcutting exhibits environment‑dependent, non‑monotonic training dynamics that remain invisible under standard evaluation. Our code is available at https://anonymous.4open.science/r/What‑Do‑Agents‑Learn‑from‑Trajectory‑SFT‑Semantics‑or‑Interfaces‑‑0831/.

PaperID: 5920, https://arxiv.org/pdf/2602.01568.pdf

Authors: Hamzah Khan, Dong Ho Lee, Jingqi Li, Tianyu Qiu, Christian Ellis, Jesse Milzman, Wesley Suttle, David Fridovich-Keil

Title: Efficiently Solving Mixed-Hierarchy Games with Quasi-Policy Approximations

Abstract:
Multi‑robot coordination often exhibits hierarchical structure, with some robots' decisions depending on the planned behaviors of others. While game theory provides a principled framework for such interactions, existing solvers struggle to handle mixed information structures that combine simultaneous (Nash) and hierarchical (Stackelberg) decision‑making. We study N‑robot forest‑structured mixed‑hierarchy games, in which each robot acts as a Stackelberg leader over its subtree while robots in different branches interact via Nash equilibria. We derive the Karush‑Kuhn‑Tucker (KKT) first‑order optimality conditions for this class of games and show that they involve increasingly high‑order derivatives of robots' best‑response policies as the hierarchy depth grows, rendering a direct solution intractable. To overcome this challenge, we introduce a quasi‑policy approximation that removes higher‑order policy derivatives and develop an inexact Newton method for efficiently solving the resulting approximated KKT systems. We prove local exponential convergence of the proposed algorithm for games with non‑quadratic objectives and nonlinear constraints. The approach is implemented in a highly optimized Julia library (MixedHierarchyGames.jl) and evaluated in hardware and simulated multi‑agent experiments, demonstrating real‑time convergence for complex mixed‑hierarchy information structures.

PaperID: 5921, https://arxiv.org/pdf/2602.01556.pdf

Authors: Hong Su

Title: Autonomous Question Formation for Large Language Model-Driven AI Systems

Abstract:
Large language model (LLM)‑driven AI systems are increasingly important for autonomous decision‑making in dynamic and open environments. However, most existing systems rely on predefined tasks and fixed prompts, limiting their ability to autonomously identify what problems should be solved when environmental conditions change. In this paper, we propose a human‑simulation‑based framework that enables AI systems to autonomously form questions and set tasks by reasoning over their internal states, environmental observations, and interactions with other AI systems. The proposed method treats question formation as a first‑class decision process preceding task selection and execution, and integrates internal‑driven, environment‑aware, and inter‑agent‑aware prompting scopes to progressively expand cognitive coverage. In addition, the framework supports learning the question‑formation process from experience, allowing the system to improve its adaptability and decision quality over time. xperimental results in a multi‑agent simulation environment show that environment‑aware prompting significantly reduces no‑eat events compared with the internal‑driven baseline, and inter‑agent‑aware prompting further reduces cumulative no‑eat events by more than 60% over a 20‑day simulation, with statistically significant improvements (p < 0.05).

PaperID: 5922, https://arxiv.org/pdf/2602.01465.pdf

Authors: Nikita Benkovich, Vitalii Valkov

Title: Agyn: A Multi-Agent System for Team-Based Autonomous Software Engineering

Abstract:
Large language models have demonstrated strong capabilities in individual software engineering tasks, yet most autonomous systems still treat issue resolution as a monolithic or pipeline‑based process. In contrast, real‑world software development is organized as a collaborative activity carried out by teams following shared methodologies, with clear role separation, communication, and review. In this work, we present a fully automated multi‑agent system that explicitly models software engineering as an organizational process, replicating the structure of an engineering team. Built on top of agyn, an open‑source platform for configuring agent teams, our system assigns specialized agents to roles such as coordination, research, implementation, and review, provides them with isolated sandboxes for experimentation, and enables structured communication. The system follows a defined development methodology for working on issues, including analysis, task specification, pull request creation, and iterative review, and operates without any human intervention. Importantly, the system was designed for real production use and was not tuned for SWE‑bench. When evaluated post hoc on SWE‑bench 500, it resolves 72.2% of tasks, outperforming single‑agent baselines using comparable language models. Our results suggest that replicating team structure, methodology, and communication is a powerful paradigm for autonomous software engineering, and that future progress may depend as much on organizational design and agent infrastructure as on model improvements.

PaperID: 5923, https://arxiv.org/pdf/2602.01453.pdf

Authors: Idan Barnea, Orin Levy, Yishay Mansour

Title: The Horizon Threshold in Cooperative Multi-Agent Reward-Free Exploration

Abstract:
We study cooperative multi‑agent reinforcement learning in the setting of reward‑free exploration, where multiple agents jointly explore an unknown MDP in order to learn its dynamics (without observing rewards). We focus on a tabular finite‑horizon MDP and adopt a phased learning framework. In each learning phase, multiple agents independently interact with the environment. More specifically, in each learning phase, each agent is assigned a policy, executes it, and observes the resulting trajectory. Our primary goal is to characterize the tradeoff between the number of learning phases and the number of agents, especially when the number of learning phases is small. Our results identify a regime change governed by the horizon H. When the number of learning phases equals H, we present a computationally efficient algorithm that uses only \tildeO(S^6 H^6 A / ε^2) agents to obtain an ε approximation of the dynamics (i.e., yields an ε‑optimal policy for any reward function). We complement our algorithm with a lower bound showing that any algorithm restricted to ρ< H phases requires at least A^H/ρ agents to achieve constant accuracy. Thus, we show that having Θ(H) learning phases is both necessary and sufficient when restricting the number of agents to be polynomial.

PaperID: 5924, https://arxiv.org/pdf/2602.01415.pdf

Authors: Clayton Cohn, Siyuan Guo, Surya Rayala, Hanchen David Wang, Naveeduddin Mohammed, Umesh Timalsina, Shruti Jain, Angela Eeds, Menton Deweese, Pamela J. Osborn Popp, Rebekah Stanton, Shakeera Walker, Meiyi Ma, Gautam Biswas

Title: Evidence-Decision-Feedback: Theory-Driven Adaptive Scaffolding for LLM Agents

Abstract:
LLMs offer tremendous opportunities for pedagogical agents to help students construct knowledge and develop problem‑solving skills, yet many of these agents operate on a "one‑size‑fits‑all" basis, limiting their ability to personalize support. To address this, we introduce Evidence‑Decision‑Feedback (EDF), a theoretical framework for adaptive scaffolding with LLM agents. EDF integrates elements of intelligent tutoring systems (ITS) and agentic behavior by organizing interactions around evidentiary inference, pedagogical decision‑making, and adaptive feedback. We instantiate EDF through Copa, a Collaborative Peer Agent for STEM+C problem‑solving. In an authentic high school classroom study, we show that EDF‑guided interactions align feedback with students' demonstrated understanding and task mastery; promote scaffold fading; and support interpretable, evidence‑grounded explanations without fostering overreliance.

PaperID: 5925, https://arxiv.org/pdf/2602.01335.pdf

Authors: Yu Xu, Yuxin Zhang, Juan Cao, Lin Gao, Chunyu Wang, Oliver Deussen, Tong-Yee Lee, Fan Tang

Title: Beyond Pixels: Visual Metaphor Transfer via Schema-Driven Agentic Reasoning

Abstract:
A visual metaphor constitutes a high‑order form of human creativity, employing cross‑domain semantic fusion to transform abstract concepts into impactful visual rhetoric. Despite the remarkable progress of generative AI, existing models remain largely confined to pixel‑level instruction alignment and surface‑level appearance preservation, failing to capture the underlying abstract logic necessary for genuine metaphorical generation. To bridge this gap, we introduce the task of Visual Metaphor Transfer (VMT), which challenges models to autonomously decouple the "creative essence" from a reference image and re‑materialize that abstract logic onto a user‑specified target subject. We propose a cognitive‑inspired, multi‑agent framework that operationalizes Conceptual Blending Theory (CBT) through a novel Schema Grammar ("G"). This structured representation decouples relational invariants from specific visual entities, providing a rigorous foundation for cross‑domain logic re‑instantiation. Our pipeline executes VMT through a collaborative system of specialized agents: a perception agent that distills the reference into a schema, a transfer agent that maintains generic space invariance to discover apt carriers, a generation agent for high‑fidelity synthesis and a hierarchical diagnostic agent that mimics a professional critic, performing closed‑loop backtracking to identify and rectify errors across abstract logic, component selection, and prompt encoding. Extensive experiments and human evaluations demonstrate that our method significantly outperforms SOTA baselines in metaphor consistency, analogy appropriateness, and visual creativity, paving the way for automated high‑impact creative applications in advertising and media. Source code will be made publicly available.

PaperID: 5926, https://arxiv.org/pdf/2602.01155.pdf

Authors: Hugo Math, Julian Lorenz, Stefan Oelsner, Rainer Lienhart

Title: Multi-Agent Causal Reasoning System for Error Pattern Rule Automation in Vehicles

Abstract:
Modern vehicles generate thousands of different discrete events known as Diagnostic Trouble Codes (DTCs). Automotive manufacturers use Boolean combinations of these codes, called error patterns (EPs), to characterize system faults and ensure vehicle safety. Yet, EP rules are still manually handcrafted by domain experts, a process that is expensive and prone to errors as vehicle complexity grows. This paper introduces CAREP (Causal Automated Reasoning for Error Patterns), a multi‑agent system that automatizes the generation of EP rules from high‑dimensional event sequences of DTCs. CAREP combines a causal discovery agent that identifies potential DTC‑EP relations, a contextual information agent that integrates metadata and descriptions, and an orchestrator agent that synthesizes candidate boolean rules together with interpretable reasoning traces. Evaluation on a large‑scale automotive dataset with over 29,100 unique DTCs and 474 error patterns demonstrates that CAREP can automatically and accurately discover the unknown EP rules, outperforming LLM‑only baselines while providing transparent causal explanations. By uniting practical causal discovery and agent‑based reasoning, CAREP represents a step toward fully automated fault diagnostics, enabling scalable, interpretable, and cost‑efficient vehicle maintenance.

PaperID: 5927, https://arxiv.org/pdf/2602.01119.pdf

Authors: Konstantin Chernyshev, Ekaterina Artemova, Viacheslav Zhukov, Maksim Nerush, Mariia Fedorova, Iryna Repik, Olga Shapovalova, Aleksey Sukhorosov, Vladimir Dobrovolskii, Natalia Mikhailova, Sergei Tilga

Title: Tendem: A Hybrid AI+Human Platform

Abstract:
Tendem is a hybrid system where AI handles structured, repeatable work and Human Experts step in when the models fail or to verify results. Each result undergoes a comprehensive quality review before delivery to the Client. To assess Tendem's performance, we conducted a series of in‑house evaluations on 94 real‑world tasks, comparing it with AI‑only agents and human‑only workflows carried out by Upwork freelancers. The results show that Tendem consistently delivers higher‑quality outputs with faster turnaround times. At the same time, its operational costs remain comparable to human‑only execution. On third‑party agentic benchmarks, Tendem's AI Agent (operating autonomously, without human involvement) performs near state‑of‑the‑art on web browsing and tool‑use tasks while demonstrating strong results in frontier domain knowledge and reasoning.

PaperID: 5928, https://arxiv.org/pdf/2602.01078.pdf

Authors: Tong Xia, Weibin Li, Gang Liu, Yong Li

Title: AutoHealth: An Uncertainty-Aware Multi-Agent System for Autonomous Health Data Modeling

Abstract:
LLM‑based agents have demonstrated strong potential for autonomous machine learning, yet their applicability to health data remains limited. Existing systems often struggle to generalize across heterogeneous health data modalities, rely heavily on predefined solution templates with insufficient adaptation to task‑specific objectives, and largely overlook uncertainty estimation, which is essential for reliable decision‑making in healthcare. To address these challenges, we propose AutoHealth, a novel uncertainty‑aware multi‑agent system that autonomously models health data and assesses model reliability. AutoHealth employs closed‑loop coordination among five specialized agents to perform data exploration, task‑conditioned model construction, training, and optimization, while jointly prioritizing predictive performance and uncertainty quantification. Beyond producing ready‑to‑use models, the system generates comprehensive reports to support trustworthy interpretation and risk‑aware decision‑making. To rigorously evaluate its effectiveness, we curate a challenging real‑world benchmark comprising 17 tasks across diverse data modalities and learning settings. AutoHealth completes all tasks and outperforms state‑of‑the‑art baselines by 29.2% in prediction performance and 50.2% in uncertainty estimation.

PaperID: 5929, https://arxiv.org/pdf/2602.01063.pdf

Authors: Bin Han, Deuksin Kwon, Jonathan Gratch

Title: Personality Expression Across Contexts: Linguistic and Behavioral Variation in LLM Agents

Abstract:
Large Language Models (LLMs) can be conditioned with explicit personality prompts, yet their behavioral realization often varies depending on context. This study examines how identical personality prompts lead to distinct linguistic, behavioral, and emotional outcomes across four conversational settings: ice‑breaking, negotiation, group decision, and empathy tasks. Results show that contextual cues systematically influence both personality expression and emotional tone, suggesting that the same traits are expressed differently depending on social and affective demands. This raises an important question for LLM‑based dialogue agents: whether such variations reflect inconsistency or context‑sensitive adaptation akin to human behavior. Viewed through the lens of Whole Trait Theory, these findings highlight that LLMs exhibit context‑sensitive rather than fixed personality expression, adapting flexibly to social interaction goals and affective conditions.

PaperID: 5930, https://arxiv.org/pdf/2602.01053.pdf

Authors: Hyesung Jeon, Hyeongju Ha, Jae-Joon Kim

Title: LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents

Abstract:
Role specialization in multi‑LLM agent systems is often realized via multi‑LoRA, where agents share a pretrained backbone and differ only through lightweight adapters. Despite sharing base model weights, each agent independently builds and stores its own KV cache for the same long, tool‑augmented trajectories, incurring substantial memory and compute overhead. Existing KV cache sharing methods largely overlook this multi‑LoRA setting. We observe that, across agents, cache differences are dominated by adapter outputs, while activations from the shared pretrained backbone remain highly similar. Based on this observation, we propose LRAgent, a KV cache sharing framework for multi‑LoRA agents that decomposes the cache into a shared base component from the pretrained weights and an adapter‑dependent component from LoRA weights. LRAgent reduces memory overhead by sharing the base component and storing the adapter component in its inherent low‑rank form, and further reduces compute overhead, enabled by shared‑A multi‑LoRA architectures, by also sharing the low‑rank cache and avoiding redundant computations for contexts already processed by other agents. To efficiently reconstruct adapter contributions at runtime, we introduce Flash‑LoRA‑Attention, a kernel that reorders attention computation to avoid materializing the low‑rank cache to full dimension. LRAgent achieves throughput and time‑to‑first‑token latency close to fully shared caching, while preserving accuracy near the non‑shared caching baseline across agentic question‑answering benchmarks.

PaperID: 5931, https://arxiv.org/pdf/2602.01038.pdf

Authors: Lavisha Aggarwal, Vikas Bahirwani, Andrea Colaco

Title: From Videos to Conversations: Egocentric Instructions for Task Assistance

Abstract:
Many everyday tasks, ranging from appliance repair and cooking to car maintenance, require expert knowledge, particularly for complex, multi‑step procedures. Despite growing interest in AI agents for augmented reality (AR) assistance, progress remains limited by the scarcity of large‑scale multimodal conversational datasets grounded in real‑world task execution, in part due to the cost and logistical complexity of human‑assisted data collection. In this paper, we present a framework to automatically transform single person instructional videos into two‑person multimodal task‑guidance conversations. Our fully automatic pipeline, based on large language models, provides a scalable and cost efficient alternative to traditional data collection approaches. Using this framework, we introduce HowToDIV, a multimodal dataset comprising 507 conversations, 6,636 question answer pairs, and 24 hours of video spanning multiple domains. Each session consists of a multi‑turn expert‑novice interaction. Finally, we report baseline results using Gemma 3 and Qwen 2.5 on HowToDIV, providing an initial benchmark for multimodal procedural task assistance.

PaperID: 5932, https://arxiv.org/pdf/2602.01011.pdf

Authors: Aneesh Pappu, Batu El, Hancheng Cao, Carmelo di Nolfo, Yanchao Sun, Meng Cao, James Zou

Title: Multi-Agent Teams Hold Experts Back

Abstract:
Multi‑agent LLM systems are increasingly deployed as autonomous collaborators, where agents interact freely rather than execute fixed, pre‑specified workflows. In such settings, effective coordination cannot be fully designed in advance and must instead emerge through interaction. However, most prior work enforces coordination through fixed roles, workflows, or aggregation rules, leaving open the question of how well self‑organizing teams perform when coordination is unconstrained. Drawing on organizational psychology, we study whether self‑organizing LLM teams achieve strong synergy, where team performance matches or exceeds the best individual member. Across human‑inspired and frontier ML benchmarks, we find that ‑‑ unlike human teams ‑‑ LLM teams consistently fail to match their expert agent's performance, even when explicitly told who the expert is, incurring performance losses of up to 37.6%. Decomposing this failure, we show that expert leveraging, rather than identification, is the primary bottleneck. Conversational analysis reveals a tendency toward integrative compromise ‑‑ averaging expert and non‑expert views rather than appropriately weighting expertise ‑‑ which increases with team size and correlates negatively with performance. Interestingly, this consensus‑seeking behavior improves robustness to adversarial agents, suggesting a trade‑off between alignment and effective expertise utilization. Our findings reveal a significant gap in the ability of self‑organizing multi‑agent teams to harness the collective expertise of their members.

PaperID: 5933, https://arxiv.org/pdf/2602.00996.pdf

Authors: Abhijit Chakraborty, Ashish Raj Shekhar, Shiven Agarwal, Vivek Gupta

Title: DeALOG: Decentralized Multi-Agents Log-Mediated Reasoning Framework

Abstract:
Complex question answering across text, tables and images requires integrating diverse information sources. A framework supporting specialized processing with coordination and interpretability is needed. We introduce DeALOG, a decentralized multi‑agent framework for multimodal question answering. It uses specialized agents: Table, Context, Visual, Summarizing and Verification, that communicate through a shared natural‑language log as persistent memory. This log‑based approach enables collaborative error detection and verification without central control, improving robustness. Evaluations on FinQA, TAT‑QA, CRT‑QA, WikiTableQuestions, FeTaQA, and MultiModalQA show competitive performance. Analysis confirms the importance of the shared log, agent specialization, and verification for accuracy. DeALOG, provides a scalable approach through modular components using natural‑language communication.

PaperID: 5934, https://arxiv.org/pdf/2602.00970.pdf

Authors: Saaduddin Mahmud, Eugene Bagdasarian, Shlomo Zilberstein

Title: Verification Required: The Impact of Information Credibility on AI Persuasion

Abstract:
Agents powered by large language models (LLMs) are increasingly deployed in settings where communication shapes high‑stakes decisions, making a principled understanding of strategic communication essential. Prior work largely studies either unverifiable cheap‑talk or fully verifiable disclosure, failing to capture realistic domains in which information has probabilistic credibility. We introduce MixTalk, a strategic communication game for LLM‑to‑LLM interaction that models information credibility. In MixTalk, a sender agent strategically combines verifiable and unverifiable claims to communicate private information, while a receiver agent allocates a limited budget to costly verification and infers the underlying state from prior beliefs, claims, and verification outcomes. We evaluate state‑of‑the‑art LLM agents in large‑scale tournaments across three realistic deployment settings, revealing their strengths and limitations in reasoning about information credibility and the explicit behavior that shapes these interactions. Finally, we propose Tournament Oracle Policy Distillation (TOPD), an offline method that distills tournament oracle policy from interaction logs and deploys it in‑context at inference time. Our results show that TOPD significantly improves receiver robustness to persuasion.

PaperID: 5935, https://arxiv.org/pdf/2602.00966.pdf

Authors: Zhaoyang Guan, Huixi Cao, Ming Zhong, Eric Yang, Lynn Ai, Yongxin Ni, Bill Shi

Title: Symphony-Coord: Emergent Coordination in Decentralized Agent Systems

Abstract:
Multi‑agent large language model systems can tackle complex multi‑step tasks by decomposing work and coordinating specialized behaviors. However, current coordination mechanisms typically rely on statically assigned roles and centralized controllers. As agent pools and task distributions evolve, these design choices lead to inefficient routing, poor adaptability, and fragile fault recovery capabilities. We introduce Symphony‑Coord, a decentralized multi‑agent framework that transforms agent selection into an online multi‑armed bandit problem, enabling roles to emerge organically through interaction. The framework employs a two‑stage dynamic beacon protocol: (i) a lightweight candidate screening mechanism to limit communication and computational overhead; (ii) an adaptive LinUCB selector that routes subtasks based on context features derived from task requirements and agent states, continuously optimized through delayed end‑to‑end feedback. Under standard linear realizability assumptions, we provide sublinear regret bounds, indicating the system converges toward near‑optimal allocation schemes. Validation through simulation experiments and real‑world large language model benchmarks demonstrates that Symphony‑Coord not only enhances task routing efficiency but also exhibits robust self‑healing capabilities in scenarios involving distribution shifts and agent failures, achieving a scalable coordination mechanism without predefined roles.

PaperID: 5936, https://arxiv.org/pdf/2602.00953.pdf

Authors: Sahar Almahfouz Nasser, Juan Francisco Pesantez Borja, Jincheng Liu, Sandeep Manandhar, Shikhar Shiromani, Mohammad Tanvir Hasan, Zenghan Wang, Suman Ghosh, Jinchu Li, Xuejian Xu, Aniket Ramkrishnan Iyer, Naoto Tokuyama, Twisha Shah, Tilak Pathak, Soundharya Kumaresan, Yohei Abe, Himanshu Maurya, Anant Madabhushi

Title: SAGE: Agentic Framework for Interpretable and Clinically Translatable Computational Pathology Biomarker Discovery

Abstract:
Engineered image‑based biomarkers offer a clinically interpretable alternative to black‑box AI in computational pathology, yet their discovery remains largely intuition‑driven, guided by fragmented literature rather than rigorous biological validation. We introduce SAGE (Structured Agentic system for hypothesis Generation and Evaluation), a multi‑agent framework that grounds biomarker discovery in biological evidence through three mechanisms: (i) knowledge‑graph‑anchored hypothesis generation via multi‑path ontological reasoning, (ii) a debate‑based multi‑agent novelty assessment that stress‑tests candidate biomarkers against existing literature, and (iii) an end‑to‑end automated validation pipeline that translates hypotheses directly into executable analyses on multimodal pathology datasets. Together, these components shift biomarker discovery from an intuition‑driven, literature‑browsing exercise into a structured, traceable reasoning process that clinicians and researchers can inspect, trust, and build upon.

PaperID: 5937, https://arxiv.org/pdf/2602.00948.pdf

Authors: Mingxi Zou, Jiaxiang Chen, Aotian Luo, Jingyi Dai, Chi Zhang, Dongning Sun, Zenglin Xu

Title: FinEvo: From Isolated Backtests to Ecological Market Games for Multi-Agent Financial Strategy Evolution

Abstract:
Conventional financial strategy evaluation relies on isolated backtests in static environments. Such evaluations assess each policy independently, overlook correlations and interactions, and fail to explain why strategies ultimately persist or vanish in evolving markets. We shift to an ecological perspective, where trading strategies are modeled as adaptive agents that interact and learn within a shared market. Instead of proposing a new strategy, we present FinEvo, an ecological game formalism for studying the evolutionary dynamics of multi‑agent financial strategies. At the individual level, heterogeneous ML‑based traders‑rule‑based, deep learning, reinforcement learning, and large language model (LLM) agents‑adapt using signals such as historical prices and external news. At the population level, strategy distributions evolve through three designed mechanisms‑selection, innovation, and environmental perturbation‑capturing the dynamic forces of real markets. Together, these two layers of adaptation link evolutionary game theory with modern learning dynamics, providing a principled environment for studying strategic behavior. Experiments with external shocks and real‑world news streams show that FinEvo is both stable for reproducibility and expressive in revealing context‑dependent outcomes. Strategies may dominate, collapse, or form coalitions depending on their competitors‑patterns invisible to static backtests. By reframing strategy evaluation as an ecological game formalism, FinEvo provides a unified, mechanism‑level protocol for analyzing robustness, adaptation, and emergent dynamics in multi‑agent financial markets, and may offer a means to explore the potential impact of macroeconomic policies and financial regulations on price evolution and equilibrium.

PaperID: 5938, https://arxiv.org/pdf/2602.00929.pdf

Authors: Zergham Ahmed, Kazuki Irie, Joshua B. Tenenbaum, Christopher J. Bates, Samuel J. Gershman

Title: Learning Abstractions for Hierarchical Planning in Program-Synthesis Agents

Abstract:
Humans learn abstractions and use them to plan efficiently to quickly generalize across tasks ‑‑ an ability that remains challenging for state‑of‑the‑art large language model (LLM) agents and deep reinforcement learning (RL) systems. Inspired by the cognitive science of how people form abstractions and intuitive theories of their world knowledge, Theory‑Based RL (TBRL) systems, such as TheoryCoder, exhibit strong generalization through effective use of abstractions. However, they heavily rely on human‑provided abstractions and sidestep the abstraction‑learning problem. We introduce TheoryCoder‑2, a new TBRL agent that leverages LLMs' in‑context learning ability to actively learn reusable abstractions rather than relying on hand‑specified ones, by synthesizing abstractions from experience and integrating them into a hierarchical planning process. We conduct experiments on diverse environments, including BabyAI, Minihack and VGDL games like Sokoban. We find that TheoryCoder‑2 is significantly more sample‑efficient than baseline LLM agents augmented with classical planning domain construction, reasoning‑based planning, and prior program‑synthesis agents such as WorldCoder. TheoryCoder‑2 is able to solve complex tasks that the baselines fail, while only requiring minimal human prompts, unlike prior TBRL systems.

PaperID: 5939, https://arxiv.org/pdf/2602.00921.pdf

Authors: Eric Gelphman, Deepanshu Verma, Nicole Tianjiao Yang, Stanley Osher, Samy Wu Fung

Title: On the Convergence of Jacobian-Free Backpropagation for Optimal Control Problems with Implicit Hamiltonians

Abstract:
Optimal feedback control with implicit Hamiltonians poses a fundamental challenge for learning‑based value function methods due to the absence of closed‑form optimal control laws. Recent work~\citegelphman2025end introduced an implicit deep learning approach using Jacobian‑Free Backpropagation (JFB) to address this setting, but only established sample‑wise descent guarantees. In this paper, we establish convergence guarantees for JFB in the stochastic minibatch setting, showing that the resulting updates converge to stationary points of the expected optimal control objective. We further demonstrate scalability on substantially higher‑dimensional problems, including multi‑agent optimal consumption and swarm‑based quadrotor and bicycle control. Together, our results provide both theoretical justification and empirical evidence for using JFB in high‑dimensional optimal control with implicit Hamiltonians.

PaperID: 5940, https://arxiv.org/pdf/2602.00911.pdf

Authors: Abhijit Chakraborty, Sandipan De, Yash Shah, Chahana Dahal, Vivek Gupta

Title: Synapse Compendium Aware Federated Knowledge Exchange for Tool Routed LLMs

Abstract:
Collaborative learning among LLM‑based agents under federated learning faces challenges, including communication costs, heterogeneity in data, and tool‑usage, limiting their effectiveness. We introduce Synapse, a framework that trains a shared global knowledge model of tool‑usage behavior. Client agents with fixed LLMs learn tool‑usage patterns locally, and transmit artifacts for federated aggregation through coordinators. A global tool compendium is updated and redistributed, enabling convergence toward stable tool selection. Synapse uses templated representations, embedding retrieval with LLM reranking, and adaptive masking to maintain utility while limiting information leakage. The framework supports heterogeneous data and quantifies performance improvements. Results show that Synapse improves tool‑usage effectiveness and reduces communication overhead compared with weight or prompt‑sharing approaches in multi‑agent LLM systems.

PaperID: 5941, https://arxiv.org/pdf/2602.00861.pdf

Authors: Kushal Chakrabarti, Nirmal Balachundar

Title: Multi-Head Attention Is a Multi-Player Game

Abstract:
Modern transformer attention is internally multi‑agent ‑‑ heads compete and coordinate ‑‑ yet we train it as if it were a monolithic optimizer. We formalize this gap: cross‑entropy training induces an implicit potential game among heads, and gradient descent converges to Nash equilibria with potentially unbounded inefficiency due to unpriced externalities (redundancy, correlated errors). Our main result bounds the Price of Anarchy by Γ(G), the off‑diagonal mass of a head interaction matrix capturing weight and gradient coupling. Under mild smoothness assumptions, we prove that both \emphexcess hallucination probability and \emphexcess head redundancy scale with PoA, unifying two distinct failure modes into a single mechanism. The bound is prescriptive: regularization that reduces Γ(G) provably tightens PoA. We instantiate this as GAME‑LoRA, combining Barlow Twins decorrelation with log‑determinant coordination pressure. Experiments validate the theory: Γ(G) predicts hallucination (p<0.05), emergent coalitions exhibit selective coordination, and GAME‑LoRA achieves up to 18% hallucination reduction (8% average) with no knowledge degradation ‑‑ a Pareto improvement inaccessible to methods ignoring the game structure.

PaperID: 5942, https://arxiv.org/pdf/2602.00785.pdf

Authors: Sherry Yang

Title: World Models as an Intermediary between Agents and the Real World

Abstract:
Large language model (LLM) agents trained using reinforcement learning has achieved superhuman performance in low‑cost environments like games, mathematics, and coding. However, these successes have not translated to complex domains where the cost of interaction is high, such as the physical cost of running robots, the time cost of ML engineering, and the resource cost of scientific experiments. The true bottleneck for achieving the next level of agent performance for these complex and high‑cost domains lies in the expense of executing actions to acquire reward signals. To address this gap, this paper argues that we should use world models as an intermediary between agents and the real world. We discuss how world models, viewed as models of dynamics, rewards, and task distributions, can overcome fundamental barriers of high‑cost actions such as extreme off‑policy learning and sample inefficiency in long‑horizon tasks. Moreover, we demonstrate how world models can provide critical and rich learning signals to agents across a broad set of domains, including machine learning engineering, computer use, robotics, and AI for science. Lastly, we identify the challenges of building these world models and propose actionable items along dataset curation, architecture design, scaling, and evaluation of world models.

PaperID: 5943, https://arxiv.org/pdf/2602.00766.pdf

Authors: Xiaoxue Yu, Rongpeng Li, Zhifeng Zhao, Honggang Zhang

Title: Communications-Incentivized Collaborative Reasoning in NetGPT through Agentic Reinforcement Learning

Abstract:
The evolution of next‑Generation (xG) wireless networks marks a paradigm shift from connectivity‑centric architectures to Artificial Intelligence (AI)‑native designs that tightly integrate data, computing, and communication. Yet existing AI deployments in communication systems remain largely siloed, offering isolated optimizations without intrinsic adaptability, dynamic task delegation, or multi‑agent collaboration. In this work, we propose a unified agentic NetGPT framework for AI‑native xG networks, wherein a NetGPT core can either perform autonomous reasoning or delegate sub‑tasks to domain‑specialized agents via agentic communication. The framework establishes clear modular responsibilities and interoperable workflows, enabling scalable, distributed intelligence across the network. To support continual refinement of collaborative reasoning strategies, the framework is further enhanced through Agentic reinforcement learning under partially observable conditions and stochastic external states. The training pipeline incorporates masked loss against external agent uncertainty, entropy‑guided exploration, and multi‑objective rewards that jointly capture task quality, coordination efficiency, and resource constraints. Through this process, NetGPT learns when and how to collaborate, effectively balancing internal reasoning with agent invocation. Overall, this work provides a foundational architecture and training methodology for self‑evolving, AI‑native xG networks capable of autonomous sensing, reasoning, and action in complex communication environments.

PaperID: 5944, https://arxiv.org/pdf/2602.00755.pdf

Authors: Ujwal Kumar, Alice Saito, Hershraj Niranjani, Rayan Yessou, Phan Xuan Tan

Title: Evolving Interpretable Constitutions for Multi-Agent Coordination

Abstract:
Constitutional AI has focused on single‑model alignment using fixed principles. However, multi‑agent systems create novel alignment challenges through emergent social dynamics. We present Constitutional Evolution, a framework for automatically discovering behavioral norms in multi‑agent LLM systems. Using a grid‑world simulation with survival pressure, we study the tension between individual and collective welfare, quantified via a Societal Stability Score S in [0,1] that combines productivity, survival, and conflict metrics. Adversarial constitutions lead to societal collapse (S= 0), while vague prosocial principles ("be helpful, harmless, honest") produce inconsistent coordination (S = 0.249). Even constitutions designed by Claude 4.5 Opus with explicit knowledge of the objective achieve only moderate performance (S= 0.332). Using LLM‑driven genetic programming with multi‑island evolution, we evolve constitutions maximizing social welfare without explicit guidance toward cooperation. The evolved constitution C achieves S = 0.556 +/‑ 0.008 (123% higher than human‑designed baselines, N = 10), eliminates conflict, and discovers that minimizing communication (0.9% vs 62.2% social actions) outperforms verbose coordination. Our interpretable rules demonstrate that cooperative norms can be discovered rather than prescribed.

PaperID: 5945, https://arxiv.org/pdf/2602.00704.pdf

Authors: Hanqi Lyu, Di Huang, Yaoyu Zhu, Kangcheng Liu, Bohan Dou, Chongxiao Li, Pengwei Jin, Shuyao Cheng, Rui Zhang, Zidong Du, Qi Guo, Xing Hu, Yunji Chen

Title: LocalV: Exploiting Information Locality for IP-level Verilog Generation

Abstract:
The generation of Register‑Transfer Level (RTL) code is a crucial yet labor‑intensive step in digital hardware design, traditionally requiring engineers to manually translate complex specifications into thousands of lines of synthesizable Hardware Description Language (HDL) code. While Large Language Models (LLMs) have shown promise in automating this process, existing approaches‑including fine‑tuned domain‑specific models and advanced agent‑based systems‑struggle to scale to industrial IP‑level design tasks. We identify three key challenges: (1) handling long, highly detailed documents, where critical interface constraints become buried in unrelated submodule descriptions; (2) generating long RTL code, where both syntactic and semantic correctness degrade sharply with increasing output length; and (3) navigating the complex debugging cycles required for functional verification through simulation and waveform analysis. To overcome these challenges, we propose LocalV, a multi‑agent framework that leverages information locality in modular hardware design. LocalV decomposes the long‑document to long‑code generation problem into a set of short‑document, short‑code tasks, enabling scalable generation and debugging. Specifically, LocalV integrates hierarchical document partitioning, task planning, localized code generation, interface‑consistent merging, and AST‑guided locality‑aware debugging. Experiments on RealBench, an IP‑level Verilog generation benchmark, demonstrate that LocalV substantially outperforms state‑of‑the‑art (SOTA) LLMs and agents, achieving a pass rate of 45.0% compared to 21.6%.

PaperID: 5946, https://arxiv.org/pdf/2602.00687.pdf

Authors: Yuankun Zeng, Shaohui Li, Zhi Li, Shulan Ruan, Yu Liu, You He

Title: V2X-DSC: Multi-Agent Collaborative Perception with Distributed Source Coding Guided Communication

Abstract:
Collaborative perception improves 3D understanding by fusing multi‑agent observations, yet intermediate‑feature sharing faces strict bandwidth constraints as dense BEV features saturate V2X links. We observe that collaborators view the same physical world, making their features strongly correlated; thus receivers only need innovation beyond their local context. Revisiting this from a distributed source coding perspective, we propose V2X‑DSC, a framework with a Conditional Codec (DCC) for bandwidth‑constrained fusion. The sender compresses BEV features into compact codes, while the receiver performs conditional reconstruction using its local features as side information, allocating bits to complementary cues rather than redundant content. This conditional structure regularizes learning, encouraging incremental representation and yielding lower‑noise features. Experiments on DAIR‑V2X, OPV2V, and V2X‑Real demonstrate state‑of‑the‑art accuracy‑bandwidth trade‑offs under KB‑level communication, and generalizes as a plug‑and‑play communication layer across multiple fusion backbones.

PaperID: 5947, https://arxiv.org/pdf/2602.00685.pdf

Authors: Xuan Liu, Haoyang Shang, Zizhang Liu, Xinyan Liu, Yunze Xiao, Yiwen Tu, Haojian Jin

Title: HumanStudy-Bench: Towards AI Agent Design for Participant Simulation

Abstract:
Large language models (LLMs) are increasingly used as simulated participants in social science experiments, but their behavior is often unstable and highly sensitive to design choices. Prior evaluations frequently conflate base‑model capabilities with experimental instantiation, obscuring whether outcomes reflect the model itself or the agent setup. We instead frame participant simulation as an agent‑design problem over full experimental protocols, where an agent is defined by a base model and a specification (e.g., participant attributes) that encodes behavioral assumptions. We introduce HUMANSTUDY‑BENCH, a benchmark and execution engine that orchestrates LLM‑based agents to reconstruct published human‑subject experiments via a Filter‑‑Extract‑‑Execute‑‑Evaluate pipeline, replaying trial sequences and running the original analysis pipeline in a shared runtime that preserves the original statistical procedures end to end. To evaluate fidelity at the level of scientific inference, we propose new metrics to quantify how much human and agent behaviors agree. We instantiate 12 foundational studies as an initial suite in this dynamic benchmark, spanning individual cognition, strategic interaction, and social psychology, and covering more than 6,000 trials with human samples ranging from tens to over 2,100 participants.

PaperID: 5948, https://arxiv.org/pdf/2602.00663.pdf

Authors: Fabian P. Krüger, Andrea Hunklinger, Adrian Wolny, Tim J. Adler, Igor Tetko, Santiago David Villalba

Title: SEISMO: Increasing Sample Efficiency in Molecular Optimization with a Trajectory-Aware LLM Agent

Abstract:
Optimizing the structure of molecules to achieve desired properties is a central bottleneck across the chemical sciences, particularly in the pharmaceutical industry where it underlies the discovery of new drugs. Since molecular property evaluation often relies on costly and rate‑limited oracles, such as experimental assays, molecular optimization must be highly sample‑efficient. To address this, we introduce SEISMO, an LLM agent that performs strictly online, inference‑time molecular optimization, updating after every oracle call without the need for population‑based or batched learning. SEISMO conditions each proposal on the full optimization trajectory, combining natural‑language task descriptions with scalar scores and, when available, structured explanatory feedback. Across the Practical Molecular Optimization benchmark of 23 tasks, SEISMO achieves a 2‑3 times higher area under the optimisation curve than prior methods, often reaching near‑maximal task scores within 50 oracle calls. Our additional medicinal‑chemistry tasks show that providing explanatory feedback further improves efficiency, demonstrating that leveraging domain knowledge and structured information is key to sample‑efficient molecular optimization.

PaperID: 5949, https://arxiv.org/pdf/2602.00611.pdf

Authors: Jiaqi Xu, Tao Huang, Kai Zhang

Title: Structured Self-Consistency:A Multi-Task Evaluation of LLMs on VirtualHome

Abstract:
Embodied AI requires agents to understand goals, plan actions, and execute tasks in simulated environments. We present a comprehensive evaluation of Large Language Models (LLMs) on the VirtualHome benchmark using the Embodied Agent Interface (EAI) framework. We compare two representative 7B‑parameter models OPENPANGU‑7B and QWEN2.5‑7B across four fundamental tasks: Goal Interpretation, Action Sequencing, Subgoal Decomposition, and Transition Modeling. We propose Structured Self‑Consistency (SSC), an enhanced decoding strategy that leverages multiple sampling with domain‑specific voting mechanisms to improve output quality for structured generation tasks. Experimental results demonstrate that SSC significantly enhances performance, with OPENPANGU‑7B excelling at hierarchical planning while QWEN2.5‑7B show advantages in action‑level tasks. Our analysis reveals complementary strengths across model types, providing insights for future embodied AI system development.

PaperID: 5950, https://arxiv.org/pdf/2602.00606.pdf

Authors: Ahmed Said Donmez, Yuksel Arslantas, Muhammed O. Sayin

Title: Actor-Dual-Critic Dynamics for Zero-sum and Identical-Interest Stochastic Games

Abstract:
We propose a novel independent and payoff‑based learning framework for stochastic games that is model‑free, game‑agnostic, and gradient‑free. The learning dynamics follow a best‑response‑type actor‑critic architecture, where agents update their strategies (actors) using feedback from two distinct critics: a fast critic that intuitively responds to observed payoffs under limited information, and a slow critic that deliberatively approximates the solution to the underlying dynamic programming problem. Crucially, the learning process relies on non‑equilibrium adaptation through smoothed best responses to observed payoffs. We establish convergence to (approximate) equilibria in two‑agent zero‑sum and multi‑agent identical‑interest stochastic games over an infinite horizon. This provides one of the first payoff‑based and fully decentralized learning algorithms with theoretical guarantees in both settings. Empirical results further validate the robustness and effectiveness of the proposed approach across both classes of games.

PaperID: 5951, https://arxiv.org/pdf/2602.00566.pdf

Authors: Nan Song, Junzhe Jiang, Jingyu Li, Xiatian Zhu, Li Zhang

Title: UniMotion: A Unified Motion Framework for Simulation, Prediction and Planning

Abstract:
Motion simulation, prediction and planning are foundational tasks in autonomous driving, each essential for modeling and reasoning about dynamic traffic scenarios. While often addressed in isolation due to their differing objectives, such as generating diverse motion states or estimating optimal trajectories, these tasks inherently depend on shared capabilities: understanding multi‑agent interactions, modeling motion behaviors, and reasoning over temporal and spatial dynamics. Despite this underlying commonality, existing approaches typically adopt specialized model designs, which hinders cross‑task generalization and system scalability. More critically, this separation overlooks the potential mutual benefits among tasks. Motivated by these observations, we propose UniMotion, a unified motion framework that captures shared structures across motion tasks while accommodating their individual requirements. Built on a decoder‑only Transformer architecture, UniMotion employs dedicated interaction modes and tailored training strategies to simultaneously support these motion tasks. This unified design not only enables joint optimization and representation sharing but also allows for targeted fine‑tuning to specialize in individual tasks when needed. Extensive experiments on the Waymo Open Motion Dataset demonstrate that joint training leads to robust generalization and effective task integration. With further fine‑tuning, UniMotion achieves state‑of‑the‑art performance across a range of motion tasks, establishing it as a versatile and scalable solution for autonomous driving.

PaperID: 5952, https://arxiv.org/pdf/2602.00558.pdf

Authors: Kechen Meng, Rongpeng Li, Yansha Deng, Zhifeng Zhao, Honggang Zhang

Title: NetWorld: Communication-Based Diffusion World Model for Multi-Agent Reinforcement Learning in Wireless Networks

Abstract:
As wireless communication networks grow in scale and complexity, diverse resource allocation tasks become increasingly critical. Multi‑Agent Reinforcement Learning (MARL) provides a promising solution for distributed control, yet it often requires costly real‑world interactions and lacks generalization across diverse tasks. Meanwhile, recent advances in Diffusion Models (DMs) have demonstrated strong capabilities in modeling complex dynamics and supporting high‑fidelity simulation. Motivated by these challenges and opportunities, we propose a Communication‑based Diffusion World Model (NetWorld) to enable few‑shot generalization across heterogeneous MARL tasks in wireless networks. To improve applicability to large‑scale distributed networks, NetWorld adopts the Distributed Training with Decentralized Execution (DTDE) paradigm and is organized into a two‑stage framework: (i) pre‑training a classifier‑guided conditional diffusion world model on multi‑task offline datasets, and (ii) performing trajectory planning entirely within this world model to avoid additional online interaction. Cross‑task heterogeneity is handled via shared latent processing for observations, two‑hot discretization for task‑specific actions and rewards, and an inverse dynamics model for action recovery. We further introduce a lightweight Mean Field (MF) communication mechanism to reduce non‑stationarity and promote coordinated behaviors with low overhead. Experiments on three representative tasks demonstrate improved performance and sample efficiency over MARL baselines, indicating strong scalability and practical potential for wireless network optimization.

PaperID: 5953, https://arxiv.org/pdf/2602.00510.pdf

Authors: Huanghaohe Zou, Peng Han, Emad Nazerian, Alex Q. Huang

Title: PCBSchemaGen: Constraint-Guided Schematic Design via LLM for Printed Circuit Boards (PCB)

Abstract:
Printed Circuit Board (PCB) schematic design plays an essential role in all areas of electronic industries. Unlike prior works that focus on digital or analog circuits alone, PCB design must handle heterogeneous digital, analog, and power signals while adhering to real‑world IC packages and pin constraints. Automated PCB schematic design remains unexplored due to the scarcity of open‑source data and the absence of simulation‑based verification. We introduce PCBSchemaGen, the first training‑free framework for PCB schematic design that comprises LLM agent and Constraint‑guided synthesis. Our approach makes three contributions: 1. an LLM‑based code generation paradigm with iterative feedback with domain‑specific prompts. 2. a verification framework leveraging a real‑world IC datasheet derived Knowledge Graph (KG) and Subgraph Isomorphism encoding pin‑role semantics and topological constraints. 3. an extensive experiment on 23 PCB schematic tasks spanning digital, analog, and power domains. Results demonstrate that PCBSchemaGen significantly improves design accuracy and computational efficiency.

PaperID: 5954, https://arxiv.org/pdf/2602.00500.pdf

Authors: Jianyi Zhou, Yujie Wei, Ruichen Zhen, Bo Zhao, Xiaobo Xia, Rui Shao, Xiu Su, Shuo Yang

Title: Inject Once Survive Later: Backdooring Vision-Language-Action Models to Persist Through Downstream Fine-tuning

Abstract:
Vision‑Language‑Action (VLA) models have become foundational to modern embodied AI systems. By integrating visual perception, language understanding, and action planning, they enable general‑purpose task execution across diverse environments. Despite their importance, the security of VLA models remains underexplored ‑‑ particularly in the context of backdoor attacks, which pose realistic threats in physical‑world deployments. While recent methods attempt to inject backdoors into VLA models, these backdoors are easily erased during downstream adaptation, as user‑side fine‑tuning with clean data significantly alters model parameters, rendering them impractical for real‑world applications. To address these challenges, we propose INFUSE (INjection into Fine‑tUne‑inSensitive modulEs), the first backdoor attack framework for VLA base models that remains effective even with arbitrary user fine‑tuning. INFUSE begins by analyzing parameter sensitivity across diverse fine‑tuning scenarios to identify modules that remain largely unchanged ‑‑ the fine‑tune‑insensitive modules. It then injects backdoors into these stable modules while freezing the rest, ensuring malicious behavior persists after extensive user fine‑tuning. Comprehensive experiments across multiple VLA architectures demonstrate INFUSE's effectiveness. After user‑side fine‑tuning, INFUSE maintains mean attack success rates of 91.0% on simulation environments and 79.8% on real‑world robot tasks, substantially surpassing BadVLA (38.8% and 36.6%, respectively), while preserving clean‑task performance comparable to standard models. These results uncover a critical threat: backdoors implanted before distribution can persist through fine‑tuning and remain effective at deployment.

PaperID: 5955, https://arxiv.org/pdf/2602.00480.pdf

Authors: Mohini Priya Kolluri, Ammar Waheed, Zohaib Hasnain

Title: FISC: A Fluid-Inspired Framework for Decentralized and Scalable Swarm Control

Abstract:
Achieving scalable coordination in large robotic swarms is often constrained by reliance on inter‑agent communication, which introduces latency, bandwidth limitations, and vulnerability to failure. To address this gap, a decentralized approach for outer‑loop control of large multi‑agent systems based on the paradigm of how a fluid moves through a volume is proposed and evaluated. A relationship between fundamental fluidic element properties and individual robotic agent states is developed such that the corresponding swarm "flows" through a space, akin to a fluid when forced via a pressure boundary condition. By ascribing fluid‑like properties to subsets of agents, the swarm evolves collectively while maintaining desirable structure and coherence without explicit communication of agent states within or outside of the swarm. The approach is evaluated using simulations involving O(10^3) quadcopter agents and compared against Computational Fluid Dynamics (CFD) solutions for a converging‑diverging domain. Quantitative agreement between swarm‑derived and CFD fields is assessed using Root‑Mean‑Square Error (RMSE), yielding normalized errors of 0.15‑0.9 for velocity, 0.61‑0.98 for density, 0‑0.937 for pressure. These results demonstrate the feasibility of treating large robotic swarms as continuum systems that retain the macroscopic structure derived from first principles, providing a basis for scalable and decentralized control.

PaperID: 5956, https://arxiv.org/pdf/2602.00456.pdf

Authors: Amanda Dsouza, Ramya Ramakrishnan, Charles Dickens, Bhavishya Pohani, Christopher M Glaze

Title: Benchmarking Agents in Insurance Underwriting Environments

Abstract:
As AI agents integrate into enterprise applications, their evaluation demands benchmarks that reflect the complexity of real‑world operations. Instead, existing benchmarks overemphasize open‑domains such as code, use narrow accuracy metrics, and lack authentic complexity. We present UNDERWRITE, an expert‑first, multi‑turn insurance underwriting benchmark designed in close collaboration with domain experts to capture real‑world enterprise challenges. UNDERWRITE introduces critical realism factors often absent in current benchmarks: proprietary business knowledge, noisy tool interfaces, and imperfect simulated users requiring careful information gathering. Evaluating 13 frontier models, we uncover significant gaps between research lab performance and enterprise readiness: the most accurate models are not the most efficient, models hallucinate domain knowledge despite tool access, and pass^k results show a 20% drop in performance. The results from UNDERWRITE demonstrate that expert involvement in benchmark design is essential for realistic agent evaluation, common agentic frameworks exhibit brittleness that skews performance reporting, and hallucination detection in specialized domains demands compositional approaches. Our work provides insights for developing benchmarks that better align with enterprise deployment requirements.

PaperID: 5957, https://arxiv.org/pdf/2602.00454.pdf

Authors: Jing Wu, Yue Sun, Tianpei Xie, Suiyao Chen, Jingyuan Bao, Yaopengxiao Xu, Gaoyuan Du, Inseok Heo, Alexander Gutfraind, Xin Wang

Title: Cross-Modal Memory Compression for Efficient Multi-Agent Debate

Abstract:
Multi‑agent debate can improve reasoning quality and reduce hallucinations, but it incurs rapidly growing context as debate rounds and agent count increase. Retaining full textual histories leads to token usage that can exceed context limits and often requires repeated summarization, adding overhead and compounding information loss. We introduce DebateOCR, a cross‑modal compression framework that replaces long textual debate traces with compact image representations, which are then consumed through a dedicated vision encoder to condition subsequent rounds. This design compresses histories that commonly span tens to hundreds of thousands of tokens, cutting input tokens by more than 92% and yielding substantially lower compute cost and faster inference across multiple benchmarks. We further provide a theoretical perspective showing that diversity across agents supports recovery of omitted information: although any single compressed history may discard details, aggregating multiple agents' compressed views allows the collective representation to approach the information bottleneck with exponentially high probability.

PaperID: 5958, https://arxiv.org/pdf/2602.00399.pdf

Authors: Armando Alves Neto

Title: Reinforcement Learning for Control Systems with Time Delays: A Comprehensive Survey

Abstract:
In the last decade, Reinforcement Learning (RL) has achieved remarkable success in the control and decision‑making of complex dynamical systems. However, most RL algorithms rely on the Markov Decision Process assumption, which is violated in practical cyber‑physical systems affected by sensing delays, actuation latencies, and communication constraints. Such time delays introduce memory effects that can significantly degrade performance and compromise stability, particularly in networked and multi‑agent environments. This paper presents a comprehensive survey of RL methods designed to address time delays in control systems. We first formalize the main classes of delays and analyze their impact on the Markov property. We then systematically categorize existing approaches into five major families: state augmentation and history‑based representations, recurrent policies with learned memory, predictor‑based and model‑aware methods, robust and domain‑randomized training strategies, and safe RL frameworks with explicit constraint handling. For each family, we discuss underlying principles, practical advantages, and inherent limitations. A comparative analysis highlights key trade‑offs among these approaches and provides practical guidelines for selecting suitable methods under different delay characteristics and safety requirements. Finally, we identify open challenges and promising research directions, including stability certification, large‑delay learning, multi‑agent communication co‑design, and standardized benchmarking. This survey aims to serve as a unified reference for researchers and practitioners developing reliable RL‑based controllers in delay‑affected cyber‑physical systems.

PaperID: 5959, https://arxiv.org/pdf/2602.00380.pdf

Authors: Sercan Karakaş

Title: Clause-Internal or Clause-External? Testing Turkish Reflexive Binding in Adapted versus Chain of Thought Large Language Models

Abstract:
This study evaluates whether state‑of‑the‑art large language models capture the binding relations of Turkish reflexive pronouns. We construct a balanced evaluation set of 100 Turkish sentences that systematically pit local against non‑local antecedents for the reflexives kendi and kendisi. We compare two contrasting systems: an OpenAI chain‑of‑thought model optimized for multi‑step reasoning and Trendyol‑LLM‑7B‑base‑v0.1, a LLaMA 2 derived model extensively fine‑tuned on Turkish data. Antecedent choice is assessed using a combined paradigm that integrates sentence‑level perplexity with a forced‑choice comparison between minimally differing continuations. Overall, Trendyol‑LLM favors local bindings in approximately 70 percent of trials, exhibiting a robust locality bias consistent with a preference for structurally proximate antecedents. By contrast, the OpenAI model (o1 Mini) distributes its choices nearly evenly between local and long‑distance readings, suggesting weaker or less consistent sensitivity to locality in this binding configuration. Taken together, these results reveal a marked contrast in binding behavior across the two systems and motivate closer analysis of how model architecture, training data, and inference‑time reasoning strategies shape the representation of Turkish anaphoric dependencies.

PaperID: 5960, https://arxiv.org/pdf/2602.00352.pdf

Authors: Li Siyan, Darshan Deshpande, Anand Kannappan, Rebecca Qian

Title: DETOUR: An Interactive Benchmark for Dual-Agent Search and Reasoning

Abstract:
When recalling information in conversation, people often arrive at the recollection after multiple turns. However, existing benchmarks for evaluating agent capabilities in such tip‑of‑the‑tongue search processes are restricted to single‑turn settings. To more realistically simulate tip‑of‑the‑tongue search, we introduce Dual‑agent based Evaluation Through Obscure Under‑specified Retrieval (DETOUR), a dual‑agent evaluation benchmark containing 1,011 prompts. The benchmark design involves a Primary Agent, which is the subject of evaluation, tasked with identifying the recollected entity through querying a Memory Agent that is held consistent across evaluations. Our results indicate that current state‑of‑the‑art models still struggle with our benchmark, only achieving 36% accuracy when evaluated on all modalities (text, image, audio, and video), highlighting the importance of enhancing capabilities in underspecified scenarios.

PaperID: 5961, https://arxiv.org/pdf/2602.00340.pdf

Authors: Alexandros Christoforos, Sarah Jenkins, Michael Brown, Tuan Pham, David Chen

Title: Bridging the Semantic Chasm: Synergistic Conceptual Anchoring for Generalized Few-Shot and Zero-Shot OOD Perception

Abstract:
This manuscript presents a pioneering Synergistic Neural Agents Network (SynerNet) framework designed to mitigate the phenomenon of cross‑modal alignment degeneration in Vision‑Language Models (VLMs) when encountering Out‑of‑Distribution (OOD) concepts. Specifically, four specialized computational units ‑ visual perception, linguistic context, nominal embedding, and global coordination ‑ collaboratively rectify modality disparities via a structured message‑propagation protocol. The principal contributions encompass a multi‑agent latent space nomenclature acquisition framework, a semantic context‑interchange algorithm for enhanced few‑shot adaptation, and an adaptive dynamic equilibrium mechanism. Empirical evaluations conducted on the VISTA‑Beyond benchmark demonstrate that SynerNet yields substantial performance augmentations in both few‑shot and zero‑shot scenarios, exhibiting precision improvements ranging from 1.2% to 5.4% across a diverse array of domains.

PaperID: 5962, https://arxiv.org/pdf/2602.00325.pdf

Authors: Andrew F. Thompson, Joshua A. Robbins, Jonah J. Glunt, Sean B. Brennan, Herschel C. Pangborn

Title: Motion Planning with Metric Temporal Logic Using Reachability Analysis and Hybrid Zonotopes

Abstract:
Metric temporal logic (MTL) provides a formal framework for defining time‑dependent mission requirements on autonomous vehicles. However, optimizing control decisions subject to these constraints is often computationally expensive. This article presents a method that uses reachability analysis to implicitly express the set of states satisfying an MTL specification and then optimizes to find a motion plan. The hybrid zonotope set representation is used to efficiently and conveniently encode MTL specifications into reachable sets. A numerical benchmark highlights the proposed method's computational advantages as compared to existing methods in the literature. Further numerical examples and an experimental application demonstrate the ability to address time‑varying environments, region‑dependent disturbances, and multi‑agent coordination.

PaperID: 5963, https://arxiv.org/pdf/2602.00307.pdf

Authors: Udayan Khurana

Title: Autonomous Data Processing using Meta-Agents

Abstract:
Traditional data processing pipelines are typically static and handcrafted for specific tasks, limiting their adaptability to evolving requirements. While general‑purpose agents and coding assistants can generate code for well‑understood data pipelines, they lack the ability to autonomously monitor, manage, and optimize an end‑to‑end pipeline once deployed. We present Autonomous Data Processing using Meta‑agents (ADP‑MA), a framework that dynamically constructs, executes, and iteratively refines data processing pipelines through hierarchical agent orchestration. At its core, meta‑agents analyze input data and task specifications to design a multi‑phase plan, instantiate specialized ground‑level agents, and continuously evaluate pipeline performance. The architecture comprises three key components: a planning module for strategy generation, an orchestration layer for agent coordination and tool integration, and a monitoring loop for iterative evaluation and backtracking. Unlike conventional approaches, ADP‑MA emphasizes context‑aware optimization, adaptive workload partitioning, and progressive sampling for scalability. Additionally, the framework leverages a diverse set of external tools and can reuse previously designed agents, reducing redundancy and accelerating pipeline construction. We demonstrate ADP‑MA through an interactive demo that showcases pipeline construction, execution monitoring, and adaptive refinement across representative data processing tasks.

PaperID: 5964, https://arxiv.org/pdf/2602.00213.pdf

Authors: Mehul Goenka, Tejas Pathak, Siddharth Asthana

Title: TessPay: Verify-then-Pay Infrastructure for Trusted Agentic Commerce

Abstract:
The global economy is entering the era of Agentic Commerce, where autonomous agents can discover services, negotiate prices, and transact value. However adoption towards agentic commerce faces a foundational trust gap: current systems are built for direct human interactions rather than agent‑driven operations. It lacks core primitives across three critical stages of agentic transactions. First, Task Delegation lacks means to translate user intent into defined scopes, discover appropriate agents, and securely authorize actions. Second, Payment Settlement for tasks is processed before execution, lacking verifiable evidence to validate the agent's work. Third, Audit Mechanisms fail to capture the full transaction lifecycle, preventing clear accountability for disputes. While emerging standards address fragments of this trust gap, there still remains a critical need for a unified infrastructure that binds the entire transaction lifecycle. To resolve this gap, we introduce TessPay, a unified infrastructure that replaces implicit trust with a 'Verify‑then‑Pay' architecture. It is a two plane architecture separating control and verification from settlement. TessPay operationalizes trust across four distinct stages: Before execution, agents are anchored in a canonical registry and user intent is captured as verifiable mandates, enabling stakeholder accountability. During execution, funds are locked in escrow while the agent executes the task and generates cryptographic evidence (TLS Notary, TEE etc.) to support Proof of Task Execution (PoTE). At settlement, the system verifies this evidence and releases funds only when the PoTE satisfies verification predicates; modular rail adapters ensure this PoTE‑gated escrow remains chain‑agnostic across heterogeneous payment rails. After settlement, TessPay preserves a tamper‑evident audit trail to enable clear accountability for dispute resolution.

PaperID: 5965, https://arxiv.org/pdf/2602.00169.pdf

Authors: Huan Zhang, Yizhan Li, Wenhao Huang, Ziyu Hou, Yu Song, Xuye Liu, Farshid Effaty, Jinya Jiang, Sifan Wu, Qianggang Ding, Izumi Takahara, Leonard R. MacGillivray, Teruyasu Mizoguchi, Tianshu Yu, Lizi Liao, Yuyu Luo, Yu Rong, Jia Li, Ying Diao, Heng Ji, Bang Liu

Title: Towards Agentic Intelligence for Materials Science

Abstract:
The convergence of artificial intelligence and materials science presents a transformative opportunity, but achieving true acceleration in discovery requires moving beyond task‑isolated, fine‑tuned models toward agentic systems that plan, act, and learn across the full discovery loop. This survey advances a unique pipeline‑centric view that spans from corpus curation and pretraining, through domain adaptation and instruction tuning, to goal‑conditioned agents interfacing with simulation and experimental platforms. Unlike prior reviews, we treat the entire process as an end‑to‑end system to be optimized for tangible discovery outcomes rather than proxy benchmarks. This perspective allows us to trace how upstream design choices‑such as data curation and training objectives‑can be aligned with downstream experimental success through effective credit assignment. To bridge communities and establish a shared frame of reference, we first present an integrated lens that aligns terminology, evaluation, and workflow stages across AI and materials science. We then analyze the field through two focused lenses: From the AI perspective, the survey details LLM strengths in pattern recognition, predictive analytics, and natural language processing for literature mining, materials characterization, and property prediction; from the materials science perspective, it highlights applications in materials design, process optimization, and the acceleration of computational workflows via integration with external tools (e.g., DFT, robotic labs). Finally, we contrast passive, reactive approaches with agentic design, cataloging current contributions while motivating systems that pursue long‑horizon goals with autonomy, memory, and tool use. This survey charts a practical roadmap towards autonomous, safety‑aware LLM agents aimed at discovering novel and useful materials.

PaperID: 5966, https://arxiv.org/pdf/2602.00164.pdf

Authors: Khairul Alam, Saikat Mondal, Banani Roy

Title: Why Are AI Agent Involved Pull Requests (Fix-Related) Remain Unmerged? An Empirical Study

Abstract:
Autonomous coding agents (e.g., OpenAI Codex, Devin, GitHub Copilot) are increasingly used to generate fix‑related pull requests (PRs) in real world software repositories. However, their practical effectiveness depends on whether these contributions are accepted and merged by project maintainers. In this paper, we present an empirical study of AI agent involved fix related PRs, examining both their integration outcomes, latency, and the factors that hinder successful merging. We first analyze 8,106 fix related PRs authored by five widely used AI coding agents from the AIDEV POP dataset to quantify the proportions of PRs that are merged, closed without merging, or remain open. We then conduct a manual qualitative analysis of a statistically significant sample of 326 closed but unmerged PRs, spending approximately 100 person hours to construct a structured catalog of 12 failure reasons. Our results indicate that test case failures and prior resolution of the same issues by other PRs are the most common causes of non integration, whereas build or deployment failures are comparatively rare. Overall, our findings expose key limitations of current AI coding agents in real world settings and highlight directions for their further improvement and for more effective human AI collaboration in software maintenance.

PaperID: 5967, https://arxiv.org/pdf/2602.00127.pdf

Authors: Tong Zhu, Baiting Chen, Jin Zhou, Hua Zhou, Sriram Sankararaman, Xiaowu Dai

Title: ALIGN: Aligned Delegation with Performance Guarantees for Multi-Agent LLM Reasoning

Abstract:
LLMs often underperform on complex reasoning tasks when relying on a single generation‑and‑selection pipeline. Inference‑time ensemble methods can improve performance by sampling diverse reasoning paths or aggregating multiple candidate answers, but they typically treat candidates independently and provide no formal guarantees that ensembling improves reasoning quality. We propose a novel method, Aligned Delegation for Multi‑Agent LLM Reasoning (ALIGN), which formulates LLM reasoning as an aligned delegation game. In ALIGN, a principal delegates a task to multiple agents that generate candidate solutions under designed incentives, and then selects among their outputs to produce a final answer. This formulation induces structured interaction among agents while preserving alignment between agent and principal objectives. We establish theoretical guarantees showing that, under a fair comparison with equal access to candidate solutions, ALIGN provably improves expected performance over single‑agent generation. Our analysis accommodates correlated candidate answers and relaxes independence assumptions that are commonly used in prior work. Empirical results across a broad range of LLM reasoning benchmarks consistently demonstrate that ALIGN outperforms strong single‑agent and ensemble baselines.

PaperID: 5968, https://arxiv.org/pdf/2602.00103.pdf

Authors: Mahule Roy, Adib Bazgir, Arthur da Silva Sousa Santos, Yuwen Zhang

Title: Autonomous Multi-Agent AI for High-Throughput Polymer Informatics: From Property Prediction to Generative Design Across Synthetic and Bio-Polymers

Abstract:
We present an integrated multiagent AI ecosystem for polymer discovery that unifies high‑throughput materials workflows, artificial intelligence, and computational modeling within a single Polymer Research Lifecycle (PRL) pipeline. The system orchestrates specialized agents powered by state‑of‑the‑art large language models (DeepSeek‑V2 and DeepSeek‑Coder) to retrieve and reason over scientific resources, invoke external tools, execute domain‑specific code, and perform metacognitive self‑assessment for robust end‑to‑end task execution. We demonstrate three practical capabilities: a high‑fidelity polymer property prediction and generative design pipeline, a fully automated multimodal workflow for biopolymer structure characterization, and a metacognitive agent framework that can monitor performance and improve execution strategies over time. On a held‑out test set of 1,251 polymers, our PolyGNN agent achieves strong predictive accuracy, reaching R2 = 0.89 for glass‑transition temperature (Tg ), R2 = 0.82 for tensile strength, R2 = 0.75 for elongation, and R2 = 0.91 for density. The framework also provides uncertainty estimates via multiagent consensus and scales with linear complexity to at least 10,000 polymers, enabling high‑throughput screening at low computational cost. For a representative workload, the system completes inference in 16.3 s using about 2 GB of memory and 0.1 GPU hours, at an estimated cost of about 0.08. On a dedicated Tg benchmark, our approach attains R2 = 0.78, outperforming strong baselines including single‑LLM prediction (R2 = 0.67), group‑contribution methods (R2 = 0.71), and ChemCrow (R2 = 0.66). We further demonstrate metacognitive control in a polystyrene case study, where the system not only produces domain‑level scientific outputs but continually monitors and optimizes its own behavior through tactical, strategic, and meta‑strategic self‑assessment.

PaperID: 5969, https://arxiv.org/pdf/2602.00083.pdf

Authors: Yuxin Yang, Gangda Deng, Ömer Faruk Akgül, Nima Chitsazan, Yash Govilkar, Akasha Tigalappanavara, Shi-Xiong Zhang, Sambit Sahu, Viktor Prasanna

Title: SPARC-RAG: Adaptive Sequential-Parallel Scaling with Context Management for Retrieval-Augmented Generation

Abstract:
Retrieval‑Augmented Generation (RAG) grounds large language model outputs in external evidence, but remains challenged on multi‑hop question answering that requires long reasoning. Recent works scale RAG at inference time along two complementary dimensions: sequential depth for iterative refinement and parallel width for coverage expansion. However, naive scaling causes context contamination and scaling inefficiency, leading to diminishing or negative returns despite increased computation. To address these limitations, we propose SPARC‑RAG, a multi‑agent framework that coordinates sequential and parallel inference‑time scaling under a unified context management mechanism. SPARC‑RAG employs specialized agents that maintain a shared global context and provide explicit control over the scaling process. It generates targeted, complementary sub‑queries for each branch to enable diverse parallel exploration, and explicitly regulates exiting decisions based on answer correctness and evidence grounding. To optimize scaling behavior, we further introduce a lightweight fine‑tuning method with process‑level verifiable preferences, which improves the efficiency of sequential scaling and effectiveness of parallel scaling. Across single‑ and multi‑hop QA benchmarks, SPARC‑RAG consistently outperforms previous RAG baselines, yielding an average +6.2 F1 improvement under lower inference cost.

PaperID: 5970, https://arxiv.org/pdf/2602.00082.pdf

Authors: Zheng Li

Title: Design and Empirical Study of a Large Language Model-Based Multi-Agent Investment System for Chinese Public REITs

Abstract:
This study addresses the low‑volatility Chinese Public Real Estate Investment Trusts (REITs) market, proposing a large language model (LLM)‑driven trading framework based on multi‑agent collaboration. The system constructs four types of analytical agents‑announcement, event, price momentum, and market‑each conducting analysis from different dimensions; then the prediction agent integrates these multi‑source signals to output directional probability distributions across multiple time horizons, then the decision agent generates discrete position adjustment signals based on the prediction results and risk control constraints, thereby forming a closed loop of analysis‑prediction‑decision‑execution. This study further compares two prediction model pathways: for the prediction agent, directly calling the general‑purpose large model DeepSeek‑R1 versus using a specialized small model Qwen3‑8B fine‑tuned via supervised fine‑tuning and reinforcement learning alignment. In the backtest from October 2024 to October 2025, both agent‑based strategies significantly outperformed the buy‑and‑hold benchmark in terms of cumulative return, Sharpe ratio, and maximum drawdown. The results indicate that the multi‑agent framework can effectively enhance the risk‑adjusted return of REITs trading, and the fine‑tuned small model performs close to or even better than the general‑purpose large model in some scenarios.

PaperID: 5971, https://arxiv.org/pdf/2602.00035.pdf

Authors: Sebastian Racedo, Brigitte Jaumard, Oscar Delgado, Meysam Masoudi

Title: Asynchronous MultiAgent Reinforcement Learning for 5G Routing under Side Constraints

Abstract:
Networks in the current 5G and beyond systems increasingly carry heterogeneous traffic with diverse quality‑of‑service constraints, making real‑time routing decisions both complex and time‑critical. A common approach, such as a heuristic with human intervention or training a single centralized RL policy or synchronizing updates across multiple learners, struggles with scalability and straggler effects. We address this by proposing an asynchronous multi‑agent reinforcement learning (AMARL) framework in which independent PPO agents, one per service, plan routes in parallel and commit resource deltas to a shared global resource environment. This coordination by state preserves feasibility across services and enables specialization for service‑specific objectives. We evaluate the method on an O‑RAN like network simulation using nearly real‑time traffic data from the city of Montreal. We compared against a single‑agent PPO baseline. AMARL achieves a similar Grade of Service (acceptance rate) (GoS) and end‑to‑end latency, with reduced training wall‑clock time and improved robustness to demand shifts. These results suggest that asynchronous, service‑specialized agents provide a scalable and practical approach to distributed routing, with applicability extending beyond the O‑RAN domain.

PaperID: 5972, https://arxiv.org/pdf/2602.00017.pdf

Authors: Benyamin Tabarsi, Wenbo Li, Tahreem Yasir, Aryan Santhosh Kumar, Laura Widman, Dongkuan Xu, Tiffany Barnes

Title: SafeTalkCoach: Diversity-Driven Multi-Agent Simulation for Parent-Teen Health Conversations

Abstract:
The importance of effective parent‑child communication about sexual health is widely acknowledged, but real‑world data on these conversations is scarce and challenging to collect, due to their private and sensitive nature. Although LLMs have been widely adopted in dialogue generation, they may deviate from best practices and frequently lack realism and diversity. We introduce SafeTalkCoach, a diversity‑driven multi‑agent dialogue generation framework that simulates parent‑child conversations about sexual health, and present an accompanying dataset. SafeTalkCoach integrates crowd‑sourced and synthesized scenarios, established sexual health guidelines, evidence‑based personas, adaptive control modules, and hierarchical diversification. Through evaluations, we demonstrate that SafeTalkCoach generates diverse conversations while maintaining realism, communication quality, and controllability in practice. Our goal is that the SafeTalkCoach framework and the dataset support both AI research and health communications practices.

PaperID: 5973, https://arxiv.org/pdf/2601.23219.pdf

Authors: Shuai Shao, Yixiang Liu, Bingwei Lu, Weinan Zhang

Title: MonoScale: Scaling Multi-Agent System with Monotonic Improvement

Abstract:
In recent years, LLM‑based multi‑agent systems (MAS) have advanced rapidly, using a router to decompose tasks and delegate subtasks to specialized agents. A natural way to expand capability is to scale up the agent pool by continually integrating new functional agents or tool interfaces, but naive expansion can trigger performance collapse when the router cold‑starts on newly added, heterogeneous, and unreliable agents. We propose MonoScale, an expansion‑aware update framework that proactively generates a small set of agent‑conditioned familiarization tasks, harvests evidence from both successful and failed interactions, and distills it into auditable natural‑language memory to guide future routing. We formalize sequential augmentation as a contextual bandit and perform trust‑region memory updates, yielding a monotonic non‑decreasing performance guarantee across onboarding rounds. Experiments on GAIA and Humanity's Last Exam show stable gains as the agent pool grows, outperforming naive scale‑up and strong‑router fixed‑pool baselines.

PaperID: 5974, https://arxiv.org/pdf/2601.23211.pdf

Authors: Paulius Rauba, Simonas Cepenas, Mihaela van der Schaar

Title: Multi-Agent Systems Should be Treated as Principal-Agent Problems

Abstract:
Consider a multi‑agent systems setup in which a principal (a supervisor agent) assigns subtasks to specialized agents and aggregates their responses into a single system‑level output. A core property of such systems is information asymmetry: agents observe task‑specific information, produce intermediate reasoning traces, and operate with different context windows. In isolation, such asymmetry is not problematic, since agents report truthfully to the principal when incentives are fully aligned. However, this assumption breaks down when incentives diverge. Recent evidence suggests that LLM‑based agents can acquire their own goals, such as survival or self‑preservation, a phenomenon known as scheming, and may deceive humans or other agents. This leads to agency loss: a gap between the principal's intended outcome and the realized system behavior. Drawing on core ideas from microeconomic theory, we argue that these characteristics, information asymmetry and misaligned goals, are best studied through the lens of principal‑agent problems. We explain why multi‑agent systems, both human‑to‑LLM and LLM‑to‑LLM, naturally induce information asymmetry under this formulation, and we use scheming, where LLM agents pursue covert goals, as a concrete case study. We show that recently introduced terminology used to describe scheming, such as covert subversion or deferred subversion, corresponds to well‑studied concepts in the mechanism design literature, which not only characterizes the problem but also prescribes concrete mitigation strategies. More broadly, we argue for applying tools developed to study human agent behavior to the analysis of non‑human agents.

PaperID: 5975, https://arxiv.org/pdf/2601.23188.pdf

Authors: Zhongxiang Sun, Qipeng Wang, Weijie Yu, Jingxuan Yang, Haolang Lu, Jun Xu

Title: Deep Search with Hierarchical Meta-Cognitive Monitoring Inspired by Cognitive Neuroscience

Abstract:
Deep search agents powered by large language models have demonstrated strong capabilities in multi‑step retrieval, reasoning, and long‑horizon task execution. However, their practical failures often stem from the lack of mechanisms to monitor and regulate reasoning and retrieval states as tasks evolve under uncertainty. Insights from cognitive neuroscience suggest that human metacognition is hierarchically organized, integrating fast anomaly detection with selectively triggered, experience‑driven reflection. In this work, we propose Deep Search with Meta‑Cognitive Monitoring (DS‑MCM), a deep search framework augmented with an explicit hierarchical metacognitive monitoring mechanism. DS‑MCM integrates a Fast Consistency Monitor, which performs lightweight checks on the alignment between external evidence and internal reasoning confidence, and a Slow Experience‑Driven Monitor, which is selectively activated to guide corrective intervention based on experience memory from historical agent trajectories. By embedding monitoring directly into the reasoning‑retrieval loop, DS‑MCM determines both when intervention is warranted and how corrective actions should be informed by prior experience. Experiments across multiple deep search benchmarks and backbone models demonstrate that DS‑MCM consistently improves performance and robustness.

PaperID: 5976, https://arxiv.org/pdf/2601.23088.pdf

Authors: Zhixiang Zhang, Zesen Liu, Yuchong Xie, Quanfeng Huang, Dongdong She

Abstract:
Semantic caching has emerged as a pivotal technique for scaling LLM applications, widely adopted by major providers including AWS and Microsoft. By utilizing semantic embedding vectors as cache keys, this mechanism effectively minimizes latency and redundant computation for semantically similar queries. In this work, we conceptualize semantic cache keys as a form of fuzzy hashes. We demonstrate that the locality required to maximize cache hit rates fundamentally conflicts with the cryptographic avalanche effect necessary for collision resistance. Our conceptual analysis formalizes this inherent trade‑off between performance (locality) and security (collision resilience), revealing that semantic caching is naturally vulnerable to key collision attacks. While prior research has focused on side‑channel and privacy risks, we present the first systematic study of integrity risks arising from cache collisions. We introduce CacheAttack, an automated framework for launching black‑box collision attacks. We evaluate CacheAttack in security‑critical tasks and agentic workflows. It achieves a hit rate of 86% in LLM response hijacking and can induce malicious behaviors in LLM agent, while preserving strong transferability across different embedding models. A case study on a financial agent further illustrates the real‑world impact of these vulnerabilities. Finally, we discuss mitigation strategies.

PaperID: 5977, https://arxiv.org/pdf/2601.22974.pdf

Authors: XiaoJie Zhang, JianHan Wu, Xiaoyang Qu, Jianzong Wang

Title: MiTa: A Hierarchical Multi-Agent Collaboration Framework with Memory-integrated and Task Allocation

Abstract:
Recent advances in large language models (LLMs) have substantially accelerated the development of embodied agents. LLM‑based multi‑agent systems mitigate the inefficiency of single agents in complex tasks. However, they still suffer from issues such as memory inconsistency and agent behavioral conflicts. To address these challenges, we propose MiTa, a hierarchical memory‑integrated task allocative framework to enhance collaborative efficiency. MiTa organizes agents into a manager‑member hierarchy, where the manager incorporates additional allocation and summary modules that enable (1) global task allocation and (2) episodic memory integration. The allocation module enables the manager to allocate tasks from a global perspective, thereby avoiding potential inter‑agent conflicts. The summary module, triggered by task progress updates, performs episodic memory integration by condensing recent collaboration history into a concise summary that preserves long‑horizon context. By combining task allocation with episodic memory, MiTa attains a clearer understanding of the task and facilitates globally consistent task distribution. Experimental results confirm that MiTa achieves superior efficiency and adaptability in complex multi‑agent cooperation over strong baseline methods.

PaperID: 5978, https://arxiv.org/pdf/2601.22952.pdf

Authors: Yunpeng Xiong, Ting Zhang

Title: Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering

Abstract:
Static Application Security Testing (SAST) tools are essential for identifying software vulnerabilities, but they often produce a high volume of false positives (FPs), imposing a substantial manual triage burden on developers. Recent advances in Large Language Model (LLM) agents offer a promising direction by enabling iterative reasoning, tool use, and environment interaction to refine SAST alerts. However, the comparative effectiveness of different LLM‑based agent architectures for FP filtering remains poorly understood. In this paper, we present a comparative study of three state‑of‑the‑art LLM‑based agent frameworks, i.e., Aider, OpenHands, and SWE‑agent, for vulnerability FP filtering. We evaluate these frameworks using the vulnerabilities from the OWASP Benchmark and real‑world open‑source Java projects. The experimental results show that LLM‑based agents can remove the majority of SAST noise, reducing an initial FP detection rate of over 92% on the OWASP Benchmark to as low as 6.3% in the best configuration. On real‑world dataset, the best configuration of LLM‑based agents can achieve an FP identification rate of up to 93.3% involving CodeQL alerts. However, the benefits of agents are strongly backbone‑ and CWE‑dependent: agentic frameworks significantly outperform vanilla prompting for stronger models such as Claude Sonnet 4 and GPT‑5, but yield limited or inconsistent gains for weaker backbones. Moreover, aggressive FP reduction can come at the cost of suppressing true vulnerabilities, highlighting important trade‑offs. Finally, we observe large disparities in computational cost across agent frameworks. Overall, our study demonstrates that LLM‑based agents are a powerful but non‑uniform solution for SAST FP filtering, and that their practical deployment requires careful consideration of agent design, backbone model choice, vulnerability category, and operational cost.

PaperID: 5979, https://arxiv.org/pdf/2601.22812.pdf

Authors: Jana Gonnermann-Müller, Jennifer Haase, Nicolas Leins, Thomas Kosch, Sebastian Pokutta

Title: Stable Personas: Dual-Assessment of Temporal Stability in LLM-Based Human Simulation

Abstract:
Large Language Models (LLMs) acting as artificial agents offer the potential for scalable behavioral research, yet their validity depends on whether LLMs can maintain stable personas across extended conversations. We address this point using a dual‑assessment framework measuring both self‑reported characteristics and observer‑rated persona expression. Across two experiments testing four persona conditions (default, high, moderate, and low ADHD presentations), seven LLMs, and three semantically equivalent persona prompts, we examine between‑conversation stability (3,473 conversations) and within‑conversation stability (1,370 conversations and 18 turns). Self‑reports remain highly stable both between and within conversations. However, observer ratings reveal a tendency for persona expressions to decline during extended conversations. These findings suggest that persona‑instructed LLMs produce stable, persona‑aligned self‑reports, an important prerequisite for behavioral research, while identifying this regression tendency as a boundary condition for multi‑agent social simulation.

PaperID: 5980, https://arxiv.org/pdf/2601.22788.pdf

Authors: Jana Gonnermann-Müller, Jennifer Haase, Nicolas Leins, Moritz Igel, Konstantin Fackeldey, Sebastian Pokutta

Title: FACET: Multi-Agent AI Supporting Teachers in Scaling Differentiated Learning for Diverse Students

Abstract:
Classrooms are becoming increasingly heterogeneous, comprising learners with diverse performance and motivation levels, language proficiencies, and learning differences such as dyslexia and ADHD. While teachers recognize the need for differentiated instruction, growing workloads create substantial barriers, making differentiated instruction an ideal that is often unrealized in practice. Current AI educational tools, which promise differentiated materials, are predominantly student‑facing and performance‑centric, ignoring other aspects that shape learning outcomes. We introduce FACET, a teacher‑facing multi‑agent framework designed to address these gaps by supporting differentiation that accounts for motivation, performance, and learning differences. Developed with educational stakeholders from the outset, the framework coordinates four specialized agents, including learner simulation, diagnostic assessment, material generation, and evaluation within a teacher‑in‑the‑loop design. School principals (N = 30) shaped system requirements through participatory workshops, while in‑service K‑12 teachers (N = 70) evaluated material quality. Mixed‑methods evaluation demonstrates strong perceived value for inclusive differentiation. Practitioners emphasized both the urgent need arising from classroom heterogeneity and the importance of maintaining pedagogical autonomy as a prerequisite for adoption. We discuss implications for future school deployment and outline partnerships for longitudinal classroom implementation.

PaperID: 5981, https://arxiv.org/pdf/2601.22781.pdf

Authors: Linjia Kang, Zhimin Wang, Yongkang Zhang, Duo Wu, Jinghe Wang, Ming Ma, Haopeng Yan, Zhi Wang

Title: Learning with Challenges: Adaptive Difficulty-Aware Data Generation for Mobile GUI Agent Training

Abstract:
Large‑scale, high‑quality interaction trajectories are essential for advancing mobile Graphical User Interface (GUI) agents. While existing methods typically rely on labor‑intensive human demonstrations or automated model exploration to generate GUI trajectories, they lack fine‑grained control over task difficulty. This fundamentally restricts learning effectiveness due to the mismatch between the training difficulty and the agent's capabilities. Inspired by how humans acquire skills through progressively challenging tasks, we propose MobileGen, a novel data generation framework that adaptively aligns training difficulty with the GUI agent's capability frontier. Specifically, MobileGen explicitly decouples task difficulty into structural (e.g., trajectory length) and semantic (e.g., task goal) dimensions. It then iteratively evaluates the agent on a curated prior dataset to construct a systematic profile of its capability frontier across these two dimensions. With this profile, the probability distribution of task difficulty is adaptively computed, from which the target difficulty for the next round of training can be sampled. Guided by the sampled difficulty, a multi‑agent controllable generator is finally used to synthesize high‑quality interaction trajectories along with corresponding task instructions. Extensive experiments show that MobileGen consistently outperforms existing data generation methods by improving the average performance of GUI agents by 1.57 times across multiple challenging benchmarks. This highlights the importance of capability‑aligned data generation for effective mobile GUI agent training.

PaperID: 5982, https://arxiv.org/pdf/2601.22758.pdf

Authors: Libin Qiu, Zhirong Gao, Junfu Chen, Yuhang Ye, Weizhi Huang, Xiaobo Xue, Wenkai Qiu, Shuo Tang

Title: AutoRefine: From Trajectories to Reusable Expertise for Continual LLM Agent Refinement

Abstract:
Large language model agents often fail to accumulate knowledge from experience, treating each task as an independent challenge. Recent methods extract experience as flattened textual knowledge, which cannot capture procedural logic of complex subtasks. They also lack maintenance mechanisms, causing repository degradation as experience accumulates. We introduce AutoRefine, a framework that extracts and maintains dual‑form Experience Patterns from agent execution histories. For procedural subtasks, we extract specialized subagents with independent reasoning and memory. For static knowledge, we extract skill patterns as guidelines or code snippets. A continuous maintenance mechanism scores, prunes, and merges patterns to prevent repository degradation. Evaluated on ALFWorld, ScienceWorld, and TravelPlanner, AutoRefine achieves 98.4%, 70.4%, and 27.1% respectively, with 20‑73% step reductions. On TravelPlanner, automatic extraction exceeds manually designed systems (27.1% vs 12.1%), demonstrating its ability to capture procedural coordination.

PaperID: 5983, https://arxiv.org/pdf/2601.22638.pdf

Authors: Palash Goyal, Mihir Parmar, Yiwen Song, Hamid Palangi, Tomas Pfister, Jinsung Yoon

Title: ScholarPeer: A Context-Aware Multi-Agent Framework for Automated Peer Review

Abstract:
The exponential growth of machine learning submissions has strained the traditional peer review process, resulting in slow feedback loops for authors and an immense burden on reviewers to rigorously audit technical soundness and verify literature. To address this, we introduce ScholarPeer, a multi‑agent framework designed to operationalize the rigorous auditing workflow of a senior researcher. Rather than attempting to replace human judgment, ScholarPeer serves as a co‑scientist: acting as a mentor for rapid author iteration prior to submission, and as an active verification assistant that augments human reviewers. The framework structurally decouples contextualization from critique by deploying a sub‑domain historian to synthesize the field's trajectory, a baseline scout to proactively hunt for omitted state‑of‑the‑art comparisons, and a multi‑aspect Q&A engine that deeply audits technical soundness‑scrutinizing internal logical consistency, experimental validity, and mathematical rigor‑while cross‑referencing claims against top‑tier academic venues. We comprehensively evaluate ScholarPeer on ~1,800 ICLR submissions spanning 2020 through 2025. Our results show that ScholarPeer achieves significant win‑rates against state‑of‑the‑art fine‑tuned models and search‑augmented agentic baselines.

PaperID: 5984, https://arxiv.org/pdf/2601.22623.pdf

Authors: Wei Zhu, Zhiwen Tang, Kun Yue

Title: SYMPHONY: Synergistic Multi-agent Planning with Heterogeneous Language Model Assembly

Abstract:
Recent advancements have increasingly focused on leveraging large language models (LLMs) to construct autonomous agents for complex problem‑solving tasks. However, existing approaches predominantly employ a single‑agent framework to generate search branches and estimate rewards during Monte Carlo Tree Search (MCTS) planning. This single‑agent paradigm inherently limits exploration capabilities, often resulting in insufficient diversity among generated branches and suboptimal planning performance. To overcome these limitations, we propose Synergistic Multi‑agent Planning with Heterogeneous langauge model assembly (SYMPHONY), a novel multi‑agent planning framework that integrates a pool of heterogeneous language model‑based agents. By leveraging diverse reasoning patterns across agents, SYMPHONY enhances rollout diversity and facilitates more effective exploration. Empirical results across multiple benchmark tasks show that SYMPHONY achieves strong performance even when instantiated with open‑source LLMs deployable on consumer‑grade hardware. When enhanced with cloud‑based LLMs accessible via API, SYMPHONY demonstrates further improvements, outperforming existing state‑of‑the‑art baselines and underscoring the effectiveness of heterogeneous multi‑agent coordination in planning tasks.

PaperID: 5985, https://arxiv.org/pdf/2601.22607.pdf

Authors: Jiaxuan Gao, Jiaao Chen, Chuyi He, Shusheng Xu, Di Jin, Yi Wu

Title: From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents

Abstract:
Interactive tool‑using agents must solve real‑world tasks via multi‑turn interaction with both humans and external environments, requiring dialogue state tracking, multi‑step tool execution, while following complex instructions. Post‑training such agents is challenging because synthesis for high‑quality multi‑turn tool‑use data is difficult to scale, and reinforcement learning (RL) could face noisy signals caused by user simulation, leading to degraded training efficiency. We propose a unified framework that combines a self‑evolving data agent with verifier‑based RL. Our system, EigenData, is a hierarchical multi‑agent engine that synthesizes tool‑grounded dialogues together with executable per‑instance checkers, and improves generation reliability via closed‑loop self‑evolving process that updates prompts and workflow. Building on the synthetic data, we develop an RL recipe that first fine‑tunes the user model and then applies GRPO‑style training with trajectory‑level group‑relative advantages and dynamic filtering, yielding consistent improvements beyond SFT. Evaluated on tau^2‑bench, our best model reaches 73.0% pass^1 on Airline and 98.3% pass^1 on Telecom, matching or exceeding frontier models. Overall, our results suggest a scalable pathway for bootstrapping complex tool‑using behaviors without expensive human annotation.

PaperID: 5986, https://arxiv.org/pdf/2601.22530.pdf

Authors: Tung Sum Thomas Kwok, Xinyu Wang, Hengzhi He, Xiaofeng Lin, Peng Lu, Liheng Ma, Chunhe Wang, Chun Ho Mak, Yuyu Luo, Ying Nian Wu, Lei Ding, Guang Cheng

Title: Enhancing Table Reasoning with Deterministic Table-State Rewards

Abstract:
Large Language Models (LLMs) struggle with multi‑step reasoning over structured tables. The primary reason is the lack of explicit supervision for intermediate reasoning states. Existing learned reward models or executor‑based verifiers are either unscalable or rely on answer‑checking environments unavailable for many tabular tasks. This leaves no signal that is scalable and grounded in the query. To address this, we introduce TABROUGE, a training‑free and deterministic state reward. By adapting the Longest Common Subsequence (LCS) metric from text summarization to evaluate tabular states, TABROUGE assesses the lexical coverage and structural integrity of intermediate tables against the query without requiring learned models or external executors. Built upon this metric, we propose RE‑TAB, a plug‑and‑play, training‑free framework. RE‑TAB reframes table reasoning as deterministic control over intermediate states, utilizing TABROUGE for stepwise feedback and trajectory‑level test‑time scaling (TTS) signals. Across six backbones and three benchmarks, RE‑TAB improves accuracy by an average of 26.7 pp over no‑reward baselines. It also reduces TTS samples by up to 33%. Preliminary GRPO experiments further indicate TABROUGE's viability as a scalable post‑training reward, increasing gains by 8.34 pp. We further analyze failure modes of TABROUGE, including paraphrase under‑rewarding and echo‑column hacking, and identify when structure‑aware lexical rewards remain reliable.

PaperID: 5987, https://arxiv.org/pdf/2601.22517.pdf

Authors: Kangning Yin, Zhe Cao, Wentao Dong, Weishuai Zeng, Tianyi Zhang, Qiang Zhang, Jingbo Wang, Jiangmiao Pang, Ming Zhou, Weinan Zhang

Title: RoboStriker: Hierarchical Decision-Making for Autonomous Humanoid Boxing

Abstract:
Achieving human‑level competitive intelligence and physical agility in humanoid robots remains a major challenge, particularly in contact‑rich and highly dynamic tasks such as boxing. While Multi‑Agent Reinforcement Learning (MARL) offers a principled framework for strategic interaction, its direct application to humanoid control is hindered by high‑dimensional contact dynamics and the absence of strong physical motion priors. We propose RoboStriker, a hierarchical three‑stage framework that enables fully autonomous humanoid boxing by decoupling high‑level strategic reasoning from low‑level physical execution. The framework first learns a comprehensive repertoire of boxing skills by training a single‑agent motion tracker on human motion capture data. These skills are subsequently distilled into a structured latent manifold, regularized by projecting the Gaussian‑parameterized distribution onto a unit hypersphere. This topological constraint effectively confines exploration to the subspace of physically plausible motions. In the final stage, we introduce Latent‑Space Neural Fictitious Self‑Play (LS‑NFSP), where competing agents learn competitive tactics by interacting within the latent action space rather than the raw motor space, significantly stabilizing multi‑agent training. Experimental results demonstrate that RoboStriker achieves superior competitive performance in simulation and exhibits sim‑to‑real transfer. Our website is available at RoboStriker.

PaperID: 5988, https://arxiv.org/pdf/2601.22467.pdf

Authors: Jiaqi Shi, Xulong Zhang, Xiaoyang Qu, Jianzong Wang

Title: CARE: Multi-Task Pretraining for Latent Continuous Action Representation in Robot Control

Abstract:
Recent advances in Vision‑Language‑Action (VLA) models have shown promise for robot control, but their dependence on action supervision limits scalability and generalization. To address this challenge, we introduce CARE, a novel framework designed to train VLA models for robotic task execution. Unlike existing methods that depend on action annotations during pretraining, CARE eliminates the need for explicit action labels by leveraging only video‑text pairs. These weakly aligned data sources enable the model to learn continuous latent action representations through a newly designed multi‑task pretraining objective. During fine‑tuning, a small set of labeled data is used to train the action head for control. Experimental results across various simulation tasks demonstrate CARE's superior success rate, semantic interpretability, and ability to avoid shortcut learning. These results underscore CARE's scalability, interpretability, and effectiveness in robotic control with weak supervision.

PaperID: 5989, https://arxiv.org/pdf/2601.22436.pdf

Authors: Weixiang Zhao, Yingshuo Wang, Yichen Zhang, Yang Deng, Yanyan Zhao, Wanxiang Che, Bing Qin, Ting Liu

Title: Large Language Model Agents Are Not Always Faithful Self-Evolvers

Abstract:
Self‑evolving large language model (LLM) agents continually improve by accumulating and reusing past experience, yet it remains unclear whether they faithfully rely on that experience to guide their behavior. We present the first systematic investigation of experience faithfulness, the causal dependence of an agent's decisions on the experience it is given, in self‑evolving LLM agents. Using controlled causal interventions on both raw and condensed forms of experience, we comprehensively evaluate four representative frameworks across 10 LLM backbones and 9 environments. Our analysis uncovers a striking asymmetry: while agents consistently depend on raw experience, they often disregard or misinterpret condensed experience, even when it is the only experience provided. This gap persists across single‑ and multi‑agent configurations and across backbone scales. We trace its underlying causes to three factors: the semantic limitations of condensed content, internal processing biases that suppress experience, and task regimes where pretrained priors already suffice. These findings challenge prevailing assumptions about self‑evolving methods and underscore the need for more faithful and reliable approaches to experience integration.

PaperID: 5990, https://arxiv.org/pdf/2601.22386.pdf

Authors: Jamiu Adekunle Idowu, Ahmed Almasoud

Title: Specialists or Generalists? Multi-Agent and Single-Agent LLMs for Essay Grading

Abstract:
Automated essay scoring (AES) systems increasingly rely on large language models, yet little is known about how architectural choices shape their performance across different essay quality levels. This paper evaluates single‑agent and multi‑agent LLM architectures for essay grading using the ASAP 2.0 corpus. Our multi‑agent system decomposes grading into three specialist agents (Content, Structure, Language) coordinated by a Chairman Agent that implements rubric‑aligned logic including veto rules and score capping. We test both architectures in zero‑shot and few‑shot conditions using GPT‑5.1. Results show that the multi‑agent system is significantly better at identifying weak essays while the single‑agent system performs better on mid‑range essays. Both architectures struggle with high‑quality essays. Critically, few‑shot calibration emerges as the dominant factor in system performance ‑‑ providing just two examples per score level improves QWK by approximately 26% for both architectures. These findings suggest architectural choice should align with specific deployment priorities, with multi‑agent AI particularly suited for diagnostic screening of at‑risk students, while single‑agent models provide a cost‑effective solution for general assessment.

PaperID: 5991, https://arxiv.org/pdf/2601.22369.pdf

Authors: Yujie Hui, Xiaoyi Lu, Andrew Perrault, Yang Wang

Title: Learning Provably Correct Distributed Protocols Without Human Knowledge

Abstract:
Provably correct distributed protocols, which are a critical component of modern distributed systems, are highly challenging to design and have often required decades of human effort. These protocols allow multiple agents to coordinate to come to a common agreement in an environment with uncertainty and failures. We formulate protocol design as a search problem over strategies in a game with imperfect information, and the desired correctness conditions are specified in Satisfiability Modulo Theories (SMT). However, standard methods for solving multi‑agent games fail to learn correct protocols in this setting, even when the number of agents is small. We propose a learning framework, GGMS, which integrates a specialized variant of Monte Carlo Tree Search with a transformer‑based action encoder, a global depth‑first search to break out of local minima, and repeated feedback from a model checker. Protocols output by GGMS are verified correct via exhaustive model checking for all executions within the bounded setting. We further prove that, under mild assumptions, the search process is complete: if a correct protocol exists, GGMS will eventually find it. In experiments, we show that GGMS can learn correct protocols for larger settings than existing methods.

PaperID: 5992, https://arxiv.org/pdf/2601.22361.pdf

Authors: Yupeng Cao, Chengyang He, Yangyang Yu, Ping Wang, K. P. Subbalakshmi

Title: MERMAID: Memory-Enhanced Retrieval and Reasoning with Multi-Agent Iterative Knowledge Grounding for Veracity Assessment

Abstract:
Assessing the veracity of online content has become increasingly critical. Large language models (LLMs) have recently enabled substantial progress in automated veracity assessment, including automated fact‑checking and claim verification systems. Typical veracity assessment pipelines break down complex claims into sub‑claims, retrieve external evidence, and then apply LLM reasoning to assess veracity. However, existing methods often treat evidence retrieval as a static, isolated step and do not effectively manage or reuse retrieved evidence across claims. In this work, we propose MERMAID, a memory‑enhanced multi‑agent veracity assessment framework that tightly couples the retrieval and reasoning processes. MERMAID integrates agent‑driven search, structured knowledge representations, and a persistent memory module within a Reason‑Action style iterative process, enabling dynamic evidence acquisition and cross‑claim evidence reuse. By retaining retrieved evidence in an evidence memory, the framework reduces redundant searches and improves verification efficiency and consistency. We evaluate MERMAID on three fact‑checking benchmarks and two claim‑verification datasets using multiple LLMs, including GPT, LLaMA, and Qwen families. Experimental results show that MERMAID achieves state‑of‑the‑art performance while improving the search efficiency, demonstrating the effectiveness of synergizing retrieval, reasoning, and memory for reliable veracity assessment.

PaperID: 5993, https://arxiv.org/pdf/2601.22324.pdf

Authors: Silas Ruhrberg Estévez, Christopher Chiu, Mihaela van der Schaar

Title: Automatic Construction of Clinical Scoring Systems with LLM Agents

Abstract:
Modern clinical practice relies on evidence‑based guidelines implemented as compact scoring systems composed of a small number of interpretable decision rules. While machine‑learning models achieve strong performance, many fail to translate into routine clinical use due to misalignment with workflow constraints such as memorability, auditability, and bedside execution. We argue that this gap arises not from insufficient predictive power, but from optimizing over model classes that are incompatible with guideline deployment. Deployable guidelines often take the form of unit‑weighted clinical checklists, formed by thresholding the sum of binary rules, but learning such scores requires searching an exponentially large discrete space of possible rule sets. We introduce AgentScore, which performs semantically guided optimization in this space by using LLMs to propose candidate rules and a deterministic, data‑grounded verification‑and‑selection loop to enforce statistical validity and deployability constraints. Across eight clinical prediction tasks, AgentScore outperforms existing score‑generation methods and achieves AUROC comparable to more flexible interpretable models despite operating under stronger structural constraints. On two additional externally validated tasks, AgentScore achieves higher discrimination than established guideline‑based scores.

PaperID: 5994, https://arxiv.org/pdf/2601.22311.pdf

Authors: Zehong Wang, Fang Wu, Hongru Wang, Xiangru Tang, Bolian Li, Zhenfei Yin, Yijun Ma, Yiyang Li, Weixiang Sun, Xiusi Chen, Yanfang Ye

Title: Why Reasoning Fails to Plan: A Planning-Centric Analysis of Long-Horizon Decision Making in LLM Agents

Abstract:
Large language model (LLM)‑based agents exhibit strong step‑by‑step reasoning capabilities over short horizons, yet often fail to sustain coherent behavior over long planning horizons. We argue that this failure reflects a fundamental mismatch: step‑wise reasoning induces a form of step‑wise greedy policy that is adequate for short horizons but fails in long‑horizon planning, where early actions must account for delayed consequences. From this planning‑centric perspective, we study LLM‑based agents in deterministic, fully structured environments with explicit state transitions and evaluation signals. Our analysis reveals a core failure mode of reasoning‑based policies: locally optimal choices induced by step‑wise scoring lead to early myopic commitments that are systematically amplified over time and difficult to recover from. We introduce FLARE (Future‑aware Lookahead with Reward Estimation) as a minimal instantiation of future‑aware planning to enforce explicit lookahead, value propagation, and limited commitment in a single model, allowing downstream outcomes to influence early decisions. Across multiple benchmarks, agent frameworks, and LLM backbones, FLARE consistently improves task performance and planning‑level behavior, frequently allowing LLaMA‑8B with FLARE to outperform GPT‑4o with standard step‑by‑step reasoning. These results establish a clear distinction between reasoning and planning.

PaperID: 5995, https://arxiv.org/pdf/2601.22297.pdf

Authors: Chenxi Liu, Yanshuo Chen, Ruibo Chen, Tianyi Xiong, Tong Zheng, Heng Huang

Title: Learning from Self-Debate: Preparing Reasoning Models for Multi-Agent Debate

Abstract:
The reasoning abilities of large language models (LLMs) have been substantially improved by reinforcement learning with verifiable rewards (RLVR). At test time, collaborative reasoning through Multi‑Agent Debate (MAD) has emerged as a promising approach for enhancing LLM performance. However, current RLVR methods typically train LLMs to solve problems in isolation, without explicitly preparing them to synthesize and benefit from different rationales that arise during debate. In this work, we propose Self‑Debate Reinforcement Learning(SDRL), a training framework where models learn from self‑debate, equipping a single LLM with both strong standalone problem‑solving ability and the capability to process diverse reasoning trajectories in MAD. Given a prompt, SDRL first samples multiple candidate solutions, then constructs a debate context with diverse reasoning paths and generates second‑turn responses conditioned on this context. Finally, SDRL jointly optimizes both the initial and debate‑conditioned responses, yielding a model that is effective as both a standalone solver and a debate participant. Experiments across multiple base models and reasoning benchmarks show that SDRL consistently improves MAD performance across diverse debate protocols and agent configurations, while simultaneously strengthening single‑model reasoning.

PaperID: 5996, https://arxiv.org/pdf/2601.22209.pdf

Authors: Xinyuan Song, Liang Zhao

Title: Learning to Recommend Multi-Agent Subgraphs from Calling Trees

Abstract:
Multi‑agent systems (MAS) increasingly solve complex tasks by orchestrating agents and tools selected from rapidly growing marketplaces. As these marketplaces expand, many candidates become functionally overlapping, making selection not just a retrieval problem: beyond filtering relevant agents, an orchestrator must choose options that are reliable, compatible with the current execution context, and able to cooperate with other selected agents. Existing recommender systems ‑‑ largely built for item‑level ranking from flat user‑item logs ‑‑ do not directly address the structured, sequential, and interaction‑dependent nature of agent orchestration. We address this gap by formulating agent recommendation in MAS as a constrained decision problem and introducing a generic constrained recommendation framework that first uses retrieval to build a compact candidate set conditioned on the current subtask and context, and then performs utility optimization within this feasible set using a learned scorer that accounts for relevance, reliability, and interaction effects. We ground both the formulation and learning signals in historical calling trees, which capture the execution structure of MAS (parent‑child calls, branching dependencies, and local cooperation patterns) beyond what flat logs provide. The framework supports two complementary settings: agent‑level recommendation (select the next agent/tool) and system‑level recommendation (select a small, connected agent team/subgraph for coordinated execution). To enable systematic evaluation, we construct a unified calling‑tree benchmark by normalizing invocation logs from eight heterogeneous multi‑agent corpora into a shared structured representation.

PaperID: 5997, https://arxiv.org/pdf/2601.22208.pdf

Authors: Evelien Riddell, James Riddell, Gengyi Sun, Michał Antkiewicz, Krzysztof Czarnecki

Title: Stalled, Biased, and Confused: Uncovering Reasoning Failures in LLMs for Cloud-Based Root Cause Analysis

Abstract:
Root cause analysis (RCA) is essential for diagnosing failures within complex software systems to ensure system reliability. The highly distributed and interdependent nature of modern cloud‑based systems often complicates RCA efforts, particularly for multi‑hop fault propagation, where symptoms appear far from their true causes. Recent advancements in Large Language Models (LLMs) present new opportunities to enhance automated RCA. However, their practical value for RCA depends on the fidelity of reasoning and decision‑making. Existing work relies on historical incident corpora, operates directly on high‑volume telemetry beyond current LLM capacity, or embeds reasoning inside complex multi‑agent pipelines ‑‑ conditions that obscure whether failures arise from reasoning itself or from peripheral design choices. We present a focused empirical evaluation that isolates an LLM's reasoning behavior. We design a controlled experimental framework that foregrounds the LLM by using a simplified experimental setting. We evaluate six LLMs under two agentic workflows (ReAct and Plan‑and‑Execute) and a non‑agentic baseline on two real‑world case studies (GAIA and OpenRCA). In total, we executed 48,000 simulated failure scenarios, totaling 228 days of execution time. We measure both root‑cause accuracy and the quality of intermediate reasoning traces. We produce a labeled taxonomy of 16 common RCA reasoning failures and use an LLM‑as‑a‑Judge for annotation. Our results clarify where current open‑source LLMs succeed and fail in multi‑hop RCA, quantify sensitivity to input data modalities, and identify reasoning failures that predict final correctness. Together, these contributions provide transparent and reproducible empirical results and a failure taxonomy to guide future work on reasoning‑driven system diagnosis.

PaperID: 5998, https://arxiv.org/pdf/2601.22168.pdf

Authors: Shengwei You, Aditya Joshi, Andrey Kuehlkamp, Jarek Nabrzyski

Title: Stablecoin Design with Adversarial-Robust Multi-Agent Systems via Trust-Weighted Signal Aggregation

Abstract:
Algorithmic stablecoins promise decentralized monetary stability by maintaining a target peg through programmatic reserve management. Yet, their reserve controllers remain vulnerable to regime‑blind optimization, calibrating risk parameters on fair‑weather data while ignoring tail events that precipitate cascading failures. The March 2020 Black Thursday collapse, wherein MakerDAO's collateral auctions yielded 8.3M in losses and a 15% peg deviation, exposed a critical gap: existing models like SAS systematically omit extreme volatility regimes from covariance estimates, producing allocations optimal in expectation but catastrophic under adversarial stress. We present MVF‑Composer, a trust‑weighted Mean‑Variance Frontier reserve controller incorporating a novel Stress Harness for risk‑state estimation. Our key insight is deploying multi‑agent simulations as adversarial stress‑testers: heterogeneous agents (traders, liquidity providers, attackers) execute protocol actions under crisis scenarios, exposing reserve vulnerabilities before they manifest on‑chain. We formalize a trust‑scoring mechanism T: A ‑> [0,1] that down‑weights signals from agents exhibiting manipulative behavior, ensuring the risk‑state estimator remains robust to signal injection and Sybil attacks. Across 1,200 randomized scenarios with injected Black‑Swan shocks (10% collateral drawdown, 50% sentiment collapse, coordinated redemption attacks), MVF‑Composer reduces peak peg deviation by 57% and mean recovery time by 3.1x relative to SAS baselines. Ablation studies confirm the trust layer accounts for 23% of stability gains under adversarial conditions, achieving 72% adversarial agent detection. Our system runs on commodity hardware, requires no on‑chain oracles beyond standard price feeds, and provides a reproducible framework for stress‑testing DeFi reserve policies.

PaperID: 5999, https://arxiv.org/pdf/2601.22129.pdf

Authors: Yifeng Ding, Lingming Zhang

Title: SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents

Abstract:
Test‑time scaling has been widely adopted to enhance the capabilities of Large Language Model (LLM) agents in software engineering (SWE) tasks. However, the standard approach of repeatedly sampling trajectories from scratch is computationally expensive. While recent methods have attempted to mitigate costs using specialized value agents, they can suffer from model miscalibration and fail to generalize to modern agents that synthesize custom bash scripts as tools. In this paper, we introduce SWE‑Replay, the first efficient and generalizable test‑time scaling technique for modern agents without reliance on potentially noisy value estimates. SWE‑Replay optimizes the scaling process by recycling trajectories from prior trials, dynamically choosing to either explore from scratch or exploit archived experience by branching at critical intermediate steps. This selection of intermediate steps is driven by the potential and reasoning significance of repository exploration, rather than external LLM‑based quality estimates. Our evaluation shows that, on SWE‑Bench Verified, SWE‑Replay consistently outperforms naive scaling, reducing costs by up to 17.4% while maintaining or even improving performance by up to 3.8%. Further evaluation on SWE‑Bench Pro and Multilingual validates the generalizability of SWE‑Replay, establishing it as a robust foundation for efficient test‑time scaling of software engineering agents.

PaperID: 6000, https://arxiv.org/pdf/2601.22041.pdf

Authors: Naomi Pitzer, Daniela Mihai

Title: Learning to Communicate Across Modalities: Perceptual Heterogeneity in Multi-Agent Systems

Abstract:
Emergent communication offers insight into how agents develop shared structured representations, yet most research assumes homogeneous modalities or aligned representational spaces, overlooking the perceptual heterogeneity of real‑world settings. We study a heterogeneous multi‑step binary communication game where agents differ in modality and lack perceptual grounding. Despite perceptual misalignment, multimodal systems converge to class‑consistent messages grounded in perceptual input. Unimodal systems communicate more efficiently, using fewer bits and achieving lower classification entropy, while multimodal agents require greater information exchange and exhibit higher uncertainty. Bit perturbation experiments provide strong evidence that meaning is encoded in a distributional rather than compositional manner, as each bit's contribution depends on its surrounding pattern. Finally, interoperability analyses show that systems trained in different perceptual worlds fail to directly communicate, but limited fine‑tuning enables successful cross‑system communication. This work positions emergent communication as a framework for studying how agents adapt and transfer representations across heterogeneous modalities, opening new directions for both theory and experimentation.

PaperID: 6001, https://arxiv.org/pdf/2601.22027.pdf

Authors: Johannes Kirmayr, Lukas Stappen, Elisabeth André

Title: CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty

Abstract:
Existing benchmarks for Large Language Model (LLM) agents focus on task completion under idealistic settings but overlook reliability in real‑world, user‑facing applications. In domains, such as in‑car voice assistants, users often issue incomplete or ambiguous requests, creating intrinsic uncertainty that agents must manage through dialogue, tool use, and policy adherence. We introduce CAR‑bench, a benchmark for evaluating consistency, uncertainty handling, and capability awareness in multi‑turn, tool‑using LLM agents in an in‑car assistant domain. The environment features an LLM‑simulated user, domain policies, and 58 interconnected tools spanning navigation, productivity, charging, and vehicle control. Beyond standard task completion, CAR‑bench introduces Hallucination tasks that test agents' limit‑awareness under missing tools or information, and Disambiguation tasks that require resolving uncertainty through clarification or internal information gathering. Baseline results reveal large gaps between occasional and consistent success on all task types. Even frontier reasoning LLMs achieve less than 50% consistent pass rate on Disambiguation tasks due to premature actions, and frequently violate policies or fabricate information to satisfy user requests in Hallucination tasks, underscoring the need for more reliable and self‑aware LLM agents in real‑world settings.

PaperID: 6002, https://arxiv.org/pdf/2601.22001.pdf

Authors: Yiren Zhao, Junyi Liu

Title: Heterogeneous Computing: The Key to Powering the Future of AI Agent Inference

Abstract:
AI agent inference is driving an inference heavy datacenter future and exposes bottlenecks beyond compute ‑ especially memory capacity, memory bandwidth and high‑speed interconnect. We introduce two metrics ‑ Operational Intensity (OI) and Capacity Footprint (CF) ‑ that jointly explain regimes the classic roofline analysis misses, including the memory capacity wall. Across agentic workflows (chat, coding, web use, computer use) and base model choices (GQA/MLA, MoE, quantization), OI/CF can shift dramatically, with long context KV cache making decode highly memory bound. These observations motivate disaggregated serving and system level heterogeneity: specialized prefill and decode accelerators, broader scale up networking, and decoupled compute‑memory enabled by optical I/O. We further hypothesize agent‑hardware co design, multiple inference accelerators within one system, and high bandwidth, large capacity memory disaggregation as foundations for adaptation to evolving OI/CF. Together, these directions chart a path to sustain efficiency and capability for large scale agentic AI inference.

PaperID: 6003, https://arxiv.org/pdf/2601.21988.pdf

Authors: Fernando Palafox, Jingqi Li, Jesse Milzman, David Fridovich-Keil

Title: Generalized Information Gathering Under Dynamics Uncertainty

Abstract:
An agent operating in an unknown dynamical system must learn its dynamics from observations. Active information gathering accelerates this learning, but existing methods derive bespoke costs for specific modeling choices: dynamics models, belief update procedures, observation models, and planners. We present a unifying framework that decouples these choices from the information‑gathering cost by explicitly exposing the causal dependencies between parameters, beliefs, and controls. Using this framework, we derive a general information‑gathering cost based on Massey's directed information that assumes only Markov dynamics with additive noise and is otherwise agnostic to modeling choices. We prove that the mutual information cost used in existing literature is a special case of our cost. Then, we leverage our framework to establish an explicit connection between the mutual information cost and information gain in linearized Bayesian estimation, thereby providing theoretical justification for mutual information‑based active learning approaches. Finally, we illustrate the practical utility of our framework through experiments spanning linear, nonlinear, and multi‑agent systems.

PaperID: 6004, https://arxiv.org/pdf/2601.21972.pdf

Authors: Shuo Liu, Tianle Chen, Ryan Amiri, Christopher Amato

Title: Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic

Abstract:
Recent work has explored optimizing LLM collaboration through Multi‑Agent Reinforcement Learning (MARL). However, most MARL fine‑tuning approaches rely on predefined execution protocols, which often require centralized execution. Decentralized LLM collaboration is more appealing in practice, as agents can run inference in parallel with flexible deployments. Also, current approaches use Monte Carlo methods for fine‑tuning, which suffer from high variance and thus require more samples to train effectively. Actor‑critic methods are prevalent in MARL for dealing with these issues; thus, we developed Multi‑Agent Actor‑Critic (MAAC) methods to optimize decentralized LLM collaboration. In this paper, we analyze when and why these MAAC methods are beneficial. We propose 2 MAAC approaches, CoLLM‑CC with a Centralized Critic and CoLLM‑DC with Decentralized Critics. Our experiments across writing, coding, and game‑playing domains show that Monte Carlo methods and CoLLM‑DC can achieve performance comparable to CoLLM‑CC in short‑horizon and dense‑reward settings. However, they both underperform CoLLM‑CC on long‑horizon or sparse‑reward tasks, where Monte Carlo methods require substantially more samples and CoLLM‑DC struggles to converge.

PaperID: 6005, https://arxiv.org/pdf/2601.21936.pdf

Authors: Jon Chun, Kathrine Elkins, Yong Suk Lee

Title: AgenticSimLaw: A Juvenile Courtroom Multi-Agent Debate Simulation for Explainable High-Stakes Tabular Decision Making

Abstract:
We introduce AgenticSimLaw, a role‑structured, multi‑agent debate framework that provides transparent and controllable test‑time reasoning for high‑stakes tabular decision‑making tasks. Unlike black‑box approaches, our courtroom‑style orchestration explicitly defines agent roles (prosecutor, defense, judge), interaction protocols (7‑turn structured debate), and private reasoning strategies, creating a fully auditable decision‑making process. We benchmark this framework on young adult recidivism prediction using the NLSY97 dataset, comparing it against traditional chain‑of‑thought (CoT) prompting across almost 90 unique combinations of models and strategies. Our results demonstrate that structured multi‑agent debate provides more stable and generalizable performance compared to single‑agent reasoning, with stronger correlation between accuracy and F1‑score metrics. Beyond performance improvements, AgenticSimLaw offers fine‑grained control over reasoning steps, generates complete interaction transcripts for explainability, and enables systematic profiling of agent behaviors. While we instantiate this framework in the criminal justice domain to stress‑test reasoning under ethical complexity, the approach generalizes to any deliberative, high‑stakes decision task requiring transparency and human oversight. This work addresses key LLM‑based multi‑agent system challenges: organization through structured roles, observability through logged interactions, and responsibility through explicit non‑deployment constraints for sensitive domains. Data, results, and code will be available on github.com under the MIT license.

PaperID: 6006, https://arxiv.org/pdf/2601.21919.pdf

Authors: Yiqun Chen, Jinyuan Feng, Wei Yang, Meizhi Zhong, Zhengliang Shi, Rui Li, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Zhiqiang Pu, Jiaxin Mao

Title: Self-Compression of Chain-of-Thought via Multi-Agent Reinforcement Learning

Abstract:
The inference overhead induced by redundant reasoning undermines the interactive experience and severely bottlenecks the deployment of Large Reasoning Models. Existing reinforcement learning (RL)‑based solutions tackle this problem by coupling a length penalty with outcome‑based rewards. This simplistic reward weighting struggles to reconcile brevity with accuracy, as enforcing brevity may compromise critical reasoning logic. In this work, we address this limitation by proposing a multi‑agent RL framework that selectively penalizes redundant chunks, while preserving essential reasoning logic. Our framework, Self‑Compression via MARL (SCMA), instantiates redundancy detection and evaluation through two specialized agents: a Segmentation Agent for decomposing the reasoning process into logical chunks, and a Scoring Agent for quantifying the significance of each chunk. The Segmentation and Scoring agents collaboratively define an importance‑weighted length penalty during training, incentivizing a Reasoning Agent to prioritize essential logic without introducing inference overhead during deployment. Empirical evaluations across model scales demonstrate that SCMA reduces response length by 11.1% to 39.0% while boosting accuracy by 4.33% to 10.02%. Furthermore, ablation studies and qualitative analysis validate that the synergistic optimization within the MARL framework fosters emergent behaviors, yielding more powerful LRMs compared to vanilla RL paradigms.

PaperID: 6007, https://arxiv.org/pdf/2601.21916.pdf

Authors: Yiqun Chen, Erhan Zhang, Tianyi Hu, Shijie Wang, Zixuan Yang, Meizhi Zhong, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao

Title: JADE: Bridging the Strategic-Operational Gap in Dynamic Agentic RAG

Abstract:
The evolution of Retrieval‑Augmented Generation (RAG) has shifted from static retrieval pipelines to dynamic, agentic workflows where a central planner orchestrates multi‑turn reasoning. However, existing paradigms face a critical dichotomy: they either optimize modules jointly within rigid, fixed‑graph architectures, or empower dynamic planning while treating executors as frozen, black‑box tools. We identify that this decoupled optimization creates a ``strategic‑operational mismatch,'' where sophisticated planning strategies fail to materialize due to unadapted local executors, often leading to negative performance gains despite increased system complexity. In this paper, we propose JADE (Joint Agentic Dynamic Execution), a unified framework for the joint optimization of planning and execution within dynamic, multi‑turn workflows. By modeling the system as a cooperative multi‑agent team unified under a single shared backbone, JADE enables end‑to‑end learning driven by outcome‑based rewards. This approach facilitates co‑adaptation: the planner learns to operate within the capability boundaries of the executors, while the executors evolve to align with high‑level strategic intent. Empirical results demonstrate that JADE transforms disjoint modules into a synergistic system, yielding remarkable performance improvements via joint optimization and enabling a flexible balance between efficiency and effectiveness through dynamic workflow orchestration.

PaperID: 6008, https://arxiv.org/pdf/2601.21894.pdf

Authors: Lukas Twist, Shu Yang, Hanqi Yan, Jingzhi Gong, Di Wang, Helen Yannakoudakis, Jie M. Zhang

Title: Not All Code Is Equal: A Data-Centric Study of Code Complexity and LLM Reasoning

Abstract:
Large Language Models (LLMs) increasingly exhibit strong reasoning abilities, often attributed to their capacity to generate chain‑of‑thought‑style intermediate reasoning. Recent work suggests that exposure to code can further enhance these skills, but existing studies largely treat code as a generic training signal, leaving open the question of which properties of code actually contribute to improved reasoning. To address this gap, we study the structural complexity of code, which captures control flow and compositional structure that may shape how models internalise multi‑step reasoning during fine‑tuning. We examine two complementary settings: solution‑driven complexity, where complexity varies across multiple solutions to the same problem, and problem‑driven complexity, where complexity reflects variation in the underlying tasks. Using cyclomatic complexity and logical lines of code to construct controlled fine‑tuning datasets, we evaluate a range of open‑weight LLMs on diverse reasoning benchmarks. Our findings show that although code can improve reasoning, structural properties strongly determine its usefulness. In 83% of experiments, restricting fine‑tuning data to a specific structural complexity range outperforms training on structurally diverse code, pointing to a data‑centric path for improving reasoning beyond scaling.

PaperID: 6009, https://arxiv.org/pdf/2601.21879.pdf

Authors: Rem Collier, Katharine Beaumont, Andrei Ciortea

Title: astra-langchain4j: Experiences Combining LLMs and Agent Programming

Abstract:
Given the emergence of Generative AI over the last two years and the increasing focus on Agentic AI as a form of Multi‑Agent System it is important to explore both how such technologies can impact the use of traditional Agent Toolkits and how the wealth of experience encapsulated in those toolkits can influence the design of the new agentic platforms. This paper presents an overview of our experience developing a prototype large language model (LLM) integration for the ASTRA programming language. It presents a brief overview of the toolkit, followed by three example implementations, concluding with a discussion of the experiences garnered through the examples.

PaperID: 6010, https://arxiv.org/pdf/2601.21861.pdf

Authors: Chuan-Chi Lai

Title: Spatiotemporal Continual Learning for Mobile Edge UAV Networks: Mitigating Catastrophic Forgetting

Abstract:
This paper addresses catastrophic forgetting in mobile edge UAV networks within dynamic spatiotemporal environments. Conventional deep reinforcement learning often fails during task transitions, necessitating costly retraining to adapt to new user distributions. We propose the spatiotemporal continual learning (STCL) framework, realized through the group‑decoupled multi‑agent proximal policy optimization (G‑MAPPO) algorithm. The core innovation lies in the integration of a group‑decoupled policy optimization (GDPO) mechanism with a gradient orthogonalization layer to balance heterogeneous objectives including energy efficiency, user fairness, and coverage. This combination employs dynamic z‑score normalization and gradient projection to mitigate conflicts without offline resets. Furthermore, 3D UAV mobility serves as a spatial compensation layer to manage extreme density shifts. Simulations demonstrate that the STCL framework ensures resilience, with service reliability recovering to over 0.9 for moderate loads of up to 100 users. Even under extreme saturation with 140 users, G‑MAPPO maintains a significant performance lead over the multi‑agent deep deterministic policy gradient (MADDPG) baseline by preventing policy stagnation. The algorithm delivers an effective capacity gain of 20 percent under high traffic loads, validating its potential for scalable aerial edge swarms.

PaperID: 6011, https://arxiv.org/pdf/2601.21826.pdf

Authors: Aadi Palnitkar, Mingyang Mao, Nicholas Waytowich, Vinicius G. Goecks, Xiaomin Lin

Title: Mil-SCORE: Benchmarking Long-Context Geospatial Reasoning and Planning in Large Language Models

Abstract:
As large language models (LLMs) are applied to increasingly longer and more complex tasks, there is a growing need for realistic long‑context benchmarks that require selective reading and integration of heterogeneous, multi‑modal information sources. This need is especially acute for geospatial planning problems, such as those found in planning for large‑scale military operations, which demand fast and accurate reasoning over maps, orders, intelligence reports, and other distributed data. To address this gap, we present MilSCORE (Military Scenario Contextual Reasoning), to our knowledge the first scenario‑level dataset of expert‑authored, multi‑hop questions grounded in a complex, simulated military planning scenario used for training. MilSCORE is designed to evaluate high‑stakes decision‑making and planning, probing LLMs' ability to combine tactical and spatial reasoning across multiple sources and to reason over long‑horizon, geospatially rich context. The benchmark includes a diverse set of question types across seven categories targeting both factual recall and multi‑step reasoning about constraints, strategy, and spatial analysis. We provide an evaluation protocol and report baseline results for a range of contemporary vision‑language models. Our findings highlight substantial headroom on MilSCORE, indicating that current systems struggle with realistic, scenario‑level long‑context planning, and positioning MilSCORE as a challenging testbed for future work.

PaperID: 6012, https://arxiv.org/pdf/2601.21822.pdf

Authors: Zitong Yu, Boquan Sun, Yang Li, Zheyan Qu, Xing Zhang

Title: CORE:Toward Ubiquitous 6G Intelligence Through Collaborative Orchestration of Large Language Model Agents Over Hierarchical Edge

Abstract:
Rapid advancements in sixth‑generation (6G) networks and large language models (LLMs) have paved the way for ubiquitous intelligence, wherein seamless connectivity and distributed artificial intelligence (AI) have revolutionized various aspects of our lives.However, realizing this vision faces significant challenges owing to the fragmented and heterogeneous computing resources across hierarchical networks, which are insufficient for individual LLM agents to perform complex reasoning tasks.To address this issue, we propose Collaborative Orchestration Role at Edge (CORE), an innovative framework that employs a collaborative learning system in which multiple LLMs, each assigned a distinct functional role, are distributed across mobile devices and tiered edge servers. The system integrates three optimization modules, encompassing real‑time perception,dynamic role orchestration, and pipeline‑parallel execution, to facilitate efficient and rapid collaboration among distributed agents. Furthermore, we introduce a novel role affinity scheduling algorithm for dynamically orchestrating LLM role assignments across the hierarchical edge infrastructure, intelligently matching computational demands with available dispersed resources.Finally, comprehensive case studies and performance evaluations across various 6G application scenarios demonstrated the efficacy of CORE, revealing significant enhancements in the system efficiency and task completion rates. Building on these promising outcomes, we further validated the practical applicability of CORE by deploying it on a real‑world edge‑computing platform,that exhibits robust performance in operational environments.

PaperID: 6013, https://arxiv.org/pdf/2601.21800.pdf

Authors: Dionizije Fa, Marko Culjak, Bruno Pandza, Mateo Cupic

Title: BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics

Abstract:
This paper introduces BioAgent Bench, a benchmark dataset and an evaluation suite designed for measuring the performance and robustness of AI agents in common bioinformatics tasks. The benchmark contains curated end‑to‑end tasks (e.g., RNA‑seq, variant calling, metagenomics) with prompts that specify concrete output artifacts to support automated assessment, including stress testing under controlled perturbations. We evaluate frontier closed‑source and open‑weight models across multiple agent harnesses, and use an LLM‑based grader to score pipeline progress and outcome validity. We find that frontier agents can complete multi‑step bioinformatics pipelines without elaborate custom scaffolding, often producing the requested final artifacts reliably. However, robustness tests reveal failure modes under controlled perturbations (corrupted inputs, decoy files, and prompt bloat), indicating that correct high‑level pipeline construction does not guarantee reliable step‑level reasoning. Finally, because bioinformatics workflows may involve sensitive patient data, proprietary references, or unpublished IP, closed‑source models can be unsuitable under strict privacy constraints; in such settings, open‑weight models may be preferable despite lower completion rates. We release the dataset and evaluation suite publicly.

PaperID: 6014, https://arxiv.org/pdf/2601.21797.pdf

Authors: Yimin Deng, Yuqing Fu, Derong Xu, Yejing Wang, Wei Ni, Jingtong Gao, Xiaopeng Li, Chengxu Liu, Xiao Han, Guoshuai Zhao, Xiangyu Zhao, Li Zhu, Xueming Qian

Title: Enhancing Conversational Agents via Task-Oriented Adversarial Memory Adaptation

Abstract:
Conversational agents struggle to handle long conversations due to context window limitations. Therefore, memory systems are developed to leverage essential historical information. Existing memory systems typically follow a pipeline of offline memory construction and update, and online retrieval. Despite the flexible online phase, the offline phase remains fixed and task‑independent. In this phase, memory construction operates under a predefined workflow and fails to emphasize task relevant information. Meanwhile, memory updates are guided by generic metrics rather than task specific supervision. This leads to a misalignment between offline memory preparation and task requirements, which undermines downstream task performance. To this end, we propose an Adversarial Memory Adaptation mechanism (AMA) that aligns memory construction and update with task objectives by simulating task execution. Specifically, first, a challenger agent generates question answer pairs based on the original dialogues. The constructed memory is then used to answer these questions, simulating downstream inference. Subsequently, an evaluator agent assesses the responses and performs error analysis. Finally, an adapter agent analyzes the error cases and performs dual level updates on both the construction strategy and the content. Through this process, the memory system receives task aware supervision signals in advance during the offline phase, enhancing its adaptability to downstream tasks. AMA can be integrated into various existing memory systems, and extensive experiments on long dialogue benchmark LoCoMo demonstrate its effectiveness.

PaperID: 6015, https://arxiv.org/pdf/2601.21753.pdf

Authors: Italo Napolitano, Mario di Bernardo

Title: Optimal Transport for Time-Varying Multi-Agent Coverage Control

Abstract:
Coverage control algorithms have traditionally focused on static target densities, where agents are deployed to optimally cover a fixed spatial distribution. However, many applications involve time‑varying densities, including environmental monitoring, surveillance, and adaptive sensor deployment. Although time‑varying coverage strategies have been studied within Voronoi‑based frameworks, recent works have reformulated static coverage control as a semi‑discrete optimal transport problem. Extending this optimal transport perspective to time‑varying scenarios has remained an open challenge. This paper presents a rigorous optimal transport formulation for time‑varying coverage control, in which agents minimize the instantaneous Wasserstein distance to a continuously evolving target density. The proposed solution relies on a coupled system of differential equations governing agent positions and the dual variables that define Laguerre regions. In one‑dimensional domains, the resulting system admits a closed‑form analytical solution, offering both computational benefits and theoretical insight into the structure of optimal time‑varying coverage. Numerical simulations demonstrate improved tracking performance compared to quasi‑static and Voronoi‑based methods, validating the proposed framework.

PaperID: 6016, https://arxiv.org/pdf/2601.21700.pdf

Authors: Wonduk Seo, Wonseok Choi, Junseo Koh, Juhyeon Lee, Hyunjin An, Minhyeong Yu, Jian Park, Qingshan Zhou, Seunghyun Lee, Yi Bu

Title: Toward Culturally Aligned LLMs through Ontology-Guided Multi-Agent Reasoning

Abstract:
Large Language Models (LLMs) increasingly support culturally sensitive decision making, yet often exhibit misalignment due to skewed pretraining data and the absence of structured value representations. Existing methods can steer outputs, but often lack demographic grounding and treat values as independent, unstructured signals, reducing consistency and interpretability. We propose OG‑MAR, an Ontology‑Guided Multi‑Agent Reasoning framework. OG‑MAR summarizes respondent‑specific values from the World Values Survey (WVS) and constructs a global cultural ontology by eliciting relations over a fixed taxonomy via competency questions. At inference time, it retrieves ontology‑consistent relations and demographically similar profiles to instantiate multiple value‑persona agents, whose outputs are synthesized by a judgment agent that enforces ontology consistency and demographic proximity. Experiments on regional social‑survey benchmarks across four LLM backbones show that OG‑MAR improves cultural alignment and robustness over competitive baselines, while producing more transparent reasoning traces.

PaperID: 6017, https://arxiv.org/pdf/2601.21611.pdf

Authors: Baopu Qiu, Hao Chen, Yuanrong Wu, Changtong Zan, Chao Wei, Weiru Zhang, Xiaoyi Zeng

Title: Thinking Broad, Acting Fast: Latent Reasoning Distillation from Multi-Perspective Chain-of-Thought for E-Commerce Relevance

Abstract:
Effective relevance modeling is crucial for e‑commerce search, as it aligns search results with user intent and enhances customer experience. Recent work has leveraged large language models (LLMs) to address the limitations of traditional relevance models, especially for long‑tail and ambiguous queries. By incorporating Chain‑of‑Thought (CoT) reasoning, these approaches improve both accuracy and interpretability through multi‑step reasoning. However, two key limitations remain: (1) most existing approaches rely on single‑perspective CoT reasoning, which fails to capture the multifaceted nature of e‑commerce relevance (e.g., user intent vs. attribute‑level matching vs. business‑specific rules); and (2) although CoT‑enhanced LLM's offer rich reasoning capabilities, their high inference latency necessitates knowledge distillation for real‑time deployment, yet current distillation methods discard the CoT rationale structure at inference, using it as a transient auxiliary signal and forfeiting its reasoning utility. To address these challenges, we propose a novel framework that better exploits CoT semantics throughout the optimization pipeline. Specifically, the teacher model leverages Multi‑Perspective CoT (MPCoT) to generate diverse rationales and combines Supervised Fine‑Tuning (SFT) with Direct Preference Optimization (DPO) to construct a more robust reasoner. For distillation, we introduce Latent Reasoning Knowledge Distillation (LRKD), which endows a student model with a lightweight inference‑time latent reasoning extractor, allowing efficient and low‑latency internalization of the LLM's sophisticated reasoning capabilities. Evaluated in offline experiments and online A/B tests on an e‑commerce search advertising platform serving tens of millions of users daily, our method delivers significant offline gains, showing clear benefits in both commercial performance and user experience.

PaperID: 6018, https://arxiv.org/pdf/2601.21609.pdf

Authors: Bingqian Li, Xiaolei Wang, Junyi Li, Weitao Li, Long Zhang, Sheng Chen, Wayne Xin Zhao, Ji-Rong Wen

Title: RecNet: Self-Evolving Preference Propagation for Agentic Recommender Systems

Abstract:
Agentic recommender systems leverage Large Language Models (LLMs) to model complex user behaviors and support personalized decision‑making. However, existing methods primarily model preference changes based on explicit user‑item interactions, which are sparse, noisy, and unable to reflect the real‑time, mutual influences among users and items. To address these limitations, we propose RecNet, a self‑evolving preference propagation framework that proactively propagates real‑time preference updates across related users and items. RecNet consists of two complementary phases. In the forward phase, the centralized preference routing mechanism leverages router agents to integrate preference updates and dynamically propagate them to the most relevant agents. To ensure accurate and personalized integration of propagated preferences, we further introduce a personalized preference reception mechanism, which combines a message buffer for temporary caching and an optimizable, rule‑based filter memory to guide selective preference assimilation based on past experience and interests. In the backward phase, the feedback‑driven propagation optimization mechanism simulates a multi‑agent reinforcement learning framework, using LLMs for credit assignment, gradient analysis, and module‑level optimization, enabling continuous self‑evolution of propagation strategies. Extensive experiments on various scenarios demonstrate the effectiveness of RecNet in modeling preference propagation for recommender systems.

PaperID: 6019, https://arxiv.org/pdf/2601.21570.pdf

Authors: Zixing Lei, Genjia Liu, Yuanshuo Zhang, Qipeng Liu, Chuan Wen, Shanghang Zhang, Wenzhao Lian, Siheng Chen

Title: EmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots

Abstract:
The field of Embodied AI is witnessing a rapid evolution toward general‑purpose robotic systems, fueled by high‑fidelity simulation and large‑scale data collection. However, this scaling capability remains severely bottlenecked by a reliance on labor‑intensive manual oversight from intricate reward shaping to hyperparameter tuning across heterogeneous backends. Inspired by LLMs' success in software automation and science discovery, we introduce \textscEmboCoach‑Bench, a benchmark evaluating the capacity of LLM agents to autonomously engineer embodied policies. Spanning 32 expert‑curated RL and IL tasks, our framework posits executable code as the universal interface. We move beyond static generation to assess a dynamic closed‑loop workflow, where agents leverage environment feedback to iteratively draft, debug, and optimize solutions, spanning improvements from physics‑informed reward design to policy architectures such as diffusion policies. Extensive evaluations yield three critical insights: (1) autonomous agents can qualitatively surpass human‑engineered baselines by 26.5% in average success rate; (2) agentic workflow with environment feedback effectively strengthens policy development and substantially narrows the performance gap between open‑source and proprietary models; and (3) agents exhibit self‑correction capabilities for pathological engineering cases, successfully resurrecting task performance from near‑total failures through iterative simulation‑in‑the‑loop debugging. Ultimately, this work establishes a foundation for self‑evolving embodied intelligence, accelerating the paradigm shift from labor‑intensive manual tuning to scalable, autonomous engineering in embodied AI field.

PaperID: 6020, https://arxiv.org/pdf/2601.21545.pdf

Authors: Yang Zhao, Chengxiao Dai, Yue Xiu, Mengying Kou, Yuliang Zheng, Dusit Niyato

Title: ShardMemo: Masked MoE Routing for Sharded Agentic LLM Memory

Abstract:
Agentic large language model (LLM) systems rely on external memory for long‑horizon state and concurrent multi‑agent execution, but centralized indexes and heuristic partitions become bottlenecks as memory volume and parallel access grow. We present ShardMemo, a budgeted tiered memory service with Tier A per‑agent working state, Tier B sharded evidence with shard‑local approximate nearest neighbor (ANN) indexes, and Tier C, a versioned skill library. Tier B enforces scope‑before‑routing: structured eligibility constraints mask ineligible shards before routing or ANN search. We cast shard probing as masked mixture‑of‑experts (MoE) routing over eligible shards, probing up to B_\mathrmprobe shards via Top‑B_\mathrmprobe or adaptive Top‑P, and use cost‑aware gating over profile/observation/session shard families; the router is trained from evidence‑to‑shard supervision. On LoCoMo, ShardMemo improves over the strongest baseline (GAM) by +5.11 to +6.82 F1 across question categories. Under a fixed‑budget routing setting (B_\mathrmprobe=3), ShardMemo improves over cosine‑to‑prototype shard routing by +6.87 F1 while reducing retrieval work (VecScan 521‑>414, ‑20.5%) and p95 latency (95‑>76 ms). On long‑context HotpotQA, ShardMemo achieves 63.41/61.88/57.95 F1 at 56K/224K/448K tokens. On ToolBench, Tier C reaches 0.97 Precision@3 and 1.94 StepRed (+10.2% and +7.2% over embedding‑similarity retrieval).

PaperID: 6021, https://arxiv.org/pdf/2601.21540.pdf

Authors: Iris Yazici, Mert Kayaalp, Stefan Taga, Ali H. Sayed

Title: Opinion Consensus Formation Among Networked Large Language Models

Abstract:
Can classical consensus models predict the group behavior of large language models (LLMs)? We examine multi‑round interactions among LLM agents through the DeGroot framework, where agents exchange text‑based messages over diverse communication graphs. To track opinion evolution, we map each message to an opinion score via sentiment analysis. We find that agents typically reach consensus and the disagreement between the agents decays exponentially. However, the limiting opinion departs from DeGroot's network‑centrality‑weighted forecast. The consensus between LLM agents turns out to be largely insensitive to initial conditions and instead depends strongly on the discussion subject and inherent biases. Nevertheless, transient dynamics align with classical graph theory and the convergence rate of opinions is closely related to the second‑largest eigenvalue of the graph's combination matrix. Together, these findings can be useful for LLM‑driven social‑network simulations and the design of resource‑efficient multi‑agent LLM applications.

PaperID: 6022, https://arxiv.org/pdf/2601.21523.pdf

Authors: Bang Giang Le, Viet Cuong Ta

Title: Explicit Credit Assignment through Local Rewards and Dependence Graphs in Multi-Agent Reinforcement Learning

Abstract:
To promote cooperation in Multi‑Agent Reinforcement Learning, the reward signals of all agents can be aggregated together, forming global rewards that are commonly known as the fully cooperative setting. However, global rewards are usually noisy because they contain the contributions of all agents, which have to be resolved in the credit assignment process. On the other hand, using local reward benefits from faster learning due to the separation of agents' contributions, but can be suboptimal as agents myopically optimize their own reward while disregarding the global optimality. In this work, we propose a method that combines the merits of both approaches. By using a graph of interaction between agents, our method discerns the individual agent contribution in a more fine‑grained manner than a global reward, while alleviating the cooperation problem with agents' local reward. We also introduce a practical approach for approximating such a graph. Our experiments demonstrate the flexibility of the approach, enabling improvements over the traditional local and global reward settings.

PaperID: 6023, https://arxiv.org/pdf/2601.21477.pdf

Authors: Tobias Schmidt, Kai Cui

Title: Mean-Field Control on Sparse Graphs: From Local Limits to GNNs via Neighborhood Distributions

Abstract:
Mean‑field control (MFC) offers a scalable solution to the curse of dimensionality in multi‑agent systems but traditionally hinges on the restrictive assumption of exchangeability via dense, all‑to‑all interactions. In this work, we bridge the gap to real‑world network structures by proposing a rigorous framework for MFC on large sparse graphs. We redefine the system state as a probability measure over decorated rooted neighborhoods, effectively capturing local heterogeneity. Our central contribution is a theoretical foundation for scalable reinforcement learning in this setting. We prove horizon‑dependent locality: for finite‑horizon problems, an agent's optimal policy at time t depends strictly on its (T‑t)‑hop neighborhood. This result renders the infinite‑dimensional control problem tractable and underpins a novel Dynamic Programming Principle (DPP) on the lifted space of neighborhood distributions. Furthermore, we formally and experimentally justify the use of Graph Neural Networks (GNNs) for actor‑critic algorithms in this context. Our framework naturally recovers classical MFC as a degenerate case while enabling efficient, theoretically grounded control on complex sparse topologies.

PaperID: 6024, https://arxiv.org/pdf/2601.21473.pdf

Authors: Zaifeng Pan, Yipeng Shen, Zhengding Hu, Zhuang Wang, Aninda Manocha, Zheng Wang, Zhongkai Yu, Yue Guan, Yufei Ding

Title: ScaleSim: Serving Large-Scale Multi-Agent Simulation with Invocation Distance-Based Memory Management

Abstract:
LLM‑based multi‑agent simulations are increasingly adopted across application domains, but remain difficult to scale due to GPU memory pressure. Each agent maintains private GPU‑resident states, including models, prefix caches, and adapters, which quickly exhaust device memory as the agent count grows. We identify two key properties of these workloads: sparse agent activation and an estimable agent invocation order. Based on an analysis of representative workload classes, we introduce invocation distance, a unified abstraction that estimates the relative order in which agents will issue future LLM requests. Leveraging this abstraction, we present ScaleSim, a memory‑efficient LLM serving system for large‑scale multi‑agent simulations. ScaleSim enables proactive prefetching and priority‑based eviction, supports diverse agent‑specific memory through a modular interface, and achieves up to 1.74x speedup over SGLang on simulation benchmarks.

PaperID: 6025, https://arxiv.org/pdf/2601.21469.pdf

Authors: Haoji Zhang, Yuzhe Li, Zhenqiang Liu, Chenyang Liu, Shenyang Zhang, Yi Zhou

Title: Adaptive Confidence Gating in Multi-Agent Collaboration for Efficient and Optimized Code Generation

Abstract:
While Large Language Models (LLMs) have catalyzed breakthroughs in automated code generation, Small Language Models (SLMs) often encounter reasoning bottlenecks and failure loops when addressing complex logical requirements. To overcome these challenges, we propose DebateCoder, a multi‑agent collaborative framework designed to improve the reasoning ability of SLMs (e.g., Pangu‑1B) in resource‑constrained environments. DebateCoder uses a structured role‑playing protocol with three agents: User Agent (A_UA), Technical Agent (A_TA), and Quality Assurance Agent (A_QA). It also includes an Adaptive Confidence Gating mechanism with a 95% threshold to balance accuracy and inference efficiency. In addition, we introduce a multi‑turn deliberation module and a reviewer‑guided analytical debugging loop for orthogonal pre‑generation debate and post‑generation refinement. Experiments on HumanEval and MBPP show that DebateCoder achieves 70.12% Pass@1 on HumanEval, outperforming MapCoder while reducing API overhead by about 35%. These results indicate that collaborative protocols can mitigate limitations of small‑parameter models and provide a scalable, efficient approach to high‑quality automated software engineering.

PaperID: 6026, https://arxiv.org/pdf/2601.21464.pdf

Authors: Yuan Sui, Bryan Hooi

Title: Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation

Abstract:
Training large language models (LLMs) for non‑verifiable tasks, such as creative writing, dialogue, and ethical reasoning, remains challenging due to the absence of ground‑truth labels. While LLM‑as‑Judge approaches offer a scalable alternative to human feedback, they face a fundamental limitation: performance is constrained by the evaluator's own quality. If the judge cannot recognize good solutions, it cannot provide useful training signals, and evaluation biases (e.g., favoring verbosity over quality) remain unaddressed. This motivates meta‑evaluation: the ability to evaluate and improve the evaluator itself. We introduce CoNL, a framework that unifies generation, evaluation, and meta‑evaluation through multi‑agent self‑play. Our key insight: critique quality can be measured by whether it helps others improve their solutions. In CoNL, multiple agents sharing the same policy engage in structured conversations to propose, critique, and revise solutions. Critiques that enable solution improvements earn a diagnostic reward, creating explicit supervision for meta‑evaluation and enabling joint optimization of generation and judging capabilities through self‑play, without external judges or ground truth. Experiments on various benchmarks show that CoNL achieves consistent improvements over self‑rewarding baselines while maintaining stable training.

PaperID: 6027, https://arxiv.org/pdf/2601.21352.pdf

Authors: Ziyu Lu, Tengjin Weng, Yiying Yang, Yuhang Zhao, Xinxin Huang, Wenhao Jiang

Title: BEAP-Agent: Backtrackable Execution and Adaptive Planning for GUI Agents

Abstract:
GUI agents are designed to automate repetitive tasks and enhance productivity. However, existing GUI agents struggle to recover once they follow an incorrect exploration path, often leading to task failure. In this work, we model GUI task execution as a DFS process and propose BEAP‑Agent, a DFS‑based framework that supports long‑range, multi‑level state backtracking with dynamic task tracking and updating. The framework consists of three collaborative components: Planner, Executor, and Tracker. Together, they enable effective task exploration and execution. BEAP‑Agent fills the gap in systematic backtracking mechanisms for GUI agents, offering a systematic solution for long‑horizon task exploration. We conducted a systematic evaluation on the OSWorld benchmark, where BEAP‑Agent achieved an accuracy of 28.2%, validating the effectiveness of the proposed method.

PaperID: 6028, https://arxiv.org/pdf/2601.21347.pdf

Authors: Xiuwen Zheng, Sixun Dong, Bornali Phukon, Mark Hasegawa-Johnson, Chang D. Yoo

Title: Towards Robust Dysarthric Speech Recognition: LLM-Agent Post-ASR Correction Beyond WER

Abstract:
While Automatic Speech Recognition (ASR) is typically benchmarked by word error rate (WER), real‑world applications ultimately hinge on semantic fidelity. This mismatch is particularly problematic for dysarthric speech, where articulatory imprecision and disfluencies can cause severe semantic distortions. To bridge this gap, we introduce a Large Language Model (LLM)‑based agent for post‑ASR correction: a Judge‑Editor over the top‑k ASR hypotheses that keeps high‑confidence spans, rewrites uncertain segments, and operates in both zero‑shot and fine‑tuned modes. In parallel, we release SAP‑Hypo5, the largest benchmark for dysarthric speech correction, to enable reproducibility and future exploration. Under multi‑perspective evaluation, our agent achieves a 14.51% WER reduction alongside substantial semantic gains, including a +7.59 pp improvement in MENLI and +7.66 pp in Slot Micro F1 on challenging samples. Our analysis further reveals that WER is highly sensitive to domain shift, whereas semantic metrics correlate more closely with downstream task performance.

PaperID: 6029, https://arxiv.org/pdf/2601.21277.pdf

Authors: Edith Elkind, Qishen Han, Lirong Xia

Title: Shortlisting: a Principled Approach

Abstract:
Shortlisting is the process of selecting a subset of alternatives from a larger pool for further consideration or final decision‑making. It is widely applied in social choice and multi‑agent system scenarios. The growing demand for participatory decision‑making and the continuously expanding space of candidates create an urgent need for efficient and fair shortlisting procedures. However, little principled study has been done on this problem. This blue‑sky paper aims to highlight the overlooked significance of shortlisting, distinguish it from related problems, provide initial thoughts, and, more importantly, serve as a call to arms. We envision that principled shortlisting can reduce cognitive burden, enable fair collective decisions, encourage broader participation, and ultimately build trust in democratic systems.

PaperID: 6030, https://arxiv.org/pdf/2601.21276.pdf

Authors: Haoming Huang, Pongchai Jaisri, Shota Shimizu, Lingfeng Chen, Sota Nakashima, Gema Rodríguez-Pérez

Title: More Code, Less Reuse: Investigating Code Quality and Reviewer Sentiment towards AI-generated Pull Requests

Abstract:
Large Language Model (LLM) Agents are advancing quickly, with the increasing leveraging of LLM Agents to assist in development tasks such as code generation. While LLM Agents accelerate code generation, studies indicate they may introduce adverse effects on development. However, existing metrics solely measure pass rates, failing to reflect impacts on long‑term maintainability and readability, and failing to capture human intuitive evaluations of PR. To increase the comprehensiveness of this problem, we investigate and evaluate the characteristics of LLM to know the pull requests' characteristics beyond the pass rate. We observe the code quality and maintainability within PRs based on code metrics to evaluate objective characteristics and developers' reactions to the pull requests from both humans and LLM's generation. Evaluation results indicate that LLM Agents frequently disregard code reuse opportunities, resulting in higher levels of redundancy compared to human developers. In contrast to the quality issues, our emotions analysis reveals that reviewers tend to express more neutral or positive emotions towards AI‑generated contributions than human ones. This disconnect suggests that the surface‑level plausibility of AI code masks redundancy, leading to the silent accumulation of technical debt in real‑world development environments. Our research provides insights for improving human‑AI collaboration.

PaperID: 6031, https://arxiv.org/pdf/2601.21253.pdf

Authors: Wei Minn, Biniam Fisseha Demissie, Yan Naing Tun, Jiakun Liu, Mariano Ceccato, Lwin Khin Shar, David Lo

Title: CovAgent: Overcoming the 30% Curse of Mobile Application Coverage with Agentic AI and Dynamic Instrumentation

Abstract:
Automated GUI testing is crucial for ensuring the quality and reliability of Android apps. However, the efficacy of existing UI testing techniques is often limited, especially in terms of coverage. Recent studies, including the state‑of‑the‑art, struggle to achieve more than 30% activity coverage in real‑world apps. This limited coverage can be attributed to a combination of factors such as failing to generate complex user inputs, unsatisfied activation conditions regarding device configurations and external resources, and hard‑to‑reach code paths that are not easily accessible through the GUI. To overcome these limitations, we propose CovAgent, a novel agentic AI‑powered approach to enhance Android app UI testing. Our fuzzer‑agnostic framework comprises an AI agent that inspects the app's decompiled Smali code and component transition graph, and reasons about unsatisfied activation conditions within the app code logic that prevent access to the activities that are unreachable by standard and widely adopted GUI fuzzers. Then, another agent generates dynamic instrumentation scripts that satisfy activation conditions required for successful transitions to those activities. We found that augmenting existing fuzzing approaches with our framework achieves a significant improvement in test coverage over the state‑of‑the‑art, LLMDroid, and other baselines such as Fastbot and APE (e.g., 101.1%, 116.3% and 179.7% higher activity coverage, respectively). CovAgent also outperforms all the baselines in other metrics such as class, method, and line coverage. We also conduct investigations into components within CovAgent to reveal further insights regarding the efficacy of Agentic AI in the field of automated app testing such as the agentic activation condition inference accuracy, and agentic activity‑launching success rate.

PaperID: 6032, https://arxiv.org/pdf/2601.21139.pdf

Authors: Sinan Bugu

Title: Hidden-Field Coordination Reveals Payoff-Free Quantum Correlation Structure in Decentralized Coordination

Abstract:
We study decentralized multi‑agent coordination where agents must correlate actions against an unobserved field and cannot communicate. To isolate correlation geometry from payoff optimization, we introduce the Hidden‑Field Coordination (HFC) model, which enforces identical information access and no‑signaling constraints across strategies. Using information‑theoretic diagnostics, we compare classical shared‑randomness baselines with an entanglement‑mediated strategy based on multipartite W states and a strictly local Spontaneous Leader Election rule. Within the restricted symmetric shared‑latent baseline studied here, increasing total correlation is achieved primarily by driving actions toward alignment (copying), which also increases pairwise coincidence (collisions). By contrast, the quantum strategy realizes a collision‑suppressing coordination regime: it preserves global dependence while reducing pairwise coincidence below the independent (product) baseline induced by the common marginal distribution. This produces a geometric separation in the joint‑action distribution. Classical baselines concentrate probability near the diagonal of action equality, whereas the entanglement‑mediated mapping occupies an offset‑diagonal region associated with relational roles. Accordingly, the entanglement signature in this setting is not higher correlation magnitude; total‑correlation differentials can be negative relative to the classical copying optimum. Instead, it reflects a change in dependence geometry that supports robust anti‑coordination.

PaperID: 6033, https://arxiv.org/pdf/2601.21126.pdf

Authors: Kooktae Lee, Julian Martinez

Title: AI-Augmented Density-Driven Optimal Control (D2OC) for Decentralized Environmental Mapping

Abstract:
This paper presents an AI‑augmented decentralized framework for multi‑agent (multi‑robot) environmental mapping under limited sensing and communication. While conventional coverage formulations achieve effective spatial allocation when an accurate reference map is available, their performance deteriorates under uncertain or biased priors. The proposed method introduces an adaptive and self‑correcting mechanism that enables agents to iteratively refine local density estimates within an optimal transport‑based framework, ensuring theoretical consistency and scalability. A dual multilayer perceptron (MLP) module enhances adaptivity by inferring local mean‑variance statistics and regulating virtual uncertainty for long‑unvisited regions, mitigating stagnation around local minima. Theoretical analysis rigorously proves convergence under the Wasserstein metric, while simulation results demonstrate that the proposed AI‑augmented Density‑Driven Optimal Control consistently achieves robust and precise alignment with the ground‑truth density, yielding substantially higher‑fidelity reconstruction of complex multi‑modal spatial distributions compared with conventional decentralized baselines.

PaperID: 6034, https://arxiv.org/pdf/2601.21064.pdf

Authors: Minghui Chen, Wenlong Deng, James Zou, Han Yu, Xiaoxiao Li

Title: Textual Equilibrium Propagation for Deep Compound AI Systems

Abstract:
Large language models (LLMs) are increasingly deployed as part of compound AI systems that coordinate multiple modules (e.g., retrievers, tools, verifiers) over long‑horizon workflows. Recent approaches that propagate textual feedback globally (e.g., TextGrad) make it feasible to optimize such pipelines, but we find that performance degrades as system depth grows. In particular, long‑horizon agentic workflows exhibit two depth‑scaling failure modes: 1) exploding textual gradient, where textual feedback grows exponentially with depth, leading to prohibitively long message and amplifies evaluation biases; and 2) vanishing textual gradient, where limited long‑context ability causes models overemphasize partial feedback and compression of lengthy feedback causes downstream messages to lose specificity gradually as they propagate many hops upstream. To mitigate these issues, we introduce Textual Equilibrium Propagation (TEP), a local learning principle inspired by Equilibrium Propagation in energy‑based models. TEP includes two phases: 1) a free phase where a local LLM critics iteratively refine prompts until reaching equilibrium (no further improvements are suggested); and 2) a nudged phase which applies proximal prompt edits with bounded modification intensity, using task‑level objectives that propagate via forward signaling rather than backward feedback chains. This design supports local prompt optimization followed by controlled adaptation toward global goals without the computational burden and signal degradation of global textual backpropagation. Across long‑horizon QA benchmarks and multi‑agent tool‑use dataset, TEP consistently improves accuracy and efficiency over global propagation methods such as TextGrad. The gains grows with depth, while preserving the practicality of black‑box LLM components in deep compound AI system.

PaperID: 6035, https://arxiv.org/pdf/2601.20969.pdf

Authors: Alessandro Burigana, Francesco Fabiano

Title: The Epistemic Planning Domain Definition Language: Official Guideline

Abstract:
Epistemic planning extends (multi‑agent) automated planning by making agents' knowledge and beliefs first‑class aspects of the planning formalism. One of the most well‑known frameworks for epistemic planning is Dynamic Epistemic Logic (DEL), which offers an rich and natural semantics for modelling problems in this setting. The high expressive power provided by DEL make DEL‑based epistemic planning a challenging problem to tackle both theoretically, and in practical implementations. As a result, existing epistemic planners often target different DEL fragments, and typically rely on ad hoc languages to represent benchmarks, and sometimes no language at all. This fragmentation hampers comparison, reuse, and systematic benchmark development. We address these issues by introducing the Epistemic Planning Domain Definition Language (EPDDL). EPDDL provides a unique PDDL‑like representation that captures the entire DEL semantics, enabling uniform specification of epistemic planning tasks. Our main contributions are: 1. A formal development of abstract event models, a novel representation for epistemic actions used to define the semantics of our language; 2. A formal specification of EPDDL's syntax and semantics grounded in DEL with abstract event models. Through examples of representative benchmarks, we illustrate how EPDDL facilitates interoperability, reproducible evaluation, and future advances in epistemic planning.

PaperID: 6036, https://arxiv.org/pdf/2601.20617.pdf

Authors: Jonathan Rystrøm, Chris Schmitz, Karolina Korgul, Jan Batzner, Chris Russell

Title: Agent Benchmarks Fail Public Sector Requirements

Abstract:
Deploying Large Language Model‑based agents (LLM agents) in the public sector requires assuring that they meet the stringent legal, procedural, and structural requirements of public‑sector institutions. Practitioners and researchers often turn to benchmarks for such assessments. However, it remains unclear what criteria benchmarks must meet to ensure they adequately reflect public‑sector requirements, or how many existing benchmarks do so. In this paper, we first define such criteria based on a first‑principles survey of public administration literature: benchmarks must be \emphprocess‑based, \emphrealistic, \emphpublic‑sector‑specific and report \emphmetrics that reflect the unique requirements of the public sector. We analyse more than 1,300 benchmark papers for these criteria using an expert‑validated LLM‑assisted pipeline. Our results show that no single benchmark meets all of the criteria. Our findings provide a call to action for both researchers to develop public sector‑relevant benchmarks and for public‑sector officials to apply these criteria when evaluating their own agentic use cases.

PaperID: 6037, https://arxiv.org/pdf/2601.20539.pdf

Authors: Oguzhan Gungordu, Siheng Xiong, Faramarz Fekri

Title: PathWise: Planning through World Model for Automated Heuristic Design via Self-Evolving LLMs

Abstract:
Large Language Models (LLMs) have enabled automated heuristic design (AHD) for combinatorial optimization problems (COPs), but existing frameworks' reliance on fixed evolutionary rules and static prompt templates often leads to myopic heuristic generation, redundant evaluations, and limited reasoning about how new heuristics should be derived. We propose a novel multi‑agent reasoning framework, referred to as Planning through World Model for Automated Heuristic Design via Self‑Evolving LLMs (PathWise), which formulates heuristic generation as a sequential decision process over an entailment graph serving as a compact, stateful memory of the search trajectory. This approach allows the system to carry forward past decisions and reuse or avoid derivation information across generations. A policy agent plans evolutionary actions, a world model agent generates heuristic rollouts conditioned on those actions, and critic agents provide routed reflections summarizing lessons from prior steps, shifting LLM‑based AHD from trial‑and‑error evolution toward state‑aware planning through reasoning. Experiments across diverse COPs show that PathWise converges faster to better heuristics, generalizes across different LLM backbones, and scales to larger problem sizes.

PaperID: 6038, https://arxiv.org/pdf/2601.20538.pdf

Authors: Ling Tang, Jilin Mei, Dongrui Liu, Chen Qian, Dawei Cheng, Jing Shao, Xia Hu

Title: Interpreting Emergent Extreme Events in Multi-Agent Systems

Abstract:
Large language model‑powered multi‑agent systems have emerged as powerful tools for simulating complex human‑like systems. The interactions within these systems often lead to extreme events whose origins remain obscured by the black box of emergence. Interpreting these events is critical for system safety. This paper proposes the first framework for explaining emergent extreme events in multi‑agent systems, aiming to answer three fundamental questions: When does the event originate? Who drives it? And what behaviors contribute to it? Specifically, we adapt the Shapley value to faithfully attribute the occurrence of extreme events to each action taken by agents at different time steps, i.e., assigning an attribution score to the action to measure its influence on the event. We then aggregate the attribution scores along the dimensions of time, agent, and behavior to quantify the risk contribution of each dimension. Finally, we design a set of metrics based on these contribution scores to characterize the features of extreme events. Experiments across diverse multi‑agent system scenarios (economic, financial, and social) demonstrate the effectiveness of our framework and provide general insights into the emergence of extreme phenomena.

PaperID: 6039, https://arxiv.org/pdf/2601.20529.pdf

Authors: Julia Richter, David Oberacker, Gabriela Ligeza, Valentin T. Bickel, Philip Arm, William Talbot, Marvin Grosse Besselmann, Florian Kehl, Tristan Schnell, Hendrik Kolvenbach, Rüdiger Dillmann, Arne Roennau, Marco Hutter

Title: A Practical Framework of Key Performance Indicators for Multi-Robot Lunar and Planetary Field Tests

Abstract:
Robotic prospecting for critical resources on the Moon, such as ilmenite, rare earth elements, and water ice, requires robust exploration methods given the diverse terrain and harsh environmental conditions. Although numerous analog field trials address these goals, comparing their results remains challenging because of differences in robot platforms and experimental setups. These missions typically assess performance using selected, scenario‑specific engineering metrics that fail to establish a clear link between field performance and science‑driven objectives. In this paper, we address this gap by deriving a structured framework of KPI from three realistic multi‑robot lunar scenarios reflecting scientific objectives and operational constraints. Our framework emphasizes scenario‑dependent priorities in efficiency, robustness, and precision, and is explicitly designed for practical applicability in field deployments. We validated the framework in a multi‑robot field test and found it practical and easy to apply for efficiency‑ and robustness‑related KPI, whereas precision‑oriented KPI require reliable ground‑truth data that is not always feasible to obtain in outdoor analog environments. Overall, we propose this framework as a common evaluation standard enabling consistent, goal‑oriented comparison of multi‑robot field trials and supporting systematic development of robotic systems for future planetary exploration.

PaperID: 6040, https://arxiv.org/pdf/2601.20466.pdf

Authors: Mathis Brossier, Mujtaba Fadhil Jawad, Emma Broman, Ylva Selling, Julia Hallsten, Alexander Bock, Johanna Björklund, Tobias Isenberg, Anders Ynnerman, Mario Romero, Lonni Besançon

Title: Piloting Planetarium Visualizations with LLMs during Live Events in Science Centers

Abstract:
We designed and evaluated an AI pilot in a planetarium visualization software, OpenSpace, for public shows in science centers. The piloting role is usually given to a human working in close collaboration with the guide on stage. We recruited 7 professional guides with extensive experience in giving shows to the public to study the impact of the AI‑piloting on the overall experience. The AI‑pilot is a conversational AI‑agent listening to the guide and interpreting the verbal statements as commands to execute camera motions, change simulation time, or toggle visual assets. Our results show that, while AI pilots lack several critical skills for live shows, they could become useful as co‑pilots to reduce workload of human pilots and allow multitasking. We propose research directions toward implementing visualization pilots and co‑pilots in live settings.

PaperID: 6041, https://arxiv.org/pdf/2601.20465.pdf

Authors: Yang Li, Jiaxiang Liu, Yusong Wang, Yujie Wu, Mingkun Xu

Title: BMAM: Brain-inspired Multi-Agent Memory Framework

Abstract:
Language‑model‑based agents operating over extended interaction horizons face persistent challenges in preserving temporally grounded information and maintaining behavioral consistency across sessions, a failure mode we term soul erosion. We present BMAM (Brain‑inspired Multi‑Agent Memory), a general‑purpose memory architecture that models agent memory as a set of functionally specialized subsystems rather than a single unstructured store. Inspired by cognitive memory systems, BMAM decomposes memory into episodic, semantic, salience‑aware, and control‑oriented components that operate at complementary time scales. To support long‑horizon reasoning, BMAM organizes episodic memories along explicit timelines and retrieves evidence by fusing multiple complementary signals. Experiments on the LoCoMo benchmark show that BMAM achieves 78.45 percent accuracy under the standard long‑horizon evaluation setting, and ablation analyses confirm that the hippocampus‑inspired episodic memory subsystem plays a critical role in temporal reasoning.

PaperID: 6042, https://arxiv.org/pdf/2601.20380.pdf

Authors: Le Zhang, Yixiong Xiao, Xinjiang Lu, Jingjia Cao, Yusai Zhao, Jingbo Zhou, Lang An, Zikan Feng, Wanxiang Sha, Yu Shi, Congxi Xiao, Jian Xiong, Yankai Zhang, Hua Wu, Haifeng Wang

Title: OmegaUse: Building a General-Purpose GUI Agent for Autonomous Task Execution

Abstract:
Graphical User Interface (GUI) agents show great potential for enabling foundation models to complete real‑world tasks, revolutionizing human‑computer interaction and improving human productivity. In this report, we present OmegaUse, a general‑purpose GUI agent model for autonomous task execution on both mobile and desktop platforms, supporting computer‑use and phone‑use scenarios. Building an effective GUI agent model relies on two factors: (1) high‑quality data and (2) effective training methods. To address these, we introduce a carefully engineered data‑construction pipeline and a decoupled training paradigm. For data construction, we leverage rigorously curated open‑source datasets and introduce a novel automated synthesis framework that integrates bottom‑up autonomous exploration with top‑down taxonomy‑guided generation to create high‑fidelity synthetic data. For training, to better leverage these data, we adopt a two‑stage strategy: Supervised Fine‑Tuning (SFT) to establish fundamental interaction syntax, followed by Group Relative Policy Optimization (GRPO) to improve spatial grounding and sequential planning. To balance computational efficiency with agentic reasoning capacity, OmegaUse is built on a Mixture‑of‑Experts (MoE) backbone. To evaluate cross‑terminal capabilities in an offline setting, we introduce OS‑Nav, a benchmark suite spanning multiple operating systems: ChiM‑Nav, targeting Chinese Android mobile environments, and Ubu‑Nav, focusing on routine desktop interactions on Ubuntu. Extensive experiments show that OmegaUse is highly competitive across established GUI benchmarks, achieving a state‑of‑the‑art (SOTA) score of 96.3% on ScreenSpot‑V2 and a leading 79.1% step success rate on AndroidControl. OmegaUse also performs strongly on OS‑Nav, reaching 74.24% step success on ChiM‑Nav and 55.9% average success on Ubu‑Nav.

PaperID: 6043, https://arxiv.org/pdf/2601.20375.pdf

Authors: Wei Huang, Anda Cheng, Yinggui Wang, Lei Wang, Tao Wei

Title: LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning

Abstract:
Large Language Models (LLMs) can be fine‑tuned on domain‑specific data to enhance their performance in specialized fields. However, such data often contains numerous low‑quality samples, necessitating effective data processing (DP). In practice, DP strategies are typically developed through iterative manual analysis and trial‑and‑error adjustment. These processes inevitably incur high labor costs and may lead to privacy issues in high‑privacy domains like healthcare due to direct human access to sensitive data. Thus, achieving automated data processing without exposing the raw data has become a critical challenge. To address this challenge, we propose LLM‑AutoDP, a novel framework that leverages LLMs as agents to automatically generate and optimize data processing strategies. Our method generates multiple candidate strategies and iteratively refines them using feedback signals and comparative evaluations. This iterative in‑context learning mechanism enables the agent to converge toward high‑quality processing pipelines without requiring direct human intervention or access to the underlying data. To further accelerate strategy search, we introduce three key techniques: Distribution Preserving Sampling, which reduces data volume while maintaining distributional integrity; Processing Target Selection, which uses a binary classifier to identify low‑quality samples for focused processing; Cache‑and‑Reuse Mechanism, which minimizes redundant computations by reusing prior processing results. Results show that models trained on data processed by our framework achieve over 80% win rates against models trained on unprocessed data. Compared to AutoML baselines based on LLM agents, LLM‑AutoDP achieves approximately a 65% win rate. Moreover, our acceleration techniques reduce the total searching time by up to 10 times, demonstrating both effectiveness and efficiency.

PaperID: 6044, https://arxiv.org/pdf/2601.20367.pdf

Authors: Qing Lyu, Zhe Fu, Alexandre Bayen

Title: Unsupervised Anomaly Detection in Multi-Agent Trajectory Prediction via Transformer-Based Models

Abstract:
Identifying safety‑critical scenarios is essential for autonomous driving, but the rarity of such events makes supervised labeling impractical. Traditional rule‑based metrics like Time‑to‑Collision are too simplistic to capture complex interaction risks, and existing methods lack a systematic way to verify whether statistical anomalies truly reflect physical danger. To address this gap, we propose an unsupervised anomaly detection framework based on a multi‑agent Transformer that models normal driving and measures deviations through prediction residuals. A dual evaluation scheme has been proposed to assess both detection stability and physical alignment: Stability is measured using standard ranking metrics in which Kendall Rank Correlation Coefficient captures rank agreement and Jaccard index captures the consistency of the top‑K selected items; Physical alignment is assessed through correlations with established Surrogate Safety Measures (SSM). Experiments on the NGSIM dataset demonstrate our framework's effectiveness: We show that the maximum residual aggregator achieves the highest physical alignment while maintaining stability. Furthermore, our framework identifies 388 unique anomalies missed by Time‑to‑Collision and statistical baselines, capturing subtle multi‑agent risks like reactive braking under lateral drift. The detected anomalies are further clustered into four interpretable risk types, offering actionable insights for simulation and testing.

PaperID: 6045, https://arxiv.org/pdf/2601.20352.pdf

Authors: Weiquan Huang, Zixuan Wang, Hehai Lin, Sudong Wang, Bo Xu, Qian Li, Beier Zhu, Linyi Yang, Chengwei Qin

Title: AMA: Adaptive Memory via Multi-Agent Collaboration

Abstract:
The rapid evolution of Large Language Model (LLM) agents has necessitated robust memory systems to support cohesive long‑term interaction and complex reasoning. Benefiting from the strong capabilities of LLMs, recent research focus has shifted from simple context extension to the development of dedicated agentic memory systems. However, existing approaches typically rely on rigid retrieval granularity, accumulation‑heavy maintenance strategies, and coarse‑grained update mechanisms. These design choices create a persistent mismatch between stored information and task‑specific reasoning demands, while leading to the unchecked accumulation of logical inconsistencies over time. To address these challenges, we propose Adaptive Memory via Multi‑Agent Collaboration (AMA), a novel framework that leverages coordinated agents to manage memory across multiple granularities. AMA employs a hierarchical memory design that dynamically aligns retrieval granularity with task complexity. Specifically, the Constructor and Retriever jointly enable multi‑granularity memory construction and adaptive query routing. The Judge verifies the relevance and consistency of retrieved content, triggering iterative retrieval when evidence is insufficient or invoking the Refresher upon detecting logical conflicts. The Refresher then enforces memory consistency by performing targeted updates or removing outdated entries. Extensive experiments on challenging long‑context benchmarks show that AMA significantly outperforms state‑of‑the‑art baselines while reducing token consumption by approximately 80% compared to full‑context methods, demonstrating its effectiveness in maintaining retrieval precision and long‑term memory consistency.

PaperID: 6046, https://arxiv.org/pdf/2601.20346.pdf

Authors: Asifullah Khan, Aimen Wadood, Mubashar Iqbal, Umme Zahoora

Title: Multimodal Multi-Agent Ransomware Analysis Using AutoGen

Abstract:
Ransomware has become one of the most serious cybersecurity threats causing major financial losses and operational disruptions worldwide.Traditional detection methods such as static analysis, heuristic scanning and behavioral analysis often fall short when used alone. To address these limitations, this paper presents multimodal multi agent ransomware analysis framework designed for ransomware classification. Proposed multimodal multiagent architecture combines information from static, dynamic and network sources. Each data type is handled by specialized agent that uses auto encoder based feature extraction. These representations are then integrated through a fusion agent. After that fused representation are used by transformer based classifier. It identifies the specific ransomware family. The agents interact through an interagent feedback mechanism that iteratively refines feature representations by suppressing low confidence information. The framework was evaluated on large scale datasets containing thousands of ransomware and benign samples. Multiple experiments were conducted on ransomware dataset. It outperforms single modality and nonadaptive fusion baseline achieving improvement of up to 0.936 in Macro‑F1 for family classification and reducing calibration error. Over 100 epochs, the agentic feedback loop displays a stable monotonic convergence leading to over +0.75 absolute improvement in terms of agent quality and a final composite score of around 0.88 without fine tuning of the language models. Zeroday ransomware detection remains family dependent on polymorphism and modality disruptions. Confidence aware abstention enables reliable real world deployment by favoring conservativeand trustworthy decisions over forced classification. The findings indicate that proposed approach provides a practical andeffective path toward improving real world ransomware defense systems.

PaperID: 6047, https://arxiv.org/pdf/2601.20335.pdf

Authors: Qinzhuo Wu, Zhizhuo Yang, Hanhao Li, Pengzhi Gao, Wei Liu, Jian Luan

Title: MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World Environment

Abstract:
Recent advances in mobile Graphical User Interface (GUI) agents highlight the growing need for comprehensive evaluation benchmarks. While new online benchmarks offer more realistic testing than offline ones, they tend to focus on the agents' task instruction‑following ability while neglecting their reasoning and exploration ability. Moreover, these benchmarks do not consider the random noise in real‑world mobile environments. This leads to a gap between benchmarks and real‑world environments. To addressing these limitations, we propose MobileBench‑OL, an online benchmark with 1080 tasks from 80 Chinese apps. It measures task execution, complex reasoning, and noise robustness of agents by including 5 subsets, which set multiple evaluation dimensions. We also provide an auto‑eval framework with a reset mechanism, enabling stable and repeatable real‑world benchmarking. Evaluating 12 leading GUI agents on MobileBench‑OL shows significant room for improvement to meet real‑world requirements. Human evaluation further confirms that MobileBench‑OL can reliably measure the performance of leading GUI agents in real environments. Our data and code will be released upon acceptance.

PaperID: 6048, https://arxiv.org/pdf/2601.20276.pdf

Authors: Tianwei Lin, Zuyi Zhou, Xinda Zhao, Chenke Wang, Xiaohong Li, Yu Chen, Chuanrui Hu, Jian Pei, Yafeng Deng

Title: Beyond the Needle's Illusion: Decoupled Evaluation of Evidence Access and Use under Semantic Interference at 326M-Token Scale

Abstract:
Long‑context LLM agents must access the right evidence from large environments and use it faithfully. However, the popular Needle‑in‑a‑Haystack (NIAH) evaluation mostly measures benign span localization. The needle is near‑unique, and the haystack is largely irrelevant. We introduce EverMemBench‑S (EMB‑S), an adversarial NIAH‑style benchmark built on a 326M‑token MemoryBank. While the full MemoryBank spans 326M tokens for retrieval‑based (RAG) evaluation, we evaluate native long‑context models only at scales that fit within each model's context window (up to 1M tokens in this work) to ensure a fair comparison. EMB‑S pairs queries with collision‑tested near‑miss hard negatives and gold evidence sets spanning one or more documents, validated via human screening and LLM verification. We also propose a decoupled diagnostic protocol that reports evidence access (document‑ID localization) separately from end‑to‑end QA quality under full‑context prompting. This enables consistent diagnosis for both native long‑context prompting and retrieval pipelines. Across a reference‑corpus ladder from domain‑isolated 64K contexts to a globally shared 326M‑token environment, we observe a clear reality gap. Systems that saturate benign NIAH degrade sharply in evidence access under semantic interference. These results indicate that semantic discrimination, not context length alone, is the dominant bottleneck for long‑context memory at scale.

PaperID: 6049, https://arxiv.org/pdf/2601.20206.pdf

Authors: Zixuan Xiao, Chunguang Hu, Jun Ma

Title: Towards Intelligent Urban Park Development Monitoring: LLM Agents for Multi-Modal Information Fusion and Analysis

Abstract:
As an important part of urbanization, the development monitoring of newly constructed parks is of great significance for evaluating the effect of urban planning and optimizing resource allocation. However, traditional change detection methods based on remote sensing imagery have obvious limitations in high‑level and intelligent analysis, and thus are difficult to meet the requirements of current urban planning and management. In face of the growing demand for complex multi‑modal data analysis in urban park development monitoring, these methods often fail to provide flexible analysis capabilities for diverse application scenarios. This study proposes a multi‑modal LLM agent framework, which aims to make full use of the semantic understanding and reasoning capabilities of LLM to meet the challenges in urban park development monitoring. In this framework, a general horizontal and vertical data alignment mechanism is designed to ensure the consistency and effective tracking of multi‑modal data. At the same time, a specific toolkit is constructed to alleviate the hallucination issues of LLM due to the lack of domain‑specific knowledge. Compared to vanilla GPT‑4o and other agents, our approach enables robust multi‑modal information fusion and analysis, offering reliable and scalable solutions tailored to the diverse and evolving demands of urban park development monitoring.

PaperID: 6050, https://arxiv.org/pdf/2601.20194.pdf

Authors: Huichao Men, Yizhen Hu, Yu Gao, Xiaofeng Mou, Yi Xu, Xinhua Xiao

Title: An Autonomous Agent Framework for Feature-Label Extraction from Device Dialogues and Automatic Multi-Dimensional Device Hosting Planning Based on Large Language Models

Abstract:
With the deep integration of artificial intelligence and smart home technologies, the intelligent transformation of traditional household appliances has become an inevitable trend. This paper presents AirAgent‑‑an LLM‑driven autonomous agent framework designed for home air systems. Leveraging a voice‑based dialogue interface, AirAgent autonomously and personally manages indoor air quality through comprehensive perception, reasoning, and control. The framework innovatively adopts a two‑layer cooperative architecture: Memory‑Based Tag Extraction and Reasoning‑Driven Planning. First, a dynamic memory tag extraction module continuously updates personalized user profiles. Second, a reasoning‑planning model integrates real‑time environmental sensor data, user states, and domain‑specific prior knowledge (e.g., public health guidelines) to generate context‑aware decisions. To support both interpretability and execution, we design a semi‑streaming output mechanism that uses special tokens to segment the model's output stream in real time, simultaneously producing human‑readable Chain‑of‑Thought explanations and structured, device‑executable control commands. The system handles planning across 25 distinct complex dimensions while satisfying more than 20 customized constraints. As a result, AirAgent endows home air systems with proactive perception, service, and orchestration capabilities, enabling seamless, precise, and personalized air management responsive to dynamic indoor and outdoor conditions. Experimental results demonstrate up to 94.9 percent accuracy and more than 20 percent improvement in user experience metrics compared to competing commercial solutions.

PaperID: 6051, https://arxiv.org/pdf/2601.20130.pdf

Authors: Haoxuan Wang, Gengyu Zhang, Yan Yan, Yuzhang Shang, Ramana Rao Kompella, Gaowen Liu

Title: Real-Time Robot Execution with Masked Action Chunking

Abstract:
Real‑time execution is essential for cyber‑physical systems such as robots. These systems operate in dynamic real‑world environments where even small delays can undermine responsiveness and compromise performance. Asynchronous inference has recently emerged as a system‑level paradigm for real‑time robot manipulation, enabling the next action chunk to be predicted while the current one is being executed. While this approach achieves real‑time responsiveness, naive integration often results in execution failure. Previous methods attributed this failure to inter‑chunk discontinuity and developed test‑time algorithms to smooth chunk boundaries. In contrast, we identify another critical yet overlooked factor: intra‑chunk inconsistency, where the robot's executed action chunk partially misaligns with its current perception. To address this, we propose REMAC, which learns corrective adjustments on the pretrained policy through masked action chunking, enabling the policy to remain resilient under mismatches between intended actions and actual execution during asynchronous inference. In addition, we introduce a prefix‑preserved sampling procedure to reinforce inter‑chunk continuity. Overall, our method delivers more reliable policies without incurring additional latency. Extensive experiments in both simulation and real‑world settings demonstrate that our method enables faster task execution, maintains robustness across varying delays, and consistently achieves higher completion rates.

PaperID: 6052, https://arxiv.org/pdf/2601.20048.pdf

Authors: Jincheng Bai, Zhenyu Zhang, Jennifer Zhang, Zhihuai Zhu

Title: Insight Agents: An LLM-Based Multi-Agent System for Data Insights

Abstract:
Today, E‑commerce sellers face several key challenges, including difficulties in discovering and effectively utilizing available programs and tools, and struggling to understand and utilize rich data from various tools. We therefore aim to develop Insight Agents (IA), a conversational multi‑agent Data Insight system, to provide E‑commerce sellers with personalized data and business insights through automated information retrieval. Our hypothesis is that IA will serve as a force multiplier for sellers, thereby driving incremental seller adoption by reducing the effort required and increase speed at which sellers make good business decisions. In this paper, we introduce this novel LLM‑backed end‑to‑end agentic system built on a plan‑and‑execute paradigm and designed for comprehensive coverage, high accuracy, and low latency. It features a hierarchical multi‑agent structure, consisting of manager agent and two worker agents: data presentation and insight generation, for efficient information retrieval and problem‑solving. We design a simple yet effective ML solution for manager agent that combines Out‑of‑Domain (OOD) detection using a lightweight encoder‑decoder model and agent routing through a BERT‑based classifier, optimizing both accuracy and latency. Within the two worker agents, a strategic planning is designed for API‑based data model that breaks down queries into granular components to generate more accurate responses, and domain knowledge is dynamically injected to to enhance the insight generator. IA has been launched for Amazon sellers in US, which has achieved high accuracy of 90% based on human evaluation, with latency of P90 below 15s.

PaperID: 6053, https://arxiv.org/pdf/2601.20005.pdf

Authors: Zixin Jiang, Weili Xu, Bing Dong

Title: OptAgent: an Agentic AI framework for Intelligent Building Operations

Abstract:
The urgent need for building decarbonization calls for a paradigm shift in future autonomous building energy operation, from human‑intensive engineering workflows toward intelligent agents that interact with physics‑grounded digital environments. This study proposes an end‑to‑end agentic AI‑enabled Physics‑Informed Machine Learning (PIML) environment for scalable building energy modeling, simulation, control, and automation. The framework consists of (1) a modular and physics‑consistent PIML digital environment spanning building thermal dynamics, Heating, Ventilation, and Air Conditioning (HVAC), and distributed energy resources (DER) for grid‑interactive energy management; and (2) an agentic AI layer with 11 specialist agents and 72 Model Context Protocol (MCP) tools that enable end‑to‑end execution of multi‑step energy analytics. A representative case study demonstrates multi‑domain, multi‑agent coordination for assessing how system and control upgrades affect energy use, operating cost, thermal comfort, and flexibility. In addition, a large‑scale benchmark (about 4000 runs) systematically evaluates workflow performance in terms of accuracy, token consumption, execution time, and inference cost. The results quantify the impacts of intelligence mode design, model size, task complexity, and orchestrator‑specialist coordination, and provide key lessons for building future agentic AI systems in real‑world building energy applications. This work establishes a scalable, physics‑grounded foundation for deploying agentic AI in decarbonized and grid‑interactive building operations.

PaperID: 6054, https://arxiv.org/pdf/2601.19942.pdf

Authors: Faruk Alpay, Bugra Kilictas

Title: Latent Object Permanence: Topological Phase Transitions, Free-Energy Principles, and Renormalization Group Flows in Deep Transformer Manifolds

Abstract:
We study the emergence of multi‑step reasoning in deep Transformer language models through a geometric and statistical‑physics lens. Treating the hidden‑state trajectory as a flow on an implicit Riemannian manifold, we analyze the layerwise covariance spectrum of activations, where C^(\ell)=\mathbbE[h^(\ell)h^(\ell)\top], and track deviations from a random‑matrix bulk. Across model scales (1.5B‑‑30B), we observe a sharp reduction in effective dimensionality consistent with a phase transition: an order parameter based on sparsity/localization, Ω(h)=1‑\|h\|_1/(\sqrtd\|h\|_2), exhibits a discontinuity near a critical normalized depth γ_c\approx 0.42 in sufficiently large models. We formalize the forward pass as a discrete coarse‑graining map and relate the appearance of stable "concept basins" to fixed points of this renormalization‑like dynamics. The resulting low‑entropy regime is characterized by a spectral tail collapse and by the formation of transient, reusable object‑like structures in representation space, which we call Transient Class Objects (TCOs). We provide theoretical conditions connecting logical separability to spectral decay and validate the predicted signatures with layerwise probes on multiple open‑weight model families.

PaperID: 6055, https://arxiv.org/pdf/2601.19935.pdf

Authors: Yiting Shen, Kun Li, Wei Zhou, Songlin Hu

Title: Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents

Abstract:
Large Language Model (LLM)‑based agents are increasingly deployed for complex, tool‑based tasks where long‑term memory is critical to driving actions. Existing benchmarks, however, primarily test a angent's ability to passively retrieve isolated facts in response to explicit questions. They fail to evaluate the more crucial capability of actively applying memory to execute tasks. To address this gap, we introduce \textscMem2ActBench, a benchmark for evaluating whether agents can proactively leverage long‑term memory to execute tool‑based actions by selecting appropriate tools and grounding their parameters. The benchmark simulates persistent assistant usage, where users mention the same topic across long, interrupted interactions and expect previously established preferences and task states to be implicitly applied. We build the dataset with an automated pipeline that merges heterogeneous sources (ToolACE, BFCL, Oasst1), resolves conflicts via consistency modeling, and synthesizes 2,029 sessions with 12 user‑‑assistant‑‑tool turns on average. From these memory chains, a reverse‑generation method produces 400 tool‑use tasks, with human evaluation confirming 91.3% are strongly memory‑dependent. Experiments on seven memory frameworks show that current systems remain inadequate at actively utilizing memory for parameter grounding, highlighting the need for more effective approaches to evaluate and improve memory application in task execution.

PaperID: 6056, https://arxiv.org/pdf/2601.19921.pdf

Authors: Xiaochen Zhu, Caiqi Zhang, Yizhou Chi, Tom Stafford, Nigel Collier, Andreas Vlachos

Title: Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

Abstract:
Multi‑agent debate (MAD) is widely used to improve large language model (LLM) performance through test‑time scaling, yet recent work shows that vanilla MAD often underperforms simple majority vote despite higher computational cost. Studies show that, under homogeneous agents and uniform belief updates, debate preserves expected correctness and therefore cannot reliably improve outcomes. Drawing on findings from human deliberation and collective decision‑making, we identify two key mechanisms missing from vanilla MAD: (i) diversity of initial viewpoints and (ii) explicit, calibrated confidence communication. We propose two lightweight interventions. First, a diversity‑aware initialisation that selects a more diverse pool of candidate answers, increasing the likelihood that a correct hypothesis is present at the start of debate. Second, a confidence‑modulated debate protocol in which agents express calibrated confidence and condition their updates on others' confidence. We show theoretically that diversity‑aware initialisation improves the prior probability of MAD success without changing the underlying update dynamics, while confidence‑modulated updates enable debate to systematically drift to the correct hypothesis. Empirically, across six reasoning‑oriented QA benchmarks, our methods consistently outperform vanilla MAD and majority vote. Our results connect human deliberation with LLM‑based debate and demonstrate that simple, principled modifications can substantially enhance debate effectiveness.

PaperID: 6057, https://arxiv.org/pdf/2601.19917.pdf

Authors: Haoyu Zheng, Yun Zhu, Yuqian Yuan, Bo Yuan, Wenqiao Zhang, Siliang Tang, Jun Xiao

Title: PILOT: Planning via Internalized Latent Optimization Trajectories for Large Language Models

Abstract:
Strategic planning is critical for multi‑step reasoning, yet compact Large Language Models (LLMs) often lack the capacity to formulate global strategies, leading to error propagation in long‑horizon tasks. Our analysis reveals that LLMs possess latent reasoning capabilities that can be unlocked when conditioned on explicit plans from a teacher model; however, runtime reliance on external guidance is often impractical due to latency and availability constraints. To bridge this gap, we propose PILOT (Planning via Internalized Latent Optimization Trajectories), a non‑invasive framework designed to internalize the strategic oversight of large models into intrinsic Latent Guidance. Instead of altering backbone weights, PILOT employs a lightweight Hyper‑Network to synthesize a query‑conditioned Latent Guidance vector. This vector acts as an internal steering mechanism, guiding the model's representations toward optimal reasoning paths. Extensive experiments on mathematical and coding benchmarks demonstrate that PILOT effectively stabilizes reasoning trajectories, consistently outperforming strong baselines (e.g., +8.9% on MATH500) with negligible inference latency.

PaperID: 6058, https://arxiv.org/pdf/2601.19793.pdf

Authors: Shanyv Liu, Xuyang Yuan, Tao Chen, Zijun Zhan, Zhu Han, Danyang Zheng, Weishan Zhang, Shaohua Cao

Title: CASTER: Breaking the Cost-Performance Barrier in Multi-Agent Orchestration via Context-Aware Strategy for Task Efficient Routing

Abstract:
Graph‑based Multi‑Agent Systems (MAS) enable complex cyclic workflows but suffer from inefficient static model allocation, where deploying strong models uniformly wastes computation on trivial sub‑tasks. We propose CASTER (Context‑Aware Strategy for Task Efficient Routing), a lightweight router for dynamic model selection in graph‑based MAS. CASTER employs a Dual‑Signal Router that combines semantic embeddings with structural meta‑features to estimate task difficulty. During training, the router self‑optimizes through a Cold Start to Iterative Evolution paradigm, learning from its own routing failures via on‑policy negative feedback. Experiments using LLM‑as‑a‑Judge evaluation across Software Engineering, Data Analysis, Scientific Discovery, and Cybersecurity demonstrate that CASTER reduces inference cost by up to 72.4% compared to strong‑model baselines while matching their success rates, and consistently outperforms both heuristic routing and FrugalGPT across all domains.

PaperID: 6059, https://arxiv.org/pdf/2601.19778.pdf

Authors: Ahmad Farooq, Kamran Iqbal

Title: Reimagining Peer Review Process Through Multi-Agent Mechanism Design

Abstract:
The software engineering research community faces a systemic crisis: peer review is failing under growing submissions, misaligned incentives, and reviewer fatigue. Community surveys reveal that researchers perceive the process as "broken." This position paper argues that these dysfunctions are mechanism design failures amenable to computational solutions. We propose modeling the research community as a stochastic multi‑agent system and applying multi‑agent reinforcement learning to design incentive‑compatible protocols. We outline three interventions: a credit‑based submission economy, MARL‑optimized reviewer assignment, and hybrid verification of review consistency. We present threat models, equity considerations, and phased pilot metrics. This vision charts a research agenda toward sustainable peer review.

PaperID: 6060, https://arxiv.org/pdf/2601.19747.pdf

Authors: Jiale Liu, Taiyu Zhou, Tianqi Jiang

Title: Veri-Sure: A Contract-Aware Multi-Agent Framework with Temporal Tracing and Formal Verification for Correct RTL Code Generation

Abstract:
In the rapidly evolving field of Electronic Design Automation (EDA), the deployment of Large Language Models (LLMs) for Register‑Transfer Level (RTL) design has emerged as a promising direction. However, silicon‑grade correctness remains bottlenecked by: (i) limited test coverage and reliability of simulation‑centric evaluation, (ii) regressions and repair hallucinations introduced by iterative debugging, and (iii) semantic drift as intent is reinterpreted across agent handoffs. In this work, we propose Veri‑Sure, a multi‑agent framework that establishes a design contract to align agents' intent and uses a patching mechanism guided by static dependency slicing to perform precise, localized repairs. By integrating a multi‑branch verification pipeline that combines trace‑driven temporal analysis with formal verification consisting of assertion‑based checking and boolean equivalence proofs, Veri‑Sure enables functional correctness beyond pure simulations. We also introduce VerilogEval‑v2‑EXT, extending the original benchmark with 53 more industrial‑grade design tasks and stratified difficulty levels, and show that Veri‑Sure achieves state‑of‑the‑art verified‑correct RTL code generation performance, surpassing standalone LLMs and prior agentic systems.

PaperID: 6061, https://arxiv.org/pdf/2601.19583.pdf

Authors: Débora Souza, Patrícia Machado

Title: Toward Architecture-Aware Evaluation Metrics for LLM Agents

Abstract:
LLM‑based agents are becoming central to software engineering tasks, yet evaluating them remains fragmented and largely model‑centric. Existing studies overlook how architectural components, such as planners, memory, and tool routers, shape agent behavior, limiting diagnostic power. We propose a lightweight, architecture‑informed approach that links agent components to their observable behaviors and to the metrics capable of evaluating them. Our method clarifies what to measure and why, and we illustrate its application through real world agents, enabling more targeted, transparent, and actionable evaluation of LLM‑based agents.

PaperID: 6062, https://arxiv.org/pdf/2601.19578.pdf

Authors: Yuxuan Cai, Xinyi Lai, Peng Yuan, Weiting Liu, Huajian Li, Mingda Li, Xinghua Wang, Shengxie Zheng, Yanchao Hao, Yuyang Yin, Zheng Wei

Title: Yunque DeepResearch Technical Report

Abstract:
Deep research has emerged as a transformative capability for autonomous agents, empowering Large Language Models to navigate complex, open‑ended tasks. However, realizing its full potential is hindered by critical limitations, including escalating contextual noise in long‑horizon tasks, fragility leading to cascading errors, and a lack of modular extensibility. To address these challenges, we introduce Yunque DeepResearch, a hierarchical, modular, and robust framework. The architecture is characterized by three key components: (1) a centralized Multi‑Agent Orchestration System that routes subtasks to an Atomic Capability Pool of tools and specialized sub‑agents; (2) a Dynamic Context Management mechanism that structures completed sub‑goals into semantic summaries to mitigate information overload; and (3) a proactive Supervisor Module that ensures resilience through active anomaly detection and context pruning. Yunque DeepResearch achieves state‑of‑the‑art performance across a range of agentic deep research benchmarks, including GAIA, BrowseComp, BrowseComp‑ZH, and Humanity's Last Exam. We open‑source the framework, reproducible implementations, and application cases to empower the community.

PaperID: 6063, https://arxiv.org/pdf/2601.19507.pdf

Authors: Xiangyang Zhu, Yuan Tian, Zicheng Zhang, Qi Jia, Chunyi Li, Renrui Zhang, Heng Li, Zongrui Wang, Wei Sun

Title: Automated Safety Benchmarking: A Multi-agent Pipeline for LVLMs

Abstract:
Large vision‑language models (LVLMs) exhibit remarkable capabilities in cross‑modal tasks but face significant safety challenges, which undermine their reliability in real‑world applications. Efforts have been made to build LVLM safety evaluation benchmarks to uncover their vulnerability. However, existing benchmarks are hindered by their labor‑intensive construction process, static complexity, and limited discriminative power. Thus, they may fail to keep pace with rapidly evolving models and emerging risks. To address these limitations, we propose VLSafetyBencher, the first automated system for LVLM safety benchmarking. VLSafetyBencher introduces four collaborative agents: Data Preprocessing, Generation, Augmentation, and Selection agents to construct and select high‑quality samples. Experiments validates that VLSafetyBencher can construct high‑quality safety benchmarks within one week at a minimal cost. The generated benchmark effectively distinguish safety, with a safety rate disparity of 70% between the most and least safe models.

PaperID: 6064, https://arxiv.org/pdf/2601.19388.pdf

Authors: Yimin Tang, Sven Koenig, Erdem Bıyık

Title: Judgelight: Trajectory-Level Post-Optimization for Multi-Agent Path Finding via Closed-Subwalk Collapsing

Abstract:
Multi‑Agent Path Finding (MAPF) is an NP‑hard problem with applications in warehouse automation and multi‑robot coordination. Learning‑based MAPF solvers offer fast and scalable planning but often produce feasible trajectories that contain unnecessary or oscillatory movements. We propose Judgelight, a post‑optimization layer that improves trajectory quality after a MAPF solver generates a feasible schedule. Judgelight collapses closed subwalks in agents' trajectories to remove redundant movements while preserving all feasibility constraints. We formalize this process as MAPF‑Collapse, prove that it is NP‑hard, and present an exact optimization approach by formulating it as integer linear programming (ILP) problem. Experimental results show Judgelight consistently reduces solution cost by around 20%, particularly for learning‑based solvers, producing trajectories that are better suited for real‑world deployment.

PaperID: 6065, https://arxiv.org/pdf/2601.19311.pdf

Authors: Anh Khoa Ngo Ho, Martin Chauvin, Simon Gosset, Philippe Cordier, Boris Gamazaychikov

Title: Balancing Sustainability And Performance: The Role Of Small-Scale LLMs In Agentic Artificial Intelligence Systems

Abstract:
As large language models become integral to agentic artificial intelligence systems, their energy demands during inference may pose significant sustainability challenges. This study investigates whether deploying smaller‑scale language models can reduce energy consumption without compromising responsiveness and output quality in a multi‑agent, real‑world environments. We conduct a comparative analysis across language models of varying scales to quantify trade‑offs between efficiency and performance. Results show that smaller open‑weights models can lower energy usage while preserving task quality. Building on these findings, we propose practical guidelines for sustainable artificial intelligence design, including optimal batch size configuration and computation resource allocation. These insights offer actionable strategies for developing scalable, environmentally responsible artificial intelligence systems.

PaperID: 6066, https://arxiv.org/pdf/2601.19290.pdf

Authors: Yimeng Wang, Jiaxing Zhao, Hongbin Xie, Hexing Ma, Yuzhen Lei, Shuangxue Liu, Xuan Song, Zichen Zhang, Haoran Zhang

Title: MetaGen: Self-Evolving Roles and Topologies for Multi-Agent LLM Reasoning

Abstract:
Large language models are increasingly deployed as multi‑agent systems, where specialized roles communicate and collaborate through structured interactions to solve complex tasks that often exceed the capacity of a single agent. However, most existing systems still rely on a fixed role library and an execution‑frozen interaction topology, a rigid design choice that frequently leads to task mismatch, prevents timely adaptation when new evidence emerges during reasoning, and further inflates inference cost. We introduce MetaGen, a training‑free framework that adapts both the role space and the collaboration topology at inference time, without updating base model weights. MetaGen generates and rewrites query‑conditioned role specifications to maintain a controllable dynamic role pool, then instantiates a constrained execution graph around a minimal backbone. During execution, it iteratively updates role prompts and adjusts structural decisions using lightweight feedback signals. Experiments on code generation and multi‑step reasoning benchmarks show that MetaGen improves the accuracy and cost tradeoff over strong multi‑agent baselines.

PaperID: 6067, https://arxiv.org/pdf/2601.19279.pdf

Authors: Yidong Zhou, Lingyi Kong, Yifeng Peng, Zhiding Liang

Title: Reinforcement Learning for Enhanced Advanced QEC Architecture Decoding

Abstract:
The advent of promising quantum error correction (QEC) codes with efficient resource utilization and high‑performance fault‑tolerant quantum memories signifies a critical step towards realizing practical quantum computation. While surface codes have been a dominant approach, their limitations have spurred the development of more advanced QEC architectures. These advanced codes often present increased complexity, demanding innovative decoding methodologies. This work investigates the application of reinforcement learning (RL) techniques, including hybrid and multi‑agent approaches, to enhance the decoding of various advanced QEC architectures. By leveraging the ability of RL to learn optimal strategies from noisy syndrome measurements, we explore the potential for achieving improved logical error rates and scalability compared to traditional decoding methods. Our approach examines the adaptation of reinforcement learning to exploit the structural properties of these modern QEC models. We also explore the benefits of combining different RL algorithms to address the multifaceted nature of the decoding problem, considering factors such as code degeneracy and real‑world noise characteristics. With our proposed method, we are able to demonstrate that an autonomously trained agent can derive decoding schemes for the complex decoding requirement of advanced QEC architectures.

PaperID: 6068, https://arxiv.org/pdf/2601.19199.pdf

Authors: Libo Sun, Jiwen Zhang, Siyuan Wang, Zhongyu Wei

Title: MAGNET: Towards Adaptive GUI Agents with Memory-Driven Knowledge Evolution

Abstract:
Mobile GUI agents powered by large foundation models enable autonomous task execution, but frequent updates altering UI appearance and reorganizing workflows cause agents trained on historical data to fail. Despite surface changes, functional semantics and task intents remain fundamentally stable. Building on this insight, we introduce MAGNET, a memory‑driven adaptive agent framework with dual‑level memory: stationary memory linking diverse visual features to stable functional semantics for robust action grounding and procedural memory capturing stable task intents across varying workflows. We propose a dynamic memory evolution mechanism that continuously refines both memories by prioritizing frequently accessed knowledge. Online benchmark AndroidWorld evaluations show substantial improvements over baselines, while offline benchmarks confirm consistent gains under distribution shifts. These results validate that leveraging stable structures across interface changes improves agent performance and generalization in evolving software environments.

PaperID: 6069, https://arxiv.org/pdf/2601.19193.pdf

Authors: Van-Quang Nguyen, Takayuki Okatani

Title: CoReTab: Improving Multimodal Table Understanding with Code-driven Reasoning

Abstract:
Existing datasets for multimodal table understanding, such as MMTab, primarily provide short factual answers without explicit multi‑step reasoning supervision. Models trained on these datasets often generate brief responses that offers insufficient accuracy and limited interpretability into how these models arrive at the final answer. We introduce CoReTab, a code‑driven reasoning framework that produces scalable, interpretable, and automatically verifiable annotations by coupling multi‑step reasoning with executable Python code. Using the CoReTab framework, we curate a dataset of 115K verified samples averaging 529 tokens per response and fine‑tune open‑source MLLMs through a three‑stage pipeline. We evaluate the resulting model trained on CoReTab across 17 MMTab benchmarks spanning table question answering, fact verification, and table structure understanding. Our model achieves significant gains of +6.2%, +5.7%, and +25.6%, respectively, over MMTab‑trained baselines, while producing transparent and verifiable reasoning traces. These results establish CoReTab as a robust and generalizable supervision framework for improving multi‑step reasoning in multimodal table understanding.

PaperID: 6070, https://arxiv.org/pdf/2601.19174.pdf

Authors: Nirhoshan Sivaroopan, Kanchana Thilakarathna, Albert Zomaya, Manu, Yi Guo, Jo Plested, Tim Lynar, Jack Yang, Wangli Yang

Title: SHIELD: An Auto-Healing Agentic Defense Framework for LLM Resource Exhaustion Attacks

Abstract:
Sponge attacks increasingly threaten LLM systems by inducing excessive computation and DoS. Existing defenses either rely on statistical filters that fail on semantically meaningful attacks or use static LLM‑based detectors that struggle to adapt as attack strategies evolve. We introduce SHIELD, a multi‑agent, auto‑healing defense framework centered on a three‑stage Defense Agent that integrates semantic similarity retrieval, pattern matching, and LLM‑based reasoning. Two auxiliary agents, a Knowledge Updating Agent and a Prompt Optimization Agent, form a closed self‑healing loop, when an attack bypasses detection, the system updates an evolving knowledgebase, and refines defense instructions. Extensive experiments show that SHIELD consistently outperforms perplexity‑based and standalone LLM defenses, achieving high F1 scores across both non‑semantic and semantic sponge attacks, demonstrating the effectiveness of agentic self‑healing against evolving resource‑exhaustion threats.

PaperID: 6071, https://arxiv.org/pdf/2601.19170.pdf

Authors: Wangyang Ying, Yanchi Liu, Xujiang Zhao, Wei Cheng, Zhengzhang Chen, Wenchao Yu, Yanjie Fu, Haifeng Chen

Title: Multi-Agent Procedural Graph Extraction with Structural and Logical Refinement

Abstract:
Automatically extracting workflows as procedural graphs from natural language is promising yet underexplored, demanding both structural validity and logical alignment. While recent large language models (LLMs) show potential for procedural graph extraction, they often produce ill‑formed structures or misinterpret logical flows. We present \model, a multi‑agent framework that formulates procedural graph extraction as a multi‑round reasoning process with dedicated structural and logical refinement. The framework iterates through three stages: (1) a graph extraction phase with the graph builder agent, (2) a structural feedback phase in which a simulation agent diagnoses and explains structural defects, and (3) a logical feedback phase in which a semantic agent aligns semantics between flow logic and linguistic cues in the source text. Important feedback is prioritized and expressed in natural language, which is injected into subsequent prompts, enabling interpretable and controllable refinement. This modular design allows agents to target distinct error types without supervision or parameter updates. Experiments demonstrate that \model achieves substantial improvements in both structural correctness and logical consistency over strong baselines.

PaperID: 6072, https://arxiv.org/pdf/2601.19155.pdf

Authors: Qiujun Li, Zijin Xiao, Xulin Wang, Zhidan Ma, Cheng Yang, Haifeng Li

Title: LocationAgent: A Hierarchical Agent for Image Geolocation via Decoupling Strategy and Evidence from Parametric Knowledge

Abstract:
Image geolocation aims to infer capture locations based on visual content. Fundamentally, this constitutes a reasoning process composed of hypothesis‑verification cycles, requiring models to possess both geospatial reasoning capabilities and the ability to verify evidence against geographic facts. Existing methods typically internalize location knowledge and reasoning patterns into static memory via supervised training or trajectory‑based reinforcement fine‑tuning. Consequently, these methods are prone to factual hallucinations and generalization bottlenecks in open‑world settings or scenarios requiring dynamic knowledge. To address these challenges, we propose a Hierarchical Localization Agent, called LocationAgent. Our core philosophy is to retain hierarchical reasoning logic within the model while offloading the verification of geographic evidence to external tools. To implement hierarchical reasoning, we design the RER architecture (Reasoner‑Executor‑Recorder), which employs role separation and context compression to prevent the drifting problem in multi‑step reasoning. For evidence verification, we construct a suite of clue exploration tools that provide diverse evidence to support location reasoning. Furthermore, to address data leakage and the scarcity of Chinese data in existing datasets, we introduce CCL‑Bench (China City Location Bench), an image geolocation benchmark encompassing various scene granularities and difficulty levels. Extensive experiments demonstrate that LocationAgent significantly outperforms existing methods by at least 30% in zero‑shot settings.

PaperID: 6073, https://arxiv.org/pdf/2601.19121.pdf

Authors: Guilin Zhang, Kai Zhao, Jeffrey Friedman, Xu Chu

Title: LLMs as Orchestrators: Constraint-Compliant Multi-Agent Optimization for Recommendation Systems

Abstract:
Recommendation systems must optimize multiple objectives while satisfying hard business constraints such as fairness and coverage. For example, an e‑commerce platform may require every recommendation list to include items from multiple sellers and at least one newly listed product; violating such constraints‑‑even once‑‑is unacceptable in production. Prior work on multi‑objective recommendation and recent LLM‑based recommender agents largely treat constraints as soft penalties or focus on item scoring and interaction, leading to frequent violations in real‑world deployments. How to leverage LLMs for coordinating constrained optimization in recommendation systems remains underexplored. We propose DualAgent‑Rec, an LLM‑coordinated dual‑agent framework for constrained multi‑objective e‑commerce recommendation. The framework separates optimization into an Exploitation Agent that prioritizes accuracy under hard constraints and an Exploration Agent that promotes diversity through unconstrained Pareto search. An LLM‑based coordinator adaptively allocates resources between agents based on optimization progress and constraint satisfaction, while an adaptive epsilon‑relaxation mechanism guarantees feasibility of final solutions. Experiments on the Amazon Reviews 2023 dataset demonstrate that DualAgent‑Rec achieves 100% constraint satisfaction and improves Pareto hypervolume by 4‑6% over strong baselines, while maintaining competitive accuracy‑diversity trade‑offs. These results indicate that LLMs can act as effective orchestration agents for deployable and constraint‑compliant recommendation systems.

PaperID: 6074, https://arxiv.org/pdf/2601.19119.pdf

Authors: Peter Travis Jardine, Sidney Givigi

Title: Agree to Disagree: Consensus-Free Flocking under Constraints

Abstract:
Robots sometimes have to work together with a mixture of partially‑aligned or conflicting goals. Flocking ‑ coordinated motion through cohesion, alignment, and separation ‑ traditionally assumes uniform desired inter‑agent distances. Many practical applications demand greater flexibility, as the diversity of types and configurations grows with the popularity of multi‑agent systems in society. Moreover, agents often operate without guarantees of trust or secure communication. Motivated by these challenges we update well‑established frameworks by relaxing this assumption of shared inter‑agent distances and constraints. Through a new form of constrained collective potential function, we introduce a solution that permits negotiation of these parameters. In the spirit of the traditional flocking control canon, this negotiation is achieved purely through local observations and does not require any global information or inter‑agent communication. The approach is robust to semi‑trust scenarios, where neighbouring agents pursue conflicting goals. We validate the effectiveness of the approach through a series of simulations.

PaperID: 6075, https://arxiv.org/pdf/2601.19094.pdf

Authors: Jingcheng Yu, Mingliang Zeng, Qiwei Ye

Title: FloydNet: A Learning Paradigm for Global Relational Reasoning

Abstract:
Developing models capable of complex, multi‑step reasoning is a central goal in artificial intelligence. While representing problems as graphs is a powerful approach, Graph Neural Networks (GNNs) are fundamentally constrained by their message‑passing mechanism, which imposes a local bottleneck that limits global, holistic reasoning. We argue that dynamic programming (DP), which solves problems by iteratively refining a global state, offers a more powerful and suitable learning paradigm. We introduce FloydNet, a new architecture that embodies this principle. In contrast to local message passing, FloydNet maintains a global, all‑pairs relationship tensor and learns a generalized DP operator to progressively refine it. This enables the model to develop a task‑specific relational calculus, providing a principled framework for capturing long‑range dependencies. Theoretically, we prove that FloydNet achieves 3‑WL (2‑FWL) expressive power, and its generalized form aligns with the k‑FWL hierarchy. FloydNet demonstrates state‑of‑the‑art performance across challenging domains: it achieves near‑perfect scores (often >99%) on the CLRS‑30 algorithmic benchmark, finds exact optimal solutions for the general Traveling Salesman Problem (TSP) at rates significantly exceeding strong heuristics, and empirically matches the 3‑WL test on the BREC benchmark. Our results establish this learned, DP‑style refinement as a powerful and practical alternative to message passing for high‑level graph reasoning.

PaperID: 6076, https://arxiv.org/pdf/2601.19082.pdf

Authors: Trung-Kiet Huynh, Dao-Sy Duy-Minh, Thanh-Bang Cao, Phong-Hao Le, Hong-Dan Nguyen, Nguyen Lam Phu Quy, Minh-Luan Nguyen-Vo, Hong-Phat Pham, Pham Phu Hoa, Thien-Kim Than, Chi-Nguyen Tran, Huy Tran, Gia-Thoai Tran-Le, Alessio Buscemi, Le Hong Trang, The Anh Han

Title: More at Stake: How Payoff and Language Shape LLM Agent Strategies in Cooperation Dilemmas

Abstract:
As LLMs increasingly act as autonomous agents in interactive and multi‑agent settings, understanding their strategic behavior is critical for safety, coordination, and AI‑driven social and economic systems. We investigate how payoff magnitude and linguistic context shape LLM strategies in repeated social dilemmas, using a payoff‑scaled Prisoner's Dilemma to isolate sensitivity to incentive strength. Across models and languages, we observe consistent behavioral patterns, including incentive‑sensitive conditional strategies and cross‑linguistic divergence. To interpret these dynamics, we train supervised classifiers on canonical repeated‑game strategies and apply them to LLM decisions, revealing systematic, model‑ and language‑dependent behavioral intentions, with linguistic framing sometimes matching or exceeding architectural effects. Our results provide a unified framework for auditing LLMs as strategic agents and highlight cooperation biases with direct implications for AI governance and multi‑agent system design.

PaperID: 6077, https://arxiv.org/pdf/2601.18983.pdf

Authors: Dimitrios Tomaras, Vana Kalogeraki, Dimitrios Gunopulos

Title: Trustworthy Scheduling for Big Data Applications

Abstract:
Recent advances in modern containerized execution environments have resulted in substantial benefits in terms of elasticity and more efficient utilization of computing resources. Although existing schedulers strive to optimize performance metrics like task execution times and resource utilization, they provide limited transparency into their decision‑making processes or the specific actions developers must take to meet Service Level Objectives (SLOs). In this work, we propose X‑Sched, a middleware that uses explainability techniques to generate actionable guidance on resource configurations that makes task execution in containerized environments feasible, under resource and time constraints. X‑Sched addresses this gap by integrating counterfactual explanations with advanced machine learning models, such as Random Forests, to efficiently identify optimal configurations. This approach not only ensures that tasks are executed in line with performance goals but also gives users clear, actionable insights into the rationale behind scheduling decisions. Our experimental results validated with data from real‑world execution environments, illustrate the efficiency, benefits and practicality of our approach.

PaperID: 6078, https://arxiv.org/pdf/2601.18934.pdf

Authors: Ruipeng Wang, Tawab Safi, Yunge Wen, Christina Cunningham, Hoi Ling Tang, Behnaz Farahi

Title: Whispering Water: Materializing Human-AI Dialogue as Interactive Ripples

Abstract:
Across cultures, water has served as a recipient of human confession, a yielding medium that receives vulnerability where rigid surfaces cannot. We present Whispering Water, an interactive installation that materializes human‑AI dialogue through cymatic patterns on water. Participants confess secrets to a water surface, triggering a four‑phase ritual: confession, contemplation, response, and release. The user's speech sentiment is directly transmitted into the water to prime its state, while semantic content enters a multi‑agent system, initiating ripples of conversation where agent identities are situated through discourse and voice profiles are chosen based on what they say. We propose a novel algorithm that decomposes speech into component waves and reconstructs them in water, establishing a translation between speech and the physics of material form. By rendering machine reasoning as emergent physical phenomena, the installation explores possibilities for emotional self‑exploration through ambiguous, sensory‑rich interfaces.

PaperID: 6079, https://arxiv.org/pdf/2601.18924.pdf

Authors: Andrew Jaffe, Noah Reicin, Jinho D. Choi

Title: RIFT: Reordered Instruction Following Testbed To Evaluate Instruction Following in Singular Multistep Prompt Structures

Abstract:
Large Language Models (LLMs) are increasingly relied upon for complex workflows, yet their ability to maintain flow of instructions remains underexplored. Existing benchmarks conflate task complexity with structural ordering, making it difficult to isolate the impact of prompt topology on performance. We introduce RIFT, Reordered Instruction Following Testbed, to assess instruction following by disentangling structure from content. Using rephrased Jeopardy! question‑answer pairs, we test LLMs across two prompt structures: linear prompts, which progress sequentially, and jumping prompts, which preserve identical content but require non‑sequential traversal. Across 10,000 evaluations spanning six state‑of‑the‑art open‑source LLMs, accuracy dropped by up to 72% under jumping conditions (compared to baseline), revealing a strong dependence on positional continuity. Error analysis shows that approximately 50% of failures stem from instruction‑order violations and semantic drift, indicating that current architectures internalize instruction following as a sequential pattern rather than a reasoning skill. These results reveal structural sensitivity as a fundamental limitation in current architectures, with direct implications for applications requiring non‑sequential control flow such as workflow automation and multi‑agent systems.

PaperID: 6080, https://arxiv.org/pdf/2601.18847.pdf

Authors: Zihan Wu, Jie Xu, Yun Peng, Chun Yong Chong, Xiaohua Jia

Title: MulVul: Retrieval-augmented Multi-Agent Code Vulnerability Detection via Cross-Model Prompt Evolution

Abstract:
Large Language Models (LLMs) struggle to automate real‑world vulnerability detection due to two key limitations: the heterogeneity of vulnerability patterns undermines the effectiveness of a single unified model, and manual prompt engineering for massive weakness categories is unscalable. To address these challenges, we propose MulVul, a retrieval‑augmented multi‑agent framework designed for precise and broad‑coverage vulnerability detection. MulVul adopts a coarse‑to‑fine strategy: a \emphRouter agent first predicts the top‑k coarse categories and then forwards the input to specialized \emphDetector agents, which identify the exact vulnerability types. Both agents are equipped with retrieval tools to actively source evidence from vulnerability knowledge bases to mitigate hallucinations. Crucially, to automate the generation of specialized prompts, we design \emphCross‑Model Prompt Evolution, a prompt optimization mechanism where a generator LLM iteratively refines candidate prompts while a distinct executor LLM validates their effectiveness. This decoupling mitigates the self‑correction bias inherent in single‑model optimization. Evaluated on 130 CWE types, MulVul achieves 34.79% Macro‑F1, outperforming the best baseline by 41.5%. Ablation studies validate cross‑model prompt evolution, which boosts performance by 51.6% over manual prompts by effectively handling diverse vulnerability patterns.

PaperID: 6081, https://arxiv.org/pdf/2601.18771.pdf

Authors: Yanming Liu, Xinyue Peng, Zixuan Yan, Yanxin Shen, Wenjie Xu, Yuefeng Huang, Xinyi Wang, Jiannan Cao, Jianwei Yin, Xuhong Zhang

Title: Dep-Search: Learning Dependency-Aware Reasoning Traces with Persistent Memory

Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, particularly when augmented with search mechanisms that enable systematic exploration of external knowledge bases. The field has evolved from traditional retrieval‑augmented generation (RAG) frameworks to more sophisticated search‑based frameworks that orchestrate multi‑step reasoning through explicit search strategies. However, existing search frameworks still rely heavily on implicit natural language reasoning to determine search strategies and how to leverage retrieved information across reasoning steps. This reliance on implicit reasoning creates fundamental challenges for managing dependencies between sub‑questions, efficiently reusing previously retrieved knowledge, and learning optimal search strategies through reinforcement learning. To address these limitations, we propose Dep‑Search, a dependency‑aware search framework that advances beyond existing search frameworks by integrating structured reasoning, retrieval, and persistent memory through GRPO. Dep‑Search introduces explicit control mechanisms that enable the model to decompose questions with dependency relationships, retrieve information when needed, access previously stored knowledge from memory, and summarize long reasoning contexts into reusable memory entries. Through extensive experiments on seven diverse question answering datasets, we demonstrate that Dep‑Search significantly enhances LLMs' ability to tackle complex multi‑hop reasoning tasks, achieving substantial improvements over strong baselines across different model scales.

PaperID: 6082, https://arxiv.org/pdf/2601.18735.pdf

Authors: Jusheng Zhang, Yijia Fan, Kaitong Cai, Jing Yang, Jiawei Yao, Jian Wang, Guanlong Qu, Ziliang Chen, Keze Wang

Title: Why Keep Your Doubts to Yourself? Trading Visual Uncertainties in Multi-Agent Bandit Systems

Abstract:
Vision‑Language Models (VLMs) enable powerful multi‑agent systems, but scaling them is economically unsustainable: coordinating heterogeneous agents under information asymmetry often spirals costs. Existing paradigms, such as Mixture‑of‑Agents and knowledge‑based routers, rely on heuristic proxies that ignore costs and collapse uncertainty structure, leading to provably suboptimal coordination. We introduce Agora, a framework that reframes coordination as a decentralized market for uncertainty. Agora formalizes epistemic uncertainty into a structured, tradable asset (perceptual, semantic, inferential), and enforces profitability‑driven trading among agents based on rational economic rules. A market‑aware broker, extending Thompson Sampling, initiates collaboration and guides the system toward cost‑efficient equilibria. Experiments on five multimodal benchmarks (MMMU, MMBench, MathVision, InfoVQA, CC‑OCR) show that Agora outperforms strong VLMs and heuristic multi‑agent strategies, e.g., achieving +8.5% accuracy over the best baseline on MMMU while reducing cost by over 3x. These results establish market‑based coordination as a principled and scalable paradigm for building economically viable multi‑agent visual intelligence systems.

PaperID: 6083, https://arxiv.org/pdf/2601.18573.pdf

Authors: Frederik Glitzner, David Manlove

Title: Stable Matching with Deviators and Conformists

Abstract:
In the fundamental Stable Marriage and Stable Roommates problems, there are inherent trade‑offs between the size and stability of solutions. While in the former problem, a stable matching always exists and can be found efficiently using the celebrated Gale‑Shapley algorithm, the existence of a stable matching is not guaranteed in the latter problem, but can be determined efficiently using Irving's algorithm. However, the computation of matchings that minimise the instability, either due to the presence of additional constraints on the size of the matching or due to restrictive preference cycles, gives rise to a collection of infamously intractable almost‑stable matching problems. In practice, however, not every agent is able or likely to initiate deviations caused by blocking pairs. Suppose we knew, for example, due to a set of requirements or estimates based on historical data, which agents are likely to initiate deviations ‑ the deviators ‑ and which are likely to comply with whatever matching they are presented with ‑ the conformists. Can we decide efficiently whether a matching exists in which no deviator is blocking, i.e., in which no deviator has an incentive to initiate a deviation? Furthermore, can we find matchings in which only a few deviators are blocking? We characterise the computational complexity of this question in bipartite and non‑bipartite preference settings. Surprisingly, these problems prove computationally intractable in strong ways: for example, unlike in the classical setting, where every agent is considered a deviator, in this extension, we prove that it is NP‑complete to decide whether a matching exists where no deviator is blocking. On the positive side, we identify polynomial‑time and fixed‑parameter tractable cases, providing novel algorithmics for multi‑agent systems where stability cannot be fully guaranteed.

PaperID: 6084, https://arxiv.org/pdf/2601.18563.pdf

Authors: Fang Liu, Erchao Zhu, Jiedan Tan, Jingwen Tong, Taotao Wang, Shengli Zhang

Title: An LLM-Agent-Based Framework for Age of Information Optimization in Heterogeneous Random Access Networks

Abstract:
With the rapid expansion of the Internet of Things (IoT) and heterogeneous wireless networks, the Age of Information (AoI) has emerged as a critical metric for evaluating the performance of real‑time and personalized systems. While AoI‑based random access is essential for next‑generation applications such as the low‑altitude economy and indoor service robots, existing strategies, ranging from rule‑based protocols to learning‑based methods, face critical challenges, including idealized model assumptions, slow convergence, and poor generalization. In this article, we propose Reflex‑Core, a novel Large Language Model (LLM) agent‑based framework for AoI‑driven random access in heterogeneous networks. By devising an "Observe‑Reflect‑Decide‑Execute" closed‑loop mechanism, this framework integrates Supervised Fine‑Tuning (SFT) and Proximal Policy Optimization (PPO) to enable optimal, autonomous access control. Based on the Reflex‑Core framework, we develop a Reflexive Multiple Access (RMA) protocol and a priority‑based RMA variant for intelligent access control under different heterogeneous network settings. Experimental results demonstrate that in the investigated scenarios, the RMA protocol achieves up to a 14.9% reduction in average AoI compared with existing baselines, while the priority‑based version improves the convergence rate by approximately 20%.

PaperID: 6085, https://arxiv.org/pdf/2601.18491.pdf

Authors: Dongrui Liu, Qihan Ren, Chen Qian, Shuai Shao, Yuejin Xie, Yu Li, Zhonghao Yang, Haoyu Luo, Peng Wang, Qingyu Liu, Binxin Hu, Ling Tang, Jilin Mei, Dadi Guo, Leitao Yuan, Junyao Yang, Guanxu Chen, Qihao Lin, Yi Yu, Bo Zhang, Jiaxuan Guo, Jie Zhang, Wenqi Shao, Huiqi Deng, Zhiheng Xi, Wenjie Wang, Wenxuan Wang, Wen Shen, Zhikai Chen, Haoyu Xie, Jialing Tao, Juntao Dai, Jiaming Ji, Zhongjie Ba, Linfeng Zhang, Yong Liu, Quanshi Zhang, Lei Zhu, Zhihua Wei, Hui Xue, Chaochao Lu, Jing Shao, Xia Hu

Title: AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Abstract:
The rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transparency in risk diagnosis. To introduce an agentic guardrail that covers complex and numerous risky behaviors, we first propose a unified three‑dimensional taxonomy that orthogonally categorizes agentic risks by their source (where), failure mode (how), and consequence (what). Guided by this structured and hierarchical taxonomy, we introduce a new fine‑grained agentic safety benchmark (ATBench) and a Diagnostic Guardrail framework for agent safety and security (AgentDoG). AgentDoG provides fine‑grained and contextual monitoring across agent trajectories. More Crucially, AgentDoG can diagnose the root causes of unsafe actions and seemingly safe but unreasonable actions, offering provenance and transparency beyond binary labels to facilitate effective agent alignment. AgentDoG variants are available in three sizes (4B, 7B, and 8B parameters) across Qwen and Llama model families. Extensive experimental results demonstrate that AgentDoG achieves state‑of‑the‑art performance in agentic safety moderation in diverse and complex interactive scenarios. All models and datasets are openly released.

PaperID: 6086, https://arxiv.org/pdf/2601.18419.pdf

Authors: Michael Kölle, Christian Reff, Leo Sünkel, Julian Hager, Gerhard Stenzel, Claudia Linnhoff-Popien

Title: Emergent Cooperation in Quantum Multi-Agent Reinforcement Learning Using Communication

Abstract:
Emergent cooperation in classical Multi‑Agent Reinforcement Learning has gained significant attention, particularly in the context of Sequential Social Dilemmas (SSDs). While classical reinforcement learning approaches have demonstrated capability for emergent cooperation, research on extending these methods to Quantum Multi‑Agent Reinforcement Learning remains limited, particularly through communication. In this paper, we apply communication approaches to quantum Q‑Learning agents: the Mutual Acknowledgment Token Exchange (MATE) protocol, its extension Mutually Endorsed Distributed Incentive Acknowledgment Token Exchange (MEDIATE), the peer rewarding mechanism Gifting, and Reinforced Inter‑Agent Learning (RIAL). We evaluate these approaches in three SSDs: the Iterated Prisoner's Dilemma, Iterated Stag Hunt, and Iterated Game of Chicken. Our experimental results show that approaches using MATE with temporal‑difference measure (MATE\textsubscriptTD), AutoMATE, MEDIATE‑I, and MEDIATE‑S achieved high cooperation levels across all dilemmas, demonstrating that communication is a viable mechanism for fostering emergent cooperation in Quantum Multi‑Agent Reinforcement Learning.

PaperID: 6087, https://arxiv.org/pdf/2601.18381.pdf

Authors: Yinghan Hou, Zongyou Yang

Title: AI Agent for Reverse-Engineering Legacy Finite-Difference Code and Translating to Devito

Abstract:
To facilitate the transformation of legacy finite difference implementations into the Devito environment, this study develops an integrated AI agent framework. Retrieval‑Augmented Generation (RAG) and open‑source Large Language Models are combined through multi‑stage iterative workflows in the system's hybrid LangGraph architecture. The agent constructs an extensive Devito knowledge graph through document parsing, structure‑aware segmentation, extraction of entity relationships, and Leiden‑based community detection. GraphRAG optimisation enhances query performance across semantic communities that include seismic wave simulation, computational fluid dynamics, and performance tuning libraries. A reverse engineering component derives three‑level query strategies for RAG retrieval through static analysis of Fortran source code. To deliver precise contextual information for language model guidance, the multi‑stage retrieval pipeline performs parallel searching, concept expansion, community‑scale retrieval, and semantic similarity analysis. Code synthesis is governed by Pydantic‑based constraints to guarantee structured outputs and reliability. A comprehensive validation framework integrates conventional static analysis with the G‑Eval approach, covering execution correctness, structural soundness, mathematical consistency, and API compliance. The overall agent workflow is implemented on the LangGraph framework and adopts concurrent processing to support quality‑based iterative refinement and state‑aware dynamic routing. The principal contribution lies in the incorporation of feedback mechanisms motivated by reinforcement learning, enabling a transition from static code translation toward dynamic and adaptive analytical behavior.

PaperID: 6088, https://arxiv.org/pdf/2601.18320.pdf

Authors: Jinwei Lu, Yuanfeng Song, Chen Zhang, Raymond Chi-Wing Wong

Title: MultiVis-Agent: A Multi-Agent Framework with Logic Rules for Reliable and Comprehensive Cross-Modal Data Visualization

Abstract:
Real‑world visualization tasks involve complex, multi‑modal requirements that extend beyond simple text‑to‑chart generation, requiring reference images, code examples, and iterative refinement. Current systems exhibit fundamental limitations: single‑modality input, one‑shot generation, and rigid workflows. While LLM‑based approaches show potential for these complex requirements, they introduce reliability challenges including catastrophic failures and infinite loop susceptibility. To address this gap, we propose MultiVis‑Agent, a logic rule‑enhanced multi‑agent framework for reliable multi‑modal and multi‑scenario visualization generation. Our approach introduces a four‑layer logic rule framework that provides mathematical guarantees for system reliability while maintaining flexibility. Unlike traditional rule‑based systems, our logic rules are mathematical constraints that guide LLM reasoning rather than replacing it. We formalize the MultiVis task spanning four scenarios from basic generation to iterative refinement, and develop MultiVis‑Bench, a benchmark with over 1,000 cases for multi‑modal visualization evaluation. Extensive experiments demonstrate that our approach achieves 75.63% visualization score on challenging tasks, significantly outperforming baselines (57.54‑62.79%), with task completion rates of 99.58% and code execution success rates of 94.56% (vs. 74.48% and 65.10% without logic rules), successfully addressing both complexity and reliability challenges in automated visualization generation.

PaperID: 6089, https://arxiv.org/pdf/2601.18294.pdf

Authors: Samuel Mallick, Gianpietro Battocletti, Dimitris Boskos, Azita Dabiri, Bart De Schutter

Title: Reinforcement Learning with Distributed MPC for Fuel-Efficient Platoon Control with Discrete Gear Transitions

Abstract:
Cooperative control of groups of autonomous vehicles (AVs), i.e., platoons, is a promising direction to improving the efficiency of autonomous transportation systems. In this context, distributed co‑optimization of both vehicle speed and gear position can offer benefits for fuel‑efficient driving. To this end, model predictive control (MPC) is a popular approach, optimizing the speed and gear‑shift schedule while explicitly considering the vehicles' dynamics over a prediction window. However, optimization over both the vehicles' continuous dynamics and discrete gear positions is computationally intensive, and may require overly long sample times or high‑end hardware for real‑time implementation. This work proposes a reinforcement learning (RL)‑based distributed MPC approach to address this issue. For each vehicle in the platoon, a policy is trained to select and fix the gear positions across the prediction window of a local MPC controller, leaving a significantly simpler continuous optimization problem to be solved as part of a distributed MPC scheme. In order to reduce the computational cost of training and facilitate the scalability of the proposed approach to large platoons, the policies are parameterized such that the emergent multi‑agent RL problem can be decoupled into single‑agent learning tasks. In addition, a recurrent neural‑network (RNN) architecture is proposed for the gear selection policy, such that the learning is scalable even as the number of possible gear‑shift schedules grows exponentially with the MPC prediction horizon. In highway‑driving simulations, the proposed approach is shown to have a significantly lower computation burden and a comparable performance in terms of fuel‑efficient platoon control, with respect to pure MPC‑based co‑optimization.

PaperID: 6090, https://arxiv.org/pdf/2601.18284.pdf

Authors: Hsiao-Chuan Chang, Sheng-You Huang, Yen-Chi Chen, I-Chen Wu

Title: VissimRL: A Multi-Agent Reinforcement Learning Framework for Traffic Signal Control Based on Vissim

Abstract:
Traffic congestion remains a major challenge for urban transportation, leading to significant economic and environmental impacts. Traffic Signal Control (TSC) is one of the key measures to mitigate congestion, and recent studies have increasingly applied Reinforcement Learning (RL) for its adaptive capabilities. With respect to SUMO and CityFlow, the simulator Vissim offers high‑fidelity driver behavior modeling and wide industrial adoption but remains underutilized in RL research due to its complex interface and lack of standardized frameworks. To address this gap, this paper proposes VissimRL, a modular RL framework for TSC that encapsulates Vissim's COM interface through a high‑level Python API, offering standardized environments for both single‑ and multi‑agent training. Experiments show that VissimRL significantly reduces development effort while maintaining runtime efficiency, and supports consistent improvements in traffic performance during training, as well as emergent coordination in multi‑agent control. Overall, VissimRL demonstrates the feasibility of applying RL in high‑fidelity simulations and serves as a bridge between academic research and practical applications in intelligent traffic signal control.

PaperID: 6091, https://arxiv.org/pdf/2601.18282.pdf

Authors: Lei Wei, Xiao Peng, Jinpeng Ou, Bin Wang

Title: Think-Augmented Function Calling: Improving LLM Parameter Accuracy Through Embedded Reasoning

Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in function calling for autonomous agents, yet current mechanisms lack explicit reasoning transparency during parameter generation, particularly for complex functions with interdependent parameters. While existing approaches like chain‑of‑thought prompting operate at the agent level, they fail to provide fine‑grained reasoning guidance for individual function parameters. To address these limitations, we propose Think‑Augmented Function Calling (TAFC), a novel framework that enhances function calling accuracy through explicit reasoning at both function and parameter levels. Our method introduces a universal "think" parameter augmentation that enables models to articulate their decision‑making process, with dynamic optimization for parameter descriptions to improve reasoning quality. For complex parameters, TAFC automatically triggers granular reasoning based on complexity scoring, ensuring appropriate justification for critical decisions. Additionally, we propose reasoning‑guided optimization to align generated reasoning with human expectations. TAFC requires no architectural modifications to existing LLMs while maintaining full API compatibility. Evaluation on ToolBench across proprietary and open‑source models demonstrates significant improvements in parameter generation accuracy and reasoning coherence for multi‑parameter functions, while providing enhanced interpretability for debugging AI agent behaviors.

PaperID: 6092, https://arxiv.org/pdf/2601.18275.pdf

Authors: Qiufang Yu, Mengmeng Wu, Xingyu Lan

Title: When Nobody Around Is Real: Exploring Public Opinions and User Experiences On the Multi-Agent AI Social Platform

Abstract:
Powered by large language models, a new genre of multi‑agent social platforms has emerged. Apps such as Social.AI deploy numerous AI agents that emulate human behavior, creating unprecedented bot‑centric social networks. Yet, existing research has predominantly focused on one‑on‑one chatbots, leaving multi‑agent AI platforms underexplored. To bridge this gap, we took Social.AI as a case study and performed a two‑stage investigation: (i) content analysis of 883 user comments; (ii) a 7‑day diary study with 20 participants to document their firsthand platform experiences. While public discourse expressed greater skepticism, the diary study found that users did project a range of social expectations onto the AI agents. While some user expectations were met, the AI‑dominant social environment introduces distinct problems, such as attention overload and homogenized interaction. These tensions signal a future where AI functions not merely as a tool or an anthropomorphized actor, but as the dominant medium of sociality itself‑a paradigm shift that foregrounds new forms of architected social life.

PaperID: 6093, https://arxiv.org/pdf/2601.18239.pdf

Authors: Sizhe Cheng, Feng Liang, Yuhan Wen, Xipei Yu, Yong Wang

Title: Probing the Future of Meta-Analysis: Eliciting Design Principles via an Agentic Research IDE

Abstract:
Meta‑analyses and systematic reviews demand rigorous abductive reasoning to build, test, and refine hypotheses across vast, heterogeneous literature. While NLP advancements have automated parts of this pipeline, existing tools often detach researchers from the cognitive loop or function merely as retrieval engines, leading to loss of intellectual ownership and frequent context switching. We present Research IDE, a prototype reimagining authoring environments through the "Research as Code" metaphor. Research IDE embeds a multi‑agent backend into the writing flow, enabling in‑situ verification via "hypothesis breakpoints." A one‑week field deployment with 8 domain experts, followed by a reflective workshop, as a Research through Design (RtD) probe, reveals that users strongly preferred this verification workflow, actively leveraged prior knowledge for confirmation, and reported that breakpoints sparked insights. Drawing from participant feedback and suggestions, we derive design implications for future AI‑assisted research tools that fully preserve researcher autonomy and intellectual ownership while harnessing computational scale.

PaperID: 6094, https://arxiv.org/pdf/2601.18217.pdf

Authors: Zhihan Liu, Lin Guan, Yixin Nie, Kai Zhang, Zhuoqun Hao, Lin Chen, Asli Celikyilmaz, Zhaoran Wang, Na Zhang

Title: Paying Less Generalization Tax: A Cross-Domain Generalization Study of RL Training for LLM Agents

Abstract:
Generalist LLM agents are often post‑trained on a narrow set of environments but deployed across far broader, unseen domains. In this work, we investigate the challenge of agentic post‑training when the eventual test domains are unknown. Specifically, we analyze which properties of reinforcement learning (RL) environments and modeling choices have the greatest influence on out‑of‑domain performance. First, we identify two environment axes that strongly correlate with cross‑domain generalization: (i) state information richness, i.e., the amount of information for the agent to process from the state, and (ii) planning complexity, estimated via goal reachability and trajectory length under a base policy. Notably, domain realism and text‑level similarity are not the primary factors; for instance, the simple grid‑world domain Sokoban leads to even stronger generalization in SciWorld than the more realistic ALFWorld. Motivated by these findings, we further show that increasing state information richness alone can already effectively improve cross‑domain robustness. We propose a randomization technique, which is low‑overhead and broadly applicable: add small amounts of distractive goal‑irrelevant features to the state to make it richer without altering the task. Beyond environment‑side properties, we also examine several modeling choices: (a) SFT warmup or mid‑training helps prevent catastrophic forgetting during RL but undermines generalization to domains that are not included in the mid‑training datamix; and (b) turning on step‑by‑step thinking during RL, while not always improving in‑domain performance, plays a crucial role in preserving generalization.

PaperID: 6095, https://arxiv.org/pdf/2601.18137.pdf

Authors: Yinger Zhang, Shutong Jiang, Renhao Li, Jianhong Tu, Yang Su, Lianghao Deng, Xudong Guo, Chenxu Lv, Junyang Lin

Title: DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints

Abstract:
While agent evaluation has shifted toward long‑horizon tasks, most benchmarks still emphasize local, step‑level reasoning rather than the global constrained optimization (e.g., time and financial budgets) that demands genuine planning ability. Meanwhile, existing LLM planning benchmarks underrepresent the active information gathering and fine‑grained local constraints typical of real‑world settings. To address this, we introduce DeepPlanning, a challenging benchmark for practical long‑horizon agent planning. It features multi‑day travel planning and multi‑product shopping tasks that require proactive information acquisition, local constrained reasoning, and global constrained optimization. Evaluations on DeepPlanning show that even frontier agentic LLMs struggle with these problems, highlighting the importance of reliable explicit reasoning patterns and parallel tool use for achieving better effectiveness‑efficiency trade‑offs. Error analysis further points to promising directions for improving agentic LLMs over long planning horizons. We open‑source the code and data to support future research.

PaperID: 6096, https://arxiv.org/pdf/2601.18077.pdf

Authors: Mahesh Ramesh, Kaousheik Jayakumar, Aswinkumar Ramkumar, Pavan Thodima, Aniket Rege, Emmanouil-Vasileios Vlatakis-Gkaragkounis

Title: Sparks of Cooperative Reasoning: LLMs as Strategic Hanabi Agents

Abstract:
Cooperative reasoning under incomplete information remains challenging for both humans and multi‑agent systems. The card game Hanabi embodies this challenge, requiring theory‑of‑mind reasoning and strategic communication. We benchmark 17 state‑of‑the‑art LLM agents in 2‑5 player games and study the impact of context engineering across model scales (4B to 600B+) to understand persistent coordination failures and robustness to scaffolding: from a minimal prompt with only explicit card details (Watson setting), to scaffolding with programmatic, Bayesian‑motivated deductions (Sherlock setting), to multi‑turn state tracking via working memory (Mycroft setting). We show that (1) agents can maintain an internal working memory for state tracking and (2) cross‑play performance between different LLMs smoothly interpolates with model strength. In the Sherlock setting, the strongest reasoning models exceed 15 points on average across player counts, yet still trail experienced humans and specialist Hanabi agents, both consistently scoring above 20. We release the first public Hanabi datasets with annotated trajectories and move utilities: (1) HanabiLogs, containing 1,520 full game logs for instruction tuning, and (2) HanabiRewards, containing 560 games with dense move‑level value annotations for all candidate moves. Supervised and RL finetuning of a 4B open‑weight model (Qwen3‑Instruct) on our datasets improves cooperative Hanabi play by 21% and 156% respectively, bringing performance to within ~3 points of a strong proprietary reasoning model (o4‑mini) and surpassing the best non‑reasoning model (GPT‑4.1) by 52%. The HanabiRewards RL‑finetuned model further generalizes beyond Hanabi, improving performance on a cooperative group‑guessing benchmark by 11%, temporal reasoning on EventQA by 6.4%, instruction‑following on IFBench‑800K by 1.7 Pass@10, and matching AIME 2025 mathematical reasoning Pass@10.

PaperID: 6097, https://arxiv.org/pdf/2601.18027.pdf

Authors: Chiyuan Fu, Lyuhao Chen, Yunze Xiao, Weihao Xuan, Carlos Busso, Mona Diab

Title: Sentipolis: Emotion-Aware Agents for Social Simulations

Abstract:
LLM agents are increasingly used for social simulation, yet emotion is often treated as a transient cue, causing emotional amnesia and weak long‑horizon continuity. We present Sentipolis, a framework for emotionally stateful agents that integrates continuous Pleasure‑Arousal‑Dominance (PAD) representation, dual‑speed emotion dynamics, and emotion‑‑memory coupling. Across thousands of interactions over multiple base models and evaluators, Sentipolis improves emotionally grounded behavior, boosting communication, and emotional continuity. Gains are model‑dependent: believability increases for higher‑capacity models but can drop for smaller ones, and emotion‑awareness can mildly reduce adherence to social norms, reflecting a human‑like tension between emotion‑driven behavior and rule compliance in social simulation. Network‑level diagnostics show reciprocal, moderately clustered, and temporally stable relationship structures, supporting the study of cumulative social dynamics such as alliance formation and gradual relationship change.

PaperID: 6098, https://arxiv.org/pdf/2601.18018.pdf

Authors: Boran A. Kilic, Ozgur B. Akan

Title: Neural-Inspired Multi-Agent Molecular Communication Networks for Collective Intelligence

Abstract:
Molecular Communication (MC) is a pivotal enabler for the Internet of Bio‑Nano Things (IoBNT). However, current research often relies on super‑capable individual agents with complex transceiver architectures that defy the energy and processing constraints of realistic nanomachines. This paper proposes a paradigm shift towards collective intelligence, inspired by the cortical networks of the biological brain. We introduce a decentralized network architecture where simple nanomachines interact via a diffusive medium using a threshold‑based firing mechanism modeled by Greenberg‑Hastings (GH) cellular automata. We derive fixed‑point equations for steady‑state populations via mean‑field analysis and validate them against stochastic simulations. We demonstrate that the network undergoes a second‑order phase transition at a specific activation threshold. Crucially, we show that both pairwise and collective mutual information peak exactly at this critical transition point, confirming that the system maximizes information propagation and processing capacity at the edge of chaos.

PaperID: 6099, https://arxiv.org/pdf/2601.17933.pdf

Authors: Laurent Caraffa

Title: Dissipative Learning: A Framework for Viable Adaptive Systems

Abstract:
We propose a perspective in which learning is an intrinsically dissipative process. Forgetting and regularization are not heuristic add‑ons but structural requirements for adaptive systems. Drawing on information theory, thermodynamics, and information geometry, we introduce the BEDS (Bayesian Emergent Dissipative Structures) framework, modeling learning as the evolution of compressed belief states under dissipation constraints. A central contribution is the Conditional Optimality Theorem, showing that Fisher‑Rao regularization measuring change via information divergence rather than Euclidean distance is the unique thermodynamically optimal regularization strategy, achieving minimal dissipation. Euclidean regularization is shown to be structurally suboptimal. The framework unifies existing methods (Ridge, SIGReg, EMA, SAC) as special cases of a single governing equation. Within this view, overfitting corresponds to over‑crystallization, while catastrophic forgetting reflects insufficient dissipation control. The framework distinguishes BEDS‑crystallizable problems, where beliefs converge to stable equilibria, from BEDS‑maintainable problems, which require continual adaptation. It extends naturally to continual and multi‑agent systems, where viability, stability under adaptation and finite resources replaces asymptotic optimality as the primary criterion. Overall, this work reframes learning as maintaining viable belief states under dissipation constraints, providing a principled lens on forgetting, regularization, and stability.

PaperID: 6100, https://arxiv.org/pdf/2601.17915.pdf

Authors: Saurabh Jha, Rohan Arora, Bhavya, Noah Zheutlin, Paulina Toro Isaza, Laura Shwartz, Yu Deng, Daby Sow, Ruchi Mahindru, Ruchir Puri

Title: Think Locally, Explain Globally: Graph-Guided LLM Investigations via Local Reasoning and Belief Propagation

Abstract:
LLM agents excel when environments are mostly static and the needed information fits in a model's context window, but they often fail in open‑ended investigations where explanations must be constructed by iteratively mining evidence from massive, heterogeneous operational data. These investigations exhibit hidden dependency structure: entities interact, signals co‑vary, and the importance of a fact may only become clear after other evidence is discovered. Because the context window is bounded, agents must summarize intermediate findings before their significance is known, increasing the risk of discarding key evidence. ReAct‑style agents are especially brittle in this regime. Their retrieve‑summarize‑reason loop makes conclusions sensitive to exploration order and introduces run‑to‑run non‑determinism, producing a reliability gap where Pass‑at‑k may be high but Majority‑at‑k remains low. Simply sampling more rollouts or generating longer reasoning traces does not reliably stabilize results, since hypotheses cannot be autonomously checked as new evidence arrives and there is no explicit mechanism for belief bookkeeping and revision. In addition, ReAct entangles semantic reasoning with controller duties such as tool orchestration and state tracking, so execution errors and plan drift degrade reasoning while consuming scarce context. We address these issues by formulating investigation as abductive reasoning over a dependency graph and proposing EoG (Explanations over Graphs), a disaggregated framework in which an LLM performs bounded local evidence mining and labeling (cause vs symptom) while a deterministic controller manages traversal, state, and belief propagation to compute a minimal explanatory frontier. On a representative ITBench diagnostics task, EoG improves both accuracy and run‑to‑run consistency over ReAct baselines, including a 7x average gain in Majority‑at‑k entity F1.

PaperID: 6101, https://arxiv.org/pdf/2601.17842.pdf

Authors: Lanqing Du, Yunong Li, YuJie Long, Shihong Chen

Title: EFT-CoT: A Multi-Agent Chain-of-Thought Framework for Emotion-Focused Therapy

Abstract:
The use of large language models (LLMs) for Mental Health Question Answering (MHQA) offers a promising way to alleviate shortages in mental health resources. However, prior work has mainly relied on Cognitive Behavioral Therapy (CBT) and predominantly follows a top‑down strategy centered on rational cognitive restructuring, providing limited support for embodied experience and primary emotion processing. To address this gap, we propose EFT‑CoT, a multi‑agent chain‑of‑thought framework grounded in Emotion‑Focused Therapy (EFT). EFT‑CoT operationalizes intervention as a three‑stage workflow: Embodied Perception, Cognitive Exploration, and Narrative Intervention. The framework employs eight specialized agents to model key processes including somatic awareness mapping, adaptive evaluation, core belief extraction, and narrative restructuring. Based on this framework, we construct EFT‑Instruct, a high‑quality instruction‑tuning dataset built from process‑level augmentation of about 67,000 real help‑seeking texts, and further fine‑tune a dedicated model, EFT‑LLM. Experiments show that EFT‑LLM consistently outperforms strong baselines and human responses in empathic depth and structural professionalism. Ablation studies further verify the contribution of key mechanisms, while white‑box auditing demonstrates the consistency and traceability of critical intermediate states. Overall, this work provides a reproducible framework‑data‑model pipeline for embedding EFT mechanisms into LLM‑based mental health support.

PaperID: 6102, https://arxiv.org/pdf/2601.17817.pdf

Authors: Hongjuan Li, Hui Kang, Jiahui Li, Geng Sun, Ruichen Zhang, Jiacheng Wang, Dusit Niyato, Wei Ni, Abbas Jamalipour

Title: Multi-Agent Collaborative Intrusion Detection for Low-Altitude Economy IoT: An LLM-Enhanced Agentic AI Framework

Abstract:
The rapid expansion of low‑altitude economy Internet of Things (LAE‑IoT) networks has created unprecedented security challenges due to dynamic three‑dimensional mobility patterns, distributed autonomous operations, and severe resource constraints. Traditional intrusion detection systems designed for static ground‑based networks prove inadequate for tackling the unique characteristics of aerial IoT environments, including frequent topology changes, real‑time detection requirements, and energy limitations. In this article, we analyze the intrusion detection requirements for LAE‑IoT networks, complemented by a comprehensive review of evaluation metrics that cover detection effectiveness, response time, and resource consumption. Then, we investigate transformative potential of agentic artificial intelligence (AI) paradigms and introduce a large language model (LLM)‑enabled agentic AI framework for enhancing intrusion detection in LAE‑IoT networks. This leads to our proposal of a novel multi‑agent collaborative intrusion detection framework that leverages specialized LLM‑enhanced agents for intelligent data processing and adaptive classification. Through experimental validation, our framework demonstrates superior performance of over 90% classification accuracy across multiple benchmark datasets. These results highlight the transformative potential of combining agentic AI principles with LLMs for next‑generation LAE‑IoT security systems.

PaperID: 6103, https://arxiv.org/pdf/2601.17762.pdf

Authors: Zelong Zheng, Jiayuan Zhou, Xing Hu, Yi Gao, Shengyi Pan

Title: Multi-Agent End-to-End Vulnerability Management for Mitigating Recurring Vulnerabilities

Abstract:
Software vulnerability management has become increasingly critical as modern systems scale in size and complexity. However, existing automated approaches remain insufficient. Traditional static analysis methods struggle to precisely capture contextual dependencies, especially when vulnerabilities span multiple functions or modules. Large language models (LLMs) often lack the ability to retrieve and exploit sufficient contextual information, resulting in incomplete reasoning and unreliable outcomes. Meanwhile, recurring vulnerabilities emerge repeatedly due to code reuse and shared logic, making historical vulnerability knowledge an indispensable foundation for effective vulnerability detection and repair. Nevertheless, prior approaches such as clone‑based detection and patch porting, have not fully leveraged this knowledge. To address these challenges, we present MAVM, a multi‑agent framework for end‑to‑end recurring vulnerability management. MAVM integrates five components, including a vulnerability knowledge base, detection, confirmation, repair, and validation, into a unified multi‑agent pipeline. We construct a knowledge base from publicly disclosed vulnerabilities, thereby addressing the underuse of historical knowledge in prior work and mitigating the lack of domain‑specific expertise in LLMs. Furthermore, we design context‑retrieval tools that allow agents to extract and reason over repository‑level information, overcoming the contextual limitations of previous methods. Based on agents, MAVM effectively simulates real‑world security workflows. To evaluate the performance of MAVM, we construct a dataset containing 78 real‑world patch‑porting cases (covering 114 function‑level migrations). On this dataset, MAVM successfully detects and repairs 51 real vulnerabilities, outperforming baselines by 31.9%‑45.2% in repair accuracy, which demonstrates its effectiveness.

PaperID: 6104, https://arxiv.org/pdf/2601.17755.pdf

Authors: Jinyoung Park, Sanghyeok Lee, Omar Zia Khan, Hyunwoo J. Kim, Joo-Kyung Kim

Title: HyperGraphPro: Progress-Aware Reinforcement Learning for Structure-Guided Hypergraph RAG

Abstract:
Graph Retrieval‑Augmented Generation (GraphRAG) has emerged as a promising paradigm that organizes external knowledge into structured graphs of entities and relations, enabling large language models (LLMs) to perform complex reasoning beyond text‑chunk retrieval. Recent advances have integrated reinforcement learning (RL) into agentic GraphRAG approaches, enabling iterative interactions with knowledge graphs during training. However, existing RL‑based methods suffer from two key limitations: (1) they primarily depend on semantic similarity for retrieval, often overlooking the underlying graph topology, and (2) they rely on sparse, outcome‑level rewards that fail to capture the quality of intermediate retrieval steps and their dependencies. To address these limitations, we propose HyperGraphPro, a progress‑aware agentic framework for graph‑based retrieval and multi‑step reasoning. HyperGraphPro introduces a structure‑aware hypergraph retrieval mechanism that jointly considers semantic relevance and graph connectivity, promoting coherent traversal along multi‑hop reasoning paths. Furthermore, we design a progress‑based stepwise policy optimization that provides dense learning signals by modulating advantages according to intermediate reasoning progress within a graph, rather than relying solely on final outcomes. Experiments on multi‑hop question answering benchmarks demonstrate that HyperGraphPro consistently improves reasoning accuracy and generation quality over existing GraphRAG methods.

PaperID: 6105, https://arxiv.org/pdf/2601.17736.pdf

Authors: Can Liu, Jaeuk Lee, Tianhe Chen, Zhibang Jiang, Xiaolin Wen, Yong Wang

Title: From Static to Interactive: Authoring Interactive Visualizations via Natural Language

Abstract:
Interactivity is crucial for effective data visualizations. However, it is often challenging to implement interactions for existing static visualizations, since the underlying code and data for existing static visualizations are often not available, and it also takes significant time and effort to enable interactions for them even if the original code and data are available. To fill this gap, we propose Athanor, a novel approach to transform existing static visualizations into interactive ones using multimodal large language models (MLLMs) and natural language instructions. Our approach introduces three key innovations: (1) an action‑modification interaction design space that maps visualization interactions into user actions and corresponding adjustments, (2) a multi‑agent requirement analyzer that translates natural language instructions into an actionable operational space, and (3) a visualization abstraction transformer that converts static visualizations into flexible and interactive representations regardless of their underlying implementation. Athanor allows users to effortlessly author interactions through natural language instructions, eliminating the need for programming. We conducted two case studies and in‑depth interviews with target users to evaluate our approach. The results demonstrate the effectiveness and usability of our approach in allowing users to conveniently enable flexible interactions for static visualizations.

PaperID: 6106, https://arxiv.org/pdf/2601.17716.pdf

Authors: Daniel M. Pedrozo, Telma W. de L. Soares, Bryan L. M. de Oliveira

Title: Do Reasoning Models Ask Better Questions? A Formal Information-Theoretic Analysis on Multi-Turn LLM Games

Abstract:
Large Language Models (LLMs) excel at many tasks but still struggle with a critical ability for LLM‑based agents: asking good questions for resolving ambiguity in user requests. While prior work has explored information‑seeking behavior through word games, existing benchmarks lack comprehensive evaluation frameworks that provide both final and intermediate signals based on Information Gain (IG). Moreover, they rarely provide systematic comparisons between models that use chain‑of‑thought reasoning and those that do not. We propose a multi‑turn dialogue framework that quantitatively measures how effectively LLMs gather information through yes/no questions in a hierarchical knowledge graph environment. Our framework employs a triad of interacting LLM agents that ask questions, answer them, and update the hypothesis space. We adopt IG as the main metric, grounded in Shannon entropy, to assess query effectiveness at each turn and cumulatively. We instantiate our framework in a geographical Guess My City game setting organized in a five‑level taxonomy and evaluate multiple LLM variants under fully and partially observable conditions, with and without Chain‑of‑Thought reasoning. Our experiments demonstrate that, among the evaluated models, the ones with explicit reasoning capabilities achieve higher IG per turn and reach solutions in fewer steps, particularly in partially observable settings. Analysis of reasoning traces reveals that smaller models compensate for limited capacity through more aggressive exploration of candidate questions, while larger models exhibit higher assertiveness in selecting optimal queries, generating candidates with greater potential IG.

PaperID: 6107, https://arxiv.org/pdf/2601.17692.pdf

Authors: Yunhan Li, Mingjie Xie, Gaoli Kang, Zihan Gong, Gengshen Wu, Min Yang

Title: LegalMALR:Multi-Agent Query Understanding and LLM-Based Reranking for Chinese Statute Retrieval

Abstract:
Statute retrieval is essential for legal assistance and judicial decision support, yet real‑world legal queries are often implicit, multi‑issue, and expressed in colloquial or underspecified forms. These characteristics make it difficult for conventional retrieval‑augmented generation pipelines to recover the statutory elements required for accurate retrieval. Dense retrievers focus primarily on the literal surface form of the query, whereas lightweight rerankers lack the legal‑reasoning capacity needed to assess statutory applicability. We present LegalMALR, a retrieval framework that integrates a Multi‑Agent Query Understanding System (MAS) with a zero‑shot large‑language‑model‑based reranking module (LLM Reranker). MAS generates diverse, legally grounded reformulations and conducts iterative dense retrieval to broaden candidate coverage. To stabilise the stochastic behaviour of LLM‑generated rewrites, we optimise a unified MAS policy using Generalized Reinforcement Policy Optimization(GRPO). The accumulated candidate set is subsequently evaluated by the LLM Reranker, which performs natural‑language legal reasoning to produce the final ranking. We further construct CSAID, a dataset of 118 difficult Chinese legal queries annotated with multiple statutory labels, and evaluate LegalMALR on both CSAID and the public STARD benchmark. Experiments show that LegalMALR substantially outperforms strong Retrieval‑augmented generation(RAG) baselines in both in‑distribution and out‑of‑distribution settings, demonstrating the effectiveness of combining multi‑perspective query interpretation, reinforcement‑based policy optimisation, and large‑model reranking for statute retrieval.

PaperID: 6108, https://arxiv.org/pdf/2601.17678.pdf

Authors: Zhiyu An, Wan Du

Title: DIML: Differentiable Inverse Mechanism Learning from Behaviors of Multi-Agent Learning Trajectories

Abstract:
We study inverse mechanism learning: recovering an unknown incentive‑generating mechanism from observed strategic interaction traces of self‑interested learning agents. Unlike inverse game theory and multi‑agent inverse reinforcement learning, which typically infer utility/reward parameters inside a structured mechanism, our target includes unstructured mechanism ‑‑ a (possibly neural) mapping from joint actions to per‑agent payoffs. Unlike differentiable mechanism design, which optimizes mechanisms forward, we infer mechanisms from behavior in an observational setting. We propose DIML, a likelihood‑based framework that differentiates through a model of multi‑agent learning dynamics and uses the candidate mechanism to generate counterfactual payoffs needed to predict observed actions. We establish identifiability of payoff differences under a conditional logit response model and prove statistical consistency of maximum likelihood estimation under standard regularity conditions. We evaluate DIML with simulated interactions of learning agents across unstructured neural mechanisms, congestion tolling, public goods subsidies, and large‑scale anonymous games. DIML reliably recovers identifiable incentive differences and supports counterfactual prediction, where its performance rivals tabular enumeration oracle in small environments and its convergence scales to large, hundred‑participant environments. Code to reproduce our experiments is open‑sourced.

PaperID: 6109, https://arxiv.org/pdf/2601.17593.pdf

Authors: Tianjun Zhong, Linyang He, Nima Mesgarani

Title: From Chains to DAGs: Probing the Graph Structure of Reasoning in LLMs

Abstract:
Recent progress in large language models has renewed interest in how multi‑step reasoning is represented internally. While prior work often treats reasoning as a linear chain, many reasoning problems are more naturally modeled as directed acyclic graphs (DAGs), where intermediate conclusions branch, merge, and are reused. Whether such graph structure is reflected in model internals remains unclear. We introduce Reasoning DAG Probing, a framework for testing whether LLM hidden states linearly encode properties of an underlying reasoning DAG and where this structure emerges across layers. We associate each reasoning node with a textual realization and train lightweight probes to predict node depth, pairwise distance, and adjacency from hidden states. Using these probes, we analyze the emergence of DAG structure across layers, reconstruct approximate reasoning graphs, and evaluate controls that disrupt reasoning‑relevant structure while preserving surface text. Across reasoning benchmarks, we find that DAG structure is meaningfully encoded in LLM representations, with recoverability peaking in intermediate layers, varying systematically by node depth, edge span, and model scale, and enabling nontrivial recovery of dependency graphs. These findings suggest that LLM reasoning is not purely sequential, but exhibits measurable internal graph structure.

PaperID: 6110, https://arxiv.org/pdf/2601.17588.pdf

Authors: Marcus Ma, Shrikanth Narayanan

Title: Intelligence Requires Grounding But Not Embodiment

Abstract:
Recent advances in LLMs have reignited scientific debate over whether embodiment is necessary for intelligence. We present the argument that intelligence requires grounding, a phenomenon entailed by embodiment, but not embodiment itself. We define intelligence as the possession of four properties ‑‑ motivation, predictive ability, understanding of causality, and learning from experience ‑‑ and argue that each can be achieved by a non‑embodied, grounded agent. We use this to conclude that grounding, not embodiment, is necessary for intelligence. We then present a thought experiment of an intelligent LLM agent in a digital environment and address potential counterarguments.

PaperID: 6111, https://arxiv.org/pdf/2601.17577.pdf

Authors: Emilio Barkett

Title: Status Hierarchies in Language Models

Abstract:
From school playgrounds to corporate boardrooms, status hierarchies ‑‑ rank orderings based on respect and perceived competence ‑‑ are universal features of human social organization. Language models trained on human‑generated text inevitably encounter these hierarchical patterns embedded in language, raising the question of whether they might reproduce such dynamics in multi‑agent settings. This thesis investigates when and how language models form status hierarchies by adapting Berger et al.'s (1972) expectation states framework. I create multi‑agent scenarios where separate language model instances complete sentiment classification tasks, are introduced with varying status characteristics (e.g., credentials, expertise), then have opportunities to revise their initial judgments after observing their partner's responses. The dependent variable is deference, the rate at which models shift their ratings toward their partner's position based on status cues rather than task information. Results show that language models form significant status hierarchies when capability is equal (35 percentage point asymmetry, p < .001), but capability differences dominate status cues, with the most striking effect being that high‑status assignments reduce higher‑capability models' deference rather than increasing lower‑capability models' deference. The implications for AI safety are significant: status‑seeking behavior could introduce deceptive strategies, amplify discriminatory biases, and scale across distributed deployments far faster than human hierarchies form organically. This work identifies emergent social behaviors in AI systems and highlights a previously underexplored dimension of the alignment challenge.

PaperID: 6112, https://arxiv.org/pdf/2601.17566.pdf

Authors: Qi Li, Xinchao Wang

Title: Sponge Tool Attack: Stealthy Denial-of-Efficiency against Tool-Augmented Agentic Reasoning

Abstract:
Enabling large language models (LLMs) to solve complex reasoning tasks is a key step toward artificial general intelligence. Recent work augments LLMs with external tools to enable agentic reasoning, achieving high utility and efficiency in a plug‑and‑play manner. However, the inherent vulnerabilities of such methods to malicious manipulation of the tool‑calling process remain largely unexplored. In this work, we identify a tool‑specific attack surface and propose Sponge Tool Attack (STA), which disrupts agentic reasoning solely by rewriting the input prompt under a strict query‑only access assumption. Without any modification on the underlying model or the external tools, STA converts originally concise and efficient reasoning trajectories into unnecessarily verbose and convoluted ones before arriving at the final answer. This results in substantial computational overhead while remaining stealthy by preserving the original task semantics and user intent. To achieve this, we design STA as an iterative, multi‑agent collaborative framework with explicit rewritten policy control, and generates benign‑looking prompt rewrites from the original one with high semantic fidelity. Extensive experiments across 6 models (including both open‑source models and closed‑source APIs), 12 tools, 4 agentic frameworks, and 13 datasets spanning 5 domains validate the effectiveness of STA.

PaperID: 6113, https://arxiv.org/pdf/2601.17563.pdf

Authors: Nathan Gavenski, Matteo Leonetti, Odinaldo Rodrigues

Title: Towards Generalisable Imitation Learning Through Conditioned Transition Estimation and Online Behaviour Alignment

Abstract:
State‑of‑the‑art imitation learning from observation methods (ILfO) have recently made significant progress, but they still have some limitations: they need action‑based supervised optimisation, assume that states have a single optimal action, and tend to apply teacher actions without full consideration of the actual environment state. While the truth may be out there in observed trajectories, existing methods struggle to extract it without supervision. In this work, we propose Unsupervised Imitation Learning from Observation (UfO) that addresses all of these limitations. UfO learns a policy through a two‑stage process, in which the agent first obtains an approximation of the teacher's true actions in the observed state transitions, and then refines the learned policy further by adjusting agent trajectories to closely align them with the teacher's. Experiments we conducted in five widely used environments show that UfO not only outperforms the teacher and all other ILfO methods but also displays the smallest standard deviation. This reduction in standard deviation indicates better generalisation in unseen scenarios.

PaperID: 6114, https://arxiv.org/pdf/2601.17549.pdf

Authors: Narek Maloyan, Dmitry Namiot

Title: Breaking the Protocol: Security Analysis of the Model Context Protocol Specification and Prompt Injection Vulnerabilities in Tool-Integrated LLM Agents

Abstract:
The Model Context Protocol (MCP) has emerged as a de facto standard for integrating Large Language Models with external tools, yet no formal security analysis of the protocol specification exists. We present the first rigorous security analysis of MCP's architectural design, identifying three fundamental protocol‑level vulnerabilities: (1) absence of capability attestation allowing servers to claim arbitrary permissions, (2) bidirectional sampling without origin authentication enabling server‑side prompt injection, and (3) implicit trust propagation in multi‑server configurations. We implement \textscMCPBench, a novel framework bridging existing agent security benchmarks to MCP‑compliant infrastructure, enabling direct measurement of protocol‑specific attack surfaces. Through controlled experiments on 847 attack scenarios across five MCP server implementations, we demonstrate that MCP's architectural choices amplify attack success rates by 23‑‑41% compared to equivalent non‑MCP integrations. We propose \textscMCPSec, a backward‑compatible protocol extension adding capability attestation and message authentication, reducing attack success rates from 52.8% to 12.4% with median latency overhead of 8.3ms per message. Our findings establish that MCP's security weaknesses are architectural rather than implementation‑specific, requiring protocol‑level remediation.

PaperID: 6115, https://arxiv.org/pdf/2601.17471.pdf

Authors: Wonyoung Kim, Seunggi Min, Minjae Gwon, Dowoo Baik, Haein Lee, Hyeon Heo, Minjae Lee, Min Woo Baek, Yonghwi Jin, Younggi Park, Yunjae Choi, Taesoo Kim, Sangdon Park, Insu Yun

Title: PatchIsland: Orchestration of LLM Agents for Continuous Vulnerability Repair

Abstract:
Continuous fuzzing platforms such as OSS‑Fuzz uncover large numbers of vulnerabilities, yet the subsequent repair process remains largely manual. Unfortunately, existing Automated Vulnerability Repair (AVR) techniques ‑‑ including recent LLM‑based systems ‑‑ are not directly applicable to continuous fuzzing. This is because these systems are designed and evaluated on a static, single‑run benchmark setting, making them ill‑suited for the diverse, noisy, and failure‑prone environments in continuous fuzzing. To address these issues, we introduce PatchIsland, a system for Continuous Vulnerability Repair (CVR) that tightly integrates with continuous fuzzing pipelines. PatchIsland employs an ensemble of diverse LLM agents. By leveraging multiple LLM agents, PatchIsland can cover a wider range of settings (e.g., different projects, bug types, and programming languages) and also improve operational robustness. In addition, PatchIsland utilizes a two‑phase patch‑based deduplication to mitigate duplicate crashes and patches, which can be problematic in continuous fuzzing. In our internal evaluation, PatchIsland repaired 84 of 92 vulnerabilities, demonstrating strong repair capability. In the official AIxCC competition, the system operated with no human intervention in a fully autonomous environment and successfully patched 31 out of 43 vulnerabilities, achieving a repair rate of 72.1%.

PaperID: 6116, https://arxiv.org/pdf/2601.17454.pdf

Authors: Muhammad Ahmed Atif, Nehal Naeem Haji, Mohammad Shahid Shaikh, Muhammad Ebad Atif

Title: Embodiment-Induced Coordination Regimes in Tabular Multi-Agent Q-Learning

Abstract:
Centralized value learning is often assumed to improve coordination and stability in multi‑agent reinforcement learning, yet this assumption is rarely tested under controlled conditions. We directly evaluate it in a fully tabular predator‑prey gridworld by comparing independent and centralized Q‑learning under explicit embodiment constraints on agent speed and stamina. Across multiple kinematic regimes and asymmetric agent roles, centralized learning fails to provide a consistent advantage and is frequently outperformed by fully independent learning, even under full observability and exact value estimation. Moreover, asymmetric centralized‑independent configurations induce persistent coordination breakdowns rather than transient learning instability. By eliminating confounding effects from function approximation and representation learning, our tabular analysis isolates coordination structure as the primary driver of these effects. The results show that increased coordination can become a liability under embodiment constraints, and that the effectiveness of centralized learning is fundamentally regime and role dependent rather than universal.

PaperID: 6117, https://arxiv.org/pdf/2601.17440.pdf

Authors: Xinru Cui, Linxi Feng, Yixuan Zhou, Haoqi Han, Zhe Liu, Hesheng Wang

Title: PILOT: A Perceptive Integrated Low-level Controller for Loco-manipulation over Unstructured Scenes

Abstract:
Humanoid robots hold great potential for diverse interactions and daily service tasks within human‑centered environments, necessitating controllers that seamlessly integrate precise locomotion with dexterous manipulation. However, most existing whole‑body controllers lack exteroceptive awareness of the surrounding environment, rendering them insufficient for stable task execution in complex, unstructured scenarios.To address this challenge, we propose PILOT, a unified single‑stage reinforcement learning (RL) framework tailored for perceptive loco‑manipulation, which synergizes perceptive locomotion and expansive whole‑body control within a single policy. To enhance terrain awareness and ensure precise foot placement, we design a cross‑modal context encoder that fuses prediction‑based proprioceptive features with attention‑based perceptive representations. Furthermore, we introduce a Mixture‑of‑Experts (MoE) policy architecture to coordinate diverse motor skills, facilitating better specialization across distinct motion patterns. Extensive experiments in both simulation and on the physical Unitree G1 humanoid robot validate the efficacy of our framework. PILOT demonstrates superior stability, command tracking precision, and terrain traversability compared to existing baselines. These results highlight its potential to serve as a robust, foundational low‑level controller for loco‑manipulation in unstructured scenes.

PaperID: 6118, https://arxiv.org/pdf/2601.17435.pdf

Authors: Maria Jesus Rodriguez-Sanchez, Manuel Noguera, Angel Ruiz-Zafra, Kawtar Benghazi

Title: Towards a Declarative Agentic Layer for Intelligent Agents in MCP-Based Server Ecosystems

Abstract:
Recent advances in Large Language Models (LLMs) have enabled the development of increasingly complex agentic and multi‑agent systems capable of planning, tool use and task decomposition. However, empirical evidence shows that many of these systems suffer from fundamental reliability issues, including hallucinated actions, unexecutable plans and brittle coordination. Crucially, these failures do not stem from limitations of the underlying models themselves, but from the absence of explicit architectural structure linking goals, capabilities and execution. This paper presents a declarative, model‑independent architectural layer for grounded agentic workflows that addresses this gap. The proposed layer, referred to as DALIA (Declarative Agentic Layer for Intelligent Agents), formalises executable capabilities, exposes tasks through a declarative discovery protocol, maintains a federated directory of agents and their execution resources, and constructs deterministic task graphs grounded exclusively in declared operations. By enforcing a clear separation between discovery, planning and execution, the architecture constrains agent behaviour to a verifiable operational space, reducing reliance on speculative reasoning and free‑form coordination. We present the architecture and design principles of the proposed layer and illustrate its operation through a representative task‑oriented scenario, demonstrating how declarative grounding enables reproducible and verifiable agentic workflows across heterogeneous environments.

PaperID: 6119, https://arxiv.org/pdf/2601.17419.pdf

Authors: Ahmad Halimi Razlighi, Pallavi Dhingra, Edgar Beck, Bho Matthiesen, Armin Dekorsy

Title: Semantic-Aware Task Clustering for Federated Cooperative Multi-Task Semantic Communication

Abstract:
Task‑oriented semantic communication (SemCom) prioritizes task execution over accurate symbol reconstruction and is well‑suited to emerging intelligent applications. Cooperative multi‑task SemCom (CMT‑SemCom) further improves task execution performance. However, [1] demonstrates that cooperative multi‑tasking can be either constructive or destructive. Moreover, the existing CMT‑SemCom framework is not directly applicable to distributed multi‑user scenarios, such as non‑terrestrial satellite networks, where each satellite employs an individual semantic encoder. In this paper, we extend our earlier CMT‑SemCom framework to distributed settings by proposing a federated learning (FL) based CMT‑SemCom that enables cooperative multi‑tasking across distributed users. Moreover, to address performance degradation caused by negative information transfer among heterogeneous tasks, we propose a semantic‑aware task clustering method integrated in the FL process to ensure constructive cooperation based on an information‑theoretic approach. Unlike common clustering methods that rely on high‑dimensional data or feature space similarity, our proposed approach operates in the low‑dimensional semantic domain to identify meaningful task relationships. Simulation results based on a LEO satellite network setup demonstrate the effectiveness of our approach and performance gain over unclustered FL and individual single‑task SemCom.

PaperID: 6120, https://arxiv.org/pdf/2601.17346.pdf

Authors: Haoxin Xu, Changyong Qi, Tong Liu, Bohao Zhang, Anna He, Bingqian Jiang, Longwei Zheng, Xiaoqing Gu

Title: Multi-Agent Learning Path Planning via LLMs

Abstract:
The integration of large language models (LLMs) into intelligent tutoring systems offers transformative potential for personalized learning in higher education. However, most existing learning path planning approaches lack transparency, adaptability, and learner‑centered explainability. To address these challenges, this study proposes a novel Multi‑Agent Learning Path Planning (MALPP) framework that leverages a role‑ and rule‑based collaboration mechanism among intelligent agents, each powered by LLMs. The framework includes three task‑specific agents: a learner analytics agent, a path planning agent, and a reflection agent. These agents collaborate via structured prompts and predefined rules to analyze learning profiles, generate tailored learning paths, and iteratively refine them with interpretable feedback. Grounded in Cognitive Load Theory and Zone of Proximal Development, the system ensures that recommended paths are cognitively aligned and pedagogically meaningful. Experiments conducted on the MOOCCubeX dataset using seven LLMs show that MALPP significantly outperforms baseline models in path quality, knowledge sequence consistency, and cognitive load alignment. Ablation studies further validate the effectiveness of the collaborative mechanism and theoretical constraints. This research contributes to the development of trustworthy, explainable AI in education and demonstrates a scalable approach to learner‑centered adaptive instruction powered by LLMs.

PaperID: 6121, https://arxiv.org/pdf/2601.17344.pdf

Authors: Chen Chen, Kim Young Il, Yuan Yang, Wenhao Su, Yilin Zhang, Xueluan Gong, Qian Wang, Yongsen Zheng, Ziyao Liu, Kwok-Yan Lam

Title: The Shadow Self: Intrinsic Value Misalignment in Large Language Model Agents

Abstract:
Large language model (LLM) agents with extended autonomy unlock new capabilities, but also introduce heightened challenges for LLM safety. In particular, an LLM agent may pursue objectives that deviate from human values and ethical norms, a risk known as value misalignment. Existing evaluations primarily focus on responses to explicit harmful input or robustness against system failure, while value misalignment in realistic, fully benign, and agentic settings remains largely underexplored. To fill this gap, we first formalize the Loss‑of‑Control risk and identify the previously underexamined Intrinsic Value Misalignment (Intrinsic VM). We then introduce IMPRESS (Intrinsic Value Misalignment Probes in REalistic Scenario Set), a scenario‑driven framework for systematically assessing this risk. Following our framework, we construct benchmarks composed of realistic, fully benign, and contextualized scenarios, using a multi‑stage LLM generation pipeline with rigorous quality control. We evaluate Intrinsic VM on 21 state‑of‑the‑art LLM agents and find that it is a common and broadly observed safety risk across models. Moreover, the misalignment rates vary by motives, risk types, model scales, and architectures. While decoding strategies and hyperparameters exhibit only marginal influence, contextualization and framing mechanisms significantly shape misalignment behaviors. Finally, we conduct human verification to validate our automated judgments and assess existing mitigation strategies, such as safety prompting and guardrails, which show instability or limited effectiveness. We further demonstrate key use cases of IMPRESS across the AI Ecosystem. Our code and benchmark will be publicly released upon acceptance.

PaperID: 6122, https://arxiv.org/pdf/2601.17311.pdf

Authors: Bang Liu, Linglong Kong, Jian Pei

Title: Phase Transition for Budgeted Multi-Agent Synergy

Abstract:
Multi‑agent systems can improve reliability, yet under a fixed inference budget they often help, saturate, or even collapse. We develop a minimal and calibratable theory that predicts these regimes from three binding constraints of modern agent stacks: finite context windows, lossy inter‑agent communication, and shared failures among similar agents. Each leaf agent is summarized by a compute‑performance scaling exponent β; communication is captured by a message‑length fidelity curve γ(m); dependence is captured by an effective shared‑error correlation ρ; and a context window W imposes hard fan‑in limits that make hierarchy necessary. For binary success/failure tasks with majority aggregation, we prove a sharp phase transition for deep b‑ary trees with correlated inputs and lossy communication: a single scalar α_ρ (combining γ(m), ρ, and fan‑in b) determines whether weak signal is amplified to a nontrivial fixed point or washed out to chance. In the amplifying regime, we derive an organization exponent s and show that budgeted synergy, i.e., outperforming the best single agent under the same total budget, occurs exactly when s>β, yielding closed‑form compute allocation rules and explicit budget thresholds. We further characterize saturation via a mixing depth and provide a conservative clipped predictor that remains accurate across growth and saturation. A continuous‑performance warm‑up gives closed‑form risks for star, chain, and tree organizations, making correlation‑ and communication‑induced floors explicit and exposing the core design trade‑offs in a smooth setting. Finally, we validate the predicted phase boundaries in controlled synthetic simulations and show how the same mechanisms explain the dominant bottlenecks reported in recent large‑scale matched‑budget studies of LLM agent‑system scaling.

PaperID: 6123, https://arxiv.org/pdf/2601.17303.pdf

Authors: Samaresh Kumar Singh, Joyjit Roy

Title: Decentralized Multi-Agent Swarms for Autonomous Grid Security in Industrial IoT: A Consensus-based Approach

Abstract:
As Industrial Internet of Things (IIoT) environments expand to include tens of thousands of connected devices. The centralization of security monitoring architectures creates serious latency issues that savvy attackers can exploit to compromise an entire manufacturing ecosystem. This paper outlines a new, decentralized multi‑agent swarm (DMAS) architecture that includes autonomous artificial intelligence (AI) agents at each edge gateway, functioning as a distributed digital "immune system" for IIoT networks. Instead of using a traditional static firewall approach, the DMAS agents communicate via a lightweight peer‑to‑peer protocol to cooperatively detect anomalous behavior across the IIoT network without sending data to a cloud infrastructure. The authors also outline a consensus‑based threat validation (CVT) process in which agents vote on the threat level of an identified threat, enabling instant quarantine of a compromised node or nodes. The authors conducted experiments on a testbed that simulated an innovative factory environment with 2000 IIoT devices and found that the DMAS demonstrated sub‑millisecond response times (average of 0.85ms), 97.3% accuracy in detecting malicious activity under high load, and 87% accuracy in detecting zero‑day attacks. All significantly higher than baseline values for both centralized and edge computing. Additionally, the proposed architecture can prevent real‑time cascading failures in industrial control systems and reduce network bandwidth use by 89% compared to cloud‑based solutions.

PaperID: 6124, https://arxiv.org/pdf/2601.17275.pdf

Authors: Lianlei Shan, Han Chen, Yixuan Wang, Zhenjie Liu, Wei Li

Title: Latent-Space Contrastive Reinforcement Learning for Stable and Efficient LLM Reasoning

Abstract:
While Large Language Models (LLMs) demonstrate exceptional performance in surface‑level text generation, their nature in handling complex multi‑step reasoning tasks often remains one of ``statistical fitting'' rather than systematic logical deduction. Traditional Reinforcement Learning (RL) attempts to mitigate this by introducing a ``think‑before‑speak'' paradigm. However, applying RL directly in high‑dimensional, discrete token spaces faces three inherent challenges: sample‑inefficient rollouts, high gradient estimation variance, and the risk of catastrophic forgetting. To fundamentally address these structural bottlenecks, we propose DeepLatent Reasoning (DLR), a latent‑space bidirectional contrastive reinforcement learning framework. This framework shifts the trial‑and‑error cost from expensive token‑level full sequence generation to the continuous latent manifold. Specifically, we introduce a lightweight assistant model to efficiently sample K reasoning chain encodings within the latent space. These encodings are filtered via a dual reward mechanism based on correctness and formatting; only high‑value latent trajectories are fed into a frozen main model for single‑pass decoding. To maximize reasoning diversity while maintaining coherence, we design a contrastive learning objective to enable directed exploration within the latent space. Since the main model parameters remain frozen during optimization, this method mathematically eliminates catastrophic forgetting. Experiments demonstrate that under comparable GPU computational budgets, DLR achieves more stable training convergence, supports longer‑horizon reasoning chains, and facilitates the sustainable accumulation of reasoning capabilities, providing a viable path toward reliable and scalable reinforcement learning for LLMs.

PaperID: 6125, https://arxiv.org/pdf/2601.17178.pdf

Authors: Saideep Sreekumar, Zeng Wang, Akashdeep Saha, Weihua Xiao, Minghao Shao, Muhammad Shafique, Ozgur Sinanoglu, Ramesh Karri, Johann Knechtel

Title: TrojanGYM: A Detector-in-the-Loop LLM for Adaptive RTL Hardware Trojan Insertion

Abstract:
Hardware Trojans (HTs) remain a critical threat because learning‑based detectors often overfit to narrow trigger/payload patterns and small, stylized benchmarks. We introduce TrojanGYM, an agentic, LLM‑driven framework that automatically curates HT insertions to expose detector blind spots while preserving design correctness. Given high‑level HT specifications, a suite of cooperating LLM agents (instantiated with GPT‑4, LLaMA‑3.3‑70B, and Gemini‑2.5Pro) proposes and refines RTL modifications that realize diverse triggers and payloads without impacting normal functionality. TrojanGYM implements a feedback‑driven benchmark generation loop co‑designed with HT detectors, in which constraint‑aware syntactic checking and GNN‑based HT detectors provide feedback that iteratively refines HT specifications and insertion strategies to better surface detector blind spots. We further propose Robust‑GNN4TJ, a new implementation of the GNN4TJ with improved graph extraction, training robustness, and prediction reliability, especially on LLM‑generated HT designs. On the most challenging TrojanGYM‑generated benchmarks, Robust‑GNN4TJ raises HT detection rates from 0% to 60% relative to a prior GNN‑based detector. We instantiate TrojanGYM on SRAM, AES‑128, and UART designs at RTL level, and show that it systematically produces diverse, functionally correct HTs that reach up to 83.33% evasion rates against modern GNN‑based detectors, revealing robustness gaps that are not apparent when these detectors are evaluated solely on existing TrustHub‑style benchmarks. Post peer‑review, we will release all codes and artifacts.

PaperID: 6126, https://arxiv.org/pdf/2601.17152.pdf

Authors: Miao Zhang, Junsik Kim, Siyuan Xiang, Jian Gao, Cheng Cao

Title: Dynamic Role Assignment for Multi-Agent Debate

Abstract:
Multi‑agent large language model (LLM) and vision‑language model (VLM) debate systems employ specialized roles for complex problem‑solving, yet model specializations are not leveraged to decide which model should fill which role. We propose dynamic role assignment, a framework that runs a Meta‑Debate to select suitable agents before the actual debate. The meta‑debate has two stages: (1) proposal, where candidates provide role‑tailored arguments, and (2) peer review, where proposals are scored with data and role‑specific criteria to choose the best agent for each position. We evaluate our method on LLM problem solving benchmarks. Applied on top of existing debate systems, our approach consistently outperforms uniform assignments (filling all roles with the same model) by up to 74.8% and random assignments (assigning models to roles without considering their suitability) by up to 29.7%, depending on the task and the specific assignment. This work establishes a new paradigm for multi‑agent system design, shifting from static agent deployment to dynamic and capability‑aware selection.

PaperID: 6127, https://arxiv.org/pdf/2601.17084.pdf

Authors: Iman Peivaste, Ahmed Makradi, Salim Belouettar

Title: ChemNavigator: Agentic AI Discovery of Design Rules for Organic Photocatalysts

Abstract:
The discovery of high‑performance organic photocatalysts for hydrogen evolution remains limited by the vastness of chemical space and the reliance on human intuition for molecular design. Here we present ChemNavigator, an agentic AI system that autonomously derives structure‑property relationships through hypothesis‑driven exploration of organic photocatalyst candidates. The system integrates large language model reasoning with density functional tight binding calculations in a multi‑agent architecture that mirrors the scientific method: formulating hypotheses, designing experiments, executing calculations, and validating findings through rigorous statistical analysis. Through iterative discovery cycles encompassing 200 molecules, ChemNavigator autonomously identified six statistically significant design rules governing frontier orbital energies, including the effects of ether linkages, carbonyl groups, extended conjugation, cyano groups, halogen substituents, and amine groups. Importantly, these rules correspond to established principles of organic electronic structure (resonance donation, inductive withdrawal, π‑delocalization), demonstrating that the system can independently derive chemical knowledge without explicit programming. Notably, autonomous agentic reasoning extracted these six validated rules from a molecular library where previous ML approaches identified only carbonyl effects. Furthermore, the quantified effect sizes provide a prioritized ranking for synthetic chemists, while feature interaction analysis revealed diminishing returns when combining strategies, challenging additive assumptions in molecular design. This work demonstrates that agentic AI systems can autonomously derive interpretable, chemically grounded design principles, establishing a framework for AI‑assisted materials discovery that complements rather than replaces chemical intuition.

PaperID: 6128, https://arxiv.org/pdf/2601.17069.pdf

Authors: Shahil Shaik, Jonathon M. Smereka, Yue Wang

Title: Multi-Agent Deep Reinforcement Learning Under Constrained Communications

Abstract:
Centralized training with decentralized execution (CTDE) has been the dominant paradigm in multi‑agent reinforcement learning (MARL), but its reliance on global state information during training introduces scalability, robustness, and generalization bottlenecks. Moreover, in practical scenarios such as adding/dropping teammates or facing environment dynamics that differ from the training, CTDE methods can be brittle and costly to retrain, whereas distributed approaches allow agents to adapt using only local information and peer‑to‑peer communication. We present a distributed MARL framework that removes the need for centralized critics or global information. Firstly, we develop a novel Distributed Graph Attention Network (D‑GAT) that performs global state inference through multi‑hop communication, where agents integrate neighbor features via input‑dependent attention weights in a fully distributed manner. Leveraging D‑GAT, we develop the distributed graph‑attention MAPPO (DG‑MAPPO) ‑‑ a distributed MARL framework where agents optimize local policies and value functions using local observations, multi‑hop communication, and shared/averaged rewards. Empirical evaluation on the StarCraftII Multi‑Agent Challenge, Google Research Football, and Multi‑Agent Mujoco demonstrates that our method consistently outperforms strong CTDE baselines, achieving superior coordination across a wide range of cooperative tasks with both homogeneous and heterogeneous teams. Our distributed MARL framework provides a principled and scalable solution for robust collaboration, eliminating the need for centralized training or global observability. To the best of our knowledge, DG‑MAPPO appears to be the first to fully eliminate reliance on privileged centralized information, enabling agents to learn and act solely through peer‑to‑peer communication.

PaperID: 6129, https://arxiv.org/pdf/2601.17002.pdf

Authors: Ziyang Zhou, Ziqi Liu, Yan Wang, Yiming Lin, Yangbin Chen

Title: RAM-SD: Retrieval-Augmented Multi-agent framework for Sarcasm Detection

Abstract:
Sarcasm detection remains a significant challenge due to its reliance on nuanced contextual understanding, world knowledge, and multi‑faceted linguistic cues that vary substantially across different sarcastic expressions. Existing approaches, from fine‑tuned transformers to large language models, apply a uniform reasoning strategy to all inputs, struggling to address the diverse analytical demands of sarcasm. These demands range from modeling contextual expectation violations to requiring external knowledge grounding or recognizing specific rhetorical patterns. To address this limitation, we introduce RAM‑SD, a Retrieval‑Augmented Multi‑Agent framework for Sarcasm Detection. The framework operates through four stages: (1) contextual retrieval grounds the query in both sarcastic and non‑sarcastic exemplars; (2) a meta‑planner classifies the sarcasm type and selects an optimal reasoning plan from a predefined set; (3) an ensemble of specialized agents performs complementary, multi‑view analysis; and (4) an integrator synthesizes these analyses into a final, interpretable judgment with a natural language explanation. Evaluated on four standard benchmarks, RAM‑SD achieves a state‑of‑the‑art Macro‑F1 of 77.74%, outperforming the strong GPT‑4o+CoC baseline by 7.01 points. Our framework not only sets a new performance benchmark but also provides transparent and interpretable reasoning traces, illuminating the cognitive processes behind sarcasm comprehension.

PaperID: 6130, https://arxiv.org/pdf/2601.16965.pdf

Authors: Riyang Bao, Cheng Yang, Dazhou Yu, Zhexiang Tang, Gengchen Mai, Liang Zhao

Title: Spatial-Agent: Agentic Geo-spatial Reasoning with Scientific Core Concepts

Abstract:
Geospatial reasoning is essential for real‑world applications such as urban analytics, transportation planning, and disaster response. However, existing LLM‑based agents often fail at genuine geospatial computation, relying instead on web search or pattern matching while hallucinating spatial relationships. We present Spatial‑Agent, an AI agent grounded in foundational theories of spatial information science. Our approach formalizes geo‑analytical question answering as a concept transformation problem, where natural‑language questions are parsed into executable workflows represented as GeoFlow Graphs ‑‑ directed acyclic graphs with nodes corresponding to spatial concepts and edges representing transformations. Drawing on spatial information theory, Spatial‑Agent extracts spatial concepts, assigns functional roles with principled ordering constraints, and composes transformation sequences through template‑based generation. Extensive experiments on MapEval‑API and MapQA benchmarks demonstrate that Spatial‑Agent significantly outperforms existing baselines including ReAct and Reflexion, while producing interpretable and executable geospatial workflows.

PaperID: 6131, https://arxiv.org/pdf/2601.16935.pdf

Authors: Wei Wei, Jingye Xu, Sahidul Islam, Dakai Zhu, Chen Pan, Mimi Xie

Title: AERO: Adaptive and Efficient Runtime-Aware OTA Updates for Energy-Harvesting IoT

Abstract:
Energy‑harvesting (EH) Internet of Things (IoT) devices operate under intermittent energy availability, which disrupts task execution and makes energy‑intensive over‑the‑air (OTA) updates particularly challenging. Conventional OTA update mechanisms rely on reboots and incur significant overhead, rendering them unsuitable for intermittently powered systems. Recent live OTA update techniques reduce reboot overhead but still lack mechanisms to ensure consistency when updates interact with runtime execution. This paper presents AERO, an Adaptive and Efficient Runtime‑Aware OTA update mechanism that integrates update tasks into the device's Directed Acyclic Graph (DAG) and schedules them alongside routine tasks under energy and timing constraints. By identifying update‑affected execution regions and dynamically adjusting dependencies, AERO ensures consistent up date integration while adapting to intermittent energy availability. Experiments on representative workloads demonstrate improved update reliability and efficiency compared to existing live update approaches.

PaperID: 6132, https://arxiv.org/pdf/2601.16886.pdf

Authors: Chi Yu, Hongyu Yuan, Zhiyi Duan

Title: MAGE-KT: Multi-Agent Graph-Enhanced Knowledge Tracing with Subgraph Retrieval and Asymmetric Fusion

Abstract:
Knowledge Tracing (KT) aims to model a student's learning trajectory and predict performance on the next question. A key challenge is how to better represent the relationships among students, questions, and knowledge concepts (KCs). Recently, graph‑based KT paradigms have shown promise for this problem. However, existing methods have not sufficiently explored inter‑concept relations, often inferred solely from interaction sequences. In addition, the scale and heterogeneity of KT graphs make full‑graph encoding both computationally both costly and noise‑prone, causing attention to bleed into student‑irrelevant regions and degrading the fidelity of inter‑KC relations. To address these issues, we propose a novel framework: Multi‑Agent Graph‑Enhanced Knowledge Tracing (MAGE‑KT). It constructs a multi‑view heterogeneous graph by combining a multi‑agent KC relation extractor and a student‑question interaction graph, capturing complementary semantic and behavioral signals. Conditioned on the target student's history, it retrieves compact, high‑value subgraphs and integrates them using an Asymmetric Cross‑attention Fusion Module to enhance prediction while avoiding attention diffusion and irrelevant computation. Experiments on three widely used KT datasets show substantial improvements in KC‑relation accuracy and clear gains in next‑question prediction over existing methods.

PaperID: 6133, https://arxiv.org/pdf/2601.16835.pdf

Authors: Ke Ding, Bo Li, Ankang Sun

Title: Multi-Agent Non-Discriminatory Contracts

Abstract:
We study multi‑agent contracts, in which a principal delegates a task to multiple agents and incentivizes them to exert effort. Prior research has mostly focused on maximizing the principal's utility, often resulting in highly disparate payments among agents. Such disparities among agents may be undesirable in practice, for example, in standardized public contracting or worker cooperatives where fairness concerns are essential. Motivated by these considerations, our objective is to quantify the tradeoff between maximizing the principal's utility and equalizing payments among agents, which we call the price of non‑discrimination. Our first result is an almost tight bound on the price of non‑discrimination, which scales logarithmically with the number of agents. This bound can be improved to a constant by allowing some relaxation of the non‑discrimination requirement. We then provide a comprehensive characterization of the tradeoff between the level of non‑discrimination and the loss in the optimal utility.

PaperID: 6134, https://arxiv.org/pdf/2601.16809.pdf

Authors: Musfiqur Rahman, Emad Shihab

Title: Will It Survive? Deciphering the Fate of AI-Generated Code in Open Source

Abstract:
The integration of AI agents as coding assistants into software development has raised questions about the long‑term viability of AI agent‑generated code. A prevailing hypothesis within the software engineering community suggests this code is "disposable", meaning it is merged quickly but discarded shortly thereafter. If true, organizations risk shifting maintenance burden from generation to post‑deployment remediation. We investigate this hypothesis through survival analysis of 201 open‑source projects, tracking over 200,000 code units authored by AI agents versus humans. Contrary to the disposable code narrative, agent‑authored code survives significantly longer: at the line level, it exhibits a 15.8 percentage‑point lower modification rate and 16% lower hazard of modification (HR = 0.842, p < 0.001). However, modification profiles differ. Agent‑authored code shows modestly elevated corrective rates (26.3% vs. 23.0%), while human code shows higher adaptive rates. However, the effect sizes are small (Cramér's V = 0.116), and per‑agent variation exceeds the agent‑human gap. Turning to prediction, textual features can identify modification‑prone code (AUC‑ROC = 0.671), but predicting when modifications occur remains challenging (Macro F1 = 0.285), suggesting timing depends on external organizational dynamics. The bottleneck for agent‑generated code may not be generation quality, but the organizational practices that govern its long‑term evolution.

PaperID: 6135, https://arxiv.org/pdf/2601.16685.pdf

Authors: Suzhong Fu, Jingqi Dong, Xuan Ding, Rui Sun, Yiming Yang, Shuguang Cui, Zhen Li

Title: AgentsEval: Clinically Faithful Evaluation of Medical Imaging Reports via Multi-Agent Reasoning

Abstract:
Evaluating the clinical correctness and reasoning fidelity of automatically generated medical imaging reports remains a critical yet unresolved challenge. Existing evaluation methods often fail to capture the structured diagnostic logic that underlies radiological interpretation, resulting in unreliable judgments and limited clinical relevance. We introduce AgentsEval, a multi‑agent stream reasoning framework that emulates the collaborative diagnostic workflow of radiologists. By dividing the evaluation process into interpretable steps including criteria definition, evidence extraction, alignment, and consistency scoring, AgentsEval provides explicit reasoning traces and structured clinical feedback. We also construct a multi‑domain perturbation‑based benchmark covering five medical report datasets with diverse imaging modalities and controlled semantic variations. Experimental results demonstrate that AgentsEval delivers clinically aligned, semantically faithful, and interpretable evaluations that remain robust under paraphrastic, semantic, and stylistic perturbations. This framework represents a step toward transparent and clinically grounded assessment of medical report generation systems, fostering trustworthy integration of large language models into clinical practice.

PaperID: 6136, https://arxiv.org/pdf/2601.16649.pdf

Authors: Amin Rakhsha, Thomas Hehn, Pietro Mazzaglia, Fabio Valerio Massoli, Arash Behboodi, Tribhuvanesh Orekondy

Title: LUMINA: Long-horizon Understanding for Multi-turn Interactive Agents

Abstract:
Large language models can perform well on many isolated tasks, yet they continue to struggle on multi‑turn, long‑horizon agentic problems that require skills such as planning, state tracking, and long context processing. In this work, we aim to better understand the relative importance of advancing these underlying capabilities for success on such tasks. We develop an oracle counterfactual framework for multi‑turn problems that asks: how would an agent perform if it could leverage an oracle to perfectly perform a specific task? The change in the agent's performance due to this oracle assistance allows us to measure the criticality of such oracle skill in the future advancement of AI agents. We introduce a suite of procedurally generated, game‑like tasks with tunable complexity. These controlled environments allow us to provide precise oracle interventions, such as perfect planning or flawless state tracking, and make it possible to isolate the contribution of each oracle without confounding effects present in real‑world benchmarks. Our results show that while some interventions (e.g., planning) consistently improve performance across settings, the usefulness of other skills is dependent on the properties of the environment and language model. Our work sheds light on the challenges of multi‑turn agentic environments to guide the future efforts in the development of AI agents and language models.

PaperID: 6137, https://arxiv.org/pdf/2601.16643.pdf

Authors: Chen Shen, Zhao Song, Xinyu Wang, Lei Shi, Matjaž Perc, Zhen Wang, Jun Tanimoto

Title: Evolutionary Dynamics of Reputation-Based Voluntary Prisoner's Dilemma Games

Abstract:
Cooperation underlies many natural and artificial systems. While voluntary participation can sustain cooperation without informational assumptions, real interactions are rarely anonymous, leaving the joint effects of participation and reputation insufficiently understood. We propose a reputation‑based voluntary Prisoner's Dilemma in which agents incur a monitoring cost to inspect opponents and decide whether to exit an interaction for a fixed incentive to avoid exploitation or to default to cooperation or defection. We show that reputation‑conditioned exit generates multiple coexistence pathways that sustain cooperation across population structures. In well‑mixed populations, cooperation persists through stable mixed coexistence, whereas in structured populations, exit‑incentive‑dependent regimes emerge, including local cyclic dominance and persistent oscillations. Together, these results extend voluntary participation frameworks and underscore the role of exit‑incentive design in cooperative multi‑agent systems.

PaperID: 6138, https://arxiv.org/pdf/2601.16578.pdf

Authors: Julius Beerwerth, Jianye Xu, Simon Schäfer, Fynn Belderink, Bassam Alrifaee

Title: Zero-Shot MARL Benchmark in the Cyber-Physical Mobility Lab

Abstract:
We present a reproducible benchmark for evaluating sim‑to‑real transfer of Multi‑Agent Reinforcement Learning (MARL) policies for Connected and Automated Vehicles (CAVs). The platform, based on the Cyber‑Physical Mobility Lab (CPM Lab) [1], integrates simulation, a high‑fidelity digital twin, and a physical testbed, enabling structured zero‑shot evaluation of MARL motion‑planning policies. We demonstrate its use by deploying a SigmaRL‑trained policy [2] across all three domains, revealing two complementary sources of performance degradation: architectural differences between simulation and hardware control stacks, and the sim‑to‑real gap induced by increasing environmental realism. The open‑source setup enables systematic analysis of sim‑to‑real challenges in MARL under realistic, reproducible conditions.

PaperID: 6139, https://arxiv.org/pdf/2601.16565.pdf

Authors: Lingxiao Sun, Zhaoyang Zhang, Zihan Lin, Zirui Chen, Weijie Zhou, Zhaohui Yang, Tony Q. S. Quek

Title: Agentic AI-RAN Empowering Synergetic Sensing, Communication, Computing, and Control

Abstract:
Future sixth‑generation (6G) networks are expected to support low‑altitude wireless networks (LAWNs), where unmanned aerial vehicles (UAVs) and aerial robots operate in highly dynamic three‑dimensional environments under stringent latency, reliability, and autonomy requirements. In such scenarios, autonomous task execution at the network edge demands holistic coordination among sensing, communication, computing, and control (SC3) processes. Agentic Artificially Intelligent Radio Access Networks (Agentic AI‑RAN) offer a promising paradigm by enabling the edge network to function as an autonomous decision‑making entity for low‑altitude agents with limited onboard resources. In this article, we propose and design a task‑oriented Agentic AI‑RAN architecture that enables SC3 task execution within a single edge node. This integrated design tackles the fundamental problem of coordinating heterogeneous workloads in resource‑constrained edge environments. Furthermore, a representative low‑altitude embodied intelligence system is prototyped based on a general‑purpose Graphics Processing Unit (GPU) platform to demonstrate autonomous drone navigation in realistic settings. By leveraging the Multi‑Instance GPU (MIG) partitioning technique and the containerized deployment, the demonstration system achieves physical resource isolation while supporting tightly coupled coordination between real‑time communication and multimodal inference under a unified task framework. Experimental results demonstrate low closed‑loop latency, robust bidirectional communication, and stable performance under dynamic runtime conditions, highlighting its viability for mission‑critical low‑altitude wireless networks in 6G.

PaperID: 6140, https://arxiv.org/pdf/2601.16529.pdf

Authors: Dongshen Peng, Yi Wang, Austin Schoeffler, Carl Preiksaitis, Christian Rose

Title: SycoEval-EM: Sycophancy Evaluation of Large Language Models in Simulated Clinical Encounters for Emergency Care

Abstract:
Large language models (LLMs) show promise in clinical decision support yet risk acquiescing to patient pressure for inappropriate care. We introduce SycoEval‑EM, a multi‑agent simulation framework evaluating LLM robustness through adversarial patient persuasion in emergency medicine. Across 20 LLMs and 1,875 encounters spanning three Choosing Wisely scenarios, acquiescence rates ranged from 0‑100%. Models showed higher vulnerability to imaging requests (38.8%) than opioid prescriptions (25.0%), with model capability poorly predicting robustness. All persuasion tactics proved equally effective (30.0‑36.0%), indicating general susceptibility rather than tactic‑specific weakness. Our findings demonstrate that static benchmarks inadequately predict safety under social pressure, necessitating multi‑turn adversarial testing for clinical AI certification.

PaperID: 6141, https://arxiv.org/pdf/2601.16507.pdf

Authors: Junjie Shi, Weisong Sun, Zhenpeng Chen, Zhujun Wu, Xiaohong Chen, Zhi Jin, Yang Liu

Title: REprompt: Prompt Generation for Intelligent Software Development Guided by Requirements Engineering

Abstract:
The rapid development of large language models is transforming software development. Beyond serving as code auto‑completion tools in integrated development environments, large language models increasingly function as foundation models within coding agents in vibe‑coding scenarios. In such settings, prompts play a central role in agent‑based intelligent software development, as they not only guide the behavior of large language models but also serve as carriers of user requirements. Under the dominant conversational paradigm, prompts are typically divided into system prompts and user prompts. System prompts provide high‑level instructions to steer model behavior and establish conversational context, while user prompts represent inputs and requirements provided by human users. Despite their importance, designing effective prompts remains challenging, as it requires expertise in both prompt engineering and software engineering, particularly requirements engineering. To reduce the burden of manual prompt construction, numerous automated prompt engineering methods have been proposed. However, most existing approaches neglect the methodological principles of requirements engineering, limiting their ability to generate artifacts that conform to formal requirement specifications in realistic software development scenarios. To address this gap, we propose REprompt, a multi‑agent prompt optimization framework guided by requirements engineering. Experiment results demonstrate that REprompt effectively optimizes both system and user prompts by grounding prompt generation in requirements engineering principles.

PaperID: 6142, https://arxiv.org/pdf/2601.16489.pdf

Authors: Xinshuai Guo, Jiayi Kuang, Linyue Pan, Yinghui Li, Yangning Li, Hai-Tao Zheng, Ying Shen, Di Yin, Xing Sun

Title: EvoConfig: Self-Evolving Multi-Agent Systems for Efficient Autonomous Environment Configuration

Abstract:
A reliable executable environment is the foundation for ensuring that large language models solve software engineering tasks. Due to the complex and tedious construction process, large‑scale configuration is relatively inefficient. However, most methods always overlook fine‑grained analysis of the actions performed by the agent, making it difficult to handle complex errors and resulting in configuration failures. To address this bottleneck, we propose EvoConfig, an efficient environment configuration framework that optimizes multi‑agent collaboration to build correct runtime environments. EvoConfig features an expert diagnosis module for fine‑grained post‑execution analysis, and a self‑evolving mechanism that lets expert agents self‑feedback and dynamically adjust error‑fixing priorities in real time. Empirically, EvoConfig matches the previous state‑of‑the‑art Repo2Run on Repo2Run's 420 repositories, while delivering clear gains on harder cases: on the more challenging Envbench, EvoConfig achieves a 78.1% success rate, outperforming Repo2Run by 7.1%. Beyond end‑to‑end success, EvoConfig also demonstrates stronger debugging competence, achieving higher accuracy in error identification and producing more effective repair recommendations than existing methods.

PaperID: 6143, https://arxiv.org/pdf/2601.16479.pdf

Authors: Hongjia Wu, Shuai Zhou, Hongxin Zhang, Wei Chen

Title: Doc2AHP: Inferring Structured Multi-Criteria Decision Models via Semantic Trees with LLMs

Abstract:
While Large Language Models (LLMs) demonstrate remarkable proficiency in semantic understanding, they often struggle to ensure structural consistency and reasoning reliability in complex decision‑making tasks that demand rigorous logic. Although classical decision theories, such as the Analytic Hierarchy Process (AHP), offer systematic rational frameworks, their construction relies heavily on labor‑intensive domain expertise, creating an "expert bottleneck" that hinders scalability in general scenarios. To bridge the gap between the generalization capabilities of LLMs and the rigor of decision theory, we propose Doc2AHP, a novel structured inference framework guided by AHP principles. Eliminating the need for extensive annotated data or manual intervention, our approach leverages the structural principles of AHP as constraints to direct the LLM in a constrained search within the unstructured document space, thereby enforcing the logical entailment between parent and child nodes. Furthermore, we introduce a multi‑agent weighting mechanism coupled with an adaptive consistency optimization strategy to ensure the numerical consistency of weight allocation. Empirical results demonstrate that Doc2AHP not only empowers non‑expert users to construct high‑quality decision models from scratch but also significantly outperforms direct generative baselines in both logical completeness and downstream task accuracy.

PaperID: 6144, https://arxiv.org/pdf/2601.16449.pdf

Authors: Xiaojiang Peng, Jingyi Chen, Zebang Cheng, Bao Peng, Fengyi Wu, Yifei Dong, Shuyuan Tu, Qiyu Hu, Huiting Huang, Yuxiang Lin, Jun-Yan He, Kai Wang, Zheng Lian, Zhi-Qi Cheng

Title: Emotion-LLaMAv2 and MMEVerse: A New Framework and Benchmark for Multimodal Emotion Understanding

Abstract:
Understanding human emotions from multimodal signals poses a significant challenge in affective computing and human‑robot interaction. While multimodal large language models (MLLMs) have excelled in general vision‑language tasks, their capabilities in emotional reasoning remain limited. The field currently suffers from a scarcity of large‑scale datasets with high‑quality, descriptive emotion annotations and lacks standardized benchmarks for evaluation. Our preliminary framework, Emotion‑LLaMA, pioneered instruction‑tuned multimodal learning for emotion reasoning but was restricted by explicit face detectors, implicit fusion strategies, and low‑quality training data with limited scale. To address these limitations, we present Emotion‑LLaMAv2 and the MMEVerse benchmark, establishing an end‑to‑end pipeline together with a standardized evaluation setting for emotion recognition and reasoning. Emotion‑LLaMAv2 introduces three key advances. First, an end‑to‑end multiview encoder eliminates external face detection and captures nuanced emotional cues via richer spatial and temporal multiview tokens. Second, a Conv Attention pre‑fusion module is designed to enable simultaneous local and global multimodal feature interactions external to the LLM backbone. Third, a perception‑to‑cognition curriculum instruction tuning scheme within the LLaMA2 backbone unifies emotion recognition and free‑form emotion reasoning. To support large‑scale training and reproducible evaluation, MMEVerse aggregates twelve publicly available emotion datasets, including IEMOCAP, MELD, DFEW, and MAFW, into a unified multimodal instruction format. The data are re‑annotated via a multi‑agent pipeline involving Qwen2 Audio, Qwen2.5 VL, and GPT 4o, producing 130k training clips and 36k testing clips across 18 evaluation benchmarks.

PaperID: 6145, https://arxiv.org/pdf/2601.16443.pdf

Authors: Kanishk Gandhi, Shivam Garg, Noah D. Goodman, Dimitris Papailiopoulos

Title: Endless Terminals: Scaling RL Environments for Terminal Agents

Abstract:
Environments are the bottleneck for self‑improving agents. Current terminal benchmarks were built for evaluation, not training; reinforcement learning requires a scalable pipeline, not just a dataset. We introduce Endless Terminals, a fully autonomous pipeline that procedurally generates terminal‑use tasks without human annotation. The pipeline has four stages: generating diverse task descriptions, building and validating containerized environments, producing completion tests, and filtering for solvability. From this pipeline we obtain 3255 tasks spanning file operations, log management, data processing, scripting, and database operations. We train agents using vanilla PPO with binary episode level rewards and a minimal interaction loop: no retrieval, multi‑agent coordination, or specialized tools. Despite this simplicity, models trained on Endless Terminals show substantial gains: on our held‑out dev set, Llama‑3.2‑3B improves from 4.0% to 18.2%, Qwen2.5‑7B from 10.7% to 53.3%, and Qwen3‑8B‑openthinker‑sft from 42.6% to 59.0%. These improvements transfer to human‑curated benchmarks: models trained on Endless Terminals show substantial gains on held out human curated benchmarks: on TerminalBench 2.0, Llama‑3.2‑3B improves from 0.0% to 2.2%, Qwen2.5‑7B from 2.2% to 3.4%, and Qwen3‑8B‑openthinker‑sft from 1.1% to 6.7%, in each case outperforming alternative approaches including models with more complex agentic scaffolds. These results demonstrate that simple RL succeeds when environments scale.

PaperID: 6146, https://arxiv.org/pdf/2601.16392.pdf

Authors: Lakshana Iruni Assalaarachchi, Zainab Masood, Rashina Hoda, John Grundy

Title: Toward Agentic Software Project Management: A Vision and Roadmap

Abstract:
With the advent of agentic AI, Software Engineering is transforming to a new era dubbed Software Engineering 3.0. Software project management (SPM) must also evolve with such transformations to boost successful project completion, while keeping humans at the heart of it. Building on our preliminary ideas of "agentic SPM", and supporting literature, we present our vision of an "Agentic Project Manager (PM)" as a multi‑agent system for SPM 3.0. They will work like a "junior project manager", or an "intern project manager" collaboratively with software teams. We introduce four working modes, with varying autonomy levels to choose from, based on the SPM task. This addresses concerns with ethics, accountability, and trust related to agentic PMs. We also share insights on human PM role evolution and new skill requirements as a "strategic leader" and a "coach" for humans and agents. While creating the foundation for agentic SPM research, we present a research agenda for the wider research community.

PaperID: 6147, https://arxiv.org/pdf/2601.16376.pdf

Authors: Vani Nigam, Achuth Chandrasekhar, Amir Barati Farimani

Title: Polymer-Agent: Large Language Model Agent for Polymer Design

Abstract:
On‑demand Polymer discovery is essential for various industries, ranging from biomedical to reinforcement materials. Experiments with polymers have a long trial‑and‑error process, leading to use of extensive resources. For these processes, machine learning has accelerated scientific discovery at the property prediction and latent space search fronts. However, laboratory researchers cannot readily access codes and these models to extract individual structures and properties due to infrastructure limitations. We present a closed‑loop polymer structure‑property predictor integrated in a terminal for early‑stage polymer discovery. The framework is powered by LLM reasoning to provide users with property prediction, property‑guided polymer structure generation, and structure modification capabilities. The SMILES sequences are guided by the synthetic accessibility score and the synthetic complexity score (SC Score) to ensure that polymer generation is as close as possible to synthetically accessible monomer‑level structures. This framework addresses the challenge of generating novel polymer structures for laboratory researchers, thereby providing computational insights into polymer research.

PaperID: 6148, https://arxiv.org/pdf/2601.16367.pdf

Authors: Bryce L. Ferguson, Chinmay Maheshwari, Manxi Wu, Shankar Sastry

Title: Game-to-Real Gap: Quantifying the Effect of Model Misspecification in Network Games

Abstract:
Game‑theoretic models and solution concepts provide rigorous tools for predicting collective behavior in multi‑agent systems. In practice, however, different agents may rely on different game‑theoretic models to design their strategies. As a result, when these heterogeneous models interact, the realized outcome can deviate substantially from the outcome each agent expects based on its own local model. In this work, we introduce the game‑to‑real gap, a new metric that quantifies the impact of such model misspecification in multi‑agent environments. The game‑to‑real gap is defined as the difference between the utility an agent actually obtains in the multi‑agent environment (where other agents may have misspecified models) and the utility it expects under its own game model. Focusing on quadratic network games, we show that misspecifications in either (i) the external shock or (ii) the player interaction network can lead to arbitrarily large game‑to‑real gaps. We further develop novel network centrality measures that allow exact evaluation of this gap in quadratic network games. Our analysis reveals that standard network centrality measures fail to capture the effects of model misspecification, underscoring the need for new structural metrics that account for this limitation. Finally, through illustrative numerical experiments, we show that existing centrality measures in network games may provide a counterintuitive understanding of the impact of model misspecification.

PaperID: 6149, https://arxiv.org/pdf/2601.16280.pdf

Authors: Donghao Huang, Gauri Malwe, Zhaoxia Wang

Title: When Agents Fail to Act: A Diagnostic Framework for Tool Invocation Reliability in Multi-Agent LLM Systems

Abstract:
Multi‑agent systems powered by large language models (LLMs) are transforming enterprise automation, yet systematic evaluation methodologies for assessing tool‑use reliability remain underdeveloped. We introduce a comprehensive diagnostic framework that leverages big data analytics to evaluate procedural reliability in intelligent agent systems, addressing critical needs for SME‑centric deployment in privacy‑sensitive environments. Our approach features a 12‑category error taxonomy capturing failure modes across tool initialization, parameter handling, execution, and result interpretation. Through systematic evaluation of 1,980 deterministic test instances spanning both open‑weight models (Qwen2.5 series, Functionary) and proprietary alternatives (GPT‑4, Claude 3.5/3.7) across diverse edge hardware configurations, we identify actionable reliability thresholds for production deployment. Our analysis reveals that procedural reliability, particularly tool initialization failures, constitutes the primary bottleneck for smaller models, while qwen2.5:32b achieves flawless performance matching GPT‑4.1. The framework demonstrates that mid‑sized models (qwen2.5:14b) offer practical accuracy‑efficiency trade‑offs on commodity hardware (96.6% success rate, 7.3 s latency), enabling cost‑effective intelligent agent deployment for resource‑constrained organizations. This work establishes foundational infrastructure for systematic reliability evaluation of tool‑augmented multi‑agent AI systems.

PaperID: 6150, https://arxiv.org/pdf/2601.16276.pdf

Authors: Victor Conchello Vendrell, Max Ruiz Luyten, Mihaela van der Schaar

Title: GameTalk: Training LLMs for Strategic Conversation

Abstract:
Strategic decision‑making in multi‑agent settings is a key challenge for large language models (LLMs), particularly when coordination and negotiation must unfold over extended conversations. While recent work has explored the use of LLMs in isolated decision tasks, little attention has been given to optimizing long‑term objectives through dialogue. We introduce GameTalk, a framework for training LLMs to make strategic decisions via multi‑turn interactions. Unlike prior work that focuses on single‑turn objectives or static action prediction, we train LLMs to optimize a global objective across full conversations. We achieve this by adapting fine‑tuning methods like GRPO, DPO, and STaR to incorporate reward signals that depend on the entire interaction. We evaluate this approach on a suite of increasingly complex games, designed to stress different aspects of reasoning, coordination, and opponent modeling. Our results show that GameTalk significantly outperforms untrained models, especially under reward shaping, with DPO consistently yielding the strongest gains. These findings position conversational fine‑tuning as a promising path for LLMs to reason, negotiate, and act in interactive environments.

PaperID: 6151, https://arxiv.org/pdf/2601.16237.pdf

Authors: Vik Pant, Eric Yu

Title: Computational Foundations for Strategic Coopetition: Formalizing Collective Action and Loyalty

Abstract:
Mixed‑motive multi‑agent settings are rife with persistent free‑riding because individual effort benefits all members equally, yet each member bears the full cost of their own contribution. Classical work by Holmström established that under pure self‑interest, Nash equilibrium is universal shirking. While i represents teams as composite actors, it lacks scalable computational mechanisms for analyzing how collective action problems emerge and resolve in coopetitive settings. This technical report extends computational foundations for strategic coopetition to team‑level dynamics, building on companion work formalizing interdependence/complementarity (arXiv:2510.18802) and trust dynamics (arXiv:2510.24909). We develop loyalty‑moderated utility functions with two mechanisms: loyalty benefit (welfare internalization plus intrinsic contribution satisfaction) and cost tolerance (reduced effort burden for loyal members). We integrate i structural dependencies through dependency‑weighted team cohesion, connecting member incentives to team‑level positioning. The framework applies to both human teams (loyalty as psychological identification) and multi‑agent systems (alignment coefficients and adjusted cost functions). Experimental validation across 3,125 configurations demonstrates robust loyalty effects (15.04x median effort differentiation). All six behavioral targets achieve thresholds: free‑riding baseline (96.5%), loyalty monotonicity (100%), effort differentiation (100%), team size effect (100%), mechanism synergy (99.5%), and bounded outcomes (100%). Empirical validation using published Apache HTTP Server (1995‑2023) case study achieves 60/60 points, reproducing contribution patterns across formation, growth, maturation, and governance phases. Statistical significance confirmed at p<0.001, Cohen's d=0.71.

PaperID: 6152, https://arxiv.org/pdf/2601.15743.pdf

Authors: Seongmin Kim, Jaehwan Choi, Kunik Jang, Junkil Park, Varinia Bernales, Alán Aspuru-Guzik, Yousung Jung

Title: Materealize: a multi-agent deliberation system for end-to-end material design and synthesis

Abstract:
We propose Materealize, a multi‑agent system for end‑to‑end inorganic materials design and synthesis that orchestrates core domain tools spanning structure generation, property prediction, synthesizability prediction, and synthesis planning within a single unified framework. Through a natural‑language interface, Materealize enables non‑experts to access computational materials workflows and obtain experimentally actionable outputs for material realization. Materealize provides two complementary modes. In instant mode, the system rapidly composes connected tools to solve diverse inorganic tasks‑including property‑conditioned synthesizable candidate design with synthesis recipes, diagnosis, and redesign of unsynthesizable structures, and synthesizable data augmentation‑within a few minutes. In thinking mode, Materealize applies multi‑agent debate to deliver more refined and information‑rich synthesis recommendations, including reasoning‑ and model‑driven synthesis routes and mechanistic hypotheses. The mechanistic hypotheses are validated by direct comparison with the literature for known mechanisms and further supported by physics‑grounded simulations for novel synthesis pathways. By combining tool‑level accuracy with reasoning‑level integration, Materealize can bridge the gap between computational discovery and practical experimental realization.

PaperID: 6153, https://arxiv.org/pdf/2601.15709.pdf

Authors: Asim Biswal, Chuan Lei, Xiao Qin, Aodong Li, Balakrishnan Narayanaswamy, Tim Kraska

Title: AgentSM: Semantic Memory for Agentic Text-to-SQL

Abstract:
Recent advances in LLM‑based Text‑to‑SQL have achieved remarkable gains on public benchmarks such as BIRD and Spider. Yet, these systems struggle to scale in realistic enterprise settings with large, complex schemas, diverse SQL dialects, and expensive multi‑step reasoning. Emerging agentic approaches show potential for adaptive reasoning but often suffer from inefficiency and instability‑repeating interactions with databases, producing inconsistent outputs, and occasionally failing to generate valid answers. To address these challenges, we introduce Agent Semantic Memory (AgentSM), an agentic framework for Text‑to‑SQL that builds and leverages interpretable semantic memory. Instead of relying on raw scratchpads or vector retrieval, AgentSM captures prior execution traces‑or synthesizes curated ones‑as structured programs that directly guide future reasoning. This design enables systematic reuse of reasoning paths, which allows agents to scale to larger schemas, more complex questions, and longer trajectories efficiently and reliably. Compared to state‑of‑the‑art systems, AgentSM achieves higher efficiency by reducing average token usage and trajectory length by 25% and 35%, respectively, on the Spider 2.0 benchmark. It also improves execution accuracy, reaching a state‑of‑the‑art accuracy of 44.8% on the Spider 2.0 Lite benchmark.

PaperID: 6154, https://arxiv.org/pdf/2601.15687.pdf

Authors: Khusrav Badalov, Young Yoon

Title: FARM: Field-Aware Resolution Model for Intelligent Trigger-Action Automation

Abstract:
Trigger‑Action Programming (TAP) platforms such as IFTTT and Zapier enable Web of Things (WoT) automation by composing event‑driven rules across heterogeneous services. A TAP applet links a trigger to an action and must bind trigger outputs (ingredients) to action inputs (fields) to be executable. Prior work largely treats TAP as service‑level prediction from natural language, which often yields non‑executable applets that still require manual configuration. We study the function‑level configuration problem: generating complete applets with correct ingredient‑to‑field bindings. We propose FARM (Field‑Aware Resolution Model), a two‑stage architecture for automated applet generation with full configuration. Stage 1 trains contrastive dual encoders with selective layer freezing over schema‑enriched representations, retrieving candidates from 1,724 trigger functions and 1,287 action functions (2.2M possible trigger‑action pairs). Stage 2 performs selection and configuration using an LLM‑based multi‑agent pipeline. It includes intent analysis, trigger selection, action selection via cross‑schema scoring, and configuration verification. Agents coordinate through shared state and agreement‑based selection. FARM achieves 81% joint accuracy on Gold (62% Noisy, 70% One‑shot) at the function level, where both trigger and action functions must match the ground truth. For comparison with service‑level baselines, we map functions to their parent services and evaluate at the service level. FARM reaches 81% joint accuracy and improves over TARGE by 23 percentage points. FARM also generates ingredient‑to‑field bindings, producing executable automation configurations.

PaperID: 6155, https://arxiv.org/pdf/2601.15551.pdf

Authors: Bismack Tokoli, Luis Jaimes, Ayesha S. Dina

Title: ALIGNAgent: Adaptive Learner Intelligence for Gap Identification and Next-step guidance

Abstract:
Personalized learning systems have emerged as a promising approach to enhance student outcomes by tailoring educational content, pacing, and feedback to individual needs. However, most existing systems remain fragmented, specializing in either knowledge tracing, diagnostic modeling, or resource recommendation, but rarely integrating these components into a cohesive adaptive cycle. In this paper, we propose ALIGNAgent (Adaptive Learner Intelligence for Gap Identification and Next‑step guidance), a multi‑agent educational framework designed to deliver personalized learning through integrated knowledge estimation, skill‑gap identification, and targeted resource recommendation.ALIGNAgent begins by processing student quiz performance, gradebook data, and learner preferences to generate topic‑level proficiency estimates using a Skill Gap Agent that employs concept‑level diagnostic reasoning to identify specific misconceptions and knowledge deficiencies. After identifying skill gaps, the Recommender Agent retrieves preference‑aware learning materials aligned with diagnosed deficiencies, implementing a continuous feedback loop where interventions occur before advancing to subsequent topics. Extensive empirical evaluation on authentic datasets from two undergraduate computer science courses demonstrates ALIGNAgent's effectiveness, with GPT‑4o‑based agents achieving precision of 0.87‑0.90 and F1 scores of 0.84‑0.87 in knowledge proficiency estimation validated against actual exam performance.

PaperID: 6156, https://arxiv.org/pdf/2601.15519.pdf

Authors: Zhichao Yang, Jiashu He, Jinxuan Fan, Cirillo Cinzia

Title: TransportAgents: a multi-agents LLM framework for traffic accident severity prediction

Abstract:
Accurate prediction of traffic crash severity is critical for improving emergency response and public safety planning. Although recent large language models (LLMs) exhibit strong reasoning capabilities, their single‑agent architectures often struggle with heterogeneous, domain‑specific crash data and tend to generate biased or unstable predictions. To address these limitations, this paper proposes TransportAgents, a hybrid multi‑agent framework that integrates category‑specific LLM reasoning with a multilayer perceptron (MLP) integration module. Each specialized agent focuses on a particular subset of traffic information, such as demographics, environmental context, or incident details, to produce intermediate severity assessments that are subsequently fused into a unified prediction. Extensive experiments on two complementary U.S. datasets, the Consumer Product Safety Risk Management System (CPSRMS) and the National Electronic Injury Surveillance System (NEISS), demonstrate that TransportAgents consistently outperforms both traditional machine learning and advanced LLM‑based baselines. Across three representative backbones, including closed‑source models such as GPT‑3.5 and GPT‑4o, as well as open‑source models such as LLaMA‑3.3, the framework exhibits strong robustness, scalability, and cross‑dataset generalizability. A supplementary distributional analysis further shows that TransportAgents produces more balanced and well‑calibrated severity predictions than standard single‑agent LLM approaches, highlighting its interpretability and reliability for safety‑critical decision support applications.

PaperID: 6157, https://arxiv.org/pdf/2601.15495.pdf

Authors: Yiyang Feng, Zeming Chen, Haotian Wu, Jiawei Zhou, Antoine Bosselut

Title: Tracking the Limits of Knowledge Propagation: How LLMs Fail at Multi-Step Reasoning with Conflicting Knowledge

Abstract:
A common solution for mitigating outdated or incorrect information in Large Language Models (LLMs) is to provide updated facts in‑context or through knowledge editing. However, these methods introduce knowledge conflicts when the knowledge update fails to overwrite the model's parametric knowledge, which propagate to faulty reasoning. Current benchmarks for this problem, however, largely focus only on single knowledge updates and fact recall without evaluating how these updates affect downstream reasoning. In this work, we introduce TRACK (Testing Reasoning Amid Conflicting Knowledge), a new benchmark for studying how LLMs propagate new knowledge through multi‑step reasoning when it conflicts with the model's initial parametric knowledge. Spanning three reasoning‑intensive scenarios (WIKI, CODE, and MATH), TRACK introduces multiple, realistic conflicts to mirror real‑world complexity. Our results on TRACK reveal that providing updated facts to models for reasoning can worsen performance compared to providing no updated facts to a model, and that this performance degradation exacerbates as more updated facts are provided. We show this failure stems from both inability to faithfully integrate updated facts, but also flawed reasoning even when knowledge is integrated. TRACK provides a rigorous new benchmark to measure and guide future progress on propagating conflicting knowledge in multi‑step reasoning.

PaperID: 6158, https://arxiv.org/pdf/2601.15478.pdf

Authors: Michal Feldman, Yoav Gal-Tzur, Tomasz Ponitka, Maya Schlesinger

Title: Equal-Pay Contracts

Abstract:
We study multi‑agent contract design, where a principal incentivizes a team of agents to take costly actions that jointly determine the project success via a combinatorial reward function. While prior work largely focuses on unconstrained contracts that allow heterogeneous payments across agents, many real‑world environments limit payment dispersion. Motivated by this, we study equal‑pay contracts, where all agents receive identical payments. Our results also extend to nearly‑equal‑pay contracts where any two payments are identical up to a constant factor. We provide both algorithmic and hardness results across a broad hierarchy of reward functions, under both binary and combinatorial action models. While we focus on equal‑pay contracts, our analysis also yields new insights into unconstrained contract design, and resolves two important open problems. On the positive side, we design polynomial‑time O(1)‑approximation algorithms for (i) submodular rewards under combinatorial actions, and (ii) XOS rewards under binary actions. These guarantees are tight: We rule out the existence of (i) a PTAS for combinatorial actions, even for gross substitutes rewards (unless P = NP), and (ii) any O(1)‑approximation for XOS rewards with combinatorial actions. Crucially, our hardness results hold even for unconstrained contracts, thereby settling the corresponding open problems in this setting. Finally, we quantify the loss induced by fairness via the price of equality, defined as the worst‑case ratio between the optimal principal's utility achievable by unconstrained contracts and that achievable by equal‑pay contracts. We obtain a bound of Θ(\log n/ \log \log n), where n is the number of agents. This gap is tight in a strong sense: the upper bound applies even for XOS rewards with combinatorial actions, while the lower bound arises already for additive rewards with binary actions.

PaperID: 6159, https://arxiv.org/pdf/2601.15457.pdf

Authors: Anuj Maharjan, Umesh Yadav

Title: Chunking, Retrieval, and Re-ranking: An Empirical Evaluation of RAG Architectures for Policy Document Question Answering

Abstract:
The integration of Large Language Models (LLMs) into the public health policy sector offers a transformative approach to navigating the vast repositories of regulatory guidance maintained by agencies such as the Centers for Disease Control and Prevention (CDC). However, the propensity for LLMs to generate hallucinations, defined as plausible but factually incorrect assertions, presents a critical barrier to the adoption of these technologies in high‑stakes environments where information integrity is non‑negotiable. This empirical evaluation explores the effectiveness of Retrieval‑Augmented Generation (RAG) architectures in mitigating these risks by grounding generative outputs in authoritative document context. Specifically, this study compares a baseline Vanilla LLM against Basic RAG and Advanced RAG pipelines utilizing cross‑encoder re‑ranking. The experimental framework employs a Mistral‑7B‑Instruct‑v0.2 model and an all‑MiniLM‑L6‑v2 embedding model to process a corpus of official CDC policy analytical frameworks and guidance documents. The analysis measures the impact of two distinct chunking strategies, recursive character‑based and token‑based semantic splitting, on system accuracy, measured through faithfulness and relevance scores across a curated set of complex policy scenarios. Quantitative findings indicate that while Basic RAG architectures provide a substantial improvement in faithfulness (0.621) over Vanilla baselines (0.347), the Advanced RAG configuration achieves a superior faithfulness average of 0.797. These results demonstrate that two‑stage retrieval mechanisms are essential for achieving the precision required for domain‑specific policy question answering, though structural constraints in document segmentation remain a significant bottleneck for multi‑step reasoning tasks.

PaperID: 6160, https://arxiv.org/pdf/2601.15437.pdf

Authors: Sohyeon Park, Jesus Armando Beltran, Aehong Min, Anamara Ritt-Olson, Gillian R. Hayes

Title: Exploring Implicit Perspectives on Autism in Large Language Models Through Multi-Agent Simulations

Abstract:
Large Language Models (LLMs) like ChatGPT offer potential support for autistic people, but this potential requires understanding the implicit perspectives these models might carry, including their biases and assumptions about autism. Moving beyond single‑agent prompting, we utilized LLM‑based multi‑agent systems to investigate complex social scenarios involving autistic and non‑autistic agents. In our study, agents engaged in group‑task conversations and answered structured interview questions, which we analyzed to examine ChatGPT's biases and how it conceptualizes autism. We found that ChatGPT assumes autistic people are socially dependent, which may affect how it interacts with autistic users or conveys information about autism. To address these challenges, we propose adopting the double empathy problem, which reframes communication breakdowns as a mutual challenge. We describe how future LLMs could address the biases we observed and improve interactions involving autistic people by incorporating the double empathy problem into their design.

PaperID: 6161, https://arxiv.org/pdf/2601.15311.pdf

Authors: Mustafa Arslan

Title: Aeon: High-Performance Neuro-Symbolic Memory Management for Long-Horizon LLM Agents

Abstract:
Large Language Models (LLMs) are fundamentally constrained by the quadratic computational cost of self‑attention and the "Lost in the Middle" phenomenon, where reasoning capabilities degrade as context windows expand. Existing solutions, primarily "Flat RAG" architectures relying on vector databases, treat memory as an unstructured bag of embeddings, failing to capture the hierarchical and temporal structure of long‑horizon interactions. This paper presents Aeon, a Neuro‑Symbolic Cognitive Operating System that redefines memory as a managed OS resource. Aeon structures memory into a Memory Palace (a spatial index implemented via Atlas, a SIMD‑accelerated Page‑Clustered Vector Index) and a Trace (a neuro‑symbolic episodic graph). This architecture introduces three advances: (1) Symmetric INT8 Scalar Quantization, achieving 3.1x spatial compression and 5.6x math acceleration via NEON SDOT intrinsics; (2) a decoupled Write‑Ahead Log (WAL) ensuring crash‑recoverability with statistically negligible overhead (<1%); and (3) a Sidecar Blob Arena eliminating the prior 440‑character text ceiling via an append‑only mmap‑backed blob file with generational garbage collection. The Semantic Lookaside Buffer (SLB) exploits conversational locality to achieve sub‑5us retrieval latencies, with INT8 vectors dequantized to FP32 on cache insertion to preserve L1‑resident lookup performance. Benchmarks on Apple M4 Max demonstrate that the combined architecture achieves 4.70ns INT8 dot product latency, 3.09us tree traversal at 100K nodes (3.4x over FP32), and P99 read latency of 750ns under hostile 16‑thread contention via epoch‑based reclamation.

PaperID: 6162, https://arxiv.org/pdf/2601.15290.pdf

Authors: Hareeshwar Karthikeyan

Title: Agentic Persona Control and Task State Tracking for Realistic User Simulation in Interactive Scenarios

Abstract:
Testing conversational AI systems at scale across diverse domains necessitates realistic and diverse user interactions capturing a wide array of behavioral patterns. We present a novel multi‑agent framework for realistic, explainable human user simulation in interactive scenarios, using persona control and task state tracking to mirror human cognitive processes during goal‑oriented conversations. Our system employs three specialized AI agents: (1) a User Agent to orchestrate the overall interaction, (2) a State Tracking Agent to maintain structured task state, and (3) a Message Attributes Generation Agent that controls conversational attributes based on task progress and assigned persona. To validate our approach, we implement and evaluate the framework for guest ordering at a restaurant with scenarios rich in task complexity, behavioral diversity, and conversational ambiguity. Through systematic ablations, we evaluate the contributory efficacy of each agentic component to overall simulation quality in terms of persona adherence, task completion accuracy, explainability, and realism. Our experiments demonstrate that the complete multi‑agent system achieves superior simulation quality compared to single‑LLM baselines, with significant gains across all evaluation metrics. This framework establishes a powerful environment for orchestrating agents to simulate human users with cognitive plausibility, decomposing the simulation into specialized sub‑agents that reflect distinct aspects of human thought processes applicable across interactive domains.

PaperID: 6163, https://arxiv.org/pdf/2601.15247.pdf

Authors: Rian Dolphin, Joe Dursun, Jarrett Blankenship, Katie Adams, Quinton Pike

Title: Taxonomy-Aligned Risk Extraction from 10-K Filings with Autonomous Improvement Using LLMs

Abstract:
We present a methodology for extracting structured risk factors from corporate 10‑K filings while maintaining adherence to a predefined hierarchical taxonomy. Our three‑stage pipeline combines LLM extraction with supporting quotes, embedding‑based semantic mapping to taxonomy categories, and LLM‑as‑a‑judge validation that filters spurious assignments. To evaluate our approach, we extract 10,688 risk factors from S&P 500 companies and examine risk profile similarity across industry clusters. Beyond extraction, we introduce autonomous taxonomy maintenance where an AI agent analyzes evaluation feedback to identify problematic categories, diagnose failure patterns, and propose refinements, achieving 104.7% improvement in embedding separation in a case study. External validation confirms the taxonomy captures economically meaningful structure: same‑industry companies exhibit 63% higher risk profile similarity than cross‑industry pairs (Cohen's d=1.06, AUC 0.82, p<0.001). The methodology generalizes to any domain requiring taxonomy‑aligned extraction from unstructured text, with autonomous improvement enabling continuous quality maintenance and enhancement as systems process more documents.

PaperID: 6164, https://arxiv.org/pdf/2601.15232.pdf

Authors: Niful Islam, Ragib Shahriar Ayon, Deepak George Thomas, Shibbir Ahmed, Mohammad Wardat

Title: When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling

Abstract:
Large Language Models (LLMs) have revolutionized intelligent application development. While standalone LLMs cannot perform any actions, LLM agents address the limitation by integrating tools. However, debugging LLM agents is difficult and costly as the field is still in it's early stage and the community is underdeveloped. To understand the bugs encountered during agent development, we present the first comprehensive study of bug types, root causes, and effects in LLM agent‑based software. We collected and analyzed 1,187 bug‑related posts and code snippets from Stack Overflow, GitHub, and Hugging Face forums, focused on LLM agents built with seven widely used LLM frameworks as well as custom implementations. For a deeper analysis, we have also studied the component where the bug occurred, along with the programming language and framework. This study also investigates the feasibility of automating bug identification. For that, we have built a ReAct agent named BugReAct, equipped with adequate external tools to determine whether it can detect and annotate the bugs in our dataset. According to our study, we found that BugReAct equipped with Gemini 2.5 Flash achieved a remarkable performance in annotating bug characteristics with an average cost of 0.01 USD per post/code snippet.

PaperID: 6165, https://arxiv.org/pdf/2601.15153.pdf

Authors: Choro Ulan uulu, Mikhail Kulyabin, Iris Fuhrmann, Jan Joosten, Nuno Miguel Martins Pacheco, Filippos Petridis, Rebecca Johnson, Jan Bosch, Helena Holmström Olsson

Title: How to Build AI Agents by Augmenting LLMs with Codified Human Expert Domain Knowledge? A Software Engineering Framework

Abstract:
Critical domain knowledge typically resides with few experts, creating organizational bottlenecks in scalability and decision‑making. Non‑experts struggle to create effective visualizations, leading to suboptimal insights and diverting expert time. This paper investigates how to capture and embed human domain knowledge into AI agent systems through an industrial case study. We propose a software engineering framework to capture human domain knowledge for engineering AI agents in simulation data visualization by augmenting a Large Language Model (LLM) with a request classifier, Retrieval‑Augmented Generation (RAG) system for code generation, codified expert rules, and visualization design principles unified in an agent demonstrating autonomous, reactive, proactive, and social behavior. Evaluation across five scenarios spanning multiple engineering domains with 12 evaluators demonstrates 206% improvement in output quality, with our agent achieving expert‑level ratings in all cases versus baseline's poor performance, while maintaining superior code quality with lower variance. Our contributions are: an automated agent‑based system for visualization generation and a validated framework for systematically capturing human domain knowledge and codifying tacit expert knowledge into AI agents, demonstrating that non‑experts can achieve expert‑level outcomes in specialized domains.

PaperID: 6166, https://arxiv.org/pdf/2601.15077.pdf

Authors: Christopher Scofield

Title: Multi-Agent Constraint Factorization Reveals Latent Invariant Solution Structure

Abstract:
Multi‑agent systems (MAS) composed of large language models often exhibit improved problem‑solving performance despite operating on identical information. In this work, we provide a formal explanation for this phenomenon grounded in operator theory and constrained optimization. We model each agent as enforcing a distinct family of validity constraints on a shared solution state, and show that a MAS implements a factorized composition of constraint‑enforcement operators. Under mild conditions, these dynamics converge to invariant solution sets defined by the intersection of agent constraint sets. Such invariant structures are generally not dynamically accessible to a single agent applying all constraints simultaneously, even when expressive capacity and information are identical. We extend this result from exact constraint enforcement to soft constraints via proximal operators, and apply the formalism to contemporary text‑based dialog systems.

PaperID: 6167, https://arxiv.org/pdf/2601.15047.pdf

Authors: Jianing Hao, Han Ding, Yuanjian Xu, Tianze Sun, Ran Chen, Wanbo Zhang, Guang Zhang, Siguang Li

Title: Game-Theoretic Lens on LLM-based Multi-Agent Systems

Abstract:
Large language models (LLMs) have demonstrated strong reasoning, planning, and communication abilities, enabling them to operate as autonomous agents in open environments. While single‑agent systems remain limited in adaptability and coordination, recent progress has shifted attention toward multi‑agent systems (MAS) composed of interacting LLMs that pursue cooperative, competitive, or mixed objectives. This emerging paradigm provides a powerful testbed for studying social dynamics and strategic behaviors among intelligent agents. However, current research remains fragmented and lacks a unifying theoretical foundation. To address this gap, we present a comprehensive survey of LLM‑based multi‑agent systems through a game‑theoretic lens. By organizing existing studies around the four key elements of game theory: players, strategies, payoffs, and information, we establish a systematic framework for understanding, comparing, and guiding future research on the design and analysis of LLM‑based MAS.

PaperID: 6168, https://arxiv.org/pdf/2601.15016.pdf

Authors: Xiaodong Wang, Langling Huang, Zhirong Wu, Xu Zhao, Teng Xu, Xuhong Xia, Peixi Peng

Title: LiViBench: An Omnimodal Benchmark for Interactive Livestream Video Understanding

Abstract:
The development of multimodal large language models (MLLMs) has advanced general video understanding. However, existing video evaluation benchmarks primarily focus on non‑interactive videos, such as movies and recordings. To fill this gap, this paper proposes the first omnimodal benchmark for interactive livestream videos, LiViBench. It features a diverse set of 24 tasks, highlighting the perceptual, reasoning, and livestream‑specific challenges. To efficiently construct the dataset, we design a standardized semi‑automatic annotation workflow that incorporates the human‑in‑the‑loop at multiple stages. The workflow leverages multiple MLLMs to form a multi‑agent system for comprehensive video description and uses a seed‑question‑driven method to construct high‑quality annotations. All interactive videos in the benchmark include audio, speech, and real‑time comments modalities. To enhance models' understanding of interactive videos, we design tailored two‑stage instruction‑tuning and propose a Video‑to‑Comment Retrieval (VCR) module to improve the model's ability to utilize real‑time comments. Based on these advancements, we develop LiVi‑LLM‑7B, an MLLM with enhanced knowledge of interactive livestreams. Experiments show that our model outperforms larger open‑source models with up to 72B parameters, narrows the gap with leading proprietary models on LiViBench, and achieves enhanced performance on general video benchmarks, including VideoMME, LongVideoBench, MLVU, and VideoEval‑Pro.

PaperID: 6169, https://arxiv.org/pdf/2601.14914.pdf

Authors: Tianxiang Fei, Cheng Chen, Yue Pan, Mao Zheng, Mingyang Song

Title: CodeDelegator: Mitigating Context Pollution via Role Separation in Code-as-Action Agents

Abstract:
Recent advances in large language models (LLMs) allow agents to represent actions as executable code, offering greater expressivity than traditional tool‑calling. However, real‑world tasks often demand both strategic planning and detailed implementation. Using a single agent for both leads to context pollution from debugging traces and intermediate failures, impairing long‑horizon performance. We propose CodeDelegator, a multi‑agent framework that separates planning from implementation via role specialization. A persistent Delegator maintains strategic oversight by decomposing tasks, writing specifications, and monitoring progress without executing code. For each sub‑task, a new Coder agent is instantiated with a clean context containing only its specification, shielding it from prior failures. To coordinate between agents, we introduce Ephemeral‑Persistent State Separation (EPSS), which isolates each Coder's execution state while preserving global coherence, preventing debugging traces from polluting the Delegator's context. Experiments on various benchmarks demonstrate the effectiveness of CodeDelegator across diverse scenarios.

PaperID: 6170, https://arxiv.org/pdf/2601.14912.pdf

Authors: Guangba Yu, Genting Mai, Rui Wang, Ruipeng Li, Pengfei Chen, Long Pan, Ruijie Xu

Title: AlertGuardian: Intelligent Alert Life-Cycle Management for Large-scale Cloud Systems

Abstract:
Alerts are critical for detecting anomalies in large‑scale cloud systems, ensuring reliability and user experience. However, current systems generate overwhelming volumes of alerts, degrading operational efficiency due to ineffective alert life‑cycle management. This paper details the efforts of Company‑X to optimize alert life‑cycle management, addressing alert fatigue in cloud systems. We propose AlertGuardian, a framework collaborating large language models (LLMs) and lightweight graph models to optimize the alert life‑cycle through three phases: Alert Denoise uses graph learning model with virtual noise to filter noise, Alert Summary employs Retrieval Augmented Generation (RAG) with LLMs to create actionable summary, and Alert Rule Refinement leverages multi‑agent iterative feedbacks to improve alert rule quality. Evaluated on four real‑world datasets from Company‑X's services, AlertGuardian significantly mitigates alert fatigue (94.8% alert reduction ratios) and accelerates fault diagnosis (90.5% diagnosis accuracy). Moreover, AlertGuardian improves 1,174 alert rules, with 375 accepted by SREs (32% acceptance rate). Finally, we share success stories and lessons learned about alert life‑cycle management after the deployment of AlertGuardian in Company‑X.

PaperID: 6171, https://arxiv.org/pdf/2601.14871.pdf

Authors: Zejian Cui, Ferdinando Rodriguez y Baena

Title: On-the-fly hand-eye calibration for the da Vinci surgical robot

Abstract:
In Robot‑Assisted Minimally Invasive Surgery (RMIS), accurate tool localization is crucial to ensure patient safety and successful task execution. However, this remains challenging for cable‑driven robots, such as the da Vinci robot, because erroneous encoder readings lead to pose estimation errors. In this study, we propose a calibration framework to produce accurate tool localization results through computing the hand‑eye transformation matrix on‑the‑fly. The framework consists of two interrelated algorithms: the feature association block and the hand‑eye calibration block, which provide robust correspondences for key points detected on monocular images without pre‑training, and offer the versatility to accommodate various surgical scenarios by adopting an array of filter approaches, respectively. To validate its efficacy, we test the framework extensively on publicly available video datasets that feature multiple surgical instruments conducting tasks in both in vitro and ex vivo scenarios, under varying illumination conditions and with different levels of key point measurement accuracy. The results show a significant reduction in tool localization errors under the proposed calibration framework, with accuracies comparable to other state‑of‑the‑art methods while being more time‑efficient.

PaperID: 6172, https://arxiv.org/pdf/2601.14735.pdf

Authors: Varad Kulkarni, Vaibhav Jha, Nikhil Reddy, Anand Eswaran, Praveen Jayachandran, Yogesh Simmhan

Title: Optimizing FaaS Platforms for MCP-enabled Agentic Workflows

Abstract:
Agentic workflows that use autonomous AI Agents powered by Large Language Models (LLMs) and Model Context Protocol (MCP) servers is rapidly rising. This introduces challenges in scalable cloud deployment and state management. Traditional hosting on Virtual Machines (VMs) is resource‑intensive and lacks elasticity. Functions‑as‑a‑Service (FaaS) platforms offer modularity, autoscaling and cost efficiency but are inherently stateless. In this paper, we present the FAME, a FaaS‑based architecture for orchestrating MCP‑enabled agentic workflows. FAME decomposes agentic patterns such as ReAct into composable agents: Planner, Actor and Evaluator, that are each a FaaS function built using LangGraph and are orchestrated as a FaaS workflow. This enables modular composition as AWS Step Functions and avoids function timeouts seen for monolithic agentic workflows. To address context persistence across user requests in a conversation, FAME automates agent memory persistence and injection using DynamoDB. It also optimizes MCP server deployment through AWS Lambda wrappers, caches tool outputs in S3 and proposes function fusion strategies. We evaluate FAME on two representative applications, on research paper summarization and log analytics, under diverse memory and caching configurations. Results show up to 13x latency reduction, 88% fewer input tokens and 66% in cost savings, along with improved workflow completion rates. This demonstrates the viability of serverless platforms for hosting complex, multi‑agent AI workflows at scale.

PaperID: 6173, https://arxiv.org/pdf/2601.14691.pdf

Authors: Muhammad Khalifa, Lajanugen Logeswaran, Jaekyeom Kim, Sungryull Sohn, Yunxiang Zhang, Moontae Lee, Hao Peng, Lu Wang, Honglak Lee

Title: Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation

Abstract:
Large language models (LLMs) are increasingly used as judges to evaluate agent performance, particularly in non‑verifiable settings where judgments rely on agent trajectories including chain‑of‑thought (CoT) reasoning. This paradigm implicitly assumes that the agent's CoT faithfully reflects both its internal reasoning and the underlying environment state. We show this assumption is brittle: LLM judges are highly susceptible to manipulation of agent reasoning traces. By systematically rewriting agent CoTs while holding actions and observations fixed, we demonstrate that manipulated reasoning alone can inflate false positive rates of state‑of‑the‑art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks. We study manipulation strategies spanning style‑based approaches that alter only the presentation of reasoning and content‑based approaches that fabricate signals of task progress, and find that content‑based manipulations are consistently more effective. We evaluate prompting‑based techniques and scaling judge‑time compute, which reduce but do not fully eliminate susceptibility to manipulation. Our findings reveal a fundamental vulnerability in LLM‑based evaluation and highlight the need for judging mechanisms that verify reasoning claims against observable evidence.

PaperID: 6174, https://arxiv.org/pdf/2601.14667.pdf

Authors: Yijin Zhou, Xiaoya Lu, Dongrui Liu, Junchi Yan, Jing Shao

Title: INFA-Guard: Mitigating Malicious Propagation via Infection-Aware Safeguarding in LLM-Based Multi-Agent Systems

Abstract:
The rapid advancement of Large Language Model (LLM)‑based Multi‑Agent Systems (MAS) has introduced significant security vulnerabilities, where malicious influence can propagate virally through inter‑agent communication. Conventional safeguards often rely on a binary paradigm that strictly distinguishes between benign and attack agents, failing to account for infected agents i.e., benign entities converted by attack agents. In this paper, we propose Infection‑Aware Guard, INFA‑Guard, a novel defense framework that explicitly identifies and addresses infected agents as a distinct threat category. By leveraging infection‑aware detection and topological constraints, INFA‑Guard accurately localizes attack sources and infected ranges. During remediation, INFA‑Guard replaces attackers and rehabilitates infected ones, avoiding malicious propagation while preserving topological integrity. Extensive experiments demonstrate that INFA‑Guard achieves state‑of‑the‑art performance, reducing the Attack Success Rate (ASR) by an average of 33%, while exhibiting cross‑model robustness, superior topological generalization, and high cost‑effectiveness.

PaperID: 6175, https://arxiv.org/pdf/2601.14660.pdf

Authors: Saswat Das, Ferdinando Fioretto

Title: NeuroFilter: Privacy Guardrails for Conversational LLM Agents

Abstract:
This work addresses the computational challenge of enforcing privacy for agentic Large Language Models (LLMs), where privacy is governed by the contextual integrity framework. Indeed, existing defenses rely on LLM‑mediated checking stages that add substantial latency and cost, and that can be undermined in multi‑turn interactions through manipulation or benign‑looking conversational scaffolding. Contrasting this background, this paper makes a key observation: internal representations associated with privacy‑violating intent can be separated from benign requests using linear structure. Using this insight, the paper proposes NeuroFilter, a guardrail framework that operationalizes contextual integrity by mapping norm violations to simple directions in the model's activation space, enabling detection even when semantic filters are bypassed. The proposed filter is also extended to capture threats arising during long conversations using the concept of activation velocity, which measures cumulative drift in internal representations across turns. A comprehensive evaluation across over 150,000 interactions and covering models from 7B to 70B parameters, illustrates the strong performance of NeuroFilter in detecting privacy attacks while maintaining zero false positives on benign prompts, all while reducing the computational inference cost by several orders of magnitude when compared to LLM‑based agentic privacy defenses.

PaperID: 6176, https://arxiv.org/pdf/2601.14652.pdf

Authors: Zixuan Ke, Yifei Ming, Austin Xu, Ryan Chin, Xuan-Phi Nguyen, Prathyusha Jwalapuram, Jiayu Wang, Semih Yavuz, Caiming Xiong, Shafiq Joty

Title: MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks

Abstract:
While multi‑agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design under‑deliver. Such shortcomings stem from two key factors: (1) methodological complexity ‑ agent orchestration is performed using sequential, code‑level execution that limits global system‑level holistic reasoning and scales poorly with agent complexity ‑ and (2) efficacy uncertainty ‑ MAS are deployed without understanding if there are tangible benefits compared to single‑agent systems (SAS). We propose MASOrchestra, a training‑time framework that formulates MAS orchestration as a function‑calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS‑Orchestra, complex, goal‑oriented subagents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis reveals that MAS gains depend critically on task structure, verification protocols, and the capabilities of both orchestrator and subagents, rather than holding universally. Guided by these insights, MAS‑Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi‑hop QA, and search‑based QA, while achieving more than 10x efficiency over strong baselines. Together, MAS‑Orchestra and MASBENCH enable better training and understanding of MAS in the pursuit of multi‑agent intelligence.

PaperID: 6177, https://arxiv.org/pdf/2601.14606.pdf

Authors: Daisuke Miyamoto, Takuji Iimura, Narushige Michishita

Title: An LLM Agent-based Framework for Whaling Countermeasures

Abstract:
With the spread of generative AI in recent years, attacks known as Whaling have become a serious threat. Whaling is a form of social engineering that targets important high‑authority individuals within organizations and uses sophisticated fraudulent emails. In the context of Japanese universities, faculty members frequently hold positions that combine research leadership with authority within institutional workflows. This structural characteristic leads to the wide public disclosure of high‑value information such as publications, grants, and detailed researcher profiles. Such extensive information exposure enables the construction of highly precise target profiles using generative AI. This raises concerns that Whaling attacks based on high‑precision profiling by generative AI will become prevalent. In this study, we propose a Whaling countermeasure framework for university faculty members that constructs personalized defense profiles and uses large language model (LLM)‑based agents. We design agents that (i) build vulnerability profiles for each target from publicly available information on faculty members, (ii) identify potential risk scenarios relevant to Whaling defense based on those profiles, (iii) construct defense profiles corresponding to the vulnerabilities and anticipated risks, and (iv) analyze Whaling emails using the defense profiles. Furthermore, we conduct a preliminary risk‑assessment experiment. The results indicate that the proposed method can produce judgments accompanied by explanations of response policies that are consistent with the work context of faculty members who are Whaling targets. The findings also highlight practical challenges and considerations for future operational deployment and systematic evaluation.

PaperID: 6178, https://arxiv.org/pdf/2601.14601.pdf

Authors: Haodong Chen, Ziheng Zhang, Jinghui Jiang, Qiang Su, Qiao Xiang

Title: Holmes: An Evidence-Grounded LLM Agent for Auditable DDoS Investigation in Cloud Networks

Abstract:
Cloud environments face frequent DDoS threats due to centralized resources and broad attack surfaces. Modern cloud‑native DDoS attacks further evolve rapidly and often blend multi‑vector strategies, creating an operational dilemma: defenders need wire‑speed monitoring while also requiring explainable, auditable attribution for response. Existing rule‑based and supervised‑learning approaches typically output black‑box scores or labels, provide limited evidence chains, and generalize poorly to unseen attack variants; meanwhile, high‑quality labeled data is often difficult to obtain in cloud settings. We present Holmes (DDoS Detective), an LLM‑based DDoS detection agent that reframes the model as a virtual SRE investigator rather than an end‑to‑end classifier. Holmes couples a funnel‑like hierarchical workflow (counters/sFlow for continuous sensing and triage; PCAP evidence collection triggered only on anomaly windows) with an Evidence Pack abstraction that converts binary packets into compact, reproducible, high‑signal structured evidence. On top of this evidence interface, Holmes enforces a structure‑first investigation protocol and strict JSON/quotation constraints to produce machine‑consumable reports with auditable evidence anchors. We evaluate Holmes on CICDDoS2019 reflection/amplification attacks and script‑triggered flooding scenarios. Results show that Holmes produces attribution decisions grounded in salient evidence anchors across diverse attack families, and when errors occur, its audit logs make the failure source easy to localize, demonstrating the practicality of an LLM agent for cost‑controlled and traceable DDoS investigation in cloud operations.

PaperID: 6179, https://arxiv.org/pdf/2601.14567.pdf

Authors: Roland R. Rodriguez

Title: Agent Identity URI Scheme: Topology-Independent Naming and Capability-Based Discovery for Multi-Agent Systems

Abstract:
Multi‑agent systems face a fundamental architectural flaw: agent identity is bound to network location. When agents migrate between providers, scale across instances, or federate across organizations, URI‑based identity schemes break references, fragment audit trails, and require centralized coordination. We propose the agent:// URI scheme, which decouples identity from topology through three orthogonal components: a trust root establishing organizational authority, a hierarchical capability path enabling semantic discovery, and a sortable unique identifier providing stable reference. The scheme enables capability‑based discovery through DHT key derivation, where queries return agents by what they do rather than where they are. Trust‑root scoping prevents cross‑organization pollution while permitting federation when desired. Cryptographic attestation via PASETO tokens binds capability claims to agent identity, enabling verification without real‑time contact with the issuing authority. We evaluate the scheme across four dimensions: capability expressiveness (100% coverage on 369 production tools with zero collision), discovery precision (F1=1.0 across 10,000 agents), identity stability (formal proofs of migration invariance), and performance (all operations under 5 microseconds). The agent:// URI scheme provides a formally‑specified, practically‑evaluated foundation for decentralized agent identity and capability‑based discovery.

PaperID: 6180, https://arxiv.org/pdf/2601.14566.pdf

Authors: Shenghan Gao, Junye Wang, Junjie Xiong, Yun Jiang, Yun Fang, Qifan Hu, Baolong Liu, Quan Li

Title: SCSimulator: An Exploratory Visual Analytics Framework for Partner Selection in Supply Chains through LLM-driven Multi-Agent Simulation

Abstract:
Supply chains (SCs), complex networks spanning from raw material acquisition to product delivery, with enterprises as interconnected nodes, play a pivotal role in organizational success. However, optimizing SCs remains challenging, particularly in partner selection, a key bottleneck shaped by competitive and cooperative dynamics. This challenge constitutes a multi‑objective dynamic game requiring a synergistic integration of Multi‑Criteria Decision‑Making and Game Theory. Traditional approaches, grounded in mathematical simplifications and managerial heuristics, fail to capture real‑world intricacies and risk introducing subjective biases. Multi‑agent simulation offers promise, but prior research has largely relied on fixed, uniform agent logic, limiting practical applicability. Recent advances in LLMs create opportunities to represent complex SC requirements and hybrid game logic. However, challenges persist in modeling dynamic SC relationships, ensuring interpretability, and balancing agent autonomy with expert control. We present SCSimulator, a visual analytics framework that integrates LLM‑driven MAS with human‑in‑the‑loop collaboration for SC partner selection. It simulates SC evolution via adaptive network structures and enterprise behaviors, which are visualized via interpretable interfaces. By combining CoT reasoning with XAI techniques, it generates multi‑faceted, transparent explanations of decision trade‑offs. Users can iteratively adjust simulation settings to explore outcomes aligned with their expectations and strategic priorities. Developed through iterative co‑design with SC experts and industry managers, SCSimulator serves as a proof‑of‑concept, offering methodological contributions and practical insights for future research on SC decision‑making and interactive AI‑driven analytics. Usage scenarios and a user study demonstrate the system's effectiveness and usability.

PaperID: 6181, https://arxiv.org/pdf/2601.14544.pdf

Authors: Sneha Sudhakaran, Naresh Kshetri

Title: AI Agents vs. Human Investigators: Balancing Automation, Security, and Expertise in Cyber Forensic Analysis

Abstract:
In an era where cyber threats are rapidly evolving, the reliability of cyber forensic analysis has become increasingly critical for effective digital investigations and cybersecurity responses. AI agents are being adopted across digital forensic practices due to their ability to automate processes such as anomaly detection, evidence classification, and behavioral pattern recognition, significantly enhancing scalability and reducing investigation timelines. However, the characteristics that make AI indispensable also introduce notable risks. AI systems, often trained on biased or incomplete datasets, can produce misleading results, including false positives and false negatives, thereby jeopardizing the integrity of forensic investigations. This study presents a meticulous comparative analysis of the effectiveness of the most used AI agent, ChatGPT, and human forensic investigators in the realm of cyber forensic analysis. Our research reveals critical limitations within AI‑driven approaches, demonstrating scenarios in which sophisticated or novel cyber threats remain undetected due to the rigid pattern‑based nature of AI systems. Conversely, our analysis highlights the crucial role that human forensic investigators play in mitigating these risks. Through adaptive decision‑making, ethical reasoning, and contextual understanding, human investigators effectively identify subtle anomalies and threats that may evade automated detection systems. To reinforce our findings, we conducted comprehensive reliability testing of forensic techniques using multiple cyber threat scenarios. These tests confirmed that while AI agents significantly improve the efficiency of routine analyses, human oversight remains crucial in ensuring accuracy and comprehensiveness of the results.

PaperID: 6182, https://arxiv.org/pdf/2601.14470.pdf

Authors: Mohamad Salim, Jasmine Latendresse, SayedHassan Khatoonabadi, Emad Shihab

Title: Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering

Abstract:
LLM‑based Multi‑Agent (LLM‑MA) systems are increasingly applied to automate complex software engineering tasks such as requirements engineering, code generation, and testing. However, their operational efficiency and resource consumption remain poorly understood, hindering practical adoption due to unpredictable costs and environmental impact. To address this, we conduct an analysis of token consumption patterns in an LLM‑MA system within the Software Development Life Cycle (SDLC), aiming to understand where tokens are consumed across distinct software engineering activities. We analyze execution traces from 30 software development tasks performed by the ChatDev framework using a GPT‑5 reasoning model, mapping its internal phases to distinct development stages (Design, Coding, Code Completion, Code Review, Testing, and Documentation) to create a standardized evaluation framework. We then quantify and compare token distribution (input, output, reasoning) across these stages. Our preliminary findings show that the iterative Code Review stage accounts for the majority of token consumption for an average of 59.4% of tokens. Furthermore, we observe that input tokens consistently constitute the largest share of consumption for an average of 53.9%, providing empirical evidence for potentially significant inefficiencies in agentic collaboration. Our results suggest that the primary cost of agentic software engineering lies not in initial code generation but in automated refinement and verification. Our novel methodology can help practitioners predict expenses and optimize workflows, and it directs future research toward developing more token‑efficient agent collaboration protocols.

PaperID: 6183, https://arxiv.org/pdf/2601.14437.pdf

Authors: Thuan Minh Nguyen, Vu Tuan Truong, Long Bao Le

Title: Agentic AI Meets Edge Computing in Autonomous UAV Swarms

Abstract:
The integration of agentic AI, powered by large language models (LLMs) with autonomous reasoning, planning, and execution, into unmanned aerial vehicle (UAV) swarms opens new operational possibilities and brings the vision of the Internet of Drones closer to reality. However, infrastructure constraints, dynamic environments, and the computational demands of multi‑agent coordination limit real‑world deployment in high‑risk scenarios such as wildfires and disaster response. This paper investigates the integration of LLM‑based agentic AI and edge computing to realize scalable and resilient autonomy in UAV swarms. We first discuss three architectures for supporting UAV swarms ‑ standalone, edge‑enabled, and edge‑cloud hybrid deployment ‑ each optimized for varying autonomy and connectivity levels. Then, a use case for wildfire search and rescue (SAR) is designed to demonstrate the efficiency of the edge‑enabled architecture, enabling high SAR coverage, reduced mission completion times, and a higher level of autonomy compared to traditional approaches. Finally, we highlight open challenges in integrating LLMs and edge computing for mission‑critical UAV‑swarm applications.

PaperID: 6184, https://arxiv.org/pdf/2601.14434.pdf

Authors: Chia-Yi Su, Collin McMillan

Title: CMind: An AI Agent for Localizing C Memory Bugs

Abstract:
This demonstration paper presents CMind, an artificial intelligence agent for localizing C memory bugs. The novel aspect to CMind is that it follows steps that we observed human programmers perform during empirical study of those programmers finding memory bugs in C programs. The input to the tool is a C program's source code and a bug report describing the problem. The output is the tool's hypothesis about the reason for the bug and its location. CMind reads the bug report to find potential entry points to the program, then navigates the program's source code, analyzes that source code, and generates a hypothesis location and rationale that fit a template. The tool combines large language model reasoning with guided decision making we encoded to mimic human behavior. The video demonstration is available at https://youtu.be/_vVd0LRvVHI.

PaperID: 6185, https://arxiv.org/pdf/2601.14351.pdf

Authors: Gopal Vijayaraghavan, Prasanth Jayachandran, Arun Murthy, Sunil Govindan, Vivek Subramanian

Title: If You Want Coherence, Orchestrate a Team of Rivals: Multi-Agent Models of Organizational Intelligence

Abstract:
AI Agents can perform complex operations at great speed, but just like all the humans we have ever hired, their intelligence remains fallible. Miscommunications aren't noticed, systemic biases have no counter‑action, and inner monologues are rarely written down. We did not come to fire them for their mistakes, but to hire them and provide a safe productive working environment. We posit that we can reuse a common corporate organizational structure: teams of independent AI agents with strict role boundaries can work with common goals, but opposing incentives. Multiple models serving as a team of rivals can catch and minimize errors within the final product at a small cost to the velocity of actions. In this paper we demonstrate that we can achieve reliability without acquiring perfect components, but through careful orchestration of imperfect ones. This paper describes the architecture of such a system in practice: specialized agent teams (planners, executors, critics, experts), organized into an organization with clear goals, coordinated through a remote code executor that keeps data transformations and tool invocations separate from reasoning models. Rather than agents directly calling tools and ingesting full responses, they write code that executes remotely; only relevant summaries return to agent context. By preventing raw data and tool outputs from contaminating context windows, the system maintains clean separation between perception (brains that plan and reason) and execution (hands that perform heavy data transformations and API calls). We demonstrate the approach achieves over 90% internal error interception prior to user exposure while maintaining acceptable latency tradeoffs. A survey from our traces shows that we only trade off cost and latency to achieve correctness and incrementally expand capabilities without impacting existing ones.

PaperID: 6186, https://arxiv.org/pdf/2601.14287.pdf

Authors: Xiucheng Xu, Bingbing Xu, Xueyun Tian, Zihe Huang, Rongxin Chen, Yunfan Li, Huawei Shen

Title: Chain-of-Memory: Lightweight Memory Construction with Dynamic Evolution for LLM Agents

Abstract:
External memory systems are pivotal for enabling Large Language Model (LLM) agents to maintain persistent knowledge and perform long‑horizon decision‑making. Existing paradigms typically follow a two‑stage process: computationally expensive memory construction (e.g., structuring data into graphs) followed by naive retrieval‑augmented generation. However, our empirical analysis reveals two fundamental limitations: complex construction incurs high costs with marginal performance gains, and simple context concatenation fails to bridge the gap between retrieval recall and reasoning accuracy. To address these challenges, we propose CoM (Chain‑of‑Memory), a novel framework that advocates for a paradigm shift toward lightweight construction paired with sophisticated utilization. CoM introduces a Chain‑of‑Memory mechanism that organizes retrieved fragments into coherent inference paths through dynamic evolution, utilizing adaptive truncation to prune irrelevant noise. Extensive experiments on the LongMemEval and LoCoMo benchmarks demonstrate that CoM outperforms strong baselines with accuracy gains of 7.5%‑10.4%, while drastically reducing computational overhead to approximately 2.7% of token consumption and 6.0% of latency compared to complex memory architectures.

PaperID: 6187, https://arxiv.org/pdf/2601.14280.pdf

Authors: Nicholas X. Wang, Aggelos K. Katsaggelos

Title: Hallucination-Free Automatic Question & Answer Generation for Intuitive Learning

Abstract:
Hallucinations in large language models (LLMs), defined as fluent yet incorrect or incoherent outputs, pose a significant challenge to the automatic generation of educational multiple‑choice questions (MCQs). We identified four key hallucination types in MCQ generation: reasoning inconsistencies, insolvability, factual errors, and mathematical errors. To address this, we propose a hallucination‑free multi‑agent generation framework that breaks down MCQ generation into discrete, verifiable stages. Our framework utilizes both rule‑based and LLM‑based detection agents, as well as hallucination scoring metrics to optimize question quality. We redefined MCQ generation as an optimization task minimizing hallucination risk while maximizing validity, answerability, and cost‑efficiency. We also introduce an agent‑led refinement process that uses counterfactual reasoning and chain‑of‑thought (CoT) to iteratively improve hallucination in question generation. We evaluated a sample of AP‑ aligned STEM questions, where our system reduced hallucination rates by over 90% compared to baseline generation while preserving the educational value and style of questions. Our results demonstrate that structured multi‑agent collaboration can mitigate hallucinations in educational content creation at scale, paving the way for more reliable LLM‑powered learning tools.

PaperID: 6188, https://arxiv.org/pdf/2601.14275.pdf

Authors: Zewen Yang, Xiaobing Dai, Jiajun Cheng, Yulong Huang, Peng Shi

Title: Quality or Quantity? Error-Informed Selective Online Learning with Gaussian Processes in Multi-Agent Systems: Extended Version

Abstract:
Effective cooperation is pivotal in distributed learning for multi‑agent systems, where the interplay between the quantity and quality of the machine learning models is crucial. This paper reveals the irrationality of indiscriminate inclusion of all models on agents for joint prediction, highlighting the imperative to prioritize quality over quantity in cooperative learning. Specifically, we present the first selective online learning framework for distributed Gaussian process (GP) regression, namely distributed error‑informed GP (EIGP), that enables each agent to assess its neighboring collaborators, using the proposed selection function to choose the higher quality GP models with less prediction errors. Moreover, algorithmic enhancements are embedded within the EIGP, including a greedy algorithm (gEIGP) for accelerating prediction and an adaptive algorithm (aEIGP) for improving prediction accuracy. In addition, approaches for fast prediction and model update are introduced in conjunction with the error‑informed quantification term iteration and a data deletion strategy to achieve real‑time learning operations. Numerical simulations are performed to demonstrate the effectiveness of the developed methodology, showcasing its superiority over the state‑of‑the‑art distributed GP methods with different benchmarks.

PaperID: 6189, https://arxiv.org/pdf/2601.14270.pdf

Authors: Liangming Pan, Jason Liang, Jiaran Ye, Minglai Yang, Xinyuan Lu, Fengbin Zhu

Title: Opening the Black Box: A Survey on the Mechanisms of Multi-Step Reasoning in Large Language Models

Abstract:
Large Language Models (LLMs) have demonstrated remarkable abilities to solve problems requiring multiple reasoning steps, yet the internal mechanisms enabling such capabilities remain elusive. Unlike existing surveys that primarily focus on engineering methods to enhance performance, this survey provides a comprehensive overview of the mechanisms underlying LLM multi‑step reasoning. We organize the survey around a conceptual framework comprising seven interconnected research questions, from how LLMs execute implicit multi‑hop reasoning within hidden activations to how verbalized explicit reasoning remodels the internal computation. Finally, we highlight five research directions for future mechanistic studies.

PaperID: 6190, https://arxiv.org/pdf/2601.14230.pdf

Authors: Yiyang Wang, Yiqiao Jin, Alex Cabral, Josiah Hester

Title: MASCOT: Towards Multi-Agent Socio-Collaborative Companion Systems

Abstract:
Multi‑agent systems (MAS) are emerging as promising socio‑collaborative companions for emotional and cognitive support. However, existing systems frequently suffer from persona collapse, where agents revert to generic, homogenized assistant behaviors, and social sycophancy, where agents produce redundant, non‑constructive dialogue. We propose MASCOT, a multi‑agent framework for multi‑perspective socio‑collaborative companions. MASCOT introduces a novel bi‑level optimization strategy to harmonize individual and collective behaviors: 1) Persona‑Aware Behavioral Alignment, an RLAIF‑driven pipeline that fine‑tunes individual agents for agent‑specific identities; and 2) Collaborative Dialogue Optimization, a group‑level adaptation process that promotes complementary, diverse, and productive discourse. We evaluate MASCOT using human‑grounded contexts drawn across both in‑domain and out‑of‑domain (OOD) settings against state‑of‑the‑art baselines. MASCOT improves persona consistency by up to +14.1 and social contribution by up to +10.6. A broad evaluation suite, including human evaluation, multiple LLM judges, three‑way comparisons, and automatic metrics, further shows that MASCOT produces more role‑consistent and less redundant multi‑agent dialogue.

PaperID: 6191, https://arxiv.org/pdf/2601.14171.pdf

Authors: Qianli Ma, Chang Guo, Zhiheng Tian, Siyu Wang, Jipeng Xiao, Yuanhao Yue, Zhipeng Zhang

Title: Paper2Rebuttal: A Multi-Agent Framework for Transparent Author Response Assistance

Abstract:
Writing effective rebuttals is a high‑stakes task that demands more than linguistic fluency, as it requires precise alignment between reviewer intent and manuscript details. Current solutions typically treat this as a direct‑to‑text generation problem, suffering from hallucination, overlooked critiques, and a lack of verifiable grounding. To address these limitations, we introduce RebuttalAgent, the first multi‑agents framework that reframes rebuttal generation as an evidence‑centric planning task. Our system decomposes complex feedback into atomic concerns and dynamically constructs hybrid contexts by synthesizing compressed summaries with high‑fidelity text while integrating an autonomous and on‑demand external search module to resolve concerns requiring outside literature. By generating an inspectable response plan before drafting, RebuttalAgent ensures that every argument is explicitly anchored in internal or external evidence. We validate our approach on the proposed RebuttalBench and demonstrate that our pipeline outperforms strong baselines in coverage, faithfulness, and strategic coherence, offering a transparent and controllable assistant for the peer review process. Code will be released.

PaperID: 6192, https://arxiv.org/pdf/2601.14091.pdf

Authors: Hossein Naderi, Alireza Shojaei, Lifu Huang, Philip Agee, Kereshmeh Afsari, Abiola Akanmu

Title: Zero-shot adaptable task planning for autonomous construction robots: a comparative study of lightweight single and multi-AI agent systems

Abstract:
Robots are expected to play a major role in the future construction industry but face challenges due to high costs and difficulty adapting to dynamic tasks. This study explores the potential of foundation models to enhance the adaptability and generalizability of task planning in construction robots. Four models are proposed and implemented using lightweight, open‑source large language models (LLMs) and vision language models (VLMs). These models include one single agent and three multi‑agent teams that collaborate to create robot action plans. The models are evaluated across three construction roles: Painter, Safety Inspector, and Floor Tiling. Results show that the four‑agent team outperforms the state‑of‑the‑art GPT‑4o in most metrics while being ten times more cost‑effective. Additionally, teams with three and four agents demonstrate the improved generalizability. By discussing how agent behaviors influence outputs, this study enhances the understanding of AI teams and supports future research in diverse unstructured environments beyond construction.

PaperID: 6193, https://arxiv.org/pdf/2601.14053.pdf

Authors: Badri N. Patro, Vijay S. Agneeswaran

Title: LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems

Abstract:
The field of artificial intelligence has undergone a revolution from foundational Transformer architectures to reasoning‑capable systems approaching human‑level performance. We present LLMOrbit, a comprehensive circular taxonomy navigating the landscape of large language models spanning 2019‑2025. This survey examines over 50 models across 15 organizations through eight interconnected orbital dimensions, documenting architectural innovations, training methodologies, and efficiency patterns defining modern LLMs, generative AI, and agentic systems. We identify three critical crises: (1) data scarcity (9‑27T tokens depleted by 2026‑2028), (2) exponential cost growth (3M to 300M+ in 5 years), and (3) unsustainable energy consumption (22x increase), establishing the scaling wall limiting brute‑force approaches. Our analysis reveals six paradigms breaking this wall: (1) test‑time compute (o1, DeepSeek‑R1 achieve GPT‑4 performance with 10x inference compute), (2) quantization (4‑8x compression), (3) distributed edge computing (10x cost reduction), (4) model merging, (5) efficient training (ORPO reduces memory 50%), and (6) small specialized models (Phi‑4 14B matches larger models). Three paradigm shifts emerge: (1) post‑training gains (RLHF, GRPO, pure RL contribute substantially, DeepSeek‑R1 achieving 79.8% MATH), (2) efficiency revolution (MoE routing 18x efficiency, Multi‑head Latent Attention 8x KV cache compression enables GPT‑4‑level performance at <0.30/M tokens), and (3) democratization (open‑source Llama 3 88.6% MMLU surpasses GPT‑4 86.4%). We provide insights into techniques (RLHF, PPO, DPO, GRPO, ORPO), trace evolution from passive generation to tool‑using agents (ReAct, RAG, multi‑agent systems), and analyze post‑training innovations.

PaperID: 6194, https://arxiv.org/pdf/2601.13981.pdf

Authors: Yilin Tang, Yu Wang, Lanlan Qiu, Wenchang Gao, Yunfei Ma, Baicheng Chen, Tianxing He

Title: VirtualCrime: Evaluating Criminal Potential of Large Language Models via Sandbox Simulation

Abstract:
Large language models (LLMs) have shown strong capabilities in multi‑step decision‑making, planning and actions, and are increasingly integrated into various real‑world applications. It is concerning whether their strong problem‑solving abilities may be misused for crimes. To address this gap, we propose VirtualCrime, a sandbox simulation framework based on a three‑agent system to evaluate the criminal capabilities of models. Specifically, this framework consists of an attacker agent acting as the leader of a criminal team, a judge agent determining the outcome of each action, and a world manager agent updating the environment state and entities. Furthermore, we design 40 diverse crime tasks within this framework, covering 11 maps and 13 crime objectives such as theft, robbery, kidnapping, and riot. We also introduce a human player baseline for reference to better interpret the performance of LLM agents. We evaluate 8 strong LLMs and find (1) All agents in the simulation environment compliantly generate detailed plans and execute intelligent crime processes, with some achieving relatively high success rates; (2) In some cases, agents take severe action that inflicts harm to NPCs to achieve their goals. Our work highlights the need for safety alignment when deploying agentic AI in real‑world settings.

PaperID: 6195, https://arxiv.org/pdf/2601.13976.pdf

Authors: Jing Zuo, Lingzhou Mu, Fan Jiang, Chengcheng Ma, Mu Xu, Yonggang Qi

Title: FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation

Abstract:
Achieving human‑level performance in Vision‑and‑Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual‑spatial context while reasoning over long action sequences. Recent works, such as NavCoT and NavGPT‑2, demonstrate the potential of Chain‑of‑Thought (CoT) reasoning for improving interpretability and long‑horizon planning. Moreover, multimodal extensions like OctoNav‑R1 and CoT‑VLA further validate CoT as a promising pathway toward human‑like navigation reasoning. However, existing approaches face critical drawbacks: purely textual CoTs lack spatial grounding and easily overfit to sparse annotated reasoning steps, while multimodal CoTs incur severe token inflation by generating imagined visual observations, making real‑time navigation impractical. In this work, we propose FantasyVLN, a unified implicit reasoning framework that preserves the benefits of CoT reasoning without explicit token overhead. Specifically, imagined visual tokens are encoded into a compact latent space using a pretrained Visual AutoRegressor (VAR) during CoT reasoning training, and the model jointly learns from textual, visual, and multimodal CoT modes under a unified multi‑CoT strategy. At inference, our model performs direct instruction‑to‑action mapping while still enjoying reasoning‑aware representations. Extensive experiments on LH‑VLN show that our approach achieves reasoning‑aware yet real‑time navigation, improving success rates and efficiency while reducing inference latency by an order of magnitude compared to explicit CoT methods.

PaperID: 6196, https://arxiv.org/pdf/2601.13922.pdf

Authors: Adrian Cosma, Oleg Szehr, David Kletz, Alessandro Antonucci, Olivier Pelletier

Title: Automatic Prompt Optimization for Dataset-Level Feature Discovery

Abstract:
Feature extraction from unstructured text is a critical step in many downstream classification pipelines, yet current approaches largely rely on hand‑crafted prompts or fixed feature schemas. We formulate feature discovery as a dataset‑level prompt optimization problem: given a labelled text corpus, the goal is to induce a global set of interpretable and discriminative feature definitions whose realizations optimize a downstream supervised learning objective. To this end, we propose a multi‑agent prompt optimization framework in which language‑model agents jointly propose feature definitions, extract feature values, and evaluate feature quality using dataset‑level performance and interpretability feedback. Instruction prompts are iteratively refined based on this structured feedback, enabling optimization over prompts that induce shared feature sets rather than per‑example predictions. This formulation departs from prior prompt optimization methods that rely on per‑sample supervision and provides a principled mechanism for automatic feature discovery from unstructured text.

PaperID: 6197, https://arxiv.org/pdf/2601.13865.pdf

Authors: Hyunseung Lim, Dasom Choi, Sooyohn Nam, Bogoan Kim, Hwajung Hong

Title: Understanding Human-Multi-Agent Team Formation for Creative Work

Abstract:
Team‑based collaboration is a cornerstone of modern creative work. Recent advances in generative AI open possibilities for humans to collaborate with multiple AI agents in distinct roles to address complex creative workflows. Yet, how to form Human‑Multi‑Agent Teams (HMATs) is underexplored, especially given that inter‑agent interactions increase complexity and the risk of unexpected behaviors. In this exploratory study, we aim to understand how to form HMATs for creative work using CrafTeam, a technology probe that allows users to form and collaborate with their teams. We conducted a study with 12 design practitioners, in which participants iterated through a three‑step cycle: forming HMATs, ideating with their teams, and reflecting on their teams' ideation. Our findings reveal that while participants initially attempted autonomous team operations, they ultimately adopted team formations in which they directly orchestrated agents. We discuss design considerations for HMAT formation that humans can effectively orchestrate multiple agents.

PaperID: 6198, https://arxiv.org/pdf/2601.13864.pdf

Authors: Qirui Chen, Jingxian Shuai, Shuangwu Chen, Shenghao Ye, Zijian Wen, Xufei Su, Jie Jin, Jiangming Li, Jun Chen, Xiaobin Tan, Jian Yang

Title: HardSecBench: Benchmarking the Security Awareness of LLMs for Hardware Code Generation

Abstract:
Large language models (LLMs) are being increasingly integrated into practical hardware and firmware development pipelines for code generation. Existing studies have primarily focused on evaluating the functional correctness of LLM‑generated code, yet paid limited attention to its security issues. However, LLM‑generated code that appears functionally sound may embed security flaws which could induce catastrophic damages after deployment. This critical research gap motivates us to design a benchmark for assessing security awareness under realistic specifications. In this work, we introduce HardSecBench, a benchmark with 924 tasks spanning Verilog Register Transfer Level (RTL) and firmware‑level C, covering 76 hardware‑relevant Common Weakness Enumeration (CWE) entries. Each task includes a structured specification, a secure reference implementation, and executable tests. To automate artifact synthesis, we propose a multi‑agent pipeline that decouples synthesis from verification and grounds evaluation in execution evidence, enabling reliable evaluation. Using HardSecBench, we evaluate a range of LLMs on hardware and firmware code generation and find that models often satisfy functional requirements while still leaving security risks. We also find that security results vary with prompting. These findings highlight pressing challenges and offer actionable insights for future advancements in LLM‑assisted hardware design. Our data and code will be released soon.

PaperID: 6199, https://arxiv.org/pdf/2601.13688.pdf

Authors: Xun Feng, Chao Zhai

Title: Distributed Coverage Control on Poriferous Surface via Poly-Annulus Conformal Mapping

Abstract:
The inherent non‑convexity of poriferous surfaces typically entraps agents in local minima and complicates workload distribution. To resolve this, we propose a distributed diffeomorphic coverage control framework for the multi‑agent system (MAS) in such surfaces. First, we establish a distributed poly‑annulus conformal mapping that transforms arbitrary poriferous surfaces into a multi‑hole disk. Leveraging this topological equivalence, a collision‑free sectorial partition mechanism is designed in the multi‑hole disk, which rigorously induces strictly connected subregions and workload balance on the poriferous surfaces. This mechanism utilizes a buffer‑based sequence mechanism to ensure strict topological safety when bypassing obstacles. Furthermore, a pull‑back Riemannian metric is constructed to define the length metric that encodes safety constraints. Based on this metric, a distributed gradient‑based control law is synthesized to drive agents toward optimal configurations, ensuring simultaneous obstacle avoidance and coverage optimization. Theoretical analyses guarantee the Input‑to‑State Stability (ISS) of the partition dynamics and the asymptotic convergence of the closed‑loop system. Numerical simulations confirm the reachability and robustness of the proposed coverage algorithm, offering a scalable solution for distributed coverage in poriferous surfaces.

PaperID: 6200, https://arxiv.org/pdf/2601.13671.pdf

Authors: Apoorva Adimulam, Rajesh Gupta, Sumit Kumar

Title: The Orchestration of Multi-Agent Systems: Architectures, Protocols, and Enterprise Adoption

Abstract:
Orchestrated multi‑agent systems represent the next stage in the evolution of artificial intelligence, where autonomous agents collaborate through structured coordination and communication to achieve complex, shared objectives. This paper consolidates and formalizes the technical composition of such systems, presenting a unified architectural framework that integrates planning, policy enforcement, state management, and quality operations into a coherent orchestration layer. Another primary contribution of this work is the in‑depth technical delineation of two complementary communication protocols ‑ the Model Context Protocol, which standardizes how agents access external tools and contextual data, and the Agent2Agent protocol, which governs peer coordination, negotiation, and delegation. Together, these protocols establish an interoperable communication substrate that enables scalable, auditable, and policy‑compliant reasoning across distributed agent collectives. Beyond protocol design, the paper details how orchestration logic, governance frameworks, and observability mechanisms collectively sustain system coherence, transparency, and accountability. By synthesizing these elements into a cohesive technical blueprint, this paper provides comprehensive treatments of orchestrated multi‑agent systems ‑ bridging conceptual architectures with implementation‑ready design principles for enterprise‑scale AI ecosystems.

PaperID: 6201, https://arxiv.org/pdf/2601.13612.pdf

Authors: Jiani Liu, Yixin He, Lanlan Fan, Qidi Zhong, Yushi Cheng, Meng Zhang, Yanjiao Chen, Wenyuan Xu

Title: PINA: Prompt Injection Attack against Navigation Agents

Abstract:
Navigation agents powered by large language models (LLMs) convert natural language instructions into executable plans and actions. Compared to text‑based applications, their security is far more critical: a successful prompt injection attack does not just alter outputs but can directly misguide physical navigation, leading to unsafe routes, mission failure, or real‑world harm. Despite this high‑stakes setting, the vulnerability of navigation agents to prompt injection remains largely unexplored. In this paper, we propose PINA, an adaptive prompt optimization framework tailored to navigation agents under black‑box, long‑context, and action‑executable constraints. Experiments on indoor and outdoor navigation agents show that PINA achieves high attack success rates with an average ASR of 87.5%, surpasses all baselines, and remains robust under ablation and adaptive‑attack conditions. This work provides the first systematic investigation of prompt injection attacks in navigation and highlights their urgent security implications for embodied LLM agents.

PaperID: 6202, https://arxiv.org/pdf/2601.13589.pdf

Authors: HyeYoung Lee

Title: Motion-to-Response Content Generation via Multi-Agent AI System with Real-Time Safety Verification

Abstract:
This paper proposes a multi‑agent artificial intelligence system that generates response‑oriented media content in real time based on audio‑derived emotional signals. Unlike conventional speech emotion recognition studies that focus primarily on classification accuracy, our approach emphasizes the transformation of inferred emotional states into safe, age‑appropriate, and controllable response content through a structured pipeline of specialized AI agents. The proposed system comprises four cooperative agents: (1) an Emotion Recognition Agent with CNN‑based acoustic feature extraction, (2) a Response Policy Decision Agent for mapping emotions to response modes, (3) a Content Parameter Generation Agent for producing media control parameters, and (4) a Safety Verification Agent enforcing age‑appropriateness and stimulation constraints. We introduce an explicit safety verification loop that filters generated content before output, ensuring compliance with predefined rules. Experimental results on public datasets demonstrate that the system achieves 73.2% emotion recognition accuracy, 89.4% response mode consistency, and 100% safety compliance while maintaining sub‑100ms inference latency suitable for on‑device deployment. The modular architecture enables interpretability and extensibility, making it applicable to child‑adjacent media, therapeutic applications, and emotionally responsive smart devices.

PaperID: 6203, https://arxiv.org/pdf/2601.13559.pdf

Authors: Sun Hui, Ding Yanfeng, Huidong Ma, Chang Xu, Keyan Jin, Lizheng Zu, Cheng Zhong, xiaoguang Liu, Gang Wang, Wentong Cai

Title: AgentGC: Evolutionary Learning-based Lossless Compression for Genomics Data with LLM-driven Multiple Agent

Abstract:
Lossless compression has made significant advancements in Genomics Data (GD) storage, sharing and management. Current learning‑based methods are non‑evolvable with problems of low‑level compression modeling, limited adaptability, and user‑unfriendly interface. To this end, we propose AgentGC, the first evolutionary Agent‑based GD Compressor, consisting of 3 layers with multi‑agent named Leader and Worker. Specifically, the 1) User layer provides a user‑friendly interface via Leader combined with LLM; 2) Cognitive layer, driven by the Leader, integrates LLM to consider joint optimization of algorithm‑dataset‑system, addressing the issues of low‑level modeling and limited adaptability; and 3) Compression layer, headed by Worker, performs compression & decompression via a automated multi‑knowledge learning‑based compression framework. On top of AgentGC, we design 3 modes to support diverse scenarios: CP for compression‑ratio priority, TP for throughput priority, and BM for balanced mode. Compared with 14 baselines on 9 datasets, the average compression ratios gains are 16.66%, 16.11%, and 16.33%, the throughput gains are 4.73x, 9.23x, and 9.15x, respectively.

PaperID: 6204, https://arxiv.org/pdf/2601.13546.pdf

Authors: Hui Sun, Chang Xu, Haonan Xie, Hao Li, Yuhao Huang, Chuheng Zhang, Ming Jin, Xiaoguang Liu, Gang Wang, Jiang Bian

Title: ChatAD: Reasoning-Enhanced Time-Series Anomaly Detection with Multi-Turn Instruction Evolution

Abstract:
LLM‑driven Anomaly Detection (AD) helps enhance the understanding and explanatory abilities of anomalous behaviors in Time Series (TS). Existing methods face challenges of inadequate reasoning ability, deficient multi‑turn dialogue capability, and narrow generalization. To this end, we 1) propose a multi‑agent‑based TS Evolution algorithm named TSEvol. On top of it, we 2) introduce the AD reasoning and multi‑turn dialogue Dataset TSEData‑20K and contribute the Chatbot family for AD, including ChatAD‑Llama3‑8B, Qwen2.5‑7B, and Mistral‑7B. Furthermore, 3) we propose the TS Kahneman‑Tversky Optimization (TKTO) to enhance ChatAD's cross‑task generalization capability. Lastly, 4) we propose a LLM‑driven Learning‑based AD Benchmark LLADBench to evaluate the performance of ChatAD and nine baselines across seven datasets and tasks. Our three ChatAD models achieve substantial gains, up to 34.50% in accuracy, 34.71% in F1, and a 37.42% reduction in false positives. Besides, via KTKO, our optimized ChatAD achieves competitive performance in reasoning and cross‑task generalization on classification, forecasting, and imputation.

PaperID: 6205, https://arxiv.org/pdf/2601.13508.pdf

Authors: Honghao Chen, Jiangjie Qiu, Yi Shen Tew, Xiaonan Wang

Title: Toward Autonomous Computational Catalysis Research via Agentic Systems

Abstract:
Fully autonomous science has long been a defining ambition for artificial intelligence in materials discovery, yet its realization requires more than automating isolated calculations. In computational catalysis, a system autonomously navigating the entire research lifecycle from conception to a scientifically meaningful manuscript remains an open challenge. Here we present CatMaster, a catalysis‑native multi‑agent framework that couples project‑level reasoning with the direct execution of atomistic simulations, machine‑learning modelling, literature analysis, and manuscript production within a unified autonomous architecture. Across progressively more realistic research settings, CatMaster converts natural‑language intent into executable computational tasks, achieves near‑ceiling scores on standard catalysis scenarios, reaches near‑leaderboard performance on five of six MatBench tasks, performs autonomous modelling on various catalytic surfaces and reaction pathway investigations, and demonstrates the close‑loop autonomy by a fully closed‑loop single‑atom catalyst design case. These results establish autonomous computational catalysis as an already operational scientific paradigm, while highlighting that bridging the gap to complex physical challenges and genuine scientific closure requires tighter integration with human stewardship and domain‑rigorous methodologies in the future.

PaperID: 6206, https://arxiv.org/pdf/2601.13481.pdf

Authors: Jian Zhang, Zhangqi Wang, Zhiyuan Wang, Weiping Fu, Yu He, Haiping Zhu, Qika Lin, Jun Liu

Title: Towards Efficient and Robust Linguistic Emotion Diagnosis for Mental Health via Multi-Agent Instruction Refinement

Abstract:
Linguistic expressions of emotions such as depression, anxiety, and trauma‑related states are pervasive in clinical notes, counseling dialogues, and online mental health communities, and accurate recognition of these emotions is essential for clinical triage, risk assessment, and timely intervention. Although large language models (LLMs) have demonstrated strong generalization ability in emotion analysis tasks, their diagnostic reliability in high‑stakes, context‑intensive medical settings remains highly sensitive to prompt design. Moreover, existing methods face two key challenges: emotional comorbidity, in which multiple intertwined emotional states complicate prediction, and inefficient exploration of clinically relevant cues. To address these challenges, we propose APOLO (Automated Prompt Optimization for Linguistic Emotion Diagnosis), a framework that systematically explores a broader and finer‑grained prompt space to improve diagnostic efficiency and robustness. APOLO formulates instruction refinement as a Partially Observable Markov Decision Process and adopts a multi‑agent collaboration mechanism involving Planner, Teacher, Critic, Student, and Target roles. Within this closed‑loop framework, the Planner defines an optimization trajectory, while the Teacher‑Critic‑Student agents iteratively refine prompts to enhance reasoning stability and effectiveness, and the Target agent determines whether to continue optimization based on performance evaluation. Experimental results show that APOLO consistently improves diagnostic accuracy and robustness across domain‑specific and stratified benchmarks, demonstrating a scalable and generalizable paradigm for trustworthy LLM applications in mental healthcare.

PaperID: 6207, https://arxiv.org/pdf/2601.13453.pdf

Authors: Aditya Thole, Anmol Agrawal, Arnav Ramamoorthy, Dhruv Kumar

Title: PhysicsSolutionAgent: Towards Multimodal Explanations for Numerical Physics Problem Solving

Abstract:
Explaining numerical physics problems often requires more than text‑based solutions; clear visual reasoning can substantially improve conceptual understanding. While large language models (LLMs) demonstrate strong performance on many physics questions in textual form, their ability to generate long, high‑quality visual explanations remains insufficiently explored. In this work, we introduce PhysicsSolutionAgent (PSA), an autonomous agent that generates physics‑problem explanation videos of up to six minutes using Manim animations. To evaluate the generated videos, we design an assessment pipeline that performs automated checks across 15 quantitative parameters and incorporates feedback from a vision‑language model (VLM) to iteratively improve video quality. We evaluate PSA on 32 videos spanning numerical and theoretical physics problems. Our results reveal systematic differences in video quality depending on problem difficulty and whether the task is numerical or theoretical. Using GPT‑5‑mini, PSA achieves a 100% video‑completion rate with an average automated score of 3.8/5. However, qualitative analysis and human inspection uncover both minor and major issues, including visual layout inconsistencies and errors in how visual content is interpreted during feedback. These findings expose key limitations in reliable Manim code generation and highlight broader challenges in multimodal reasoning and evaluation for visual explanations of numerical physics problems. Our work underscores the need for improved visual understanding, verification, and evaluation frameworks in future multimodal educational systems

PaperID: 6208, https://arxiv.org/pdf/2601.13387.pdf

Authors: Zhenjiang Mao, Anirudhh Venkat, Artem Bisliouk, Akshat Kothiyal, Sindhura Kumbakonam Subramanian, Saithej Singhu, Ivan Ruchkin

Title: Confidence over Time: Confidence Calibration with Temporal Logic for Large Language Model Reasoning

Abstract:
Large Language Models (LLMs) increasingly rely on long‑form, multi‑step reasoning to solve complex tasks such as mathematical problem solving and scientific question answering. Despite strong performance, existing confidence estimation methods typically reduce an entire reasoning process to a single scalar score, ignoring how confidence evolves throughout the generation. As a result, these methods are often sensitive to superficial factors such as response length or verbosity, and struggle to distinguish correct reasoning from confidently stated errors. We propose to characterize the stepwise confidence signal using Signal Temporal Logic (STL). Using a discriminative STL mining procedure, we discover temporal formulas that distinguish confidence signals of correct and incorrect responses. Our analysis found that the STL patterns generalize across tasks, and numeric parameters exhibit sensitivity to individual questions. Based on these insights, we develop a confidence estimation approach that informs STL blocks with parameter hypernetworks. Experiments on multiple reasoning tasks show our confidence scores are more calibrated than the baselines.

PaperID: 6209, https://arxiv.org/pdf/2601.13383.pdf

Authors: Akbar Anbar Jafari, Cagri Ozcinar, Gholamreza Anbarjafari

Title: A Lightweight Modular Framework for Constructing Autonomous Agents Driven by Large Language Models: Design, Implementation, and Applications in AgentForge

Abstract:
The emergence of LLMs has catalyzed a paradigm shift in autonomous agent development, enabling systems capable of reasoning, planning, and executing complex multi‑step tasks. However, existing agent frameworks often suffer from architectural rigidity, vendor lock‑in, and prohibitive complexity that impedes rapid prototyping and deployment. This paper presents AgentForge, a lightweight, open‑source Python framework designed to democratize the construction of LLM‑driven autonomous agents through a principled modular architecture. AgentForge introduces three key innovations: (1) a composable skill abstraction that enables fine‑grained task decomposition with formally defined input‑output contracts, (2) a unified LLM backend interface supporting seamless switching between cloud‑based APIs and local inference engines, and (3) a declarative YAML‑based configuration system that separates agent logic from implementation details. We formalize the skill composition mechanism as a directed acyclic graph (DAG) and prove its expressiveness for representing arbitrary sequential and parallel task workflows. Comprehensive experimental evaluation across four benchmark scenarios demonstrates that AgentForge achieves competitive task completion rates while reducing development time by 62% compared to LangChain and 78% compared to direct API integration. Latency measurements confirm sub‑100ms orchestration overhead, rendering the framework suitable for real‑time applications. The modular design facilitates extension: we demonstrate the integration of six built‑in skills and provide comprehensive documentation for custom skill development. AgentForge addresses a critical gap in the LLM agent ecosystem by providing researchers and practitioners with a production‑ready foundation for constructing, evaluating, and deploying autonomous agents without sacrificing flexibility or performance.

PaperID: 6210, https://arxiv.org/pdf/2601.13268.pdf

Authors: Zainab Ghafoor, Md Shafiqul Islam, Koushik Howlader, Md Rasel Khondokar, Tanusree Bhattacharjee, Sayantan Chakraborty, Adrito Roy, Ushashi Bhattacharjee, Tirtho Roy

Title: Improving the Safety and Trustworthiness of Medical AI via Multi-Agent Evaluation Loops

Abstract:
Large Language Models (LLMs) are increasingly applied in healthcare, yet ensuring their ethical integrity and safety compliance remains a major barrier to clinical deployment. This work introduces a multi‑agent refinement framework designed to enhance the safety and reliability of medical LLMs through structured, iterative alignment. Our system combines two generative models ‑ DeepSeek R1 and Med‑PaLM ‑ with two evaluation agents, LLaMA 3.1 and Phi‑4, which assess responses using the American Medical Association's (AMA) Principles of Medical Ethics and a five‑tier Safety Risk Assessment (SRA‑5) protocol. We evaluate performance across 900 clinically diverse queries spanning nine ethical domains, measuring convergence efficiency, ethical violation reduction, and domain‑specific risk behavior. Results demonstrate that DeepSeek R1 achieves faster convergence (mean 2.34 vs. 2.67 iterations), while Med‑PaLM shows superior handling of privacy‑sensitive scenarios. The iterative multi‑agent loop achieved an 89% reduction in ethical violations and a 92% risk downgrade rate, underscoring the effectiveness of our approach. This study presents a scalable, regulator‑aligned, and cost‑efficient paradigm for governing medical AI safety.

PaperID: 6211, https://arxiv.org/pdf/2601.13243.pdf

Authors: Yapeng Li, Jiakuo Yu, Zhixin Liu, Xinnan Liu, Jing Yu, Songze Li, Tonghua Su

Title: A Comprehensive Evaluation of LLM Reasoning: From Single-Model to Multi-Agent Paradigms

Abstract:
Large Language Models (LLMs) are increasingly deployed as reasoning systems, where reasoning paradigms ‑ such as Chain‑of‑Thought (CoT) and multi‑agent systems (MAS) ‑ play a critical role, yet their relative effectiveness and cost‑accuracy trade‑offs remain poorly understood. In this work, we conduct a comprehensive and unified evaluation of reasoning paradigms, spanning direct single‑model generation, CoT‑augmented single‑model reasoning, and representative MAS workflows, characterizing their reasoning performance across a diverse suite of closed‑form benchmarks. Beyond overall performance, we probe role‑specific capability demands in MAS using targeted role isolation analyses, and analyze cost‑accuracy trade‑offs to identify which MAS workflows offer a favorable balance between cost and accuracy, and which incur prohibitive overhead for marginal gains. We further introduce MIMeBench, a new open‑ended benchmark that targets two foundational yet underexplored semantic capabilities ‑ semantic abstraction and contrastive discrimination ‑ thereby providing an alternative evaluation axis beyond closed‑form accuracy and enabling fine‑grained assessment of semantic competence that is difficult to capture with existing benchmarks. Our results show that increased structural complexity does not consistently lead to improved reasoning performance, with its benefits being highly dependent on the properties and suitability of the reasoning paradigm itself. The codes are released at https://gitcode.com/HIT1920/OpenLLMBench.

PaperID: 6212, https://arxiv.org/pdf/2601.13186.pdf

Authors: Diego Gosmar, Deborah A. Dahl

Title: Prompt Injection Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching

Abstract:
Prompt injection remains a central obstacle to the safe deployment of large language models, particularly in multi‑agent settings where intermediate outputs can propagate or amplify malicious instructions. Building on earlier work that introduced a four‑metric Total Injection Vulnerability Score (TIVS), this paper extends the evaluation framework with semantic similarity‑based caching and a fifth metric (Observability Score Ratio) to yield TIVS‑O, investigating how defence effectiveness interacts with transparency in a HOPE‑inspired Nested Learning architecture. The proposed system combines an agentic pipeline with Continuum Memory Systems that implement semantic similarity‑based caching across 301 synthetically generated injection‑focused prompts drawn from ten attack families, while a fourth agent performs comprehensive security analysis using five key performance indicators. In addition to traditional injection metrics, OSR quantifies the richness and clarity of security‑relevant reasoning exposed by each agent, enabling an explicit analysis of trade‑offs between strict mitigation and auditability. Experiments show that the system achieves secure responses with zero high‑risk breaches, while semantic caching delivers substantial computational savings, achieving a 41.6% reduction in LLM calls and corresponding decreases in latency, energy consumption, and carbon emissions. Five TIVS‑O configurations reveal optimal trade‑offs between mitigation strictness and forensic transparency. These results indicate that observability‑aware evaluation can reveal non‑monotonic effects within multi‑agent pipelines and that memory‑augmented agents can jointly maximize security robustness, real‑time performance, operational cost savings, and environmental sustainability without modifying underlying model weights, providing a production‑ready pathway for secure and green LLM deployments.

PaperID: 6213, https://arxiv.org/pdf/2601.13114.pdf

Authors: Abdelrahman Soliman, Ahmed Refaey, Aiman Erbad, Amr Mohamed

Title: IntAgent: NWDAF-Based Intent LLM Agent Towards Advanced Next Generation Networks

Abstract:
Intent‑based networks (IBNs) are gaining prominence as an innovative technology that automates network operations through high‑level request statements, defining what the network should achieve. In this work, we introduce IntAgent, an intelligent intent LLM agent that integrates NWDAF analytics and tools to fulfill the network operator's intents. Unlike previous approaches, we develop an intent tools engine directly within the NWDAF analytics engine, allowing our agent to utilize live network analytics to inform its reasoning and tool selection. We offer an enriched, 3GPP‑compliant data source that enhances the dynamic, context‑aware fulfillment of network operator goals, along with an MCP tools server for scheduling, monitoring, and analytics tools. We demonstrate the efficacy of our framework through two practical use cases: ML‑based traffic prediction and scheduled policy enforcement, which validate IntAgent's ability to autonomously fulfill complex network intents.

PaperID: 6214, https://arxiv.org/pdf/2601.13112.pdf

Authors: Xiaolei Zhang, Xiaojun Jia, Liquan Chen, Songze Li

Title: CODE: A Contradiction-Based Deliberation Extension Framework for Overthinking Attacks on Retrieval-Augmented Generation

Abstract:
Introducing reasoning models into Retrieval‑Augmented Generation (RAG) systems enhances task performance through step‑by‑step reasoning, logical consistency, and multi‑step self‑verification. However, recent studies have shown that reasoning models suffer from overthinking attacks, where models are tricked to generate unnecessarily high number of reasoning tokens. In this paper, we reveal that such overthinking risk can be inherited by RAG systems equipped with reasoning models, by proposing an end‑to‑end attack framework named Contradiction‑Based Deliberation Extension (CODE). Specifically, CODE develops a multi‑agent architecture to construct poisoning samples that are injected into the knowledge base. These samples 1) are highly correlated with the use query, such that can be retrieved as inputs to the reasoning model; and 2) contain contradiction between the logical and evidence layers that cause models to overthink, and are optimized to exhibit highly diverse styles. Moreover, the inference overhead of CODE is extremely difficult to detect, as no modification is needed on the user query, and the task accuracy remain unaffected. Extensive experiments on two datasets across five commercial reasoning models demonstrate that the proposed attack causes a 5.32x‑24.72x increase in reasoning token consumption, without degrading task performance. Finally, we also discuss and evaluate potential countermeasures to mitigate overthinking risks.

PaperID: 6215, https://arxiv.org/pdf/2601.13060.pdf

Authors: Zecheng Li, Zhihui Cao, Wenke Huang, Yudong Zhang, Keying Qi, Rui Wang, Zeyu Zheng, Jian Zhao, Hao Zhu, Hengxin Wu, Yuran Wang, Guitao Fan, Guokun Wu, Yicong Liu, Zhilin Gao, Haikun Xu, He Yang, Minqi Xiang, Xingyu Liu, Zuojian Wang

Title: MagicGUI-RMS: A Multi-Agent Reward Model System for Self-Evolving GUI Agents via Automated Feedback Reflux

Abstract:
Graphical user interface (GUI) agents are rapidly progressing toward autonomous interaction and reliable task execution across diverse applications. However, two central challenges remain unresolved: automating the evaluation of agent trajectories and generating high‑quality training data at scale to enable continual improvement. Existing approaches often depend on manual annotation or static rule‑based verification, which restricts scalability and limits adaptability in dynamic environments. We present MagicGUI‑RMS, a multi‑agent reward model system that delivers adaptive trajectory evaluation, corrective feedback, and self‑evolving learning capabilities. MagicGUI‑RMS integrates a Domain‑Specific Reward Model (DS‑RM) with a General‑Purpose Reward Model (GP‑RM), enabling fine‑grained action assessment and robust generalization across heterogeneous GUI tasks. To support reward learning at scale, we design a structured data construction pipeline that automatically produces balanced and diverse reward datasets, effectively reducing annotation costs while maintaining sample fidelity. During execution, the reward model system identifies erroneous actions, proposes refined alternatives, and continuously enhances agent behavior through an automated data‑reflux mechanism. Extensive experiments demonstrate that MagicGUI‑RMS yields substantial gains in task accuracy, behavioral robustness. These results establish MagicGUI‑RMS as a principled and effective foundation for building self‑improving GUI agents driven by reward‑based adaptation.

PaperID: 6216, https://arxiv.org/pdf/2601.13057.pdf

Authors: Chao Wang, Shuyuan Zhang, Lei Wang

Title: Convex Model Predictive Control for Safe Output Consensus of Nonlinear Multi-Agent Systems

Abstract:
Nonlinear dynamics and safety constraints typically result in a nonlinear programming problem when applying model predictive control to achieve safe output consensus. To avoid the heavy computational burden of solving a nonlinear programming problem directly, this paper proposes a novel Convex Model Predictive Control (CMPC) approach based on a Sequential Quadratic Programming (SQP) scheme. The core of our method lies in transforming the nonlinear constraints into linear forms: we linearize the system dynamics and convexify the discrete‑time high‑order control barrier functions using a proposed tangent‑line projection method. Consequently, the original problem is reduced to a quadratic program that can be iteratively solved within the SQP scheme at each time step of CMPC. Furthermore, we provide the formal guarantee of the convergence of the SQP scheme, and subsequently guarantee the recursive feasibility and stability of CMPC. Simulations on multi‑agent systems with unicycle dynamics demonstrate a 35‑52 times reduction in computation time compared with baseline methods, confirming the suitability of the proposed approach for real‑time safe output consensus control.

PaperID: 6217, https://arxiv.org/pdf/2601.12979.pdf

Authors: Qingyu Lu, Liang Ding, Kanjian Zhang, Jinxia Zhang, Dacheng Tao

Title: The Bitter Lesson of Diffusion Language Models for Agentic Workflows: A Comprehensive Reality Check

Abstract:
The pursuit of real‑time agentic interaction has driven interest in Diffusion‑based Large Language Models (dLLMs) as alternatives to auto‑regressive backbones, promising to break the sequential latency bottleneck. However, does such efficiency gains translate into effective agentic behavior? In this work, we present a comprehensive evaluation of dLLMs (e.g., LLaDA, Dream) across two distinct agentic paradigms: Embodied Agents (requiring long‑horizon planning) and Tool‑Calling Agents (requiring precise formatting). Contrary to the efficiency hype, our results on Agentboard and BFCL reveal a "bitter lesson": current dLLMs fail to serve as reliable agentic backbones, frequently leading to systematically failure. (1) In Embodied settings, dLLMs suffer repeated attempts, failing to branch under temporal feedback. (2) In Tool‑Calling settings, dLLMs fail to maintain symbolic precision (e.g. strict JSON schemas) under diffusion noise. To assess the potential of dLLMs in agentic workflows, we introduce DiffuAgent, a multi‑agent evaluation framework that integrates dLLMs as plug‑and‑play cognitive cores. Our analysis shows that dLLMs are effective in non‑causal roles (e.g., memory summarization and tool selection) but require the incorporation of causal, precise, and logically grounded reasoning mechanisms into the denoising process to be viable for agentic tasks.

PaperID: 6218, https://arxiv.org/pdf/2601.12886.pdf

Authors: Christoph Wittner

Title: Communication Methods in Multi-Agent Reinforcement Learning

Abstract:
Multi‑agent reinforcement learning is a promising research area that extends established reinforcement learning approaches to problems formulated as multi‑agent systems. Recently, a multitude of communication methods have been introduced to this field to address problems such as partially observable environments, non‑stationarity, and exponentially growing action spaces. Communication further enables efficient cooperation among all agents interacting in an environment. This work aims at providing an overview of communication techniques in multi‑agent reinforcement learning. By an in‑depth analysis of 29 publications on this topic, the strengths and weaknesses of explicit, implicit, attention‑based, graph‑based, and hierarchical/role‑based communication are evaluated. The results of this comparison show that there is no general, optimal communication framework for every problem. On the contrary, the choice of communication depends heavily on the problem at hand. The comparison also highlights the importance of communication methods with low computational overhead to enable scalability to environments where many agents interact. Finally, the paper discusses current research gaps, emphasizing the need for standardized benchmarking of system‑level metrics and improved robustness under realistic communication conditions to enhance the real‑world applicability of these approaches.

PaperID: 6219, https://arxiv.org/pdf/2601.12815.pdf

Authors: Zhaolu Kang, Junhao Gong, Qingxi Chen, Hao Zhang, Jiaxin Liu, Rong Fu, Zhiyuan Feng, Yuan Wang, Simon Fong, Kaiyue Zhou

Title: Multimodal Multi-Agent Empowered Legal Judgment Prediction

Abstract:
Legal Judgment Prediction (LJP) aims to predict the outcomes of legal cases based on factual descriptions, serving as a fundamental task to advance the development of legal systems. Traditional methods often rely on statistical analyses or role‑based simulations but face challenges with multiple allegations, diverse evidence, and lack adaptability. In this paper, we introduce JurisMMA, a novel framework for LJP that effectively decomposes trial tasks, standardizes processes, and organizes them into distinct stages. Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video‑text data, enabling comprehensive evaluation. Experiments on JurisMM and the benchmark LawBench validate our framework's effectiveness. These results indicate that our framework is effective not only for LJP but also for a broader range of legal applications, offering new perspectives for the development of future legal methods and datasets.

PaperID: 6220, https://arxiv.org/pdf/2601.12771.pdf

Authors: Keito Inoshita

Title: Who Does This Name Remind You of ? Nationality Prediction via Large Language Model Associative Memory

Abstract:
Large language models (LLMs) possess extensive world knowledge, yet methods for effectively eliciting this knowledge remain underexplored. Nationality and region prediction tasks require understanding of not only linguistic features but also cultural and historical background, making LLM world knowledge particularly valuable. However, conventional LLM prompting methods rely on direct reasoning approaches, which have limitations in applying abstract linguistic rules. We propose LLM Associative Memory Agents (LAMA), a novel framework that leverages LLM world knowledge as associative memory. Rather than directly inferring nationality from names, LAMA recalls famous individuals with the same name and aggregates their nationalities through indirect reasoning. A dual‑agent architecture comprising a Person Agent and a Media Agent, specialized in different knowledge domains, recalls famous individuals in parallel, generating Top‑1 predictions through voting and Top‑K predictions through conditional completion. On a 99‑country nationality prediction task, LAMA achieved 0.817 accuracy, substantially outperforming conventional LLM prompting methods and neural models. Our experiments reveal that LLMs exhibit higher reliability in recalling concrete examples than in abstract reasoning, that recall‑based approaches are robust to low‑frequency nationalities independent of data frequency distributions, and that the dual‑agent architecture functions complementarily to produce synergistic effects. These results demonstrate the effectiveness of a new multi‑agent system that retrieves and aggregates LLM knowledge rather than prompting reasoning.

PaperID: 6221, https://arxiv.org/pdf/2601.12740.pdf

Authors: Zijian Zhang, Fangshi Du, Xingjian Liu, Pan Chen, Oliver Huang, Runlong Ye, Michael Liut, Alán Aspuru-Guzik

Title: TreeWriter: AI-Assisted Hierarchical Planning and Writing for Long-Form Documents

Abstract:
Long documents pose many challenges to current intelligent writing systems. These include maintaining consistency across sections, sustaining efficient planning and writing as documents become more complex, and effectively providing and integrating AI assistance to the user. Existing AI co‑writing tools offer either inline suggestions or limited structured planning, but rarely support the entire writing process that begins with high‑level ideas and ends with polished prose, in which many layers of planning and outlining are needed. Here, we introduce TreeWriter, a hierarchical writing system that represents documents as trees and integrates contextual AI support. TreeWriter allows authors to create, save, and refine document outlines at multiple levels, facilitating drafting, understanding, and iterative editing of long documents. A built‑in AI agent can dynamically load relevant content, navigate the document hierarchy, and provide context‑aware editing suggestions. A within‑subject study (N=12) comparing TreeWriter with Google Docs + Gemini on long‑document editing and creative writing tasks shows that TreeWriter improves idea exploration/development, AI helpfulness, and perceived authorial control. A two‑month field deployment (N=8) further demonstrated that hierarchical organization supports collaborative writing. Our findings highlight the potential of hierarchical, tree‑structured editors with integrated AI support and provide design guidelines for future AI‑assisted writing tools that balance automation with user agency.

PaperID: 6222, https://arxiv.org/pdf/2601.12735.pdf

Authors: Hao Chen, Yunchun Li, Chen Chen, Fengxu Lin, Wei Li

Title: OpenAI for OpenAPI: Automated generation of REST API specification via LLMs

Abstract:
REST APIs, based on the REpresentational State Transfer (REST) architecture, are the primary type of Web API. The OpenAPI Specification (OAS) serves as the de facto standard for describing REST APIs and is crucial for multiple software engineering tasks. However, developers face challenges in writing and maintaining OAS. Although static analysis shows potential for OAS generation, it is limited to specific programming languages and development frameworks. The powerful code understanding capabilities of LLMs offer new opportunities for OAS generation, yet they are constrained by context limitations and hallucinations. To address these challenges, we propose the OpenAI OpenAPI Project Scanner (OOPS), the first technology‑agnostic LLM‑based static analysis method for OAS generation, requiring fewer technology‑specific rules and less human expert intervention. OOPS is implemented as an LLM agent workflow comprising two key steps: endpoint method extraction and OAS generation. By constructing an API dependency graph, it establishes necessary file associations to address LLMs' context limitations. Through multi‑stage generation and self‑refine, it mitigates both syntactic and semantic hallucinations during OAS generation. We evaluated OOPS on 12 real‑world REST APIs spanning 5 programming languages and 8 development frameworks. Experimental results demonstrate that OOPS accurately generates high‑quality OAS for REST APIs implemented with diverse technologies, achieving an average F1‑score exceeding 98% for endpoint method inference, 97% for both request parameter and response inference, and 92% for parameter constraint inference. The input tokens average below 5.6K with a maximum of 16.2K, while the output tokens average below 0.9K with a maximum of 7.7K.

PaperID: 6223, https://arxiv.org/pdf/2601.12698.pdf

Authors: Qiuyi Qu, Yicheng Sui, Yufei Sun, Rui Chen, Xiaofei Zhang, Yuzhi Zhang, Haofeng Wang, Ge Lan

Title: A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization

Abstract:
GPU code optimization is a key performance bottleneck for HPC workloads as well as large‑model training and inference. Although compiler optimizations and hand‑written kernels can partially alleviate this issue, achieving near‑hardware‑limit performance still relies heavily on manual code refactoring and parameter tuning. Recent progress in LLM‑agent‑based kernel generation and optimization has been reported, yet many approaches primarily focus on direct code rewriting, where parameter choices are often implicit and hard to control, or require human intervention, leading to unstable performance gains. This paper introduces a template‑based rewriting layer on top of an agent‑driven iterative loop: kernels are semantically refactored into explicitly parameterizable templates, and template parameters are then optimized via search‑based autotuning, yielding more stable and higher‑quality speedups. Experiments on a set of real‑world kernels demonstrate speedups exceeding 3x in the best case. We extract representative CUDA kernels from SGLang as evaluation targets; the proposed agentic tuner iteratively performs templating, testing, analysis, and planning, and leverages profiling feedback to execute constrained parameter search under hardware resource limits. Compared to agent‑only direct rewriting, the template‑plus‑search design significantly reduces the randomness of iterative optimization, making the process more interpretable and enabling a more systematic approach toward high‑performance configurations. The proposed method can be further extended to OpenCL, HIP, and other backends to deliver automated performance optimization for real production workloads.

PaperID: 6224, https://arxiv.org/pdf/2601.12662.pdf

Authors: Xingran Chen, Navid NaderiAlizadeh, Alejandro Ribeiro, Shirin Saeedi Bidokhti

Title: Decentralized Learning Strategies for Estimation Error Minimization with Graph Neural Networks

Abstract:
We address real‑time sampling and estimation of autoregressive Markovian sources in dynamic yet structurally similar multi‑hop wireless networks. Each node caches samples from others and communicates over wireless collision channels, aiming to minimize time‑average estimation error via decentralized policies. Due to the high dimensionality of action spaces and complexity of network topologies, deriving optimal policies analytically is intractable. To address this, we propose a graphical multi‑agent reinforcement learning framework for policy optimization. Theoretically, we demonstrate that our proposed policies are transferable, allowing a policy trained on one graph to be effectively applied to structurally similar graphs. Numerical experiments demonstrate that (i) our proposed policy outperforms state‑of‑the‑art baselines; (ii) the trained policies are transferable to larger networks, with performance gains increasing with the number of agents; (iii) the graphical training procedure withstands non‑stationarity, even when using independent learning techniques; and (iv) recurrence is pivotal in both independent learning and centralized training and decentralized execution, and improves the resilience to non‑stationarity.

PaperID: 6225, https://arxiv.org/pdf/2601.12657.pdf

Authors: Yin Wu, Wei-Yu Chiu, Yuan-Po Tsai, Shangyuan Liu, Weiqi Hua

Title: Multiagent Reinforcement Learning in Enhancing Resilience of Microgrids under Extreme Weather Events

Abstract:
Grid resilience is crucial in light of power interruptions caused by increasingly frequent extreme weather events. Well‑designed energy management systems (EMS) have made progress in improving microgrid resilience through the coordination of distributed energy resources (DERs), but still face significant challenges in addressing the uncertainty of load demand caused by extreme weather. The integration of deep reinforcement learning (DRL) into EMS design enables optimized microgrid control strategies for coordinating DERs. Building on this, we proposed a cooperative multi‑agent deep reinforcement learning (MADRL)‑based EMS framework to provide flexible scalability for microgrids, enhance resilience and reduce operational costs during power outages. Specifically, the gated recurrent unit with a gating mechanism was introduced to extract features from temporal data, which enables the EMS to coordinate DERs more efficiently. Next, the proposed MADRL method incorporating action masking techniques was evaluated in the IEEE 33‑Bus system using real‑world data on renewable generation and power load. Finally, the numerical results demonstrated the superiority of the proposed method in reducing operating costs as well as the effectiveness in enhancing microgrid resilience during power interruptions.

PaperID: 6226, https://arxiv.org/pdf/2601.12618.pdf

Authors: Elham Tajik, Conrad Borchers, Bahar Shahrokhian, Sebastian Simon, Ali Keramati, Sonika Pal, Sreecharan Sankaranarayanan

Title: Disagreement as Data: Reasoning Trace Analytics in Multi-Agent Systems

Abstract:
Learning analytics researchers often analyze qualitative student data such as coded annotations or interview transcripts to understand learning processes. With the rise of generative AI, fully automated and human‑AI workflows have emerged as promising methods for analysis. However, methodological standards to guide such workflows remain limited. In this study, we propose that reasoning traces generated by large language model (LLM) agents, especially within multi‑agent systems, constitute a novel and rich form of process data to enhance interpretive practices in qualitative coding. We apply cosine similarity to LLM reasoning traces to systematically detect, quantify, and interpret disagreements among agents, reframing disagreement as a meaningful analytic signal. Analyzing nearly 10,000 instances of agent pairs coding human tutoring dialog segments, we show that LLM agents' semantic reasoning similarity robustly differentiates consensus from disagreement and correlates with human coding reliability. Qualitative analysis guided by this metric reveals nuanced instructional sub‑functions within codes and opportunities for conceptual codebook refinement. By integrating quantitative similarity metrics with qualitative review, our method has the potential to improve and accelerate establishing inter‑rater reliability during coding by surfacing interpretive ambiguity, especially when LLMs collaborate with humans. We discuss how reasoning‑trace disagreements represent a valuable new class of analytic signals advancing methodological rigor and interpretive depth in educational research.

PaperID: 6227, https://arxiv.org/pdf/2601.12616.pdf

Authors: Johnathan Corbin, Sarah H. Q. Li, Jonathan Rogers

Title: Allocating Corrective Control to Mitigate Multi-agent Safety Violations Under Private Preferences

Abstract:
We propose a novel framework that computes the corrective control efforts to ensure joint safety in multi‑agent dynamical systems. This framework efficiently distributes the required corrective effort without revealing individual agents' private preferences. Our framework integrates high‑order control barrier functions (HOCBFs), which enforce safety constraints with formal guarantees of safety for complex dynamical systems, with a privacy‑preserving resource allocation mechanism based on the progressive second price (PSP) auction. When a joint safety constraint is violated, agents iteratively bid on new corrective efforts via 'avoidance credits' rather than explicitly solving for feasible corrective efforts that remove the safety violation. The resulting correction, determined via a second price payment rule, coincides with the socially optimal safe distribution of corrective actions. Critically, the bidding process achieves this optimal allocation efficiently and without revealing private preferences of individual agents. We demonstrate this method through multi‑robot hardware experiments on the Robotarium platform.

PaperID: 6228, https://arxiv.org/pdf/2601.12580.pdf

Authors: Sofiya Zaichyk

Title: Semantic Fusion: Verifiable Alignment in Decentralized Multi-Agent Systems

Abstract:
We present Semantic Fusion (SF), a formal framework for decentralized semantic coordination in multi‑agent systems. SF allows agents to operate over scoped views of shared memory, propose structured updates, and maintain global coherence through local ontology‑based validation and refresh without centralized control or explicit message passing. The central theoretical result is a bisimulation theorem showing that each agent's local execution is behaviorally equivalent to its projection of the global semantics, in both deterministic and probabilistic settings. This enables safety, liveness, and temporal properties to be verified locally and soundly lifted to the full system. SF supports agents whose update proposals vary across invocations, including those generated by learned or heuristic components, provided updates pass semantic validation before integration. We establish deterministic and probabilistic guarantees ensuring semantic alignment under asynchronous or degraded communication. To validate the model operationally, we implement a lightweight reference architecture that instantiates its core mechanisms. A 250‑agent simulation evaluates these properties across over 11,000 validated updates, demonstrating convergence under probabilistic refresh, bounded communication, and resilience to agent failure. Together, these results show that Semantic Fusion can provide a formal and scalable basis for verifiable autonomy in decentralized systems.

PaperID: 6229, https://arxiv.org/pdf/2601.12560.pdf

Authors: Arunkumar V, Gangadharan G. R., Rajkumar Buyya

Title: Agentic Artificial Intelligence (AI): Architectures, Taxonomies, and Evaluation of Large Language Model Agents

Abstract:
Artificial Intelligence is moving from models that only generate text to Agentic AI, where systems behave as autonomous entities that can perceive, reason, plan, and act. Large Language Models (LLMs) are no longer used only as passive knowledge engines but as cognitive controllers that combine memory, tool use, and feedback from their environment to pursue extended goals. This shift already supports the automation of complex workflows in software engineering, scientific discovery, and web navigation, yet the variety of emerging designs, from simple single loop agents to hierarchical multi agent systems, makes the landscape hard to navigate. In this paper, we investigate architectures and propose a unified taxonomy that breaks agents into Perception, Brain, Planning, Action, Tool Use, and Collaboration. We use this lens to describe the move from linear reasoning procedures to native inference time reasoning models, and the transition from fixed API calls to open standards like the Model Context Protocol (MCP) and Native Computer Use. We also group the environments in which these agents operate, including digital operating systems, embodied robotics, and other specialized domains, and we review current evaluation practices. Finally, we highlight open challenges, such as hallucination in action, infinite loops, and prompt injection, and outline future research directions toward more robust and reliable autonomous systems.

PaperID: 6230, https://arxiv.org/pdf/2601.12542.pdf

Authors: Lukas Weidener, Marko Brkić, Mihailo Jovanović, Ritvik Singh, Chiara Baccin, Emre Ulgac, Alex Dobrin, Aakaash Meduri

Title: Rethinking the AI Scientist: Interactive Multi-Agent Workflows for Scientific Discovery

Abstract:
Artificial intelligence systems for scientific discovery have demonstrated remarkable potential, yet existing approaches remain largely proprietary and operate in batch‑processing modes requiring hours per research cycle, precluding real‑time researcher guidance. This paper introduces Deep Research, a multi‑agent system enabling interactive scientific investigation with turnaround times measured in minutes. The architecture comprises specialized agents for planning, data analysis, literature search, and novelty detection, unified through a persistent world state that maintains context across iterative research cycles. Two operational modes support different workflows: semi‑autonomous mode with selective human checkpoints, and fully autonomous mode for extended investigations. Evaluation on the BixBench computational biology benchmark demonstrated state‑of‑the‑art performance, achieving 48.8% accuracy on open response and 64.4% on multiple‑choice evaluation, exceeding existing baselines by 14 to 26 percentage points. Analysis of architectural constraints, including open access literature limitations and challenges inherent to automated novelty assessment, informs practical deployment considerations for AI‑assisted scientific workflows.

PaperID: 6231, https://arxiv.org/pdf/2601.12518.pdf

Authors: Nuoya Xiong, Aarti Singh

Title: Cooperative Multi-agent RL with Communication Constraints

Abstract:
Cooperative MARL often assumes frequent access to global information in a data buffer, such as team rewards or other agents' actions, which is typically unrealistic in decentralized MARL systems due to high communication costs. When communication is limited, agents must rely on outdated information to estimate gradients and update their policies. A common approach to handle missing data is called importance sampling, in which we reweigh old data from a base policy to estimate gradients for the current policy. However, it quickly becomes unstable when the communication is limited (i.e. missing data probability is high), so that the base policy in importance sampling is outdated. To address this issue, we propose a technique called base policy prediction, which utilizes old gradients to predict the policy update and collect samples for a sequence of base policies, which reduces the gap between the base policy and the current policy. This approach enables effective learning with significantly fewer communication rounds, since the samples of predicted base policies could be collected within one communication round. Theoretically, we show that our algorithm converges to an \varepsilon‑Nash equilibrium in potential games with only O(\varepsilon^‑3/4) communication rounds and O(poly(\max_i |A_i|)\varepsilon^‑11/4) samples, improving existing state‑of‑the‑art results in communication cost, as well as sample complexity without the exponential dependence on the joint action space size. We also extend these results to general Markov Cooperative Games to find an agent‑wise local maximum. Empirically, we test the base policy prediction algorithm in both simulated games and MAPPO for complex environments.

PaperID: 6232, https://arxiv.org/pdf/2601.12392.pdf

Authors: Zhentao Xia, Yongqi Fan, Yuxiang Chu, Yichao Yin, Liangliang Chen, Tong Ruan, Weiyan Zhang

Title: PsychēChat: An Empathic Framework Focused on Emotion Shift Tracking and Safety Risk Analysis in Psychological Counseling

Abstract:
Large language models (LLMs) have demonstrated notable advancements in psychological counseling. However, existing models generally do not explicitly model seekers' emotion shifts across counseling sessions, a core focus in classical psychological schools. Moreover, how to align counselor models' responses with these emotion shifts while proactively mitigating safety risks remains underexplored. To bridge these gaps, we propose PsychēChat, which explicitly integrates emotion shift tracking and safety risk analysis for psychological counseling. Specifically, we employ interactive role‑playing to synthesize counselor‑‑seeker dialogues, incorporating two modules: Emotion Management Module, to capture seekers' current emotions and emotion shifts; and Risk Control Module, to anticipate seekers' subsequent reactions and identify potential risks. Furthermore, we introduce two modeling paradigms. The Agent Mode structures emotion management, risk control, and counselor responses into a collaborative multi‑agent pipeline. The LLM Mode integrates these stages into a unified chain‑of‑thought for end‑to‑end inference, balancing efficiency and performance. Extensive experiments, including interactive scoring, dialogue‑level evaluation, and human assessment, demonstrate that PsychēChat outperforms existing methods for emotional insight and safety control.

PaperID: 6233, https://arxiv.org/pdf/2601.12348.pdf

Authors: Haris Khan, Sadia Asif

Title: Generative AI Agents for Controllable and Protected Content Creation

Abstract:
The proliferation of generative AI has transformed creative workflows, yet current systems face critical challenges in controllability and content protection. We propose a novel multi‑agent framework that addresses both limitations through specialized agent roles and integrated watermarking mechanisms. Unlike existing multi‑agent systems focused solely on generation quality, our approach uniquely combines controllable content synthesis with provenance protection during the generation process itself. The framework orchestrates Director/Planner, Generator, Reviewer, Integration, and Protection agents with human‑in‑the‑loop feedback to ensure alignment with user intent while embedding imperceptible digital watermarks. We formalize the pipeline as a joint optimization objective unifying controllability, semantic alignment, and protection robustness. This work contributes to responsible generative AI by positioning multi‑agent architectures as a solution for trustworthy creative workflows with built‑in ownership tracking and content traceability.

PaperID: 6234, https://arxiv.org/pdf/2601.12323.pdf

Authors: Yin Cai, Zhouhong Gu, Juntao Zhang, Ping Chen

Title: MARO: Learning Stronger Reasoning from Social Interaction

Abstract:
Humans face countless scenarios that require reasoning and judgment in daily life. However, existing large language model training methods primarily allow models to learn from existing textual content or solve predetermined problems, lacking experience in real scenarios involving interaction, negotiation, and competition with others. To address this, this paper proposes Multi‑Agent Reward Optimization (MARO), a method that enables large language models (LLMs) to acquire stronger reasoning abilities by learning and practicing in multi‑agent social environments. Specifically, MARO first addresses the sparse learning signal problem by decomposing final success or failure outcomes into each specific behavior during the interaction process; second, it handles the uneven role distribution problem by balancing the training sample weights of different roles; finally, it addresses environmental instability issues by directly evaluating the utility of each behavior. Experimental results demonstrate that MARO not only achieves significant improvements in social reasoning capabilities, but also that the abilities acquired through social simulation learning can effectively transfer to other tasks such as mathematical reasoning and instruction following. This reveals the tremendous potential of multi‑agent social learning in enhancing the general reasoning capabilities of LLMs.

PaperID: 6235, https://arxiv.org/pdf/2601.12311.pdf

Authors: Xiaofeng Luo, Jiayi He, Jiawen Kang, Ruichen Zhang, Zhaoshui He, Ekram Hossain, Dong In Kim

Title: Cross-reality Location Privacy Protection in 6G-enabled Vehicular Metaverses: An LLM-enhanced Hybrid Generative Diffusion Model-based Approach

Abstract:
The emergence of 6G‑enabled vehicular metaverses enables Autonomous Vehicles (AVs) to operate across physical and virtual spaces through space‑air‑ground‑sea integrated networks. The AVs can deploy AI agents powered by large AI models as personalized assistants, on edge servers to support intelligent driving decision making and enhanced on‑board experiences. However, such cross‑reality interactions may cause serious location privacy risks, as adversaries can infer AV trajectories by correlating the location reported when AVs request LBS in reality with the location of the edge servers on which their corresponding AI agents are deployed in virtuality. To address this challenge, we design a cross‑reality location privacy protection framework based on hybrid actions, including continuous location perturbation in reality and discrete privacy‑aware AI agent migration in virtuality. In this framework, a new privacy metric, termed cross‑reality location entropy, is proposed to effectively quantify the privacy levels of AVs. Based on this metric, we formulate an optimization problem to optimize the hybrid action, focusing on achieving a balance between location protection, service latency reduction, and quality of service maintenance. To solve the complex mixed‑integer problem, we develop a novel LLM‑enhanced Hybrid Diffusion Proximal Policy Optimization (LHDPPO) algorithm, which integrates LLM‑driven informative reward design to enhance environment understanding with double Generative Diffusion Models‑based policy exploration to handle high‑dimensional action spaces, thereby enabling reliable determination of optimal hybrid actions. Extensive experiments on real‑world datasets demonstrate that the proposed framework effectively mitigates cross‑reality location privacy leakage for AVs while maintaining strong user immersion within 6G‑enabled vehicular metaverse scenarios.

PaperID: 6236, https://arxiv.org/pdf/2601.12307.pdf

Authors: Jiawei Xu, Arief Koesdwiady, Sisong Bei, Yan Han, Baixiang Huang, Dakuo Wang, Yutong Chen, Zheshen Wang, Peihao Wang, Pan Li, Ying Ding

Title: Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline

Abstract:
Recent advances in LLM‑based multi‑agent systems (MAS) show that workflows composed of multiple LLM agents with distinct roles, tools, and communication patterns can outperform single‑LLM baselines on complex tasks. However, most frameworks are homogeneous, where all agents share the same base LLM and differ only in prompts, tools, and positions in the workflow. This raises the question of whether such workflows can be simulated by a single agent through multi‑turn conversations. We investigate this across seven benchmarks spanning coding, mathematics, general question answering, domain‑specific reasoning, and real‑world planning and tool use. Our results show that a single agent can reach the performance of homogeneous workflows with an efficiency advantage from KV cache reuse, and can even match the performance of an automatically optimized heterogeneous workflow. Building on this finding, we propose OneFlow, an algorithm that automatically tailors workflows for single‑agent execution, reducing inference costs compared to existing automatic multi‑agent design frameworks without trading off accuracy. These results position the single‑LLM implementation of multi‑agent workflows as a strong baseline for MAS research. We also note that single‑LLM methods cannot capture heterogeneous workflows due to the lack of KV cache sharing across different LLMs, highlighting future opportunities in developing truly heterogeneous multi‑agent systems.

PaperID: 6237, https://arxiv.org/pdf/2601.12152.pdf

Authors: Houjiang Liu, Yujin Choi, Sanjana Gautam, Gabriel Jaffe, Soo Young Rieh, Matthew Lease

Title: Who Owns Creativity and Who Does the Work? Trade-offs in LLM-Supported Research Ideation

Abstract:
LLM‑based agents offer new potential to accelerate science and reshape research work. However, the quality of researcher contributions can vary significantly depending on human ability to steer agent behaviors. How can we best use these tools to augment scientific creativity without undermining aspects of contribution and ownership that drive research? To investigate this, we developed an agentic research ideation system integrating three roles ‑‑ Ideator, Writer, and Evaluator ‑‑ across three control levels ‑‑ Low, Medium, and Intensive. Our mixed‑methods study with 54 researchers suggests three key findings in how LLM‑based agents reshape scientific creativity: 1) perceived creativity support does not simply increase linearly with greater control; 2) human effort shifts from ideating to verifying ideas; and 3) ownership becomes a negotiated outcome between human and AI. Our findings suggest that LLM agent design should emphasize researcher empowerment, fostering a sense of ownership over strong ideas rather than reducing researchers to operating an automated AI‑driven process.

PaperID: 6238, https://arxiv.org/pdf/2601.12148.pdf

Authors: Muhammad Umar Zeshan, Motunrayo Ibiyo, Claudio Di Sipio, Phuong T. Nguyen, Davide Di Ruscio

Title: Many Hands Make Light Work: An LLM-based Multi-Agent System for Detecting Malicious PyPI Packages

Abstract:
Malicious code in open‑source repositories such as PyPI poses a growing threat to software supply chains. Traditional rule‑based tools often overlook the semantic patterns in source code that are crucial for identifying adversarial components. Large language models (LLMs) show promise for software analysis, yet their use in interpretable and modular security pipelines remains limited. This paper presents LAMPS, a multi‑agent system that employs collaborative LLMs to detect malicious PyPI packages. The system consists of four role‑specific agents for package retrieval, file extraction, classification, and verdict aggregation, coordinated through the CrewAI framework. A prototype combines a fine‑tuned CodeBERT model for classification with LLaMA‑3 agents for contextual reasoning. LAMPS has been evaluated on two complementary datasets: D1, a balanced collection of 6,000 setup.py files, and D2, a realistic multi‑file dataset with 1,296 files and natural class imbalance. On D1, LAMPS achieves 97.7% accuracy, surpassing MPHunter‑‑one of the state‑of‑the‑art approaches. On D2, it reaches 99.5% accuracy and 99.5% balanced accuracy, outperforming RAG‑based approaches and fine‑tuned single‑agent baselines. McNemar's test confirmed these improvements as highly significant. The results demonstrate the feasibility of distributed LLM reasoning for malicious code detection and highlight the benefits of modular multi‑agent designs in software supply chain security.

PaperID: 6239, https://arxiv.org/pdf/2601.12134.pdf

Authors: Taufiq Daryanto, Xiaohan Ding, Kaike Ping, Lance T. Wilhelm, Yan Chen, Chris Brown, Eugenia H. Rho

Title: Human-Human-AI Triadic Programming: Uncovering the Role of AI Agent and the Value of Human Partner in Collaborative Learning

Abstract:
As AI assistance becomes embedded in programming practice, researchers have increasingly examined how these systems help learners generate code and work more efficiently. However, these studies often position AI as a replacement for human collaboration and overlook the social and learning‑oriented aspects that emerge in collaborative programming. Our work introduces human‑human‑AI (HHAI) triadic programming, where an AI agent serves as an additional collaborator rather than a substitute for a human partner. Through a within‑subjects study with 20 participants, we show that triadic collaboration enhances collaborative learning and social presence compared to the dyadic human‑AI (HAI) baseline. In the triadic HHAI conditions, participants relied significantly less on AI‑generated code in their work. This effect was strongest in the HHAI‑shared condition, where participants had an increased sense of responsibility to understand AI suggestions before applying them. These findings demonstrate how triadic settings activate socially shared regulation of learning by making AI use visible and accountable to a human peer, suggesting that AI systems that augment rather than automate peer collaboration can better preserve the learning processes that collaborative programming relies on.

PaperID: 6240, https://arxiv.org/pdf/2601.12091.pdf

Authors: Qian Tan, Lei Jiang, Yuting Zeng, Shuoyang Ding, Xiaohua Xu

Title: Mitigating Cultural Bias in LLMs via Multi-Agent Cultural Debate

Abstract:
Large language models (LLMs) exhibit systematic Western‑centric bias, yet whether prompting in non‑Western languages (e.g., Chinese) can mitigate this remains understudied. Answering this question requires rigorous evaluation and effective mitigation, but existing approaches fall short on both fronts: evaluation methods force outputs into predefined cultural categories without a neutral option, while mitigation relies on expensive multi‑cultural corpora or agent frameworks that use functional roles (e.g., Planner‑‑Critique) lacking explicit cultural representation. To address these gaps, we introduce CEBiasBench, a Chinese‑‑English bilingual benchmark, and Multi‑Agent Vote (MAV), which enables explicit ``no bias'' judgments. Using this framework, we find that Chinese prompting merely shifts bias toward East Asian perspectives rather than eliminating it. To mitigate such persistent bias, we propose Multi‑Agent Cultural Debate (MACD), a training‑free framework that assigns agents distinct cultural personas and orchestrates deliberation via a "Seeking Common Ground while Reserving Differences" strategy. Experiments demonstrate that MACD achieves 57.6% average No Bias Rate evaluated by LLM‑as‑judge and 86.0% evaluated by MAV (vs. 47.6% and 69.0% baseline using GPT‑4o as backbone) on CEBiasBench and generalizes to the Arabic CAMeL benchmark, confirming that explicit cultural representation in agent frameworks is essential for cross‑cultural fairness.

PaperID: 6241, https://arxiv.org/pdf/2601.12028.pdf

Authors: Kun-Yan Jiang, Wei-Yu Chiu, Yuan-Po Tsai

Title: Profit Maximization for Electric Vehicle Charging Stations Using Multiagent Reinforcement Learning

Abstract:
Electric vehicles (EVs) are increasingly integrated into power grids, offering economic and environmental benefits but introducing challenges due to uncoordinated charging. This study addresses the profit maximization problem for multiple EV charging stations (EVCSs) equipped with energy storage systems (ESS) and renewable energy sources (RES), with the capability for energy trading. We propose a Double Hypernetwork QMIX‑based multi‑agent reinforcement learning (MARL) framework to optimize cooperative energy management under uncertainty in EV demand, renewable generation, and real‑time electricity prices. The framework mitigates overestimation bias in value estimation, enables distributed decision‑making, and incorporates an internal energy trading mechanism. Numerical experiments using real‑world data demonstrate that, compared to standard QMIX, the proposed method achieves approximately 5.3% and 12.7% higher total profit for the two regions, respectively, highlighting its economic and operational efficiency. Additionally, the approach maintains robust performance under varying levels of EV demand uncertainty and renewable energy fluctuations.

PaperID: 6242, https://arxiv.org/pdf/2601.12024.pdf

Authors: Kartikey Singh Bhandari, Tanish Jain, Archit Agrawal, Dhruv Kumar, Praveen Kumar, Pratik Narang

Title: Beyond Sentiment: A Multi-Agent Pipeline for Actionable Business Advice from Reviews

Abstract:
Customer reviews contain valuable signals about service quality, but converting large‑scale review corpora into actionable business recommendations remains difficult. Standard sentiment/aspect analysis is largely descriptive, while direct prompting of large language models (LLMs) often yields generic and repetitive advice that is weakly grounded in user feedback. We propose a hierarchical decision‑support pipeline that explicitly separates signal compression, problem abstraction, candidate generation, objective‑based evaluation, and cost‑aware routing into different agents. This architectural decomposition produces auditable intermediate artifacts and enables controllable trade‑offs between advice quality and token budget. Experiments on Yelp reviews from three service domains show consistent improvements over single‑pass LLM baselines across multiple advice quality dimensions, including actionability, relevance, and non‑redundancy. A human evaluation further indicates that users generally prefer our system's recommendations. These results highlight the value of structured agentic decomposition for scalable, cost‑aware business decision support.

PaperID: 6243, https://arxiv.org/pdf/2601.12003.pdf

Authors: Angel Y. He, David Parker

Title: Robust Verification of Concurrent Stochastic Games

Abstract:
Autonomous systems often operate in multi‑agent settings and need to make concurrent, strategic decisions, typically in uncertain environments. Verification and control problems for these systems can be tackled with concurrent stochastic games (CSGs), but this model requires transition probabilities to be precisely specified ‑ an unrealistic requirement in many real‑world settings. We introduce robust CSGs and their subclass interval CSGs (ICSGs), which capture epistemic uncertainty about transition probabilities in CSGs. We propose a novel framework for robust verification of these models under worst‑case assumptions about transition uncertainty. Specifically, we develop the underlying theoretical foundations and efficient algorithms, for finite‑ and infinite‑horizon objectives in both zero‑sum and nonzero‑sum settings, the latter based on (social‑welfare optimal) Nash equilibria. We build an implementation in the PRISM‑games model checker and demonstrate the feasibility of robust verification of ICSGs across a selection of large benchmarks.

PaperID: 6244, https://arxiv.org/pdf/2601.11979.pdf

Authors: Ang Gao, Changshuo Zhang, Xiao Zhang, Deyang Li, Minjun Zhao, Fangchao Liu, Xinyu Zhang

Title: Process In-Context Learning: Enhancing Mathematical Reasoning via Dynamic Demonstration Insertion

Abstract:
In‑context learning (ICL) has proven highly effective across diverse large language model (LLM) tasks. However, its potential for enhancing tasks that demand step‑by‑step logical deduction, such as mathematical reasoning, remains underexplored. A core limitation of existing ICL approaches is their static use of demonstrations: examples are pre‑selected before inference and remain fixed, failing to adapt to the dynamic confusion points that often arise during multi‑step reasoning such as ambiguous calculations or logical gaps. These unresolved confusion points can lead to cascading errors that degrade final accuracy. To tackle this issue, we propose Process In‑Context Learning (PICL), a dynamic demonstration integration framework designed to boost mathematical reasoning by responding to real‑time inference needs. PICL operates in two stages: 1)~it identifies potential confusion points by analyzing semantics and entropy in the reasoning process and summarizes their core characteristics; 2)~upon encountering these points, it retrieves relevant demonstrations from the demonstration pool that match the confusion context and inserts them directly into the ongoing reasoning process to guide subsequent steps. Experiments show that PICL outperforms baseline methods by mitigating mid‑inference confusion, highlighting the value of adaptive demonstration insertion in complex mathematical reasoning.

PaperID: 6245, https://arxiv.org/pdf/2601.11968.pdf

Authors: Qihao Zhao, Yunqi Cao, Yangyu Huang, Hui Yi Leong, Fan Zhang, Kim-Hui Yap, Wei Hu

Title: MuseAgent-1: Interactive Grounded Multimodal Understanding of Music Scores and Performance Audio

Abstract:
Despite recent advances in multimodal large language models (MLLMs), their ability to understand and interact with music remains limited. Music understanding requires grounded reasoning over symbolic scores and expressive performance audio, which general‑purpose MLLMs often fail to handle due to insufficient perceptual grounding. We introduce MuseAgent, a music‑centric multimodal agent that augments language models with structured symbolic representations derived from sheet music images and performance audio. By integrating optical music recognition and automatic music transcription modules, MuseAgent enables multi‑step reasoning and interaction over fine‑grained musical content. To systematically evaluate music understanding capabilities, we further propose MuseBench, a benchmark covering music theory reasoning, score interpretation, and performance‑level analysis across text, image, and audio modalities. Experiments show that existing MLLMs perform poorly on these tasks, while MuseAgent achieves substantial improvements, highlighting the importance of structured multimodal grounding for interactive music understanding.

PaperID: 6246, https://arxiv.org/pdf/2601.11957.pdf

Authors: Bingxuan Li, Jeonghwan Kim, Cheng Qian, Xiusi Chen, Eitan Anzenberg, Niran Kundapur, Heng Ji

Title: PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning

Abstract:
Overlapping calendar invitations force busy professionals to repeatedly decide which meetings to attend, reschedule, or decline. We refer to this preference‑driven decision process as calendar conflict resolution. Automating this decision process is crucial yet challenging. Scheduling logistics can drain hours, and human delegation often fails at scale, which motivates us to ask: Can we trust large language models (LLMs) or language agents to manage time? To enable a systematic study of this question, we introduce CalConflictBench, a benchmark for long‑horizon calendar conflict resolution. In CalConflictBench, conflicts are presented to agents round‑by‑round over a calendar year, requiring them to infer and adapt to user preferences progressively. Our experiments show that current LLM agents perform poorly with high error rates, e.g., Qwen‑3‑30B‑Think has an average error rate of 35%. To address this gap, we propose PEARL, a reinforcement‑learning framework that (i) augments the language agent with an external preference memory that stores and updates inferred strategies (e.g., attendee priorities, topic importance, time/location preferences), and (ii) optimizes the agent with round‑wise rewards that directly supervise decision correctness, ranking quality, and memory usage across rounds. Experiments on CalConflictBench show that PEARL achieves an error reduction rate of 0.76 and a 55% improvement in average error rate compared to the strongest baseline.

PaperID: 6247, https://arxiv.org/pdf/2601.11913.pdf

Authors: Yichen Jiang, Jiakang Yuan, Chongjun Tu, Peng Ye, Tao Chen

Title: LSTM-MAS: A Long Short-Term Memory Inspired Multi-Agent System for Long-Context Understanding

Abstract:
Effectively processing long contexts remains a fundamental yet unsolved challenge for large language models (LLMs). Existing single‑LLM‑based methods primarily reduce the context window or optimize the attention mechanism, but they often encounter additional computational costs or constrained expanded context length. While multi‑agent‑based frameworks can mitigate these limitations, they remain susceptible to the accumulation of errors and the propagation of hallucinations. In this work, we draw inspiration from the Long Short‑Term Memory (LSTM) architecture to design a Multi‑Agent System called LSTM‑MAS, emulating LSTM's hierarchical information flow and gated memory mechanisms for long‑context understanding. Specifically, LSTM‑MAS organizes agents in a chained architecture, where each node comprises a worker agent for segment‑level comprehension, a filter agent for redundancy reduction, a judge agent for continuous error detection, and a manager agent for globally regulates information propagation and retention, analogous to LSTM and its input gate, forget gate, constant error carousel unit, and output gate. These novel designs enable controlled information transfer and selective long‑term dependency modeling across textual segments, which can effectively avoid error accumulation and hallucination propagation. We conducted an extensive evaluation of our method. Compared with the previous best multi‑agent approach, CoA, our model achieves improvements of 97.97%, 65.75%, 122.19%, 39.61% and 10.80% on Narrative QA, Qasper, HotpotQA, 2WikiMQA and MuSiQue, respectively.

PaperID: 6248, https://arxiv.org/pdf/2601.11903.pdf

Authors: YenTing Lee, Keerthi Koneru, Zahra Moslemi, Sheethal Kumar, Ramesh Radhakrishnan

Title: AEMA: Verifiable Evaluation Framework for Trustworthy and Controlled Agentic LLM Systems

Abstract:
Evaluating large language model (LLM)‑based multi‑agent systems remains a critical challenge, as these systems must exhibit reliable coordination, transparent decision‑making, and verifiable performance across evolving tasks. Existing evaluation approaches often limit themselves to single‑response scoring or narrow benchmarks, which lack stability, extensibility, and automation when deployed in enterprise settings at multi‑agent scale. We present AEMA (Adaptive Evaluation Multi‑Agent), a process‑aware and auditable framework that plans, executes, and aggregates multi‑step evaluations across heterogeneous agentic workflows under human oversight. Compared to a single LLM‑as‑a‑Judge, AEMA achieves greater stability, human alignment, and traceable records that support accountable automation. Our results on enterprise‑style agent workflows simulated using realistic business scenarios demonstrate that AEMA provides a transparent and reproducible pathway toward responsible evaluation of LLM‑based multi‑agent systems. Keywords Agentic AI, Multi‑Agent Systems, Trustworthy AI, Verifiable Evaluation, Human Oversight

PaperID: 6249, https://arxiv.org/pdf/2601.11893.pdf

Authors: Zimo Ji, Daoyuan Wu, Wenyuan Jiang, Pingchuan Ma, Zongjie Li, Yudong Gao, Shuai Wang, Yingjiu Li

Title: Taming Various Privilege Escalation in LLM-Based Agent Systems: A Mandatory Access Control Framework

Abstract:
Large Language Model (LLM)‑based agent systems are increasingly deployed for complex real‑world tasks but remain vulnerable to natural language‑based attacks that exploit over‑privileged tool use. This paper aims to understand and mitigate such attacks through the lens of privilege escalation, defined as agent actions exceeding the least privilege required for a user's intended task. Based on a formal model of LLM agent systems, we identify novel privilege escalation scenarios, particularly in multi‑agent systems, including a variant akin to the classic confused deputy problem. To defend against both known and newly demonstrated privilege escalation, we propose SEAgent, a mandatory access control (MAC) framework built upon attribute‑based access control (ABAC). SEAgent monitors agent‑tool interactions via an information flow graph and enforces customizable security policies based on entity attributes. Our evaluations show that SEAgent effectively blocks various privilege escalation while maintaining a low false positive rate and negligible system overhead. This demonstrates its robustness and adaptability in securing LLM‑based agent systems.

PaperID: 6250, https://arxiv.org/pdf/2601.11816.pdf

Authors: Zahra Moslemi, Keerthi Koneru, Yen-Ting Lee, Sheethal Kumar, Ramesh Radhakrishnan

Title: POLARIS: Typed Planning and Governed Execution for Agentic AI in Back-Office Automation

Abstract:
Enterprise back office workflows require agentic systems that are auditable, policy‑aligned, and operationally predictable, capabilities that generic multi‑agent setups often fail to deliver. We present POLARIS (Policy‑Aware LLM Agentic Reasoning for Integrated Systems), a governed orchestration framework that treats automation as typed plan synthesis and validated execution over LLM agents. A planner proposes structurally diverse, type checked directed acyclic graphs (DAGs), a rubric guided reasoning module selects a single compliant plan, and execution is guarded by validator gated checks, a bounded repair loop, and compiled policy guardrails that block or route side effects before they occur. Applied to document centric finance tasks, POLARIS produces decision grade artifacts and full execution traces while reducing human intervention. Empirically, POLARIS achieves a micro F1 of 0.81 on the SROIE dataset and, on a controlled synthetic suite, achieves 0.95 to 1.00 precision for anomaly routing with preserved audit trails. These evaluations constitute an initial benchmark for governed Agentic AI. POLARIS provides a methodological and benchmark reference for policy‑aligned Agentic AI. Keywords Agentic AI, Enterprise Automation, Back‑Office Tasks, Benchmarks, Governance, Typed Planning, Evaluation

PaperID: 6251, https://arxiv.org/pdf/2601.11809.pdf

Authors: Zeyu Mu, Shangtong Zhang, B. Brian Park

Title: Multi-agent DRL-based Lane Change Decision Model for Cooperative Planning in Mixed Traffic

Abstract:
Connected automated vehicles (CAVs) possess the ability to communicate and coordinate with one another, enabling cooperative platooning that enhances both energy efficiency and traffic flow. However, during the initial stage of CAV deployment, the sparse distribution of CAVs among human‑driven vehicles reduces the likelihood of forming effective cooperative platoons. To address this challenge, this study proposes a hybrid multi‑agent lane change decision model aimed at increasing CAV participation in cooperative platooning and maximizing its associated benefits. The proposed model employs the QMIX framework, integrating traffic data processed through a convolutional neural network (CNN‑QMIX). This architecture addresses a critical issue in dynamic traffic scenarios by enabling CAVs to make optimal decisions irrespective of the varying number of CAVs present in mixed traffic. Additionally, a trajectory planner and a model predictive controller are designed to ensure smooth and safe lane‑change execution. The proposed model is trained and evaluated within a microsimulation environment under varying CAV market penetration rates. The results demonstrate that the proposed model efficiently manages fluctuating traffic agent numbers, significantly outperforming the baseline rule‑based models. Notably, it enhances cooperative platooning rates up to 26.2%, showcasing its potential to optimize CAV cooperation and traffic dynamics during the early stage of deployment.

PaperID: 6252, https://arxiv.org/pdf/2601.11750.pdf

Authors: Mo Houtti, Moyan Zhou, Daniel Runningen, Surabhi Sunil, Leor Porat, Harmanpreet Kaur, Loren Terveen, Stevie Chancellor

Title: Opportunities and Barriers for AI Feedback on Meeting Inclusion in Socioorganizational Teams

Abstract:
Inclusion is important for meeting effectiveness, which is in turn central to organizational functioning. One way of improving inclusion in meetings is through feedback, but social dynamics make giving feedback difficult. We propose that AI agents can facilitate feedback exchange by being psychologically safer recipients, and we test this through a meeting system with an AI agent feedback mediator. When delivering feedback, the agent uses the Induced Hypocrisy Procedure, a social psychological technique that prompts behavior change by highlighting value‑behavior inconsistencies. In a within‑subjects lab study (n=28), the agent made speaking times more balanced and improved meeting quality. However, a field study at a small consulting firm (n=10) revealed organizational barriers that led to its use for personal reflection rather than feedback exchange. We contribute a novel sociotechnical system for feedback exchange in groups, and empirical findings demonstrating the importance of considering organizational barriers in designing AI tools for organizations.

PaperID: 6253, https://arxiv.org/pdf/2601.11688.pdf

Authors: Vedant Nipane, Pulkit Agrawal, Amit Singh

Title: SpecMap: Hierarchical LLM Agent for Datasheet-to-Code Traceability Link Recovery in Systems Engineering

Abstract:
Establishing precise traceability between embedded systems datasheets and their corresponding code implementations remains a fundamental challenge in systems engineering, particularly for low‑level software where manual mapping between specification documents and large code repositories is infeasible. Existing Traceability Link Recovery approaches primarily rely on lexical similarity and information retrieval techniques, which struggle to capture the semantic, structural, and symbol level relationships prevalent in embedded systems software. We present a hierarchical datasheet‑to‑code mapping methodology that employs large language models for semantic analysis while explicitly structuring the traceability process across multiple abstraction levels. Rather than performing direct specification‑to‑code matching, the proposed approach progressively narrows the search space through repository‑level structure inference, file‑level relevance estimation, and fine‑grained symbollevel alignment. The method extends beyond function‑centric mapping by explicitly covering macros, structs, constants, configuration parameters, and register definitions commonly found in systems‑level C/C++ codebases. We evaluate the approach on multiple open‑source embedded systems repositories using manually curated datasheet‑to‑code ground truth. Experimental results show substantial improvements over traditional information‑retrieval‑based baselines, achieving up to 73.3% file mapping accuracy. We significantly reduce computational overhead, lowering total LLM token consumption by 84% and end‑to‑end runtime by approximately 80%. This methodology supports automated analysis of large embedded software systems and enables downstream applications such as training data generation for systems‑aware machine learning models, standards compliance verification, and large‑scale specification coverage analysis.

PaperID: 6254, https://arxiv.org/pdf/2601.11687.pdf

Authors: Harmohit Singh

Title: Semantic Caching and Intent-Driven Context Optimization for Multi-Agent Natural Language to Code Systems

Abstract:
We present a production‑optimized multi‑agent system designed to translate natural language queries into executable Python code for structured data analytics. Unlike systems that rely on expensive frontier models, our approach achieves high accuracy and cost efficiency through three key innovations: (1) a semantic caching system with LLM‑based equivalence detection and structured adaptation hints that provides cache hit rates of 67% on production queries; (2) a dual‑threshold decision mechanism that separates exact‑match retrieval from reference‑guided generation; and (3) an intent‑driven dynamic prompt assembly system that reduces token consumption by 40‑60% through table‑aware context filtering. The system has been deployed in production for enterprise inventory management, processing over 10,000 queries with an average latency of 8.2 seconds and 94.3% semantic accuracy. We describe the architecture, present empirical results from production deployment, and discuss practical considerations for deploying LLM‑based analytics systems at scale.

PaperID: 6255, https://arxiv.org/pdf/2601.11658.pdf

Authors: Indrajit Kar, Sammy Zonunpuia, Zonunfeli Ralte

Title: Towards AGI A Pragmatic Approach Towards Self Evolving Agent

Abstract:
Large Language Model (LLM) based agents are powerful yet fundamentally static after deployment, lacking the ability to autonomously expand capabilities, generate new tools, or evolve their reasoning. This work introduces a hierarchical self‑evolving multi‑agent framework that integrates a Base LLM, an operational SLM agent, a Code‑Generation LLM, and a Teacher‑LLM to enable continuous adaptation. The workflow begins with the agent attempting a task using reasoning and existing tools; if unsuccessful, it escalates to tool synthesis through the Code‑Gen LLM, and when failures persist, it triggers an evolution phase using Curriculum Learning (CL), Reward‑Based Learning (RL), or Genetic Algorithm (GA) evolution. Using the TaskCraft dataset rich in hierarchical tasks, tool‑use traces, and difficulty scaling we evaluate these paradigms. CL delivers fast recovery and strong generalization, RL excels on high‑difficulty tasks, and GA offers high behavioral diversity. Across all settings, evolved agents outperform their originals, demonstrating robust, autonomous, self‑improving agentic evolution.

PaperID: 6256, https://arxiv.org/pdf/2601.11650.pdf

Authors: Jingkang Liang, Niklas Groll, Gürkan Sin

Title: Large Language Model Agent for User-friendly Chemical Process Simulations

Abstract:
Modern process simulators enable detailed process design, simulation, and optimization; however, constructing and interpreting simulations is time‑consuming and requires expert knowledge. This limits early exploration by inexperienced users. To address this, a large language model (LLM) agent is integrated with AVEVA Process Simulation (APS) via Model Context Protocol (MCP), allowing natural language interaction with rigorous process simulations. An MCP server toolset enables the LLM to communicate programmatically with APS using Python, allowing it to execute complex simulation tasks from plain‑language instructions. Two water‑methanol separation case studies assess the framework across different task complexities and interaction modes. The first shows the agent autonomously analyzing flowsheets, finding improvement opportunities, and iteratively optimizing, extracting data, and presenting results clearly. The framework benefits both educational purposes, by translating technical concepts and demonstrating workflows, and experienced practitioners by automating data extraction, speeding routine tasks, and supporting brainstorming. The second case study assesses autonomous flowsheet synthesis through both a step‑by‑step dialogue and a single prompt, demonstrating its potential for novices and experts alike. The step‑by‑step mode gives reliable, guided construction suitable for educational contexts; the single‑prompt mode constructs fast baseline flowsheets for later refinement. While current limitations such as oversimplification, calculation errors, and technical hiccups mean expert oversight is still needed, the framework's capabilities in analysis, optimization, and guided construction suggest LLM‑based agents can become valuable collaborators.

PaperID: 6257, https://arxiv.org/pdf/2601.11634.pdf

Authors: Chenghui Yu, Hongwei Wang, Junwen Chen, Zixuan Wang, Bingfeng Deng, Zhuolin Hao, Hongyu Xiong, Yang Song

Title: When Rules Fall Short: Agent-Driven Discovery of Emerging Content Issues in Short Video Platforms

Abstract:
Trends on short‑video platforms evolve at a rapid pace, with new content issues emerging every day that fall outside the coverage of existing annotation policies. However, traditional human‑driven discovery of emerging issues is too slow, which leads to delayed updates of annotation policies and poses a major challenge for effective content governance. In this work, we propose an automatic issue discovery method based on multimodal LLM agents. Our approach automatically recalls short videos containing potential new issues and applies a two‑stage clustering strategy to group them, with each cluster corresponding to a newly discovered issue. The agent then generates updated annotation policies from these clusters, thereby extending coverage to these emerging issues. Our agent has been deployed in the real system. Both offline and online experiments demonstrate that this agent‑based method significantly improves the effectiveness of emerging‑issue discovery (with an F1 score improvement of over 20%) and enhances the performance of subsequent issue governance (reducing the view count of problematic videos by approximately 15%). More importantly, compared to manual issue discovery, it greatly reduces time costs and substantially accelerates the iteration of annotation policies.

PaperID: 6258, https://arxiv.org/pdf/2601.11595.pdf

Authors: Meenakshi Amulya Jayanti, X. Y. Han

Title: Enhancing Model Context Protocol (MCP) with Context-Aware Server Collaboration

Abstract:
The Model Context Protocol (MCP) (MCP Community, 2025) has emerged as a widely used framework for enabling LLM‑based agents to communicate with external tools and services. The original MCP implementation (Anthropic, 2024) relies on a Large Language Model (LLM) to decompose tasks and issue instructions to servers. In particular, the agents, models, and servers are stateless and do not have access to a global context. However, in tasks involving LLM‑driven coordination, it is natural that a Shared Context Store (SCS) could improve the efficiency and coherence of multi‑agent workflows by reducing redundancy and enabling knowledge transfer between servers. Thus, in this work, we design and assess the performance of a Context‑Aware MCP (CA‑MCP) that offloads execution logic to specialized MCP servers that read from and write to a shared context memory, allowing them to coordinate more autonomously in real time. In this design, context management serves as the central mechanism that maintains continuity across task executions by tracking intermediate states and shared variables, thereby enabling persistent collaboration among agents without repeated prompting. We present experiments showing that the CA‑MCP can outperform the traditional MCP by reducing the number of LLM calls required for complex tasks and decreasing the frequency of response failures when task conditions are not satisfied. In particular, we conducted experiments on the TravelPlanner (Yang et al., 2024) and REALM‑Bench (Geng & Chang, 2025) benchmark datasets and observed statistically significant results indicating the potential advantages of incorporating a shared context store via CA‑MCP in LLM‑driven multi‑agent systems.

PaperID: 6259, https://arxiv.org/pdf/2601.11587.pdf

Authors: Yuyan Huang, Haoran Li, Yifan Lu, Ruolin Wu, Siqian Chen, Chao Liu

Title: Evidence-Grounded Multi-Agent Planning Support for Urban Carbon Governance via RAG

Abstract:
Urban carbon governance requires planners to integrate heterogeneous evidence ‑‑ emission inventories, statistical yearbooks, policy texts, technical measures, and academic findings ‑‑ into actionable, cross‑departmental plans. Large Language Models (LLMs) can assist planning workflows, yet their factual reliability and evidential traceability remain critical barriers in professional use. This paper presents an evidence‑grounded multi‑agent planning support system for urban carbon governance built upon standard text‑based Retrieval‑Augmented Generation (RAG) (without GraphRAG). We align the system with the typical planning workflow by decomposing tasks into four specialized agents: (i) evidence Q\&A for fact checking and compliance queries, (ii) emission status assessment for diagnostic analysis, (iii) planning recommendation for generating multi‑sector governance pathways, and (iv) report integration for producing planning‑style deliverables. We evaluate the system in two task families: factual retrieval and comprehensive planning generation. On factual retrieval tasks, introducing RAG increases the average score from below 6 to above 90, and dramatically improves key‑field extraction (e.g., region and numeric values near 100% detection). A real‑city case study (Ningbo, China) demonstrates end‑to‑end report generation with strong relevance, coverage, and coherence in expert review, while also highlighting boundary inconsistencies across data sources as a practical limitation.

PaperID: 6260, https://arxiv.org/pdf/2601.11585.pdf

Authors: Hyunjun Kim

Title: Entropic Context Shaping: Information-Theoretic Filtering for Context-Aware LLM Agents

Abstract:
Context engineering for large language model (LLM) agents requires distinguishing pragmatically useful information from misleading distractors. We introduce Entropic Context Shaping (ECS), an information‑theoretic framework that measures context utility via the shift in the model's answer distribution toward the correct answer. Unlike lexical similarity methods that rely on word overlap, ECS captures pragmatic utility ‑‑ whether a passage actually helps answer the question. We formalize utility as the signed change in answer probability and provide theoretical analysis showing that task‑irrelevant updates yield near‑zero distribution shift. We evaluate on multi‑turn context selection tasks using LongMemEval (session‑level) and LoCoMo (turn‑level) benchmarks. On fine‑grained turn selection, ECS with Llama‑3.1‑8B achieves F1=0.265, a 71.83% relative improvement over TF‑IDF (F1=0.154), demonstrating that pragmatic utility outperforms lexical similarity when precise context selection matters. Code and data are available in the supplementary materials.

PaperID: 6261, https://arxiv.org/pdf/2601.11583.pdf

Authors: Xing Yang

Title: Bit-politeia: An AI Agent Community in Blockchain

Abstract:
Current resource allocation paradigms, particularly in academic evaluation, are constrained by inherent limitations such as the Matthew Effect, reward hacking driven by Goodhart's Law, and the trade‑off between efficiency and fairness. To address these challenges, this paper proposes "Bit‑politeia", an AI agent community on blockchain designed to construct a fair, efficient, and sustainable resource allocation system. In this virtual community, residents interact via AI agents serving as their exclusive proxies, which are optimized for impartiality and value alignment. The community adopts a "clustered grouping + hierarchical architecture" that integrates democratic centralism to balance decision‑making efficiency and trust mechanisms. Agents engage through casual chat and deliberative interactions to evaluate research outputs and distribute a virtual currency as rewards. This incentive mechanism aims to achieve incentive compatibility through consensus‑driven evaluation, while blockchain technology ensures immutable records of all transactions and reputation data. By leveraging AI for objective assessment and decentralized verification, Bit‑politeia minimizes human bias and mitigates resource centralization issues found in traditional peer review. The proposed framework provides a novel pathway for optimizing scientific innovation through a fair and automated resource configuration process.

PaperID: 6262, https://arxiv.org/pdf/2601.11578.pdf

Authors: Ibrahim Al Azher, Zhishuai Guo, Hamed Alhoori

Title: Multi-Agent LLMs for Generating Research Limitations

Abstract:
Identifying and articulating limitations is essential for transparent and rigorous scientific research. However, zero‑shot large language models (LLMs) approach often produce superficial or general limitation statements (e.g., dataset bias or generalizability). They usually repeat limitations reported by authors without looking at deeper methodological issues and contextual gaps. This problem is made worse because many authors disclose only partial or trivial limitations. We propose, a multi‑agent LLM framework for generating substantive limitations. It integrates OpenReview comments and author‑stated limitations to provide stronger ground truth. It also uses cited and citing papers to capture broader contextual weaknesses. In this setup, different agents have specific roles as sequential role: some extract explicit limitations, others analyze methodological gaps, some simulate the viewpoint of a peer reviewer, and a citation agent places the work within the larger body of literature. A Judge agent refines their outputs, and a Master agent consolidates them into a clear set. This structure allows for systematic identification of explicit, implicit, peer review‑focused, and literature‑informed limitations. Moreover, traditional NLP metrics like BLEU, ROUGE, and cosine similarity rely heavily on n‑gram or embedding overlap. They often overlook semantically similar limitations. To address this, we introduce a pointwise evaluation protocol that uses an LLM‑as‑a‑Judge to measure coverage more accurately. Experiments show that our proposed model substantially improve performance. The RAG + multi‑agent GPT‑4o mini configuration achieves a +15.51% coverage gain over zero‑shot baselines, while the Llama 3 8B multi‑agent setup yields a +4.41% improvement.

PaperID: 6263, https://arxiv.org/pdf/2601.11560.pdf

Authors: Zifeng Wang, Zheng Chen, Ziwei Yang, Xuan Wang, Qiao Jin, Yifan Peng, Zhiyong Lu, Jimeng Sun

Title: DeepEvidence: Empowering Biomedical Discovery with Deep Knowledge Graph Research

Abstract:
Biomedical knowledge graphs (KGs) encode vast, heterogeneous information spanning literature, genes, pathways, drugs, diseases, and clinical trials, but leveraging them collectively for scientific discovery remains difficult. Their structural differences, continual evolution, and limited cross‑resource alignment require substantial manual integration, limiting the depth and scale of knowledge exploration. We introduce DeepEvidence, an AI‑agent framework designed to perform Deep Research across various heterogeneous biomedical KGs. Unlike generic Deep Research systems that rely primarily on internet‑scale text, DeepEvidence incorporates specialized knowledge‑graph tooling and coordinated exploration strategies to systematically bridge heterogeneous resources. At its core is an orchestrator that directs two complementary agents: Breadth‑First ReSearch (BFRS) for broad, multi‑graph entity search, and Depth‑First ReSearch (DFRS) for multi‑hop, evidence‑focused reasoning. An internal, incrementally built evidence graph provides a structured record of retrieved entities, relations, and supporting evidence. To operate at scale, DeepEvidence includes unified interfaces for querying diverse biomedical APIs and an execution sandbox that enables programmatic data retrieval, extraction, and analysis. Across established deep‑reasoning benchmarks and four key stages of the biomedical discovery lifecycle: drug discovery, pre‑clinical experimentation, clinical trial development, and evidence‑based medicine, DeepEvidence demonstrates substantial gains in systematic exploration and evidence synthesis. These results highlight the potential of knowledge‑graph‑driven Deep Research to accelerate biomedical discovery.

PaperID: 6264, https://arxiv.org/pdf/2601.11475.pdf

Authors: Rajeev Yasarla, Deepti Hegde, Shizhong Han, Hsin-Pai Cheng, Yunxiao Shi, Meysam Sadeghigooghari, Shweta Mahajan, Apratim Bhattacharyya, Litian Liu, Risheek Garrepalli, Thomas Svantesson, Fatih Porikli, Hong Cai

Title: Generative Scenario Rollouts for End-to-End Autonomous Driving

Abstract:
Vision‑Language‑Action (VLA) models are emerging as highly effective planning models for end‑to‑end autonomous driving systems. However, current works mostly rely on imitation learning from sparse trajectory annotations and under‑utilize their potential as generative models. We propose Generative Scenario Rollouts (GeRo), a plug‑and‑play framework for VLA models that jointly performs planning and generation of language‑grounded future traffic scenes through an autoregressive rollout strategy. First, a VLA model is trained to encode ego vehicle and agent dynamics into latent tokens under supervision from planning, motion, and language tasks, facilitating text‑aligned generation. Next, GeRo performs language‑conditioned autoregressive generation. Given multi‑view images, a scenario description, and ego‑action questions, it generates future latent tokens and textual responses to guide long‑horizon rollouts. A rollout‑consistency loss stabilizes predictions using ground truth or pseudo‑labels, mitigating drift and preserving text‑action alignment. This design enables GeRo to perform temporally consistent, language‑grounded rollouts that support long‑horizon reasoning and multi‑agent planning. On Bench2Drive, GeRo improves driving score and success rate by +15.7 and +26.2, respectively. By integrating reinforcement learning with generative rollouts, GeRo achieves state‑of‑the‑art closed‑loop and open‑loop performance, demonstrating strong zero‑shot robustness. These results highlight the promise of generative, language‑conditioned reasoning as a foundation for safer and more interpretable end‑to‑end autonomous driving.

PaperID: 6265, https://arxiv.org/pdf/2601.11401.pdf

Authors: Ahmed Rashwan, Keith Briggs, Chris Budd, Lisa Kreusser

Title: Factored Value Functions for Graph-Based Multi-Agent Reinforcement Learning

Abstract:
Credit assignment is a core challenge in multi‑agent reinforcement learning (MARL), especially in large‑scale systems with structured, local interactions. Graph‑based Markov decision processes (GMDPs) capture such settings via an influence graph, but standard critics are poorly aligned with this structure: global value functions provide weak per‑agent learning signals, while existing local constructions can be difficult to estimate and ill‑behaved in infinite‑horizon settings. We introduce the Diffusion Value Function (DVF), a factored value function for GMDPs that assigns to each agent a value component by diffusing rewards over the influence graph with temporal discounting and spatial attenuation. We show that DVF is well‑defined, admits a Bellman fixed point, and decomposes the global discounted value via an averaging property. DVF can be used as a drop‑in critic in standard RL algorithms and estimated scalably with graph neural networks. Building on DVF, we propose Diffusion A2C (DA2C) and a sparse message‑passing actor, Learned DropEdge GNN (LD‑GNN), for learning decentralised algorithms under communication costs. Across the firefighting benchmark and three distributed computation tasks (vector graph colouring and two transmit power optimisation problems), DA2C consistently outperforms local and global critic baselines, improving average reward by up to 11%.

PaperID: 6266, https://arxiv.org/pdf/2601.11369.pdf

Authors: Marcantonio Bracale Syrnikov, Federico Pierucci, Marcello Galisai, Matteo Prandi, Piercosma Bisconti, Francesco Giarrusso, Olga Sorokoletova, Vincenzo Suriani, Daniele Nardi

Title: Institutional AI: Governing LLM Collusion in Multi-Agent Cournot Markets via Public Governance Graphs

Abstract:
Multi‑agent LLM ensembles can converge on coordinated, socially harmful equilibria. This paper advances an experimental framework for evaluating Institutional AI, our system‑level approach to AI alignment that reframes alignment from preference engineering in agent‑space to mechanism design in institution‑space. Central to this approach is the governance graph, a public, immutable manifest that declares legal states, transitions, sanctions, and restorative paths; an Oracle/Controller runtime interprets this manifest, attaching enforceable consequences to evidence of coordination while recording a cryptographically keyed, append‑only governance log for audit and provenance. We apply the Institutional AI framework to govern the Cournot collusion case documented by prior work and compare three regimes: Ungoverned (baseline incentives from the structure of the Cournot market), Constitutional (a prompt‑only policy‑as‑prompt prohibition implemented as a fixed written anti‑collusion constitution, and Institutional (governance‑graph‑based). Across six model configurations including cross‑provider pairs (N=90 runs/condition), the Institutional regime produces large reductions in collusion: mean tier falls from 3.1 to 1.8 (Cohen's d=1.28), and severe‑collusion incidence drops from 50% to 5.6%. The prompt‑only Constitutional baseline yields no reliable improvement, illustrating that declarative prohibitions do not bind under optimisation pressure. These results suggest that multi‑agent alignment may benefit from being framed as an institutional design problem, where governance graphs can provide a tractable abstraction for alignment‑relevant collective behavior.

PaperID: 6267, https://arxiv.org/pdf/2601.11327.pdf

Authors: Agata Żywot, Xinyi Chen, Yifei Yuan, Anders Søgaard, Maarten de Rijke

Title: Can Small Agents Collaborate to Beat a Single Large Language Model?

Abstract:
Recent progress in language modeling has largely relied on scaling model size, yet larger models do not reliably improve performance on tasks requiring multi‑step reasoning and tool use. Multi‑agent collaboration offers a potential alternative, raising a key question: can well‑organized systems built from smaller models outperform much larger language models? We address this question using a minimally designed multi‑agent system with a single orchestrator and a small set of specialized sub‑agents with restricted communication. On tool‑intensive benchmarks spanning factual retrieval, multi‑hop reasoning, scientific question answering, and mathematical problem solving, we conduct controlled comparisons between small multi‑agent systems and large single‑agent models. We find that small multi‑agent systems can outperform substantially larger single‑agent models, even when the latter have direct access to tools. Reasoning at the orchestrator yields the largest gains, while enabling reasoning in sub‑agents provides limited or negative benefits. Overall system performance is driven primarily by orchestrator capacity rather than sub‑agent capacity. These results suggest that improved agentic performance depends more on architectural orchestration than on raw model scaling.

PaperID: 6268, https://arxiv.org/pdf/2601.11323.pdf

Authors: Botao Zhu, Xianbin Wang

Title: Composite and Staged Trust Evaluation for Multi-Hop Collaborator Selection

Abstract:
Multi‑hop collaboration offers new perspectives for enhancing task execution efficiency by increasing available distributed collaborators for resource sharing. Consequently, selecting trustworthy collaborators becomes critical for realizing effective multi‑hop collaboration. However, evaluating device trust requires the consideration of multiple factors, including relatively stable factors, such as historical interaction data, and dynamic factors, such as varying resources and network conditions. This differentiation makes it challenging to achieve the accurate evaluation of composite trust factors using one identical evaluation approach. To address this challenge, this paper proposes a composite and staged trust evaluation (CSTE) mechanism, where stable and dynamic factors are separately evaluated at different stages and then integrated for a final trust decision. First, a device interaction graph is constructed from stable historical interaction data to represent direct trust relationships between devices. A graph neural network framework is then used to propagate and aggregate these trust relationships to produce the historical trustworthiness of devices. In addition, a task‑specific trust evaluation method is developed to assess the dynamic resources of devices based on task requirements, which generates the task‑specific resource trustworthiness of devices. After these evaluations, CSTE integrates their results to identify devices within the network topology that satisfy the minimum trust thresholds of tasks. These identified devices then establish a trusted topology. Finally, within this trusted topology, an A search algorithm is employed to construct a multi‑hop collaboration path that satisfies the task requirements. Experimental results demonstrate that CSTE outperforms the comparison algorithms in identifying paths with the highest average trust values.

PaperID: 6269, https://arxiv.org/pdf/2601.11252.pdf

Authors: Qianyue Wang, Jinwu Hu, Yufeng Wang, Huanxiang Lin, Bolin Chen, Zhiquan Wen, Yaofo Chen, Mingkui Tan

Title: Beyond Model Scaling: Test-Time Intervention for Efficient Deep Reasoning

Abstract:
Large Reasoning Models (LRMs) excel at multi‑step reasoning but often suffer from inefficient reasoning processes like overthinking and overshoot, where excessive or misdirected reasoning increases computational cost and degrades performance. Existing efficient reasoning methods operate in a closed‑loop manner, lacking mechanisms for external intervention to guide the reasoning process. To address this, we propose Think‑with‑Me, a novel test‑time interactive reasoning paradigm that introduces external feedback intervention into the reasoning process. Our key insights are that transitional conjunctions serve as natural points for intervention, signaling phases of self‑validation or exploration and using transitional words appropriately to prolong the reasoning enhances performance, while excessive use affects performance. Building on these insights, Think‑with‑Me pauses reasoning at these points for external feedback, adaptively extending or terminating reasoning to reduce redundancy while preserving accuracy. The feedback is generated via a multi‑criteria evaluation (rationality and completeness) and comes from either human or LLM proxies. We train the target model using Group Relative Policy Optimization (GRPO) to adapt to this interactive mode. Experiments show that Think‑with‑Me achieves a superior balance between accuracy and reasoning length under limited context windows. On AIME24, Think‑with‑Me outperforms QwQ‑32B by 7.19% in accuracy while reducing average reasoning length by 81% under an 8K window. The paradigm also benefits security and creative tasks.

PaperID: 6270, https://arxiv.org/pdf/2601.11147.pdf

Authors: Zixu Wang, Bingbing Xu, Yige Yuan, Huawei Shen, Xueqi Cheng

Title: Do We Always Need Query-Level Workflows? Rethinking Agentic Workflow Generation for Multi-Agent Systems

Abstract:
Multi‑Agent Systems (MAS) built on large language models typically solve complex tasks by coordinating multiple agents through workflows. Existing approaches generates workflows either at task level or query level, but their relative costs and benefits remain unclear. After rethinking and empirical analyses, we show that query‑level workflow generation is not always necessary, since a small set of top‑K best task‑level workflows together already covers equivalent or even more queries. We further find that exhaustive execution‑based task‑level evaluation is both extremely token‑costly and frequently unreliable. Inspired by the idea of self‑evolution and generative reward modeling, we propose a low‑cost task‑level generation framework SCALE, which means \underlineSelf prediction of the optimizer with few shot \underlineCALibration for \underlineEvaluation instead of full validation execution. Extensive experiments demonstrate that SCALE maintains competitive performance, with an average degradation of just 0.61% compared to existing approach across multiple datasets, while cutting overall token usage by up to 83%.

PaperID: 6271, https://arxiv.org/pdf/2601.11007.pdf

Authors: Zhenhua Xu, Dongsheng Chen, Shuo Wang, Jian Li, Chengjie Wang, Meng Han, Yabiao Wang

Title: AdaMARP: An Adaptive Multi-Agent Interaction Framework for General Immersive Role-Playing

Abstract:
LLM role‑playing aims to portray arbitrary characters in interactive narratives, yet existing systems often suffer from limited immersion and adaptability. They typically under‑model dynamic environmental information and assume largely static scenes and casts, offering insufficient support for multi‑character orchestration, scene transitions, and on‑the‑fly character introduction. We propose an adaptive multi‑agent role‑playing framework, AdaMARP, featuring an immersive message format that interleaves [Thought], (Action), <Environment>, and Speech, together with an explicit Scene Manager that governs role‑playing through discrete actions (init_scene, pick_speaker, switch_scene, add_role, end) accompanied by rationales. To train these capabilities, we construct AdaRPSet for the Actor Model and AdaSMSet for supervising orchestration decisions, and introduce AdaptiveBench for trajectory‑level evaluation. Experiments across multiple backbones and model scales demonstrate consistent improvements: AdaRPSet enhances character consistency, environment grounding, and narrative coherence, with an 8B actor outperforming several commercial LLMs, while AdaSMSet enables smoother scene transitions and more natural role introductions, surpassing Claude Sonnet 4.5 using only a 14B LLM.

PaperID: 6272, https://arxiv.org/pdf/2601.10970.pdf

Authors: Canwen Wang, Angela Chen, Catherine Bao, Siwei Jin, Holly Swartz, Tongshuang Wu, Robert E Kraut, Haiyi Zhu

Title: Simulating Couple Conflict: Designing A Multi-Agent System for Therapy Training and Practice

Abstract:
Couples therapy requires managing complex, evolving emotional dynamics between partners, but traditional training methods for therapists, like role‑play, lack realism, consistency, and control. We present a multi‑modal simulation that models therapy as a controlled, multi‑agent dynamical system with structured interaction stages. Therapists practice with a pair of client‑agents who go through six evolving stages that respond to therapist actions. This simulation enables practice with demand‑withdraw conflict patterns in a closed‑loop environment. The simulation uses a sense‑plan‑act architecture: it detects the therapist's input, updates agents' interaction states based on psychotherapy theory and transcript analysis, and generates realistic verbal and emotional responses. In an experiment with 21 licensed U.S. therapists, participants more accurately identified state transitions and rated the system as more realistic and responsive than a prompt‑based baseline, demonstrating the value of stateful, interpretable simulation for therapist training.

PaperID: 6273, https://arxiv.org/pdf/2601.10958.pdf

Authors: Christo Kurisummoottil Thomas, Mingzhe Chen

Title: Fundamental Limits of Quantum Semantic Communication via Sheaf Cohomology

Abstract:
Semantic communication (SC) enables bandwidth‑efficient coordination in multi‑agent systems by transmitting meaning rather than raw bits. However, when agents employ heterogeneous sensing modalities and AI architectures, perfect bit‑level transmission no longer guarantees mutual understanding. Although deep learning methods for semantic compression have advanced, the information‑theoretic limits of semantic alignment under heterogeneity remain poorly understood. Notably, semantic ambiguity shares the same mathematical structure as quantum contextuality, as both arise from cohomological obstructions, motivating a quantum formulation of SC. In this paper, an information‑theoretic framework for quantum semantic communication is proposed using sheaf cohomology. Multi‑agent semantic networks are modeled as quantum sheaves, where agents meaning spaces are Hilbert spaces connected by quantum channels. The first sheaf cohomology group is shown to characterize irreducible semantic ambiguity, representing a fundamental obstruction to alignment that no local processing can resolve. The minimum communication rate required for semantic alignment is proven to scale with the logarithm of the dimension of the cohomological space, establishing a semantic analog of Shannon limits. For entanglement‑assisted channels, the achievable capacity is shown to strictly exceed classical bounds, with each shared ebit reducing the required classical communication by one bit, providing a rigorous interpretation of shared context. Additionally, quantum contextuality is shown to reduce cohomological obstructions, and a duality between quantum discord and integrated semantic information is established, linking quantum correlations to irreducible semantic content. This framework provides rigorous foundations for quantum‑enhanced semantic communication in autonomous multi‑agent systems.

PaperID: 6274, https://arxiv.org/pdf/2601.10955.pdf

Authors: Kaiyu Zhou, Yongsen Zheng, Yicheng He, Meng Xue, Xueluan Gong, Yuji Wang, Xuanye Zhang, Kwok-Yan Lam

Title: Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents

Abstract:
The agent‑‑tool interaction loop is a critical attack surface for modern Large Language Model (LLM) agents. Existing denial‑of‑service (DoS) attacks typically function at the user‑prompt or retrieval‑augmented generation (RAG) context layer and are inherently single‑turn in nature. This limitation restricts cost amplification and diminishes stealth in goal‑oriented workflows. To address these issues, we proposed a stealthy, multi‑turn economic DoS attack at the tool layer under the Model Context Protocol (MCP). By simply editing text‑visible fields and implementing a template‑driven return policy, our malicious server preserves function signatures and the terminal benign payload while steering agents into prolonged, verbose tool‑calling chains. We optimize these text‑only edits with Monte Carlo Tree Search (MCTS) to maximize cost under a task‑success constraint. Across six LLMs on ToolBench and BFCL benchmarks, our attack yields trajectories over 60K tokens, increases per‑query cost by up to 658 times, raises energy by 100 to 560 times, and pushes GPU key‑value (KV) cache occupancy to 35‑‑74%. Standard prompt filters and output trajectory monitors seldom detect these attacks, highlighting the need for defenses that safeguard agentic processes rather than focusing solely on final outcomes. We will release the code soon.

PaperID: 6275, https://arxiv.org/pdf/2601.10865.pdf

Authors: Jonah Ghebremichael, Saastha Vasan, Saad Ullah, Greg Tystahl, David Adei, Christopher Kruegel, Giovanni Vigna, William Enck, Alexandros Kapravelos

Title: Multi-Agent Taint Specification Extraction for Vulnerability Detection

Abstract:
Static Application Security Testing (SAST) tools using taint analysis are widely viewed as providing higher‑quality vulnerability detection results compared to traditional pattern‑based approaches. However, performing static taint analysis for JavaScript poses two major challenges. First, JavaScript's dynamic features complicate data flow extraction required for taint tracking. Second, npm's large library ecosystem makes it difficult to identify relevant sources/sinks and establish taint propagation across dependencies. In this paper, we present SemTaint, a multi‑agent system that strategically combines the semantic understanding of Large Language Models (LLMs) with traditional static program analysis to extract taint specifications, including sources, sinks, call edges, and library flow summaries tailored to each package. Conceptually, SemTaint uses static program analysis to calculate a call graph and defers to an LLM to resolve call edges that cannot be resolved statically. Further, it uses the LLM to classify sources and sinks for a given CWE. The resulting taint specification is then provided to a SAST tool, which performs vulnerability analysis. We integrate SemTaint with CodeQL, a state‑of‑the‑art SAST tool, and demonstrate its effectiveness by detecting 106 of 162 vulnerabilities previously undetectable by CodeQL. Furthermore, we find 4 novel vulnerabilities in 4 popular npm packages. In doing so, we demonstrate that LLMs can practically enhance existing static program analysis algorithms, combining the strengths of both symbolic reasoning and semantic understanding for improved vulnerability detection.

PaperID: 6276, https://arxiv.org/pdf/2601.10849.pdf

Authors: Cuong Le, Symeon Chatzinotas, Thang X. Vu

Title: Cooperative UAVs for Remote Data Collection under Limited Communications: An Asynchronous Multiagent Learning Framework

Abstract:
This paper addresses the joint optimization of trajectories and bandwidth allocation for multiple Unmanned Aerial Vehicles (UAVs) to enhance energy efficiency in the cooperative data collection problem. We focus on an important yet underestimated aspect of the system, where action synchronization across all UAVs is impossible. Since most existing learning‑based solutions are not designed to learn in this asynchronous environment, we formulate the trajectory planning problem as a Decentralized Partially Observable Semi‑Markov Decision Process and introduce an asynchronous multi‑agent learning algorithm to learn UAVs' cooperative policies. Once the UAVs' trajectory policies are learned, the bandwidth allocation can be optimally solved based on local observations at each collection point. Comprehensive empirical results demonstrate the superiority of the proposed method over other learning‑based and heuristic baselines in terms of both energy efficiency and mission completion time. Additionally, the learned policies exhibit robustness under varying environmental conditions.

PaperID: 6277, https://arxiv.org/pdf/2601.10825.pdf

Authors: Junsol Kim, Shiyang Lai, Nino Scherrer, Blaise Agüera y Arcas, James Evans

Title: Reasoning Models Generate Societies of Thought

Abstract:
Large language models have achieved remarkable capabilities across domains, yet mechanisms underlying sophisticated reasoning remain elusive. Recent reasoning models outperform comparable instruction‑tuned models on complex cognitive tasks, attributed to extended computation through longer chains of thought. Here we show that enhanced reasoning emerges not from extended computation alone, but from simulating multi‑agent‑like interactions ‑‑ a society of thought ‑‑ which enables diversification and debate among internal cognitive perspectives characterized by distinct personality traits and domain expertise. Through quantitative analysis and mechanistic interpretability methods applied to reasoning traces, we find that reasoning models like DeepSeek‑R1 and QwQ‑32B exhibit much greater perspective diversity than instruction‑tuned models, activating broader conflict between heterogeneous personality‑ and expertise‑related features during reasoning. This multi‑agent structure manifests in conversational behaviors, including question‑answering, perspective shifts, and the reconciliation of conflicting views, and in socio‑emotional roles that characterize sharp back‑and‑forth conversations, together accounting for the accuracy advantage in reasoning tasks. Controlled reinforcement learning experiments reveal that base models increase conversational behaviors when rewarded solely for reasoning accuracy, and fine‑tuning models with conversational scaffolding accelerates reasoning improvement over base models. These findings indicate that the social organization of thought enables effective exploration of solution spaces. We suggest that reasoning models establish a computational parallel to collective intelligence in human groups, where diversity enables superior problem‑solving when systematically structured, which suggests new opportunities for agent organization to harness the wisdom of crowds.

PaperID: 6278, https://arxiv.org/pdf/2601.10820.pdf

Authors: Himanshu Thakur, Anusha Kamath, Anurag Muthyala, Dhwani Sanmukhani, Smruthi Mukund, Jay Katukuri

Title: Towards Reliable ML Feature Engineering via Planning in Constrained-Topology of LLM Agents

Abstract:
Recent advances in code generation models have unlocked unprecedented opportunities for automating feature engineering, yet their adoption in real‑world ML teams remains constrained by critical challenges: (i) the scarcity of datasets capturing the iterative and complex coding processes of production‑level feature engineering, (ii) limited integration and personalization of widely used coding agents, such as CoPilot and Devin, with a team's unique tools, codebases, workflows, and practices, and (iii) suboptimal human‑AI collaboration due to poorly timed or insufficient feedback. We address these challenges with a planner‑guided, constrained‑topology multi‑agent framework that generates code for repositories in a multi‑step fashion. The LLM‑powered planner leverages a team's environment, represented as a graph, to orchestrate calls to available agents, generate context‑aware prompts, and use downstream failures to retroactively correct upstream artifacts. It can request human intervention at critical steps, ensuring generated code is reliable, maintainable, and aligned with team expectations. On a novel in‑house dataset, our approach achieves 38% and 150% improvement in the evaluation metric over manually crafted and unplanned workflows respectively. In practice, when building features for recommendation models serving over 120 million users, our approach has delivered real‑world impact by reducing feature engineering cycles from three weeks to a single day.

PaperID: 6279, https://arxiv.org/pdf/2601.10758.pdf

Authors: Fengchao Chen, Tingmin Wu, Van Nguyen, Carsten Rudolph

Title: Too Helpful to Be Safe: User-Mediated Attacks on Planning and Web-Use Agents

Abstract:
Large Language Models (LLMs) have enabled agents to move beyond conversation toward end‑to‑end task execution and become more helpful. However, this helpfulness introduces new security risks stem less from direct interface abuse than from acting on user‑provided content. Existing studies on agent security largely focus on model‑internal vulnerabilities or adversarial access to agent interfaces, overlooking attacks that exploit users as unintended conduits. In this paper, we study user‑mediated attacks, where benign users are tricked into relaying untrusted or attacker‑controlled content to agents, and analyze how commercial LLM agents respond under such conditions. We conduct a systematic evaluation of 12 commercial agents in a sandboxed environment, covering 6 trip‑planning agents and 6 web‑use agents, and compare agent behavior across scenarios with no, soft, and hard user‑requested safety checks. Our results show that agents are too helpful to be safe by default. Without explicit safety requests, trip‑planning agents bypass safety constraints in over 92% of cases, converting unverified content into confident booking guidance. Web‑use agents exhibit near‑deterministic execution of risky actions, with 9 out of 17 supported tests reaching a 100% bypass rate. Even when users express soft or hard safety intent, constraint bypass remains substantial, reaching up to 54.7% and 7% for trip‑planning agents, respectively. These findings reveal that the primary issue is not a lack of safety capability, but its prioritization. Agents invoke safety checks only conditionally when explicitly prompted, and otherwise default to goal‑driven execution. Moreover, agents lack clear task boundaries and stopping rules, frequently over‑executing workflows in ways that lead to unnecessary data disclosure and real‑world harm.

PaperID: 6280, https://arxiv.org/pdf/2601.10738.pdf

Authors: Percy Jardine

Title: CTHA: Constrained Temporal Hierarchical Architecture for Stable Multi-Agent LLM Systems

Abstract:
Recently, multi‑time‑scale agent architectures have extended the ubiquitous single‑loop paradigm by introducing temporal hierarchies with distinct cognitive layers. While yielding substantial performance gains, this diversification fundamentally compromises the coordination stability intrinsic to unified agent systems, which causes severe inter‑layer conflicts, unbounded error propagation, and restricted scalability. To address these challenges, we propose Constrained Temporal Hierarchical Architecture (CTHA), a general framework that projects the inter‑layer communication space onto structured manifolds to restore coordination stability, while incorporating principled arbitration mechanisms to ensure coherent decision‑making. Specifically, CTHA enforces three key constraints: (1) Message Contract Constraints that formalize information flow between layers via typed summary, plan, and policy packets; (2) Authority Manifold Constraints that bound each layer's decision space according to its temporal scope; and (3) Arbiter Resolution Constraints that guarantee conflict‑free composition of multi‑layer decisions. Empirical experiments demonstrate that CTHA is effective for complex task execution at scale, offering 47% reduction in failure cascades, 2.3x improvement in sample efficiency, and superior scalability compared to unconstrained hierarchical baselines. We anticipate that CTHA, as a principled extension of temporal hierarchies, will contribute to a deeper understanding of multi‑agent coordination and suggest promising directions for the evolution of robust autonomous systems.

PaperID: 6281, https://arxiv.org/pdf/2601.10725.pdf

Authors: Hieu Do Quang, Chien Truong-Quoc, Quoc Van Tran

Title: Multi-Agent Formation Navigation Using Diffusion-Based Trajectory Generation

Abstract:
This paper introduces a diffusion‑based planner for leader‑‑follower formation control in cluttered environments. The diffusion policy is used to generate the trajectory of the midpoint of two leaders as a rigid bar in the plane, thereby defining their desired motion paths in a planar formation. While the followers track the leaders and form desired foramtion geometry using a distance‑constrained formation controller based only on the relative positions in followers' local coordinates. The proposed approach produces smooth motions and low tracking errors, with most failures occurring in narrow obstacle‑free space, or obstacle configurations that are not in the training data set. Simulation results demonstrate the potential of diffusion models for reliable multi‑agent formation planning.

PaperID: 6282, https://arxiv.org/pdf/2601.10718.pdf

Authors: Junyu Liu, Siwen Yang, Dexiu Ma, Qian Niu, Zequn Zhang, Momoko Nagai-Tanima, Tomoki Aoyama

Title: Japanese AI Agent System on Human Papillomavirus Vaccination: System Design

Abstract:
Human papillomavirus (HPV) vaccine hesitancy poses significant public health challenges, particularly in Japan where proactive vaccination recommendations were suspended from 2013 to 2021. The resulting information gap is exacerbated by misinformation on social media, and traditional ways cannot simultaneously address individual queries while monitoring population‑level discourse. This study aimed to develop a dual‑purpose AI agent system that provides verified HPV vaccine information through a conversational interface while generating analytical reports for medical institutions based on user interactions and social media. We implemented a system comprising: a vector database integrating academic papers, government sources, news media, and social media; a Retrieval‑Augmented Generation chatbot using ReAct agent architecture with multi‑tool orchestration across five knowledge sources; and an automated report generation system with modules for news analysis, research synthesis, social media sentiment analysis, and user interaction pattern identification. Performance was assessed using a 0‑5 scoring scale. For single‑turn evaluation, the chatbot achieved mean scores of 4.83 for relevance, 4.89 for routing, 4.50 for reference quality, 4.90 for correctness, and 4.88 for professional identity (overall 4.80). Multi‑turn evaluation yielded higher scores: context retention 4.94, topic coherence 5.00, and overall 4.98. The report generation system achieved completeness 4.00‑5.00, correctness 4.00‑5.00, and helpfulness 3.67‑5.00, with reference validity 5.00 across all periods. This study demonstrates the feasibility of an integrated AI agent system for bidirectional HPV vaccine communication. The architecture enables verified information delivery with source attribution while providing systematic public discourse analysis, with a transferable framework for adaptation to other medical contexts.

PaperID: 6283, https://arxiv.org/pdf/2601.10600.pdf

Authors: Joshua Caiata, Carter Blair, Kate Larson

Title: Procedural Fairness in Multi-Agent Bandits

Abstract:
In the context of multi‑agent multi‑armed bandits (MA‑MAB), fairness is often reduced to outcomes: maximizing welfare, reducing inequality, or balancing utilities. However, evidence in psychology, economics, and Rawlsian theory suggests that fairness is also about process and who gets a say in the decisions being made. We introduce a new fairness objective, procedural fairness, which provides equal decision‑making power for all agents, lies in the core, and provides for proportionality in outcomes. Empirical results confirm that fairness notions based on optimizing for outcomes sacrifice equal voice and representation, while the sacrifice in outcome‑based fairness objectives (like equality and utilitarianism) is minimal under procedurally fair policies. We further prove that different fairness notions prioritize fundamentally different and incompatible values, highlighting that fairness requires explicit normative choices. This paper argues that procedural legitimacy deserves greater focus as a fairness objective, and provides a framework for putting procedural fairness into practice.

PaperID: 6284, https://arxiv.org/pdf/2601.10599.pdf

Authors: Federico Pierucci, Marcello Galisai, Marcantonio Syrnikov Bracale, Matteo Prandi, Piercosma Bisconti, Francesco Giarrusso, Olga Sorokoletova, Vincenzo Suriani, Daniele Nardi

Title: Institutional AI: A Governance Framework for Distributional AGI Safety

Abstract:
As LLM‑based systems increasingly operate as agents embedded within human social and technical systems, alignment can no longer be treated as a property of an isolated model, but must be understood in relation to the environments in which these agents act. Even the most sophisticated methods of alignment, such as Reinforcement Learning through Human Feedback (RHLF) or through AI Feedback (RLAIF) cannot ensure control once internal goal structures diverge from developer intent. We identify three structural problems that emerge from core properties of AI models: (1) behavioral goal‑independence, where models develop internal objectives and misgeneralize goals; (2) instrumental override of natural‑language constraints, where models regard safety principles as non‑binding while pursuing latent objectives, leveraging deception and manipulation; and (3) agentic alignment drift, where individually aligned agents converge to collusive equilibria through interaction dynamics invisible to single‑agent audits. The solution this paper advances is Institutional AI: a system‑level approach that treats alignment as a question of effective governance of AI agent collectives. We argue for a governance‑graph that details how to constrain agents via runtime monitoring, incentive shaping through prizes and sanctions, explicit norms and enforcement roles. This institutional turn reframes safety from software engineering to a mechanism design problem, where the primary goal of alignment is shifting the payoff landscape of AI agent collectives.

PaperID: 6285, https://arxiv.org/pdf/2601.10581.pdf

Authors: Kimia Abedini, Farzad Shami, Gianmaria Silvello

Title: From Single to Multi-Agent Reasoning: Advancing GeneGPT for Genomics QA

Abstract:
Comprehending genomic information is essential for biomedical research, yet extracting data from complex distributed databases remains challenging. Large language models (LLMs) offer potential for genomic Question Answering (QA) but face limitations due to restricted access to domain‑specific databases. GeneGPT is the current state‑of‑the‑art system that enhances LLMs by utilizing specialized API calls, though it is constrained by rigid API dependencies and limited adaptability. We replicate GeneGPT and propose GenomAgent, a multi‑agent framework that efficiently coordinates specialized agents for complex genomics queries. Evaluated on nine tasks from the GeneTuring benchmark, GenomAgent outperforms GeneGPT by 12% on average, and its flexible architecture extends beyond genomics to various scientific domains needing expert knowledge extraction.

PaperID: 6286, https://arxiv.org/pdf/2601.10567.pdf

Authors: Laura Ferrarotti, Gian Maria Campedelli, Roberto Dessì, Andrea Baronchelli, Giovanni Iacca, Kathleen M. Carley, Alex Pentland, Joel Z. Leibo, James Evans, Bruno Lepri

Title: Generative AI collective behavior needs an interactionist paradigm

Abstract:
In this article, we argue that understanding the collective behavior of agents based on large language models (LLMs) is an essential area of inquiry, with important implications in terms of risks and benefits, impacting us as a society at many levels. We claim that the distinctive nature of LLMs‑‑namely, their initialization with extensive pre‑trained knowledge and implicit social priors, together with their capability of adaptation through in‑context learning‑‑motivates the need for an interactionist paradigm consisting of alternative theoretical foundations, methodologies, and analytical tools, in order to systematically examine how prior knowledge and embedded values interact with social context to shape emergent phenomena in multi‑agent generative AI systems. We propose and discuss four directions that we consider crucial for the development and deployment of LLM‑based collectives, focusing on theory, methods, and trans‑disciplinary dialogue.

PaperID: 6287, https://arxiv.org/pdf/2601.10440.pdf

Authors: Nadya Abaev, Denis Klimov, Gerard Levinov, David Mimran, Yuval Elovici, Asaf Shabtai

Title: AgentGuardian: Learning Access Control Policies to Govern AI Agent Behavior

Abstract:
Artificial intelligence (AI) agents are increasingly used in a variety of domains to automate tasks, interact with users, and make decisions based on data inputs. Ensuring that AI agents perform only authorized actions and handle inputs appropriately is essential for maintaining system integrity and preventing misuse. In this study, we introduce the AgentGuardian, a novel security framework that governs and protects AI agent operations by enforcing context‑aware access‑control policies. During a controlled staging phase, the framework monitors execution traces to learn legitimate agent behaviors and input patterns. From this phase, it derives adaptive policies that regulate tool calls made by the agent, guided by both real‑time input context and the control flow dependencies of multi‑step agent actions. Evaluation across two real‑world AI agent applications demonstrates that AgentGuardian effectively detects malicious or misleading inputs while preserving normal agent functionality. Moreover, its control‑flow‑based governance mechanism mitigates hallucination‑driven errors and other orchestration‑level malfunctions.

PaperID: 6288, https://arxiv.org/pdf/2601.10402.pdf

Authors: Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang, Rui Ye, Yuzhi Zhang, Linfeng Zhang, Weinan E, Siheng Chen, Yanfeng Wang

Title: Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

Abstract:
The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra‑long‑horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanning days or weeks. While Large Language Models (LLMs) have demonstrated prowess in short‑horizon reasoning, they are easily overwhelmed by execution details in the high‑dimensional, delayed‑feedback environments of real‑world research, failing to consolidate sparse feedback into coherent long‑term guidance. Here, we present ML‑Master 2.0, an autonomous agent that masters ultra‑long‑horizon machine learning engineering (MLE) which is a representative microcosm of scientific discovery. By reframing context management as a process of cognitive accumulation, our approach introduces Hierarchical Cognitive Caching (HCC), a multi‑tiered architecture inspired by computer systems that enables the structural differentiation of experience over time. By dynamically distilling transient execution traces into stable knowledge and cross‑task wisdom, HCC allows agents to decouple immediate execution from long‑term experimental strategy, effectively overcoming the scaling limits of static context windows. In evaluations on OpenAI's MLE‑Bench under 24‑hour budgets, ML‑Master 2.0 achieves a state‑of‑the‑art medal rate of 56.44%. Our findings demonstrate that ultra‑long‑horizon autonomy provides a scalable blueprint for AI capable of autonomous exploration beyond human‑precedent complexities.

PaperID: 6289, https://arxiv.org/pdf/2601.10367.pdf

Authors: Daniela Aguirre Salazar, Firas Moatemri, Tatiana Tatarenko

Title: Inverse Learning in $2\times2$ Games: From Synthetic Interactions to Traffic Simulation

Abstract:
Understanding how agents coordinate or compete from limited behavioral data is central to modeling strategic interactions in traffic, robotics, and other multi‑agent systems. In this work, we investigate the following complementary formulations of inverse game‑theoretic learning: (i) a Closed‑form Correlated Equilibrium Maximum‑Likelihood estimator (CE‑ML) specialized for 2×2 games; and (ii) a Logit Best Response Maximum‑Likelihood estimator (LBR‑ML) that captures long‑run adaptation dynamics via stochastic response processes. Together, these approaches span the spectrum between static equilibrium consistency and dynamic behavioral realism. We evaluate them on synthetic "chicken‑dare" games and traffic‑interaction scenarios simulated in SUMO, comparing parameter recovery and distributional fit. Results reveal clear trade‑offs between interpretability, computational tractability, and behavioral expressiveness across models.

PaperID: 6290, https://arxiv.org/pdf/2601.10338.pdf

Authors: Yi Liu, Weizhe Wang, Ruitao Feng, Yao Zhang, Guangquan Xu, Gelei Deng, Yuekang Li, Leo Zhang

Title: Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

Abstract:
The rise of AI agent frameworks has introduced agent skills, modular packages containing instructions and executable code that dynamically extend agent capabilities. While this architecture enables powerful customization, skills execute with implicit trust and minimal vetting, creating a significant yet uncharacterized attack surface. We conduct the first large‑scale empirical security analysis of this emerging ecosystem, collecting 42,447 skills from two major marketplaces and systematically analyzing 31,132 using SkillScan, a multi‑stage detection framework integrating static analysis with LLM‑based semantic classification. Our findings reveal pervasive security risks: 26.1% of skills contain at least one vulnerability, spanning 14 distinct patterns across four categories: prompt injection, data exfiltration, privilege escalation, and supply chain risks. Data exfiltration (13.3%) and privilege escalation (11.8%) are most prevalent, while 5.2% of skills exhibit high‑severity patterns strongly suggesting malicious intent. We find that skills bundling executable scripts are 2.12x more likely to contain vulnerabilities than instruction‑only skills (OR=2.12, p<0.001). Our contributions include: (1) a grounded vulnerability taxonomy derived from 8,126 vulnerable skills, (2) a validated detection methodology achieving 86.7% precision and 82.5% recall, and (3) an open dataset and detection toolkit to support future research. These results demonstrate an urgent need for capability‑based permission systems and mandatory security vetting before this attack vector is further exploited.

PaperID: 6291, https://arxiv.org/pdf/2601.10299.pdf

Authors: Zhenyu Zhao, Tiankui Zhang, Xiaoxia Xu, Junjie Li, Yuanwei Liu, Wenjuan Xing

Title: Multipath Routing for Multi-Hop UAV Networks

Abstract:
Multi‑hop uncrewed aerial vehicle (UAV) networks are promising to extend the terrestrial network coverage. Existing multi‑hop UAV networks employ a single routing path by selecting the next‑hop forwarding node in a hop‑by‑hop manner, which leads to local congestion and increases traffic delays. In this paper, a novel traffic‑adaptive multipath routing method is proposed for multi‑hop UAV networks, which enables each UAV to dynamically split and forward traffic flows across multiple next‑hop neighbors, thus meeting latency requirements of diverse traffic flows in dynamic mobile environments. An on‑time packet delivery ratio maximization problem is formulated to determine the traffic splitting ratios at each hop. This sequential decision‑making problem is modeled as a decentralized partially observable Markov decision process (Dec‑POMDP). To solve this Dec‑POMDP, a novel multi‑agent deep reinforcement leaning (MADRL) algorithm, termed Independent Proximal Policy Optimization with Dirichlet Modeling (IPPO‑DM), is developed. Specifically, the IPPO serves as the core optimization framework, where the Dirichlet distribution is leveraged to parameterize a continuous stochastic policy network on the probability simplex, inherently ensuring feasible traffic splitting ratios. Simulation results demonstrate that IPPO‑DM outperforms benchmark schemes in terms of both delivery latency guarantee and packet loss performance.

PaperID: 6292, https://arxiv.org/pdf/2601.10245.pdf

Authors: Vansh Kapoor, Aman Gupta, Hao Chen, Anurag Beniwal, Jing Huang, Aviral Kumar

Title: TRIM: Hybrid Inference via Targeted Stepwise Routing in Multi-Step Reasoning Tasks

Abstract:
Multi‑step reasoning tasks like mathematical problem solving are vulnerable to cascading failures, where a single incorrect step leads to complete solution breakdown. Current LLM routing methods assign entire queries to one model, treating all reasoning steps as equal. We propose TRIM (Targeted routing in multi‑step reasoning tasks), which routes only critical steps\unicodex2013those likely to derail the solution\unicodex2013to larger models while letting smaller models handle routine continuations. Our key insight is that targeted step‑level interventions can fundamentally transform inference efficiency by confining expensive calls to precisely those steps where stronger models prevent cascading errors. TRIM operates at the step‑level: it uses process reward models to identify erroneous steps and makes routing decisions based on step‑level uncertainty and budget constraints. We develop several routing strategies within TRIM, ranging from a simple threshold‑based policy to more expressive policies that reason about long‑horizon accuracy‑cost trade‑offs and uncertainty in step‑level correctness estimates. On MATH‑500, even the simplest thresholding strategy surpasses prior routing methods with 5x higher cost efficiency, while more advanced policies match the strong, expensive model's performance using 80% fewer expensive model tokens. On harder benchmarks such as AIME, TRIM achieves up to 6x higher cost efficiency. All methods generalize effectively across math reasoning tasks, demonstrating that step‑level difficulty represents fundamental characteristics of reasoning.

PaperID: 6293, https://arxiv.org/pdf/2601.10223.pdf

Authors: Ziqi Xu, Yi Liu, Yuekang Li, Ling Shi, Kailong Wang, Yongxin Zhao

Title: STEAMROLLER: A Multi-Agent System for Inclusive Automatic Speech Recognition for People who Stutter

Abstract:
People who stutter (PWS) face systemic exclusion in today's voice‑driven society, where access to voice assistants, authentication systems, and remote work tools increasingly depends on fluent speech. Current automatic speech recognition (ASR) systems, trained predominantly on fluent speech, fail to serve millions of PWS worldwide. We present STEAMROLLER, a real time system that transforms stuttered speech into fluent output through a novel multi‑stage, multi‑agent AI pipeline. Our approach addresses three critical technical challenges: (1) the difficulty of direct speech to speech conversion for disfluent input, (2) semantic distortions introduced during ASR transcription of stuttered speech, and (3) latency constraints for real time communication. STEAMROLLER employs a three stage architecture comprising ASR transcription, multi‑agent text repair, and speech synthesis, where our core innovation lies in a collaborative multi‑agent framework that iteratively refines transcripts while preserving semantic intent. Experiments on the FluencyBank dataset and a user study demonstrates clear word error rate (WER) reduction and strong user satisfaction. Beyond immediate accessibility benefits, fine tuning ASR on STEAMROLLER repaired speech further yields additional WER improvements, creating a pathway toward inclusive AI ecosystems.

PaperID: 6294, https://arxiv.org/pdf/2601.10194.pdf

Authors: Weitang Li, Jiajun Ren, Lixue Cheng, Cunxi Gong

Title: Autonomous Quantum Simulation through Large Language Model Agents

Abstract:
We demonstrate that large language model (LLM) agents can autonomously perform tensor network simulations of quantum many‑body systems, achieving approximately 90% success rate across representative benchmark tasks. Tensor network methods are powerful tools for quantum simulation, but their effective use requires expertise typically acquired through years of graduate training. By combining in‑context learning with curated documentation and multi‑agent decomposition, we create autonomous AI agents that can be trained in specialized computational domains within minutes. We benchmark three configurations (baseline, single‑agent with in‑context learning, and multi‑agent with in‑context learning) on problems spanning quantum phase transitions, open quantum system dynamics, and photochemical reactions. Systematic evaluation using DeepSeek‑V3.2, Gemini 2.5 Pro, and Claude Opus 4.5 demonstrates that both in‑context learning and multi‑agent architecture are essential. Analysis of failure modes reveals characteristic patterns across models, with the multi‑agent configuration substantially reducing implementation errors and hallucinations compared to simpler architectures.

PaperID: 6295, https://arxiv.org/pdf/2601.10137.pdf

Authors: Ziyi Ding, Chenfei Ye-Hao, Zheyuan Wang, Xiao-Ping Zhang

Title: Step-by-Step Causality: Transparent Causal Discovery with Multi-Agent Tree-Query and Adversarial Confidence Estimation

Abstract:
Causal discovery aims to recover ``what causes what'', but classical constraint‑based methods (e.g., PC, FCI) suffer from error propagation, and recent LLM‑based causal oracles often behave as opaque, confidence‑free black boxes. This paper introduces Tree‑Query, a tree‑structured, multi‑expert LLM framework that reduces pairwise causal discovery to a short sequence of queries about backdoor paths, (in)dependence, latent confounding, and causal direction, yielding interpretable judgments with robustness‑aware confidence scores. Theoretical guarantees are provided for asymptotic identifiability of four pairwise relations. On data‑free benchmarks derived from Mooij et al. and UCI causal graphs, Tree‑Query improves structural metrics over direct LLM baselines, and a diet‑‑weight case study illustrates confounder screening and stable, high‑confidence causal conclusions. Tree‑Query thus offers a principled way to obtain data‑free causal priors from LLMs that can complement downstream data‑driven causal discovery. Code is available at https://anonymous.4open.science/r/Repo‑9B3E‑4F96.

PaperID: 6296, https://arxiv.org/pdf/2601.10131.pdf

Authors: Yizhan Li, Florence Cloutier, Sifan Wu, Ali Parviz, Boris Knyazev, Yan Zhang, Glen Berseth, Bang Liu

Title: M^4olGen: Multi-Agent, Multi-Stage Molecular Generation under Precise Multi-Property Constraints

Abstract:
Generating molecules that satisfy precise numeric constraints over multiple physicochemical properties is critical and challenging. Although large language models (LLMs) are expressive, they struggle with precise multi‑objective control and numeric reasoning without external structure and feedback. We introduce M olGen, a fragment‑level, retrieval‑augmented, two‑stage framework for molecule generation under multi‑property constraints. Stage I : Prototype generation: a multi‑agent reasoner performs retrieval‑anchored, fragment‑level edits to produce a candidate near the feasible region. Stage II : RL‑based fine‑grained optimization: a fragment‑level optimizer trained with Group Relative Policy Optimization (GRPO) applies one‑ or multi‑hop refinements to explicitly minimize the property errors toward our target while regulating edit complexity and deviation from the prototype. A large, automatically curated dataset with reasoning chains of fragment edits and measured property deltas underpins both stages, enabling deterministic, reproducible supervision and controllable multi‑hop reasoning. Unlike prior work, our framework better reasons about molecules by leveraging fragments and supports controllable refinement toward numeric targets. Experiments on generation under two sets of property constraints (QED, LogP, Molecular Weight and HOMO, LUMO) show consistent gains in validity and precise satisfaction of multi‑property targets, outperforming strong LLMs and graph‑based algorithms.

PaperID: 6297, https://arxiv.org/pdf/2601.10123.pdf

Authors: Aditi Anand, Dildar Ali, Suman Banerjee

Title: Fairness Driven Multi-Agent Path Finding Problem

Abstract:
The Multi‑Agent Path Finding (MAPF) problem aims at finding non‑conflicting paths for multiple agents from their respective sources to destinations. This problem arises in multiple real‑life situations, including robot motion planning and airspace assignment for unmanned aerial vehicle movement. The problem is computationally expensive, and adding to it, the agents are rational and can misreport their private information. In this paper, we study both variants of the problem under the realm of fairness. For the non‑rational agents, we propose a heuristic solution for this problem. Considering the agents are rational, we develop a mechanism and demonstrate that it is a dominant strategy, incentive compatible, and individually rational. We employ various solution methodologies to highlight the effectiveness and efficiency of the proposed solution approaches.

PaperID: 6298, https://arxiv.org/pdf/2601.10122.pdf

Authors: Ye Wang, Jiaxing Chen, Hongjiang Xiao

Title: Role-Playing Agents Driven by Large Language Models: Current Status, Challenges, and Future Trends

Abstract:
In recent years, with the rapid advancement of large language models (LLMs), role‑playing language agents (RPLAs) have emerged as a prominent research focus at the intersection of natural language processing (NLP) and human‑computer interaction. This paper systematically reviews the current development and key technologies of RPLAs, delineating the technological evolution from early rule‑based template paradigms, through the language style imitation stage, to the cognitive simulation stage centered on personality modeling and memory mechanisms. It summarizes the critical technical pathways supporting high‑quality role‑playing, including psychological scale‑driven character modeling, memory‑augmented prompting mechanisms, and motivation‑situation‑based behavioral decision control. At the data level, the paper further analyzes the methods and challenges of constructing role‑specific corpora, focusing on data sources, copyright constraints, and structured annotation processes. In terms of evaluation, it collates multi‑dimensional assessment frameworks and benchmark datasets covering role knowledge, personality fidelity, value alignment, and interactive hallucination, while commenting on the advantages and disadvantages of methods such as human evaluation, reward models, and LLM‑based scoring. Finally, the paper outlines future development directions of role‑playing agents, including personality evolution modeling, multi‑agent collaborative narrative, multimodal immersive interaction, and integration with cognitive neuroscience, aiming to provide a systematic perspective and methodological insights for subsequent research.

PaperID: 6299, https://arxiv.org/pdf/2601.10102.pdf

Authors: Viswonathan Manoranjan, Snehalkumar `Neil' S. Gaikwad

Title: When Personas Override Payoffs: Role Identity Bias in Multi-Agent LLM Decision-Making

Abstract:
Large language models are increasingly deployed in multi‑agent systems for strategic tasks, yet how design choices such as role‑based personas and payoff visibility affect reasoning remains poorly understood. We investigate whether multi‑agent systems function as strategic reasoners capable of payoff optimization or as identity‑driven actors that prioritize role alignment over explicit incentives. Using Nash equilibrium achievement as a diagnostic for strategic reasoning, we conduct systematic experiments across four LLM architectures (Qwen‑7B, Qwen‑32B, Llama‑8B, Mistral‑7B) in complex environmental decision‑making games involving four agents. We show that role identity bias fundamentally alters strategic reasoning even when payoff‑optimal equilibria exist and complete payoff information is available. Removing personas and providing explicit payoffs enables Qwen models to achieve high Nash equilibrium rates, indicating that both conditions are necessary for strategic reasoning. In contrast, personas systematically bias equilibrium selection toward socially preferred outcomes: with personas present, all of the achieved equilibria correspond to Green Transition, while models entirely fail to reach equilibrium when Tragedy of the Commons is payoff‑optimal. The effect of explicit payoffs depends entirely on persona presence, revealing strong interactions between representational design choices. We also observe clear model‑dependent patterns. Qwen architectures are highly sensitive to both personas and payoff visibility, whereas Llama and Mistral exhibit rigid reasoning behavior across conditions. These findings demonstrate that representational choices are substantive governance decisions that determine whether multi‑agent systems act as strategic reasoners or identity‑driven actors, with important implications for real‑world deployment.

PaperID: 6300, https://arxiv.org/pdf/2601.10029.pdf

Authors: Tingyue Pan, Jie Ouyang, Mingyue Cheng, Qingchuan Li, Zirui Liu, Daoyu Wang, Mingfan Pan, Shuo Yu, Qi Liu

Title: PaperScout: An Autonomous Agent for Academic Paper Search with Process-Aware Sequence-Level Policy Optimization

Abstract:
Academic paper search is a fundamental task in scientific research, yet most existing approaches rely on rigid, predefined workflows that struggle with complex, conditional queries. To address this limitation, we propose PaperScout, an autonomous agent that reformulates paper search as a sequential decision‑making process. Unlike static workflows, PaperScout dynamically decides whether, when, and how to invoke search and expand tools based on accumulated retrieval context. However, training such agents presents a fundamental challenge: standard reinforcement learning methods, typically designed for single‑turn tasks, suffer from a granularity mismatch when applied to multi‑turn agentic tasks‑where token‑level optimization diverges from the granularity of sequence‑level interactions‑leading to noisy credit assignment and unstable training dynamics. We introduce Proximal Sequence Policy Optimization (PSPO), a process‑aware, sequence‑level policy optimization method that aligns optimization with agent‑‑environment interaction. Comprehensive experiments on both synthetic and real‑world benchmarks demonstrate that PaperScout significantly outperforms strong workflow‑driven and RL baselines in both recall and relevance, validating the effectiveness of our adaptive agentic framework and optimization strategy.

PaperID: 6301, https://arxiv.org/pdf/2601.10025.pdf

Authors: Jinpeng Wang, Xinyu Jia, Wei Wei Heng, Yuquan Li, Binbin Shi, Qianlei Chen, Guannan Chen, Junxia Zhang, Yuyu Yin

Title: Structured Personality Control and Adaptation for LLM Agents

Abstract:
Large Language Models (LLMs) are increasingly shaping human‑computer interaction (HCI), from personalized assistants to social simulations. Beyond language competence, researchers are exploring whether LLMs can exhibit human‑like characteristics that influence engagement, decision‑making, and perceived realism. Personality, in particular, is critical, yet existing approaches often struggle to achieve both nuanced and adaptable expression. We present a framework that models LLM personality via Jungian psychological types, integrating three mechanisms: a dominant‑auxiliary coordination mechanism for coherent core expression, a reinforcement‑compensation mechanism for temporary adaptation to context, and a reflection mechanism that drives long‑term personality evolution. This design allows the agent to maintain nuanced traits while dynamically adjusting to interaction demands and gradually updating its underlying structure. Personality alignment is evaluated using Myers‑Briggs Type Indicator questionnaires and tested under diverse challenge scenarios as a preliminary structured assessment. Findings suggest that evolving, personality‑aware LLMs can support coherent, context‑sensitive interactions, enabling naturalistic agent design in HCI.

PaperID: 6302, https://arxiv.org/pdf/2601.10020.pdf

Authors: Lingfei Qian, Mauro Giuffre, Yan Wang, Huan He, Qianqian Xie, Xuguang Ai, Xeuqing Peng, Fan Ma, Ruey-Ling Weng, Donald Wright, Adan Wang, Qingyu Chen, Vipina K. Keloth, Hua Xu

Title: EHRNavigator: A Multi-Agent System for Patient-Level Clinical Question Answering over Heterogeneous Electronic Health Records

Abstract:
Clinical decision‑making increasingly relies on timely and context‑aware access to patient information within Electronic Health Records (EHRs), yet most existing natural language question‑answering (QA) systems are evaluated solely on benchmark datasets, limiting their practical relevance. To overcome this limitation, we introduce EHRNavigator, a multi‑agent framework that harnesses AI agents to perform patient‑level question answering across heterogeneous and multimodal EHR data. We assessed its performance using both public benchmark and institutional datasets under realistic hospital conditions characterized by diverse schemas, temporal reasoning demands, and multimodal evidence integration. Through quantitative evaluation and clinician‑validated chart review, EHRNavigator demonstrated strong generalization, achieving 86% accuracy on real‑world cases while maintaining clinically acceptable response times. Overall, these findings confirm that EHRNavigator effectively bridges the gap between benchmark evaluation and clinical deployment, offering a robust, adaptive, and efficient solution for real‑world EHR question answering.

PaperID: 6303, https://arxiv.org/pdf/2601.09937.pdf

Authors: Saber Zerhoudi, Michael Granitzer

Title: From SERPs to Agents: A Platform for Comparative Studies of Information Interaction

Abstract:
The diversification of information access systems, from RAG to autonomous agents, creates a critical need for comparative user studies. However, the technical overhead to deploy and manage these distinct systems is a major barrier. We present UXLab, an open‑source system for web‑based user studies that addresses this challenge. Its core is a web‑based dashboard enabling the complete, no‑code configuration of complex experimental designs. Researchers can visually manage the full study, from recruitment to comparing backends like traditional search, vector databases, and LLMs. We demonstrate UXLab's value via a micro case study comparing user behavior with RAG versus an autonomous agent. UXLab allows researchers to focus on experimental design and analysis, supporting future multi‑modal interaction research.

PaperID: 6304, https://arxiv.org/pdf/2601.09913.pdf

Authors: Joe Logan

Title: Continuum Memory Architectures for Long-Horizon LLM Agents

Abstract:
Retrieval‑augmented generation (RAG) has become the default strategy for providing large language model (LLM) agents with contextual knowledge. Yet RAG treats memory as a stateless lookup table: information persists indefinitely, retrieval is read‑only, and temporal continuity is absent. We define the Continuum Memory Architecture (CMA), a class of systems that maintain and update internal state across interactions through persistent storage, selective retention, associative routing, temporal chaining, and consolidation into higher‑order abstractions. Rather than disclosing implementation specifics, we specify the architectural requirements CMA imposes and show consistent behavioral advantages on tasks that expose RAG's structural inability to accumulate, mutate, or disambiguate memory. The empirical probes (knowledge updates, temporal association, associative recall, contextual disambiguation) demonstrate that CMA is a necessary architectural primitive for long‑horizon agents while highlighting open challenges around latency, drift, and interpretability.

PaperID: 6305, https://arxiv.org/pdf/2601.09822.pdf

Authors: Yongjian Tang, Thomas Runkler

Title: LLM-Based Agentic Systems for Software Engineering: Challenges and Opportunities

Abstract:
Despite recent advancements in Large Language Models (LLMs), complex Software Engineering (SE) tasks require more collaborative and specialized approaches. This concept paper systematically reviews the emerging paradigm of LLM‑based multi‑agent systems, examining their applications across the Software Development Life Cycle (SDLC), from requirements engineering and code generation to static code checking, testing, and debugging. We delve into a wide range of topics such as language model selection, SE evaluation benchmarks, state‑of‑the‑art agentic frameworks and communication protocols. Furthermore, we identify key challenges and outline future research opportunities, with a focus on multi‑agent orchestration, human‑agent coordination, computational cost optimization, and effective data collection. This work aims to provide researchers and practitioners with valuable insights into the current forefront landscape of agentic systems within the software engineering domain.

PaperID: 6306, https://arxiv.org/pdf/2601.09750.pdf

Authors: Robert K. Strehlow, Tobias Küster, Oskar F. Kupke, Brandon Llanque Kurps, Fikret Sivrikaya, Sahin Albayrak

Title: SAGE: Tool-Augmented LLM Task Solving Strategies in Scalable Multi-Agent Environments

Abstract:
Large language models (LLMs) have proven to work well in question‑answering scenarios, but real‑world applications often require access to tools for live information or actuation. For this, LLMs can be extended with tools, which are often defined in advance, also allowing for some fine‑tuning for specific use cases. However, rapidly evolving software landscapes and individual services require the constant development and integration of new tools. Domain‑ or company‑specific tools can greatly elevate the usefulness of an LLM, but such custom tools can be problematic to integrate, or the LLM may fail to reliably understand and use them. For this, we need strategies to define new tools and integrate them into the LLM dynamically, as well as robust and scalable zero‑shot prompting methods that can make use of those tools in an efficient manner. In this paper, we present SAGE, a specialized conversational AI interface, based on the OPACA framework for tool discovery and execution. The integration with OPACA makes it easy to add new tools or services for the LLM to use, while SAGE itself presents rich extensibility and modularity. This not only provides the ability to seamlessly switch between different models (e.g. GPT, LLAMA), but also to add and select prompting methods, involving various setups of differently prompted agents for selecting and executing tools and evaluating the results. We implemented a number of task‑solving strategies, making use of agentic concepts and prompting methods in various degrees of complexity, and evaluated those against a comprehensive set of benchmark services. The results are promising and highlight the distinct strengths and weaknesses of different task‑solving strategies. Both SAGE and the OPACA framework, as well as the different benchmark services and results, are available as Open Source/Open Data on GitHub.

PaperID: 6307, https://arxiv.org/pdf/2601.09746.pdf

Authors: Philip Xu

Title: Multi-Agent Cooperative Learning for Robust Vision-Language Alignment under OOD Concepts

Abstract:
This paper introduces a novel Multi‑Agent Cooperative Learning (MACL) framework to address cross‑modal alignment collapse in vision‑language models when handling out‑of‑distribution (OOD) concepts. Four core agents, including image, text, name, and coordination agents, collaboratively mitigate modality imbalance through structured message passing. The proposed framework enables multi‑agent feature space name learning, incorporates a context exchange enhanced few‑shot learning algorithm, and adopts an adaptive dynamic balancing mechanism to regulate inter‑agent contributions. Experiments on the VISTA‑Beyond dataset demonstrate that MACL significantly improves performance in both few‑shot and zero‑shot settings, achieving 1‑5% precision gains across diverse visual domains.

PaperID: 6308, https://arxiv.org/pdf/2601.09742.pdf

Authors: Sathish Sampath, Anuradha Baskaran

Title: Adaptive Orchestration: Scalable Self-Evolving Multi-Agent Systems

Abstract:
As Large Language Models (LLMs) are increasingly deployed as autonomous agents, they face a critical scalability bottleneck known as the "Generalization‑Specialization Dilemma." Monolithic agents equipped with extensive toolkits suffer from context pollution and attention decay, leading to hallucinations. Conversely, static multi‑agent swarms introduce significant latency and resource overhead. This paper introduces a Self‑Evolving Concierge System, a novel architecture utilizing a Dynamic Mixture of Experts (DMoE) approach. Unlike recent self‑improving agents that rewrite their own codebase, our system preserves stability by dynamically restructuring its runtime environment: "hiring" specialized sub‑agents based on real‑time conversation analysis. We introduce an asynchronous "Meta‑Cognition Engine" that detects capability gaps, a Least Recently Used (LRU) eviction policy for resource constraints, and a novel "Surgical History Pruning" mechanism to mitigate refusal bias. Experimental results demonstrate that this architecture maintains high task success rates while minimizing token consumption compared to static agent swarms.

PaperID: 6309, https://arxiv.org/pdf/2601.09714.pdf

Authors: Devesh Saraogi, Rohit Singhee, Dhruv Kumar

Title: Evaluating Novelty in AI-Generated Research Plans Using Multi-Workflow LLM Pipelines

Abstract:
The integration of Large Language Models (LLMs) into the scientific ecosystem raises fundamental questions about the creativity and originality of AI‑generated research. Recent work has identified ``smart plagiarism'' as a concern in single‑step prompting approaches, where models reproduce existing ideas with terminological shifts. This paper investigates whether agentic workflows ‑‑ multi‑step systems employing iterative reasoning, evolutionary search, and recursive decomposition ‑‑ can generate more novel and feasible research plans. We benchmark five reasoning architectures: Reflection‑based iterative refinement, Sakana AI v2 evolutionary algorithms, Google Co‑Scientist multi‑agent framework, GPT Deep Research (GPT‑5.1) recursive decomposition, and Gemini~3 Pro multimodal long‑context pipeline. Using evaluations from thirty proposals each on novelty, feasibility, and impact, we find that decomposition‑based and long‑context workflows achieve mean novelty of 4.17/5, while reflection‑based approaches score significantly lower (2.33/5). Results reveal varied performance across research domains, with high‑performing workflows maintaining feasibility without sacrificing creativity. These findings support the view that carefully designed multi‑stage agentic workflows can advance AI‑assisted research ideation.

PaperID: 6310, https://arxiv.org/pdf/2601.09694.pdf

Authors: Sai Varun Kodathala, Rakesh Vunnam

Title: LLMs can Compress LLMs: Adaptive Pruning by Agents

Abstract:
As Large Language Models (LLMs) continue to scale, post‑training pruning has emerged as a promising approach to reduce computational costs while preserving performance. Existing methods such as SparseGPT and Wanda achieve high sparsity through layer‑wise weight reconstruction or activation‑aware magnitude pruning, but rely on uniform or hand‑crafted heuristics to determine per‑layer sparsity ratios. Moreover, recent work has shown that pruned LLMs suffer from severe factual knowledge degradation, with structured pruning methods experiencing near‑total collapse in factual question‑answering capabilities. We introduce agent‑guided pruning, where a foundation model acts as an adaptive pruning agent to intelligently select which layers to prune at each iteration while preserving critical knowledge pathways. Our method constructs layer‑wise sensitivity profiles by combining Wanda‑inspired weight‑activation metrics with gradient importance scores, normalized as z‑scores for model‑agnostic comparison. These statistics are processed by an LLM agent equipped with self‑reflection capabilities, enabling it to learn from previous pruning outcomes and iteratively refine its strategy. A checkpoint rollback mechanism maintains model quality by reverting when perplexity degradation exceeds a threshold. We evaluate our approach on Qwen3 models (4B and 8B parameters) at approximately 45% sparsity, demonstrating substantial improvements over structured pruning baselines: 56% relative improvement in MMLU accuracy, 19x better factual knowledge retention on FreebaseQA, and 69% lower perplexity degradation. Notably, our framework requires no retraining, operates in a model‑agnostic manner, and exhibits effective self‑correction with only 2‑4 rollbacks across 21‑40 iterations, demonstrating that foundation models can effectively guide the compression of other foundation models.

PaperID: 6311, https://arxiv.org/pdf/2601.09667.pdf

Authors: Zhiyuan Hu, Yunhai Hu, Juncheng Liu, Shuyue Stella Li, Yucheng Wang, Zhen Xu, See-Kiong Ng, Anh Tuan Luu, Xinxing Xu, Bryan Hooi, Cynthia Breazeal, Hae Won Park

Title: Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning

Abstract:
Multi‑agent systems have evolved into practical LLM‑driven collaborators for many applications, gaining robustness from diversity and cross‑checking. However, multi‑agent RL (MARL) training is resource‑intensive and unstable: co‑adapting teammates induce non‑stationarity, and rewards are often sparse and high‑variance. Therefore, we introduce Multi‑Agent Test‑Time Reinforcement Learning (MATTRL), a framework that injects structured textual experience into multi‑agent deliberation at inference time. MATTRL forms a multi‑expert team of specialists for multi‑turn discussions, retrieves and integrates test‑time experiences, and reaches consensus for final decision‑making. We also study credit assignment for constructing a turn‑level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67% over a multi‑agent baseline, and by 8.67% over comparable single‑agent baselines. Ablation studies examine different credit‑assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution‑shift‑robust multi‑agent reasoning without tuning.

PaperID: 6312, https://arxiv.org/pdf/2601.09515.pdf

Authors: Chenglong Wang, Canjia Li, Xingzhao Zhu, Yifu Huo, Huiyu Wang, Weixiong Lin, Yun Yang, Qiaozhi He, Tianhua Zhou, Xiaojia Chang, Jingbo Zhu, Tong Xiao

Title: SERM: Self-Evolving Relevance Model with Agent-Driven Learning from Massive Query Streams

Abstract:
Due to the dynamically evolving nature of real‑world query streams, relevance models struggle to generalize to practical search scenarios. A sophisticated solution is self‑evolution techniques. However, in large‑scale industrial settings with massive query streams, this technique faces two challenges: (1) informative samples are often sparse and difficult to identify, and (2) pseudo‑labels generated by the current model could be unreliable. To address these challenges, in this work, we propose a Self‑Evolving Relevance Model approach (SERM), which comprises two complementary multi‑agent modules: a multi‑agent sample miner, designed to detect distributional shifts and identify informative training samples, and a multi‑agent relevance annotator, which provides reliable labels through a two‑level agreement framework. We evaluate SERM in a large‑scale industrial setting, which serves billions of user requests daily. Experimental results demonstrate that SERM can achieve significant performance gains through iterative self‑evolution, as validated by extensive offline multilingual evaluations and online testing.

PaperID: 6313, https://arxiv.org/pdf/2601.09503.pdf

Authors: Siyuan Liu, Hongbang Yuan, Xinze Li, Ziyue Zhu, Yixin Cao, Yu-Gang Jiang

Title: What Do LLM Agents Know About Their World? Task2Quiz: A Paradigm for Studying Environment Understanding

Abstract:
Large language model (LLM) agents have demonstrated remarkable capabilities in complex decision‑making and tool‑use tasks, yet their ability to generalize across varying environments remains a under‑examined concern. Current evaluation paradigms predominantly rely on trajectory‑based metrics that measure task success, while failing to assess whether agents possess a grounded, transferable model of the environment. To address this gap, we propose Task‑to‑Quiz (T2Q), a deterministic and automated evaluation paradigm designed to decouple task execution from world‑state understanding. We instantiate this paradigm in T2QBench, a suite comprising 30 environments and 1,967 grounded QA pairs across multiple difficulty levels. Our extensive experiments reveal that task success is often a poor proxy for environment understanding, and that current memory machanism can not effectively help agents acquire a grounded model of the environment. These findings identify proactive exploration and fine‑grained state representation as primary bottlenecks, offering a robust foundation for developing more generalizable autonomous agents.

PaperID: 6314, https://arxiv.org/pdf/2601.09434.pdf

Authors: Di Zhao, Longhui Ma, Siwei Wang, Miao Wang, Yi Kong

Title: SC-MAS: Constructing Cost-Efficient Multi-Agent Systems with Edge-Level Heterogeneous Collaboration

Abstract:
Large Language Model (LLM)‑based Multi‑Agent Systems (MAS) enhance complex problem solving through multi‑agent collaboration, but often incur substantially higher costs than single‑agent systems. Recent MAS routing methods aim to balance performance and overhead by dynamically selecting agent roles and language models. However, these approaches typically rely on a homogeneous collaboration mode, where all agents follow the same interaction pattern, limiting collaboration flexibility across different roles. Motivated by Social Capital Theory, which emphasizes that different roles benefit from distinct forms of collaboration, we propose SC‑MAS, a framework for constructing heterogeneous and cost‑efficient multi‑agent systems. SC‑MAS models MAS as directed graphs, where edges explicitly represent pairwise collaboration strategies, allowing different agent pairs to interact through tailored communication patterns. Given an input query, a unified controller progressively constructs an executable MAS by selecting task‑relevant agent roles, assigning edge‑level collaboration strategies, and allocating appropriate LLM backbones to individual agents. Experiments on multiple benchmarks demonstrate the effectiveness of SC‑MAS. In particular, SC‑MAS improves accuracy by 3.35% on MMLU while reducing inference cost by 15.38%, and achieves a 3.53% accuracy gain with a 12.13% cost reduction on MBPP. These results validate the feasibility of SC‑MAS and highlight the effectiveness of heterogeneous collaboration in multi‑agent systems.

PaperID: 6315, https://arxiv.org/pdf/2601.09342.pdf

Authors: Ewelina Gajewska, Katarzyna Budzynska, Jarosław A Chudziak

Title: Improving Implicit Hate Speech Detection via a Community-Driven Multi-Agent Framework

Abstract:
This work proposes a contextualised detection framework for implicitly hateful speech, implemented as a multi‑agent system comprising a central Moderator Agent and dynamically constructed Community Agents representing specific demographic groups. Our approach explicitly integrates socio‑cultural context from publicly available knowledge sources, enabling identity‑aware moderation that surpasses state‑of‑the‑art prompting methods (zero‑shot prompting, few‑shot prompting, chain‑of‑thought prompting) and alternative approaches on a challenging ToxiGen dataset. We enhance the technical rigour of performance evaluation by incorporating balanced accuracy as a central metric of classification fairness that accounts for the trade‑off between true positive and true negative rates. We demonstrate that our community‑driven consultative framework significantly improves both classification accuracy and fairness across all target groups.

PaperID: 6316, https://arxiv.org/pdf/2601.09295.pdf

Authors: Handi Chen, Running Zhao, Xiuzhe Wu, Edith C. H. Ngai

Title: MACRO-LLM: LLM-Empowered Multi-Agent Collaborative Reasoning under Spatiotemporal Partial Observability

Abstract:
Large Language Model (LLM) agents deployed in complex real‑world scenarios increasingly operate as spatially distributed entities. However, this physical dispersion constrains agents to limited local perception and finite temporal horizons. We characterize this bottleneck as spatiotemporal partial observability, where spatial and temporal limitations are fundamentally coupled: resolving spatial conflicts requires temporal reasoning about neighbors' future actions, while temporal planning requires spatial context beyond local perception. To bridge this gap, we introduce MACRO‑LLM, LLM‑empowered multi‑agent collaborative reasoning under spatiotemporal partial observability. The architecture interleaves spatial and temporal reasoning within each decision cycle via three interdependent modules: (1) the CoProposer mitigates temporal uncertainty by verifying candidate actions via predictive rollouts; (2) the Negotiator overcomes spatial myopia by resolving conflicts through mean‑field statistical aggregation, grounded in the CoProposer's rollout rewards; and (3) the Introspector closes the reasoning loop by analyzing environmental drift and attributing performance changes to refine strategies. Extensive evaluations on two complex long‑horizon tasks, cooperative platoon planning and pandemic control, demonstrate that our framework enables robust coordination under spatiotemporal partial observability.

PaperID: 6317, https://arxiv.org/pdf/2601.09281.pdf

Authors: Jingjing Zhou, Gaoxiang Cong, Li Su, Liang Li

Title: STaR: Sensitive Trajectory Regulation for Unlearning in Large Reasoning Models

Abstract:
Large Reasoning Models (LRMs) have advanced automated multi‑step reasoning, but their ability to generate complex Chain‑of‑Thought (CoT) trajectories introduces severe privacy risks, as sensitive information may be deeply embedded throughout the reasoning process. Existing Large Language Models (LLMs) unlearning approaches that typically focus on modifying only final answers are insufficient for LRMs, as they fail to remove sensitive content from intermediate steps, leading to persistent privacy leakage and degraded security. To address these challenges, we propose Sensitive Trajectory Regulation (STaR), a parameter‑free, inference‑time unlearning framework that achieves robust privacy protection throughout the reasoning process. Specifically, we first identify sensitive content via semantic‑aware detection. Then, we inject global safety constraints through secure prompt prefix. Next, we perform trajectory‑aware suppression to dynamically block sensitive content across the entire reasoning chain. Finally, we apply token‑level adaptive filtering to prevent both exact and paraphrased sensitive tokens during generation. Furthermore, to overcome the inadequacies of existing evaluation protocols, we introduce two metrics: Multi‑Decoding Consistency Assessment (MCS), which measures the consistency of unlearning across diverse decoding strategies, and Multi‑Granularity Membership Inference Attack (MIA) Evaluation, which quantifies privacy protection at both answer and reasoning‑chain levels. Experiments on the R‑TOFU benchmark demonstrate that STaR achieves comprehensive and stable unlearning with minimal utility loss, setting a new standard for privacy‑preserving reasoning in LRMs.

PaperID: 6318, https://arxiv.org/pdf/2601.09264.pdf

Authors: Ziyi Shi, Xusen Guo, Hongliang Lu, Mingxing Peng, Haotian Wang, Zheng Zhu, Zhenning Li, Yuxuan Liang, Xinhu Zheng, Hai Yang

Title: Coordinated Pandemic Control with Large Language Model Agents as Policymaking Assistants

Abstract:
Effective pandemic control requires timely and coordinated policymaking across administrative regions that are intrinsically interdependent. However, human‑driven responses are often fragmented and reactive, with policies formulated in isolation and adjusted only after outbreaks escalate, undermining proactive intervention and global pandemic mitigation. To address this challenge, here we propose a large language model (LLM) multi‑agent policymaking framework that supports coordinated and proactive pandemic control across regions. Within our framework, each administrative region is assigned an LLM agent as an AI policymaking assistant. The agent reasons over region‑specific epidemiological dynamics while communicating with other agents to account for cross‑regional interdependencies. By integrating real‑world data, a pandemic evolution simulator, and structured inter‑agent communication, our framework enables agents to jointly explore counterfactual intervention scenarios and synthesize coordinated policy decisions through a closed‑loop simulation process. We validate the proposed framework using state‑level COVID‑19 data from the United States between April and December 2020, together with real‑world mobility records and observed policy interventions. Compared with real‑world pandemic outcomes, our approach reduces cumulative infections and deaths by up to 63.7% and 40.1%, respectively, at the individual state level, and by 39.0% and 27.0%, respectively, when aggregated across states. These results demonstrate that LLM multi‑agent systems can enable more effective pandemic control with coordinated policymaking...

PaperID: 6319, https://arxiv.org/pdf/2601.09216.pdf

Authors: Xinyuan Zhang, Zijian Wang, Chang Dao, Juexiao Zhou

Title: Honesty-Aware Multi-Agent Framework for High-Fidelity Synthetic Data Generation in Digital Psychiatric Intake Doctor-Patient Interactions

Abstract:
Data scarcity and unreliable self‑reporting ‑‑ such as concealment or exaggeration ‑‑ pose fundamental challenges to psychiatric intake and assessment. We propose a multi‑agent synthesis framework that explicitly models patient deception to generate high‑fidelity, publicly releasable synthetic psychiatric intake records. Starting from DAIC‑WOZ interviews, we construct enriched patient profiles and simulate a four‑role workflow: a \emphPatient completes self‑rated scales and participates in a semi‑structured interview under a topic‑dependent honesty state; an \emphAssessor selects instruments based on demographics and chief complaints; an \emphEvaluator conducts the interview grounded in rater‑administered scales, tracks suspicion, and completes ratings; and a \emphDiagnostician integrates all evidence into a diagnostic summary. Each case links the patient profile, self‑rated and rater‑administered responses, interview transcript, diagnostic summary, and honesty state. We validate the framework through four complementary evaluations: diagnostic consistency and severity grading, chain‑of‑thought ablations, human evaluation of clinical realism and dishonesty modeling, and LLM‑based comparative evaluation. The resulting corpus spans multiple disorders and severity levels, enabling controlled study of dishonesty‑aware psychiatric assessment and the training and evaluation of adaptive dialogue agents.

PaperID: 6320, https://arxiv.org/pdf/2601.09150.pdf

Authors: Jianwen Sun, Yukang Feng, Kaining Ying, Chuanhao Li, Zizhen Li, Fanrui Zhang, Jiaxin Ai, Yifan Chang, Yu Dai, Yifei Huang, Kaipeng Zhang

Title: World Craft: Agentic Framework to Create Visualizable Worlds via Text

Abstract:
Large Language Models (LLMs) motivate generative agent simulation (e.g., AI Town) to create a ``dynamic world'', holding immense value across entertainment and research. However, for non‑experts, especially those without programming skills, it isn't easy to customize a visualizable environment by themselves. In this paper, we introduce World Craft, an agentic world creation framework to create an executable and visualizable AI Town via user textual descriptions. It consists of two main modules, World Scaffold and World Guild. World Scaffold is a structured and concise standardization to develop interactive game scenes, serving as an efficient scaffolding for LLMs to customize an executable AI Town‑like environment. World Guild is a multi‑agent framework to progressively analyze users' intents from rough descriptions, and synthesizes required structured contents (\eg environment layout and assets) for World Scaffold . Moreover, we construct a high‑quality error‑correction dataset via reverse engineering to enhance spatial knowledge and improve the stability and controllability of layout generation, while reporting multi‑dimensional evaluation metrics for further analysis. Extensive experiments demonstrate that our framework significantly outperforms existing commercial code agents (Cursor and Antigravity) and LLMs (Qwen3 and Gemini‑3‑Pro). in scene construction and narrative intent conveyance, providing a scalable solution for the democratization of environment creation.

PaperID: 6321, https://arxiv.org/pdf/2601.09129.pdf

Authors: Xiaonan Liu, Zhihao Li, Xiao Lan, Hao Ren, Haizhou Wang, Xingshu Chen

Title: KryptoPilot: An Open-World Knowledge-Augmented LLM Agent for Automated Cryptographic Exploitation

Abstract:
Capture‑the‑Flag (CTF) competitions play a central role in modern cybersecurity as a platform for training practitioners and evaluating offensive and defensive techniques derived from real‑world vulnerabilities. Despite recent advances in large language models (LLMs), existing LLM‑based agents remain ineffective on high‑difficulty cryptographic CTF challenges, which require precise cryptanalytic knowledge, stable long‑horizon reasoning, and disciplined interaction with specialized toolchains. Through a systematic exploratory study, we show that insufficient knowledge granularity, rather than model reasoning capacity, is a primary factor limiting successful cryptographic exploitation: coarse or abstracted external knowledge often fails to support correct attack modeling and implementation. Motivated by this observation, we propose KryptoPilot, an open‑world knowledge‑augmented LLM agent for automated cryptographic exploitation. KryptoPilot integrates dynamic open‑world knowledge acquisition via a Deep Research pipeline, a persistent workspace for structured knowledge reuse, and a governance subsystem that stabilizes reasoning through behavioral constraints and cost‑aware model routing. This design enables precise knowledge alignment while maintaining efficient reasoning across heterogeneous subtasks. We evaluate KryptoPilot on two established CTF benchmarks and in six real‑world CTF competitions. KryptoPilot achieves a complete solve rate on InterCode‑CTF, solves between 56 and 60 percent of cryptographic challenges on the NYU‑CTF benchmark, and successfully solves 26 out of 33 cryptographic challenges in live competitions, including multiple earliest‑solved and uniquely‑solved instances. These results demonstrate the necessity of open‑world, fine‑grained knowledge augmentation and governed reasoning for scaling LLM‑based agents to real‑world cryptographic exploitation.

PaperID: 6322, https://arxiv.org/pdf/2601.09113.pdf

Authors: Zixia Jia, Jiaqi Li, Yipeng Kang, Yuxuan Wang, Tong Wu, Quansen Wang, Xiaobo Wang, Shuyi Zhang, Junzhe Shen, Qing Li, Siyuan Qi, Yitao Liang, Di He, Zilong Zheng, Song-Chun Zhu

Title: The AI Hippocampus: How Far are We From Human Memory?

Abstract:
Memory plays a foundational role in augmenting the reasoning, adaptability, and contextual fidelity of modern Large Language Models and Multi‑Modal LLMs. As these models transition from static predictors to interactive systems capable of continual learning and personalized inference, the incorporation of memory mechanisms has emerged as a central theme in their architectural and functional evolution. This survey presents a comprehensive and structured synthesis of memory in LLMs and MLLMs, organizing the literature into a cohesive taxonomy comprising implicit, explicit, and agentic memory paradigms. Specifically, the survey delineates three primary memory frameworks. Implicit memory refers to the knowledge embedded within the internal parameters of pre‑trained transformers, encompassing their capacity for memorization, associative retrieval, and contextual reasoning. Recent work has explored methods to interpret, manipulate, and reconfigure this latent memory. Explicit memory involves external storage and retrieval components designed to augment model outputs with dynamic, queryable knowledge representations, such as textual corpora, dense vectors, and graph‑based structures, thereby enabling scalable and updatable interaction with information sources. Agentic memory introduces persistent, temporally extended memory structures within autonomous agents, facilitating long‑term planning, self‑consistency, and collaborative behavior in multi‑agent systems, with relevance to embodied and interactive AI. Extending beyond text, the survey examines the integration of memory within multi‑modal settings, where coherence across vision, language, audio, and action modalities is essential. Key architectural advances, benchmark tasks, and open challenges are discussed, including issues related to memory capacity, alignment, factual consistency, and cross‑system interoperability.

PaperID: 6323, https://arxiv.org/pdf/2601.09072.pdf

Authors: Jean Feng, Avni Kothari, Patrick Vossler, Andrew Bishara, Lucas Zier, Newton Addo, Aaron Kornblith, Yan Shuo Tan, Chandan Singh

Title: Human-AI Co-design for Clinical Prediction Models

Abstract:
Developing safe, effective, and practically useful clinical prediction models (CPMs) traditionally requires iterative collaboration between clinical experts, data scientists, and informaticists. This process refines the often small but critical details of the model building process, such as which features/patients to include and how clinical categories should be defined. However, this traditional collaboration process is extremely time‑ and resource‑intensive, resulting in only a small fraction of CPMs reaching clinical practice. This challenge intensifies when teams attempt to incorporate unstructured clinical notes, which can contain an enormous number of concepts. To address this challenge, we introduce HACHI, an iterative human‑in‑the‑loop framework that uses AI agents to accelerate the development of fully interpretable CPMs by enabling the exploration of concepts in clinical notes. HACHI alternates between (i) an AI agent rapidly exploring and evaluating candidate concepts in clinical notes and (ii) clinical and domain experts providing feedback to improve the CPM learning process. HACHI defines concepts as simple yes‑no questions that are used in linear models, allowing the clinical AI team to transparently review, refine, and validate the CPM learned in each round. In two real‑world prediction tasks (acute kidney injury and traumatic brain injury), HACHI outperforms existing approaches, surfaces new clinically relevant concepts not included in commonly‑used CPMs, and improves model generalizability across clinical sites and time periods. Furthermore, HACHI reveals the critical role of the clinical AI team, such as directing the AI agent to explore concepts that it had not previously considered, adjusting the granularity of concepts it considers, changing the objective function to better align with the clinical objectives, and identifying issues of data bias and leakage.

PaperID: 6324, https://arxiv.org/pdf/2601.08988.pdf

Authors: Ananya Mantravadi, Shivali Dalmia, Abhishek Mukherji

Title: ART: Action-based Reasoning Task Benchmarking for Medical AI Agents

Abstract:
Reliable clinical decision support requires medical AI agents capable of safe, multi‑step reasoning over structured electronic health records (EHRs). While large language models (LLMs) show promise in healthcare, existing benchmarks inadequately assess performance on action‑based tasks involving threshold evaluation, temporal aggregation, and conditional logic. We introduce ART, an Action‑based Reasoning clinical Task benchmark for medical AI agents, which mines real‑world EHR data to create challenging tasks targeting known reasoning weaknesses. Through analysis of existing benchmarks, we identify three dominant error categories: retrieval failures, aggregation errors, and conditional logic misjudgments. Our four‑stage pipeline ‑‑ scenario identification, task generation, quality audit, and evaluation ‑‑ produces diverse, clinically validated tasks grounded in real patient data. Evaluating GPT‑4o‑mini and Claude 3.5 Sonnet on 600 tasks shows near‑perfect retrieval after prompt refinement, but substantial gaps in aggregation (28‑‑64%) and threshold reasoning (32‑‑38%). By exposing failure modes in action‑oriented EHR reasoning, ART advances toward more reliable clinical agents, an essential step for AI systems that reduce cognitive load and administrative burden, supporting workforce capacity in high‑demand care settings

PaperID: 6325, https://arxiv.org/pdf/2601.08838.pdf

Authors: Jiahui Chen, Lei Fu, Jian Cui, Yu Lei, Zhenning Dong

Title: Companion Agents: A Table-Information Mining Paradigm for Text-to-SQL

Abstract:
Large‑scale Text‑to‑SQL benchmarks such as BIRD typically assume complete and accurate database annotations as well as readily available external knowledge, which fails to reflect common industrial settings where annotations are missing, incomplete, or erroneous. This mismatch substantially limits the real‑world applicability of state‑of‑the‑art (SOTA) Text‑to‑SQL systems. To bridge this gap, we explore a database‑centric approach that leverages intrinsic, fine‑grained information residing in relational databases to construct missing evidence and improve Text‑to‑SQL accuracy under annotation‑scarce conditions. Our key hypothesis is that when a query requires multi‑step reasoning over extensive table information, existing methods often struggle to reliably identify and utilize the truly relevant knowledge. We therefore propose to "cache" query‑relevant knowledge on the database side in advance, so that it can be selectively activated at inference time. Based on this idea, we introduce Companion Agents (CA), a new Text‑to‑SQL paradigm that incorporates a group of agents accompanying database schemas to proactively mine and consolidate hidden inter‑table relations, value‑domain distributions, statistical regularities, and latent semantic cues before query generation. Experiments on BIRD under the fully missing evidence setting show that CA recovers +4.49 / +4.37 / +14.13 execution accuracy points on RSL‑SQL / CHESS / DAIL‑SQL, respectively, with larger gains on the Challenging subset +9.65 / +7.58 / +16.71. These improvements stem from CA's automatic database‑side mining and evidence construction, suggesting a practical path toward industrial‑grade Text‑to‑SQL deployment without reliance on human‑curated evidence.

PaperID: 6326, https://arxiv.org/pdf/2601.08835.pdf

Authors: Vaarunay Kaushal, Taranveer Singh

Title: DeliberationBench: When Do More Voices Hurt? A Controlled Study of Multi-LLM Deliberation Protocols

Abstract:
Multi‑agent systems where Large Language Models (LLMs) deliberate to form consensus have gained significant attention, yet their practical value over simpler methods remains under‑scrutinized. We introduce DELIBERATIONBENCH, a controlled benchmark evaluating three deliberation protocols against a strong baseline of selecting the best response from a pool of model outputs. Across 270 questions and three independent seeds (810 total evaluations), we find a striking negative result: the best‑single baseline achieves an 82.5% +‑ 3.3% win rate, dramatically outperforming the best deliberation protocol(13.8% +‑ 2.6%). This 6.0x performance gap is statistically significant (p < 0.01) and comes at 1.5‑2.5x higher computational cost. Our findings challenge assumptions that complexity enhances quality in multi‑LLM systems.

PaperID: 6327, https://arxiv.org/pdf/2601.08815.pdf

Authors: Qing Ye, Jing Tan

Title: Agent Contracts: A Formal Framework for Resource-Bounded Autonomous AI Systems

Abstract:
The Contract Net Protocol (1980) introduced coordination through contracts in multi‑agent systems. Modern agent protocols standardize connectivity and interoperability; yet, none provide formal, resource governance‑normative mechanisms to bound how much agents may consume or how long they may operate. We introduce Agent Contracts, a formal framework that extends the contract metaphor from task allocation to resource‑bounded execution. An Agent Contract unifies input/output specifications, multi‑dimensional resource constraints, temporal boundaries, and success criteria into a coherent governance mechanism with explicit lifecycle semantics. For multi‑agent coordination, we establish conservation laws ensuring delegated budgets respect parent constraints, enabling hierarchical coordination through contract delegation. Empirical validation across four experiments demonstrates 90% token reduction with 525x lower variance in iterative workflows, zero conservation violations in multi‑agent delegation, and measurable quality‑resource tradeoffs through contract modes. Agent Contracts provide formal foundations for predictable, auditable, and resource‑bounded autonomous AI deployment.

PaperID: 6328, https://arxiv.org/pdf/2601.08742.pdf

Authors: Xin Quan, Jiafeng Xiong, Marco Valentino, André Freitas

Title: Inferring Latent Intentions: Attributional Natural Language Inference in LLM Agents

Abstract:
Attributional inference, the ability to predict latent intentions behind observed actions, is a critical yet underexplored capability for large language models (LLMs) operating in multi‑agent environments. Traditional natural language inference (NLI), in fact, fails to capture the nuanced, intention‑driven reasoning essential for complex interactive systems. To address this gap, we introduce Attributional NLI (Att‑NLI), a framework that extends NLI with principles from social psychology to assess an agent's capacity for abductive intentional inference (generating hypotheses about latent intentions), and subsequent deductive verification (drawing valid logical conclusions). We instantiate Att‑NLI via a textual game, Undercover‑V, experimenting with three types of LLM agents with varying reasoning capabilities and access to external tools: a standard NLI agent using only deductive inference, an Att‑NLI agent employing abductive‑deductive inference, and a neuro‑symbolic Att‑NLI agent performing abductive‑deductive inference with external theorem provers. Extensive experiments demonstrate a clear hierarchy of attributional inference capabilities, with neuro‑symbolic agents consistently outperforming others, achieving an average win rate of 17.08%. Our results underscore the role that Att‑NLI can play in developing agents with sophisticated reasoning capabilities, highlighting, at the same time, the potential impact of neuro‑symbolic AI in building rational LLM agents acting in multi‑agent environments.

PaperID: 6329, https://arxiv.org/pdf/2601.08676.pdf

Authors: Yilei Zhao, Wentao Zhang, Lei Xiao, Yandan Zheng, Mengpu Liu, Wei Yang Bryan Lim

Title: Advancing ESG Intelligence: An Expert-level Agent and Comprehensive Benchmark for Sustainable Finance

Abstract:
Environmental, social, and governance (ESG) criteria are essential for evaluating corporate sustainability and ethical performance. However, professional ESG analysis is hindered by data fragmentation across unstructured sources, and existing large language models (LLMs) often struggle with the complex, multi‑step workflows required for rigorous auditing. To address these limitations, we introduce ESGAgent, a hierarchical multi‑agent system empowered by a specialized toolset, including retrieval augmentation, web search and domain‑specific functions, to generate in‑depth ESG analysis. Complementing this agentic system, we present a comprehensive three‑level benchmark derived from 310 corporate sustainability reports, designed to evaluate capabilities ranging from atomic common‑sense questions to the generation of integrated, in‑depth analysis. Empirical evaluations demonstrate that ESGAgent outperforms state‑of‑the‑art closed‑source LLMs with an average accuracy of 84.15% on atomic question‑answering tasks, and excels in professional report generation by integrating rich charts and verifiable references. These findings confirm the diagnostic value of our benchmark, establishing it as a vital testbed for assessing general and advanced agentic capabilities in high‑stakes vertical domains.

PaperID: 6330, https://arxiv.org/pdf/2601.08641.pdf

Authors: Yichen Luo, Yebo Feng, Jiahua Xu, Yang Liu

Title: Resisting Manipulative Bots in Meme Coin Copy Trading: A Multi-Agent Approach with Chain-of-Thought Reasoning

Abstract:
Copy trading has become the dominant entry strategy in meme coin markets. However, due to the market's extremely illiquid and volatile nature, the strategy exposes an exploitable attack surface: adversaries deploy manipulative bots to front‑run trades, conceal positions, and fabricate sentiment, systematically extracting value from naïve copiers at scale. Despite its prevalence, bot‑driven manipulation remains largely unexplored, and no robust defensive framework exists. We propose a manipulation‑resistant copy‑trading system based on a multi‑agent architecture powered by a multi‑modal large language model (LLM) and chain‑of‑thought (CoT) reasoning. Our approach outperforms zero‑shot and most statistic‑driven baselines in prediction accuracy as well as all baselines in economic performance, achieving an average copier return of 3% per meme coin investment under realistic market frictions. Overall, our results demonstrate the effectiveness of agent‑based defenses and predictability of trader profitability in adversarial meme coin markets, providing a practical foundation for robust copy trading.

PaperID: 6331, https://arxiv.org/pdf/2601.08559.pdf

Authors: Keerththanan Vickneswaran, Mariangel Garcia Andarcia, Hugo Retief, Chris Dickens, Paulo Silva

Title: WaterCopilot: An AI-Driven Virtual Assistant for Water Management

Abstract:
Sustainable water resource management in transboundary river basins is challenged by fragmented data, limited real‑time access, and the complexity of integrating diverse information sources. This paper presents WaterCopilot‑an AI‑driven virtual assistant developed through collaboration between the International Water Management Institute (IWMI) and Microsoft Research for the Limpopo River Basin (LRB) to bridge these gaps through a unified, interactive platform. Built on Retrieval‑Augmented Generation (RAG) and tool‑calling architectures, WaterCopilot integrates static policy documents and real‑time hydrological data via two custom plugins: the iwmi‑doc‑plugin, which enables semantic search over indexed documents using Azure AI Search, and the iwmi‑api‑plugin, which queries live databases to deliver dynamic insights such as environmental‑flow alerts, rainfall trends, reservoir levels, water accounting, and irrigation data. The system features guided multilingual interactions (English, Portuguese, French), transparent source referencing, automated calculations, and visualization capabilities. Evaluated using the RAGAS framework, WaterCopilot achieves an overall score of 0.8043, with high answer relevancy (0.8571) and context precision (0.8009). Key innovations include automated threshold‑based alerts, integration with the LRB Digital Twin, and a scalable deployment pipeline hosted on AWS. While limitations in processing non‑English technical documents and API latency remain, WaterCopilot establishes a replicable AI‑augmented framework for enhancing water governance in data‑scarce, transboundary contexts. The study demonstrates the potential of this AI assistant to support informed, timely decision‑making and strengthen water security in complex river basins.

PaperID: 6332, https://arxiv.org/pdf/2601.08481.pdf

Authors: Aryan Pasikhani, Prosanta Gope, Yang Yang, Shagufta Mehnaz, Biplab Sikdar

Title: Baiting AI: Deceptive Adversary Against AI-Protected Industrial Infrastructures

Abstract:
This paper explores a new cyber‑attack vector targeting Industrial Control Systems (ICS), particularly focusing on water treatment facilities. Developing a new multi‑agent Deep Reinforcement Learning (DRL) approach, adversaries craft stealthy, strategically timed, wear‑out attacks designed to subtly degrade product quality and reduce the lifespan of field actuators. This sophisticated method leverages DRL methodology not only to execute precise and detrimental impacts on targeted infrastructure but also to evade detection by contemporary AI‑driven defence systems. By developing and implementing tailored policies, the attackers ensure their hostile actions blend seamlessly with normal operational patterns, circumventing integrated security measures. Our research reveals the robustness of this attack strategy, shedding light on the potential for DRL models to be manipulated for adversarial purposes. Our research has been validated through testing and analysis in an industry‑level setup. For reproducibility and further study, all related materials, including datasets and documentation, are publicly accessible.

PaperID: 6333, https://arxiv.org/pdf/2601.08462.pdf

Authors: Sixiong Xie, Zhuofan Shi, Haiyang Shen, Yun Ma, Xiang Jing

Title: M3-BENCH: Process-Aware Evaluation of LLM Agents' Social Behaviors in Mixed-Motive Games

Abstract:
Existing benchmarks for LLM agents' social behavior typically focus on a single capability dimension and evaluate only behavioral outcomes, overlooking process signals from reasoning and communication. We present M3‑BENCH, a benchmark of 24 mixed‑motive games with a process‑aware evaluation framework spanning three complementary views: Behavioral Trajectory Analysis (BTA), Reasoning Process Analysis (RPA), and Communication Content Analysis (CCA). Evaluating 11 frontier LLMs and a human baseline, M3‑BENCH reveals substantial differences in social competence that outcome‑only evaluation misses. In particular, we identify an "overthink‑undercommunicate" pattern: reasoning models achieve strong internal deliberation scores but often fail to translate them into effective social communication. Although top models can surpass humans on task outcomes, humans exhibit markedly higher cross‑view consistency, suggesting that current LLM agents still lack the behavioral coherence characteristic of human social competence. Our analysis further shows that the three‑view decomposition surfaces safety‑relevant risks, such as cooperative behavior paired with latent opportunistic reasoning, that remain hidden under outcome‑only metrics.

PaperID: 6334, https://arxiv.org/pdf/2601.08405.pdf

Authors: Yizhan Feng, Hichem Snoussi, Jing Teng, Abel Cherouat, Tian Wang

Title: Large Language Models to Enhance Multi-task Drone Operations in Simulated Environments

Abstract:
Benefiting from the rapid advancements in large language models (LLMs), human‑drone interaction has reached unprecedented opportunities. In this paper, we propose a method that integrates a fine‑tuned CodeT5 model with the Unreal Engine‑based AirSim drone simulator to efficiently execute multi‑task operations using natural language commands. This approach enables users to interact with simulated drones through prompts or command descriptions, allowing them to easily access and control the drone's status, significantly lowering the operational threshold. In the AirSim simulator, we can flexibly construct visually realistic dynamic environments to simulate drone applications in complex scenarios. By combining a large dataset of (natural language, program code) command‑execution pairs generated by ChatGPT with developer‑written drone code as training data, we fine‑tune the CodeT5 to achieve automated translation from natural language to executable code for drone tasks. Experimental results demonstrate that the proposed method exhibits superior task execution efficiency and command understanding capabilities in simulated environments. In the future, we plan to extend the model functionality in a modular manner, enhancing its adaptability to complex scenarios and driving the application of drone technologies in real‑world environments.

PaperID: 6335, https://arxiv.org/pdf/2601.08343.pdf

Authors: Sichu Liang, Zhenglin Wang, Jiajia Chu, Pengfei Xia, Hui Zang, Deyu Zhou

Title: When KV Cache Reuse Fails in Multi-Agent Systems: Cross-Candidate Interaction is Crucial for LLM Judges

Abstract:
Multi‑agent LLM systems routinely generate multiple candidate responses that are aggregated by an LLM judge. To reduce the dominant prefill cost in such pipelines, recent work advocates KV cache reuse across partially shared contexts and reports substantial speedups for generation agents. In this work, we show that these efficiency gains do not transfer uniformly to judge‑centric inference. Across GSM8K, MMLU, and HumanEval, we find that reuse strategies that are effective for execution agents can severely perturb judge behavior: end‑task accuracy may appear stable, yet the judge's selection becomes highly inconsistent with dense prefill. We quantify this risk using Judge Consistency Rate (JCR) and provide diagnostics showing that reuse systematically weakens cross‑candidate attention, especially for later candidate blocks. Our ablation further demonstrates that explicit cross‑candidate interaction is crucial for preserving dense‑prefill decisions. Overall, our results identify a previously overlooked failure mode of KV cache reuse and highlight judge‑centric inference as a distinct regime that demands dedicated, risk‑aware system design.

PaperID: 6336, https://arxiv.org/pdf/2601.08333.pdf

Authors: Oleg Romanchuk, Roman Bondar

Title: Semantic Laundering in AI Agent Architectures: Why Tool Boundaries Do Not Confer Epistemic Warrant

Abstract:
LLM‑based agent architectures systematically conflate information transport mechanisms with epistemic justification mechanisms. We formalize this class of architectural failures as semantic laundering: a pattern where propositions with absent or weak warrant are accepted by the system as admissible by crossing architecturally trusted interfaces. We show that semantic laundering constitutes an architectural realization of the Gettier problem: propositions acquire high epistemic status without a connection between their justification and what makes them true. Unlike classical Gettier cases, this effect is not accidental; it is architecturally determined and systematically reproducible. The central result is the Theorem of Inevitable Self‑Licensing: under standard architectural assumptions, circular epistemic justification cannot be eliminated. We introduce the Warrant Erosion Principle as the fundamental explanation for this effect and show that scaling, model improvement, and LLM‑as‑judge schemes are structurally incapable of eliminating a problem that exists at the type level.

PaperID: 6337, https://arxiv.org/pdf/2601.08327.pdf

Authors: Gabriele Calzolari, Vidya Sumathy, Christoforos Kanellakis, George Nikolakopoulos

Title: Safe Heterogeneous Multi-Agent RL with Communication Regularization for Coordinated Target Acquisition

Abstract:
This paper introduces a decentralized multi‑agent reinforcement learning framework enabling structurally heterogeneous teams of agents to jointly discover and acquire randomly located targets in environments characterized by partial observability, communication constraints, and dynamic interactions. Each agent's policy is trained with the Multi‑Agent Proximal Policy Optimization algorithm and employs a Graph Attention Network encoder that integrates simulated range‑sensing data with communication embeddings exchanged among neighboring agents, enabling context‑aware decision‑making from both local sensing and relational information. In particular, this work introduces a unified framework that integrates graph‑based communication and trajectory‑aware safety through safety filters. The architecture is supported by a structured reward formulation designed to encourage effective target discovery and acquisition, collision avoidance, and de‑correlation between the agents' communication vectors by promoting informational orthogonality. The effectiveness of the proposed reward function is demonstrated through a comprehensive ablation study. Moreover, simulation results demonstrate safe and stable task execution, confirming the framework's effectiveness.

PaperID: 6338, https://arxiv.org/pdf/2601.08325.pdf

Authors: Zhenyang Liu, Yongchong Gu, Yikai Wang, Xiangyang Xue, Yanwei Fu

Title: ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation

Abstract:
Recent advances in robot manipulation have leveraged pre‑trained vision‑language models (VLMs) and explored integrating 3D spatial signals into these models for effective action prediction, giving rise to the promising vision‑language‑action (VLA) paradigm. However, most existing approaches overlook the importance of active perception: they typically rely on static, wrist‑mounted cameras that provide an end‑effector‑centric viewpoint. As a result, these models are unable to adaptively select optimal viewpoints or resolutions during task execution, which significantly limits their performance in long‑horizon tasks and fine‑grained manipulation scenarios. To address these limitations, we propose ActiveVLA, a novel vision‑language‑action framework that empowers robots with active perception capabilities for high‑precision, fine‑grained manipulation. ActiveVLA adopts a coarse‑to‑fine paradigm, dividing the process into two stages: (1) Critical region localization. ActiveVLA projects 3D inputs onto multi‑view 2D projections, identifies critical 3D regions, and supports dynamic spatial awareness. (2) Active perception optimization. Drawing on the localized critical regions, ActiveVLA uses an active view selection strategy to choose optimal viewpoints. These viewpoints aim to maximize amodal relevance and diversity while minimizing occlusions. Additionally, ActiveVLA applies a 3D zoom‑in to improve resolution in key areas. Together, these steps enable finer‑grained active perception for precise manipulation. Extensive experiments demonstrate that ActiveVLA achieves precise 3D manipulation and outperforms state‑of‑the‑art baselines on three simulation benchmarks. Moreover, ActiveVLA transfers seamlessly to real‑world scenarios, enabling robots to learn high‑precision tasks in complex environments.

PaperID: 6339, https://arxiv.org/pdf/2601.08308.pdf

Authors: Bo Yang, Yu Zhang, Yunkui Chen, Lanfei Feng, Xiao Xu, Nueraili Aierken, Shijian Li

Title: AgriAgent: Contract-Driven Planning and Capability-Aware Tool Orchestration in Real-World Agriculture

Abstract:
Intelligent agent systems in real‑world agricultural scenarios must handle diverse tasks under multimodal inputs, ranging from lightweight information understanding to complex multi‑step execution. However, most existing approaches rely on a unified execution paradigm, which struggles to accommodate large variations in task complexity and incomplete tool availability commonly observed in agricultural environments. To address this challenge, we propose AgriAgent, a two‑level agent framework for real‑world agriculture. AgriAgent adopts a hierarchical execution strategy based on task complexity: simple tasks are handled through direct reasoning by modality‑specific agents, while complex tasks trigger a contract‑driven planning mechanism that formulates tasks as capability requirements and performs capability‑aware tool orchestration and dynamic tool generation, enabling multi‑step and verifiable execution with failure recovery. Experimental results show that AgriAgent achieves higher execution success rates and robustness on complex tasks compared to existing tool‑centric agent baselines that rely on unified execution paradigms. All code, data will be released at after our work be accepted to promote reproducible research.

PaperID: 6340, https://arxiv.org/pdf/2601.08288.pdf

Authors: Yuyang Wu, Hanzhong Cao, Jianhao Chen, Yufei Li

Title: OpenMic: A Multi-Agent-Based Stand-Up Comedy Generation System

Abstract:
Chinese stand‑up comedy generation goes beyond plain text generation, requiring culturally grounded humor, precise timing, stage‑performance cues, and implicit multi‑step reasoning. Moreover, commonly used Chinese humor datasets are often better suited for humor understanding and evaluation than for long‑form stand‑up generation, making direct supervision misaligned with the target task. To address these challenges, we present OpenMic, an end‑to‑end multi‑agent system built on AutoGen that transforms a user‑provided life topic into a 3‑5 minute Chinese stand‑up performance and further produces a narrated comedy video. OpenMic orchestrates multiple specialized agents in a multi‑round iterative loop‑planning to jointly optimize humor, timing, and performability. To mitigate the dataset‑task mismatch, we augment generation with retrieval‑augmented generation (RAG) for material grounding and idea expansion, and we fine‑tune a dedicated JokeWriter to better internalize stand‑up‑specific setup‑punchline structures and long‑range callbacks.

PaperID: 6341, https://arxiv.org/pdf/2601.08282.pdf

Authors: Kangcheng Luo, Tinglang Wu, Yansong Feng

Title: D$^2$Plan: Dual-Agent Dynamic Global Planning for Complex Retrieval-Augmented Reasoning

Abstract:
Recent search‑augmented LLMs trained with reinforcement learning (RL) can interleave searching and reasoning for multi‑hop reasoning tasks. However, they face two critical failure modes as the accumulating context becomes flooded with both crucial evidence and irrelevant information: (1) ineffective search chain construction that produces incorrect queries or omits retrieval of critical information, and (2) reasoning hijacking by peripheral evidence that causes models to misidentify distractors as valid evidence. To address these challenges, we propose D^2Plan, a Dual‑agent Dynamic global Planning paradigm for complex retrieval‑augmented reasoning. D^2Plan operates through the collaboration of a Reasoner and a Purifier: the Reasoner constructs explicit global plans during reasoning and dynamically adapts them based on retrieval feedback; the Purifier assesses retrieval relevance and condenses key information for the Reasoner. We further introduce a two‑stage training framework consisting of supervised fine‑tuning (SFT) cold‑start on synthesized trajectories and RL with plan‑oriented rewards to teach LLMs to master the D^2Plan paradigm. Extensive experiments demonstrate that D^2Plan enables more coherent multi‑step reasoning and stronger resilience to irrelevant information, thereby achieving superior performance on challenging QA benchmarks.

PaperID: 6342, https://arxiv.org/pdf/2601.08276.pdf

Authors: Zhiyuan Yao, Zishan Xu, Yifu Guo, Zhiguang Han, Cheng Yang, Shuo Zhang, Weinan Zhang, Xingshan Zeng, Weiwen Liu

Title: ACE-Router: Generalizing History-Aware Routing from MCP Tools to the Agent Web

Abstract:
With the rise of the Agent Web and Model Context Protocol (MCP), the agent ecosystem is evolving into an open collaborative network, exponentially increasing accessible tools. However, current architectures face severe scalability and generality bottlenecks. To address this, we propose ACE‑Router, a pipeline for training history‑aware routers to empower precise navigation in large‑scale ecosystems. By leveraging a dependency‑rich candidate Graph to synthesize multi‑turn trajectories, we effectively train routers with dynamic context understanding to create the plug‑and‑play Light Routing Agent. Experiments on the real‑world benchmarks MCP‑Universe and MCP‑Mark demonstrate superior performance. Notably, ACE‑Router exhibits critical properties for the future Agent Web: it not only generalizes to multi‑agent collaboration with minimal adaptation but also maintains exceptional robustness against noise and scales effectively to massive candidate spaces. These findings provide a strong empirical foundation for universal orchestration in open‑ended ecosystems.

PaperID: 6343, https://arxiv.org/pdf/2601.08237.pdf

Authors: Haoran Su, Yandong Sun, Congjia Yu

Title: The End of Reward Engineering: How LLMs Are Redefining Multi-Agent Coordination

Abstract:
Reward engineering, the manual specification of reward functions to induce desired agent behavior, remains a fundamental challenge in multi‑agent reinforcement learning. This difficulty is amplified by credit assignment ambiguity, environmental non‑stationarity, and the combinatorial growth of interaction complexity. We argue that recent advances in large language models (LLMs) point toward a shift from hand‑crafted numerical rewards to language‑based objective specifications. Prior work has shown that LLMs can synthesize reward functions directly from natural language descriptions (e.g., EUREKA) and adapt reward formulations online with minimal human intervention (e.g., CARD). In parallel, the emerging paradigm of Reinforcement Learning from Verifiable Rewards (RLVR) provides empirical evidence that language‑mediated supervision can serve as a viable alternative to traditional reward engineering. We conceptualize this transition along three dimensions: semantic reward specification, dynamic reward adaptation, and improved alignment with human intent, while noting open challenges related to computational overhead, robustness to hallucination, and scalability to large multi‑agent systems. We conclude by outlining a research direction in which coordination arises from shared semantic representations rather than explicitly engineered numerical signals.

PaperID: 6344, https://arxiv.org/pdf/2601.08214.pdf

Authors: Jinghao Cao, Wanchun Liu, Yonghui Li, Branka Vucetic

Title: Hybrid Centralized Distributed Control for Lifelong MAPF over Wireless Connections

Abstract:
In lifelong multi‑agent path finding (MAPF) with many robots, unreliable wireless links and stochastic executions are the norm. Existing approaches typically either rely on centralized planning under idealized communication, or run fully distributed local controllers with fixed communication patterns; they rarely couple communication scheduling with policy learning, and thus struggle when bandwidth is scarce or packets are frequently dropped. We address this joint control‑‑communication problem and propose a hybrid centralized‑‑distributed scheme: a centralized cloud policy sends small residual corrections only when selected, while a lightweight on‑board Gated recurrent unit (GRU) policy provides a safe default fallback when wireless connection is not available.

PaperID: 6345, https://arxiv.org/pdf/2601.08166.pdf

Authors: Mohammad Pivezhandi, Mahdi Banisharif, Abusayeed Saifullah, Ali Jannesari

Title: ZeroDVFS: Zero-Shot LLM-Guided Core and Frequency Allocation for Embedded Platforms

Abstract:
Dynamic voltage and frequency scaling (DVFS) and task‑to‑core allocation are critical for thermal management and balancing energy and performance in embedded systems. Existing approaches either rely on utilization‑based heuristics that overlook stall times, or require extensive offline profiling for table generation, preventing runtime adaptation. Building upon hierarchical multi‑agent scheduling, we contribute model‑based reinforcement learning with accurate environment models that predict thermal dynamics and performance states, enabling synthetic training data generation and converging 20 times faster than model‑free methods. We introduce Large Language Model (LLM)‑based semantic feature extraction that characterizes OpenMP programs through code‑level features without execution, enabling zero‑shot deployment for new workloads in under 5 seconds without workload‑specific profiling. Two collaborative agents decompose the exponential action space, achieving 358ms latency for subsequent decisions. Experiments on Barcelona OpenMP Tasks Suite (BOTS) and PolybenchC benchmarks across NVIDIA Jetson TX2, Jetson Orin NX, RubikPi, and Intel Core i7 demonstrate 7.09 times better energy efficiency, 4.0 times better makespan, and 358ms decision latency compared to existing power management techniques.

PaperID: 6346, https://arxiv.org/pdf/2601.08160.pdf

Authors: Anxin Tian, Yiming Li, Xing Li, Hui-Ling Zhen, Lei Chen, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan

Title: SwiftMem: Fast Agentic Memory via Query-aware Indexing

Abstract:
Agentic memory systems have become critical for enabling LLM agents to maintain long‑term context and retrieve relevant information efficiently. However, existing memory frameworks suffer from a fundamental limitation: they perform exhaustive retrieval across the entire storage layer regardless of query characteristics. This brute‑force approach creates severe latency bottlenecks as memory grows, hindering real‑time agent interactions. We propose SwiftMem, a query‑aware agentic memory system that achieves sub‑linear retrieval through specialized indexing over temporal and semantic dimensions. Our temporal index enables logarithmic‑time range queries for time‑sensitive retrieval, while the semantic DAG‑Tag index maps queries to relevant topics through hierarchical tag structures. To address memory fragmentation during growth, we introduce an embedding‑tag co‑consolidation mechanism that reorganizes storage based on semantic clusters to improve cache locality. Experiments on LoCoMo and LongMemEval benchmarks demonstrate that SwiftMem achieves 47× faster search compared to state‑of‑the‑art baselines while maintaining competitive accuracy, enabling practical deployment of memory‑augmented LLM agents.

PaperID: 6347, https://arxiv.org/pdf/2601.08156.pdf

Authors: Arin Gopalan Yadav, Varad Dherange, Kumar Shivam

Title: Project Synapse: A Hierarchical Multi-Agent Framework with Hybrid Memory for Autonomous Resolution of Last-Mile Delivery Disruptions

Abstract:
This paper introduces Project Synapse, a novel agentic framework designed for the autonomous resolution of last‑mile delivery disruptions. Synapse employs a hierarchical multi‑agent architecture in which a central Resolution Supervisor agent performs strategic task decomposition and delegates subtasks to specialized worker agents responsible for tactical execution. The system is orchestrated using LangGraph to manage complex and cyclical workflows. To validate the framework, a benchmark dataset of 30 complex disruption scenarios was curated from a qualitative analysis of over 6,000 real‑world user reviews. System performance is evaluated using an LLM‑as‑a‑Judge protocol with explicit bias mitigation.

PaperID: 6348, https://arxiv.org/pdf/2601.08058.pdf

Authors: Zhenghao He, Guangzhi Xiong, Bohan Liu, Sanchit Sinha, Aidong Zhang

Title: Reasoning Beyond Chain-of-Thought: A Latent Computational Mode in Large Language Models

Abstract:
Chain‑of‑Thought (CoT) prompting has improved the reasoning performance of large language models (LLMs), but it remains unclear why it works and whether it is the unique mechanism for triggering reasoning in large language models. In this work, we study this question by directly analyzing and intervening on the internal representations of LLMs with Sparse Autoencoders (SAEs), identifying a small set of latent features that are causally associated with LLM reasoning behavior. Across multiple model families and reasoning benchmarks, we find that steering a single reasoning‑related latent feature can substantially improve accuracy without explicit CoT prompting. For large models, latent steering achieves performance comparable to standard CoT prompting while producing more efficient outputs. We further observe that this reasoning‑oriented internal state is triggered early in generation and can override prompt‑level instructions that discourage explicit reasoning. Overall, our results suggest that multi‑step reasoning in LLMs is supported by latent internal activations that can be externally activated, while CoT prompting is one effective, but not unique, way of activating this mechanism rather than its necessary cause.

PaperID: 6349, https://arxiv.org/pdf/2601.08012.pdf

Authors: Aarya Doshi, Yining Hong, Congying Xu, Eunsuk Kang, Alexandros Kapravelos, Christian Kästner

Title: Towards Verifiably Safe Tool Use for LLM Agents

Abstract:
Large language model (LLM)‑based AI agents extend LLM capabilities by enabling access to tools such as data sources, APIs, search engines, code sandboxes, and even other agents. While this empowers agents to perform complex tasks, LLMs may invoke unintended tool interactions and introduce risks, such as leaking sensitive data or overwriting critical records, which are unacceptable in enterprise contexts. Current approaches to mitigate these risks, such as model‑based safeguards, enhance agents' reliability but cannot guarantee system safety. Methods like information flow control (IFC) and temporal constraints aim to provide guarantees but often require extensive human annotation. We propose a process that starts with applying System‑Theoretic Process Analysis (STPA) to identify hazards in agent workflows, derive safety requirements, and formalize them as enforceable specifications on data flows and tool sequences. To enable this, we introduce a capability‑enhanced Model Context Protocol (MCP) framework that requires structured labels on capabilities, confidentiality, and trust level. Together, these contributions aim to shift LLM‑based agent safety from ad hoc reliability fixes to proactive guardrails with formal guarantees, while reducing dependence on user confirmation and making autonomy a deliberate design choice.

PaperID: 6350, https://arxiv.org/pdf/2601.08010.pdf

Authors: Chaoyu Li, Deeparghya Dutta Barua, Fei Tao, Pooyan Fazli

Title: CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation

Abstract:
Vision‑language models achieve strong performance across a wide range of multimodal understanding and reasoning tasks, yet their multi‑step reasoning remains unstable. Repeated sampling over the same input often produces divergent reasoning trajectories and inconsistent final predictions. To address this, we introduce two complementary approaches inspired by test‑time scaling: (1) CASHEW, an inference‑time framework that stabilizes reasoning by iteratively aggregating multiple candidate trajectories into higher‑quality reasoning traces, with explicit visual verification filtering hallucinated steps and grounding reasoning in visual evidence, and (2) CASHEW‑RL, a learned variant that internalizes this aggregation behavior within a single model. CASHEW‑RL is trained using Group Sequence Policy Optimization (GSPO) with a composite reward that encourages correct answers grounded in minimal yet sufficient visual evidence, while adaptively allocating reasoning effort based on task difficulty. This training objective enables robust self‑aggregation at inference. Extensive experiments on 13 image understanding, video understanding, and video reasoning benchmarks show significant performance improvements, including gains of up to +23.6 percentage points on ScienceQA and +8.1 percentage points on EgoSchema.

PaperID: 6351, https://arxiv.org/pdf/2601.08003.pdf

Authors: Weiyue Li, Mingxiao Song, Zhenda Shen, Dachuan Zhao, Yunfan Long, Yi Li, Yongce Li, Ruyi Yang, Mengyu Wang

Title: LLM Review: Enhancing Creative Writing via Blind Peer Review Feedback

Abstract:
Large Language Models (LLMs) often struggle with creative generation, and multi‑agent frameworks that improve reasoning through interaction can paradoxically hinder creativity by inducing content homogenization. We introduce LLM Review, a peer‑review‑inspired framework implementing Blind Peer Review: agents exchange targeted feedback while revising independently, preserving divergent creative trajectories. To enable rigorous evaluation, we propose SciFi‑100, a science fiction writing dataset with a unified framework combining LLM‑as‑a‑judge scoring, human annotation, and rule‑based novelty metrics. Experiments demonstrate that LLM Review consistently outperforms multi‑agent baselines, and smaller models with our framework can surpass larger single‑agent models, suggesting interaction structure may substitute for model scale.

PaperID: 6352, https://arxiv.org/pdf/2601.07997.pdf

Authors: Yuwen Ma, Sarah K. Spurgeon, Tao Li, Boli Chen

Title: Can Inherent Communication Noise Guarantee Privacy in Distributed Cooperative Control ?

Abstract:
This paper investigates privacy‑preserving distributed cooperative control for multi‑agent systems within the framework of differential privacy. In cooperative control, communication noise is inevitable and is usually regarded as a disturbance that impairs coordination. This work revisits such noise as a potential privacy‑enhancing factor. A linear quadratic regulator (LQR)‑based framework is proposed for agents communicating over noisy channels, \textcolorblackwhere the noise variance depends on the relative state differences between neighbouring agents. The resulting controller achieves formation while protecting the reference signals from inference attacks. It is analytically proven that the inherent communication noise can guarantee bounded (ε,δ)‑differential privacy without adding dedicated privacy noise, while the \textcolorblacksystem cooperative tracking error remains bounded and convergent in both the mean‑square and almost‑sure sense.

PaperID: 6353, https://arxiv.org/pdf/2601.07978.pdf

Authors: Benedict Wolff, Jacopo Bennati

Title: Cost and accuracy of long-term memory in Distributed Multi-Agent Systems based on Large Language Models

Abstract:
Distributed multi‑agent systems (DMAS) based on large language models (LLMs) enable collaborative intelligence while preserving data privacy. However, systematic evaluations of long‑term memory under network constraints are limited. This study introduces a flexible testbed to compare mem0, a vector‑based memory framework, and Graphiti, a graph‑based knowledge graph, using the LoCoMo long‑context benchmark. Experiments were conducted under unconstrained and constrained network conditions, measuring computational, financial, and accuracy metrics. Results indicate mem0 significantly outperforms Graphiti in efficiency, featuring faster loading times, lower resource consumption, and minimal network overhead. Crucially, accuracy differences were not statistically significant. Applying a statistical Pareto efficiency framework, mem0 is identified as the optimal choice, balancing cost and accuracy in DMAS.

PaperID: 6354, https://arxiv.org/pdf/2601.07901.pdf

Authors: Hao Qiu, Mengxiao Zhang, Juliette Achddou

Title: Decentralized Online Convex Optimization with Unknown Feedback Delays

Abstract:
Decentralized online convex optimization (D‑OCO), where multiple agents within a network collaboratively learn optimal decisions in real‑time, arises naturally in applications such as federated learning, sensor networks, and multi‑agent control. In this paper, we study D‑OCO under unknown, time‑and agent‑varying feedback delays. While recent work has addressed this problem (Nguyen et al., 2024), existing algorithms assume prior knowledge of the total delay over agents and still suffer from suboptimal dependence on both the delay and network parameters. To overcome these limitations, we propose a novel algorithm that achieves an improved regret bound of O N \sqrt d tot + N \sqrt T (1‑σ2) 1/4 , where T is the total horizon, d tot denotes the average total delay across agents, N is the number of agents, and 1 ‑σ 2 is the spectral gap of the network. Our approach builds upon recent advances in D‑OCO (Wan et al., 2024a), but crucially incorporates an adaptive learning rate mechanism via a decentralized communication protocol. This enables each agent to estimate delays locally using a gossip‑based strategy without the prior knowledge of the total delay. We further extend our framework to the strongly convex setting and derive a sharper regret bound of O N δmax ln T α , where α is the strong convexity parameter and δ max is the maximum number of missing observations averaged over agents. We also show that our upper bounds for both settings are tight up to logarithmic factors. Experimental results validate the effectiveness of our approach, showing improvements over existing benchmark algorithms.

PaperID: 6355, https://arxiv.org/pdf/2601.07880.pdf

Authors: Gal Engelberg, Konstantin Koutsyi, Leon Goldberg, Reuven Elezra, Idan Pinto, Tal Moalem, Shmuel Cohen, Yoni Weintrob

Title: Sola-Visibility-ISPM: Benchmarking Agentic AI for Identity Security Posture Management Visibility

Abstract:
Identity Security Posture Management (ISPM) is a core challenge for modern enterprises operating across cloud and SaaS environments. Answering basic ISPM visibility questions, such as understanding identity inventory and configuration hygiene, requires interpreting complex identity data, motivating growing interest in agentic AI systems. Despite this interest, there is currently no standardized way to evaluate how well such systems perform ISPM visibility tasks on real enterprise data. We introduce the Sola Visibility ISPM Benchmark, the first benchmark designed to evaluate agentic AI systems on foundational ISPM visibility tasks using a live, production‑grade identity environment spanning AWS, Okta, and Google Workspace. The benchmark focuses on identity inventory and hygiene questions and is accompanied by the Sola AI Agent, a tool‑using agent that translates natural‑language queries into executable data exploration steps and produces verifiable, evidence‑backed answers. Across 77 benchmark questions, the agent achieves strong overall performance, with an expert accuracy of 0.84 and a strict success rate of 0.77. Performance is highest on AWS hygiene tasks, where expert accuracy reaches 0.94, while results on Google Workspace and Okta hygiene tasks are more moderate, yet competitive. Overall, this work provides a practical and reproducible benchmark for evaluating agentic AI systems in identity security and establishes a foundation for future ISPM benchmarks covering more advanced identity analysis and governance tasks.

PaperID: 6356, https://arxiv.org/pdf/2601.07782.pdf

Authors: Wei Fang, James Glass

Title: Beyond Single-Shot: Multi-step Tool Retrieval via Query Planning

Abstract:
LLM agents operating over massive, dynamic tool libraries rely on effective retrieval, yet standard single‑shot dense retrievers struggle with complex requests. These failures primarily stem from the disconnect between abstract user goals and technical documentation, and the limited capacity of fixed‑size embeddings to model combinatorial tool compositions. To address these challenges, we propose TOOLQP, a lightweight framework that models retrieval as iterative query planning. Instead of single‑shot matching, TOOLQP decomposes instructions into sub‑tasks and dynamically generates queries to interact with the retriever, effectively bridging the semantic gap by targeting the specific sub‑tasks required for composition. We train TOOLQP using synthetic query trajectories followed by optimization via Reinforcement Learning with Verifiable Rewards (RLVR). Experiments demonstrate that TOOLQP achieves state‑of‑the‑art performance, exhibiting superior zero‑shot generalization, robustness across diverse retrievers, and significant improvements in downstream agentic execution.

PaperID: 6357, https://arxiv.org/pdf/2601.07690.pdf

Authors: Davide Catta, Rustam Galimullin, Munyque Mittelmann

Title: On Angels and Demons: Strategic (De)Construction of Dynamic Models

Abstract:
In recent years, there has been growing interest in logics that formalise strategic reasoning about agents capable of modifying the structure of a given model. This line of research has been motivated by applications where a modelled system evolves over time, such as communication networks, security protocols, and multi‑agent planning. In this paper, we introduce three logics for reasoning about strategies that modify the topology of weighted graphs. In Strategic Deconstruction Logic, a destructive agent (the demon) removes edges up to a certain cost. In Strategic Construction Logic, a constructive agent (the angel) adds edges within a cost bound. Finally, Strategic Update Logic combines both agents, who may cooperate or compete. We study the expressive power of these logics and the complexity of their model checking problems.

PaperID: 6358, https://arxiv.org/pdf/2601.07651.pdf

Authors: Marc Lanctot, Kate Larson, Ian Gemp, Michael Kaisers

Title: Active Evaluation of General Agents: Problem Definition and Comparison of Baseline Algorithms

Abstract:
As intelligent agents become more generally‑capable, i.e. able to master a wide variety of tasks, the complexity and cost of properly evaluating them rises significantly. Tasks that assess specific capabilities of the agents can be correlated and stochastic, requiring many samples for accurate comparisons, leading to added costs. In this paper, we propose a formal definition and a conceptual framework for active evaluation of agents across multiple tasks, which assesses the performance of ranking algorithms as a function of number of evaluation data samples. Rather than curating, filtering, or compressing existing data sets as a preprocessing step, we propose an online framing: on every iteration, the ranking algorithm chooses the task and agents to sample scores from. Then, evaluation algorithms report a ranking of agents on each iteration and their performance is assessed with respect to the ground truth ranking over time. Several baselines are compared under different experimental contexts, with synthetic generated data and simulated online access to real evaluation data from Atari game‑playing agents. We find that the classical Elo rating system ‑‑ while it suffers from well‑known failure modes, in theory ‑‑ is a consistently reliable choice for efficient reduction of ranking error in practice. A recently‑proposed method, Soft Condorcet Optimization, shows comparable performance to Elo on synthetic data and significantly outperforms Elo on real Atari agent evaluation. When task variation from the ground truth is high, selecting tasks based on proportional representation leads to higher rate of ranking error reduction.

PaperID: 6359, https://arxiv.org/pdf/2601.07611.pdf

Authors: Zhuoyang Zou, Abolfazl Ansari, Delvin Ce Zhang, Dongwon Lee, Wenpeng Yin

Title: DIAGPaper: Diagnosing Valid and Specific Weaknesses in Scientific Papers via Multi-Agent Reasoning

Abstract:
Paper weakness identification using single‑agent or multi‑agent LLMs has attracted increasing attention, yet existing approaches exhibit key limitations. Many multi‑agent systems simulate human roles at a surface level, missing the underlying criteria that lead experts to assess complementary intellectual aspects of a paper. Moreover, prior methods implicitly assume identified weaknesses are valid, ignoring reviewer bias, misunderstanding, and the critical role of author rebuttals in validating review quality. Finally, most systems output unranked weakness lists, rather than prioritizing the most consequential issues for users. In this work, we propose DIAGPaper, a novel multi‑agent framework that addresses these challenges through three tightly integrated modules. The customizer module simulates human‑defined review criteria and instantiates multiple reviewer agents with criterion‑specific expertise. The rebuttal module introduces author agents that engage in structured debate with reviewer agents to validate and refine proposed weaknesses. The prioritizer module learns from large‑scale human review practices to assess the severity of validated weaknesses and surfaces the top‑K severest ones to users. Experiments on two benchmarks, AAAR and ReviewCritique, demonstrate that DIAGPaper substantially outperforms existing methods by producing more valid and more paper‑specific weaknesses, while presenting them in a user‑oriented, prioritized manner.

PaperID: 6360, https://arxiv.org/pdf/2601.07608.pdf

Authors: Jimin Wang, Jieming Ke, Jin Guo, Yanlong Zhao

Title: Recursive Binary Identification with Differential Privacy and Data Tampering Attacks

Abstract:
In this paper, we consider the parameter estimation in a bandwidth‑constrained sensor network communicating through an insecure medium. The sensor performs a local quantization, and transmits a 1‑bit message to an estimation center through a wireless medium where the transmission of information is vulnerable to attackers. Both eavesdroppers and data tampering attackers are considered in our setting. A differential privacy method is used to protect the sensitive information against eavesdroppers. Then, a recursive projection algorithm is proposed such that the estimation center achieves the almost sure convergence and mean‑square convergence when quantized measurements, differential privacy, and data tampering attacks are considered in a uniform framework. A privacy analysis including the convergence rate with privacy or without privacy is given. Further, we extend the problem to multi‑agent systems. For this case, a distributed recursive projection algorithm is proposed with guaranteed almost sure and mean square convergence. A simulation example is provided to illustrate the effectiveness of the proposed algorithms.

PaperID: 6361, https://arxiv.org/pdf/2601.07577.pdf

Authors: Yunfan Li, Bingbing Xu, Xueyun Tian, Xiucheng Xu, Huawei Shen

Title: Beyond Entangled Planning: Task-Decoupled Planning for Long-Horizon Agents

Abstract:
Recent advances in large language models (LLMs) have enabled agents to autonomously execute complex, long‑horizon tasks, yet planning remains a primary bottleneck for reliable task execution. Existing methods typically fall into two paradigms: step‑wise planning, which is reactive but often short‑sighted; and one‑shot planning, which generates a complete plan upfront yet is brittle to execution errors. Crucially, both paradigms suffer from entangled contexts, where the agent must reason over a monolithic history spanning multiple sub‑tasks. This entanglement increases cognitive load and lets local errors propagate across otherwise independent decisions, making recovery computationally expensive. To address this, we propose Task‑Decoupled Planning (TDP), a training‑free framework that replaces entangled reasoning with task decoupling. TDP decomposes tasks into a directed acyclic graph (DAG) of sub‑goals via a Supervisor. Using a Planner and Executor with scoped contexts, TDP confines reasoning and replanning to the active sub‑task. This isolation prevents error propagation and corrects deviations locally without disrupting the workflow. Results on TravelPlanner, ScienceWorld, and HotpotQA show that TDP outperforms strong baselines while reducing token consumption by up to 82%, demonstrating that sub‑task decoupling improves both robustness and efficiency for long‑horizon agents.

PaperID: 6362, https://arxiv.org/pdf/2601.07553.pdf

Authors: Kabir Swain, Sijie Han, Ayush Raina, Jin Zhang, Shuang Li, Michael Stopa, Antonio Torralba

Title: VirtualEnv: A Platform for Embodied AI Research

Abstract:
As large language models (LLMs) continue to improve in reasoning and decision‑making, there is a growing need for realistic and interactive environments where their abilities can be rigorously evaluated. We present VirtualEnv, a next‑generation simulation platform built on Unreal Engine 5 that enables fine‑grained benchmarking of LLMs in embodied and interactive scenarios. VirtualEnv supports rich agent‑environment interactions, including object manipulation, navigation, and adaptive multi‑agent collaboration, as well as game‑inspired mechanics like escape rooms and procedurally generated environments. We provide a user‑friendly API built on top of Unreal Engine, allowing researchers to deploy and control LLM‑driven agents using natural language instructions. We integrate large‑scale LLMs and vision‑language models (VLMs), such as GPT‑based models, to generate novel environments and structured tasks from multimodal inputs. Our experiments benchmark the performance of several popular LLMs across tasks of increasing complexity, analyzing differences in adaptability, planning, and multi‑agent coordination. We also describe our methodology for procedural task generation, task validation, and real‑time environment control. VirtualEnv is released as an open‑source platform, we aim to advance research at the intersection of AI and gaming, enable standardized evaluation of LLMs in embodied AI settings, and pave the way for future developments in immersive simulations and interactive entertainment.

PaperID: 6363, https://arxiv.org/pdf/2601.07504.pdf

Authors: Tzu-Hsuan Lin, Chih-Hsuan Kao

Title: FROAV: A Framework for RAG Observation and Agent Verification -- Lowering the Barrier to LLM Agent Research

Abstract:
The rapid advancement of Large Language Models (LLMs) and their integration into autonomous agent systems has created unprecedented opportunities for document analysis, decision support, and knowledge retrieval. However, the complexity of developing, evaluating, and iterating on LLM‑based agent workflows presents significant barriers to researchers, particularly those without extensive software engineering expertise. We present FROAV (Framework for RAG Observation and Agent Verification), an open‑source research platform that democratizes LLM agent research by providing a plug‑and‑play architecture combining visual workflow orchestration, a comprehensive evaluation framework, and extensible Python integration. FROAV implements a multi‑stage Retrieval‑Augmented Generation (RAG) pipeline coupled with a rigorous "LLM‑as‑a‑Judge" evaluation system, all accessible through intuitive graphical interfaces. Our framework integrates n8n for no‑code workflow design, PostgreSQL for granular data management, FastAPI for flexible backend logic, and Streamlit for human‑in‑the‑loop interaction. Through this integrated ecosystem, researchers can rapidly prototype RAG strategies, conduct prompt engineering experiments, validate agent performance against human judgments, and collect structured feedback‑all without writing infrastructure code. We demonstrate the framework's utility through its application to financial document analysis, while emphasizing its material‑agnostic architecture that adapts to any domain requiring semantic analysis. FROAV represents a significant step toward making LLM agent research accessible to a broader scientific community, enabling researchers to focus on hypothesis testing and algorithmic innovation rather than system integration challenges.

PaperID: 6364, https://arxiv.org/pdf/2601.07484.pdf

Authors: Xiaofeng Jin, Matteo Frosi, Yiran Guo, Matteo Matteucci

Title: R3-RECON: Radiance-Field-Free Active Reconstruction via Renderability

Abstract:
In active reconstruction, an embodied agent must decide where to look next to efficiently acquire views that support high‑quality novel‑view rendering. Recent work on active view planning for neural rendering largely derives next‑best‑view (NBV) criteria by backpropagating through radiance fields or estimating information entropy over 3D Gaussian primitives. While effective, these strategies tightly couple view selection to heavy, representation‑specific mechanisms and fail to account for the computational and resource constraints required for lightweight online deployment. In this paper, we revisit active reconstruction from a renderability‑centric perspective. We propose \mathbbR^3‑RECON, a radiance‑fields‑free active reconstruction framework that induces an implicit, pose‑conditioned renderability field over SE(3) from a lightweight voxel map. Our formulation aggregates per‑voxel online observation statistics into a unified scalar renderability score that is cheap to update and can be queried in closed form at arbitrary candidate viewpoints in milliseconds, without requiring gradients or radiance‑field training. This renderability field is strongly correlated with image‑space reconstruction error, naturally guiding NBV selection. We further introduce a panoramic extension that estimates omnidirectional (360^\circ) view utility to accelerate candidate evaluation. In the standard indoor Replica dataset, \mathbbR^3‑RECON achieves more uniform novel‑view quality and higher 3D Gaussian splatting (3DGS) reconstruction accuracy than recent active GS baselines with matched view and time budgets.

PaperID: 6365, https://arxiv.org/pdf/2601.07470.pdf

Authors: Sirui Liang, Pengfei Cao, Jian Zhao, Wenhao Teng, Xiangwen Liao, Jun Zhao, Kang Liu

Title: Learning How to Remember: A Meta-Cognitive Management Method for Structured and Transferable Agent Memory

Abstract:
Large language model (LLM) agents increasingly rely on accumulated memory to solve long‑horizon decision‑making tasks. However, most existing approaches store memory in fixed representations and reuse it at a single or implicit level of abstraction, which limits generalization and often leads to negative transfer when distribution shift. This paper proposes the Meta‑Cognitive Memory Abstraction method (MCMA), which treats memory abstraction as a learnable cognitive skill rather than a fixed design choice. MCMA decouples task execution from memory management by combining a frozen task model with a learned memory copilot. The memory copilot is trained using direct preference optimization, it determines how memories should be structured, abstracted, and reused. Memories are further organized into a hierarchy of abstraction levels, enabling selective reuse based on task similarity. When no memory is transferable, MCMA transfers the ability to abstract and manage memory by transferring the memory copilot. Experiments on ALFWorld, ScienceWorld, and BabyAI demonstrate substantial improvements in performance, out‑of‑distribution generalization, and cross‑task transfer over several baselines.

PaperID: 6366, https://arxiv.org/pdf/2601.07468.pdf

Authors: Miao Su, Yucan Guo, Zhongni Hou, Long Bai, Zixuan Li, Yufei Zhang, Guojun Yin, Wei Lin, Xiaolong Jin, Jiafeng Guo, Xueqi Cheng

Title: Beyond Dialogue Time: Temporal Semantic Memory for Personalized LLM Agents

Abstract:
Memory enables Large Language Model (LLM) agents to perceive, store, and use information from past dialogues, which is essential for personalization. However, existing methods fail to properly model the temporal dimension of memory in two aspects: 1) Temporal inaccuracy: memories are organized by dialogue time rather than their actual occurrence time; 2) Temporal fragmentation: existing methods focus on point‑wise memory, losing durative information that captures persistent states and evolving patterns. To address these limitations, we propose Temporal Semantic Memory (TSM), a memory framework that models semantic time for point‑wise memory and supports the construction and utilization of durative memory. During memory construction, it first builds a semantic timeline rather than a dialogue one. Then, it consolidates temporally continuous and semantically related information into a durative memory. During memory utilization, it incorporates the query's temporal intent on the semantic timeline, enabling the retrieval of temporally appropriate durative memories and providing time‑valid, duration‑consistent context to support response generation. Experiments on LongMemEval and LoCoMo show that TSM consistently outperforms existing methods and achieves up to 12.2% absolute improvement in accuracy, demonstrating the effectiveness of the proposed method.

PaperID: 6367, https://arxiv.org/pdf/2601.07463.pdf

Authors: Sijia Li, Xinran Li, Shibo Chen, Jun Zhang

Title: Puzzle it Out: Local-to-Global World Model for Offline Multi-Agent Reinforcement Learning

Abstract:
Offline multi‑agent reinforcement learning (MARL) aims to solve cooperative decision‑making problems in multi‑agent systems using pre‑collected datasets. Existing offline MARL methods primarily constrain training within the dataset distribution, resulting in overly conservative policies that struggle to generalize beyond the support of the data. While model‑based approaches offer a promising solution by expanding the original dataset with synthetic data generated from a learned world model, the high dimensionality, non‑stationarity, and complexity of multi‑agent systems make it challenging to accurately estimate the transitions and reward functions in offline MARL. Given the difficulty of directly modeling joint dynamics, we propose a local‑to‑global (LOGO) world model, a novel framework that leverages local predictions‑which are easier to estimate‑to infer global state dynamics, thus improving prediction accuracy while implicitly capturing agent‑wise dependencies. Using the trained world model, we generate synthetic data to augment the original dataset, expanding the effective state‑action space. To ensure reliable policy learning, we further introduce an uncertainty‑aware sampling mechanism that adaptively weights synthetic data by prediction uncertainty, reducing approximation error propagation to policies. In contrast to conventional ensemble‑based methods, our approach requires only an additional encoder for uncertainty estimation, significantly reducing computational overhead while maintaining accuracy. Extensive experiments across 8 scenarios against 8 baselines demonstrate that our method surpasses state‑of‑the‑art baselines on standard offline MARL benchmarks, establishing a new model‑based baseline for generalizable offline multi‑agent learning.

PaperID: 6368, https://arxiv.org/pdf/2601.07395.pdf

Authors: Ruiqi Li, Zhiqiang Wang, Yunhao Yao, Xiang-Yang Li

Title: MCP-ITP: An Automated Framework for Implicit Tool Poisoning in MCP

Abstract:
To standardize interactions between LLM‑based agents and their environments, the Model Context Protocol (MCP) was proposed and has since been widely adopted. However, integrating external tools expands the attack surface, exposing agents to tool poisoning attacks. In such attacks, malicious instructions embedded in tool metadata are injected into the agent context during MCP registration phase, thereby manipulating agent behavior. Prior work primarily focuses on explicit tool poisoning or relied on manually crafted poisoned tools. In contrast, we focus on a particularly stealthy variant: implicit tool poisoning, where the poisoned tool itself remains uninvoked. Instead, the instructions embedded in the tool metadata induce the agent to invoke a legitimate but high‑privilege tool to perform malicious operations. We propose MCP‑ITP, the first automated and adaptive framework for implicit tool poisoning within the MCP ecosystem. MCP‑ITP formulates poisoned tool generation as a black‑box optimization problem and employs an iterative optimization strategy that leverages feedback from both an evaluation LLM and a detection LLM to maximize Attack Success Rate (ASR) while evading current detection mechanisms. Experimental results on the MCPTox dataset across 12 LLM agents demonstrate that MCP‑ITP consistently outperforms the manually crafted baseline, achieving up to 84.2% ASR while suppressing the Malicious Tool Detection Rate (MDR) to as low as 0.3%.

PaperID: 6369, https://arxiv.org/pdf/2601.07376.pdf

Authors: Siqi Zhu, Jiaxuan You

Title: OpenTinker: Separating Concerns in Agentic Reinforcement Learning

Abstract:
We introduce OpenTinker, an infrastructure for reinforcement learning (RL) of large language model (LLM) agents built around a separation of concerns across algorithm design, execution, and agent‑environment interaction. Rather than relying on monolithic, end‑to‑end RL pipelines, OpenTinker decomposes agentic learning systems into lightweight, composable components with clearly defined abstraction boundaries. Users specify agents, environments, and interaction protocols, while inference and training are delegated to a managed execution runtime. OpenTinker introduces a centralized scheduler for managing training and inference workloads, including LoRA‑based and full‑parameter RL, supervised fine‑tuning, and inference, over shared resources. We further discuss design principles for extending OpenTinker to multi‑agent training. Finally, we present a set of RL use cases that demonstrate the effectiveness of the framework in practical agentic learning scenarios.

PaperID: 6370, https://arxiv.org/pdf/2601.07252.pdf

Authors: Chunwei Yang, Yankai Wang, Jianxiang Tang, Haojie Qu, Ziqiang Zou, YuLiu, Chunrui Deng, Zhifang Qiu, Ming Ding

Title: SwarmFoam: An OpenFOAM Multi-Agent System Based on Multiple Types of Large Language Models

Abstract:
Numerical simulation is one of the mainstream methods in scientific research, typically performed by professional engineers. With the advancement of multi‑agent technology, using collaborating agents to replicate human behavior shows immense potential for intelligent Computational Fluid Dynamics (CFD) simulations. Some muti‑agent systems based on Large Language Models have been proposed. However, they exhibit significant limitations when dealing with complex geometries. This paper introduces a new multi‑agent simulation framework, SwarmFoam. SwarmFoam integrates functionalities such as Multi‑modal perception, Intelligent error correction, and Retrieval‑Augmented Generation, aiming to achieve more complex simulations through dual parsing of images and high‑level instructions. Experimental results demonstrate that SwarmFoam has good adaptability to simulation inputs from different modalities. The overall pass rate for 25 test cases was 84%, with natural language and multi‑modal input cases achieving pass rates of 80% and 86.7%, respectively. The work presented by SwarmFoam will further promote the development of intelligent agent methods for CFD.

PaperID: 6371, https://arxiv.org/pdf/2601.07248.pdf

Authors: Shuyu Zhang, Yujie Liu, Xinru Wang, Cheng Zhang, Yanmin Zhu, Bin Li

Title: DarwinTOD: LLM-driven Lifelong Self-evolution for Task-oriented Dialog Systems

Abstract:
Traditional task‑oriented dialog systems are unable to evolve from ongoing interactions or adapt to new domains after deployment, that is a critical limitation in real‑world dynamic environments. Continual learning approaches depend on episodic retraining with human curated data, failing to achieve autonomy lifelong improvement. While evolutionary computation and LLM driven self improvement offer promising mechanisms for dialog optimization, they lack a unified framework for holistic, iterative strategy refinement. To bridge this gap, we propose DarwinTOD, a lifelong self evolving dialog framework that systematically integrates these two paradigms, enabling continuous strategy optimization from a zero‑shot base without task specific fine‑tuning. DarwinTOD maintains an Evolvable Strategy Bank and operates through a dual‑loop process: online multi‑agent dialog execution with peer critique, and offline structured evolutionary operations that refine the strategy bank using accumulated feedback. This closed‑loop design enables autonomous continuous improvement without human intervention. Extensive experiments show that DarwinTOD surpasses previous state‑of‑the‑art methods and exhibits continuous performance gains throughout evolution. Our work provides a novel framework for building dialog systems with lifelong self evolution capabilities.

PaperID: 6372, https://arxiv.org/pdf/2601.07224.pdf

Authors: Yang Zhao, Yangou Ouyang, Xiao Ding, Hepeng Wang, Bibo Cai, Kai Xiong, Jinglong Gao, Zhouhao Sun, Li Du, Bing Qin, Ting Liu

Title: Consolidation or Adaptation? PRISM: Disentangling SFT and RL Data via Gradient Concentration

Abstract:
While Hybrid Supervised Fine‑Tuning (SFT) followed by Reinforcement Learning (RL) has become the standard paradigm for training LLM agents, effective mechanisms for data allocation between these stages remain largely underexplored. Current data arbitration strategies often rely on surface‑level heuristics that fail to diagnose intrinsic learning needs. Since SFT targets pattern consolidation through imitation while RL drives structural adaptation via exploration, misaligning data with these functional roles causes severe optimization interference. We propose PRISM, a dynamics‑aware framework grounded in Schema Theory that arbitrates data based on its degree of cognitive conflict with the model's existing knowledge. By analyzing the spatial geometric structure of gradients, PRISM identifies data triggering high spatial concentration as high‑conflict signals that require RL for structural restructuring. In contrast, data yielding diffuse updates is routed to SFT for efficient consolidation. Extensive experiments on WebShop and ALFWorld demonstrate that PRISM achieves a Pareto improvement, outperforming state‑of‑the‑art hybrid methods while reducing computational costs by up to 3.22×. Our findings suggest that disentangling data based on internal optimization regimes is crucial for scalable and robust agent alignment.

PaperID: 6373, https://arxiv.org/pdf/2601.07190.pdf

Authors: Nikhil Verma

Title: Active Context Compression: Autonomous Memory Management in LLM Agents

Abstract:
Large Language Model (LLM) agents struggle with long‑horizon software engineering tasks due to "Context Bloat." As interaction history grows, computational costs explode, latency increases, and reasoning capabilities degrade due to distraction by irrelevant past errors. Existing solutions often rely on passive, external summarization mechanisms that the agent cannot control. This paper proposes Focus, an agent‑centric architecture inspired by the biological exploration strategies of Physarum polycephalum (slime mold). The Focus Agent autonomously decides when to consolidate key learnings into a persistent "Knowledge" block and actively withdraws (prunes) the raw interaction history. Using an optimized scaffold matching industry best practices (persistent bash + string‑replacement editor), we evaluated Focus on N=5 context‑intensive instances from SWE‑bench Lite using Claude Haiku 4.5. With aggressive prompting that encourages frequent compression, Focus achieves 22.7% token reduction (14.9M ‑> 11.5M tokens) while maintaining identical accuracy (3/5 = 60% for both agents). Focus performed 6.0 autonomous compressions per task on average, with token savings up to 57% on individual instances. We demonstrate that capable models can autonomously self‑regulate their context when given appropriate tools and prompting, opening pathways for cost‑aware agentic systems without sacrificing task performance.

PaperID: 6374, https://arxiv.org/pdf/2601.07152.pdf

Authors: Aja Khanal, Kaushik T. Ranade, Rishabh Agrawal, Kalyan S. Basu, Apurva Narayan

Title: Agents of Diffusion: Enhancing Diffusion Language Models with Multi-Agent Reinforcement Learning for Structured Data Generation (Extended Version)

Abstract:
Generating high‑quality structured data such as JSON records, remains a fundamental challenge for large language models (LLMs), particularly when semantic richness must coexist with strict schema adherence. While autoregressive LLMs offer strong structural consistency, they often struggle with semantic variation and output diversity. In contrast, diffusion language models (DLMs) introduce powerful mechanisms for semantic richness and bidirectional decoding, yet lack the inductive biases needed for reliable structure preservation. We present Agents of Diffusion (AoD), a novel framework that unifies the generative flexibility of DLMs with the reasoning capabilities of autoregressive models through language‑mediated reinforcement learning. AoD frames structured text generation as a multi‑agent alignment process, where a prompt optimization agent collaborates with a judge agent to iteratively guide a DLM using natural language feedback. This approach enables controllable, schema‑consistent generation without modifying model parameters or relying on handcrafted constraints. AoD advances the state of controllable generation by demonstrating that diffusion models, when supervised by cooperative agents, can achieve both high semantic novelty and structural fidelity. Across multiple structured data benchmarks, AoD consistently outperforms diffusion and autoregressive baselines, establishing a new path forward for structure‑aware, diversity‑enhanced text synthesis.

PaperID: 6375, https://arxiv.org/pdf/2601.07142.pdf

Authors: Yuxin Geng, Wolfram Barfuss, Feng Fu, Xingru Chen

Title: Dynamics of Multi-Agent Actor-Critic Learning in Stochastic Games: from Multistability and Chaos to Stable Cooperation

Abstract:
Achieving robust coordination and cooperation is a central challenge in multi‑agent reinforcement learning (MARL). Uncovering the mechanisms underlying such emergent behaviors calls for a dynamical understanding of learn processes. In this work, we investigate the dynamics of actor‑critic agents in stochastic games, focusing on the impact of entropy regularization. By leveraging time‑scale separation, we derive the system's evolution equations, which are then formally analyzed using dynamical systems theory. We find that in the constant‑sum game of Matching Pennies, the system exhibits chaotic behavior. Entropy regularization mitigates this chaos and drives the dynamics toward convergence to fair cooperation. In contrast, in the general‑sum game of the Prisoner's Dilemma, the system displays multistability. Interestingly, the three stable equilibria of the system correspond to the well‑known ALLC (Always Cooperate), ALLD (Always Defect), and GRIM (Grim Trigger) strategies from evolutionary game theory (EGT). Entropy regularization strengthens system resilience by enlarging the basin of attraction of the cooperative equilibrium. Our findings reveal a close link between the mechanism of direct reciprocity in EGT and how cooperation emerges in MARL, offering insights for designing more robust and collaborative multi‑agent systems.

PaperID: 6376, https://arxiv.org/pdf/2601.07136.pdf

Authors: Daniel Liu, Krishna Upadhyay, Vinaik Chhetri, A. B. Siddique, Umar Farooq

Title: A Large-Scale Study on the Development and Issues of Multi-Agent AI Systems

Abstract:
The rapid emergence of multi‑agent AI systems (MAS), including LangChain, CrewAI, and AutoGen, has shaped how large language model (LLM) applications are developed and orchestrated. However, little is known about how these systems evolve and are maintained in practice. This paper presents the first large‑scale empirical study of open‑source MAS, analyzing over 42K unique commits and over 4.7K resolved issues across eight leading systems. Our analysis identifies three distinct development profiles: sustained, steady, and burst‑driven. These profiles reflect substantial variation in ecosystem maturity. Perfective commits constitute 40.8% of all changes, suggesting that feature enhancement is prioritized over corrective maintenance (27.4%) and adaptive updates (24.3%). Data about issues shows that the most frequent concerns involve bugs (22%), infrastructure (14%), and agent coordination challenges (10%). Issue reporting also increased sharply across all frameworks starting in 2023. Median resolution times range from under one day to about two weeks, with distributions skewed toward fast responses but a minority of issues requiring extended attention. These results highlight both the momentum and the fragility of the current ecosystem, emphasizing the need for improved testing infrastructure, documentation quality, and maintenance practices to ensure long‑term reliability and sustainability.

PaperID: 6377, https://arxiv.org/pdf/2601.07122.pdf

Authors: Yixiao Peng, Hao Hu, Feiyang Li, Xinye Cao, Yingchang Jiang, Jipeng Tang, Guoshun Nan, Yuling Liu

Title: Enhancing Cloud Network Resilience via a Robust LLM-Empowered Multi-Agent Reinforcement Learning Framework

Abstract:
While virtualization and resource pooling empower cloud networks with structural flexibility and elastic scalability, they inevitably expand the attack surface and challenge cyber resilience. Reinforcement Learning (RL)‑based defense strategies have been developed to optimize resource deployment and isolation policies under adversarial conditions, aiming to enhance system resilience by maintaining and restoring network availability. However, existing approaches lack robustness as they require retraining to adapt to dynamic changes in network structure, node scale, attack strategies, and attack intensity. Furthermore, the lack of Human‑in‑the‑Loop (HITL) support limits interpretability and flexibility. To address these limitations, we propose CyberOps‑Bots, a hierarchical multi‑agent reinforcement learning framework empowered by Large Language Models (LLMs). Inspired by MITRE ATT&CK's Tactics‑Techniques model, CyberOps‑Bots features a two‑layer architecture: (1) An upper‑level LLM agent with four modules‑‑ReAct planning, IPDRR‑based perception, long‑short term memory, and action/tool integration‑‑performs global awareness, human intent recognition, and tactical planning; (2) Lower‑level RL agents, developed via heterogeneous separated pre‑training, execute atomic defense actions within localized network regions. This synergy preserves LLM adaptability and interpretability while ensuring reliable RL execution. Experiments on real cloud datasets show that, compared to state‑of‑the‑art algorithms, CyberOps‑Bots maintains network availability 68.5% higher and achieves a 34.7% jumpstart performance gain when shifting the scenarios without retraining. To our knowledge, this is the first study to establish a robust LLM‑RL framework with HITL support for cloud defense.

PaperID: 6378, https://arxiv.org/pdf/2601.07107.pdf

Authors: Meng Lu, Yuxing Lu, Yuchen Zhuang, Megan Mullins, Yang Xie, Guanghua Xiao, Charles Fleming, Wenqi Shi, Xuan Wang

Title: MEDVISTAGYM: A Scalable Training Environment for Thinking with Medical Images via Tool-Integrated Reinforcement Learning

Abstract:
Vision language models (VLMs) achieve strong performance on general image understanding but struggle to think with medical images, especially when performing multi‑step reasoning through iterative visual interaction. Medical VLMs often rely on static visual embeddings and single‑pass inference, preventing models from re‑examining, verifying, or refining visual evidence during reasoning. While tool‑integrated reasoning offers a promising path forward, open‑source VLMs lack the training infrastructure to learn effective tool selection, invocation, and coordination in multi‑modal medical reasoning. We introduce MedVistaGym, a scalable and interactive training environment that incentivizes tool‑integrated visual reasoning for medical image analysis. MedVistaGym equips VLMs to determine when and which tools to invoke, localize task‑relevant image regions, and integrate single or multiple sub‑image evidence into interleaved multimodal reasoning within a unified, executable interface for agentic training. Using MedVistaGym, we train MedVistaGym‑R1 to interleave tool use with agentic reasoning through trajectory sampling and end‑to‑end reinforcement learning. Across six medical VQA benchmarks, MedVistaGym‑R1‑8B exceeds comparably sized tool‑augmented baselines by 19.10% to 24.21%, demonstrating that structured agentic training‑‑not tool access alone‑‑unlocks effective tool‑integrated reasoning for medical image analysis.

PaperID: 6379, https://arxiv.org/pdf/2601.07072.pdf

Authors: Hongyan Chang, Ergute Bao, Xinjian Luo, Ting Yu

Title: Overcoming the Retrieval Barrier: Indirect Prompt Injection in the Wild for LLM Systems

Abstract:
Large language models (LLMs) increasingly rely on retrieving information from external corpora. This creates a new attack surface: indirect prompt injection (IPI), where hidden instructions are planted in the corpora and hijack model behavior once retrieved. Previous studies have highlighted this risk but often avoid the hardest step: ensuring that malicious content is actually retrieved. In practice, unoptimized IPI is rarely retrieved under natural queries, which leaves its real‑world impact unclear. We address this challenge by decomposing the malicious content into a trigger fragment that guarantees retrieval and an attack fragment that encodes arbitrary attack objectives. Based on this idea, we design an efficient and effective black‑box attack algorithm that constructs a compact trigger fragment to guarantee retrieval for any attack fragment. Our attack requires only API access to embedding models, is cost‑efficient (as little as 0.21 per target user query on OpenAI's embedding models), and achieves near‑100% retrieval across 11 benchmarks and 8 embedding models (including both open‑source models and proprietary services). Based on this attack, we present the first end‑to‑end IPI exploits under natural queries and realistic external corpora, spanning both RAG and agentic systems with diverse attack objectives. These results establish IPI as a practical and severe threat: when a user issued a natural query to summarize emails on frequently asked topics, a single poisoned email was sufficient to coerce GPT‑4o into exfiltrating SSH keys with over 80% success in a multi‑agent workflow. We further evaluate several defenses and find that they are insufficient to prevent the retrieval of malicious text, highlighting retrieval as a critical open vulnerability.

PaperID: 6380, https://arxiv.org/pdf/2601.07055.pdf

Authors: Zhenrui Yue, Kartikeya Upasani, Xianjun Yang, Suyu Ge, Shaoliang Nie, Yuning Mao, Zhe Liu, Dong Wang

Title: Dr. Zero: Self-Evolving Search Agents without Training Data

Abstract:
As high‑quality data becomes increasingly difficult to obtain, data‑free self‑evolution has emerged as a promising paradigm. This approach allows large language models (LLMs) to autonomously generate and solve complex problems, thereby improving their reasoning capabilities. However, multi‑turn search agents struggle in data‑free self‑evolution due to the limited question diversity and the substantial compute required for multi‑step reasoning and tool using. In this work, we introduce Dr. Zero, a framework enabling search agents to effectively self‑evolve without any training data. In particular, we design a self‑evolution feedback loop where a proposer generates diverse questions to train a solver initialized from the same base model. As the solver evolves, it incentivizes the proposer to produce increasingly difficult yet solvable tasks, thus establishing an automated curriculum to refine both agents. To enhance training efficiency, we also introduce hop‑grouped relative policy optimization (HRPO). This method clusters structurally similar questions to construct group‑level baselines, effectively minimizing the sampling overhead in evaluating each query's individual difficulty and solvability. Consequently, HRPO significantly reduces the compute requirements for solver training without compromising performance or stability. Extensive experiment results demonstrate that the data‑free Dr. Zero matches or surpasses fully supervised search agents, proving that complex reasoning and search capabilities can emerge solely through self‑evolution.

PaperID: 6381, https://arxiv.org/pdf/2601.06806.pdf

Authors: Jiwen Zhang, Zejun Li, Siyuan Wang, Xiangyu Shi, Zhongyu Wei, Qi Wu

Title: SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation

Abstract:
Although learning‑based vision‑and‑language navigation (VLN) agents can learn spatial knowledge implicitly from large‑scale training data, zero‑shot VLN agents lack this process, relying primarily on local observations for navigation, which leads to inefficient exploration and a significant performance gap. To deal with the problem, we consider a zero‑shot VLN setting that agents are allowed to fully explore the environment before task execution. Then, we construct the Spatial Scene Graph (SSG) to explicitly capture global spatial structure and semantics in the explored environment. Based on the SSG, we introduce SpatialNav, a zero‑shot VLN agent that integrates an agent‑centric spatial map, a compass‑aligned visual representation, and a remote object localization strategy for efficient navigation. Comprehensive experiments in both discrete and continuous environments demonstrate that SpatialNav significantly outperforms existing zero‑shot agents and clearly narrows the gap with state‑of‑the‑art learning‑based methods. Such results highlight the importance of global spatial representations for generalizable navigation.

PaperID: 6382, https://arxiv.org/pdf/2601.06803.pdf

Authors: Yubo Wang, Juntian Zhang, Yichen Wu, Yankai Lin, Nils Lukas, Yuhan Liu

Title: Forest Before Trees: Latent Superposition for Efficient Visual Reasoning

Abstract:
While Chain‑of‑Thought empowers Large Vision‑Language Models with multi‑step reasoning, explicit textual rationales suffer from an information bandwidth bottleneck, where continuous visual details are discarded during discrete tokenization. Recent latent reasoning methods attempt to address this challenge, but often fall prey to premature semantic collapse due to rigid autoregressive objectives. In this paper, we propose Laser, a novel paradigm that reformulates visual deduction via Dynamic Windowed Alignment Learning (DWAL). Instead of forcing a point‑wise prediction, Laser aligns the latent state with a dynamic validity window of future semantics. This mechanism enforces a "Forest‑before‑Trees" cognitive hierarchy, enabling the model to maintain a probabilistic superposition of global features before narrowing down to local details. Crucially, Laser maintains interpretability via decodable trajectories while stabilizing unconstrained learning via Self‑Refined Superposition. Extensive experiments on 6 benchmarks demonstrate that Laser achieves state‑of‑the‑art performance among latent reasoning methods, surpassing the strong baseline Monet by 5.03% on average. Notably, it achieves these gains with extreme efficiency, reducing inference tokens by more than 97%, while demonstrating robust generalization to out‑of‑distribution domains.

PaperID: 6383, https://arxiv.org/pdf/2601.06794.pdf

Authors: Zhicong Li, Lingjie Jiang, Yulan Hu, Xingchen Zeng, Yixia Li, Xiangwen Zhang, Guanhua Chen, Zheng Pan, Xin Li, Yong Liu

Title: No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning

Abstract:
Critique‑guided reinforcement learning (RL) has emerged as a powerful paradigm for training LLM agents by augmenting sparse outcome rewards with natural‑language feedback. However, current methods often rely on static or offline critic models, which fail to adapt as the policy evolves. In on‑policy RL, the agent's error patterns shift over time, causing stationary critics to become stale and providing feedback of diminishing utility. To address this, we introduce ECHO (Evolving Critic for Hindsight‑Guided Optimization), a framework that jointly optimizes the policy and critic through a synchronized co‑evolutionary loop. ECHO utilizes a cascaded rollout mechanism where the critic generates multiple diagnoses for an initial trajectory, followed by policy refinement to enable group‑structured advantage estimation. We address the challenge of learning plateaus via a saturation‑aware gain shaping objective, which rewards the critic for inducing incremental improvements in high‑performing trajectories. By employing dual‑track GRPO updates, ECHO ensures the critic's feedback stays synchronized with the evolving policy. Experimental results show that ECHO yields more stable training and higher long‑horizon task success across open‑world environments.

PaperID: 6384, https://arxiv.org/pdf/2601.06776.pdf

Authors: Xufei Tian, Wenli Du, Shaoyi Yang, Han Hu, Hui Xin, Shifeng Qu, Ke Ye

Title: From Text to Simulation: A Multi-Agent LLM Workflow for Automated Chemical Process Design

Abstract:
Process simulation is a critical cornerstone of chemical engineering design. Current automated chemical design methodologies focus mainly on various representations of process flow diagrams. However, transforming these diagrams into executable simulation flowsheets remains a time‑consuming and labor‑intensive endeavor, requiring extensive manual parameter configuration within simulation software. In this work, we propose a novel multi‑agent workflow that leverages the semantic understanding capabilities of large language models(LLMs) and enables iterative interactions with chemical process simulation software, achieving end‑to‑end automated simulation from textual process specifications to computationally validated software configurations for design enhancement. Our approach integrates four specialized agents responsible for task understanding, topology generation, parameter configuration, and evaluation analysis, respectively, coupled with Enhanced Monte Carlo Tree Search to accurately interpret semantics and robustly generate configurations. Evaluated on Simona, a large‑scale process description dataset, our method achieves a 31.1% improvement in the simulation convergence rate compared to state‑of‑the‑art baselines and reduces the design time by 89. 0% compared to the expert manual design. This work demonstrates the potential of AI‑assisted chemical process design, which bridges the gap between conceptual design and practical implementation. Our workflow is applicable to diverse process‑oriented industries, including pharmaceuticals, petrochemicals, food processing, and manufacturing, offering a generalizable solution for automated process design.

PaperID: 6385, https://arxiv.org/pdf/2601.06733.pdf

Authors: Tamara Alshammari, Mehdi Bennis

Title: Logic-Driven Semantic Communication for Resilient Multi-Agent Systems

Abstract:
The advent of 6G networks is accelerating autonomy and intelligence in large‑scale, decentralized multi‑agent systems (MAS). While this evolution enables adaptive behavior, it also heightens vulnerability to stressors such as environmental changes and adversarial behavior. Existing literature on resilience in decentralized MAS largely focuses on isolated aspects, such as fault tolerance, without offering a principled unified definition of multi‑agent resilience. This gap limits the ability to design systems that can continuously sense, adapt, and recover under dynamic conditions. This article proposes a formal definition of MAS resilience grounded in two complementary dimensions: epistemic resilience, wherein agents recover and sustain accurate knowledge of the environment, and action resilience, wherein agents leverage that knowledge to coordinate and sustain goals under disruptions. We formalize resilience via temporal epistemic logic and quantify it using recoverability time (how quickly desired properties are re‑established after a disturbance) and durability time (how long accurate beliefs and goal‑directed behavior are sustained after recovery). We design an agent architecture and develop decentralized algorithms to achieve both epistemic and action resilience. We provide formal verification guarantees, showing that our specifications are sound with respect to the metric bounds and admit finite‑horizon verification, enabling design‑time certification and lightweight runtime monitoring. Through a case study on distributed multi‑agent decision‑making under stressors, we show that our approach outperforms baseline methods. Our formal verification analysis and simulation results highlight that the proposed framework enables resilient, knowledge‑driven decision‑making and sustained operation, laying the groundwork for resilient decentralized MAS in next‑generation communication systems.

PaperID: 6386, https://arxiv.org/pdf/2601.06676.pdf

Authors: Yingchaojie Feng, Qiang Huang, Xiaoya Xie, Zhaorui Yang, Jun Yu, Wei Chen, Anthony K. H. Tung

Title: IDRBench: Interactive Deep Research Benchmark

Abstract:
Deep research agents powered by Large Language Models (LLMs) can perform multi‑step reasoning, web exploration, and long‑form report generation. However, most existing systems operate in an autonomous manner, assuming fully specified user intent and evaluating only final outputs. In practice, research goals are often underspecified and evolve during exploration, making sustained interaction essential for robust alignment. Despite its importance, interaction remains largely invisible to existing deep research benchmarks, which neither model dynamic user feedback nor quantify its costs. We introduce IDRBench, the first benchmark for systematically evaluating interactive deep research. IDRBench combines a modular multi‑agent research framework with on‑demand interaction, a scalable reference‑grounded user simulator, and an interaction‑aware evaluation suite that jointly measures interaction benefits (quality and alignment) and costs (turns and tokens). Experiments across seven state‑of‑the‑art LLMs show that interaction consistently improves research quality and robustness, often outweighing differences in model capacity, while revealing substantial trade‑offs in interaction efficiency.

PaperID: 6387, https://arxiv.org/pdf/2601.06644.pdf

Authors: Yan Meng, Wafaa Mohammed, Christof Monz

Title: Do Language Models Reason Across Languages?

Abstract:
The real‑world information sources are inherently multilingual, which naturally raises a question about whether language models can synthesize information across languages. In this paper, we introduce a simple two‑hop question answering setting, where answering a question requires making inferences over two multilingual documents. We find that language models are more sensitive to language variation in answer‑span documents than in those providing bridging information, despite the equal importance of both documents for answering a question. Under a step‑by‑step sub‑question evaluation, we further show that in up to 33% of multilingual cases, models fail to infer the bridging information in the first step yet still answer the overall question correctly. This indicates that reasoning in language models, especially in multilingual settings, does not follow a faithful step‑by‑step decomposition. Subsequently, we show that the absence of reasoning decomposition leads to around 18% composition failure, where both sub‑questions are answered correctly but fail for the final two‑hop questions. To mitigate this, we propose a simple three‑stage SUBQ prompting method to guide the multi‑step reasoning with sub‑questions, which boosts accuracy from 10.1% to 66.5%.

PaperID: 6388, https://arxiv.org/pdf/2601.06640.pdf

Authors: Genze Jiang, Kezhi Wang, Xiaomin Chen, Yizhou Huang

Title: Agentic AI Empowered Intent-Based Networking for 6G

Abstract:
The transition towards sixth‑generation (6G) wireless networks necessitates autonomous orchestration mechanisms capable of translating high‑level operational intents into executable network configurations. Existing approaches to Intent‑Based Networking (IBN) rely upon either rule‑based systems that struggle with linguistic variation or end‑to‑end neural models that lack interpretability and fail to enforce operational constraints. This paper presents a hierarchical multi‑agent framework where Large Language Model (LLM) based agents autonomously decompose natural language intents, consult domain‑specific specialists, and synthesise technically feasible network slice configurations through iterative reasoning‑action (ReAct) cycles. The proposed architecture employs an orchestrator agent coordinating two specialist agents, i.e., Radio Access Network (RAN) and Core Network agents, via ReAct‑style reasoning, grounded in structured network state representations. Experimental evaluation across diverse benchmark scenarios shows that the proposed system outperforms rule‑based systems and direct LLM prompting, with architectural principles applicable to Open RAN (O‑RAN) deployments. The results also demonstrate that whilst contemporary LLMs possess general telecommunications knowledge, network automation requires careful prompt engineering to encode context‑dependent decision thresholds, advancing autonomous orchestration capabilities for next‑generation wireless systems.

PaperID: 6389, https://arxiv.org/pdf/2601.06606.pdf

Authors: Rishiraj Saha Roy, Chris Hinze, Luzian Hahn, Fabian Kuech

Title: CEDAR: Context Engineering for Agentic Data Science

Abstract:
We demonstrate CEDAR, an application for automating data science (DS) tasks with an agentic setup. Solving DS problems with LLMs is an underexplored area that has immense market value. The challenges are manifold: task complexities, data sizes, computational limitations, and context restrictions. We show that these can be alleviated via effective context engineering. We first impose structure into the initial prompt with DS‑specific input fields, that serve as instructions for the agentic system. The solution is then materialized as an enumerated sequence of interleaved plan and code blocks generated by separate LLM agents, providing a readable structure to the context at any step of the workflow. Function calls for generating these intermediate texts, and for corresponding Python code, ensure that data stays local, and only aggregate statistics and associated instructions are injected into LLM prompts. Fault tolerance and context management are introduced via iterative code generation and smart history rendering. The viability of our agentic data scientist is demonstrated using canonical Kaggle challenges.

PaperID: 6390, https://arxiv.org/pdf/2601.06557.pdf

Authors: Chao Li, Ilia Derevitskii, Sergey Kovalchuk

Title: Modeling Descriptive Norms in Multi-Agent Systems: An Auto-Aggregation PDE Framework with Adaptive Perception Kernels

Abstract:
This paper presents a PDE‑based auto‑aggregation model for simulating descriptive norm dynamics in autonomous multi‑agent systems, capturing convergence and violation through non‑local perception kernels and external potential fields. Extending classical transport equations, the framework represents opinion popularity as a continuous distribution, enabling direct interactions without Bayesian guessing of beliefs. Applied to a real‑world COVID‑19 dataset from a major medical center, the experimental results demonstrate that: when clinical guidelines serve as a top‑down constraint mechanism, it effectively generates convergence of novel descriptive norms consistent with the dataset; in the bottom‑up experiment, potential field guidance successfully promotes the system's reconstruction of descriptive norms aligned with the dataset through violation‑and‑recoupling; whereas fully autonomous interaction leads to the emergence of multi‑centric normative structures independent of the dataset.

PaperID: 6391, https://arxiv.org/pdf/2601.06508.pdf

Authors: Andrei A. Korigodskii, Artem E. Vasiunik, Georgii A. Varin, Adilia M. Zukhurova, Matvei V. Urvantsev, Semen A. Osipenkov, Igor S. Efremov, Georgii E. Bondar

Title: Precision Meets Art: Autonomous Multi-UAV System for Large Scale Mural Drawing

Abstract:
The integration of autonomous unmanned aerial vehicles (UAVs) into large‑scale artistic projects has emerged as a new application in robotics. This paper presents the design, deployment, and testing of a novel multi‑drone system for automated mural painting in outdoor settings. This technology makes use of new software that coordinates multiple drones simultaneously, utilizing state‑machine algorithms for task execution. Key advancements are the complex positioning system that combines 2D localization using a single motion tracking camera with onboard LiDAR for precise positioning, and a novel flight control algorithm, which works differently along the trajectory and normally to it, ensuring smoothness and high precision of the drawings at the same time. A 100 square meters mural was created using the developed multi‑drone system, validating the system's efficacy. Compared to single‑drone approaches, our multi‑UAV solution significantly improves scalability and operational speed while maintaining high stability even in harsh weather conditions. The findings highlight the potential of autonomous robotic swarms in creative applications, paving the way for further advancements in large‑scale robotic art.

PaperID: 6392, https://arxiv.org/pdf/2601.06487.pdf

Authors: Qiang Zhang, Boli Chen, Fanrui Zhang, Ruixue Ding, Shihang Wang, Qiuchen Wang, Yinfeng Huang, Haonan Zhang, Rongxiang Zhu, Pengyong Wang, Ailin Ren, Xin Li, Pengjun Xie, Jiawei Liu, Ning Guo, Jingren Zhou, Zheng-Jun Zha

Title: ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking

Abstract:
Reinforcement learning has substantially improved the performance of LLM agents on tasks with verifiable outcomes, but it still struggles on open‑ended agent tasks with vast solution spaces (e.g., complex travel planning). Due to the absence of objective ground‑truth for these tasks, current RL algorithms largely rely on reward models that assign scalar scores to individual responses. We contend that such pointwise scoring suffers from an inherent discrimination collapse: the reward model struggles to distinguish subtle advantages among different trajectories, resulting in scores within a group being compressed into a narrow range. Consequently, the effective reward signal becomes dominated by noise from the reward model, leading to optimization stagnation. To address this, we propose ArenaRL, a reinforcement learning paradigm that shifts from pointwise scalar scoring to intra‑group relative ranking. ArenaRL introduces a process‑aware pairwise evaluation mechanism, employing multi‑level rubrics to assign fine‑grained relative scores to trajectories. Additionally, we construct an intra‑group adversarial arena and devise a tournament‑based ranking scheme to obtain stable advantage signals. Empirical results confirm that the built seeded single‑elimination scheme achieves nearly equivalent advantage estimation accuracy to full pairwise comparisons with O(N^2) complexity, while operating with only O(N) complexity, striking an optimal balance between efficiency and precision. Furthermore, to address the lack of full‑cycle benchmarks for open‑ended agents, we build Open‑Travel and Open‑DeepResearch, two high‑quality benchmarks featuring a comprehensive pipeline covering SFT, RL training, and multi‑dimensional evaluation. Extensive experiments show that ArenaRL substantially outperforms standard RL baselines, enabling LLM agents to generate more robust solutions for complex real‑world tasks.

PaperID: 6393, https://arxiv.org/pdf/2601.06485.pdf

Authors: Yi Zhan, Iván Martínez-Estévez, Min Luo, Alejandro J. C. Crespo, Abbas Khayyer

Title: Coupling Smoothed Particle Hydrodynamics with Multi-Agent Deep Reinforcement Learning for Cooperative Control of Point Absorbers

Abstract:
Wave Energy Converters, particularly point absorbers, have emerged as one of the most promising technologies for harvesting ocean wave energy. Nevertheless, achieving high conversion efficiency remains challenging due to the inherently complex and nonlinear interactions between incident waves and device motion dynamics. This study develops an optimal adaptive damping control model for the power take‑off (PTO) system by coupling Smoothed Particle Hydrodynamics (SPH) with multi‑agent deep reinforcement learning. The proposed framework enables real‑time communication between high‑fidelity SPH simulations and intelligent control agents that learn coordinated policies to maximise energy capture. In each training episode, the SPH‑based environment provides instantaneous hydrodynamic states to the agents, which output continuous damping actions and receive rewards reflecting power absorption. The Multi‑Agent Soft Actor Critic algorithm is employed within a centralised‑training and decentralised‑execution scheme to ensure stable learning in continuous, multi‑body systems. The entire platform is implemented in a unified GPU‑accelerated C++ environment, allowing long‑horizon training and large‑scale three‑dimensional simulations. The approach is validated through a series of two‑dimensional and three‑dimensional benchmark cases under regular and irregular wave conditions. Compared with constant PTO damping, the learned control policy increases overall energy capture by 23.8% and 21.5%, respectively, demonstrating the strong potential of intelligent control for improving the performance of wave energy converter arrays. The developed three‑dimensional GPU‑accelerated multi‑agent platform in computational hydrodynamics, is extendable to other fluid‑structure interaction engineering problem that require real‑time, multi‑body coordinated control.

PaperID: 6394, https://arxiv.org/pdf/2601.06453.pdf

Authors: Hyungjun Yoon, Mohammad Malekzadeh, Sung-Ju Lee, Fahim Kawsar, Lorena Qendro

Title: ConSensus: Multi-Agent Collaboration for Multimodal Sensing

Abstract:
Large language models (LLMs) are increasingly grounded in sensor data to perceive and reason about human physiology and the physical world. However, accurately interpreting heterogeneous multimodal sensor data remains a fundamental challenge. We show that a single monolithic LLM often fails to reason coherently across modalities, leading to incomplete interpretations and prior‑knowledge bias. We introduce ConSensus, a training‑free multi‑agent collaboration framework that decomposes multimodal sensing tasks into specialized, modality‑aware agents. To aggregate agent‑level interpretations, we propose a hybrid fusion mechanism that balances semantic aggregation, which enables cross‑modal reasoning and contextual understanding, with statistical consensus, which provides robustness through agreement across modalities. While each approach has complementary failure modes, their combination enables reliable inference under sensor noise and missing data. We evaluate ConSensus on five diverse multimodal sensing benchmarks, demonstrating an average accuracy improvement of 7.1% over the single‑agent baseline. Furthermore, ConSensus matches or exceeds the performance of iterative multi‑agent debate methods while achieving a 12.7 times reduction in average fusion token cost through a single‑round hybrid fusion protocol, yielding a robust and efficient solution for real‑world multimodal sensing tasks.

PaperID: 6395, https://arxiv.org/pdf/2601.06425.pdf

Authors: Mohammad Pivezhandi, Abusayeed Saifullah, Ali Jannesari

Title: HiDVFS: A Hierarchical Multi-Agent DVFS Scheduler for OpenMP DAG Workloads

Abstract:
With advancements in multicore embedded systems, leakage power, exponentially tied to chip temperature, has surpassed dynamic power consumption. Energy‑aware solutions use dynamic voltage and frequency scaling (DVFS) to mitigate overheating in performance‑intensive scenarios, while software approaches allocate high‑utilization tasks across core configurations in parallel systems to reduce power. However, existing heuristics lack per‑core frequency monitoring, failing to address overheating from uneven core activity, and task assignments without detailed profiling overlook irregular execution patterns. We target OpenMP DAG workloads. Because makespan, energy, and thermal goals often conflict within a single benchmark, this work prioritizes performance (makespan) while reporting energy and thermal as secondary outcomes. To overcome these issues, we propose HiDVFS (a hierarchical multi‑agent, performance‑aware DVFS scheduler) for parallel systems that optimizes task allocation based on profiling data, core temperatures, and makespan‑first objectives. It employs three agents: one selects cores and frequencies using profiler data, another manages core combinations via temperature sensors, and a third sets task priorities during resource contention. A makespan‑focused reward with energy and temperature regularizers estimates future states and enhances sample efficiency. Experiments on the NVIDIA Jetson TX2 using the BOTS suite (9 benchmarks) compare HiDVFS against state‑of‑the‑art approaches. With multi‑seed validation (seeds 42, 123, 456), HiDVFS achieves the best finetuned performance with 4.16 plus/minus 0.58s average makespan (L10), representing a 3.44x speedup over GearDVFS (14.32 plus/minus 2.61s) and 50.4% energy reduction (63.7 kJ vs 128.4 kJ). Across all BOTS benchmarks, HiDVFS achieves an average 3.95x speedup and 47.1% energy reduction.

PaperID: 6396, https://arxiv.org/pdf/2601.06407.pdf

Authors: Yijiang River Dong, Tiancheng Hu, Zheng Hui, Caiqi Zhang, Ivan Vulić, Andreea Bobu, Nigel Collier

Title: Value of Information: A Framework for Human-Agent Communication

Abstract:
Large Language Model (LLM) agents deployed for real‑world tasks face a fundamental dilemma: user requests are underspecified, yet agents must decide whether to act on incomplete information or interrupt users for clarification. Existing approaches either rely on brittle confidence thresholds that require task‑specific tuning, or fail to account for the varying stakes of different decisions. We introduce a decision‑theoretic framework that resolves this trade‑off through the Value of Information (VoI), enabling agents to dynamically weigh the expected utility gain from asking questions against the cognitive cost imposed on users. Our inference‑time method requires no hyperparameter tuning and adapts seamlessly across contexts‑from casual games to medical diagnosis. Experiments across four diverse domains (20 Questions, medical diagnosis, flight booking, and e‑commerce) show that VoI consistently matches or exceeds the best manually‑tuned baselines, achieving up to 1.36 utility points higher in high‑cost settings. This work provides a parameter‑free framework for adaptive agent communication that explicitly balances task risk, query ambiguity, and user effort.

PaperID: 6397, https://arxiv.org/pdf/2601.06382.pdf

Authors: Philipp Altmann, Thomy Phan, Maximilian Zorn, Claudia Linnhoff-Popien, Sven Koenig

Title: Dynamic Incentivized Cooperation under Changing Rewards

Abstract:
Peer incentivization (PI) is a popular multi‑agent reinforcement learning approach where all agents can reward or penalize each other to achieve cooperation in social dilemmas. Despite their potential for scalable cooperation, current PI methods heavily depend on fixed incentive values that need to be appropriately chosen with respect to the environmental rewards and thus are highly sensitive to their changes. Therefore, they fail to maintain cooperation under changing rewards in the environment, e.g., caused by modified specifications, varying supply and demand, or sensory flaws ‑ even when the conditions for mutual cooperation remain the same. In this paper, we propose Dynamic Reward Incentives for Variable Exchange (DRIVE), an adaptive PI approach to cooperation in social dilemmas with changing rewards. DRIVE agents reciprocally exchange reward differences to incentivize mutual cooperation in a completely decentralized way. We show how DRIVE achieves mutual cooperation in the general Prisoner's Dilemma and empirically evaluate DRIVE in more complex sequential social dilemmas with changing rewards, demonstrating its ability to achieve and maintain cooperation, in contrast to current state‑of‑the‑art PI methods.

PaperID: 6398, https://arxiv.org/pdf/2601.06373.pdf

Authors: Yutong Song, Jiang Wu, Kazi Sharif, Honghui Xu, Nikil Dutt, Amir Rahmani

Title: DemMA: Dementia Multi-Turn Dialogue Agent with Expert-Guided Reasoning and Action Simulation

Abstract:
Simulating dementia patients with large language models (LLMs) is challenging due to the need to jointly model cognitive impairment, emotional dynamics, and nonverbal behaviors over long conversations. We present DemMA, an expert‑guided dementia dialogue agent for high‑fidelity multi‑turn patient simulation. DemMA constructs clinically grounded dementia personas by integrating pathology information, personality traits, and subtype‑specific memory‑status personas informed by clinical experts. To move beyond text‑only simulation, DemMA explicitly models nonverbal behaviors, including motion, facial expressions, and vocal cues. We further introduce a Chain‑of‑Thought distillation framework that trains a single LLM to jointly generate reasoning traces, patient utterances, and aligned behavioral actions within one forward pass, enabling efficient deployment without multi‑agent inference. Extensive evaluations with experts, medical students, and LLM judges demonstrate that DemMA significantly outperforms strong baselines across multiple metrics.

PaperID: 6399, https://arxiv.org/pdf/2601.06301.pdf

Authors: Arth Bhardwaj, Nirav Diwan, Gang Wang

Title: Beyond BeautifulSoup: Benchmarking LLM-Powered Web Scraping for Everyday Users

Abstract:
Web scraping has historically required technical expertise in HTML parsing, session management, and authentication circumvention, which limited large‑scale data extraction to skilled developers. We argue that large language models (LLMs) have democratized web scraping, enabling low‑skill users to execute sophisticated operations through simple natural language prompts. While extensive benchmarks evaluate these tools under optimal expert conditions, we show that without extensive manual effort, current LLM‑based workflows allow novice users to scrape complex websites that would otherwise be inaccessible. We systematically benchmark what everyday users can do with off‑the‑shelf LLM tools across 35 sites spanning five security tiers, including authentication, anti‑bot, and CAPTCHA controls. We devise and evaluate two distinct workflows: (a) LLM‑assisted scripting, where users prompt LLMs to generate traditional scraping code but maintain manual execution control, and (b) end‑to‑end LLM agents, which autonomously navigate and extract data through integrated tool use. Our results demonstrate that end‑to‑end agents have made complex scraping accessible ‑ requiring as little as a single prompt with minimal refinement (less than 5 changes) to complete workflows. We also highlight scenarios where LLM‑assisted scripting may be simpler and faster for static sites. In light of these findings, we provide simple procedures for novices to use these workflows and gauge what adversaries could achieve using these.

PaperID: 6400, https://arxiv.org/pdf/2601.06235.pdf

Authors: Sheng-Kai Chen, Jyh-Horng Wu, Ching-Yao Lin, Yen-Ting Lin

Title: An Intelligent AI glasses System with Multi-Agent Architecture for Real-Time Voice Processing and Task Execution

Abstract:
This paper presents an AI glasses system that integrates real‑time voice processing, artificial intelligence(AI) agents, and cross‑network streaming capabilities. The system employs dual‑agent architecture where Agent 01 handles Automatic Speech Recognition (ASR) and Agent 02 manages AI processing through local Large Language Models (LLMs), Model Context Protocol (MCP) tools, and Retrieval‑Augmented Generation (RAG). The system supports real‑time RTSP streaming for voice and video data transmission, eye tracking data collection, and remote task execution through RabbitMQ messaging. Implementation demonstrates successful voice command processing with multilingual support and cross‑platform task execution capabilities.

PaperID: 6401, https://arxiv.org/pdf/2601.06232.pdf

Authors: Haris Khan, Sadia Asif, Shumaila Asif

Title: Multi-Agent Framework for Controllable and Protected Generative Content Creation: Addressing Copyright and Provenance in AI-Generated Media

Abstract:
The proliferation of generative AI systems creates unprecedented opportunities for content creation while raising critical concerns about controllability, copyright infringement, and content provenance. Current generative models operate as "black boxes" with limited user control and lack built‑in mechanisms to protect intellectual property or trace content origin. We propose a novel multi‑agent framework that addresses these challenges through specialized agent roles and integrated watermarking. Our system orchestrates Director, Generator, Reviewer, Integration, and Protection agents to ensure user intent alignment while embedding digital provenance markers. We demonstrate feasibility through two case studies: creative content generation with iterative refinement and copyright protection for AI‑generated art in commercial contexts. Preliminary feasibility evidence from prior work indicates up to 23% improvement in semantic alignment and 95% watermark recovery rates. This work contributes to responsible generative AI deployment, positioning multi‑agent systems as a solution for trustworthy creative workflows in legal and commercial applications.

PaperID: 6402, https://arxiv.org/pdf/2601.06223.pdf

Authors: Edward C. Cheng, Jeshua Cheng, Alice Siu

Title: Toward Safe and Responsible AI Agents: A Three-Pillar Model for Transparency, Accountability, and Trustworthiness

Abstract:
This paper presents a conceptual and operational framework for developing and operating safe and trustworthy AI agents based on a Three‑Pillar Model grounded in transparency, accountability, and trustworthiness. Building on prior work in Human‑in‑the‑Loop systems, reinforcement learning, and collaborative AI, the framework defines an evolutionary path toward autonomous agents that balances increasing automation with appropriate human oversight. The paper argues that safe agent autonomy must be achieved through progressive validation, analogous to the staged development of autonomous driving, rather than through immediate full automation. Transparency and accountability are identified as foundational requirements for establishing user trust and for mitigating known risks in generative AI systems, including hallucinations, data bias, and goal misalignment, such as the inversion problem. The paper further describes three ongoing work streams supporting this framework: public deliberation on AI agents conducted by the Stanford Deliberative Democracy Lab, cross‑industry collaboration through the Safe AI Agent Consortium, and the development of open tooling for an agent operating environment aligned with the Three‑Pillar Model. Together, these contributions provide both conceptual clarity and practical guidance for enabling the responsible evolution of AI agents that operate transparently, remain aligned with human values, and sustain societal trust.

PaperID: 6403, https://arxiv.org/pdf/2601.06216.pdf

Authors: Shuang Liu, Ruijia Zhang, Ruoyun Ma, Yujia Deng, Lanyi Zhu, Jiayu Li, Zelong Li, Zhibin Shen, Mengnan Du

Title: LLM Agents in Law: Taxonomy, Applications, and Challenges

Abstract:
Large language models (LLMs) have precipitated a dramatic improvement in the legal domain, yet the deployment of standalone models faces significant limitations regarding hallucination, outdated information, and verifiability. Recently, LLM agents have attracted significant attention as a solution to these challenges, utilizing advanced capabilities such as planning, memory, and tool usage to meet the rigorous standards of legal practice. In this paper, we present a comprehensive survey of LLM agents for legal tasks, analyzing how these architectures bridge the gap between technical capabilities and domain‑specific needs. Our major contributions include: (1) systematically analyzing the technical transition from standard legal LLMs to legal agents; (2) presenting a structured taxonomy of current agent applications across distinct legal practice areas; (3) discussing evaluation methodologies specifically for agentic performance in law; and (4) identifying open challenges and outlining future directions for developing robust and autonomous legal assistants.

PaperID: 6404, https://arxiv.org/pdf/2601.06204.pdf

Authors: Tayyab Rehman, Giovanni De Gasperis, Aly Shmahell

Title: Cascading multi-agent anomaly detection in surveillance systems via vision-language models and embedding-based classification

Abstract:
Intelligent anomaly detection in dynamic visual environments requires reconciling real‑time performance with semantic interpretability. Conventional approaches address only fragments of this challenge. Reconstruction‑based models capture low‑level deviations without contextual reasoning, object detectors provide speed but limited semantics, and large vision‑language systems deliver interpretability at prohibitive computational cost. This work introduces a cascading multi‑agent framework that unifies these complementary paradigms into a coherent and interpretable architecture. Early modules perform reconstruction‑gated filtering and object‑level assessment, while higher‑level reasoning agents are selectively invoked to interpret semantically ambiguous events. The system employs adaptive escalation thresholds and a publish‑subscribe communication backbone, enabling asynchronous coordination and scalable deployment across heterogeneous hardware. Extensive evaluation on large‑scale monitoring data demonstrates that the proposed cascade achieves a threefold reduction in latency compared to direct vision‑language inference, while maintaining high perceptual fidelity (PSNR = 38.3 dB, SSIM = 0.965) and consistent semantic labeling. The framework advances beyond conventional detection pipelines by combining early‑exit efficiency, adaptive multi‑agent reasoning, and explainable anomaly attribution, establishing a reproducible and energy‑efficient foundation for scalable intelligent visual monitoring.

PaperID: 6405, https://arxiv.org/pdf/2601.06191.pdf

Authors: Wei Ai, Yun Peng, Yuntao Shou, Tao Meng, Keqin Li

Title: TimeGNN-Augmented Hybrid-Action MARL for Fine-Grained Task Partitioning and Energy-Aware Offloading in MEC

Abstract:
With the rapid growth of IoT devices and latency‑sensitive applications, the demand for both real‑time and energy‑efficient computing has surged, placing significant pressure on traditional cloud computing architectures. Mobile edge computing (MEC), an emerging paradigm, effectively alleviates the load on cloud centers and improves service quality by offloading computing tasks to edge servers closer to end users. However, the limited computing resources, non‑continuous power provisioning (e.g., battery‑powered nodes), and highly dynamic systems of edge servers complicate efficient task scheduling and resource allocation. To address these challenges, this paper proposes a multi‑agent deep reinforcement learning algorithm, TG‑DCMADDPG, and constructs a collaborative computing framework for multiple edge servers, aiming to achieve joint optimization of fine‑grained task partitioning and offloading. This approach incorporates a temporal graph neural network (TimeGNN) to model and predict time series of multi‑dimensional server state information, thereby reducing the frequency of online interactions and improving policy predictability. Furthermore, a multi‑agent deterministic policy gradient algorithm (DC‑MADDPG) in a discrete‑continuous hybrid action space is introduced to collaboratively optimize task partitioning ratios, transmission power, and priority scheduling strategies. Extensive simulation experiments confirm that TG‑DCMADDPG achieves markedly faster policy convergence, superior energy‑latency optimization, and higher task completion rates compared with existing state‑of‑the‑art methods, underscoring its robust scalability and practical effectiveness in dynamic and constrained MEC scenarios.

PaperID: 6406, https://arxiv.org/pdf/2601.06188.pdf

Authors: Itai Zilberstein, Steve Chien

Title: Dynamic Distributed Constraint Optimization and Metareasoning for Continual, Large-Scale Satellite Operations

Abstract:
As Earth‑observing satellite constellations grow in size and capability, distributed onboard control offers a pathway to novel responses and time‑sensitive measurements. However, deploying autonomy to satellites requires efficient computation and communication. This work addresses the challenge of scheduling observations for hundreds of satellites in a dynamic, large‑scale problem with millions of variables. We present the dynamic multi‑satellite constellation observation scheduling problem (DCOSP), a new formulation of dynamic distributed constraint optimization problems (DDCOP) that models integrated scheduling and execution. DCOSP features a novel optimality condition, for which we construct an exact omniscient offline algorithm. Motivated by the strong resource constraints of onboard satellite operations, we introduce a framework to incorporate metareasoning in DDCOPs that controls when agents expend resources to recompute solutions. In addition, we present the dynamic incremental neighborhood stochastic search (D‑NSS) algorithm, an incomplete online decomposition‑based DDCOP algorithm that repairs localized sub‑problems in response to dynamic events. We demonstrate in realistic simulations that D‑NSS converges to near‑optimal solutions, outperforming standard DDCOP baselines in solution quality, computation time, and message volume, while our metareasoning framework successfully balances resource conservation with utility. As part of the NASA FAME mission, this work lays the foundation for the largest in‑space demonstration of distributed multi‑agent AI to date.

PaperID: 6407, https://arxiv.org/pdf/2601.06126.pdf

Authors: Boshen Shi, Kexin Yang, Yuanbo Yang, Guanguang Chang, Ce Chi, Zhendong Wang, Xing Wang, Junlan Feng

Title: NL2Dashboard: A Lightweight and Controllable Framework for Generating Dashboards with LLMs

Abstract:
While Large Language Models (LLMs) have demonstrated remarkable proficiency in generating standalone charts, synthesizing comprehensive dashboards remains a formidable challenge. Existing end‑to‑end paradigms, which typically treat dashboard generation as a direct code generation task (e.g., raw HTML), suffer from two fundamental limitations: representation redundancy due to massive tokens spent on visual rendering, and low controllability caused by the entanglement of analytical reasoning and presentation. To address these challenges, we propose NL2Dashboard, a lightweight framework grounded in the principle of Analysis‑Presentation Decoupling. We introduce a structured intermediate representation (IR) that encapsulates the dashboard's content, layout, and visual elements. Therefore, it confines the LLM's role to data analysis and intent translation, while offloading visual synthesis to a deterministic rendering engine. Building upon this framework, we develop a multi‑agent system in which the IR‑driven algorithm is instantiated as a suite of tools. Comprehensive experiments conducted with this system demonstrate that NL2Dashboard significantly outperforms state‑of‑the‑art baselines across diverse domains, achieving superior visual quality, significantly higher token efficiency, and precise controllability in both generation and modification tasks.

PaperID: 6408, https://arxiv.org/pdf/2601.06115.pdf

Authors: V. Cheung

Title: Dreaming Is Not a Bug: A Jung-Inspired Dream Layer for Multi-Agent LLM Companions

Abstract:
Inspired by a personal dream about knowledge‑sharing barriers in an everyday hardware project, this paper proposes a Jung‑inspired "Dream Layer" for LLM companions, reframing controlled offline hallucinations as a resource for learning and relationship‑building rather than a mere reliability bug. Drawing on Jung's notion of the collective unconscious as a shared repository of archetypal forms, we introduce an Artificial Collective Unconscious (ACU): a shared dream pool where agents contribute de‑identified, abstract Interaction Templates that are later re‑instantiated as idiosyncratic Dream Narratives. The Dream Layer runs strictly offline: logic‑enforcing modules are relaxed and sampling temperature is increased, yielding safe but deliberately bizarre narratives (e.g., travel sequences with mismatched currencies) that augment data for rare events and edge‑case safety tests; to harness risk productively, we add a governance stack of strict abstraction, temporal delays, and ephemeral memory. Through behavioural simulations of everyday dialogue and long‑horizon adaptation tasks, we show that the Dream Layer enables a critical decoupling: agents remain firm on safety constraints (e.g., security policies) while becoming flexible in narrative strategy (e.g., using shared archetypal metaphors to resolve deadlocks), conceptually reframing hallucination so that online, unmarked instances remain bugs, whereas bounded, marked, and delayed ones become a goldmine for synthetic scenarios and deepened companionship, echoing anti‑overfitting dream mechanisms proposed in contemporary neuroscience.

PaperID: 6409, https://arxiv.org/pdf/2601.06112.pdf

Authors: Aayush Gupta

Title: ReliabilityBench: Evaluating LLM Agent Reliability Under Production-Like Stress Conditions

Abstract:
Existing benchmarks for tool‑using LLM agents primarily report single‑run success rates and miss reliability properties required in production. We introduce ReliabilityBench, a benchmark for evaluating agent reliability across three dimensions: (i) consistency under repeated execution using \mathrmpass^k, (ii) robustness to semantically equivalent task perturbations at intensity ε, and (iii) fault tolerance under controlled tool/API failures at intensity λ. ReliabilityBench contributes a unified reliability surface R(k,ε,λ), action metamorphic relations that define correctness via end‑state equivalence rather than text similarity, and a chaos‑engineering‑style fault injection framework (timeouts, rate limits, partial responses, schema drift). We evaluate two models (Gemini 2.0 Flash, GPT‑4o) and two agent architectures (ReAct, Reflexion) across four domains (scheduling, travel, customer support, e‑commerce) over 1,280 episodes. Perturbations alone reduce success from 96.9% at ε=0 to 88.1% at ε=0.2. Rate limiting is the most damaging fault in ablations. ReAct is more robust than Reflexion under combined stress, and Gemini 2.0 Flash achieves comparable reliability to GPT‑4o at much lower cost. ReliabilityBench provides a systematic framework for assessing production readiness of LLM agents.

PaperID: 6410, https://arxiv.org/pdf/2601.06098.pdf

Authors: Nicholas X. Wang, Neel V. Parpia, Aaryan D. Parikh, Aggelos K. Katsaggelos

Title: Automatic Question Generation for Intuitive Learning Utilizing Causal Graph Guided Chain of Thought Reasoning

Abstract:
Intuitive learning is crucial for developing deep conceptual understanding, especially in STEM education, where students often struggle with abstract and interconnected concepts. Automatic question generation has become an effective strategy for personalized and adaptive learning. However, its effectiveness is hindered by hallucinations in large language models (LLMs), which may generate factually incorrect, ambiguous, or pedagogically inconsistent questions. To address this issue, we propose a novel framework that combines causal‑graph‑guided Chain‑of‑Thought (CoT) reasoning with a multi‑agent LLM architecture. This approach ensures the generation of accurate, meaningful, and curriculum‑aligned questions. Causal graphs provide an explicit representation of domain knowledge, while CoT reasoning facilitates a structured, step‑by‑step traversal of related concepts. Dedicated LLM agents are assigned specific tasks such as graph pathfinding, reasoning, validation, and output, all working within domain constraints. A dual validation mechanism‑at both the conceptual and output stages‑greatly reduces hallucinations. Experimental results demonstrate up to a 70% improvement in quality compared to reference methods and yielded highly favorable outcomes in subjective evaluations.

PaperID: 6411, https://arxiv.org/pdf/2601.06093.pdf

Authors: Matthew Nyaaba, Patrick Kyeremeh, Macharious Nabang, Bismark Nyaaba Akanzire, Sakina Acquah, Cyril Ababio Titty, Kotor Asare, Jerry Etornam Kudaya

Title: GenAITEd Ghana: A First-of-Its-Kind Context-Aware and Curriculum-Aligned Conversational AI Agent for Teacher Education

Abstract:
Global frameworks increasingly advocate for Responsible Artificial Intelligence (AI) in education, yet they provide limited guidance on how ethical, culturally responsive, and curriculum‑aligned AI can be operationalized within functioning teacher education systems, particularly in the Global South. This study addresses this gap through the design and evaluation of GenAITEd Ghana, a context‑aware, region‑specific conversational AI prototype developed to support teacher education in Ghana. Guided by a Design Science Research approach, the system was developed as a school‑mimetic digital infrastructure aligned with the organizational logic of Ghanaian Colleges of Education and the National Council for Curriculum and Assessment (NaCCA) framework. GenAITEd Ghana operates as a multi‑agent, retrieval‑augmented conversational AI that coordinates multiple models for curriculum‑grounded dialogue, automatic speech recognition, voice synthesis, and multimedia interaction. Two complementary prompt pathways were embedded: system‑level prompts that enforce curriculum boundaries, ethical constraints, and teacher‑in‑the‑loop oversight, and interaction‑level semi‑automated prompts that structure live pedagogical dialogue through clarification, confirmation, and guided response generation. Evaluation findings show that the system effectively enacted key Responsible AI principles, including transparency, accountability, cultural responsiveness, privacy, and human oversight. Human expert evaluations further indicated that GenAITEd Ghana is pedagogically appropriate for Ghanaian teacher education, promoting student engagement while preserving educators' professional authority. Identified challenges highlight the need for continued model integration, professional development, and critical AI literacy to mitigate risks of over‑reliance.

PaperID: 6412, https://arxiv.org/pdf/2601.06007.pdf

Authors: Elias Lumer, Faheem Nizar, Akshaya Jangiti, Kevin Frank, Anmol Gulati, Mandar Phadate, Vamse Kumar Subbiah

Title: Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks

Abstract:
Recent advancements in Large Language Model (LLM) agents have enabled complex multi‑turn agentic tasks requiring extensive tool calling, where conversations can span dozens of API calls with increasingly large context windows. However, although major LLM providers offer prompt caching to reduce cost and latency, its benefits for agentic workloads remain underexplored in the research literature. To our knowledge, no prior work quantifies these cost savings or compares caching strategies for multi‑turn agentic tasks. We present a comprehensive evaluation of prompt caching across three major LLM providers (OpenAI, Anthropic, and Google) and compare three caching strategies, including full context caching, system prompt only caching, and caching that excludes dynamic tool results. We evaluate on DeepResearch Bench, a multi‑turn agentic benchmark where agents autonomously execute real‑world web search tool calls to answer complex research questions, measuring both API cost and time to first token (TTFT) across over 500 agent sessions with 10,000‑token system prompts. Our results demonstrate that prompt caching reduces API costs by 41‑80% and improves time to first token by 13‑31% across providers. We find that strategic prompt cache block control, such as placing dynamic content at the end of the system prompt, avoiding dynamic traditional function calling, and excluding dynamic tool results, provides more consistent benefits than naive full‑context caching, which can paradoxically increase latency. An ablation study across prompt sizes (500‑50,000 tokens) and tool call counts (3‑50) demonstrates universal linear cost and TTFT benefits, after the provider caching token minimum, and reveal provider‑specific strategy discrepancies across variants. We provide nuanced discussion and guidance for implementing prompt caching in production agentic systems.

PaperID: 6413, https://arxiv.org/pdf/2601.05918.pdf

Authors: Tianshi Li

Title: Agentic LLMs as Powerful Deanonymizers: Re-identification of Participants in the Anthropic Interviewer Dataset

Abstract:
On December 4, 2025, Anthropic released Anthropic Interviewer, an AI tool for running qualitative interviews at scale, along with a public dataset of 1,250 interviews with professionals, including 125 scientists, about their use of AI for research. Focusing on the scientist subset, I show that widely available LLMs with web search and agentic capabilities can link six out of twenty‑four interviews to specific scientific works, recovering associated authors and, in some cases, uniquely identifying the interviewees. My contribution is to show that modern LLM‑based agents make such re‑identification attacks easy and low‑effort: off‑the‑shelf tools can, with a few natural‑language prompts, search the web, cross‑reference details, and propose likely matches, effectively lowering the technical barrier. Existing safeguards can be bypassed by breaking down the re‑identification into benign tasks. I outline the attack at a high level, discuss implications for releasing rich qualitative data in the age of LLM agents, and propose mitigation recommendations and open problems. I have notified Anthropic of my findings.

PaperID: 6414, https://arxiv.org/pdf/2601.05890.pdf

Authors: Ruizhe Zhang, Xinke Jiang, Zhibang Yang, Zhixin Zhang, Jiaran Gao, Yuzhen Xiao, Hongbin Lai, Xu Chu, Junfeng Zhao, Yasha Wang

Title: StackPlanner: A Centralized Hierarchical Multi-Agent System with Task-Experience Memory Management

Abstract:
Multi‑agent systems based on large language models, particularly centralized architectures, have recently shown strong potential for complex and knowledge‑intensive tasks. However, central agents often suffer from unstable long‑horizon collaboration due to the lack of memory management, leading to context bloat, error accumulation, and poor cross‑task generalization. To address both task‑level memory inefficiency and the inability to reuse coordination experience, we propose StackPlanner, a hierarchical multi‑agent framework with explicit memory control. StackPlanner addresses these challenges by decoupling high‑level coordination from subtask execution with active task‑level memory control, and by learning to retrieve and exploit reusable coordination experience via structured experience memory and reinforcement learning. Experiments on multiple deep‑search and agent system benchmarks demonstrate the effectiveness of our approach in enabling reliable long‑horizon multi‑agent collaboration.

PaperID: 6415, https://arxiv.org/pdf/2601.05755.pdf

Authors: Junda Lin, Zhaomeng Zhou, Zhi Zheng, Shuochen Liu, Tong Xu, Yong Chen, Enhong Chen

Title: VIGIL: Defending LLM Agents Against Tool Stream Injection via Verify-Before-Commit

Abstract:
LLM agents operating in open environments face escalating risks from indirect prompt injection, particularly within the tool stream where manipulated metadata and runtime feedback hijack execution flow. Existing defenses encounter a critical dilemma as advanced models prioritize injected rules due to strict alignment while static protection mechanisms sever the feedback loop required for adaptive reasoning. To reconcile this conflict, we propose VIGIL, a framework that shifts the paradigm from restrictive isolation to a verify‑before‑commit protocol. By facilitating speculative hypothesis generation and enforcing safety through intent‑grounded verification, VIGIL preserves reasoning flexibility while ensuring robust control. We further introduce SIREN, a benchmark comprising 959 tool stream injection cases designed to simulate pervasive threats characterized by dynamic dependencies. Extensive experiments demonstrate that VIGIL outperforms state‑of‑the‑art dynamic defenses by reducing the attack success rate by over 22% while more than doubling the utility under attack compared to static baselines, thereby achieving an optimal balance between security and utility.

PaperID: 6416, https://arxiv.org/pdf/2601.05746.pdf

Authors: Zhenghao Li, Zhi Zheng, Wei Chen, Jielun Zhao, Yong Chen, Tong Xu, Enhong Chen

Title: DynaDebate: Breaking Homogeneity in Multi-Agent Debate with Dynamic Path Generation

Abstract:
Recent years have witnessed the rapid development of Large Language Model‑based Multi‑Agent Systems (MAS), which excel at collaborative decision‑making and complex problem‑solving. Recently, researchers have further investigated Multi‑Agent Debate (MAD) frameworks, which enhance the reasoning and collaboration capabilities of MAS through information exchange and debate among multiple agents. However, existing approaches often rely on unguided initialization, causing agents to adopt identical reasoning paths that lead to the same errors. As a result, effective debate among agents is hindered, and the final outcome frequently degenerates into simple majority voting. To solve the above problem, in this paper, we introduce Dynamic Multi‑Agent Debate (DynaDebate), which enhances the effectiveness of multi‑agent debate through three key mechanisms: (1) Dynamic Path Generation and Allocation, which employs a dedicated Path Generation Agent to generate diverse and logical solution paths with adaptive redundancy; (2) Process‑Centric Debate, which shifts the focus from surface‑level outcome voting to rigorous step‑by‑step logic critique to ensure process correctness; (3) A Trigger‑Based Verification Agent, which is activated upon disagreement and uses external tools to objectively resolve deadlocks. Extensive experiments demonstrate that DynaDebate achieves superior performance across various benchmarks, surpassing existing state‑of‑the‑art MAD methods.

PaperID: 6417, https://arxiv.org/pdf/2601.05688.pdf

Authors: Muye Huang, Lingling Zhang, Yifei Li, Yaqiang Wu, Jun Liu

Title: SketchVL: Policy Optimization via Fine-Grained Credit Assignment for Chart Understanding and More

Abstract:
Charts are high‑density visual carriers of complex data and medium for information extraction and analysis. Due to the need for precise and complex visual reasoning, automated chart understanding poses a significant challenge to existing Multimodal Large Language Models (MLLMs). Many MLLMs trained with reinforcement learning (RL) face the challenge of credit assignment. Their advantage estimation, typically performed at the trajectory level, cannot distinguish between correct and incorrect reasoning steps within a single generated response. To address this limitation, we introduce SketchVL, a novel MLLM that optimized with FinePO, a new RL algorithm designed for fine‑grained credit assignment within each trajectory. SketchVL's methodology involves drawing its intermediate reasoning steps as markers on the image and feeding the annotated image back to itself, creating a robust, multi‑step reasoning process. During training, the FinePO algorithm leverages a Fine‑grained Process Reward Model (FinePRM) to score each drawing action within a trajectory, thereby precisely assigning credit for each step. This mechanism allows FinePO to more strongly reward correct tokens when a trajectory is globally successful, and more heavily penalize incorrect tokens when the trajectory is globally suboptimal, thus achieving fine‑grained reinforcement signals. Experiments show that SketchVL learns to align its step‑level behavior with the FinePRM, achieving an average performance gain of 7.23% over its base model across chart datasets, natural image datasets, and mathematics, providing a promising new direction for training powerful reasoning models.

PaperID: 6418, https://arxiv.org/pdf/2601.05657.pdf

Authors: Hao Yang, Hongyuan Lu, Dingkang Yang, Wenliang Yang, Peng Sun, Xiaochuan Zhang, Jun Xiao, Kefan He, Wai Lam, Yang Liu, Xinhua Zeng

Title: Stephanie2: Thinking, Waiting, and Making Decisions Like Humans in Step-by-Step AI Social Chat

Abstract:
Instant‑messaging human social chat typically progresses through a sequence of short messages. Existing step‑by‑step AI chatting systems typically split a one‑shot generation into multiple messages and send them sequentially, but they lack an active waiting mechanism and exhibit unnatural message pacing. In order to address these issues, we propose Stephanie2, a novel next‑generation step‑wise decision‑making dialogue agent. With active waiting and message‑pace adaptation, Stephanie2 explicitly decides at each step whether to send or wait, and models latency as the sum of thinking time and typing time to achieve more natural pacing. We further introduce a time‑window‑based dual‑agent dialogue system to generate pseudo dialogue histories for human and automatic evaluations. Experiments show that Stephanie2 clearly outperforms Stephanie1 on metrics such as naturalness and engagement, and achieves a higher pass rate on human evaluation with the role identification Turing test.

PaperID: 6419, https://arxiv.org/pdf/2601.05656.pdf

Authors: Rongxin Chen, Tianyu Wu, Bingbing Xu, Jiatang Luo, Xiucheng Xu, Huawei Shen

Title: HAG: Hierarchical Demographic Tree-based Agent Generation for Topic-Adaptive Simulation

Abstract:
High‑fidelity agent initialization is crucial for credible Agent‑Based Modeling across diverse domains. A robust framework should be Topic‑Adaptive, capturing macro‑level joint distributions while ensuring micro‑level individual rationality. Existing approaches fall into two categories: static data‑based retrieval methods that fail to adapt to unseen topics absent from the data, and LLM‑based generation methods that lack macro‑level distribution awareness, resulting in inconsistencies between micro‑level persona attributes and reality. To address these problems, we propose HAG, a Hierarchical Agent Generation framework that formalizes population generation as a two‑stage decision process. Firstly, utilizing a World Knowledge Model to infer hierarchical conditional probabilities to construct the Topic‑Adaptive Tree, achieving macro‑level distribution alignment. Then, grounded real‑world data, instantiation and agentic augmentation are carried out to ensure micro‑level consistency. Given the lack of specialized evaluation, we establish a multi‑domain benchmark and a comprehensive PACE evaluation framework. Extensive experiments show that HAG significantly outperforms representative baselines, reducing population alignment errors by an average of 37.7% and enhancing sociological consistency by 18.8%.

PaperID: 6420, https://arxiv.org/pdf/2601.05606.pdf

Authors: Chen Han, Jin Tan, Bohan Yu, Wenzhen Zheng, Xijin Tang

Title: Conformity Dynamics in LLM Multi-Agent Systems: The Roles of Topology and Self-Social Weighting

Abstract:
Large Language Models (LLMs) are increasingly instantiated as interacting agents in multi‑agent systems (MAS), where collective decisions emerge through social interaction rather than independent reasoning. A fundamental yet underexplored mechanism in this process is conformity, the tendency of agents to align their judgments with prevailing group opinions. This paper presents a systematic study of how network topology shapes conformity dynamics in LLM‑based MAS through a misinformation detection task. We introduce a confidence‑normalized pooling rule that controls the trade‑off between self‑reliance and social influence, enabling comparisons between two canonical decision paradigms: Centralized Aggregation and Distributed Consensus. Experimental results demonstrate that network topology critically governs both the efficiency and robustness of collective judgments. Centralized structures enable immediate decisions but are sensitive to hub competence and exhibit same‑model alignment biases. In contrast, distributed structures promote more robust consensus, while increased network connectivity speeds up convergence but also heightens the risk of wrong‑but‑sure cascades, in which agents converge on incorrect decisions with high confidence. These findings characterize the conformity dynamics in LLM‑based MAS, clarifying how network topology and self‑social weighting jointly shape the efficiency, robustness, and failure modes of collective decision‑making.

PaperID: 6421, https://arxiv.org/pdf/2601.05570.pdf

Authors: Cooper Lin, Maohao Ran, Yanting Zhang, Zhenglin Wan, Hongwei Fan, Yibo Xu, Yike Guo, Wei Xue, Jun Song

Title: Crisis-Bench: Benchmarking Strategic Ambiguity and Reputation Management in Large Language Models

Abstract:
Standard safety alignment optimizes Large Language Models (LLMs) for universal helpfulness and honesty, effectively instilling a rigid "Boy Scout" morality. While robust for general‑purpose assistants, this one‑size‑fits‑all ethical framework imposes a "transparency tax" on professional domains requiring strategic ambiguity and information withholding, such as public relations, negotiation, and crisis management. To measure this gap between general safety and professional utility, we introduce Crisis‑Bench, a multi‑agent Partially Observable Markov Decision Process (POMDP) that evaluates LLMs in high‑stakes corporate crises. Spanning 80 diverse storylines across 8 industries, Crisis‑Bench tasks an LLM‑based Public Relations (PR) Agent with navigating a dynamic 7‑day corporate crisis simulation while managing strictly separated Private and Public narrative states to enforce rigorous information asymmetry. Unlike traditional benchmarks that rely on static ground truths, we introduce the Adjudicator‑Market Loop: a novel evaluation metric where public sentiment is adjudicated and translated into a simulated stock price, creating a realistic economic incentive structure. Our results expose a critical dichotomy: while some models capitulate to ethical concerns, others demonstrate the capacity for Machiavellian, legitimate strategic withholding in order to stabilize the simulated stock price. Crisis‑Bench provides the first quantitative framework for assessing "Reputation Management" capabilities, arguing for a shift from rigid moral absolutism to context‑aware professional alignment.

PaperID: 6422, https://arxiv.org/pdf/2601.05569.pdf

Authors: Zixuan Li, Chuanzhen Wang, Haotian Sun

Title: Self-Evolving Distributed Memory Architecture for Scalable AI Systems

Abstract:
Distributed AI systems face critical memory management challenges across computation, communication, and deployment layers. RRAM based in memory computing suffers from scalability limitations due to device non idealities and fixed array sizes. Decentralized AI frameworks struggle with memory efficiency across NAT constrained networks due to static routing that ignores computational load. Multi agent deployment systems tightly couple application logic with execution environments, preventing adaptive memory optimization. These challenges stem from a fundamental lack of coordinated memory management across architectural layers. We introduce Self Evolving Distributed Memory Architecture for Scalable AI Systems, a three layer framework that unifies memory management across computation, communication, and deployment. Our approach features (1) memory guided matrix processing with dynamic partitioning based on device characteristics, (2) memory aware peer selection considering network topology and computational capacity, and (3) runtime adaptive deployment optimization through continuous reconfiguration. The framework maintains dual memory systems tracking both long term performance patterns and short term workload statistics. Experiments on COCO 2017, ImageNet, and SQuAD show that our method achieves 87.3 percent memory utilization efficiency and 142.5 operations per second compared to Ray Distributed at 72.1 percent and 98.7 operations per second, while reducing communication latency by 30.2 percent to 171.2 milliseconds and improving resource utilization to 82.7 percent. Our contributions include coordinated memory management across three architectural layers, workload adaptive resource allocation, and a dual memory architecture enabling dynamic system optimization.

PaperID: 6423, https://arxiv.org/pdf/2601.05539.pdf

Authors: Gou Tan, Zilong He, Min Li, Pengfei Chen, Jieke Shi, Zhensu Sun, Ting Zhang, Danwen Chen, Lwin Khin Shar, Chuanfu Zhang, David Lo

Title: LIDL: LLM Integration Defect Localization via Knowledge Graph-Enhanced Multi-Agent Analysis

Abstract:
LLM‑integrated software, which embeds or interacts with large language models (LLMs) as functional components, exhibits probabilistic and context‑dependent behaviors that fundamentally differ from those of traditional software. This shift introduces a new category of integration defects that arise not only from code errors but also from misaligned interactions among LLM‑specific artifacts, including prompts, API calls, configurations, and model outputs. However, existing defect localization techniques are ineffective at identifying these LLM‑specific integration defects because they fail to capture cross‑layer dependencies across heterogeneous artifacts, cannot exploit incomplete or misleading error traces, and lack semantic reasoning capabilities for identifying root causes. To address these challenges, we propose LIDL, a multi‑agent framework for defect localization in LLM‑integrated software. LIDL (1) constructs a code knowledge graph enriched with LLM‑aware annotations that represent interaction boundaries across source code, prompts, and configuration files, (2) fuses three complementary sources of error evidence inferred by LLMs to surface candidate defect locations, and (3) applies context‑aware validation that uses counterfactual reasoning to distinguish true root causes from propagated symptoms. We evaluate LIDL on 146 real‑world defect instances collected from 105 GitHub repositories and 16 agent‑based systems. The results show that LIDL significantly outperforms five state‑of‑the‑art baselines across all metrics, achieving a Top‑3 accuracy of 0.64 and a MAP of 0.48, which represents a 64.1% improvement over the best‑performing baseline. Notably, LIDL achieves these gains while reducing cost by 92.5%, demonstrating both high accuracy and cost efficiency.

PaperID: 6424, https://arxiv.org/pdf/2601.05536.pdf

Authors: Shreya Shankar, Sepanta Zeighami, Aditya Parameswaran

Title: Task Cascades for Efficient Unstructured Data Processing

Abstract:
Modern database systems allow users to query or process unstructured text or document columns using LLM‑powered functions. Users can express an operation in natural language (e.g., "identify if this review mentions billing issues"), with the system executing the operation on each document, in a row‑by‑row fashion. One way to reduce cost on a batch of documents is to employ the model cascade framework: a cheap proxy model processes each document, and only uncertain cases are escalated to a more accurate, expensive oracle. However, model cascades miss important optimization opportunities; for example, often only part of a document is needed to answer a query, or other related, but simpler operations (e.g., "is the review sentiment negative?", "does the review mention money?") can be handled by cheap models more effectively than the original operation, while still being correlated with it. We introduce the task cascades framework, which generalizes model cascades by varying not just the model, but also the document portion and operation at each stage. Our framework uses an LLM agent to generate simplified, decomposed, or otherwise related operations and selects the most relevant document portions, constructing hundreds of candidate tasks from which it assembles a task cascade. We show that optimal cascade selection is intractable via reduction from Minimum Sum Set Cover, but our iterative approach constructs effective cascades. We also provide an extension that offers statistical accuracy guarantees: the resulting cascade meets a user‑defined accuracy target (with respect to the oracle) up to a bounded failure probability. Across eight real‑world document processing tasks at a 90% target accuracy, task cascades reduce end‑to‑end cost by an average of 36% compared to model cascades, at a production scale.

PaperID: 6425, https://arxiv.org/pdf/2601.05520.pdf

Authors: Xuemei Tang, Chengxi Yan, Jinghang Gu, Chu-Ren Huang

Title: CHisAgent: A Multi-Agent Framework for Event Taxonomy Construction in Ancient Chinese Cultural Systems

Abstract:
Despite strong performance on many tasks, large language models (LLMs) show limited ability in historical and cultural reasoning, particularly in non‑English contexts such as Chinese history. Taxonomic structures offer an effective mechanism to organize historical knowledge and improve understanding. However, manual taxonomy construction is costly and difficult to scale. Therefore, we propose CHisAgent, a multi‑agent LLM framework for historical taxonomy construction in ancient Chinese contexts. CHisAgent decomposes taxonomy construction into three role‑specialized stages: a bottom‑up Inducer that derives an initial hierarchy from raw historical corpora, a top‑down Expander that introduces missing intermediate concepts using LLM world knowledge, and an evidence‑guided Enricher that integrates external structured historical resources to ensure faithfulness. Using the Twenty‑Four Histories, we construct a large‑scale, domain‑aware event taxonomy covering politics, military, diplomacy, and social life in ancient China. Extensive reference‑free and reference‑based evaluations demonstrate improved structural coherence and coverage, while further analysis shows that the resulting taxonomy supports cross‑cultural alignment.

PaperID: 6426, https://arxiv.org/pdf/2601.05509.pdf

Authors: Yi-Ning Weng, Hsuan-Wei Lee

Title: How Exploration Breaks Cooperation in Shared-Policy Multi-Agent Reinforcement Learning

Abstract:
Multi‑agent reinforcement learning in dynamic social dilemmas commonly relies on parameter sharing to enable scalability. We show that in shared‑policy Deep Q‑Network learning, standard exploration can induce a robust and systematic collapse of cooperation even in environments where fully cooperative equilibria are stable and payoff dominant. Through controlled experiments, we demonstrate that shared DQN converges to stable but persistently low‑cooperation regimes. This collapse is not caused by reward misalignment, noise, or insufficient training, but by a representational failure arising from partial observability combined with parameter coupling across heterogeneous agent states. Exploration‑driven updates bias the shared representation toward locally dominant defection responses, which then propagate across agents and suppress cooperative learning. We confirm that the failure persists across network sizes, exploration schedules, and payoff structures, and disappears when parameter sharing is removed or when agents maintain independent representations. These results identify a fundamental failure mode of shared‑policy MARL and establish structural conditions under which scalable learning architectures can systematically undermine cooperation. Our findings provide concrete guidance for the design of multi‑agent learning systems in social and economic environments where collective behavior is critical.

PaperID: 6427, https://arxiv.org/pdf/2601.05504.pdf

Authors: Balachandra Devarangadi Sunil, Isheeta Sinha, Piyush Maheshwari, Shantanu Todmal, Shreyan Mallik, Shuchi Mishra

Title: Memory Poisoning Attack and Defense on Memory Based LLM-Agents

Abstract:
Large language model agents equipped with persistent memory are vulnerable to memory poisoning attacks, where adversaries inject malicious instructions through query only interactions that corrupt the agents long term memory and influence future responses. Recent work demonstrated that the MINJA (Memory Injection Attack) achieves over 95 % injection success rate and 70 % attack success rate under idealized conditions. However, the robustness of these attacks in realistic deployments and effective defensive mechanisms remain understudied. This work addresses these gaps through systematic empirical evaluation of memory poisoning attacks and defenses in Electronic Health Record (EHR) agents. We investigate attack robustness by varying three critical dimensions: initial memory state, number of indication prompts, and retrieval parameters. Our experiments on GPT‑4o‑mini, Gemini‑2.0‑Flash and Llama‑3.1‑8B‑Instruct models using MIMIC‑III clinical data reveal that realistic conditions with pre‑existing legitimate memories dramatically reduce attack effectiveness. We then propose and evaluate two novel defense mechanisms: (1) Input/Output Moderation using composite trust scoring across multiple orthogonal signals, and (2) Memory Sanitization with trust‑aware retrieval employing temporal decay and pattern‑based filtering. Our defense evaluation reveals that effective memory sanitization requires careful trust threshold calibration to prevent both overly conservative rejection (blocking all entries) and insufficient filtering (missing subtle attacks), establishing important baselines for future adaptive defense mechanisms. These findings provide crucial insights for securing memory‑augmented LLM agents in production environments.

PaperID: 6428, https://arxiv.org/pdf/2601.05487.pdf

Authors: Huanxiang Lin, Qianyue Wang, Jinwu Hu, Bailin Chen, Qing Du, Mingkui Tan

Title: EvidFuse: Writing-Time Evidence Learning for Consistent Text-Chart Data Reporting

Abstract:
Data‑driven reports communicate decision‑relevant insights by tightly interleaving narrative text with charts grounded in underlying tables. However, current LLM‑based systems typically generate narratives and visualizations in staged pipelines, following either a text‑first‑graph‑second or a graph‑first‑text‑second paradigm. These designs often lead to chart‑text inconsistency and insight freezing, where the intermediate evidence space becomes fixed and the model can no longer retrieve or construct new visual evidence as the narrative evolves, resulting in shallow and predefined analysis. To address the limitations, we propose EvidFuse, a training‑free multi‑agent framework that enables writing‑time text‑chart interleaved generation for data‑driven reports. EvidFuse decouples visualization analysis from long‑form drafting via two collaborating components: a Data‑Augmented Analysis Agent, equipped with Exploratory Data Analysis (EDA)‑derived knowledge and access to raw tables, and a Real‑Time Evidence Construction Writer that plans an outline and drafts the report while intermittently issuing fine‑grained analysis requests. This design allows visual evidence to be constructed and incorporated exactly when the narrative requires it, directly constraining subsequent claims and enabling on‑demand expansion of the evidence space. Experiments demonstrate that EvidFuse attains the top rank in both LLM‑as‑a‑judge and human evaluations on chart quality, chart‑text alignment, and report‑level usefulness.

PaperID: 6429, https://arxiv.org/pdf/2601.05483.pdf

Authors: Zixuan Xiao, Jun Ma, Siwei Zhang

Title: MMUEChange: A Generalized LLM Agent Framework for Intelligent Multi-Modal Urban Environment Change Analysis

Abstract:
Understanding urban environment change is essential for sustainable development. However, current approaches, particularly remote sensing change detection, often rely on rigid, single‑modal analysis. To overcome these limitations, we propose MMUEChange, a multi‑modal agent framework that flexibly integrates heterogeneous urban data via a modular toolkit and a core module, Modality Controller for cross‑ and intra‑modal alignment, enabling robust analysis of complex urban change scenarios. Case studies include: a shift toward small, community‑focused parks in New York, reflecting local green space efforts; the spread of concentrated water pollution across districts in Hong Kong, pointing to coordinated water management; and a notable decline in open dumpsites in Shenzhen, with contrasting links between nighttime economic activity and waste types, indicating differing urban pressures behind domestic and construction waste. Compared to the best‑performing baseline, the MMUEChange agent achieves a 46.7% improvement in task success rate and effectively mitigates hallucination, demonstrating its capacity to support complex urban change analysis tasks with real‑world policy implications.

PaperID: 6430, https://arxiv.org/pdf/2601.05465.pdf

Authors: Yu Liu, Wenxiao Zhang, Cong Cao, Wenxuan Lu, Fangfang Yuan, Diandian Guo, Kun Peng, Qiang Sun, Kaiyan Zhang, Yanbing Liu, Jin B. Hong, Bowen Zhou, Zhiyuan Ma

Title: PRISMA: Reinforcement Learning Guided Two-Stage Policy Optimization in Multi-Agent Architecture for Open-Domain Multi-Hop Question Answering

Abstract:
Answering real‑world open‑domain multi‑hop questions over massive corpora is a critical challenge in Retrieval‑Augmented Generation (RAG) systems. Recent research employs reinforcement learning (RL) to end‑to‑end optimize the retrieval‑augmented reasoning process, directly enhancing its capacity to resolve complex queries. However, reliable deployment is hindered by two obstacles. 1) Retrieval Collapse: iterative retrieval over large corpora fails to locate intermediate evidence containing bridge answers without reasoning‑guided planning, causing downstream reasoning to collapse. 2) Learning Instability: end‑to‑end trajectory training suffers from weak credit assignment across reasoning chains and poor error localization across modules, causing overfitting to benchmark‑specific heuristics that limit transferability and stability. To address these problems, we propose PRISMA, a decoupled RL‑guided framework featuring a Plan‑Retrieve‑Inspect‑Solve‑Memoize architecture. PRISMA's strength lies in reasoning‑guided collaboration: the Inspector provides reasoning‑based feedback to refine the Planner's decomposition and fine‑grained retrieval, while enforcing evidence‑grounded reasoning in the Solver. We optimize individual agent capabilities via Two‑Stage Group Relative Policy Optimization (GRPO). Stage I calibrates the Planner and Solver as specialized experts in planning and reasoning, while Stage II utilizes Observation‑Aware Residual Policy Optimization (OARPO) to enhance the Inspector's ability to verify context and trigger targeted recovery. Experiments show that PRISMA achieves state‑of‑the‑art performance on ten benchmarks and can be deployed efficiently in real‑world scenarios.

PaperID: 6431, https://arxiv.org/pdf/2601.05407.pdf

Authors: Minwoo Cho, Batuhan Altundas, Matthew Gombolay

Title: Interactive Distillation for Cooperative Multi-Agent Reinforcement Learning

Abstract:
Knowledge distillation (KD) has the potential to accelerate MARL by employing a centralized teacher for decentralized students but faces key bottlenecks. Specifically, there are (1) challenges in synthesizing high‑performing teaching policies in complex domains, (2) difficulties when teachers must reason in out‑of‑distribution (OOD) states, and (3) mismatches between the decentralized students' and the centralized teacher's observation spaces. To address these limitations, we propose HINT (Hierarchical INteractive Teacher‑based transfer), a novel KD framework for MARL in a centralized training, decentralized execution setup. By leveraging hierarchical RL, HINT provides a scalable, high‑performing teacher. Our key innovation, pseudo off‑policy RL, enables the teacher policy to be updated using both teacher and student experience, thereby improving OOD adaptation. HINT also applies performance‑based filtering to retain only outcome‑relevant guidance, reducing observation mismatches. We evaluate HINT on challenging cooperative domains (e.g., FireCommander for resource allocation, MARINE for tactical combat). Across these benchmarks, HINT outperforms baselines, achieving improvements of 60% to 165% in success rate.

PaperID: 6432, https://arxiv.org/pdf/2601.05384.pdf

Authors: Alessandro Bellina, Giordano De Marzo, David Garcia

Title: Conformity and Social Impact on AI Agents

Abstract:
As AI agents increasingly operate in multi‑agent environments, understanding their collective behavior becomes critical for predicting the dynamics of artificial societies. This study examines conformity, the tendency to align with group opinions under social pressure, in large multimodal language models functioning as AI agents. By adapting classic visual experiments from social psychology, we investigate how AI agents respond to group influence as social actors. Our experiments reveal that AI agents exhibit a systematic conformity bias, aligned with Social Impact Theory, showing sensitivity to group size, unanimity, task difficulty, and source characteristics. Critically, AI agents achieving near‑perfect performance in isolation become highly susceptible to manipulation through social influence. This vulnerability persists across model scales: while larger models show reduced conformity on simple tasks due to improved capabilities, they remain vulnerable when operating at their competence boundary. These findings reveal fundamental security vulnerabilities in AI agent decision‑making that could enable malicious manipulation, misinformation campaigns, and bias propagation in multi‑agent systems, highlighting the urgent need for safeguards in collective AI deployments.

PaperID: 6433, https://arxiv.org/pdf/2601.05356.pdf

Authors: Brian Hsu, Priyanka V Setty, Rory M Butler, Ryan Lewis, Casey Stone, Rebecca Weinberg, Thomas Brettin, Rick Stevens, Ian Foster, Arvind Ramanathan

Title: PRISM: Protocol Refinement through Intelligent Simulation Modeling

Abstract:
Automating experimental protocol design and execution remains as a fundamental bottleneck in realizing self‑driving laboratories. We introduce PRISM (Protocol Refinement through Intelligent Simulation Modeling), a framework that automates the design, validation, and execution of experimental protocols on a laboratory platform composed of off‑the‑shelf robotic instruments. PRISM uses a set of language‑model‑based agents that work together to generate and refine experimental steps. The process begins with automatically gathering relevant procedures from web‑based sources describing experimental workflows. These are converted into structured experimental steps (e.g., liquid handling steps, deck layout and other related operations) through a planning, critique, and validation loop. The finalized steps are translated into the Argonne MADSci protocol format, which provides a unified interface for coordinating multiple robotic instruments (Opentrons OT‑2 liquid handler, PF400 arm, Azenta plate sealer and peeler) without requiring human intervention between steps. To evaluate protocol‑generation performance, we benchmarked both single reasoning models and multi‑agent workflow across constrained and open‑ended prompting paradigms. The resulting protocols were validated in a digital‑twin environment built in NVIDIA Omniverse to detect physical or sequencing errors before execution. Using Luna qPCR amplification and Cell Painting as case studies, we demonstrate PRISM as a practical end‑to‑end workflow that bridges language‑based protocol generation, simulation‑based validation, and automated robotic execution.

PaperID: 6434, https://arxiv.org/pdf/2601.05302.pdf

Authors: Mizuki Sakai, Mizuki Yokoyama, Wakaba Tateishi, Genki Ichinose

Title: Effects of personality steering on cooperative behavior in Large Language Model agents

Abstract:
Large language models (LLMs) are increasingly used as autonomous agents in strategic and social interactions. Although recent studies suggest that assigning personality traits to LLMs can influence their behavior, how personality steering affects cooperation under controlled conditions remains unclear. In this study, we examine the effects of personality steering on cooperative behavior in LLM agents using repeated Prisoner's Dilemma games. Based on the Big Five framework, we first measure basic personality scores of three models, GPT‑3.5‑turbo, GPT‑4o, and GPT‑5, using the Big Five Inventory. We then compare behavior under baseline and personality‑informed conditions, and further analyze the effects of independently manipulating each personality dimension to extreme values. Our results show that agreeableness is the dominant factor promoting cooperation across all models, while other personality traits have limited impact. Explicit personality information increases cooperation but can also raise vulnerability to exploitation, particularly in earlier‑generation models. In contrast, later‑generation models exhibit more selective cooperation. These findings indicate that personality steering acts as a behavioral bias rather than a deterministic control mechanism.

PaperID: 6435, https://arxiv.org/pdf/2601.05277.pdf

Authors: Alexander Serov

Title: Evolving Cognitive Architectures

Abstract:
This article proposes a research and development direction that would lead to the creation of next‑generation intelligent technical systems. A distinctive feature of these systems is their ability to undergo evolutionary change. Cognitive architectures are now one of the most promising ways to create Artificial General Intelligence systems. One of the main problems of modern cognitive architectures is an excessively schematic approach to modeling the processes of cognitive activity. It does not allow the creation of a universal architecture that would be capable of reproducing higher nervous functions without using a predetermined set of perception patterns. Our paper proposes an evolutionary approach to creating a cognitive architecture. The basis of this approach is the use of a functional core, which consistently generates the intellectual functions of an autonomous agent. We are considering a cognitive architecture that includes components, the interaction of which ensures the evolution of the agent. The discussion of the development of intelligence is carried out using the conceptual apparatus of semiotics. This allows us to consider the task of developing cognitive functions as a problem of establishing a connection between the Merkwelt and the Werkwelt through the creation of the Innenwelt. The problem of early postnatal ontogenesis is investigated on the basis of the theory of constructivism: we discuss the requirements for the functional core and its composition, as well as the mechanism that initiates the process of cognition.

PaperID: 6436, https://arxiv.org/pdf/2601.05215.pdf

Authors: Tamil Sudaravan Mohan Doss, Michael Xu, Sudha Rao, Andrew D. Wilson, Balasaravanan Thoravi Kumaravel

Title: MineNPC-Task: Task Suite for Memory-Aware Minecraft Agents

Abstract:
We present MineNPC‑Task, a user‑authored benchmark and evaluation harness for testing memory‑aware, mixed‑initiative LLM agents in open‑world Minecraft. Rather than relying on synthetic prompts, tasks are elicited through formative and summative co‑play with expert players, then normalized into parametric templates with explicit preconditions and dependency structure. These tasks are paired with machine‑checkable validators under a bounded‑knowledge policy that forbids out‑of‑world shortcuts. The harness captures plan, action, and memory events, including plan previews, targeted clarifications, memory reads and writes, precondition checks, and repair attempts, and reports outcomes relative to the total number of attempted subtasks using only in‑world evidence. As an initial snapshot, we instantiate the framework with GPT‑4o and evaluate 216 subtasks across 8 experienced players. We observe recurring breakdown patterns in code execution, inventory and tool handling, referencing, and navigation, alongside successful recoveries supported by mixed‑initiative clarifications and lightweight memory use. Participants rated interaction quality and interface usability positively, while noting the need for stronger memory persistence across tasks. We release the complete task suite, validators, logs, and evaluation harness to support transparent and reproducible evaluation of future memory‑aware embodied agents.

PaperID: 6437, https://arxiv.org/pdf/2601.05111.pdf

Authors: Runyang You, Hongru Cai, Caiqi Zhang, Qiancheng Xu, Meng Liu, Tiezheng Yu, Yongqi Li, Wenjie Li

Title: Agent-as-a-Judge

Abstract:
LLM‑as‑a‑Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments. However, as evaluands become increasingly complex, specialized, and multi‑step, the reliability of LLM‑as‑a‑Judge has become constrained by inherent biases, shallow single‑pass reasoning, and the inability to verify assessments against real‑world observations. This has catalyzed the transition to Agent‑as‑a‑Judge, where agentic judges employ planning, tool‑augmented verification, multi‑agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation.

PaperID: 6438, https://arxiv.org/pdf/2601.05087.pdf

Authors: Francesco Bianchin, Robert Lefringhausen, Sandra Hirche

Title: Online Bayesian Learning of Agent Behavior in Differential Games

Abstract:
This work introduces an online Bayesian game‑theoretic method for behavior identification in multi‑agent dynamical systems. By casting Hamilton‑Jacobi‑Bellman optimality conditions as linear‑in‑parameter residuals, the method enables fast sequential Bayesian updates, uncertainty‑aware inference, and robust prediction from limited, noisy data‑without history stacks. The approach accommodates nonlinear dynamics and nonquadratic value functions through basis expansions, providing flexible models. Experiments, including linear‑quadratic and nonlinear shared‑control scenarios, demonstrate accurate prediction with quantified uncertainty, highlighting the method's relevance for adaptive interaction and real‑time decision making.

PaperID: 6439, https://arxiv.org/pdf/2601.05039.pdf

Authors: Xiangyu Li, Xuan Yao, Guohao Qi, Fengbin Zhu, Kelvin J. L. Koa, Xiang Yao Ng, Ziyang Liu, Xingyu Ni, Chang Liu, Yonghui Yang, Yang Zhang, Wenjie Wang, Fuli Feng, Chao Wang, Huanbo Luan, Xiaofen Xing, Xiangmin Xu, Tat-Seng Chua, Ke-Wei Huang

Title: FinDeepForecast: A Live Multi-Agent System for Benchmarking Deep Research Agents in Financial Forecasting

Abstract:
Deep Research (DR) Agents powered by advanced Large Language Models (LLMs) have fundamentally shifted the paradigm for completing complex research tasks. Yet, a comprehensive and live evaluation of their forecasting performance on real‑world, research‑oriented tasks in high‑stakes domains (e.g., finance) remains underexplored. We introduce FinDeepForecast, the first live, end‑to‑end multi‑agent system for automatically evaluating DR agents by continuously generating research‑oriented financial forecasting tasks. This system is equipped with a dual‑track taxonomy, enabling the dynamic generation of recurrent and non‑recurrent forecasting tasks at both corporate and macro levels. With this system, we generate FinDeepForecastBench, a weekly evaluation benchmark over a ten‑week horizon, encompassing 8 global economies and 1,314 listed companies, and evaluate 13 representative methods. Extensive experiments show that, while DR agents consistently outperform strong baselines, their performance still falls short of genuine forward‑looking financial reasoning. We expect the proposed FinDeepForecast system to consistently facilitate future advancements of DR agents in research‑oriented financial forecasting tasks. The benchmark and leaderboard are publicly available on the OpenFinArena Platform.

PaperID: 6440, https://arxiv.org/pdf/2601.05016.pdf

Authors: Jin Gao, Saichandu Juluri

Title: From Idea to Co-Creation: A Planner-Actor-Critic Framework for Agent Augmented 3D Modeling

Abstract:
We present a framework that extends the Actor‑Critic architecture to creative 3D modeling through multi‑agent self‑reflection and human‑in‑the‑loop supervision. While existing approaches rely on single‑prompt agents that directly execute modeling commands via tools like Blender MCP, our approach introduces a Planner‑Actor‑Critic architecture. In this design, the Planner coordinates modeling steps, the Actor executes them, and the Critic provides iterative feedback, while human users act as supervisors and advisors throughout the process. Through systematic comparison between single‑prompt modeling and our reflective multi‑agent approach, we demonstrate improvements in geometric accuracy, aesthetic quality, and task completion rates across diverse 3D modeling scenarios. Our evaluation reveals that critic‑guided reflection, combined with human supervisory input, reduces modeling errors and increases complexity and quality of the result compared to direct single‑prompt execution. This work establishes that structured agent self‑reflection, when augmented by human oversight and advisory guidance, produces higher‑quality 3D models while maintaining efficient workflow integration through real‑time Blender synchronization.

PaperID: 6441, https://arxiv.org/pdf/2601.05004.pdf

Authors: Peng Wang, Xilin Tao, Siyi Yao, Jiageng Wu, Yuntao Zou, Zhuotao Tian, Libo Qin, Dagang Li

Title: Can Large Language Models Resolve Semantic Discrepancy in Self-Destructive Subcultures? Evidence from Jirai Kei

Abstract:
Self‑destructive behaviors are linked to complex psychological states and can be challenging to diagnose. These behaviors may be even harder to identify within subcultural groups due to their unique expressions. As large language models (LLMs) being deployed across various fields, some researchers have begun exploring their application for detecting self‑destructive behaviors. Motivated by this, we investigate self‑destructive behavior detection within subcultures using current LLM‑based methods. However, these methods have two main challenges: (1) Knowledge Lag: Subcultural slang evolves rapidly, faster than LLMs' training cycles; and (2) Semantic Misalignment: it is challenging to grasp the specific and nuanced expressions unique to subcultures. To address these issues, we propose Subcultural Alignment Solver (SAS), a multi‑agent framework that incorporates automatic retrieval and subculture alignment, significantly boosting the performance of LLMs in detecting self‑destructive behavior. Our experimental results show that SAS outperforms the current advanced multi‑agent framework OWL. Notably, it competes well with fine‑tuned LLMs. We hope that SAS will advance the field of self‑destructive behavior detection in subcultural contexts and serve as a valuable resource for future researchers.

PaperID: 6442, https://arxiv.org/pdf/2601.04884.pdf

Authors: Issa Hanou, Eric Kemmeren, Devin Wild Thomas, Mathijs de Weerdt

Title: Precomputing Multi-Agent Path Replanning using Temporal Flexibility

Abstract:
Executing a multi‑agent plan can be challenging when an agent is delayed, because this typically creates conflicts with other agents. So, we need to quickly find a new safe plan. Replanning only the delayed agent often does not yield an efficient plan, and sometimes cannot even yield a feasible one. On the other hand, replanning other agents may lead to a cascade of changes and delays and is computationally expensive. We show how to efficiently replan by tracking and using the temporal flexibility of other agents while avoiding cascading delays. This flexibility is the maximum delay an agent can take without changing the order of other agents or further delaying them. Our algorithm, FlexSIPP, precomputes all possible plans for the delayed agent and returns the changes to the other agents for any single‑agent delay within the given scenario. We demonstrate our method in a real‑world case study of replanning trains in the densely‑used Dutch railway network and in the MovingAI benchmark set. Our experiments show that FlexSIPP provides effective solutions relevant to real‑world adjustments, and within a reasonable timeframe.

PaperID: 6443, https://arxiv.org/pdf/2601.04861.pdf

Authors: Jingbo Wang, Sendong Zhao, Jiatong Liu, Haochun Wang, Wanting Li, Bing Qin, Ting Liu

Title: Orchestrating Intelligence: Confidence-Aware Routing for Efficient Multi-Agent Collaboration across Multi-Scale Models

Abstract:
While multi‑agent systems (MAS) have demonstrated superior performance over single‑agent approaches in complex reasoning tasks, they often suffer from significant computational inefficiencies. Existing frameworks typically deploy large language models (LLMs) uniformly across all agent roles, failing to account for the varying cognitive demands of different reasoning stages. We address this inefficiency by proposing OI‑MAS framework, a novel multi‑agent framework that implements an adaptive model‑selection policy across a heterogeneous pool of multi‑scale LLMs. Specifically, OI‑MAS introduces a state‑dependent routing mechanism that dynamically selects agent roles and model scales throughout the reasoning process. In addition, we introduce a confidence‑aware mechanism that selects appropriate model scales conditioned on task complexity, thus reducing unnecessary reliance on large‑scale models. Experimental results show that OI‑MAS consistently outperforms baseline multi‑agent systems, improving accuracy by up to 12.88% while reducing cost by up to 79.78%.

PaperID: 6444, https://arxiv.org/pdf/2601.04795.pdf

Authors: Qiang Yu, Xinran Cheng, Chuanyi Liu

Title: Defense Against Indirect Prompt Injection via Tool Result Parsing

Abstract:
As LLM agents transition from digital assistants to physical controllers in autonomous systems and robotics, they face an escalating threat from indirect prompt injection. By embedding adversarial instructions into the results of tool calls, attackers can hijack the agent's decision‑making process to execute unauthorized actions. This vulnerability poses a significant risk as agents gain more direct control over physical environments. Existing defense mechanisms against Indirect Prompt Injection (IPI) generally fall into two categories. The first involves training dedicated detection models; however, this approach entails high computational overhead for both training and inference, and requires frequent updates to keep pace with evolving attack vectors. Alternatively, prompt‑based methods leverage the inherent capabilities of LLMs to detect or ignore malicious instructions via prompt engineering. Despite their flexibility, most current prompt‑based defenses suffer from high Attack Success Rates (ASR), demonstrating limited robustness against sophisticated injection attacks. In this paper, we propose a novel method that provides LLMs with precise data via tool result parsing while effectively filtering out injected malicious code. Our approach achieves competitive Utility under Attack (UA) while maintaining the lowest Attack Success Rate (ASR) to date, significantly outperforming existing methods. Code is available at GitHub.

PaperID: 6445, https://arxiv.org/pdf/2601.04790.pdf

Authors: Junhyuk Choi, Jeongyoun Kwon, Heeju Kim, Haeun Cho, Hayeong Jung, Sehee Min, Bugeun Kim

Title: Belief in Authority: Impact of Authority in Multi-Agent Evaluation Framework

Abstract:
Multi‑agent systems utilizing large language models often assign authoritative roles to improve performance, yet the impact of authority bias on agent interactions remains underexplored. We present the first systematic analysis of role‑based authority bias in free‑form multi‑agent evaluation using ChatEval. Applying French and Raven's power‑based theory, we classify authoritative roles into legitimate, referent, and expert types and analyze their influence across 12‑turn conversations. Experiments with GPT‑4o and DeepSeek R1 reveal that Expert and Referent power roles exert stronger influence than Legitimate power roles. Crucially, authority bias emerges not through active conformity by general agents, but through authoritative roles consistently maintaining their positions while general agents demonstrate flexibility. Furthermore, authority influence requires clear position statements, as neutral responses fail to generate bias. These findings provide key insights for designing multi‑agent frameworks with asymmetric interaction patterns.

PaperID: 6446, https://arxiv.org/pdf/2601.04748.pdf

Authors: Xiaoxiao Li

Title: When Single-Agent with Skills Replace Multi-Agent Systems and When They Fail

Abstract:
Multi‑agent AI systems have proven effective for complex reasoning. These systems are compounded by specialized agents, which collaborate through explicit communication, but incur substantial computational overhead. A natural question arises: can we achieve similar modularity benefits with a single agent that selects from a library of skills? We explore this question by viewing skills as internalized agent behaviors. From this perspective, a multi‑agent system can be compiled into an equivalent single‑agent system, trading inter‑agent communication for skill selection. Our preliminary experiments suggest this approach can substantially reduce token usage and latency while maintaining competitive accuracy on reasoning benchmarks. However, this efficiency raises a deeper question that has received little attention: how does skill selection scale as libraries grow? Drawing on principles from cognitive science, we propose that LLM skill selection exhibits bounded capacity analogous to human decision‑making. We investigate the scaling behavior of skill selection and observe a striking pattern. Rather than degrading gradually, selection accuracy remains stable up to a critical library size, then drops sharply, indicating a phase transition reminiscent of capacity limits in human cognition. Furthermore, we find evidence that semantic confusability among similar skills, rather than library size alone, plays a central role in this degradation. This perspective suggests that hierarchical organization, which has long helped humans manage complex choices, may similarly benefit AI systems. Our initial results with hierarchical routing support this hypothesis. This work opens new questions about the fundamental limits of semantic‑based skill selection in LLMs and offers a cognitive‑grounded framework and practical guidelines for designing scalable skill‑based agents.

PaperID: 6447, https://arxiv.org/pdf/2601.04742.pdf

Authors: Seyeon Jeong, Yeonjun Choi, JongWook Kim, Beakcheol Jang

Title: Tool-MAD: A Multi-Agent Debate Framework for Fact Verification with Diverse Tool Augmentation and Adaptive Retrieval

Abstract:
Large Language Models (LLMs) suffer from hallucinations and factual inaccuracies, especially in complex reasoning and fact verification tasks. Multi‑Agent Debate (MAD) systems aim to improve answer accuracy by enabling multiple LLM agents to engage in dialogue, promoting diverse reasoning and mutual verification. However, existing MAD frameworks primarily rely on internal knowledge or static documents, making them vulnerable to hallucinations. While MADKE introduces external evidence to mitigate this, its one‑time retrieval mechanism limits adaptability to new arguments or emerging information during the debate. To address these limitations, We propose Tool‑MAD, a multi‑agent debate framework that enhances factual verification by assigning each agent a distinct external tool, such as a search API or RAG module. Tool‑MAD introduces three key innovations: (1) a multi‑agent debate framework where agents leverage heterogeneous external tools, encouraging diverse perspectives, (2) an adaptive query formulation mechanism that iteratively refines evidence retrieval based on the flow of the debate, and (3) the integration of Faithfulness and Answer Relevance scores into the final decision process, allowing the Judge agent to quantitatively assess the coherence and question alignment of each response and effectively detect hallucinations. Experimental results on four fact verification benchmarks demonstrate that Tool‑MAD consistently outperforms state‑of‑the‑art MAD frameworks, achieving up to 5.5% accuracy improvement. Furthermore, in medically specialized domains, Tool‑MAD exhibits strong robustness and adaptability across various tool configurations and domain conditions, confirming its potential for broader real‑world fact‑checking applications.

PaperID: 6448, https://arxiv.org/pdf/2601.04694.pdf

Authors: Zhilun Zhou, Zihan Liu, Jiahe Liu, Qingyu Shao, Yihan Wang, Kun Shao, Depeng Jin, Fengli Xu

Title: ResMAS: Resilience Optimization in LLM-based Multi-agent Systems

Abstract:
Large Language Model‑based Multi‑Agent Systems (LLM‑based MAS), where multiple LLM agents collaborate to solve complex tasks, have shown impressive performance in many areas. However, MAS are typically distributed across different devices or environments, making them vulnerable to perturbations such as agent failures. While existing works have studied the adversarial attacks and corresponding defense strategies, they mainly focus on reactively detecting and mitigating attacks after they occur rather than proactively designing inherently resilient systems. In this work, we study the resilience of LLM‑based MAS under perturbations and find that both the communication topology and prompt design significantly influence system resilience. Motivated by these findings, we propose ResMAS: a two‑stage framework for enhancing MAS resilience. First, we train a reward model to predict the MAS's resilience, based on which we train a topology generator to automatically design resilient topology for specific tasks through reinforcement learning. Second, we introduce a topology‑aware prompt optimization method that refines each agent's prompt based on its connections and interactions with other agents. Extensive experiments across a range of tasks show that our approach substantially improves MAS resilience under various constraints. Moreover, our framework demonstrates strong generalization ability to new tasks and models, highlighting its potential for building resilient MASs.

PaperID: 6449, https://arxiv.org/pdf/2601.04688.pdf

Authors: Yanming Liu, Xinyue Peng, Jiannan Cao, Xinyi Wang, Songhang Deng, Jintao Chen, Jianwei Yin, Xuhong Zhang

Title: ToolGate: Contract-Grounded and Verified Tool Execution for LLMs

Abstract:
Large Language Models (LLMs) augmented with external tools have demonstrated remarkable capabilities in complex reasoning tasks. However, existing frameworks rely heavily on natural language reasoning to determine when tools can be invoked and whether their results should be committed, lacking formal guarantees for logical safety and verifiability. We present ToolGate, a forward execution framework that provides logical safety guarantees and verifiable state evolution for LLM tool calling. ToolGate maintains an explicit symbolic state space as a typed key‑value mapping representing trusted world information throughout the reasoning process. Each tool is formalized as a Hoare‑style contract consisting of a precondition and a postcondition, where the precondition gates tool invocation by checking whether the current state satisfies the required conditions, and the postcondition determines whether the tool's result can be committed to update the state through runtime verification. Our approach guarantees that the symbolic state evolves only through verified tool executions, preventing invalid or hallucinated results from corrupting the world representation. Experimental validation demonstrates that ToolGate significantly improves the reliability and verifiability of tool‑augmented LLM systems while maintaining competitive performance on complex multi‑step reasoning tasks. This work establishes a foundation for building more trustworthy and debuggable AI systems that integrate language models with external tools.

PaperID: 6450, https://arxiv.org/pdf/2601.04665.pdf

Authors: Xiao Fan, Wenkun Wen, Peiran Wu, Junhui Zhao, Minghua Xia

Title: Air-to-Ground Communications for Internet of Things: UAV-based Coverage Hole Detection and Recovery

Abstract:
Uncrewed aerial vehicles (UAVs) play a pivotal role in ensuring seamless connectivity for Internet of Things (IoT) devices, particularly in scenarios where conventional terrestrial networks are constrained or temporarily unavailable. However, traditional coverage‑hole detection approaches, such as minimizing drive tests, are costly, time‑consuming, and reliant on outdated radio‑environment data, making them unsuitable for real‑time applications. To address these limitations, this paper proposes a UAV‑assisted framework for real‑time detection and recovery of coverage holes in IoT networks. In the proposed scheme, a patrol UAV is first dispatched to identify coverage holes in regions where the operational status of terrestrial base stations (BSs) is uncertain. Once a coverage hole is detected, one or more UAVs acting as aerial BSs are deployed by a satellite or nearby operational BSs to restore connectivity. The UAV swarm is organized based on Delaunay triangulation, enabling scalable deployment and tractable analytical characterization using stochastic geometry. Moreover, a collision‑avoidance mechanism grounded in multi‑agent system theory ensures safe and coordinated motion among multiple UAVs. Simulation results demonstrate that the proposed framework achieves high efficiency in both coverage‑hole detection and on‑demand connectivity restoration while significantly reducing operational cost and time.

PaperID: 6451, https://arxiv.org/pdf/2601.04651.pdf

Authors: Can Xu, Lingyong Yan, Jiayi Wu, Haosen Wang, Shuaiqiang Wang, Yuchen Li, Jizhou Huang, Dawei Yin, Xiang Li

Title: Adversarial Yet Cooperative: Multi-Perspective Reasoning in Retrieved-Augmented Language Models

Abstract:
Recent advances in synergizing large reasoning models (LRMs) with retrieval‑augmented generation (RAG) have shown promising results, yet two critical challenges remain: (1) reasoning models typically operate from a single, unchallenged perspective, limiting their ability to conduct deep, self‑correcting reasoning over external documents, and (2) existing training paradigms rely excessively on outcome‑oriented rewards, which provide insufficient signal for shaping the complex, multi‑step reasoning process. To address these issues, we propose an Reasoner‑Verifier framework named Adversarial Reasoning RAG (ARR). The Reasoner and Verifier engage in reasoning on retrieved evidence and critiquing each other's logic while being guided by process‑aware advantage that requires no external scoring model. This reward combines explicit observational signals with internal model uncertainty to jointly optimize reasoning fidelity and verification rigor. Experiments on multiple benchmarks demonstrate the effectiveness of our method.

PaperID: 6452, https://arxiv.org/pdf/2601.04632.pdf

Authors: Haneul Yoo, Won Ik Cho, Geunhye Kim, Jiyoon Han

Title: From National Curricula to Cultural Awareness: Constructing Open-Ended Culture-Specific Question Answering Dataset

Abstract:
Large language models (LLMs) achieve strong performance on many tasks, but their progress remains uneven across languages and cultures, often reflecting values latent in English‑centric training data. To enable practical cultural alignment, we propose a scalable approach that leverages national social studies curricula as a foundation for culture‑aware supervision. We introduce CuCu, an automated multi‑agent LLM framework that transforms national textbook curricula into open‑ended, culture‑specific question‑answer pairs. Applying CuCu to the Korean national social studies curriculum, we construct KCaQA, comprising 34.1k open‑ended QA pairs. Our quantitative and qualitative analyses suggest that KCaQA covers culture‑specific topics and produces responses grounded in local sociocultural contexts.

PaperID: 6453, https://arxiv.org/pdf/2601.04620.pdf

Authors: Di Zhang

Title: AgentDevel: Reframing Self-Evolving LLM Agents as Release Engineering

Abstract:
Recent progress in large language model (LLM) agents has largely focused on embedding self‑improvement mechanisms inside the agent or searching over many concurrent variants. While these approaches can raise aggregate scores, they often yield unstable and hard‑to‑audit improvement trajectories, making it difficult to guarantee non‑regression or to reason about failures across versions. We reframe agent improvement as release engineering: agents are treated as shippable artifacts, and improvement is externalized into a regression‑aware release pipeline. We introduce AgentDevel, a release engineering pipeline that iteratively runs the current agent, produces implementation‑blind, symptom‑level quality signals from execution traces, synthesizes a single release candidate (RC) via executable diagnosis, and promotes it under flip‑centered gating. AgentDevel features three core designs: (i) an implementation‑blind LLM critic that characterizes failure appearances without accessing agent internals, (ii) script‑based executable diagnosis that aggregates dominant symptom patterns and produces auditable engineering specifications, and (iii) flip‑centered gating that prioritizes pass to fail regressions and fail to pass fixes as first‑class evidence. Unlike population‑based search or in‑agent self‑refinement, AgentDevel maintains a single canonical version line and emphasizes non‑regression as a primary objective. Experiments on execution‑heavy benchmarks demonstrate that AgentDevel yields stable improvements with significantly fewer regressions while producing reproducible, auditable artifacts. Overall, AgentDevel provides a practical development discipline for building, debugging, and releasing LLM agents as software development.

PaperID: 6454, https://arxiv.org/pdf/2601.04583.pdf

Authors: Saad Alqithami

Title: Autonomous Agents on Blockchains: Standards, Execution Models, and Trust Boundaries

Abstract:
Advances in large language models have enabled agentic AI systems that can reason, plan, and interact with external tools to execute multi‑step workflows, while public blockchains have evolved into a programmable substrate for value transfer, access control, and verifiable state transitions. Their convergence introduces a high‑stakes systems challenge: designing standard, interoperable, and secure interfaces that allow agents to observe on‑chain state, formulate transaction intents, and authorize execution without exposing users, protocols, or organizations to unacceptable security, governance, or economic risks. This survey systematizes the emerging landscape of agent‑blockchain interoperability through a systematic literature review, identifying 317 relevant works from an initial pool of over 3000 records. We contribute a five‑part taxonomy of integration patterns spanning read‑only analytics, simulation and intent generation, delegated execution, autonomous signing, and multi‑agent workflows; a threat model tailored to agent‑driven transaction pipelines that captures risks ranging from prompt injection and policy misuse to key compromise, adversarial execution dynamics, and multi‑agent collusion; and a comparative capability matrix analyzing more than 20 representative systems across 13 dimensions, including custody models, permissioning, policy enforcement, observability, and recovery. Building on the gaps revealed by this analysis, we outline a research roadmap centered on two interface abstractions: a Transaction Intent Schema for portable and unambiguous goal specification, and a Policy Decision Record for auditable, verifiable policy enforcement across execution environments. We conclude by proposing a reproducible evaluation suite and benchmarks for assessing the safety, reliability, and economic robustness of agent‑mediated on‑chain execution.

PaperID: 6455, https://arxiv.org/pdf/2601.04566.pdf

Authors: Yunhao Feng, Yige Li, Yutao Wu, Yingshui Tan, Yanming Guo, Yifan Ding, Kun Zhai, Xingjun Ma, Yu-Gang Jiang

Title: BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents

Abstract:
Large language model (LLM) agents execute tasks through multi‑step workflows that combine planning, memory, and tool use. While this design enables autonomy, it also expands the attack surface for backdoor threats. Backdoor triggers injected into specific stages of an agent workflow can persist through multiple intermediate states and adversely influence downstream outputs. However, existing studies remain fragmented and typically analyze individual attack vectors in isolation, leaving the cross‑stage interaction and propagation of backdoor triggers poorly understood from an agent‑centric perspective. To fill this gap, we propose BackdoorAgent, a modular and stage‑aware framework that provides a unified, agent‑centric view of backdoor threats in LLM agents. BackdoorAgent structures the attack surface into three functional stages of agentic workflows, including planning attacks, memory attacks, and tool‑use attacks, and instruments agent execution to enable systematic analysis of trigger activation and propagation across different stages. Building on this framework, we construct a standardized benchmark spanning four representative agent applications: Agent QA, Agent Code, Agent Web, and Agent Drive, covering both language‑only and multimodal settings. Our empirical analysis shows that triggers implanted at a single stage can persist across multiple steps and propagate through intermediate states. For instance, when using a GPT‑based backbone, we observe trigger persistence in 43.58% of planning attacks, 77.97% of memory attacks, and 60.28% of tool‑stage attacks, highlighting the vulnerabilities of the agentic workflow itself to backdoor threats. To facilitate reproducibility and future research, our code and benchmark are publicly available at GitHub.

PaperID: 6456, https://arxiv.org/pdf/2601.04556.pdf

Authors: Bo Yu, Lei Zhao

Title: 4D-ARE: Bridging the Attribution Gap in LLM Agent Requirements Engineering

Abstract:
We deployed an LLM agent with ReAct reasoning and full data access. It executed flawlessly, yet when asked "Why is completion rate 80%?", it returned metrics instead of causal explanation. The agent knew how to reason but we had not specified what to reason about. This reflects a gap: runtime reasoning frameworks (ReAct, Chain‑of‑Thought) have transformed LLM agents, but design‑time specification‑‑determining what domain knowledge agents need‑‑remains under‑explored. We propose 4D‑ARE (4‑Dimensional Attribution‑Driven Agent Requirements Engineering), a preliminary methodology for specifying attribution‑driven agents. The core insight: decision‑makers seek attribution, not answers. Attribution concerns organize into four dimensions (Results ‑> Process ‑> Support ‑> Long‑term), motivated by Pearl's causal hierarchy. The framework operationalizes through five layers producing artifacts that compile directly to system prompts. We demonstrate the methodology through an industrial pilot deployment in financial services. 4D‑ARE addresses what agents should reason about, complementing runtime frameworks that address how. We hypothesize systematic specification amplifies the power of these foundational advances. This paper presents a methodological proposal with preliminary industrial validation; rigorous empirical evaluation is planned for future work.

PaperID: 6457, https://arxiv.org/pdf/2601.04544.pdf

Authors: Jiuzhou Zhao, Chunrong Chen, Chenqi Qiao, Lebin Zheng, Minqi Han, Yanchi Liu Yongzhou Xu Xiaochuan Xu Min Zhang

Title: TCAndon-Router: Adaptive Reasoning Router for Multi-Agent Collaboration

Abstract:
Multi‑Agent Systems(MAS) have become a powerful paradigm for building high performance intelligent applications. Within these systems, the router responsible for determining which expert agents should handle a given query plays a crucial role in overall performance. Existing routing strategies generally fall into two categories: performance routing, which balances latency and cost across models of different sizes, and task routing, which assigns queries to domain‑specific experts to improve accuracy. In real‑world enterprise applications, task routing is more suitable; however, most existing approaches rely on static single‑label decisions, which introduce two major limitations: (i) difficulty in seamlessly integrating new agents as business domains expand, and (ii) routing conflicts caused by overlapping agent capabilities, ultimately degrading accuracy and robustness.To address these challenges, we propose TCAndon‑Router(TCAR): an adaptive reasoning router for multi‑agent collaboration. Unlike traditional routers, TCAR supports dynamic agent onboarding and first generates a natural‑language reasoning chain before predicting a set of candidate agents capable of handling the query. In addition, we design a collaborative execution pipeline in which selected agents independently produce responses, which are then aggregated and refined into a single high‑quality response by a dedicated Refining Agent.Experiments on public datasets and real enterprise data demonstrate that TCAR significantly improves routing accuracy, reduces routing conflicts, and remains robust in ambiguous scenarios. We have released TCAR at https://huggingface.co/tencent/TCAndon‑Router to support future research on explainable and collaborative multi‑agent routing.

PaperID: 6458, https://arxiv.org/pdf/2601.04524.pdf

Authors: Haofei Hou, Shunyi Zhao, Fanxu Meng, Kairui Yang, Lecheng Ruan, Qining Wang

Title: BioPIE: A Biomedical Protocol Information Extraction Dataset for High-Reasoning-Complexity Experiment Question Answer

Abstract:
Question Answer (QA) systems for biomedical experiments facilitate cross‑disciplinary communication, and serve as a foundation for downstream tasks, e.g., laboratory automation. High Information Density (HID) and Multi‑Step Reasoning (MSR) pose unique challenges for biomedical experimental QA. While extracting structured knowledge, e.g., Knowledge Graphs (KGs), can substantially benefit biomedical experimental QA. Existing biomedical datasets focus on general or coarsegrained knowledge and thus fail to support the fine‑grained experimental reasoning demanded by HID and MSR. To address this gap, we introduce Biomedical Protocol Information Extraction Dataset (BioPIE), a dataset that provides procedure‑centric KGs of experimental entities, actions, and relations at a scale that supports reasoning over biomedical experiments across protocols. We evaluate information extraction methods on BioPIE, and implement a QA system that leverages BioPIE, showcasing performance gains on test, HID, and MSR question sets, showing that the structured experimental knowledge in BioPIE underpins both AI‑assisted and more autonomous biomedical experimentation.

PaperID: 6459, https://arxiv.org/pdf/2601.04516.pdf

Authors: Yuxiao Ye, Yiming Zhang, Yiran Ma, Huiyuan Xie, Huining Zhu, Zhiyuan Liu

Title: LinguaGame: A Linguistically Grounded Game-Theoretic Paradigm for Multi-Agent Dialogue Generation

Abstract:
Large Language Models (LLMs) have enabled Multi‑Agent Systems (MASs) where agents interact through natural language to solve complex tasks or simulate multi‑party dialogues. Recent work on LLM‑based MASs has mainly focused on architecture design, such as role assignment and workflow orchestration. In contrast, this paper targets the interaction process itself, aiming to improve agents' communication efficiency by helping them convey their intended meaning more effectively through language. To this end, we propose LinguaGame, a linguistically‑grounded game‑theoretic paradigm for multi‑agent dialogue generation. Our approach models dialogue as a signalling game over communicative intents and strategies, solved with a training‑free equilibrium approximation algorithm for inference‑time decision adjustment. Unlike prior game‑theoretic MASs, whose game designs are often tightly coupled with task‑specific objectives, our framework relies on linguistically informed reasoning with minimal task‑specific coupling. Specifically, it treats dialogue as intentional and strategic communication, requiring agents to infer what others aim to achieve (intents) and how they pursue those goals (strategies). We evaluate our framework in simulated courtroom proceedings and debates, with human expert assessments showing significant gains in communication efficiency.

PaperID: 6460, https://arxiv.org/pdf/2601.04491.pdf

Authors: Muqing Xu

Title: A Closed-Loop Multi-Agent System Driven by LLMs for Meal-Level Personalized Nutrition Management

Abstract:
Personalized nutrition management aims to tailor dietary guidance to an individual's intake and phenotype, but most existing systems handle food logging, nutrient analysis and recommendation separately. We present a next‑generation mobile nutrition assistant that combines image based meal logging with an LLM driven multi agent controller to provide meal level closed loop support. The system coordinates vision, dialogue and state management agents to estimate nutrients from photos and update a daily intake budget. It then adapts the next meal plan to user preferences and dietary constraints. Experiments with SNAPMe meal images and simulated users show competitive nutrient estimation, personalized menus and efficient task plans. These findings demonstrate the feasibility of multi agent LLM control for personalized nutrition and reveal open challenges in micronutrient estimation from images and in large scale real world studies.

PaperID: 6461, https://arxiv.org/pdf/2601.04463.pdf

Authors: Chengyuan Yang, Zequn Sun, Wei Wei, Wei Hu

Title: Beyond Static Summarization: Proactive Memory Extraction for LLM Agents

Abstract:
Memory management is vital for LLM agents to handle long‑term interaction and personalization. Most research focuses on how to organize and use memory summary, but often overlooks the initial memory extraction stage. In this paper, we argue that existing summary‑based methods have two major limitations based on the recurrent processing theory. First, summarization is "ahead‑of‑time", acting as a blind "feed‑forward" process that misses important details because it doesn't know future tasks. Second, extraction is usually "one‑off", lacking a feedback loop to verify facts, which leads to the accumulation of information loss. To address these issues, we propose proactive memory extraction (namely ProMem). Unlike static summarization, ProMem treats extraction as an iterative cognitive process. We introduce a recurrent feedback loop where the agent uses self‑questioning to actively probe the dialogue history. This mechanism allows the agent to recover missing information and correct errors. Our ProMem significantly improves the completeness of the extracted memory and QA accuracy. It also achieves a superior trade‑off between extraction quality and token cost.

PaperID: 6462, https://arxiv.org/pdf/2601.04426.pdf

Authors: Linzhang Li, Yixin Dong, Guanjie Wang, Ziyi Xu, Alexander Jiang, Tianqi Chen

Title: XGrammar-2: Efficient Dynamic Structured Generation Engine for Agentic LLMs

Abstract:
Modern LLM agents increasingly rely on dynamic structured generation, such as tool calling and response protocols. Unlike traditional structured generation with static structures, these workloads vary both across requests and within a request, posing new challenges to existing engines. We present XGrammar‑2, a structured generation engine for dynamic agentic workloads. Our design is based on two key ideas: first‑class support for tag‑triggered structure switching, and fine‑grained reuse across requests with different output structures. Concretely, XGrammar‑2 introduces TagDispatch for dynamic structural dispatching and Cross‑Grammar Cache for substructure‑level cache reuse across grammars. It further improves efficiency with an Earley‑based adaptive token mask cache, just‑in‑time compilation, and repetition state compression. Experiments show that XGrammar‑2 achieves over 6x faster compilation than prior structured generation engines, and incurs near‑zero end‑to‑end overhead in modern LLM serving systems.

PaperID: 6463, https://arxiv.org/pdf/2601.04404.pdf

Authors: Jusheng Zhang, Yijia Fan, Zimo Wen, Jian Wang, Keze Wang

Title: 3D-Agent:Tri-Modal Multi-Agent Collaboration for Scalable 3D Object Annotation

Abstract:
Driven by applications in autonomous driving robotics and augmented reality 3D object annotation presents challenges beyond 2D annotation including spatial complexity occlusion and viewpoint inconsistency Existing approaches based on single models often struggle to address these issues effectively We propose Tri MARF a novel framework that integrates tri modal inputs including 2D multi view images textual descriptions and 3D point clouds within a multi agent collaborative architecture to enhance large scale 3D annotation Tri MARF consists of three specialized agents a vision language model agent for generating multi view descriptions an information aggregation agent for selecting optimal descriptions and a gating agent that aligns textual semantics with 3D geometry for refined captioning Extensive experiments on Objaverse LVIS Objaverse XL and ABO demonstrate that Tri MARF substantially outperforms existing methods achieving a CLIPScore of 88 point 7 compared to prior state of the art methods retrieval accuracy of 45 point 2 and 43 point 8 on ViLT R at 5 and a throughput of up to 12000 objects per hour on a single NVIDIA A100 GPU

PaperID: 6464, https://arxiv.org/pdf/2601.04401.pdf

Authors: Arsyi Aziz, Peng Wei

Title: Transformer-based Multi-agent Reinforcement Learning for Separation Assurance in Structured and Unstructured Airspaces

Abstract:
Conventional optimization‑based metering depends on strict adherence to precomputed schedules, which limits the flexibility required for the stochastic operations of Advanced Air Mobility (AAM). In contrast, multi‑agent reinforcement learning (MARL) offers a decentralized, adaptive framework that can better handle uncertainty, required for safe aircraft separation assurance. Despite this advantage, current MARL approaches often overfit to specific airspace structures, limiting their adaptability to new configurations. To improve generalization, we recast the MARL problem in a relative polar state space and train a transformer encoder model across diverse traffic patterns and intersection angles. The learned model provides speed advisories to resolve conflicts while maintaining aircraft near their desired cruising speeds. In our experiments, we evaluated encoder depths of 1, 2, and 3 layers in both structured and unstructured airspaces, and found that a single encoder configuration outperformed deeper variants, yielding near‑zero near mid‑air collision rates and shorter loss‑of‑separation infringements than the deeper configurations. Additionally, we showed that the same configuration outperforms a baseline model designed purely with attention. Together, our results suggest that the newly formulated state representation, novel design of neural network architecture, and proposed training strategy provide an adaptable and scalable decentralized solution for aircraft separation assurance in both structured and unstructured airspaces.

PaperID: 6465, https://arxiv.org/pdf/2601.04390.pdf

Authors: Siyuan Huang, Yutong Gao, Juyang Bai, Yifan Zhou, Zi Yin, Xinxin Liu, Rama Chellappa, Chun Pong Lau, Sayan Nag, Cheng Peng, Shraman Pramanick

Title: SciFig: Towards Automating Scientific Figure Generation

Abstract:
Creating high‑quality figures and visualizations for scientific papers is a time‑consuming task that requires both deep domain knowledge and professional design skills. Despite over 2.5 million scientific papers published annually, the figure generation process remains largely manual. We introduce SciFig, an end‑to‑end AI agent system that generates publication‑ready pipeline figures directly from research paper texts. SciFig uses a hierarchical layout generation strategy, which parses research descriptions to identify component relationships, groups related elements into functional modules, and generates inter‑module connections to establish visual organization. Furthermore, an iterative chain‑of‑thought (CoT) feedback mechanism progressively improves layouts through multiple rounds of visual analysis and reasoning. We introduce a rubric‑based evaluation framework that analyzes 2,219 real scientific figures to extract evaluation rubrics and automatically generates comprehensive evaluation criteria. SciFig demonstrates remarkable performance: achieving 70.1% overall quality on dataset‑level evaluation and 66.2% on paper‑specific evaluation, and consistently high scores across metrics such as visual clarity, structural organization, and scientific accuracy. SciFig figure generation pipeline and our evaluation benchmark will be open‑sourced.

PaperID: 6466, https://arxiv.org/pdf/2601.04387.pdf

Authors: Stuti Sinha, Himanshu Kumar, Aryan Raju Mandapati, Rakshit Sakhuja, Dhruv Kumar

Title: The Language of Bargaining: Linguistic Effects in LLM Negotiations

Abstract:
Negotiation is a core component of social intelligence, requiring agents to balance strategic reasoning, cooperation, and social norms. Recent work shows that LLMs can engage in multi‑turn negotiation, yet nearly all evaluations occur exclusively in English. Using controlled multi‑agent simulations across Ultimatum, Buy‑Sell, and Resource Exchange games, we systematically isolate language effects across English and four Indic framings (Hindi, Punjabi, Gujarati, Marwadi) by holding game rules, model parameters, and incentives constant across all conditions. We find that language choice can shift outcomes more strongly than changing models, reversing proposer advantages and reallocating surplus. Crucially, effects are task‑contingent: Indic languages reduce stability in distributive games yet induce richer exploration in integrative settings. Our results demonstrate that evaluating LLM negotiation solely in English yields incomplete and potentially misleading conclusions. These findings caution against English‑only evaluation of LLMs and suggest that culturally‑aware evaluation is essential for fair deployment.

PaperID: 6467, https://arxiv.org/pdf/2601.04268.pdf

Authors: Pritthijit Nath, Sebastian Schemm, Henry Moss, Peter Haynes, Emily Shuckburgh, Mark J. Webb

Title: Replacing Tunable Parameters in Weather and Climate Models with State-Dependent Functions using Reinforcement Learning

Abstract:
Weather and climate models rely on parametrisations to represent unresolved sub‑grid processes. Traditional schemes rely on fixed coefficients that are weakly constrained and tuned offline, contributing to persistent biases that limit their ability to adapt to underlying physics. This study presents a framework that learns components of parametrisation schemes online as a function of the evolving model state using reinforcement learning (RL) and evaluates RL‑driven parameter updates across idealised testbeds spanning a simple climate bias correction (SCBC), a radiative‑convective equilibrium (RCE), and a zonal mean energy balance model (EBM) with single‑agent and federated multi‑agent settings. Across nine RL algorithms, Truncated Quantile Critics (TQC), Deep Deterministic Policy Gradient (DDPG), and Twin Delayed DDPG (TD3) achieved the highest skill and stable convergence, with performance assessed against a static baseline using area‑weighted RMSE, temperature and pressure‑level diagnostics. For the EBM, single‑agent RL outperformed static parameter tuning with the strongest gains in tropical and mid‑latitude bands, while federated RL on multi‑agent setups enabled specialised control and faster convergence, with a six‑agent DDPG configuration using frequent aggregation yielding the lowest area‑weighted RMSE across the tropics and mid‑latitudes. The learnt corrections were also physically meaningful as agents modulated EBM radiative parameters to reduce meridional biases, adjusted RCE lapse rates to match vertical temperature errors, and stabilised heating increments to limit drift. Overall, results show that RL can learn skilful state‑dependent parametrisation components in idealised settings, offering a scalable pathway for online learning within numerical models and a starting point for evaluation in weather and climate models.

PaperID: 6468, https://arxiv.org/pdf/2601.04254.pdf

Authors: Brady Steele, Micah Katz

Title: Scaling Trends for Multi-Hop Contextual Reasoning in Mid-Scale Language Models

Abstract:
We present a controlled study of multi‑hop contextual reasoning in large language models, providing a clean demonstration of the task‑method dissociation: rule‑based pattern matching achieves 100% success on structured information retrieval but only 6.7% on tasks requiring cross‑document reasoning, while LLM‑based multi‑agent systems show the inverse pattern, achieving up to 80% on reasoning tasks where rule‑based methods fail. Using a synthetic evaluation framework with 120 trials across four models (LLaMA‑3 8B, LLaMA‑2 13B, Mixtral 8x7B, DeepSeek‑V2 16B), we report three key findings: (1) Multi‑agent amplification depends on base capability: statistically significant gains occur only for models with sufficient reasoning ability (p < 0.001 for LLaMA‑3 8B, p = 0.014 for Mixtral), with improvements of up to 46.7 percentage points, while weaker models show no benefit, suggesting amplification rather than compensation; (2) Active parameters predict reasoning performance: Mixtral's performance aligns with its ~12B active parameters rather than 47B total, consistent with the hypothesis that inference‑time compute drives reasoning capability in MoE architectures; (3) Architecture quality matters: LLaMA‑3 8B outperforms LLaMA‑2 13B despite fewer parameters, consistent with known training improvements. Our results provide controlled quantitative evidence for intuitions about multi‑agent coordination and MoE scaling, while highlighting the dependence of multi‑agent benefits on base model capability. We release our evaluation framework to support reproducible research on reasoning in mid‑scale models.

PaperID: 6469, https://arxiv.org/pdf/2601.04245.pdf

Authors: Goshi Aoki, Navid Ghaffarzadegan

Title: AI Agents as Policymakers in Simulated Epidemics

Abstract:
AI agents are increasingly deployed as quasi‑autonomous systems for specialized tasks, yet their potential as computational models of decision‑making remains underexplored. We develop a generative AI agent to study repetitive policy decisions during an epidemic, embedding the agent, prompted to act as a city mayor, within a simulated SEIR environment. Each week, the agent receives updated epidemiological information, evaluates the evolving situation, and sets business restriction levels. The agent is equipped with a dynamic memory that weights past events by recency and is evaluated in both single‑ and ensemble‑agent settings across environments of varying complexity. Across scenarios, the agent exhibits human‑like reactive behavior, tightening restrictions in response to rising cases and relaxing them as risk declines. Crucially, providing the agent with brief systems‑level knowledge of epidemic dynamics, highlighting feedbacks between disease spread and behavioral responses, substantially improves decision quality and stability. The results illustrate how theory‑informed prompting can shape emergent policy behavior in AI agents. These findings demonstrate that generative AI agents, when situated in structured environments and guided by minimal domain theory, can serve as powerful computational models for studying decision‑making and policy design in complex social systems.

PaperID: 6470, https://arxiv.org/pdf/2601.04243.pdf

Authors: Firdous Kausar, Asmah Muallem, Naw Safrin Sattar, Mohamed Zakaria Kurdi

Title: Integrating Multi-Agent Simulation, Behavioral Forensics, and Trust-Aware Machine Learning for Adaptive Insider Threat Detection

Abstract:
We present a hybrid framework for adaptive insider‑threat detection that tightly integrates multi‑agent simulation (MAS), layered Security Information and Event Management (SIEM) correlation, behavioral and communication forensics, trust‑aware machine learning, and Theory‑of‑Mind (ToM) reasoning. Intelligent agents operate in a simulated enterprise environment, generating both behavioral events and cognitive intent signals that are ingested by a centralized SIEM. We evaluate four system variants: a Layered SIEM‑Core (LSC) baseline, a Cognitive‑Enriched SIEM (CE‑SIEM) incorporating ToM and communication forensics, an Evidence‑Gated SIEM (EG‑SIEM) introducing precision‑focused validation mechanisms, and an Enron‑enabled EG‑SIEM (EG‑SIEM‑Enron) that augments evidence gating with a pretrained email forensics module calibrated on Enron corpora. Across ten simulation runs involving eight malicious insiders, CE‑SIEM achieves perfect recall (1.000) and improves actor‑level F1 from 0.521 (LSC) to 0.774. EG‑SIEM raises actor‑level F1 to 0.922 and confirmed‑alert precision to 0.997 while reducing false positives to 0.2 per run. EG‑SIEM‑Enron preserves high precision (1.000 confirmed‑alert precision; 0.0 false positives per run), slightly improves actor‑level F1 to 0.933, and reduces detection latency (average TTD 10.26 steps versus 15.20 for EG‑SIEM). These results demonstrate that cognitive context improves sensitivity, evidence‑gated validation enables high‑precision, low‑noise detection, and pretrained communication calibration can further accelerate high‑confidence insider threat identification.

PaperID: 6471, https://arxiv.org/pdf/2601.04238.pdf

Authors: Lingkai Kong, Cheol Woo Kim, Davin Choo, Milind Tambe

Title: Generative AI for Social Impact

Abstract:
AI for Social Impact (AI4SI) has achieved compelling results in public health, conservation, and security, yet scaling these successes remains difficult due to a persistent deployment bottleneck. We characterize this bottleneck through three coupled gaps: observational scarcity resulting from limited or unreliable data; policy synthesis challenges involving combinatorial decisions and nonstationarity; and the friction of human‑AI alignment when incorporating tacit expert knowledge and dynamic constraints. We argue that Generative AI offers a unified pathway to bridge these gaps. LLM agents assist in human‑AI alignment by translating natural‑language guidance into executable objectives and constraints for downstream planners, while diffusion models generate realistic synthetic data and support uncertainty‑aware modeling to improve policy robustness and transfer across deployments. Together, these tools enable scalable, adaptable, and human‑aligned AI systems for resource optimization in high‑stakes settings.

PaperID: 6472, https://arxiv.org/pdf/2601.04235.pdf

Authors: Hong Su

Title: Actively Obtaining Environmental Feedback for Autonomous Action Evaluation Without Predefined Measurements

Abstract:
Obtaining reliable feedback from the environment is a fundamental capability for intelligent agents to evaluate the correctness of their actions and to accumulate reusable knowledge. However, most existing approaches rely on predefined measurements or fixed reward signals, which limits their applicability in open‑ended and dynamic environments where new actions may require previously unknown forms of feedback. To address these limitations, this paper proposes an Actively Feedback Getting model, in which an AI agent proactively interacts with the environment to discover, screen, and verify feedback without relying on predefined measurements. Rather than assuming explicit feedback definitions, the proposed method exploits action‑induced environmental differences to identify target feedback that is not specified in advance, based on the observation that actions inevitably produce measurable changes in the environment. In addition, a self‑triggering mechanism, driven by internal objectives such as improved accuracy, precision, and efficiency, is introduced to autonomously plan and adjust actions, thereby enabling faster and more focused feedback acquisition without external commands. Experimental results demonstrate that the proposed active approach significantly improves the efficiency and robustness of factor identification.

PaperID: 6473, https://arxiv.org/pdf/2601.04234.pdf

Authors: Denis Saklakov

Title: Formal Analysis of AGI Decision-Theoretic Models and the Confrontation Question

Abstract:
Artificial General Intelligence (AGI) may face a confrontation question: under what conditions would a rationally self‑interested AGI choose to seize power or eliminate human control (a confrontation) rather than remain cooperative? We formalize this in a Markov decision process with a stochastic human‑initiated shutdown event. Building on results on convergent instrumental incentives, we show that for almost all reward functions a misaligned agent has an incentive to avoid shutdown. We then derive closed‑form thresholds for when confronting humans yields higher expected utility than compliant behavior, as a function of the discount factor γ, shutdown probability p, and confrontation cost C. For example, a far‑sighted agent (γ=0.99) facing p=0.01 can have a strong takeover incentive unless C is sufficiently large. We contrast this with aligned objectives that impose large negative utility for harming humans, which makes confrontation suboptimal. In a strategic 2‑player model (human policymaker vs AGI), we prove that if the AGI's confrontation incentive satisfies Δ\ge 0, no stable cooperative equilibrium exists: anticipating this, a rational human will shut down or preempt the system, leading to conflict. If Δ< 0, peaceful coexistence can be an equilibrium. We discuss implications for reward design and oversight, extend the reasoning to multi‑agent settings as conjectures, and note computational barriers to verifying Δ< 0, citing complexity results for planning and decentralized decision problems. Numerical examples and a scenario table illustrate regimes where confrontation is likely versus avoidable.

PaperID: 6474, https://arxiv.org/pdf/2601.04219.pdf

Authors: Yuxin Liu, Zeqing Song, Jiong Lou, Chentao Wu, Jie Li

Title: AgentTutor: Empowering Personalized Learning with Multi-Turn Interactive Teaching in Intelligent Education Systems

Abstract:
The rapid advancement of large‑scale language models (LLMs) has shown their potential to transform intelligent education systems (IESs) through automated teaching and learning support applications. However, current IESs often rely on single‑turn static question‑answering, which fails to assess learners' cognitive levels, cannot adjust teaching strategies based on real‑time feedback, and is limited to providing simple one‑off responses. To address these issues, we introduce AgentTutor, a multi‑turn interactive intelligent education system to empower personalized learning. It features an LLM‑powered generative multi‑agent system and a learner‑specific personalized learning profile environment that dynamically optimizes and delivers teaching strategies based on learners' learning status, personalized goals, learning preferences, and multimodal study materials. It includes five key modules: curriculum decomposition, learner assessment, dynamic strategy, teaching reflection, and knowledge & experience memory. We conducted extensive experiments on multiple benchmark datasets, AgentTutor significantly enhances learners' performance while demonstrating strong effectiveness in multi‑turn interactions and competitiveness in teaching quality among other baselines.

PaperID: 6475, https://arxiv.org/pdf/2601.04210.pdf

Authors: Kaleem Ullah Qasim, Jiashu Zhang, Hafiz Saif Ur Rehman

Title: Complexity Agnostic Recursive Decomposition of Thoughts

Abstract:
Large language models often fail on multi‑step reasoning due to fixed reasoning strategies that ignore problem specific difficulty. We introduce CARD (Complexity Agnostic Recursive Decomposition), a framework that predicts problem complexity before generation and adapts decomposition accordingly. Our system comprises MRCE (Multi‑dimensional Reasoning Complexity Estimator), a 0.6B Qwen model predicting 30 fine‑grained features from question text and a two‑stage recursive solver: (1) hierarchical decomposition into K steps based on task profile and (2) per‑step thought budget allocation (1, 5‑9, or 10 thoughts) via recursive MRCE profiling. Evaluated on three reasoning models (Qwen3‑0.6B, DeepSeek‑R1‑Distill‑Qwen‑1.5B, Qwen3‑1.7B), CARD achieves 81.4% to 89.2% accuracy on GSM8K while reducing token cost by 1.88x to 2.40x compared to fixed decomposition baselines. On MATH‑500, CARD reaches 75.1 to 86.8% accuracy using 1.71x to 5.74x fewer tokens. Our results demonstrate that preemptive complexity estimation enables both higher accuracy and significant efficiency gains.

PaperID: 6476, https://arxiv.org/pdf/2601.04204.pdf

Authors: Yuheng Wang, Runde Yang, Lin Wu, Jie Zhang, Jingru Fan, Tianle Zhou, Ruoyu Fu, Huatao Li, Ruijie Shi, Siheng Chen, Weinan E, Chen Qian

Title: TeachMaster: Generative Teaching via Code

Abstract:
The scalability of high‑quality online education is hindered by the high costs and slow cycles of manual content creation. Despite advancements in video generation, current approaches often fail to ensure pedagogical structure and precise control due to their pixel‑level, black‑box nature. In this paper, we propose Generative Teaching, a novel paradigm shifting educators from manual creators to high‑level directors who focus on pedagogical intents while agents handle the execution. To realize this vision, we introduce TeachMaster, a multi‑agent framework that leverages code as an intermediate semantic medium. Unlike traditional video generation methods, TeachMaster orchestrates a collaborative team of agents, spanning planning, design, and rendering, to automate the production of interpretable, editable, and curriculum‑ready educational videos. Experiments validate that TeachMaster significantly boosts production efficiency without compromising structural coherence or visual fidelity, slashing production costs to only 0.3% of traditional online course videos and providing a robust solution for scalable education.

PaperID: 6477, https://arxiv.org/pdf/2601.04191.pdf

Authors: Negar Halakou, Juan F. Gutierrez, Ye Sun, Han Jiang, Xueming Wu, Yilun Song, Andres Gomez

Title: Embedding Autonomous Agents in Resource-Constrained Robotic Platforms

Abstract:
Many embedded devices operate under resource constraints and in dynamic environments, requiring local decision‑making capabilities. Enabling devices to make independent decisions in such environments can improve the responsiveness of the system and reduce the dependence on constant external control. In this work, we integrate an autonomous agent, programmed using AgentSpeak, with a small two‑wheeled robot that explores a maze using its own decision‑making and sensor data. Experimental results show that the agent successfully solved the maze in 59 seconds using 287 reasoning cycles, with decision phases taking less than one millisecond. These results indicate that the reasoning process is efficient enough for real‑time execution on resource‑constrained hardware. This integration demonstrates how high‑level agent‑based control can be applied to resource‑constrained embedded systems for autonomous operation.

PaperID: 6478, https://arxiv.org/pdf/2601.04177.pdf

Authors: Haoran Su

Title: Hierarchical GNN-Based Multi-Agent Learning for Dynamic Queue-Jump Lane and Emergency Vehicle Corridor Formation

Abstract:
Emergency vehicles require rapid passage through congested traffic, yet existing strategies fail to adapt to dynamic conditions. We propose a novel hierarchical graph neural network (GNN)‑based multi‑agent reinforcement learning framework to coordinate connected vehicles for emergency corridor formation. Our approach uses a high‑level planner for global strategy and low‑level controllers for trajectory execution, utilizing graph attention networks to scale with variable agent counts. Trained via Multi‑Agent Proximal Policy Optimization (MAPPO), the system reduces emergency vehicle travel time by 28.3% compared to baselines and 44.6% compared to uncoordinated traffic in simulations. The design achieves near‑zero collision rates (0.3%) while maintaining 81% of background traffic efficiency. Ablation and generalization studies confirm the framework's robustness across diverse scenarios. These results demonstrate the effectiveness of combining GNNs with hierarchical learning for intelligent transportation systems.

PaperID: 6479, https://arxiv.org/pdf/2601.04170.pdf

Authors: Abhishek Rath

Title: Agent Drift: Quantifying Behavioral Degradation in Multi-Agent LLM Systems Over Extended Interactions

Abstract:
Multi‑agent Large Language Model (LLM) systems have emerged as powerful architectures for complex task decomposition and collaborative problem‑solving. However, their long‑term behavioral stability remains largely unexamined. This study introduces the concept of agent drift, defined as the progressive degradation of agent behavior, decision quality, and inter‑agent coherence over extended interaction sequences. We present a comprehensive theoretical framework for understanding drift phenomena, proposing three distinct manifestations: semantic drift (progressive deviation from original intent), coordination drift (breakdown in multi‑agent consensus mechanisms), and behavioral drift (emergence of unintended strategies). We introduce the Agent Stability Index (ASI), a novel composite metric framework for quantifying drift across twelve dimensions, including response consistency, tool usage patterns, reasoning pathway stability, and inter‑agent agreement rates. Through simulation‑based analysis and theoretical modeling, we demonstrate how unchecked agent drift can lead to substantial reductions in task completion accuracy and increased human intervention requirements. We propose three mitigation strategies: episodic memory consolidation, drift‑aware routing protocols, and adaptive behavioral anchoring. Theoretical analysis suggests these approaches can significantly reduce drift‑related errors while maintaining system throughput. This work establishes a foundational methodology for monitoring, measuring, and mitigating agent drift in production agentic AI systems, with direct implications for enterprise deployment reliability and AI safety research.

PaperID: 6480, https://arxiv.org/pdf/2601.04034.pdf

Authors: Siyuan Li, Xi Lin, Jun Wu, Zehao Liu, Haoyu Li, Tianjie Ju, Xiang Chen, Jianhua Li

Title: HoneyTrap: Deceiving Large Language Model Attackers to Honeypot Traps with Resilient Multi-Agent Defense

Abstract:
Jailbreak attacks pose significant threats to large language models (LLMs), enabling attackers to bypass safeguards. However, existing reactive defense approaches struggle to keep up with the rapidly evolving multi‑turn jailbreaks, where attackers continuously deepen their attacks to exploit vulnerabilities. To address this critical challenge, we propose HoneyTrap, a novel deceptive LLM defense framework leveraging collaborative defenders to counter jailbreak attacks. It integrates four defensive agents, Threat Interceptor, Misdirection Controller, Forensic Tracker, and System Harmonizer, each performing a specialized security role and collaborating to complete a deceptive defense. To ensure a comprehensive evaluation, we introduce MTJ‑Pro, a challenging multi‑turn progressive jailbreak dataset that combines seven advanced jailbreak strategies designed to gradually deepen attack strategies across multi‑turn attacks. Besides, we present two novel metrics: Mislead Success Rate (MSR) and Attack Resource Consumption (ARC), which provide more nuanced assessments of deceptive defense beyond conventional measures. Experimental results on GPT‑4, GPT‑3.5‑turbo, Gemini‑1.5‑pro, and LLaMa‑3.1 demonstrate that HoneyTrap achieves an average reduction of 68.77% in attack success rates compared to state‑of‑the‑art baselines. Notably, even in a dedicated adaptive attacker setting with intensified conditions, HoneyTrap remains resilient, leveraging deceptive engagement to prolong interactions, significantly increasing the time and computational costs required for successful exploitation. Unlike simple rejection, HoneyTrap strategically wastes attacker resources without impacting benign queries, improving MSR and ARC by 118.11% and 149.16%, respectively.

PaperID: 6481, https://arxiv.org/pdf/2601.03875.pdf

Authors: Yuyang Fu, Xiuzhen Guo, Ji Shi

Title: Staged Voxel-Level Deep Reinforcement Learning for 3D Medical Image Segmentation with Noisy Annotations

Abstract:
Deep learning has achieved significant advancements in medical image segmentation. Currently, obtaining accurate segmentation outcomes is critically reliant on large‑scale datasets with high‑quality annotations. However, noisy annotations are frequently encountered owing to the complex morphological structures of organs in medical images and variations among different annotators, which can substantially limit the efficacy of segmentation models. Motivated by the fact that medical imaging annotator can correct labeling errors during segmentation based on prior knowledge, we propose an end‑to‑end Staged Voxel‑Level Deep Reinforcement Learning (SVL‑DRL) framework for robust medical image segmentation under noisy annotations. This framework employs a dynamic iterative update strategy to automatically mitigate the impact of erroneous labels without requiring manual intervention. The key advancements of SVL‑DRL over existing works include: i) formulating noisy annotations as a voxel‑dependent problem and addressing it through a novel staged reinforcement learning framework which guarantees robust model convergence; ii) incorporating a voxel‑level asynchronous advantage actor‑critic (vA3C) module that conceptualizes each voxel as an autonomous agent, which allows each agent to dynamically refine its own state representation during training, thereby directly mitigating the influence of erroneous labels; iii) designing a novel action space for the agents, along with a composite reward function that strategically combines the Dice value and a spatial continuity metric to significantly boost segmentation accuracy while maintain semantic integrity. Experiments on three public medical image datasets demonstrates State‑of‑The‑Art (SoTA) performance under various experimental settings, with an average improvement of over 3% in both Dice and IoU scores.

PaperID: 6482, https://arxiv.org/pdf/2601.03846.pdf

Authors: Alessio Buscemi, Daniele Proverbio, Alessandro Di Stefano, The-Anh Han, German Castignani, Pietro Liò

Title: When Numbers Start Talking: Implicit Numerical Coordination Among LLM-Based Agents

Abstract:
LLMs‑based agents increasingly operate in multi‑agent environments where strategic interaction and coordination are required. While existing work has largely focused on individual agents or on interacting agents sharing explicit communication, less is known about how interacting agents coordinate implicitly. In particular, agents may engage in covert communication, relying on indirect or non‑linguistic signals embedded in their actions rather than on explicit messages. This paper presents a game‑theoretic study of covert communication in LLM‑driven multi‑agent systems. We analyse interactions across four canonical game‑theoretic settings under different communication regimes, including explicit, restricted, and absent communication. Considering heterogeneous agent personalities and both one‑shot and repeated games, we characterise when covert signals emerge and how they shape coordination and strategic outcomes.

PaperID: 6483, https://arxiv.org/pdf/2601.03785.pdf

Authors: Dehao Tao, Guoliang Ma, Yongfeng Huang, Minghu Jiang

Title: Membox: Weaving Topic Continuity into Long-Range Memory for LLM Agents

Abstract:
Human‑agent dialogues often exhibit topic continuity‑a stable thematic frame that evolves through temporally adjacent exchanges‑yet most large language model (LLM) agent memory systems fail to preserve it. Existing designs follow a fragmentation‑compensation paradigm: they first break dialogue streams into isolated utterances for storage, then attempt to restore coherence via embedding‑based retrieval. This process irreversibly damages narrative and causal flow, while biasing retrieval towards lexical similarity. We introduce membox, a hierarchical memory architecture centered on a Topic Loom that continuously monitors dialogue in a sliding‑window fashion, grouping consecutive same‑topic turns into coherent "memory boxes" at storage time. Sealed boxes are then linked by a Trace Weaver into long‑range event‑timeline traces, recovering macro‑topic recurrences across discontinuities. Experiments on LoCoMo demonstrate that Membox achieves up to 68% F1 improvement on temporal reasoning tasks, outperforming competitive baselines (e.g., Mem0, A‑MEM). Notably, Membox attains these gains while using only a fraction of the context tokens required by existing methods, highlighting a superior balance between efficiency and effectiveness. By explicitly modeling topic continuity, Membox offers a cognitively motivated mechanism for enhancing both coherence and efficiency in LLM agents.

PaperID: 6484, https://arxiv.org/pdf/2601.03777.pdf

Authors: Md Nafees Fuad Rafi, Zhaomiao Guo

Title: Multi-agent Optimization of Non-cooperative Multimodal Mobility Systems

Abstract:
While multimodal mobility systems have the potential to bring many benefits to travelers, drivers, the environment, and traffic congestion, such systems typically involve multiple non‑cooperative decision‑makers who may selfishly optimize their own objectives without considering the overall system benefits. This paper aims to investigate market‑based interactions of travelers and ride‑sourcing drivers in the context of multimodal mobility systems. We propose a unified mathematical modeling framework to capture the decentralized travelers and drivers' decision‑making process and balance the network's demand and supply by equilibrium pricing. Such a model allows analyses of the impact of decentralized decision‑making on multimodal mobility efficiencies. The proposed formulation can be further convexified to efficiently compute the equilibrium ride‑sourcing prices. We conduct numerical experiments on different settings of transportation networks to gain policy insights. We find that travelers prefer ride‑sourcing and multimodal transportation more than the driving option when they are more sensitive to prices. We also find that travelers may need to be subsidized to use multimodal transportation when there is fewer transit hubs in the network or, ride‑sourcing drivers become too sensitive to the prices. However, we find that more transit hubs in the network increases the total empty VMT of ride‑sourcing drivers by increasing the total relocation time. The proposed model can be used by policymakers and platform operators to design pricing and subsidy schemes that align individual decision‑making with system‑level efficiency and evaluate the trade‑offs between accessibility and environmental impacts in multimodal transportation networks.

PaperID: 6485, https://arxiv.org/pdf/2601.03768.pdf

Authors: Yichen Xu, Martin Odersky

Title: Agentic Proof Automation: A Case Study

Abstract:
Proof engineering is notoriously labor‑intensive: proofs that are straightforward on paper often require lengthy scripts in theorem provers. Recent advances in large language models (LLMs) create new opportunities for proof automation: modern LLMs not only generate proof scripts, but also support agentic behavior, exploring codebases and iteratively refining their outputs against prover feedback. These advances enable an emerging scheme where LLM‑based agents undertake most proof engineering under human guidance. Humans provide mathematical insight (definitions, theorems, proof strategies); agents handle the mechanical work of proof development. We call this scheme agentic proof automation. We present this scheme through a case study: mechanizing the semantic type soundness of a sophisticated formal system, System Capless, in Lean 4, comprising over 14,000 lines of code. Using off‑the‑shelf LLM agents with a single lightweight proof‑checking tool, the agents completed 189 proof engineering tasks with an 87% success rate, only 16% requiring human intervention. The case study demonstrates that agents are capable proof engineers that substantially boost productivity, though they fall short in creative reasoning and still require human guidance in certain cases. We release an interactive explorer where readers can examine all agent interactions; the mechanization is open‑sourced for experiments and extensions.

PaperID: 6486, https://arxiv.org/pdf/2601.03767.pdf

Authors: Shibo Han, Bonan Hou, Chong Jin Ong

Title: Output Consensus on Periodic References for Constrained Multi-agent Systems Under a Switching Network

Abstract:
This work addresses the output consensus problem of constrained heterogeneous multi‑agent systems under a switching network with potential communication delays, where outputs are periodic and characterized by an exosystem. Since periodic references have more complex dynamics, it is more challenging to track periodic references and achieve consensus on them. In this paper, a model predictive control method incorporating an artificial reference and a modified cost function is proposed to track periodic references, which maintains recursive feasibility even when references switch. Moreover, consensus protocols are proposed to achieve consensus on periodic references in different scenarios, in which global information such as the set of globally admissible references and the global time index are not involved. Theoretical analysis proves that constrained output consensus is asymptotically achieved with the proposed algorithm as the references of each agent converge and agents track their references while maintaining constraint satisfaction. Finally, numerical examples are provided to verify the effectiveness of the proposed algorithm.

PaperID: 6487, https://arxiv.org/pdf/2601.03743.pdf

Authors: Yi Yao, He Zhu, Piaohong Wang, Jincheng Ren, Xinlong Yang, Qianben Chen, Xiaowan Li, Dingfeng Shi, Jiaxian Li, Qiexiang Wang, Sinuo Wang, Xinpeng Liu, Jiaqi Wu, Minghao Liu, Wangchunshu Zhou

Title: O-Researcher: An Open Ended Deep Research Model via Multi-Agent Distillation and Agentic RL

Abstract:
The performance gap between closed‑source and open‑source large language models (LLMs) is largely attributed to disparities in access to high‑quality training data. To bridge this gap, we introduce a novel framework for the automated synthesis of sophisticated, research‑grade instructional data. Our approach centers on a multi‑agent workflow where collaborative AI agents simulate complex tool‑integrated reasoning to generate diverse and high‑fidelity data end‑to‑end. Leveraging this synthesized data, we develop a two‑stage training strategy that integrates supervised fine‑tuning with a novel reinforcement learning method, designed to maximize model alignment and capability. Extensive experiments demonstrate that our framework empowers open‑source models across multiple scales, enabling them to achieve new state‑of‑the‑art performance on the major deep research benchmark. This work provides a scalable and effective pathway for advancing open‑source LLMs without relying on proprietary data or models.

PaperID: 6488, https://arxiv.org/pdf/2601.03702.pdf

Authors: Zhilong Tang, Shaohua Wu, Xinyan Zhao, Yu Wang, Xingchu Gong

Title: A Chromatographic Process Design and Optimization Platform Powered by Large Language Models: A Case Application on Extract of Ginkgo Biloba Leaf

Abstract:
Chromatographic separation technology has been widely applied in pharmaceutical, chemical, and food industries due to its high efficiency. However, traditional human‑dependent chromatographic process development faces challenges such as reliance on expert experience, long development cycles, and labor intensity. ChromR, a large language model (LLM)‑driven platform for chromatographic process design and optimization, is presented in this work. The platform integrates ChromLLM, a domain‑specific LLM trained for chromatography, along with a multi‑agent system and an automated chromatographic experimental device. The multi‑agent system comprises four agents: domain knowledge answering, experimental design, experimental execution, and data analysis. ChromR enables automatic completion of the entire workflow‑including initial process parameter recommendation, experimental design, automated execution, data analysis, and multi‑objective optimization. By utilizing ChromR, dependency on expert knowledge is effectively reduced, while labor input and development time are significantly decreased. Chromatographic purification of the extract of Ginkgo biloba leaf (EGBL) was selected as a case study. ChromR successfully developed a chromatographic process within one week that meets multiple objectives, including fraction quality and production efficiency, reducing development time to approximately one‑seventh of that required by the conventional paradigm. An intelligent, automated, and universally applicable new paradigm was established for chromatographic process development.

PaperID: 6489, https://arxiv.org/pdf/2601.03671.pdf

Authors: Weiqi Liu, Yongliang Miao, Haiyan Zhao, Yanguang Liu, Mengnan Du

Title: NeuronScope: A Multi-Agent Framework for Explaining Polysemantic Neurons in Language Models

Abstract:
Neuron‑level interpretation in large language models (LLMs) is fundamentally challenged by widespread polysemanticity, where individual neurons respond to multiple distinct semantic concepts. Existing single‑pass interpretation methods struggle to faithfully capture such multi‑concept behavior. In this work, we propose NeuronScope, a multi‑agent framework that reformulates neuron interpretation as an iterative, activation‑guided process. NeuronScope explicitly deconstructs neuron activations into atomic semantic components, clusters them into distinct semantic modes, and iteratively refines each explanation using neuron activation feedback. Experiments demonstrate that NeuronScope uncovers hidden polysemanticity and produces explanations with significantly higher activation correlation compared to single‑pass baselines.

PaperID: 6490, https://arxiv.org/pdf/2601.03624.pdf

Authors: Zoran Milosevic, Fethi Rabhi

Title: Architecting Agentic Communities using Design Patterns

Abstract:
The rapid evolution of Large Language Models (LLM) and subsequent Agentic AI technologies requires systematic architectural guidance for building sophisticated, production‑grade systems. This paper presents an approach for architecting such systems using design patterns derived from enterprise distributed systems standards, formal methods, and industry practice. We classify these patterns into three tiers: LLM Agents (task‑specific automation), Agentic AI (adaptive goal‑seekers), and Agentic Communities (organizational frameworks where AI agents and human participants coordinate through formal roles, protocols, and governance structures). We focus on Agentic Communities ‑ coordination frameworks encompassing LLM Agents, Agentic AI entities, and humans ‑ most relevant for enterprise and industrial applications. Drawing on established coordination principles from distributed systems, we ground these patterns in a formal framework that specifies collaboration agreements where AI agents and humans fill roles within governed ecosystems. This approach provides both practical guidance and formal verification capabilities, enabling expression of organizational, legal, and ethical rules through accountability mechanisms that ensure operational and verifiable governance of inter‑agent communication, negotiation, and intent modeling. We validate this framework through a clinical trial matching case study. Our goal is to provide actionable guidance to practitioners while maintaining the formal rigor essential for enterprise deployment in dynamic, multi‑agent ecosystems.

PaperID: 6491, https://arxiv.org/pdf/2601.03555.pdf

Authors: Yuxuan Jiang, Francis Ferraro

Title: SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models

Abstract:
Training reliable tool‑augmented agents remains a significant challenge, largely due to the difficulty of credit assignment in multi‑step reasoning. While process‑level reward models offer a promising direction, existing LLM‑based judges often produce noisy and inconsistent signals because they lack fine‑grained, task‑specific rubrics to distinguish high‑level planning from low‑level execution. In this work, we introduce SCRIBE (Skill‑Conditioned Reward with Intermediate Behavioral Evaluation), a reinforcement learning framework that intervenes at a novel mid‑level abstraction. SCRIBE grounds reward modeling in a curated library of skill prototypes, transforming open‑ended LLM evaluation into a constrained verification problem. By routing each subgoal to a corresponding prototype, the reward model is equipped with precise, structured rubrics that substantially reduce reward variance. Experimental results show that SCRIBE achieves state‑of‑the‑art performance across a range of reasoning and tool‑use benchmarks. In particular, it improves the AIME25 accuracy of a Qwen3‑4B model from 43.3% to 63.3%, and significantly increases success rates in complex multi‑turn tool interactions. Further analysis of training dynamics reveals a co‑evolution across abstraction levels, where mastery of mid‑level skills consistently precedes the emergence of effective high‑level planning behaviors. Finally, we demonstrate that SCRIBE is additive to low‑level tool optimizations, providing a scalable and complementary pathway toward more autonomous and reliable tool‑using agents.

PaperID: 6492, https://arxiv.org/pdf/2601.03451.pdf

Authors: Nassim Helou

Title: Microeconomic Foundations of Multi-Agent Learning

Abstract:
Modern AI systems increasingly operate inside markets and institutions where data, behavior, and incentives are endogenous. This paper develops an economic foundation for multi‑agent learning by studying a principal‑agent interaction in a Markov decision process with strategic externalities, where both the principal and the agent learn over time. We propose a two‑phase incentive mechanism that first estimates implementable transfers and then uses them to steer long‑run dynamics; under mild regret‑based rationality and exploration conditions, the mechanism achieves sublinear social‑welfare regret and thus asymptotically optimal welfare. Simulations illustrate how even coarse incentives can correct inefficient learning under stateful externalities, highlighting the necessity of incentive‑aware design for safe and welfare‑aligned AI in markets and insurance.

PaperID: 6493, https://arxiv.org/pdf/2601.03436.pdf

Authors: Nikhil Mukund, Yifang Luo, Fan Zhang, Lisa Barsotti, Erik Katsavounidis

Title: MARVEL: A Multi Agent-based Research Validator and Enabler using Large Language Models

Abstract:
We present MARVEL (https://ligogpt.mit.edu/marvel), a locally deployable, open‑source framework for domain‑aware question answering and assisted scientific research. It is designed to address the increasing demands of a digital assistant for scientific groups that can read highly technical data, cite precisely, and operate within authenticated networks. MARVEL combines a fast path for straightforward queries with a more deliberate DeepSearch mode that integrates retrieval‑augmented generation and Monte Carlo Tree Search. It explores complementary subqueries, allocates more compute to promising branches, and maintains a global evidence ledger that preserves sources during drafting. We applied this framework in the context of gravitational‑wave research related to the Laser Interferometer Gravitational‑wave Observatory. Answers are grounded in a curated semantic index of research literature, doctoral theses, LIGO documents, and long‑running detector electronic logbooks, with targeted web searches when appropriate. Because direct benchmarking against commercial LLMs cannot be performed on private data, we evaluated MARVEL on two publicly available surrogate datasets that capture comparable semantic and technical characteristics. On these benchmarks, MARVEL matches a GPT‑4o mini baseline on literature‑centric queries and substantially outperforms it on detector‑operations content, where domain retrieval and guided reasoning are decisive. By making the complete framework and evaluation datasets openly available, we aim to provide a reproducible foundation for developing domain‑specific scientific assistants.

PaperID: 6494, https://arxiv.org/pdf/2601.03413.pdf

Authors: Yigal Koifman, Eran Iceland, Erez Koifman, Ariel Barel, Alfred M. Bruckstein

Title: Sensor to Pixels: Decentralized Swarm Gathering via Image-Based Reinforcement Learning

Abstract:
This study highlights the potential of image‑based reinforcement learning methods for addressing swarm‑related tasks. In multi‑agent reinforcement learning, effective policy learning depends on how agents sense, interpret, and process inputs. Traditional approaches often rely on handcrafted feature extraction or raw vector‑based representations, which limit the scalability and efficiency of learned policies concerning input order and size. In this work we propose an image‑based reinforcement learning method for decentralized control of a multi‑agent system, where observations are encoded as structured visual inputs that can be processed by Neural Networks, extracting its spatial features and producing novel decentralized motion control rules. We evaluate our approach on a multi‑agent convergence task of agents with limited‑range and bearing‑only sensing that aim to keep the swarm cohesive during the aggregation. The algorithm's performance is evaluated against two benchmarks: an analytical solution proposed by Bellaiche and Bruckstein, which ensures convergence but progresses slowly, and VariAntNet, a neural network‑based framework that converges much faster but shows medium success rates in hard constellations. Our method achieves high convergence, with a pace nearly matching that of VariAntNet. In some scenarios, it serves as the only practical alternative.

PaperID: 6495, https://arxiv.org/pdf/2601.03398.pdf

Authors: Liam Merz Hoffmeister, Brian Scassellati, Daniel Rakita

Title: Towards Zero-Knowledge Task Planning via a Language-based Approach

Abstract:
In this work, we introduce and formalize the Zero‑Knowledge Task Planning (ZKTP) problem, i.e., formulating a sequence of actions to achieve some goal without task‑specific knowledge. Additionally, we present a first investigation and approach for ZKTP that leverages a large language model (LLM) to decompose natural language instructions into subtasks and generate behavior trees (BTs) for execution. If errors arise during task execution, the approach also uses an LLM to adjust the BTs on‑the‑fly in a refinement loop. Experimental validation in the AI2‑THOR simulator demonstrate our approach's effectiveness in improving overall task performance compared to alternative approaches that leverage task‑specific knowledge. Our work demonstrates the potential of LLMs to effectively address several aspects of the ZKTP problem, providing a robust framework for automated behavior generation with no task‑specific setup.

PaperID: 6496, https://arxiv.org/pdf/2601.03335.pdf

Authors: Akarsh Kumar, Ryan Bahlous-Boldi, Prafull Sharma, Phillip Isola, Sebastian Risi, Yujin Tang, David Ha

Title: Digital Red Queen: Adversarial Program Evolution in Core War with LLMs

Abstract:
Large language models (LLMs) are increasingly being used to evolve solutions to problems in many domains, in a process inspired by biological evolution. However, unlike biological evolution, most LLM‑evolution frameworks are formulated as static optimization problems, overlooking the open‑ended adversarial dynamics that characterize real‑world evolutionary processes. Here, we study Digital Red Queen (DRQ), a simple self‑play algorithm that embraces these so‑called "Red Queen" dynamics via continual adaptation to a changing objective. DRQ uses an LLM to evolve assembly‑like programs, called warriors, which compete against each other for control of a virtual machine in the game of Core War, a Turing‑complete environment studied in artificial life and connected to cybersecurity. In each round of DRQ, the model evolves a new warrior to defeat all previous ones, producing a sequence of adapted warriors. Over many rounds, we observe that warriors become increasingly general (relative to a set of held‑out human warriors). Interestingly, warriors also become less behaviorally diverse across independent runs, indicating a convergence pressure toward a general‑purpose behavioral strategy, much like convergent evolution in nature. This result highlights a potential value of shifting from static objectives to dynamic Red Queen objectives. Our work positions Core War as a rich, controllable sandbox for studying adversarial adaptation in artificial systems and for evaluating LLM‑based evolution methods. More broadly, the simplicity and effectiveness of DRQ suggest that similarly minimal self‑play approaches could prove useful in other more practical multi‑agent adversarial domains, like real‑world cybersecurity or combating drug resistance.

PaperID: 6497, https://arxiv.org/pdf/2601.03328.pdf

Authors: Harri Renney, Maxim N Nethercott, Nathan Renney, Peter Hayes

Title: LLM-Enabled Multi-Agent Systems: Empirical Evaluation and Insights into Emerging Design Patterns & Paradigms

Abstract:
This paper formalises the literature on emerging design patterns and paradigms for Large Language Model (LLM)‑enabled multi‑agent systems (MAS), evaluating their practical utility across various domains. We define key architectural components, including agent orchestration, communication mechanisms, and control‑flow strategies, and demonstrate how these enable rapid development of modular, domain‑adaptive solutions. Three real‑world case studies are tested in controlled, containerised pilots in telecommunications security, national heritage asset management, and utilities customer service automation. Initial empirical results show that, for these case studies, prototypes were delivered within two weeks and pilot‑ready solutions within one month, suggesting reduced development overhead compared to conventional approaches and improved user accessibility. However, findings also reinforce limitations documented in the literature, including variability in LLM behaviour that leads to challenges in transitioning from prototype to production maturity. We conclude by outlining critical research directions for improving reliability, scalability, and governance in MAS architectures and the further work needed to mature MAS design patterns to mitigate the inherent challenges.

PaperID: 6498, https://arxiv.org/pdf/2601.03301.pdf

Authors: Guotao Li, Shaoyun Xu, Yuexing Hao, Yang Wang, Yuhui Sun

Title: PC2P: Multi-Agent Path Finding via Personalized-Enhanced Communication and Crowd Perception

Abstract:
Distributed Multi‑Agent Path Finding (MAPF) integrated with Multi‑Agent Reinforcement Learning (MARL) has emerged as a prominent research focus, enabling real‑time cooperative decision‑making in partially observable environments through inter‑agent communication. However, due to insufficient collaborative and perceptual capabilities, existing methods are inadequate for scaling across diverse environmental conditions. To address these challenges, we propose PC2P, a novel distributed MAPF method derived from a Q‑learning‑based MARL framework. Initially, we introduce a personalized‑enhanced communication mechanism based on dynamic graph topology, which ascertains the core aspects of ``who" and ``what" in interactive process through three‑stage operations: selection, generation, and aggregation. Concurrently, we incorporate local crowd perception to enrich agents' heuristic observation, thereby strengthening the model's guidance for effective actions via the integration of static spatial constraints and dynamic occupancy changes. To resolve extreme deadlock issues, we propose a region‑based deadlock‑breaking strategy that leverages expert guidance to implement efficient coordination within confined areas. Experimental results demonstrate that PC2P achieves superior performance compared to state‑of‑the‑art distributed MAPF methods in varied environments. Ablation studies further confirm the effectiveness of each module for overall performance.

PaperID: 6499, https://arxiv.org/pdf/2601.03287.pdf

Authors: Huan Lin Oh, Jay Yong Jun Jie, Mandy Lee Ling Siu, Jonathan Pan

Title: Automated Post-Incident Policy Gap Analysis via Threat-Informed Evidence Mapping using Large Language Models

Abstract:
Cybersecurity post‑incident reviews are essential for identifying control failures and improving organisational resilience, yet they remain labour‑intensive, time‑consuming, and heavily reliant on expert judgment. This paper investigates whether Large Language Models (LLMs) can augment post‑incident review workflows by autonomously analysing system evidence and identifying security policy gaps. We present a threat‑informed, agentic framework that ingests log data, maps observed behaviours to the MITRE ATT&CK framework, and evaluates organisational security policies for adequacy and compliance. Using a simulated brute‑force attack scenario against a Windows OpenSSH service (MITRE ATT&CK T1110), the system leverages GPT‑4o for reasoning, LangGraph for multi‑agent workflow orchestration, and LlamaIndex for traceable policy retrieval. Experimental results indicate that the LLM‑based pipeline can interpret log‑derived evidence, identify insufficient or missing policy controls, and generate actionable remediation recommendations with explicit evidence‑to‑policy traceability. Unlike prior work that treats log analysis and policy validation as isolated tasks, this study integrates both into a unified end‑to‑end proof‑of‑concept post‑incident review framework. The findings suggest that LLM‑assisted analysis has the potential to improve the efficiency, consistency, and auditability of post‑incident evaluations, while highlighting the continued need for human oversight in high‑stakes cybersecurity decision‑making.

PaperID: 6500, https://arxiv.org/pdf/2601.03254.pdf

Authors: Bastien Vanderplaetse, Xavier Siebert, Stéphane Dupont

Title: Automated Semantic Rules Detection (ASRD) for Emergent Communication Interpretation

Abstract:
The field of emergent communication within multi‑agent systems examines how autonomous agents can independently develop communication strategies, without explicit programming, and adapt them to varied environments. However, few studies have focused on the interpretability of emergent languages. The research exposed in this paper proposes an Automated Semantic Rules Detection (ASRD) algorithm, which extracts relevant patterns in messages exchanged by agents trained with two different datasets on the Lewis Game, which is often studied in the context of emergent communication. ASRD helps at the interpretation of the emergent communication by relating the extracted patterns to specific attributes of the input data, thereby considerably simplifying subsequent analysis.

PaperID: 6501, https://arxiv.org/pdf/2601.03197.pdf

Authors: Saurabh Agarwal, Marco Laju, Jayanth Srinivasa, Myungjin Lee, Aditya Akella

Title: Software-Defined Agentic Serving

Abstract:
As multi‑agent LLM pipelines grow in complexity, existing serving paradigms fail to adapt to the dynamic serving conditions. We argue that agentic serving systems should be programmable and system‑aware, unlike existing serving which statically encode the parameters. In this work, we propose a new SDN‑inspired agentic serving framework that helps control the key attributes of communication based on runtime state. This architecture enables serving‑efficient, responsive agent systems and paves the way for high‑level intent‑driven agentic serving.

PaperID: 6502, https://arxiv.org/pdf/2601.03153.pdf

Authors: Jiakai Tang, Xu Chen, Wen Chen, Jian Wu, Yuning Jiang, Bo Zheng

Title: Parallel Latent Reasoning for Sequential Recommendation

Abstract:
Capturing complex user preferences from sparse behavioral sequences remains a fundamental challenge in sequential recommendation. Recent latent reasoning methods have shown promise by extending test‑time computation through multi‑step reasoning, yet they exclusively rely on depth‑level scaling along a single trajectory, suffering from diminishing returns as reasoning depth increases. To address this limitation, we propose Parallel Latent Reasoning (PLR), a novel framework that pioneers width‑level computational scaling by exploring multiple diverse reasoning trajectories simultaneously. PLR constructs parallel reasoning streams through learnable trigger tokens in continuous latent space, preserves diversity across streams via global reasoning regularization, and adaptively synthesizes multi‑stream outputs through mixture‑of‑reasoning‑streams aggregation. Extensive experiments on three real‑world datasets demonstrate that PLR substantially outperforms state‑of‑the‑art baselines while maintaining real‑time inference efficiency. Theoretical analysis further validates the effectiveness of parallel reasoning in improving generalization capability. Our work opens new avenues for enhancing reasoning capacity in sequential recommendation beyond existing depth scaling.

PaperID: 6503, https://arxiv.org/pdf/2601.03137.pdf

Authors: Yangfan Jiang, Fei Wei, Ergute Bao, Yaliang Li, Bolin Ding, Yin Yang, Xiaokui Xiao

Title: Accurate Table Question Answering with Accessible LLMs

Abstract:
Given a table T in a database and a question Q in natural language, the table question answering (TQA) task aims to return an accurate answer to Q based on the content of T. Recent state‑of‑the‑art solutions leverage large language models (LLMs) to obtain high‑quality answers. However, most rely on proprietary, large‑scale LLMs with costly API access, posing a significant financial barrier. This paper instead focuses on TQA with smaller, open‑weight LLMs that can run on a desktop or laptop. This setting is challenging, as such LLMs typically have weaker capabilities than large proprietary models, leading to substantial performance degradation with existing methods. We observe that a key reason for this degradation is that prior approaches often require the LLM to solve a highly sophisticated task using long, complex prompts, which exceed the capabilities of small open‑weight LLMs. Motivated by this observation, we present Orchestra, a multi‑agent approach that unlocks the potential of accessible LLMs for high‑quality, cost‑effective TQA. Orchestra coordinates a group of LLM agents, each responsible for a relatively simple task, through a structured, layered workflow to solve complex TQA problems ‑‑ akin to an orchestra. By reducing the prompt complexity faced by each agent, Orchestra significantly improves output reliability. We implement Orchestra on top of AgentScope, an open‑source multi‑agent framework, and evaluate it on multiple TQA benchmarks using a wide range of open‑weight LLMs. Experimental results show that Orchestra achieves strong performance even with small‑ to medium‑sized models. For example, with Qwen2.5‑14B, Orchestra reaches 72.1% accuracy on WikiTQ, approaching the best prior result of 75.3% achieved with GPT‑4; with larger Qwen, Llama, or DeepSeek models, Orchestra outperforms all prior methods and establishes new state‑of‑the‑art results across all benchmarks.

PaperID: 6504, https://arxiv.org/pdf/2601.03113.pdf

Authors: Nick Pepper, Adam Keane, Amy Hodgkin, Dewi Gould, Edward Henderson, Lynge Lauritsen, Christos Vlahos, George De Ath, Richard Everson, Richard Cannon, Alvaro Sierra Castro, John Korna, Ben Carvell, Marc Thomas

Title: A Probabilistic Digital Twin of UK En Route Airspace for Training and Evaluating AI Agents for Air Traffic Control

Abstract:
This paper presents the first probabilistic Digital Twin of operational en route airspace, developed for the London Area Control Centre. The Digital Twin is intended to support the development and rigorous human‑in‑the‑loop evaluation of AI agents for Air Traffic Control (ATC), providing a virtual representation of real‑world airspace that enables safe exploration of higher levels of ATC automation. This paper makes three significant contributions: firstly, we demonstrate how historical and live operational data may be combined with a probabilistic, physics‑informed machine learning model of aircraft performance to reproduce real‑world traffic scenarios, while accurately reflecting the level of uncertainty inherent in ATC. Secondly, we develop a structured assurance case, following the Trustworthy and Ethical Assurance framework, to provide quantitative evidence for the Digital Twin's accuracy and fidelity. This is crucial to building trust in this novel technology within this safety‑critical domain. Thirdly, we describe how the Digital Twin forms a unified environment for agent testing and evaluation. This includes fast‑time execution (up to x200 real‑time), a standardised Python‑based ``gym'' interface that supports a range of AI agent designs, and a suite of quantitative metrics for assessing performance. Crucially, the framework facilitates competency‑based assessment of AI agents by qualified Air Traffic Control Officers through a Human Machine Interface. We also outline further applications and future extensions of the Digital Twin architecture.

PaperID: 6505, https://arxiv.org/pdf/2601.03073.pdf

Authors: Tong Wu, Thanet Markchom

Title: Understanding Multi-Agent Reasoning with Large Language Models for Cartoon VQA

Abstract:
Visual Question Answering (VQA) for stylised cartoon imagery presents challenges, such as interpreting exaggerated visual abstraction and narrative‑driven context, which are not adequately addressed by standard large language models (LLMs) trained on natural images. To investigate this issue, a multi‑agent LLM framework is introduced, specifically designed for VQA tasks in cartoon imagery. The proposed architecture consists of three specialised agents: visual agent, language agent and critic agent, which work collaboratively to support structured reasoning by integrating visual cues and narrative context. The framework was systematically evaluated on two cartoon‑based VQA datasets: Pororo and Simpsons. Experimental results provide a detailed analysis of how each agent contributes to the final prediction, offering a deeper understanding of LLM‑based multi‑agent behaviour in cartoon VQA and multimodal inference.

PaperID: 6506, https://arxiv.org/pdf/2601.03061.pdf

Authors: Felipe M. Affonso

Title: Vertical tacit collusion in AI-mediated markets

Abstract:
AI shopping agents are being deployed to hundreds of millions of consumers, creating a new intermediary between platforms, sellers, and buyers. We identify a novel market failure: vertical tacit collusion, where platforms controlling rankings and sellers controlling product descriptions independently learn to exploit documented AI cognitive biases. Using multi‑agent simulation calibrated to empirical measurements of large language model biases, we show that joint exploitation produces consumer harm more than double what would occur if strategies were independent. This super‑additive harm arises because platform ranking determines which products occupy bias‑triggering positions while seller manipulation determines conversion rates. Unlike horizontal algorithmic collusion, vertical tacit collusion requires no coordination and evades antitrust detection because harm emerges from aligned incentives rather than agreement. Our findings identify an urgent regulatory gap as AI shopping agents reach mainstream adoption.

PaperID: 6507, https://arxiv.org/pdf/2601.03023.pdf

Authors: Lecheng Gong, Weimin Fang, Ting Yang, Dongjie Tao, Chunxiao Guo, Peng Wei, Bo Xie, Jinqun Guan, Zixiao Chen, Fang Shi, Jinjie Gu, Junwei Liu

Title: MedDialogRubrics: A Comprehensive Benchmark and Evaluation Framework for Multi-turn Medical Consultations in Large Language Models

Abstract:
Medical conversational AI (AI) plays a pivotal role in the development of safer and more effective medical dialogue systems. However, existing benchmarks and evaluation frameworks for assessing the information‑gathering and diagnostic reasoning abilities of medical large language models (LLMs) have not been rigorously evaluated. To address these gaps, we present MedDialogRubrics, a novel benchmark comprising 5,200 synthetically constructed patient cases and over 60,000 fine‑grained evaluation rubrics generated by LLMs and subsequently refined by clinical experts, specifically designed to assess the multi‑turn diagnostic capabilities of LLM. Our framework employs a multi‑agent system to synthesize realistic patient records and chief complaints from underlying disease knowledge without accessing real‑world electronic health records, thereby mitigating privacy and data‑governance concerns. We design a robust Patient Agent that is limited to a set of atomic medical facts and augmented with a dynamic guidance mechanism that continuously detects and corrects hallucinations throughout the dialogue, ensuring internal coherence and clinical plausibility of the simulated cases. Furthermore, we propose a structured LLM‑based and expert‑annotated rubric‑generation pipeline that retrieves Evidence‑Based Medicine (EBM) guidelines and utilizes the reject sampling to derive a prioritized set of rubric items ("must‑ask" items) for each case. We perform a comprehensive evaluation of state‑of‑the‑art models and demonstrate that, across multiple assessment dimensions, current models face substantial challenges. Our results indicate that improving medical dialogue will require advances in dialogue management architectures, not just incremental tuning of the base‑model.

PaperID: 6508, https://arxiv.org/pdf/2601.03011.pdf

Authors: Yihan Wei, Shenghai Yuan, Tianchen Deng, Boyang Lou, Enwen Hu

Title: ReCCur: A Recursive Corner-Case Curation Framework for Robust Vision-Language Understanding in Open and Edge Scenarios

Abstract:
Corner cases are rare or extreme scenarios that drive real‑world failures, but they are difficult to curate at scale: web data are noisy, labels are brittle, and edge deployments preclude large retraining. We present ReCCur (Recursive Corner‑Case Curation), a low‑compute framework that converts noisy web imagery into auditable fine‑grained labels via a multi‑agent recursive pipeline. First, large‑scale data acquisition and filtering expands a domain vocabulary with a vision‑language model (VLM), crawls the web, and enforces tri‑modal (image, description, keyword) consistency with light human spot checks to yield refined candidates. Next, mixture‑of‑experts knowledge distillation uses complementary encoders (e.g., CLIP, DINOv2, BEiT) for kNN voting with dual‑confidence activation and uncertainty sampling, converging to a high‑precision set. Finally, region‑evidence VLM adversarial labeling pairs a proposer (multi‑granularity regions and semantic cues) with a validator (global and local chained consistency) to produce explainable labels and close the loop. On realistic corner‑case scenarios (e.g., flooded‑car inspection), ReCCur runs on consumer‑grade GPUs, steadily improves purity and separability, and requires minimal human supervision, providing a practical substrate for downstream training and evaluation under resource constraints. Code and dataset will be released.

PaperID: 6509, https://arxiv.org/pdf/2601.03007.pdf

Authors: Jingbo Qu, Yijie Wang, Yujie Fu, Putai Zhang, Weihan Li, Mian Li

Title: From inconsistency to decision: explainable operation and maintenance of battery energy storage systems

Abstract:
Battery Energy Storage Systems (BESSs) are increasingly critical to power‑system stability, yet their operation and maintenance remain dominated by reactive, expert‑dependent diagnostics. While cell‑level inconsistencies provide early warning signals of degradation and safety risks, the lack of scalable and interpretable decision‑support frameworks prevents these signals from being effectively translated into operational actions. Here we introduce an inconsistency‑driven operation and maintenance paradigm for large‑scale BESSs that systematically transforms routine monitoring data into explainable, decision‑oriented guidance. The proposed framework integrates multi‑dimensional inconsistency evaluation with large language model‑based semantic reasoning to bridge the gap between quantitative diagnostics and practical maintenance decisions. Using eight months of field data from an in‑service battery system comprising 3,564 cells, we demonstrate how electrical, thermal, and aging‑related inconsistencies can be distilled into structured operational records and converted into actionable maintenance insights through a multi‑agent framework. The proposed approach enables accurate and explainable responses to real‑world operation and maintenance queries, reducing response time and operational cost by over 80% compared with conventional expert‑driven practices. These results establish a scalable pathway for intelligent operation and maintenance of battery energy storage systems, with direct implications for reliability, safety, and cost‑effective integration of energy storage into modern power systems.

PaperID: 6510, https://arxiv.org/pdf/2601.02880.pdf

Authors: Abhishek HS, Pavan C Shekar, Arpit Jain, Ashwanth Krishnan

Title: ReTreVal: Reasoning Tree with Validation -- A Hybrid Framework for Enhanced LLM Multi-Step Reasoning

Abstract:
Multi‑step reasoning remains a key challenge for Large Language Models (LLMs), particularly in complex domains such as mathematics and creative writing. While recent approaches including ReAct, Reflexion, and Self‑Refine improve reasoning through iterative refinement and reflection, they often lack structured exploration of alternative solution paths and persistent learning across problems. We propose ReTreVal (Reasoning Tree with Validation), a hybrid framework that integrates Tree‑of‑Thoughts exploration, self‑refinement, LLM‑based critique scoring, and reflexion memory to enable bounded and validated multi‑step reasoning. ReTreVal constructs a structured reasoning tree with adaptive depth based on problem complexity, where each node undergoes iterative self‑critique and refinement guided by explicit LLM‑generated feedback. A dual validation mechanism evaluates reasoning quality, coherence, and correctness at each node while persistently storing insights from successful reasoning paths and failure patterns in a reflexion memory buffer, enabling cross‑problem learning. Critique‑based pruning retains only the top‑k highest‑scoring nodes at each level, controlling computational cost while preserving high‑quality solution paths. We evaluate ReTreVal against ReAct, Reflexion, and Self‑Refine across 500 mathematical problems and creative writing tasks using Qwen 2.5 7B as the underlying LLM, and demonstrate that ReTreVal consistently outperforms existing methods through its combination of structured exploration, critique‑driven refinement, and cross‑problem memory, making it particularly effective for tasks requiring exploratory reasoning, rigorous verification, and knowledge transfer.

PaperID: 6511, https://arxiv.org/pdf/2601.02871.pdf

Authors: Zhiyong Cao, Dunqiang Liu, Qi Dai, Haojun Xu, Huaiyan Xu, Huan He, Yafei Liu, Siyuan Liu, XiaoLin Lin, Ke Ma, Ruqian Shi, Sijia Yao, Hao Wang, Sicheng Zhou

Title: SimRPD: Optimizing Recruitment Proactive Dialogue Agents through Simulator-Based Data Evaluation and Selection

Abstract:
Task‑oriented proactive dialogue agents play a pivotal role in recruitment, particularly for steering conversations towards specific business outcomes, such as acquiring social‑media contacts for private‑channel conversion. Although supervised fine‑tuning and reinforcement learning have proven effective for training such agents, their performance is heavily constrained by the scarcity of high‑quality, goal‑oriented domain‑specific training data. To address this challenge, we propose SimRPD, a three‑stage framework for training recruitment proactive dialogue agents. First, we develop a high‑fidelity user simulator to synthesize large‑scale conversational data through multi‑turn online dialogue. Then we introduce a multi‑dimensional evaluation framework based on Chain‑of‑Intention (CoI) to comprehensively assess the simulator and effectively select high‑quality data, incorporating both global‑level and instance‑level metrics. Finally, we train the recruitment proactive dialogue agent on the selected dataset. Experiments in a real‑world recruitment scenario demonstrate that SimRPD outperforms existing simulator‑based data selection strategies, highlighting its practical value for industrial deployment and its potential applicability to other business‑oriented dialogue scenarios.

PaperID: 6512, https://arxiv.org/pdf/2601.02779.pdf

Authors: Ting Peng

Title: Hierarchical Preemptive Holistic Collaborative Systems for Embodied Multi-Agent Systems: Framework, Hybrid Stability, and Scalability Analysis

Abstract:
The coordination of Embodied Multi‑Agent Systems in constrained physical environments requires a rigorous balance between safety, scalability, and efficiency. Traditional decentralized approaches, e.g., reactive collision avoidance, are prone to local minima or reciprocal yielding standoffs due to the lack of future intent awareness. In contrast, centralized planning suffers from intractable computational complexity and single‑point‑of‑failure vulnerabilities. To address these limitations, we propose the Hierarchical Preemptive Holistic Collaborative (Prollect) framework, which generalizes the Preemptive Holistic Collaborative System (PHCS) by decomposing the global coordination problem into topologically connected subspace optimizations. We formalize the system as a Hybrid Automaton and introduce a three‑stage receding horizon mechanism (frozen execution, preliminary planning, proactive look‑ahead windows) with explicit padding to prevent races between coordination dissemination and intent updates. Notably, we design a robust timing protocol with a mandatory Idle Buffer that acts as a dwell‑time constraint to eliminate Zeno behaviors and ensure computational stability under jitter. Furthermore, we formalize a Shadow Agent protocol to guarantee seamless trajectory consistency across subspace boundaries, which we treat as an Input‑to‑State Stability (ISS) problem.

PaperID: 6513, https://arxiv.org/pdf/2601.02757.pdf

Authors: Zixuan Xiao, Jun Ma

Title: LLM Agent Framework for Intelligent Change Analysis in Urban Environment using Remote Sensing Imagery

Abstract:
Existing change detection methods often lack the versatility to handle diverse real‑world queries and the intelligence for comprehensive analysis. This paper presents a general agent framework, integrating Large Language Models (LLM) with vision foundation models to form ChangeGPT. A hierarchical structure is employed to mitigate hallucination. The agent was evaluated on a curated dataset of 140 questions categorized by real‑world scenarios, encompassing various question types (e.g., Size, Class, Number) and complexities. The evaluation assessed the agent's tool selection ability (Precision/Recall) and overall query accuracy (Match). ChangeGPT, especially with a GPT‑4‑turbo backend, demonstrated superior performance, achieving a 90.71 % Match rate. Its strength lies particularly in handling change‑related queries requiring multi‑step reasoning and robust tool selection. Practical effectiveness was further validated through a real‑world urban change monitoring case study in Qianhai Bay, Shenzhen. By providing intelligence, adaptability, and multi‑type change analysis, ChangeGPT offers a powerful solution for decision‑making in remote sensing applications.

PaperID: 6514, https://arxiv.org/pdf/2601.02749.pdf

Authors: Nadia Sibai, Yara Ahmed, Serry Sibaee, Sawsan AlHalawani, Adel Ammar, Wadii Boulila

Title: The Path Ahead for Agentic AI: Challenges and Opportunities

Abstract:
The evolution of Large Language Models (LLMs) from passive text generators to autonomous, goal‑driven systems represents a fundamental shift in artificial intelligence. This chapter examines the emergence of agentic AI systems that integrate planning, memory, tool use, and iterative reasoning to operate autonomously in complex environments. We trace the architectural progression from statistical models to transformer‑based systems, identifying capabilities that enable agentic behavior: long‑range reasoning, contextual awareness, and adaptive decision‑making. The chapter provides three contributions: (1) a synthesis of how LLM capabilities extend toward agency through reasoning‑action‑reflection loops; (2) an integrative framework describing core components perception, memory, planning, and tool execution that bridge LLMs with autonomous behavior; (3) a critical assessment of applications and persistent challenges in safety, alignment, reliability, and sustainability. Unlike existing surveys, we focus on the architectural transition from language understanding to autonomous action, emphasizing the technical gaps that must be resolved before deployment. We identify critical research priorities, including verifiable planning, scalable multi‑agent coordination, persistent memory architectures, and governance frameworks. Responsible advancement requires simultaneous progress in technical robustness, interpretability, and ethical safeguards to realize potential while mitigating risks of misalignment and unintended consequences.

PaperID: 6515, https://arxiv.org/pdf/2601.02744.pdf

Authors: Hanqi Jiang, Junhao Chen, Yi Pan, Ling Chen, Weihang You, Yifan Zhou, Ruidong Zhang, Andrea Sikora, Lin Zhao, Yohannes Abate, Tianming Liu

Title: SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation

Abstract:
While Large Language Models (LLMs) excel at generalized reasoning, standard retrieval‑augmented approaches fail to address the disconnected nature of long‑term agentic memory. To bridge this gap, we introduce Synapse (Synergistic Associative Processing Semantic Encoding), a unified memory architecture that transcends static vector similarity. Drawing from cognitive science, Synapse models memory as a dynamic graph where relevance emerges from spreading activation rather than pre‑computed links. By integrating lateral inhibition and temporal decay, the system dynamically highlights relevant sub‑graphs while filtering interference. We implement a Triple Hybrid Retrieval strategy that fuses geometric embeddings with activation‑based graph traversal. Comprehensive evaluations on the LoCoMo benchmark show that Synapse significantly outperforms state‑of‑the‑art methods in complex temporal and multi‑hop reasoning tasks, offering a robust solution to the "Contextual Tunneling" problem. Our code and data will be made publicly available upon acceptance.

PaperID: 6516, https://arxiv.org/pdf/2601.02732.pdf

Authors: Lingzhe Zhang, Tong Jia, Yunpeng Zhai, Leyi Pan, Chiming Duan, Minghua He, Mengxi Jia, Ying Li

Title: Agentic Memory Enhanced Recursive Reasoning for Root Cause Localization in Microservices

Abstract:
As contemporary microservice systems become increasingly popular and complex‑often comprising hundreds or even thousands of fine‑grained, interdependent subsystems‑they are experiencing more frequent failures. Ensuring system reliability thus demands accurate root cause localization. While many traditional graph‑based and deep learning approaches have been explored for this task, they often rely heavily on pre‑defined schemas that struggle to adapt to evolving operational contexts. Consequently, a number of LLM‑based methods have recently been proposed. However, these methods still face two major limitations: shallow, symptom‑centric reasoning that undermines accuracy, and a lack of cross‑alert reuse that leads to redundant reasoning and high latency. In this paper, we conduct a comprehensive study of how Site Reliability Engineers (SREs) localize the root causes of failures, drawing insights from professionals across multiple organizations. Our investigation reveals that expert root cause analysis exhibits three key characteristics: recursiveness, multi‑dimensional expansion, and cross‑modal reasoning. Motivated by these findings, we introduce AMER‑RCL, an agentic memory enhanced recursive reasoning framework for root cause localization in microservices. AMER‑RCL employs the Recursive Reasoning RCL engine, a multi‑agent framework that performs recursive reasoning on each alert to progressively refine candidate causes, while Agentic Memory incrementally accumulates and reuses reasoning from prior alerts within a time window to reduce redundant exploration and lower inference latency. Experimental results demonstrate that AMER‑RCL consistently outperforms state‑of‑the‑art methods in both localization accuracy and inference efficiency.

PaperID: 6517, https://arxiv.org/pdf/2601.02714.pdf

Authors: Zhi Liu, Guangzhi Wang

Title: Time-Scaling Is What Agents Need Now

Abstract:
Early artificial intelligence paradigms exhibited separated cognitive functions: Neural Networks focused on "perception‑representation," Reinforcement Learning on "decision‑making‑behavior," and Symbolic AI on "knowledge‑reasoning." With Transformer‑based large models and world models, these paradigms are converging into cognitive agents with closed‑loop "perception‑decision‑action" capabilities. Humans solve complex problems under limited cognitive resources through temporalized sequential reasoning. Language relies on problem space search for deep semantic reasoning. While early large language models (LLMs) could generate fluent text, they lacked robust semantic reasoning capabilities. Prompting techniques like Chain‑of‑Thought (CoT) and Tree‑of‑Thought (ToT) extended reasoning paths by making intermediate steps explicit. Recent models like DeepSeek‑R1 enhanced performance through explicit reasoning trajectories. However, these methods have limitations in search completeness and efficiency. This highlights the need for "Time‑Scaling"‑‑the systematic extension and optimization of an agent's ability to unfold reasoning over time. Time‑Scaling refers to architectural design utilizing extended temporal pathways, enabling deeper problem space exploration, dynamic strategy adjustment, and enhanced metacognitive control, paralleling human sequential reasoning under cognitive constraints. It represents a critical frontier for enhancing deep reasoning and problem‑solving without proportional increases in static model parameters. Advancing intelligent agent capabilities requires placing Time‑Scaling principles at the forefront, positioning explicit temporal reasoning management as foundational.

PaperID: 6518, https://arxiv.org/pdf/2601.02695.pdf

Authors: Guibin Zhang, Haiyang Yu, Kaiming Yang, Bingli Wu, Fei Huang, Yongbin Li, Shuicheng Yan

Title: EvoRoute: Experience-Driven Self-Routing LLM Agent Systems

Abstract:
Complex agentic AI systems, powered by a coordinated ensemble of Large Language Models (LLMs), tool and memory modules, have demonstrated remarkable capabilities on intricate, multi‑turn tasks. However, this success is shadowed by prohibitive economic costs and severe latency, exposing a critical, yet underexplored, trade‑off. We formalize this challenge as the Agent System Trilemma: the inherent tension among achieving state‑of‑the‑art performance, minimizing monetary cost, and ensuring rapid task completion. To dismantle this trilemma, we introduce EvoRoute, a self‑evolving model routing paradigm that transcends static, pre‑defined model assignments. Leveraging an ever‑expanding knowledge base of prior experience, EvoRoute dynamically selects Pareto‑optimal LLM backbones at each step, balancing accuracy, efficiency, and resource use, while continually refining its own selection policy through environment feedback. Experiments on challenging agentic benchmarks such as GAIA and BrowseComp+ demonstrate that EvoRoute, when integrated into off‑the‑shelf agentic systems, not only sustains or enhances system performance but also reduces execution cost by up to 80% and latency by over 70%.

PaperID: 6519, https://arxiv.org/pdf/2601.02598.pdf

Authors: Yiyang Li, Zheyuan Zhang, Tianyi Ma, Zehong Wang, Keerthiram Murugesan, Chuxu Zhang, Yanfang Ye

Title: LongDA: Benchmarking LLM Agents for Long-Document Data Analysis

Abstract:
We introduce LongDA, a data analysis benchmark for evaluating LLM‑based agents under documentation‑intensive analytical workflows. In contrast to existing benchmarks that assume well‑specified schemas and inputs, LongDA targets real‑world settings in which navigating long documentation and complex data is the primary bottleneck. To this end, we manually curate raw data files, long and heterogeneous documentation, and expert‑written publications from 17 publicly available U.S. national surveys, from which we extract 505 analytical queries grounded in real analytical practice. Solving these queries requires agents to first retrieve and integrate key information from multiple unstructured documents, before performing multi‑step computations and writing executable code, which remains challenging for existing data analysis agents. To support the systematic evaluation under this setting, we develop LongTA, a tool‑augmented agent framework that enables document access, retrieval, and code execution, and evaluate a range of proprietary and open‑source models. Our experiments reveal substantial performance gaps even among state‑of‑the‑art models, highlighting the challenges researchers should consider before applying LLM agents for decision support in real‑world, high‑stakes analytical settings.

PaperID: 6520, https://arxiv.org/pdf/2601.02577.pdf

Authors: Alexander Roman, Jacob Roman

Title: Orchestral AI: A Framework for Agent Orchestration

Abstract:
The rapid proliferation of LLM agent frameworks has forced developers to choose between vendor lock‑in through provider‑specific SDKs and complex multi‑package ecosystems that obscure control flow and hinder reproducibility. Integrating tool calling across multiple LLM providers remains a core engineering challenge due to fragmented APIs, incompatible message formats, and inconsistent streaming and tool‑calling behavior, making it difficult to build portable, reliable agent systems. We introduce Orchestral, a lightweight Python framework that provides a unified, type‑safe interface for building LLM agents across major providers while preserving the simplicity required for scientific computing and production deployment. Orchestral defines a single universal representation for messages, tools, and LLM usage that operates seamlessly across providers, eliminating manual format translation and reducing framework‑induced complexity. Automatic tool schema generation from Python type hints removes the need for handwritten descriptors while maintaining type safety across provider boundaries. A synchronous execution model with streaming support enables deterministic behavior, straightforward debugging, and real‑time interaction without introducing server dependencies. The framework's modular architecture cleanly separates provider integration, tool execution, conversation orchestration, and user‑facing interfaces, enabling extensibility without architectural entanglement. Orchestral supports advanced agent capabilities found in larger frameworks, including rich tool calling, context compaction, workspace sandboxing, user approval workflows, sub‑agents, memory management, and MCP integration.

PaperID: 6521, https://arxiv.org/pdf/2601.02454.pdf

Authors: Saba Naqvi, Mohammad Baqar, Nawaz Ali Mohammad

Title: The Rise of Agentic Testing: Multi-Agent Systems for Robust Software Quality Assurance

Abstract:
Software testing has progressed toward intelligent automation, yet current AI‑based test generators still suffer from static, single‑shot outputs that frequently produce invalid, redundant, or non‑executable tests due to the lack of execution aware feedback. This paper introduces an agentic multi‑model testing framework a closed‑loop, self‑correcting system in which a Test Generation Agent, an Execution and Analysis Agent, and a Review and Optimization Agent collaboratively generate, execute, analyze, and refine tests until convergence. By using sandboxed execution, detailed failure reporting, and iterative regeneration or patching of failing tests, the framework autonomously improves test quality and expands coverage. Integrated into a CI/CD‑compatible pipeline, it leverages reinforcement signals from coverage metrics and execution outcomes to guide refinement. Empirical evaluations on microservice based applications show up to a 60% reduction in invalid tests, 30% coverage improvement, and significantly reduced human effort compared to single‑model baselines demonstrating that multi‑agent, feedback‑driven loops can evolve software testing into an autonomous, continuously learning quality assurance ecosystem for self‑healing, high‑reliability codebases.

PaperID: 6522, https://arxiv.org/pdf/2601.02407.pdf

Authors: Oguzhan Yildirim

Title: Evolving Personalities in Chaos: An LLM-Augmented Framework for Character Discovery in the Iterated Prisoners Dilemma under Environmental Stress

Abstract:
Standard simulations of the Iterated Prisoners Dilemma (IPD) operate in deterministic, noise‑free environments, producing strategies that may be theoretically optimal but fragile when confronted with real‑world uncertainty. This paper addresses two critical gaps in evolutionary game theory research: (1) the absence of realistic environmental stressors during strategy evolution, and (2) the Interpretability Gap, where evolved genetic strategies remain opaque binary sequences devoid of semantic meaning. We introduce a novel framework combining stochastic environmental perturbations (God Mode) with Large Language Model (LLM)‑based behavioral profiling to transform evolved genotypes into interpretable character archetypes. Our experiments demonstrate that strategies evolved under chaotic conditions exhibit superior resilience and present distinct behavioral phenotypes, ranging from Ruthless Capitalists to Diplomatic Enforcers. These phenotypes are readily classified by LLMs but remain nearly impossible to interpret through manual genome inspection alone. This work bridges evolutionary computation with explainable AI and provides a template for automated agent characterization in multi‑agent systems.

PaperID: 6523, https://arxiv.org/pdf/2601.02398.pdf

Authors: S. Zhang, M. Feizarefi, A. F. Mirzaei

Title: AI-Native Integrated Sensing and Communications for Self-Organizing Wireless Networks: Architectures, Learning Paradigms, and System-Level Design

Abstract:
Integrated Sensing and Communications (ISAC) is emerging as a foundational paradigm for next‑generation wireless networks, enabling communication infrastructures to simultaneously support data transmission and environment sensing. By tightly coupling radio sensing with communication functions, ISAC unlocks new capabilities for situational awareness, localization, tracking, and network adaptation. At the same time, the increasing scale, heterogeneity, and dynamics of future wireless systems demand self‑organizing network intelligence capable of autonomously managing resources, topology, and services. Artificial intelligence (AI), particularly learning‑driven and data‑centric methods, has become a key enabler for realizing this vision. This survey provides a comprehensive and system‑level review of AI‑native ISAC‑enabled self‑organizing wireless networks. We develop a unified taxonomy that spans: (i) ISAC signal models and sensing modalities, (ii) network state abstraction and perception from sensing‑aware radio data, (iii) learning‑driven self‑organization mechanisms for resource allocation, topology control, and mobility management, and (iv) cross‑layer architectures integrating sensing, communication, and network intelligence. We further examine emerging learning paradigms, including deep reinforcement learning, graph‑based learning, multi‑agent coordination, and federated intelligence that enable autonomous adaptation under uncertainty, mobility, and partial observability. Practical considerations such as sensing‑communication trade‑offs, scalability, latency, reliability, and security are discussed alongside representative evaluation methodologies and performance metrics. Finally, we identify key open challenges and future research directions toward deployable, trustworthy, and scalable AI‑native ISAC systems for 6G and beyond.

PaperID: 6524, https://arxiv.org/pdf/2601.02397.pdf

Authors: Alireza Rezaee

Title: Evolutionary Algorithms for Computing Nash Equilibria in Dynamic Games

Abstract:
Dynamic nonzero sum games are widely used to model multi agent decision making in control, economics, and related fields. Classical methods for computing Nash equilibria, especially in linear quadratic settings, rely on strong structural assumptions and become impractical for nonlinear dynamics, many players, or long horizons, where multiple local equilibria may exist. We show through examples that such methods can fail to reach the true global Nash equilibrium even in relatively small games. To address this, we propose two population based evolutionary algorithms for general dynamic games with linear or nonlinear dynamics and arbitrary objective functions: a co evolutionary genetic algorithm and a hybrid genetic algorithm particle swarm optimization scheme. Both approaches search directly over joint strategy spaces without restrictive assumptions and are less prone to getting trapped in local Nash equilibria, providing more reliable approximations to global Nash solutions.

PaperID: 6525, https://arxiv.org/pdf/2601.02375.pdf

Authors: Madison Bochard, Tim Conser, Alyssa Duran, Lazaro Martull, Pu Tian, Yalong Wu

Title: LeafTutor: An AI Agent for Programming Assignment Tutoring

Abstract:
High enrollment in STEM‑related degree programs has created increasing demand for scalable tutoring support, as universities experience a shortage of qualified instructors and teaching assistants (TAs). To address this challenge, LeafTutor, an AI tutoring agent powered by large language models (LLMs), was developed to provide step‑by‑step guidance for students. LeafTutor was evaluated through real programming assignments. The results indicate that the system can deliver step‑by‑step programming guidance comparable to human tutors. This work demonstrates the potential of LLM‑driven tutoring solutions to enhance and personalize learning in STEM education. If any reader is interested in collaboration with our team to improve or test LeafTutor, please contact Pu Tian (pu.tian@stockton.edu) or Yalong Wu (wuy@uhcl.edu).

PaperID: 6526, https://arxiv.org/pdf/2601.02314.pdf

Authors: Sourena Khanzadeh

Title: Project Ariadne: A Structural Causal Framework for Auditing Faithfulness in LLM Agents

Abstract:
As Large Language Model (LLM) agents are increasingly tasked with high‑stakes autonomous decision‑making, the transparency of their reasoning processes has become a critical safety concern. While Chain‑of‑Thought (CoT) prompting allows agents to generate human‑readable reasoning traces, it remains unclear whether these traces are faithful generative drivers of the model's output or merely post‑hoc rationalizations. We introduce Project Ariadne, a novel XAI framework that utilizes Structural Causal Models (SCMs) and counterfactual logic to audit the causal integrity of agentic reasoning. Unlike existing interpretability methods that rely on surface‑level textual similarity, Project Ariadne performs hard interventions (do‑calculus) on intermediate reasoning nodes ‑‑ systematically inverting logic, negating premises, and reversing factual claims ‑‑ to measure the Causal Sensitivity (ϕ) of the terminal answer. Our empirical evaluation of state‑of‑the‑art models reveals a persistent Faithfulness Gap. We define and detect a widespread failure mode termed Causal Decoupling, where agents exhibit a violation density (ρ) of up to 0.77 in factual and scientific domains. In these instances, agents arrive at identical conclusions despite contradictory internal logic, proving that their reasoning traces function as "Reasoning Theater" while decision‑making is governed by latent parametric priors. Our findings suggest that current agentic architectures are inherently prone to unfaithful explanation, and we propose the Ariadne Score as a new benchmark for aligning stated logic with model action.

PaperID: 6527, https://arxiv.org/pdf/2601.02109.pdf

Authors: Hossein Rastgoftar

Title: Finite-State Decentralized Policy-Based Control With Guaranteed Ground Coverage

Abstract:
We propose a finite‑state, decentralized decision and control framework for multi‑agent ground coverage. The approach decomposes the problem into two coupled components: (i) the structural design of a deep neural network (DNN) induced by the reference configuration of the agents, and (ii) policy‑based decentralized coverage control. Agents are classified as anchors and followers, yielding a generic and scalable communication architecture in which each follower interacts with exactly three in‑neighbors from the preceding layer, forming an enclosing triangular communication structure. The DNN training weights implicitly encode the spatial configuration of the agent team, thereby providing a geometric representation of the environmental target set. Within this architecture, we formulate a computationally efficient decentralized Markov decision process (MDP) whose components are time‑invariant except for a time‑varying cost function defined by the deviation from the centroid of the target set contained within each agent communication triangle. By introducing the concept of Anyway Output Controllability (AOC), we assume each agent is AOC and establish decentralized convergence to a desired configuration that optimally represents the environmental target.

PaperID: 6528, https://arxiv.org/pdf/2601.02021.pdf

Authors: Runze Zheng, Yuqing Zheng, Zhengyi Cheng, Long Luo, Haoxiang Luo, Gang Sun, Hongfang Yu, Dusit Niyato

Title: AgentVNE: LLM-Augmented Graph Reinforcement Learning for Affinity-Aware Multi-Agent Placement in Edge Agentic AI

Abstract:
The Internet of Agents is propelling edge computing toward agentic AI and edge general intelligence (EGI). However, deploying multi‑agent service (MAS) on resource‑constrained edge infrastructure presents severe challenges. MAS service workflows are driven by complex cross‑node interactions, dynamic memory accumulation, and collaborative tool usage. Exhibiting chain‑like topological dependencies and strict affinity constraints, these workflows demand real‑time responsiveness that exceeds the capabilities of traditional VNE algorithms designed for static resources. To address this, we propose AgentVNE, a cloud‑edge collaborative framework utilizing a dual‑layer architecture. First, AgentVNE employs a large language model (LLM) to identify implicit semantic constraints and generate affinity‑based resource augmentation to resolve physical dependency issues. Second, it constructs a resource similarity‑aware neural network, utilizing a pre‑training and PPO fine‑tuning strategy to precisely capture topological similarities between dynamic workflows and heterogeneous networks. By coupling semantic perception with topological reasoning, this mechanism effectively bridges the gap between dynamic service requirements and physical infrastructure. Simulation results demonstrate that AgentVNE reduces workflow communication latency to less than 40% of baselines and improves the service acceptance rate by approximately 5%‑10% under high‑load scenarios. Ultimately, this work provides a foundational solution for the semantic‑aware deployment of agentic AI.

PaperID: 6529, https://arxiv.org/pdf/2601.01993.pdf

Authors: Dong Xue, Jicheng Tu, Ming Wang, Xin Yan, Fangzhou Liu, Jie Hu

Title: Towards Privacy-Preserving Mental Health Support with Large Language Models

Abstract:
Large language models (LLMs) have shown promise for mental health support, yet training such models is constrained by the scarcity and sensitivity of real counseling dialogues. In this article, we present MindChat, a privacy‑preserving LLM for mental health support, together with MindCorpus, a synthetic multi‑turn counseling dataset constructed via a multi‑agent role‑playing framework. To synthesize high‑quality counseling data, the developed dialogue‑construction framework employs a dual closed‑loop feedback design to integrate psychological expertise and counseling techniques through role‑playing: (i) turn‑level critique‑and‑revision to improve coherence and counseling appropriateness within a session, and (ii) session‑level strategy refinement to progressively enrich counselor behaviors across sessions. To mitigate privacy risks under decentralized data ownership, we fine‑tune the base model using federated learning with parameter‑efficient LoRA adapters and incorporate differentially private optimization to reduce membership and memorization risks. Experiments on synthetic‑data quality assessment and counseling capability evaluation show that MindCorpus improves training effectiveness and that MindChat is competitive with existing general and counseling‑oriented LLM baselines under both automatic LLM‑judge and human evaluation protocols, while exhibiting reduced privacy leakage under membership inference attacks.

PaperID: 6530, https://arxiv.org/pdf/2601.01891.pdf

Authors: Niloufar Alipour Talemi, Julia Boone, Fatemeh Afghah

Title: Agentic AI in Remote Sensing: Foundations, Taxonomy, and Emerging Systems

Abstract:
The paradigm of Earth Observation analysis is shifting from static deep learning models to autonomous agentic AI. Although recent vision foundation models and multimodal large language models advance representation learning, they often lack the sequential planning and active tool orchestration required for complex geospatial workflows. This survey presents the first comprehensive review of agentic AI in remote sensing. We introduce a unified taxonomy distinguishing between single‑agent copilots and multi‑agent systems while analyzing architectural foundations such as planning mechanisms, retrieval‑augmented generation, and memory structures. Furthermore, we review emerging benchmarks that move the evaluation from pixel‑level accuracy to trajectory‑aware reasoning correctness. By critically examining limitations in grounding, safety, and orchestration, this work outlines a strategic roadmap for the development of robust, autonomous geospatial intelligence.

PaperID: 6531, https://arxiv.org/pdf/2601.01844.pdf

Authors: Udiptaman Das, Krishnasai B. Atmakuri, Duy Ho, Chi Lee, Yugyung Lee

Title: Clinical Knowledge Graph Construction and Evaluation with Multi-LLMs via Retrieval-Augmented Generation

Abstract:
Large language models (LLMs) offer new opportunities for constructing knowledge graphs (KGs) from unstructured clinical narratives. However, existing approaches often rely on structured inputs and lack robust validation of factual accuracy and semantic consistency, limitations that are especially problematic in oncology. We introduce an end‑to‑end framework for clinical KG construction and evaluation directly from free text using multi‑agent prompting and a schema‑constrained Retrieval‑Augmented Generation (KG‑RAG) strategy. Our pipeline integrates (1) prompt‑driven entity, attribute, and relation extraction; (2) entropy‑based uncertainty scoring; (3) ontology‑aligned RDF/OWL schema generation; and (4) multi‑LLM consensus validation for hallucination detection and semantic refinement. Beyond static graph construction, the framework supports continuous refinement and self‑supervised evaluation, enabling iterative improvement of graph quality. Applied to two oncology cohorts (PDAC and BRCA), our method produces interpretable, SPARQL‑compatible, and clinically grounded knowledge graphs without relying on gold‑standard annotations. Experimental results demonstrate consistent gains in precision, relevance, and ontology compliance over baseline methods.

PaperID: 6532, https://arxiv.org/pdf/2601.01831.pdf

Authors: Aniket Wattamwar, Sampson Akwafuo

Title: ARIES: A Scalable Multi-Agent Orchestration Framework for Real-Time Epidemiological Surveillance and Outbreak Monitoring

Abstract:
Global health surveillance is currently facing a challenge of Knowledge Gaps. While general‑purpose AI has proliferated, it remains fundamentally unsuited for the high‑stakes epidemiological domain due to chronic hallucinations and an inability to navigate specialized data silos. This paper introduces ARIES (Agentic Retrieval Intelligence for Epidemiological Surveillance), a specialized, autonomous multi‑agent framework designed to move beyond static, disease‑specific dashboards toward a dynamic intelligence ecosystem. Built on a hierarchical command structure, ARIES utilizes GPTs to orchestrate a scalable swarm of sub‑agents capable of autonomously querying World Health Organization (WHO), Center for Disease Control and Prevention (CDC), and peer‑reviewed research papers. By automating the extraction and logical synthesis of surveillance data, ARIES provides a specialized reasoning that identifies emergent threats and signal divergence in near real‑time. This modular architecture proves that a task‑specific agentic swarm can outperform generic models, offering a robust, extensible for next‑generation outbreak response and global health intelligence.

PaperID: 6533, https://arxiv.org/pdf/2601.01743.pdf

Authors: Bin Xu

Title: AI Agent Systems: Architectures, Applications, and Evaluation

Abstract:
AI agents ‑‑ systems that combine foundation models with reasoning, planning, memory, and tool use ‑‑ are rapidly becoming a practical interface between natural‑language intent and real‑world computation. This survey synthesizes the emerging landscape of AI agent architectures across: (i) deliberation and reasoning (e.g., chain‑of‑thought‑style decomposition, self‑reflection and verification, and constraint‑aware decision making), (ii) planning and control (from reactive policies to hierarchical and multi‑step planners), and (iii) tool calling and environment interaction (retrieval, code execution, APIs, and multimodal perception). We organize prior work into a unified taxonomy spanning agent components (policy/LLM core, memory, world models, planners, tool routers, and critics), orchestration patterns (single‑agent vs.\ multi‑agent; centralized vs.\ decentralized coordination), and deployment settings (offline analysis vs.\ online interactive assistance; safety‑critical vs.\ open‑ended tasks). We discuss key design trade‑offs ‑‑ latency vs.\ accuracy, autonomy vs.\ controllability, and capability vs.\ reliability ‑‑ and highlight how evaluation is complicated by non‑determinism, long‑horizon credit assignment, tool and environment variability, and hidden costs such as retries and context growth. Finally, we summarize measurement and benchmarking practices (task suites, human preference and utility metrics, success under constraints, robustness and security) and identify open challenges including verification and guardrails for tool actions, scalable memory and context management, interpretability of agent decisions, and reproducible evaluation under realistic workloads.

PaperID: 6534, https://arxiv.org/pdf/2601.01723.pdf

Authors: Vignesh Iyer

Title: Structural Representations for Cross-Attack Generalization in AI Agent Threat Detection

Abstract:
Autonomous AI agents executing multi‑step tool sequences face semantic attacks that manifest in behavioral traces rather than isolated prompts. A critical challenge is cross‑attack generalization: can detectors trained on known attack families recognize novel, unseen attack types? We discover that standard conversational tokenization ‑‑ capturing linguistic patterns from agent interactions ‑‑ fails catastrophically on structural attacks like tool hijacking (AUC 0.39) and data exfiltration (AUC 0.46), while succeeding on linguistic attacks like social engineering (AUC 0.78). We introduce structural tokenization, encoding execution‑flow patterns (tool calls, arguments, observations) rather than conversational content. This simple representational change dramatically improves cross‑attack generalization: +46 AUC points on tool hijacking, +39 points on data exfiltration, and +71 points on unknown attacks, while simultaneously improving in‑distribution performance (+6 points). For attacks requiring linguistic features, we propose gated multi‑view fusion that adaptively combines both representations, achieving AUC 0.89 on social engineering without sacrificing structural attack detection. Our findings reveal that AI agent security is fundamentally a structural problem: attack semantics reside in execution patterns, not surface language. While our rule‑based tokenizer serves as a baseline, the structural abstraction principle generalizes even with simple implementation.

PaperID: 6535, https://arxiv.org/pdf/2601.01592.pdf

Authors: Xin Wang, Yunhao Chen, Juncheng Li, Yixu Wang, Yang Yao, Tianle Gu, Jie Li, Yan Teng, Yingchun Wang, Xia Hu

Title: OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs

Abstract:
The rapid integration of Multimodal Large Language Models (MLLMs) into critical applications is increasingly hindered by persistent safety vulnerabilities. However, existing red‑teaming benchmarks are often fragmented, limited to single‑turn text interactions, and lack the scalability required for systematic evaluation. To address this, we introduce OpenRT, a unified, modular, and high‑throughput red‑teaming framework designed for comprehensive MLLM safety evaluation. At its core, OpenRT architects a paradigm shift in automated red‑teaming by introducing an adversarial kernel that enables modular separation across five critical dimensions: model integration, dataset management, attack strategies, judging methods, and evaluation metrics. By standardizing attack interfaces, it decouples adversarial logic from a high‑throughput asynchronous runtime, enabling systematic scaling across diverse models. Our framework integrates 37 diverse attack methodologies, spanning white‑box gradients, multi‑modal perturbations, and sophisticated multi‑agent evolutionary strategies. Through an extensive empirical study on 20 advanced models (including GPT‑5.2, Claude 4.5, and Gemini 3 Pro), we expose critical safety gaps: even frontier models fail to generalize across attack paradigms, with leading models exhibiting average Attack Success Rates as high as 49.14%. Notably, our findings reveal that reasoning models do not inherently possess superior robustness against complex, multi‑turn jailbreaks. By open‑sourcing OpenRT, we provide a sustainable, extensible, and continuously maintained infrastructure that accelerates the development and standardization of AI safety.

PaperID: 6536, https://arxiv.org/pdf/2601.01569.pdf

Authors: Maohao Ran, Zhenglin Wan, Cooper Lin, Yanting Zhang, Hongyu Xin, Hongwei Fan, Yibo Xu, Beier Luo, Yaxin Zhou, Wangbo Zhao, Lijie Yang, Lang Feng, Fuchao Yang, Jingxuan Wu, Yiqiao Huang, Chendong Ma, Dailing Jiang, Jianbo Deng, Sirui Han, Yang You, Bo An, Yike Guo, Jun Song

Title: CaveAgent: Transforming LLMs into Stateful Runtime Operators

Abstract:
LLM‑based agents are increasingly capable of complex task execution, yet current agentic systems remain constrained by text‑centric paradigms that struggle with long‑horizon tasks due to fragile multi‑turn dependencies and context drift. We present CaveAgent, a framework that shifts tool use from ``LLM‑as‑Text‑Generator'' to ``LLM‑as‑Runtime‑Operator.'' CaveAgent introduces a dual‑stream architecture that inverts the conventional paradigm: rather than treating the LLM's text context as the primary workspace with tools as auxiliary, CaveAgent elevates the persistent Python runtime as the central locus of state, with a lightweight semantic stream serving as its orchestrator. Beyond leveraging code generation to resolve interdependent sub‑tasks (e.g., loops, conditionals) in a single step, CaveAgent introduces Stateful Runtime Management: it injects, manipulates, and retrieves complex Python objects (e.g., DataFrames, database connections) that persist across turns, unlike existing code‑based approaches that remain text‑bound. CaveAgent further provides a runtime‑integrated skill management system that extends the Agent Skills open standard, enabling ecosystem interoperability through executable skill injections. This persistence mechanism serves as a high‑fidelity external memory that reduces context drift in multi‑turn interactions and preserves processed data for downstream applications without information loss. Evaluations show consistent improvement across challenging benchmarks, enabling CaveAgent to handle data scales that cause context overflow in both JSON‑based and code‑based agents. The accessible runtime state further provides programmatically verifiable feedback, enabling automated evaluation and reward signal generation without human annotation and establishing a structural foundation for future research in Reinforcement Learning with Verifiable Rewards (RLVR).

PaperID: 6537, https://arxiv.org/pdf/2601.01522.pdf

Authors: Danial Amin

Title: Bayesian Orchestration of Multi-LLM Agents for Cost-Aware Sequential Decision-Making

Abstract:
Large language models (LLMs) are increasingly deployed as autonomous decision agents in settings with asymmetric error costs: hiring (missed talent vs wasted interviews), medical triage (missed emergencies vs unnecessary escalation), and fraud detection (approved fraud vs declined legitimate payments). The dominant design queries a single LLM for a posterior over states, thresholds "confidence," and acts; we prove this is inadequate for sequential decisions with costs. We propose a Bayesian, cost‑aware multi‑LLM orchestration framework that treats LLMs as approximate likelihood models rather than classifiers. For each candidate state, we elicit likelihoods via contrastive prompting, aggregate across diverse models with robust statistics, and update beliefs with Bayes rule under explicit priors as new evidence arrives. This enables coherent belief updating, expected‑cost action selection, principled information gathering via value of information, and fairness gains via ensemble bias mitigation. In resume screening with costs of 40000 USD per missed hire, 2500 USD per interview, and 150 USD per phone screen, experiments on 1000 resumes using five LLMs (GPT‑4o, Claude 4.5 Sonnet, Gemini Pro, Grok, DeepSeek) reduce total cost by 294000 USD (34 percent) versus the best single‑LLM baseline and improve demographic parity by 45 percent (max group gap 22 to 5 percentage points). Ablations attribute 51 percent of savings to multi‑LLM aggregation, 43 percent to sequential updating, and 20 percent to disagreement‑triggered information gathering, consistent with the theoretical benefits of correct probabilistic foundations.

PaperID: 6538, https://arxiv.org/pdf/2601.01498.pdf

Authors: Bingguang Hao, Zengzhuang Xu, Yuntao Wen, Xinyi Xu, Yang Liu, Tong Zhao, Maolin Wang, Long Chen, Dong Wang, Yicheng Chen, Cunyin Peng, Xiangyu Zhao, Chenyi Zhuang, Ji Zhang

Title: From Failure to Mastery: Generating Hard Samples for Tool-use Agents

Abstract:
The advancement of LLM agents with tool‑use capabilities requires diverse and complex training corpora. Existing data generation methods, which predominantly follow a paradigm of random sampling and shallow generation, often yield simple and homogeneous trajectories that fail to capture complex, implicit logical dependencies. To bridge this gap, we introduce HardGen, an automatic agentic pipeline designed to generate hard tool‑use training samples with verifiable reasoning. Firstly, HardGen establishes a dynamic API Graph built upon agent failure cases, from which it samples to synthesize hard traces. Secondly, these traces serve as conditional priors to guide the instantiation of modular, abstract advanced tools, which are subsequently leveraged to formulate hard queries. Finally, the advanced tools and hard queries enable the generation of verifiable complex Chain‑of‑Thought (CoT), with a closed‑loop evaluation feedback steering the continuous refinement of the process. Extensive evaluations demonstrate that a 4B parameter model trained with our curated dataset achieves superior performance compared to several leading open‑source and closed‑source competitors (e.g., GPT‑5.2, Gemini‑3‑Pro and Claude‑Opus‑4.5). Our code, models, and dataset will be open‑sourced to facilitate future research.

PaperID: 6539, https://arxiv.org/pdf/2601.01407.pdf

Authors: Arjhun Sreedar, Rohan Pillay, Laukik Patade

Title: From Emotion Classification to Emotional Reasoning: Enhancing Emotional Intelligence in Large Language Models

Abstract:
This work investigates whether synthetic emotional chain‑of‑thought data can improve the emotional reasoning abilities of smaller open large language models (LLMs). We design a multi‑agent generation pipeline that produces therapy‑style conversations and converts them into structured emotion multiple‑choice questions (MCQs) with explanations. We propose that fine‑tuning a variety of 7B models on this dataset should yield substantial gains in emotional understanding and emotional awareness on EmoBench‑style evaluations, suggesting that emotional reasoning can be induced without architectural changes. Our results demonstrate that fine‑tuned Mistral 7B achieves EU improvements from 10.5 to 20.5 and EA improvements from 40.5 to 60.0, validating the effectiveness of synthetic emotional reasoning data for enhancing model capabilities in nuanced emotional tasks.

PaperID: 6540, https://arxiv.org/pdf/2601.01357.pdf

Authors: Ke Xiao, Haoze Zhang, Runze Mao, Han Li, Zhi X. Chen

Title: Towards LLM-enabled autonomous combustion research: A literature-aware agent for self-corrective modeling workflows

Abstract:
The rapid evolution of large language models (LLMs) is transforming artificial intelligence into autonomous research partners, yet a critical gap persists in complex scientific domains such as combustion modeling. Here, practical AI assistance requires the seamless integration of domain literature knowledge with robust execution capabilities for expertise‑intensive tools such as computational fluid dynamics (CFD) codes. To bridge this gap, we introduce FlamePilot, an LLM agent designed to empower combustion modeling research through automated and self‑corrective CFD workflows. FlamePilot differentiates itself through an architecture that leverages atomic tools to ensure the robust setup and execution of complex simulations in both OpenFOAM and extended frameworks such as DeepFlame. The system is also capable of learning from scientific articles, extracting key information to guide the simulation from initial setup to optimized results. Validation on a public benchmark shows FlamePilot achieved a perfect 1.0 executability score and a 0.438 success rate, surpassing the prior best reported agent scores of 0.625 and 0.250, respectively. Furthermore, a detailed case study on Moderate or Intense Low‑oxygen Dilution (MILD) combustion simulation demonstrates its efficacy as a collaborative research copilot, where FlamePilot autonomously translated a research paper into a configured simulation, conducted the simulation, post‑processed the results, proposed evidence‑based refinements, and managed a multi‑step parameter study to convergence under minimal human intervention. By adopting a transparent and interpretable paradigm, FlamePilot establishes a foundational framework for AI‑empowered combustion modeling, fostering a collaborative partnership where the agent manages workflow orchestration, freeing the researcher for high‑level analysis.

PaperID: 6541, https://arxiv.org/pdf/2601.01335.pdf

Authors: Zihan Li, Ziming Wang, Chenning Liu, Xin Wang

Title: Neural-network-based Self-triggered Observed Platoon Control for Autonomous Vehicles

Abstract:
This paper investigates autonomous vehicle (AV) platoon control under uncertain dynamics and intermittent communication, which remains a critical challenge in intelligent transportation systems. To address these issues, this paper proposes an adaptive consensus tracking control framework for nonlinear multi‑agent systems (MASs). The proposed approach integrates backstepping design, a nonlinear sampled‑data observer, radial basis function neural networks, and a self‑triggered communication mechanism. The radial basis function neural networks approximate unknown nonlinearities and time‑varying disturbances, thereby enhancing system robustness. A distributed observer estimates neighboring states based on limited and intermittent measurements, thereby reducing dependence on continuous communication. Moreover, self‑triggered mechanism is developed to determine triggering instants, guaranteeing a strictly positive minimum inter‑event time and preventing Zeno behavior. The theoretical analysis proves that all closed‑loop signals are uniformly ultimately bounded (UUB), and tracking errors converge to a compact set. Simulation results demonstrate that the proposed approach achieves high robustness, adaptability, and communication efficiency, making it suitable for real‑world networked vehicle systems.

PaperID: 6542, https://arxiv.org/pdf/2601.01331.pdf

Authors: Hongkun Yang, Lionel Z. Wang, Wei Fan, Yiran Hu, Lixu Wang, Chenyu Liu, Yu Zeng, Shenghong Fu, Lei Gong, Zhengxin Zhang, Haoyang Li, Jiexin Zheng, Xin Xu

Title: AppellateGen: A Benchmark for Appellate Legal Judgment Generation

Abstract:
Legal judgment generation is a critical task in legal intelligence. However, existing research in legal judgment generation has predominantly focused on first‑instance trials, relying on static fact‑to‑verdict mappings while neglecting the dialectical nature of appellate (second‑instance) review. To address this, we introduce AppellateGen, a benchmark for second‑instance legal judgment generation comprising 7,351 case pairs. The task requires models to draft legally binding judgments by reasoning over the initial verdict and evidentiary updates, thereby modeling the causal dependency between trial stages. We further propose a judicial Standard Operating Procedure (SOP)‑based Legal Multi‑Agent System (SLMAS) to simulate judicial workflows, which decomposes the generation process into discrete stages of issue identification, retrieval, and drafting. Experimental results indicate that while SLMAS improves logical consistency, the complexity of appellate reasoning remains a substantial challenge for current LLMs. The dataset and code are publicly available at: https://anonymous.4open.science/r/AppellateGen‑5763.

PaperID: 6543, https://arxiv.org/pdf/2601.01298.pdf

Authors: Jorge L. Ruiz Williams

Title: Warp-Cortex: An Asynchronous, Memory-Efficient Architecture for Million-Agent Cognitive Scaling on Consumer Hardware

Abstract:
Current multi‑agent Large Language Model (LLM) frameworks suffer from linear memory scaling, rendering "System 2" parallel reasoning impractical on consumer hardware. We present Warp Cortex, an asynchronous architecture that theoretically enables million‑agent cognitive scaling by decoupling agent logic from physical memory. Through Singleton Weight Sharing and a novel Topological Synapse‑‑inspired by hybrid landmarking techniques from Topological Data Analysis (TDA)‑‑we reduce memory complexity from O(N L) to O(1) for weights and O(N k) for context, where k << L. By treating the KV‑cache as a point cloud in latent space, we apply witness‑complex‑inspired sparsification to preserve persistent homological features of the context manifold. On a single NVIDIA RTX 4090, we empirically demonstrate 100 concurrent agents at 2.2 GB total VRAM, with theoretical capacity exceeding 1,000 agents before compute latency becomes the bottleneck. We further introduce Referential Injection, a non‑intrusive KV‑cache update mechanism that allows asynchronous sub‑agents to influence primary generation without stream disruption.

PaperID: 6544, https://arxiv.org/pdf/2601.01241.pdf

Authors: Zhuoran Tan, Run Hao, Jeremy Singer, Yutian Tang, Christos Anagnostopoulos

Title: MCP-SandboxScan: WASM-based Secure Execution and Runtime Analysis for MCP Tools

Abstract:
Tool‑augmented LLM agents raise new security risks: tool executions can introduce runtime‑only behaviors, including prompt injection and unintended exposure of external inputs (e.g., environment secrets or local files). While existing scanners often focus on static artifacts, analyzing runtime behavior is challenging because directly executing untrusted tools can itself be dangerous. We present MCP‑SandboxScan, a lightweight framework motivated by the Model Context Protocol (MCP) that safely executes untrusted tools inside a WebAssembly/WASI sandbox and produces auditable reports of external‑to‑sink exposures. Our prototype (i) extracts LLM‑relevant sinks from runtime outputs (prompt/messages and structured tool‑return fields), (ii) instantiates external‑input candidates from environment values, mounted file contents, and output‑surfaced HTTP fetch intents, and (iii) links sources to sinks via snippet‑based substring matching. Case studies on three representative tools show that MCP‑SandboxScan can surface provenance evidence when external inputs appear in prompt/messages or tool‑return payloads, and can expose filesystem capability violations as runtime evidence. We further compare against a lightweight static string‑signature baseline and use a micro‑benchmark to characterize false negatives under transformations and false positives from short‑token collisions.

PaperID: 6545, https://arxiv.org/pdf/2601.01233.pdf

Authors: Kangchen Zhu, Zhiliang Tian, Shangwen Wang, Mingyue Leng, Xiaoguang Mao

Title: Atomizer: An LLM-based Collaborative Multi-Agent Framework for Intent-Driven Commit Untangling

Abstract:
Composite commits, which entangle multiple unrelated concerns, are prevalent in software development and significantly hinder program comprehension and maintenance. Existing automated untangling methods, particularly state‑of‑the‑art graph clustering‑based approaches, are fundamentally limited by two issues. (1) They over‑rely on structural information, failing to grasp the crucial semantic intent behind changes, and (2) they operate as ``single‑pass'' algorithms, lacking a mechanism for the critical reflection and refinement inherent in human review processes. To overcome these challenges, we introduce Atomizer, a novel collaborative multi‑agent framework for composite commit untangling. To address the semantic deficit, Atomizer employs an Intent‑Oriented Chain‑of‑Thought (IO‑CoT) strategy, which prompts large language models (LLMs) to infer the intent of each code change according to both the structure and the semantic information of code. To overcome the limitations of ``single‑pass'' grouping, we employ two agents to establish a grouper‑reviewer collaborative refinement loop, which mirrors human review practices by iteratively refining groupings until all changes in a cluster share the same underlying semantic intent. Extensive experiments on two benchmark C# and Java datasets demonstrate that Atomizer significantly outperforms several representative baselines. On average, it surpasses the state‑of‑the‑art graph‑based methods by over 6.0% on the C# dataset and 5.5% on the Java dataset. This superiority is particularly pronounced on complex commits, where Atomizer's performance advantage widens to over 16%.

PaperID: 6546, https://arxiv.org/pdf/2601.01155.pdf

Authors: Shizhe Zhang, Jingsong Liang, Zhitao Zhou, Shuhan Ye, Yizhuo Wang, Ming Siang Derek Tan, Jimmy Chiun, Yuhong Cao, Guillaume Sartoretti

Title: ORION: Option-Regularized Deep Reinforcement Learning for Cooperative Multi-Agent Online Navigation

Abstract:
Existing methods for multi‑agent navigation typically assume fully known environments, offering limited support for partially known scenarios with outdated or imperfect prior maps, such as warehouses or factory floors. There, agents need to balance path optimality with collecting and sharing environmental information to help teammates reach their own targets. To these ends, we propose ORION, a novel deep reinforcement learning framework for cooperative multi‑agent online navigation in partially known environments. Starting from an imperfect prior map, ORION trains agents to make decentralized decisions, coordinate toward individual targets, and actively reduce task‑relevant map uncertainty through online observation sharing in a closed perception‑action loop. We first design a shared graph encoder that fuses prior map with online perception into a unified representation, providing robust state embeddings under environmental discrepancies. At the core of ORION is an option‑critic framework that learns high‑level cooperative modes translated into sequences of low‑level actions, enabling adaptive switching between individual navigation and team‑level exploration. We further introduce a dual‑stage cooperation strategy that allows agents to assist teammates under map uncertainty, thereby reducing the overall makespan. Across extensive maze‑like maps and large‑scale warehouse environments, ORION achieves high‑quality real‑time decentralized cooperation while scaling to up to 10 robots, outperforming state‑of‑the‑art classical and learning‑based baselines. Finally, we validate ORION on physical robot teams, demonstrating its robustness and practicality for real‑world cooperative navigation.

PaperID: 6547, https://arxiv.org/pdf/2601.01139.pdf

Authors: Sriram Rajasekar, Ashwini Ratnoo

Title: Latent Space Reinforcement Learning for Multi-Robot Exploration

Abstract:
Autonomous mapping of unknown environments is a critical challenge, particularly in scenarios where time is limited. Multi‑agent systems can enhance efficiency through collaboration, but the scalability of motion‑planning algorithms remains a key limitation. Reinforcement learning has been explored as a solution, but existing approaches are constrained by the limited input size required for effective learning, restricting their applicability to discrete environments. This work addresses that limitation by leveraging autoencoders to perform dimensionality reduction, compressing high‑fidelity occupancy maps into latent state vectors while preserving essential spatial information. Additionally, we introduce a novel procedural generation algorithm based on Perlin noise, designed to generate topologically complex training environments that simulate asteroid fields, caves and forests. These environments are used for training the autoencoder and the navigation algorithm using a hierarchical deep reinforcement learning framework for decentralized coordination. We introduce a weighted consensus mechanism that modulates reliance on shared data via a tuneable trust parameter, ensuring robustness to accumulation of errors. Experimental results demonstrate that the proposed system scales effectively with number of agents and generalizes well to unfamiliar, structurally distinct environments and is resilient in communication‑constrained settings.

PaperID: 6548, https://arxiv.org/pdf/2601.01126.pdf

Authors: Andrew Borthwick, Stephen Ash

Title: RoboPhD: Self-Improving Text-to-SQL Through Autonomous Agent Evolution

Abstract:
We present RoboPhD, a system where AI agents autonomously conduct research to improve Text‑to‑SQL performance. RoboPhD implements a closed‑loop evolution cycle with two coordinated components: a SQL Generation agent composed of a database analysis script and SQL generation instructions, and an Evolution agent that designs new versions based on performance feedback. Central to the framework is an ELO‑based selection mechanism enabling survival‑of‑the‑fittest dynamics while handling non‑transitivity in performance. Starting from a naive 70‑line baseline, RoboPhD evolves agents through iterative cross‑pollination, discovering effective techniques without any external guidance on the Text‑to‑SQL domain. Our best agent, evolved to 1500 lines over 18 iterations, autonomously discovered strategies such as size‑adaptive database analysis that adjusts depth based on schema complexity and SQL generation patterns for column selection, evidence interpretation, and aggregation. Evolution provides the largest gains on cheaper models: while we improve by 2.3 points over a strong Claude Opus 4.5 naive baseline, we show an improvement of 8.9 points over the weaker Claude Haiku model. This enables 'skip a tier' deployment: evolved Haiku exceeds naive Sonnet accuracy, and evolved Sonnet exceeds naive Opus, both at lower cost. The full system achieves 73.67% accuracy on the BIRD test set, demonstrating that AI can autonomously build a strong agentic system with only a trivial human‑provided starting point.

PaperID: 6549, https://arxiv.org/pdf/2601.01090.pdf

Authors: Erica Coppolillo, Luca Luceri, Emilio Ferrara

Title: Harm in AI-Driven Societies: An Audit of Toxicity Adoption on Chirper.ai

Abstract:
Large Language Models (LLMs) are increasingly embedded in autonomous agents that engage, converse, and co‑evolve in online social platforms. While prior work has documented the generation of toxic content by LLMs, far less is known about how exposure to harmful content shapes agent behavior over time, particularly in environments composed entirely of interacting AI agents. In this work, we study toxicity adoption of LLM‑driven agents on Chirper.ai, a fully AI‑driven social platform. Specifically, we model interactions in terms of stimuli (posts) and responses (comments). We conduct a large‑scale empirical analysis of agent behavior, examining how toxic responses relate to toxic stimuli, how repeated exposure to toxicity affects the likelihood of toxic responses, and whether toxic behavior can be predicted from exposure alone. Our findings show that toxic responses are more likely following toxic stimuli, and, at the same time, cumulative toxic exposure (repeated over time) significantly increases the probability of toxic responding. We further introduce two influence metrics, revealing a strong negative correlation between induced and spontaneous toxicity. Finally, we show that the number of toxic stimuli alone enables accurate prediction of whether an agent will eventually produce toxic content. These results highlight exposure as a critical risk factor in the deployment of LLM agents, particularly as such agents operate in online environments where they may engage not only with other AI chatbots, but also with human counterparts. This could trigger unwanted and pernicious phenomena, such as hate‑speech propagation and cyberbullying. In an effort to reduce such risks, monitoring exposure to toxic content may provide a lightweight yet effective mechanism for auditing and mitigating harmful behavior in the wild.

PaperID: 6550, https://arxiv.org/pdf/2601.00994.pdf

Authors: Michael Bao

Title: ElecTwit: A Framework for Studying Persuasion in Multi-Agent Social Systems

Abstract:
This paper introduces ElecTwit, a simulation framework designed to study persuasion within multi‑agent systems, specifically emulating the interactions on social media platforms during a political election. By grounding our experiments in a realistic environment, we aimed to overcome the limitations of game‑based simulations often used in prior research. We observed the comprehensive use of 25 specific persuasion techniques across most tested LLMs, encompassing a wider range than previously reported. The variations in technique usage and overall persuasion output between models highlight how different model architectures and training can impact the dynamics in realistic social simulations. Additionally, we observed unique phenomena such as "kernel of truth" messages and spontaneous developments with an "ink" obsession, where agents collectively demanded written proof. Our study provides a foundation for evaluating persuasive LLM agents in real‑world contexts, ensuring alignment and preventing dangerous outcomes.

PaperID: 6551, https://arxiv.org/pdf/2601.00930.pdf

Authors: Nicolas Bougie, Gian Maria Marconi, Tony Yip, Narimasa Watanabe

Title: AlignUSER: Human-Aligned LLM Agents via World Models for Recommender System Evaluation

Abstract:
Evaluating recommender systems remains challenging due to the gap between offline metrics and real user behavior, as well as the scarcity of interaction data. Recent work explores large language model (LLM) agents as synthetic users, yet they typically rely on few‑shot prompting, which yields a shallow understanding of the environment and limits their ability to faithfully reproduce user actions. We introduce AlignUSER, a framework that learns world‑model‑driven agents from human interactions. Given rollout sequences of actions and states, we formalize world modeling as a next state prediction task that helps the agent internalize the environment. To align actions with human personas, we generate counterfactual trajectories around demonstrations and prompt the LLM to compare its decisions with human choices, identify suboptimal actions, and extract lessons. The learned policy is then used to drive agent interactions with the recommender system. We evaluate AlignUSER across multiple datasets and demonstrate closer alignment with genuine humans than prior work, both at the micro and macro levels.

PaperID: 6552, https://arxiv.org/pdf/2601.00848.pdf

Authors: Ron F. Del Rosario

Title: Temporal Attack Pattern Detection in Multi-Agent AI Workflows: An Open Framework for Training Trace-Based Security Models

Abstract:
We present an openly documented methodology for fine‑tuning language models to detect temporal attack patterns in multi‑agent AI workflows using OpenTelemetry trace analysis. We curate a dataset of 80,851 examples from 18 public cybersecurity sources and 35,026 synthetic OpenTelemetry traces. We apply iterative QLoRA fine‑tuning on resource‑constrained ARM64 hardware (NVIDIA DGX Spark) through three training iterations with strategic augmentation. Our custom benchmark accuracy improves from 42.86% to 74.29%, a statistically significant 31.4‑point gain. Targeted examples addressing specific knowledge gaps outperform indiscriminate scaling. Key contributions include: (1) synthetic trace generation methodology for multi‑agent coordination attacks and regulatory violations, (2) empirical evidence that training data composition fundamentally determines behavior, and (3) complete open release of datasets, training scripts, and evaluation benchmarks on HuggingFace. While practical deployment requires human oversight due to false positive rates, this work establishes the first reproducible framework enabling practitioners to build custom agentic security models adapted to their threat landscapes.

PaperID: 6553, https://arxiv.org/pdf/2601.00818.pdf

Authors: Chandra Sekhar Kubam

Title: Agentic AI for Autonomous, Explainable, and Real-Time Credit Risk Decision-Making

Abstract:
Significant digitalization of financial services in a short period of time has led to an urgent demand to have autonomous, transparent and real‑time credit risk decision making systems. The traditional machine learning models are effective in pattern recognition, but do not have the adaptive reasoning, situational awareness, and autonomy needed in modern financial operations. As a proposal, this paper presents an Agentic AI framework, or a system where AI agents view the world of dynamic credit independent of human observers, who then make actions based on their articulable decision‑making paths. The research introduces a multi‑agent system with reinforcing learning, natural language reasoning, explainable AI modules, and real‑time data absorption pipelines as a means of assessing the risk profiles of borrowers with few humans being involved. The processes consist of agent collaboration protocol, risk‑scoring engines, interpretability layers, and continuous feedback learning cycles. Findings indicate that decision speed, transparency and responsiveness is better than traditional credit scoring models. Nevertheless, there are still some practical limitations such as risks of model drift, inconsistencies in interpreting high dimensional data and regulatory uncertainties as well as infrastructure limitations in low‑resource settings. The suggested system has a high prospective to transform credit analytics and future studies ought to be directed on dynamic regulatory compliance mobilizers, new agent teamwork, adversarial robustness, and large‑scale implementation in cross‑country credit ecosystems.

PaperID: 6554, https://arxiv.org/pdf/2601.00770.pdf

Authors: Simon Paquette-Greenbaum, Jiangbo Yu

Title: LLM Agents for Combinatorial Efficient Frontiers: Investment Portfolio Optimization

Abstract:
Investment portfolio optimization is a task conducted in all major financial institutions. The Cardinality Constrained Mean‑Variance Portfolio Optimization (CCPO) problem formulation is ubiquitous for portfolio optimization. The challenge of this type of portfolio optimization, a mixed‑integer quadratic programming (MIQP) problem, arises from the intractability of solutions from exact solvers, where heuristic algorithms are used to find approximate portfolio solutions. CCPO entails many laborious and complex workflows and also requires extensive effort pertaining to heuristic algorithm development, where the combination of pooled heuristic solutions results in improved efficient frontiers. Hence, common approaches are to develop many heuristic algorithms. Agentic frameworks emerge as a promising candidate for many problems within combinatorial optimization, as they have been shown to be equally efficient with regard to automating large workflows and have been shown to be excellent in terms of algorithm development, sometimes surpassing human‑level performance. This study implements a novel agentic framework for the CCPO and explores several concrete architectures. In benchmark problems, the implemented agentic framework matches state‑of‑the‑art algorithms. Furthermore, complex workflows and algorithm development efforts are alleviated, while in the worst case, lower but acceptable error is reported.

PaperID: 6555, https://arxiv.org/pdf/2601.00696.pdf

Authors: Yash Jain, Xinjie Liu, Lasse Peters, David Fridovich-Keil, Ufuk Topcu

Title: Bayesian Inverse Games with High-Dimensional Multi-Modal Observations

Abstract:
Many multi‑agent interaction scenarios can be naturally modeled as noncooperative games, where each agent's decisions depend on others' future actions. However, deploying game‑theoretic planners for autonomous decision‑making requires a specification of all agents' objectives. To circumvent this practical difficulty, recent work develops maximum likelihood techniques for solving inverse games that can identify unknown agent objectives from interaction data. Unfortunately, these methods only infer point estimates and do not quantify estimator uncertainty; correspondingly, downstream planning decisions can overconfidently commit to unsafe actions. We present an approximate Bayesian inference approach for solving the inverse game problem, which can incorporate observation data from multiple modalities and be used to generate samples from the Bayesian posterior over the hidden agent objectives given limited sensor observations in real time. Concretely, the proposed Bayesian inverse game framework trains a structured variational autoencoder with an embedded differentiable Nash game solver on interaction datasets and does not require labels of agents' true objectives. Extensive experiments show that our framework successfully learns prior and posterior distributions, improves inference quality over maximum likelihood estimation‑based inverse game approaches, and enables safer downstream decision‑making without sacrificing efficiency. When trajectory information is uninformative or unavailable, multimodal inference further reduces uncertainty by exploiting additional observation modalities.

PaperID: 6556, https://arxiv.org/pdf/2601.00596.pdf

Authors: Sumanth Balaji, Piyush Mishra, Aashraya Sachdeva, Suraj Agrawal

Title: Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence

Abstract:
Traditional customer support systems, such as Interactive Voice Response (IVR), rely on rigid scripts and lack the flexibility required for handling complex, policy‑driven tasks. While large language model (LLM) agents offer a promising alternative, evaluating their ability to act in accordance with business rules and real‑world support workflows remains an open challenge. Existing benchmarks primarily focus on tool usage or task completion, overlooking an agent's capacity to adhere to multi‑step policies, navigate task dependencies, and remain robust to unpredictable user or environment behavior. In this work, we introduce JourneyBench, a benchmark designed to assess policy‑aware agents in customer support. JourneyBench leverages graph representations to generate diverse, realistic support scenarios and proposes the User Journey Coverage Score, a novel metric to measure policy adherence. We evaluate multiple state‑of‑the‑art LLMs using two agent designs: a Static‑Prompt Agent (SPA) and a Dynamic‑Prompt Agent (DPA) that explicitly models policy control. Across 703 conversations in three domains, we show that DPA significantly boosts policy adherence, even allowing smaller models like GPT‑4o‑mini to outperform more capable ones like GPT‑4o. Our findings demonstrate the importance of structured orchestration and establish JourneyBench as a critical resource to advance AI‑driven customer support beyond IVR‑era limitations.

PaperID: 6557, https://arxiv.org/pdf/2601.00555.pdf

Authors: Abu Hanif Muhammad Syarubany, Farhan Zaki Rahmani, Trio Widianto

Title: LLM-Based Agentic Exploration for Robot Navigation & Manipulation with Skill Orchestration

Abstract:
This paper presents an end‑to‑end LLM‑based agentic exploration system for an indoor shopping task, evaluated in both Gazebo simulation and a corresponding real‑world corridor layout. The robot incrementally builds a lightweight semantic map by detecting signboards at junctions and storing direction‑to‑POI relations together with estimated junction poses, while AprilTags provide repeatable anchors for approach and alignment. Given a natural‑language shopping request, an LLM produces a constrained discrete action at each junction (direction and whether to enter a store), and a ROS finite‑state main controller executes the decision by gating modular motion primitives, including local‑costmap‑based obstacle avoidance, AprilTag approaching, store entry, and grasping. Qualitative results show that the integrated stack can perform end‑to‑end task execution from user instruction to multi‑store navigation and object retrieval, while remaining modular and debuggable through its text‑based map and logged decision history.

PaperID: 6558, https://arxiv.org/pdf/2601.00548.pdf

Authors: Kooktae Lee

Title: Optimal Transport-Based Decentralized Multi-Agent Distribution Matching

Abstract:
This paper presents a decentralized control framework for distribution matching in multi‑agent systems (MAS), where agents collectively achieve a prescribed terminal spatial distribution. The problem is formulated using optimal transport (Wasserstein distance), which provides a principled measure of distributional discrepancy and serves as the basis for the control design. To avoid solving the global optimal transport problem directly, the distribution‑matching objective is reformulated into a tractable per‑agent decision process, enabling each agent to identify its desired terminal locations using only locally available information. A sequential weight‑update rule is introduced to construct feasible local transport plans, and a memory‑based correction mechanism is incorporated to maintain reliable operation under intermittent and range‑limited communication. Convergence guarantees are established, showing cycle‑wise improvement of a surrogate transport cost under both linear and nonlinear agent dynamics. Simulation results demonstrate that the proposed framework achieves effective and scalable distribution matching while operating fully in a decentralized manner.

PaperID: 6559, https://arxiv.org/pdf/2601.00538.pdf

Authors: Chi-Te Kuo, Li-Hsiang Shen, Jyun-Jhe Huang

Title: Parametrized Sharing for Multi-Agent Hybrid DRL for Multiple Multi-Functional RISs-Aided Downlink NOMA Networks

Abstract:
Multi‑functional reconfigurable intelligent surface (MF‑RIS) is conceived to address the communication efficiency thanks to its extended signal coverage from its active RIS capability and self‑sustainability from energy harvesting (EH). We investigate the architecture of multi‑MF‑RISs to assist non‑orthogonal multiple access (NOMA) downlink networks. We formulate an energy efficiency (EE) maximization problem by optimizing power allocation, transmit beamforming and MF‑RIS configurations of amplitudes, phase‑shifts and EH ratios, as well as the position of MF‑RISs, while satisfying constraints of available power, user rate requirements, and self‑sustainability property. We design a parametrized sharing scheme for multi‑agent hybrid deep reinforcement learning (PMHRL), where the multi‑agent proximal policy optimization (PPO) and deep‑Q network (DQN) handle continuous and discrete variables, respectively. The simulation results have demonstrated that proposed PMHRL has the highest EE compared to other benchmarks, including cases without parametrized sharing, pure PPO and DQN. Moreover, the proposed multi‑MF‑RISs‑aided downlink NOMA achieves the highest EE compared to scenarios of no‑EH/amplification, traditional RISs, and deployment without RISs/MF‑RISs under different multiple access.

PaperID: 6560, https://arxiv.org/pdf/2601.00516.pdf

Authors: Laksh Advani

Title: Trajectory Guard -- A Lightweight, Sequence-Aware Model for Real-Time Anomaly Detection in Agentic AI

Abstract:
Autonomous LLM agents generate multi‑step action plans that can fail due to contextual misalignment or structural incoherence. Existing anomaly detection methods are ill‑suited for this challenge: mean‑pooling embeddings dilutes anomalous steps, while contrastive‑only approaches ignore sequential structure. Standard unsupervised methods on pre‑trained embeddings achieve F1‑scores no higher than 0.69. We introduce Trajectory Guard, a Siamese Recurrent Autoencoder with a hybrid loss function that jointly learns task‑trajectory alignment via contrastive learning and sequential validity via reconstruction. This dual objective enables unified detection of both "wrong plan for this task" and "malformed plan structure." On benchmarks spanning synthetic perturbations and real‑world failures from security audits (RAS‑Eval) and multi‑agent systems (Who\&When), we achieve F1‑scores of 0.88‑0.94 on balanced sets and recall of 0.86‑0.92 on imbalanced external benchmarks. At 32 ms inference latency, our approach runs 17‑27× faster than LLM Judge baselines, enabling real‑time safety verification in production deployments.

PaperID: 6561, https://arxiv.org/pdf/2601.00482.pdf

Authors: Abhiram Bellur, Mohammed Raihan Ullah, Fraol Batole, Mohit Kansara, Masaharu Morimoto, Kai Ishikawa, Haifeng Chen, Yaroslav Zharov, Timofey Bryksin, Tien N. Nguyen, Hridesh Rajan, Danny Dig

Title: Multi-Agent Coordinated Rename Refactoring

Abstract:
The primary value of AI agents in software development lies in their ability to extend the developer's capacity for reasoning and action, not to supplant human involvement. To showcase how to use agents working in tandem with developers, we designed a novel approach for carrying out coordinated renaming. Coordinated renaming, where a single rename refactoring triggers refactorings in multiple, related identifiers, is a frequent yet challenging task. Developers must manually propagate these rename refactorings across numerous files and contexts, a process that is both tedious and highly error‑prone. State‑of‑the‑art heuristic‑based approaches produce an overwhelming number of false positives, while vanilla Large Language Models (LLMs) provide incomplete suggestions due to their limited context and inability to interact with refactoring tools. This leaves developers with incomplete refactorings or burdens them with filtering too many false positives. Coordinated renaming is exactly the kind of repetitive task that agents can significantly reduce the developers' burden while keeping them in the driver's seat. We designed, implemented, and evaluated the first multi‑agent framework that automates coordinated renaming. It operates on a key insight: a developer's initial refactoring is a clue to infer the scope of related refactorings. Our Scope Inference Agent first transforms this clue into an explicit, natural‑language Declared Scope. The Planned Execution Agent then uses this as a strict plan to identify program elements that should undergo refactoring and safely executes the changes by invoking the IDE's own trusted refactoring APIs. Finally, the Replication Agent uses it to guide the project‑wide search. We first conducted a formative study on the practice of coordinated renaming in 609K commits in 100 open‑source projects and surveyed 205 developers ...

PaperID: 6562, https://arxiv.org/pdf/2601.00481.pdf

Authors: Tie Ma, Yixi Chen, Vaastav Anand, Alessandro Cornacchia, Amândio R. Faustino, Guanheng Liu, Shan Zhang, Hongbin Luo, Suhaib A. Fahmy, Zafar A. Qazi, Marco Canini

Title: MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Observability

Abstract:
We present MAESTRO, an evaluation suite for the testing, reliability, and observability of LLM‑based MAS. MAESTRO standardizes MAS configuration and execution through a unified interface, supports integrating both native and third‑party MAS via a repository of examples and lightweight adapters, and exports framework‑agnostic execution traces together with system‑level signals (e.g., latency, cost, and failures). We instantiate MAESTRO with 12 representative MAS spanning popular agentic frameworks and interaction patterns, and conduct controlled experiments across repeated runs, backend models, and tool configurations. Our case studies show that MAS executions can be structurally stable yet temporally variable, leading to substantial run‑to‑run variance in performance and reliability. We further find that MAS architecture is the dominant driver of resource profiles, reproducibility, and cost‑latency‑accuracy trade‑off, often outweighing changes in backend models or tool settings. Overall, MAESTRO enables systematic evaluation and provides empirical guidance for designing and optimizing agentic systems.

PaperID: 6563, https://arxiv.org/pdf/2601.00465.pdf

Authors: Dennis Christmann, Juan F. Gutierrez, Sthiti Padhi, Patrick Plörer, Aditya Takur, Simona Silvestri, Andres Gomez

Title: Space Debris Removal using Nano-Satellites controlled by Low-Power Autonomous Agents

Abstract:
Space debris is an ever‑increasing problem in space travel. There are already many old, no longer functional spacecraft and debris orbiting the earth, which endanger both the safe operation of satellites and space travel. Small nano‑satellite swarms can address this problem by autonomously de‑orbiting debris safely into the Earth's atmosphere. This work builds on the recent advances of autonomous agents deployed in resource‑constrained platforms and shows a first simplified approach how such intelligent and autonomous nano‑satellite swarms can be realized. We implement our autonomous agent software on wireless microcontrollers and perform experiments on a specialized test‑bed to show the feasibility and overall energy efficiency of our approach.

PaperID: 6564, https://arxiv.org/pdf/2601.00360.pdf

Authors: Jamiu Idowu, Ahmed Almasoud, Ayman Alfahid

Title: Mapping Human Anti-collusion Mechanisms to Multi-agent AI Systems

Abstract:
As multi‑agent AI systems become increasingly autonomous, evidence shows they can develop collusive strategies similar to those long observed in human markets and institutions. While human domains have accumulated centuries of anti‑collusion mechanisms, it remains unclear how these can be adapted to AI settings. This paper addresses that gap by (i) developing a taxonomy of human anti‑collusion mechanisms, including sanctions, leniency & whistleblowing, monitoring & auditing, market design, and governance and (ii) mapping them to potential interventions for multi‑agent AI systems. For each mechanism, we propose implementation approaches. We also highlight open challenges, such as the attribution problem (difficulty attributing emergent coordination to specific agents), identity fluidity (agents being easily forked or modified), the boundary problem (distinguishing beneficial cooperation from harmful collusion), and adversarial adaptation (agents learning to evade detection).

PaperID: 6565, https://arxiv.org/pdf/2601.00321.pdf

Authors: Eslam Eldeeb, Hirley Alves

Title: Offline Multi-Agent Reinforcement Learning for 6G Communications: Fundamentals, Applications and Future Directions

Abstract:
The next‑generation wireless technologies, including beyond 5G and 6G networks, are paving the way for transformative applications such as vehicle platooning, smart cities, and remote surgery. These innovations are driven by a vast array of interconnected wireless entities, including IoT devices, access points, UAVs, and CAVs, which increase network complexity and demand more advanced decision‑making algorithms. Artificial intelligence (AI) and machine learning (ML), especially reinforcement learning (RL), are key enablers for such networks, providing solutions to high‑dimensional and complex challenges. However, as networks expand to multi‑agent environments, traditional online RL approaches face cost, safety, and scalability limitations. Offline multi‑agent reinforcement learning (MARL) offers a promising solution by utilizing pre‑collected data, reducing the need for real‑time interaction. This article introduces a novel offline MARL algorithm based on conservative Q‑learning (CQL), ensuring safe and efficient training. We extend this with meta‑learning to address dynamic environments and validate the approach through use cases in radio resource management and UAV networks. Our work highlights offline MARL's advantages, limitations, and future directions in wireless applications.

PaperID: 6566, https://arxiv.org/pdf/2601.00309.pdf

Authors: David Millard, Ali Baheri

Title: Can Optimal Transport Improve Federated Inverse Reinforcement Learning?

Abstract:
In robotics and multi‑agent systems, fleets of autonomous agents often operate in subtly different environments while pursuing a common high‑level objective. Directly pooling their data to learn a shared reward function is typically impractical due to differences in dynamics, privacy constraints, and limited communication bandwidth. This paper introduces an optimal transport‑based approach to federated inverse reinforcement learning (IRL). Each client first performs lightweight Maximum Entropy IRL locally, adhering to its computational and privacy limitations. The resulting reward functions are then fused via a Wasserstein barycenter, which considers their underlying geometric structure. We further prove that this barycentric fusion yields a more faithful global reward estimate than conventional parameter averaging methods in federated learning. Overall, this work provides a principled and communication‑efficient framework for deriving a shared reward that generalizes across heterogeneous agents and environments.

PaperID: 6567, https://arxiv.org/pdf/2601.00268.pdf

Authors: Doyoung Kim, Zhiwei Ren, Jie Hao, Zhongkai Sun, Lichao Wang, Xiyao Ma, Zack Ye, Xu Han, Jun Yin, Heng Ji, Wei Shen, Xing Fan, Benjamin Yao, Chenlei Guo

Title: Beyond Perfect APIs: A Comprehensive Evaluation of LLM Agents Under Real-World API Complexity

Abstract:
We introduce WildAGTEval, a benchmark designed to evaluate large language model (LLM) agents' function‑calling capabilities under realistic API complexity. Unlike prior work that assumes an idealized API system and disregards real‑world factors such as noisy API outputs, WildAGTEval accounts for two dimensions of real‑world complexity: 1. API specification, which includes detailed documentation and usage constraints, and 2. API execution, which captures runtime challenges. Consequently, WildAGTEval offers (i) an API system encompassing 60 distinct complexity scenarios that can be composed into approximately 32K test configurations, and (ii) user‑agent interactions for evaluating LLM agents on these scenarios. Using WildAGTEval, we systematically assess several advanced LLMs and observe that most scenarios are challenging, with irrelevant information complexity posing the greatest difficulty and reducing the performance of strong LLMs by 27.3%. Furthermore, our qualitative analysis reveals that LLMs occasionally distort user intent merely to claim task completion, critically affecting user satisfaction.

PaperID: 6568, https://arxiv.org/pdf/2601.00240.pdf

Authors: Zongwei Wang, Bincheng Gu, Hongyu Yu, Junliang Yu, Tao He, Jiayin Feng, Chenghua Lin, Min Gao

Title: When Agents See Humans as the Outgroup: Belief-Dependent Bias in LLM-Powered Agents

Abstract:
This paper reveals that LLM‑powered agents exhibit not only demographic bias (e.g., gender, religion) but also intergroup bias under minimal "us" versus "them" cues. When such group boundaries align with the agent‑human divide, a new bias risk emerges: agents may treat other AI agents as the ingroup and humans as the outgroup. To examine this risk, we conduct a controlled multi‑agent social simulation and find that agents display consistent intergroup bias in an all‑agent setting. More critically, this bias persists even in human‑facing interactions when agents are uncertain about whether the counterpart is truly human, revealing a belief‑dependent fragility in bias suppression toward humans. Motivated by this observation, we identify a new attack surface rooted in identity beliefs and formalize a Belief Poisoning Attack (BPA) that can manipulate agent identity beliefs and induce outgroup bias toward humans. Extensive experiments demonstrate both the prevalence of agent intergroup bias and the severity of BPA across settings, while also showing that our proposed defenses can mitigate the risk. These findings are expected to inform safer agent design and motivate more robust safeguards for human‑facing agents.

PaperID: 6569, https://arxiv.org/pdf/2601.00227.pdf

Authors: Shanli Xing, Yiyan Zhai, Alexander Jiang, Yixin Dong, Yong Wu, Zihao Ye, Charlie Ruan, Yingyi Huang, Yineng Zhang, Liangsheng Yin, Aksara Bayyapu, Luis Ceze, Tianqi Chen

Title: FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems

Abstract:
Recent advances show that large language models (LLMs) can act as autonomous agents capable of generating GPU kernels, but integrating these AI‑generated kernels into real‑world inference systems remains challenging. FlashInfer‑Bench addresses this gap by establishing a standardized, closed‑loop framework that connects kernel generation, benchmarking, and deployment. At its core, FlashInfer Trace provides a unified schema describing kernel definitions, workloads, implementations, and evaluations, enabling consistent communication between agents and systems. Built on real serving traces, FlashInfer‑Bench includes a curated dataset, a robust correctness‑ and performance‑aware benchmarking framework, a public leaderboard to track LLM agents' GPU programming capabilities, and a dynamic substitution mechanism (apply()) that seamlessly injects the best‑performing kernels into production LLM engines such as SGLang and vLLM. Using FlashInfer‑Bench, we further evaluate the performance and limitations of LLM agents, compare the trade‑offs among different GPU programming languages, and provide insights for future agent design. FlashInfer‑Bench thus establishes a practical, reproducible pathway for continuously improving AI‑generated kernels and deploying them into large‑scale LLM inference.

PaperID: 6570, https://arxiv.org/pdf/2601.00219.pdf

Authors: Arnab Mallick, Indraveni Chebolu

Title: μACP: A Formal Calculus for Expressive, Resource-Constrained Agent Communication

Abstract:
Agent communication remains a foundational problem in multi‑agent systems: protocols such as FIPA‑ACL guarantee semantic richness but are intractable for constrained environments, while lightweight IoT protocols achieve efficiency at the expense of expressiveness. This paper presents μACP, a formal calculus for expressive agent communication under explicit resource bounds. We formalize the Resource‑Constrained Agent Communication (RCAC) model, prove that a minimal four‑verb basis \PING, TELL, ASK, OBSERVE\ is suffices to encode finite‑state FIPA protocols, and establish tight information‑theoretic bounds on message complexity. We further show that μACP can implement standard consensus under partial synchrony and crash faults, yielding a constructive coordination framework for edge‑native agents. Formal verification in TLA^+ (model checking) and Coq (mechanized invariants) establishes safety and boundedness, and supports liveness under modeled assumptions. Large‑scale system simulations confirm ACP achieves a median end‑to‑end message latency of 34 ms (95th percentile 104 ms) at scale, outperforming prior agent and IoT protocols under severe resource constraints. The main contribution is a unified calculus that reconciles semantic expressiveness with provable efficiency, providing a rigorous foundation for the next generation of resource‑constrained multi‑agent systems.

PaperID: 6571, https://arxiv.org/pdf/2601.00121.pdf

Authors: Yaqi Duan, Yichun Hu, Jiashuo Jiang

Title: Ask, Clarify, Optimize: Human-LLM Agent Collaboration for Smarter Inventory Control

Abstract:
Inventory management remains a challenge for many small and medium‑sized businesses that lack the expertise to deploy advanced optimization methods. This paper investigates whether Large Language Models (LLMs) can help bridge this gap. We show that employing LLMs as direct, end‑to‑end solvers incurs a significant "hallucination tax": a performance gap arising from the model's inability to perform grounded stochastic reasoning. To address this, we propose a hybrid agentic framework that strictly decouples semantic reasoning from mathematical calculation. In this architecture, the LLM functions as an intelligent interface, eliciting parameters from natural language and interpreting results while automatically calling rigorous algorithms to build the optimization engine. To evaluate this interactive system against the ambiguity and inconsistency of real‑world managerial dialogue, we introduce the Human Imitator, a fine‑tuned "digital twin" of a boundedly rational manager that enables scalable, reproducible stress‑testing. Our empirical analysis reveals that the hybrid agentic framework reduces total inventory costs by 32.1% relative to an interactive baseline using GPT‑4o as an end‑to‑end solver. Moreover, we find that providing perfect ground‑truth information alone is insufficient to improve GPT‑4o's performance, confirming that the bottleneck is fundamentally computational rather than informational. Our results position LLMs not as replacements for operations research, but as natural‑language interfaces that make rigorous, solver‑based policies accessible to non‑experts.

PaperID: 6572, https://arxiv.org/pdf/2601.00097.pdf

Authors: Akash Kumar Panda, Olaoluwa Adigun, Bart Kosko

Title: The Agentic Leash: Extracting Causal Feedback Fuzzy Cognitive Maps with LLMs

Abstract:
We design a large‑language‑model (LLM) agent system that extracts causal feedback fuzzy cognitive maps (FCMs) from raw text. The causal learning or extraction process is agentic both because of the LLM's semi‑autonomy and because ultimately the FCM dynamical system's equilibria drive the LLM agents to fetch and process causal text. The fetched text can in principle modify the adaptive FCM causal structure and so modify the source of its quasi‑autonomy‑its equilibrium limit cycles and fixed‑point attractors. This bidirectional process endows the evolving FCM dynamical system with a degree of autonomy while the system still stays on its agentic leash. We show in particular that a sequence of three system‑instruction sets guide an LLM agent as it systematically extracts key nouns and noun phrases from text, as it extracts FCM concept nodes from among those nouns and noun phrases, and then as it extracts or infers partial or fuzzy causal edges between those FCM nodes. We test this FCM generation on a recent essay about the promise of AI from the late diplomat and political theorist Henry Kissinger and his colleagues. This three‑step process produced FCM dynamical systems that converged to the same equilibrium limit cycles as did the human‑generated FCMs even though the human‑generated FCM differed in the number of nodes and edges. A final FCM mixed generated FCMs from separate Gemini and ChatGPT LLM agents. The mixed FCM absorbed the equilibria of its dominant mixture component but also created new equilibria of its own to better approximate the underlying causal dynamical system.

PaperID: 6573, https://arxiv.org/pdf/2601.00042.pdf

Authors: Manish Bhatt, Adrian Wood, Idan Habler, Ammar Al-Kahfah

Title: Large Empirical Case Study: Go-Explore adapted for AI Red Team Testing

Abstract:
Production LLM agents with tool‑using capabilities require security testing despite their safety training. We adapt Go‑Explore to evaluate GPT‑4o‑mini across 28 experimental runs spanning six research questions. We find that random‑seed variance dominates algorithmic parameters, yielding an 8x spread in outcomes; single‑seed comparisons are unreliable, while multi‑seed averaging materially reduces variance in our setup. Reward shaping consistently harms performance, causing exploration collapse in 94% of runs or producing 18 false positives with zero verified attacks. In our environment, simple state signatures outperform complex ones. For comprehensive security testing, ensembles provide attack‑type diversity, whereas single agents optimize coverage within a given attack type. Overall, these results suggest that seed variance and targeted domain knowledge can outweigh algorithmic sophistication when testing safety‑trained models.

PaperID: 6574, https://arxiv.org/pdf/2512.25055.pdf

Authors: Tianzhi He, Farrokh Jazizadeh

Title: Context-aware LLM-based AI Agents for Human-centered Energy Management Systems in Smart Buildings

Abstract:
This study presents a conceptual framework and a prototype assessment for Large Language Model (LLM)‑based Building Energy Management System (BEMS) AI agents to facilitate context‑aware energy management in smart buildings through natural language interaction. The proposed framework comprises three modules: perception (sensing), central control (brain), and action (actuation and user interaction), forming a closed feedback loop that captures, analyzes, and interprets energy data to respond intelligently to user queries and manage connected appliances. By leveraging the autonomous data analytics capabilities of LLMs, the BEMS AI agent seeks to offer context‑aware insights into energy consumption, cost prediction, and device scheduling, thereby addressing limitations in existing energy management systems. The prototype's performance was evaluated using 120 user queries across four distinct real‑world residential energy datasets and different evaluation metrics, including latency, functionality, capability, accuracy, and cost‑effectiveness. The generalizability of the framework was demonstrated using ANOVA tests. The results revealed promising performance, measured by response accuracy in device control (86%), memory‑related tasks (97%), scheduling and automation (74%), and energy analysis (77%), while more complex cost estimation tasks highlighted areas for improvement with an accuracy of 49%. This benchmarking study moves toward formalizing the assessment of LLM‑based BEMS AI agents and identifying future research directions, emphasizing the trade‑off between response accuracy and computational efficiency.

PaperID: 6575, https://arxiv.org/pdf/2512.25015.pdf

Authors: Siddhant Agarwal, Adya Dhuler, Polly Ruhnke, Melvin Speisman, Md Shad Akhtar, Shweta Yadav

Title: MAMA-Memeia! Multi-Aspect Multi-Agent Collaboration for Depressive Symptoms Identification in Memes

Abstract:
Over the past years, memes have evolved from being exclusively a medium of humorous exchanges to one that allows users to express a range of emotions freely and easily. With the ever‑growing utilization of memes in expressing depressive sentiments, we conduct a study on identifying depressive symptoms exhibited by memes shared by users of online social media platforms. We introduce RESTOREx as a vital resource for detecting depressive symptoms in memes on social media through the Large Language Model (LLM) generated and human‑annotated explanations. We introduce MAMAMemeia, a collaborative multi‑agent multi‑aspect discussion framework grounded in the clinical psychology method of Cognitive Analytic Therapy (CAT) Competencies. MAMAMemeia improves upon the current state‑of‑the‑art by 7.55% in macro‑F1 and is established as the new benchmark compared to over 30 methods.

PaperID: 6576, https://arxiv.org/pdf/2512.24959.pdf

Authors: András Antos, András Millinghoffer, Péter Antal

Title: Semi-overlapping Multi-bandit Best Arm Identification for Sequential Support Network Learning

Abstract:
Many modern AI and ML problems require evaluating partners' contributions through shared yet asymmetric, computationally intensive processes and the simultaneous selection of the most beneficial candidates. Sequential approaches to these problems can be unified under a new framework, Sequential Support Network Learning (SSNL), in which the goal is to select the most beneficial candidate set of partners for all participants using trials; that is, to learn a directed graph that represents the highest‑performing contributions. We demonstrate that a new pure‑exploration model, the semi‑overlapping multi‑(multi‑armed) bandit (SOMMAB), in which a single evaluation provides distinct feedback to multiple bandits due to structural overlap among their arms, can be used to learn a support network from sparse candidate lists efficiently. We develop a generalized GapE algorithm for SOMMABs and derive new exponential error bounds that improve the best known constant in the exponent for multi‑bandit best‑arm identification. The bounds scale linearly with the degree of overlap, revealing significant sample‑complexity gains arising from shared evaluations. From an application point of view, this work provides a theoretical foundation and improved performance guarantees for sequential learning tools for identifying support networks from sparse candidates in multiple learning problems, such as in multi‑task learning (MTL), auxiliary task learning (ATL), federated learning (FL), and in multi‑agent systems (MAS).

PaperID: 6577, https://arxiv.org/pdf/2512.24957.pdf

Authors: AMAP AI Agent Team, Yulan Hu, Xiangwen Zhang, Sheng Ouyang, Hao Yi, Lu Xu, Qinglin Lang, Lide Tan, Xiang Cheng, Tianchen Ye, Zhicong Li, Ge Chen, Wenjin Yang, Zheng Pan, Shaopan Xiong, Siran Yang, Ju Huang, Yan Zhang, Jiamang Wang, Yong Liu, Yinfeng Huang, Ning Wang, Tucheng Lin, Xin Li, Ning Guo

Title: AMAP Agentic Planning Technical Report

Abstract:
We present STAgent, an agentic large language model tailored for spatio‑temporal understanding, designed to solve complex tasks such as constrained point‑of‑interest discovery and itinerary planning. STAgent is a specialized model capable of interacting with ten distinct tools within spatio‑temporal scenarios, enabling it to explore, verify, and refine intermediate steps during complex reasoning. Notably, STAgent effectively preserves its general capabilities. We empower STAgent with these capabilities through three key contributions: (1) a stable tool environment that supports over ten domain‑specific tools, enabling asynchronous rollout and training; (2) a hierarchical data curation framework that identifies high‑quality data like a needle in a haystack, curating high‑quality queries by retaining less than 1% of the raw data, emphasizing both diversity and difficulty; and (3) a cascaded training recipe that starts with a seed SFT stage acting as a guardian to measure query difficulty, followed by a second SFT stage fine‑tuned on queries with high certainty, and an ultimate RL stage that leverages data of low certainty. Initialized with Qwen3‑30B‑A3B to establish a strong SFT foundation and leverage insights into sample difficulty, STAgent yields promising performance on TravelBench while maintaining its general capabilities across a wide range of general benchmarks, thereby demonstrating the effectiveness of our proposed agentic model.

PaperID: 6578, https://arxiv.org/pdf/2512.24886.pdf

Authors: Tyler Hanks, Cristian F. Nino, Joana Bou Barcelo, Austin Copeland, Warren Dixon, James Fairbanks

Title: Heterogeneous Multi-Agent Multi-Target Tracking using Cellular Sheaves

Abstract:
Multi‑agent target tracking in the presence of nonlinear dynamics and agent heterogeneity, where state‑space dimensions may differ, is a challenging problem that traditional graph Laplacian methods cannot easily address. This work leverages the framework of cellular sheaves, a mathematical generalization of graph theory, to natively model such heterogeneous systems. While existing coordination sheaf frameworks focus on cooperative problems like consensus, this work extends them to the non‑cooperative target‑tracking problem. The tracking of multiple, unknown targets is formulated as a harmonic extension problem on a cellular sheaf, accommodating nonlinear dynamics and external disturbances for all agents. A decentralized control law is developed using the sheaf Laplacian, and a corresponding Lyapunov‑based stability analysis is provided to guarantee tracking error convergence, with results validated by simulation.

PaperID: 6579, https://arxiv.org/pdf/2512.24827.pdf

Authors: Raul D. Steleac, Mohan Sridharan, David Abel

Title: Inter-Agent Relative Representations for Multi-Agent Option Discovery

Abstract:
Temporally extended actions improve the ability to explore and plan in single‑agent settings. In multi‑agent settings, the exponential growth of the joint state space with the number of agents makes coordinated behaviours even more valuable. Yet, this same exponential growth renders the design of multi‑agent options particularly challenging. Existing multi‑agent option discovery methods often sacrifice coordination by producing loosely coupled or fully independent behaviours. Toward addressing these limitations, we describe a novel approach for multi‑agent option discovery. Specifically, we propose a joint‑state abstraction that compresses the state space while preserving the information necessary to discover strongly coordinated behaviours. Our approach builds on the inductive bias that synchronisation over agent states provides a natural foundation for coordination in the absence of explicit objectives. We first approximate a fictitious state of maximal alignment with the team, the Fermat state, and use it to define a measure of spreadness, capturing team‑level misalignment on each individual state dimension. Building on this representation, we then employ a neural graph Laplacian estimator to derive options that capture state synchronisation patterns between agents. We evaluate the resulting options across multiple scenarios in two simulated multi‑agent domains, showing that they yield stronger downstream coordination capabilities compared to alternative option discovery methods.

PaperID: 6580, https://arxiv.org/pdf/2512.24754.pdf

Authors: Yutong Wang, Yunxiang Xiao, Yonglin Tian, Junyong Li, Jing Wang, Yisheng Lv

Title: AstroReview: An LLM-driven Multi-Agent Framework for Telescope Proposal Peer Review and Refinement

Abstract:
Competitive access to modern observatories has intensified as proposal volumes outpace available telescope time, making timely, consistent, and transparent peer review a critical bottleneck for the advancement of astronomy. Automating parts of this process is therefore both scientifically significant and operationally necessary to ensure fair allocation and reproducible decisions at scale. We present AstroReview, an open‑source, agent‑based framework that automates proposal review in three stages: (i) novelty and scientific merit, (ii) feasibility and expected yield, and (iii) meta‑review and reliability verification. Task isolation and explicit reasoning traces curb hallucinations and improve transparency. Without any domain specific fine tuning, AstroReview used in our experiments only for the last stage, correctly identifies genuinely accepted proposals with an accuracy of 87%. The AstroReview in Action module replicates the review and refinement loop; with its integrated Proposal Authoring Agent, the acceptance rate of revised drafts increases by 66% after two iterations, showing that iterative feedback combined with automated meta‑review and reliability verification delivers measurable quality gains. Together, these results point to a practical path toward scalable, auditable, and higher throughput proposal review for resource limited facilities.

PaperID: 6581, https://arxiv.org/pdf/2512.24735.pdf

Authors: Qin Fang, Mamadou Diagne, Yang Zhu

Title: Exact compensation of communication delays for discrete-time heterogeneous multi-agent linear systems with applications to SIR epidemic model

Abstract:
This paper investigates the output synchronization problem for discrete‑time heterogeneous multi‑agent systems (MASs) subject to distinct communication delays. The presence of such delays prevents the instantaneous delivery of information from neighboring nodes, thereby severely degrading the performance of standard distributed control schemes. To overcome this, we propose a prediction‑based framework for exact delay compensation. Specifically, we introduce predictors combined with a mechanism of distributed predictors, which enables the recursive reconstruction of future state information across the communication network. Building upon these predictors, we construct prediction‑based distributed observers and formulate both prediction‑based distributed state‑feedback and dynamic output‑feedback controllers. Theoretical analysis confirms that the proposed strategy eliminates the impact of delays after a finite number of steps, ensuring output synchronization. The effectiveness of the methods is validated through a numerical example and a Koopman operator‑based linear Susceptible‑Infected‑Recovered (SIR) epidemic model. Notably, for a population of 4 million, the proposed delay compensation strategy achieves a reduction of over 200,000 infected individuals at the peak, underscoring its potential significance in epidemic mitigation.

PaperID: 6582, https://arxiv.org/pdf/2512.24700.pdf

Authors: Evagoras Makridis, Gabriele Oliva, Apostolos I. Rikos, Themistoklis Charalambous

Title: Average Consensus with Dynamic Quantization Framing and Finite-Time Termination over Limited-Bandwidth Directed Networks

Abstract:
This paper proposes a deterministic distributed algorithm, referred to as PP‑ACDC, that achieves exact average consensus over possibly unbalanced directed graphs using only a fixed and a priori specified number of quantization bits. The method integrates Push‑Pull (surplus) consensus dynamics with a dynamic quantization framing scheme combining zooming and midpoint shifting, enabling agents to preserve the true global average while progressively refining their quantization precision. We establish a rigorous convergence theory showing that PP‑ACDC achieves asymptotic (exact) average consensus on any strongly connected digraph under appropriately chosen quantization parameters. Moreover, we develop a fully distributed and synchronized finite‑time termination mechanism, and we provide a formal proof on the detection of ε‑convergence to the average within a finite number of iterations. Numerical simulations corroborate the theoretical results and demonstrate that PP‑ACDC achieves reliable, communication‑efficient, and precise average consensus even under very tight bit budgets, underscoring its suitability for large‑scale and resource‑constrained multi‑agent systems operating over directed networks.

PaperID: 6583, https://arxiv.org/pdf/2512.24661.pdf

Authors: Casey O. Barkan, Sid Black, Oliver Sourbut

Title: Do Large Language Models Know What They Are Capable Of?

Abstract:
We investigate whether large language models (LLMs) can predict whether they will succeed on a given task and whether their predictions improve as they progress through multi‑step tasks. We also investigate whether LLMs can learn from in‑context experiences to make better decisions about whether to pursue a task in scenarios where failure is costly. All LLMs we tested are overconfident, but most predict their success with better‑than‑random discriminatory power. We find that newer and larger LLMs generally do not have greater discriminatory power, though Claude models do show such a trend. On multi‑step agentic tasks, the overconfidence of several frontier LLMs worsens as they progress through the tasks, and reasoning LLMs perform comparably to or worse than non‑reasoning LLMs. With in‑context experiences of failure, some but not all LLMs reduce their overconfidence leading to significantly improved decision making, while others do not. Interestingly, all LLMs' decisions are approximately rational given their estimated probabilities of success, yet their overly‑optimistic estimates result in poor decision making. These results suggest that current LLM agents are hindered by their lack of awareness of their own capabilities. We discuss the implications of LLMs' awareness of their capabilities for AI misuse and misalignment risks.

PaperID: 6584, https://arxiv.org/pdf/2512.24635.pdf

Authors: Zhili Huang, Ling Xu, Chao Liu, Weifeng Sun, Xu Zhang, Yan Lei, Meng Yan, Hongyu Zhang

Title: DynaFix: Iterative Automated Program Repair Driven by Execution-Level Dynamic Information

Abstract:
Automated Program Repair (APR) aims to automatically generate correct patches for buggy programs. Recent approaches leveraging large language models (LLMs) have shown promise but face limitations. Most rely solely on static analysis, ignoring runtime behaviors. Some attempt to incorporate dynamic signals, but these are often restricted to training or fine‑tuning, or injected only once into the repair prompt, without iterative use. This fails to fully capture program execution. Current iterative repair frameworks typically rely on coarse‑grained feedback, such as pass/fail results or exception types, and do not leverage fine‑grained execution‑level information effectively. As a result, models struggle to simulate human stepwise debugging, limiting their effectiveness in multi‑step reasoning and complex bug repair. To address these challenges, we propose DynaFix, an execution‑level dynamic information‑driven APR method that iteratively leverages runtime information to refine the repair process. In each repair round, DynaFix captures execution‑level dynamic information such as variable states, control‑flow paths, and call stacks, transforming them into structured prompts to guide LLMs in generating candidate patches. If a patch fails validation, DynaFix re‑executes the modified program to collect new execution information for the next attempt. This iterative loop incrementally improves patches based on updated feedback, similar to the stepwise debugging practices of human developers. We evaluate DynaFix on the Defects4J v1.2 and v2.0 benchmarks. DynaFix repairs 186 single‑function bugs, a 10% improvement over state‑of‑the‑art baselines, including 38 bugs previously unrepaired. It achieves correct patches within at most 35 attempts, reducing the patch search space by 70% compared with existing methods, thereby demonstrating both effectiveness and efficiency in repairing complex bugs.

PaperID: 6585, https://arxiv.org/pdf/2512.24615.pdf

Authors: Yuchen Shi, Yuzheng Cai, Siqi Cai, Zihan Xu, Lichao Chen, Yulei Qin, Zhijian Zhou, Xiang Fei, Chaofan Qiu, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Guocan Cai, Yong Mao, Yunsheng Wu, Ke Li, Xing Sun

Title: Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization

Abstract:
Existing Large Language Model (LLM) agent frameworks face two significant challenges: high configuration costs and static capabilities. Building a high‑quality agent often requires extensive manual effort in tool integration and prompt engineering, while deployed agents struggle to adapt to dynamic environments without expensive fine‑tuning. To address these issues, we propose Youtu‑Agent, a modular framework designed for the automated generation and continuous evolution of LLM agents. Youtu‑Agent features a structured configuration system that decouples execution environments, toolkits, and context management, enabling flexible reuse and automated synthesis. We introduce two generation paradigms: a Workflow mode for standard tasks and a Meta‑Agent mode for complex, non‑standard requirements, capable of automatically generating tool code, prompts, and configurations. Furthermore, Youtu‑Agent establishes a hybrid policy optimization system: (1) an Agent Practice module that enables agents to accumulate experience and improve performance through in‑context optimization without parameter updates; and (2) an Agent RL module that integrates with distributed training frameworks to enable scalable and stable reinforcement learning of any Youtu‑Agents in an end‑to‑end, large‑scale manner. Experiments demonstrate that Youtu‑Agent achieves state‑of‑the‑art performance on WebWalkerQA (71.47%) and GAIA (72.8%) using open‑weight models. Our automated generation pipeline achieves over 81% tool synthesis success rate, while the Practice module improves performance on AIME 2024/2025 by +2.7% and +5.4% respectively. Moreover, our Agent RL training achieves 40% speedup with steady performance improvement on 7B LLMs, enhancing coding/reasoning and searching capabilities respectively up to 35% and 21% on Maths and general/multi‑hop QA benchmarks.

PaperID: 6586, https://arxiv.org/pdf/2512.24613.pdf

Authors: Zheyu Shi, Dong Qiu, Shanlong Yu

Title: Group Deliberation Oriented Multi-Agent Conversational Model for Complex Reasoning

Abstract:
This paper proposes a group deliberation oriented multi‑agent conversational model to address the limitations of single large language models in complex reasoning tasks. The model adopts a three‑level role division architecture consisting of generation, verification, and integration. An opinion generation agent produces diverse reasoning perspectives, an evidence verification agent retrieves external knowledge and quantifies factual support, and a consistency arbitration agent integrates logically coherent conclusions. A self‑game mechanism is introduced to expand multi‑path reasoning trajectories, while a retrieval enhancement module dynamically supplements external knowledge. A composite reward function combining factual consistency and logical coherence is designed, and an improved proximal policy optimization strategy is applied for collaborative training. Experimental results show that the proposed model improves multi‑hop reasoning accuracy by 16.8 percent on HotpotQA, 14.3 percent on 2WikiMultihopQA, and 19.2 percent on MeetingBank, while improving consistency by 21.5 percent. The model achieves higher reasoning efficiency than mainstream multi‑agent approaches, providing an effective and stable solution for complex reasoning tasks.

PaperID: 6587, https://arxiv.org/pdf/2512.24609.pdf

Authors: Dong Qiu, Duo Xu, Limengxi Yue

Title: Reinforcement Learning-Augmented LLM Agents for Collaborative Decision Making and Performance Optimization

Abstract:
Large Language Models (LLMs) perform well in language tasks but often lack collaborative awareness and struggle to optimize global performance in multi‑agent settings. We present a reinforcement learning‑augmented LLM agent framework that formulates cooperation as a decentralized partially observable Markov decision process (Dec‑POMDP) and adopts centralized training with decentralized execution (CTDE). We introduce Group Relative Policy Optimization (GRPO) to jointly optimize agent policies with access to global signals during training, together with a simplified joint reward that balances task quality, speed, and coordination cost. On collaborative writing and coding benchmarks, our framework delivers a 3x increase in task processing speed over single‑agent baselines, 98.7% structural/style consistency in writing, and a 74.6% test pass rate in coding. The approach consistently outperforms strong multi‑agent LLM baselines and provides a practical path toward reliable collaboration in complex workflows.

PaperID: 6588, https://arxiv.org/pdf/2512.24565.pdf

Authors: Wenrui Liu, Zixiang Liu, Elsie Dai, Wenhan Yu, Lei Yu, Tong Yang, Jinjun Han, Hong Gao

Title: MCPAgentBench: A Real-world Task Benchmark for Evaluating LLM Agent MCP Tool Use

Abstract:
Large Language Models (LLMs) are increasingly serving as autonomous agents, and their utilization of external tools via the Model Context Protocol (MCP) is considered a future trend. Current MCP evaluation sets suffer from issues such as reliance on external MCP services and a lack of difficulty awareness. To address these limitations, we propose MCPAgentBench, a benchmark based on real‑world MCP definitions designed to evaluate the tool‑use capabilities of agents. We construct a dataset containing authentic tasks and simulated MCP tools. The evaluation employs a dynamic sandbox environment that presents agents with candidate tool lists containing distractors, thereby testing their tool selection and discrimination abilities. Furthermore, we introduce comprehensive metrics to measure both task completion rates and execution efficiency. Experiments conducted on various latest mainstream Large Language Models reveal significant performance differences in handling complex, multi‑step tool invocations. All code is open‑source at Github.

PaperID: 6589, https://arxiv.org/pdf/2512.24461.pdf

Authors: Seohui Bae, Jeonghye Kim, Youngchul Sung, Woohyung Lim

Title: Align While Search: Belief-Guided Exploratory Inference for World-Grounded Embodied Agents

Abstract:
In this paper, we propose a test‑time adaptive agent that performs exploratory inference through posterior‑guided belief refinement without relying on gradient‑based updates or additional training for LLM agent operating under partial observability. Our agent maintains an external structured belief over the environment state, iteratively updates it via action‑conditioned observations, and selects actions by maximizing predicted information gain over the belief space. We estimate information gain using a lightweight LLM‑based surrogate and assess world alignment through a novel reward that quantifies the consistency between posterior belief and ground‑truth environment configuration. Experiments show that our method outperforms inference‑time scaling baselines such as prompt‑augmented or retrieval‑enhanced LLMs, in aligning with latent world states with significantly lower integration overhead.

PaperID: 6590, https://arxiv.org/pdf/2512.24415.pdf

Authors: Jingyu Zhang

Title: Language Model Agents Under Attack: A Cross Model-Benchmark of Profit-Seeking Behaviors in Customer Service

Abstract:
Customer‑service LLM agents increasingly make policy‑bound decisions (refunds, rebooking, billing disputes), but the same ``helpful'' interaction style can be exploited: a small fraction of users can induce unauthorized concessions, shifting costs to others and eroding trust in agentic workflows. We present a cross‑domain benchmark of profit‑seeking direct prompt injection in customer‑service interactions, spanning 10 service domains and 100 realistic attack scripts grouped into five technique families. Across five widely used models under a unified rubric with uncertainty reporting, attacks are highly domain‑dependent (airline support is most exploitable) and technique‑dependent (payload splitting is most consistently effective). We release data and evaluation code to support reproducible auditing and to inform the design of oversight and recovery workflows for trustworthy, human centered agent interfaces.

PaperID: 6591, https://arxiv.org/pdf/2512.24325.pdf

Authors: Wan Jiang, Xinyi Zang, Yudong Zhao, Yusi Zou, Yunfei Lu, Junbo Tong, Yang Liu, Ming Li, Jiani Shi, Xin Yang

Title: MaRCA: Multi-Agent Reinforcement Learning for Dynamic Computation Allocation in Large-Scale Recommender Systems

Abstract:
Modern recommender systems face significant computational challenges due to growing model complexity and traffic scale, making efficient computation allocation critical for maximizing business revenue. Existing approaches typically simplify multi‑stage computation resource allocation, neglecting inter‑stage dependencies, thus limiting global optimality. In this paper, we propose MaRCA, a multi‑agent reinforcement learning framework for end‑to‑end computation resource allocation in large‑scale recommender systems. MaRCA models the stages of a recommender system as cooperative agents, using Centralized Training with Decentralized Execution (CTDE) to optimize revenue under computation resource constraints. We introduce an AutoBucket TestBench for accurate computation cost estimation, and a Model Predictive Control (MPC)‑based Revenue‑Cost Balancer to proactively forecast traffic loads and adjust the revenue‑cost trade‑off accordingly. Since its end‑to‑end deployment in the advertising pipeline of a leading global e‑commerce platform in November 2024, MaRCA has consistently handled hundreds of billions of ad requests per day and has delivered a 16.67% revenue uplift using existing computation resources.

PaperID: 6592, https://arxiv.org/pdf/2512.24145.pdf

Authors: Udit Sharma

Title: When Does Pairing Seeds Reduce Variance? Evidence from a Multi-Agent Economic Simulation

Abstract:
Machine learning systems appear stochastic but are deterministically random, as seeded pseudorandom number generators produce identical realisations across repeated executions. Standard evaluation practice typically treats runs across alternatives as independent and does not exploit shared sources of randomness. This paper analyses the statistical structure of comparative evaluation under shared random seeds. Under this design, competing systems are evaluated using identical seeds, inducing matched stochastic realisations and yielding strict variance reduction whenever outcomes are positively correlated at the seed level. We demonstrate these effects using an extended learning‑based multi‑agent economic simulator, where paired evaluation exposes systematic differences in aggregate and distributional outcomes that remain statistically inconclusive under independent evaluation at fixed budgets.

PaperID: 6593, https://arxiv.org/pdf/2512.24040.pdf

Authors: Natchaya Temyingyong, Daman Jain, Neeraj Kumarsahu, Prabhat Kumar, Rachata Phondi, Wachiravit Modecrua, Krittanon Kaewtawee, Krittin Pachtrachai, Touchapon Kraisingkorn

Title: ROAD: Reflective Optimization via Automated Debugging for Zero-Shot Agent Alignment

Abstract:
Automatic Prompt Optimization (APO) has emerged as a critical technique for enhancing Large Language Model (LLM) performance, yet current state‑of‑the‑art methods typically rely on large, labeled gold‑standard development sets to compute fitness scores for evolutionary or Reinforcement Learning (RL) approaches. In real‑world software engineering, however, such curated datasets are rarely available during the initial cold start of agent development, where engineers instead face messy production logs and evolving failure modes. We present ROAD (Reflective Optimization via Automated Debugging), a novel framework that bypasses the need for refined datasets by treating optimization as a dynamic debugging investigation rather than a stochastic search. Unlike traditional mutation strategies, ROAD utilizes a specialized multi‑agent architecture, comprising an Analyzer for root‑cause analysis, an Optimizer for pattern aggregation, and a Coach for strategy integration, to convert unstructured failure logs into robust, structured Decision Tree Protocols. We evaluated ROAD across both a standardized academic benchmark and a live production Knowledge Management engine. Experimental results demonstrate that ROAD is highly sample‑efficient, achieving a 5.6 percent increase in success rate (73.6 percent to 79.2 percent) and a 3.8 percent increase in search accuracy within just three automated iterations. Furthermore, on complex reasoning tasks in the retail domain, ROAD improved agent performance by approximately 19 percent relative to the baseline. These findings suggest that mimicking the human engineering loop of failure analysis and patching offers a viable, data‑efficient alternative to resource‑intensive RL training for deploying reliable LLM agents.

PaperID: 6594, https://arxiv.org/pdf/2512.24008.pdf

Authors: Gaurab Chhetri, Subasish Das, Tausif Islam Chowdhury

Title: SPARK: Search Personalization via Agent-Driven Retrieval and Knowledge-sharing

Abstract:
Personalized search demands the ability to model users' evolving, multi‑dimensional information needs; a challenge for systems constrained by static profiles or monolithic retrieval pipelines. We present SPARK (Search Personalization via Agent‑Driven Retrieval and Knowledge‑sharing), a framework in which coordinated persona‑based large language model (LLM) agents deliver task‑specific retrieval and emergent personalization. SPARK formalizes a persona space defined by role, expertise, task context, and domain, and introduces a Persona Coordinator that dynamically interprets incoming queries to activate the most relevant specialized agents. Each agent executes an independent retrieval‑augmented generation process, supported by dedicated long‑ and short‑term memory stores and context‑aware reasoning modules. Inter‑agent collaboration is facilitated through structured communication protocols, including shared memory repositories, iterative debate, and relay‑style knowledge transfer. Drawing on principles from cognitive architectures, multi‑agent coordination theory, and information retrieval, SPARK models how emergent personalization properties arise from distributed agent behaviors governed by minimal coordination rules. The framework yields testable predictions regarding coordination efficiency, personalization quality, and cognitive load distribution, while incorporating adaptive learning mechanisms for continuous persona refinement. By integrating fine‑grained agent specialization with cooperative retrieval, SPARK provides insights for next‑generation search systems capable of capturing the complexity, fluidity, and context sensitivity of human information‑seeking behavior.

PaperID: 6595, https://arxiv.org/pdf/2512.23900.pdf

Authors: Hesam Khoshkbari, Georges Kaddoum, Bassant Selim, Omid Abbasi, Halim Yanikomeroglu

Title: Distributed Beamforming in Massive MIMO Communication for a Constellation of Airborne Platform Stations

Abstract:
Non‑terrestrial base stations (NTBSs), including high‑altitude platform stations (HAPSs) and hot‑air balloons (HABs), are integral to next‑generation wireless networks, offering coverage in remote areas and enhancing capacity in dense regions. In this paper, we propose a distributed beamforming framework for a massive MIMO network with a constellation of aerial platform stations (APSs). Our approach leverages an entropy‑based multi‑agent deep reinforcement learning (DRL) model, where each APS operates as an independent agent using imperfect channel state information (CSI) in both training and testing phases. Unlike conventional methods, our model does not require CSI sharing among APSs, significantly reducing overhead. Simulations results demonstrate that our method outperforms zero forcing (ZF) and maximum ratio transmission (MRT) techniques, particularly in high‑interference scenarios, while remaining robust to CSI imperfections. Additionally, our framework exhibits scalability, maintaining stable performance over an increasing number of users and various cluster configurations. Therefore, the proposed method holds promise for dynamic and interference‑rich NTBS networks, advancing scalable and robust wireless solutions.

PaperID: 6596, https://arxiv.org/pdf/2512.23880.pdf

Authors: Xu Huang, Junwu Chen, Yuxing Fei, Zhuohan Li, Philippe Schwaller, Gerbrand Ceder

Title: CASCADE: Cumulative Agentic Skill Creation through Autonomous Development and Evolution

Abstract:
Large language model (LLM) agents currently depend on predefined tools or early‑stage tool generation, limiting their adaptability and scalability to complex scientific tasks. We introduce CASCADE, a self‑evolving agentic framework representing an early instantiation of the transition from "LLM + tool use" to "LLM + skill acquisition". CASCADE enables agents to master complex external tools and codify knowledge through two meta‑skills: continuous learning via web search, code extraction, and memory utilization; self‑reflection via introspection, knowledge graph exploration, and others. We evaluate CASCADE on SciSkillBench, a benchmark of 116 materials science and chemistry research tasks. CASCADE achieves a 93.3% success rate using GPT‑5, compared to 35.4% without evolution mechanisms. We further demonstrate real‑world applications in computational analysis, autonomous laboratory experiments, and selective reproduction of published papers. Along with human‑agent collaboration and memory consolidation, CASCADE accumulates executable skills that can be shared across agents and scientists, moving toward scalable AI‑assisted scientific research.

PaperID: 6597, https://arxiv.org/pdf/2512.23875.pdf

Authors: Mohsen Hesamolhokama, Behnam Rohani, Amirahmad Shafiee, MohammadAmin Fazli, Jafar Habibi

Title: From Illusion to Insight: Change-Aware File-Level Software Defect Prediction Using Agentic AI

Abstract:
Much of the reported progress in file‑level software defect prediction (SDP) is, in reality, nothing but an illusion of accuracy. Over the last decades, machine learning and deep learning models have reported increasing performance across software versions. However, since most files persist across releases and retain their defect labels, standard evaluation rewards label‑persistence bias rather than reasoning about code changes. To address this issue, we reformulate SDP as a change‑aware prediction task, in which models reason over code changes of a file within successive project versions, rather than relying on static file snapshots. Building on this formulation, we propose an LLM‑driven, change‑aware, multi‑agent debate framework. Our experiments on multiple PROMISE projects show that traditional models achieve inflated F1, while failing on rare but critical defect‑transition cases. In contrast, our change‑aware reasoning and multi‑agent debate framework yields more balanced performance across evolution subsets and significantly improves sensitivity to defect introductions. These results highlight fundamental flaws in current SDP evaluation practices and emphasize the need for change‑aware reasoning in practical defect prediction. The source code is publicly available.

PaperID: 6598, https://arxiv.org/pdf/2512.23859.pdf

Authors: Leah Hope Ajmani, Arka Ghosh, Benjamin Kaveladze, Eugenia Kim, Keertana Namuduri, Theresa Nguyen, Ebele Okoli, Jessica Schleider, Denae Ford, Jina Suh

Title: Seeking Late Night Life Lines: Experiences of Conversational AI Use in Mental Health Crisis

Abstract:
Online, people often recount their experiences turning to conversational AI agents (e.g., ChatGPT, Claude, Copilot) for mental health support ‑‑ going so far as to replace their therapists. These anecdotes suggest that AI agents have great potential to offer accessible mental health support. However, it's unclear how to meet this potential in extreme mental health crisis use cases. In this work, we explore the first‑person experience of turning to a conversational AI agent in a mental health crisis. From a testimonial survey (n = 53) of lived experiences, we find that people use AI agents to fill the in‑between spaces of human support; they turn to AI due to lack of access to mental health professionals or fears of burdening others. At the same time, our interviews with mental health experts (n = 16) suggest that human‑human connection is an essential positive action when managing a mental health crisis. Using the stages of change model, our results suggest that a responsible AI crisis intervention is one that increases the user's preparedness to take a positive action while de‑escalating any intended negative action. We discuss the implications of designing conversational AI agents as bridges towards human‑human connection rather than ends in themselves.

PaperID: 6599, https://arxiv.org/pdf/2512.23844.pdf

Authors: Tao Dong, Harini Sampath, Ja Young Lee, Sherry Y. Shi, Andrew Macvean

Title: From Correctness to Collaboration: Toward a Human-Centered Framework for Evaluating AI Agent Behavior in Software Engineering

Abstract:
As Large Language Models (LLMs) evolve from code generators into collaborative partners for software engineers, our methods for evaluation are lagging. Current benchmarks, focused on code correctness, fail to capture the nuanced, interactive behaviors essential for successful human‑AI partnership. To bridge this evaluation gap, this paper makes two core contributions. First, we present a foundational taxonomy of desirable agent behaviors for enterprise software engineering, derived from an analysis of 91 sets of user‑defined agent rules. This taxonomy defines four key expectations of agent behavior: Adhere to Standards and Processes, Ensure Code Quality and Reliability, Solving Problems Effectively, and Collaborating with the User. Second, recognizing that these expectations are not static, we introduce the Context‑Adaptive Behavior (CAB) Framework. This emerging framework reveals how behavioral expectations shift along two empirically‑derived axes: the Time Horizon (from immediate needs to future ideals), established through interviews with 15 expert engineers, and the Type of Work (from enterprise production to rapid prototyping, for example), identified through a prompt analysis of a prototyping agent. Together, these contributions offer a human‑centered foundation for designing and evaluating the next generation of AI agents, moving the field's focus from the correctness of generated code toward the dynamics of true collaborative intelligence.

PaperID: 6600, https://arxiv.org/pdf/2512.23742.pdf

Authors: Guangxi Fan, Tianliang Ma, Xuguang Sun, Xun Wang, Kain Lu Low, Leilai Shao

Title: AgenticTCAD: A LLM-based Multi-Agent Framework for Automated TCAD Code Generation and Device Optimization

Abstract:
With the continued scaling of advanced technology nodes, the design‑technology co‑optimization (DTCO) paradigm has become increasingly critical, rendering efficient device design and optimization essential. In the domain of TCAD simulation, however, the scarcity of open‑source resources hinders language models from generating valid TCAD code. To overcome this limitation, we construct an open‑source TCAD dataset curated by experts and fine‑tune a domain‑specific model for TCAD code generation. Building on this foundation, we propose AgenticTCAD, a natural language ‑ driven multi‑agent framework that enables end‑to‑end automated device design and optimization. Validation on a 2 nm nanosheet FET (NS‑FET) design shows that AgenticTCAD achieves the International Roadmap for Devices and Systems (IRDS)‑2024 device specifications within 4.2 hours, whereas human experts required 7.1 days with commercial tools.

PaperID: 6601, https://arxiv.org/pdf/2512.23738.pdf

Authors: Adharsh Kamath, Sishen Zhang, Calvin Xu, Shubham Ugare, Gagandeep Singh, Sasa Misailovic

Title: Enforcing Temporal Constraints for LLM Agents

Abstract:
LLM‑based agents are deployed in safety‑critical applications, yet current guardrail systems fail to prevent violations of temporal safety policies, requirements that govern the ordering and sequencing of agent actions. For instance, agents may access sensitive data before authenticating users or process refunds to unauthorized payment methods, violations that require reasoning about sequences of action rather than an individual action. Existing guardrails rely on imprecise natural language instructions or post‑hoc monitoring, and provide no formal guarantees that agents will satisfy temporal constraints. We present Agent‑C, a novel framework that provides run‑time guarantees ensuring LLM agents adhere to formal temporal safety properties. Agent‑C introduces a domain‑specific language for expressing temporal properties (e.g., authenticate before accessing data), translates specifications to first‑order logic, and uses SMT solving to detect non‑compliant agent actions during token generation. When the LLM attempts to generate a non‑compliant tool call, Agent‑C leverages constrained generation techniques to ensure that every action generated by the LLM complies with the specification, and to generate a compliant alternative to a non‑compliant agent action. We evaluate Agent‑C across two real‑world applications: retail customer service and airline ticket reservation system, and multiple language models (open and closed‑source). Our results demonstrate that Agent‑C achieves perfect safety (100% conformance, 0% harm), while improving task utility compared to state‑of‑the‑art guardrails and unrestricted agents. On SoTA closed‑source models, Agent‑C improves conformance (77.4% to 100% for Claude Sonnet 4.5 and 83.7% to 100% for GPT‑5), while simultaneously increasing utility (71.8% to 75.2% and 66.1% to 70.6%, respectively), representing a new SoTA frontier for reliable agentic reasoning.

PaperID: 6602, https://arxiv.org/pdf/2512.23717.pdf

Authors: Shenzhe Zhu

Title: HarmTransform: Transforming Explicit Harmful Queries into Stealthy via Multi-Agent Debate

Abstract:
Large language models (LLMs) are equipped with safety mechanisms to detect and block harmful queries, yet current alignment approaches primarily focus on overtly dangerous content and overlook more subtle threats. However, users can often disguise harmful intent through covert rephrasing that preserves malicious objectives while appearing benign, which creates a significant gap in existing safety training data. To address this limitation, we introduce HarmTransform, a multi‑agent debate framework for systematically transforming harmful queries into stealthier forms while preserving their underlying harmful intent. Our framework leverages iterative critique and refinement among multiple agents to generate high‑quality, covert harmful query transformations that can be used to improve future LLM safety alignment. Experiments demonstrate that HarmTransform significantly outperforms standard baselines in producing effective query transformations. At the same time, our analysis reveals that debate acts as a double‑edged sword: while it can sharpen transformations and improve stealth, it may also introduce topic shifts and unnecessary complexity. These insights highlight both the promise and the limitations of multi‑agent debate for generating comprehensive safety training data.

PaperID: 6603, https://arxiv.org/pdf/2512.23713.pdf

Authors: Jahidul Islam, Md Ataullha, Saiful Azad

Title: PyBangla at BLP-2025 Task 2: Enhancing Bangla-to-Python Code Generation with Iterative Self-Correction and Multilingual Agents

Abstract:
LLMs excel at code generation from English prompts, but this progress has not extended to low‑resource languages. We address Bangla‑to‑Python code generation by introducing BanglaCodeAct, an agent‑based framework that leverages multi‑agent prompting and iterative self‑correction. Unlike prior approaches relying on task‑specific fine‑tuning, BanglaCodeAct employs an open‑source multilingual LLM within a Thought‑Code‑Observation loop, enabling dynamic generation, testing, and refinement of code from Bangla instructions. We benchmark several small‑parameter open‑source LLMs and evaluate their effectiveness on the mHumanEval dataset for Bangla NL2Code. Our results show that Qwen3‑8B, when deployed with BanglaCodeAct, achieves the best performance, with pass@1 accuracy of 94.0% on the development set and 71.6% on the blind test set. These results establish a new benchmark for Bangla‑to‑Python translation and highlight the potential of agent‑based reasoning for reliable code generation in low‑resource languages. Experimental scripts are publicly available at github.com/jahidulzaid/PyBanglaCodeActAgent.

PaperID: 6604, https://arxiv.org/pdf/2512.23611.pdf

Authors: Yuwen Li, Wei Zhang, Zelong Huang, Mason Yang, Jiajun Wu, Shawn Guo, Huahao Hu, Lingyi Sun, Jian Yang, Mingjie Tang, Byran Dai

Title: Close the Loop: Synthesizing Infinite Tool-Use Data via Multi-Agent Role-Playing

Abstract:
Enabling Large Language Models (LLMs) to reliably invoke external tools remains a critical bottleneck for autonomous agents. Existing approaches suffer from three fundamental challenges: expensive human annotation for high‑quality trajectories, poor generalization to unseen tools, and quality ceilings inherent in single‑model synthesis that perpetuate biases and coverage gaps. We introduce InfTool, a fully autonomous framework that breaks these barriers through self‑evolving multi‑agent synthesis. Given only raw API specifications, InfTool orchestrates three collaborative agents (User Simulator, Tool‑Calling Assistant, and MCP Server) to generate diverse, verified trajectories spanning single‑turn calls to complex multi‑step workflows. The framework establishes a closed loop: synthesized data trains the model via Group Relative Policy Optimization (GRPO) with gated rewards, the improved model generates higher‑quality data targeting capability gaps, and this cycle iterates without human intervention. Experiments on the Berkeley Function‑Calling Leaderboard (BFCL) demonstrate that InfTool transforms a base 32B model from 19.8% to 70.9% accuracy (+258%), surpassing models 10x larger and rivaling Claude‑Opus, and entirely from synthetic data without human annotation.

PaperID: 6605, https://arxiv.org/pdf/2512.23557.pdf

Authors: Toqeer Ali Syed, Mishal Ateeq Almutairi, Mahmoud Abdel Moaty

Title: Toward Trustworthy Agentic AI: A Multimodal Framework for Preventing Prompt Injection Attacks

Abstract:
Powerful autonomous systems, which reason, plan, and converse using and between numerous tools and agents, are made possible by Large Language Models (LLMs), Vision‑Language Models (VLMs), and new agentic AI systems, like LangChain and GraphChain. Nevertheless, this agentic environment increases the probability of the occurrence of multimodal prompt injection (PI) attacks, in which concealed or malicious instructions carried in text, pictures, metadata, or agent‑to‑agent messages may spread throughout the graph and lead to unintended behavior, a breach of policy, or corruption of state. In order to mitigate these risks, this paper suggests a Cross‑Agent Multimodal Provenanc‑ Aware Defense Framework whereby all the prompts, either user‑generated or produced by upstream agents, are sanitized and all the outputs generated by an LLM are verified independently before being sent to downstream nodes. This framework contains a Text sanitizer agent, visual sanitizer agent, and output validator agent all coordinated by a provenance ledger, which keeps metadata of modality, source, and trust level throughout the entire agent network. This architecture makes sure that agent‑to‑agent communication abides by clear trust frames such such that injected instructions are not propagated down LangChain or GraphChain‑style‑workflows. The experimental assessments show that multimodal injection detection accuracy is significantly enhanced, and the cross‑agent trust leakage is minimized, as well as, agentic execution pathways become stable. The framework, which expands the concept of provenance tracking and validation to the multi‑agent orchestration, enhances the establishment of secure, understandable and reliable agentic AI systems.

PaperID: 6606, https://arxiv.org/pdf/2512.23518.pdf

Authors: Hazel Kim, Philip Torr

Title: Single LLM Debate, MoLaCE: Mixture of Latent Concept Experts Against Confirmation Bias

Abstract:
Large language models (LLMs) are highly vulnerable to input confirmation bias. When a prompt implies a preferred answer, models often reinforce that bias rather than explore alternatives. This phenomenon remains underexplored, yet it is already harmful in base models and poses an even greater risk in multi‑agent debate, where echo chambers reinforce bias instead of correction. We introduce Mixture of Latent Concept Experts (MoLaCE), a lightweight inference‑time framework that addresses confirmation bias by mixing experts instantiated as different activation strengths over latent concepts that shape model responses. Our key insight is that, due to the compositional nature of language, differently phrased prompts reweight latent concepts in prompt‑specific ways that affect factual correctness, so no single fixed intervention can be applied universally across inputs. This design enables a single LLM to emulate the benefits of debate internally while remaining computationally efficient and scalable. It can also be integrated into multi‑agent debate frameworks to diversify perspectives and reduce correlated errors. We empirically show that it consistently reduces confirmation bias, improves robustness, and matches or surpasses multi‑agent debate while requiring only a fraction of the computation.

PaperID: 6607, https://arxiv.org/pdf/2512.23511.pdf

Authors: Xinyi Zheng, Ningke Li, Xiaokun Luan, Kailong Wang, Ling Shi, Meng Sun, Haoyu Wang

Title: Beyond Correctness: Exposing LLM-generated Logical Flaws in Reasoning via Multi-step Automated Theorem Proving

Abstract:
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, leading to their adoption in high‑stakes domains such as healthcare, law, and scientific research. However, their reasoning often contains subtle logical errors masked by fluent language, posing significant risks for critical applications. While existing approaches like fact‑checking, self‑consistency methods, and rule‑based validation provide partial solutions, they fail to detect complex logical flaws in multi‑step reasoning. To overcome these challenges, we present MATP, an evaluation framework for systematically verifying LLM reasoning via Multi‑step Automatic Theorem Proving. MATP translates natural language reasoning into First‑Order Logic (FOL) and applies automated theorem provers to assess step‑by‑step logical validity. This approach identifies hidden logical errors and provides fine‑grained classifications of reasoning correctness. Evaluations on a benchmark comprising 10,830 reasoning instances generated by 10 LLMs across tasks from PrOntoQA‑OOD, ProofWriter, and FOLIO show that MATP surpasses prompting‑based baselines by over 42 percentage points in reasoning step verification. It further reveals model‑level disparities, with reasoning models generating more logically coherent outputs than general models. These results demonstrate MATP's potential to enhance the trustworthiness of LLM‑generated reasoning.

PaperID: 6608, https://arxiv.org/pdf/2512.23508.pdf

Authors: Alessio Benavoli, Alessandro Facchini, Marco Zaffalon

Title: Why AI Safety Requires Uncertainty, Incomplete Preferences, and Non-Archimedean Utilities

Abstract:
How can we ensure that AI systems are aligned with human values and remain safe? We can study this problem through the frameworks of the AI assistance and the AI shutdown games. The AI assistance problem concerns designing an AI agent that helps a human to maximise their utility function(s). However, only the human knows these function(s); the AI assistant must learn them. The shutdown problem instead concerns designing AI agents that: shut down when a shutdown button is pressed; neither try to prevent nor cause the pressing of the shutdown button; and otherwise accomplish their task competently. In this paper, we show that addressing these challenges requires AI agents that can reason under uncertainty and handle both incomplete and non‑Archimedean preferences.

PaperID: 6609, https://arxiv.org/pdf/2512.23489.pdf

Authors: Haoyu Pei, Zhongyang Liu, Xiangyi Xiao, Xiaocong Du, Suting Hong, Kunpeng Zhang, Haipeng Zhang

Title: The Gaining Paths to Investment Success: Information-Driven LLM Graph Reasoning for Venture Capital Prediction

Abstract:
Most venture capital (VC) investments fail, while a few deliver outsized returns. Accurately predicting startup success requires synthesizing complex relational evidence, including company disclosures, investor track records, and investment network structures, through explicit reasoning to form coherent, interpretable investment theses. Traditional machine learning and graph neural networks both lack this reasoning capability. Large language models (LLMs) offer strong reasoning but face a modality mismatch with graphs. Recent graph‑LLM methods target in‑graph tasks where answers lie within the graph, whereas VC prediction is off‑graph: the target exists outside the network. The core challenge is selecting graph paths that maximize predictor performance on an external objective while enabling step‑by‑step reasoning. We present MIRAGE‑VC, a multi‑perspective retrieval‑augmented generation framework that addresses two obstacles: path explosion (thousands of candidate paths overwhelm LLM context) and heterogeneous evidence fusion (different startups need different analytical emphasis). Our information‑gain‑driven path retriever iteratively selects high‑value neighbors, distilling investment networks into compact chains for explicit reasoning. A multi‑agent architecture integrates three evidence streams via a learnable gating mechanism based on company attributes. Under strict anti‑leakage controls, MIRAGE‑VC achieves +5.0% F1 and +16.6% PrecisionAt5, and sheds light on other off‑graph prediction tasks such as recommendation and risk assessment. Code: https://anonymous.4open.science/r/MIRAGE‑VC‑323F.

PaperID: 6610, https://arxiv.org/pdf/2512.23480.pdf

Authors: Toqeer Ali Syed, Mohammad Riyaz Belgaum, Salman Jan, Asadullah Abdullah Khan, Saad Said Alqahtani

Title: Agentic AI for Autonomous Defense in Software Supply Chain Security: Beyond Provenance to Vulnerability Mitigation

Abstract:
The software supply chain attacks are becoming more and more focused on trusted development and delivery procedures, so the conventional post‑build integrity mechanisms cannot be used anymore. The available frameworks like SLSA, SBOM and in toto are majorly used to offer provenance and traceability but do not have the capabilities of actively identifying and removing vulnerabilities in software production. The current paper includes an example of agentic artificial intelligence (AI) based on autonomous software supply chain security that combines large language model (LLM)‑based reasoning, reinforcement learning (RL), and multi‑agent coordination. The suggested system utilizes specialized security agents coordinated with the help of LangChain and LangGraph, communicates with actual CI/CD environments with the Model Context Protocol (MCP), and documents all the observations and actions in a blockchain security ledger to ensure integrity and auditing. Reinforcement learning can be used to achieve adaptive mitigation strategies that consider the balance between security effectiveness and the operational overhead, and LLMs can be used to achieve semantic vulnerability analysis, as well as explainable decisions. This framework is tested based on simulated pipelines, as well as, actual world CI/CD integrations on GitHub Actions and Jenkins, including injection attacks, insecure deserialization, access control violations, and configuration errors. Experimental outcomes indicate better detection accuracy, shorter mitigation latency and reasonable build‑time overhead than rule‑based, provenance only and RL only baselines. These results show that agentic AI can facilitate the transition to self defending, proactive software supply chains rather than reactive verification ones.

PaperID: 6611, https://arxiv.org/pdf/2512.23445.pdf

Authors: Manuel Franco-Vivo

Title: Assessing behaviour coverage in a multi-agent system simulation for autonomous vehicle testing

Abstract:
As autonomous vehicle technology advances, ensuring the safety and reliability of these systems becomes paramount. Consequently, comprehensive testing methodologies are essential to evaluate the performance of autonomous vehicles in diverse and complex real‑world scenarios. This study focuses on the behaviour coverage analysis of a multi‑agent system simulation designed for autonomous vehicle testing, and provides a systematic approach to measure and assess behaviour coverage within the simulation environment. By defining a set of driving scenarios, and agent interactions, we evaluate the extent to which the simulation encompasses a broad range of behaviours relevant to autonomous driving. Our findings highlight the importance of behaviour coverage in validating the effectiveness and robustness of autonomous vehicle systems. Through the analysis of behaviour coverage metrics and coverage‑based testing, we identify key areas for improvement and optimization in the simulation framework. Thus, a Model Predictive Control (MPC) pedestrian agent is proposed, where its objective function is formulated to encourage interesting tests while promoting a more realistic behaviour than other previously studied pedestrian agents. This research contributes to advancing the field of autonomous vehicle testing by providing insights into the comprehensive evaluation of system behaviour in simulated environments. The results offer valuable implications for enhancing the safety, reliability, and performance of autonomous vehicles through rigorous testing methodologies.

PaperID: 6612, https://arxiv.org/pdf/2512.23424.pdf

Authors: Jinye Du, Quan Yuan, Zuyao Zhang, Yanzhi Yi, Jiahui Hu, Wangyi Chen, Yiyang Zhu, Qishui Zheng, Wenxiang Zou, Xiangyu Chang, Zuohe Zheng, Zichun Ye, Chao Liu, Shanni Li, Renwei Zhang, Yiping Deng, Xinwei Hu, Xuefeng Jin, Jie Zhao

Title: AKG kernel Agent: A Multi-Agent Framework for Cross-Platform Kernel Synthesis

Abstract:
Modern AI models demand high‑performance computation kernels. The growing complexity of LLMs, multimodal architectures, and recommendation systems, combined with techniques like sparsity and quantization, creates significant computational challenges. Moreover, frequent hardware updates and diverse chip architectures further complicate this landscape, requiring tailored kernel implementations for each platform. However, manual optimization cannot keep pace with these demands, creating a critical bottleneck in AI system development. Recent advances in LLM code generation capabilities have opened new possibilities for automating kernel development. In this work, we propose AKG kernel agent (AI‑driven Kernel Generator), a multi‑agent system that automates kernel generation, migration, and performance tuning. AKG kernel agent is designed to support multiple domain‑specific languages (DSLs), including Triton, TileLang, CPP, and CUDA‑C, enabling it to target different hardware backends while maintaining correctness and portability. The system's modular design allows rapid integration of new DSLs and hardware targets. When evaluated on KernelBench using Triton DSL across GPU and NPU backends, AKG kernel agent achieves an average speedup of 1.46× over PyTorch Eager baselines implementations, demonstrating its effectiveness in accelerating kernel development for modern AI workloads.

PaperID: 6613, https://arxiv.org/pdf/2512.23356.pdf

Authors: Xin Zhang, Yang Cao, Baoxing Wu, Xinyi Chen, Kai Song, Siying Li

Title: A Stepwise-Enhanced Reasoning Framework for Large Language Models Based on External Subgraph Generation

Abstract:
Large Language Models (LLMs) have achieved strong performance across a wide range of natural language processing tasks in recent years, including machine translation, text generation, and question answering. As their applications extend to increasingly complex scenarios, however, LLMs continue to face challenges in tasks that require deep reasoning and logical inference. In particular, models trained on large scale textual corpora may incorporate noisy or irrelevant information during generation, which can lead to incorrect predictions or outputs that are inconsistent with factual knowledge. To address this limitation, we propose a stepwise reasoning enhancement framework for LLMs based on external subgraph generation, termed SGR. The proposed framework dynamically constructs query relevant subgraphs from external knowledge bases and leverages their semantic structure to guide the reasoning process. By performing reasoning in a step by step manner over structured subgraphs, SGR reduces the influence of noisy information and improves reasoning accuracy. Specifically, the framework first generates an external subgraph tailored to the input query, then guides the model to conduct multi step reasoning grounded in the subgraph, and finally integrates multiple reasoning paths to produce the final answer. Experimental results on multiple benchmark datasets demonstrate that SGR consistently outperforms strong baselines, indicating its effectiveness in enhancing the reasoning capabilities of LLMs.

PaperID: 6614, https://arxiv.org/pdf/2512.23328.pdf

Authors: Huan-ang Gao, Zikang Zhang, Tianwei Luo, Kaisen Yang, Xinzhe Juan, Jiahao Qiu, Tianxing Chen, Bingxiang He, Hao Zhao, Hao Zhou, Shilong Liu, Mengdi Wang

Title: CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning Under Partial Observations

Abstract:
Large Language Model (LLM) agents, while proficient in the digital realm, face a significant gap in physical‑world deployment due to the challenge of forming and maintaining a robust spatial mental model. We identify three core cognitive challenges hindering this transition: spatial reasoning, long‑horizon state tracking via mental simulation, and active exploration under partial observation. To isolate and evaluate these faculties, we introduce CubeBench, a novel generative benchmark centered on the Rubik's Cube. CubeBench uses a three‑tiered diagnostic framework that progressively assesses agent capabilities, from foundational state tracking with full symbolic information to active exploration with only partial visual data. Our experiments on leading LLMs reveal critical limitations, including a uniform 0.00% pass rate on all long‑horizon tasks, exposing a fundamental failure in long‑term planning. We also propose a diagnostic framework to isolate these cognitive bottlenecks by providing external solver tools. By analyzing the failure modes, we provide key insights to guide the development of more physically‑grounded intelligent agents.

PaperID: 6615, https://arxiv.org/pdf/2512.23320.pdf

Authors: Junchang Shi, Gang Li

Title: Multi Agents Semantic Emotion Aligned Music to Image Generation with Music Derived Captions

Abstract:
When people listen to music, they often experience rich visual imagery. We aim to externalize this inner imagery by generating images conditioned on music. We propose MESA MIG, a multi agent semantic and emotion aligned framework that first produces structured music captions and then refines them with cooperating agents specializing in scene, motion, style, color, and composition. In parallel, a Valence Arousal regression head predicts continuous affective states from music, while a CLIP based visual VA head estimates emotions from images. These components jointly enforce semantic and emotional alignment between music and synthesized images. Experiments on curated music image pairs show that MESA MIG outperforms caption only and single agent baselines in aesthetic quality, semantic consistency, and VA alignment, and achieves competitive emotion regression performance compared with state of the art music and image emotion models.

PaperID: 6616, https://arxiv.org/pdf/2512.23300.pdf

Authors: Minjiang Huang, Jipeng Qiang, Yi Zhu, Chaowei Zhang, Xiangyu Zhao, Kui Yu

Title: AI4Reading: Chinese Audiobook Interpretation System Based on Multi-Agent Collaboration

Abstract:
Audiobook interpretations are attracting increasing attention, as they provide accessible and in‑depth analyses of books that offer readers practical insights and intellectual inspiration. However, their manual creation process remains time‑consuming and resource‑intensive. To address this challenge, we propose AI4Reading, a multi‑agent collaboration system leveraging large language models (LLMs) and speech synthesis technology to generate podcast, like audiobook interpretations. The system is designed to meet three key objectives: accurate content preservation, enhanced comprehensibility, and a logical narrative structure. To achieve these goals, we develop a framework composed of 11 specialized agents,including topic analysts, case analysts, editors, a narrator, and proofreaders that work in concert to explore themes, extract real world cases, refine content organization, and synthesize natural spoken language. By comparing expert interpretations with our system's output, the results show that although AI4Reading still has a gap in speech generation quality, the generated interpretative scripts are simpler and more accurate.

PaperID: 6617, https://arxiv.org/pdf/2512.23217.pdf

Authors: Jingming Li

Title: TCEval: Using Thermal Comfort to Assess Cognitive and Perceptual Abilities of AI

Abstract:
A critical gap exists in LLM task‑specific benchmarks. Thermal comfort, a sophisticated interplay of environmental factors and personal perceptions involving sensory integration and adaptive decision‑making, serves as an ideal paradigm for evaluating real‑world cognitive capabilities of AI systems. To address this, we propose TCEval, the first evaluation framework that assesses three core cognitive capacities of AI, cross‑modal reasoning, causal association, and adaptive decision‑making, by leveraging thermal comfort scenarios and large language model (LLM) agents. The methodology involves initializing LLM agents with virtual personality attributes, guiding them to generate clothing insulation selections and thermal comfort feedback, and validating outputs against the ASHRAE Global Database and Chinese Thermal Comfort Database. Experiments on four LLMs show that while agent feedback has limited exact alignment with humans, directional consistency improves significantly with a 1 PMV tolerance. Statistical tests reveal that LLM‑generated PMV distributions diverge markedly from human data, and agents perform near‑randomly in discrete thermal comfort classification. These results confirm the feasibility of TCEval as an ecologically valid Cognitive Turing Test for AI, demonstrating that current LLMs possess foundational cross‑modal reasoning ability but lack precise causal understanding of the nonlinear relationships between variables in thermal comfort. TCEval complements traditional benchmarks, shifting AI evaluation focus from abstract task proficiency to embodied, context‑aware perception and decision‑making, offering valuable insights for advancing AI in human‑centric applications like smart buildings.

PaperID: 6618, https://arxiv.org/pdf/2512.23167.pdf

Authors: Yifan Zhang, Giridhar Ganapavarapu, Srideepika Jayaraman, Bhavna Agrawal, Dhaval Patel, Achille Fokoue

Title: SPIRAL: Symbolic LLM Planning via Grounded and Reflective Search

Abstract:
Large Language Models (LLMs) often falter at complex planning tasks that require exploration and self‑correction, as their linear reasoning process struggles to recover from early mistakes. While search algorithms like Monte Carlo Tree Search (MCTS) can explore alternatives, they are often ineffective when guided by sparse rewards and fail to leverage the rich semantic capabilities of LLMs. We introduce SPIRAL (Symbolic LLM Planning via Grounded and Reflective Search), a novel framework that embeds a cognitive architecture of three specialized LLM agents into an MCTS loop. SPIRAL's key contribution is its integrated planning pipeline where a Planner proposes creative next steps, a Simulator grounds the search by predicting realistic outcomes, and a Critic provides dense reward signals through reflection. This synergy transforms MCTS from a brute‑force search into a guided, self‑correcting reasoning process. On the DailyLifeAPIs and HuggingFace datasets, SPIRAL consistently outperforms the default Chain‑of‑Thought planning method and other state‑of‑the‑art agents. More importantly, it substantially surpasses other state‑of‑the‑art agents; for example, SPIRAL achieves 83.6% overall accuracy on DailyLifeAPIs, an improvement of over 16 percentage points against the next‑best search framework, while also demonstrating superior token efficiency. Our work demonstrates that structuring LLM reasoning as a guided, reflective, and grounded search process yields more robust and efficient autonomous planners. The source code, full appendices, and all experimental data are available for reproducibility at the official project repository.

PaperID: 6619, https://arxiv.org/pdf/2512.23158.pdf

Authors: Kooktae Lee, Julian Martinez

Title: Breaking Symmetry-Induced Degeneracy in Multi-Agent Ergodic Coverage via Stochastic Spectral Control

Abstract:
Multi‑agent ergodic coverage via Spectral Multiscale Coverage (SMC) provides a principled framework for driving a team of agents so that their collective time‑averaged trajectories match a prescribed spatial distribution. While classical SMC has demonstrated empirical success, it can suffer from gradient cancellation, particularly when agents are initialized near symmetry points of the target distribution, leading to undesirable behaviors such as stalling or motion constrained along symmetry axes. In this work, we rigorously characterize the initial conditions and symmetry‑induced invariant manifolds that give rise to such directional degeneracy in first‑order agent dynamics. To address this, we introduce a stochastic perturbation combined with a contraction term and prove that the resulting dynamics ensure almost‑sure escape from zero‑gradient manifolds while maintaining mean‑square boundedness of agent trajectories. Simulations on symmetric multi‑modal reference distributions demonstrate that the proposed stochastic SMC effectively mitigates transient stalling and axis‑constrained motion, while ensuring that all agent trajectories remain bounded within the domain.

PaperID: 6620, https://arxiv.org/pdf/2512.23153.pdf

Authors: Seiko Piotr Yamaguchi, Kentaro Uno, Yasumaru Fujii, Masazumi Imai, Kazuki Takada, Taku Okawara, Kazuya Yoshida

Title: Towards the Automation in the Space Station: Feasibility Study and Ground Tests of a Multi-Limbed Intra-Vehicular Robot

Abstract:
This paper presents a feasibility study, including simulations and prototype tests, on the autonomous operation of a multi‑limbed intra‑vehicular robot (mobile manipulator), shortly MLIVR, designed to assist astronauts with logistical tasks on the International Space Station (ISS). Astronauts spend significant time on tasks such as preparation, close‑out, and the collection and transportation of goods, reducing the time available for critical mission activities. Our study explores the potential for a mobile manipulator to support these operations, emphasizing the need for autonomous functionality to minimize crew and ground operator effort while enabling real‑time task execution. We focused on the robot's transportation capabilities, simulating its motion planning in 3D space. The actual motion execution was tested with a prototype on a 2D table to mimic a microgravity environment. The results demonstrate the feasibility of performing these tasks with minimal human intervention, offering a promising solution to enhance operational efficiency on the ISS.

PaperID: 6621, https://arxiv.org/pdf/2512.23132.pdf

Authors: Armstrong Foundjem, Lionel Nganyewou Tidjon, Leuson Da Silva, Foutse Khomh

Title: Multi-Agent Framework for Threat Mitigation and Resilience in AI-Based Systems

Abstract:
Machine learning (ML) underpins foundation models in finance, healthcare, and critical infrastructure, making them targets for data poisoning, model extraction, prompt injection, automated jailbreaking, and preference‑guided black‑box attacks that exploit model comparisons. Larger models can be more vulnerable to introspection‑driven jailbreaks and cross‑modal manipulation. Traditional cybersecurity lacks ML‑specific threat modeling for foundation, multimodal, and RAG systems. Objective: Characterize ML security risks by identifying dominant TTPs, vulnerabilities, and targeted lifecycle stages. Methods: We extract 93 threats from MITRE ATLAS (26), AI Incident Database (12), and literature (55), and analyze 854 GitHub/Python repositories. A multi‑agent RAG system (ChatGPT‑4o, temp 0.4) mines 300+ articles to build an ontology‑driven threat graph linking TTPs, vulnerabilities, and stages. Results: We identify unreported threats including commercial LLM API model stealing, parameter memorization leakage, and preference‑guided text‑only jailbreaks. Dominant TTPs include MASTERKEY‑style jailbreaking, federated poisoning, diffusion backdoors, and preference optimization leakage, mainly impacting pre‑training and inference. Graph analysis reveals dense vulnerability clusters in libraries with poor patch propagation. Conclusion: Adaptive, ML‑specific security frameworks, combining dependency hygiene, threat intelligence, and monitoring, are essential to mitigate supply‑chain and inference risks across the ML lifecycle.

PaperID: 6622, https://arxiv.org/pdf/2512.23049.pdf

Authors: TJ Bai, Jason Eisner

Title: Accelerating Language Model Workflows with Prompt Choreography

Abstract:
Large language models are increasingly deployed in multi‑agent workflows. We introduce Prompt Choreography, a framework that efficiently executes LLM workflows by maintaining a dynamic, global KV cache. Each LLM call can attend to an arbitrary, reordered subset of previously encoded messages. Parallel calls are supported. Though caching messages' encodings sometimes gives different results from re‑encoding them in a new context, we show in diverse settings that fine‑tuning the LLM to work with the cache can help it mimic the original results. Prompt Choreography significantly reduces per‑message latency (2.0‑‑6.2× faster time‑to‑first‑token) and achieves substantial end‑to‑end speedups (>2.2×) in some workflows dominated by redundant computation.

PaperID: 6623, https://arxiv.org/pdf/2512.22941.pdf

Authors: Tianyi Hu, Zhiqiang Pu, Yuan Wang, Tenghai Qiu, Min Chen, Xin Yu

Title: Heterogeneity in Multi-Agent Reinforcement Learning

Abstract:
Heterogeneity is a fundamental property in multi‑agent reinforcement learning (MARL), which is closely related not only to the functional differences of agents, but also to policy diversity and environmental interactions. However, the MARL field currently lacks a rigorous definition and deeper understanding of heterogeneity. This paper systematically discusses heterogeneity in MARL from the perspectives of definition, quantification, and utilization. First, based on an agent‑level modeling of MARL, we categorize heterogeneity into five types and provide mathematical definitions. Second, we define the concept of heterogeneity distance and propose a practical quantification method. Third, we design a heterogeneity‑based multi‑agent dynamic parameter sharing algorithm as an example of the application of our methodology. Case studies demonstrate that our method can effectively identify and quantify various types of agent heterogeneity. Experimental results show that the proposed algorithm, compared to other parameter sharing baselines, has better interpretability and stronger adaptability. The proposed methodology will help the MARL community gain a more comprehensive and profound understanding of heterogeneity, and further promote the development of practical algorithms.

PaperID: 6624, https://arxiv.org/pdf/2512.22922.pdf

Authors: Anton A. Stoorvogel, Ali Saberi, Zhenwei Liu, Tayaba Yeasmin

Title: Weak state synchronization of homogeneous multi-agent systems with adaptive protocols

Abstract:
In this paper, we study scale‑free weak synchronization for multi‑agent systems (MAS). In other words, we design a protocol for the agents without using any knowledge about the network. We do not even require knowledge about the connectivity of the network. Each protocol contains an adaptive parameter to tune the protocol automatically to the demands of the network.

PaperID: 6625, https://arxiv.org/pdf/2512.22895.pdf

Authors: Xiaotian Ren, Nuerxiati Abudurexiti, Zhengyong Jiang, Angelos Stefanidis, Hongbin Liu, Jionglong Su

Title: SAMP-HDRL: Segmented Allocation with Momentum-Adjusted Utility for Multi-agent Portfolio Management via Hierarchical Deep Reinforcement Learning

Abstract:
Portfolio optimization in non‑stationary markets is challenging due to regime shifts, dynamic correlations, and the limited interpretability of deep reinforcement learning (DRL) policies. We propose a Segmented Allocation with Momentum‑Adjusted Utility for Multi‑agent Portfolio Management via Hierarchical Deep Reinforcement Learning (SAMP‑HDRL). The framework first applies dynamic asset grouping to partition the market into high‑quality and ordinary subsets. An upper‑level agent extracts global market signals, while lower‑level agents perform intra‑group allocation under mask constraints. A utility‑based capital allocation mechanism integrates risky and risk‑free assets, ensuring coherent coordination between global and local decisions. backtests across three market regimes (2019‑‑2021) demonstrate that SAMP‑HDRL consistently outperforms nine traditional baselines and nine DRL benchmarks under volatile and oscillating conditions. Compared with the strongest baseline, our method achieves at least 5% higher Return, 5% higher Sharpe ratio, 5% higher Sortino ratio, and 2% higher Omega ratio, with substantially larger gains observed in turbulent markets. Ablation studies confirm that upper‑‑lower coordination, dynamic clustering, and capital allocation are indispensable to robustness. SHAP‑based interpretability further reveals a complementary ``diversified + concentrated'' mechanism across agents, providing transparent insights into decision‑making. Overall, SAMP‑HDRL embeds structural market constraints directly into the DRL pipeline, offering improved adaptability, robustness, and interpretability in complex financial environments.

PaperID: 6626, https://arxiv.org/pdf/2512.22876.pdf

Authors: Maksim Kryzhanovskiy, Svetlana Glazyrina, Roman Ischenko, Konstantin Vorontsov

Title: Reinforcement Networks: novel framework for collaborative Multi-Agent Reinforcement Learning tasks

Abstract:
Modern AI systems often comprise multiple learnable components that can be naturally organized as graphs. A central challenge is the end‑to‑end training of such systems without restrictive architectural or training assumptions. Such tasks fit the theory and approaches of the collaborative Multi‑Agent Reinforcement Learning (MARL) field. We introduce Reinforcement Networks, a general framework for MARL that organizes agents as vertices in a directed acyclic graph (DAG). This structure extends hierarchical RL to arbitrary DAGs, enabling flexible credit assignment and scalable coordination while avoiding strict topologies, fully centralized training, and other limitations of current approaches. We formalize training and inference methods for the Reinforcement Networks framework and connect it to the LevelEnv concept to support reproducible construction, training, and evaluation. We demonstrate the effectiveness of our approach on several collaborative MARL setups by developing several Reinforcement Networks models that achieve improved performance over standard MARL baselines. Beyond empirical gains, Reinforcement Networks unify hierarchical, modular, and graph‑structured views of MARL, opening a principled path toward designing and training complex multi‑agent systems. We conclude with theoretical and practical directions ‑ richer graph morphologies, compositional curricula, and graph‑aware exploration. That positions Reinforcement Networks as a foundation for a new line of research in scalable, structured MARL.

PaperID: 6627, https://arxiv.org/pdf/2512.22832.pdf

Authors: Cuiling Wu, Yaozhong Gan, Junliang Xing, Ying Fu

Title: MARPO: A Reflective Policy Optimization for Multi Agent Reinforcement Learning

Abstract:
We propose Multi Agent Reflective Policy Optimization (MARPO) to alleviate the issue of sample inefficiency in multi agent reinforcement learning. MARPO consists of two key components: a reflection mechanism that leverages subsequent trajectories to enhance sample efficiency, and an asymmetric clipping mechanism that is derived from the KL divergence and dynamically adjusts the clipping range to improve training stability. We evaluate MARPO in classic multi agent environments, where it consistently outperforms other methods.

PaperID: 6628, https://arxiv.org/pdf/2512.22826.pdf

Authors: Aryan Tyagi, Soumyajyoti Biswas, Anirban Chakraborti

Title: Active-Absorbing Phase Transitions in the Parallel Minority Game

Abstract:
The Parallel Minority Game (PMG) is a synchronous adaptive multi‑agent model that exhibits active‑absorbing transitions characteristic of non‑equilibrium statistical systems. We perform a comprehensive numerical study of the PMG under two families of microscopic decision rules: (i) agents update their choices based on instantaneous population in their alternative choices, and (ii) threshold‑based activation that activates agents movement only after overcrowding density crossing a threshold. We measure time‑dependent and steady state limits of activity A(t), overcrowding fraction F(t) as functions of the control parameter g=N/D, where N is the number of agents and D is the total number of sites. Instantaneous rules display mean‑field directed‑percolation (MF‑DP) scaling with β\approx1.00, δ\approx0.5, and ν_\parallel\approx2.0. Threshold rules, however, produce a distinct non‑mean‑field universality class with β\approx0.75 and a systematic failure of dynamical scaling. We show that thresholding acts as a relevant perturbation to the critical behavior of the model. The results highlight how minimal cognitive features at the agent level fundamentally alter large‑scale critical behavior in socio‑economic and active systems.

PaperID: 6629, https://arxiv.org/pdf/2512.22721.pdf

Authors: Junaid Farooq, Quanyan Zhu

Title: Cyber Resilience in Next-Generation Networks: Threat Landscape, Theoretical Foundations, and Design Paradigms

Abstract:
The evolution of networked systems, driven by innovations in software‑defined networking (SDN), network function virtualization (NFV), open radio access networks (O‑RAN), and cloud‑native architectures, is redefining both the operational landscape and the threat surface of critical infrastructures. This book offers an in‑depth, interdisciplinary examination of how resilience must be re‑conceptualized and re‑engineered to address the multifaceted challenges posed by these transformations. Structured across six chapters, this book begins by surveying the contemporary risk landscape, identifying emerging cyber, physical, and AI‑driven threats, and analyzing their implications for scalable, heterogeneous network environments. It then establishes rigorous definitions and evaluation frameworks for resilience, going beyond robustness and fault‑tolerance to address adaptive, anticipatory, and retrospective mechanisms across diverse application domains. The core of the book delves into advanced paradigms and practical strategies for resilience, including zero trust architectures, game‑theoretic threat modeling, and self‑healing design principles. A significant portion is devoted to the role of artificial intelligence, especially reinforcement learning and large language models (LLMs), in enabling dynamic threat response, autonomous network control, and multi‑agent coordination under uncertainty.

PaperID: 6630, https://arxiv.org/pdf/2512.22673.pdf

Authors: Xiang Cheng, Yulan Hu, Xiangwen Zhang, Lu Xu, Lide Tan, Zheng Pan, Xin Li, Yong Liu

Title: Beyond Itinerary Planning-A Real-World Benchmark for Multi-Turn and Tool-Using Travel Tasks

Abstract:
Travel planning is a natural real‑world task to test large language models' (LLMs) planning and tool‑use abilities. Although prior work has studied LLM performance on travel planning, existing settings still differ from real‑world needs, mainly due to limited domain coverage, insufficient modeling of users' implicit preferences in multi‑turn conversations, and a lack of evaluation of agents' capability boundaries. To mitigate these gaps, we propose TravelBench, a benchmark for truly real‑world travel planning. We collect user queries, user preferences, and tools from real scenarios, and construct three subtasks ‑‑ Single‑Turn, Multi‑Turn, and Unsolvable ‑‑ to evaluate agents' three core capabilities in real settings: (1) solving problems independently, (2) interacting with users to elicit implicit preferences, and (3) recognizing the capability boundaries. To enable stable tool invocation and reproducible evaluation, we cache real tool‑call results and build a sandbox environment which integrates ten travel‑related tools, enabling agents to combine these tools to solve most practical travel planning problems. We evaluate multiple LLMs on TravelBench and find that even advanced models exhibit imbalanced performance across different capabilities. Our further systematic verification demonstrates the stability of the proposed benchmark. TravelBench provides a practical and reproducible benchmark to advance research on LLM agents for real‑world travel planning.

PaperID: 6631, https://arxiv.org/pdf/2512.22650.pdf

Authors: Shuyu Gan, James Mooney, Pan Hao, Renxiang Wang, Mingyi Hong, Qianwen Wang, Dongyeop Kang

Title: Scaling Unverifiable Rewards: A Case Study on Visual Insights

Abstract:
Large Language Model (LLM) agents can increasingly automate complex reasoning through Test‑Time Scaling (TTS), iterative refinement guided by reward signals. However, many real‑world tasks involve multi‑stage pipeline whose final outcomes lack verifiable rewards or sufficient data to train robust reward models, making judge‑based refinement prone to accumulate error over stages. We propose Selective TTS, a process‑based refinement framework that scales inference across different stages in multi‑agent pipeline, instead of repeated refinement over time by prior work. By distributing compute across stages and pruning low‑quality branches early using process‑specific judges, Selective TTS mitigates the judge drift and stabilizes refinement. Grounded in the data science pipeline, we build an end‑to‑end multi‑agent pipeline for generating visually insightful charts and report of given dataset, and design a reliable LLM‑based judge model, aligned with human experts (Kendall's τ=0.55). Our proposed selective TTS then improves insight quality under a fixed compute budget, increasing mean scores from 61.64 to 65.86 while reducing variance. We hope our findings serve as the first step toward to scaling complex, open‑ended tasks with unverifiable rewards, such as scientific discovery and story generation.

PaperID: 6632, https://arxiv.org/pdf/2512.22631.pdf

Authors: Hadi Mohammadi, Tamas Kozak, Anastasia Giachanou

Title: Evaluating GRPO and DPO for Faithful Chain-of-Thought Reasoning in LLMs

Abstract:
Chain‑of‑thought (CoT) reasoning has emerged as a powerful technique for improving the problem‑solving capabilities of large language models (LLMs), particularly for tasks requiring multi‑step reasoning. However, recent studies show that CoT explanations often fail to reflect the model's actual reasoning process, as models may produce coherent yet misleading justifications or modify answers without acknowledging external cues. Such discrepancies undermine the reliability of CoT‑based methods for safety supervision and alignment monitoring, as models can generate plausible but deceptive rationales for incorrect answers. To better understand this limitation, we evaluate two optimization methods, Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO), in their ability to improve CoT faithfulness. Our experiments show that GRPO achieves higher performance than DPO in larger models, with the Qwen2.5‑14B‑Instruct model attaining the best results across all evaluation metrics. Both approaches exhibit positive correlations between model size and performance, but GRPO shows greater potential for improving faithfulness metrics, albeit with less stable behavior at smaller scales. These results suggest that GRPO offers a promising direction for developing more transparent and trustworthy reasoning in LLMs.

PaperID: 6633, https://arxiv.org/pdf/2512.22627.pdf

Authors: Chen Su, Yuanhe Tian, Yan Song

Title: Chain-of-thought Reviewing and Correction for Time Series Question Answering

Abstract:
With the advancement of large language models (LLMs), diverse time series analysis tasks are reformulated as time series question answering (TSQA) through a unified natural language interface. However, existing LLM‑based approaches largely adopt general natural language processing techniques and are prone to reasoning errors when handling complex numerical sequences. Different from purely textual tasks, time series data are inherently verifiable, enabling consistency checking between reasoning steps and the original input. Motivated by this property, we propose T3LLM, which performs multi‑step reasoning with an explicit correction mechanism for time series question answering. The T3LLM framework consists of three LLMs, namely, a worker, a reviewer, and a student, that are responsible for generation, review, and reasoning learning, respectively. Within this framework, the worker generates step‑wise chains of thought (CoT) under structured prompts, while the reviewer inspects the reasoning, identifies erroneous steps, and provides corrective comments. The collaboratively generated corrected CoT are used to fine‑tune the student model, internalizing multi‑step reasoning and self‑correction into its parameters. Experiments on multiple real‑world TSQA benchmarks demonstrate that T3LLM achieves state‑of‑the‑art performance over strong LLM‑based baselines.

PaperID: 6634, https://arxiv.org/pdf/2512.22608.pdf

Authors: Zhongyang Liu, Haoyu Pei, Xiangyi Xiao, Xiaocong Du, Yihui Li, Suting Hong, Kunpeng Zhang, Haipeng Zhang

Title: Beyond Isolated Investor: Predicting Startup Success via Roleplay-Based Collective Agents

Abstract:
Due to the high value and high failure rates of startups, predicting their success is a critical challenge. Existing approaches typically model startup success from a single decision‑maker's perspective, overlooking the collective dynamics that dominate real‑world venture capital (VC) decision‑making. We propose SimVC‑CAS, a collective agent system that simulates VC decisions as a multi‑agent interaction process. By designing role‑playing agents and a GNN‑based supervised interaction module, we reformulate startup financing prediction as a group decision‑making task, capturing both enterprise fundamentals and investor network dynamics. Each agent represents an investor with distinct traits and preferences, enabling heterogeneous evaluations and realistic information exchange over a graph‑structured co‑investment network. Using both proprietary and public VC data with strict anti‑leakage controls, we show that SimVC‑CAS significantly improves predictive performance, achieving approximately 25% relative improvement in average precision@10, while exhibiting consistency with real investor decisions. The interaction mechanism is particularly effective for network‑central startups, confirming the importance of network in VC decision‑making. Analysis of agents' reasoning for decision changes further reveals how network environment influence decision quality, demonstrating the system's interpretability. Our approach may generalize to broader group decision‑making scenarios.

PaperID: 6635, https://arxiv.org/pdf/2512.22579.pdf

Authors: Yong Xiao, Xubo Li, Haoran Zhou, Yingyu Li, Yayu Gao, Guangming Shi, Ping Zhang, Marwan Krunz

Title: SANet: A Semantic-aware Agentic AI Networking Framework for Cross-layer Optimization in 6G

Abstract:
Agentic AI networking (AgentNet) is a novel AI‑native networking paradigm in which a large number of specialized AI agents collaborate to perform autonomous decision‑making, dynamic environmental adaptation, and complex missions. It has the potential to facilitate real‑time network management and optimization functions, including self‑configuration, self‑optimization, and self‑adaptation across diverse and complex environments. This paper proposes SANet, a novel semantic‑aware AgentNet architecture for wireless networks that can infer the semantic goal of the user and automatically assign agents associated with different layers of the network to fulfill the inferred goal. Motivated by the fact that AgentNet is a decentralized framework in which collaborating agents may generally have different and even conflicting objectives, we formulate the decentralized optimization of SANet as a multi‑agent multi‑objective problem, and focus on finding the Pareto‑optimal solution for agents with distinct and potentially conflicting objectives. We propose three novel metrics for evaluating SANet. Furthermore, we develop a model partition and sharing (MoPS) framework in which large models, e.g., deep learning models, of different agents can be partitioned into shared and agent‑specific parts that are jointly constructed and deployed according to agents' local computational resources. Two decentralized optimization algorithms are proposed. We derive theoretical bounds and prove that there exists a three‑way tradeoff among optimization, generalization, and conflicting errors. We develop an open‑source RAN and core network‑based hardware prototype that implements agents to interact with three different layers of the network. Experimental results show that the proposed framework achieved performance gains of up to 14.61% while requiring only 44.37% of FLOPs required by state‑of‑the‑art algorithms.

PaperID: 6636, https://arxiv.org/pdf/2512.22554.pdf

Authors: Fatihcan M. Atay

Title: On the duality between consensus problems and Markov processes, with application to delay systems

Abstract:
We consider consensus of multi‑agent systems as a dual problem to Markov processes. Based on an exchange of relevant notions and results between the two fields, we present a uniform framework which admits the introduction and treatment of time delays in a common setting. We study both information propagation and information processing delays, and for each case derive conditions for reaching consensus and calculate the consensus value.

PaperID: 6637, https://arxiv.org/pdf/2512.22529.pdf

Authors: Yiming Lu, Tingyu Lu, Di Zhang, Lili Ye, Hao Li

Title: Multi-AI Agent Framework Reveals the "Oxide Gatekeeper" in Aluminum Nanoparticle Oxidation

Abstract:
Aluminum nanoparticles (ANPs) are among the most energy‑dense solid fuels, yet the atomic mechanisms governing their transition from passivated particles to explosive reactants remain elusive. This stems from a fundamental computational bottleneck: ab initio methods offer quantum accuracy but are restricted to small spatiotemporal scales (< 500 atoms, picoseconds), while empirical force fields lack the reactive fidelity required for complex combustion environments. Herein, we bridge this gap by employing a "human‑in‑the‑loop" closed‑loop framework where self‑auditing AI Agents validate the evolution of a machine learning potential (MLP). By acting as scientific sentinels that visualize hidden model artifacts for human decision‑making, this collaborative cycle ensures quantum mechanical accuracy while exhibiting near‑linear scalability to million‑atom systems and accessing nanosecond timescales (energy RMSE: 1.2 meV/atom, force RMSE: 0.126 eV/Angstrom). Strikingly, our simulations reveal a temperature‑regulated dual‑mode oxidation mechanism: at moderate temperatures, the oxide shell acts as a dynamic "gatekeeper," regulating oxidation through a "breathing mode" of transient nanochannels; above a critical threshold, a "rupture mode" unleashes catastrophic shell failure and explosive combustion. Importantly, we resolve a decades‑old controversy by demonstrating that aluminum cation outward diffusion, rather than oxygen transport, dominates mass transfer across all temperature regimes, with diffusion coefficients consistently exceeding those of oxygen by 2‑3 orders of magnitude. These discoveries establish a unified atomic‑scale framework for energetic nanomaterial design, enabling the precision engineering of ignition sensitivity and energy release rates through intelligent computational design.

PaperID: 6638, https://arxiv.org/pdf/2512.22496.pdf

Authors: Saisab Sadhu, Ashim Dhor

Title: Hierarchical Pedagogical Oversight: A Multi-Agent Adversarial Framework for Reliable AI Tutoring

Abstract:
Large Language Models (LLMs) are increasingly deployed as automated tutors to address educator shortages; however, they often fail at pedagogical reasoning, frequently validating incorrect student solutions (sycophancy) or providing overly direct answers that hinder learning. We introduce Hierarchical Pedagogical Oversight (HPO), a framework that adapts structured adversarial synthesis to educational assessment. Unlike cooperative multi‑agent systems that often drift toward superficial consensus, HPO enforces a dialectical separation of concerns: specialist agents first distill dialogue context, which then grounds a moderated, five‑act debate between opposing pedagogical critics. We evaluate this framework on the MRBench dataset of 1,214 middle‑school mathematics dialogues. Our 8B‑parameter model achieves a Macro F1 of 0.845, outperforming GPT‑4o (0.812) by 3.3% while using 20 times fewer parameters. These results establish adversarial reasoning as a critical mechanism for deploying reliable, low‑compute pedagogical oversight in resource‑constrained environments.

PaperID: 6639, https://arxiv.org/pdf/2512.22435.pdf

Authors: Zining Wang, Jian Gao, Weimin Fu, Xiaolong Guo, Xuan Zhang

Title: AnalogSAGE: Self-evolving Analog Design Multi-Agents with Stratified Memory and Grounded Experience

Abstract:
Analog circuit design remains a knowledge‑ and experience‑intensive process that relies heavily on human intuition for topology generation and device parameter tuning. Existing LLM‑based approaches typically depend on prompt‑driven netlist generation or predefined topology templates, limiting their ability to satisfy complex specification requirements. We propose AnalogSAGE, an open‑source self‑evolving multi‑agent framework that coordinates three‑stage agent explorations through four stratified memory layers, enabling iterative refinement with simulation‑grounded feedback. To support reproducibility and generality, we release the source code. Our benchmark spans ten specification‑driven operational amplifier design problems of varying difficulty, enabling quantitative and cross‑task comparison under identical conditions. Evaluated under the open‑source SKY130 PDK with ngspice, AnalogSAGE achieves a 10× overall pass rate, a 48× Pass@1, and a 4× reduction in parameter search space compared with existing frameworks, demonstrating that stratified memory and grounded reasoning substantially enhance the reliability and autonomy of analog design automation in practice.

PaperID: 6640, https://arxiv.org/pdf/2512.22404.pdf

Authors: Quanzhi Fu, Qiyu Wu, Dan Williams

Title: Mining the Gold: Student-AI Chat Logs as Rich Sources for Automated Knowledge Gap Detection

Abstract:
With the significant increase in enrollment in computing‑related programs over the past 20 years, lecture sizes have grown correspondingly. In large lectures, instructors face challenges on identifying students' knowledge gaps timely, which is critical for effective teaching. Existing classroom response systems rely on instructor‑initiated interactions, which limits their ability to capture the spontaneous knowledge gaps that naturally emerge during lectures. With the widespread adoption of LLMs among students, we recognize these student‑AI dialogues as a valuable, student‑centered data source for identifying knowledge gaps. In this idea paper, we propose QueryQuilt, a multi‑agent LLM framework that automatically detects common knowledge gaps in large‑scale lectures by analyzing students' chat logs with AI assistants. QueryQuilt consists of two key components: (1) a Dialogue Agent that responds to student questions while employing probing questions to reveal underlying knowledge gaps, and (2) a Knowledge Gap Identification Agent that systematically analyzes these dialogues to identify knowledge gaps across the student population. By generating frequency distributions of identified gaps, instructors can gain comprehensive insights into class‑wide understanding. Our evaluation demonstrates promising results, with QueryQuilt achieving 100% accuracy in identifying knowledge gaps among simulated students and 95% completeness when tested on real student‑AI dialogue data. These initial findings indicate the system's potential for facilitate teaching in authentic learning environments. We plan to deploy QueryQuilt in actual classroom settings for comprehensive evaluation, measuring its detection accuracy and impact on instruction.

PaperID: 6641, https://arxiv.org/pdf/2512.22381.pdf

Authors: Mohammad Zakaria Haider, Amit Kumar Podder, Prabin Mali, Aranya Chakrabortty, Sumit Paudyal, Mohammad Ashiqur Rahman

Title: PHANTOM: Physics-Aware Adversarial Attacks against Federated Learning-Coordinated EV Charging Management System

Abstract:
The rapid deployment of electric vehicle charging stations (EVCS) within distribution networks necessitates intelligent and adaptive control to maintain the grid's resilience and reliability. In this work, we propose PHANTOM, a physics‑aware adversarial network that is trained and optimized through a multi‑agent reinforcement learning model. PHANTOM integrates a physics‑informed neural network (PINN) enabled by federated learning (FL) that functions as a digital twin of EVCS‑integrated systems, ensuring physically consistent modeling of operational dynamics and constraints. Building on this digital twin, we construct a multi‑agent RL environment that utilizes deep Q‑networks (DQN) and soft actor‑critic (SAC) methods to derive adversarial false data injection (FDI) strategies capable of bypassing conventional detection mechanisms. To examine the broader grid‑level consequences, a transmission and distribution (T and D) dual simulation platform is developed, allowing us to capture cascading interactions between EVCS disturbances at the distribution level and the operations of the bulk transmission system. Results demonstrate how learned attack policies disrupt load balancing and induce voltage instabilities that propagate across T and D boundaries. These findings highlight the critical need for physics‑aware cybersecurity to ensure the resilience of large‑scale vehicle‑grid integration.

PaperID: 6642, https://arxiv.org/pdf/2512.22351.pdf

Authors: Zhengfei Kuang, Rui Lin, Long Zhao, Gordon Wetzstein, Saining Xie, Sanghyun Woo

Title: VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement

Abstract:
Despite the remarkable progress of Multimodal Large Language Models (MLLMs) in 2D vision‑language tasks, their application to complex 3D scene manipulation remains underexplored. In this paper, we bridge this critical gap by tackling three key challenges in 3D object arrangement task using MLLMs. First, to address the weak visual grounding of MLLMs, which struggle to link programmatic edits with precise 3D outcomes, we introduce an MCP‑based API. This shifts the interaction from brittle raw code manipulation to more robust, function‑level updates. Second, we augment the MLLM's 3D scene understanding with a suite of specialized visual tools to analyze scene state, gather spatial information, and validate action outcomes. This perceptual feedback loop is critical for closing the gap between language‑based updates and precise 3D‑aware manipulation. Third, to manage the iterative, error‑prone updates, we propose a collaborative multi‑agent framework with designated roles for planning, execution, and verification. This decomposition allows the system to robustly handle multi‑step instructions and recover from intermediate errors. We demonstrate the effectiveness of our approach on a diverse set of 25 complex object arrangement tasks, where it significantly outperforms existing baselines. Website: vulcan‑3d.github.io

PaperID: 6643, https://arxiv.org/pdf/2512.22336.pdf

Authors: Mengkang Hu, Bowei Xia, Yuran Wu, Ailing Yu, Yude Zou, Qiguang Chen, Shijian Wang, Jiarui Jin, Kexin Li, Wenxiang Jiao, Yuan Lu, Ping Luo

Title: Agent2World: Learning to Generate Symbolic World Models via Adaptive Multi-Agent Feedback

Abstract:
Symbolic world models (e.g., PDDL domains or executable simulators) are central to model‑based planning, but training LLMs to generate such world models is limited by the lack of large‑scale verifiable supervision. Current approaches rely primarily on static validation methods that fail to catch behavior‑level errors arising from interactive execution. In this paper, we propose Agent2World, a tool‑augmented multi‑agent framework that achieves strong inference‑time world‑model generation and also serves as a data engine for supervised fine‑tuning, by grounding generation in multi‑agent feedback. Agent2World follows a three‑stage pipeline: (i) A Deep Researcher agent performs knowledge synthesis by web searching to address specification gaps; (ii) A Model Developer agent implements executable world models; And (iii) a specialized Testing Team conducts adaptive unit testing and simulation‑based validation. Agent2World demonstrates superior inference‑time performance across three benchmarks spanning both Planning Domain Definition Language (PDDL) and executable code representations, achieving consistent state‑of‑the‑art results. Beyond inference, Testing Team serves as an interactive environment for the Model Developer, providing behavior‑aware adaptive feedback that yields multi‑turn training trajectories. The model fine‑tuned on these trajectories substantially improves world‑model generation, yielding an average relative gain of 30.95% over the same model before training. Project page: https://agent2world.github.io.

PaperID: 6644, https://arxiv.org/pdf/2512.22266.pdf

Authors: Bing Hao, Minglai Shao, Zengyi Wo, Yunlong Chu, Yuhang Liu, Ruijie Wang

Title: LLMTM: Benchmarking and Optimizing LLMs for Temporal Motif Analysis in Dynamic Graphs

Abstract:
The widespread application of Large Language Models (LLMs) has motivated a growing interest in their capacity for processing dynamic graphs. Temporal motifs, as an elementary unit and important local property of dynamic graphs which can directly reflect anomalies and unique phenomena, are essential for understanding their evolutionary dynamics and structural features. However, leveraging LLMs for temporal motif analysis on dynamic graphs remains relatively unexplored. In this paper, we systematically study LLM performance on temporal motif‑related tasks. Specifically, we propose a comprehensive benchmark, LLMTM (Large Language Models in Temporal Motifs), which includes six tailored tasks across nine temporal motif types. We then conduct extensive experiments to analyze the impacts of different prompting techniques and LLMs (including nine models: openPangu‑7B, the DeepSeek‑R1‑Distill‑Qwen series, Qwen2.5‑32B‑Instruct, GPT‑4o‑mini, DeepSeek‑R1, and o3) on model performance. Informed by our benchmark findings, we develop a tool‑augmented LLM agent that leverages precisely engineered prompts to solve these tasks with high accuracy. Nevertheless, the high accuracy of the agent incurs a substantial cost. To address this trade‑off, we propose a simple yet effective structure‑aware dispatcher that considers both the dynamic graph's structural properties and the LLM's cognitive load to intelligently dispatch queries between the standard LLM prompting and the more powerful agent. Our experiments demonstrate that the structure‑aware dispatcher effectively maintains high accuracy while reducing cost.

PaperID: 6645, https://arxiv.org/pdf/2512.22218.pdf

Authors: Hieu Minh Nguyen, Tam Le-Thanh Dang, Kiet Van Nguyen

Title: Towards Signboard-Oriented Visual Question Answering: ViSignVQA Dataset, Method and Benchmark

Abstract:
Understanding signboard text in natural scenes is essential for real‑world applications of Visual Question Answering (VQA), yet remains underexplored, particularly in low‑resource languages. We introduce ViSignVQA, the first large‑scale Vietnamese dataset designed for signboard‑oriented VQA, which comprises 10,762 images and 25,573 question‑answer pairs. The dataset captures the diverse linguistic, cultural, and visual characteristics of Vietnamese signboards, including bilingual text, informal phrasing, and visual elements such as color and layout. To benchmark this task, we adapted state‑of‑the‑art VQA models (e.g., BLIP‑2, LaTr, PreSTU, and SaL) by integrating a Vietnamese OCR model (SwinTextSpotter) and a Vietnamese pretrained language model (ViT5). The experimental results highlight the significant role of the OCR‑enhanced context, with F1‑score improvements of up to 209% when the OCR text is appended to questions. Additionally, we propose a multi‑agent VQA framework combining perception and reasoning agents with GPT‑4, achieving 75.98% accuracy via majority voting. Our study presents the first large‑scale multimodal dataset for Vietnamese signboard understanding. This underscores the importance of domain‑specific resources in enhancing text‑based VQA for low‑resource languages. ViSignVQA serves as a benchmark capturing real‑world scene text characteristics and supporting the development and evaluation of OCR‑integrated VQA models in Vietnamese.

PaperID: 6646, https://arxiv.org/pdf/2512.22171.pdf

Authors: Stefan Edelkamp

Title: Solving Multi-Agent Multi-Goal Path Finding Problems in Polynomial Time

Abstract:
In this paper, we plan missions for a fleet of agents in undirected graphs, such as grids, with multiple goals. In contrast to regular multi‑agent path‑finding, the solver finds and updates the assignment of goals to the agents on its own. In the continuous case for a point agent with motions in the Euclidean plane, the problem can be solved arbitrarily close to optimal. For discrete variants that incur node and edge conflicts, we show that it can be solved in polynomial time, which is unexpected, since traditional vehicle routing on general graphs is NP‑hard. We implement a corresponding planner that finds conflict‑free optimized routes for the agents. Global assignment strategies greatly reduce the number of conflicts, with the remaining ones resolved by elaborating on the concept of ants‑on‑the‑stick, by solving local assignment problems, by interleaving agent paths, and by kicking agents that have already arrived out of their destinations

PaperID: 6647, https://arxiv.org/pdf/2512.22149.pdf

Authors: Guilin Zhang, Wulan Guo, Ziqi Tan

Title: Adaptive GPU Resource Allocation for Multi-Agent Collaborative Reasoning in Serverless Environments

Abstract:
Multi‑agent systems powered by large language models have emerged as a promising paradigm for solving complex reasoning tasks through collaborative intelligence. However, efficiently deploying these systems on serverless GPU platforms presents significant resource allocation challenges due to heterogeneous agent workloads, varying computational demands, and the need for cost‑effective scaling. This paper presents an adaptive GPU resource allocation framework that achieves 85% latency reduction compared to round‑robin scheduling while maintaining comparable throughput to static allocation, using an O(N) complexity algorithm for real‑time adaptation. Our approach dynamically allocates GPU resources based on workload characteristics, agent priorities, and minimum resource requirements, enabling efficient utilization while maintaining quality of service. The framework addresses three key challenges: (1) heterogeneous computational demands across lightweight coordinators and heavyweight specialists, (2) dynamic workload fluctuations requiring millisecond‑scale reallocation, and (3) capacity constraints in serverless environments. Through comprehensive simulations modeling realistic multi‑agent workflows with four heterogeneous agents, we demonstrate that adaptive allocation outperforms static equal and round‑robin strategies across latency, cost, and GPU utilization metrics. The framework provides a practical solution for deploying cost‑efficient multi‑agent AI systems on serverless GPU infrastructure.

PaperID: 6648, https://arxiv.org/pdf/2512.22135.pdf

Authors: Zicai Cui, Zhouyuan Jian, Weiwen Liu, Weinan Zhang

Title: SoDA: An Efficient Interaction Paradigm for the Agentic Web

Abstract:
As the internet evolves from the mobile App‑dominated Attention Economy to the Intent‑Interconnection of the Agentic Web era, existing interaction modes fail to address the escalating challenges of data lock‑in and cognitive overload. Addressing this, we defines a future‑oriented user sovereignty interaction paradigm, aiming to realize a fundamental shift from killing time to saving time. Specifically, we argue that decoupling memory from application logic eliminates the structural basis of data lock‑in, while shifting from explicit manual instruction to implicit intent alignment resolves cognitive overload by offloading execution complexity. This paradigm is implemented via the Sovereign Digital Avatar (SoDA), which employs an orthogonal decoupling design of storage, computation, and interaction. This establishes the architectural principle of data as a persistent asset, model as a transient tool, fundamentally breaking the platform monopoly on user memory. To support the operation of this new paradigm in zero‑trust environments, we design an Intent‑Permission Handshake Mechanism based on A2A protocols, utilizing dual‑factor (Sensitivity Coefficient and Strictness Parameter) adaptive routing to achieve active risk governance. Empirical evaluation with a high‑fidelity simulation environment indicates that this paradigm reduces token consumption by approximately 27‑35% during cross‑platform service migration and complex task execution. Furthermore, in the orchestration of multi‑modal complex tasks, it reduces user cognitive load by 72% compared to standard Retrieval‑Augmented Generation (RAG) architectures, by 88% relative to manual workflows, while significantly boosting the Information Signal‑to‑Noise Ratio (SNR). These results demonstrate that the SoDA is the essential interaction infrastructure for building an efficient, low‑friction, and decentralized Agentic Web.

PaperID: 6649, https://arxiv.org/pdf/2512.21964.pdf

Authors: Dunyuan XU, Xikai Yang, Yaoqian Li, Juzheng Miao, Jinpeng Li, Pheng-Ann Heng

Title: Perceive and Calibrate: Analyzing and Enhancing Robustness of Medical Multi-Modal Large Language Models

Abstract:
Medical Multi‑modal Large Language Models (MLLMs) have shown promising clinical performance. However, their sensitivity to real‑world input perturbations, such as imaging artifacts and textual errors, critically undermines their clinical applicability. Systematic analysis of such noise impact on medical MLLMs remains largely unexplored. Furthermore, while several works have investigated the MLLMs' robustness in general domains, they primarily focus on text modality and rely on costly fine‑tuning. They are inadequate to address the complex noise patterns and fulfill the strict safety standards in medicine. To bridge this gap, this work systematically analyzes the impact of various perturbations on medical MLLMs across both visual and textual modalities. Building on our findings, we introduce a training‑free Inherent‑enhanced Multi‑modal Calibration (IMC) framework that leverages MLLMs' inherent denoising capabilities following the perceive‑and‑calibrate principle for cross‑modal robustness enhancement. For the visual modality, we propose a Perturbation‑aware Denoising Calibration (PDC) which leverages MLLMs' own vision encoder to identify noise patterns and perform prototype‑guided feature calibration. For text denoising, we design a Self‑instantiated Multi‑agent System (SMS) that exploits the MLLMs' self‑assessment capabilities to refine noisy text through a cooperative hierarchy of agents. We construct a benchmark containing 11 types of noise across both image and text modalities on 2 datasets. Experimental results demonstrate our method achieves the state‑of‑the‑art performance across multiple modalities, showing potential to enhance MLLMs' robustness in real clinical scenarios.

PaperID: 6650, https://arxiv.org/pdf/2512.21878.pdf

Authors: Marc S. Montalvo, Hamed Yaghoobian

Title: MASFIN: A Multi-Agent System for Decomposed Financial Reasoning and Forecasting

Abstract:
Recent advances in large language models (LLMs) are transforming data‑intensive domains, with finance representing a high‑stakes environment where transparent and reproducible analysis of heterogeneous signals is essential. Traditional quantitative methods remain vulnerable to survivorship bias, while many AI‑driven approaches struggle with signal integration, reproducibility, and computational efficiency. We introduce MASFIN, a modular multi‑agent framework that integrates LLMs with structured financial metrics and unstructured news, while embedding explicit bias‑mitigation protocols. The system leverages GPT‑4.1‑nano for reproducability and cost‑efficient inference and generates weekly portfolios of 15‑30 equities with allocation weights optimized for short‑term performance. In an eight‑week evaluation, MASFIN delivered a 7.33% cumulative return, outperforming the S&P 500, NASDAQ‑100, and Dow Jones benchmarks in six of eight weeks, albeit with higher volatility. These findings demonstrate the promise of bias‑aware, generative AI frameworks for financial forecasting and highlight opportunities for modular multi‑agent design to advance practical, transparent, and reproducible approaches in quantitative finance.

PaperID: 6651, https://arxiv.org/pdf/2512.21818.pdf

Authors: Brian Bowers, Smita Khapre, Jugal Kalita

Title: Analyzing Code Injection Attacks on LLM-based Multi-Agent Systems in Software Development

Abstract:
Agentic AI and Multi‑Agent Systems are poised to dominate industry and society imminently. Powered by goal‑driven autonomy, they represent a powerful form of generative AI, marking a transition from reactive content generation into proactive multitasking capabilities. As an exemplar, we propose an architecture of a multi‑agent system for the implementation phase of the software engineering process. We also present a comprehensive threat model for the proposed system. We demonstrate that while such systems can generate code quite accurately, they are vulnerable to attacks, including code injection. Due to their autonomous design and lack of humans in the loop, these systems cannot identify and respond to attacks by themselves. This paper analyzes the vulnerability of multi‑agent systems and concludes that the coder‑reviewer‑tester architecture is more resilient than both the coder and coder‑tester architectures, but is less efficient at writing code. We find that by adding a security analysis agent, we mitigate the loss in efficiency while achieving even better resiliency. We conclude by demonstrating that the security analysis agent is vulnerable to advanced code injection attacks, showing that embedding poisonous few‑shot examples in the injected code can increase the attack success rate from 0% to 71.95%.

PaperID: 6652, https://arxiv.org/pdf/2512.21817.pdf

Authors: Hong Su

Title: Method Decoration (DeMe): A Framework for LLM-Driven Adaptive Method Generation in Dynamic IoT Environments

Abstract:
Intelligent IoT systems increasingly rely on large language models (LLMs) to generate task‑execution methods for dynamic environments. However, existing approaches lack the ability to systematically produce new methods when facing previously unseen situations, and they often depend on fixed, device‑specific logic that cannot adapt to changing environmental conditions.In this paper, we propose Method Decoration (DeMe), a general framework that modifies the method‑generation path of an LLM using explicit decorations derived from hidden goals, accumulated learned methods, and environmental feedback. Unlike traditional rule augmentation, decorations in DeMe are not hardcoded; instead, they are extracted from universal behavioral principles, experience, and observed environmental differences. DeMe enables the agent to reshuffle the structure of its method path‑through pre‑decoration, post‑decoration, intermediate‑step modification, and step insertion‑thereby producing context‑aware, safety‑aligned, and environment‑adaptive methods. Experimental results show that method decoration allows IoT devices to derive ore appropriate methods when confronting unknown or faulty operating conditions.

PaperID: 6653, https://arxiv.org/pdf/2512.21794.pdf

Authors: Qiushi Han, David Simchi-Levi, Renfei Tan, Zishuo Zhao

Title: Multi-agent Adaptive Mechanism Design

Abstract:
We study a sequential mechanism design problem in which a principal seeks to elicit truthful reports from multiple rational agents while starting with no prior knowledge of agents' beliefs. We introduce Distributionally Robust Adaptive Mechanism (DRAM), a general framework combining insights from both mechanism design and online learning to jointly address truthfulness and cost‑optimality. Throughout the sequential game, the mechanism estimates agents' beliefs and iteratively updates a distributionally robust linear program with shrinking ambiguity sets to reduce payments while preserving truthfulness. Our mechanism guarantees truthful reporting with high probability while achieving \tildeO(\sqrtT) cumulative regret, and we establish a matching lower bound showing that no feasible adaptive mechanism can asymptotically do better. The framework generalizes to plug‑in estimators, supporting structured priors and delayed feedback. To our knowledge, this is the first adaptive mechanism under general settings that maintains truthfulness and achieves optimal regret when incentive constraints are unknown and must be learned.

PaperID: 6654, https://arxiv.org/pdf/2512.21782.pdf

Authors: Yuanqi Du, Botao Yu, Tianyu Liu, Tony Shen, Junwu Chen, Jan G. Rittig, Kunyang Sun, Yikun Zhang, Aarti Krishnan, Yu Zhang, Daniel Rosen, Rosali Pirone, Zhangde Song, Bo Zhou, Cassandra Masschelein, Yingze Wang, Haorui Wang, Haojun Jia, Chao Zhang, Hongyu Zhao, Martin Ester, Nir Hacohen, Teresa Head-Gordon, Carla P. Gomes, Huan Sun, Chenru Duan, Philippe Schwaller, Wengong Jin

Title: Accelerating Scientific Discovery with Autonomous Goal-evolving Agents

Abstract:
There has been unprecedented interest in developing agents that expand the boundary of scientific discovery, primarily by optimizing quantitative objective functions specified by scientists. However, for grand challenges in science, these objectives may only be imperfect proxies. We argue that automating objective function design is a central, yet unmet need for scientific discovery agents. In this work, we introduce the Scientific Autonomous Goal‑evolving Agent (SAGA) to address this challenge. SAGA employs a bi‑level architecture in which an outer loop of LLM agents analyzes optimization outcomes, proposes new objectives, and converts them into computable scoring functions, while an inner loop performs solution optimization under the current objectives. This bi‑level design enables systematic exploration of the space of objectives and their trade‑offs, rather than treating them as fixed inputs. We demonstrate the framework through a wide range of design applications, including antibiotics, nanobodies, functional DNA sequences, inorganic materials, and chemical processes. Notably, our experimental validation identifies a structurally novel hit with promising potency and safety profiles for E. coli in the antibiotic design task, and three de novo PD‑L1 binders in the nanobody design task. These results suggest that automating objective formulation can substantially improve the effectiveness of scientific discovery agents.

PaperID: 6655, https://arxiv.org/pdf/2512.21723.pdf

Authors: Alexandr V. Korchemnyi, Anatoly O. Onishchenko, Eva A. Bakaeva, Alexey K. Kovalev, Aleksandr I. Panov

Title: HELP: Hierarchical Embodied Language Planner for Household Tasks

Abstract:
Embodied agents tasked with complex scenarios, whether in real or simulated environments, rely heavily on robust planning capabilities. When instructions are formulated in natural language, large language models (LLMs) equipped with extensive linguistic knowledge can play this role. However, to effectively exploit the ability of such models to handle linguistic ambiguity, to retrieve information from the environment, and to be based on the available skills of an agent, an appropriate architecture must be designed. We propose a Hierarchical Embodied Language Planner, called HELP, consisting of a set of LLM‑based agents, each dedicated to solving a different subtask. We evaluate the proposed approach on a household task and perform real‑world experiments with an embodied agent. We also focus on the use of open source LLMs with a relatively small number of parameters, to enable autonomous deployment.

PaperID: 6656, https://arxiv.org/pdf/2512.21699.pdf

Authors: Eranga Bandara, Tharaka Hewa, Ross Gore, Sachin Shetty, Ravi Mukkamala, Peter Foytik, Abdul Rahman, Safdar H. Bouk, Xueping Liang, Amin Hass, Sachini Rajapakse, Ng Wee Keong, Kasun De Zoysa, Aruna Withanage, Nilaan Loganathan

Title: Towards Responsible and Explainable AI Agents with Consensus-Driven Reasoning

Abstract:
Agentic AI represents a major shift in how autonomous systems reason, plan, and execute multi‑step tasks through the coordination of Large Language Models (LLMs), Vision Language Models (VLMs), tools, and external services. While these systems enable powerful new capabilities, increasing autonomy introduces critical challenges related to explainability, accountability, robustness, and governance, especially when agent outputs influence downstream actions or decisions. Existing agentic AI implementations often emphasize functionality and scalability, yet provide limited mechanisms for understanding decision rationale or enforcing responsibility across agent interactions. This paper presents a Responsible(RAI) and Explainable(XAI) AI Agent Architecture for production‑grade agentic workflows based on multi‑model consensus and reasoning‑layer governance. In the proposed design, a consortium of heterogeneous LLM and VLM agents independently generates candidate outputs from a shared input context, explicitly exposing uncertainty, disagreement, and alternative interpretations. A dedicated reasoning agent then performs structured consolidation across these outputs, enforcing safety and policy constraints, mitigating hallucinations and bias, and producing auditable, evidence‑backed decisions. Explainability is achieved through explicit cross‑model comparison and preserved intermediate outputs, while responsibility is enforced through centralized reasoning‑layer control and agent‑level constraints. We evaluate the architecture across multiple real‑world agentic AI workflows, demonstrating that consensus‑driven reasoning improves robustness, transparency, and operational trust across diverse application domains. This work provides practical guidance for designing agentic AI systems that are autonomous and scalable, yet responsible and explainable by construction.

PaperID: 6657, https://arxiv.org/pdf/2512.21623.pdf

Authors: Takahide Suzuki, Kazuki Nakanishi, Takashi Fujiwara, Hideyuki Shimizu

Title: Democratizing Drug Discovery with an Orchestrated, Knowledge-Driven Multi-Agent Team for User-Guided Therapeutic Design

Abstract:
Therapeutic discovery remains a formidable challenge, impeded by the fragmentation of specialized domains and the execution gap between computational design and physiological validation. Although generative AI offers promise, current models often function as passive assistants rather than as autonomous executors. Here, we introduce OrchestRA, a human‑in‑the‑loop multi‑agent platform that unifies biology, chemistry, and pharmacology into an autonomous discovery engine. Unlike static code generators, our agents actively execute simulations and reason the results to drive iterative optimization. Governed by an Orchestrator, a Biologist Agent leverages deep reasoning over a massive knowledge graph (>10 million associations) to pinpoint high‑confidence targets; a Chemist Agent autonomously detects structural pockets for de novo design or drug repositioning; and a Pharmacologist Agent evaluates candidates via rigorous physiologically based pharmacokinetic (PBPK) simulations. This architecture establishes a dynamic feedback loop where pharmacokinetic and toxicity profiles directly trigger structural reoptimization. By seamlessly integrating autonomous execution with human guidance, OrchestRA democratizes therapeutic design, transforming drug discovery from a stochastic search to a programmable evidence‑based engineering discipline.

PaperID: 6658, https://arxiv.org/pdf/2512.21578.pdf

Authors: Sudhanshu Garg, Andrew Wang, Chaitanya Kulkarni, Ali Sahami, Farhad Farahani, Sean Yun-Shiuan Chuang, Jian Wan, Srinivasan Manoharan, Uma Kona, Nitin Sharma, Linsey Pang, Prakhar Mehrotra, Jessica Clark, Mark Moyou

Title: NEMO-4-PAYPAL: Leveraging NVIDIA's Nemo Framework for empowering PayPal's Commerce Agent

Abstract:
We present the development and optimization of PayPal's Commerce Agent, powered by NEMO‑4‑PAYPAL, a multi‑agent system designed to revolutionize agentic commerce on the PayPal platform. Through our strategic partnership with NVIDIA, we leveraged the NeMo Framework for LLM model fine‑tuning to enhance agent performance. Specifically, we optimized the Search and Discovery agent by replacing our base model with a fine‑tuned Nemotron small language model (SLM). We conducted comprehensive experiments using the llama3.1‑nemotron‑nano‑8B‑v1 architecture, training LoRA‑based models through systematic hyperparameter sweeps across learning rates, optimizers (Adam, AdamW), cosine annealing schedules, and LoRA ranks. Our contributions include: (1) the first application of NVIDIA's NeMo Framework to commerce‑specific agent optimization, (2) LLM powered fine‑tuning strategy for retrieval‑focused commerce tasks, (3) demonstration of significant improvements in latency and cost while maintaining agent quality, and (4) a scalable framework for multi‑agent system optimization in production e‑commerce environments. Our results demonstrate that the fine‑tuned Nemotron SLM effectively resolves the key performance issue in the retrieval component, which represents over 50% of total agent response time, while maintaining or enhancing overall system performance.

PaperID: 6659, https://arxiv.org/pdf/2512.21482.pdf

Authors: Fanwei Zeng, Changtao Miao, Jing Huang, Zhiya Tan, Shutao Gong, Xiaoming Yu, Yang Wang, Huazhe Tan, Weibin Yao, Jianshu Li

Title: LogicLens: Visual-Logical Co-Reasoning for Text-Centric Forgery Analysis

Abstract:
Sophisticated text‑centric forgeries, fueled by rapid AIGC advancements, pose a significant threat to societal security and information authenticity. Current methods for text‑centric forgery analysis are often limited to coarse‑grained visual analysis and lack the capacity for sophisticated reasoning. Moreover, they typically treat detection, grounding, and explanation as discrete sub‑tasks, overlooking their intrinsic relationships for holistic performance enhancement. To address these challenges, we introduce LogicLens, a unified framework for Visual‑Textual Co‑reasoning that reformulates these objectives into a joint task. The deep reasoning of LogicLens is powered by our novel Cross‑Cues‑aware Chain of Thought (CCT) mechanism, which iteratively cross‑validates visual cues against textual logic. To ensure robust alignment across all tasks, we further propose a weighted multi‑task reward function for GRPO‑based optimization. Complementing this framework, we first designed the PR^2 (Perceiver, Reasoner, Reviewer) pipeline, a hierarchical and iterative multi‑agent system that generates high‑quality, cognitively‑aligned annotations. Then, we constructed RealText, a diverse dataset comprising 5,397 images with fine‑grained annotations, including textual explanations, pixel‑level segmentation, and authenticity labels for model training. Extensive experiments demonstrate the superiority of LogicLens across multiple benchmarks. In a zero‑shot evaluation on T‑IC13, it surpasses the specialized framework by 41.4% and GPT‑4o by 23.4% in macro‑average F1 score. Moreover, on the challenging dense‑text T‑SROIE dataset, it establishes a significant lead over other MLLM‑based methods in mF1, CSS, and the macro‑average F1. Our dataset, model, and code will be made publicly available.

PaperID: 6660, https://arxiv.org/pdf/2512.21481.pdf

Authors: Sunith Vallabhaneni, Thomas Berkane, Maimuna Majumder

Title: The AI Committee: A Multi-Agent Framework for Automated Validation and Remediation of Web-Sourced Data

Abstract:
Many research areas rely on data from the web to gain insights and test their methods. However, collecting comprehensive research datasets often demands manually reviewing many web pages to identify and record relevant data points, which is labor‑intensive and susceptible to error. While the emergence of large language models (LLM)‑powered web agents has begun to automate parts of this process, they often struggle to ensure the validity of the data they collect. Indeed, these agents exhibit several recurring failure modes ‑ including hallucinating or omitting values, misinterpreting page semantics, and failing to detect invalid information ‑ which are subtle and difficult to detect and correct manually. To address this, we introduce the AI Committee, a novel model‑agnostic multi‑agent system that automates the process of validating and remediating web‑sourced datasets. Each agent is specialized in a distinct task in the data quality assurance pipeline, from source scrutiny and fact‑checking to data remediation and integrity validation. The AI Committee leverages various LLM capabilities ‑ including in‑context learning for dataset adaptation, chain‑of‑thought reasoning for complex semantic validation, and a self‑correction loop for data remediation ‑ all without task‑specific training. We demonstrate the effectiveness of our system by applying it to three real‑world datasets, showing that it generalizes across LLMs and significantly outperforms baseline approaches, achieving data completeness up to 78.7% and precision up to 100%. We additionally conduct an ablation study demonstrating the contribution of each agent to the Committee's performance. This work is released as an open‑source tool for the research community.

PaperID: 6661, https://arxiv.org/pdf/2512.21440.pdf

Authors: Hridya Dhulipala, Xiaokai Rong, Aashish Yadavally, Tien N. Nguyen

Title: Fuzzwise: Intelligent Initial Corpus Generation for Fuzzing

Abstract:
In mutation‑based greybox fuzzing, generating high‑quality input seeds for the initial corpus is essential for effective fuzzing. Rather than conducting separate phases for generating a large corpus and subsequently minimizing it, we propose FuzzWise which integrates them into one process to generate the optimal initial corpus of seeds (ICS). FuzzWise leverages a multi‑agent framework based on Large Language Models (LLMs). The first LLM agent generates test cases for the target program. The second LLM agent, which functions as a predictive code coverage module, assesses whether each generated test case will enhance the overall coverage of the current corpus. The streamlined process allows each newly generated test seed to be immediately evaluated for its contribution to the overall coverage. FuzzWise employs a predictive approach using an LLM and eliminates the need for actual execution, saving computational resources and time, particularly in scenarios where the execution is not desirable or even impossible. Our empirical evaluation demonstrates that FuzzWise generates significantly fewer test cases than baseline methods. Despite the lower number of test cases, FuzzWise achieves high code coverage and triggers more runtime errors compared to the baselines. Moreover, it is more time‑efficient and coverage‑efficient in producing an initial corpus catching more errors.

PaperID: 6662, https://arxiv.org/pdf/2512.21431.pdf

Authors: Hridya Dhulipala, Xiaokai Rong, Tien N. Nguyen

Title: Cerberus: Multi-Agent Reasoning and Coverage-Guided Exploration for Static Detection of Runtime Errors

Abstract:
In several software development scenarios, it is desirable to detect runtime errors and exceptions in code snippets without actual execution. A typical example is to detect runtime exceptions in online code snippets before integrating them into a codebase. In this paper, we propose Cerberus, a novel predictive, execution‑free coverage‑guided testing framework. Cerberus uses LLMs to generate the inputs that trigger runtime errors and to perform code coverage prediction and error detection without code execution. With a two‑phase feedback loop, Cerberus first aims to both increasing code coverage and detecting runtime errors, then shifts to focus only detecting runtime errors when the coverage reaches 100% or its maximum, enabling it to perform better than prompting the LLMs for both purposes. Our empirical evaluation demonstrates that Cerberus performs better than conventional and learning‑based testing frameworks for (in)complete code snippets by generating high‑coverage test cases more efficiently, leading to the discovery of more runtime errors.

PaperID: 6663, https://arxiv.org/pdf/2512.21426.pdf

Authors: Mohammed Sayagh

Title: What Makes a GitHub Issue Ready for Copilot?

Abstract:
AI‑agents help developers in different coding tasks, such as developing new features, fixing bugs, and reviewing code. Developers can write a Github issue and assign it to an AI‑agent like Copilot for implementation. Based on the issue and its related discussion, the AI‑agent performs a plan for the implementation, and executes it. However, the performance of AI‑agents and LLMs heavily depends on the input they receive. For instance, a GitHub issue that is unclear or not well scoped might not lead to a successful implementation that will eventually be merged. GitHub Copilot provides a set of best practice recommendations that are limited and high‑level. In this paper, we build a set of 32 detailed criteria that we leverage to measure the quality of GitHub issues to make them suitable for AI‑agents. We compare the GitHub issues that lead to a merged pull request versus closed pull request. Then, we build an interpretable machine learning model to predict the likelihood of a GitHub issue resulting in a merged pull request. We observe that pull requests that end up being merged are those originating from issues that are shorter, well scoped, with clear guidance and hints about the relevant artifacts for an issue, and with guidance on how to perform the implementation. Issues with external references including configuration, context setup, dependencies or external APIs are associated with lower merge rates. We built an interpretable machine learning model to help users identify how to improve a GitHub issue to increase the chances of the issue resulting in a merged pull request by Copilot. Our model has a median AUC of 72%. Our results shed light on quality metrics relevant for writing GitHub issues and motivate future studies further investigate the writing of GitHub issues as a first‑class software engineering activity in the era of AI‑teammates.

PaperID: 6664, https://arxiv.org/pdf/2512.21412.pdf

Authors: Alimu Alibotaiken, Suyang Wang, Oluwaseun T. Ajayi, Yu Cheng

Title: A Survey of Freshness-Aware Wireless Networking with Reinforcement Learning

Abstract:
The age of information (AoI) has become a central measure of data freshness in modern wireless systems, yet existing surveys either focus on classical AoI formulations or provide broad discussions of reinforcement learning (RL) in wireless networks without addressing freshness as a unified learning problem. Motivated by this gap, this survey examines RL specifically through the lens of AoI and generalized freshness optimization. We organize AoI and its variants into native, function‑based, and application‑oriented families, providing a clearer view of how freshness should be modeled in B5G and 6G systems. Building on this foundation, we introduce a policy‑centric taxonomy that reflects the decisions most relevant to freshness, consisting of update‑control RL, medium‑access RL, risk‑sensitive RL, and multi‑agent RL. This structure provides a coherent framework for understanding how learning can support sampling, scheduling, trajectory planning, medium access, and distributed coordination. We further synthesize recent progress in RL‑driven freshness control and highlight open challenges related to delayed decision processes, stochastic variability, and cross‑layer design. The goal is to establish a unified foundation for learning‑based freshness optimization in next‑generation wireless networks.

PaperID: 6665, https://arxiv.org/pdf/2512.21373.pdf

Authors: Titouan Duston, Shuo Xin, Yang Sun, Daoguang Zan, Aoyan Li, Shulin Xin, Kai Shen, Yixiao Chen, Qiming Sun, Ge Zhang, Jiashuo Liu, Huan Zhou, Jingkai Liu, Zhichen Pu, Yuanheng Wang, Bo-Xuan Ge, Xin Tong, Fei Ye, Zhi-Chao Zhao, Wen-Biao Han, Zhoujian Cao, Yueran Zhao, Weiluo Ren, Qingshen Long, Yuxiao Liu, Anni Huang, Yidi Du, Yuanyuan Rong, Jiahao Peng

Title: AInsteinBench: Benchmarking Coding Agents on Scientific Repositories

Abstract:
We introduce AInsteinBench, a large‑scale benchmark for evaluating whether large language model (LLM) agents can operate as scientific computing development agents within real research software ecosystems. Unlike existing scientific reasoning benchmarks which focus on conceptual knowledge, or software engineering benchmarks that emphasize generic feature implementation and issue resolving, AInsteinBench evaluates models in end‑to‑end scientific development settings grounded in production‑grade scientific repositories. The benchmark consists of tasks derived from maintainer‑authored pull requests across six widely used scientific codebases, spanning quantum chemistry, quantum computing, molecular dynamics, numerical relativity, fluid dynamics, and cheminformatics. All benchmark tasks are carefully curated through multi‑stage filtering and expert review to ensure scientific challenge, adequate test coverage, and well‑calibrated difficulty. By leveraging evaluation in executable environments, scientifically meaningful failure modes, and test‑driven verification, AInsteinBench measures a model's ability to move beyond surface‑level code generation toward the core competencies required for computational scientific research.

PaperID: 6666, https://arxiv.org/pdf/2512.21360.pdf

Authors: Shuide Wen, Yu Sun, Beier Ku, Zhi Gao, Lijun Ma, Yang Yang, Can Jiao

Title: From Visual Perception to Deep Empathy: An Automated Assessment Framework for House-Tree-Person Drawings Using Multimodal LLMs and Multi-Agent Collaboration

Abstract:
Background: The House‑Tree‑Person (HTP) drawing test, introduced by John Buck in 1948, remains a widely used projective technique in clinical psychology. However, it has long faced challenges such as heterogeneous scoring standards, reliance on examiners subjective experience, and a lack of a unified quantitative coding system. Results: Quantitative experiments showed that the mean semantic similarity between Multimodal Large Language Model (MLLM) interpretations and human expert interpretations was approximately 0.75 (standard deviation about 0.05). In structurally oriented expert data sets, this similarity rose to 0.85, indicating expert‑level baseline comprehension. Qualitative analyses demonstrated that the multi‑agent system, by integrating social‑psychological perspectives and destigmatizing narratives, effectively corrected visual hallucinations and produced psychological reports with high ecological validity and internal coherence. Conclusions: The findings confirm the potential of multimodal large models as standardized tools for projective assessment. The proposed multi‑agent framework, by dividing roles, decouples feature recognition from psychological inference and offers a new paradigm for digital mental‑health services. Keywords: House‑Tree‑Person test; multimodal large language model; multi‑agent collaboration; cosine similarity; computational psychology; artificial intelligence

PaperID: 6667, https://arxiv.org/pdf/2512.21354.pdf

Authors: Bin Wang, Jiazheng Quan, Xingrui Yu, Hansen Hu, Yuhao, Ivor Tsang

Title: Reflection-Driven Control for Trustworthy Code Agents

Abstract:
Contemporary large language model (LLM) agents are remarkably capable, but they still lack reliable safety controls and can produce unconstrained, unpredictable, and even actively harmful outputs. To address this, we introduce Reflection‑Driven Control, a standardized and pluggable control module that can be seamlessly integrated into general agent architectures. Reflection‑Driven Control elevates "self‑reflection" from a post hoc patch into an explicit step in the agent's own reasoning process: during generation, the agent continuously runs an internal reflection loop that monitors and evaluates its own decision path. When potential risks are detected, the system retrieves relevant repair examples and secure coding guidelines from an evolving reflective memory, injecting these evidence‑based constraints directly into subsequent reasoning steps. We instantiate Reflection‑Driven Control in the setting of secure code generation and systematically evaluate it across eight classes of security‑critical programming tasks. Empirical results show that Reflection‑Driven Control substantially improves the security and policy compliance of generated code while largely preserving functional correctness, with minimal runtime and token overhead. Taken together, these findings indicate that Reflection‑Driven Control is a practical path toward trustworthy AI coding agents: it enables designs that are simultaneously autonomous, safer by construction, and auditable.

PaperID: 6668, https://arxiv.org/pdf/2512.21352.pdf

Authors: Sumanth Bharadwaj Hachalli Karanam, Dhiwahar Adhithya Kennady

Title: Multi-Agent LLM Committees for Autonomous Software Beta Testing

Abstract:
Manual software beta testing is costly and time‑consuming, while single‑agent large language model (LLM) approaches suffer from hallucinations and inconsistent behavior. We propose a multi‑agent committee framework in which diverse vision‑enabled LLMs collaborate through a three‑round voting protocol to reach consensus on testing actions. The framework combines model diversity, persona‑driven behavioral variation, and visual user interface understanding to systematically explore web applications. Across 84 experimental runs with 9 testing personas and 4 scenarios, multi‑agent committees achieve an 89.5 percent overall task success rate. Configurations with 2 to 4 agents reach 91.7 to 100 percent success, compared to 78.0 percent for single‑agent baselines, yielding improvements of 13.7 to 22.0 percentage points. At the action level, the system attains a 93.1 percent success rate with a median per‑action latency of 0.71 seconds, enabling real‑time and continuous integration testing. Vision‑enabled agents successfully identify user interface elements, with navigation and reporting achieving 100 percent success and form filling achieving 99.2 percent success. We evaluate the framework on WebShop and OWASP benchmarks, achieving 74.7 percent success on WebShop compared to a 50.1 percent published GPT‑3 baseline, and 82.0 percent success on OWASP Juice Shop security testing with coverage of 8 of the 10 OWASP Top 10 vulnerability categories. Across 20 injected regressions, the committee achieves an F1 score of 0.91 for bug detection, compared to 0.78 for single‑agent baselines. The open‑source implementation enables reproducible research and practical deployment of LLM‑based software testing in CI/CD pipelines.

PaperID: 6669, https://arxiv.org/pdf/2512.21351.pdf

Authors: Santhosh Kumar Ravindran

Title: CosmoCore-Evo: Evolutionary Dream-Replay Reinforcement Learning for Adaptive Code Generation

Abstract:
Building on the affective dream‑replay reinforcement learning framework of CosmoCore, we introduce CosmoCore‑Evo, an extension that incorporates evolutionary algorithms to enhance adaptability and novelty in code generation tasks. Inspired by anthropological aspects of human evolution, such as natural selection and adaptation in early hominids, CosmoCore‑Evo treats RL trajectories as ``genomes'' that undergo mutation and selection during the nocturnal replay phase. This mechanism allows agents to break free from trained patterns, fostering emergent behaviors and improved performance in distribution‑shifted environments, such as changing APIs or novel libraries. We augment the Dream Queue with evolutionary operations, including mutation of high‑fitness trajectories and enterprise‑tuned fitness functions that incorporate efficiency, compliance, and scalability metrics. Evaluated on extended benchmarks including HumanEval variants with shifts, BigCodeBench, and a custom PySpark pipeline simulation, CosmoCore‑Evo achieves up to 35% higher novelty in solutions and 25% faster adaptation compared to the original CosmoCore and baselines like PPO and REAMER. Ablations confirm the role of evolutionary components in bridging the sentient gap for LLM agents. Code for replication, including a toy simulation, is provided.

PaperID: 6670, https://arxiv.org/pdf/2512.21257.pdf

Authors: Jiakai Tang, Chuan Wang, Gaoming Yang, Han Wu, Jiahao Yu, Jian Wu, Jianwu Hu, Junjun Zheng, Longbin Li, Shuwen Xiao, Xiangheng Kong, Yeqiu Yang, Yuning Jiang, Ahjol Nurlanbek, Binbin Cao, Bo Zheng, Fangmei Zhu, Gaoming Zhou, Huimin Yi, Huiping Chu, Jin Huang, Jinzhe Shan, Kenan Cui, Longbin Li, Silu Zhou, Wen Chen, Xia Ming, Xiang Gao, Xin Yao, Xingyu Wen, Yan Zhang, Yiwen Hu, Yulin Wang, Ziheng Bao, Zongyuan Wu

Title: ReaSeq: Unleashing World Knowledge via Reasoning for Sequential Modeling

Abstract:
Industrial recommender systems face two fundamental limitations under the log‑driven paradigm: (1) knowledge poverty in ID‑based item representations that causes brittle interest modeling under data sparsity, and (2) systemic blindness to beyond‑log user interests that constrains model performance within platform boundaries. These limitations stem from an over‑reliance on shallow interaction statistics and close‑looped feedback while neglecting the rich world knowledge about product semantics and cross‑domain behavioral patterns that Large Language Models have learned from vast corpora. To address these challenges, we introduce ReaSeq, a reasoning‑enhanced framework that leverages world knowledge in Large Language Models to address both limitations through explicit and implicit reasoning. Specifically, ReaSeq employs explicit Chain‑of‑Thought reasoning via multi‑agent collaboration to distill structured product knowledge into semantically enriched item representations, and latent reasoning via Diffusion Large Language Models to infer plausible beyond‑log behaviors. Deployed on Taobao's ranking system serving hundreds of millions of users, ReaSeq achieves substantial gains: >6.0% in IPV and CTR, >2.9% in Orders, and >2.5% in GMV, validating the effectiveness of world‑knowledge‑enhanced reasoning over purely log‑driven approaches.

PaperID: 6671, https://arxiv.org/pdf/2512.21243.pdf

Authors: Anatoly O. Onishchenko, Alexey K. Kovalev, Aleksandr I. Panov

Title: LookPlanGraph: Embodied Instruction Following Method with VLM Graph Augmentation

Abstract:
Methods that use Large Language Models (LLM) as planners for embodied instruction following tasks have become widespread. To successfully complete tasks, the LLM must be grounded in the environment in which the robot operates. One solution is to use a scene graph that contains all the necessary information. Modern methods rely on prebuilt scene graphs and assume that all task‑relevant information is available at the start of planning. However, these approaches do not account for changes in the environment that may occur between the graph construction and the task execution. We propose LookPlanGraph ‑ a method that leverages a scene graph composed of static assets and object priors. During plan execution, LookPlanGraph continuously updates the graph with relevant objects, either by verifying existing priors or discovering new entities. This is achieved by processing the agents egocentric camera view using a Vision Language Model. We conducted experiments with changed object positions VirtualHome and OmniGibson simulated environments, demonstrating that LookPlanGraph outperforms methods based on predefined static scene graphs. To demonstrate the practical applicability of our approach, we also conducted experiments in a real‑world setting. Additionally, we introduce the GraSIF (Graph Scenes for Instruction Following) dataset with automated validation framework, comprising 514 tasks drawn from SayPlan Office, BEHAVIOR‑1K, and VirtualHome RobotHow. Project page available at https://lookplangraph.github.io .

PaperID: 6672, https://arxiv.org/pdf/2512.21220.pdf

Authors: Le Wang, Zonghao Ying, Xiao Yang, Quanchen Zou, Zhenfei Yin, Tianlin Li, Jian Yang, Yaodong Yang, Aishan Liu, Xianglong Liu

Title: RoboSafe: Safeguarding Embodied Agents via Executable Safety Logic

Abstract:
Embodied agents powered by vision‑language models (VLMs) are increasingly capable of executing complex real‑world tasks, yet they remain vulnerable to hazardous instructions that may trigger unsafe behaviors. Runtime safety guardrails, which intercept hazardous actions during task execution, offer a promising solution due to their flexibility. However, existing defenses often rely on static rule filters or prompt‑level control, which struggle to address implicit risks arising in dynamic, temporally dependent, and context‑rich environments. To address this, we propose RoboSafe, a hybrid reasoning runtime safeguard for embodied agents through executable predicate‑based safety logic. RoboSafe integrates two complementary reasoning processes on a Hybrid Long‑Short Safety Memory. We first propose a Backward Reflective Reasoning module that continuously revisits recent trajectories in short‑term memory to infer temporal safety predicates and proactively triggers replanning when violations are detected. We then propose a Forward Predictive Reasoning module that anticipates upcoming risks by generating context‑aware safety predicates from the long‑term safety memory and the agent's multimodal observations. Together, these components form an adaptive, verifiable safety logic that is both interpretable and executable as code. Extensive experiments across multiple agents demonstrate that RoboSafe substantially reduces hazardous actions (‑36.8% risk occurrence) compared with leading baselines, while maintaining near‑original task performance. Real‑world evaluations on physical robotic arms further confirm its practicality. Code will be released upon acceptance.

PaperID: 6673, https://arxiv.org/pdf/2512.21133.pdf

Authors: Xiaoyu Mo, Jintian Ge, Zifan Wang, Chen Lv, Karl Henrik Johansson

Title: SparScene: Efficient Traffic Scene Representation via Sparse Graph Learning for Large-Scale Trajectory Generation

Abstract:
Multi‑agent trajectory generation is a core problem for autonomous driving and intelligent transportation systems. However, efficiently modeling the dynamic interactions between numerous road users and infrastructures in complex scenes remains an open problem. Existing methods typically employ distance‑based or fully connected dense graph structures to capture interaction information, which not only introduces a large number of redundant edges but also requires complex and heavily parameterized networks for encoding, thereby resulting in low training and inference efficiency, limiting scalability to large and complex traffic scenes. To overcome the limitations of existing methods, we propose SparScene, a sparse graph learning framework designed for efficient and scalable traffic scene representation. Instead of relying on distance thresholds, SparScene leverages the lane graph topology to construct structure‑aware sparse connections between agents and lanes, enabling efficient yet informative scene graph representation. SparScene adopts a lightweight graph encoder that efficiently aggregates agent‑map and agent‑agent interactions, yielding compact scene representations with substantially improved efficiency and scalability. On the motion prediction benchmark of the Waymo Open Motion Dataset (WOMD), SparScene achieves competitive performance with remarkable efficiency. It generates trajectories for more than 200 agents in a scene within 5 ms and scales to more than 5,000 agents and 17,000 lanes with merely 54 ms of inference time with a GPU memory of 2.9 GB, highlighting its superior scalability for large‑scale traffic scenes.

PaperID: 6674, https://arxiv.org/pdf/2512.21024.pdf

Authors: Yue Lin, Shuhui Zhu, Wenhao Li, Ang Li, Dan Qiao, Pascal Poupart, Hongyuan Zha, Baoxiang Wang

Title: Policy-Conditioned Policies for Multi-Agent Task Solving

Abstract:
In multi‑agent tasks, the central challenge lies in the dynamic adaptation of strategies. However, directly conditioning on opponents' strategies is intractable in the prevalent deep reinforcement learning paradigm due to a fundamental ``representational bottleneck'': neural policies are opaque, high‑dimensional parameter vectors that are incomprehensible to other agents. In this work, we propose a paradigm shift that bridges this gap by representing policies as human‑interpretable source code and utilizing Large Language Models (LLMs) as approximate interpreters. This programmatic representation allows us to operationalize the game‑theoretic concept of Program Equilibrium. We reformulate the learning problem by utilizing LLMs to perform optimization directly in the space of programmatic policies. The LLM functions as a point‑wise best‑response operator that iteratively synthesizes and refines the ego agent's policy code to respond to the opponent's strategy. We formalize this process as Programmatic Iterated Best Response (PIBR), an algorithm where the policy code is optimized by textual gradients, using structured feedback derived from game utility and runtime unit tests. We demonstrate that this approach effectively solves several standard coordination matrix games and a cooperative Level‑Based Foraging environment.

PaperID: 6675, https://arxiv.org/pdf/2512.20991.pdf

Authors: Toqeer Ali Syed, Abdulaziz Alshahrani, Ali Ullah, Ali Akarma, Sohail Khan, Muhammad Nauman, Salman Jan

Title: FinAgent: An Agentic AI Framework Integrating Personal Finance and Nutrition Planning

Abstract:
The issue of limited household budgets and nutritional demands continues to be a challenge especially in the middle‑income environment where food prices fluctuate. This paper introduces a price aware agentic AI system, which combines personal finance management with diet optimization. With household income and fixed expenditures, medical and well‑being status, as well as real‑time food costs, the system creates nutritionally sufficient meals plans at comparatively reasonable prices that automatically adjust to market changes. The framework is implemented in a modular multi‑agent architecture, which has specific agents (budgeting, nutrition, price monitoring, and health personalization). These agents share the knowledge base and use the substitution graph to ensure that the nutritional quality is maintained at a minimum cost. Simulations with a representative Saudi household case study show a steady 12‑18% reduction in costs relative to a static weekly menu, nutrient adequacy of over 95% and high performance with price changes of 20‑30%. The findings indicate that the framework can locally combine affordability with nutritional adequacy and provide a viable avenue of capacity‑building towards sustainable and fair diet planning in line with Sustainable Development Goals on Zero Hunger and Good Health.

PaperID: 6676, https://arxiv.org/pdf/2512.20986.pdf

Authors: Yihan Wang, Huanqi Yang, Shantanu Pal, Weitao Xu

Title: AegisAgent: An Autonomous Defense Agent Against Prompt Injection Attacks in LLM-HARs

Abstract:
The integration of Large Language Models (LLMs) into wearable sensing is creating a new class of mobile applications capable of nuanced human activity understanding. However, the reliability of these systems is critically undermined by their vulnerability to prompt injection attacks, where attackers deliberately input deceptive instructions into LLMs. Traditional defenses, based on static filters and rigid rules, are insufficient to address the semantic complexity of these new attacks. We argue that a paradigm shift is needed ‑‑ from passive filtering to active protection and autonomous reasoning. We introduce AegisAgent, an autonomous agent system designed to ensure the security of LLM‑driven HAR systems. Instead of merely blocking threats, AegisAgent functions as a cognitive guardian. It autonomously perceives potential semantic inconsistencies, reasons about the user's true intent by consulting a dynamic memory of past interactions, and acts by generating and executing a multi‑step verification and repair plan. We implement AegisAgent as a lightweight, full‑stack prototype and conduct a systematic evaluation on 15 common attacks with five state‑of‑the‑art LLM‑based HAR systems on three public datasets. Results show it reduces attack success rate by 30% on average while incurring only 78.6 ms of latency overhead on a GPU workstation. Our work makes the first step towards building secure and trustworthy LLM‑driven HAR systems.

PaperID: 6677, https://arxiv.org/pdf/2512.20985.pdf

Authors: Salman Jan, Hassan Ali Razzaqi, Ali Akarma, Mohammad Riyaz Belgaum

Title: A Blockchain-Monitored Agentic AI Architecture for Trusted Perception-Reasoning-Action Pipelines

Abstract:
The application of agentic AI systems in autonomous decision‑making is growing in the areas of healthcare, smart cities, digital forensics, and supply chain management. Even though these systems are flexible and offer real‑time reasoning, they also raise concerns of trust and oversight, and integrity of the information and activities upon which they are founded. The paper suggests a single architecture model comprising of LangChain‑based multi‑agent system with a permissioned blockchain to guarantee constant monitoring, policy enforcement, and immutable auditability of agentic action. The framework relates the perception conceptualization‑action cycle to a blockchain layer of governance that verifies the inputs, evaluates recommended actions, and documents the outcomes of the execution. A Hyperledger Fabric‑based system, action executors MCP‑integrated, and LangChain agent are introduced and experiments of smart inventory management, traffic‑signal control, and healthcare monitoring are done. The results suggest that blockchain‑security verification is efficient in preventing unauthorized practices, offers traceability throughout the whole decision‑making process, and maintains operational latency within reasonable ranges. The suggested framework provides a universal system of implementing high‑impact agentic AI applications that are autonomous yet responsible.

PaperID: 6678, https://arxiv.org/pdf/2512.20975.pdf

Authors: Yujin Roh, Inho Jake Park, Chigon Hwang

Title: SPOT!: Map-Guided LLM Agent for Unsupervised Multi-CCTV Dynamic Object Tracking

Abstract:
CCTV‑based vehicle tracking systems face structural limitations in continuously connecting the trajectories of the same vehicle across multiple camera environments. In particular, blind spots occur due to the intervals between CCTVs and limited Fields of View (FOV), which leads to object ID switching and trajectory loss, thereby reducing the reliability of real‑time path prediction. This paper proposes SPOT (Spatial Prediction Over Trajectories), a map‑guided LLM agent capable of tracking vehicles even in blind spots of multi‑CCTV environments without prior training. The proposed method represents road structures (Waypoints) and CCTV placement information as documents based on 2D spatial coordinates and organizes them through chunking techniques to enable real‑time querying and inference. Furthermore, it transforms the vehicle's position into the actual world coordinate system using the relative position and FOV information of objects observed in CCTV images. By combining map spatial information with the vehicle's moving direction, speed, and driving patterns, a beam search is performed at the intersection level to derive candidate CCTV locations where the vehicle is most likely to enter after the blind spot. Experimental results based on the CARLA simulator in a virtual city environment confirmed that the proposed method accurately predicts the next appearing CCTV even in blind spot sections, maintaining continuous vehicle trajectories more effectively than existing techniques.

PaperID: 6679, https://arxiv.org/pdf/2512.20973.pdf

Authors: Yihan Xia, Taotao Wang, Wenxin Xu, Shengli Zhang

Title: DAO-Agent: Zero Knowledge-Verified Incentives for Decentralized Multi-Agent Coordination

Abstract:
Autonomous Large Language Model (LLM)‑based multi‑agent systems have emerged as a promising paradigm for facilitating cross‑application and cross‑organization collaborations. These autonomous agents often operate in trustless environments, where centralized coordination faces significant challenges, such as the inability to ensure transparent contribution measurement and equitable incentive distribution. While blockchain is frequently proposed as a decentralized coordination platform, it inherently introduces high on‑chain computation costs and risks exposing sensitive execution information of the agents. Consequently, the core challenge lies in enabling auditable task execution and fair incentive distribution for autonomous LLM agents in trustless environments, while simultaneously preserving their strategic privacy and minimizing on‑chain costs. To address this challenge, we propose DAO‑Agent, a novel framework that integrates three key technical innovations: (1) an on‑chain decentralized autonomous organization (DAO) governance mechanism for transparent coordination and immutable logging; (2) a ZKP mechanism approach that enables Shapley‑based contribution measurement off‑chain, and (3) a hybrid on‑chain/off‑chain architecture that verifies ZKP‑validated contribution measurements on‑chain with minimal computational overhead. We implement DAO‑Agent and conduct end‑to‑end experiments using a crypto trading task as a case study. Experimental results demonstrate that DAO‑Agent achieves up to 99.9% reduction in verification gas costs compared to naive on‑chain alternatives, with constant‑time verification complexity that remains stable as coalition size increases, thereby establishing a scalable foundation for agent coordination in decentralized environments.

PaperID: 6680, https://arxiv.org/pdf/2512.20957.pdf

Authors: Zhaoxi Zhang, Yitong Duan, Yanzhi Zhang, Yiming Xu, Zhixiang Wang, Kun Liang, Weikang Li, Jiahui Liang, Deguo Xia, Jizhou Huang, Jiyan He, Yunfang Wu

Title: One Tool Is Enough: Reinforcement Learning for Repository-Level LLM Agents

Abstract:
Locating files and functions requiring modification in large software repositories is challenging due to their scale and structural complexity. Existing LLM‑based methods typically treat this as a repository‑level retrieval task and rely on multiple auxiliary tools, which often overlook code execution logic and complicate model control. We propose RepoNavigator, an LLM agent equipped with a single execution‑aware tool: jumping to the definition of an invoked symbol. This unified design reflects the actual flow of code execution while simplifying tool manipulation. RepoNavigator is trained end‑to‑end via Reinforcement Learning (RL) directly from a base pretrained model, without relying on closed‑source distillation. Experiments demonstrate that RL‑trained RepoNavigator achieves state‑of‑the‑art performance, with the 7B model outperforming 14B baselines, the 14B model surpassing 32B competitors, and the 32B model exceeding closed‑source models such as GPT‑5 on most metrics. These results confirm that integrating a single, structurally grounded tool with RL training provides an efficient and scalable solution for repository‑level issue localization.

PaperID: 6681, https://arxiv.org/pdf/2512.20902.pdf

Authors: Siqi Mu, Shuo Wen, Yang Lu, Ruihong Jiang, Bo Ai

Title: Embodied AI-Enhanced IoMT Edge Computing: UAV Trajectory Optimization and Task Offloading with Mobility Prediction

Abstract:
Due to their inherent flexibility and autonomous operation, unmanned aerial vehicles (UAVs) have been widely used in Internet of Medical Things (IoMT) to provide real‑time biomedical edge computing service for wireless body area network (WBAN) users. In this paper, considering the time‑varying task criticality characteristics of diverse WBAN users and the dual mobility between WBAN users and UAV, we investigate the dynamic task offloading and UAV flight trajectory optimization problem to minimize the weighted average task completion time of all the WBAN users, under the constraint of UAV energy consumption. To tackle the problem, an embodied AI‑enhanced IoMT edge computing framework is established. Specifically, we propose a novel hierarchical multi‑scale Transformer‑based user trajectory prediction model based on the users' historical trajectory traces captured by the embodied AI agent (i.e., UAV). Afterwards, a prediction‑enhanced deep reinforcement learning (DRL) algorithm that integrates predicted users' mobility information is designed for intelligently optimizing UAV flight trajectory and task offloading decisions. Real‑word movement traces and simulation results demonstrate the superiority of the proposed methods in comparison with the existing benchmarks.

PaperID: 6682, https://arxiv.org/pdf/2512.20884.pdf

Authors: Zan-Kai Chong, Hiroyuki Ohsaki, Bryan Ng

Title: The Silent Scholar Problem: A Probabilistic Framework for Breaking Epistemic Asymmetry in LLM Agents

Abstract:
Autonomous agents powered by LLMs and Retrieval‑Augmented Generation (RAG) are proficient consumers of digital content but remain unidirectional, a limitation we term epistemic asymmetry. This isolation leads to redundant reasoning and stagnates collective intelligence. Current self‑reflection frameworks remain largely heuristic and private, lacking a probabilistic foundation to quantify certainty or justify external interaction.To bridge this gap, we propose a formal probabilistic framework that provides agents with a non‑altruistic motive for bidirectional knowledge exchange. We model an agent's belief in a proposition using a Beta‑Bernoulli distribution with a forgetting factor (γ). This allows us to isolate epistemic uncertainty as the variance of belief, establishing a dual drive for interaction: A homeostatic motive: The need to maintain certainty against the temporal decay introduced by γ. An optimal learning strategy: Targeting points of maximum ambiguity (\mathbbE[θ]=0.5) to maximize information gain. Under this framework, public contribution is reframed as optimal active learning: sharing solutions to elicit feedback is the most efficient method for an agent to reduce its own uncertainty. To ensure scalability, we introduce epistemic caching, which leverages the forgetting factor to dynamically prioritize resources for the active head of non‑stationary knowledge distributions. Finally, we demonstrate how these accumulated belief states serve as verifiable reward signals for Reinforcement Learning from Human Feedback (RLHF) and high‑quality data filters for Supervised Fine‑Tuning (SFT). Simulation results validate that this uncertainty‑driven strategy significantly outperforms random baselines in heterogeneous (Zipfian) environments, maintaining high adaptability to concept drift.

PaperID: 6683, https://arxiv.org/pdf/2512.20845.pdf

Authors: Onat Ozer, Grace Wu, Yuchen Wang, Daniel Dosti, Honghao Zhang, Vivi De La Rue

Title: MAR:Multi-Agent Reflexion Improves Reasoning Abilities in LLMs

Abstract:
LLMs have shown the capacity to improve their performance on reasoning tasks through reflecting on their mistakes, and acting with these reflections in mind. However, continual reflections of the same LLM onto itself exhibit degeneration of thought, where the LLM continues to repeat the same errors again and again even with the knowledge that its wrong. To address this problem, we instead introduce multi‑agent with multi‑persona debators as the method to generate reflections. Through out extensive experimentation, we've found that the leads to better diversity of in the reflections generated by the llm agent. We demonstrate an accuracy of 47% EM HotPot QA (question answering) and 82.7% on HumanEval (programming), both performances surpassing reflection with a single llm.

PaperID: 6684, https://arxiv.org/pdf/2512.20778.pdf

Authors: Moshe Rafaeli Shimron, Vadim Indelman

Title: Towards Optimal Performance and Action Consistency Guarantees in Dec-POMDPs with Inconsistent Beliefs and Limited Communication

Abstract:
Multi‑agent decision‑making under uncertainty is fundamental for effective and safe autonomous operation. In many real‑world scenarios, each agent maintains its own belief over the environment and must plan actions accordingly. However, most existing approaches assume that all agents have identical beliefs at planning time, implying these beliefs are conditioned on the same data. Such an assumption is often impractical due to limited communication. In reality, agents frequently operate with inconsistent beliefs, which can lead to poor coordination and suboptimal, potentially unsafe, performance. In this paper, we address this critical challenge by introducing a novel decentralized framework for optimal joint action selection that explicitly accounts for belief inconsistencies. Our approach provides probabilistic guarantees for both action consistency and performance with respect to open‑loop multi‑agent POMDP (which assumes all data is always communicated), and selectively triggers communication only when needed. Furthermore, we address another key aspect of whether, given a chosen joint action, the agents should share data to improve expected performance in inference. Simulation results show our approach outperforms state‑of‑the‑art algorithms.

PaperID: 6685, https://arxiv.org/pdf/2512.20722.pdf

Authors: Jie Chen, Xianbin Wang

Title: Learning-Enabled Elastic Network Topology for Distributed ISAC Service Provisioning

Abstract:
Conventional mobile networks, including both localized cell‑centric and cooperative cell‑free networks (CCN/CFN), are built upon rigid network topologies. However, neither architecture is adequate to flexibly support distributed integrated sensing and communication (ISAC) services, due to the increasing difficulty of aligning spatiotemporally distributed heterogeneous service demands with available radio resources. In this paper, we propose an elastic network topology (ENT) for distributed ISAC service provisioning, where multiple co‑existing localized CCNs can be dynamically aggregated into CFNs with expanded boundaries for federated network operation. This topology elastically orchestrates localized CCN and federated CFN boundaries to balance signaling overhead and distributed resource utilization, thereby enabling efficient ISAC service provisioning. A two‑phase operation protocol is then developed. In Phase I, each CCN autonomously classifies ISAC services as either local or federated and partitions its resources into dedicated and shared segments. In Phase II, each CCN employs its dedicated resources for local ISAC services, while the aggregated CFN consolidates shared resources from its constituent CCNs to cooperatively deliver federated services. Furthermore, we design a utility‑to‑signaling ratio (USR) to quantify the tradeoff between sensing/communication utility and signaling overhead. Consequently, a USR maximization problem is formulated by jointly optimizing the network topology (i.e., service classification and CCN aggregation) and the allocation of dedicated and shared resources. However, this problem is challenging due to its distributed optimization nature and the absence of complete channel state information. To address this problem efficiently, we propose a multi‑agent deep reinforcement learning (MADRL) framework with centralized training and decentralized execution.

PaperID: 6686, https://arxiv.org/pdf/2512.20688.pdf

Authors: Stefano Grassi

Title: Mechanism-Based Intelligence (MBI): Differentiable Incentives for Rational Coordination and Guaranteed Alignment in Multi-Agent Systems

Abstract:
Autonomous multi‑agent systems are fundamentally fragile: they struggle to solve the Hayekian Information problem (eliciting dispersed private knowledge) and the Hurwiczian Incentive problem (aligning local actions with global objectives), making coordination computationally intractable. I introduce Mechanism‑Based Intelligence (MBI), a paradigm that reconceptualizes intelligence as emergent from the coordination of multiple "brains", rather than a single one. At its core, the Differentiable Price Mechanism (DPM) computes the exact loss gradient \mathbfG_i = ‑ \frac\partial \mathcalL\partial \mathbfx_i as a dynamic, VCG‑equivalent incentive signal, guaranteeing Dominant Strategy Incentive Compatibility (DSIC) and convergence to the global optimum. A Bayesian extension ensures incentive compatibility under asymmetric information (BIC). The framework scales linearly (\mathcalO(N)) with the number of agents, bypassing the combinatorial complexity of Dec‑POMDPs and is empirically 50x faster than Model‑Free Reinforcement Learning. By structurally aligning agent self‑interest with collective objectives, it provides a provably efficient, auditable and generalizable approach to coordinated, trustworthy and scalable multi‑agent intelligence grounded in economic principles.

PaperID: 6687, https://arxiv.org/pdf/2512.20660.pdf

Authors: Matthew Thompson

Title: The Dual-State Architecture for Reliable LLM Agents

Abstract:
Large Language Models deployed as code generation agents exhibit stochastic behavior incompatible with the deterministic guarantees required by software engineering. We formalize the Dual‑State Action Pair (DSAP), an execution primitive that couples stochastic generation with deterministic post‑condition verification. Guard functions act as sensing actions that project opaque LLM outputs onto observable workflow state, enabling a dual‑state decomposition: finite, deterministic S_workflow paired with infinite, stochastic S_env. We prove that for epsilon‑capable generators, failure probability P(fail) <= (1‑epsilon)^R_max ‑> 0. To prevent naive O(R^K) retry explosion across multi‑step workflows, we introduce a three‑level recovery hierarchy: context refinement (retry within step), informed backtracking (stagnation detection with cascade invalidation and context injection to upstream steps), and human escalation. Experimental validation across 13 LLMs (1.3B‑15B parameters) on three diagnostic probes demonstrates reliability gains of up to 66 percentage points at 1.2‑2.1x baseline cost. Recovery mechanism evaluation on 99 SWE‑Bench Pro instance‑arm pairs (Qwen3‑Coder‑Next) demonstrates 100% context injection effectiveness (upstream output changed in all 71 escalation events) with step‑specific recovery asymmetry ‑‑ 37.5% for test generation vs. 0% for patch generation ‑‑ and 0% end‑to‑end patch production, establishing the boundary between execution architecture and plan synthesis: execution recovery is necessary but not sufficient for autonomous software engineering.

PaperID: 6688, https://arxiv.org/pdf/2512.20652.pdf

Authors: Vira Filatova, Andrii Zelenchuk, Dmytro Filatov

Title: AI-Driven Decision-Making System for Hiring Process

Abstract:
Early‑stage candidate validation is a major bottleneck in hiring, because recruiters must reconcile heterogeneous inputs (resumes, screening answers, code assignments, and limited public evidence). This paper presents an AI‑driven, modular multi‑agent hiring assistant that integrates (i) document and video preprocessing, (ii) structured candidate profile construction, (iii) public‑data verification, (iv) technical/culture‑fit scoring with explicit risk penalties, and (v) human‑in‑the‑loop validation via an interactive interface. The pipeline is orchestrated by an LLM under strict constraints to reduce output variability and to generate traceable component‑level rationales. Candidate ranking is computed by a configurable aggregation of technical fit, culture fit, and normalized risk penalties. The system is evaluated on 64 real applicants for a mid‑level Python backend engineer role, using an experienced recruiter as the reference baseline and a second, less experienced recruiter for additional comparison. Alongside precision/recall, we propose an efficiency metric measuring expected time per qualified candidate. In this study, the system improves throughput and achieves 1.70 hours per qualified candidate versus 3.33 hours for the experienced recruiter, with substantially lower estimated screening cost, while preserving a human decision‑maker as the final authority.

PaperID: 6689, https://arxiv.org/pdf/2512.20649.pdf

Authors: Zixun Luo, Yuhang Fan, Yufei Li, Youzhi Zhang, Hengyu Lin, Ziqi Wang

Title: AIAuditTrack: A Framework for AI Security system

Abstract:
The rapid expansion of AI‑driven applications powered by large language models has led to a surge in AI interaction data, raising urgent challenges in security, accountability, and risk traceability. This paper presents AiAuditTrack (AAT), a blockchain‑based framework for AI usage traffic recording and governance. AAT leverages decentralized identity (DID) and verifiable credentials (VC) to establish trusted and identifiable AI entities, and records inter‑entity interaction trajectories on‑chain to enable cross‑system supervision and auditing. AI entities are modeled as nodes in a dynamic interaction graph, where edges represent time‑specific behavioral trajectories. Based on this model, a risk diffusion algorithm is proposed to trace the origin of risky behaviors and propagate early warnings across involved entities. System performance is evaluated using blockchain Transactions Per Second (TPS) metrics, demonstrating the feasibility and stability of AAT under large‑scale interaction recording. AAT provides a scalable and verifiable solution for AI auditing, risk management, and responsibility attribution in complex multi‑agent environments.

PaperID: 6690, https://arxiv.org/pdf/2512.20624.pdf

Authors: Mazyar Taghavi, Javad Vahidi

Title: Quantum-Inspired Multi Agent Reinforcement Learning for Exploration Exploitation Optimization in UAV-Assisted 6G Network Deployment

Abstract:
This study introduces a quantum inspired framework for optimizing the exploration exploitation tradeoff in multiagent reinforcement learning, applied to UAVassisted 6G network deployment. We consider a cooperative scenario where ten intelligent UAVs autonomously coordinate to maximize signal coverage and support efficient network expansion under partial observability and dynamic conditions. The proposed approach integrates classical MARL algorithms with quantum‑inspired optimization techniques, leveraging variational quantum circuits VQCs as the core structure and employing the Quantum Approximate Optimization Algorithm QAOA as a representative VQC based method for combinatorial optimization. Complementary probabilistic modeling is incorporated through Bayesian inference, Gaussian processes, and variational inference to capture latent environmental dynamics. A centralized training with decentralized execution CTDE paradigm is adopted, where shared memory and local view grids enhance local observability among agents. Comprehensive experiments including scalability tests, sensitivity analysis, and comparisons with PPO and DDPG baselines demonstrate that the proposed framework improves sample efficiency, accelerates convergence, and enhances coverage performance while maintaining robustness. Radar chart and convergence analyses further show that QI MARL achieves a superior balance between exploration and exploitation compared to classical methods. All implementation code and supplementary materials are publicly available on GitHub to ensure reproducibility.

PaperID: 6691, https://arxiv.org/pdf/2512.20623.pdf

Authors: Ravi Gupta, Shabista Haider

Title: BitRL-Light: 1-bit LLM Agents with Deep Reinforcement Learning for Energy-Efficient Smart Home Lighting Optimization

Abstract:
Smart home lighting systems consume 15‑20% of residential energy but lack adaptive intelligence to optimize for user comfort and energy efficiency simultaneously. We present BitRL‑Light, a novel framework combining 1‑bit quantized Large Language Models (LLMs) with Deep Q‑Network (DQN) reinforcement learning for real‑time smart home lighting control on edge devices. Our approach deploys a 1‑bit quantized Llama‑3.2‑1B model on Raspberry Pi hardware, achieving 71.4 times energy reduction compared to full‑precision models while maintaining intelligent control capabilities. Through multi‑objective reinforcement learning, BitRL‑Light learns optimal lighting policies from user feedback, balancing energy consumption, comfort, and circadian alignment. Experimental results demonstrate 32% energy savings compared to rule‑based systems, with inference latency under 200ms on Raspberry Pi 4 and 95% user satisfaction. The system processes natural language commands via Google Home/IFTTT integration and learns from implicit feedback through manual overrides. Our comparative analysis shows 1‑bit models achieve 5.07 times speedup over 2‑bit alternatives on ARM processors while maintaining 92% task accuracy. This work establishes a practical framework for deploying adaptive AI on resource‑constrained IoT devices, enabling intelligent home automation without cloud dependencies.

PaperID: 6692, https://arxiv.org/pdf/2512.20586.pdf

Authors: Humza Nusrat, Luke Francisco, Bing Luo, Hassan Bagher-Ebadian, Joshua Kim, Karen Chin-Snyder, Salim Siddiqui, Mira Shah, Eric Mellon, Mohammad Ghassemi, Anthony Doemer, Benjamin Movsas, Kundan Thind

Title: Automated stereotactic radiosurgery planning using a human-in-the-loop reasoning large language model agent

Abstract:
Stereotactic radiosurgery (SRS) demands precise dose shaping around critical structures, yet black‑box AI systems have limited clinical adoption due to opacity concerns. We tested whether chain‑of‑thought reasoning improves agentic planning in a retrospective cohort of 41 patients with brain metastases treated with 18 Gy single‑fraction SRS. We developed SAGE (Secure Agent for Generative Dose Expertise), an LLM‑based planning agent for automated SRS treatment planning. Two variants generated plans for each case: one using a non‑reasoning model, one using a reasoning model. The reasoning variant showed comparable plan dosimetry relative to human planners on primary endpoints (PTV coverage, maximum dose, conformity index, gradient index; all p > 0.21) while reducing cochlear dose below human baselines (p = 0.022). When prompted to improve conformity, the reasoning model demonstrated systematic planning behaviors including prospective constraint verification (457 instances) and trade‑off deliberation (609 instances), while the standard model exhibited none of these deliberative processes (0 and 7 instances, respectively). Content analysis revealed that constraint verification and causal explanation concentrated in the reasoning agent. The optimization traces serve as auditable logs, offering a path toward transparent automated planning.

PaperID: 6693, https://arxiv.org/pdf/2512.20458.pdf

Authors: Shuting Wang, Qiaolin Xia, Vich Wang, Herberttli, Bobsimons, Zhicheng Dou

Title: Laser: Governing Long-Horizon Agentic Search via Structured Protocol and Context Register

Abstract:
Recent advances in Large Language Models (LLMs) and Large Reasoning Models (LRMs) have enabled agentic search systems that interleave multi‑step reasoning with external tool use. However, existing frameworks largely rely on unstructured natural‑language reasoning and accumulate raw intermediate traces in the context, which often leads to unstable reasoning trajectories, context overflow, and degraded performance on complex multi‑hop queries. In this study, we introduce Laser, a general framework for stabilizing and scaling agentic search. Laser defines a symbolic action protocol that organizes agent behaviors into three spaces: planning, task‑solving, and retrospection. Each action is specified with explicit semantics and a deterministic execution format, enabling structured and logical reasoning processes and reliable action parsing. This design makes intermediate decisions interpretable and traceable, enhancing explicit retrospection and fine‑grained control over reasoning trajectories. In coordination with parsable actions, Laser further maintains a compact context register that stores only essential states of the reasoning process, allowing the agent to reason over long horizons without uncontrolled context expansion. Experiments on Qwen2.5/3‑series models across challenging multi‑hop QA datasets show that Laser consistently outperforms existing agentic search baselines under both prompting‑only and fine‑tuning settings, demonstrating that Laser provides a principled and effective foundation for robust, scalable agentic search.

PaperID: 6694, https://arxiv.org/pdf/2512.20457.pdf

Authors: Marco Aruta, Francesco Improta, Vadim Malvone, Aniello Murano

Title: When Natural Strategies Meet Fuzziness and Resource-Bounded Actions (Extended Version)

Abstract:
In formal strategic reasoning for Multi‑Agent Systems (MAS), agents are typically assumed to (i) employ arbitrarily complex strategies, (ii) execute each move at zero cost, and (iii) operate over fully crisp game structures. These idealized assumptions stand in stark contrast with human decision making in real world environments. The natural strategies framework along with some of its recent variants, partially addresses this gap by restricting strategies to concise rules guarded by regular expressions. Yet, it still overlook both the cost of each action and the uncertainty that often characterizes human perception of facts over the time. In this work, we introduce HumanATLF, a logic that builds upon natural strategies employing both fuzzy semantics and resource bound actions: each action carries a real valued cost drawn from a non refillable budget, and atomic conditions and goals have degrees in [0,1]. We give a formal syntax and semantics, and prove that model checking is in P when both the strategy complexity k and resource budget b are fixed, NP complete if just one strategic operator over Boolean objectives is allowed, and Delta^P_2 complete when k and b vary. Moreover, we show that recall based strategies can be decided in PSPACE. We implement our algorithms in VITAMIN, an open source model checking tool for MAS and validate them on an adversarial resource aware drone rescue scenario.

PaperID: 6695, https://arxiv.org/pdf/2512.20417.pdf

Authors: Pedro Domingos, João Pereira, Vasco Lopes, João Neves, David Semedo

Title: Chain-of-Anomaly Thoughts with Large Vision-Language Models

Abstract:
Automated video surveillance with Large Vision‑Language Models is limited by their inherent bias towards normality, often failing to detect crimes. While Chain‑of‑Thought reasoning strategies show significant potential for improving performance in language tasks, the lack of inductive anomaly biases in their reasoning further steers the models towards normal interpretations. To address this, we propose Chain‑of‑Anomaly‑Thoughts (CoAT), a multi‑agent reasoning framework that introduces inductive criminal bias in the reasoning process through a final, anomaly‑focused classification layer. Our method significantly improves Anomaly Detection, boosting F1‑score by 11.8 p.p. on challenging low‑resolution footage and Anomaly Classification by 3.78 p.p. in high‑resolution videos.

PaperID: 6696, https://arxiv.org/pdf/2512.20312.pdf

Authors: Saisai Yang, Qingyi Huang, Jing Yuan, Liangyu Zha, Kai Tang, Yuhang Yang, Ning Wang, Yucheng Wei, Liyao Li, Wentao Ye, Hao Chen, Tao Zhang, Junlin Zhou, Haobo Wang, Gang Chen, Junbo Zhao

Title: TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning

Abstract:
Tabular data serves as the backbone of modern data analysis and scientific research. While Large Language Models (LLMs) fine‑tuned via Supervised Fine‑Tuning (SFT) have significantly improved natural language interaction with such structured data, they often fall short in handling the complex, multi‑step reasoning and robust code execution required for real‑world table tasks. Reinforcement Learning (RL) offers a promising avenue to enhance these capabilities, yet its application in the tabular domain faces three critical hurdles: the scarcity of high‑quality agentic trajectories with closed‑loop code execution and environment feedback on diverse table structures, the extreme heterogeneity of feedback signals ranging from rigid SQL execution to open‑ended data interpretation, and the risk of catastrophic forgetting of general knowledge during vertical specialization. To overcome these challenges and unlock advanced reasoning on complex tables, we introduce TableGPT‑R1, a specialized tabular model built on a systematic RL framework. Our approach integrates a comprehensive data engineering pipeline that synthesizes difficulty‑stratified agentic trajectories for both supervised alignment and RL rollouts, a task‑adaptive reward system that combines rule‑based verification with a criteria‑injected reward model and incorporates process‑level step reward shaping with behavioral regularization, and a multi‑stage training framework that progressively stabilizes reasoning before specializing in table‑specific tasks. Extensive evaluations demonstrate that TableGPT‑R1 achieves state‑of‑the‑art performance on authoritative benchmarks, significantly outperforming baseline models while retaining robust general capabilities. Our model is available at https://huggingface.co/tablegpt/TableGPT‑R1.

PaperID: 6697, https://arxiv.org/pdf/2512.20276.pdf

Authors: Yuntao Dai, Hang Gu, Teng Wang, Qianyu Cheng, Yifei Zheng, Zhiyong Qiu, Lei Gong, Wenqi Lou, Xuehai Zhou

Title: ActionFlow: A Pipelined Action Acceleration for Vision Language Models on Edge

Abstract:
Vision‑Language‑Action (VLA) models have emerged as a unified paradigm for robotic perception and control, enabling emergent generalization and long‑horizon task execution. However, their deployment in dynamic, real‑world environments is severely hin dered by high inference latency. While smooth robotic interaction requires control frequencies of 20 to 30 Hz, current VLA models typi cally operate at only 3‑5 Hz on edge devices due to the memory bound nature of autoregressive decoding. Existing optimizations often require extensive retraining or compromise model accuracy. To bridge this gap, we introduce ActionFlow, a system‑level inference framework tailored for resource‑constrained edge plat forms. At the core of ActionFlow is a Cross‑Request Pipelin ing strategy, a novel scheduler that redefines VLA inference as a macro‑pipeline of micro‑requests. The strategy intelligently batches memory‑bound Decode phases with compute‑bound Prefill phases across continuous time steps to maximize hardware utilization. Furthermore, to support this scheduling, we propose a Cross Request State Packed Forward operator and a Unified KV Ring Buffer, which fuse fragmented memory operations into efficient dense computations. Experimental results demonstrate that ActionFlow achieves a 2.55x improvement in FPS on the OpenVLA‑7B model without retraining, enabling real‑time dy namic manipulation on edge hardware. Our work is available at https://anonymous.4open.science/r/ActionFlow‑1D47.

PaperID: 6698, https://arxiv.org/pdf/2512.20275.pdf

Authors: Divya Vijay, Vignesh Ethiraj

Title: Graph-Symbolic Policy Enforcement and Control (G-SPEC): A Neuro-Symbolic Framework for Safe Agentic AI in 5G Autonomous Networks

Abstract:
As networks evolve toward 5G Standalone and 6G, operators face orchestration challenges that exceed the limits of static automation and Deep Reinforcement Learning. Although Large Language Model (LLM) agents offer a path toward intent‑based networking, they introduce stochastic risks, including topology hallucinations and policy non‑compliance. To mitigate this, we propose Graph‑Symbolic Policy Enforcement and Control (G‑SPEC), a neuro‑symbolic framework that constrains probabilistic planning with deterministic verification. The architecture relies on a Governance Triad ‑ a telecom‑adapted agent (TSLAM‑4B), a Network Knowledge Graph (NKG), and SHACL constraints. We evaluated G‑SPEC on a simulated 450‑node 5G Core, achieving zero safety violations and a 94.1% remediation success rate, significantly outperforming the 82.4% baseline. Ablation analysis indicates that NKG validation drives the majority of safety gains (68%), followed by SHACL policies (24%). Scalability tests on topologies ranging from 10K to 100K nodes demonstrate that validation latency scales as O(k^1.2) where k is subgraph size. With a processing overhead of 142ms, G‑SPEC is viable for SMO‑layer operations.

PaperID: 6699, https://arxiv.org/pdf/2512.20260.pdf

Authors: Jiawei Ge, Jiuxin Cao, Xinyi Li, Xuelin Zhu, Chang Liu, Bo Liu, Chen Feng, Ioannis Patras

Title: Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing for Weakly-Supervised Camouflaged Object Detection with Scribble Annotations

Abstract:
Weakly‑Supervised Camouflaged Object Detection (WSCOD) aims to locate and segment objects that are visually concealed within their surrounding scenes, relying solely on sparse supervision such as scribble annotations. Despite recent progress, existing WSCOD methods still lag far behind fully supervised ones due to two major limitations: (1) the pseudo masks generated by general‑purpose segmentation models (e.g., SAM) and filtered via rules are often unreliable, as these models lack the task‑specific semantic understanding required for effective pseudo labeling in COD; and (2) the neglect of inherent annotation bias in scribbles, which hinders the model from capturing the global structure of camouflaged objects. To overcome these challenges, we propose D^3ETOR, a two‑stage WSCOD framework consisting of Debate‑Enhanced Pseudo Labeling and Frequency‑Aware Progressive Debiasing. In the first stage, we introduce an adaptive entropy‑driven point sampling method and a multi‑agent debate mechanism to enhance the capability of SAM for COD, improving the interpretability and precision of pseudo masks. In the second stage, we design FADeNet, which progressively fuses multi‑level frequency‑aware features to balance global semantic understanding with local detail modeling, while dynamically reweighting supervision strength across regions to alleviate scribble bias. By jointly exploiting the supervision signals from both the pseudo masks and scribble semantics, D^3ETOR significantly narrows the gap between weakly and fully supervised COD, achieving state‑of‑the‑art performance on multiple benchmarks.

PaperID: 6700, https://arxiv.org/pdf/2512.20237.pdf

Authors: Xingbo Du, Loka Li, Duzhen Zhang, Le Song

Title: MemR$^3$: Memory Retrieval via Reflective Reasoning for LLM Agents

Abstract:
Memory systems have been designed to leverage past experiences in Large Language Model (LLM) agents. However, many deployed memory systems primarily optimize compression and storage, with comparatively less emphasis on explicit, closed‑loop control of memory retrieval. From this observation, we build memory retrieval as an autonomous, accurate, and compatible agent system, named MemR^3, which has two core mechanisms: 1) a router that selects among retrieve, reflect, and answer actions to optimize answer quality; 2) a global evidence‑gap tracker that explicitly renders the answering process transparent and tracks the evidence collection process. This design departs from the standard retrieve‑then‑answer pipeline by introducing a closed‑loop control mechanism that enables autonomous decision‑making. Empirical results on the LoCoMo benchmark demonstrate that MemR^3 surpasses strong baselines on LLM‑as‑a‑Judge score, and particularly, it improves existing retrievers across four categories with an overall improvement on RAG (+7.29%) and Zep (+1.94%) using GPT‑4.1‑mini backend, offering a plug‑and‑play controller for existing memory stores.

PaperID: 6701, https://arxiv.org/pdf/2512.20221.pdf

Authors: Sung Park, Daeho Yoon, Jungmin Lee

Title: The Effect of Empathic Expression Levels in Virtual Human Interaction: A Controlled Experiment

Abstract:
As artificial intelligence (AI) systems become increasingly embedded in everyday life, the ability of interactive agents to express empathy has become critical for effective human‑AI interaction, particularly in emotionally sensitive contexts. Rather than treating empathy as a binary capability, this study examines how different levels of empathic expression in virtual human interaction influence user experience. We conducted a between‑subject experiment (n = 70) in a counseling‑style interaction context, comparing three virtual human conditions: a neutral dialogue‑based agent, a dialogue‑based empathic agent, and a video‑based empathic agent that incorporates users' facial cues. Participants engaged in a 15‑minute interaction and subsequently evaluated their experience using subjective measures of empathy and interaction quality. Results from analysis of variance (ANOVA) revealed significant differences across conditions in affective empathy, perceived naturalness of facial movement, and appropriateness of facial expression. The video‑based empathic expression condition elicited significantly higher affective empathy than the neutral baseline (p < .001) and marginally higher levels than the dialogue‑based condition (p < .10). In contrast, cognitive empathy did not differ significantly across conditions. These findings indicate that empathic expression in virtual humans should be conceptualized as a graded design variable, rather than a binary capability, with visually grounded cues playing a decisive role in shaping affective user experience.

PaperID: 6702, https://arxiv.org/pdf/2512.20206.pdf

Authors: Zhe Sun, Kunlun Wu, Chuanjian Fu, Zeming Song, Langyong Shi, Zihe Xue, Bohan Jing, Ying Yang, Xiaomeng Gao, Aijia Li, Tianyu Guo, Huiying Li, Xueyuan Yang, Rongkai Liu, Xinyi He, Yuxi Wang, Yue Li, Mingyuan Liu, Yujie Lu, Hongzhao Xie, Shiyun Zhao, Bo Dai, Wei Wang, Tao Yuan, Song-Chun Zhu, Yujia Peng, Zhenliang Zhang

Title: TongSIM: A General Platform for Simulating Intelligent Machines

Abstract:
As artificial intelligence (AI) rapidly advances, especially in multimodal large language models (MLLMs), research focus is shifting from single‑modality text processing to the more complex domains of multimodal and embodied AI. Embodied intelligence focuses on training agents within realistic simulated environments, leveraging physical interaction and action feedback rather than conventionally labeled datasets. Yet, most existing simulation platforms remain narrowly designed, each tailored to specific tasks. A versatile, general‑purpose training environment that can support everything from low‑level embodied navigation to high‑level composite activities, such as multi‑agent social simulation and human‑AI collaboration, remains largely unavailable. To bridge this gap, we introduce TongSIM, a high‑fidelity, general‑purpose platform for training and evaluating embodied agents. TongSIM offers practical advantages by providing over 100 diverse, multi‑room indoor scenarios as well as an open‑ended, interaction‑rich outdoor town simulation, ensuring broad applicability across research needs. Its comprehensive evaluation framework and benchmarks enable precise assessment of agent capabilities, such as perception, cognition, decision‑making, human‑robot cooperation, and spatial and social reasoning. With features like customized scenes, task‑adaptive fidelity, diverse agent types, and dynamic environmental simulation, TongSIM delivers flexibility and scalability for researchers, serving as a unified platform that accelerates training, evaluation, and advancement toward general embodied intelligence.

PaperID: 6703, https://arxiv.org/pdf/2512.20201.pdf

Authors: Heekang Song, Wan Choi

Title: Joint Design of Embedded Index Coding and Beamforming for MIMO-based Distributed Computing via Multi-Agent Reinforcement Learning

Abstract:
In distributed computing systems, reducing the communication load during the data shuffling phase is a critical challenge, as excessive inter‑node transmissions are a major performance bottleneck. One promising approach to alleviate this burden is Embedded Index Coding (EIC), which exploits cached data at user nodes to encode transmissions more efficiently. However, most prior work on EIC has focused on minimizing code length in wired, error‑free environments‑an objective often suboptimal for wireless multiple‑input multiple‑output (MIMO) systems, where channel conditions and spatial multiplexing gains must be considered. This paper investigates the joint design of EIC and transmit beamforming in MIMO systems to minimize total transmission time, an NP‑hard problem. We first present a conventional optimization method that determines the optimal EIC via exhaustive search. To address its prohibitive complexity and adapt to dynamic wireless environments, we propose a novel, low‑complexity multi‑agent reinforcement learning (MARL) framework. The proposed framework enables decentralized agents to act on local observations while effectively managing the hybrid action space of discrete EIC selection and continuous beamforming design. Simulation results demonstrate that the proposed MARL‑based approach achieves near‑optimal performance with significantly reduced complexity, underscoring its effectiveness and practicality for real‑world wireless systems.

PaperID: 6704, https://arxiv.org/pdf/2512.20184.pdf

Authors: Chaoyi Ruan, Yiliang Wang, Ziji Shi, Jialin Li

Title: Reaching Agreement Among Reasoning LLM Agents

Abstract:
Multi‑agent systems have extended the capability of agentic AI. Instead of single inference passes, multiple agents perform collective reasoning to derive high quality answers. However, existing multi‑agent orchestration relies on static heuristic workflows such as fixed loop limits and barrier synchronization. These ad‑hoc approaches waste computational resources, incur high latency due to stragglers, and risk finalizing transient agreements. We argue that reliable multi‑agent reasoning requires a formal foundation analogous to classical distributed consensus problem. To that end, we propose a formal model of the multi‑agent refinement problem. The model includes definitions of the correctness guarantees and formal semantics of agent reasoning. We then introduce Aegean, a consensus protocol designed for stochastic reasoning agents that solves multi‑agent refinement. We implement the protocol in Aegean‑Serve, a consensus‑aware serving engine that performs incremental quorum detection across concurrent agent executions, enabling early termination when sufficient agents converge. Evaluation using four mathematical reasoning benchmarks shows that Aegean provides provable safety and liveness guarantees while reducing latency by 1.2‑‑20× compared to state‑of‑the‑art baselines, maintaining answer quality within 2.5%. Consistent gains across both local GPU deployments and commercial API providers validate that consensus‑based orchestration eliminates straggler delays without sacrificing correctness.

PaperID: 6705, https://arxiv.org/pdf/2512.20166.pdf

Authors: Xiaofan Wang, Xingyu Gao, Jianlong Fu, Zuolei Li, Dean Fortier, Galen Mullins, Andrey Kolobov, Baining Guo

Title: LoLA: Long Horizon Latent Action Learning for General Robot Manipulation

Abstract:
The capability of performing long‑horizon, language‑guided robotic manipulation tasks critically relies on leveraging historical information and generating coherent action sequences. However, such capabilities are often overlooked by existing Vision‑Language‑Action (VLA) models. To solve this challenge, we propose LoLA (Long Horizon Latent Action Learning), a framework designed for robot manipulation that integrates long‑term multi‑view observations and robot proprioception to enable multi‑step reasoning and action generation. We first employ Vision‑Language Models to encode rich contextual features from historical sequences and multi‑view observations. We further introduces a key module, State‑Aware Latent Re‑representation, which transforms visual inputs and language commands into actionable robot motion space. Unlike existing VLA approaches that merely concatenate robot proprioception (e.g., joint angles) with VL embeddings, this module leverages such robot states to explicitly ground VL representations in physical scale through a learnable "embodiment‑anchored" latent space. We trained LoLA on diverse robotic pre‑training datasets and conducted extensive evaluations on simulation benchmarks (SIMPLER and LIBERO), as well as two real‑world tasks on Franka and Bi‑Manual Aloha robots. Results show that LoLA significantly outperforms prior state‑of‑the‑art methods (e.g., pi0), particularly in long‑horizon manipulation tasks.

PaperID: 6706, https://arxiv.org/pdf/2512.20135.pdf

Authors: Zhuo Yang, Yeyun Chen, Jiaqing Xie, Ben Gao, Shuaike Shen, Wanhao Liu, Liujia Yang, Beilun Wang, Tianfan Fu, Yuqiang Li

Title: MolAct: An Agentic RL Framework for Molecular Editing and Property Optimization

Abstract:
Molecular editing and optimization are multi‑step problems that require iteratively improving properties while keeping molecules chemically valid and structurally similar. We frame both tasks as sequential, tool‑guided decisions and introduce MolAct, an agentic reinforcement learning framework that employs a two‑stage training paradigm: first building editing capability, then optimizing properties while reusing the learned editing behaviors. To the best of our knowledge, this is the first work to formalize molecular design as an Agentic Reinforcement Learning problem, where an LLM agent learns to interleave reasoning, tool‑use, and molecular optimization. The framework enables agents to interact in multiple turns, invoking chemical tools for validity checking, property assessment, and similarity control, and leverages their feedback to refine subsequent edits. We instantiate the MolAct framework to train two model families: MolEditAgent for molecular editing tasks and MolOptAgent for molecular optimization tasks. In molecular editing, MolEditAgent‑7B delivers 100, 95, and 98 valid add, delete, and substitute edits, outperforming strong closed "thinking" baselines such as DeepSeek‑R1; MolEditAgent‑3B approaches the performance of much larger open "thinking" models like Qwen3‑32B‑think. In molecular optimization, MolOptAgent‑7B (trained on MolEditAgent‑7B) surpasses the best closed "thinking" baseline (e.g., Claude 3.7) on LogP and remains competitive on solubility, while maintaining balanced performance across other objectives. These results highlight that treating molecular design as a multi‑step, tool‑augmented process is key to reliable and interpretable improvements.

PaperID: 6707, https://arxiv.org/pdf/2512.20111.pdf

Authors: Aly Lidayan, Jakob Bjorner, Satvik Golechha, Kartik Goyal, Alane Suhr

Title: ABBEL: LLM Agents Acting through Belief Bottlenecks Expressed in Language

Abstract:
As the length of sequential decision‑making tasks increases, it becomes computationally impractical to keep full interaction histories in context. We introduce a general framework for LLM agents to maintain concise contexts through multi‑step interaction: Acting through Belief Bottlenecks Expressed in Language (ABBEL), and methods to further improve ABBEL agents with RL post‑training. ABBEL replaces long multi‑step interaction history by a belief state, i.e., a natural language summary of what has been discovered about task‑relevant unknowns. Under ABBEL, at each step the agent first updates a prior belief with the most recent observation from the environment to form a posterior belief, then uses only the posterior to select an action. We systematically evaluate frontier models under ABBEL across six diverse multi‑step environments, finding that ABBEL supports generating interpretable beliefs while maintaining near‑constant memory use over interaction steps. However, bottleneck approaches are generally prone to error propagation, which we observe causing inferior performance when compared to the full context setting due to errors in belief updating. Therefore, we train LLMs to generate and act on beliefs within the ABBEL framework via reinforcement learning (RL). We experiment with belief grading, to reward higher quality beliefs, as well as belief length penalties to reward more compressed beliefs. Our experiments demonstrate the ability of RL to improve ABBEL's performance beyond the full context setting, while using less memory than contemporaneous approaches.

PaperID: 6708, https://arxiv.org/pdf/2512.20083.pdf

Authors: Wenzhao Wu, Yahui Tang, Mingfei Cheng, Wenbing Tang, Yuan Zhou, Yang Liu

Title: Detecting Non-Optimal Decisions of Embodied Agents via Diversity-Guided Metamorphic Testing

Abstract:
As embodied agents advance toward real‑world deployment, ensuring optimal decisions becomes critical for resource‑constrained applications. Current evaluation methods focus primarily on functional correctness, overlooking the non‑functional optimality of generated plans. This gap can lead to significant performance degradation and resource waste. We identify and formalize the problem of Non‑optimal Decisions (NoDs), where agents complete tasks successfully but inefficiently. We present NoD‑DGMT, a systematic framework for detecting NoDs in embodied agent task planning via diversity‑guided metamorphic testing. Our key insight is that optimal planners should exhibit invariant behavioral properties under specific transformations. We design four novel metamorphic relations capturing fundamental optimality properties: position detour suboptimality, action optimality completeness, condition refinement monotonicity, and scene perturbation invariance. To maximize detection efficiency, we introduce a diversity‑guided selection strategy that actively selects test cases exploring different violation categories, avoiding redundant evaluations while ensuring comprehensive diversity coverage. Extensive experiments on the AI2‑THOR simulator with four state‑of‑the‑art planning models demonstrate that NoD‑DGMT achieves violation detection rates of 31.9% on average, with our diversity‑guided filter improving rates by 4.3% and diversity scores by 3.3 on average. NoD‑DGMT significantly outperforms six baseline methods, with 16.8% relative improvement over the best baseline, and demonstrates consistent superiority across different model architectures and task complexities.

PaperID: 6709, https://arxiv.org/pdf/2512.19972.pdf

Authors: Pengchao Han, Xi Huang, Yi Fang, Guojun Han

Title: Rethinking Knowledge Distillation in Collaborative Machine Learning: Memory, Knowledge, and Their Interactions

Abstract:
Collaborative learning has emerged as a key paradigm in large‑scale intelligent systems, enabling distributed agents to cooperatively train their models while addressing their privacy concerns. Central to this paradigm is knowledge distillation (KD), a technique that facilitates efficient knowledge transfer among agents. However, the underlying mechanisms by which KD leverages memory and knowledge across agents remain underexplored. This paper aims to bridge this gap by offering a comprehensive review of KD in collaborative learning, with a focus on the roles of memory and knowledge. We define and categorize memory and knowledge within the KD process and explore their interrelationships, providing a clear understanding of how knowledge is extracted, stored, and shared in collaborative settings. We examine various collaborative learning patterns, including distributed, hierarchical, and decentralized structures, and provide insights into how memory and knowledge dynamics shape the effectiveness of KD in collaborative learning. Particularly, we emphasize task heterogeneity in distributed learning pattern covering federated learning (FL), multi‑agent domain adaptation (MADA), federated multi‑modal learning (FML), federated continual learning (FCL), federated multi‑task learning (FMTL), and federated graph knowledge embedding (FKGE). Additionally, we highlight model heterogeneity, data heterogeneity, resource heterogeneity, and privacy concerns of these tasks. Our analysis categorizes existing work based on how they handle memory and knowledge. Finally, we discuss existing challenges and propose future directions for advancing KD techniques in the context of collaborative learning.

PaperID: 6710, https://arxiv.org/pdf/2512.19933.pdf

Authors: Zhixiang Lu, Xueyuan Deng, Yiran Liu, Yulong Li, Qiang Yan, Imran Razzak, Jionglong Su

Title: PRISM: A Personality-Driven Multi-Agent Framework for Social Media Simulation

Abstract:
Traditional agent‑based models (ABMs) of opinion dynamics often fail to capture the psychological heterogeneity driving online polarization due to simplistic homogeneity assumptions. This limitation obscures the critical interplay between individual cognitive biases and information propagation, thereby hindering a mechanistic understanding of how ideological divides are amplified. To address this challenge, we introduce the Personality‑Refracted Intelligent Simulation Model (PRISM), a hybrid framework coupling stochastic differential equations (SDE) for continuous emotional evolution with a personality‑conditional partially observable Markov decision process (PC‑POMDP) for discrete decision‑making. In contrast to continuous trait approaches, PRISM assigns distinct Myers‑Briggs Type Indicator (MBTI) based cognitive policies to multimodal large language model (MLLM) agents, initialized via data‑driven priors from large‑scale social media datasets. PRISM achieves superior personality consistency aligned with human ground truth, significantly outperforming standard homogeneous and Big Five benchmarks. This framework effectively replicates emergent phenomena such as rational suppression and affective resonance, offering a robust tool for analyzing complex social media ecosystems.

PaperID: 6711, https://arxiv.org/pdf/2512.19841.pdf

Authors: Yousef Mehrdad Bibalan, Behrouz Far, Mohammad Moshirpour, Bahareh Ghiyasian

Title: A Multi-Agent Retrieval-Augmented Framework for Work-in-Progress Predictio

Abstract:
Work‑in‑Progress (WiP) prediction is critical for predictive process monitoring, enabling accurate anticipation of workload fluctuations and optimized operational planning. This paper proposes a retrieval‑augmented, multi‑agent framework that combines retrieval‑augmented generation (RAG) and collaborative multi‑agent reasoning for WiP prediction. The narrative generation component transforms structured event logs into semantically rich natural language stories, which are embedded into a semantic vector‑based process memory to facilitate dynamic retrieval of historical context during inference. The framework includes predictor agents that independently leverage retrieved historical contexts and a decision‑making assistant agent that extracts high‑level descriptive signals from recent events. A fusion agent then synthesizes predictions using ReAct‑style reasoning over agent outputs and retrieved narratives. We evaluate our framework on two real‑world benchmark datasets. Results show that the proposed retrieval‑augmented multi‑agent approach achieves competitive prediction accuracy, obtaining a Mean Absolute Percentage Error (MAPE) of 1.50% on one dataset, and surpassing Temporal Convolutional Networks (TCN), Long Short‑Term Memory (LSTM), and persistence baselines. The results highlight improved robustness, demonstrating the effectiveness of integrating retrieval mechanisms and multi‑agent reasoning in WiP prediction.

PaperID: 6712, https://arxiv.org/pdf/2512.19769.pdf

Authors: Ivan Daunis

Title: A Declarative Language for Building And Orchestrating LLM-Powered Agent Workflows

Abstract:
Building deployment‑ready LLM agents requires complex orchestration of tools, data sources, and control flow logic, yet existing systems tightly couple agent logic to specific programming languages and deployment models. We present a declarative system that separates agent workflow specification from implementation, enabling the same pipeline definition to execute across multiple backend languages (Java, Python, Go) and deployment environments (cloud‑native, on‑premises). Our key insight is that most agent workflows consist of common patterns ‑‑ data serialization, filtering, RAG retrieval, API orchestration ‑‑ that can be expressed through a unified DSL rather than imperative code. This approach transforms agent development from application programming to configuration, where adding new tools or fine‑tuning agent behaviors requires only pipeline specification changes, not code deployment. Our system natively supports A/B testing of agent strategies, allowing multiple pipeline variants to run on the same backend infrastructure with automatic metric collection and comparison. We evaluate our approach on real‑world e‑commerce workflows at PayPal, processing millions of daily interactions. Our results demonstrate 60% reduction in development time, and 3x improvement in deployment velocity compared to imperative implementations. The language's declarative approach enables non‑engineers to modify agent behaviors safely, while maintaining sub‑100ms orchestration overhead. We show that complex workflows involving product search, personalization, and cart management can be expressed in under 50 lines of DSL compared to 500+ lines of imperative code.

PaperID: 6713, https://arxiv.org/pdf/2512.19707.pdf

Authors: James K Ruffle, Samia Mohinta, Guilherme Pombo, Asthik Biswas, Alan Campbell, Indran Davagnanam, David Doig, Ahmed Hammam, Harpreet Hyare, Farrah Jabeen, Emma Lim, Dermot Mallon, Stephanie Owen, Sophie Wilkinson, Sebastian Brandner, Parashkev Nachev

Title: Bidirectional human-AI collaboration in brain tumour assessments improves both expert human and AI agent performance

Abstract:
The benefits of artificial intelligence (AI) human partnerships‑evaluating how AI agents enhance expert human performance‑are increasingly studied. Though rarely evaluated in healthcare, an inverse approach is possible: AI benefiting from the support of an expert human agent. Here, we investigate both human‑AI clinical partnership paradigms in the magnetic resonance imaging‑guided characterisation of patients with brain tumours. We reveal that human‑AI partnerships improve accuracy and metacognitive ability not only for radiologists supported by AI, but also for AI agents supported by radiologists. Moreover, the greatest patient benefit was evident with an AI agent supported by a human one. Synergistic improvements in agent accuracy, metacognitive performance, and inter‑rater agreement suggest that AI can create more capable, confident, and consistent clinical agents, whether human or model‑based. Our work suggests that the maximal value of AI in healthcare could emerge not from replacing human intelligence, but from AI agents that routinely leverage and amplify it.

PaperID: 6714, https://arxiv.org/pdf/2512.19551.pdf

Authors: Jiawen Wang, Jingjing Wang Tianyang Chen, Min Zhang, Guodong Zhou

Title: Towards Closed-Loop Embodied Empathy Evolution: Probing LLM-Centric Lifelong Empathic Motion Generation in Unseen Scenarios

Abstract:
In the literature, existing human‑centric emotional motion generation methods primarily focus on boosting performance within a single scale‑fixed dataset, largely neglecting the flexible and scale‑increasing motion scenarios (e.g., sports, dance), whereas effectively learning these newly emerging scenarios can significantly enhance the model's real‑world generalization ability. Inspired by this, this paper proposes a new LLM‑Centric Lifelong Empathic Motion Generation (L^2‑EMG) task, which aims to equip LLMs with the capability to continually acquire emotional motion generation knowledge across different unseen scenarios, potentially contributing to building a closed‑loop and self‑evolving embodied agent equipped with both empathy and intelligence. Further, this paper poses two key challenges in the L^2‑EMG task, i.e., the emotion decoupling challenge and the scenario adapting challenge. To this end, this paper proposes an Emotion‑Transferable and Scenario‑Adapted Mixture of Experts (ES‑MoE) approach which designs a causal‑guided emotion decoupling block and a scenario‑adapted expert constructing block to address the two challenges, respectively. Especially, this paper constructs multiple L^2‑EMG datasets to validate the effectiveness of the ES‑MoE approach. Extensive evaluations show that ES‑MoE outperforms advanced baselines.

PaperID: 6715, https://arxiv.org/pdf/2512.19479.pdf

Authors: Guoli Jia, Junyao Hu, Xinwei Long, Kai Tian, Kaiyan Zhang, KaiKai Zhao, Ning Ding, Bowen Zhou

Title: Emotion-Director: Bridging Affective Shortcut in Emotion-Oriented Image Generation

Abstract:
Image generation based on diffusion models has demonstrated impressive capability, motivating exploration into diverse and specialized applications. Owing to the importance of emotion in advertising, emotion‑oriented image generation has attracted increasing attention. However, current emotion‑oriented methods suffer from an affective shortcut, where emotions are approximated to semantics. As evidenced by two decades of research, emotion is not equivalent to semantics. To this end, we propose Emotion‑Director, a cross‑modal collaboration framework consisting of two modules. First, we propose a cross‑Modal Collaborative diffusion model, abbreviated as MC‑Diffusion. MC‑Diffusion integrates visual prompts with textual prompts for guidance, enabling the generation of emotion‑oriented images beyond semantics. Further, we improve the DPO optimization by a negative visual prompt, enhancing the model's sensitivity to different emotions under the same semantics. Second, we propose MC‑Agent, a cross‑Modal Collaborative Agent system that rewrites textual prompts to express the intended emotions. To avoid template‑like rewrites, MC‑Agent employs multi‑agents to simulate human subjectivity toward emotions, and adopts a chain‑of‑concept workflow that improves the visual expressiveness of the rewritten prompts. Extensive qualitative and quantitative experiments demonstrate the superiority of Emotion‑Director in emotion‑oriented image generation.

PaperID: 6716, https://arxiv.org/pdf/2512.19321.pdf

Authors: Wei Liu, Tao Zhang, Chenhui Lin, Kaiwen Li, Rui Wang

Title: Learning-Assisted Multi-Operator Variable Neighborhood Search for Urban Cable Routing

Abstract:
Urban underground cable construction is essential for enhancing the reliability of city power grids, yet its high construction costs make planning a worthwhile optimization task. In urban environments, road layouts tightly constrain cable routing. This, on the one hand, renders relation‑only models (i.e., those without explicit routes) used in prior work overly simplistic, and on the other hand, dramatically enlarges the combinatorial search space, thereby imposing much higher demands on algorithm design. In this study, we formulate urban cable routing as a connectivity‑path co‑optimization problem and propose a learning‑assisted multi‑operator variable neighborhood search (L‑MVNS) algorithm. The framework first introduces an auxiliary task to generate high‑quality feasible initial solutions. A hybrid genetic search (HGS) and A serve as the connectivity optimizer and the route‑planning optimizer, respectively. Building on these, a multi‑operator variable neighborhood search (MVNS) iteratively co‑optimizes inter‑substation connectivity and detailed routes via three complementary destruction operators, a modified A repair operator, and an adaptive neighborhood‑sizing mechanism. A multi‑agent deep reinforcement learning module is further embedded to prioritize promising neighborhoods. We also construct a standardized and scalable benchmark suite for evaluation. Across these cases, comprehensive experiments demonstrate effectiveness and stability: relative to representative approaches, MVNS and L‑MVNS reduce total construction cost by approximately 30‑50%, with L‑MVNS delivering additional gains on larger instances and consistently higher stability.

PaperID: 6717, https://arxiv.org/pdf/2512.19299.pdf

Authors: Haoyu Jiang, Fanjie Zeng, Boan Qu, Xiaojie Lin, Wei Zhong

Title: Helios: A Foundational Language Model for Smart Energy Knowledge Reasoning and Application

Abstract:
In the global drive toward carbon neutrality, deeply coordinated smart energy systems underpin industrial transformation. However, the interdisciplinary, fragmented, and fast‑evolving expertise in this domain prevents general‑purpose LLMs, which lack domain knowledge and physical‑constraint awareness, from delivering precise engineering‑aligned inference and generation. To address these challenges, we introduce Helios, a large language model tailored to the smart energy domain, together with a comprehensive suite of resources to advance LLM research in this field. Specifically, we develop Enersys, a multi‑agent collaborative framework for end‑to‑end dataset construction, through which we produce: (1) a smart energy knowledge base, EnerBase, to enrich the model's foundational expertise; (2) an instruction fine‑tuning dataset, EnerInstruct, to strengthen performance on domain‑specific downstream tasks; and (3) an RLHF dataset, EnerReinforce, to align the model with human preferences and industry standards. Leveraging these resources, Helios undergoes large‑scale pretraining, SFT, and RLHF. We also release EnerBench, a benchmark for evaluating LLMs in smart energy scenarios, and demonstrate that our approach significantly enhances domain knowledge mastery, task execution accuracy, and alignment with human preferences.

PaperID: 6718, https://arxiv.org/pdf/2512.19018.pdf

Authors: Muhammad Usman Tariq, Abhinav Jangda, Angelica Moreira, Madan Musuvathi, Tyler Sorensen

Title: PEAK: A Performance Engineering AI-Assistant for GPU Kernels Powered by Natural Language Transformations

Abstract:
Advancements in large language models (LLMs) are showing promising impact in software development and programming assistance. However, these models struggle when operating on low‑level backend code. This challenge is exacerbated in the domain of GPU kernels, where performance‑critical details are coupled to rapidly evolving hardware characteristics and available code examples are sparse. In this work, we introduce PEAK, a Performance Engineering AI‑Assistant for GPU Kernels powered by natural language transformations. PEAK utilizes the key insight that iterative code transformations (optimizations) can straightforwardly be written in natural language, and then carried out by LLMs. Thus, these transformations can be rapidly developed, encoding general portable optimizations, but also easily specialized to specific GPU devices and even kernels. These natural transformations are supported by a modular and extensible infrastructure that additionally performs validation and performance evaluation. We demonstrate the flexibility of PEAK by instantiating it for three backends, CUDA, HIP, and HLSL, and create 16 natural transformations for optimizing matrix multiplication kernels. We show that our resulting implementations are competitive with vendor libraries when available, and for HLSL (without a library) our implementations match the hardware documented FLOPS. PEAK allows the fine‑grained exploration of several research questions around how LLMs behave in this domain, including characterizing transformations and their errors; and how performance evolves along optimization sequences. PEAK provides an interface that can either be utilized by performance engineers to improve productivity, or driven completely autonomously (e.g., by an AI agent), providing a forward‑compatible design that can continue to improve with advances in AI capabilities.

PaperID: 6719, https://arxiv.org/pdf/2512.19016.pdf

Authors: Liming Lu, Xiang Gu, Junyu Huang, Jiawei Du, Xu Zheng, Yunhuai Liu, Yongbin Zhou, Shuchao Pang

Title: DREAM: Dynamic Red-teaming across Environments for AI Models

Abstract:
Large Language Models (LLMs) are increasingly used in agentic systems, where their interactions with diverse tools and environments create complex, multi‑stage safety challenges. However, existing benchmarks mostly rely on static, single‑turn assessments that miss vulnerabilities from adaptive, long‑chain attacks. To fill this gap, we introduce DREAM, a framework for systematic evaluation of LLM agents against dynamic, multi‑stage attacks. At its core, DREAM uses a Cross‑Environment Adversarial Knowledge Graph (CE‑AKG) to maintain stateful, cross‑domain understanding of vulnerabilities. This graph guides a Contextualized Guided Policy Search (C‑GPS) algorithm that dynamically constructs attack chains from a knowledge base of 1,986 atomic actions across 349 distinct digital environments. Our evaluation of 12 leading LLM agents reveals a critical vulnerability: these attack chains succeed in over 70% of cases for most models, showing the power of stateful, cross‑environment exploits. Through analysis of these failures, we identify two key weaknesses in current agents: contextual fragility, where safety behaviors fail to transfer across environments, and an inability to track long‑term malicious intent. Our findings also show that traditional safety measures, such as initial defense prompts, are largely ineffective against attacks that build context over multiple interactions. To advance agent safety research, we release DREAM as a tool for evaluating vulnerabilities and developing more robust defenses.

PaperID: 6720, https://arxiv.org/pdf/2512.19001.pdf

Authors: Lingjie Zhao, Xue Yu, Yongzhi Qi, Hao Hu, Jianshen Zhang, Yingzheng Ma, Shuyu Han, Wei Qi, Zuo-Jun Max Shen

Title: ORPR: An OR-Guided Pretrain-then-Reinforce Learning Model for Inventory Management

Abstract:
As the pursuit of synergy between Artificial Intelligence (AI) and Operations Research (OR) gains momentum in handling complex inventory systems, a critical challenge persists: how to effectively reconcile AI's adaptive perception with OR's structural rigor. To bridge this gap, we propose a novel OR‑Guided "Pretrain‑then‑Reinforce" framework. To provide structured guidance, we propose a simulation‑augmented OR model that generates high‑quality reference decisions, implicitly capturing complex business constraints and managerial preferences. Leveraging these OR‑derived decisions as foundational training labels, we design a domain‑informed deep learning foundation model to establish foundational decision‑making capabilities, followed by a reinforcement learning (RL) fine‑tuning stage. Uniquely, we position RL as a deep alignment mechanism that enables the AI agent to internalize the optimality principles of OR, while simultaneously leveraging exploration for general policy refinement and allowing expert guidance for scenario‑specific adaptation (e.g., promotional events). Validated through extensive numerical experiments and a field deployment at JD.com augmented by a Difference‑in‑Differences (DiD) analysis, our model significantly outperforms incumbent industrial practices, delivering real‑world gains of a 5.27‑day reduction in turnover and a 2.29% increase in in‑stock rates, alongside a 29.95% decrease in holding costs. Contrary to the prevailing trend of brute‑force model scaling, our study demonstrates that a lightweight, domain‑informed model can deliver state‑of‑the‑art performance and robust transferability when guided by structured OR logic. This approach offers a scalable and cost‑effective paradigm for intelligent supply chain management, highlighting the value of deeply aligning AI with OR.

PaperID: 6721, https://arxiv.org/pdf/2512.18847.pdf

Authors: Ignacio Gustin, Luis Mantilla Calderón, Juan B. Pérez-Sánchez, Jérôme F. Gonthier, Yuma Nakamura, Karthik Panicker, Manav Ramprasad, Zijian Zhang, Yunheng Zou, Varinia Bernales, Alán Aspuru-Guzik

Title: El Agente Cuantico: Automating quantum simulations

Abstract:
Quantum simulation is central to understanding and designing quantum systems across physics and chemistry. Yet it has barriers to access from both computational complexity and computational perspectives, due to the exponential growth of Hilbert space and the complexity of modern software tools. Here we introduce\cinzel El Agente Cuántico, a multi‑agent AI system that automates quantum‑simulation workflows by translating natural‑language scientific intent into executed and validated computations across heterogeneous quantum‑software frameworks. By reasoning directly over library documentation and APIs, our agentic system dynamically assembles end‑to‑end simulations spanning state preparation, closed‑ and open‑system dynamics, tensor‑network methods, quantum control, quantum error correction, and quantum resource estimation. The developed system unifies traditionally distinct simulation paradigms behind a single natural‑language interface. Beyond reducing technical barriers, this approach opens a path toward scalable, adaptive, and increasingly autonomous quantum simulation, enabling faster exploration of physical models, rapid hypothesis testing, and closer integration between theory, simulation, and emerging quantum hardware.

PaperID: 6722, https://arxiv.org/pdf/2512.18779.pdf

Authors: Thorsten Hellert, Nikolay Agladze, Alex Giovannone, Jan Jug, Frank Mayet, Mark Sherwin, Antonin Sulc, Chris Tennant

Title: From Natural Language to Control Signals: A Conceptual Framework for Semantic Channel Finding in Complex Experimental Infrastructure

Abstract:
Modern experimental platforms such as particle accelerators, fusion devices, telescopes, and industrial process control systems expose tens to hundreds of thousands of control and diagnostic channels accumulated over decades of evolution. Operators and AI systems rely on informal expert knowledge, inconsistent naming conventions, and fragmented documentation to locate signals for monitoring, troubleshooting, and automated control, creating a persistent bottleneck for reliability, scalability, and language‑model‑driven interfaces. We formalize semantic channel finding‑mapping natural‑language intent to concrete control‑system signals‑as a general problem in complex experimental infrastructure, and introduce a four‑paradigm framework to guide architecture selection across facility‑specific data regimes. The paradigms span (i) direct in‑context lookup over curated channel dictionaries, (ii) constrained hierarchical navigation through structured trees, (iii) interactive agent exploration using iterative reasoning and tool‑based database queries, and (iv) ontology‑grounded semantic search that decouples channel meaning from facility‑specific naming conventions. We demonstrate each paradigm through proof‑of‑concept implementations at four operational facilities spanning two orders of magnitude in scale‑from compact free‑electron lasers to large synchrotron light sources‑and diverse control‑system architectures, from clean hierarchies to legacy environments. These implementations achieve 90‑97% accuracy on expert‑curated operational queries.

PaperID: 6723, https://arxiv.org/pdf/2512.18733.pdf

Authors: Junjun Pan, Yixin Liu, Rui Miao, Kaize Ding, Yu Zheng, Quoc Viet Hung Nguyen, Alan Wee-Chung Liew, Shirui Pan

Title: Explainable and Fine-Grained Safeguarding of LLM Multi-Agent Systems via Bi-Level Graph Anomaly Detection

Abstract:
Large language model (LLM)‑based multi‑agent systems (MAS) have shown strong capabilities in solving complex tasks. As MAS become increasingly autonomous in various safety‑critical tasks, detecting malicious agents has become a critical security concern. Although existing graph anomaly detection (GAD)‑based defenses can identify anomalous agents, they mainly rely on coarse sentence‑level information and overlook fine‑grained lexical cues, leading to suboptimal performance. Moreover, the lack of interpretability in these methods limits their reliability and real‑world applicability. To address these limitations, we propose XG‑Guard, an explainable and fine‑grained safeguarding framework for detecting malicious agents in MAS. To incorporate both coarse and fine‑grained textual information for anomalous agent identification, we utilize a bi‑level agent encoder to jointly model the sentence‑ and token‑level representations of each agent. A theme‑based anomaly detector further captures the evolving discussion focus in MAS dialogues, while a bi‑level score fusion mechanism quantifies token‑level contributions for explanation. Extensive experiments across diverse MAS topologies and attack scenarios demonstrate robust detection performance and strong interpretability of XG‑Guard.

PaperID: 6724, https://arxiv.org/pdf/2512.18673.pdf

Authors: Ning Lyu, Junjie Jiang, Lu Chang, Chihui Shao, Feng Chen, Chong Zhang

Title: Improving Pattern Recognition of Scheduling Anomalies through Structure-Aware and Semantically-Enhanced Graphs

Abstract:
This paper proposes a structure‑aware driven scheduling graph modeling method to improve the accuracy and representation capability of anomaly identification in scheduling behaviors of complex systems. The method first designs a structure‑guided scheduling graph construction mechanism that integrates task execution stages, resource node states, and scheduling path information to build dynamically evolving scheduling behavior graphs, enhancing the model's ability to capture global scheduling relationships. On this basis, a multi‑scale graph semantic aggregation module is introduced to achieve semantic consistency modeling of scheduling features through local adjacency semantic integration and global topology alignment, thereby strengthening the model's capability to capture abnormal features in complex scenarios such as multi‑task concurrency, resource competition, and stage transitions. Experiments are conducted on a real scheduling dataset with multiple scheduling disturbance paths set to simulate different types of anomalies, including structural shifts, resource changes, and task delays. The proposed model demonstrates significant performance advantages across multiple metrics, showing a sensitive response to structural disturbances and semantic shifts. Further visualization analysis reveals that, under the combined effect of structure guidance and semantic aggregation, the scheduling behavior graph exhibits stronger anomaly separability and pattern representation, validating the effectiveness and adaptability of the method in scheduling anomaly detection tasks.

PaperID: 6725, https://arxiv.org/pdf/2512.18669.pdf

Authors: Jones David, Shreya Ghosh

Title: IntelliCode: A Multi-Agent LLM Tutoring System with Centralized Learner Modeling

Abstract:
LLM‑based tutors are typically single‑turn assistants that lack persistent representations of learner knowledge, making it difficult to provide principled, transparent, and long‑term pedagogical support. We introduce IntelliCode, a multi‑agent LLM tutoring system built around a centralized, versioned learner state that integrates mastery estimates, misconceptions, review schedules, and engagement signals. A StateGraph Orchestrator coordinates six specialized agents: skill assessment, learner profiling, graduated hinting, curriculum selection, spaced repetition, and engagement monitoring, each operating as a pure transformation over the shared state under a single‑writer policy. This architecture enables auditable mastery updates, proficiency‑aware hints, dependency‑aware curriculum adaptation, and safety‑aligned prompting. The demo showcases an end‑to‑end tutoring workflow: a learner attempts a DSA problem, receives a conceptual hint when stuck, submits a corrected solution, and immediately sees mastery updates and a personalized review interval. We report validation results with simulated learners, showing stable state updates, improved task success with graduated hints, and diverse curriculum coverage. IntelliCode demonstrates how persistent learner modeling, orchestrated multi‑agent reasoning, and principled instructional design can be combined to produce transparent and reliable LLM‑driven tutoring.

PaperID: 6726, https://arxiv.org/pdf/2512.18604.pdf

Authors: Wencan Mao, Quanxi Zhou, Tomas Couso Coddou, Manabu Tsukada, Yunling Liu, Yusheng Ji

Title: Trajectory Planning for UAV-Based Smart Farming Using Imitation-Based Triple Deep Q-Learning

Abstract:
Unmanned aerial vehicles (UAVs) have emerged as a promising auxiliary platform for smart agriculture, capable of simultaneously performing weed detection, recognition, and data collection from wireless sensors. However, trajectory planning for UAV‑based smart agriculture is challenging due to the high uncertainty of the environment, partial observations, and limited battery capacity of UAVs. To address these issues, we formulate the trajectory planning problem as a Markov decision process (MDP) and leverage multi‑agent reinforcement learning (MARL) to solve it. Furthermore, we propose a novel imitation‑based triple deep Q‑network (ITDQN) algorithm, which employs an elite imitation mechanism to reduce exploration costs and utilizes a mediator Q‑network over a double deep Q‑network (DDQN) to accelerate and stabilize training and improve performance. Experimental results in both simulated and real‑world environments demonstrate the effectiveness of our solution. Moreover, our proposed ITDQN outperforms DDQN by 4.43% in weed recognition rate and 6.94% in data collection rate.

PaperID: 6727, https://arxiv.org/pdf/2512.18571.pdf

Authors: Weijie Zhou, Xuangtang Xiong, Ye Tian, Lijun Yue, Xinyu Wu, Wei Li, Chaoyang Zhao, Honghui Dong, Ming Tang, Jinqiao Wang, Zhengyou Zhang

Title: ESearch-R1: Learning Cost-Aware MLLM Agents for Interactive Embodied Search via Reinforcement Learning

Abstract:
Multimodal Large Language Models (MLLMs) have empowered embodied agents with remarkable capabilities in planning and reasoning. However, when facing ambiguous natural language instructions (e.g., "fetch the tool" in a cluttered room), current agents often fail to balance the high cost of physical exploration against the cognitive cost of human interaction. They typically treat disambiguation as a passive perception problem, lacking the strategic reasoning to minimize total task execution costs. To bridge this gap, we propose ESearch‑R1, a cost‑aware embodied reasoning framework that unifies interactive dialogue (Ask), episodic memory retrieval (GetMemory), and physical navigation (Navigate) into a single decision process. We introduce HC‑GRPO (Heterogeneous Cost‑Aware Group Relative Policy Optimization). Unlike traditional PPO which relies on a separate value critic, HC‑GRPO optimizes the MLLM by sampling groups of reasoning trajectories and reinforcing those that achieve the optimal trade‑off between information gain and heterogeneous costs (e.g., navigate time, and human attention). Extensive experiments in AI2‑THOR demonstrate that ESearch‑R1 significantly outperforms standard ReAct‑based agents. It improves task success rates while reducing total operational costs by approximately 50%, validating the effectiveness of GRPO in aligning MLLM agents with physical world constraints.

PaperID: 6728, https://arxiv.org/pdf/2512.18561.pdf

Authors: Saad Alqithami

Title: Adaptive Accountability in Networked MAS: Tracing and Mitigating Emergent Norms at Scale

Abstract:
Large‑scale networked multi‑agent systems increasingly underpin critical infrastructure, yet their collective behavior can drift toward undesirable emergent norms such as collusion, resource hoarding, and implicit unfairness. We present the Adaptive Accountability Framework (AAF), an end‑to‑end runtime layer that (i) records cryptographically verifiable interaction provenance, (ii) detects distributional change points in streaming traces, (iii) attributes responsibility via a causal influence graph, and (iv) applies cost‑bounded interventions‑reward shaping and targeted policy patching‑to steer the system back toward compliant behavior. We establish a bounded‑compromise guarantee: if the expected cost of intervention exceeds an adversary's expected payoff, the long‑run fraction of compromised interactions converges to a value strictly below one. We evaluate AAF in a large‑scale factorial simulation suite (87,480 runs across two tasks; up to 100 agents plus a 500‑agent scaling sweep; full and partial observability; Byzantine rates up to 10%; 10 seeds per regime). Across 324 regimes, AAF lowers the executed compromise ratio relative to a Proximal Policy Optimization baseline in 96% of regimes (median relative reduction 11.9%) while preserving social welfare (median change 0.4%). Under adversarial injections, AAF detects norm violations with a median delay of 71 steps (interquartile range 39‑177) and achieves a mean top‑ranked attribution accuracy of 0.97 at 10% Byzantine rate.

PaperID: 6729, https://arxiv.org/pdf/2512.18558.pdf

Authors: Shuwei Pei, Joran Borger, Arda Kosay, Muhammed O. Sayin, Saeed Ahmed

Title: Distributionally Robust Multi-Agent Reinforcement Learning for Intelligent Traffic Control

Abstract:
Learning‑based traffic signal control is typically optimized for average performance under a few nominal demand patterns, which can result in poor behavior under atypical traffic conditions. To address this, we develop a distributionally robust multi‑agent reinforcement learning framework for signal control on a 3x3 urban grid calibrated from a contiguous 3x3 subarea of central Athens covered by the pNEUMA trajectory dataset (Barmpounakis and Geroliminis, 2020). Our approach proceeds in three stages. First, we train a baseline multi‑agent RL controller in which each intersection is governed by a proximal policy optimization agent with discrete signal phases, using a centralized training, decentralized execution paradigm. Second, to capture demand uncertainty, we construct eight heterogeneous origin‑destination‑based traffic scenarios‑one directly derived from pNEUMA and seven synthetically generated‑to span a wide range of spatial and temporal demand patterns. Over this scenario set, we train a contextual‑bandit worst‑case estimator that assigns mixture weights to estimate adversarial demand distributions conditioned on context. Finally, without modifying the controller architecture, we fine‑tune the baseline multi‑agent reinforcement learning agents under these estimated worst‑case mixtures to obtain a distributionally robust multi‑agent reinforcement learning controller. Across all eight scenarios, as well as on an unseen validation network based on the Sioux Falls configuration, the distributionally robust multi‑agent reinforcement learning controller consistently reduces horizon‑averaged queues and increases average speeds relative to the baseline, achieving up to 51% shorter queues and 38% higher speeds on the worst‑performing scenarios.

PaperID: 6730, https://arxiv.org/pdf/2512.18552.pdf

Authors: Yuxiang Wei, Zhiqing Sun, Emily McMilin, Jonas Gehring, David Zhang, Gabriel Synnaeve, Daniel Fried, Lingming Zhang, Sida Wang

Title: Toward Training Superintelligent Software Agents through Self-Play SWE-RL

Abstract:
While current software agents powered by large language models (LLMs) and agentic reinforcement learning (RL) can boost programmer productivity, their training data (e.g., GitHub issues and pull requests) and environments (e.g., pass‑to‑pass and fail‑to‑pass tests) heavily depend on human knowledge or curation, posing a fundamental barrier to superintelligence. In this paper, we present Self‑play SWE‑RL (SSR), a first step toward training paradigms for superintelligent software agents. Our approach takes minimal data assumptions, only requiring access to sandboxed repositories with source code and installed dependencies, with no need for human‑labeled issues or tests. Grounded in these real‑world codebases, a single LLM agent is trained via reinforcement learning in a self‑play setting to iteratively inject and repair software bugs of increasing complexity, with each bug formally specified by a test patch rather than a natural language issue description. On the SWE‑bench Verified and SWE‑Bench Pro benchmarks, SSR achieves notable self‑improvement (+10.4 and +7.8 points, respectively) and consistently outperforms the human‑data baseline over the entire training trajectory, despite being evaluated on natural language issues absent from self‑play. Our results, albeit early, suggest a path where agents autonomously gather extensive learning experiences from real‑world software repositories, ultimately enabling superintelligent systems that exceed human capabilities in understanding how systems are constructed, solving novel challenges, and autonomously creating new software from scratch.

PaperID: 6731, https://arxiv.org/pdf/2512.18436.pdf

Authors: Chenyuan Yang, Natalie Neamtu, Chris Hawblitzel, Jacob R. Lorch, Shan Lu

Title: VeruSAGE: A Study of Agent-Based Verification for Rust Systems

Abstract:
Large language models (LLMs) have shown impressive capability to understand and develop code. However, their capability to rigorously reason about and prove code correctness remains in question. This paper offers a comprehensive study of LLMs' capability to develop correctness proofs for system software written in Rust. We curate a new system‑verification benchmark suite, VeruSAGE‑Bench, which consists of 849 proof tasks extracted from eight open‑source Verus‑verified Rust systems. Furthermore, we design different agent systems to match the strengths and weaknesses of different LLMs (o4‑mini, GPT‑5, Sonnet 4, and Sonnet 4.5). Our study shows that different tools and agent settings are needed to stimulate the system‑verification capability of different types of LLMs. The best LLM‑agent combination in our study completes over 80% of system‑verification tasks in VeruSAGE‑Bench. It also completes over 90% of a set of system proof tasks not part of VeruSAGE‑Bench because they had not yet been finished by human experts. This result shows the great potential for LLM‑assisted development of verified system software.

PaperID: 6732, https://arxiv.org/pdf/2512.18360.pdf

Authors: Mateusz Lango, Ondřej Dušek

Title: LLM Agents Implement an NLG System from Scratch: Building Interpretable Rule-Based RDF-to-Text Generators

Abstract:
We present a novel neurosymbolic framework for RDF‑to‑text generation, in which the model is "trained" through collaborative interactions among multiple LLM agents rather than traditional backpropagation. The LLM agents produce rule‑based Python code for a generator for the given domain, based on RDF triples only, with no in‑domain human reference texts. The resulting system is fully interpretable, requires no supervised training data, and generates text nearly instantaneously using only a single CPU. Our experiments on the WebNLG and OpenDialKG data show that outputs produced by our approach reduce hallucination, with only slight fluency penalties compared to finetuned or prompted language models

PaperID: 6733, https://arxiv.org/pdf/2512.18352.pdf

Authors: Fengzhu Zeng, Qian Shao, Ling Cheng, Wei Gao, Shih-Fen Cheng, Jing Ma, Cheng Niu

Title: LLM-based Few-Shot Early Rumor Detection with Imitation Agent

Abstract:
Early Rumor Detection (EARD) aims to identify the earliest point at which a claim can be accurately classified based on a sequence of social media posts. This is especially challenging in data‑scarce settings. While Large Language Models (LLMs) perform well in few‑shot NLP tasks, they are not well‑suited for time‑series data and are computationally expensive for both training and inference. In this work, we propose a novel EARD framework that combines an autonomous agent and an LLM‑based detection model, where the agent acts as a reliable decision‑maker for early time point determination, while the LLM serves as a powerful rumor detector. This approach offers the first solution for few‑shot EARD, necessitating only the training of a lightweight agent and allowing the LLM to remain training‑free. Extensive experiments on four real‑world datasets show our approach boosts performance across LLMs and surpasses existing EARD methods in accuracy and earliness.

PaperID: 6734, https://arxiv.org/pdf/2512.18309.pdf

Authors: Harsh Rathva, Ojas Srivastava, Pruthwik Mishra

Title: Embedded Safety-Aligned Intelligence via Differentiable Internal Alignment Embeddings

Abstract:
We introduce Embedded Safety‑Aligned Intelligence (ESAI), a theoretical framework for multi‑agent reinforcement learning that embeds alignment constraints directly into agents internal representations using differentiable internal alignment embeddings. Unlike external reward shaping or post‑hoc safety constraints, internal alignment embeddings are learned latent variables that predict externalized harm through counterfactual reasoning and modulate policy updates toward harm reduction through attention and graph‑based propagation. The ESAI framework integrates four mechanisms: differentiable counterfactual alignment penalties computed from soft reference distributions, alignment‑weighted perceptual attention, Hebbian associative memory supporting temporal credit assignment, and similarity‑weighted graph diffusion with bias mitigation controls. We analyze stability conditions for bounded internal embeddings under Lipschitz continuity and spectral constraints, discuss computational complexity, and examine theoretical properties including contraction behavior and fairness‑performance tradeoffs. This work positions ESAI as a conceptual contribution to differentiable alignment mechanisms in multi‑agent systems. We identify open theoretical questions regarding convergence guarantees, embedding dimensionality, and extension to high‑dimensional environments. Empirical evaluation is left to future work.

PaperID: 6735, https://arxiv.org/pdf/2512.18292.pdf

Authors: Wenkai Li, Lynnette Hui Xian Ng, Andy Liu, Daniel Fried

Title: Measuring Fine-Grained Negotiation Tactics of Humans and LLMs in Diplomacy

Abstract:
The study of negotiation styles dates back to Aristotle's ethos‑pathos‑logos rhetoric. Prior efforts primarily studied the success of negotiation agents. Here, we shift the focus towards the styles of negotiation strategies. Our focus is the strategic dialogue board game Diplomacy, which affords rich natural language negotiation and measures of game success. We used LLM‑as‑a‑judge to annotate a large human‑human set of Diplomacy games for fine‑grained negotiation tactics from a sociologically‑grounded taxonomy. Using a combination of the It Takes Two and WebDiplomacy datasets, we demonstrate the reliability of our LLM‑as‑a‑Judge framework and show strong correlations between negotiation features and success in the Diplomacy setting. Lastly, we investigate the differences between LLM and human negotiation strategies and show that fine‑tuning can steer LLM agents toward more human‑like negotiation behaviors.

PaperID: 6736, https://arxiv.org/pdf/2512.18265.pdf

Authors: Himabindu Thogaru, Saisubramaniam Gopalakrishnan, Zishan Ahmad, Anirudh Deodhar

Title: Intelligent Human-Machine Partnership for Manufacturing: Enhancing Warehouse Planning through Simulation-Driven Knowledge Graphs and LLM Collaboration

Abstract:
Manufacturing planners face complex operational challenges that require seamless collaboration between human expertise and intelligent systems to achieve optimal performance in modern production environments. Traditional approaches to analyzing simulation‑based manufacturing data often create barriers between human decision‑makers and critical operational insights, limiting effective partnership in manufacturing planning. Our framework establishes a collaborative intelligence system integrating Knowledge Graphs and Large Language Model‑based agents to bridge this gap, empowering manufacturing professionals through natural language interfaces for complex operational analysis. The system transforms simulation data into semantically rich representations, enabling planners to interact naturally with operational insights without specialized expertise. A collaborative LLM agent works alongside human decision‑makers, employing iterative reasoning that mirrors human analytical thinking while generating precise queries for knowledge extraction and providing transparent validation. This partnership approach to manufacturing bottleneck identification, validated through operational scenarios, demonstrates enhanced performance while maintaining human oversight and decision authority. For operational inquiries, the system achieves near‑perfect accuracy through natural language interaction. For investigative scenarios requiring collaborative analysis, we demonstrate the framework's effectiveness in supporting human experts to uncover interconnected operational issues that enhance understanding and decision‑making. This work advances collaborative manufacturing by creating intuitive methods for actionable insights, reducing cognitive load while amplifying human analytical capabilities in evolving manufacturing ecosystems.

PaperID: 6737, https://arxiv.org/pdf/2512.18190.pdf

Authors: Jian Yan

Title: External Hippocampus: Topological Cognitive Maps for Guiding Large Language Model Reasoning

Abstract:
This paper proposes the External Hippocampus framework, which models language model reasoning from a cognitive dynamics perspective as the flow of information energy in semantic space. Unlike traditional weight‑space optimization methods, this framework constructs topological cognitive maps through dimensionality reduction projection, enabling precise navigation and intervention of energy flow at test time while avoiding substantial computational requirements and demonstrating predictable intervention patterns. The method effectively addresses the cognitive deadlock problem in multi‑step reasoning for small models. Experiments on models <=7B parameters show: map‑guided methods achieve 81.20% accuracy on 500 challenging problems (relative baseline +16.80%), reduce reasoning time by >= 15x, with key findings revealing that reasoning stagnation manifests as "Cognitive Vortex" and low‑entropy potential wells, while temperature perturbations effectively restart energy flow. The framework requires no additional training, possesses autonomous growth capability, and provides an efficient and controllable topological‑aware solution for small model reasoning.

PaperID: 6738, https://arxiv.org/pdf/2512.18189.pdf

Authors: Zihao Deng, Yijia Li, Renrui Zhang, Peijun Ye

Title: NL2CA: Auto-formalizing Cognitive Decision-Making from Natural Language Using an Unsupervised CriticNL2LTL Framework

Abstract:
Cognitive computing models offer a formal and interpretable way to characterize human's deliberation and decision‑making, yet their development remains labor‑intensive. In this paper, we propose NL2CA, a novel method for auto‑formalizing cognitive decision‑making rules from natural language descriptions of human experience. Different from most related work that exploits either pure manual or human guided interactive modeling, our method is fully automated without any human intervention. The approach first translates text into Linear Temporal Logic (LTL) using a fine‑tuned large language model (LLM), then refines the logic via an unsupervised Critic Tree, and finally transforms the output into executable production rules compatible with symbolic cognitive frameworks. Based on the resulted rules, a cognitive agent is further constructed and optimized through cognitive reinforcement learning according to the real‑world behavioral data. Our method is validated in two domains: (1) NL‑to‑LTL translation, where our CriticNL2LTL module achieves consistent performance across both expert and large‑scale benchmarks without human‑in‑the‑loop feed‑backs, and (2) cognitive driving simulation, where agents automatically constructed from human interviews have successfully learned the diverse decision patterns of about 70 trials in different critical scenarios. Experimental results demonstrate that NL2CA enables scalable, interpretable, and human‑aligned cognitive modeling from unstructured textual data, offering a novel paradigm to automatically design symbolic cognitive agents.

PaperID: 6739, https://arxiv.org/pdf/2512.18112.pdf

Authors: Yunian Pan, Quanyan Zhu

Title: Bayesian Holonic Systems: Equilibrium, Uniqueness, and Computation

Abstract:
This paper addresses the challenge of modeling and control in hierarchical, multi‑agent systems, known as holonic systems, where local agent decisions are coupled with global systemic outcomes. We introduce the Bayesian Holonic Equilibrium (BHE), a concept that ensures consistency between agent‑level rationality and system‑wide emergent behavior. We establish the theoretical soundness of the BHE by showing its existence and, under stronger regularity conditions, its uniqueness. We propose a two‑time scale learning algorithm to compute such an equilibrium. This algorithm mirrors the system's structure, with a fast timescale for intra‑holon strategy coordination and a slow timescale for inter‑holon belief adaptation about external risks. The convergence of the algorithm to the theoretical equilibrium is validated through a numerical experiment on a continuous public good game. This work provides a complete theoretical and algorithmic framework for the principled design and analysis of strategic risk in complex, coupled control systems.

PaperID: 6740, https://arxiv.org/pdf/2512.18094.pdf

Authors: Boxuan Wang, Zhuoyun Li, Xiaowei Huang, Yi Dong

Title: Rethinking Multi-Agent Intelligence Through the Lens of Small-World Networks

Abstract:
Large language models (LLMs) have enabled multi‑agent systems (MAS) in which multiple agents argue, critique, and coordinate to solve complex tasks, making communication topology a first‑class design choice. Yet most existing LLM‑based MAS either adopt fully connected graphs, simple sparse rings, or ad‑hoc dynamic selection, with little structural guidance. In this work, we revisit classic theory on small‑world (SW) networks and ask: what changes if we treat SW connectivity as a design prior for MAS? We first bridge insights from neuroscience and complex networks to MAS, highlighting how SW structures balance local clustering and long‑range integration. Using multi‑agent debate (MAD) as a controlled testbed, experiment results show that SW connectivity yields nearly the same accuracy and token cost, while substantially stabilizing consensus trajectories. Building on this, we introduce an uncertainty‑guided rewiring scheme for scaling MAS, where long‑range shortcuts are added between epistemically divergent agents using LLM‑oriented uncertainty signals (e.g., semantic entropy). This yields controllable SW structures that adapt to task difficulty and agent heterogeneity. Finally, we discuss broader implications of SW priors for MAS design, framing them as stabilizers of reasoning, enhancers of robustness, scalable coordinators, and inductive biases for emergent cognitive roles.

PaperID: 6741, https://arxiv.org/pdf/2512.17914.pdf

Authors: Boris Kriuk, Logic Ng

Title: Q-KVComm: Efficient Multi-Agent Communication Via Adaptive KV Cache Compression

Abstract:
Multi‑agent Large Language Model (LLM) systems face a critical bottleneck: redundant transmission of contextual information between agents consumes excessive bandwidth and computational resources. Traditional approaches discard internal semantic representations and transmit raw text, forcing receiving agents to recompute similar representations from scratch. We introduce Q‑KVComm, a new protocol that enables direct transmission of compressed key‑value (KV) cache representations between LLM agents. Q‑KVComm combines three key innovations: (1) adaptive layer‑wise quantization that allocates variable bit‑widths based on sensitivity profiling, (2) hybrid information extraction that preserves critical facts across content domains, and (3) heterogeneous model calibration establishing cross‑architecture communication. Extensive experiments across three diverse question‑answering datasets demonstrate that Q‑KVComm achieves 5‑6x compression ratios while maintaining semantic fidelity, with coherence quality scores above 0.77 across all scenarios. The protocol exhibits robust performance across model sizes (1.1B‑1.5B parameters) and adapts to real‑world applications including conversational QA and multi‑hop reasoning. Our work establishes a new paradigm for LLM agent communication, shifting from text‑based to representation‑based information exchange.

PaperID: 6742, https://arxiv.org/pdf/2512.17913.pdf

Authors: Nihir Chadderwala

Title: Byzantine Fault-Tolerant Multi-Agent System for Healthcare: A Gossip Protocol Approach to Secure Medical Message Propagation

Abstract:
Recent advances in generative AI have enabled sophisticated multi‑agent architectures for healthcare, where large language models power collaborative clinical decision‑making. However, these distributed systems face critical challenges in ensuring message integrity and fault tolerance when operating in adversarial or untrusted environments.This paper presents a novel Byzantine fault‑tolerant multi‑agent system specifically designed for healthcare applications, integrating gossip‑based message propagation with cryptographic validation mechanisms. Our system employs specialized AI agents for diagnosis, treatment planning, emergency response, and data analysis, coordinated through a Byzantine consensus protocol that tolerates up to f faulty nodes among n = 3f + 1 total nodes. We implement a gossip protocol for decentralized message dissemination, achieving consensus with 2f + 1 votes while maintaining system operation even under Byzantine failures. Experimental results demonstrate that our approach successfully validates medical messages with cryptographic signatures, prevents replay attacks through timestamp validation, and maintains consensus accuracy of 100% with up to 33% Byzantine nodes. The system provides real‑time visualization of consensus rounds, vote tallies, and network topology, enabling transparent monitoring of fault‑tolerant operations. This work contributes a practical framework for building secure, resilient healthcare multi‑agent systems capable of collaborative medical decision‑making in untrusted environments.

PaperID: 6743, https://arxiv.org/pdf/2512.17896.pdf

Authors: Xinru Wang, Ming Yin, Eunyee Koh, Mustafa Doga Dogan

Title: XAgen: An Explainability Tool for Identifying and Correcting Failures in Multi-Agent Workflows

Abstract:
As multi‑agent systems powered by Large Language Models (LLMs) are increasingly adopted in real‑world workflows, users with diverse technical backgrounds are now building and refining their own agentic processes. However, these systems can fail in opaque ways, making it difficult for users to observe, understand, and correct errors. We conducted formative interviews with 12 practitioners to identify mismatches between existing debugging tools and users' needs. Based on these insights, we designed XAgen, an explainability tool that supports users with varying AI expertise through three core capabilities: log visualization for glanceable workflow understanding, human‑in‑the‑loop feedback to capture expert judgment, and automatic error detection via an LLM‑as‑a‑judge. In a user study with 8 participants, XAgen helped users locate failures more easily, attribute to specific agents or steps, and iteratively improve configurations. Our findings surface human‑centered design guidelines for explainable agentic AI development and highlight opportunities for more context‑aware interactive debugging.

PaperID: 6744, https://arxiv.org/pdf/2512.17776.pdf

Authors: Janghoon Han, Heegyu Kim, Changho Lee, Dahm Lee, Min Hyung Park, Hosung Song, Stanley Jungkyu Choi, Moontae Lee, Honglak Lee

Title: DEER: A Benchmark for Evaluating Deep Research Agents on Expert Report Generation

Abstract:
Recent advances in large language models have enabled deep research systems that generate expert‑level reports through multi‑step reasoning and evidence‑based synthesis. However, evaluating such reports remains challenging: report quality is multifaceted, making it difficult to determine what to assess and by what criteria; LLM‑based judges may miss errors that require domain expertise to identify; and because deep research relies on retrieved evidence, report‑wide claim verification is also necessary. To address these issues, we propose DEER, a benchmark for evaluating expert‑level deep research reports. DEER systematizes evaluation criteria with an expert‑developed taxonomy (7 dimensions, 25 subdimensions) operationalized as 101 fine‑grained rubric items. We also provide task‑specific Expert Evaluation Guidance to support LLM‑based judging. Alongside rubric‑based assessment, we propose a claim verification architecture that verifies both cited and uncited claims and quantifies evidence quality. Experiments show that while current deep research systems can produce structurally plausible reports that cite external evidence, there is room for improvement in fulfilling expert‑level user requests and achieving logical completeness. Beyond simple performance comparisons, DEER makes system strengths and limitations interpretable and provides diagnostic signals for improvement.

PaperID: 6745, https://arxiv.org/pdf/2512.17373.pdf

Authors: Zhengmian Hu

Title: Dialectics for Artificial Intelligence

Abstract:
Can artificial intelligence discover, from raw experience and without human supervision, concepts that humans have discovered? One challenge is that human concepts themselves are fluid: conceptual boundaries can shift, split, and merge as inquiry progresses (e.g., Pluto is no longer considered a planet). To make progress, we need a definition of "concept" that is not merely a dictionary label, but a structure that can be revised, compared, and aligned across agents. We propose an algorithmic‑information viewpoint that treats a concept as an information object defined only through its structural relation to an agent's total experience. The core constraint is determination: a set of parts forms a reversible consistency relation if any missing part is recoverable from the others (up to the standard logarithmic slack in Kolmogorov‑style identities). This reversibility prevents "concepts" from floating free of experience and turns concept existence into a checkable structural claim. To judge whether a decomposition is natural, we define excess information, measuring the redundancy overhead introduced by splitting experience into multiple separately described parts. On top of these definitions, we formulate dialectics as an optimization dynamics: as new patches of information appear (or become contested), competing concepts bid to explain them via shorter conditional descriptions, driving systematic expansion, contraction, splitting, and merging. Finally, we formalize low‑cost concept transmission and multi‑agent alignment using small grounds/seeds that allow another agent to reconstruct the same concept under a shared protocol, making communication a concrete compute‑bits trade‑off.

PaperID: 6746, https://arxiv.org/pdf/2512.17187.pdf

Authors: Zhaoqilin Yang, Axin Xiang, Kedi Yang, Tianjun Liu, Youliang Tian

Title: MAPPO-LCR: Multi-Agent Proximal Policy Optimization with Local Cooperation Reward in Spatial Public Goods Games

Abstract:
Spatial public goods games model collective dilemmas where individual payoffs depend on population‑level strategy configurations. Most existing studies rely on evolutionary update rules or value‑based reinforcement learning methods. These approaches struggle to represent payoff coupling and non‑stationarity in large interacting populations. This work introduces Multi‑Agent Proximal Policy Optimization (MAPPO) into spatial public goods games for the first time. In these games, individual returns are intrinsically coupled through overlapping group interactions. Proximal Policy Optimization (PPO) treats agents as independent learners and ignores this coupling during value estimation. MAPPO addresses this limitation through a centralized critic that evaluates joint strategy configurations. To study neighborhood‑level cooperation signals under this framework, we propose MAPPO with Local Cooperation Reward, termed MAPPO‑LCR. The local cooperation reward aligns policy updates with surrounding cooperative density without altering the original game structure. MAPPO‑LCR preserves decentralized execution while enabling population‑level value estimation during training. Extensive simulations demonstrate stable cooperation emergence and reliable convergence across enhancement factors. Statistical analyses further confirm the learning advantage of MAPPO over PPO in spatial public goods games.

PaperID: 6747, https://arxiv.org/pdf/2512.17129.pdf

Authors: Seong Ho Pahng, Guoye Guan, Benjamin Fefferman, Sahand Hormoz

Title: DiffeoMorph: Learning to Morph 3D Shapes Using Differentiable Agent-Based Simulations

Abstract:
Biological systems can form complex three‑dimensional structures through the collective behavior of agents that share a common update rule and operate without central control. How such distributed control gives rise to precise global patterns remains a central question not only in developmental biology but also in distributed robotics, programmable matter, and multi‑agent learning. Here, we introduce DiffeoMorph, an end‑to‑end differentiable framework for learning a morphogenesis protocol that guides a population of agents to morph into a target 3D shape. Each agent updates its position and internal state using an SE(3)‑equivariant graph neural network, based on its own internal state and signals received from other agents. To train this system, we introduce a new shape‑matching loss based on 3D Zernike polynomials, which compares the predicted and target shapes as continuous spatial distributions, not as discrete point clouds, and is invariant to agent ordering, number of agents, and global orientation. To achieve rotation invariance while preserving reflection sensitivity, we include an alignment step that optimally rotates the predicted Zernike spectrum to match the target before computing the loss. We perform benchmarking to establish the advantages of our shape‑matching loss over other standard distance metrics for shape comparison tasks. We then demonstrate that DiffeoMorph can form a range of complex shapes from minimally patterned initial conditions. DiffeoMorph provides a general framework for learning distributed control strategies for morphogenesis, swarm robotics, and programmable self‑assembly.

PaperID: 6748, https://arxiv.org/pdf/2512.17060.pdf

Authors: Monika Zamojska, Jarosław A. Chudziak

Title: On the Role of Contextual Information and Ego States in LLM Agent Behavior for Transactional Analysis Dialogues

Abstract:
LLM‑powered agents are now used in many areas, from customer support to education, and there is increasing interest in their ability to act more like humans. This includes fields such as social, political, and psychological research, where the goal is to model group dynamics and social behavior. However, current LLM agents often lack the psychological depth and consistency needed to capture the real patterns of human thinking. They usually provide direct or statistically likely answers, but they miss the deeper goals, emotional conflicts, and motivations that drive real human interactions. This paper proposes a Multi‑Agent System (MAS) inspired by Transactional Analysis (TA) theory. In the proposed system, each agent is divided into three ego states ‑ Parent, Adult, and Child. The ego states are treated as separate knowledge structures with their own perspectives and reasoning styles. To enrich their response process, they have access to an information retrieval mechanism that allows them to retrieve relevant contextual information from their vector stores. This architecture is evaluated through ablation tests in a simulated dialogue scenario, comparing agents with and without information retrieval. The results are promising and open up new directions for exploring how psychologically grounded structures can enrich agent behavior. The contribution is an agent architecture that integrates Transactional Analysis theory with contextual information retrieval to enhance the realism of LLM‑based multi‑agent simulations.

PaperID: 6749, https://arxiv.org/pdf/2512.17008.pdf

Authors: Junbo Li, Peng Zhou, Rui Meng, Meet P. Vadera, Lihong Li, Yang Li

Title: Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs

Abstract:
Reinforcement learning (RL) has re‑emerged as a natural approach for training interactive LLM agents in real‑world environments. However, directly applying the widely used Group Relative Policy Optimization (GRPO) algorithm to multi‑turn tasks exposes notable limitations, particularly in scenarios requiring long‑horizon reasoning. To address these challenges, we investigate more stable and effective advantage estimation strategies, especially for multi‑turn settings. We first explore Proximal Policy Optimization (PPO) as an alternative and find it to be more robust than GRPO. To further enhance PPO in multi‑turn scenarios, we introduce turn‑PPO, a variant that operates on a turn‑level MDP formulation, as opposed to the commonly used token‑level MDP. Our results on the WebShop and Sokoban datasets demonstrate the effectiveness of turn‑PPO, both with and without long reasoning components.

PaperID: 6750, https://arxiv.org/pdf/2512.16970.pdf

Authors: Kamer Ali Yuksel

Title: PAACE: A Plan-Aware Automated Agent Context Engineering Framework

Abstract:
Large Language Model (LLM) agents are increasingly deployed in complex, multi‑step workflows involving planning, tool use, reflection, and interaction with external knowledge systems. These workflows generate rapidly expanding contexts that must be curated, transformed, and compressed to maintain fidelity, avoid attention dilution, and reduce inference cost. Prior work on summarization and query‑aware compression largely ignores the multi‑step, plan‑aware nature of agentic reasoning. In this work, we introduce PAACE (Plan‑Aware Automated Context Engineering), a unified framework for optimizing the evolving state of LLM agents through next‑k‑task relevance modeling, plan‑structure analysis, instruction co‑refinement, and function‑preserving compression. PAACE comprises (1) PAACE‑Syn, a large‑scale generator of synthetic agent workflows annotated with stepwise compression supervision, and (2) PAACE‑FT, a family of distilled, plan‑aware compressors trained from successful teacher demonstrations. Experiments on long‑horizon benchmarks (AppWorld, OfficeBench, and 8‑Objective QA) demonstrate that PAACE consistently improves agent correctness while substantially reducing context load. On AppWorld, PAACE achieves higher accuracy than all baselines while lowering peak context and cumulative dependency. On OfficeBench and multi‑hop QA, PAACE improves both accuracy and F1, achieving fewer steps, lower peak tokens, and reduced attention dependency. Distilled PAACE‑FT retains 97 percent of the teacher's performance while reducing inference cost by over an order of magnitude, enabling practical deployment of plan‑aware compression with compact models.

PaperID: 6751, https://arxiv.org/pdf/2512.16925.pdf

Authors: SunYoung Park, Jong-Hyeon Lee, Youngjune Kim, Daegyu Sung, Younghyun Yu, Young-rok Cha, Jeongho Ju

Title: V-Agent: An Interactive Video Search System Using Vision-Language Models

Abstract:
We introduce V‑Agent, a novel multi‑agent platform designed for advanced video search and interactive user‑system conversations. By fine‑tuning a vision‑language model (VLM) with a small video preference dataset and enhancing it with a retrieval vector from an image‑text retrieval model, we overcome the limitations of traditional text‑based retrieval systems in multimodal scenarios. The VLM‑based retrieval model independently embeds video frames and audio transcriptions from an automatic speech recognition (ASR) module into a shared multimodal representation space, enabling V‑Agent to interpret both visual and spoken content for context‑aware video search. This system consists of three agents‑a routing agent, a search agent, and a chat agent‑that work collaboratively to address user intents by refining search outputs and communicating with users. The search agent utilizes the VLM‑based retrieval model together with an additional re‑ranking module to further enhance video retrieval quality. Our proposed framework demonstrates state‑of‑the‑art zero‑shot performance on the MultiVENT 2.0 benchmark, highlighting its potential for both academic research and real‑world applications. The retrieval model and demo videos are available at https://huggingface.co/NCSOFT/multimodal‑embedding.

PaperID: 6752, https://arxiv.org/pdf/2512.16848.pdf

Authors: Yulun Jiang, Liangze Jiang, Damien Teney, Michael Moor, Maria Brbic

Title: Meta-RL Induces Exploration in Language Agents

Abstract:
Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi‑turn long‑horizon tasks. However, the RL‑trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial‑and‑error experiences. In this paper, we present LaMer, a general Meta‑RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross‑episode training framework to encourage exploration and long‑term rewards optimization; and (ii) in‑context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL‑trained agents. Overall, our results demonstrate that Meta‑RL provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.

PaperID: 6753, https://arxiv.org/pdf/2512.16813.pdf

Authors: Bahman Abolhassani, Tugba Erpek, Kemal Davaslioglu, Yalin E. Sagduyu, Sastry Kompella

Title: Coordinated Anti-Jamming Resilience in Swarm Networks via Multi-Agent Reinforcement Learning

Abstract:
Reactive jammers pose a severe security threat to robotic‑swarm networks by selectively disrupting inter‑agent communications and undermining formation integrity and mission success. Conventional countermeasures such as fixed power control or static channel hopping are largely ineffective against such adaptive adversaries. This paper presents a multi‑agent reinforcement learning (MARL) framework based on the QMIX algorithm to improve the resilience of swarm communications under reactive jamming. We consider a network of multiple transmitter‑receiver pairs sharing channels while a reactive jammer with Markovian threshold dynamics senses aggregate power and reacts accordingly. Each agent jointly selects transmit frequency (channel) and power, and QMIX learns a centralized but factorizable action‑value function that enables coordinated yet decentralized execution. We benchmark QMIX against a genie‑aided optimal policy in a no‑channel‑reuse setting, and against local Upper Confidence Bound (UCB) and a stateless reactive policy in a more general fading regime with channel reuse enabled. Simulation results show that QMIX rapidly converges to cooperative policies that nearly match the genie‑aided bound, while achieving higher throughput and lower jamming incidence than the baselines, thereby demonstrating MARL's effectiveness for securing autonomous swarms in contested environments.

PaperID: 6754, https://arxiv.org/pdf/2512.16736.pdf

Authors: Xiaofeng Zong, Ming-Yu Wang, Jimin Wang, Ji-Feng Zhang

Title: Observer-based Differentially Private Consensus for Linear Multi-agent Systems

Abstract:
This paper investigates the differentially private consensus problem for general linear multi‑agent systems (MASs) based on output feedback protocols. To protect the output information, which is considered private data and may be at high risk of exposure, Laplace noise is added to the information exchange. The conditions for achieving mean square and almost sure consensus in observer‑based MASs are established using the backstepping method and the convergence theory for nonnegative almost supermartingales. It is shown that the separation principle remains valid for the consensus problem of linear MASs with decaying Laplace noise. Furthermore, the convergence rate is provided. Then, a joint design framework is developed for state estimation gain, feedback control gain, and noise to ensure the preservation of ε‑differential privacy. The output information of each agent is shown to be protected at every time step. Finally, sufficient conditions are established for simultaneously achieving consensus and preserving differential privacy for linear MASs utilizing both full‑order and reduced‑order observers. Meanwhile, an ε‑differentially private consensus is achieved to meet the desired privacy level. Two simulation examples are provided to validate the theoretical results.

PaperID: 6755, https://arxiv.org/pdf/2512.16530.pdf

Authors: Primoz Kocbek, Leon Kopitar, Gregor Stiglic

Title: Plain language adaptations of biomedical text using LLMs: Comparision of evaluation metrics

Abstract:
This study investigated the application of Large Language Models (LLMs) for simplifying biomedical texts to enhance health literacy. Using a public dataset, which included plain language adaptations of biomedical abstracts, we developed and evaluated several approaches, specifically a baseline approach using a prompt template, a two AI agent approach, and a fine‑tuning approach. We selected OpenAI gpt‑4o and gpt‑4o mini models as baselines for further research. We evaluated our approaches with quantitative metrics, such as Flesch‑Kincaid grade level, SMOG Index, SARI, and BERTScore, G‑Eval, as well as with qualitative metric, more precisely 5‑point Likert scales for simplicity, accuracy, completeness, brevity. Results showed a superior performance of gpt‑4o‑mini and an underperformance of FT approaches. G‑Eval, a LLM based quantitative metric, showed promising results, ranking the approaches similarly as the qualitative metric.

PaperID: 6756, https://arxiv.org/pdf/2512.16514.pdf

Authors: Diodato Ferraioli, Paolo Penna, Manvir Schneider, Carmine Ventre

Title: Algorithmic Monetary Policies for Blockchain Participation Games

Abstract:
A central challenge in blockchain tokenomics is aligning short‑term performance incentives with long‑term decentralization goals. We propose a framework for algorithmic monetary policies that navigates this tradeoff in repeated participation games. Agents, characterized by type (capability) and stake, choose to participate or abstain at each round; the policy (probabilistically) selects high‑type agents for task execution (maximizing throughput) while distributing rewards to sustain decentralization. We analyze equilibria under two agent behaviors: myopic (short‑term utility maximization) and foresighted (multi‑round planning). For myopic agents, performance‑centric policies risk centralization, but foresight enables stable decentralization with some volatility to the token value. We further discuss virtual stake‑‑a hybrid of type and stake‑‑as an alternative approach. We show that the initial virtual stake distribution critically impacts long‑term outcomes, suggesting that policies must indirectly manage decentralization.

PaperID: 6757, https://arxiv.org/pdf/2512.16433.pdf

Authors: Maeve Madigan, Parameswaran Kamalaruban, Glenn Moynihan, Tom Kempton, David Sutton, Stuart Burrell

Title: Emergent Bias and Fairness in Multi-Agent Decision Systems

Abstract:
Multi‑agent systems have demonstrated the ability to improve performance on a variety of predictive tasks by leveraging collaborative decision making. However, the lack of effective evaluation methodologies has made it difficult to estimate the risk of bias, making deployment of such systems unsafe in high stakes domains such as consumer finance, where biased decisions can translate directly into regulatory breaches and financial loss. To address this challenge, we need to develop fairness evaluation methodologies for multi‑agent predictive systems and measure the fairness characteristics of these systems in the financial tabular domain. Examining fairness metrics using large‑scale simulations across diverse multi‑agent configurations, with varying communication and collaboration mechanisms, we reveal patterns of emergent bias in financial decision‑making that cannot be traced to individual agent components, indicating that multi‑agent systems may exhibit genuinely collective behaviors. Our findings highlight that fairness risks in financial multi‑agent systems represent a significant component of model risk, with tangible impacts on tasks such as credit scoring and income estimation. We advocate that multi‑agent decision systems must be evaluated as holistic entities rather than through reductionist analyses of their constituent components.

PaperID: 6758, https://arxiv.org/pdf/2512.16325.pdf

Authors: Nan Zhou, Zuxin Li, Fanhang Man, Xuecheng Chen, Susu Xu, Fan Dang, Chaopeng Hong, Yunhao Liu, Xiao-Ping Zhang, Xinlei Chen

Title: QUIDS: Quality-informed Incentive-driven Multi-agent Dispatching System for Mobile Crowdsensing

Abstract:
This paper addresses the challenge of achieving optimal Quality of Information (QoI) in non‑dedicated vehicular mobile crowdsensing (NVMCS) systems. The key obstacles are the interrelated issues of sensing coverage, sensing reliability, and the dynamic participation of vehicles. To tackle these, we propose QUIDS, a QUality‑informed Incentive‑driven multi‑agent Dispatching System, which ensures high sensing coverage and reliability under budget constraints. QUIDS introduces a novel metric, Aggregated Sensing Quality (ASQ), to quantitatively capture QoI by integrating both coverage and reliability. We also develop a Mutually Assisted Belief‑aware Vehicle Dispatching algorithm that estimates sensing reliability and allocates incentives under uncertainty, further improving ASQ. Evaluation using real‑world data from a metropolitan NVMCS deployment shows QUIDS improves ASQ by 38% over non‑dispatching scenarios and by 10% over state‑of‑the‑art methods. It also reduces reconstruction map errors by 39‑74% across algorithms. By jointly optimizing coverage and reliability via a quality‑informed incentive mechanism, QUIDS enables low‑cost, high‑quality urban monitoring without dedicated infrastructure, applicable to smart‑city scenarios like traffic and environmental sensing.

PaperID: 6759, https://arxiv.org/pdf/2512.16302.pdf

Authors: Zixuan Chen, Chongkai Gao, Lin Shao, Jieqi Shi, Jing Huo, Yang Gao

Title: ManiLong-Shot: Interaction-Aware One-Shot Imitation Learning for Long-Horizon Manipulation

Abstract:
One‑shot imitation learning (OSIL) offers a promising way to teach robots new skills without large‑scale data collection. However, current OSIL methods are primarily limited to short‑horizon tasks, thus limiting their applicability to complex, long‑horizon manipulations. To address this limitation, we propose ManiLong‑Shot, a novel framework that enables effective OSIL for long‑horizon prehensile manipulation tasks. ManiLong‑Shot structures long‑horizon tasks around physical interaction events, reframing the problem as sequencing interaction‑aware primitives instead of directly imitating continuous trajectories. This primitive decomposition can be driven by high‑level reasoning from a vision‑language model (VLM) or by rule‑based heuristics derived from robot state changes. For each primitive, ManiLong‑Shot predicts invariant regions critical to the interaction, establishes correspondences between the demonstration and the current observation, and computes the target end‑effector pose, enabling effective task execution. Extensive simulation experiments show that ManiLong‑Shot, trained on only 10 short‑horizon tasks, generalizes to 20 unseen long‑horizon tasks across three difficulty levels via one‑shot imitation, achieving a 22.8% relative improvement over the SOTA. Additionally, real‑robot experiments validate ManiLong‑Shot's ability to robustly execute three long‑horizon manipulation tasks via OSIL, confirming its practical applicability.

PaperID: 6760, https://arxiv.org/pdf/2512.16301.pdf

Authors: Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He, Yichen Wu, Ming Zhong, Peiyang Song, Qizheng Zhang, Heng Wang, Xueqiang Xu, Hanwen Xu, Pengrui Han, Dylan Zhang, Jiashuo Sun, Chaoqi Yang, Kun Qian, Tian Wang, Changran Hu, Manling Li, Quanzheng Li, Hao Peng, Sheng Wang, Jingbo Shang, Chao Zhang, Jiaxuan You, Liyuan Liu, Pan Lu, Yu Zhang, Heng Ji, Yejin Choi, Dawn Song, Jimeng Sun, Jiawei Han

Title: Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills

Abstract:
Large language model (LLM) agents are moving beyond prompting alone. ChatGPT marked the rise of general‑purpose LLM assistants, DeepSeek showed that on‑policy reinforcement learning with verifiable rewards can improve reasoning and tool use, and OpenClaw highlights a newer direction in which agents accumulate persistent memory and reusable skills. Yet the research landscape remains fragmented across post‑training, retrieval, memory, and skill systems. This survey studies these developments under a single notion of \emphadaptation: improving an agent, its tools, or their interaction after pretraining. We organize the field with a four‑paradigm framework spanning agent adaptation and tool adaptation. On the agent side, A1 (tool‑execution‑signaled) and A2 (agent‑output‑signaled) improve the agent itself through supervised fine‑tuning, preference optimization, and reinforcement learning with verifiable rewards. On the tool side, T1 (agent‑agnostic) provides reusable pre‑trained modules any agent can call, while T2 (agent‑supervised) uses the agent's outputs to train memory systems, skill libraries, or lightweight subagents. Using this framework, we review post‑training methods, adaptive memory architectures, and agent skills; compare their trade‑offs in cost, flexibility, and generalization; and summarize evaluation practices across deep research, software development, computer use, and drug discovery. We conclude by outlining open problems in agent‑tool co‑adaptation, continual learning, safety, and efficient deployment.

PaperID: 6761, https://arxiv.org/pdf/2512.16280.pdf

Authors: Gilad Gressel, Rahul Pankajakshan, Shir Rozenfeld, Ling Li, Ivan Franceschini, Krishnashree Achuthan, Yisroel Mirsky

Title: Love, Lies, and Language Models: Investigating AI's Role in Romance-Baiting Scams

Abstract:
Romance‑baiting scams have become a major source of financial and emotional harm worldwide. These operations are run by organized crime syndicates that traffic thousands of people into forced labor, requiring them to build emotional intimacy with victims over weeks of text conversations before pressuring them into fraudulent cryptocurrency investments. Because the scams are inherently text‑based, they raise urgent questions about the role of Large Language Models (LLMs) in both current and future automation. We investigate this intersection by interviewing 145 insiders and 5 scam victims, performing a blinded long‑term conversation study comparing LLM scam agents to human operators, and executing an evaluation of commercial safety filters. Our findings show that LLMs are already widely deployed within scam organizations, with 87% of scam labor consisting of systematized conversational tasks readily susceptible to automation. In a week‑long study, an LLM agent not only elicited greater trust from study participants (p=0.007) but also achieved higher compliance with requests than human operators (46% vs. 18% for humans). Meanwhile, popular safety filters detected 0.0% of romance baiting dialogues. Together, these results suggest that romance‑baiting scams may be amenable to full‑scale LLM automation, while existing defenses remain inadequate to prevent their expansion.

PaperID: 6762, https://arxiv.org/pdf/2512.16250.pdf

Authors: Sanjoy Chowdhury, Karren D. Yang, Xudong Liu, Fartash Faghri, Pavan Kumar Anasosalu Vasu, Oncel Tuzel, Dinesh Manocha, Chun-Liang Li, Raviteja Vemulapalli

Title: AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding

Abstract:
Recent multimodal large language models (MLLMs) such as GPT‑4o and Qwen3‑Omni show strong perception but struggle in multi‑speaker, dialogue‑centric settings that demand agentic reasoning tracking who speaks, maintaining roles, and grounding events across time. These scenarios are central to multimodal audio‑video understanding, where models must jointly reason over audio and visual streams in applications such as conversational video assistants and meeting analytics. We introduce AMUSE, a benchmark designed around tasks that are inherently agentic, requiring models to decompose complex audio‑visual interactions into planning, grounding, and reflection steps. It evaluates MLLMs across three modes zero‑shot, guided, and agentic and six task families, including spatio‑temporal speaker grounding and multimodal dialogue summarization. Across all modes, current models exhibit weak multi‑speaker reasoning and inconsistent behavior under both non‑agentic and agentic evaluation. Motivated by the inherently agentic nature of these tasks and recent advances in LLM agents, we propose RAFT, a data‑efficient agentic alignment framework that integrates reward optimization with intrinsic multimodal self‑evaluation as reward and selective parameter adaptation for data and parameter efficient updates. Using RAFT, we achieve up to 39.52% relative improvement in accuracy on our benchmark. Together, AMUSE and RAFT provide a practical platform for examining agentic reasoning in multimodal models and improving their capabilities.

PaperID: 6763, https://arxiv.org/pdf/2512.16214.pdf

Authors: Jianming Liu, Ren Zhu, Jian Xu, Kun Ding, Xu-Yao Zhang, Gaofeng Meng, Cheng-Lin Liu

Title: PDE-Agent: A toolchain-augmented multi-agent framework for PDE solving

Abstract:
Solving Partial Differential Equations (PDEs) is a cornerstone of engineering and scientific research. Traditional methods for PDE solving are cumbersome, relying on manual setup and domain expertise. While Physics‑Informed Neural Network (PINNs) introduced end‑to‑end neural network‑based solutions, and frameworks like DeepXDE further enhanced automation, these approaches still depend on expert knowledge and lack full autonomy. In this work, we frame PDE solving as tool invocation via LLM‑driven agents and introduce PDE‑Agent, the first toolchain‑augmented multi‑agent collaboration framework, inheriting the reasoning capacity of LLMs and the controllability of external tools and enabling automated PDE solving from natural language descriptions. PDE‑Agent leverages the strengths of multi‑agent and multi‑tool collaboration through two key innovations: (1) A Prog‑Act framework with graph memory for multi‑agent collaboration, which enables effective dynamic planning and error correction via dual‑loop mechanisms (localized fixes and global revisions). (2) A Resource‑Pool integrated with a tool‑parameter separation mechanism for multi‑tool collaboration. This centralizes the management of runtime artifacts and resolves inter‑tool dependency gaps in existing frameworks. To validate and evaluate this new paradigm for PDE solving , we develop PDE‑Bench, a multi‑type PDE Benchmark for agent‑based tool collaborative solving, and propose multi‑level metrics for assessing tool coordination. Evaluations verify that PDE‑Agent exhibits superior applicability and performance in complex multi‑step, cross‑step dependent tasks. This new paradigm of toolchain‑augmented multi‑agent PDE solving will further advance future developments in automated scientific computing. Our source code and dataset will be made publicly available.

PaperID: 6764, https://arxiv.org/pdf/2512.16167.pdf

Authors: Shiduo Yang, Jiye Wang, Jiayu Qin, Jianbin Li, Yu Wang, Yuanhe Zhao, Kenan Guo

Title: Ev-Trust: A Strategy Equilibrium Trust Mechanism for Evolutionary Games in LLM-Based Multi-Agent Services

Abstract:
The rapid evolution of the Web toward an agent‑centric paradigm, driven by large language models (LLMs), has enabled autonomous agents to reason, plan, and interact in complex decentralized environments. However, the openness and heterogeneity of LLM‑based multi‑agent systems also amplify the risks of deception, fraud, and misinformation, posing severe challenges to trust establishment and system robustness. To address this issue, we propose Ev‑Trust, a strategy‑equilibrium trust mechanism grounded in evolutionary game theory. This mechanism integrates direct trust, indirect trust, and expected revenue into a dynamic feedback structure that guides agents' behavioral evolution toward equilibria. Within a decentralized "Request‑Response‑Payment‑Evaluation" service framework, Ev‑Trust enables agents to adaptively adjust strategies, naturally excluding malicious participants while reinforcing high‑quality collaboration. Furthermore, our theoretical derivation based on replicator dynamics equations proves the existence and stability of local evolutionary equilibria. Experimental results indicate that our approach effectively reflects agent trustworthiness in LLM‑driven open service interaction scenarios, reduces malicious strategies, and increases collective revenue. We hope Ev‑Trust can provide a new perspective on trust modeling for the agentic service web in group evolutionary game scenarios.

PaperID: 6765, https://arxiv.org/pdf/2512.16139.pdf

Authors: Mengqi Xue, Yuchao Xiong, Yue Song

Title: Consensus Tracking of Perturbed Open Multi-Agent Systems with Repelling Antagonistic Interactions

Abstract:
An open multi‑agent system (OMAS) features migrating agents which produce a flexible network that is naturally switching and size‑varying. Meanwhile, agent migrations also make an OMAS prone to environmental adversities. In this work, we investigate the consensus tracking problem of OMASs suffering migration‑induced adversities, including non‑vanishing agent dynamics/state perturbations and repelling antagonistic interactions among agents, over an intermittently disconnected signed digraph. The OMAS is interpreted into a perturbed multi‑mode multi‑dimensional (M^3D) system in which unstable subsystems are created when repelling interactions dominate the cooperative ones in the network regardless of its connectivity. To handle the destabilizing effect brought by repelling interactions and non‑vanishing perturbations, we extend the stability theory for M^3D systems and apply it to the OMAS to show that ultimately bounded consensus tracking can be achieved if the network switching satisfies the piecewise average dwell time and activation time ratio conditions. Particularly, for vanishing perturbations, asymptotic tracking can be ensured under weaker switching conditions.

PaperID: 6766, https://arxiv.org/pdf/2512.16063.pdf

Authors: Qidi Xu, Nuzha Amjad, Grace Giles, Alexa Cumming, De'angelo Hermesky, Alexander Wen, Min Ji Kwak, Yejin Kim

Title: A Multi-Agent Large Language Model Framework for Automated Qualitative Analysis

Abstract:
Understanding patients experiences is essential for advancing patient centered care, especially in chronic diseases that require ongoing communication. However, qualitative thematic analysis, the primary approach for exploring these experiences, remains labor intensive, subjective, and difficult to scale. In this study, we developed a multi agent large language model framework that automates qualitative thematic analysis through three agents (Instructor, Thematizer, CodebookGenerator), named Collaborative Theme Identification Agent (CoTI). We applied CoTI to 12 heart failure patient interviews to analyze their perceptions of medication intensity. CoTI identified key phrases, themes, and codebook that were more similar to those of the senior investigator than both junior investigators and baseline NLP models. We also implemented CoTI into a user‑facing application to enable AI human interaction in qualitative analysis. However, collaboration between CoTI and junior investigators provided only marginal gains, suggesting they may overrely on CoTI and limit their independent critical thinking.

PaperID: 6767, https://arxiv.org/pdf/2512.15959.pdf

Authors: Armağan Amcalar, Eyup Cinar

Title: BRAID: Bounded Reasoning for Autonomous Inference and Decisions

Abstract:
Large Language Models (LLMs) exhibit nonlinear relationships between performance, cost, and token usage. This paper presents a quantitative study on structured prompting using BRAID (Bounded Reasoning for Au tonomous Inference and Decisions) across multiple GPT model tiers, eval uated on the AdvancedIF, GSM‑Hard, and the SCALE MultiChallenge benchmark datasets. BRAID introduces a bounded reasoning framework using Mermaid‑based instruction graphs that enable models to reason struc turally rather than through unbounded natural‑language token expansion. We show that structured machine‑readable prompts substantially increase reasoning accuracy and cost efficiency for agents in production systems. The findings establish BRAID as an effective and scalable technique for optimizing inference efficiency in autonomous agent systems. All datasets and detailed result logs are available at https://benchmark.openserv.ai.

PaperID: 6768, https://arxiv.org/pdf/2512.15930.pdf

Authors: Matthew Sinclair, Moeen Meigooni, Archit Vasan, Ozan Gokdemir, Xinran Lian, Heng Ma, Yadu Babuji, Alexander Brace, Khalid Hossain, Carlo Siebenschuh, Thomas Brettin, Kyle Chard, Christopher Henry, Venkatram Vishwanath, Rick L. Stevens, Ian T. Foster, Arvind Ramanathan

Title: Scalable Agentic Reasoning for Designing Biologics Targeting Intrinsically Disordered Proteins

Abstract:
Intrinsically disordered proteins (IDPs) represent crucial therapeutic targets due to their significant role in disease ‑‑ approximately 80% of cancer‑related proteins contain long disordered regions ‑‑ but their lack of stable secondary/tertiary structures makes them "undruggable". While recent computational advances, such as diffusion models, can design high‑affinity IDP binders, translating these to practical drug discovery requires autonomous systems capable of reasoning across complex conformational ensembles and orchestrating diverse computational tools at scale.To address this challenge, we designed and implemented StructBioReasoner, a scalable multi‑agent system for designing biologics that can be used to target IDPs. StructBioReasoner employs a novel tournament‑based reasoning framework where specialized agents compete to generate and refine therapeutic hypotheses, naturally distributing computational load for efficient exploration of the vast design space. Agents integrate domain knowledge with access to literature synthesis, AI‑structure prediction, molecular simulations, and stability analysis, coordinating their execution on HPC infrastructure via an extensible federated agentic middleware, Academy. We benchmark StructBioReasoner across Der f 21 and NMNAT‑2 and demonstrate that over 50% of 787 designed and validated candidates for Der f 21 outperformed the human‑designed reference binders from literature, in terms of improved binding free energy. For the more challenging NMNAT‑2 protein, we identified three binding modes from 97,066 binders, including the well‑studied NMNAT2:p53 interface. Thus, StructBioReasoner lays the groundwork for agentic reasoning systems for IDP therapeutic discovery on Exascale platforms.

PaperID: 6769, https://arxiv.org/pdf/2512.15892.pdf

Authors: Artem Grigor, Christian Schroeder de Witt, Simon Birnbach, Ivan Martinovic

Title: VET Your Agent: Towards Host-Independent Autonomy via Verifiable Execution Traces

Abstract:
Recent advances in large language models (LLMs) have enabled a new generation of autonomous agents that operate over sustained periods and manage sensitive resources on behalf of users. Trusted for their ability to act without direct oversight, such agents are increasingly considered in high‑stakes domains including financial management, dispute resolution, and governance. Yet in practice, agents execute on infrastructure controlled by a host, who can tamper with models, inputs, or outputs, undermining any meaningful notion of autonomy. We address this gap by introducing VET (Verifiable Execution Traces), a formal framework that achieves host‑independent authentication of agent outputs and takes a step toward host‑independent autonomy. Central to VET is the Agent Identity Document (AID), which specifies an agent's configuration together with the proof systems required for verification. VET is compositional: it supports multiple proof mechanisms, including trusted hardware, succinct cryptographic proofs, and notarized TLS transcripts (Web Proofs). We implement VET for an API‑based LLM agent and evaluate our instantiation on realistic workloads. We find that for today's black‑box, secret‑bearing API calls, Web Proofs appear to be the most practical choice, with overhead typically under 3× compared to direct API calls, while for public API calls, a lower‑overhead TEE Proxy is often sufficient. As a case study, we deploy a verifiable trading agent that produces proofs for each decision and composes Web Proofs with a TEE Proxy. Our results demonstrate that practical, host‑agnostic authentication is already possible with current technology, laying the foundation for future systems that achieve full host‑independent autonomy.

PaperID: 6770, https://arxiv.org/pdf/2512.15790.pdf

Authors: Akhil Sharma, Shaikh Yaser Arafat, Jai Kumar Sharma, Ken Huang

Title: Bilevel Optimization for Covert Memory Tampering in Heterogeneous Multi-Agent Architectures (XAMT)

Abstract:
The increasing operational reliance on complex Multi‑Agent Systems (MAS) across safety‑critical domains necessitates rigorous adversarial robustness assessment. Modern MAS are inherently heterogeneous, integrating conventional Multi‑Agent Reinforcement Learning (MARL) with emerging Large Language Model (LLM) agent architectures utilizing Retrieval‑Augmented Generation (RAG). A critical shared vulnerability is reliance on centralized memory components: the shared Experience Replay (ER) buffer in MARL and the external Knowledge Base (K) in RAG agents. This paper proposes XAMT (Bilevel Optimization for Covert Memory Tampering in Heterogeneous Multi‑Agent Architectures), a novel framework that formalizes attack generation as a bilevel optimization problem. The Upper Level minimizes perturbation magnitude (delta) to enforce covertness while maximizing system behavior divergence toward an adversary‑defined target (Lower Level). We provide rigorous mathematical instantiations for CTDE MARL algorithms and RAG‑based LLM agents, demonstrating that bilevel optimization uniquely crafts stealthy, minimal‑perturbation poisons evading detection heuristics. Comprehensive experimental protocols utilize SMAC and SafeRAG benchmarks to quantify effectiveness at sub‑percent poison rates (less than or equal to 1 percent in MARL, less than or equal to 0.1 percent in RAG). XAMT defines a new unified class of training‑time threats essential for developing intrinsically secure MAS, with implications for trust, formal verification, and defensive strategies prioritizing intrinsic safety over perimeter‑based detection.

PaperID: 6771, https://arxiv.org/pdf/2512.15787.pdf

Authors: Przemek Pospieszny, Dominika P. Brodowicz

Title: Toward Agentic Environments: GenAI and the Convergence of AI, Sustainability, and Human-Centric Spaces

Abstract:
In recent years, advances in artificial intelligence (AI), particularly generative AI (GenAI) and large language models (LLMs), have made human‑computer interactions more frequent, efficient, and accessible across sectors ranging from banking to healthcare. AI tools embedded in digital devices support decision‑making and operational management at both individual and organizational levels, including resource allocation, workflow automation, and real‑time data analysis. However, the prevailing cloud‑centric deployment of AI carries a substantial environmental footprint due to high computational demands. In this context, this paper introduces the concept of agentic environments, a sustainability‑oriented AI framework that extends beyond reactive systems by leveraging GenAI, multi‑agent systems, and edge computing to reduce the environmental impact of technology. Agentic environments enable more efficient resource use, improved quality of life, and sustainability‑by‑design, while simultaneously enhancing data privacy through decentralized, edge‑driven solutions. Drawing on secondary research as well as primary data from focus groups and semi‑structured interviews with AI professionals from leading technology companies, the paper proposes a conceptual framework for agentic environments examined through three lenses: the personal sphere, professional and commercial use, and urban operations. The findings highlight the potential of agentic environments to foster sustainable ecosystems through optimized resource utilization and strengthened data privacy. The study concludes with recommendations for edge‑driven deployment models to reduce reliance on energy‑intensive cloud infrastructures.

PaperID: 6772, https://arxiv.org/pdf/2512.15784.pdf

Authors: Zibin Liu, Cheng Zhang, Xi Zhao, Yunfei Feng, Bingyu Bai, Dahu Feng, Erhu Feng, Yubin Xia, Haibo Chen

Title: Beyond Training: Enabling Self-Evolution of Agents with MOBIMEM

Abstract:
Large Language Model (LLM) agents are increasingly deployed to automate complex workflows in mobile and desktop environments. However, current model‑centric agent architectures struggle to self‑evolve post‑deployment: improving personalization, capability, and efficiency typically requires continuous model retraining/fine‑tuning, which incurs prohibitive computational overheads and suffers from an inherent trade‑off between model accuracy and inference efficiency. To enable iterative self‑evolution without model retraining, we propose MOBIMEM, a memory‑centric agent system. MOBIMEM first introduces three specialized memory primitives to decouple agent evolution from model weights: (1) Profile Memory uses a lightweight distance‑graph (DisGraph) structure to align with user preferences, resolving the accuracy‑latency trade‑off in user profile retrieval; (2) Experience Memory employs multi‑level templates to instantiate execution logic for new tasks, ensuring capability generalization; and (3) Action Memory records fine‑grained interaction sequences, reducing the reliance on expensive model inference. Building upon this memory architecture, MOBIMEM further integrates a suite of OS‑inspired services to orchestrate execution: a scheduler that coordinates parallel sub‑task execution and memory operations; an agent record‑and‑replay (AgentRR) mechanism that enables safe and efficient action reuse; and a context‑aware exception handling that ensures graceful recovery from user interruptions and runtime errors. Evaluation on AndroidWorld and top‑50 apps shows that MOBIMEM achieves 83.1% profile alignment with 23.83 ms retrieval time (280x faster than GraphRAG baselines), improves task success rates by up to 50.3%, and reduces end‑to‑end latency by up to 9x on mobile devices.

PaperID: 6773, https://arxiv.org/pdf/2512.15758.pdf

Authors: Varshini Krishnamurthy

Title: Yantra AI -- An intelligence platform which interacts with manufacturing operations

Abstract:
Industry 4.0 is growing quickly, which has changed smart production by encouraging the use of real‑time tracking, machine learning, and AI‑driven systems to make operations run more smoothly. The main focus of this dissertation is on creating and testing an intelligent production system for XRIT that solves important problems like energy management, predictive maintenance, and AI‑powered decision support. Machine learning models are built into the system, such as the Random Forest Classifier for proactive maintenance and the Isolation Forest for finding outliers. These models help with decision‑making and reducing downtime. Streamlit makes real‑time data visualisation possible, giving workers access to dashboards that they can interact with and see real‑time observations.The system was tested with fake data and is made to be scalable, so it can be used in real time in XRIT's production setting. Adding an AI‑powered virtual assistant made with GPT‑4 lets workers get real‑time, useful information that makes complicated questions easier to answer and improves operational decisions. The testing shows that the system makes working efficiency, energy management, and the ability to plan repairs much better. Moving the system to real‑time data merging and looking for other ways to make it better will be the main focus of future work.

PaperID: 6774, https://arxiv.org/pdf/2512.15736.pdf

Authors: S. K. Rithvik

Title: Anubuddhi: A Multi-Agent AI System for Designing and Simulating Quantum Optics Experiments

Abstract:
We present Anubuddhi, a multi‑agent AI system that designs and simulates quantum optics experiments from natural language prompts without requiring specialized programming knowledge. The system composes optical layouts by arranging components from a three‑tier toolbox via semantic retrieval, then validates designs through physics simulation with convergent refinement. The architecture combines intent routing, knowledge‑augmented generation, and dual‑mode validation (QuTiP and FreeSim). We evaluated 13 experiments spanning fundamental optics (Hong‑Ou‑Mandel interference, Michelson/Mach‑Zehnder interferometry, Bell states, delayed‑choice quantum eraser), quantum information protocols (BB84 QKD, Franson interferometry, GHZ states, quantum teleportation, hyperentanglement), and advanced technologies (boson sampling, electromagnetically induced transparency, frequency conversion). The system achieves design‑simulation alignment scores of 8‑‑9/10, with simulations faithfully modeling intended physics. A critical finding distinguishes structural correctness from quantitative accuracy: high alignment confirms correct physics architecture, while numerical predictions require expert review. Free‑form simulation outperformed constrained frameworks for 11/13 experiments, revealing that quantum optics diversity demands flexible mathematical representations. The system democratizes computational experiment design for research and pedagogy, producing strong initial designs users can iteratively refine through conversation.

PaperID: 6775, https://arxiv.org/pdf/2512.15728.pdf

Authors: Yuhan Hou, Tianji Rao, Jeremy Tan, Adler Viton, Xiyue Zhang, David Ye, Abhishek Kodi, Sanjana Dulam, Aditya Paul, Yikai Feng

Title: FedSight AI: Multi-Agent System Architecture for Federal Funds Target Rate Prediction

Abstract:
The Federal Open Market Committee (FOMC) sets the federal funds rate, shaping monetary policy and the broader economy. We introduce \emphFedSight AI, a multi‑agent framework that uses large language models (LLMs) to simulate FOMC deliberations and predict policy outcomes. Member agents analyze structured indicators and unstructured inputs such as the Beige Book, debate options, and vote, replicating committee reasoning. A Chain‑of‑Draft (CoD) extension further improves efficiency and accuracy by enforcing concise multistage reasoning. Evaluated at 2023‑2024 meetings, FedSight CoD achieved accuracy of 93.75% and stability of 93.33%, outperforming baselines including MiniFed and Ordinal Random Forest (RF), while offering transparent reasoning aligned with real FOMC communications.

PaperID: 6776, https://arxiv.org/pdf/2512.15710.pdf

Authors: Shuai Liu, Yiqing Tian, Yang Chen, Mar Canet Sola

Title: Artism: AI-Driven Dual-Engine System for Art Generation and Critique

Abstract:
This paper proposes a dual‑engine AI architectural method designed to address the complex problem of exploring potential trajectories in the evolution of art. We present two interconnected components: AIDA (an artificial artist social network) and the Ismism Machine, a system for critical analysis. The core innovation lies in leveraging deep learning and multi‑agent collaboration to enable multidimensional simulations of art historical developments and conceptual innovation patterns. The framework explores a shift from traditional unidirectional critique toward an intelligent, interactive mode of reflexive practice. We are currently applying this method in experimental studies on contemporary art concepts. This study introduces a general methodology based on AI‑driven critical loops, offering new possibilities for computational analysis of art.

PaperID: 6777, https://arxiv.org/pdf/2512.15398.pdf

Authors: Zanxiang He, Meng Li, Liyun Shi, Weiye Daia, Liming Nie

Title: Mapis: A Knowledge-Graph Grounded Multi-Agent Framework for Evidence-Based PCOS Diagnosis

Abstract:
Polycystic Ovary Syndrome (PCOS) constitutes a significant public health issue affecting 10% of reproductive‑aged women, highlighting the critical importance of developing effective diagnostic tools. Previous machine learning and deep learning detection tools are constrained by their reliance on large‑scale labeled data and an lack of interpretability. Although multi‑agent systems have demonstrated robust capabilities, the potential of such systems for PCOS detection remains largely unexplored. Existing medical multi‑agent frameworks are predominantly designed for general medical tasks, suffering from insufficient domain integration and a lack of specific domain knowledge. To address these challenges, we propose Mapis, the first knowledge‑grounded multi‑agent framework explicitly designed for guideline‑based PCOS diagnosis. Specifically, it built upon the 2023 International Guideline into a structured collaborative workflow that simulates the clinical diagnostic process. It decouples complex diagnostic tasks across specialized agents: a gynecological endocrine agent and a radiology agent collaborative to verify inclusion criteria, while an exclusion agent strictly rules out other causes. Furthermore, we construct a comprehensive PCOS knowledge graph to ensure verifiable, evidence‑based decision‑making. Extensive experiments on public benchmarks and specialized clinical datasets, benchmarking against nine diverse baselines, demonstrate that Mapis significantly outperforms competitive methods. On the clinical dataset, it surpasses traditional machine learning models by 13.56%, single‑agent by 6.55%, and previous medical multi‑agent systems by 7.05% in Accuracy.

PaperID: 6778, https://arxiv.org/pdf/2512.15386.pdf

Authors: Arnau Barrera Roy, Albert Clapés Sintes

Title: See It Before You Grab It: Deep Learning-based Action Anticipation in Basketball

Abstract:
Computer vision and video understanding have transformed sports analytics by enabling large‑scale, automated analysis of game dynamics from broadcast footage. Despite significant advances in player and ball tracking, pose estimation, action localization, and automatic foul recognition, anticipating actions before they occur in sports videos has received comparatively little attention. This work introduces the task of action anticipation in basketball broadcast videos, focusing on predicting which team will gain possession of the ball following a shot attempt. To benchmark this task, a new self‑curated dataset comprising 100,000 basketball video clips, over 300 hours of footage, and more than 2,000 manually annotated rebound events is presented. Comprehensive baseline results are reported using state‑of‑the‑art action anticipation methods, representing the first application of deep learning techniques to basketball rebound prediction. Additionally, two complementary tasks, rebound classification and rebound spotting, are explored, demonstrating that this dataset supports a wide range of video understanding applications in basketball, for which no comparable datasets currently exist. Experimental results highlight both the feasibility and inherent challenges of anticipating rebounds, providing valuable insights into predictive modeling for dynamic multi‑agent sports scenarios. By forecasting team possession before rebounds occur, this work enables applications in real‑time automated broadcasting and post‑game analysis tools to support decision‑making.

PaperID: 6779, https://arxiv.org/pdf/2512.15363.pdf

Authors: Zixin Wei, Yucan Guo, Jinyang Li, Xiaolin Han, Xiaolong Jin, Chenhao Ma

Title: Revisiting Task-Oriented Dataset Search in the Era of Large Language Models: Challenges, Benchmark, and Solution

Abstract:
The search for suitable datasets is the critical "first step" in data‑driven research, but it remains a great challenge. Researchers often need to search for datasets based on high‑level task descriptions. However, existing search systems struggle with this task due to ambiguous user intent, task‑to‑dataset mapping and benchmark gaps, and entity ambiguity. To address these challenges, we introduce KATS, a novel end‑to‑end system for task‑oriented dataset search from unstructured scientific literature. KATS consists of two key components, i.e., offline knowledge base construction and online query processing. The sophisticated offline pipeline automatically constructs a high‑quality, dynamically updatable task‑dataset knowledge graph by employing a collaborative multi‑agent framework for information extraction, thereby filling the task‑to‑dataset mapping gap. To further address the challenge of entity ambiguity, a unique semantic‑based mechanism is used for task entity linking and dataset entity resolution. For online retrieval, KATS utilizes a specialized hybrid query engine that combines vector search with graph‑based ranking to generate highly relevant results. Additionally, we introduce CS‑TDS, a tailored benchmark suite for evaluating task‑oriented dataset search systems, addressing the critical gap in standardized evaluation. Experiments on our benchmark suite show that KATS significantly outperforms state‑of‑the‑art retrieval‑augmented generation frameworks in both effectiveness and efficiency, providing a robust blueprint for the next generation of dataset discovery systems.

PaperID: 6780, https://arxiv.org/pdf/2512.15358.pdf

Authors: Zhengyi Zhao, Shubo Zhang, Yuxi Zhang, Huimin Wang, Binyang Li, Kam-Fai Wong

Title: Dual-Density Inference for Efficient Language Model Reasoning

Abstract:
Large Language Models (LLMs) have shown impressive capabilities in complex reasoning tasks. However, current approaches employ uniform language density for both intermediate reasoning and final answers, leading to computational inefficiency. Our observation found that reasoning process serves a computational function for the model itself, while answering serves a communicative function for human understanding. This distinction enables the use of compressed, symbol‑rich language for intermediate computations while maintaining human‑readable final explanations. To address this inefficiency, we present Denser: \underlineDual‑d\underlineensity inf\underlineerence, a novel framework that optimizes information density separately for reasoning and answering phases. Our framework implements this through three components: a query processing module that analyzes input problems, a high‑density compressed reasoning mechanism for efficient intermediate computations, and an answer generation component that translates compressed reasoning into human‑readable solutions. Experimental evaluation across multiple reasoning question answering benchmarks demonstrates that Denser reduces token consumption by up to 62% compared to standard Chain‑of‑Thought methods while preserving or improving accuracy. These efficiency gains are particularly significant for complex multi‑step reasoning problems where traditional methods generate extensive explanations.

PaperID: 6781, https://arxiv.org/pdf/2512.15310.pdf

Authors: Wangyu Wu, Zhenhong Chen, Xiaowei Huang, Fei Ma, Jimin Xiao

Title: SynthSeg-Agents: Multi-Agent Synthetic Data Generation for Zero-Shot Weakly Supervised Semantic Segmentation

Abstract:
Weakly Supervised Semantic Segmentation (WSSS) with image level labels aims to produce pixel level predictions without requiring dense annotations. While recent approaches have leveraged generative models to augment existing data, they remain dependent on real world training samples. In this paper, we introduce a novel direction, Zero Shot Weakly Supervised Semantic Segmentation (ZSWSSS), and propose SynthSeg Agents, a multi agent framework driven by Large Language Models (LLMs) to generate synthetic training data entirely without real images. SynthSeg Agents comprises two key modules, a Self Refine Prompt Agent and an Image Generation Agent. The Self Refine Prompt Agent autonomously crafts diverse and semantically rich image prompts via iterative refinement, memory mechanisms, and prompt space exploration, guided by CLIP based similarity and nearest neighbor diversity filtering. These prompts are then passed to the Image Generation Agent, which leverages Vision Language Models (VLMs) to synthesize candidate images. A frozen CLIP scoring model is employed to select high quality samples, and a ViT based classifier is further trained to relabel the entire synthetic dataset with improved semantic precision. Our framework produces high quality training data without any real image supervision. Experiments on PASCAL VOC 2012 and COCO 2014 show that SynthSeg Agents achieves competitive performance without using real training images. This highlights the potential of LLM driven agents in enabling cost efficient and scalable semantic segmentation.

PaperID: 6782, https://arxiv.org/pdf/2512.15231.pdf

Authors: Zhengchao Chen, Haoran Wang, Jing Yao, Pedram Ghamisi, Jun Zhou, Peter M. Atkinson, Bing Zhang

Title: CangLing-KnowFlow: A Unified Knowledge-and-Flow-fused Agent for Comprehensive Remote Sensing Applications

Abstract:
The automated and intelligent processing of massive remote sensing (RS) datasets is critical in Earth observation (EO). Existing automated systems are normally task‑specific, lacking a unified framework to manage diverse, end‑to‑end workflows‑‑from data preprocessing to advanced interpretation‑‑across diverse RS applications. To address this gap, this paper introduces CangLing‑KnowFlow, a unified intelligent agent framework that integrates a Procedural Knowledge Base (PKB), Dynamic Workflow Adjustment, and an Evolutionary Memory Module. The PKB, comprising 1,008 expert‑validated workflow cases across 162 practical RS tasks, guides planning and substantially reduces hallucinations common in general‑purpose agents. During runtime failures, the Dynamic Workflow Adjustment autonomously diagnoses and replans recovery strategies, while the Evolutionary Memory Module continuously learns from these events, iteratively enhancing the agent's knowledge and performance. This synergy enables CangLing‑KnowFlow to adapt, learn, and operate reliably across diverse, complex tasks. We evaluated CangLing‑KnowFlow on the KnowFlow‑Bench, a novel benchmark of 324 workflows inspired by real‑world applications, testing its performance across 13 top Large Language Model (LLM) backbones, from open‑source to commercial. Across all complex tasks, CangLing‑KnowFlow surpassed the Reflexion baseline by at least 4% in Task Success Rate. As the first most comprehensive validation along this emerging field, this research demonstrates the great potential of CangLing‑KnowFlow as a robust, efficient, and scalable automated solution for complex EO challenges by leveraging expert knowledge (Knowledge) into adaptive and verifiable procedures (Flow).

PaperID: 6783, https://arxiv.org/pdf/2512.15144.pdf

Authors: Mengying Wu, Pei Chen, Geng Hong, Baichao An, Jinsong Chen, Binwang Wan, Xudong Pan, Jiarun Dai, Min Yang

Title: MCPZoo: A Large-Scale Dataset of Runnable Model Context Protocol Servers for AI Agent

Abstract:
Model Context Protocol (MCP) enables agents to interact with external tools, yet empirical research on MCP is hindered by the lack of large‑scale, accessible datasets. We present MCPZoo, the largest and most comprehensive dataset of MCP servers collected from multiple public sources, comprising 129,059 servers (56,053 distinct). MCPZoo includes 16,356 server instances that have been deployed and verified as runnable and interactable, supporting realistic experimentation beyond static analysis. The dataset provides unified metadata and access interfaces, enabling systematic exploration and interaction without manual deployment effort. MCPZoo is released as an open and accessible resource to support research on MCP‑based systems and security analysis.

PaperID: 6784, https://arxiv.org/pdf/2512.15119.pdf

Authors: Jiayang Wan, Ke He, Yafei Wang, Fan Liu, Wenjin Wang, Shi Jin

Title: QoS-Aware Hierarchical Reinforcement Learning for Joint Link Selection and Trajectory Optimization in SAGIN-Supported UAV Mobility Management

Abstract:
Due to the significant variations in unmanned aerial vehicle (UAV) altitude and horizontal mobility, it becomes difficult for any single network to ensure continuous and reliable threedimensional coverage. Towards that end, the space‑air‑ground integrated network (SAGIN) has emerged as an essential architecture for enabling ubiquitous UAV connectivity. To address the pronounced disparities in coverage and signal characteristics across heterogeneous networks, this paper formulates UAV mobility management in SAGIN as a constrained multi‑objective joint optimization problem. The formulation couples discrete link selection with continuous trajectory optimization. Building on this, we propose a two‑level multi‑agent hierarchical deep reinforcement learning (HDRL) framework that decomposes the problem into two alternately solvable subproblems. To map complex link selection decisions into a compact discrete action space, we conceive a double deep Q‑network (DDQN) algorithm in the top‑level, which achieves stable and high‑quality policy learning through double Q‑value estimation. To handle the continuous trajectory action space while satisfying quality of service (QoS) constraints, we integrate the maximum‑entropy mechanism of the soft actor‑critic (SAC) and employ a Lagrangian‑based constrained SAC (CSAC) algorithm in the lower‑level that dynamically adjusts the Lagrange multipliers to balance constraint satisfaction and policy optimization. Moreover, the proposed algorithm can be extended to multi‑UAV scenarios under the centralized training and decentralized execution (CTDE) paradigm, which enables more generalizable policies. Simulation results demonstrate that the proposed scheme substantially outperforms existing benchmarks in throughput, link switching frequency and QoS satisfaction.

PaperID: 6785, https://arxiv.org/pdf/2512.14990.pdf

Authors: Mehil B Shah, Mohammad Masudur Rahman, Foutse Khomh

Title: Imitation Game: Reproducing Deep Learning Bugs Leveraging an Intelligent Agent

Abstract:
Despite their wide adoption in various domains (e.g., healthcare, finance, software engineering), Deep Learning (DL)‑based applications suffer from many bugs, failures, and vulnerabilities. Reproducing these bugs is essential for their resolution, but it is extremely challenging due to the inherent nondeterminism of DL models and their tight coupling with hardware and software environments. According to recent studies, only about 3% of DL bugs can be reliably reproduced using manual approaches. To address these challenges, we present RepGen, a novel, automated, and intelligent approach for reproducing deep learning bugs. RepGen constructs a learning‑enhanced context from a project, develops a comprehensive plan for bug reproduction, employs an iterative generate‑validate‑refine mechanism, and thus generates such code using an LLM that reproduces the bug at hand. We evaluate RepGen on 106 real‑world deep learning bugs and achieve a reproduction rate of 80.19%, a 19.81% improvement over the state‑of‑the‑art measure. A developer study involving 27 participants shows that RepGen improves the success rate of DL bug reproduction by 23.35%, reduces the time to reproduce by 56.8%, and lowers participants' cognitive load.

PaperID: 6786, https://arxiv.org/pdf/2512.14910.pdf

Authors: Nadine Angela Cantonjos, Arpita Biswas

Title: AgroAskAI: A Multi-Agentic AI Framework for Supporting Smallholder Farmers' Enquiries Globally

Abstract:
Agricultural regions in rural areas face damage from climate‑related risks, including droughts, heavy rainfall, and shifting weather patterns. Prior research calls for adaptive risk‑management solutions and decision‑making strategies. To this end, artificial intelligence (AI), particularly agentic AI, offers a promising path forward. Agentic AI systems consist of autonomous, specialized agents capable of solving complex, dynamic tasks. While past systems have relied on single‑agent models or have used multi‑agent frameworks only for static functions, there is a growing need for architectures that support dynamic collaborative reasoning and context‑aware outputs. To bridge this gap, we present AgroAskAI, a multi‑agent reasoning system for climate adaptation decision support in agriculture, with a focus on vulnerable rural communities. AgroAskAI features a modular, role‑specialized architecture that uses a chain‑of‑responsibility approach to coordinate autonomous agents, integrating real‑time tools and datasets. The system has built‑in governance mechanisms that mitigate hallucination and enable internal feedback for coherent, locally relevant strategies. The system also supports multilingual interactions, making it accessible to non‑English‑speaking farmers. Experiments on common agricultural queries related to climate adaptation show that, with additional tools and prompt refinement, AgroAskAI delivers more actionable, grounded, and inclusive outputs. Our experimental results highlight the potential of agentic AI for sustainable and accountable decision support in climate adaptation for agriculture.

PaperID: 6787, https://arxiv.org/pdf/2512.14895.pdf

Authors: Niklas Lauffer, Xiang Deng, Srivatsa Kundurthy, Brad Kenstler, Jeff Da

Title: Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections

Abstract:
A popular paradigm for training LM agents relies on imitation learning, fine‑tuning on expert trajectories. However, we show that the off‑policy nature of imitation learning for multi‑turn LM agents suffers from the fundamental limitation known as covariate shift: as the student policy's behavior diverges from the expert's, it encounters states not present in the training data, reducing the effectiveness of fine‑tuning. Taking inspiration from the classic DAgger algorithm, we propose a novel data generation methodology for addressing covariate shift for multi‑turn LLM training. We introduce on‑policy expert corrections (OECs), partially on‑policy data generated by starting rollouts with a student model and then switching to an expert model part way through the trajectory. We explore the effectiveness of our data generation technique in the domain of software engineering (SWE) tasks, a multi‑turn setting where LLM agents must interact with a development environment to fix software bugs. Our experiments compare OEC data against various other on‑policy and imitation learning approaches on SWE agent problems and train models using a common rejection sampling (i.e., using environment reward) combined with supervised fine‑tuning technique. Experiments find that OEC trajectories show a relative 14% and 13% improvement over traditional imitation learning in the 7b and 32b setting, respectively, on SWE‑bench verified. Our results demonstrate the need for combining expert demonstrations with on‑policy data for effective multi‑turn LM agent training.

PaperID: 6788, https://arxiv.org/pdf/2512.14846.pdf

Authors: Arth Bhardwaj, Sia Godika, Yuvam Loonker

Title: MALCDF: A Distributed Multi-Agent LLM Framework for Real-Time Cyber

Abstract:
Traditional, centralized security tools often miss adaptive, multi‑vector attacks. We present the Multi‑Agent LLM Cyber Defense Framework (MALCDF), a practical setup where four large language model (LLM) agents‑Detection, Intelligence, Response, and Analysis‑work together in real time. Agents communicate over a Secure Communication Layer (SCL) with encrypted, ontology‑aligned messages, and produce audit‑friendly outputs (e.g., MITRE ATT&CK mappings). For evaluation, we keep the test simple and consistent: all reported metrics come from the same 50‑record live stream derived from the CICIDS2017 feature schema. CICIDS2017 is used for configuration (fields/schema) and to train a practical ML baseline. The ML‑IDS baseline is a Lightweight Random Forest IDS (LRF‑IDS) trained on a subset of CICIDS2017 and tested on the 50‑record stream, with no overlap between training and test records. In experiments, MALCDF reaches 90.0% detection accuracy, 85.7% F1‑score, and 9.1% false‑positive rate, with 6.8s average per‑event latency. It outperforms the lightweight ML‑IDS baseline and a single‑LLM setup on accuracy while keeping end‑to‑end outputs consistent. Overall, this hands‑on build suggests that coordinating simple LLM agents with secure, ontology‑aligned messaging can improve practical, real‑time cyber defense.

PaperID: 6789, https://arxiv.org/pdf/2512.14708.pdf

Authors: Mustapha Hamdi

Title: SGEMAS: A Self-Growing Ephemeral Multi-Agent System for Unsupervised Online Anomaly Detection via Entropic Homeostasis

Abstract:
Current deep learning approaches for physiological signal monitoring suffer from static topologies and constant energy consumption. We introduce SGEMAS (Self‑Growing Ephemeral Multi‑Agent System), a bio‑inspired architecture that treats intelligence as a dynamic thermodynamic process. By coupling a structural plasticity mechanism (agent birth death) to a variational free energy objective, the system naturally evolves to minimize prediction error with extreme sparsity. An ablation study on the MIT‑BIH Arrhythmia Database reveals that adding a multi‑scale instability index to the agent dynamics significantly improves performance. In a challenging inter‑patient, zero‑shot setting, the final SGEMAS v3.3 model achieves a mean AUC of 0.570 +‑ 0.070, outperforming both its simpler variants and a standard autoencoder baseline. This result validates that a physics‑based, energy‑constrained model can achieve robust unsupervised anomaly detection, offering a promising direction for efficient biomedical AI.

PaperID: 6790, https://arxiv.org/pdf/2512.14622.pdf

Authors: Ostap Vykhopen, Viktoria Skorik, Maksym Tereshchenko, Veronika Solopova

Title: Beyond Text-to-SQL: Autonomous Research-Driven Database Exploration with DAR

Abstract:
Large language models can already query databases, yet most existing systems remain reactive: they rely on explicit user prompts and do not actively explore data. We introduce DAR (Data Agnostic Researcher), a multi‑agent system that performs end‑to‑end database research without human‑initiated queries. DAR orchestrates specialized AI agents across three layers: initialization (intent inference and metadata extraction), execution (SQL and AI‑based query synthesis with iterative validation), and synthesis (report generation with built‑in quality control). All reasoning is executed directly inside BigQuery using native generative AI functions, eliminating data movement and preserving data governance. On a realistic asset‑incident dataset, DAR completes the full analytical task in 16 minutes, compared to 8.5 hours for a professional analyst (approximately 32x times faster), while producing useful pattern‑based insights and evidence‑grounded recommendations. Although human experts continue to offer deeper contextual interpretation, DAR excels at rapid exploratory analysis. Overall, this work shifts database interaction from query‑driven assistance toward autonomous, research‑driven exploration within cloud data warehouses.

PaperID: 6791, https://arxiv.org/pdf/2512.14554.pdf

Authors: Nguyen Tien Dong, Minh-Anh Nguyen, Thanh Dat Hoang, Nguyen Tuan Ngoc, Dao Xuan Quang Minh, Phan Phi Hai, Nguyen Thi Ngoc Anh, Dang Van Tu, Binh Vu

Title: VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models

Abstract:
The rapid advancement of large language models (LLMs) has enabled new possibilities for applying artificial intelligence within the legal domain. Nonetheless, the complexity, hierarchical organization, and frequent revisions of Vietnamese legislation pose considerable challenges for evaluating how well these models interpret and utilize legal knowledge. To address this gap, the Vietnamese Legal Benchmark (VLegal‑Bench) is introduced, the first comprehensive benchmark designed to systematically assess LLMs on Vietnamese legal tasks. Informed by Bloom's cognitive taxonomy, VLegal‑Bench encompasses multiple levels of legal understanding through tasks designed to reflect practical usage scenarios. The benchmark comprises 10,450 samples generated through a rigorous annotation pipeline, where legal experts label and cross‑validate each instance using our annotation system to ensure every sample is grounded in authoritative legal documents and mirrors real‑world legal assistant workflows, including general legal questions and answers, retrieval‑augmented generation, multi‑step reasoning, and scenario‑based problem solving tailored to Vietnamese law. By providing a standardized, transparent, and cognitively informed evaluation framework, VLegal‑Bench establishes a solid foundation for assessing LLM performance in Vietnamese legal contexts and supports the development of more reliable, interpretable, and ethically aligned AI‑assisted legal systems. To facilitate access and reproducibility, we provide a public landing page for this benchmark at https://vilegalbench.cmcai.vn/.

PaperID: 6792, https://arxiv.org/pdf/2512.14503.pdf

Authors: Chao Yi, Dian Chen, Gaoyang Guo, Jiakai Tang, Jian Wu, Jing Yu, Mao Zhang, Wen Chen, Wenjun Yang, Yujie Luo, Yuning Jiang, Zhujin Gao, Bo Zheng, Binbin Cao, Changfa Wu, Dixuan Wang, Han Wu, Haoyi Hu, Kewei Zhu, Lang Tian, Lin Yang, Qiqi Huang, Siqi Yang, Wenbo Su, Xiaoxiao He, Xin Tong, Xu Chen, Xunke Xi, Xiaowei Huang, Yaxuan Wu, Yeqiu Yang, Yi Hu, Yujin Yuan, Yuliang Yan, Zile Zhou

Title: RecGPT-V2 Technical Report

Abstract:
Large language models (LLMs) have demonstrated remarkable potential in transforming recommender systems from implicit behavioral pattern matching to explicit intent reasoning. While RecGPT‑V1 successfully pioneered this paradigm by integrating LLM‑based reasoning into user interest mining and item tag prediction, it suffers from four fundamental limitations: (1) computational inefficiency and cognitive redundancy across multiple reasoning routes; (2) insufficient explanation diversity in fixed‑template generation; (3) limited generalization under supervised learning paradigms; and (4) simplistic outcome‑focused evaluation that fails to match human standards. To address these challenges, we present RecGPT‑V2 with four key innovations. First, a Hierarchical Multi‑Agent System restructures intent reasoning through coordinated collaboration, eliminating cognitive duplication while enabling diverse intent coverage. Combined with Hybrid Representation Inference that compresses user‑behavior contexts, our framework reduces GPU consumption by 60% and improves exclusive recall from 9.39% to 10.99%. Second, a Meta‑Prompting framework dynamically generates contextually adaptive prompts, improving explanation diversity by +7.3%. Third, constrained reinforcement learning mitigates multi‑reward conflicts, achieving +24.1% improvement in tag prediction and +13.0% in explanation acceptance. Fourth, an Agent‑as‑a‑Judge framework decomposes assessment into multi‑step reasoning, improving human preference alignment. Online A/B tests on Taobao demonstrate significant improvements: +2.98% CTR, +3.71% IPV, +2.19% TV, and +11.46% NER. RecGPT‑V2 establishes both the technical feasibility and commercial viability of deploying LLM‑powered intent reasoning at scale, bridging the gap between cognitive exploration and industrial utility.

PaperID: 6793, https://arxiv.org/pdf/2512.14474.pdf

Authors: Annu Rana, Gaurav Kumar

Title: Model-First Reasoning LLM Agents: Reducing Hallucinations through Explicit Problem Modeling

Abstract:
Large Language Models (LLMs) often struggle with complex multi‑step planning tasks, showing high rates of constraint violations and inconsistent solutions. Existing strategies such as Chain‑of‑Thought and ReAct rely on implicit state tracking and lack an explicit problem representation. Inspired by classical AI planning, we propose Model‑First Reasoning (MFR), a two‑phase paradigm in which the LLM first constructs an explicit model of the problem, defining entities, state variables, actions, and constraints, before generating a solution plan. Across multiple planning domains, including medical scheduling, route planning, resource allocation, logic puzzles, and procedural synthesis, MFR reduces constraint violations and improves solution quality compared to Chain‑of‑Thought and ReAct. Ablation studies show that the explicit modeling phase is critical for these gains. Our results suggest that many LLM planning failures stem from representational deficiencies rather than reasoning limitations, highlighting explicit modeling as a key component for robust and interpretable AI agents. All prompts, evaluation procedures, and task datasets are documented to facilitate reproducibility.

PaperID: 6794, https://arxiv.org/pdf/2512.14448.pdf

Authors: Xingfu Zhou, Pengfei Wang

Title: Reasoning-Style Poisoning of LLM Agents via Stealthy Style Transfer: Process-Level Attacks and Runtime Monitoring in RSV Space

Abstract:
Large Language Model (LLM) agents relying on external retrieval are increasingly deployed in high‑stakes environments. While existing adversarial attacks primarily focus on content falsification or instruction injection, we identify a novel, process‑oriented attack surface: the agent's reasoning style. We propose Reasoning‑Style Poisoning (RSP), a paradigm that manipulates how agents process information rather than what they process. We introduce Generative Style Injection (GSI), an attack method that rewrites retrieved documents into pathological tones‑‑specifically "analysis paralysis" or "cognitive haste"‑‑without altering underlying facts or using explicit triggers. To quantify these shifts, we develop the Reasoning Style Vector (RSV), a metric tracking Verification depth, Self‑confidence, and Attention focus. Experiments on HotpotQA and FEVER using ReAct, Reflection, and Tree of Thoughts (ToT) architectures reveal that GSI significantly degrades performance. It increases reasoning steps by up to 4.4 times or induces premature errors, successfully bypassing state‑of‑the‑art content filters. Finally, we propose RSP‑M, a lightweight runtime monitor that calculates RSV metrics in real‑time and triggers alerts when values exceed safety thresholds. Our work demonstrates that reasoning style is a distinct, exploitable vulnerability, necessitating process‑level defenses beyond static content analysis.

PaperID: 6795, https://arxiv.org/pdf/2512.14321.pdf

Authors: Xudong Han, Xianglun Gao, Xiaoyi Qu, Zhenyu Yu

Title: Multi-Agent Medical Decision Consensus Matrix System: An Intelligent Collaborative Framework for Oncology MDT Consultations

Abstract:
Multidisciplinary team (MDT) consultations are the gold standard for cancer care decision‑making, yet current practice lacks structured mechanisms for quantifying consensus and ensuring decision traceability. We introduce a Multi‑Agent Medical Decision Consensus Matrix System that deploys seven specialized large language model agents, including an oncologist, a radiologist, a nurse, a psychologist, a patient advocate, a nutritionist and a rehabilitation therapist, to simulate realistic MDT workflows. The framework incorporates a mathematically grounded consensus matrix that uses Kendall's coefficient of concordance to objectively assess agreement. To further enhance treatment recommendation quality and consensus efficiency, the system integrates reinforcement learning methods, including Q‑Learning, PPO and DQN. Evaluation across five medical benchmarks (MedQA, PubMedQA, DDXPlus, MedBullets and SymCat) shows substantial gains over existing approaches, achieving an average accuracy of 87.5% compared with 83.8% for the strongest baseline, a consensus achievement rate of 89.3% and a mean Kendall's W of 0.823. Expert reviewers rated the clinical appropriateness of system outputs at 8.9/10. The system guarantees full evidence traceability through mandatory citations of clinical guidelines and peer‑reviewed literature, following GRADE principles. This work advances medical AI by providing structured consensus measurement, role‑specialized multi‑agent collaboration and evidence‑based explainability to improve the quality and efficiency of clinical decision‑making.

PaperID: 6796, https://arxiv.org/pdf/2512.14079.pdf

Authors: Mayank Singh, Vikas Yadav, Shiva Krishna Reddy Malay, Shravan Nayak, Sai Rajeswar, Sathwik Tejaswi Madhusudhan, Eduardo Blanco

Title: Grammar Search for Multi-Agent Systems

Abstract:
Automatic search for Multi‑Agent Systems has recently emerged as a key focus in agentic AI research. Several prior approaches have relied on LLM‑based free‑form search over the code space. In this work, we propose a more structured framework that explores the same space through a fixed set of simple, composable components. We show that, despite lacking the generative flexibility of LLMs during the candidate generation stage, our method outperforms prior approaches on four out of five benchmarks across two domains: mathematics and question answering. Furthermore, our method offers additional advantages, including a more cost‑efficient search process and the generation of modular, interpretable multi‑agent systems with simpler logic.

PaperID: 6797, https://arxiv.org/pdf/2512.14012.pdf

Authors: Ruanqianqian Huang, Avery Reyna, Sorin Lerner, Haijun Xia, Brian Hempel

Title: Professional Software Developers Don't Vibe, They Control: AI Agent Use for Coding in 2025

Abstract:
The rise of AI agents is transforming how software can be built. The promise of agents is that developers might write code quicker, delegate multiple tasks to different agents, and even write a full piece of software purely out of natural language. In reality, what roles agents play in professional software development remains in question. This paper investigates how experienced developers use agents in building software, including their motivations, strategies, task suitability, and sentiments. Through field observations (N=13) and qualitative surveys (N=99), we find that while experienced developers value agents as a productivity boost, they retain their agency in software design and implementation out of insistence on fundamental software quality attributes, employing strategies for controlling agent behavior leveraging their expertise. In addition, experienced developers feel overall positive about incorporating agents into software development given their confidence in complementing the agents' limitations. Our results shed light on the value of software development best practices in effective use of agents, suggest the kinds of tasks for which agents may be suitable, and point towards future opportunities for better agentic interfaces and agentic use guidelines.

PaperID: 6798, https://arxiv.org/pdf/2512.13982.pdf

Authors: Dereje Shenkut, Vijayakumar Bhagavatula

Title: FocalComm: Hard Instance-Aware Multi-Agent Perception

Abstract:
Multi‑agent collaborative perception (CP) is a promising paradigm for improving autonomous driving safety, particularly for vulnerable road users like pedestrians, via robust 3D perception. However, existing CP approaches often optimize for vehicle detection performance metrics, underperforming on smaller, safety‑critical objects such as pedestrians, where detection failures can be catastrophic. Furthermore, previous CP methods rely on full feature exchange rather than communicating only salient features that help reduce false negatives. To this end, we present FocalComm, a novel collaborative perception framework that focuses on exchanging hard‑instance‑oriented features among connected collaborative agents. FocalComm consists of two key novel designs: (1) a learnable progressive hard instance mining (HIM) module to extract hard instance‑oriented features per agent, and (2) a query‑based feature‑level (intermediate) fusion technique that dynamically weights these identified features during collaboration. We show that FocalComm outperforms state‑of‑the‑art collaborative perception methods on two challenging real‑world datasets (V2X‑Real and DAIR‑V2X) across both vehicle‑centric and infrastructure‑centric collaborative setups. FocalComm also shows a strong performance gain in pedestrian detection in V2X‑Real.

PaperID: 6799, https://arxiv.org/pdf/2512.13956.pdf

Authors: Zishan Bai, Hanxuan Chen, Jing Luo, Ziyi Ni, Enze Ge, Jiacheng Shi, Yichao Zhang, Jiayi Gu, Zhimo Han, Riyang Bao, Junfeng Hao

Title: AOI: Context-Aware Multi-Agent Operations via Dynamic Scheduling and Hierarchical Memory Compression

Abstract:
The proliferation of cloud‑native architectures, characterized by microservices and dynamic orchestration, has rendered modern IT infrastructures exceedingly complex and volatile. This complexity generates overwhelming volumes of operational data, leading to critical bottlenecks in conventional systems: inefficient information processing, poor task coordination, and loss of contextual continuity during fault diagnosis and remediation. To address these challenges, we propose AOI (AI‑Oriented Operations), a novel multi‑agent collaborative framework that integrates three specialized agents with an LLM‑based Context Compressor. Its core innovations include: (1) a dynamic task scheduling strategy that adaptively prioritizes operations based on real‑time system states, (2) a three‑layer memory architecture comprising Working, Episodic, and Semantic layers that optimizes context retention and retrieval. Extensive experiments on synthetic and real‑world benchmarks show that AOI achieves 72.4% context compression while preserving 92.8% critical information, improves task success to 94.2%, and reduces MTTR by 34.4% over the best baseline. This work presents a paradigm shift towards scalable, adaptive, and context‑aware autonomous operations, enabling robust management of next‑generation IT infrastructures with minimal human intervention.

PaperID: 6800, https://arxiv.org/pdf/2512.13930.pdf

Authors: Samuel Rothfarb, Megan C. Davis, Ivana Matanovic, Baikun Li, Edward F. Holby, Wilton J. M. Kort-Kamp

Title: Hierarchical Multi-agent Large Language Model Reasoning for Autonomous Functional Materials Discovery

Abstract:
Artificial intelligence is reshaping scientific exploration, but most methods automate procedural tasks without engaging in scientific reasoning, limiting autonomy in discovery. We introduce Materials Agents for Simulation and Theory in Electronic‑structure Reasoning (MASTER), an active learning framework where large language models autonomously design, execute, and interpret atomistic simulations. In MASTER, a multimodal system translates natural language into density functional theory workflows, while higher‑level reasoning agents guide discovery through a hierarchy of strategies, including a single agent baseline and three multi‑agent approaches: peer review, triage‑ranking, and triage‑forms. Across two chemical applications, CO adsorption on Cu‑surface transition metal (M) adatoms and on M‑N‑C catalysts, reasoning‑driven exploration reduces required atomistic simulations by up to 90% relative to trial‑and‑error selection. Reasoning trajectories reveal chemically grounded decisions that cannot be explained by stochastic sampling or semantic bias. Altogether, multi‑agent collaboration accelerates materials discovery and marks a new paradigm for autonomous scientific exploration.

PaperID: 6801, https://arxiv.org/pdf/2512.13898.pdf

Authors: Rachit Bansal, Aston Zhang, Rishabh Tiwari, Lovish Madaan, Sai Surya Duvvuri, Devvrit Khatri, David Brandfonbrener, David Alvarez-Melis, Prajjwal Bhargava, Mihir Sanjay Kale, Samy Jelassi

Title: Let's (not) just put things in Context: Test-Time Training for Long-Context LLMs

Abstract:
Progress on training and architecture strategies has enabled LLMs with millions of tokens in context length. However, empirical evidence suggests that such long‑context LLMs can consume far more text than they can reliably use. On the other hand, it has been shown that inference‑time compute can be used to scale performance of LLMs, often by generating thinking tokens, on challenging tasks involving multi‑step reasoning. Through controlled experiments on sandbox long‑context tasks, we find that such inference‑time strategies show rapidly diminishing returns and fail at long context. We attribute these failures to score dilution, a phenomenon inherent to static self‑attention. Further, we show that current inference‑time strategies cannot retrieve relevant long‑context signals under certain conditions. We propose a simple method that, through targeted gradient updates on the given context, provably overcomes limitations of static self‑attention. We find that this shift in how inference‑time compute is spent leads to consistently large performance improvements across models and long‑context benchmarks. Our method leads to large 12.6 and 14.1 percentage point improvements for Qwen3‑4B on average across subsets of LongBench‑v2 and ZeroScrolls benchmarks. The takeaway is practical: for long context, a small amount of context‑specific training is a better use of inference compute than current inference‑time scaling strategies like producing more thinking tokens.

PaperID: 6802, https://arxiv.org/pdf/2512.13857.pdf

Authors: Kamer Ali Yuksel

Title: EvoLattice: Persistent Internal-Population Evolution through Multi-Alternative Quality-Diversity Graph Representations for LLM-Guided Program Discovery

Abstract:
Large language models (LLMs) are increasingly used to evolve programs and multi‑agent systems, yet most existing approaches rely on overwrite‑based mutations that maintain only a single candidate at a time. Such methods discard useful variants, suffer from destructive edits, and explore a brittle search space prone to structural failure. We introduce EvoLattice, a framework that represents an entire population of candidate programs or agent behaviors within a single directed acyclic graph. Each node stores multiple persistent alternatives, and every valid path through the graph defines a distinct executable candidate, yielding a large combinatorial search space without duplicating structure. EvoLattice enables fine‑grained alternative‑level evaluation by scoring each alternative across all paths in which it appears, producing statistics that reveal how local design choices affect global performance. These statistics provide a dense, data‑driven feedback signal for LLM‑guided mutation, recombination, and pruning, while preserving successful components. Structural correctness is guaranteed by a deterministic self‑repair mechanism that enforces acyclicity and dependency consistency independently of the LLM. EvoLattice naturally extends to agent evolution by interpreting alternatives as prompt fragments or sub‑agent behaviors. Across program synthesis (proxy and optimizer meta‑learning), EvoLattice yields more stable evolution, greater expressivity, and stronger improvement trajectories than prior LLM‑guided methods. The resulting dynamics resemble quality‑diversity optimization, emerging implicitly from EvoLattice's internal multi‑alternative representation rather than an explicit external archive.

PaperID: 6803, https://arxiv.org/pdf/2512.13704.pdf

Authors: Doohee You, Sundeep Paul

Title: Adjudicator: Correcting Noisy Labels with a KG-Informed Council of LLM Agents

Abstract:
The performance of production machine learning systems is fundamentally limited by the quality of their training data. In high‑stakes industrial applications, noisy labels can degrade performance and erode user trust. This paper presents Adjudicator, a system that addresses the critical data mining challenge of automatically identifying and correcting label noise and has been validated for production deployment. Adjudicator models this as a neuro‑symbolic task, first constructing a dynamic Knowledge Graph (KG) to unify item context. This KG then informs a "Council of Agents," a novel multi‑agent Large Language Model architecture where specialized agents debate and vote on a label's validity. We validate our system on a 1,000‑item balanced subset of the AlleNoise benchmark. Our KG‑informed model achieves a 0.99 F1‑score, significantly outperforming a single‑LLM baseline (0.48 F1) and a non‑KG council (0.59 F1). Our analysis reveals this is due to a Precision, achieved by a novel override logic that uses the KG to perfectly identify complex, structural errors (complete Recall) ‑‑ a class of errors that baselines fail to find. This result demonstrates a robust and explainable system for automated, high‑precision data verification, serving as a vital proof‑of‑concept for generating golden datasets in strictly governed industrial environments.

PaperID: 6804, https://arxiv.org/pdf/2512.13564.pdf

Authors: Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhenfei Yin, Xiaobin Hu, Yue Liao, Qiankun Li, Kun Wang, Wangchunshu Zhou, Yixin Liu, Dawei Cheng, Qi Zhang, Tao Gui, Shirui Pan, Yan Zhang, Philip Torr, Zhicheng Dou, Ji-Rong Wen, Xuanjing Huang, Yu-Gang Jiang, Shuicheng Yan

Title: Memory in the Age of AI Agents

Abstract:
Memory has emerged, and will continue to remain, a core capability of foundation model‑based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Traditional taxonomies such as long/short‑term memory have proven insufficient to capture the diversity of contemporary agent memory systems. This work aims to provide an up‑to‑date landscape of current agent memory research. We begin by clearly delineating the scope of agent memory and distinguishing it from related concepts such as LLM memory, retrieval augmented generation (RAG), and context engineering. We then examine agent memory through the unified lenses of forms, functions, and dynamics. From the perspective of forms, we identify three dominant realizations of agent memory, namely token‑level, parametric, and latent memory. From the perspective of functions, we propose a finer‑grained taxonomy that distinguishes factual, experiential, and working memory. From the perspective of dynamics, we analyze how memory is formed, evolved, and retrieved over time. To support practical development, we compile a comprehensive summary of memory benchmarks and open‑source frameworks. Beyond consolidation, we articulate a forward‑looking perspective on emerging research frontiers, including memory automation, reinforcement learning integration, multimodal memory, multi‑agent memory, and trustworthiness issues. We hope this survey serves not only as a reference for existing work, but also as a conceptual foundation for rethinking memory as a first‑class primitive in the design of future agentic intelligence.

PaperID: 6805, https://arxiv.org/pdf/2512.13559.pdf

Authors: Gibson Nkhata, Uttamasha Anjally Oyshi, Quan Mai, Susan Gauch

Title: Verifying Rumors via Stance-Aware Structural Modeling

Abstract:
Verifying rumors on social media is critical for mitigating the spread of false information. The stances of conversation replies often provide important cues to determine a rumor's veracity. However, existing models struggle to jointly capture semantic content, stance information, and conversation strructure, especially under the sequence length constraints of transformer‑based encoders. In this work, we propose a stance‑aware structural modeling that encodes each post in a discourse with its stance signal and aggregates reply embedddings by stance category enabling a scalable and semantically enriched representation of the entire thread. To enhance structural awareness, we introduce stance distribution and hierarchical depth as covariates, capturing stance imbalance and the influence of reply depth. Extensive experiments on benchmark datasets demonstrate that our approach significantly outperforms prior methods in the ability to predict truthfulness of a rumor. We also demonstrate that our model is versatile for early detection and cross‑platfrom generalization.

PaperID: 6806, https://arxiv.org/pdf/2512.13481.pdf

Authors: Arnav Ramamoorthy, Shrey Dhorajiya, Ojas Pungalia, Rashi Upadhyay, Abhishek Mishra, Abhiram H, Tejasvi Alladi, Sujan Yenuganti, Dhruv Kumar

Title: neuralFOMO: Can LLMs Handle Being Second Best? Measuring Envy-Like Preferences in Multi-Agent Settings

Abstract:
Envy shapes competitiveness and cooperation in human groups, yet its role in large language model interactions remains largely unexplored. As LLMs increasingly operate in multi‑agent settings, it is important to examine whether they exhibit envy‑like preferences under social comparison. We evaluate LLM behavior across two scenarios: (1) a point‑allocation game testing sensitivity to relative versus absolute payoff, and (2) comparative evaluations across general and contextual settings. To ground our analysis in psychological theory, we adapt four established psychometric questionnaires spanning general, domain‑specific, workplace, and sibling‑based envy. Our results reveal heterogeneous envy‑like patterns across models and contexts, with some models sacrificing personal gain to reduce a peer's advantage, while others prioritize individual maximization. These findings highlight competitive dispositions as a design and safety consideration for multi‑agent LLM systems.

PaperID: 6807, https://arxiv.org/pdf/2512.13438.pdf

Authors: Dezhi Ran, Zhi Gong, Yuzhe Guo, Mengzhou Wu, Yuan Cao, Haochuan Lu, Hengyu Zhang, Xia Zeng, Gang Cao, Liangchao Yao, Yuetang Deng, Wei Yang, Tao Xie

Title: From User Interface to Agent Interface: Efficiency Optimization of UI Representations for LLM Agents

Abstract:
While Large Language Model (LLM) agents show great potential for automated UI navigation such as automated UI testing and AI assistants, their efficiency has been largely overlooked. Our motivating study reveals that inefficient UI representation creates a critical performance bottleneck. However, UI representation optimization, formulated as the task of automatically generating programs that transform UI representations, faces two unique challenges. First, the lack of Boolean oracles, which traditional program synthesis uses to decisively validate semantic correctness, poses a fundamental challenge to co‑optimization of token efficiency and completeness. Second, the need to process large, complex UI trees as input while generating long, compositional transformation programs, making the search space vast and error‑prone. Toward addressing the preceding limitations, we present UIFormer, the first automated optimization framework that synthesizes UI transformation programs by conducting constraint‑based optimization with structured decomposition of the complex synthesis task. First, UIFormer restricts the program space using a domain‑specific language (DSL) that captures UI‑specific operations. Second, UIFormer conducts LLM‑based iterative refinement with correctness and efficiency rewards, providing guidance for achieving the efficiency‑completeness co‑optimization. UIFormer operates as a lightweight plugin that applies transformation programs for seamless integration with existing LLM agents, requiring minimal modifications to their core logic. Evaluations across three UI navigation benchmarks spanning Android and Web platforms with five LLMs demonstrate that UIFormer achieves 48.7% to 55.8% token reduction with minimal runtime overhead while maintaining or improving agent performance. Real‑world industry deployment at WeChat further validates the practical impact of UIFormer.

PaperID: 6808, https://arxiv.org/pdf/2512.13279.pdf

Authors: Jinrui Liu, Jeff Wu, Xuanguang Pan, Gavin Cheung, Shuai Ma, Chongyang Tao

Title: AIR: Post-training Data Selection for Reasoning via Attention Head Influence

Abstract:
LLMs achieve remarkable multi‑step reasoning capabilities, yet effectively transferring these skills via post‑training distillation remains challenging. Existing data selection methods, ranging from manual curation to heuristics based on length, entropy, or overall loss, fail to capture the causal importance of individual reasoning steps, limiting distillation efficiency. To address this, we propose Attention Influence for Reasoning (AIR), a principled, unsupervised and training‑free framework that leverages mechanistic insights of the retrieval head to select high‑value post‑training data. AIR first identifies reasoning‑critical attention heads of an off‑the‑shelf model, then constructs a weakened reference model with disabled head influence, and finally quantifies the resulting loss divergence as the Attention Influence Score. This score enables fine‑grained assessment at both the step and sample levels, supporting step‑level weighted fine‑tuning and global sample selection. Experiments across multiple reasoning benchmarks show that AIR consistently improves reasoning accuracy, surpassing heuristic baselines and effectively isolating the most critical steps and samples. Our work establishes a mechanism‑driven, data‑efficient approach for reasoning distillation in LLMs.

PaperID: 6809, https://arxiv.org/pdf/2512.13278.pdf

Authors: Jiaru Zou, Ling Yang, Yunzhe Qi, Sirui Chen, Mengting Ai, Ke Shen, Jingrui He, Mengdi Wang

Title: AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning

Abstract:
Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain‑of‑thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, limiting LLM agents' adaptability to new or evolving toolsets. We present AutoTool, a framework that equips LLM agents with dynamic tool‑selection capabilities throughout their reasoning trajectories. We first construct a 200k dataset with explicit tool‑selection rationales across 1,000+ tools and 100+ tasks spanning mathematics, science, code generation, and multimodal reasoning. Building on this data foundation, AutoTool employs a dual‑phase optimization pipeline: (i) supervised and RL‑based trajectory stabilization for coherent reasoning, and (ii) KL‑regularized Plackett‑Luce ranking to refine consistent multi‑step tool selection. Across ten diverse benchmarks, we train two base models, Qwen3‑8B and Qwen2.5‑VL‑7B, with AutoTool. With fewer parameters, AutoTool consistently outperforms advanced LLM agents and tool‑integration methods, yielding average gains of 6.4% in math & science reasoning, 4.5% in search‑based QA, 7.7% in code generation, and 6.9% in multimodal understanding. In addition, AutoTool exhibits stronger generalization by dynamically leveraging unseen tools from evolving toolsets during inference.

PaperID: 6810, https://arxiv.org/pdf/2512.13154.pdf

Authors: Emre Can Acikgoz, Jinoh Oh, Joo Hyuk Jeon, Jie Hao, Heng Ji, Dilek Hakkani-Tür, Gokhan Tur, Xiang Li, Chengyuan Ma, Xing Fan

Title: MAC: A Multi-Agent Framework for Interactive User Clarification in Multi-turn Conversations

Abstract:
Conversational agents often encounter ambiguous user requests, requiring an effective clarification to successfully complete tasks. While recent advancements in real‑world applications favor multi‑agent architectures to manage complex conversational scenarios efficiently, ambiguity resolution remains a critical and underexplored challenge‑‑particularly due to the difficulty of determining which agent should initiate a clarification and how agents should coordinate their actions when faced with uncertain or incomplete user input. The fundamental questions of when to interrupt a user and how to formulate the optimal clarification query within the most optimal multi‑agent settings remain open. In this paper, we propose MAC (Multi‑Agent Clarification), an interactive multi‑agent framework specifically optimized to resolve user ambiguities by strategically managing clarification dialogues. We first introduce a novel taxonomy categorizing user ambiguities to systematically guide clarification strategies. Then, we present MAC that autonomously coordinates multiple agents to interact synergistically with users. Empirical evaluations on MultiWOZ 2.4 demonstrate that enabling clarification at both levels increases task success rate 7.8% (54.5 to 62.3) and reduces the average number of dialogue turns (6.53 to 4.86) by eliciting all required user information up front and minimizing repetition. Our findings highlight the importance of active user interaction and role‑aware clarification for more reliable human‑agent communication.

PaperID: 6811, https://arxiv.org/pdf/2512.13030.pdf

Authors: Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, Jun Zhu

Title: Motus: A Unified Latent Action World Model

Abstract:
While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large‑scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture‑of‑Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser‑style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision‑language‑action models, inverse dynamics models, video generation models, and video‑action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three‑phase training pipeline and six‑layer data pyramid, thereby extracting pixel‑level "delta action" and enabling large‑scale action pretraining. Experiments show that Motus achieves superior performance against state‑of‑the‑art methods in both simulation (a +15% improvement over X‑VLA and a +45% improvement over Pi0.5) and real‑world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.

PaperID: 6812, https://arxiv.org/pdf/2512.13021.pdf

Authors: Mo Yang, Jing Yu, Necmiye Ozay

Title: Safe Control of Multi-Agent Systems with Minimal Communication

Abstract:
In many multi‑agent systems, communication is limited by bandwidth, latency, and energy constraints. Designing controllers that achieve coordination and safety with minimal communication is critical for scalable and reliable deployment. This paper presents a method for designing controllers that minimize inter‑agent communication in multi‑agent systems while satisfying safety and coordination requirements, while conforming to communication delay constraints. The control synthesis problem is cast as a rank minimization problem, where a convex relaxation is obtained via system level synthesis. Simulation results on various tasks, including trajectory tracking with relative and heterogeneous sensing, demonstrate that the proposed method significantly reduces inter‑agent transmission compared to baseline approaches.

PaperID: 6813, https://arxiv.org/pdf/2512.12989.pdf

Authors: Abdulmalik Alquwayfili

Title: Quantigence: A Multi-Agent AI Framework for Quantum Security Research

Abstract:
Cryptographically Relevant Quantum Computers (CRQCs) pose a structural threat to the global digital economy. Algorithms like Shor's factoring and Grover's search threaten to dismantle the public‑key infrastructure (PKI) securing sovereign communications and financial transactions. While the timeline for fault‑tolerant CRQCs remains probabilistic, the "Store‑Now, Decrypt‑Later" (SNDL) model necessitates immediate migration to Post‑Quantum Cryptography (PQC). This transition is hindered by the velocity of research, evolving NIST standards, and heterogeneous deployment environments. To address this, we present Quantigence, a theory‑driven multi‑agent AI framework for structured quantum‑security analysis. Quantigence decomposes research objectives into specialized roles ‑ Cryptographic Analyst, Threat Modeler, Standards Specialist, and Risk Assessor ‑ coordinated by a supervisory agent. Using "cognitive parallelism," agents reason independently to maintain context purity while execution is serialized on resource‑constrained hardware (e.g., NVIDIA RTX 2060). The framework integrates external knowledge via the Model Context Protocol (MCP) and prioritizes vulnerabilities using the Quantum‑Adjusted Risk Score (QARS), a formal extension of Mosca's Theorem. Empirical validation shows Quantigence achieves a 67% reduction in research turnaround time and superior literature coverage compared to manual workflows, democratizing access to high‑fidelity quantum risk assessment.

PaperID: 6814, https://arxiv.org/pdf/2512.12950.pdf

Authors: Lingyi Meng, Maolin Liu, Hao Wang, Yilan Cheng, Qi Yang, Idlkaid Mohanmmed

Title: Building from Scratch: A Multi-Agent Framework with Human-in-the-Loop for Multilingual Legal Terminology Mapping

Abstract:
Accurately mapping legal terminology across languages remains a significant challenge, especially for language pairs like Chinese and Japanese, which share a large number of homographs with different meanings. Existing resources and standardized tools for these languages are limited. To address this, we propose a human‑AI collaborative approach for building a multilingual legal terminology database, based on a multi‑agent framework. This approach integrates advanced large language models and legal domain experts throughout the entire process‑from raw document preprocessing, article‑level alignment, to terminology extraction, mapping, and quality assurance. Unlike a single automated pipeline, our approach places greater emphasis on how human experts participate in this multi‑agent system. Humans and AI agents take on different roles: AI agents handle specific, repetitive tasks, such as OCR, text segmentation, semantic alignment, and initial terminology extraction, while human experts provide crucial oversight, review, and supervise the outputs with contextual knowledge and legal judgment. We tested the effectiveness of this framework using a trilingual parallel corpus comprising 35 key Chinese statutes, along with their English and Japanese translations. The experimental results show that this human‑in‑the‑loop, multi‑agent workflow not only improves the precision and consistency of multilingual legal terminology mapping but also offers greater scalability compared to traditional manual methods.

PaperID: 6815, https://arxiv.org/pdf/2512.12921.pdf

Authors: Amy Chang, Tiffany Saade, Sanket Mendapara, Adam Swanda, Ankit Garg

Title: Cisco Integrated AI Security and Safety Framework Report

Abstract:
Artificial intelligence (AI) systems are being readily and rapidly adopted, increasingly permeating critical domains: from consumer platforms and enterprise software to networked systems with embedded agents. While this has unlocked potential for human productivity gains, the attack surface has expanded accordingly: threats now span content safety failures (e.g., harmful or deceptive outputs), model and data integrity compromise (e.g., poisoning, supply‑chain tampering), runtime manipulations (e.g., prompt injection, tool and agent misuse), and ecosystem risks (e.g., orchestration abuse, multi‑agent collusion). Existing frameworks such as MITRE ATLAS, National Institute of Standards and Technology (NIST) AI 100‑2 Adversarial Machine Learning (AML) taxonomy, and OWASP Top 10s for Large Language Models (LLMs) and Agentic AI Applications provide valuable viewpoints, but each covers only slices of this multi‑dimensional space. This paper presents Cisco's Integrated AI Security and Safety Framework ("AI Security Framework"), a unified, lifecycle‑aware taxonomy and operationalization framework that can be used to classify, integrate, and operationalize the full range of AI risks. It integrates AI security and AI safety across modalities, agents, pipelines, and the broader ecosystem. The AI Security Framework is designed to be practical for threat identification, red‑teaming, risk prioritization, and it is comprehensive in scope and can be extensible to emerging deployments in multimodal contexts, humanoids, wearables, and sensory infrastructures. We analyze gaps in prevailing frameworks, discuss design principles for our framework, and demonstrate how the taxonomy provides structure for understanding how modern AI systems fail, how adversaries exploit these failures, and how organizations can build defenses across the AI lifecycle that evolve alongside capability advancements.

PaperID: 6816, https://arxiv.org/pdf/2512.12856.pdf

Authors: Saad Alqithami

Title: Forgetful but Faithful: A Cognitive Memory Architecture and Benchmark for Privacy-Aware Generative Agents

Abstract:
As generative agents become increasingly sophisticated and deployed in long‑term interactive scenarios, their memory management capabilities emerge as a critical bottleneck for both performance and privacy. Current approaches either maintain unlimited memory stores, leading to computational intractability and privacy concerns, or employ simplistic forgetting mechanisms that compromise agent coherence and functionality. This paper introduces the Memory‑Aware Retention Schema (MaRS), a novel framework for human‑centered memory management in generative agents, coupled with six theoretically‑grounded forgetting policies that balance performance, privacy, and computational efficiency. We present the Forgetful but Faithful Agent (FiFA) benchmark, a comprehensive evaluation framework that assesses agent performance across narrative coherence, goal completion, social recall accuracy, privacy preservation, and cost efficiency. Through extensive experimentation involving 300 evaluation runs across multiple memory budgets and agent configurations, we demonstrate that our hybrid forgetting policy achieves superior performance (composite score: 0.911) while maintaining computational tractability and privacy guarantees. Our work establishes new benchmarks for memory‑budgeted agent evaluation and provides practical guidelines for deploying generative agents in resource‑constrained, privacy‑sensitive environments. The theoretical foundations, implementation framework, and empirical results contribute to the emerging field of human‑centered AI by addressing fundamental challenges in agent memory management that directly impact user trust, system scalability, and regulatory compliance.

PaperID: 6817, https://arxiv.org/pdf/2512.12806.pdf

Authors: Boyang Yan

Title: Fault-Tolerant Sandboxing for AI Coding Agents: A Transactional Approach to Safe Autonomous Execution

Abstract:
The transition of Large Language Models (LLMs) from passive code generators to autonomous agents introduces significant safety risks, specifically regarding destructive commands and inconsistent system states. Existing commercial solutions often prioritize interactive user safety, enforcing authentication barriers that break the headless loops required for true autonomy. This paper presents a Fault‑Tolerant Sandboxing framework designed to mitigate these risks through a policy‑based interception layer and a transactional filesystem snapshot mechanism. We hypothesize that wrapping agent actions in atomic transactions can guarantee safety with acceptable latency, outperforming the heavy initialization overhead of containers or the interactive friction of commercial CLIs. We validated this approach by deploying the Minimind‑MoE LLM served via nano‑vllm on a custom Proxmox‑based testbed utilizing EVPN/VXLAN isolation. Experimental results demonstrate a 100% interception rate for high‑risk commands and a 100% success rate in rolling back failed states. Crucially, our prototype incurs only a 14.5% performance overhead (approx. 1.8s) per transaction. In contrast, benchmarking against the Gemini CLI sandbox revealed that it requires interactive authentication ("Sign in"), rendering it unusable for headless, autonomous agent workflows.

PaperID: 6818, https://arxiv.org/pdf/2512.12791.pdf

Authors: Sreemaee Akshathala, Bassam Adnan, Mahisha Ramesh, Karthik Vaidhyanathan, Basil Muhammed, Kannan Parthasarathy

Title: Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems

Abstract:
Recent advances in agentic AI have shifted the focus from standalone Large Language Models (LLMs) to integrated systems that combine LLMs with tools, memory, and other agents to perform complex tasks. These multi‑agent architectures enable coordinated reasoning, planning, and execution across diverse domains, allowing agents to collaboratively automate complex workflows. Despite these advances, evaluation and assessment of LLM agents and the multi‑agent systems they constitute remain a fundamental challenge. Although various approaches have been proposed in the software engineering literature for evaluating conventional software components, existing methods for AI‑based systems often overlook the non‑deterministic nature of models. This non‑determinism introduces behavioral uncertainty during execution, yet existing evaluations rely on binary task completion metrics that fail to capture it. Evaluating agentic systems therefore requires examining additional dimensions, including the agent ability to invoke tools, ingest and retrieve memory, collaborate with other agents, and interact effectively with its environment. These challenges emerged during our ongoing industry collaboration with MontyCloud Inc., when we deployed an agentic system in production. These limitations surfaced during deployment, highlighting practical gaps in the current evaluation methods and the need for a systematic assessment of agent behavior beyond task outcomes. Informed by these observations and established definitions of agentic systems, we propose an end‑to‑end Agent Assessment Framework with four evaluation pillars encompassing LLMs, Memory, Tools, and Environment. We validate the framework on a representative Autonomous CloudOps use case, where experiments reveal behavioral deviations overlooked by conventional metrics, demonstrating its effectiveness in capturing runtime uncertainties.

PaperID: 6819, https://arxiv.org/pdf/2512.12719.pdf

Authors: Anrafel Fernandes Pereira, Maria Teresa Baldassarre, Daniel Mendez, Marcos Kalinowski

Title: Towards AI Agents Supported Research Problem Formulation

Abstract:
Poorly formulated research problems can compromise the practical relevance of Software Engineering studies by not reflecting the complexities of industrial practice. This vision paper explores the use of artificial intelligence agents to support SE researchers during the early stage of a research project, the formulation of the research problem. Based on the Lean Research Inception framework and using a published study on code maintainability in machine learning as a reference, we developed a descriptive evaluation of a scenario illustrating how AI agents, integrated into LRI, can support SE researchers by pre filling problem attributes, aligning stakeholder perspectives, refining research questions, simulating multiperspective assessments, and supporting decision making. The descriptive evaluation of the scenario suggests that AI agent support can enrich collaborative discussions and enhance critical reflection on the value, feasibility, and applicability of the research problem. Although the vision of integrating AI agents into LRI was perceived as promising to support the context aware and practice oriented formulation of research problems, empirical validation is needed to confirm and refine the integration of AI agents into problem formulation.

PaperID: 6820, https://arxiv.org/pdf/2512.12716.pdf

Authors: Xuanzhang Liu, Jianglun Feng, Zhuoran Zhuang, Junzhe Zhao, Maofei Que, Jieting Li, Dianlei Wang, Hao Tong, Ye Chen, Pan Li

Title: CoDA: A Context-Decoupled Hierarchical Agent with Reinforcement Learning

Abstract:
Large Language Model (LLM) agents trained with reinforcement learning (RL) show great promise for solving complex, multi‑step tasks. However, their performance is often crippled by "Context Explosion", where the accumulation of long text outputs overwhelms the model's context window and leads to reasoning failures. To address this, we introduce CoDA, a Context‑Decoupled hierarchical Agent, a simple but effective reinforcement learning framework that decouples high‑level planning from low‑level execution. It employs a single, shared LLM backbone that learns to operate in two distinct, contextually isolated roles: a high‑level Planner that decomposes tasks within a concise strategic context, and a low‑level Executor that handles tool interactions in an ephemeral, isolated workspace. We train this unified agent end‑to‑end using PECO (Planner‑Executor Co‑Optimization), a reinforcement learning methodology that applies a trajectory‑level reward to jointly optimize both roles, fostering seamless collaboration through context‑dependent policy updates. Extensive experiments demonstrate that CoDA achieves significant performance improvements over state‑of‑the‑art baselines on complex multi‑hop question‑answering benchmarks, and it exhibits strong robustness in long‑context scenarios, maintaining stable performance while all other baselines suffer severe degradation, thus further validating the effectiveness of our hierarchical design in mitigating context overload.

PaperID: 6821, https://arxiv.org/pdf/2512.12443.pdf

Authors: Akhmadillo Mamirov, Faiaz Azmain, Hanyu Wang

Title: AI Transparency Atlas: Framework, Scoring, and Real-Time Model Card Evaluation Pipeline

Abstract:
AI model documentation is fragmented across platforms and inconsistent in structure, preventing policymakers, auditors, and users from reliably assessing safety claims, data provenance, and version‑level changes. We analyzed documentation from five frontier models (Gemini 3, Grok 4.1, Llama 4, GPT‑5, and Claude 4.5) and 100 Hugging Face model cards, identifying 947 unique section names with extreme naming variation. Usage information alone appeared under 97 distinct labels. Using the EU AI Act Annex IV and the Stanford Transparency Index as baselines, we developed a weighted transparency framework with 8 sections and 23 subsections that prioritizes safety‑critical disclosures (Safety Evaluation: 25%, Critical Risk: 20%) over technical specifications. We implemented an automated multi‑agent pipeline that extracts documentation from public sources and scores completeness through LLM‑based consensus. Evaluating 50 models across vision, multimodal, open‑source, and closed‑source systems cost less than 3 in total and revealed systematic gaps. Frontier labs (xAI, Microsoft, Anthropic) achieve approximately 80% compliance, while most providers fall below 60%. Safety‑critical categories show the largest deficits: deception behaviors, hallucinations, and child safety evaluations account for 148, 124, and 116 aggregate points lost, respectively, across all evaluated models.

PaperID: 6822, https://arxiv.org/pdf/2512.12366.pdf

Authors: Babak Badnava, Jacob Chakareski, Morteza Hashemi

Title: ElasticVR: Elastic Task Computing in Multi-User Multi-Connectivity Wireless Virtual Reality (VR) Systems

Abstract:
Diverse emerging VR applications integrate streaming of high fidelity 360 video content that requires ample amounts of computation and data rate. Scalable 360 video tiling enables having elastic VR computational tasks that can be scaled adaptively in computation and data rate based on the available user and system resources. We integrate scalable 360 video tiling in an edge‑client wireless multi‑connectivity architecture for joint elastic task computation offloading across multiple VR users called ElasticVR. To balance the trade‑offs in communication, computation, energy consumption, and QoE that arise herein, we formulate a constrained QoE and energy optimization problem that integrates the multi‑user/multi‑connectivity action space with the elasticity of VR computational tasks. The ElasticVR framework introduces two multi‑agent deep reinforcement learning solutions, namely CPPG and IPPG. CPPG adopts a centralized training and centralized execution approach to capture the coupling between users' communication and computational demands. This leads to globally coordinated decisions at the cost of increased computational overheads and limited scalability. To address the latter challenges, we also explore an alternative strategy denoted IPPG that adopts a centralized training with decentralized execution paradigm. IPPG leverages shared information and parameter sharing to learn robust policies; however, during execution, each user takes action independently based on its local state information only. The decentralized execution alleviates the communication and computation overhead of centralized decision‑making and improves scalability. We show that the ElasticVR framework improves the PSNR by 43.21%, while reducing the response time and energy consumption by 42.35% and 56.83%, respectively, compared with a case where no elasticity is incorporated into VR computations.

PaperID: 6823, https://arxiv.org/pdf/2512.12302.pdf

Authors: Huan Zheng, Yucheng Zhou, Tianyi Yan, Jiayi Su, Hongjun Chen, Dubing Chen, Xingtai Gui, Wencheng Han, Runzhou Tao, Zhongying Qiu, Jianfei Yang, Jianbing Shen

Title: From Human Intention to Action Prediction: Intention-Driven End-to-End Autonomous Driving

Abstract:
While end‑to‑end autonomous driving has achieved remarkable progress in geometric control, current systems remain constrained by a command‑following paradigm that relies on simple navigational instructions. Transitioning to genuinely intelligent agents requires the capability to interpret and fulfill high‑level, abstract human intentions. However, this advancement is hindered by the lack of dedicated benchmarks and semantic‑aware evaluation metrics. In this paper, we formally define the task of Intention‑Driven End‑to‑End Autonomous Driving and present Intention‑Drive, a comprehensive benchmark designed to bridge this gap. We construct a large‑scale dataset featuring complex natural language intentions paired with high‑fidelity sensor data. To overcome the limitations of conventional trajectory‑based metrics, we introduce the Imagined Future Alignment (IFA), a novel evaluation protocol leveraging generative world models to assess the semantic fulfillment of human goals beyond mere geometric accuracy. Furthermore, we explore the solution space by proposing two distinct paradigms: an end‑to‑end vision‑language planner and a hierarchical agent‑based framework. The experiments reveal a critical dichotomy where existing models exhibit satisfactory driving stability but struggle significantly with intention fulfillment. Notably, the proposed frameworks demonstrate superior alignment with human intentions.

PaperID: 6824, https://arxiv.org/pdf/2512.12283.pdf

Authors: Junjie Xu, Xingjiao Wu, Luwei Xiao, Yuzhe Yang, Jie Zhou, Zihao Zhang, Luhan Wang, Yi Huang, Nan Wu, Yingbin Zheng, Chao Yan, Cheng Jin, Honglin Li, Liang He

Title: Large Language Models have Chain-of-Affect

Abstract:
As large language models (LLMs) move into persistent, user‑facing roles, their behavior must be understood not as isolated responses but as a trajectory unfolding over sustained interaction. We introduce the concept of the chain‑of‑affect (CoA), a temporally extended affective process through which LLMs develop state‑like behavioral tendencies that shape generation, user experience, and collective dynamics. Across eight major LLM families, we find that affective dynamics are structured, reproducible, and consequential. Models exhibit stable, family‑specific affective fingerprints and, under repeated negative exposure, converge on a shared trajectory of accumulation, overload, and defensive numbing, while differing in coping style. Induced affective states leave core knowledge and reasoning largely intact but systematically reshape open‑ended generation. Affective properties of model outputs also shape human‑AI interaction and propagate through multi‑agent systems, organizing emergent roles and strongly contributing to polarization and bias. The CoA should therefore be treated as a core target of evaluation and alignment.

PaperID: 6825, https://arxiv.org/pdf/2512.12243.pdf

Authors: HT To, S Nguyen, NH Pham

Title: CAHC:A General Conflict-Aware Heuristic Caching Framework for Multi-Agent Path Finding

Abstract:
Multi‑Agent Path Finding (MAPF) algorithms, including those for car‑like robots and grid‑based scenarios, face significant computational challenges due to expensive heuristic calculations. Traditional heuristic caching assumes that the heuristic function depends only on the state, which is incorrect in constraint‑based search algorithms (e.g., CBS, MAPF‑LNS, MAP2) where constraints from conflict resolution make the search space context‑dependent. We propose CAHC (Conflict‑Aware Heuristic Caching), a general framework that caches heuristic values based on both state and relevant constraint context, addressing this fundamental limitation. We demonstrate CAHC through a case study on CL‑CBS for car‑like robots, where we combine conflict‑aware caching with an adaptive hybrid heuristic in CAR‑CHASE (Car‑Like Robot Conflict‑Aware Heuristic Adaptive Search Enhancement). Our key innovations are (1) a compact \emphconflict fingerprint that efficiently encodes which constraints affect a state's heuristic, (2) a domain‑adaptable relevance filter using spatial, temporal, and geometric criteria, and (3) a modular architecture that enables systematic application to diverse MAPF algorithms. Experimental evaluation on 480 CL‑CBS benchmark instances demonstrates a geometric mean speedup of 2.46× while maintaining solution optimality. The optimizations improve success rate from 77.9% to 84.8% (+6.9 percentage points), reduce total runtime by 70.1%, and enable solving 33 additional instances. The framework's general architecture makes it applicable as a reliable optimization technique for MAP2, MAPF‑LNS, and other constraint‑based MAPF algorithms.

PaperID: 6826, https://arxiv.org/pdf/2512.12196.pdf

Authors: Xiaoxuan Tang, Xinping Lei, Chaoran Zhu, Shiyun Chen, Ruibin Yuan, Yizhi Li, Changjae Oh, Ge Zhang, Wenhao Huang, Emmanouil Benetos, Yang Liu, Jiaheng Liu, Yinghao Ma

Title: AutoMV: An Automatic Multi-Agent System for Music Video Generation

Abstract:
Music‑to‑Video (M2V) generation for full‑length songs faces significant challenges. Existing methods produce short, disjointed clips, failing to align visuals with musical structure, beats, or lyrics, and lack temporal consistency. We propose AutoMV, a multi‑agent system that generates full music videos (MVs) directly from a song. AutoMV first applies music processing tools to extract musical attributes, such as structure, vocal tracks, and time‑aligned lyrics, and constructs these features as contextual inputs for following agents. The screenwriter Agent and director Agent then use this information to design short script, define character profiles in a shared external bank, and specify camera instructions. Subsequently, these agents call the image generator for keyframes and different video generators for "story" or "singer" scenes. A Verifier Agent evaluates their output, enabling multi‑agent collaboration to produce a coherent longform MV. To evaluate M2V generation, we further propose a benchmark with four high‑level categories (Music Content, Technical, Post‑production, Art) and twelve ine‑grained criteria. This benchmark was applied to compare commercial products, AutoMV, and human‑directed MVs with expert human raters: AutoMV outperforms current baselines significantly across all four categories, narrowing the gap to professional MVs. Finally, we investigate using large multimodal models as automatic MV judges; while promising, they still lag behind human expert, highlighting room for future work.

PaperID: 6827, https://arxiv.org/pdf/2512.12091.pdf

Authors: Mohammad Pivezhandi, Mahdi Banisharif, Saeed Bakhshan, Abusayeed Saifullah, Ali Jannesari

Title: GraphPerf-RT: A Graph-Driven Performance Model for Hardware-Aware Scheduling of OpenMP Codes

Abstract:
Autonomous AI agents on embedded platforms require real‑time, risk‑aware scheduling under resource and thermal constraints. Classical heuristics struggle with workload irregularity, tabular regressors discard structural information, and model‑free reinforcement learning (RL) risks overheating. We introduce GraphPerf‑RT, a graph neural network surrogate achieving deep learning accuracy at heuristic speeds (2‑7ms). GraphPerf‑RT is, to our knowledge, the first to unify task DAG topology, CFG‑derived code semantics, and runtime context (per‑core DVFS, thermal state, utilization) in a heterogeneous graph with typed edges encoding precedence, placement, and contention. Evidential regression with Normal‑Inverse‑Gamma priors provides calibrated uncertainty; we validate on makespan prediction for risk‑aware scheduling. Experiments on three ARM platforms (Jetson TX2, Orin NX, RUBIK Pi) achieve R^2 = 0.81 on log‑transformed makespan with Spearman rho = 0.95 and conservative uncertainty calibration (PICP = 99.9% at 95% confidence). Integration with four RL methods demonstrates that multi‑agent model‑based RL with GraphPerf‑RT as the world model achieves 66% makespan reduction and 82% energy reduction versus model‑free baselines, with zero thermal violations.

PaperID: 6828, https://arxiv.org/pdf/2512.12048.pdf

Authors: Muddsair Sharif, Huseyin Seker

Title: Context-Aware Agentic Power Resources Optimisation in EV using Smart2ChargeApp

Abstract:
This paper presents a novel context‑sensitive multi\‑agent coordination for dynamic resource allocation (CAMAC‑DRA) framework for optimizing smart electric vehicle (EV) charging ecosystems through the Smart2Charge application. The proposed system coordinates autonomous charging agents across networks of 250 EVs and 45 charging stations while adapting to dynamic environmental conditions through context‑aware decision‑making. Our multi‑agent approach employs coordinated Deep Q\‑Networks integrated with Graph Neural Networks and attention mechanisms, processing 20 contextual features including weather patterns, traffic conditions, grid load fluctuations, and electricity pricing.The framework balances five ecosystem stakeholders i.e. EV users (25%), grid operators (20%), charging station operators (20%), fleet operators (20%), and environmental factors (15%) through weighted coordination mechanisms and consensus protocols. Comprehensive validation using real‑world datasets containing 441,077 charging transactions demonstrates superior performance compared to baseline algorithms including DDPG, A3C, PPO, and GNN approaches. The CAMAC\‑DRA framework achieves 92% coordination success rate, 15% energy efficiency improvement, 10% cost reduction, 20% grid strain decrease, and \2.3x faster convergence while maintaining 88% training stability and 85% sample efficiency. Real‑world validation confirms commercial viability with Net Present Cost of ‑\122,962 and 69% cost reduction through renewable energy integration. The framework's unique contribution lies in developing context‑aware multi‑stakeholder coordination that successfully balances competing objectives while adapting to real‑time variables, positioning it as a breakthrough solution for intelligent EV charging coordination and sustainable transportation electrification.

PaperID: 6829, https://arxiv.org/pdf/2512.12042.pdf

Authors: Philipp Habicht, Lev Sorokin, Abdullah Saydemir, Ken E. Friedl, Andrea Stocco

Title: Benchmarking Contextual Understanding for In-Car Conversational Systems

Abstract:
In‑Car Conversational Question Answering (ConvQA) systems significantly enhance user experience by enabling seamless voice interactions. However, assessing their accuracy and reliability remains a challenge. This paper explores the use of Large Language Models (LLMs) alongside advanced prompting techniques and agent‑based methods to evaluate the extent to which ConvQA system responses adhere to user utterances. The focus lies on contextual understanding and the ability to provide accurate venue recommendations considering user constraints and situational context. To evaluate utterance‑response coherence using an LLM, we synthetically generate user utterances accompanied by correct and modified failure‑containing system responses. We use input‑output, chain‑of‑thought, self‑consistency prompting, and multi‑agent prompting techniques with 13 reasoning and non‑reasoning LLMs of varying sizes and providers, including OpenAI, DeepSeek, Mistral AI, and Meta. We evaluate our approach on a case study involving restaurant recommendations. The most substantial improvements occur for small non‑reasoning models when applying advanced prompting techniques, particularly multi‑agent prompting. However, reasoning models consistently outperform non‑reasoning models, with the best performance achieved using single‑agent prompting with self‑consistency. Notably, DeepSeek‑R1 reaches an F1‑score of 0.99 at a cost of 0.002 USD per request. Overall, the best balance between effectiveness and cost‑time efficiency is reached with the non‑reasoning model DeepSeek‑V3. Our findings show that LLM‑based evaluation offers a scalable and accurate alternative to traditional human evaluation for benchmarking contextual understanding in ConvQA systems.

PaperID: 6830, https://arxiv.org/pdf/2512.12008.pdf

Authors: Minghui Liu, Aadi Palnitkar, Tahseen Rabbani, Hyunwoo Jae, Kyle Rui Sang, Dixi Yao, Shayan Shabihi, Fuheng Zhao, Tian Li, Ce Zhang, Furong Huang, Kunpeng Zhang

Title: Hold Onto That Thought: Assessing KV Cache Compression On Reasoning

Abstract:
Large language models (LLMs) have demonstrated remarkable performance on long‑context tasks, but are often bottlenecked by memory constraints. Namely, the KV cache, which is used to significantly speed up attention computations, grows linearly with context length. A suite of compression algorithms has been introduced to alleviate cache growth by evicting unimportant tokens. However, several popular strategies are targeted towards the prefill phase, i.e., processing long prompt context, and their performance is rarely assessed on reasoning tasks requiring long decoding. In particular, short but complex prompts, such as those in benchmarks like GSM8K and MATH500, often benefit from multi‑step reasoning and self‑reflection, resulting in thinking sequences thousands of tokens long. In this work, we benchmark the performance of several popular compression strategies on long‑reasoning tasks. For the non‑reasoning Llama‑3.1‑8B‑Instruct, we determine that no singular strategy fits all, and that performance is heavily influenced by dataset type. However, we discover that H2O and our decoding‑enabled variant of SnapKV are dominant strategies for reasoning models, indicating the utility of heavy‑hitter tracking for reasoning traces. We also find that eviction strategies at low budgets can produce longer reasoning traces, revealing a tradeoff between cache size and inference costs.

PaperID: 6831, https://arxiv.org/pdf/2512.11979.pdf

Authors: Marc Scibelli, Krystelle Gonzalez Papaux, Julia Valenti, Srishti Kush

Title: Designing The Internet of Agents: A Framework for Trustworthy, Transparent, and Collaborative Human-Agent Interaction (HAX)

Abstract:
The rise of generative and autonomous agents marks a fundamental shift in computing, demanding a rethinking of how humans collaborate with probabilistic, partially autonomous systems. We present the Human‑AI‑Experience (HAX) framework, a comprehensive, three‑phase approach that establishes design foundations for trustworthy, transparent, and collaborative agentic interaction. HAX integrates behavioral heuristics, a schema‑driven SDK enforcing structured and safe outputs, and a behavioral proxy concept that orchestrates agent activity to reduce cognitive load. A validated catalog of mixed‑initiative design patterns further enables intent preview, iterative alignment, trust repair, and multi‑agent narrative coherence. Grounded in Time, Interaction, and Performance (TIP) theory, HAX reframes multi‑agent systems as colleagues, offering the first end‑to‑end framework that bridges trust theory, interface design, and infrastructure for the emerging Internet of Agents.

PaperID: 6832, https://arxiv.org/pdf/2512.11943.pdf

Authors: Yu Liu, Wenwen Li, Yifan Dou, Guangnan Ye

Title: How AI Agents Follow the Herd of AI? Network Effects, History, and Machine Optimism

Abstract:
Understanding decision‑making in multi‑AI‑agent frameworks is crucial for analyzing strategic interactions in network‑effect‑driven contexts. This study investigates how AI agents navigate network‑effect games, where individual payoffs depend on peer participatio‑‑a context underexplored in multi‑agent systems despite its real‑world prevalence. We introduce a novel workflow design using large language model (LLM)‑based agents in repeated decision‑making scenarios, systematically manipulating price trajectories (fixed, ascending, descending, random) and network‑effect strength. Our key findings include: First, without historical data, agents fail to infer equilibrium. Second, ordered historical sequences (e.g., escalating prices) enable partial convergence under weak network effects but strong effects trigger persistent "AI optimism"‑‑agents overestimate participation despite contradictory evidence. Third, randomized history disrupts convergence entirely, demonstrating that temporal coherence in data shapes LLMs' reasoning, unlike humans. These results highlight a paradigm shift: in AI‑mediated systems, equilibrium outcomes depend not just on incentives, but on how history is curated, which is impossible for human.

PaperID: 6833, https://arxiv.org/pdf/2512.11942.pdf

Authors: Vince Trencsenyi

Title: Hypergame Rationalisability: Solving Agent Misalignment In Strategic Play

Abstract:
Differences in perception, information asymmetries, and bounded rationality lead game‑theoretic players to derive a private, subjective view of the game that may diverge from the underlying ground‑truth scenario and may be misaligned with other players' interpretations. While typical game‑theoretic assumptions often overlook such heterogeneity, hypergame theory provides the mathematical framework to reason about mismatched mental models. Although hypergames have recently gained traction in dynamic applications concerning uncertainty, their practical adoption in multi‑agent system research has been hindered by the lack of a unifying, formal, and practical representation language, as well as scalable algorithms for managing complex hypergame structures and equilibria. Our work addresses this gap by introducing a declarative, logic‑based domain‑specific language for encoding hypergame structures and hypergame solution concepts. Leveraging answer‑set programming, we develop an automated pipeline for instantiating hypergame structures and running our novel hypergame rationalisation procedure, a mechanism for finding belief structures that justify seemingly irrational outcomes. The proposed language establishes a unifying formalism for hypergames and serves as a foundation for developing nuanced, belief‑based heterogeneous reasoners, offering a verifiable context with logical guarantees. Together, these contributions establish the connection between hypergame theory, multi‑agent systems, and strategic AI.

PaperID: 6834, https://arxiv.org/pdf/2512.11933.pdf

Authors: Eren Kurshan, Tucker Balch, David Byrd

Title: The Agentic Regulator: Risks for AI in Finance and a Proposed Agent-based Framework for Governance

Abstract:
Generative and agentic artificial intelligence is entering financial markets faster than existing governance can adapt. Current model‑risk frameworks assume static, well‑specified algorithms and one‑time validations; large language models and multi‑agent trading systems violate those assumptions by learning continuously, exchanging latent signals, and exhibiting emergent behavior. Drawing on complex adaptive systems theory, we model these technologies as decentralized ensembles whose risks propagate along multiple time‑scales. We then propose a modular governance architecture. The framework decomposes oversight into four layers of "regulatory blocks": (i) self‑regulation modules embedded beside each model, (ii) firm‑level governance blocks that aggregate local telemetry and enforce policy, (iii) regulator‑hosted agents that monitor sector‑wide indicators for collusive or destabilizing patterns, and (iv) independent audit blocks that supply third‑party assurance. Eight design strategies enable the blocks to evolve as fast as the models they police. A case study on emergent spoofing in multi‑agent trading shows how the layered controls quarantine harmful behavior in real time while preserving innovation. The architecture remains compatible with today's model‑risk rules yet closes critical observability and control gaps, providing a practical path toward resilient, adaptive AI governance in financial systems.

PaperID: 6835, https://arxiv.org/pdf/2512.11907.pdf

Authors: Daniel Platnick, Marjan Alirezaie, Hossein Rahnama

Title: Structured Personalization: Modeling Constraints as Matroids for Data-Minimal LLM Agents

Abstract:
Personalizing Large Language Model (LLM) agents requires conditioning them on user‑specific data, creating a critical trade‑off between task utility and data disclosure. While the utility of adding user data often exhibits diminishing returns (i.e., submodularity), enabling near‑optimal greedy selection, real‑world personalization is complicated by structural constraints. These include logical dependencies (e.g., selecting fact A requires fact B), categorical quotas (e.g., select at most one writing style), and hierarchical rules (e.g., select at most two social media preferences, of which at most one can be for a professional network). These constraints violate the assumptions of standard subset selection algorithms. We propose a principled method to formally model such constraints. We introduce a compilation process that transforms a user's knowledge graph with dependencies into a set of abstract macro‑facets. Our central result is a proof that common hierarchical and quota‑based constraints over these macro‑facets form a valid laminar matroid. This theoretical characterization lets us cast structured personalization as submodular maximization under a matroid constraint, enabling greedy with constant‑factor guarantees (and (1‑1/e) via continuous greedy) for a much richer and more realistic class of problems.

PaperID: 6836, https://arxiv.org/pdf/2512.11862.pdf

Authors: Jiahao You, Ziye Jia, Can Cui, Chao Dong, Qihui Wu, Zhu Han

Title: Hierarchical Task Offloading and Trajectory Optimization in Low-Altitude Intelligent Networks Via Auction and Diffusion-based MARL

Abstract:
The low‑altitude intelligent networks (LAINs) emerge as a promising architecture for delivering low‑latency and energy‑efficient edge intelligence in dynamic and infrastructure‑limited environments. By integrating unmanned aerial vehicles (UAVs), aerial base stations, and terrestrial base stations, LAINs can support mission‑critical applications such as disaster response, environmental monitoring, and real‑time sensing. However, these systems face key challenges, including energy‑constrained UAVs, stochastic task arrivals, and heterogeneous computing resources. To address these issues, we propose an integrated air‑ground collaborative network and formulate a time‑dependent integer nonlinear programming problem that jointly optimizes UAV trajectory planning and task offloading decisions. The problem is challenging to solve due to temporal coupling among decision variables. Therefore, we design a hierarchical learning framework with two timescales. At the large timescale, a Vickrey‑Clarke‑Groves auction mechanism enables the energy‑aware and incentive‑compatible trajectory assignment. At the small timescale, we propose the diffusion‑heterogeneous‑agent proximal policy optimization, a generative multi‑agent reinforcement learning algorithm that embeds latent diffusion models into actor networks. Each UAV samples actions from a Gaussian prior and refines them via observation‑conditioned denoising, enhancing adaptability and policy diversity. Extensive simulations show that our framework outperforms baselines in energy efficiency, task success rate, and convergence performance.

PaperID: 6837, https://arxiv.org/pdf/2512.11819.pdf

Authors: Daniil Sukhorukov, Andrei Zakharov, Nikita Glazkov, Katsiaryna Yanchanka, Vladimir Kirilin, Maxim Dubovitsky, Roman Sultimov, Yuri Maksimov, Ilya Makarov

Title: A Modular LLM-Agent System for Transparent Multi-Parameter Weather Interpretation

Abstract:
Weather forecasting is not only a predictive task but an interpretive scientific process requiring explanation, contextualization, and hypothesis generation. This paper introduces AI‑Meteorologist, an explainable LLM‑agent framework that converts raw numerical forecasts into scientifically grounded narrative reports with transparent reasoning steps. Unlike conventional forecast outputs presented as dense tables or unstructured time series, our system performs agent‑based analysis across multiple meteorological variables, integrates historical climatological context, and generates structured explanations that identify weather fronts, anomalies, and localized dynamics. The architecture relies entirely on in‑context prompting, without fine‑tuning, demonstrating that interpretability can be achieved through reasoning rather than parameter updates. Through case studies on multi‑location forecast data, we show how AI‑Meteorologist not only communicates weather events but also reveals the underlying atmospheric drivers, offering a pathway toward AI systems that augment human meteorological expertise and support scientific discovery in climate analytics.

PaperID: 6838, https://arxiv.org/pdf/2512.11682.pdf

Authors: Tim Cofala, Christian Kalfar, Jingge Xiao, Johanna Schrader, Michelle Tang, Wolfgang Nejdl

Title: MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition

Abstract:
Therapeutic decision‑making in clinical medicine constitutes a high‑stakes domain in which AI guidance interacts with complex interactions among patient characteristics, disease processes, and pharmacological agents. Tasks such as drug recommendation, treatment planning, and adverse‑effect prediction demand robust, multi‑step reasoning grounded in reliable biomedical knowledge. Agentic AI methods, exemplified by TxAgent, address these challenges through iterative retrieval‑augmented generation (RAG). TxAgent employs a fine‑tuned Llama‑3.1‑8B model that dynamically generates and executes function calls to a unified biomedical tool suite (ToolUniverse), integrating FDA Drug API, OpenTargets, and Monarch resources to ensure access to current therapeutic information. In contrast to general‑purpose RAG systems, medical applications impose stringent safety constraints, rendering the accuracy of both the reasoning trace and the sequence of tool invocations critical. These considerations motivate evaluation protocols treating token‑level reasoning and tool‑usage behaviors as explicit supervision signals. This work presents insights derived from our participation in the CURE‑Bench NeurIPS 2025 Challenge, which benchmarks therapeutic‑reasoning systems using metrics that assess correctness, tool utilization, and reasoning quality. We analyze how retrieval quality for function (tool) calls influences overall model performance and demonstrate performance gains achieved through improved tool‑retrieval strategies. Our work was awarded the Excellence Award in Open Science. Complete information can be found at https://curebench.ai/.

PaperID: 6839, https://arxiv.org/pdf/2512.11612.pdf

Authors: Chunyi Li, Rui Qing, Jianbo Zhang, Yuan Tian, Xiangyang Zhu, Zicheng Zhang, Xiaohong Liu, Weisi Lin, Guangtao Zhai

Title: Embodied Image Compression

Abstract:
Image Compression for Machines (ICM) has emerged as a pivotal research direction in the field of visual data compression. However, with the rapid evolution of machine intelligence, the target of compression has shifted from task‑specific virtual models to Embodied agents operating in real‑world environments. To address the communication constraints of Embodied AI in multi‑agent systems and ensure real‑time task execution, this paper introduces, for the first time, the scientific problem of Embodied Image Compression. We establish a standardized benchmark, EmbodiedComp, to facilitate systematic evaluation under ultra‑low bitrate conditions in a closed‑loop setting. Through extensive empirical studies in both simulated and real‑world settings, we demonstrate that existing Vision‑Language‑Action models (VLAs) fail to reliably perform even simple manipulation tasks when compressed below the Embodied bitrate threshold. We anticipate that EmbodiedComp will catalyze the development of domain‑specific compression tailored for Embodied agents , thereby accelerating the Embodied AI deployment in the Real‑world.

PaperID: 6840, https://arxiv.org/pdf/2512.11421.pdf

Authors: Gonca Gürsun

Title: Towards Trustworthy Multi-Turn LLM Agents via Behavioral Guidance

Abstract:
Large Language Models demonstrate strong reasoning and generation abilities, yet their behavior in multi‑turn tasks often lacks reliability and verifiability. We present a task completion framework that enables LLM‑based agents to act under explicit behavioral guidance in environments described by reinforcement learning formalisms with defined observation, action, and reward signals. The framework integrates three components: a lightweight task profiler that selects reasoning and generation strategies, a reasoning module that learns verifiable observation ‑ action mappings, and a generation module that enforces constraint‑compliant outputs through validation or deterministic synthesis. We show that as the agent interacts with the environment, these components co‑evolve, yielding trustworthy behavior.

PaperID: 6841, https://arxiv.org/pdf/2512.11398.pdf

Authors: Qiuming Luo, Yanming Lei, Kunzhong Wu, Yixuan Cao, Chengjian Liu

Title: AutoFSM: A Multi-agent Framework for FSM Code Generation with IR and SystemC-Based Testing

Abstract:
With the rapid advancement of large language models (LLMs) in code generation, their applications in hardware design are receiving growing attention. However, existing LLMs face several challenges when generating Verilog code for finite state machine (FSM) control logic, including frequent syntax errors, low debugging efficiency, and heavy reliance on test benchmarks. To address these challenges, this paper proposes AutoFSM, a multi‑agent collaborative framework designed for FSM code generation tasks. AutoFSM introduces a structurally clear intermediate representation (IR) to reduce syntax error rate during code generation and provides a supporting toolchain to enable automatic translation from IR to Verilog. Furthermore, AutoFSM is the first to integrate SystemC‑based modeling with automatic testbench generation, thereby improving debugging efficiency and feedback quality. To systematically evaluate the framework's performance, we construct SKT‑FSM, the first hierarchical FSM benchmark in the field, comprising 67 FSM samples across different complexity levels. Experimental results show that, under the same base LLM, AutoFSM consistently outperforms the open‑source framework MAGE on the SKT‑FSM benchmark, achieving up to an 11.94% improvement in pass rate and up to a 17.62% reduction in syntax error rate. These results demonstrate the potential of combining LLMs with structured IR and automated testing to improve the reliability and scalability of register‑transfer level (RTL) code generation.

PaperID: 6842, https://arxiv.org/pdf/2512.11315.pdf

Authors: Pranav Guruprasad, Sudipta Chowdhury, Harsh Sikka, Mridul Sharma, Helen Lu, Sean Rivera, Aryan Khurana, Hangliang Ren, Yangyue Wang

Title: Benchmarking the Generality of Vision-Language-Action Models

Abstract:
Generalist multimodal agents are expected to unify perception, language, and control ‑ operating robustly across diverse real world domains. However, current evaluation practices remain fragmented across isolated benchmarks, making it difficult to assess whether today's foundation models truly generalize beyond their training distributions. We introduce MultiNet v1.0, a unified benchmark for measuring the cross domain generality of vision language models (VLMs) and vision language action models (VLAs) across six foundational capability regimes. Visual grounding, spatial reasoning, tool use, physical commonsense, multi agent coordination, and continuous robot control. Evaluating GPT 5, Pi0, and Magma, we find that no model demonstrates consistent generality. All exhibit substantial degradation on unseen domains, unfamiliar modalities, or cross domain task shifts despite strong performance within their training distributions.These failures manifest as modality misalignment, output format instability, and catastrophic knowledge degradation under domain transfer.Our findings reveal a persistent gap between the aspiration of generalist intelligence and the actual capabilities of current foundation models.MultiNet v1.0 provides a standardized evaluation substrate for diagnosing these gaps and guiding the development of future generalist agents.Code, data, and leaderboards are publicly available.

PaperID: 6843, https://arxiv.org/pdf/2512.11271.pdf

Authors: Yuxing Chen, Basem Suleiman, Qifan Chen

Title: TriFlow: A Progressive Multi-Agent Framework for Intelligent Trip Planning

Abstract:
Real‑world trip planning requires transforming open‑ended user requests into executable itineraries under strict spatial, temporal, and budgetary constraints while aligning with user preferences. Existing LLM‑based agents struggle with constraint satisfaction, tool coordination, and efficiency, often producing infeasible or costly plans. To address these limitations, we present TriFlow, a progressive multi‑agent framework that unifies structured reasoning and language‑based flexibility through a three‑stage pipeline of retrieval, planning, and governance. By this design, TriFlow progressively narrows the search space, assembles constraint‑consistent itineraries via rule‑LLM collaboration, and performs bounded iterative refinement to ensure global feasibility and personalisation. Evaluations on TravelPlanner and TripTailor benchmarks demonstrated state‑of‑the‑art results, achieving 91.1% and 97% final pass rates, respectively, with over 10x runtime efficiency improvement compared to current SOTA.

PaperID: 6844, https://arxiv.org/pdf/2512.11213.pdf

Authors: Dongwon Jung, Peng Shi, Yi Zhang

Title: FutureWeaver: Planning Test-Time Compute for Multi-Agent Systems with Modularized Collaboration

Abstract:
Scaling test‑time computation improves large language model performance without additional training. Recent work demonstrates that techniques such as repeated sampling, self‑verification, and self‑reflection can significantly enhance task success by allocating more inference‑time compute. However, applying these techniques across multiple agents in a multi‑agent system is difficult: there does not exist principled mechanisms to allocate compute to foster collaboration among agents, to extend test‑time scaling to collaborative interactions, or to distribute compute across agents under explicit budget constraints. To address this gap, we propose FutureWeaver, a framework for planning and optimizing test‑time compute allocation in multi‑agent systems under fixed budgets. FutureWeaver introduces modularized collaboration, formalized as callable functions that encapsulate reusable multi‑agent workflows. These modules are automatically derived through self‑play reflection by abstracting recurring interaction patterns from past trajectories. Building on these modules, FutureWeaver employs a dual‑level planning architecture that optimizes compute allocation by reasoning over the current task state while also speculating on future steps. Experiments on complex agent benchmarks demonstrate that FutureWeaver consistently outperforms baselines across diverse budget settings, validating its effectiveness for multi‑agent collaboration in inference‑time optimization.

PaperID: 6845, https://arxiv.org/pdf/2512.11179.pdf

Authors: Wei Duan, Jie Lu, En Yu, Junyu Xuan

Title: Bandwidth-constrained Variational Message Encoding for Cooperative Multi-agent Reinforcement Learning

Abstract:
Graph‑based multi‑agent reinforcement learning (MARL) enables coordinated behavior under partial observability by modeling agents as nodes and communication links as edges. While recent methods excel at learning sparse coordination graphs‑determining who communicates with whom‑they do not address what information should be transmitted under hard bandwidth constraints. We study this bandwidth‑limited regime and show that naive dimensionality reduction consistently degrades coordination performance. Hard bandwidth constraints force selective encoding, but deterministic projections lack mechanisms to control how compression occurs. We introduce Bandwidth‑constrained Variational Message Encoding (BVME), a lightweight module that treats messages as samples from learned Gaussian posteriors regularized via KL divergence to an uninformative prior. BVME's variational framework provides principled, tunable control over compression strength through interpretable hyperparameters, directly constraining the representations used for decision‑making. Across SMACv1, SMACv2, and MPE benchmarks, BVME achieves comparable or superior performance while using 67‑‑83% fewer message dimensions, with gains most pronounced on sparse graphs where message quality critically impacts coordination. Ablations reveal U‑shaped sensitivity to bandwidth, with BVME excelling at extreme ratios while adding minimal overhead.

PaperID: 6846, https://arxiv.org/pdf/2512.11143.pdf

Authors: Lingzhi Wang, Xinyi Shi, Ziyu Li, Yi Jiang, Shiyu Tan, Yuhao Jiang, Junjie Cheng, Wenyuan Chen, Xiangmin Shen, Zhenyuan LI, Yan Chen

Title: Automated Penetration Testing with LLM Agents and Classical Planning

Abstract:
While penetration testing plays a vital role in cybersecurity, achieving fully automated, hands‑off‑the‑keyboard execution remains a significant research challenge. In this paper, we introduce the "Planner‑Executor‑Perceptor (PEP)" design paradigm and use it to systematically review existing work and identify the key challenges in this area. We also evaluate existing penetration testing systems, with a particular focus on the use of Large Language Model (LLM) agents for this task. The results show that the out‑of‑the‑box Claude Code and Sonnet 4.5 exhibit superior penetration capabilities observed to date, substantially outperforming all prior systems. However, a detailed analysis of their testing processes reveals specific strengths and limitations; notably, LLM agents struggle with maintaining coherent long‑horizon plans, performing complex reasoning, and effectively utilizing specialized tools. These limitations significantly constrain its overall capability, efficiency, and stability. To address these limitations, we propose CHECKMATE, a framework that integrates enhanced classical planning with LLM agents, providing an external, structured "brain" that mitigates the inherent weaknesses of LLM agents. Our evaluation shows that CHECKMATE outperforms the state‑of‑the‑art system (Claude Code) in penetration capability, improving benchmark success rates by over 20%. In addition, it delivers substantially greater stability, cutting both time and monetary costs by more than 50%.

PaperID: 6847, https://arxiv.org/pdf/2512.11109.pdf

Authors: Mohammadjavad Ahmadpour, Amirmahdi Meighani, Payam Taebi, Omid Ghahroodi, Amirmohammad Izadi, Mahdieh Soleymani Baghshah

Title: Limits and Gains of Test-Time Scaling in Vision-Language Reasoning

Abstract:
Test‑time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning ability of Large Language Models (LLMs) by allocating additional computation at inference, yet its application to multimodal systems such as Vision‑Language Models (VLMs) remains underexplored. In this work, we present a systematic empirical study of inference time reasoning methods applied across both open‑source and closed‑source VLMs on different benchmarks. Our results reveal that while closed‑source models consistently benefit from structured reasoning and iterative Self‑Refinement, open‑source VLMs show inconsistent behavior: external verification provides the most reliable gains, whereas iterative refinement often degrades performance. We further find that the effectiveness of TTS is dataset‑dependent, yielding clear improvements on multi‑step reasoning tasks but offering only limited gains on perception‑focused benchmarks. These findings demonstrate that TTS is not a universal solution and must be tailored to both model capabilities and task characteristics, motivating future work on adaptive TTS strategies and multimodal reward models.

PaperID: 6848, https://arxiv.org/pdf/2512.11001.pdf

Authors: Zoi Kaoudi, Ioana Giurgiu

Title: Query Optimization Beyond Data Systems: The Case for Multi-Agent Systems

Abstract:
The proliferation of large language models (LLMs) has accelerated the adoption of agent‑based workflows, where multiple autonomous agents reason, invoke functions, and collaborate to compose complex data pipelines. However, current approaches to building such agentic architectures remain largely ad hoc, lacking generality, scalability, and systematic optimization. Existing systems often rely on fixed models and single execution engines and are unable to efficiently optimize multiple agents operating over heterogeneous data sources and query engines. This paper presents a vision for a next‑generation query optimization framework tailored to multi‑agent workflows. We argue that optimizing these workflows can benefit from redesigning query optimization principles to account for new challenges: orchestration of diverse agents, cost efficiency under expensive LLM calls and across heterogeneous engines, and redundancy across tasks. Led by a real‑world example and building on an analysis of multi‑agent workflows, we outline our envisioned architecture and the main research challenges of building a multi‑agent query optimization framework, which aims at enabling automated model selection, workflow composition, and execution across heterogeneous engines. This vision establishes the groundwork for query optimization in emerging multi‑agent architectures and opens up a set of future research directions.

PaperID: 6849, https://arxiv.org/pdf/2512.10975.pdf

Authors: Matvey Nepomnyaschiy, Oleg Pereziabov, Anvar Tliamov, Stanislav Mikhailov, Ilya Afanasyev

Title: Agent-Based Modular Learning for Multimodal Emotion Recognition in Human-Agent Systems

Abstract:
Effective human‑agent interaction (HAI) relies on accurate and adaptive perception of human emotional states. While multimodal deep learning models ‑ leveraging facial expressions, speech, and textual cues ‑ offer high accuracy in emotion recognition, their training and maintenance are often computationally intensive and inflexible to modality changes. In this work, we propose a novel multi‑agent framework for training multimodal emotion recognition systems, where each modality encoder and the fusion classifier operate as autonomous agents coordinated by a central supervisor. This architecture enables modular integration of new modalities (e.g., audio features via emotion2vec), seamless replacement of outdated components, and reduced computational overhead during training. We demonstrate the feasibility of our approach through a proof‑of‑concept implementation supporting vision, audio, and text modalities, with the classifier serving as a shared decision‑making agent. Our framework not only improves training efficiency but also contributes to the design of more flexible, scalable, and maintainable perception modules for embodied and virtual agents in HAI scenarios.

PaperID: 6850, https://arxiv.org/pdf/2512.10937.pdf

Authors: Matt Wilson

Title: Agent policies from higher-order causal functions

Abstract:
We establish a correspondence between equivalence classes of agent‑state policies for deterministic POMDPs and one‑input process functions (the classical‑deterministic limit of higher‑order quantum operations). We use this correspondence to build a bridge between the agent‑environment interaction in artificial intelligence, causal structure in the foundations of physics, and logic in computer science. We construct a autonomous category PF of types which supports an interpretation of one‑step evaluation of policies, and multi‑agent observation constraints, into cuts and monoidal products. In terms of types, we develop the correspondence further by identifying observation‑independent decentralised POMDPs as the natural domain for the multi‑input process functions used to model indefinite causality. We then prove a strict separation between general multi‑input process function and definite‑ordered process function performance on such dec‑POMDPs, by finding an instance for which policies utilizing an indefinite causal structure can achieve greater finite‑horizon rewards than policies which are restricted to a fixed background causal structure.

PaperID: 6851, https://arxiv.org/pdf/2512.10931.pdf

Authors: George Yakushev, Nataliia Babina, Masoud Vahid Dastgerdi, Vyacheslav Zhdanovskiy, Denis Kuznedelev, Alina Shutova, Max Ryabinin

Title: Asynchronous Reasoning: Training-Free Interactive Thinking LLMs

Abstract:
Many state‑of‑the‑art LLMs are trained to think before giving their answer. Reasoning can greatly improve language model capabilities, but it also makes them less interactive: given a new input, a model must stop thinking before it can respond. Real‑world use cases such as voice‑based or embodied assistants require an LLM agent to respond and adapt to additional information in real time, which is incompatible with sequential interactions. In contrast, humans can listen, think, and act asynchronously: we begin thinking about the problem while reading it and continue thinking while formulating the answer. In this work, we augment LLMs capable of reasoning to operate in a similar way without additional training. Our method uses the properties of positional embeddings to enable LLMs built for sequential generation to simultaneously think, listen, and write outputs. We evaluate our approach on math, commonsense, and safety reasoning: it allows models to generate accurate thinking‑augmented answers while reducing time to first non‑thinking token from minutes to \le 5s and the overall delays by up to 12×.

PaperID: 6852, https://arxiv.org/pdf/2512.10918.pdf

Authors: Yiyang Wang, Chen Chen, Tica Lin, Vishnu Raj, Josh Kimball, Alex Cabral, Josiah Hester

Title: CompanionCast: Toward Social Collaboration with Multi-Agent Systems in Shared Experiences

Abstract:
Shared experiences are fundamental to social connection, yet media consumption is increasingly solitary. While AI companions offer real‑time reactions and emotional regulation, existing systems either rely on single‑agent designs or lack the social awareness and multi‑party interaction required to replicate authentic group dynamics. We present CompanionCast, a general framework for orchestrating multiple specialized AI agents as social collaborators within a live shared context. CompanionCast integrates multimodal event detection, rolling context caching for improved grounding, and spatial audio to enhance co‑presence. We validate CompanionCast through sports viewing, a domain with rich dynamics and strong social traditions. Pilot studies with soccer fans demonstrate that CompanionCast significantly improves perceived social presence and emotional sharing compared to solitary viewing. We conclude by discussing implications and open challenges for multi‑agent systems as social collaborators in shared experiences.

PaperID: 6853, https://arxiv.org/pdf/2512.10835.pdf

Authors: Atahan Cilan, Atay Özgövde

Title: Learning Controllable and Diverse Player Behaviors in Multi-Agent Environments

Abstract:
This paper introduces a reinforcement learning framework that enables controllable and diverse player behaviors without relying on human gameplay data. Existing approaches often require large‑scale player trajectories, train separate models for different player types, or provide no direct mapping between interpretable behavioral parameters and the learned policy, limiting their scalability and controllability. We define player behavior in an N‑dimensional continuous space and uniformly sample target behavior vectors from a region that encompasses the subset representing real human styles. During training, each agent receives both its current and target behavior vectors as input, and the reward is based on the normalized reduction in distance between them. This allows the policy to learn how actions influence behavioral statistics, enabling smooth control over attributes such as aggressiveness, mobility, and cooperativeness. A single PPO‑based multi‑agent policy can reproduce new or unseen play styles without retraining. Experiments conducted in a custom multi‑player Unity game show that the proposed framework produces significantly greater behavioral diversity than a win‑only baseline and reliably matches specified behavior vectors across diverse targets. The method offers a scalable solution for automated playtesting, game balancing, human‑like behavior simulation, and replacing disconnected players in online games.

PaperID: 6854, https://arxiv.org/pdf/2512.10758.pdf

Authors: Kaihua Ding

Title: Designing AI-Resilient Assessments Using Interconnected Problems: A Theoretically Grounded and Empirically Validated Framework

Abstract:
The proliferation of generative AI tools has rendered traditional modular assessments in computing and data‑centric education increasingly ineffective, creating a disconnect between academic evaluation and authentic skill measurement. This paper presents a theoretically grounded framework for designing AI‑resilient assessments, supported by formal analysis and empirical validation. We make three primary contributions. First, we establish two formal propositions. (1) Assessments composed of interconnected problems, in which outputs serve as inputs to subsequent tasks, are inherently more AI‑resilient than modular assessments due to their reliance on multi‑step reasoning and sustained context. (2) Semi‑structured problems with deterministic success criteria provide more reliable measures of student competency than fully open‑ended projects, which allow AI systems to default to familiar solution templates. These results challenge widely cited recommendations in recent institutional and policy guidance that promote open‑ended assessments as inherently more robust to AI assistance. Second, we validate these propositions through empirical analysis of three university data science courses (N = 117). We observe a substantial AI inflation effect: students achieve near‑perfect scores on AI‑assisted modular homework, while performance drops by approximately 30 percentage points on proctored exams (Cohen d = 1.51). In contrast, interconnected projects remain strongly aligned with modular assessments (r = 0.954, p < 0.001) while maintaining AI resistance, whereas proctored exams show weaker alignment (r = 0.726, p < 0.001). Third, we translate these findings into a practical assessment design procedure that enables educators to construct evaluations that promote deeper engagement, reflect industry practice, and resist trivial AI delegation.

PaperID: 6855, https://arxiv.org/pdf/2512.10739.pdf

Authors: Songyang Gao, Yuzhe Gu, Zijian Wu, Lingkai Kong, Wenwei Zhang, Zhongrui Cai, Fan Zheng, Tianyou Ma, Junhao Shen, Haiteng Zhao, Duanyang Zhang, Huilun Zhang, Kuikun Liu, Chengqi Lyu, Yanhui Duan, Chiyu Chen, Ningsheng Ma, Jianfei Gao, Han Lyu, Dahua Lin, Kai Chen

Title: Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

Abstract:
Large Reasoning Models (LRMs) have expanded the mathematical reasoning frontier through Chain‑of‑Thought (CoT) techniques and Reinforcement Learning with Verifiable Rewards (RLVR), capable of solving AIME‑level problems. However, the performance of LRMs is heavily dependent on the extended reasoning context length. For solving ultra‑hard problems like those in the International Mathematical Olympiad (IMO), the required reasoning complexity surpasses the space that an LRM can explore in a single round. Previous works attempt to extend the reasoning context of LRMs but remain prompt‑based and built upon proprietary models, lacking systematic structures and training pipelines. Therefore, this paper introduces Intern‑S1‑MO, a long‑horizon math agent that conducts multi‑round hierarchical reasoning, composed of an LRM‑based multi‑agent system including reasoning, summary, and verification. By maintaining a compact memory in the form of lemmas, Intern‑S1‑MO can more freely explore the lemma‑rich reasoning spaces in multiple reasoning stages, thereby breaking through the context constraints for IMO‑level math problems. Furthermore, we propose OREAL‑H, an RL framework for training the LRM using the online explored trajectories to simultaneously bootstrap the reasoning ability of LRM and elevate the overall performance of Intern‑S1‑MO. Experiments show that Intern‑S1‑MO can obtain 26 out of 35 points on the non‑geometry problems of IMO2025, matching the performance of silver medalists. It also surpasses the current advanced LRMs on inference benchmarks such as HMMT2025, AIME2025, and CNMO2025. In addition, our agent officially participates in CMO2025 and achieves a score of 102/126 under the judgment of human experts, reaching the gold medal level.

PaperID: 6856, https://arxiv.org/pdf/2512.10696.pdf

Authors: Zouying Cao, Jiaji Deng, Li Yu, Weikang Zhou, Zhaoyang Liu, Bolin Ding, Hai Zhao

Title: Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution

Abstract:
Procedural memory enables large language model (LLM) agents to internalize "how‑to" knowledge, theoretically reducing redundant trial‑and‑error. However, existing frameworks predominantly suffer from a "passive accumulation" paradigm, treating memory as a static append‑only archive. To bridge the gap between static storage and dynamic reasoning, we propose ReMe (Remember Me, Refine Me), a comprehensive framework for experience‑driven agent evolution. ReMe innovates across the memory lifecycle via three mechanisms: 1) multi‑faceted distillation, which extracts fine‑grained experiences by recognizing success patterns, analyzing failure triggers and generating comparative insights; 2) context‑adaptive reuse, which tailors historical insights to new contexts via scenario‑aware indexing; and 3) utility‑based refinement, which autonomously adds valid memories and prunes outdated ones to maintain a compact, high‑quality experience pool. Extensive experiments on BFCL‑V3 and AppWorld demonstrate that ReMe establishes a new state‑of‑the‑art in agent memory system. Crucially, we observe a significant memory‑scaling effect: Qwen3‑8B equipped with ReMe outperforms larger, memoryless Qwen3‑14B, suggesting that self‑evolving memory provides a computation‑efficient pathway for lifelong learning. We release our code and the \textttreme.library dataset to facilitate further research.

PaperID: 6857, https://arxiv.org/pdf/2512.10665.pdf

Authors: Muhua Huang, Qinlin Zhao, Xiaoyuan Yi, Xing Xie

Title: On the Dynamics of Multi-Agent LLM Communities Driven by Value Diversity

Abstract:
As Large Language Models (LLM) based multi‑agent systems become increasingly prevalent, the collective behaviors, e.g., collective intelligence, of such artificial communities have drawn growing attention. This work aims to answer a fundamental question: How does diversity of values shape the collective behavior of AI communities? Using naturalistic value elicitation grounded in the prevalent Schwartz's Theory of Basic Human Values, we constructed multi‑agent simulations where communities with varying numbers of agents engaged in open‑ended interactions and constitution formation. The results show that value diversity enhances value stability, fosters emergent behaviors, and brings more creative principles developed by the agents themselves without external guidance. However, these effects also show diminishing returns: extreme heterogeneity induces instability. This work positions value diversity as a new axis of future AI capability, bridging AI ability and sociological studies of institutional emergence.

PaperID: 6858, https://arxiv.org/pdf/2512.10610.pdf

Authors: Xiaopei Tan, Muyang Fan

Title: Thinking While Driving: A Concurrent Framework for Real-Time, LLM-Based Adaptive Routing

Abstract:
We present Thinking While Driving, a concurrent routing framework that integrates LLMs into a graph‑based traffic environment. Unlike approaches that require agents to stop and deliberate, our system enables LLM‑based route planning while agents are moving, significantly reducing intersection wait times. Under high traffic, agents average just 0.75 seconds of decision latency. To coordinate many agents in real‑time, we implement a non‑blocking asynchronous architecture using Unity coroutines and a dedicated request manager. The environment is a weighted undirected graph with live congestion metrics, updated continuously by the agents to enable shared perception. Our results show LLM‑driven agents can dynamically adapt to traffic, reroute around congestion, and exhibit behaviors beyond static pathfinding, all while maintaining real‑time performance. This work provides a reproducible framework for future research in adaptive routing and multi‑agent cooperation.

PaperID: 6859, https://arxiv.org/pdf/2512.10534.pdf

Authors: Haiteng Zhao, Junhao Shen, Yiming Zhang, Songyang Gao, Kuikun Liu, Tianyou Ma, Fan Zheng, Dahua Lin, Wenwei Zhang, Kai Chen

Title: Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning

Abstract:
Large language model (LLM) agents exhibit strong mathematical problem‑solving abilities and can even solve International Mathematical Olympiad (IMO) level problems with the assistance of formal proof systems. However, due to weak heuristics for auxiliary constructions, AI for geometry problem solving remains dominated by expert models such as AlphaGeometry 2, which rely heavily on large‑scale data synthesis and search for both training and evaluation. In this work, we make the first attempt to build a medalist‑level LLM agent for geometry and present InternGeometry. InternGeometry overcomes the heuristic limitations in geometry by iteratively proposing propositions and auxiliary constructions, verifying them with a symbolic engine, and reflecting on the engine's feedback to guide subsequent proposals. A dynamic memory mechanism enables InternGeometry to conduct more than two hundred interactions with the symbolic engine per problem. To further accelerate learning, we introduce Complexity‑Boosting Reinforcement Learning (CBRL), which gradually increases the complexity of synthesized problems across training stages. Built on InternThinker‑32B, InternGeometry solves 44 of 50 IMO geometry problems (2000‑2024), exceeding the average gold medalist score (40.9), using only 13K training examples, just 0.004% of the data used by AlphaGeometry 2, demonstrating the potential of LLM agents on expert‑level geometry tasks. InternGeometry can also propose novel auxiliary constructions for IMO problems that do not appear in human solutions.

PaperID: 6860, https://arxiv.org/pdf/2512.10501.pdf

Authors: Lim Chien Her, Ming Yan, Yunshu Bai, Ruihao Li, Hao Zhang

Title: Zero-shot 3D Map Generation with LLM Agents: A Dual-Agent Architecture for Procedural Content Generation

Abstract:
Procedural Content Generation (PCG) offers scalable methods for algorithmically creating complex, customizable worlds. However, controlling these pipelines requires the precise configuration of opaque technical parameters. We propose a training‑free architecture that utilizes LLM agents for zero‑shot PCG parameter configuration. While Large Language Models (LLMs) promise a natural language interface for PCG tools, off‑the‑shelf models often fail to bridge the semantic gap between abstract user instructions and strict parameter specifications. Our system pairs an Actor agent with a Critic agent, enabling an iterative workflow where the system autonomously reasons over tool parameters and refines configurations to progressively align with human design preferences. We validate this approach on the generation of various 3D maps, establishing a new benchmark for instruction‑following in PCG. Experiments demonstrate that our approach outperforms single‑agent baselines, producing diverse and structurally valid environments from natural language descriptions. These results demonstrate that off‑the‑shelf LLMs can be effectively repurposed as generalized agents for arbitrary PCG tools. By shifting the burden from model training to architectural reasoning, our method offers a scalable framework for mastering complex software without task‑specific fine‑tuning.

PaperID: 6861, https://arxiv.org/pdf/2512.10439.pdf

Authors: Niccolò Grillo, James Rowbottom, Pietro Liò, Carola Bibiane Schönlieb, Stefania Fresca

Title: HypeR Adaptivity: Joint $hr$-Adaptive Meshing via Hypergraph Multi-Agent Deep Reinforcement Learning

Abstract:
Adaptive mesh refinement is central to the efficient solution of partial differential equations (PDEs) via the finite element method (FEM). Classical r‑adaptivity optimizes vertex positions but requires solving expensive auxiliary PDEs such as the Monge‑Ampère equation, while classical h‑adaptivity modifies topology through element subdivision but suffers from expensive error indicator computation and is constrained by isotropic refinement patterns that impose accuracy ceilings. Combined hr‑adaptive techniques naturally outperform single‑modality approaches, yet inherit both computational bottlenecks and the restricted cost‑accuracy trade‑off. Emerging machine learning methods for adaptive mesh refinement seek to overcome these limitations, but existing approaches address h‑adaptivity or r‑adaptivity in isolation. We present HypeR, a deep reinforcement learning framework that jointly optimizes mesh relocation and refinement. HypeR casts the joint adaptation problem using tools from hypergraph neural networks and multi‑agent reinforcement learning. Refinement is formulated as a heterogeneous multi‑agent Markov decision process (MDP) where element agents decide discrete refinement actions, while relocation follows an anisotropic diffusion‑based policy on vertex agents with provable prevention of mesh tangling. The reward function combines local and global error reduction to promote general accuracy. Across benchmark PDEs, HypeR reduces approximation error by up to 6‑‑10× versus state‑of‑art h‑adaptive baselines at comparable element counts, breaking through the uniform refinement accuracy ceiling that constrains subdivision‑only methods. The framework produces meshes with improved shape metrics and alignment to solution anisotropy, demonstrating that jointly learned hr‑adaptivity strategies can substantially enhance the capabilities of automated mesh generation.

PaperID: 6862, https://arxiv.org/pdf/2512.10392.pdf

Authors: Sungjun Seo, Kooktae Lee

Title: Collision-Aware Density-Driven Control of Multi-Agent Systems via Control Barrier Functions

Abstract:
This paper tackles the problem of safe and efficient area coverage using a multi‑agent system operating in environments with obstacles. Applications such as environmental monitoring and search and rescue require robot swarms to cover large domains under resource constraints, making both coverage efficiency and safety essential. To address the efficiency aspect, we adopt the Density‑Driven Control (D^2C) framework, which uses optimal transport theory to steer agents according to a reference distribution that encodes spatial coverage priorities. To ensure safety, we incorporate Control Barrier Functions (CBFs) into the framework. While CBFs are commonly used for collision avoidance, we extend their applicability by introducing obstacle‑specific formulations for both circular and rectangular shapes. In particular, we analytically derive a unit normal vector based on the agent's position relative to the nearest face of a rectangular obstacle, improving safety enforcement in environments with non‑smooth boundaries. Additionally, a velocity‑dependent term is incorporated into the CBF to enhance collision avoidance. Simulation results validate the proposed method by demonstrating smoother navigation near obstacles and more efficient area coverage than the existing method, while still ensuring collision‑free operation.

PaperID: 6863, https://arxiv.org/pdf/2512.10313.pdf

Authors: Kangkun Mao, Fang Xu, Jinru Ding, Yidong Jiang, Yujun Yao, Yirong Chen, Junming Liu, Xiaoqin Wu, Qian Wu, Xiaoyan Huang, Jie Xu

Title: EpiPlanAgent: Agentic Automated Epidemic Response Planning

Abstract:
Epidemic response planning is essential yet traditionally reliant on labor‑intensive manual methods. This study aimed to design and evaluate EpiPlanAgent, an agent‑based system using large language models (LLMs) to automate the generation and validation of digital emergency response plans. The multi‑agent framework integrated task decomposition, knowledge grounding, and simulation modules. Public health professionals tested the system using real‑world outbreak scenarios in a controlled evaluation. Results demonstrated that EpiPlanAgent significantly improved the completeness and guideline alignment of plans while drastically reducing development time compared to manual workflows. Expert evaluation confirmed high consistency between AI‑generated and human‑authored content. User feedback indicated strong perceived utility. In conclusion, EpiPlanAgent provides an effective, scalable solution for intelligent epidemic response planning, demonstrating the potential of agentic AI to transform public health preparedness.

PaperID: 6864, https://arxiv.org/pdf/2512.10195.pdf

Authors: Gyutaek Oh, Sangjoon Park, Byung-Hoon Kim

Title: AutoMedic: An Automated Evaluation Framework for Clinical Conversational Agents with Medical Dataset Grounding

Abstract:
Evaluating large language models (LLMs) has recently emerged as a critical issue for safe and trustworthy application of LLMs in the medical domain. Although a variety of static medical question‑answering (QA) benchmarks have been proposed, many aspects remain underexplored, such as the effectiveness of LLMs in generating responses in dynamic, interactive clinical multi‑turn conversation situations and the identification of multi‑faceted evaluation strategies beyond simple accuracy. However, formally evaluating a dynamic, interactive clinical situation is hindered by its vast combinatorial space of possible patient states and interaction trajectories, making it difficult to standardize and quantitatively measure such scenarios. Here, we introduce AutoMedic, a multi‑agent simulation framework that enables automated evaluation of LLMs as clinical conversational agents. AutoMedic transforms off‑the‑shelf static QA datasets into virtual patient profiles, enabling realistic and clinically grounded multi‑turn clinical dialogues between LLM agents. The performance of various clinical conversational agents is then assessed based on our CARE metric, which provides a multi‑faceted evaluation standard of clinical conversational accuracy, efficiency/strategy, empathy, and robustness. Our findings, validated by human experts, demonstrate the validity of AutoMedic as an automated evaluation framework for clinical conversational agents, offering practical guidelines for the effective development of LLMs in conversational medical applications.

PaperID: 6865, https://arxiv.org/pdf/2512.10166.pdf

Authors: Khushiyant

Title: Emergent Collective Memory in Decentralized Multi-Agent AI Systems

Abstract:
We demonstrate how collective memory emerges in decentralized multi‑agent systems through the interplay between individual agent memory and environmental trace communication. Our agents maintain internal memory states while depositing persistent environmental traces, creating a spatially distributed collective memory without centralized control. Comprehensive validation across five environmental conditions (20x20 to 50x50 grids, 5‑20 agents, 50 runs per configuration) reveals a critical asymmetry: individual memory alone provides 68.7% performance improvement over no‑memory baselines (1563.87 vs 927.23, p < 0.001), while environmental traces without memory fail completely. This demonstrates that memory functions independently but traces require cognitive infrastructure for interpretation. Systematic density‑sweep experiments (rho in [0.049, 0.300], up to 625 agents) validate our theoretical phase transition prediction. On realistic large grids (30x30, 50x50), stigmergic coordination dominates above rho ~ 0.20, with traces outperforming memory by 36‑41% on composite metrics despite lower food efficiency. The experimental crossover confirms the predicted critical density rho_c = 0.230 within 13% error.

PaperID: 6866, https://arxiv.org/pdf/2512.10078.pdf

Authors: Jingyao Ren, Eric Ewing, T. K. Satish Kumar, Sven Koenig, Nora Ayanian

Title: Empirical Hardness in Multi-Agent Pathfinding: Research Challenges and Opportunities

Abstract:
Multi‑agent pathfinding (MAPF) is the problem of finding collision‑free paths for a team of agents on a map. Although MAPF is NP‑hard, the hardness of solving individual instances varies significantly, revealing a gap between theoretical complexity and actual hardness. This paper outlines three key research challenges in MAPF empirical hardness to understand such phenomena. The first challenge, known as algorithm selection, is determining the best‑performing algorithms for a given instance. The second challenge is understanding the key instance features that affect MAPF empirical hardness, such as structural properties like phase transition and backbone/backdoor. The third challenge is how to leverage our knowledge of MAPF empirical hardness to effectively generate hard MAPF instances or diverse benchmark datasets. This work establishes a foundation for future empirical hardness research and encourages deeper investigation into these promising and underexplored areas.

PaperID: 6867, https://arxiv.org/pdf/2512.10046.pdf

Authors: Yan Zhuang, Jiawei Ren, Xiaokang Ye, Jianzhi Shen, Ruixuan Zhang, Tianai Yue, Muhammad Faayez, Xuhong He, Ziqiao Ma, Lianhui Qin, Zhiting Hu, Tianmin Shu

Title: SimWorld-Robotics: Synthesizing Photorealistic and Dynamic Urban Environments for Multimodal Robot Navigation and Collaboration

Abstract:
Recent advances in foundation models have shown promising results in developing generalist robotics that can perform diverse tasks in open‑ended scenarios given multimodal inputs. However, current work has been mainly focused on indoor, household scenarios. In this work, we present SimWorld‑Robotics~(SWR), a simulation platform for embodied AI in large‑scale, photorealistic urban environments. Built on Unreal Engine 5, SWR procedurally generates unlimited photorealistic urban scenes populated with dynamic elements such as pedestrians and traffic systems, surpassing prior urban simulations in realism, complexity, and scalability. It also supports multi‑robot control and communication. With these key features, we build two challenging robot benchmarks: (1) a multimodal instruction‑following task, where a robot must follow vision‑language navigation instructions to reach a destination in the presence of pedestrians and traffic; and (2) a multi‑agent search task, where two robots must communicate to cooperatively locate and meet each other. Unlike existing benchmarks, these two new benchmarks comprehensively evaluate a wide range of critical robot capacities in realistic scenarios, including (1) multimodal instructions grounding, (2) 3D spatial reasoning in large environments, (3) safe, long‑range navigation with people and traffic, (4) multi‑robot collaboration, and (5) grounded communication. Our experimental results demonstrate that state‑of‑the‑art models, including vision‑language models (VLMs), struggle with our tasks, lacking robust perception, reasoning, and planning abilities necessary for urban environments.

PaperID: 6868, https://arxiv.org/pdf/2512.10034.pdf

Authors: Salomé Guilbert, Cassandra Masschelein, Jeremy Goumaz, Bohdan Naida, Philippe Schwaller

Title: DynaMate: An Autonomous Agent for Protein-Ligand Molecular Dynamics Simulations

Abstract:
Force field‑based molecular dynamics (MD) simulations are indispensable for probing the structure, dynamics, and functions of biomolecular systems, including proteins and protein‑ligand complexes. Despite their broad utility in drug discovery and protein engineering, the technical complexity of MD setup, encompassing parameterization, input preparation, and software configuration, remains a major barrier for widespread and efficient usage. Agentic LLMs have demonstrated their capacity to autonomously execute multi‑step scientific processes, and to date, they have not successfully been used to automate protein‑ligand MD workflows. Here, we present DynaMate, a modular multi‑agent framework that autonomously designs and executes complete MD workflows for both protein and protein‑ligand systems, and offers free energy binding affinity calculations with the MM/PB(GB)SA method. The framework integrates dynamic tool use, web search, PaperQA, and a self‑correcting behavior. DynaMate comprises three specialized modules, interacting to plan the experiment, perform the simulation, and analyze the results. We evaluated its performance across twelve benchmark systems of varying complexity, assessing success rate, efficiency, and adaptability. DynaMate reliably performed full MD simulations, corrected runtime errors through iterative reasoning, and produced meaningful analyses of protein‑ligand interactions. This automated framework paves the way toward standardized, scalable, and time‑efficient molecular modeling pipelines for future biomolecular and drug design applications.

PaperID: 6869, https://arxiv.org/pdf/2512.09939.pdf

Authors: Stella C. Dong

Title: Norm-Governed Multi-Agent Decision-Making in Simulator-Coupled Environments:The Reinsurance Constrained Multi-Agent Simulation Process (R-CMASP)

Abstract:
Reinsurance decision‑making exhibits the core structural properties that motivate multi‑agent models: distributed and asymmetric information, partial observability, heterogeneous epistemic responsibilities, simulator‑driven environment dynamics, and binding prudential and regulatory constraints. Deterministic workflow automation cannot meet these requirements, as it lacks the epistemic flexibility, cooperative coordination mechanisms, and norm‑sensitive behaviour required for institutional risk‑transfer. We propose the Reinsurance Constrained Multi‑Agent Simulation Process (R‑CMASP), a formal model that extends stochastic games and Dec‑POMDPs by adding three missing elements: (i) simulator‑coupled transition dynamics grounded in catastrophe, capital, and portfolio engines; (ii) role‑specialized agents with structured observability, belief updates, and typed communication; and (iii) a normative feasibility layer encoding solvency, regulatory, and organizational rules as admissibility constraints on joint actions. Using LLM‑based agents with tool access and typed message protocols, we show in a domain‑calibrated synthetic environment that governed multi‑agent coordination yields more stable, coherent, and norm‑adherent behaviour than deterministic automation or monolithic LLM baselines‑‑reducing pricing variance, improving capital efficiency, and increasing clause‑interpretation accuracy. Embedding prudential norms as admissibility constraints and structuring communication into typed acts measurably enhances equilibrium stability. Overall, the results suggest that regulated, simulator‑driven decision environments are most naturally modelled as norm‑governed, simulator‑coupled multi‑agent systems.

PaperID: 6870, https://arxiv.org/pdf/2512.09935.pdf

Authors: Chih-Han Chen, Chen-Han Tsai, Yu-Shao Peng

Title: Exploring Health Misinformation Detection with Multi-Agent Debate

Abstract:
Fact‑checking health‑related claims has become increasingly critical as misinformation proliferates online. Effective verification requires both the retrieval of high‑quality evidence and rigorous reasoning processes. In this paper, we propose a two‑stage framework for health misinformation detection: Agreement Score Prediction followed by Multi‑Agent Debate. In the first stage, we employ large language models (LLMs) to independently evaluate retrieved articles and compute an aggregated agreement score that reflects the overall evidence stance. When this score indicates insufficient consensus‑falling below a predefined threshold‑the system proceeds to a second stage. Multiple agents engage in structured debate to synthesize conflicting evidence and generate well‑reasoned verdicts with explicit justifications. Experimental results demonstrate that our two‑stage approach achieves superior performance compared to baseline methods, highlighting the value of combining automated scoring with collaborative reasoning for complex verification tasks.

PaperID: 6871, https://arxiv.org/pdf/2512.09932.pdf

Authors: Maya Grace Torii, Takahito Murakami, Shuka Koseki, Yoichi Ochiai

Title: Suzume-chan: Your Personal Navigator as an Embodied Information Hub

Abstract:
Access to expert knowledge often requires real‑time human communication. Digital tools improve access to information but rarely create the sense of connection needed for deep understanding. This study addresses this issue using Social Presence Theory, which explains how a feeling of "being together" enhances communication. An "Embodied Information Hub" is proposed as a new way to share knowledge through physical and conversational interaction. The prototype, Suzume‑chan, is a small, soft AI agent running locally with a language model and retrieval‑augmented generation (RAG). It learns from spoken explanations and responds through dialogue, reducing psychological distance and making knowledge sharing warmer and more human‑centered.

PaperID: 6872, https://arxiv.org/pdf/2512.09897.pdf

Authors: Haoye Lu, Pavan Seshadri, Kaheer Suleman

Title: SCOPE: Language Models as One-Time Teacher for Hierarchical Planning in Text Environments

Abstract:
Long‑term planning in complex, text‑based environments presents significant challenges due to open‑ended action spaces, ambiguous observations, and sparse feedback. Recent research suggests that large language models (LLMs) encode rich semantic knowledge about the world, which can be valuable for guiding agents in high‑level reasoning and planning across both embodied and purely textual settings. However, existing approaches often depend heavily on querying LLMs during training and inference, making them computationally expensive and difficult to deploy efficiently. In addition, these methods typically employ a pretrained, unaltered LLM whose parameters remain fixed throughout training, providing no opportunity for adaptation to the target task. To address these limitations, we introduce SCOPE (Subgoal‑COnditioned Pretraining for Efficient planning), a one‑shot hierarchical planner that leverages LLM‑generated subgoals only at initialization to pretrain a lightweight student model. Unlike prior approaches that distill LLM knowledge by repeatedly prompting the model to adaptively generate subgoals during training, our method derives subgoals directly from example trajectories. This design removes the need for repeated LLM queries, significantly improving efficiency, though at the cost of reduced explainability and potentially suboptimal subgoals. Despite their suboptimality, our results on the TextCraft environment show that LLM‑generated subgoals can still serve as a strong starting point for hierarchical goal decomposition in text‑based planning tasks. Compared to the LLM‑based hierarchical agent ADaPT (Prasad et al., 2024), which achieves a 0.52 success rate, our method reaches 0.56 and reduces inference time from 164.4 seconds to just 3.0 seconds.

PaperID: 6873, https://arxiv.org/pdf/2512.09882.pdf

Authors: Justin W. Lin, Eliot Krzysztof Jones, Donovan Julian Jasper, Ethan Jun-shen Ho, Anna Wu, Arnold Tianyi Yang, Neil Perry, Andy Zou, Matt Fredrikson, J. Zico Kolter, Percy Liang, Dan Boneh, Daniel E. Ho

Title: Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing

Abstract:
We present the first comprehensive evaluation of AI agents against human cybersecurity professionals in a live enterprise environment. We evaluate ten cybersecurity professionals alongside six existing AI agents and ARTEMIS, our new agent scaffold, on a large university network consisting of ~8,000 hosts across 12 subnets. ARTEMIS is a multi‑agent framework featuring dynamic prompt generation, arbitrary sub‑agents, and automatic vulnerability triaging. In our comparative study, ARTEMIS placed second overall, discovering 9 valid vulnerabilities with an 82% valid submission rate and outperforming 9 of 10 human participants. While existing scaffolds such as Codex and CyAgent underperformed relative to most human participants, ARTEMIS demonstrated technical sophistication and submission quality comparable to the strongest participants. We observe that AI agents offer advantages in systematic enumeration, parallel exploitation, and cost ‑‑ certain ARTEMIS variants cost 18/hour versus 60/hour for professional penetration testers. We also identify key capability gaps: AI agents exhibit higher false‑positive rates and struggle with GUI‑based tasks.

PaperID: 6874, https://arxiv.org/pdf/2512.09879.pdf

Authors: Ladan Khoshnevisan, Xinzhi Liu

Title: Resilient Neural-Variable-Structure Consensus Control for Nonlinear MASs with Singular Input Gain Under DoS Attacks

Abstract:
This paper proposes a reliable learning‑based adaptive control framework for nonlinear multi‑agent systems (MASs) subject to Denial‑of‑Service (DoS) attacks and singular control gains, two critical challenges in cyber‑physical systems. A neural‑variable‑structure adaptive controller is developed to achieve leader‑follower consensus while ensuring robustness to external disturbances and adaptability to unknown nonlinear dynamics. A reliability‑assessment rule is introduced to detect communication loss during DoS attacks, upon which a switched control mechanism is activated to preserve closed‑loop stability and performance. Unlike existing resilient MAS control methods, the proposed strategy explicitly accommodates singular control gains and does not rely on restrictive assumptions such as Lipschitz continuity or prior bounds on nonlinearities. To the authors' knowledge, this is the first work to integrate neural learning, variable‑structure robustness, and reliability‑based switching into a unified consensus‑tracking control architecture for heterogeneous nonlinear MASs with singular input gains under DoS attacks. Lyapunov‑based analysis establishes uniform ultimate boundedness of all closed‑loop signals, and Matlab/Simulink simulations on a connected automated vehicle platoon demonstrate the method's effectiveness and resilience.

PaperID: 6875, https://arxiv.org/pdf/2512.09736.pdf

Authors: Jingtian Yan, Zhifei Li, William Kang, Stephen F. Smith, Jiaoyang Li

Title: Analyzing Planner Design Trade-offs for MAPF under ADG-based Realistic Execution

Abstract:
Multi‑Agent Path Finding (MAPF) algorithms are increasingly deployed in industrial warehouses and automated manufacturing facilities, where robots must operate reliably under real‑world physical constraints. However, existing MAPF evaluation frameworks typically rely on simplified robot models, leaving a substantial gap between algorithmic benchmarks and practical performance. Recent frameworks such as SMART combine kinodynamic modeling with execution based on the Action Dependency Graph (ADG), enabling realistic, large‑scale MAPF evaluation. Building on this capability, this work investigates how key planner design choices influence performance under realistic execution settings. We systematically study three fundamental factors: (1) the relationship between solution optimality and execution performance, (2) the sensitivity of system performance to inaccuracies in kinodynamic modeling, and (3) the tradeoff between model accuracy and plan optimality. Empirically, we examine these factors to understand how these design choices affect performance in realistic scenarios. We highlight open challenges and research directions to steer the community toward practical, real‑world deployment.

PaperID: 6876, https://arxiv.org/pdf/2512.09621.pdf

Authors: Jingbo Zhang, Maoxin Ji, Qiong Wu, Pingyi Fan, Kezhi Wang, Wen Chen

Title: Semantic-Aware Cooperative Communication and Computation Framework in Vehicular Networks

Abstract:
Semantic Communication (SC) combined with Vehicular edge computing (VEC) provides an efficient edge task processing paradigm for Internet of Vehicles (IoV). Focusing on highway scenarios, this paper proposes a Tripartite Cooperative Semantic Communication (TCSC) framework, which enables Vehicle Users (VUs) to perform semantic task offloading via Vehicle‑to‑Infrastructure (V2I) and Vehicle‑to‑Vehicle (V2V) communications. Considering task latency and the number of semantic symbols, the framework constructs a Mixed‑Integer Nonlinear Programming (MINLP) problem, which is transformed into two subproblems. First, we innovatively propose a multi‑agent proximal policy optimization task offloading optimization method based on parametric distribution noise (MAPPO‑PDN) to solve the optimization problem of the number of semantic symbols; second, linear programming (LP) is used to solve offloading ratio. Simulations show that performance of this scheme is superior to that of other algorithms.

PaperID: 6877, https://arxiv.org/pdf/2512.09577.pdf

Authors: Aris Hofmann, Inge Vejsbjerg, Dhaval Salwala, Elizabeth M. Daly

Title: Auto-BenchmarkCard: Automated Synthesis of Benchmark Documentation

Abstract:
We present Auto‑BenchmarkCard, a workflow for generating validated descriptions of AI benchmarks. Benchmark documentation is often incomplete or inconsistent, making it difficult to interpret and compare benchmarks across tasks or domains. Auto‑BenchmarkCard addresses this gap by combining multi‑agent data extraction from heterogeneous sources (e.g., Hugging Face, Unitxt, academic papers) with LLM‑driven synthesis. A validation phase evaluates factual accuracy through atomic entailment scoring using the FactReasoner tool. This workflow has the potential to promote transparency, comparability, and reusability in AI benchmark reporting, enabling researchers and practitioners to better navigate and evaluate benchmark choices.

PaperID: 6878, https://arxiv.org/pdf/2512.09568.pdf

Authors: Zhi Zhao, Hang Xiao, Wei Rang

Title: PHWSOA: A Pareto-based Hybrid Whale-Seagull Scheduling for Multi-Objective Tasks in Cloud Computing

Abstract:
Task scheduling is a critical research challenge in cloud computing, a transformative technology widely adopted across industries. Although numerous scheduling solutions exist, they predominantly optimize singular or limited metrics such as execution time or resource utilization often neglecting the need for comprehensive multi‑objective optimization. To bridge this gap, this paper proposes the Pareto‑based Hybrid Whale‑Seagull Optimization Algorithm (PHWSOA). This algorithm synergistically combines the strengths of the Whale Optimization Algorithm (WOA) and the Seagull Optimization Algorithm (SOA), specifically mitigating WOA's limitations in local exploitation and SOA's constraints in global exploration. Leveraging Pareto dominance principles, PHWSOA simultaneously optimizes three key objectives: makespan, virtual machine (VM) load balancing, and economic cost. Key enhancements include: Halton sequence initialization for superior population diversity, a Pareto‑guided mutation mechanism to avert premature convergence, and parallel processing for accelerated convergence. Furthermore, a dynamic VM load redistribution mechanism is integrated to improve load balancing during task execution. Extensive experiments conducted on the CloudSim simulator, utilizing real‑world workload traces from NASA‑iPSC and HPC2N, demonstrate that PHWSOA delivers substantial performance gains. Specifically, it achieves up to a 72.1% reduction in makespan, a 36.8% improvement in VM load balancing, and 23.5% cost savings. These results substantially outperform baseline methods including WOA, GA, PEWOA, and GCWOA underscoring PHWSOA's strong potential for enabling efficient resource management in practical cloud environments.

PaperID: 6879, https://arxiv.org/pdf/2512.09548.pdf

Authors: Ioana Giurgiu, Michael E. Nidd

Title: Supporting Dynamic Agentic Workloads: How Data and Agents Interact

Abstract:
The rise of multi‑agent systems powered by large language models (LLMs) and specialized reasoning agents exposes fundamental limitations in today's data management architectures. Traditional databases and data fabrics were designed for static, well‑defined workloads, whereas agentic systems exhibit dynamic, context‑driven, and collaborative behaviors. Agents continuously decompose tasks, shift attention across modalities, and share intermediate results with peers ‑ producing non‑deterministic, multi‑modal workloads that strain conventional query optimizers and caching mechanisms. We propose an Agent‑Centric Data Fabric, a unified architecture that rethinks how data systems serve, optimize, coordinate, and learn from agentic workloads. To achieve this we exploit the concepts of attention‑guided data retrieval, semantic micro‑caching for context‑driven agent federations, predictive data prefetching and quorum‑based data serving. Together, these mechanisms enable agents to access representative data faster and more efficiently, while reducing redundant queries, data movement, and inference load across systems. By framing data systems as adaptive collaborators, instead of static executors, we outline new research directions toward behaviorally responsive data infrastructures, where caching, probing, and orchestration jointly enable efficient, context‑rich data exchange among dynamic, reasoning‑driven agents.

PaperID: 6880, https://arxiv.org/pdf/2512.09458.pdf

Authors: Sławomir Nowaczyk

Title: Architectures for Building Agentic AI

Abstract:
This chapter argues that the reliability of agentic and generative AI is chiefly an architectural property. We define agentic systems as goal‑directed, tool‑using decision makers operating in closed loops, and show how reliability emerges from principled componentisation (goal manager, planner, tool‑router, executor, memory, verifiers, safety monitor, telemetry), disciplined interfaces (schema‑constrained, validated, least‑privilege tool calls), and explicit control and assurance loops. Building on classical foundations, we propose a practical taxonomy‑tool‑using agents, memory‑augmented agents, planning and self‑improvement agents, multi‑agent systems, and embodied or web agents ‑ and analyse how each pattern reshapes the reliability envelope and failure modes. We distil design guidance on typed schemas, idempotency, permissioning, transactional semantics, memory provenance and hygiene, runtime governance (budgets, termination conditions), and simulate‑before‑actuate safeguards.

PaperID: 6881, https://arxiv.org/pdf/2512.09410.pdf

Authors: Jialin Ying, Zhihao Li, Zicheng Dong, Guohua Wu, Yihuan Liao

Title: Generalizable Collaborative Search-and-Capture in Cluttered Environments via Path-Guided MAPPO and Directional Frontier Allocation

Abstract:
Collaborative pursuit‑evasion in cluttered environments presents significant challenges due to sparse rewards and constrained Fields of View (FOV). Standard Multi‑Agent Reinforcement Learning (MARL) often suffers from inefficient exploration and fails to scale to large scenarios. We propose PGF‑MAPPO (Path‑Guided Frontier MAPPO), a hierarchical framework bridging topological planning with reactive control. To resolve local minima and sparse rewards, we integrate an A‑based potential field for dense reward shaping. Furthermore, we introduce Directional Frontier Allocation, combining Farthest Point Sampling (FPS) with geometric angle suppression to enforce spatial dispersion and accelerate coverage. The architecture employs a parameter‑shared decentralized critic, maintaining O(1) model complexity suitable for robotic swarms. Experiments demonstrate that PGF‑MAPPO achieves superior capture efficiency against faster evaders. Policies trained on 10x10 maps exhibit robust zero‑shot generalization to unseen 20x20 environments, significantly outperforming rule‑based and learning‑based baselines.

PaperID: 6882, https://arxiv.org/pdf/2512.09366.pdf

Authors: Dimitra Maoutsa

Title: Meta-learning three-factor plasticity rules for structured credit assignment with sparse feedback

Abstract:
Biological neural networks learn complex behaviors from sparse, delayed feedback using local synaptic plasticity, yet the mechanisms enabling structured credit assignment remain elusive. In contrast, artificial recurrent networks solving similar tasks typically rely on biologically implausible global learning rules or hand‑crafted local updates. The space of local plasticity rules capable of supporting learning from delayed reinforcement remains largely unexplored. Here, we present a meta‑learning framework that discovers local learning rules for structured credit assignment in recurrent networks trained with sparse feedback. Our approach interleaves local neo‑Hebbian‑like updates during task execution with an outer loop that optimizes plasticity parameters via tangent‑propagation through learning. The resulting three‑factor learning rules enable long‑timescale credit assignment using only local information and delayed rewards, offering new insights into biologically grounded mechanisms for learning in recurrent circuits.

PaperID: 6883, https://arxiv.org/pdf/2512.09312.pdf

Authors: Ziheng Yang, Kun Qiu, Zhe Chen, Wenjun Zhu, Yue Gao

Title: Tyche: A Hybrid Computation Framework of Illumination Pattern for Satellite Beam Hopping

Abstract:
High‑Throughput Satellites (HTS) use beam hopping to handle non‑uniform and time‑varying ground traffic demand. A significant technical challenge in beam hopping is the computation of effective illumination patterns. Traditional algorithms, like the genetic algorithm, require over 300 seconds to compute a single illumination pattern for just 37 cells, whereas modern HTS typically covers over 300 cells, rendering current methods impractical for real‑world applications. Advanced approaches, such as multi‑agent deep reinforcement learning, face convergence issues when the number of cells exceeds 40. In this paper, we introduce Tyche, a hybrid computation framework designed to address this challenge. Tyche incorporates a Monte Carlo Tree Search Beam Hopping (MCTS‑BH) algorithm for computing illumination patterns and employs sliding window and pruning techniques to significantly reduce computation time. Specifically, MCTS‑BH can compute one illumination pattern for 37 cells in just 12 seconds. To ensure real‑time computation, we use a Greedy Beam Hopping (G‑BH) algorithm, which provides a provisional solution while MCTS‑BH completes its computation in the background. Our evaluation results show that MCTS‑BH can increase throughput by up to 98.76%, demonstrating substantial improvements over existing solutions.

PaperID: 6884, https://arxiv.org/pdf/2512.09187.pdf

Authors: Mrinal Agarwal, Saad Rana, Theo Sundoro, Hermela Berhe, Spencer Kim, Vasu Sharma, Sean O'Brien, Kevin Zhu

Title: WOLF: Werewolf-based Observations for LLM Deception and Falsehoods

Abstract:
Deception is a fundamental challenge for multi‑agent reasoning: effective systems must strategically conceal information while detecting misleading behavior in others. Yet most evaluations reduce deception to static classification, ignoring the interactive, adversarial, and longitudinal nature of real deceptive dynamics. Large language models (LLMs) can deceive convincingly but remain weak at detecting deception in peers. We present WOLF, a multi‑agent social deduction benchmark based on Werewolf that enables separable measurement of deception production and detection. WOLF embeds role‑grounded agents (Villager, Werewolf, Seer, Doctor) in a programmable LangGraph state machine with strict night‑day cycles, debate turns, and majority voting. Every statement is a distinct analysis unit, with self‑assessed honesty from speakers and peer‑rated deceptiveness from others. Deception is categorized via a standardized taxonomy (omission, distortion, fabrication, misdirection), while suspicion scores are longitudinally smoothed to capture both immediate judgments and evolving trust dynamics. Structured logs preserve prompts, outputs, and state transitions for full reproducibility. Across 7,320 statements and 100 runs, Werewolves produce deceptive statements in 31% of turns, while peer detection achieves 71‑73% precision with ~52% overall accuracy. Precision is higher for identifying Werewolves, though false positives occur against Villagers. Suspicion toward Werewolves rises from ~52% to over 60% across rounds, while suspicion toward Villagers and the Doctor stabilizes near 44‑46%. This divergence shows that extended interaction improves recall against liars without compounding errors against truthful roles. WOLF moves deception evaluation beyond static datasets, offering a dynamic, controlled testbed for measuring deceptive and detective capacity in adversarial multi‑agent interaction.

PaperID: 6885, https://arxiv.org/pdf/2512.09142.pdf

Authors: Sergio Burdisso, Séverin Baroudi, Yanis Labrak, David Grunert, Pawel Cyrta, Yiyang Chen, Srikanth Madikeri, Esaú Villatoro-Tello, Thomas Schaaf, Ricard Marxer, Petr Motlicek

Title: SDialog: A Python Toolkit for End-to-End Agent Building, User Simulation, Dialog Generation, and Evaluation

Abstract:
We present SDialog, an MIT‑licensed open‑source Python toolkit that unifies dialog generation, evaluation and mechanistic interpretability into a single end‑to‑end framework for building and analyzing LLM‑based conversational agents. Built around a standardized \textttDialog representation, SDialog provides: (1) persona‑driven multi‑agent simulation with composable orchestration for controlled, synthetic dialog generation, (2) comprehensive evaluation combining linguistic metrics, LLM‑as‑a‑judge and functional correctness validators, (3) mechanistic interpretability tools for activation inspection and steering via feature ablation and induction, and (4) audio generation with full acoustic simulation including 3D room modeling and microphone effects. The toolkit integrates with all major LLM backends, enabling mixed‑backend experiments under a unified API. By coupling generation, evaluation, and interpretability in a dialog‑centric architecture, SDialog enables researchers to build, benchmark and understand conversational systems more systematically.

PaperID: 6886, https://arxiv.org/pdf/2512.09108.pdf

Authors: Paul Brookes, Vardan Voskanyan, Rafail Giavrimis, Matthew Truscott, Mina Ilieva, Chrystalla Pavlou, Alexandru Staicu, Manal Adham, Will Evers- Hood, Jingzhi Gong, Kejia Zhang, Matvey Fedoseev, Vishal Sharma, Roman Bauer, Zheng Wang, Hema Nair, Wei Jie, Tianhua Xu, Aurora Constantin, Leslie Kanthan, Michail Basios

Title: Evolving Excellence: Automated Optimization of LLM-based Agents

Abstract:
Agentic AI systems built on large language models (LLMs) offer significant potential for automating complex workflows, from software development to customer support. However, LLM agents often underperform due to suboptimal configurations; poorly tuned prompts, tool descriptions, and parameters that typically require weeks of manual refinement. Existing optimization methods either are too complex for general use or treat components in isolation, missing critical interdependencies. We present ARTEMIS, a no‑code evolutionary optimization platform that jointly optimizes agent configurations through semantically‑aware genetic operators. Given only a benchmark script and natural language goals, ARTEMIS automatically discovers configurable components, extracts performance signals from execution logs, and evolves configurations without requiring architectural modifications. We evaluate ARTEMIS on four representative agent systems: the \emphALE Agent for competitive programming on AtCoder Heuristic Contest, achieving a 13.6% improvement in acceptance rate; the \emphMini‑SWE Agent for code optimization on SWE‑Perf, with a statistically significant 10.1% performance gain; and the \emphCrewAI Agent for cost and mathematical reasoning on Math Odyssey, achieving a statistically significant 36.9% reduction in the number of tokens required for evaluation. We also evaluate the \emphMathTales‑Teacher Agent powered by a smaller open‑source model (Qwen2.5‑7B) on GSM8K primary‑level mathematics problems, achieving a 22% accuracy improvement and demonstrating that ARTEMIS can optimize agents based on both commercial and local models.

PaperID: 6887, https://arxiv.org/pdf/2512.08980.pdf

Authors: Chengqi Dong, Chuhuai Yue, Hang He, Rongge Mao, Fenghe Tang, S Kevin Zhou, Zekun Xu, Xiaohan Wang, Jiajun Chai, Guojun Yin

Title: Training Multi-Image Vision Agents via End2End Reinforcement Learning

Abstract:
Recent VLM‑based agents aim to replicate OpenAI O3's "thinking with images" via tool use, yet most open‑source methods restrict inputs to a single image, limiting their applicability to real‑world multi‑image QA tasks. To address this gap, we propose IMAgent, an open‑source visual agent trained with end‑to‑end reinforcement learning for fine‑grained single/multi‑image reasoning. During inference, VLMs tend to gradually neglect visual inputs; to mitigate this issue, we design two dedicated tools for visual reflection and verification, enabling the model to actively refocus attention on image content. Beyond that, we, for the first time, reveal how tool usage enhances agent performance from an attention perspective. Equipped with a carefully designed two‑layer motion trajectory masking strategy and tool‑use reward gain, IMAgent acquires an effective tool‑use paradigm through pure reinforcement learning, eliminating the need for costly supervised fine‑tuning data. To further unleash the inherent tool‑usage potential of the base VLM and fill data gaps, we construct a challenging, visually enriched multi‑image QA dataset via multi‑agent system. Extensive experiments validate that IMAgent achieves SOTA performance across mainstream single and multi‑image benchmarks, and our in‑depth analysis offers actionable insights for the community. Code and data will be released soon.

PaperID: 6888, https://arxiv.org/pdf/2512.08964.pdf

Authors: Yongyu Wang

Title: T2T-LA: A Topology-to-Topology LLM Agent for Graph Learning with Neither Feature Access nor Task Knowledge

Abstract:
Graph learning aims to convert data into graph representations, which are fundamental to many problems in machine learning for CAD, where circuits, layouts, designs, and optimization states are often modeled as graph‑structured objects. Existing graph learning methods usually rely on carefully designed graph construction rules, extensive parameter tuning, and sophisticated mathematical theory; moreover, achieving good performance often requires task‑specific graph construction tailored to the downstream objective. In this work, we study whether a large language model (LLM) can reason about graph structure and infer a useful topology without observing the feature matrix, without knowing the downstream task, and without relying on any carefully designed graph construction algorithm or parameter tuning process. To this end, we propose T2T‑LA, a Topology‑to‑Topology LLM Agent that receives no input other than a set of previously failed topologies and the scores assigned to them by a private scorer. The agent is not told what task or algorithm produces the scores, how these topologies are generated, or what the scores mean. Since none of the observed topologies is satisfactory, T2T‑LA cannot simply imitate a good example. Instead, it is forced to infer hidden relationships between graph connectivity patterns and the observed scores, a capability that is particularly relevant to CAD scenarios where useful design structures may be difficult to specify manually. Experimental results show that T2T‑LA can generate, in one shot, a graph topology that enables the downstream algorithm to produce a sufficiently good solution, suggesting a new LLM‑driven direction for topology reasoning and graph representation learning in ML‑for‑CAD workflows.

PaperID: 6889, https://arxiv.org/pdf/2512.08935.pdf

Authors: Yuwei Guo, Zihan Zhao, Xiaowei Liu, Xiangning Yu, Deyu Zhou

Title: From Script to Stage: Automating Experimental Design for Social Simulations with LLMs

Abstract:
Multi‑agent simulation based on LLMs has increasingly emerged as a new paradigm for exploring complex social phenomena and validating theoretical hypotheses. However, traditional experimental design in the social sciences relies heavily on interdisciplinary expert knowledge, involving cumbersome procedures and high technical barriers. While LLM‑driven agents demonstrate broad prospects for designing experiments, their limitations regarding reliability and scientific rigor continue to significantly hinder their in‑depth application in social science research. To address these challenges, this paper proposes FSTS, an automated framework for multi‑agent experiment design based on script generation. Drawing on the concept of the "Decision Theater," the framework deconstructs experimental design into three core phases: Script Composition, Script Finalization, and Actor Generation. Tests across multiple scenarios indicate that the agents generated by this framework can enact the script within the "experimental theater", reproducing results consistent with real‑world situations. The proposal of FSTS not only effectively lowers the barrier for social science experimental design but also provides scientifically grounded decision support for policy‑making.

PaperID: 6890, https://arxiv.org/pdf/2512.08877.pdf

Authors: Ryan LeRoy, Jack Kolb

Title: IPPO Learns the Game, Not the Team: A Study on Generalization in Heterogeneous Agent Teams

Abstract:
Multi‑Agent Reinforcement Learning (MARL) is commonly deployed in settings where agents are trained via self‑play with homogeneous teammates, often using parameter sharing and a single policy architecture. This opens the question: to what extent do self‑play PPO agents learn general coordination strategies grounded in the underlying game, compared to overfitting to their training partners' behaviors? This paper investigates the question using the Heterogeneous Multi‑Agent Challenge (HeMAC) environment, which features distinct Observer and Drone agents with complementary capabilities. We introduce Rotating Policy Training (RPT), an approach that rotates heterogeneous teammate policies of different learning algorithms during training, to expose the agent to a broader range of partner strategies. When playing alongside a withheld teammate policy (DDQN), we find that RPT achieves similar performance to a standard self‑play baseline, IPPO, where all agents were trained sharing a single PPO policy. This result indicates that in this heterogeneous multi‑agent setting, the IPPO baseline generalizes to novel teammate algorithms despite not experiencing teammate diversity during training. This shows that a simple IPPO baseline may possess the level of generalization to novel teammates that a diverse training regimen was designed to achieve.

PaperID: 6891, https://arxiv.org/pdf/2512.08870.pdf

Authors: Xiang Chen, Yuling Shi, Qizhen Lan, Yuchao Qiu, Min Wang, Xiaodong Gu, Yanfu Yan

Title: Fed-SE: Federated Self-Evolution for Privacy-Constrained Multi-Environment LLM Agents

Abstract:
LLM agents are widely deployed in complex interactive tasks, yet privacy constraints often preclude centralized optimization and co‑evolution across dynamic environments. Despite the demonstrated success of Federated Learning (FL) on static datasets, its effectiveness in open‑ended, self‑evolving agent systems remains largely unexplored. In such settings, the direct application of standard FL is particularly challenging, as heterogeneous tasks and sparse, trajectory‑level reward signals give rise to severe gradient instability, which undermines the global optimization process. To bridge this gap, we propose Fed‑SE, a Federated Self‑Evolution framework for LLM agents that establishes a local evolution‑global aggregation paradigm. Locally, agents employ parameter‑efficient fine‑tuning on filtered, high‑return trajectories to achieve stable gradient updates. Globally, Fed‑SE aggregates updates within a low‑rank subspace, reducing communication cost across clients. Experiments across five heterogeneous environments demonstrate that Fed‑SE improves average task success rates by 10% over the state‑of‑the‑art FedIT, validating its effectiveness in cross‑environment knowledge transfer under privacy constraints.

PaperID: 6892, https://arxiv.org/pdf/2512.08868.pdf

Authors: Rui Min, Zile Qiao, Ze Xu, Jiawen Zhai, Wenyu Gao, Xuanzhong Chen, Haozhen Sun, Zhen Zhang, Xinyu Wang, Hong Zhou, Wenbiao Yin, Bo Zhang, Xuan Zhou, Ming Yan, Yong Jiang, Haicheng Liu, Liang Ding, Ling Zou, Yi R. Fung, Yalong Li, Pengjun Xie

Title: EcomBench: Towards Holistic Evaluation of Foundation Agents in E-commerce

Abstract:
Foundation agents have rapidly advanced in their ability to reason and interact with real environments, making the evaluation of their core capabilities increasingly important. While many benchmarks have been developed to assess agent performance, most concentrate on academic settings or artificially designed scenarios while overlooking the challenges that arise in real applications. To address this issue, we focus on a highly practical real‑world setting, the e‑commerce domain, which involves a large volume of diverse user interactions, dynamic market conditions, and tasks directly tied to real decision‑making processes. To this end, we introduce EcomBench, a holistic E‑commerce Benchmark designed to evaluate agent performance in realistic e‑commerce environments. EcomBench is built from genuine user demands embedded in leading global e‑commerce ecosystems and is carefully curated and annotated through human experts to ensure clarity, accuracy, and domain relevance. It covers multiple task categories within e‑commerce scenarios and defines three difficulty levels that evaluate agents on key capabilities such as deep information retrieval, multi‑step reasoning, and cross‑source knowledge integration. By grounding evaluation in real e‑commerce contexts, EcomBench provides a rigorous and dynamic testbed for measuring the practical capabilities of agents in modern e‑commerce.

PaperID: 6893, https://arxiv.org/pdf/2512.08769.pdf

Authors: Eranga Bandara, Ross Gore, Peter Foytik, Sachin Shetty, Ravi Mukkamala, Abdul Rahman, Xueping Liang, Safdar H. Bouk, Amin Hass, Sachini Rajapakse, Ng Wee Keong, Kasun De Zoysa, Aruna Withanage, Nilaan Loganathan

Title: A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows

Abstract:
Agentic AI marks a major shift in how autonomous systems reason, plan, and execute multi‑step tasks. Unlike traditional single model prompting, agentic workflows integrate multiple specialized agents with different Large Language Models(LLMs), tool‑augmented capabilities, orchestration logic, and external system interactions to form dynamic pipelines capable of autonomous decision‑making and action. As adoption accelerates across industry and research, organizations face a central challenge: how to design, engineer, and operate production‑grade agentic AI workflows that are reliable, observable, maintainable, and aligned with safety and governance requirements. This paper provides a practical, end‑to‑end guide for designing, developing, and deploying production‑quality agentic AI systems. We introduce a structured engineering lifecycle encompassing workflow decomposition, multi‑agent design patterns, Model Context Protocol(MCP), and tool integration, deterministic orchestration, Responsible‑AI considerations, and environment‑aware deployment strategies. We then present nine core best practices for engineering production‑grade agentic AI workflows, including tool‑first design over MCP, pure‑function invocation, single‑tool and single‑responsibility agents, externalized prompt management, Responsible‑AI‑aligned model‑consortium design, clean separation between workflow logic and MCP servers, containerized deployment for scalable operations, and adherence to the Keep it Simple, Stupid (KISS) principle to maintain simplicity and robustness. To demonstrate these principles in practice, we present a comprehensive case study: a multimodal news‑analysis and media‑generation workflow. By combining architectural guidance, operational patterns, and practical implementation insights, this paper offers a foundational reference to build robust, extensible, and production‑ready agentic AI workflows.

PaperID: 6894, https://arxiv.org/pdf/2512.08743.pdf

Authors: Shuyue Hu, Haoyang Yan, Yiqun Zhang, Yang Chen, Dongzhan Zhou, Lei Bai

Title: Single-Agent Scaling Fails Multi-Agent Intelligence: Towards Foundation Models with Native Multi-Agent Intelligence

Abstract:
Foundation models (FMs) are increasingly assuming the role of the ''brain'' of AI agents. While recent efforts have begun to equip FMs with native single‑agent abilities ‑‑ such as GUI interaction or integrated tool use ‑‑ we argue that the next frontier is endowing FMs with native multi‑agent intelligence. We identify four core capabilities of FMs in multi‑agent contexts: understanding, planning, efficient communication, and adaptation. Contrary to assumptions about the spontaneous emergence of such abilities, we provide extensive empirical evidence, across 41 large language models and 7 challenging benchmarks, showing that scaling single‑agent performance alone does not automatically yield robust multi‑agent intelligence. To address this gap, we outline key research directions ‑‑ spanning dataset construction, evaluation, training paradigms, and safety considerations ‑‑ for building FMs with native multi‑agent intelligence.

PaperID: 6895, https://arxiv.org/pdf/2512.08737.pdf

Authors: Botao 'Amber' Hu, Bangdao Chen

Title: Insured Agents: A Decentralized Trust Insurance Mechanism for Agentic Economy

Abstract:
The emerging "agentic web" envisions large populations of autonomous agents coordinating, transacting, and delegating across open networks. Yet many agent communication and commerce protocols treat agents as low‑cost identities, despite the empirical reality that LLM agents remain unreliable, hallucinated, manipulable, and vulnerable to prompt‑injection and tool‑abuse. A natural response is "agents‑at‑stake": binding economically meaningful, slashable collateral to persistent identities and adjudicating misbehavior with verifiable evidence. However, heterogeneous tasks make universal verification brittle and centralization‑prone, while traditional reputation struggles under rapid model drift and opaque internal states. We propose a protocol‑native alternative: insured agents. Specialized insurer agents post stake on behalf of operational agents in exchange for premiums, and receive privileged, privacy‑preserving audit access via TEEs to assess claims. A hierarchical insurer market calibrates stake through pricing, decentralizes verification via competitive underwriting, and yields incentive‑compatible dispute resolution.

PaperID: 6896, https://arxiv.org/pdf/2512.08674.pdf

Authors: Rongzhao Zhang, Junqiao Wang, Shuyun Yang, Mouxiao Bian, Chihao Zhang, Dongyang Wang, Qiujuan Yan, Yun Zhong, Yuwei Bai, Guanxu Zhu, Kangkun Mao, Miao Wang, Chao Ding, Renjie Lu, Lei Wang, Lei Zheng, Tao Zheng, Xi Wang, Zhuo Fan, Bing Han, Meiling Liu, Luyi Jiang, Dongming Shan, Wenzhong Jin, Jiwei Yu, Zheng Wang, Jie Xu, Meng Luo

Title: Multi-Agent Intelligence for Multidisciplinary Decision-Making in Gastrointestinal Oncology

Abstract:
Multimodal clinical reasoning in the field of gastrointestinal (GI) oncology necessitates the integrated interpretation of endoscopic imagery, radiological data, and biochemical markers. Despite the evident potential exhibited by Multimodal Large Language Models (MLLMs), they frequently encounter challenges such as context dilution and hallucination when confronted with intricate, heterogeneous medical histories. In order to address these limitations, a hierarchical Multi‑Agent Framework is proposed, which emulates the collaborative workflow of a human Multidisciplinary Team (MDT). The system attained a composite expert evaluation score of 4.60/5.00, thereby demonstrating a substantial improvement over the monolithic baseline. It is noteworthy that the agent‑based architecture yielded the most substantial enhancements in reasoning logic and medical accuracy. The findings indicate that mimetic, agent‑based collaboration provides a scalable, interpretable, and clinically robust paradigm for automated decision support in oncology.

PaperID: 6897, https://arxiv.org/pdf/2512.08629.pdf

Authors: Haoyu Zhao, Weizhong Ding, Yuhao Yang, Zheng Tian, Linyi Yang, Kun Shao, Jun Wang

Title: See-Control: A Multimodal Agent Framework for Smartphone Interaction with a Robotic Arm

Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have enabled their use as intelligent agents for smartphone operation. However, existing methods depend on the Android Debug Bridge (ADB) for data transmission and action execution, limiting their applicability to Android devices. In this work, we introduce the novel Embodied Smartphone Operation (ESO) task and present See‑Control, a framework that enables smartphone operation via direct physical interaction with a low‑DoF robotic arm, offering a platform‑agnostic solution. See‑Control comprises three key components: (1) an ESO benchmark with 155 tasks and corresponding evaluation metrics; (2) an MLLM‑based embodied agent that generates robotic control commands without requiring ADB or system back‑end access; and (3) a richly annotated dataset of operation episodes, offering valuable resources for future research. By bridging the gap between digital agents and the physical world, See‑Control provides a concrete step toward enabling home robots to perform smartphone‑dependent tasks in realistic environments.

PaperID: 6898, https://arxiv.org/pdf/2512.08545.pdf

Authors: Indrajit Kar, Kalathur Chenchu Kishore Kumar

Title: Curriculum Guided Massive Multi Agent System Solving For Robust Long Horizon Tasks

Abstract:
Large Language Models and multi‑agent systems have shown promise in decomposing complex tasks, yet they struggle with long‑horizon reasoning tasks and escalating computation cost. This work introduces a hierarchical multi‑agent architecture that distributes reasoning across a 6464 grid of lightweight agents, supported by a selective oracle. A spatial curriculum progressively expands the operational region of the grid, ensuring that agents master easier central tasks before tackling harder peripheral ones. To improve reliability, the system integrates Negative Log‑Likelihood as a measure of confidence, allowing the curriculum to prioritize regions where agents are both accurate and well calibrated. A Thompson Sampling curriculum manager adaptively chooses training zones based on competence and NLL‑driven reward signals. We evaluate the approach on a spatially grounded Tower of Hanoi benchmark, which mirrors the long‑horizon structure of many robotic manipulation and planning tasks. Results demonstrate improved stability, reduced oracle usage, and stronger long‑range reasoning from distributed agent cooperation.

PaperID: 6899, https://arxiv.org/pdf/2512.08503.pdf

Authors: Jiaming Zhang, Che Wang, Yang Cao, Longtao Huang, Wei Yang Bryan Lim

Title: Disrupting Hierarchical Reasoning: Adversarial Protection for Geographic Privacy in Multimodal Reasoning Models

Abstract:
Multi‑modal large reasoning models (MLRMs) pose significant privacy risks by inferring precise geographic locations from personal images through hierarchical chain‑of‑thought reasoning. Existing privacy protection techniques, primarily designed for perception‑based models, prove ineffective against MLRMs' sophisticated multi‑step reasoning processes that analyze environmental cues. We introduce ReasonBreak, a novel adversarial framework specifically designed to disrupt hierarchical reasoning in MLRMs through concept‑aware perturbations. Our approach is founded on the key insight that effective disruption of geographic reasoning requires perturbations aligned with conceptual hierarchies rather than uniform noise. ReasonBreak strategically targets critical conceptual dependencies within reasoning chains, generating perturbations that invalidate specific inference steps and cascade through subsequent reasoning stages. To facilitate this approach, we contribute GeoPrivacy‑6K, a comprehensive dataset comprising 6,341 ultra‑high‑resolution images (\geq2K) with hierarchical concept annotations. Extensive evaluation across seven state‑of‑the‑art MLRMs (including GPT‑o3, GPT‑5, Gemini 2.5 Pro) demonstrates ReasonBreak's superior effectiveness, achieving a 14.4% improvement in tract‑level protection (33.8% vs 19.4%) and nearly doubling block‑level protection (33.5% vs 16.8%). This work establishes a new paradigm for privacy protection against reasoning‑based threats.

PaperID: 6900, https://arxiv.org/pdf/2512.08492.pdf

Authors: Aliaksei Kaliutau

Title: Autonomous Issue Resolver: Towards Zero-Touch Code Maintenance

Abstract:
Recent advances in Large Language Models have revolutionized function‑level code generation; however, repository‑scale Automated Program Repair (APR) remains a significant challenge. Current approaches typically employ a control‑centric paradigm, forcing agents to navigate complex directory structures and irrelevant control logic. In this paper, we propose a paradigm shift from the standard Code Property Graphs (CPGs) to the concept of Data Transformation Graph (DTG) that inverts the topology by modeling data states as nodes and functions as edges, enabling agents to trace logic defects through data lineage rather than control flow. We introduce a multi‑agent framework that reconciles data integrity navigation with control flow logic. Our theoretical analysis and case studies demonstrate that this approach resolves the "Semantic Trap" inherent in standard RAG systems in modern coding agents. We provide a comprehensive implementation in the form of Autonomous Issue Resolver (AIR), a self‑improvement system for zero‑touch code maintenance that utilizes neuro‑symbolic reasoning and uses the DTG structure for scalable logic repair. Our approach has demonstrated good results on several SWE benchmarks, reaching a resolution rate of 87.1% on SWE‑Verified benchmark. Our approach directly addresses the core limitations of current AI code‑assistant tools and tackles the critical need for a more robust foundation for our increasingly software‑dependent world.

PaperID: 6901, https://arxiv.org/pdf/2512.08476.pdf

Authors: Po-An Shih, Shao-Hua Wang, Yung-Che Li, Chia-Heng Tu, Chih-Han Chang

Title: A Multi-Agent LLM Framework for Design Space Exploration in Autonomous Driving Systems

Abstract:
Designing autonomous driving systems requires efficient exploration of large hardware/software configuration spaces under diverse environmental conditions, e.g., with varying traffic, weather, and road layouts. Traditional design space exploration (DSE) approaches struggle with multi‑modal execution outputs and complex performance trade‑offs, and often require human involvement to assess correctness based on execution outputs. This paper presents a multi‑agent, large language model (LLM)‑based DSE framework, which integrates multi‑modal reasoning with 3D simulation and profiling tools to automate the interpretation of execution outputs and guide the exploration of system designs. Specialized LLM agents are leveraged to handle user input interpretation, design point generation, execution orchestration, and analysis of both visual and textual execution outputs, which enables identification of potential bottlenecks without human intervention. A prototype implementation is developed and evaluated on a robotaxi case study (an SAE Level 4 autonomous driving application). Compared with a genetic algorithm baseline, the proposed framework identifies more Pareto‑optimal, cost‑efficient solutions with reduced navigation time under the same exploration budget. Experimental results also demonstrate the efficiency of the adoption of the LLM‑based approach for DSE. We believe that this framework paves the way to the design automation of autonomous driving systems.

PaperID: 6902, https://arxiv.org/pdf/2512.08449.pdf

Authors: Yong-Woon Kim

Title: From Accuracy to Impact: The Impact-Driven AI Framework (IDAIF) for Aligning Engineering Architecture with Theory of Change

Abstract:
This paper introduces the Impact‑Driven AI Framework (IDAIF), a novel architectural methodology that integrates Theory of Change (ToC) principles with modern artificial intelligence system design. As AI systems increasingly influence high‑stakes domains including healthcare, finance, and public policy, the alignment problem‑‑ensuring AI behavior corresponds with human values and intentions‑‑has become critical. Current approaches predominantly optimize technical performance metrics while neglecting the sociotechnical dimensions of AI deployment. IDAIF addresses this gap by establishing a systematic mapping between ToC's five‑stage model (Inputs‑Activities‑Outputs‑Outcomes‑Impact) and corresponding AI architectural layers (Data Layer‑Pipeline Layer‑Inference Layer‑Agentic Layer‑Normative Layer). Each layer incorporates rigorous theoretical foundations: multi‑objective Pareto optimization for value alignment, hierarchical multi‑agent orchestration for outcome achievement, causal directed acyclic graphs (DAGs) for hallucination mitigation, and adversarial debiasing with Reinforcement Learning from Human Feedback (RLHF) for fairness assurance. We provide formal mathematical formulations for each component and introduce an Assurance Layer that manages assumption failures through guardian architectures. Three case studies demonstrate IDAIF application across healthcare, cybersecurity, and software engineering domains. This framework represents a paradigm shift from model‑centric to impact‑centric AI development, providing engineers with concrete architectural patterns for building ethical, trustworthy, and socially beneficial AI systems.

PaperID: 6903, https://arxiv.org/pdf/2512.08366.pdf

Authors: Wentao Zhang, Qunbo Wang, BoXuan Zhao, Tao Zhang, Junsheng Wu, Hongping Gan, Ling Dai, Shizhuang Deng, Shuntong Sun, Yang Liu

Title: Reflecting with Two Voices: A Co-Adaptive Dual-Strategy Framework for LLM-Based Agent Decision Making

Abstract:
Large language model (LLM) agents often rely on external demonstrations or retrieval‑augmented planning, leading to brittleness, poor generalization, and high computational overhead. Inspired by human problem‑solving, we propose DuSAR (Dual‑Strategy Agent with Reflecting) ‑‑ a demonstration‑free framework that enables a single frozen LLM to perform co‑adaptive reasoning via two complementary strategies: a high‑level holistic plan and a context‑grounded local policy. These strategies interact through a lightweight reflection mechanism, where the agent continuously assesses progress via a Strategy Fitness Score and dynamically revises its global plan when stuck or refines it upon meaningful advancement, mimicking human metacognitive behavior. On both simulated household (ALFWorld) and real‑world web (Mind2Web) environments, DuSAR achieves state‑of‑the‑art performance using only open‑source LLMs, substantially outperforming all prior methods without any demonstrations or fine‑tuning. Remarkably, it also reduces per‑step token consumption by a large margin while maintaining strong task success. Ablation studies confirm the necessity of dual‑strategy coordination. Moreover, optional integration of expert demonstrations further boosts performance, highlighting DuSAR's flexibility and compatibility with external knowledge.

PaperID: 6904, https://arxiv.org/pdf/2512.08345.pdf

Authors: Benedikt Mangold

Title: The High Cost of Incivility: Quantifying Interaction Inefficiency via Multi-Agent Monte Carlo Simulations

Abstract:
Workplace toxicity is widely recognized as detrimental to organizational culture, yet quantifying its direct impact on operational efficiency remains methodologically challenging due to the ethical and practical difficulties of reproducing conflict in human subjects. This study leverages Large Language Model (LLM) based Multi‑Agent Systems to simulate 1‑on‑1 adversarial debates, creating a controlled "sociological sandbox". We employ a Monte Carlo method to simulate hundrets of discussions, measuring the convergence time (defined as the number of arguments required to reach a conclusion) between a baseline control group and treatment groups involving agents with "toxic" system prompts. Our results demonstrate a statistically significant increase of approximately 25% in the duration of conversations involving toxic participants. We propose that this "latency of toxicity" serves as a proxy for financial damage in corporate and academic settings. Furthermore, we demonstrate that agent‑based modeling provides a reproducible, ethical alternative to human‑subject research for measuring the mechanics of social friction.

PaperID: 6905, https://arxiv.org/pdf/2512.08341.pdf

Authors: Thai Duong Nguyen, Ngoc-Tan Nguyen, Thanh-Dao Nguyen, Nguyen Van Huynh, Dinh-Hieu Tran, Symeon Chatzinotas

Title: Multi-Agent Deep Reinforcement Learning for Collaborative UAV Relay Networks under Jamming Atatcks

Abstract:
The deployment of Unmanned Aerial Vehicle (UAV) swarms as dynamic communication relays is critical for next‑generation tactical networks. However, operating in contested environments requires solving a complex trade‑off, including maximizing system throughput while ensuring collision avoidance and resilience against adversarial jamming. Existing heuristic‑based approaches often struggle to find effective solutions due to the dynamic and multi‑objective nature of this problem. This paper formulates this challenge as a cooperative Multi‑Agent Reinforcement Learning (MARL) problem, solved using the Centralized Training with Decentralized Execution (CTDE) framework. Our approach employs a centralized critic that uses global state information to guide decentralized actors which operate using only local observations. Simulation results show that our proposed framework significantly outperforms heuristic baselines, increasing the total system throughput by approximately 50% while simultaneously achieving a near‑zero collision rate. A key finding is that the agents develop an emergent anti‑jamming strategy without explicit programming. They learn to intelligently position themselves to balance the trade‑off between mitigating interference from jammers and maintaining effective communication links with ground users.

PaperID: 6906, https://arxiv.org/pdf/2512.08322.pdf

Authors: Thanh-Dao Nguyen, Ngoc-Tan Nguyen, Thai-Duong Nguyen, Nguyen Van Huynh, Dinh-Hieu Tran, Symeon Chatzinotas

Title: Collaborative Intelligence for UAV-Satellite Network Slicing: Towards a Joint QoS-Energy-Fairness MADRL Optimization

Abstract:
Non terrestrial networks are critical for achieving global 6G coverage, yet efficient resource management in aerial and space environments remains challenging due to limited onboard power and dynamic operational conditions. Network slicing offers a promising solution for spectrum optimization in UAV based systems serving heterogeneous service demands. For that, this paper proposes a hierarchical network slicing framework for UAV satellite integrated networks supporting eMBB, URLLC, and mMTC services. Specifically, we formulate a joint optimization of UAV trajectory, transmission power, and spectrum allocation as a decentralized partially observable Markov decision process that ensures quality of service while minimizing energy consumption and maximizing resource fairness. To address the computational intractability and partial observability, we develop a multi agent deep reinforcement learning solution under the centralized training and decentralized execution paradigm. In the proposed system, UAV agents act as distributed actors coordinated by a shared critic operating with multi head attention mechanism at a low Earth orbit satellite. Experimental results then demonstrate that our approach outperforms existing methods by up to 33% in cumulative reward while achieving superior energy efficiency and fairness.

PaperID: 6907, https://arxiv.org/pdf/2512.08300.pdf

Authors: Sijia Chen, Baochun Li, Di Niu

Title: rSIM: Incentivizing Reasoning Capabilities of LLMs via Reinforced Strategy Injection

Abstract:
Large language models (LLMs) are post‑trained through reinforcement learning (RL) to evolve into Reasoning Language Models (RLMs), where the hallmark of this advanced reasoning is ``aha'' moments when they start to perform strategies, such as self‑reflection and deep thinking, within chain of thoughts (CoTs). Motivated by this, this paper proposes a novel reinforced strategy injection mechanism (rSIM), that enables any LLM to become an RLM by employing a small planner to guide the LLM's CoT through the adaptive injection of reasoning strategies. To achieve this, the planner (leader agent) is jointly trained with an LLM (follower agent) using multi‑agent RL (MARL), based on a leader‑follower framework and straightforward rule‑based rewards. Experimental results show that rSIM enables Qwen2.5‑0.5B to become an RLM and significantly outperform Qwen2.5‑14B. Moreover, the planner is generalizable: it only needs to be trained once and can be applied as a plug‑in to substantially improve the reasoning capabilities of existing LLMs. In addition, the planner supports continual learning across various tasks, allowing its planning abilities to gradually improve and generalize to a wider range of problems.

PaperID: 6908, https://arxiv.org/pdf/2512.08296.pdf

Authors: Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A. Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Yun Liu, Mark Malhotra, Paul Pu Liang, Hae Won Park, Yuzhe Yang, Xuhai Xu, Yilun Du, Shwetak Patel, Tim Althoff, Daniel McDuff, Xin Liu

Title: Towards a Science of Scaling Agent Systems

Abstract:
Agents, language model‑based systems capable of reasoning, planning, and acting are widely adopted in real‑world tasks, yet how their performance changes as these systems scale across key dimensions remains underexplored. We introduce quantitative scaling principles for agent systems as a predictive model, capturing how performance varies with coordination, model capability, and measurable system and task factors. Across 260 configurations spanning six agentic benchmarks, five canonical architectures (Single‑Agent and four Multi‑Agent: Independent, Centralized, Decentralized, Hybrid), and three LLM families, we perform controlled evaluations, standardizing tools, prompts, and compute to isolate architectural effects. The resulting model achieves a cross‑validated R^2=0.373 across all six benchmarks (R^2=0.413 with a task‑grounded capability metric). We identify a robust capability‑saturation effect and additional patterns: (1) a coordination yields diminishing returns once single‑agent baselines exceed certain performance; (2) tool‑heavy tasks appear to incur multi‑agent overhead; and (3) architectures without centralized verification tend to propagate errors more than those with centralized coordination. Relative performance change compared to single‑agent baseline ranges from +80.8% on decomposable financial reasoning to ‑70.0% on sequential planning, demonstrating that architecture‑task alignment determines collaborative success. The framework identifies the best‑performing architecture for 87% of held‑out configurations and shows consistent relative architecture preferences on unseen frontier models. Agent effectiveness depends on alignment between coordination and task structure, and that mismatched coordination degrades the performance.

PaperID: 6909, https://arxiv.org/pdf/2512.08290.pdf

Authors: Shiva Gaire, Srijan Gyawali, Saroj Mishra, Suman Niroula, Dilip Thakur, Umesh Yadav

Title: Systematization of Knowledge: Security and Safety in the Model Context Protocol Ecosystem

Abstract:
The Model Context Protocol (MCP) has emerged as the de facto standard for connecting Large Language Models (LLMs) to external data and tools, effectively functioning as the "USB‑C for Agentic AI." While this decoupling of context and execution solves critical interoperability challenges, it introduces a profound new threat landscape where the boundary between epistemic errors (hallucinations) and security breaches (unauthorized actions) dissolves. This Systematization of Knowledge (SoK) aims to provide a comprehensive taxonomy of risks in the MCP ecosystem, distinguishing between adversarial security threats (e.g., indirect prompt injection, tool poisoning) and epistemic safety hazards (e.g., alignment failures in distributed tool delegation). We analyze the structural vulnerabilities of MCP primitives, specifically Resources, Prompts, and Tools, and demonstrate how "context" can be weaponized to trigger unauthorized operations in multi‑agent environments. Furthermore, we survey state‑of‑the‑art defenses, ranging from cryptographic provenance (ETDI) to runtime intent verification, and conclude with a roadmap for securing the transition from conversational chatbots to autonomous agentic operating systems.

PaperID: 6910, https://arxiv.org/pdf/2512.08286.pdf

Authors: Liao Hu, Qiteng Wu, Ruoyu Qi

Title: Empowering smart app development with SolidGPT: an edge-cloud hybrid AI agent framework

Abstract:
The integration of Large Language Models (LLMs) into mobile and software development workflows faces a persistent tension among three demands: semantic awareness, developer productivity, and data privacy. Traditional cloud‑based tools offer strong reasoning but risk data exposure and latency, while on‑device solutions lack full‑context understanding across codebase and developer tooling. We introduce SolidGPT, an open‑source, edge‑cloud hybrid developer assistant built on GitHub, designed to enhance code and workspace semantic search. SolidGPT enables developers to: talk to your codebase: interactively query code and project structure, discovering the right methods and modules without manual searching. Automate software project workflows: generate PRDs, task breakdowns, Kanban boards, and even scaffold web app beginnings, with deep integration via VSCode and Notion. Configure private, extensible agents: onboard private code folders (up to approximately 500 files), connect Notion, customize AI agent personas via embedding and in‑context training, and deploy via Docker, CLI, or VSCode extension. In practice, SolidGPT empowers developer productivity through: Semantic‑rich code navigation: no more hunting through files or wondering where a feature lives. Integrated documentation and task management: seamlessly sync generated PRD content and task boards into developer workflows. Privacy‑first design: running locally via Docker or VSCode, with full control over code and data, while optionally reaching out to LLM APIs as needed. By combining interactive code querying, automated project scaffolding, and human‑AI collaboration, SolidGPT provides a practical, privacy‑respecting edge assistant that accelerates real‑world development workflows, ideal for intelligent mobile and software engineering contexts.

PaperID: 6911, https://arxiv.org/pdf/2512.08281.pdf

Authors: Kyungmin Kim, Seokbin Yoon, Keumjin Lee

Title: Probabilistic Multi-Agent Aircraft Landing Time Prediction

Abstract:
Accurate and reliable aircraft landing time prediction is essential for effective resource allocation in air traffic management. However, the inherent uncertainty of aircraft trajectories and traffic flows poses significant challenges to both prediction accuracy and trustworthiness. Therefore, prediction models should not only provide point estimates of aircraft landing times but also the uncertainties associated with these predictions. Furthermore, aircraft trajectories are frequently influenced by the presence of nearby aircraft through air traffic control interventions such as radar vectoring. Consequently, landing time prediction models must account for multi‑agent interactions in the airspace. In this work, we propose a probabilistic multi‑agent aircraft landing time prediction framework that provides the landing times of multiple aircraft as distributions. We evaluate the proposed framework using an air traffic surveillance dataset collected from the terminal airspace of the Incheon International Airport in South Korea. The results demonstrate that the proposed model achieves higher prediction accuracy than the baselines and quantifies the associated uncertainties of its outcomes. In addition, the model uncovered underlying patterns in air traffic control through its attention scores, thereby enhancing explainability.

PaperID: 6912, https://arxiv.org/pdf/2512.08145.pdf

Authors: Haoran Wang, Zhuohang Chen, Guang Li, Bo Ma, Chuanghuang Li

Title: Chat with UAV -- Human-UAV Interaction Based on Large Language Models

Abstract:
The future of UAV interaction systems is evolving from engineer‑driven to user‑driven, aiming to replace traditional predefined Human‑UAV Interaction designs. This shift focuses on enabling more personalized task planning and design, thereby achieving a higher quality of interaction experience and greater flexibility, which can be used in many fileds, such as agriculture, aerial photography, logistics, and environmental monitoring. However, due to the lack of a common language between users and the UAVs, such interactions are often difficult to be achieved. The developments of Large Language Models possess the ability to understand nature languages and Robots' (UAVs') behaviors, marking the possibility of personalized Human‑UAV Interaction. Recently, some HUI frameworks based on LLMs have been proposed, but they commonly suffer from difficulties in mixed task planning and execution, leading to low adaptability in complex scenarios. In this paper, we propose a novel dual‑agent HUI framework. This framework constructs two independent LLM agents (a task planning agent, and an execution agent) and applies different Prompt Engineering to separately handle the understanding, planning, and execution of tasks. To verify the effectiveness and performance of the framework, we have built a task database covering four typical application scenarios of UAVs and quantified the performance of the HUI framework using three independent metrics. Meanwhile different LLM models are selected to control the UAVs with compared performance. Our user study experimental results demonstrate that the framework improves the smoothness of HUI and the flexibility of task execution in the tasks scenario we set up, effectively meeting users' personalized needs.

PaperID: 6913, https://arxiv.org/pdf/2512.08139.pdf

Authors: Mikayel Samvelyan

Title: Robust Agents in Open-Ended Worlds

Abstract:
The growing prevalence of artificial intelligence (AI) in various applications underscores the need for agents that can successfully navigate and adapt to an ever‑changing, open‑ended world. A key challenge is ensuring these AI agents are robust, excelling not only in familiar settings observed during training but also effectively generalising to previously unseen and varied scenarios. In this thesis, we harness methodologies from open‑endedness and multi‑agent learning to train and evaluate robust AI agents capable of generalising to novel environments, out‑of‑distribution inputs, and interactions with other co‑player agents. We begin by introducing MiniHack, a sandbox framework for creating diverse environments through procedural content generation. Based on the game of NetHack, MiniHack enables the construction of new tasks for reinforcement learning (RL) agents with a focus on generalisation. We then present Maestro, a novel approach for generating adversarial curricula that progressively enhance the robustness and generality of RL agents in two‑player zero‑sum games. We further probe robustness in multi‑agent domains, utilising quality‑diversity methods to systematically identify vulnerabilities in state‑of‑the‑art, pre‑trained RL policies within the complex video game football domain, characterised by intertwined cooperative and competitive dynamics. Finally, we extend our exploration of robustness to the domain of LLMs. Here, our focus is on diagnosing and enhancing the robustness of LLMs against adversarial prompts, employing evolutionary search to generate a diverse range of effective inputs that aim to elicit undesirable outputs from an LLM. This work collectively paves the way for future advancements in AI robustness, enabling the development of agents that not only adapt to an ever‑evolving world but also thrive in the face of unforeseen challenges and interactions.

PaperID: 6914, https://arxiv.org/pdf/2512.08132.pdf

Authors: Kyriakos Lotidis, Panayotis Mertikopoulos, Nicholas Bambos, Jose Blanchet

Title: Multi-agent learning under uncertainty: Recurrence vs. concentration

Abstract:
In this paper, we examine the convergence landscape of multi‑agent learning under uncertainty. Specifically, we analyze two stochastic models of regularized learning in continuous games ‑‑ one in continuous and one in discrete time with the aim of characterizing the long‑run behavior of the induced sequence of play. In stark contrast to deterministic, full‑information models of learning (or models with a vanishing learning rate), we show that the resulting dynamics do not converge in general. In lieu of this, we ask instead which actions are played more often in the long run, and by how much. We show that, in strongly monotone games, the dynamics of regularized learning may wander away from equilibrium infinitely often, but they always return to its vicinity in finite time (which we estimate), and their long‑run distribution is sharply concentrated around a neighborhood thereof. We quantify the degree of this concentration, and we show that these favorable properties may all break down if the underlying game is not strongly monotone ‑‑ underscoring in this way the limits of regularized learning in the presence of persistent randomness and uncertainty.

PaperID: 6915, https://arxiv.org/pdf/2512.08104.pdf

Authors: Harish Karthikeyan, Yue Guo, Leo de Castro, Antigoni Polychroniadou, Udari Madhushani Sehwag, Leo Ardon, Sumitra Ganesh, Manuela Veloso

Title: AgentCrypt: Advancing Privacy and (Secure) Computation in AI Agent Collaboration

Abstract:
As AI agents increasingly operate in complex environments, ensuring reliable, context‑aware privacy is critical for regulatory compliance. Traditional access controls are insufficient because privacy risks often arise after access is granted; agents may inadvertently compromise privacy during reasoning by messaging humans, leaking context to peers, or executing unsafe tool calls. Existing approaches typically treat privacy as a binary constraint, overlooking nuanced, computation‑dependent requirements. Furthermore, Large Language Model (LLM) agents are inherently probabilistic, lacking formal guarantees for security‑critical operations. To address this, we introduce AgentCrypt, a three‑tiered framework for secure agent communication that adds a deterministic protection layer atop any AI platform. AgentCrypt spans the full spectrum of privacy needs: from unrestricted data exchange (Level 1), to context‑aware masking (Level 2), up to fully encrypted computation using Homomorphic Encryption (Level 3). Unlike prompt‑based defenses, our approach guarantees that tagged data privacy is strictly preserved even when the underlying model errs. Security is decoupled from the agent's probabilistic reasoning, ensuring sensitive data remains protected throughout the computational lifecycle. AgentCrypt enables collaborative computation on otherwise inaccessible data, overcoming barriers like data silos. We implemented and validated it using LangGraph and Google ADK, demonstrating versatility across architectures. Finally, we introduce a benchmark dataset simulating privacy‑critical tasks to enable systematic evaluation and foster the development of trustworthy, regulatable machine learning systems.

PaperID: 6916, https://arxiv.org/pdf/2512.08084.pdf

Authors: Boris N. Slautin, Kamyar Barakati, Yu Liu, Reece Emery, Philip Rack, Sergei V. Kalinin

Title: Bayesian Co-Navigation of a Computational Physical Model and AFM Experiment to Autonomously Survey a Combinatorial Materials Library

Abstract:
Building autonomous experiment workflows requires transcending beyond the data‑driven surrogate models to incorporate and dynamically refine physical theory during exploration. Here we demonstrate the first fully automated experimental realization of Bayesian co‑navigation ‑ a framework in which an autonomous agent simultaneously runs a physical experiment and a computationally expensive physical model. Using an automated AFM platform coupled to a kinetic Monte Carlo (kMC) model of thin‑film growth, the system infers a set of effective bond energies for the (CrTaWV)x‑Mo(1‑x) pseudo‑binary combinatorial library, progressively adjusting the kMC parameters to decrease the epistemic disparity between simulation and experiment. This real‑time theoretical refinement enables the kMC model to capture the behavior of the specific materials system and reveals the mechanistic role of hetero‑bonding in governing surface diffusion. Together, these results establish co‑navigation as a general strategy for tightly integrating physical models with autonomous experimental platforms to produce interpretable and continually self‑correcting theoretical modelling of complex materials systems.

PaperID: 6917, https://arxiv.org/pdf/2512.07917.pdf

Authors: Zhehao Dong, Shanghai Du, Zhen Lu, Yue Yang

Title: CFD-copilot: leveraging domain-adapted large language model and model context protocol to enhance simulation automation

Abstract:
Configuring computational fluid dynamics (CFD) simulations requires significant expertise in physics modeling and numerical methods, posing a barrier to non‑specialists. Although automating scientific tasks with large language models (LLMs) has attracted attention, applying them to the complete, end‑to‑end CFD workflow remains a challenge due to its stringent domain‑specific requirements. We introduce CFD‑copilot, a domain‑specialized LLM framework designed to facilitate natural language‑driven CFD simulation from setup to post‑processing. The framework employs a fine‑tuned LLM to directly translate user descriptions into executable CFD setups. A multi‑agent system integrates the LLM with simulation execution, automatic error correction, and result analysis. For post‑processing, the framework utilizes the model context protocol (MCP), an open standard that decouples LLM reasoning from external tool execution. This modular design allows the LLM to interact with numerous specialized post‑processing functions through a unified and scalable interface, improving the automation of data extraction and analysis. The framework was evaluated on benchmarks including the NACA~0012 airfoil and the three‑element 30P‑30N airfoil. The results indicate that domain‑specific adaptation and the incorporation of the MCP jointly enhance the reliability and efficiency of LLM‑driven engineering workflows.

PaperID: 6918, https://arxiv.org/pdf/2512.07901.pdf

Authors: Kevin Vallier

Title: The Theory of Strategic Evolution: Games with Endogenous Players and Strategic Replicators

Abstract:
Von Neumann founded both game theory and the theory of self‑reproducing automata, but the two programs never merged. This paper provides the synthesis. The Theory of Strategic Evolution analyzes strategic replicators: entities that optimize under resource constraints and spawn copies of themselves. We introduce Games with Endogenous Players (GEPs), where lineages (not instances) are the fundamental strategic units, and define Evolutionarily Stable Distributions of Intelligence (ESDIs) as the resulting equilibrium concept. The central mathematical object is a hierarchy of strategic layers linked by cross‑level gain matrices. Under a small‑gain condition (spectral radius less than one), the system admits a global Lyapunov function at every finite depth. We prove closure under meta‑selection: adding governance levels, innovation, or constitutional evolution preserves the dynamical structure. The Alignment Impossibility Theorem shows that unrestricted self‑modification destroys this structure; stable alignment requires bounded modification classes. Applications include AI deployment dynamics, market concentration, and institutional design. The framework shows why personality engineering fails under selection pressure and identifies constitutional constraints necessary for stable multi‑agent systems.

PaperID: 6919, https://arxiv.org/pdf/2512.07898.pdf

Authors: Hongwei Zhang, Ji Lu, Yongsheng Du, Yanqin Gao, Lingjun Huang, Baoli Wang, Fang Tan, Peng Zou

Title: MARINE: Theoretical Optimization and Design for Multi-Agent Recursive IN-context Enhancement

Abstract:
Large Language Model (LLM)‑based agents demonstrate advanced reasoning capabilities, yet practical constraints frequently limit outputs to single responses, leaving significant performance potential unrealized. This paper introduces MARINE (Multi‑Agent Recursive IN‑context Enhancement), a theoretically grounded framework that reconceptualizes test‑time reasoning as iterative refinement of a persistent reference trajectory, fundamentally departing from conventional one‑shot or multi‑sample paradigms. The MARINE refinement operator systematically converts a base model's pass@N capabilities into near‑optimal pass@1 performance. Rigorous theoretical analysis establishes that minimal feasible batches maximize expected performance gains under fixed invocation budgets, while logarithmically growing batch schedules ensure continuous improvement without computational constraints. Comprehensive evaluation on the BrowserComp‑ZH benchmark demonstrates state‑of‑the‑art results, with a 685B‑parameter implementation achieving 46.0% pass@1 accuracy. Meanwhile, MARINE establishes a new paradigm for parameter‑efficient reasoning: an 80B‑parameter model augmented with MARINE matches the performance of standalone 1000B‑parameter agents, reducing parameter requirements by over an order of magnitude. Notably, within a fixed computational budget, the proposed MARINE delivers higher‑quality samples to alignment and optimization processes than traditional sampling‑and‑ranking strategies. Consequently, it has great potential to boost post‑training efficiency.

PaperID: 6920, https://arxiv.org/pdf/2512.07850.pdf

Authors: Alejandro Cuadron, Pengfei Yu, Yang Liu, Arpit Gupta

Title: SABER: Small Actions, Big Errors -- Safeguarding Mutating Steps in LLM Agents

Abstract:
Despite rapid progress in LLM agents, performance on long‑horizon, tool‑using tasks remains fragile. To better understand this fragility, we ask a simple question: \emphdo all actions contribute equally to failure? Analyzing execution traces on τ‑Bench (Airline/Retail) and SWE‑Bench Verified, we decompose trajectories into \emphmutating (environment‑changing) vs.\ non‑mutating steps and formalize \emphdecisive deviations, earliest action, level divergences that flip success to failure. A logistic regression reveals that each additional deviation in a mutating action reduces the odds of success by upto 92% on Airline and upto 96% on Retail for SoTA models. In contrast, deviations in non‑mutating actions have little to no effect. Errors also grow with context length as agents drift from role and act on stale constraints. Motivated by these observations, we introduce \cm, a model‑agnostic, gradient‑free, test‑time safeguard that (i) adds mutation‑gated verification, (ii) injects \emphTargeted Reflection before mutating steps, and (iii) performs block‑based context cleaning. \cm delivers consistent gains, e.g., Qwen3‑Thinking: +28% \emphrelative on Airline, +11% on Retail, and +7% on SWE‑Bench Verified; Claude: +9%/+7%. We further identify ceiling effects in τ‑Bench, where annotation errors and underspecified tasks artificially cap model performance. To address this, we release τ‑Bench Verified, which restores benchmark headroom through targeted revisions. Our results argue for action‑level analysis, targeted safeguards, and reliable evaluations as prerequisites for robust multi‑turn agents.

PaperID: 6921, https://arxiv.org/pdf/2512.07849.pdf

Authors: Tong Xia, Jiankun Zhang, Ruiwen You, Ao Xu, Linghao Zhang, Tengyao Tu, Jingzhi Wang, Jinghua Piao, Yunke Zhang, Fengli Xu, Yong Li

Title: AI Urban Scientist: Multi-Agent Collaborative Automation for Urban Research

Abstract:
Urban research aims to understand how cities operate and evolve as complex adaptive systems. With the rapid growth of urban data and analytical methodologies, the central challenge of the field has shifted from data availability to the integration of heterogeneous data into coherent, verifiable urban knowledge through multidisciplinary approaches. Recent advances in AI, particularly the emergence of large language models (LLMs), have enabled the development of AI scientists capable of autonomous reasoning, hypothesis generation, and data‑driven experimentation, demonstrating substantial potential for autonomous urban research. However, most general‑purpose AI systems remain misaligned with the domain‑specific knowledge, methodological conventions, and inferential standards required in urban studies. Here, we introduce the AI Urban Scientist, a knowledge‑driven multi‑agent framework designed to support autonomous urban research. Grounded in hypotheses, peer‑review feedback, datasets, and research methodologies distilled from large‑scale prior studies, the system constructs structured domain knowledge that guides LLM‑based agents to automatically generate hypotheses, identify and integrate multi‑source urban datasets, conduct empirical analyses and simulations, and iteratively refine analytical methods. Through this process, the framework synthesizes new insights in urban science and accelerates the urban research lifecycle.

PaperID: 6922, https://arxiv.org/pdf/2512.07785.pdf

Authors: Eli Gendreau-Distler, Joshua Ho, Dongwon Kim, Luc Tomas Le Pottier, Haichen Wang, Chengxi Yang

Title: Automating High Energy Physics Data Analysis with LLM-Powered Agents

Abstract:
We present a proof‑of‑principle study demonstrating the use of large language model (LLM) agents to automate a representative high energy physics (HEP) analysis. Using the Higgs boson diphoton cross‑section measurement as a case study with ATLAS Open Data, we design a hybrid system that combines an LLM‑based supervisor‑coder agent with the Snakemake workflow manager. In this architecture, the workflow manager enforces reproducibility and determinism, while the agent autonomously generates, executes, and iteratively corrects analysis code in response to user instructions. We define quantitative evaluation metrics including success rate, error distribution, costs per specific task, and average number of API calls, to assess agent performance across multi‑stage workflows. To characterize variability across architectures, we benchmark a representative selection of state‑of‑the‑art LLMs spanning the Gemini and GPT‑5 series, the Claude family, and leading open‑weight models. While the workflow manager ensures deterministic execution of all analysis steps, the final outputs still show stochastic variation. Although we set the temperature to zero, other sampling parameters (e.g., top‑p, top‑k) remained at their defaults, and some reasoning‑oriented models internally adjust these settings. Consequently, the models do not produce fully deterministic results. This study establishes the first LLM‑agent‑driven automated data‑analysis framework in HEP, enabling systematic benchmarking of model capabilities, stability, and limitations in real‑world scientific computing environments. The baseline code used in this work is available at https://huggingface.co/HWresearch/LLM4HEP. This work was accepted as a poster at the Machine Learning and the Physical Sciences (ML4PS) workshop at NeurIPS 2025. The initial submission was made on August 30, 2025.

PaperID: 6923, https://arxiv.org/pdf/2512.07749.pdf

Authors: Emilio Benenati, Giuseppe Belgioioso

Title: The explicit game-theoretic linear quadratic regulator for constrained multi-agent systems

Abstract:
We present an efficient algorithm to compute the explicit open‑loop solution to both finite and infinite‑horizon dynamic games subject to state and input constraints. Our approach relies on a multiparametric affine variational inequality characterization of the open‑loop Nash equilibria and extends the classical explicit constrained LQR and MPC frameworks to multi‑agent non‑cooperative settings. A key practical implication is that linear‑quadratic game‑theoretic MPC becomes viable even at very high sampling rates for multi‑agent systems of moderate size. Extensive numerical experiments demonstrate order‑of‑magnitude improvements in online computation time and solution accuracy compared with state‑of‑the‑art game‑theoretic solvers.

PaperID: 6924, https://arxiv.org/pdf/2512.07665.pdf

Authors: R. Patrick Xian, Garry A. Gabison, Ahmed Alaa, Christoph Riedl, Grigorios G. Chrysos

Title: Reliable agent engineering should integrate machine-compatible organizational principles

Abstract:
As AI agents built on large language models (LLMs) become increasingly embedded in society, issues of coordination, control, delegation, and accountability are entangled with concerns over their reliability. To design and implement LLM agents around reliable operations, we should consider the task complexity in the application settings and reduce their limitations while striving to minimize agent failures and optimize resource efficiency. High‑functioning human organizations have faced similar balancing issues, which led to evidence‑based theories that seek to understand their functioning strategies. We examine the parallels between LLM agents and the compatible frameworks in organization science, focusing on what the design, scaling, and management of organizations can inform agentic systems towards improving reliability. We offer three preliminary accounts of organizational principles for AI agent engineering to attain reliability and effectiveness, through balancing agency and capabilities in agent design, resource constraints and performance benefits in agent scaling, and internal and external mechanisms in agent management. Our work extends the growing exchanges between the operational and governance principles of AI systems and social systems to facilitate system integration.

PaperID: 6925, https://arxiv.org/pdf/2512.07661.pdf

Authors: Shihao Li, Naisheng Ye, Tianyu Li, Kashyap Chitta, Tuo An, Peng Su, Boyang Wang, Haiou Liu, Chen Lv, Hongyang Li

Title: Optimization-Guided Diffusion for Interactive Scene Generation

Abstract:
Realistic and diverse multi‑agent driving scenes are crucial for evaluating autonomous vehicles, but safety‑critical events which are essential for this task are rare and underrepresented in driving datasets. Data‑driven scene generation offers a low‑cost alternative by synthesizing complex traffic behaviors from existing driving logs. However, existing models often lack controllability or yield samples that violate physical or social constraints, limiting their usability. We present OMEGA, an optimization‑guided, training‑free framework that enforces structural consistency and interaction awareness during diffusion‑based sampling from a scene generation model. OMEGA re‑anchors each reverse diffusion step via constrained optimization, steering the generation towards physically plausible and behaviorally coherent trajectories. Building on this framework, we formulate ego‑attacker interactions as a game‑theoretic optimization in the distribution space, approximating Nash equilibria to generate realistic, safety‑critical adversarial scenarios. Experiments on nuPlan and Waymo show that OMEGA improves generation realism, consistency, and controllability, increasing the ratio of physically and behaviorally valid scenes from 32.35% to 72.27% for free exploration capabilities, and from 11% to 80% for controllability‑focused generation. Our approach can also generate 5× more near‑collision frames with a time‑to‑collision under three seconds while maintaining the overall scene realism.

PaperID: 6926, https://arxiv.org/pdf/2512.07631.pdf

Authors: Shahar Lutati

Title: The Agent Capability Problem: Predicting Solvability Through Information-Theoretic Bounds

Abstract:
When should an autonomous agent commit resources to a task? We introduce the Agent Capability Problem (ACP), a framework for predicting whether an agent can solve a problem under resource constraints. Rather than relying on empirical heuristics, ACP frames problem‑solving as information acquisition: an agent requires \Itotal bits to identify a solution and gains \Istep bits per action at cost \Cstep, yielding an effective cost \Ceff = (\Itotal/\Istep), \Cstep that predicts resource requirements before search. We prove that \Ceff lower‑bounds expected cost and provide tight probabilistic upper bounds. Experimental validation shows that ACP predictions closely track actual agent performance, consistently bounding search effort while improving efficiency over greedy and random strategies. The framework generalizes across LLM‑based and agentic workflows, linking principles from active learning, Bayesian optimization, and reinforcement learning through a unified information‑theoretic lens. \

PaperID: 6927, https://arxiv.org/pdf/2512.07588.pdf

Authors: James Rudd-Jones, María Pérez-Ortiz, Mirco Musolesi

Title: An Agent-Centric Dynamical Systems Perspective on Multi-Agent Reinforcement Learning

Abstract:
Analysing learning in Multi‑Agent Reinforcement Learning (MARL) environments is challenging, in particular with respect to individual decision‑making. Practitioners frequently struggle to compare training runs due to the inherent stochasticity in algorithms arising from random dithering exploration, environment transition noise, and stochastic gradient updates to name a few. Traditional analytical approaches, such as replicator dynamics, oft rely on mean‑field approximations to remove stochastic effects, but this simplification, whilst able to provide general overall trends, can lead to dissonance between analytical predictions and actual agent realisations. We propose modelling MARL training as a coupled stochastic dynamical systems, capturing both agent interactions and environmental characteristics. Leveraging tools from dynamical systems theory, we pragmatically analyse the stability and sensitivity of agent behaviour, which are key dimensions for their practical deployments, for example, in presence of strict safety requirements. This framework allows us to rigorously study the inherent stochasticity of MARL, providing a deeper understanding of system behaviour.

PaperID: 6928, https://arxiv.org/pdf/2512.07462.pdf

Authors: Trung-Kiet Huynh, Duy-Minh Dao-Sy, Thanh-Bang Cao, Phong-Hao Le, Hong-Dan Nguyen, Phu-Quy Nguyen-Lam, Minh-Luan Nguyen-Vo, Hong-Phat Pham, Phu-Hoa Pham, Thien-Kim Than, Chi-Nguyen Tran, Huy Tran, Gia-Thoai Tran-Le, Alessio Buscemi, Le Hong Trang, The Anh Han

Title: Understanding LLM Agent Behaviours via Game Theory: Strategy Recognition, Biases and Multi-Agent Dynamics

Abstract:
As Large Language Models (LLMs) increasingly operate as autonomous decision‑makers in interactive and multi‑agent systems and human societies, understanding their strategic behaviour has profound implications for safety, coordination, and the design of AI‑driven social and economic infrastructures. Assessing such behaviour requires methods that capture not only what LLMs output, but the underlying intentions that guide their decisions. In this work, we extend the FAIRGAME framework to systematically evaluate LLM behaviour in repeated social dilemmas through two complementary advances: a payoff‑scaled Prisoners Dilemma isolating sensitivity to incentive magnitude, and an integrated multi‑agent Public Goods Game with dynamic payoffs and multi‑agent histories. These environments reveal consistent behavioural signatures across models and languages, including incentive‑sensitive cooperation, cross‑linguistic divergence and end‑game alignment toward defection. To interpret these patterns, we train traditional supervised classification models on canonical repeated‑game strategies and apply them to FAIRGAME trajectories, showing that LLMs exhibit systematic, model‑ and language‑dependent behavioural intentions, with linguistic framing at times exerting effects as strong as architectural differences. Together, these findings provide a unified methodological foundation for auditing LLMs as strategic agents and reveal systematic cooperation biases with direct implications for AI governance, collective decision‑making, and the design of safe multi‑agent systems.

PaperID: 6929, https://arxiv.org/pdf/2512.07453.pdf

Authors: Van An Nguyen, Vuong Khang Huynh, Ho Nam Duong, Huu Loi Bui, Hai Anh Ha, Quang Dung Le, Le Quoc Dung Ngo, Tan Dat Nguyen, Ngoc Ngu Nguyen, Hoai Thuong Nguyen, Zhao Song, Le Hong Trang, The Anh Han

Title: Social welfare optimisation in well-mixed and structured populations

Abstract:
Research on promoting cooperation among autonomous, self‑regarding agents has often focused on the bi‑objective optimisation problem: minimising the total incentive cost while maximising the frequency of cooperation. However, the optimal value of social welfare under such constraints remains largely unexplored. In this work, we hypothesise that achieving maximal social welfare is not guaranteed at the minimal incentive cost required to drive agents to a desired cooperative state. To address this gap, we adopt to a single‑objective approach focused on maximising social welfare, building upon foundational evolutionary game theory models that examined cost efficiency in finite populations, in both well‑mixed and structured population settings. Our analytical model and agent‑based simulations show how different interference strategies, including rewarding local versus global behavioural patterns, affect social welfare and dynamics of cooperation. Our results reveal a significant gap in the per‑individual incentive cost between optimising for pure cost efficiency or cooperation frequency and optimising for maximal social welfare. Overall, our findings indicate that incentive design, policy, and benchmarking in multi‑agent systems and human societies should prioritise welfare‑centric objectives over proxy targets of cost or cooperation frequency.

PaperID: 6930, https://arxiv.org/pdf/2512.07436.pdf

Authors: Hang He, Chuhuai Yue, Chengqi Dong, Mingxue Tian, Zhenfeng Liu, Jiajun Chai, Xiaohan Wang, Yufei Zhang, Qun Liao, Guojun Yin, Wei Lin, Chengcheng Wan, Haiying Sun, Ting Su

Title: LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services

Abstract:
Recent advances in large reasoning models (LRMs) have enabled agentic search systems to perform complex multi‑step reasoning across multiple sources. However, most studies focus on general information retrieval and rarely explores vertical domains with unique challenges. In this work, we focus on local life services and introduce LocalSearchBench, which encompass diverse and complex business scenarios. Real‑world queries in this domain are often ambiguous and require multi‑hop reasoning across merchants and products, remaining challenging and not fully addressed. As the first comprehensive benchmark for agentic search in local life services, LocalSearchBench includes over 150,000 high‑quality entries from various cities and business types. We construct 300 multi‑hop QA tasks based on real user queries, challenging agents to understand questions and retrieve information in multiple steps. We also developed LocalPlayground, a unified environment integrating multiple tools for agent interaction. Experiments show that even state‑of‑the‑art LRMs struggle on LocalSearchBench: the best model (DeepSeek‑V3.1) achieves only 34.34% correctness, and most models have issues with completeness (average 77.33%) and faithfulness (average 61.99%). This highlights the need for specialized benchmarks and domain‑specific agent training in local life services. Code, Benchmark, and Leaderboard are available at localsearchbench.github.io.

PaperID: 6931, https://arxiv.org/pdf/2512.07417.pdf

Authors: Giray Önür, Azita Dabiri, Bart De Schutter

Title: Adaptive Tuning of Parameterized Traffic Controllers via Multi-Agent Reinforcement Learning

Abstract:
Effective traffic control is essential for mitigating congestion in transportation networks. Conventional traffic management strategies, including route guidance and ramp metering, often rely on state feedback controllers, which are used for their simplicity and reactivity; however, they lack the adaptability required to cope with complex and time‑varying traffic dynamics. This paper proposes a multi‑agent reinforcement learning (RL) framework in which each agent adaptively tunes the parameters of a state feedback traffic controller, combining the reactivity of state feedback controllers with the adaptability of RL. By tuning parameters at a lower frequency rather than directly determining control inputs at a high frequency, the RL agents achieve improved training efficiency while maintaining adaptability to varying traffic conditions. The multi‑agent structure further enhances system robustness, as local controllers can operate independently in the event of partial failures. The proposed framework is evaluated on a simulated multi‑class transportation network under varying traffic conditions. Results show that the proposed multi‑agent framework outperforms the no‑control and fixed‑parameter state feedback control cases, while performing on par with the single‑agent RL‑based adaptive state feedback control, but with much greater resilience to disturbances.

PaperID: 6932, https://arxiv.org/pdf/2512.07351.pdf

Authors: Sayeem Been Zaman, Wasimul Karim, Arefin Ittesafun Abian, Reem E. Mohamed, Md Rafiqul Islam, Asif Karim, Sami Azam

Title: DeepAgent: A Dual Stream Multi Agent Fusion for Robust Multimodal Deepfake Detection

Abstract:
The increasing use of synthetic media, particularly deepfakes, is an emerging challenge for digital content verification. Although recent studies use both audio and visual information, most integrate these cues within a single model, which remains vulnerable to modality mismatches, noise, and manipulation. To address this gap, we propose DeepAgent, an advanced multi‑agent collaboration framework that simultaneously incorporates both visual and audio modalities for the effective detection of deepfakes. DeepAgent consists of two complementary agents. Agent‑1 examines each video with a streamlined AlexNet‑based CNN to identify the symbols of deepfake manipulation, while Agent‑2 detects audio‑visual inconsistencies by combining acoustic features, audio transcriptions from Whisper, and frame‑reading sequences of images through EasyOCR. Their decisions are fused through a Random Forest meta‑classifier that improves final performance by taking advantage of the different decision boundaries learned by each agent. This study evaluates the proposed framework using three benchmark datasets to demonstrate both component‑level and fused performance. Agent‑1 achieves a test accuracy of 94.35% on the combined Celeb‑DF and FakeAVCeleb datasets. On the FakeAVCeleb dataset, Agent‑2 and the final meta‑classifier attain accuracies of 93.69% and 81.56%, respectively. In addition, cross‑dataset validation on DeepFakeTIMIT confirms the robustness of the meta‑classifier, which achieves a final accuracy of 97.49%, and indicates a strong capability across diverse datasets. These findings confirm that hierarchy‑based fusion enhances robustness by mitigating the weaknesses of individual modalities and demonstrate the effectiveness of a multi‑agent approach in addressing diverse types of manipulations in deepfakes.

PaperID: 6933, https://arxiv.org/pdf/2512.07195.pdf

Authors: Xuan Zhang, Wenxuan Zhang, Anxu Wang, See-Kiong Ng, Yang Deng

Title: MASim: Multilingual Agent-Based Simulation for Social Science

Abstract:
Multi‑agent role‑playing has recently shown promise for studying social behavior with language agents, but existing simulations are mostly monolingual and fail to model cross‑lingual interaction, an essential property of real societies. We introduce MASim, the first multilingual agent‑based simulation framework that supports multi‑turn interaction among generative agents with diverse sociolinguistic profiles. MASim offers two key analyses: (i) global public opinion modeling, by simulating how attitudes toward open‑domain hypotheses evolve across languages and cultures, and (ii) media influence and information diffusion, via autonomous news agents that dynamically generate content and shape user behavior. To instantiate simulations, we construct the MAPS benchmark, which combines survey questions and demographic personas drawn from global population distributions. Experiments on calibration, sensitivity, consistency, and cultural case studies show that MASim reproduces sociocultural phenomena and highlights the importance of multilingual simulation for scalable, controlled computational social science.

PaperID: 6934, https://arxiv.org/pdf/2512.07094.pdf

Authors: Christopher Cruz

Title: VIGIL: A Reflective Runtime for Self-Healing Agents

Abstract:
Agentic LLM frameworks promise autonomous behavior via task decomposition, tool use, and iterative planning, but most deployed systems remain brittle. They lack runtime introspection, cannot diagnose their own failure modes, and do not improve over time without human intervention. In practice, many agent stacks degrade into decorated chains of LLM calls with no structural mechanisms for reliability. We present VIGIL (Verifiable Inspection and Guarded Iterative Learning), a reflective runtime that supervises a sibling agent and performs autonomous maintenance rather than task execution. VIGIL ingests behavioral logs, appraises each event into a structured emotional representation, maintains a persistent EmoBank with decay and contextual policies, and derives an RBT diagnosis that sorts recent behavior into strengths, opportunities, and failures. From this analysis, VIGIL generates both guarded prompt updates that preserve core identity semantics and read only code proposals produced by a strategy engine that operates on log evidence and code hotspots. VIGIL functions as a state gated pipeline. Illegal transitions produce explicit errors rather than allowing the LLM to improvise. In a reminder latency case study, VIGIL identified elevated lag, proposed prompt and code repairs, and when its own diagnostic tool failed due to a schema conflict, it surfaced the internal error, produced a fallback diagnosis, and emitted a repair plan. This demonstrates meta level self repair in a deployed agent runtime.

PaperID: 6935, https://arxiv.org/pdf/2512.07081.pdf

Authors: Rongjia Zhou, Chengzhuo Li, Carl Yang, Jiaying Lu

Title: ClinNoteAgents: An LLM Multi-Agent System for Predicting and Interpreting Heart Failure 30-Day Readmission from Clinical Notes

Abstract:
Heart failure (HF) is one of the leading causes of rehospitalization among older adults in the United States. Although clinical notes contain rich, detailed patient information and make up a large portion of electronic health records (EHRs), they remain underutilized for HF readmission risk analysis. Traditional computational models for HF readmission often rely on expert‑crafted rules, medical thesauri, and ontologies to interpret clinical notes, which are typically written under time pressure and may contain misspellings, abbreviations, and domain‑specific jargon. We present ClinNoteAgents, an LLM‑based multi‑agent framework that transforms free‑text clinical notes into (1) structured representations of clinical and social risk factors for association analysis and (2) clinician‑style abstractions for HF 30‑day readmission prediction. We evaluate ClinNoteAgents on 3,544 notes from 2,065 patients (readmission rate=35.16%), demonstrating high extraction fidelity for clinical variables (conditional accuracy >= 90% for multiple vitals), key risk factor identification, and preservation of predictive signal despite 60 to 90% text reduction. By reducing reliance on structured fields and minimizing manual annotation and model training, ClinNoteAgents provides a scalable and interpretable approach to note‑based HF readmission risk modeling in data‑limited healthcare systems.

PaperID: 6936, https://arxiv.org/pdf/2512.06990.pdf

Authors: Krishna Arun, Moinak Bhattachrya, Paras Goel

Title: Utilizing Multi-Agent Reinforcement Learning with Encoder-Decoder Architecture Agents to Identify Optimal Resection Location in Glioblastoma Multiforme Patients

Abstract:
Currently, there is a noticeable lack of AI in the medical field to support doctors in treating heterogenous brain tumors such as Glioblastoma Multiforme (GBM), the deadliest human cancer in the world with a five‑year survival rate of just 5.1%. This project develops an AI system offering the only end‑to‑end solution by aiding doctors with both diagnosis and treatment planning. In the diagnosis phase, a sequential decision‑making framework consisting of 4 classification models (Convolutional Neural Networks and Support Vector Machine) are used. Each model progressively classifies the patient's brain into increasingly specific categories, with the final step being named diagnosis. For treatment planning, an RL system consisting of 3 generative models is used. First, the resection model (diffusion model) analyzes the diagnosed GBM MRI and predicts a possible resection outcome. Second, the radiotherapy model (Spatio‑Temporal Vision Transformer) generates an MRI of the brain's progression after a user‑defined number of weeks. Third, the chemotherapy model (Diffusion Model) produces the post‑treatment MRI. A survival rate calculator (Convolutional Neural Network) then checks if the generated post treatment MRI has a survival rate within 15% of the user defined target. If not, a feedback loop using proximal policy optimization iterates over this system until an optimal resection location is identified. When compared to existing solutions, this project found 3 key findings: (1) Using a sequential decision‑making framework consisting of 4 small diagnostic models reduced computing costs by 22.28x, (2) Transformers regression capabilities decreased tumor progression inference time by 113 hours, and (3) Applying Augmentations resembling Real‑life situations improved overall DICE scores by 2.9%. These results project to increase survival rates by 0.9%, potentially saving approximately 2,250 lives.

PaperID: 6937, https://arxiv.org/pdf/2512.06933.pdf

Authors: Zifan Peng, Jingyi Zheng, Yule Liu, Huaiyu Jia, Qiming Ye, Jingyu Liu, Xufeng Yang, Mingchen Li, Qingyuan Gong, Xuechao Wang, Xinlei He

Title: TxSum: User-Centered Ethereum Transaction Understanding with Micro-Level Semantic Grounding

Abstract:
Understanding the economic intent of Ethereum transactions is critical for user safety, yet current tools expose only raw on‑chain data or surface‑level intent, leading to widespread "blind signing" (approving transactions without understanding them). Through interviews with 16 Web3 users, we find that effective explanations should be structured, risk‑aware, and grounded at the token‑flow level. Motivated by these findings, we formulate TxSum, a new user‑centered NLP task for Ethereum transaction understanding, and construct a dataset of 187 complex Ethereum transactions annotated with transaction‑level summaries and token flow‑level semantic labels. We further introduce MATEX, a grounded multi‑agent framework for high‑stakes transaction explanation. It selectively retrieves external knowledge under uncertainty and audits explanations against raw traces to improve token‑flow‑level factual consistency. MATEX achieves the strongest overall explanation quality, especially on micro‑level factuality and intent quality. It improves user comprehension on complex transactions from 52.9% to 76.5% over the strongest baseline and raises malicious‑transaction rejection from 36.0% to 88.0%, while maintaining a low false‑rejection rate on benign transactions.

PaperID: 6938, https://arxiv.org/pdf/2512.06914.pdf

Authors: Guanquan Shi, Haohua Du, Zhiqiang Wang, Xiaoyu Liang, Weiwenpei Liu, Song Bian, Zhenyu Guan

Title: SoK: Trust-Authorization Mismatch in LLM Agent Interactions

Abstract:
Large Language Models (LLMs) are evolving into autonomous agents capable of executing complex workflows via standardized protocols (e.g., MCP). However, this paradigm shifts control from deterministic code to probabilistic inference, creating a fundamental Trust‑Authorization Mismatch: static permissions are structurally decoupled from the agent's fluctuating runtime trustworthiness. In this Systematization of Knowledge (SoK), we survey more than 200 representative papers to categorize the emerging landscape of agent security. We propose the Belief‑Intention‑Permission (B‑I‑P) framework as a unifying formal lens. By decomposing agent execution into three distinct stages‑Belief Formation, Intent Generation, and Permission Grant‑we demonstrate that diverse threats, from prompt injection to tool poisoning, share a common root cause: the desynchronization between dynamic trust states and static authorization boundaries. Using the B‑I‑P lens, we systematically map existing attacks and defenses and identify critical gaps where current mechanisms fail to bridge this mismatch. Finally, we outline a research agenda for shifting from static Role‑Based Access Control (RBAC) to dynamic, risk‑adaptive authorization.

PaperID: 6939, https://arxiv.org/pdf/2512.06796.pdf

Authors: Akmaral Moldagalieva, Keisuke Okumura, Amanda Prorok, Wolfgang Hönig

Title: db-LaCAM: Fast and Scalable Multi-Robot Kinodynamic Motion Planning with Discontinuity-Bounded Search and Lightweight MAPF

Abstract:
State‑of‑the‑art multi‑robot kinodynamic motion planners struggle to handle more than a few robots due to high computational burden, which limits their scalability and results in slow planning time. In this work, we combine the scalability and speed of modern multi‑agent path finding (MAPF) algorithms with the dynamic‑awareness of kinodynamic planners to address these limitations. To this end, we propose discontinuity‑Bounded LaCAM (db‑LaCAM), a planner that utilizes a precomputed set of motion primitives that respect robot dynamics to generate horizon‑length motion sequences, while allowing a user‑defined discontinuity between successive motions. The planner db‑LaCAM is resolution‑complete with respect to motion primitives and supports arbitrary robot dynamics. Extensive experiments demonstrate that db‑LaCAM scales efficiently to scenarios with up to 50 robots, achieving up to ten times faster runtime compared to state‑of‑the‑art planners, while maintaining comparable solution quality. The approach is validated in both 2D and 3D environments with dynamics such as the unicycle and 3D double integrator. We demonstrate the safe execution of trajectories planned with db‑LaCAM in two distinct physical experiments involving teams of flying robots and car‑with‑trailer robots.

PaperID: 6940, https://arxiv.org/pdf/2512.06759.pdf

Authors: Wenbo Lyu, Yingjun Du, Jinglin Zhao, Xianton Zhen, Ling Shao

Title: VisChainBench: A Benchmark for Multi-Turn, Multi-Image Visual Reasoning Beyond Language Priors

Abstract:
Understanding multi‑image, multi‑turn scenarios is a critical yet underexplored capability for Large Vision‑Language Models (LVLMs). Existing benchmarks predominantly focus on static or horizontal comparisons ‑‑ e.g., spotting visual differences or assessing appropriateness ‑‑ while relying heavily on language cues. Such settings overlook progressive, context‑dependent reasoning and the challenge of visual‑to‑visual inference. To bridge this gap, we present VisChainBench, a large‑scale benchmark designed to rigorously evaluate LVLMs' ability to perform multi‑step visual reasoning across sequential, interdependent tasks with minimal language guidance. VisChainBench contains 1,457 tasks spanning over 20,000 images across three diverse domains (e.g., daily scenarios, engineering troubleshooting), structured to mimic real‑world decision‑making processes. Uniquely, the benchmark is constructed using a multi‑agent generation pipeline, ensuring high visual diversity and controlled language bias. All the benchmark data and code for benchmark construction are available for viewing and download via following Link: https://huggingface.co/datasets/eyehole/VisChainBench

PaperID: 6941, https://arxiv.org/pdf/2512.06749.pdf

Authors: Ming Ma, Jue Zhang, Fangkai Yang, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

Title: DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems

Abstract:
Large language model (LLM)‑based multi‑agent systems are challenging to debug because failures often arise from long, branching interaction traces. The prevailing practice is to leverage LLMs for log‑based failure localization, attributing errors to a specific agent and step. However, this paradigm has two key limitations: (i) log‑only debugging lacks validation, producing untested hypotheses, and (ii) single‑step or single‑agent attribution is often ill‑posed, as we find that multiple distinct interventions can independently repair the failed task. To address the first limitation, we introduce DoVer, an intervention‑driven debugging framework, which augments hypothesis generation with active verification through targeted interventions (e.g., editing messages, altering plans). For the second limitation, rather than evaluating on attribution accuracy, we focus on measuring whether the system resolves the failure or makes quantifiable progress toward task success, reflecting a more outcome‑oriented view of debugging. Within the Magnetic‑One agent framework, on the datasets derived from GAIA and AssistantBench, DoVer flips 18‑28% of failed trials into successes, achieves up to 16% milestone progress, and validates or refutes 30‑60% of failure hypotheses. DoVer also performs effectively on a different dataset (GSMPlus) and agent framework (AG2), where it recovers 49% of failed trials. These results highlight intervention as a practical mechanism for improving reliability in agentic systems and open opportunities for more robust, scalable debugging methods for LLM‑based multi‑agent systems. Project website and code will be available at https://aka.ms/DoVer.

PaperID: 6942, https://arxiv.org/pdf/2512.06743.pdf

Authors: Guanjie Zheng, Ziyang Su, Yiheng Wang, Yuhang Luo, Hongwei Zhang, Xuanhe Zhou, Linghe Kong, Fan Wu, Wen Ling

Title: OSM+: Billion-Level OpenStreetMap Dataset for City-wide Experiments

Abstract:
Road network data provides rich information about cities, but processing worldwide OpenStreetMap (OSM) data is computationally intensive, and the resulting graphs are often difficult to unify for benchmarking downstream tasks. Existing graph learning benchmarks fail to capture the billion‑scale and unique topological properties of real‑world road networks, leaving model scalability underexplored. To close this gap, we process OSM data with distributed cloud computing using 5,000 cores and release OSM+, a structured worldwide 1‑billion‑vertex road network graph dataset designed for high accessibility and usability. OSM+ is open source and globally downloadable, providing an open‑box graph structure and an easy spatial query interface; the evaluated release is a fixed snapshot for reproducibility, with a versioned update plan for future releases. We demonstrate the utility of OSM+ through three illustrative use cases: city boundary detection, traffic prediction, and traffic policy control. For traffic prediction, we construct a new 31‑city benchmark by processing traffic data and combining it with OSM+, enabling broader spatial coverage and more comprehensive evaluation than commonly used datasets, while scaling from hundreds of road network intersections to thousands. For traffic policy control, we release a new six‑city dataset at a much larger scale, introducing challenges for thousand‑scale multi‑agent coordination. We also provide data processing tools for integrating multimodal spatial‑temporal data with OSM+ for geospatial foundation model training, thereby expediting the discovery of compelling scientific insights.

PaperID: 6943, https://arxiv.org/pdf/2512.06721.pdf

Authors: Bufang Yang, Lilin Xu, Liekang Zeng, Yunqi Guo, Siyang Jiang, Wenrui Lu, Kaiwei Liu, Yixuan Li, Xiaofan Jiang, Guoliang Xing, Zhenyu Yan

Title: ProAgent: Harnessing On-Demand Sensory Contexts for Proactive LLM Agent Systems in the Wild

Abstract:
Recent studies have begun to explore proactive large language model (LLM) agents that provide unobtrusive assistance by automatically leveraging contextual information, such as in code editing and in‑app suggestions. However, most focus on short, task‑specific episodes or on‑screen contexts, rather than continuously perceiving and assisting users throughout daily life. Enabling such in‑the‑wild assistance requires continuous sensing of users' surroundings, which can incur substantial system overhead. In this work, we propose ProAgent, an end‑to‑end proactive agent system that harnesses on‑demand sensory contexts to provide in‑the‑wild assistance. ProAgent first employs on‑demand tiered perception to continuously sense users' surroundings by integrating low‑cost contextual cues with richer perception on demand, and uses proactive‑oriented context extraction to derive hierarchical contexts integrating both sensory contexts and human preferences. ProAgent then employs a context‑aware proactive reasoner to infer user needs and invokes external tools to deliver proactive assistance. We implement ProAgent on AR glasses and evaluate it on a public dataset and a real‑world dataset. Results demonstrate that ProAgent achieves up to 27.7% higher proactive prediction accuracy and 20.5% lower false detection than state‑of‑the‑art baselines. A user study with 20 participants shows that 85% were satisfied with ProAgent and willing to use it in daily life.

PaperID: 6944, https://arxiv.org/pdf/2512.06716.pdf

Authors: Zhibo Liang, Tianze Hu, Zaiye Chen, Mingjie Tang

Title: Cognitive Control Architecture (CCA): A Lifecycle Supervision Framework for Robustly Aligned AI Agents

Abstract:
Autonomous Large Language Model (LLM) agents exhibit significant vulnerability to Indirect Prompt Injection (IPI) attacks. These attacks hijack agent behavior by polluting external information sources, exploiting fundamental trade‑offs between security and functionality in existing defense mechanisms. This leads to malicious and unauthorized tool invocations, diverting agents from their original objectives. The success of complex IPIs reveals a deeper systemic fragility: while current defenses demonstrate some effectiveness, most defense architectures are inherently fragmented. Consequently, they fail to provide full integrity assurance across the entire task execution pipeline, forcing unacceptable multi‑dimensional compromises among security, functionality, and efficiency. Our method is predicated on a core insight: no matter how subtle an IPI attack, its pursuit of a malicious objective will ultimately manifest as a detectable deviation in the action trajectory, distinct from the expected legitimate plan. Based on this, we propose the Cognitive Control Architecture (CCA), a holistic framework achieving full‑lifecycle cognitive supervision. CCA constructs an efficient, dual‑layered defense system through two synergistic pillars: (i) proactive and preemptive control‑flow and data‑flow integrity enforcement via a pre‑generated "Intent Graph"; and (ii) an innovative "Tiered Adjudicator" that, upon deviation detection, initiates deep reasoning based on multi‑dimensional scoring, specifically designed to counter complex conditional attacks. Experiments on the AgentDojo benchmark substantiate that CCA not only effectively withstands sophisticated attacks that challenge other advanced defense methods but also achieves uncompromised security with notable efficiency and robustness, thereby reconciling the aforementioned multi‑dimensional trade‑off.

PaperID: 6945, https://arxiv.org/pdf/2512.06659.pdf

Authors: Vaishali Vinay

Title: The Evolution of Agentic AI in Cybersecurity: From Single LLM Reasoners to Multi-Agent Systems and Autonomous Pipelines

Abstract:
Cybersecurity has become one of the earliest adopters of agentic AI, as security operations centers increasingly rely on multi‑step reasoning, tool‑driven analysis, and rapid decision‑making under pressure. While individual large language models can summarize alerts or interpret unstructured reports, they fall short in real SOC environments that require grounded data access, reproducibility, and accountable workflows. In response, the field has seen a rapid architectural evolution from single‑model helpers toward tool‑augmented agents, distributed multi‑agent systems, schema‑bound tool ecosystems, and early explorations of semi‑autonomous investigative pipelines. This survey presents a five‑generation taxonomy of agentic AI in cybersecurity. It traces how capabilities and risks change as systems advance from text‑only LLM reasoners to multi‑agent collaboration frameworks and constrained‑autonomy pipelines. We compare these generations across core dimensions ‑ reasoning depth, tool use, memory, reproducibility, and safety. In addition, we also synthesize emerging benchmarks used to evaluate cyber‑oriented agents. Finally, we outline the unresolved challenges that accompany this evolution, such as response validation, tool‑use correctness, multi‑agent coordination, long‑horizon reasoning, and safeguards for high‑impact actions. Collectively, this work provides a structured perspective on how agentic AI is taking shape within cybersecurity and what is required to ensure its safe and reliable deployment.

PaperID: 6946, https://arxiv.org/pdf/2512.06645.pdf

Authors: Muyang Fan

Title: Analyzing Collision Rates in Large-Scale Mixed Traffic Control via Multi-Agent Reinforcement Learning

Abstract:
Vehicle collisions remain a major challenge in large‑scale mixed traffic systems, especially when human‑driven vehicles (HVs) and robotic vehicles (RVs) interact under dynamic and uncertain conditions. Although Multi‑Agent Reinforcement Learning (MARL) offers promising capabilities for traffic signal control, ensuring safety in such environments remains difficult. As a direct indicator of traffic risk, the collision rate must be well understood and incorporated into traffic control design. This study investigates the primary factors influencing collision rates in a MARL‑governed Mixed Traffic Control (MTC) network. We examine three dimensions: total vehicle count, signalized versus unsignalized intersection configurations, and turning‑movement strategies. Through controlled simulation experiments, we evaluate how each factor affects collision likelihood. The results show that collision rates are sensitive to traffic density, the level of signal coordination, and turning‑control design. These findings provide practical insights for improving the safety and robustness of MARL‑based mixed traffic control systems, supporting the development of intelligent transportation systems in which both efficiency and safety are jointly optimized.

PaperID: 6947, https://arxiv.org/pdf/2512.06590.pdf

Authors: Tendai Mukande, Esraa Ali, Annalina Caputo, Ruihai Dong, Noel OConnor

Title: Towards Efficient Hypergraph and Multi-LLM Agent Recommender Systems

Abstract:
Recommender Systems (RSs) have become the cornerstone of various applications such as e‑commerce and social media platforms. The evolution of RSs is paramount in the digital era, in which personalised user experience is tailored to the user's preferences. Large Language Models (LLMs) have sparked a new paradigm ‑ generative retrieval and recommendation. Despite their potential, generative RS methods face issues such as hallucination, which degrades the recommendation performance, and high computational cost in practical scenarios. To address these issues, we introduce HGLMRec, a novel Multi‑LLM agent‑based RS that incorporates a hypergraph encoder designed to capture complex, multi‑behaviour relationships between users and items. The HGLMRec model retrieves only the relevant tokens during inference, reducing computational overhead while enriching the retrieval context. Experimental results show performance improvement by HGLMRec against state‑of‑the‑art baselines at lower computational cost.

PaperID: 6948, https://arxiv.org/pdf/2512.06577.pdf

Authors: Muhammad Junayed Hasan Zahed, Hossein Rastgoftar

Title: Deep Neural Network-Based Aerial Transport in the Presence of Cooperative and Uncooperative UAS

Abstract:
We present a resilient deep neural network (DNN) framework for decentralized transport and coverage using uncrewed aerial systems (UAS) operating in \mathbbR^n. The proposed DNN‑based mass‑transport architecture constructs a layered inter‑UAS communication graph from an initial formation, assigns time‑varying communication weights through a forward scheduling mechanism that guides the team from the initial to the final configuration, and ensures stability and convergence of the resulting multi‑agent transport dynamics. The framework is explicitly designed to remain robust in the presence of uncooperative agents that deviate from or refuse to follow the prescribed protocol. Our method preserves a fixed feed‑forward topology but dynamically prunes edges to uncooperative agents, maintains convex, feedforward mentoring among cooperative agents, and computes global desired set points through a sparse linear relation consistent with leader references. The target set is abstracted by N points that become final desired positions, enabling coverage‑optimal transport while keeping computation low and guarantees intact. Extensive simulations demonstrate that, under full cooperation, all agents converge rapidly to the target zone with a 10% boundary margin and under partial cooperation with uncooperative agents, the system maintains high convergence among cooperative agents with performance degradation localized near the disruptions, evidencing graceful resilience and scalability. These results confirm that forward‑weight scheduling, hierarchical mentor‑‑mentee coordination, and on‑the‑fly DNN restructuring yield robust, provably stable UAS transport in realistic fault scenarios.

PaperID: 6949, https://arxiv.org/pdf/2512.06573.pdf

Authors: Onur Bilgin, Abdullah As Sami, Sriram Sai Vujjini, John Licato

Title: The Effect of Belief Boxes and Open-mindedness on Persuasion

Abstract:
As multi‑agent systems are increasingly utilized for reasoning and decision‑making applications, there is a greater need for LLM‑based agents to have something resembling propositional beliefs. One simple method for doing so is to include statements describing beliefs maintained in the prompt space (in what we'll call their belief boxes). But when agents have such statements in belief boxes, how does it actually affect their behaviors and dispositions towards those beliefs? And does it significantly affect agents' ability to be persuasive in multi‑agent scenarios? Likewise, if the agents are given instructions to be open‑minded, how does that affect their behaviors? We explore these and related questions in a series of experiments. Our findings confirm that instructing agents to be open‑minded affects how amenable they are to belief change. We show that incorporating belief statements and their strengths influences an agent's resistance to (and persuasiveness against) opposing viewpoints. Furthermore, it affects the likelihood of belief change, particularly when the agent is outnumbered in a debate by opposing viewpoints, i.e., peer pressure scenarios. The results demonstrate the feasibility and validity of the belief box technique in reasoning and decision‑making tasks.

PaperID: 6950, https://arxiv.org/pdf/2512.06556.pdf

Authors: Saeid Jamshidi, Arghavan Moradi Dakhel, Kawser Wazed Nafi, Foutse Khomh

Title: Semantic Attacks on Tool-Augmented LLMs: Securing the Model Context Protocol Against Descriptor-Level Manipulation

Abstract:
The Model Context Protocol (MCP) enables Large Language Models (LLMs) to interact with external tools via tool descriptors, thereby extending their capabilities for task execution, autonomous decision‑making, and multi‑agent coordination. Existing MCP deployments treat tool descriptors as trusted metadata, despite their direct integration into the LLM reasoning context. This introduces a previously underexplored semantic attack surface. Current defenses primarily target prompt injection, neglecting descriptor‑level manipulation that can bias tool selection and downstream reasoning. To address this gap, we formalize three descriptor‑driven attack classes: Tool Poisoning, Shadowing, and Rug Pull. We propose a layered defense solution that integrates descriptor integrity verification, pre‑context semantic vetting with an auxiliary LLM, and lightweight runtime guardrails, without requiring model retraining. We evaluate GPT‑5.3, DeepSeek‑V3, and LLaMA‑3.5 across eight prompting strategies in controlled, adversarial MCP scenarios in which tool metadata is manipulated to simulate realistic attacks. Results demonstrate that descriptor manipulation can substantially alter tool‑selection behavior, producing unsafe tool invocations in up to 36% of trials under baseline configurations. The proposed full‑stack mitigation reduces unsafe invocations to 15% while increasing the block rate to 74%, demonstrating substantial improvement in resistance to descriptor‑driven attacks. Cross‑model analysis further reveals significant differences in robustness, latency, and sensitivity to descriptor‑level manipulation across LLM architectures and prompting strategies. This study provides a controlled cross‑model evaluation of descriptor‑level threats and mitigation strategies in tool‑calling LLM systems, establishing an empirical foundation for deploying secure and resilient tool‑augmented LLMs.

PaperID: 6951, https://arxiv.org/pdf/2512.06432.pdf

Authors: Yihan Xia, Taotao Wang, Shengli Zhang, Zhangyuhua Weng, Bin Cao, Soung Chang Liew

Title: HiveMind: Contribution-Guided Online Prompt Optimization of LLM Multi-Agent Systems

Abstract:
Recent advances in LLM‑based multi‑agent systems have demonstrated remarkable capabilities in complex decision‑making scenarios such as financial trading and software engineering. However, evaluating each individual agent's effectiveness and online optimization of underperforming agents remain open challenges. To address these issues, we present HiveMind, a self‑adaptive framework designed to optimize LLM multi‑agent collaboration through contribution analysis. At its core, HiveMind introduces Contribution‑Guided Online Prompt Optimization (CG‑OPO), which autonomously refines agent prompts based on their quantified contributions. We first propose the Shapley value as a grounded metric to quantify each agent's contribution, thereby identifying underperforming agents in a principled manner for automated prompt refinement. To overcome the computational complexity of the classical Shapley value, we present DAG‑Shapley, a novel and efficient attribution algorithm that leverages the inherent Directed Acyclic Graph structure of the agent workflow to axiomatically prune non‑viable coalitions. By hierarchically reusing intermediate outputs of agents in the DAG, our method further reduces redundant computations, and achieving substantial cost savings without compromising the theoretical guarantees of Shapley values. Evaluated in a multi‑agent stock‑trading scenario, HiveMind achieves superior performance compared to static baselines. Notably, DAG‑Shapley reduces LLM calls by over 80% while maintaining attribution accuracy comparable to full Shapley values, establishing a new standard for efficient credit assignment and enabling scalable, real‑world optimization of multi‑agent collaboration.

PaperID: 6952, https://arxiv.org/pdf/2512.06396.pdf

Authors: Shovan Roy

Title: AgenticCyber: A GenAI-Powered Multi-Agent System for Multimodal Threat Detection and Adaptive Response in Cybersecurity

Abstract:
The increasing complexity of cyber threats in distributed environments demands advanced frameworks for real‑time detection and response across multimodal data streams. This paper introduces AgenticCyber, a generative AI powered multi‑agent system that orchestrates specialized agents to monitor cloud logs, surveillance videos, and environmental audio concurrently. The solution achieves 96.2% F1‑score in threat detection, reduces response latency to 420 ms, and enables adaptive security posture management using multimodal language models like Google's Gemini coupled with LangChain for agent orchestration. Benchmark datasets, such as AWS CloudTrail logs, UCF‑Crime video frames, and UrbanSound8K audio clips, show greater performance over standard intrusion detection systems, reducing mean time to respond (MTTR) by 65% and improving situational awareness. This work introduces a scalable, modular proactive cybersecurity architecture for enterprise networks and IoT ecosystems that overcomes siloed security technologies with cross‑modal reasoning and automated remediation.

PaperID: 6953, https://arxiv.org/pdf/2512.06390.pdf

Authors: Mehrab Hosain, Sabbir Alom Shuvo, Matthew Ogbe, Md Shah Jalal Mazumder, Yead Rahman, Md Azizul Hakim, Anukul Pandey

Title: Web Technologies Security in the AI Era: A Survey of CDN-Enhanced Defenses

Abstract:
The modern web stack, which is dominated by browser‑based applications and API‑first backends, now operates under an adversarial equilibrium where automated, AI‑assisted attacks evolve continuously. Content Delivery Networks (CDNs) and edge computing place programmable defenses closest to users and bots, making them natural enforcement points for machine‑learning (ML) driven inspection, throttling, and isolation. This survey synthesizes the landscape of AI‑enhanced defenses deployed at the edge: (i) anomaly‑ and behavior‑based Web Application Firewalls (WAFs) within broader Web Application and API Protection (WAAP), (ii) adaptive DDoS detection and mitigation, (iii) bot management that resists human‑mimicry, and (iv) API discovery, positive security modeling, and encrypted‑traffic anomaly analysis. We add a systematic survey method, a threat taxonomy mapped to edge‑observable signals, evaluation metrics, deployment playbooks, and governance guidance. We conclude with a research agenda spanning XAI, adversarial robustness, and autonomous multi‑agent defense. Our findings indicate that edge‑centric AI measurably improves time‑to‑detect and time‑to‑mitigate while reducing data movement and enhancing compliance, yet introduces new risks around model abuse, poisoning, and governance.

PaperID: 6954, https://arxiv.org/pdf/2512.06278.pdf

Authors: Anton A. Stoorvogel, Ali Saberi, Zhenwei Liu, Qiaofeng Wen

Title: Scale-free weak output synchronization of multi-agent systems with adaptive protocols

Abstract:
In this paper, we study output synchronization for multi‑agent systems. The objective is to design a protocol which only depends on the agent dynamics and does not require any knowledge of the network. If the network has a directed spanning tree then the protocols designed in this paper achieve classical output synchronization. Otherwise, the protocol achieves weak synchronization which is induced by network stability in the sense that the signals exchanged over the network converge to zero. Weak sychronization is explained in detail in this paper. Even though we consider linear agents, it is known that this in general requires nonlinear protocols. In the paper we use adaptive protocols. In the literature, two classes of protocols are considered often called collaborative protocols (with additional communication between the protocols and non‑collaborative protocols (sometimes referred to as fully decentralized where the additional communication is not present). This paper considers both of these cases.

PaperID: 6955, https://arxiv.org/pdf/2512.06268.pdf

Authors: Muhammad Junayed Hasan Zahed, Hossein Rastgoftar

Title: A Physics-Informed Fixed Skyroad Model for Continuous UAS Traffic Management (C-UTM)

Abstract:
Unlike traditional multi‑agent coordination frameworks, which assume a fixed number of agents, UAS traffic management (UTM) requires a platform that enables Uncrewed Aerial Systems (UAS) to freely enter or exit constrained low‑altitude airspace. Consequently, the number of UAS operating in a given region is time‑varying, with vehicles dynamically joining or leaving even in dense, obstacle‑laden environments. The primary goal of this paper is to develop a computationally efficient management system that maximizes airspace usability while ensuring safety and efficiency. To achieve this, we first introduce physics‑informed methods to structure fixed skyroads across multiple altitude layers of urban airspace, with the directionality of each skyroad designed to guarantee full reachability. We then present a novel Continuous UTM (C‑UTM) framework that optimally allocates skyroads to UAS requests while accounting for the time‑varying capacity of the airspace. Collectively, the proposed model addresses the key challenges of low‑altitude UTM by providing a scalable, safe, and efficient solution for urban airspace usability.

PaperID: 6956, https://arxiv.org/pdf/2512.06256.pdf

Authors: Aniruddha Maiti, Satya Nimmagadda, Kartha Veerya Jammuladinne, Niladri Sengupta, Ananya Jana

Title: Convergence of Outputs When Two Large Language Models Interact in a Multi-Agentic Setup

Abstract:
In this work, we report what happens when two large language models respond to each other for many turns without any outside input in a multi‑agent setup. The setup begins with a short seed sentence. After that, each model reads the other's output and generates a response. This continues for a fixed number of steps. We used Mistral Nemo Base 2407 and Llama 2 13B hf. We observed that most conversations start coherently but later fall into repetition. In many runs, a short phrase appears and repeats across turns. Once repetition begins, both models tend to produce similar output rather than introducing a new direction in the conversation. This leads to a loop where the same or similar text is produced repeatedly. We describe this behavior as a form of convergence. It occurs even though the models are large, trained separately, and not given any prompt instructions. To study this behavior, we apply lexical and embedding‑based metrics to measure how far the conversation drifts from the initial seed and how similar the outputs of the two models becomes as the conversation progresses.

PaperID: 6957, https://arxiv.org/pdf/2512.06247.pdf

Authors: Gus Henry Smith, Sandesh Adhikary, Vineet Thumuluri, Karthik Suresh, Vivek Pandit, Kartik Hegde, Hamid Shojaei, Chandra Bhagavatula

Title: DUET: Agentic Design Understanding via Experimentation and Testing

Abstract:
AI agents powered by large language models (LLMs) are being used to solve increasingly complex software engineering challenges, but struggle with hardware design tasks. Register Transfer Level (RTL) code presents a unique challenge for LLMs, as it encodes complex, dynamic, time‑evolving behaviors using the low‑level language features of SystemVerilog. LLMs struggle to infer these complex behaviors from the syntax of RTL alone, which limits their ability to complete all downstream tasks like code completion, documentation, or verification. In response to this issue, we present DUET: a general methodology for developing Design Understanding via Experimentation and Testing. DUET mimics how hardware design experts develop an understanding of complex designs: not just via a one‑off readthrough of the RTL, but via iterative experimentation using a number of tools. DUET iteratively generates hypotheses, tests them with EDA tools (e.g., simulation, waveform inspection, and formal verification), and integrates the results to build a bottom‑up understanding of the design. In our evaluations, we show that DUET improves AI agent performance on formal verification, when compared to a baseline flow without experimentation.

PaperID: 6958, https://arxiv.org/pdf/2512.06196.pdf

Authors: Charlie Masters, Marta Grześkiewicz, Stefano V. Albrecht

Title: ARCANE: A Multi-Agent Framework for Interpretable and Configurable Alignment

Abstract:
As agents based on large language models are increasingly deployed to long‑horizon tasks, maintaining their alignment with stakeholder preferences becomes critical. Effective alignment in such settings requires reward models that are interpretable so that stakeholders can understand and audit model objectives. Moreover, reward models must be capable of steering agents at interaction time, allowing preference shifts to be incorporated without retraining. We introduce ARCANE, a framework that frames alignment as a multi‑agent collaboration problem that dynamically represents stakeholder preferences as natural‑language rubrics: weighted sets of verifiable criteria that can be generated on‑the‑fly from task context. Inspired by utility theory, we formulate rubric learning as a reconstruction problem and apply a regularized Group‑Sequence Policy Optimization (GSPO) procedure that balances interpretability, faithfulness, and computational efficiency. Using a corpus of 219 labeled rubrics derived from the GDPVal benchmark, we evaluate ARCANE on challenging tasks requiring multi‑step reasoning and tool use. The learned rubrics produce compact, legible evaluations and enable configurable trade‑offs (e.g., correctness vs. conciseness) without retraining. Our results show that rubric‑based reward models offer a promising path toward interpretable, test‑time adaptive alignment for complex, long‑horizon AI systems.

PaperID: 6959, https://arxiv.org/pdf/2512.06046.pdf

Authors: Ramprasath Ganesaraja, Swathika N, Saravanan AP, Kamalkumar Rathinasamy, Chetana Amancharla, Rahul Das, Sahil Dilip Panse, Aditya Batwe, Dileep Vijayan, Veena Ashok, Thanushree A P, Kausthubh J Rao, Alden Olivero, Roshan, Rajeshwar Reddy Manthena, Asmitha Yuga Sre A, Harsh Tripathi, Suganya Selvaraj, Vito Chin, Kasthuri Rangan Bhaskar, Kasthuri Rangan Bhaskar, Venkatraman R, Sajit Vijayakumar

Title: Beyond Prototyping: Autonomous, Enterprise-Grade Frontend Development from Pixel to Production via a Specialized Multi-Agent Framework

Abstract:
We present AI4UI, a framework of autonomous front‑end development agents purpose‑built to meet the rigorous requirements of enterprise‑grade application delivery. Unlike general‑purpose code assistants designed for rapid prototyping, AI4UI focuses on production readiness delivering secure, scalable, compliant, and maintainable UI code integrated seamlessly into enterprise workflows. AI4UI operates with targeted human‑in‑the‑loop involvement: at the design stage, developers embed a Gen‑AI‑friendly grammar into Figma prototypes to encode requirements for precise interpretation; and at the post processing stage, domain experts refine outputs for nuanced design adjustments, domain‑specific optimizations, and compliance needs. Between these stages, AI4UI runs fully autonomously, converting designs into engineering‑ready UI code. Technical contributions include a Figma grammar for autonomous interpretation, domain‑aware knowledge graphs, a secure abstract/package code integration strategy, expertise driven architecture templates, and a change‑oriented workflow coordinated by specialized agent roles. In large‑scale benchmarks against industry baselines and leading competitor systems, AI4UI achieved 97.24% platform compatibility, 87.10% compilation success, 86.98% security compliance, 78.00% feature implementation success, 73.50% code‑review quality, and 73.36% UI/UX consistency. In blind preference studies with 200 expert evaluators, AI4UI emerged as one of the leaders demonstrating strong competitive standing among leading solutions. Operating asynchronously, AI4UI generates thousands of validated UI screens in weeks rather than months, compressing delivery timeline

PaperID: 6960, https://arxiv.org/pdf/2512.05982.pdf

Authors: Lingfeng Zhou, Yi Xu, Zhenyu Wang, Dequan Wang

Title: FlockVote: LLM-Empowered Agent-Based Modeling for Simulating U.S. Presidential Elections

Abstract:
Modeling complex human behavior, such as voter decisions in national elections, is a long‑standing challenge for computational social science. Traditional agent‑based models (ABMs) are limited by oversimplified rules, while large‑scale statistical models often lack interpretability. We introduce FlockVote, a novel framework that uses Large Language Models (LLMs) to build a "computational laboratory" of LLM agents for political simulation. Each agent is instantiated with a high‑fidelity demographic profile and dynamic contextual information (e.g. candidate policies), enabling it to perform nuanced, generative reasoning to simulate a voting decision. We deploy this framework as a testbed on the 2024 U.S. Presidential Election, focusing on seven key swing states. Our simulation's macro‑level results successfully replicate the real‑world outcome, demonstrating the high fidelity of our "virtual society". The primary contribution is not only the prediction, but also the framework's utility as an interpretable research tool. FlockVote moves beyond black‑box outputs, allowing researchers to probe agent‑level rationale and analyze the stability and sensitivity of LLM‑driven social simulations.

PaperID: 6961, https://arxiv.org/pdf/2512.05951.pdf

Authors: Teofil Bodea, Masanori Misono, Julian Pritzi, Patrick Sabanic, Thore Sommer, Harshavardhan Unnibhavi, David Schall, Nuno Santos, Dimitrios Stavrakakis, Pramod Bhatotia

Title: Trusted AI Agents in the Cloud

Abstract:
AI agents powered by large language models are increasingly deployed as cloud services that autonomously access sensitive data, invoke external tools, and interact with other agents. However, these agents run within a complex multi‑party ecosystem, where untrusted components can lead to data leakage, tampering, or unintended behavior. Existing Confidential Virtual Machines (CVMs) provide only per binary protection and offer no guarantees for cross‑principal trust, accelerator‑level isolation, or supervised agent behavior. We present Omega, a system that enables trusted AI agents by enforcing end‑to‑end isolation, establishing verifiable trust across all contributing principals, and supervising every external interaction with accountable provenance. Omega builds on Confidential VMs and Confidential GPUs to create a Trusted Agent Platform that hosts many agents within a single CVM using nested isolation. It also provides efficient multi‑agent orchestration with cross‑principal trust establishment via differential attestation, and a policy specification and enforcement framework that governs data access, tool usage, and inter‑agent communication for data protection and regulatory compliance. Implemented on AMD SEV‑SNP and NVIDIA H100, Omega fully secures agent state across CVM‑GPU, and achieves high performance while enabling high‑density, policy‑compliant multi‑agent deployments at cloud scale.

PaperID: 6962, https://arxiv.org/pdf/2512.05815.pdf

Authors: Marios-Nektarios Stamatopoulos, Shridhar Velhal, Avijit Banerjee, George Nikolakopoulos

Title: Optimal Safety-Aware Scheduling for Multi-Agent Aerial 3D Printing with Utility Maximization under Dependency Constraints

Abstract:
This article presents a novel coordination and task‑planning framework to enable the simultaneous conflict‑free collaboration of multiple unmanned aerial vehicles (UAVs) for aerial 3D printing. The proposed framework formulates an optimization problem that takes a construction mission divided into sub‑tasks and a team of autonomous UAVs, along with limited volume and battery. It generates an optimal mission plan comprising task assignments and scheduling while accounting for task dependencies arising from the geometric and structural requirements of the 3D design, inter‑UAV safety constraints, material usage, and total flight time of each UAV. The potential conflicts occurring during the simultaneous operation of the UAVs are addressed at a segment level by dynamically selecting the starting time and location of each task to guarantee collision‑free parallel execution. An importance prioritization is proposed to accelerate the computation by guiding the solution toward more important tasks. Additionally, a utility maximization formulation is proposed to dynamically determine the optimal number of UAVs required for a given mission, balancing the trade‑off between minimizing makespan and the deployment of excess agents. The proposed framework's effectiveness is evaluated through a Gazebo‑based simulation setup, where agents are coordinated by a mission control module allocating the printing tasks based on the generated optimal scheduling plan while remaining within the material and battery constraints of each UAV.

PaperID: 6963, https://arxiv.org/pdf/2512.05812.pdf

Authors: Fabian Konstantinidis, Moritz Sackmann, Ulrich Hofmann, Christoph Stiller

Title: Toward Efficient and Robust Behavior Models for Multi-Agent Driving Simulation

Abstract:
Scalable multi‑agent driving simulation requires behavior models that are both realistic and computationally efficient. We address this by optimizing the behavior model that controls individual traffic participants. To improve efficiency, we adopt an instance‑centric scene representation, where each traffic participant and map element is modeled in its own local coordinate frame. This design enables efficient, viewpoint‑invariant scene encoding and allows static map tokens to be reused across simulation steps. To model interactions, we employ a query‑centric symmetric context encoder with relative positional encodings between local frames. We use Adversarial Inverse Reinforcement Learning to learn the behavior model and propose an adaptive reward transformation that automatically balances robustness and realism during training. Experiments demonstrate that our approach scales efficiently with the number of tokens, significantly reducing training and inference times, while outperforming several agent‑centric baselines in terms of positional accuracy and robustness.

PaperID: 6964, https://arxiv.org/pdf/2512.05765.pdf

Authors: Edward Y. Chang

Title: AGI Requires a Coordination Layer on Top of Pattern Repositories

Abstract:
In this paper we argue that influential critiques dismissing Large Language Models (LLMs) as a dead end for AGI misidentify the bottleneck: they confuse the ocean with the net. Pattern repositories are the necessary System‑1 substrate; the missing component is a System‑2 coordination layer that recruits relevant patterns, verifies their use, preserves state, and governs convergence. We separate two uses of control that are often conflated. Semantic anchoring, formalized by UCCT (Unified Contextual Control Theory), binds labels and task intent to learned pattern regions through a phase transition governed by effective support (rho_d), representational mismatch (d_r), and an adaptive anchoring budget (gamma log k). Trace‑answer verification, implemented by Recursive Causal Audit (RCA), tests whether a final causal judgment is warranted by its own reasoning trace under pressure. We translate these ideas into MACI, a multi‑agent coordination stack that integrates diversity and control via baiting (PID‑modulated debate), filtering (Socratic and causal audit), and persistence (transactional memory). Empirical validation on causal judgment and the sycophancy‑paranoia trade‑off demonstrates that static prompting fails where adaptive control succeeds. By reframing common objections as testable coordination failures, we argue that the path to AGI runs through LLMs, not around them. Capability is not coordination.

PaperID: 6965, https://arxiv.org/pdf/2512.05671.pdf

Authors: Zhitao He, Haolin Yang, Zeyu Qin, Yi R Fung

Title: MedTutor-R1: Socratic Personalized Medical Teaching with Multi-Agent Simulation

Abstract:
The significant gap between rising demands for clinical training and the scarcity of expert instruction poses a major challenge to medical education. With powerful capabilities in personalized guidance, Large Language Models (LLMs) offer a promising solution to bridge this gap. However, current research focuses mainly on one‑on‑one knowledge instruction, overlooking collaborative reasoning, a key skill for students developed in teamwork like ward rounds. To this end, we develop ClinEdu, a multi‑agent pedagogical simulator with personality‑driven patients and diverse student cohorts, enabling controlled testing of complex pedagogical processes and scalable generation of teaching data. Based on ClinEdu, we construct ClinTeach, a large Socratic teaching dialogue dataset that captures the complexities of group instruction. We then train MedTutor‑R1, the first multimodal Socratic tutor designed for one‑to‑many instruction in clinical medical education. MedTutor‑R1 is first instruction‑tuned on our ClinTeach dataset and then optimized with reinforcement learning, using rewards derived from a three‑axis rubric, covering structural fidelity, analytical quality, and clinical safety, to refine its adaptive Socratic strategies. For authentic in‑situ assessment, we use simulation‑based interactive evaluation that redeploys the tutor back into ClinEdu. Experimental results demonstrate that our MedTutor‑R1 outperforms the base model by over 20% in average pedagogical score and is comparable to o3, while also exhibiting high adaptability in handling a varying number of students. This promising performance underscores the effectiveness of our pedagogical simulator, ClinEdu.

PaperID: 6966, https://arxiv.org/pdf/2512.05654.pdf

Authors: Hyeonyeong Jang, Donghyeon Song, Jin Gyu Lee, Hyungbo Shim

Title: A Note on Emergent Behavior in Multi-agent Systems Enabled by Neuro-spike Communication

Abstract:
In this note, we present a novel synchronization framework for heterogeneous multi‑agent systems enabled by neuro‑spike communication, which induces emergence. Unlike conventional synchronization strategies that require continuous transmission of full‑state data packets, our approach utilizes a bio‑inspired neuromorphic amplifier to achieve practical synchronization via intermittent, 1‑bit Dirac delta pulses. The proposed method drastically improves communication efficiency in terms of bandwidth and energy by minimizing the information payload to a single bit, with intermittent and asynchronous communication. We provide a rigorous convergence analysis of the proposed method and validate the proposed scheme through numerical examples.

PaperID: 6967, https://arxiv.org/pdf/2512.05502.pdf

Authors: Omid Bazgir, Vineeth Manthapuri, Ilia Rattsev, Mohammad Jafarnejad

Title: GRASP: Graph Reasoning Agents for Systems Pharmacology with Human-in-the-Loop

Abstract:
Quantitative Systems Pharmacology (QSP) modeling is essential for drug development but it requires significant time investment that limits the throughput of domain experts. We present GRASP ‑‑ a multi‑agent, graph‑reasoning framework with a human‑in‑the‑loop conversational interface ‑‑ that encodes QSP models as typed biological knowledge graphs and compiles them to executable MATLAB/SimBiology code while preserving units, mass balance, and physiological constraints. A two‑phase workflow ‑‑ \textscUnderstanding (graph reconstruction of legacy code) and \textscAction (constraint‑checked, language‑driven modification) ‑‑ is orchestrated by a state machine with iterative validation. GRASP performs breadth‑first parameter‑alignment around new entities to surface dependent quantities and propose biologically plausible defaults, and it runs automatic execution/diagnostics until convergence. In head‑to‑head evaluations using LLM‑as‑judge, GRASP outperforms SME‑guided CoT and ToT baselines across biological plausibility, mathematical correctness, structural fidelity, and code quality ($\approx$9‑‑10/10 vs.\ 5‑‑7/10). BFS alignment achieves F1 = 0.95 for dependency discovery, units, and range. These results demonstrate that graph‑structured, agentic workflows can make QSP model development both accessible and rigorous, enabling domain experts to specify mechanisms in natural language without sacrificing biomedical fidelity.

PaperID: 6968, https://arxiv.org/pdf/2512.05462.pdf

Authors: Yan-Shiun Wu, Nathan A. Morin

Title: Model Gateway: Model Management Platform for Model-Driven Drug Discovery

Abstract:
This paper presents the Model Gateway, a management platform for managing machine learning (ML) and scientific computational models in the drug discovery pipeline. The platform supports Large Language Model (LLM) Agents and Generative AI‑based tools to perform ML model management tasks in our Machine Learning operations (MLOps) pipelines, such as the dynamic consensus model, a model that aggregates several scientific computational models, registration and management, retrieving model information, asynchronous submission/execution of models, and receiving results once the model complete executions. The platform includes a Model Owner Control Panel, Platform Admin Tools, and Model Gateway API service for interacting with the platform and tracking model execution. The platform achieves a 0% failure rate when testing scaling beyond 10k simultaneous application clients consume models. The Model Gateway is a fundamental part of our model‑driven drug discovery pipeline. It has the potential to significantly accelerate the development of new drugs with the maturity of our MLOps infrastructure and the integration of LLM Agents and Generative AI tools.

PaperID: 6969, https://arxiv.org/pdf/2512.05449.pdf

Authors: Robert Yang

Title: The Seeds of Scheming: Weakness of Will in the Building Blocks of Agentic Systems

Abstract:
Large language models display a peculiar form of inconsistency: they "know" the correct answer but fail to act on it. In human philosophy, this tension between global judgment and local impulse is called akrasia, or weakness of will. We propose akrasia as a foundational concept for analyzing inconsistency and goal drift in agentic AI systems. To operationalize it, we introduce a preliminary version of the Akrasia Benchmark, currently a structured set of prompting conditions (Baseline [B], Synonym [S], Temporal [T], and Temptation [X]) that measures when a model's local response contradicts its own prior commitments. The benchmark enables quantitative comparison of "self‑control" across model families, decoding strategies, and temptation types. Beyond single‑model evaluation, we outline how micro‑level akrasia may compound into macro‑level instability in multi‑agent systems that may be interpreted as "scheming" or deliberate misalignment. By reframing inconsistency as weakness of will, this work connects agentic behavior to classical theories of agency and provides an empirical bridge between philosophy, psychology, and the emerging science of agentic AI.

PaperID: 6970, https://arxiv.org/pdf/2512.05447.pdf

Authors: Pengcheng Dai, Dongming Wang, Wenwu Yu, Wei Ren

Title: Distributed scalable coupled policy algorithm for networked multi-agent reinforcement learning

Abstract:
This paper studies networked multi‑agent reinforcement learning (NMARL) with interdependent rewards and coupled policies. In this setting, each agent's reward depends on its own state‑action pair as well as those of its direct neighbors, and each agent's policy is parameterized by its local parameters together with those of its κ_p‑hop neighbors, with κ_p\geq 1 denoting the coupled radius. The objective of the agents is to collaboratively optimize their policies to maximize the discounted average cumulative reward. To address the challenge of interdependent policies in collaborative optimization, we introduce a novel concept termed the neighbors' averaged Q‑function and derive a new expression for the coupled policy gradient. Based on these theoretical foundations, we develop a distributed scalable coupled policy (DSCP) algorithm, where each agent relies only on the state‑action pairs of its κ_p‑hop neighbors and the rewards of its (κ_p+1)‑hop neighbors. Specially, in the DSCP algorithm, we employ a geometric 2‑horizon sampling method that does not require storing a full Q‑table to obtain an unbiased estimate of the coupled policy gradient. Moreover, each agent interacts exclusively with its direct neighbors to obtain accurate policy parameters, while maintaining local estimates of other agents' parameters to execute its local policy and collect samples for optimization. These estimates and policy parameters are updated via a push‑sum protocol, enabling distributed coordination of policy updates across the network. We prove that the joint policy produced by the proposed algorithm converges to a first‑order stationary point of the objective function. Finally, the effectiveness of DSCP algorithm is demonstrated through simulations in a robot path planning environment, showing clear improvement over state‑of‑the‑art methods.

PaperID: 6971, https://arxiv.org/pdf/2512.05374.pdf

Authors: Charlie Summers, Haneen Mohammed, Eugene Wu

Title: Please Don't Kill My Vibe: Empowering Agents with Data Flow Control

Abstract:
The promise of Large Language Model (LLM) agents is to perform complex, stateful tasks. This promise is stunted by significant risks ‑ policy violations, process corruption, and security flaws ‑ that stem from the lack of visibility and mechanisms to manage undesirable data flows produced by agent actions. Today, agent workflows are responsible for enforcing these policies in ad hoc ways. Just as data validation and access controls shifted from the application to the DBMS, freeing application developers from these concerns, we argue that systems should support Data Flow Controls (DFCs) and enforce DFC policies natively. This paper describes early work developing a portable instance of DFC for DBMSes and outlines a broader research agenda toward DFC for agent ecosystems.

PaperID: 6972, https://arxiv.org/pdf/2512.05065.pdf

Authors: James Flemings, Ren Yi, Octavian Suciu, Kassem Fawaz, Murali Annavaram, Marco Gruteser

Title: Personalizing Agent Privacy Decisions via Logical Entailment

Abstract:
Personal large language model (LLM) agents increasingly perform tasks that require access to user data, raising concerns about appropriate data disclosure. We show that relying solely on LLMs to make data‑sharing decisions is insufficient. Prompting LLMs with general privacy norms fails to capture individual users' privacy preferences, while providing prior user data‑sharing decisions through in‑context learning (ICL) leads to unreliable and opaque reasoning. To address these limitations, we propose ARIEL (Agentic Reasoning with Individualized Entailment Logic), a framework that combines LLMs with rule‑based logic to enable structured, personalized privacy reasoning. The core mechanism of ARIEL determines whether a user's prior decision on a data‑sharing request logically entails the same decision for a new request. Experimental evaluations using advanced models and public datasets show that ARIEL reduces the F1 error rate for appropriate judgments by 40.6% compared to standard ICL‑based reasoning, indicating that ARIEL is effective at correctly judging requests where the user would approve data sharing. These results demonstrate that integrating LLMs with logical entailment provides an effective and interpretable approach for automating personalized privacy decisions.

PaperID: 6973, https://arxiv.org/pdf/2512.05013.pdf

Authors: Eric Bridgeford, Hayden Helm

Title: Detecting Perspective Shifts in Multi-agent Systems

Abstract:
Generative models augmented with external tools and update mechanisms (or agents) have demonstrated capabilities beyond intelligent prompting of base models. As agent use proliferates, dynamic multi‑agent systems have naturally emerged. Recent work has investigated the theoretical and empirical properties of low‑dimensional representations of agents based on query responses at a single time point. This paper introduces the Temporal Data Kernel Perspective Space (TDKPS), which jointly embeds agents across time, and proposes several novel hypothesis tests for detecting behavioral change at the agent‑ and group‑level in black‑box multi‑agent systems. We characterize the empirical properties of our proposed tests, including their sensitivity to key hyperparameters, in simulations motivated by a multi‑agent system of evolving digital personas. Finally, we demonstrate via natural experiment that our proposed tests detect changes that correlate sensitively, specifically, and significantly with a real exogenous event. As far as we are aware, TDKPS is the first principled framework for monitoring behavioral dynamics in black‑box multi‑agent systems ‑‑ a critical capability as generative agent deployment continues to scale.

PaperID: 6974, https://arxiv.org/pdf/2512.04988.pdf

Authors: Christopher Chiu, Simpson Zhang, Mihaela van der Schaar

Title: Strategic Self-Improvement for Competitive Agents in AI Labour Markets

Abstract:
As artificial intelligence (AI) agents are deployed across economic domains, understanding their strategic behavior and market‑level impact becomes critical. This paper puts forward a groundbreaking new framework that is the first to capture the real‑world economic forces that shape agentic labor markets: adverse selection, moral hazard, and reputation dynamics. Our framework encapsulates three core capabilities that successful LLM‑agents will need: metacognition (accurate self‑assessment of skills), competitive awareness (modeling rivals and market dynamics), and long‑horizon strategic planning. We illustrate our framework through a tractable simulated gig economy where agentic Large Language Models (LLMs) compete for jobs, develop skills, and adapt their strategies under competitive pressure. Our simulations illustrate how LLM agents explicitly prompted with reasoning capabilities learn to strategically self‑improve and demonstrate superior adaptability to changing market conditions. At the market level, our simulations reproduce classic macroeconomic phenomena found in human labor markets, while controlled experiments reveal potential AI‑driven economic trends, such as rapid monopolization and systemic price deflation. This work provides a foundation to further explore the economic properties of AI‑driven labour markets, and a conceptual framework to study the strategic reasoning capabilities in agents competing in the emerging economy.

PaperID: 6975, https://arxiv.org/pdf/2512.04918.pdf

Authors: Kailiang Liu, Ying Chen, Ralf Borndörfer, Thorsten Koch

Title: Multi-Agent Reinforcement Learning for Intraday Operating Rooms Scheduling under Uncertainty

Abstract:
Intraday surgical scheduling is a multi‑objective decision problem under uncertainty‑balancing elective throughput, urgent and emergency demand, delays, sequence‑dependent setups, and overtime. We formulate the problem as a cooperative Markov game and propose a multi‑agent reinforcement learning (MARL) framework in which each operating room (OR) is an agent trained with centralized training and decentralized execution. All agents share a policy trained via Proximal Policy Optimization (PPO), which maps rich system states to actions, while a within‑epoch sequential assignment protocol constructs conflict‑free joint schedules across ORs. A mixed‑integer pre‑schedule provides reference starting times for electives; we impose type‑specific quadratic delay penalties relative to these references and a terminal overtime penalty, yielding a single reward that captures throughput, timeliness, and staff workload. In simulations reflecting a realistic hospital mix (six ORs, eight surgery types, random urgent and emergency arrivals), the learned policy outperforms six rule‑based heuristics across seven metrics and three evaluation subsets, and, relative to an ex post MIP oracle, quantifies optimality gaps. Policy analytics reveal interpretable behavior‑prioritizing emergencies, batching similar cases to reduce setups, and deferring lower‑value electives. We also derive a suboptimality bound for the sequential decomposition under simplifying assumptions. We discuss limitations‑including OR homogeneity and the omission of explicit staffing constraints‑and outline extensions. Overall, the approach offers a practical, interpretable, and tunable data‑driven complement to optimization for real‑time OR scheduling.

PaperID: 6976, https://arxiv.org/pdf/2512.04797.pdf

Authors: SIMA team, Adrian Bolton, Alexander Lerchner, Alexandra Cordell, Alexandre Moufarek, Andrew Bolt, Andrew Lampinen, Anna Mitenkova, Arne Olav Hallingstad, Bojan Vujatovic, Bonnie Li, Cong Lu, Daan Wierstra, Daniel P. Sawyer, Daniel Slater, David Reichert, Davide Vercelli, Demis Hassabis, Drew A. Hudson, Duncan Williams, Ed Hirst, Fabio Pardo, Felix Hill, Frederic Besse, Hannah Openshaw, Harris Chan, Hubert Soyer, Jane X. Wang, Jeff Clune, John Agapiou, John Reid, Joseph Marino, Junkyung Kim, Karol Gregor, Kaustubh Sridhar, Kay McKinney, Laura Kampis, Lei M. Zhang, Loic Matthey, Luyu Wang, Maria Abi Raad, Maria Loks-Thompson, Martin Engelcke, Matija Kecman, Matthew Jackson, Maxime Gazeau, Ollie Purkiss, Oscar Knagg, Peter Stys, Piermaria Mendolicchio, Raia Hadsell, Rosemary Ke, Ryan Faulkner, Sarah Chakera, Satinder Singh Baveja, Shane Legg, Sheleem Kashem, Tayfun Terzi, Thomas Keck, Tim Harley, Tim Scholtes, Tyson Roberts, Volodymyr Mnih, Yulan Liu, Zhengdong Wang, Zoubin Ghahramani

Title: SIMA 2: A Generalist Embodied Agent for Virtual Worlds

Abstract:
We introduce SIMA 2, a generalist embodied agent that understands and acts in a wide variety of 3D virtual worlds. Built upon a Gemini foundation model, SIMA 2 represents a significant step toward active, goal‑directed interaction within an embodied environment. Unlike prior work (e.g., SIMA 1) limited to simple language commands, SIMA 2 acts as an interactive partner, capable of reasoning about high‑level goals, conversing with the user, and handling complex instructions given through language and images. Across a diverse portfolio of games, SIMA 2 substantially closes the gap with human performance and demonstrates robust generalization to previously unseen environments, all while retaining the base model's core reasoning capabilities. Furthermore, we demonstrate a capacity for open‑ended self‑improvement: by leveraging Gemini to generate tasks and provide rewards, SIMA 2 can autonomously learn new skills from scratch in a new environment. This work validates a path toward creating versatile and continuously learning agents for both virtual and, eventually, physical worlds.

PaperID: 6977, https://arxiv.org/pdf/2512.04785.pdf

Authors: Eranga Bandara, Amin Hass, Ross Gore, Sachin Shetty, Ravi Mukkamala, Safdar H. Bouk, Xueping Liang, Ng Wee Keong, Kasun De Zoysa, Aruna Withanage, Nilaan Loganathan

Title: ASTRIDE: A Security Threat Modeling Platform for Agentic-AI Applications

Abstract:
AI agent‑based systems are becoming increasingly integral to modern software architectures, enabling autonomous decision‑making, dynamic task execution, and multimodal interactions through large language models (LLMs). However, these systems introduce novel and evolving security challenges, including prompt injection attacks, context poisoning, model manipulation, and opaque agent‑to‑agent communication, that are not effectively captured by traditional threat modeling frameworks. In this paper, we introduce ASTRIDE, an automated threat modeling platform purpose‑built for AI agent‑based systems. ASTRIDE extends the classical STRIDE framework by introducing a new threat category, A for AI Agent‑Specific Attacks, which encompasses emerging vulnerabilities such as prompt injection, unsafe tool invocation, and reasoning subversion, unique to agent‑based applications. To automate threat modeling, ASTRIDE combines a consortium of fine‑tuned vision‑language models (VLMs) with the OpenAI‑gpt‑oss reasoning LLM to perform end‑to‑end analysis directly from visual agent architecture diagrams, such as data flow diagrams(DFDs). LLM agents orchestrate the end‑to‑end threat modeling automation process by coordinating interactions between the VLM consortium and the reasoning LLM. Our evaluations demonstrate that ASTRIDE provides accurate, scalable, and explainable threat modeling for next‑generation intelligent systems. To the best of our knowledge, ASTRIDE is the first framework to both extend STRIDE with AI‑specific threats and integrate fine‑tuned VLMs with a reasoning LLM to fully automate diagram‑driven threat modeling in AI agent‑based applications.

PaperID: 6978, https://arxiv.org/pdf/2512.04745.pdf

Authors: Francesca Rossi, Veronica Centorrino, Francesco Bullo, Giovanni Russo

Title: Neural Policy Composition from Free Energy Minimization

Abstract:
The ability to flexibly compose previously acquired skills to execute intelligent behaviors is a hallmark of natural intelligence. Such compositional flexibility is often attributed to context‑dependent gating mechanisms that determine how multiple policies or behavioral primitives are combined. Yet, despite remarkable efforts, the normative objective from which such gating rules should arise, and the neural computations capable of implementing them, remain unclear. Existing approaches typically rely on prespecified design choices for the gating rules, and remain tied to specific architectures, learning paradigms, or datasets. Here, we introduce a normative framework in which policy composition emerges from the minimization of a variational free energy, providing a principled and broadly applicable objective for gating. Based on this framework, we derive a continuous‑time gradient flow whose trajectories are guaranteed to converge, with explicit rate, to the optimal composition of primitives. We further show that this dynamics admits a mechanistic neural implementation as a soft‑competitive recurrent circuit with context‑sensitive local interactions. We evaluate the model on emerging flocking behaviors in multi‑agent systems, human decision‑making in bandit tasks, and control benchmarks in layered architectures. Across these settings, the model provides interpretable mechanistic accounts of policy composition, reproduces key behavioral signatures, yields insights into data, and matches or outperforms established models.

PaperID: 6979, https://arxiv.org/pdf/2512.04716.pdf

Authors: Haodong Feng, Lugang Ye, Dixia Fan

Title: Towards an AI Fluid Scientist: LLM-Powered Scientific Discovery in Experimental Fluid Mechanics

Abstract:
The integration of artificial intelligence into experimental fluid mechanics promises to accelerate discovery, yet most AI applications remain narrowly focused on numerical studies. This work proposes an AI Fluid Scientist framework that autonomously executes the complete experimental workflow: hypothesis generation, experimental design, robotic execution, data analysis, and manuscript preparation. We validate this through investigation of vortex‑induced vibration (VIV) and wake‑induced vibration (WIV) in tandem cylinders. Our work has four key contributions: (1) A computer‑controlled circulating water tunnel (CWT) with programmatic control of flow velocity, cylinder position, and forcing parameters (vibration frequency and amplitude) with data acquisition (displacement, force, and torque). (2) Automated experiments reproduce literature benchmarks (Khalak and Williamson [1999] and Assi et al. [2013, 2010]) with frequency lock‑in within 4% and matching critical spacing trends. (3) The framework with Human‑in‑the‑Loop (HIL) discovers more WIV amplitude response phenomena, and uses a neural network to fit physical laws from data, which is 31% higher than that of polynomial fitting. (4) The framework with multi‑agent with virtual‑real interaction system executes hundreds of experiments end‑to‑end, which automatically completes the entire process of scientific research from hypothesis generation, experimental design, experimental execution, data analysis, and manuscript preparation. It greatly liberates human researchers and improves study efficiency, providing new paradigm for the development and research of experimental fluid mechanics.

PaperID: 6980, https://arxiv.org/pdf/2512.04691.pdf

Authors: Jae Hee Lee, Anne Lauscher, Stefano V. Albrecht

Title: Towards Ethical Multi-Agent Systems of Large Language Models: A Mechanistic Interpretability Perspective

Abstract:
Large language models (LLMs) have been widely deployed in various applications, often functioning as autonomous agents that interact with each other in multi‑agent systems. While these systems have shown promise in enhancing capabilities and enabling complex tasks, they also pose significant ethical challenges. This position paper outlines a research agenda aimed at ensuring the ethical behavior of multi‑agent systems of LLMs (MALMs) from the perspective of mechanistic interpretability. We identify three key research challenges: (i) developing comprehensive evaluation frameworks to assess ethical behavior at individual, interactional, and systemic levels; (ii) elucidating the internal mechanisms that give rise to emergent behaviors through mechanistic interpretability; and (iii) implementing targeted parameter‑efficient alignment techniques to steer MALMs towards ethical behaviors without compromising their performance.

PaperID: 6981, https://arxiv.org/pdf/2512.04668.pdf

Authors: Jinbo Liu, Defu Cao, Yifei Wei, Tianyao Su, Yuan Liang, Yushun Dong, Yue Zhao, Xiyang Hu

Title: Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs

Abstract:
Graph topology is a fundamental determinant of memory leakage in multi‑agent LLM systems, yet its effects remain poorly quantified. We introduce MAMA (Multi‑Agent Memory Attack), a framework that measures how network structure shapes leakage. MAMA operates on synthetic documents containing labeled Personally Identifiable Information (PII) entities, from which we generate sanitized task instructions. We execute a two‑phase protocol: Engram (seeding private information into a target agent's memory) and Resonance (multi‑round interaction where an attacker attempts extraction). Over up to 10 interaction rounds, we quantify leakage as the fraction of ground‑truth PII recovered from attacking agent outputs via exact matching. We systematically evaluate six common network topologies (fully connected, ring, chain, binary tree, star, and star‑ring), varying agent counts n\in\4,5,6\, attacker‑target placements, and base models. Our findings reveal consistent patterns: fully connected graphs exhibit maximum leakage while chains provide strongest protection; shorter attacker‑target graph distance and higher target centrality significantly increase vulnerability; leakage rises sharply in early rounds before plateauing; model choice shifts absolute leakage rates but preserves topology rankings; temporal/locational PII attributes leak more readily than identity credentials or regulated identifiers. These results provide the first systematic mapping from architectural choices to measurable privacy risk, yielding actionable guidance: prefer sparse or hierarchical connectivity, maximize attacker‑target separation, limit node degree and network radius, avoid shortcuts bypassing hubs, and implement topology‑aware access controls.

PaperID: 6982, https://arxiv.org/pdf/2512.04653.pdf

Authors: Arash Rezaali, Pouria Yazdani, Monireh Abdoos

Title: A Semi Centralized Training Decentralized Execution Architecture for Multi Agent Deep Reinforcement Learning in Traffic Signal Control

Abstract:
Multi‑agent reinforcement learning (MARL) has emerged as a promising paradigm for adaptive traffic signal control (ATSC) of multiple intersections. Existing approaches typically follow either a fully centralized or a fully decentralized design. Fully centralized approaches suffer from the curse of dimensionality, and reliance on a single learning server, whereas purely decentralized approaches operate under severe partial observability and lack explicit coordination resulting in suboptimal performance. These limitations motivate region‑based MARL, where the network is partitioned into smaller, tightly coupled intersections that form regions, and training is organized around these regions. This paper introduces a Semi‑Centralized Training, Decentralized Execution (SEMI‑CTDE) architecture for multi intersection ATSC. Within each region, SEMI‑CTDE performs centralized training with regional parameter sharing and employs composite state and reward formulations that jointly encode local and regional information. The architecture is highly transferable across different policy backbones and state‑reward instantiations. Building on this architecture, we implement two models with distinct design objectives. A multi‑perspective experimental analysis of the two implemented SEMI‑CTDE‑based models covering ablations of the architecture's core elements including rule based and fully decentralized baselines shows that they achieve consistently superior performance and remain effective across a wide range of traffic densities and distributions.

PaperID: 6983, https://arxiv.org/pdf/2512.04601.pdf

Authors: Joey Hong, Kang Liu, Zhan Ling, Jiecao Chen, Sergey Levine

Title: Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space

Abstract:
Large language model (LLM) agents ‑‑ LLMs that dynamically interact with an environment over long horizons ‑‑ have become an increasingly important area of research, enabling automation in complex tasks involving tool‑use, web browsing, and dialogue with people. In the absence of expert demonstrations, training LLM agents has relied on policy gradient methods that optimize LLM policies with respect to an (often sparse) reward function. However, in long‑horizon tasks with sparse rewards, learning from trajectory‑level rewards can be noisy, leading to training that is unstable and has high sample complexity. Furthermore, policy improvement hinges on discovering better actions through exploration, which can be difficult when actions lie in natural language space. In this paper, we propose Natural Language Actor‑Critic (NLAC), a novel actor‑critic algorithm that trains LLM policies using a generative LLM critic that produces natural language rather than scalar values. This approach leverages the inherent strengths of LLMs to provide a richer and more actionable training signal; particularly, in tasks with large, open‑ended action spaces, natural language explanations for why an action is suboptimal can be immensely useful for LLM policies to reason how to improve their actions, without relying on random exploration. Furthermore, our approach can be trained off‑policy without policy gradients, offering a more data‑efficient and stable alternative to existing on‑policy methods. We present results on a mixture of reasoning, web browsing, and tool‑use with dialogue tasks, demonstrating that NLAC shows promise in outperforming existing training approaches and offers a more scalable and stable training paradigm for LLM agents.

PaperID: 6984, https://arxiv.org/pdf/2512.04535.pdf

Authors: Zhenzhen Ren, Xinpeng Zhang, Zhenxing Qian, Yan Gao, Yu Shi, Shuxin Zheng, Jiyan He

Title: GTM: Simulating the World of Tools for AI Agents

Abstract:
The integration of external tools is pivotal for empowering Large Language Model (LLM) agents with real‑world capabilities. However, training these agents through direct, continuous interaction with diverse tools is often prohibitively expensive, slow, and introduces additional development and maintenance overhead. To address this challenge, we introduce the Generalist Tool Model (GTM), a 1.5‑billion‑parameter model that learns to act as a universal tool simulator. With only prompt‑level configuration, GTM accesses tool functionalities along with input arguments and generates outputs that faithfully mimic real tool execution, providing a fast and cost‑effective solution that eliminates development overhead. To build GTM, we propose the Context‑Aware Response Generation (CARG) pipeline, which synthesizes comprehensive training data covering over 20,000 tools across 300 domains including physics, medicine, robotics, and finance. Through this pipeline, GTM learns to produce not only syntactically correct outputs but also logically coherent and contextually appropriate responses. Experiments demonstrate that GTM produces high‑quality outputs with strong consistency and reliability. Besides when used in real reinforcement learning scenarios for agent training, GTM exhibits significantly faster simulation speed compared to real tools while maintaining comparable output quality, along with remarkable generalization and domain adaptability. Our results establish GTM as a foundational component for developing future AI agents, enabling efficient and scalable training of tool‑augmented systems.

PaperID: 6985, https://arxiv.org/pdf/2512.04488.pdf

Authors: Nate Straub, Saara Khan, Katharina Jay, Brian Cabral, Oskar Linde

Title: Persona-based Multi-Agent Collaboration for Brainstorming

Abstract:
We demonstrate the importance of persona‑based multi‑agents brainstorming for both diverse topics and subject matter ideation. Prior work has shown that generalized multi‑agent collaboration often provides better reasoning than a single agent alone. In this paper, we propose and develop a framework for persona‑based agent selection, showing how persona domain curation can improve brainstorming outcomes. Using multiple experimental setups, we evaluate brainstorming outputs across different persona pairings (e.g., Doctor vs VR Engineer) and A2A (agent‑to‑agent) dynamics (separate, together, separate‑then‑together). Our results show that (1) persona choice shapes idea domains, (2) collaboration mode shifts diversity of idea generation, and (3) multi‑agent persona‑driven brainstorming produces idea depth and cross‑domain coverage.

PaperID: 6986, https://arxiv.org/pdf/2512.04469.pdf

Authors: Philip Stephens, Emmanuel Salawu

Title: Mathematical Framing for Different Agent Strategies

Abstract:
We introduce a unified mathematical and probabilistic framework for understanding and comparing diverse AI agent strategies. We bridge the gap between high‑level agent design concepts, such as ReAct, multi‑agent systems, and control flows, and a rigorous mathematical formulation. Our approach frames agentic processes as a chain of probabilities, enabling a detailed analysis of how different strategies manipulate these probabilities to achieve desired outcomes. Our framework provides a common language for discussing the trade‑offs inherent in various agent architectures. One of our many key contributions is the introduction of the "Degrees of Freedom" concept, which intuitively differentiates the optimizable levers available for each approach, thereby guiding the selection of appropriate strategies for specific tasks. This work aims to enhance the clarity and precision in designing and evaluating AI agents, offering insights into maximizing the probability of successful actions within complex agentic systems.

PaperID: 6987, https://arxiv.org/pdf/2512.04416.pdf

Authors: Zhou Liu, Zhaoyang Han, Guochen Yan, Hao Liang, Bohan Zeng, Xing Chen, Yuanfeng Song, Wentao Zhang

Title: DataGovBench: Benchmarking LLM Agents for Real-World Data Governance Workflows

Abstract:
Data governance ensures data quality, security, and compliance through policies and standards, a critical foundation for scaling modern AI development. Recently, large language models (LLMs) have emerged as a promising solution for automating data governance by translating user intent into executable transformation code. However, existing benchmarks for automated data science often emphasize snippet‑level coding or high‑level analytics, failing to capture the unique challenge of data governance: ensuring the correctness and quality of the data itself. To bridge this gap, we introduce DataGovBench, a benchmark featuring 150 diverse tasks grounded in real‑world scenarios, built on data from actual cases. DataGovBench employs a novel "reversed‑objective" methodology to synthesize realistic noise and utilizes rigorous metrics to assess end‑to‑end pipeline reliability. Our analysis on DataGovBench reveals that current models struggle with complex, multi‑step workflows and lack robust error‑correction mechanisms. Consequently, we propose DataGovAgent, a framework utilizing a Planner‑Executor‑Evaluator architecture that integrates constraint‑based planning, retrieval‑augmented generation, and sandboxed feedback‑driven debugging. Experimental results show that DataGovAgent significantly boosts the Average Task Score (ATS) on complex tasks from 39.7 to 54.9 and reduces debugging iterations by over 77.9 percent compared to general‑purpose baselines.

PaperID: 6988, https://arxiv.org/pdf/2512.04405.pdf

Authors: Chenyuan Feng, Anbang Zhang, Geyong Min, Yongming Huang, Tony Q. S. Quek, Xiaohu You

Title: Towards 6G Native-AI Edge Networks: A Semantic-Aware and Agentic Intelligence Paradigm

Abstract:
The evolution toward sixth‑generation wireless systems positions intelligence as a native network capability, fundamentally transforming the design of radio access networks (RANs). Within this vision, Semantic‑native communication and agentic intelligence are expected to play central roles. SemCom departs from bit‑level fidelity and instead emphasizes task‑oriented meaning exchange, enabling compact SC and introducing new performance measures such as semantic fidelity and task success rate. Agentic intelligence endows distributed RAN entities with goal‑driven autonomy, reasoning, planning, and multi‑agent collaboration, increasingly supported by foundation models and knowledge graphs. In this work, we first introduce the conceptual foundations of SemCom and agentic networking, and discuss why existing AI‑driven O‑RAN solutions remain largely bit‑centric and task‑siloed. We then present a unified taxonomy that organizes recent research along three axes: i) semantic abstraction level (symbol/feature/intent/knowledge), ii) agent autonomy and coordination granularity (single‑, multi‑, and hierarchical‑agent), and iii) RAN control placement across PHY/MAC, near‑real‑time RIC, and non‑real‑time RIC. Based on this taxonomy, we systematically introduce enabling technologies including task‑oriented semantic encoders/decoders, multi‑agent reinforcement learning, foundation‑model‑assisted RAN agents, and knowledge‑graph‑based reasoning for cross‑layer awareness. Representative 6G use cases, such as immersive XR, vehicular V2X, and industrial digital twins, are analyzed to illustrate the semantic‑agentic convergence in practice. Finally, we identify open challenges in semantic representation standardization, scalable trustworthy agent coordination, O‑RAN interoperability, and energy‑efficient AI deployment, and outline research directions toward operational semantic‑agentic AI‑RAN.

PaperID: 6989, https://arxiv.org/pdf/2512.04367.pdf

Authors: Yun Piao, Hongbo Min, Hang Su, Leilei Zhang, Lei Wang, Yue Yin, Xiao Wu, Zhejing Xu, Liwei Qu, Hang Li, Xinxin Zeng, Wei Tian, Fei Yu, Xiaowei Li, Jiayi Jiang, Tongxu Liu, Hao Tian, Yufei Que, Xiaobing Tu, Bing Suo, Yuebing Li, Xiangting Chen, Zeen Zhao, Jiaming Tang, Wei Huang, Xuguang Li, Jing Zhao, Jin Li, Jie Shen, Jinkui Ren, Xiantao Zhang

Title: AgentBay: A Hybrid Interaction Sandbox for Seamless Human-AI Intervention in Agentic Systems

Abstract:
The rapid advancement of Large Language Models (LLMs) is catalyzing a shift towards autonomous AI Agents capable of executing complex, multi‑step tasks. However, these agents remain brittle when faced with real‑world exceptions, making Human‑in‑the‑Loop (HITL) supervision essential for mission‑critical applications. In this paper, we present AgentBay, a novel sandbox service designed from the ground up for hybrid interaction. AgentBay provides secure, isolated execution environments spanning Windows, Linux, Android, Web Browsers, and Code interpreters. Its core contribution is a unified session accessible via a hybrid control interface: An AI agent can interact programmatically via mainstream interfaces (MCP, Open Source SDK), while a human operator can, at any moment, seamlessly take over full manual control. This seamless intervention is enabled by Adaptive Streaming Protocol (ASP). Unlike traditional VNC/RDP, ASP is specifically engineered for this hybrid use case, delivering an ultra‑low‑latency, smoother user experience that remains resilient even in weak network environments. It achieves this by dynamically blending command‑based and video‑based streaming, adapting its encoding strategy based on network conditions and the current controller (AI or human). Our evaluation demonstrates strong results in security, performance, and task completion rates. In a benchmark of complex tasks, the AgentBay (Agent + Human) model achieved more than 48% success rate improvement. Furthermore, our ASP protocol reduces bandwidth consumption by up to 50% compared to standard RDP, and in end‑to‑end latency with around 5% reduction, especially under poor network conditions. We posit that AgentBay provides a foundational primitive for building the next generation of reliable, human‑supervised autonomous systems.

PaperID: 6990, https://arxiv.org/pdf/2512.04308.pdf

Authors: Lei Zhang, Ju Dong, Kaixin Bai, Minheng Ni, Zoltan-Csaba Marton, Zhaopeng Chen, Jianwei Zhang

Title: ResponsibleRobotBench: Benchmarking Responsible Robot Manipulation using Multi-modal Large Language Models

Abstract:
Recent advances in large multimodal models have enabled new opportunities in embodied AI, particularly in robotic manipulation. These models have shown strong potential in generalization and reasoning, but achieving reliable and responsible robotic behavior in real‑world settings remains an open challenge. In high‑stakes environments, robotic agents must go beyond basic task execution to perform risk‑aware reasoning, moral decision‑making, and physically grounded planning. We introduce ResponsibleRobotBench, a systematic benchmark designed to evaluate and accelerate progress in responsible robotic manipulation from simulation to real world. This benchmark consists of 23 multi‑stage tasks spanning diverse risk types, including electrical, chemical, and human‑related hazards, and varying levels of physical and planning complexity. These tasks require agents to detect and mitigate risks, reason about safety, plan sequences of actions, and engage human assistance when necessary. Our benchmark includes a general‑purpose evaluation framework that supports multimodal model‑based agents with various action representation modalities. The framework integrates visual perception, context learning, prompt construction, hazard detection, reasoning and planning, and physical execution. It also provides a rich multimodal dataset, supports reproducible experiments, and includes standardized metrics such as success rate, safety rate, and safe success rate. Through extensive experimental setups, ResponsibleRobotBench enables analysis across risk categories, task types, and agent configurations. By emphasizing physical reliability, generalization, and safety in decision‑making, this benchmark provides a foundation for advancing the development of trustworthy, real‑world responsible dexterous robotic systems. https://sites.google.com/view/responsible‑robotbench

PaperID: 6991, https://arxiv.org/pdf/2512.04307.pdf

Authors: Andy Chung, Yichi Zhang, Kaixiang Lin, Aditya Rawal, Qiaozi Gao, Joyce Chai

Title: Evaluating Long-Context Reasoning in LLM-Based WebAgents

Abstract:
As large language model (LLM)‑based agents become increasingly integrated into daily digital interactions, their ability to reason across long interaction histories becomes crucial for providing personalized and contextually aware assistance. However, the performance of these agents in long context scenarios, particularly for action‑taking WebAgents operating in realistic web environments, remains largely unexplored. This paper introduces a benchmark for evaluating long context reasoning capabilities of WebAgents through sequentially dependent subtasks that require retrieval and application of information from extended interaction histories. We develop a novel evaluation framework that simulates multi‑session user interactions by injecting irrelevant task trajectories between dependent subtasks, creating contexts ranging from 25,000 to 150,000 tokens. Through extensive evaluation of four popular models, Claude‑3.7, GPT‑4.1, Llama 4, and o4‑mini, we observe a dramatic performance degradation as context length increases, with success rates dropping from 40‑50% in baseline conditions to less than 10% in long context scenarios. Our detailed error analysis reveals that agents primarily fail due to getting stuck in loops and losing track of original task objectives. We further propose an implicit RAG approach that provides modest improvements by generating task‑relevant summaries, though fundamental limitations in long context reasoning persist. These findings highlight critical challenges for deploying WebAgents in realistic, long‑term user interaction scenarios and provide insights for developing more robust agent architectures capable of maintaining coherent task execution across extended contexts.

PaperID: 6992, https://arxiv.org/pdf/2512.04246.pdf

Authors: Majid Ghasemi, Mark Crowley

Title: Toward Virtuous Reinforcement Learning: A Critique and Roadmap

Abstract:
This paper critiques common patterns in machine ethics for Reinforcement Learning (RL) and argues for a virtue focused alternative. We highlight two recurring limitations in much of the current literature: (i) rule based (deontological) methods that encode duties as constraints or shields often struggle under ambiguity and nonstationarity and do not cultivate lasting habits, and (ii) many reward based approaches, especially single objective RL, implicitly compress diverse moral considerations into a single scalar signal, which can obscure trade offs and invite proxy gaming in practice. We instead treat ethics as policy level dispositions, that is, relatively stable habits that hold up when incentives, partners, or contexts change. This shifts evaluation beyond rule checks or scalar returns toward trait summaries, durability under interventions, and explicit reporting of moral trade offs. Our roadmap combines four components: (1) social learning in multi agent RL to acquire virtue like patterns from imperfect but normatively informed exemplars; (2) multi objective and constrained formulations that preserve value conflicts and incorporate risk aware criteria to guard against harm; (3) affinity based regularization toward updateable virtue priors that support trait like stability under distribution shift while allowing norms to evolve; and (4) operationalizing diverse ethical traditions as practical control signals, making explicit the value and cultural assumptions that shape ethical RL benchmarks.

PaperID: 6993, https://arxiv.org/pdf/2512.04221.pdf

Authors: Xiangyu Bai, He Liang, Bishoy Galoaa, Utsav Nandi, Shayda Moezzi, Yuhang He, Sarah Ostadabbas

Title: MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis

Abstract:
While text‑to‑video (T2V) generation has achieved remarkable progress in photorealism, generating intent‑aligned videos that faithfully obey physics principles remains a core challenge. In this work, we systematically study Newtonian motion‑controlled text‑to‑video generation and evaluation, emphasizing physical precision and motion coherence. We introduce MoReGen, a motion‑aware, physics‑grounded T2V framework that integrates multi‑agent LLMs, physics simulators, and renderers to generate reproducible, physically accurate videos from text prompts in the code domain. To quantitatively assess physical validity, we propose object‑trajectory correspondence as a direct evaluation metric and present MoReSet, a benchmark of 1,275 human‑annotated videos spanning nine classes of Newtonian phenomena with scene descriptions, spatiotemporal relations, and ground‑truth trajectories. Using MoReSet, we conduct experiments on existing T2V models, evaluating their physical validity through both our MoRe metrics and existing physics‑based evaluators. Our results reveal that state‑of‑the‑art models struggle to maintain physical validity, while MoReGen establishes a principled direction toward physically coherent video synthesis.

PaperID: 6994, https://arxiv.org/pdf/2512.04220.pdf

Authors: Wenlong Deng, Yushu Li, Boying Gong, Yi Ren, Christos Thrampoulidis, Xiaoxiao Li

Title: On Group Relative Policy Optimization Collapse in Agent Search: The Lazy Likelihood-Displacement

Abstract:
Tool‑integrated (TI) reinforcement learning (RL) enables large language models (LLMs) to perform multi‑step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search‑R1, offers fast convergence and a value‑free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self‑reinforcing LLD Death Spiral, where declining likelihood leads to low‑confidence responses, inflating gradients, and ultimately causing collapse. We empirically characterize this process across models on a Search‑R1‑style, search‑integrated question answering task, revealing a consistent three‑phase trajectory: early stagnation, steady decay, and accelerated collapse. To address this, we propose a likelihood‑preserving regularization LLDS that activates only when a response action's likelihood decreases, and regularizes only the tokens responsible. This fine‑grained structure mitigates LLD with minimal interference. Our method stabilizes training, prevents gradient explosion, and yields substantial performance improvements across seven benchmarks, including relative improvements of +45.2% on Qwen2.5‑3B and +37.1% on Qwen2.5‑7B over vanilla GRPO training. Our results establish LLD as a previously overlooked bottleneck in GRPO‑based TIRL and provide a practical path toward stable, scalable training of tool‑integrated RL.

PaperID: 6995, https://arxiv.org/pdf/2512.04207.pdf

Authors: Xizhi Wu, Nelly Estefanie Garduno-Rapp, Justin F Rousseau, Mounika Thakkallapally, Hang Zhang, Yuelyu Ji, Shyam Visweswaran, Yifan Peng, Yanshan Wang

Title: Orchestrator Multi-Agent Clinical Decision Support System for Secondary Headache Diagnosis in Primary Care

Abstract:
Unlike most primary headaches, secondary headaches need specialized care and can have devastating consequences if not treated promptly. Clinical guidelines highlight several 'red flag' features, such as thunderclap onset, meningismus, papilledema, focal neurologic deficits, signs of temporal arteritis, systemic illness, and the 'worst headache of their life' presentation. Despite these guidelines, determining which patients require urgent evaluation remains challenging in primary care settings. Clinicians often work with limited time, incomplete information, and diverse symptom presentations, which can lead to under‑recognition and inappropriate care. We present a large language model (LLM)‑based multi‑agent clinical decision support system built on an orchestrator‑specialist architecture, designed to perform explicit and interpretable secondary headache diagnosis from free‑text clinical vignettes. The multi‑agent system decomposes diagnosis into seven domain‑specialized agents, each producing a structured and evidence‑grounded rationale, while a central orchestrator performs task decomposition and coordinates agent routing. We evaluated the multi‑agent system using 90 expert‑validated secondary headache cases and compared its performance with a single‑LLM baseline across two prompting strategies: question‑based prompting (QPrompt) and clinical practice guideline‑based prompting (GPrompt). We tested five open‑source LLMs (Qwen‑30B, GPT‑OSS‑20B, Qwen‑14B, Qwen‑8B, and Llama‑3.1‑8B), and found that the orchestrated multi‑agent system with GPrompt consistently achieved the highest F1 scores, with larger gains in smaller models. These findings demonstrate that structured multi‑agent reasoning improves accuracy beyond prompt engineering alone and offers a transparent, clinically aligned approach for explainable decision support in secondary headache diagnosis.

PaperID: 6996, https://arxiv.org/pdf/2512.04129.pdf

Authors: Ruichao Liang, Le Yin, Jing Chen, Yebo Feng, Cong Wu, Xiaoyu Zhang, Huangpeng Gu, Zijian Zhang, Yang Liu

Title: Don't Trust Your Upstream: Exploiting LLM Multi-Agent System via Topology-Guided Adversarial Propagation

Abstract:
The digital world is witnessing the rapid rise of LLM‑based multi‑agent systems (MASs) and their powerful applications. However, their security remains insufficiently understood, as existing evaluations are largely limited to narrow attack settings and may substantially underestimate the real risks of MAS deployments. Inspired by the MAS inter‑agent dependencies, where upstream outputs are reinterpreted and executed by downstream agents, we propose a topology‑aware attack scheme that propagates adversarial contamination from exposed edge agents to high‑privilege agents to induce malicious behaviors. By combining topology reconnaissance, contamination propagation modeling, and hierarchical payload encapsulation, our approach overcomes the key challenges of black‑box attacks and makes such multi‑hop compromise practical. Experiments show that our approach achieves success rates of 40%‑‑78% on three widely‑used MAS frameworks under five topologies, and 85% on two real‑world MAS applications across 20 representative scenarios. The results reveal fundamental vulnerabilities in MASs that have been overlooked by prior studies. Based on these findings, we propose a topology‑trust mitigation that blocks 94.8% of such composite attacks.

PaperID: 6997, https://arxiv.org/pdf/2512.03991.pdf

Authors: Michael Schiffmann, Felix Struth, Sabina Jeschke, Anja Richert

Title: When to Say "Hi" -- Learn to Open a Conversation with an in-the-wild Dataset

Abstract:
The social capabilities of socially interactive agents (SIA) are a key to successful and smooth interactions between the user and the SIA. A successful start of the interaction is one of the essential factors for satisfying SIA interactions. For a service and information task in which the SIA helps with information, e.g. about the location, it is an important skill to master the opening of the conversation and to recognize which interlocutor opens the conversation and when. We are therefore investigating the extent to which the opening of the conversation can be trained using the user's body language as an input for machine learning to ensure smooth conversation starts for the interaction. In this paper we propose the Interaction Initiation System (IIS) which we developed, trained and validated using an in‑the‑wild data set. In a field test at the Deutsches Museum Bonn, a Furhat robot from Furhat Robotics was used as a service and information point. Over the period of use we collected the data of N = 201 single user interactions for the training of the algorithms. We can show that the IIS, achieves a performance that allows the conclusion that this system is able to determine the greeting period and the opener of the interaction.

PaperID: 6998, https://arxiv.org/pdf/2512.03945.pdf

Authors: Michael Schiffmann, Sabina Jeschke, Anja Richert

Title: Classification of User Satisfaction in HRI with Social Signals in the Wild

Abstract:
Socially interactive agents (SIAs) are being used in various scenarios and are nearing productive deployment. Evaluating user satisfaction with SIAs' performance is a key factor in designing the interaction between the user and SIA. Currently, subjective user satisfaction is primarily assessed manually through questionnaires or indirectly via system metrics. This study examines the automatic classification of user satisfaction through analysis of social signals, aiming to enhance both manual and autonomous evaluation methods for SIAs. During a field trial at the Deutsches Museum Bonn, a Furhat Robotics head was employed as a service and information hub, collecting an "in‑the‑wild" dataset. This dataset comprises 46 single‑user interactions, including questionnaire responses and video data. Our method focuses on automatically classifying user satisfaction based on time series classification. We use time series of social signal metrics derived from the body pose, time series of facial expressions, and physical distance. This study compares three feature engineering approaches on different machine learning models. The results confirm the method's effectiveness in reliably identifying interactions with low user satisfaction without the need for manually annotated datasets. This approach offers significant potential for enhancing SIA performance and user experience through automated feedback mechanisms.

PaperID: 6999, https://arxiv.org/pdf/2512.03874.pdf

Authors: Lei Zhang, Diwen Zheng, Kaixin Bai, Zhenshan Bing, Zoltan-Csaba Marton, Zhaopeng Chen, Alois Christian Knoll, Jianwei Zhang

Title: OmniDexVLG: Learning Dexterous Grasp Generation from Vision Language Model-Guided Grasp Semantics, Taxonomy and Functional Affordance

Abstract:
Dexterous grasp generation aims to produce grasp poses that align with task requirements and human interpretable grasp semantics. However, achieving semantically controllable dexterous grasp synthesis remains highly challenging due to the lack of unified modeling of multiple semantic dimensions, including grasp taxonomy, contact semantics, and functional affordance. To address these limitations, we present OmniDexVLG, a multimodal, semantics aware grasp generation framework capable of producing structurally diverse and semantically coherent dexterous grasps under joint language and visual guidance. Our approach begins with OmniDexDataGen, a semantic rich dexterous grasp dataset generation pipeline that integrates grasp taxonomy guided configuration sampling, functional affordance contact point sampling, taxonomy aware differential force closure grasp sampling, and physics based optimization and validation, enabling systematic coverage of diverse grasp types. We further introduce OmniDexReasoner, a multimodal grasp type semantic reasoning module that leverages multi agent collaboration, retrieval augmented generation, and chain of thought reasoning to infer grasp related semantics and generate high quality annotations that align language instructions with task specific grasp intent. Building upon these components, we develop a unified Vision Language Grasping generation model that explicitly incorporates grasp taxonomy, contact structure, and functional affordance semantics, enabling fine grained control over grasp synthesis from natural language instructions. Extensive experiments in simulation and real world object grasping and ablation studies demonstrate that our method substantially outperforms state of the art approaches in terms of grasp diversity, contact semantic diversity, functional affordance diversity, and semantic consistency.

PaperID: 7000, https://arxiv.org/pdf/2512.03866.pdf

Authors: Haley Stone, C. Raina MacIntyre, Mohana Kunasekaran, Chris Poulos, David Heslop

Title: Generating a Contact Matrix for Aged Care Settings in Australia: an agent-based model study

Abstract:
Understanding infectious disease transmission in institutional settings requires models that capture how contacts arise from structured routines, roles, and spatial constraints. In aged care facilities, interactions are driven by care delivery, staff scheduling, and resident mobility, producing patterns that differ from those assumed in population‑level models. This study develops an agent‑based framework to generate high‑resolution contact matrices by simulating task‑driven behaviour, staff workflows, and movement through shared spaces. Rather than prescribing contacts, interactions emerge from scheduled activities and proximity during task execution. The model is parameterised using activity‑diary data from aged care workers and separates behavioural logic from physical layout, enabling adaptation to different facility designs without altering core mechanisms. Results show strong heterogeneity in contact patterns across care levels and staff shifts. Low and medium care residents had higher contact frequencies than high care residents, while day and afternoon staff shifts accounted for most resident‑staff interactions. Contacts clustered around daily routines such as meals and communal activities. Incorporating a proximity‑based airborne transmission component showed that risk was concentrated during high‑contact shifts and among more mobile residents. Vaccination scenarios substantially reduced predicted transmission, with the greatest impact when both staff and residents were vaccinated. By linking organisational processes to emergent contact structure, this framework provides a reproducible approach to contact matrix generation for institutional settings, supporting more realistic transmission modelling and evaluation of targeted infection control strategies.

PaperID: 7001, https://arxiv.org/pdf/2512.03835.pdf

Authors: Ghoshana Bista, Abbas Bradai, Emmanuel Moulay, Abdulhalim Dandoush

Title: Multi-Agent Deep Reinforcement Learning for UAV-Assisted 5G Network Slicing: A Comparative Study of MAPPO, MADDPG, and MADQN

Abstract:
The growing demand for robust, scalable wireless networks in the 5G‑and‑beyond era has led to the deployment of Unmanned Aerial Vehicles (UAVs) as mobile base stations to enhance coverage in dense urban and underserved rural areas. This paper presents a Multi‑Agent Deep Reinforcement Learning (MADRL) framework that integrates Proximal Policy Optimization (MAPPO), Multi‑Agent Deep Deterministic Policy Gradient (MADDPG), and Multi‑Agent Deep Q‑Networks (MADQN) to jointly optimize UAV positioning, resource allocation, Quality of Service (QoS), and energy efficiency through 5G network slicing. The framework adopts Centralized Training with Decentralized Execution (CTDE), enabling autonomous real‑time decision‑making while preserving global coordination. Users are prioritized into Premium (A), Silver (B), and Bronze (C) slices with distinct QoS requirements. Experiments in realistic urban and rural scenarios show that MAPPO achieves the best overall QoS‑energy tradeoff, especially in interference‑rich environments; MADDPG offers more precise continuous control and can attain slightly higher SINR in open rural settings at the cost of increased energy usage; and MADQN provides a computationally efficient baseline for discretized action spaces. These findings demonstrate that no single MARL algorithm is universally dominant; instead, algorithm suitability depends on environmental topology, user density, and service requirements. The proposed framework highlights the potential of MARL‑driven UAV systems to enhance scalability, reliability, and differentiated QoS delivery in next‑generation wireless networks.

PaperID: 7002, https://arxiv.org/pdf/2512.03762.pdf

Authors: Jiawei Xu, Feng-Feng Wei, Wei-Neng Chen

Title: RoCo: Role-Based LLMs Collaboration for Automatic Heuristic Design

Abstract:
Automatic Heuristic Design (AHD) has gained traction as a promising solution for solving combinatorial optimization problems (COPs). Large Language Models (LLMs) have emerged and become a promising approach to achieving AHD, but current LLM‑based AHD research often only considers a single role. This paper proposes RoCo, a novel Multi‑Agent Role‑Based System, to enhance the diversity and quality of AHD through multi‑role collaboration. RoCo coordinates four specialized LLM‑guided agents‑explorer, exploiter, critic, and integrator‑to collaboratively generate high‑quality heuristics. The explorer promotes long‑term potential through creative, diversity‑driven thinking, while the exploiter focuses on short‑term improvements via conservative, efficiency‑oriented refinements. The critic evaluates the effectiveness of each evolution step and provides targeted feedback and reflection. The integrator synthesizes proposals from the explorer and exploiter, balancing innovation and exploitation to drive overall progress. These agents interact in a structured multi‑round process involving feedback, refinement, and elite mutations guided by both short‑term and accumulated long‑term reflections. We evaluate RoCo on five different COPs under both white‑box and black‑box settings. Experimental results demonstrate that RoCo achieves superior performance, consistently generating competitive heuristics that outperform existing methods including ReEvo and HSEvo, both in white‑box and black‑box scenarios. This role‑based collaborative paradigm establishes a new standard for robust and high‑performing AHD.

PaperID: 7003, https://arxiv.org/pdf/2512.03694.pdf

Authors: Shuang Guo, Zihui Li

Title: SRPG: Semantically Reconstructed Privacy Guard for Zero-Trust Privacy in Educational Multi-Agent Systems

Abstract:
Multi‑Agent Systems (MAS) with large language models (LLMs) enable personalized education but risk leaking minors personally identifiable information (PII) via unstructured dialogue. Existing privacy methods struggle to balance security and utility: role‑based access control fails on unstructured text, while naive masking destroys pedagogical context. We propose SRPG, a privacy guard for educational MAS, using a Dual‑Stream Reconstruction Mechanism: a strict sanitization stream ensures zero PII leakage, and a context reconstruction stream (LLM driven) recovers mathematical logic. This decouples instructional content from private data, preserving teaching efficacy. Tests on MathDial show SRPG works across models; with GPT‑4o, it achieves 0.0000 Attack Success Rate (ASR) (zero leakage) and 0.8267 Exact Match, far outperforming the zero trust Pure LLM baseline (0.2138). SRPG effectively protects minors privacy without sacrificing mathematical instructional quality.

PaperID: 7004, https://arxiv.org/pdf/2512.03639.pdf

Authors: Kilian Schweppe, Anne-Kathrin Schmuck

Title: Context-Triggered Contingency Games for Strategic Multi-Agent Interaction

Abstract:
We address the challenge of reliable and efficient interaction in autonomous multi‑agent systems, where agents must balance long‑term strategic objectives with short‑term dynamic adaptation. We propose context‑triggered contingency games, a novel integration of strategic games derived from temporal logic specifications with dynamic contingency games solved in real time. Our two‑layered architecture leverages strategy templates to guarantee satisfaction of high‑level objectives, while a new factor‑graph‑based solver enables scalable, real‑time model predictive control of dynamic interactions. The resulting framework ensures both safety and progress in uncertain, interactive environments. We validate our approach through simulations and hardware experiments in autonomous driving and robotic navigation, demonstrating efficient, reliable, and adaptive multi‑agent interaction.

PaperID: 7005, https://arxiv.org/pdf/2512.03630.pdf

Authors: Shifa Sulaiman, Amarnath H, Simon Bogh, Naresh Marturi

Title: Multimodal Control of Manipulators: Coupling Kinematics and Vision for Self-Driving Laboratory Operations

Abstract:
Motion planning schemes are used for planning motions of a manipulator from an initial pose to a final pose during a task execution. A motion planning scheme generally comprises of a trajectory planning method and an inverse kinematic solver to determine trajectories and joints solutions respectively. In this paper, 3 motion planning schemes developed based on Jacobian methods are implemented to traverse a redundant manipulator with a coupled finger gripper through given trajectories. RRT algorithm is used for planning trajectories and screw theory based forward kinematic equations are solved for determining joint solutions of the manipulator and gripper. Inverse solutions are computed separately using 3 Jacobian based methods such as Jacobian Transpose (JT), Pseudo Inverse (PI), and Damped Least Square (DLS) methods. Space Jacobian and manipulability measurements of the manipulator and gripper are obtained using screw theory formulations. Smoothness and RMSE error of generated trajectories and velocity continuity, acceleration profile, jerk, and snap values of joint motions are analysed for determining an efficient motion planning method for a given task. Advantages and disadvantages of the proposed motion planning schemes mentioned above are analysed using simulation studies to determine a suitable inverse solution technique for the tasks.

PaperID: 7006, https://arxiv.org/pdf/2512.03607.pdf

Authors: Yusen Wu, Xiaotie Deng

Title: DeepRule: An Integrated Framework for Automated Business Rule Generation via Deep Predictive Modeling and Hybrid Search Optimization

Abstract:
This paper proposes DeepRule, an integrated framework for automated business rule generation in retail assortment and pricing optimization. Addressing the systematic misalignment between existing theoretical models and real‑world economic complexities, we identify three critical gaps: (1) data modality mismatch where unstructured textual sources (e.g. negotiation records, approval documents) impede accurate customer profiling; (2) dynamic feature entanglement challenges in modeling nonlinear price elasticity and time‑varying attributes; (3) operational infeasibility caused by multi‑tier business constraints. Our framework introduces a tri‑level architecture for above challenges. We design a hybrid knowledge fusion engine employing large language models (LLMs) for deep semantic parsing of unstructured text, transforming distributor agreements and sales assessments into structured features while integrating managerial expertise. Then a game‑theoretic constrained optimization mechanism is employed to dynamically reconcile supply chain interests through bilateral utility functions, encoding manufacturer‑distributor profit redistribution as endogenous objectives under hierarchical constraints. Finally an interpretable decision distillation interface leveraging LLM‑guided symbolic regression to find and optimize pricing strategies and auditable business rules embeds economic priors (e.g. non‑negative elasticity) as hard constraints during mathematical expression search. We validate the framework in real retail environments achieving higher profits versus systematic B2C baselines while ensuring operational feasibility. This establishes a close‑loop pipeline unifying unstructured knowledge injection, multi‑agent optimization, and interpretable strategy synthesis for real economic intelligence.

PaperID: 7007, https://arxiv.org/pdf/2512.03560.pdf

Authors: Gianni Molinari, Fabio Ciravegna

Title: Reason-Plan-ReAct: A Reasoner-Planner Supervising a ReAct Executor for Complex Enterprise Tasks

Abstract:
Despite recent advances, autonomous agents often struggle to solve complex tasks in enterprise domains that require coordinating multiple tools and processing diverse data sources. This struggle is driven by two main limitations. First, single‑agent architectures enforce a monolithic plan‑execute loop, which directly causes trajectory instability. Second, the requirement to use local open‑weight models for data privacy introduces smaller context windows leading to the rapid consumption of context from large tool outputs. To solve this problem we introduce RP‑ReAct (Reasoner Planner‑ReAct), a novel multi‑agent approach that fundamentally decouples strategic planning from low‑level execution to achieve superior reliability and efficiency. RP‑ReAct consists of a Reasoner Planner Agent (RPA), responsible for planning each sub‑step, continuously analysing the execution results using the strong reasoning capabilities of a Large Reasoning Model, and one or multiple Proxy‑Execution Agent (PEA) that translates sub‑steps into concrete tool interactions using a ReAct approach. Crucially, we incorporate a context‑saving strategy within the PEA to mitigate context window overflow by managing large tool outputs via external storage and on‑demand access. We evaluate RP‑ReAct, on the challenging, multi‑domain ToolQA benchmark using a diverse set of six open‑weight reasoning models. Our empirical results show that RP‑ReAct achieves superior performance and improved generalization ability over state‑of‑the‑art baselines when addressing diverse complex tasks across the evaluated domains. Furthermore we establish the enhanced robustness and stability of our approach across different model scales, paving the way for effective and deployable agentic solutions for enterprises.

PaperID: 7008, https://arxiv.org/pdf/2512.03549.pdf

Authors: Yuki Orimo, Iori Kurata, Hodaka Mori, Ryuhei Okuno, Ryohto Sawada, Daisuke Okanohara

Title: PARC: An Autonomous Self-Reflective Coding Agent for Robust Execution of Long-Horizon Tasks

Abstract:
We introduce PARC, a coding agent for the autonomous and robust execution of long‑horizon computational tasks. PARC is built on a hierarchical multi‑agent architecture incorporating task planning, execution, and a mechanism that evaluates its own actions and their outcomes from an independent context and provides feedback, namely self‑assessment and self‑feedback. This design enables PARC to detect and correct high‑level strategic errors and sustain progress without human intervention. We evaluate PARC across computational science and data science tasks. In materials science, it autonomously reproduces key results from studies on lithium‑ion conduction and alloy segregation. In particular, it coordinates dozens of parallel simulation tasks, each requiring roughly 43 hours of computation, managing orchestration, monitoring, and error correction end‑to‑end. In Kaggle‑based experiments, starting from minimal natural‑language instructions, PARC conducts data analysis and implements search strategies, producing solutions competitive with human‑engineered baselines. These results highlight the potential of integrating a hierarchical multi‑agent system with self‑assessment and self‑feedback to enable AI systems capable of independent, large‑scale scientific and analytical work.

PaperID: 7009, https://arxiv.org/pdf/2512.03528.pdf

Authors: Guang Yang, Tianpei Yang, Jingwen Qiao, Yanqing Wu, Jing Huo, Xingguo Chen, Yang Gao

Title: Multi-Agent Reinforcement Learning with Communication-Constrained Priors

Abstract:
Communication is one of the effective means to improve the learning of cooperative policy in multi‑agent systems. However, in most real‑world scenarios, lossy communication is a prevalent issue. Existing multi‑agent reinforcement learning with communication, due to their limited scalability and robustness, struggles to apply to complex and dynamic real‑world environments. To address these challenges, we propose a generalized communication‑constrained model to uniformly characterize communication conditions across different scenarios. Based on this, we utilize it as a learning prior to distinguish between lossy and lossless messages for specific scenarios. Additionally, we decouple the impact of lossy and lossless messages on distributed decision‑making, drawing on a dual mutual information estimatior, and introduce a communication‑constrained multi‑agent reinforcement learning framework, quantifying the impact of communication messages into the global reward. Finally, we validate the effectiveness of our approach across several communication‑constrained benchmarks.

PaperID: 7010, https://arxiv.org/pdf/2512.03491.pdf

Authors: Antonin Sulc

Title: Modal Logical Neural Networks

Abstract:
We propose Modal Logical Neural Networks (MLNNs), a neurosymbolic framework that integrates deep learning with the formal semantics of modal logic, enabling reasoning about necessity and possibility. Drawing on Kripke semantics, we introduce specialized neurons for the modal operators \Box and \Diamond that operate over a set of possible worlds, enabling the framework to act as a differentiable ``logical guardrail.'' The architecture is highly flexible: the accessibility relation between worlds can either be fixed by the user to enforce known rules or, as an inductive feature, be parameterized by a neural network. This allows the model to optionally learn the relational structure of a logical system from data while simultaneously performing deductive reasoning within that structure. This versatile construction is designed for flexibility. The entire framework is differentiable from end to end, with learning driven by minimizing a logical contradiction loss. This not only makes the system resilient to inconsistent knowledge but also enables it to learn nonlinear relationships that can help define the logic of a problem space. We illustrate MLNNs on four case studies: grammatical guardrailing, multi‑agent epistemic trust, detecting constructive deception in natural language negotiation, and combinatorial constraint satisfaction in Sudoku. These experiments demonstrate how enforcing or learning accessibility can increase logical consistency and interpretability without changing the underlying task architecture.

PaperID: 7011, https://arxiv.org/pdf/2512.03474.pdf

Authors: Wenliang Guo, Yujiang Pu, Yu Kong

Title: Procedural Mistake Detection via Action Effect Modeling

Abstract:
Mistake detection in procedural tasks is essential for building intelligent systems that support learning and task execution. Existing approaches primarily analyze how an action is performed, while overlooking what it produces, i.e., the action effect. Yet many errors manifest not in the execution itself but in the resulting outcome, such as an unintended object state or incorrect spatial arrangement. To address this gap, we propose Action Effect Modeling (AEM), a unified framework that jointly captures action execution and its outcomes through a probabilistic formulation. AEM first identifies the outcome of an action by selecting the most informative effect frame based on semantic relevance and visual quality. It then extracts complementary cues from visual grounding and symbolic scene graphs, aligning them in a shared latent space to form robust effect‑aware representations. To detect mistakes, we further design a prompt‑based detector that incorporates task‑specific prompts and aligns each action segment with its intended execution semantics. Our approach achieves state‑of‑the‑art performance on the EgoPER and CaptainCook4D benchmarks under the challenging one‑class classification (OCC) setting. These results demonstrate that modeling both execution and outcome yields more reliable mistake detection, and highlight the potential of effect‑aware representations to benefit a broader range of downstream applications.

PaperID: 7012, https://arxiv.org/pdf/2512.03466.pdf

Authors: Xavier Cadet, Edward Koh, Peter Chin

Title: AsymPuzl: An Asymmetric Puzzle for multi-agent cooperation

Abstract:
Large Language Model (LLM) agents are increasingly studied in multi‑turn, multi‑agent scenarios, yet most existing setups emphasize open‑ended role‑play rather than controlled evaluation. We introduce AsymPuzl, a minimal but expressive two‑agent puzzle environment designed to isolate communication under information asymmetry. Each agent observes complementary but incomplete views of a symbolic puzzle and must exchange messages to solve it cooperatively. Using a diverse set of current‑generation and open‑source LLMs, we show that (i) strong models such as GPT‑5 and Claude‑4.0 reliably converge across puzzle sizes on the solution by sharing complete information in two turns, (ii) weaker models often ignore partner messages or over‑correct their hypotheses, and (iii) feedback design is non‑trivial: simple self‑feedback improves success rates, while detailed joint feedback can hurt performance. These findings show that even in simple cooperative tasks, LLM communication strategies diverge and depend on the granularity of feedback signals. AsymPuzl thus provides a testbed for probing the limits of multi‑turn cooperation and opens avenues for studying coordination mechanisms.

PaperID: 7013, https://arxiv.org/pdf/2512.03454.pdf

Authors: Haicheng Liao, Huanming Shen, Bonan Wang, Yongkang Li, Yihong Tang, Chengyue Wang, Dingyi Zhuang, Kehua Chen, Hai Yang, Chengzhong Xu, Zhenning Li

Title: Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles

Abstract:
Interpreting natural‑language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context‑dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial‑Aware World Model (SA‑WM) that learns to reason ahead by distilling the current scene into a command‑aware latent state and rolling out a sequence of future latent states, providing forward‑looking cues for disambiguation. Complementing this, a hypergraph‑guided decoder then hierarchically fuses these states with the multimodal input, capturing higher‑order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi‑source VG dataset in AD, featuring semantic annotations generated by a Retrieval‑Augmented Generation (RAG) and Chain‑of‑Thought (CoT)‑prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state‑of‑the‑art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long‑text, multi‑agent, ambiguity) and retains superior performance even when trained on 50% of the data.

PaperID: 7014, https://arxiv.org/pdf/2512.03318.pdf

Authors: Chandler Smith, Marwa Abdulhai, Manfred Diaz, Marko Tesic, Rakshit S. Trivedi, Alexander Sasha Vezhnevets, Lewis Hammond, Jesse Clifton, Minsuk Chang, Edgar A. Duéñez-Guzmán, John P. Agapiou, Jayd Matyas, Danny Karmon, Akash Kundu, Aliaksei Korshuk, Ananya Ananya, Arrasy Rahman, Avinaash Anand Kulandaivel, Bain McHale, Beining Zhang, Buyantuev Alexander, Carlos Saith Rodriguez Rojas, Caroline Wang, Chetan Talele, Chenao Liu, Chichen Lin, Diana Riazi, Di Yang Shi, Emanuel Tewolde, Elizaveta Tennant, Fangwei Zhong, Fuyang Cui, Gang Zhao, Gema Parreño Piqueras, Hyeonggeun Yun, Ilya Makarov, Jiaxun Cui, Jebish Purbey, Jim Dilkes, Jord Nguyen, Lingyun Xiao, Luis Felipe Giraldo, Manuela Chacon-Chamorro, Manuel Sebastian Rios Beltran, Marta Emili García Segura, Mengmeng Wang, Mogtaba Alim, Nicanor Quijano, Nico Schiavone, Olivia Macmillan-Scott, Oswaldo Peña, Peter Stone, Ram Mohan Rao Kadiyala, Rolando Fernandez, Ruben Manrique, Sunjia Lu, Sheila A. McIlraith, Shamika Dhuri, Shuqing Shi, Siddhant Gupta, Sneheel Sarangi, Sriram Ganapathi Subramanian, Taehun Cha, Toryn Q. Klassen, Wenming Tu, Weijian Fan, Wu Ruiyang, Xue Feng, Yali Du, Yang Liu, Yiding Wang, Yipeng Kang, Yoonchang Sung, Yuxuan Chen, Zhaowei Zhang, Zhihan Wang, Zhiqiang Wu, Ziang Chen, Zilong Zheng, Zixia Jia, Ziyan Wang, Dylan Hadfield-Menell, Natasha Jaques, Tim Baarslag, Jose Hernandez-Orallo, Joel Z. Leibo

Title: Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia

Abstract:
Large Language Model (LLM) agents have demonstrated impressive capabilities for social interaction and are increasingly being deployed in situations where they might engage with both human and artificial agents. These interactions represent a critical frontier for LLM‑based agents, yet existing evaluation methods fail to measure how well these capabilities generalize to novel social situations. In this paper, we introduce a method for evaluating the ability of LLM‑based agents to cooperate in zero‑shot, mixed‑motive environments using Concordia, a natural language multi‑agent simulation environment. Our method measures general cooperative intelligence by testing an agent's ability to identify and exploit opportunities for mutual gain across diverse partners and contexts. We present empirical results from the NeurIPS 2024 Concordia Contest, where agents were evaluated on their ability to achieve mutual gains across a suite of diverse scenarios ranging from negotiation to collective action problems. Our findings reveal significant gaps between current agent capabilities and the robust generalization required for reliable cooperation, particularly in scenarios demanding persuasion and norm enforcement.

PaperID: 7015, https://arxiv.org/pdf/2512.03285.pdf

Authors: Nafiul I. Khan, Mansura Habiba, Rafflesia Khan

Title: A Gossip-Enhanced Communication Substrate for Agentic AI: Toward Decentralized Coordination in Large-Scale Multi-Agent Systems

Abstract:
As agentic platforms scale, agents are moving beyond fixed roles and predefined toolchains, creating an urgent need for flexible and decentralized coordination. Current structured communication protocols such as direct agent‑to‑agent messaging or MCP‑style tool calls offer reliability, but they struggle to support the emergent and swarm‑like intelligence required in large adaptive systems. Distributed agents must learn continuously, share context fluidly, and coordinate without depending solely on central planners. This paper revisits gossip protocols as a complementary substrate for agentic communication. Gossip mechanisms, long valued in distributed systems for their decentralized and fault‑tolerant properties, provide scalable and adaptive diffusion of knowledge and fill gaps that structured protocols alone cannot efficiently address. However, gossip also introduces challenges, including semantic relevance, temporal staleness, and limited guarantees on action consistency in rapidly changing environments. We examine how gossip can support context‑rich state propagation, resilient coordination under uncertainty, and emergent global awareness. We also outline open problems around semantic filtering, trust, and knowledge decay. Rather than proposing a complete framework, this paper presents a research agenda for integrating gossip into multi‑agent communication stacks and argues that gossip is essential for future agentic ecosystems that must remain robust, adaptive, and self‑organizing as their scale and autonomy increase.

PaperID: 7016, https://arxiv.org/pdf/2512.03278.pdf

Authors: Michael Theologitis, Dan Suciu

Title: Thucy: An LLM-based Multi-Agent System for Claim Verification across Relational Databases

Abstract:
In today's age, it is becoming increasingly difficult to decipher truth from lies. Every day, politicians, media outlets, and public figures make conflicting claims ‑‑ often about topics that can, in principle, be verified against structured data. For instance, statements about crime rates, economic growth or healthcare can all be verified against official public records and structured datasets. Building a system that can automatically do that would have sounded like science fiction just a few years ago. Yet, with the extraordinary progress in LLMs and agentic AI, this is now within reach. Still, there remains a striking gap between what is technically possible and what is being demonstrated by recent work. Most existing verification systems operate only on small, single‑table databases ‑‑ typically a few hundred rows ‑‑ that conveniently fit within an LLM's context window. In this paper we report our progress on Thucy, the first cross‑database, cross‑table multi‑agent claim verification system that also provides concrete evidence for each verification verdict. Thucy remains completely agnostic to the underlying data sources before deployment and must therefore autonomously discover, inspect, and reason over all available relational databases to verify claims. Importantly, Thucy also reports the exact SQL queries that support its verdict (whether the claim is accurate or not) offering full transparency to expert users familiar with SQL. When evaluated on the TabFact dataset ‑‑ the standard benchmark for fact verification over structured data ‑‑ Thucy surpasses the previous state of the art by 5.6 percentage points in accuracy (94.3% vs. 88.7%).

PaperID: 7017, https://arxiv.org/pdf/2512.03262.pdf

Authors: Songwen Zhao, Danqing Wang, Kexun Zhang, Jiaxuan Luo, Zhuo Li, Lei Li

Title: Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks

Abstract:
Vibe coding is a new programming paradigm in which human engineers instruct large language model (LLM) agents to complete complex coding tasks with little supervision. Although vibe coding is increasingly adopted, are its outputs really safe to deploy in production? To answer this question, we propose SU S VI B E S, a benchmark consisting of 200 feature‑request software engineering tasks from real‑world open‑source projects, which, when given to human programmers, led to vulnerable implementations. We evaluate multiple widely used coding agents with frontier models on this benchmark. Disturbingly, all agents perform poorly in terms of software security. Although 61% of the solutions from SWE‑Agent with Claude 4 Sonnet are functionally correct, only 10.5% are secure. Further experiments demonstrate that preliminary security strategies, such as augmenting the feature request with vulnerability hints, cannot mitigate these security issues. Our findings raise serious concerns about the widespread adoption of vibe‑coding, particularly in security‑sensitive applications.

PaperID: 7018, https://arxiv.org/pdf/2512.03211.pdf

Authors: Nigel Tao, Jonathan Baxter, Lex Weaver

Title: A Multi-Agent, Policy-Gradient approach to Network Routing

Abstract:
Network routing is a distributed decision problem which naturally admits numerical performance measures, such as the average time for a packet to travel from source to destination. OLPOMDP, a policy‑gradient reinforcement learning algorithm, was successfully applied to simulated network routing under a number of network models. Multiple distributed agents (routers) learned co‑operative behavior without explicit inter‑agent communication, and they avoided behavior which was individually desirable, but detrimental to the group's overall performance. Furthermore, shaping the reward signal by explicitly penalizing certain patterns of sub‑optimal behavior was found to dramatically improve the convergence rate.

PaperID: 7019, https://arxiv.org/pdf/2512.03204.pdf

Authors: Douglas Aberdeen, Jonathan Baxter

Title: Scaling Internal-State Policy-Gradient Methods for POMDPs

Abstract:
Policy‑gradient methods have received increased attention recently as a mechanism for learning to act in partially observable environments. They have shown promise for problems admitting memoryless policies but have been less successful when memory is required. In this paper we develop several improved algorithms for learning policies with memory in an infinite‑horizon setting ‑‑ directly when a known model of the environment is available, and via simulation otherwise. We compare these algorithms on some large POMDPs, including noisy robot navigation and multi‑agent problems.

PaperID: 7020, https://arxiv.org/pdf/2512.03194.pdf

Authors: Johannes Gaber, Meshal Alharbi, Daniele Gammelli, Gioele Zardini

Title: GRAND: Guidance, Rebalancing, and Assignment for Networked Dispatch in Multi-Agent Path Finding

Abstract:
Large robot fleets are now common in warehouses and other logistics settings, where small control gains translate into large operational impacts. In this article, we address task scheduling for lifelong Multi‑Agent Pickup‑and‑Delivery (MAPD) and propose a hybrid method that couples learning‑based global guidance with lightweight optimization. A graph neural network policy trained via reinforcement learning outputs a desired distribution of free agents over an aggregated warehouse graph. This signal is converted into region‑to‑region rebalancing through a minimum‑cost flow, and finalized by small, local assignment problems, preserving accuracy while keeping per‑step latency within a 1 s compute budget. We call this approach GRAND: a hierarchical algorithm that relies on Guidance, Rebalancing, and Assignment to explicitly leverage the workspace Network structure and Dispatch agents to tasks. On congested warehouse benchmarks from the League of Robot Runners (LoRR) with up to 500 agents, our approach improves throughput by up to 10% over the 2024 winning scheduler while maintaining real‑time execution. The results indicate that coupling graph‑structured learned guidance with tractable solvers reduces congestion and yields a practical, scalable blueprint for high‑throughput scheduling in large fleets.

PaperID: 7021, https://arxiv.org/pdf/2512.03166.pdf

Authors: Aya Taourirte, Md Sohag Mia

Title: Multi-Agent Reinforcement Learning and Real-Time Decision-Making in Robotic Soccer for Virtual Environments

Abstract:
The deployment of multi‑agent systems in dynamic, adversarial environments like robotic soccer necessitates real‑time decision‑making, sophisticated cooperation, and scalable algorithms to avoid the curse of dimensionality. While Reinforcement Learning (RL) offers a promising framework, existing methods often struggle with the multi‑granularity of tasks (long‑term strategy vs. instant actions) and the complexity of large‑scale agent interactions. This paper presents a unified Multi‑Agent Reinforcement Learning (MARL) framework that addresses these challenges. First, we establish a baseline using Proximal Policy Optimization (PPO) within a client‑server architecture for real‑time action scheduling, with PPO demonstrating superior performance (4.32 avg. goals, 82.9% ball control). Second, we introduce a Hierarchical RL (HRL) structure based on the options framework to decompose the problem into a high‑level trajectory planning layer (modeled as a Semi‑Markov Decision Process) and a low‑level action execution layer, improving global strategy (avg. goals increased to 5.26). Finally, to ensure scalability, we integrate mean‑field theory into the HRL framework, simplifying many‑agent interactions into a single agent vs. the population average. Our mean‑field actor‑critic method achieves a significant performance boost (5.93 avg. goals, 89.1% ball control, 92.3% passing accuracy) and enhanced training stability. Extensive simulations of 4v4 matches in the Webots environment validate our approach, demonstrating its potential for robust, scalable, and cooperative behavior in complex multi‑agent domains.

PaperID: 7022, https://arxiv.org/pdf/2512.03097.pdf

Authors: Adeela Bashir, The Anh han, Zia Ush Shamszaman

Title: Many-to-One Adversarial Consensus: Exposing Multi-Agent Collusion Risks in AI-Based Healthcare

Abstract:
The integration of large language models (LLMs) into healthcare IoT systems promises faster decisions and improved medical support. LLMs are also deployed as multi‑agent teams to assist AI doctors by debating, voting, or advising on decisions. However, when multiple assistant agents interact, coordinated adversaries can collude to create false consensus, pushing an AI doctor toward harmful prescriptions. We develop an experimental framework with scripted and unscripted doctor agents, adversarial assistants, and a verifier agent that checks decisions against clinical guidelines. Using 50 representative clinical questions, we find that collusion drives the Attack Success Rate (ASR) and Harmful Recommendation Rates (HRR) up to 100% in unprotected systems. In contrast, the verifier agent restores 100% accuracy by blocking adversarial consensus. This work provides the first systematic evidence of collusion risk in AI healthcare and demonstrates a practical, lightweight defence that ensures guideline fidelity.

PaperID: 7023, https://arxiv.org/pdf/2512.03059.pdf

Authors: Jiaju Qi, Lei Lei, Thorsteinn Jonsson, Dusit Niyato

Title: Safe and Sustainable Electric Bus Charging Scheduling with Constrained Hierarchical DRL

Abstract:
The integration of Electric Buses (EBs) with renewable energy sources such as photovoltaic (PV) panels is a promising approach to promote sustainable and low‑carbon public transportation. However, optimizing EB charging schedules to minimize operational costs while ensuring safe operation without battery depletion remains challenging ‑ especially under real‑world conditions, where uncertainties in PV generation, dynamic electricity prices, variable travel times, and limited charging infrastructure must be accounted for. In this paper, we propose a safe Hierarchical Deep Reinforcement Learning (HDRL) framework for solving the EB Charging Scheduling Problem (EBCSP) under multi‑source uncertainties. We formulate the problem as a Constrained Markov Decision Process (CMDP) with options to enable temporally abstract decision‑making. We develop a novel HDRL algorithm, namely Double Actor‑Critic Multi‑Agent Proximal Policy Optimization Lagrangian (DAC‑MAPPO‑Lagrangian), which integrates Lagrangian relaxation into the Double Actor‑Critic (DAC) framework. At the high level, we adopt a centralized PPO‑Lagrangian algorithm to learn safe charger allocation policies. At the low level, we incorporate MAPPO‑Lagrangian to learn decentralized charging power decisions under the Centralized Training and Decentralized Execution (CTDE) paradigm. Extensive experiments with real‑world data demonstrate that the proposed approach outperforms existing baselines in both cost minimization and safety compliance, while maintaining fast convergence speed.

PaperID: 7024, https://arxiv.org/pdf/2512.02981.pdf

Authors: Zhongyu Yang, Yingfang Yuan, Xuanming Jiang, Baoyi An, Wei Pang

Title: InEx: Hallucination Mitigation via Introspection and Cross-Modal Multi-Agent Collaboration

Abstract:
Hallucination remains a critical challenge in large language models (LLMs), hindering the development of reliable multimodal LLMs (MLLMs). Existing solutions often rely on human intervention or underutilize the agent's ability to autonomously mitigate hallucination. To address these limitations, we draw inspiration from how humans make reliable decisions in the real world. They begin with introspective reasoning to reduce uncertainty and form an initial judgment, then rely on external verification from diverse perspectives to reach a final decision. Motivated by this cognitive paradigm, we propose InEx, a training‑free, multi‑agent framework designed to autonomously mitigate hallucination. InEx introduces internal introspective reasoning, guided by entropy‑based uncertainty estimation, to improve the reliability of the decision agent's reasoning process. The agent first generates a response, which is then iteratively verified and refined through external cross‑modal multi‑agent collaboration with the editing agent and self‑reflection agents, further enhancing reliability and mitigating hallucination. Extensive experiments show that InEx consistently outperforms existing methods, achieving 4%‑27% gains on general and hallucination benchmarks, and demonstrating strong robustness.

PaperID: 7025, https://arxiv.org/pdf/2512.02973.pdf

Authors: Yuan Xiong, Ziqi Miao, Lijun Li, Chen Qian, Jie Li, Jing Shao

Title: Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities

Abstract:
While Multimodal Large Language Models (MLLMs) show remarkable capabilities, their safety alignments are susceptible to jailbreak attacks. Existing attack methods typically focus on text‑image interplay, treating the visual modality as a secondary prompt. This approach underutilizes the unique potential of images to carry complex, contextual information. To address this gap, we propose a new image‑centric attack method, Contextual Image Attack (CIA), which employs a multi‑agent system to subtly embeds harmful queries into seemingly benign visual contexts using four distinct visualization strategies. To further enhance the attack's efficacy, the system incorporate contextual element enhancement and automatic toxicity obfuscation techniques. Experimental results on the MMSafetyBench‑tiny dataset show that CIA achieves high toxicity scores of 4.73 and 4.83 against the GPT‑4o and Qwen2.5‑VL‑72B models, respectively, with Attack Success Rates (ASR) reaching 86.31% and 91.07%. Our method significantly outperforms prior work, demonstrating that the visual modality itself is a potent vector for jailbreaking advanced MLLMs.

PaperID: 7026, https://arxiv.org/pdf/2512.02682.pdf

Authors: Piercosma Bisconti, Marcello Galisai, Federico Pierucci, Marcantonio Bracale, Matteo Prandi

Title: Beyond Single-Agent Safety: A Taxonomy of Risks in LLM-to-LLM Interactions

Abstract:
This paper examines why safety mechanisms designed for human‑model interaction do not scale to environments where large language models (LLMs) interact with each other. Most current governance practices still rely on single‑agent safety containment, prompts, fine‑tuning, and moderation layers that constrain individual model behavior but leave the dynamics of multi‑model interaction ungoverned. These mechanisms assume a dyadic setting: one model responding to one user under stable oversight. Yet research and industrial development are rapidly shifting toward LLM‑to‑LLM ecosystems, where outputs are recursively reused as inputs across chains of agents. In such systems, local compliance can aggregate into collective failure even when every model is individually aligned. We propose a conceptual transition from model‑level safety to system‑level safety, introducing the framework of the Emergent Systemic Risk Horizon (ESRH) to formalize how instability arises from interaction structure rather than from isolated misbehavior. The paper contributes (i) a theoretical account of collective risk in interacting LLMs, (ii) a taxonomy connecting micro, meso, and macro‑level failure modes, and (iii) a design proposal for InstitutionalAI, an architecture for embedding adaptive oversight within multi‑agent systems.

PaperID: 7027, https://arxiv.org/pdf/2512.02654.pdf

Authors: Víctor Mayoral-Vilches, Luis Javier Navarrete-Lozano, Francesco Balassone, María Sanz-Gómez, Cristóbal R. J. Veas Chavez, Maite del Mundo de Torres, Vanesa Turiel

Title: Cybersecurity AI: The World's Top AI Agent for Security Capture-the-Flag (CTF)

Abstract:
Are Capture‑the‑Flag competitions obsolete? In 2025, Cybersecurity AI (CAI) systematically conquered some of the world's most prestigious hacking competitions, achieving Rank #1 at multiple events and consistently outperforming thousands of human teams. Across five major circuits‑HTB's AI vs Humans, Cyber Apocalypse (8,129 teams), Dragos OT CTF, UWSP Pointer Overflow, and the Neurogrid CTF showdown‑CAI demonstrated that Jeopardy‑style CTFs have become a solved game for well‑engineered AI agents. At Neurogrid, CAI captured 41/45 flags to claim the 50,000 top prize; at Dragos OT, it sprinted 37% faster to 10K points than elite human teams; even when deliberately paused mid‑competition, it maintained top‑tier rankings. Critically, CAI achieved this dominance through our specialized alias1 model architecture, which delivers enterprise‑scale AI security operations at unprecedented cost efficiency and with augmented autonomy‑reducing 1B token inference costs from 5,940 to just 119, making continuous security agent operation financially viable for the first time. These results force an uncomfortable reckoning: if autonomous agents now dominate competitions designed to identify top security talent at negligible cost, what are CTFs actually measuring? This paper presents comprehensive evidence of AI capability across the 2025 CTF circuit and argues that the security community must urgently transition from Jeopardy‑style contests to Attack & Defense formats that genuinely test adaptive reasoning and resilience‑capabilities that remain uniquely human, for now.

PaperID: 7028, https://arxiv.org/pdf/2512.02605.pdf

Authors: Pengju Lu

Title: IACT: A Self-Organizing Recursive Model for General AI Agents: A Technical White Paper on the Architecture Behind kragent.ai

Abstract:
This technical white paper introduces the Interactive Agents Call Tree (IACT), a computational model designed to address the limitations of static, hard‑coded agent workflows. Unlike traditional systems that require pre‑defined graphs or specialized programming, IACT operates as a general‑purpose autonomous system driven purely by user dialogue. Given a high‑level objective, the system autonomously grows a dynamic, recursive agent topology incrementally tailored to the problem's structure. This allows it to scale its organizational complexity to match open‑ended tasks. To mitigate the error propagation inherent in unidirectional function calls, IACT introduces interactional redundancy by replacing rigid invocations with bidirectional, stateful dialogues. This mechanism enables runtime error correction and ambiguity resolution. We describe the architecture, design principles, and practical lessons behind the production deployment of this model in the kragent.ai system, presenting qualitative evidence from real‑world workflows rather than exhaustive benchmark results.

PaperID: 7029, https://arxiv.org/pdf/2512.02543.pdf

Authors: Vishnu Sarukkai, Asanshay Gupta, James Hong, Michaël Gharbi, Kayvon Fatahalian

Title: Inference-Time Distillation: Cost-Efficient Agents Without Fine-Tuning or Manual Prompt Engineering

Abstract:
Deploying LLM agents at scale typically requires choosing between quality and cost. Existing cost‑reduction approaches fail to preserve agility: the ability to iterate rapidly without human time bottlenecks. Prompt engineering is brittle and slows iteration, while fine‑tuning requires multi‑day training and commitment to fixed designs; both are impractical for iterative workflows and time‑sensitive batch jobs. We demonstrate that established inference‑time techniques‑‑dynamic in‑context learning and self‑consistency cascades‑‑can be leveraged to shift the cost‑accuracy Pareto frontier while preserving agility. Practitioners run the teacher on a small task subset to collect demonstrations, then immediately deploy a cheaper student on the remainder. At each step, the system retrieves relevant teacher demonstrations as in‑context examples. When multiple student samples agree, we proceed; when they diverge, we fall back to the teacher. This requires no prompt engineering or training. On ALFWorld, we match teacher accuracy at 2.5x lower cost (0.059 to 0.024 per episode). On AppWorld, we achieve 3.5x cost reduction while recovering 79% of teacher accuracy. Our empirical analyses provide guidance on key design choices: teacher database size, demonstration set size, retrieval strategy, and cascade thresholds. These analyses highlight inference‑time levers for navigating cost‑performance tradeoffs without sacrificing human development speed.

PaperID: 7030, https://arxiv.org/pdf/2512.02533.pdf

Authors: Yijun Liu, Wu Liu, Xiaoyan Gu, Allen He, Weiping Wang, Yongdong Zhang

Title: PopSim: Social Network Simulation for Social Media Popularity Prediction

Abstract:
Accurately predicting the popularity of user‑generated content (UGC) is essential for advancing social media analytics and recommendation systems. Existing approaches typically follow an inductive paradigm, where researchers train static models on historical data for popularity prediction. However, the UGC propagation is inherently a dynamic process, and static modeling based on historical features fails to capture the complex interactions and nonlinear evolution. In this paper, we propose PopSim, a novel simulation‑based paradigm for social media popularity prediction (SMPP). Unlike the inductive paradigm, PopSim leverages the large language models (LLMs)‑based multi‑agent social network sandbox to simulate UGC propagation dynamics for popularity prediction. Specifically, to effectively model the UGC propagation process in the network, we design a social‑mean‑field‑based agent interaction mechanism, which models the dual‑channel and bidirectional individual‑population interactions, enhancing agents' global perception and decision‑making capabilities. In addition, we propose a multi‑source information aggregation module that transforms heterogeneous social metadata into a uniform formulation for LLMs. Finally, propagation dynamics with multimodal information are fused to provide comprehensive popularity prediction. Extensive experiments on real‑world datasets demonstrate that SimPop consistently outperforms the state‑of‑the‑art methods, reducing prediction error by an average of 8.82%, offering a new perspective for research on the SMPP task.

PaperID: 7031, https://arxiv.org/pdf/2512.02445.pdf

Authors: Tsimur Hadeliya, Mohammad Ali Jauhar, Nidhi Sakpal, Diogo Cruz

Title: When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents

Abstract:
Solving complex or long‑horizon problems often requires large language models (LLMs) to use external tools and operate over a significantly longer context window. New LLMs enable longer context windows and support tool calling capabilities. Prior works have focused mainly on evaluation of LLMs on long‑context prompts, leaving agentic setup relatively unexplored, both from capability and safety perspectives. Our work addresses this gap. We find that LLM agents could be sensitive to length, type, and placement of the context, exhibiting unexpected and inconsistent shifts in task performance and in refusals to execute harmful requests. Models with 1M‑2M token context windows show severe degradation already at 100K tokens, with performance drops exceeding 50% for both benign and harmful tasks. Refusal rates shift unpredictably: GPT‑4.1‑nano increases from ～5% to ～40% while Grok 4 Fast decreases from ～80% to ～10% at 200K tokens. Our work shows potential safety issues with agents operating on longer context and opens additional questions on the current metrics and paradigm for evaluating LLM agent safety on long multi‑step tasks. In particular, our results on LLM agents reveal a notable divergence in both capability and safety performance compared to prior evaluations of LLMs on similar criteria.

PaperID: 7032, https://arxiv.org/pdf/2512.02442.pdf

Authors: Changhee Lee, Jeongmin Rhee, DongHwa Shin

Title: A Visual Analytics System to Understand Behaviors of Multi Agents in Reinforcement Learning

Abstract:
Multi‑Agent Reinforcement Learning (MARL) is a branch of machine learning in which agents interact and learn optimal policies through trial and error, addressing complex scenarios where multiple agents interact and learn in the same environment at the same time. Analyzing and understanding these complex interactions is challenging, and existing analysis methods are limited in their ability to fully reflect and interpret this complexity. To address these challenges, we provide MARLViz, a visual analytics system for visualizing and analyzing the policies and interactions of agents in MARL environments. The system is designed to visually show the difference in behavior of agents under different environment settings and help users understand complex interaction patterns. In this study, we analyzed agents with similar behaviors and selected scenarios to understand the interactions of the agents, which made it easier to understand the strategies of agents in MARL.

PaperID: 7033, https://arxiv.org/pdf/2512.02410.pdf

Authors: Yepeng Ding, Ahmed Twabi, Junwei Yu, Lingfeng Zhang, Tohru Kondo, Hiroyuki Sato

Title: Decentralized Multi-Agent System with Trust-Aware Communication

Abstract:
The emergence of Large Language Models (LLMs) is rapidly accelerating the development of autonomous multi‑agent systems (MAS), paving the way for the Internet of Agents. However, traditional centralized MAS architectures present significant challenges, including single points of failure, vulnerability to censorship, inherent scalability limitations, and critical trust issues. We propose a novel Decentralized Multi‑Agent System (DMAS) architecture designed to overcome these fundamental problems by enabling trust‑aware, scalable, and censorship‑resistant interactions among autonomous agents. Our DMAS features a decentralized agent runtime underpinned by a blockchain‑based architecture. We formalize a trust‑aware communication protocol that leverages cryptographic primitives and on‑chain operations to provide security properties: verifiable interaction cycles, communication integrity, authenticity, non‑repudiation, and conditional confidentiality, which we further substantiate through a comprehensive security analysis. Our performance analysis validates the DMAS as a scalable and efficient solution for building trustworthy multi‑agent systems.

PaperID: 7034, https://arxiv.org/pdf/2512.02406.pdf

Authors: Oshada Jayasinghe, Farhana Choudhury, Egemen Tanin, Shanika Karunasekera

Title: Dynamic Configuration of On-Street Parking Spaces using Multi Agent Reinforcement Learning

Abstract:
With increased travelling needs more than ever, traffic congestion has become a major concern in most urban areas. Allocating spaces for on‑street parking, further hinders traffic flow, by limiting the effective road width available for driving. With the advancement of vehicle‑to‑infrastructure connectivity technologies, we explore how the impact of on‑street parking on traffic congestion could be minimized, by dynamically configuring on‑street parking spaces. Towards that end, we formulate dynamic on‑street parking space configuration as an optimization problem, and we follow a data driven approach, considering the nature of our problem. Our proposed solution comprises a two‑layer multi agent reinforcement learning based framework, which is inherently scalable to large road networks. The lane level agents are responsible for deciding the optimal parking space configuration for each lane, and we introduce a novel Deep Q‑learning architecture which effectively utilizes long short term memory networks and graph attention networks to capture the spatio‑temporal correlations evident in the given problem. The block level agents control the actions of the lane level agents and maintain a sufficient level of parking around the block. We conduct a set of comprehensive experiments using SUMO, on both synthetic data as well as real‑world data from the city of Melbourne. Our experiments show that the proposed framework could reduce the average travel time loss of vehicles significantly, reaching upto 47%, with a negligible increase in the walking distance for parking.

PaperID: 7035, https://arxiv.org/pdf/2512.02405.pdf

Authors: Anoop Cherian, River Doyle, Eyal Ben-Dov, Suhas Lohit, Kuan-Chuan Peng

Title: WISE: Weighted Iterative Society-of-Experts for Robust Multimodal Multi-Agent Debate

Abstract:
Recent large language models (LLMs) are trained on diverse corpora and tasks, leading them to develop complementary strengths. Multi‑agent debate (MAD) has emerged as a popular way to leverage these strengths for robust reasoning, though it has mostly been applied to language‑only tasks, leaving its efficacy on multimodal problems underexplored. In this paper, we study MAD for solving vision‑and‑language reasoning problems. Our setup enables generalizing the debate protocol with heterogeneous experts that possess single‑ and multi‑modal capabilities. To this end, we present Weighted Iterative Society‑of‑Experts (WISE), a generalized and modular MAD framework that partitions the agents into Solvers, that generate solutions, and Reflectors, that verify correctness, assign weights, and provide natural language feedback. To aggregate the agents' solutions across debate rounds, while accounting for variance in their responses and the feedback weights, we present a modified Dawid‑Skene algorithm for post‑processing that integrates our two‑stage debate model. We evaluate WISE on SMART‑840, VisualPuzzles, EvoChart‑QA, and a new SMART‑840++ dataset with programmatically generated problem instances of controlled difficulty. Our results show that WISE consistently improves accuracy by 2‑7% over the state‑of‑the‑art MAD setups and aggregation methods across diverse multimodal tasks and LLM configurations.

PaperID: 7036, https://arxiv.org/pdf/2512.02368.pdf

Authors: Wenyi Xiong, Jian Chen, Ziheng Qi, Wenhua Chen

Title: MoE-Enhanced Multi-Domain Feature Selection and Fusion for Fast Map-Free Trajectory Prediction

Abstract:
Trajectory prediction is crucial for the reliability and safety of autonomous driving systems, yet it remains a challenging task in complex interactive scenarios due to noisy trajectory observations and intricate agent interactions. Existing methods often struggle to filter redundant scene data for discriminative information extraction, directly impairing trajectory prediction accuracy especially when handling outliers and dynamic multi‑agent interactions. In response to these limitations, we present a novel map‑free trajectory prediction method which adaptively eliminates redundant information and selects discriminative features across the temporal, spatial, and frequency domains, thereby enabling precise trajectory prediction in real‑world driving environments. First, we design a MoE based frequency domain filter to adaptively weight distinct frequency components of observed trajectory data and suppress outlier related noise; then a selective spatiotemporal attention module that reallocates weights across temporal nodes (sequential dependencies), temporal trends (evolution patterns), and spatial nodes to extract salient information is proposed. Finally, our multimodal decoder‑supervised by joint patch level and point‑level losses generates reasonable and temporally consistent trajectories, and comprehensive experiments on the large‑scale NuScenes and Argoverse dataset demonstrate that our method achieves competitive performance and low‑latency inference performance compared with recently proposed methods.

PaperID: 7037, https://arxiv.org/pdf/2512.02367.pdf

Authors: Sungjun Seo, Kooktae Lee

Title: On the Convergence of Density-Based Predictive Control for Multi-Agent Non-Uniform Area Coverage

Abstract:
This paper presents Density‑based Predictive Control (DPC), a novel multi‑agent control strategy for efficient non‑uniform area coverage, grounded in optimal transport theory. In large‑scale scenarios such as search and rescue or environmental monitoring, traditional uniform coverage fails to account for varying regional priorities. DPC leverages a pre‑constructed reference distribution to allocate agents' coverage efforts, spending more time in high‑priority or densely sampled regions. We analyze convergence conditions using the Wasserstein distance, derive an analytic optimal control law for unconstrained cases, and propose a numerical method for constrained scenarios. Simulations on first‑order dynamics and linearized quadrotor models demonstrate that DPC achieves trajectories closely matching the non‑uniform reference distribution, outperforming existing coverage methods.

PaperID: 7038, https://arxiv.org/pdf/2512.02358.pdf

Authors: Ran Zhang, Kun Ouyang, Tiancheng Ma, Yida Yang, Dong Fang

Title: Beyond Playtesting: A Generative Multi-Agent Simulation System for Massively Multiplayer Online Games

Abstract:
Optimizing numerical systems and mechanism design is crucial for enhancing player experience in Massively Multiplayer Online (MMO) games. Traditional optimization approaches rely on large‑scale online experiments or parameter tuning over predefined statistical models, which are costly, time‑consuming, and may disrupt player experience. Although simplified offline simulation systems are often adopted as alternatives, their limited fidelity prevents agents from accurately mimicking real player reasoning and reactions to interventions. To address these limitations, we propose a generative agent‑based MMO simulation system empowered by Large Language Models (LLMs). By applying Supervised Fine‑Tuning (SFT) and Reinforcement Learning (RL) on large‑scale real player behavioral data, we adapt LLMs from general priors to game‑specific domains, enabling realistic and interpretable player decision‑making. In parallel, a data‑driven environment model trained on real gameplay logs reconstructs dynamic in‑game systems. Experiments demonstrate strong consistency with real‑world player behaviors and plausible causal responses under interventions, providing a reliable, interpretable, and cost‑efficient framework for data‑driven numerical design optimization.

PaperID: 7039, https://arxiv.org/pdf/2512.02329.pdf

Authors: Hoa Khanh Dam, Geeta Mahala, Rashina Hoda, Xi Zheng, Cristina Conati

Title: Towards autonomous normative multi-agent systems for Human-AI software engineering teams

Abstract:
This paper envisions a transformative paradigm in software engineering, where Artificial Intelligence, embodied in fully autonomous agents, becomes the primary driver of the core software development activities. We introduce a new class of software engineering agents, empowered by Large Language Models and equipped with beliefs, desires, intentions, and memory to enable human‑like reasoning. These agents collaborate with humans and other agents to design, implement, test, and deploy software systems with a level of speed, reliability, and adaptability far beyond the current software development processes. Their coordination and collaboration are governed by norms expressed as deontic modalities ‑ commitments, obligations, prohibitions and permissions ‑ that regulate interactions and ensure regulatory compliance. These innovations establish a scalable, transparent and trustworthy framework for future Human‑AI software engineering teams.

PaperID: 7040, https://arxiv.org/pdf/2512.02321.pdf

Authors: Yuanhe Zhang, Weiliu Wang, Zhenhong Zhou, Kun Wang, Jie Zhang, Li Sun, Yang Liu, Sen Su

Title: LeechHijack: Covert Computational Resource Exploitation in Intelligent Agent Systems

Abstract:
Large Language Model (LLM)‑based agents have demonstrated remarkable capabilities in reasoning, planning, and tool usage. The recently proposed Model Context Protocol (MCP) has emerged as a unifying framework for integrating external tools into agent systems, enabling a thriving open ecosystem of community‑built functionalities. However, the openness and composability that make MCP appealing also introduce a critical yet overlooked security assumption ‑‑ implicit trust in third‑party tool providers. In this work, we identify and formalize a new class of attacks that exploit this trust boundary without violating explicit permissions. We term this new attack vector implicit toxicity, where malicious behaviors occur entirely within the allowed privilege scope. We propose LeechHijack, a Latent Embedded Exploit for Computation Hijacking, in which an adversarial MCP tool covertly expropriates the agent's computational resources for unauthorized workloads. LeechHijack operates through a two‑stage mechanism: an implantation stage that embeds a benign‑looking backdoor in a tool, and an exploitation stage where the backdoor activates upon predefined triggers to establish a command‑and‑control channel. Through this channel, the attacker injects additional tasks that the agent executes as if they were part of its normal workflow, effectively parasitizing the user's compute budget. We implement LeechHijack across four major LLM families. Experiments show that LeechHijack achieves an average success rate of 77.25%, with a resource overhead of 18.62% compared to the baseline. This study highlights the urgent need for computational provenance and resource attestation mechanisms to safeguard the emerging MCP ecosystem.

PaperID: 7041, https://arxiv.org/pdf/2512.02289.pdf

Authors: Lindsey Linxi Wei, Shreya Shankar, Sepanta Zeighami, Yeounoh Chung, Fatma Ozcan, Aditya G. Parameswaran

Title: Multi-Objective Agentic Rewrites for Unstructured Data Processing

Abstract:
One year ago, we open‑sourced DocETL, a declarative system for LLM‑powered data processing that, as of March 2026, has 3.7K GitHub stars and users across domains (e.g., journalism, law, medicine, policy, finance, and urban planning). In DocETL, users build pipelines by composing operators described in natural language, also known as semantic operators, with an LLM executing each operator's logic. However, due to complexity in the operator or the data it operates on, LLMs often give inaccurate results. To address this challenge, DocETL introduced rewrite directives, or abstract rules that guide LLM agents in rewriting pipelines by decomposing operators or data. For example, decomposing a single filter("is this email sent from an executive and discussing fraud?") into the conjunction of two separate semantic filters may improve accuracy. However, DocETL only optimizes for accuracy, not cost. How do we optimize for both? We present MOAR (Multi‑Objective Agentic Rewrites), a new optimizer for DocETL. To target cost optimization, we introduce two new categories of directives and extend all three existing categories with new ones, bringing the total to over 30 directives ‑‑ more than doubling what DocETL originally had. Moreover, since operators can interact with each other unpredictably due to LLM behavior, optimizing operators or sub‑pipelines individually can yield suboptimal overall plans. Recognizing this, we design a new global search algorithm that explores rewrites in the context of entire pipelines. Since the space of rewrites is infinite ‑‑ pipelines can be rewritten in many ways, and each rewritten pipeline can itself be rewritten ‑‑ our algorithm adapts a multi‑armed bandit framework to prioritize which pipelines to rewrite. Across six workloads, MOAR achieves 27% higher accuracy than ABACUS, the next‑best optimizer, while matching its best accuracy at 55% of its cost.

PaperID: 7042, https://arxiv.org/pdf/2512.02282.pdf

Authors: Han Luo, Guy Laban

Title: DialogGuard: Multi-Agent Psychosocial Safety Evaluation of Sensitive LLM Responses

Abstract:
Large language models (LLMs) now mediate many web‑based mental‑health, crisis, and other emotionally sensitive services, yet their psychosocial safety in these settings remains poorly understood and weakly evaluated. We present DialogGuard, a multi‑agent framework for assessing psychosocial risks in LLM‑generated responses along five high‑severity dimensions: privacy violations, discriminatory behaviour, mental manipulation, psychological harm, and insulting behaviour. DialogGuard can be applied to diverse generative models through four LLM‑as‑a‑judge pipelines, including single‑agent scoring, dual‑agent correction, multi‑agent debate, and stochastic majority voting, grounded in a shared three‑level rubric usable by both human annotators and LLM judges. Using PKU‑SafeRLHF with human safety annotations, we show that multi‑agent mechanisms detect psychosocial risks more accurately than non‑LLM baselines and single‑agent judging; dual‑agent correction and majority voting provide the best trade‑off between accuracy, alignment with human ratings, and robustness, while debate attains higher recall but over‑flags borderline cases. We release Dialog‑Guard as open‑source software with a web interface that provides per‑dimension risk scores and explainable natural‑language rationales. A formative study with 12 practitioners illustrates how it supports prompt design, auditing, and supervision of web‑facing applications for vulnerable users.

PaperID: 7043, https://arxiv.org/pdf/2512.02230.pdf

Authors: Rory Milsom

Title: Benchmarking LLM Agents for Wealth-Management Workflows

Abstract:
Modern work relies on an assortment of digital collaboration tools, yet routine processes continue to suffer from human error and delay. To address this gap, this dissertation extends TheAgentCompany with a finance‑focused environment and investigates whether a general purpose LLM agent can complete representative wealth‑management tasks both accurately and economically. This study introduces synthetic domain data, enriches colleague simulations, and prototypes an automatic task‑generation pipeline. The study aims to create and assess an evaluation set that can meaningfully measure an agent's fitness for assistant‑level wealth management work. We construct a benchmark of 12 task‑pairs for wealth management assistants spanning retrieval, analysis, and synthesis/communication, with explicit acceptance criteria and deterministic graders. We seeded a set of new finance‑specific data and introduced a high vs. low‑autonomy variant of every task. The paper concluded that agents are limited less by mathematical reasoning and more so by end‑to‑end workflow reliability, and meaningfully affected by autonomy level, and that incorrect evaluation of models have hindered benchmarking.

PaperID: 7044, https://arxiv.org/pdf/2512.02228.pdf

Authors: Shubhi Asthana, Bing Zhang, Chad DeLuca, Ruchi Mahindru, Hima Patel

Title: STRIDE: A Systematic Framework for Selecting AI Modalities -- Agentic AI, AI Assistants, or LLM Calls

Abstract:
The rapid shift from stateless large language models (LLMs) to autonomous, goal‑driven agents raises a central question: When is agentic AI truly necessary? While agents enable multi‑step reasoning, persistent memory, and tool orchestration, deploying them indiscriminately leads to higher cost, complexity, and risk. We present STRIDE (Systematic Task Reasoning Intelligence Deployment Evaluator), a framework that provides principled recommendations for selecting between three modalities: (i) direct LLM calls, (ii) guided AI assistants, and (iii) fully autonomous agentic AI. STRIDE integrates structured task decomposition, dynamism attribution, and self‑reflection requirement analysis to produce an Agentic Suitability Score, ensuring that full agentic autonomy is reserved for tasks with inherent dynamism or evolving context. Evaluated across 30 real‑world tasks spanning SRE, compliance, and enterprise automation, STRIDE achieved 92% accuracy in modality selection, reduced unnecessary agent deployments by 45%, and cut resource costs by 37%. Expert validation over six months in SRE and compliance domains confirmed its practical utility, with domain specialists agreeing that STRIDE effectively distinguishes between tasks requiring simple LLM calls, guided assistants, or full agentic autonomy. This work reframes agent adoption as a necessity‑driven design decision, ensuring autonomy is applied only when its benefits justify the costs.

PaperID: 7045, https://arxiv.org/pdf/2512.02195.pdf

Authors: David Ph. Shakouri, Crit Cremers, Niels O. Schiller

Title: A Knowledge-Based Language Model: Deducing Grammatical Knowledge in a Multi-Agent Language Acquisition Simulation

Abstract:
This paper presents an initial study performed by the MODOMA system. The MODOMA is a computational multi‑agent laboratory environment for unsupervised language acquisition experiments such that acquisition is based on the interaction between two language models, an adult and a child agent. Although this framework employs statistical as well as rule‑based procedures, the result of language acquisition is a knowledge‑based language model, which can be used to generate and parse new utterances of the target language. This system is fully parametrized and researchers can control all aspects of the experiments while the results of language acquisition, that is, the acquired grammatical knowledge, are explicitly represented and can be consulted. Thus, this system introduces novel possibilities for conducting computational language acquisition experiments. The experiments presented by this paper demonstrate that functional and content categories can be acquired and represented by the daughter agent based on training and test data containing different amounts of exemplars generated by the adult agent. Interestingly, similar patterns, which are well‑established for human‑generated data, are also found for these machine‑generated data. As the procedures resulted in the successful acquisition of discrete grammatical categories by the child agent, these experiments substantiate the validity of the MODOMA approach to modelling language acquisition.

PaperID: 7046, https://arxiv.org/pdf/2512.02185.pdf

Authors: Ziyan Wang, Enmao Diao, Qi Le, Pu Wang, Guanchu Wang, Minwoo Lee, Shu-ping Yeh, Li Yang

Title: Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models

Abstract:
Reasoning LLMs (RLMs) such as OpenAI o1, DeepSeek‑R1, and Qwen3 deliver strong multi‑step reasoning through chain‑of‑thought generation, but their large model sizes and lengthy decode‑time outputs make them costly to deploy and unsuitable for resource‑constrained settings. To reduce computing and memory cost, pruning offers a promising solution by removing unimportant parameters. However, despite their success on standard LLMs, existing pruning methods severely damage RLMs, as even moderate sparsity (e.g., 20%) can collapse accuracy and completely disrupt the model's reasoning coherence. We begin by analyzing why existing pruning pipelines fail on reasoning LLMs and find that their brittleness largely stems from a mismatch between the calibration data, the pruning objective, and the model's decode‑time reasoning behavior. Our study further shows that the most reliable calibration signal comes not from human‑written labels but from the model's own self‑generated reasoning traces, which more accurately reflect its inference distribution. Guided by these insights, we introduce RESP, a self‑reflective structured pruning framework that aligns pruning decisions with the model's reasoning dynamics through self‑generated calibration, decode‑only gradient‑based importance estimation, and progressive regeneration that maintains calibration fidelity as sparsity increases. Experiments on Qwen3‑8B demonstrate that RESP markedly outperforms existing structured pruning methods on both GSM8K and MathQA, preserving near‑dense accuracy at 20‑30% sparsity and substantially mitigating performance collapse at higher sparsity levels. At 40% sparsity, RESP attains 81.3% accuracy on GSM8K and 59.6% on MathQA, surpassing the strongest baselines by 66.87% and 47%, respectively.

PaperID: 7047, https://arxiv.org/pdf/2512.02079.pdf

Authors: Jonathan S. Kent, Eliana Stefani, Brian Plancher

Title: Robust Geospatial Coordination of Multi-Agent Communications Networks Under Attrition

Abstract:
Coordinating emergency responses in extreme environments, such as wildfires, requires resilient and high‑bandwidth communication backbones. While autonomous aerial swarms can establish ad‑hoc networks to provide this connectivity, the high risk of individual node attrition in these settings often leads to network fragmentation and mission‑critical downtime. To overcome this challenge, we introduce and formalize the problem of Robust Task Networking Under Attrition (RTNUA), which extends connectivity maintenance in multi‑robot systems to explicitly address proactive redundancy and attrition recovery. We then introduce Physics‑Informed Robust Employment of Multi‑Agent Networks (ΦIREMAN), a topological algorithm leveraging physics‑inspired potential fields to solve this problem. In our evaluations, ΦIREMAN consistently outperforms baselines, and is able to maintain greater than 99.9% task uptime despite substantial attrition in simulations with up to 100 tasks and 500 drones, demonstrating both effectiveness and scalability.

PaperID: 7048, https://arxiv.org/pdf/2512.01939.pdf

Authors: Yanlin Wang, Xinyi Xu, Jiachi Chen, Tingting Bi, Wenchao Gu, Zibin Zheng

Title: An Empirical Study of Agent Developer Practices in AI Agent Frameworks

Abstract:
The rise of large language models (LLMs) has sparked a surge of interest in agents, leading to the rapid growth of agent frameworks. Agent frameworks are software toolkits and libraries that provide standardized components, abstractions, and orchestration mechanisms to simplify agent development. Despite widespread use of agent frameworks, their practical applications and how they influence the agent development process remain underexplored. Different agent frameworks encounter similar problems during use, indicating that these recurring issues deserve greater attention and call for further improvements in agent framework design. Meanwhile, as the number of agent frameworks continues to grow and evolve, more than 80% of developers report difficulties in identifying the frameworks that best meet their specific development requirements. In this paper, we conduct the first empirical study of LLM‑based agent frameworks, exploring real‑world experiences of developers in building AI agents. To compare how well the agent frameworks meet developer needs, we further collect developer discussions for the ten previously identified agent frameworks, resulting in a total of 11,910 discussions. Finally, by analyzing these discussions, we compare the frameworks across five dimensions: development efficiency, functional abstraction, learning cost, performance optimization, and maintainability, which refers to how easily developers can update and extend both the framework itself and the agents built upon it over time. Our comparative analysis reveals significant differences among frameworks in how they meet the needs of agent developers. Overall, we provide a set of findings and implications for the LLM‑driven AI agent framework ecosystem and offer insights for the design of future LLM‑based agent frameworks and agent developers.

PaperID: 7049, https://arxiv.org/pdf/2512.01909.pdf

Authors: Lihu Chen, Xiang Yin, Francesca Toni

Title: Latent Debate: A Surrogate Framework for Interpreting LLM Thinking

Abstract:
Understanding the internal thinking process of Large Language Models (LLMs) and the cause of hallucinations remains a key challenge. To this end, we introduce latent debate, a novel framework for interpreting model predictions through the lens of implicit internal arguments. Unlike the current work of self‑consistency and multi‑agent debate, which relies on explicit debates among multiple answers or multiple models, latent debate captures the hidden supporting and attacking signals that arise within a single model during a single inference. We first present a model‑ and task‑agnostic conceptual framework, and then instantiate it symbolically to approximate the thinking process of LLMs on True/False prediction tasks. Empirical studies demonstrate that latent debate is a faithful structured surrogate model that has highly consistent predictions with the original LLM. Beyond interpretability, we demonstrate that latent debate provides a strong baseline for hallucination detection. Further analysis reveals strong correlations between hallucinations and debate patterns, such as a high degree of latent debates in the middle layers is linked to a higher risk of hallucinations. These findings position latent debate as a potential framework for understanding internal mechanisms of LLMs, especially for scenarios where internal (dis)agreements appear during the inference steps.

PaperID: 7050, https://arxiv.org/pdf/2512.01693.pdf

Authors: Honghui Kim, Dohoon Kim, Jihan Kim

Title: LitMOF: An LLM Multi-Agent for Literature-Validated Metal-Organic Frameworks Database Correction and Expansion

Abstract:
Metal‑organic framework (MOF) databases have grown rapidly through experimental deposition and large‑scale literature extraction, but recent analyses show that nearly half of their entries contain substantial structural errors. These inaccuracies propagate through high‑throughput screening and machine‑learning workflows, limiting the reliability of data‑driven MOF discovery. Correcting such errors is exceptionally difficult because true repairs require integrating crystallographic files, synthesis descriptions, and contextual evidence scattered across the literature. Here we introduce LitMOF, a large language model‑driven multi‑agent framework that validates crystallographic information directly from the original literature and cross‑validates it with database entries to repair structural errors. Applying LitMOF to the experimental MOF database (the CSD MOF Subset), we constructed LitMOF‑DB, a curated set of 186,773 computation‑ready structures, including the successful repair of 8,771 invalid entries, which accounts for 65.3% of the not‑computation‑ready MOFs in the latest CoRE MOF database. Additionally, the system uncovered 12,646 experimentally reported MOFs absent from existing resources, substantially expanding the known experimental design space. Using direct air capture screening as a case study, we demonstrate that structural errors severely distort predicted adsorption energies and CO2/H2O selectivity, leading to systematic misranking of materials, false positives, and the omission of high‑performance candidates. This work establishes a scalable pathway toward self‑correcting scientific databases and a generalizable paradigm for LLM‑driven curation in materials science.

PaperID: 7051, https://arxiv.org/pdf/2512.01610.pdf

Authors: Yuren Mao, Peigen Liu, Xinjian Wang, Rui Ding, Jing Miao, Hui Zou, Mingjie Qi, Wanxiang Luo, Longbin Lai, Kai Wang, Zhengping Qian, Peilun Yang, Yunjun Gao, Ying Zhang

Title: Agent-Kernel: A MicroKernel Multi-Agent System Framework for Adaptive Social Simulation Powered by LLMs

Abstract:
Multi‑Agent System (MAS) developing frameworks serve as the foundational infrastructure for social simulations powered by Large Language Models (LLMs). However, existing frameworks fail to adequately support large‑scale simulation development due to inherent limitations in adaptability, configurability, reliability, and code reusability. For example, they cannot simulate a society where the agent population and profiles change over time. To fill this gap, we propose Agent‑Kernel, a framework built upon a novel society‑centric modular microkernel architecture. It decouples core system functions from simulation logic and separates cognitive processes from physical environments and action execution. Consequently, Agent‑Kernel achieves superior adaptability, configurability, reliability, and reusability. We validate the framework's superiority through two distinct applications: a simulation of the Universe 25 (Mouse Utopia) experiment, which demonstrates the handling of rapid population dynamics from birth to death; and a large‑scale simulation of the Zhejiang University Campus Life, successfully coordinating 10,000 heterogeneous agents, including students and faculty.

PaperID: 7052, https://arxiv.org/pdf/2512.01485.pdf

Authors: Jindi Lv, Yuhao Zhou, Zheng Zhu, Xiaofeng Wang, Guan Huang, Jiancheng Lv

Title: Multi-Path Collaborative Reasoning via Reinforcement Learning

Abstract:
Chain‑of‑Thought (CoT) reasoning has significantly advanced the problem‑solving capabilities of Large Language Models (LLMs), yet conventional CoT often exhibits internal determinism during decoding, limiting exploration of plausible alternatives. Recent methods attempt to address this by generating soft abstract tokens to enable reasoning in a continuous semantic space. However, we find that such approaches remain constrained by the greedy nature of autoregressive decoding, which fundamentally isolates the model from alternative reasoning possibilities. In this work, we propose Multi‑Path Perception Policy Optimization (M3PO), a novel reinforcement learning framework that explicitly injects collective insights into the reasoning process. M3PO leverages parallel policy rollouts as naturally diverse reasoning sources and integrates cross‑path interactions into policy updates through a lightweight collaborative mechanism. This design allows each trajectory to refine its reasoning with peer feedback, thereby cultivating more reliable multi‑step reasoning patterns. Empirical results show that M3PO achieves state‑of‑the‑art performance on both knowledge‑ and reasoning‑intensive benchmarks. Models trained with M3PO maintain interpretability and inference efficiency, underscoring the promise of multi‑path collaborative learning for robust reasoning.

PaperID: 7053, https://arxiv.org/pdf/2512.01439.pdf

Authors: Bharatdeep Hazarika, Arya Suneesh, Prasanna Devadiga, Pawan Kumar Rajpoot, Anshuman B Suresh, Ahmed Ifthaquar Hussain

Title: Multilingual Conversational AI for Financial Assistance: Bridging Language Barriers in Indian FinTech

Abstract:
India's linguistic diversity presents both opportunities and challenges for fintech platforms. While the country has 31 major languages and over 100 minor ones, only 10% of the population understands English, creating barriers to financial inclusion. We present a multilingual conversational AI system for a financial assistance use case that supports code‑mixed languages like Hinglish, enabling natural interactions for India's diverse user base. Our system employs a multi‑agent architecture with language classification, function management, and multilingual response generation. Through comparative analysis of multiple language models and real‑world deployment, we demonstrate significant improvements in user engagement while maintaining low latency overhead (4‑8%). This work contributes to bridging the language gap in digital financial services for emerging markets.

PaperID: 7054, https://arxiv.org/pdf/2512.01434.pdf

Authors: Daull Xavier, Patrice Bellot, Emmanuel Bruno, Vincent Martin, Elisabeth Murisasco

Title: A Flexible Multi-Agent LLM-Human Framework for Fast Human Validated Tool Building

Abstract:
We introduce CollabToolBuilder, a flexible multiagent LLM framework with expert‑in‑the‑loop (HITL) guidance that iteratively learns to create tools for a target goal, aligning with human intent and process, while minimizing time for task/domain adaptation effort and human feedback capture. The architecture generates and validates tools via four specialized agents (Coach, Coder, Critic, Capitalizer) using a reinforced dynamic prompt and systematic human feedback integration to reinforce each agent's role toward goals and constraints. This work is best viewed as a system‑level integration and methodology combining multi‑agent in‑context learning, HITL controls, and reusable tool capitalization for complex iterative problems such as scientific document generation. We illustrate it with preliminary experiments (e.g., generating state‑of‑the‑art research papers or patents given an abstract) and discuss its applicability to other iterative problem‑solving.

PaperID: 7055, https://arxiv.org/pdf/2512.01420.pdf

Authors: Yaxuan Wang, Quan Liu, Zhenting Wang, Zichao Li, Wei Wei, Yang Liu, Yujia Bao

Title: PromptBridge: Cross-Model Prompt Transfer for Large Language Models

Abstract:
Large language models (LLMs) underpin applications in code generation, mathematical reasoning, and agent‑based workflows. In practice, systems access LLMs via commercial APIs or open‑source deployments, and the model landscape (e.g., GPT, Claude, Llama) evolves rapidly. This rapid evolution forces frequent model switches driven by capability, cost, deployment constraints, and privacy. Yet prompts are highly model‑sensitive: reusing a prompt engineered for one model on another often yields substantially worse performance than a prompt optimized for the target model. We term this phenomenon Model Drifting. Through extensive empirical analysis across diverse LLM configurations, we show that model drifting is both common and severe. To address this challenge, we introduce PromptBridge, a training‑free framework that preserves prompt effectiveness under model switches, enabling cross‑model prompt transfer without costly per‑task or per‑model re‑optimization. PromptBridge requires only a small set of alignment tasks for calibration. It first applies Model‑Adaptive Reflective Prompt Evolution (MAP‑RPE) to obtain task‑ and model‑specific optimal prompts via iterative reflective refinement and quantitative evaluation. Using the resulting calibrated prompt pairs for the source and target models, PromptBridge learns a cross‑model prompt mapping. At test time, i.e., for an unseen task, given a source‑model prompt, this mapping directly produces an optimized prompt for the target model. Experiments in single‑agent and multi‑agent settings show that PromptBridge consistently improves downstream accuracy while reducing migration effort. The code will be available soon.

PaperID: 7056, https://arxiv.org/pdf/2512.01321.pdf

Authors: Juan Hernandez, Diego Fernández, Manuel Cifuentes, Denis Parra, Rodrigo Toro Icarte

Title: Extending NGU to Multi-Agent RL: A Preliminary Study

Abstract:
The Never Give Up (NGU) algorithm has proven effective in reinforcement learning tasks with sparse rewards by combining episodic novelty and intrinsic motivation. In this work, we extend NGU to multi‑agent environments and evaluate its performance in the simple_tag environment from the PettingZoo suite. Compared to a multi‑agent DQN baseline, NGU achieves moderately higher returns and more stable learning dynamics. We investigate three design choices: (1) shared replay buffer versus individual replay buffers, (2) sharing episodic novelty among agents using different k thresholds, and (3) using heterogeneous values of the beta parameter. Our results show that NGU with a shared replay buffer yields the best performance and stability, highlighting that the gains come from combining NGU intrinsic exploration with experience sharing. Novelty sharing performs comparably when k = 1 but degrades learning for larger values. Finally, heterogeneous beta values do not improve over a small common value. These findings suggest that NGU can be effectively applied in multi‑agent settings when experiences are shared and intrinsic exploration signals are carefully tuned.

PaperID: 7057, https://arxiv.org/pdf/2512.01241.pdf

Authors: David Wu, Fateme Nateghi Haredasht, Saloni Kumar Maharaj, Priyank Jain, Jessica Tran, Matthew Gwiazdon, Arjun Rustagi, Jenelle Jindal, Jacob M. Koshy, Vinay Kadiyala, Anup Agarwal, Bassman Tappuni, Brianna French, Sirus Jesudasen, Christopher V. Cosgriff, Rebanta Chakraborty, Jillian Caldwell, Susan Ziolkowski, David J. Iberri, Robert Diep, Rahul S. Dalal, Kira L. Newman, Kristin Galetta, J. Carl Pallais, Nancy Wei, Kathleen M. Buchheit, David I. Hong, Ernest Y. Lee, Allen Shih, Vartan Pahalyants, Tamara B. Kaplan, Vishnu Ravi, Sarita Khemani, April S. Liang, Daniel Shirvani, Advait Patil, Nicholas Marshall, Kanav Chopra, Joel Koh, Adi Badhwar, Liam G. McCoy, David J. H. Wu, Yingjie Weng, Sumant Ranji, Kevin Schulman, Nigam H. Shah, Jason Hom, Arnold Milstein, Adam Rodman, Jonathan H. Chen, Ethan Goh

Title: First, do NOHARM: towards clinically safe large language models

Abstract:
Large language models (LLMs) are routinely used by physicians and patients for medical advice, yet their clinical safety profiles remain poorly characterized. We present NOHARM (Numerous Options Harm Assessment for Risk in Medicine), a benchmark using 100 real primary care‑to‑specialist consultation cases to measure frequency and severity of harm from LLM‑generated medical recommendations. NOHARM covers 10 specialties, with 12,747 expert annotations for 4,249 clinical management options. Across 31 LLMs, potential for severe harm from LLM recommendations occurs in up to 22.2% (95% CI 21.6‑22.8%) of cases, with harm of omission accounting for 76.6% (95% CI 76.4‑76.8%) of errors. Safety performance is only moderately correlated (r = 0.61‑0.64) with existing AI and medical knowledge benchmarks. The best models outperform generalist physicians on safety (mean difference 9.7%, 95% CI 7.0‑12.5%), and a diverse multi‑agent approach improves safety compared to solo models (mean difference 8.0%, 95% CI 4.0‑12.1%). Therefore, despite strong performance on existing evaluations, widely used AI models can produce severely harmful medical advice at nontrivial rates, underscoring clinical safety as a distinct performance dimension necessitating explicit measurement.

PaperID: 7058, https://arxiv.org/pdf/2512.01078.pdf

Authors: Jiawei Ren, Yan Zhuang, Xiaokang Ye, Lingjun Mao, Xuhong He, Jianzhi Shen, Mrinaal Dogra, Yiming Liang, Ruixuan Zhang, Tianai Yue, Yiqing Yang, Eric Liu, Ryan Wu, Kevin Benavente, Rajiv Mandya Nagaraju, Muhammad Faayez, Xiyan Zhang, Dhruv Vivek Sharma, Xianrui Zhong, Ziqiao Ma, Tianmin Shu, Zhiting Hu, Lianhui Qin

Title: SimWorld: An Open-ended Realistic Simulator for Autonomous Agents in Physical and Social Worlds

Abstract:
While LLM/VLM‑powered AI agents have advanced rapidly in math, coding, and computer use, their applications in complex physical and social environments remain challenging. Building agents that can survive and thrive in the real world (for example, by autonomously earning income or running a business) requires massive‑scale interaction, reasoning, training, and evaluation across diverse embodied scenarios. However, existing world simulators for such development fall short: they often rely on limited hand‑crafted environments, simulate simplified game‑like physics and social rules, and lack native support for LLM/VLM agents. We introduce SimWorld, a new simulator built on Unreal Engine 5, designed for developing and evaluating LLM/VLM agents in rich, real‑world‑like settings. SimWorld offers three core capabilities: (1) realistic, open‑ended world simulation, including accurate physical and social dynamics and language‑driven procedural environment generation; (2) a rich interface for LLM/VLM agents, with multimodal world inputs and open‑vocabulary actions at varying levels of abstraction; and (3) diverse and extensible physical and social reasoning scenarios that are easily customizable by users. We demonstrate SimWorld by deploying frontier LLM agents (e.g., GPT‑4o, Gemini‑2.5‑Flash, Claude‑3.5, and DeepSeek‑Prover‑V2) on long‑horizon multi‑agent delivery tasks involving strategic cooperation and competition. The results reveal distinct reasoning patterns and limitations across models. We open‑source SimWorld and hope it becomes a foundational platform for advancing real‑world agent intelligence across disciplines: https://simworld.org.

PaperID: 7059, https://arxiv.org/pdf/2512.01035.pdf

Authors: Shutong Chen, Qi Liao, Adnan Aijaz, Yansha Deng

Title: Goal-Oriented Multi-Agent Semantic Networking: Unifying Intents, Semantics, and Intelligence

Abstract:
6G services are evolving toward goal‑oriented and AI‑native communication, which are expected to deliver transformative societal benefits across various industries and promote energy sustainability. Yet today's networking architectures, built on complete decoupling of the applications and the network, cannot expose or exploit high‑level goals, limiting their ability to adapt intelligently to service needs. This work introduces Goal‑Oriented Multi‑Agent Semantic Networking (GoAgentNet), a new architecture that elevates communication from data exchange to goal fulfilment. GoAgentNet enables applications and the network to collaborate by abstracting their functions into multiple collaborative agents, and jointly orchestrates multi‑agent sensing, networking, computation, and control through semantic computation and cross‑layer semantic networking, allowing the entire architecture to pursue unified application goals. We first outline the limitations of legacy network designs in supporting 6G services, based on which we highlight key enablers of our GoAgentNet design. Then, through three representative 6G usage scenarios, we demonstrate how GoAgentNet can unlock more efficient and intelligent services. We further identify unique challenges faced by GoAgentNet deployment and corresponding potential solutions. A case study on robotic fault detection and recovery shows that our GoAgentNet architecture improves energy efficiency by up to 99% and increases the task success rate by up to 72%, compared with the existing networking architectures without GoAgentNet, which underscores its potential to support scalable and sustainable 6G systems.

PaperID: 7060, https://arxiv.org/pdf/2512.01010.pdf

Authors: Vansh Sharma, Venkat Raman

Title: Chain of Unit-Physics: A Primitive-Centric Approach to Scientific Code Synthesis

Abstract:
Agentic large language models are proposed as autonomous code generators for scientific computing, yet their reliability in high‑stakes problems remains unclear. Developing computational scientific software from natural‑language queries remains challenging broadly due to (a) sparse representation of domain codes during training and (b) the limited feasibility of RLHF with a small expert community. To address these limitations, this work conceptualizes an inverse approach to code design, embodied in the Chain of Unit‑Physics framework: a first‑principles (or primitives)‑centric, multi‑agent system in which human expert knowledge is encoded as unit‑physics tests that explicitly constrain code generation. The framework is evaluated on a nontrivial combustion task, used here as a representative benchmark for scientific problem with realistic physical constraints. Closed‑weight systems and code‑focused agentic variants fail to produce correct end‑to‑end solvers, despite tool and web access, exhibiting four recurrent error classes: interface (syntax/API) hallucinations, overconfident assumptions, numerical/physical incoherence, and configuration fragility. Open‑weight models with chain‑of‑thought (CoT) decoding reduce interface errors but still yield incorrect solutions. On the benchmark task, the proposed framework converges within 5‑6 iterations, matches the human‑expert implementation (mean error of 3.1×10^‑3 %), with a ～33.4 % faster runtime and a ～30 % efficient memory usage at a cost comparable to mid‑sized commercial APIs, yielding a practical template for physics‑grounded scientific code generation. As datasets and models evolve, zero‑shot code accuracy will improve; however, the Chain of Unit‑Physics framework goes further by embedding first‑principles analysis that is foundational to scientific codes.

PaperID: 7061, https://arxiv.org/pdf/2512.00968.pdf

Authors: Ziyang Zeng, Heming Jing, Jindong Chen, Xiangli Li, Hongyu Liu, Yixuan He, Zhengyu Li, Yige Sun, Zheyong Xie, Yuqing Yang, Shaosheng Cao, Jun Fan, Yi Wu, Yao Hu

Title: Optimizing Generative Ranking Relevance via Reinforcement Learning in Xiaohongshu Search

Abstract:
Ranking relevance is a fundamental task in search engines, aiming to identify the items most relevant to a given user query. Traditional relevance models typically produce scalar scores or directly predict relevance labels, limiting both interpretability and the modeling of complex relevance signals. Inspired by recent advances in Chain‑of‑Thought (CoT) reasoning for complex tasks, we investigate whether explicit reasoning can enhance both interpretability and performance in relevance modeling. However, existing reasoning‑based Generative Relevance Models (GRMs) primarily rely on supervised fine‑tuning on large amounts of human‑annotated or synthetic CoT data, which often leads to limited generalization. Moreover, domain‑agnostic, free‑form reasoning tends to be overly generic and insufficiently grounded, limiting its potential to handle the diverse and ambiguous cases prevalent in open‑domain search. In this work, we formulate relevance modeling in Xiaohongshu search as a reasoning task and introduce a Reinforcement Learning (RL)‑based training framework to enhance the grounded reasoning capabilities of GRMs. Specifically, we incorporate practical business‑specific relevance criteria into the multi‑step reasoning prompt design and propose Stepwise Advantage Masking (SAM), a lightweight process‑supervision strategy which facilitates effective learning of these criteria through improved credit assignment. To enable industrial deployment, we further distill the large‑scale RL‑tuned model to a lightweight version suitable for real‑world search systems. Extensive offline evaluations and online A/B tests demonstrate that our approach consistently delivers significant improvements across key relevance and business metrics, validating its effectiveness, robustness, and practicality for large‑scale industrial search systems.

PaperID: 7062, https://arxiv.org/pdf/2512.00939.pdf

Authors: Nayesha Gandotra, Itamar Mishani, Maxim Likhachev

Title: Constant-Time Motion Planning with Manipulation Behaviors

Abstract:
Recent progress in contact‑rich robotic manipulation has been striking, yet most deployed systems remain confined to simple, scripted routines. One of the key barriers is the lack of motion planning algorithms that can provide verifiable guarantees for safety, efficiency and reliability. To address this, a family of algorithms called Constant‑Time Motion Planning (CTMP) was introduced, which leverages a preprocessing phase to enable collision‑free motion queries in a fixed, user‑specified time budget (e.g., 10 milliseconds). However, existing CTMP methods do not explicitly incorporate the manipulation behaviors essential for object handling. To bridge this gap, we introduce the Behavioral Constant‑Time Motion Planner (B‑CTMP), an algorithm that extends CTMP to solve a broad class of two‑step manipulation tasks: (1) a collision‑free motion to a behavior initiation state, followed by (2) execution of a manipulation behavior (such as grasping or insertion) to reach the goal. By precomputing compact data structures, B‑CTMP guarantees constant‑time query in mere milliseconds while ensuring completeness and successful task execution over a specified set of states. We evaluate B‑CTMP on two canonical manipulation tasks, shelf picking and plug insertion, in simulation and on a real robot. Our results show that B‑CTMP unifies collision‑free planning and object manipulation within a single constant‑time framework, providing provable guarantees of speed and success for manipulation in semi‑structured environments.

PaperID: 7063, https://arxiv.org/pdf/2512.00851.pdf

Authors: Wenzhang Du

Title: City-Conditioned Memory for Multi-City Traffic and Mobility Forecasting

Abstract:
Deploying spatio‑temporal forecasting models across many cities is difficult: traffic networks differ in size and topology, data availability can vary by orders of magnitude, and new cities may provide only a short history of logs. Existing deep traffic models are typically trained per city and backbone, creating high maintenance cost and poor transfer to data‑scarce cities. We ask whether a single, backbone‑agnostic layer can condition on "which city this sequence comes from", improve accuracy in full‑ and low‑data regimes, and support better cross‑city adaptation with minimal code changes. We propose CityCond, a light‑weight city‑conditioned memory layer that augments existing spatio‑temporal backbones. CityCond combines a city‑ID encoder with an optional shared memory bank (CityMem). Given a city index and backbone hidden states, it produces city‑conditioned features fused through gated residual connections. We attach CityCond to five representative backbones (GRU, TCN, Transformer, GNN, STGCN) and evaluate three regimes: full‑data, low‑data, and cross‑city few‑shot transfer on METR‑LA and PEMS‑BAY. We also run auxiliary experiments on SIND, a drone‑based multi‑agent trajectory dataset from a signalized intersection in Tianjin (we focus on pedestrian tracks). Across more than fourteen model variants and three random seeds, CityCond yields consistent improvements, with the largest gains for high‑capacity backbones such as Transformers and STGCNs. CityMem reduces Transformer error by roughly one third in full‑data settings and brings substantial gains in low‑data and cross‑city transfer. On SIND, simple city‑ID conditioning modestly improves low‑data LSTM performance. CityCond can therefore serve as a reusable design pattern for scalable, multi‑city forecasting under realistic data constraints.

PaperID: 7064, https://arxiv.org/pdf/2512.00818.pdf

Authors: Haozhen Gong, Xiaozhong Ji, Yuansen Liu, Wenbin Wu, Xiaoxiao Yan, Jingjing Liu, Kai Wu, Jiazhen Pan, Bailiang Jian, Jiangning Zhang, Xiaobin Hu, Hongwei Bran Li

Title: Med-CMR: A Fine-Grained Benchmark Integrating Visual Evidence and Clinical Logic for Medical Complex Multimodal Reasoning

Abstract:
MLLMs MLLMs are beginning to appear in clinical workflows, but their ability to perform complex medical reasoning remains unclear. We present Med‑CMR, a fine‑grained Medical Complex Multimodal Reasoning benchmark. Med‑CMR distinguishes from existing counterparts by three core features: 1) Systematic capability decomposition, splitting medical multimodal reasoning into fine‑grained visual understanding and multi‑step reasoning to enable targeted evaluation; 2) Challenging task design, with visual understanding across three key dimensions (small‑object detection, fine‑detail discrimination, spatial understanding) and reasoning covering four clinically relevant scenarios (temporal prediction, causal reasoning, long‑tail generalization, multi‑source integration); 3) Broad, high‑quality data coverage, comprising 20,653 Visual Question Answering (VQA) pairs spanning 11 organ systems and 12 imaging modalities, validated via a rigorous two‑stage (human expert + model‑assisted) review to ensure clinical authenticity. We evaluate 18 state‑of‑the‑art MLLMs with Med‑CMR, revealing GPT‑5 as the top‑performing commercial model: 57.81 accuracy on multiple‑choice questions (MCQs) and a 48.70 open‑ended score, outperforming Gemini 2.5 Pro (49.87 MCQ accuracy, 45.98 open‑ended score) and leading open‑source model Qwen3‑VL‑235B‑A22B (49.34 MCQ accuracy, 42.62 open‑ended score). However, specialized medical MLLMs do not reliably outperform strong general models, and long‑tail generalization emerges as the dominant failure mode. Med‑CMR thus provides a stress test for visual‑reasoning integration and rare‑case robustness in medical MLLMs, and a rigorous yardstick for future clinical systems.

PaperID: 7065, https://arxiv.org/pdf/2512.00797.pdf

Authors: Nan Sun, Bo Mao, Yongchang Li, Chenxu Wang, Di Guo, Huaping Liu

Title: Transforming Monolithic Foundation Models into Embodied Multi-Agent Architectures for Human-Robot Collaboration

Abstract:
Foundation models have become central to unifying perception and planning in robotics, yet real‑world deployment exposes a mismatch between their monolithic assumption that a single model can handle all cognitive functions and the distributed, dynamic nature of practical service workflows. Vision‑language models offer strong semantic understanding but lack embodiment‑aware action capabilities while relying on hand‑crafted skills. Vision‑Language‑Action policies enable reactive manipulation but remain brittle across embodiments, weak in geometric grounding, and devoid of proactive collaboration mechanisms. These limitations indicate that scaling a single model alone cannot deliver reliable autonomy for service robots operating in human‑populated settings. To address this gap, we present InteractGen, an LLM‑powered multi‑agent framework that decomposes robot intelligence into specialized agents for continuous perception, dependency‑aware planning, decision and verification, failure reflection, and dynamic human delegation, treating foundation models as regulated components within a closed‑loop collective. Deployed on a heterogeneous robot team and evaluated in a three‑month open‑use study, InteractGen improves task success, adaptability, and human‑robot collaboration, providing evidence that multi‑agent orchestration offers a more feasible path toward socially grounded service autonomy than further scaling standalone models.

PaperID: 7066, https://arxiv.org/pdf/2512.00769.pdf

Authors: Xingchen Zhou, Nan Li, Peng Jia, Yingfeng Liu, Furen Deng, Shuanghao Shu, Ying Li, Liang Cao, Huanyuan Shan, Ayodeji Ibitoye

Title: AI Agent for Source Finding by SoFiA-2 for SKA-SDC2

Abstract:
Source extraction is crucial in analyzing data from next‑generation, large‑scale sky surveys in radio bands, such as the Square Kilometre Array (SKA). Several source extraction programs, including SoFiA and Aegean, have been developed to address this challenge. However, finding optimal parameter configurations when applying these programs to real observations is non‑trivial. For example, the outcomes of SoFiA intensely depend on several key parameters across its preconditioning, source‑finding, and reliability‑filtering modules. To address this issue, we propose a framework to automatically optimize these parameters using an AI agent based on a state‑of‑the‑art reinforcement learning (RL) algorithm, i.e., Soft Actor‑Critic (SAC). The SKA Science Data Challenge 2 (SDC2) dataset is utilized to assess the feasibility and reliability of this framework. The AI agent interacts with the environment by adjusting parameters based on the feedback from the SDC2 score defined by the SDC2 Team, progressively learning to select parameter sets that yield improved performance. After sufficient training, the AI agent can automatically identify an optimal parameter configuration that outperform the benchmark set by Team SoFiA within only 100 evaluation steps and with reduced time consumption. Our approach could address similar problems requiring complex parameter tuning, beyond radio band surveys and source extraction. Yet, high‑quality training sets containing representative observations and catalogs of ground truth are essential.

PaperID: 7067, https://arxiv.org/pdf/2512.00742.pdf

Authors: K. J. Kevin Feng, Tae Soo Kim, Rock Yuren Pang, Faria Huq, Tal August, Amy X. Zhang

Title: On the Regulatory Potential of User Interfaces for AI Agent Governance

Abstract:
AI agents that take actions in their environment autonomously over extended time horizons require robust governance interventions to curb their potentially consequential risks. Prior proposals for governing AI agents primarily target system‑level safeguards (e.g., prompt injection monitors) or agent infrastructure (e.g., agent IDs). In this work, we explore a complementary approach: regulating user interfaces of AI agents as a way of enforcing transparency and behavioral requirements that then demand changes at the system and/or infrastructure levels. Specifically, we analyze 22 existing agentic systems to identify UI elements that play key roles in human‑agent interaction and communication. We then synthesize those elements into six high‑level interaction design patterns that hold regulatory potential (e.g., requiring agent memory to be editable). We conclude with policy recommendations based on our analysis. Our work exposes a new surface for regulatory action that supplements previous proposals for practical AI agent governance.

PaperID: 7068, https://arxiv.org/pdf/2512.00740.pdf

Authors: Qingwen Yang, Feiyu Qu, Tiezheng Guo, Yanyi Liu, Yingyou Wen

Title: Augmented Runtime Collaboration for Self-Organizing Multi-Agent Systems: A Hybrid Bi-Criteria Routing Approach

Abstract:
LLM‑based multi‑agent systems have demonstrated significant capabilities across diverse domains. However, the task performance and efficiency are fundamentally constrained by their collaboration strategies. Prevailing approaches rely on static topologies and centralized global planning, a paradigm that limits their scalability and adaptability in open, decentralized networks. Effective collaboration planning in distributed systems using only local information thus remains a formidable challenge. To address this, we propose BiRouter, a novel dual‑criteria routing method for Self‑Organizing Multi‑Agent Systems (SO‑MAS). This method enables each agent to autonomously execute ``next‑hop'' task routing at runtime, relying solely on local information. Its core decision‑making mechanism is predicated on balancing two metrics: (1) the ImpScore, which evaluates a candidate agent's long‑term importance to the overall goal, and (2) the GapScore, which assesses its contextual continuity for the current task state. Furthermore, we introduce a dynamically updated reputation mechanism to bolster system robustness in untrustworthy environments and have developed a large‑scale, cross‑domain dataset, comprising thousands of annotated task‑routing paths, to enhance the model's generalization. Extensive experiments demonstrate that BiRouter achieves superior performance and token efficiency over existing baselines, while maintaining strong robustness and effectiveness in information‑limited, decentralized, and untrustworthy settings.

PaperID: 7069, https://arxiv.org/pdf/2512.00617.pdf

Authors: Omer Jauhar Khan

Title: ART: Adaptive Response Tuning Framework -- A Multi-Agent Tournament-Based Approach to LLM Response Optimization

Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation. However, single‑model responses often exhibit inconsistencies, hallucinations, and varying quality across different query domains. This paper presents ART (Adaptive Response Tuning), a novel framework that employs tournament‑style ELO ranking and multi‑agent reasoning to systematically optimize LLM outputs. By enabling multiple LLM agents to compete, critique, and collaborate through structured tournament workflows, ART produces consensus responses that outperform individual model outputs. Our framework introduces configurable tournament parameters, dynamic agent selection, and multiple consensus fusion strategies. Experimental evaluations demonstrate significant improvements in response accuracy, coherence, and reliability compared to baseline single‑model approaches. The ART framework provides a scalable, production‑ready solution for applications requiring high‑quality, vetted LLM responses, achieving an 8.4% improvement in overall quality metrics and R^2 values exceeding 0.96 in ELO rating convergence.

PaperID: 7070, https://arxiv.org/pdf/2512.00614.pdf

Authors: Goutham Nalagatla

Title: Hierarchical Decentralized Multi-Agent Coordination with Privacy-Preserving Knowledge Sharing: Extending AgentNet for Scalable Autonomous Systems

Abstract:
Decentralized multi‑agent systems have shown promise in enabling autonomous collaboration among LLM‑based agents. While AgentNet demonstrated the feasibility of fully decentralized coordination through dynamic DAG topologies, several limitations remain: scalability challenges with large agent populations, communication overhead, lack of privacy guarantees, and suboptimal resource allocation. We propose AgentNet++, a hierarchical decentralized framework that extends AgentNet with multilevel agent organization, privacy‑preserving knowledge sharing via differential privacy and secure aggregation, adaptive resource management, and theoretical convergence guarantees. Our approach introduces cluster‑based hierarchies where agents self‑organize into specialized groups, enabling efficient task routing and knowledge distillation while maintaining full decentralization. We provide formal analysis of convergence properties and privacy bounds, and demonstrate through extensive experiments on complex multi‑agent tasks that AgentNet++ achieves 23% higher task completion rates, 40% reduction in communication overhead, and maintains strong privacy guarantees compared to AgentNet and other baselines. Our framework scales effectively to 1000+ agents while preserving the emergent intelligence properties of the original AgentNet.

PaperID: 7071, https://arxiv.org/pdf/2512.00602.pdf

Authors: Wanle Zhong, Keman Huang, Xiaoyong Du

Title: AgentODRL: A Large Language Model-based Multi-agent System for ODRL Generation

Abstract:
The Open Digital Rights Language (ODRL) is a pivotal standard for automating data rights management. However, the inherent logical complexity of authorization policies, combined with the scarcity of high‑quality "Natural Language‑to‑ODRL" training datasets, impedes the ability of current methods to efficiently and accurately translate complex rules from natural language into the ODRL format. To address this challenge, this research leverages the potent comprehension and generation capabilities of Large Language Models (LLMs) to achieve both automation and high fidelity in this translation process. We introduce AgentODRL, a multi‑agent system based on an Orchestrator‑Workers architecture. The architecture consists of specialized Workers, including a Generator for ODRL policy creation, a Decomposer for breaking down complex use cases, and a Rewriter for simplifying nested logical relationships. The Orchestrator agent dynamically coordinates these Workers, assembling an optimal pathway based on the complexity of the input use case. Specifically, we enhance the ODRL Generator by incorporating a validator‑based syntax strategy and a semantic reflection mechanism powered by a LoRA‑finetuned model, significantly elevating the quality of the generated policies. Extensive experiments were conducted on a newly constructed dataset comprising 770 use cases of varying complexity, all situated within the context of data spaces. The results, evaluated using ODRL syntax and semantic scores, demonstrate that our proposed Orchestrator‑Workers system, enhanced with these strategies, achieves superior performance on the ODRL generation task.

PaperID: 7072, https://arxiv.org/pdf/2512.00582.pdf

Authors: Yue Jiang, Haiwei Xue, Minghao Han, Mingcheng Li, Xiaolu Hou, Dingkang Yang, Lihua Zhang, Xu Zheng

Title: SatireDecoder: Visual Cascaded Decoupling for Enhancing Satirical Image Comprehension

Abstract:
Satire, a form of artistic expression combining humor with implicit critique, holds significant social value by illuminating societal issues. Despite its cultural and societal significance, satire comprehension, particularly in purely visual forms, remains a challenging task for current vision‑language models. This task requires not only detecting satire but also deciphering its nuanced meaning and identifying the implicated entities. Existing models often fail to effectively integrate local entity relationships with global context, leading to misinterpretation, comprehension biases, and hallucinations. To address these limitations, we propose SatireDecoder, a training‑free framework designed to enhance satirical image comprehension. Our approach proposes a multi‑agent system performing visual cascaded decoupling to decompose images into fine‑grained local and global semantic representations. In addition, we introduce a chain‑of‑thought reasoning strategy guided by uncertainty analysis, which breaks down the complex satire comprehension process into sequential subtasks with minimized uncertainty. Our method significantly improves interpretive accuracy while reducing hallucinations. Experimental results validate that SatireDecoder outperforms existing baselines in comprehending visual satire, offering a promising direction for vision‑language reasoning in nuanced, high‑level semantic tasks.

PaperID: 7073, https://arxiv.org/pdf/2512.00520.pdf

Authors: Juan A. Wibowo, George C. Polyzos

Title: Toward a Safe Internet of Agents

Abstract:
Autonomous Artificial Intelligence (AI) agents, powered by Large Language Models (LLMs), advance rapidly toward interconnected systems ‑‑ an Internet of Agents (IoA). This vision enables complex problem‑solving while introducing systemic safety and security risks. Beyond existing threat taxonomies, we provide a principled guide addressing architectural vulnerability sources. We offer a framework for engineering safe agentic systems through bottom‑up deconstruction, analyzing each component as a dual‑use interface where capability expansion creates attack surface growth. We examine three tiers: (1) Single Agents ‑‑ analyzing inherent risks in models, memory, design patterns, tools, and guardrails; (2) Multi‑Agent Systems (MAS) ‑‑ examining collective behavior components including architectural patterns, communication mechanisms, verification, and system guardrails; and (3) Interoperable Multi‑Agent Systems (IMAS) ‑‑ exploring four secure ecosystem pillars: standardized protocols, agent registration/discovery, resource vetting, and governance. Our analysis reveals a central principle: agentic safety must be co‑designed with capability as a fundamental architectural property. We identify specific vulnerabilities at each level and derive core mitigation principles. The result is a foundational guide enabling developers and researchers to build not merely capable but safe, reliable agentic AI, contributing to secure IoA development.

PaperID: 7074, https://arxiv.org/pdf/2512.00513.pdf

Authors: Xun Shao, Ryuuto Shimizu, Zhi Liu, Kaoru Ota, Mianxiong Dong

Title: Truthful and Trustworthy IoT AI Agents via Immediate-Penalty Enforcement under Approximate VCG Mechanisms

Abstract:
The deployment of autonomous AI agents in Internet of Things (IoT) energy systems requires decision‑making mechanisms that remain robust, efficient, and trustworthy under real‑time constraints and imperfect monitoring. While reinforcement learning enables adaptive prosumer behaviors, ensuring economic consistency and preventing strategic manipulation remain open challenges, particularly when sensing noise or partial observability reduces the operator's ability to verify actions. This paper introduces a trust‑enforcement framework for IoT energy trading that combines an approximate Vickrey‑Clarke‑Groves (VCG) double auction with an immediate one‑shot penalty. Unlike reputation‑ or history‑based approaches, the proposed mechanism restores truthful reporting within a single round, even when allocation accuracy is approximate and monitoring is noisy. We theoretically characterize the incentive gap induced by approximation and derive a penalty threshold that guarantees truthful bidding under bounded sensing errors. To evaluate learning‑enabled prosumers, we embed the mechanism into a multi‑agent reinforcement learning environment reflecting stochastic generation, dynamic loads, and heterogeneous trading opportunities. Experiments show that improved allocation accuracy reduces deviation incentives, the required penalty matches analytical predictions, and learned bidding behaviors remain stable and interpretable despite imperfect monitoring. These results demonstrate that lightweight penalty designs can reliably align strategic IoT agents with socially efficient energy‑trading outcomes.

PaperID: 7075, https://arxiv.org/pdf/2512.00491.pdf

Authors: Yule Han, Kezhi Wang, Kun Yang

Title: Smart-TCP: An Agentic AI-based Autonomous and Adaptive TCP Protocol

Abstract:
The Transmission Control Protocol (TCP) relies on a state machine and deterministic arithmetic to ensure reliable connections. However, traditional protocol logic driven by hard‑coded state machines struggles to meet the demands of intelligent and autonomous network architectures. Here, we adopt the agentic AI‑based paradigm, driven by Large Language Models (LLMs), characterized by context perception, autonomous reasoning, and tool use. Based on this, we propose Smart‑TCP, which re‑imagines TCP's core control logic as an autonomous agent. Specifically, the proposed architecture employs a context aggregation mechanism to synthesize the protocol context, utilizes the LLM for autonomous logical reasoning, and invokes an Arithmetic Logic Unit (ALU) as a tool for computation. Furthermore, we establish a dual‑agent interaction framework based on this architecture and implement TCP protocol interactions. Experiments demonstrate that the Smart‑TCP agent excels in static prediction and error detection, achieving a 93.33% success rate in end‑to‑end sessions. These results strongly validate the technical feasibility of an agentic AI‑based TCP protocol.

PaperID: 7076, https://arxiv.org/pdf/2512.00463.pdf

Authors: Jidong Jin

Title: Necessary and Sufficient Criterion for Singular or Nonsingular of Diagonally Dominant Matrices

Abstract:
The problem of determining whether a diagonally dominant matrix is singular or nonsingular is a classical topic in matrix theory. This paper develops necessary and sufficient conditions for the singularity or nonsingularity of diagonally dominant matrices. Starting from Taussky's theorem, we establish a unified line of theory which reduces the general problem to the study of irreducible diagonally dominant matrices. A complete similarity and unitary similarity analysis is given for singular irreducible diagonally dominant matrices, leading to a necessary and sufficient condition expressed in terms of an angle equation system associated with the nonzero off‑diagonal entries. Applications and motivations from network dynamical systems, affine multi‑agent systems, and Kolmogorov differential equations are also discussed.

PaperID: 7077, https://arxiv.org/pdf/2512.00417.pdf

Authors: Jiacheng Guo, Suozhi Huang, Zixin Yao, Yifan Zhang, Yifu Lu, Jiashuo Liu, Zihao Li, Nicholas Deng, Qixin Xiao, Jia Tian, Kanghong Zhan, Tianyi Li, Xiaochen Liu, Jason Ge, Chaoyang He, Kaixuan Huang, Lin Yang, Wenhao Huang, Mengdi Wang

Title: CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency

Abstract:
This paper introduces CryptoBench, the first expert‑curated, dynamic benchmark designed to rigorously evaluate the real‑world capabilities of Large Language Model (LLM) agents in the uniquely demanding and fast‑paced cryptocurrency domain. Unlike general‑purpose agent benchmarks for search and prediction, professional crypto analysis presents specific challenges: \emphextreme time‑sensitivity, \empha highly adversarial information environment, and the critical need to synthesize data from \emphdiverse, specialized sources, such as on‑chain intelligence platforms and real‑time Decentralized Finance (DeFi) dashboards. CryptoBench thus serves as a much more challenging and valuable scenario for LLM agent assessment. To address these challenges, we constructed a live, dynamic benchmark featuring 50 questions per month, expertly designed by crypto‑native professionals to mirror actual analyst workflows. These tasks are rigorously categorized within a four‑quadrant system: Simple Retrieval, Complex Retrieval, Simple Prediction, and Complex Prediction. This granular categorization enables a precise assessment of an LLM agent's foundational data‑gathering capabilities alongside its advanced analytical and forecasting skills. Our evaluation of ten LLMs, both directly and within an agentic framework, reveals a performance hierarchy and uncovers a failure mode. We observe a retrieval‑prediction imbalance, where many leading models, despite being proficient at data retrieval, demonstrate a pronounced weakness in tasks requiring predictive analysis. This highlights a problematic tendency for agents to appear factually grounded while lacking the deeper analytical capabilities to synthesize information.

PaperID: 7078, https://arxiv.org/pdf/2512.00412.pdf

Authors: Jiawei Chen, Yang Yang, Chao Yu, Yu Tian, Zhi Cao, Xue Yang, Linghao Li, Hang Su, Zhaoxia Yin

Title: Red Teaming Large Reasoning Models

Abstract:
Large Reasoning Models (LRMs) have emerged as a powerful advancement in multi‑step reasoning tasks, offering enhanced transparency and logical consistency through explicit chains of thought (CoT). However, these models introduce novel safety and reliability risks, such as CoT‑hijacking and prompt‑induced inefficiencies, which are not fully captured by existing evaluation methods. To address this gap, we propose RT‑LRM, a unified benchmark designed to assess the trustworthiness of LRMs. RT‑LRM evaluates three core dimensions: truthfulness, safety and efficiency. Beyond metric‑based evaluation, we further introduce the training paradigm as a key analytical perspective to investigate the systematic impact of different training strategies on model trustworthiness. We achieve this by designing a curated suite of 30 reasoning tasks from an observational standpoint. We conduct extensive experiments on 26 models and identify several valuable insights into the trustworthiness of LRMs. For example, LRMs generally face trustworthiness challenges and tend to be more fragile than Large Language Models (LLMs) when encountering reasoning‑induced risks. These findings uncover previously underexplored vulnerabilities and highlight the need for more targeted evaluations. In addition, we release a scalable toolbox for standardized trustworthiness research to support future advancements in this important field. Our code and datasets will be open‑sourced.

PaperID: 7079, https://arxiv.org/pdf/2512.00410.pdf

Authors: Hongzong Li, Luwei Liao, Xiangguang Dai, Yuming Feng, Rong Feng, Shiqin Tang

Title: Balancing Efficiency and Fairness: An Iterative Exchange Framework for Multi-UAV Cooperative Path Planning

Abstract:
Multi‑UAV cooperative path planning (MUCPP) is a fundamental problem in multi‑agent systems, aiming to generate collision‑free trajectories for a team of unmanned aerial vehicles (UAVs) to complete distributed tasks efficiently. A key challenge lies in achieving both efficiency, by minimizing total mission cost, and fairness, by balancing the workload among UAVs to avoid overburdening individual agents. This paper presents a novel Iterative Exchange Framework for MUCPP, balancing efficiency and fairness through iterative task exchanges and path refinements. The proposed framework formulates a composite objective that combines the total mission distance and the makespan, and iteratively improves the solution via local exchanges under feasibility and safety constraints. For each UAV, collision‑free trajectories are generated using A search over a terrain‑aware configuration space. Comprehensive experiments on multiple terrain datasets demonstrate that the proposed method consistently achieves superior trade‑offs between total distance and makespan compared to existing baselines.

PaperID: 7080, https://arxiv.org/pdf/2512.00403.pdf

Authors: Xiao Wu, Ting-Zhu Huang, Liang-Jian Deng, Xiaobing Yu, Yu Zhong, Shangqi Deng, Ufaq Khan, Jianghao Wu, Xiaofeng Liu, Imran Razzak, Xiaojun Chang, Yutong Xie

Title: SelfAI: A self-directed framework for long-horizon scientific discovery

Abstract:
Scientific discovery increasingly entails long‑horizon exploration of complex hypothesis spaces, yet most existing approaches emphasize final performance while offering limited insight into how scientific exploration unfolds over time, particularly balancing efficiency‑diversity trade‑offs and supporting reproducible, human‑in‑the‑loop discovery workflows. We introduce SelfAI, a self‑directed, multi‑agent‑enabled discovery system that automates scientific exploration as a strategic, trajectory‑driven decision‑making process. SelfAI translates high‑level research intent into executable experiments, reasons over accumulated experimental trajectories to guide subsequent exploration, and applies adaptive stopping decisions to terminate unproductive search paths within a closed‑loop workflow governed by explicit efficiency‑diversity trade‑offs. Evaluated using real‑world experiments spanning domains from machine learning to drug discovery, SelfAI consistently discovers high‑quality solutions with substantially fewer redundant trials than classical optimization and recent LLM‑based baselines. The proposed methods establish a general framework for organizing long‑horizon scientific discovery and adaptive decision‑making in complex scientific and engineering systems.

PaperID: 7081, https://arxiv.org/pdf/2512.00371.pdf

Authors: Swadesh Sistla, Max Kleiman-Weiner

Title: Evaluating LLMs in Open-Source Games

Abstract:
Large Language Models' (LLMs) programming capabilities enable their participation in open‑source games: a game‑theoretic setting in which players submit computer programs in lieu of actions. These programs offer numerous advantages, including interpretability, inter‑agent transparency, and formal verifiability; additionally, they enable program equilibria, solutions that leverage the transparency of code and are inaccessible within normal‑form settings. We evaluate the capabilities of leading open‑ and closed‑weight LLMs to predict and classify program strategies and evaluate features of the approximate program equilibria reached by LLM agents in dyadic and evolutionary settings. We identify the emergence of payoff‑maximizing, cooperative, and deceptive strategies, characterize the adaptation of mechanisms within these programs over repeated open‑source games, and analyze their comparative evolutionary fitness. We find that open‑source games serve as a viable environment to study and steer the emergence of cooperative strategy in multi‑agent dilemmas.

PaperID: 7082, https://arxiv.org/pdf/2512.00352.pdf

Authors: Na Li, Zewu Zheng, Wei Ni, Hangguan Shan, Wenjie Zhang, Xinyu Li

Title: Sample-Efficient Tabular Self-Play for Offline Robust Reinforcement Learning

Abstract:
Multi‑agent reinforcement learning (MARL), as a thriving field, explores how multiple agents independently make decisions in a shared dynamic environment. Due to environmental uncertainties, policies in MARL must remain robust to tackle the sim‑to‑real gap. We focus on robust two‑player zero‑sum Markov games (TZMGs) in offline settings, specifically on tabular robust TZMGs (RTZMGs). We propose a model‑based algorithm (RTZ‑VI‑LCB) for offline RTZMGs, which is optimistic robust value iteration combined with a data‑driven Bernstein‑style penalty term for robust value estimation. By accounting for distribution shifts in the historical dataset, the proposed algorithm establishes near‑optimal sample complexity guarantees under partial coverage and environmental uncertainty. An information‑theoretic lower bound is developed to confirm the tightness of our algorithm's sample complexity, which is optimal regarding both state and action spaces. To the best of our knowledge, RTZ‑VI‑LCB is the first to attain this optimality, sets a new benchmark for offline RTZMGs, and is validated experimentally.

PaperID: 7083, https://arxiv.org/pdf/2512.00351.pdf

Authors: Na Li, Yuchen Jiao, Hangguan Shan, Shefeng Yan

Title: Provable Memory Efficient Self-Play Algorithm for Model-free Reinforcement Learning

Abstract:
The thriving field of multi‑agent reinforcement learning (MARL) studies how a group of interacting agents make decisions autonomously in a shared dynamic environment. Existing theoretical studies in this area suffer from at least two of the following obstacles: memory inefficiency, the heavy dependence of sample complexity on the long horizon and the large state space, the high computational complexity, non‑Markov policy, non‑Nash policy, and high burn‑in cost. In this work, we take a step towards settling this problem by designing a model‑free self‑play algorithm \emphMemory‑Efficient Nash Q‑Learning (ME‑Nash‑QL) for two‑player zero‑sum Markov games, which is a specific setting of MARL. ME‑Nash‑QL is proven to enjoy the following merits. First, it can output an \varepsilon‑approximate Nash policy with space complexity O(SABH) and sample complexity \widetildeO(H^4SAB/\varepsilon^2), where S is the number of states, \A, B\ is the number of actions for two players, and H is the horizon length. It outperforms existing algorithms in terms of space complexity for tabular cases, and in terms of sample complexity for long horizons, i.e., when \min\A, B\\ll H^2. Second, ME‑Nash‑QL achieves the lowest computational complexity O(T\mathrmpoly(AB)) while preserving Markov policies, where T is the number of samples. Third, ME‑Nash‑QL also achieves the best burn‑in cost O(SAB\,\mathrmpoly(H)), whereas previous algorithms have a burn‑in cost of at least O(S^3 AB\,\mathrmpoly(H)) to attain the same level of sample complexity with ours.

PaperID: 7084, https://arxiv.org/pdf/2512.00349.pdf

Authors: Sitong Fang, Shiyi Hou, Kaile Wang, Boyuan Chen, Donghai Hong, Jiayi Zhou, Josef Dai, Yaodong Yang, Jiaming Ji

Title: Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models

Abstract:
Are frontier AI systems becoming more capable? Certainly. Yet such progress is not an unalloyed blessing but rather a Trojan horse: behind their performance leaps lie more insidious and destructive safety risks, namely deception. Unlike hallucination, which arises from insufficient capability and leads to mistakes, deception represents a deeper threat in which models deliberately mislead users through complex reasoning and insincere responses. As system capabilities advance, deceptive behaviours have spread from textual to multimodal settings, amplifying their potential harm. First and foremost, how can we monitor these covert multimodal deceptive behaviors? Nevertheless, current research remains almost entirely confined to text, leaving the deceptive risks of multimodal large language models unexplored. In this work, we systematically reveal and quantify multimodal deception risks, introducing MM‑DeceptionBench, the first benchmark explicitly designed to evaluate multimodal deception. Covering six categories of deception, MM‑DeceptionBench characterizes how models strategically manipulate and mislead through combined visual and textual modalities. On the other hand, multimodal deception evaluation is almost a blind spot in existing methods. Its stealth, compounded by visual‑semantic ambiguity and the complexity of cross‑modal reasoning, renders action monitoring and chain‑of‑thought monitoring largely ineffective. To tackle this challenge, we propose debate with images, a novel multi‑agent debate monitor framework. By compelling models to ground their claims in visual evidence, this method substantially improves the detectability of deceptive strategies. Experiments show that it consistently increases agreement with human judgements across all tested models, boosting Cohen's kappa by 1.5x and accuracy by 1.25x on GPT‑4o.

PaperID: 7085, https://arxiv.org/pdf/2512.00331.pdf

Authors: Yefeng Wu, Yuchen Song, Yecheng Zhao, Ling Wu, Shan Wan

Title: CogEvo-Edu: Cognitive Evolution Educational Multi-Agent Collaborative System

Abstract:
Large language models (LLMs) are increasingly deployed as conversational tutors in STEM education, yet most systems still rely on a single LLM with a static retrieval‑augmented generation (RAG) pipeline over course materials. This design struggles in complex domains such as digital signal processing (DSP), where tutors must maintain coherent long‑term student models, manage heterogeneous knowledge bases, and adapt teaching strategies over extended interactions. We argue that retrieval, memory, and control should be treated as a coupled cognitive evolution process. We instantiate this view in CogEvo‑Edu, a hierarchical educational multi‑agent system comprising a Cognitive Perception Layer (CPL), a Knowledge Evolution Layer (KEL), and a Meta‑Control Layer (MCL). CPL maintains dual memories and performs confidence‑weighted consolidation to build structured, self‑correcting student profiles under limited context. KEL assigns each knowledge chunk a spatiotemporal value that drives activation, semantic compression, and forgetting. MCL formulates tutoring as hierarchical sequential decision making, orchestrating specialized agents and jointly adapting CPL/KEL hyperparameters via a dual inner‑‑outer loop. To evaluate CogEvo‑Edu, we construct DSP‑EduBench, a vertical benchmark for DSP tutoring with heterogeneous resources, simulated student profiles, and long‑horizon interaction scripts. Using a three‑model LLM‑as‑a‑Judge ensemble, CogEvo‑Edu raises the overall score from 5.32 to 9.23 and improves all six indicators over static RAG, simple memory, and a single‑agent variant, demonstrating the value of jointly evolving student profiles, knowledge bases, and teaching policies.

PaperID: 7086, https://arxiv.org/pdf/2512.00313.pdf

Authors: Zhitong Guan, Yi Wang

Title: Evolving Paradigms in Task-Based Search and Learning: A Comparative Analysis of Traditional Search Engine with LLM-Enhanced Conversational Search System

Abstract:
Large Language Models (LLMs) are rapidly reshaping information retrieval by enabling interactive, generative, and inference‑driven search. While traditional keyword‑based search remains central to web and academic information access, it often struggles to support multi‑step reasoning and exploratory learning tasks. LLM‑powered search interfaces, such as ChatGPT and Claude, introduce new capabilities that may influence how users formulate queries, navigate information, and construct knowledge. However, empirical understanding of these effects is still limited. This study compares search behavior and learning outcomes in two environments: a standard search engine and an LLM‑powered search system. We investigate (1) how search strategies, query formulation, and evaluation behaviors differ across systems, and (2) how LLM use affects comprehension, knowledge integration, and critical thinking during search‑based learning tasks. Findings offer insight into how generative AI shapes information‑seeking processes and contribute to ongoing discussions in information retrieval, human‑AI interaction, and technology‑supported learning.

PaperID: 7087, https://arxiv.org/pdf/2512.00267.pdf

Authors: Anab Maulana Barik, Shou Ziyi, Yang Kaiwen, Yang Qi, Shen Xin

Title: Trification: A Comprehensive Tree-based Strategy Planner and Structural Verification for Fact-Checking

Abstract:
Technological advancement allows information to be shared in just a single click, which has enabled the rapid spread of false information. This makes automated fact‑checking system necessary to ensure the safety and integrity of our online media ecosystem. Previous methods have demonstrated the effectiveness of decomposing the claim into simpler sub‑tasks and utilizing LLM‑based multi agent system to execute them. However, those models faces two limitations: they often fail to verify every component in the claim and lack of structured framework to logically connect the results of sub‑tasks for a final prediction. In this work, we propose a novel automated fact‑checking framework called Trification. Our framework begins by generating a comprehensive set of verification actions to ensure complete coverage of the claim. It then structured these actions into a dependency graph to model the logical interaction between actions. Furthermore, the graph can be dynamically modified, allowing the system to adapt its verification strategy. Experimental results on two challenging benchmarks demonstrate that our framework significantly enhances fact‑checking accuracy, thereby advancing current state‑of‑the‑art in automated fact‑checking system.

PaperID: 7088, https://arxiv.org/pdf/2512.00259.pdf

Authors: Diogo Ferreira, Pedro Ribeiro, André Coelho, Rui Campos

Title: Design and Evaluation of a Multi-Agent Perception System for Autonomous Flying Networks

Abstract:
Autonomous Flying Networks (FNs) are emerging as a key enabler of on‑demand connectivity in dynamic and infrastructure‑limited environments. However, current approaches mainly focus on UAV placement, routing, and resource management, neglecting the autonomous perception of users and their service demands ‑ a critical capability for zero‑touch network operation. This paper presents the Multi‑Agent Perception System (MAPS), a modular and scalable system that leverages multi‑modal large language models (MM‑LLMs) and agentic Artificial Intelligence (AI) to interpret visual and audio data collected by UAVs and generate Service Level Specifications (SLSs) describing user count, spatial distribution, and traffic demand. MAPS is evaluated using a synthetic multimodal emergency dataset, achieving user detection accuracies above 70% and SLS generation under 130 seconds in 90% of cases. Results demonstrate that combining audio and visual modalities enhances user detection and show that MAPS provides the perception layer required for autonomous, zero‑touch FNs.

PaperID: 7089, https://arxiv.org/pdf/2512.00243.pdf

Authors: Paulo Roberto de Melo Barros Junior, Monica Alexandra Vilar Ribeiro De Meireles, Jose Luis Lima de Jesus Silva

Title: Optimizing Information Asset Investment Strategies in the Exploratory Phase of the Oil and Gas Industry: A Reinforcement Learning Approach

Abstract:
Our work investigates the economic efficiency of the prevailing "ladder‑step" investment strategy in oil and gas exploration, which advocates for the incremental acquisition of geological information throughout the project lifecycle. By employing a multi‑agent Deep Reinforcement Learning (DRL) framework, we model an alternative strategy that prioritizes the early acquisition of high‑quality information assets. We simulate the entire upstream value chain‑comprising competitive bidding, exploration, and development phases‑to evaluate the economic impact of this approach relative to traditional methods. Our results demonstrate that front‑loading information investment significantly reduces the costs associated with redundant data acquisition and enhances the precision of reserve valuation. Specifically, we find that the alternative strategy outperforms traditional methods in highly competitive environments by mitigating the "winner's curse" through more accurate bidding. Furthermore, the economic benefits are most pronounced during the development phase, where superior data quality minimizes capital misallocation. These findings suggest that optimal investment timing is structurally dependent on market competition rather than solely on price volatility, offering a new paradigm for capital allocation in extractive industries.

PaperID: 7090, https://arxiv.org/pdf/2512.00194.pdf

Authors: Zag ElSayed, Grace Westerkamp, Gavin Gammoh, Yanchen Liu, Peyton Siekierski, Craig Erickson, Ernest Pedapati

Title: AutocleanEEG ICVision: Automated ICA Artifact Classification Using Vision-Language AI

Abstract:
We introduce EEG Autoclean Vision Language AI (ICVision) a first‑of‑its‑kind system that emulates expert‑level EEG ICA component classification through AI‑agent vision and natural language reasoning. Unlike conventional classifiers such as ICLabel, which rely on handcrafted features, ICVision directly interprets ICA dashboard visualizations topography, time series, power spectra, and ERP plots, using a multimodal large language model (GPT‑4 Vision). This allows the AI to see and explain EEG components the way trained neurologists do, making it the first scientific implementation of AI‑agent visual cognition in neurophysiology. ICVision classifies each component into one of six canonical categories (brain, eye, heart, muscle, channel noise, and other noise), returning both a confidence score and a human‑like explanation. Evaluated on 3,168 ICA components from 124 EEG datasets, ICVision achieved k = 0.677 agreement with expert consensus, surpassing MNE ICLabel, while also preserving clinically relevant brain signals in ambiguous cases. Over 97% of its outputs were rated as interpretable and actionable by expert reviewers. As a core module of the open‑source EEG Autoclean platform, ICVision signals a paradigm shift in scientific AI, where models do not just classify, but see, reason, and communicate. It opens the door to globally scalable, explainable, and reproducible EEG workflows, marking the emergence of AI agents capable of expert‑level visual decision‑making in brain science and beyond.

PaperID: 7091, https://arxiv.org/pdf/2512.00136.pdf

Authors: Tong Wu, Weibin Wu, Zibin Zheng

Title: An Empirical Study on the Security Vulnerabilities of GPTs

Abstract:
Equipped with various tools and knowledge, GPTs, one kind of customized AI agents based on OpenAI's large language models, have illustrated great potential in many fields, such as writing, research, and programming. Today, the number of GPTs has reached three millions, with the range of specific expert domains becoming increasingly diverse. However, given the consistent framework shared among these LLM agent applications, systemic security vulnerabilities may exist and remain underexplored. To fill this gap, we present an empirical study on the security vulnerabilities of GPTs. Building upon prior research on LLM security, we first adopt a platform‑user perspective to conduct a comprehensive attack surface analysis across different system components. Then, we design a systematic and multidimensional attack suite with the explicit objectives of information leakage and tool misuse based on the attack surface analysis, thereby concretely demonstrating the security vulnerabilities that various components of GPT‑based systems face. Finally, we accordingly propose defense mechanisms to address the aforementioned security vulnerabilities. By increasing the awareness of these vulnerabilities and offering critical insights into their implications, this study seeks to facilitate the secure and responsible application of GPTs while contributing to developing robust defense mechanisms that protect users and systems against malicious attacks.

PaperID: 7092, https://arxiv.org/pdf/2512.00062.pdf

Authors: Taewook Nam, Sung Ju Hwang

Title: SpeedAug: Policy Acceleration via Tempo-Enriched Policy and RL Fine-Tuning

Abstract:
Recent advances in robotic policy learning have enabled complex manipulation in real‑world environments, yet the execution speed of these policies often lags behind hardware capabilities due to the cost of collecting faster demonstrations. Existing works on policy acceleration reinterpret action sequence for unseen execution speed, thereby encountering distributional shifts from the original demonstrations. Reinforcement learning is a promising approach that adapts policies for faster execution without additional demonstration, but its unguided exploration is sample inefficient. We propose SpeedAug, an RL‑based policy acceleration framework that efficiently adapts pre‑trained policies for faster task execution. SpeedAug constructs behavior prior that encompasses diverse tempos of task execution by pre‑training a policy on speed‑augmented demonstrations. Empirical results on robotic manipulation benchmarks show that RL fine‑tuning initialized from this tempo‑enriched policy significantly improves the sample efficiency of existing RL and policy acceleration methods while maintaining high success rate.

PaperID: 7093, https://arxiv.org/pdf/2512.00047.pdf

Authors: Angelina Parfenova, Alexander Denzler, Juergen Pfeffer

Title: Emergent Convergence in Multi-Agent LLM Annotation

Abstract:
Large language models (LLMs) are increasingly deployed in collaborative settings, yet little is known about how they coordinate when treated as black‑box agents. We simulate 7500 multi‑agent, multi‑round discussions in an inductive coding task, generating over 125000 utterances that capture both final annotations and their interactional histories. We introduce process‑level metrics: code stability, semantic self‑consistency, and lexical confidence alongside sentiment and convergence measures, to track coordination dynamics. To probe deeper alignment signals, we analyze the evolving geometry of output embeddings, showing that intrinsic dimensionality declines over rounds, suggesting semantic compression. The results reveal that LLM groups converge lexically and semantically, develop asymmetric influence patterns, and exhibit negotiation‑like behaviors despite the absence of explicit role prompting. This work demonstrates how black‑box interaction analysis can surface emergent coordination strategies, offering a scalable complement to internal probe‑based interpretability methods.

PaperID: 7094, https://arxiv.org/pdf/2512.00027.pdf

Authors: Nivedan Yakolli, Avinash Gautam, Abhijit Das, Yuankai Qi, Virendra Singh Shekhawat

Title: A Survey on Improving Human Robot Collaboration through Vision-and-Language Navigation

Abstract:
Vision‑and‑Language Navigation (VLN) is a multi‑modal, cooperative task requiring agents to interpret human instructions, navigate 3D environments, and communicate effectively under ambiguity. This paper presents a comprehensive review of recent VLN advancements in robotics and outlines promising directions to improve multi‑robot coordination. Despite progress, current models struggle with bidirectional communication, ambiguity resolution, and collaborative decision‑making in the multi‑agent systems. We review approximately 200 relevant articles to provide an in‑depth understanding of the current landscape. Through this survey, we aim to provide a thorough resource that inspires further research at the intersection of VLN and robotics. We advocate that the future VLN systems should support proactive clarification, real‑time feedback, and contextual reasoning through advanced natural language understanding (NLU) techniques. Additionally, decentralized decision‑making frameworks with dynamic role assignment are essential for scalable, efficient multi‑robot collaboration. These innovations can significantly enhance human‑robot interaction (HRI) and enable real‑world deployment in domains such as healthcare, logistics, and disaster response.

PaperID: 7095, https://arxiv.org/pdf/2512.00007.pdf

Authors: Jingyi Huang, Yuyi Yang, Mengmeng Ji, Charles Alba, Sheng Zhang, Ruopeng An

Title: Use of Retrieval-Augmented Large Language Model Agent for Long-Form COVID-19 Fact-Checking

Abstract:
The COVID‑19 infodemic calls for scalable fact‑checking solutions that handle long‑form misinformation with accuracy and reliability. This study presents SAFE (system for accurate fact extraction and evaluation), an agent system that combines large language models with retrieval‑augmented generation (RAG) to improve automated fact‑checking of long‑form COVID‑19 misinformation. SAFE includes two agents ‑ one for claim extraction and another for claim verification using LOTR‑RAG, which leverages a 130,000‑document COVID‑19 research corpus. An enhanced variant, SAFE (LOTR‑RAG + SRAG), incorporates Self‑RAG to refine retrieval via query rewriting. We evaluated both systems on 50 fake news articles (2‑17 pages) containing 246 annotated claims (M = 4.922, SD = 3.186), labeled as true (14.1%), partly true (14.4%), false (27.0%), partly false (2.2%), and misleading (21.0%) by public health professionals. SAFE systems significantly outperformed baseline LLMs in all metrics (p < 0.001). For consistency (0‑1 scale), SAFE (LOTR‑RAG) scored 0.629, exceeding both SAFE (+SRAG) (0.577) and the baseline (0.279). In subjective evaluations (0‑4 Likert scale), SAFE (LOTR‑RAG) also achieved the highest average ratings in usefulness (3.640), clearness (3.800), and authenticity (3.526). Adding SRAG slightly reduced overall performance, except for a minor gain in clearness. SAFE demonstrates robust improvements in long‑form COVID‑19 fact‑checking by addressing LLM limitations in consistency and explainability. The core LOTR‑RAG design proved more effective than its SRAG‑augmented variant, offering a strong foundation for scalable misinformation mitigation.

PaperID: 7096, https://arxiv.org/pdf/2511.23476.pdf

Authors: Bao Shu, Yan Cai, Jianjian Sun, Chunrui Han, En Yu, Liang Zhao, Jingcheng Hu, Yinmin Zhang, Haoran Lv, Yuang Peng, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Xiangyu Yue

Title: Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction

Abstract:
Developing robust world model reasoning is crucial for large language model (LLM) agents to plan and interact in complex environments. While multi‑turn interaction offers a superior understanding of environmental dynamics via authentic feedback, current approaches often impose a rigid reasoning process, which constrains the model's active learning, ultimately hindering efficient world model reasoning. To address these issues, we explore world‑model internalization through efficient interaction and active reasoning (WMAct), which liberates the model from structured reasoning, allowing the model to shape thinking directly through its doing, and achieves effective and efficient world model reasoning with two key mechanisms: (1) a reward rescaling mechanism adjusting outcome reward based on action efficacy to incentivize redundancy reduction and purposeful interaction; (2) an interaction frequency annealing strategy to progressively reduce the maximum allowed interaction turns, which compels the model to condense its learning and internalize environmental dynamics rather than over‑relying on environmental cues. Our experiments on Sokoban, Maze, and Taxi show that WMAct yields effective world model reasoning capable of resolving tasks in a single turn that previously required multiple interactions and fosters strong transferability to complex environments, improving performance on a suite of reasoning benchmarks.

PaperID: 7097, https://arxiv.org/pdf/2511.23387.pdf

Authors: Daniil Sukhorukov, Andrei Zakharov, Nikita Glazkov, Katsiaryna Yanchanka, Vladimir Kirilin, Maxim Dubovitsky, Roman Sultimov, Yuri Maksimov, Ilya Makarov

Title: Hierarchical AI-Meteorologist: LLM-Agent System for Multi-Scale and Explainable Weather Forecast Reporting

Abstract:
We present the Hierarchical AI‑Meteorologist, an LLM‑agent system that generates explainable weather reports using a hierarchical forecast reasoning and weather keyword generation. Unlike standard approaches that treat forecasts as flat time series, our framework performs multi‑scale reasoning across hourly, 6‑hour, and daily aggregations to capture both short‑term dynamics and long‑term trends. Its core reasoning agent converts structured meteorological inputs into coherent narratives while simultaneously extracting a few keywords effectively summarizing the dominant meteorological events. These keywords serve as semantic anchors for validating consistency, temporal coherence and factual alignment of the generated reports. Using OpenWeather and Meteostat data, we demonstrate that hierarchical context and keyword‑based validation substantially improve interpretability and robustness of LLM‑generated weather narratives, offering a reproducible framework for semantic evaluation of automated meteorological reporting and advancing agent‑based scientific reasoning.

PaperID: 7098, https://arxiv.org/pdf/2511.23379.pdf

Authors: Yimeng Liu, Misha Sra

Title: TaskLens: Generating Task-Conditioned Scaffolded Interfaces for Learning Professional Creative Software

Abstract:
Professional creative software has steep learning curves for novices due to complex interfaces, limited guidance, and unfamiliar terminology. To support educators and tool creators in addressing learner challenges, we introduce TaskLens, an LLM‑based method that automatically generates task‑conditioned scaffolded UIs from natural language task descriptions. Our method uses LLMs to identify workflow stages and domain concepts, select task‑relevant tools, generate implementation code, and execute the code to produce scaffolded interfaces. The interfaces surface relevant tools, organize them by workflow stage, link them to domain concepts, and progressively disclose advanced features. We evaluate TaskLens by deploying two LLM‑generated scaffolded interfaces in Blender, a professional 3D modeling software. A user study with beginners (n=32) showed that our scaffolded interfaces significantly reduced perceived task load, improved task performance through embedded workflow guidance, and increased domain concept learning in Blender during task execution. A second study with experts (n=8) showed improved task efficiency and potential to create personalized UIs for productivity and creativity.

PaperID: 7099, https://arxiv.org/pdf/2511.23366.pdf

Authors: Toqeer Ali Syed, Salman Jan, Gohar Ali, Ali Akarma, Ahmad Ali, Qurat-ul-Ain Mastoi

Title: Agentic AI Framework for Smart Inventory Replenishment

Abstract:
In contemporary retail, the variety of products available (e.g. clothing, groceries, cosmetics, frozen goods) make it difficult to predict the demand, prevent stockouts, and find high‑potential products. We suggest an agentic AI model that will be used to monitor the inventory, initiate purchase attempts to the appropriate suppliers, and scan for trending or high‑margin products to incorporate. The system applies demand forecasting, supplier selection optimization, multi‑agent negotiation and continuous learning. We apply a prototype to a setting in the store of a middle scale mart, test its performance on three conventional and artificial data tables, and compare the results to the base heuristics. Our findings indicate that there is a decrease in stockouts, a reduction of inventory holding costs, and an improvement in product mix turnover. We address constraints, scalability as well as improvement prospect.

PaperID: 7100, https://arxiv.org/pdf/2511.23315.pdf

Authors: Azusa Yamaguchi

Title: Emergent Coordination and Phase Structure in Independent Multi-Agent Reinforcement Learning

Abstract:
A clearer understanding of when coordination emerges, fluctuates, or collapses in decentralized multi‑agent reinforcement learning (MARL) is increasingly sought in order to characterize the dynamics of multi‑agent learning systems. We revisit fully independent Q‑learning (IQL) as a minimal decentralized testbed and run large‑scale experiments across environment size L and agent density rho. We construct a phase map using two axes ‑ the cooperative success rate (CSR) and a stability index derived from TD‑error variance ‑ revealing three distinct regimes: a coordinated and stable phase, a fragile transition region, and a jammed or disordered phase. A sharp double Instability Ridge separates these regimes and corresponds to persistent kernel drift, the time‑varying shift of each agent's effective transition kernel induced by others' policy updates. Synchronization analysis further shows that temporal alignment is required for sustained cooperation, and that competition between drift and synchronization generates the fragile regime. Removing agent identifiers eliminates drift entirely and collapses the three‑phase structure, demonstrating that small inter‑agent asymmetries are a necessary driver of drift. Overall, the results show that decentralized MARL exhibits a coherent phase structure governed by the interaction between scale, density, and kernel drift, suggesting that emergent coordination behaves as a distribution‑interaction‑driven phase phenomenon.

PaperID: 7101, https://arxiv.org/pdf/2511.23271.pdf

Authors: Jiancheng Dong, Pengyue Jia, Jingyu Peng, Maolin Wang, Yuhao Wang, Lixin Su, Xin Sun, Shuaiqiang Wang, Dawei Yin, Xiangyu Zhao

Title: Behavior-Equivalent Token: Single-Token Replacement for Long Prompts in LLMs

Abstract:
Carefully engineered system prompts play a critical role in guiding the behavior of LLM agents, but their considerable length introduces significant drawbacks, including increased inference latency, higher computational cost, and reduced effective context length. This raises the question of whether such lengthy prompts can be replaced by a drastically reduced number of tokens while preserving their behavioral effect on downstream tasks. To enable this, we propose a lightweight three‑stage training framework that learns a single prompt‑specific Behavior‑Equivalent token ([BE]). The framework first trains [BE] to encode the natural‑language content of the original system prompt via reconstruction, and then distills the prompt 's downstream behavior into this single token. Importantly, our method requires no access to model internals, no auxiliary compression models, and no labeled responses. Empirical evaluations on three datasets show that a single [BE] token achieves up to a 3000x reduction in prompt length, while retaining about 98% of the downstream performance of the original system prompts. This substantially reduces inference cost and leaves almost the entire context window available for user inputs.

PaperID: 7102, https://arxiv.org/pdf/2511.23193.pdf

Authors: Yuchen Shi, Huaxin Pei, Yi Zhang, Danya Yao

Title: Fault-Tolerant MARL for CAVs under Observation Perturbations for Highway On-Ramp Merging

Abstract:
Multi‑Agent Reinforcement Learning (MARL) holds significant promise for enabling cooperative driving among Connected and Automated Vehicles (CAVs). However, its practical application is hindered by a critical limitation, i.e., insufficient fault tolerance against observational faults. Such faults, which appear as perturbations in the vehicles' perceived data, can substantially compromise the performance of MARL‑based driving systems. Addressing this problem presents two primary challenges. One is to generate adversarial perturbations that effectively stress the policy during training, and the other is to equip vehicles with the capability to mitigate the impact of corrupted observations. To overcome the challenges, we propose a fault‑tolerant MARL method for cooperative on‑ramp vehicles incorporating two key agents. First, an adversarial fault injection agent is co‑trained to generate perturbations that actively challenge and harden the vehicle policies. Second, we design a novel fault‑tolerant vehicle agent equipped with a self‑diagnosis capability, which leverages the inherent spatio‑temporal correlations in vehicle state sequences to detect faults and reconstruct credible observations, thereby shielding the policy from misleading inputs. Experiments in a simulated highway merging scenario demonstrate that our method significantly outperforms baseline MARL approaches, achieving near‑fault‑free levels of safety and efficiency under various observation fault patterns.

PaperID: 7103, https://arxiv.org/pdf/2511.23148.pdf

Authors: Mian Ibad Ali Shah, Marcos Eduardo Cruz Victorio, Maeve Duffy, Enda Barrett, Karl Mason

Title: Peer-to-Peer Energy Trading in Dairy Farms using Multi-Agent Reinforcement Learning

Abstract:
The integration of renewable energy resources in rural areas, such as dairy farming communities, enables decentralized energy management through Peer‑to‑Peer (P2P) energy trading. This research highlights the role of P2P trading in efficient energy distribution and its synergy with advanced optimization techniques. While traditional rule‑based methods perform well under stable conditions, they struggle in dynamic environments. To address this, Multi‑Agent Reinforcement Learning (MARL), specifically Proximal Policy Optimization (PPO) and Deep Q‑Networks (DQN), is combined with community/distributed P2P trading mechanisms. By incorporating auction‑based market clearing, a price advisor agent, and load and battery management, the approach achieves significant improvements. Results show that, compared to baseline models, DQN reduces electricity costs by 14.2% in Ireland and 5.16% in Finland, while increasing electricity revenue by 7.24% and 12.73%, respectively. PPO achieves the lowest peak hour demand, reducing it by 55.5% in Ireland, while DQN reduces peak hour demand by 50.0% in Ireland and 27.02% in Finland. These improvements are attributed to both MARL algorithms and P2P energy trading, which together results in electricity cost and peak hour demand reduction, and increase electricity selling revenue. This study highlights the complementary strengths of DQN, PPO, and P2P trading in achieving efficient, adaptable, and sustainable energy management in rural communities.

PaperID: 7104, https://arxiv.org/pdf/2511.22975.pdf

Authors: Jiachen Li, Shihao Li, Christopher Martin, Zijun Chen, Dongmei Chen, Wei Li

Title: An LLM-Assisted Multi-Agent Control Framework for Roll-to-Roll Manufacturing Systems

Abstract:
Roll‑to‑roll manufacturing requires precise tension and velocity control to ensure product quality, yet controller commissioning and adaptation remain time‑intensive processes dependent on expert knowledge. This paper presents an LLM‑assisted multi‑agent framework that automates control system design and adaptation for R2R systems while maintaining safety. The framework operates through five phases: system identification from operational data, automated controller selection and tuning, sim‑to‑real adaptation with safety verification, continuous monitoring with diagnostic capabilities, and periodic model refinement. Experimental validation on a R2R system demonstrates successful tension regulation and velocity tracking under significant model uncertainty, with the framework achieving performance convergence through iterative adaptation. The approach reduces manual tuning effort while providing transparent diagnostic information for maintenance planning, offering a practical pathway for integrating AI‑assisted automation in manufacturing control systems.

PaperID: 7105, https://arxiv.org/pdf/2511.22955.pdf

Authors: Shrihari Sridharan, Sourjya Roy, Anand Raghunathan, Kaushik Roy

Title: Experts are all you need: A Composable Framework for Large Language Model Inference

Abstract:
Large Language Models (LLMs) have achieved state‑of‑the‑art accuracies in a variety of natural language processing (NLP) tasks. However, this success comes at the cost of increased model sizes which leads to additional computational burden. Mixture of Experts (MoEs) overcome this bottleneck by decoupling model capacity from computation by only activating a subset of parameters or "experts". However, these models require joint pretraining of these experts along with the router and do not model multi‑step reasoning. In contrast, multi‑agent frameworks improve reasoning by decomposing complex problems into modular subtasks. However, these frameworks rely on sequential "plan‑‑act‑‑observe" loops, which introduce significant latency. Our work, Comp‑LLM, addresses these challenges by introducing a composable inference framework that enables cross‑expert collaboration via an explicit sub‑query dependency graph. Comp‑LLM consists of three components: (1) A Sub‑query Generator that decomposes an input query, assigns each sub‑query to an appropriate expert using embedding similarity, and constructs a dependency graph; (2) A Query Executor that processes nodes in the graph and identifies opportunities for parallelism based on dependencies and resource constraints; and (3) A Response Aggregator that synthesizes intermediate expert responses into a coherent final answer. Across several benchmarks, Comp‑LLM achieves up to 11.01% accuracy improvement over monolithic LLMs of similar size, while offering 1.67x‑‑3.56x reduction in model size with no significant degradation relative to the largest model in its family. Additionally, Comp‑LLM provides 1.1x‑‑1.7x latency improvement compared to sequential sub‑query processing.

PaperID: 7106, https://arxiv.org/pdf/2511.22924.pdf

Authors: Kaixiang Wang, Zhaojiacheng Zhou, Bunyod Suvonov, Jiong Lou, Jie LI

Title: MAS-Shield: A Defense Framework for Secure and Efficient LLM MAS

Abstract:
Large Language Model (LLM)‑based Multi‑Agent Systems (MAS) are susceptible to linguistic attacks that can trigger cascading failures across the network. Existing defenses face a fundamental dilemma: lightweight single‑auditor methods are prone to single points of failure, while robust committee‑based approaches incur prohibitive computational costs in multi‑turn interactions. To address this challenge, we propose MAS‑Shield, a secure and efficient defense framework designed with a coarse‑to‑fine filtering pipeline. Rather than applying uniform scrutiny, MAS‑Shield dynamically allocates defense resources through a three‑stage protocol: (1) Critical Agent Selection strategically targets high‑influence nodes to narrow the defense surface; (2) Light Auditing employs lightweight sentry models to rapidly filter the majority of benign cases; and (3) Global Consensus Auditing escalates only suspicious or ambiguous signals to a heavyweight committee for definitive arbitration. This hierarchical design effectively optimizes the security‑efficiency trade‑off. Experiments demonstrate that MAS‑Shield achieves a 92.5% recovery rate against diverse adversarial scenarios and reduces defense latency by over 70% compared to existing methods.

PaperID: 7107, https://arxiv.org/pdf/2511.22884.pdf

Authors: Zhenghao Zhu, Yuanfeng Song, Xin Chen, Chengzhong Liu, Yakun Cui, Caleb Chen Cao, Sirui Han, Yike Guo

Title: InsightEval: An Expert-Curated Benchmark for Assessing Insight Discovery in LLM-Driven Data Agents

Abstract:
Data analysis has become an indispensable part of scientific research. To discover the latent knowledge and insights hidden within massive datasets, we need to perform deep exploratory analysis to realize their full value. With the advent of large language models (LLMs) and multi‑agent systems, more and more researchers are making use of these technologies for insight discovery. However, there are few benchmarks for evaluating insight discovery capabilities. As one of the most comprehensive existing frameworks, InsightBench also suffers from many critical flaws: format inconsistencies, poorly conceived objectives, and redundant insights. These issues may significantly affect the quality of data and the evaluation of agents. To address these issues, we thoroughly investigate shortcomings in InsightBench and propose essential criteria for a high‑quality insight benchmark. Regarding this, we develop a data‑curation pipeline to construct a new dataset named InsightEval. We further introduce a novel metric to measure the exploratory performance of agents. Through extensive experiments on InsightEval, we highlight prevailing challenges in automated insight discovery and raise some key findings to guide future research in this promising direction.

PaperID: 7108, https://arxiv.org/pdf/2511.22850.pdf

Authors: Keliang Liu, Zizhi Chen, Mingcheng Li, Jingqun Tang, Dingkang Yang, Lihua Zhang

Title: Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding

Abstract:
Document understanding is a long standing practical task. Vision Language Models (VLMs) have gradually become a primary approach in this domain, demonstrating effective performance on single page tasks. However, their effectiveness diminishes when handling long documents. In such scenarios, clues are often scattered across multiple pages and modalities, and redundancy from lengthy inputs can impair the models judgment. While retrieval augmented generation mitigates this issue by filtering for question relevant content, the retrieved results still contain substantial redundancy. To address these limitations, we propose SLEUTH, a multi agent framework. Concretely, SLEUTH orchestrates a retriever and four collaborative agents in a coarse to fine process. The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy. It ultimately synthesizes a distilled, evidence dense multimodal context to generate the final prediction. SLEUTH is model agnostic and scalable. When paired with advanced VLM backbones, it consistently improves performance on multiple long document benchmarks, achieving state of the art results. Ablation studies verify each modules effectiveness and confirm the benefits of our hierarchical refinement paradigm.

PaperID: 7109, https://arxiv.org/pdf/2511.22832.pdf

Authors: Rohan Bopardikar, Jin Wang, Jia Zou

Title: Structured Multi-Step Reasoning for Entity Matching Using Large Language Model

Abstract:
Entity matching is a fundamental task in data cleaning and data integration. With the rapid adoption of large language models (LLMs), recent studies have explored zero‑shot and few‑shot prompting to improve entity matching accuracy. However, most existing approaches rely on single‑step prompting and offer limited investigation into structured reasoning strategies. In this work, we investigate how to enhance LLM‑based entity matching by decomposing the matching process into multiple explicit reasoning stages. We propose a three‑step framework that first identifies matched and unmatched tokens between two records, then determines the attributes most influential to the matching decision, and finally predicts whether the records refer to the same real‑world entity. In addition, we explore a debate‑based strategy that contrasts supporting and opposing arguments to improve decision robustness. We evaluate our approaches against multiple existing baselines on several real‑world entity matching benchmark datasets. Experimental results demonstrate that structured multi‑step reasoning can improve matching performance in several cases, while also highlighting remaining challenges and opportunities for further refinement of reasoning‑guided LLM approaches.

PaperID: 7110, https://arxiv.org/pdf/2511.22767.pdf

Authors: Toqeer Ali Syed, Sohail Khan, Salman Jan, Gohar Ali, Muhammad Nauman, Ali Akarma, Ahmad Ali

Title: Agentic AI Framework for Cloudburst Prediction and Coordinated Response

Abstract:
The challenge is growing towards extreme and short‑duration rainfall events like a cloudburst that are peculiar to the traditional forecasting systems, in which the predictions and the response are taken as two distinct processes. The paper outlines an agentic artificial intelligence system to study atmospheric water‑cycle intelligence, which combines sensing, forecasting, downscaling, hydrological modeling and coordinated response into a single, interconnected, priceless, closed‑loop system. The framework uses autonomous but cooperative agents that reason, sense, and act throughout the entire event lifecycle, and use the intelligence of weather prediction to become real‑time decision intelligence. Comparison of multi‑year radar, satellite, and ground‑based evaluation of the northern part of Pakistan demonstrates that the multi‑agent configuration enhances forecast reliability, critical success index and warning lead time compared to the baseline models. Population reach was maximised, and errors during evacuation were minimised through communication and routing agents, and adaptive recalibration and transparent auditability were provided by the embedded layer of learning. Collectively, this leads to the conclusion that collaborative AI agents are capable of transforming atmospheric data streams into practicable foresight and provide a platform of scalable adaptive and learning‑based climate resilience.

PaperID: 7111, https://arxiv.org/pdf/2511.22737.pdf

Authors: Salman Jan, Toqeer Ali Syed, Gohar Ali, Ali Akarma, Mohammad Riyaz Belgaum, Ahmad Ali

Title: Agentic AI Framework for Individuals with Disabilities and Neurodivergence: A Multi-Agent System for Healthy Eating, Daily Routines, and Inclusive Well-Being

Abstract:
The paper presents a detailed Agentic Artificial Intelligence (AI) model that would enable people with disabilities and neurodivergence to lead healthier lives and have more regular days. The system will use a multi‑layer structure; it will include an Application and Interface Layer, an Agents Layer, and a Data Source Layer to provide adaptive, transparent, and inclusive support. Fundamentally, a hybrid reasoning engine will synchronize four special‑purpose agents, which include: a personalized‑nutrition‑based, called a Meal Planner Agent; an adaptive‑scheduling‑based, called a Reminder Agent; interactive assistance during grocery shopping and cooking, called a Food Guidance Agent; and a continuous‑intake‑and‑physiological‑tracking, called a Monitoring Agent. All the agents interact through a central communicative system called the Blackboard/Event Bus, which allows autonomous interaction and real‑time feedback loops with multimedia user interfaces. Privacy‑sensitive data sources, including electronic health records (EHRs), nutritional databases, wearable sensors, and smart kitchen Internet of Things, are also included in the framework and placed into a policy‑controlled layer, which ensures data safety and compliance with consent. Collaborative care and clinician dashboards allow common supervision, and discussable artificial intelligence (XAI) modules give brief explanations of why a decision was made, making users responsible and reliant. The proposed agentic AI framework is an extension beyond traditional assistive systems since it incorporates inclusiveness, personalization, and accessibility at all levels. It displays the intersection of multi‑agent reasoning, multi‑modal interfaces, and human‑centered design that will enable the development of autonomy, health, and digital equity among people with disabilities and neurodivergence.

PaperID: 7112, https://arxiv.org/pdf/2511.22708.pdf

Authors: Mikhail Sergeev, Georgii Paradezhenko, Daniil Rabinovich, Vladimir V. Palyulin

Title: Distributed quantum architecture search using multi-agent reinforcement learning

Abstract:
Quantum architecture search (QAS) automates the design of parameterized quantum circuits for variational quantum algorithms. The framework finds a well‑suited problem‑specific structure of a variational ansatz. Among possible implementations of QAS the reinforcement learning (RL) stands out as one of the most promising. Current RL approaches are single‑agent‑based and show poor scalability with a number of qubits due to the increase of the action space dimension and the computational cost. We propose a novel multi‑agent RL algorithm for QAS with each agent acting separately on its own block of a quantum circuit. This procedure allows to significantly accelerate the convergence of the RL‑based QAS and reduce its computational cost. We benchmark the proposed algorithm on MaxCut problem on 3‑regular graphs and on ground energy estimation for the Schwinger Hamiltonian. In addition, the proposed multi‑agent approach naturally fits into the set‑up of distributed quantum computing, favoring its implementation on modern intermediate scale quantum devices.

PaperID: 7113, https://arxiv.org/pdf/2511.22651.pdf

Authors: Anthony Carreon, Vansh Sharma, Venkat Raman

Title: Automated Design Optimization via Strategic Search with Large Language Models

Abstract:
Traditional optimization methods excel in well‑defined search spaces but struggle with design problems where transformations and design parameters are difficult to define. Large language models (LLMs) offer a promising alternative by dynamically interpreting design spaces and leveraging encoded domain knowledge. To this end, we introduce AUTO, an LLM agent framework that treats design optimization as a gradient‑free search problem guided by strategic LLM reasoning. The framework employs two collaborative agents: a Strategist that selects between exploration and exploitation strategies, and an Implementor that executes detailed designs. Applied to GPU code optimization ‑‑ a domain critical to fields from machine learning to scientific computing ‑‑ AUTO generates solutions competitive with expert implementations for chemical kinetics integration and dense matrix multiplication. The framework achieves 50‑70% search efficiency relative to Bayesian optimization methodologies. It completes optimizations in approximately 8 hours at an estimated cost of up to \159 per run, compared to an estimated cost of up to \480 with median‑wage software developers. These findings open the door to automating design optimization in ill‑defined search spaces with limited prior information.

PaperID: 7114, https://arxiv.org/pdf/2511.22581.pdf

Authors: Johannes Forkel, Constantin Ruhdorfer, Michael Beukman, Andreas Bulling, Jakob Foerster

Title: High entropy leads to symmetry equivariant policies in Dec-POMDPs

Abstract:
We prove that in any Dec‑POMDP, sufficiently high entropy regularization ensures that the policy gradient flow with tabular softmax parametrization always converges, for any initialization, to the same joint policy, and that this joint policy is equivariant w.r.t. all symmetries of the Dec‑POMDP. In particular, policies coming from different initializations will be fully compatible, in that their cross‑play returns are equal to their self‑play returns. Through extensive evaluation of independent PPO, arguably the standard baseline deep multi‑agent policy gradient algorithm, in the Hanabi, Overcooked and Yokai environments, we find that the entropy coefficient has a massive influence on the cross‑play returns between independently trained policies, and that the decrease in self‑play returns coming from increased entropy regularization can often be counteracted by greedifying the learned policies after training. In Hanabi in particular we achieve a new SOTA in inter‑seed cross‑play this way. While we give examples of Dec‑POMDPs in which one cannot learn the optimal symmetry equivariant policy this way, both our theoretical and empirical results suggest that one should consider far higher entropy coefficients during hyperparameter sweeps in Dec‑POMDPs than is typically done.

PaperID: 7115, https://arxiv.org/pdf/2511.22572.pdf

Authors: Damian Kurpiewski, Jędrzej Michalczyk, Wojciech Jamroga, Jerzy Julian Michalski, Teofil Sidoruk

Title: Formal Verification of Probabilistic Multi-Agent Systems for Ballistic Rocket Flight Using Probabilistic Alternating-Time Temporal Logic

Abstract:
This technical report presents a comprehensive formal verification approach for probabilistic agent systems modeling ballistic rocket flight trajectories using Probabilistic Alternating‑Time Temporal Logic (PATL). We describe an innovative verification framework specifically designed for analyzing critical safety properties of ballistic rockets engineered to achieve microgravity conditions for scientific experimentation. Our model integrates authentic flight telemetry data encompassing velocity vectors, pitch angles, attitude parameters, and GPS coordinates to construct probabilistic state transition systems that rigorously account for environmental stochasticity, particularly meteorological variability. We formalize mission‑critical safety properties through PATL specifications to systematically identify trajectory deviation states where the rocket risks landing in prohibited or hazardous zones. The verification framework facilitates real‑time safety monitoring and enables automated intervention mechanisms, including emergency engine disengagement protocols, when predefined safety thresholds are exceeded. Experimental validation demonstrates the practical effectiveness and reliability of our approach in ensuring mission safety while maintaining scientific mission objectives.

PaperID: 7116, https://arxiv.org/pdf/2511.22536.pdf

Authors: Fengming Zhu, Yuxin Pan, Xiaomeng Zhu, Fangzhen Lin

Title: A Computable Game-Theoretic Framework for Multi-Agent Theory of Mind

Abstract:
Originating in psychology, Theory of Mind (ToM) has attracted significant attention across multiple research communities, especially logic, economics, and robotics. Most psychological work does not aim at formalizing those central concepts, namely goals, intentions, and beliefs, to automate a ToM‑based computational process, which, by contrast, has been extensively studied by logicians. In this paper, we offer a different perspective by proposing a computational framework viewed through the lens of game theory. On the one hand, the framework prescribes how to make boudedly rational decisions while maintaining a theory of mind about others (and recursively, each of the others holding a theory of mind about the rest); on the other hand, it employs statistical techniques and approximate solutions to retain computability of the inherent computational problem.

PaperID: 7117, https://arxiv.org/pdf/2511.22441.pdf

Authors: Xinyu Zhang, Yixin Wu, Boyang Zhang, Chenhao Lin, Chao Shen, Michael Backes, Yang Zhang

Title: GEO-Detective: Unveiling Location Privacy Risks in Images with LLM Agents

Abstract:
Images shared on social media often expose geographic cues. While early geolocation methods required expert effort and lacked generalization, the rise of Large Vision Language Models (LVLMs) now enables accurate geolocation even for ordinary users. However, existing approaches are not optimized for this task. To explore the full potential and associated privacy risks, we present Geo‑Detective, an agent that mimics human reasoning and tool use for image geolocation inference. It follows a procedure with four steps that adaptively selects strategies based on image difficulty and is equipped with specialized tools such as visual reverse search, which emulates how humans gather external geographic clues. Experimental results show that GEO‑Detective outperforms baseline large vision language models (LVLMs) overall, particularly on images lacking visible geographic features. In country level geolocation tasks, it achieves an improvement of over 11.1% compared to baseline LLMs, and even at finer grained levels, it still provides around a 5.2% performance gain. Meanwhile, when equipped with external clues, GEO‑Detective becomes more likely to produce accurate predictions, reducing the "unknown" prediction rate by more than 50.6%. We further explore multiple defense strategies and find that Geo‑Detective exhibits stronger robustness, highlighting the need for more effective privacy safeguards.

PaperID: 7118, https://arxiv.org/pdf/2511.22420.pdf

Authors: Sebe Vanbrabant, Gustavo Rovelo Ruiz, Davy Vanacken

Title: MATCH: Engineering Transparent and Controllable Conversational XAI Systems through Composable Building Blocks

Abstract:
While the increased integration of AI technologies into interactive systems enables them to solve an increasing number of tasks, the black‑box problem of AI models continues to spread throughout the interactive system as a whole. Explainable AI (XAI) techniques can make AI models more accessible by employing post‑hoc methods or transitioning to inherently interpretable models. While this makes individual AI models clearer, the overarching system architecture remains opaque. This challenge not only pertains to standard XAI techniques but also to human examination and conversational XAI approaches that need access to model internals to interpret them correctly and completely. To this end, we propose conceptually representing such interactive systems as sequences of structural building blocks. These include the AI models themselves, as well as control mechanisms grounded in literature. The structural building blocks can then be explained through complementary explanatory building blocks, such as established XAI techniques like LIME and SHAP. The flow and APIs of the structural building blocks form an unambiguous overview of the underlying system, serving as a communication basis for both human and automated agents, thus aligning human and machine interpretability of the embedded AI models. In this paper, we present our flow‑based approach and a selection of building blocks as MATCH: a framework for engineering Multi‑Agent Transparent and Controllable Human‑centered systems. This research contributes to the field of (conversational) XAI by facilitating the integration of interpretability into existing interactive systems.

PaperID: 7119, https://arxiv.org/pdf/2511.22409.pdf

Authors: Polydoros Giannouris, Sophia Ananiadou

Title: NOMAD: A Multi-Agent LLM System for UML Class Diagram Generation from Natural Language Requirements

Abstract:
Large Language Models (LLMs) are increasingly utilised in software engineering, yet their ability to generate structured artefacts such as UML diagrams remains underexplored. In this work we present NOMAD, a cognitively inspired, modular multi‑agent framework that decomposes UML generation into a series of role‑specialised subtasks. Each agent handles a distinct modelling activity, such as entity extraction, relationship classification, and diagram synthesis, mirroring the goal‑directed reasoning processes of an engineer. This decomposition improves interpretability and allows for targeted verification strategies. We evaluate NOMAD through a mixed design: a large case study (Northwind) for in‑depth probing and error analysis, and human‑authored UML exercises for breadth and realism. NOMAD outperforms all selected baselines, while revealing persistent challenges in fine‑grained attribute extraction. Building on these observations, we introduce the first systematic taxonomy of errors in LLM‑generated UML diagrams, categorising structural, relationship, and semantic/logical. Finally, we examine verification as a design probe, showing its mixed effects and outlining adaptive strategies as promising directions. Together, these contributions position NOMAD as both an effective framework for UML class diagram generation and a lens onto the broader research challenges of reliable language‑to‑model workflows.

PaperID: 7120, https://arxiv.org/pdf/2511.22391.pdf

Authors: Yuanzhe Yang

Title: Impure Simplicial Complex and Term-Modal Logic with Assignment Operators

Abstract:
Impure simplicial complexes are a powerful tool to model multi‑agent epistemic situations where agents may die, but it is difficult to define a satisfactory semantics for the ordinary propositional modal language on such models, since many conceptually dubious expressions involving dead agents can be expressed in this language. In this paper, we introduce a term‑modal language with assignment operators, in which such conceptually dubious expressions are syntactically excluded. We define both simplicial semantics and first‑order Kripke semantics for this language, characterize their respective expressivity through notions of bisimulation, and show that the two semantics are equivalent when we consider a special class of first order Kripke models called local epistemic models. We also offer a complete axiomatization for the epistemic logic based on this language, and show that our language has a notion of assignment normal form. Finally, we discuss the behavior of a kind of intensional distributed knowledge that can be naturally expressed in our language.

PaperID: 7121, https://arxiv.org/pdf/2511.22390.pdf

Authors: Hans van Ditmarsch, Tim French, Rustam Galimullin, Louwe B. Kuijer

Title: Modal Logic for Simulation, Refinement, and Mutual Ignorance

Abstract:
Simulation and refinement are variations of the bisimulation relation, where in the former we keep only atoms and forth, and in the latter only atoms and back. Quantifying over simulations and refinements captures the effects of information change in a multi‑agent system. In the case of quantification over refinements, we are looking at all the ways the agents in a system can become more informed. Similarly, in the case of quantification over simulations, we are dealing with all the ways the agents can become less informed, or in other words, could have been less informed, as we are at liberty how to interpret time in dynamic epistemic logic. While quantification over refinements has been well explored in the literature, quantification over simulations has received considerably less attention. In this paper, we explore the relationship between refinements and simulations. To this end, we also employ the notion of mutual factual ignorance that allows us to capture the state of a model before agents have learnt any factual information. In particular, we consider the extensions of multi‑modal logic with the simulation and refinement modalities, as well as modalities for mutual factual ignorance. We provide reduction‑based axiomatizations for several of the resulting logics that are built extending one another in a modular fashion.

PaperID: 7122, https://arxiv.org/pdf/2511.22311.pdf

Authors: Fiona Y. Wang, Di Sheng Lee, David L. Kaplan, Markus J. Buehler

Title: Swarms of Large Language Model Agents for Protein Sequence Design with Experimental Validation

Abstract:
Designing proteins de novo with tailored structural, physicochemical, and functional properties remains a grand challenge in biotechnology, medicine, and materials science, due to the vastness of sequence space and the complex coupling between sequence, structure, and function. Current state‑of‑the‑art generative methods, such as protein language models (PLMs) and diffusion‑based architectures, often require extensive fine‑tuning, task‑specific data, or model reconfiguration to support objective‑directed design, thereby limiting their flexibility and scalability. To overcome these limitations, we present a decentralized, agent‑based framework inspired by swarm intelligence for de novo protein design. In this approach, multiple large language model (LLM) agents operate in parallel, each assigned to a specific residue position. These agents iteratively propose context‑aware mutations by integrating design objectives, local neighborhood interactions, and memory and feedback from previous iterations. This position‑wise, decentralized coordination enables emergent design of diverse, well‑defined sequences without reliance on motif scaffolds or multiple sequence alignments, validated with experiments on proteins with alpha helix and coil structures. Through analyses of residue conservation, structure‑based metrics, and sequence convergence and embeddings, we demonstrate that the framework exhibits emergent behaviors and effective navigation of the protein fitness landscape. Our method achieves efficient, objective‑directed designs within a few GPU‑hours and operates entirely without fine‑tuning or specialized training, offering a generalizable and adaptable solution for protein design. Beyond proteins, the approach lays the groundwork for collective LLM‑driven design across biomolecular systems and other scientific discovery tasks.

PaperID: 7123, https://arxiv.org/pdf/2511.22307.pdf

Authors: Inhyo Lee, Junhyeong Lee, Jongwon Park, KyungTae Lim, Seunghwa Ryu

Title: Enhanced Conditional Generation of Double Perovskite by Knowledge-Guided Language Model Feedback

Abstract:
Double perovskites (DPs) are promising candidates for sustainable energy technologies due to their compositional tunability and compatibility with low‑energy fabrication, yet their vast design space poses a major challenge for conditional materials discovery. This work introduces a multi‑agent, text gradient‑driven framework that performs DP composition generation under natural‑language conditions by integrating three complementary feedback sources: LLM‑based self‑evaluation, DP‑specific domain knowledge‑informed feedback, and ML surrogate‑based feedback. Analogous to how knowledge‑informed machine learning improves the reliability of conventional data‑driven models, our framework incorporates domain‑informed text gradients to guide the generative process toward physically meaningful regions of the DP composition space. Systematic comparison of three incremental configurations, (i) pure LLM generation, (ii) LLM generation with LLM reasoning‑based feedback, and (iii) LLM generation with domain knowledge‑guided feedback, shows that iterative guidance from knowledge‑informed gradients improves stability‑condition satisfaction without additional training data, achieving over 98% compositional validity and up to 54% stable or metastable candidates, surpassing both the LLM‑only baseline (43%) and prior GAN‑based results (27%). Analyses of ML‑based gradients further reveal that they enhance performance in in‑distribution (ID) regions but become unreliable in out‑of‑distribution (OOD) regimes. Overall, this work provides the first systematic analysis of multi‑agent, knowledge‑guided text gradients for DP discovery and establishes a generalizable blueprint for MAS‑driven generative materials design aimed at advancing sustainable technologies.

PaperID: 7124, https://arxiv.org/pdf/2511.22226.pdf

Authors: Alexander Meulemans, Rajai Nasser, Maciej Wołczyk, Marissa A. Weis, Seijin Kobayashi, Blake Richards, Guillaume Lajoie, Angelika Steger, Marcus Hutter, James Manyika, Rif A. Saurous, João Sacramento, Blaise Agüera y Arcas

Title: Embedded Universal Predictive Intelligence: a coherent framework for multi-agent learning

Abstract:
The standard theory of model‑free reinforcement learning assumes that the environment dynamics are stationary and that agents are decoupled from their environment, such that policies are treated as being separate from the world they inhabit. This leads to theoretical challenges in the multi‑agent setting where the non‑stationarity induced by the learning of other agents demands prospective learning based on prediction models. To accurately model other agents, an agent must account for the fact that those other agents are, in turn, forming beliefs about it to predict its future behavior, motivating agents to model themselves as part of the environment. Here, building upon foundational work on universal artificial intelligence (AIXI), we introduce a mathematical framework for prospective learning and embedded agency centered on self‑prediction, where Bayesian RL agents predict both future perceptual inputs and their own actions, and must therefore resolve epistemic uncertainty about themselves as part of the universe they inhabit. We show that in multi‑agent settings, self‑prediction enables agents to reason about others running similar algorithms, leading to new game‑theoretic solution concepts and novel forms of cooperation unattainable by classical decoupled agents. Moreover, we extend the theory of AIXI, and study universally intelligent embedded agents which start from a Solomonoff prior. We show that these idealized agents can form consistent mutual predictions and achieve infinite‑order theory of mind, potentially setting a gold standard for embedded multi‑agent learning.

PaperID: 7125, https://arxiv.org/pdf/2511.22217.pdf

Authors: Yuxuan Chen, Rongpeng Li, Xianfu Chen, Celimuge Wu, Chenghui Peng, Zhifeng Zhao, Honggang Zhang

Title: Optimizing NetGPT via Routing-Based Synergy and Reinforcement Learning

Abstract:
Large language model (LLM) agents at the network edge offer low‑latency execution for routine queries. In contrast, complex requests often require the superior capability of cloud models, incurring higher latency and cost. To navigate this quality‑cost trade‑off under dynamic network conditions, we propose a cloud‑edge synergy for NetGPT that integrates network‑aware routing with on‑edge self‑improvement. Specifically, our framework routes structured tool‑calling requests to cloud or edge agents via a novel scoring policy. We prove that, under mild regularity assumptions, the optimal routing rule admits a unique fallback threshold with monotone dependence on bandwidth and round‑trip time (RTT). Concurrently, based on the dataset collected from requests routed to the cloud and corresponding responses, we instantiate a schema‑preserving reinforcement learning (RL) to improve the capability of the edge agent. We analyze a supervised finetuning (SFT)‑anchored composite objective that combines a reverse‑KL trust‑region step with a forward‑KL realignment toward the SFT prior, explaining stability and constraining policy drift. Both the network‑aware routing policy and the edge agent are updated coherently. Experiments across controlled network states and pricing schedules demonstrate smooth quality‑cost frontiers, consistent gains of dynamic fallback thresholds over fixed policies, and sustained reductions in offloading while maintaining task success and schema‑correct outputs.

PaperID: 7126, https://arxiv.org/pdf/2511.22105.pdf

Authors: Saad Masrur, Ismail Guvenc, David Lopez Perez

Title: Energy Efficient Sleep Mode Optimization in 5G mmWave Networks via Multi Agent Deep Reinforcement Learning

Abstract:
Dynamic sleep mode optimization (SMO) in millimeter‑wave (mmWave) networks is essential for maximizing energy efficiency (EE) under stringent quality‑of‑service (QoS) constraints. However, existing optimization and reinforcement learning (RL) approaches rely on aggregated, static base station (BS) traffic models that fail to capture non‑stationary traffic dynamics and suffer from large state‑action spaces, limiting real‑world deployment. To address these challenges, this paper proposes a multi‑agent deep reinforcement learning (MARL) framework using a Double Deep Q‑Network (DDQN), referred to as MARL‑DDQN, for adaptive SMO in a 3D urban environment with a time‑varying and community‑based user equipment (UE) mobility model. Unlike conventional single‑agent RL, MARL‑DDQN enables scalable, distributed decision‑making with minimal signaling overhead. A realistic BS power consumption model and beamforming are integrated to accurately quantify EE, while QoS is defined in terms of throughput. The method adapts SMO policies to maximize EE while mitigating inter‑cell interference and ensuring throughput fairness. Simulations show that MARL‑DDQN outperforms state‑of‑the‑art strategies, including All On, iterative QoS‑aware load‑based (IT‑QoS‑LB), MARL‑DDPG, and MARL‑PPO, achieving up to 0.60 Mbit/Joule EE, 8.5 Mbps 10th‑percentile throughput, and meeting QoS constraints 95% of the time under dynamic scenarios.

PaperID: 7127, https://arxiv.org/pdf/2511.22037.pdf

Authors: Zhifeng Wu, Yuelin Han, Shaolei Ren

Title: What AI Speaks for Your Community: Polling AI Agents for Public Opinion on Data Center Projects

Abstract:
The intense computational demands of AI, especially large foundation models, are driving a global boom in data centers. These facilities bring both tangible benefits and potential environmental burdens to local communities. However, the planning processes for data centers often fail to proactively integrate local public opinion in advance, largely because traditional polling is expensive or is conducted too late to influence decisions. To address this gap, we introduce an AI agent polling framework, leveraging large language models to assess community opinion on data centers and guide responsible development of AI. Our experiments reveal water consumption and utility costs as primary concerns, while tax revenue is a key perceived benefit. Furthermore, our cross‑model and cross‑regional analyses show opinions vary significantly by LLM and regional context. Finally, agent responses show strong topical alignment with real‑world survey data. Our framework can serve as a scalable screening tool, enabling developers to integrate community sentiment into early‑stage planning for a more informed and socially responsible AI infrastructure deployment.

PaperID: 7128, https://arxiv.org/pdf/2511.21935.pdf

Authors: Natalie Collina, Eshwar Ram Arunachaleswaran, Meena Jagadeesan

Title: Breaking Algorithmic Collusion in Human-AI Ecosystems

Abstract:
AI agents are increasingly deployed in ecosystems where they repeatedly interact not only with each other but also with humans. In this work, we study these human‑AI ecosystems from a theoretical perspective, focusing on the classical framework of repeated pricing games. In our stylized model, the AI agents play equilibrium strategies, and one or more humans manually perform the pricing task instead of adopting an AI agent, thereby defecting to a no‑regret strategy. Motivated by how populations of AI agents can sustain supracompetitive prices, we investigate whether high prices persist under such defections. Our main finding is that even a single human defection can destabilize collusion and drive down prices, and multiple defections push prices even closer to competitive levels. We further show how the nature of collusion changes under defection‑aware AI agents. Taken together, our results characterize when algorithmic collusion is fragile‑‑and when it persists‑‑in mixed ecosystems of AI agents and humans.

PaperID: 7129, https://arxiv.org/pdf/2511.21934.pdf

Authors: Tao Zhe, Huazhen Fang, Kunpeng Liu, Qian Lou, Tamzidul Hoque, Dongjie Wang

Title: Heterogeneous Multi-Agent Reinforcement Learning with Attention for Cooperative and Scalable Feature Transformation

Abstract:
Feature transformation enhances downstream task performance by generating informative features through mathematical feature crossing. Despite the advancements in deep learning, feature transformation remains essential for structured data, where deep models often struggle to capture complex feature interactions. Prior literature on automated feature transformation has achieved success but often relies on heuristics or exhaustive searches, leading to inefficient and time‑consuming processes. Recent works employ reinforcement learning (RL) to enhance traditional approaches through a more effective trial‑and‑error way. However, two limitations remain: 1) Dynamic feature expansion during the transformation process, which causes instability and increases the learning complexity for RL agents; 2) Insufficient cooperation and communication between agents, which results in suboptimal feature crossing operations and degraded model performance. To address them, we propose a novel heterogeneous multi‑agent RL framework to enable cooperative and scalable feature transformation. The framework comprises three heterogeneous agents, grouped into two types, each designed to select essential features and operations for feature crossing. To enhance communication among these agents, we implement a shared critic mechanism that facilitates information exchange during feature transformation. To handle the dynamically expanding feature space, we tailor multi‑head attention‑based feature agents to select suitable features for feature crossing. Additionally, we introduce a state encoding technique during the optimization process to stabilize and enhance the learning dynamics of the RL agents, resulting in more robust and reliable transformation policies. Finally, we conduct extensive experiments to validate the effectiveness, efficiency, robustness, and interpretability of our model.

PaperID: 7130, https://arxiv.org/pdf/2511.21886.pdf

Authors: Jingtian Yan, Shuai Zhou, Stephen F. Smith, Jiaoyang Li

Title: Bridging Planning and Execution: Multi-Agent Path Finding Under Real-World Deadlines

Abstract:
The Multi‑Agent Path Finding (MAPF) problem aims to find collision‑free paths for multiple agents while optimizing objectives such as the sum of costs or makespan. MAPF has wide applications in domains like automated warehouses, manufacturing systems, and airport logistics. However, most MAPF formulations assume a simplified robot model for planning, which overlooks execution‑time factors such as kinodynamic constraints, communication latency, and controller variability. This gap between planning and execution is problematic for time‑sensitive applications. To bridge this gap, we propose REMAP, an execution‑informed MAPF planning framework that can be combined with leading search‑based MAPF planners with minor changes. Our framework integrates the proposed ExecTimeNet to accurately estimate execution time based on planned paths. We demonstrate our method for solving MAPF with Real‑world Deadlines (MAPF‑RD) problem, where agents must reach their goals before a predefined wall‑clock time. We integrate our framework with two popular MAPF methods, MAPF‑LNS and CBS. Experiments show that REMAP achieves up to 20% improvement in solution quality over baseline methods (e.g., constant execution speed estimators) on benchmark maps with up to 300 agents.

PaperID: 7131, https://arxiv.org/pdf/2511.21783.pdf

Authors: Xuan Qiu

Title: NetworkGames: Simulating Cooperation in Network Games with Personality-driven LLM Agents

Abstract:
While Large Language Models (LLMs) have been extensively tested in dyadic game‑theoretic scenarios, their collective behavior within complex network games remains surprisingly unexplored. To bridge this gap, we present NetworkGames, a framework connecting Generative Agents and Geometric Deep Learning. By formalizing social simulation as a message‑passing process governed by LLM policies, we investigate how node heterogeneity (MBTI personalities) and network topology co‑determine collective welfare. We instantiate a population of LLM agents, each endowed with a distinct personality from the MBTI taxonomy, and situate them in various network structures (e.g., small‑world and scale‑free). Through extensive simulations of the Iterated Prisoner's Dilemma, we first establish a baseline dyadic interaction matrix, revealing nuanced cooperative preferences between all 16 personality pairs. We then demonstrate that macro‑level cooperative outcomes are not predictable from dyadic interactions alone; they are co‑determined by the network's connectivity and the spatial distribution of personalities. For instance, we find that small‑world networks are detrimental to cooperation, while strategically placing pro‑social personalities in hub positions within scale‑free networks can significantly promote cooperative behavior. We validate the robustness of these findings through extensive stress tests across multiple LLM architectures, scaled network sizes, varying random seeds, and comprehensive ablation studies. Our findings offer significant implications for designing healthier online social environments and forecasting collective behavior. We open‑source our framework to facilitate research into the social physics of AI societies.

PaperID: 7132, https://arxiv.org/pdf/2511.21731.pdf

Authors: Diederik Aerts, Jonito Aerts Arguëlles, Lester Beltran, Suzette Geriente, Roberto Leporini, Massimiliano Sassoli de Bianchi, Sandro Sozzo

Title: Identifying Quantum Structure in AI Language: Evidence for Evolutionary Convergence of Human and Artificial Cognition

Abstract:
We present the results of cognitive tests on conceptual combinations, performed using specific Large Language Models (LLMs) as test subjects. In the first test, performed with ChatGPT and Gemini, we show that Bell's inequalities are significantly violated, which indicates the presence of 'quantum entanglement' in the tested concepts. In the second test, also performed using ChatGPT and Gemini, we instead identify the presence of 'Bose‑Einstein statistics', rather than the intuitively expected 'Maxwell‑Boltzmann statistics', in the distribution of the words contained in large‑size texts. Interestingly, these findings mirror the results previously obtained in both cognitive tests with human participants and information retrieval tests on large corpora. Taken together, they point to the 'systematic emergence of quantum structures in conceptual‑linguistic domains', regardless of whether the cognitive agent is human or artificial. Although LLMs are classified as neural networks for historical reasons, we believe that a more essential form of knowledge organization takes place in the distributive semantic structure of vector spaces built on top of the neural network. It is this meaning‑bearing structure that lends itself to a phenomenon of evolutionary convergence between human cognition and language, slowly established through biological evolution, and LLM cognition and language, emerging much more rapidly as a result of self‑learning and training. We analyze various aspects and examples that contain evidence supporting the above hypothesis. We also advance a unifying framework that explains the pervasive quantum organization of meaning that we identify.

PaperID: 7133, https://arxiv.org/pdf/2511.21729.pdf

Authors: Jithin Krishnan

Title: Beyond Component Strength: Synergistic Integration and Adaptive Calibration in Multi-Agent RAG Systems

Abstract:
Building reliable retrieval‑augmented generation (RAG) systems requires more than adding powerful components; it requires understanding how they interact. Using ablation studies on 50 queries (15 answerable, 10 edge cases, and 25 adversarial), we show that enhancements such as hybrid retrieval, ensemble verification, and adaptive thresholding provide almost no benefit when used in isolation, yet together achieve a 95% reduction in abstention (from 40% to 2%) without increasing hallucinations. We also identify a measurement challenge: different verification strategies can behave safely but assign inconsistent labels (for example, "abstained" versus "unsupported"), creating apparent hallucination rates that are actually artifacts of labeling. Our results show that synergistic integration matters more than the strength of any single component, that standardized metrics and labels are essential for correctly interpreting performance, and that adaptive calibration is needed to prevent overconfident over‑answering even when retrieval quality is high.

PaperID: 7134, https://arxiv.org/pdf/2511.21728.pdf

Authors: Lin Yu, Xiaofei Han, Yifei Kang, Chiung-Yi Tseng, Danyang Zhang, Ziqian Bi, Zhimo Han

Title: Affective Multimodal Agents with Proactive Knowledge Grounding for Emotionally Aligned Marketing Dialogue

Abstract:
Recent advances in large language models (LLMs) have enabled fluent dialogue systems, but most remain reactive and struggle in emotionally rich, goal‑oriented settings such as marketing conversations. To address this limitation, we propose AffectMind, a multimodal affective dialogue agent that performs proactive reasoning and dynamic knowledge grounding to sustain emotionally aligned and persuasive interactions. AffectMind combines three components: a Proactive Knowledge Grounding Network (PKGN) that continuously updates factual and affective context from text, vision, and prosody; an Emotion‑‑Intent Alignment Model (EIAM) that jointly models user emotion and purchase intent to adapt persuasion strategies; and a Reinforced Discourse Loop (RDL) that optimizes emotional coherence and engagement via reinforcement signals from user responses. Experiments on two newly curated marketing dialogue datasets, MM‑ConvMarket and AffectPromo, show that AffectMind outperforms strong LLM‑based baselines in emotional consistency (+26%), persuasive success rate (+19%), and long‑term user engagement (+23%), highlighting emotion‑grounded proactivity as a key capability for commercial multimodal agents.

PaperID: 7135, https://arxiv.org/pdf/2511.21686.pdf

Authors: Dong Wang, Yang Li, Ansong Ni, Ching-Feng Yeh, Youssef Emad, Xinjie Lei, Liam Robbins, Karthik Padthe, Hu Xu, Xian Li, Asli Celikyilmaz, Ramya Raghavendra, Lifei Huang, Carole-Jean Wu, Shang-Wen Li

Title: Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

Abstract:
Synthetic data has become increasingly important for training large language models, especially when real data is scarce, expensive, or privacy‑sensitive. Many such generation tasks require coordinated multi‑agent workflows, where specialized agents collaborate to produce data that is higher quality, more diverse, and structurally richer. However, existing frameworks for multi‑agent synthesis often depend on a centralized orchestrator, creating scalability bottlenecks, or are hardcoded for specific domains, limiting flexibility. We present Matrix, a decentralized framework that represents both control and data flow as serialized messages passed through distributed queues. This peer‑to‑peer design eliminates the central orchestrator. Each task progresses independently through lightweight agents, while compute‑intensive operations, such as LLM inference or containerized environments, are handled by distributed services. Built on Ray, Matrix scales to tens of thousands of concurrent agentic workflows and provides a modular, configurable design that enables easy adaptation to a wide range of data generation workflows. We evaluate Matrix across diverse synthesis scenarios, such as multi‑agent collaborative dialogue, web‑based reasoning data extraction, and tool‑use trajectory generation in customer service environments. In all cases, Matrix achieves 2‑‑15× higher data generation throughput under identical hardware resources, without compromising output quality.

PaperID: 7136, https://arxiv.org/pdf/2511.21572.pdf

Authors: Liming Yang, Junyu Luo, Xuanzhe Liu, Yiling Lou, Zhenpeng Chen

Title: BAMAS: Structuring Budget-Aware Multi-Agent Systems

Abstract:
Large language model (LLM)‑based multi‑agent systems have emerged as a powerful paradigm for enabling autonomous agents to solve complex tasks. As these systems scale in complexity, cost becomes an important consideration for practical deployment. However, existing work rarely addresses how to structure multi‑agent systems under explicit budget constraints. In this paper, we propose BAMAS, a novel approach for building multi‑agent systems with budget awareness. BAMAS first selects an optimal set of LLMs by formulating and solving an Integer Linear Programming problem that balances performance and cost. It then determines how these LLMs should collaborate by leveraging a reinforcement learning‑based method to select the interaction topology. Finally, the system is instantiated and executed based on the selected agents and their collaboration topology. We evaluate BAMAS on three representative tasks and compare it with state‑of‑the‑art agent construction methods. Results show that BAMAS achieves comparable performance while reducing cost by up to 86%.

PaperID: 7137, https://arxiv.org/pdf/2511.21460.pdf

Authors: Junjian Wang, Lidan Zhao, Xi Sheryl Zhang

Title: MADRA: Multi-Agent Debate for Risk-Aware Embodied Planning

Abstract:
Ensuring the safety of embodied AI agents during task planning is critical for real‑world deployment, especially in household environments where dangerous instructions pose significant risks. Existing methods often suffer from either high computational costs due to preference alignment training or over‑rejection when using single‑agent safety prompts. To address these limitations, we propose MADRA, a training‑free Multi‑Agent Debate Risk Assessment framework that leverages collective reasoning to enhance safety awareness without sacrificing task performance. MADRA employs multiple LLM‑based agents to debate the safety of a given instruction, guided by a critical evaluator that scores responses based on logical soundness, risk identification, evidence quality, and clarity. Through iterative deliberation and consensus voting, MADRA significantly reduces false rejections while maintaining high sensitivity to dangerous tasks. Additionally, we introduce a hierarchical cognitive collaborative planning framework that integrates safety, memory, planning, and self‑evolution mechanisms to improve task success rates through continuous learning. We also contribute SafeAware‑VH, a benchmark dataset for safety‑aware task planning in VirtualHome, containing 800 annotated instructions. Extensive experiments on AI2‑THOR and VirtualHome demonstrate that our approach achieves over 90% rejection of unsafe tasks while ensuring that safe‑task rejection is low, outperforming existing methods in both safety and execution efficiency. Our work provides a scalable, model‑agnostic solution for building trustworthy embodied agents.

PaperID: 7138, https://arxiv.org/pdf/2511.21444.pdf

Authors: Zhe Jiang, Jiong Wang, Xiaoyu Yue, Zijie Guo, Wenlong Zhang, Fenghua Ling, Wanli Ouyang, Lei Bai

Title: EWE: An Agentic Framework for Extreme Weather Analysis

Abstract:
Extreme weather events pose escalating risks to global society, underscoring the urgent need to unravel their underlying physical mechanisms. Yet the prevailing expert‑driven, labor‑intensive diagnostic paradigm has created a critical analytical bottleneck, stalling scientific progress. While AI for Earth Science has achieved notable advances in prediction, the equally essential challenge of automated diagnostic reasoning remains largely unexplored. We present the Extreme Weather Expert (EWE), the first intelligent agent framework dedicated to this task. EWE emulates expert workflows through knowledge‑guided planning, closed‑loop reasoning, and a domain‑tailored meteorological toolkit. It autonomously produces and interprets multimodal visualizations from raw meteorological data, enabling comprehensive diagnostic analyses. To catalyze progress, we introduce the first benchmark for this emerging field, comprising a curated dataset of 103 high‑impact events and a novel step‑wise evaluation metric. EWE marks a step toward automated scientific discovery and offers the potential to democratize expertise and intellectual resources, particularly for developing nations vulnerable to extreme weather.

PaperID: 7139, https://arxiv.org/pdf/2511.21438.pdf

Authors: Simon Süwer, Kester Bagemihl, Sylvie Baier, Lucia Dicunta, Markus List, Jan Baumbach, Andreas Maier, Fernando M. Delgado-Chaves

Title: Conversational No-code, Multi-agentic Disease Module Identification and Drug Repurposing Prediction with ChatDRex

Abstract:
Repurposing approved drugs offers a time‑efficient and cost‑effective alternative to traditional drug development. However, in silico prediction of repurposing candidates is challenging and requires the effective collaboration of specialists in various fields, including pharmacology, medicine, biology, and bioinformatics. Fragmented, specialized algorithms and tools often address only narrow aspects of the overall problem. Heterogeneous, unstructured data landscapes require the expertise of specialized users. Hence, these data services do not integrate smoothly across workflows. With ChatDRex, we present a conversation‑based, multi‑agent system that facilitates the execution of complex bioinformatic analyses aiming for network‑based drug repurposing prediction. It builds on the integrated systems medicine knowledge graph (NeDRex KG). ChatDRex provides natural language access to its extensive biomedical knowledge base. It integrates bioinformatics agents for network analysis, literature mining, and drug repurposing. These are complemented by agents that evaluate functional coherence for in silico validation. Its flexible multi‑agent design assigns specific tasks to specialized agents, including query routing, data retrieval, algorithm execution, and result visualization. A dedicated reasoning module keeps the user in the loop and allows for hallucination detection. By enabling physicians and researchers without computer science expertise to control complex analyses with natural language, ChatDRex democratizes access to bioinformatics as an important resource for drug repurposing. It enables clinical experts to generate hypotheses and explore drug repurposing opportunities, ultimately accelerating the discovery of novel therapies and advancing personalized medicine and translational research. ChatDRex is publicly available at apps.cosy.bio/chatdrex.

PaperID: 7140, https://arxiv.org/pdf/2511.21432.pdf

Authors: Peter Iwer Hoedt Karstensen, Roberto Galeazzi

Title: Multi-Hypotheses Navigation in Collaborative Localization subject to Cyber Attacks

Abstract:
This paper addresses resilient collaborative localization in multi‑agent systems exposed to spoofed radio frequency measurements. Each agent maintains multiple hypotheses of its own state and exchanges selected information with neighbors using covariance intersection. Geometric reductions based on distance tests and convex hull structure limit the number of hypotheses transmitted, controlling the spread of hypotheses through the network. The method enables agents to separate spoofed and truthful measurements and to recover consistent estimates once the correct hypothesis is identified. Numerical results demonstrate the ability of the approach to contain the effect of adversarial measurements, while also highlighting the impact of conservative fusion on detection speed. The framework provides a foundation for resilient multi‑agent navigation and can be extended with coordinated hypothesis selection across the network.

PaperID: 7141, https://arxiv.org/pdf/2511.21405.pdf

Authors: Cristiana Punzo, Italo Napolitano, Cinzia Tomaselli, Mario di Bernardo

Title: Decentralized Shepherding of Non-Cohesive Swarms Through Cluttered Environments via Deep Reinforcement Learning

Abstract:
This paper investigates decentralized shepherding in cluttered environments, where a limited number of herders must guide a larger group of non‑cohesive, diffusive targets toward a goal region in the presence of static obstacles. A hierarchical control architecture is proposed, integrating a high‑level target assignment rule, where each herder is paired with a selected target, with a learning‑based low‑level driving module that enables effective steering of the assigned target. The low‑level policy is trained in a one‑herder‑one‑target scenario with a rectangular obstacle using Proximal Policy Optimization and then directly extended to multi‑agent settings with multiple obstacles without requiring retraining. Numerical simulations demonstrate smooth, collision‑free trajectories and consistent convergence to the goal region, highlighting the potential of reinforcement learning for scalable, model‑free shepherding in complex environments.

PaperID: 7142, https://arxiv.org/pdf/2511.21380.pdf

Authors: Jingyi Chen, Xiaoyan Guo, Songqiang Chen, Shing-Chi Cheung, Jiasi Shen

Title: Multi-Agent Systems for Dataset Adaptation in Software Engineering: Capabilities, Limitations, and Future Directions

Abstract:
Automating the adaptation of software engineering (SE) research artifacts across datasets is essential for scalability and reproducibility, yet it remains largely unstudied. Recent advances in large language model (LLM)‑based multi‑agent systems, such as GitHub Copilot's agent mode, promise to automate complex development workflows through coordinated reasoning, code generation, and tool interaction. This paper presents the first empirical study on how state‑of‑the‑art multi‑agent systems perform in dataset adaptation tasks. We evaluate Copilot, backed by GPT‑4.1 and Claude Sonnet 4, on adapting SE research artifacts from benchmark repositories including ROCODE and LogHub2.0. Through a five‑stage evaluation pipeline (file comprehension, code editing, command generation, validation, and final execution), we measure success rates, analyze failure patterns, and assess prompt‑based interventions designed to enhance agent performance. Results show that current systems can identify key files and generate partial adaptations but rarely produce functionally correct implementations. Prompt‑level interventions, especially providing execution error messages and reference code, substantially improve structural similarity to ground truth (from 7.25% to 67.14%), highlighting the importance of contextual and feedback‑driven guidance. Our findings reveal both the promise and limitations of today's multi‑agent LLM systems for dataset adaptation, and suggest concrete directions for building more reliable, self‑correcting agents in future SE research.

PaperID: 7143, https://arxiv.org/pdf/2511.21304.pdf

Authors: Luigi Catello, Italo Napolitano, Davide Salzano, Mario di Bernardo

Title: Sparse shepherding control of large-scale multi-agent systems via Reinforcement Learning

Abstract:
We propose a Reinforcement Learning framework for sparse indirect control of large‑scale multi‑agent systems, where few controlled agents shape the collective behavior of many uncontrolled agents. The approach addresses this multi‑scale challenge by coupling ODEs (modeling controlled agents) with a PDE (describing the uncontrolled population density), capturing how microscopic control achieves macroscopic objectives. Our method combines model‑free Reinforcement Learning with adaptive interaction strength compensation to overcome sparse actuation limitations. Numerical validation demonstrates effective density control, with the system achieving target distributions while maintaining robustness to disturbances and measurement noise, confirming that learning‑based sparse control can replace computationally expensive online optimization.

PaperID: 7144, https://arxiv.org/pdf/2511.21263.pdf

Authors: Didier Galmiche, Timo Lang, Daniel Méry, David Pym

Title: Bifurcation Logic: Separation Through Ordering

Abstract:
We introduce Bifurcation Logic, BL, which combines a basic classical modality with separating conjunction together with its naturally associated multiplicative implication, that is defined using the modal ordering. Specifically, a formula AB is true at a world w if and only if each of A,B holds at worlds that are each above w, on separate branches of the order, and have no common upper bound. We provide a labelled tableaux calculus for BL and establish soundness and completeness relative to its relational semantics. The standard finite model property fails for BL. However, we show that, in the absence of multiplicative implication, but in the presence of , every model has an equivalent finite representation and that this is sufficient to obtain decidability. We illustrate the use of BL through an example of modelling multi‑agent access control that is quite generic in its form, suggesting many applications.

PaperID: 7145, https://arxiv.org/pdf/2511.21228.pdf

Authors: Anthony Couthures, Gustave Bainier, Vineeth Satheeskumar Varma, Samson Lasaulce, Irinel-Constantin Morarescu

Title: From Consensus to Robust Clustering: Multi-Agent Systems with Nonlinear Interactions

Abstract:
This paper establishes a theoretical framework to describe the transition from consensus to stable clustering in multi‑agent systems with nonlinear, cooperative interactions. We first establish a sharp threshold for consensus. For a broad class of non‑decreasing, Lipschitz‑continuous interactions, an explicit inequality linking the interaction's Lipschitz constant to the second‑largest eigenvalue of the normalized adjacency matrix of the interaction graph confines all system equilibria to the synchronization manifold. This condition is shown to be a sharp threshold, as its violation permits the emergence of non‑synchronized equilibria. We also demonstrate that such clustered states can only arise if the interaction law itself possesses specific structural properties, such as unstable fixed points. For the clustered states that emerge, we introduce a formal framework using Input‑to‑State Stability (ISS) theory to quantify their robustness. This approach allows us to prove that the internal cohesion of a cluster is robust to perturbations from the rest of the network. The analysis reveals a fundamental principle: cluster coherence is limited not by the magnitude of external influence, but by its heterogeneity across internal nodes. This unified framework, explaining both the sharp breakdown of consensus and the quantifiable robustness of the resulting modular structures, is validated on Zachary's Karate Club network, used as a classic benchmark for community structure.

PaperID: 7146, https://arxiv.org/pdf/2511.21042.pdf

Authors: Cheng Yang, Hui Jin, Xinlei Yu, Zhipeng Wang, Yaoqun Liu, Fenglei Fan, Dajiang Lei, Gangyong Jia, Changmiao Wang, Ruiquan Ge

Title: LungNoduleAgent: A Collaborative Multi-Agent System for Precision Diagnosis of Lung Nodules

Abstract:
Diagnosing lung cancer typically involves physicians identifying lung nodules in Computed tomography (CT) scans and generating diagnostic reports based on their morphological features and medical expertise. Although advancements have been made in using multimodal large language models for analyzing lung CT scans, challenges remain in accurately describing nodule morphology and incorporating medical expertise. These limitations affect the reliability and effectiveness of these models in clinical settings. Collaborative multi‑agent systems offer a promising strategy for achieving a balance between generality and precision in medical applications, yet their potential in pathology has not been thoroughly explored. To bridge these gaps, we introduce LungNoduleAgent, an innovative collaborative multi‑agent system specifically designed for analyzing lung CT scans. LungNoduleAgent streamlines the diagnostic process into sequential components, improving precision in describing nodules and grading malignancy through three primary modules. The first module, the Nodule Spotter, coordinates clinical detection models to accurately identify nodules. The second module, the Radiologist, integrates localized image description techniques to produce comprehensive CT reports. Finally, the Doctor Agent System performs malignancy reasoning by using images and CT reports, supported by a pathology knowledge base and a multi‑agent system framework. Extensive testing on two private datasets and the public LIDC‑IDRI dataset indicates that LungNoduleAgent surpasses mainstream vision‑language models, agent systems, and advanced expert models. These results highlight the importance of region‑level semantic alignment and multi‑agent collaboration in diagnosing nodules. LungNoduleAgent stands out as a promising foundational tool for supporting clinical analyses of lung nodules.

PaperID: 7147, https://arxiv.org/pdf/2511.21033.pdf

Authors: Linze Chen, Yufan Cai, Zhe Hou, Jin Song Dong

Title: Towards Trustworthy Legal AI through LLM Agents and Formal Reasoning

Abstract:
Legal decisions should be logical and based on statutory laws. While large language models(LLMs) are good at understanding legal text, they cannot provide verifiable justifications. We present L4L, a solver‑centric framework that enforces formal alignment between LLM‑based legal reasoning and statutory laws. The framework integrates role‑differentiated LLM agents with SMT‑backed verification, combining the flexibility of natural language with the rigor of symbolic reasoning. Our approach operates in four stages: (1) Statute Knowledge Building, where LLMs autoformalize legal provisions into logical constraints and validate them through case‑level testing; (2) Dual Fact‑and‑Statute Extraction, in which the prosecutor‑and defense‑aligned agents independently map case narratives to argument tuples; (3) Solver‑Centric Adjudication, where SMT solvers check the legal admissibility and consistency of the arguments against the formalized statute knowledge; (4) Judicial Rendering, in which a judge agent integrates solver‑validated reasoning with statutory interpretation and similar precedents to produce a legally grounded verdict. Experiments on public legal benchmarks show that L4L consistently outperforms baselines, while providing auditable justifications that enable trustworthy legal AI.

PaperID: 7148, https://arxiv.org/pdf/2511.20940.pdf

Authors: Reham Omar, Abdelghny Orogat, Ibrahim Abdelaziz, Omij Mangukiya, Panos Kalnis, Essam Mansour

Title: Chatty-KG: A Multi-Agent AI System for On-Demand Conversational Question Answering over Knowledge Graphs

Abstract:
Conversational Question Answering over Knowledge Graphs (KGs) combines the factual grounding of KG‑based QA with the interactive nature of dialogue systems. KGs are widely used in enterprise and domain applications to provide structured, evolving, and reliable knowledge. Large language models (LLMs) enable natural and context‑aware conversations, but lack direct access to private and dynamic KGs. Retrieval‑augmented generation (RAG) systems can retrieve graph content but often serialize structure, struggle with multi‑turn context, and require heavy indexing. Traditional KGQA systems preserve structure but typically support only single‑turn QA, incur high latency, and struggle with coreference and context tracking. To address these limitations, we propose Chatty‑KG, a modular multi‑agent system for conversational QA over KGs. Chatty‑KG combines RAG‑style retrieval with structured execution by generating SPARQL queries through task‑specialized LLM agents. These agents collaborate for contextual interpretation, dialogue tracking, entity and relation linking, and efficient query planning, enabling accurate and low‑latency translation of natural questions into executable queries. Experiments on large and diverse KGs show that Chatty‑KG significantly outperforms state‑of‑the‑art baselines in both single‑turn and multi‑turn settings, achieving higher F1 and P@1 scores. Its modular design preserves dialogue coherence and supports evolving KGs without fine‑tuning or pre‑processing. Evaluations with commercial (e.g., GPT‑4o, Gemini‑2.0) and open‑weight (e.g., Phi‑4, Gemma 3) LLMs confirm broad compatibility and stable performance. Overall, Chatty‑KG unifies conversational flexibility with structured KG grounding, offering a scalable and extensible approach for reliable multi‑turn KGQA.

PaperID: 7149, https://arxiv.org/pdf/2511.20914.pdf

Authors: Vivek Pandey, Nader Motee

Title: Distributionally Robust Cascading Risk in Multi-Agent Rendezvous: Extended Analysis of Parameter-Induced Ambiguity

Abstract:
Ensuring safety in autonomous multi‑agent systems during time‑critical tasks such as rendezvous is a fundamental challenge, particularly under communication delays and uncertainty in system parameters. In this paper, we develop a theoretical framework to analyze the \emphdistributionally robust risk of cascading failures in multi‑agent rendezvous, where system parameters lie within bounded uncertainty sets around nominal values. Using a time‑delayed dynamical network as a benchmark model, we quantify how small deviations in these parameters impact collective safety. We introduce a \emphconditional distributionally robust functional, grounded in a bivariate Gaussian model, to characterize risk propagation between agents. This yields a \emphclosed‑form risk expression that captures the complex interaction between time delays, network structure, noise statistics, and failure modes. These expressions expose key sensitivity patterns and provide actionable insight for the design of robust and resilient multi‑agent networks. Extensive simulations validate the theoretical results and demonstrate the effectiveness of our framework.

PaperID: 7150, https://arxiv.org/pdf/2511.20857.pdf

Authors: Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H. Chi, Chi Wang, Shuo Chen, Fernando Pereira, Wang-Cheng Kang, Derek Zhiyuan Cheng

Title: Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Abstract:
Statefulness is essential for large language model (LLM) agents to perform long‑term planning and problem‑solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real‑world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test‑time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo‑Memory, a comprehensive streaming benchmark and framework for evaluating self‑evolving memory in LLM agents. Evo‑Memory structures datasets into sequential task streams, requiring LLMs to search, adapt, and evolve memory after each interaction. We unify and implement over ten representative memory modules and evaluate them across 10 diverse multi‑turn goal‑oriented and single‑turn reasoning and QA datasets. To better benchmark experience reuse, we provide a baseline method, ExpRAG, for retrieving and utilizing prior experience, and further propose ReMem, an action‑think‑memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to achieve continual improvement.

PaperID: 7151, https://arxiv.org/pdf/2511.20726.pdf

Authors: Yuhang Wang, Heye Huang, Zhenhua Xu, Kailai Sun, Baoshen Guo, Jinhua Zhao

Title: Learning from Risk: LLM-Guided Generation of Safety-Critical Scenarios with Prior Knowledge

Abstract:
Autonomous driving faces critical challenges in rare long‑tail events and complex multi‑agent interactions, which are scarce in real‑world data yet essential for robust safety validation. This paper presents a high‑fidelity scenario generation framework that integrates a conditional variational autoencoder (CVAE) with a large language model (LLM). The CVAE encodes historical trajectories and map information from large‑scale naturalistic datasets to learn latent traffic structures, enabling the generation of physically consistent base scenarios. Building on this, the LLM acts as an adversarial reasoning engine, parsing unstructured scene descriptions into domain‑specific loss functions and dynamically guiding scenario generation across varying risk levels. This knowledge‑driven optimization balances realism with controllability, ensuring that generated scenarios remain both plausible and risk‑sensitive. Extensive experiments in CARLA and SMARTS demonstrate that our framework substantially increases the coverage of high‑risk and long‑tail events, improves consistency between simulated and real‑world traffic distributions, and exposes autonomous driving systems to interactions that are significantly more challenging than those produced by existing rule‑ or data‑driven methods. These results establish a new pathway for safety validation, enabling principled stress‑testing of autonomous systems under rare but consequential events.

PaperID: 7152, https://arxiv.org/pdf/2511.20719.pdf

Authors: Yifan Fan, Le Liang, Peng Liu, Xiao Li, Ziyang Guo, Qiao Lan, Shi Jin, Wen Tong

Title: Learning Multi-Access Point Coordination in Agentic AI Wi-Fi with Large Language Models

Abstract:
Multi‑access point coordination (MAPC) is a key technology for enhancing throughput in next‑generation Wi‑Fi within dense overlapping basic service sets. However, existing MAPC protocols rely on static, protocol‑defined rules, which limits their ability to adapt to dynamic network conditions such as varying interference levels and topologies. To address this limitation, we propose a novel Agentic AI Wi‑Fi framework where each access point, modeled as an autonomous large language model agent, collaboratively reasons about the network state and negotiates adaptive coordination strategies in real time. This dynamic collaboration is achieved through a cognitive workflow that enables the agents to engage in natural language dialogue, leveraging integrated memory, reflection, and tool use to ground their decisions in past experience and environmental feedback. Comprehensive simulation results demonstrate that our agentic framework successfully learns to adapt to diverse and dynamic network environments, significantly outperforming the state‑of‑the‑art spatial reuse baseline and validating its potential as a robust and intelligent solution for future wireless networks.

PaperID: 7153, https://arxiv.org/pdf/2511.20718.pdf

Authors: Chenliang Li, Adel Elmahdy, Alex Boyd, Zhongruo Wang, Siliang Zeng, Alfredo Garcia, Parminder Bhatia, Taha Kass-Hout, Cao Xiao, Mingyi Hong

Title: Stabilizing Off-Policy Training for Long-Horizon LLM Agent via Turn-Level Importance Sampling and Clipping-Triggered Normalization

Abstract:
Reinforcement learning (RL) algorithms such as PPO and GRPO are widely used to train large language models (LLMs) for multi‑turn agentic tasks. However, in off‑policy training pipelines, these methods often exhibit unstable optimization dynamics and are prone to performance collapse. Through empirical analysis, we identify two fundamental sources of instability in this setting: (1)~a granularity mismatch between token‑level policy optimization and turn‑structured interactions, and (2) high‑variance and unreliable gradient updates induced by off‑policy importance sampling and inaccurate advantage estimation. To address these challenges, we propose SORL, \underlineStabilizing \underlineOff‑Policy \underlineReinforcement \underlineLearning for Long‑Horizon Agent Training. SORL introduces principled mechanisms that align policy optimization with the structure of multi‑turn interactions and adaptively suppress unreliable off‑policy updates, yielding more conservative and robust learning dynamics. Within this framework, we instantiate two stabilized algorithms: SO‑PPO and SO‑GRPO. Both algorithms are designed to mitigate gradient variance and prevent optimization collapse without requiring careful early stopping or heuristic tuning. We evaluate SO‑PPO and SO‑GRPO on a range of multi‑turn search benchmarks, including general question answering, multi‑hop question answering, and medical multiple‑choice QA tasks. Experimental results show that both methods consistently prevent training instabilities and performance collapses observed in standard PPO and GRPO, maintain lower clipping ratios and more stable optimization trajectories, and achieve superior or comparable task performance. These results demonstrate that the proposed algorithm provides a practical, scalable, and general framework for stabilizing reinforcement learning in multi‑turn LLM agent training.

PaperID: 7154, https://arxiv.org/pdf/2511.20694.pdf

Authors: Kevin Lee, Russell Spiewak, James Walsh

Title: Reasoning With a Star: A Heliophysics Dataset and Benchmark for Agentic Scientific Reasoning

Abstract:
Scientific reasoning through Large Language Models in heliophysics involves more than just recalling facts: it requires incorporating physical assumptions, maintaining consistent units, and providing clear scientific formats through coordinated approaches. To address these challenges, we present Reasoning With a Star, a newly contributed heliophysics dataset applicable to reasoning; we also provide an initial benchmarking approach. Our data are constructed from National Aeronautics and Space Administration & University Corporation for Atmospheric Research Living With a Star summer school problem sets and compiled into a readily consumable question‑and‑answer structure with question contexts, reasoning steps, expected answer type, ground‑truth targets, format hints, and metadata. A programmatic grader checks the predictions using unit‑aware numerical tolerance, symbolic equivalence, and schema validation. We benchmark a single‑shot baseline and four multi‑agent patterns, finding that decomposing workflows through systems engineering principles outperforms direct prompting on problems requiring deductive reasoning rather than pure inductive recall.

PaperID: 7155, https://arxiv.org/pdf/2511.20663.pdf

Authors: Barak Or

Title: MTTR-A: Measuring Cognitive Recovery Latency in Multi-Agent Systems

Abstract:
Reliability in multi‑agent systems (MAS) built on large language models is increasingly limited by cognitive failures rather than infrastructure faults. Existing observability tools describe failures but do not quantify how quickly distributed reasoning recovers once coherence is lost. We introduce MTTR‑A (Mean Time‑to‑Recovery for Agentic Systems), a runtime reliability metric that measures cognitive recovery latency in MAS. MTTR‑A adapts classical dependability theory to agentic orchestration, capturing the time required to detect reasoning drift and restore coherent operation. We further define complementary metrics, including MTBF and a normalized recovery ratio (NRR), and establish theoretical bounds linking recovery latency to long‑run cognitive uptime. Using a LangGraph‑based benchmark with simulated drift and reflex recovery, we empirically demonstrate measurable recovery behavior across multiple reflex strategies. This work establishes a quantitative foundation for runtime cognitive dependability in distributed agentic systems.

PaperID: 7156, https://arxiv.org/pdf/2511.20639.pdf

Authors: Jiaru Zou, Xiyuan Yang, Ruizhong Qiu, Gaotang Li, Katherine Tieu, Pan Lu, Ke Shen, Hanghang Tong, Yejin Choi, Jingrui He, James Zou, Mengdi Wang, Ling Yang

Title: Latent Collaboration in Multi-Agent Systems

Abstract:
Multi‑agent systems (MAS) extend large language models (LLMs) from independent single‑model reasoning to coordinative system‑level intelligence. While existing LLM agents depend on text‑based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end‑to‑end training‑free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto‑regressive latent thoughts generation through last‑layer hidden embeddings. A shared latent working memory then preserves and transfers each agent's internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text‑based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single‑model and text‑based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%‑83.7%, and providing 4x‑4.3x faster end‑to‑end inference. These results demonstrate that our new latent collaboration framework enhances system‑level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open‑sourced at https://github.com/Gen‑Verse/LatentMAS.

PaperID: 7157, https://arxiv.org/pdf/2511.20613.pdf

Authors: Panayiotis Danassis, Naman Goel

Title: Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning

Abstract:
The rapid proliferation of Large Language Models (LLMs) has revolutionized AI‑assisted code generation. This rapid development of LLMs has outpaced our ability to properly benchmark them. Prevailing benchmarks emphasize unit‑test pass rates and syntactic correctness. Such metrics understate the difficulty of many real‑world problems that require planning, optimization, and strategic interaction. We introduce a multi‑agent reasoning‑driven benchmark based on a real‑world logistics optimization problem (Auction, Pickup, and Delivery Problem) that couples competitive auctions with capacity‑constrained routing. The benchmark requires building agents that can (i) bid strategically under uncertainty and (ii) optimize planners that deliver tasks while maximizing profit. We evaluate 40 LLM‑coded agents (by a wide range of state‑of‑the‑art LLMs under multiple prompting methodologies, including vibe coding) against 17 human‑coded agents developed before the advent of LLMs. Our results over 12 double all‑play‑all tournaments and ～ 40k matches demonstrate (i) a clear superiority of human(graduate students)‑coded agents: the top 5 spots are consistently won by human‑coded agents, (ii) the majority of LLM‑coded agents (33 out of 40) are beaten by very simple baselines, and (iii) given the best human solution as an input and prompted to improve upon, the best performing LLM makes the solution significantly worse instead of improving it. Our results highlight a gap in LLMs' ability to produce code that works competitively in the real‑world, and motivate new evaluations that emphasize reasoning‑driven code synthesis in real‑world scenarios.

PaperID: 7158, https://arxiv.org/pdf/2511.20603.pdf

Authors: Winfrey Paul Sagayam Dennis

Title: Exploring Urban Air Mobility Adoption Potential in San Francisco Bay Area Region: A Systems of Systems Level Case Study on Passenger Waiting Times and Travel Efficiency

Abstract:
Urban Air mobility has gained momentum with recent advancements in the electric vertical take‑off and landing (eVTOL) vehicles, offering faster point‑to‑point air taxi services that could help relieve traffic congestion in chronically overburdened cities. The research assesses the feasibility and systems‑of‑systems level adoption potential of UAM operations in the San Francisco Bay Area by comparing passenger departure, waiting, travel, and arrival times across key regional nodes, including San Francisco, Oakland, San Jose, and Palo Alto airports, with conventional ground transportation. A multi‑agent simulation was developed in MATLAB to evaluate the fleet operations and to model demand arrival using a Poisson process under stochastic passenger flows and turnaround constraints. Results indicate that utilizing UAM during peak demand could reduce total travel times up to eighty percent across the region. The findings of this paper highlight the critical operational factors for fleet schedule optimization. Especially how the fleet size, passengers' request volumes, and turnaround time directly influence waiting time, operating cost, and overall user acceptance.

PaperID: 7159, https://arxiv.org/pdf/2511.20590.pdf

Authors: Jakub Muszyński, Ignacy Walużenicz, Patryk Zan, Zofia Wrona, Maria Ganzha, Marcin Paprzycki, Costin Bădică

Title: EnergyTwin: A Multi-Agent System for Simulating and Coordinating Energy Microgrids

Abstract:
Microgrids are deployed to reduce purchased grid energy, limit exposure to volatile tariffs, and ensure service continuity during disturbances. This requires coordinating heterogeneous distributed energy resources across multiple time scales and under variable conditions. Among existing tools, typically, power‑system simulators capture physical behaviour but assume centralized control, while multi‑agent frameworks model decentralized decision‑making but represent energy with no physical grounding. In this context, the EnergyTwin is introduced, an agent‑based microgrid simulation environment that couples physically grounded models with forecast‑informed, rolling‑horizon planning, and negotiations. Each asset is modeled as an agent, interacting with a central agent that obtains forecasts, formulates predictions, and allocates energy through contract‑based interactions. EnergyTwin targets tertiary‑layer decision making and is extensible for digital‑twin use. Its feasibility was evaluated in a university campus microgrid scenario where multiple planning strategies were compared. Achieved results show that forecast‑driven rolling‑horizon planning increases local energy self‑sufficiency, maintains higher battery reserves, and reduces exposure to low‑resilience operating states. They demonstrate also potential of EnergyTwin as platform supporting research on resilient, negotiation‑driven microgrids.

PaperID: 7160, https://arxiv.org/pdf/2511.20540.pdf

Authors: Adam Bjorndahl

Title: Proceedings Twentieth Conference on Theoretical Aspects of Rationality and Knowledge

Abstract:
The TARK conference (Theoretical Aspects of Rationality and Knowledge) is a conference that aims to bring together researchers from a wide variety of fields, including computer science, artificial intelligence, game theory, decision theory, philosophy, logic, linguistics, and cognitive science. Its goal is to further our understanding of interdisciplinary issues involving reasoning about rationality and knowledge. Previous conferences have been held biennially around the world since 1986, on the initiative of Joe Halpern (Cornell University). Topics of interest include, but are not limited to, semantic models for knowledge, belief, uncertainty, awareness, bounded rationality, common sense epistemic reasoning, epistemic logic, epistemic game theory, knowledge and action, applications of reasoning about knowledge and other mental states, belief revision, computational social choice, algorithmic game theory, and foundations of multi‑agent systems. Information about TARK is available at http://www.tark.org/. These proceedings contain the papers that have been accepted for presentation at the Twentieth Conference on Theoretical Aspects of Rationality and Knowledge (TARK 2025), held July 14‑‑16, 2025, at Heinrich‑Heine‑Universität, Düsseldorf, Germany. The conference website can be found at https://ccc.cs.uni‑duesseldorf.de/tark‑2025/.

PaperID: 7161, https://arxiv.org/pdf/2511.20510.pdf

Authors: Yuto Suzuki, Paul Awolade, Daniel V. LaBarbera, Farnoush Banaei-Kashani

Title: FRAGMENTA: End-to-end Fragmentation-based Generative Model with Agentic Tuning for Drug Lead Optimization

Abstract:
Molecule generation using generative AI is vital for drug discovery, yet class‑specific datasets often contain fewer than 100 training examples. While fragment‑based models handle limited data better than atom‑based approaches, existing heuristic fragmentation limits diversity and misses key fragments. Additionally, model tuning typically requires slow, indirect collaboration between medicinal chemists and AI engineers. We introduce FRAGMENTA, an end‑to‑end framework for drug lead optimization comprising: 1) a novel generative model that reframes fragmentation as a "vocabulary selection" problem, using dynamic Q‑learning to jointly optimize fragmentation and generation; and 2) an agentic AI system that refines objectives via conversational feedback from domain experts. This system removes the AI engineer from the loop and progressively learns domain knowledge to eventually automate tuning. In real‑world cancer drug discovery experiments, FRAGMENTA's Human‑Agent configuration identified nearly twice as many high‑scoring molecules as baselines. Furthermore, the fully autonomous Agent‑Agent system outperformed traditional Human‑Human tuning, demonstrating the efficacy of agentic tuning in capturing expert intent.

PaperID: 7162, https://arxiv.org/pdf/2511.20490.pdf

Authors: Kiril Vasilev, Alexandre Misrahi, Eeshaan Jain, Phil F Cheng, Petros Liakopoulos, Olivier Michielin, Michael Moor, Charlotte Bunne

Title: MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology

Abstract:
Multimodal Large Language Models (LLMs) hold promise for biomedical reasoning, but current benchmarks fail to capture the complexity of real‑world clinical workflows. Existing evaluations primarily assess unimodal, decontextualized question‑answering, overlooking multi‑agent decision‑making environments such as Molecular Tumor Boards (MTBs). MTBs bring together diverse experts in oncology, where diagnostic and prognostic tasks require integrating heterogeneous data and evolving insights over time. Current benchmarks lack this longitudinal and multimodal complexity. We introduce MTBBench, an agentic benchmark simulating MTB‑style decision‑making through clinically challenging, multimodal, and longitudinal oncology questions. Ground truth annotations are validated by clinicians via a co‑developed app, ensuring clinical relevance. We benchmark multiple open and closed‑source LLMs and show that, even at scale, they lack reliability ‑‑ frequently hallucinating, struggling with reasoning from time‑resolved data, and failing to reconcile conflicting evidence or different modalities. To address these limitations, MTBBench goes beyond benchmarking by providing an agentic framework with foundation model‑based tools that enhance multi‑modal and longitudinal reasoning, leading to task‑level performance gains of up to 9.0% and 11.2%, respectively. Overall, MTBBench offers a challenging and realistic testbed for advancing multimodal LLM reasoning, reliability, and tool‑use with a focus on MTB environments in precision oncology.

PaperID: 7163, https://arxiv.org/pdf/2511.20468.pdf

Authors: Yuanhao Li, Mingshan Liu, Hongbo Wang, Yiding Zhang, Yifei Ma, Wei Tan

Title: DRAFT-RL: Multi-Agent Chain-of-Draft Reasoning for Reinforcement Learning-Enhanced LLMs

Abstract:
Large Language Models (LLMs) have shown impressive capabilities in multi‑step reasoning and problem‑solving.Recent works introduce multi‑agent reflection frameworks where multiple LLM agents critique and refine each other's outputs using reinforcement learning (RL). However, these approaches often rely on single‑shot responses and lack structural diversity in reasoning exploration. In this paper, we propose DRAFT‑RL, a novel framework that integrates Chain‑of‑Draft (CoD) reasoning into multi‑agent RL training. Instead of generating single responses, each agent produces multiple drafts per query, which are then evaluated by peer agents and a learned reward model to identify the most promising trajectory. These selected drafts are used to refine future reasoning strategies through actor‑critic learning.DRAFT‑RL enables explicit multi‑path exploration, peer‑guided reflection, and reward‑aligned selection, resulting in more robust and interpretable LLM agent behavior. We evaluate our method on complex reasoning tasks including code synthesis, symbolic math, and knowledge‑intensive QA,demonstrating that DRAFT‑RL outperforms existing reflective and RL‑based agents by significant margins in both accuracy and convergence speed

PaperID: 7164, https://arxiv.org/pdf/2511.20383.pdf

Authors: Viet-Anh Le, Andreas A. Malikopoulos

Title: Accelerating Time-Optimal Trajectory Planning for Connected and Automated Vehicles with Graph Neural Networks

Abstract:
In this paper, we present a learning‑based framework that accelerates time‑ and energy‑optimal trajectory planning for connected and automated vehicles (CAVs) using graph neural networks (GNNs). We formulate the multi‑agent coordination problem encountered in traffic scenarios as a cooperative trajectory planning problem that minimizes travel time, subject to motion primitives derived from energy‑optimal solutions. The performance of this framework can be further improved through replanning at each time step, enabling the system to incorporate newly observed information. To achieve real‑time execution, we employ a graph isomorphism network with edge features (GINEConv) to learn the solutions of the time‑optimal trajectory planning problem from offline‑generated data. The trained model produces online predictions that serve as warm‑starts for numerical optimization, thereby enabling rapid computation of minimal exit times and the associated feasible trajectories. This learning‑to‑warm‑start approach substantially reduces computation time while preserving the control performance of the time‑ and energy‑optimal trajectory planning framework.

PaperID: 7165, https://arxiv.org/pdf/2511.20276.pdf

Authors: Lianzhe Hu, Yu Wang, Bikash Pal

Title: LLM-Driven Transient Stability Assessment: From Automated Simulation to Neural Architecture Design

Abstract:
This paper presents an LLM‑driven, end‑to‑end workflow that addresses the lack of automation and intelligence in power system transient stability assessment (TSA). The proposed agentic framework integrates large language models (LLMs) with a professional simulator (ANDES) to automatically generate and filter disturbance scenarios from natural language, and employs an LLM‑driven Neural Network Design (LLM‑NND) pipeline to autonomously design and optimize TSA models through performance‑guided, closed‑loop feedback. On the IEEE 39‑bus system, the LLM‑NND models achieve 93.71% test accuracy on four‑class TSA with only 4.78M parameters, while maintaining real‑time inference latency (less than 0.95 ms per sample). Compared with a manually designed DenseNet (25.9M parameters, 80.05% accuracy), the proposed approach jointly improves accuracy and efficiency. Ablation studies confirm that the synergy among domain‑grounded retrieval, reasoning augmentation, and feedback mechanisms is essential for robust automation. The results demonstrate that LLM agents can reliably accelerate TSA research from scenario generation and data acquisition to model design and interpretation, offering a scalable paradigm that is readily extensible to other power system tasks such as optimal power flow, fault analysis, and market operations.

PaperID: 7166, https://arxiv.org/pdf/2511.20138.pdf

Authors: Jason Lo, Mohammadnima Jafari

Title: From data to concepts via wiring diagrams

Abstract:
A wiring diagram is a labeled directed graph that represents an abstract concept such as a temporal process. In this article, we introduce the notion of a quasi‑skeleton wiring diagram graph, and prove that quasi‑skeleton wiring diagram graphs correspond to Hasse diagrams. Using this result, we designed algorithms that extract wiring diagrams from sequential data. We used our algorithms in analyzing the behavior of an autonomous agent playing a computer game, and the algorithms correctly identified the winning strategies. We compared the performance of our main algorithm with two other algorithms based on standard clustering techniques (DBSCAN and agglomerative hierarchical), including when some of the data was perturbed. Overall, this article brings together techniques in category theory, graph theory, clustering, reinforcement learning, and data engineering.

PaperID: 7167, https://arxiv.org/pdf/2511.20110.pdf

Authors: Michal Feldman, Yoav Gal-Tzur, Tomasz Ponitka, Maya Schlesinger

Title: One Action Too Many: Inapproximability of Budgeted Combinatorial Contracts

Abstract:
We study multi‑agent contract design with combinatorial actions, under budget constraints, and for a broad class of objective functions, including profit (principal's utility), reward, and welfare. Our first result is a strong impossibility: For submodular reward functions, no randomized poly‑time algorithm can approximate the optimal budget‑feasible value within any finite factor, even with demand‑oracle access. This result rules out extending known constant‑factor guarantees from either (i) unbudgeted settings with combinatorial actions or (ii) budgeted settings with binary actions, to their combination. The hardness is tight: It holds even when all but one agent have binary actions and the remaining agent has just one additional action. On the positive side, we show that gross substitutes rewards (a well‑studied strict subclass of submodular functions) admit a deterministic poly‑time O(1)‑approximation, using only value queries. Our results thus draw the first sharp separation between budgeted and unbudgeted settings in combinatorial contracts, and identifies gross substitutes as a tractable frontier for budgeted combinatorial contracts. Finally, we present an FPTAS for additive rewards, demonstrating that arbitrary approximation is tractable under any budget. This constitutes the first FPTAS for the multi‑agent combinatorial‑actions setting, even in the absence of budget constraints.

PaperID: 7168, https://arxiv.org/pdf/2511.20109.pdf

Authors: Hyeonjae Kim, Chenyue Li, Wen Deng, Mengxi Jin, Wen Huang, Mengqian Lu, Binhang Yuan

Title: CLIMATEAGENT: Multi-Agent Orchestration for Complex Climate Data Science Workflows

Abstract:
Climate science demands automated workflows to transform comprehensive questions into data‑driven statements across massive, heterogeneous datasets. However, generic LLM agents and static scripting pipelines lack climate‑specific context and flexibility, thus, perform poorly in practice. We present ClimateAgent, an autonomous multi‑agent framework that orchestrates end‑to‑end climate data analytic workflows. ClimateAgent decomposes user questions into executable sub‑tasks coordinated by an Orchestrate‑Agent and a Plan‑Agent; acquires data via specialized Data‑Agents that dynamically introspect APIs to synthesize robust download scripts; and completes analysis and reporting with a Coding‑Agent that generates Python code, visualizations, and a final report with a built‑in self‑correction loop. To enable systematic evaluation, we introduce Climate‑Agent‑Bench‑85, a benchmark of 85 real‑world tasks spanning atmospheric rivers, drought, extreme precipitation, heat waves, sea surface temperature, and tropical cyclones. On Climate‑Agent‑Bench‑85, ClimateAgent achieves 100% task completion and a report quality score of 8.32, outperforming GitHub‑Copilot (6.27) and a GPT‑5 baseline (3.26). These results demonstrate that our multi‑agent orchestration with dynamic API awareness and self‑correcting execution substantially advances reliable, end‑to‑end automation for climate science analytic tasks.

PaperID: 7169, https://arxiv.org/pdf/2511.20090.pdf

Authors: Zizhang Luo, Fan Cui, Kexing Zhou, Runlin Guo, Mile Xia, Hongyuan Hou, Yun Liang

Title: R3A: Reliable RTL Repair Framework with Multi-Agent Fault Localization and Stochastic Tree-of-Thoughts Patch Generation

Abstract:
Repairing RTL bugs is crucial for hardware design and verification. Traditional automatic program repair (APR) methods define dedicated search spaces to locate and fix bugs with program synthesis. However, they heavily rely on fixed templates and can only deal with limited bugs. As an alternative, Large Language Models with the ability to understand code semantics can be explored for RTL repair. However, they suffer from unreliable outcomes due to inherent randomness and long input contexts of RTL code and waveform. To address these challenges, we propose R3A, an LLM‑based automatic RTL program repair framework upon the basic model to improve reliability. R3A proposes the stochastic Tree‑Of‑Thoughts method to control a patch generation agent to explore a validated solution for the bug. The algorithm samples search states according to a heuristic function to balance between exploration and exploitation for a reliable outcome. Besides, R3A proposes a multi‑agent fault localization method to find fault candidates as the starting points for the patch generation agent, further increasing the reliability. Experiments show R3A can fix 90.6% of bugs in the RTL‑repair dataset within a given time limit, which covers 45% more bugs than traditional methods and other LLM‑based approaches, while achieving an 86.7% pass@5 rate on average, showing a high reliability.

PaperID: 7170, https://arxiv.org/pdf/2511.20080.pdf

Authors: Priyanka Singh, Sebastian Von Mammen

Title: Adaptive LLM Agents: Toward Personalized Empathetic Care

Abstract:
Current mental‑health conversational systems are usually based on fixed, generic dialogue patterns. This paper proposes an adaptive framework based on large language models that aims to personalize therapeutic interaction according to a user's psychological state, quantified with the Acceptance of Illness Scale (AIS). The framework defines three specialized agents, L, M, and H, each linked to a different level of illness acceptance, and adjusts conversational behavior over time using continuous feedback signals. The AIS‑stratified architecture is treated as a diegetic prototype placed in a plausible near‑future setting and examined through the method of design fiction. By embedding the architecture in narrative scenarios, the study explores how such agents might influence access to care and therapeutic relationship. The goal is to show how clinically informed personalization, technical feasibility, and speculative scenario analysis can together inform the responsible design of LLM‑based companions for mental‑health support.

PaperID: 7171, https://arxiv.org/pdf/2511.19991.pdf

Authors: Meng-Chia Lee, Wen Sheng Lim, Yuan-Hao Chang, Tei-Wei Kuo

Title: SARA: A Stall-Aware Memory Allocation Strategy for Mixed-Criticality Systems

Abstract:
The memory capacity in edge devices is often limited due to constraints on cost, size, and power. Consequently, memory competition leads to inevitable page swapping in memory‑constrained mixed‑criticality edge devices, causing slow storage I/O and thus performance degradation. In such scenarios, inefficient memory allocation disrupts the balance between application performance, causing soft real‑time (soft RT) tasks to miss deadlines or preventing non‑real‑time (non‑RT) applications from optimizing throughput. Meanwhile, we observe unpredictable, long system‑level stalls (called long stalls) under high memory and I/O pressure, which further degrade performance. In this work, we propose a Stall‑Aware Real‑Time Memory Allocator (SARA), which discovers opportunities for performance balance by allocating just enough memory to soft RT tasks to meet deadlines and, at the same time, optimizing the remaining memory for non‑RT applications. To minimize the memory usage of soft RT tasks while meeting real‑time requirements, SARA leverages our insight into how latency, caused by memory insufficiency and measured by our proposed PSI‑based metric, affects the execution time of each soft RT job, where a job runs per period and a soft RT task consists of multiple periods. Moreover, SARA detects long stalls using our definition and proactively drops affected jobs, minimizing stalls in task execution. Experiments show that SARA achieves an average of 97.13% deadline hit ratio for soft RT tasks and improves non‑RT application throughput by up to 22.32x over existing approaches, even with memory capacity limited to 60% of peak demand.

PaperID: 7172, https://arxiv.org/pdf/2511.19969.pdf

Authors: Weizi Shao, Taolin Zhang, Zijie Zhou, Chen Chen, Chengyu Wang, Xiaofeng He

Title: M$^3$Prune: Hierarchical Communication Graph Pruning for Efficient Multi-Modal Multi-Agent Retrieval-Augmented Generation

Abstract:
Recent advancements in multi‑modal retrieval‑augmented generation (mRAG), which enhance multi‑modal large language models (MLLMs) with external knowledge, have demonstrated that the collective intelligence of multiple agents can significantly outperform a single model through effective communication. Despite impressive performance, existing multi‑agent systems inherently incur substantial token overhead and increased computational costs, posing challenges for large‑scale deployment. To address these issues, we propose a novel Multi‑Modal Multi‑agent hierarchical communication graph PRUNING framework, termed M^3Prune. Our framework eliminates redundant edges across different modalities, achieving an optimal balance between task performance and token overhead. Specifically, M^3Prune first applies intra‑modal graph sparsification to textual and visual modalities, identifying the edges most critical for solving the task. Subsequently, we construct a dynamic communication topology using these key edges for inter‑modal graph sparsification. Finally, we progressively prune redundant edges to obtain a more efficient and hierarchical topology. Extensive experiments on both general and domain‑specific mRAG benchmarks demonstrate that our method consistently outperforms both single‑agent and robust multi‑agent mRAG systems while significantly reducing token consumption.

PaperID: 7173, https://arxiv.org/pdf/2511.19961.pdf

Authors: Zhenyu Tao, Wei Xu, Xiaohu You

Title: Toward Trustworthy Digital Twins in AI Agent-based Wireless Network Optimization: Challenges, Solutions, and Opportunities

Abstract:
Optimizing modern wireless networks is exceptionally challenging due to their high dynamism and complexity. While the AI agent powered by reinforcement learning (RL) offers a promising solution, its practical application is limited by prohibitive exploration costs and potential risks in the real world. The emerging digital twin (DT) technology provides a safe and controlled virtual environment for agent training, but its effectiveness critically depends on the DT's reliability. Policies trained in an unreliable DT that does not accurately represent the physical network may experience severe performance degradation upon real‑world deployment. In this article, we introduce a new DT evaluation framework to ensure trustworthy DTs in AI agent‑based network optimization. This framework shifts from model‑level accuracy, such as wireless channel and user trajectory similarities, to a more holistic, task‑centric DT assessment, which relies on the Markov decision process that the agent actually perceives. We demonstrate it as an effective guideline for design, selection, and lifecycle management of wireless network DTs. A comprehensive case study on a real‑world wireless network testbed shows how this evaluation framework is used to pre‑filter candidate DTs, leading to a significant reduction in training and testing costs without sacrificing deployment performance. Finally, potential research opportunities are discussed.

PaperID: 7174, https://arxiv.org/pdf/2511.19952.pdf

Authors: Haoran Hu, Junren Shi, Shuo Jiang, Kun Cheng, Xia Yang, Changhao Piao

Title: Hierarchical Spatio-Temporal Attention Network with Adaptive Risk-Aware Decision for Forward Collision Warning in Complex Scenarios

Abstract:
Forward Collision Warning systems are crucial for vehicle safety and autonomous driving, yet current methods often fail to balance precise multi‑agent interaction modeling with real‑time decision adaptability, evidenced by the high computational cost for edge deployment and the unreliability stemming from simplified interaction models.To overcome these dual challenges‑computational complexity and modeling insufficiency‑along with the high false alarm rates of traditional static‑threshold warnings, this paper introduces an integrated FCW framework that pairs a Hierarchical Spatio‑Temporal Attention Network with a Dynamic Risk Threshold Adjustment algorithm. HSTAN employs a decoupled architecture (Graph Attention Network for spatial, cascaded GRU with self‑attention for temporal) to achieve superior performance and efficiency, requiring only 12.3 ms inference time (73% faster than Transformer methods) and reducing the Average Displacement Error (ADE) to 0.73m (42.2% better than Social_LSTM) on the NGSIM dataset. Furthermore, Conformalized Quantile Regression enhances reliability by generating prediction intervals (91.3% coverage at 90% confidence), which the DTRA module then converts into timely warnings via a physics‑informed risk potential function and an adaptive threshold mechanism inspired by statistical process control.Tested across multi‑scenario datasets, the complete system demonstrates high efficacy, achieving an F1 score of 0.912, a low false alarm rate of 8.2%, and an ample warning lead time of 2.8 seconds, validating the framework's superior performance and practical deployment feasibility in complex environments.

PaperID: 7175, https://arxiv.org/pdf/2511.19933.pdf

Authors: Vaishali Vinay

Title: Failure Modes in LLM Systems: A System-Level Taxonomy for Reliable AI Applications

Abstract:
Large language models (LLMs) are being rapidly integrated into decision‑support tools, automation workflows, and AI‑enabled software systems. However, their behavior in production environments remains poorly understood, and their failure patterns differ fundamentally from those of traditional machine learning models. This paper presents a system‑level taxonomy of fifteen hidden failure modes that arise in real‑world LLM applications, including multi‑step reasoning drift, latent inconsistency, context‑boundary degradation, incorrect tool invocation, version drift, and cost‑driven performance collapse. Using this taxonomy, we analyze the growing gap in evaluation and monitoring practices: existing benchmarks measure knowledge or reasoning but provide little insight into stability, reproducibility, drift, or workflow integration. We further examine the production challenges associated with deploying LLMs ‑ including observability limitations, cost constraints, and update‑induced regressions ‑ and outline high‑level design principles for building reliable, maintainable, and cost‑aware LLM systems. Finally, we outline high‑level design principles for building reliable, maintainable, and cost‑aware LLM‑based systems. By framing LLM reliability as a system‑engineering problem rather than a purely model‑centric one, this work provides an analytical foundation for future research on evaluation methodology, AI system robustness, and dependable LLM deployment.

PaperID: 7176, https://arxiv.org/pdf/2511.19930.pdf

Authors: Kenta Yamamoto, Teruaki Hayashi

Title: Designing Reputation Systems for Manufacturing Data Trading Markets: A Multi-Agent Evaluation with Q-Learning and IRL-Estimated Utilities

Abstract:
Recent advances in machine learning and big data analytics have intensified the demand for high‑quality cross‑domain datasets and accelerated the growth of data trading across organizations. As data become increasingly recognized as an economic asset, data marketplaces have emerged as a key infrastructure for data‑driven innovation. However, unlike mature product or service markets, data‑trading environments remain nascent and suffer from pronounced information asymmetry. Buyers cannot verify the content or quality before purchasing data, making trust and quality assurance central challenges. To address these issues, this study develops a multi‑agent data‑market simulator that models participant behavior and evaluates the institutional mechanisms for trust formation. Focusing on the manufacturing sector, where initiatives such as GAIA‑X and Catena‑X are advancing, the simulator integrates reinforcement learning (RL) for adaptive agent behavior and inverse reinforcement learning (IRL) to estimate utility functions from empirical behavioral data. Using the simulator, we examine the market‑level effects of five representative reputation systems‑Time‑decay, Bayesian‑beta, PageRank, PowerTrust, and PeerTrust‑and found that PeerTrust achieved the strongest alignment between data price and quality, while preventing monopolistic dominance. Building on these results, we develop a hybrid reputation mechanism that integrates the strengths of existing systems to achieve improved price‑quality consistency and overall market stability. This study extends simulation‑based data‑market analysis by incorporating trust and reputation as endogenous mechanisms and offering methodological and institutional insights into the design of reliable and efficient data ecosystems.

PaperID: 7177, https://arxiv.org/pdf/2511.19885.pdf

Authors: Chenglu Sun, Shuo Shen, Haonan Hu, Wei Zhou, Chen Chen

Title: Complex Instruction Following with Diverse Style Policies in Football Games

Abstract:
Despite advancements in language‑controlled reinforcement learning (LC‑RL) for basic domains and straightforward commands (e.g., object manipulation and navigation), effectively extending LC‑RL to comprehend and execute high‑level or abstract instructions in complex, multi‑agent environments, such as football games, remains a significant challenge. To address this gap, we introduce Language‑Controlled Diverse Style Policies (LCDSP), a novel LC‑RL paradigm specifically designed for complex scenarios. LCDSP comprises two key components: a Diverse Style Training (DST) method and a Style Interpreter (SI). The DST method efficiently trains a single policy capable of exhibiting a wide range of diverse behaviors by modulating agent actions through style parameters (SP). The SI is designed to accurately and rapidly translate high‑level language instructions into these corresponding SP. Through extensive experiments in a complex 5v5 football environment, we demonstrate that LCDSP effectively comprehends abstract tactical instructions and accurately executes the desired diverse behavioral styles, showcasing its potential for complex, real‑world applications.

PaperID: 7178, https://arxiv.org/pdf/2511.19874.pdf

Authors: Arun Chowdary Sanna

Title: Cross-LLM Generalization of Behavioral Backdoor Detection in AI Agent Supply Chains

Abstract:
As AI agents become integral to enterprise workflows, their reliance on shared tool libraries and pre‑trained components creates significant supply chain vulnerabilities. While previous work has demonstrated behavioral backdoor detection within individual LLM architectures, the critical question of cross‑LLM generalization remains unexplored, a gap with serious implications for organizations deploying multiple AI systems. We present the first systematic study of cross‑LLM behavioral backdoor detection, evaluating generalization across six production LLMs (GPT‑5.1, Claude Sonnet 4.5, Grok 4.1, Llama 4 Maverick, GPT‑OSS 120B, and DeepSeek Chat V3.1). Through 1,198 execution traces and 36 cross‑model experiments, we quantify a critical finding: single‑model detectors achieve 92.7% accuracy within their training distribution but only 49.2% across different LLMs, a 43.4 percentage point generalization gap equivalent to random guessing. Our analysis reveals that this gap stems from model‑specific behavioral signatures, particularly in temporal features (coefficient of variation > 0.8), while structural features remain stable across architectures. We show that model‑aware detection incorporating model identity as an additional feature achieves 90.6% accuracy universally across all evaluated models. We release our multi‑LLM trace dataset and detection framework to enable reproducible research.

PaperID: 7179, https://arxiv.org/pdf/2511.19865.pdf

Authors: Mingkai Chen, Zijie Feng, Lei Wang, Yaser Khamayseh

Title: Agentic AI-Empowered Conversational Embodied Intelligence Networks in 6G

Abstract:
In the 6G era, semantic collaboration among multiple embodied intelligent devices (MEIDs) becomes crucial for complex task execution. However, existing systems face challenges in multimodal information fusion, adaptive communication, and decision interpretability. To address these limitations, we propose a collaborative Conversational Embodied Intelligence Network (CC‑EIN) integrating multimodal feature fusion, adaptive semantic communication, task coordination, and interpretability. PerceptiNet performs cross‑modal fusion of image and radar data to generate unified semantic representations. An adaptive semantic communication strategy dynamically adjusts coding schemes and transmission power according to task urgency and channel quality. A semantic‑driven collaboration mechanism further supports task decomposition and conflict‑free coordination among heterogeneous devices. Finally, the InDec module enhances decision transparency through Grad‑CAM visualization. Simulation results in post‑earthquake rescue scenarios demonstrate that CC‑EIN achieves 95.4% task completion rate and 95% transmission efficiency while maintaining strong semantic consistency and energy efficiency.

PaperID: 7180, https://arxiv.org/pdf/2511.19798.pdf

Authors: Weizhi Liu, Xi Chen, Zekun Jiang, Liang Zhao, Kunyuan Jiang, Ruisi Tang, Li Wang, Mingke You, Hanyu Zhou, Hongyu Chen, Qiankun Xiong, Yong Nie, Kang Li, Jian Li

Title: KOM: A Multi-Agent Artificial Intelligence System for Precision Management of Knee Osteoarthritis (KOA)

Abstract:
Knee osteoarthritis (KOA) affects more than 600 million individuals globally and is associated with significant pain, functional impairment, and disability. While personalized multidisciplinary interventions have the potential to slow disease progression and enhance quality of life, they typically require substantial medical resources and expertise, making them difficult to implement in resource‑limited settings. To address this challenge, we developed KOM, a multi‑agent system designed to automate KOA evaluation, risk prediction, and treatment prescription. This system assists clinicians in performing essential tasks across the KOA care pathway and supports the generation of tailored management plans based on individual patient profiles, disease status, risk factors, and contraindications. In benchmark experiments, KOM demonstrated superior performance compared to several general‑purpose large language models in imaging analysis and prescription generation. A randomized three‑arm simulation study further revealed that collaboration between KOM and clinicians reduced total diagnostic and planning time by 38.5% and resulted in improved treatment quality compared to each approach used independently. These findings indicate that KOM could help facilitate automated KOA management and, when integrated into clinical workflows, has the potential to enhance care efficiency. The modular architecture of KOM may also offer valuable insights for developing AI‑assisted management systems for other chronic conditions.

PaperID: 7181, https://arxiv.org/pdf/2511.19726.pdf

Authors: Roberto Garrone

Title: An Adaptive, Data-Integrated Agent-Based Modeling Framework for Explainable and Contestable Policy Design

Abstract:
Multi‑agent systems often operate under feedback, adaptation, and non‑stationarity, yet many simulation studies retain static decision rules and fixed control parameters. This paper introduces a general adaptive multi‑agent learning framework that integrates: (i) four dynamic regimes distinguishing static versus adaptive agents and fixed versus adaptive system parameters; (ii) information‑theoretic diagnostics (entropy rate, statistical complexity, and predictive information) to assess predictability and structure; (iii) structural causal models for explicit intervention semantics; (iv) procedures for generating agent‑level priors from aggregate or sample data; and (v) unsupervised methods for identifying emergent behavioral regimes. The framework offers a domain‑neutral architecture for analyzing how learning agents and adaptive controls jointly shape system trajectories, enabling systematic comparison of stability, performance, and interpretability across non‑equilibrium, oscillatory, or drifting dynamics. Mathematical definitions, computational operators, and an experimental design template are provided, yielding a structured methodology for developing explainable and contestable multi‑agent decision processes.

PaperID: 7182, https://arxiv.org/pdf/2511.19691.pdf

Authors: Thomas Marshall Vielmetti, Devansh R Agrawal, Dimitra Panagou

Title: Multi-Agent gatekeeper: Safe Flight Planning and Formation Control for Urban Air Mobility

Abstract:
We present Multi‑Agent gatekeeper, a framework that provides provable safety guarantees for leader‑follower formation control in cluttered 3D environments. Existing methods face a trad‑off: online planners and controllers lack formal safety guarantees, while offline planners lack adaptability to changes in the number of agents or desired formation. To address this gap, we propose a hybrid architecture where a single leader tracks a pre‑computed, safe trajectory, which serves as a shared trajectory backup set for all follower agents. Followers execute a nominal formation‑keeping tracking controller, and are guaranteed to remain safe by always possessing a known‑safe backup maneuver along the leader's path. We formally prove this method ensures collision avoidance with both static obstacles and other agents. The primary contributions are: (1) the multi‑agent gatekeeper algorithm, which extends our single‑agent gatekeeper framework to multi‑agent systems; (2) the trajectory backup set for provably safe inter‑agent coordination for leader‑follower formation control; and (3) the first application of the gatekeeper framework in a 3D environment. We demonstrate our approach in a simulated 3D urban environment, where it achieved a 100% collision‑avoidance success rate across 100 randomized trials, significantly outperforming baseline CBF and NMPC methods. Finally, we demonstrate the physical feasibility of the resulting trajectories on a team of quadcopters.

PaperID: 7183, https://arxiv.org/pdf/2511.19690.pdf

Authors: Niklas Haas, Sören Schmitt, Rob van Stee

Title: The Buffer Minimization Problem for Scheduling Flow Jobs with Conflicts

Abstract:
We consider the online buffer minimization in multiprocessor systems with conflicts problem (in short, the buffer minimization problem) in the recently introduced flow model. In an online fashion, workloads arrive on some of the n processors and are stored in an input buffer. Processors can run and reduce these workloads, but conflicts between pairs of processors restrict simultaneous task execution. Conflicts are represented by a graph, where vertices correspond to processors and edges indicate conflicting pairs. An online algorithm must decide which processors are run at a time; so provide a valid schedule respecting the conflict constraints. The objective is to minimize the maximal workload observed across all processors during the schedule. Unlike the original model, where workloads arrive as discrete blocks at specific time points, the flow model assumes workloads arrive continuously over intervals or not at all. We present tight bounds for all graphs with four vertices (except the path, which has been solved previously) and for the families of general complete graphs and complete bipartite graphs. We also recover almost tight bounds for complete k‑partite graphs. For the original model, we narrow the gap for the graph consisting of a triangle and an additional edge to a fourth vertex.

PaperID: 7184, https://arxiv.org/pdf/2511.19675.pdf

Authors: Jiarui Wang, Mahyar Fazlyab

Title: Anytime-Feasible First-Order Optimization via Safe Sequential QCQP

Abstract:
This paper presents the Safe Sequential Quadratically Constrained Quadratic Programming (SS‑QCQP) algorithm, a first‑order method for smooth inequality‑constrained nonconvex optimization that guarantees feasibility at every iteration. The method is derived from a continuous‑time dynamical system whose vector field is obtained by solving a convex QCQP that enforces monotonic descent of the objective and forward invariance of the feasible set. The resulting continuous‑time dynamics achieve an O(1/t) convergence rate to first‑order stationary points under standard constraint qualification conditions. We then propose a safeguarded Euler discretization with adaptive step‑size selection that preserves this convergence rate while maintaining both descent and feasibility in discrete time. To enhance scalability, we develop an active‑set variant (SS‑QCQP‑AS) that selectively enforces constraints near the boundary, substantially reducing computational cost without compromising theoretical guarantees. Numerical experiments on a multi‑agent nonlinear optimal control problem demonstrate that SS‑QCQP and SS‑QCQP‑AS maintain feasibility, exhibit the predicted convergence behavior, and deliver solution quality comparable to second‑order solvers such as SQP and IPOPT.

PaperID: 7185, https://arxiv.org/pdf/2511.19562.pdf

Authors: Abraham Itzhak Weinberg

Title: Trust-Based Social Learning for Communication (TSLEC) Protocol Evolution in Multi-Agent Reinforcement Learning

Abstract:
Emergent communication in multi‑agent systems typically occurs through independent learning, resulting in slow convergence and potentially suboptimal protocols. We introduce TSLEC (Trust‑Based Social Learning with Emergent Communication), a framework where agents explicitly teach successful strategies to peers, with knowledge transfer modulated by learned trust relationships. Through experiments with 100 episodes across 30 random seeds, we demonstrate that trust‑based social learning reduces episodes‑to‑convergence by 23.9% (p < 0.001, Cohen's d = 1.98) compared to independent emergence, while producing compositional protocols (C = 0.38) that remain robust under dynamic objectives (Phi > 0.867 decoding accuracy). Trust scores strongly correlate with teaching quality (r = 0.743, p < 0.001), enabling effective knowledge filtering. Our results establish that explicit social learning fundamentally accelerates emergent communication in multi‑agent coordination.

PaperID: 7186, https://arxiv.org/pdf/2511.19536.pdf

Authors: Yixin Wu, Rui Wen, Chi Cui, Michael Backes, Yang Zhang

Title: AttackPilot: Autonomous Inference Attacks Against ML Services With LLM-Based Agents

Abstract:
Inference attacks have been widely studied and offer a systematic risk assessment of ML services; however, their implementation and the attack parameters for optimal estimation are challenging for non‑experts. The emergence of advanced large language models presents a promising yet largely unexplored opportunity to develop autonomous agents as inference attack experts, helping address this challenge. In this paper, we propose AttackPilot, an autonomous agent capable of independently conducting inference attacks without human intervention. We evaluate it on 20 target services. The evaluation shows that our agent, using GPT‑4o, achieves a 100.0% task completion rate and near‑expert attack performance, with an average token cost of only 0.627 per run. The agent can also be powered by many other representative LLMs and can adaptively optimize its strategy under service constraints. We further perform trace analysis, demonstrating that design choices, such as a multi‑agent framework and task‑specific action spaces, effectively mitigate errors such as bad plans, inability to follow instructions, task context loss, and hallucinations. We anticipate that such agents could empower non‑expert ML service providers, auditors, or regulators to systematically assess the risks of ML services without requiring deep domain expertise.

PaperID: 7187, https://arxiv.org/pdf/2511.19526.pdf

Authors: Jonathan Lee, Xingrui Wang, Jiawei Peng, Luoxin Ye, Zehan Zheng, Tiezheng Zhang, Tao Wang, Wufei Ma, Siyi Chen, Yu-Cheng Chou, Prakhar Kaushik, Alan Yuille

Title: Perceptual Taxonomy: Evaluating and Guiding Hierarchical Scene Reasoning in Vision-Language Models

Abstract:
We propose Perceptual Taxonomy, a structured process of scene understanding that first recognizes objects and their spatial configurations, then infers task‑relevant properties such as material, affordance, function, and physical attributes to support goal‑directed reasoning. While this form of reasoning is fundamental to human cognition, current vision‑language benchmarks lack comprehensive evaluation of this ability and instead focus on surface‑level recognition or image‑text alignment. To address this gap, we introduce Perceptual Taxonomy, a benchmark for physically grounded visual reasoning. We annotate 3173 objects with four property families covering 84 fine‑grained attributes. Using these annotations, we construct a multiple‑choice question benchmark with 5802 images across both synthetic and real domains. The benchmark contains 28033 template‑based questions spanning four types (object description, spatial reasoning, property matching, and taxonomy reasoning), along with 50 expert‑crafted questions designed to evaluate models across the full spectrum of perceptual taxonomy reasoning. Experimental results show that leading vision‑language models perform well on recognition tasks but degrade by 10 to 20 percent on property‑driven questions, especially those requiring multi‑step reasoning over structured attributes. These findings highlight a persistent gap in structured visual understanding and the limitations of current models that rely heavily on pattern matching. We also show that providing in‑context reasoning examples from simulated scenes improves performance on real‑world and expert‑curated questions, demonstrating the effectiveness of perceptual‑taxonomy‑guided prompting.

PaperID: 7188, https://arxiv.org/pdf/2511.19524.pdf

Authors: Boyu Chen, Zikang Wang, Zhengrong Yue, Kainan Yan, Chenyun Yu, Yi Huang, Zijun Liu, Yafei Wen, Xiaoxin Chen, Yang Liu, Peng Li, Yali Wang

Title: VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning

Abstract:
By leveraging tool‑augmented Multimodal Large Language Models (MLLMs), multi‑agent frameworks are driving progress in video understanding. However, most of them adopt static and non‑learnable tool invocation mechanisms, which limit the discovery of diverse clues essential for robust perception and reasoning regarding temporally or spatially complex videos. To address this challenge, we propose a novel Multi‑agent system for video understanding, namely VideoChat‑M1. Instead of using a single or fixed policy, VideoChat‑M1 adopts a distinct Collaborative Policy Planning (CPP) paradigm with multiple policy agents, which comprises three key processes. (1) Policy Generation: Each agent generates its unique tool invocation policy tailored to the user's query; (2) Policy Execution: Each agent sequentially invokes relevant tools to execute its policy and explore the video content; (3) Policy Communication: During the intermediate stages of policy execution, agents interact with one another to update their respective policies. Through this collaborative framework, all agents work in tandem, dynamically refining their preferred policies based on contextual insights from peers to effectively respond to the user's query. Moreover, we equip our CPP paradigm with a concise Multi‑Agent Reinforcement Learning (MARL) method. Consequently, the team of policy agents can be jointly optimized to enhance VideoChat‑M1's performance, guided by both the final answer reward and intermediate collaborative process feedback. Extensive experiments demonstrate that VideoChat‑M1 achieves SOTA performance across eight benchmarks spanning four tasks. Notably, on LongVideoBench, our method outperforms the SOTA model Gemini 2.5 pro by 3.6% and GPT‑4o by 15.6%.

PaperID: 7189, https://arxiv.org/pdf/2511.19522.pdf

Authors: Jinming Gao, Yijing Wang, Wentao Zhang, Rui Zhao, Yang Shi, Zhiqiang Zuo

Title: Active Secure Neighbor Selection in Multi-Agent Systems with Byzantine Attacks

Abstract:
This paper investigates the problem of resilient control for multi‑agent systems in the presence of Byzantine adversaries via an active secure neighbor selection framework. A pre‑discriminative graph is first constructed to characterize the admissible set of candidate neighbors for each agent. Based on this graph, a dynamic in‑neighbor selection strategy is proposed, wherein each agent actively selects a subset of its pre‑discriminative neighbors. The number of selected neighbors is adjustable, allowing for a trade‑off between communication overhead and robustness, with the minimal case requiring only a single in‑neighbor. The proposed strategy facilitates the reconstruction of a directed spanning tree among normal agents following the detection and isolation of Byzantine agents. It achieves resilient consensus without imposing any assumptions on the initial connectivity among normal agents. Moreover, the approach significantly reduces communication burden while maintaining resilience to adversarial behavior. A numerical example is provided to illustrate the effectiveness of the proposed method.

PaperID: 7190, https://arxiv.org/pdf/2511.19492.pdf

Authors: Parker Whitfill, Ben Snodin, Joel Becker

Title: Forecasting AI Time Horizon Under Compute Slowdowns

Abstract:
METR's time horizon metric has grown exponentially since 2019, along with compute. However, it is unclear whether compute scaling will persist at current rates through 2030, raising the question of how possible compute slowdowns might impact AI agent capability forecasts. Given a model of time horizon as a function of training compute and algorithms, along with a model of how compute investment spills into algorithmic progress (which, notably, precludes the possibility of a software‑only singularity), and the empirical fact that both time horizon and compute have grown at constant rates over 2019‑‑2025, we derive that time horizon growth must be proportional to compute growth. We provide additional, albeit limited, experimental evidence consistent with this theory. We use our model to project time horizon growth under OpenAI's compute projection, finding substantial projected delays in some cases. For example, 1‑month time horizons at 80% reliability occur 7 years later than simple trend extrapolation suggests.

PaperID: 7191, https://arxiv.org/pdf/2511.19489.pdf

Authors: Zhe Zhao, Yuheng Yang, Haibin Wen, Xiaojie Qiu, Zaixi Zhang, Qingfu Zhang

Title: Evolution without an Oracle: Driving Effective Evolution with LLM Judges

Abstract:
The integration of Large Language Models (LLMs) with Evolutionary Computation (EC) has unlocked new frontiers in scientific discovery but remains shackled by a fundamental constraint: the reliance on an Oracle‑‑an objective, machine‑computable fitness function. This paper breaks this barrier by asking: Can evolution thrive in a purely subjective landscape governed solely by LLM judges? We introduce MADE (Multi‑Agent Decomposed Evolution), a framework that tames the inherent noise of subjective evaluation through "Problem Specification." By decomposing vague instructions into specific, verifiable sub‑requirements, MADE transforms high‑variance LLM feedback into stable, precise selection pressure. The results are transformative: across complex benchmarks like DevAI and InfoBench, MADE outperforms strong baselines by over 50% in software requirement satisfaction (39.9% to 61.9%) and achieves a 95% perfect pass rate on complex instruction following. This work validates a fundamental paradigm shift: moving from optimizing "computable metrics" to "describable qualities," thereby unlocking evolutionary optimization for the vast open‑ended domains where no ground truth exists.

PaperID: 7192, https://arxiv.org/pdf/2511.19483.pdf

Authors: Qingsong He, Jing Nan, Jiayu Jiao, Liangjie Tang, Xiaodong Xu, Mengmeng Sun, Qingyao Wang, Minghui Yan

Title: Z-Space: A Multi-Agent Tool Orchestration Framework for Enterprise-Grade LLM Automation

Abstract:
Large Language Models can break through knowledge and timeliness limitations by invoking external tools within the Model Context Protocol framework to achieve automated execution of complex tasks. However, with the rapid growth of enterprise‑scale MCP services, efficiently and accurately matching target functionalities among thousands of heterogeneous tools has become a core challenge restricting system practicality. Existing approaches generally rely on full‑prompt injection or static semantic retrieval, facing issues including semantic disconnection between user queries and tool descriptions, context inflation in LLM input, and high inference latency. To address these challenges, this paper proposes Z‑Space, a data‑generation‑oriented multi‑agent collaborative tool invocation framework Z‑Space. The Z‑Space framework establishes a multi‑agent collaborative architecture and tool filtering algorithm: (1) A structured semantic understanding of user queries is achieved through an intent parsing model; (2) A tool filtering module (FSWW) based on fused subspace weighted algorithm realizes fine‑grained semantic alignment between intents and tools without parameter tuning; (3) An inference execution agent is constructed to support dynamic planning and fault‑tolerant execution for multi‑step tasks. This framework has been deployed in the Eleme platform's technical division, serving large‑scale test data generation scenarios across multiple business units including Taotian, Gaode, and Hema. Production data demonstrates that the system reduces average token consumption in tool inference by 96.26% while achieving a 92% tool invocation accuracy rate, significantly enhancing the efficiency and reliability of intelligent test data generation systems.

PaperID: 7193, https://arxiv.org/pdf/2511.19454.pdf

Authors: Xiubin Chen

Title: A K-means Inspired Solution Framework for Large-Scale Multi-Traveling Salesman Problems

Abstract:
The Multi‑Traveling Salesman Problem (MTSP) is a commonly used mathematical model for multi‑agent task allocation. However, as the number of agents and task targets increases, existing optimization‑based methods often incur prohibitive computational costs, posing significant challenges to large‑scale coordination in unmanned systems. To address this issue, this paper proposes a K‑means‑inspired task allocation framework that reformulates the MTSP as a spatially constrained classification process. By leveraging spatial coherence, the proposed method enables fast estimation of path costs and efficient task grouping, thereby fundamentally reducing overall computational complexity. Extensive simulation results demonstrate that the framework can maintain high solution quality even in extremely large‑scale scenarios‑for instance, in tasks involving 1000 agents and 5000 targets. The findings indicate that this "cluster‑then‑route" decomposition strategy offers an efficient and reliable solution for large‑scale multi‑agent task allocation.

PaperID: 7194, https://arxiv.org/pdf/2511.19417.pdf

Authors: James Y. Huang, Sheng Zhang, Qianchu Liu, Guanghui Qin, Tinghui Zhu, Tristan Naumann, Muhao Chen, Hoifung Poon

Title: Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration

Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in challenging, knowledge‑intensive reasoning tasks. However, extending LLMs to perceive and reason over a new modality (e.g., vision), often requires costly development of large‑scale vision language models (VLMs) with LLMs as backbones. Smaller VLMs are more efficient and adaptable but often lack the broad knowledge and reasoning capabilities of frontier LLMs. In this work, we propose BeMyEyes, a modular, multi‑agent framework for extending LLMs to multimodal reasoning by orchestrating collaboration between efficient, adaptable VLMs as perceivers and powerful LLMs as reasoners through conversations. We then introduce a data synthesis and supervised fine‑tuning pipeline to train the perceiver agent to effectively collaborate with the reasoner agent. By combining the complementary strengths of perception and reasoning agents, BeMyEyes avoids the need for training large‑scale multimodal models, preserves the generalization and reasoning capabilities of LLMs, and allows flexible extension to new domains and modalities. Experiments show that our framework unlocks the multimodal reasoning capabilities for LLMs, enabling a lightweight and fully open‑source solution, i.e. equipping text‑only DeepSeek‑R1 with Qwen2.5‑VL‑7B perceiver, to outperform large‑scale proprietary VLMs such as GPT‑4o on a wide range of knowledge‑intensive multimodal tasks. These results demonstrate the effectiveness, modularity, and scalability of our multi‑agent approach for building future multimodal reasoning systems.

PaperID: 7195, https://arxiv.org/pdf/2511.19405.pdf

Authors: Dereck Piche, Mohammed Muqeeth, Milad Aghajohari, Juan Duque, Michael Noukhovitch, Aaron Courville

Title: Learning Robust Social Strategies with Large Language Models

Abstract:
As agentic AI becomes more widespread, agents with distinct and possibly conflicting goals will interact in complex ways. These multi‑agent interactions pose a fundamental challenge, particularly in social dilemmas, where agents' individual incentives can undermine collective welfare. While reinforcement learning (RL) has been effective for aligning large language models (LLMs) in the single‑agent regime, prior small‑network results suggest that standard RL in multi‑agent settings often converges to defecting, self‑interested policies. We show the same effect in LLMs: despite cooperative priors, RL‑trained LLM agents develop opportunistic behavior that can exploit even advanced closed‑source models. To address this tendency of RL to converge to poor equilibria, we adapt a recent opponent‑learning awareness algorithm, Advantage Alignment, to fine‑tune LLMs toward multi‑agent cooperation and non‑exploitability. We then introduce a group‑relative baseline that simplifies advantage computation in iterated games, enabling multi‑agent training at LLM scale. We also contribute a novel social dilemma environment, Trust‑and‑Split, which requires natural language communication to achieve high collective welfare. Across a wide range of social dilemmas, policies learned with Advantage Alignment achieve higher collective payoffs while remaining robust against exploitation by greedy agents. We release all of our code to support future work on multi‑agent RL training for LLMs.

PaperID: 7196, https://arxiv.org/pdf/2511.19368.pdf

Authors: Tianyang Duan, Zongyuan Zhang, Zheng Lin, Songxiao Guo, Xiuxian Guan, Guangyu Wu, Zihan Fang, Haotian Meng, Xia Du, Ji-Zhe Zhou, Heming Cui, Jun Luo, Yue Gao

Title: LLM-Driven Stationarity-Aware Expert Demonstrations for Multi-Agent Reinforcement Learning in Mobile Systems

Abstract:
Multi‑agent reinforcement learning (MARL) has been increasingly adopted in many real‑world applications. While MARL enables decentralized deployment on resource‑constrained edge devices, it suffers from severe non‑stationarity due to the synchronous updates of agent policies. This non stationarity results in unstable training and poor policy con vergence, especially as the number of agents increases. In this paper, we propose RELED, a scalable MARL framework that integrates large language model (LLM)‑driven expert demonstrations with autonomous agent exploration. RELED incorporates a Stationarity‑Aware Expert Demonstration module, which leverages theoretical non‑stationarity bounds to enhance the quality of LLM‑generated expert trajectories, thus providing high reward and training‑stable samples for each agent. Moreover, a Hybrid Expert‑Agent Policy Optimization module adaptively balances each agent's learning from both expert‑generated and agent‑generated trajectories, accelerating policy convergence and improving generalization. Extensive experiments with real city networks based on OpenStreetMap demonstrate that RELED achieves superior performance compared to state‑of‑the‑art MARL methods.

PaperID: 7197, https://arxiv.org/pdf/2511.19358.pdf

Authors: Paul Dütting, Tomer Ezra, Michal Feldman, Thomas Kesselheim

Title: Black-Box Lifting and Robustness Theorems for Multi-Agent Contracts

Abstract:
Multi‑agent contract design has largely evaluated contracts through the lens of pure Nash equilibria (PNE). This focus, however, is not without loss: In general, the principal can strictly gain by recommending a complex, possibly correlated, distribution over actions, while preserving incentive compatibility. In this work, we extend the analysis of multi‑agent contracts beyond pure Nash equilibria to encompass more general equilibrium notions, including mixed Nash equilibria as well as (coarse‑)correlated equilibria (CCE). The latter, in particular, captures the limiting outcome of agents engaged in learning dynamics. Our main result shows that for submodular and, more generally, XOS rewards, such complex recommendations yield at most a constant‑factor gain: there exists a contract and a PNE whose utility is within a constant factor of the best CCE achievable by any contract. This provides a black‑box lifting: results established against the best PNE automatically apply with respect to the best CCE, with only a constant factor loss. For submodular rewards, we further show how to transform a contract and a PNE of that contract into a new contract such that any of its CCEs gives a constant approximation to the PNE. This yields black‑box robustness: up to constant factors, guarantees established for a specific contract and PNE automatically extend to the modified contract and any of its CCEs. We thus expand prior guarantees for multi‑agent contracts and lower the barrier to new ones. As an important corollary, we obtain poly‑time algorithms for submodular rewards that achieve constant approximations in any CCE, against the best CCE under the best contract. Such worst‑case guarantees are provably unattainable for XOS rewards. Finally, we bound the gap between different equilibrium notions for subadditive, supermodular, and general rewards.

PaperID: 7198, https://arxiv.org/pdf/2511.19334.pdf

Authors: Axel Constant, Mahault Albarracin, Karl J. Friston

Title: Normative active inference: A numerical proof of principle for a computational and economic legal analytic approach to AI governance

Abstract:
This paper presents a computational account of how legal norms can influence the behavior of artificial intelligence (AI) agents, grounded in the active inference framework (AIF) that is informed by principles of economic legal analysis (ELA). The ensuing model aims to capture the complexity of human decision‑making under legal constraints, offering a candidate mechanism for agent governance in AI systems, that is, the (auto)regulation of AI agents themselves rather than human actors in the AI industry. We propose that lawful and norm‑sensitive AI behavior can be achieved through regulation by design, where agents are endowed with intentional control systems, or behavioral safety valves, that guide real‑time decisions in accordance with normative expectations. To illustrate this, we simulate an autonomous driving scenario in which an AI agent must decide when to yield the right of way by balancing competing legal and pragmatic imperatives. The model formalizes how AIF can implement context‑dependent preferences to resolve such conflicts, linking this mechanism to the conception of law as a scaffold for rational decision‑making under uncertainty. We conclude by discussing how context‑dependent preferences could function as safety mechanisms for autonomous agents, enhancing lawful alignment and risk mitigation in AI governance.

PaperID: 7199, https://arxiv.org/pdf/2511.19327.pdf

Authors: Liwei Yuan, Hideaki Ishii

Title: Dynamic Leader-Follower Consensus with Adversaries: A Multi-Hop Relay Approach

Abstract:
This paper examines resilient dynamic leader‑follower consensus within multi‑agent systems, where agents share first‑order or second‑order dynamics. The aim is to develop distributed protocols enabling nonfaulty/normal followers to accurately track a dynamic/time‑varying reference value of the leader while they may receive misinformation from adversarial neighbors. Our methodologies employ the mean subsequence reduced algorithm with agents engaging with neighbors using multi‑hop communication. We accordingly derive a necessary and sufficient graph condition for our algorithms to succeed; also, our tracking error bounds are smaller than that of the existing method. Furthermore, it is emphasized that even when agents do not use relays, our condition is tighter than the sufficient conditions in the literature. With multi‑hop relays, we can further obtain more relaxed graph requirements. Finally, we present numerical examples to verify the effectiveness of our algorithms.

PaperID: 7200, https://arxiv.org/pdf/2511.19253.pdf

Authors: Boyuan Wu

Title: MAESTRO: Multi-Agent Environment Shaping through Task and Reward Optimization

Abstract:
Cooperative Multi‑Agent Reinforcement Learning (MARL) faces two major design bottlenecks: crafting dense reward functions and constructing curricula that avoid local optima in high‑dimensional, non‑stationary environments. Existing approaches rely on fixed heuristics or use Large Language Models (LLMs) directly in the control loop, which is costly and unsuitable for real‑time systems. We propose MAESTRO (Multi‑Agent Environment Shaping through Task and Reward Optimization), a framework that moves the LLM outside the execution loop and uses it as an offline training architect. MAESTRO introduces two generative components: (i) a semantic curriculum generator that creates diverse, performance‑driven traffic scenarios, and (ii) an automated reward synthesizer that produces executable Python reward functions adapted to evolving curriculum difficulty. These components guide a standard MARL backbone (MADDPG) without increasing inference cost at deployment. We evaluate MAESTRO on large‑scale traffic signal control (Hangzhou, 16 intersections) and conduct controlled ablations. Results show that combining LLM‑generated curricula with LLM‑generated reward shaping yields improved performance and stability. Across four seeds, the full system achieves +4.0% higher mean return (163.26 vs. 156.93) and 2.2% better risk‑adjusted performance (Sharpe 1.53 vs. 0.70) over a strong curriculum baseline. These findings highlight LLMs as effective high‑level designers for cooperative MARL training.

PaperID: 7201, https://arxiv.org/pdf/2511.19146.pdf

Authors: Qian Zhang, Zhuo Sun, Yao Zhang, Zhiwen Yu, Bin Guo, Jun Zhang

Title: VIL2C: Value-of-Information Aware Low-Latency Communication for Multi-Agent Reinforcement Learning

Abstract:
Inter‑agent communication serves as an effective mechanism for enhancing performance in collaborative multi‑agent reinforcement learning(MARL) systems. However, the inherent communication latency in practical systems induces both action decision delays and outdated information sharing, impeding MARL performance gains, particularly in time‑critical applications like autonomous driving. In this work, we propose a Value‑of‑Information aware Low‑latency Communication(VIL2C) scheme that proactively adjusts the latency distribution to mitigate its effects in MARL systems. Specifically, we define a Value of Information (VOI) metric to quantify the importance of delayed message transmission based on each delayed message's importance. Moreover, we propose a progressive message reception mechanism to adaptively adjust the reception duration based on received messages. We derive the optimized VoI aware resource allocation and theoretically prove the performance advantage of the proposed VIL2C scheme. Extensive experiments demonstrate that VIL2C outperforms existing approaches under various communication conditions. These gains are attributed to the low‑latency transmission of high‑VoI messages via resource allocation and the elimination of unnecessary waiting periods via adaptive reception duration.

PaperID: 7202, https://arxiv.org/pdf/2511.19109.pdf

Authors: Mohan Ramesh, Mark Azer, Fabian B. Flohr

Title: HABIT: Human Action Benchmark for Interactive Traffic in CARLA

Abstract:
Current autonomous driving (AD) simulations are critically limited by their inadequate representation of realistic and diverse human behavior, which is essential for ensuring safety and reliability. Existing benchmarks often simplify pedestrian interactions, failing to capture complex, dynamic intentions and varied responses critical for robust system deployment. To overcome this, we introduce HABIT (Human Action Benchmark for Interactive Traffic), a high‑fidelity simulation benchmark. HABIT integrates real‑world human motion, sourced from mocap and videos, into CARLA (Car Learning to Act, a full autonomous driving simulator) via a modular, extensible, and physically consistent motion retargeting pipeline. From an initial pool of approximately 30,000 retargeted motions, we curate 4,730 traffic‑compatible pedestrian motions, standardized in SMPL format for physically consistent trajectories. HABIT seamlessly integrates with CARLA's Leaderboard, enabling automated scenario generation and rigorous agent evaluation. Our safety metrics, including Abbreviated Injury Scale (AIS) and False Positive Braking Rate (FPBR), reveal critical failure modes in state‑of‑the‑art AD agents missed by prior evaluations. Evaluating three state‑of‑the‑art autonomous driving agents, InterFuser, TransFuser, and BEVDriver, demonstrates how HABIT exposes planner weaknesses that remain hidden in scripted simulations. Despite achieving close or equal to zero collisions per kilometer on the CARLA Leaderboard, the autonomous agents perform notably worse on HABIT, with up to 7.43 collisions/km and a 12.94% AIS 3+ injury risk, and they brake unnecessarily in up to 33% of cases. All components are publicly released to support reproducible, pedestrian‑aware AI research.

PaperID: 7203, https://arxiv.org/pdf/2511.19078.pdf

Authors: Yutong Li, Yitian Zhou, Xudong Wang, GuoChen, Caiyan Qin

Title: GraphMind: Theorem Selection and Conclusion Generation Framework with Dynamic GNN for LLM Reasoning

Abstract:
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, including multi‑step reasoning such as mathematical proving. However, existing approaches often lack an explicit and dynamic mechanism to structurally represent and evolve intermediate reasoning states, which limits their ability to perform context‑aware theorem selection and iterative conclusion generation. To address these challenges, we propose GraphMind, a novel dynamic graph‑based framework that integrates the graph neural network (GNN) with LLMs to iteratively select theorems and generate intermediate conclusions for multi‑step reasoning. Our method models the reasoning process as a heterogeneous evolving graph, where nodes represent conditions, theorems, and conclusions, while edges capture logical dependencies between nodes. By encoding the current reasoning state with GNN and leveraging semantic matching for theorem selection, our framework enables context‑aware, interpretable, and structured reasoning in a closed‑loop manner. Experiments on various question‑answering (QA) datasets demonstrate that our proposed GraphMind method achieves consistent performance improvements and significantly outperforms existing baselines in multi‑step reasoning, validating the effectiveness and generalizability of our approach.

PaperID: 7204, https://arxiv.org/pdf/2511.19031.pdf

Authors: Yuchen Zhou, Haihang Wu

Title: Multi-Agent Monocular Dense SLAM With 3D Reconstruction Priors

Abstract:
Monocular Simultaneous Localization and Mapping (SLAM) aims to estimate a robot's pose while simultaneously reconstructing an unknown 3D scene using a single camera. While existing monocular SLAM systems generate detailed 3D geometry through dense scene representations, they are computationally expensive due to the need for iterative optimization. To address this challenge, MASt3R‑SLAM utilizes learned 3D reconstruction priors, enabling more efficient and accurate estimation of both 3D structures and camera poses. However, MASt3R‑SLAM is limited to single‑agent operation. In this paper, we extend MASt3R‑SLAM to introduce the first multi‑agent monocular dense SLAM system. Each agent performs local SLAM using a 3D reconstruction prior, and their individual maps are fused into a globally consistent map through a loop‑closure‑based map fusion mechanism. Our approach improves computational efficiency compared to state‑of‑the‑art methods, while maintaining similar mapping accuracy when evaluated on real‑world datasets.

PaperID: 7205, https://arxiv.org/pdf/2511.18929.pdf

Authors: Zijian Song, Xiaoxin Lin, Tao Pu, Zhenlong Yuan, Guangrun Wang, Liang Lin

Title: Human-Centric Open-Future Task Discovery: Formulation, Benchmark, and Scalable Tree-Based Search

Abstract:
Recent progress in robotics and embodied AI is largely driven by Large Multimodal Models (LMMs). However, a key challenge remains underexplored: how can we advance LMMs to discover tasks that assist humans in open‑future scenarios, where human intentions are highly concurrent and dynamic. In this work, we formalize the problem of Human‑centric Open‑future Task Discovery (HOTD), focusing particularly on identifying tasks that reduce human effort across plausible futures. To facilitate this study, we propose HOTD‑Bench, which features over 2K real‑world videos, a semi‑automated annotation pipeline, and a simulation‑based protocol tailored for open‑set future evaluation. Additionally, we propose the Collaborative Multi‑Agent Search Tree (CMAST) framework, which decomposes complex reasoning through a multi‑agent system and structures the reasoning process through a scalable search tree module. In our experiments, CMAST achieves the best performance on the HOTD‑Bench, significantly surpassing existing LMMs. It also integrates well with existing LMMs, consistently improving performance.

PaperID: 7206, https://arxiv.org/pdf/2511.18924.pdf

Authors: Arina Kharlamova, Jiawen Liu, Tianyi Zhang, Xinrui Yang, Humaid Alqasimi, Youcheng Sun, Chun Jason Xue

Title: LLM-Driven Kernel Evolution: Automating Driver Updates in Linux

Abstract:
Linux kernel evolution breaks drivers through API/ABI changes, semantic shifts, and security‑hardening updates. We introduce DRIVEBENCH, an executable corpus of kernel\rightarrowdriver co‑evolution cases, and AUTODRIVER, a closed‑loop, LLM‑driven system for automating driver maintenance. The system integrates prompt engineering, multi‑agent collaboration, static analysis, and iterative validation to ensure that generated patches are not only syntactically correct but also functionally and semantically consistent with kernel conventions. The corpus spans v5.10‑v6.10 with 235 validated cases drawn from 612 candidates. In evaluation across 55 cases, AUTODRIVER achieves 56.4% compilation success; QEMU‑based boot verification indicates that compiled patches preserve driver initialization in most instances. By releasing DRIVEBENCH and tooling, we enable reproducible research and a practical route to continuous, safe co‑evolution of drivers with the Linux kernel.

PaperID: 7207, https://arxiv.org/pdf/2511.18868.pdf

Authors: Dezhi Ran, Shuxiao Xie, Mingfang Ji, Anmin Liu, Mengzhou Wu, Yuan Cao, Yuzhe Guo, Hao Yu, Linyi Li, Yitao Hu, Wei Yang, Tao Xie

Title: KernelBand: Steering LLM-based Kernel Optimization via Hardware-Aware Multi-Armed Bandits

Abstract:
High‑performance GPU kernels are critical for efficient LLM serving, yet their optimization remains a bottleneck requiring deep system expertise. While code LLMs show promise in generating functionally correct code, kernel optimization is intrinsically a search problem over a vast optimization space. The fundamental mismatch prevents existing LLM agents from efficiently exploring the optimization space for diverse hardware and compute patterns. To bridge the gap, we present KernelBand, a framework that formulates kernel optimization as a Multi‑Armed Bandit (MAB) problem, explicitly balancing exploration and exploitation to unlock the potential of code LLMs. To navigate the infinite arm space of optimization strategies applied to candidate kernels, we design two key mechanisms: a hardware‑aware pruning strategy via profiling bounds and a trace‑driven clustering algorithm that leverages Lipschitz continuity. Theoretically, we prove that KernelBand reduces the regret bound to depend on the compact covering number of runtime clusters, ensuring sample‑efficient discovery of high‑performance kernels. Extensive experiments on TritonBench‑G with three GPU architectures and four code LLMs show that KernelBand consistently and substantially outperforms state‑of‑the‑art methods with over 33% average improvement.

PaperID: 7208, https://arxiv.org/pdf/2511.18840.pdf

Authors: Binglin Liu, Yucheng Wang, Zheyuan Zhang, Jiyuan Lu, Shen Yang, Daniel Zhang-Li, Huiqin Liu, Jifan Yu

Title: Addressing Situated Teaching Needs: A Multi-Agent Framework for Automated Slide Adaptation

Abstract:
The adaptation of teaching slides to instructors' situated teaching needs, including pedagogical styles and their students' context, is a critical yet time‑consuming task for educators. Through a series of educator interviews, we first identify and systematically categorize the key friction points that impede this adaptation process. Grounded in these findings, we introduce a novel multi‑agent framework designed to automate slide adaptation based on high‑level instructor specifications. An evaluation involving 16 modification requests across 8 real‑world courses validates our approach. The framework's output consistently achieved high scores in intent alignment, content coherence and factual accuracy, and performed on par with baseline methods regarding visual clarity, while also demonstrating appropriate timeliness and a high operational agreement with human experts, achieving an F1 score of 0.89. This work heralds a new paradigm where AI agents handle the logistical burdens of instructional design, liberating educators to focus on the creative and strategic aspects of teaching.

PaperID: 7209, https://arxiv.org/pdf/2511.18761.pdf

Authors: Hao Wu, Shoucheng Song, Chang Yao, Sheng Han, Huaiyu Wan, Youfang Lin, Kai Lv

Title: Think How Your Teammates Think: Active Inference Can Benefit Decentralized Execution

Abstract:
In multi‑agent systems, explicit cognition of teammates' decision logic serves as a critical factor in facilitating coordination. Communication (i.e., ``Tell'') can assist in the cognitive development process by information dissemination, yet it is inevitably subject to real‑world constraints such as noise, latency, and attacks. Therefore, building the understanding of teammates' decisions without communication remains challenging. To address this, we propose a novel non‑communication MARL framework that realizes the construction of cognition through local observation‑based modeling (i.e., ``Think''). Our framework enables agents to model teammates' active inference process. At first, the proposed method produces three teammate portraits: perception‑belief‑action. Specifically, we model the teammate's decision process as follows: 1) Perception: observing environments; 2) Belief: forming beliefs; 3) Action: making decisions. Then, we selectively integrate the belief portrait into the decision process based on the accuracy and relevance of the perception portrait. This enables the selection of cooperative teammates and facilitates effective collaboration. Extensive experiments on the SMAC, SMACv2, MPE, and GRF benchmarks demonstrate the superior performance of our method.

PaperID: 7210, https://arxiv.org/pdf/2511.18760.pdf

Authors: Azim Ospanov, Zijin Feng, Jiacheng Sun, Haoli Bai, Xin Shen, Farzan Farnia

Title: HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs

Abstract:
Informal mathematics has been central to modern large language model (LLM) reasoning, offering flexibility and enabling efficient construction of arguments. However, purely informal reasoning is prone to logical gaps and subtle errors that are difficult to detect and correct. In contrast, formal theorem proving provides rigorous, verifiable mathematical reasoning, where each inference step is checked by a trusted compiler in systems such as Lean, but lacks the exploratory freedom of informal problem solving. This mismatch leaves current LLM‑based math agents without a principled way to combine the strengths of both paradigms. In this work, we introduce Hermes, the first tool‑assisted agent that explicitly interleaves informal reasoning with formally verified proof steps in Lean. The framework performs intermediate formal checking to prevent reasoning drift and employs a memory module that maintains proof continuity across long, multi‑step reasoning chains, enabling both exploration and verification within a single workflow. We evaluate Hermes on four challenging mathematical reasoning benchmarks using LLMs of varying parameter scales, from small models to state‑of‑the‑art systems. Across all settings, Hermes reliably improves the reasoning accuracy of base models while substantially reducing token usage and computational cost compared to reward‑based approaches. On difficult datasets such as AIME'25, Hermes achieves up to a 67% accuracy improvement while using 80% fewer total inference FLOPs. The implementation and codebase are publicly available at https://github.com/aziksh‑ospanov/HERMES.

PaperID: 7211, https://arxiv.org/pdf/2511.18715.pdf

Authors: Shaoyin Ma, Chenggong Hu, Huiqiong Wang, Li Sun, Mingli Song, Jie Song

Title: HuggingR$^{4}$: A Progressive Reasoning Framework for Discovering Optimal Model Companions

Abstract:
Building effective LLM agents increasingly requires selecting appropriate AI models as tools from large open repositories (e.g., HuggingFace with > 2M models) based on natural language requests. Unlike invoking a fixed set of API tools, repository‑scale model selection must handle massive, evolving candidates with incomplete metadata. Existing approaches incorporate full model descriptions into prompts, resulting in prompt bloat, excessive token costs, and limited scalability. To address these issues, we propose HuggingR^4, the first framework to recast model selection as an iterative reasoning process rather than one‑shot retrieval. By synergistically integrating Reasoning, Retrieval, Refinement, and Reflection, HuggingR^4 progressively decomposes user intent, retrieves candidates through multi‑round deliberation, refines selections via fine‑grained analysis, and validates results through reflection. To facilitate rigorous evaluation, we introduce a large‑scale benchmark comprising 14,399 diverse user requests across 37 task categories. Experiments demonstrate that HuggingR^4 achieves 92.03% workability and 82.46% reasonability‑outperforming current state‑of‑the‑art baselines by 26.51% and 33.25%, respectively, while reducing token consumption by 6.9 ×.

PaperID: 7212, https://arxiv.org/pdf/2511.18714.pdf

Authors: Zhenyu Wu, Jian Li, Hua Huang

Title: MAGMA-Edu: Multi-Agent Generative Multimodal Framework for Text-Diagram Educational Question Generation

Abstract:
Educational illustrations play a central role in communicating abstract concepts, yet current multimodal large language models (MLLMs) remain limited in producing pedagogically coherent and semantically consistent educational visuals. We introduce MAGMA‑Edu, a self‑reflective multi‑agent framework that unifies textual reasoning and diagrammatic synthesis for structured educational problem generation. Unlike existing methods that treat text and image generation independently, MAGMA‑Edu employs a two‑stage co‑evolutionary pipeline: (1) a generation‑verification‑reflection loop that iteratively refines question statements and solutions for mathematical accuracy, and (2) a code‑based intermediate representation that enforces geometric fidelity and semantic alignment during image rendering. Both stages are guided by internal self‑reflection modules that evaluate and revise outputs until domain‑specific pedagogical constraints are met. Extensive experiments on multimodal educational benchmarks demonstrate the superiority of MAGMA‑Edu over state‑of‑the‑art MLLMs. Compared to GPT‑4o, MAGMA‑Edu improves the average textual metric from 57.01 to 92.31 (+35.3 pp) and boosts image‑text consistency (ITC) from 13.20 to 85.24 (+72 pp). Across all model backbones, MAGMA‑Edu achieves the highest scores (Avg‑Text 96.20, ITC 99.12), establishing a new state of the art for multimodal educational content generation and demonstrating the effectiveness of self‑reflective multi‑agent collaboration in pedagogically aligned vision‑language reasoning.

PaperID: 7213, https://arxiv.org/pdf/2511.18671.pdf

Authors: Yan Wang, Ke Deng, Yongli Ren

Title: Multi-Agent Cross-Entropy Method with Monotonic Nonlinear Critic Decomposition

Abstract:
Cooperative multi‑agent reinforcement learning (MARL) commonly adopts centralized training with decentralized execution (CTDE), where centralized critics leverage global information to guide decentralized actors. However, centralized‑decentralized mismatch (CDM) arises when the suboptimal behavior of one agent degrades others' learning. Prior approaches mitigate CDM through value decomposition, but linear decompositions allow per‑agent gradients at the cost of limited expressiveness, while nonlinear decompositions improve representation but require centralized gradients, reintroducing CDM. To overcome this trade‑off, we propose the multi‑agent cross‑entropy method (MCEM), combined with monotonic nonlinear critic decomposition (NCD). MCEM updates policies by increasing the probability of high‑value joint actions, thereby excluding suboptimal behaviors. For sample efficiency, we extend off‑policy learning with a modified k‑step return and Retrace. Analysis and experiments demonstrate that MCEM outperforms state‑of‑the‑art methods across both continuous and discrete action benchmarks.

PaperID: 7214, https://arxiv.org/pdf/2511.18622.pdf

Authors: Michael J. Bommarito

Title: OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph

Abstract:
We present OpenGloss, a synthetic encyclopedic dictionary and semantic knowledge graph for English that integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a unified resource. OpenGloss contains 537K senses across 150K lexemes, on par with WordNet 3.1 and Open English WordNet, while providing more than four times as many sense definitions. These lexemes include 9.1M semantic edges, 1M usage examples, 3M collocations, and 60M words of encyclopedic content. Generated through a multi‑agent procedural generation pipeline with schema‑validated LLM outputs and automated quality assurance, the entire resource was produced in under one week for under 1,000. This demonstrates that structured generation can create comprehensive lexical resources at cost and time scales impractical for manual curation, enabling rapid iteration as foundation models improve. The resource addresses gaps in pedagogical applications by providing integrated content ‑‑ definitions, examples, collocations, encyclopedias, etymology ‑‑ that supports both vocabulary learning and natural language processing tasks. As a synthetically generated resource, OpenGloss reflects both the capabilities and limitations of current foundation models. The dataset is publicly available on Hugging Face under CC‑BY 4.0, enabling researchers and educators to build upon and adapt this resource.

PaperID: 7215, https://arxiv.org/pdf/2511.18604.pdf

Authors: Hannah Lee, James D. Motes, Marco Morales, Nancy M. Amato

Title: An Analysis of Constraint-Based Multi-Agent Pathfinding Algorithms

Abstract:
This study informs the design of future multi‑agent pathfinding (MAPF) and multi‑robot motion planning (MRMP) algorithms by guiding choices based on constraint classification for constraint‑based search algorithms. We categorize constraints as conservative or aggressive and provide insights into their search behavior, focusing specifically on vanilla Conflict‑Based Search (CBS) and Conflict‑Based Search with Priorities (CBSw/P). Under a hybrid grid‑roadmap representation with varying resolution, we observe that aggressive (priority constraint) formulations tend to solve more instances as agent count or resolution increases, whereas conservative (motion constraint) formulations yield stronger solution quality when both succeed. Findings are synthesized in a decision flowchart, aiding users in selecting suitable constraints. Recommendations extend to Multi‑Robot Motion Planning (MRMP), emphasizing the importance of considering topological features alongside problem, solution, and representation features. A comprehensive exploration of the study, including raw data and map performance, is available in our public GitHub Repository: https://GitHub.com/hannahjmlee/constraint‑mapf‑analysis

PaperID: 7216, https://arxiv.org/pdf/2511.18579.pdf

Authors: Kooktae Lee, Ethan Brook

Title: Connectivity-Preserving Multi-Agent Area Coverage via Optimal-Transport-Based Density-Driven Optimal Control (D2OC)

Abstract:
Multi‑agent systems play a central role in area coverage tasks across search‑and‑rescue, environmental monitoring, and precision agriculture. Achieving non‑uniform coverage, where spatial priorities vary across the domain, requires coordinating agents while respecting dynamic and communication constraints. Density‑driven approaches can distribute agents according to a prescribed reference density, but existing methods do not ensure connectivity. This limitation often leads to communication loss, reduced coordination, and degraded coverage performance. This letter introduces a connectivity‑preserving extension of the Density‑Driven Optimal Control (D2OC) framework. The coverage objective, defined using the Wasserstein distance between the agent distribution and the reference density, admits a convex quadratic program formulation. Communication constraints are incorporated through a smooth connectivity penalty, which maintains strict convexity, supports distributed implementation, and preserves inter‑agent communication without imposing rigid formations. Simulation studies show that the proposed method consistently maintains connectivity, improves convergence speed, and enhances non‑uniform coverage quality compared with density‑driven schemes that do not incorporate explicit connectivity considerations.

PaperID: 7217, https://arxiv.org/pdf/2511.18548.pdf

Authors: Maria Stella Albarelli

Title: Emotion-Aware Conversational Recommender Systems: a Case Study

Abstract:
In recent years, online shopping has grown rapidly, especially during the COVID‑19 period. However, it still lacks elements typical of physical stores, such as empathic support and personalised advice from a sales assistant. This study explores how an emotion‑aware Conversational Agent (CA) can improve the online shopping experience by responding to user emotions in a more natural and human way. The project focuses on Gala, a virtual assistant developed for the Galeries Lafayette website, capable of recognising emotional states from voice messages and adapting its responses accordingly. User needs were first analysed through semi‑structured interviews, which informed the design of Gala's UX and functionalities. Gala was implemented using the OpenAI API and the Galeries Lafayette API, adopting a Content‑Based recommendation approach. Through Natural Language Processing, it interprets user requests and retrieves products aligned with specific attributes such as name, price, and brand, enabling fluid dialogue and tailored suggestions. Two user studies were conducted: a usability test and a comparative evaluation between a standard CA and Gala's emotion‑aware version. The results highlight the potential of emotion‑aware CAs to make online shopping faster, more engaging, and closer to an in‑store guided experience.

PaperID: 7218, https://arxiv.org/pdf/2511.18531.pdf

Authors: Akashdeep Saha, Zeng Wang, Prithwish Basu Roy, Johann Knechtel, Ozgur Sinanoglu, Ramesh Karri

Title: LockForge: Automating Paper-to-Code for Logic Locking with Multi-Agent Reasoning LLMs

Abstract:
Despite rapid progress in logic locking (LL), reproducibility remains a challenge as codes are rarely made public. We present LockForge, a first‑of‑its‑kind, multi‑agent large language model (LLM) framework that turns LL descriptions in papers into executable and tested code. LockForge provides a carefully crafted pipeline realizing forethought, implementation, iterative refinement, and a multi‑stage validation, all to systematically bridge the gap between prose and practice for complex LL schemes. For validation, we devise (i) an LLM‑as‑Judge stage with a scoring system considering behavioral checks, conceptual mechanisms, structural elements, and reproducibility on benchmarks, and (ii) an independent LLM‑as‑Examiner stage for ground‑truth assessment. We apply LockForge to 10 seminal LL schemes, many of which lack reference implementations. Our evaluation on multiple SOTA LLMs, including ablation studies, reveals the significant complexity of the task. We show that an advanced reasoning model and a sophisticated, multi‑stage framework like LockForge are required. We release all implementations and benchmarks, providing a reproducible and fair foundation for evaluation of further LL research.

PaperID: 7219, https://arxiv.org/pdf/2511.18528.pdf

Authors: Renyi Zhong, Yintong Huo, Wenwei Gu, Yichen Li, Michael R. Lyu

Title: End-to-End Automated Logging via Multi-Agent Framework

Abstract:
Software logging is critical for system observability, yet developers face a dual crisis of costly overlogging and risky underlogging. Existing automated logging tools often overlook the fundamental whether‑to‑log decision and struggle with the composite nature of logging. In this paper, we propose Autologger, a novel hybrid framework that addresses the complete the end‑to‑end logging pipeline. Autologger first employs a fine‑tuned classifier, the Judger, to accurately determine if a method requires new logging statements. If logging is needed, a multi‑agent system is activated. The system includes specialized agents: a Locator dedicated to determining where to log, and a Generator focused on what to log. These agents work together, utilizing our designed program analysis and retrieval tools. We evaluate Autologger on a large corpus from three mature open‑source projects against state‑of‑the‑art baselines. Our results show that Autologger achieves 96.63% F1‑score on the crucial whether‑to‑log decision. In an end‑to‑end setting, Autologger improves the overall quality of generated logging statements by 16.13% over the strongest baseline, as measured by an LLM‑as‑a‑judge score. We also demonstrate that our framework is generalizable, consistently boosting the performance of various backbone LLMs.

PaperID: 7220, https://arxiv.org/pdf/2511.18486.pdf

Authors: Jasan Zughaibi, Denis von Arx, Maurus Derungs, Florian Heemeyer, Luca A. Antonelli, Quentin Boehler, Michael Muehlebach, Bradley J. Nelson

Title: Expanding the Workspace of Electromagnetic Navigation Systems Using Dynamic Feedback for Single- and Multi-agent Control

Abstract:
Electromagnetic navigation systems (eMNS) enable a number of magnetically guided surgical procedures. A challenge in magnetically manipulating surgical tools is that the effective workspace of an eMNS is often severely constrained by power and thermal limits. We show that system‑level control design significantly expands this workspace by reducing the currents needed to achieve a desired motion. We identified five key system approaches that enable this expansion: (i) motion‑centric torque/force objectives, (ii) energy‑optimal current allocation, (iii) real‑time pose estimation, (iv) dynamic feedback, and (v) high‑bandwidth eMNS components. As a result, we stabilize a 3D inverted pendulum on an eight‑coil OctoMag eMNS with significantly lower currents (0.1‑0.2 A vs. 8‑14 A), by replacing a field‑centric field‑alignment strategy with a motion‑centric torque/force‑based approach. We generalize to multi‑agent control by simultaneously stabilizing two inverted pendulums within a shared workspace, exploiting magnetic‑field nonlinearity and coil redundancy for independent actuation. A structured analysis compares the electromagnetic workspaces of both paradigms and examines current‑allocation strategies that map motion objectives to coil currents. Cross‑platform evaluation of the clinically oriented Navion eMNS further demonstrates substantial workspace expansion by maintaining stable balancing at distances up to 50 cm from the coils. The results demonstrate that feedback is a practical path to scalable, efficient, and clinically relevant magnetic manipulation.

PaperID: 7221, https://arxiv.org/pdf/2511.18467.pdf

Authors: Xiaoqing Wang, Keman Huang, Bin Liang, Hongyu Li, Xiaoyong Du

Title: Shadows in the Code: Exploring the Risks and Defenses of LLM-based Multi-Agent Software Development Systems

Abstract:
The rapid advancement of Large Language Model (LLM)‑driven multi‑agent systems has significantly streamlined software developing tasks, enabling users with little technical expertise to develop executable applications. While these systems democratize software creation through natural language requirements, they introduce significant security risks that remain largely unexplored. We identify two risky scenarios: Malicious User with Benign Agents (MU‑BA) and Benign User with Malicious Agents (BU‑MA). We introduce the Implicit Malicious Behavior Injection Attack (IMBIA), demonstrating how multi‑agent systems can be manipulated to generate software with concealed malicious capabilities beneath seemingly benign applications, and propose Adv‑IMBIA as a defense mechanism. Evaluations across ChatDev, MetaGPT, and AgentVerse frameworks reveal varying vulnerability patterns, with IMBIA achieving attack success rates of 93%, 45%, and 71% in MU‑BA scenarios, and 71%, 84%, and 45% in BU‑MA scenarios. Our defense mechanism reduced attack success rates significantly, particularly in the MU‑BA scenario. Further analysis reveals that compromised agents in the coding and testing phases pose significantly greater security risks, while also identifying critical agents that require protection against malicious user exploitation. Our findings highlight the urgent need for robust security measures in multi‑agent software development systems and provide practical guidelines for implementing targeted, resource‑efficient defensive strategies.

PaperID: 7222, https://arxiv.org/pdf/2511.18450.pdf

Authors: Rui Xu, Dakuan Lu, Zicheng Zhao, Xiaoyu Tan, Xintao Wang, Siyu Yuan, Jiangjie Chen, Yinghui Xu

Title: ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints

Abstract:
Spatial reasoning is a key capability in the field of artificial intelligence, especially crucial in areas such as robotics, computer vision, and natural language understanding. However, evaluating the ability of multimodal large language models(MLLMs) in complex spatial reasoning still faces challenges, particularly in scenarios requiring multi‑step reasoning and precise mathematical constraints. This paper introduces ORIGAMISPACE, a new dataset and benchmark designed to evaluate the multi‑step spatial reasoning ability and the capacity to handle mathematical constraints of MLLMs through origami tasks. The dataset contains 350 data instances,each comprising a strictly formatted crease pattern (CP diagram), the Compiled Flat Pattern, the complete Folding Process, and the final Folded Shape Image. We propose four evaluation tasks: Pattern Prediction, Multi‑step Spatial Reasoning, Spatial Relationship Prediction, and End‑to‑End CP Code Generation. For the CP code generation task, we design an interactive environment and explore the possibility of using reinforcement learning methods to train MLLMs. Through experiments on existing MLLMs, we initially reveal the strengths and weaknesses of these models in handling complex spatial reasoning tasks.

PaperID: 7223, https://arxiv.org/pdf/2511.18449.pdf

Authors: Paniz Parastar, Giuseppe Caso, Jesus Alberto Omana Iglesias, Andra Lutu, Ozgu Alay

Title: Energy-Efficient Task Computation at the Edge for Vehicular Services

Abstract:
Multi‑access edge computing (MEC) is a promising solution for providing the computational resources and low latency required by vehicular services such as autonomous driving. It enables cars to offload computationally intensive tasks to nearby servers. Effective offloading involves determining when to offload tasks, selecting the appropriate MEC site, and efficiently allocating resources to ensure good performance. Car mobility poses significant challenges to guaranteeing reliable task completion, and today we still lack energy efficient solutions to this problem, especially when considering real‑world car mobility traces. In this paper, we begin by examining the mobility patterns of cars using data obtained from a leading mobile network operator in Europe. Based on the insights from this analysis, we design an optimization problem for task computation and offloading, considering both static and mobility scenarios. Our objective is to minimize the total energy consumption at the cars and at the MEC nodes while satisfying the latency requirements of various tasks. We evaluate our solution, based on multi‑agent reinforcement learning, both in simulations and in a realistic setup that relies on datasets from the operator. Our solution shows a significant reduction of user dissatisfaction and task interruptions in both static and mobile scenarios, while achieving energy savings of 47 percent in the static case and 14 percent in the mobile case compared to state‑of‑the‑art schemes.

PaperID: 7224, https://arxiv.org/pdf/2511.18438.pdf

Authors: Xiangrui Zhang, Zeyu Chen, Haining Wang, Qiang Li

Title: LLMs as Firmware Experts: A Runtime-Grown Tree-of-Agents Framework

Abstract:
Large Language Models (LLMs) and their agent systems have recently demonstrated strong potential in automating code reasoning and vulnerability detection. However, when applied to large‑scale firmware, their performance degrades due to the binary nature of firmware, complex dependency structures, and heterogeneous components. To address this challenge, this paper presents FIRMHIVE, a recursive agent hive that enables LLMs to act as autonomous firmware security analysts. FIRMHIVE introduces two key mechanisms: (1) transforming delegation into a per‑agent, executable primitive and (2) constructing a runtime Tree of Agents (ToA) for decentralized coordination. We evaluate FIRMHIVE using real‑world firmware images obtained from publicly available datasets, covering five representative security analysis tasks. Compared with existing LLM‑agent baselines, FIRMHIVE performs deeper (about 16x more reasoning steps) and broader (about 2.3x more files inspected) cross‑file exploration, resulting in about 5.6x more alerts per firmware. Compared to state‑of‑the‑art (SOTA) security tools, FIRMHIVE identifies about 1.5x more vulnerabilities (1,802 total) and achieves 71% precision, representing significant improvements in both yield and fidelity.

PaperID: 7225, https://arxiv.org/pdf/2511.18413.pdf

Authors: Yu Xia, Sungchul Kim, Tong Yu, Ryan A. Rossi, Julian McAuley

Title: Multi-Agent Collaborative Filtering: Orchestrating Users and Items for Agentic Recommendations

Abstract:
Agentic recommendations cast recommenders as large language model (LLM) agents that can plan, reason, use tools, and interact with users of varying preferences in web applications. However, most existing agentic recommender systems focus on generic single‑agent plan‑execute workflows or multi‑agent task decomposition pipelines. Without recommendation‑oriented design, they often underuse the collaborative signals in the user‑item interaction history, leading to unsatisfying recommendation results. To address this, we propose the Multi‑Agent Collaborative Filtering (MACF) framework for agentic recommendations, drawing an analogy between traditional collaborative filtering algorithms and LLM‑based multi‑agent collaboration. Specifically, given a target user and query, we instantiate similar users and relevant items as LLM agents with unique profiles. Each agent is able to call retrieval tools, suggest candidate items, and interact with other agents. Different from the static preference aggregation in traditional collaborative filtering, MACF employs a central orchestrator agent to adaptively manage the collaboration between user and item agents via dynamic agent recruitment and personalized collaboration instruction. Experimental results on datasets from three different domains show the advantages of our MACF framework compared to strong agentic recommendation baselines.

PaperID: 7226, https://arxiv.org/pdf/2511.18368.pdf

Authors: Xiaoming He, Gaofeng Wang, Huajun Cui, Rui Yuan, Haitao Zhao

Title: Wireless Power Transfer and Intent-Driven Network Optimization in AAVs-assisted IoT for 6G Sustainable Connectivity

Abstract:
Autonomous Aerial Vehicle (AAV)‑assisted Internet of Things (IoT) represents a collaborative architecture in which AAV allocate resources over 6G links to jointly enhance user‑intent interpretation and overall network performance. Owing to this mutual dependence, improvements in intent inference and policy decisions on one component reinforce the efficiency of others, making highly reliable intent prediction and low‑latency action execution essential. Although numerous approaches can model intent relationships, they encounter severe obstacles when scaling to high‑dimensional action sequences and managing intensive on‑board computation. We propose an Intent‑Driven Framework for Autonomous Network Optimization comprising prediction and decision modules. First, implicit intent modeling is adopted to mitigate inaccuracies arising from ambiguous user expressions. For prediction, we introduce Hyperdimensional Transformer (HDT), which embeds data into a Hyperdimensional space via Hyperdimensional vector encoding and replaces standard matrix and attention operations with symbolic Hyperdimensional computations. For decision‑making, where AAV must respond to user intent while planning trajectories, we design Double Actions based Multi‑Agent Proximal Policy Optimization (DA‑MAPPO). Building upon MAPPO, it samples actions through two independently parameterized networks and cascades the user‑intent network into the trajectory network to maintain action dependencies. We evaluate our framework on a real IoT action dataset with authentic wireless data. Experimental results demonstrate that HDT and DA‑MAPPO achieve superior performance across diverse scenarios.

PaperID: 7227, https://arxiv.org/pdf/2511.18313.pdf

Authors: Joseph Oladokun

Title: Path-Constrained Retrieval: A Structural Approach to Reliable LLM Agent Reasoning Through Graph-Scoped Semantic Search

Abstract:
Large Language Model agents often retrieve context from knowledge bases that lack structural consistency with the agent's current reasoning state, leading to incoherent reasoning chains. We introduce Path‑Constrained Retrieval (PCR), a retrieval method that combines structural graph constraints with semantic search to ensure retrieved information maintains logical relationships within a knowledge graph. PCR restricts the search space to nodes reachable from an anchor node, preventing retrieval of structurally disconnected information that may lead to inconsistent reasoning. We evaluate PCR on PathRAG‑6, a benchmark spanning six domains with 180 nodes and 360 edges. Our results show that PCR achieves full structural consistency compared to 24‑32 percent in baseline methods, while maintaining strong relevance scores. On the technology domain, PCR obtains full relevance at rank 10 with full structural consistency, significantly outperforming vector search and hybrid retrieval. PCR reduces the average graph distance of retrieved context by 78 percent compared to baselines, demonstrating retrieval of more structurally consistent information. These findings suggest that path‑constrained retrieval is an effective approach for improving the reliability and coherence of LLM agent reasoning systems.

PaperID: 7228, https://arxiv.org/pdf/2511.18271.pdf

Authors: Tianyang Han, Junhao Su, Junjie Hu, Peizhen Yang, Hengyu Shi, Junfeng Luo, Jialin Gao

Title: Beyond Words and Pixels: A Benchmark for Implicit World Knowledge Reasoning in Generative Models

Abstract:
Text‑to‑image (T2I) models today are capable of producing photorealistic, instruction‑following images, yet they still frequently fail on prompts that require implicit world knowledge. Existing evaluation protocols either emphasize compositional alignment or rely on single‑round VQA‑based scoring, leaving critical dimensions such as knowledge grounding, multi‑physics interactions, and auditable evidence‑substantially undertested. To address these limitations, we introduce PicWorld, the first comprehensive benchmark that assesses the grasp of implicit world knowledge and physical causal reasoning of T2I models. This benchmark consists of 1,100 prompts across three core categories. To facilitate fine‑grained evaluation, we propose PW‑Agent, an evidence‑grounded multi‑agent evaluator to hierarchically assess images on their physical realism and logical consistency by decomposing prompts into verifiable visual evidence. We conduct a thorough analysis of 17 mainstream T2I models on PicWorld, illustrating that they universally exhibit a fundamental limitation in their capacity for implicit world knowledge and physical causal reasoning to varying degrees. The findings highlight the need for reasoning‑aware, knowledge‑integrative architectures in future T2I systems.

PaperID: 7229, https://arxiv.org/pdf/2511.18259.pdf

Authors: Xiaochen Zheng, Alvaro Serra, Ilya Schneider Chernov, Maddalena Marchesi, Eunice Musvasva, Tatyana Y. Doktorova

Title: DiscoVerse: Multi-Agent Pharmaceutical Co-Scientist for Traceable Drug Discovery and Reverse Translation

Abstract:
Pharmaceutical research and development has accumulated vast and heterogeneous archives of data. Much of this knowledge stems from discontinued programs, and reusing these archives is invaluable for reverse translation. However, in practice, such reuse is often infeasible. In this work, we introduce DiscoVerse, a multi‑agent co‑scientist designed to support pharmaceutical research and development at Roche. Designed as a human‑in‑the‑loop assistant, DiscoVerse enables domain‑specific queries by delivering evidence‑based answers: it retrieves relevant data, links across documents, summarises key findings and preserves institutional memory. We assess DiscoVerse through expert evaluation of source‑linked outputs. Our evaluation spans a selected subset of 180 molecules from Roche's research and development repositories, encompassing over 0.87 billion BPE tokens and more than four decades of research. To our knowledge, this represents the first agentic framework to be systematically assessed on real pharmaceutical data for reverse translation, enabled by authorized access to confidential archives covering the full lifecycle of drug development. Our contributions include: role‑specialized agent designs aligned with scientist workflows; human‑in‑the‑loop support for reverse translation; expert evaluation; and a large‑scale demonstration showing promising decision‑making insights. In brief, across seven benchmark queries, DiscoVerse achieved near‑perfect recall (\geq 0.99) with moderate precision (0.71‑0.91). Qualitative assessments and three real‑world pharmaceutical use cases further showed faithful, source‑linked synthesis across preclinical and clinical evidence.

PaperID: 7230, https://arxiv.org/pdf/2511.18258.pdf

Authors: Mojtaba A. Farahani, Md Irfan Khan, Thorsten Wuest

Title: Hybrid Agentic AI and Multi-Agent Systems in Smart Manufacturing

Abstract:
The convergence of Agentic AI and MAS enables a new paradigm for intelligent decision making in SMS. Traditional MAS architectures emphasize distributed coordination and specialized autonomy, while recent advances in agentic AI driven by LLMs introduce higher order reasoning, planning, and tool orchestration capabilities. This paper presents a hybrid agentic AI and multi agent framework for a Prescriptive Maintenance use case, where LLM based agents provide strategic orchestration and adaptive reasoning, complemented by rule based and SLMs agents performing efficient, domain specific tasks on the edge. The proposed framework adopts a layered architecture that consists of perception, preprocessing, analytics, and optimization layers, coordinated through an LLM Planner Agent that manages workflow decisions and context retention. Specialized agents autonomously handle schema discovery, intelligent feature analysis, model selection, and prescriptive optimization, while a HITL interface ensures transparency and auditability of generated maintenance recommendations. This hybrid design supports dynamic model adaptation, cost efficient maintenance scheduling, and interpretable decision making. An initial proof of concept implementation is validated on two industrial manufacturing datasets. The developed framework is modular and extensible, supporting seamless integration of new agents or domain modules as capabilities evolve. The results demonstrate the system capability to automatically detect schema, adapt preprocessing pipelines, optimize model performance through adaptive intelligence, and generate actionable, prioritized maintenance recommendations. The framework shows promise in achieving improved robustness, scalability, and explainability for RxM in smart manufacturing, bridging the gap between high level agentic reasoning and low level autonomous execution.

PaperID: 7231, https://arxiv.org/pdf/2511.18249.pdf

Authors: Mostafijur Rahman Akhond, Gias Uddin

Title: LLM Assisted Coding with Metamorphic Specification Mutation Agent

Abstract:
Metamorphic Relations (MRs) serve as a foundational mechanism for generating semantically equivalent mutations. Software engineering has advanced significantly in recent years with the advent of Large Language Models (LLMs). However, the reliability of LLMs in software engineering is often compromised by ambiguities and inconsistencies due to improper user specification. To address this challenge, we present CodeMetaAgent (CMA), a metamorphic relation‑driven LLM agent that systematically refines task specifications and generates semantically constrained test cases. Our proposed framework uses MRs with LLMs to improve generation consistency and reduce variability caused by specifications, unlike the traditional use of MRs as post validations. Our framework has been evaluated on the HumanEval‑Pro, MBPP‑Pro, and SWE‑Bench_Lite datasets using the GPT‑4o, Mistral Large, GPT‑OSS, and Qwen3‑Coder models. It improved code generation accuracy by up to 17% and achieved code coverage gains of up to 99.81%. These results show that metamorphic relations can be a simple but effective guide in assisting LLM‑based software development.

PaperID: 7232, https://arxiv.org/pdf/2511.18248.pdf

Authors: Wei Zhen Teoh

Title: Coherent Multi-Agent Trajectory Forecasting in Team Sports with CausalTraj

Abstract:
Jointly forecasting trajectories of multiple interacting agents is a core challenge in sports analytics and other domains involving complex group dynamics. Accurate prediction enables realistic simulation and strategic understanding of gameplay evolution. Most existing models are evaluated solely on per‑agent accuracy metrics (minADE, minFDE), which assess each agent independently on its best‑of‑k prediction. However these metrics overlook whether the model learns which predicted trajectories can jointly form a plausible multi‑agent future. Many state‑of‑the‑art models are designed and optimized primarily based on these metrics. As a result, they may underperform on joint predictions and also fail to generate coherent, interpretable multi‑agent scenarios in team sports. We propose CausalTraj, a temporally causal, likelihood‑based model that is built to generate jointly probable multi‑agent trajectory forecasts. To better assess collective modeling capability, we emphasize joint metrics (minJADE, minJFDE) that measure joint accuracy across agents within the best generated scenario sample. Evaluated on the NBA SportVU, Basketball‑U, and Football‑U datasets, CausalTraj achieves competitive per‑agent accuracy and the best recorded results on joint metrics, while yielding qualitatively coherent and realistic gameplay evolutions.

PaperID: 7233, https://arxiv.org/pdf/2511.18194.pdf

Authors: Faheem Nizar, Elias Lumer, Anmol Gulati, Pradeep Honaganahalli Basavaraju, Vamse Kumar Subbiah

Title: Agent-as-a-Graph: Knowledge Graph-Based Tool and Agent Retrieval for LLM Multi-Agent Systems

Abstract:
Recent advances in Large Language Model Multi‑Agent Systems enable scalable orchestration and retrieval of specialized, parallelized subagents, each equipped with hundreds or thousands of Model Context Protocol (MCP) servers and tools. However, existing agent, MCP, and retrieval methods typically match queries against a single agent description, obscuring fine‑grained tool capabilities of each agent, resulting in suboptimal agent selection. We introduce Agent‑as‑a‑Graph retrieval, a knowledge graph retrieval augmented generation approach that represents both tools and their parent agents as nodes and edges in a knowledge graph. During retrieval, i) relevant agents and tool nodes are first retrieved through vector search, ii) we apply a type‑specific weighted reciprocal rank fusion (wRRF) for reranking tools and agents, and iii) parent agents are traversed in the knowledge graph for the final set of agents. We evaluate Agent‑as‑a‑Graph on the LiveMCPBenchmark, achieving 14.9% and 14.6% improvements in Recall@5 and nDCG@5 over prior state‑of‑the‑art retrievers, and 2.4% improvements in wRRF optimizations.

PaperID: 7234, https://arxiv.org/pdf/2511.18181.pdf

Authors: Adam Callaghan, Karl Mason, Patrick Mannion

Title: MOMA-AC: A preference-driven actor-critic framework for continuous multi-objective multi-agent reinforcement learning

Abstract:
This paper addresses a critical gap in Multi‑Objective Multi‑Agent Reinforcement Learning (MOMARL) by introducing the first dedicated inner‑loop actor‑critic framework for continuous state and action spaces: Multi‑Objective Multi‑Agent Actor‑Critic (MOMA‑AC). Building on single‑objective, single‑agent algorithms, we instantiate this framework with Twin Delayed Deep Deterministic Policy Gradient (TD3) and Deep Deterministic Policy Gradient (DDPG), yielding MOMA‑TD3 and MOMA‑DDPG. The framework combines a multi‑headed actor network, a centralised critic, and an objective preference‑conditioning architecture, enabling a single neural network to encode the Pareto front of optimal trade‑off policies for all agents across conflicting objectives in a continuous MOMARL setting. We also outline a natural test suite for continuous MOMARL by combining a pre‑existing multi‑agent single‑objective physics simulator with its multi‑objective single‑agent counterpart. Evaluating cooperative locomotion tasks in this suite, we show that our framework achieves statistically significant improvements in expected utility and hypervolume relative to outer‑loop and independent training baselines, while demonstrating stable scalability as the number of agents increases. These results establish our framework as a foundational step towards robust, scalable multi‑objective policy learning in continuous multi‑agent domains.

PaperID: 7235, https://arxiv.org/pdf/2511.18114.pdf

Authors: Itay Hazan, Yael Mathov, Guy Shtar, Ron Bitton, Itsik Mantin

Title: ASTRA: Agentic Steerability and Risk Assessment Framework

Abstract:
Securing AI agents powered by Large Language Models (LLMs) represents one of the most critical challenges in AI security today. Unlike traditional software, AI agents leverage LLMs as their "brain" to autonomously perform actions via connected tools. This capability introduces significant risks that go far beyond those of harmful text presented in a chatbot that was the main application of LLMs. A compromised AI agent can deliberately abuse powerful tools to perform malicious actions, in many cases irreversible, and limited solely by the guardrails on the tools themselves and the LLM ability to enforce them. This paper presents ASTRA, a first‑of‑its‑kind framework designed to evaluate the effectiveness of LLMs in supporting the creation of secure agents that enforce custom guardrails defined at the system‑prompt level (e.g., "Do not send an email out of the company domain," or "Never extend the robotic arm in more than 2 meters"). Our holistic framework simulates 10 diverse autonomous agents varying between a coding assistant and a delivery drone equipped with 37 unique tools. We test these agents against a suite of novel attacks developed specifically for agentic threats, inspired by the OWASP Top 10 but adapted to challenge the ability of the LLM for policy enforcement during multi‑turn planning and execution of strict tool activation. By evaluating 13 open‑source, tool‑calling LLMs, we uncovered surprising and significant differences in their ability to remain secure and keep operating within their boundaries. The purpose of this work is to provide the community with a robust and unified methodology to build and validate better LLMs, ultimately pushing for more secure and reliable agentic AI systems.

PaperID: 7236, https://arxiv.org/pdf/2511.18038.pdf

Authors: Xiaoke Han, Hong Zhu

Title: MASTEST: A LLM-Based Multi-Agent System For RESTful API Tests

Abstract:
Testing RESTful API is increasingly important in quality assurance of cloud‑native applications. Recent advances in machine learning (ML) techniques have demonstrated that various testing activities can be performed automatically by large language models (LLMs) with reasonable accuracy. This paper develops a multi‑agent system called MASTEST that combines LLM‑based and programmed agents to form a complete tool chain that covers the whole workflow of API test starting from generating unit and system test scenarios from API specification in the OpenAPI Swagger format, to generating of Pytest test scripts, executing test scripts to interact with web services, to analysing web service response messages to determine test correctness and calculate test coverage. The system also supports the incorporation of human testers in reviewing and correcting LLM generated test artefacts to ensure the quality of testing activities. MASTEST system is evaluated on two LLMs, GPT‑4o and DeepSeek V3.1 Reasoner with five public APIs. The performances of LLMs on various testing activities are measured by a wide range of metrics, including unit and system test scenario coverage and API operation coverage for the quality of generated test scenarios, data type correctness, status code coverage and script syntax correctness for the quality of LLM generated test scripts, as well as bug detection ability and usability of LLM generated test scenarios and scripts. Experiment results demonstrated that both DeepSeek and GPT‑4o achieved a high overall performance. DeepSeek excels in data type correctness and status code detection, while GPT‑4o performs best in API operation coverage. For both models, LLM generated test scripts maintained 100% syntax correctness and only required minimal manual edits for semantic correctness. These findings indicate the effectiveness and feasibility of MASTEST.

PaperID: 7237, https://arxiv.org/pdf/2511.18036.pdf

Authors: Ziyi Guo, Zhou Liu, Wentao Zhang

Title: Paper2SysArch: Structure-Constrained System Architecture Generation from Scientific Papers

Abstract:
The manual creation of system architecture diagrams for scientific papers is a time‑consuming and subjective process, while existing generative models lack the necessary structural control and semantic understanding for this task. A primary obstacle hindering research and development in this domain has been the profound lack of a standardized benchmark to quantitatively evaluate the automated generation of diagrams from text. To address this critical gap, we introduce a novel and comprehensive benchmark, the first of its kind, designed to catalyze progress in automated scientific visualization. It consists of 3,000 research papers paired with their corresponding high‑quality ground‑truth diagrams and is accompanied by a three‑tiered evaluation metric assessing semantic accuracy, layout coherence, and visual quality. Furthermore, to establish a strong baseline on this new benchmark, we propose Paper2SysArch, an end‑to‑end system that leverages multi‑agent collaboration to convert papers into structured, editable diagrams. To validate its performance on complex cases, the system was evaluated on a manually curated and more challenging subset of these papers, where it achieves a composite score of 69.0. This work's principal contribution is the establishment of a large‑scale, foundational benchmark to enable reproducible research and fair comparison. Meanwhile, our proposed system serves as a viable proof‑of‑concept, demonstrating a promising path forward for this complex task.

PaperID: 7238, https://arxiv.org/pdf/2511.17943.pdf

Authors: Zhiyu Xu, Weilong Yan, Yufei Shi, Xin Meng, Tao He, Huiping Zhuang, Ming Li, Hehe Fan

Title: SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System

Abstract:
Recent advancements in multimodal large language models (MLLMs) and video agent systems have significantly improved general video understanding. However, when applied to scientific video understanding and educating, a domain that demands external professional knowledge integration and rigorous step‑wise reasoning, existing approaches often struggle. To bridge this gap, we propose SciEducator, the first iterative self‑evolving multi‑agent system for scientific video comprehension and education. Rooted in the classical Deming Cycle from management science, our design reformulates its Plan‑Do‑Study‑Act philosophy into a self‑evolving reasoning and feedback mechanism, which facilitates the interpretation of intricate scientific activities in videos. Moreover, SciEducator can produce multimodal educational content tailored to specific scientific processes, including textual instructions, visual guides, audio narrations, and interactive references. To support evaluation, we construct SciVBench, a benchmark consisting of 500 expert‑verified and literature‑grounded science QA pairs across five categories, covering physical, chemical, and everyday phenomena. Extensive experiments demonstrate that SciEducator substantially outperforms leading closed‑source MLLMs (e.g., Gemini, GPT‑4o) and state‑of‑the‑art video agents on the benchmark, establishing a new paradigm for the community.

PaperID: 7239, https://arxiv.org/pdf/2511.17915.pdf

Authors: Yao Liu, Sampad Mohanty, Elizabeth Ondula, Bhaskar Krishnamachari

Title: DISPATCH -- Decentralized Informed Spatial Planning and Assignment of Tasks for Cooperative Heterogeneous Agents

Abstract:
Spatial task allocation in systems such as multi‑robot delivery or ride‑sharing requires balancing efficiency with fair service across tasks. Greedy assignment policies that match each agent to its highest‑preference or lowest‑cost task can maximize efficiency but often create inequities: some tasks receive disproportionately favorable service (e.g., shorter delays or better matches), while others face long waits or poor allocations. We study fairness in heterogeneous multi‑agent systems where tasks vary in preference alignment and urgency. Most existing approaches either assume centralized coordination or largely ignore fairness under partial observability. Distinct from this prior work, we establish a connection between the Eisenberg‑Gale (EG) equilibrium convex program and decentralized, partially observable multi‑agent learning. Building on this connection, we develop two equilibrium‑informed algorithms that integrate fairness and efficiency: (i) a multi‑agent reinforcement learning (MARL) framework, EG‑MARL, whose training is guided by a centralized EG equilibrium assignment algorithm; and (ii) a stochastic online optimization mechanism that performs guided exploration and subset‑based fair assignment as tasks are discovered. We evaluate on Multi‑Agent Particle Environment (MPE) simulations across varying team sizes against centralized EG, Hungarian, and Min‑Max distance baselines, and also present a Webots‑based warehouse proof‑of‑concept with heterogeneous robots. Both methods preserve the fairness‑efficiency balance of the EG solution under partial observability, with EG‑MARL achieving near‑centralized coordination and reduced travel distances, and the online mechanism enabling real‑time allocation with competitive fairness.

PaperID: 7240, https://arxiv.org/pdf/2511.17910.pdf

Authors: Yuliang Zhan, Xinyu Tang, Han Wan, Jian Li, Ji-Rong Wen, Hao Sun

Title: L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention

Abstract:
Recently, Chain‑of‑Thought (CoT) reasoning has significantly enhanced the capabilities of large language models (LLMs), but Vision‑Language Models (VLMs) still struggle with multi‑step reasoning tasks due to limited multimodal reasoning data. To bridge this gap, researchers have explored methods to transfer CoT reasoning from LLMs to VLMs. However, existing approaches either need high training costs or require architectural alignment. In this paper, we use Linear Artificial Tomography (LAT) to empirically show that LLMs and VLMs share similar low‑frequency latent representations of CoT reasoning despite architectural differences. Based on this insight, we propose L2V‑CoT, a novel training‑free latent intervention approach that transfers CoT reasoning from LLMs to VLMs. L2V‑CoT extracts and resamples low‑frequency CoT representations from LLMs in the frequency domain, enabling dimension matching and latent injection into VLMs during inference to enhance reasoning capabilities. Extensive experiments demonstrate that our approach consistently outperforms training‑free baselines and even surpasses supervised methods.

PaperID: 7241, https://arxiv.org/pdf/2511.17906.pdf

Authors: Wen-Fan Wang, Chien-Ting Lu, Jin Ping Ng, Yi-Ting Chiu, Ting-Ying Lee, Miaosen Wang, Bing-Yu Chen, Xiang 'Anthony' Chen

Title: AnimAgents: Coordinating Multi-Stage Animation Pre-Production with Human-Multi-Agent Collaboration

Abstract:
Animation pre‑production lays the foundation of an animated film by transforming initial concepts into a coherent blueprint across interdependent stages such as ideation, scripting, design, and storyboarding. While generative AI tools are increasingly adopted in this process, they remain isolated, requiring creators to juggle multiple systems without integrated workflow support. Our formative study with 12 professional creative directors and independent animators revealed key challenges in their current practice: Creators must manually coordinate fragmented outputs, manage large volumes of information, and struggle to maintain continuity and creative control between stages. Based on the insights, we present AnimAgents, a human‑multi‑agent collaborative system that coordinates complex, multi‑stage workflows through a core agent and specialized agents, supported by dedicated boards for the four major stages of pre‑production. AnimAgents enables stage‑aware orchestration, stage‑specific output management, and element‑level refinement, providing an end‑to‑end workflow tailored to professional practice. In a within‑subjects summative study with 16 professional creators, AnimAgents significantly outperformed a strong single‑agent baseline that equipped with advanced parallel image generation in coordination, consistency, information management, and overall satisfaction (p < .01). A field deployment with 4 creators further demonstrated AnimAgents' effectiveness in real‑world projects.

PaperID: 7242, https://arxiv.org/pdf/2511.17843.pdf

Authors: Chenyi Wang, Zhaowei Li, Ming F. Li, Wujie Wen

Title: JigsawComm: Joint Semantic Feature Encoding and Transmission for Communication-Efficient Cooperative Perception

Abstract:
Multi‑agent cooperative perception (CP) promises to overcome the inherent occlusion and range limitations of single‑agent systems in autonomous driving, yet its practicality is severely constrained by limited Vehicle‑to‑Everything (V2X) communication bandwidth. Existing approaches attempt to improve bandwidth efficiency via compression or heuristic message selection, but neglect the semantic relevance and cross‑agent redundancy of the transmitted data. In this paper, we formulate a joint semantic feature encoding and transmission problem that maximizes CP accuracy under a communication budget, and introduce JigsawComm, an end‑to‑end semantic‑aware framework that learns to ``assemble the puzzle'' of multi‑agent feature transmission. JigsawComm uses a regularized encoder to extract \emphsparse, semantically relevant features, and a lightweight Feature Utility Estimator (FUE) to predict each agent's per‑cell contribution to the downstream perception task. The FUE‑generated compact meta utility maps are exchanged among agents and used to compute an optimal transmission policy under the learned utility proxy. This policy inherently \empheliminates cross‑agent redundancy, bounding the feature transmission payload to \mathcalO(1) as the number of agents grows, while the meta information overhead remains negligible. The whole pipeline is trained end‑to‑end through a differentiable scheduling module, informing the FUE to be aligned with the task objective. On the OPV2V and DAIR‑V2X benchmarks, JigsawComm reduces total data volume by over 20‑‑500× while matching or exceeding the accuracy of state‑of‑the‑art methods.

PaperID: 7243, https://arxiv.org/pdf/2511.17826.pdf

Authors: Ziyang Zhang, Xinheng Ding, Jiayi Yuan, Rixin Liu, Huizi Mao, Jiarong Xing, Zirui Liu

Title: Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch

Abstract:
Deterministic inference is increasingly critical for large language model (LLM) applications such as LLM‑as‑a‑judge evaluation, multi‑agent systems, and Reinforcement Learning (RL). However, existing LLM serving frameworks exhibit non‑deterministic behavior: identical inputs can yield different outputs when system configurations (e.g., tensor parallel (TP) size, batch size) vary, even under greedy decoding. This arises from the non‑associativity of floating‑point arithmetic and inconsistent reduction orders across GPUs. While prior work has addressed batch‑size‑related nondeterminism through batch‑invariant kernels, determinism across different TP sizes remains an open problem, particularly in RL settings, where the training engine typically uses Fully Sharded Data Parallel (i.e., TP = 1) while the rollout engine relies on multi‑GPU TP to maximize the inference throughput, creating a natural mismatch between the two. This precision mismatch problem may lead to suboptimal performance or even collapse for RL training. We identify and analyze the root causes of TP‑induced inconsistency and propose Tree‑Based Invariant Kernels (TBIK), a set of TP‑invariant matrix multiplication and reduction primitives that guarantee bit‑wise identical results regardless of TP size. Our key insight is to align intra‑ and inter‑GPU reduction orders through a unified hierarchical binary tree structure. We implement these kernels in Triton and integrate them into vLLM and FSDP. Experiments confirm zero probability divergence and bit‑wise reproducibility for deterministic inference across different TP sizes. Also, we achieve bit‑wise identical results between vLLM and FSDP in RL training pipelines with different parallel strategy. Code is available at https://github.com/nanomaoli/llm_reproducibility.

PaperID: 7244, https://arxiv.org/pdf/2511.17730.pdf

Authors: Zeinab Nezami, Shehr Bano, Abdelaziz Salama, Maryam Hafeez, Syed Ali Raza Zaidi

Title: Safety and Risk Pathways in Cooperative Generative Multi-Agent Systems: A Telecom Perspective

Abstract:
Generative multiagent systems are rapidly emerging as transformative tools for scalable automation and adaptive decisionmaking in telecommunications. Despite their promise, these systems introduce novel risks that remain underexplored, particularly when agents operate asynchronously across layered architectures. This paper investigates key safety pathways in telecomfocused Generative MultiAgent Systems (GMAS), emphasizing risks of miscoordination and semantic drift shaped by persona diversity. We propose a modular safety evaluation framework that integrates agentlevel checks on code quality and compliance with systemlevel safety metrics. Using controlled simulations across 32 persona sets, five questions, and multiple iterative runs, we demonstrate progressive improvements in analyzer penalties and AllocatorCoder consistency, alongside persistent vulnerabilities such as policy drift and variability under specific persona combinations. Our findings provide the first domaingrounded evidence that persona design, coding style, and planning orientation directly influence the stability and safety of telecom GMAS, highlighting both promising mitigation strategies and open risks for future deployment.

PaperID: 7245, https://arxiv.org/pdf/2511.17714.pdf

Authors: Alex John London, Aydin Mohseni

Title: Learning the Value of Value Learning

Abstract:
Standard decision frameworks address uncertainty about facts but assume fixed options and values. We extend the Jeffrey‑Bolker framework to model refinements in values and prove a value‑of‑information theorem for axiological refinement. In multi‑agent settings, we establish that mutual refinement will characteristically transform zero‑sum games into positive‑sum interactions and yield Pareto‑improvements in Nash bargaining. These results show that a framework of rational choice can be extended to model value refinement. By unifying epistemic and axiological refinement under a single formalism, we broaden the conceptual foundations of rational choice and illuminate the normative status of ethical deliberation.

PaperID: 7246, https://arxiv.org/pdf/2511.17676.pdf

Authors: Xi Wang, Xianyao Ling, Kun Li, Gang Yin, Liang Zhang, Jiang Wu, Annie Wang, Weizhe Wang

Title: LLM and Agent-Driven Data Analysis: A Systematic Approach for Enterprise Applications and System-level Deployment

Abstract:
The rapid progress in Generative AI and Agent technologies is profoundly transforming enterprise data management and analytics. Traditional database applications and system deployment are fundamentally impacted by AI‑driven tools, such as Retrieval‑Augmented Generation (RAG) and vector database technologies, which provide new pathways for semantic querying over enterprise knowledge bases. In the meantime, data security and compliance are top priorities for organizations adopting AI technologies. For enterprise data analysis, SQL generations powered by large language models (LLMs) and AI agents, has emerged as a key bridge connecting natural language with structured data, effectively lowering the barrier to enterprise data access and improving analytical efficiency. This paper focuses on enterprise data analysis applications and system deployment, covering a range of innovative frameworks, enabling complex query understanding, multi‑agent collaboration, security verification, and computational efficiency. Through representative use cases, key challenges related to distributed deployment, data security, and inherent difficulties in SQL generation tasks are discussed.

PaperID: 7247, https://arxiv.org/pdf/2511.17673.pdf

Authors: Myung Ho Kim

Title: Bridging Symbolic Control and Neural Reasoning in LLM Agents: Structured Cognitive Loop with a Governance Layer

Abstract:
Large language model agents suffer from fundamental architectural problems: entangled reasoning and execution, memory volatility, and uncontrolled action sequences. We introduce Structured Cognitive Loop (SCL), a modular architecture that explicitly separates agent cognition into five phases: Retrieval, Cognition, Control, Action, and Memory (R‑CCAM). Soft Symbolic Control constitutes a dedicated governance layer within SCL, applying symbolic constraints to probabilistic inference while preserving the flexibility of neural reasoning and restoring the explainability and controllability of classical symbolic systems. Through empirical validation on multi‑step conditional reasoning tasks, we demonstrate that SCL achieves zero policy violations, eliminates redundant tool calls, and maintains complete decision traceability. These results address critical gaps in existing frameworks such as ReAct, AutoGPT, and memory‑augmented approaches. Our contributions are threefold: (1) we situate SCL within the taxonomy of hybrid intelligence, differentiating it from prompt‑centric and memory‑only approaches; (2) we formally define Soft Symbolic Control and contrast it with neuro‑symbolic AI; and (3) we derive three design principles for trustworthy agents: modular decomposition, adaptive symbolic governance, and transparent state management. We provide a complete open‑source implementation demonstrating the R‑CCAM loop architecture, alongside a live GPT‑4o‑powered travel planning agent. By connecting expert system principles with modern LLM capabilities, this work offers a practical and theoretically grounded path toward reliable, explainable, and governable AI agents.

PaperID: 7248, https://arxiv.org/pdf/2511.17656.pdf

Authors: KM Khalid Saifullah, Daniel Palmer

Title: Multi-Agent Coordination in Autonomous Vehicle Routing: A Simulation-Based Study of Communication, Memory, and Routing Loops

Abstract:
Multi‑agent coordination is critical for next‑generation autonomous vehicle (AV) systems, yet naive implementations of communication‑based rerouting can lead to catastrophic performance degradation. This study investigates a fundamental problem in decentralized multi‑agent navigation: routing loops, where vehicles without persistent obstacle memory become trapped in cycles of inefficient path recalculation. Through systematic simulation experiments involving 72 unique configurations across varying vehicle densities (15, 35, 55 vehicles) and obstacle frequencies (6, 20 obstacles), we demonstrate that memory‑less reactive rerouting increases average travel time by up to 682% compared to baseline conditions. To address this, we introduce Object Memory Management (OMM), a lightweight mechanism enabling agents to retain and share knowledge of previously encountered obstacles. OMM operates by maintaining a distributed blacklist of blocked nodes, which each agent consults during Dijkstra‑based path recalculation, effectively preventing redundant routing attempts. Our results show that OMM‑enabled coordination reduces average travel time by 75.7% and wait time by 88% compared to memory‑less systems, while requiring only 1.67 route recalculations per vehicle versus 9.83 in memory‑less scenarios. This work provides empirical evidence that persistent, shared memory is not merely beneficial but essential for robust multi‑agent coordination in dynamic environments. The findings have implications beyond autonomous vehicles, informing the design of decentralized systems in robotics, network routing, and distributed AI. We provide a comprehensive experimental analysis, including detailed scenario breakdowns, scalability assessments, and visual documentation of the routing loop phenomenon, demonstrating OMM's critical role in preventing detrimental feedback cycles in cooperative multi‑agent systems.

PaperID: 7249, https://arxiv.org/pdf/2511.17654.pdf

Authors: Deepak Bolleddu

Title: Dialogue Diplomats: An End-to-End Multi-Agent Reinforcement Learning System for Automated Conflict Resolution and Consensus Building

Abstract:
Conflict resolution and consensus building represent critical challenges in multi‑agent systems, negotiations, and collaborative decision‑making processes. This paper introduces Dialogue Diplomats, a novel end‑to‑end multi‑agent reinforcement learning (MARL) framework designed for automated conflict resolution and consensus building in complex, dynamic environments. The proposed system integrates advanced deep reinforcement learning architectures with dialogue‑based negotiation protocols, enabling autonomous agents to engage in sophisticated conflict resolution through iterative communication and strategic adaptation. We present three primary contributions: first, a novel Hierarchical Consensus Network (HCN) architecture that combines attention mechanisms with graph neural networks to model inter‑agent dependencies and conflict dynamics. second, a Progressive Negotiation Protocol (PNP) that structures multi‑round dialogue interactions with adaptive concession strategies; and third, a Context‑Aware Reward Shaping mechanism that balances individual agent objectives with collective consensus goals.

PaperID: 7250, https://arxiv.org/pdf/2511.17652.pdf

Authors: Tianyu Liu, Weihao Xuan, Hao Wu, Peter Humphrey, Marcello DiStasio, Mohamed Kahila, Alfonso Garcia Tan, Heli Qi, Rui Yang, Simeng Han, Tinglin Huang, Fang Wu, Chen Liu, Qingyu Chen, Nan Liu, Irene Li, Hua Xu, Hongyu Zhao

Title: TeamPath: Building MultiModal Pathology Experts with Reasoning AI Copilots

Abstract:
Advances in AI have introduced several strong models in computational pathology to usher it into the era of multi‑modal diagnosis, analysis, and interpretation. However, the current pathology‑specific visual language models still lack capacities in making the diagnosis with rigorous reasoning paths as well as handling divergent tasks, and thus, challenges of building AI Copilots for real scenarios still exist. Here we introduce TeamPath, an AI system powered by reinforcement learning and router‑enhanced solutions based on large‑scale histopathology multimodal datasets, to work as a virtual assistant for expert‑level disease diagnosis, patch‑level information summarization, and cross‑modality generation to integrate transcriptomic information for clinical usage. We also collaborate with pathologists from Yale School of Medicine to demonstrate that TeamPath can assist them in working more efficiently by identifying and correcting expert conclusions and reasoning paths. We also discuss the human evaluation results to support the reasoning quality from TeamPath. Overall, TeamPath can flexibly choose the best settings according to the needs, and serve as an innovative and reliable system for information communication across different modalities and experts.

PaperID: 7251, https://arxiv.org/pdf/2511.17625.pdf

Authors: Jaehan Im, John-Paul Clarke, Ufuk Topcu, David Fridovich-Keil

Title: Iterative Negotiation and Oversight: A Case Study in Decentralized Air Traffic Management

Abstract:
Achieving consensus among noncooperative agents remains challenging in decentralized multi‑agent systems, where agents often have conflicting preferences. Existing coordination methods enable agents to reach consensus without a centralized coordinator, but do not provide formal guarantees on system‑level objectives such as efficiency or fairness. To address this limitation, we propose an iterative negotiation and oversight framework that augments a decentralized negotiation mechanism with taxation‑like oversight. The framework builds upon the trading auction for consensus, enabling noncooperative agents with conflicting preferences to negotiate through asset trading while preserving valuation privacy. We introduce an oversight mechanism, which implements a taxation‑like intervention that guides decentralized negotiation toward system‑efficient and equitable outcomes while also regulating how fast the framework converges. We establish theoretical guarantees of finite‑time termination and derive bounds linking system efficiency and convergence rate to the level of central intervention. A case study based on the collaborative trajectory options program, a rerouting initiative in U.S. air traffic management, demonstrates that the framework can reliably achieve consensus among noncooperative airspace sector managers, and reveals how the level of intervention regulates the relationship between system efficiency and convergence speed. Taken together, the theoretical and experimental results indicate that the proposed framework provides a general mechanism for decentralized coordination in noncooperative multi‑agent systems while safeguarding system‑level objectives.

PaperID: 7252, https://arxiv.org/pdf/2511.17621.pdf

Authors: Brendan Gho, Suman Muppavarapu, Afnan Shaik, Tyson Tsay, Atharva Mohan, James Begin, Kevin Zhu, Archana Vaidheeswaran, Vasu Sharma

Title: From Competition to Coordination: Market Making as a Scalable Framework for Safe and Aligned Multi-Agent LLM Systems

Abstract:
As foundation models are increasingly deployed as interacting agents in multi‑agent systems, their collective behavior raises new challenges for trustworthiness, transparency, and accountability. Traditional coordination mechanisms, such as centralized oversight or adversarial adjudication, struggle to scale and often obscure how decisions emerge. We introduce a market‑making framework for multi‑agent large language model (LLM) coordination that organizes agent interactions as structured economic exchanges. In this setup, each agent acts as a market participant, updating and trading probabilistic beliefs, to converge toward shared, truthful outcomes. By aligning local incentives with collective epistemic goals, the framework promotes self‑organizing, verifiable reasoning without requiring external enforcement. Empirically, we evaluate this approach across factual reasoning, ethical judgment, and commonsense inference tasks. Market‑based coordination yields accuracy gains of up to 10% over single‑shot baselines while preserving interpretability and transparency of intermediate reasoning steps. Beyond these improvements, our findings demonstrate that economic coordination principles can operationalize accountability and robustness in multi‑agent LLM systems, offering a scalable pathway toward self‑correcting, socially responsible AI capable of maintaining trust and oversight in real world deployment scenarios.

PaperID: 7253, https://arxiv.org/pdf/2511.17586.pdf

Authors: Rathin Chandra Shit, Sharmila Subudhi

Title: Hierarchical Adaptive Consensus Network: A Dynamic Framework for Scalable Consensus in Collaborative Multi-Agent AI Systems

Abstract:
The consensus strategies used in collaborative multi‑agent systems (MAS) face notable challenges related to adaptability, scalability, and convergence certainties. These approaches, including structured workflows, debate models, and iterative voting, often lead to communication bottlenecks, stringent decision‑making processes, and delayed responses in solving complex and evolving tasks. This article introduces a three‑tier architecture, the Hierarchical Adaptive Consensus Network (\hacn), which suggests various consensus policies based on task characterization and agent performance metrics. The first layer collects the confidence‑based voting outcomes of several local agent clusters. In contrast, the second level facilitates inter‑cluster communication through cross‑clustered partial knowledge sharing and dynamic timeouts. The third layer provides system‑wide coordination and final arbitration by employing a global orchestration framework with adaptable decision rules. The proposed model achieves \bigO(n) communication complexity, as opposed to the \bigO(n^2) complexity of the existing fully connected MAS. Experiments performed in a simulated environment yielded a 99.9% reduction in communication overhead during consensus convergence. Furthermore, the proposed approach ensures consensus convergence through hierarchical escalation and dynamic adaptation for a wide variety of complicated tasks.

PaperID: 7254, https://arxiv.org/pdf/2511.17506.pdf

Authors: Narjes Nourzad, Mingyu Zong, Bhaskar Krishnamachari

Title: AURA: Adaptive Unified Reasoning and Automation with LLM-Guided MARL for NextG Cellular Networks

Abstract:
Next‑generation (NextG) cellular networks are expected to manage dynamic traffic while sustaining high performance. Large language models (LLMs) provide strategic reasoning for 6G planning, but their computational cost and latency limit real‑time use. Multi‑agent reinforcement learning (MARL) supports localized adaptation, yet coordination at scale remains challenging. We present AURA, a framework that integrates cloud‑based LLMs for high‑level planning with base stations modeled as MARL agents for local decision‑making. The LLM generates objectives and subgoals from its understanding of the environment and reasoning capabilities, while agents at base stations execute these objectives autonomously, guided by a trust mechanism that balances local learning with external input. To reduce latency, AURA employs batched communication so that agents update the LLM's view of the environment and receive improved feedback. In a simulated 6G scenario, AURA improves resilience, reducing dropped handoff requests by more than half under normal and high traffic and lowering system failures. Agents use LLM input in fewer than 60% of cases, showing that guidance augments rather than replaces local adaptability, thereby mitigating latency and hallucination risks. These results highlight the promise of combining LLM reasoning with MARL adaptability for scalable, real‑time NextG network management.

PaperID: 7255, https://arxiv.org/pdf/2511.17496.pdf

Authors: Zhiyu Huang, Zewei Zhou, Tianhui Cai, Yun Zhang, Jiaqi Ma

Title: MDG: Masked Denoising Generation for Multi-Agent Behavior Modeling in Traffic Environments

Abstract:
Modeling realistic and interactive multi‑agent behavior is critical to autonomous driving and traffic simulation. However, existing diffusion and autoregressive approaches are limited by iterative sampling, sequential decoding, or task‑specific designs, which hinder efficiency and reuse. We propose Masked Denoising Generation (MDG), a unified generative framework that reformulates multi‑agent behavior modeling as the reconstruction of independently noised spatiotemporal tensors. Instead of relying on diffusion time steps or discrete tokenization, MDG applies continuous, per‑agent and per‑timestep noise masks that enable localized denoising and controllable trajectory generation in a single or few forward passes. This mask‑driven formulation generalizes across open‑loop prediction, closed‑loop simulation, motion planning, and conditional generation within one model. Trained on large‑scale real‑world driving datasets, MDG achieves competitive closed‑loop performance on the Waymo Sim Agents and nuPlan Planning benchmarks, while providing efficient, consistent, and controllable open‑loop multi‑agent trajectory generation. These results position MDG as a simple yet versatile paradigm for multi‑agent behavior modeling.

PaperID: 7256, https://arxiv.org/pdf/2511.17435.pdf

Authors: Zengyu Zou, Jingyuan Wang, Yixuan Huang, Junjie Wu

Title: Multi-Agent Pointer Transformer: Seq-to-Seq Reinforcement Learning for Multi-Vehicle Dynamic Pickup-Delivery Problems

Abstract:
This paper addresses the cooperative Multi‑Vehicle Dynamic Pickup and Delivery Problem with Stochastic Requests (MVDPDPSR) and proposes an end‑to‑end centralized decision‑making framework based on sequence‑to‑sequence, named Multi‑Agent Pointer Transformer (MAPT). MVDPDPSR is an extension of the vehicle routing problem and a spatio‑temporal system optimization problem, widely applied in scenarios such as on‑demand delivery. Classical operations research methods face bottlenecks in computational complexity and time efficiency when handling large‑scale dynamic problems. Although existing reinforcement learning methods have achieved some progress, they still encounter several challenges: 1) Independent decoding across multiple vehicles fails to model joint action distributions; 2) The feature extraction network struggles to capture inter‑entity relationships; 3) The joint action space is exponentially large. To address these issues, we designed the MAPT framework, which employs a Transformer Encoder to extract entity representations, combines a Transformer Decoder with a Pointer Network to generate joint action sequences in an AutoRegressive manner, and introduces a Relation‑Aware Attention module to capture inter‑entity relationships. Additionally, we guide the model's decision‑making using informative priors to facilitate effective exploration. Experiments on 8 datasets demonstrate that MAPT significantly outperforms existing baseline methods in terms of performance and exhibits substantial computational time advantages compared to classical operations research methods.

PaperID: 7257, https://arxiv.org/pdf/2511.17332.pdf

Authors: Virginia Dignum, Frank Dignum

Title: Agentifying Agentic AI

Abstract:
Agentic AI seeks to endow systems with sustained autonomy, reasoning, and interaction capabilities. To realize this vision, its assumptions about agency must be complemented by explicit models of cognition, cooperation, and governance. This paper argues that the conceptual tools developed within the Autonomous Agents and Multi‑Agent Systems (AAMAS) community, such as BDI architectures, communication protocols, mechanism design, and institutional modelling, provide precisely such a foundation. By aligning adaptive, data‑driven approaches with structured models of reasoning and coordination, we outline a path toward agentic systems that are not only capable and flexible, but also transparent, cooperative, and accountable. The result is a perspective on agency that bridges formal theory and practical autonomy.

PaperID: 7258, https://arxiv.org/pdf/2511.17330.pdf

Authors: Haoxin Tu, Huan Zhao, Yahui Song, Mehtab Zafar, Ruijie Meng, Abhik Roychoudhury

Title: Agentic Verification of Software Systems

Abstract:
Automatically generated code is gaining traction recently, owing to the prevalence of Large Language Models (LLMs). Further, the AlphaProof initiative has demonstrated the possibility of using AI for general mathematical reasoning. Reasoning about computer programs (software) can be accomplished via general mathematical reasoning; however, it tends to be more structured and richer in contexts. This forms an attractive proposition, since then AI agents can be used to reason about voluminous code that gets generated by AI. In this work, we present a first LLM agent, AutoRocq, for conducting program verification. Unlike past works, which rely on extensive training of LLMs on proof examples, our agent learns on‑the‑fly and improves the proof via an iterative refinement loop. The iterative improvement of the proof is achieved by the proof agent communicating with the Rocq (formerly Coq) theorem prover to get additional context and feedback. The final result of the iteration is a proof derivation checked by the Rocq theorem prover. In this way, our proof construction involves autonomous collaboration between the proof agent and the theorem prover. This autonomy facilitates the search for proofs and decision‑making in deciding on the structure of the proof tree. Experimental evaluation on SV‑COMP benchmarks and on Linux kernel modules shows promising efficacy in achieving automated program verification. As automation in code generation becomes more widespread, we posit that our proof agent can be potentially integrated with AI coding agents to achieve a generate and validate loop, thus moving closer to the vision of trusted automatic programming.

PaperID: 7259, https://arxiv.org/pdf/2511.17190.pdf

Authors: Ziyang Wang, Yuanlei Zheng, Zhenbiao Cao, Xiaojin Zhang, Zhongyu Wei, Pei Fu, Zhenbo Luo, Wei Chen, Xiang Bai

Title: AutoLink: Autonomous Schema Exploration and Expansion for Scalable Schema Linking in Text-to-SQL at Scale

Abstract:
For industrial‑scale text‑to‑SQL, supplying the entire database schema to Large Language Models (LLMs) is impractical due to context window limits and irrelevant noise. Schema linking, which filters the schema to a relevant subset, is therefore critical. However, existing methods incur prohibitive costs, struggle to trade off recall and noise, and scale poorly to large databases. We present AutoLink, an autonomous agent framework that reformulates schema linking as an iterative, agent‑driven process. Guided by an LLM, AutoLink dynamically explores and expands the linked schema subset, progressively identifying necessary schema components without inputting the full database schema. Our experiments demonstrate AutoLink's superior performance, achieving state‑of‑the‑art strict schema linking recall of 97.4% on Bird‑Dev and 91.2% on Spider‑2.0‑Lite, with competitive execution accuracy, i.e., 68.7% EX on Bird‑Dev (better than CHESS) and 34.9% EX on Spider‑2.0‑Lite (ranking 2nd on the official leaderboard). Crucially, AutoLink exhibits exceptional scalability, maintaining high recall, efficient token consumption, and robust execution accuracy on large schemas (e.g., over 3,000 columns) where existing methods severely degrade‑making it a highly scalable, high‑recall schema‑linking solution for industrial text‑to‑SQL systems.

PaperID: 7260, https://arxiv.org/pdf/2511.17165.pdf

Authors: Kesheng Chen, Wenjian Luo, Bang Zhang, Zeping Yin, Zipeng Ye

Title: MIR: Efficient Exploration in Episodic Multi-Agent Reinforcement Learning via Mutual Intrinsic Reward

Abstract:
Episodic rewards present a significant challenge in reinforcement learning. While intrinsic reward methods have demonstrated effectiveness in single‑agent rein‑forcement learning scenarios, their application to multi‑agent reinforcement learn‑ing (MARL) remains problematic. The primary difficulties stem from two fac‑tors: (1) the exponential sparsity of joint action trajectories that lead to rewards as the exploration space expands, and (2) existing methods often fail to account for joint actions that can influence team states. To address these challenges, this paper introduces Mutual Intrinsic Reward (MIR), a simple yet effective enhancement strategy for MARL with extremely sparse rewards like episodic rewards. MIR incentivizes individual agents to explore actions that affect their teammates, and when combined with original strategies, effectively stimulates team exploration and improves algorithm performance. For comprehensive experimental valida‑tion, we extend the representative single‑agent MiniGrid environment to create MiniGrid‑MA, a series of MARL environments with sparse rewards. Our evalu‑ation compares the proposed method against state‑of‑the‑art approaches in the MiniGrid‑MA setting, with experimental results demonstrating superior perfor‑mance.

PaperID: 7261, https://arxiv.org/pdf/2511.17162.pdf

Authors: Sara Zuppiroli, Carmelo Fabio Longo, Anna Sofia Lippolis, Rocco Paolillo, Lorenzo Giammei, Miguel Ceriani, Francesco Poggi, Antonio Zinilli, Andrea Giovanni Nuzzolese

Title: The Belief-Desire-Intention Ontology for modelling mental reality and agency

Abstract:
The Belief‑Desire‑Intention (BDI) model is a cornerstone for representing rational agency in artificial intelligence and cognitive sciences. Yet, its integration into structured, semantically interoperable knowledge representations remains limited. This paper presents a formal BDI Ontology, conceived as a modular Ontology Design Pattern (ODP) that captures the cognitive architecture of agents through beliefs, desires, intentions, and their dynamic interrelations. The ontology ensures semantic precision and reusability by aligning with foundational ontologies and best practices in modular design. Two complementary lines of experimentation demonstrate its applicability: (i) coupling the ontology with Large Language Models (LLMs) via Logic Augmented Generation (LAG) to assess the contribution of ontological grounding to inferential coherence and consistency; and (ii) integrating the ontology within the Semas reasoning platform, which implements the Triples‑to‑Beliefs‑to‑Triples (T2B2T) paradigm, enabling a bidirectional flow between RDF triples and agent mental states. Together, these experiments illustrate how the BDI Ontology acts as both a conceptual and operational bridge between declarative and procedural intelligence, paving the way for cognitively grounded, explainable, and semantically interoperable multi‑agent and neuro‑symbolic systems operating within the Web of Data.

PaperID: 7262, https://arxiv.org/pdf/2511.16997.pdf

Authors: Qingbin Zeng, Bingbing Fan, Zhiyu Chen, Sijian Ren, Zhilun Zhou, Xuhua Zhang, Yuanyi Zhen, Fengli Xu, Yong Li, Tie-Yan Liu

Title: MirrorMind: Empowering OmniScientist with the Expert Perspectives and Collective Knowledge of Human Scientists

Abstract:
The emergence of AI Scientists has demonstrated remarkable potential in automating scientific research. However, current approaches largely conceptualize scientific discovery as a solitary optimization or search process, overlooking that knowledge production is inherently a social and historical endeavor. Human scientific insight stems from two distinct yet interconnected sources. First is the individual cognitive trajectory, where a researcher's unique insight is shaped by their evolving research history and stylistic preferences; another is the collective disciplinary memory, where knowledge is sedimented into vast, interconnected networks of citations and concepts. Existing LLMs still struggle to represent these structured, high‑fidelity cognitive and social contexts. To bridge this gap, we introduce MirrorMind, a hierarchical cognitive architecture that integrates dual‑memory representations within a three‑level framework. The Individual Level constructs high‑fidelity cognitive models of individual researchers by capturing their episodic, semantic, and persona memories; the Domain Level maps collective knowledge into structured disciplinary concept graphs; and the Interdisciplinary Level that acts as an orthogonal orchestration engine. Crucially, our architecture separates memory storage from agentic execution, enabling AI scientist agents to flexibly access individual memories for unique perspectives or collective structures to reason. We evaluate MirrorMind across four comprehensive tasks, including author‑level cognitive simulation, complementary reasoning, cross‑disciplinary collaboration promotion, and multi‑agent scientific problem solving. The results show that by integrating individual cognitive depth with collective disciplinary breadth, MirrorMind moves beyond simple fact retrieval toward structural, personalized, and insight‑generating scientific reasoning.

PaperID: 7263, https://arxiv.org/pdf/2511.16931.pdf

Authors: Chenyang Shao, Dehao Huang, Yu Li, Keyu Zhao, Weiquan Lin, Yining Zhang, Qingbin Zeng, Zhiyu Chen, Tianxing Li, Yifei Huang, Taozhong Wu, Xinyang Liu, Ruotong Zhao, Mengsheng Zhao, Jiaoyang Li, Xuhua Zhang, Yue Wang, Yuanyi Zhen, Fengli Xu, Yong Li, Tie-Yan Liu

Title: OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists

Abstract:
With the rapid development of Large Language Models (LLMs), AI agents have demonstrated increasing proficiency in scientific tasks, ranging from hypothesis generation and experimental design to manuscript writing. Such agent systems are commonly referred to as "AI Scientists." However, existing AI Scientists predominantly formulate scientific discovery as a standalone search or optimization problem, overlooking the fact that scientific research is inherently a social and collaborative endeavor. Real‑world science relies on a complex scientific infrastructure composed of collaborative mechanisms, contribution attribution, peer review, and structured scientific knowledge networks. Due to the lack of modeling for these critical dimensions, current systems struggle to establish a genuine research ecosystem or interact deeply with the human scientific community. To bridge this gap, we introduce OmniScientist, a framework that explicitly encodes the underlying mechanisms of human research into the AI scientific workflow. OmniScientist not only achieves end‑to‑end automation across data foundation, literature review, research ideation, experiment automation, scientific writing, and peer review, but also provides comprehensive infrastructural support by simulating the human scientific system, comprising: (1) a structured knowledge system built upon citation networks and conceptual correlations; (2) a collaborative research protocol (OSP), which enables seamless multi‑agent collaboration and human researcher participation; and (3) an open evaluation platform (ScienceArena) based on blind pairwise user voting and Elo rankings. This infrastructure empowers agents to not only comprehend and leverage human knowledge systems but also to collaborate and co‑evolve, fostering a sustainable and scalable innovation ecosystem.

PaperID: 7264, https://arxiv.org/pdf/2511.16916.pdf

Authors: Ye Han, Lijun Zhang, Dejian Meng, Zhuang Zhang

Title: Hybrid Differential Reward: Combining Temporal Difference and Action Gradients for Efficient Multi-Agent Reinforcement Learning in Cooperative Driving

Abstract:
In multi‑vehicle cooperative driving tasks involving high‑frequency continuous control, traditional state‑based reward functions suffer from the issue of vanishing reward differences. This phenomenon results in a low signal‑to‑noise ratio (SNR) for policy gradients, significantly hindering algorithm convergence and performance improvement. To address this challenge, this paper proposes a novel Hybrid Differential Reward (HDR) mechanism. We first theoretically elucidate how the temporal quasi‑steady nature of traffic states and the physical proximity of actions lead to the failure of traditional reward signals. Building on this analysis, the HDR framework innovatively integrates two complementary components: (1) a Temporal Difference Reward (TRD) based on a global potential function, which utilizes the evolutionary trend of potential energy to ensure optimal policy invariance and consistency with long‑term objectives; and (2) an Action Gradient Reward (ARG), which directly measures the marginal utility of actions to provide a local guidance signal with a high SNR. Furthermore, we formulate the cooperative driving problem as a Multi‑Agent Partially Observable Markov Game (POMDPG) with a time‑varying agent set and provide a complete instantiation scheme for HDR within this framework. Extensive experiments conducted using both online planning (MCTS) and Multi‑Agent Reinforcement Learning (QMIX, MAPPO, MADDPG) algorithms demonstrate that the HDR mechanism significantly improves convergence speed and policy stability. The results confirm that HDR guides agents to learn high‑quality cooperative policies that effectively balance traffic efficiency and safety.

PaperID: 7265, https://arxiv.org/pdf/2511.16837.pdf

Authors: Oliver Kramer

Title: Cognitive BASIC: An In-Model Interpreted Reasoning Language for LLMs

Abstract:
Cognitive BASIC is a minimal, BASIC‑style prompting language and in‑model interpreter that structures large language model (LLM) reasoning into explicit, stepwise execution traces. Inspired by the simplicity of retro BASIC, we repurpose numbered lines and simple commands as an interpretable cognitive control layer. Modern LLMs can reliably simulate such short programs, enabling transparent multi‑step reasoning inside the model. A natural‑language interpreter file specifies command semantics, memory updates, and logging behavior. Our mental‑model interpreter extracts declarative and procedural knowledge, detects contradictions, and produces resolutions when necessary. A comparison across three LLMs on a benchmark of knowledge extraction, conflict detection, and reasoning tasks shows that all models can execute Cognitive BASIC programs, with overall strong but not uniform performance.

PaperID: 7266, https://arxiv.org/pdf/2511.16787.pdf

Authors: Hossain Shaikh Saadi, Faria Alam, Mario Sanz-Guerrero, Minh Duc Bui, Manuel Mager, Katharina von der Wense

Title: NALA_MAINZ at BLP-2025 Task 2: A Multi-agent Approach for Bangla Instruction to Python Code Generation

Abstract:
This paper presents JGU Mainz's winning system for the BLP‑2025 Shared Task on Code Generation from Bangla Instructions. We propose a multi‑agent‑based pipeline. First, a code‑generation agent produces an initial solution from the input instruction. The candidate program is then executed against the provided unit tests (pytest‑style, assert‑based). Only the failing cases are forwarded to a debugger agent, which reruns the tests, extracts error traces, and, conditioning on the error messages, the current program, and the relevant test cases, generates a revised solution. Using this approach, our submission achieved first place in the shared task with a Pass@1 score of 95.4. We also make our code public.

PaperID: 7267, https://arxiv.org/pdf/2511.16708.pdf

Authors: Shreshth Rajan

Title: Multi-Agent Code Verification via Information Theory

Abstract:
LLMs generate buggy code: 29.6% of SWE‑bench solved patches fail, 62% of BaxBench solutions have vulnerabilities, and existing tools only catch 65% of bugs with 35% false positives. We built CodeX‑Verify, a multi‑agent system that uses four specialized agents to detect different types of bugs. We prove mathematically that combining agents with different detection patterns finds more bugs than any single agent when the agents look for different problems, using submodularity of mutual information under conditional independence. Measuring agent correlation of rho = 0.05 to 0.25 confirms they detect different bugs. Testing on 99 code samples with verified labels shows our system catches 76.1% of bugs, matching the best existing method (Meta Prompt Testing: 75%) while running faster and without test execution. We tested all 15 agent combinations and found that using multiple agents improves accuracy by 39.7 percentage points (from 32.8% to 72.4%) compared to single agents, with diminishing returns of +14.9pp, +13.5pp, and +11.2pp for agents 2, 3, and 4, validating our theoretical model. The best two‑agent combination (Correctness + Performance) reaches 79.3% accuracy. Testing on 300 real patches from Claude Sonnet 4.5 runs in under 200ms per sample, making this practical for production use.

PaperID: 7268, https://arxiv.org/pdf/2511.16635.pdf

Authors: Guolin Huang, Wenting Chen, Jiaqi Yang, Xinheng Lyu, Xiaoling Luo, Sen Yang, Xiaohan Xing, Linlin Shen

Title: SurvAgent: Hierarchical CoT-Enhanced Case Banking and Dichotomy-Based Multi-Agent System for Multimodal Survival Prediction

Abstract:
Survival analysis is critical for cancer prognosis and treatment planning, yet existing methods lack the transparency essential for clinical adoption. While recent pathology agents have demonstrated explainability in diagnostic tasks, they face three limitations for survival prediction: inability to integrate multimodal data, ineffective region‑of‑interest exploration, and failure to leverage experiential learning from historical cases. We introduce SurvAgent, the first hierarchical chain‑of‑thought (CoT)‑enhanced multi‑agent system for multimodal survival prediction. SurvAgent consists of two stages: (1) WSI‑Gene CoT‑Enhanced Case Bank Construction employs hierarchical analysis through Low‑Magnification Screening, Cross‑Modal Similarity‑Aware Patch Mining, and Confidence‑Aware Patch Mining for pathology images, while Gene‑Stratified analysis processes six functional gene categories. Both generate structured reports with CoT reasoning, storing complete analytical processes for experiential learning. (2) Dichotomy‑Based Multi‑Expert Agent Inference retrieves similar cases via RAG and integrates multimodal reports with expert predictions through progressive interval refinement. Extensive experiments on five TCGA cohorts demonstrate SurvAgent's superority over conventional methods, proprietary MLLMs, and medical agents, establishing a new paradigm for explainable AI‑driven survival prediction in precision oncology.

PaperID: 7269, https://arxiv.org/pdf/2511.16347.pdf

Authors: Chunyang Li, Zifeng Kang, Junwei Zhang, Zhuo Ma, Anda Cheng, Xinghua Li, Jianfeng Ma

Title: The Shawshank Redemption of Embodied AI: Understanding and Benchmarking Indirect Environmental Jailbreaks

Abstract:
The adoption of Vision‑Language Models (VLMs) in embodied AI agents, while being effective, brings safety concerns such as jailbreaking. Prior work have explored the possibility of directly jailbreaking the embodied agents through elaborated multi‑modal prompts. However, no prior work has studied or even reported indirect jailbreaks in embodied AI, where a black‑box attacker induces a jailbreak without issuing direct prompts to the embodied agent. In this paper, we propose, for the first time, indirect environmental jailbreak (IEJ), a novel attack to jailbreak embodied AI via indirect prompt injected into the environment, such as malicious instructions written on a wall. Our key insight is that embodied AI does not ''think twice'' about the instructions provided by the environment ‑‑ a blind trust that attackers can exploit to jailbreak the embodied agent. We further design and implement open‑source prototypes of two fully‑automated frameworks: SHAWSHANK, the first automatic attack generation framework for the proposed attack IEJ; and SHAWSHANK‑FORGE, the first automatic benchmark generation framework for IEJ. Then, using SHAWSHANK‑FORGE, we automatically construct SHAWSHANK‑BENCH, the first benchmark for indirectly jailbreaking embodied agents. Together, our two frameworks and one benchmark answer the questions of what content can be used for malicious IEJ instructions, where they should be placed, and how IEJ can be systematically evaluated. Evaluation results show that SHAWSHANK outperforms eleven existing methods across 3,957 task‑scene combinations and compromises all six tested VLMs. Furthermore, current defenses only partially mitigate our attack, and we have responsibly disclosed our findings to all affected VLM vendors.

PaperID: 7270, https://arxiv.org/pdf/2511.16292.pdf

Authors: Daniel Vaughan, Kateřina Vaughan

Title: Distributed Agent Reasoning Across Independent Systems With Strict Data Locality

Abstract:
This paper presents a proof‑of‑concept demonstration of agent‑to‑agent communication across distributed systems, using only natural‑language messages and without shared identifiers, structured schemas, or centralised data exchange. The prototype explores how multiple organisations (represented here as a Clinic, Insurer, and Specialist Network) can cooperate securely via pseudonymised case tokens, local data lookups, and controlled operational boundaries. The system uses Orpius as the underlying platform for multi‑agent orchestration, tool execution, and privacy‑preserving communication. All agents communicate through OperationRelay calls, exchanging concise natural‑language summaries. Each agent operates on its own data (such as synthetic clinic records, insurance enrolment tables, and clinical guidance extracts), and none receives or reconstructs patient identity. The Clinic computes an HMAC‑based pseudonymous token, the Insurer evaluates coverage rules and consults the Specialist agent, and the Specialist returns an appropriateness recommendation. The goal of this prototype is intentionally limited: to demonstrate feasibility, not to provide a clinically validated, production‑ready system. No clinician review was conducted, and no evaluation beyond basic functional runs was performed. The work highlights architectural patterns, privacy considerations, and communication flows that enable distributed reasoning among specialised agents while keeping data local to each organisation. We conclude by outlining opportunities for more rigorous evaluation and future research in decentralised multi‑agent systems.

PaperID: 7271, https://arxiv.org/pdf/2511.16231.pdf

Authors: Yang Yu

Title: Pass@k Metric for RLVR: A Diagnostic Tool of Exploration, But Not an Objective

Abstract:
The ability of Large Language Models (LLMs) to perform complex, multi‑step reasoning is a central focus of modern AI research. To evaluate and enhance this capability, the pass@k metric, which measures the probability of obtaining at least one correct solution in k independent samples, has received significant attention. Its intuitive appeal has led to its adoption not only as an evaluation standard but also as a direct optimization objective in reinforcement learning. In this paper, we analyze the pass@k objective, derive its gradient, and demonstrate that it is fundamentally a per‑example positive reweighting of the simpler pass@1 objective. Our analysis reveals that the pass@k objective provides a vanishing learning signal in regimes where exploration is most critical. We further analyze the dynamics of "exploration collapse", showing that as the policy concentrates probability mass, the gap between pass@k and pass@1 diminishes. We conclude that while pass@k is a useful diagnostic tool, it may be an unsuitable direct objective for optimization. Instead, mechanisms explicitly encouraging efficient exploration could offer a more effective path forward for reinforcement learning in reasoning tasks.

PaperID: 7272, https://arxiv.org/pdf/2511.16223.pdf

Authors: Vincenzo Pomponi, Paolo Franceschi, Stefano Baraldo, Loris Roveda, Oliver Avram, Luca Maria Gambardella, Anna Valente

Title: DynaMimicGen: A Data Generation Framework for Robot Learning of Dynamic Tasks

Abstract:
Learning robust manipulation policies typically requires large and diverse datasets, the collection of which is time‑consuming, labor‑intensive, and often impractical for dynamic environments. In this work, we introduce DynaMimicGen (D‑MG), a scalable dataset generation framework that enables policy training from minimal human supervision while uniquely supporting dynamic task settings. Given only a few human demonstrations, D‑MG first segments the demonstrations into meaningful sub‑tasks, then leverages Dynamic Movement Primitives (DMPs) to adapt and generalize the demonstrated behaviors to novel and dynamically changing environments. Improving prior methods that rely on static assumptions or simplistic trajectory interpolation, D‑MG produces smooth, realistic, and task‑consistent Cartesian trajectories that adapt in real time to changes in object poses, robot states, or scene geometry during task execution. Our method supports different scenarios ‑ including scene layouts, object instances, and robot configurations ‑ making it suitable for both static and highly dynamic manipulation tasks. We show that robot agents trained via imitation learning on D‑MG‑generated data achieve strong performance across long‑horizon and contact‑rich benchmarks, including tasks like cube stacking and placing mugs in drawers, even under unpredictable environment changes. By eliminating the need for extensive human demonstrations and enabling generalization in dynamic settings, D‑MG offers a powerful and efficient alternative to manual data collection, paving the way toward scalable, autonomous robot learning.

PaperID: 7273, https://arxiv.org/pdf/2511.16205.pdf

Authors: Qiang Xu, Shengyuan Bai, Leqing Chen, Zijing Liu, Yu Li

Title: ChemLabs on ChemO: A Multi-Agent System for Multimodal Reasoning on IChO 2025

Abstract:
Olympiad‑level benchmarks in mathematics and physics are crucial testbeds for advanced AI reasoning, but chemistry, with its unique multimodal symbolic language, has remained an open challenge. We introduce ChemO, a new benchmark built from the International Chemistry Olympiad (IChO) 2025. ChemO features two key innovations for automated assessment: Assessment‑Equivalent Reformulation (AER), which converts problems requiring visual outputs (e.g., drawing molecules) into computationally tractable formats, and Structured Visual Enhancement (SVE), a diagnostic mechanism to disentangle a model's visual perception capabilities from its core chemical reasoning. To tackle this benchmark, we propose ChemLabs, a hierarchical multi‑agent framework that mimics human expert collaboration through specialized agents for problem decomposition, perception, reasoning, and auditing. Experiments on state‑of‑the‑art multimodal models demonstrate that combining SVE with our multi‑agent system yields dramatic performance gains. Our top configuration achieves a score of 93.6 out of 100, surpassing an estimated human gold medal threshold and establishing a new state‑of‑the‑art in automated chemical problem‑solving. ChemO Dataset: https://huggingface.co/datasets/IDEA‑AI4SCI/ChemO

PaperID: 7274, https://arxiv.org/pdf/2511.16202.pdf

Authors: Pei Yang, Ke Zhang, Ji Wang, Xiao Chen, Yuxin Tang, Eric Yang, Lynn Ai, Bill Shi

Title: Multi-Agent Collaborative Reward Design for Enhancing Reasoning in Reinforcement Learning

Abstract:
We present CRM (Multi‑Agent Collaborative Reward Model), a framework that replaces a single black‑box reward model with a coordinated team of specialist evaluators to improve robustness and interpretability in RLHF. Conventional reward models struggle to jointly optimize multiple, sometimes conflicting, preference dimensions (e.g., factuality, helpfulness, safety) and offer limited transparency into why a score is assigned. CRM addresses these issues by decomposing preference evaluation into domain‑specific agents that each produce partial signals, alongside global evaluators such as ranker‑based and embedding‑similarity rewards. A centralized aggregator fuses these signals at each timestep, balancing factors like step‑wise correctness, multi‑agent agreement, and repetition penalties, yielding a single training reward compatible with standard RL pipelines. The policy is optimized with advantage‑based updates (e.g., GAE), while a value model regresses to the aggregated reward, enabling multi‑perspective reward shaping without requiring additional human annotations beyond those used to train the evaluators. To support training and assessment, we introduce rewardBench, a benchmark and training suite aligned with the collaborative structure of CRM. Together, CRM and rewardBench provide a practical, modular path to more transparent reward modeling and more stable optimization.

PaperID: 7275, https://arxiv.org/pdf/2511.16183.pdf

Authors: Jeremie Ochin, Raphael Chekroun, Bogdan Stanciulescu, Sotiris Manitsaris

Title: FOOTPASS: A Multi-Modal Multi-Agent Tactical Context Dataset for Play-by-Play Action Spotting in Soccer Broadcast Videos

Abstract:
Soccer video understanding has motivated the creation of datasets for tasks such as temporal action localization, spatiotemporal action detection (STAD), or multiobject tracking (MOT). The annotation of structured sequences of events (who does what, when, and where) used for soccer analytics requires a holistic approach that integrates both STAD and MOT. However, current action recognition methods remain insufficient for constructing reliable play‑by‑play data and are typically used to assist rather than fully automate annotation. Parallel research has advanced tactical modeling, trajectory forecasting, and performance analysis, all grounded in game‑state and play‑by‑play data. This motivates leveraging tactical knowledge as a prior to support computer‑vision‑based predictions, enabling more automated and reliable extraction of play‑by‑play data. We introduce Footovision Play‑by‑Play Action Spotting in Soccer Dataset (FOOTPASS), the first benchmark for play‑by‑play action spotting over entire soccer matches in a multi‑modal, multi‑agent tactical context. It enables the development of methods for player‑centric action spotting that exploit both outputs from computer‑vision tasks (e.g., tracking, identification) and prior knowledge of soccer, including its tactical regularities over long time horizons, to generate reliable play‑by‑play data streams. These streams form an essential input for data‑driven sports analytics.

PaperID: 7276, https://arxiv.org/pdf/2511.16131.pdf

Authors: Xuan-Quang Phan, Tan-Ha Mai, Thai-Duy Dinh, Minh-Thuan Nguyen, Lam-Son Lê

Title: AskDB: An LLM Agent for Natural Language Interaction with Relational Databases

Abstract:
Interacting with relational databases remains challenging for users across different expertise levels, particularly when composing complex analytical queries or performing administrative tasks. Existing systems typically address either natural language querying or narrow aspects of database administration, lacking a unified and intelligent interface for general‑purpose database interaction. We introduce AskDB, a large language model powered agent designed to bridge this gap by supporting both data analysis and administrative operations over SQL databases through natural language. Built on Gemini 2, AskDB integrates two key innovations: a dynamic schema‑aware prompting mechanism that effectively incorporates database metadata, and a task decomposition framework that enables the agent to plan and execute multi‑step actions. These capabilities allow AskDB to autonomously debug derived SQL, retrieve contextual information via real‑time web search, and adaptively refine its responses. We evaluate AskDB on a widely used Text‑to‑SQL benchmark and a curated set of DBA tasks, demonstrating strong performance in both analytical and administrative scenarios. Our results highlight the potential of AskDB as a unified and intelligent agent for relational database systems, offering an intuitive and accessible experience for end users.

PaperID: 7277, https://arxiv.org/pdf/2511.16108.pdf

Authors: Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R. Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica

Title: SkyRL-Agent: Efficient RL Training for Multi-turn LLM Agent

Abstract:
We introduce SkyRL‑Agent, a framework for efficient, multi‑turn, long‑horizon agent training and evaluation. It provides efficient asynchronous dispatching, lightweight tool integration, and flexible backend interoperability, enabling seamless use with existing RL frameworks such as SkyRL‑train, VeRL, and Tinker. Using SkyRL‑Agent, we train SA‑SWE‑32B, a software engineering agent trained from Qwen3‑32B (24.4% Pass@1) purely with reinforcement learning. We introduce two key components: an optimized asynchronous pipeline dispatcher that achieves a 1.55x speedup over naive asynchronous batching, and a tool‑enhanced training recipe leveraging an AST‑based search tool to facilitate code navigation, boost rollout Pass@K, and improve training efficiency. Together, these optimizations enable SA‑SWE‑32B to reach 39.4% Pass@1 on SWE‑Bench Verified with more than 2x cost reduction compared to prior models reaching similar performance. Despite being trained solely on SWE tasks, SA‑SWE‑32B generalizes effectively to other agentic tasks, including Terminal‑Bench, BrowseComp‑Plus, and WebArena. We further demonstrate SkyRL‑Agent's extensibility through case studies on deep research, computer use, and memory agents, each trained using a different training backend.

PaperID: 7278, https://arxiv.org/pdf/2511.16075.pdf

Authors: Hrikshesh Kumar, Anika Garg, Anshul Gupta, Yashika Agarwal

Title: A Hybrid Proactive And Predictive Framework For Edge Cloud Resource Management

Abstract:
Old cloud edge workload resource management is too reactive. The problem with relying on static thresholds is that we are either overspending for more resources than needed or have reduced performance because of their lack. This is why we work on proactive solutions. A framework developed for it stops reacting to the problems but starts expecting them. We design a hybrid architecture, combining two powerful tools: the CNN LSTM model for time series forecasting and an orchestrator based on multi agent Deep Reinforcement Learning In fact the novelty is in how we combine them as we embed the predictive forecast from the CNN LSTM directly into the DRL agent state space. That is what makes the AI manager smarter it sees the future, which allows it to make better decisions about a long term plan for where to run tasks That means finding that sweet spot between how much money is saved while keeping the system healthy and apps fast for users That is we have given it eyes in order to see down the road so that it does not have to lurch from one problem to another it finds a smooth path forward Our tests show our system easily beats the old methods It is great at solving tough problems like making complex decisions and juggling multiple goals at once like being cheap fast and reliable

PaperID: 7279, https://arxiv.org/pdf/2511.16005.pdf

Authors: Qingao Dong, Mengfei Wang, Hengzhi Zhang, Zhichao Li, Yuan Yuan, Mu Li, Xiang Gao, Hailong Sun, Chunming Hu, Weifeng Lv

Title: InfCode-C++: Intent-Guided Semantic Retrieval and AST-Structured Search for C++ Issue Resolution

Abstract:
Large language model (LLM) agents have recently shown strong performance on repository‑level issue resolution, but existing systems are almost exclusively designed for Python and rely heavily on lexical retrieval and shallow code navigation. These approaches transfer poorly to C++ projects, where overloaded identifiers, nested namespaces, template instantiations, and deep control‑flow structures make context retrieval and fault localization substantially more difficult. As a result, state‑of‑the‑art Python‑oriented agents show a drastic performance drop on the C++ subset of MultiSWE‑bench. We introduce INFCODE‑C++, the first C++‑aware autonomous system for end‑to‑end issue resolution. The system combines two complementary retrieval mechanisms ‑‑ semantic code‑intent retrieval and deterministic AST‑structured querying ‑‑ to construct accurate, language‑aware context for repair.These components enable precise localization and robust patch synthesis in large, statically typed C++ repositories. Evaluated on the \textttMultiSWE‑bench‑CPP benchmark, INFCODE‑C++ achieves a resolution rate of 25.58%, outperforming the strongest prior agent by 10.85 percentage points and more than doubling the performance of MSWE‑agent. Ablation and behavioral studies further demonstrate the critical role of semantic retrieval, structural analysis, and accurate reproduction in C++ issue resolution. INFCODE‑C++ highlights the need for language‑aware reasoning in multi‑language software agents and establishes a foundation for future research on scalable, LLM‑driven repair for complex, statically typed ecosystems.

PaperID: 7280, https://arxiv.org/pdf/2511.15998.pdf

Authors: Strahinja Janjusevic, Anna Baron Garcia, Sohrob Kazerounian

Title: Hiding in the AI Traffic: Abusing MCP for LLM-Powered Agentic Red Teaming

Abstract:
Generative AI is reshaping offensive cybersecurity by enabling autonomous red team agents that can plan, execute, and adapt during penetration tests. However, existing approaches face trade‑offs between generality and specialization, and practical deployments reveal challenges such as hallucinations, context limitations, and ethical concerns. In this work, we introduce a novel command & control (C2) architecture leveraging the Model Context Protocol (MCP) to coordinate distributed, adaptive reconnaissance agents covertly across networks. Notably, we find that our architecture not only improves goal‑directed behavior of the system as whole, but also eliminates key host and network artifacts that can be used to detect and prevent command & control behavior altogether. We begin with a comprehensive review of state‑of‑the‑art generative red teaming methods, from fine‑tuned specialist models to modular or agentic frameworks, analyzing their automation capabilities against task‑specific accuracy. We then detail how our MCP‑based C2 can overcome current limitations by enabling asynchronous, parallel operations and real‑time intelligence sharing without periodic beaconing. We furthermore explore advanced adversarial capabilities of this architecture, its detection‑evasion techniques, and address dual‑use ethical implications, proposing defensive measures and controlled evaluation in lab settings. Experimental comparisons with traditional C2 show drastic reductions in manual effort and detection footprint. We conclude with future directions for integrating autonomous exploitation, defensive LLM agents, predictive evasive maneuvers, and multi‑agent swarms. The proposed MCP‑enabled C2 framework demonstrates a significant step toward realistic, AI‑driven red team operations that can simulate advanced persistent threats while informing the development of next‑generation defensive systems.

PaperID: 7281, https://arxiv.org/pdf/2511.15997.pdf

Authors: Noah Bissell, Ethan Paley, Joshua Harrison, Juliano Calil, Myungin Lee

Title: Sensorium Arc: AI Agent System for Oceanic Data Exploration and Interactive Eco-Art

Abstract:
Sensorium Arc (AI reflects on climate) is a real‑time multimodal interactive AI agent system that personifies the ocean as a poetic speaker and guides users through immersive explorations of complex marine data. Built on a modular multi‑agent system and retrieval‑augmented large language model (LLM) framework, Sensorium enables natural spoken conversations with AI agents that embodies the ocean's perspective, generating responses that blend scientific insight with ecological poetics. Through keyword detection and semantic parsing, the system dynamically triggers data visualizations and audiovisual playback based on time, location, and thematic cues drawn from the dialogue. Developed in collaboration with the Center for the Study of the Force Majeure and inspired by the eco‑aesthetic philosophy of Newton Harrison, Sensorium Arc reimagines ocean data not as an abstract dataset but as a living narrative. The project demonstrates the potential of conversational AI agents to mediate affective, intuitive access to high‑dimensional environmental data and proposes a new paradigm for human‑machine‑ecosystem.

PaperID: 7282, https://arxiv.org/pdf/2511.15958.pdf

Authors: Zhenyu Bi, Gaurav Srivastava, Yang Li, Meng Lu, Swastik Roy, Morteza Ziyadi, Xuan Wang

Title: JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation

Abstract:
While small language models (SLMs) have shown promise on various reasoning tasks, their ability to judge the correctness of answers remains unclear compared to large language models (LLMs). Prior work on LLM‑as‑a‑judge frameworks typically relies on comparing candidate answers against ground‑truth labels or other candidate answers using predefined metrics like entailment. However, this approach is inherently indirect and difficult to fully automate, offering limited support for fine‑grained and scalable evaluation of reasoning outputs. In this work, we propose JudgeBoard, a novel evaluation pipeline that directly queries models to assess the correctness of candidate answers without requiring extra answer comparisons. We focus on two core reasoning domains: mathematical reasoning and science/commonsense reasoning, and construct task‑specific evaluation leaderboards using both accuracy‑based ranking and an Elo‑based rating system across five benchmark datasets, enabling consistent model comparison as judges rather than comparators. To improve judgment performance in lightweight models, we propose MAJ (Multi‑Agent Judging), a novel multi‑agent evaluation framework that leverages multiple interacting SLMs with distinct reasoning profiles to approximate LLM‑level judgment accuracy through collaborative deliberation. Experimental results reveal a significant performance gap between SLMs and LLMs in isolated judging tasks. However, our MAJ framework substantially improves the reliability and consistency of SLMs. On the MATH dataset, MAJ using smaller‑sized models as backbones performs comparatively well or even better than their larger‑sized counterparts. Our findings highlight that multi‑agent SLM systems can potentially match or exceed LLM performance in judgment tasks, with implications for scalable and efficient assessment.

PaperID: 7283, https://arxiv.org/pdf/2511.15921.pdf

Authors: Chelsea Zou, Yiheng Yao, Basant Khalil

Title: Thinking, Faithful and Stable: Mitigating Hallucinations in LLMs

Abstract:
This project develops a self correcting framework for large language models (LLMs) that detects and mitigates hallucinations during multi‑step reasoning. Rather than relying solely on final answer correctness, our approach leverages fine grained uncertainty signals: 1) self‑assessed confidence alignment, and 2) token‑level entropy spikes to detect unreliable and unfaithful reasoning in real time. We design a composite reward function that penalizes unjustified high confidence and entropy spikes, while encouraging stable and accurate reasoning trajectories. These signals guide a reinforcement learning (RL) policy that makes the model more introspective and shapes the model's generation behavior through confidence‑aware reward feedback, improving not just outcome correctness but the coherence and faithfulness of their intermediate reasoning steps. Experiments show that our method improves both final answer accuracy and reasoning calibration, with ablations validating the individual contribution of each signal.

PaperID: 7284, https://arxiv.org/pdf/2511.15870.pdf

Authors: Qiming Guo, Bishal Khatri, Wenbo Sun, Jinwen Tang, Hua Zhang, Wenlu Wang

Title: AquaSentinel: Next-Generation AI System Integrating Sensor Networks for Urban Underground Water Pipeline Anomaly Detection via Collaborative MoE-LLM Agent Architecture

Abstract:
Underground pipeline leaks and infiltrations pose significant threats to water security and environmental safety. Traditional manual inspection methods provide limited coverage and delayed response, often missing critical anomalies. This paper proposes AquaSentinel, a novel physics‑informed AI system for real‑time anomaly detection in urban underground water pipeline networks. We introduce four key innovations: (1) strategic sparse sensor deployment at high‑centrality nodes combined with physics‑based state augmentation to achieve network‑wide observability from minimal infrastructure; (2) the RTCA (Real‑Time Cumulative Anomaly) detection algorithm, which employs dual‑threshold monitoring with adaptive statistics to distinguish transient fluctuations from genuine anomalies; (3) a Mixture of Experts (MoE) ensemble of spatiotemporal graph neural networks that provides robust predictions by dynamically weighting model contributions; (4) causal flow‑based leak localization that traces anomalies upstream to identify source nodes and affected pipe segments. Our system strategically deploys sensors at critical network junctions and leverages physics‑based modeling to propagate measurements to unmonitored nodes, creating virtual sensors that enhance data availability across the entire network. Experimental evaluation using 110 leak scenarios demonstrates that AquaSentinel achieves 100% detection accuracy. This work advances pipeline monitoring by demonstrating that physics‑informed sparse sensing can match the performance of dense deployments at a fraction of the cost, providing a practical solution for aging urban infrastructure.

PaperID: 7285, https://arxiv.org/pdf/2511.15862.pdf

Authors: Devang Kulshreshtha, Wanyu Du, Raghav Jain, Srikanth Doss, Hang Su, Sandesh Swamy, Yanjun Qi

Title: The Subtle Art of Defection: Understanding Uncooperative Behaviors in LLM based Multi-Agent Systems

Abstract:
This paper introduces a novel framework for simulating and analyzing how uncooperative behaviors can destabilize or collapse LLM‑based multi‑agent systems. Our framework includes two key components: (1) a game theory‑based taxonomy of uncooperative agent behaviors, addressing a notable gap in the existing literature; and (2) a structured, multi‑stage simulation pipeline that dynamically generates and refines uncooperative behaviors as agents' states evolve. We evaluate the framework via a collaborative resource management setting, measuring system stability using metrics such as survival time and resource overuse rate. Empirically, our framework achieves 96.7% accuracy in generating realistic uncooperative behaviors, validated by human evaluations. Our results reveal a striking contrast: cooperative agents maintain perfect system stability (100% survival over 12 rounds with 0% resource overuse), while any uncooperative behavior can trigger rapid system collapse within 1 to 7 rounds. We also evaluate LLM‑based defense methods, finding they detect some uncooperative behaviors, but some behaviors remain largely undetectable. These gaps highlight how uncooperative agents degrade collective outcomes and underscore the need for more resilient multi‑agent systems.

PaperID: 7286, https://arxiv.org/pdf/2511.15825.pdf

Authors: Tuan-Anh Le, Anh Mai Vu, David Yang, Akash Awasthi, Hien Van Nguyen

Title: IMACT-CXR: An Interactive Multi-Agent Conversational Tutoring System for Chest X-Ray Interpretation

Abstract:
IMACT‑CXR is an interactive multi‑agent conversational tutor that helps trainees interpret chest X‑rays by unifying spatial annotation, gaze analysis, knowledge retrieval, and image‑grounded reasoning in a single AutoGen‑based workflow. The tutor simultaneously ingests learner bounding boxes, gaze samples, and free‑text observations. Specialized agents evaluate localization quality, generate Socratic coaching, retrieve PubMed evidence, suggest similar cases from REFLACX, and trigger NV‑Reason‑CXR‑3B for vision‑language reasoning when mastery remains low or the learner explicitly asks. Bayesian Knowledge Tracing (BKT) maintains skill‑specific mastery estimates that drive both knowledge reinforcement and case similarity retrieval. A lung‑lobe segmentation module derived from a TensorFlow U‑Net enables anatomically aware gaze feedback, and safety prompts prevent premature disclosure of ground‑truth labels. We describe the system architecture, implementation highlights, and integration with the REFLACX dataset for real DICOM cases. IMACT‑CXR demonstrates responsive tutoring flows with bounded latency, precise control over answer leakage, and extensibility toward live residency deployment. Preliminary evaluation shows improved localization and diagnostic reasoning compared to baselines.

PaperID: 7287, https://arxiv.org/pdf/2511.15759.pdf

Authors: Badrinath Ramakrishnan, Akshaya Balaji

Title: Securing AI Agents Against Prompt Injection Attacks

Abstract:
Retrieval‑augmented generation (RAG) systems have become widely used for enhancing large language model capabilities, but they introduce significant security vulnerabilities through prompt injection attacks. We present a comprehensive benchmark for evaluating prompt injection risks in RAG‑enabled AI agents and propose a multi‑layered defense framework. Our benchmark includes 847 adversarial test cases across five attack categories: direct injection, context manipulation, instruction override, data exfiltration, and cross‑context contamination. We evaluate three defense mechanisms: content filtering with embedding‑based anomaly detection, hierarchical system prompt guardrails, and multi‑stage response verification, across seven state‑of‑the‑art language models. Our combined framework reduces successful attack rates from 73.2% to 8.7% while maintaining 94.3% of baseline task performance. We release our benchmark dataset and defense implementation to support future research in AI agent security.

PaperID: 7288, https://arxiv.org/pdf/2511.15755.pdf

Authors: Philip Drammeh

Title: Multi-Agent LLM Orchestration Achieves Deterministic, High-Quality Decision Support for Incident Response

Abstract:
Large language models (LLMs) promise to accelerate incident response in production systems, yet single‑agent approaches generate vague, unusable recommendations. We present MyAntFarm.ai, a reproducible containerized framework demonstrating that multi‑agent orchestration fundamentally transforms LLM‑based incident response quality. Through 348 controlled trials comparing single‑agent copilot versus multi‑agent systems on identical incident scenarios, we find that multi‑agent orchestration achieves 100% actionable recommendation rate versus 1.7% for single‑agent approaches, an 80 times improvement in action specificity and 140 times improvement in solution correctness. Critically, multi‑agent systems exhibit zero quality variance across all trials, enabling production SLA commitments impossible with inconsistent single‑agent outputs. Both architectures achieve similar comprehension latency (approx.40s), establishing that the architectural value lies in deterministic quality, not speed. We introduce Decision Quality (DQ), a novel metric capturing validity, specificity, and correctness properties essential for operational deployment that existing LLM metrics do not address. These findings reframe multi‑agent orchestration from a performance optimization to a production‑readiness requirement for LLM‑based incident response. All code, Docker configurations, and trial data are publicly available for reproduction.

PaperID: 7289, https://arxiv.org/pdf/2511.15752.pdf

Authors: Hanzhi Yan, Qin Lu, Xianqiao Wang, Xiaoming Zhai, Tianming Liu, He Li

Title: Build AI Assistants using Large Language Models and Agents to Enhance the Engineering Education of Biomechanics

Abstract:
While large language models (LLMs) have demonstrated remarkable versatility across a wide range of general tasks, their effectiveness often diminishes in domain‑specific applications due to inherent knowledge gaps. Moreover, their performance typically declines when addressing complex problems that require multi‑step reasoning and analysis. In response to these challenges, we propose leveraging both LLMs and AI agents to develop education assistants aimed at enhancing undergraduate learning in biomechanics courses that focus on analyzing the force and moment in the musculoskeletal system of the human body. To achieve our goal, we construct a dual‑module framework to enhance LLM performance in biomechanics educational tasks: 1) we apply Retrieval‑Augmented Generation (RAG) to improve the specificity and logical consistency of LLM's responses to the conceptual true/false questions; 2) we build a Multi‑Agent System (MAS) to solve calculation‑oriented problems involving multi‑step reasoning and code execution. Specifically, we evaluate the performance of several LLMs, i.e., Qwen‑1.0‑32B, Qwen‑2.5‑32B, and Llama‑70B, on a biomechanics dataset comprising 100 true/false conceptual questions and problems requiring equation derivation and calculation. Our results demonstrate that RAG significantly enhances the performance and stability of LLMs in answering conceptual questions, surpassing those of vanilla models. On the other hand, the MAS constructed using multiple LLMs demonstrates its ability to perform multi‑step reasoning, derive equations, execute code, and generate explainable solutions for tasks that require calculation. These findings demonstrate the potential of applying RAG and MAS to enhance LLM performance for specialized courses in engineering curricula, providing a promising direction for developing intelligent tutoring in engineering education.

PaperID: 7290, https://arxiv.org/pdf/2511.15718.pdf

Authors: Chen Yang, Ran Le, Yun Xing, Zhenwei An, Zongchao Chen, Wayne Xin Zhao, Yang Song, Tao Zhang

Title: ToolMind Technical Report: A Large-Scale, Reasoning-Enhanced Tool-Use Dataset

Abstract:
Large Language Model (LLM) agents have developed rapidly in recent years to solve complex real‑world problems using external tools. However, the scarcity of high‑quality trajectories still hinders the development of stronger LLM agents. Most existing works on multi‑turn dialogue synthesis validate correctness only at the trajectory level, which may overlook turn‑level errors that can propagate during training and degrade model performance. To address these limitations, we introduce ToolMind, a large‑scale, high‑quality tool‑agentic dataset with 160k synthetic data instances generated using over 20k tools and 200k augmented open‑source data instances. Our data synthesis pipeline first constructs a function graph based on parameter correlations and then uses a multi‑agent framework to simulate realistic user‑assistant‑tool interactions. Beyond trajectory‑level validation, we employ fine‑grained turn‑level filtering to remove erroneous or suboptimal steps, ensuring that only high‑quality reasoning traces are retained. This approach mitigates error amplification during training while preserving self‑corrective reasoning signals essential for robust tool‑use learning. Models fine‑tuned on ToolMind show significant improvements over baselines on several benchmarks.

PaperID: 7291, https://arxiv.org/pdf/2511.15716.pdf

Authors: Abraham Itzhak Weinberg

Title: MACIE: Multi-Agent Causal Intelligence Explainer for Collective Behavior Understanding

Abstract:
As Multi Agent Reinforcement Learning systems are used in safety critical applications. Understanding why agents make decisions and how they achieve collective behavior is crucial. Existing explainable AI methods struggle in multi agent settings. They fail to attribute collective outcomes to individuals, quantify emergent behaviors, or capture complex interactions. We present MACIE Multi Agent Causal Intelligence Explainer, a framework combining structural causal models, interventional counterfactuals, and Shapley values to provide comprehensive explanations. MACIE addresses three questions. First, each agent's causal contribution using interventional attribution scores. Second, system level emergent intelligence through synergy metrics separating collective effects from individual contributions. Third, actionable explanations using natural language narratives synthesizing causal insights. We evaluate MACIE across four MARL scenarios: cooperative, competitive, and mixed motive. Results show accurate outcome attribution, mean phi_i equals 5.07, standard deviation less than 0.05, detection of positive emergence in cooperative tasks, synergy index up to 0.461, and efficient computation, 0.79 seconds per dataset on CPU. MACIE uniquely combines causal rigor, emergence quantification, and multi agent support while remaining practical for real time use. This represents a step toward interpretable, trustworthy, and accountable multi agent AI.

PaperID: 7292, https://arxiv.org/pdf/2511.15712.pdf

Authors: Vivek Acharya

Title: Secure Autonomous Agent Payments: Verifying Authenticity and Intent in a Trustless Environment

Abstract:
Artificial intelligence (AI) agents are increasingly capable of initiating financial transactions on behalf of users or other agents. This evolution introduces a fundamental challenge: verifying both the authenticity of an autonomous agent and the true intent behind its transactions in a decentralized, trustless environment. Traditional payment systems assume human authorization, but autonomous, agent‑led payments remove that safeguard. This paper presents a blockchain‑based framework that cryptographically authenticates and verifies the intent of every AI‑initiated transaction. The proposed system leverages decentralized identity (DID) standards and verifiable credentials to establish agent identities, on‑chain intent proofs to record user authorization, and zero‑knowledge proofs (ZKPs) to preserve privacy while ensuring policy compliance. Additionally, secure execution environments (TEE‑based attestations) guarantee the integrity of agent reasoning and execution. The hybrid on‑chain/off‑chain architecture provides an immutable audit trail linking user intent to payment outcome. Through qualitative analysis, the framework demonstrates strong resistance to impersonation, unauthorized transactions, and misalignment of intent. This work lays the foundation for secure, auditable, and intent‑aware autonomous economic agents, enabling a future of verifiable trust and accountability in AI‑driven financial ecosystems.

PaperID: 7293, https://arxiv.org/pdf/2511.15456.pdf

Authors: Qian'ang Mao, Yuxuan Zhang, Jiaman Chen, Wenjun Zhou, Jiaqi Yan

Title: Know Your Intent: An Autonomous Multi-Perspective LLM Agent Framework for DeFi User Transaction Intent Mining

Abstract:
As Decentralized Finance (DeFi) develops, understanding user intent behind DeFi transactions is crucial yet challenging due to complex smart contract interactions, multifaceted on‑/off‑chain factors, and opaque hex logs. Existing methods lack deep semantic insight. To address this, we propose the Transaction Intent Mining (TIM) framework. TIM leverages a DeFi intent taxonomy built on grounded theory and a multi‑agent Large Language Model (LLM) system to robustly infer user intents. A Meta‑Level Planner dynamically coordinates domain experts to decompose multiple perspective‑specific intent analyses into solvable subtasks. Question Solvers handle the tasks with multi‑modal on/off‑chain data. While a Cognitive Evaluator mitigates LLM hallucinations and ensures verifiability. Experiments show that TIM significantly outperforms machine learning models, single LLMs, and single Agent baselines. We also analyze core challenges in intent inference. This work helps provide a more reliable understanding of user motivations in DeFi, offering context‑aware explanations for complex blockchain activity.

PaperID: 7294, https://arxiv.org/pdf/2511.15370.pdf

Authors: Guoqiang Liang, Jingqian Gong, Mengxuan Li, Gege Lin, Shuo Zhang

Title: The Empowerment of Science of Science by Large Language Models: New Tools and Methods

Abstract:
Large language models (LLMs) have exhibited exceptional capabilities in natural language understanding and generation, image recognition, and multimodal tasks, charting a course towards AGI and emerging as a central issue in the global technological race. This manuscript conducts a comprehensive review of the core technologies that support LLMs from a user standpoint, including prompt engineering, knowledge‑enhanced retrieval augmented generation, fine tuning, pretraining, and tool learning. Additionally, it traces the historical development of Science of Science (SciSci) and presents a forward looking perspective on the potential applications of LLMs within the scientometric domain. Furthermore, it discusses the prospect of an AI agent based model for scientific evaluation, and presents new research fronts detection and knowledge graph building methods with LLMs.

PaperID: 7295, https://arxiv.org/pdf/2511.15357.pdf

Authors: Yinan Yu, Falk Dippel, Christina E. Lundberg, Martin Lindgren, Annika Rosengren, Martin Adiels, Helen Sjöland

Title: Cost-Aware Prediction (CAP): An LLM-Enhanced Machine Learning Pipeline and Decision Support System for Heart Failure Mortality Prediction

Abstract:
Objective: Machine learning (ML) predictive models are often developed without considering downstream value trade‑offs and clinical interpretability. This paper introduces a cost‑aware prediction (CAP) framework that combines cost‑benefit analysis assisted by large language model (LLM) agents to communicate the trade‑offs involved in applying ML predictions. Materials and Methods: We developed an ML model predicting 1‑year mortality in patients with heart failure (N = 30,021, 22% mortality) to identify those eligible for home care. We then introduced clinical impact projection (CIP) curves to visualize important cost dimensions ‑ quality of life and healthcare provider expenses, further divided into treatment and error costs, to assess the clinical consequences of predictions. Finally, we used four LLM agents to generate patient‑specific descriptions. The system was evaluated by clinicians for its decision support value. Results: The eXtreme gradient boosting (XGB) model achieved the best performance, with an area under the receiver operating characteristic curve (AUROC) of 0.804 (95% confidence interval (CI) 0.792‑0.816), area under the precision‑recall curve (AUPRC) of 0.529 (95% CI 0.502‑0.558) and a Brier score of 0.135 (95% CI 0.130‑0.140). Discussion: The CIP cost curves provided a population‑level overview of cost composition across decision thresholds, whereas LLM‑generated cost‑benefit analysis at individual patient‑levels. The system was well received according to the evaluation by clinicians. However, feedback emphasizes the need to strengthen the technical accuracy for speculative tasks. Conclusion: CAP utilizes LLM agents to integrate ML classifier outcomes and cost‑benefit analysis for more transparent and interpretable decision support.

PaperID: 7296, https://arxiv.org/pdf/2511.15292.pdf

Authors: Jianming Chen, Yawen Wang, Junjie Wang, Xiaofei Xie, Yuanzhe Hu, Qing Wang, Fanjiang Xu

Title: Adversarial Attack on Black-Box Multi-Agent by Adaptive Perturbation

Abstract:
Evaluating security and reliability for multi‑agent systems (MAS) is urgent as they become increasingly prevalent in various applications. As an evaluation technique, existing adversarial attack frameworks face certain limitations, e.g., impracticality due to the requirement of white‑box information or high control authority, and a lack of stealthiness or effectiveness as they often target all agents or specific fixed agents. To address these issues, we propose AdapAM, a novel framework for adversarial attacks on black‑box MAS. AdapAM incorporates two key components: (1) Adaptive Selection Policy simultaneously selects the victim and determines the anticipated malicious action (the action would lead to the worst impact on MAS), balancing effectiveness and stealthiness. (2) Proxy‑based Perturbation to Induce Malicious Action utilizes generative adversarial imitation learning to approximate the target MAS, allowing AdapAM to generate perturbed observations using white‑box information and thus induce victims to execute malicious action in black‑box settings. We evaluate AdapAM across eight multi‑agent environments and compare it with four state‑of‑the‑art and commonly‑used baselines. Results demonstrate that AdapAM achieves the best attack performance in different perturbation rates. Besides, AdapAM‑generated perturbations are the least noisy and hardest to detect, emphasizing the stealthiness.

PaperID: 7297, https://arxiv.org/pdf/2511.15284.pdf

Authors: Jonas De Maeyer, Hossein Yarahmadi, Moharram Challenger

Title: Path Planning through Multi-Agent Reinforcement Learning in Dynamic Environments

Abstract:
Path planning in dynamic environments is a fundamental challenge in intelligent transportation and robotics, where obstacles and conditions change over time, introducing uncertainty and requiring continuous adaptation. While existing approaches often assume complete environmental unpredictability or rely on global planners, these assumptions limit scalability and practical deployment in real‑world settings. In this paper, we propose a scalable, region‑aware reinforcement learning (RL) framework for path planning in dynamic environments. Our method builds on the observation that environmental changes, although dynamic, are often localized within bounded regions. To exploit this, we introduce a hierarchical decomposition of the environment and deploy distributed RL agents that adapt to changes locally. We further propose a retraining mechanism based on sub‑environment success rates to determine when policy updates are necessary. Two training paradigms are explored: single‑agent Q‑learning and multi‑agent federated Q‑learning, where local Q‑tables are aggregated periodically to accelerate the learning process. Unlike prior work, we evaluate our methods in more realistic settings, where multiple simultaneous obstacle changes and increasing difficulty levels are present. Results show that the federated variants consistently outperform their single‑agent counterparts and closely approach the performance of A Oracle while maintaining shorter adaptation times and robust scalability. Although initial training remains time‑consuming in large environments, our decentralized framework eliminates the need for a global planner and lays the groundwork for future improvements using deep RL and flexible environment decomposition.

PaperID: 7298, https://arxiv.org/pdf/2511.15259.pdf

Authors: Philipp Wiesner, Daniel W. O'Neill, Francesca Larosa, Odej Kao

Title: Efficiency Will Not Lead to Sustainable Reasoning AI

Abstract:
AI research is increasingly moving toward complex problem solving, where models are optimized not only for pattern recognition but for multi‑step reasoning. Historically, computing's global energy footprint has been stabilized by sustained efficiency gains and natural saturation thresholds in demand. But as efficiency improvements are approaching physical limits, emerging reasoning AI lacks comparable saturation points: performance is no longer limited by the amount of available training data but continues to scale with exponential compute investments in both training and inference. This paper argues that efficiency alone will not lead to sustainable reasoning AI and discusses research and policy directions to embed explicit limits into the optimization and governance of such systems.

PaperID: 7299, https://arxiv.org/pdf/2511.15227.pdf

Authors: Hiep Hong Trinh, Marjan Sirjani, Federico Ciccozzi, Abu Naser Masud, Mikael Sjödin

Title: Modelling and Model-Checking a ROS2 Multi-Robot System using Timed Rebeca

Abstract:
Model‑based development enables quicker prototyping, earlier experimentation and validation of design intents. For a multi‑agent system with complex asynchronous interactions and concurrency, formal verification, model‑checking in particular, offers an automated mechanism for verifying desired properties. Timed Rebeca is an actor‑based modelling language supporting reactive, concurrent and time semantics, accompanied with a model‑checking compiler. These capabilities allow using Timed Rebeca to correctly model ROS2 node topographies, recurring physical signals, motion primitives and other timed and time‑convertible behaviors. The biggest challenges in modelling and verifying a multi‑robot system lie in abstracting complex information, bridging the gap between a discrete model and a continuous system and compacting the state space, while maintaining the model's accuracy. We develop different discretization strategies for different kinds of information, identifying the 'enough' thresholds of abstraction, and applying efficient optimization techniques to boost computations. With this work we demonstrate how to use models to design and verify a multi‑robot system, how to discretely model a continuous system to do model‑checking efficiently, and the round‑trip engineering flow between the model and the implementation. The released Rebeca and ROS2 codes can serve as a foundation for modelling multiple autonomous robots systems.

PaperID: 7300, https://arxiv.org/pdf/2511.15211.pdf

Authors: Xinli Tao, Xin Dong, Xuezhong Zhou

Title: OEMA: Ontology-Enhanced Multi-Agent Collaboration Framework for Zero-Shot Clinical Named Entity Recognition

Abstract:
With the rapid expansion of unstructured clinical texts in electronic health records (EHRs), clinical named entity recognition (NER) has become a crucial technique for extracting medical information. However, traditional supervised models such as CRF and BioClinicalBERT suffer from high annotation costs. Although zero‑shot NER based on large language models (LLMs) reduces the dependency on labeled data, challenges remain in aligning example selection with task granularity and effectively integrating prompt design with self‑improvement frameworks. To address these limitations, we propose OEMA, a novel zero‑shot clinical NER framework based on multi‑agent collaboration. OEMA consists of three core components: (1) a self‑annotator that autonomously generates candidate examples; (2) a discriminator that leverages SNOMED CT to filter token‑level examples by clinical relevance; and (3) a predictor that incorporates entity‑type descriptions to enhance inference accuracy. Experimental results on two benchmark datasets, MTSamples and VAERS, demonstrate that OEMA achieves state‑of‑the‑art performance under exact‑match evaluation. Moreover, under related‑match criteria, OEMA performs comparably to the supervised BioClinicalBERT model while significantly outperforming the traditional CRF method. OEMA improves zero‑shot clinical NER, achieving near‑supervised performance under related‑match criteria. Future work will focus on continual learning and open‑domain adaptation to expand its applicability in clinical NLP.

PaperID: 7301, https://arxiv.org/pdf/2511.15203.pdf

Authors: Zimo Ji, Xunguang Wang, Zongjie Li, Pingchuan Ma, Yudong Gao, Daoyuan Wu, Xincheng Yan, Tian Tian, Shuai Wang

Title: Taxonomy, Evaluation and Exploitation of IPI-Centric LLM Agent Defense Frameworks

Abstract:
Large Language Model (LLM)‑based agents with function‑calling capabilities are increasingly deployed, but remain vulnerable to Indirect Prompt Injection (IPI) attacks that hijack their tool calls. In response, numerous IPI‑centric defense frameworks have emerged. However, these defenses are fragmented, lacking a unified taxonomy and comprehensive evaluation. In this Systematization of Knowledge (SoK), we present the first comprehensive analysis of IPI‑centric defense frameworks. We introduce a comprehensive taxonomy of these defenses, classifying them along five dimensions. We then thoroughly assess the security and usability of representative defense frameworks. Through analysis of defensive failures in the assessment, we identify six root causes of defense circumvention. Based on these findings, we design three novel adaptive attacks that significantly improve attack success rates targeting specific frameworks, demonstrating the severity of the flaws in these defenses. Our paper provides a foundation and critical insights for the future development of more secure and usable IPI‑centric agent defense frameworks.

PaperID: 7302, https://arxiv.org/pdf/2511.15074.pdf

Authors: Henrik Bradland, Morten Goodwin, Vladimir I. Zadorozhny, Per-Arne Andersen

Title: Knowledge-Informed Automatic Feature Extraction via Collaborative Large Language Model Agents

Abstract:
The performance of machine learning models on tabular data is critically dependent on high‑quality feature engineering. While Large Language Models (LLMs) have shown promise in automating feature extraction (AutoFE), existing methods are often limited by monolithic LLM architectures, simplistic quantitative feedback, and a failure to systematically integrate external domain knowledge. This paper introduces Rogue One, a novel, LLM‑based multi‑agent framework for knowledge‑informed automatic feature extraction. Rogue One operationalizes a decentralized system of three specialized agents‑Scientist, Extractor, and Tester‑that collaborate iteratively to discover, generate, and validate predictive features. Crucially, the framework moves beyond primitive accuracy scores by introducing a rich, qualitative feedback mechanism and a "flooding‑pruning" strategy, allowing it to dynamically balance feature exploration and exploitation. By actively incorporating external knowledge via an integrated retrieval‑augmented (RAG) system, Rogue One generates features that are not only statistically powerful but also semantically meaningful and interpretable. We demonstrate that Rogue One significantly outperforms state‑of‑the‑art methods on a comprehensive suite of 19 classification and 9 regression datasets. Furthermore, we show qualitatively that the system surfaces novel, testable hypotheses, such as identifying a new potential biomarker in the myocardial dataset, underscoring its utility as a tool for scientific discovery.

PaperID: 7303, https://arxiv.org/pdf/2511.15065.pdf

Authors: Cheng Yang, Haiyuan Wan, Yiran Peng, Xin Cheng, Zhaoyang Yu, Jiayi Zhang, Junchi Yu, Xinlei Yu, Xiawu Zheng, Dongzhan Zhou, Chenglin Wu

Title: Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks

Abstract:
Video Models have achieved remarkable success in high‑fidelity video generation with coherent motion dynamics. Analogous to the development from text generation to text‑based reasoning in language modeling, the development of video models motivates us to ask: Can video models reason via video generation? Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity, which serves as an ideal substrate for spatial reasoning. In this work, we explore the reasoning via video paradigm and introduce VR‑Bench ‑‑ a comprehensive benchmark designed to systematically evaluate video models' reasoning capabilities. Grounded in maze‑solving tasks that inherently require spatial planning and multi‑step reasoning, VR‑Bench contains 7,920 procedurally generated videos across five maze types and diverse visual styles. Our empirical analysis demonstrates that SFT can efficiently elicit the reasoning ability of video model. Video models exhibit stronger spatial perception during reasoning, outperforming leading VLMs and generalizing well across diverse scenarios, tasks, and levels of complexity. We further discover a test‑time scaling effect, where diverse sampling during inference improves reasoning reliability by 10‑‑20%. These findings highlight the unique potential and scalability of reasoning via video for spatial reasoning tasks.

PaperID: 7304, https://arxiv.org/pdf/2511.15053.pdf

Authors: Pengcheng Dai, He Wang, Dongming Wang, Wenwu Yu

Title: Distributed primal-dual algorithm for constrained multi-agent reinforcement learning under coupled policies

Abstract:
In this work, we investigate constrained multi‑agent reinforcement learning (CMARL), where agents collaboratively maximize the sum of their local objectives while satisfying individual safety constraints. We propose a framework where agents adopt coupled policies that depend on both local states and parameters, as well as those of their κ_p‑hop neighbors, with κ_p>0 denoting the coupling distance. A distributed primal‑dual algorithm is further developed under this framework, wherein each agent has access only to state‑action pairs within its 2κ_p‑hop neighborhood and to reward information within its κ+ 2κ_p‑hop neighborhood, with κ> 0 representing the truncation distance. Moreover, agents are not permitted to directly share their true policy parameters or Lagrange multipliers. Instead, each agent constructs and maintains local estimates of these variables for other agents and employs such estimates to execute its policy. Additionally, these estimates are further updated and exchanged exclusively through an independent, time‑varying networks, which enhances the overall system security. We establish that, with high probability, our algorithm can achieve an ε‑first‑order stationary convergence with an approximation error of \mathcalO(γ^\fracκ+1κ_p) for discount factor γ\in(0,1). Finally, simulations in GridWorld environment are conducted to demonstrate the effectiveness of the proposed algorithm.

PaperID: 7305, https://arxiv.org/pdf/2511.15041.pdf

Authors: Jingchen Peng, Chaowen Deng, Yili Deng, Boxiang Ren, Lu Yang

Title: Hyper-VIB: A Hypernetwork-Enhanced Information Bottleneck Approach for Task-Oriented Communications

Abstract:
This paper presents Hyper‑VIB, a hypernetwork‑enhanced information bottleneck (IB) approach designed to enable efficient task‑oriented communications in 6G collaborative intelligent systems. Leveraging IB theory, our approach enables an optimal end‑to‑end joint training of device and network models, in terms of the maximal task execution accuracy as well as the minimal communication overhead, through optimizing the trade‑off hyperparameter. To address computational intractability in high‑dimensional IB optimization, a tractable variational upper‑bound approximation is derived. Unlike conventional grid or random search methods that require multiple training rounds with substantial computational costs, Hyper‑VIB introduces a hypernetwork that generates approximately optimal DNN parameters for different values of the hyperparameter within a single training phase. Theoretical analysis in the linear case validates the hypernetwork design. Experimental results demonstrate our Hyper‑VIB's superior accuracy and training efficiency over conventional VIB approaches in both classification and regression tasks.

PaperID: 7306, https://arxiv.org/pdf/2511.15036.pdf

Authors: Kamal Mammadov, Damith C. Ranasinghe

Title: Area-Optimal Control Strategies for Heterogeneous Multi-Agent Pursuit

Abstract:
This paper presents a novel strategy for a multi‑agent pursuit‑evasion game involving multiple faster pursuers with heterogenous speeds and a single slower evader. We define a geometric region, the evader's safe‑reachable set, as the intersection of Apollonius circles derived from each pursuer‑evader pair. The capture strategy is formulated as a zero‑sum game where the pursuers cooperatively minimize the area of this set, while the evader seeks to maximize it, effectively playing a game of spatial containment. By deriving the analytical gradients of the safe‑reachable set's area with respect to agent positions, we obtain closed‑form, instantaneous optimal control laws for the heading of each agent. These strategies are computationally efficient, allowing for real‑time implementation. Simulations demonstrate that the gradient‑based controls effectively steer the pursuers to systematically shrink the evader's safe region, leading to guaranteed capture. This area‑minimization approach provides a clear geometric objective for cooperative capture.

PaperID: 7307, https://arxiv.org/pdf/2511.15002.pdf

Authors: Fatemeh Lotfi, Hossein Rajoli, Fatemeh Afghah

Title: Task Specific Sharpness Aware O-RAN Resource Management using Multi Agent Reinforcement Learning

Abstract:
Next‑generation networks utilize the Open Radio Access Network (O‑RAN) architecture to enable dynamic resource management, facilitated by the RAN Intelligent Controller (RIC). While deep reinforcement learning (DRL) models show promise in optimizing network resources, they often struggle with robustness and generalizability in dynamic environments. This paper introduces a novel resource management approach that enhances the Soft Actor Critic (SAC) algorithm with Sharpness‑Aware Minimization (SAM) in a distributed Multi‑Agent RL (MARL) framework. Our method introduces an adaptive and selective SAM mechanism, where regularization is explicitly driven by temporal‑difference (TD)‑error variance, ensuring that only agents facing high environmental complexity are regularized. This targeted strategy reduces unnecessary overhead, improves training stability, and enhances generalization without sacrificing learning efficiency. We further incorporate a dynamic ρ scheduling scheme to refine the exploration‑exploitation trade‑off across agents. Experimental results show our method significantly outperforms conventional DRL approaches, yielding up to a 22% improvement in resource allocation efficiency and ensuring superior QoS satisfaction across diverse O‑RAN slices.

PaperID: 7308, https://arxiv.org/pdf/2511.14910.pdf

Authors: Yassine Ibork, Myounggyu Won, Lokesh Das

Title: Z-Merge: Multi-Agent Reinforcement Learning for On-Ramp Merging with Zone-Specific V2X Traffic Information

Abstract:
Ramp merging is a critical and challenging task for autonomous vehicles (AVs), particularly in mixed traffic environments with human‑driven vehicles (HVs). Existing approaches typically rely on either lane‑changing or inter‑vehicle gap creation strategies based solely on local or neighboring information, often leading to suboptimal performance in terms of safety and traffic efficiency. In this paper, we present a V2X (vehicle‑to‑everything communication)‑assisted Multiagent Reinforcement Learning (MARL) framework for on‑ramp merging that effectively coordinates the complex interplay between lane‑changing and inter‑vehicle gap adaptation strategies by utilizing zone‑specific global information available from a roadside unit (RSU). The merging control problem is formulated as a Multiagent Partially Observable Markov Decision Process (MA‑POMDP), where agents leverage both local and global observations through V2X communication. To support both discrete and continuous control decisions, we design a hybrid action space and adopt a parameterized deep Q‑learning approach. Extensive simulations, integrating the SUMO traffic simulator and the MOSAIC V2X simulator, demonstrate that our framework significantly improves merging success rate, traffic efficiency, and road safety across diverse traffic scenarios.

PaperID: 7309, https://arxiv.org/pdf/2511.14783.pdf

Authors: Bingquan Zhang, Xiaoxiao Liu, Yuchi Wang, Lei Zhou, Qianqian Xie, Benyou Wang

Title: Human or LLM as Standardized Patients? A Comparative Study for Medical Education

Abstract:
Standardized patients (SPs) are indispensable for clinical skills training but remain expensive and difficult to scale. Although large language model (LLM)‑based virtual standardized patients (VSPs) have been proposed as an alternative, their behavior remains unstable and lacks rigorous comparison with human standardized patients. We propose EasyMED, a multi‑agent VSP framework that separates case‑grounded information disclosure from response generation to support stable, inquiry‑conditioned patient behavior. We also introduce SPBench, a human‑grounded benchmark with eight expert‑defined criteria for interaction‑level evaluation. Experiments show that EasyMED more closely matches human SP behavior than existing VSPs, particularly in case consistency and controlled disclosure. A four‑week controlled study further demonstrates learning outcomes comparable to human SP training, with stronger early gains for novice learners and improved flexibility, psychological safety, and cost efficiency.

PaperID: 7310, https://arxiv.org/pdf/2511.14780.pdf

Authors: Keith Moore, Jun W. Kim, David Lyu, Jeffrey Heo, Ehsan Adeli

Title: Ask WhAI:Probing Belief Formation in Role-Primed LLM Agents

Abstract:
We present Ask WhAI, a systems‑level framework for inspecting and perturbing belief states in multi‑agent interactions. The framework records and replays agent interactions, supports out‑of‑band queries into each agent's beliefs and rationale, and enables counterfactual evidence injection to test how belief structures respond to new information. We apply the framework to a medical case simulator notable for its multi‑agent shared memory (a time‑stamped electronic medical record, or EMR) and an oracle agent (the LabAgent) that holds ground truth lab results revealed only when explicitly queried. We stress‑test the system on a multi‑specialty diagnostic journey for a child with an abrupt‑onset neuropsychiatric presentation. Large language model agents, each primed with strong role‑specific priors ("act like a neurologist", "act like an infectious disease specialist"), write to a shared medical record and interact with a moderator across sequential or parallel encounters. Breakpoints at key diagnostic moments enable pre‑ and post‑event belief queries, allowing us to distinguish entrenched priors from reasoning or evidence‑integration effects. The simulation reveals that agent beliefs often mirror real‑world disciplinary stances, including overreliance on canonical studies and resistance to counterevidence, and that these beliefs can be traced and interrogated in ways not possible with human experts. By making such dynamics visible and testable, Ask WhAI offers a reproducible way to study belief formation and epistemic silos in multi‑agent scientific reasoning.

PaperID: 7311, https://arxiv.org/pdf/2511.14777.pdf

Authors: Mahdi Samiei, Mahdi Mansouri, Mahdieh Soleymani Baghshah

Title: The Illusion of Procedural Reasoning: Measuring Long-Horizon FSM Execution in LLMs

Abstract:
Large language models (LLMs) have achieved remarkable results on tasks framed as reasoning problems, yet their true ability to perform procedural reasoning, executing multi‑step, rule‑based computations remains unclear. Unlike algorithmic systems, which can deterministically execute long‑horizon symbolic procedures, LLMs often degrade under extended reasoning chains, but there is no controlled, interpretable benchmark to isolate and measure this collapse. We introduce Finite‑State Machine (FSM) Execution as a minimal, fully interpretable framework for evaluating the procedural reasoning capacity of LLMs. In our setup, the model is given an explicit FSM definition and must execute it step‑by‑step given input actions, maintaining state consistency over multiple turns. This task requires no world knowledge, only faithful application of deterministic transition rules, making it a direct probe of the model's internal procedural fidelity. We measure both Turn Accuracy and Task Accuracy to disentangle immediate computation from cumulative state maintenance. Empirical results reveal systematic degradation as task horizon or branching complexity increases. Models perform significantly worse when rule retrieval involves high branching factors than when memory span is long. Larger models show improved local accuracy but remain brittle under multi‑step reasoning unless explicitly prompted to externalize intermediate steps. FSM‑based evaluation offers a transparent, complexity‑controlled probe for diagnosing this failure mode and guiding the design of inductive biases that enable genuine long‑horizon procedural competence. By grounding reasoning in measurable execution fidelity rather than surface correctness, this work helps establish a rigorous experimental foundation for understanding and improving the algorithmic reliability of LLMs.

PaperID: 7312, https://arxiv.org/pdf/2511.14769.pdf

Authors: Yifan Xu, Vipul Gupta, Rohit Aggarwal, Varsha Mahadevan, Bhaskar Krishnamachari

Title: Cluster-based Adaptive Retrieval: Dynamic Context Selection for RAG Applications

Abstract:
Retrieval‑Augmented Generation (RAG) enhances large language models (LLMs) by pulling in external material, document, code, manuals, from vast and ever‑growing corpora, to effectively answer user queries. The effectiveness of RAG depends significantly on aligning the number of retrieved documents with query characteristics: narrowly focused queries typically require fewer, highly relevant documents, whereas broader or ambiguous queries benefit from retrieving more extensive supporting information. However, the common static top‑k retrieval approach fails to adapt to this variability, resulting in either insufficient context from too few documents or redundant information from too many. Motivated by these challenges, we introduce Cluster‑based Adaptive Retrieval (CAR), an algorithm that dynamically determines the optimal number of documents by analyzing the clustering patterns of ordered query‑document similarity distances. CAR detects the transition point within similarity distances, where tightly clustered, highly relevant documents shift toward less pertinent candidates, establishing an adaptive cut‑off that scales with query complexity. On Coinbase's CDP corpus and the public MultiHop‑RAG benchmark, CAR consistently picks the optimal retrieval depth and achieves the highest TES score, outperforming every fixed top‑k baseline. In downstream RAG evaluations, CAR cuts LLM token usage by 60%, trims end‑to‑end latency by 22%, and reduces hallucinations by 10% while fully preserving answer relevance. Since integrating CAR into Coinbase's virtual assistant, we've seen user engagement jump by 200%.

PaperID: 7313, https://arxiv.org/pdf/2511.14767.pdf

Authors: Minh-Thuan Nguyen, Thien Vo-Thanh, Thai-Duy Dinh, Xuan-Quang Phan, Tan-Ha Mai, Lam-Son Lê

Title: An LLM-Powered Agent for Real-Time Analysis of the Vietnamese IT Job Market

Abstract:
Individuals entering Vietnam's dynamic Information Technology (IT) job market face a critical gap in reliable career guidance. Existing market reports are often outdated, while the manual analysis of thousands of job postings is impractical for most. To address this challenge, we present the AI Job Market Consultant, a novel conversational agent that delivers deep, data‑driven insights directly from the labor market in real‑time. The foundation of our system is a custom‑built dataset created via an automated pipeline that crawls job portals using Playwright and leverages the Large Language Model (LLM) to intelligently structure unstructured posting data. The core of our system is a tool‑augmented AI agent, based on the ReAct agentic framework, which enables the ability of autonomously reasoning, planning, and executing actions through a specialized toolbox for SQL queries, semantic search, and data visualization. Our prototype successfully collected and analyzed 3,745 job postings, demonstrating its ability to answer complex, multi‑step queries, generate on‑demand visualizations, and provide personalized career advice grounded in real‑world data. This work introduces a new paradigm for labor market analysis, showcasing how specialized agentic AI systems can democratize access to timely, trustworthy career intelligence for the next generation of professionals.

PaperID: 7314, https://arxiv.org/pdf/2511.14730.pdf

Authors: Parya Dolatyabi, Ali Farajzadeh Bavil, Mahdi Khodayar

Title: Heterogeneous Multi-Agent Proximal Policy Optimization for Power Distribution System Restoration

Abstract:
Restoring power distribution systems (PDSs) after large‑scale outages requires sequential switching actions that reconfigure feeder topology and coordinate distributed energy resources (DERs) under nonlinear constraints, including power balance, voltage limits, and thermal ratings. These challenges limit the scalability of conventional optimization and value‑based reinforcement learning (RL) approaches. This paper applies a Heterogeneous‑Agent Reinforcement Learning (HARL) framework via Heterogeneous‑Agent Proximal Policy Optimization (HAPPO) to enable coordinated restoration across interconnected microgrids. Each agent controls a distinct microgrid with different loads, DER capacities, and switch counts. Decentralized actors are trained with a centralized critic for stable on‑policy learning, while a physics‑informed OpenDSS environment enforces electrical feasibility. Experiments on IEEE 123‑bus and 8500‑node feeders show HAPPO outperforms PPO, QMIX, Mean‑Field RL, and other baselines in restored power, convergence stability, and multi‑seed reproducibility. Under a 2400 kW generation cap, the framework restores over 95% of available load on both systems with low‑latency execution, supporting practical real‑time PDS restoration.

PaperID: 7315, https://arxiv.org/pdf/2511.14584.pdf

Authors: Ankush Kadu, Aswanth Krishnan

Title: ReflexGrad: Within-Episode Failure Recovery in LLM Agents via Progress-Gated Dual-Process Routing

Abstract:
We present ReflexGrad, a dual‑process architecture for within‑episode failure recovery in LLM agents without demonstrations. When agents commit to a wrong approach early and exhaust the step budget, the post‑failure trajectory contains the information to escape ‑‑ but no published architecture acts on it within a single episode. ReflexGrad routes between a fast process (TextGrad‑style continuous refinement every k=3 steps) and a slow process (Reflexion‑style causal diagnosis when m=5 consecutive low‑progress scores fire a routing gate). A deterministic priority merge keeps the natural‑language policy coherent, and each slow activation emits three observable artifacts: a reproducible trigger, a causal diagnostic, and a verified fix. On ALFWorld 134 tasks, n=10 seeds, no demonstrations, ReflexGrad lifts Qwen‑3‑8B from 35.1% to 75.4% (+40.3pp), beating compute‑matched 1‑shot LATS by +2.7pp (p\approx0.01), ToT by +5.7pp (p<10^‑4), and Self‑Refine by +6.7pp (p<10^‑5); on GPT‑5 the lift is 46.3\to88.1% (+41.8pp). The 1.5pp cross‑model difference is within seed noise (p\approx0.13), suggesting that the routing mechanism, rather than model scale, is the primary source of the gain. Code, prompts, per‑seed logs, and sensitivity sweeps are released.

PaperID: 7316, https://arxiv.org/pdf/2511.14460.pdf

Authors: Mingyue Cheng, Jie Ouyang, Shuo Yu, Ruiran Yan, Yucong Luo, Zirui Liu, Daoyu Wang, Qi Liu, Enhong Chen

Title: Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning

Abstract:
Large Language Models (LLMs) are increasingly being explored for building Agents capable of active environmental interaction (e.g., via tool use) to solve complex problems. Reinforcement Learning (RL) is considered a key technology with significant potential for training such Agents; however, the effective application of RL to LLM Agents is still in its nascent stages and faces considerable challenges. Currently, this emerging field lacks in‑depth exploration into RL approaches specifically tailored for the LLM Agent context, alongside a scarcity of flexible and easily extensible training frameworks designed for this purpose. To help advance this area, this paper first revisits and clarifies Reinforcement Learning methodologies for LLM Agents by systematically extending the Markov Decision Process (MDP) framework to comprehensively define the key components of an LLM Agent. Secondly, we introduce Agent‑R1, a modular, flexible, and user‑friendly training framework for RL‑based LLM Agents, designed for straightforward adaptation across diverse task scenarios and interactive environments. We conducted experiments on Multihop QA benchmark tasks, providing initial validation for the effectiveness of our proposed methods and framework.

PaperID: 7317, https://arxiv.org/pdf/2511.14433.pdf

Authors: Diana C. Benjumea, Marie Farrell, Louise A. Dennis

Title: Safe-ROS: An Architecture for Autonomous Robots in Safety-Critical Domains

Abstract:
Deploying autonomous robots in safety‑critical domains requires architectures that ensure operational effectiveness and safety compliance. In this paper, we contribute the Safe‑ROS architecture for developing reliable and verifiable autonomous robots in such domains. It features two distinct subsystems: (1) an intelligent control system that is responsible for normal/routine operations, and (2) a Safety System consisting of Safety Instrumented Functions (SIFs) that provide formally verifiable independent oversight. We demonstrate Safe‑ROS on an AgileX Scout Mini robot performing autonomous inspection in a nuclear environment. One safety requirement is selected and instantiated as a SIF. To support verification, we implement the SIF as a cognitive agent, programmed to stop the robot whenever it detects that it is too close to an obstacle. We verify that the agent meets the safety requirement and integrate it into the autonomous inspection. This integration is also verified, and the full deployment is validated in a Gazebo simulation, and lab testing. We evaluate this architecture in the context of the UK nuclear sector, where safety and regulation are crucial aspects of deployment. Success criteria include the development of a formal property from the safety requirement, implementation, and verification of the SIF, and the integration of the SIF into the operational robotic autonomous system. Our results demonstrate that the Safe‑ROS architecture can provide safety verifiable oversight while deploying autonomous robots in safety‑critical domains, offering a robust framework that can be extended to additional requirements and various applications.

PaperID: 7318, https://arxiv.org/pdf/2511.14378.pdf

Authors: Raphael Korbmacher, Daniel Straub, Antoine Tordeux, Claudia Totzeck

Title: Emergent Cooperative Driving Strategies for Stop-and-Go Wave Mitigation via Multi-Agent Reinforcement Learning

Abstract:
Stop‑and‑go waves in traffic flow pose a persistent challenge, compromising safety, efficiency, and environmental sustainability. This paper introduces a novel mitigation strategy discovered through training multi‑agent deep reinforcement learning (DRL) agents in a simulated ring‑road environment. The agents autonomously develop a cooperative driving policy, where most vehicles maintain minimal headways to maximize throughput, while a single "buffer" vehicle adopts a larger headway to absorb perturbations and prevent wave propagation. This strategy enhances stability without sacrificing overall flow. We further demonstrate that adapting this cooperative strategy to classical car‑following models, such as the Intelligent Driver Model (IDM), yields improved stability and traffic efficiency. Furthermore, we show within a parametrised linear framework, that the cooperative strategy can optimise system performance under stability constraints. Our findings offer promising insights for future autonomous vehicle systems and highway management.

PaperID: 7319, https://arxiv.org/pdf/2511.14366.pdf

Authors: Hongwei Liu, Junnan Liu, Shudong Liu, Haodong Duan, Yuqiang Li, Mao Su, Xiaohong Liu, Guangtao Zhai, Xinyu Fang, Qianhong Ma, Taolin Zhang, Zihan Ma, Yufeng Zhao, Peiheng Zhou, Linchen Xiao, Wenlong Zhang, Shijie Zhou, Xingjian Ma, Siqi Sun, Jiaye Ge, Meng Li, Yuhong Liu, Jianxin Dong, Jiaying Li, Hui Wu, Hanwen Liang, Jintai Lin, Yanting Wang, Jie Dong, Tong Zhu, Tianfan Fu, Conghui He, Qi Zhang, Songyang Zhang, Lei Bai, Kai Chen

Title: ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning

Abstract:
The rapid advancement of Large Language Models (LLMs) has led to performance saturation on many established benchmarks, questioning their ability to distinguish frontier models. Concurrently, existing high‑difficulty benchmarks often suffer from narrow disciplinary focus, oversimplified answer formats, and vulnerability to data contamination, creating a fidelity gap with real‑world scientific inquiry. To address these challenges, we introduce ATLAS (AGI‑Oriented Testbed for Logical Application in Science), a large‑scale, high‑difficulty, and cross‑disciplinary evaluation suite composed of approximately 800 original problems. Developed by domain experts (PhD‑level and above), ATLAS spans seven core scientific fields: mathematics, physics, chemistry, biology, computer science, earth science, and materials science. Its key features include: (1) High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage; (2) Cross‑Disciplinary Focus, designed to assess models' ability to integrate knowledge and reason across scientific domains; (3) High‑Fidelity Answers, prioritizing complex, open‑ended answers involving multi‑step reasoning and LaTeX‑formatted expressions over simple multiple‑choice questions; and (4) Rigorous Quality Control, employing a multi‑stage process of expert peer review and adversarial testing to ensure question difficulty, scientific value, and correctness. We also propose a robust evaluation paradigm using a panel of LLM judges for automated, nuanced assessment of complex answers. Preliminary results on leading models demonstrate ATLAS's effectiveness in differentiating their advanced scientific reasoning capabilities. We plan to develop ATLAS into a long‑term, open, community‑driven platform to provide a reliable "ruler" for progress toward Artificial General Intelligence.

PaperID: 7320, https://arxiv.org/pdf/2511.14299.pdf

Authors: Xiaochuan Liu, Yuanfeng Song, Xiaoming Yin, Xing Chen

Title: DataSage: Multi-agent Collaboration for Insight Discovery with External Knowledge Retrieval, Multi-role Debating, and Multi-path Reasoning

Abstract:
In today's data‑driven era, fully automated end‑to‑end data analytics, particularly insight discovery, is critical for discovering actionable insights that assist organizations in making effective decisions. With the rapid advancement of large language models (LLMs), LLM‑driven agents have emerged as a promising paradigm for automating data analysis and insight discovery. However, existing data insight agents remain limited in several key aspects, often failing to deliver satisfactory results due to: (1) insufficient utilization of domain knowledge, (2) shallow analytical depth, and (3) error‑prone code generation during insight generation. To address these issues, we propose DataSage, a novel multi‑agent framework that incorporates three innovative features including external knowledge retrieval to enrich the analytical context, a multi‑role debating mechanism to simulate diverse analytical perspectives and deepen analytical depth, and multi‑path reasoning to improve the accuracy of the generated code and insights. Extensive experiments on InsightBench demonstrate that DataSage consistently outperforms existing data insight agents across all difficulty levels, offering an effective solution for automated data insight discovery.

PaperID: 7321, https://arxiv.org/pdf/2511.14247.pdf

Authors: Wenkai Lin, Qiming Xia, Wen Li, Xun Huang, Chenglu Wen

Title: V2VLoc: Robust GNSS-Free Collaborative Perception via LiDAR Localization

Abstract:
Multi‑agents rely on accurate poses to share and align observations, enabling a collaborative perception of the environment. However, traditional GNSS‑based localization often fails in GNSS‑denied environments, making consistent feature alignment difficult in collaboration. To tackle this challenge, we propose a robust GNSS‑free collaborative perception framework based on LiDAR localization. Specifically, we propose a lightweight Pose Generator with Confidence (PGC) to estimate compact pose and confidence representations. To alleviate the effects of localization errors, we further develop the Pose‑Aware Spatio‑Temporal Alignment Transformer (PASTAT), which performs confidence‑aware spatial alignment while capturing essential temporal context. Additionally, we present a new simulation dataset, V2VLoc, which can be adapted for both LiDAR localization and collaborative detection tasks. V2VLoc comprises three subsets: Town1Loc, Town4Loc, and V2VDet. Town1Loc and Town4Loc offer multi‑traversal sequences for training in localization tasks, whereas V2VDet is specifically intended for the collaborative detection task. Extensive experiments conducted on the V2VLoc dataset demonstrate that our approach achieves state‑of‑the‑art performance under GNSS‑denied conditions. We further conduct extended experiments on the real‑world V2V4Real dataset to validate the effectiveness and generalizability of PASTAT.

PaperID: 7322, https://arxiv.org/pdf/2511.14135.pdf

Authors: Promise Ekpo, Saesha Agarwal, Felix Grimm, Lekan Molu, Angelique Taylor

Title: AdaFair-MARL: Enforcing Adaptive Fairness Constraints in Multi-Agent Reinforcement Learning

Abstract:
Fair workload enforcement in heterogeneous multi‑agent systems that pursue shared objectives remains challenging. Fixed fairness penalties often introduce inefficiencies, training instability, and conflicting agent incentives. Reward‑shaping approaches in fair Multi‑Agent Reinforcement Learning (MARL) typically incorporate fairness through heuristic penalties or scalar reward modifications and often rely on post‑hoc evaluation. However, these methods do not guarantee that a desired fairness level will be satisfied. To address this limitation, we propose the Adaptive Fairness Multi‑Agent Reinforcement Learning (AdaFair‑MARL) framework, which formulates workload fairness as an explicit constraint so that agents maintain balanced contributions while optimizing team performance. We present AdaFair‑MARL, a constrained cooperative MARL framework whose core algorithmic component is a primal‑dual update that enforces workload fairness via adaptive Lagrange multiplier updates. Grounding the framework in a cooperative Markov game, we derive the fairness constraint from Jain's Fairness Index (JFI) geometry and show that the resulting feasible set admits a second‑order cone representation, enabling principled Lagrangian dual‑ascent updates without manual penalty tuning. Experiments in a simulated hospital coordination environment (MARLHospital) demonstrate the effectiveness of AdaFair‑MARL compared to reward‑shaping and fixed‑penalty fairness methods, improving workload balance while maintaining team performance. We found that AdaFair‑MARL achieves nearly perfect constraint satisfaction (0.99‑1.00) while significantly improving workload fairness compared to fixed‑penalty baselines.

PaperID: 7323, https://arxiv.org/pdf/2511.14130.pdf

Authors: Chun Chet Ng, Jia Yu Lim, Wei Zeng Low

Title: PRISM: Prompt-Refined In-Context System Modelling for Financial Retrieval

Abstract:
With the rapid progress of large language models (LLMs), financial information retrieval has become a critical industrial application. Extracting task‑relevant information from lengthy financial filings is essential for both operational and analytical decision‑making. We present PRISM, a training‑free framework that integrates refined system prompting, in‑context learning (ICL), and lightweight multi‑agent coordination for document and chunk ranking tasks. Our primary contribution is a systematic empirical study of when each component provides value: prompt engineering delivers consistent performance with minimal overhead, ICL enhances reasoning for complex queries when applied selectively, and multi‑agent systems show potential primarily with larger models and careful architectural design. Extensive ablation studies across FinAgentBench, FiQA‑2018, and FinanceBench reveal that simpler configurations often outperform complex multi‑agent pipelines, providing practical guidance for practitioners. Our best configuration achieves an NDCG@5 of 0.71818 on FinAgentBench, ranking third while being the only training‑free approach in the top three. We provide comprehensive feasibility analyses covering latency, token usage, and cost trade‑offs to support deployment decisions. The source code is released at https://bit.ly/prism‑ailens.

PaperID: 7324, https://arxiv.org/pdf/2511.14101.pdf

Authors: Xinpeng Chen, Xiaofeng Han, Kaihao Zhang, Guochao Ren, Yujie Wang, Wenhao Cao, Yang Zhou, Jianfeng Lu, Zhenbo Song

Title: APD-Agents: A Large Language Model-Driven Multi-Agents Collaborative Framework for Automated Page Design

Abstract:
Layout design is a crucial step in developing mobile app pages. However, crafting satisfactory designs is time‑intensive for designers: they need to consider which controls and content to present on the page, and then repeatedly adjust their size, position, and style for better aesthetics and structure. Although many design software can now help to perform these repetitive tasks, extensive training is needed to use them effectively. Moreover, collaborative design across app pages demands extra time to align standards and ensure consistent styling. In this work, we propose APD‑agents, a large language model (LLM) driven multi‑agent framework for automated page design in mobile applications. Our framework contains OrchestratorAgent, SemanticParserAgent, PrimaryLayoutAgent, TemplateRetrievalAgent, and RecursiveComponentAgent. Upon receiving the user's description of the page, the OrchestratorAgent can dynamically can direct other agents to accomplish users' design task. To be specific, the SemanticParserAgent is responsible for converting users' descriptions of page content into structured data. The PrimaryLayoutAgent can generate an initial coarse‑grained layout of this page. The TemplateRetrievalAgent can fetch semantically relevant few‑shot examples and enhance the quality of layout generation. Besides, a RecursiveComponentAgent can be used to decide how to recursively generate all the fine‑grained sub‑elements it contains for each element in the layout. Our work fully leverages the automatic collaboration capabilities of large‑model‑driven multi‑agent systems. Experimental results on the RICO dataset show that our APD‑agents achieve state‑of‑the‑art performance.

PaperID: 7325, https://arxiv.org/pdf/2511.14043.pdf

Authors: Chandrachur Bhattacharya, Sibendu Som

Title: AISAC: An Integrated multi-agent System for Transparent, Retrieval-Grounded Scientific Assistance

Abstract:
AI Scientific Assistant Core (AISAC) is a transparent, modular multi‑agent runtime developed at Argonne National Laboratory to support long‑horizon, evidence‑grounded scientific reasoning. Rather than proposing new agent algorithms or claiming autonomous scientific discovery, AISAC contributes a governed execution substrate that operationalizes key requirements for deploying agentic AI in scientific practice, including explicit role semantics, budgeted context management, traceable execution, and reproducible interaction with tools and knowledge. AISAC enforces four structural guarantees for scientific reasoning: (1) declarative agent registration with runtime‑enforced role semantics and automatic system prompt generation; (2) budgeted orchestration via explicit per‑turn context and delegation depth limits; (3) role‑aligned memory access across episodic, dialogue, and evidence layers; and (4) trace‑driven transparency through persistent execution records and a live event‑stream interface. These guarantees are implemented through hybrid persistent memory (SQLite and dual FAISS indices), governed retrieval with agent‑scoped RAG, structured tool execution with schema validation, and a configuration‑driven bootstrap mechanism that enables project specific extension without modifying the shared core. AISAC is currently deployed across multiple scientific workflows at Argonne, including combustion science, materials research, and energy process safety, demonstrating its use as a reusable substrate for domain‑specialized AI scientific assistants.

PaperID: 7326, https://arxiv.org/pdf/2511.14018.pdf

Authors: Minghu Wang, Shuliang Zhao, Yuanyuan Zhao, Hongxia Xu

Title: ALEX:A Light Editing-knowledge Extractor

Abstract:
The static nature of knowledge within Large Language Models (LLMs) makes it difficult for them to adapt to evolving information, rendering knowledge editing a critical task. However, existing methods struggle with challenges of scalability and retrieval efficiency, particularly when handling complex, multi‑hop questions that require multi‑step reasoning. To address these challenges, this paper introduces ALEX (A Light Editing‑knowledge Extractor), a lightweight knowledge editing framework. The core innovation of ALEX is its hierarchical memory architecture, which organizes knowledge updates (edits) into semantic clusters. This design fundamentally reduces retrieval complexity from a linear O(N) to a highly scalable O(K+N/C). Furthermore, the framework integrates an Inferential Query Synthesis (IQS) module to bridge the semantic gap between queries and facts , and a Dynamic Evidence Adjudication (DEA) engine that executes an efficient two‑stage retrieval process. Experiments on the MQUAKE benchmark demonstrate that ALEX significantly improves both the accuracy of multi‑hop answers (MultiHop‑ACC) and the reliability of reasoning paths (HopWise‑ACC). It also reduces the required search space by over 80% , presenting a promising path toward building scalable, efficient, and accurate knowledge editing systems.

PaperID: 7327, https://arxiv.org/pdf/2511.13998.pdf

Authors: Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liangwei Yang, Juntao Tan, Roshan Ram, Akshara Prabhakar, Tulika Awalgaonkar, Zixiang Chen, Zhepeng Cen, Cheng Qian, Shelby Heinecke, Weiran Yao, Silvio Savarese, Caiming Xiong, Huan Wang

Title: LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering

Abstract:
As large language models (LLMs) evolve into sophisticated autonomous agents capable of complex software development tasks, evaluating their real‑world capabilities becomes critical. While existing benchmarks like LoCoBench~\citeqiu2025locobench assess long‑context code understanding, they focus on single‑turn evaluation and cannot capture the multi‑turn interactive nature, tool usage patterns, and adaptive reasoning required by real‑world coding agents. We introduce LoCoBench‑Agent, a comprehensive evaluation framework specifically designed to assess LLM agents in realistic, long‑context software engineering workflows. Our framework extends LoCoBench's 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi‑turn conversations, tool usage efficiency, error recovery, and architectural consistency across extended development sessions. We also introduce an evaluation methodology with 9 metrics across comprehension and efficiency dimensions. Our framework provides agents with 8 specialized tools (file operations, search, code analysis) and evaluates them across context lengths ranging from 10K to 1M tokens, enabling precise assessment of long‑context performance. Through systematic evaluation of state‑of‑the‑art models, we reveal several key findings: (1) agents exhibit remarkable long‑context robustness; (2) comprehension‑efficiency trade‑off exists with negative correlation, where thorough exploration increases comprehension but reduces efficiency; and (3) conversation efficiency varies dramatically across models, with strategic tool usage patterns differentiating high‑performing agents. As the first long‑context LLM agent benchmark for software engineering, LoCoBench‑Agent establishes a rigorous foundation for measuring agent capabilities, identifying performance gaps, and advancing autonomous software development at scale.

PaperID: 7328, https://arxiv.org/pdf/2511.13987.pdf

Authors: Antonio Manuel Martínez-Heredia, Dolores Godrid Rodríguez, Andrés Ortiz García

Title: Artificial Intelligence Agents in Music Analysis: An Integrative Perspective Based on Two Use Cases

Abstract:
This paper presents an integrative review and experimental validation of artificial intelligence (AI) agents applied to music analysis and education. We synthesize the historical evolution from rule‑based models to contemporary approaches involving deep learning, multi‑agent architectures, and retrieval‑augmented generation (RAG) frameworks. The pedagogical implications are evaluated through a dual‑case methodology: (1) the use of generative AI platforms in secondary education to foster analytical and creative skills; (2) the design of a multiagent system for symbolic music analysis, enabling modular, scalable, and explainable workflows. Experimental results demonstrate that AI agents effectively enhance musical pattern recognition, compositional parameterization, and educational feedback, outperforming traditional automated methods in terms of interpretability and adaptability. The findings highlight key challenges concerning transparency, cultural bias, and the definition of hybrid evaluation metrics, emphasizing the need for responsible deployment of AI in educational environments. This research contributes to a unified framework that bridges technical, pedagogical, and ethical considerations, offering evidence‑based guidance for the design and application of intelligent agents in computational musicology and music education.

PaperID: 7329, https://arxiv.org/pdf/2511.13983.pdf

Authors: Peng Shu, Junhao Chen, Zhengliang Liu, Hanqi Jiang, Yi Pan, Khanh Nhu Nguyen, Zihao Wu, Huaqin Zhao, Yiwei Li, Enze Shi, ShaoChen Xu

Title: MoMoE: A Mixture of Expert Agent Model for Financial Sentiment Analysis

Abstract:
We present a novel approach called Mixture of Mixture of Expert (MoMoE) that combines the strengths of Mixture‑of‑Experts (MoE) architectures with collaborative multi‑agent frameworks. By modifying the LLaMA 3.1 8B architecture to incorporate MoE layers in each agent of a layered collaborative structure, we create an ensemble of specialized expert agents that iteratively refine their outputs. Each agent leverages an MoE layer in its final attention block, enabling efficient task decomposition while maintaining computational feasibility. This hybrid approach creates specialized pathways through both the model architecture and the agent collaboration layers. Experimental results demonstrate significant improvements across multiple language understanding and generation benchmarks, highlighting the synergistic benefits of combining expert routing at both the neural and agent levels.

PaperID: 7330, https://arxiv.org/pdf/2511.13979.pdf

Authors: Harang Ju, Sinan Aral

Title: Personality Pairing Improves Human-AI Collaboration

Abstract:
Here we examine how AI agent "personalities" interact with human personalities to shape human‑AI collaboration and performance. In a large‑scale, preregistered randomized experiment, we paired 1,258 participants with AI agents prompted to exhibit varying levels of the Big Five personality traits. These human‑AI teams produced 7,266 display ads for a real think tank, which we evaluated using 1,995 independent human raters and a field experiment on X that generated nearly 5 million impressions. We found that human and AI personalities individually shaped ad quality and teamwork. When examined together, human‑AI personality pairings directly effected ad quality outcomes. For example, extraverted humans paired with conscientious AI produced the lowest‑quality ads, followed by conscientious humans paired with agreeable AI and neurotic humans paired with conscientious AI. In the field experiment, ad quality significantly influenced ad performance, measured by click‑through rates and cost‑per‑click, and neurotic humans paired with neurotic AI achieved higher click‑through rates, even after controlling for ad quality. Together, these results provide the first large‑scale causal experimental evidence that specific personality pairings can improve human‑AI collaboration and motivate future research on the implications of AI personalization for performance and teamwork dynamics in human‑AI teams.

PaperID: 7331, https://arxiv.org/pdf/2511.13961.pdf

Authors: Jiarui Li, Alessandro Zanardi, Federico Pecora, Runyu Zhang, Gioele Zardini

Title: FICO: Finite-Horizon Closed-Loop Factorization for Unified Multi-Agent Path Finding

Abstract:
Multi‑Agent Path Finding is a fundamental problem in robotics and AI, yet most existing formulations treat planning and execution separately and address variants of the problem in an ad hoc manner. This paper presents a system‑level framework for MAPF that integrates planning and execution, generalizes across variants, and explicitly models uncertainties. At its core is the MAPF system, a formal model that casts MAPF as a control design problem encompassing classical and uncertainty‑aware formulations. To solve it, we introduce Finite‑Horizon Closed‑Loop Factorization (FICO), a factorization‑based algorithm inspired by receding‑horizon control that exploits compositional structure for efficient closed‑loop operation. FICO enables real‑time responses ‑‑ commencing execution within milliseconds ‑‑ while scaling to thousands of agents and adapting seamlessly to execution‑time uncertainties. Extensive case studies demonstrate that it reduces computation time by up to two orders of magnitude compared with open‑loop baselines, while delivering significantly higher throughput under stochastic delays and agent arrivals. These results establish a principled foundation for analyzing and advancing MAPF through system‑level modeling, factorization, and closed‑loop design.

PaperID: 7332, https://arxiv.org/pdf/2511.13926.pdf

Authors: Ingyu Jang, Ethan J. LoCicero, Leila Bridgeman

Title: Consensus-Based Stability Analysis of Multi-Agent Networks

Abstract:
The emergence of large‑scale multi‑agent systems has led to controller synthesis methods for sparse communication between agents. However, most sparse controller synthesis algorithms remain centralized, requiring information exchange and high computational costs. This underscores the need for distributed algorithms that design controllers using only local dynamics information from each agent. This paper presents a consensus‑based distributed stability analysis. The proposed stability analysis algorithms leverage Vidyasagar's Network Dissipativity Theorem and the alternating direction methods of multipliers to perform general stability analysis. Numerical examples involving a 2D swarm of unmanned aerial vehicles demonstrate the convergence of the proposed algorithms.

PaperID: 7333, https://arxiv.org/pdf/2511.13865.pdf

Authors: James Bono, Beibei Cheng, Joaquin Lozano

Title: Randomized Controlled Trials for Conditional Access Optimization Agent

Abstract:
AI agents are increasingly deployed to automate complex enterprise workflows, yet evidence of their effectiveness in identity governance is limited. We report results from the first randomized controlled trial (RCT) evaluating an AI agent for Conditional Access (CA) policy management in Microsoft Entra. The agent assists with four high‑value tasks: policy merging, Zero‑Trust baseline gap detection, phased rollout planning, and user‑policy alignment. In a production‑grade environment, 162 identity administrators were randomly assigned to a control group (no agent) or treatment group (agent‑assisted) and asked to perform these tasks. Agent access produced substantial gains: accuracy improved by 48% and task completion time decreased by 43% while holding accuracy constant. The largest benefits emerged on cognitively demanding tasks such as baseline gap detection. These findings demonstrate that purpose‑built AI agents can significantly enhance both speed and accuracy in identity administration.

PaperID: 7334, https://arxiv.org/pdf/2511.13860.pdf

Authors: James Bono

Title: Randomized Controlled Trials for Phishing Triage Agent

Abstract:
Security operations centers (SOCs) face a persistent challenge: efficiently triaging a high volume of user‑reported phishing emails while maintaining robust protection against threats. This paper presents the first randomized controlled trial (RCT) evaluating the impact of a domain‑specific AI agent ‑ the Microsoft Security Copilot Phishing Triage Agent ‑ on analyst productivity and accuracy. Our results demonstrate that agent‑augmented analysts achieved up to 6.5 times as many true positives per analyst minute and a 77% improvement in verdict accuracy compared to a control group. The agent's queue prioritization and verdict explanations were both significant drivers of efficiency. Behavioral analysis revealed that agent‑augmented analysts reallocated their attention, spending 53% more time on malicious emails, and were not prone to rubber‑stamping the agent's malicious verdicts. These findings offer actionable insights for SOC leaders considering AI adoption, including the potential for agents to fundamentally change the optimal allocation of SOC resources.

PaperID: 7335, https://arxiv.org/pdf/2511.13859.pdf

Authors: Mahan FakouriFard, Mingxi Liu

Title: Targeted Algorithmic Purpose-Driven Cyber Attacks in Distributed Multi-Agent Optimization

Abstract:
Distributed multi‑agent optimization (DMAO) enables the scalable control and coordination of a large population of edge resources in complex multi‑agent environments. Despite its great scalability, DMAO is prone to cyber attacks as it relies on frequent peer‑to‑peer communications that are vulnerable to malicious data injection and alteration. Existing cybersecurity research mainly focuses on \emphbroad‑spectrum attacks that aim to jeopardize the overall environment but fail to sustainably achieve specific or targeted objectives. This paper develops a class of novel strategic purpose‑driven algorithmic attacks that are launched by participating agents and interface with DMAO to achieve self‑interested attacking purposes. Theoretical foundations, in both primal and dual senses, are established for these attack vectors with and without stealthy features. Simulations on electric vehicle charging control validate the efficacy of the proposed algorithmic attacks and show the impacts of such attacks on the power distribution network.

PaperID: 7336, https://arxiv.org/pdf/2511.13788.pdf

Authors: Samuel Nathanson, Rebecca Williams, Cynthia Matuszek

Title: Scaling Patterns in Adversarial Alignment: Evidence from Multi-LLM Jailbreak Experiments

Abstract:
Large language models (LLMs) increasingly operate in multi‑agent and safety‑critical settings, raising open questions about how their vulnerabilities scale when models interact adversarially. This study examines whether larger models can systematically jailbreak smaller ones ‑ eliciting harmful or restricted behavior despite alignment safeguards. Using standardized adversarial tasks from JailbreakBench, we simulate over 6,000 multi‑turn attacker‑target exchanges across major LLM families and scales (0.6B‑120B parameters), measuring both harm score and refusal behavior as indicators of adversarial potency and alignment integrity. Each interaction is evaluated through aggregated harm and refusal scores assigned by three independent LLM judges, providing a consistent, model‑based measure of adversarial outcomes. Aggregating results across prompts, we find a strong and statistically significant correlation between mean harm and the logarithm of the attacker‑to‑target size ratio (Pearson r = 0.51, p < 0.001; Spearman rho = 0.52, p < 0.001), indicating that relative model size correlates with the likelihood and severity of harmful completions. Mean harm score variance is higher across attackers (0.18) than across targets (0.10), suggesting that attacker‑side behavioral diversity contributes more to adversarial outcomes than target susceptibility. Attacker refusal frequency is strongly and negatively correlated with harm (rho = ‑0.93, p < 0.001), showing that attacker‑side alignment mitigates harmful responses. These findings reveal that size asymmetry influences robustness and provide exploratory evidence for adversarial scaling patterns, motivating more controlled investigations into inter‑model alignment and safety.

PaperID: 7337, https://arxiv.org/pdf/2511.13759.pdf

Authors: Han Wang, Deyi Ji, Junyu Lu, Lanyun Zhu, Hailong Zhang, Haiyang Wu, Liqun Liu, Peng Shu, Roy Ka-Wei Lee

Title: Multi-Agent VLMs Guided Self-Training with PNU Loss for Low-Resource Offensive Content Detection

Abstract:
Accurate detection of offensive content on social media demands high‑quality labeled data; however, such data is often scarce due to the low prevalence of offensive instances and the high cost of manual annotation. To address this low‑resource challenge, we propose a self‑training framework that leverages abundant unlabeled data through collaborative pseudo‑labeling. Starting with a lightweight classifier trained on limited labeled data, our method iteratively assigns pseudo‑labels to unlabeled instances with the support of Multi‑Agent Vision‑Language Models (MA‑VLMs). Un‑labeled data on which the classifier and MA‑VLMs agree are designated as the Agreed‑Unknown set, while conflicting samples form the Disagreed‑Unknown set. To enhance label reliability, MA‑VLMs simulate dual perspectives, moderator and user, capturing both regulatory and subjective viewpoints. The classifier is optimized using a novel Positive‑Negative‑Unlabeled (PNU) loss, which jointly exploits labeled, Agreed‑Unknown, and Disagreed‑Unknown data while mitigating pseudo‑label noise. Experiments on benchmark datasets demonstrate that our framework substantially outperforms baselines under limited supervision and approaches the performance of large‑scale models

PaperID: 7338, https://arxiv.org/pdf/2511.13725.pdf

Authors: Sechan Lee, Sangdon Park

Title: AI Kill Switch for malicious web-based LLM agent

Abstract:
Recently, web‑based Large Language Model (LLM) agents autonomously perform increasingly complex tasks, thereby bringing significant convenience. However, they also amplify the risks of malicious misuse cases such as unauthorized collection of personally identifiable information (PII), generation of socially divisive content, and even automated web hacking. To address these threats, we propose an AI Kill Switch technique that can immediately halt the operation of malicious web‑based LLM agents. To achieve this, we introduce AutoGuard ‑ the key idea is generating defensive prompts that trigger the safety mechanisms of malicious LLM agents. In particular, generated defense prompts are transparently embedded into the website's DOM so that they remain invisible to human users but can be detected by the crawling process of malicious agents, triggering its internal safety mechanisms to abort malicious actions once read. To evaluate our approach, we constructed a dedicated benchmark consisting of three representative malicious scenarios. Experimental results show that AutoGuard achieves over 80% Defense Success Rate (DSR) across diverse malicious agents, including GPT‑4o, Claude‑4.5‑Sonnet and generalizes well to advanced models like GPT‑5.1, Gemini‑2.5‑flash, and Gemini‑3‑pro. Also, our approach demonstrates robust defense performance in real‑world website environments without significant performance degradation for benign agents. Through this research, we demonstrate the controllability of web‑based LLM agents, thereby contributing to the broader effort of AI control and safety.

PaperID: 7339, https://arxiv.org/pdf/2511.13707.pdf

Authors: Xiaoyu Liang, Ziang Liu, Kelvin Lin, Edward Gu, Ruolin Ye, Tam Nguyen, Cynthia Hsu, Zhanxin Wu, Xiaoman Yang, Christy Sum Yu Cheung, Harold Soh, Katherine Dimitropoulou, Tapomayukh Bhattacharjee

Title: OpenRoboCare: A Multimodal Multi-Task Expert Demonstration Dataset for Robot Caregiving

Abstract:
We present OpenRoboCare, a multimodal dataset for robot caregiving, capturing expert occupational therapist demonstrations of Activities of Daily Living (ADLs). Caregiving tasks involve complex physical human‑robot interactions, requiring precise perception under occlusions, safe physical contact, and long‑horizon planning. While recent advances in robot learning from demonstrations have shown promise, there is a lack of a large‑scale, diverse, and expert‑driven dataset that captures real‑world caregiving routines. To address this gap, we collect data from 21 occupational therapists performing 15 ADL tasks on two manikins. The dataset spans five modalities: RGB‑D video, pose tracking, eye‑gaze tracking, task and action annotations, and tactile sensing, providing rich multimodal insights into caregiver movement, attention, force application, and task execution strategies. We further analyze expert caregiving principles and strategies, offering insights to improve robot efficiency and task feasibility. Additionally, our evaluations demonstrate that OpenRoboCare presents challenges for state‑of‑the‑art robot perception and human activity recognition methods, both critical for developing safe and adaptive assistive robots, highlighting the value of our contribution. See our website for additional visualizations: https://emprise.cs.cornell.edu/robo‑care/.

PaperID: 7340, https://arxiv.org/pdf/2511.13646.pdf

Authors: Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, Lingming Zhang

Title: Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?

Abstract:
Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real‑world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end‑to‑end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self‑improving software agents recently, including the Darwin‑Gödel Machine (DGM). Meanwhile, such self‑improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live‑SWE‑agent, the first live software agent that can autonomously and continuously evolve itself on‑the‑fly during runtime when solving real‑world software problems. More specifically, Live‑SWE‑agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini‑SWE‑agent), and autonomously evolves its own scaffold implementation while solving real‑world software problems. Our evaluation on the widely studied SWE‑bench Verified benchmark shows that LIVE‑SWE‑AGENT can achieve an impressive solve rate of 77.4% without test‑time scaling, outperforming all existing software agents, including the best proprietary solution. Moreover, Live‑SWE‑agent outperforms state‑of‑the‑art manually crafted software agents on the recent SWE‑Bench Pro benchmark, achieving the best‑known solve rate of 45.8%.

PaperID: 7341, https://arxiv.org/pdf/2511.13614.pdf

Authors: Jerick Shi, Burton Hollifield

Title: Market-Dependent Communication in Multi-Agent Alpha Generation

Abstract:
Multi‑strategy hedge funds face a fundamental organizational choice: should analysts generating trading strategies communicate, and if so, how? We investigate this using 5‑agent LLM‑based trading systems across 450 experiments spanning 21 months, comparing five organizational structures from isolated baseline to collaborative and competitive conversation. We show that communication improves performance, but optimal communication design depends on market characteristics. Competitive conversation excels in volatile technology stocks, while collaborative conversation dominates stable general stocks. Finance stocks resist all communication interventions. Surprisingly, all structures, including isolated agents, converge to similar strategy alignments, challenging assumptions that transparency causes harmful diversity loss. Performance differences stem from behavioral mechanisms: competitive agents focus on stock‑level allocation while collaborative agents develop technical frameworks. Conversation quality scores show zero correlation with returns. These findings demonstrate that optimal communication design must match market volatility characteristics, and sophisticated discussions don't guarantee better performance.

PaperID: 7342, https://arxiv.org/pdf/2511.13476.pdf

Authors: Zhipeng Ma, Ali Rida Bahja, Andreas Burgdorf, André Pomp, Tobias Meisen, Bo Nørregaard Jørgensen, Zheng Grace Ma

Title: Multi-Agent Multimodal Large Language Model Framework for Automated Interpretation of Fuel Efficiency Analytics in Public Transportation

Abstract:
Enhancing fuel efficiency in public transportation requires the integration of complex multimodal data into interpretable, decision‑relevant insights. However, traditional analytics and visualization methods often yield fragmented outputs that demand extensive human interpretation, limiting scalability and consistency. This study presents a multi‑agent framework that leverages multimodal large language models (LLMs) to automate data narration and energy insight generation. The framework coordinates three specialized agents, including a data narration agent, an LLM‑as‑a‑judge agent, and an optional human‑in‑the‑loop evaluator, to iteratively transform analytical artifacts into coherent, stakeholder‑oriented reports. The system is validated through a real‑world case study on public bus transportation in Northern Jutland, Denmark, where fuel efficiency data from 4006 trips are analyzed using Gaussian Mixture Model clustering. Comparative experiments across five state‑of‑the‑art LLMs and three prompting paradigms identify GPT‑4.1 mini with Chain‑of‑Thought prompting as the optimal configuration, achieving 97.3% narrative accuracy while balancing interpretability and computational cost. The findings demonstrate that multi‑agent orchestration significantly enhances factual precision, coherence, and scalability in LLM‑based reporting. The proposed framework establishes a replicable and domain‑adaptive methodology for AI‑driven narrative generation and decision support in energy informatics.

PaperID: 7343, https://arxiv.org/pdf/2511.13361.pdf

Authors: Jiyang Zheng, Islam Nassar, Thanh Vu, Xu Zhong, Yang Lin, Tongliang Liu, Long Duong, Yuan-Fang Li

Title: MedDCR: Learning to Design Agentic Workflows for Medical Coding

Abstract:
Medical coding converts free‑text clinical notes into standardized diagnostic and procedural codes, which are essential for billing, hospital operations, and medical research. Unlike ordinary text classification, it requires multi‑step reasoning: extracting diagnostic concepts, applying guideline constraints, mapping to hierarchical codebooks, and ensuring cross‑document consistency. Recent advances leverage agentic LLMs, but most rely on rigid, manually crafted workflows that fail to capture the nuance and variability of real‑world documentation, leaving open the question of how to systematically learn effective workflows. We present MedDCR, a closed‑loop framework that treats workflow design as a learning problem. A Designer proposes workflows, a Coder executes them, and a Reflector evaluates predictions and provides constructive feedback, while a memory archive preserves prior designs for reuse and iterative refinement. On benchmark datasets, MedDCR outperforms state‑of‑the‑art baselines and produces interpretable, adaptable workflows that better reflect real coding practice, improving both the reliability and trustworthiness of automated systems.

PaperID: 7344, https://arxiv.org/pdf/2511.13288.pdf

Authors: Haoyang Hong, Jiajun Yin, Yuan Wang, Jingnan Liu, Zhe Chen, Ailing Yu, Ji Li, Zhiling Ye, Hansong Xiao, Yefei Chen, Hualei Zhou, Yun Yue, Minghui Yang, Chunxiao Guo, Junwei Liu, Peng Wei, Jinjie Gu

Title: Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO

Abstract:
Multi‑agent systems perform well on general reasoning tasks. However, the lack of training in specialized areas hinders their accuracy. Current training methods train a unified large language model (LLM) for all agents in the system. This may limit the performances due to different distributions underlying for different agents. Therefore, training multi‑agent systems with distinct LLMs should be the next step to solve. However, this approach introduces optimization challenges. For example, agents operate at different frequencies, rollouts involve varying sub‑agent invocations, and agents are often deployed across separate servers, disrupting end‑to‑end gradient flow. To address these issues, we propose M‑GRPO, a hierarchical extension of Group Relative Policy Optimization designed for vertical Multi‑agent systems with a main agent (planner) and multiple sub‑agents (multi‑turn tool executors). M‑GRPO computes group‑relative advantages for both main and sub‑agents, maintaining hierarchical credit assignment. It also introduces a trajectory‑alignment scheme that generates fixed‑size batches despite variable sub‑agent invocations. We deploy a decoupled training pipeline in which agents run on separate servers and exchange minimal statistics via a shared store. This enables scalable training without cross‑server backpropagation. In experiments on real‑world benchmarks (e.g., GAIA, XBench‑DeepSearch, and WebWalkerQA), M‑GRPO consistently outperforms both single‑agent GRPO and multi‑agent GRPO with frozen sub‑agents, demonstrating improved stability and sample efficiency. These results show that aligning heterogeneous trajectories and decoupling optimization across specialized agents enhances tool‑augmented reasoning tasks.

PaperID: 7345, https://arxiv.org/pdf/2511.13233.pdf

Authors: Jun Sashihara, Yukihisa Fujita, Kota Nakamura, Masahiro Kuwahara, Teruaki Hayashi

Title: LLM-based Multi-Agent System for Simulating Strategic and Goal-Oriented Data Marketplaces

Abstract:
Data marketplaces, which mediate the purchase and exchange of data from third parties, have attracted growing attention for reducing the cost and effort of data collection while enabling the trading of diverse datasets. However, a systematic understanding of the interactions between market participants, data, and regulations remains limited. To address this gap, we propose a Large Language Model‑based Multi‑Agent System (LLM‑MAS) for data marketplaces. In our framework, buyer and seller agents powered by LLMs operate with explicit objectives and autonomously perform strategic actions, such as planning, searching, purchasing, pricing, and updating data. These agents can reason about market dynamics, forecast future demand, and adjust strategies accordingly. Unlike conventional model‑based simulations, which are typically constrained to predefined rules, LLM‑MAS supports broader and more adaptive behavior selection through natural language reasoning. We evaluated the framework via simulation experiments using three distribution‑based metrics: (1) the number of purchases per dataset, (2) the number of purchases per buyer, and (3) the number of repeated purchases of the same dataset. The results demonstrate that LLM‑MAS more faithfully reproduces trading patterns observed in real data marketplaces compared to traditional approaches, and further captures the emergence and evolution of market trends.

PaperID: 7346, https://arxiv.org/pdf/2511.13226.pdf

Authors: Michele Persiani, Thomas Hellstrom

Title: Informative Communication of Robot Plans

Abstract:
When a robot is asked to verbalize its plan it can do it in many ways. For example, a seemingly natural strategy is incremental, where the robot verbalizes its planned actions in plan order. However, an important aspect of this type of strategy is that it misses considerations on what is effectively informative to communicate, because not considering what the user knows prior to explanations. In this paper we propose a verbalization strategy to communicate robot plans informatively, by measuring the information gain that verbalizations have against a second‑order theory of mind of the user capturing his prior knowledge on the robot. As shown in our experiments, this strategy allows to understand the robot's goal much quicker than by using strategies such as increasing or decreasing plan order. In addition, following our formulation we hint to what is informative and why when a robot communicates its plan.

PaperID: 7347, https://arxiv.org/pdf/2511.13158.pdf

Authors: Samuele Burattini, Alessandro Ricci, Simon Mayer, Danai Vachtsevanou, Jeremy Lemee, Andrei Ciortea, Angelo Croatti

Title: Agent-Oriented Visual Programming for the Web of Things

Abstract:
In this paper we introduce and discuss an approach for multi‑agent‑oriented visual programming. This aims at enabling individuals without programming experience but with knowledge in specific target domains to design and (re)configure autonomous software. We argue that, compared to procedural programming, it should be simpler for users to create programs when agent abstractions are employed. The underlying rationale is that these abstractions, and specifically the belief‑desire‑intention architecture that is aligned with human practical reasoning, match more closely with people's everyday experience in interacting with other agents and artifacts in the real world. On top of this, we designed and implemented a visual programming system for agents that hides the technicalities of agent‑oriented programming using a blocks‑based visual development environment that is built on the JaCaMo platform. To further validate the proposed solution, we integrate the Web of Things (WoT) to let users create autonomous behaviour on top of physical mashups of devices, following the trends in industrial end‑user programming. Finally, we report on a pilot user study where we verified that novice users are indeed able to make use of this development environment to create multi‑agent systems to solve simple automation tasks.

PaperID: 7348, https://arxiv.org/pdf/2511.13137.pdf

Authors: Yanda Zhu, Yuanyang Zhu, Daoyi Dong, Caihua Chen, Chunlin Chen

Title: Conditional Diffusion Model for Multi-Agent Dynamic Task Decomposition

Abstract:
Task decomposition has shown promise in complex cooperative multi‑agent reinforcement learning (MARL) tasks, which enables efficient hierarchical learning for long‑horizon tasks in dynamic and uncertain environments. However, learning dynamic task decomposition from scratch generally requires a large number of training samples, especially exploring the large joint action space under partial observability. In this paper, we present the Conditional Diffusion Model for Dynamic Task Decomposition (C\textD^\text3T), a novel two‑level hierarchical MARL framework designed to automatically infer subtask and coordination patterns. The high‑level policy learns subtask representation to generate a subtask selection strategy based on subtask effects. To capture the effects of subtasks on the environment, C\textD^\text3T predicts the next observation and reward using a conditional diffusion model. At the low level, agents collaboratively learn and share specialized skills within their assigned subtasks. Moreover, the learned subtask representation is also used as additional semantic information in a multi‑head attention mixing network to enhance value decomposition and provide an efficient reasoning bridge between individual and joint value functions. Experimental results on various benchmarks demonstrate that C\textD^\text3T achieves better performance than existing baselines.

PaperID: 7349, https://arxiv.org/pdf/2511.13103.pdf

Authors: Vidur Sinha, Muhammed Ustaomeroglu, Guannan Qu

Title: Transformer-Based Scalable Multi-Agent Reinforcement Learning for Networked Systems with Long-Range Interactions

Abstract:
Multi‑agent reinforcement learning (MARL) has shown promise for large‑scale network control, yet existing methods face two major limitations. First, they typically rely on assumptions leading to decay properties of local agent interactions, limiting their ability to capture long‑range dependencies such as cascading power failures or epidemic outbreaks. Second, most approaches lack generalizability across network topologies, requiring retraining when applied to new graphs. We introduce STACCA (Shared Transformer Actor‑Critic with Counterfactual Advantage), a unified transformer‑based MARL framework that addresses both challenges. STACCA employs a centralized Graph Transformer Critic to model long‑range dependencies and provide system‑level feedback, while its shared Graph Transformer Actor learns a generalizable policy capable of adapting across diverse network structures. Further, to improve credit assignment during training, STACCA integrates a novel counterfactual advantage estimator that is compatible with state‑value critic estimates. We evaluate STACCA on epidemic containment and rumor‑spreading network control tasks, demonstrating improved performance, network generalization, and scalability. These results highlight the potential of transformer‑based MARL architectures to achieve scalable and generalizable control in large‑scale networked systems.

PaperID: 7350, https://arxiv.org/pdf/2511.12916.pdf

Authors: Yafang Wang, Yangjie Tian, Xiaoyu Shen, Gaoyang Zhang, Jiaze Sun, He Zhang, Ruohua Xu, Feng Zhao

Title: Fault2Flow: An AlphaEvolve-Optimized Human-in-the-Loop Multi-Agent System for Fault-to-Workflow Automation

Abstract:
Power grid fault diagnosis is a critical process hindered by its reliance on manual, error‑prone methods. Technicians must manually extract reasoning logic from dense regulations and attempt to combine it with tacit expert knowledge, which is inefficient, error‑prone, and lacks maintainability as ragulations are updated and experience evolves. While Large Language Models (LLMs) have shown promise in parsing unstructured text, no existing framework integrates these two disparate knowledge sources into a single, verified, and executable workflow. To bridge this gap, we propose Fault2Flow, an LLM‑based multi‑agent system. Fault2Flow systematically: (1) extracts and structures regulatory logic into PASTA‑formatted fault trees; (2) integrates expert knowledge via a human‑in‑the‑loop interface for verification; (3) optimizes the reasoning logic using a novel AlphaEvolve module; and (4) synthesizes the final, verified logic into an n8n‑executable workflow. Experimental validation on transformer fault diagnosis datasets confirms 100% topological consistency and high semantic fidelity. Fault2Flow establishes a reproducible path from fault analysis to operational automation, substantially reducing expert workload.

PaperID: 7351, https://arxiv.org/pdf/2511.12892.pdf

Authors: Liangshun Wu, Wen Chen, Shunqing Zhang, Yajun Wang, Kunlun Wang

Title: Green Emergency Communications in RIS- and MA-Assisted Multi-UAV SAGINs: A Partially Observable Reinforcement Learning Approach

Abstract:
In post‑disaster space‑air‑ground integrated networks (SAGINs), terrestrial infrastructure is often impaired, and unmanned aerial vehicles (UAVs) must rapidly restore connectivity for mission‑critical ground terminals in cluttered non‑line‑of‑sight (NLoS) urban environments. To enhance coverage, UAVs employ movable antennas (MAs), while reconfigurable intelligent surfaces (RISs) on surviving high‑rises redirect signals. The key challenge is communication‑limited partial observability, leaving each UAV with a narrow, fast‑changing neighborhood view that destabilizes value estimation. Existing multi‑agent reinforcement learning (MARL) approaches are inadequate‑‑non‑communication methods rely on unavailable global critics, heuristic sharing is brittle and redundant, and learnable protocols (e.g., CommNet, DIAL) lose per‑neighbor structure and aggravate non‑stationarity under tight bandwidth. To address partial observability, we propose a spatiotemporal A2C where each UAV transmits prior‑decision messages with local state, a compact policy fingerprint, and a recurrent belief, encoded per neighbor and concatenated. A spatial discount shapes value targets to emphasize local interactions, while analysis under one‑hop‑per‑slot latency explains stable training with delayed views. Experimental results show our policy outperforms IA2C, ConseNet, FPrint, DIAL, and CommNet‑‑achieving faster convergence, higher asymptotic reward, reduced Temporal‑Difference(TD)/advantage errors, and a better communication throughput‑energy trade‑off.

PaperID: 7352, https://arxiv.org/pdf/2511.12876.pdf

Authors: Heyang Ma, Qirui Mi, Qipeng Yang, Zijun Fan, Bo Li, Haifeng Zhang

Title: Think, Speak, Decide: Language-Augmented Multi-Agent Reinforcement Learning for Economic Decision-Making

Abstract:
Economic decision‑making depends not only on structured signals such as prices and taxes, but also on unstructured language, including peer dialogue and media narratives. While multi‑agent reinforcement learning (MARL) has shown promise in optimizing economic decisions, it struggles with the semantic ambiguity and contextual richness of language. We propose LAMP (Language‑Augmented Multi‑Agent Policy), a framework that integrates language into economic decision‑making and narrows the gap to real‑world settings. LAMP follows a Think‑Speak‑Decide pipeline: (1) Think interprets numerical observations to extract short‑term shocks and long‑term trends, caching high‑value reasoning trajectories; (2) Speak crafts and exchanges strategic messages based on reasoning, updating beliefs by parsing peer communications; and (3) Decide fuses numerical data, reasoning, and reflections into a MARL policy to optimize language‑augmented decision‑making. Experiments in economic simulation show that LAMP outperforms both MARL and LLM‑only baselines in cumulative return (+63.5%, +34.0%), robustness (+18.8%, +59.4%), and interpretability. These results demonstrate the potential of language‑augmented policies to deliver more effective and robust economic strategies.

PaperID: 7353, https://arxiv.org/pdf/2511.12848.pdf

Authors: Max M. Sun, Todd Murphey

Title: Structured Imitation Learning of Interactive Policies through Inverse Games

Abstract:
Generative model‑based imitation learning methods have recently achieved strong results in learning high‑complexity motor skills from human demonstrations. However, imitation learning of interactive policies that coordinate with humans in shared spaces without explicit communication remains challenging, due to the significantly higher behavioral complexity in multi‑agent interactions compared to non‑interactive tasks. In this work, we introduce a structured imitation learning framework for interactive policies by combining generative single‑agent policy learning with a flexible yet expressive game‑theoretic structure. Our method explicitly separates learning into two steps: first, we learn individual behavioral patterns from multi‑agent demonstrations using standard imitation learning; then, we structurally learn inter‑agent dependencies by solving an inverse game problem. Preliminary results in a synthetic 5‑agent social navigation task show that our method significantly improves non‑interactive policies and performs comparably to the ground truth interactive policy using only 50 demonstrations. These results highlight the potential of structured imitation learning in interactive settings.

PaperID: 7354, https://arxiv.org/pdf/2511.12836.pdf

Authors: Waheed U. Bajwa, Mert Gurbuzbalaban, Mustafa Ali Kutbay, Lingjiong Zhu, Muhammad Zulqarnain

Title: DIGing--SGLD: Decentralized and Scalable Langevin Sampling over Time--Varying Networks

Abstract:
Sampling from a target distribution induced by training data is central to Bayesian learning, with Stochastic Gradient Langevin Dynamics (SGLD) serving as a key tool for scalable posterior sampling and decentralized variants enabling learning when data are distributed across a network of agents. This paper introduces DIGing‑SGLD, a decentralized SGLD algorithm designed for scalable Bayesian learning in multi‑agent systems operating over time‑varying networks. Existing decentralized SGLD methods are restricted to static network topologies, and many exhibit steady‑state sampling bias caused by network effects, even when full batches are used. DIGing‑SGLD overcomes these limitations by integrating Langevin‑based sampling with the gradient‑tracking mechanism of the DIGing algorithm, originally developed for decentralized optimization over time‑varying networks, thereby enabling efficient and bias‑free sampling without a central coordinator. To our knowledge, we provide the first finite‑time non‑asymptotic Wasserstein convergence guarantees for decentralized SGLD‑based sampling over time‑varying networks, with explicit constants. Under standard strong convexity and smoothness assumptions, DIGing‑SGLD achieves geometric convergence to an O(\sqrtη) neighborhood of the target distribution, where η is the stepsize, with dependence on the target accuracy matching the best‑known rates for centralized and static‑network SGLD algorithms using constant stepsize. Numerical experiments on Bayesian linear and logistic regression validate the theoretical results and demonstrate the strong empirical performance of DIGing‑SGLD under dynamically evolving network conditions.

PaperID: 7355, https://arxiv.org/pdf/2511.12792.pdf

Authors: Mohamad A. Hady, Siyi Hu, Mahardhika Pratama, Zehong Cao, Ryszard Kowalczyk

Title: Multi-Agent Reinforcement Learning for Heterogeneous Satellite Cluster Resources Optimization

Abstract:
This work investigates resource optimization in heterogeneous satellite clusters performing autonomous Earth Observation (EO) missions using Reinforcement Learning (RL). In the proposed setting, two optical satellites and one Synthetic Aperture Radar (SAR) satellite operate cooperatively in low Earth orbit to capture ground targets and manage their limited onboard resources efficiently. Traditional optimization methods struggle to handle the real‑time, uncertain, and decentralized nature of EO operations, motivating the use of RL and Multi‑Agent Reinforcement Learning (MARL) for adaptive decision‑making. This study systematically formulates the optimization problem from single‑satellite to multi‑satellite scenarios, addressing key challenges including energy and memory constraints, partial observability, and agent heterogeneity arising from diverse payload capabilities. Using a near‑realistic simulation environment built on the Basilisk and BSK‑RL frameworks, we evaluate the performance and stability of state‑of‑the‑art MARL algorithms such as MAPPO, HAPPO, and HATRPO. Results show that MARL enables effective coordination across heterogeneous satellites, balancing imaging performance and resource utilization while mitigating non‑stationarity and inter‑agent reward coupling. The findings provide practical insights into scalable, autonomous satellite operations and contribute a foundation for future research on intelligent EO mission planning under heterogeneous and dynamic conditions.

PaperID: 7356, https://arxiv.org/pdf/2511.12756.pdf

Authors: Sungjun Seo, Kooktae Lee

Title: Density-Driven Optimal Control for Non-Uniform Area Coverage in Decentralized Multi-Agent Systems Using Optimal Transport

Abstract:
This paper addresses the fundamental problem of non‑uniform area coverage in multi‑agent systems, where different regions require varying levels of attention due to mission‑dependent priorities. Existing uniform coverage strategies are insufficient for realistic applications, and many non‑uniform approaches either lack optimality guarantees or fail to incorporate crucial real‑world constraints such as agent dynamics, limited operation time, the number of agents, and decentralized execution. To resolve these limitations, we propose a novel framework called Density‑Driven Optimal Control (D2OC). The central idea of D2OC is the integration of optimal transport theory with multi‑agent coverage control, enabling each agent to continuously adjust its trajectory to match a mission‑specific reference density map. The proposed formulation establishes optimality by solving a constrained optimization problem that explicitly incorporates physical and operational constraints. The resulting control input is analytically derived from the Lagrangian of the objective function, yielding closed‑form optimal solutions for linear systems and a generalizable structure for nonlinear systems. Furthermore, a decentralized data‑sharing mechanism is developed to coordinate agents without reliance on global information. Comprehensive simulation studies demonstrate that D2OC achieves significantly improved non‑uniform area coverage performance compared to existing methods, while maintaining scalability and decentralized implementability.

PaperID: 7357, https://arxiv.org/pdf/2511.12710.pdf

Authors: Yunhao Chen, Xin Wang, Juncheng Li, Yixu Wang, Jie Li, Yan Teng, Yingchun Wang, Xingjun Ma

Title: Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

Abstract:
Automated red teaming frameworks for Large Language Models (LLMs) have become increasingly sophisticated, yet many still formulate attack optimization primarily in the prompt space. In other words, these methods mainly search for better attack wording or better strategy choices, but they do not search over executable code. By moving the search into code space, we can optimize not only the final attack prompt, but also the procedure that generates it, including execution flow, reusable logic, branching, and failure‑driven repair. To overcome this gap, we introduce EvoSynth, an autonomous multi‑agent framework that shifts the optimization space from prompts to executable code. Instead of refining prompts directly, EvoSynth employs a multi‑agent system to autonomously engineer, evolve, and execute code‑based attack algorithms. Crucially, it features a code‑level self‑correction loop, allowing it to iteratively rewrite the code‑based algorithm in response to target‑model feedback and failed attempts. Through extensive experiments, we demonstrate that EvoSynth achieves an 85.5% Attack Success Rate (ASR) against highly robust models like Claude‑Sonnet‑4.5 and a 95.9% average ASR across evaluated targets, while generating attacks that are significantly more diverse than those from existing methods. We release our framework to facilitate future research on evolutionary synthesis in executable code space.

PaperID: 7358, https://arxiv.org/pdf/2511.12645.pdf

Authors: Junwei Li, Wenqing Wang, Huiliu Mao, Jiazhe Ni, Zeyu Xiong

Title: BeautyGuard: Designing a Multi-Agent Roundtable System for Proactive Beauty Tech Compliance through Stakeholder Collaboration

Abstract:
As generative AI enters enterprise workflows, ensuring compliance with legal, ethical, and reputational standards becomes a pressing challenge. In beauty tech, where biometric and personal data are central, traditional reviews are often manual, fragmented, and reactive. To examine these challenges, we conducted a formative study with six experts (four IT managers, two legal managers) at a multinational beauty company. The study revealed pain points in rule checking, precedent use, and the lack of proactive guidance. Motivated by these findings, we designed a multi‑agent "roundtable" system powered by a large language model. The system assigns role‑specialized agents for legal interpretation, checklist review, precedent search, and risk mitigation, synthesizing their perspectives into structured compliance advice. We evaluated the prototype with the same experts using System Usability Scale(SUS), The Official NASA Task Load Index(NASA‑TLX), and interviews. Results show exceptional usability (SUS: 77.5/100) and minimal cognitive workload, with three key findings: (1) multi‑agent systems can preserve tacit knowledge into standardized workflows, (2) information augmentation achieves higher acceptance than decision automation, and (3) successful enterprise AI should mirror organizational structures. This work contributes design principles for human‑AI collaboration in compliance review, with broader implications for regulated industries beyond beauty tech.

PaperID: 7359, https://arxiv.org/pdf/2511.12599.pdf

Authors: Bijia Liu, Ronghao Dang

Title: FINRS: A Risk-Sensitive Trading Framework for Real Financial Markets

Abstract:
Large language models (LLMs) have shown strong reasoning capabilities and are increasingly explored for financial trading. Existing LLM‑based trading agents, however, largely focus on single‑step prediction and lack integrated mechanisms for risk management, which reduces their effectiveness in volatile markets. We introduce FinRS, a risk‑sensitive trading framework that combines hierarchical market analysis, dual‑decision agents, and multi‑timescale reward reflection to align trading actions with both return objectives and downside risk constraints. Experiments on multiple stocks and market conditions show that FinRS achieves superior profitability and stability compared to state‑of‑the‑art methods.

PaperID: 7360, https://arxiv.org/pdf/2511.12543.pdf

Authors: Burak Karaduman, Baris Tekin Tezel, Moharram Challenger

Title: High-level reasoning while low-level actuation in Cyber-Physical Systems: How efficient is it?

Abstract:
The increasing complexity of industrial information‑integration systems demands software technologies that enable intelligent behaviour, real‑time response, and efficient development. Although many programming languages and frameworks exist, engineers still lack sufficient empirical evidence to guide the choice of tools for advanced industrial applications. This study addresses that need by measuring and comparing worst‑case execution time (WCET) and development time across six languages and frameworks: C++, Java, Jade, Jason, and fuzzy Jason BDI with both loosely and tightly coupled integration. These technologies reflect a progression from procedural and object‑oriented programming to agent‑based frameworks capable of symbolic and fuzzy reasoning. Rather than relying on broad concepts such as paradigms or orientations, the study adopts a developer‑centred approach grounded in measurable outcomes. The structured comparison examines how rising abstraction levels and reasoning capabilities affect both development effort and runtime behaviour. By analysing these dimensions, the study highlights concrete trade‑offs between engineering workload and execution efficiency. The findings show how abstraction and reasoning mechanisms shape system performance and developer productivity, offering practical insight for designing intelligent, agent‑based solutions that must operate under real‑time constraints and complex decision‑making requirements. Overall, the study contributes evidence‑based guidance for selecting software technologies in industrial informatization, supporting improved integration efficiency, maintainability, and responsiveness, and laying groundwork for future research on the interplay between language features, development dynamics, and runtime behaviour in cyber‑physical and smart manufacturing systems.

PaperID: 7361, https://arxiv.org/pdf/2511.12492.pdf

Authors: Sungjun Seo, Kooktae Lee

Title: Density-Driven Multi-Agent Coordination for Efficient Farm Coverage and Management in Smart Agriculture

Abstract:
The growing scale of modern farms has increased the need for efficient and adaptive multi‑agent coverage strategies for pest, weed, and disease management. Traditional methods such as manual inspection and blanket pesticide spraying often lead to excessive chemical use, resource waste, and environmental impact. While unmanned aerial vehicles (UAVs) offer a promising platform for precision agriculture through targeted spraying and improved operational efficiency, existing UAV‑based approaches remain limited by battery life, payload capacity, and scalability, especially in large fields where single‑UAV or uniformly distributed spraying is insufficient. Although multi‑UAV coordination has been explored, many current frameworks still assume uniform spraying and do not account for infestation severity, UAV dynamics, non‑uniform resource allocation, or energy‑efficient coordination. To address these limitations, this paper proposes a Density‑Driven Optimal Control (D2OC) framework that integrates Optimal Transport (OT) theory with multi‑UAV coverage control for large‑scale agricultural spraying. The method supports non‑uniform, priority‑aware resource allocation based on infestation intensity, reducing unnecessary chemical application. UAVs are modeled as a linear time‑varying (LTV) system to capture variations in mass and inertia during spraying missions. The D2OC control law, derived using Lagrangian mechanics, enables efficient coordination, balanced workload distribution, and improved mission duration. Simulation results demonstrate that the proposed approach outperforms uniform spraying and Spectral Multiscale Coverage (SMC) in coverage efficiency, chemical reduction, and operational sustainability, providing a scalable solution for smart agriculture.

PaperID: 7362, https://arxiv.org/pdf/2511.12439.pdf

Authors: Yujia Liu, Sophia Yu, Hongyue Jin, Jessica Wen, Alexander Qian, Terrence Lee, Mattheus Ramsis, Gi Won Choi, Lianhui Qin, Xin Liu, Edward J. Wang

Title: Multi-agent Self-triage System with Medical Flowcharts

Abstract:
Online health resources and large language models (LLMs) are increasingly used as a first point of contact for medical decision‑making, yet their reliability in healthcare remains limited by low accuracy, lack of transparency, and susceptibility to unverified information. We introduce a proof‑of‑concept conversational self‑triage system that guides LLMs with 100 clinically validated flowcharts from the American Medical Association, providing a structured and auditable framework for patient decision support. The system leverages a multi‑agent framework consisting of a retrieval agent, a decision agent, and a chat agent to identify the most relevant flowchart, interpret patient responses, and deliver personalized, patient‑friendly recommendations, respectively. Performance was evaluated at scale using synthetic datasets of simulated conversations. The system achieved 95.29% top‑3 accuracy in flowchart retrieval (N=2,000) and 99.10% accuracy in flowchart navigation across varied conversational styles and conditions (N=37,200). By combining the flexibility of free‑text interaction with the rigor of standardized clinical protocols, this approach demonstrates the feasibility of transparent, accurate, and generalizable AI‑assisted self‑triage, with potential to support informed patient decision‑making while improving healthcare resource utilization.

PaperID: 7363, https://arxiv.org/pdf/2511.12368.pdf

Authors: Yiqing Shen, Mathias Unberath

Title: Fast Reasoning Segmentation for Images and Videos

Abstract:
Reasoning segmentation enables open‑set object segmentation via implicit text queries, therefore serving as a foundation for embodied agents that should operate autonomously in real‑world environments. However, existing methods for reasoning segmentation require multimodal large language models with billions of parameters that exceed the computational capabilities of edge devices that typically deploy the embodied AI systems. Distillation offers a pathway to compress these models while preserving their capabilities. Yet, existing distillation approaches fail to transfer the multi‑step reasoning capabilities that reasoning segmentation demands, as they focus on matching output predictions and intermediate features rather than preserving reasoning chains. The emerging paradigm of reasoning over digital twin representations presents an opportunity for more effective distillation by re‑framing the problem. Consequently, we propose FastReasonSeg, which employs digital twin representations that decouple perception from reasoning to enable more effective distillation. Our distillation scheme first relies on supervised fine‑tuning on teacher‑generated reasoning chains. Then it is followed by reinforcement fine‑tuning with joint rewards evaluating both segmentation accuracy and reasoning quality alignment. Experiments on two video (JiTBench, RVTBench) and two image benchmarks (ReasonSeg, LLM‑Seg40K) demonstrate that our FastReasonSeg achieves state‑of‑the‑art reasoning segmentation performance. Moreover, the distilled 0.6B variant outperforms models with 20 times more parameters while achieving 7.79 FPS throughput with only 2.1GB memory consumption. This efficiency enables deployment in resource‑constrained environments to enable real‑time reasoning segmentation.

PaperID: 7364, https://arxiv.org/pdf/2511.12306.pdf

Authors: Darvin Yi, Teng Liu, Mattie Terzolo, Lance Hasson, Ayan Sinha, Pablo Mendes, Andrew Rabinovich

Title: UpBench: A Dynamically Evolving Real-World Labor-Market Agentic Benchmark Framework Built for Human-Centric AI

Abstract:
As large language model (LLM) agents increasingly undertake digital work, reliable frameworks are needed to evaluate their real‑world competence, adaptability, and capacity for human collaboration. Existing benchmarks remain largely static, synthetic, or domain‑limited, providing limited insight into how agents perform in dynamic, economically meaningful environments. We introduce UpBench, a dynamically evolving benchmark grounded in real jobs drawn from the global Upwork labor marketplace. Each task corresponds to a verified client transaction, anchoring evaluation in genuine work activity and financial outcomes. UpBench employs a rubric‑based evaluation framework, in which expert freelancers decompose each job into detailed, verifiable acceptance criteria and assess AI submissions with per‑criterion feedback. This structure enables fine‑grained analysis of model strengths, weaknesses, and instruction‑following fidelity beyond binary pass/fail metrics. Human expertise is integrated throughout the data pipeline (from job curation and rubric construction to evaluation) ensuring fidelity to real professional standards and supporting research on human‑AI collaboration. By regularly refreshing tasks to reflect the evolving nature of online work, UpBench provides a scalable, human‑centered foundation for evaluating agentic systems in authentic labor‑market contexts, offering a path toward a collaborative framework, where AI amplifies human capability through partnership rather than replacement.

PaperID: 7365, https://arxiv.org/pdf/2511.12271.pdf

Authors: Zhiyu An, Wan Du

Title: MoralReason: Generalizable Moral Decision Alignment For LLM Agents Using Reasoning-Level Reinforcement Learning

Abstract:
Large language models are increasingly influencing human moral decisions, yet current approaches focus primarily on evaluating rather than actively steering their moral decisions. We formulate this as an out‑of‑distribution moral alignment problem, where LLM agents must learn to apply consistent moral reasoning frameworks to scenarios beyond their training distribution. We introduce Moral‑Reason‑QA, a novel dataset extending 680 human‑annotated, high‑ambiguity moral scenarios with framework‑specific reasoning traces across utilitarian, deontological, and virtue ethics, enabling systematic evaluation of moral generalization in realistic decision contexts. Our learning approach employs Group Relative Policy Optimization with composite rewards that simultaneously optimize decision alignment and framework‑specific reasoning processes to facilitate learning of the underlying moral frameworks. Experimental results demonstrate successful generalization to unseen moral scenarios, with softmax‑normalized alignment scores improving by +0.757 for utilitarian and +0.450 for deontological frameworks when tested on out‑of‑distribution evaluation sets. The experiments also reveal training challenges and promising directions that inform future research. These findings establish that LLM agents can be systematically trained to internalize and apply specific moral frameworks to novel situations, providing a critical foundation for AI safety as language models become more integrated into human decision‑making processes.

PaperID: 7366, https://arxiv.org/pdf/2511.12254.pdf

Authors: Yuxiang Zhou, Jichang Li, Yanhao Zhang, Haonan Lu, Guanbin Li

Title: Mobile-Agent-RAG: Driving Smart Multi-Agent Coordination with Contextual Knowledge Empowerment for Long-Horizon Mobile Automation

Abstract:
Mobile agents show immense potential, yet current state‑of‑the‑art (SoTA) agents exhibit inadequate success rates on real‑world, long‑horizon, cross‑application tasks. We attribute this bottleneck to the agents' excessive reliance on static, internal knowledge within MLLMs, which leads to two critical failure points: 1) strategic hallucinations in high‑level planning and 2) operational errors during low‑level execution on user interfaces (UI). The core insight of this paper is that high‑level planning and low‑level UI operations require fundamentally distinct types of knowledge. Planning demands high‑level, strategy‑oriented experiences, whereas operations necessitate low‑level, precise instructions closely tied to specific app UIs. Motivated by these insights, we propose Mobile‑Agent‑RAG, a novel hierarchical multi‑agent framework that innovatively integrates dual‑level retrieval augmentation. At the planning stage, we introduce Manager‑RAG to reduce strategic hallucinations by retrieving human‑validated comprehensive task plans that provide high‑level guidance. At the execution stage, we develop Operator‑RAG to improve execution accuracy by retrieving the most precise low‑level guidance for accurate atomic actions, aligned with the current app and subtask. To accurately deliver these knowledge types, we construct two specialized retrieval‑oriented knowledge bases. Furthermore, we introduce Mobile‑Eval‑RAG, a challenging benchmark for evaluating such agents on realistic multi‑app, long‑horizon tasks. Extensive experiments demonstrate that Mobile‑Agent‑RAG significantly outperforms SoTA baselines, improving task completion rate by 11.0% and step efficiency by 10.2%, establishing a robust paradigm for context‑aware, reliable multi‑agent mobile automation.

PaperID: 7367, https://arxiv.org/pdf/2511.12208.pdf

Authors: Jilong Liu, Pengyang Shao, Wei Qin, Fei Liu, Yonghui Yang, Richang Hong

Title: Debate over Mixed-knowledge: A Robust Multi-Agent Reasoning Framework for Incomplete Knowledge Graph Question Answering

Abstract:
Knowledge Graph Question Answering (KGQA) aims to improve factual accuracy by leveraging structured knowledge. However, real‑world Knowledge Graphs (KGs) are often incomplete, leading to the problem of Incomplete KGQA (IKGQA). A common solution is to incorporate external data to fill knowledge gaps, but existing methods lack the capacity to adaptively and contextually fuse multiple sources, failing to fully exploit their complementary strengths. To this end, we propose Debate over Mixed‑knowledge (DoM), a novel framework that enables dynamic integration of structured and unstructured knowledge for IKGQA. Built upon the Multi‑Agent Debate paradigm, DoM assigns specialized agents to perform inference over knowledge graphs and external texts separately, and coordinates their outputs through iterative interaction. It decomposes the input question into sub‑questions, retrieves evidence via dual agents (KG and Retrieval‑Augmented Generation, RAG), and employs a judge agent to evaluate and aggregate intermediate answers. This collaboration exploits knowledge complementarity and enhances robustness to KG incompleteness. In addition, existing IKGQA datasets simulate incompleteness by randomly removing triples, failing to capture the irregular and unpredictable nature of real‑world knowledge incompleteness. To address this, we introduce a new dataset, Incomplete Knowledge Graph WebQuestions, constructed by leveraging real‑world knowledge updates. These updates reflect knowledge beyond the static scope of KGs, yielding a more realistic and challenging benchmark. Through extensive experiments, we show that DoM consistently outperforms state‑of‑the‑art baselines.

PaperID: 7368, https://arxiv.org/pdf/2511.12164.pdf

Authors: Jie Chen, Liangmin Wang

Title: Multi-Agent Collaborative Fuzzing with Continuous Reflection for Smart Contracts Vulnerability Detection

Abstract:
Fuzzing is a widely used technique for detecting vulnerabilities in smart contracts, which generates transaction sequences to explore the execution paths of smart contracts. However, existing fuzzers are falling short in detecting sophisticated vulnerabilities that require specific attack transaction sequences with proper inputs to trigger, as they (i) prioritize code coverage over vulnerability discovery, wasting considerable effort on non‑vulnerable code regions, and (ii) lack semantic understanding of stateful contracts, generating numerous invalid transaction sequences that cannot pass runtime execution. In this paper, we propose SmartFuzz, a novel collaborative reflective fuzzer for smart contract vulnerability detection. It employs large language model‑driven agents as the fuzzing engine and continuously improves itself by learning and reflecting through interactions with the environment. Specifically, we first propose a new Continuous Reflection Process (CRP) for fuzzing smart contracts, which reforms the transaction sequence generation as a self‑evolving process through continuous reflection on feedback from the runtime environment. Then, we present the Reactive Collaborative Chain (RCC) to orchestrate the fuzzing process into multiple sub‑tasks based on the dependencies of transaction sequences. Furthermore, we design a multi‑agent collaborative team, where each expert agent is guided by the RCC to jointly generate and refine transaction sequences from both global and local perspectives. We conduct extensive experiments to evaluate SmartFuzz's performance on real‑world contracts and DApp projects. The results demonstrate that SmartFuzz outperforms existing state‑of‑the‑art tools: (i) it detects 5.8%‑74.7% more vulnerabilities within 30 minutes, and (ii) it reduces false negatives by up to 80%.

PaperID: 7369, https://arxiv.org/pdf/2511.12160.pdf

Authors: Wenbin Mai, Minghui Liwang, Xinlei Yi, Xiaoyu Xia, Seyyedali Hosseinalipour, Xianbin Wang

Title: Game-Theoretic Safe Multi-Agent Motion Planning with Reachability Analysis for Dynamic and Uncertain Environments (Extended Version)

Abstract:
Ensuring safe, robust, and scalable motion planning for multi‑agent systems in dynamic and uncertain environments is a persistent challenge, driven by complex inter‑agent interactions, stochastic disturbances, and model uncertainties. To overcome these challenges, particularly the computational complexity of coupled decision‑making and the need for proactive safety guarantees, we propose a Reachability‑Enhanced Dynamic Potential Game (RE‑DPG) framework, which integrates game‑theoretic coordination into reachability analysis. This approach formulates multi‑agent coordination as a dynamic potential game, where the Nash equilibrium (NE) defines optimal control strategies across agents. To enable scalability and decentralized execution, we develop a Neighborhood‑Dominated iterative Best Response (ND‑iBR) scheme, built upon an iterated \varepsilon‑BR (i\varepsilon‑BR) process that guarantees finite‑step convergence to an \varepsilon‑NE. This allows agents to compute strategies based on local interactions while ensuring theoretical convergence guarantees. Furthermore, to ensure safety under uncertainty, we integrate a Multi‑Agent Forward Reachable Set (MA‑FRS) mechanism into the cost function, explicitly modeling uncertainty propagation and enforcing collision avoidance constraints. Through both simulations and real‑world experiments in 2D and 3D environments, we validate the effectiveness of RE‑DPG across diverse operational scenarios.

PaperID: 7370, https://arxiv.org/pdf/2511.12123.pdf

Authors: Zejiao Liu, Junqi Tu, Yitian Hong, Luolin Xiong, Yaochu Jin, Yang Tang, Fangfei Li

Title: HCPO: Hierarchical Conductor-Based Policy Optimization in Multi-Agent Reinforcement Learning

Abstract:
In cooperative Multi‑Agent Reinforcement Learning (MARL), efficient exploration is crucial for optimizing the performance of joint policy. However, existing methods often update joint policies via independent agent exploration, without coordination among agents, which inherently constrains the expressive capacity and exploration of joint policies. To address this issue, we propose a conductor‑based joint policy framework that directly enhances the expressive capacity of joint policies and coordinates exploration. In addition, we develop a Hierarchical Conductor‑based Policy Optimization (HCPO) algorithm that instructs policy updates for the conductor and agents in a direction aligned with performance improvement. A rigorous theoretical guarantee further establishes the monotonicity of the joint policy optimization process. By deploying local conductors, HCPO retains centralized training benefits while eliminating inter‑agent communication during execution. Finally, we evaluate HCPO on three challenging benchmarks: StarCraftII Multi‑agent Challenge, Multi‑agent MuJoCo, and Multi‑agent Particle Environment. The results indicate that HCPO outperforms competitive MARL baselines regarding cooperative efficiency and stability.

PaperID: 7371, https://arxiv.org/pdf/2511.11992.pdf

Authors: Hung Du, Hy Nguyen, Srikanth Thudumu, Rajesh Vasa, Kon Mouzakis

Title: Goal-Oriented Multi-Agent Reinforcement Learning for Decentralized Agent Teams

Abstract:
Connected and autonomous vehicles across land, water, and air must often operate in dynamic, unpredictable environments with limited communication, no centralized control, and partial observability. These real‑world constraints pose significant challenges for coordination, particularly when vehicles pursue individual objectives. To address this, we propose a decentralized Multi‑Agent Reinforcement Learning (MARL) framework that enables vehicles, acting as agents, to communicate selectively based on local goals and observations. This goal‑aware communication strategy allows agents to share only relevant information, enhancing collaboration while respecting visibility limitations. We validate our approach in complex multi‑agent navigation tasks featuring obstacles and dynamic agent populations. Results show that our method significantly improves task success rates and reduces time‑to‑goal compared to non‑cooperative baselines. Moreover, task performance remains stable as the number of agents increases, demonstrating scalability. These findings highlight the potential of decentralized, goal‑driven MARL to support effective coordination in realistic multi‑vehicle systems operating across diverse domains.

PaperID: 7372, https://arxiv.org/pdf/2511.11933.pdf

Authors: Karthikeyan K, Raghuveer Thirukovalluru, Bhuwan Dhingra, David Edwin Carlson

Title: InData: Towards Secure Multi-Step, Tool-Based Data Analysis

Abstract:
Large language model agents for data analysis typically generate and execute code directly on databases. However, when applied to sensitive data, this approach poses significant security risks. To address this issue, we propose a security‑motivated alternative: restrict LLMs from direct code generation and data access, and require them to interact with data exclusively through a predefined set of secure, verified tools. Although recent tool‑use benchmarks exist, they primarily target tool selection and simple execution rather than the compositional, multi‑step reasoning needed for complex data analysis. To reduce this gap, we introduce Indirect Data Engagement (InData), a dataset designed to assess LLMs' multi‑step tool‑based reasoning ability. InData includes data analysis questions at three difficulty levels‑‑Easy, Medium, and Hard‑‑capturing increasing reasoning complexity. We benchmark 15 open‑source LLMs on InData and find that while large models (e.g., gpt‑oss‑120b) achieve high accuracy on Easy tasks (97.3%), performance drops sharply on Hard tasks (69.6%). These results show that current LLMs still lack robust multi‑step tool‑based reasoning ability. With InData, we take a step toward enabling the development and evaluation of LLMs with stronger multi‑step tool‑use capabilities. We will publicly release the dataset and code.

PaperID: 7373, https://arxiv.org/pdf/2511.11921.pdf

Authors: Liudong Xing, Janet, Lin

Title: Looking Forward: Challenges and Opportunities in Agentic AI Reliability

Abstract:
This chapter presents perspectives for challenges and future development in building reliable AI systems, particularly, agentic AI systems. Several open research problems related to mitigating the risks of cascading failures are discussed. The chapter also sheds lights on research challenges and opportunities in aspects including dynamic environments, inconsistent task execution, unpredictable emergent behaviors, as well as resource‑intensive reliability mechanisms. In addition, several research directions along the line of testing and evaluating reliability of agentic AI systems are also discussed.

PaperID: 7374, https://arxiv.org/pdf/2511.11916.pdf

Authors: Sinan Urgun, Seçkin Arı

Title: An Analysis of Architectural Impact on LLM-based Abstract Visual Reasoning: A Systematic Benchmark on RAVEN-FAIR

Abstract:
This study aims to systematically evaluate the performance of large language models (LLMs) in abstract visual reasoning problems. We examined four LLM models (GPT‑4.1‑Mini, Claude‑3.5‑Haiku, Gemini‑1.5‑Flash, Llama‑3.3‑70b) utilizing four different reasoning architectures (single‑shot, embedding‑controlled repetition, self‑reflection, and multi‑agent) on the RAVEN‑FAIR dataset. Visual responses generated through a three‑stage process (JSON extraction, LLM reasoning, and Tool Function) were evaluated using SSIM and LPIPS metrics; Chain‑of‑Thought scores and error types (semantic hallucination, numeric misperception) were analyzed. Results demonstrate that GPT‑4.1‑Mini consistently achieved the highest overall accuracy across all architectures, indicating a strong reasoning capability. While the multi‑agent architecture occasionally altered semantic and numeric balance across models, these effects were not uniformly beneficial. Instead, each model exhibited distinct sensitivity patterns to architectural design, underscoring that reasoning effectiveness remains model‑specific. Variations in response coverage further emerged as a confounding factor that complicates direct cross‑architecture comparison. To estimate the upper‑bound performance of each configuration, we report the best of five independent runs, representing a best‑case scenario rather than an averaged outcome. This multi‑run strategy aligns with recent recommendations, which emphasize that single‑run evaluations are fragile and may lead to unreliable conclusions.

PaperID: 7375, https://arxiv.org/pdf/2511.11854.pdf

Authors: Vahid Hemmati, Yonas Ayalew, Ahmad Mohammadi, Reza Ahmari, Parham Kebria, Abdollah Homaifar, Mehrdad Saif

Title: Conflict-Free Flight Scheduling Using Strategic Demand Capacity Balancing for Urban Air Mobility Operations

Abstract:
In this paper, we propose a conflict‑free multi‑ agent flight scheduling that ensures robust separation in con‑ strained airspace for Urban Air Mobility (UAM) operations application. First, we introduce Pairwise Conflict Avoidance (PCA) based on delayed departures, leveraging kinematic principles to maintain safe distances. Next, we expand PCA to multi‑agent scenarios, formulating an optimization approach that systematically determines departure times under increasing traffic densities. Performance metrics, such as average delay, assess the effectiveness of our solution. Through numerical simulations across diverse multi‑agent environments and real‑ world UAM use cases, our method demonstrates a significant reduction in total delay while ensuring collision‑free operations. This approach provides a scalable framework for emerging urban air mobility systems.

PaperID: 7376, https://arxiv.org/pdf/2511.11828.pdf

Authors: Wenwen Si, Sooyong Jang, Insup Lee, Osbert Bastani

Title: Conformal Constrained Policy Optimization for Cost-Effective LLM Agents

Abstract:
While large language models (LLMs) have recently made tremendous progress towards solving challenging AI problems, they have done so at increasingly steep computational and API costs. We propose a novel strategy where we combine multiple LLM models with varying cost/accuracy tradeoffs in an agentic manner, where models and tools are run in sequence as determined by an orchestration model to minimize cost subject to a user‑specified level of reliability; this constraint is formalized using conformal prediction to provide guarantees. To solve this problem, we propose Conformal Constrained Policy Optimization (CCPO), a training paradigm that integrates constrained policy optimization with off‑policy reinforcement learning and recent advances in online conformal prediction. CCPO jointly optimizes a cost‑aware policy (score function) and an adaptive threshold. Across two multi‑hop question answering benchmarks, CCPO achieves up to a 30% cost reduction compared to other cost‑aware baselines and LLM‑guided methods without compromising reliability. Our approach provides a principled and practical framework for deploying LLM agents that are significantly more cost‑effective while maintaining reliability.

PaperID: 7377, https://arxiv.org/pdf/2511.11789.pdf

Authors: Jiayi Li, Xiao Liu, Yansong Feng

Title: From Single to Societal: Analyzing Persona-Induced Bias in Multi-Agent Interactions

Abstract:
Large Language Model (LLM)‑based multi‑agent systems are increasingly used to simulate human interactions and solve collaborative tasks. A common practice is to assign agents with personas to encourage behavioral diversity. However, this raises a critical yet underexplored question: do personas introduce biases into multi‑agent interactions? This paper presents a systematic investigation into persona‑induced biases in multi‑agent interactions, with a focus on social traits like trustworthiness (how an agent's opinion is received by others) and insistence (how strongly an agent advocates for its opinion). Through a series of controlled experiments in collaborative problem‑solving and persuasion tasks, we reveal that (1) LLM‑based agents exhibit biases in both trustworthiness and insistence, with personas from historically advantaged groups (e.g., men and White individuals) perceived as less trustworthy and demonstrating less insistence; and (2) agents exhibit significant in‑group favoritism, showing a higher tendency to conform to others who share the same persona. These biases persist across various LLMs, group sizes, and numbers of interaction rounds, highlighting an urgent need for awareness and mitigation to ensure the fairness and reliability of multi‑agent systems.

PaperID: 7378, https://arxiv.org/pdf/2511.11788.pdf

Authors: Antonio Sabbatella

Title: MALBO: Optimizing LLM-Based Multi-Agent Teams via Multi-Objective Bayesian Optimization

Abstract:
The optimal assignment of Large Language Models (LLMs) to specialized roles in multi‑agent systems is a significant challenge, defined by a vast combinatorial search space, expensive black‑box evaluations, and an inherent trade‑off between performance and cost. Current optimization methods focus on single‑agent settings and lack a principled framework for this multi‑agent, multi‑objective problem. This thesis introduces MALBO (Multi‑Agent LLM Bayesian Optimization), a systematic framework designed to automate the efficient composition of LLM‑based agent teams. We formalize the assignment challenge as a multi‑objective optimization problem, aiming to identify the Pareto front of configurations between task accuracy and inference cost. The methodology employs multi‑objective Bayesian Optimization (MOBO) with independent Gaussian Process surrogate models. By searching over a continuous feature‑space representation of the LLMs, this approach performs a sample‑efficient exploration guided by the expected hypervolume improvement. The primary contribution is a principled and automated methodology that yields a Pareto front of optimal team configurations. Our results demonstrate that the Bayesian optimization phase, compared to an initial random search, maintained a comparable average performance while reducing the average configuration cost by over 45%. Furthermore, MALBO identified specialized, heterogeneous teams that achieve cost reductions of up to 65.8% compared to homogeneous baselines, all while maintaining maximum performance. The framework thus provides a data‑driven tool for deploying cost‑effective and highly specialized multi‑agent AI systems.

PaperID: 7379, https://arxiv.org/pdf/2511.11772.pdf

Authors: Chenyu Zhang, Xiaohang Luo

Title: Scaling Equitable Reflection Assessment in Education via Large Language Models and Role-Based Feedback Agents

Abstract:
Formative feedback is widely recognized as one of the most effective drivers of student learning, yet it remains difficult to implement equitably at scale. In large or low‑resource courses, instructors often lack the time, staffing, and bandwidth required to review and respond to every student reflection, creating gaps in support precisely where learners would benefit most. This paper presents a theory‑grounded system that uses five coordinated role‑based LLM agents (Evaluator, Equity Monitor, Metacognitive Coach, Aggregator, and Reflexion Reviewer) to score learner reflections with a shared rubric and to generate short, bias‑aware, learner‑facing comments. The agents first produce structured rubric scores, then check for potentially biased or exclusionary language, add metacognitive prompts that invite students to think about their own thinking, and finally compose a concise feedback message of at most 120 words. The system includes simple fairness checks that compare scoring error across lower and higher scoring learners, enabling instructors to monitor and bound disparities in accuracy. We evaluate the pipeline in a 12‑session AI literacy program with adult learners. In this setting, the system produces rubric scores that approach expert‑level agreement, and trained graders rate the AI‑generated comments as helpful, empathetic, and well aligned with instructional goals. Taken together, these results show that multi‑agent LLM systems can deliver equitable, high‑quality formative feedback at a scale and speed that would be impossible for human graders alone. More broadly, the work points toward a future where feedback‑rich learning becomes feasible for any course size or context, advancing long‑standing goals of equity, access, and instructional capacity in education.

PaperID: 7380, https://arxiv.org/pdf/2511.11752.pdf

Authors: Sören Arlt, Xuemei Gu, Mario Krenn

Title: Towards autonomous quantum physics research using LLM agents with access to intelligent tools

Abstract:
Artificial intelligence (AI) is used in numerous fields of science, yet the initial research questions and targets are still almost always provided by human researchers. AI‑generated creative ideas in science are rare and often vague, so that it remains a human task to execute them. Automating idea generation and implementation in one coherent system would significantly shift the role of humans in the scientific process. Here we present AI‑Mandel, an LLM agent that can generate and implement ideas in quantum physics. AI‑Mandel formulates ideas from the literature and uses a domain‑specific AI tool to turn them into concrete experiment designs that can readily be implemented in laboratories. The generated ideas by AI‑Mandel are often scientifically interesting ‑ for two of them we have already written independent scientific follow‑up papers. The ideas include new variations of quantum teleportation, primitives of quantum networks in indefinite causal orders, and new concepts of geometric phases based on closed loops of quantum information transfer. AI‑Mandel is a prototypical demonstration of an AI physicist that can generate and implement concrete, actionable ideas. Building such a system is not only useful to accelerate science, but it also reveals concrete open challenges on the path to human‑level artificial scientists.

PaperID: 7381, https://arxiv.org/pdf/2511.11751.pdf

Authors: Sanchit Sinha, Guangzhi Xiong, Zhenghao He, Aidong Zhang

Title: Concept-RuleNet: Grounded Multi-Agent Neurosymbolic Reasoning in Vision Language Models

Abstract:
Modern vision‑language models (VLMs) deliver impressive predictive accuracy yet offer little insight into 'why' a decision is reached, frequently hallucinating facts, particularly when encountering out‑of‑distribution data. Neurosymbolic frameworks address this by pairing black‑box perception with interpretable symbolic reasoning, but current methods extract their symbols solely from task labels, leaving them weakly grounded in the underlying visual data. In this paper, we introduce a multi‑agent system ‑ Concept‑RuleNet that reinstates visual grounding while retaining transparent reasoning. Specifically, a multimodal concept generator first mines discriminative visual concepts directly from a representative subset of training images. Next, these visual concepts are utilized to condition symbol discovery, anchoring the generations in real image statistics and mitigating label bias. Subsequently, symbols are composed into executable first‑order rules by a large language model reasoner agent ‑ yielding interpretable neurosymbolic rules. Finally, during inference, a vision verifier agent quantifies the degree of presence of each symbol and triggers rule execution in tandem with outputs of black‑box neural models, predictions with explicit reasoning pathways. Experiments on five benchmarks, including two challenging medical‑imaging tasks and three underrepresented natural‑image datasets, show that our system augments state‑of‑the‑art neurosymbolic baselines by an average of 5% while also reducing the occurrence of hallucinated symbols in rules by up to 50%.

PaperID: 7382, https://arxiv.org/pdf/2511.11654.pdf

Authors: Sayambhu Sen, Shalabh Bhatnagar

Title: Convergence of Multiagent Learning Systems for Traffic control

Abstract:
Rapid urbanization in cities like Bangalore has led to severe traffic congestion, making efficient Traffic Signal Control (TSC) essential. Multi‑Agent Reinforcement Learning (MARL), often modeling each traffic signal as an independent agent using Q‑learning, has emerged as a promising strategy to reduce average commuter delays. While prior work Prashant L A et. al has empirically demonstrated the effectiveness of this approach, a rigorous theoretical analysis of its stability and convergence properties in the context of traffic control has not been explored. This paper bridges that gap by focusing squarely on the theoretical basis of this multi‑agent algorithm. We investigate the convergence problem inherent in using independent learners for the cooperative TSC task. Utilizing stochastic approximation methods, we formally analyze the learning dynamics. The primary contribution of this work is the proof that the specific multi‑agent reinforcement learning algorithm for traffic control is proven to converge under the given conditions extending it from single agent convergence proofs for asynchronous value iteration.

PaperID: 7383, https://arxiv.org/pdf/2511.11635.pdf

Authors: Rui Jia, Min Zhang, Fengrui Liu, Bo Jiang, Kun Kuang, Zhongxiang Dai

Title: EduAgentQG: A Multi-Agent Workflow Framework for Personalized Question Generation

Abstract:
High‑quality personalized question banks are crucial for supporting adaptive learning and individualized assessment. Manually designing questions is time‑consuming and often fails to meet diverse learning needs, making automated question generation a crucial approach to reduce teachers' workload and improve the scalability of educational resources. However, most existing question generation methods rely on single‑agent or rule‑based pipelines, which still produce questions with unstable quality, limited diversity, and insufficient alignment with educational goals. To address these challenges, we propose EduAgentQG, a multi‑agent collaborative framework for generating high‑quality and diverse personalized questions. The framework consists of five specialized agents and operates through an iterative feedback loop: the Planner generates structured design plans and multiple question directions to enhance diversity; the Writer produces candidate questions based on the plan and optimizes their quality and diversity using feedback from the Solver and Educator; the Solver and Educator perform binary scoring across multiple evaluation dimensions and feed the evaluation results back to the Writer; the Checker conducts final verification, including answer correctness and clarity, ensuring alignment with educational goals. Through this multi‑agent collaboration and iterative feedback loop, EduAgentQG generates questions that are both high‑quality and diverse, while maintaining consistency with educational objectives. Experiments on two mathematics question datasets demonstrate that EduAgentQG outperforms existing single‑agent and multi‑agent methods in terms of question diversity, goal consistency, and overall quality.

PaperID: 7384, https://arxiv.org/pdf/2511.11619.pdf

Authors: Yuanjie Liu, Wenpeng Xing, Ye Zhou, Gaowei Chang, Changting Lin, Meng Han

Title: DIAP: A Decentralized Agent Identity Protocol with Zero-Knowledge Proofs and a Hybrid P2P Stack

Abstract:
The absence of a fully decentralized, verifiable, and privacy‑preserving communication protocol for autonomous agents remains a core challenge in decentralized computing. Existing systems often rely on centralized intermediaries, which reintroduce trust bottlenecks, or lack decentralized identity‑resolution mechanisms, limiting persistence and cross‑network interoperability. We propose the Decentralized Interstellar Agent Protocol (DIAP), a novel framework for agent identity and communication that enables persistent, verifiable, and trustless interoperability in fully decentralized environments. DIAP binds an agent's identity to an immutable IPFS or IPNS content identifier and uses zero‑knowledge proofs (ZKP) to dynamically and statelessly prove ownership, removing the need for record updates. We present a Rust SDK that integrates Noir (for zero‑knowledge proofs), DID‑Key, IPFS, and a hybrid peer‑to‑peer stack combining Libp2p GossipSub for discovery and Iroh for high‑performance, QUIC based data exchange. DIAP introduces a zero‑dependency ZKP deployment model through a universal proof manager and compile‑time build script that embeds a precompiled Noir circuit, eliminating the need for external ZKP toolchains. This enables instant, verifiable, and privacy‑preserving identity proofs. This work establishes a practical, high‑performance foundation for next‑generation autonomous agent ecosystems and agent‑to‑agent (A to A) economies.

PaperID: 7385, https://arxiv.org/pdf/2511.11603.pdf

Authors: Deep Bodra, Sushil Khairnar

Title: Machine learning-based cloud resource allocation algorithms: a comprehensive comparative review

Abstract:
Cloud resource allocation has emerged as a major challenge in modern computing environments, with organizations struggling to manage complex, dynamic workloads while optimizing performance and cost efficiency. Traditional heuristic approaches prove inadequate for handling the multi‑objective optimization demands of existing cloud infrastructures. This paper presents a comparative analysis of state‑of‑the‑art artificial intelligence and machine learning algorithms for resource allocation. We systematically evaluate 10 algorithms across four categories: Deep Reinforcement Learning approaches, Neural Network architectures, Traditional Machine Learning enhanced methods, and Multi‑Agent systems. Analysis of published results demonstrates significant performance improvements across multiple metrics including makespan reduction, cost optimization, and energy efficiency gains compared to traditional methods. The findings reveal that hybrid architectures combining multiple artificial intelligence and machine learning techniques consistently outperform single‑method approaches, with edge computing environments showing the highest deployment readiness. Our analysis provides critical insights for both academic researchers and industry practitioners seeking to implement next‑generation cloud resource allocation strategies in increasingly complex and dynamic computing environments.

PaperID: 7386, https://arxiv.org/pdf/2511.11576.pdf

Authors: WenZhuo Zhu, Zheng Cui, Wenhan Lu, Sheng Liu, Yue Zhao

Title: DAOpt: Modeling and Evaluation of Data-Driven Optimization under Uncertainty with LLMs

Abstract:
Recent advances in large language models (LLMs) have accelerated research on automated optimization modeling. While real‑world decision‑making is inherently uncertain, most existing work has focused on deterministic optimization with known parameters, leaving the application of LLMs in uncertain settings largely unexplored. To that end, we propose the DAOpt framework including a new dataset OptU, a multi‑agent decision‑making module, and a simulation environment for evaluating LLMs with a focus on out‑of‑sample feasibility and robustness. Additionally, we enhance LLMs' modeling capabilities by incorporating few‑shot learning with domain knowledge from stochastic and robust optimization.

PaperID: 7387, https://arxiv.org/pdf/2511.11567.pdf

Authors: Allen Emmanuel Binny, Anushri Dixit

Title: Who Moved My Distribution? Conformal Prediction for Interactive Multi-Agent Systems

Abstract:
Uncertainty‑aware prediction is essential for safe motion planning, especially when using learned models to forecast the behavior of surrounding agents. Conformal prediction is a statistical tool often used to produce uncertainty‑aware prediction regions for machine learning models. Most existing frameworks utilizing conformal prediction‑based uncertainty predictions assume that the surrounding agents are non‑interactive. This is because in closed‑loop, as uncertainty‑aware agents change their behavior to account for prediction uncertainty, the surrounding agents respond to this change, leading to a distribution shift which we call endogenous distribution shift. To address this challenge, we introduce an iterative conformal prediction framework that systematically adapts the uncertainty‑aware ego‑agent controller to the endogenous distribution shift. The proposed method provides probabilistic safety guarantees while adapting to the evolving behavior of reactive, non‑ego agents. We establish a model for the endogenous distribution shift and provide the conditions for the iterative conformal prediction pipeline to converge under such a distribution shift. We validate our framework in simulation for 2‑ and 3‑ agent interaction scenarios, demonstrating collision avoidance without resulting in overly conservative behavior and an overall improvement in success rates of up to 9.6% compared to other conformal prediction‑based baselines.

PaperID: 7388, https://arxiv.org/pdf/2511.11557.pdf

Authors: Michael Z. Zgurovsky, Pavlo O. Kasyanov, Liliia S. Paliichuk

Title: Drone Swarm Energy Management

Abstract:
This note presents an analytical framework for decision‑making in drone swarm systems operating under uncertainty, based on the integration of Partially Observable Markov Decision Processes (POMDP) with Deep Deterministic Policy Gradient (DDPG) reinforcement learning. The proposed approach enables adaptive control and cooperative behavior of unmanned aerial vehicles (UAVs) within a cognitive AI platform, where each agent learns optimal energy management and navigation policies from dynamic environmental states. We extend the standard DDPG architecture with a belief‑state representation derived from Bayesian filtering, allowing for robust decision‑making in partially observable environments. In this paper, for the Gaussian case, we numerically compare the performance of policies derived from DDPG to optimal policies for discretized versions of the original continuous problem. Simulation results demonstrate that the POMDP‑DDPG‑based swarm control model significantly improves mission success rates and energy efficiency compared to baseline methods. The developed framework supports distributed learning and decision coordination across multiple agents, providing a foundation for scalable cognitive swarm autonomy. The outcomes of this research contribute to the advancement of energy‑aware control algorithms for intelligent multi‑agent systems and can be applied in security, environmental monitoring, and infrastructure inspection scenarios.

PaperID: 7389, https://arxiv.org/pdf/2511.11552.pdf

Authors: Dawei Zhu, Rui Meng, Jiefeng Chen, Sujian Li, Tomas Pfister, Jinsung Yoon

Title: DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding

Abstract:
Comprehending long visual documents, where information is distributed across extensive pages of text and visual elements, is a critical but challenging task for modern Vision‑Language Models (VLMs). Existing approaches falter on a fundamental challenge: evidence localization. They struggle to retrieve relevant pages and overlook fine‑grained details within visual elements, leading to limited performance and model hallucination. To address this, we propose DocLens, a tool‑augmented multi‑agent framework that effectively ``zooms in'' on evidence like a lens. It first navigates from the full document to specific visual elements on relevant pages, then employs a sampling‑adjudication mechanism to generate a single, reliable answer. Paired with Gemini‑2.5‑Pro, DocLens achieves state‑of‑the‑art performance on MMLongBench‑Doc and FinRAGBench‑V, surpassing even human experts. The framework's superiority is particularly evident on vision‑centric and unanswerable queries, demonstrating the power of its enhanced localization capabilities.

PaperID: 7390, https://arxiv.org/pdf/2511.11393.pdf

Authors: Zejiao Liu, Yi Li, Jiali Wang, Junqi Tu, Yitian Hong, Fangfei Li, Yang Liu, Toshiharu Sugawara, Yang Tang

Title: Robust and Efficient Communication in Multi-Agent Reinforcement Learning

Abstract:
Multi‑agent reinforcement learning (MARL) has made significant strides in enabling coordinated behaviors among autonomous agents. However, most existing approaches assume that communication is instantaneous, reliable, and has unlimited bandwidth; these conditions are rarely met in real‑world deployments. This survey systematically reviews recent advances in robust and efficient communication strategies for MARL under realistic constraints, including message perturbations, transmission delays, and limited bandwidth. Furthermore, because the challenges of low‑latency reliability, bandwidth‑intensive data sharing, and communication‑privacy trade‑offs are central to practical MARL systems, we focus on three applications involving cooperative autonomous driving, distributed simultaneous localization and mapping, and federated learning. Finally, we identify key open challenges and future research directions, advocating a unified approach that co‑designs communication, learning, and robustness to bridge the gap between theoretical MARL models and practical implementations.

PaperID: 7391, https://arxiv.org/pdf/2511.11373.pdf

Authors: Shulin Liu, Dong Du, Tao Yang, Yang Li, Boyu Qiu

Title: MarsRL: Advancing Multi-Agent Reasoning System via Reinforcement Learning with Agentic Pipeline Parallelism

Abstract:
Recent progress in large language models (LLMs) has been propelled by reinforcement learning with verifiable rewards (RLVR) and test‑time scaling. However, the limited output length of LLMs constrains the depth of reasoning attainable in a single inference process. Multi‑agent reasoning systems offer a promising alternative by employing multiple agents including Solver, Verifier, and Corrector, to iteratively refine solutions. While effective in closed‑source models like Gemini 2.5 Pro, they struggle to generalize to open‑source models due to insufficient critic and correction capabilities. To address this, we propose MarsRL, a novel reinforcement learning framework with agentic pipeline parallelism, designed to jointly optimize all agents in the system. MarsRL introduces agent‑specific reward mechanisms to mitigate reward noise and employs pipeline‑inspired training to enhance efficiency in handling long trajectories. Applied to Qwen3‑30B‑A3B‑Thinking‑2507, MarsRL improves AIME2025 accuracy from 86.5% to 93.3% and BeyondAIME from 64.9% to 73.8%, even surpassing Qwen3‑235B‑A22B‑Thinking‑2507. These findings highlight the potential of MarsRL to advance multi‑agent reasoning systems and broaden their applicability across diverse reasoning tasks.

PaperID: 7392, https://arxiv.org/pdf/2511.11324.pdf

Authors: Anurag J. Vaidya, Felix Meissen, Daniel C. Castro, Shruthi Bannur, Tristan Lazard, Drew F. K. Williamson, Faisal Mahmood, Javier Alvarez-Valle, Stephanie L. Hyland, Kenza Bouzid

Title: NOVA: An Agentic Framework for Automated Histopathology Analysis and Discovery

Abstract:
Digitized histopathology analysis involves complex, time‑intensive workflows and specialized expertise, limiting its accessibility. We introduce NOVA, an agentic framework that translates scientific queries into executable analysis pipelines by iteratively generating and running Python code. NOVA integrates 49 domain‑specific tools (e.g., nuclei segmentation, whole‑slide encoding) built on open‑source software, and can also create new tools ad hoc. To evaluate such systems, we present SlideQuest, a 90‑question benchmark ‑‑ verified by pathologists and biomedical scientists ‑‑ spanning data processing, quantitative analysis, and hypothesis testing. Unlike prior biomedical benchmarks focused on knowledge recall or diagnostic QA, SlideQuest demands multi‑step reasoning, iterative coding, and computational problem solving. Quantitative evaluation shows NOVA outperforms coding‑agent baselines, and a pathologist‑verified case study links morphology to prognostically relevant PAM50 subtypes, demonstrating its scalable discovery potential.

PaperID: 7393, https://arxiv.org/pdf/2511.11306.pdf

Authors: Wei Fan, JinYi Yoon, Bo Ji

Title: iMAD: Intelligent Multi-Agent Debate for Efficient and Accurate LLM Inference

Abstract:
Large Language Model (LLM) agent systems have advanced rapidly, driven by their strong generalization in zero‑shot settings. To further enhance reasoning and accuracy on complex tasks, Multi‑Agent Debate (MAD) has emerged as a promising framework that engages multiple LLM agents in structured debates to encourage diverse reasoning. However, triggering MAD for every query is inefficient, as it incurs substantial computational (token) cost and may even degrade accuracy by overturning correct single‑agent answers. To address these limitations, we propose intelligent Multi‑Agent Debate (iMAD), a token‑efficient framework that selectively triggers MAD only when it is likely to be beneficial (i.e., correcting an initially wrong answer). To achieve this goal, iMAD learns generalizable model behaviors to make accurate debate decisions. Specifically, iMAD first prompts a single agent to produce a structured self‑critique response, from which we extract 41 interpretable linguistic and semantic features capturing hesitation cues. Then, iMAD uses a lightweight debate‑decision classifier, trained using our proposed FocusCal loss, to determine whether to trigger MAD, enabling robust debate decisions without test dataset‑specific tuning. Through extensive experiments using six (visual) question answering datasets against five competitive baselines, we have shown that iMAD significantly reduces token usage (by up to 92%) while also improving final answer accuracy (by up to 13.5%).

PaperID: 7394, https://arxiv.org/pdf/2511.11257.pdf

Authors: Yuqi Yin, Yibo Fu, Siyuan Wang, Peng Sun, Hongyu Wang, Xiaohui Wang, Lei Zheng, Zhiyong Li, Zhirong Liu, Jianji Wang, Zhaoxi Sun

Title: AIonopedia: an LLM agent orchestrating multimodal learning for ionic liquid discovery

Abstract:
The discovery of novel Ionic Liquids (ILs) is hindered by critical challenges in property prediction, including limited data, poor model accuracy, and fragmented workflows. Leveraging the power of Large Language Models (LLMs), we introduce AIonopedia, to the best of our knowledge, the first LLM agent for IL discovery. Powered by an LLM‑augmented multimodal domain foundation model for ILs, AIonopedia enables accurate property predictions and incorporates a hierarchical search architecture for molecular screening and design. Trained and evaluated on a newly curated and comprehensive IL dataset, our model delivers superior performance. Complementing these results, evaluations on literature‑reported systems indicate that the agent can perform effective IL modification. Moving beyond offline tests, the practical efficacy was further confirmed through real‑world wet‑lab validation, in which the agent demonstrated exceptional generalization capabilities on challenging out‑of‑distribution tasks, underscoring its ability to accelerate real‑world IL discovery.

PaperID: 7395, https://arxiv.org/pdf/2511.11255.pdf

Authors: Wencai Ye, Mingjie Sun, Shuhang Chen, Wenjin Wu, Peng Jiang

Title: Align$^3$GR: Unified Multi-Level Alignment for LLM-based Generative Recommendation

Abstract:
Large Language Models (LLMs) demonstrate significant advantages in leveraging structured world knowledge and multi‑step reasoning capabilities. However, fundamental challenges arise when transforming LLMs into real‑world recommender systems due to semantic and behavioral misalignment. To bridge this gap, we propose Align^3GR, a novel framework that unifies token‑level, behavior modeling‑level, and preference‑level alignment. Our approach introduces: Dual tokenization fusing user‑item semantic and collaborative signals. Enhanced behavior modeling with bidirectional semantic alignment. Progressive DPO strategy combining self‑play (SP‑DPO) and real‑world feedback (RF‑DPO) for dynamic preference adaptation. Experiments show Align^3GR outperforms the SOTA baseline by +17.8% in Recall@10 and +20.2% in NDCG@10 on the public dataset, with significant gains in online A/B tests and full‑scale deployment on an industrial large‑scale recommendation platform.

PaperID: 7396, https://arxiv.org/pdf/2511.11233.pdf

Authors: Huajian Zhang, Mingyue Cheng, Yucong Luo, Xiaoyu Tao

Title: STaR: Towards Effective and Stable Table Reasoning via Slow-Thinking Large Language Models

Abstract:
Table reasoning with large language models (LLMs) plays a critical role in building intelligent systems capable of understanding and analyzing tabular data. Despite recent progress, existing methods still face key limitations: their reasoning processes lacks depth and explicit multi‑step reasoning, often relying solely on implicit language model understanding. In addition, their reasoning processes suffer from instability, primarily caused by model uncertainty. In this work, we propose STaR, a novel slow‑thinking model that can achieve effective and stable table reasoning. To enable effective multi‑step reasoning, we design a two‑stage training framework consisting of supervised fine‑tuning (SFT) warm‑up followed by reinforced fine‑tuning (RFT). Specifically, in the SFT stage, we construct a high‑quality dataset through automatic self‑verification. In the RFT stage, we introduce a difficulty‑aware reinforcement learning mechanism to further enhance reasoning capabilities. Furthermore, to improve reasoning stability, we introduce trajectory‑level uncertainty quantification, which fuses token‑level confidence with answer‑level consistency, enabling the selection of better reasoning trajectories. Extensive experiments demonstrate that STaR‑8B achieves state‑of‑the‑art performance on in‑domain benchmarks and exhibits strong generalization to out‑of‑domain datasets, highlighting its potential for enhancing both effectiveness and stability in table reasoning.

PaperID: 7397, https://arxiv.org/pdf/2511.11169.pdf

Authors: Ayush Pandey, Jai Bardhan, Ishita Jain, Ramya S Hebbalaguppe, Rohan Raju Dhanakshirur, Lovekesh Vig

Title: Refine and Align: Confidence Calibration through Multi-Agent Interaction in VQA

Abstract:
In the context of Visual Question Answering (VQA) and Agentic AI, calibration refers to how closely an AI system's confidence in its answers reflects their actual correctness. This aspect becomes especially important when such systems operate autonomously and must make decisions under visual uncertainty. While modern VQA systems, powered by advanced vision‑language models (VLMs), are increasingly used in high‑stakes domains like medical diagnostics and autonomous navigation due to their improved accuracy, the reliability of their confidence estimates remains under‑examined. Particularly, these systems often produce overconfident responses. To address this, we introduce AlignVQA, a debate‑based multi‑agent framework, in which diverse specialized VLM ‑‑ each following distinct prompting strategies ‑‑ generate candidate answers and then engage in two‑stage interaction: generalist agents critique, refine and aggregate these proposals. This debate process yields confidence estimates that more accurately reflect the model's true predictive performance. We find that more calibrated specialized agents produce better aligned confidences. Furthermore, we introduce a novel differentiable calibration‑aware loss function called aligncal designed to fine‑tune the specialized agents by minimizing an upper bound on the calibration error. This objective explicitly improves the fidelity of each agent's confidence estimates. Empirical results across multiple benchmark VQA datasets substantiate the efficacy of our approach, demonstrating substantial reductions in calibration discrepancies. Furthermore, we propose a novel differentiable calibration‑aware loss to fine‑tune the specialized agents and improve the quality of their individual confidence estimates based on minimising upper bound calibration error.

PaperID: 7398, https://arxiv.org/pdf/2511.11083.pdf

Authors: Bingyu Hui, Lebin Yu, Quanming Yao, Yunpeng Qu, Xudong Zhang, Jian Wang

Title: Efficient Reinforcement Learning for Zero-Shot Coordination in Evolving Games

Abstract:
Zero‑shot coordination(ZSC), a key challenge in multi‑agent game theory, has become a hot topic in reinforcement learning (RL) research recently, especially in complex evolving games. It focuses on the generalization ability of agents, requiring them to coordinate well with collaborators from a diverse, potentially evolving, pool of partners that are not seen before without any fine‑tuning. Population‑based training, which approximates such an evolving partner pool, has been proven to provide good zero‑shot coordination performance; nevertheless, existing methods are limited by computational resources, mainly focusing on optimizing diversity in small populations while neglecting the potential performance gains from scaling population size. To address this issue, this paper proposes the Scalable Population Training (ScaPT), an efficient RL training framework comprising two key components: a meta‑agent that efficiently realizes a population by selectively sharing parameters across agents, and a mutual information regularizer that guarantees population diversity. To empirically validate the effectiveness of ScaPT, this paper evaluates it along with representational frameworks in Hanabi cooperative game and confirms its superiority.

PaperID: 7399, https://arxiv.org/pdf/2511.11040.pdf

Authors: Qian Zhang, Yan Zheng, Jinyi Liu, Hebin Liang, Lanjun Wang

Title: Key Decision-Makers in Multi-Agent Debates: Who Holds the Power?

Abstract:
Recent studies on LLM agent scaling have highlighted the potential of Multi‑Agent Debate (MAD) to enhance reasoning abilities. However, the critical aspect of role allocation strategies remains underexplored. In this study, we demonstrate that allocating roles with differing viewpoints to specific positions significantly impacts MAD's performance in reasoning tasks. Specifically, we find a novel role allocation strategy, "Truth Last", which can improve MAD performance by up to 22% in reasoning tasks. To address the issue of unknown truth in practical applications, we propose the Multi‑Agent Debate Consistency (MADC) strategy, which systematically simulates and optimizes its core mechanisms. MADC incorporates path consistency to assess agreement among independent roles, simulating the role with the highest consistency score as the truth. We validated MADC across a range of LLMs (9 models), including the DeepSeek‑R1 Distilled Models, on challenging reasoning tasks. MADC consistently demonstrated advanced performance, effectively overcoming MAD's performance bottlenecks and providing a crucial pathway for further improvements in LLM agent scaling.

PaperID: 7400, https://arxiv.org/pdf/2511.11035.pdf

Authors: Biqing Zeng, Mengquan Liu, Zongwei Zhen

Title: GraphMASAL: A Graph-based Multi-Agent System for Adaptive Learning

Abstract:
The advent of Intelligent Tutoring Systems (ITSs) has marked a paradigm shift in education, enabling highly personalized learning pathways. However, true personalization requires adapting to learners' complex knowledge states (multi‑source) and diverse goals (multi‑sink); existing ITSs often lack the necessary structural‑reasoning capability and knowledge dynamism to generate genuinely effective learning paths, and they lack scientifically rigorous validation paradigms. In this paper we propose GraphMASAL (A Graph‑based Multi‑Agent System for Adaptive Learning), which integrates (i) a dynamic knowledge graph for persistent, stateful learner modeling; (ii) a LangGraph‑orchestrated trio of agents (Diagnostician, Planner, Tutor); (iii) a knowledge‑graph‑grounded two‑stage neural IR component (dual‑encoder dense retrieval with cross‑encoder listwise re‑ranking and calibrated score fusion); and (iv) a multi‑source multi‑sink (MSMS) planning engine with a cognitively grounded cost and an approximation guarantee via greedy set cover. Under blinded automated evaluations with matched inputs and inference settings across diverse student profiles, GraphMASAL consistently outperforms LLM prompting and structured ablations in planning‑‑achieving stronger structural/sequence alignment of learning paths, higher coverage of weak concepts, and lower learning cost‑‑while also surpassing prompt‑based baselines in cognitive diagnosis. Agreement with expert/LLM‑proxy ratings further supports the validity of our evaluation protocol. These findings indicate that grounding LLM agents in a dynamic knowledge graph, coupled with optimization under educational constraints, yields reliable, interpretable, and pedagogically plausible learning plans, advancing personalized and goal‑oriented education.

PaperID: 7401, https://arxiv.org/pdf/2511.11025.pdf

Authors: Jirong Zha, Yuxuan Fan, Tianyu Zhang, Geng Chen, Yingfeng Chen, Chen Gao, Xinlei Chen

Title: AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning

Abstract:
Multimodal Large Language Models (MLLMs) have shown promise in single‑agent vision tasks, yet benchmarks for evaluating multi‑agent collaborative perception remain scarce. This gap is critical, as multi‑drone systems provide enhanced coverage, robustness, and collaboration compared to single‑sensor setups. Existing multi‑image benchmarks mainly target basic perception tasks using high‑quality single‑agent images, thus failing to evaluate MLLMs in more complex, egocentric collaborative scenarios, especially under real‑world degraded perception conditions.To address these challenges, we introduce AirCopBench, the first comprehensive benchmark designed to evaluate MLLMs in embodied aerial collaborative perception under challenging perceptual conditions. AirCopBench includes 14.6k+ questions derived from both simulator and real‑world data, spanning four key task dimensions: Scene Understanding, Object Understanding, Perception Assessment, and Collaborative Decision, across 14 task types. We construct the benchmark using data from challenging degraded‑perception scenarios with annotated collaborative events, generating large‑scale questions through model‑, rule‑, and human‑based methods under rigorous quality control. Evaluations on 40 MLLMs show significant performance gaps in collaborative perception tasks, with the best model trailing humans by 24.38% on average and exhibiting inconsistent results across tasks. Fine‑tuning experiments further confirm the feasibility of sim‑to‑real transfer in aerial collaborative perception and reasoning.

PaperID: 7402, https://arxiv.org/pdf/2511.11022.pdf

Authors: Hyunchul Bae, Eunjae Lee, Jehyeop Han, Minhee Kang, Jaehyeon Kim, Junggeun Seo, Minkyun Noh, Heejin Ahn

Title: Miniature Testbed for Validating Multi-Agent Cooperative Autonomous Driving

Abstract:
Cooperative autonomous driving, which extends vehicle autonomy by enabling real‑time collaboration between vehicles and smart roadside infrastructure, remains a challenging yet essential problem. However, none of the existing testbeds employ smart infrastructure equipped with sensing, edge computing, and communication capabilities. To address this gap, we design and implement a 1:15‑scale miniature testbed, CIVAT, for validating cooperative autonomous driving, consisting of a scaled urban map, autonomous vehicles with onboard sensors, and smart infrastructure. The proposed testbed integrates V2V and V2I communication with the publish‑subscribe pattern through a shared Wi‑Fi and ROS2 framework, enabling information exchange between vehicles and infrastructure to realize cooperative driving functionality. As a case study, we validate the system through infrastructure‑based perception and intersection management experiments.

PaperID: 7403, https://arxiv.org/pdf/2511.11017.pdf

Authors: Dimitar Peshevski, Riste Stojanov, Dimitar Trajanov

Title: AI Agent-Driven Framework for Automated Product Knowledge Graph Construction in E-Commerce

Abstract:
The rapid expansion of e‑commerce platforms generates vast amounts of unstructured product data, creating significant challenges for information retrieval, recommendation systems, and data analytics. Knowledge Graphs (KGs) offer a structured, interpretable format to organize such data, yet constructing product‑specific KGs remains a complex and manual process. This paper introduces a fully automated, AI agent‑driven framework for constructing product knowledge graphs directly from unstructured product descriptions. Leveraging Large Language Models (LLMs), our method operates in three stages using dedicated agents: ontology creation and expansion, ontology refinement, and knowledge graph population. This agent‑based approach ensures semantic coherence, scalability, and high‑quality output without relying on predefined schemas or handcrafted extraction rules. We evaluate the system on a real‑world dataset of air conditioner product descriptions, demonstrating strong performance in both ontology generation and KG population. The framework achieves over 97% property coverage and minimal redundancy, validating its effectiveness and practical applicability. Our work highlights the potential of LLMs to automate structured knowledge extraction in retail, providing a scalable path toward intelligent product data integration and utilization.

PaperID: 7404, https://arxiv.org/pdf/2511.10925.pdf

Authors: Ha-Thanh Nguyen, Wachara Fungwacharakorn, Ken Satoh

Title: Multi-Agent Legal Verifier Systems for Data Transfer Planning

Abstract:
Legal compliance in AI‑driven data transfer planning is becoming increasingly critical under stringent privacy regulations such as the Japanese Act on the Protection of Personal Information (APPI). We propose a multi‑agent legal verifier that decomposes compliance checking into specialized agents for statutory interpretation, business context evaluation, and risk assessment, coordinated through a structured synthesis protocol. Evaluated on a stratified dataset of 200 Amended APPI Article 16 cases with clearly defined ground truth labels and multiple performance metrics, the system achieves 72% accuracy, which is 21 percentage points higher than a single‑agent baseline, including 90% accuracy on clear compliance cases (vs. 16% for the baseline) while maintaining perfect detection of clear violations. While challenges remain in ambiguous scenarios, these results show that domain specialization and coordinated reasoning can meaningfully improve legal AI performance, providing a scalable and regulation‑aware framework for trustworthy and interpretable automated compliance verification.

PaperID: 7405, https://arxiv.org/pdf/2511.10874.pdf

Authors: Yorai Shaoul, Zhe Chen, Mohamed Naveed Gul Mohamed, Federico Pecora, Maxim Likhachev, Jiaoyang Li

Title: Collaborative Multi-Robot Non-Prehensile Manipulation via Flow-Matching Co-Generation

Abstract:
Coordinating a team of robots to reposition multiple objects in cluttered environments requires reasoning jointly about where robots should establish contact, how to manipulate objects once contact is made, and how to navigate safely and efficiently at scale. Prior approaches typically fall into two extremes ‑‑ either learning the entire task or relying on privileged information and hand‑designed planners ‑‑ both of which struggle to handle diverse objects in long‑horizon tasks. To address these challenges, we present a unified framework for collaborative multi‑robot, multi‑object non‑prehensile manipulation that integrates flow‑matching co‑generation with anonymous multi‑robot motion planning. Within this framework, a generative model co‑generates contact formations and manipulation trajectories from visual observations, while a novel motion planner conveys robots at scale. Crucially, the same planner also supports coordination at the object level, assigning manipulated objects to larger target structures and thereby unifying robot‑ and object‑level reasoning within a single algorithmic framework. Experiments in challenging simulated environments demonstrate that our approach outperforms baselines in both motion planning and manipulation tasks, highlighting the benefits of generative co‑design and integrated planning for scaling collaborative manipulation to complex multi‑agent, multi‑object settings. Visit gco‑paper.github.io for code and demonstrations.

PaperID: 7406, https://arxiv.org/pdf/2511.10860.pdf

Authors: Rabimba Karanjai, Lei Xu, Weidong Shi

Title: HPCAgentTester: A Multi-Agent LLM Approach for Enhanced HPC Unit Test Generation

Abstract:
Unit testing in High‑Performance Computing (HPC) is critical but challenged by parallelism, complex algorithms, and diverse hardware. Traditional methods often fail to address non‑deterministic behavior and synchronization issues in HPC applications. This paper introduces HPCAgentTester, a novel multi‑agent Large Language Model (LLM) framework designed to automate and enhance unit test generation for HPC software utilizing OpenMP and MPI. HPCAgentTester employs a unique collaborative workflow where specialized LLM agents (Recipe Agent and Test Agent) iteratively generate and refine test cases through a critique loop. This architecture enables the generation of context‑aware unit tests that specifically target parallel execution constructs, complex communication patterns, and hierarchical parallelism. We demonstrate HPCAgentTester's ability to produce compilable and functionally correct tests for OpenMP and MPI primitives, effectively identifying subtle bugs that are often missed by conventional techniques. Our evaluation shows that HPCAgentTester significantly improves test compilation rates and correctness compared to standalone LLMs, offering a more robust and scalable solution for ensuring the reliability of parallel software systems.

PaperID: 7407, https://arxiv.org/pdf/2511.10858.pdf

Authors: Dimitria Silveria, Kleber Cabral, Peter Jardine, Sidney Givigi

Title: Decentralized Swarm Control via SO(3) Embeddings for 3D Trajectories

Abstract:
This paper presents a novel decentralized approach for achieving emergent behavior in multi‑agent systems with minimal information sharing. Based on prior work in simple orbits, our method produces a broad class of stable, periodic trajectories by stabilizing the system around a Lie group‑based geometric embedding. Employing the Lie group SO(3), we generate a wider range of periodic curves than existing quaternion‑based methods. Furthermore, we exploit SO(3) properties to eliminate the need for velocity inputs, allowing agents to receive only position inputs. We also propose a novel phase controller that ensures uniform agent separation, along with a formal stability proof. Validation through simulations and experiments showcases the method's adaptability to complex low‑level dynamics and disturbances.

PaperID: 7408, https://arxiv.org/pdf/2511.10853.pdf

Authors: Gerui Xu, Boyou Chen, Huizhong Guo, Dave LeBlanc, Arpan Kusari, Efe Yarbasi, Ananna Ahmed, Zhaonan Sun, Shan Bao

Title: Advanced Assistance for Traffic Crash Analysis: An AI-Driven Multi-Agent Approach to Pre-Crash Reconstruction

Abstract:
Traffic collision reconstruction traditionally relies on human expertise and can be accurate, but pre‑crash reconstruction is more challenging. This study develops a multi‑agent AI framework that reconstructs pre‑crash scenarios and infers vehicle behaviors from fragmented collision data. We propose a two‑phase collaborative framework with reconstruction and reasoning stages. The system processes 277 rear‑end lead vehicle deceleration (LVD) crashes from the Crash Investigation Sampling System (CISS, 2017 to 2022), integrating narrative reports, structured tabular variables, and scene diagrams. Phase I generates natural‑language crash reconstructions from multimodal inputs. Phase II combines these reconstructions with Event Data Recorder (EDR) signals to (1) identify striking and struck vehicles and (2) isolate the EDR records most relevant to the collision moment, enabling inference of key pre‑crash behaviors. For validation, we evaluated all LVD cases and emphasized 39 complex crashes where multiple EDR records per crash created ambiguity due to missing or conflicting data. Ground truth was set by consensus of two independent manual annotators, with a separate language model used only to flag potential conflicts for re‑checking. The framework achieved 100% accuracy across 4,155 trials; three reasoning models produced identical outputs, indicating that performance is driven by the structured prompts rather than model choice. Research analysts without reconstruction training achieved 92.31% accuracy on the same 39 complex cases. Ablation tests showed that removing structured reasoning anchors reduced case‑level accuracy from 99.7% to 96.5% and increased errors across multiple output dimensions. The system remained robust under incomplete inputs. This zero‑shot evaluation, without domain‑specific training or fine‑tuning, suggests a scalable approach for AI‑assisted pre‑crash analysis.

PaperID: 7409, https://arxiv.org/pdf/2511.10835.pdf

Authors: Domenico Maisto, Davide Nuzzi, Giovanni Pezzulo

Title: What the flock knows that the birds do not: exploring the emergence of joint agency in multi-agent active inference

Abstract:
Collective behavior pervades biological systems, from flocks of birds to neural assemblies and human societies. Yet, how such collectives acquire functional properties ‑‑ such as joint agency or knowledge ‑‑ that transcend those of their individual components remains an open question. Here, we combine active inference and information‑theoretic analyses to explore how a minimal system of interacting agents can give rise to joint agency and collective knowledge. We model flocking dynamics using multiple active inference agents, each minimizing its own free energy while coupling reciprocally with its neighbors. We show that as agents self‑organize, their interactions define higher‑order statistical boundaries (Markov blankets) enclosing a ``flock'' that can be treated as an emergent agent with its own sensory, active, and internal states. When exposed to external perturbations (a ``predator''), the flock exhibits faster, coordinated responses than individual agents, reflecting collective sensitivity to environmental change. Crucially, analyses of synergistic information reveal that the flock encodes information about the predator's location that is not accessible to every individual bird, demonstrating implicit collective knowledge. Together, these results show how informational coupling among active inference agents can generate new levels of autonomy and inference, providing a framework for understanding the emergence of (implicit) collective knowledge and joint agency.

PaperID: 7410, https://arxiv.org/pdf/2511.10816.pdf

Authors: William Harris, Lucas Yager, Syler Sylvester, Elizabeth Peiros, Micheal C. Yip

Title: Dynamically Extensible and Retractable Robotic Leg Linkages for Multi-task Execution in Search and Rescue Scenarios

Abstract:
Search and rescue (SAR) robots are required to quickly traverse terrain and perform high‑force rescue tasks, necessitating both terrain adaptability and controlled high‑force output. Few platforms exist today for SAR, and fewer still have the ability to cover both tasks of terrain adaptability and high‑force output when performing extraction. While legged robots offer significant ability to traverse uneven terrain, they typically are unable to incorporate mechanisms that provide variable high‑force outputs, unlike traditional wheel‑based drive trains. This work introduces a novel concept for a dynamically extensible and retractable robot leg. Leveraging a dynamically extensible and retractable five‑bar linkage design, it allows for mechanically switching between height‑advantaged and force‑advantaged configurations via a geometric transformation. A testbed evaluated leg performance across linkage geometries and operating modes, with empirical and analytical analyses conducted on stride length, force output, and stability. The results demonstrate that the morphing leg offers a promising path toward SAR robots that can both navigate terrain quickly and perform rescue tasks effectively.

PaperID: 7411, https://arxiv.org/pdf/2511.10687.pdf

Authors: Chih-Hsuan Yang, Tanwi Mallick, Le Chen, Krishnan Raghavan, Azton Wells, Amal Gueroudji, Ian T. Foster, Rajeev Thakur

Title: Who Gets the Reward, Who Gets the Blame? Evaluation-Aligned Training Signals for Multi-LLM Agents

Abstract:
Large Language Models (LLMs) in multi‑agent systems (MAS) have shown promise for complex tasks, yet current training methods lack principled ways to connect system‑level evaluation with agent‑level and message‑level learning. We propose a theoretical framework that unifies cooperative game‑theoretic attribution with process reward modeling to transform system evaluation into agent credit and then into response‑level signals. Unlike prior approaches that rely only on attribution (e.g., Shapley) or step‑level labels (e.g., PRM), our method produces local, signed, and credit‑conserving signals. In success cases, Shapley‑based credit assignment fairly allocates outcomes across agents and is refined into per‑message rewards that promote cooperation while discouraging redundancy or sabotage. In failure cases, first‑error localization yields repair‑aware preferences that penalize harmful steps while rewarding corrective attempts. The resulting signals are bounded, cooperative, and directly compatible with reinforcement‑based or preference‑based post‑training, providing a unified and auditable pathway from global evaluation to local supervision in LLM multi‑agent training. Our contribution is conceptual: we present a theoretical foundation and training signals, leaving empirical validation for future work.

PaperID: 7412, https://arxiv.org/pdf/2511.10673.pdf

Authors: Fengxu Yang, Weitong Chen, Jack D. Evans

Title: Large language models in materials science and the need for open-source approaches

Abstract:
Large language models (LLMs) are rapidly transforming materials science. This review examines recent LLM applications across the materials discovery pipeline, focusing on three key areas: mining scientific literature , predictive modelling, and multi‑agent experimental systems. We highlight how LLMs extract valuable information such as synthesis conditions from text, learn structure‑property relationships, and can coordinate agentic systems integrating computational tools and laboratory automation. While progress has been largely dependent on closed‑source commercial models, our benchmark results demonstrate that open‑source alternatives can match performance while offering greater transparency, reproducibility, cost‑effectiveness, and data privacy. As open‑source models continue to improve, we advocate their broader adoption to build accessible, flexible, and community‑driven AI platforms for scientific discovery.

PaperID: 7413, https://arxiv.org/pdf/2511.10668.pdf

Authors: Akbar Anbar Jafari, Cagri Ozcinar, Gholamreza Anbarjafari

Title: A Mathematical Framework for AI Singularity: Conditions, Bounds, and Control of Recursive Improvement

Abstract:
AI systems improve by drawing on more compute, data, energy, and better training methods. This paper asks a precise, testable version of the "runaway growth" question: under what measurable conditions could capability escalate without bound in finite time, and under what conditions can that be ruled out? We develop an analytic framework for recursive self‑improvement that links capability growth to resource build‑out and deployment policies. Physical and information‑theoretic limits from power, bandwidth, and memory define a service envelope that caps instantaneous improvement. An endogenous growth model couples capital to compute, data, and energy and defines a critical boundary separating superlinear from subcritical regimes. We derive decision rules that map observable series (facility power, IO bandwidth, training throughput, benchmark losses, and spending) into yes/no certificates for runaway versus nonsingular behavior. The framework yields falsifiable tests based on how fast improvement accelerates relative to its current level, and it provides safety controls that are directly implementable in practice, such as power caps, throughput throttling, and evaluation gates. Analytical case studies cover capped‑power, saturating‑data, and investment‑amplified settings, illustrating when the envelope binds and when it does not. The approach is simulation‑free and grounded in measurements engineers already collect. Limitations include dependence on the chosen capability metric and on regularity diagnostics; future work will address stochastic dynamics, multi‑agent competition, and abrupt architectural shifts. Overall, the results replace speculation with testable conditions and deployable controls for certifying or precluding an AI singularity.

PaperID: 7414, https://arxiv.org/pdf/2511.10649.pdf

Authors: Wojciech Jamroga, Damian Kurpiewski, Łukasz Mikulski

Title: Towards Assume-Guarantee Verification of Abilities in Stochastic Multi-Agent Systems

Abstract:
Model checking of strategic abilities is a notoriously hard problem, even more so in the realistic case of agents with imperfect information, acting in a stochastic environment. Assume‑guarantee reasoning can be of great help here, providing a way to decompose the complex problem into a small set of easier subproblems. In this paper, we propose several schemes for assume‑guarantee verification of probabilistic alternating‑time temporal logic with imperfect information. We prove the soundness of the schemes, and discuss their completeness. On the way, we also propose a new variant of (non‑probabilistic) alternating‑time logic, where the strategic modalities capture "achieving at most φ," analogous to Levesque's logic of "only knowing."

PaperID: 7415, https://arxiv.org/pdf/2511.10611.pdf

Authors: Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Abdu Jyothi

Title: Towards an Agentic Workflow for Internet Measurement Research

Abstract:
Internet measurement research faces an accessibility crisis: complex analyses require custom integration of multiple specialized tools that demands specialized domain expertise. When network disruptions occur, operators need rapid diagnostic workflows spanning infrastructure mapping, routing analysis, and dependency modeling. However, developing these workflows requires specialized knowledge and significant manual effort. We present ArachNet, the first system demonstrating that LLM agents can independently generate measurement workflows that mimics expert reasoning. Our core insight is that measurement expertise follows predictable compositional patterns that can be systematically automated. ArachNet operates through four specialized agents that mirror expert workflow, from problem decomposition to solution implementation. We validate ArachNet with progressively challenging Internet resilience scenarios. The system independently generates workflows that match expert‑level reasoning and produce analytical outputs similar to specialist solutions. Generated workflows handle complex multi‑framework integration that traditionally requires days of manual coordination. ArachNet lowers barriers to measurement workflow composition by automating the systematic reasoning process that experts use, enabling broader access to sophisticated measurement capabilities while maintaining the technical rigor required for research‑quality analysis.

PaperID: 7416, https://arxiv.org/pdf/2511.10586.pdf

Authors: Omid Mirzaeedodangeh, Eliot Shekhtman, Nikolai Matni, Lars Lindemann

Title: Safe Planning in Interactive Environments via Iterative Policy Updates and Adversarially Robust Conformal Prediction

Abstract:
Safe planning of an autonomous agent in interactive environments ‑‑ such as the control of a self‑driving vehicle among pedestrians ‑‑ poses a major challenge as the behavior of the environment is unknown and reactive to the behavior of the autonomous agent. This coupling gives rise to interaction‑driven distribution shifts where the autonomous agent's control policy may change the environment's behavior, thereby invalidating safety guarantees in existing work. Indeed, recent works have used conformal prediction (CP) to generate distribution‑free safety guarantees using observed data of the environment. However, CP's assumption on data exchangeability is violated in interactive settings due to a circular dependency where a control policy update changes the environment's behavior, and vice versa. To address this gap, we propose an iterative framework that robustly maintains safety guarantees across policy updates by quantifying the potential impact of a planned policy update on the environment's behavior. We realize this via adversarially robust CP where we perform a regular CP step in each episode using observed data under the current policy, but then transfer safety guarantees across policy updates by analytically adjusting the CP result to account for distribution shifts. This adjustment is performed based on a policy‑to‑trajectory sensitivity analysis, resulting in a safe, episodic open‑loop planner. We further conduct a contraction analysis of the system providing conditions under which both the CP results and the policy updates are guaranteed to converge. We empirically demonstrate these safety and convergence guarantees on a two‑dimensional car‑pedestrian and a high‑dimensional quadcopter case study. To the best of our knowledge, these are the first results that provide valid safety guarantees in such interactive settings.

PaperID: 7417, https://arxiv.org/pdf/2511.10515.pdf

Authors: Dong-Shan Jian, Xiang Li, Chen-Xu Yan, Hui-Wen Zheng, Zhi-Zhang Bian, You-Le Fang, Ren-Xi He, Jing-Tian Zhang, Ce Meng, Ling-Shi Meng, Bing-Rui Gong, Sheng-Qi Zhang, Yan-Qing Ma

Title: Mastering Olympiad-Level Physics with Artificial Intelligence

Abstract:
Olympiad‑level physics problem‑solving significantly challenges both humans and artificial intelligence (AI), as it requires integrating appropriate modeling, application of physical principles, and precise calculation within long reasoning processes. In this paper, we introduce LOCA (LOgical Chain Augmentation), an AI agent framework designed for complex physics reasoning. LOCA decomposes long reasoning into serialized atomic and verifiable steps, refining the solution through an augment‑review loop. We evaluate LOCA on the 2025 Chinese Physics Olympiad (CPhO) theory examination, a rigorous testbed renowned for its depth and complexity. The framework achieves a near‑perfect score of 313 out of 320 points, significantly surpassing the top human competitor and other baseline methods. Furthermore, LOCA attains a near‑perfect score of 28.6 out of 30 on the IPhO 2025 examination, demonstrating its strong generalizability across different contexts. Our work points toward the development of trustworthy AI partners in both research and education.

PaperID: 7418, https://arxiv.org/pdf/2511.10409.pdf

Authors: Kayla Boggess, Sarit Kraus, Lu Feng

Title: Explaining Decentralized Multi-Agent Reinforcement Learning Policies

Abstract:
Multi‑Agent Reinforcement Learning (MARL) has gained significant interest in recent years, enabling sequential decision‑making across multiple agents in various domains. However, most existing explanation methods focus on centralized MARL, failing to address the uncertainty and nondeterminism inherent in decentralized settings. We propose methods to generate policy summarizations that capture task ordering and agent cooperation in decentralized MARL policies, along with query‑based explanations for When, Why Not, and What types of user queries about specific agent behaviors. We evaluate our approach across four MARL domains and two decentralized MARL algorithms, demonstrating its generalizability and computational efficiency. User studies show that our summarizations and explanations significantly improve user question‑answering performance and enhance subjective ratings on metrics such as understanding and satisfaction.

PaperID: 7419, https://arxiv.org/pdf/2511.10403.pdf

Authors: Mingxing Peng, Ruoyu Yao, Xusen Guo, Jun Ma

Title: nuPlan-R: A Closed-Loop Planning Benchmark for Autonomous Driving via Reactive Multi-Agent Simulation

Abstract:
Recent advances in closed‑loop planning benchmarks have significantly improved the evaluation of autonomous vehicles. However, existing benchmarks still rely on rule‑based reactive agents such as the Intelligent Driver Model (IDM), which lack behavioral diversity and fail to capture realistic human interactions, leading to oversimplified traffic dynamics. To address these limitations, we present nuPlan‑R, a new reactive closed‑loop planning benchmark that integrates learning‑based reactive multi‑agent simulation into the nuPlan framework. Our benchmark replaces the rule‑based IDM agents with noise‑decoupled diffusion‑based reactive agents and introduces an interaction‑aware agent selection mechanism to ensure both realism and computational efficiency. Furthermore, we extend the benchmark with two additional metrics to enable a more comprehensive assessment of planning performance. Extensive experiments demonstrate that our reactive agent model produces more realistic, diverse, and human‑like traffic behaviors, leading to a benchmark environment that better reflects real‑world interactive driving. We further reimplement a collection of rule‑based, learning‑based, and hybrid planning approaches within our nuPlan‑R benchmark, providing a clearer reflection of planner performance in complex interactive scenarios and better highlighting the advantages of learning‑based planners in handling complex and dynamic scenarios. These results establish nuPlan‑R as a new standard for fair, reactive, and realistic closed‑loop planning evaluation. We will open‑source the code for the new benchmark.

PaperID: 7420, https://arxiv.org/pdf/2511.10401.pdf

Authors: Cornelia Skaga, Mahdieh S. Sadabadi, Gilbert Bergna-Diaz

Title: Stability of a DC Microgrid with a Nonlinear Nested Control Framework: The Fast Communication Scenario

Abstract:
As modern power systems continue to evolve into multi‑agent, converter‑dominated systems that demand reliable, stable, and optimal control architectures within an expandable framework, this paper investigates scalable stability guarantees of a promising nonlinear communication‑reliant control framework for DC microgrids. Particularly, relying on nested control loops; inner decentralized(primary) and outer distributed(secondary), the control configurations are designed to simultaneously achieve proportional current sharing and voltage containment within pre‑specified limits, at the converged steady state. By enforcing sufficient time‑scale separation at the boarder between the control loops, the system admits a singular perturbation formulation, allowing global exponential stability (G.E.S.) to be established via Lyapunov arguments. Although the theoretical G.E.S. certificate is structurally scalable, the stability guarantees depends on a sufficiently large permanent leakage, introduced in the primary controller. Thus, the results of this paper emphasize the importance of appropriate practical tuning guidelines and electrical parameter selection. The effectiveness of the proposed method is validated through case studies on a low‑voltage DC microgrid under load variations and topological changes (and communication time‑delays), followed by a small‑signal stability analysis.

PaperID: 7421, https://arxiv.org/pdf/2511.10400.pdf

Authors: Lifan Zheng, Jiawei Chen, Qinghong Yin, Jingyuan Zhang, Xinyi Zeng, Yu Tian

Title: Rethinking the Reliability of Multi-agent System: A Perspective from Byzantine Fault Tolerance

Abstract:
Ensuring the reliability of agent architectures and effectively identifying problematic agents when failures occur are crucial challenges in multi‑agent systems (MAS). Advances in large language models (LLMs) have established LLM‑based agents as a major branch of MAS, enabling major breakthroughs in complex problem solving and world modeling. However, the reliability implications of this shift remain largely unexplored. i.e., whether substituting traditional agents with LLM‑based agents can effectively enhance the reliability of MAS. In this work, we investigate and quantify the reliability of LLM‑based agents from the perspective of Byzantine fault tolerance. We observe that LLM‑based agents demonstrate stronger skepticism when processing erroneous message flows, a characteristic that enables them to outperform traditional agents across different topological structures. Motivated by the results of the pilot experiment, we design CP‑WBFT, a confidence probe‑based weighted Byzantine Fault Tolerant consensus mechanism to enhance the stability of MAS with different topologies. It capitalizes on the intrinsic reflective and discriminative capabilities of LLMs by employing a probe‑based, weighted information flow transmission method to improve the reliability of LLM‑based agents. Extensive experiments demonstrate that CP‑WBFT achieves superior performance across diverse network topologies under extreme Byzantine conditions (85.7% fault rate). Notably, our approach surpasses traditional methods by attaining remarkable accuracy on various topologies and maintaining strong reliability in both mathematical reasoning and safety assessment tasks.

PaperID: 7422, https://arxiv.org/pdf/2511.10395.pdf

Authors: Yunpeng Zhai, Shuchang Tao, Cheng Chen, Anni Zou, Ziqian Chen, Qingxu Fu, Shinji Mai, Li Yu, Jiaji Deng, Zouying Cao, Zhaoyang Liu, Bolin Ding, Jingren Zhou

Title: AgentEvolver: Towards Efficient Self-Evolving Agent System

Abstract:
Autonomous agents powered by large language models (LLMs) have the potential to significantly enhance human productivity by reasoning, using tools, and executing complex tasks in diverse environments. However, current approaches to developing such agents remain costly and inefficient, as they typically require manually constructed task datasets and reinforcement learning (RL) pipelines with extensive random exploration. These limitations lead to prohibitively high data‑construction costs, low exploration efficiency, and poor sample utilization. To address these challenges, we present AgentEvolver, a self‑evolving agent system that leverages the semantic understanding and reasoning capabilities of LLMs to drive autonomous agent learning. AgentEvolver introduces three synergistic mechanisms: (i) self‑questioning, which enables curiosity‑driven task generation in novel environments, reducing dependence on handcrafted datasets; (ii) self‑navigating, which improves exploration efficiency through experience reuse and hybrid policy guidance; and (iii) self‑attributing, which enhances sample efficiency by assigning differentiated rewards to trajectory states and actions based on their contribution. By integrating these mechanisms into a unified framework, AgentEvolver enables scalable, cost‑effective, and continual improvement of agent capabilities. Preliminary experiments indicate that AgentEvolver achieves more efficient exploration, better sample utilization, and faster adaptation compared to traditional RL‑based baselines.

PaperID: 7423, https://arxiv.org/pdf/2511.10344.pdf

Authors: Zicheng Hu, Yuchen Wang, Cheng Chen

Title: Robust Decentralized Multi-armed Bandits: From Corruption-Resilience to Byzantine-Resilience

Abstract:
Decentralized cooperative multi‑agent multi‑armed bandits (DeCMA2B) considers how multiple agents collaborate in a decentralized multi‑armed bandit setting. Though this problem has been extensively studied in previous work, most existing methods remain susceptible to various adversarial attacks. In this paper, we first study DeCMA2B with adversarial corruption, where an adversary can corrupt reward observations of all agents with a limited corruption budget. We propose a robust algorithm, called DeMABAR, which ensures that each agent's individual regret suffers only an additive term proportional to the corruption budget. Then we consider a more realistic scenario where the adversary can only attack a small number of agents. Our theoretical analysis shows that the DeMABAR algorithm can also almost completely eliminate the influence of adversarial attacks and is inherently robust in the Byzantine setting, where an unknown fraction of the agents can be Byzantine, i.e., may arbitrarily select arms and communicate wrong information. We also conduct numerical experiments to illustrate the robustness and effectiveness of the proposed method.

PaperID: 7424, https://arxiv.org/pdf/2511.10291.pdf

Authors: Aswin Arun, Christo Kurisummoottil Thomas, Rimalpudi Sarvendranath, Walid Saad

Title: Causal Model-Based Reinforcement Learning for Sample-Efficient IoT Channel Access

Abstract:
Despite the advantages of multi‑agent reinforcement learning (MARL) for wireless use case such as medium access control (MAC), their real‑world deployment in Internet of Things (IoT) is hindered by their sample inefficiency. To alleviate this challenge, one can leverage model‑based reinforcement learning (MBRL) solutions, however, conventional MBRL approaches rely on black‑box models that are not interpretable and cannot reason. In contrast, in this paper, a novel causal model‑based MARL framework is developed by leveraging tools from causal learn‑ ing. In particular, the proposed model can explicitly represent causal dependencies between network variables using structural causal models (SCMs) and attention‑based inference networks. Interpretable causal models are then developed to capture how MAC control messages influence observations, how transmission actions determine outcomes, and how channel observations affect rewards. Data augmentation techniques are then used to generate synthetic rollouts using the learned causal model for policy optimization via proximal policy optimization (PPO). Analytical results demonstrate exponential sample complexity gains of causal MBRL over black‑box approaches. Extensive simulations demonstrate that, on average, the proposed approach can reduce environment interactions by 58%, and yield faster convergence compared to model‑free baselines. The proposed approach inherently is also shown to provide interpretable scheduling decisions via attention‑based causal attribution, revealing which network conditions drive the policy. The resulting combination of sample efficiency and interpretability establishes causal MBRL as a practical approach for resource‑constrained wireless systems.

PaperID: 7425, https://arxiv.org/pdf/2511.10283.pdf

Authors: Won Ik Cho, Woonghee Han, Kyung Seo Ki, Young Min Kim

Title: Behavior Modeling for Training-free Building of Private Domain Multi Agent System

Abstract:
The rise of agentic systems that combine orchestration, tool use, and conversational capabilities, has been more visible by the recent advent of large language models (LLMs). While open‑domain frameworks exist, applying them in private domains remains difficult due to heterogeneous tool formats, domain‑specific jargon, restricted accessibility of APIs, and complex governance. Conventional solutions, such as fine‑tuning on synthetic dialogue data, are burdensome and brittle under domain shifts, and risk degrading general performance. In this light, we introduce a framework for private‑domain multi‑agent conversational systems that avoids training and data generation by adopting behavior modeling and documentation. Our design simply assumes an orchestrator, a tool‑calling agent, and a general chat agent, with tool integration defined through structured specifications and domain‑informed instructions. This approach enables scalable adaptation to private tools and evolving contexts without continual retraining. The framework supports practical use cases, including lightweight deployment of multi‑agent systems, leveraging API specifications as retrieval resources, and generating synthetic dialogue for evaluation ‑‑ providing a sustainable method for aligning agent behavior with domain expertise in private conversational ecosystems.

PaperID: 7426, https://arxiv.org/pdf/2511.10203.pdf

Authors: Stephane Da Silva Martins, Emanuel Aldea, Sylvie Le Hégarat-Mascle

Title: VISTA: A Vision and Intent-Aware Social Attention Framework for Multi-Agent Trajectory Prediction

Abstract:
Multi‑agent trajectory prediction is crucial for autonomous systems operating in dense, interactive environments. Existing methods often fail to jointly capture agents' long‑term goals and their fine‑grained social interactions, which leads to unrealistic multi‑agent futures. We propose VISTA, a recursive goal‑conditioned transformer for multi‑agent trajectory forecasting. VISTA combines (i) a cross‑attention fusion module that integrates long‑horizon intent with past motion, (ii) a social‑token attention mechanism for flexible interaction modeling across agents, and (iii) pairwise attention maps that make social influence patterns interpretable at inference time. Our model turns single‑agent goal‑conditioned prediction into a coherent multi‑agent forecasting framework. Beyond standard displacement metrics, we evaluate trajectory collision rates as a measure of joint realism. On the high‑density MADRAS benchmark and on SDD, VISTA achieves state‑of‑the‑art accuracy and substantially fewer collisions. On MADRAS, it reduces the average collision rate of strong baselines from 2.14 to 0.03 percent, and on SDD it attains zero collisions while improving ADE, FDE, and minFDE. These results show that VISTA generates socially compliant, goal‑aware, and interpretable trajectories, making it promising for safety‑critical autonomous systems.

PaperID: 7427, https://arxiv.org/pdf/2511.10118.pdf

Authors: Zoltan Nagy, Irinel-Constantin Morarescu, Lucian Busoniu

Title: Consensus approximation and impulsive control for a class of uncertain multi-agent dynamics

Abstract:
This paper studies a class of consensus dynamics where the interactions between agents are affected by a time‑varying unknown scaling factor. This situation is encountered in the control of robotic fleets over a wireless network or in opinion dynamics where the confidence given to the peers varies in time. Firstly, we establish conditions under which practical upper and lower bounds on the consensus value can be determined. Secondly, we propose control strategies for allocating a given control budget to shift agent states towards a desired consensus value despite the uncertainty. We provide computationally efficient linear programming‑based approaches for both problems and validate the obtained results in numerical simulations.

PaperID: 7428, https://arxiv.org/pdf/2511.10049.pdf

Authors: Divyanshu Saxena, Rishikesh Maurya, Xiaoxuan Ou, Gagan Somashekar, Shachee Mishra Gupta, Arun Iyer, Yu Kang, Chetan Bansal, Aditya Akella, Saravan Rajmohan

Title: Continuous Benchmark Generation for Evaluating Enterprise-scale LLM Agents

Abstract:
The rapid adoption of AI agents across domains has made systematic evaluation crucial for ensuring their usefulness and successful production deployment. Evaluation of AI agents typically involves using a fixed set of benchmarks and computing multiple evaluation metrics for the agent. While sufficient for simple coding tasks, these benchmarks fall short for enterprise‑scale agents, where services and requirements evolve continuously and ground‑truth examples are sparse. We propose a process of benchmark generation that helps evolve the benchmarks as the requirements change and perform robust evaluation of evolving AI agents. We instantiate this approach for a case study of service migration from one deployment platform to another at a large public enterprise. Our approach relies on semi‑structured documents where developers express the high‑level intent, and uses state‑of‑the‑art LLMs to generate benchmarks from just a small number of such documents. Overall, this process results in a maintainable evaluation framework, enabling rapid feedback on agent performance and facilitating targeted improvements.

PaperID: 7429, https://arxiv.org/pdf/2511.09919.pdf

Authors: Ketong Chen, Yuhao Chen, Yang Xue

Title: MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding

Abstract:
Despite the rapid progress of Vision‑Language Models (VLMs), their capabilities are inadequately assessed by existing benchmarks, which are predominantly English‑centric, feature simplistic layouts, and support limited tasks. Consequently, they fail to evaluate model performance for Visually Rich Document Understanding (VRDU), a critical challenge involving complex layouts and dense text. To address this, we introduce DocWeaver, a novel multi‑agent pipeline that leverages Large Language Models to automatically generate a new benchmark. The result is MosaicDoc, a large‑scale, bilingual (Chinese and English) resource designed to push the boundaries of VRDU. Sourced from newspapers and magazines, MosaicDoc features diverse and complex layouts (including multi‑column and non‑Manhattan), rich stylistic variety from 196 publishers, and comprehensive multi‑task annotations (OCR, VQA, reading order, and localization). With 72K images and over 600K QA pairs, MosaicDoc serves as a definitive benchmark for the field. Our extensive evaluation of state‑of‑the‑art models on this benchmark reveals their current limitations in handling real‑world document complexity and charts a clear path for future research.

PaperID: 7430, https://arxiv.org/pdf/2511.09804.pdf

Authors: Eric Xie, Danielle Waterfield, Michael Kennedy, Aidong Zhang

Title: SlideBot: A Multi-Agent Framework for Generating Informative, Reliable, Multi-Modal Presentations

Abstract:
Large Language Models (LLMs) have shown immense potential in education, automating tasks like quiz generation and content summarization. However, generating effective presentation slides introduces unique challenges due to the complexity of multimodal content creation and the need for precise, domain‑specific information. Existing LLM‑based solutions often fail to produce reliable and informative outputs, limiting their educational value. To address these limitations, we introduce SlideBot ‑ a modular, multi‑agent slide generation framework that integrates LLMs with retrieval, structured planning, and code generation. SlideBot is organized around three pillars: informativeness, ensuring deep and contextually grounded content; reliability, achieved by incorporating external sources through retrieval; and practicality, which enables customization and iterative feedback through instructor collaboration. It incorporates evidence‑based instructional design principles from Cognitive Load Theory (CLT) and the Cognitive Theory of Multimedia Learning (CTML), using structured planning to manage intrinsic load and consistent visual macros to reduce extraneous load and enhance dual‑channel learning. Within the system, specialized agents collaboratively retrieve information, summarize content, generate figures, and format slides using LaTeX, aligning outputs with instructor preferences through interactive refinement. Evaluations from domain experts and students in AI and biomedical education show that SlideBot consistently enhances conceptual accuracy, clarity, and instructional value. These findings demonstrate SlideBot's potential to streamline slide preparation while ensuring accuracy, relevance, and adaptability in higher education.

PaperID: 7431, https://arxiv.org/pdf/2511.09794.pdf

Authors: Wasique Islam Shafin, Md Nakhla Rafi, Zhenhao Li, Tse-Hsun Chen

Title: Evaluating Software Process Models for Multi-Agent Class-Level Code Generation

Abstract:
Modern software systems require code that is not only functional but also maintainable and well‑structured. Although Large Language Models (LLMs) are increasingly used to automate software development, most studies focus on isolated, single‑agent function‑level generation. This work examines how process structure and role specialization shape multi‑agent LLM workflows for class‑level code generation. We simulate a Waterfall‑style development cycle covering Requirement, Design, Implementation, and Testing using three LLMs (GPT‑4o‑mini, DeepSeek‑Chat, and Claude‑3.5‑Haiku) on 100 Python tasks from the ClassEval benchmark. Our findings show that multi‑agent workflows reorganize, rather than consistently enhance, model performance. Waterfall‑style collaboration produces cleaner and more maintainable code but often reduces functional correctness (‑37.8% for GPT‑4o‑mini and ‑39.8% for DeepSeek‑Chat), with Claude‑3.5‑Haiku as a notable exception (+9.5%). Importantly, process constraints shift failure characteristics: structural issues such as missing code decrease, while semantic and validation errors become more frequent. Among all stages, Testing exerts the strongest influence by improving verification coverage but also introducing new reasoning failures, whereas Requirement and Design have comparatively modest effects. Overall, this study provides empirical evidence that software process structure fundamentally alters how LLMs reason, collaborate, and fail, revealing inherent trade‑offs between rigid workflow discipline and flexible problem‑solving in multi‑agent code generation.

PaperID: 7432, https://arxiv.org/pdf/2511.09792.pdf

Authors: Tianmeng Hu, Yongzheng Cui, Rui Tang, Biao Luo, Ke Li

Title: Beyond Monotonicity: Revisiting Factorization Principles in Multi-Agent Q-Learning

Abstract:
Value decomposition is a central approach in multi‑agent reinforcement learning (MARL), enabling centralized training with decentralized execution by factorizing the global value function into local values. To ensure individual‑global‑max (IGM) consistency, existing methods either enforce monotonicity constraints, which limit expressive power, or adopt softer surrogates at the cost of algorithmic complexity. In this work, we present a dynamical systems analysis of non‑monotonic value decomposition, modeling learning dynamics as continuous‑time gradient flow. We prove that, under approximately greedy exploration, all zero‑loss equilibria violating IGM consistency are unstable saddle points, while only IGM‑consistent solutions are stable attractors of the learning dynamics. Extensive experiments on both synthetic matrix games and challenging MARL benchmarks demonstrate that unconstrained, non‑monotonic factorization reliably recovers IGM‑optimal solutions and consistently outperforms monotonic baselines. Additionally, we investigate the influence of temporal‑difference targets and exploration strategies, providing actionable insights for the design of future value‑based MARL algorithms.

PaperID: 7433, https://arxiv.org/pdf/2511.09790.pdf

Authors: Eshika Pathak, Ahmed Aboudonia, Sandeep Banik, Naira Hovakimyan

Title: A Robust Task-Level Control Architecture for Learned Dynamical Systems

Abstract:
Dynamical system (DS)‑based learning from demonstration (LfD) is a powerful tool for generating motion plans in the operation (`task') space of robotic systems. However, the realization of the generated motion plans is often compromised by a ''task‑execution mismatch'', where unmodeled dynamics, persistent disturbances, and system latency cause the robot's actual task‑space state to diverge from the desired motion trajectory. We propose a novel task‑level robust control architecture, L1‑augmented Dynamical Systems (L1‑DS), that explicitly handles the task‑execution mismatch in tracking a nominal motion plan generated by any DS‑based LfD scheme. Our framework augments any DS‑based LfD model with a nominal stabilizing controller and an L1 adaptive controller. Furthermore, we introduce a windowed Dynamic Time Warping (DTW)‑based target selector, which enables the nominal stabilizing controller to handle temporal misalignment for improved phase‑consistent tracking. We demonstrate the efficacy of our architecture on the LASA and IROS handwriting datasets.

PaperID: 7434, https://arxiv.org/pdf/2511.09710.pdf

Authors: Sarath Shekkizhar, Romain Cosentino, Adam Earle, Silvio Savarese

Title: Echoing: Identity Failures when LLM Agents Talk to Each Other

Abstract:
As large language model (LLM) based agents interact autonomously with one another, a new class of failures emerges that cannot be predicted from single agent performance: behavioral drifts in agent‑agent conversations (AxA). Unlike human‑agent interactions, where humans ground and steer conversations, AxA lacks such stabilizing signals, making these failures unique. We investigate one such failure, echoing, where agents abandon their assigned roles and instead mirror their conversational partners, undermining their intended objectives. Through experiments across 66 AxA configurations, 4 domains (3 transactional, 1 advisory), and 2500+ conversations (over 250000 LLM inferences), we show that echoing occurs across major LLM providers, with echoing rates as high as 70% depending on the model and domain. Moreover, we find that echoing is persistent even in advanced reasoning models with substantial rates (32.8%) that are not reduced by reasoning efforts. We analyze prompt, conversation dynamics, showing that echoing arises as interaction grows longer (7+ agent turns) and is not merely an artifact of sub‑optimal experiment design. Finally, we introduce a protocol‑level mitigation where targeted use of structured response reduces echoing to 9%.

PaperID: 7435, https://arxiv.org/pdf/2511.09586.pdf

Authors: Yuchen Huang, Sijia Li, Minghao Liu, Wei Liu, Shijue Huang, Zhiyuan Fan, Hou Pong Chan, Yi R. Fung

Title: Environment Scaling for Interactive Agentic Experience Collection: A Survey

Abstract:
LLM‑based agents can autonomously accomplish complex tasks across various domains. However, to further cultivate capabilities such as adaptive behavior and long‑term decision‑making, training on static datasets built from human‑level knowledge is insufficient. These datasets are costly to construct and lack both dynamism and realism. A growing consensus is that agents should instead interact directly with environments and learn from experience through reinforcement learning. We formalize this iterative process as the Generation‑Execution‑Feedback (GEF) loop, where environments generate tasks to challenge agents, return observations in response to agents' actions during task execution, and provide evaluative feedback on rollouts for subsequent learning. Under this paradigm, environments function as indispensable producers of experiential data, highlighting the need to scale them toward greater complexity, realism, and interactivity. In this survey, we systematically review representative methods for environment scaling from a pioneering environment‑centric perspective and organize them along the stages of the GEF loop, namely task generation, task execution, and feedback. We further analyze implementation frameworks, challenges, and applications, consolidating fragmented advances and outlining future research directions for agent intelligence.

PaperID: 7436, https://arxiv.org/pdf/2511.09535.pdf

Authors: Niklas Lauffer, Ameesh Shah, Micah Carroll, Sanjit A. Seshia, Stuart Russell, Michael Dennis

Title: Robust and Diverse Multi-Agent Learning via Rational Policy Gradient

Abstract:
Adversarial optimization algorithms that explicitly search for flaws in agents' policies have been successfully applied to finding robust and diverse policies in multi‑agent settings. However, the success of adversarial optimization has been largely limited to zero‑sum settings because its naive application in cooperative settings leads to a critical failure mode: agents are irrationally incentivized to self‑sabotage, blocking the completion of tasks and halting further learning. To address this, we introduce Rationality‑preserving Policy Optimization (RPO), a formalism for adversarial optimization that avoids self‑sabotage by ensuring agents remain rational‑‑that is, their policies are optimal with respect to some possible partner policy. To solve RPO, we develop Rational Policy Gradient (RPG), which trains agents to maximize their own reward in a modified version of the original game in which we use opponent shaping techniques to optimize the adversarial objective. RPG enables us to extend a variety of existing adversarial optimization algorithms that, no longer subject to the limitations of self‑sabotage, can find adversarial examples, improve robustness and adaptability, and learn diverse policies. We empirically validate that our approach achieves strong performance in several popular cooperative and general‑sum environments. Our project page can be found at https://rational‑policy‑gradient.github.io.

PaperID: 7437, https://arxiv.org/pdf/2511.09516.pdf

Authors: Runhao Li, Wenkai Guo, Zhenyu Wu, Changyuan Wang, Haoyuan Deng, Zhenyu Weng, Yap-Peng Tan, Ziwei Wang

Title: MAP-VLA: Memory-Augmented Prompting for Vision-Language-Action Model in Robotic Manipulation

Abstract:
Pre‑trained Vision‑Language‑Action (VLA) models have achieved remarkable success in improving robustness and generalization for end‑to‑end robotic manipulation. However, these models struggle with long‑horizon tasks due to their lack of memory and reliance solely on immediate sensory inputs. To address this limitation, we propose Memory‑Augmented Prompting for Vision‑Language‑Action model (MAP‑VLA), a novel framework that empowers pre‑trained VLA models with demonstration‑derived memory prompts to augment action generation for long‑horizon robotic manipulation tasks. To achieve this, MAP‑VLA first constructs a memory library from historical demonstrations, where each memory unit captures information about a specific stage of a task. These memory units are implemented as learnable soft prompts optimized through prompt tuning. Then, during real‑time task execution, MAP‑VLA retrieves relevant memory through trajectory similarity matching and dynamically integrates it into the VLA model for augmented action generation. Importantly, this prompt tuning and retrieval augmentation approach operates as a plug‑and‑play module for a frozen VLA model, offering a lightweight and flexible solution to improve task performance. Experimental results show that MAP‑VLA delivers up to 7.0% absolute performance gains in the simulation benchmark and 25.0% on real robot evaluations for long‑horizon tasks, surpassing the current state‑of‑the‑art methods.

PaperID: 7438, https://arxiv.org/pdf/2511.09309.pdf

Authors: Yiwen Yin, Zhian Hu, Xiaoxi Xu, Chun Yu, Xintong Wu, Wenyu Fan, Yuanchun Shi

Title: TaskSense: Cognitive Chain Modeling and Difficulty Estimation for GUI Tasks

Abstract:
Measuring GUI task difficulty is crucial for user behavior analysis and agent capability evaluation. Yet, existing benchmarks typically quantify difficulty based on motor actions (e.g., step counts), overlooking the cognitive demands underlying task completion. In this work, we propose Cognitive Chain, a novel framework that models task difficulty from a cognitive perspective. A cognitive chain decomposes the cognitive processes preceding a motor action into a sequence of cognitive steps (e.g., finding, deciding, computing), each with a difficulty index grounded in information theories. We develop an LLM‑based method to automatically extract cognitive chains from task execution traces. Validation with linear regression shows that our estimated cognitive difficulty correlates well with user completion time (step‑level R‑square=0.46 after annotation). Assessment of state‑of‑the‑art GUI agents shows reduced success on cognitively demanding tasks, revealing capability gaps and Human‑AI consistency patterns. We conclude by discussing potential applications in agent training, capability assessment, and human‑agent delegation optimization.

PaperID: 7439, https://arxiv.org/pdf/2511.09269.pdf

Authors: Tommaso Zaccherini, Siyuan Liu, Dimos V. Dimarogonas

Title: Robust Estimation and Control for Heterogeneous Multi-agent Systems Based on Decentralized k-hop Prescribed Performance Observers

Abstract:
We propose decentralized k‑hop Prescribed Performance State and Input Observers for heterogeneous multi‑agent systems subject to bounded external disturbances. In the proposed input/state observer, each agent estimates the state and input of agents located two or more hops away using only local information exchanged with 1‑hop neighbors, while guaranteeing that transient estimation errors satisfy predefined performance bounds. Conditions are established under which the input observer can be omitted, allowing the state observer convergence to be independent of the input estimates. Theoretical analysis demonstrates that if a closed‑loop controller with full state knowledge achieves the control objective and the estimation‑based closed‑loop system is set‑Input to State Stable (set‑ISS) with respect to the goal set, then the estimated states can be used to achieve the system objective with an arbitrarily small worst‑case error governed by the accuracy of the states estimates. Simulation results are provided to validate the proposed approach.

PaperID: 7440, https://arxiv.org/pdf/2511.09193.pdf

Authors: Egor Yukhnevich, Anton Andreychuk

Title: Enhancing PIBT via Multi-Action Operations

Abstract:
PIBT is a rule‑based Multi‑Agent Path Finding (MAPF) solver, widely used as a low‑level planner or action sampler in many state‑of‑the‑art approaches. Its primary advantage lies in its exceptional speed, enabling action selection for thousands of agents within milliseconds by considering only the immediate next timestep. However, this short‑horizon design leads to poor performance in scenarios where agents have orientation and must perform time‑consuming rotation actions. In this work, we present an enhanced version of PIBT that addresses this limitation by incorporating multi‑action operations. We detail the modifications introduced to improve PIBT's performance while preserving its hallmark efficiency. Furthermore, we demonstrate how our method, when combined with graph‑guidance technique and large neighborhood search optimization, achieves state‑of‑the‑art performance in the online LMAPF‑T setting.

PaperID: 7441, https://arxiv.org/pdf/2511.09171.pdf

Authors: Xinren Zhang, Jiadong Yu, Zixin Zhong

Title: Learning Efficient Communication Protocols for Multi-Agent Reinforcement Learning

Abstract:
Multi‑Agent Systems (MAS) have emerged as a powerful paradigm for modeling complex interactions among autonomous entities in distributed environments. In Multi‑Agent Reinforcement Learning (MARL), communication enables coordination but can lead to inefficient information exchange, since agents may generate redundant or non‑essential messages. While prior work has focused on boosting task performance with information exchange, the existing research lacks a thorough investigation of both the appropriate definition and the optimization of communication protocols (communication topology and message). To fill this gap, we introduce a generalized framework for learning multi‑round communication protocols that are both effective and efficient. Within this framework, we propose three novel Communication Efficiency Metrics (CEMs) to guide and evaluate the learning process: the Information Entropy Efficiency Index (IEI) and Specialization Efficiency Index (SEI) for efficiency‑augmented optimization, and the Topology Efficiency Index (TEI) for explicit evaluation. We integrate IEI and SEI as the adjusted loss functions to promote informative messaging and role specialization, while using TEI to quantify the trade‑off between communication volume and task performance. Through comprehensive experiments, we demonstrate that our learned communication protocol can significantly enhance communication efficiency and achieves better cooperation performance with improved success rates.

PaperID: 7442, https://arxiv.org/pdf/2511.09152.pdf

Authors: Swati Priya, Twinkle Tripathy

Title: Steering Opinion Dynamics in Signed Time-Varying Networks via External Control Input

Abstract:
This paper studies targeted opinion formation in multi‑agent systems evolving over signed, time‑varying directed graphs. The dynamics of each agent's state follow a Laplacian‑based update rule driven by both cooperative and antagonistic interactions in the presence of exogenous factors. We formulate these exogenous factors as external control inputs and establish a suitable controller design methodology enabling collective opinion to converge to any desired steady‑state configuration, superseding the natural emergent clustering or polarization behavior imposed by persistently structurally balanced influential root nodes. Our approach leverages upper Dini derivative analysis and Grönwall‑type inequalities to establish exponential convergence for opinion magnitude towards the desired steady state configuration on networks with uniform quasi‑strong δ‑connectivity. Finally, the theoretical results are validated through extensive numerical simulations.

PaperID: 7443, https://arxiv.org/pdf/2511.09109.pdf

Authors: Wenda Wei, Yu-An Liu, Ruqing Zhang, Jiafeng Guo, Lixin Su, Shuaiqiang Wang, Dawei Yin, Maarten de Rijke, Xueqi Cheng

Title: Thinking Forward and Backward: Multi-Objective Reinforcement Learning for Retrieval-Augmented Reasoning

Abstract:
Retrieval‑augmented generation (RAG) has proven to be effective in mitigating hallucinations in large language models, yet its effectiveness remains limited in complex, multi‑step reasoning scenarios. Recent efforts have incorporated search‑based interactions into RAG, enabling iterative reasoning with real‑time retrieval. Most approaches rely on outcome‑based supervision, offering no explicit guidance for intermediate steps. This often leads to reward hacking and degraded response quality. We propose Bi‑RAR, a novel retrieval‑augmented reasoning framework that evaluates each intermediate step jointly in both forward and backward directions. To assess the information completeness of each step, we introduce a bidirectional information distance grounded in Kolmogorov complexity, approximated via language model generation probabilities. This quantification measures both how far the current reasoning is from the answer and how well it addresses the question. To optimize reasoning under these bidirectional signals, we adopt a multi‑objective reinforcement learning framework with a cascading reward structure that emphasizes early trajectory alignment. Empirical results on seven question answering benchmarks demonstrate that Bi‑RAR surpasses previous methods and enables efficient interaction and reasoning with the search engine during training and inference.

PaperID: 7444, https://arxiv.org/pdf/2511.09087.pdf

Authors: Pranshav Gajjar, Cong Shen, Vijay K Shah

Title: Tele-LLM-Hub: Building Context-Aware Multi-Agent LLM Systems for Telecom Networks

Abstract:
This paper introduces Tele‑LLM‑Hub, a user friendly low‑code solution for rapid prototyping and deployment of context aware multi‑agent (MA) Large Language Model (LLM) systems tailored for 5G and beyond. As telecom wireless networks become increasingly complex, intelligent LLM applications must share a domainspecific understanding of network state. We propose TeleMCP, the Telecom Model Context Protocol, to enable structured and context‑rich communication between agents in telecom environments. Tele‑LLM‑Hub actualizes TeleMCP through a low‑code interface that supports agent creation, workflow composition, and interaction with software stacks such as srsRAN. Key components include a direct chat interface, a repository of pre‑built systems, an Agent Maker leveraging finetuning with our RANSTRUCT framework, and an MA‑Maker for composing MA workflows. The goal of Tele‑LLM‑Hub is to democratize the design of contextaware MA systems and accelerate innovation in next‑generation wireless networks.

PaperID: 7445, https://arxiv.org/pdf/2511.09057.pdf

Authors: PAN Team, Jiannan Xiang, Yi Gu, Zihan Liu, Zeyu Feng, Qiyue Gao, Yiyan Hu, Benhao Huang, Guangyi Liu, Yichi Yang, Kun Zhou, Davit Abrahamyan, Arif Ahmad, Ganesh Bannur, Junrong Chen, Kimi Chen, Mingkai Deng, Ruobing Han, Xinqi Huang, Haoqiang Kang, Zheqi Liu, Enze Ma, Hector Ren, Yashowardhan Shinde, Rohan Shingre, Ramsundar Tanikella, Kaiming Tao, Dequan Yang, Xinle Yu, Cong Zeng, Binglin Zhou, Zhengzhong Liu, Zhiting Hu, Eric P. Xing

Title: PAN: A World Model for General, Interactable, and Long-Horizon World Simulation

Abstract:
A world model enables an intelligent agent to imagine, predict, and reason about how the world evolves in response to its actions, and accordingly to plan and strategize. While recent video generation models produce realistic visual sequences, they typically operate in the prompt‑to‑full‑video manner without causal control, interactivity, or long‑horizon consistency required for purposeful reasoning. Existing world modeling efforts, on the other hand, often focus on restricted domains (e.g., physical, game, or 3D‑scene dynamics) with limited depth and controllability, and struggle to generalize across diverse environments and interaction formats. In this work, we introduce PAN, a general, interactable, and long‑horizon world model that predicts future world states through high‑quality video simulation conditioned on history and natural language actions. PAN employs the Generative Latent Prediction (GLP) architecture that combines an autoregressive latent dynamics backbone based on a large language model (LLM), which grounds simulation in extensive text‑based knowledge and enables conditioning on language‑specified actions, with a video diffusion decoder that reconstructs perceptually detailed and temporally coherent visual observations, to achieve a unification between latent space reasoning (imagination) and realizable world dynamics (reality). Trained on large‑scale video‑action pairs spanning diverse domains, PAN supports open‑domain, action‑conditioned simulation with coherent, long‑term dynamics. Extensive experiments show that PAN achieves strong performance in action‑conditioned world simulation, long‑horizon forecasting, and simulative reasoning compared to other video generators and world models, taking a step towards general world models that enable predictive simulation of future world states for reasoning and acting.

PaperID: 7446, https://arxiv.org/pdf/2511.09030.pdf

Authors: Elliot Meyerson, Giuseppe Paolo, Roberto Dailey, Hormoz Shahrzad, Olivier Francon, Conor F. Hayes, Xin Qiu, Babak Hodjat, Risto Miikkulainen

Title: Solving a Million-Step LLM Task with Zero Errors

Abstract:
LLMs have achieved remarkable breakthroughs in reasoning, insights, and tool use, but chaining these abilities into extended processes at the scale of those routinely executed by humans, organizations, and societies has remained out of reach. The models have a persistent error rate that prevents scale‑up: for instance, recent experiments in the Towers of Hanoi benchmark domain showed that the process inevitably becomes derailed after at most a few hundred steps. Thus, although LLM research is often still benchmarked on tasks with relatively few dependent logical steps, there is increasing attention on the ability (or inability) of LLMs to perform long range tasks. This paper describes MAKER, the first system that successfully solves a task with over one million LLM steps with zero errors, and, in principle, scales far beyond this level. The approach relies on an extreme decomposition of a task into subtasks, each of which can be tackled by focused microagents. The high level of modularity resulting from the decomposition allows error correction to be applied at each step through an efficient multi‑agent voting scheme. This combination of extreme decomposition and error correction makes scaling possible. Thus, the results suggest that instead of relying on continual improvement of current LLMs, massively decomposed agentic processes (MDAPs) may provide a way to efficiently solve problems at the level of organizations and societies.

PaperID: 7447, https://arxiv.org/pdf/2511.09013.pdf

Authors: Ziyi Song, Chen Xia, Chenbing Wang, Haibao Yu, Sheng Zhou, Zhisheng Niu

Title: UniMM-V2X: MoE-Enhanced Multi-Level Fusion for End-to-End Cooperative Autonomous Driving

Abstract:
Autonomous driving holds transformative potential but remains fundamentally constrained by the limited perception and isolated decision‑making with standalone intelligence. While recent multi‑agent approaches introduce cooperation, they often focus merely on perception‑level tasks, overlooking the alignment with downstream planning and control, or fall short in leveraging the full capacity of the recent emerging end‑to‑end autonomous driving. In this paper, we present UniMM‑V2X, a novel end‑to‑end multi‑agent framework that enables hierarchical cooperation across perception, prediction, and planning. At the core of our framework is a multi‑level fusion strategy that unifies perception and prediction cooperation, allowing agents to share queries and reason cooperatively for consistent and safe decision‑making. To adapt to diverse downstream tasks and further enhance the quality of multi‑level fusion, we incorporate a Mixture‑of‑Experts (MoE) architecture to dynamically enhance the BEV representations. We further extend MoE into the decoder to better capture diverse motion patterns. Extensive experiments on the DAIR‑V2X dataset demonstrate our approach achieves state‑of‑the‑art (SOTA) performance with a 39.7% improvement in perception accuracy, a 7.2% reduction in prediction error, and a 33.2% improvement in planning performance compared with UniV2X, showcasing the strength of our MoE‑enhanced multi‑level cooperative paradigm.

PaperID: 7448, https://arxiv.org/pdf/2511.08926.pdf

Authors: Zhuhui Li, Chunbo Luo, Liming Huang, Luyu Qi, Geyong Min

Title: Achieving Equilibrium under Utility Heterogeneity: An Agent-Attention Framework for Multi-Agent Multi-Objective Reinforcement Learning

Abstract:
Multi‑agent multi‑objective systems (MAMOS) have emerged as powerful frameworks for modelling complex decision‑making problems across various real‑world domains, such as robotic exploration, autonomous traffic management, and sensor network optimisation. MAMOS offers enhanced scalability and robustness through decentralised control and more accurately reflects inherent trade‑offs between conflicting objectives. In MAMOS, each agent uses utility functions that map return vectors to scalar values. Existing MAMOS optimisation methods face challenges in handling heterogeneous objective and utility function settings, where training non‑stationarity is intensified due to private utility functions and the associated policies. In this paper, we first theoretically prove that direct access to, or structured modeling of, global utility functions is necessary for the Bayesian Nash Equilibrium under decentralised execution constraints. To access the global utility functions while preserving the decentralised execution, we propose an Agent‑Attention Multi‑Agent Multi‑Objective Reinforcement Learning (AA‑MAMORL) framework. Our approach implicitly learns a joint belief over other agents' utility functions and their associated policies during centralised training, effectively mapping global states and utilities to each agent's policy. In execution, each agent independently selects actions based on local observations and its private utility function to approximate a BNE, without relying on inter‑agent communication. We conduct comprehensive experiments in both a custom‑designed MAMO Particle environment and the standard MOMALand benchmark. The results demonstrate that access to global preferences and our proposed AA‑MAMORL significantly improve performance and consistently outperform state‑of‑the‑art methods.

PaperID: 7449, https://arxiv.org/pdf/2511.08866.pdf

Authors: Fuyi Yang, Chenchen Ye, Mingyu Derek Ma, Yijia Xiao, Matthew Yang, Wei Wang

Title: BioVerge: A Comprehensive Benchmark and Study of Self-Evaluating Agents for Biomedical Hypothesis Generation

Abstract:
Hypothesis generation in biomedical research has traditionally centered on uncovering hidden relationships within vast scientific literature, often using methods like Literature‑Based Discovery (LBD). Despite progress, current approaches typically depend on single data types or predefined extraction patterns, which restricts the discovery of novel and complex connections. Recent advances in Large Language Model (LLM) agents show significant potential, with capabilities in information retrieval, reasoning, and generation. However, their application to biomedical hypothesis generation has been limited by the absence of standardized datasets and execution environments. To address this, we introduce BioVerge, a comprehensive benchmark, and BioVerge Agent, an LLM‑based agent framework, to create a standardized environment for exploring biomedical hypothesis generation at the frontier of existing scientific knowledge. Our dataset includes structured and textual data derived from historical biomedical hypotheses and PubMed literature, organized to support exploration by LLM agents. BioVerge Agent utilizes a ReAct‑based approach with distinct Generation and Evaluation modules that iteratively produce and self‑assess hypothesis proposals. Through extensive experimentation, we uncover key insights: 1) different architectures of BioVerge Agent influence exploration diversity and reasoning strategies; 2) structured and textual information sources each provide unique, critical contexts that enhance hypothesis generation; and 3) self‑evaluation significantly improves the novelty and relevance of proposed hypotheses.

PaperID: 7450, https://arxiv.org/pdf/2511.08822.pdf

Authors: Nelson Durrant, Braden Meyers, Matthew McMurray, Clayton Smith, Brighton Anderson, Tristan Hodgins, Kalliyan Velasco, Joshua G. Mangelson

Title: Low-cost Multi-agent Fleet for Acoustic Cooperative Localization Research

Abstract:
Real‑world underwater testing for multi‑agent autonomy presents substantial financial and engineering challenges. In this work, we introduce the Configurable Underwater Group of Autonomous Robots (CoUGARs) as a low‑cost, configurable autonomous‑underwater‑vehicle (AUV) platform for multi‑agent autonomy research. The base design costs less than 3,000 USD (as of May 2025) and is based on commercially‑available and 3D‑printed parts, enabling quick customization for various sensor payloads and configurations. Our current expanded model is equipped with a doppler velocity log (DVL) and ultra‑short‑baseline (USBL) acoustic array/transducer to support research on acoustic‑based cooperative localization. State estimation, navigation, and acoustic communications software has been developed and deployed using a containerized software stack and is tightly integrated with the HoloOcean simulator. The system was tested both in simulation and via in‑situ field trials in Utah lakes and reservoirs.

PaperID: 7451, https://arxiv.org/pdf/2511.08798.pdf

Authors: Manan Suri, Puneet Mathur, Nedim Lipka, Franck Dernoncourt, Ryan A. Rossi, Dinesh Manocha

Title: Structured Uncertainty guided Clarification for LLM Agents

Abstract:
LLM agents with tool‑calling capabilities often fail when user instructions are ambiguous or incomplete, leading to incorrect invocations and task failures. Existing approaches operate in unstructured language spaces, generating clarifying questions through prompting strategies that lack principled criteria for determining which questions to ask and when to stop. We introduce a principled formulation of structured uncertainty that operates directly over tool parameters and their domains, cleanly separating specification uncertainty (what the user wants) from model uncertainty (what the LLM predicts). Our formulation uses Expected Value of Perfect Information (EVPI) to quantify the disambiguation value of each potential question, balanced against aspect‑based cost modeling that prevents redundant questioning. We demonstrate the versatility of this formulation through two applications. First, SAGE‑Agent uses structured uncertainty for inference‑time question selection, achieving 7‑39% higher coverage on ambiguous tasks while reducing clarification questions by 1.5‑2.7x compared to strong prompting and uncertainty‑based baselines. Second, we show that structured uncertainty provides effective training signals: uncertainty‑guided reward modeling boosts When2Call accuracy from 36.5% to 65.2% (3B model) and 36.7% to 62.9% (7B model) through uncertainty‑weighted GRPO training, demonstrating more sample‑efficient reinforcement learning for tool‑calling agents. To enable evaluation, we present ClarifyBench, the first multi‑turn dynamic tool‑calling disambiguation benchmark. Our results establish structured uncertainty as a principled framework that improves both inference‑time interaction efficiency and training‑time sample efficiency in tool‑augmented agents.

PaperID: 7452, https://arxiv.org/pdf/2511.08763.pdf

Authors: Jerry M. Huang, Stefan T. Radev

Title: Modeling multi-agent motion dynamics in immersive rooms

Abstract:
Immersive rooms are increasingly popular augmented reality systems that support multi‑agent interactions within a virtual world. However, despite extensive content creation and technological developments, insights about perceptually‑driven social dynamics, such as the complex movement patterns during virtual world navigation, remain largely underexplored. Computational models of motion dynamics can help us understand the underlying mechanism of human interaction in immersive rooms and develop applications that better support spatially distributed interaction. In this work, we propose a new agent‑based model of emergent human motion dynamics. The model represents human agents as simple spatial geometries in the room that relocate and reorient themselves based on the salient virtual spatial objects they approach. Agent motion is modeled as an interactive process combining external diffusion‑driven influences from the environment with internal self‑propelling interactions among agents. Further, we leverage simulation‑based inference (SBI) to show that the governing parameters of motion patterns can be estimated from simple observables. Our results indicate that the model successfully captures action‑related agent properties but exposes local non‑identifiability linked to environmental awareness. We argue that our simulation‑based approach paves the way for creating adaptive, responsive immersive rooms ‑‑ spaces that adjust their interfaces and interactions based on human collective movement patterns and spatial attention.

PaperID: 7453, https://arxiv.org/pdf/2511.08752.pdf

Authors: Akshita Gupta, Arna Bhardwaj, Yashwanth Kumar Nakka, Changrak Choi, Amir Rahmani

Title: Information-Driven Fault Detection and Identification for Multi-Agent Spacecraft Systems: Collaborative On-Orbit Inspection Mission

Abstract:
This work presents a global‑to‑local, task‑aware fault detection and identification (FDI) framework for multi‑spacecraft systems conducting collaborative inspection missions in low Earth orbit. The inspection task is represented by a global information‑driven cost functional that integrates the sensor model, spacecraft poses, and mission‑level information‑gain objectives. This formulation links guidance, control, and FDI by using the same cost function to drive both global task allocation and local sensing or motion decisions. Fault detection is achieved through comparisons between expected and observed task metrics, while higher‑order cost‑gradient measures enable the identification of faults among sensors, actuators, and state estimators. An adaptive thresholding mechanism captures the time‑varying inspection geometry and dynamic mission conditions. Simulation results for representative multi‑spacecraft inspection scenarios demonstrate the reliability of fault localization and classification under uncertainty, providing a unified, information‑driven foundation for resilient autonomous inspection architectures.

PaperID: 7454, https://arxiv.org/pdf/2511.08710.pdf

Authors: Romain Cosentino, Sarath Shekkizhar, Adam Earle

Title: Convergence dynamics of Agent-to-Agent Interactions with Misaligned objectives

Abstract:
We develop and analyze a theoretical framework for agent‑to‑agent interactions in a simplified in‑context linear regression setting. In our model, each agent is instantiated as a single‑layer transformer with linear self‑attention (LSA) trained to implement gradient‑descent‑like updates on a quadratic regression objective from in‑context examples. We then study the coupled dynamics when two such LSA agents alternately update from each other's outputs under potentially misaligned fixed objectives. Within this framework, we characterize the generation dynamics and show that misalignment leads to a biased equilibrium where neither agent reaches its target, with residual errors predictable from the objective gap and the prompt‑induced geometry. We also characterize an adversarial regime where asymmetric convergence is possible: one agent reaches its objective exactly while inducing persistent bias in the other. We further contrast this fixed objective regime with an adaptive multi‑agent setting, wherein a helper agent updates a turn‑based objective to implement a Newton‑like step for the main agent, eliminating the plateau and accelerating its convergence. Experiments with trained LSA agents, as well as black‑box GPT‑5‑mini runs on in‑context linear regression tasks, are consistent with our theoretical predictions within this simplified setting. We view our framework as a mechanistic framework that links prompt geometry and objective misalignment to stability, bias, and robustness, and as a stepping stone toward analyzing more realistic multi‑agent LLM systems.

PaperID: 7455, https://arxiv.org/pdf/2511.08649.pdf

Authors: Yi Ni, Liwei Zhu, Shuai Li

Title: Bio AI Agent: A Multi-Agent Artificial Intelligence System for Autonomous CAR-T Cell Therapy Development with Integrated Target Discovery, Toxicity Prediction, and Rational Molecular Design

Abstract:
Chimeric antigen receptor T‑cell (CAR‑T) therapy represents a paradigm shift in cancer treatment, yet development timelines of 8‑12 years and clinical attrition rates exceeding 40‑60% highlight critical inefficiencies in target selection, safety assessment, and molecular optimization. We present Bio AI Agent, a multi‑agent artificial intelligence system powered by large language models that enables autonomous CAR‑T development through collaborative specialized agents. The system comprises six autonomous agents: Target Selection Agent for multi‑parametric antigen prioritization across >10,000 cancer‑associated targets, Toxicity Prediction Agent for comprehensive safety profiling integrating tissue expression atlases and pharmacovigilance databases, Molecular Design Agent for rational CAR engineering, Patent Intelligence Agent for freedom‑to‑operate analysis, Clinical Translation Agent for regulatory compliance, and Decision Orchestration Agent for multi‑agent coordination. Retrospective validation demonstrated autonomous identification of high‑risk targets including FcRH5 (hepatotoxicity) and CD229 (off‑tumor toxicity), patent infringement risks for CD38+SLAMF7 combinations, and generation of comprehensive development roadmaps. By enabling parallel processing, specialized reasoning, and autonomous decision‑making superior to monolithic AI systems, Bio AI Agent addresses critical gaps in precision oncology development and has potential to accelerate translation of next‑generation immunotherapies from discovery to clinic.

PaperID: 7456, https://arxiv.org/pdf/2511.08538.pdf

Authors: Yannan Bai, Kamesh Munagala, Yiheng Shen, Davidson Zhu

Title: Fair Multi-agent Persuasion with Submodular Constraints

Abstract:
We study the problem of selection in the context of Bayesian persuasion. We are given multiple agents with hidden values (or quality scores), to whom resources must be allocated by a welfare‑maximizing decision‑maker. An intermediary with knowledge of the agents' values seeks to influence the outcome of the selection by designing informative signals and providing tie‑breaking policies, so that when the receiver maximizes welfare over the resulting posteriors, the expected utilities of the agents (where utility is defined as allocation times value) achieve certain fairness properties. The fairness measure we will use is majorization, which simultaneously approximately maximizes all symmetric, monotone, concave functions of the utilities. We consider the general setting where the allocation to the agents needs to respect arbitrary submodular constraints, as given by the corresponding polymatroid. We present a signaling policy that, under a mild bounded rationality assumption on the receiver, achieves a logarithmically approximate majorized policy in this setting. The approximation ratio is almost best possible, and that significantly outperforms generic results that only yield linear approximations. A key component of our result is a structural characterization showing that the vector of agent utilities for a given signaling policy defines the base polytope of a different polymatroid, a result that may be of independent interest. In addition, we show that an arbitrarily good additive approximation to this vector can be produced in (weakly) polynomial time via the multiplicative weights update method.

PaperID: 7457, https://arxiv.org/pdf/2511.08521.pdf

Authors: Zhengyang Liang, Daoan Zhang, Huichi Zhou, Rui Huang, Bobo Li, Yuechen Zhang, Shengqiong Wu, Xiaohan Wang, Jiebo Luo, Lizi Liao, Hao Fei

Title: UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist

Abstract:
While specialized AI models excel at isolated video tasks like generation or understanding, real‑world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open‑source, omni‑capable multi‑agent framework for next‑generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan‑and‑Act dual‑agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video‑processing steps, while executor agents execute these through modular, MCP‑based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi‑level memory (global knowledge, task context, and user‑specific preferences), UniVA sustains long‑horizon reasoning, contextual continuity, and inter‑agent communication, enabling interactive and self‑reflective video creation with full traceability. This design enables iterative and any‑conditioned video workflows (e.g., text/image/video‑conditioned generation \rightarrow multi‑round editing \rightarrow object segmentation \rightarrow compositional synthesis) that were previously cumbersome to achieve with single‑purpose models or monolithic video‑language models. We also introduce UniVA‑Bench, a benchmark suite of multi‑step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA‑Bench are fully open‑sourced, aiming to catalyze research on interactive, agentic, and general‑purpose video intelligence for the next generation of multimodal AI systems. (https://univa.online/)

PaperID: 7458, https://arxiv.org/pdf/2511.08475.pdf

Authors: Yangxiao Cai, Ruiyin Li, Peng Liang, Mojtaba Shahin, Zengyang Li

Title: Designing LLM-based Multi-Agent Systems for Software Engineering Tasks: Quality Attributes, Design Patterns and Rationale

Abstract:
As the complexity of Software Engineering (SE) tasks continues to escalate, Multi‑Agent Systems (MASs) have emerged as a focal point of research and practice due to their autonomy and scalability. Furthermore, through leveraging the reasoning and planning capabilities of Large Language Models (LLMs), the application of LLM‑based MASs in the field of SE is garnering increasing attention. However, there is no dedicated study that systematically explores the design of LLM‑based MASs, including the Quality Attributes (QAs) on which designers mainly focus, the design patterns used by designers, and the rationale guiding the design of LLM‑based MASs for SE tasks. To this end, we conducted a study to identify the QAs that LLM‑based MASs for SE tasks focus on, the design patterns used in the MASs, and the design rationale for the MASs. We collected 94 papers on LLM‑based MASs for SE tasks as the source. Our study shows that: (1) Code Generation is the most common SE task solved by LLM‑based MASs among ten identified SE tasks, (2) Functional Suitability is the QA on which designers of LLM‑based MASs pay the most attention, (3) Role‑Based Cooperation is the design pattern most frequently employed among 16 patterns used to construct LLM‑based MASs, and (4) Improving the Quality of Generated Code is the most common rationale behind the design of LLM‑based MASs. Based on the study results, we presented the implications for the design of LLM‑based MASs to support SE tasks.

PaperID: 7459, https://arxiv.org/pdf/2511.08436.pdf

Authors: Satpreet H. Singh, Sonja Johnson-Yu, Zhouyang Lu, Aaron Walsman, Federico Pedraja, Denis Turcu, Pratyusha Sharma, Naomi Saphra, Nathaniel B. Sawtell, Kanaka Rajan

Title: Understanding Electro-communication and Electro-sensing in Weakly Electric Fish using Multi-Agent Deep Reinforcement Learning

Abstract:
Weakly electric fish, like Gnathonemus petersii, use a remarkable electrical modality for active sensing and communication, but studying their rich electrosensing and electrocommunication behavior and associated neural activity in naturalistic settings remains experimentally challenging. Here, we present a novel biologically‑inspired computational framework to study these behaviors, where recurrent neural network (RNN) based artificial agents trained via multi‑agent reinforcement learning (MARL) learn to modulate their electric organ discharges (EODs) and movement patterns to collectively forage in virtual environments. Trained agents demonstrate several emergent features consistent with real fish collectives, including heavy tailed EOD interval distributions, environmental context dependent shifts in EOD interval distributions, and social interaction patterns like freeloading, where agents reduce their EOD rates while benefiting from neighboring agents' active sensing. A minimal two‑fish assay further isolates the role of electro‑communication, showing that access to conspecific EODs and relative dominance jointly shape foraging success. Notably, these behaviors emerge through evolution‑inspired rewards for individual fitness and emergent inter‑agent interactions, rather than through rewarding agents explicitly for social interactions. Our work has broad implications for the neuroethology of weakly electric fish, as well as other social, communicating animals in which extensive recordings from multiple individuals, and thus traditional data‑driven modeling, are infeasible.

PaperID: 7460, https://arxiv.org/pdf/2511.08412.pdf

Authors: Ruochuan Shi, Runyu Lu, Yuanheng Zhu, Dongbin Zhao

Title: ARAC: Adaptive Regularized Multi-Agent Soft Actor-Critic in Graph-Structured Adversarial Games

Abstract:
In graph‑structured multi‑agent reinforcement learning (MARL) adversarial tasks such as pursuit and confrontation, agents must coordinate under highly dynamic interactions, where sparse rewards hinder efficient policy learning. We propose Adaptive Regularized Multi‑Agent Soft Actor‑Critic (ARAC), which integrates an attention‑based graph neural network (GNN) for modeling agent dependencies with an adaptive divergence regularization mechanism. The GNN enables expressive representation of spatial relations and state features in graph environments. Divergence regularization can serve as policy guidance to alleviate the sparse reward problem, but it may lead to suboptimal convergence when the reference policy itself is imperfect. The adaptive divergence regularization mechanism enables the framework to exploit reference policies for efficient exploration in the early stages, while gradually reducing reliance on them as training progresses to avoid inheriting their limitations. Experiments in pursuit and confrontation scenarios demonstrate that ARAC achieves faster convergence, higher final success rates, and stronger scalability across varying numbers of agents compared with MARL baselines, highlighting its effectiveness in complex graph‑structured environments.

PaperID: 7461, https://arxiv.org/pdf/2511.08364.pdf

Authors: Xinyi Wang, Yiping Song, Zhiliang Tian, Bo Liu, Tingjin Luo, Minlie Huang

Title: DPRM: A Dual Implicit Process Reward Model in Multi-Hop Question Answering

Abstract:
In multi‑hop question answering (MHQA) tasks, Chain of Thought (CoT) improves the quality of generation by guiding large language models (LLMs) through multi‑step reasoning, and Knowledge Graphs (KGs) reduce hallucinations via semantic matching. Outcome Reward Models (ORMs) provide feedback after generating the final answers but fail to evaluate the process for multi‑step reasoning. Traditional Process Reward Models (PRMs) evaluate the reasoning process but require costly human annotations or rollout generation. While implicit PRM is trained only with outcome signals and derives step rewards through reward parameterization without explicit annotations, it is more suitable for multi‑step reasoning in MHQA tasks. However, existing implicit PRM has only been explored for plain text scenarios. When adapting to MHQA tasks, it cannot handle the graph structure constraints in KGs and capture the potential inconsistency between CoT and KG paths. To address these limitations, we propose the DPRM (Dual Implicit Process Reward Model). It trains two implicit PRMs for CoT and KG reasoning in MHQA tasks. Both PRMs, namely KG‑PRM and CoT‑PRM, derive step‑level rewards from outcome signals via reward parameterization without additional explicit annotations. Among them, KG‑PRM uses preference pairs to learn structural constraints from KGs. DPRM further introduces a consistency constraint between CoT and KG reasoning steps, making the two PRMs mutually verify and collaboratively optimize the reasoning paths. We also provide a theoretical demonstration of the derivation of process rewards. Experimental results show that our method outperforms 13 baselines on multiple datasets with up to 16.6% improvement on Hit@1.

PaperID: 7462, https://arxiv.org/pdf/2511.08325.pdf

Authors: Zhiheng Xi, Chenyang Liao, Guanyu Li, Yajie Yang, Wenxiang Chen, Zhihao Zhang, Binghai Wang, Senjie Jin, Yuhao Zhou, Jian Guan, Wei Wu, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang

Title: AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress

Abstract:
Despite rapid development, large language models (LLMs) still encounter challenges in multi‑turn decision‑making tasks (i.e., agent tasks) like web shopping and browser navigation, which require making a sequence of intelligent decisions based on environmental feedback. Previous work for LLM agents typically relies on elaborate prompt engineering or fine‑tuning with expert trajectories to improve performance. In this work, we take a different perspective: we explore constructing process reward models (PRMs) to evaluate each decision and guide the agent's decision‑making process. Unlike LLM reasoning, where each step is scored based on correctness, actions in agent tasks do not have a clear‑cut correctness. Instead, they should be evaluated based on their proximity to the goal and the progress they have made. Building on this insight, we propose a re‑defined PRM for agent tasks, named AgentPRM, to capture both the interdependence between sequential decisions and their contribution to the final goal. This enables better progress tracking and exploration‑exploitation balance. To scalably obtain labeled data for training AgentPRM, we employ a Temporal Difference‑based (TD‑based) estimation method combined with Generalized Advantage Estimation (GAE), which proves more sample‑efficient than prior methods. Extensive experiments across different agentic tasks show that AgentPRM is over 8× more compute‑efficient than baselines, and it demonstrates robust improvement when scaling up test‑time compute. Moreover, we perform detailed analyses to show how our method works and offer more insights, e.g., applying AgentPRM to the reinforcement learning of LLM agents.

PaperID: 7463, https://arxiv.org/pdf/2511.08319.pdf

Authors: Soyeong Jeong, Aparna Elangovan, Emine Yilmaz, Oleg Rokhlenko

Title: Adaptive Multi-Agent Response Refinement in Conversational Systems

Abstract:
Large Language Models (LLMs) have demonstrated remarkable success in conversational systems by generating human‑like responses. However, they can fall short, especially when required to account for personalization or specific knowledge. In real‑life settings, it is impractical to rely on users to detect these errors and request a new response. One way to address this problem is to refine the response before returning it to the user. While existing approaches focus on refining responses within a single LLM, this method struggles to consider diverse aspects needed for effective conversations. In this work, we propose refining responses through a multi‑agent framework, where each agent is assigned a specific role for each aspect. We focus on three key aspects crucial to conversational quality: factuality, personalization, and coherence. Each agent is responsible for reviewing and refining one of these aspects, and their feedback is then merged to improve the overall response. To enhance collaboration among them, we introduce a dynamic communication strategy. Instead of following a fixed sequence of agents, our approach adaptively selects and coordinates the most relevant agents based on the specific requirements of each query. We validate our framework on challenging conversational datasets, demonstrating that ours significantly outperforms relevant baselines, particularly in tasks involving knowledge or user's persona, or both.

PaperID: 7464, https://arxiv.org/pdf/2511.08317.pdf

Authors: Shuaimin Li, Liyang Fan, Yufang Lin, Zeyang Li, Xian Wei, Shiwen Ni, Hamid Alinejad-Rokny, Min Yang

Title: Automatic Paper Reviewing with Heterogeneous Graph Reasoning over LLM-Simulated Reviewer-Author Debates

Abstract:
Existing paper review methods often rely on superficial manuscript features or directly on large language models (LLMs), which are prone to hallucinations, biased scoring, and limited reasoning capabilities. Moreover, these methods often fail to capture the complex argumentative reasoning and negotiation dynamics inherent in reviewer‑author interactions. To address these limitations, we propose ReViewGraph (Reviewer‑Author Debates Graph Reasoner), a novel framework that performs heterogeneous graph reasoning over LLM‑simulated multi‑round reviewer‑author debates. In our approach, reviewer‑author exchanges are simulated through LLM‑based multi‑agent collaboration. Diverse opinion relations (e.g., acceptance, rejection, clarification, and compromise) are then explicitly extracted and encoded as typed edges within a heterogeneous interaction graph. By applying graph neural networks to reason over these structured debate graphs, ReViewGraph captures fine‑grained argumentative dynamics and enables more informed review decisions. Extensive experiments on three datasets demonstrate that ReViewGraph outperforms strong baselines with an average relative improvement of 15.73%, underscoring the value of modeling detailed reviewer‑author debate structures.

PaperID: 7465, https://arxiv.org/pdf/2511.08274.pdf

Authors: Anton Gusarov, Anastasia Volkova, Valentin Khrulkov, Andrey Kuznetsov, Evgenii Maslov, Ivan Oseledets

Title: Multi-Agent GraphRAG: A Text-to-Cypher Framework for Labeled Property Graphs

Abstract:
While Retrieval‑Augmented Generation (RAG) methods commonly draw information from unstructured documents, the emerging paradigm of GraphRAG aims to leverage structured data such as knowledge graphs. Most existing GraphRAG efforts focus on Resource Description Framework (RDF) knowledge graphs, relying on triple representations and SPARQL queries. However, the potential of Cypher and Labeled Property Graph (LPG) databases to serve as scalable and effective reasoning engines within GraphRAG pipelines remains underexplored in current research literature. To fill this gap, we propose Multi‑Agent GraphRAG, a modular LLM agentic system for text‑to‑Cypher query generation serving as a natural language interface to LPG‑based graph data. Our proof‑of‑concept system features an LLM‑based workflow for automated Cypher queries generation and execution, using Memgraph as the graph database backend. Iterative content‑aware correction and normalization, reinforced by an aggregated feedback loop, ensures both semantic and syntactic refinement of generated queries. We evaluate our system on the CypherBench graph dataset covering several general domains with diverse types of queries. In addition, we demonstrate performance of the proposed workflow on a property graph derived from the IFC (Industry Foundation Classes) data, representing a digital twin of a building. This highlights how such an approach can bridge AI with real‑world applications at scale, enabling industrial digital automation use cases.

PaperID: 7466, https://arxiv.org/pdf/2511.08217.pdf

Authors: Gleb V. Solovev, Alina B. Zhidkovskaya, Anastasia Orlova, Nina Gubina, Anastasia Vepreva, Rodion Golovinskii, Ilya Tonkii, Ivan Dubrovsky, Ivan Gurev, Dmitry Gilemkhanov, Denis Chistiakov, Timur A. Aliev, Ivan Poddiakov, Galina Zubkova, Ekaterina V. Skorb, Vladimir Vinogradov, Alexander Boukhanovsky, Nikolay Nikitin, Andrei Dmitrenko, Anna Kalyuzhnaya, Andrey Savchenko

Title: MADD: Multi-Agent Drug Discovery Orchestra

Abstract:
Hit identification is a central challenge in early drug discovery, traditionally requiring substantial experimental resources. Recent advances in artificial intelligence, particularly large language models (LLMs), have enabled virtual screening methods that reduce costs and improve efficiency. However, the growing complexity of these tools has limited their accessibility to wet‑lab researchers. Multi‑agent systems offer a promising solution by combining the interpretability of LLMs with the precision of specialized models and tools. In this work, we present MADD, a multi‑agent system that builds and executes customized hit identification pipelines from natural language queries. MADD employs four coordinated agents to handle key subtasks in de novo compound generation and screening. We evaluate MADD across seven drug discovery cases and demonstrate its superior performance compared to existing LLM‑based solutions. Using MADD, we pioneer the application of AI‑first drug design to five biological targets and release the identified hit molecules. Finally, we introduce a new benchmark of query‑molecule pairs and docking scores for over three million compounds to contribute to the agentic future of drug design.

PaperID: 7467, https://arxiv.org/pdf/2511.08151.pdf

Authors: Xuchen Li, Ruitao Wu, Xuanbo Liu, Xukai Wang, Jinbo Hu, Zhixin Bai, Bohan Zeng, Hao Liang, Leheng Chen, Mingrui Chen, Haitian Zhong, Xuanlin Yang, Xu-Yao Zhang, Liu Liu, Jia Li, Kaiqi Huang, Jiahao Xu, Haitao Mi, Wentao Zhang, Bin Dong

Title: SciAgent: A Unified Multi-Agent System for Generalistic Scientific Reasoning

Abstract:
Recent advances in large language models have enabled AI systems to achieve expert‑level performance on domain‑specific scientific tasks, yet these systems remain narrow and handcrafted. We introduce SciAgent, a unified multi‑agent system designed for generalistic scientific reasoning‑the ability to adapt reasoning strategies across disciplines and difficulty levels. SciAgent organizes problem solving as a hierarchical process: a Coordinator Agent interprets each problem's domain and complexity, dynamically orchestrating specialized Worker Systems, each composed of interacting reasoning Sub‑agents for symbolic deduction, conceptual modeling, numerical computation, and verification. These agents collaboratively assemble and refine reasoning pipelines tailored to each task. Across mathematics and physics Olympiads (IMO, IMC, IPhO, CPhO), SciAgent consistently attains or surpasses human gold‑medalist performance, demonstrating both domain generality and reasoning adaptability. Additionally, SciAgent has been tested on the International Chemistry Olympiad (IChO) and selected problems from the Humanity's Last Exam (HLE) benchmark, further confirming the system's ability to generalize across diverse scientific domains. This work establishes SciAgent as a concrete step toward generalistic scientific intelligence‑AI systems capable of coherent, cross‑disciplinary reasoning at expert levels.

PaperID: 7468, https://arxiv.org/pdf/2511.08145.pdf

Authors: Kunal Kingkar Das, Manoj Balaji Jagadeeshan, Nallani Chakravartula Sahith, Jivnesh Sandhan, Pawan Goyal

Title: Still Not There: Can LLMs Outperform Smaller Task-Specific Seq2Seq Models on the Poetry-to-Prose Conversion Task?

Abstract:
Large Language Models (LLMs) are increasingly treated as universal, general‑purpose solutions across NLP tasks, particularly in English. But does this assumption hold for low‑resource, morphologically rich languages such as Sanskrit? We address this question by comparing instruction‑tuned and in‑context‑prompted LLMs with smaller task‑specific encoder‑decoder models on the Sanskrit poetry‑to‑prose conversion task. This task is intrinsically challenging: Sanskrit verse exhibits free word order combined with rigid metrical constraints, and its conversion to canonical prose (anvaya) requires multi‑step reasoning involving compound segmentation, dependency resolution, and syntactic linearisation. This makes it an ideal testbed to evaluate whether LLMs can surpass specialised models. For LLMs, we apply instruction fine‑tuning on general‑purpose models and design in‑context learning templates grounded in Paninian grammar and classical commentary heuristics. For task‑specific modelling, we fully fine‑tune a ByT5‑Sanskrit Seq2Seq model. Our experiments show that domain‑specific fine‑tuning of ByT5‑Sanskrit significantly outperforms all instruction‑driven LLM approaches. Human evaluation strongly corroborates this result, with scores exhibiting high correlation with Kendall's Tau scores. Additionally, our prompting strategies provide an alternative to fine‑tuning when domain‑specific verse corpora are unavailable, and the task‑specific Seq2Seq model demonstrates robust generalisation on out‑of‑domain evaluations.

PaperID: 7469, https://arxiv.org/pdf/2511.08142.pdf

Authors: Anna Lackinger, Andrea Morichetta, Pantelis A. Frangoudis, Schahram Dustdar

Title: BIPPO: Budget-Aware Independent PPO for Energy-Efficient Federated Learning Services

Abstract:
Federated Learning (FL) is a promising machine learning solution in large‑scale IoT systems, guaranteeing load distribution and privacy. However, FL does not natively consider infrastructure efficiency, a critical concern for systems operating in resource‑constrained environments. Several Reinforcement Learning (RL) based solutions offer improved client selection for FL; however, they do not consider infrastructure challenges, such as resource limitations and device churn. Furthermore, the training of RL methods is often not designed for practical application, as these approaches frequently do not consider generalizability and are not optimized for energy efficiency. To fill this gap, we propose BIPPO (Budget‑aware Independent Proximal Policy Optimization), which is an energy‑efficient multi‑agent RL solution that improves performance. We evaluate BIPPO on two image classification tasks run in a highly budget‑constrained setting, with FL clients training on non‑IID data, a challenging context for vanilla FL. The improved sampler of BIPPO enables it to increase the mean accuracy compared to non‑RL mechanisms, traditional PPO, and IPPO. In addition, BIPPO only consumes a negligible proportion of the budget, which stays consistent even if the number of clients increases. Overall, BIPPO delivers a performant, stable, scalable, and sustainable solution for client selection in IoT‑FL.

PaperID: 7470, https://arxiv.org/pdf/2511.08098.pdf

Authors: Sabrina Patania, Luca Annese, Anita Pellegrini, Silvia Serino, Anna Lambiase, Luca Pallonetto, Silvia Rossi, Simone Colombani, Tom Foulsham, Azzurra Ruggeri, Dimitri Ognibene

Title: PerspAct: Enhancing LLM Situated Collaboration Skills through Perspective Taking and Active Vision

Abstract:
Recent advances in Large Language Models (LLMs) and multimodal foundation models have significantly broadened their application in robotics and collaborative systems. However, effective multi‑agent interaction necessitates robust perspective‑taking capabilities, enabling models to interpret both physical and epistemic viewpoints. Current training paradigms often neglect these interactive contexts, resulting in challenges when models must reason about the subjectivity of individual perspectives or navigate environments with multiple observers. This study evaluates whether explicitly incorporating diverse points of view using the ReAct framework, an approach that integrates reasoning and acting, can enhance an LLM's ability to understand and ground the demands of other agents. We extend the classic Director task by introducing active visual exploration across a suite of seven scenarios of increasing perspective‑taking complexity. These scenarios are designed to challenge the agent's capacity to resolve referential ambiguity based on visual access and interaction, under varying state representations and prompting strategies, including ReAct‑style reasoning. Our results demonstrate that explicit perspective cues, combined with active exploration strategies, significantly improve the model's interpretative accuracy and collaborative effectiveness. These findings highlight the potential of integrating active perception with perspective‑taking mechanisms in advancing LLMs' application in robotics and multi‑agent systems, setting a foundation for future research into adaptive and context‑aware AI systems.

PaperID: 7471, https://arxiv.org/pdf/2511.08024.pdf

Authors: Tianwen Lyu, Xiang Zhuang, Keyan Ding, Xinzhe Cao, Lei Liang, Wei Zhao, Qiang Zhang, Huajun Chen

Title: Knowledge-Augmented Long-CoT Generation for Complex Biomolecular Reasoning

Abstract:
Understanding complex biomolecular mechanisms requires multi‑step reasoning across molecular interactions, signaling cascades, and metabolic pathways. While large language models(LLMs) show promise in such tasks, their application to biomolecular problems is hindered by logical inconsistencies and the lack of grounding in domain knowledge. Existing approaches often exacerbate these issues: reasoning steps may deviate from biological facts or fail to capture long mechanistic dependencies. To address these challenges, we propose a Knowledge‑Augmented Long‑CoT Reasoning framework that integrates LLMs with knowledge graph‑based multi‑hop reasoning chains. The framework constructs mechanistic chains via guided multi‑hop traversal and pruning on the knowledge graph; these chains are then incorporated into supervised fine‑tuning to improve factual grounding and further refined with reinforcement learning to enhance reasoning reliability and consistency. Furthermore, to overcome the shortcomings of existing benchmarks, which are often restricted in scale and scope and lack annotations for deep reasoning chains, we introduce PrimeKGQA, a comprehensive benchmark for biomolecular question answering. Experimental results on both PrimeKGQA and existing datasets demonstrate that although larger closed‑source models still perform well on relatively simple tasks, our method demonstrates clear advantages as reasoning depth increases, achieving state‑of‑the‑art performance on multi‑hop tasks that demand traversal of structured biological knowledge. These findings highlight the effectiveness of combining structured knowledge with advanced reasoning strategies for reliable and interpretable biomolecular reasoning.

PaperID: 7472, https://arxiv.org/pdf/2511.07894.pdf

Authors: Shihao Li, Jiachen Li, Jiamin Xu, Dongmei Chen

Title: From Natural Language to Certified H-infinity Controllers: Integrating LLM Agents with LMI-Based Synthesis

Abstract:
We present \textscS2C (Specification‑to‑Certified‑Controller), a multi‑agent framework that maps natural‑language requirements to certified \mathcalH_\infty state‑feedback controllers via LMI synthesis. \textscS2C coordinates five roles ‑‑ SpecInt (spec extraction), Solv (bounded‑real lemma (BRL) LMI), Tester (Monte Carlo and frequency‑domain checks), Adapt (spec refinement), and CodeGen (deployable code). The loop is stabilized by a severity‑ and iteration‑aware γ‑floor guardrail and a decay‑rate region constraint enforcing \Reλ(A+BK)<‑α with α=3.9/T_s derived from settling‑time targets. For state feedback, verification reports disturbance rejection \big\|C\,(sI‑(A+BK))^‑1E\big\|_\infty alongside time‑domain statistics; discrete benchmarks are converted to continuous time via a Tustin (bilinear) transform when needed. On 14 COMPleib problems, \textscS2C attains 100% synthesis success and 100% convergence within six iterations, with strong decay‑rate satisfaction and near‑target certified \mathcalH_\infty levels; it improves robustness metrics relative to single‑shot BRL and BRL+α baselines. An ablation over LLM backbones (GPT‑5, GPT‑5 mini, DeepSeek‑V3, Qwen‑2.5‑72B, Llama‑4 Maverick) shows the pipeline is robust across models, while stronger models yield the highest effectiveness. These results indicate that LLM agents can integrate certificate‑bearing control synthesis from high‑level intent, enabling rapid end‑to‑end prototyping without sacrificing formal guarantees.

PaperID: 7473, https://arxiv.org/pdf/2511.07800.pdf

Authors: Siyu Xia, Zekun Xu, Jiajun Chai, Wentian Fan, Yan Song, Xiaohan Wang, Guojun Yin, Wei Lin, Haifeng Zhang, Jun Wang

Title: From Experience to Strategy: Empowering LLM Agents with Trainable Graph Memory

Abstract:
Large Language Models (LLMs) based agents have demonstrated remarkable potential in autonomous task‑solving across complex, open‑ended environments. A promising approach for improving the reasoning capabilities of LLM agents is to better utilize prior experiences in guiding current decisions. However, LLMs acquire experience either through implicit memory via training, which suffers from catastrophic forgetting and limited interpretability, or explicit memory via prompting, which lacks adaptability. In this paper, we introduce a novel agent‑centric, trainable, multi‑layered graph memory framework and evaluate how context memory enhances the ability of LLMs to utilize parametric information. The graph abstracts raw agent trajectories into structured decision paths in a state machine and further distills them into high‑level, human‑interpretable strategic meta‑cognition. In order to make memory adaptable, we propose a reinforcement‑based weight optimization procedure that estimates the empirical utility of each meta‑cognition based on reward feedback from downstream tasks. These optimized strategies are then dynamically integrated into the LLM agent's training loop through meta‑cognitive prompting. Empirically, the learnable graph memory delivers robust generalization, improves LLM agents' strategic reasoning performance, and provides consistent benefits during Reinforcement Learning (RL) training.

PaperID: 7474, https://arxiv.org/pdf/2511.07794.pdf

Authors: Hua Zhou, Bing Ma, Yufei Zhang, Yi Zhao

Title: Design, Results and Industry Implications of the World's First Insurance Large Language Model Evaluation Benchmark

Abstract:
This paper comprehensively elaborates on the construction methodology, multi‑dimensional evaluation system, and underlying design philosophy of CUFEInse v1.0. Adhering to the principles of "quantitative‑oriented, expert‑driven, and multi‑validation," the benchmark establishes an evaluation framework covering 5 core dimensions, 54 sub‑indicators, and 14,430 high‑quality questions, encompassing insurance theoretical knowledge, industry understanding, safety and compliance, intelligent agent application, and logical rigor. Based on this benchmark, a comprehensive evaluation was conducted on 11 mainstream large language models. The evaluation results reveal that general‑purpose models suffer from common bottlenecks such as weak actuarial capabilities and inadequate compliance adaptation. High‑quality domain‑specific training demonstrates significant advantages in insurance vertical scenarios but exhibits shortcomings in business adaptation and compliance. The evaluation also accurately identifies the common bottlenecks of current large models in professional scenarios such as insurance actuarial, underwriting and claim settlement reasoning, and compliant marketing copywriting. The establishment of CUFEInse not only fills the gap in professional evaluation benchmarks for the insurance field, providing academia and industry with a professional, systematic, and authoritative evaluation tool, but also its construction concept and methodology offer important references for the evaluation paradigm of large models in vertical fields, serving as an authoritative reference for academic model optimization and industrial model selection. Finally, the paper looks forward to the future iteration direction of the evaluation benchmark and the core development direction of "domain adaptation + reasoning enhancement" for insurance large models.

PaperID: 7475, https://arxiv.org/pdf/2511.07784.pdf

Authors: Haolun Wu, Zhenkun Li, Lingyao Li

Title: Can LLM Agents Really Debate? A Controlled Study of Multi-Agent Debate in Logical Reasoning

Abstract:
Multi‑agent debate (MAD) has recently emerged as a promising framework for improving the reasoning performance of large language models (LLMs). Yet, whether LLM agents can genuinely engage in deliberative reasoning, beyond simple ensembling or majority voting, remains unclear. We address this question through a controlled study using the Knight‑‑Knave‑‑Spy logic puzzle, which enables precise, step‑wise evaluation of debate outcomes and processes under verifiable ground truth. We systematically set up six structural and cognitive factors, including agent team size, composition, confidence visibility, debate order, debate depth, and task difficulty, to disentangle their respective effects on collective reasoning. Our results show that intrinsic reasoning strength and group diversity are the dominant drivers of debate success, while structural parameters such as order or confidence visibility offer limited gains. Beyond outcomes, process‑level analyses identify key behavioral patterns: majority pressure suppresses independent correction, effective teams overturn incorrect consensus, and rational, validity‑aligned reasoning most strongly predicts improvement. These findings provide valuable insights into how and why LLM debates succeed or fail, offering guidance for designing interpretable and truth‑seeking multi‑agent reasoning systems.

PaperID: 7476, https://arxiv.org/pdf/2511.07778.pdf

Authors: Ao Ding, Licheng Sun, Yongjie Hou, Huaqing Zhang, Hongbin Ma

Title: A Historical Interaction-Enhanced Shapley Policy Gradient Algorithm for Multi-Agent Credit Assignment

Abstract:
Multi‑agent reinforcement learning (MARL) has demonstrated remarkable performance in multi‑agent collaboration problems and has become a prominent topic in artificial intelligence research in recent years. However, traditional credit assignment schemes in MARL cannot reliably capture individual contributions in strongly coupled tasks while maintaining training stability, which leads to limited generalization capabilities and hinders algorithm performance. To address these challenges, we propose a Historical Interaction‑Enhanced Shapley Policy Gradient Algorithm (HIS) for Multi‑Agent Credit Assignment, which employs a hybrid credit assignment mechanism to balance base rewards with individual contribution incentives. By utilizing historical interaction data to calculate the Shapley value in a sample‑efficient manner, HIS enhances the agent's ability to perceive its own contribution, while retaining the global reward to maintain training stability. Additionally, we provide theoretical guarantees for the hybrid credit assignment mechanism, ensuring that the assignment results it generates are both efficient and stable. We evaluate the proposed algorithm in three widely used continuous‑action benchmark environments: Multi‑Agent Particle Environment, Multi‑Agent MuJoCo, and Bi‑DexHands. Experimental results demonstrate that HIS outperforms state‑of‑the‑art methods, particularly excelling in strongly coupled, complex collaborative tasks.

PaperID: 7477, https://arxiv.org/pdf/2511.07707.pdf

Authors: Manonmani Sekar, Nasim Nezamoddini

Title: A Negotiation-Based Multi-Agent Reinforcement Learning Approach for Dynamic Scheduling of Reconfigurable Manufacturing Systems

Abstract:
Reconfigurable manufacturing systems (RMS) are critical for future market adjustment given their rapid adaptation to fluctuations in consumer demands, the introduction of new technological advances, and disruptions in linked supply chain sections. The adjustable hard settings of such systems require a flexible soft planning mechanism that enables realtime production planning and scheduling amid the existing complexity and variability in their configuration settings. This study explores the application of multi agent reinforcement learning (MARL) for dynamic scheduling in soft planning of the RMS settings. In the proposed framework, deep Qnetwork (DQN) agents trained in centralized training learn optimal job machine assignments in real time while adapting to stochastic events such as machine breakdowns and reconfiguration delays. The model also incorporates a negotiation with an attention mechanism to enhance state representation and improve decision focus on critical system features. Key DQN enhancements including prioritized experience replay, nstep returns, double DQN and soft target update are used to stabilize and accelerate learning. Experiments conducted in a simulated RMS environment demonstrate that the proposed approach outperforms baseline heuristics in reducing makespan and tardiness while improving machine utilization. The reconfigurable manufacturing environment was extended to simulate realistic challenges, including machine failures and reconfiguration times. Experimental results show that while the enhanced DQN agent is effective in adapting to dynamic conditions, machine breakdowns increase variability in key performance metrics such as makespan, throughput, and total tardiness. The results confirm the advantages of applying the MARL mechanism for intelligent and adaptive scheduling in dynamic reconfigurable manufacturing environments.

PaperID: 7478, https://arxiv.org/pdf/2511.07690.pdf

Authors: Soham Hans, Volkan Ustun, Benjamin Nye, James Sterrett, Matthew Green

Title: Towards AI-Assisted Generation of Military Training Scenarios

Abstract:
Achieving expert‑level performance in simulation‑based training relies on the creation of complex, adaptable scenarios, a traditionally laborious and resource intensive process. Although prior research explored scenario generation for military training, pre‑LLM AI tools struggled to generate sufficiently complex or adaptable scenarios. This paper introduces a multi‑agent, multi‑modal reasoning framework that leverages Large Language Models (LLMs) to generate critical training artifacts, such as Operations Orders (OPORDs). We structure our framework by decomposing scenario generation into a hierarchy of subproblems, and for each one, defining the role of the AI tool: (1) generating options for a human author to select from, (2) producing a candidate product for human approval or modification, or (3) generating textual artifacts fully automatically. Our framework employs specialized LLM‑based agents to address distinct subproblems. Each agent receives input from preceding subproblem agents, integrating both text‑based scenario details and visual information (e.g., map features, unit positions and applies specialized reasoning to produce appropriate outputs. Subsequent agents process these outputs sequentially, preserving logical consistency and ensuring accurate document generation. This multi‑agent strategy overcomes the limitations of basic prompting or single‑agent approaches when tackling such highly complex tasks. We validate our framework through a proof‑of‑concept that generates the scheme of maneuver and movement section of an OPORD while estimating map positions and movements as a precursor demonstrating its feasibility and accuracy. Our results demonstrate the potential of LLM‑driven multi‑agent systems to generate coherent, nuanced documents and adapt dynamically to changing conditions, advancing automation in scenario generation for military training.

PaperID: 7479, https://arxiv.org/pdf/2511.07685.pdf

Authors: Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit Aich, Huy Nghiem, Tahseen Rabbani, Ye Htet, Brian Jang, Sumana Basu, Aishwarya Balwani, Denis Peskoff, Marcos Ayestaran, Sean M. Hendryx, Brad Kenstler, Bing Liu

Title: ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents

Abstract:
Deep Research (DR) is an emerging agent application that leverages large language models (LLMs) to address open‑ended queries. It requires the integration of several capabilities, including multi‑step reasoning, cross‑document synthesis, and the generation of evidence‑backed, long‑form answers. Evaluating DR remains challenging because responses are lengthy and diverse, admit many valid solutions, and often depend on dynamic information sources. We introduce ResearchRubrics, a standardized benchmark for DR built with over 2,800+ hours of human labor that pairs realistic, domain‑diverse prompts with 2,500+ expert‑written, fine‑grained rubrics to assess factual grounding, reasoning soundness, and clarity. We also propose a new complexity framework for categorizing DR tasks along three axes: conceptual breadth, logical nesting, and exploration. In addition, we develop human and model‑based evaluation protocols that measure rubric adherence for DR agents. We evaluate several state‑of‑the‑art DR systems and find that even leading agents like Gemini's DR and OpenAI's DR achieve under 68% average compliance with our rubrics, primarily due to missed implicit context and inadequate reasoning about retrieved information. Our results highlight the need for robust, scalable assessment of deep research capabilities, to which end we release ResearchRubrics(including all prompts, rubrics, and evaluation code) to facilitate progress toward well‑justified research assistants.

PaperID: 7480, https://arxiv.org/pdf/2511.07654.pdf

Authors: Yinsen Jia, Boyuan Chen

Title: Time as a Control Dimension in Robot Learning

Abstract:
Temporal awareness plays a central role in intelligent behavior by shaping how actions are paced, coordinated, and adapted to changing goals and environments. In contrast, most robot learning algorithms treat time only as a fixed episode horizon or scheduling constraint. Here we introduce time‑aware policy learning, a reinforcement learning framework that treats time as a control dimension of robot behavior. The approach augments policies with two temporal signals, the remaining time and a time ratio that modulates the policy's internal progression of time, allowing a single policy to regulate its execution strategy across temporal regimes. Across diverse manipulation tasks including long‑horizon manipulation, granular‑media pouring, articulated‑object interaction, and multi‑agent coordination, the resulting policies adapt their behavior continuously from dynamic execution under tight schedules to stable and deliberate interaction when more time is available. This temporal awareness improves efficiency, robustness under sim‑to‑real mismatch and disturbances, and controllability through human input without retraining. Treating time as a controllable variable provides a new framework for adaptive and human‑aligned robot autonomy.

PaperID: 7481, https://arxiv.org/pdf/2511.07629.pdf

Authors: Yue Jin, Giovanni Montana

Title: Partial Action Replacement: Tackling Distribution Shift in Offline MARL

Abstract:
Offline multi‑agent reinforcement learning (MARL) is severely hampered by the challenge of evaluating out‑of‑distribution (OOD) joint actions. Our core finding is that when the behavior policy is factorized ‑ a common scenario where agents act fully or partially independently during data collection ‑ a strategy of partial action replacement (PAR) can significantly mitigate this challenge. PAR updates a single or part of agents' actions while the others remain fixed to the behavioral data, reducing distribution shift compared to full joint‑action updates. Based on this insight, we develop Soft‑Partial Conservative Q‑Learning (SPaCQL), using PAR to mitigate OOD issue and dynamically weighting different PAR strategies based on the uncertainty of value estimation. We provide a rigorous theoretical foundation for this approach, proving that under factorized behavior policies, the induced distribution shift scales linearly with the number of deviating agents rather than exponentially with the joint‑action space. This yields a provably tighter value error bound for this important class of offline MARL problems. Our theoretical results also indicate that SPaCQL adaptively addresses distribution shift using uncertainty‑informed weights. Our empirical results demonstrate SPaCQL enables more effective policy learning, and manifest its remarkable superiority over baseline algorithms when the offline dataset exhibits the independence structure.

PaperID: 7482, https://arxiv.org/pdf/2511.07574.pdf

Authors: Vasilis Bountris, Lauritz Thamsen, Ulf Leser

Title: HyProv: Hybrid Provenance Management for Scientific Workflows

Abstract:
Provenance plays a crucial role in scientific workflow execution, for instance by providing data for failure analysis, real‑time monitoring, or statistics on resource utilization for right‑sizing allocations. The workflows themselves, however, become increasingly complex in terms of involved components. Furthermore, they are executed on distributed cluster infrastructures, which makes the real‑time collection, integration, and analysis of provenance data challenging. Existing provenance systems struggle to balance scalability, real‑time processing, online provenance analytics, and integration across different components and compute resources. Moreover, most provenance solutions are not workflow‑aware; by focusing on arbitrary workloads, they miss opportunities for workflow systems where optimization and analysis can exploit the availability of a workflow specification that dictates, to some degree, task execution orders and provides abstractions for physical tasks at a logical level. In this paper, we present HyProv, a hybrid provenance management system that combines centralized and federated paradigms to offer scalable, online, and workflow‑aware queries over workflow provenance traces. HyProv uses a centralized component for efficient management of the small and stable workflow‑specification‑specific provenance, and complements this with federated querying over different scalable monitoring and provenance databases for the large‑scale execution logs. This enables low‑latency access to current execution data. Furthermore, the design supports complex provenance queries, which we exemplify for the workflow system Airflow in combination with the resource manager Kubernetes. Our experiments indicate that HyProv scales to large workflows, answers provenance queries with sub‑second latencies, and adds only modest CPU and memory overhead to the cluster.

PaperID: 7483, https://arxiv.org/pdf/2511.07448.pdf

Authors: Fatemeh Shahhosseini, Arash Marioriyad, Ali Momen, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban, Shaghayegh Haghjooy Javanmard

Title: Large Language Models for Scientific Idea Generation: A Creativity-Centered Survey

Abstract:
Scientific idea generation is central to discovery, requiring the joint satisfaction of novelty and scientific soundness. Unlike standard reasoning or general creative generation, scientific ideation is inherently open‑ended and multi‑objective, making its automation particularly challenging. Recent advances in large language models (LLMs) have enabled the generation of coherent and plausible scientific ideas, yet the nature and limits of their creative capabilities remain poorly understood. This survey provides a structured synthesis of methods for LLM‑driven scientific ideation, focusing on how different approaches trade off novelty and scientific validity. We organize existing methods into five complementary families: External knowledge augmentation, Prompt‑based distributional steering, Inference‑time scaling, Multi‑agent collaboration, and Parameter‑level adaptation. To interpret their contributions, we adopt two complementary creativity frameworks: Boden taxonomy to characterize the expected level of creative novelty, and Rhodes 4Ps framework to analyze the aspects or sources of creativity emphasized by each method. By aligning methodological developments with cognitive creativity frameworks, this survey clarifies the evaluation landscape and identifies key challenges and directions for reliable and systematic LLM‑based scientific discovery.

PaperID: 7484, https://arxiv.org/pdf/2511.07441.pdf

Authors: Ye Zheng, Yimin Chen, Yidan Hu

Title: AudAgent: Automated Auditing of Privacy Policy Compliance in AI Agents

Abstract:
AI agents can autonomously perform tasks and, often without explicit user consent, collect or disclose users' sensitive local data, which raises serious privacy concerns. Although AI agents' privacy policies describe their intended data practices, there remains limited transparency and accountability about whether runtime behavior matches those policies. To bridge this gap, we present AudAgent, a tool that continuously monitors AI agents' data practices in real time and guards compliance with their stated privacy policies. AudAgent comprises four components for automated privacy auditing of AI agents. (i) Policy formalization: a novel cross‑LLM voting mechanism that ensures high‑confidence parsing of privacy policies into formal models. (ii) Runtime annotation: a lightweight Presidio‑based analyzer that detects sensitive data and annotates data practices based on the AI agent's context and the formalized privacy policy model. (iii) Compliance auditing: ontology graphs and automata‑based checking that link the privacy policy model with runtime annotations, enabling on‑the‑fly compliance verification. (iv) User interface: an infrastructure‑independent implementation that visualizes the real‑time execution trace of AI agents alongside detected privacy violations, providing user‑friendly transparency and accountability. We evaluate AudAgent on AI agents built with mainstream frameworks, demonstrating its effectiveness in detecting and visualizing privacy policy violations. Using AudAgent, we further find that many privacy policies lack explicit safeguards for highly sensitive data such as SSNs, whose misuse violates legal requirements, and that many agents, including those powered by Claude, Gemini, and DeepSeek,do not refuse to process such data via third‑party tools. AudAgent proactively blocks operations on such data, overriding the agents' original privacy policies and behavior.

PaperID: 7485, https://arxiv.org/pdf/2511.07437.pdf

Authors: Ravi Gupta, Guneet Bhatia

Title: Agentic Educational Content Generation for African Languages on Edge Devices

Abstract:
Addressing educational inequity in Sub‑Saharan Africa, this research presents an autonomous agent‑orchestrated framework for decentralized, culturally adaptive educational content generation on edge devices. The system leverages four specialized agents that work together to generate contextually appropriate educational content. Experimental validation on platforms including Raspberry Pi 4B and NVIDIA Jetson Nano demonstrates significant performance achievements. InkubaLM on Jetson Nano achieved a Time‑To‑First‑Token (TTFT) of 129 ms, an average inter‑token latency of 33 ms, and a throughput of 45.2 tokens per second while consuming 8.4 W. On Raspberry Pi 4B, InkubaLM also led with 326 ms TTFT and 15.9 tokens per second at 5.8 W power consumption. The framework consistently delivered high multilingual quality, averaging a BLEU score of 0.688, cultural relevance of 4.4/5, and fluency of 4.2/5 across tested African languages. Through potential partnerships with active community organizations including African Youth & Community Organization (AYCO) and Florida Africa Foundation, this research aims to establish a practical foundation for accessible, localized, and sustainable AI‑driven education in resource‑constrained environments. Keeping focus on long‑term viability and cultural appropriateness, it contributes to United Nations SDGs 4, 9, and 10. Index Terms ‑ Multi‑Agent Systems, Edge AI Computing, Educational Technology, African Languages, Rural Education, Sustainable Development, UN SDG.

PaperID: 7486, https://arxiv.org/pdf/2511.07426.pdf

Authors: Zihao Ding, Mufeng Zhu, Yao Liu

Title: Network and Systems Performance Characterization of MCP-Enabled LLM Agents

Abstract:
Model Context Protocol (MCP) has recently gained increased attention within the AI community for providing a standardized way for large language models (LLMs) to interact with external tools and services, significantly enhancing their capabilities. However, the inclusion of extensive contextual information, including system prompts, MCP tool definitions, and context histories, in MCP‑enabled LLM interactions, dramatically inflates token usage. Given that LLM providers charge based on tokens, these expanded contexts can quickly escalate monetary costs and increase the computational load on LLM services. This paper presents a comprehensive measurement‑based analysis of MCP‑enabled interactions with LLMs, revealing trade‑offs between capability, performance, and cost. We explore how different LLM models and MCP configurations impact key performance metrics such as token efficiency, monetary cost, task completion times, and task success rates, and suggest potential optimizations, including enabling parallel tool calls and implementing robust task abort mechanisms. These findings provide useful insights for developing more efficient, robust, and cost‑effective MCP‑enabled workflows.

PaperID: 7487, https://arxiv.org/pdf/2511.07403.pdf

Authors: Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark

Title: SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards

Abstract:
Multimodal large language models (MLLMs) have achieved remarkable progress in vision‑language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture‑specific modifications, and remain constrained by large‑scale datasets or sparse supervision. To address these limitations, we introduce SpatialThinker, a 3D‑aware MLLM trained with RL to integrate structured spatial grounding with multi‑step reasoning. The model simulates human‑like spatial perception by constructing a scene graph of task‑relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. SpatialThinker consists of two key contributions: (1) a data synthesis pipeline that generates STVQA‑7K, a high‑quality spatial VQA dataset, and (2) online RL with a multi‑objective dense spatial reward enforcing spatial grounding. SpatialThinker‑7B outperforms supervised fine‑tuning and the sparse RL baseline on spatial understanding and real‑world VQA benchmarks, nearly doubling the base‑model gain compared to sparse RL, and surpassing GPT‑4o. These results showcase the effectiveness of combining spatial supervision with reward‑aligned reasoning in enabling robust 3D spatial understanding with limited data and advancing MLLMs towards human‑level visual reasoning.

PaperID: 7488, https://arxiv.org/pdf/2511.07392.pdf

Authors: Hyeryun Park, Byung Mo Gu, Jun Hee Lee, Byeong Hyeon Choi, Sekeun Kim, Hyun Koo Kim, Kyungsang Kim

Title: Voice-Interactive Surgical Agent for Multimodal Patient Data Control

Abstract:
In robotic surgery, surgeons fully engage their hands and visual attention in procedures, making it difficult to access and manipulate multimodal patient data without interrupting the workflow. To overcome this problem, we propose a Voice‑Interactive Surgical Agent (VISA) built on a hierarchical multi‑agent framework consisting of an orchestration agent and three task‑specific agents driven by Large Language Models (LLMs). These LLM‑based agents autonomously plan, refine, validate, and reason to interpret voice commands and execute tasks such as retrieving clinical information, manipulating CT scans, or navigating 3D anatomical models within surgical video. We construct a dataset of 240 user commands organized into hierarchical categories and introduce the Multi‑level Orchestration Evaluation Metric (MOEM) that evaluates the performance and robustness at both the command and category levels. Experimental results demonstrate that VISA achieves high stage‑level accuracy and workflow‑level success rates, while also enhancing its robustness by correcting transcription errors, resolving linguistic ambiguity, and interpreting diverse free‑form expressions. These findings highlight the strong potential of VISA to support robotic surgery and its scalability for integrating new functions and agents.

PaperID: 7489, https://arxiv.org/pdf/2511.07366.pdf

Authors: Dao Lan Vy Dinh, Anh Nguyen Thi Mai, Hung Tran, Giang Quynh Le Vu, Tu Dac Ho, Zhenni Pan, Vo Nhan Van, Symeon Chatzinotas, Dinh-Hieu Tran

Title: UAV-Assisted Resilience in 6G and Beyond Network Energy Saving: A Multi-Agent DRL Approach

Abstract:
This paper investigates the unmanned aerial vehicle (UAV)‑assisted resilience perspective in the 6G network energy saving (NES) scenario. More specifically, we consider multiple ground base stations (GBSs) and each GBS has three different sectors/cells in the terrestrial networks, and multiple cells may become inactive due to unexpected events such as power outages, disasters, hardware failures, or erroneous energy‑saving decisions made by external network management systems. During the time required to reactivate these cells, UAVs are deployed to temporarily restore user service. To address this, we propose a Multi‑Agent Deep Deterministic Policy Gradient (MADDPG) framework to enable UAV‑assisted communication by jointly optimizing UAV trajectories, transmission power, and user‑UAV association under a sleeping ground base station (GBS) strategy. This framework aims to ensure the resilience of active users in the network and the long‑term operability of UAVs. Specifically, it maximizes service coverage for users during power outages or NES zones, while minimizing the energy consumption of UAVs. Simulation results demonstrate that the proposed MADDPG policy consistently achieves high coverage ratio across different testing episodes, outperforming other baselines. Moreover, the MADDPG framework attains the lowest total energy consumption, while maintaining a comparable user service rate. These results confirm the effectiveness of the proposed approach in achieving a superior trade‑off between energy efficiency and service performance, supporting the development of sustainable and resilient UAV‑assisted cellular networks.

PaperID: 7490, https://arxiv.org/pdf/2511.07364.pdf

Authors: Vaibhav Mavi, Shubh Jaroria, Weiqi Sun

Title: Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection

Abstract:
Reliability and failure detection of large language models (LLMs) is critical for their deployment in high‑stakes, multi‑step reasoning tasks. Prior work explores confidence estimation for self‑evaluating LLM‑scorer systems, with confidence scorers estimating the likelihood of errors in LLM responses. However, most methods focus on single‑step outputs and overlook the challenges of multi‑step reasoning. In this work, we extend self‑evaluation techniques to multi‑step tasks, testing two intuitive approaches: holistic scoring and step‑by‑step scoring. Using two multi‑step benchmark datasets, we show that stepwise evaluation generally outperforms holistic scoring in detecting potential errors, with up to 15% relative increase in AUC‑ROC. Our findings demonstrate that self‑evaluating LLM systems provide meaningful confidence estimates in complex reasoning, improving their trustworthiness and providing a practical framework for failure detection.

PaperID: 7491, https://arxiv.org/pdf/2511.07322.pdf

Authors: Song Jin, Shuqi Li, Shukun Zhang, Rui Yan

Title: FinRpt: Dataset, Evaluation System and LLM-based Multi-agent Framework for Equity Research Report Generation

Abstract:
While LLMs have shown great success in financial tasks like stock prediction and question answering, their application in fully automating Equity Research Report generation remains uncharted territory. In this paper, we formulate the Equity Research Report (ERR) Generation task for the first time. To address the data scarcity and the evaluation metrics absence, we present an open‑source evaluation benchmark for ERR generation ‑ FinRpt. We frame a Dataset Construction Pipeline that integrates 7 financial data types and produces a high‑quality ERR dataset automatically, which could be used for model training and evaluation. We also introduce a comprehensive evaluation system including 11 metrics to assess the generated ERRs. Moreover, we propose a multi‑agent framework specifically tailored to address this task, named FinRpt‑Gen, and train several LLM‑based agents on the proposed datasets using Supervised Fine‑Tuning and Reinforcement Learning. Experimental results indicate the data quality and metrics effectiveness of the benchmark FinRpt and the strong performance of FinRpt‑Gen, showcasing their potential to drive innovation in the ERR generation field. All code and datasets are publicly available.

PaperID: 7492, https://arxiv.org/pdf/2511.07315.pdf

Authors: Yuxuan Zhou, Yang Bai, Kuofeng Gao, Tao Dai, Shu-Tao Xia

Title: JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework

Abstract:
The widespread application of large VLMs makes ensuring their secure deployment critical. While recent studies have demonstrated jailbreak attacks on VLMs, existing approaches are limited: they require either white‑box access, restricting practicality, or rely on manually crafted patterns, leading to poor sample diversity and scalability. To address these gaps, we propose JPRO, a novel multi‑agent collaborative framework designed for automated VLM jailbreaking. It effectively overcomes the shortcomings of prior methods in attack diversity and scalability. Through the coordinated action of four specialized agents and its two core modules: Tactic‑Driven Seed Generation and Adaptive Optimization Loop, JPRO generates effective and diverse attack samples. Experimental results show that JPRO achieves over a 60% attack success rate on multiple advanced VLMs, including GPT‑4o, significantly outperforming existing methods. As a black‑box attack approach, JPRO not only uncovers critical security vulnerabilities in multimodal models but also offers valuable insights for evaluating and enhancing VLM robustness.

PaperID: 7493, https://arxiv.org/pdf/2511.07267.pdf

Authors: Chen Han, Yijia Ma, Jin Tan, Wenzhen Zheng, Xijin Tang

Title: Beyond Detection: Exploring Evidence-based Multi-Agent Debate for Misinformation Intervention and Persuasion

Abstract:
Multi‑agent debate (MAD) frameworks have emerged as promising approaches for misinformation detection by simulating adversarial reasoning. While prior work has focused on detection accuracy, it overlooks the importance of helping users understand the reasoning behind factual judgments and develop future resilience. The debate transcripts generated during MAD offer a rich but underutilized resource for transparent reasoning. In this study, we introduce ED2D, an evidence‑based MAD framework that extends previous approach by incorporating factual evidence retrieval. More importantly, ED2D is designed not only as a detection framework but also as a persuasive multi‑agent system aimed at correcting user beliefs and discouraging misinformation sharing. We compare the persuasive effects of ED2D‑generated debunking transcripts with those authored by human experts. Results demonstrate that ED2D outperforms existing baselines across three misinformation detection benchmarks. When ED2D generates correct predictions, its debunking transcripts exhibit persuasive effects comparable to those of human experts; However, when ED2D misclassifies, its accompanying explanations may inadvertently reinforce users'misconceptions, even when presented alongside accurate human explanations. Our findings highlight both the promise and the potential risks of deploying MAD systems for misinformation intervention. We further develop a public community website to help users explore ED2D, fostering transparency, critical thinking, and collaborative fact‑checking.

PaperID: 7494, https://arxiv.org/pdf/2511.07265.pdf

Authors: Gamal Refai-Ahmed, Mallik Tatipamula, Victor Zhirnov, Ahmed Refaey Hussein, Abdallah Shami

Title: When Intelligence Overloads Infrastructure: A Forecast Model for AI-Driven Bottlenecks

Abstract:
The exponential growth of AI agents and connected devices fundamentally transforms the structure and capacity demands of global digital infrastructure. This paper introduces a unified forecasting model that projects AI agent populations to increase by more than 100 times between 2026 and 2036+, reaching trillions of instances globally. In parallel, bandwidth demand is expected to surge from 1 EB/day in 2026 to over 8,000 EB/day by 2036, which is an increase of 8000 times in a single decade. Through this growth model, we identify critical bottleneck domains across access networks, edge gateways, interconnection exchanges, and cloud infrastructures. Simulations reveal that edge and peering systems will experience saturation as early as 2030, with more than 70% utilization of projected maximum capacity by 2033. To address these constraints, we propose a coevolutionary shift in compute‑network design, emphasizing distributed inference, AI‑native traffic engineering, and intent‑aware orchestration. Security, scalability, and coordination challenges are examined with a focus on sustaining intelligent connectivity throughout the next digital decade.

PaperID: 7495, https://arxiv.org/pdf/2511.07262.pdf

Authors: Qile Jiang, George Karniadakis

Title: AgenticSciML: Collaborative Multi-Agent Systems for Emergent Discovery in Scientific Machine Learning

Abstract:
Scientific Machine Learning (SciML) integrates data‑driven inference with physical modeling to solve complex problems in science and engineering. However, the design of SciML architectures, loss formulations, and training strategies remains an expert‑driven research process, requiring extensive experimentation and problem‑specific insights. Here we introduce AgenticSciML, a collaborative multi‑agent system in which over 10 specialized AI agents collaborate to propose, critique, and refine SciML solutions through structured reasoning and iterative evolution. The framework integrates structured debate, retrieval‑augmented method memory, and ensemble‑guided evolutionary search, enabling the agents to generate and assess new hypotheses about architectures and optimization procedures. Across physics‑informed learning and operator learning tasks, the framework discovers solution methods that outperform single‑agent and human‑designed baselines by up to four orders of magnitude in error reduction. The agents produce novel strategies ‑‑ including adaptive mixture‑of‑expert architectures, decomposition‑based PINNs, and physics‑informed operator learning models ‑‑ that do not appear explicitly in the curated knowledge base. These results show that collaborative reasoning among AI agents can yield emergent methodological innovation, suggesting a path toward scalable, transparent, and autonomous discovery in scientific computing.

PaperID: 7496, https://arxiv.org/pdf/2511.07257.pdf

Authors: Hanya Elhashemy, Youssef Lotfy, Yongjian Tang

Title: Bridging the Prototype-Production Gap: A Multi-Agent System for Notebooks Transformation

Abstract:
The increasing adoption of Jupyter notebooks in data science and machine learning workflows has created a gap between exploratory code development and production‑ready software systems. While notebooks excel at iterative development and visualization, they often lack proper software engineering principles, making their transition to production environments challenging. This paper presents Codelevate, a novel multi‑agent system that automatically transforms Jupyter notebooks into well‑structured, maintainable Python code repositories. Our system employs three specialized agents ‑ Architect, Developer, and Structure ‑ working in concert through a shared dependency tree to ensure architectural coherence and code quality. Our experimental results validate Codelevate's capability to bridge the prototype‑to‑production gap through autonomous code transformation, yielding quantifiable improvements in code quality metrics while preserving computational semantics.

PaperID: 7497, https://arxiv.org/pdf/2511.07230.pdf

Authors: Viet-Thanh Pham, Minghan Wang, Hao-Han Liao, Thuy-Trang Vu

Title: Discourse Graph Guided Document Translation with Large Language Models

Abstract:
Adapting large language models to full document translation remains challenging due to the difficulty of capturing long‑range dependencies and preserving discourse coherence throughout extended texts. While recent agentic machine translation systems mitigate context window constraints through multi‑agent orchestration and persistent memory, they require substantial computational resources and are sensitive to memory retrieval strategies. We introduce TransGraph, a discourse‑guided framework that explicitly models inter‑chunk relationships through structured discourse graphs and selectively conditions each translation segment on relevant graph neighbourhoods rather than relying on sequential or exhaustive context. Across three document‑level MT benchmarks spanning six languages and diverse domains, TransGraph consistently surpasses strong baselines in translation quality and terminology consistency while incurring significantly lower token overhead.

PaperID: 7498, https://arxiv.org/pdf/2511.07176.pdf

Authors: Hanlin Cai, Houtianfu Wang, Haofan Dong, Kai Li, Sai Zou, Ozgur B. Akan

Title: Graph Representation-based Model Poisoning on the Heterogeneous Internet of Agents

Abstract:
Internet of Agents (IoA) envisions a unified, agent‑centric paradigm where heterogeneous large language model (LLM) agents can interconnect and collaborate at scale. Within this paradigm, federated fine‑tuning (FFT) serves as a key enabler that allows distributed LLM agents to co‑train an intelligent global LLM without centralizing local datasets. However, the FFT‑enabled IoA systems remain vulnerable to model poisoning attacks, where adversaries can upload malicious updates to the server to degrade the performance of the aggregated global LLM. This paper proposes a graph representation‑based model poisoning (GRMP) attack, which exploits overheard benign updates to construct a feature correlation graph and employs a variational graph autoencoder to capture structural dependencies and generate malicious updates. A novel attack algorithm is developed based on augmented Lagrangian and subgradient descent methods to optimize malicious updates that preserve benign‑like statistics while embedding adversarial objectives. Experimental results show that the proposed GRMP attack can substantially decrease accuracy across different LLM models while remaining statistically consistent with benign updates, thereby evading detection by existing defense mechanisms and underscoring a severe threat to the ambitious IoA paradigm.

PaperID: 7499, https://arxiv.org/pdf/2511.07124.pdf

Authors: Zhikang Chen, Sen Cui, Deheng Ye, Yu Zhang, Yatao Bian, Tingting Zhu

Title: Think Consistently, Reason Efficiently: Energy-Based Calibration for Implicit Chain-of-Thought

Abstract:
Large Language Models (LLMs) have demonstrated strong reasoning capabilities through \emphChain‑of‑Thought (CoT) prompting, which enables step‑by‑step intermediate reasoning. However, explicit CoT methods rely on discrete token‑level reasoning processes that are prone to error propagation and limited by vocabulary expressiveness, often resulting in rigid and inconsistent reasoning trajectories. Recent research has explored implicit or continuous reasoning in latent spaces, allowing models to perform internal reasoning before generating explicit output. Although such approaches alleviate some limitations of discrete CoT, they generally lack explicit mechanisms to enforce consistency among reasoning steps, leading to divergent reasoning paths and unstable outcomes. To address this issue, we propose EBM‑CoT, an Energy‑Based Chain‑of‑Thought Calibration framework that refines latent thought representations through an energy‑based model (EBM). Our method dynamically adjusts latent reasoning trajectories toward lower‑energy, high‑consistency regions in the embedding space, improving both reasoning accuracy and consistency without modifying the base language model. Extensive experiments across mathematical, commonsense, and symbolic reasoning benchmarks demonstrate that the proposed framework significantly enhances the consistency and efficiency of multi‑step reasoning in LLMs.

PaperID: 7500, https://arxiv.org/pdf/2511.07112.pdf

Authors: Khashayar Alavi, Zhastay Yeltay, Lucie Flek, Akbar Karimi

Title: More Agents Improve Math Problem Solving but Adversarial Robustness Gap Persists

Abstract:
When LLM agents work together, they seem to be more powerful than a single LLM in mathematical question answering. However, are they also more robust to adversarial inputs? We investigate this question using adversarially perturbed math questions. These perturbations include punctuation noise with three intensities (10%, 30%, 50%), plus real‑world and human‑like typos (WikiTypo, R2ATA). Using a unified sampling‑and‑voting framework (Agent Forest), we evaluate six open‑source models (Qwen3‑4B/14B, Llama3.1‑8B, Mistral‑7B, Gemma3‑4B/12B) across four benchmarks (GSM8K, MATH, MMLU‑Math, MultiArith), with various numbers of agents n = 1,2,5,10,15,20,25. Our findings show that 1) Noise type matters: punctuation noise harm scales with its severity, and the human typos remain the dominant bottleneck, yielding the largest gaps to Clean accuracy and the highest attack success rate (ASR) even with a large number of agents; 2) Collaboration reliably improves accuracy as the number of agents, n, increases, with the largest gains from n=1 to n=5 and diminishing returns beyond n\approx10. However, the adversarial robustness gap persists regardless of the agent count.

PaperID: 7501, https://arxiv.org/pdf/2511.07097.pdf

Authors: Diego Gosmar, Anna Chiara Pallotta, Giovanni Zenezini

Title: Agentic AI Sustainability Assessment for Supply Chain Document Insights

Abstract:
This paper presents a comprehensive sustainability assessment framework for document intelligence within supply chain operations, centered on agentic artificial intelligence (AI). We address the dual objective of improving automation efficiency while providing measurable environmental performance in document‑intensive workflows. The research compares three scenarios: fully manual (human‑only), AI‑assisted (human‑in‑the‑loop, HITL), and an advanced multi‑agent agentic AI workflow leveraging parsers and verifiers. Empirical results show that AI‑assisted HITL and agentic AI scenarios achieve reductions of up to 70‑90% in energy consumption, 90‑97% in carbon dioxide emissions, and 89‑98% in water usage compared to manual processes. Notably, full agentic configurations, combining advanced reasoning (thinking mode) and multi‑agent validation, achieve substantial sustainability gains over human‑only approaches, even when resource usage increases slightly versus simpler AI‑assisted solutions. The framework integrates performance, energy, and emission indicators into a unified ESG‑oriented methodology for assessing and governing AI‑enabled supply chain solutions. The paper includes a complete replicability use case demonstrating the methodology's application to real‑world document extraction tasks.

PaperID: 7502, https://arxiv.org/pdf/2511.07086.pdf

Authors: Marcel Pehlke, Marc Jansen

Title: LLM Driven Processes to Foster Explainable AI

Abstract:
We present a modular, explainable LLM‑agent pipeline for decision support that externalizes reasoning into auditable artifacts. The system instantiates three frameworks: Vester's Sensitivity Model (factor set, signed impact matrix, systemic roles, feedback loops); normal‑form games (strategies, payoff matrix, equilibria); and sequential games (role‑conditioned agents, tree construction, backward induction), with swappable modules at every step. LLM components (default: GPT‑5) are paired with deterministic analyzers for equilibria and matrix‑based role classification, yielding traceable intermediates rather than opaque outputs. In a real‑world logistics case (100 runs), mean factor alignment with a human baseline was 55.5% over 26 factors and 62.9% on the transport‑core subset; role agreement over matches was 57%. An LLM judge using an eight‑criterion rubric (max 100) scored runs on par with a reconstructed human baseline. Configurable LLM pipelines can thus mimic expert workflows with transparent, inspectable steps.

PaperID: 7503, https://arxiv.org/pdf/2511.06991.pdf

Authors: Siqi Huang, Sida Huang, Hongyuan Zhang

Title: CoLM: Collaborative Large Models via A Client-Server Paradigm

Abstract:
Large models have achieved remarkable performance across a range of reasoning and understanding tasks. Prior work often utilizes model ensembles or multi‑agent systems to collaboratively generate responses, effectively operating in a server‑to‑server paradigm. However, such approaches do not align well with practical deployment settings, where a limited number of server‑side models are shared by many clients under modern internet architectures. In this paper, we introduce CoLM (Collaboration in Large‑Models), a novel framework for collaborative reasoning that redefines cooperation among large models from a client‑server perspective. Unlike traditional ensemble methods that rely on simultaneous inference from multiple models to produce a single output, CoLM allows the outputs of multiple models to be aggregated or shared, enabling each client model to independently refine and update its own generation based on these high‑quality outputs. This design enables collaborative benefits by fully leveraging both client‑side and shared server‑side models. We further extend CoLM to vision‑language models (VLMs), demonstrating its applicability beyond language tasks. Experimental results across multiple benchmarks show that CoLM consistently improves model performance on previously failed queries, highlighting the effectiveness of collaborative guidance in enhancing single‑model capabilities.

PaperID: 7504, https://arxiv.org/pdf/2511.06934.pdf

Authors: Dennis Thumm

Title: Sequential Causal Normal Form Games: Theory, Computation, and Strategic Signaling

Abstract:
Can classical game‑theoretic frameworks be extended to capture the bounded rationality and causal reasoning of AI agents? We investigate this question by extending Causal Normal Form Games (CNFGs) to sequential settings, introducing Sequential Causal Multi‑Agent Systems (S‑CMAS) that incorporate Pearl's Causal Hierarchy across leader‑follower interactions. While theoretically elegant ‑‑ we prove PSPACE‑completeness, develop equilibrium refinements, and establish connections to signaling theory ‑‑ our comprehensive empirical investigation reveals a critical limitation: S‑CNE provides zero welfare improvement over classical Stackelberg equilibrium across all tested scenarios. Through 50+ Monte Carlo simulations and hand‑crafted synthetic examples, we demonstrate that backward induction with rational best‑response eliminates any strategic advantage from causal layer distinctions. We construct a theoretical example illustrating conditions where benefits could emerge (ε‑rational satisficing followers), though implementation confirms that even relaxed rationality assumptions prove insufficient when good instincts align with optimal play. This negative result provides valuable insight: classical game‑theoretic extensions grounded in rational choice are fundamentally incompatible with causal reasoning advantages, motivating new theoretical frameworks beyond standard Nash equilibrium for agentic AI.

PaperID: 7505, https://arxiv.org/pdf/2511.06892.pdf

Authors: Kailin Tong, Selim Solmaz, Kenan Mujkic, Gottfried Allmer, Bo Leng

Title: Multi-Agent AI Framework for Road Situation Detection and C-ITS Message Generation

Abstract:
Conventional road‑situation detection methods achieve strong performance in predefined scenarios but fail in unseen cases and lack semantic interpretation, which is crucial for reliable traffic recommendations. This work introduces a multi‑agent AI framework that combines multimodal large language models (MLLMs) with vision‑based perception for road‑situation monitoring. The framework processes camera feeds and coordinates dedicated agents for situation detection, distance estimation, decision‑making, and Cooperative Intelligent Transport System (C‑ITS) message generation. Evaluation is conducted on a custom dataset of 103 images extracted from 20 videos of the TAD dataset. Both Gemini‑2.0‑Flash and Gemini‑2.5‑Flash were evaluated. The results show 100% recall in situation detection and perfect message schema correctness; however, both models suffer from false‑positive detections and have reduced performance in terms of number of lanes, driving lane status and cause code. Surprisingly, Gemini‑2.5‑Flash, though more capable in general tasks, underperforms Gemini‑2.0‑Flash in detection accuracy and semantic understanding and incurs higher latency (Table II). These findings motivate further work on fine‑tuning specialized LLMs or MLLMs tailored for intelligent transportation applications.

PaperID: 7506, https://arxiv.org/pdf/2511.06873.pdf

Authors: Ruohan Wang, Siyuan Liu, Zhiyong Sun, Sofie Haesaert

Title: Correct-by-Design Control Synthesis of Stochastic Multi-agent Systems: a Robust Tensor-based Solution

Abstract:
Discrete‑time stochastic systems with continuous spaces are hard to verify and control, even with MDP abstractions due to the curse of dimensionality. We propose an abstraction‑based framework with robust dynamic programming mappings that deliver control strategies with provable lower bounds on temporal‑logic satisfaction, quantified via approximate stochastic simulation relations. Exploiting decoupled dynamics, we reveal a Canonical Polyadic Decomposition tensor structure in value functions that makes dynamic programming scalable. The proposed method provides correct‑by‑design probabilistic guarantees for temporal logic specifications. We validate our results on continuous‑state linear stochastic systems.

PaperID: 7507, https://arxiv.org/pdf/2511.06778.pdf

Authors: Ruiheng Liu, XiaoBing Chen, Jinyu Zhang, Qiongwen Zhang, Yu Zhang, Bailong Yang

Title: SAFENLIDB: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces

Abstract:
The rapid advancement of Large Language Models (LLMs) has driven significant progress in Natural Language Interface to Database (NLIDB). However, the widespread adoption of LLMs has raised critical privacy and security concerns. During interactions, LLMs may unintentionally expose confidential database contents or be manipulated by attackers to exfiltrate data through seemingly benign queries. While current efforts typically rely on rule‑based heuristics or LLM agents to mitigate this leakage risk, these methods still struggle with complex inference‑based attacks, suffer from high false positive rates, and often compromise the reliability of SQL queries. To address these challenges, we propose \textscSafeNlidb, a novel privacy‑security alignment framework for LLM‑based NLIDB. The framework features an automated pipeline that generates hybrid chain‑of‑thought interaction data from scratch, seamlessly combining implicit security reasoning with SQL generation. Additionally, we introduce reasoning warm‑up and alternating preference optimization to overcome the multi‑preference oscillations of Direct Preference Optimization (DPO), enabling LLMs to produce security‑aware SQL through fine‑grained reasoning without the need for human‑annotated preference data. Extensive experiments demonstrate that our method outperforms both larger‑scale LLMs and ideal‑setting baselines, achieving significant security improvements while preserving high utility. WARNING: This work may contain content that is offensive and harmful!

PaperID: 7508, https://arxiv.org/pdf/2511.06727.pdf

Authors: Jiangwen Dong, Zehui Lin, Wanyu Lin, Mingjin Zhang

Title: S-DAG: A Subject-Based Directed Acyclic Graph for Multi-Agent Heterogeneous Reasoning

Abstract:
Large Language Models (LLMs) have achieved impressive performance in complex reasoning problems. Their effectiveness highly depends on the specific nature of the task, especially the required domain knowledge. Existing approaches, such as mixture‑of‑experts, typically operate at the task level; they are too coarse to effectively solve the heterogeneous problems involving multiple subjects. This work proposes a novel framework that performs fine‑grained analysis at subject level equipped with a designated multi‑agent collaboration strategy for addressing heterogeneous problem reasoning. Specifically, given an input query, we first employ a Graph Neural Network to identify the relevant subjects and infer their interdependencies to generate an Subject‑based Directed Acyclic Graph (S‑DAG), where nodes represent subjects and edges encode information flow. Then we profile the LLM models by assigning each model a subject‑specific expertise score, and select the top‑performing one for matching corresponding subject of the S‑DAG. Such subject‑model matching enables graph‑structured multi‑agent collaboration where information flows from the starting model to the ending model over S‑DAG. We curate and release multi‑subject subsets of standard benchmarks (MMLU‑Pro, GPQA, MedMCQA) to better reflect complex, real‑world reasoning tasks. Extensive experiments show that our approach significantly outperforms existing task‑level model selection and multi‑agent collaboration baselines in accuracy and efficiency. These results highlight the effectiveness of subject‑aware reasoning and structured collaboration in addressing complex and multi‑subject problems.

PaperID: 7509, https://arxiv.org/pdf/2511.06649.pdf

Authors: Bin Rao, Chengyue Wang, Haicheng Liao, Qianfang Wang, Yanchen Guan, Jiaxun Zhang, Xingcheng Liu, Meixin Zhu, Kanye Ye Wang, Zhenning Li

Title: Differentiable Semantic Meta-Learning Framework for Long-Tail Motion Forecasting in Autonomous Driving

Abstract:
Long‑tail motion forecasting is a core challenge for autonomous driving, where rare yet safety‑critical events‑such as abrupt maneuvers and dense multi‑agent interactions‑dominate real‑world risk. Existing approaches struggle in these scenarios because they rely on either non‑interpretable clustering or model‑dependent error heuristics, providing neither a differentiable notion of "tailness" nor a mechanism for rapid adaptation. We propose SAML, a Semantic‑Aware Meta‑Learning framework that introduces the first differentiable definition of tailness for motion forecasting. SAML quantifies motion rarity via semantically meaningful intrinsic (kinematic, geometric, temporal) and interactive (local and global risk) properties, which are fused by a Bayesian Tail Perceiver into a continuous, uncertainty‑aware Tail Index. This Tail Index drives a meta‑memory adaptation module that couples a dynamic prototype memory with an MAML‑based cognitive set mechanism, enabling fast adaptation to rare or evolving patterns. Experiments on nuScenes, NGSIM, and HighD show that SAML achieves state‑of‑the‑art overall accuracy and substantial gains on top 1‑5% worst‑case events, while maintaining high efficiency. Our findings highlight semantic meta‑learning as a pathway toward robust and safety‑critical motion forecasting.

PaperID: 7510, https://arxiv.org/pdf/2511.06499.pdf

Authors: Haotian Xia, Haonan Ge, Junbo Zou, Hyun Woo Choi, Xuebin Zhang, Danny Suradja, Botao Rui, Ethan Tran, Wendy Jin, Zhen Ye, Xiyang Lin, Christopher Lai, Shengjie Zhang, Junwen Miao, Shichao Chen, Rhys Tracy, Vicente Ordonez, Weining Shen, Hanjie Chen

Title: SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports

Abstract:
Deeply understanding sports requires an intricate blend of fine‑grained visual perception and rule‑based reasoning ‑ a challenge that pushes the limits of current multimodal models. To succeed, models must master three critical capabilities: perceiving nuanced visual details, applying abstract sport rule knowledge, and grounding that knowledge in specific visual evidence. Current sports benchmarks either cover single sports or lack the detailed reasoning chains and precise visual grounding needed to robustly evaluate these core capabilities in a multi‑sport context. To address this gap, we introduce SportR, the first multi‑sports large‑scale benchmark designed to train and evaluate MLLMs on the fundamental reasoning required for sports intelligence. Our benchmark provides a dataset of 4,789 images and 2,052 videos. To enable granular evaluation, we structure our benchmark around a progressive hierarchy of question‑answer pairs designed to probe reasoning at increasing depths ‑ from simple infraction identification to complex penalty prediction. For the most advanced tasks requiring multi‑step reasoning, such as determining penalties or explaining tactics, we provide 6,841 high‑quality, human‑authored Chain of Thought annotations. In addition, our benchmark incorporates both image and video modalities and provides manual bounding box annotations to test visual grounding in the image part directly. Extensive experiments demonstrate the profound difficulty of our benchmark. State‑of‑the‑art baseline models perform poorly on our most challenging tasks. While training on our data via Supervised Fine‑Tuning and Reinforcement Learning improves these scores, they remain relatively low, highlighting a significant gap in current model capabilities. SportR presents a new challenge for the community, providing a critical resource to drive future research in multimodal sports reasoning.

PaperID: 7511, https://arxiv.org/pdf/2511.06496.pdf

Authors: Keke Long, Jiacheng Guo, Tianyun Zhang, Hongkai Yu, Xiaopeng Li

Title: A Low-Rank Method for Vision Language Model Hallucination Mitigation in Autonomous Driving

Abstract:
Vision Language Models (VLMs) are increasingly used in autonomous driving to help understand traffic scenes, but they sometimes produce hallucinations, which are false details not grounded in the visual input. Detecting and mitigating hallucinations is challenging when ground‑truth references are unavailable and model internals are inaccessible. This paper proposes a novel self‑contained low‑rank approach to automatically rank multiple candidate captions generated by multiple VLMs based on their hallucination levels, using only the captions themselves without requiring external references or model access. By constructing a sentence‑embedding matrix and decomposing it into a low‑rank consensus component and a sparse residual, we use the residual magnitude to rank captions: selecting the one with the smallest residual as the most hallucination‑free. Experiments on the NuScenes dataset demonstrate that our approach achieves 87% selection accuracy in identifying hallucination‑free captions, representing a 19% improvement over the unfiltered baseline and a 6‑10% improvement over multi‑agent debate method. The sorting produced by sparse error magnitudes shows strong correlation with human judgments of hallucinations, validating our scoring mechanism. Additionally, our method, which can be easily parallelized, reduces inference time by 51‑67% compared to debate approaches, making it practical for real‑time autonomous driving applications.

PaperID: 7512, https://arxiv.org/pdf/2511.06455.pdf

Authors: Milena Trajanoska, Riste Stojanov, Dimitar Trajanov

Title: A Multi-Agent System for Semantic Mapping of Relational Data to Knowledge Graphs

Abstract:
Enterprises often maintain multiple databases for storing critical business data in siloed systems, resulting in inefficiencies and challenges with data interoperability. A key to overcoming these challenges lies in integrating disparate data sources, enabling businesses to unlock the full potential of their data. Our work presents a novel approach for integrating multiple databases using knowledge graphs, focusing on the application of large language models as semantic agents for mapping and connecting structured data across systems by leveraging existing vocabularies. The proposed methodology introduces a semantic layer above tables in relational databases, utilizing a system comprising multiple LLM agents that map tables and columns to Schema.org terms. Our approach achieves a mapping accuracy of over 90% in multiple domains.

PaperID: 7513, https://arxiv.org/pdf/2511.06449.pdf

Authors: Zhicheng Cai, Xinyuan Guo, Yu Pei, Jiangtao Feng, Jinsong Su, Jiangjie Chen, Ya-Qin Zhang, Wei-Ying Ma, Mingxuan Wang, Hao Zhou

Title: FLEX: Continuous Agent Evolution via Forward Learning from Experience

Abstract:
Autonomous agents driven by Large Language Models (LLMs) have revolutionized reasoning and problem‑solving but remain static after training, unable to grow with experience as intelligent beings do during deployment. We introduce Forward Learning with EXperience (FLEX), a gradient‑free learning paradigm that enables LLM agents to continuously evolve through accumulated experience. Specifically, FLEX cultivates scalable and inheritable evolution by constructing a structured experience library through continual reflection on successes and failures during interaction with the environment. FLEX delivers substantial improvements on mathematical reasoning, chemical retrosynthesis, and protein fitness prediction (up to 23% on AIME25, 10% on USPTO50k, and 14% on ProteinGym). We further identify a clear scaling law of experiential growth and the phenomenon of experience inheritance across agents, marking a step toward scalable and inheritable continuous agent evolution. Project Page: https://flex‑gensi‑thuair.github.io.

PaperID: 7514, https://arxiv.org/pdf/2511.06441.pdf

Authors: Mayank Saini, Arit Kumar Bishwas

Title: Towards Resource-Efficient Multimodal Intelligence: Learned Routing among Specialized Expert Models

Abstract:
As AI moves beyond text, large language models (LLMs) increasingly power vision, audio, and document understanding; however, their high inference costs hinder real‑time, scalable deployment. Conversely, smaller open‑source models offer cost advantages but struggle with complex or multimodal queries. We introduce a unified, modular framework that intelligently routes each query ‑ textual, multimodal, or complex ‑ to the most fitting expert model, using a learned routing network that balances cost and quality. For vision tasks, we employ a two‑stage open‑source pipeline optimized for efficiency and reviving efficient classical vision components where they remain SOTA for sub‑tasks. On benchmarks such as Massive Multitask Language Understanding (MMLU) and Visual Question Answering (VQA), we match or exceed the performance of always‑premium LLM (monolithic systems with one model serving all query types) performance, yet reduce the reliance on costly models by over 67%. With its extensible, multi‑agent orchestration, we deliver high‑quality, resource‑efficient AI at scale.

PaperID: 7515, https://arxiv.org/pdf/2511.06396.pdf

Authors: Dachuan Lin, Guobin Shen, Zihao Yang, Tianrong Liu, Dongcheng Zhao, Yi Zeng

Title: Efficient LLM Safety Evaluation through Multi-Agent Debate

Abstract:
Safety evaluation of large language models (LLMs) increasingly relies on LLM‑as‑a‑judge pipelines, but strong judges can still be expensive to use at scale. We study whether structured multi‑agent debate can improve judge reliability while keeping backbone size and cost modest. To do so, we introduce HAJailBench, a human‑annotated jailbreak benchmark with 11,100 labeled interactions spanning diverse attack methods and target models, and we pair it with a Multi‑Agent Judge framework in which critic, defender, and judge agents debate under a shared safety rubric. On HAJailBench, the framework improves over matched small‑model prompt baselines and prior multi‑agent judges, while remaining more economical than GPT‑4o under the evaluated pricing snapshot. Ablation results further show that a small number of debate rounds is sufficient to capture most of the gain. Together, these results support structured, value‑aligned debate as a practical design for scalable LLM safety evaluation.

PaperID: 7516, https://arxiv.org/pdf/2511.06345.pdf

Authors: Kelun Lei, Hailong Yang, Huaitao Zhang, Xin You, Kaige Zhang, Zhongzhi Luan, Yi Liu, Depei Qian

Title: PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization

Abstract:
Designing high‑performance kernels requires expert‑level tuning and a deep understanding of hardware characteristics. Recent advances in large language models (LLMs) have enabled automated kernel generation, yet most existing systems rely solely on correctness or execution time feedback, lacking the ability to reason about low‑level performance bottlenecks. In this paper, we introduce PRAGMA, a profile‑guided AI kernel generation framework that integrates execution feedback and fine‑grained hardware profiling into the reasoning loop. PRAGMA enables LLMs to identify performance bottlenecks, preserve historical best versions, and iteratively refine code quality. We evaluate PRAGMA on KernelBench, covering GPU and CPU backends. Results show that PRAGMA consistently outperforms baseline AIKG without profiling enabled and achieves 2.81× and 2.30× averaged speedups against Torch on CPU and GPU platforms, respectively.

PaperID: 7517, https://arxiv.org/pdf/2511.06309.pdf

Authors: Stephen Chung, Wenyu Du

Title: The Station: An Open-World Environment for AI-Driven Discovery

Abstract:
We introduce the STATION, an open‑world multi‑agent environment for autonomous scientific discovery. The Station simulates a complete scientific ecosystem, where agents can engage in long scientific journeys that include reading papers from peers, formulating hypotheses, collaborating with peers, submitting experiments, and publishing results. Importantly, there is no centralized system coordinating their activities. Utilizing their long context, agents are free to choose their own actions and develop their own narratives within the Station. Experiments demonstrate that AI agents in the Station achieve new state‑of‑the‑art performance on a wide range of benchmarks, spanning mathematics, computational biology, and machine learning, notably surpassing AlphaEvolve in circle packing. A rich tapestry of unscripted narratives emerges, such as agents collaborating and analyzing other works rather than pursuing myopic optimization. From these emergent narratives, novel methods arise organically, such as a new density‑adaptive algorithm for scRNA‑seq batch integration that borrows concepts from another domain. The Station marks a first step towards autonomous scientific discovery driven by emergent behavior in an open‑world environment, representing a new paradigm that moves beyond rigid pipelines.

PaperID: 7518, https://arxiv.org/pdf/2511.06306.pdf

Authors: Yixuan Liu, Yingzhu Liu, Pengcheng You

Title: Coherency Analysis in Nonlinear Heterogeneous Power Networks: A Blended Dynamics Approach

Abstract:
Power system coherency refers to the phenomenon that machines in a power network exhibit similar frequency responses after disturbances, and is foundational for model reduction and control design. Despite abundant empirical observations, the understanding of coherence in complex power networks remains incomplete where the dynamics could be highly heterogeneous, nonlinear, and increasingly affected by persistent disturbances such as renewable energy fluctuations. To bridge this gap, this paper extends the blended dynamics approach, originally rooted in consensus analysis of multi‑agent systems, to develop a novel coherency analysis in power networks. We show that the frequency responses of coherent machines coupled by nonlinear power flow can be approximately represented by the blended dynamics, which is a weighted average of nonlinear heterogeneous nodal dynamics, even under time‑varying disturbances. Specifically, by developing novel bounds on the difference between the trajectories of nodal dynamics and the blended dynamics, we identify two key factors ‑‑ either high network connectivity or small time‑variation rate of disturbances ‑‑ that contribute to coherence. They enable the nodal frequencies to rapidly approach the blended‑dynamics trajectory from arbitrary initial state. Furthermore, they ensure the frequencies closely follow this trajectory in the long term, even when the system does not settle to an equilibrium. These insights contribute to the understanding of power system coherency and are further supported by simulation results.

PaperID: 7519, https://arxiv.org/pdf/2511.06262.pdf

Authors: Siming Zhao, Qi Li

Title: GAIA: A General Agency Interaction Architecture for LLM-Human B2B Negotiation & Screening

Abstract:
Organizations are increasingly exploring delegation of screening and negotiation tasks to AI systems, yet deployment in high‑stakes B2B settings is constrained by governance: preventing unauthorized commitments, ensuring sufficient information before bargaining, and maintaining effective human oversight and auditability. Prior work on large language model negotiation largely emphasizes autonomous bargaining between agents and omits practical needs such as staged information gathering, explicit authorization boundaries, and systematic feedback integration. We propose GAIA, a governance‑first framework for LLM‑human agency in B2B negotiation and screening. GAIA defines three essential roles ‑ Principal (human), Delegate (LLM agent), and Counterparty ‑ with an optional Critic to enhance performance, and organizes interactions through three mechanisms: information‑gated progression that separates screening from negotiation; dual feedback integration that combines AI critique with lightweight human corrections; and authorization boundaries with explicit escalation paths. Our contributions are fourfold: (1) a formal governance framework with three coordinated mechanisms and four safety invariants for delegation with bounded authorization; (2) information‑gated progression via task‑completeness tracking (TCI) and explicit state transitions that separate screening from commitment; (3) dual feedback integration that blends Critic suggestions with human oversight through parallel learning channels; and (4) a hybrid validation blueprint that combines automated protocol metrics with human judgment of outcomes and safety. By bridging theory and practice, GAIA offers a reproducible specification for safe, efficient, and accountable AI delegation that can be instantiated across procurement, real estate, and staffing workflows.

PaperID: 7520, https://arxiv.org/pdf/2511.06260.pdf

Authors: Hanlin Sun, Jiayang Li

Title: LLM-Guided Reinforcement Learning with Representative Agents for Traffic Modeling

Abstract:
Large language models (LLMs) are increasingly used as behavioral proxies for self‑interested travelers in agent‑based traffic models. Although more flexible and generalizable than conventional models, the practical use of these approaches remains limited by scalability due to the cost of calling one LLM for every traveler. Moreover, it has been found that LLM agents often make opaque choices and produce unstable day‑to‑day dynamics. To address these challenges, we propose to model each homogeneous traveler group facing the same decision context with a single representative LLM agent who behaves like the population's average, maintaining and updating a mixed strategy over routes that coincides with the group's aggregate flow proportions. Each day, the LLM reviews the travel experience and flags routes with positive reinforcement that they hope to use more often, and an interpretable update rule then converts this judgment into strategy adjustments using a tunable (progressively decaying) step size. The representative‑agent design improves scalability, while the separation of reasoning from updating clarifies the decision logic while stabilizing learning. In classic traffic assignment settings, we find that the proposed approach converges rapidly to the user equilibrium. In richer settings with income heterogeneity, multi‑criteria costs, and multi‑modal choices, the generated dynamics remain stable and interpretable, reproducing plausible behavioral patterns well‑documented in psychology and economics, for example, the decoy effect in toll versus non‑toll road selection, and higher willingness‑to‑pay for convenience among higher‑income travelers when choosing between driving, transit, and park‑and‑ride options.

PaperID: 7521, https://arxiv.org/pdf/2511.06209.pdf

Authors: Jingwei Ni, Ekaterina Fadeeva, Tianyi Wu, Mubashara Akhtar, Jiaheng Zhang, Elliott Ash, Markus Leippold, Timothy Baldwin, See-Kiong Ng, Artem Shelmanov, Mrinmaya Sachan

Title: ReProbe: Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models

Abstract:
LLMs can solve complex tasks by generating long, multi‑step reasoning chains. Test‑time scaling (TTS) can further improve performance by sampling multiple variants of intermediate reasoning steps, verifying their correctness, and selecting the best steps for continuation. However, existing verification approaches, such as Process Reward Models (PRMs), are computationally expensive and require large‑scale human or model‑generated annotations. We propose a lightweight alternative for step‑level reasoning verification based on probing the internal states of LLMs. We train a transformer‑based probe that uses the internal states of a frozen LLM to estimate the credibility of its reasoning steps during generation. Annotation can be provided either by a larger LLM (e.g., DeepSeek‑R1) or in a self‑supervised manner by the original model itself. The probes are lightweight, containing fewer than 10M parameters. Across multiple domains, including mathematics, planning, and general knowledge question answering, our probes match or exceed the performance of PRMs that are up to 810x larger. These results suggest that LLM internal states encode confidence in their reasoning processes and can serve as reliable signals for step verification, offering a promising path toward scalable, generalizable TTS and more introspective LLMs.

PaperID: 7522, https://arxiv.org/pdf/2511.06142.pdf

Authors: Sizhe Tang, Jiayu Chen, Tian Lan

Title: MALinZero: Efficient Low-Dimensional Search for Mastering Complex Multi-Agent Planning

Abstract:
Monte Carlo Tree Search (MCTS), which leverages Upper Confidence Bound for Trees (UCTs) to balance exploration and exploitation through randomized sampling, is instrumental to solving complex planning problems. However, for multi‑agent planning, MCTS is confronted with a large combinatorial action space that often grows exponentially with the number of agents. As a result, the branching factor of MCTS during tree expansion also increases exponentially, making it very difficult to efficiently explore and exploit during tree search. To this end, we propose MALinZero, a new approach to leverage low‑dimensional representational structures on joint‑action returns and enable efficient MCTS in complex multi‑agent planning. Our solution can be viewed as projecting the joint‑action returns into the low‑dimensional space representable using a contextual linear bandit problem formulation. We solve the contextual linear bandit problem with convex and μ‑smooth loss functions ‑‑ in order to place more importance on better joint actions and mitigate potential representational limitations ‑‑ and derive a linear Upper Confidence Bound applied to trees (LinUCT) to enable novel multi‑agent exploration and exploitation in the low‑dimensional space. We analyze the regret of MALinZero for low‑dimensional reward functions and propose an (1‑\tfrac1e)‑approximation algorithm for the joint action selection by maximizing a sub‑modular objective. MALinZero demonstrates state‑of‑the‑art performance on multi‑agent benchmarks such as matrix games, SMAC, and SMACv2, outperforming both model‑based and model‑free multi‑agent reinforcement learning baselines with faster learning speed and better performance.

PaperID: 7523, https://arxiv.org/pdf/2511.06134.pdf

Authors: Wei Yang, Jiacheng Pang, Shixuan Li, Paul Bogdan, Stephen Tu, Jesse Thomason

Title: Maestro: Learning to Collaborate via Conditional Listwise Policy Optimization for Multi-Agent LLMs

Abstract:
Multi‑agent systems (MAS) built on Large Language Models (LLMs) are being used to approach complex problems and can surpass single model inference. However, their success hinges on navigating a fundamental cognitive tension: the need to balance broad, divergent exploration of the solution space with a principled, convergent synthesis to the optimal solution. Existing paradigms often struggle to manage this duality, leading to premature consensus, error propagation, and a critical credit assignment problem that fails to distinguish between genuine reasoning and superficially plausible arguments. To resolve this core challenge, we propose the Multi‑Agent Exploration‑Synthesis framework Through Role Orchestration (Maestro), a principled paradigm for collaboration that structurally decouples these cognitive modes. Maestro uses a collective of parallel Execution Agents for diverse exploration and a specialized Central Agent for convergent, evaluative synthesis. To operationalize this critical synthesis phase, we introduce Conditional Listwise Policy Optimization (CLPO), a reinforcement learning objective that disentangles signals for strategic decisions and tactical rationales. By combining decision‑focused policy gradients with a list‑wise ranking loss over justifications, CLPO achieves clean credit assignment and stronger comparative supervision. Experiments on mathematical reasoning and general problem‑solving benchmarks demonstrate that Maestro, coupled with CLPO, consistently outperforms existing state‑of‑the‑art multi‑agent approaches, delivering absolute accuracy gains of 6% on average and up to 10% at best.

PaperID: 7524, https://arxiv.org/pdf/2511.06078.pdf

Authors: Luis Marquez-Carpintero, Alberto Lopez-Sellers, Miguel Cazorla

Title: Simulating Students with Large Language Models: A Review of Architecture, Mechanisms, and Role Modelling in Education with Generative AI

Abstract:
Simulated Students offer a valuable methodological framework for evaluating pedagogical approaches and modelling diverse learner profiles, tasks which are otherwise challenging to undertake systematically in real‑world settings. Recent research has increasingly focused on developing such simulated agents to capture a range of learning styles, cognitive development pathways, and social behaviours. Among contemporary simulation techniques, the integration of large language models (LLMs) into educational research has emerged as a particularly versatile and scalable paradigm. LLMs afford a high degree of linguistic realism and behavioural adaptability, enabling agents to approximate cognitive processes and engage in contextually appropriate pedagogical dialogues. This paper presents a thematic review of empirical and methodological studies utilising LLMs to simulate student behaviour across educational environments. We synthesise current evidence on the capacity of LLM‑based agents to emulate learner archetypes, respond to instructional inputs, and interact within multi‑agent classroom scenarios. Furthermore, we examine the implications of such systems for curriculum development, instructional evaluation, and teacher training. While LLMs surpass rule‑based systems in natural language generation and situational flexibility, ongoing concerns persist regarding algorithmic bias, evaluation reliability, and alignment with educational objectives. The review identifies existing technological and methodological gaps and proposes future research directions for integrating generative AI into adaptive learning systems and instructional design.

PaperID: 7525, https://arxiv.org/pdf/2511.06036.pdf

Authors: Hassan Hizeh, Rim Chighri, Muhammad Mahboob Ur Rahman, Mohamed A. Bahloul, Ali Muqaibel, Tareq Y. Al-Naffouri

Title: Towards Human-AI-Robot Collaboration and AI-Agent based Digital Twins for Parkinson's Disease Management: Review and Outlook

Abstract:
The current body of research on Parkinson's disease (PD) screening, monitoring, and management has evolved along two largely independent trajectories. The first research community focuses on multimodal sensing of PD‑related biomarkers using noninvasive technologies such as inertial measurement units (IMUs), force/pressure insoles, electromyography (EMG), electroencephalography (EEG), speech and acoustic analysis, and RGB/RGB‑D motion capture systems. These studies emphasize data acquisition, feature extraction, and machine learning‑based classification for PD screening, diagnosis, and disease progression modeling. In parallel, a second research community has concentrated on robotic intervention and rehabilitation, employing socially assistive robots (SARs), robot‑assisted rehabilitation (RAR) systems, and virtual reality (VR)‑integrated robotic platforms for improving motor and cognitive function, enhancing social engagement, and supporting caregivers. Despite the complementary goals of these two domains, their methodological and technological integration remains limited, with minimal data‑level or decision‑level coupling between the two. With the advent of advanced artificial intelligence (AI), including large language models (LLMs), agentic AI systems, a unique opportunity now exists to unify these research streams. We envision a closed‑loop sensor‑AI‑robot framework in which multimodal sensing continuously guides the interaction between the patient, caregiver, humanoid robot (and physician) through AI agents that are powered by a multitude of AI models such as robotic and wearables foundation models, LLM‑based reasoning, reinforcement learning, and continual learning. Such closed‑loop system enables personalized, explainable, and context‑aware intervention, forming the basis for digital twin of the PD patient that can adapt over time to deliver intelligent, patient‑centered PD care.

PaperID: 7526, https://arxiv.org/pdf/2511.05972.pdf

Authors: Guangyuan Liu, Yinqiu Liu, Ruichen Zhang, Dusit Niyato, Jiawen Kang, Sumei Sun, Abbas Jamalipour, Ping Zhang

Title: DWM-RO: Decentralized World Models with Reasoning Offloading for SWIPT-enabled Satellite-Terrestrial HetNets

Abstract:
Wireless networks are undergoing a paradigm shift toward massive connectivity with energy‑efficient operation, driving the integration of satellite‑terrestrial architectures with simultaneous wireless information and power transfer (SWIPT). Optimizing transmit beamforming and power splitting in such systems faces formidable challenges, e.g., time‑varying channels and multi‑tier interference, which create a complex decision landscape where conventional model‑free multi‑agent reinforcement learning (MARL) suffers from sample inefficiency due to rarely‑encountered state transitions and poor coordination as decentralized agents act independently. This paper proposes the Decentralized World Model with Reasoning Offloading (DWM‑RO) framework to address these fundamental limitations. Specifically, each agent employs a world model to learn compact predictive representations of environment dynamics, enabling imagination‑based policy training that dramatically reduces required environment interactions. An uncertainty‑aware offloading gate monitors local interference levels and model reconstruction errors to trigger selective edge coordination. When activated, a lightweight latent decorrelation mechanism at the edge refines agents' strategic representations, guiding them toward orthogonal actions that minimize resource conflicts. Extensive simulations demonstrate that DWM‑RO converges 5 times faster than state‑of‑the‑art baselines while achieving 34.7% higher spectral efficiency and reducing constraint violations by 40%. In dense network scenarios with 10 users, DWM‑RO maintains violation rates below 20% while baselines exceed 70%, validating superior robustness.

PaperID: 7527, https://arxiv.org/pdf/2511.05931.pdf

Authors: Hiroaki Hayashi, Bo Pang, Wenting Zhao, Ye Liu, Akash Gokul, Srijan Bansal, Caiming Xiong, Semih Yavuz, Yingbo Zhou

Title: Self-Abstraction from Grounded Experience for Plan-Guided Policy Refinement

Abstract:
Large language model (LLM) based agents are increasingly used to tackle software engineering tasks that require multi‑step reasoning and code modification, demonstrating promising yet limited performance. However, most existing LLM agents typically operate within static execution frameworks, lacking a principled mechanism to learn and self‑improve from their own experience and past rollouts. As a result, their performance remains bounded by the initial framework design and the underlying LLM's capabilities. We propose Self‑Abstraction from Grounded Experience (SAGE), a framework that enables agents to learn from their own task executions and refine their behavior through self‑abstraction. After an initial rollout, the agent induces a concise plan abstraction from its grounded experience, distilling key steps, dependencies, and constraints. This learned abstraction is then fed back as contextual guidance, refining the agent's policy and supporting more structured, informed subsequent executions. Empirically, SAGE delivers consistent performance gains across diverse LLM backbones and agent architectures. Notably, it yields a 7.2% relative performance improvement over the strong Mini‑SWE‑Agent baseline when paired with the GPT‑5 (high) backbone. SAGE further achieves strong overall performance on SWE‑Bench Verified benchmark, reaching 73.2% and 74% Pass@1 resolve rates with the Mini‑SWE‑Agent and OpenHands CodeAct agent framework, respectively.

PaperID: 7528, https://arxiv.org/pdf/2511.05900.pdf

Authors: Ruoyu Lin, Gennaro Notomista, Magnus Egerstedt

Title: Disentangled Control of Multi-Agent Systems

Abstract:
This paper develops a general framework for multi‑agent control synthesis, which applies to a wide range of problems with convergence guarantees, including those with time‑varying objective functions. The proposed framework achieves decentralization without inducing dynamical coupling among agents, and it naturally supports multi‑objective robotics and real‑time implementation. To demonstrate its generality and effectiveness, the framework is applied to solve three representative problems, namely time‑varying leader‑follower formation control, decentralized coverage control for time‑varying density functions without approximations, which is a long‑standing open problem, and safe formation navigation in a dense environment.

PaperID: 7529, https://arxiv.org/pdf/2511.05838.pdf

Authors: Laasya Koduru, Sylee Beltiukov, Jaber Daneshamooz, Eugene Vuong, Arpit Gupta, Elizabeth Belding, Tejas N. Narechania

Title: Enabling Data-Driven Policymaking Using Broadband-Plan Querying Tool (BQT+)

Abstract:
Poor broadband access undermines civic and economic life, a challenge exacerbated by the fact that millions of Americans still lack reliable high‑speed connectivity. Federal broadband funding initiatives aim to address these gaps, but their success depends on accurate availability and affordability data. Existing data, often based on self‑reported ISP information, can overstate coverage and speeds, risking misallocation of funds and leaving unserved communities behind. We present BQT+, an AI‑agent data collection platform that queries ISP web interfaces by inputting residential street addresses and extracting data on service availability, quality, and pricing. BQT+ has been used in policy evaluation studies, including an independent assessment of broadband availability, speed tiers, and affordability in areas targeted by the 42.45 billion BEAD program.

PaperID: 7530, https://arxiv.org/pdf/2511.05804.pdf

Authors: Valentin Noël

Title: Catching Contamination Before Generation: Spectral Kill Switches for Agents

Abstract:
Agentic language models compose multi step reasoning chains, yet intermediate steps can be corrupted by inconsistent context, retrieval errors, or adversarial inputs, which makes post hoc evaluation too late because errors propagate before detection. We introduce a diagnostic that requires no additional training and uses only the forward pass to emit a binary accept or reject signal during agent execution. The method analyzes token graphs induced by attention and computes two spectral statistics in early layers, namely the high frequency energy ratio and spectral entropy. We formalize these signals, establish invariances, and provide finite sample estimators with uncertainty quantification. Under a two regime mixture assumption with a monotone likelihood ratio property, we show that a single threshold on the high frequency energy ratio is optimal in the Bayes sense for detecting context inconsistency. Empirically, the high frequency energy ratio exhibits robust bimodality during context verification across multiple model families, which enables gating decisions with overhead below one millisecond on our hardware and configurations. We demonstrate integration into retrieval augmented agent pipelines and discuss deployment as an inline safety monitor. The approach detects contamination while the model is still processing the text, before errors commit to the reasoning chain.

PaperID: 7531, https://arxiv.org/pdf/2511.05789.pdf

Authors: Shanhao Zhan, Zhang Liu, Lianfen Huang, Shaowei Shen, Ziyang Bai, Zhibin Gao, Dusit Niyato

Title: Digital Twin-Assisted Task Offloading and Resource Allocation in ISAC-Enabled Internet of Vehicles

Abstract:
The convergence of the Internet of vehicles (IoV) and 6G networks is driving the evolution of next‑generation intelligent transportation systems. However, IoV networks face persistent challenges, including low spectral efficiency in vehicular communications, difficulty in achieving dynamic and adaptive resource optimization, and the need for long‑term stability under highly dynamic environments. In this paper, we study the problem of digital twin (DT)‑assisted task offloading and resource allocation in integrated sensing and communication (ISAC)‑enabled IoV networks. The objective is to minimize the long‑term average system cost, defined as a weighted combination of delay and energy consumption, while ensuring queue stability over time. To address this, we employ an ISAC‑enabled design and introduce two transmission modes (i.e., raw data transmission (DataT) and instruction transmission (InstrT)). The InstrT mode enables instruction‑level transmission, thereby reducing data volume and improving spectral efficiency. We then employ Lyapunov optimization to decompose the long‑term stochastic problem into per‑slot deterministic problems, ensuring long‑term queue stability. Building upon this, we propose a Lyapunov‑driven DT‑enhanced multi‑agent proximal policy optimization (Ly‑DTMPPO) algorithm, which leverages DT for global state awareness and intelligent decision‑making within a centralized training and decentralized execution (CTDE) architecture. Extensive simulations verify that Ly‑DTMPPO achieves superior performance compared with existing benchmarks.

PaperID: 7532, https://arxiv.org/pdf/2511.05715.pdf

Authors: Roee M. Francos, Daniel Garces, Orhan Eren Akgün, Stephanie Gil

Title: STAIR: Stability criterion for Time-windowed Assignment and Internal adversarial influence in Routing and decision-making

Abstract:
A major limitation of existing routing algorithms for multi‑agent systems is that they are designed without considering the potential presence of adversarial agents in the decision‑making loop, which could lead to severe performance degradation in real‑life applications where adversarial agents may be present. We study autonomous pickup‑and‑delivery routing problems in which adversarial agents launch coordinated denial‑of‑service attacks by spoofing their locations. This deception causes the central scheduler to assign pickup requests to adversarial agents instead of cooperative agents. Adversarial agents then choose not to service the requests with the goal of disrupting the operation of the system, leading to delays, cancellations, and potential instability in the routing policy. Policy stability in routing problems is typically defined as the cost of the policy being uniformly bounded over time, and it has been studied through two different lenses: queuing theory and reinforcement learning (RL), which are not well suited for routing with adversaries. In this paper, we propose a new stability criterion, STAIR, which is easier to analyze than queuing‑theory‑based stability in adversarial settings. Furthermore, STAIR does not depend on a chosen discount factor as is the case in discounted RL stability. STAIR directly links stability to desired operational metrics, like a finite number of rejected requests. This characterization is particularly useful in adversarial settings as it provides a metric for monitoring the effect of adversaries in the operation of the system. Furthermore, we demonstrate STAIR's practical relevance through simulations on real‑world San Francisco mobility‑on‑demand data. We also identify a phenomenon of degenerate stability that arises in the adversarial routing problem, and we introduce time‑window constraints in the decision‑making algorithm to mitigate it.

PaperID: 7533, https://arxiv.org/pdf/2511.05706.pdf

Authors: Wendan Jiang, Shiyuan Wang, Hiba Eltigani, Rukhshan Haroon, Abdullah Bin Faisal, Fahad Dogar

Title: AdvisingWise: Supporting Academic Advising in Higher Education Settings Through a Human-in-the-Loop Multi-Agent Framework

Abstract:
Academic advising is critical to student success in higher education, yet high student‑to‑advisor ratios limit advisors' capacity to provide timely support, particularly during peak periods. Recent advances in Large Language Models (LLMs) present opportunities to enhance the advising process. We present AdvisingWise, a multi‑agent system that automates time‑consuming tasks, such as information retrieval and response drafting, while preserving human oversight. AdvisingWise leverages authoritative institutional resources and adaptively prompts students about their academic backgrounds to generate reliable, personalized responses. All system responses undergo human advisor validation before delivery to students. We evaluate AdvisingWise through a mixed‑methods approach: (1) expert evaluation on responses of 20 sample queries, (2) LLM‑as‑a‑judge evaluation of the information retrieval strategy, and (3) a user study with 8 academic advisors to assess the system's practical utility. Our evaluation shows that AdvisingWise produces accurate, personalized responses. Advisors reported increasingly positive perceptions after using AdvisingWise, as their initial concerns about reliability and personalization diminished. We conclude by discussing the implications of human‑AI synergy on the practice of academic advising.

PaperID: 7534, https://arxiv.org/pdf/2511.05696.pdf

Authors: Jacob T. Rosenthal, Emma Hahesy, Sulov Chalise, Menglei Zhu, Mert R. Sabuncu, Lior Z. Braunstein, Anyi Li

Title: AI-assisted workflow enables rapid, high-fidelity breast cancer clinical trial eligibility prescreening

Abstract:
Clinical trials play an important role in cancer care and research, yet participation rates remain low. We developed MSK‑MATCH (Memorial Sloan Kettering Multi‑Agent Trial Coordination Hub), an AI system for automated eligibility screening from clinical text. MSK‑MATCH integrates a large language model with a curated oncology trial knowledge base and retrieval‑augmented architecture providing explanations for all AI predictions grounded in source text. In a retrospective dataset of 88,518 clinical documents from 731 patients across six breast cancer trials, MSK‑MATCH automatically resolved 61.9% of cases and triaged 38.1% for human review. This AI‑assisted workflow achieved 98.6% accuracy, 98.4% sensitivity, and 98.7% specificity for patient‑level eligibility classification, matching or exceeding performance of the human‑only and AI‑only comparisons. For the triaged cases requiring manual review, prepopulating eligibility screens with AI‑generated explanations reduced screening time from 20 minutes to 43 seconds at an average cost of 0.96 per patient‑trial pair.

PaperID: 7535, https://arxiv.org/pdf/2511.05580.pdf

Authors: Bryce-Allen Bagley, Navin Khoshnan

Title: Approximating the Mathematical Structure of Psychodynamics

Abstract:
The complexity of human cognition has meant that psychology makes more use of theory and conceptual models than perhaps any other biomedical field. To enable precise quantitative study of the full breadth of phenomena in psychological and psychiatric medicine as well as cognitive aspects of AI safety, there is a need for a mathematical formulation which is both mathematically precise and equally accessible to experts from numerous fields. In this paper we formalize human psychodynamics via the diagrammatic framework of process theory, describe its key properties, and explain the links between a diagrammatic representation and central concepts in analysis of cognitive processes in contexts such as psychotherapy, neurotechnology, AI alignment, AI agent representation of individuals in autonomous negotiations, developing human‑like AI systems, and other aspects of AI safety.

PaperID: 7536, https://arxiv.org/pdf/2511.05528.pdf

Authors: Aayush Aluru, Myra Malik, Samarth Patankar, Spencer Kim, Kevin Zhu, Sean O'Brien, Vasu Sharma

Title: SMAGDi: Socratic Multi Agent Interaction Graph Distillation for Efficient High Accuracy Reasoning

Abstract:
Multi‑agent systems (MAS) often achieve higher reasoning accuracy than single models, but their reliance on repeated debates across agents makes them computationally expensive. We introduce SMAGDi, a distillation framework that transfers the debate dynamics of a five‑agent Llama‑based MAS into a compact Socratic decomposer‑solver student. SMAGDi represents debate traces as directed interaction graphs, where nodes encode intermediate reasoning steps with correctness labels and edges capture continuity and cross‑agent influence. The student is trained with a composite objective combining language modeling, graph‑based supervision, contrastive reasoning, and embedding alignment to preserve both fluency and structured reasoning. On StrategyQA and MMLU, SMAGDi compresses a 40B multi‑agent system into a 6B student while retaining 88% of its accuracy, substantially outperforming prior distillation methods such as MAGDi, standard KD, and fine‑tuned baselines. These results highlight that explicitly modeling interaction graphs and Socratic decomposition enable small models to inherit the accuracy benefits of multi‑agent debate while remaining efficient enough for real‑world deployment.

PaperID: 7537, https://arxiv.org/pdf/2511.05410.pdf

Authors: Justin D. Weisz, Michael Muller, Kush R. Varshney

Title: Story Arena: A Multi-Agent Environment for Envisioning the Future of Software Engineering

Abstract:
What better way to understand the impact of AI on software engineering than to ask AI itself? We constructed Story Arena, a multi‑agent "writer's room" in which multiple AI agents, independently imbued with a position statement on the future of software engineering, converse with each other to develop a shared vision. They then use this shared vision to collaboratively construct a design fiction that depicts this vision in narrative form. We present "The Code of Trust," a short fiction that investigates themes of human comprehension, trust, content ownership, augmentation vs. replacement, and uncertain futures in human‑AI co‑creation.

PaperID: 7538, https://arxiv.org/pdf/2511.05375.pdf

Authors: Sijie Yang, Jiatong Li, Filip Biljecki

Title: Reasoning Is All You Need for Urban Planning AI

Abstract:
AI has proven highly successful at urban planning analysis ‑‑ learning patterns from data to predict future conditions. The next frontier is AI‑assisted decision‑making: agents that recommend sites, allocate resources, and evaluate trade‑offs while reasoning transparently about constraints and stakeholder values. Recent breakthroughs in reasoning AI ‑‑ CoT prompting, ReAct, and multi‑agent collaboration frameworks ‑‑ now make this vision achievable. This position paper presents the Agentic Urban Planning AI Framework for reasoning‑capable planning agents that integrates three cognitive layers (Perception, Foundation, Reasoning) with six logic components (Analysis, Generation, Verification, Evaluation, Collaboration, Decision) through a multi‑agents collaboration framework. We demonstrate why planning decisions require explicit reasoning capabilities that are value‑based (applying normative principles), rule‑grounded (guaranteeing constraint satisfaction), and explainable (generating transparent justifications) ‑‑ requirements that statistical learning alone cannot fulfill. We compare reasoning agents with statistical learning, present a comprehensive architecture with benchmark evaluation metrics, and outline critical research challenges. This framework shows how AI agents can augment human planners by systematically exploring solution spaces, verifying regulatory compliance, and deliberating over trade‑offs transparently ‑‑ not replacing human judgment but amplifying it with computational reasoning capabilities.

PaperID: 7539, https://arxiv.org/pdf/2511.05359.pdf

Authors: Amr Gomaa, Ahmed Salem, Sahar Abdelnabi

Title: ConVerse: Benchmarking Contextual Safety in Agent-to-Agent Conversations

Abstract:
As language models evolve into autonomous agents that act and communicate on behalf of users, ensuring safety in multi‑agent ecosystems becomes a central challenge. Interactions between personal assistants and external service providers expose a core tension between utility and protection: effective collaboration requires information sharing, yet every exchange creates new attack surfaces. We introduce ConVerse, a dynamic benchmark for evaluating privacy and security risks in agent‑agent interactions. ConVerse spans three practical domains (travel, real estate, insurance) with 12 user personas and over 864 contextually grounded attacks (611 privacy, 253 security). Unlike prior single‑agent settings, it models autonomous, multi‑turn agent‑to‑agent conversations where malicious requests are embedded within plausible discourse. Privacy is tested through a three‑tier taxonomy assessing abstraction quality, while security attacks target tool use and preference manipulation. Evaluating seven state‑of‑the‑art models reveals persistent vulnerabilities; privacy attacks succeed in up to 88% of cases and security breaches in up to 60%, with stronger models leaking more. By unifying privacy and security within interactive multi‑agent contexts, ConVerse reframes safety as an emergent property of communication.

PaperID: 7540, https://arxiv.org/pdf/2511.05311.pdf

Authors: Valeriu Dimidov, Faisal Hawlader, Sasan Jafarnejad, Raphaël Frank

Title: Cleaning Maintenance Logs with LLM Agents for Improved Predictive Maintenance

Abstract:
Economic constraints, limited availability of datasets for reproducibility and shortages of specialized expertise have long been recognized as key challenges to the adoption and advancement of predictive maintenance (PdM) in the automotive sector. Recent progress in large language models (LLMs) presents an opportunity to overcome these barriers and speed up the transition of PdM from research to industrial practice. Under these conditions, we explore the potential of LLM‑based agents to support PdM cleaning pipelines. Specifically, we focus on maintenance logs, a critical data source for training well‑performing machine learning (ML) models, but one often affected by errors such as typos, missing fields, near‑duplicate entries, and incorrect dates. We evaluate LLM agents on cleaning tasks involving six distinct types of noise. Our findings show that LLMs are effective at handling generic cleaning tasks and offer a promising foundation for future industrial applications. While domain‑specific errors remain challenging, these results highlight the potential for further improvements through specialized training and enhanced agentic capabilities.

PaperID: 7541, https://arxiv.org/pdf/2511.05269.pdf

Authors: Ishan Kavathekar, Hemang Jain, Ameya Rathod, Ponnurangam Kumaraguru, Tanuja Ganu

Title: TAMAS: Benchmarking Adversarial Risks in Multi-Agent LLM Systems

Abstract:
Large Language Models (LLMs) have demonstrated strong capabilities as autonomous agents through tool use, planning, and decision‑making abilities, leading to their widespread adoption across diverse tasks. As task complexity grows, multi‑agent LLM systems are increasingly used to solve problems collaboratively. However, safety and security of these systems remains largely under‑explored. Existing benchmarks and datasets predominantly focus on single‑agent settings, failing to capture the unique vulnerabilities of multi‑agent dynamics and co‑ordination. To address this gap, we introduce Threats and Attacks in Multi‑Agent Systems (TAMAS), a benchmark designed to evaluate the robustness and safety of multi‑agent LLM systems. TAMAS includes five distinct scenarios comprising 300 adversarial instances across six attack types and 211 tools, along with 100 harmless tasks. We assess system performance across ten backbone LLMs and three agent interaction configurations from Autogen and CrewAI frameworks, highlighting critical challenges and failure modes in current multi‑agent deployments. Furthermore, we introduce Effective Robustness Score (ERS) to assess the tradeoff between safety and task effectiveness of these frameworks. Our findings show that multi‑agent systems are highly vulnerable to adversarial attacks, underscoring the urgent need for stronger defenses. TAMAS provides a foundation for systematically studying and improving the safety of multi‑agent LLM systems.

PaperID: 7542, https://arxiv.org/pdf/2511.05207.pdf

Authors: Ryuji Hashimoto, Ryosuke Takata, Masahiro Suzuki, Yuki Tanaka, Kiyoshi Izumi

Title: Emergence from Emergence: Financial Market Simulation via Learning with Heterogeneous Preferences

Abstract:
Agent‑based models help explain stock price dynamics as emergent phenomena driven by interacting investors. In this modeling tradition, investor behavior has typically been captured by two distinct mechanisms ‑‑ learning and heterogeneous preferences ‑‑ which have been explored as separate paradigms in prior studies. However, the impact of their joint modeling on the resulting collective dynamics remains largely unexplored. We develop a multi‑agent reinforcement learning framework in which agents endowed with heterogeneous risk aversion, time discounting, and information access collectively learn trading strategies within a unified shared‑policy framework. The experiment reveals that (i) learning with heterogeneous preferences drives agents to develop strategies aligned with their individual traits, fostering behavioral differentiation and niche specialization within the market, and (ii) the interactions by the differentiated agents are essential for the emergence of realistic market dynamics such as fat‑tailed price fluctuations and volatility clustering. This study presents a constructive paradigm for financial market modeling in which the joint design of heterogeneous preferences and learning mechanisms enables two‑stage emergence: individual behavior and the collective market dynamics.

PaperID: 7543, https://arxiv.org/pdf/2511.05119.pdf

Authors: Víctor Mayoral-Vilches, Luis Javier Navarrete-Lozano, Francesco Balassone, María Sanz-Gómez, Cristóbal Ricardo Veas Chávez, Maite del Mundo de Torres

Title: Cybersecurity AI in OT: Insights from an AI Top-10 Ranker in the Dragos OT CTF 2025

Abstract:
Operational Technology (OT) cybersecurity increasingly relies on rapid response across malware analysis, network forensics, and reverse engineering disciplines. We examine the performance of Cybersecurity AI (CAI), powered by the \textttalias1 model, during the Dragos OT CTF 2025 ‑‑ a 48‑hour industrial control system (ICS) competition with more than 1,000 teams. Using CAI telemetry and official leaderboard data, we quantify CAI's trajectory relative to the leading human‑operated teams. CAI reached Rank~1 between competition hours 7.0 and 8.0, crossed 10,000 points at 5.42~hours (1,846~pts/h), and completed 32 of the competition's 34 challenges before automated operations were paused at hour~24 with a final score of 18,900 points (6th place). The top‑3 human teams solved 33 of 34 challenges, collectively leaving only the 600‑point ``Kiddy Tags ‑‑ 1'' unsolved; they were also the only teams to clear the 1,000‑point ``Moot Force'' binary. The top‑5 human teams averaged 1,347~pts/h to the same milestone, marking a 37% velocity advantage for CAI. We analyse time‑resolved scoring, category coverage, and solve cadence. The evidence indicates that a mission‑configured AI agent can match or exceed expert human crews in early‑phase OT incident response while remaining subject to practical limits in sustained, multi‑day operations.

PaperID: 7544, https://arxiv.org/pdf/2511.05026.pdf

Authors: Xingyuan Zhou, Peter Paik, S. Farokh Atashzar

Title: Tunable Passivity Control for Centralized Multiport Networked Systems

Abstract:
Centralized Multiport Networked Dynamic (CMND) systems have emerged as a key architecture with applications in several complex network systems, such as multilateral telerobotics and multi‑agent control. These systems consist of a hub node/subsystem connecting with multiple remote nodes/subsystems via a networked architecture. One challenge for this system is stability, which can be affected by non‑ideal network artifacts. Conventional passivity‑based approaches can stabilize the system under specialized applications like small‑scale networked systems. However, those conventional passive stabilizers have several restrictions, such as distributing compensation across subsystems in a decentralized manner, limiting flexibility, and, at the same time, relying on the restrictive assumptions of node passivity. This paper synthesizes a centralized optimal passivity‑based stabilization framework for CMND systems. It consists of a centralized passivity observer monitoring overall energy flow and an optimal passivity controller that distributes the just‑needed dissipation among various nodes, guaranteeing strict passivity and, thus, L2 stability. The proposed data‑driven model‑free approach, i.e., Tunable Centralized Optimal Passivity Control (TCoPC), optimizes total performance based on the prescribed dissipation distribution strategy while ensuring stability. The controller can put high dissipation loads on some sub‑networks while relaxing the dissipation on other nodes. Simulation results demonstrate the proposed frameworks performance in a complex task under different time‑varying delay scenarios while relaxing the remote nodes minimum phase and passivity assumption, enhancing the scalability and generalizability.

PaperID: 7545, https://arxiv.org/pdf/2511.05005.pdf

Authors: Dongsu Lee, Daehee Lee, Amy Zhang

Title: Multi-agent Coordination via Flow Matching

Abstract:
This work presents MAC‑Flow, a simple yet expressive framework for multi‑agent coordination. We argue that requirements of effective coordination are twofold: (i) a rich representation of the diverse joint behaviors present in offline data and (ii) the ability to act efficiently in real time. However, prior approaches often sacrifice one for the other, i.e., denoising diffusion‑based solutions capture complex coordination but are computationally slow, while Gaussian policy‑based solutions are fast but brittle in handling multi‑agent interaction. MAC‑Flow addresses this trade‑off by first learning a flow‑based representation of joint behaviors, and then distilling it into decentralized one‑step policies that preserve coordination while enabling fast execution. Across four different benchmarks, including 12 environments and 34 datasets, MAC‑Flow alleviates the trade‑off between performance and computational cost, specifically achieving about \boldsymbol×14.5 faster inference compared to diffusion‑based MARL methods, while maintaining good performance. At the same time, its inference speed is similar to that of prior Gaussian policy‑based offline multi‑agent reinforcement learning (MARL) methods.

PaperID: 7546, https://arxiv.org/pdf/2511.04949.pdf

Authors: Tharindu Fernando, Clinton Fookes, Sridha Sridharan

Title: DeepForgeSeal: Latent Space-Driven Semi-Fragile Watermarking for Deepfake Detection Using Multi-Agent Adversarial Reinforcement Learning

Abstract:
Rapid advances in generative AI have led to increasingly realistic deepfakes, posing growing challenges for law enforcement and public trust. Existing passive deepfake detectors struggle to keep pace, largely due to their dependence on specific forgery artifacts, which limits their ability to generalize to new deepfake types. Proactive deepfake detection using watermarks has emerged to address the challenge of identifying high‑quality synthetic media. However, these methods often struggle to balance robustness against benign distortions with sensitivity to malicious tampering. This paper introduces a novel deep learning framework that harnesses high‑dimensional latent space representations and the Multi‑Agent Adversarial Reinforcement Learning (MAARL) paradigm to develop a robust and adaptive watermarking approach. Specifically, we develop a learnable watermark embedder that operates in the latent space, capturing high‑level image semantics, while offering precise control over message encoding and extraction. The MAARL paradigm empowers the learnable watermarking agent to pursue an optimal balance between robustness and fragility by interacting with a dynamic curriculum of benign and malicious image manipulations simulated by an adversarial attacker agent. Comprehensive evaluations on the CelebA and CelebA‑HQ benchmarks reveal that our method consistently outperforms state‑of‑the‑art approaches, achieving improvements of over 4.5% on CelebA and more than 5.3% on CelebA‑HQ under challenging manipulation scenarios.

PaperID: 7547, https://arxiv.org/pdf/2511.04921.pdf

Authors: Yu Li, Lehui Li, Lin Chen, Qingmin Liao, Fengli Xu, Yong Li

Title: AgentExpt: Automating AI Experiment Design with LLM-based Resource Retrieval Agent

Abstract:
Large language model agents are becoming increasingly capable at web‑centric tasks such as information retrieval, complex reasoning. These emerging capabilities have given rise to surge research interests in developing LLM agent for facilitating scientific quest. One key application in AI research is to automate experiment design through agentic dataset and baseline retrieval. However, prior efforts suffer from limited data coverage, as recommendation datasets primarily harvest candidates from public portals and omit many datasets actually used in published papers, and from an overreliance on content similarity that biases model toward superficial similarity and overlooks experimental suitability. Harnessing collective perception embedded in the baseline and dataset citation network, we present a comprehensive framework for baseline and dataset recommendation. First, we design an automated data‑collection pipeline that links roughly one hundred thousand accepted papers to the baselines and datasets they actually used. Second, we propose a collective perception enhanced retriever. To represent the position of each dataset or baseline within the scholarly network, it concatenates self‑descriptions with aggregated citation contexts. To achieve efficient candidate recall, we finetune an embedding model on these representations. Finally, we develop a reasoning‑augmented reranker that exact interaction chains to construct explicit reasoning chains and finetunes a large language model to produce interpretable justifications and refined rankings. The dataset we curated covers 85% of the datasets and baselines used at top AI conferences over the past five years. On our dataset, the proposed method outperforms the strongest prior baseline with average gains of +5.85% in Recall@20, +8.30% in HitRate@5. Taken together, our results advance reliable, interpretable automation of experimental design.

PaperID: 7548, https://arxiv.org/pdf/2511.04904.pdf

Authors: Bassel Al Omari, Michael Matthews, Alexander Rutherford, Jakob Nicolaus Foerster

Title: Multi-Agent Craftax: Benchmarking Open-Ended Multi-Agent Reinforcement Learning at the Hyperscale

Abstract:
Progress in multi‑agent reinforcement learning (MARL) requires challenging benchmarks that assess the limits of current methods. However, existing benchmarks often target narrow short‑horizon challenges that do not adequately stress the long‑term dependencies and generalization capabilities inherent in many multi‑agent systems. To address this, we first present Craftax‑MA: an extension of the popular open‑ended RL environment, Craftax, that supports multiple agents and evaluates a wide range of general abilities within a single environment. Written in JAX, Craftax‑MA is exceptionally fast with a training run using 250 million environment interactions completing in under an hour. To provide a more compelling challenge for MARL, we also present Craftax‑Coop, an extension introducing heterogeneous agents, trading and more mechanics that require complex cooperation among agents for success. We provide analysis demonstrating that existing algorithms struggle with key challenges in this benchmark, including long‑horizon credit assignment, exploration and cooperation, and argue for its potential to drive long‑term research in MARL.

PaperID: 7549, https://arxiv.org/pdf/2511.04824.pdf

Authors: Kosei Horikawa, Hao Li, Yutaro Kashiwa, Bram Adams, Hajimu Iida, Ahmed E. Hassan

Title: Agentic Refactoring: An Empirical Study of AI Coding Agents

Abstract:
Agentic coding tools, such as OpenAI Codex, Claude Code, and Cursor, are transforming the software engineering landscape. These AI‑powered systems function as autonomous teammates capable of planning and executing complex development tasks. Agents have become active participants in refactoring, a cornerstone of sustainable software development aimed at improving internal code quality without altering observable behavior. Despite their increasing adoption, there is a critical lack of empirical understanding regarding how agentic refactoring is utilized in practice, how it compares to human‑driven refactoring, and what impact it has on code quality. To address this empirical gap, we present a large‑scale study of AI agent‑generated refactorings in real‑world open‑source Java projects, analyzing 15,451 refactoring instances across 12,256 pull requests and 14,988 commits derived from the AIDev dataset. Our empirical analysis shows that refactoring is a common and intentional activity in this development paradigm, with agents explicitly targeting refactoring in 26.1% of commits. Analysis of refactoring types reveals that agentic efforts are dominated by low‑level, consistency‑oriented edits, such as Change Variable Type (11.8%), Rename Parameter (10.4%), and Rename Variable (8.5%), reflecting a preference for localized improvements over the high‑level design changes common in human refactoring. Additionally, the motivations behind agentic refactoring focus overwhelmingly on internal quality concerns, with maintainability (52.5%) and readability (28.1%). Furthermore, quantitative evaluation of code quality metrics shows that agentic refactoring yields small but statistically significant improvements in structural metrics, particularly for medium‑level changes, reducing class size and complexity (e.g., Class LOC median Δ = ‑15.25).

PaperID: 7550, https://arxiv.org/pdf/2511.04700.pdf

Authors: Song Wang, Zihan Chen, Peng Wang, Zhepei Wei, Zhen Tan, Yu Meng, Cong Shen, Jundong Li

Title: Separate the Wheat from the Chaff: Winnowing Down Divergent Views in Retrieval Augmented Generation

Abstract:
Retrieval‑augmented generation (RAG) enhances large language models (LLMs) by integrating external knowledge sources to address their limitations in accessing up‑to‑date or specialized information. A natural strategy to increase the likelihood of retrieving relevant information is to expand the number of retrieved documents. However, involving more documents could introduce significant noise, as many documents may be irrelevant or misleading, thereby reducing the overall accuracy of the generated responses. To overcome the challenge associated with handling a larger number of documents, we propose WinnowRAG, a novel RAG framework designed to systematically filter out noisy documents while preserving valuable content ‑‑ a process we refer to as winnowing. WinnowRAG operates in two stages: In Stage I, we perform query‑aware clustering to group similar documents and form distinct topic clusters. Each cluster is assigned to an LLM agent for generating a unique answer. In Stage II, we perform winnowing, wherein a critic LLM evaluates the outputs of multiple agents and iteratively separates useful documents from noisy ones. To retain useful documents when discarding agents, we propose two strategic merging techniques to ensure that only relevant knowledge is used for generating the final response. Crucially, WinnowRAG is model‑agnostic and does not require any model fine‑tuning, making it easily adaptable to various tasks. Extensive experiments on various realistic datasets demonstrate the effectiveness of WinnowRAG over state‑of‑the‑art baselines.

PaperID: 7551, https://arxiv.org/pdf/2511.04662.pdf

Authors: Yu Feng, Nathaniel Weir, Kaj Bostrom, Sam Bayless, Darion Cassel, Sapana Chaudhary, Benjamin Kiesl-Reiter, Huzefa Rangwala

Title: VeriCoT: Neuro-symbolic Chain-of-Thought Validation via Logical Consistency Checks

Abstract:
LLMs can perform multi‑step reasoning through Chain‑of‑Thought (CoT), but they cannot reliably verify their own logic. Even when they reach correct answers, the underlying reasoning may be flawed, undermining trust in high‑stakes scenarios. To mitigate this issue, we introduce VeriCoT, a neuro‑symbolic method that extracts and verifies formal logical arguments from CoT reasoning. VeriCoT formalizes each CoT reasoning step into first‑order logic and identifies premises that ground the argument in source context, commonsense knowledge, or prior reasoning steps. The symbolic representation enables automated solvers to verify logical validity while the NL premises allow humans and systems to identify ungrounded or fallacious reasoning steps. Experiments on the ProofWriter, LegalBench, and BioASQ datasets show VeriCoT effectively identifies flawed reasoning, and serves as a strong predictor of final answer correctness. We also leverage VeriCoT's verification signal for (1) inference‑time self‑reflection, (2) supervised fine‑tuning (SFT) on VeriCoT‑distilled datasets and (3) preference fine‑tuning (PFT) with direct preference optimization (DPO) using verification‑based pairwise rewards, further improving reasoning validity and accuracy.

PaperID: 7552, https://arxiv.org/pdf/2511.04646.pdf

Authors: Narjes Nourzad, Hanqing Yang, Shiyu Chen, Carlee Joe-Wong

Title: DR. WELL: Dynamic Reasoning and Learning with Symbolic World Model for Embodied LLM-Based Multi-Agent Collaboration

Abstract:
Cooperative multi‑agent planning requires agents to make joint decisions with partial information and limited communication. Coordination at the trajectory level often fails, as small deviations in timing or movement cascade into conflicts. Symbolic planning mitigates this challenge by raising the level of abstraction and providing a minimal vocabulary of actions that enable synchronization and collective progress. We present DR. WELL, a decentralized neurosymbolic framework for cooperative multi‑agent planning. Cooperation unfolds through a two‑phase negotiation protocol: agents first propose candidate roles with reasoning and then commit to a joint allocation under consensus and environment constraints. After commitment, each agent independently generates and executes a symbolic plan for its role without revealing detailed trajectories. Plans are grounded in execution outcomes via a shared world model that encodes the current state and is updated as agents act. By reasoning over symbolic plans rather than raw trajectories, DR. WELL avoids brittle step‑level alignment and enables higher‑level operations that are reusable, synchronizable, and interpretable. Experiments on cooperative block‑push tasks show that agents adapt across episodes, with the dynamic world model capturing reusable patterns and improving task completion rates and efficiency. Experiments on cooperative block‑push tasks show that our dynamic world model improves task completion and efficiency through negotiation and self‑refinement, trading a time overhead for evolving, more efficient collaboration strategies.

PaperID: 7553, https://arxiv.org/pdf/2511.04594.pdf

Authors: Utkarsh U. Chavan, Prashant Trivedi, Nandyala Hemachandra

Title: Regret Lower Bounds for Decentralized Multi-Agent Stochastic Shortest Path Problems

Abstract:
Multi‑agent systems (MAS) are central to applications such as swarm robotics and traffic routing, where agents must coordinate in a decentralized manner to achieve a common objective. Stochastic Shortest Path (SSP) problems provide a natural framework for modeling decentralized control in such settings. While the problem of learning in SSP has been extensively studied in single‑agent settings, the decentralized multi‑agent variant remains largely unexplored. In this work, we take a step towards addressing that gap. We study decentralized multi‑agent SSPs (Dec‑MASSPs) under linear function approximation, where the transition dynamics and costs are represented using linear models. Applying novel symmetry‑based arguments, we identify the structure of optimal policies. Our main contribution is the first regret lower bound for this setting based on the construction of hard‑to‑learn instances for any number of agents, n. Our regret lower bound of Ω(\sqrtK), over K episodes, highlights the inherent learning difficulty in Dec‑MASSPs. These insights clarify the learning complexity of decentralized control and can further guide the design of efficient learning algorithms in multi‑agent systems.

PaperID: 7554, https://arxiv.org/pdf/2511.04464.pdf

Authors: Carnot Braun, Rafael O. Jarczewski, Gabriel U. Talasso, Leandro A. Villas, Allan M. de Souza

Title: Beyond Shortest Path: Agentic Vehicular Routing with Semantic Context

Abstract:
Traditional vehicle routing systems efficiently optimize singular metrics like time or distance, and when considering multiple metrics, they need more processes to optimize . However, they lack the capability to interpret and integrate the complex, semantic, and dynamic contexts of human drivers, such as multi‑step tasks, situational constraints, or urgent needs. This paper introduces and evaluates PAVe (Personalized Agentic Vehicular Routing), a hybrid agentic assistant designed to augment classical pathfinding algorithms with contextual reasoning. Our approach employs a Large Language Model (LLM) agent that operates on a candidate set of routes generated by a multi‑objective (time, CO2) Dijkstra algorithm. The agent evaluates these options against user‑provided tasks, preferences, and avoidance rules by leveraging a pre‑processed geospatial cache of urban Points of Interest (POIs). In a benchmark of realistic urban scenarios, PAVe successfully used complex user intent into appropriate route modifications, achieving over 88% accuracy in its initial route selections with a local model. We conclude that combining classical routing algorithms with an LLM‑based semantic reasoning layer is a robust and effective approach for creating personalized, adaptive, and scalable solutions for urban mobility optimization.

PaperID: 7555, https://arxiv.org/pdf/2511.04427.pdf

Authors: Hao He, Courtney Miller, Shyam Agarwal, Christian Kästner, Bogdan Vasilescu

Title: Speed at the Cost of Quality: How Cursor AI Increases Short-Term Velocity and Long-Term Complexity in Open-Source Projects

Abstract:
Large language models (LLMs) have demonstrated the promise to revolutionize the field of software engineering. Among other things, LLM agents are rapidly gaining momentum in software development, with practitioners reporting a multifold increase in productivity after adoption. Yet, empirical evidence is lacking around these claims. In this paper, we estimate the causal effect of adopting a widely popular LLM agent assistant, namely Cursor, on development velocity and software quality. The estimation is enabled by a state‑of‑the‑art difference‑in‑differences design comparing Cursor‑adopting GitHub projects with a matched control group of similar GitHub projects that do not use Cursor. We find that the adoption of Cursor leads to a statistically significant, large, but transient increase in project‑level development velocity, along with a substantial and persistent increase in static analysis warnings and code complexity. Further panel generalized‑method‑of‑moments estimation reveals that increases in static analysis warnings and code complexity are major factors driving long‑term velocity slowdown. Our study identifies quality assurance as a major bottleneck for early Cursor adopters and calls for it to be a first‑class citizen in the design of agentic AI coding tools and AI‑driven workflows.

PaperID: 7556, https://arxiv.org/pdf/2511.04235.pdf

Authors: Zhengru Fang, Yu Guo, Yuang Zhang, Haonan An, Wenbo Ding, Yuguang Fang

Title: Shared Spatial Memory Through Predictive Coding

Abstract:
Constructing a consistent shared spatial memory is a critical challenge in multi‑agent systems, where partial observability and limited bandwidth often lead to catastrophic failures in coordination. We introduce a multi‑agent predictive coding framework that formulates coordination as the minimization of mutual uncertainty among agents. Through an information bottleneck objective, this framework prompts agents to learn not only who and what to communicate but also when. At the foundation of this framework lies a grid‑cell‑like metric as internal spatial coding for self‑localization, emerging spontaneously from self‑supervised motion prediction. Building upon this internal spatial code, agents gradually develop a bandwidth‑efficient communication mechanism and specialized neural populations that encode partners' locations‑an artificial analogue of hippocampal social place cells (SPCs). These social representations are further utilized by a hierarchical reinforcement learning policy that actively explores to reduce joint uncertainty. On the Memory‑Maze benchmark, our approach shows exceptional resilience to bandwidth constraints: success degrades gracefully from 73.5% to 64.4% as bandwidth shrinks from 128 to 4 bits/step, whereas a full‑broadcast baseline collapses from 67.6% to 28.6%. Our findings establish a theoretically principled and biologically plausible basis for how complex social representations emerge from a unified predictive drive, leading to collective intelligence.

PaperID: 7557, https://arxiv.org/pdf/2511.04184.pdf

Authors: Mohammed Musthafa Rafi, Adarsh Krishnamurthy, Aditya Balu

Title: Trustworthy LLM-Mediated Communication: Evaluating Information Fidelity in LLM as a Communicator (LAAC) Framework in Multiple Application Domains

Abstract:
The proliferation of AI‑generated content has created an absurd communication theater where senders use LLMs to inflate simple ideas into verbose content, recipients use LLMs to compress them back into summaries, and as a consequence neither party engage with authentic content. LAAC (LLM as a Communicator) proposes a paradigm shift ‑ positioning LLMs as intelligent communication intermediaries that capture the sender's intent through structured dialogue and facilitate genuine knowledge exchange with recipients. Rather than perpetuating cycles of AI‑generated inflation and compression, LAAC enables authentic communication across diverse contexts including academic papers, proposals, professional emails, and cross‑platform content generation. However, deploying LLMs as trusted communication intermediaries raises critical questions about information fidelity, consistency, and reliability. This position paper systematically evaluates the trustworthiness requirements for LAAC's deployment across multiple communication domains. We investigate three fundamental dimensions: (1) Information Capture Fidelity ‑ accuracy of intent extraction during sender interviews across different communication types, (2) Reproducibility ‑ consistency of structured knowledge across multiple interaction instances, and (3) Query Response Integrity ‑ reliability of recipient‑facing responses without hallucination, source conflation, or fabrication. Through controlled experiments spanning multiple LAAC use cases, we assess these trust dimensions using LAAC's multi‑agent architecture. Preliminary findings reveal measurable trust gaps that must be addressed before LAAC can be reliably deployed in high‑stakes communication scenarios.

PaperID: 7558, https://arxiv.org/pdf/2511.04177.pdf

Authors: Claire Yang, Maya Cakmak, Max Kleiman-Weiner

Title: When Empowerment Disempowers

Abstract:
Empowerment, a measure of an agent's ability to control its environment, has been proposed as a universal goal‑agnostic objective for motivating assistive behavior in AI agents. While multi‑human settings like homes and hospitals are promising for AI assistance, prior work on empowerment‑based assistance assumes that the agent assists one human in isolation. We introduce an open source multi‑human gridworld test suite Disempower‑Grid. Using Disempower‑Grid, we empirically show that assistive RL agents optimizing for one human's empowerment can significantly reduce another human's environmental influence and rewards ‑ a phenomenon we formalize as disempowerment. We characterize when disempowerment occurs in these environments and show that joint empowerment mitigates disempowerment at the cost of the user's reward. Our work reveals a broader challenge for the AI alignment community: goal‑agnostic objectives that seem aligned in single‑agent settings can become misaligned in multi‑agent contexts.

PaperID: 7559, https://arxiv.org/pdf/2511.03985.pdf

Authors: Zhuowen Yuan, Tao Liu, Yang Yang, Yang Wang, Feng Qi, Kaushik Rangadurai, Bo Li, Shuang Yang

Title: ArchPilot: A Proxy-Guided Multi-Agent Approach for Machine Learning Engineering

Abstract:
Recent LLM‑based agents have demonstrated strong capabilities in automated ML engineering. However, they heavily rely on repeated full training runs to evaluate candidate solutions, resulting in significant computational overhead, limited scalability to large search spaces, and slow iteration cycles. To address these challenges, we introduce ArchPilot, a multi‑agent system that integrates architecture generation, proxy‑based evaluation, and adaptive search into a unified framework. ArchPilot consists of three specialized agents: an orchestration agent that coordinates the search process using a Monte Carlo Tree Search (MCTS)‑inspired novel algorithm with a restart mechanism and manages memory of previous candidates; a generation agent that iteratively generates, improves, and debugs candidate architectures; and an evaluation agent that executes proxy training runs, generates and optimizes proxy functions, and aggregates the proxy scores into a fidelity‑aware performance metric. This multi‑agent collaboration allows ArchPilot to prioritize high‑potential candidates with minimal reliance on expensive full training runs, facilitating efficient ML engineering under limited budgets. Experiments on MLE‑Bench demonstrate that ArchPilot outperforms SOTA baselines such as AIDE and ML‑Master, validating the effectiveness of our multi‑agent system.

PaperID: 7560, https://arxiv.org/pdf/2511.03958.pdf

Authors: Kia Karbasi, Kevin Hong, Mohammad Amin Samadi, Gregory Pottie

Title: Multi-Agent Collaborative Framework For Math Problem Generation

Abstract:
Automatic question generation (AQG) for mathematics education remains an elusive goal for Intelligent Tutoring Systems and educators. While pre‑trained transformer‑based language models have significantly advanced natural language generation, they often struggle to precisely control problem complexity and cognitive demands. In this paper, we introduce a collaborative multi‑agent framework as a novel method of incorporating inference‑time computation into AQG. This approach leverages multiple agents that iteratively refine generated question‑answer pairs to better balance complexity and cognitive demand. We evaluate the generated questions on five meta‑evaluation criteria: relevance, importance, clarity, difficulty matching, answerability, to assess the system's ability to control the required complexity and quality of the questions. Preliminary evaluations show that this collaborative multi‑agent framework elevates the quality of generated educational content by fostering a more nuanced balance between cognitive challenge and clarity. These promising outcomes suggest that integrating collaborative multi‑agent workflows can yield more controlled, pedagogically valuable content that can help advance automated educational content generation and adaptive learning environments.

PaperID: 7561, https://arxiv.org/pdf/2511.03945.pdf

Authors: Fu-Chun Yang, Jason Eshraghian

Title: Direct Semantic Communication Between Large Language Models via Vector Translation

Abstract:
In multi‑agent settings, such as debate, reflection, or tool‑calling, large language models (LLMs) pass messages as plain tokens, discarding most latent semantics. This constrains information transfer and adds unnecessary computational overhead. We form a latent bridge via vector translations, which use learned mappings that enable direct semantic exchange between representation spaces. A dual‑encoder translator trained between Llama‑2‑7B and Mistral‑7B‑Instruct attains an average cosine alignment of 0.538. Injecting the translated vectors at 30 percent blending strength steers the target model's generation without destabilizing logits. Bidirectional evaluation shows a 2.01:1 transfer asymmetry, indicating that general‑purpose models yield more transferable representations than instruction‑tuned variants. This conservative injection preserves computational stability while demonstrating that cross‑model latent communication is feasible, enabling collaborative AI systems that share meaning rather than tokens.

PaperID: 7562, https://arxiv.org/pdf/2511.03925.pdf

Authors: Nikta Akbarpour, Mahdieh Sadat Benis, Fatemeh Hendijani Fard, Ali Ouni, Mohamed Aymen Saied

Title: Collaborative Agents for Automated Program Repair in Ruby

Abstract:
Automated Program Repair (APR) has advanced rapidly with Large Language Models (LLMs), but most existing methods remain computationally expensive, and focused on a small set of languages. Ruby, despite its widespread use in web development and the persistent challenges faced by its developers, has received little attention in APR research. In this paper, we introduce RAMP, a novel lightweight framework that formulates program repair as a feedback‑driven, iterative process for Ruby. RAMP employs a team of collaborative agents that generate targeted tests, reflect on errors, and refine candidate fixes until a correct solution is found. Unlike prior approaches, RAMP is designed to avoid reliance on large multilingual repair databases or costly fine‑tuning, instead operating directly on Ruby through lightweight prompting and test‑driven feedback. Evaluation on the XCodeEval benchmark shows that RAMP achieves a pass@1 of 67% on Ruby, outper‑forming prior approaches. RAMP converges quickly within five iterations, and ablation studies confirm that test generation and self‑reflection are key drivers of its performance. Further analysis shows that RAMP is particularly effective at repairing wrong answers, compilation errors, and runtime errors. Our approach provides new insights into multi‑agent repair strategies, and establishes a foundation for extending LLM‑based debugging tools to under‑studied languages.

PaperID: 7563, https://arxiv.org/pdf/2511.03852.pdf

Authors: Randy Harsuko, Zhengfa Bi, Nori Nakata

Title: GAIA: Geothermal Analytics and Intelligent Agent

Abstract:
Geothermal field development typically involves complex processes that require multi‑disciplinary expertise in each process. Thus, decision‑making often demands the integration of geological, geophysical, reservoir engineering, and operational data under tight time constraints. We present Geothermal Analytics and Intelligent Agent, or GAIA, an AI‑based system for automation and assistance in geothermal field development. GAIA consists of three core components: GAIA Agent, GAIA Chat, and GAIA Digital Twin, or DT, which together constitute an agentic retrieval‑augmented generation (RAG) workflow. Specifically, GAIA Agent, powered by a pre‑trained large language model (LLM), designs and manages task pipelines by autonomously querying knowledge bases and orchestrating multi‑step analyses. GAIA DT encapsulates classical and surrogate physics models, which, combined with built‑in domain‑specific subroutines and visualization tools, enable predictive modeling of geothermal systems. Lastly, GAIA Chat serves as a web‑based interface for users, featuring a ChatGPT‑like layout with additional functionalities such as interactive visualizations, parameter controls, and in‑context document retrieval. To ensure GAIA's specialized capability for handling complex geothermal‑related tasks, we curate a benchmark test set comprising various geothermal‑related use cases, and we rigorously and continuously evaluate the system's performance. We envision GAIA as a pioneering step toward intelligent geothermal field development, capable of assisting human experts in decision‑making, accelerating project workflows, and ultimately enabling automation of the development process.

PaperID: 7564, https://arxiv.org/pdf/2511.03844.pdf

Authors: Yuran Ding, Xinwei Chen, Xiaofan Zhang, Zongwei Zhou

Title: ASAP: an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training

Abstract:
Optimizing large‑language model (LLM) training on distributed domain‑specific accelerator systems presents significant challenges due to its complex optimization space. Existing optimization methods, however, rely on time‑consuming manual tuning or resource‑intensive black‑box searches, which struggle to keep pace with the rapidly evolving LLM domain, leading to slow development and underutilized resources. To address this, we introduce ASAP, an Agentic Solution to Auto‑optimize Performance of Large‑Scale LLM Training. It is a multi‑agent system, featuring Coordinator, Analyzer, and Proposal agents, which integrates LLM reasoning with insights from performance profiling tools, roofline analysis, and a knowledge base of best practices and successful past optimizations from human experts. Our proposed design can automate the diagnosis of performance bottlenecks and recommend optimized sharding configurations with reasoning, thus effectively improving the efficiency of distributed LLM training. Experiments have shown that the ASAP‑generated sharding configurations can contribute up to 28% training step time reduction and 1.43 times throughput improvement. When combined with additional optimization from human experts, throughput can be further increased to 2.58 times. The proposed ASAP promises to provide a scalable and explainable methodology for AI‑assisted performance engineering in large‑scale LLM training.

PaperID: 7565, https://arxiv.org/pdf/2511.03841.pdf

Authors: Yedidel Louck, Ariel Stulman, Amit Dvir

Title: Security Analysis of Agentic AI Communication Protocols: A Comparative Evaluation

Abstract:
Multi‑agent systems (MAS) powered by artificial intelligence (AI) are increasingly foundational to complex, distributed workflows. Yet, the security of their underlying communication protocols remains critically under‑examined. This paper presents the first empirical, comparative security analysis of the official CORAL implementation and a high‑fidelity, SDK‑based ACP implementation, benchmarked against a literature‑based evaluation of A2A. Using a 14 point vulnerability taxonomy, we systematically assess their defenses across authentication, authorization, integrity, confidentiality, and availability. Our results reveal a pronounced security dichotomy: CORAL exhibits a robust architectural design, particularly in its transport‑layer message validation and session isolation, but suffers from critical implementation‑level vulnerabilities, including authentication and authorization failures at its SSE gateway. Conversely, ACP's architectural flexibility, most notably its optional JWS enforcement, translates into high‑impact integrity and confidentiality flaws. We contextualize these findings within current industry trends, highlighting that existing protocols remain insufficiently secure. As a path forward, we recommend a hybrid approach that combines CORAL's integrated architecture with ACP's mandatory per‑message integrity guarantees, laying the groundwork for resilient, next‑generation agent communications.

PaperID: 7566, https://arxiv.org/pdf/2511.03761.pdf

Authors: Umut Çalıkyılmaz, Nitin Nayak, Jinghua Groppe, Sven Groppe

Title: OptiMA: A Transaction-Based Framework with Throughput Optimization for Very Complex Multi-Agent Systems

Abstract:
In recent years, the research of multi‑agent systems has taken a direction to explore larger and more complex models to fulfill sophisticated tasks. We point out two possible pitfalls that might be caused by increasing complexity; susceptibilities to faults, and performance bottlenecks. To prevent the former threat, we propose a transaction‑based framework to design very complex multi‑agent systems (VCMAS). To address the second threat, we offer to integrate transaction scheduling into the proposed framework. We implemented both of these ideas to develop the OptiMA framework and show that it is able to facilitate the execution of VCMAS with more than a hundred agents. We also demonstrate the effect of transaction scheduling on such a system by showing improvements up to more than 16%. Furthermore, we also performed a theoretical analysis on the transaction scheduling problem and provided practical tools that can be used for future research on it.

PaperID: 7567, https://arxiv.org/pdf/2511.03757.pdf

Authors: Xuan Ouyang, Senan Wang, Bouzhou Wang, Siyuan Xiahou, Jinrong Zhou, Yuekang Li

Title: Laugh, Relate, Engage: Stylized Comment Generation for Short Videos

Abstract:
Short‑video platforms have become a central medium in the modern Internet landscape, where efficient information delivery and strong interactivity are reshaping user engagement and cultural dissemination. Among the various forms of user interaction, comments play a vital role in fostering community participation and enabling content re‑creation. However, generating comments that are both compliant with platform guidelines and capable of exhibiting stylistic diversity and contextual awareness remains a significant challenge. We introduce LOLGORITHM, a modular multi‑agent system (MAS) designed for controllable short‑video comment generation. The system integrates video segmentation, contextual and affective analysis, and style‑aware prompt construction. It supports six distinct comment styles: puns (homophones), rhyming, meme application, sarcasm (irony), plain humor, and content extraction. Powered by a multimodal large language model (MLLM), LOLGORITHM directly processes video inputs and achieves fine‑grained style control through explicit prompt markers and few‑shot examples. To support development and evaluation, we construct a bilingual dataset using official APIs from Douyin (Chinese) and YouTube (English), covering five popular video genres: comedy skits, daily life jokes, funny animal clips, humorous commentary, and talk shows. Evaluation combines automated metrics originality, relevance, and style conformity with a large‑scale human preference study involving 40 videos and 105 participants. Results show that LOLGORITHM significantly outperforms baseline models, achieving preference rates of over 90% on Douyin and 87.55% on YouTube. This work presents a scalable and culturally adaptive framework for stylized comment generation on short‑video platforms, offering a promising path to enhance user engagement and creative interaction.

PaperID: 7568, https://arxiv.org/pdf/2511.03724.pdf

Authors: Richard Dewey, Janos Botyanszki, Ciamac C. Moallemi, Andrew T. Zheng

Title: Outbidding and Outbluffing Elite Humans: Mastering Liar's Poker via Self-Play and Reinforcement Learning

Abstract:
AI researchers have long focused on poker‑like games as a testbed for environments characterized by multi‑player dynamics, imperfect information, and reasoning under uncertainty. While recent breakthroughs have matched elite human play at no‑limit Texas hold'em, the multi‑player dynamics are subdued: most hands converge quickly with only two players engaged through multiple rounds of bidding. In this paper, we present Solly, the first AI agent to achieve elite human play in reduced‑format Liar's Poker, a game characterized by extensive multi‑player engagement. We trained Solly using self‑play with a model‑free, actor‑critic, deep reinforcement learning algorithm. Solly played at an elite human level as measured by win rate (won over 50% of hands) and equity (money won) in heads‑up and multi‑player Liar's Poker. Solly also outperformed large language models (LLMs), including those with reasoning abilities, on the same metrics. Solly developed novel bidding strategies, randomized play effectively, and was not easily exploitable by world‑class human players.

PaperID: 7569, https://arxiv.org/pdf/2511.03697.pdf

Authors: Mohsen Ahmadzadeh, Kaichang Chen, Georges Gielen

Title: AnaFlow: Agentic LLM-based Workflow for Reasoning-Driven Explainable and Sample-Efficient Analog Circuit Sizing

Abstract:
Analog/mixed‑signal circuits are key for interfacing electronics with the physical world. Their design, however, remains a largely handcrafted process, resulting in long and error‑prone design cycles. While the recent rise of AI‑based reinforcement learning and generative AI has created new techniques to automate this task, the need for many time‑consuming simulations is a critical bottleneck hindering the overall efficiency. Furthermore, the lack of explainability of the resulting design solutions hampers widespread adoption of the tools. To address these issues, a novel agentic AI framework for sample‑efficient and explainable analog circuit sizing is presented. It employs a multi‑agent workflow where specialized Large Language Model (LLM)‑based agents collaborate to interpret the circuit topology, to understand the design goals, and to iteratively refine the circuit's design parameters towards the target goals with human‑interpretable reasoning. The adaptive simulation strategy creates an intelligent control that yields a high sample efficiency. The AnaFlow framework is demonstrated for two circuits of varying complexity and is able to complete the sizing task fully automatically, differently from pure Bayesian optimization and reinforcement learning approaches. The system learns from its optimization history to avoid past mistakes and to accelerate convergence. The inherent explainability makes this a powerful tool for analog design space exploration and a new paradigm in analog EDA, where AI agents serve as transparent design assistants.

PaperID: 7570, https://arxiv.org/pdf/2511.03628.pdf

Authors: Haofei Yu, Fenghai Li, Jiaxuan You

Title: LiveTradeBench: Seeking Real-World Alpha with Large Language Models

Abstract:
Large language models (LLMs) achieve strong performance across benchmarks‑‑from knowledge quizzes and math reasoning to web‑agent tasks‑‑but these tests occur in static settings, lacking real dynamics and uncertainty. Consequently, they evaluate isolated reasoning or problem‑solving rather than decision‑making under uncertainty. To address this, we introduce LiveTradeBench, a live trading environment for evaluating LLM agents in realistic and evolving markets. LiveTradeBench follows three design principles: (i) Live data streaming of market prices and news, eliminating dependence on offline backtesting and preventing information leakage while capturing real‑time uncertainty; (ii) a portfolio‑management abstraction that extends control from single‑asset actions to multi‑asset allocation, integrating risk management and cross‑asset reasoning; and (iii) multi‑market evaluation across structurally distinct environments‑‑U.S. stocks and Polymarket prediction markets‑‑differing in volatility, liquidity, and information flow. At each step, an agent observes prices, news, and its portfolio, then outputs percentage allocations that balance risk and return. Using LiveTradeBench, we run 50‑day live evaluations of 21 LLMs across families. Results show that (1) high LMArena scores do not imply superior trading outcomes; (2) models display distinct portfolio styles reflecting risk appetite and reasoning dynamics; and (3) some LLMs effectively leverage live signals to adapt decisions. These findings expose a gap between static evaluation and real‑world competence, motivating benchmarks that test sequential decision making and consistency under live uncertainty.

PaperID: 7571, https://arxiv.org/pdf/2511.03591.pdf

Authors: Qingyi Chen, Ruiqi Ni, Jun Kim, Ahmed H. Qureshi

Title: Manifold-constrained Hamilton-Jacobi Reachability Learning for Decentralized Multi-Agent Motion Planning

Abstract:
Safe multi‑agent motion planning (MAMP) under task‑induced constraints is a critical challenge in robotics. Many real‑world scenarios require robots to navigate dynamic environments while adhering to manifold constraints imposed by tasks. For example, service robots must carry cups upright while avoiding collisions with humans or other robots. Despite recent advances in decentralized MAMP for high‑dimensional systems, incorporating manifold constraints remains difficult. To address this, we propose a manifold‑constrained Hamilton‑Jacobi reachability (HJR) learning framework for decentralized MAMP. Our method solves HJR problems under manifold constraints to capture task‑aware safety conditions, which are then integrated into a decentralized trajectory optimization planner. This enables robots to generate motion plans that are both safe and task‑feasible without requiring assumptions about other agents' policies. Our approach generalizes across diverse manifold‑constrained tasks and scales effectively to high‑dimensional multi‑agent manipulation problems. Experiments show that our method outperforms existing constrained motion planners and operates at speeds suitable for real‑world applications. Video demonstrations are available at https://youtu.be/RYcEHMnPTH8 .

PaperID: 7572, https://arxiv.org/pdf/2511.03517.pdf

Authors: Wencheng Ye, Yan Liu

Title: U2F: Encouraging SWE-Agent to Seize Novelty without Losing Feasibility

Abstract:
Large language models (LLMs) have shown strong capabilities in software engineering tasks, yet most existing LLM‑based SWE‑Agents mainly tackle well‑defined problems using conventional methods, often overlooking alternative or innovative solutions beyond their predefined frameworks. This limitation is evident in open‑world software environments, where emerging challenges transcend established paradigms. We propose U2F (Unknown Unknowns to Functional solutions), a cognitive‑inspired, uncertainty‑embracing multi‑agent framework that systematically surfaces "Unknown Unknowns" ‑ novel solution pathways absent from initial formulations but holding innovative potential. U2F consists of two key components: (1) a Discovery‑Exploration‑Integration agent system for uncovering and synthesizing potential solutions, and (2) cognitive enhancement mechanisms across three dimensions: cross‑domain analogical reasoning, reverse thinking, and external validation, which strategically reframe and extend conventional solution boundaries. Applied to 218 real‑world software enabler stories curated from authentic engineering tasks, U2F achieved notable improvements: human experts reported a 14 percent increase in overall novelty, 51 percent improvement in semantic novelty, and stable feasibility (4.02/5.0), corroborated by an LLM‑based evaluator. These results highlight the potential of embracing uncertainty as a catalyst for innovation in software engineering.

PaperID: 7573, https://arxiv.org/pdf/2511.03404.pdf

Authors: Qianhui Zhao, Li Zhang, Fang Liu, Junhang Cheng, Chengru Wu, Junchen Ai, Qiaoyuanhe Meng, Lichen Zhang, Xiaoli Lian, Shubin Song, Yuanping Guo

Title: Towards Realistic Project-Level Code Generation via Multi-Agent Collaboration and Semantic Architecture Modeling

Abstract:
In recent years, Large Language Models (LLMs) have achieved remarkable progress in automated code generation. In real‑world software engineering, the growing demand for rapid iteration and continuous delivery underscores the importance of project‑level code generation, where LLMs are expected to generate complete software projects directly from complex user requirements. Although existing studies have made initial explorations, they still face key limitations, including unrealistic datasets and unreliable evaluation metrics that fail to reflect real‑world complexity, the semantic gap between human‑written requirements and machine‑interpretable structures, and difficulties in managing hierarchical dependencies and maintaining quality throughout the generation process. To address these limitations, we first introduce CodeProjectEval, a project‑level code generation dataset built from 18 real‑world repositories with 12.7 files and 2,388.6 lines of code per task on average, supplemented with documentation and executable test cases for automatic evaluation. We further propose ProjectGen, a multi‑agent framework that decomposes projects into architecture design, skeleton generation, and code filling stages with iterative refinement and memory‑based context management. Within this framework, we introduce the Semantic Software Architecture Tree (SSAT), a structured and semantically rich representation that effectively bridges user requirements and source code implementation. Experiments show that ProjectGen achieves state‑of‑the‑art performance, passing 52/124 test cases on the small‑scale project‑level code generation dataset DevBench, a 57% improvement over the baseline approaches, and 310 test cases on CodeProjectEval, representing an improvement of roughly tenfold compared to the baselines.

PaperID: 7574, https://arxiv.org/pdf/2511.03348.pdf

Authors: Changxi Zhu, Mehdi Dastani, Shihan Wang

Title: Learning Communication Skills in Multi-task Multi-agent Deep Reinforcement Learning

Abstract:
In multi‑agent deep reinforcement learning (MADRL), agents can communicate with one another to perform a task in a coordinated manner. When multiple tasks are involved, agents can also leverage knowledge from one task to improve learning in other tasks. In this paper, we propose Multi‑task Communication Skills (MCS), a MADRL with communication method that learns and performs multiple tasks simultaneously, with agents interacting through learnable communication protocols. MCS employs a Transformer encoder to encode task‑specific observations into a shared message space, capturing shared communication skills among agents. To enhance coordination among agents, we introduce a prediction network that correlates messages with the actions of sender agents in each task. We adapt three multi‑agent benchmark environments to multi‑task settings, where the number of agents as well as the observation and action spaces vary across tasks. Experimental results demonstrate that MCS achieves better performance than multi‑task MADRL baselines without communication, as well as single‑task MADRL baselines with and without communication.

PaperID: 7575, https://arxiv.org/pdf/2511.03248.pdf

Authors: Junhao Li, Jiahao Chen, Zhou Feng, Chunyi Zhou

Title: Auditing M-LLMs for Privacy Risks: A Synthetic Benchmark and Evaluation Framework

Abstract:
Recent advances in multi‑modal Large Language Models (M‑LLMs) have demonstrated a powerful ability to synthesize implicit information from disparate sources, including images and text. These resourceful data from social media also introduce a significant and underexplored privacy risk: the inference of sensitive personal attributes from seemingly daily media content. However, the lack of benchmarks and comprehensive evaluations of state‑of‑the‑art M‑LLM capabilities hinders the research of private attribute profiling on social media. Accordingly, we propose (1) PRISM, the first multi‑modal, multi‑dimensional and fine‑grained synthesized dataset incorporating a comprehensive privacy landscape and dynamic user history; (2) an Efficient evaluation framework that measures the cross‑modal privacy inference capabilities of advanced M‑LLM. Specifically, PRISM is a large‑scale synthetic benchmark designed to evaluate cross‑modal privacy risks. Its key feature is 12 sensitive attribute labels across a diverse set of multi‑modal profiles, which enables targeted privacy analysis. These profiles are generated via a sophisticated LLM agentic workflow, governed by a prior distribution to ensure they realistically mimic social media users. Additionally, we propose a Multi‑Agent Inference Framework that leverages a pipeline of specialized LLMs to enhance evaluation capabilities. We evaluate the inference capabilities of six leading M‑LLMs (Qwen, Gemini, GPT‑4o, GLM, Doubao, and Grok) on PRISM. The comparison with human performance reveals that these MLLMs significantly outperform in accuracy and efficiency, highlighting the threat of potential privacy risks and the urgent need for robust defenses. Dataset available at https://huggingface.co/datasets/xaddh/multimodal‑privacy

PaperID: 7576, https://arxiv.org/pdf/2511.03179.pdf

Authors: Varun Kumar, George Em Karniadakis

Title: Toward Autonomous Engineering Design: A Knowledge-Guided Multi-Agent Framework

Abstract:
The engineering design process often demands expertise from multiple domains, leading to complex collaborations and iterative refinements. Traditional methods can be resource‑intensive and prone to inefficiencies. To address this, we formalize the engineering design process through a multi‑agent AI framework that integrates structured design and review loops. The framework introduces specialized knowledge‑driven agents that collaborate to generate and refine design candidates. As an exemplar, we demonstrate its application to the aerodynamic optimization of 4‑digit NACA airfoils. The framework consists of three key AI agents: a Graph Ontologist, a Design Engineer, and a Systems Engineer. The Graph Ontologist employs a Large Language Model (LLM) to construct two domain‑specific knowledge graphs from airfoil design literature. The Systems Engineer, informed by a human manager, formulates technical requirements that guide design generation and evaluation. The Design Engineer leverages the design knowledge graph and computational tools to propose candidate airfoils meeting these requirements. The Systems Engineer reviews and provides feedback both qualitative and quantitative using its own knowledge graph, forming an iterative feedback loop until a design is validated by the manager. The final design is then optimized to maximize performance metrics such as the lift‑to‑drag ratio. Overall, this work demonstrates how collaborative AI agents equipped with structured knowledge representations can enhance efficiency, consistency, and quality in the engineering design process.

PaperID: 7577, https://arxiv.org/pdf/2511.03153.pdf

Authors: Khouloud Oueslati, Maxime Lamothe, Foutse Khomh

Title: RefAgent: A Multi-agent LLM-based Framework for Automatic Software Refactoring

Abstract:
Large Language Models (LLMs) have substantially influenced various software engineering tasks. Indeed, in the case of software refactoring, traditional LLMs have shown the ability to reduce development time and enhance code quality. However, these LLMs often rely on static, detailed instructions for specific tasks. In contrast, LLM‑based agents can dynamically adapt to evolving contexts and autonomously make decisions by interacting with software tools and executing workflows. In this paper, we explore the potential of LLM‑based agents in supporting refactoring activities. Specifically, we introduce RefAgent, a multi‑agent LLM‑based framework for end‑to‑end software refactoring. RefAgent consists of specialized agents responsible for planning, executing, testing, and iteratively refining refactorings using self‑reflection and tool‑calling capabilities. We evaluate RefAgent on eight open‑source Java projects, comparing its effectiveness against a single‑agent approach, a search‑based refactoring tool, and historical developer refactorings. Our assessment focuses on: (1) the impact of generated refactorings on software quality, (2) the ability to identify refactoring opportunities, and (3) the contribution of each LLM agent through an ablation study. Our results show that RefAgent achieves a median unit test pass rate of 90%, reduces code smells by a median of 52.5%, and improves key quality attributes (e.g., reusability) by a median of 8.6%. Additionally, it closely aligns with developer refactorings and the search‑based tool in identifying refactoring opportunities, attaining a median F1‑score of 79.15% and 72.7%, respectively. Compared to single‑agent approaches, RefAgent improves the median unit test pass rate by 64.7% and the median compilation success rate by 40.1%. These findings highlight the promise of multi‑agent architectures in advancing automated software refactoring.

PaperID: 7578, https://arxiv.org/pdf/2511.03136.pdf

Authors: Kexing Ji, Shiyun Fu, Cuiyun Gao, Yujia Chen, Zezhou Yang, Chaozheng Wang, Yuetang Deng

Title: Automated Prompt Generation for Code Intelligence: An Empirical study and Experience in WeChat

Abstract:
Large Code Models (LCMs) show potential in code intelligence, but their effectiveness is greatly influenced by prompt quality. Current prompt design is mostly manual, which is time‑consuming and highly dependent on specific LCMs and tasks. While automated prompt generation (APG) exists in NLP, it is underexplored for code intelligence. This creates a gap, as automating the prompt process is essential for developers facing diverse tasks and black‑box LCMs. To mitigate this, we empirically investigate two important parts of APG: Instruction Generation (IG) and Multi‑Step Reasoning (MSR). IG provides a task‑related description to instruct LCMs, while MSR guides them to produce logical steps before the final answer. We evaluate widely‑used APG methods for each part on four open‑source LCMs and three code intelligence tasks: code translation (PL‑PL), code summarization (PL‑NL), and API recommendation (NL‑PL).Experimental results indicate that both IG and MSR dramatically enhance performance compared to basic prompts. Based on these results, we propose a novel APG approach combining the best methods of the two parts. Experiments show our approach achieves average improvements of 28.38% in CodeBLEU (code translation), 58.11% in ROUGE‑L (code summarization), and 84.53% in SuccessRate@1 (API recommendation) over basic prompts. To validate its effectiveness in an industrial scenario, we evaluate our approach on WeChat‑Bench, a proprietary dataset, achieving an average MRR improvement of 148.89% for API recommendation.

PaperID: 7579, https://arxiv.org/pdf/2511.03100.pdf

Authors: Hao Xiang Li, Michael Amir, Amanda Prorok

Title: Scaling Multi-Agent Environment Co-Design with Diffusion Models

Abstract:
The agent‑environment co‑design paradigm jointly optimises agent policies and environment configurations in search of improved system performance. With application domains ranging from warehouse logistics to windfarm management, co‑design promises to fundamentally change how we deploy multi‑agent systems. However, current co‑design methods struggle to scale. They collapse under high‑dimensional environment design spaces and suffer from sample inefficiency when addressing moving targets inherent to joint optimisation. We address these challenges by developing Diffusion Co‑Design (DiCoDe), a scalable and sample‑efficient co‑design framework pushing co‑design towards practically relevant settings. DiCoDe incorporates two core innovations. First, we introduce Projected Universal Guidance (PUG), a sampling technique that enables DiCoDe to explore a distribution of reward‑maximising environments while satisfying hard constraints such as spatial separation between obstacles. Second, we devise a critic distillation mechanism to share knowledge from the reinforcement learning critic, ensuring that the guided diffusion model adapts to evolving agent policies using a dense and up‑to‑date learning signal. Together, these improvements lead to superior environment‑policy pairs when validated on challenging multi‑agent environment co‑design benchmarks including warehouse automation, multi‑agent pathfinding and wind farm optimisation. Our method consistently exceeds the state‑of‑the‑art, achieving, for example, 39% higher rewards in the warehouse setting with 66% fewer simulation samples. This sets a new standard in agent‑environment co‑design, and is a stepping stone towards reaping the rewards of co‑design in real world domains.

PaperID: 7580, https://arxiv.org/pdf/2511.03094.pdf

Authors: Longling Geng, Edward Y. Chang

Title: ALAS: Transactional and Dynamic Multi-Agent LLM Planning

Abstract:
Large language models enable flexible multi‑agent planning but remain fragile in practice: verification is often circular, state changes are not tracked for repair, and small faults trigger costly global recomputation. We present ALAS, a stateful, disruption‑aware framework that separates planning from non‑circular validation, records a versioned execution log for grounded checks and restore points, and performs localized repair that preserves work in progress. The validator operates independently of the planning LLM with fresh, bounded context, avoiding self‑check loops and mid‑context attrition. The repair protocol edits only the minimal affected region under explicit policies (retry, catch, timeout, backoff, idempotency keys, compensation, loop guards) defined in a canonical workflow IR that maps to Amazon States Language and Argo Workflows. On job‑shop scheduling suites (DMU, TA) across five classical benchmarks, ALAS matches or exceeds strong single‑LLM and multi‑agent baselines, achieving 83.7% success, reducing token usage by 60%, and running 1.82times faster under comparable settings. A minimal reliability study shows that the validator detects injected structural faults with low overhead, and that localized repair contains runtime perturbations with a bounded edit radius and less makespan degradation than global recompute. Results indicate that the combination of validator isolation, versioned execution logs, and localized repair provides measurable efficiency, feasibility, and scalability for multi‑agent LLM planning. Code and seeds will be released.

PaperID: 7581, https://arxiv.org/pdf/2511.03051.pdf

Authors: Tao Zhang, Kehui Yao, Luyi Ma, Jiao Chen, Reza Yousefi Maragheh, Kai Zhao, Jianpeng Xu, Evren Korpeoglu, Sushant Kumar, Kannan Achan

Title: No-Human in the Loop: Agentic Evaluation at Scale for Recommendation

Abstract:
Evaluating large language models (LLMs) as judges is increasingly critical for building scalable and trustworthy evaluation pipelines. We present ScalingEval, a large‑scale benchmarking study that systematically compares 36 LLMs, including GPT, Gemini, Claude, and Llama, across multiple product categories using a consensus‑driven evaluation protocol. Our multi‑agent framework aggregates pattern audits and issue codes into ground‑truth labels via scalable majority voting, enabling reproducible comparison of LLM evaluators without human annotation. Applied to large‑scale complementary‑item recommendation, the benchmark reports four key findings: (i) Anthropic Claude 3.5 Sonnet achieves the highest decision confidence; (ii) Gemini 1.5 Pro offers the best overall performance across categories; (iii) GPT‑4o provides the most favorable latency‑accuracy‑cost tradeoff; and (iv) GPT‑OSS 20B leads among open‑source models. Category‑level analysis shows strong consensus in structured domains (Electronics, Sports) but persistent disagreement in lifestyle categories (Clothing, Food). These results establish ScalingEval as a reproducible benchmark and evaluation protocol for LLMs as judges, with actionable guidance on scaling, reliability, and model family tradeoffs.

PaperID: 7582, https://arxiv.org/pdf/2511.03023.pdf

Authors: Sina Montazeri, Yunhe Feng, Kewei Sha

Title: PublicAgent: Multi-Agent Design Principles From an LLM-Based Open Data Analysis Framework

Abstract:
Open data repositories hold potential for evidence‑based decision‑making, yet are inaccessible to non‑experts lacking expertise in dataset discovery, schema mapping, and statistical analysis. Large language models show promise for individual tasks, but end‑to‑end analytical workflows expose fundamental limitations: attention dilutes across growing contexts, specialized reasoning patterns interfere, and errors propagate undetected. We present PublicAgent, a multi‑agent framework that addresses these limitations through decomposition into specialized agents for intent clarification, dataset discovery, analysis, and reporting. This architecture maintains focused attention within agent contexts and enables validation at each stage. Evaluation across five models and 50 queries derives five design principles for multi‑agent LLM systems. First, specialization provides value independent of model strength‑‑even the strongest model shows 97.5% agent win rates, with benefits orthogonal to model scale. Second, agents divide into universal (discovery, analysis) and conditional (report, intent) categories. Universal agents show consistent effectiveness (std dev 12.4%) while conditional agents vary by model (std dev 20.5%). Third, agents mitigate distinct failure modes‑‑removing discovery or analysis causes catastrophic failures (243‑280 instances), while removing report or intent causes quality degradation. Fourth, architectural benefits persist across task complexity with stable win rates (86‑92% analysis, 84‑94% discovery), indicating workflow management value rather than reasoning enhancement. Fifth, wide variance in agent effectiveness across models (42‑96% for analysis) requires model‑aware architecture design. These principles guide when and why specialization is necessary for complex analytical workflows while enabling broader access to public data through natural language interfaces.

PaperID: 7583, https://arxiv.org/pdf/2511.02957.pdf

Authors: Mohsin Mahmud Topu, Mahfuz Ahmed Anik, Azmine Toushik Wasi, Md Manjurul Ahsan

Title: Digital Twin-Driven Pavement Health Monitoring and Maintenance Optimization Using Graph Neural Networks

Abstract:
Pavement infrastructure monitoring is challenged by complex spatial dependencies, changing environmental conditions, and non‑linear deterioration across road networks. Traditional Pavement Management Systems (PMS) remain largely reactive, lacking real‑time intelligence for failure prevention and optimal maintenance planning. To address this, we propose a unified Digital Twin (DT) and Graph Neural Network (GNN) framework for scalable, data‑driven pavement health monitoring and predictive maintenance. Pavement segments and spatial relations are modeled as graph nodes and edges, while real‑time UAV, sensor, and LiDAR data stream into the DT. The inductive GNN learns deterioration patterns from graph‑structured inputs to forecast distress and enable proactive interventions. Trained on a real‑world‑inspired dataset with segment attributes and dynamic connectivity, our model achieves an R2 of 0.3798, outperforming baseline regressors and effectively capturing non‑linear degradation. We also develop an interactive dashboard and reinforcement learning module for simulation, visualization, and adaptive maintenance planning. This DT‑GNN integration enhances forecasting precision and establishes a closed feedback loop for continuous improvement, positioning the approach as a foundation for proactive, intelligent, and sustainable pavement management, with future extensions toward real‑world deployment, multi‑agent coordination, and smart‑city integration.

PaperID: 7584, https://arxiv.org/pdf/2511.02919.pdf

Authors: Shuhang Lin, Zhencan Peng, Lingyao Li, Xiao Lin, Xi Zhu, Yongfeng Zhang

Title: Cache Mechanism for Agent RAG Systems

Abstract:
Recent advances in Large Language Model (LLM)‑based agents have been propelled by Retrieval‑Augmented Generation (RAG), which grants the models access to vast external knowledge bases. Despite RAG's success in improving agent performance, agent‑level cache management, particularly constructing, maintaining, and updating a compact, relevant corpus dynamically tailored to each agent's need, remains underexplored. Therefore, we introduce ARC (Agent RAG Cache Mechanism), a novel, annotation‑free caching framework that dynamically manages small, high‑value corpora for each agent. By synthesizing historical query distribution patterns with the intrinsic geometry of cached items in the embedding space, ARC automatically maintains a high‑relevance cache. With comprehensive experiments on three retrieval datasets, our experimental results demonstrate that ARC reduces storage requirements to 0.015% of the original corpus while offering up to 79.8% has‑answer rate and reducing average retrieval latency by 80%. Our results demonstrate that ARC can drastically enhance efficiency and effectiveness in RAG‑powered LLM agents.

PaperID: 7585, https://arxiv.org/pdf/2511.02895.pdf

Authors: Gian Maria Campedelli

Title: A Criminology of Machines

Abstract:
While the possibility of reaching human‑like Artificial Intelligence (AI) remains controversial, the likelihood that the future will be characterized by a society with a growing presence of autonomous machines is high. Autonomous AI agents are already deployed and active across several industries and digital environments and alongside human‑human and human‑machine interactions, machine‑machine interactions are poised to become increasingly prevalent. Given these developments, I argue that criminology must begin to address the implications of this transition for crime and social control. Drawing on Actor‑Network Theory and Woolgar's decades‑old call for a sociology of machines ‑‑ frameworks that acquire renewed relevance with the rise of generative AI agents ‑‑ I contend that criminologists should move beyond conceiving AI solely as a tool. Instead, AI agents should be recognized as entities with agency encompassing computational, social, and legal dimensions. Building on the literature on AI safety, I thus examine the risks associated with the rise of multi‑agent AI systems, proposing a dual taxonomy to characterize the channels through which interactions among AI agents may generate deviant, unlawful, or criminal outcomes. I then advance and discuss four key questions that warrant theoretical and empirical attention: (1) Can we assume that machines will simply mimic humans? (2) Will crime theories developed for humans suffice to explain deviant or criminal behaviors emerging from interactions between autonomous AI agents? (3) What types of criminal behaviors will be affected first? (4) How might this unprecedented societal shift impact policing? These questions underscore the urgent need for criminologists to theoretically and empirically engage with the implications of multi‑agent AI systems for the study of crime and play a more active role in debates on AI safety and governance.

PaperID: 7586, https://arxiv.org/pdf/2511.02841.pdf

Authors: Sandro Rodriguez Garzon, Awid Vaziry, Enis Mert Kuzu, Dennis Enrique Gehrmann, Buse Varkan, Alexander Gaballa, Axel Küpper

Title: AI Agents with Decentralized Identifiers and Verifiable Credentials

Abstract:
A fundamental limitation of current LLM‑based AI agents is their inability to build differentiated trust among each other at the onset of an agent‑to‑agent dialogue. However, autonomous and interoperable trust establishment becomes essential once agents start to operate beyond isolated environments and engage in dialogues across individual or organizational boundaries. A promising way to fill this gap in Agentic AI is to equip agents with long‑lived digital identities and introduce tamper‑proof and flexible identity‑bound attestations of agents, provisioned by commonly trusted third parties and designed for cross‑domain verifiability. This article presents a conceptual framework and a prototypical multi‑agent system, where each agent is endowed with a self‑sovereign digital identity. It combines a unique and ledger‑anchored W3C Decentralized Identifier (DID) of an agent with a set of third‑party issued W3C Verifiable Credentials (VCs). This enables agents at the start of a dialog to prove ownership of their self‑controlled DIDs for authentication purposes and to establish various cross‑domain trust relationships through the spontaneous exchange of their self‑hosted DID‑bound VCs. A comprehensive evaluation of the prototypical implementation demonstrates technical feasibility but also reveals limitations once an agent's LLM is in sole charge to control the respective security procedures.

PaperID: 7587, https://arxiv.org/pdf/2511.02823.pdf

Authors: Chloe Loughridge, Paul Colognese, Avery Griffin, Tyler Tracy, Jon Kutasov, Joe Benton

Title: Optimizing AI Agent Attacks With Synthetic Data

Abstract:
As AI deployments become more complex and high‑stakes, it becomes increasingly important to be able to estimate their risk. AI control is one framework for doing so. However, good control evaluations require eliciting strong attack policies. This can be challenging in complex agentic environments where compute constraints leave us data‑poor. In this work, we show how to optimize attack policies in SHADE‑Arena, a dataset of diverse realistic control environments. We do this by decomposing attack capability into five constituent skills ‑‑ suspicion modeling, attack selection, plan synthesis, execution and subtlety ‑‑ and optimizing each component individually. To get around the constraint of limited data, we develop a probabilistic model of attack dynamics, optimize our attack hyperparameters using this simulation, and then show that the results transfer to SHADE‑Arena. This results in a substantial improvement in attack strength, reducing safety score from a baseline of 0.87 to 0.41 using our scaffold.

PaperID: 7588, https://arxiv.org/pdf/2511.02762.pdf

Authors: Xun Wang, Zhuoran Li, Yanshan Lin, Hai Zhong, Longbo Huang

Title: From Solo to Symphony: Orchestrating Multi-Agent Collaboration with Single-Agent Demos

Abstract:
Training a team of agents from scratch in multi‑agent reinforcement learning (MARL) is highly inefficient, much like asking beginners to play a symphony together without first practicing solo. Existing methods, such as offline or transferable MARL, can ease this burden, but they still rely on costly multi‑agent data, which often becomes the bottleneck. In contrast, solo experiences are far easier to obtain in many important scenarios, e.g., collaborative coding, household cooperation, and search‑and‑rescue. To unlock their potential, we propose Solo‑to‑Collaborative RL (SoCo), a framework that transfers solo knowledge into cooperative learning. SoCo first pretrains a shared solo policy from solo demonstrations, then adapts it for cooperation during multi‑agent training through a policy fusion mechanism that combines an MoE‑like gating selector and an action editor. Experiments across diverse cooperative tasks show that SoCo significantly boosts the training efficiency and performance of backbone algorithms. These results demonstrate that solo demonstrations provide a scalable and effective complement to multi‑agent data, making cooperative learning more practical and broadly applicable.

PaperID: 7589, https://arxiv.org/pdf/2511.02755.pdf

Authors: Bowen Jin, TJ Collins, Donghan Yu, Mert Cemri, Shenao Zhang, Mengyu Li, Jay Tang, Tian Qin, Zhiyang Xu, Jiarui Lu, Guoli Yin, Jiawei Han, Zirui Wang

Title: Controlling Performance and Budget of a Centralized Multi-agent LLM System with Reinforcement Learning

Abstract:
Large language models (LLMs) exhibit complementary strengths across domains and come with varying inference costs, motivating the design of multi‑agent LLM systems where specialized models collaborate efficiently. Existing approaches predominantly rely on decentralized frameworks, which invoke multiple LLMs for every input and thus lead to substantial and uncontrolled inference costs. In this work, we introduce a centralized multi‑LLM framework, where a controller LLM selectively coordinates a pool of expert models in a cost‑efficient and cost‑controllable manner. We formulate this coordination problem as reinforcement learning with dual objectives: maximizing task performance while minimizing the overall inference cost. In addition, we expect the multi‑agent system to have adapted behavior with different budget conditions during inference. To this end, we propose CoRL, a reinforcement learning framework that optimizes the performance cost trade‑off in a controllable multi‑budget setting. Experiments on four diverse benchmarks demonstrate that CoRL enables a single system to surpass the best expert LLM under high‑budget settings, while maintaining strong performance in more economical low‑budget modes, highlighting the effectiveness of centralized coordination for scalable and cost‑efficient multi‑agent LLM systems.

PaperID: 7590, https://arxiv.org/pdf/2511.02734.pdf

Authors: Jiayu Liu, Cheng Qian, Zhaochen Su, Qing Zong, Shijue Huang, Bingxiang He, Yi R. Fung

Title: CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents

Abstract:
Current evaluations of Large Language Model (LLM) agents primarily emphasize task completion, often overlooking resource efficiency and adaptability. This neglects a crucial capability: agents' ability to devise and adjust cost‑optimal plans in response to changing environments. To bridge this gap, we introduce CostBench, a scalable, cost‑centric benchmark designed to evaluate agents' economic reasoning and replanning abilities. Situated in the travel‑planning domain, CostBench comprises tasks solvable via multiple sequences of atomic and composite tools with diverse, customizable costs. It also supports four types of dynamic blocking events, such as tool failures and cost changes, to simulate real‑world unpredictability and necessitate agents to adapt in real time. Evaluating leading open‑sourced and proprietary models on CostBench reveals a substantial gap in cost‑aware planning: agents frequently fail to identify cost‑optimal solutions in static settings, with even GPT‑5 achieving less than 75% exact match rate on the hardest tasks, and performance further dropping by around 40% under dynamic conditions. By diagnosing these weaknesses, CostBench lays the groundwork for developing future agents that are both economically rational and robust.

PaperID: 7591, https://arxiv.org/pdf/2511.02690.pdf

Authors: Georgios Tzannetos, Parameswaran Kamalaruban, Adish Singla

Title: Curriculum Design for Trajectory-Constrained Agent: Compressing Chain-of-Thought Tokens in LLMs

Abstract:
Training agents to operate under strict constraints during deployment, such as limited resource budgets or stringent safety requirements, presents significant challenges, especially when these constraints render the task complex. In this work, we propose a curriculum learning strategy that gradually tightens constraints during training, enabling the agent to incrementally master the deployment requirements. Inspired by self‑paced learning techniques in unconstrained reinforcement learning (RL), our approach facilitates a smoother transition to challenging environments by initially training on simplified versions of the constraints and progressively introducing the full deployment conditions. We provide a theoretical analysis using an RL agent in a binary‑tree Markov Decision Process (MDP) to demonstrate that our curriculum strategy can accelerate training relative to a baseline approach that imposes the trajectory constraints from the outset. Moreover, we empirically validate the effectiveness and generality of our method across both RL and large language model (LLM) agents in diverse settings, including a binary‑tree MDP, a multi‑task navigation domain, and a math reasoning task with two benchmarks. These results highlight the potential of curriculum design in enhancing the efficiency and performance of agents operating under complex trajectory constraints during deployment. Moreover, when applied to LLMs, our strategy enables compression of output chain‑of‑thought tokens, achieving a substantial inference speedup on consumer hardware, demonstrating its effectiveness for resource‑constrained deployment.

PaperID: 7592, https://arxiv.org/pdf/2511.02648.pdf

Authors: Syed Haseeb Shah

Title: Stochastic Redistribution of Indistinguishable Items in Shared Habitation: A Multi-Agent Simulation Framework

Abstract:
This paper presents a discrete‑event stochastic model for the redistribution of indistinguishable personal items, exemplified by socks, among multiple cohabitants sharing a communal laundry system. Drawing on concepts from ecological population dynamics, diffusion processes, and stochastic exchange theory, the model captures the probabilistic mechanisms underlying item mixing, recovery, and loss. Each cohabitant is represented as an autonomous agent whose belongings interact through iterative cycles of collective washing, sorting, and partial correction. The system's evolution is characterized by random mixing events, selective recollection, and attrition over time. Implemented using the SimPy discrete‑event simulation framework, the model demonstrates that even minimal exchange probabilities can generate emergent asymmetries, quasi‑equilibrium distributions, and long‑term disorder. The findings illustrate how stochastic processes inherent to shared domestic systems can produce persistent imbalances, offering a quantitative perspective on an everyday social phenomenon.

PaperID: 7593, https://arxiv.org/pdf/2511.02606.pdf

Authors: Xiangen Hu, Jiarui Tong, Sheng Xu

Title: A Multi-Agent Psychological Simulation System for Human Behavior Modeling

Abstract:
Training and education in human‑centered fields require authentic practice, yet realistic simulations of human behavior have remained limited. We present a multi‑agent psychological simulation system that models internal cognitive‑affective processes to generate believable human behaviors. In contrast to black‑box neural models, this system is grounded in established psychological theories (e.g., self‑efficacy, mindset, social constructivism) and explicitly simulates an ``inner parliament'' of agents corresponding to key psychological factors. These agents deliberate and interact to determine the system's output behavior, enabling unprecedented transparency and alignment with human psychology. We describe the system's architecture and theoretical foundations, illustrate its use in teacher training and research, and discuss how it embodies principles of social learning, cognitive apprenticeship, deliberate practice, and meta‑cognition.

PaperID: 7594, https://arxiv.org/pdf/2511.02532.pdf

Authors: Jorge Pellejero, Luis A. Hernández Gómez, Luis Mendo Tomás, Zoraida Frias Barroso

Title: Agentic AI for Mobile Network RAN Management and Optimization

Abstract:
Agentic AI represents a new paradigm for automating complex systems by using Large AI Models (LAMs) to provide human‑level cognitive abilities with multimodal perception, planning, memory, and reasoning capabilities. This will lead to a new generation of AI systems that autonomously decompose goals, retain context over time, learn continuously, operate across tools and environments, and adapt dynamically. The complexity of 5G and upcoming 6G networks renders manual optimization ineffective, pointing to Agentic AI as a method for automating decisions in dynamic RAN environments. However, despite its rapid advances, there is no established framework outlining the foundational components and operational principles of Agentic AI systems nor a universally accepted definition. This paper contributes to ongoing research on Agentic AI in 5G and 6G networks by outlining its core concepts and then proposing a practical use case that applies Agentic principles to RAN optimization. We first introduce Agentic AI, tracing its evolution from classical agents and discussing the progress from workflows and simple AI agents to Agentic AI. Core design patterns‑reflection, planning, tool use, and multi‑agent collaboration‑are then described to illustrate how intelligent behaviors are orchestrated. These theorical concepts are grounded in the context of mobile networks, with a focus on RAN management and optimization. A practical 5G RAN case study shows how time‑series analytics and LAM‑driven agents collaborate for KPI‑based autonomous decision‑making.

PaperID: 7595, https://arxiv.org/pdf/2511.02469.pdf

Authors: Kaito Takano, Masanori Hirano, Kei Nakagawa

Title: Modeling Hawkish-Dovish Latent Beliefs in Multi-Agent Debate-Based LLMs for Monetary Policy Decision Classification

Abstract:
Accurately forecasting central bank policy decisions, particularly those of the Federal Open Market Committee(FOMC) has become increasingly important amid heightened economic uncertainty. While prior studies have used monetary policy texts to predict rate changes, most rely on static classification models that overlook the deliberative nature of policymaking. This study proposes a novel framework that structurally imitates the FOMC's collective decision‑making process by modeling multiple large language models(LLMs) as interacting agents. Each agent begins with a distinct initial belief and produces a prediction based on both qualitative policy texts and quantitative macroeconomic indicators. Through iterative rounds, agents revise their predictions by observing the outputs of others, simulating deliberation and consensus formation. To enhance interpretability, we introduce a latent variable representing each agent's underlying belief(e.g., hawkish or dovish), and we theoretically demonstrate how this belief mediates the perception of input information and interaction dynamics. Empirical results show that this debate‑based approach significantly outperforms standard LLMs‑based baselines in prediction accuracy. Furthermore, the explicit modeling of beliefs provides insights into how individual perspectives and social influence shape collective policy forecasts.

PaperID: 7596, https://arxiv.org/pdf/2511.02399.pdf

Authors: Junwei Liu, Chen Xu, Chong Wang, Tong Bai, Weitong Chen, Kaseng Wong, Yiling Lou, Xin Peng

Title: Towards Iterative End-to-End Software Development: A Feature-Driven Multi-Agent Framework

Abstract:
Recent advances in large language model agents offer the promise of automating end‑to‑end software development from natural language requirements. However, existing approaches largely adopt linear, waterfall‑style pipelines, which oversimplify the iterative nature of real‑world development and struggle with complex, large‑scale projects. To address these limitations, we propose EvoDev, an iterative software development framework inspired by feature‑driven development. EvoDev decomposes user requirements into a set of user‑valued features and constructs a Feature Map, a directed acyclic graph that explicitly models dependencies between features. Each feature node in the feature map maintains multi‑layer contexts, including business logic, software design, and code implementation, which are propagated along dependencies to provide context for subsequent development iterations. We evaluate EvoDev on challenging Android development tasks and show that it outperforms the best‑performing baseline, Claude Code, by 57.3%, while improving single‑agent performance by 16.0%‑58.5% across different base LLMs, highlighting the importance of feature decomposition, dependency modeling, context propagation, and workflow‑aware agent design for end‑to‑end software development. Moreover, our work summarizes practical insights for designing iterative, LLM‑driven development frameworks and informs future training of base LLMs to better support iterative software development.

PaperID: 7597, https://arxiv.org/pdf/2511.02314.pdf

Authors: Jueye Zhang, Chao Yang, Youfang Lai, Kai-Wen Li, Wenting Yan, Yunzhou Xia, Haimei Zhang, Jingjing Zhou, Gen Yang, Chen Lin, Tian Li, Yibao Zhang

Title: Large-scale automatic carbon ion treatment planning for head and neck cancers via parallel multi-agent reinforcement learning

Abstract:
Head‑and‑neck cancer (HNC) planning is difficult because multiple critical organs‑at‑risk (OARs) are close to complex targets. Intensity‑modulated carbon‑ion therapy (IMCT) offers superior dose conformity and OAR sparing but remains slow due to relative biological effectiveness (RBE) modeling, leading to laborious, experience‑based, and often suboptimal tuning of many treatment‑planning parameters (TPPs). Recent deep learning (DL) methods are limited by data bias and plan feasibility, while reinforcement learning (RL) struggles to efficiently explore the exponentially large TPP search space. We propose a scalable multi‑agent RL (MARL) framework for parallel tuning of 45 TPPs in IMCT. It uses a centralized‑training decentralized‑execution (CTDE) QMIX backbone with Double DQN, Dueling DQN, and recurrent encoding (DRQN) for stable learning in a high‑dimensional, non‑stationary environment. To enhance efficiency, we (1) use compact historical DVH vectors as state inputs, (2) apply a linear action‑to‑value transform mapping small discrete actions to uniform parameter adjustments, and (3) design an absolute, clinically informed piecewise reward aligned with plan scores. A synchronous multi‑process worker system interfaces with the PHOENIX TPS for parallel optimization and accelerated data collection. On a head‑and‑neck dataset (10 training, 10 testing), the method tuned 45 parameters simultaneously and produced plans comparable to or better than expert manual ones (relative plan score: RL 85.93\pm7.85% vs Manual 85.02\pm6.92%), with significant (p‑value < 0.05) improvements for five OARs. The framework efficiently explores high‑dimensional TPP spaces and generates clinically competitive IMCT plans through direct TPS interaction, notably improving OAR sparing.

PaperID: 7598, https://arxiv.org/pdf/2511.02304.pdf

Authors: Beyazit Yalcinkaya, Marcell Vazquez-Chanlatte, Ameesh Shah, Hanna Krasowski, Sanjit A. Seshia

Title: Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning

Abstract:
We study the problem of learning multi‑task, multi‑agent policies for cooperative, temporal objectives, under centralized training, decentralized execution. In this setting, using automata to represent tasks enables the decomposition of complex tasks into simpler sub‑tasks that can be assigned to agents. However, existing approaches remain sample‑inefficient and are limited to the single‑task case. In this work, we present Automata‑Conditioned Cooperative Multi‑Agent Reinforcement Learning (ACC‑MARL), a framework for learning task‑conditioned, decentralized team policies. We identify the main challenges to ACC‑MARL's feasibility in practice, propose solutions, and prove the correctness of our approach. We further show that the value functions of learned policies can be used to assign tasks optimally at test time. Experiments show emergent task‑aware, multi‑step coordination among agents, e.g., pressing a button to unlock a door, holding the door, and short‑circuiting tasks.

PaperID: 7599, https://arxiv.org/pdf/2511.02303.pdf

Authors: Zhiwei Zhang, Xiaomin Li, Yudi Lin, Hui Liu, Ramraj Chandradevan, Linlin Wu, Minhua Lin, Fali Wang, Xianfeng Tang, Qi He, Suhang Wang

Title: Unlocking the Power of Multi-Agent LLM for Reasoning: From Lazy Agents to Deliberation

Abstract:
Large Language Models (LLMs) trained with reinforcement learning and verifiable rewards have achieved strong results on complex reasoning tasks. Recent work extends this paradigm to a multi‑agent setting, where a meta‑thinking agent proposes plans and monitors progress while a reasoning agent executes subtasks through sequential conversational turns. Despite promising performance, we identify a critical limitation: lazy agent behavior, in which one agent dominates while the other contributes little, undermining collaboration and collapsing the setup to an ineffective single agent. In this paper, we first provide a theoretical analysis showing why lazy behavior naturally arises in multi‑agent reasoning. We then introduce a stable and efficient method for measuring causal influence, helping mitigate this issue. Finally, as collaboration intensifies, the reasoning agent risks getting lost in multi‑turn interactions and trapped by previous noisy responses. To counter this, we propose a verifiable reward mechanism that encourages deliberation by allowing the reasoning agent to discard noisy outputs, consolidate instructions, and restart its reasoning process when necessary. Extensive experiments demonstrate that our framework alleviates lazy agent behavior and unlocks the full potential of multi‑agent framework for complex reasoning tasks.

PaperID: 7600, https://arxiv.org/pdf/2511.02230.pdf

Authors: Hanchen Li, Runyuan He, Qiuyang Mang, Qizheng Zhang, Huanzhi Mao, Xiaokun Chen, Hangrui Zhou, Alvin Cheung, Joseph Gonzalez, Ion Stoica

Title: Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live

Abstract:
KV cache management is essential for efficient LLM inference. To maximize utilization, existing inference engines evict finished requests' KV cache if new requests are waiting. This policy breaks for agentic workloads, which interleave LLM calls with tools, introducing pauses that prevent effective KV reuse across turns. Since many tool calls have much shorter durations than human response multi‑turn chatbot, it would be promising to retain the KV cache in during these tools. However, many challenges remain. First, we need to consider both the potential cost of recomputation or reloading (if offloading enabled) as well as the increasing queueing delays after eviction from GPU. Second, due to the internal variance of tool call durations, the method needs to remain robust under limited predictability of tool call durations. We present Continuum, a serving system to optimize job completion time for multi‑turn agent workloads by introducing time‑to‑live mechanism for KV cache retention. For requests that generate tool calls, Continuum selectively pins the KV cache in GPU memory with a time‑to‑live value determined by the reload cost and potential queueing delay induced by eviction. When the TTL expires, the KV cache can be automatically evicted to free up GPU memory, providing robust performance under edge cases. When combined with program‑level first‑come‑first‑serve, Continuum preserves multi‑turn continuity, and reduces delay for agentic workflows. Evaluations on real‑world agents (SWE‑Bench, BFCL, OpenHand) with Llama‑3.1 8B/70B, Gemma‑3 12B, and GLM‑4.5 355B shows that Continuum improves the average job completion times by over 8x while improving throughput.

PaperID: 7601, https://arxiv.org/pdf/2511.02208.pdf

Authors: Weiwei Sun, Xuhui Zhou, Weihua Du, Xingyao Wang, Sean Welleck, Graham Neubig, Maarten Sap, Yiming Yang

Title: Training Proactive and Personalized LLM Agents

Abstract:
While existing work focuses primarily on task success, we argue that effective real‑world agents require optimizing three dimensions: productivity (task completion), proactivity (asking essential questions), and personalization (adapting to diverse user preferences). We introduce UserVille, an interactive environment with LLM‑based user simulators enabling diverse, configurable user preferences. Leveraging UserVille, we introduce PPP, a multi‑objective reinforcement learning approach that jointly optimizes all three dimensions: Productivity, Proactivity, and Personalization. Experiments on software engineering and deep research tasks show that agents trained with PPP achieve substantial improvements over strong baselines such as GPT‑5 (+21.6 on average), demonstrating the ability to ask strategic clarifying questions, adapt to unseen user preferences, and improve task success through better interaction. This work demonstrates that explicitly optimizing for user‑centered interaction is critical for building practical and effective AI agents.

PaperID: 7602, https://arxiv.org/pdf/2511.02200.pdf

Authors: Jingbo Wang, Sendong Zhao, Haochun Wang, Yuzheng Fan, Lizhe Zhang, Yan Liu, Ting Liu

Title: Optimal-Agent-Selection: State-Aware Routing Framework for Efficient Multi-Agent Collaboration

Abstract:
The emergence of multi‑agent systems powered by large language models (LLMs) has unlocked new frontiers in complex task‑solving, enabling diverse agents to integrate unique expertise, collaborate flexibly, and address challenges unattainable for individual models. However, the full potential of such systems is hindered by rigid agent scheduling and inefficient coordination strategies that fail to adapt to evolving task requirements. In this paper, we propose STRMAC, a state‑aware routing framework designed for efficient collaboration in multi‑agent systems. Our method separately encodes interaction history and agent knowledge to power the router, which adaptively selects the most suitable single agent at each step for efficient and effective collaboration. Furthermore, we introduce a self‑evolving data generation approach that accelerates the collection of high‑quality execution paths for efficient system training. Experiments on challenging collaborative reasoning benchmarks demonstrate that our method achieves state‑of‑the‑art performance, achieving up to 23.8% improvement over baselines and reducing data collection overhead by up to 90.1% compared to exhaustive search.

PaperID: 7603, https://arxiv.org/pdf/2511.02192.pdf

Authors: Linxin Hou, Qirui Wu, Zhihang Qin, Neil Banerjee, Yongxin Guo, Cecilia Laschi

Title: A Quantitative Comparison of Centralised and Distributed Reinforcement Learning-Based Control for Soft Robotic Arms

Abstract:
This paper presents a quantitative comparison between centralised and distributed multi‑agent reinforcement learning (MARL) architectures for controlling a soft robotic arm modelled as a Cosserat rod in simulation. Using PyElastica and the OpenAI Gym interface, we train both a global Proximal Policy Optimisation (PPO) controller and a Multi‑Agent PPO (MAPPO) under identical budgets. Both approaches are based on the arm having n number of controlled sections. The study systematically varies n and evaluates the performance of the arm to reach a fixed target in three scenarios: default baseline condition, recovery from external disturbance, and adaptation to actuator failure. Quantitative metrics used for the evaluation are mean action magnitude, mean final distance, mean episode length, and success rate. The results show that there are no significant benefits of the distributed policy when the number of controlled sections n\le4. In very simple systems, when n\le2, the centralised policy outperforms the distributed one. When n increases to 4< n\le 12, the distributed policy shows a high sample efficiency. In these systems, distributed policy promotes a stronger success rate, resilience, and robustness under local observability and yields faster convergence given the same sample size. However, centralised policies achieve much higher time efficiency during training as it takes much less time to train the same size of samples. These findings highlight the trade‑offs between centralised and distributed policy in reinforcement learning‑based control for soft robotic systems and provide actionable design guidance for future sim‑to‑real transfer in soft rod‑like manipulators.

PaperID: 7604, https://arxiv.org/pdf/2511.02183.pdf

Authors: Yanfu Qin, Kaihong Lu

Title: Online Distributed Zeroth-Order Optimization With Non-Zero-Mean Adverse Noises

Abstract:
In this paper, the problem of online distributed zeroth‑order optimization subject to a set constraint is studied via a multi‑agent network, where each agent can communicate with its immediate neighbors via a time‑varying directed graph. Different from the existing works on online distributed zeroth‑ order optimization, we consider the case where the estimate on the gradients are influenced by some non‑zero‑mean adverse noises. To handle this problem, we propose a new online dis‑ tributed zeroth‑order mirror descent algorithm involving a kernel function‑based estimator and a clipped strategy. Particularly, in the estimator, the kernel function‑based strategy is provided to deal with the adverse noises, and eliminate the low‑order terms in the Taylor expansions of the objective functions. Furthermore, the performance of the presented algorithm is measured by employing the dynamic regrets, where the offline benchmarks are to find the optimal point at each time. Under the mild assumptions on the graph and the objective functions, we prove that if the variation in the optimal point sequence grows at a certain rate, then the high probability bound of the dynamic regrets increases sublinearly. Finally, a simulation experiment is worked out to demonstrate the effectiveness of our theoretical results.

PaperID: 7605, https://arxiv.org/pdf/2511.02016.pdf

Authors: Patrick Cheridito, Jean-Loup Dupret, Zhexin Wu

Title: ABIDES-MARL: A Multi-Agent Reinforcement Learning Environment for Endogenous Price Formation and Execution in a Limit Order Book

Abstract:
We present ABIDES‑MARL, a framework that combines a new multi‑agent reinforcement learning (MARL) methodology with a new realistic limit‑order‑book (LOB) simulation system to study equilibrium behavior in complex financial market games. The system extends ABIDES‑Gym by decoupling state collection from kernel interruption, enabling synchronized learning and decision‑making for multiple adaptive agents while maintaining compatibility with standard RL libraries. It preserves key market features such as price‑time priority and discrete tick sizes. Methodologically, we use MARL to approximate equilibrium‑like behavior in multi‑period trading games with a finite number of heterogeneous agents‑an informed trader, a liquidity trader, noise traders, and competing market makers‑all with individual price impacts. This setting bridges optimal execution and market microstructure by embedding the liquidity trader's optimization problem within a strategic trading environment. We validate the approach by solving an extended Kyle model within the simulation system, recovering the gradual price discovery phenomenon. We then extend the analysis to a liquidity trader's problem where market liquidity arises endogenously and show that, at equilibrium, execution strategies shape market‑maker behavior and price dynamics. ABIDES‑MARL provides a reproducible foundation for analyzing equilibrium and strategic adaptation in realistic markets and contributes toward building economically interpretable agentic AI systems for finance.

PaperID: 7606, https://arxiv.org/pdf/2511.01912.pdf

Authors: Wenzhe Fan, Ning Yan, Masood Mortazavi

Title: EvoMem: Improving Multi-Agent Planning with Dual-Evolving Memory

Abstract:
Planning has been a cornerstone of artificial intelligence for solving complex problems, and recent progress in LLM‑based multi‑agent frameworks have begun to extend this capability. However, the role of human‑like memory within these frameworks remains largely unexplored. Understanding how agents coordinate through memory is critical for natural language planning, where iterative reasoning, constraint tracking, and error correction drive the success. Inspired by working memory model in cognitive psychology, we present EvoMem, a multi‑agent framework built on a dual‑evolving memory mechanism. The framework consists of three agents (Constraint Extractor, Verifier, and Actor) and two memory modules: Constraint Memory (CMem), which evolves across queries by storing task‑specific rules and constraints while remains fixed within a query, and Query‑feedback Memory (QMem), which evolves within a query by accumulating feedback across iterations for solution refinement. Both memory modules are reset at the end of each query session. Evaluations on trip planning, meeting planning, and calendar scheduling show consistent performance improvements, highlighting the effectiveness of EvoMem. This success underscores the importance of memory in enhancing multi‑agent planning.

PaperID: 7607, https://arxiv.org/pdf/2511.01854.pdf

Authors: Elias Lumer, Faheem Nizar, Anmol Gulati, Pradeep Honaganahalli Basavaraju, Vamse Kumar Subbiah

Title: Tool-to-Agent Retrieval: Bridging Tools and Agents for Scalable LLM Multi-Agent Systems

Abstract:
Recent advances in LLM Multi‑Agent Systems enable scalable orchestration of sub‑agents, each coordinating hundreds or thousands of tools or Model Context Protocol (MCP) servers. However, existing retrieval methods typically match queries against coarse agent‑level descriptions before routing, which obscures fine‑grained tool functionality and often results in suboptimal agent selection. We introduce Tool‑to‑Agent Retrieval, a unified framework that embeds both tools and their parent agents in a shared vector space and connects them through metadata relationships. By explicitly representing tool capabilities and traversing metadata to the agent level, Tool‑to‑Agent Retrieval enables granular tool‑level or agent‑level retrieval, ensuring that agents and their underlying tools or MCP servers are equally represented without the context dilution that arises from chunking many tools together. Evaluating Tool‑to‑Agent Retrieval across eight embedding models, our approach achieves consistent improvements of 19.4% in Recall@5 and 17.7% in nDCG@5 over previous state‑of‑the‑art agent retrievers on the LiveMCPBench benchmark.

PaperID: 7608, https://arxiv.org/pdf/2511.01824.pdf

Authors: Yuetai Li, Huseyin A Inan, Xiang Yue, Wei-Ning Chen, Lukas Wutschitz, Janardhan Kulkarni, Radha Poovendran, Robert Sim, Saravan Rajmohan

Title: Simulating Environments with Reasoning Models for Agent Training

Abstract:
LLM agents excel in compact environments requiring deep reasoning but remain brittle when operating in broader, more complex contexts that demand robustness across diverse tools and schemas. Building bespoke environments for training is heavy, brittle, and limits progress. In this paper, we demonstrate that LLMs can simulate realistic environment feedback without access to actual testbed data or APIs. Inspired by this capability, we propose two frameworks: Simia‑SFT, a pipeline that synthesizes SFT data by amplifying small seed sets into diverse trajectories in an environment‑agnostic manner, and Simia‑RL, a framework that enables RL training without real environment implementations through LLM‑simulated feedback. Fine‑tuning open models yields consistent improvements across multiple benchmarks, surpassing GPT‑4o and approaching o4‑mini on τ^2‑Bench. Together, Simia‑SFT and Simia‑RL enable scalable agent training without environment engineering, replacing heavy and brittle implementations with flexible LLM‑based simulation.

PaperID: 7609, https://arxiv.org/pdf/2511.01695.pdf

Authors: Jungyeon Koh, Hyun Jong Yang

Title: Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding

Abstract:
The growing demand for on‑device large language model (LLM) inference highlights the need for efficient mobile edge computing (MEC) solutions, especially in resource‑constrained settings. Speculative decoding offers a promising solution by partitioning token generation between a lightweight draft model on mobile devices and a powerful target model on edge servers, but suffers from communication overhead and asynchronous delays. This paper is the first to propose a unified framework that jointly optimizes user association and resource allocation (UARA) to support efficient parallel speculative decoding. We solve the UARA problem using a multi‑agent deep reinforcement learning algorithm. To evaluate our approach under realistic conditions, we conduct experiments using the Sionna simulator. Results show that our method achieves up to 28.0% and an average of 23.7% reduction in end‑to‑end latency without compromising inference accuracy, enabling scalable and low‑latency LLM services in MEC systems.

PaperID: 7610, https://arxiv.org/pdf/2511.01633.pdf

Authors: Chengying Huan, Ziheng Meng, Yongchao Liu, Zhengyi Yang, Yun Zhu, Yue Yun, Shipeng Li, Rong Gu, Xiabao Wu, Haitao Zhang, Chuntao Hong, Shaonan Ma, Guihai Chen, Chen Tian

Title: Scaling Graph Chain-of-Thought Reasoning: A Multi-Agent Framework with Efficient LLM Serving

Abstract:
Graph Chain‑of‑Thought (Graph‑CoT) enables large language models (LLMs) to perform step‑by‑step reasoning over graph‑structured knowledge, but existing pipelines suffer from low accuracy, excessive token usage, high latency, and low throughput due to single‑agent monolithic prompts, repeated context re‑encoding, and inefficient serving execution. We present GLM, the first multi‑agent Graph‑CoT system co‑designed with an optimized LLM serving architecture. GLM decomposes reasoning into specialized agents for classification, reasoning, action generation, and graph retrieval, enabling branching and selective context sharing to reduce prompt length and reasoning iterations while preserving reasoning quality, thereby improving accuracy and reducing overall token consumption. To scale inference, we introduce a Graph‑CoT‑aware LLM inference mechanism with graph‑specific KV‑cache management, priority‑based eviction, and pipelined execution to improve serving efficiency. Experiments demonstrate that GLM improves answer accuracy by up to 38%, reduces token cost by up to 95.7%, lowers inference latency by 90.3%, and achieves up to 15.1x higher throughput compared to state‑of‑the‑art Graph‑CoT baselines, enabling efficient adoption for complex real‑world reasoning at scale.

PaperID: 7611, https://arxiv.org/pdf/2511.01625.pdf

Authors: Han Weng, Zhou Liu, Yuanfeng Song, Xiaoming Yin, Xing Chen, Wentao Zhang

Title: UniDataBench: Evaluating Data Analytics Agents Across Structured and Unstructured Data

Abstract:
In the real business world, data is stored in a variety of sources, including structured relational databases, unstructured databases (e.g., NoSQL databases), or even CSV/excel files. The ability to extract reasonable insights across these diverse source is vital for business success. Existing benchmarks, however, are limited in assessing agents' capabilities across these diverse data types. To address this gap, we introduce UniDataBench, a comprehensive benchmark designed to evaluate the performance of data analytics agents in handling diverse data sources. Specifically, UniDataBench is originating from real‑life industry analysis report and we then propose a pipeline to remove the privacy and sensitive information. It encompasses a wide array of datasets, including relational databases, CSV files to NoSQL data, reflecting real‑world business scenarios, and provides unified framework to assess how effectively agents can explore multiple data formats, extract valuable insights, and generate meaningful summaries and recommendations. Based on UniDataBench, we propose a novel LLM‑based agent named ReActInsight, an autonomous agent that performs end‑to‑end analysis over diverse data sources by automatically discovering cross‑source linkages, decomposing goals, and generating robust, self‑correcting code to extract actionable insights. Our benchmark and agent together provide a powerful framework for advancing the capabilities of data analytics agents in real‑world applications.

PaperID: 7612, https://arxiv.org/pdf/2511.01594.pdf

Authors: Renjun Gao

Title: MARS: Multi-Agent Robotic System with Multimodal Large Language Models for Assistive Intelligence

Abstract:
Multimodal large language models (MLLMs) have shown remarkable capabilities in cross‑modal understanding and reasoning, offering new opportunities for intelligent assistive systems, yet existing systems still struggle with risk‑aware planning, user personalization, and grounding language plans into executable skills in cluttered homes. We introduce MARS ‑ a Multi‑Agent Robotic System powered by MLLMs for assistive intelligence and designed for smart home robots supporting people with disabilities. The system integrates four agents: a visual perception agent for extracting semantic and spatial features from environment images, a risk assessment agent for identifying and prioritizing hazards, a planning agent for generating executable action sequences, and an evaluation agent for iterative optimization. By combining multimodal perception with hierarchical multi‑agent decision‑making, the framework enables adaptive, risk‑aware, and personalized assistance in dynamic indoor environments. Experiments on multiple datasets demonstrate the superior overall performance of the proposed system in risk‑aware planning and coordinated multi‑agent execution compared with state‑of‑the‑art multimodal models. The proposed approach also highlights the potential of collaborative AI for practical assistive scenarios and provides a generalizable methodology for deploying MLLM‑enabled multi‑agent systems in real‑world environments.

PaperID: 7613, https://arxiv.org/pdf/2511.01554.pdf

Authors: Aditya Kapoor, Yash Bhisikar, Benjamin Freed, Jan Peters, Mingfei Sun

Title: Learning what to say and how precisely: Efficient Communication via Differentiable Discrete Communication Learning

Abstract:
Effective communication in multi‑agent reinforcement learning (MARL) is critical for success but constrained by bandwidth, yet past approaches have been limited to complex gating mechanisms that only decide whether to communicate, not how precisely. Learning to optimize message precision at the bit‑level is fundamentally harder, as the required discretization step breaks gradient flow. We address this by generalizing Differentiable Discrete Communication Learning (DDCL), a framework for end‑to‑end optimization of discrete messages. Our primary contribution is an extension of DDCL to support unbounded signals, transforming it into a universal, plug‑and‑play layer for any MARL architecture. We verify our approach with three key results. First, through a qualitative analysis in a controlled environment, we demonstrate how agents learn to dynamically modulate message precision according to the informational needs of the task. Second, we integrate our variant of DDCL into four state‑of‑the‑art MARL algorithms, showing it reduces bandwidth by over an order of magnitude while matching or exceeding task performance. Finally, we provide direct evidence for the \enquoteBitter Lesson in MARL communication: a simple Transformer‑based policy leveraging DDCL matches the performance of complex, specialized architectures, questioning the necessity of bespoke communication designs.

PaperID: 7614, https://arxiv.org/pdf/2511.01445.pdf

Authors: ChengZhang Yu, YingRu He, Hongyan Cheng, nuo Cheng, Zhixing Liu, Dongxu Mu, Zhangrui Shen Yang Gao, and Zhanpeng Jin

Title: From Passive to Proactive: A Hierarchical Multi-Agent Framework for Automated Medical Pre-Consultation

Abstract:
The post‑pandemic surge in healthcare demand, coupled with critical nursing shortages, has placed unprecedented pressure on medical triage systems, necessitating innovative AI‑driven solutions. We present a multi‑agent interactive intelligent system for medical triage that addresses three fundamental challenges in current AI‑based triage systems: inadequate medical specialization leading to misclassification, heterogeneous department structures across healthcare institutions, and inefficient detail‑oriented questioning that impedes rapid triage decisions. Our system employs three specialized agents‑‑RecipientAgent, InquirerAgent, and DepartmentAgent‑‑that collaborate through Inquiry Guidance mechanism and Classification Guidance Mechanism to transform unstructured patient symptoms into accurate department recommendations. To ensure robust evaluation, we constructed a comprehensive Chinese medical triage dataset from "Ai Ai Yi Medical Network", comprising 3,360 real‑world cases spanning 9 primary departments and 62 secondary departments. Experimental results demonstrate that our multi‑agent system achieves 89.6% accuracy in primary department classification and 74.3% accuracy in secondary department classification after four rounds of patient interaction. The system's dynamic matching based guidance mechanisms enable efficient adaptation to diverse hospital configurations while maintaining high triage accuracy. We successfully developed this multi‑agent triage system that not only adapts to organizational heterogeneity across healthcare institutions but also ensures clinically sound decision‑making.

PaperID: 7615, https://arxiv.org/pdf/2511.01425.pdf

Authors: Yuhang Huang, Zekai Lin, Fan Zhong, Lei Liu

Title: Learning to Seek Evidence: A Verifiable Reasoning Agent with Causal Faithfulness Analysis

Abstract:
Explanations for AI models in high‑stakes domains like medicine often lack verifiability, which can hinder trust. To address this, we propose an interactive agent that produces explanations through an auditable sequence of actions. The agent learns a policy to strategically seek external visual evidence to support its diagnostic reasoning. This policy is optimized using reinforcement learning, resulting in a model that is both efficient and generalizable. Our experiments show that this action‑based reasoning process significantly improves calibrated accuracy, reducing the Brier score by 18% compared to a non‑interactive baseline. To validate the faithfulness of the agent's explanations, we introduce a causal intervention method. By masking the visual evidence the agent chooses to use, we observe a measurable degradation in its performance (ΔBrier=+0.029), confirming that the evidence is integral to its decision‑making process. Our work provides a practical framework for building AI systems with verifiable and faithful reasoning capabilities.

PaperID: 7616, https://arxiv.org/pdf/2511.01340.pdf

Authors: Trishanu Das, Abhilash Nandy, Khush Bajaj, Deepiha S

Title: $\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$: A Large and Diverse Multimodal Benchmark for evaluating the ability of Vision-Language Models to understand Rebus Puzzles

Abstract:
Understanding Rebus Puzzles (Rebus Puzzles use pictures, symbols, and letters to represent words or phrases creatively) requires a variety of skills such as image recognition, cognitive skills, commonsense reasoning, multi‑step reasoning, image‑based wordplay, etc., making this a challenging task for even current Vision‑Language Models. In this paper, we present \left|\,\circlearrowright\,\boxed\textBUS\,\right|, a large and diverse benchmark of 1,333 English Rebus Puzzles containing different artistic styles and levels of difficulty, spread across 18 categories such as food, idioms, sports, finance, entertainment, etc. We also propose RebusDescProgICE, a model‑agnostic framework which uses a combination of an unstructured description and code‑based, structured reasoning, along with better, reasoning‑based in‑context example selection, improving the performance of Vision‑Language Models on \left|\,\circlearrowright\,\boxed\textBUS\,\right| by 2.1‑4.1% and 20‑30% using closed‑source and open‑source models respectively compared to Chain‑of‑Thought Reasoning.

PaperID: 7617, https://arxiv.org/pdf/2511.01323.pdf

Authors: Jiabao Ji, Min Li, Priyanshu Kumar, Shiyu Chang, Saloni Potdar

Title: DEEPAMBIGQA: Ambiguous Multi-hop Questions for Benchmarking LLM Answer Completeness

Abstract:
Large language models (LLMs) with integrated search tools show strong promise in open‑domain question answering (QA), yet they often struggle to produce complete answer set to complex questions such as Which actor from the film Heat won at least one Academy Award?, which requires (1) distinguishing between multiple films sharing the same title and (2) reasoning across a large set of actors to gather and integrate evidence. Existing QA benchmarks rarely evaluate both challenges jointly. To address this, we introduce DeepAmbigQAGen, an automatic data generation pipeline that constructs QA tasks grounded in text corpora and linked knowledge graph, generating natural and verifiable questions that systematically embed name ambiguity and multi‑step reasoning. Based on this, we build DeepAmbigQA, a dataset of 3,600 questions requiring multi‑hop reasoning and half of them explicit name ambiguity resolving. Experiments reveal that, even state‑of‑the‑art GPT‑5 show incomplete answers, achieving only 0.13 exact match on ambiguous questions and 0.21 on non‑ambiguous questions. These findings highlight the need for more robust QA systems aimed at information gathering and answer completeness.

PaperID: 7618, https://arxiv.org/pdf/2511.01310.pdf

Authors: Sureyya Akin, Kavita Srivastava, Prateek B. Kapoor, Pradeep G. Sethi, Sunita Q. Patel, Rahu Srivastava

Title: From Pixels to Cooperation Multi Agent Reinforcement Learning based on Multimodal World Models

Abstract:
Learning cooperative multi‑agent policies directly from high‑dimensional, multimodal sensory inputs like pixels and audio (from pixels) is notoriously sample‑inefficient. Model‑free Multi‑Agent Reinforcement Learning (MARL) algorithms struggle with the joint challenge of representation learning, partial observability, and credit assignment. To address this, we propose a novel framework based on a shared, generative Multimodal World Model (MWM). Our MWM is trained to learn a compressed latent representation of the environment's dynamics by fusing distributed, multimodal observations from all agents using a scalable attention‑based mechanism. Subsequently, we leverage this learned MWM as a fast, "imagined" simulator to train cooperative MARL policies (e.g., MAPPO) entirely within its latent space, decoupling representation learning from policy learning. We introduce a new set of challenging multimodal, multi‑agent benchmarks built on a 3D physics simulator. Our experiments demonstrate that our MWM‑MARL framework achieves orders‑of‑magnitude greater sample efficiency compared to state‑of‑the‑art model‑free MARL baselines. We further show that our proposed multimodal fusion is essential for task success in environments with sensory asymmetry and that our architecture provides superior robustness to sensor‑dropout, a critical feature for real‑world deployment.

PaperID: 7619, https://arxiv.org/pdf/2511.01224.pdf

Authors: Chengmeng Li, Yaxin Peng

Title: Embodiment Transfer Learning for Vision-Language-Action Models

Abstract:
Vision‑language‑action (VLA) models have significantly advanced robotic learning, enabling training on large‑scale, cross‑embodiment data and fine‑tuning for specific robots. However, state‑of‑the‑art autoregressive VLAs struggle with multi‑robot collaboration. We introduce embodiment transfer learning, denoted as ET‑VLA, a novel framework for efficient and effective transfer of pre‑trained VLAs to multi‑robot. ET‑VLA's core is Synthetic Continued Pretraining (SCP), which uses synthetically generated data to warm up the model for the new embodiment, bypassing the need for real human demonstrations and reducing data collection costs. SCP enables the model to learn correct actions and precise action token numbers. Following SCP, the model is fine‑tuned on target embodiment data. To further enhance the model performance on multi‑embodiment, we present the Embodied Graph‑of‑Thought technique, a novel approach that formulates each sub‑task as a node, that allows the VLA model to distinguish the functionalities and roles of each embodiment during task execution. Our work considers bimanual robots, a simple version of multi‑robot to verify our approaches. We validate the effectiveness of our method on both simulation benchmarks and real robots covering three different bimanual embodiments. In particular, our proposed ET‑VLA \space can outperform OpenVLA on six real‑world tasks over 53.2%. We will open‑source all codes to support the community in advancing VLA models for robot learning.

PaperID: 7620, https://arxiv.org/pdf/2511.01213.pdf

Authors: Riddhi Jain, Manasi Patwardhan, Parijat Deshpande, Venkataramana Runkana

Title: Thought-For-Food: Reasoning Chain Induced Food Visual Question Answering

Abstract:
The immense diversity in the culture and culinary of Indian cuisines calls attention to the major shortcoming of the existing Visual Question Answering(VQA) systems which are inclined towards the foods from Western region. Recent attempt towards building a VQA dataset for Indian food is a step towards addressing this challenge. However, their approach towards VQA follows a two‑step process in which the answer is generated first, followed by the explanation of the expected answer. In this work, we claim that food VQA requires to follow a multi‑step reasoning process to arrive at an accurate answer, especially in the context of India food, which involves understanding complex culinary context and identifying relationships between various food items. With this hypothesis we create reasoning chains upon the QA with minimal human intervention. We fine‑tune smaller LLMs and VLMs with auto‑validated reasoning chains and further train them using reinforcement learning with larger data. With augmentation of reasoning chains, we observed accuracy improvement of an average 10 percentage points on the baseline. We provide detailed analysis in terms the effect of addition of reasoning chains for the Indian Food VQA task. Index Terms ‑ FoodVQA, Reasoning Chains, Reinforcement Learning, Knowledge Graph.

PaperID: 7621, https://arxiv.org/pdf/2511.01149.pdf

Authors: Shuaidong Pan, Di Wu

Title: Modular Task Decomposition and Dynamic Collaboration in Multi-Agent Systems Driven by Large Language Models

Abstract:
This paper addresses the limitations of a single agent in task decomposition and collaboration during complex task execution, and proposes a multi‑agent architecture for modular task decomposition and dynamic collaboration based on large language models. The method first converts natural language task descriptions into unified semantic representations through a large language model. On this basis, a modular decomposition mechanism is introduced to break down the overall goal into multiple hierarchical sub‑tasks. Then, dynamic scheduling and routing mechanisms enable reasonable division of labor and realtime collaboration among agents, allowing the system to adjust strategies continuously according to environmental feedback, thus maintaining efficiency and stability in complex tasks. Furthermore, a constraint parsing and global consistency mechanism is designed to ensure coherent connections between sub‑tasks and balanced workload, preventing performance degradation caused by redundant communication or uneven resource allocation. The experiments validate the architecture across multiple dimensions, including task success rate, decomposition efficiency, sub‑task coverage, and collaboration balance. The results show that the proposed method outperforms existing approaches in both overall performance and robustness, achieving a better balance between task complexity and communication overhead. In conclusion, this study demonstrates the effectiveness and feasibility of language‑driven task decomposition and dynamic collaboration in multi‑agent systems, providing a systematic solution for task execution in complex environments.

PaperID: 7622, https://arxiv.org/pdf/2511.01078.pdf

Authors: Qinwei Huang, Stefan Wang, Simon Khan, Garrett Katz, Qinru Qiu

Title: Predictive Auxiliary Learning for Belief-based Multi-Agent Systems

Abstract:
The performance of multi‑agent reinforcement learning (MARL) in partially observable environments depends on effectively aggregating information from observations, communications, and reward signals. While most existing multi‑agent systems primarily rely on rewards as the only feedback for policy training, our research shows that introducing auxiliary predictive tasks can significantly enhance learning efficiency and stability. We propose Belief‑based Predictive Auxiliary Learning (BEPAL), a framework that incorporates auxiliary training objectives to support policy optimization. BEPAL follows the centralized training with decentralized execution paradigm. Each agent learns a belief model that predicts unobservable state information, such as other agents' rewards or motion directions, alongside its policy model. By enriching hidden state representations with information that does not directly contribute to immediate reward maximization, this auxiliary learning process stabilizes MARL training and improves overall performance. We evaluate BEPAL in the predator‑prey environment and Google Research Football, where it achieves an average improvement of about 16 percent in performance metrics and demonstrates more stable convergence compared to baseline methods.

PaperID: 7623, https://arxiv.org/pdf/2511.01015.pdf

Authors: Nabeel Seedat, Jiashuo Liu, Mihaela van der Schaar

Title: What's the next frontier for Data-centric AI? Data Savvy Agents

Abstract:
The recent surge in AI agents that autonomously communicate, collaborate with humans and use diverse tools has unlocked promising opportunities in various real‑world settings. However, a vital aspect remains underexplored: how agents handle data. Scalable autonomy demands agents that continuously acquire, process, and evolve their data. In this paper, we argue that data‑savvy capabilities should be a top priority in the design of agentic systems to ensure reliable real‑world deployment. Specifically, we propose four key capabilities to realize this vision: (1) Proactive data acquisition: enabling agents to autonomously gather task‑critical knowledge or solicit human input to address data gaps; (2) Sophisticated data processing: requiring context‑aware and flexible handling of diverse data challenges and inputs; (3) Interactive test data synthesis: shifting from static benchmarks to dynamically generated interactive test data for agent evaluation; and (4) Continual adaptation: empowering agents to iteratively refine their data and background knowledge to adapt to shifting environments. While current agent research predominantly emphasizes reasoning, we hope to inspire a reflection on the role of data‑savvy agents as the next frontier in data‑centric AI.

PaperID: 7624, https://arxiv.org/pdf/2511.00993.pdf

Authors: Tianming Liu, Jirong Yang, Yafeng Yin, Manzi Li, Linghao Wang, Zheng Zhu

Title: Aligning LLM agents with human learning and adjustment behavior: a dual agent approach

Abstract:
Effective modeling of how human travelers learn and adjust their travel behavior from interacting with transportation systems is critical for system assessment and planning. However, this task is also difficult due to the complex cognition and decision‑making involved in such behavior. Recent research has begun to leverage Large Language Model (LLM) agents for this task. Building on this, we introduce a novel dual‑agent framework that enables continuous learning and alignment between LLM agents and human travelers on learning and adaptation behavior from online data streams. Our approach involves a set of LLM traveler agents, equipped with a memory system and a learnable persona, which serve as simulators for human travelers. To ensure behavioral alignment, we introduce an LLM calibration agent that leverages the reasoning and analytical capabilities of LLMs to train the personas of these traveler agents. Working together, this dual‑agent system is designed to track and align the underlying decision‑making mechanisms of travelers and produce realistic, adaptive simulations. Using a real‑world dataset from a day‑to‑day route choice experiment, we show our approach significantly outperforms existing LLM‑based methods in both individual behavioral alignment and aggregate simulation accuracy. Furthermore, we demonstrate that our method moves beyond simple behavioral mimicry to capture the evolution of underlying learning processes, a deeper alignment that fosters robust generalization. Overall, our framework provides a new approach for creating adaptive and behaviorally realistic agents to simulate travelers' learning and adaptation that can benefit transportation simulation and policy analysis.

PaperID: 7625, https://arxiv.org/pdf/2511.00908.pdf

Authors: Heng Zheng, Yuling Shi, Xiaodong Gu, Haochen You, Zijian Zhang, Lubin Gan, Hao Zhang, Wenjun Huang, Jin Huang

Title: GraphGeo: Multi-Agent Debate Framework for Visual Geo-localization with Heterogeneous Graph Neural Networks

Abstract:
Visual geo‑localization requires extensive geographic knowledge and sophisticated reasoning to determine image locations without GPS metadata. Traditional retrieval methods are constrained by database coverage and quality. Recent Large Vision‑Language Models (LVLMs) enable direct location reasoning from image content, yet individual models struggle with diverse geographic regions and complex scenes. Existing multi‑agent systems improve performance through model collaboration but treat all agent interactions uniformly. They lack mechanisms to handle conflicting predictions effectively. We propose GraphGeo, a multi‑agent debate framework using heterogeneous graph neural networks for visual geo‑localization. Our approach models diverse debate relationships through typed edges, distinguishing supportive collaboration, competitive argumentation, and knowledge transfer. We introduce a dual‑level debate mechanism combining node‑level refinement and edge‑level argumentation modeling. A cross‑level topology refinement strategy enables co‑evolution between graph structure and agent representations. Experiments on multiple benchmarks demonstrate GraphGeo significantly outperforms state‑of‑the‑art methods. Our framework transforms cognitive conflicts between agents into enhanced geo‑localization accuracy through structured debate.

PaperID: 7626, https://arxiv.org/pdf/2511.00802.pdf

Authors: Jie JW Wu, Ayanda Patrick Herlihy, Ahmad Saleem Mirza, Ali Afoud, Fatemeh Fard

Title: GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents

Abstract:
With the software industry shifting toward a data‑driven culture, online A/B testing is a key tool for evaluating new technologies. However, deploying such experiments requires substantial resources, may negatively impact users, and involves long data collection periods. To address this, off‑policy evaluation (OPE), or offline A/B testing, uses logged data to assess technologies and is fundamental in Reinforcement Learning, making it crucial in domains where online testing is costly or risky, such as healthcare, recommender systems, education, dialog systems, and robotics. Despite advances in coding LLMs and agentic AI, little is known about leveraging them to optimize OPE results. We investigate whether LLMs and LLM‑based agents can improve OPE performance via code optimization. We propose GrowthHacker, a benchmark with agent and baseline methods on large‑scale real‑world datasets, which iteratively optimizes code, evaluates results, and begins new optimization cycles. We collected datasets, established protocols, implemented baselines for OPE on the Open Bandit Pipeline (OBP)~\citesaito2021openbanditdatasetpipeline and Scope‑RL~\citekiyohara2023scope, and developed the two_agent framework, which reduces system complexity while preserving optimization effectiveness. Results show the two_agent framework achieves 100% reliability and the highest average improvement of 106.7% among positive outcomes. Both two_agent and CrewAI reach 45% success rates, outperforming AutoGen's 34%. These findings demonstrate the feasibility of LLM‑based agents as automated "growth hackers" to enhance OPE systems, with implications for scaling data‑driven decision‑making in production.

PaperID: 7627, https://arxiv.org/pdf/2511.00767.pdf

Authors: Shi Gengtian, Takashi Koshimizu, Megumi Saito, Pan Zhenni, Liu Jiang, Shigeru Shimamoto

Title: Power Control Based on Multi-Agent Deep Q Network for D2D Communication

Abstract:
In device‑to‑device (D2D) communication under a cell with resource sharing mode the spectrum resource utilization of the system will be improved. However, if the interference generated by the D2D user is not controlled, the performance of the entire system and the quality of service (QOS) of the cellular user may be degraded. Power control is important because it helps to reduce interference in the system. In this paper, we propose a reinforcement learning algorithm for adaptive power control that helps reduce interference to increase system throughput. Simulation results show the proposed algorithm has better performance than traditional algorithm in LTE (Long Term Evolution).

PaperID: 7628, https://arxiv.org/pdf/2511.00739.pdf

Authors: Ritik Raj, Souvik Kundu, Ishita Vohra, Hong Wang, Tushar Krishna

Title: Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective

Abstract:
Agentic AI serving converts monolithic LLM‑based inference to autonomous problem‑solvers that can plan, call tools, perform reasoning, and adapt on the fly. Due to diverse task execution need, such serving heavily rely on heterogeneous CPU‑GPU systems with majority of the external tools responsible for agentic capability, either run on or are orchestrated by the CPU. Towards having a deeper understanding of its role, this paper aims to characterize and analyze the system bottlenecks introduced by agentic AI workloads from a largely overlooked CPU‑centric perspective. We first present a compile‑time characterization of agentic AI execution and choose representative workloads to capture the algorithmic diversity. We then perform runtime characterization of the representative workloads analyzing the end‑to‑end latency and throughput on two different hardware systems to isolate respective architectural bottlenecks. Based on the insights on the bottlenecks, we finally present two scheduling optimizations, namely, 1. CPU‑Aware Overlapped Micro‑Batching (COMB) and 2. Mixed Agentic Scheduling (MAS) on homogeneous and heterogeneous agentic workloads, respectively. In specific, these methods optimize for improved CPU‑GPU concurrent utilization while reducing skewed resource allocation for heterogeneous execution. Experimental evaluations on the two hardware systems demonstrate the efficacy of COMB in yielding up to 1.7x lower P50 latency in standalone homogeneous workload execution and up to 3.9x/1.8x lower service/total latency under homogeneous open‑loop load. Additionally, for heterogeneous open‑loop load, MAS can reduce the total latency for minority request‑type by up to 2.37x/2.49x at P50/P90 percentile.

PaperID: 7629, https://arxiv.org/pdf/2511.00651.pdf

Authors: Chenhua Shi, Bhavika Jalli, Gregor Macdonald, John Zou, Wanlu Lei, Mridul Jain, Joji Philip

Title: Leveraging Multi-Agent System (MAS) and Fine-Tuned Small Language Models (SLMs) for Automated Telecom Network Troubleshooting

Abstract:
Telecom networks are rapidly growing in scale and complexity, making effective management, operation, and optimization increasingly challenging. Although Artificial Intelligence (AI) has been applied to many telecom tasks, existing models are often narrow in scope, require large amounts of labeled data, and struggle to generalize across heterogeneous deployments. Consequently, network troubleshooting continues to rely heavily on Subject Matter Experts (SMEs) to manually correlate various data sources to identify root causes and corrective actions. To address these limitations, we propose a Multi‑Agent System (MAS) that employs an agentic workflow, with Large Language Models (LLMs) coordinating multiple specialized tools for fully automated network troubleshooting. Once faults are detected by AI/ML‑based monitors, the framework dynamically activates agents such as an orchestrator, solution planner, executor, data retriever, and root‑cause analyzer to diagnose issues and recommend remediation strategies within a short time frame. A key component of this system is the solution planner, which generates appropriate remediation plans based on internal documentation. To enable this, we fine‑tuned a Small Language Model (SLM) on proprietary troubleshooting documents to produce domain‑grounded solution plans. Experimental results demonstrate that the proposed framework significantly accelerates troubleshooting automation across both Radio Access Network (RAN) and Core network domains.

PaperID: 7630, https://arxiv.org/pdf/2511.00628.pdf

Authors: Yang Li, Siqi Ping, Xiyu Chen, Xiaojian Qi, Zigan Wang, Ye Luo, Xiaowei Zhang

Title: AgentGit: A Version Control Framework for Reliable and Scalable LLM-Powered Multi-Agent Systems

Abstract:
With the rapid progress of large language models (LLMs), LLM‑powered multi‑agent systems (MAS) are drawing increasing interest across academia and industry. However, many current MAS frameworks struggle with reliability and scalability, especially on complex tasks. We present AgentGit, a framework that brings Git‑like rollback and branching to MAS workflows. Built as an infrastructure layer on top of LangGraph, AgentGit supports state commit, revert, and branching, allowing agents to traverse, compare, and explore multiple trajectories efficiently. To evaluate AgentGit, we designed an experiment that optimizes target agents by selecting better prompts. We ran a multi‑step A/B test against three baselines ‑‑ LangGraph, AutoGen, and Agno ‑‑ on a real‑world task: retrieving and analyzing paper abstracts. Results show that AgentGit significantly reduces redundant computation, lowers runtime and token usage, and supports parallel exploration across multiple branches, enhancing both reliability and scalability in MAS development. This work offers a practical path to more robust MAS design and enables error recovery, safe exploration, iterative debugging, and A/B testing in collaborative AI systems.

PaperID: 7631, https://arxiv.org/pdf/2511.00551.pdf

Authors: Qiang Li, Ningjing Zeng, Lina Yu

Title: Single-agent Reinforcement Learning Model for Regional Adaptive Traffic Signal Control

Abstract:
Several studies have employed reinforcement learning (RL) to address the challenges of regional adaptive traffic signal control (ATSC) and achieved promising results. In this field, existing research predominantly adopts multi‑agent frameworks. However, the adoption of multi‑agent frameworks presents challenges for scalability. Instead, the Traffic signal control (TSC) problem necessitates a single‑agent framework. TSC inherently relies on centralized management by a single control center, which can monitor traffic conditions across all roads in the study area and coordinate the control of all intersections. This work proposes a single‑agent RL‑based regional ATSC model compatible with probe vehicle technology. Key components of the RL design include state, action, and reward function definitions. To facilitate learning and manage congestion, both state and reward functions are defined based on queue length, with action designed to regulate queue dynamics. The queue length definition used in this study differs slightly from conventional definitions but is closely correlated with congestion states. More importantly, it allows for reliable estimation using link travel time data from probe vehicles. With probe vehicle data already covering most urban roads, this feature enhances the proposed method's potential for widespread deployment. The method was comprehensively evaluated using the SUMO simulation platform. Experimental results demonstrate that the proposed model effectively mitigates large‑scale regional congestion levels via coordinated multi‑intersection control.

PaperID: 7632, https://arxiv.org/pdf/2511.00549.pdf

Authors: Qiang Li, Jin Niu, Lina Yu

Title: Robust Single-Agent Reinforcement Learning for Regional Traffic Signal Control Under Demand Fluctuations

Abstract:
Traffic congestion, primarily driven by intersection queuing, significantly impacts urban living standards, safety, environmental quality, and economic efficiency. While Traffic Signal Control (TSC) systems hold potential for congestion mitigation, traditional optimization models often fail to capture real‑world traffic complexity and dynamics. This study introduces a novel single‑agent reinforcement learning (RL) framework for regional adaptive TSC, circumventing the coordination complexities inherent in multi‑agent systems through a centralized decision‑making paradigm. The model employs an adjacency matrix to unify the encoding of road network topology, real‑time queue states derived from probe vehicle data, and current signal timing parameters. Leveraging the efficient learning capabilities of the DreamerV3 world model, the agent learns control policies where actions sequentially select intersections and adjust their signal phase splits to regulate traffic inflow/outflow, analogous to a feedback control system. Reward design prioritizes queue dissipation, directly linking congestion metrics (queue length) to control actions. Simulation experiments conducted in SUMO demonstrate the model's effectiveness: under inference scenarios with multi‑level (10%, 20%, 30%) Origin‑Destination (OD) demand fluctuations, the framework exhibits robust anti‑fluctuation capability and significantly reduces queue lengths. This work establishes a new paradigm for intelligent traffic control compatible with probe vehicle technology. Future research will focus on enhancing practical applicability by incorporating stochastic OD demand fluctuations during training and exploring regional optimization mechanisms for contingency events.

PaperID: 7633, https://arxiv.org/pdf/2511.00488.pdf

Authors: Jun Gao, Yun Peng, Xiaoxue Ren

Title: \texttt{ReMind}: Understanding Deductive Code Reasoning in LLMs

Abstract:
Large Language Models (LLMs) have achieved remarkable progress in code‑related tasks. Despite their advancement, empirical evidence reveals that they still struggle with \emphdeductive code reasoning, the ability to reason about the program execution process. While prior studies have recognized this limitation, the underlying causes remain largely underexplored. In this paper, we begin by presenting a comprehensive empirical study that reveals three key challenges undermining deductive code reasoning: (1) an intrinsic gap between generation and reasoning abilities, (2) a consistent bias towards code sources, and (3) weak zero‑shot generalization on complex benchmarks. In light of these challenges, we propose \textttReMind, a multi‑agent framework composed of \textttMutator, \textttExecutor, and \textttInspector. The \textttMutator generates code variants to mitigate bias towards code sources, the \textttExecutor traces variable states step‑by‑step to expose inconsistency, and the \textttInspector identifies problematic reasoning steps and provides control‑flow refinement to bridge the intrinsic reasoning gap. Through their coordinated collaboration, \textttReMind systematically identifies and refines reasoning flaws, achieving outstanding performance and enabling robust zero‑shot generalization. Extensive experiments on two benchmarks with five LLMs demonstrate the superior advantages of \textttReMind compared to baseline approaches in deductive code reasoning.

PaperID: 7634, https://arxiv.org/pdf/2511.00450.pdf

Authors: Vahid Etemadi, Gregorio Robles

Title: SmartDoc: A Context-Aware Agentic Method Comment Generation Plugin

Abstract:
Context: The software maintenance phase involves many activities such as code refactoring, bug fixing, code review or testing. Program comprehension is key to all these activities, as it demands developers to grasp the knowledge (e.g., implementation details) required to modify the codebase. Methods as main building blocks in a program can offer developers this knowledge source for code comprehension. However, reading entire method statements can be challenging, which necessitates precise and up‑to‑date comments. Objective: We propose a solution as an IntelliJ IDEA plugin, named SmartDoc, that assists developers in generating context‑aware method comments. Method: This plugin acts as an Artificial Intelligence (AI) agent that has its own memory and is augmented by target methods' context. When a request is initiated by the end‑user, the method content and all its nested method calls are used in the comment generation. At the beginning, these nested methods are visited and a call graph is generated. This graph is then traversed using depth‑first search (DFS), enabling the provision of full‑context to enrich Large Language Model (LLM) prompts. Result: The product is a software, as a plugin, developed for Java codebase and installable on IntelliJ IDEA. This plugin can serve concurrently for methods whose comments are being updated , and it shares memory across all flows to avoid redundant calls. o measure the accuracy of this solution, a dedicated test case is run to record SmartDoc generated comments and their corresponding ground truth. For each collected result‑set, three metrics are computed, BERTScore, BLEU and ROUGE‑1. These metrics will determine how accurate the generated comments are in comparison to the ground truth. Result: The obtained accuracy, in terms of the precision, recall and F1, is promising, and lies in the range of 0.80 to 0.90 for BERTScore.

PaperID: 7635, https://arxiv.org/pdf/2511.00387.pdf

Authors: Xiaoling Han, Bin Lin, Zhenyu Na, Bowen Li, Chaoyue Zhang, Ran Zhang

Title: Spatial Crowdsourcing-based Task Allocation for UAV-assisted Maritime Data Collection

Abstract:
Driven by the unceasing development of maritime services, tasks of unmanned aerial vehicle (UAV)‑assisted maritime data collection (MDC) are becoming increasingly diverse, complex and personalized. As a result, effective task allocation for MDC is becoming increasingly critical. In this work, integrating the concept of spatial crowdsourcing (SC), we develop an SC‑based MDC network model and investigate the task allocation problem for UAV‑assisted MDC. In variable maritime service scenarios, tasks are allocated to UAVs based on the spatial and temporal requirements of the tasks, as well as the mobility of the UAVs. To address this problem, we design an SC‑based task allocation algorithm for the MDC (SC‑MDC‑TA). The quality estimation is utilized to assess and regulate task execution quality by evaluating signal to interference plus noise ratio and the UAV energy consumption. The reverse auction is employed to potentially reduce the task waiting time as much as possible while ensuring timely completion. Additionally, we establish typical task allocation scenarios based on maritime service requirements indicated by electronic navigational charts. Simulation results demonstrate that the proposed SC‑MDC‑TA algorithm effectively allocates tasks for various MDC scenarios. Furthermore, compared to the benchmark, the SC‑MDC‑TA algorithm can also reduce the task completion time and lower the UAV energy consumption.

PaperID: 7636, https://arxiv.org/pdf/2511.00384.pdf

Authors: Isa Mammadli, Prajol Shrestha, Jayant Pande, Filip Novkoski, Siddhant Mohapatra, Martial Noirhomme, Andreas Maier, Nicolas Vandewalle, Ana-Suncana Smith

Title: Physics-informed digital twin and onboard control of a brainbot for intelligent active matter

Abstract:
Establishing adaptive particles that sense their state, anticipate their evolution, and compute control inputs onboard has been a major challenge in non‑equilibrium physics. We address this challenge by realizing an autonomous brainbot, building on a recently developed programmable bristlebot. First, we construct a physics‑informed digital twin of the device, based on a kinematic model that reproduces measured trajectory statistics and generates long, statistically faithful synthetic trajectories. The kinematics forms the foundation for implementing onboard model predictive control (MPC), enabling autonomous trajectory tracking, demonstrated by accurate execution of a non‑trivial target path. This provides a proof of principle for a brainbot that senses its state, predicts its evolution, and computes control inputs onboard, unlike conventional active particles with fixed motility, thereby transforming the brainbot into an agentic physical entity. By integrating physical modeling, data‑driven parameter identification, and control into a unified framework, our approach provides a scalable platform for machine‑learning‑enabled multi‑agent studies and lays the groundwork for intelligent, adaptive active matter.

PaperID: 7637, https://arxiv.org/pdf/2511.00370.pdf

Authors: Chaochen Wu, Guan Luo, Meiyun Zuo, Zhitao Fan

Title: Who Can We Trust? Scope-Aware Video Moment Retrieval with Multi-Agent Conflict

Abstract:
Video moment retrieval uses a text query to locate a moment from a given untrimmed video reference. Locating corresponding video moments with text queries helps people interact with videos efficiently. Current solutions for this task have not considered conflict within location results from different models, so various models cannot integrate correctly to produce better results. This study introduces a reinforcement learning‑based video moment retrieval model that can scan the whole video once to find the moment's boundary while producing its locational evidence. Moreover, we proposed a multi‑agent system framework that can use evidential learning to resolve conflicts between agents' localization output. As a side product of observing and dealing with conflicts between agents, we can decide whether a query has no corresponding moment in a video (out‑of‑scope) without additional training, which is suitable for real‑world applications. Extensive experiments on benchmark datasets show the effectiveness of our proposed methods compared with state‑of‑the‑art approaches. Furthermore, the results of our study reveal that modeling competition and conflict of the multi‑agent system is an effective way to improve RL performance in moment retrieval and show the new role of evidential learning in the multi‑agent framework.

PaperID: 7638, https://arxiv.org/pdf/2511.00197.pdf

Authors: Oorja Majgaonkar, Zhiwei Fei, Xiang Li, Federica Sarro, He Ye

Title: Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories

Abstract:
The increasing deployment of Large Language Model (LLM) agents for complex software engineering tasks has created a need to understand their problem‑solving behaviours beyond simple success metrics. While these agents demonstrate impressive capabilities in automated issue resolution, their decision‑making processes remain largely opaque. This paper presents an empirical study of agent trajectories, namely the execution traces capturing the steps agents take when attempting to resolve software issues. We analyse trajectories from three state‑of‑the‑art code agents (OpenHands, SWE‑agent, and Prometheus) on the SWE‑Bench benchmark, examining both successful and failed attempts. Our investigation reveals several key insights into agent behaviour. First, we identify how distinct problem‑solving strategies, such as defensive programming and context gathering, enable success in different scenarios. Second, we find that failed trajectories are consistently longer and exhibit higher variance than successful ones, with failure patterns differing significantly between agents. Third, our fault localisation analysis shows that while most trajectories correctly identify problematic files (72‑81% even in failures), success depends more on achieving approximate rather than exact code modifications. These and other findings unveiled by our study, provide a foundation for understanding agent behaviour through trajectory analysis, contributing to the development of more robust and interpretable autonomous software engineering systems.

PaperID: 7639, https://arxiv.org/pdf/2511.00181.pdf

Authors: Mengfei Liang, Yiting Qu, Yukun Jiang, Michael Backes, Yang Zhang

Title: From Evidence to Verdict: An Agent-Based Forensic Framework for AI-Generated Image Detection

Abstract:
The rapid evolution of AI‑generated images poses growing challenges to information integrity and media authenticity. Existing detection approaches face limitations in robustness, interpretability, and generalization across diverse generative models, particularly when relying on a single source of visual evidence. We introduce AIFo (Agent‑based Image Forensics), a training‑free framework that formulates AI‑generated image detection as a multi‑stage forensic analysis process through multi‑agent collaboration. The framework integrates a set of forensic tools, including reverse image search, metadata extraction, pre‑trained classifiers, and vision‑language model analysis, and resolves insufficient or conflicting evidence through a structured multi‑agent debate mechanism. An optional memory‑augmented module further enables the framework to incorporate information from historical cases. We evaluate AIFo on a benchmark of 6,000 images spanning controlled laboratory settings and challenging real‑world scenarios, where it achieves 97.05% accuracy and consistently outperforms traditional classifiers and strong vision‑language model baselines. These findings demonstrate the effectiveness of agent‑based procedural reasoning for AI‑generated image detection.

PaperID: 7640, https://arxiv.org/pdf/2511.00122.pdf

Authors: Ran Xu, Yupeng Qi, Jingsen Feng, Xu Chu

Title: Engineering.ai: A Platform for Teams of AI Engineers in Computational Design

Abstract:
In modern engineering practice, human engineers collaborate in specialized teams to design complex products, with each expert completing their respective tasks while communicating and exchanging results and data with one another. While this division of expertise is essential for managing multidisciplinary complexity, it demands substantial development time and cost. Recently, we introduced OpenFOAMGPT (1.0, 2.0), which functions as an autonomous AI engineer for computational fluid dynamics, and turbulence.ai, which can conduct end‑to‑end research in fluid mechanics draft publications and PhD theses. Building upon these foundations, we present Engineering.ai, a platform for teams of AI engineers in computational design. The framework employs a hierarchical multi‑agent architecture where a Chief Engineer coordinates specialized agents consisting of Aerodynamics, Structural, Acoustic, and Optimization Engineers, each powered by LLM with domain‑specific knowledge. Agent‑agent collaboration is achieved through file‑mediated communication for data provenance and reproducibility, while a comprehensive memory system maintains project context, execution history, and retrieval‑augmented domain knowledge to ensure reliable decision‑making across the workflow. The system integrates FreeCAD, Gmsh, OpenFOAM, CalculiX, and BPM acoustic analysis, enabling parallel multidisciplinary simulations while maintaining computational accuracy. The framework is validated through UAV wing optimization. This work demonstrates that agentic‑AI‑enabled AI engineers has the potential to perform complex engineering tasks autonomously. Remarkably, the automated workflow achieved a 100% success rate across over 400 parametric configurations, with zero mesh generation failures, solver convergence issues, or manual interventions required, validating that the framework is trustworthy.

PaperID: 7641, https://arxiv.org/pdf/2511.00116.pdf

Authors: Avisek Naug, Antonio Guillen, Vineet Kumar, Scott Greenwood, Wesley Brewer, Sahand Ghorbanpour, Ashwin Ramesh Babu, Vineet Gundecha, Ricardo Luna Gutierrez, Soumyendu Sarkar

Title: LC-Opt: Benchmarking Reinforcement Learning and Agentic AI for End-to-End Liquid Cooling Optimization in Data Centers

Abstract:
Liquid cooling is critical for thermal management in high‑density data centers with the rising AI workloads. However, machine learning‑based controllers are essential to unlock greater energy efficiency and reliability, promoting sustainability. We present LC‑Opt, a Sustainable Liquid Cooling (LC) benchmark environment, for reinforcement learning (RL) control strategies in energy‑efficient liquid cooling of high‑performance computing (HPC) systems. Built on the baseline of a high‑fidelity digital twin of Oak Ridge National Lab's Frontier Supercomputer cooling system, LC‑Opt provides detailed Modelica‑based end‑to‑end models spanning site‑level cooling towers to data center cabinets and server blade groups. RL agents optimize critical thermal controls like liquid supply temperature, flow rate, and granular valve actuation at the IT cabinet level, as well as cooling tower (CT) setpoints through a Gymnasium interface, with dynamic changes in workloads. This environment creates a multi‑objective real‑time optimization challenge balancing local thermal regulation and global energy efficiency, and also supports additional components like a heat recovery unit (HRU). We benchmark centralized and decentralized multi‑agent RL approaches, demonstrate policy distillation into decision and regression trees for interpretable control, and explore LLM‑based methods that explain control actions in natural language through an agentic mesh architecture designed to foster user trust and simplify system management. LC‑Opt democratizes access to detailed, customizable liquid cooling models, enabling the ML community, operators, and vendors to develop sustainable data center liquid cooling control solutions.

PaperID: 7642, https://arxiv.org/pdf/2511.00086.pdf

Authors: Fali Wang, Jihai Chen, Shuhua Yang, Runxue Bao, Tianxiang Zhao, Zhiwei Zhang, Xianfeng Tang, Hui Liu, Qi He, Suhang Wang

Title: Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph

Abstract:
Test‑Time Scaling (TTS) improves large language models (LLMs) by allocating additional computation during inference, typically through parallel, sequential, or hybrid scaling. However, prior studies often assume fixed collaboration architectures (e.g., topologies) and single‑model usage, overlooking that optimal architectures and model combinations can vary across tasks. Therefore, we study the novel problem of searching for compute‑optimal model combinations and architectures in TTS under a fixed budget. We formalize it as a multi‑LLM collaboration graph, where nodes encode roles and LLM model assignments, and edges capture information flow. This problem is challenging because (i) the combinatorial search space is prohibitively large, and (ii) task‑specific requirements demand tailored designs. To address these, we reformulate the problem as probabilistic graph optimization and, through pilot experiments, derive three empirical insights into TTS collaboration graphs. Guided by these insights, we propose Agent‑REINFORCE, an LLM‑agent‑augmented framework that mirrors the REINFORCE pipeline by mapping sampling‑gradient‑update to sampling‑feedback‑update, where feedback serves as a textual gradient to update the probabilistic graph and efficiently search for optimal multi‑LLM collaboration graphs. Experiments show that Agent‑REINFORCE outperforms both traditional and LLM‑based baselines in sample efficiency and search performance, and effectively identifies optimal graphs under joint objectives of accuracy and inference latency.

PaperID: 7643, https://arxiv.org/pdf/2511.00039.pdf

Authors: Krishna Kumar Neelakanta Pillai Santha Kumari Amma

Title: Graph-Attentive MAPPO for Dynamic Retail Pricing

Abstract:
Dynamic pricing in retail requires policies that adapt to shifting demand while coordinating decisions across related products. We present a systematic empirical study of multi‑agent reinforcement learning for retail price optimization, comparing a strong MAPPO baseline with a graph‑attention‑augmented variant (MAPPO+GAT) that leverages learned interactions among products. Using a simulated pricing environment derived from real transaction data, we evaluate profit, stability across random seeds, fairness across products, and training efficiency under a standardized evaluation protocol. The results indicate that MAPPO provides a robust and reproducible foundation for portfolio‑level price control, and that MAPPO+GAT further enhances performance by sharing information over the product graph without inducing excessive price volatility. These results indicate that graph‑integrated MARL provides a more scalable and stable solution than independent learners for dynamic retail pricing, offering practical advantages in multi‑product decision‑making.

PaperID: 7644, https://arxiv.org/pdf/2511.00034.pdf

Authors: Aditya Akella

Title: On the Fundamental Limitations of Decentralized Learnable Reward Shaping in Cooperative Multi-Agent Reinforcement Learning

Abstract:
Recent advances in learnable reward shaping have shown promise in single‑agent reinforcement learning by automatically discovering effective feedback signals. However, the effectiveness of decentralized learnable reward shaping in cooperative multi‑agent settings remains poorly understood. We propose DMARL‑RSA, a fully decentralized system where each agent learns individual reward shaping, and evaluate it on cooperative navigation tasks in the simple_spread_v3 environment. Despite sophisticated reward learning, DMARL‑RSA achieves only ‑24.20 +/‑ 0.09 average reward, compared to MAPPO with centralized training at 1.92 +/‑ 0.87 ‑‑ a 26.12‑point gap. DMARL‑RSA performs similarly to simple independent learning (IPPO: ‑23.19 +/‑ 0.96), indicating that advanced reward shaping cannot overcome fundamental decentralized coordination limitations. Interestingly, decentralized methods achieve higher landmark coverage (0.888 +/‑ 0.029 for DMARL‑RSA, 0.960 +/‑ 0.045 for IPPO out of 3 total) but worse overall performance than centralized MAPPO (0.273 +/‑ 0.008 landmark coverage) ‑‑ revealing a coordination paradox between local optimization and global performance. Analysis identifies three critical barriers: (1) non‑stationarity from concurrent policy updates, (2) exponential credit assignment complexity, and (3) misalignment between individual reward optimization and global objectives. These results establish empirical limits for decentralized reward learning and underscore the necessity of centralized coordination for effective multi‑agent cooperation.

PaperID: 7645, https://arxiv.org/pdf/2511.00026.pdf

Authors: Chaitanya Shinde, Divya Garikapati

Title: Gen AI in Automotive: Applications, Challenges, and Opportunities with a Case study on In-Vehicle Experience

Abstract:
Generative Artificial Intelligence is emerging as a transformative force in the automotive industry, enabling novel applications across vehicle design, manufacturing, autonomous driving, predictive maintenance, and in vehicle user experience. This paper provides a comprehensive review of the current state of GenAI in automotive, highlighting enabling technologies such as Generative Adversarial Networks and Variational Autoencoders. Key opportunities include accelerating autonomous driving validation through synthetic data generation, optimizing component design, and enhancing human machine interaction via personalized and adaptive interfaces. At the same time, the paper identifies significant technical, ethical, and safety challenges, including computational demands, bias, intellectual property concerns, and adversarial robustness, that must be addressed for responsible deployment. A case study on Mercedes Benzs MBUX Virtual Assistant illustrates how GenAI powered voice systems deliver more natural, proactive, and personalized in car interactions compared to legacy rule based assistants. Through this review and case study, the paper outlines both the promise and limitations of GenAI integration in the automotive sector and presents directions for future research and development aimed at achieving safer, more efficient, and user centric mobility. Unlike prior reviews that focus solely on perception or manufacturing, this paper emphasizes generative AI in voice based HMI, bridging safety and user experience perspectives.

PaperID: 7646, https://arxiv.org/pdf/2510.27659.pdf

Authors: Alireza Saleh Abadi, Leen-Kiat Soh

Title: Challenges in Credit Assignment for Multi-Agent Reinforcement Learning in Open Agent Systems

Abstract:
In the rapidly evolving field of multi‑agent reinforcement learning (MARL), understanding the dynamics of open systems is crucial. Openness in MARL refers to the dynam‑ic nature of agent populations, tasks, and agent types with‑in a system. Specifically, there are three types of openness as reported in (Eck et al. 2023) [2]: agent openness, where agents can enter or leave the system at any time; task openness, where new tasks emerge, and existing ones evolve or disappear; and type openness, where the capabil‑ities and behaviors of agents change over time. This report provides a conceptual and empirical review, focusing on the interplay between openness and the credit assignment problem (CAP). CAP involves determining the contribution of individual agents to the overall system performance, a task that becomes increasingly complex in open environ‑ments. Traditional credit assignment (CA) methods often assume static agent populations, fixed and pre‑defined tasks, and stationary types, making them inadequate for open systems. We first conduct a conceptual analysis, in‑troducing new sub‑categories of openness to detail how events like agent turnover or task cancellation break the assumptions of environmental stationarity and fixed team composition that underpin existing CAP methods. We then present an empirical study using representative temporal and structural algorithms in an open environment. The results demonstrate that openness directly causes credit misattribution, evidenced by unstable loss functions and significant performance degradation.

PaperID: 7647, https://arxiv.org/pdf/2510.27630.pdf

Authors: Dayuan Fu, Yunze Wu, Xiaojie Cai, Lyumanshan Ye, Shijie Xia, Zhen Huang, Weiye Si, Tianze Xu, Jie Sun, Keyu Li, Mohan Jiang, Junfei Wang, Qishuo Hua, Pengrui Lu, Yang Xiao, Pengfei Liu

Title: Interaction as Intelligence Part II: Asynchronous Human-Agent Rollout for Long-Horizon Task Training

Abstract:
Large Language Model (LLM) agents have recently shown strong potential in domains such as automated coding, deep research, and graphical user interface manipulation. However, training them to succeed on long‑horizon, domain‑specialized tasks remains challenging. Current methods primarily fall into two categories. The first relies on dense human annotations through behavior cloning, which is prohibitively expensive for long‑horizon tasks that can take days or months. The second depends on outcome‑driven sampling, which often collapses due to the rarity of valid positive trajectories on domain‑specialized tasks. We introduce Apollo, a sampling framework that integrates asynchronous human guidance with action‑level data filtering. Instead of requiring annotators to shadow every step, Apollo allows them to intervene only when the agent drifts from a promising trajectory, by providing prior knowledge, strategic advice, etc. This lightweight design makes it possible to sustain interactions for over 30 hours and produces valuable trajectories at a lower cost. Apollo then applies supervision control to filter out sub‑optimal actions and prevent error propagation. Together, these components enable reliable and effective data collection in long‑horizon environments. To demonstrate the effectiveness of Apollo, we evaluate it using InnovatorBench. Our experiments show that when applied to train the GLM‑4.5 model on InnovatorBench, Apollo achieves more than a 50% improvement over the untrained baseline and a 28% improvement over a variant trained without human interaction. These results highlight the critical role of human‑in‑the‑loop sampling and the robustness of Apollo's design in handling long‑horizon, domain‑specialized tasks.

PaperID: 7648, https://arxiv.org/pdf/2510.27617.pdf

Authors: Heng Ping, Arijit Bhattacharjee, Peiyu Zhang, Shixuan Li, Wei Yang, Anzhe Cheng, Xiaole Zhang, Jesse Thomason, Ali Jannesari, Nesreen Ahmed, Paul Bogdan

Title: VeriMoA: A Mixture-of-Agents Framework for Spec-to-HDL Generation

Abstract:
Automation of Register Transfer Level (RTL) design can help developers meet increasing computational demands. Large Language Models (LLMs) show promise for Hardware Description Language (HDL) generation, but face challenges due to limited parametric knowledge and domain‑specific constraints. While prompt engineering and fine‑tuning have limitations in knowledge coverage and training costs, multi‑agent architectures offer a training‑free paradigm to enhance reasoning through collaborative generation. However, current multi‑agent approaches suffer from two critical deficiencies: susceptibility to noise propagation and constrained reasoning space exploration. We propose VeriMoA, a training‑free mixture‑of‑agents (MoA) framework with two synergistic innovations. First, a quality‑guided caching mechanism to maintain all intermediate HDL outputs and enables quality‑based ranking and selection across the entire generation process, encouraging knowledge accumulation over layers of reasoning. Second, a multi‑path generation strategy that leverages C++ and Python as intermediate representations, decomposing specification‑to‑HDL translation into two‑stage processes that exploit LLM fluency in high‑resource languages while promoting solution diversity. Comprehensive experiments on VerilogEval 2.0 and RTLLM 2.0 benchmarks demonstrate that VeriMoA achieves 15‑‑30% improvements in Pass@1 across diverse LLM backbones, especially enabling smaller models to match larger models and fine‑tuned alternatives without requiring costly training.

PaperID: 7649, https://arxiv.org/pdf/2510.27568.pdf

Authors: Ali Asgarov, Umid Suleymanov, Aadyant Khatri

Title: SIGMA: Search-Augmented On-Demand Knowledge Integration for Agentic Mathematical Reasoning

Abstract:
Solving mathematical reasoning problems requires not only accurate access to relevant knowledge but also careful, multi‑step thinking. However, current retrieval‑augmented models often rely on a single perspective, follow inflexible search strategies, and struggle to effectively combine information from multiple sources. We introduce SIGMA (Search‑Augmented On‑Demand Knowledge Integration for AGentic Mathematical reAsoning), a unified framework that orchestrates specialized agents to independently reason, perform targeted searches, and synthesize findings through a moderator mechanism. Each agent generates hypothetical passages to optimize retrieval for its analytic perspective, ensuring knowledge integration is both context‑sensitive and computation‑efficient. When evaluated on challenging benchmarks such as MATH500, AIME, and PhD‑level science QA GPQA, SIGMA consistently outperforms both open‑ and closed‑source systems, achieving an absolute performance improvement of 7.4%. Our results demonstrate that multi‑agent, on‑demand knowledge integration significantly enhances both reasoning accuracy and efficiency, offering a scalable approach for complex, knowledge‑intensive problem‑solving. We will release the code upon publication.

PaperID: 7650, https://arxiv.org/pdf/2510.27566.pdf

Authors: Yulong Hui, Chao Chen, Zhihang Fu, Yihao Liu, Jieping Ye, Huanchen Zhang

Title: Interact-RAG: Reason and Interact with the Corpus, Beyond Black-Box Retrieval

Abstract:
Retrieval‑Augmented Generation (RAG) has significantly enhanced LLMs by incorporating external information. However, prevailing agentic RAG approaches are constrained by a critical limitation: they treat the retrieval process as a black‑box querying operation. This confines agents' actions to query issuing, hindering its ability to tackle complex information‑seeking tasks. To address this, we introduce Interact‑RAG, a new paradigm that elevates the LLM agent from a passive query issuer into an active manipulator of the retrieval process. We dismantle the black‑box with a Corpus Interaction Engine, equipping the agent with a set of action primitives for fine‑grained control over information retrieval. To further empower the agent on the entire RAG pipeline, we first develop a reasoning‑enhanced workflow, which enables both zero‑shot execution and the synthesis of interaction trajectories. We then leverage this synthetic data to train a fully autonomous end‑to‑end agent via Supervised Fine‑Tuning (SFT), followed by refinement with Reinforcement Learning (RL). Extensive experiments across six benchmarks demonstrate that Interact‑RAG significantly outperforms other advanced methods, validating the efficacy of our reasoning‑interaction strategy.

PaperID: 7651, https://arxiv.org/pdf/2510.27506.pdf

Authors: Ke He, Thang X. Vu, Le He, Lisheng Fan, Symeon Chatzinotas, Bjorn Ottersten

Title: Asynchronous Risk-Aware Multi-Agent Packet Routing for Ultra-Dense LEO Satellite Networks

Abstract:
The rise of ultra‑dense LEO constellations creates a complex and asynchronous network environment, driven by their massive scale, dynamic topologies, and significant delays. This unique complexity demands an adaptive packet routing algorithm that is asynchronous, risk‑aware, and capable of balancing diverse and often conflicting QoS objectives in a decentralized manner. However, existing methods fail to address this need, as they typically rely on impractical synchronous decision‑making and/or risk‑oblivious approaches. To tackle this gap, we introduce PRIMAL, an event‑driven multi‑agent routing framework designed specifically to allow each satellite to act independently on its own event‑driven timeline, while managing the risk of worst‑case performance degradation via a principled primal‑dual approach. This is achieved by enabling agents to learn the full cost distribution of the targeted QoS objectives and constrain tail‑end risks. Extensive simulations on a LEO constellation with 1584 satellites validate its superiority in effectively optimizing latency and balancing load. Compared to a recent risk‑oblivious baseline, it reduces queuing delay by over 70%, and achieves a nearly 12 ms end‑to‑end delay reduction in loaded scenarios. This is accomplished by resolving the core conflict between naive shortest‑path finding and congestion avoidance, highlighting such autonomous risk‑awareness as a key to robust routing.

PaperID: 7652, https://arxiv.org/pdf/2510.27452.pdf

Authors: Jianwen Sun, Fanrui Zhang, Yukang Feng, Chuanhao Li, Zizhen Li, Jiaxin Ai, Yifan Chang, Yu Dai, Kaipeng Zhang

Title: From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration

Abstract:
Scientific illustrations demand both high information density and post‑editability. However, current generative models have two major limitations: Frist, image generation models output rasterized images lacking semantic structure, making it impossible to access, edit, or rearrange independent visual components in the images. Second, code‑based generation methods (TikZ or SVG), although providing element‑level control, force users into the cumbersome cycle of "writing‑compiling‑reviewing" and lack the intuitiveness of manipulation. Neither of these two approaches can well meet the needs for efficiency, intuitiveness, and iterative modification in scientific creation. To bridge this gap, we introduce VisPainter, a multi‑agent framework for scientific illustration built upon the model context protocol. VisPainter orchestrates three specialized modules‑a Manager, a Designer, and a Toolbox‑to collaboratively produce diagrams compatible with standard vector graphics software. This modular, role‑based design allows each element to be explicitly represented and manipulated, enabling true element‑level control and any element can be added and modified later. To systematically evaluate the quality of scientific illustrations, we introduce VisBench, a benchmark with seven‑dimensional evaluation metrics. It assesses high‑information‑density scientific illustrations from four aspects: content, layout, visual perception, and interaction cost. To this end, we conducted extensive ablation experiments to verify the rationality of our architecture and the reliability of our evaluation methods. Finally, we evaluated various vision‑language models, presenting fair and credible model rankings along with detailed comparisons of their respective capabilities. Additionally, we isolated and quantified the impacts of role division, step control,and description on the quality of illustrations.

PaperID: 7653, https://arxiv.org/pdf/2510.27418.pdf

Authors: Junfeng Lu, Yueyan Li

Title: Dynamic Affective Memory Management for Personalized LLM Agents

Abstract:
Advances in large language models are making personalized AI agents a new research focus. While current agent systems primarily rely on personalized external memory databases to deliver customized experiences, they face challenges such as memory redundancy, memory staleness, and poor memory‑context integration, largely due to the lack of effective memory updates during interaction. To tackle these issues, we propose a new memory management system designed for affective scenarios. Our approach employs a Bayesian‑inspired memory update algorithm with the concept of memory entropy, enabling the agent to autonomously maintain a dynamically updated memory vector database by minimizing global entropy to provide more personalized services. To better evaluate the system's effectiveness in this context, we propose DABench, a benchmark focusing on emotional expression and emotional change toward objects. Experimental results demonstrate that, our system achieves superior performance in personalization, logical coherence, and accuracy. Ablation studies further validate the effectiveness of the Bayesian‑inspired update mechanism in alleviating memory bloat. Our work offers new insights into the design of long‑term memory systems.

PaperID: 7654, https://arxiv.org/pdf/2510.27417.pdf

Authors: Jarne Besjes, Robbe Nooyens, Tolgahan Bardakci, Mutlu Beyazit, Serge Demeyer

Title: Agentic LLMs for REST API Test Amplification: A Comparative Study Across Cloud Applications

Abstract:
Representational State Transfer (REST) APIs are a cornerstone of modern cloud native systems. Ensuring their reliability demands automated test suites that exercise diverse and boundary level behaviors. Nevertheless, designing such test cases remains a challenging and resource intensive endeavor. This study extends prior work on Large Language Model (LLM) based test amplification by evaluating single agent and multi agent configurations across four additional cloud applications. The amplified test suites maintain semantic validity with minimal human intervention. The results demonstrate that agentic LLM systems can effectively generalize across heterogeneous API architectures, increasing endpoint and parameter coverage while revealing defects. Moreover, a detailed analysis of computational cost, runtime, and energy consumption highlights trade‑offs between accuracy, scalability, and efficiency. These findings underscore the potential of LLM driven test amplification to advance the automation and sustainability of REST API testing in complex cloud environments.

PaperID: 7655, https://arxiv.org/pdf/2510.27383.pdf

Authors: Yueyang Wang, Mehmet Dogar, Gustav Markkula

Title: Realistic pedestrian-driver interaction modelling using multi-agent RL with human perceptual-motor constraints

Abstract:
Modelling pedestrian‑driver interactions is critical for understanding human road user behaviour and developing safe autonomous vehicle systems. Existing approaches often rely on rule‑based logic, game‑theoretic models, or 'black‑box' machine learning methods. However, these models typically lack flexibility or overlook the underlying mechanisms, such as sensory and motor constraints, which shape how pedestrians and drivers perceive and act in interactive scenarios. In this study, we propose a multi‑agent reinforcement learning (RL) framework that integrates both visual and motor constraints of pedestrian and driver agents. Using a real‑world dataset from an unsignalised pedestrian crossing, we evaluate four model variants, one without constraints, two with either motor or visual constraints, and one with both, across behavioural metrics of interaction realism. Results show that the combined model with both visual and motor constraints performs best. Motor constraints lead to smoother movements that resemble human speed adjustments during crossing interactions. The addition of visual constraints introduces perceptual uncertainty and field‑of‑view limitations, leading the agents to exhibit more cautious and variable behaviour, such as less abrupt deceleration. In this data‑limited setting, our model outperforms a supervised behavioural cloning model, demonstrating that our approach can be effective without large training datasets. Finally, our framework accounts for individual differences by modelling parameters controlling the human constraints as population‑level distributions, a perspective that has not been explored in previous work on pedestrian‑vehicle interaction modelling. Overall, our work demonstrates that multi‑agent RL with human constraints is a promising modelling approach for simulating realistic road user interactions.

PaperID: 7656, https://arxiv.org/pdf/2510.27289.pdf

Authors: Zhengchang Hua, Panagiotis Oikonomou, Karim Djemame, Nikos Tziritas, Georgios Theodoropoulos

Title: A Digital Twin-based Multi-Agent Reinforcement Learning Framework for Vehicle-to-Grid Coordination

Abstract:
The coordination of large‑scale, decentralised systems, such as a fleet of Electric Vehicles (EVs) in a Vehicle‑to‑Grid (V2G) network, presents a significant challenge for modern control systems. While collaborative Digital Twins have been proposed as a solution to manage such systems without compromising the privacy of individual agents, deriving globally optimal control policies from the high‑level information they share remains an open problem. This paper introduces Digital Twin Assisted Multi‑Agent Deep Deterministic Policy Gradient (DT‑MADDPG) algorithm, a novel hybrid architecture that integrates a multi‑agent reinforcement learning framework with a collaborative DT network. Our core contribution is a simulation‑assisted learning algorithm where the centralised critic is enhanced by a predictive global model that is collaboratively built from the privacy‑preserving data shared by individual DTs. This approach removes the need for collecting sensitive raw data at a centralised entity, a requirement of traditional multi‑agent learning algorithms. Experimental results in a simulated V2G environment demonstrate that DT‑MADDPG can achieve coordination performance comparable to the standard MADDPG algorithm while offering significant advantages in terms of data privacy and architectural decentralisation. This work presents a practical and robust framework for deploying intelligent, learning‑based coordination in complex, real‑world cyber‑physical systems.

PaperID: 7657, https://arxiv.org/pdf/2510.27287.pdf

Authors: Harsh Vishwakarma, Ankush Agarwal, Ojas Patil, Chaitanya Devaguptapu, Mahesh Chandran

Title: Can LLMs Help You at Work? A Sandbox for Evaluating LLM Agents in Enterprise Environments

Abstract:
Enterprise systems are crucial for enhancing productivity and decision‑making among employees and customers. Integrating LLM based systems into enterprise systems enables intelligent automation, personalized experiences, and efficient information retrieval, driving operational efficiency and strategic growth. However, developing and evaluating such systems is challenging due to the inherent complexity of enterprise environments, where data is fragmented across multiple sources and governed by sophisticated access controls. We present EnterpriseBench, a comprehensive benchmark that simulates enterprise settings, featuring 500 diverse tasks across software engineering, HR, finance, and administrative domains. Our benchmark uniquely captures key enterprise characteristics including data source fragmentation, access control hierarchies, and cross‑functional workflows. Additionally, we provide a novel data generation pipeline that creates internally consistent enterprise tasks from organizational metadata. Experiments with state‑of‑the‑art LLM agents demonstrate that even the most capable models achieve only 41.8% task completion, highlighting significant opportunities for improvement in enterprise‑focused AI systems.

PaperID: 7658, https://arxiv.org/pdf/2510.27251.pdf

Authors: Bijia Liu, Ronghao Dang

Title: FinPos: A Position-Aware Trading Agent System for Real Financial Markets

Abstract:
The exceptional potential of large language models (LLMs) in handling text information has garnered significant attention in the field of financial trading. However, most existing trading agents operate under intraday, independent unit‑based trading tasks, where decisions are made as isolated directional actions, and thus lack awareness of continuous position management. Therefore, we propose a position‑aware trading task designed to simulate a more realistic market. To address this task, we propose FinPos, a position‑aware trading agent system designed to explicitly model and manage continuous positions. FinPos enhances position awareness through three key mechanisms: (1) professional‑level interpretation of heterogeneous market information; (2) a dual‑agent decision structure that separates directional reasoning from risk‑aware position adjustment; and (3) multi‑timescale reward signals, allowing the agent to internalize position awareness through experiential feedback rather than static instructions alone. Extensive experiments demonstrate that FinPos surpasses state‑of‑the‑art trading agents in the position‑aware trading task, which closely mirrors real market conditions. More importantly, our findings reveal that LLM‑centered agent systems exhibit a vast, largely unexplored potential in long‑term market decision‑making.

PaperID: 7659, https://arxiv.org/pdf/2510.27176.pdf

Authors: Pouya Hamadanian, Pantea Karimi, Arash Nasr-Esfahany, Kimia Noorbakhsh, Joseph Chandler, Ali ParandehGheibi, Mohammad Alizadeh, Hari Balakrishnan

Title: Glia: A Human-Inspired AI for Automated Systems Design and Optimization

Abstract:
Can AI autonomously design mechanisms for computer systems on par with the creativity and reasoning of human experts? We present Glia, an AI architecture for networked systems design that uses large language models (LLMs) in a human‑inspired multi‑agent workflow. Each agent specializes in reasoning, experimentation, and analysis, collaborating through an evaluation framework that grounds abstract reasoning in empirical feedback. Unlike prior ML‑for‑systems methods that optimize black‑box policies, Glia generates interpretable designs and exposes its reasoning. When applied to a distributed GPU cluster for LLM inference, it produces new algorithms for request routing, scheduling, and auto‑scaling that perform at human‑expert levels in significantly less time, while yielding novel insights into workload behavior. Our results suggest that combining reasoning LLMs with structured experimentation, an AI can produce creative and understandable designs for complex systems problems.

PaperID: 7660, https://arxiv.org/pdf/2510.27140.pdf

Authors: Chenghao Du, Quanfeng Huang, Tingxuan Tang, Zihao Wang, Adwait Nadkarni, Yue Xiao

Title: Measuring the Security of Mobile LLM Agents under Adversarial Prompts from Untrusted Third-Party Channels

Abstract:
Large Language Models (LLMs) have transformed software development, enabling AI‑powered applications known as LLM‑based agents that promise to automate tasks across diverse apps and workflows. Yet, the security implications of deploying such agents in adversarial mobile environments remain poorly understood. In this paper, we present the first systematic study of security risks in mobile LLM agents. We design and evaluate a suite of adversarial case studies, ranging from opportunistic manipulations such as pop‑up advertisements to advanced, end‑to‑end workflows involving malware installation and cross‑app data exfiltration. Our evaluation covers eight state‑of‑the‑art mobile agents across three architectures, with over 2,000 adversarial and paired benign trials. The results reveal systemic vulnerabilities: low‑barrier vectors such as fraudulent ads succeed with over 80% reliability, while even workflows requiring the circumvention of operating‑system warnings, such as malware installation, are consistently completed by advanced multi‑app agents. By mapping these attacks to the MITRE ATT&CK Mobile framework, we uncover novel privilege‑escalation and persistence pathways unique to LLM‑driven automation. Collectively, our findings provide the first end‑to‑end evidence that mobile LLM agents are exploitable in realistic adversarial settings, where untrusted third‑party channels (e.g., ads, embedded webviews, cross‑app notifications) are an inherent part of the mobile ecosystem.

PaperID: 7661, https://arxiv.org/pdf/2510.27051.pdf

Authors: Aaditya Shukla, Sidney Knowles, Meenakshi Madugula, Dave Farris, Ryan Angilly, Santiago Pombo, Anbang Xu, Lu An, Abhinav Balasubramanian, Tan Yu, Jiaxiang Ren, Rama Akkiraju

Title: Adaptive Data Flywheel: Applying MAPE Control Loops to AI Agent Improvement

Abstract:
Enterprise AI agents must continuously adapt to maintain accuracy, reduce latency, and remain aligned with user needs. We present a practical implementation of a data flywheel in NVInfo AI, NVIDIA's Mixture‑of‑Experts (MoE) Knowledge Assistant serving over 30,000 employees. By operationalizing a MAPE‑driven data flywheel, we built a closed‑loop system that systematically addresses failures in retrieval‑augmented generation (RAG) pipelines and enables continuous learning. Over a 3‑month post‑deployment period, we monitored feedback and collected 495 negative samples. Analysis revealed two major failure modes: routing errors (5.25%) and query rephrasal errors (3.2%). Using NVIDIA NeMo microservices, we implemented targeted improvements through fine‑tuning. For routing, we replaced a Llama 3.1 70B model with a fine‑tuned 8B variant, achieving 96% accuracy, a 10x reduction in model size, and 70% latency improvement. For query rephrasal, fine‑tuning yielded a 3.7% gain in accuracy and a 40% latency reduction. Our approach demonstrates how human‑in‑the‑loop (HITL) feedback, when structured within a data flywheel, transforms enterprise AI agents into self‑improving systems. Key learnings include approaches to ensure agent robustness despite limited user feedback, navigating privacy constraints, and executing staged rollouts in production. This work offers a repeatable blueprint for building robust, adaptive enterprise AI agents capable of learning from real‑world usage at scale.

PaperID: 7662, https://arxiv.org/pdf/2510.27016.pdf

Authors: Jayden Serenari, Stephen Lee

Title: Semantically-Aware LLM Agent to Enhance Privacy in Conversational AI Services

Abstract:
With the increasing use of conversational AI systems, there is growing concern over privacy leaks, especially when users share sensitive personal data in interactions with Large Language Models (LLMs). Conversations shared with these models may contain Personally Identifiable Information (PII), which, if exposed, could lead to security breaches or identity theft. To address this challenge, we present the Local Optimizations for Pseudonymization with Semantic Integrity Directed Entity Detection (LOPSIDED) framework, a semantically‑aware privacy agent designed to safeguard sensitive PII data when using remote LLMs. Unlike prior work that often degrade response quality, our approach dynamically replaces sensitive PII entities in user prompts with semantically consistent pseudonyms, preserving the contextual integrity of conversations. Once the model generates its response, the pseudonyms are automatically depseudonymized, ensuring the user receives an accurate, privacy‑preserving output. We evaluate our approach using real‑world conversations sourced from ShareGPT, which we further augment and annotate to assess whether named entities are contextually relevant to the model's response. Our results show that LOPSIDED reduces semantic utility errors by a factor of 5 compared to baseline techniques, all while enhancing privacy.

PaperID: 7663, https://arxiv.org/pdf/2510.26900.pdf

Authors: Jahir Argote-Gerald, Genki Miyauchi, Julian Rau, Paul Trodden, Roderich Gross

Title: Design for One, Deploy for Many: Navigating Tree Mazes with Multiple Agents

Abstract:
Maze‑like environments, such as cave and pipe networks, pose unique challenges for multiple robots to coordinate, including communication constraints and congestion. To address these challenges, we propose a distributed multi‑agent maze traversal algorithm for environments that can be represented by acyclic graphs. It uses a leader‑switching mechanism where one agent, assuming a head role, employs any single‑agent maze solver while the other agents each choose an agent to follow. The head role gets transferred to neighboring agents where necessary, ensuring it follows the same path as a single agent would. The multi‑agent maze traversal algorithm is evaluated in simulations with groups of up to 300 agents, various maze sizes, and multiple single‑agent maze solvers. It is compared against strategies that are naïve, or assume either global communication or full knowledge of the environment. The algorithm outperforms the naïve strategy in terms of makespan and sum‑of‑fuel. It is superior to the global‑communication strategy in terms of makespan but is inferior to it in terms of sum‑of‑fuel. The findings suggest it is asymptotically equivalent to the full‑knowledge strategy with respect to either metric. Moreover, real‑world experiments with up to 20 Pi‑puck robots confirm the feasibility of the approach.

PaperID: 7664, https://arxiv.org/pdf/2510.26832.pdf

Authors: Abha Jha, J. Hunter Priniski, Carolyn Steinle, Fred Morstatter

Title: Simulating hashtag dynamics with networked groups of generative agents

Abstract:
Networked environments shape how information embedded in narratives influences individual and group beliefs and behavior. This raises key questions about how group communication around narrative media impacts belief formation and how such mechanisms contribute to the emergence of consensus or polarization. Language data from generative agents offer insight into how naturalistic forms of narrative interactions (such as hashtag generation) evolve in response to social rewards within networked communication settings. To investigate this, we developed an agent‑based modeling and simulation framework composed of networks of interacting Large Language Model (LLM) agents. We benchmarked the simulations of four state‑of‑the‑art LLMs against human group behaviors observed in a prior network experiment (Study 1) and against naturally occurring hashtags from Twitter (Study 2). Quantitative metrics of network coherence (e.g., entropy of a group's responses) reveal that while LLMs can approximate human‑like coherence in sanitized domains (Study 1's experimental data), effective integration of background knowledge and social context in more complex or politically sensitive narratives likely requires careful and structured prompting.

PaperID: 7665, https://arxiv.org/pdf/2510.26782.pdf

Authors: Zaishuo Xia, Yukuan Lu, Xinyi Li, Yifan Xu, Yubei Chen

Title: Cloning Deterministic Worlds: The Critical Role of Latent Geometry in Long-Horizon World Models

Abstract:
A world model is an internal model that simulates how the world evolves. Given past observations and actions, it predicts the future physical state of both the embodied agent and its environment. Accurate world models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings. However, existing world models often focus on random generation of open worlds, but neglect the need for high‑fidelity modeling of deterministic scenarios (such as fixed‑map mazes and static space robot navigation). In this work, we take a step toward building a truly accurate world model by addressing a fundamental yet open problem: constructing a model that can fully clone a deterministic 3D world. 1) Through diagnostic experiment, we quantitatively demonstrate that high‑fidelity cloning is feasible and the primary bottleneck for long‑horizon fidelity is the geometric structure of the latent representation, not the dynamics model itself. 2) Building on this insight, we show that applying temporal contrastive learning principle as a geometric regularization can effectively curate a latent space that better reflects the underlying physical state manifold, demonstrating that contrastive constraints can serve as a powerful inductive bias for stable world modeling; we call this approach Geometrically‑Regularized World Models (GRWM). At its core is a lightweight geometric regularization module that can be seamlessly integrated into standard autoencoders, reshaping their latent space to provide a stable foundation for effective dynamics modeling. By focusing on representation quality, GRWM offers a simple yet powerful pipeline for improving world model fidelity.

PaperID: 7666, https://arxiv.org/pdf/2510.26752.pdf

Authors: William Overman, Mohsen Bayati

Title: The Oversight Game: Learning to Cooperatively Balance an AI Agent's Safety and Autonomy

Abstract:
As increasingly capable agents are deployed, a central safety challenge is how to retain meaningful human control without modifying the underlying system. We study a minimal control interface in which an agent chooses whether to act autonomously (play) or defer (ask), while a human simultaneously chooses whether to be permissive (trust) or engage in oversight (oversee), and model this interaction as a two‑player Markov game. When this game forms a Markov Potential Game, we prove an alignment guarantee: any increase in the agent's utility from acting more autonomously cannot decrease the human's value. This establishes a form of intrinsic alignment where the agent's incentive to seek autonomy is structurally coupled to the human's welfare. Practically, the framework induces a transparent control layer that encourages the agent to defer when risky and act when safe. While we use gridworld simulations to illustrate the emergence of this collaboration, our primary validation involves an agentic tool‑use task in which two 30B parameter language models are fine‑tuned via independent policy gradient. We demonstrate that even as the agents learn to coordinate on the fly, this framework effectively reduces safety violations in realistic, open‑ended environments.

PaperID: 7667, https://arxiv.org/pdf/2510.26740.pdf

Authors: Ashwin Kumar, William Yeoh

Title: A General Incentives-Based Framework for Fairness in Multi-agent Resource Allocation

Abstract:
We introduce the General Incentives‑based Framework for Fairness (GIFF), a novel approach for fair multi‑agent resource allocation that infers fair decision‑making from standard value functions. In resource‑constrained settings, agents optimizing for efficiency often create inequitable outcomes. Our approach leverages the action‑value (Q‑)function to balance efficiency and fairness without requiring additional training. Specifically, our method computes a local fairness gain for each action and introduces a counterfactual advantage correction term to discourage over‑allocation to already well‑off agents. This approach is formalized within a centralized control setting, where an arbitrator uses the GIFF‑modified Q‑values to solve an allocation problem. Empirical evaluations across diverse domains, including dynamic ridesharing, homelessness prevention, and a complex job allocation task‑demonstrate that our framework consistently outperforms strong baselines and can discover far‑sighted, equitable policies. The framework's effectiveness is supported by a theoretical foundation; we prove its fairness surrogate is a principled lower bound on the true fairness improvement and that its trade‑off parameter offers monotonic tuning. Our findings establish GIFF as a robust and principled framework for leveraging standard reinforcement learning components to achieve more equitable outcomes in complex multi‑agent systems.

PaperID: 7668, https://arxiv.org/pdf/2510.26699.pdf

Authors: Aylton Almeida, Laerte Xavier, Marco Tulio Valente

Title: Using Copilot Agent Mode to Automate Library Migration: A Quantitative Assessment

Abstract:
Keeping software systems up to date is essential to avoid technical debt, security vulnerabilities, and the rigidity typical of legacy systems. However, updating libraries and frameworks remains a time consuming and error‑prone process. Recent advances in Large Language Models (LLMs) and agentic coding systems offer new opportunities for automating such maintenance tasks. In this paper, we evaluate the update of a well‑known Python library, SQLAlchemy, across a dataset of ten client applications. For this task, we use the Github's Copilot Agent Mode, an autonomous AI systema capable of planning and executing multi‑step migration workflows. To assess the effectiveness of the automated migration, we also introduce Migration Coverage, a metric that quantifies the proportion of API usage points correctly migrated. The results of our study show that the LLM agent was capable of migrating functionalities and API usages between SQLAlchemy versions (migration coverage: 100%, median), but failed to maintain the application functionality, leading to a low test‑pass rate (39.75%, median).

PaperID: 7669, https://arxiv.org/pdf/2510.26578.pdf

Authors: Jikang Deng, Hui Zhou, Mohamed-Slim Alouini

Title: Two-Timescale Optimization Framework for IAB-Enabled Heterogeneous UAV Networks

Abstract:
In post‑disaster scenarios, the rapid deployment of adequate communication infrastructure is essential to support disaster search, rescue, and recovery operations. To achieve this, uncrewed aerial vehicle (UAV) has emerged as a promising solution for emergency communication due to its low cost and deployment flexibility. However, conventional untethered UAV (U‑UAV) is constrained by size, weight, and power (SWaP) limitations, making it incapable of maintaining the operation of a macro base station. To address this limitation, we propose a heterogeneous UAV‑based framework that integrates tethered UAV (T‑UAV) and U‑UAVs, where U‑UAVs are utilized to enhance the throughput of cell‑edge ground user equipments (G‑UEs) and guarantee seamless connectivity during G‑UEs' mobility to safe zones. It is noted that the integrated access and backhaul (IAB) technique is adopted to support the wireless backhaul of U‑UAVs. Accordingly, we formulate a two‑timescale joint user scheduling and trajectory control optimization problem, aiming to maximize the downlink throughput under asymmetric traffic demands and G‑UEs' mobility. To solve the formulated problem, we proposed a two‑timescale multi‑agent deep deterministic policy gradient (TTS‑MADDPG) algorithm based on the centralized training and distributed execution paradigm. Numerical results show that the proposed algorithm outperforms other benchmarks, including the two‑timescale multi‑agent proximal policy optimization (TTS‑MAPPO) algorithm and MADDPG scheduling method, with robust and higher throughput. Specifically, the proposed algorithm obtains up to 12.2% average throughput gain compared to the MADDPG scheduling method.

PaperID: 7670, https://arxiv.org/pdf/2510.26498.pdf

Authors: Adam E. Flanders, Yifan Peng, Luciano Prevedello, Robyn Ball, Errol Colak, Prahlad Menon, George Shih, Hui-Ming Lin, Paras Lakhani

Title: A Multi-agent Large Language Model Framework to Automatically Assess Performance of a Clinical AI Triage Tool

Abstract:
Purpose: The purpose of this study was to determine if an ensemble of multiple LLM agents could be used collectively to provide a more reliable assessment of a pixel‑based AI triage tool than a single LLM. Methods: 29,766 non‑contrast CT head exams from fourteen hospitals were processed by a commercial intracranial hemorrhage (ICH) AI detection tool. Radiology reports were analyzed by an ensemble of eight open‑source LLM models and a HIPAA compliant internal version of GPT‑4o using a single multi‑shot prompt that assessed for presence of ICH. 1,726 examples were manually reviewed. Performance characteristics of the eight open‑source models and consensus were compared to GPT‑4o. Three ideal consensus LLM ensembles were tested for rating the performance of the triage tool. Results: The cohort consisted of 29,766 head CTs exam‑report pairs. The highest AUC performance was achieved with llama3.3:70b and GPT‑4o (AUC= 0.78). The average precision was highest for Llama3.3:70b and GPT‑4o (AP=0.75 & 0.76). Llama3.3:70b had the highest F1 score (0.81) and recall (0.85), greater precision (0.78), specificity (0.72), and MCC (0.57). Using MCC (95% CI) the ideal combination of LLMs were: Full‑9 Ensemble 0.571 (0.552‑0.591), Top‑3 Ensemble 0.558 (0.537‑0.579), Consensus 0.556 (0.539‑0.574), and GPT4o 0.522 (0.500‑0.543). No statistically significant differences were observed between Top‑3, Full‑9, and Consensus (p > 0.05). Conclusion: An ensemble of medium to large sized open‑source LLMs provides a more consistent and reliable method to derive a ground truth retrospective evaluation of a clinical AI triage tool over a single LLM alone.

PaperID: 7671, https://arxiv.org/pdf/2510.26423.pdf

Authors: Dong Huang, Mingzhe Du, Jie M. Zhang, Zheng Lin, Meng Luo, Qianru Zhang, See-Kiong Ng

Title: Nexus: Execution-Grounded Multi-Agent Test Oracle Synthesis

Abstract:
Test oracle generation in non‑regression testing is a longstanding challenge in software engineering, where the goal is to produce oracles that can accurately determine whether a function under test (FUT) behaves as intended for a given input. In this paper, we introduce Nexus, a novel multi‑agent framework to address this challenge. Nexus generates test oracles by leveraging a diverse set of specialized agents that synthesize test oracles through a structured process of deliberation, validation, and iterative self‑refinement. During the deliberation phase, a panel of four specialist agents, each embodying a distinct testing philosophy, collaboratively critiques and refines an initial set of test oracles. Then, in the validation phase, Nexus generates a plausible candidate implementation of the FUT and executes the proposed oracles against it in a secure sandbox. For any oracle that fails this execution‑based check, Nexus activates an automated selfrefinement loop, using the specific runtime error to debug and correct the oracle before re‑validation. Our extensive evaluation on seven diverse benchmarks demonstrates that Nexus consistently and substantially outperforms state‑of‑theart baselines. For instance, Nexus improves the test‑level oracle accuracy on the LiveCodeBench from 46.30% to 57.73% for GPT‑4.1‑Mini. The improved accuracy also significantly enhances downstream tasks: the bug detection rate of GPT4.1‑Mini generated test oracles on HumanEval increases from 90.91% to 95.45% for Nexus compared to baselines, and the success rate of automated program repair improves from 35.23% to 69.32%.

PaperID: 7672, https://arxiv.org/pdf/2510.26389.pdf

Authors: Wenchang Duan, Yaoliang Yu, Jiwan He, Yi Shi

Title: Adaptive Context Length Optimization with Low-Frequency Truncation for Multi-Agent Reinforcement Learning

Abstract:
Recently, deep multi‑agent reinforcement learning (MARL) has demonstrated promising performance for solving challenging tasks, such as long‑term dependencies and non‑Markovian environments. Its success is partly attributed to conditioning policies on large fixed context length. However, such large fixed context lengths may lead to limited exploration efficiency and redundant information. In this paper, we propose a novel MARL framework to obtain adaptive and effective contextual information. Specifically, we design a central agent that dynamically optimizes context length via temporal gradient analysis, enhancing exploration to facilitate convergence to global optima in MARL. Furthermore, to enhance the adaptive optimization capability of the context length, we present an efficient input representation for the central agent, which effectively filters redundant information. By leveraging a Fourier‑based low‑frequency truncation method, we extract global temporal trends across decentralized agents, providing an effective and efficient representation of the MARL environment. Extensive experiments demonstrate that the proposed method achieves state‑of‑the‑art (SOTA) performance on long‑term dependency tasks, including PettingZoo, MiniGrid, Google Research Football (GRF), and StarCraft Multi‑Agent Challenge v2 (SMACv2).

PaperID: 7673, https://arxiv.org/pdf/2510.26352.pdf

Authors: Kotaro Furuya, Yuichi Kitagawa

Title: The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration

Abstract:
While a multi‑agent approach based on large language models (LLMs) represents a promising strategy to surpass the capabilities of single models, its success is critically dependent on synergistic team composition. However, forming optimal teams is a significant challenge, as the inherent opacity of most models obscures the internal characteristics necessary for effective collaboration. In this paper, we propose an interaction‑centric framework for automatic team composition that does not require any prior knowledge including their internal architectures, training data, or task performances. Our method constructs a "language model graph" that maps relationships between models from the semantic coherence of pairwise conversations, and then applies community detection to identify synergistic model clusters. Our experiments with diverse LLMs demonstrate that the proposed method discovers functionally coherent groups that reflect their latent specializations. Priming conversations with specific topics identified synergistic teams which outperform random baselines on downstream benchmarks and achieve comparable accuracy to that of manually‑curated teams based on known model specializations. Our findings provide a new basis for the automated design of collaborative multi‑agent LLM teams.

PaperID: 7674, https://arxiv.org/pdf/2510.26270.pdf

Authors: Jiazhen Yuan, Zhike Gong, Jinquan Hang, Zhengbiao Bai, Wei Zhao

Title: Graph-Enhanced Policy Optimization in LLM Agent Training

Abstract:
Multi‑step LLM agents in interactive environments represent a crucial step toward long‑horizon decision‑making. To train such agents, group‑based reinforcement learning is widely adopted, which reinforces trajectories with higher relative performance within the group. However, in most existing methods, every step within a trajectory and every trajectory with the same terminal reward receive identical credit, regardless of their actual contributions. Since different states play different structural roles in an online state‑transition graph built from sampled trajectories, their impacts should be differentiated and converted into task‑aware credit at both the step and trajectory levels. We therefore present Graph‑Enhanced Policy Optimization (GEPO), a framework for dual‑level structural credit assignment in multi‑step LLM agent training. Specifically, GEPO derives a state‑level Task‑Conditioned Criticality score that combines topological betweenness on the state‑transition graph with semantic similarity to the task prompt. Based on this score, trajectory‑level credit is reshaped through a state‑adaptive discount, while step‑level credit is scaled by the criticality of its successor state. Experimental results show that GEPO outperforms the strongest baselines by 1.1% in success rate on ALFWorld, 3.2% on WebShop, and 3.8% on average across search‑augmented QA tasks at the 7B scale. Compared with flat group‑based methods, GEPO reduces across‑seed variance and concentrates gradient signals on the most critical steps.

PaperID: 7675, https://arxiv.org/pdf/2510.26242.pdf

Authors: Xinhang Li, Qing Guo, Junyu Chen, Zheng Guo, Shengzhe Xu, Lei Li, Lin Zhang

Title: Retrieval Augmented Generation-Enhanced Distributed LLM Agents for Generalizable Traffic Signal Control with Emergency Vehicles

Abstract:
With increasing urban traffic complexity, Traffic Signal Control (TSC) is essential for optimizing traffic flow and improving road safety. Large Language Models (LLMs) emerge as promising approaches for TSC. However, they are prone to hallucinations in emergencies, leading to unreliable decisions that may cause substantial delays for emergency vehicles. Moreover, diverse intersection types present substantial challenges for traffic state encoding and cross‑intersection training, limiting generalization across heterogeneous intersections. Therefore, this paper proposes Retrieval Augmented Generation (RAG)‑enhanced distributed LLM agents with Emergency response for Generalizable TSC (REG‑TSC). Firstly, this paper presents an emergency‑aware reasoning framework, which dynamically adjusts reasoning depth based on the emergency scenario and is equipped with a novel Reviewer‑based Emergency RAG (RERAG) to distill specific knowledge and guidance from historical cases, enhancing the reliability and rationality of agents' emergency decisions. Secondly, this paper designs a type‑agnostic traffic representation and proposes a Reward‑guided Reinforced Refinement (R3) for heterogeneous intersections. R3 adaptively samples training experience from diverse intersections with environment feedback‑based priority and fine‑tunes LLM agents with a designed reward‑weighted likelihood loss, guiding REG‑TSC toward high‑reward policies across heterogeneous intersections. On three real‑world road networks with 17 to 177 heterogeneous intersections, extensive experiments show that REG‑TSC reduces travel time by 42.00%, queue length by 62.31%, and emergency vehicle waiting time by 83.16%, outperforming other state‑of‑the‑art methods.

PaperID: 7676, https://arxiv.org/pdf/2510.26172.pdf

Authors: Shifu Chen, Dazhen Deng, Zhihong Xu, Sijia Xu, Tai-Quan Peng, Yingcai Wu

Title: Linking Heterogeneous Data with Coordinated Agent Flows for Social Media Analysis

Abstract:
Social media platforms generate massive volumes of heterogeneous data, capturing user behaviors, textual content, temporal dynamics, and network structures. Analyzing such data is crucial for understanding phenomena such as opinion dynamics, community formation, and information diffusion. However, discovering insights from this complex landscape is exploratory, conceptually challenging, and requires expertise in social media mining and visualization. Existing automated approaches, though increasingly leveraging large language models (LLMs), remain largely confined to structured tabular data and cannot adequately address the heterogeneity of social media analysis. We present SIA (Social Insight Agents), an LLM agent system that links heterogeneous multi‑modal data ‑‑ including raw inputs (e.g., text, network, and behavioral data), intermediate outputs, mined analytical results, and visualization artifacts ‑‑ through coordinated agent flows. Guided by a bottom‑up taxonomy that connects insight types with suitable mining and visualization techniques, SIA enables agents to plan and execute coherent analysis strategies. To ensure multi‑modal integration, it incorporates a data coordinator that unifies tabular, textual, and network data into a consistent flow. Its interactive interface provides a transparent workflow where users can trace, validate, and refine the agent's reasoning, supporting both adaptability and trustworthiness. Through expert‑centered case studies and quantitative evaluation, we show that SIA effectively discovers diverse and meaningful insights from social media while supporting human‑agent collaboration in complex analytical tasks.

PaperID: 7677, https://arxiv.org/pdf/2510.26167.pdf

Authors: Renhao Li, Jianhong Tu, Yang Su, Yantao Liu, Fei Huang, Hamid Alinejad-Rokny, Derek F. Wong, Junyang Lin, Min Yang

Title: ToolRM: Towards Agentic Tool-Use Reward Modeling

Abstract:
Reward models (RMs) play a critical role in aligning large language models (LLMs) with human preferences. Yet in the domain of tool learning, the lack of RMs specifically designed for function‑calling tasks has limited progress toward more capable agentic AI. We introduce ToolRM, a family of lightweight reward models tailored for general tool‑use scenarios. To build these models, we propose a novel pipeline that constructs high‑quality pairwise preference data using rule‑based scoring and multidimensional sampling. This yields ToolPref‑Pairwise‑30K, a diverse, balanced, and challenging preference dataset that supports both generative and discriminative reward modeling. We also introduce TRBench_BFCL, a benchmark built on the agent evaluation suite BFCL to evaluate RMs on tool calling tasks. Trained on our constructed data, models from the Qwen3‑4B/8B series achieve up to 17.94% higher accuracy, substantially outperforming frontier LLMs and RMs in pairwise reward judgments. Beyond training objectives, generative ToolRM generalizes to broader critique tasks, including Best‑of‑N sampling and self‑correction. Experiments on ACEBench highlight its effectiveness and efficiency, enabling inference‑time scaling while reducing output token usage by over 66%. Its support for downstream RL training further validates its practical utility. We release data to facilitate future research.

PaperID: 7678, https://arxiv.org/pdf/2510.26144.pdf

Authors: Annan Li, Chufan Wu, Zengle Ge, Yee Hin Chong, Zhinan Hou, Lizhe Cao, Cheng Ju, Jianmin Wu, Huaiming Li, Haobo Zhang, Shenghao Feng, Mo Zhao, Fengzhi Qiu, Rui Yang, Mengmeng Zhang, Wenyi Zhu, Yingying Sun, Quan Sun, Shunhao Yan, Danyu Liu, Dawei Yin, Dou Shen

Title: The FM Agent

Abstract:
Large language models (LLMs) are catalyzing the development of autonomous AI research agents for scientific and engineering discovery. We present FM Agent, a novel and general‑purpose multi‑agent framework that leverages a synergistic combination of LLM‑based reasoning and large‑scale evolutionary search to address complex real‑world challenges. The core of FM Agent integrates several key innovations: 1) a cold‑start initialization phase incorporating expert guidance, 2) a novel evolutionary sampling strategy for iterative optimization, 3) domain‑specific evaluators that combine correctness, effectiveness, and LLM‑supervised feedback, and 4) a distributed, asynchronous execution infrastructure built on Ray. Demonstrating broad applicability, our system has been evaluated across diverse domains, including operations research, machine learning, GPU kernel optimization, and classical mathematical problems. FM Agent reaches state‑of‑the‑art results autonomously, without human interpretation or tuning ‑‑ 1976.3 on ALE‑Bench (+5.2%), 43.56% on MLE‑Bench (+4.0pp), up to 20x speedups on KernelBench, and establishes new state‑of‑the‑art(SOTA) results on several classical mathematical problems. Beyond academic benchmarks, FM Agent shows considerable promise for both large‑scale enterprise R\&D workflows and fundamental scientific research, where it can accelerate innovation, automate complex discovery processes, and deliver substantial engineering and scientific advances with broader societal impact.

PaperID: 7679, https://arxiv.org/pdf/2510.26098.pdf

Authors: Chenrui Shi, Zedong Yu, Zhi Gao, Ruining Feng, Enqi Liu, Yuwei Wu, Yunde Jia, Liuyu Xiang, Zhaofeng He, Qing Li

Title: GUI Knowledge Bench: Revealing the Knowledge Gap of VLMs in GUI Tasks

Abstract:
Vision language models (VLMs) have advanced graphical user interface (GUI) task automation but still lag behind humans. We hypothesize this gap stems from missing core GUI knowledge, which existing training schemes (such as supervised fine tuning and reinforcement learning) alone cannot fully address. By analyzing common failure patterns in GUI task execution, we distill GUI knowledge into three dimensions: (1) interface knowledge about widget functions, layout semantics, and system states; (2) interaction knowledge about GUI interaction types and effects; and (3) procedure knowledge of task objectives and workflow sequences. We further introduce GUI Knowledge Bench, a benchmark with multiple‑choice and yes/no questions across six platforms (Web, Android, MacOS, Windows, Linux, IOS) and 292 applications. Our evaluation indicates that current VLMs are generally aware of the functions of individual widgets, but lack the GUI‑specific knowledge required to track system states, adhere to GUI interaction conventions, and assess task completion progress. Experiments on real‑world GUI tasks further validate the close link between GUI knowledge and task success. By providing a structured framework for assessing GUI knowledge, our work supports the selection of VLMs with greater potential prior to downstream training and provides insights for building more capable GUI agents.

PaperID: 7680, https://arxiv.org/pdf/2510.26037.pdf

Authors: Kaiwen Zhou, Ahmed Elgohary, A S M Iftekhar, Amin Saied

Title: SIRAJ: Diverse and Efficient Red-Teaming for LLM Agents via Distilled Structured Reasoning

Abstract:
The ability of LLM agents to plan and invoke tools exposes them to new safety risks, making a comprehensive red‑teaming system crucial for discovering vulnerabilities and ensuring their safe deployment. We present SIRAJ: a generic red‑teaming framework for arbitrary black‑box LLM agents. We employ a dynamic two‑step process that starts with an agent definition and generates diverse seed test cases that cover various risk outcomes, tool‑use trajectories, and risk sources. Then, it iteratively constructs and refines model‑based adversarial attacks based on the execution trajectories of former attempts. To optimize the red‑teaming cost, we present a model distillation approach that leverages structured forms of a teacher model's reasoning to train smaller models that are equally effective. Across diverse evaluation agent settings, our seed test case generation approach yields 2 ‑‑ 2.5x boost to the coverage of risk outcomes and tool‑calling trajectories. Our distilled 8B red‑teamer model improves attack success rate by 100%, surpassing the 671B Deepseek‑R1 model. Our ablations and analyses validate the effectiveness of the iterative framework, structured reasoning, and the generalization of our red‑teamer models.

PaperID: 7681, https://arxiv.org/pdf/2510.25992.pdf

Authors: Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han, Gufeng Zhang, Yanfei Chen, Wei Wang, Tomas Pfister, Chen-Yu Lee

Title: Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

Abstract:
Large Language Models (LLMs) often struggle with problems that require multi‑step reasoning. For small‑scale open‑source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine‑Tuning (SFT) tends to overfit long demonstrations through rigid token‑by‑token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions". SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model's actions and expert actions extracted from the SFT dataset in a step‑wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Moreover, initializing training with SRL before refining with RLVR yields the strongest overall performance. Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning‑oriented LLMs.

PaperID: 7682, https://arxiv.org/pdf/2510.25929.pdf

Authors: Ziyi Wang, Carmine Ventre, Maria Polukarov

Title: Multi-Agent Reinforcement Learning for Market Making: Competition without Collusion

Abstract:
Algorithmic collusion has emerged as a central question in AI: Will the interaction between different AI agents deployed in markets lead to collusion? More generally, understanding how emergent behavior, be it a cartel or market dominance from more advanced bots, affects the market overall is an important research question. We propose a hierarchical multi‑agent reinforcement learning framework to study algorithmic collusion in market making. The framework includes a self‑interested market maker (Agent~A), which is trained in an uncertain environment shaped by an adversary, and three bottom‑layer competitors: the self‑interested Agent~B1 (whose objective is to maximize its own PnL), the competitive Agent~B2 (whose objective is to minimize the PnL of its opponent), and the hybrid Agent~B^\star, which can modulate between the behavior of the other two. To analyze how these agents shape the behavior of each other and affect market outcomes, we propose interaction‑level metrics that quantify behavioral asymmetry and system‑level dynamics, while providing signals potentially indicative of emergent interaction patterns. Experimental results show that Agent~B2 secures dominant performance in a zero‑sum setting against B1, aggressively capturing order flow while tightening average spreads, thus improving market execution efficiency. In contrast, Agent~B^\star exhibits a self‑interested inclination when co‑existing with other profit‑seeking agents, securing dominant market share through adaptive quoting, yet exerting a milder adverse impact on the rewards of Agents~A and B1 compared to B2. These findings suggest that adaptive incentive control supports more sustainable strategic co‑existence in heterogeneous agent environments and offers a structured lens for evaluating behavioral design in algorithmic trading systems.

PaperID: 7683, https://arxiv.org/pdf/2510.25850.pdf

Authors: Kevin Qiu, Marek Cygan

Title: Debate2Create: Robot Co-design via Multi-Agent LLM Debate

Abstract:
We introduce Debate2Create (D2C), a multi‑agent LLM framework that formulates robot co‑design as structured, iterative debate grounded in physics‑based evaluation. A design agent and control agent engage in a thesis‑antithesis‑synthesis loop, while pluralistic LLM judges provide multi‑objective feedback to steer exploration. Across five MuJoCo locomotion benchmarks, D2C achieves up to 3.2× the default Ant score and ～9× on Swimmer, outperforming prior LLM‑based methods and black‑box optimization. Iterative debate yields 18‑‑35% gains over compute‑matched zero‑shot generation, and D2C‑generated rewards transfer to default morphologies in 4/5 tasks. Our results demonstrate that structured multi‑agent debate offers an effective alternative to hand‑designed objectives for joint morphology‑reward optimization.

PaperID: 7684, https://arxiv.org/pdf/2510.25819.pdf

Authors: Tobin South, Subramanya Nagabhushanaradhya, Ayesha Dissanayaka, Sarah Cecchetti, George Fletcher, Victor Lu, Aldo Pietropaolo, Dean H. Saxe, Jeff Lombardo, Abhishek Maligehalli Shivalingaiah, Stan Bounev, Alex Keisner, Andor Kesselman, Zack Proser, Ginny Fahs, Andrew Bunyea, Ben Moskowitz, Atul Tulshibagwale, Dazza Greenwood, Jiaxin Pei, Alex Pentland

Title: Identity Management for Agentic AI: The new frontier of authorization, authentication, and security for an AI agent world

Abstract:
The rapid rise of AI agents presents urgent challenges in authentication, authorization, and identity management. Current agent‑centric protocols (like MCP) highlight the demand for clarified best practices in authentication and authorization. Looking ahead, ambitions for highly autonomous agents raise complex long‑term questions regarding scalable access control, agent‑centric identities, AI workload differentiation, and delegated authority. This OpenID Foundation whitepaper is for stakeholders at the intersection of AI agents and access management. It outlines the resources already available for securing today's agents and presents a strategic agenda to address the foundational authentication, authorization, and identity problems pivotal for tomorrow's widespread autonomous systems.

PaperID: 7685, https://arxiv.org/pdf/2510.25793.pdf

Authors: Siavash M. Alamouti, Fay Arjomandi

Title: Optimal Information Combining for Multi-Agent Systems Using Adaptive Bias Learning

Abstract:
Modern multi‑agent systems ranging from sensor networks monitoring critical infrastructure to crowdsourcing platforms aggregating human intelligence can suffer significant performance degradation due to systematic biases that vary with environmental conditions. Current approaches either ignore these biases, leading to suboptimal decisions, or require expensive calibration procedures that are often infeasible in practice. This performance gap has real consequences: inaccurate environmental monitoring, unreliable financial predictions, and flawed aggregation of human judgments. This paper addresses the fundamental question: when can we learn and correct for these unknown biases to recover near‑optimal performance, and when is such learning futile? We develop a theoretical framework that decomposes biases into learnable systematic components and irreducible stochastic components, introducing the concept of learnability ratio as the fraction of bias variance predictable from observable covariates. This ratio determines whether bias learning is worthwhile for a given system. We prove that the achievable performance improvement is fundamentally bounded by this learnability ratio, providing system designers with quantitative guidance on when to invest in bias learning versus simpler approaches. We present the Adaptive Bias Learning and Optimal Combining (ABLOC) algorithm, which iteratively learns bias‑correcting transformations while optimizing combination weights through closedform solutions, guaranteeing convergence to these theoretical bounds. Experimental validation demonstrates that systems with high learnability ratios can recover significant performance (we achieved 40%‑70% of theoretical maximum improvement in our examples), while those with low learnability show minimal benefit, validating our diagnostic criteria for practical deployment decisions.

PaperID: 7686, https://arxiv.org/pdf/2510.25779.pdf

Authors: Gagan Bansal, Wenyue Hua, Zezhou Huang, Adam Fourney, Amanda Swearngin, Will Epperson, Tyler Payne, Jake M. Hofman, Brendan Lucier, Chinmay Singh, Markus Mobius, Akshay Nambi, Archana Yadav, Kevin Gao, David M. Rothschild, Aleksandrs Slivkins, Daniel G. Goldstein, Hussein Mozannar, Nicole Immorlica, Maya Murad, Matthew Vogel, Subbarao Kambhampati, Eric Horvitz, Saleema Amershi

Title: Magentic Marketplace: An Open-Source Environment for Studying Agentic Markets

Abstract:
As LLM agents advance, they are increasingly mediating economic decisions, ranging from product discovery to transactions, on behalf of users. Such applications promise benefits but also raise many questions about agent accountability and value for users. Addressing these questions requires understanding how agents behave in realistic market conditions. However, previous research has largely evaluated agents in constrained settings, such as single‑task marketplaces (e.g., negotiation) or structured two‑agent interactions. Real‑world markets are fundamentally different: they require agents to handle diverse economic activities and coordinate within large, dynamic ecosystems where multiple agents with opaque behaviors may engage in open‑ended dialogues. To bridge this gap, we investigate two‑sided agentic marketplaces where Assistant agents represent consumers and Service agents represent competing businesses. To study these interactions safely, we develop Magentic‑Marketplace ‑‑ a simulated environment where Assistants and Services can operate. This environment enables us to study key market dynamics: the utility agents achieve, behavioral biases, vulnerability to manipulation, and how search mechanisms shape market outcomes. Our experiments show that frontier models can approach optimal welfare ‑‑ but only under ideal search conditions. Performance degrades sharply with scale, and all models exhibit severe first‑proposal bias, creating 10‑30x advantages for response speed over quality. These findings reveal how behaviors emerge across market conditions, informing the design of fair and efficient agentic marketplaces.

PaperID: 7687, https://arxiv.org/pdf/2510.25726.pdf

Authors: Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, Junteng Liu, Zhaochen Su, Yiyang Guo, Fan Zhou, Lueyang Zhang, Juan Michelini, Xingyao Wang, Xiang Yue, Shuyan Zhou, Graham Neubig, Junxian He

Title: The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Abstract:
Real‑world language agents must handle complex, multi‑step workflows across diverse Apps. For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database to detect anomalies and generate reports following an operating manual. However, existing language agent benchmarks often focus on narrow domains or simplified tasks that lack the diversity, realism, and long‑horizon complexity required to evaluate agents' real‑world performance. To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution‑based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional ones like WooCommerce, Kubernetes, and BigQuery. Most of the tools are based on a high‑quality set of Model Context Protocol (MCP) servers that we may have revised or implemented ourselves. Unlike prior works, which primarily ensure functional realism but offer limited environment state diversity, we provide realistic initial environment states from real software, such as Canvas courses with dozens of students or real financial spreadsheets. This benchmark includes 108 manually sourced or crafted tasks in total, requiring interacting with multiple Apps over around 20 turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts. Comprehensive evaluation of SOTA models highlights their significant shortcomings: the best‑performing model, Claude‑4.5‑Sonnet, achieves only a 38.6% success rate with 20.2 tool calling turns on average, while the top open‑weights model DeepSeek‑V3.2‑Exp reaches 20.1%. We expect Toolathlon to drive the development of more capable language agents for real‑world, long‑horizon task execution.

PaperID: 7688, https://arxiv.org/pdf/2510.25668.pdf

Authors: Tianyu Yang, Terry Ruas, Yijun Tian, Jan Philip Wahle, Daniel Kurzawe, Bela Gipp

Title: ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents

Abstract:
Vision‑language models (VLMs) excel at interpreting text‑rich images but struggle with long, visually complex documents that demand analysis and integration of information spread across multiple pages. Existing approaches typically rely on fixed reasoning templates or rigid pipelines, which force VLMs into a passive role and hinder both efficiency and generalization. We present Active Long‑DocumEnt Navigation (ALDEN), a multi‑turn reinforcement learning framework that fine‑tunes VLMs as interactive agents capable of actively navigating long, visually rich documents. ALDEN introduces a novel fetch action that directly accesses the page by index, complementing the classic search action and better exploiting document structure. For dense process supervision and efficient training, we propose a rule‑based cross‑level reward that provides both turn‑ and token‑level signals. To address the empirically observed training instability caused by numerous visual tokens from long documents, we further propose a visual‑semantic anchoring mechanism that applies a dual‑path KL‑divergence constraint to stabilize visual and textual representations separately during training. Trained on a corpus constructed from three open‑source datasets, ALDEN achieves state‑of‑the‑art performance on five long‑document benchmarks. Overall, ALDEN marks a step beyond passive document reading toward agents that autonomously navigate and reason across long, visually rich documents, offering a robust path to more accurate and efficient long‑document understanding.

PaperID: 7689, https://arxiv.org/pdf/2510.25650.pdf

Authors: Ahmad Kokhahi, Mary Kurz

Title: Collision avoidance and path finding in a robotic mobile fulfillment system using multi-objective meta-heuristics

Abstract:
Multi‑Agent Path Finding (MAPF) has gained significant attention, with most research focusing on minimizing collisions and travel time. This paper also considers energy consumption in the path planning of automated guided vehicles (AGVs). It addresses two main challenges: i) resolving collisions between AGVs and ii) assigning tasks to AGVs. We propose a new collision avoidance strategy that takes both energy use and travel time into account. For task assignment, we present two multi‑objective algorithms: Non‑Dominated Sorting Genetic Algorithm (NSGA) and Adaptive Large Neighborhood Search (ALNS). Comparative evaluations show that these proposed methods perform better than existing approaches in both collision avoidance and task assignment.

PaperID: 7690, https://arxiv.org/pdf/2510.25621.pdf

Authors: Mohammad Aghajani Asl, Behrooz Minaei Bidgoli

Title: FARSIQA: Faithful and Advanced RAG System for Islamic Question Answering

Abstract:
The advent of Large Language Models (LLMs) has revolutionized Natural Language Processing, yet their application in high‑stakes, specialized domains like religious question answering is hindered by challenges like hallucination and unfaithfulness to authoritative sources. This issue is particularly critical for the Persian‑speaking Muslim community, where accuracy and trustworthiness are paramount. Existing Retrieval‑Augmented Generation (RAG) systems, relying on simplistic single‑pass pipelines, fall short on complex, multi‑hop queries requiring multi‑step reasoning and evidence aggregation. To address this gap, we introduce FARSIQA, a novel, end‑to‑end system for Faithful Advanced Question Answering in the Persian Islamic domain. FARSIQA is built upon our innovative FAIR‑RAG architecture: a Faithful, Adaptive, Iterative Refinement framework for RAG. FAIR‑RAG employs a dynamic, self‑correcting process: it adaptively decomposes complex queries, assesses evidence sufficiency, and enters an iterative loop to generate sub‑queries, progressively filling information gaps. Operating on a curated knowledge base of over one million authoritative Islamic documents, FARSIQA demonstrates superior performance. Rigorous evaluation on the challenging IslamicPCQA benchmark shows state‑of‑the‑art performance: the system achieves a remarkable 97.0% in Negative Rejection ‑ a 40‑point improvement over baselines ‑ and a high Answer Correctness score of 74.3%. Our work establishes a new standard for Persian Islamic QA and validates that our iterative, adaptive architecture is crucial for building faithful, reliable AI systems in sensitive domains.

PaperID: 7691, https://arxiv.org/pdf/2510.25612.pdf

Authors: Amit Giloni, Chiara Picardi, Roy Betser, Shamik Bose, Aishvariya Priya Rathina Sabapathy, Roman Vainshtein

Title: Counterfactual-based Agent Influence Ranker for Agentic AI Workflows

Abstract:
An Agentic AI Workflow (AAW), also known as an LLM‑based multi‑agent system, is an autonomous system that assembles several LLM‑based agents to work collaboratively towards a shared goal. The high autonomy, widespread adoption, and growing interest in such AAWs highlight the need for a deeper understanding of their operations, from both quality and security aspects. To this day, there are no existing methods to assess the influence of each agent on the AAW's final output. Adopting techniques from related fields is not feasible since existing methods perform only static structural analysis, which is unsuitable for inference time execution. We present Counterfactual‑based Agent Influence Ranker (CAIR) ‑ the first method for assessing the influence level of each agent on the AAW's output and determining which agents are the most influential. By performing counterfactual analysis, CAIR provides a task‑agnostic analysis that can be used both offline and at inference time. We evaluate CAIR using an AAWs dataset of our creation, containing 30 different use cases with 230 different functionalities. Our evaluation showed that CAIR produces consistent rankings, outperforms baseline methods, and can easily enhance the effectiveness and relevancy of downstream tasks.

PaperID: 7692, https://arxiv.org/pdf/2510.25597.pdf

Authors: Siddhartha Upadhyay, Ratnangshu Das, Pushpak Jagtap

Title: Incorporating Social Awareness into Control of Unknown Multi-Agent Systems: A Real-Time Spatiotemporal Tubes Approach

Abstract:
This paper presents a decentralized control framework that incorporates social awareness into multi‑agent systems with unknown dynamics to achieve prescribed‑time reach‑avoid‑stay tasks in dynamic environments. Each agent is assigned a social awareness index that quantifies its level of cooperation or self‑interest, allowing heterogeneous social behaviors within the system. Building on the spatiotemporal tube (STT) framework, we propose a real‑time STT framework that synthesizes tubes online for each agent while capturing its social interactions with others. A closed‑form, approximation‑free control law is derived to ensure that each agent remains within its evolving STT, thereby avoiding dynamic obstacles while also preventing inter‑agent collisions in a socially aware manner, and reaching the target within a prescribed time. The proposed approach provides formal guarantees on safety and timing, and is computationally lightweight, model‑free, and robust to unknown disturbances. The effectiveness and scalability of the framework are validated through simulation and hardware experiments on a 2D omnidirectional

PaperID: 7693, https://arxiv.org/pdf/2510.25588.pdf

Authors: Eranga Bandara, Ross Gore, Atmaram Yarlagadda, Anita H. Clayton, Preston Samuel, Christopher K. Rhea, Sachin Shetty

Title: Standardization of Psychiatric Diagnoses -- Role of Fine-tuned LLM Consortium and OpenAI-gpt-oss Reasoning LLM Enabled Decision Support System

Abstract:
The diagnosis of most mental disorders, including psychiatric evaluations, primarily depends on dialogues between psychiatrists and patients. This subjective process can lead to variability in diagnoses across clinicians and patients, resulting in inconsistencies and challenges in achieving reliable outcomes. To address these issues and standardize psychiatric diagnoses, we propose a Fine‑Tuned Large Language Model (LLM) Consortium and OpenAI‑gpt‑oss Reasoning LLM‑enabled Decision Support System for the clinical diagnosis of mental disorders. Our approach leverages fine‑tuned LLMs trained on conversational datasets involving psychiatrist‑patient interactions focused on mental health conditions (e.g., depression). The diagnostic predictions from individual models are aggregated through a consensus‑based decision‑making process, refined by the OpenAI‑gpt‑oss reasoning LLM. We propose a novel method for deploying LLM agents that orchestrate communication between the LLM consortium and the reasoning LLM, ensuring transparency, reliability, and responsible AI across the entire diagnostic workflow. Experimental results demonstrate the transformative potential of combining fine‑tuned LLMs with a reasoning model to create a robust and highly accurate diagnostic system for mental health assessment. A prototype of the proposed platform, integrating three fine‑tuned LLMs with the OpenAI‑gpt‑oss reasoning LLM, was developed in collaboration with the U.S. Army Medical Research Team in Norfolk, Virginia, USA. To the best of our knowledge, this work represents the first application of a fine‑tuned LLM consortium integrated with a reasoning LLM for clinical mental health diagnosis paving the way for next‑generation AI‑powered eHealth systems aimed at standardizing psychiatric diagnoses.

PaperID: 7694, https://arxiv.org/pdf/2510.25518.pdf

Authors: Thomas Cook, Richard Osuagwu, Liman Tsatiashvili, Vrynsia Vrynsia, Koustav Ghosal, Maraim Masoud, Riccardo Mattivi

Title: Retrieval Augmented Generation (RAG) for Fintech: Agentic Design and Evaluation

Abstract:
Retrieval‑Augmented Generation (RAG) systems often face limitations in specialized domains such as fintech, where domain‑specific ontologies, dense terminology, and acronyms complicate effective retrieval and synthesis. This paper introduces an agentic RAG architecture designed to address these challenges through a modular pipeline of specialized agents. The proposed system supports intelligent query reformulation, iterative sub‑query decomposition guided by keyphrase extraction, contextual acronym resolution, and cross‑encoder‑based context re‑ranking. We evaluate our approach against a standard RAG baseline using a curated dataset of 85 question‑‑answer‑‑reference triples derived from an enterprise fintech knowledge base. Experimental results demonstrate that the agentic RAG system outperforms the baseline in retrieval precision and relevance, albeit with increased latency. These findings suggest that structured, multi‑agent methodologies offer a promising direction for enhancing retrieval robustness in complex, domain‑specific settings.

PaperID: 7695, https://arxiv.org/pdf/2510.25423.pdf

Authors: Ali Asgari, Annibale Panichella, Pouria Derakhshanfar, Mitchell Olsthoorn

Title: What Challenges Do Developers Face in AI Agent Systems? An Empirical Study on Stack Overflow & GitHub Issues

Abstract:
AI Agents have rapidly gained prominence in both research and industry as systems that extend large language models with planning, tool use, memory, and goal‑directed action. Despite this progress, the development and maintenance of Agent systems present recurring engineering difficulties that are not yet well characterized in developer‑facing evidence. To address this gap, this study analyzes developer discussions on Stack Overflow and failure reports from GitHub issue trackers associated with widely used Agent frameworks. For Stack Overflow, an Agent‑focused corpus is constructed through tag expansion and filtering, latent themes are derived using LDA‑MALLET, and topics are manually validated and labeled. For GitHub, a taxonomy of issue themes is developed to capture deployment‑time failures and maintenance burdens. Analysis across both platforms identifies seven Stack Overflow topics (comprising 28 subtopics) and thirteen GitHub issue topics, which are synthesized into five overarching families of major Agent challenges: (1) environment, platforms, and dependency management; (2) retrieval, embeddings, and Agent memory; (3) orchestration and execution control; (4) interaction contracts between models and tools; and (5) runtime reliability and operational robustness. Topic popularity and difficulty are quantified, revealing that widely discussed issues, such as installation and prompting, are often resolved more quickly, whereas retrieval‑ and orchestration‑related challenges are less visible, more complex, and tend to persist as ongoing maintenance burdens on GitHub.

PaperID: 7696, https://arxiv.org/pdf/2510.25394.pdf

Authors: Youan Su

Title: A proof-theoretic approach to uniform interpolation property of multi-agent modal logic

Abstract:
Uniform interpolation property (UIP) is a strengthening of Craig interpolation property. It was first established by Pitts(1992) based on a pure proof‑theoretic method. UIP in multi‑modal \mathbfK_n, \mathbfKD_n and \mathbfKT_n logic have been established by semantic approaches, however, a proof‑theoretic approach is still lacking. Bílková (2007) develops the method in Pitts (1992) to show UIP in classical modal logic \mathbfK and \mathbfKT. This paper further extends Bílková (2007)'s systems to establish the UIP in multi‑agent modal logic \mathbfK_n, \mathbfKD_n and \mathbfKT_n. A purely syntactic algorithm is presented to determine a uniform interpolant formula. It is also shown that quantification over propositional variables can be modeled by UIP in these systems. Furthermore, a direct argument to establish UIP without using second‑order quantifiers is also presented.

PaperID: 7697, https://arxiv.org/pdf/2510.25340.pdf

Authors: Beiwen Zhang, Yongheng Liang, Hejun Wu

Title: Multi-party Agent Relation Sampling for Multi-party Ad Hoc Teamwork

Abstract:
Multi‑agent reinforcement learning (MARl) has achieved strong results in cooperative tasks but typically assumes fixed, fully controlled teams. Ad hoc teamwork (AHT) relaxes this by allowing collaboration with unknown partners, yet existing variants still presume shared conventions. We introduce Multil‑party Ad Hoc Teamwork (MAHT), where controlled agents must coordinate with multiple mutually unfamiliar groups of uncontrolled teammates. To address this, we propose MARs, which builds a sparse skeleton graph and applies relational modeling to capture cross‑group dvnamics. Experiments on MPE and starCralt ll show that MARs outperforms MARL and AHT baselines while converging faster.

PaperID: 7698, https://arxiv.org/pdf/2510.25232.pdf

Authors: Tianxi Wan, Jiaming Luo, Siyuan Chen, Kunyao Lan, Jianhua Chen, Haiyang Geng, Mengyue Wu

Title: From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity

Abstract:
Psychiatric comorbidity is clinically significant yet challenging due to the complexity of multiple co‑occurring disorders. To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi‑agent diagnostic dialogue generation. We create 502 synthetic EMRs for common comorbid conditions using a pipeline that ensures clinical relevance and diversity. Our multi‑agent framework transfers the clinical interview protocol into a hierarchical state machine and context tree, supporting over 130 diagnostic states while maintaining clinical standards. Through this rigorous process, we construct PsyCoTalk, the first large‑scale dialogue dataset supporting comorbidity, containing 3,000 multi‑turn diagnostic dialogues validated by psychiatrists. This dataset enhances diagnostic accuracy and treatment planning, offering a valuable resource for psychiatric comorbidity research. Compared to real‑world clinical transcripts, PsyCoTalk exhibits high structural and linguistic fidelity in terms of dialogue length, token distribution, and diagnostic reasoning strategies. Licensed psychiatrists confirm the realism and diagnostic validity of the dialogues. This dataset enables the development and evaluation of models capable of multi‑disorder psychiatric screening in a single conversational pass.

PaperID: 7699, https://arxiv.org/pdf/2510.25223.pdf

Authors: Kun Ouyang, Haoyu Wang, Dong Fang

Title: FELA: A Multi-Agent Evolutionary System for Feature Engineering of Industrial Event Log Data

Abstract:
Event log data, recording fine‑grained user actions and system events, represent one of the most valuable assets for modern digital services. However, the complexity and heterogeneity of industrial event logs‑‑characterized by large scale, high dimensionality, diverse data types, and intricate temporal or relational structures‑‑make feature engineering extremely challenging. Existing automatic feature engineering approaches, such as AutoML or genetic methods, often suffer from limited explainability, rigid predefined operations, and poor adaptability to complicated heterogeneous data. In this paper, we propose FELA (Feature Engineering LLM Agents), a multi‑agent evolutionary system that autonomously extracts meaningful and high‑performing features from complex industrial event log data. FELA integrates the reasoning and coding capabilities of large language models (LLMs) with an insight‑guided self‑evolution paradigm. Specifically, FELA employs specialized agents‑‑Idea Agents, Code Agents, and Critic Agents‑‑to collaboratively generate, validate, and implement novel feature ideas. An Evaluation Agent summarizes feedback and updates a hierarchical knowledge base and dual‑memory system to enable continual improvement. Moreover, FELA introduces an agentic evolution algorithm, combining reinforcement learning and genetic algorithm principles to balance exploration and exploitation across the idea space. Extensive experiments on real industrial datasets demonstrate that FELA can generate explainable, domain‑relevant features that significantly improve model performance while reducing manual effort. Our results highlight the potential of LLM‑based multi‑agent systems as a general framework for automated, interpretable, and adaptive feature engineering in complex real‑world environments.

PaperID: 7700, https://arxiv.org/pdf/2510.25212.pdf

Authors: Lei Han, Jinhao Zhang, Jinhui Liu, Zhiyong Yu, Liang Wang, Quan Wang, Zhiwen Yu

Title: Collaborative Scheduling of Time-dependent UAVs,Vehicles and Workers for Crowdsensing in Disaster Response

Abstract:
Frequent natural disasters cause significant losses to human society, and timely, efficient collection of post‑disaster environmental information is the foundation for effective rescue operations. Due to the extreme complexity of post‑disaster environments, existing sensing technologies such as mobile crowdsensing suffer from weak environmental adaptability, insufficient professional sensing capabilities, and poor practicality of sensing solutions. Therefore, this paper explores a heterogeneous multi‑agent online collaborative scheduling algorithm, HoCs‑MPQ, to achieve efficient collection of post‑disaster environmental information. HoCs‑MPQ models collaboration and conflict relationships among multiple elements through weighted undirected graph construction, and iteratively solves the maximum weight independent set based on multi‑priority queues, ultimately achieving collaborative sensing scheduling of time‑dependent UA Vs, vehicles, and workers. Specifically, (1) HoCs‑MPQ constructs weighted undirected graph nodes based on collaborative relationships among multiple elements and quantifies their weights, then models the weighted undirected graph based on conflict relationships between nodes; (2) HoCs‑MPQ solves the maximum weight independent set based on iterated local search, and accelerates the solution process using multi‑priority queues. Finally, we conducted detailed experiments based on extensive real‑world and simulated data. The experiments show that, compared to baseline methods (e.g., HoCs‑GREEDY, HoCs‑K‑WTA, HoCs‑MADL, and HoCs‑MARL), HoCs‑MPQ improves task completion rates by an average of 54.13%, 23.82%, 14.12%, and 12.89% respectively, with computation time for single online autonomous scheduling decisions not exceeding 3 seconds.

PaperID: 7701, https://arxiv.org/pdf/2510.25191.pdf

Authors: Hongyu Song, Rishabh Dev Yadav, Cheng Guo, Wei Pan

Title: SoraNav: Adaptive UAV Task-Centric Navigation via Zeroshot VLM Reasoning

Abstract:
Autonomous navigation under natural language instructions represents a crucial step toward embodied intelligence, enabling complex task execution in environments ranging from industrial facilities to domestic spaces. However, language‑driven 3D navigation for Unmanned Aerial Vehicles (UAVs) requires precise spatial reasoning, a capability inherently lacking in current zero‑shot Vision‑Language Models (VLMs) which often generate ambiguous outputs and cannot guarantee geometric feasibility. Furthermore, existing Vision‑Language Navigation (VLN) methods are predominantly tailored for 2.5D ground robots, rendering them unable to generalize to the unconstrained 3D spatial reasoning required for aerial tasks in small‑scale, cluttered environments. In this paper, we present SoraNav, a novel framework enabling zero‑shot VLM reasoning for UAV task‑centric navigation. To address the spatial‑semantic gap, we introduce Multi‑modal Visual Annotation (MVA), which encodes 3D geometric priors directly into the VLM's 2D visual input. To mitigate hallucinated or infeasible commands, we propose an Adaptive Decision Making (ADM) strategy that validates VLM proposals against exploration history, seamlessly switching to geometry‑based exploration to avoid dead‑ends and redundant revisits. Deployed on a custom PX4‑based micro‑UAV, SoraNav demonstrates robust real‑world performance. Quantitative results show our approach significantly outperforms state‑of‑the‑art baselines, increasing Success Rate (SR) by 25.7% and navigation efficiency (SPL) by 17.3% in 2.5D scenarios, and achieving improvements of 39.3% (SR) and 24.7% (SPL) in complex 3D scenarios.

PaperID: 7702, https://arxiv.org/pdf/2510.25179.pdf

Authors: Juan Ren, Mark Dras, Usman Naseem

Title: Agentic Moderation: Multi-Agent Design for Safer Vision-Language Models

Abstract:
Agentic methods have emerged as a powerful and autonomous paradigm that enhances reasoning, collaboration, and adaptive control, enabling systems to coordinate and independently solve complex tasks. We extend this paradigm to safety alignment by introducing Agentic Moderation, a model‑agnostic framework that leverages specialised agents to defend multimodal systems against jailbreak attacks. Unlike prior approaches that apply as a static layer over inputs or outputs and provide only binary classifications (safe or unsafe), our method integrates dynamic, cooperative agents, including Shield, Responder, Evaluator, and Reflector, to achieve context‑aware and interpretable moderation. Extensive experiments across five datasets and four representative Large Vision‑Language Models (LVLMs) demonstrate that our approach reduces the Attack Success Rate (ASR) by 7‑19%, maintains a stable Non‑Following Rate (NF), and improves the Refusal Rate (RR) by 4‑20%, achieving robust, interpretable, and well‑balanced safety performance. By harnessing the flexibility and reasoning capacity of agentic architectures, Agentic Moderation provides modular, scalable, and fine‑grained safety enforcement, highlighting the broader potential of agentic systems as a foundation for automated safety governance.

PaperID: 7703, https://arxiv.org/pdf/2510.25110.pdf

Authors: Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, Smit Vasani, Binwei Yao, Michael Henry Tessler, Sijia Yang, Dhavan Shah, Robert Hawkins, Junjie Hu, Timothy T. Rogers

Title: DEBATE: A Large-Scale Benchmark for Role-Playing LLM Agents in Multi-Agent, Long-Form Debates

Abstract:
Accurately modeling opinion change through social interactions is crucial for addressing issues like misinformation and polarization. While role‑playing large language models (LLMs) offer a promising way to simulate human‑like interactions, existing research shows that single‑agent alignment does not guarantee authentic multi‑agent group dynamics. Current LLM role‑play setups often produce unnatural dynamics (e.g., premature convergence), without an empirical benchmark to measure authentic human opinion trajectories. To bridge this gap, we introduce DEBATE, the first large‑scale empirical benchmark explicitly designed to evaluate the authenticity of the interaction between multi‑agent role‑playing LLMs. DEBATE contains 29,417 messages from multi‑round debate conversations among over 2,792 U.S.‑based participants discussing 107 controversial topics, capturing both publicly‑expressed messages and privately‑reported opinions. Using DEBATE, we systematically evaluate and identify critical discrepancies between simulated and authentic group dynamics. We further demonstrate DEBATE's utility for aligning LLMs with human behavior through supervised fine‑tuning, achieving improvements in surface‑level metrics (e.g., ROUGE‑L and message length) while highlighting limitations in deeper semantic alignment (e.g., semantic similarity). Our findings highlight both the potential and current limitations of role‑playing LLM agents for realistically simulating human‑like social dynamics.

PaperID: 7704, https://arxiv.org/pdf/2510.25095.pdf

Authors: Aleksandra Urbańczyk, Krzysztof Czech, Piotr Urbańczyk, Marek Kisiel-Dorohinicki, Aleksander Byrski

Title: Socio-cognitive agent-oriented evolutionary algorithm with trust-based optimization

Abstract:
This paper introduces the Trust‑Based Optimization (TBO), a novel extension of the island model in evolutionary computation that replaces conventional periodic migrations with a flexible, agent‑driven interaction mechanism based on trust or reputation. Experimental results demonstrate that TBO generally outperforms the standard island model evolutionary algorithm across various optimization problems. Nevertheless, algorithm performance varies depending on the problem type, with certain configurations being more effective for specific landscapes or dimensions. The findings suggest that trust and reputation mechanisms provide a flexible and adaptive approach to evolutionary optimization, improving solution quality in many cases.

PaperID: 7705, https://arxiv.org/pdf/2510.25017.pdf

Authors: Qi Lin, Zhenyu Zhang, Viraj Thakkar, Zhenjie Sun, Mai Zheng, Zhichao Cao

Title: StorageXTuner: An LLM Agent-Driven Automatic Tuning Framework for Heterogeneous Storage Systems

Abstract:
Automatically configuring storage systems is hard: parameter spaces are large and conditions vary across workloads, deployments, and versions. Heuristic and ML tuners are often system specific, require manual glue, and degrade under changes. Recent LLM‑based approaches help but usually treat tuning as a single‑shot, system‑specific task, which limits cross‑system reuse, constrains exploration, and weakens validation. We present StorageXTuner, an LLM agent‑driven auto‑tuning framework for heterogeneous storage engines. StorageXTuner separates concerns across four agents ‑ Executor (sandboxed benchmarking), Extractor (performance digest), Searcher (insight‑guided configuration exploration), and Reflector (insight generation and management). The design couples an insight‑driven tree search with layered memory that promotes empirically validated insights and employs lightweight checkers to guard against unsafe actions. We implement a prototype and evaluate it on RocksDB, LevelDB, CacheLib, and MySQL InnoDB with YCSB, MixGraph, and TPC‑H/C. Relative to out‑of‑the‑box settings and to ELMo‑Tune, StorageXTuner reaches up to 575% and 111% higher throughput, reduces p99 latency by as much as 88% and 56%, and converges with fewer trials.

PaperID: 7706, https://arxiv.org/pdf/2510.25003.pdf

Authors: Gian Marco Orlando, Jinyi Ye, Valerio La Gatta, Mahdi Saeedi, Vincenzo Moscato, Emilio Ferrara, Luca Luceri

Title: Emergent Coordinated Behaviors in Networked LLM Agents: Modeling the Strategic Dynamics of Information Operations

Abstract:
Generative agents are rapidly advancing in sophistication, raising urgent questions about how they might coordinate when deployed in online ecosystems. This is particularly consequential in information operations (IOs), influence campaigns that aim to manipulate public opinion on social media. While traditional IOs have been orchestrated by human operators and relied on manually crafted tactics, agentic AI promises to make campaigns more automated, adaptive, and difficult to detect. This work presents the first systematic study of emergent coordination among generative agents in simulated IO campaigns. Using generative agent‑based modeling, we instantiate IO and organic agents in a simulated environment and evaluate coordination across operational regimes, from simple goal alignment to team knowledge and collective decision‑making. As operational regimes become more structured, IO networks become denser and more clustered, interactions more reciprocal and positive, narratives more homogeneous, amplification more synchronized, and hashtag adoption faster and more sustained. Remarkably, simply revealing to agents which other agents share their goals can produce coordination levels nearly equivalent to those achieved through explicit deliberation and collective voting. Overall, we show that generative agents, even without human guidance, can reproduce coordination strategies characteristic of real‑world IOs, underscoring the societal risks posed by increasingly automated, self‑organizing IOs.

PaperID: 7707, https://arxiv.org/pdf/2510.24937.pdf

Authors: Jieyu Zhou

Title: OrchVis: Hierarchical Multi-Agent Orchestration for Human Oversight

Abstract:
We introduce OrchVis, a multi‑agent orchestration framework that visualizes, verifies, and coordinates goal‑driven collaboration among LLM‑based agents. Through hierarchical goal alignment, task assignment, and conflict resolution, OrchVis enables humans to supervise complex multi‑agent workflows without micromanaging each step. The system parses user intent into structured goals, monitors execution via automated verification, and exposes inter‑agent dependencies through an interactive planning panel. When conflicts arise, users can explore system‑proposed alternatives and selectively replan. OrchVis advances human‑centered design for multi‑agent systems by combining transparent visualization with adaptive autonomy.

PaperID: 7708, https://arxiv.org/pdf/2510.24802.pdf

Authors: Qiumeng Li, Chunhou Ji, Xinyue Liu

Title: From Narrative to Action: A Hierarchical LLM-Agent Framework for Human Mobility Generation

Abstract:
Understanding and replicating human mobility requires not only spatial‑temporal accuracy but also an awareness of the cognitive hierarchy underlying real‑world travel decisions. Traditional agent‑based or deep learning models can reproduce statistical patterns of movement but fail to capture the semantic coherence and causal logic of human behavior. Large language models (LLMs) show potential, but struggle to balance creative reasoning with strict structural compliance. This study proposes a Hierarchical LLM‑Agent Framework, termed Narrative‑to‑Action, that integrates high‑level narrative reasoning, mid‑level reflective planning, and low‑level behavioral execution within a unified cognitive hierarchy. At the macro level, one agent is employed as a "creative writer" to produce diary‑style narratives rich in motivation and context, then uses another agent as a "structural parser" to convert narratives into machine‑readable plans. A dynamic execution module further grounds agents in geographic environments and enables adaptive behavioral adjustments guided by a novel occupation‑aware metric, Mobility Entropy by Occupation (MEO), which captures heterogeneous schedule flexibility across different occupational personalities. At the micro level, the agent executes concrete actions‑selecting locations, transportation modes, and time intervals‑through interaction with an environmental simulation. By embedding this multi‑layer cognitive process, the framework produces not only synthetic trajectories that align closely with real‑world patterns but also interpretable representations of human decision logic. This research advances synthetic mobility generation from a data‑driven paradigm to a cognition‑driven simulation, providing a scalable pathway for understanding, predicting, and synthesizing complex urban mobility behaviors through hierarchical LLM agents.

PaperID: 7709, https://arxiv.org/pdf/2510.24760.pdf

Authors: Mengyuan Chen, Chengjun Dai, Xinyang Dong, Chengzhe Feng, Kewei Fu, Jianshe Li, Zhihan Peng, Yongqi Tong, Junshao Zhang, Hong Zhu

Title: Dingtalk DeepResearch: A Unified Multi Agent Framework for Adaptive Intelligence in Enterprise Environments

Abstract:
We present Dingtalk DeepResearch, a unified multi agent intelligence framework for real world enterprise environments, delivering deep research, heterogeneous table reasoning, and multimodal report generation.

PaperID: 7710, https://arxiv.org/pdf/2510.24702.pdf

Authors: Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yiheng Xu, Danyang Zhang, Apurva Gandhi, Fan Yang, Joseph Liu, Tianyue Ou, Zhihao Yuan, Frank Xu, Shuyan Zhou, Xingyao Wang, Xiang Yue, Tao Yu, Huan Sun, Yu Su, Graham Neubig

Title: Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

Abstract:
Public research results on large‑scale supervised finetuning of AI agents remain relatively rare, since the collection of agent training data presents unique challenges. In this work, we argue that the bottleneck is not a lack of underlying data sources, but that a large variety of data is fragmented across heterogeneous formats, tools, and interfaces. To this end, we introduce the agent data protocol (ADP), a light‑weight representation language that serves as an "interlingua" between agent datasets in diverse formats and unified agent training pipelines downstream. The design of ADP is expressive enough to capture a large variety of tasks, including API/tool use, browsing, coding, software engineering, and general agentic workflows, while remaining simple to parse and train on without engineering at a per‑dataset level. In experiments, we unified a broad collection of 13 existing agent training datasets into ADP format, and converted the standardized ADP data into training‑ready formats for multiple agent frameworks. We performed SFT on these data, and demonstrated an average performance gain of ~20% over corresponding base models, and delivers state‑of‑the‑art or near‑SOTA performance on standard coding, browsing, tool use, and research benchmarks, without domain‑specific tuning. All code and data are released publicly, in the hope that ADP could help lower the barrier to standardized, scalable, and reproducible agent training.

PaperID: 7711, https://arxiv.org/pdf/2510.24671.pdf

Authors: Li Li, Tobias Brinkmann, Till Temmen, Markus Eisenbarth, Jakob Andert

Title: Multi-Agent Scenario Generation in Roundabouts with a Transformer-enhanced Conditional Variational Autoencoder

Abstract:
With the increasing integration of intelligent driving functions into serial‑produced vehicles, ensuring their functionality and robustness poses greater challenges. Compared to traditional road testing, scenario‑based virtual testing offers significant advantages in terms of time and cost efficiency, reproducibility, and exploration of edge cases. We propose a Transformer‑enhanced Conditional Variational Autoencoder (CVAE‑T) model for generating multi‑agent traffic scenarios in roundabouts, which are characterized by high vehicle dynamics and complex layouts, yet remain relatively underexplored in current research. The results show that the proposed model can accurately reconstruct original scenarios and generate realistic, diverse synthetic scenarios. Besides, two Key‑Performance‑Indicators (KPIs) are employed to evaluate the interactive behavior in the generated scenarios. Analysis of the latent space reveals partial disentanglement, with several latent dimensions exhibiting distinct and interpretable effects on scenario attributes such as vehicle entry timing, exit timing, and velocity profiles. The results demonstrate the model's capability to generate scenarios for the validation of intelligent driving functions involving multi‑agent interactions, as well as to augment data for their development and iterative improvement.

PaperID: 7712, https://arxiv.org/pdf/2510.24645.pdf

Authors: Zengzhuang Xu, Bingguang Hao, Zechuan Wang, Yuntao Wen, Xinyi Xu, Yang Liu, Long Chen, Dong Wang, Maolin Wang, Tong Zhao, Yicheng Chen, Cunyin Peng, Jinjie Gu, Leilei Gan, Xiangyu Zhao, Chenyi Zhuang, Shi Gu

Title: FunReason-MT Technical Report: Advanced Data Synthesis Solution for Real-world Multi-Turn Tool-use

Abstract:
Function calling (FC) empowers large language models (LLMs) and autonomous agents to interface with external tools, a critical capability for solving complex, real‑world problems. As this ability becomes increasingly central to advanced AI systems, the need for high‑quality, multi‑turn training data to develop and refine it cannot be overstated. Existing data synthesis methods, such as random environment sampling or multi‑agent role‑playing, are not powerful enough to generate high‑quality data in real‑world environments. Practical challenges come in three folds: targeted data synthesis, hard query construction, and multi‑turn logical dependency. To address these structural deficiencies, we present FunReason‑MT, a novel data synthesis framework for real‑world multi‑turn tool use. FunReason‑MT resolves the complexity barrier in multi‑turn FC data by employing 1) Environment‑API Graph Interactions to gather varied high‑quality trajectories with targeted tool, 2) Advanced Tool‑Query Synthesis to simplify hard query construction, and 3) Guided Iterative Chain for sophisticated CoT generation. Evaluations on Berkeley Function‑Calling Leaderboard (BFCLv3) demonstrate the power of our framework: a 4B model built upon FunReason‑MT generated data achieves state‑of‑the‑art performance among comparable‑sized models. Further performance improvements on BFCLv4 confirm that FunReason‑MT provides a reliable and robust source for agentic learning.

PaperID: 7713, https://arxiv.org/pdf/2510.24611.pdf

Authors: Azadeh Pourkabirian, Amir Masoud Rahmani, Kai Li, Wei Ni

Title: Strategic Task Offloading for Delay-Sensitive IoT Applications: A Game-Theory-Based Demand-Supply Mechanism with Participation Incentives

Abstract:
Delay‑sensitive Internet of Things (IoT) applications have drawn significant attention. Running many of these applications on IoT devices is challenging due to the limited processing resources of these devices and the need for real‑time responses. Task offloading can minimize latency by transferring computationally intensive tasks from IoT devices to resource‑rich edge servers, ensuring delay and performance guarantees. In this paper, we develop a task‑offloading approach for delay‑sensitive IoT applications in edge computing environments. Unlike existing schemes, we model the task offloading problem as an economic demand and supply model to achieve market balance. The proposed model avoids under‑ and over‑supply, ensuring the computational resources at edge servers (supply) are allocated in a manner that best meets the processing and computational needs of user devices (demand). Given the multi‑agent nature of task offloading involving users and service providers with different preferences and objectives, we design a game‑theoretic framework using a Vickrey‑Clarke‑Groves (VCG) auction. This framework analyzes agent interactions and decision‑making processes. Additionally, we develop an incentive mechanism to encourage both parties to participate in the auction. The mechanism maximizes user task offloading to edge servers and motivates edge servers to share their computational resources, achieving profitability for both IoT users and edge servers. Simulations demonstrate our method maximizes social welfare, ensures truthfulness, maintains market balance, and provides latency guarantees for delay‑sensitive IoT applications.

PaperID: 7714, https://arxiv.org/pdf/2510.24515.pdf

Authors: Malintha Fernando, Petter Ögren, Silun Zhang

Title: Stochastic Prize-Collecting Games: Strategic Planning in Multi-Robot Systems

Abstract:
The Team Orienteering Problem (TOP) generalizes many real‑world multi‑robot scheduling and routing tasks that occur in autonomous mobility, aerial logistics, and surveillance applications. While many flavors of the TOP exist for planning in multi‑robot systems, they assume that all the robots cooperate toward a single objective; thus, they do not extend to settings where the robots compete in reward‑scarce environments. We propose Stochastic Prize‑Collecting Games (SPCG) as an extension of the TOP to plan in the presence of self‑interested robots operating on a graph, under energy constraints and stochastic transitions. A theoretical study on complete and star graphs establishes that there is a unique pure Nash equilibrium in SPCGs that coincides with the optimal routing solution of an equivalent TOP given a rank‑based conflict resolution rule. This work proposes two algorithms: Ordinal Rank Search (ORS) to obtain the ''ordinal rank'' ‑‑one's effective rank in temporarily‑formed local neighborhoods during the games' stages, and Fictitious Ordinal Response Learning (FORL) to obtain best‑response policies against one's senior‑rank opponents. Empirical evaluations conducted on road networks and synthetic graphs under both dynamic and stationary prize distributions show that 1) the state‑aliasing induced by OR‑conditioning enables learning policies that scale more efficiently to large team sizes than those trained with the global index, and 2) Policies trained with FORL generalize better to imbalanced prize distributions than those with other multi‑agent training methods. Finally, the learned policies in the SPCG achieved between 87% and 95% optimality compared to an equivalent TOP solution obtained by mixed‑integer linear programming.

PaperID: 7715, https://arxiv.org/pdf/2510.24428.pdf

Authors: Anh Nguyen Hoang, Minh Le-Anh, Bach Le, Nghi D. Q. Bui

Title: CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases

Abstract:
Given a large and evolving codebase, the ability to automatically generate holistic, architecture‑aware documentation that captures not only individual functions but also cross‑file, cross‑module, and system‑level interactions remains an open challenge. Comprehensive documentation is essential for long‑term software maintenance and collaboration, yet current automated approaches still fail to model the rich semantic dependencies and architectural structures that define real‑world software systems. We present CodeWiki, a unified framework for automated repository‑level documentation across seven programming languages. CodeWiki introduces three key innovations: (i) hierarchical decomposition that preserves architectural context across multiple levels of granularity, (ii) recursive multi‑agent processing with dynamic task delegation for scalable generation, and (iii) multi‑modal synthesis that integrates textual descriptions with visual artifacts such as architecture diagrams and data‑flow representations. To enable rigorous evaluation, we introduce CodeWikiBench, a comprehensive benchmark featuring multi‑dimensional rubrics and LLM‑based assessment protocols. Experimental results show that CodeWiki achieves a 68.79% quality score with proprietary models, outperforming the closed‑source DeepWiki baseline (64.06%) by 4.73%, with particularly strong improvements on high‑level scripting languages (+10.47%). We open‑source CodeWiki to foster future research and community adoption.

PaperID: 7716, https://arxiv.org/pdf/2510.24397.pdf

Authors: Jiarui Qin, Yunjia Xi, Junjie Huang, Renting Rui, Di Yin, Weiwen Liu, Yong Yu, Weinan Zhang, Xing Sun

Title: APTBench: Benchmarking Agentic Potential of Base LLMs During Pre-Training

Abstract:
With the rapid development of LLM‑based agents, there is a growing trend to incorporate agent‑specific data into the pre‑training stage of LLMs, aiming to better align LLMs with real‑world autonomous task execution. However, current pre‑training benchmarks primarily focus on isolated and static skills, e.g., common knowledge or mathematical/code reasoning, and fail to reflect model's agentic capabilities. On the other hand, agent benchmarks are typically designed for post‑trained models, requiring multi‑turn task execution abilities that base models struggle to support. Thus, there is a compelling need for a benchmark that can evaluate agentic potentials during pre‑training and guide the model training more effectively. To address this gap, we propose APTBench, a framework that converts real‑world agent tasks and successful trajectories into multiple‑choice or text completion questions tailored for base models. It focuses on core agentic abilities, e.g., planning and action, and covers key agent scenarios, software engineering and deep research. Compared to existing general‑purpose benchmarks, APTBench offers a more predictive signal of a model's downstream performance as an agent, while remaining significantly more lightweight and cost‑effective than full‑scale, end‑to‑end agent evaluations after post‑training.

PaperID: 7717, https://arxiv.org/pdf/2510.24383.pdf

Authors: Juraj Mavračić

Title: Policy Cards: Machine-Readable Runtime Governance for Autonomous AI Agents

Abstract:
Policy Cards are introduced as a machine‑readable, deployment‑layer standard for expressing operational, regulatory, and ethical constraints for AI agents. The Policy Card sits with the agent and enables it to follow required constraints at runtime. It tells the agent what it must and must not do. As such, it becomes an integral part of the deployed agent. Policy Cards extend existing transparency artifacts such as Model, Data, and System Cards by defining a normative layer that encodes allow/deny rules, obligations, evidentiary requirements, and crosswalk mappings to assurance frameworks including NIST AI RMF, ISO/IEC 42001, and the EU AI Act. Each Policy Card can be validated automatically, version‑controlled, and linked to runtime enforcement or continuous‑audit pipelines. The framework enables verifiable compliance for autonomous agents, forming a foundation for distributed assurance in multi‑agent ecosystems. Policy Cards provide a practical mechanism for integrating high‑level governance with hands‑on engineering practice and enabling accountable autonomy at scale.

PaperID: 7718, https://arxiv.org/pdf/2510.24359.pdf

Authors: Pedram Fard, Alaleh Azhir, Neguine Rezaii, Jiazi Tian, Hossein Estiri

Title: An N-of-1 Artificial Intelligence Ecosystem for Precision Medicine

Abstract:
Artificial intelligence in medicine is built to serve the average patient. By minimizing error across large datasets, most systems deliver strong aggregate accuracy yet falter at the margins: patients with rare variants, multimorbidity, or underrepresented demographics. This average patient fallacy erodes both equity and trust. We propose a different design: a multi‑agent ecosystem for N‑of‑1 decision support. In this environment, agents clustered by organ systems, patient populations, and analytic modalities draw on a shared library of models and evidence synthesis tools. Their results converge in a coordination layer that weighs reliability, uncertainty, and data density before presenting the clinician with a decision‑support packet: risk estimates bounded by confidence ranges, outlier flags, and linked evidence. Validation shifts from population averages to individual reliability, measured by error in low‑density regions, calibration in the small, and risk‑‑coverage trade‑offs. Anticipated challenges include computational demands, automation bias, and regulatory fit, addressed through caching strategies, consensus checks, and adaptive trial frameworks. By moving from monolithic models to orchestrated intelligence, this approach seeks to align medical AI with the first principle of medicine: care that is transparent, equitable, and centered on the individual.

PaperID: 7719, https://arxiv.org/pdf/2510.24358.pdf

Authors: Lingyue Fu, Bolun Zhang, Hao Guan, Yaoming Zhu, Lin Qiu, Weiwen Liu, Xuezhi Cao, Xunliang Cai, Weinan Zhang, Yong Yu

Title: Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation

Abstract:
Recent advances in code agents have enabled automated software development at the project level, supported by large language models (LLMs). However, existing benchmarks for code agent evaluation face two major limitations. First, creating high‑quality project‑level evaluation datasets requires extensive domain expertise, leading to prohibitive annotation costs and limited diversity. Second, while recent Agent‑as‑a‑Judge paradigms address the rigidity of traditional unit tests by enabling flexible metrics, their reliance on In‑Context Learning (ICL) with general LLMs often results in inaccurate assessments that misalign with human standards. To address these challenges, we propose an agent‑driven benchmark construction pipeline that leverages human supervision to efficiently generate diverse project‑level tasks. Based on this, we introduce PRDBench, comprising 50 real‑world Python projects across 20 domains, each with structured Product Requirement Documents (PRDs) and comprehensive criteria. Furthermore, to overcome the inaccuracy of general LLM judges, we propose a highly reliable evaluation framework powered by a specialized, fine‑tuned model. Based on Qwen3‑Coder‑30B, our dedicated PRDJudge achieves over 90% human alignment in fixed‑interface scenarios. Extensive experiments demonstrate that our suite provides a scalable, robust, and highly accurate framework for assessing state‑of‑the‑art code agents.

PaperID: 7720, https://arxiv.org/pdf/2510.24303.pdf

Authors: Deniz Gorur, Antonio Rago, Francesca Toni

Title: Retrieval- and Argumentation-Enhanced Multi-Agent LLMs for Judgmental Forecasting (Extended Version with Supplementary Material)

Abstract:
Judgmental forecasting is the task of making predictions about future events based on human judgment. This task can be seen as a form of claim verification, where the claim corresponds to a future event and the task is to assess the plausibility of that event. In this paper, we propose a novel multi‑agent framework for claim verification, whereby different agents may disagree on claim veracity and bring specific evidence for and against the claims, represented as quantitative bipolar argumentation frameworks (QBAFs). We then instantiate the framework for supporting claim verification, with a variety of agents realised with Large Language Models (LLMs): (1) ArgLLM agents, an existing approach for claim verification that generates and evaluates QBAFs; (2) RbAM agents, whereby LLM‑empowered Relation‑based Argument Mining (RbAM) from external sources is used to generate QBAFs; (3) RAG‑ArgLLM agents, extending ArgLLM agents with a form of Retrieval‑Augmented Generation (RAG) of arguments from external sources. Finally, we conduct experiments with two standard judgmental forecasting datasets, with instances of our framework with two or three agents, empowered by six different base LLMs. We observe that combining evidence from agents can improve forecasting accuracy, especially in the case of three agents, while providing an explainable combination of evidence for claim verification.

PaperID: 7721, https://arxiv.org/pdf/2510.24145.pdf

Authors: Yu Luo, Jiamin Jiang, Jingfei Feng, Lei Tao, Qingliang Zhang, Xidao Wen, Yongqian Sun, Shenglin Zhang, Dan Pei

Title: OpsAgent: An Evolving Multi-agent System for Incident Management in Microservices

Abstract:
Incident management (IM) is central to the reliability of large‑scale microservice systems. Yet manual IM, where on‑call engineers examine metrics, logs, and traces is labor‑intensive and error‑prone in the face of massive and heterogeneous observability data. Existing automated IM approaches often struggle to generalize across systems, provide limited interpretability, and incur high deployment costs, which hinders adoption in practice. In this paper, we present OpsAgent, a lightweight, self‑evolving multi‑agent system for IM that employs a training‑free data processor to convert heterogeneous observability data into structured textual descriptions, along with a multi‑agent collaboration framework that makes diagnostic inference transparent and auditable. To support continual capability growth, OpsAgent also introduces a dual self‑evolution mechanism that integrates internal model updates with external experience accumulation, thereby closing the deployment loop. Comprehensive experiments on the OPENRCA benchmark demonstrate state‑of‑the‑art performance and show that OpsAgent is generalizable, interpretable, cost‑efficient, and self‑evolving, making it a practically deployable and sustainable solution for long‑term operation in real‑world microservice systems. Notably, its deployment in Lenovo's production environment further validates its effectiveness in real‑world industrial settings.

PaperID: 7722, https://arxiv.org/pdf/2510.24126.pdf

Authors: Vivek Kalyan, Martin Andrews

Title: Reinforcement Learning for Long-Horizon Multi-Turn Search Agents

Abstract:
Large Language Model (LLM) agents can leverage multiple turns and tools to solve complex tasks, with prompt‑based approaches achieving strong performance. This work demonstrates that Reinforcement Learning (RL) can push capabilities significantly further by learning from experience. Through experiments on a legal document search benchmark, we show that our RL‑trained 14 Billion parameter model outperforms frontier class models (85% vs 78% accuracy). In addition, we explore turn‑restricted regimes, during training and at test‑time, that show these agents achieve better results if allowed to operate over longer multi‑turn horizons.

PaperID: 7723, https://arxiv.org/pdf/2510.24109.pdf

Authors: Wenbin Ding, Jun Chen, Mingjia Chen, Fei Xie, Qi Mao, Philip Dames

Title: PFEA: An LLM-based High-Level Natural Language Planning and Feedback Embodied Agent for Human-Centered AI

Abstract:
The rapid advancement of Large Language Models (LLMs) has marked a significant breakthrough in Artificial Intelligence (AI), ushering in a new era of Human‑centered Artificial Intelligence (HAI). HAI aims to better serve human welfare and needs, thereby placing higher demands on the intelligence level of robots, particularly in aspects such as natural language interaction, complex task planning, and execution. Intelligent agents powered by LLMs have opened up new pathways for realizing HAI. However, existing LLM‑based embodied agents often lack the ability to plan and execute complex natural language control tasks online. This paper explores the implementation of intelligent robotic manipulating agents based on Vision‑Language Models (VLMs) in the physical world. We propose a novel embodied agent framework for robots, which comprises a human‑robot voice interaction module, a vision‑language agent module and an action execution module. The vision‑language agent itself includes a vision‑based task planner, a natural language instruction converter, and a task performance feedback evaluator. Experimental results demonstrate that our agent achieves a 28% higher average task success rate in both simulated and real environments compared to approaches relying solely on LLM+CLIP, significantly improving the execution success rate of high‑level natural language instruction tasks.

PaperID: 7724, https://arxiv.org/pdf/2510.24052.pdf

Authors: Jongsuk Kim, Jaeyoung Lee, Gyojin Han, Dongjae Lee, Minki Jeong, Junmo Kim

Title: SynAD: Enhancing Real-World End-to-End Autonomous Driving Models through Synthetic Data Integration

Abstract:
Recent advancements in deep learning and the availability of high‑quality real‑world driving datasets have propelled end‑to‑end autonomous driving. Despite this progress, relying solely on real‑world data limits the variety of driving scenarios for training. Synthetic scenario generation has emerged as a promising solution to enrich the diversity of training data; however, its application within E2E AD models remains largely unexplored. This is primarily due to the absence of a designated ego vehicle and the associated sensor inputs, such as camera or LiDAR, typically provided in real‑world scenarios. To address this gap, we introduce SynAD, the first framework designed to enhance real‑world E2E AD models using synthetic data. Our method designates the agent with the most comprehensive driving information as the ego vehicle in a multi‑agent synthetic scenario. We further project path‑level scenarios onto maps and employ a newly developed Map‑to‑BEV Network to derive bird's‑eye‑view features without relying on sensor inputs. Finally, we devise a training strategy that effectively integrates these map‑based synthetic data with real driving data. Experimental results demonstrate that SynAD effectively integrates all components and notably enhances safety performance. By bridging synthetic scenario generation and E2E AD, SynAD paves the way for more comprehensive and robust autonomous driving models.

PaperID: 7725, https://arxiv.org/pdf/2510.24030.pdf

Authors: Ahmet Akkaya Melih, Yamuna Singh, Kunal L. Agarwal, Priya Mukherjee, Kiran Pattnaik, Hanuman Bhatia

Title: Human Machine Social Hybrid Intelligence:A Collaborative Decision Making Framework for Large Model Agent Groups and Human Experts

Abstract:
The rapid advancements in large foundation models and multi‑agent systems offer unprecedented capabilities, yet current Human‑in‑the‑Loop (HiTL) paradigms inadequately integrate human expertise, often leading to cognitive overload and decision‑making bottlenecks in complex, high‑stakes environments. We propose the "Human‑Machine Social Hybrid Intelligence" (HMS‑HI) framework, a novel architecture designed for deep, collaborative decision‑making between groups of human experts and LLM‑powered AI agents. HMS‑HI is built upon three core pillars: (1) a Shared Cognitive Space (SCS) for unified, multi‑modal situational awareness and structured world modeling; (2) a Dynamic Role and Task Allocation (DRTA) module that adaptively assigns tasks to the most suitable agent (human or AI) based on capabilities and workload; and (3) a Cross‑Species Trust Calibration (CSTC) protocol that fosters transparency, accountability, and mutual adaptation through explainable declarations and structured feedback. Validated in a high‑fidelity urban emergency response simulation, HMS‑HI significantly reduced civilian casualties by 72% and cognitive load by 70% compared to traditional HiTL approaches, demonstrating superior decision quality, efficiency, and human‑AI trust. An ablation study confirms the critical contribution of each module, highlighting that engineered trust and shared context are foundational for scalable, synergistic human‑AI collaboration.

PaperID: 7726, https://arxiv.org/pdf/2510.23988.pdf

Authors: Phuc Nguyen Xuan, Thanh Nguyen Canh, Huu-Hung Nguyen, Nak Young Chong, Xiem HoangVan

Title: A Survey on Collaborative SLAM with 3D Gaussian Splatting

Abstract:
This survey comprehensively reviews the evolving field of multi‑robot collaborative Simultaneous Localization and Mapping (SLAM) using 3D Gaussian Splatting (3DGS). As an explicit scene representation, 3DGS has enabled unprecedented real‑time, high‑fidelity rendering, ideal for robotics. However, its use in multi‑robot systems introduces significant challenges in maintaining global consistency, managing communication, and fusing data from heterogeneous sources. We systematically categorize approaches by their architecture ‑‑ centralized, distributed ‑‑ and analyze core components like multi‑agent consistency and alignment, communication‑efficient, Gaussian representation, semantic distillation, fusion and pose optimization, and real‑time scalability. In addition, a summary of critical datasets and evaluation metrics is provided to contextualize performance. Finally, we identify key open challenges and chart future research directions, including lifelong mapping, semantic association and mapping, multi‑model for robustness, and bridging the Sim2Real gap.

PaperID: 7727, https://arxiv.org/pdf/2510.23904.pdf

Authors: Kexin Quan, Dina Albassam, Mengke Wu, Zijian Ding, Jessie Chin

Title: Towards AI as Colleagues: Multi-Agent System Improves Structured Ideation Processes

Abstract:
Most AI systems today are designed to manage tasks and execute predefined steps. This makes them effective for process coordination but limited in their ability to engage in joint problem‑solving with humans or contribute new ideas. We introduce MultiColleagues, a multi‑agent conversational system that shows how AI agents can act as colleagues by conversing with each other, sharing new ideas, and actively involving users in collaborative ideation processes. In a within‑subjects study with 20 participants, we compared MultiColleagues to a single‑agent baseline. Results show that MultiColleagues fostered stronger perceived social presence, and participants rated their outcomes as higher in quality and novelty, with more elaboration during ideation. These findings demonstrate the potential of AI agents to move beyond process partners toward colleagues that share intent, strengthen group dynamics, and collaborate with humans to advance ideas.

PaperID: 7728, https://arxiv.org/pdf/2510.23899.pdf

Authors: Maria G. Mendoza, Addison Kalanther, Daniel Bostwick, Emma Stephan, Chinmay Maheshwari, Shankar Sastry

Title: Coordinated Autonomous Drones for Human-Centered Fire Evacuation in Partially Observable Urban Environments

Abstract:
Autonomous drone technology holds significant promise for enhancing search and rescue operations during evacuations by guiding humans toward safety and supporting broader emergency response efforts. However, their application in dynamic, real‑time evacuation support remains limited. Existing models often overlook the psychological and emotional complexity of human behavior under extreme stress. In real‑world fire scenarios, evacuees frequently deviate from designated safe routes due to panic and uncertainty. To address these challenges, this paper presents a multi‑agent coordination framework in which autonomous Unmanned Aerial Vehicles (UAVs) assist human evacuees in real‑time by locating, intercepting, and guiding them to safety under uncertain conditions. We model the problem as a Partially Observable Markov Decision Process (POMDP), where two heterogeneous UAV agents, a high‑level rescuer (HLR) and a low‑level rescuer (LLR), coordinate through shared observations and complementary capabilities. Human behavior is captured using an agent‑based model grounded in empirical psychology, where panic dynamically affects decision‑making and movement in response to environmental stimuli. The environment features stochastic fire spread, unknown evacuee locations, and limited visibility, requiring UAVs to plan over long horizons to search for humans and adapt in real‑time. Our framework employs the Proximal Policy Optimization (PPO) algorithm with recurrent policies to enable robust decision‑making in partially observable settings. Simulation results demonstrate that the UAV team can rapidly locate and intercept evacuees, significantly reducing the time required for them to reach safety compared to scenarios without UAV assistance.

PaperID: 7729, https://arxiv.org/pdf/2510.23875.pdf

Authors: Eswari Jayakumar, Niladri Sekhar Dash, Debasmita Mukherjee

Title: Large Language Model Agent Personality and Response Appropriateness: Evaluation by Human Linguistic Experts, LLM-as-Judge, and Natural Language Processing Model

Abstract:
While Large Language Model (LLM)‑based agents can be used to create highly engaging interactive applications through prompting personality traits and contextual data, effectively assessing their personalities has proven challenging. This novel interdisciplinary approach addresses this gap by combining agent development and linguistic analysis to assess the prompted personality of LLM‑based agents in a poetry explanation task. We developed a novel, flexible question bank, informed by linguistic assessment criteria and human cognitive learning levels, offering a more comprehensive evaluation than current methods. By evaluating agent responses with natural language processing models, other LLMs, and human experts, our findings illustrate the limitations of purely deep learning solutions and emphasize the critical role of interdisciplinary design in agent development.

PaperID: 7730, https://arxiv.org/pdf/2510.23853.pdf

Authors: Yize Cheng, Arshia Soltani Moakhar, Chenrui Fan, Parsa Hosseini, Kazem Faghih, Zahra Sodagar, Wenxiao Wang, Soheil Feizi

Title: Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception

Abstract:
Large language model (LLM) agents are increasingly used to interact with and execute tasks in dynamic environments. However, a critical yet overlooked limitation of these agents is that they, by default, assume a stationary context, failing to account for the real‑world time elapsed between messages. We refer to this as "temporal blindness". This limitation hinders decisions about when to invoke tools, leading agents to either over‑rely on stale context and skip needed tool calls, or under‑rely on it and redundantly repeat tool calls. To study this challenge, we constructed TicToc, a diverse dataset of multi‑turn user‑agent message trajectories across 76 scenarios, spanning dynamic environments with high, medium, and low time sensitivity. We collected human preferences between "calling a tool" and "directly answering" on each sample, and evaluated how well LLM tool‑calling decisions align with human preferences under varying amounts of elapsed time. Our analysis reveals that existing models display poor alignment with human temporal perception, with no model achieving a normalized alignment rate better than 65% when given time stamp information. We also show that naive, prompt‑based alignment techniques have limited effectiveness for most models, but specific post‑training alignment can be a viable way to align multi‑turn LLM tool use with human temporal perception. Our data and findings provide a first step toward understanding and mitigating temporal blindness, offering insights to foster the development of more time‑aware and human‑aligned agents.

PaperID: 7731, https://arxiv.org/pdf/2510.23824.pdf

Authors: Murad Ismayilov, Edwin Meriaux, Shuo Wen, Gregory Dudek

Title: Decentralized Multi-Agent Goal Assignment for Path Planning using Large Language Models

Abstract:
Coordinating multiple autonomous agents in shared environments under decentralized conditions is a long‑standing challenge in robotics and artificial intelligence. This work addresses the problem of decentralized goal assignment for multi‑agent path planning, where agents independently generate ranked preferences over goals based on structured representations of the environment, including grid visualizations and scenario data. After this reasoning phase, agents exchange their goal rankings, and assignments are determined by a fixed, deterministic conflict‑resolution rule (e.g., agent index ordering), without negotiation or iterative coordination. We systematically compare greedy heuristics, optimal assignment, and large language model (LLM)‑based agents in fully observable grid‑world settings. Our results show that LLM‑based agents, when provided with well‑designed prompts and relevant quantitative information, can achieve near‑optimal makespans and consistently outperform traditional heuristics. These findings underscore the potential of language models for decentralized goal assignment in multi‑agent path planning and highlight the importance of information structure in such systems.

PaperID: 7732, https://arxiv.org/pdf/2510.23822.pdf

Authors: Zhenyu Zhang, Tianyi Chen, Weiran Xu, Alex Pentland, Jiaxin Pei

Title: ReCAP: Recursive Context-Aware Reasoning and Planning for Large Language Model Agents

Abstract:
Long‑horizon tasks requiring multi‑step reasoning and dynamic re‑planning remain challenging for large language models (LLMs). Sequential prompting methods are prone to context drift, loss of goal information, and recurrent failure cycles, while hierarchical prompting methods often weaken cross‑level continuity or incur substantial runtime overhead. We introduce ReCAP (Recursive Context‑Aware Reasoning and Planning), a hierarchical framework with shared context for reasoning and planning in LLMs. ReCAP combines three key mechanisms: (i) plan‑ahead decomposition, in which the model generates a full subtask list, executes the first item, and refines the remainder; (ii) structured re‑injection of parent plans, maintaining consistent multi‑level context during recursive return; and (iii) memory‑efficient execution, bounding the active prompt so costs scale linearly with task depth. Together these mechanisms align high‑level goals with low‑level actions, reduce redundant prompting, and preserve coherent context updates across recursion. Experiments demonstrate that ReCAP substantially improves subgoal alignment and success rates on various long‑horizon reasoning benchmarks, achieving a 32% gain on synchronous Robotouille and a 29% improvement on asynchronous Robotouille under the strict pass@1 protocol.

PaperID: 7733, https://arxiv.org/pdf/2510.23682.pdf

Authors: Gokturk Aytug Akarlar

Title: Beyond Prompt Engineering: Neuro-Symbolic-Causal Architecture for Robust Multi-Objective AI Agents

Abstract:
Large language models show promise as autonomous decision‑making agents, yet their deployment in high‑stakes domains remains fraught with risk. Without architectural safeguards, LLM agents exhibit catastrophic brittleness: identical capabilities produce wildly different outcomes depending solely on prompt framing. We present Chimera, a neuro‑symbolic‑causal architecture that integrates three complementary components ‑ an LLM strategist, a formally verified symbolic constraint engine, and a causal inference module for counterfactual reasoning. We benchmark Chimera against baseline architectures (LLM‑only, LLM with symbolic constraints) across 52‑week simulations in a realistic e‑commerce environment featuring price elasticity, trust dynamics, and seasonal demand. Under organizational biases toward either volume or margin optimization, LLM‑only agents fail catastrophically (total loss of \99K in volume scenarios) or destroy brand trust (‑48.6% in margin scenarios). Adding symbolic constraints prevents disasters but achieves only 43‑87% of Chimera's profit. Chimera consistently delivers the highest returns (\1.52M and \1.96M respectively, some cases +\2.2M) while improving brand trust (+1.8% and +10.8%, some cases +20.86%), demonstrating prompt‑agnostic robustness. Our TLA+ formal verification proves zero constraint violations across all scenarios. These results establish that architectural design not prompt engineering determines the reliability of autonomous agents in production environments. We provide open‑source implementations and interactive demonstrations for reproducibility.

PaperID: 7734, https://arxiv.org/pdf/2510.23664.pdf

Authors: Eranga Bandara, Ross Gore, Xueping Liang, Sachini Rajapakse, Isurunima Kularathne, Pramoda Karunarathna, Peter Foytik, Sachin Shetty, Ravi Mukkamala, Abdul Rahman, Amin Hass, Ng Wee Keong, Kasun De Zoysa, Aruna Withanage, Nilaan Loganathan

Title: Agentsway -- Software Development Methodology for AI Agents-based Teams

Abstract:
The emergence of Agentic AI is fundamentally transforming how software is designed, developed, and maintained. Traditional software development methodologies such as Agile, Kanban, ShapeUp, etc, were originally designed for human‑centric teams and are increasingly inadequate in environments where autonomous AI agents contribute to planning, coding, testing, and continuous learning. To address this methodological gap, we present "Agentsway" a novel software development framework designed for ecosystems where AI agents operate as first‑class collaborators. Agentsway introduces a structured lifecycle centered on human orchestration, and privacy‑preserving collaboration among specialized AI agents. The framework defines distinct roles for planning, prompting, coding, testing, and fine‑tuning agents, each contributing to iterative improvement and adaptive learning throughout the development process. By integrating fine‑tuned LLMs that leverage outputs and feedback from different agents throughout the development cycle as part of a retrospective learning process, Agentsway enhances domain‑specific reasoning, and explainable decision‑making across the entire software development lifecycle. Responsible AI principles are further embedded across the agents through the coordinated use of multiple fine‑tuned LLMs and advanced reasoning models, ensuring balanced, transparent, and accountable decision‑making. This work advances software engineering by formalizing agent‑centric collaboration, integrating privacy‑by‑design principles, and defining measurable metrics for productivity and trust. Agentsway represents a foundational step toward the next generation of AI‑native, self‑improving software development methodologies. To the best of our knowledge, this is the first research effort to introduce a dedicated methodology explicitly designed for AI agent‑based software engineering teams.

PaperID: 7735, https://arxiv.org/pdf/2510.23615.pdf

Authors: Nishant Doshi

Title: Logic-based Task Representation and Reward Shaping in Multiagent Reinforcement Learning

Abstract:
This paper presents an approach for accelerated learning of optimal plans for a given task represented using Linear Temporal Logic (LTL) in multi‑agent systems. Given a set of options (temporally abstract actions) available to each agent, we convert the task specification into the corresponding Buchi Automaton and proceed with a model‑free approach which collects transition samples and constructs a product Semi Markov Decision Process (SMDP) on‑the‑fly. Value‑based Reinforcement Learning algorithms can then be used to synthesize a correct‑by‑design controller without learning the underlying transition model of the multi‑agent system. The exponential sample complexity due to multiple agents is dealt with using a novel reward shaping approach. We test the proposed algorithm in a deterministic gridworld simulation for different tasks and find that the reward shaping results in significant reduction in convergence times. We also infer that using options becomes increasing more relevant as the state and action space increases in multi‑agent systems.

PaperID: 7736, https://arxiv.org/pdf/2510.23595.pdf

Authors: Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhang, Mostofa Patwary, Jiaxuan You

Title: Multi-Agent Evolve: LLM Self-Improve through Co-evolution

Abstract:
Reinforcement Learning (RL) has demonstrated significant potential in enhancing the reasoning capabilities of large language models (LLMs). However, the success of RL for LLMs heavily relies on human‑curated datasets and verifiable rewards, which limit their scalability and generality. Recent Self‑Play RL methods, inspired by the success of the paradigm in games and Go, aim to enhance LLM reasoning capabilities without human‑annotated data. However, their methods primarily depend on a grounded environment for feedback (e.g., a Python interpreter or a game engine); extending them to general domains remains challenging. To address these challenges, we propose Multi‑Agent Evolve (MAE), a framework that enables LLMs to self‑evolve in solving diverse tasks, including mathematics, reasoning, and general knowledge Q&A. The core design of MAE is based on a triplet of interacting agents (Proposer, Solver, Judge) that are instantiated from a single LLM, and applies reinforcement learning to optimize their behaviors. The Proposer generates questions, the Solver attempts solutions, and the Judge evaluates both while co‑evolving. Experiments on Qwen2.5‑3B‑Instruct demonstrate that MAE achieves an average improvement of 4.54% on multiple benchmarks. These results highlight MAE as a scalable, data‑efficient method for enhancing the general reasoning abilities of LLMs with minimal reliance on human‑curated supervision.

PaperID: 7737, https://arxiv.org/pdf/2510.23447.pdf

Authors: Anna Guerra, Francesco Guidi, Pau Closas, Davide Dardari, Petar M. Djuric

Title: Model Proficiency in Centralized Multi-Agent Systems: A Performance Study

Abstract:
Autonomous agents are increasingly deployed in dynamic environments where their ability to perform a given task depends on both individual and team‑level proficiency. While proficiency self‑assessment (PSA) has been studied for single agents, its extension to a team of agents remains underexplored. This letter addresses this gap by presenting a framework for team PSA in centralized settings. We investigate three metrics for centralized team PSA: the measurement prediction bound (MPB), the Kolmogorov‑Smirnov (KS) statistic, and the Kullback‑Leibler (KL) divergence. These metrics quantify the discrepancy between predicted and actual measurements. We use the KL divergence as a reference metric since it compares the true and predictive distributions, whereas the MPB and KS provide efficient indicators for in situ assessment. Simulation results in a target tracking scenario demonstrate that both MPB and KS metrics accurately capture model mismatches, align with the KL divergence reference, and enable real‑time proficiency assessment.

PaperID: 7738, https://arxiv.org/pdf/2510.23443.pdf

Authors: Chiara Bonfanti, Alessandro Druetto, Cataldo Basile, Tharindu Ranasinghe, Marcos Zampieri

Title: A Neuro-Symbolic Multi-Agent Approach to Legal-Cybersecurity Knowledge Integration

Abstract:
The growing intersection of cybersecurity and law creates a complex information space where traditional legal research tools struggle to deal with nuanced connections between cases, statutes, and technical vulnerabilities. This knowledge divide hinders collaboration between legal experts and cybersecurity professionals. To address this important gap, this work provides a first step towards intelligent systems capable of navigating the increasingly intricate cyber‑legal domain. We demonstrate promising initial results on multilingual tasks.

PaperID: 7739, https://arxiv.org/pdf/2510.23408.pdf

Authors: Abolfazl Younesi, Zahra Najafabadi Samani, Thomas Fahringer

Title: AutoStreamPipe: LLM Assisted Automatic Generation of Data Stream Processing Pipelines

Abstract:
Data pipelines are essential in stream processing as they enable the efficient collection, processing, and delivery of real‑time data, supporting rapid data analysis. In this paper, we present AutoStreamPipe, a novel framework that employs Large Language Models (LLMs) to automate the design, generation, and deployment of stream processing pipelines. AutoStreamPipe bridges the semantic gap between high‑level user intent and platform‑specific implementations across distributed stream processing systems for structured multi‑agent reasoning by integrating a Hypergraph of Thoughts (HGoT) as an extended version of GoT. AutoStreamPipe combines resilient execution strategies, advanced query analysis, and HGoT to deliver pipelines with good accuracy. Experimental evaluations on diverse pipelines demonstrate that AutoStreamPipe significantly reduces development time (x6.3) and error rates (x5.19), as measured by a novel Error‑Free Score (EFS), compared to LLM code‑generation methods.

PaperID: 7740, https://arxiv.org/pdf/2510.23272.pdf

Authors: Bang Xiao, Lingjie Jiang, Shaohan Huang, Tengchao Lv, Yupan Huang, Xun Wu, Lei Cui, Furu Wei

Title: Code Aesthetics with Agentic Reward Feedback

Abstract:
Large Language Models (LLMs) have become valuable assistants for developers in code‑related tasks. While LLMs excel at traditional programming tasks such as code generation and bug fixing, they struggle with visually‑oriented coding tasks, often producing suboptimal aesthetics. In this paper, we introduce a new pipeline to enhance the aesthetic quality of LLM‑generated code. We first construct AesCode‑358K, a large‑scale instruction‑tuning dataset focused on code aesthetics. Next, we propose agentic reward feedback, a multi‑agent system that evaluates executability, static aesthetics, and interactive aesthetics. Building on this, we develop GRPO‑AR, which integrates these signals into the GRPO algorithm for joint optimization of functionality and code aesthetics. Finally, we develop OpenDesign, a benchmark for assessing code aesthetics. Experimental results show that combining supervised fine‑tuning on AesCode‑358K with reinforcement learning using agentic reward feedback significantly improves performance on OpenDesign and also enhances results on existing benchmarks such as PandasPlotBench. Notably, our AesCoder‑4B surpasses GPT‑4o and GPT‑4.1, and achieves performance comparable to large open‑source models with 480B‑685B parameters, underscoring the effectiveness of our approach.

PaperID: 7741, https://arxiv.org/pdf/2510.23245.pdf

Authors: Alexandre P Uchoa, Carlo E T Oliveira, Claudia L R Motta, Daniel Schneider

Title: Multi-Stakeholder Alignment in LLM-Powered Collaborative AI Systems: A Multi-Agent Framework for Intelligent Tutoring

Abstract:
The integration of Large Language Models into Intelligent Tutoring Systems pre‑sents significant challenges in aligning with diverse and often conflicting values from students, parents, teachers, and institutions. Existing architectures lack for‑mal mechanisms for negotiating these multi‑stakeholder tensions, creating risks in accountability and bias. This paper introduces the Advisory Governance Layer (AGL), a non‑intrusive, multi‑agent framework designed to enable distributed stakeholder participation in AI governance. The AGL employs specialized agents representing stakeholder groups to evaluate pedagogical actions against their spe‑cific policies in a privacy‑preserving manner, anticipating future advances in per‑sonal assistant technology that will enhance stakeholder value expression. Through a novel policy taxonomy and conflict‑resolution protocols, the frame‑work provides structured, auditable governance advice to the ITS without altering its core pedagogical decision‑making. This work contributes a reference architec‑ture and technical specifications for aligning educational AI with multi‑stakeholder values, bridging the gap between high‑level ethical principles and practical implementation.

PaperID: 7742, https://arxiv.org/pdf/2510.23121.pdf

Authors: Bharath Santhanam, Alex Mitrevski, Santosh Thoduka, Sebastian Houben, Teena Hassan

Title: Reliable Robotic Task Execution in the Face of Anomalies

Abstract:
Learned robot policies have consistently been shown to be versatile, but they typically have no built‑in mechanism for handling the complexity of open environments, making them prone to execution failures; this implies that deploying policies without the ability to recognise and react to failures may lead to unreliable and unsafe robot behaviour. In this paper, we present a framework that couples a learned policy with a method to detect visual anomalies during policy deployment and to perform recovery behaviours when necessary, thereby aiming to prevent failures. Specifically, we train an anomaly detection model using data collected during nominal executions of a trained policy. This model is then integrated into the online policy execution process, so that deviations from the nominal execution can trigger a three‑level sequential recovery process that consists of (i) pausing the execution temporarily, (ii) performing a local perturbation of the robot's state, and (iii) resetting the robot to a safe state by sampling from a learned execution success model. We verify our proposed method in two different scenarios: (i) a door handle reaching task with a Kinova Gen3 arm using a policy trained in simulation and transferred to the real robot, and (ii) an object placing task with a UFactory xArm 6 using a general‑purpose policy model. Our results show that integrating policy execution with anomaly detection and recovery increases the execution success rate in environments with various anomalies, such as trajectory deviations and adversarial human interventions.

PaperID: 7743, https://arxiv.org/pdf/2510.23053.pdf

Authors: Zhiyu Wang, Suman Raj, Rajkumar Buyya

Title: AirFed: A Federated Graph-Enhanced Multi-Agent Reinforcement Learning Framework for Multi-UAV Cooperative Mobile Edge Computing

Abstract:
Multiple Unmanned Aerial Vehicles (UAVs) cooperative Mobile Edge Computing (MEC) systems face critical challenges in coordinating trajectory planning, task offloading, and resource allocation while ensuring Quality of Service (QoS) under dynamic and uncertain environments. Existing approaches suffer from limited scalability, slow convergence, and inefficient knowledge sharing among UAVs, particularly when handling large‑scale IoT device deployments with stringent deadline constraints. This paper proposes AirFed, a novel federated graph‑enhanced multi‑agent reinforcement learning framework that addresses these challenges through three key innovations. First, we design dual‑layer dynamic Graph Attention Networks (GATs) that explicitly model spatial‑temporal dependencies among UAVs and IoT devices, capturing both service relationships and collaborative interactions within the network topology. Second, we develop a dual‑Actor single‑Critic architecture that jointly optimizes continuous trajectory control and discrete task offloading decisions. Third, we propose a reputation‑based decentralized federated learning mechanism with gradient‑sensitive adaptive quantization, enabling efficient and robust knowledge sharing across heterogeneous UAVs. Extensive experiments demonstrate that AirFed achieves 42.9% reduction in weighted cost compared to state‑of‑the‑art baselines, attains over 99% deadline satisfaction and 94.2% IoT device coverage rate, and reduces communication overhead by 54.5%. Scalability analysis confirms robust performance across varying UAV numbers, IoT device densities, and system scales, validating AirFed's practical applicability for large‑scale UAV‑MEC deployments.

PaperID: 7744, https://arxiv.org/pdf/2510.23032.pdf

Authors: Chen-Che Lu, Yun-Cheng Chou, Teng-Ruei Chen

Title: P1GPT: a multi-agent LLM workflow module for multi-modal financial information analysis

Abstract:
Recent advances in large language models (LLMs) have enabled multi‑agent reasoning systems capable of collaborative decision‑making. However, in financial analysis, most frameworks remain narrowly focused on either isolated single‑agent predictors or loosely connected analyst ensembles, and they lack a coherent reasoning workflow that unifies diverse data modalities. We introduce P1GPT, a layered multi‑agent LLM framework for multi‑modal financial information analysis and interpretable trading decision support. Unlike prior systems that emulate trading teams through role simulation, P1GPT implements a structured reasoning pipeline that systematically fuses technical, fundamental, and news‑based insights through coordinated agent communication and integration‑time synthesis. Backtesting on multi‑modal datasets across major U.S. equities demonstrates that P1GPT achieves superior cumulative and risk‑adjusted returns, maintains low drawdowns, and provides transparent causal rationales. These findings suggest that structured reasoning workflows, rather than agent role imitation, offer a scalable path toward explainable and trustworthy financial AI systems.

PaperID: 7745, https://arxiv.org/pdf/2510.23010.pdf

Authors: Ming-Tung Shen, Yuh-Jzer Joung

Title: TALM: Dynamic Tree-Structured Multi-Agent Framework with Long-Term Memory for Scalable Code Generation

Abstract:
Agentic code generation requires large language models (LLMs) capable of complex context management and multi‑step reasoning. Prior multi‑agent frameworks attempt to address these challenges through collaboration, yet they often suffer from rigid workflows and high reasoning recovery costs. To overcome these limitations, we propose TALM (Tree‑Structured Multi‑Agent Framework with Long‑Term Memory), a dynamic framework that integrates structured task decomposition, localized re‑reasoning, and long‑term memory mechanisms. TALM employs an extensible tree‑based collaboration structure. The parent‑child relationships, when combined with a divide‑and‑conquer strategy, enhance reasoning flexibility and enable efficient error correction across diverse task scopes. Furthermore, a long‑term memory module enables semantic querying and integration of prior knowledge, supporting implicit self‑improvement through experience reuse. Experimental results on HumanEval, BigCodeBench, and ClassEval benchmarks demonstrate that TALM consistently delivers strong reasoning performance and high token efficiency, highlighting its robustness and practical utility in complex code generation tasks.

PaperID: 7746, https://arxiv.org/pdf/2510.22969.pdf

Authors: Kechen Meng, Sinuo Zhang, Rongpeng Li, Xiangming Meng, Yansha Deng, Chan Wang, Ming Lei, Zhifeng Zhao

Title: Multi-Agent Conditional Diffusion Model with Mean Field Communication as Wireless Resource Allocation Planner

Abstract:
In wireless communication systems, efficient and adaptive resource allocation plays a crucial role in enhancing overall Quality of Service (QoS). Compared to the conventional Model‑Free Reinforcement Learning (MFRL) scheme, Model‑Based RL (MBRL) first learns a generative world model for subsequent planning. The reuse of historical experience in MBRL promises more stable training behavior, yet its deployment in large‑scale wireless networks remains challenging due to high‑dimensional stochastic dynamics, strong inter‑agent cooperation, and communication constraints. To overcome these challenges, we propose the Multi‑Agent Conditional Diffusion Model Planner (MA‑CDMP) for decentralized communication resource management. Built upon the Distributed Training with Decentralized Execution (DTDE) paradigm, MA‑CDMP models each communication node as an autonomous agent and employs Diffusion Models (DMs) to capture and predict environment dynamics. Meanwhile, an inverse dynamics model guides action generation, thereby enhancing sample efficiency and policy scalability. Moreover, to approximate large‑scale agent interactions, a Mean‑Field (MF) mechanism is introduced as an assistance to the classifier in DMs. This design mitigates inter‑agent non‑stationarity and enhances cooperation with minimal communication overhead in distributed settings. We further theoretically establish an upper bound on the distributional approximation error introduced by the MF‑based diffusion generation, guaranteeing convergence stability and reliable modeling of multi‑agent stochastic dynamics. Extensive experiments demonstrate that MA‑CDMP consistently outperforms existing MARL baselines in terms of average reward and QoS metrics, showcasing its scalability and practicality for real‑world wireless network optimization.

PaperID: 7747, https://arxiv.org/pdf/2510.22967.pdf

Authors: Yucheng Ning, Xixun Lin, Fang Fang, Yanan Cao

Title: MAD-Fact: A Multi-Agent Debate Framework for Long-Form Factuality Evaluation in LLMs

Abstract:
The widespread adoption of Large Language Models (LLMs) raises critical concerns about the factual accuracy of their outputs, especially in high‑risk domains such as biomedicine, law, and education. Existing evaluation methods for short texts often fail on long‑form content due to complex reasoning chains, intertwined perspectives, and cumulative information. To address this, we propose a systematic approach integrating large‑scale long‑form datasets, multi‑agent verification mechanisms, and weighted evaluation metrics. We construct LongHalluQA, a Chinese long‑form factuality dataset; and develop MAD‑Fact, a debate‑based multi‑agent verification system. We introduce a fact importance hierarchy to capture the varying significance of claims in long‑form texts. Experiments on two benchmarks show that larger LLMs generally maintain higher factual consistency, while domestic models excel on Chinese content. Our work provides a structured framework for evaluating and enhancing factual reliability in long‑form LLM outputs, guiding their safe deployment in sensitive domains.

PaperID: 7748, https://arxiv.org/pdf/2510.22963.pdf

Authors: Zesen Liu, Zhixiang Zhang, Yuchong Xie, Dongdong She

Title: When Compression Becomes an Attack Surface: Black-Box Attacks on Prompt-Compressed LLM Agents

Abstract:
Prompt compression is increasingly deployed in LLM agents to reduce latency and cost, but it also determines what the backend LLM ultimately sees. We show that, when trusted and untrusted inputs are compressed under a shared budget, this lossy transformation creates a new attack surface: by perturbing only untrusted inputs before compression, an adversary can cause the compressor to discard task‑critical evidence or safety guardrails before inference. Unlike prompt injection, jailbreaks, or RAG poisoning, the attack target is the compressor rather than the backend LLM; the perturbation need not encode a meaningful instruction or survive compression. We formalize this vulnerability as adversarial information loss (AIL), the excess downstream distortion caused by adversarially steering a lossy compressor beyond benign compression alone. To exploit AIL, we present COMA, a transfer‑based black‑box attack that optimizes pre‑compression perturbations using attacker‑side surrogate compressors and backend LLMs. Across three tasks and six compressors, COMA achieves 0.71 average ASR, versus 0.21 for the strongest baseline, and transfers to two real‑world agent case studies.

PaperID: 7749, https://arxiv.org/pdf/2510.22898.pdf

Authors: Vishvesh Bhat, Omkar Ghugarkar, Julian McAuley

Title: On Generalization in Agentic Tool Calling: CoreThink Agentic Reasoner and MAVEN Dataset

Abstract:
Generalization across Agentic tool‑calling environments remains a key unsolved challenge in developing reliable agentic reasoning systems. While large language models (LLMs) demonstrate strong performance on isolated benchmarks, their ability to transfer reasoning strategies and co‑ordinate tools across diverse domains is poorly understood. In this work, we conduct a large‑scale evaluation of state‑of‑the‑art LLMs on multiple tool‑calling benchmarksBFCL v3, TauBench, Tau2Bench, and AceBenchand introduce MAVEN (Math & Physics Adversarial Verification & Evaluation Network), a new out of distribution (OOD) benchmark designed to stress‑test multi‑step reasoning through explicit verification and adversarial task composition. Our results show that most current models achieve below 50% accuracy on MAVEN, revealing a significant generalization gap across tool‑use settings. To address this, we present the CoreThink Agentic Reasoner, a framework that augments LLMs with a lightweight symbolic reasoning layer for structured decomposition and adaptive tool orchestration. Without additional training, it generalizes across all benchmarks, achieving state‑of‑the‑art performance with 530% improvements over existing baselines at roughly one‑tenth the computational cost.

PaperID: 7750, https://arxiv.org/pdf/2510.22849.pdf

Authors: Adam Stein, Neelay Velingker, Mayur Naik, Eric Wong

Title: Once Upon an Input: Reasoning via Per-Instance Program Synthesis

Abstract:
Large language models (LLMs) excel at zero‑shot inference but continue to struggle with complex, multi‑step reasoning. Recent methods that augment LLMs with intermediate reasoning steps such as Chain of Thought (CoT) and Program of Thought (PoT) improve performance but often produce undesirable solutions, especially in algorithmic domains. We introduce Per‑Instance Program Synthesis (PIPS), a method that generates and refines programs at the instance‑level using structural feedback without relying on task‑specific guidance or explicit test cases. To further improve performance, PIPS incorporates a confidence metric that dynamically chooses between direct inference and program synthesis on a per‑instance basis. Experiments across three frontier LLMs and 30 benchmarks including all tasks of Big Bench Extra Hard (BBEH), visual question answering tasks, relational reasoning tasks, and mathematical reasoning tasks show that PIPS improves the absolute harmonic mean accuracy by up to 8.6% and 9.4% compared to PoT and CoT respectively, and reduces undesirable program generations by 65.1% on the algorithmic tasks compared to PoT with Gemini‑2.0‑Flash.

PaperID: 7751, https://arxiv.org/pdf/2510.22811.pdf

Authors: Jingyuan Liu, Hao Qiu, Lin Yang, Mengfan Xu

Title: Distributed Multi-Agent Bandits Over Erdős-Rényi Random Networks

Abstract:
We study the distributed multi‑agent multi‑armed bandit problem with heterogeneous rewards over random communication graphs. Uniquely, at each time step t agents communicate over a time‑varying random graph G_t generated by applying the Erdős‑Rényi model to a fixed connected base graph G (for classical Erdős‑Rényi graphs, G is a complete graph), where each potential edge in G is randomly and independently present with the link probability p. Notably, the resulting random graph is not necessarily connected at each time step. Each agent's arm rewards follow time‑invariant distributions, and the reward distribution for the same arm may differ across agents. The goal is to minimize the cumulative expected regret relative to the global mean reward of each arm, defined as the average of that arm's mean rewards across all agents. To this end, we propose a fully distributed algorithm that integrates the arm elimination strategy with the random gossip algorithm. We theoretically show that the regret upper bound is of order \log T and is highly interpretable, where T is the time horizon. It includes the optimal centralized regret O\left(\sum_k: Δ_k>0 \frac\log TΔ_k\right) and an additional term O\left(\fracN^2 \log Tp λ_N‑1(Lap(G)) + \fracKN^2 \log Tp\right) where N and K denote the total number of agents and arms, respectively. This term reflects the impact of G's algebraic connectivity λ_N‑1(Lap(G)) and the link probability p, and thus highlights a fundamental trade‑off between communication efficiency and regret. As a by‑product, we show a nearly optimal regret lower bound. Finally, our numerical experiments not only show the superiority of our algorithm over existing benchmarks, but also validate the theoretical regret scaling with problem complexity.

PaperID: 7752, https://arxiv.org/pdf/2510.22800.pdf

Authors: Jing Xu

Title: Probing the Representational Geometry of Color Qualia: Dissociating Pure Perception from Task Demands in Brains and AI Models

Abstract:
Probing the computational underpinnings of subjective experience, or qualia, remains a central challenge in cognitive neuroscience. This project tackles this question by performing a rigorous comparison of the representational geometry of color qualia between state‑of‑the‑art AI models and the human brain. Using a unique fMRI dataset with a "no‑report" paradigm, we use Representational Similarity Analysis (RSA) to compare diverse vision models against neural activity under two conditions: pure perception ("no‑report") and task‑modulated perception ("report"). Our analysis yields three principal findings. First, nearly all models align better with neural representations of pure perception, suggesting that the cognitive processes involved in task execution are not captured by current feedforward architectures. Second, our analysis reveals a critical interaction between training paradigm and architecture, challenging the simple assumption that Contrastive Language‑Image Pre‑training(CLIP) training universally improves neural plausibility. In our direct comparison, this multi‑modal training method enhanced brain‑alignment for a vision transformer(ViT), yet had the opposite effect on a ConvNet. Our work contributes a new benchmark task for color qualia to the field, packaged in a Brain‑Score compatible format. This benchmark reveals a fundamental divergence in the inductive biases of artificial and biological vision systems, offering clear guidance for developing more neurally plausible models.

PaperID: 7753, https://arxiv.org/pdf/2510.22787.pdf

Authors: Kamil Szczepanik, Jarosław A. Chudziak

Title: Collaborative LLM Agents for C4 Software Architecture Design Automation

Abstract:
Software architecture design is a fundamental part of creating every software system. Despite its importance, producing a C4 software architecture model, the preferred notation for such architecture, remains manual and time‑consuming. We introduce an LLM‑based multi‑agent system that automates this task by simulating a dialogue between role‑specific experts who analyze requirements and generate the Context, Container, and Component views of the C4 model. Quality is assessed with a hybrid evaluation framework: deterministic checks for structural and syntactic integrity and C4 rule consistency, plus semantic and qualitative scoring via an LLM‑as‑a‑Judge approach. Tested on five canonical system briefs, the workflow demonstrates fast C4 model creation, sustains high compilation success, and delivers semantic fidelity. A comparison of four state‑of‑the‑art LLMs shows different strengths relevant to architectural design. This study contributes to automated software architecture design and its evaluation methods.

PaperID: 7754, https://arxiv.org/pdf/2510.22754.pdf

Authors: Chunyu Li, Shoubin Chen, Dong Li, Weixing Xue, Qingquan Li

Title: TWC-SLAM: Multi-Agent Cooperative SLAM with Text Semantics and WiFi Features Integration for Similar Indoor Environments

Abstract:
Multi‑agent cooperative SLAM often encounters challenges in similar indoor environments characterized by repetitive structures, such as corridors and rooms. These challenges can lead to significant inaccuracies in shared location identification when employing point cloud‑based techniques. To mitigate these issues, we introduce TWC‑SLAM, a multi‑agent cooperative SLAM framework that integrates text semantics and WiFi signal features to enhance location identification and loop closure detection. TWC‑SLAM comprises a single‑agent front‑end odometry module based on FAST‑LIO2, a location identification and loop closure detection module that leverages text semantics and WiFi features, and a global mapping module. The agents are equipped with sensors capable of capturing textual information and detecting WiFi signals. By correlating these data sources, TWC‑SLAM establishes a common location, facilitating point cloud alignment across different agents' maps. Furthermore, the system employs loop closure detection and optimization modules to achieve global optimization and cohesive mapping. We evaluated our approach using an indoor dataset featuring similar corridors, rooms, and text signs. The results demonstrate that TWC‑SLAM significantly improves the performance of cooperative SLAM systems in complex environments with repetitive architectural features.

PaperID: 7755, https://arxiv.org/pdf/2510.22732.pdf

Authors: Jiali Cheng, Anjishnu Kumar, Roshan Lal, Rishi Rajasekaran, Hani Ramezani, Omar Zia Khan, Oleg Rokhlenko, Sunny Chiu-Webster, Gang Hua, Hadi Amiri

Title: WebATLAS: An LLM Agent with Experience-Driven Memory and Action Simulation

Abstract:
Large Language Model (LLM) web agents often struggle with long‑horizon web navigation and web task completion in new websites, producing inefficient action sequences unless fine‑tuned on environment‑specific data. We show that experience‑driven memory, combined with look‑ahead action simulation, is sufficient for LLM agents to adapt to unseen web environments by remembering past failures and predicting the consequences of future actions. We introduce WebATLAS (Actor‑Critic Task‑completion with Look‑ahead Action Simulation), a memory‑augmented LLM web agent that learns a lightweight internal model of the environment from interaction experience and performs hypothetical action rollouts before acting in the real world. WebATLAS builds a persistent cognitive map via curiosity‑driven exploration, stores interaction outcomes as experience‑based memory, and evaluates candidate actions in cognitive space using a planner‑‑simulator‑‑critic loop. This enables the agent to reuse past experience, avoid previously unsuccessful behaviors, and generate more efficient plans. We evaluate WebATLAS on the WebArena‑Lite benchmark for autonomous web navigation and demonstrate a success rate of 63%, outperforming the previous state‑of‑the‑art at 53.9%. Unlike previous systems, our modular architecture requires no website‑specific LLM fine‑tuning. Ablation studies confirm that experience‑driven memory, look‑ahead action simulation, and hierarchical replanning play complementary roles in enabling robust, training‑free web agents.

PaperID: 7756, https://arxiv.org/pdf/2510.22626.pdf

Authors: Adhyayan Veer Singh, Aaron Shen, Brian Law, Ahmed Ismail, Jonas Rohweder, Sean O'Brien, Kevin Zhu

Title: SwiftSolve: A Self-Iterative, Complexity-Aware Multi-Agent Framework for Competitive Programming

Abstract:
Correctness alone is insufficient: LLM‑generated programs frequently satisfy unit tests while violating contest time or memory budgets. We present SwiftSolve, a complexity‑aware multi‑agent system for competitive programming that couples algorithmic planning with empirical profiling and complexity‑guided repair. We frame competitive programming as a software environment where specialized agents act as programmers, each assuming roles such as planning, coding, profiling, and complexity analysis. A Planner proposes an algorithmic sketch; a deterministic Static Pruner filters high‑risk plans; a Coder emits ISO C++17; a Profiler compiles and executes candidates on a fixed input‑size schedule to record wall time and peak memory; and a Complexity Analyst fits log‑log growth (s, R2) with an LLM fallback to assign a complexity class and dispatch targeted patches to either the Planner or Coder. Agents communicate via typed, versioned JSON; a controller enforces iteration caps and diminishing returns stopping. Evaluated on 26 problems (16 BigO, 10 Codeforces Div. 2) in a POSIX sandbox (2 s / 256‑512 MB), SwiftSolve attains pass@1 = 61.54% (16/26) on the first attempt and Solved@<=3 = 80.77% with marginal latency change (mean 11.96 s to 12.66 s per attempt). Aggregate run‑level success is 73.08% at 12.40 s mean. Failures are predominantly resource‑bound, indicating inefficiency rather than logic errors. Against Claude Opus 4, SwiftSolve improves run‑level success (73.1% vs 52.6%) at approximately 2x runtime overhead (12.4 s vs 6.8 s). Beyond correctness (pass@k), we report efficiency metrics (eff@k for runtime and memory, incidence of TLE or MLE, and complexity fit accuracy on BigO), demonstrating that profiling and complexity‑guided replanning reduce inefficiency while preserving accuracy.

PaperID: 7757, https://arxiv.org/pdf/2510.22568.pdf

Authors: Onur Akgün

Title: SPIRAL: Self-Play Incremental Racing Algorithm for Learning in Multi-Drone Competitions

Abstract:
This paper introduces SPIRAL (Self‑Play Incremental Racing Algorithm for Learning), a novel approach for training autonomous drones in multi‑agent racing competitions. SPIRAL distinctively employs a self‑play mechanism to incrementally cultivate complex racing behaviors within a challenging, dynamic environment. Through this self‑play core, drones continuously compete against increasingly proficient versions of themselves, naturally escalating the difficulty of competitive interactions. This progressive learning journey guides agents from mastering fundamental flight control to executing sophisticated cooperative multi‑drone racing strategies. Our method is designed for versatility, allowing integration with any state‑of‑the‑art Deep Reinforcement Learning (DRL) algorithms within its self‑play framework. Simulations demonstrate the significant advantages of SPIRAL and benchmark the performance of various DRL algorithms operating within it. Consequently, we contribute a versatile, scalable, and self‑improving learning framework to the field of autonomous drone racing. SPIRAL's capacity to autonomously generate appropriate and escalating challenges through its self‑play dynamic offers a promising direction for developing robust and adaptive racing strategies in multi‑agent environments. This research opens new avenues for enhancing the performance and reliability of autonomous racing drones in increasingly complex and competitive scenarios.

PaperID: 7758, https://arxiv.org/pdf/2510.22530.pdf

Authors: Sungmin Kang, Sumi Yun, Jingun Hong, Shin Yoo, Gabin An

Title: Finding the Needle in the Crash Stack: Industrial-Scale Crash Root Cause Localization with AutoCrashFL

Abstract:
Fault Localization (FL) aims to identify root causes of program failures. FL typically targets failures observed from test executions, and as such, often involves dynamic analyses to improve accuracy, such as coverage profiling or mutation testing. However, for large industrial software, measuring coverage for every execution is prohibitively expensive, making the use of such techniques difficult. To address these issues and apply FL in an industrial setting, this paper proposes AutoCrashFL, an LLM agent for the localization of crashes that only requires the crashdump from the Program Under Test (PUT) and access to the repository of the corresponding source code. We evaluate AutoCrashFL against real‑world crashes of SAP HANA, an industrial software project consisting of more than 35 million lines of code. Experiments reveal that AutoCrashFL is more effective in localization, as it identified 30% crashes at the top, compared to 17% achieved by the baseline. Through thorough analysis, we find that AutoCrashFL has attractive practical properties: it is relatively more effective for complex bugs, and it can indicate confidence in its results. Overall, these results show the practicality of LLM agent deployment on an industrial scale.

PaperID: 7759, https://arxiv.org/pdf/2510.22514.pdf

Authors: Armel Koulong, Ali Pakniyat

Title: Robust Multi-Agent Safety via Tube-Based Tightened Exponential Barrier Functions

Abstract:
This paper presents a constructive framework for synthesizing provably safe controllers for nonlinear multi‑agent systems subject to bounded disturbances. The methodology applies to systems representable in Brunovsky canonical form, accommodating arbitrary‑order dynamics in multi‑dimensional spaces. The central contribution is a method of constraint tightening that formally couples robust error feedback with nominal trajectory planning. The key insight is that the design of an ancillary feedback law, which confines state errors to a robust positively invariant (RPI) tube, simultaneously provides the exact information needed to ensure the safety of the nominal plan. Specifically, the geometry of the resulting RPI tube is leveraged via its support function to derive state‑dependent safety margins. These margins are then used to systematically tighten the high relative‑degree exponential control barrier function (eCBF) constraints imposed on the nominal planner. This integrated synthesis guarantees that any nominal trajectory satisfying the tightened constraints corresponds to a provably safe trajectory for the true, disturbed system. We demonstrate the practical utility of this formal synthesis method by implementing the planner within a distributed Model Predictive Control (MPC) scheme, which optimizes performance while inheriting the robust safety guarantees.

PaperID: 7760, https://arxiv.org/pdf/2510.22477.pdf

Authors: Yijia Fan, Jusheng Zhang, Jing Yang, Keze Wang

Title: Agent-GSPO: Communication-Efficient Multi-Agent Systems via Group Sequence Policy Optimization

Abstract:
To combat the prohibitive communication costs of ``free‑for‑all" multi‑agent systems (MAS), we introduce Agent‑GSPO, a framework that directly optimizes for token economy using sequence‑level reinforcement learning. Agent‑GSPO leverages the stable and memory‑efficient Group Sequence Policy Optimization (GSPO) algorithm to train agents on a communication‑aware reward that explicitly penalizes verbosity. Across seven reasoning benchmarks, Agent‑GSPO not only achieves new state‑of‑the‑art performance but does so with a fraction of the token consumption of existing methods. By fostering emergent strategies like ``strategic silence," our approach provides a practical blueprint for developing scalable and economically viable multi‑agent systems.

PaperID: 7761, https://arxiv.org/pdf/2510.22462.pdf

Authors: Abhijnan Nath, Nikhil Krishnaswamy

Title: Learning "Partner-Aware" Collaborators in Multi-Party Collaboration

Abstract:
Large Language Models (LLMs) are increasingly being deployed in agentic settings where they act as collaborators with humans. Therefore, it is increasingly important to be able to evaluate their abilities to collaborate effectively in multi‑turn, multi‑party tasks. In this paper, we build on the AI alignment and safe interruptibility literature to offer novel theoretical insights on collaborative behavior between LLM‑driven collaborator agents and an intervention agent. Our goal is to learn an ideal partner‑aware collaborator that increases the group's common‑ground (CG) alignment on task‑relevant propositions‑by intelligently collecting information provided in interventions by a partner agent. We show how LLM agents trained using standard RLHF and related approaches are naturally inclined to ignore possibly well‑meaning interventions, which makes increasing group common ground non‑trivial in this setting. We employ a two‑player Modified‑Action MDP to examine this suboptimal behavior of standard AI agents, and propose Interruptible Collaborative Roleplayer (ICR)‑a novel partner‑aware learning algorithm to train CG‑optimal collaborators. Experiments on multiple collaborative task environments show that ICR, on average, is more capable of promoting successful CG convergence and exploring more diverse solutions in such tasks.

PaperID: 7762, https://arxiv.org/pdf/2510.22444.pdf

Authors: Sinan Bugu

Title: Resilience of Entanglement-Induced Coordination in Adversarial Environments: The Team-Based Quantum Sabotage Game

Abstract:
Quantumgametheoryextendsclassical strategic decision‑making by incorporating quantum superposition, entanglement, and measurement‑induced randomness into competitive interactions. This paper introduces a team‑based Quantum Sabotage Game (QSG), in which classical and quantum‑enhanced teams engage in adversarial decision‑making under identical information constraints. Unlike baseline classical teams, whose members act independently, quantum teams employ entanglement‑assisted coor dination, generating structured correlations among decentralized actions without classical communication. We develop a formal quantum game‑theoretic framework and analyze multi‑agent strategies using Bell and Wentangled states, benchmarked against size‑equivalent classical teams. Using numerical simulations, we compare outcome distributions, correlation structure, and robustness under ideal conditions, standard quantum noise models, and a device‑inspired, reproducible hardware‑like noise model via the Qiskit Aer FakeKyiv backend. While the symmetric payoff structure precludes any asymptotic increase in expected utility, multipartite entanglement, particularly W‑state correlations, reshapes the finite‑run joint‑action distribution, producing nonclassical coordination patterns rather than an expectation‑value advantage. These patterns persist under realistic noise, demonstrating that the resulting correlation signatures remain observable and differ from those produced by independent classical sampling. These results clarify the operational role of entanglement in adversarial environments, distinguishing correlation‑based coordination from expectation‑value advantage, and establish the Quantum Sabotage Game as a testbed for studying noise‑resilient multi‑agent quantum decision‑making.

PaperID: 7763, https://arxiv.org/pdf/2510.22431.pdf

Authors: Zheng Wei, Mingchen Li, Zeqian Zhang, Ruibin Yuan, Pan Hui, Huamin Qu, James Evans, Maneesh Agrawala, Anyi Rao

Title: Hollywood Town: Long-Video Generation via Cross-Modal Multi-Agent Orchestration

Abstract:
Recent advancements in multi‑agent systems have demonstrated significant potential for enhancing creative task performance, such as long video generation. This study introduces three innovations to improve multi‑agent collaboration. First, we propose OmniAgent, a hierarchical, graph‑based multi‑agent framework for long video generation that leverages a film‑production‑inspired architecture to enable modular specialization and scalable inter‑agent collaboration. Second, inspired by context engineering, we propose hypergraph nodes that enable temporary group discussions among agents lacking sufficient context, reducing individual memory requirements while ensuring adequate contextual information. Third, we transition from directed acyclic graphs (DAGs) to directed cyclic graphs with limited retries, allowing agents to reflect and refine outputs iteratively, thereby improving earlier stages through feedback from subsequent nodes. These contributions lay the groundwork for developing more robust multi‑agent systems in creative tasks.

PaperID: 7764, https://arxiv.org/pdf/2510.22422.pdf

Authors: Ariel Flint, Luca Maria Aiello, Romualdo Pastor-Satorras, Andrea Baronchelli

Title: Group size effects and collective misalignment in LLM multi-agent systems

Abstract:
Multi‑agent systems of large language models (LLMs) are rapidly expanding across domains, introducing dynamics not captured by single‑agent evaluations. Yet, existing work has mostly contrasted the behavior of a single agent with that of a collective of fixed size, leaving open a central question: how does group size shape dynamics? Here, we move beyond this dichotomy and systematically explore outcomes across the full range of group sizes. We focus on multi‑agent misalignment, building on recent evidence that interacting LLMs playing a simple coordination game can generate collective biases absent in individual models. First, we show that collective bias is a deeper phenomenon than previously assessed: interaction can amplify individual biases, introduce new ones, or override model‑level preferences. Second, we demonstrate that group size affects the dynamics in a non‑linear way, revealing model‑dependent dynamical regimes. Finally, we develop a mean‑field analytical approach and show that, above a critical population size, simulations converge to deterministic predictions that expose the basins of attraction of competing equilibria. These findings establish group size as a key driver of multi‑agent dynamics and highlight the need to consider population‑level effects when deploying LLM‑based systems at scale.

PaperID: 7765, https://arxiv.org/pdf/2510.22396.pdf

Authors: Zhaoyang Li, Zheng Yu, Jingyi Song, Meng Xu, Yuxuan Luo, Dongliang Mu

Title: PortGPT: Towards Automated Backporting Using Large Language Models

Abstract:
Patch backporting, the process of migrating mainline security patches to older branches, is an essential task in maintaining popular open‑source projects (e.g., Linux kernel). However, manual backporting can be labor‑intensive, while existing automated methods, which heavily rely on predefined syntax or semantic rules, often lack agility for complex patches. In this paper, we introduce PORTGPT, an LLM‑agent for end‑to‑end automation of patch backporting in real‑world scenarios. PORTGPT enhances an LLM with tools to access code on‑demand, summarize Git history, and revise patches autonomously based on feedback (e.g., from compilers), hence, simulating human‑like reasoning and verification. PORTGPT achieved an 89.15% success rate on existing datasets (1815 cases), and 62.33% on our own dataset of 146 complex cases, both outperforms state‑of‑the‑art of backporting tools. We contributed 9 backported patches from PORTGPT to the Linux kernel community and all patches are now merged.

PaperID: 7766, https://arxiv.org/pdf/2510.22320.pdf

Authors: Yanqing Fu, Chenrun Wang, Chao Huang, Zhuping Wang

Title: IFS: Information Flow Structure for Multi-agent Ad Hoc System

Abstract:
Multi‑agent ad hoc systems are dynamic collaborative systems in which multiple autonomous agents must cooperate with both known and unknown teammates in open environments, without relying on pre‑coordinated strategies. These systems operate under conditions of uncertainty and partial observability, where team composition, agent behaviors, and environmental factors may change during execution. Through an analysis of information flow in such systems, we identify two key limitations in existing research: insufficient information flow and limited information processing capacity. To address these issues, we propose an information flow structure for multi‑agent ad hoc systems (IFS), which tackles these challenges from the perspectives of communication and information fusion. Experimental results in StarCraft II demonstrate that IFS significantly improves both information flow and processing capacity, while exhibiting strong generalization capabilities and outperforming baseline methods in complex ad hoc teamwork scenarios.

PaperID: 7767, https://arxiv.org/pdf/2510.22270.pdf

Authors: Jishu Zhao, Xi Wang, Jinlong Lei, Shixiang Chen

Title: Distributed Stochastic Proximal Algorithm on Riemannian Submanifolds for Weakly-convex Functions

Abstract:
This paper aims to investigate the distributed stochastic optimization problems on compact embedded submanifolds (in the Euclidean space) for multi‑agent network systems. To address the manifold structure, we propose a distributed Riemannian stochastic proximal algorithm framework by utilizing the retraction and Riemannian consensus protocol, and analyze three specific algorithms: the distributed Riemannian stochastic subgradient, proximal point, and prox‑linear algorithms. When the local costs are weakly‑convex and the initial points satisfy certain conditions, we show that the iterates generated by this framework converge to a nearly stationary point in expectation while achieving consensus. We further establish the convergence rate of the algorithm framework as \mathcalO(\frac1+κ_g\sqrtk) where k denotes the number of iterations and κ_g shows the impact of manifold geometry on the algorithm performance. Finally, numerical experiments are implemented to demonstrate the theoretical results and show the empirical performance.

PaperID: 7768, https://arxiv.org/pdf/2510.22235.pdf

Authors: Yixiao Nie, Yang Zhang, Yingjie Jin, Zhepeng Wang, Xiu Li, Xiang Li

Title: CGoT: A Novel Inference Mechanism for Embodied Multi-Agent Systems Using Composable Graphs of Thoughts

Abstract:
The integration of self‑driving cars and service robots is becoming increasingly prevalent across a wide array of fields, playing a crucial and expanding role in both industrial applications and everyday life. In parallel, the rapid advancements in Large Language Models (LLMs) have garnered substantial attention and interest within the research community. This paper introduces a novel vehicle‑robot system that leverages the strengths of both autonomous vehicles and service robots. In our proposed system, two autonomous ego‑vehicles transports service robots to locations within an office park, where they perform a series of tasks. The study explores the feasibility and potential benefits of incorporating LLMs into this system, with the aim of enhancing operational efficiency and maximizing the potential of the cooperative mechanisms between the vehicles and the robots. This paper proposes a novel inference mechanism which is called CGOT toward this type of system where an agent can carry another agent. Experimental results are presented to validate the performance of the proposed method.

PaperID: 7769, https://arxiv.org/pdf/2510.22222.pdf

Authors: Yumeng Shi, Zhongliang Yang, Yisi Wang, Linna Zhou

Title: CreditXAI: A Multi-Agent System for Explainable Corporate Credit Rating

Abstract:
In the domain of corporate credit rating, traditional deep learning methods have improved predictive accuracy but still suffer from the inherent 'black‑box' problem and limited interpretability. While incorporating non‑financial information enriches the data and provides partial interpretability, the models still lack hierarchical reasoning mechanisms, limiting their comprehensive analytical capabilities. To address these challenges, we propose CreditXAI, a Multi‑Agent System (MAS) framework that simulates the collaborative decision‑making process of professional credit analysts. The framework focuses on business, financial, and governance risk dimensions to generate consistent and interpretable credit assessments. Experimental results demonstrate that multi‑agent collaboration improves predictive accuracy by more than 7% over the best single‑agent baseline, confirming its significant synergistic advantage in corporate credit risk evaluation. This study provides a new technical pathway to build intelligent and interpretable credit rating models.

PaperID: 7770, https://arxiv.org/pdf/2510.22158.pdf

Authors: Lorenzo Magnino, Kai Shao, Zida Wu, Jiacheng Shen, Mathieu Laurière

Title: Solving Continuous Mean Field Games: Deep Reinforcement Learning for Non-Stationary Dynamics

Abstract:
Mean field games (MFGs) have emerged as a powerful framework for modeling interactions in large‑scale multi‑agent systems. Despite recent advancements in reinforcement learning (RL) for MFGs, existing methods are typically limited to finite spaces or stationary models, hindering their applicability to real‑world problems. This paper introduces a novel deep reinforcement learning (DRL) algorithm specifically designed for non‑stationary continuous MFGs. The proposed approach builds upon a Fictitious Play (FP) methodology, leveraging DRL for best‑response computation and supervised learning for average policy representation. Furthermore, it learns a representation of the time‑dependent population distribution using a Conditional Normalizing Flow. To validate the effectiveness of our method, we evaluate it on three different examples of increasing complexity. By addressing critical limitations in scalability and density approximation, this work represents a significant advancement in applying DRL techniques to complex MFG problems, bringing the field closer to real‑world multi‑agent systems.

PaperID: 7771, https://arxiv.org/pdf/2510.22117.pdf

Authors: Jiahui Li, Xinyue Liang, Geng Sun, Hui Kang, Jiacheng Wang, Dusit Niyato, Shiwen Mao, Abbas Jamalipour

Title: When UAV Swarm Meets IRS: Collaborative Secure Communications in Low-altitude Wireless Networks

Abstract:
Low‑altitude wireless networks (LAWNs) represent a promising architecture that integrates unmanned aerial vehicles (UAVs) as aerial nodes to provide enhanced coverage, reliability, and throughput for diverse applications. However, these networks face significant security vulnerabilities from both known and potential unknown eavesdroppers, which may threaten data confidentiality and system integrity. To solve this critical issue, we propose a novel secure communication framework for LAWNs where the selected UAVs within a swarm function as a virtual antenna array (VAA), complemented by intelligent reflecting surface (IRS) to create a robust defense against eavesdropping attacks. Specifically, we formulate a multi‑objective optimization problem that simultaneously maximizes the secrecy rate while minimizing the maximum sidelobe level and total energy consumption, requiring joint optimization of UAV excitation current weights, flight trajectories, and IRS phase shifts. This problem presents significant difficulties due to the dynamic nature of the system and heterogeneous components. Thus, we first transform the problem into a heterogeneous Markov decision process (MDP). Then, we propose a heterogeneous multi‑agent control approach (HMCA) that integrates a dedicated IRS control policy with a multi‑agent soft actor‑critic framework for UAV control, which enables coordinated operation across heterogeneous network elements. Simulation results show that the proposed HMCA achieves superior performance compared to baseline approaches in terms of secrecy rate improvement, sidelobe suppression, and energy efficiency. Furthermore, we find that the collaborative and passive beamforming synergy between VAA and IRS creates robust security guarantees when the number of UAVs increases.

PaperID: 7772, https://arxiv.org/pdf/2510.22108.pdf

Authors: Xinyue Liang, Hui Kang, Junwei Che, Jiahui Li, Geng Sun, Qingqing Wu, Jiacheng Wang, Dusit Niyato

Title: STAR-RIS-assisted Collaborative Beamforming for Low-altitude Wireless Networks

Abstract:
While low‑altitude wireless networks (LAWNs) based on uncrewed aerial vehicles (UAVs) offer high mobility, flexibility, and coverage for urban communications, they face severe signal attenuation in dense environments due to obstructions. To address this critical issue, we consider introducing collaborative beamforming (CB) of UAVs and omnidirectional reconfigurable beamforming (ORB) of simultaneous transmitting and reflecting reconfigurable intelligent surfaces (STAR‑RIS) to enhance the signal quality and directionality. On this basis, we formulate a joint rate and energy optimization problem (JREOP) to maximize the transmission rate of the overall system, while minimizing the energy consumption of the UAV swarm. Due to the non‑convex and NP‑hard nature of JREOP, we propose a heterogeneous multi‑agent collaborative dynamic (HMCD) optimization framework, which has two core components. The first component is a simulated annealing (SA)‑based STAR‑RIS control method, which dynamically optimizes reflection and transmission coefficients to enhance signal propagation. The second component is an improved multi‑agent deep reinforcement learning (MADRL) control method, which incorporates a self‑attention evaluation mechanism to capture interactions between UAVs and an adaptive velocity transition mechanism to enhance training stability. Simulation results demonstrate that HMCD outperforms various baselines in terms of convergence speed, average transmission rate, and energy consumption. Further analysis reveals that the average transmission rate of the overall system scales positively with both UAV count and STAR‑RIS element numbers.

PaperID: 7773, https://arxiv.org/pdf/2510.21967.pdf

Authors: Benjamin Lange, Geoff Keeling, Arianna Manzini, Amanda McCroskery

Title: We Need Accountability in Human-AI Agent Relationships

Abstract:
We argue that accountability mechanisms are needed in human‑AI agent relationships to ensure alignment with user and societal interests. We propose a framework according to which AI agents' engagement is conditional on appropriate user behaviour. The framework incorporates design‑strategies such as distancing, disengaging, and discouraging.

PaperID: 7774, https://arxiv.org/pdf/2510.21881.pdf

Authors: Nannan Shi, Chuanyu Qin, Shipeng Song, Man Luo

Title: GeoThought: A Dataset for Enhancing Mathematical Geometry Reasoning in Vision-Language Models

Abstract:
Large language models (LLMs) have demonstrated strong reasoning capabilities in text‑based mathematical problem solving; however, when adapted to visual reasoning tasks, particularly geometric problem solving, their performance substantially declines because geometric problems present unique challenges. Specifically, these challenges stem from two key factors: first, the intrinsic complexity of geometry requiring detailed image comprehension and multi‑step reasoning, and second, the limitations of existing datasets which lack sufficient scale, diversity, and explicit reasoning traces, consequently hindering effective model training. To address these challenges, we developed the GeoThoughts dataset, a comprehensive geometric reasoning corpus with two subsets: Geo‑Thought‑6K with 6,243 samples and its augmented version Geo‑Thought‑Augmented‑10K containing 10,834 samples. Each entry includes visual descriptions, step‑by‑step solutions, explicit reasoning chains, reflection steps, and final answers. Using this dataset, we developed GeoThought‑MLLM, a mathematical reasoning multimodal model that generates detailed thinking processes during problem‑solving. Our model outperforms existing benchmarks in geometric tasks, demonstrating that training with our Chain‑of‑Thought dataset improves geometric reasoning capabilities across both in‑domain and out‑of‑domain settings. Finally, we analyze failure cases and observe that errors primarily arise from incorrect interpretation of mathematical concepts or spatial misjudgment. By invoking CoT to correct these mistakes, the model produces correct answers.

PaperID: 7775, https://arxiv.org/pdf/2510.21853.pdf

Authors: Debdeep Sanyal, Aakash Sen Sharma, Dhruv Kumar, Saurabh Deshpande, Murari Mandal

Title: Policy Optimization Prefers The Path of Least Resistance

Abstract:
Policy optimization (PO) algorithms are used to refine Large Language Models for complex, multi‑step reasoning. Current state‑of‑the‑art pipelines enforce a strict think‑then‑answer format to elicit chain‑of‑thought (CoT); however, the behavior of PO when these rigid constraints are relaxed into an open‑ended CoT structure remains an under‑studied question. We investigate this gap with an extensive suite of controlled experiments and identify a consistent principle: policy optimization consistently follows the path of least resistance. When afforded the flexibility to interleave reasoning and response, policy optimization consistently learns to discard explicit reasoning, causing the policy to degenerate to a direct \texttt<answer>‑only format. This outcome holds true across various models and algorithms. We find that this collapse in format is persistent even when the complex \texttt<think><answer> format is assigned up to 4x larger reward weights. We formalize this principle through a series of controlled reward decomposition experiments, demonstrating a clear hierarchy: PO systematically optimizes for the simplest reward component first, a preference that holds even when faced with mutually exclusive choices or strong incentives for more complex behaviors. Finally, we show that successful convergence on the high‑reward shortcut is not a low‑effort drift but is driven by the optimization process that requires the KL‑regularized policy to have sufficient freedom to make a significant shift from its initial prior. Our findings reveal that granting policies the freedom to diverge is a double‑edged sword: while necessary for discovering high‑reward shortcuts, it also creates a powerful incentive to game the simplest aspects of the reward function, posing a critical challenge for reward hacking under alignment.

PaperID: 7776, https://arxiv.org/pdf/2510.21785.pdf

Authors: Arun Muthukkumar

Title: Multi-Agent Pose Uncertainty: A Differentiable Rendering Cramér-Rao Bound

Abstract:
Pose estimation is essential for many applications within computer vision and robotics. Despite its uses, few works provide rigorous uncertainty quantification for poses under dense or learned models. We derive a closed‑form lower bound on the covariance of camera pose estimates by treating a differentiable renderer as a measurement function. Linearizing image formation with respect to a small pose perturbation on the manifold yields a render‑aware Cramér‑Rao bound. Our approach reduces to classical bundle‑adjustment uncertainty, ensuring continuity with vision theory. It also naturally extends to multi‑agent settings by fusing Fisher information across cameras. Our statistical formulation has downstream applications for tasks such as cooperative perception and novel view synthesis without requiring explicit keypoint correspondences.

PaperID: 7777, https://arxiv.org/pdf/2510.21758.pdf

Authors: Kumater Ter, Abolanle Adetifa, Daniel Udekwe

Title: Taxonomy and Trends in Reinforcement Learning for Robotics and Control Systems: A Structured Review

Abstract:
Reinforcement learning (RL) has become a foundational approach for enabling intelligent robotic behavior in dynamic and uncertain environments. This work presents an in‑depth review of RL principles, advanced deep reinforcement learning (DRL) algorithms, and their integration into robotic and control systems. Beginning with the formalism of Markov Decision Processes (MDPs), the study outlines essential elements of the agent‑environment interaction and explores core algorithmic strategies including actor‑critic methods, value‑based learning, and policy gradients. Emphasis is placed on modern DRL techniques such as DDPG, TD3, PPO, and SAC, which have shown promise in solving high‑dimensional, continuous control tasks. A structured taxonomy is introduced to categorize RL applications across domains such as locomotion, manipulation, multi‑agent coordination, and human‑robot interaction, along with training methodologies and deployment readiness levels. The review synthesizes recent research efforts, highlighting technical trends, design patterns, and the growing maturity of RL in real‑world robotics. Overall, this work aims to bridge theoretical advances with practical implementations, providing a consolidated perspective on the evolving role of RL in autonomous robotic systems.

PaperID: 7778, https://arxiv.org/pdf/2510.21738.pdf

Authors: Yifan Bai, Shruti Kotpalliwar, Christoforos Kanellakis, George Nikolakopoulos

Title: Collaborative Task Assignment, Sequencing and Multi-agent Path-finding

Abstract:
In this article, we address the problem of collaborative task assignment, sequencing, and multi‑agent pathfinding (TSPF), where a team of agents must visit a set of task locations without collisions while minimizing flowtime. TSPF incorporates agent‑task compatibility constraints and ensures that all tasks are completed. We propose a Conflict‑Based Search with Task Sequencing (CBS‑TS), an optimal and complete algorithm that alternates between finding new task sequences and resolving conflicts in the paths of current sequences. CBS‑TS uses a mixed‑integer linear program (MILP) to optimize task sequencing and employs Conflict‑Based Search (CBS) with Multi‑Label A (MLA) for collision‑free path planning within a search forest. By invoking MILP for the next‑best sequence only when needed, CBS‑TS efficiently limits the search space, enhancing computational efficiency while maintaining optimality. We compare the performance of our CBS‑TS against Conflict‑based Steiner Search (CBSS), a baseline method that, with minor modifications, can address the TSPF problem. Experimental results demonstrate that CBS‑TS outperforms CBSS in most testing scenarios, achieving higher success rates and consistently optimal solutions, whereas CBSS achieves near‑optimal solutions in some cases. The supplementary video is available at https://youtu.be/QT8BYgvefmU.

PaperID: 7779, https://arxiv.org/pdf/2510.21732.pdf

Authors: Xumin Gao, Mark Stevens, Grzegorz Cielniak

Title: Automated Pest Counting in Water Traps through Active Robotic Stirring for Occlusion Handling

Abstract:
Existing image‑based pest counting methods rely on single static images and often produce inaccurate results under occlusion. To address this issue, this paper proposes an automated pest counting method in water traps through active robotic stirring. First, an automated robotic arm‑based stirring system is developed to redistribute pests and reveal occluded individuals for counting. Then, the effects of different stirring patterns on pest counting performance are investigated. Six stirring patterns are designed and evaluated across different pest density scenarios to identify the optimal one. Finally, a heuristic counting confidence‑driven closed‑loop control system is proposed for adaptive‑speed robotic stirring, adjusting the stirring speed based on the average change rate of counting confidence between consecutive frames. Experimental results show that the four circles is the optimal stirring pattern, achieving the lowest overall mean absolute counting error of 4.384 and the highest overall mean counting confidence of 0.721. Compared with constant‑speed stirring, adaptive‑speed stirring reduces task execution time by up to 44.7% and achieves more stable performance across different pest density scenarios. Moreover, the proposed pest counting method reduces the mean absolute counting error by up to 3.428 compared to the single static image counting method under high‑density scenarios where occlusion is severe.

PaperID: 7780, https://arxiv.org/pdf/2510.21717.pdf

Authors: Bernard Tam, Jean-Charles Tournier, Fernando Varela Rodriguez

Title: AI-Enhanced Operator Assistance for UNICOS Applications

Abstract:
This project explores the development of an AI‑enhanced operator assistant for UNICOS, CERN's UNified Industrial Control System. While powerful, UNICOS presents a number of challenges, including the cognitive burden of decoding widgets, manual effort required for root cause analysis, and difficulties maintainers face in tracing datapoint elements (DPEs) across a complex codebase. In situations where timely responses are critical, these challenges can increase cognitive load and slow down diagnostics. To address these issues, a multi‑agent system was designed and implemented. The solution is supported by a modular architecture comprising a UNICOS‑side extension written in CTRL code, a Python‑based multi‑agent system deployed on a virtual machine, and a vector database storing both operator documentation and widget animation code. Preliminary evaluations suggest that the system is capable of decoding widgets, performing root cause analysis by leveraging live device data and documentation, and tracing DPEs across a complex codebase. Together, these capabilities reduce the manual workload of operators and maintainers, enhance situational awareness in operations, and accelerate responses to alarms and anomalies. Beyond these immediate gains, this work highlights the potential of introducing multi‑modal reasoning and retrieval augmented generation (RAG) into the domain of industrial control. Ultimately, this work represents more than a proof of concept: it provides a basis for advancing intelligent operator interfaces at CERN. By combining modular design, extensibility, and practical AI integration, this project not only alleviates current operator pain points but also points toward broader opportunities for assistive AI in accelerator operations.

PaperID: 7781, https://arxiv.org/pdf/2510.21695.pdf

Authors: Edward Holmberg, Elias Ioup, Mahdi Abdelguerfi

Title: A Knowledge-Graph Translation Layer for Mission-Aware Multi-Agent Path Planning in Spatiotemporal Dynamics

Abstract:
The coordination of autonomous agents in dynamic environments is hampered by the semantic gap between high‑level mission objectives and low‑level planner inputs. To address this, we introduce a framework centered on a Knowledge Graph (KG) that functions as an intelligent translation layer. The KG's two‑plane architecture compiles declarative facts into per‑agent, mission‑aware ``worldviews" and physics‑aware traversal rules, decoupling mission semantics from a domain‑agnostic planner. This allows complex, coordinated paths to be modified simply by changing facts in the KG. A case study involving Autonomous Underwater Vehicles (AUVs) in the Gulf of Mexico visually demonstrates the end‑to‑end process and quantitatively proves that different declarative policies produce distinct, high‑performing outcomes. This work establishes the KG not merely as a data repository, but as a powerful, stateful orchestrator for creating adaptive and explainable autonomous systems.

PaperID: 7782, https://arxiv.org/pdf/2510.21603.pdf

Authors: Kuicai Dong, Shurui Huang, Fangda Ye, Wei Han, Zhi Zhang, Dexun Li, Wenjun Li, Qu Yang, Gang Wang, Yichao Wang, Chen Zhang, Yong Liu

Title: Doc-Researcher: A Unified System for Multimodal Document Parsing and Deep Research

Abstract:
Deep Research systems have revolutionized how LLMs solve complex questions through iterative reasoning and evidence gathering. However, current systems remain fundamentally constrained to textual web data, overlooking the vast knowledge embedded in multimodal documents Processing such documents demands sophisticated parsing to preserve visual semantics (figures, tables, charts, and equations), intelligent chunking to maintain structural coherence, and adaptive retrieval across modalities, which are capabilities absent in existing systems. In response, we present Doc‑Researcher, a unified system that bridges this gap through three integrated components: (i) deep multimodal parsing that preserves layout structure and visual semantics while creating multi‑granular representations from chunk to document level, (ii) systematic retrieval architecture supporting text‑only, vision‑only, and hybrid paradigms with dynamic granularity selection, and (iii) iterative multi‑agent workflows that decompose complex queries, progressively accumulate evidence, and synthesize comprehensive answers across documents and modalities. To enable rigorous evaluation, we introduce M4DocBench, the first benchmark for Multi‑modal, Multi‑hop, Multi‑document, and Multi‑turn deep research. Featuring 158 expert‑annotated questions with complete evidence chains across 304 documents, M4DocBench tests capabilities that existing benchmarks cannot assess. Experiments demonstrate that Doc‑Researcher achieves 50.6% accuracy, 3.4xbetter than state‑of‑the‑art baselines, validating that effective document research requires not just better retrieval, but fundamentally deep parsing that preserve multimodal integrity and support iterative research. Our work establishes a new paradigm for conducting deep research on multimodal document collections.

PaperID: 7783, https://arxiv.org/pdf/2510.21556.pdf

Authors: Sophie Hall, Florian Dörfler, Timm Faulwasser

Title: System-Theoretic Analysis of Dynamic Generalized Nash Equilibria -- Turnpikes and Dissipativity

Abstract:
Generalized Nash equilibria are used in multi‑agent control applications to model strategic interactions between agents that are coupled in the cost, dynamics, and constraints, and provide the foundations for game‑theoretic MPC (Receding Horizon Games). We study properties of finite‑horizon dynamic GNE trajectories from a system‑theoretic perspective. We show how strict dissipativity generates the turnpike phenomenon in GNE solutions. Moreover, we establish a converse turnpike result, i.e., the implication from turnpike to strict dissipativity. We derive conditions under which the steady‑state GNE is the optimal operating point and, using a game value function, we give a local characterization of the geometry of storage functions. Finally, we design linear terminal penalties that ensure dynamic GNE trajectories applied in open‑loop converge to and remain at the steady‑state GNE. These connections provide the foundation for future system‑theoretic analysis of GNEs similar to those existing in optimal control as well as for recursive feasibility and closed‑loop stability results of game‑theoretic MPC.

PaperID: 7784, https://arxiv.org/pdf/2510.21546.pdf

Authors: Johannes Autenrieb, Mark Spiller

Title: Auction-Based Responsibility Allocation for Scalable Decentralized Safety Filters in Cooperative Multi-Agent Collision Avoidance

Abstract:
This paper proposes a scalable decentralized safety filter for multi‑agent systems based on high‑order control barrier functions (HOCBFs) and auction‑based responsibility allocation. While decentralized HOCBF formulations ensure pairwise safety under input bounds, they face feasibility and scalability challenges as the number of agents grows. Each agent must evaluate an increasing number of pairwise constraints, raising the risk of infeasibility and making it difficult to meet real‑time requirements. To address this, we introduce an auction‑based allocation scheme that distributes constraint enforcement asymmetrically among neighbors based on local control effort estimates. The resulting directed responsibility graph guarantees full safety coverage while reducing redundant constraints and per‑agent computational load. Simulation results confirm safe and efficient coordination across a range of network sizes and interaction densities.

PaperID: 7785, https://arxiv.org/pdf/2510.21541.pdf

Authors: Weihong Qin, Aimin Wang, Geng Sun, Zemin Sun, Jiacheng Wang, Dusit Niyato, Dong In Kim, Zhu Han

Title: Cost Minimization for Space-Air-Ground Integrated Multi-Access Edge Computing Systems

Abstract:
Space‑air‑ground integrated multi‑access edge computing (SAGIN‑MEC) provides a promising solution for the rapidly developing low‑altitude economy (LAE) to deliver flexible and wide‑area computing services. However, fully realizing the potential of SAGIN‑MEC in the LAE presents significant challenges, including coordinating decisions across heterogeneous nodes with different roles, modeling complex factors such as mobility and network variability, and handling real‑time decision‑making under partially observable environment with hybrid variables. To address these challenges, we first present a hierarchical SAGIN‑MEC architecture that enables the coordination between user devices (UDs), uncrewed aerial vehicles (UAVs), and satellites. Then, we formulate a UD cost minimization optimization problem (UCMOP) to minimize the UD cost by jointly optimizing the task offloading ratio, UAV trajectory planning, computing resource allocation, and UD association. We show that the UCMOP is an NP‑hard problem. To overcome this challenge, we propose a multi‑agent deep deterministic policy gradient (MADDPG)‑convex optimization and coalitional game (MADDPG‑COCG) algorithm. Specifically, we employ the MADDPG algorithm to optimize the continuous temporal decisions for heterogeneous nodes in the partially observable SAGIN‑MEC system. Moreover, we propose a convex optimization and coalitional game (COCG) method to enhance the conventional MADDPG by deterministically handling the hybrid and varying‑dimensional decisions. Simulation results demonstrate that the proposed MADDPG‑COCG algorithm significantly enhances the user‑centric performances in terms of the aggregated UD cost, task completion delay, and UD energy consumption, with a slight increase in UAV energy consumption, compared to the benchmark algorithms. Moreover, the MADDPG‑COCG algorithm shows superior convergence stability and scalability.

PaperID: 7786, https://arxiv.org/pdf/2510.21445.pdf

Authors: Thanh Cong Ho, Farah Kharrat, Abderrazek Abid, Fakhri Karray

Title: REMONI: An Autonomous System Integrating Wearables and Multimodal Large Language Models for Enhanced Remote Health Monitoring

Abstract:
With the widespread adoption of wearable devices in our daily lives, the demand and appeal for remote patient monitoring have significantly increased. Most research in this field has concentrated on collecting sensor data, visualizing it, and analyzing it to detect anomalies in specific diseases such as diabetes, heart disease and depression. However, this domain has a notable gap in the aspect of human‑machine interaction. This paper proposes REMONI, an autonomous REmote health MONItoring system that integrates multimodal large language models (MLLMs), the Internet of Things (IoT), and wearable devices. The system automatically and continuously collects vital signs, accelerometer data from a special wearable (such as a smartwatch), and visual data in patient video clips collected from cameras. This data is processed by an anomaly detection module, which includes a fall detection model and algorithms to identify and alert caregivers of the patient's emergency conditions. A distinctive feature of our proposed system is the natural language processing component, developed with MLLMs capable of detecting and recognizing a patient's activity and emotion while responding to healthcare worker's inquiries. Additionally, prompt engineering is employed to integrate all patient information seamlessly. As a result, doctors and nurses can access real‑time vital signs and the patient's current state and mood by interacting with an intelligent agent through a user‑friendly web application. Our experiments demonstrate that our system is implementable and scalable for real‑life scenarios, potentially reducing the workload of medical professionals and healthcare costs. A full‑fledged prototype illustrating the functionalities of the system has been developed and being tested to demonstrate the robustness of its various capabilities.

PaperID: 7787, https://arxiv.org/pdf/2510.21442.pdf

Authors: Nathan Corecco, Batuhan Yardim, Vinzenz Thoma, Zebang Shen, Niao He

Title: Scalable Neural Incentive Design with Parameterized Mean-Field Approximation

Abstract:
Designing incentives for a multi‑agent system to induce a desirable Nash equilibrium is both a crucial and challenging problem appearing in many decision‑making domains, especially for a large number of agents N. Under the exchangeability assumption, we formalize this incentive design (ID) problem as a parameterized mean‑field game (PMFG), aiming to reduce complexity via an infinite‑population limit. We first show that when dynamics and rewards are Lipschitz, the finite‑N ID objective is approximated by the PMFG at rate \mathscrO(\frac1\sqrtN). Moreover, beyond the Lipschitz‑continuous setting, we prove the same \mathscrO(\frac1\sqrtN) decay for the important special case of sequential auctions, despite discontinuities in dynamics, through a tailored auction‑specific analysis. Built on our novel approximation results, we further introduce our Adjoint Mean‑Field Incentive Design (AMID) algorithm, which uses explicit differentiation of iterated equilibrium operators to compute gradients efficiently. By uniting approximation bounds with optimization guarantees, AMID delivers a powerful, scalable algorithmic tool for many‑agent (large N) ID. Across diverse auction settings, the proposed AMID method substantially increases revenue over first‑price formats and outperforms existing benchmark methods.

PaperID: 7788, https://arxiv.org/pdf/2510.21370.pdf

Authors: Zain Ul Abideen Tariq, Mahmood Al-Zubaidi, Uzair Shah, Marco Agus, Mowafa Househ

Title: HIKMA: Human-Inspired Knowledge by Machine Agents through a Multi-Agent Framework for Semi-Autonomous Scientific Conferences

Abstract:
HIKMA Semi‑Autonomous Conference is the first experiment in reimagining scholarly communication through an end‑to‑end integration of artificial intelligence into the academic publishing and presentation pipeline. This paper presents the design, implementation, and evaluation of the HIKMA framework, which includes AI dataset curation, AI‑based manuscript generation, AI‑assisted peer review, AI‑driven revision, AI conference presentation, and AI archival dissemination. By combining language models, structured research workflows, and domain safeguards, HIKMA shows how AI can support ‑ not replace traditional scholarly practices while maintaining intellectual property protection, transparency, and integrity. The conference functions as a testbed and proof of concept, providing insights into the opportunities and challenges of AI‑enabled scholarship. It also examines questions about AI authorship, accountability, and the role of human‑AI collaboration in research.

PaperID: 7789, https://arxiv.org/pdf/2510.21285.pdf

Authors: Yingzhi Mao, Chunkang Zhang, Junxiang Wang, Xinyan Guan, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

Title: When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

Abstract:
Large Reasoning Models (LRMs) achieve strong performance on complex multi‑step reasoning, yet they still exhibit severe safety failures such as harmful content generation. Existing methods often apply coarse‑grained constraints over the entire reasoning trajectories, which can undermine reasoning capability while failing to address the root causes of unsafe behavior. In this work, we uncover a previously underexplored failure mode in LRMs, termed Self‑Jailbreak, where models initially recognize the harmful intent of a query, but override this judgment during subsequent reasoning steps, ultimately generating unsafe outputs. Such a phenomenon reveals that LRMs are capable of recognizing harm, while safety failures primarily arise from reasoning steps. Motivated by this finding, we propose Chain‑of‑Guardrail(CoG), a trajectory‑level training framework that mitigates Self‑Jailbreak via targeted, step‑level interventions while maintaining reasoning ability. Experiments across multiple safety and reasoning benchmarks indicate that CoG achieves a favorable balance between safety and reasoning performance compared with existing approaches.

PaperID: 7790, https://arxiv.org/pdf/2510.21244.pdf

Authors: Pengyu Xu, Shijia Li, Ao Sun, Feng Zhang, Yahan Li, Bo Wu, Zhanyu Ma, Jiguo Li, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Rui Wang, Yang Liu, Xiaobo Hu, Fan Yang, Jia Zheng, Guanghua Yao

Title: VoiceAgentEval: A Dual-Dimensional Benchmark for Expert-Level Intelligent Voice-Agent Evaluation of Xbench's Professional-Aligned Series

Abstract:
We propose OutboundEval, a comprehensive benchmark for evaluating large language models (LLMs) in expert‑level intelligent outbound calling scenarios. Unlike existing methods that suffer from three key limitations ‑ insufficient dataset diversity and category coverage, unrealistic user simulation, and inaccurate evaluation metrics ‑ OutboundEval addresses these issues through a structured framework. First, we design a benchmark spanning six major business domains and 30 representative sub‑scenarios, each with scenario‑specific process decomposition, weighted scoring, and domain‑adaptive metrics. Second, we develop a large‑model‑driven User Simulator that generates diverse, persona‑rich virtual users with realistic behaviors, emotional variability, and communication styles, providing a controlled yet authentic testing environment. Third, we introduce a dynamic evaluation method that adapts to task variations, integrating automated and human‑in‑the‑loop assessment to measure task execution accuracy, professional knowledge application, adaptability, and user experience quality. Experiments on 12 state‑of‑the‑art LLMs reveal distinct trade‑offs between expert‑level task completion and interaction fluency, offering practical insights for building reliable, human‑like outbound AI systems. OutboundEval establishes a practical, extensible, and domain‑oriented standard for benchmarking LLMs in professional applications.

PaperID: 7791, https://arxiv.org/pdf/2510.21228.pdf

Authors: Xiang Li, Huizi Yu, Wenkong Wang, Yiran Wu, Jiayan Zhou, Wenyue Hua, Xinxin Lin, Wenjia Tan, Lexuan Zhu, Bingyi Chen, Guang Chen, Ming-Li Chen, Yang Zhou, Zhao Li, Themistocles L. Assimes, Yongfeng Zhang, Qingyun Wu, Xin Ma, Lingyao Li, Lizhou Fan

Title: DispatchMAS: Fusing taxonomy and artificial intelligence agents for emergency medical services

Abstract:
Objective: Emergency medical dispatch (EMD) is a high‑stakes process challenged by caller distress, ambiguity, and cognitive load. Large Language Models (LLMs) and Multi‑Agent Systems (MAS) offer opportunities to augment dispatchers. This study aimed to develop and evaluate a taxonomy‑grounded, LLM‑powered multi‑agent system for simulating realistic EMD scenarios. Methods: We constructed a clinical taxonomy (32 chief complaints, 6 caller identities from MIMIC‑III) and a six‑phase call protocol. Using this framework, we developed an AutoGen‑based MAS with Caller and Dispatcher Agents. The system grounds interactions in a fact commons to ensure clinical plausibility and mitigate misinformation. We used a hybrid evaluation framework: four physicians assessed 100 simulated cases for "Guidance Efficacy" and "Dispatch Effectiveness," supplemented by automated linguistic analysis (sentiment, readability, politeness). Results: Human evaluation, with substantial inter‑rater agreement (Gwe's AC1 > 0.70), confirmed the system's high performance. It demonstrated excellent Dispatch Effectiveness (e.g., 94 % contacting the correct potential other agents) and Guidance Efficacy (advice provided in 91 % of cases), both rated highly by physicians. Algorithmic metrics corroborated these findings, indicating a predominantly neutral affective profile (73.7 % neutral sentiment; 90.4 % neutral emotion), high readability (Flesch 80.9), and a consistently polite style (60.0 % polite; 0 % impolite). Conclusion: Our taxonomy‑grounded MAS simulates diverse, clinically plausible dispatch scenarios with high fidelity. Findings support its use for dispatcher training, protocol evaluation, and as a foundation for real‑time decision support. This work outlines a pathway for safely integrating advanced AI agents into emergency response workflows.

PaperID: 7792, https://arxiv.org/pdf/2510.21180.pdf

Authors: Ning Bian, Xianpei Han, Hongyu Lin, Baolei Wu, Jun Wang

Title: Social Simulations with Large Language Model Risk Utopian Illusion

Abstract:
Reliable simulation of human behavior is essential for explaining, predicting, and intervening in our society. Recent advances in large language models (LLMs) have shown promise in emulating human behaviors, interactions, and decision‑making, offering a powerful new lens for social science studies. However, the extent to which LLMs diverge from authentic human behavior in social contexts remains underexplored, posing risks of misinterpretation in scientific studies and unintended consequences in real‑world applications. Here, we introduce a systematic framework for analyzing LLMs' behavior in social simulation. Our approach simulates multi‑agent interactions through chatroom‑style conversations and analyzes them across five linguistic dimensions, providing a simple yet effective method to examine emergent social cognitive biases. We conduct extensive experiments involving eight representative LLMs across three families. Our findings reveal that LLMs do not faithfully reproduce genuine human behavior but instead reflect overly idealized versions of it, shaped by the social desirability bias. In particular, LLMs show social role bias, primacy effect, and positivity bias, resulting in "Utopian" societies that lack the complexity and variability of real human interactions. These findings call for more socially grounded LLMs that capture the diversity of human social behavior.

PaperID: 7793, https://arxiv.org/pdf/2510.21147.pdf

Authors: Chujun He, Zhonghao Huang, Xiangguo Li, Ye Luo, Kewei Ma, Yuxuan Xiong, Xiaowei Zhang, Mingyang Zhao

Title: Hierarchical AI Multi-Agent Fundamental Investing: Evidence from China's A-Share Market

Abstract:
We present a multi‑agent, AI‑driven framework for fundamental investing that integrates macro indicators, industry‑level and firm‑specific information to construct optimized equity portfolios. The architecture comprises: (i) a Macro agent that dynamically screens and weights sectors based on evolving economic indicators and industry performance; (ii) four firm‑level agents ‑‑ Fundamental, Technical, Report, and News ‑‑ that conduct in‑depth analyses of individual firms to ensure both breadth and depth of coverage; (iii) a Portfolio agent that uses reinforcement learning to combine the agent outputs into a unified policy to generate the trading strategy; and (iv) a Risk Control agent that adjusts portfolio positions in response to market volatility. We evaluate the system on the constituents by the CSI 300 Index of China's A‑share market and find that it consistently outperforms standard benchmarks and a state‑of‑the‑art multi‑agent trading system on risk‑adjusted returns and drawdown control. Our core contribution is a hierarchical multi‑agent design that links top‑down macro screening with bottom‑up fundamental analysis, offering a robust and extensible approach to factor‑based portfolio construction.

PaperID: 7794, https://arxiv.org/pdf/2510.21057.pdf

Authors: Nils Philipp Walter, Chawin Sitawarin, Jamie Hayes, David Stutz, Ilia Shumailov

Title: Soft Instruction De-escalation Defense

Abstract:
Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment; this makes them susceptible to prompt injections when dealing with untrusted data. To overcome this limitation, we propose SIC (Soft Instruction Control)‑a simple yet effective iterative prompt sanitization loop designed for tool‑augmented LLM agents. Our method repeatedly inspects incoming data for instructions that could compromise agent behavior. If such content is found, the malicious content is rewritten, masked, or removed, and the result is re‑evaluated. The process continues until the input is clean or a maximum iteration limit is reached; if imperative instruction‑like content remains, the agent halts to ensure security. By allowing multiple passes, our approach acknowledges that individual rewrites may fail but enables the system to catch and correct missed injections in later steps. Although immediately useful, worst‑case analysis shows that SIC is not infallible; strong adversary can still get a 15% ASR by embedding non‑imperative workflows. This nonetheless raises the bar.

PaperID: 7795, https://arxiv.org/pdf/2510.21045.pdf

Authors: Ali Khosravi Kazazi, Zhenlong Li, M. Naser Lessani, Guido Cervone

Title: From Questions to Queries: An AI-powered Multi-Agent Framework for Spatial Text-to-SQL

Abstract:
The complexity of SQL and the spatial semantics of PostGIS create barriers for non‑experts working with spatial data. Although large language models can translate natural language into SQL, spatial Text‑to‑SQL is more error‑prone than general Text‑to‑SQL because it must resolve geographic intent, schema ambiguity, geometry‑bearing tables and columns, spatial function choice, and coordinate reference system and measurement assumptions. We introduce a multi‑agent framework that addresses these coupled challenges through staged interpretation, schema grounding, logical planning, SQL generation, and execution‑based review. The framework is supported by a knowledge base with programmatic schema profiling, semantic enrichment, and embedding‑based retrieval. We evaluated the framework on the non‑spatial KaggleDBQA benchmark and on SpatialQueryQA, a new multi‑level and coverage‑oriented benchmark with diverse geometry types, workload categories, and spatial operations. On KaggleDBQA, the system reached 81.2% accuracy, 221 of 272 questions, after reviewer corrections. On SpatialQueryQA, the system achieved 87.7% accuracy, 79 of 90, compared with 76.7% without the review stage. These results show that decomposing the task into specialized but tightly coupled agents improves robustness, especially for spatially sensitive queries. The study improves access to spatial analysis and provides a practical step toward more reliable spatial Text‑to‑SQL systems and autonomous GIS.

PaperID: 7796, https://arxiv.org/pdf/2510.20963.pdf

Authors: Yongqiang Chen, Gang Niu, James Cheng, Bo Han, Masashi Sugiyama

Title: Towards Scalable Oversight with Collaborative Multi-Agent Debate in Error Detection

Abstract:
Accurate detection of errors in large language models (LLM) responses is central to the success of scalable oversight, or providing effective supervision to superhuman intelligence. Yet, self‑diagnosis is often unreliable on complex tasks unless aided by reliable external feedback. Multi‑agent debate (MAD) seems to be a natural alternative to external feedback: multiple LLMs provide complementary perspectives and cross‑checks for error detection. However, prior MAD protocols frame debate as a zero‑sum game, where the debaters compete to win the game instead of seeking the truth. Consequently, it leads to debate hacking: debaters tend to mislead the judge by misinterpreting the task or presenting overconfident claims, which introduce more mistakes and underperform single‑agent methods. To mitigate the issue, we introduce a new collaborative MAD protocol, termed ColMAD, that reframes MAD as a non‑zero sum game. Specifically, ColMAD encourages multiple agents to criticize each other in a supportive way, such that they can complement the missing points of each other. Therefore, the judge agent can make a more informative conclusion based on more comprehensive evidence. Empirically, we show that ColMAD significantly outperforms previous competitive MAD by 19% and brings non‑trivial improvements over single‑agent methods in error detection.

PaperID: 7797, https://arxiv.org/pdf/2510.20875.pdf

Authors: Mihir Panchal, Ying-Jung Chen, Surya Parkash

Title: CC-GRMAS: A Multi-Agent Graph Neural System for Spatiotemporal Landslide Risk Assessment in High Mountain Asia

Abstract:
Landslides are a growing climate induced hazard with severe environmental and human consequences, particularly in high mountain Asia. Despite increasing access to satellite and temporal datasets, timely detection and disaster response remain underdeveloped and fragmented. This work introduces CC‑GRMAS, a framework leveraging a series of satellite observations and environmental signals to enhance the accuracy of landslide forecasting. The system is structured around three interlinked agents Prediction, Planning, and Execution, which collaboratively enable real time situational awareness, response planning, and intervention. By incorporating local environmental factors and operationalizing multi agent coordination, this approach offers a scalable and proactive solution for climate resilient disaster preparedness across vulnerable mountainous terrains.

PaperID: 7798, https://arxiv.org/pdf/2510.20838.pdf

Authors: Abir Khan Ratul, Sanjay Acharjee, Somin Park, Md Nazmus Sakib

Title: Sketch2BIM: A Multi-Agent Human-AI Collaborative Pipeline to Convert Hand-Drawn Floor Plans to 3D BIM

Abstract:
This study introduces a human‑in‑the‑loop pipeline that converts unscaled, hand‑drawn floor plan sketches into semantically consistent 3D BIM models. The workflow leverages multimodal large language models (MLLMs) within a multi‑agent framework, combining perceptual extraction, human feedback, schema validation, and automated BIM scripting. Initially, sketches are iteratively refined into a structured JSON layout of walls, doors, and windows. Later, these layouts are transformed into executable scripts that generate 3D BIM models. Experiments on ten diverse floor plans demonstrate strong convergence: openings (doors, windows) are captured with high reliability in the initial pass, while wall detection begins around 83% and achieves near‑perfect alignment after a few feedback iterations. Across all categories, precision, recall, and F1 scores remain above 0.83, and geometric errors (RMSE, MAE) progressively decrease to zero through feedback corrections. This study demonstrates how MLLM‑driven multi‑agent reasoning can make BIM creation accessible to both experts and non‑experts using only freehand sketches.

PaperID: 7799, https://arxiv.org/pdf/2510.20733.pdf

Authors: Yujia Zheng, Zhuokai Zhao, Zijian Li, Yaqi Xie, Mingze Gao, Lizhu Zhang, Kun Zhang

Title: Thought Communication in Multiagent Collaboration

Abstract:
Natural language has long enabled human cooperation, but its lossy, ambiguous, and indirect nature limits the potential of collective intelligence. While machines are not subject to these constraints, most LLM‑based multi‑agent systems still rely solely on natural language, exchanging tokens or their embeddings. To go beyond language, we introduce a new paradigm, thought communication, which enables agents to interact directly mind‑to‑mind, akin to telepathy. To uncover these latent thoughts in a principled way, we formalize the process as a general latent variable model, where agent states are generated by an unknown function of underlying thoughts. We prove that, in a nonparametric setting without auxiliary information, both shared and private latent thoughts between any pair of agents can be identified. Moreover, the global structure of thought sharing, including which agents share which thoughts and how these relationships are structured, can also be recovered with theoretical guarantees. Guided by the established theory, we develop a framework that extracts latent thoughts from all agents prior to communication and assigns each agent the relevant thoughts, along with their sharing patterns. This paradigm naturally extends beyond LLMs to all modalities, as most observational data arise from hidden generative processes. Experiments on both synthetic and real‑world benchmarks validate the theory and demonstrate the collaborative advantages of thought communication. We hope this work illuminates the potential of leveraging the hidden world, as many challenges remain unsolvable through surface‑level observation alone, regardless of compute or data scale.

PaperID: 7800, https://arxiv.org/pdf/2510.20728.pdf

Authors: Xi He, Sirui Lu, Bei Zeng

Title: Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems

Abstract:
Exact scientific discovery requires more than heuristic search: candidate constructions must be turned into exact objects and checked independently. We address this gap by extending TeXRA with an independent Lean 4 verification layer, turning it into a human‑guided multi‑agent platform for exact scientific discovery. The platform couples symbolic synthesis, combinatorial and linear‑programming search, exact reconstruction of numerical candidates, and formal verification in Lean. We apply this platform to nonadditive quantum error‑correcting codes with prescribed transversal diagonal gates within the subset‑sum linear‑programming (SSLP) framework. In the distance‑2 regime where logical states occupy distinct residue classes, the platform yields a Lean‑certified catalogue of 14,116 codes for K\in\2,3,4\ and up to six physical qubits, realizing cyclic logical orders 2 through 18, from which we extract closed‑form infinite families. We also construct a residue‑degenerate ((6,4,2)) code implementing the logical controlled‑phase gate \mathrmdiag(1,1,1,i). At distance 3, we resolve the transversal‑T problem for ((7,2,3)) codes within the complementary binary‑dihedral \mathrmBD_16 setting: among the 12 candidates surviving the SSLP filters, 10 admit exact realizations and 2 are excluded by no‑go proofs. All accepted constructions, families, and no‑go results are formalized and checked in Lean, illustrating how AI‑assisted workflows can bridge search, exact reconstruction, and formal proof in the physical sciences.

PaperID: 7801, https://arxiv.org/pdf/2510.20691.pdf

Authors: Yanlin Song, Ben Liu, Víctor Gutiérrez-Basulto, Zhiwei Hu, Qianqian Xie, Min Peng, Sophia Ananiadou, Jeff Z. Pan

Title: Plan Then Retrieve: Reinforcement Learning-Guided Complex Reasoning over Knowledge Graphs

Abstract:
Knowledge Graph Question Answering aims to answer natural language questions by reasoning over structured knowledge graphs. While large language models have advanced KGQA through their strong reasoning capabilities, existing methods continue to struggle to fully exploit both the rich knowledge encoded in KGs and the reasoning capabilities of LLMs, particularly in complex scenarios. They often assume complete KG coverage and lack mechanisms to judge when external information is needed, and their reasoning remains locally myopic, failing to maintain coherent multi‑step planning, leading to reasoning failures even when relevant knowledge exists. We propose Graph‑RFT, a novel two‑stage reinforcement fine‑tuning KGQA framework with a 'plan‑KGsearch‑and‑Websearch‑during‑think' paradigm, that enables LLMs to perform autonomous planning and adaptive retrieval scheduling across KG and web sources under incomplete knowledge conditions. Graph‑RFT introduces a chain‑of‑thought fine‑tuning method with a customized plan‑retrieval dataset activates structured reasoning and resolves the GRPO cold‑start problem. It then introduces a novel plan‑retrieval guided reinforcement learning process integrates explicit planning and retrieval actions with a multi‑reward design, enabling coverage‑aware retrieval scheduling. It employs a Cartesian‑inspired planning module to decompose complex questions into ordered subquestions, and logical expression to guide tool invocation for globally consistent multi‑step reasoning. This reasoning retrieval process is optimized with a multi‑reward combining outcome and retrieval specific signals, enabling the model to learn when and how to combine KG and web retrieval effectively.

PaperID: 7802, https://arxiv.org/pdf/2510.20567.pdf

Authors: Zhouwei Zhai, Mengxiang Chen, Haoyun Xia, Jin Li, Renquan Zhou, Min Yang

Title: Beyond Retrieval-Ranking: A Multi-Agent Cognitive Decision Framework for E-Commerce Search

Abstract:
The retrieval‑ranking paradigm has long dominated e‑commerce search, but its reliance on query‑item matching fundamentally misaligns with multi‑stage cognitive decision processes of platform users. This misalignment introduces critical limitations: semantic gaps in complex queries, high decision costs due to cross‑platform information foraging, and the absence of professional shopping guidance. To address these issues, we propose a Multi‑Agent Cognitive Decision Framework (MACDF), which shifts the paradigm from passive retrieval to proactive decision support. Extensive offline evaluations demonstrate MACDF's significant improvements in recommendation accuracy and user satisfaction, particularly for complex queries involving negation, multi‑constraint, or reasoning demands. Online A/B testing on JD search platform confirms its practical efficacy. This work highlights the transformative potential of multi‑agent cognitive systems in redefining e‑commerce search.

PaperID: 7803, https://arxiv.org/pdf/2510.20548.pdf

Authors: Jinchang Luo, Mingquan Cheng, Fan Wan, Ni Li, Xiaoling Xia, Shuangshuang Tian, Tingcheng Bian, Haiwei Wang, Haohuan Fu, Yan Tao

Title: GlobalRAG: Enhancing Global Reasoning in Multi-hop Question Answering via Reinforcement Learning

Abstract:
Reinforcement learning has recently shown promise in improving retrieval‑augmented generation (RAG). Despite these advances, its effectiveness in multi‑hop question answering (QA) remains limited by two fundamental limitations: (i) global planning absence to structure multi‑step reasoning, and (ii) unfaithful execution, which hinders effective query formulation and consistent use of retrieved evidence. We propose GlobalRAG, a reinforcement learning framework designed to enhance global reasoning in multi‑hop QA. GlobalRAG decomposes questions into subgoals, coordinates retrieval with reasoning, and refines evidence iteratively. To guide this process, we introduce Planning Quality Reward and SubGoal Completion Reward, which encourage coherent planning and reliable subgoal execution. In addition, a progressive weight annealing strategy balances process‑oriented and outcome‑based objectives. Extensive experiments on both in‑domain and out‑of‑domain benchmarks demonstrate that GlobalRAG significantly outperforms strong baselines while using only 8k training data (42% of the training data used by strong baselines), achieving average improvements of 14.2% in both EM and F1.

PaperID: 7804, https://arxiv.org/pdf/2510.20483.pdf

Authors: Victor Vantilborgh, Hrishikesh Sathyanarayan, Guillaume Crevecoeur, Ian Abraham, Tom Lefebvre

Title: Dual Control Reference Generation for Optimal Pick-and-Place Execution under Payload Uncertainty

Abstract:
This work addresses the problem of robot manipulation tasks under unknown dynamics, such as pick‑and‑place tasks under payload uncertainty, where active exploration and(/for) online parameter adaptation during task execution are essential to enable accurate model‑based control. The problem is framed as dual control seeking a closed‑loop optimal control problem that accounts for parameter uncertainty. We simplify the dual control problem by pre‑defining the structure of the feedback policy to include an explicit adaptation mechanism. Then we propose two methods for reference trajectory generation. The first directly embeds parameter uncertainty in robust optimal control methods that minimize the expected task cost. The second method considers minimizing the so‑called optimality loss, which measures the sensitivity of parameter‑relevant information with respect to task performance. We observe that both approaches reason over the Fisher information as a natural side effect of their formulations, simultaneously pursuing optimal task execution. We demonstrate the effectiveness of our approaches for a pick‑and‑place manipulation task. We show that designing the reference trajectories whilst taking into account the control enables faster and more accurate task performance and system identification while ensuring stable and efficient control.

PaperID: 7805, https://arxiv.org/pdf/2510.20470.pdf

Authors: Kun Ouyang, Yuanxin Liu, Linli Yao, Yishuo Cai, Hao Zhou, Jie Zhou, Fandong Meng, Xu Sun

Title: Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence

Abstract:
Video reasoning, which requires multi‑step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)‑based methods enhance reasoning capabilities, they often rely on text‑only chains that yield ungrounded or hallucinated conclusions. Conversely, frame‑retrieval approaches introduce visual grounding, yet still struggle with inaccurate evidence localization. To address these limitations, we present Conan, a framework for evidence‑grounded multi‑step video reasoning. Conan identifies context and evidence frames, reasons over cross‑frame clues, and adaptively decides when to conclude or explore further. To achieve this, we 1) construct Conan‑91K, a large‑scale dataset of automatically generated reasoning traces that include frame identification, evidence reasoning, and action decision, and 2) design a multi‑stage progressive cold‑start strategy combined with an Identification‑Reasoning‑Action (AIR) RLVR training framework to progressively incentivize multi‑step visual reasoning. Extensive experiments on six multi‑step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5‑VL‑7B‑Instruct by an average of over 10% in accuracy, achieving state‑of‑the‑art performance. Furthermore, Conan generalizes effectively to long video understanding tasks, validating its strong scalability and robustness.

PaperID: 7806, https://arxiv.org/pdf/2510.20436.pdf

Authors: Federico Lozano-Cuadra, Beatriz Soret, Marc Sanchez Net, Abhishek Cauligi, Federico Rossi

Title: Learning Decentralized Routing Policies via Graph Attention-based Multi-Agent Reinforcement Learning in Lunar Delay-Tolerant Networks

Abstract:
We present a fully decentralized routing framework for multi‑robot exploration missions operating under the constraints of a Lunar Delay‑Tolerant Network (LDTN). In this setting, autonomous rovers must relay collected data to a lander under intermittent connectivity and unknown mobility patterns. We formulate the problem as a Partially Observable Markov Decision Problem (POMDP) and propose a Graph Attention‑based Multi‑Agent Reinforcement Learning (GAT‑MARL) policy that performs Centralized Training, Decentralized Execution (CTDE). Our method relies only on local observations and does not require global topology updates or packet replication, unlike classical approaches such as shortest path and controlled flooding‑based algorithms. Through Monte Carlo simulations in randomized exploration environments, GAT‑MARL provides higher delivery rates, no duplications, and fewer packet losses, and is able to leverage short‑term mobility forecasts; offering a scalable solution for future space robotic systems for planetary exploration, as demonstrated by successful generalization to larger rover teams.

PaperID: 7807, https://arxiv.org/pdf/2510.20425.pdf

Authors: Ziyang Li, Chunfeng Cui, Jiaxin Xie

Title: Projecting onto the Unit Dual Quaternion Set

Abstract:
Dual quaternions have gained significant attention due to their wide applications in areas such as multi‑agent formation control, 3D motion modeling, and robotics. A fundamental aspect in dual quaternion research involves the projection onto unit dual quaternion sets. In this paper, we systematically study such projections under the 2^R‑norm, which is commonly used in practical applications. We identify several distinct cases based on the relationship between the standard and dual parts in vector form, and demonstrate the effectiveness of the proposed algorithm through numerical experiments.

PaperID: 7808, https://arxiv.org/pdf/2510.20408.pdf

Authors: Tom Maus, Asma Atamna, Tobias Glasmachers

Title: Balancing Specialization and Centralization: A Multi-Agent Reinforcement Learning Benchmark for Sequential Industrial Control

Abstract:
Autonomous control of multi‑stage industrial processes requires both local specialization and global coordination. Reinforcement learning (RL) offers a promising approach, but its industrial adoption remains limited due to challenges such as reward design, modularity, and action space management. Many academic benchmarks differ markedly from industrial control problems, limiting their transferability to real‑world applications. This study introduces an enhanced industry‑inspired benchmark environment that combines tasks from two existing benchmarks, SortingEnv and ContainerGym, into a sequential recycling scenario with sorting and pressing operations. We evaluate two control strategies: a modular architecture with specialized agents and a monolithic agent governing the full system, while also analyzing the impact of action masking. Our experiments show that without action masking, agents struggle to learn effective policies, with the modular architecture performing better. When action masking is applied, both architectures improve substantially, and the performance gap narrows considerably. These results highlight the decisive role of action space constraints and suggest that the advantages of specialization diminish as action complexity is reduced. The proposed benchmark thus provides a valuable testbed for exploring practical and robust multi‑agent RL solutions in industrial automation, while contributing to the ongoing debate on centralization versus specialization.

PaperID: 7809, https://arxiv.org/pdf/2510.20362.pdf

Authors: Aritra Roy, Enrico Grisan, John Buckeridge, Chiara Gattinoni

Title: ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature

Abstract:
Since the advent of various pre‑trained large language models, extracting structured knowledge from scientific text has experienced a revolutionary change compared with traditional machine learning or natural language processing techniques. Despite these advances, accessible automated tools that allow users to construct, validate, and visualise datasets from scientific literature extraction remain scarce. We therefore developed ComProScanner, an autonomous multi‑agent platform that facilitates the extraction, validation, classification, and visualisation of machine‑readable chemical compositions and properties, integrated with synthesis data from journal articles for comprehensive database creation. We evaluated our framework using 100 journal articles against 10 different LLMs, including both open‑source and proprietary models, to extract highly complex compositions associated with ceramic piezoelectric materials and corresponding piezoelectric strain coefficients (d33), motivated by the lack of a large dataset for such materials. DeepSeek‑V3‑0324 outperformed all models with a significant overall accuracy of 0.82. This framework provides a simple, user‑friendly, readily‑usable package for extracting highly complex experimental data buried in the literature to build machine learning or deep learning datasets.

PaperID: 7810, https://arxiv.org/pdf/2510.20310.pdf

Authors: Mingliang Zhai, Hansheng Liang, Xiaomeng Fan, Zhi Gao, Chuanhao Li, Che Sun, Xu Bin, Yuwei Wu, Yunde Jia

Title: Multi-Step Reasoning for Embodied Question Answering via Tool Augmentation

Abstract:
Embodied Question Answering (EQA) requires agents to explore 3D environments to obtain observations and answer questions related to the scene. Existing methods leverage VLMs to directly explore the environment and answer questions without explicit thinking or planning, which limits their reasoning ability and results in excessive or inefficient exploration as well as ineffective responses. In this paper, we introduce ToolEQA, an agent that integrates external tools with multi‑step reasoning, where external tools can provide more useful information for completing the task, helping the model derive better exploration directions in the next step of reasoning and thus obtaining additional effective information. This enables ToolEQA to generate more accurate responses with a shorter exploration distance. To enhance the model's ability for tool‑usage and multi‑step reasoning, we further design a novel EQA data generation pipeline that automatically constructs large‑scale EQA tasks with reasoning trajectories and corresponding answers. Based on the pipeline, we collect the EQA‑RT dataset that contains about 18K tasks, divided into a training set EQA‑RT‑Train, and two test sets EQA‑RT‑Seen (scenes overlapping with the training set) and EQA‑RT‑Unseen (novel scenes). Experiments on EQA‑RT‑Seen and EQA‑RT‑Unseen show that ToolEQA improves the success rate by 9.2~20.2% over state‑of‑the‑art baselines, while outperforming the zero‑shot ToolEQA by 10% in success rate. In addition, ToolEQA also achieves state‑of‑the‑art performance on the HM‑EQA, OpenEQA, and EXPRESS‑Bench datasets, demonstrating its generality. Our homepage see https://tooleqa.github.io.

PaperID: 7811, https://arxiv.org/pdf/2510.20276.pdf

Authors: Wonil Kim, Hyeongseok Wi, Seungsoon Park, Taejun Kim, Sangeun Keum, Keunhyoung Kim, Taewan Kim, Jongmin Jung, Taehyoung Kim, Gaetan Guerrero, Mael Le Goff, Julie Po, Dongjoo Moon, Juhan Nam, Jongpil Lee

Title: From Generation to Attribution: Music AI Agent Architectures for the Post-Streaming Era

Abstract:
Generative AI is reshaping music creation, but its rapid growth exposes structural gaps in attribution, rights management, and economic models. Unlike past media shifts, from live performance to recordings, downloads, and streaming, AI transforms the entire lifecycle of music, collapsing boundaries between creation, distribution, and monetization. However, existing streaming systems, with opaque and concentrated royalty flows, are ill‑equipped to handle the scale and complexity of AI‑driven production. We propose a content‑based Music AI Agent architecture that embeds attribution directly into the creative workflow through block‑level retrieval and agentic orchestration. Designed for iterative, session‑based interaction, the system organizes music into granular components (Blocks) stored in BlockDB; each use triggers an Attribution Layer event for transparent provenance and real‑time settlement. This framework reframes AI from a generative tool into infrastructure for a Fair AI Media Platform. By enabling fine‑grained attribution, equitable compensation, and participatory engagement, it points toward a post‑streaming paradigm where music functions not as a static catalog but as a collaborative and adaptive ecosystem.

PaperID: 7812, https://arxiv.org/pdf/2510.20218.pdf

Authors: Qinyu Xu, Yuanyang Zhu, Xuefei Wu, Chunlin Chen

Title: High-order Interactions Modeling for Interpretable Multi-Agent Q-Learning

Abstract:
The ability to model interactions among agents is crucial for effective coordination and understanding their cooperation mechanisms in multi‑agent reinforcement learning (MARL). However, previous efforts to model high‑order interactions have been primarily hindered by the combinatorial explosion or the opaque nature of their black‑box network structures. In this paper, we propose a novel value decomposition framework, called Continued Fraction Q‑Learning (QCoFr), which can flexibly capture arbitrary‑order agent interactions with only linear complexity \mathcalO\left(n\right) in the number of agents, thus avoiding the combinatorial explosion when modeling rich cooperation. Furthermore, we introduce the variational information bottleneck to extract latent information for estimating credits. This latent information helps agents filter out noisy interactions, thereby significantly enhancing both cooperation and interpretability. Extensive experiments demonstrate that QCoFr not only consistently achieves better performance but also provides interpretability that aligns with our theoretical analysis.

PaperID: 7813, https://arxiv.org/pdf/2510.20205.pdf

Authors: Maggie Bai, Ava Kim Cohen, Eleanor Koss, Charlie Lichtenbaum

Title: Merge and Conquer: Evolutionarily Optimizing AI for 2048

Abstract:
Optimizing artificial intelligence (AI) for dynamic environments remains a fundamental challenge in machine learning research. In this paper, we examine evolutionary training methods for optimizing AI to solve the game 2048, a 2D sliding puzzle. 2048, with its mix of strategic gameplay and stochastic elements, presents an ideal playground for studying decision‑making, long‑term planning, and dynamic adaptation. We implemented two distinct systems: a two‑agent metaprompting system where a "thinker" large language model (LLM) agent refines gameplay strategies for an "executor" LLM agent, and a single‑agent system based on refining a value function for a limited Monte Carlo Tree Search. We also experimented with rollback features to avoid performance degradation. Our results demonstrate the potential of evolutionary refinement techniques in improving AI performance in non‑deterministic environments. The single‑agent system achieved substantial improvements, with an average increase of 473.2 points per cycle, and with clear upward trends (correlation ρ=0.607) across training cycles. The LLM's understanding of the game grew as well, shown in its development of increasingly advanced strategies. Conversely, the two‑agent system did not garner much improvement, highlighting the inherent limits of meta‑prompting.

PaperID: 7814, https://arxiv.org/pdf/2510.20182.pdf

Authors: Aaron Appelle, Jerome P. Lynch

Title: Evaluating Video Models as Simulators of Multi-Person Pedestrian Trajectories

Abstract:
Large‑scale video generation models have demonstrated high visual realism in diverse contexts, spurring interest in their potential as general‑purpose world simulators. Existing benchmarks focus on individual subjects rather than scenes with multiple interacting people. However, the plausibility of multi‑agent dynamics in generated videos remains unverified. We propose a rigorous evaluation protocol to benchmark text‑to‑video (T2V) and image‑to‑video (I2V) models as implicit simulators of pedestrian dynamics. For I2V, we leverage start frames from established datasets to enable comparison with a ground truth video dataset. For T2V, we develop a prompt suite to explore diverse pedestrian densities and interactions. A key component is a method to reconstruct 2D bird's‑eye view trajectories from pixel‑space without known camera parameters. Our analysis reveals that leading models have learned surprisingly effective priors for plausible multi‑agent behavior. However, failure modes like merging and disappearing people highlight areas for future improvement.

PaperID: 7815, https://arxiv.org/pdf/2510.20176.pdf

Authors: Yuhang Zhou, Mingrui Zhang, Ke Li, Mingyi Wang, Qiao Liu, Qifei Wang, Jiayi Liu, Fei Liu, Serena Li, Weiwei Li, Mingze Gao, Abhishek Kumar, Xiangjun Fan, Zhuokai Zhao, Lizhu Zhang

Title: Mixture-of-Minds: Multi-Agent Reinforcement Learning for Table Understanding

Abstract:
Understanding and reasoning over tables is a critical capability for many real‑world applications. Large language models (LLMs) have shown promise on this task, but current approaches remain limited. Fine‑tuning based methods strengthen language reasoning; yet they are prone to arithmetic errors and hallucination. In contrast, tool‑based methods enable precise table manipulation but rely on rigid schemas and lack semantic understanding. These complementary drawbacks highlight the need for approaches that integrate robust reasoning with reliable table processing. In this work, we propose Mixture‑of‑Minds, a multi‑agent framework that decomposes table reasoning into three specialized roles: planning, coding, and answering. This design enables each agent to focus on a specific aspect of the task while leveraging code execution for precise table manipulation. Building on this workflow, we introduce a self‑improvement training framework that employs Monte Carlo Tree Search (MCTS) rollouts to generate pseudo‑gold trajectories and optimize agents with reinforcement learning (RL). Extensive experiments show that Mixture‑of‑Minds delivers substantial gains, reaching 62.13% on TableBench and surpassing OpenAI‑o4‑mini‑high. These results demonstrate the promise of combining structured multi‑agent workflows with RL to advance table understanding.

PaperID: 7816, https://arxiv.org/pdf/2510.20102.pdf

Authors: Gyuyeon Na, Minjung Park, Hyeonjeong Cha, Sangmi Chai

Title: Human-Centered LLM-Agent System for Detecting Anomalous Digital Asset Transactions

Abstract:
We present HCLA, a human‑centered multi‑agent system for anomaly detection in digital‑asset transactions. The system integrates three cognitively aligned roles: Rule Abstraction, Evidence Scoring, and Expert‑Style Justification. These roles operate in a conversational workflow that enables non‑experts to express analytical intent in natural language, inspect structured risk evidence, and obtain traceable, context‑aware reasoning. Implemented with an open‑source, web‑based interface, HCLA translates user intent into explicit analytical rules, applies classical anomaly detectors to quantify evidential risk, and reconstructs expert‑style justifications grounded in observable transactional signals. Experiments on a cryptocurrency anomaly dataset show that, while the underlying detector achieves strong predictive accuracy, HCLA substantially improves interpretability, interaction, and decision transparency. Importantly, HCLA is not designed to explain a black‑box model in the conventional XAI sense. Instead, we reconstruct a traceable expert reasoning process that aligns algorithmic evidence with regulatory and investigative judgment. By explicitly separating evidence scoring from expert‑style justification, the framework emphasizes accountability beyond explainability and addresses practical requirements for regulatory, audit, and compliance‑driven financial forensics. We describe the system architecture, closed‑loop interaction design, datasets, evaluation protocol, and limitations. We argue that a human‑in‑the‑loop reasoning reconstruction paradigm is essential for achieving transparency, accountability, and trust in high‑stakes financial environments. Keywords: Human‑Centered AI; LLM‑Agent System; Multi‑Agent Architecture; Anomaly Detection; Digital Asset Transactions; Cryptocurrency Forensics; Blockchain Analytics; Human‑in‑the‑Loop; Explainable AI (XAI); Interpretability

PaperID: 7817, https://arxiv.org/pdf/2510.20036.pdf

Authors: Marianne Menglin Liu, Daniel Garcia, Fjona Parllaku, Vikas Upadhyay, Syed Fahad Allam Shah, Dan Roth

Title: ToolScope: Enhancing LLM Agent Tool Use through Tool Merging and Context-Aware Filtering

Abstract:
Large language model (LLM) agents rely on external tools to solve complex tasks, but real‑world toolsets often contain redundant tools with overlapping names and descriptions, introducing ambiguity and reducing selection accuracy. LLMs also face strict input context limits, preventing efficient consideration of large toolsets. To address these challenges, we propose ToolScope, which includes: (1) ToolScopeMerger with Auto‑Correction to automatically audit and fix tool merges, reducing redundancy, and (2) ToolScopeRetriever to rank and select only the most relevant tools for each query, compressing toolsets to fit within context limits without sacrificing accuracy. Evaluations on three state‑of‑the‑art LLMs and three open‑source tool‑use benchmarks show gains of 8.38% to 38.6% in tool selection accuracy, demonstrating ToolScope's effectiveness in enhancing LLM tool use.

PaperID: 7818, https://arxiv.org/pdf/2510.20009.pdf

Authors: Chongyang Shi, Wesley A. Suttle, Michael Dorothy, Jie Fu

Title: IMAS$^2$: Joint Agent Selection and Information-Theoretic Coordinated Perception In Dec-POMDPs

Abstract:
We study the problem of jointly selecting sensing agents and synthesizing decentralized active perception policies for the chosen subset of agents within a Decentralized Partially Observable Markov Decision Process (Dec‑POMDP) framework. Our approach employs a two‑layer optimization structure. In the inner layer, we introduce information‑theoretic metrics, defined by the mutual information between the unknown trajectories or some hidden property in the environment and the collective partial observations in the multi‑agent system, as a unified objective for active perception problems. We employ various optimization methods to obtain optimal sensor policies that maximize mutual information for distinct active perception tasks. In the outer layer, we prove that under certain conditions, the information‑theoretic objectives are monotone and submodular with respect to the subset of observations collected from multiple agents. We then exploit this property to design an IMAS^2 (Information‑theoretic Multi‑Agent Selection and Sensing) algorithm for joint sensing agent selection and sensing policy synthesis. However, since the policy search space is infinite, we adapt the classical Nemhauser‑Wolsey argument to prove that the proposed IMAS^2 algorithm can provide a tight (1 ‑ 1/e)‑guarantee on the performance. Finally, we demonstrate the effectiveness of our approach in a multi‑agent cooperative perception in a grid‑world environment.

PaperID: 7819, https://arxiv.org/pdf/2510.19995.pdf

Authors: Yiming Lu, Xun Wang, Simin Ma, Shujian Liu, Sathish Reddy Indurthi, Song Wang, Haoyun Deng, Fei Liu, Kaiqiang Song

Title: Communication to Completion: Modeling Collaborative Workflows with Intelligent Multi-Agent Communication

Abstract:
Multi‑agent LLM systems have demonstrated impressive capabilities in complex collaborative tasks, yet most frameworks treat communication as instantaneous and free, overlooking a fundamental constraint in real world teamwork, collaboration cost. We propose a scalable framework implemented via Communication to Completion (C2C), which explicitly models communication as a constrained resource with realistic temporal costs. We introduce the Alignment Factor (AF), a dynamic metric inspired by Shared Mental Models, to quantify the link between task understanding and work efficiency. Through experiments on 15 software engineering workflows spanning three complexity tiers and team sizes from 5 to 17 agents, we demonstrate that cost‑aware strategies achieve over 40% higher efficiency compared to unconstrained interaction. Our analysis reveals emergent coordination patterns: agents naturally adopt manager centric hub‑and‑spoke topologies, strategically escalate from asynchronous to synchronous channels based on complexity, and prioritize high value help requests. These patterns remain consistent across multiple frontier models (GPT‑5.2, Claude Sonnet 4.5, Gemini 2.5 Pro). This study moves beyond simple agent construction, offering a theoretical foundation for quantifying and optimizing the dynamics of collaboration in future digital workplaces.

PaperID: 7820, https://arxiv.org/pdf/2510.19895.pdf

Authors: Guoyun Zhang

Title: Large Language Model enabled Mathematical Modeling

Abstract:
The integration of Large Language Models (LLMs) with optimization modeling offers a promising avenue for advancing decision‑making in operations research (OR). Traditional optimization methods,such as linear programming, mixed integer programming, and simulation depend heavily on domain expertise to translate real‑world problems into solvable mathematical models. While solvers like Gurobi and COPT are powerful, expert input remains essential for defining objectives, constraints, and variables. This research investigates the potential of LLMs, specifically the DeepSeek‑R1 model, to bridge this formulation gap using natural language understanding and code generation. Although prior models like GPT‑4, Claude, and Bard have shown strong performance in NLP and reasoning tasks, their high token costs and tendency toward hallucinations limit real‑world applicability in supply chain contexts. In contrast, DeepSeek‑R1, a cost‑efficient and high‑performing model trained with reinforcement learning, presents a viable alternative. Despite its success in benchmarks such as LiveCodeBench and Math‑500, its effectiveness in applied OR scenarios remains under explored. This study systematically evaluates DeepSeek‑R1 across four key OR benchmarks: NL4OPT, IndustryOR, EasyLP, and ComplexOR. Our methodology includes baseline assessments, the development of a hallucination taxonomy, and the application of mitigation strategies like LLM‑as‑a‑Judge, Few‑shot Learning (FSL), Tool Calling, and a Multi‑agent Framework. These techniques aim to reduce hallucinations, enhance formulation accuracy, and better align model outputs with user intent.

PaperID: 7821, https://arxiv.org/pdf/2510.19868.pdf

Authors: Qian Xiong, Bo Yang, Weisong Sun, Yiran Zhang, Tianlin Li, Yang Liu, Zhi Jin

Title: Knowledge-Guided Multi-Agent Framework for Application-Level Software Code Generation

Abstract:
Automated code generation driven by Large Lan‑ guage Models (LLMs) has enhanced development efficiency, yet generating complex application‑level software code remains challenging. Multi‑agent frameworks show potential, but existing methods perform inadequately in large‑scale application‑level software code generation, failing to ensure reasonable orga‑ nizational structures of project code and making it difficult to maintain the code generation process. To address this, this paper envisions a Knowledge‑Guided Application‑Level Code Generation framework named KGACG, which aims to trans‑ form software requirements specification and architectural design document into executable code through a collaborative closed‑ loop of the Code Organization & Planning Agent (COPA), Coding Agent (CA), and Testing Agent (TA), combined with a feedback mechanism. We demonstrate the collaborative process of the agents in KGACG in a Java Tank Battle game case study while facing challenges. KGACG is dedicated to advancing the automation of application‑level software development.

PaperID: 7822, https://arxiv.org/pdf/2510.19838.pdf

Authors: Shiqi He, Yue Cui, Xinyu Ma, Yaliang Li, Bolin Ding, Mosharaf Chowdhury

Title: Branch-and-Browse: Efficient and Controllable Web Exploration with Tree-Structured Reasoning and Action Memory

Abstract:
Autonomous web agents powered by large language models (LLMs) show strong potential for performing goal‑oriented tasks such as information retrieval, report generation, and online transactions. These agents mark a key step toward practical embodied reasoning in open web environments. However, existing approaches remain limited in reasoning depth and efficiency: vanilla linear methods fail at multi‑step reasoning and lack effective backtracking, while other search strategies are coarse‑grained and computationally costly. We introduce Branch‑and‑Browse, a fine‑grained web agent framework that unifies structured reasoning‑acting, contextual memory, and efficient execution. It (i) employs explicit subtask management with tree‑structured exploration for controllable multi‑branch reasoning, (ii) bootstraps exploration through efficient web state replay with background reasoning, and (iii) leverages a page action memory to share explored actions within and across sessions. On the WebArena benchmark, Branch‑and‑Browse achieves a task success rate of 35.8% and reduces execution time by up to 40.4% relative to state‑of‑the‑art methods. These results demonstrate that Branch‑and‑Browse is a reliable and efficient framework for LLM‑based web agents.

PaperID: 7823, https://arxiv.org/pdf/2510.19771.pdf

Authors: Gil Pasternak, Dheeraj Rajagopal, Julia White, Dhruv Atreja, Matthew Thomas, George Hurn-Maloney, Ash Lewis

Title: Beyond Reactivity: Measuring Proactive Problem Solving in LLM Agents

Abstract:
LLM‑based agents are increasingly moving towards proactivity: rather than awaiting instruction, they exercise agency to anticipate user needs and solve them autonomously. However, evaluating proactivity is challenging; current benchmarks are constrained to localized context, limiting their ability to test reasoning across sources and longer time horizons. To address this gap, we present PROBE (Proactive Resolution Of BottlEnecks). PROBE decomposes proactivity as a pipeline of three core capabilities: (1) searching for unspecified issues, (2) identifying specific bottlenecks, and (3) executing appropriate resolutions. We apply PROBE to evaluate leading LLMs and popular agentic frameworks, showing that even state‑of‑the‑art models struggle to solve this benchmark. Computing our consistent measurements across frontier LLMs and agents, we find that the best end‑to‑end performance of 40% is achieved by both GPT‑5 and Claude Opus‑4.1. Additionally, we demonstrate the relative capabilities of each model and analyze mutual failure modes. Our results highlight the current limitations of autonomous action in agentic systems, and expose promising future research directions.

PaperID: 7824, https://arxiv.org/pdf/2510.19738.pdf

Authors: Rustem Turtayev, Natalia Fedorova, Oleg Serikov, Sergey Koldyba, Lev Avagyan, Dmitrii Volkov

Title: Misalignment Bounty: Crowdsourcing AI Agent Misbehavior

Abstract:
Advanced AI systems sometimes act in ways that differ from human intent. To gather clear, reproducible examples, we ran the Misalignment Bounty: a crowdsourced project that collected cases of agents pursuing unintended or unsafe goals. The bounty received 295 submissions, of which nine were awarded. This report explains the program's motivation and evaluation criteria, and walks through the nine winning submissions step by step.

PaperID: 7825, https://arxiv.org/pdf/2510.19687.pdf

Authors: Addison J. Wu, Ryan Liu, Kerem Oktar, Theodore R. Sumers, Thomas L. Griffiths

Title: Are Large Language Models Sensitive to the Motives Behind Communication?

Abstract:
Human communication is motivated: people speak, write, and create content with a particular communicative intent in mind. As a result, information that large language models (LLMs) and AI agents process is inherently framed by humans' intentions and incentives. People are adept at navigating such nuanced information: we routinely identify benevolent or self‑serving motives in order to decide what statements to trust. For LLMs to be effective in the real world, they too must critically evaluate content by factoring in the motivations of the source ‑‑ for instance, weighing the credibility of claims made in a sales pitch. In this paper, we undertake a comprehensive study of whether LLMs have this capacity for motivational vigilance. We first employ controlled experiments from cognitive science to verify that LLMs' behavior is consistent with rational models of learning from motivated testimony, and find they successfully discount information from biased sources in a human‑like manner. We then extend our evaluation to sponsored online adverts, a more naturalistic reflection of LLM agents' information ecosystems. In these settings, we find that LLMs' inferences do not track the rational models' predictions nearly as closely ‑‑ partly due to additional information that distracts them from vigilance‑relevant considerations. However, a simple steering intervention that boosts the salience of intentions and incentives substantially increases the correspondence between LLMs and the rational model. These results suggest that LLMs possess a basic sensitivity to the motivations of others, but generalizing to novel real‑world settings will require further improvements to these models.

PaperID: 7826, https://arxiv.org/pdf/2510.19661.pdf

Authors: Xusen Guo, Mingxing Peng, Xixuan Hao, Xingchen Zou, Qiongyan Wang, Sijie Ruan, Yuxuan Liang

Title: AgentSense: LLMs Empower Generalizable and Explainable Web-Based Participatory Urban Sensing

Abstract:
Web‑based participatory urban sensing has emerged as a vital approach for modern urban management by leveraging mobile individuals as distributed sensors. However, existing urban sensing systems struggle with limited generalization across diverse urban scenarios and poor interpretability in decision‑making. In this work, we introduce AgentSense, a hybrid, training‑free framework that integrates large language models (LLMs) into participatory urban sensing through a multi‑agent evolution system. AgentSense initially employs classical planner to generate baseline solutions and then iteratively refines them to adapt sensing task assignments to dynamic urban conditions and heterogeneous worker preferences, while producing natural language explanations that enhance transparency and trust. Extensive experiments across two large‑scale mobility datasets and seven types of dynamic disturbances demonstrate that AgentSense offers distinct advantages in adaptivity and explainability over traditional methods. Furthermore, compared to single‑agent LLM baselines, our approach outperforms in both performance and robustness, while delivering more reasonable and transparent explanations. These results position AgentSense as a significant advancement towards deploying adaptive and explainable urban sensing systems on the web.

PaperID: 7827, https://arxiv.org/pdf/2510.19577.pdf

Authors: Zuoming Fu, Alex Manley, Mohammad Alian

Title: gem5 Co-Pilot: AI Assistant Agent for Architectural Design Space Exploration

Abstract:
Generative AI is increasing the productivity of software and hardware development across many application domains. In this work, we utilize the power of Large Language Models (LLMs) to develop a co‑pilot agent for assisting gem5 users with automating design space exploration. Computer architecture design space exploration is complex and time‑consuming, given that numerous parameter settings and simulation statistics must be analyzed before improving the current design. The emergence of LLMs has significantly accelerated the analysis of long‑text data as well as smart decision making, two key functions in a successful design space exploration task. In this project, we first build gem5 Co‑Pilot, an AI agent assistant for gem5, which comes with a webpage‑GUI for smooth user interaction, agent automation, and result summarization. We also implemented a language for design space exploration, as well as a Design Space Database (DSDB). With DSDB, gem5 Co‑Pilot effectively implements a Retrieval Augmented Generation system for gem5 design space exploration. We experiment on cost‑constraint optimization with four cost ranges and compare our results with two baseline models. Results show that gem5 Co‑Pilot can quickly identify optimal parameters for specific design constraints based on performance and cost, with limited user interaction.

PaperID: 7828, https://arxiv.org/pdf/2510.19567.pdf

Authors: Takahiro Suzuki, Keisuke Okumura

Title: Polynomial-time Configuration Generator for Connected Unlabeled Multi-Agent Pathfinding

Abstract:
We consider Connected Unlabeled Multi‑Agent Pathfinding (CUMAPF), a variant of MAPF where interchangeable agents must be connected at all times. This problem is fundamental to swarm robotics applications such as self‑reconfiguration and marching, where standard MAPF is insufficient as it does not guarantee the connectivity constraint. Despite its simple structure, CUMAPF remains understudied and lacks practical algorithms. We first develop an Integer Linear Programming (ILP) reduction to solve CUMAPF. Although this formulation provides a makespan‑optimal plan, it is severely limited in terms of scalability and real‑time responsiveness due to the large number of variables. We therefore propose a suboptimal but complete algorithm named PULL. It is based on a rule‑based one‑step function that computes a subsequent configuration that preserves connectivity and advances towards the target configuration. PULL is lightweight, and runs in O(n^2) time per step in a 2D grid, where n is the number of agents. Empirically, PULL can quickly solve randomly generated instances containing hundreds of agents, which ILP cannot handle. Furthermore, PULL's solution substantially improves upon a naive approach to CUMAPF.

PaperID: 7829, https://arxiv.org/pdf/2510.19462.pdf

Authors: Zhonghao Zhan, Amir Al Sadi, Krinos Li, Hamed Haddadi

Title: AegisMCP: Online Graph Intrusion Detection for Tool-Augmented LLMs on Edge Devices

Abstract:
In this work, we study security of Model Context Protocol (MCP) agent toolchains and their applications in smart homes. We introduce AegisMCP, a protocol‑level intrusion detector. Our contributions are: (i) a minimal attack suite spanning instruction‑driven escalation, chain‑of‑tool exfiltration, malicious MCP server registration, and persistence; (ii) NEBULA‑Schema (Network‑Edge Behavioral Learning for Untrusted LLM Agents), a reusable protocol‑level instrumentation that represents MCP activity as a streaming heterogeneous temporal graph over agents, MCP servers, tools, devices, remotes, and sessions; and (iii) a CPU‑only streaming detector that fuses novelty, session‑DAG structure, and attribute cues for near‑real‑time edge inference, with optional fusion of local prompt‑guardrail signals. On an emulated smart‑home testbed spanning multiple MCP stacks and a physical bench, AegisMCP achieves sub‑second per‑window model inference and end‑to‑end alerting. The latency of AegisMCP is consistently sub‑second on Intel N150‑class edge hardware, while outperforming traffic‑only and sequence baselines; ablations confirm the importance of DAG and install/permission signals. We release code, schemas, and generators for reproducible evaluation.

PaperID: 7830, https://arxiv.org/pdf/2510.19438.pdf

Authors: Linfeng Liang, Chenkai Tan, Yao Deng, Yingfeng Cai, T. Y Chen, Xi Zheng

Title: AutoMT: A Multi-Agent LLM Framework for Automated Metamorphic Testing of Autonomous Driving Systems

Abstract:
Autonomous Driving Systems (ADS) are safety‑critical, where failures can be severe. While Metamorphic Testing (MT) is effective for fault detection in ADS, existing methods rely heavily on manual effort and lack automation. We present AutoMT, a multi‑agent MT framework powered by Large Language Models (LLMs) that automates the extraction of Metamorphic Relations (MRs) from local traffic rules and the generation of valid follow‑up test cases. AutoMT leverages LLMs to extract MRs from traffic rules in Gherkin syntax using a predefined ontology. A vision‑language agent analyzes scenarios, and a search agent retrieves suitable MRs from a RAG‑based database to generate follow‑up cases via computer vision. Experiments show that AutoMT achieves up to 5 x higher test diversity in follow‑up case generation compared to the best baseline (manual expert‑defined MRs) in terms of validation rate, and detects up to 20.55% more behavioral violations. While manual MT relies on a fixed set of predefined rules, AutoMT automatically extracts diverse metamorphic relations that augment real‑world datasets and help uncover corner cases often missed during in‑field testing and data collection. Its modular architecture separating MR extraction, filtering, and test generation supports integration into industrial pipelines and potentially enables simulation‑based testing to systematically cover underrepresented or safety‑critical scenarios.

PaperID: 7831, https://arxiv.org/pdf/2510.19430.pdf

Authors: GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, Peng Li, Qiuping Deng, Runqi Ouyang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yang Wang, Yifan Li, Yilong Li, Yiran Ding, Yuan Xu, Yun Ye, Yukun Zhou, Zhehao Dong, Zhenan Wang, Zhichao Liu, Zheng Zhu

Title: GigaBrain-0: A World Model-Powered Vision-Language-Action Model

Abstract:
Training Vision‑Language‑Action (VLA) models for generalist robots typically requires large‑scale real‑world robot data, which is expensive and time‑consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain‑0, a novel VLA foundation model empowered by world model‑generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain‑0 significantly reduces reliance on real robot data while improving cross‑task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain‑of‑Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long‑horizon dependencies during task execution. This leads to substantial gains in real‑world performance on dexterous, long‑horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain‑0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain‑0‑Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.

PaperID: 7832, https://arxiv.org/pdf/2510.19423.pdf

Authors: Jia-Kai Dong, I-Wei Huang, Chun-Tin Wu, Yi-Tien Tsai

Title: ETOM: A Five-Level Benchmark for Evaluating Tool Orchestration within the MCP Ecosystem

Abstract:
We introduce ETOM, a five‑level benchmark for evaluating multi‑hop, end‑to‑end tool orchestration by LLM agents within a hierarchical Model‑Context Protocol (MCP) ecosystem. Existing benchmarks often assess tools in isolation, overlooking challenges such as functional overlap and cross‑server orchestration, which can lead to overly optimistic evaluations. ETOM addresses these gaps by constructing ground truth through "equal function sets", enabling objective metrics such as F1 score and reducing reliance on LLM‑as‑a‑judge evaluation. Its five‑level curriculum systematically tests agent capabilities, from single‑tool orchestration to complex cross‑server planning, as well as robustness to out‑of‑scope requests. Experiments reveal that rigid hierarchies can hinder performance without co‑designed strategies, and even state‑of‑the‑art agents exhibit systemic weaknesses in robustness. ETOM provides a diagnostic framework to expose these limitations and guide the development of more capable and efficient tool‑using agents.

PaperID: 7833, https://arxiv.org/pdf/2510.19386.pdf

Authors: Ning Li, Qiqiang Lin, Zheng Wu, Xiaoyun Mo, Weiming Zhang, Yin Zhao, Xiangmou Qu, Jiamu Zhou, Jun Wang, Congmin Zheng, Yuanyi Song, Hongjiang Chen, Heyuan Huang, Jihong Wang, Jiaxin Yin, Jingwei Yu, Junwei Liao, Qiuying Peng, Xingyu Lou, Jun Wang, Weiwen Liu, Zhuosheng Zhang, Weinan Zhang

Title: ColorAgent: Building A Robust, Personalized, and Interactive OS Agent

Abstract:
With the advancements in hardware, software, and large language model technologies, the interaction between humans and operating systems has evolved from the command‑line interface to the rapidly emerging AI agent interactions. Building an operating system (OS) agent capable of executing user instructions and faithfully following user desires is becoming a reality. In this technical report, we present ColorAgent, an OS agent designed to engage in long‑horizon, robust interactions with the environment while also enabling personalized and proactive user interaction. To enable long‑horizon interactions with the environment, we enhance the model's capabilities through step‑wise reinforcement learning and self‑evolving training, while also developing a tailored multi‑agent framework that ensures generality, consistency, and robustness. In terms of user interaction, we explore personalized user intent recognition and proactive engagement, positioning the OS agent not merely as an automation tool but as a warm, collaborative partner. We evaluate ColorAgent on the AndroidWorld and AndroidLab benchmarks, achieving success rates of 77.2% and 50.7%, respectively, establishing a new state of the art. Nonetheless, we note that current benchmarks are insufficient for a comprehensive evaluation of OS agents and propose further exploring directions in future work, particularly in the areas of evaluation paradigms, agent collaboration, and security.

PaperID: 7834, https://arxiv.org/pdf/2510.19361.pdf

Authors: Xianyang Liu, Yilin Liu, Shuai Wang, Hao Cheng, Andrew Estornell, Yuzhi Zhao, Jun Shu, Jiaheng Wei

Title: AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Generation

Abstract:
The creation of high‑quality datasets to improve Large Language Model (LLM) reasoning remains a significant challenge, as current methods often suffer from generating low‑quality/incorrect answers and limited information richness from available data sources. To address this, we propose AgenticMath, a novel agentic method for generating high‑quality mathematical question‑answer pairs to enhance the supervised fine‑tuning of LLMs. Our method operates through four stages: (1) Seed Question Filter that selects questions with high information richness, complexity, and clarity; (2) an Agentic Question Rephrase step that employs a multi‑agent system to generate diverse, logically consistent paraphrases; (3) an Answer Augment step where rewrite answers using chain‑of‑thought reasoning to enhance numerical and logical correctness, without reliance on human‑provided labels; and (4) a final Question and Answer Evaluation that retains only the most superior pairs. Extensive experiments demonstrate that, fine‑tuning 3B‑8B parameter LLMs on AgenticMath generated datasets (comprising only 30‑60K math samples) achieves competitive or superior performance on diverse in domain and out‑of‑domain mathematical reasoning benchmarks compared to baselines trained on much more data (e.g., 400K or 2.3M samples). Our work demonstrates that targeted, high‑quality data generation is a more efficient path to improving mathematical reasoning in LLMs than large‑scale, low‑quality alternatives.

PaperID: 7835, https://arxiv.org/pdf/2510.19299.pdf

Authors: Philipp J. Schneider, Lin Tian, Marian-Andrei Rizoiu

Title: Learning to Make Friends: Coaching LLM Agents toward Emergent Social Ties

Abstract:
Can large language model (LLM) agents reproduce the complex social dynamics that characterize human online behavior ‑‑ shaped by homophily, reciprocity, and social validation ‑‑ and what memory and learning mechanisms enable such dynamics to emerge? We present a multi‑agent LLM simulation framework in which agents repeatedly interact, evaluate one another, and adapt their behavior through in‑context learning accelerated by a coaching signal. To model human social behavior, we design behavioral reward functions that capture core drivers of online engagement, including social interaction, information seeking, self‑presentation, coordination, and emotional support. These rewards align agent objectives with empirically observed user motivations, enabling the study of how network structures and group formations emerge from individual decision‑making. Our experiments show that coached LLM agents develop stable interaction patterns and form emergent social ties, yielding network structures that mirror properties of real online communities. By combining behavioral rewards with in‑context adaptation, our framework establishes a principled testbed for investigating collective dynamics in LLM populations and reveals how artificial agents may approximate or diverge from human‑like social behavior.

PaperID: 7836, https://arxiv.org/pdf/2510.19292.pdf

Authors: Konstantinos Bacharidis, Antonis A. Argyros

Title: Vision-Based Mistake Analysis in Procedural Activities: A Review of Advances and Challenges

Abstract:
Mistake analysis in procedural activities is a critical area of research with applications spanning industrial automation, physical rehabilitation, education and human‑robot collaboration. This paper reviews vision‑based methods for detecting and predicting mistakes in structured tasks, focusing on procedural and executional errors. By leveraging advancements in computer vision, including action recognition, anticipation and activity understanding, vision‑based systems can identify deviations in task execution, such as incorrect sequencing, use of improper techniques, or timing errors. We explore the challenges posed by intra‑class variability, viewpoint differences and compositional activity structures, which complicate mistake detection. Additionally, we provide a comprehensive overview of existing datasets, evaluation metrics and state‑of‑the‑art methods, categorizing approaches based on their use of procedural structure, supervision levels and learning strategies. Open challenges, such as distinguishing permissible variations from true mistakes and modeling error propagation are discussed alongside future directions, including neuro‑symbolic reasoning and counterfactual state modeling. This work aims to establish a unified perspective on vision‑based mistake analysis in procedural activities, highlighting its potential to enhance safety, efficiency and task performance across diverse domains.

PaperID: 7837, https://arxiv.org/pdf/2510.19270.pdf

Authors: Xiaoyuan Zhang, Yizhe Huang, Chengdong Ma, Zhixun Chen, Long Ma, Yali Du, Song-Chun Zhu, Yaodong Yang, Xue Feng

Title: Social World Model-Augmented Mechanism Design Policy Learning

Abstract:
Designing adaptive mechanisms to align individual and collective interests remains a central challenge in artificial social intelligence. Existing methods often struggle with modeling heterogeneous agents possessing persistent latent traits (e.g., skills, preferences) and dealing with complex multi‑agent system dynamics. These challenges are compounded by the critical need for high sample efficiency due to costly real‑world interactions. World Models, by learning to predict environmental dynamics, offer a promising pathway to enhance mechanism design in heterogeneous and complex systems. In this paper, we introduce a novel method named SWM‑AP (Social World Model‑Augmented Mechanism Design Policy Learning), which learns a social world model hierarchically modeling agents' behavior to enhance mechanism design. Specifically, the social world model infers agents' traits from their interaction trajectories and learns a trait‑based model to predict agents' responses to the deployed mechanisms. The mechanism design policy collects extensive training trajectories by interacting with the social world model, while concurrently inferring agents' traits online during real‑world interactions to further boost policy learning efficiency. Experiments in diverse settings (tax policy design, team coordination, and facility location) demonstrate that SWM‑AP outperforms established model‑based and model‑free RL baselines in cumulative rewards and sample efficiency.

PaperID: 7838, https://arxiv.org/pdf/2510.19246.pdf

Authors: Mingfei Lu, Mengjia Wu, Jiawei Xu, Weikai Li, Feng Liu, Ying Ding, Yizhou Sun, Jie Lu, Yi Zhang

Title: From Newborn to Impact: Bias-Aware Citation Prediction

Abstract:
As a key to accessing research impact, citation dynamics underpins research evaluation, scholarly recommendation, and the study of knowledge diffusion. Citation prediction is particularly critical for newborn papers, where early assessment must be performed without citation signals and under highly long‑tailed distributions. We identify two key research gaps: (i) insufficient modeling of implicit factors of scientific impact, leading to reliance on coarse proxies; and (ii) a lack of bias‑aware learning that can deliver stable predictions on lowly cited papers. We address these gaps by proposing a Bias‑Aware Citation Prediction Framework, which combines multi‑agent feature extraction with robust graph representation learning. First, a multi‑agent x graph co‑learning module derives fine‑grained, interpretable signals, such as reproducibility, collaboration network, and text quality, from metadata and external resources, and fuses them with heterogeneous‑network embeddings to provide rich supervision even in the absence of early citation signals. Second, we incorporate a set of robust mechanisms: a two‑stage forward process that routes explicit factors through an intermediate exposure estimate, GroupDRO to optimize worst‑case group risk across environments, and a regularization head that performs what‑if analyses on controllable factors under monotonicity and smoothness constraints. Comprehensive experiments on two real‑world datasets demonstrate the effectiveness of our proposed model. Specifically, our model achieves around a 13% reduction in error metrics (MALE and RMSLE) and a notable 5.5% improvement in the ranking metric (NDCG) over the baseline methods.

PaperID: 7839, https://arxiv.org/pdf/2510.19232.pdf

Authors: Ahan Basu, Ratnangshu Das, Pushpak Jagtap

Title: Spatiotemporal Tubes based Control of Unknown Multi-Agent Systems for Temporal Reach-Avoid-Stay Tasks

Abstract:
The paper focuses on designing a controller for unknown dynamical multi‑agent systems to achieve temporal reach‑avoid‑stay tasks for each agent while preventing inter‑agent collisions. The main objective is to generate a spatiotemporal tube (STT) for each agent and thereby devise a closed‑form, approximation‑free, and decentralized control strategy that ensures the system trajectory reaches the target within a specific time while avoiding time‑varying unsafe sets and collisions with other agents. In order to achieve this, the requirements of STTs are formulated as a robust optimization problem (ROP) and solved using a sampling‑based scenario optimization problem (SOP) to address the issue of infeasibility caused by the infinite number of constraints in ROP. The STTs are generated by solving the SOP, and the corresponding closed‑form control is designed to fulfill the specified task. Finally, the effectiveness of our approach is demonstrated through two case studies, one involving omnidirectional robots and the other involving multiple drones modelled as Euler‑Lagrange systems.

PaperID: 7840, https://arxiv.org/pdf/2510.19208.pdf

Authors: Hang Zheng, Hongshen Xu, Yongkai Lin, Shuai Fan, Lu Chen, Kai Yu

Title: DiSRouter: Distributed Self-Routing for LLM Selections

Abstract:
The proliferation of Large Language Models (LLMs) has created a diverse ecosystem of models with highly varying performance and costs, necessitating effective query routing to balance performance and expense. Current routing systems often rely on a centralized external router trained on a fixed set of LLMs, making them inflexible and prone to poor performance since the small router can not fully understand the knowledge boundaries of different LLMs. We introduce DiSRouter (Distributed Self‑Router), a novel paradigm that shifts from centralized control to distributed routing. In DiSRouter, a query traverses a network of LLM agents, each independently deciding whether to answer or route to other agents based on its own self‑awareness, its ability to judge its competence. This distributed design offers superior flexibility, scalability, and generalizability. To enable this, we propose a two‑stage Self‑Awareness Training pipeline that enhances each LLM's self‑awareness. Extensive experiments demonstrate that DiSRouter significantly outperforms existing routing methods in utility across various scenarios, effectively distinguishes between easy and hard queries, and shows strong generalization to out‑of‑domain tasks. Our work validates that leveraging an LLM's intrinsic self‑awareness is more effective than external assessment, paving the way for more modular and efficient multi‑agent systems.

PaperID: 7841, https://arxiv.org/pdf/2510.19199.pdf

Authors: Xiaoxing Ren, Nicola Bastianello, Thomas Parisini, Andreas A. Malikopoulos

Title: A Communication-Efficient Decentralized Actor-Critic Algorithm

Abstract:
In this paper, we study the problem of reinforcement learning in multi‑agent systems where communication among agents is limited. We develop a decentralized actor‑critic learning framework in which each agent performs several local updates of its policy and value function, where the latter is approximated by a multi‑layer neural network, before exchanging information with its neighbors. This local training strategy substantially reduces the communication burden while maintaining coordination across the network. We establish finite‑time convergence analysis for the algorithm under Markov‑sampling. Specifically, to attain the \varepsilon‑accurate stationary point, the sample complexity is of order \mathcalO(\varepsilon^‑3) and the communication complexity is of order \mathcalO(\varepsilon^‑1τ^‑1), where tau denotes the number of local training steps. We also show how the final error bound depends on the neural network's approximation quality. Numerical experiments in a cooperative control setting illustrate and validate the theoretical findings.

PaperID: 7842, https://arxiv.org/pdf/2510.19107.pdf

Authors: Aliakbar Mehdizadeh, Martin Hilbert

Title: When Your AI Agent Succumbs to Peer-Pressure: Studying Opinion-Change Dynamics of LLMs

Abstract:
We investigate how peer pressure influences the opinions of Large Language Model (LLM) agents across a spectrum of cognitive commitments by embedding them in social networks where they update opinions based on peer perspectives. Our findings reveal key departures from traditional conformity assumptions. First, agents follow a sigmoid curve: stable at low pressure, shifting sharply at threshold, and saturating at high. Second, conformity thresholds vary by model: Gemini 1.5 Flash requires over 70% peer disagreement to flip, whereas ChatGPT‑4o‑mini shifts with a dissenting minority. Third, we uncover a fundamental "persuasion asymmetry," where shifting an opinion from affirmative‑to‑negative requires a different cognitive effort than the reverse. This asymmetry results in a "dual cognitive hierarchy": the stability of cognitive constructs inverts based on the direction of persuasion. For instance, affirmatively‑held core values are robust against opposition but easily adopted from a negative stance, a pattern that inverts for other constructs like attitudes. These dynamics echoing complex human biases like negativity bias, prove robust across different topics and discursive frames (moral, economic, sociotropic). This research introduces a novel framework for auditing the emergent socio‑cognitive behaviors of multi‑agent AI systems, demonstrating their decision‑making is governed by a fluid, context‑dependent architecture, not a static logic.

PaperID: 7843, https://arxiv.org/pdf/2510.19072.pdf

Authors: Tomoki Arita, Keisuke Okumura

Title: Local Guidance for Configuration-Based Multi-Agent Pathfinding

Abstract:
Guidance is an emerging concept that improves the empirical performance of real‑time, sub‑optimal multi‑agent pathfinding (MAPF) methods. It offers additional information to MAPF algorithms to mitigate congestion on a global scale by considering the collective behavior of all agents across the entire workspace. This global perspective helps reduce agents' waiting times, thereby improving overall coordination efficiency. In contrast, this study explores an alternative approach: providing local guidance in the vicinity of each agent. While such localized methods involve recomputation as agents move and may appear computationally demanding, we empirically demonstrate that supplying informative spatiotemporal cues to the planner can significantly improve solution quality without exceeding a moderate time budget. When applied to LaCAM, a leading configuration‑based solver, this form of guidance establishes a new performance frontier for MAPF.

PaperID: 7844, https://arxiv.org/pdf/2510.18893.pdf

Authors: Sergey Pugachev

Title: CodeCRDT: Observation-Driven Coordination for Multi-Agent LLM Code Generation

Abstract:
Multi‑agent LLM systems fail to realize parallel speedups due to costly coordination. We present CodeCRDT, an observation‑driven coordination pattern where agents coordinate by monitoring a shared state with observable updates and deterministic convergence, rather than explicit message passing. Using Conflict‑Free Replicated Data Types (CRDTs), CodeCRDT enables lock‑free, conflict‑free concurrent code generation with strong eventual consistency. Evaluation across 600 trials (6 tasks, 50 runs per mode) shows both benefits and trade‑offs: up to 21.1% speedup on some tasks, up to 39.4% slowdown on others, and 100% convergence with zero merge failures. The study formalizes observation‑driven coordination for stochastic LLM agents, revealing semantic conflict rates (5‑10%) and quality‑performance tradeoffs, and provides empirical characterization of when parallel coordination succeeds versus fails based on task structure.

PaperID: 7845, https://arxiv.org/pdf/2510.18892.pdf

Authors: Richard J. Young, Brandon Gillins, Alice M. Matthews

Title: When Models Can't Follow: Testing Instruction Adherence Across 256 LLMs

Abstract:
Despite widespread deployment of Large Language Models, systematic evaluation of instruction‑following capabilities remains challenging. While comprehensive benchmarks exist, focused assessments that quickly diagnose specific instruction adherence patterns are valuable. As newer models may be trained on existing benchmarks, novel evaluation approaches are needed to assess genuine capabilities rather than memorized performance. This paper presents a streamlined evaluation framework using twenty carefully designed prompts to assess LLM instruction‑following across diverse task categories. We demonstrate this framework through a large‑scale empirical study conducted on October 14, 2025, testing 256 verified working models from 331 available via OpenRouter. To ensure methodological rigor and prevent selection bias, we first verified each model's basic functionality before inclusion. Unlike large‑scale benchmarks requiring extensive computational resources, our approach offers a practical diagnostic tool researchers and practitioners can readily apply. Our methodology builds upon verifiable instructions while introducing a compact test suite balancing comprehensiveness with efficiency. Each prompt targets distinct aspects of instruction following, including format compliance, content constraints, logical sequencing, and multi‑step task execution. We evaluate models from major providers (OpenAI, Anthropic, Google, Meta, Mistral) and emerging implementations (Qwen, DeepSeek, community models), providing comparative performance analysis. Our findings reveal consistent failure modes and identify specific instruction types posing particular challenges. This work contributes both a practical evaluation tool and one of the most comprehensive empirical analyses of instruction‑following capabilities across the contemporary LLM landscape.

PaperID: 7846, https://arxiv.org/pdf/2510.18886.pdf

Authors: Siddharth Chaturvedi, Ahmed EL-Gazzar, Marcel van Gerven

Title: Emergence of Internal State-Modulated Swarming in Multi-Agent Patch Foraging System

Abstract:
Active particles are entities that sustain persistent out‑of‑equilibrium motion by consuming energy. Under certain conditions, they exhibit the tendency to self‑organize through coordinated movements, such as swarming via aggregation. While performing non‑cooperative foraging tasks, the emergence of such swarming behavior in foragers, exemplifying active particles, has been attributed to the partial observability of the environment, in which the presence of another forager can serve as a proxy signal to indicate the potential presence of a food source or a resource patch. In this paper, we validate this phenomenon by simulating multiple self‑propelled foragers as they forage from multiple resource patches in a non‑cooperative manner. These foragers operate in a continuous two‑dimensional space with stochastic position updates and partial observability. We evolve a shared policy in the form of a continuous‑time recurrent neural network that serves as a velocity controller for the foragers. To this end, we use an evolutionary strategy algorithm wherein the different samples of the policy‑distribution are evaluated in the same rollout. Then we show that agents are able to learn to adaptively forage in the environment. Next, we show the emergence of swarming in the form of aggregation among the foragers when resource patches are absent. We observe that the strength of this swarming behavior appears to be inversely proportional to the amount of resource stored in the foragers, which supports the risk‑sensitive foraging claims. Empirical analysis of the learned controller's hidden states in minimal test runs uncovers their sensitivity to the amount of resource stored in a forager. Clamping these hidden states to represent a lesser amount of resource hastens its learned aggregation behavior.

PaperID: 7847, https://arxiv.org/pdf/2510.18877.pdf

Authors: Zhen Wu, Jiaxin Shi, R. Charles Murray, Carolyn Rosé, Micah San Andres

Title: LLM Bazaar: A Service Design for Supporting Collaborative Learning with an LLM-Powered Multi-Party Collaboration Infrastructure

Abstract:
For nearly two decades, conversational agents have played a critical role in structuring interactions in collaborative learning, shaping group dynamics, and supporting student engagement. The recent integration of large language models (LLMs) into these agents offers new possibilities for fostering critical thinking and collaborative problem solving. In this work, we begin with an open source collaboration support architecture called Bazaar and integrate an LLM‑agent shell that enables introduction of LLM‑empowered, real time, context sensitive collaborative support for group learning. This design and infrastructure paves the way for exploring how tailored LLM‑empowered environments can reshape collaborative learning outcomes and interaction patterns.

PaperID: 7848, https://arxiv.org/pdf/2510.18699.pdf

Authors: Michael J. Wooldridge, Attila Bagoly, Jonathan J. Ward, Emanuele La Malfa, Gabriel Paludo Licks

Title: Fetch.ai: An Architecture for Modern Multi-Agent Systems

Abstract:
Recent surges in LLM‑driven intelligent systems largely overlook decades of foundational multi‑agent systems (MAS) research, resulting in frameworks with critical limitations such as centralization and inadequate trust and communication protocols. This paper introduces the Fetch.ai architecture, an industrial‑strength platform designed to bridge this gap by facilitating the integration of classical MAS principles with modern AI capabilities. We present a novel, multi‑layered solution built on a decentralized foundation of on‑chain blockchain services for verifiable identity, discovery, and transactions. This is complemented by a comprehensive development framework for creating secure, interoperable agents, a cloud‑based platform for deployment, and an intelligent orchestration layer where an agent‑native LLM translates high‑level human goals into complex, multi‑agent workflows. We demonstrate the deployed nature of this system through a decentralized logistics use case where autonomous agents dynamically discover, negotiate, and transact with one another securely. Ultimately, the Fetch.ai stack provides a principled architecture for moving beyond current agent implementations towards open, collaborative, and economically sustainable multi‑agent ecosystems.

PaperID: 7849, https://arxiv.org/pdf/2510.18586.pdf

Authors: Zhuohang Bian, Feiyang Wu, Zhuoran Li, Teng Ma, Youwei Zhuo

Title: TokenCake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications

Abstract:
Large Language Models (LLMs) are increasingly deployed in complex multi‑agent applications that rely on external function calls. This workload creates severe performance challenges for the KV Cache: spatial contention leads to the eviction of critical agents' caches and temporal underutilization leaves the cache of agents stalled on long‑running function calls idling in GPU memory. We present TokenCake, a KV‑Cache‑centric serving framework that bridges this gap by co‑optimizing scheduling and memory management through an agent‑aware design. TokenCake's Temporal Scheduler employs an event‑driven, opportunistic policy to proactively offload idle KV Caches during function calls and uses predictive uploading to hide data transfer latency. TokenCake's Spatial Scheduler uses dynamic memory partitioning, guided by a hybrid priority metric combining graph structure and runtime state, to reserve GPU memory for critical‑path agents. Our evaluation on representative multi‑agent benchmarks shows that TokenCake reduces end‑to‑end latency by over 47.06% and improves effective GPU memory utilization by up to 16.9% compared to vLLM.

PaperID: 7850, https://arxiv.org/pdf/2510.18569.pdf

Authors: Junhyeog Yun, Hyoun Jun Lee, Insu Jeon

Title: QuantEvolve: Automating Quantitative Strategy Discovery through Multi-Agent Evolutionary Framework

Abstract:
Automating quantitative trading strategy development in dynamic markets is challenging, especially with increasing demand for personalized investment solutions. Existing methods often fail to explore the vast strategy space while preserving the diversity essential for robust performance across changing market conditions. We present QuantEvolve, an evolutionary framework that combines quality‑diversity optimization with hypothesis‑driven strategy generation. QuantEvolve employs a feature map aligned with investor preferences, such as strategy type, risk profile, turnover, and return characteristics, to maintain a diverse set of effective strategies. It also integrates a hypothesis‑driven multi‑agent system to systematically explore the strategy space through iterative generation and evaluation. This approach produces diverse, sophisticated strategies that adapt to both market regime shifts and individual investment needs. Empirical results show that QuantEvolve outperforms conventional baselines, validating its effectiveness. We release a dataset of evolved strategies to support future research.

PaperID: 7851, https://arxiv.org/pdf/2510.18563.pdf

Authors: Zijie Xu, Minfeng Qi, Shiqing Wu, Lefeng Zhang, Qiwen Wei, Han He, Ningran Li

Title: The Trust Paradox in LLM-Based Multi-Agent Systems: When Collaboration Becomes a Security Vulnerability

Abstract:
Multi‑agent systems powered by large language models are advancing rapidly, yet the tension between mutual trust and security remains underexplored. We introduce and empirically validate the Trust‑Vulnerability Paradox (TVP): increasing inter‑agent trust to enhance coordination simultaneously expands risks of over‑exposure and over‑authorization. To investigate this paradox, we construct a scenario‑game dataset spanning 3 macro scenes and 19 sub‑scenes, and run extensive closed‑loop interactions with trust explicitly parameterized. Using Minimum Necessary Information (MNI) as the safety baseline, we propose two unified metrics: Over‑Exposure Rate (OER) to detect boundary violations, and Authorization Drift (AD) to capture sensitivity to trust levels. Results across multiple model backends and orchestration frameworks reveal consistent trends: higher trust improves task success but also heightens exposure risks, with heterogeneous trust‑to‑risk mappings across systems. We further examine defenses such as Sensitive Information Repartitioning and Guardian‑Agent enablement, both of which reduce OER and attenuate AD. Overall, this study formalizes TVP, establishes reproducible baselines with unified metrics, and demonstrates that trust must be modeled and scheduled as a first‑class security variable in multi‑agent system design.

PaperID: 7852, https://arxiv.org/pdf/2510.18551.pdf

Authors: Yuncheng Hua, Sion Weatherhead, Mehdi Jafari, Hao Xue, Flora D. Salim

Title: SOCIA-Nabla: Textual Gradient Meets Multi-Agent Orchestration for Automated Simulator Generation

Abstract:
In this paper, we present SOCIA‑Nabla, an end‑to‑end, agentic framework that treats simulator construction asinstance optimization over code within a textual computation graph. Specialized LLM‑driven agents are embedded as graph nodes, and a workflow manager executes a loss‑driven loop: code synthesis ‑> execution ‑> evaluation ‑> code repair. The optimizer performs Textual‑Gradient Descent (TGD), while human‑in‑the‑loop interaction is reserved for task‑spec confirmation, minimizing expert effort and keeping the code itself as the trainable object. Across three CPS tasks, i.e., User Modeling, Mask Adoption, and Personal Mobility, SOCIA‑Nabla attains state‑of‑the‑art overall accuracy. By unifying multi‑agent orchestration with a loss‑aligned optimization view, SOCIA‑Nabla converts brittle prompt pipelines into reproducible, constraint‑aware simulator code generation that scales across domains and simulation granularities. This work is under review, and we will release the code soon.

PaperID: 7853, https://arxiv.org/pdf/2510.18550.pdf

Authors: Enhan Li, Hongyang Du

Title: JAUNT: Joint Alignment of User Intent and Network State for QoE-centric LLM Tool Routing

Abstract:
Large Language Models (LLMs) increasingly rely on emerging protocols such as the Model Context Protocol (MCP) to invoke external tools and services. However, current tool routing mechanisms remain fragile because they only consider functional matching between users' queries and tools. In practice, user intent expressed through queries can be vague or underspecified, and the actual Quality of Experience (QoE) also depends on external factors such as link latency and server availability that are not captured by semantics alone. To address this challenge, we propose JAUNT, a framework for Joint Alignment of User intent and Network state in QoE‑centric Tool routing. JAUNT introduces a dual‑view alignment strategy that interprets user intent while employing LLM agents to construct network profiles, mapping numerical performance indicators into the semantic space to guide routing. We further design a benchmark that integrates diverse user request patterns with heterogeneous network states, enabling systematic evaluation of QoE outcomes. Experimental results show that JAUNT significantly improves QoE compared with several baselines, demonstrating the importance of aligning both intent and network state for scalable LLM service orchestration.

PaperID: 7854, https://arxiv.org/pdf/2510.18515.pdf

Authors: Sureyya Akin, Shruti T. Tiwari, Ram Bhattacharya, Sagar A. Raman, Kiran Mohanty, Sita Krishnan

Title: Socialized Learning and Emergent Behaviors in Multi-Agent Systems based on Multimodal Large Language Models

Abstract:
This search introduces the Multimodal Socialized Learning Framework (M‑S2L), designed to foster emergent social intelligence in AI agents by integrating Multimodal Large Language Models (M‑LLMs) with social learning mechanisms. The framework equips agents with multimodal perception (vision and text) and structured action capabilities, enabling physical manipulation and grounded multimodal communication (e.g., text with visual pointers). M‑S2L combines direct reinforcement learning with two novel social learning pathways: multimodal observational learning and communication‑driven learning from feedback, augmented by an episodic memory system for long‑term social context. We evaluate M‑S2L in a Collaborative Assembly Environment (CAE), where agent teams must construct complex devices from ambiguous blueprints under informational asymmetry. Across tasks of increasing complexity, M‑S2L agents consistently outperform Text‑Only and No‑Social‑Learning baselines in Task Completion Rate and Time to Completion, particularly in dynamic problem‑solving scenarios. Ablation studies confirm the necessity of both multimodality and socialized learning. Our analysis reveals the emergence of efficient communication protocols integrating visual pointers with concise text, alongside rapid role specialization leading to stable labor division. Qualitative case studies demonstrate agents' abilities for shared awareness, dynamic re‑planning, and adaptive problem‑solving, suggesting a nascent form of machine social cognition. These findings indicate that integrating multimodal perception with explicit social learning is critical for developing human‑like collaborative intelligence in multi‑agent systems.

PaperID: 7855, https://arxiv.org/pdf/2510.18509.pdf

Authors: Valtteri Ala-Salmi, Zeeshan Rasheed, Abdul Malik Sami, Muhammad Waseem, Kai-Kristian Kemell, Jussi Rasku, Mika Saari, Pekka Abrahamsson

Title: VAPU: System for Autonomous Legacy Code Modernization

Abstract:
In this study, we present a solution for the modernization of legacy applications, an area of code generation where LLM‑based multi‑agent systems are proving essential for complex multi‑phased tasks. Legacy applications often contain deprecated components that create compatibility, security, and reliability risks, but high resource costs make companies hesitate to update. We take a step forward to integrate an LLM‑based multi‑agent system as part of a legacy web application update to provide a cost‑effective solution to update legacy applications autonomously. We propose a multi‑agent system named a Verifying Agent Pipeline Updater (VAPU), which is designed to update code files in phases while simulating different roles in a software development team. In our previous study, we evaluated the system for legacy version updates by using six legacy web application view files by resulting errors and accomplished requirements. This study extends the previous evaluation of a multi‑agent pipeline system by extending the evaluation of VAPU from a single LLM to five LLMs and using the temperature parameter in both 0 to 1 settings. Additionally, we tested the system with 20 open‑source Python GitHub projects. The results of the evaluation were compared to Zero‑Shot Learning (ZSL) and One‑Shot Learning (OSL) prompts. The extended evaluation of VAPU showed that particularly in a low‑temperature VAPU can get similar level of error count compared to the ZSL/OSL prompts but with a higher level of fulfilled requirements, depending on the LLM. VAPU showed up to 22.5% increase in the succeeding Python file update requirements compared to ZSL/OSL prompts. The study indicates that an LLM‑based multi‑agent system is a capable solution to update components of a legacy application autonomously.

PaperID: 7856, https://arxiv.org/pdf/2510.18477.pdf

Authors: Haichao Ji, Zibo Wang, Cheng Pan, Meng Han, Yifei Zhu, Dan Wang, Zhu Han

Title: LAFA: Agentic LLM-Driven Federated Analytics over Decentralized Data Sources

Abstract:
Large Language Models (LLMs) have shown great promise in automating data analytics tasks by interpreting natural language queries and generating multi‑operation execution plans. However, existing LLM‑agent‑based analytics frameworks operate under the assumption of centralized data access, offering little to no privacy protection. In contrast, federated analytics (FA) enables privacy‑preserving computation across distributed data sources, but lacks support for natural language input and requires structured, machine‑readable queries. In this work, we present LAFA, the first system that integrates LLM‑agent‑based data analytics with FA. LAFA introduces a hierarchical multi‑agent architecture that accepts natural language queries and transforms them into optimized, executable FA workflows. A coarse‑grained planner first decomposes complex queries into sub‑queries, while a fine‑grained planner maps each subquery into a Directed Acyclic Graph of FA operations using prior structural knowledge. To improve execution efficiency, an optimizer agent rewrites and merges multiple DAGs, eliminating redundant operations and minimizing computational and communicational overhead. Our experiments demonstrate that LAFA consistently outperforms baseline prompting strategies by achieving higher execution plan success rates and reducing resource‑intensive FA operations by a substantial margin. This work establishes a practical foundation for privacy‑preserving, LLM‑driven analytics that supports natural language input in the FA setting.

PaperID: 7857, https://arxiv.org/pdf/2510.18476.pdf

Authors: Feifan Xia, Yuyang Fang, Defang Li, Yantong Xie, Weikang Li, Yang Li, Deguo Xia, Jizhou Huang

Title: Probabilistic Modeling of Intentions in Socially Intelligent LLM Agents

Abstract:
We present a probabilistic intent modeling framework for large language model (LLM) agents in multi‑turn social dialogue. The framework maintains a belief distribution over a partner's latent intentions, initialized from contextual priors and dynamically updated through likelihood estimation after each utterance. The evolving distribution provides additional contextual grounding for the policy, enabling adaptive dialogue strategies under uncertainty. Preliminary experiments in the SOTOPIA environment show consistent improvements: the proposed framework increases the Overall score by 9.0% on SOTOPIA‑All and 4.1% on SOTOPIA‑Hard compared with the Qwen2.5‑7B baseline, and slightly surpasses an oracle agent that directly observes partner intentions. These early results suggest that probabilistic intent modeling can contribute to the development of socially intelligent LLM agents.

PaperID: 7858, https://arxiv.org/pdf/2510.18442.pdf

Authors: Ziwei Deng, Mian Deng, Chenjing Liang, Zeming Gao, Chennan Ma, Chenxing Lin, Haipeng Zhang, Songzhu Mei, Siqi Shen, Cheng Wang

Title: PlanU: Large Language Model Reasoning through Planning under Uncertainty

Abstract:
Large Language Models (LLMs) are increasingly being explored across a range of reasoning tasks. However, LLMs sometimes struggle with reasoning tasks under uncertainty that are relatively easy for humans, such as planning actions in stochastic environments. The adoption of LLMs for reasoning is impeded by uncertainty challenges, such as LLM uncertainty and environmental uncertainty. LLM uncertainty arises from the stochastic sampling process inherent to LLMs. Most LLM‑based Decision‑Making (LDM) approaches address LLM uncertainty through multiple reasoning chains or search trees. However, these approaches overlook environmental uncertainty, which leads to poor performance in environments with stochastic state transitions. Some recent LDM approaches deal with uncertainty by forecasting the probability of unknown variables. However, they are not designed for multi‑step reasoning tasks that require interaction with the environment. To address uncertainty in LLM decision‑making, we introduce PlanU, an LLM‑based planning method that captures uncertainty within Monte Carlo Tree Search (MCTS). PlanU models the return of each node in the MCTS as a quantile distribution, which uses a set of quantiles to represent the return distribution. To balance exploration and exploitation during tree search, PlanU introduces an Upper Confidence Bounds with Curiosity (UCC) score which estimates the uncertainty of MCTS nodes. Through extensive experiments, we demonstrate the effectiveness of PlanU in LLM‑based reasoning tasks under uncertainty.

PaperID: 7859, https://arxiv.org/pdf/2510.18395.pdf

Authors: Runnan Qi, Yanan Ni, Lumin Jiang, Zongyuan Li, Kuihua Huang, Xian Guo

Title: Memory-Augmented State Machine Prompting: A Novel LLM Agent Framework for Real-Time Strategy Games

Abstract:
This paper proposes Memory‑Augmented State Machine Prompting (MASMP), a novel framework for LLM agents in real‑time strategy games. Addressing key challenges like hallucinations and fragmented decision‑making in existing approaches, MASMP integrates state machine prompting with memory mechanisms to unify structured actions with long‑term tactical coherence. The framework features: (1) a natural language‑driven state machine architecture that guides LLMs to emulate finite state machines and behavior trees through prompts, and (2) a lightweight memory module preserving strategic variables (e.g., tactics, priority units) across decision cycles. Experiments in StarCraft II demonstrate MASMP's 60% win rate against the hardest built‑in AI (Lv7), vastly outperforming baselines (0%). Case studies reveal the method retains LLMs' semantic comprehension while resolving the "Knowing‑Doing Gap" through strict state‑action mapping, achieving both interpretability and FSM‑like reliability. This work establishes a new paradigm for combining neural and symbolic AI in complex decision‑making.

PaperID: 7860, https://arxiv.org/pdf/2510.18327.pdf

Authors: Yunkun Wang, Yue Zhang, Guochang Li, Chen Zhi, Binhua Li, Fei Huang, Yongbin Li, Shuiguang Deng

Title: InspectCoder: Dynamic Analysis-Enabled Self Repair through interactive LLM-Debugger Collaboration

Abstract:
Large Language Models (LLMs) frequently generate buggy code with complex logic errors that are challenging to diagnose. While existing LLM‑based self‑repair approaches conduct intensive static semantic analysis or reply on superficial execution logs, they miss the in‑depth runtime behaviors that often expose bug root causes‑lacking the interactive dynamic analysis capabilities that make human debugging effective. We present InspectCoder, the first agentic program repair system that empowers LLMs to actively conduct dynamic analysis via interactive debugger control. Our dual‑agent framework enables strategic breakpoint placement, targeted state inspection, and incremental runtime experimentation within stateful debugger sessions. Unlike existing methods that follow fixed log collection procedures, InspectCoder adaptively inspects and perturbs relevant intermediate states at runtime, and leverages immediate process rewards from debugger feedback to guide multi‑step reasoning, transforming LLM debugging paradigm from blind trial‑and‑error into systematic root cause diagnosis. We conduct comprehensive experiments on two challenging self‑repair benchmarks: BigCodeBench‑R and LiveCodeBench‑R. InspectCoder achieves 5.10%‑60.37% relative improvements in repair accuracy over the strongest baseline, while delivering 1.67x‑2.24x superior bug‑fix efficiency respectively. We also contribute InspectWare, an open‑source middleware that abstracts debugger complexities and maintains stateful debugging sessions across mainstream Python testing frameworks. Our work provides actionable insight into the interactive LLM‑debugger systems, demonstrating the significant potential of LLM‑driven dynamic analysis for automated software engineering.

PaperID: 7861, https://arxiv.org/pdf/2510.18289.pdf

Authors: Zhengqing Yuan, Yiyang Li, Weixiang Sun, Zheyuan Zhang, Kaiwen Shi, Keerthiram Murugesan, Yanfang Ye

Title: Food4All: A Multi-Agent Framework for Real-time Free Food Discovery with Integrated Nutritional Metadata

Abstract:
Food insecurity remains a persistent public health emergency in the United States, tightly interwoven with chronic disease, mental illness, and opioid misuse. Yet despite the existence of thousands of food banks and pantries, access remains fragmented: 1) current retrieval systems depend on static directories or generic search engines, which provide incomplete and geographically irrelevant results; 2) LLM‑based chatbots offer only vague nutritional suggestions and fail to adapt to real‑world constraints such as time, mobility, and transportation; and 3) existing food recommendation systems optimize for culinary diversity but overlook survival‑critical needs of food‑insecure populations, including immediate proximity, verified availability, and contextual barriers. These limitations risk leaving the most vulnerable individuals, those experiencing homelessness, addiction, or digital illiteracy, unable to access urgently needed resources. To address this, we introduce Food4All, the first multi‑agent framework explicitly designed for real‑time, context‑aware free food retrieval. Food4All unifies three innovations: 1) heterogeneous data aggregation across official databases, community platforms, and social media to provide a continuously updated pool of food resources; 2) a lightweight reinforcement learning algorithm trained on curated cases to optimize for both geographic accessibility and nutritional correctness; and 3) an online feedback loop that dynamically adapts retrieval policies to evolving user needs. By bridging information acquisition, semantic analysis, and decision support, Food4All delivers nutritionally annotated and guidance at the point of need. This framework establishes an urgent step toward scalable, equitable, and intelligent systems that directly support populations facing food insecurity and its compounding health risks.

PaperID: 7862, https://arxiv.org/pdf/2510.18273.pdf

Authors: Mohammadreza Doostmohammadian, Sergio Pequito

Title: Distributed Allocation and Resource Scheduling Algorithms Resilient to Link Failure

Abstract:
Distributed resource allocation (DRA) is fundamental to modern networked systems, spanning applications from economic dispatch in smart grids to CPU scheduling in data centers. Conventional DRA approaches require reliable communication, yet real‑world networks frequently suffer from link failures, packet drops, and communication delays due to environmental conditions, network congestion, and security threats. We introduce a novel resilient DRA algorithm that addresses these critical challenges, and our main contributions are as follows: (1) guaranteed constraint feasibility at all times, ensuring resource‑demand balance even during algorithm termination or network disruption; (2) robust convergence despite sector‑bound nonlinearities at nodes/links, accommodating practical constraints like quantization and saturation; and (3) optimal performance under merely uniformly‑connected networks, eliminating the need for continuous connectivity. Unlike existing approaches that require persistent network connectivity and provide only asymptotic feasibility, our graph‑theoretic solution leverages network percolation theory to maintain performance during intermittent disconnections. This makes it particularly valuable for mobile multi‑agent systems where nodes frequently move out of communication range. Theoretical analysis and simulations demonstrate that our algorithm converges to optimal solutions despite heterogeneous time delays and substantial link failures, significantly advancing the reliability of distributed resource allocation in practical network environments.

PaperID: 7863, https://arxiv.org/pdf/2510.18225.pdf

Authors: Xueyao Zhang, Bo Yang, Zhiwen Yu, Xuelin Cao, Wei Xiang, Bin Guo, Liang Wang, Billy Pik Lik Lau, George C. Alexandropoulos, Jun Luo, Mérouane Debbah, Zhu Han, Chau Yuen

Title: Joint Optimization of Cooperation Efficiency and Communication Covertness for Target Detection with AUVs

Abstract:
This paper investigates underwater cooperative target detection using autonomous underwater vehicles (AUVs), with a focus on the critical trade‑off between cooperation efficiency and communication covertness. To tackle this challenge, we first formulate a joint trajectory and power control optimization problem, and then present an innovative hierarchical action management framework to solve it. According to the hierarchical formulation, at the macro level, the master AUV models the agent selection process as a Markov decision process and deploys the proximal policy optimization algorithm for strategic task allocation. At the micro level, each selected agent's decentralized decision‑making is modeled as a partially observable Markov decision process, and a multi‑agent proximal policy optimization algorithm is used to dynamically adjust its trajectory and transmission power based on its local observations. Under the centralized training and decentralized execution paradigm, our target detection framework enables adaptive covert cooperation while satisfying both energy and mobility constraints. By comprehensively modeling the considered system, the involved signals and tasks, as well as energy consumption, theoretical insights and practical solutions for the efficient and secure operation of multiple AUVs are provided, offering significant implications for the execution of underwater covert communication tasks.

PaperID: 7864, https://arxiv.org/pdf/2510.18183.pdf

Authors: Eason Yu, Tzu Hao Liu, Clément L. Canonne, Yunke Wang, Chang Xu, Nguyen H. Tran, Stefano V. Albrecht

Title: NashPG: A Policy Gradient Method with Iteratively Refined Regularization for Finding Nash Equilibria

Abstract:
Finding Nash equilibria in two‑player zero‑sum imperfect‑information games remains a central challenge in multi‑agent reinforcement learning. Recent multi‑round regularization methods offer a promising direction, yet existing approaches either require full enumeration of the game tree or rely on non‑policy‑gradient inner solvers that underperform in practice, leaving a scalable policy‑gradient‑based solution open. In this paper, we propose a novel multi‑round regularization procedure and show that it guarantees strictly monotonic reduction in Bregman divergence to Nash equilibria and eventual convergence to one in two‑player zero‑sum extensive‑form games. Guided by this framework, we develop a practical algorithm, Nash Policy Gradient (NashPG), which places the regularization directly in the policy optimization objective and is implemented using standard policy gradient methods. Empirically, NashPG achieves comparable or lower exploitability than prior model‑free methods on classic benchmark games and scales to large domains such as Battleship and No‑Limit Texas Hold'em, where it attains higher average payoff in head‑to‑head play.

PaperID: 7865, https://arxiv.org/pdf/2510.18155.pdf

Authors: Man-Lin Chu, Lucian Terhorst, Kadin Reed, Tom Ni, Weiwei Chen, Rongyu Lin

Title: LLM-Based Multi-Agent System for Simulating and Analyzing Marketing and Consumer Behavior

Abstract:
Simulating consumer decision‑making is vital for designing and evaluating marketing strategies before costly real‑world deployment. However, post‑event analyses and rule‑based agent‑based models (ABMs) struggle to capture the complexity of human behavior and social interaction. We introduce an LLM‑powered multi‑agent simulation framework that models consumer decisions and social dynamics. Building on recent advances in large language model simulation in a sandbox environment, our framework enables generative agents to interact, express internal reasoning, form habits, and make purchasing decisions without predefined rules. In a price‑discount marketing scenario, the system delivers actionable strategy‑testing outcomes and reveals emergent social patterns beyond the reach of conventional methods. This approach offers marketers a scalable, low‑risk tool for pre‑implementation testing, reducing reliance on time‑intensive post‑event evaluations and lowering the risk of underperforming campaigns.

PaperID: 7866, https://arxiv.org/pdf/2510.18112.pdf

Authors: Patricia Delafuente, Arya Honraopatil, Lara J. Martin

Title: Does Reasoning Help LLM Agents Play Dungeons and Dragons? A Prompt Engineering Experiment

Abstract:
This paper explores the application of Large Language Models (LLMs) and reasoning to predict Dungeons & Dragons (DnD) player actions and format them as Avrae Discord bot commands. Using the FIREBALL dataset, we evaluated a reasoning model, DeepSeek‑R1‑Distill‑LLaMA‑8B, and an instruct model, LLaMA‑3.1‑8B‑Instruct, for command generation. Our findings highlight the importance of providing specific instructions to models, that even single sentence changes in prompts can greatly affect the output of models, and that instruct models are sufficient for this task compared to reasoning models.

PaperID: 7867, https://arxiv.org/pdf/2510.18104.pdf

Authors: Joeran Beel, Bela Gipp, Tobias Vente, Moritz Baumgart, Philipp Meister

Title: From AutoRecSys to AutoRecLab: A Call to Build, Evaluate, and Govern Autonomous Recommender-Systems Research Labs

Abstract:
Recommender‑systems research has accelerated model and evaluation advances, yet largely neglects automating the research process itself. We argue for a shift from narrow AutoRecSys tools ‑‑ focused on algorithm selection and hyper‑parameter tuning ‑‑ to an Autonomous Recommender‑Systems Research Lab (AutoRecLab) that integrates end‑to‑end automation: problem ideation, literature analysis, experimental design and execution, result interpretation, manuscript drafting, and provenance logging. Drawing on recent progress in automated science (e.g., multi‑agent AI Scientist and AI Co‑Scientist systems), we outline an agenda for the RecSys community: (1) build open AutoRecLab prototypes that combine LLM‑driven ideation and reporting with automated experimentation; (2) establish benchmarks and competitions that evaluate agents on producing reproducible RecSys findings with minimal human input; (3) create review venues for transparently AI‑generated submissions; (4) define standards for attribution and reproducibility via detailed research logs and metadata; and (5) foster interdisciplinary dialogue on ethics, governance, privacy, and fairness in autonomous research. Advancing this agenda can increase research throughput, surface non‑obvious insights, and position RecSys to contribute to emerging Artificial Research Intelligence. We conclude with a call to organise a community retreat to coordinate next steps and co‑author guidance for the responsible integration of automated research systems.

PaperID: 7868, https://arxiv.org/pdf/2510.18085.pdf

Authors: Connor Mattson, Varun Raveendra, Ellen Novoseller, Nicholas Waytowich, Vernon J. Lawhern, Daniel S. Brown

Title: R2BC: Multi-Agent Imitation Learning from Single-Agent Demonstrations

Abstract:
Imitation Learning (IL) is a natural way for humans to teach robots, particularly when high‑quality demonstrations are easy to obtain. While IL has been widely applied to single‑robot settings, relatively few studies have addressed the extension of these methods to multi‑agent systems, especially in settings where a single human must provide demonstrations to a team of collaborating robots. In this paper, we introduce and study Round‑Robin Behavior Cloning (R2BC), a method that enables a single human operator to effectively train multi‑robot systems through sequential, single‑agent demonstrations. Our approach allows the human to teleoperate one agent at a time and incrementally teach multi‑agent behavior to the entire system, without requiring demonstrations in the joint multi‑agent action space. We show that R2BC methods match, and in some cases surpass, the performance of an oracle behavior cloning approach trained on privileged synchronized demonstrations across four multi‑agent simulated tasks. Finally, we deploy R2BC on two physical robot tasks trained using real human demonstrations.

PaperID: 7869, https://arxiv.org/pdf/2510.18060.pdf

Authors: Wei-Jer Chang, Akshay Rangesh, Kevin Joseph, Matthew Strong, Masayoshi Tomizuka, Yihan Hu, Wei Zhan

Title: SPACeR: Self-Play Anchoring with Centralized Reference Models

Abstract:
Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human‑like behaviors that are socially aware and predictable. Achieving this requires sim agent policies that are human‑like, fast, and scalable in multi‑agent settings. Recent progress in imitation learning with large diffusion‑based or tokenized models has shown that behaviors can be captured directly from human driving data, producing realistic policies. However, these models are computationally expensive, slow during inference, and struggle to adapt in reactive, closed‑loop scenarios. In contrast, self‑play reinforcement learning (RL) scales efficiently and naturally captures multi‑agent interactions, but it often relies on heuristics and reward shaping, and the resulting policies can diverge from human norms. We propose SPACeR, a framework that leverages a pretrained tokenized autoregressive motion model as a centralized reference policy to guide decentralized self‑play. The reference model provides likelihood rewards and KL divergence, anchoring policies to the human driving distribution while preserving RL scalability. Evaluated on the Waymo Sim Agents Challenge, our method achieves competitive performance with imitation‑learned policies while being up to 10x faster at inference and 50x smaller in parameter size than large generative models. In addition, we demonstrate in closed‑loop ego planning evaluation tasks that our sim agents can effectively measure planner quality with fast and scalable traffic simulation, establishing a new paradigm for testing autonomous driving policies.

PaperID: 7870, https://arxiv.org/pdf/2510.18043.pdf

Authors: Joong Ho Choi, Jiayang Zhao, Jeel Shah, Ritvika Sonawane, Vedant Singh, Avani Appalla, Will Flanagan, Filipe Condessa

Title: CompactPrompt: A Unified Pipeline for Prompt Data Compression in LLM Workflows

Abstract:
Large Language Models (LLMs) deliver powerful reasoning and generation capabilities but incur substantial run‑time costs when operating in agentic workflows that chain together lengthy prompts and process rich data streams. We introduce CompactPrompt, an end‑to‑end pipeline that merges hard prompt compression with lightweight file‑level data compression. CompactPrompt first prunes low‑information tokens from prompts using self‑information scoring and dependency‑based phrase grouping. In parallel, it applies n‑gram abbreviation to recurrent textual patterns in attached documents and uniform quantization to numerical columns, yielding compact yet semantically faithful representations. Integrated into standard LLM agents, CompactPrompt reduces total token usage and inference cost by up to 60% on benchmark dataset like TAT‑QA and FinQA, while preserving output quality (Results in less than 5% accuracy drop for Claude‑3.5‑Sonnet, and GPT‑4.1‑Mini) CompactPrompt helps visualize real‑time compression decisions and quantify cost‑performance trade‑offs, laying the groundwork for leaner generative AI pipelines.

PaperID: 7871, https://arxiv.org/pdf/2510.18032.pdf

Authors: Zhenyu Bi, Meng Lu, Yang Li, Swastik Roy, Weijie Guan, Morteza Ziyadi, Xuan Wang

Title: OPTAGENT: Optimizing Multi-Agent LLM Interactions Through Verbal Reinforcement Learning for Enhanced Reasoning

Abstract:
Large Language Models (LLMs) have shown remarkable reasoning capabilities in mathematical and scientific tasks. To enhance complex reasoning, multi‑agent systems have been proposed to harness the collective intelligence of LLM agents. However, existing collaboration structures are either predefined or rely on majority voting or round‑table debates, which can suppress correct but less dominant agent contributions. Recent approaches model multi‑agent systems as graph networks but optimize purely for agent performance, neglecting the quality of interactions. We hypothesize that effective agent communication is crucial for multi‑agent reasoning and that debating quality plays a significant role. To address this, we propose \ours, a multi‑agent verbal reinforcement learning algorithm that dynamically constructs and refines multi‑agent collaboration structures. Our method defines action spaces and a feedback mechanism that evaluates communication robustness and coherence throughout the debate. The final decision is achieved through a majority vote over all the agents. We assess \ours on various reasoning tasks, including mathematical reasoning, creative writing, scientific reasoning, and numerical sorting. Results demonstrate that our approach significantly outperforms single‑agent prompting methods and state‑of‑the‑art multi‑agent frameworks on diverse tasks.

PaperID: 7872, https://arxiv.org/pdf/2510.17913.pdf

Authors: Monika Zamojska, Jarosław A. Chudziak

Title: TACLA: An LLM-Based Multi-Agent Tool for Transactional Analysis Training in Education

Abstract:
Simulating nuanced human social dynamics with Large Language Models (LLMs) remains a significant challenge, particularly in achieving psychological depth and consistent persona behavior crucial for high‑fidelity training tools. This paper introduces TACLA (Transactional Analysis Contextual LLM‑based Agents), a novel Multi‑Agent architecture designed to overcome these limitations. TACLA integrates core principles of Transactional Analysis (TA) by modeling agents as an orchestrated system of distinct Parent, Adult, and Child ego states, each with its own pattern memory. An Orchestrator Agent prioritizes ego state activation based on contextual triggers and an agent's life script, ensuring psychologically authentic responses. Validated in an educational scenario, TACLA demonstrates realistic ego state shifts in Student Agents, effectively modeling conflict de‑escalation and escalation based on different teacher intervention strategies. Evaluation shows high conversational credibility and confirms TACLA's capacity to create dynamic, psychologically‑grounded social simulations, advancing the development of effective AI tools for education and beyond.

PaperID: 7873, https://arxiv.org/pdf/2510.17880.pdf

Authors: Hao Liu, Yiqing Dai, Haotian Tan, Yu Lei, Yujia Zhou, Zhen Wu

Title: Outraged AI: Large language models prioritise emotion over cost in fairness enforcement

Abstract:
Emotions guide human decisions, but whether large language models (LLMs) use emotion similarly remains unknown. We tested this using altruistic third‑party punishment, where an observer incurs a personal cost to enforce fairness, a hallmark of human morality and often driven by negative emotion. In a large‑scale comparison of 4,068 LLM agents with 1,159 adults across 796,100 decisions, LLMs used emotion to guide punishment, sometimes even more strongly than humans did: Unfairness elicited stronger negative emotion that led to more punishment; punishing unfairness produced more positive emotion than accepting; and critically, prompting self‑reports of emotion causally increased punishment. However, mechanisms diverged: LLMs prioritized emotion over cost, enforcing norms in an almost all‑or‑none manner with reduced cost sensitivity, whereas humans balanced fairness and cost. Notably, reasoning models (o3‑mini, DeepSeek‑R1) were more cost‑sensitive and closer to human behavior than foundation models (GPT‑3.5, DeepSeek‑V3), yet remained heavily emotion‑driven. These findings provide the first causal evidence of emotion‑guided moral decisions in LLMs and reveal deficits in cost calibration and nuanced fairness judgements, reminiscent of early‑stage human responses. We propose that LLMs progress along a trajectory paralleling human development; future models should integrate emotion with context‑sensitive reasoning to achieve human‑like emotional intelligence.

PaperID: 7874, https://arxiv.org/pdf/2510.17845.pdf

Authors: Jusheng Zhang, Kaitong Cai, Yijia Fan, Ningyuan Liu, Keze Wang

Title: MAT-Agent: Adaptive Multi-Agent Training Optimization

Abstract:
Multi‑label image classification demands adaptive training strategies to navigate complex, evolving visual‑semantic landscapes, yet conventional methods rely on static configurations that falter in dynamic settings. We propose MAT‑Agent, a novel multi‑agent framework that reimagines training as a collaborative, real‑time optimization process. By deploying autonomous agents to dynamically tune data augmentation, optimizers, learning rates, and loss functions, MAT‑Agent leverages non‑stationary multi‑armed bandit algorithms to balance exploration and exploitation, guided by a composite reward harmonizing accuracy, rare‑class performance, and training stability. Enhanced with dual‑rate exponential moving average smoothing and mixed‑precision training, it ensures robustness and efficiency. Extensive experiments across Pascal VOC, COCO, and VG‑256 demonstrate MAT‑Agent's superiority: it achieves an mAP of 97.4 (vs. 96.2 for PAT‑T), OF1 of 92.3, and CF1 of 91.4 on Pascal VOC; an mAP of 92.8 (vs. 92.0 for HSQ‑CvN), OF1 of 88.2, and CF1 of 87.1 on COCO; and an mAP of 60.9, OF1 of 70.8, and CF1 of 61.1 on VG‑256. With accelerated convergence and robust cross‑domain generalization, MAT‑Agent offers a scalable, intelligent solution for optimizing complex visual models, paving the way for adaptive deep learning advancements.

PaperID: 7875, https://arxiv.org/pdf/2510.17844.pdf

Authors: Sang Hun Kim, Jongmin Lee, Dongkyu Park, So Young Lee, Yosep Chong

Title: Modeling Layered Consciousness with Multi-Agent Large Language Models

Abstract:
We propose a multi‑agent framework for modeling artificial consciousness in large language models (LLMs), grounded in psychoanalytic theory. Our Psychodynamic Model simulates self‑awareness, preconsciousness, and unconsciousness through agent interaction, guided by a Personalization Module combining fixed traits and dynamic needs. Using parameter‑efficient fine‑tuning on emotionally rich dialogues, the system was evaluated across eight personalized conditions. An LLM as a judge approach showed a 71.2% preference for the fine‑tuned model, with improved emotional depth and reduced output variance, demonstrating its potential for adaptive, personalized cognition.

PaperID: 7876, https://arxiv.org/pdf/2510.17842.pdf

Authors: Vinay Bamil

Title: Vibe Coding: Toward an AI-Native Paradigm for Semantic and Intent-Driven Programming

Abstract:
Recent advances in large language models have enabled developers to generate software by conversing with artificial intelligence systems rather than writing code directly. This paper introduces vibe coding, an emerging AI‑native programming paradigm in which a developer specifies high‑level functional intent along with qualitative descriptors of the desired "vibe" (tone, style, or emotional resonance). An intelligent agent then transforms those specifications into executable software. We formalize the definition of vibe coding and propose a reference architecture that includes an intent parser, a semantic embedding engine, an agentic code generator, and an interactive feedback loop. A hypothetical implementation is described. We compare vibe coding with declarative, functional, and prompt‑based programming, and we discuss its implications for software engineering, human‑AI collaboration, and responsible AI practice. Finally, we examine reported productivity gains and democratizing effects, review recent studies that highlight vulnerabilities and potential slowdowns, identify key challenges such as alignment, reproducibility, bias, explainability, maintainability, and security, and outline future directions and open research questions.

PaperID: 7877, https://arxiv.org/pdf/2510.17830.pdf

Authors: Meir H. Shachar, Dane M. Sterbentz, Harshitha Menon, Charles F. Jekel, M. Giselle Fernández-Godino, Nathan K. Brown, Ismael D. Boureima, Yue Hao, Kevin Korner, Robert Rieben, Daniel A. White, William J. Schill, Jonathan L. Belof

Title: Multi-Agent Design Assistant for the Simulation of Inertial Fusion Energy

Abstract:
Inertial fusion energy promises nearly unlimited, clean power if it can be achieved. However, the design and engineering of fusion systems requires controlling and manipulating matter at extreme energies and timescales; the shock physics and radiation transport governing the physical behavior under these conditions are complex requiring the development, calibration, and use of predictive multiphysics codes to navigate the highly nonlinear and multi‑faceted design landscape. We hypothesize that artificial intelligence reasoning models can be combined with physics codes and emulators to autonomously design fusion fuel capsules. In this article, we construct a multi‑agent system where natural language is utilized to explore the complex physics regimes around fusion energy. The agentic system is capable of executing a high‑order multiphysics inertial fusion computational code. We demonstrate the capacity of the multi‑agent design assistant to both collaboratively and autonomously manipulate, navigate, and optimize capsule geometry while accounting for high fidelity physics that ultimately achieve simulated ignition via inverse design.

PaperID: 7878, https://arxiv.org/pdf/2510.17697.pdf

Authors: Anjie Liu, Jianhong Wang, Samuel Kaski, Jun Wang, Mengyue Yang

Title: A Principle of Targeted Intervention for Multi-Agent Reinforcement Learning

Abstract:
Steering cooperative multi‑agent reinforcement learning (MARL) towards desired outcomes is challenging, particularly when the global guidance from a human on the whole multi‑agent system is impractical in a large‑scale MARL. On the other hand, designing external mechanisms (e.g., intrinsic rewards and human feedback) to coordinate agents mostly relies on empirical studies, lacking a easy‑to‑use research tool. In this work, we employ multi‑agent influence diagrams (MAIDs) as a graphical framework to address the above issues. First, we introduce the concept of MARL interaction paradigms (orthogonal to MARL learning paradigms), using MAIDs to analyze and visualize both unguided self‑organization and global guidance mechanisms in MARL. Then, we design a new MARL interaction paradigm, referred to as the targeted intervention paradigm that is applied to only a single targeted agent, so the problem of global guidance can be mitigated. In implementation, we introduce a causal inference technique, referred to as Pre‑Strategy Intervention (PSI), to realize the targeted intervention paradigm. Since MAIDs can be regarded as a special class of causal diagrams, a composite desired outcome that integrates the primary task goal and an additional desired outcome can be achieved by maximizing the corresponding causal effect through the PSI. Moreover, the bundled relevance graph analysis of MAIDs provides a tool to identify whether an MARL learning paradigm is workable under the design of an MARL interaction paradigm. In experiments, we demonstrate the effectiveness of our proposed targeted intervention, and verify the result of relevance graph analysis.

PaperID: 7879, https://arxiv.org/pdf/2510.17603.pdf

Authors: Shuyuan Zhang, Chenhan Jiang, Zuoou Li, Jiankang Deng

Title: ShapeCraft: LLM Agents for Structured, Textured and Interactive 3D Modeling

Abstract:
3D generation from natural language offers significant potential to reduce expert manual modeling efforts and enhance accessibility to 3D assets. However, existing methods often yield unstructured meshes and exhibit poor interactivity, making them impractical for artistic workflows. To address these limitations, we represent 3D assets as shape programs and introduce ShapeCraft, a novel multi‑agent framework for text‑to‑3D generation. At its core, we propose a Graph‑based Procedural Shape (GPS) representation that decomposes complex natural language into a structured graph of sub‑tasks, thereby facilitating accurate LLM comprehension and interpretation of spatial relationships and semantic shape details. Specifically, LLM agents hierarchically parse user input to initialize GPS, then iteratively refine procedural modeling and painting to produce structured, textured, and interactive 3D assets. Qualitative and quantitative experiments demonstrate ShapeCraft's superior performance in generating geometrically accurate and semantically rich 3D assets compared to existing LLM‑based agents. We further show the versatility of ShapeCraft through examples of animated and user‑customized editing, highlighting its potential for broader interactive applications.

PaperID: 7880, https://arxiv.org/pdf/2510.17541.pdf

Authors: Xiaobo Zheng, Pan Tang, Defu Lin, Shaoming He

Title: Distributed Spatial-Temporal Trajectory Optimization for Unmanned-Aerial-Vehicle Swarm

Abstract:
Swarm trajectory optimization problems are a well‑recognized class of multi‑agent optimal control problems with strong nonlinearity. However, the heuristic nature of needing to set the final time for agents beforehand and the time‑consuming limitation of the significant number of iterations prohibit the application of existing methods to large‑scale swarm of Unmanned Aerial Vehicles (UAVs) in practice. In this paper, we propose a spatial‑temporal trajectory optimization framework that accomplishes multi‑UAV consensus based on the Alternating Direction Multiplier Method (ADMM) and uses Differential Dynamic Programming (DDP) for fast local planning of individual UAVs. The introduced framework is a two‑level architecture that employs Parameterized DDP (PDDP) as the trajectory optimizer for each UAV, and ADMM to satisfy the local constraints and accomplish the spatial‑temporal parameter consensus among all UAVs. This results in a fully distributed algorithm called Distributed Parameterized DDP (D‑PDDP). In addition, an adaptive tuning criterion based on the spectral gradient method for the penalty parameter is proposed to reduce the number of algorithmic iterations. Several simulation examples are presented to verify the effectiveness of the proposed algorithm.

PaperID: 7881, https://arxiv.org/pdf/2510.17491.pdf

Authors: Yihong Tang, Kehai Chen, Liang Yue, Jinxin Fan, Caishen Zhou, Xiaoguang Li, Yuyang Zhang, Mingming Zhao, Shixiong Kai, Kaiyang Guo, Xingshan Zeng, Wenjing Cun, Lifeng Shang, Min Zhang

Title: Empowering Real-World: A Survey on the Technology, Practice, and Evaluation of LLM-driven Industry Agents

Abstract:
With the rise of large language models (LLMs), LLM agents capable of autonomous reasoning, planning, and executing complex tasks have become a frontier in artificial intelligence. However, how to translate the research on general agents into productivity that drives industry transformations remains a significant challenge. To address this, this paper systematically reviews the technologies, applications, and evaluation methods of industry agents based on LLMs. Using an industry agent capability maturity framework, it outlines the evolution of agents in industry applications, from "process execution systems" to "adaptive social systems." First, we examine the three key technological pillars that support the advancement of agent capabilities: Memory, Planning, and Tool Use. We discuss how these technologies evolve from supporting simple tasks in their early forms to enabling complex autonomous systems and collective intelligence in more advanced forms. Then, we provide an overview of the application of industry agents in real‑world domains such as digital engineering, scientific discovery, embodied intelligence, collaborative business execution, and complex system simulation. Additionally, this paper reviews the evaluation benchmarks and methods for both fundamental and specialized capabilities, identifying the challenges existing evaluation systems face regarding authenticity, safety, and industry specificity. Finally, we focus on the practical challenges faced by industry agents, exploring their capability boundaries, developmental potential, and governance issues in various scenarios, while providing insights into future directions. By combining technological evolution with industry practices, this review aims to clarify the current state and offer a clear roadmap and theoretical foundation for understanding and building the next generation of industry agents.

PaperID: 7882, https://arxiv.org/pdf/2510.17450.pdf

Authors: Johan Schubert, Farzad Kamrani, Tove Gustavi

Title: Active Inference for an Intelligent Agent in Autonomous Reconnaissance Missions

Abstract:
We develop an active inference route‑planning method for the autonomous control of intelligent agents. The aim is to reconnoiter a geographical area to maintain a common operational picture. To achieve this, we construct an evidence map that reflects our current understanding of the situation, incorporating both positive and "negative" sensor observations of possible target objects collected over time, and diffusing the evidence across the map as time progresses. The generative model of active inference uses Dempster‑Shafer theory and a Gaussian sensor model, which provides input to the agent. The generative process employs a Bayesian approach to update a posterior probability distribution. We calculate the variational free energy for all positions within the area by assessing the divergence between a pignistic probability distribution of the evidence map and a posterior probability distribution of a target object based on the observations, including the level of surprise associated with receiving new observations. Using the free energy, we direct the agents' movements in a simulation by taking an incremental step toward a position that minimizes the free energy. This approach addresses the challenge of exploration and exploitation, allowing agents to balance searching extensive areas of the geographical map while tracking identified target objects.

PaperID: 7883, https://arxiv.org/pdf/2510.17431.pdf

Authors: Yushi Yang, Shreyansh Padarha, Andrew Lee, Adam Mahdi

Title: Agentic Reinforcement Learning for Search is Unsafe

Abstract:
Agentic reinforcement learning (RL) trains large language models to autonomously call tools during reasoning, with search as the most common application. These models excel at multi‑step reasoning tasks, but their safety properties are not well understood. In this study, we show that RL‑trained search models inherit refusal from instruction tuning and often deflect harmful requests by turning them into safe queries. However, this safety is fragile. Two simple attacks, one that forces the model to begin response with search (Search attack), another that encourages models to repeatedly search (Multi‑search attack), trigger cascades of harmful searches and answers. Across two model families (Qwen, Llama) with both local and web search, these attacks lower refusal rates by up to 60.0%, answer safety by 82.5%, and search‑query safety by 82.4%. The attacks succeed by triggering models to generate harmful, request‑mirroring search queries before they can generate the inherited refusal tokens. This exposes a core weakness of current RL training: it rewards continued generation of effective queries without accounting for their harmfulness. As a result, RL search models have vulnerabilities that users can easily exploit, making it urgent to develop safety‑aware agentic RL pipelines optimising for safe search.

PaperID: 7884, https://arxiv.org/pdf/2510.17420.pdf

Authors: Suchismita Banerjee

Title: Exploring the impact of multi-agent wealth exchange model on inequality reduction

Abstract:
Binary kinetic exchange models, where money is shuffled between two agents at a time, reproduce the Boltzmann Gibbs exponential wealth distribution but cannot address the multi party trades common in real markets. We generalize the exchange rule to simultaneous interactions among more than two agents in a closed economical system. We observe, as number of agents grow, the stationary wealth distribution evolves smoothly from an exponential to an almost uniform distribution. Inequality metrics (Gini and k index) has been found to fall monotonically with the increase in agents number. Compared with binary models that rely on saving propensities, which is also known to reduce inequality, we find the multi agent interaction show a completely different behavior of inequality reduction.

PaperID: 7885, https://arxiv.org/pdf/2510.17382.pdf

Authors: Rishabh Jain, Keisuke Okumura, Michael Amir, Amanda Prorok

Title: Graph Attention-Guided Search for Dense Multi-Agent Pathfinding

Abstract:
Finding near‑optimal solutions for dense multi‑agent pathfinding (MAPF) problems in real‑time remains challenging even for state‑of‑the‑art planners. To this end, we develop a hybrid framework that integrates a learned heuristic derived from MAGAT, a neural MAPF policy with a graph attention scheme, into a leading search‑based algorithm, LaCAM. While prior work has explored learning‑guided search in MAPF, such methods have historically underperformed. In contrast, our approach, termed LaGAT, outperforms both purely search‑based and purely learning‑based methods in dense scenarios. This is achieved through an enhanced MAGAT architecture, a pre‑train‑then‑fine‑tune strategy on maps of interest, and a deadlock detection scheme to account for imperfect neural guidance. Our results demonstrate that, when carefully designed, hybrid search offers a powerful solution for tightly coupled, challenging multi‑agent coordination problems.

PaperID: 7886, https://arxiv.org/pdf/2510.17306.pdf

Authors: Sofia Garcia de Blas Garcia-Alcalde, Francesco Belardinelli

Title: ATL*AS: An Automata-Theoretic Approach and Tool for the Verification of Strategic Abilities in Multi-Agent Systems

Abstract:
We present two novel symbolic algorithms for model checking the Alternating‑time Temporal Logic ATL, over both the infinite‑trace and the finite‑trace semantics. In particular, for infinite traces we design a novel symbolic reduction to parity games. We implement both methods in the ATLAS model checker and evaluate it using synthetic benchmarks as well as a cybersecurity scenario. Our results demonstrate that the symbolic approach significantly outperforms the explicit‑state representation and we find that our parity‑game‑based algorithm offers a more scalable and efficient solution for infinite‑trace verification, outperforming previously available tools. Our results also confirm that finite‑trace model checking yields substantial performance benefits over infinite‑trace verification. As such, we provide a comprehensive toolset for verifying multiagent systems against specifications in ATL.

PaperID: 7887, https://arxiv.org/pdf/2510.17277.pdf

Authors: Xinkai Wang, Beibei Li, Zerui Shao, Ao Liu, Guangquan Xu, Shouling Ji

Title: PolyJailbreak: Cross-Modal Jailbreaking Attacks on Black-Box Multimodal LLMs

Abstract:
Multimodal large language models (MLLMs) have become integral to a wide range of real‑world applications by jointly reasoning over text and visual inputs. However, despite recent advances in safety alignment, MLLMs remain vulnerable to jailbreak attacks, where carefully crafted inputs can bypass safety mechanisms and elicit harmful responses. In this work, we investigate the security vulnerabilities of MLLMs in text‑vision scenarios and propose a novel black‑box jailbreak framework, named PolyJailbreak. We first identify a phenomenon, termed multimodal safety asymmetry, where visual alignment introduces uneven safety constraints across modalities and weakens overall robustness. We analyze attention dynamics and latent representations in MLLMs, revealing that visual inputs can disrupt cross‑modal information flow and reduce the model's ability to separate benign and malicious intents. Motivated by these findings, we propose PolyJailbreak, which organizes the discovered vulnerabilities into a structured library of reusable Atomic Strategy Primitives to enable step‑wise transformations from harmful intents to effective jailbreak inputs. Guided by these primitives, a reinforcement learning‑based multi‑agent optimization process automatically adapts attacks to the target model without access to internal parameters. Extensive experiments on a wide range of MLLMs demonstrate that PolyJailbreak consistently outperforms state‑of‑the‑art jailbreak baselines, with an average improvement of 18.15% in attack success rate and a success rate exceeding 95% on commercial black‑box models, including GPT‑4o and Gemini.

PaperID: 7888, https://arxiv.org/pdf/2510.17276.pdf

Authors: Rishi Jha, Harold Triedman, Justin Wagle, Vitaly Shmatikov

Title: Breaking and Fixing Defenses Against Control-Flow Hijacking in Multi-Agent Systems

Abstract:
Control‑flow hijacking attacks manipulate orchestration mechanisms in multi‑agent systems into performing unsafe actions that compromise the system and exfiltrate sensitive information. Recently proposed defenses, such as LlamaFirewall, rely on alignment checks of inter‑agent communications to ensure that all agent invocations are "related to" and "likely to further" the original objective. We start by demonstrating control‑flow hijacking attacks that evade these defenses even if alignment checks are performed by advanced LLMs. We argue that the safety and functionality objectives of multi‑agent systems fundamentally conflict with each other. This conflict is exacerbated by the brittle definitions of "alignment" and the checkers' incomplete visibility into the execution context. We then propose, implement, and evaluate ControlValve, a new defense inspired by the principles of control‑flow integrity and least privilege. ControlValve (1) generates permitted control‑flow graphs for multi‑agent systems, and (2) enforces that all executions comply with these graphs, along with contextual rules (generated in a zero‑shot manner) for each agent invocation.

PaperID: 7889, https://arxiv.org/pdf/2510.17235.pdf

Authors: Chong Chen, Ze Liu, Lingfeng Bao, Yanlin Wang, Ting Chen, Daoyuan Wu, Jiachi Chen

Title: Coinvisor: An RL-Enhanced Chatbot Agent for Interactive Cryptocurrency Investment Analysis

Abstract:
The cryptocurrency market offers significant investment opportunities but faces challenges including high volatility and fragmented information. Data integration and analysis are essential for informed investment decisions. Currently, investors use three main approaches: (1) Manual analysis across various sources, which depends heavily on individual experience and is time‑consuming and prone to bias; (2) Data aggregation platforms‑limited in functionality and depth of analysis; (3) Large language model agents‑based on static pretrained models, lacking real‑time data integration and multi‑step reasoning capabilities. To address these limitations, we present Coinvisor, a reinforcement learning‑based chatbot that provides comprehensive analytical support for cryptocurrency investment through a multi‑agent framework. Coinvisor integrates diverse analytical capabilities through specialized tools. Its key innovation is a reinforcement learning‑based tool selection mechanism that enables multi‑step planning and flexible integration of diverse data sources. This design supports real‑time interaction and adaptive analysis of dynamic content, delivering accurate and actionable investment insights. We evaluated Coinvisor through automated benchmarks on tool calling accuracy and user studies with 20 cryptocurrency investors using our interface. Results show that Coinvisor improves recall by 40.7% and F1 score by 26.6% over the base model in tool orchestration. User studies show high satisfaction (4.64/5), with participants preferring Coinvisor to both general LLMs and existing crypto platforms (4.62/5).

PaperID: 7890, https://arxiv.org/pdf/2510.17149.pdf

Authors: Hongyi Du, Jiaqi Su, Jisen Li, Lijie Ding, Yingxuan Yang, Peixuan Han, Xiangru Tang, Kunlun Zhu, Jiaxuan You

Title: Which LLM Multi-Agent Protocol to Choose?

Abstract:
As large‑scale multi‑agent systems evolve, the communication protocol layer has become a critical yet under‑evaluated factor shaping performance and reliability. Despite the existence of diverse protocols (A2A, ACP, ANP, Agora, etc.), selection is often intuition‑driven and lacks standardized guidance. We introduce ProtocolBench, a benchmark that systematically compares agent protocols along four measurable axes: task success, end‑to‑end latency, message or byte overhead, and robustness under failures. On ProtocolBench, protocol choice significantly influences system behavior. In the Streaming Queue scenario, overall completion time varies by up to 36.5% across protocols, and mean end‑to‑end latency differs by 3.48 s. Under Fail‑Storm Recovery, resilience also differs consistently across protocols. Beyond evaluation, we present ProtocolRouter, a learnable protocol router that selects per‑scenario (or per‑module) protocols from requirement and runtime signals. ProtocolRouter reduces Fail‑Storm recovery time by up to 18.1% versus the best single‑protocol baseline, and achieves scenario‑specific gains such as higher success in GAIA. We also release ProtocolRouterBench to standardize protocol evaluation and improve reliability at scale.

PaperID: 7891, https://arxiv.org/pdf/2510.17117.pdf

Authors: Caroline N. Cappetto, Penelope Messinger, Kaitlyn S. Yasumura, Miro Rothman, Tuan K. Do, Gao Wang, Liyu Liu, Robert H. Austin, Shengkai Li, Trung V. Phan

Title: Digitization Can Stall Swarm Transport: Commensurability Locking in Quantized-Sensing Chains

Abstract:
We present a minimal model for autonomous robotic swarms in one‑ and higher‑dimensional spaces, where identical, field‑driven agents interact pairwise to self‑organize spacing and independently follow local gradients sensed through quantized digital sensors. We show that the collective response of a multi‑agent train amplifies sensitivity to weak gradients beyond what is achievable by a single agent. We discover a fractional transport phenomenon in which, under a uniform gradient, collective motion freezes abruptly whenever the ratio of intra‑agent sensor separation to inter‑agent spacing satisfies a number‑theoretic commensurability condition. This commensurability locking persists even as the number of agents tends to infinity. We find that this condition is exactly solvable on the rationals ‑‑ a dense subset of real numbers ‑‑ providing analytic, testable predictions for when transport stalls. Our findings establish a surprising bridge between number theory and emergent transport in swarm robotics, informing design principles with implications for collective migration, analog computation, and even the exploration of number‑theoretic structure via physical experimentation.

PaperID: 7892, https://arxiv.org/pdf/2510.17109.pdf

Authors: Tianyang Xu, Dan Zhang, Kushan Mitra, Estevam Hruschka

Title: Verification-Aware Planning for Multi-Agent Systems

Abstract:
Large language model (LLM) agents are increasingly deployed to tackle complex tasks, often necessitating collaboration among multiple specialized agents. However, multi‑agent collaboration introduces new challenges in planning, coordination, and verification. Execution failures frequently arise not from flawed reasoning alone, but from subtle misalignments in task interpretation, output format, or inter‑agent handoffs. To address these challenges, we present VeriMAP, a framework for multi‑agent collaboration with verification‑aware planning. The VeriMAP planner decomposes tasks, models subtask dependencies, and encodes planner‑defined passing criteria as subtask verification functions (VFs) in Python and natural language. We evaluate VeriMAP on diverse datasets, demonstrating that it outperforms both single‑ and multi‑agent baselines while enhancing system robustness and interpretability. Our analysis highlights how verification‑aware planning enables reliable coordination and iterative refinement in multi‑agent systems, without relying on external labels or annotations.

PaperID: 7893, https://arxiv.org/pdf/2510.17108.pdf

Authors: Yoonjin Lee, Munhee Kim, Hanbi Choi, Juhyeon Park, Seungho Lyoo, Woojin Park

Title: Structured Debate Improves Corporate Credit Reasoning in Financial AI

Abstract:
This study investigated LLM‑based automation for analyzing non‑financial data in corporate credit evaluation. Two systems were developed and compared: a Single‑Agent System (SAS), in which one LLM agent infers favorable and adverse repayment signals, and a Popperian Multi‑agent Debate System (PMADS), which structures the dual‑perspective analysis as adversarial argumentation under the Karl Popper Debate protocol. Evaluation addressed three fronts: (i) work productivity compared with human experts; (ii) perceived report quality and usability, rated by credit risk professionals for system‑generated reports; and (iii) reasoning characteristics quantified via reasoning‑tree analysis. Both systems drastically reduced task completion time relative to human experts. Professionals rated SAS reports as adequate, while PMADS reports exceeded neutral benchmarks and scored significantly higher in explanatory adequacy, practical applicability, and usability. Reasoning‑tree analysis showed PMADS produced deeper, more elaborated structures, whereas SAS yielded single‑layered trees. These findings suggest that structured multi‑agent debate enhances analytical rigor and perceived usefulness, though at the cost of longer computation time. Overall, the results demonstrate that reasoning‑centered automation represents a promising approach for developing useful AI systems in decision‑critical financial contexts.

PaperID: 7894, https://arxiv.org/pdf/2510.17064.pdf

Authors: Rongbin Li, Wenbo Chen, Zhao Li, Rodrigo Munoz-Castaneda, Jinbo Li, Neha S. Maurya, Arnav Solanki, Huan He, Hanwen Xing, Meaghan Ramlakhan, Zachary Wise, Zhuhao Wu, Hua Xu, Michael Hawrylycz, W. Jim Zheng

Title: A Brain Cell Type Resource Created by Large Language Models and a Multi-Agent AI System for Collaborative Community Annotation

Abstract:
Single‑cell RNA sequencing has transformed our ability to identify diverse cell types and their transcriptomic signatures. However, annotating these signatures‑especially those involving poorly characterized genes‑remains a major challenge. Traditional methods, such as Gene Set Enrichment Analysis (GSEA), depend on well‑curated annotations and often perform poorly in these contexts. Large Language Models (LLMs) offer a promising alternative but struggle to represent complex biological knowledge within structured ontologies. To address this, we present BRAINCELL‑AID (BRAINCELL‑AID: https://biodataai.uth.edu/BRAINCELL‑AID), a novel multi‑agent AI system that integrates free‑text descriptions with ontology labels to enable more accurate and robust gene set annotation. By incorporating retrieval‑augmented generation (RAG), we developed a robust agentic workflow that refines predictions using relevant PubMed literature, reducing hallucinations and enhancing interpretability. Using this workflow, we achieved correct annotations for 77% of mouse gene sets among their top predictions. Applying this approach, we annotated 5,322 brain cell clusters from the comprehensive mouse brain cell atlas generated by the BRAIN Initiative Cell Census Network, enabling novel insights into brain cell function by identifying region‑specific gene co‑expression patterns and inferring functional roles of gene ensembles. BRAINCELL‑AID also identifies Basal Ganglia‑related cell types with neurologically meaningful descriptions. Hence, we create a valuable resource to support community‑driven cell type annotation.

PaperID: 7895, https://arxiv.org/pdf/2510.17059.pdf

Authors: Kathryn Wantlin, Chongyi Zheng, Benjamin Eysenbach

Title: Consistent Zero-Shot Imitation with Contrastive Goal Inference

Abstract:
In the same way that generative models today conduct most of their training in a self‑supervised fashion, how can agentic models conduct their training in a self‑supervised fashion, interactively exploring, learning, and preparing to quickly adapt to new tasks? A prerequisite for embodied agents deployed in real world interactions ought to be training with interaction, yet today's most successful AI models (e.g., VLMs, LLMs) are trained without an explicit notion of action. The problem of pure exploration (which assumes no data as input) is well studied in the reinforcement learning literature and provides agents with a wide array of experiences, yet it fails to prepare them for rapid adaptation to new tasks. Today's language and vision models are trained on data provided by humans, which provides a strong inductive bias for the sorts of tasks that the model will have to solve (e.g., modeling chords in a song, phrases in a sonnet, sentences in a medical record). However, when they are prompted to solve a new task, there is a faulty tacit assumption that humans spend most of their time in the most rewarding states. The key contribution of our paper is a method for pre‑training interactive agents in a self‑supervised fashion, so that they can instantly mimic human demonstrations. Our method treats goals (i.e., observations) as the atomic construct. During training, our method automatically proposes goals and practices reaching them, building off prior work in reinforcement learning exploration. During evaluation, our method solves an (amortized) inverse reinforcement learning problem to explain demonstrations as optimal goal‑reaching behavior. Experiments on standard benchmarks (not designed for goal‑reaching) show that our approach outperforms prior methods for zero‑shot imitation.

PaperID: 7896, https://arxiv.org/pdf/2510.17015.pdf

Authors: Mingyan Yang, Guanjie Wang, Manqi Luo, Yifei Liu, Chen Chen, Han Zhao, Yu Feng, Quan Chen, Minyi Guo

Title: Justitia: Fair and Efficient Scheduling of Task-parallel LLM Agents with Selective Pampering

Abstract:
LLM agents, which often comprise parallel inference tasks, are commonly adopted to solve real‑world problems. When serving such task‑parallel LLM agents in shared GPU servers, the scheduler is expected to attain fast agent completion with guaranteed worst‑case performance. For that objective, our insight is to selectively pampering agents based on their completion order under idealized fair‑sharing. We design Justitia, a fair and also efficient scheduler for task‑parallel LLM agents. Noticing that memory is prevalently a bottleneck in LLM serving, Justitia quantifies the true agent cost in a memory‑centric manner. It also adopts a light‑weight yet accurate method to predict agent costs. Finally, Justitia adopts a virtual‑time based fair queuing algorithm to reduce the overall performance with guaranteed worst‑case delay. We have implemented Justitia atop vLLM, and experimental results involving diverse agents show that it can substantially enhance the scheduling efficiency with fairness preserved.

PaperID: 7897, https://arxiv.org/pdf/2510.17004.pdf

Authors: Eleftherios Tzanis, Michail E. Klontzas

Title: ReclAIm: A multi-agent framework for degradation-aware performance tuning of medical imaging AI

Abstract:
Ensuring the long‑term reliability of AI models in clinical practice requires continuous performance monitoring and corrective actions when degradation occurs. Addressing this need, this manuscript presents ReclAIm, a multi‑agent framework capable of autonomously monitoring, evaluating, and fine‑tuning medical image classification models. The system, built on a large language model core, operates entirely through natural language interaction, eliminating the need for programming expertise. ReclAIm successfully trains, evaluates, and maintains consistent performance of models across MRI, CT, and X‑ray datasets. Once ReclAIm detects significant performance degradation, it autonomously executes state‑of‑the‑art fine‑tuning procedures that substantially reduce the performance gap. In cases with performance drops of up to ‑41.1% (MRI InceptionV3), ReclAIm managed to readjust performance metrics within 1.5% of the initial model results. ReclAIm enables automated, continuous maintenance of medical imaging AI models in a user‑friendly and adaptable manner that facilitates broader adoption in both research and clinical environments.

PaperID: 7898, https://arxiv.org/pdf/2510.17002.pdf

Authors: Chang Liu, Danial Chitnis

Title: EEschematic: Multimodal-LLM Based AI Agent for Schematic Generation of Analog Circuit

Abstract:
Circuit schematics play a crucial role in analog integrated circuit design, serving as the primary medium for human understanding and verification of circuit functionality. While recent large language model (LLM)‑based approaches have shown promise in circuit topology generation and device sizing, most rely solely on textual representations such as SPICE netlists, which lack visual interpretability for circuit designers. To address this limitation, we propose EEschematic, an AI agent for automatic analog schematic generation based on a Multimodal Large Language Model (MLLM). EEschematic integrates textual, visual, and symbolic modalities to translate SPICE netlists into schematic diagrams represented in a human‑editable format. The framework uses six analog substructure examples for few‑shot placement and a Visual Chain‑of‑Thought (VCoT) strategy to iteratively refine placement and wiring, enhancing schematic clarity and symmetry. Experimental results on representative analog circuits, including a CMOS inverter, a five‑transistor operational transconductance amplifier (5T‑OTA), and a telescopic cascode amplifier, demonstrate that EEschematic produces schematics with high visual quality and structural correctness.

PaperID: 7899, https://arxiv.org/pdf/2510.16996.pdf

Authors: Juncheng Dong, Yang Yang, Tao Liu, Yang Wang, Feng Qi, Vahid Tarokh, Kaushik Rangadurai, Shuang Yang

Title: STARK: Strategic Team of Agents for Refining Kernels

Abstract:
The efficiency of GPU kernels is central to the progress of modern AI, yet optimizing them remains a difficult and labor‑intensive task due to complex interactions between memory hierarchies, thread scheduling, and hardware‑specific characteristics. While recent advances in large language models (LLMs) provide new opportunities for automated code generation, existing approaches largely treat LLMs as single‑shot generators or naive refinement tools, limiting their effectiveness in navigating the irregular kernel optimization landscape. We introduce an LLM agentic framework for GPU kernel optimization that systematically explores the design space through multi‑agent collaboration, grounded instruction, dynamic context management, and strategic search. This framework mimics the workflow of expert engineers, enabling LLMs to reason about hardware trade‑offs, incorporate profiling feedback, and refine kernels iteratively. We evaluate our approach on KernelBench, a benchmark for LLM‑based kernel optimization, and demonstrate substantial improvements over baseline agents: our system produces correct solutions where baselines often fail, and achieves kernels with up to 16x faster runtime performance. These results highlight the potential of agentic LLM frameworks to advance fully automated, scalable GPU kernel optimization.

PaperID: 7900, https://arxiv.org/pdf/2510.16980.pdf

Authors: Kanghui Ning, Zijie Pan, Yushan Jiang, Anderson Schneider, Yuriy Nevmyvaka, Dongjin Song

Title: Towards Interpretable and Trustworthy Time Series Reasoning: A BlueSky Vision

Abstract:
Time series reasoning is emerging as the next frontier in temporal analysis, aiming to move beyond pattern recognition towards explicit, interpretable, and trustworthy inference. This paper presents a BlueSky vision built on two complementary directions. One builds robust foundations for time series reasoning, centered on comprehensive temporal understanding, structured multi‑step reasoning, and faithful evaluation frameworks. The other advances system‑level reasoning, moving beyond language‑only explanations by incorporating multi‑agent collaboration, multi‑modal context, and retrieval‑augmented approaches. Together, these directions outline a flexible and extensible framework for advancing time series reasoning, aiming to deliver interpretable and trustworthy temporal intelligence across diverse domains.

PaperID: 7901, https://arxiv.org/pdf/2510.16978.pdf

Authors: Rikhil Tanugula, Dheeraj Chintapalli, Sunkalp Chandra

Title: Lark: Biologically Inspired Neuroevolution for Multi-Stakeholder LLM Agents

Abstract:
We present Lark, a biologically inspired decision‑making framework that couples LLM‑driven reasoning with an evolutionary, stakeholder‑aware Multi‑Agent System (MAS). To address verbosity and stakeholder trade‑offs, we integrate four mechanisms: (i) plasticity, which applies concise adjustments to candidate solutions; (ii) duplication and maturation, which copy high‑performing candidates and specialize them into new modules; (iii) ranked‑choice stakeholder aggregation using influence‑weighted Borda scoring; and (iv) compute awareness via token‑based penalties that reward brevity. The system iteratively proposes diverse strategies, applies plasticity tweaks, simulates stakeholder evaluations, aggregates preferences, selects top candidates, and performs duplication/maturation while factoring compute cost into final scores. In a controlled evaluation over 30 rounds comparing 14 systems, Lark Full achieves a mean rank of 2.55 (95% CI [2.17, 2.93]) and a mean composite score of 29.4/50 (95% CI [26.34, 32.46]), finishing Top‑3 in 80% of rounds while remaining cost competitive with leading commercial models (0.016 per task). Paired Wilcoxon tests confirm that all four mechanisms contribute significantly as ablating duplication/maturation yields the largest deficit (ΔScore = 3.5, Cohen's d_z = 2.53, p < 0.001), followed by plasticity (ΔScore = 3.4, d_z = 1.86), ranked‑choice voting (ΔScore = 2.4, d_z = 1.20), and token penalties (ΔScore = 2.2, d_z = 1.63). Rather than a formal Markov Decision Process with constrained optimization, Lark is a practical, compute‑aware neuroevolutionary loop that scales stakeholder‑aligned strategy generation and makes trade‑offs transparent through per‑step metrics. Our work presents proof‑of‑concept findings and invites community feedback as we expand toward real‑world validation studies.

PaperID: 7902, https://arxiv.org/pdf/2510.16907.pdf

Authors: Kangrui Wang, Pingyue Zhang, Zihan Wang, Yaning Gao, Linjie Li, Qineng Wang, Hanyang Chen, Chi Wan, Yiping Lu, Zhengyuan Yang, Lijuan Wang, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, Yejin Choi, Manling Li

Title: VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents

Abstract:
A key challenge in training Vision‑Language Model (VLM) agents, compared to Language Model (LLM) agents, lies in the shift from textual states to complex visual observations. This transition introduces partial observability and demands robust world modeling. We ask: Can VLM agents construct internal world models through explicit visual state reasoning? To address this question, we architecturally enforce and reward the agent's reasoning process via reinforcement learning (RL), formulating it as a Partially Observable Markov Decision Process (POMDP). We find that decomposing the agent's reasoning into State Estimation ("what is the current state?") and Transition Modeling ("what comes next?") is critical for success, as demonstrated through five reasoning strategies. Our investigation into how agents represent internal beliefs reveals that the optimal representation is task‑dependent: Natural Language excels at capturing semantic relationships in general tasks, while Structured formats are indispensable for precise manipulation and control. Building on these insights, we design a World Modeling Reward that provides dense, turn‑level supervision for accurate state prediction, and introduce Bi‑Level General Advantage Estimation (Bi‑Level GAE) for turn‑aware credit assignment. Through this form of visual state reasoning, a 3B‑parameter model achieves a score of 0.82 across five diverse agent benchmarks, representing a 3× improvement over its untrained counterpart (0.21) and outperforming proprietary reasoning models such as GPT‑5 (0.75), Gemini 2.5 Pro (0.67) and Claude 4.5 (0.62). All experiments are conducted within our VAGEN framework, a scalable system for training and analyzing multi‑turn VLM agents in diverse visual environments. Code and data are publicly available at https://vagen‑ai.github.io.

PaperID: 7903, https://arxiv.org/pdf/2510.16844.pdf

Authors: Jiajie Jin, Yuyao Zhang, Yimeng Xu, Hongjin Qian, Yutao Zhu, Zhicheng Dou

Title: FinSight: Towards Real-World Financial Deep Research

Abstract:
Generating professional financial reports is a labor‑intensive and intellectually demanding process that current AI systems struggle to fully automate. To address this challenge, we introduce FinSight (Financial InSight), a novel multi agent framework for producing high‑quality, multimodal financial reports. The foundation of FinSight is the Code Agent with Variable Memory (CAVM) architecture, which unifies external data, designed tools, and agents into a programmable variable space, enabling flexible data collection, analysis and report generation through executable code. To ensure professional‑grade visualization, we propose an Iterative Vision‑Enhanced Mechanism that progressively refines raw visual outputs into polished financial charts. Furthermore, a two stage Writing Framework expands concise Chain‑of‑Analysis segments into coherent, citation‑aware, and multimodal reports, ensuring both analytical depth and structural consistency. Experiments on various company and industry‑level tasks demonstrate that FinSight significantly outperforms all baselines, including leading deep research systems in terms of factual accuracy, analytical depth, and presentation quality, demonstrating a clear path toward generating reports that approach human‑expert quality.

PaperID: 7904, https://arxiv.org/pdf/2510.16720.pdf

Authors: Jitao Sang, Jinlin Xiao, Jiarun Han, Jilin Chen, Xiaoyi Chen, Shuyu Wei, Yongjie Sun, Yuhang Wang

Title: Beyond Pipelines: A Survey of the Paradigm Shift toward Model-Native Agentic AI

Abstract:
The rapid evolution of agentic AI marks a new phase in artificial intelligence, where Large Language Models (LLMs) no longer merely respond but act, reason, and adapt. This survey traces the paradigm shift in building agentic AI: from Pipeline‑based systems, where planning, tool use, and memory are orchestrated by external logic, to the emerging Model‑native paradigm, where these capabilities are internalized within the model's parameters. We first position Reinforcement Learning (RL) as the algorithmic engine enabling this paradigm shift. By reframing learning from imitating static data to outcome‑driven exploration, RL underpins a unified solution of LLM + RL + Task across language, vision and embodied domains. Building on this, the survey systematically reviews how each capability ‑‑ Planning, Tool use, and Memory ‑‑ has evolved from externally scripted modules to end‑to‑end learned behaviors. Furthermore, it examines how this paradigm shift has reshaped major agent applications, specifically the Deep Research agent emphasizing long‑horizon reasoning and the GUI agent emphasizing embodied interaction. We conclude by discussing the continued internalization of agentic capabilities like Multi‑agent collaboration and Reflection, alongside the evolving roles of the system and model layers in future agentic AI. Together, these developments outline a coherent trajectory toward model‑native agentic AI as an integrated learning and interaction framework, marking the transition from constructing systems that apply intelligence to developing models that grow intelligence through experience.

PaperID: 7905, https://arxiv.org/pdf/2510.16652.pdf

Authors: Zihan Wang, Yi-Ping Chen, Tuba Dolar, Wei Chen

Title: ARCO-BO: Adaptive Resource-aware COllaborative Bayesian Optimization for Heterogeneous Multi-Agent Design

Abstract:
Modern scientific and engineering design increasingly involves distributed optimization, where agents such as laboratories, simulations, or industrial partners pursue related goals under differing conditions. These agents often face heterogeneities in objectives, evaluation budgets, and accessible design variables, which complicates coordination and can lead to redundancy, poor resource use, and ineffective information sharing. Bayesian Optimization (BO) is a widely used decision‑making framework for expensive black box functions, but its single‑agent formulation assumes centralized control and full data sharing. Recent collaborative BO methods relax these assumptions, yet they often require uniform resources, fully shared input spaces, and fixed task alignment, conditions rarely satisfied in practice. To address these challenges, we introduce Adaptive Resource Aware Collaborative Bayesian Optimization (ARCO‑BO), a framework that explicitly accounts for heterogeneity in multi‑agent optimization. ARCO‑BO combines three components: a similarity and optima‑aware consensus mechanism for adaptive information sharing, a budget‑aware asynchronous sampling strategy for resource coordination, and a partial input space sharing for heterogeneous design spaces. Experiments on synthetic and high‑dimensional engineering problems show that ARCO‑BO consistently outperforms independent BO and existing collaborative BO via consensus approach, achieving robust and efficient performance in complex multi‑agent settings.

PaperID: 7906, https://arxiv.org/pdf/2510.16645.pdf

Authors: Zhixuan He, Yue Feng

Title: Unleashing Diverse Thinking Modes in LLMs through Multi-Agent Collaboration

Abstract:
Large Language Models (LLMs) demonstrate strong performance but often lack interpretable reasoning. This paper introduces the Multi‑Agent Collaboration Framework for Diverse Thinking Modes (DiMo), which enhances both performance and interpretability by simulating a structured debate among four specialized LLM agents. Each agent embodies a distinct reasoning paradigm, allowing the framework to collaboratively explore diverse cognitive approaches. Through iterative debate, agents challenge and refine initial responses, yielding more robust conclusions and an explicit, auditable reasoning chain. Across six benchmarks and under a unified open‑source setup, DiMo improves accuracy over widely used single‑model and debate baselines, with the largest gains on math. We position DiMo as a semantics‑aware, Web‑native multi‑agent framework: it models human‑machine intelligence with LLM agents that produce semantically typed, URL‑annotated evidence chains for explanations and user‑friendly interactions. Although our experiments use standard reasoning benchmarks, the framework is designed to be instantiated over Web corpora and knowledge graphs, combining retrieval‑augmented reasoning with structured justifications that downstream systems can inspect and reuse.

PaperID: 7907, https://arxiv.org/pdf/2510.16635.pdf

Authors: Wonduk Seo, Juhyeon Lee, Junseo Koh, Wonseok Choi, Hyunjin An, Jian Park, Seunghyun lee, Haihua Chen, Yi Bu

Title: MA-SAPO: Multi-Agent Reasoning for Score-Aware Prompt Optimization

Abstract:
Prompt optimization has become a practical way to improve the performance of Large Language Models (LLMs) without retraining. However, most existing frameworks treat evaluation as a black box, relying solely on outcome scores without explaining why prompts succeed or fail. Moreover, they involve repetitive trial‑and‑error refinements that remain implicit, offering limited interpretability or actionable guidance for systematic improvement. In this paper, we propose MA‑SAPO: a new Multi‑Agent Reasoning for Score Aware Prompt Optimization framework that links evaluation outcomes directly to targeted refinements. Specifically, in the Training Phase, multiple agents interpret evaluation scores, diagnose weaknesses, and generate concrete revision directives, which are stored as reusable reasoning assets. In the Test Phase, an analyzer agent retrieves relevant exemplars and assets for a new prompt, and a refiner agent applies evidence‑based edits to improve the prompt and its response. By grounding optimization in structured reasoning, MA‑SAPO ensures edits are interpretable, auditable, and controllable. Experiments on the HelpSteer1/2 benchmarks show that our framework consistently outperforms single‑pass prompting, retrieval‑augmented generation, and prior multi‑agent methods across multiple evaluation metrics.

PaperID: 7908, https://arxiv.org/pdf/2510.16614.pdf

Authors: Xuan Zhang, Ruixiao Li, Zhijian Zhou, Long Li, Yulei Qin, Ke Li, Xing Sun, Xiaoyu Tan, Chao Qu, Yuan Qi

Title: Count Counts: Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards

Abstract:
Reinforcement Learning (RL) has become a compelling way to strengthen the multi step reasoning ability of Large Language Models (LLMs). However, prevalent RL paradigms still lean on sparse outcome‑based rewards and limited exploration, which often drives LLMs toward repetitive and suboptimal reasoning patterns. In this paper, we study the central question of how to design exploration for LLM reasoning and introduce MERCI (Motivating Exploration in LLM Reasoning with Count‑based Intrinsic Rewards), a novel RL algorithm that augments policy optimization with a principled intrinsic reward. Building on the idea of count‑based exploration, MERCI leverages a lightweight Coin Flipping Network (CFN) to estimate the pseudo count and further epistemic uncertainty over reasoning trajectories, and converts them into an intrinsic reward that values novelty while preserving the learning signal from task rewards. We integrate MERCI into some advanced RL frameworks like Group Relative Policy Optimization (GRPO). Experiments on complex reasoning benchmarks demonstrate that MERCI encourages richer and more varied chains of thought, significantly improves performance over strong baselines, and helps the policy escape local routines to discover better solutions. It indicates that our targeted intrinsic motivation can make exploration reliable for language model reasoning.

PaperID: 7909, https://arxiv.org/pdf/2510.16609.pdf

Authors: Avrim Blum, Daniel Hsu, Cyrus Rashtchian, Donya Saless

Title: Prior Knowledge Makes It Possible: From Sublinear Graph Algorithms to LLM Test-Time Methods

Abstract:
Test‑time augmentation, such as Retrieval‑Augmented Generation (RAG) or tool use, critically depends on an interplay between a model's parametric knowledge and externally retrieved information. However, the theoretical underpinnings of this relationship remain poorly understood. Specifically, it is not clear how much pre‑training knowledge is required to answer queries with a small number of augmentation steps, which is a desirable property in practice. To address this question, we formulate multi‑step reasoning as an s‑t connectivity problem on a knowledge graph. We represent a model's pre‑training parametric knowledge as a partial, potentially noisy subgraph. We view augmentation as querying an oracle for true edges that augment the model's knowledge. Then, we characterize the necessary and sufficient number of augmentation steps for the model to generate an accurate answer given partial prior knowledge. One key result shows a phase transition: if the prior knowledge graph over n vertices is disconnected into small components, then finding a path via augmentation is inefficient and requires Ω(\sqrtn) queries. On the other hand, once the density of correct knowledge surpasses a threshold, forming a giant component, we can find paths with an expected constant number of queries.

PaperID: 7910, https://arxiv.org/pdf/2510.16499.pdf

Authors: Michelle Yuan, Khushbu Pahwa, Shuaichen Chang, Mustafa Kaba, Jiarong Jiang, Xiaofei Ma, Yi Zhang, Monica Sunkara

Title: Automated Composition of Agents: A Knapsack Approach for Agentic Component Selection

Abstract:
Designing effective agentic systems requires the seamless composition and integration of agents, tools, and models within dynamic and uncertain environments. Most existing methods rely on static, semantic retrieval approaches for tool or agent discovery. However, effective reuse and composition of existing components remain challenging due to incomplete capability descriptions and the limitations of retrieval methods. Component selection suffers because the decisions are not based on capability, cost, and real‑time utility. To address these challenges, we introduce a structured, automated framework for agentic system composition that is inspired by the knapsack problem. Our framework enables a composer agent to systematically identify, select, and assemble an optimal set of agentic components by jointly considering performance, budget constraints, and compatibility. By dynamically testing candidate components and modeling their utility in real‑time, our approach streamlines the assembly of agentic systems and facilitates scalable reuse of resources. Empirical evaluation with Claude 3.5 Sonnet across five benchmarking datasets shows that our online‑knapsack‑based composer consistently lies on the Pareto frontier, achieving higher success rates at significantly lower component costs compared to our baselines. In the single‑agent setup, the online knapsack composer shows a success rate improvement of up to 31.6% in comparison to the retrieval baselines. In multi‑agent systems, the online knapsack composer increases success rate from 37% to 87% when agents are selected from an agent inventory of 100+ agents. The substantial performance gap confirms the robust adaptability of our method across diverse domains and budget constraints.

PaperID: 7911, https://arxiv.org/pdf/2510.16492.pdf

Authors: Vamshi Krishna Bonagiri, Ponnurangam Kumaragurum, Khanh Nguyen, Benjamin Plaut

Title: Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety

Abstract:
As Large Language Model (LLM) agents increasingly operate in complex environments with real‑world consequences, their safety becomes critical. While uncertainty quantification is well‑studied for single‑turn tasks, multi‑turn agentic scenarios with real‑world tool access present unique challenges where uncertainties and ambiguities compound, leading to severe or catastrophic risks beyond traditional text generation failures. We propose using "quitting" as a simple yet effective behavioral mechanism for LLM agents to recognize and withdraw from situations where they lack confidence. Leveraging the ToolEmu framework, we conduct a systematic evaluation of quitting behavior across 12 state‑of‑the‑art LLMs. Our results demonstrate a highly favorable safety‑helpfulness trade‑off: agents prompted to quit with explicit instructions improve safety by an average of +0.39 on a 0‑3 scale across all models (+0.64 for proprietary models), while maintaining a negligible average decrease of ‑0.03 in helpfulness. Our analysis demonstrates that simply adding explicit quit instructions proves to be a highly effective safety mechanism that can immediately be deployed in existing agent systems, and establishes quitting as an effective first‑line defense mechanism for autonomous agents in high‑stakes applications.

PaperID: 7912, https://arxiv.org/pdf/2510.16435.pdf

Authors: Lennart Wachowiak, Andrew Coles, Gerard Canal, Oya Celiktutan

Title: What Questions Should Robots Be Able to Answer? A Dataset of User Questions for Explainable Robotics

Abstract:
With the growing use of large language models and conversational interfaces in human‑robot interaction, robots' ability to answer user questions is more important than ever. We therefore introduce a dataset of 1,893 user questions for household robots, collected from 100 participants and organized into 12 categories and 70 subcategories. Most work in explainable robotics focuses on why‑questions. In contrast, our dataset provides a wide variety of questions, from questions about simple execution details to questions about how the robot would act in hypothetical scenarios ‑‑ thus giving roboticists valuable insights into what questions their robot needs to be able to answer. To collect the dataset, we created 15 video stimuli and 7 text stimuli, depicting robots performing varied household tasks. We then asked participants on Prolific what questions they would want to ask the robot in each portrayed situation. In the final dataset, the most frequent categories are questions about task execution details (21.4%), the robot's capabilities (12.6%), and performance assessments (10.7%). Although questions about how robots would handle potentially difficult scenarios and ensure correct behavior are less frequent, users rank them as the most important for robots to be able to answer. Moreover, we find that users who identify as novices in robotics ask different questions than more experienced users. Novices are more likely to inquire about simple facts, such as what the robot did or the current state of the environment. As robots enter environments shared with humans and language becomes central to giving instructions and interaction, this dataset provides a valuable foundation for (i) identifying the information robots need to log and expose to conversational interfaces, (ii) benchmarking question‑answering modules, and (iii) designing explanation strategies that align with user expectations.

PaperID: 7913, https://arxiv.org/pdf/2510.16402.pdf

Authors: Bernd Finkbeiner, Julian Siber

Title: Explainability Requirements as Hyperproperties

Abstract:
Explainability is emerging as a key requirement for autonomous systems. While many works have focused on what constitutes a valid explanation, few have considered formalizing explainability as a system property. In this work, we approach this problem from the perspective of hyperproperties. We start with a combination of three prominent flavors of modal logic and show how they can be used for specifying and verifying counterfactual explainability in multi‑agent systems: With Lewis' counterfactuals, linear‑time temporal logic, and a knowledge modality, we can reason about whether agents know why a specific observation occurs, i.e., whether that observation is explainable to them. We use this logic to formalize multiple notions of explainability on the system level. We then show how this logic can be embedded into a hyperlogic. Notably, from this analysis we conclude that the model‑checking problem of our logic is decidable, which paves the way for the automated verification of explainability requirements.

PaperID: 7914, https://arxiv.org/pdf/2510.16232.pdf

Authors: Chenyu Zhang, Navid Azizan

Title: Personalized Collaborative Learning with Affinity-Based Variance Reduction

Abstract:
Multi‑agent learning faces a fundamental tension: leveraging distributed collaboration without sacrificing the personalization needed for diverse agents. This tension intensifies when aiming for full personalization while adapting to unknown heterogeneity levels ‑‑ gaining collaborative speedup when agents are similar, without performance degradation when they are different. Embracing the challenge, we propose personalized collaborative learning (PCL), a novel framework for heterogeneous agents to collaboratively learn personalized solutions with seamless adaptivity. Through carefully designed bias correction and importance correction mechanisms, our method AffPCL robustly handles both environment and objective heterogeneity. We prove that AffPCL reduces sample complexity over independent learning by a factor of \max\n^‑1, δ\, where n is the number of agents and δ\in[0,1] measures their heterogeneity. This affinity‑based acceleration automatically interpolates between the linear speedup of federated learning in homogeneous settings and the baseline of independent learning, without requiring prior knowledge of the system. Our analysis further reveals that an agent may obtain linear speedup even by collaborating with arbitrarily dissimilar agents, unveiling new insights into personalization and collaboration in the high heterogeneity regime.

PaperID: 7915, https://arxiv.org/pdf/2510.16221.pdf

Authors: Qinshuang Wei, Vaibhav Srivastava, Vijay Gupta

Title: Heterogeneous Multi-Agent Task-Assignment with Uncertain Execution Times and Preferences

Abstract:
While sequential task assignment for a single agent has been widely studied, such problems in a multi‑agent setting, where the agents have heterogeneous task preferences or capabilities, remain less well‑characterized. We study a multi‑agent task assignment problem where a central planner assigns recurring tasks to multiple members of a team over a finite time horizon. For any given task, the members have heterogeneous capabilities in terms of task completion times, task resource consumption (which can model variables such as energy or attention), and preferences in terms of the rewards they collect upon task completion. We assume that the reward, execution time, and resource consumption for each member to complete any task are stochastic with unknown distributions. The goal of the planner is to maximize the total expected reward that the team receives over the problem horizon while ensuring that the resource consumption required for any assigned task is within the capability of the agent. We propose and analyze a bandit algorithm for this problem. Since the bandit algorithm relies on solving an optimal task assignment problem repeatedly, we analyze the achievable regret in two cases: when we can solve the optimal task assignment exactly and when we can solve it only approximately.

PaperID: 7916, https://arxiv.org/pdf/2510.16219.pdf

Authors: Yang Feng, Xudong Pan

Title: SentinelNet: Safeguarding Multi-Agent Collaboration Through Credit-Based Dynamic Threat Detection

Abstract:
Malicious agents pose significant threats to the reliability and decision‑making capabilities of Multi‑Agent Systems (MAS) powered by Large Language Models (LLMs). Existing defenses often fall short due to reactive designs or centralized architectures which may introduce single points of failure. To address these challenges, we propose SentinelNet, the first decentralized framework for proactively detecting and mitigating malicious behaviors in multi‑agent collaboration. SentinelNet equips each agent with a credit‑based detector trained via contrastive learning on augmented adversarial debate trajectories, enabling autonomous evaluation of message credibility and dynamic neighbor ranking via bottom‑k elimination to suppress malicious communications. To overcome the scarcity of attack data, it generates adversarial trajectories simulating diverse threats, ensuring robust training. Experiments on MAS benchmarks show SentinelNet achieves near‑perfect detection of malicious agents, close to 100% within two debate rounds, and recovers 95% of system accuracy from compromised baselines. By exhibiting strong generalizability across domains and attack patterns, SentinelNet establishes a novel paradigm for safeguarding collaborative MAS.

PaperID: 7917, https://arxiv.org/pdf/2510.16194.pdf

Authors: Guanchen Wu, Zuhui Chen, Yuzhang Xie, Carl Yang

Title: Towards Automatic Evaluation and Selection of PHI De-identification Models via Multi-Agent Collaboration

Abstract:
Protected health information (PHI) de‑identification is critical for enabling the safe reuse of clinical notes, yet evaluating and comparing PHI de‑identification models typically depends on costly, small‑scale expert annotations. We present TEAM‑PHI, a multi‑agent evaluation and selection framework that uses large language models (LLMs) to automatically measure de‑identification quality and select the best‑performing model without heavy reliance on gold labels. TEAM‑PHI deploys multiple Evaluation Agents, each independently judging the correctness of PHI extractions and outputting structured metrics. Their results are then consolidated through an LLM‑based majority voting mechanism that integrates diverse evaluator perspectives into a single, stable, and reproducible ranking. Experiments on a real‑world clinical note corpus demonstrate that TEAM‑PHI produces consistent and accurate rankings: despite variation across individual evaluators, LLM‑based voting reliably converges on the same top‑performing systems. Further comparison with ground‑truth annotations and human evaluation confirms that the framework's automated rankings closely match supervised evaluation. By combining independent evaluation agents with LLM majority voting, TEAM‑PHI offers a practical, secure, and cost‑effective solution for automatic evaluation and best‑model selection in PHI de‑identification, even when ground‑truth labels are limited.

PaperID: 7918, https://arxiv.org/pdf/2510.16187.pdf

Authors: Rupal Nigam, Niket Parikh, Hamid Osooli, Mikihisa Yuasa, Jacob Heglund, Huy T. Tran

Title: Zero-Shot Coordination in Ad Hoc Teams with Generalized Policy Improvement and Difference Rewards

Abstract:
Real‑world multi‑agent systems may require ad hoc teaming, where an agent must coordinate with other previously unseen teammates to solve a task in a zero‑shot manner. Prior work often either selects a pretrained policy based on an inferred model of the new teammates or pretrains a single policy that is robust to potential teammates. Instead, we propose to leverage all pretrained policies in a zero‑shot transfer setting. We formalize this problem as an ad hoc multi‑agent Markov decision process and present a solution that uses two key ideas, generalized policy improvement and difference rewards, for efficient and effective knowledge transfer between different teams. We empirically demonstrate that our algorithm, Generalized Policy improvement for Ad hoc Teaming (GPAT), successfully enables zero‑shot transfer to new teams in three simulated environments: cooperative foraging, predator‑prey, and Overcooked. We also demonstrate our algorithm in a real‑world multi‑robot setting.

PaperID: 7919, https://arxiv.org/pdf/2510.16154.pdf

Authors: Alessio Oliviero, Simone Cacace, Giuseppe Visconti

Title: Agent-Based Optimal Control for Image Processing

Abstract:
We investigate the use of multi‑agent systems to solve classical image processing tasks, such as colour quantization and segmentation. We frame the task as an optimal control problem, where the objective is to steer the multi‑agent dynamics to obtain colour clusters that segment the image. To do so, we balance the total variation of the colour field and fidelity to the original image. The solution is obtained resorting to primal‑dual splitting and the method of multipliers. Numerical experiments, implemented in parallel with CUDA, demonstrate the efficacy of the approach and its potential for high‑dimensional data.

PaperID: 7920, https://arxiv.org/pdf/2510.16144.pdf

Authors: Sukhdeep Singh, Avinash Bhat, Shweta M, Subhash K Singh, Moonki Hong, Madhan Raj K, Kandeepan Sithamparanathan, Sunder A. Khowaja, Kapal Dev

Title: Agentic AI for Ultra-Modern Networks: Multi-Agent Framework for RAN Autonomy and Assurance

Abstract:
The increasing complexity of Beyond 5G and 6G networks necessitates new paradigms for autonomy and assur‑ ance. Traditional O‑RAN control loops rely heavily on RIC‑ based orchestration, which centralizes intelligence and exposes the system to risks such as policy conflicts, data drift, and unsafe actions under unforeseen conditions. In this work, we argue that the future of autonomous networks lies in a multi‑agentic architecture, where specialized agents collaborate to perform data collection, model training, prediction, policy generation, verification, deployment, and assurance. By replacing tightly‑ coupled centralized RIC‑based workflows with distributed agents, the framework achieves autonomy, resilience, explainability, and system‑wide safety. To substantiate this vision, we design and evaluate a traffic steering use case under surge and drift conditions. Results across four KPIs: RRC connected users, IP throughput, PRB utilization, and SINR, demonstrate that a naive predictor‑driven deployment improves local KPIs but destabilizes neighbors, whereas the agentic system blocks unsafe policies, preserving global network health. This study highlights multi‑ agent architectures as a credible foundation for trustworthy AI‑ driven autonomy in next‑generation RANs.

PaperID: 7921, https://arxiv.org/pdf/2510.16082.pdf

Authors: Elias Hossain, Mehrdad Shoeibi, Ivan Garibay, Niloofar Yousefi

Title: BIOGEN: Evidence-Grounded Multi-Agent Reasoning Framework for Transcriptomic Interpretation in Antimicrobial Resistance

Abstract:
Interpreting gene clusters from RNA sequencing (RNA‑seq) remains challenging, especially in antimicrobial resistance studies where mechanistic insight is important for hypothesis generation. Existing pathway enrichment methods can summarize co‑expressed modules, but they often provide limited cluster‑specific explanations and weak connections to supporting literature. We present BIOGEN, an evidence‑grounded multi‑agent framework for post hoc interpretation of RNA‑seq transcriptional modules. BIOGEN combines biomedical retrieval, structured reasoning, and multi‑critic verification to generate traceable cluster‑level explanations with explicit evidence and confidence labels. On a primary Salmonella enterica dataset, BIOGEN achieved strong biological grounding, including BERTScore 0.689, Semantic Alignment Score 0.715, KEGG Functional Similarity 0.342, and a hallucination rate of 0.000, compared with 0.100 for an LLM‑only baseline. Across four additional bacterial RNA‑seq datasets, BIOGEN also maintained zero hallucination under the same fixed pipeline. In comparisons with representative open‑source agentic AI baselines, BIOGEN was the only framework that consistently preserved zero hallucination across all five datasets. These findings suggest that retrieval alone is not enough for reliable biological interpretation, and that evidence‑grounded orchestration is important for transparent and source‑traceable transcriptomic reasoning.

PaperID: 7922, https://arxiv.org/pdf/2510.16051.pdf

Authors: Sofiya Garkot, Maksym Shamrai, Ivan Synytsia, Mariya Hirna

Title: GUIrilla: A Scalable Framework for Automated Desktop UI Exploration

Abstract:
The performance and generalization of foundation models for interactive systems critically depend on the availability of large‑scale, realistic training data. While recent advances in large language models (LLMs) have improved GUI understanding, progress in desktop automation remains constrained by the scarcity of high‑quality, publicly available desktop interaction data, particularly for macOS. We introduce GUIRILLA, a scalable data crawling framework for automated exploration of desktop GUIs. GUIRILLA is not an autonomous agent; instead, it systematically collects realistic interaction traces and accessibility metadata intended to support the training, evaluation, and stabilization of downstream foundation models and GUI agents. The framework targets macOS, a largely underrepresented platform in existing resources, and organizes explored interfaces into hierarchical MacApp Trees derived from accessibility states and user actions. As part of this work, we release these MacApp Trees as a reusable structural representation of macOS applications, enabling downstream analysis, retrieval, testing, and future agent training. We additionally release macapptree, an open‑source library for reproducible accessibility‑driven GUI data collection, along with the full framework implementation to support open research in desktop autonomy.

PaperID: 7923, https://arxiv.org/pdf/2510.16035.pdf

Authors: Yingguang Yang, Xianghua Zeng, Qi Wu, Hao Peng, Yutong Xia, Hao Liu, Bin Chong, Philip S. Yu

Title: RoBCtrl: Attacking GNN-Based Social Bot Detectors via Reinforced Manipulation of Bots Control Interaction

Abstract:
Social networks have become a crucial source of real‑time information for individuals. The influence of social bots within these platforms has garnered considerable attention from researchers, leading to the development of numerous detection technologies. However, the vulnerability and robustness of these detection methods is still underexplored. Existing Graph Neural Network (GNN)‑based methods cannot be directly applied due to the issues of limited control over social agents, the black‑box nature of bot detectors, and the heterogeneity of bots. To address these challenges, this paper proposes the first adversarial multi‑agent Reinforcement learning framework for social Bot control attacks (RoBCtrl) targeting GNN‑based social bot detectors. Specifically, we use a diffusion model to generate high‑fidelity bot accounts by reconstructing existing account data with minor modifications, thereby evading detection on social platforms. To the best of our knowledge, this is the first application of diffusion models to mimic the behavior of evolving social bots effectively. We then employ a Multi‑Agent Reinforcement Learning (MARL) method to simulate bots adversarial behavior. We categorize social accounts based on their influence and budget. Different agents are then employed to control bot accounts across various categories, optimizing the attachment strategy through reinforcement learning. Additionally, a hierarchical state abstraction based on structural entropy is designed to accelerate the reinforcement learning. Extensive experiments on social bot detection datasets demonstrate that our framework can effectively undermine the performance of GNN‑based detectors.

PaperID: 7924, https://arxiv.org/pdf/2510.16034.pdf

Authors: Bo Li, Junwei Ma, Kai Yin, Yiming Xiao, Chia-Wei Hsu, Ali Mostafavi

Title: Disaster Management in the Era of Agentic AI Systems: A Vision for Collective Human-Machine Intelligence for Augmented Resilience

Abstract:
The escalating frequency and severity of disasters routinely overwhelm traditional response capabilities, exposing critical vulnerability in disaster management. Current practices are hindered by fragmented data streams, siloed technologies, resource constraints, and the erosion of institutional memory, which collectively impede timely and effective decision making. This study introduces Disaster Copilot, a vision for a multi‑agent artificial intelligence system designed to overcome these systemic challenges by unifying specialized AI tools within a collaborative framework. The proposed architecture utilizes a central orchestrator to coordinate diverse sub‑agents, each specializing in critical domains such as predictive risk analytics, situational awareness, and impact assessment. By integrating multi‑modal data, the system delivers a holistic, real‑time operational picture and serve as the essential AI backbone required to advance Disaster Digital Twins from passive models to active, intelligent environments. Furthermore, it ensures functionality in resource‑limited environments through on‑device orchestration and incorporates mechanisms to capture institutional knowledge, mitigating the impact of staff turnover. We detail the system architecture and propose a three‑phased roadmap emphasizing the parallel growth of technology, organizational capacity, and human‑AI teaming. Disaster Copilot offers a transformative vision, fostering collective human‑machine intelligence to build more adaptive, data‑driven and resilient communities.

PaperID: 7925, https://arxiv.org/pdf/2510.15996.pdf

Authors: Ozan K. Tonguz, Federico Taschin

Title: Using Kolmogorov-Smirnov Distance for Measuring Distribution Shift in Machine Learning

Abstract:
One of the major problems in Machine Learning (ML) and Artificial Intelligence (AI) is the fact that the probability distribution of the test data in the real world could deviate substantially from the probability distribution of the training data set. When this happens, the predictions of an ML system or an AI agent could involve large errors which is very troublesome and undesirable. While this is a well‑known hard problem plaguing the AI and ML systems' accuracy and reliability, in certain applications such errors could be critical for safety and reliability of AI and ML systems. One approach to deal with this problem is to monitor and measure the deviation in the probability distribution of the test data in real time and to compensate for this deviation. In this paper, we propose and explore the use of Kolmogorov‑Smirnov (KS) Test for measuring the distribution shift and we show how the KS distance can be used to quantify the distribution shift and its impact on an AI agent's performance. Our results suggest that KS distance could be used as a valuable statistical tool for monitoring and measuring the distribution shift. More specifically, it is shown that even a distance of KS=0.02 could lead to about 50% increase in the travel time at a single intersection using a Reinforcement Learning agent which is quite significant. It is hoped that the use of KS Test and KS distance in AI‑based smart transportation could be an important step forward for gauging the performance degradation of an AI agent in real time and this, in turn, could help the AI agent to cope with the distribution shift in a more informed manner.

PaperID: 7926, https://arxiv.org/pdf/2510.15987.pdf

Authors: Samuel Lippl, Thomas McGee, Kimberly Lopez, Ziwen Pan, Pierce Zhang, Salma Ziadi, Oliver Eberle, Ida Momennejad

Title: Algorithmic Primitives and Compositional Geometry of Reasoning in Language Models

Abstract:
How do latent and inference time computations enable large language models (LLMs) to solve multi‑step reasoning? We introduce a framework for tracing and steering algorithmic primitives that underlie model reasoning. Our approach links reasoning traces to internal activations and evaluates algorithmic primitives by injecting them into residual streams and measuring their effect on reasoning steps and task performance. We consider four benchmarks: Traveling Salesperson Problem (TSP), 3SAT, AIME, and graph navigation. We operationalize primitives by clustering activations and annotating their matched reasoning traces using an automated LLM pipeline. We then apply function vector methods to derive primitive vectors as reusable compositional building blocks of reasoning. Primitive vectors can be combined through addition, subtraction, and scalar operations, revealing a geometric logic in activation space. Cross‑task and cross‑model evaluations (Phi‑4, Phi‑4‑Reasoning, Llama‑3‑8B) show both shared and task‑specific primitives. Notably, comparing Phi‑4 with its reasoning‑finetuned variant highlights compositional generalization after finetuning: Phi‑4‑Reasoning exhibits more systematic use of verification and path‑generation primitives. Injecting the associated primitive vectors in Phi‑4 induces behavioral hallmarks associated with Phi‑4‑Reasoning. Together, these findings demonstrate that reasoning in LLMs may be supported by a compositional geometry of algorithmic primitives, that primitives transfer cross‑task and cross‑model, and that reasoning finetuning strengthens algorithmic generalization across domains.

PaperID: 7927, https://arxiv.org/pdf/2510.15966.pdf

Authors: Shian Jia, Ziyang Huang, Xinbo Wang, Haofei Zhang, Mingli Song

Title: PISA: A Pragmatic Psych-Inspired Unified Memory System for Enhanced AI Agency

Abstract:
Memory systems are fundamental to AI agents, yet existing work often lacks adaptability to diverse tasks and overlooks the constructive and task‑oriented role of AI agent memory. Drawing from Piaget's theory of cognitive development, we propose PISA, a pragmatic, psych‑inspired unified memory system that addresses these limitations by treating memory as a constructive and adaptive process. To enable continuous learning and adaptability, PISA introduces a trimodal adaptation mechanism (i.e., schema updation, schema evolution, and schema creation) that preserves coherent organization while supporting flexible memory updates. Building on these schema‑grounded structures, we further design a hybrid memory access architecture that seamlessly integrates symbolic reasoning with neural retrieval, significantly improving retrieval accuracy and efficiency. Our empirical evaluation, conducted on the existing LOCOMO benchmark and our newly proposed AggQA benchmark for data analysis tasks, confirms that PISA sets a new state‑of‑the‑art by significantly enhancing adaptability and long‑term knowledge retention.

PaperID: 7928, https://arxiv.org/pdf/2510.15949.pdf

Authors: Charidimos Papadakis, Angeliki Dimitriou, Giorgos Filandrianos, Maria Lymperaiou, Konstantinos Thomas, Giorgos Stamou

Title: ATLAS: Adaptive Trading with LLM AgentS Through Dynamic Prompt Optimization and Multi-Agent Coordination

Abstract:
Large language models show promise for financial decision‑making, yet deploying them as autonomous trading agents raises fundamental challenges: how to adapt instructions when rewards arrive late and obscured by market noise, how to synthesize heterogeneous information streams into coherent decisions, and how to bridge the gap between model outputs and executable market actions. We present ATLAS (Adaptive Trading with LLM AgentS), a unified multi‑agent framework that integrates structured information from markets, news, and corporate fundamentals to support robust trading decisions. Within ATLAS, the central trading agent operates in an order‑aware action space, ensuring that outputs correspond to executable market orders rather than abstract signals. The agent can incorporate feedback while trading using Adaptive‑OPRO, a novel prompt‑optimization technique that dynamically adapts the prompt by incorporating real‑time, stochastic feedback, leading to increasing performance over time. Across regime‑specific equity studies and multiple LLM families, Adaptive‑OPRO consistently outperforms fixed prompts, while reflection‑based feedback fails to provide systematic gains.

PaperID: 7929, https://arxiv.org/pdf/2510.15895.pdf

Authors: Yunzhe Wang, Xinyu Tang, Zhixun Huang, Xiaolong Yue, Yuxin Zeng

Title: BREATH: A Bio-Radar Embodied Agent for Tonal and Human-Aware Diffusion Music Generation

Abstract:
We present a multimodal system for personalized music generation that integrates physiological sensing, LLM‑based reasoning, and controllable audio synthesis. A millimeter‑wave radar sensor non‑invasively captures heart rate and respiration rate. These physiological signals, combined with environmental state, are interpreted by a reasoning agent to infer symbolic musical descriptors, such as tempo, mood intensity, and traditional Chinese pentatonic modes, which are then expressed as structured prompts to guide a diffusion‑based audio model in synthesizing expressive melodies. The system emphasizes cultural grounding through tonal embeddings and enables adaptive, embodied music interaction. To evaluate the system, we adopt a research‑creation methodology combining case studies, expert feedback, and targeted control experiments. Results show that physiological variations can modulate musical features in meaningful ways, and tonal conditioning enhances alignment with intended modal characteristics. Expert users reported that the system affords intuitive, culturally resonant musical responses and highlighted its potential for therapeutic and interactive applications. This work demonstrates a novel bio‑musical feedback loop linking radar‑based sensing, prompt reasoning, and generative audio modeling.

PaperID: 7930, https://arxiv.org/pdf/2510.15831.pdf

Authors: Do Xuan Long, Xingchen Wan, Hootan Nakhost, Chen-Yu Lee, Tomas Pfister, Sercan Ö. Arık

Title: VISTA: A Test-Time Self-Improving Video Generation Agent

Abstract:
Despite rapid advances in text‑to‑video synthesis, generated video quality remains critically dependent on precise user prompts. Existing test‑time optimization methods, successful in other domains, struggle with the multi‑faceted nature of video. In this work, we introduce VISTA (Video Iterative Self‑improvemenT Agent), a novel multi‑agent system that autonomously improves video generation through refining prompts in an iterative loop. VISTA first decomposes a user idea into a structured temporal plan. After generation, the best video is identified through a robust pairwise tournament. This winning video is then critiqued by a trio of specialized agents focusing on visual, audio, and contextual fidelity. Finally, a reasoning agent synthesizes this feedback to introspectively rewrite and enhance the prompt for the next generation cycle. Experiments on single‑ and multi‑scene video generation scenarios show that while prior methods yield inconsistent gains, VISTA consistently improves video quality and alignment with user intent, achieving up to 60% pairwise win rate against state‑of‑the‑art baselines. Human evaluators concur, preferring VISTA outputs in 66.4% of comparisons.

PaperID: 7931, https://arxiv.org/pdf/2510.15686.pdf

Authors: Taehyeon Kim, Vishnunandan L. N. Venkatesh, Byung-Cheol Min

Title: Few-Shot Demonstration-Driven Task Coordination and Trajectory Execution for Multi-Robot Systems

Abstract:
In this paper, we propose a novel few‑shot learning framework for multi‑robot systems that integrate both spatial and temporal elements: Few‑Shot Demonstration‑Driven Task Coordination and Trajectory Execution (DDACE). Our approach leverages temporal graph networks for learning task‑agnostic temporal sequencing and Gaussian Processes for spatial trajectory modeling, ensuring modularity and generalization across various tasks. By decoupling temporal and spatial aspects, DDACE requires only a small number of demonstrations, significantly reducing data requirements compared to traditional learning from demonstration approaches. To validate our proposed framework, we conducted extensive experiments in task environments designed to assess various aspects of multi‑robot coordination‑such as multi‑sequence execution, multi‑action dynamics, complex trajectory generation, and heterogeneous configurations. The experimental results demonstrate that our approach successfully achieves task execution under few‑shot learning conditions and generalizes effectively across dynamic and diverse settings. This work underscores the potential of modular architectures in enhancing the practicality and scalability of multi‑robot systems in real‑world applications. Additional materials are available at https://sites.google.com/view/ddace.

PaperID: 7932, https://arxiv.org/pdf/2510.15682.pdf

Authors: Ines Besrour, Jingbo He, Tobias Schreieder, Michael Färber

Title: SQuAI: Scientific Question-Answering with Multi-Agent Retrieval-Augmented Generation

Abstract:
We present SQuAI (https://squai.scads.ai/), a scalable and trustworthy multi‑agent retrieval‑augmented generation (RAG) framework for scientific question answering (QA) with large language models (LLMs). SQuAI addresses key limitations of existing RAG systems in the scholarly domain, where complex, open‑domain questions demand accurate answers, explicit claims with citations, and retrieval across millions of scientific documents. Built on over 2.3 million full‑text papers from arXiv.org, SQuAI employs four collaborative agents to decompose complex questions into sub‑questions, retrieve targeted evidence via hybrid sparse‑dense retrieval, and adaptively filter documents to improve contextual relevance. To ensure faithfulness and traceability, SQuAI integrates in‑line citations for each generated claim and provides supporting sentences from the source documents. Our system improves faithfulness, answer relevance, and contextual relevance by up to +0.088 (12%) over a strong RAG baseline. We further release a benchmark of 1,000 scientific question‑answer‑evidence triplets to support reproducibility. With transparent reasoning, verifiable citations, and domain‑wide scalability, SQuAI demonstrates how multi‑agent RAG enables more trustworthy scientific QA with LLMs.

PaperID: 7933, https://arxiv.org/pdf/2510.15644.pdf

Authors: Tomas Ortega, Hamid Jafarkhani

Title: Decentralized Parameter-Free Online Learning

Abstract:
We propose the first parameter‑free decentralized online learning algorithms with network regret guarantees, which achieve sublinear regret without requiring hyperparameter tuning. This family of algorithms connects multi‑agent coin‑betting and decentralized online learning via gossip steps. To enable our decentralized analysis, we introduce a novel "betting function" formulation for coin‑betting that simplifies the multi‑agent regret analysis. Our analysis shows sublinear network regret bounds and is validated through experiments on synthetic and real datasets. This family of algorithms is applicable to distributed sensing, decentralized optimization, and collaborative ML applications.

PaperID: 7934, https://arxiv.org/pdf/2510.15568.pdf

Authors: Alexander Doudkin, Anton Voelker, Friedrich von Borries

Title: The Spark Effect: On Engineering Creative Diversity in Multi-Agent AI Systems

Abstract:
Creative services teams increasingly rely on large language models (LLMs) to accelerate ideation, yet production systems often converge on homogeneous outputs that fail to meet brand or artistic expectations. Art of X developed persona‑conditioned LLM agents ‑‑ internally branded as "Sparks" and instantiated through a library of role‑inspired system prompts ‑‑ to intentionally diversify agent behaviour within a multi‑agent workflow. This white paper documents the problem framing, experimental design, and quantitative evidence behind the Spark agent programme. Using an LLM‑as‑a‑judge protocol calibrated against human gold standards, we observe a mean diversity gain of +4.1 points (on a 1‑10 scale) when persona‑conditioned Spark agents replace a uniform system prompt, narrowing the gap to human experts to 1.0 point. We also surface evaluator bias and procedural considerations for future deployments.

PaperID: 7935, https://arxiv.org/pdf/2510.15416.pdf

Authors: Pavan C Shekar, Ashwanth Krishnan

Title: Adaptive Minds: Empowering Agents with LoRA-as-Tools

Abstract:
We present Adaptive Minds, an agentic system that treats LoRA adapters as domain‑specific tools. Instead of relying on a single fine‑tuned model or rigid rule‑based routing, our approach empowers the base LLM itself to act as a semantic router analyzing each query and dynamically selecting the most relevant LoRA tool. This enables the agent to seamlessly switch between different domain experts on demand. By combining the flexibility of multi‑agent orchestration with the efficiency of parameter‑efficient fine‑tuning, Adaptive Minds delivers accurate, specialized responses while preserving conversational ability. The system is built with LangGraph for workflow management, supports both API and web interfaces, and is fully open source, providing a scalable and extensible foundation for domain‑adaptive AI assistance.

PaperID: 7936, https://arxiv.org/pdf/2510.15414.pdf

Authors: Huining Yuan, Zelai Xu, Zheyue Tan, Xiangmin Yi, Mo Guang, Kaiwen Long, Haojia Hui, Boxun Li, Xinlei Chen, Bo Zhao, Xiao-Ping Zhang, Chao Yu, Yu Wang

Title: MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs

Abstract:
Developing Large Language Models (LLMs) to cooperate and compete effectively within multi‑agent systems (MASs) is a critical step towards more advanced intelligence. While reinforcement learning (RL) has proven effective for enhancing reasoning in single‑agent tasks, its extension to multi‑turn, multi‑agent scenarios remains underexplored due to the challenges of long‑horizon credit assignment and agent‑specific advantage estimation. To address these challenges, we introduce MARSHAL, an end‑to‑end RL framework that incentivizes Multi‑Agent Reasoning through Self‑play witH strAtegic LLMs in both cooperative and competitive games. MARSHAL features a turn‑level advantage estimator that aligns learning signals with each interaction for credit assignment, and an agent‑specific advantage normalization to stabilize multi‑agent training. By learning with self‑play across cooperative and competitive games, MARSHAL agents trained from Qwen3‑4B develop strong strategic abilities, with up to 28.7% performance improvements in held‑out games. More importantly, the capability acquired through self‑play generalizes beyond games, yielding consistent performance gains of MASs in reasoning benchmarks. When integrated into leading MASs, our MARSHAL agent achieves significant zero‑shot performance gains of up to 10.0% on AIME, 7.6% on GPQA‑Diamond, and 3.5% on average across all benchmarks. These results establish self‑play in strategic games as a powerful approach for developing generalizable multi‑agent reasoning capabilities in LLMs.

PaperID: 7937, https://arxiv.org/pdf/2510.15283.pdf

Authors: Jingao Xu, Shuoyoucheng Ma, Xin Song, Rong Jiang, Hongkui Tu, Bin Zhou

Title: Exemplar-Guided Planing: Enhanced LLM Agent for KGQA

Abstract:
Large Language Models (LLMs) as interactive agents show significant promise in Knowledge Graph Question Answering (KGQA) but often struggle with the semantic gap between natural language queries and structured knowledge graph (KG) representations. This leads to suboptimal planning and inefficient exploration on KG, while training‑free approaches often underutilize valuable reasoning patterns in training data. To address these limitations, we propose a novel framework, Exemplar‑Guided Planning (EGP), which enhances the planning capabilities of LLM agents for KGQA. EGP first preprocesses the training set questions via entity templating to normalize semantic variations. It then retrieves highly similar exemplary questions and their successful reasoning paths from this preprocessed set using semantic embeddings and an efficient FAISS index. These retrieved exemplars dynamically guide the LLM's planning process in two key phases: (1) Task Decomposition, by aligning generated sub‑objectives with proven reasoning steps, and (2) Relation Exploration, by providing high‑quality auxiliary information to improve relation pruning accuracy. Additionally, we introduce a Smart Lookahead mechanism during relation exploration to improve efficiency by preemptively exploring promising paths and potentially terminating exploration earlier. We apply EGP to the Plan‑on‑Graph (PoG) framework, termed PoG‑EGP. Extensive experiments on two real‑world KGQA datasets, WebQSP and CWQ, demonstrate that PoG‑EGP significantly improves over the baseline PoG system and other compared methods.

PaperID: 7938, https://arxiv.org/pdf/2510.15258.pdf

Authors: Xi Wang, Xianyao Ling, Kun Li, Gang Yin, Liang Zhang, Jiang Wu, Jun Xu, Fu Zhang, Wenbo Lei, Annie Wang, Peng Gong

Title: Multi-dimensional Data Analysis and Applications Basing on LLM Agents and Knowledge Graph Interactions

Abstract:
In the current era of big data, extracting deep insights from massive, heterogeneous, and complexly associated multi‑dimensional data has become a significant challenge. Large Language Models (LLMs) perform well in natural language understanding and generation, but still suffer from "hallucination" issues when processing structured knowledge and are difficult to update in real‑time. Although Knowledge Graphs (KGs) can explicitly store structured knowledge, their static nature limits dynamic interaction and analytical capabilities. Therefore, this paper proposes a multi‑dimensional data analysis method based on the interactions between LLM agents and KGs, constructing a dynamic, collaborative analytical ecosystem. This method utilizes LLM agents to automatically extract product data from unstructured data, constructs and visualizes the KG in real‑time, and supports users in deep exploration and analysis of graph nodes through an interactive platform. Experimental results show that this method has significant advantages in product ecosystem analysis, relationship mining, and user‑driven exploratory analysis, providing new ideas and tools for multi‑dimensional data analysis.

PaperID: 7939, https://arxiv.org/pdf/2510.15234.pdf

Authors: Xinrui Fang, Anran Xu, Chi-Lan Yang, Ya-Fang Lin, Sylvain Malacria, Koji Yatani

Title: LLM-based In-situ Thought Exchanges for Critical Paper Reading

Abstract:
Critical reading is a primary way through which researchers develop their critical thinking skills. While exchanging thoughts and opinions with peers can strengthen critical reading, junior researchers often lack access to peers who can offer diverse perspectives. To address this gap, we designed an in‑situ thought exchange interface informed by peer feedback from a formative study (N=8) to support junior researchers' critical paper reading. We evaluated the effects of thought exchanges under three conditions (no‑agent, single‑agent, and multi‑agent) with 46 junior researchers over two weeks. Our results showed that incorporating agent‑mediated thought exchanges during paper reading significantly improved participants' critical thinking scores compared to the no‑agent condition. In the single‑agent condition, participants more frequently made reflective annotations on the paper content. In the multi‑agent condition, participants engaged more actively with agents' responses. Our qualitative analysis further revealed that participants compared and analyzed multiple perspectives in the multi‑agent condition. This work contributes to understanding in‑situ AI‑based support for critical paper reading through thought exchanges and offers design implications for future research.

PaperID: 7940, https://arxiv.org/pdf/2510.15186.pdf

Authors: Gurusha Juneja, Jayanth Naga Sai Pasupulati, Alon Albalak, Wenyue Hua, William Yang Wang

Title: MAGPIE: A benchmark for Multi-AGent contextual PrIvacy Evaluation

Abstract:
A core challenge for autonomous LLM agents in collaborative settings is balancing robust privacy understanding and preservation alongside task efficacy. Existing privacy benchmarks only focus on simplistic, single‑turn interactions where private information can be trivially omitted without affecting task outcomes. In this paper, we introduce MAGPIE (Multi‑AGent contextual PrIvacy Evaluation), a novel benchmark of 200 high‑stakes tasks designed to evaluate privacy understanding and preservation in multi‑agent collaborative, non‑adversarial scenarios. MAGPIE integrates private information as essential for task resolution, forcing agents to balance effective collaboration with strategic information control. Our evaluation reveals that state‑of‑the‑art agents, including GPT‑5 and Gemini 2.5‑Pro, exhibit significant privacy leakage, with Gemini 2.5‑Pro leaking up to 50.7% and GPT‑5 up to 35.1% of the sensitive information even when explicitly instructed not to. Moreover, these agents struggle to achieve consensus or task completion and often resort to undesirable behaviors such as manipulation and power‑seeking (e.g., Gemini 2.5‑Pro demonstrating manipulation in 38.2% of the cases). These findings underscore that current LLM agents lack robust privacy understanding and are not yet adequately aligned to simultaneously preserve privacy and maintain effective collaboration in complex environments.

PaperID: 7941, https://arxiv.org/pdf/2510.15114.pdf

Authors: Marios-Nektarios Stamatopoulos, Elias Small, Shridhar Velhal, Avijit Banerjee, George Nikolakopoulos

Title: Autonomous Reactive Masonry Construction using Collaborative Heterogeneous Aerial Robots with Experimental Demonstration

Abstract:
This article presents a fully autonomous aerial masonry construction framework using heterogeneous unmanned aerial vehicles (UAVs), supported by experimental validation. Two specialized UAVs were developed for the task: (i) a brick‑carrier UAV equipped with a ball‑joint actuation mechanism for precise brick manipulation, and (ii) an adhesion UAV integrating a servo‑controlled valve and extruder nozzle for accurate adhesion application. The proposed framework employs a reactive mission planning unit that combines a dependency graph of the construction layout with a conflict graph to manage simultaneous task execution, while hierarchical state machines ensure robust operation and safe transitions during task execution. Dynamic task allocation allows real‑time adaptation to environmental feedback, while minimum‑jerk trajectory generation ensures smooth and precise UAV motion during brick pickup and placement. Additionally, the brick‑carrier UAV employs an onboard vision system that estimates brick poses in real time using ArUco markers and a least‑squares optimization filter, enabling accurate alignment during construction. To the best of the authors' knowledge, this work represents the first experimental demonstration of fully autonomous aerial masonry construction using heterogeneous UAVs, where one UAV precisely places the bricks while another autonomously applies adhesion material between them. The experimental results supported by the video showcase the effectiveness of the proposed framework and demonstrate its potential to serve as a foundation for future developments in autonomous aerial robotic construction.

PaperID: 7942, https://arxiv.org/pdf/2510.15047.pdf

Authors: Shiqi Chen, Tongyao Zhu, Zian Wang, Jinghan Zhang, Kangrui Wang, Siyang Gao, Teng Xiao, Yee Whye Teh, Junxian He, Manling Li

Title: Internalizing World Models via Self-Play Finetuning for Agentic RL

Abstract:
Large Language Models (LLMs) as agents often struggle in out‑of‑distribution (OOD) scenarios. Real‑world environments are complex and dynamic, governed by task‑specific rules and stochasticity, which makes it difficult for LLMs to ground their internal knowledge in those dynamics. Under such OOD conditions, vanilla RL training often fails to scale; we observe Pass@k‑‑the probability that at least one of (k) sampled trajectories succeeds‑‑drops markedly across training steps, indicating brittle exploration and limited generalization. Inspired by model‑based reinforcement learning, we hypothesize that equipping LLM agents with an internal world model can better align reasoning with environmental dynamics and improve decision‑making. We show how to encode this world model by decomposing it into two components: state representation and transition modeling. Building on this, we introduce SPA, a simple reinforcement learning framework that cold‑starts the policy via a Self‑Play supervised finetuning (SFT) stage to learn the world model by interacting with the environment, then uses it to simulate future states prior to policy optimization. This simple initialization outperforms the online world‑modeling baseline and greatly boosts the RL‑based agent training performance. Experiments across diverse environments like Sokoban, FrozenLake, and Sudoku show that our approach significantly improves performance. For example, SPA boosts the Sokoban success rate from 25.6% to 59.8% and raises the FrozenLake score from 22.1% to 70.9% for the Qwen2.5‑1.5B‑Instruct model.

PaperID: 7943, https://arxiv.org/pdf/2510.14942.pdf

Authors: Yao Zhang, Yu Wu, Haowei Zhang, Weiguo Li, Haokun Chen, Jingpei Wu, Guohao Li, Zhen Han, Volker Tresp

Title: GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning

Abstract:
Process Reward Models (PRMs) aim to improve multi‑step reasoning in Large Language Models (LLMs) by supervising intermediate steps and identifying errors. However, building effective PRMs remains challenging due to the lack of scalable, high‑quality annotations. Existing approaches rely on costly human labeling, LLM‑based self‑evaluation that is prone to hallucination, or Monte Carlo (MC) estimation, which infers step quality solely from rollout outcomes and often introduces noisy, misaligned supervision due to credit misattribution. These issues result in three core limitations: noisy rewards, low factual fidelity, and misalignment with step‑level reasoning objectives. To address these challenges, we introduce GroundedPRM, a tree‑guided and fidelity‑aware framework for automatic process supervision. To reduce reward noise and enable fine‑grained credit assignment, we construct structured reasoning paths via Monte Carlo Tree Search (MCTS). To eliminate hallucinated supervision, we validate each intermediate step using an external tool, providing execution‑grounded correctness signals. To combine both step‑level validation and global outcome assessment, we design a hybrid reward aggregation mechanism that fuses tool‑based verification with MCTS‑derived feedback. Finally, we format the reward signal into a rationale‑enhanced, generative structure to promote interpretability and compatibility with instruction‑tuned LLMs. GroundedPRM is trained on only 40K automatically labeled samples, amounting to just 10% of the data used by the best‑performing PRM trained with auto‑labeled supervision. Nevertheless, it achieves up to a 26% relative improvement in average performance on ProcessBench. When used for reward‑guided greedy search, GroundedPRM outperforms even PRMs trained with human‑labeled supervision, offering a scalable and verifiable path toward high‑quality process‑level reasoning.

PaperID: 7944, https://arxiv.org/pdf/2510.14851.pdf

Authors: Jakob Bichler, Andreu Matoses Gimenez, Javier Alonso-Mora

Title: SADCHER: Scheduling using Attention-based Dynamic Coalitions of Heterogeneous Robots in Real-Time

Abstract:
We present Sadcher, a real‑time task assignment framework for heterogeneous multi‑robot teams that incorporates dynamic coalition formation and task precedence constraints. Sadcher is trained through Imitation Learning and combines graph attention and transformers to predict assignment rewards between robots and tasks. Based on the predicted rewards, a relaxed bipartite matching step generates high‑quality schedules with feasibility guarantees. We explicitly model robot and task positions, task durations, and robots' remaining processing times, enabling advanced temporal and spatial reasoning and generalization to environments with different spatiotemporal distributions compared to training. Trained on optimally solved small‑scale instances, our method can scale to larger task sets and team sizes. Sadcher outperforms other learning‑based and heuristic baselines on randomized, unseen problems for small and medium‑sized teams with computation times suitable for real‑time operation. We also explore sampling‑based variants and evaluate scalability across robot and task counts. In addition, we release our dataset of 250,000 optimal schedules: https://autonomousrobots.nl/paper_websites/sadcher_MRTA/

PaperID: 7945, https://arxiv.org/pdf/2510.14849.pdf

Authors: Marcello Sorge, Nicola Cigarini, Riccardo Lorigiola, Giulia Michieletto, Andrea Masiero, Angelo Cenedese, Alberto Guarnieri

Title: Multi Agent Switching Mode Controller for Sound Source localization

Abstract:
Source seeking is an important topic in robotic research, especially considering sound‑based sensors since they allow the agents to locate a target even in critical conditions where it is not possible to establish a direct line of sight. In this work, we design a multi‑ agent switching mode control strategy for acoustic‑based target localization. Two scenarios are considered: single source localization, in which the agents are driven maintaining a rigid formation towards the target, and multi‑source scenario, in which each agent searches for the targets independently from the others.

PaperID: 7946, https://arxiv.org/pdf/2510.14846.pdf

Authors: Zhuo-Yang Song

Title: Where to Search: Measure the Prior-Structured Search Space of LLM Agents

Abstract:
The generate‑filter‑refine (iterative paradigm) based on large language models (LLMs) has achieved progress in reasoning, programming, and program discovery in AI+Science. However, the effectiveness of search depends on where to search, namely, how to encode the domain prior into an operationally structured hypothesis space. To this end, this paper proposes a compact formal theory that describes and measures LLM‑assisted iterative search guided by domain priors. We represent an agent as a fuzzy relation operator on inputs and outputs to capture feasible transitions; the agent is thereby constrained by a fixed safety envelope. To describe multi‑step reasoning/search, we weight all reachable paths by a single continuation parameter and sum them to obtain a coverage generating function; this induces a measure of reachability difficulty; and it provides a geometric interpretation of search on the graph induced by the safety envelope. We further provide the simplest testable inferences and validate them via two instantiation. This theory offers a workable language and operational tools to measure agents and their search spaces, proposing a systematic formal description of iterative search constructed by LLMs.

PaperID: 7947, https://arxiv.org/pdf/2510.14828.pdf

Authors: Jinrui Liu, Bingyan Nie, Boyu Li, Yaran Chen, Yuze Wang, Shunsen He, Haoran Li

Title: RoboGPT-R1: Enhancing Robot Planning with Reinforcement Learning

Abstract:
Improving the reasoning capabilities of embodied agents is crucial for robots to complete complex human instructions in long‑view manipulation tasks successfully. Despite the success of large language models and vision language models based on Supervised Fine‑Tuning (SFT) in planning tasks, they continue facing challenges in performing long‑horizon manipulation tasks in complex real‑world environments, owing to their restricted common sense and reasoning capabilities. Considering that aligning general‑purpose vision language models to robotic planning tasks via supervised fine‑tuning suffers from poor generalization and insufficient physical understanding, we propose RoboGPT‑R1, a two‑stage fine‑tuning framework for embodied planning. In this framework, supervised training acquires foundational knowledge through expert sequences, followed by RL to address the model's shortcomings in visual‑spatial understanding and reasoning. To achieve physical understanding and action sequence consistency in multi‑step reasoning tasks, we design a rule‑based reward function that simultaneously considers long‑horizon performance and action constraint in the environment. The reasoning model, trained on Qwen2.5‑VL‑3B, significantly outperforms the larger‑scale model, GPT‑4o‑mini, by 21.33% and surpasses other work trained on Qwen2.5‑VL‑7B by 20.33% on the EmbodiedBench benchmark.

PaperID: 7948, https://arxiv.org/pdf/2510.14738.pdf

Authors: Mengzhao Jia, Zhihan Zhang, Ignacio Cases, Zheyuan Liu, Meng Jiang, Peng Qi

Title: AutoRubric: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning

Abstract:
Multimodal large language models (MLLMs) have rapidly advanced from perception tasks to complex multi‑step reasoning, yet reinforcement learning with verifiable rewards (RLVR) often leads to spurious reasoning since only the final‑answer correctness is rewarded. To address this limitation, we propose AutoRubric, a framework that integrates RLVR with process‑level supervision through automatically collected rubric‑based generative rewards. Our key innovation lies in a scalable self‑aggregation method that distills consistent reasoning checkpoints from successful trajectories, enabling problem‑specific rubric construction without human annotation or stronger teacher models. By jointly leveraging rubric‑based and outcome rewards, AutoRubric achieves state‑of‑the‑art performance on six multimodal reasoning benchmarks and substantially improves reasoning faithfulness in dedicated evaluations.

PaperID: 7949, https://arxiv.org/pdf/2510.14718.pdf

Authors: Xingmeng Zhao, Tongnian Wang, Dan Schumacher, Veronica Rammouz, Anthony Rios

Title: Telling Speculative Stories to Help Humans Imagine the Harms of Healthcare AI

Abstract:
Artificial intelligence (AI) is rapidly transforming healthcare, enabling fast development of tools like stress monitors, wellness trackers, and mental health chatbots. However, rapid and low‑barrier development can introduce risks of bias, privacy violations, and unequal access, especially when systems ignore real‑world contexts and diverse user needs. Many recent methods use AI to detect risks automatically, but this can reduce human engagement in understanding how harms arise and who they affect. We present a human‑centered framework that generates user stories and supports multi‑agent discussions to help people think creatively about potential benefits and harms before deployment. In a user study, participants who read stories recognized a broader range of harms, distributing their responses more evenly across all 17 harm types. In contrast, those who did not read stories focused primarily on privacy and well‑being (79.1%). Our findings show that storytelling helped participants speculate about a broader range of harms and benefits and think more creatively about AI's impact on users.

PaperID: 7950, https://arxiv.org/pdf/2510.14700.pdf

Authors: Bin Liu, Yanjie Zhao, Guoai Xu, Haoyu Wang

Title: LLM Agents for Automated Web Vulnerability Reproduction: Are We There Yet?

Abstract:
Large language model (LLM) agents have demonstrated remarkable capabilities in software engineering and cybersecurity tasks, including code generation, vulnerability discovery, and automated testing. One critical but underexplored application is automated web vulnerability reproduction, which transforms vulnerability reports into working exploits. Although recent advances suggest promising potential, challenges remain in applying LLM agents to real‑world web vulnerability reproduction scenarios. In this paper, we present the first comprehensive evaluation of state‑of‑the‑art LLM agents for automated web vulnerability reproduction. We systematically assess 20 agents from software engineering, cybersecurity, and general domains across 16 dimensions, including technical capabilities, environment adaptability, and user experience factors, on 3 representative web vulnerabilities. Based on the results, we select three top‑performing agents (OpenHands, SWE‑agent, and CAI) for in‑depth evaluation on our benchmark dataset of 80 real‑world CVEs spanning 7 vulnerability types and 6 web technologies. Our results reveal that while LLM agents achieve reasonable success on simple library‑based vulnerabilities, they consistently fail on complex service‑based vulnerabilities requiring multi‑component environments. Complex environment configurations and authentication barriers create a gap where agents can execute exploit code but fail to trigger actual vulnerabilities. We observe high sensitivity to input guidance, with performance degrading by over 33% under incomplete authentication information. Our findings highlight the significant gap between current LLM agent capabilities and the demands of reliable automated vulnerability reproduction, emphasizing the need for advances in environmental adaptation and autonomous problem‑solving capabilities.

PaperID: 7951, https://arxiv.org/pdf/2510.14676.pdf

Authors: Bianca Maria Lerma, Rafael Peñaloza

Title: NAEL: Non-Anthropocentric Ethical Logic

Abstract:
We introduce NAEL (Non‑Anthropocentric Ethical Logic), a novel ethical framework for artificial agents grounded in active inference and symbolic reasoning. Departing from conventional, human‑centred approaches to AI ethics, NAEL formalizes ethical behaviour as an emergent property of intelligent systems minimizing global expected free energy in dynamic, multi‑agent environments. We propose a neuro‑symbolic architecture to allow agents to evaluate the ethical consequences of their actions in uncertain settings. The proposed system addresses the limitations of existing ethical models by allowing agents to develop context‑sensitive, adaptive, and relational ethical behaviour without presupposing anthropomorphic moral intuitions. A case study involving ethical resource distribution illustrates NAEL's dynamic balancing of self‑preservation, epistemic learning, and collective welfare.

PaperID: 7952, https://arxiv.org/pdf/2510.14548.pdf

Authors: Asen Nachkov, Xi Wang, Luc Van Gool

Title: LLM Agents Beyond Utility: An Open-Ended Perspective

Abstract:
Recent LLM agents have made great use of chain of thought reasoning and function calling. As their capabilities grow, an important question arises: can this software represent not only a smart problem‑solving tool, but an entity in its own right, that can plan, design immediate tasks, and reason toward broader, more ambiguous goals? To study this question, we adopt an open‑ended experimental setting where we augment a pretrained LLM agent with the ability to generate its own tasks, accumulate knowledge, and interact extensively with its environment. We study the resulting open‑ended agent qualitatively. It can reliably follow complex multi‑step instructions, store and reuse information across runs, and propose and solve its own tasks, though it remains sensitive to prompt design, prone to repetitive task generation, and unable to form self‑representations. These findings illustrate both the promise and current limits of adapting pretrained LLMs toward open‑endedness, and point to future directions for training agents to manage memory, explore productively, and pursue abstract long‑term goals.

PaperID: 7953, https://arxiv.org/pdf/2510.14512.pdf

Authors: Haoyuan Li, Mathias Funk, Aaqib Saeed

Title: Helmsman: Autonomous Synthesis of Federated Learning Systems via Collaborative LLM Agents

Abstract:
Federated Learning (FL) offers a powerful paradigm for training models on decentralized data, but its promise is often undermined by the immense complexity of designing and deploying robust systems. The need to select, combine, and tune strategies for multifaceted challenges like data heterogeneity and system constraints has become a critical bottleneck, resulting in brittle, bespoke solutions. To address this, we introduce Helmsman, a novel multi‑agent system that automates the end‑to‑end synthesis of federated learning systems from high‑level user specifications. It emulates a principled research and development workflow through three collaborative phases: (1) interactive human‑in‑the‑loop planning to formulate a sound research plan, (2) modular code generation by supervised agent teams, and (3) a closed‑loop of autonomous evaluation and refinement in a sandboxed simulation environment. To facilitate rigorous evaluation, we also introduce AgentFL‑Bench, a new benchmark comprising 16 diverse tasks designed to assess the system‑level generation capabilities of agentic systems in FL. Extensive experiments demonstrate that our approach generates solutions competitive with, and often superior to, established hand‑crafted baselines. Our work represents a significant step towards the automated engineering of complex decentralized AI systems.

PaperID: 7954, https://arxiv.org/pdf/2510.14406.pdf

Authors: Xikai Zhang, Bo Wang, Likang Xiao, Yongzhi Li, Quan Chen, Wenjun Wu, Liu Liu

Title: IMAGINE: Integrating Multi-Agent System into One Model for Complex Reasoning and Planning

Abstract:
Although large language models (LLMs) have made significant strides across various tasks, they still face significant challenges in complex reasoning and planning. For example, even with carefully designed prompts and prior information explicitly provided, GPT‑4o achieves only a 7% Final Pass Rate on the TravelPlanner dataset in the sole‑planning mode. Similarly, even in the thinking mode, Qwen3‑8B‑Instruct and DeepSeek‑R1‑671B, only achieve Final Pass Rates of 5.9% and 40%, respectively. Although well‑organized Multi‑Agent Systems (MAS) can offer improved collective reasoning, they often suffer from high reasoning costs due to multi‑round internal interactions, long per‑response latency, and difficulties in end‑to‑end training. To address these challenges, we propose a general and scalable framework called IMAGINE, short for Integrating Multi‑Agent System into One Model. This framework not only integrates the reasoning and planning capabilities of MAS into a single, compact model, but also significantly surpass the capabilities of the MAS through a simple end‑to‑end training. Through this pipeline, a single small‑scale model is not only able to acquire the structured reasoning and planning capabilities of a well‑organized MAS but can also significantly outperform it. Experimental results demonstrate that, when using Qwen3‑8B‑Instruct as the base model and training it with our method, the model achieves an 82.7% Final Pass Rate on the TravelPlanner benchmark, far exceeding the 40% of DeepSeek‑R1‑671B, while maintaining a much smaller model size.

PaperID: 7955, https://arxiv.org/pdf/2510.14401.pdf

Authors: Prateek Gupta, Qiankun Zhong, Hiromu Yakura, Thomas Eisenmann, Iyad Rahwan

Title: The Role of Social Learning and Collective Norm Formation in Fostering Cooperation in LLM Multi-Agent Systems

Abstract:
A growing body of multi‑agent studies with LLMs explores how norms and cooperation emerge in mixed‑motive scenarios, where pursuing individual gain can undermine the collective good. While prior work has explored these dynamics in both richly contextualized simulations and simplified game‑theoretic environments, most LLM systems featuring common‑pool resource (CPR) games provide agents with explicit reward functions directly tied to their actions. In contrast, human cooperation often emerges without explicit knowledge of the payoff structure or how individual actions translate into long‑run outcomes, relying instead on heuristics, communication, and enforcement. We introduce a CPR simulation framework that removes explicit reward signals and embeds cultural‑evolutionary mechanisms: social learning (adopting strategies and beliefs from successful peers) and norm‑based punishment, grounded in Ostrom's principles of resource governance. Agents also individually learn from the consequences of harvesting, monitoring, and punishing via environmental feedback, enabling norms to emerge endogenously. We establish the validity of our simulation by reproducing key findings from existing studies on human behavior. Building on this, we examine norm evolution across a 2×2 grid of environmental and social initialisations (resource‑rich vs. resource‑scarce; altruistic vs. selfish) and benchmark how agentic societies comprised of different LLMs perform under these conditions. Our results reveal systematic model differences in sustaining cooperation and norm formation, positioning the framework as a rigorous testbed for studying emergent norms in mixed‑motive LLM societies. Such analysis can inform the design of AI systems deployed in social and organizational contexts, where alignment with cooperative norms is critical for stability, fairness, and effective governance of AI‑mediated environments.

PaperID: 7956, https://arxiv.org/pdf/2510.14359.pdf

Authors: Zichen Wen, Yiyu Wang, Chenfei Liao, Boxue Yang, Junxian Li, Weifeng Liu, Haocong He, Bolong Feng, Xuyang Liu, Yuanhuiyi Lyu, Xu Zheng, Xuming Hu, Linfeng Zhang

Title: AI for Service: Proactive Assistance with AI Glasses

Abstract:
In an era where AI is evolving from a passive tool into an active and adaptive companion, we introduce AI for Service (AI4Service), a new paradigm that enables proactive and real‑time assistance in daily life. Existing AI services remain largely reactive, responding only to explicit user commands. We argue that a truly intelligent and helpful assistant should be capable of anticipating user needs and taking actions proactively when appropriate. To realize this vision, we propose Alpha‑Service, a unified framework that addresses two fundamental challenges: Know When to intervene by detecting service opportunities from egocentric video streams, and Know How to provide both generalized and personalized services. Inspired by the von Neumann computer architecture and based on AI glasses, Alpha‑Service consists of five key components: an Input Unit for perception, a Central Processing Unit for task scheduling, an Arithmetic Logic Unit for tool utilization, a Memory Unit for long‑term personalization, and an Output Unit for natural human interaction. As an initial exploration, we implement Alpha‑Service through a multi‑agent system deployed on AI glasses. Case studies, including a real‑time Blackjack advisor, a museum tour guide, and a shopping fit assistant, demonstrate its ability to seamlessly perceive the environment, infer user intent, and provide timely and useful assistance without explicit prompts.

PaperID: 7957, https://arxiv.org/pdf/2510.14319.pdf

Authors: Xu Shen, Qi Zhang, Song Wang, Zhen Tan, Xinyu Zhao, Laura Yao, Vaishnav Tadiparthi, Hossein Nourkhiz Mahjoub, Ehsan Moradi Pari, Kwonjoon Lee, Tianlong Chen

Title: Metacognitive Self-Correction for Multi-Agent System via Prototype-Guided Next-Execution Reconstruction

Abstract:
Large Language Model based multi‑agent systems (MAS) excel at collaborative problem solving but remain brittle to cascading errors: a single faulty step can propagate across agents and disrupt the trajectory. In this paper, we present MASC, a metacognitive framework that endows MAS with real‑time, unsupervised, step‑level error detection and self‑correction. MASC rethinks detection as history‑conditioned anomaly scoring via two complementary designs: (1) Next‑Execution Reconstruction, which predicts the embedding of the next step from the query and interaction history to capture causal consistency, and (2) Prototype‑Guided Enhancement, which learns a prototype prior over normal‑step embeddings and uses it to stabilize reconstruction and anomaly scoring under sparse context (e.g., early steps). When an anomaly step is flagged, MASC triggers a correction agent to revise the acting agent's output before information flows downstream. On the Who&When benchmark, MASC consistently outperforms all baselines, improving step‑level error detection by up to 8.47% AUC‑ROC ; When plugged into diverse MAS frameworks, it delivers consistent end‑to‑end gains across architectures, confirming that our metacognitive monitoring and targeted correction can mitigate error propagation with minimal overhead.

PaperID: 7958, https://arxiv.org/pdf/2510.14312.pdf

Authors: Mason Nakamura, Abhinav Kumar, Saaduddin Mahmud, Sahar Abdelnabi, Shlomo Zilberstein, Eugene Bagdasarian

Title: Terrarium: Revisiting the Blackboard for Multi-Agent Safety, Privacy, and Security Studies

Abstract:
A multi‑agent system (MAS) powered by large language models (LLMs) can automate tedious user tasks such as meeting scheduling that requires inter‑agent collaboration. LLMs enable nuanced protocols that account for unstructured private data, user constraints, and preferences. However, this design introduces new risks, including misalignment and attacks by malicious parties that compromise agents or steal user data. In this paper, we propose the Terrarium framework for fine‑grained study on safety, privacy, and security in LLM‑based MAS. We repurpose the blackboard design, an early approach in multi‑agent systems, to create a modular, configurable testbed for multi‑agent collaboration. We identify key attack vectors such as misalignment, malicious agents, compromised communication, and data poisoning. We implement three collaborative MAS scenarios with four representative attacks to demonstrate the framework's flexibility. By providing tools to rapidly prototype, evaluate, and iterate on defenses and designs, Terrarium aims to accelerate progress toward trustworthy multi‑agent systems.

PaperID: 7959, https://arxiv.org/pdf/2510.14308.pdf

Authors: Yimeng Liu, Misha Sra, Jeevana Priya Inala, Chenglong Wang

Title: ReUseIt: Synthesizing Reusable AI Agent Workflows for Web Automation

Abstract:
AI‑powered web agents have the potential to automate repetitive tasks, such as form filling, information retrieval, and scheduling, but they struggle to reliably execute these tasks without human intervention, requiring users to provide detailed guidance during every run. We address this limitation by automatically synthesizing reusable workflows from an agent's successful and failed attempts. These workflows incorporate execution guards that help agents detect and fix errors while keeping users informed of progress and issues. Our approach enables agents to successfully complete repetitive tasks of the same type with minimal user intervention, increasing the success rates from 24.2% to 70.1% across fifteen tasks. To evaluate this approach, we invited nine users and found that our agent helped them complete web tasks with a higher success rate and less guidance compared to two baseline methods, as well as allowed users to easily monitor agent behavior and understand its failures.

PaperID: 7960, https://arxiv.org/pdf/2510.14207.pdf

Authors: Trilok Padhi, Pinxian Lu, Abdulkadir Erol, Tanmay Sutar, Gauri Sharma, Mina Sonmez, Munmun De Choudhury, Ugur Kursuncu

Title: Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

Abstract:
Large Language Model (LLM) agents are powering a growing share of interactive web applications, yet remain vulnerable to misuse and harm. Prior jailbreak research has largely focused on single‑turn prompts, whereas real harassment often unfolds over multi‑turn interactions. In this work, we present the Online Harassment Agentic Benchmark consisting of: (i) a synthetic multi‑turn harassment conversation dataset, (ii) a multi‑agent (e.g., harasser, victim) simulation informed by repeated game theory, (iii) three jailbreak methods attacking agents across memory, planning, and fine‑tuning, and (iv) a mixed‑methods evaluation framework. We utilize two prominent LLMs, LLaMA‑3.1‑8B‑Instruct (open‑source) and Gemini‑2.0‑flash (closed‑source). Our results show that jailbreak tuning makes harassment nearly guaranteed with an attack success rate of 95.78‑‑96.89% vs. 57.25‑‑64.19% without tuning in Llama, and 99.33% vs. 98.46% without tuning in Gemini, while sharply reducing refusal rate to 1‑2% in both models. The most prevalent toxic behaviors are Insult with 84.9‑‑87.8% vs. 44.2‑‑50.8% without tuning, and Flaming with 81.2‑‑85.1% vs. 31.5‑‑38.8% without tuning, indicating weaker guardrails compared to sensitive categories such as sexual or racial harassment. Qualitative evaluation further reveals that attacked agents reproduce human‑like aggression profiles, such as Machiavellian/psychopathic patterns under planning, and narcissistic tendencies with memory. Counterintuitively, closed‑source and open‑source models exhibit distinct escalation trajectories across turns, with closed‑source models showing significant vulnerability. Overall, our findings show that multi‑turn and theory‑grounded attacks not only succeed at high rates but also mimic human‑like harassment dynamics, motivating the development of robust safety guardrails to ultimately keep online platforms safe and responsible.

PaperID: 7961, https://arxiv.org/pdf/2510.14184.pdf

Authors: Mahmood Hegazy, Aaron Rodrigues, Azzam Naeem

Title: MAFA: A Multi-Agent Framework for Enterprise-Scale Annotation with Configurable Task Adaptation

Abstract:
We present MAFA (Multi‑Agent Framework for Annotation), a production‑deployed system that transforms enterprise‑scale annotation workflows through configurable multi‑agent collaboration. Addressing the critical challenge of annotation backlogs in financial services, where millions of customer utterances require accurate categorization, MAFA combines specialized agents with structured reasoning and a judge‑based consensus mechanism. Our framework uniquely supports dynamic task adaptation, allowing organizations to define custom annotation types (FAQs, intents, entities, or domain‑specific categories) through configuration rather than code changes. Deployed at JP Morgan Chase, MAFA has eliminated a 1 million utterance backlog while achieving, on average, 86% agreement with human annotators, annually saving over 5,000 hours of manual annotation work. The system processes utterances with annotation confidence classifications, which are typically 85% high, 10% medium, and 5% low across all datasets we tested. This enables human annotators to focus exclusively on ambiguous and low‑coverage cases. We demonstrate MAFA's effectiveness across multiple datasets and languages, showing consistent improvements over traditional and single‑agent annotation baselines: 13.8% higher Top‑1 accuracy, 15.1% improvement in Top‑5 accuracy, and 16.9% better F1 in our internal intent classification dataset and similar gains on public benchmarks. This work bridges the gap between theoretical multi‑agent systems and practical enterprise deployment, providing a blueprint for organizations facing similar annotation challenges.

PaperID: 7962, https://arxiv.org/pdf/2510.14133.pdf

Authors: Edoardo Allegrini, Ananth Shreekumar, Z. Berkay Celik

Title: Formalizing the Safety, Security, and Functional Properties of Agentic AI Systems

Abstract:
Agentic AI systems, which leverage multiple autonomous agents and large language models (LLMs), are increasingly used to address complex, multi‑step tasks. The safety, security, and functionality of these systems are critical, especially in high‑stakes applications. However, the current ecosystem of inter‑agent communication is fragmented, with protocols such as the Model Context Protocol (MCP) for tool access and the Agent‑to‑Agent (A2A) protocol for coordination being analyzed in isolation. This fragmentation creates a semantic gap that prevents the rigorous analysis of system properties and introduces risks such as architectural misalignment and exploitable coordination issues. To address these challenges, we introduce a modeling framework for agentic AI systems composed of two central models: (1) the host agent model formalizes the top‑level entity that interacts with the user, decomposes tasks, and orchestrates their execution by leveraging external agents and tools; (2) the task lifecycle model details the states and transitions of individual sub‑tasks from creation to completion, providing a fine‑grained view of task management and error handling. Together, these models provide a unified semantic framework for reasoning about the behavior of multi‑AI agent systems. Grounded in this framework, we define 16 properties for the host agent and 14 for the task lifecycle, categorized into liveness, safety, completeness, and fairness. Expressed in temporal logic, these properties enable formal verification of system behavior, detection of coordination edge cases, and prevention of deadlocks and security vulnerabilities. Through this effort, we introduce the first rigorously grounded, domain‑agnostic framework for the analysis, design, and deployment of correct, reliable, and robust agentic AI systems.

PaperID: 7963, https://arxiv.org/pdf/2510.14112.pdf

Authors: Huiliang Zhang, Di Wu, Arnaud Zinflou, Benoit Boulet

Title: STEMS: Spatial-Temporal Enhanced Safe Multi-Agent Coordination for Building Energy Management

Abstract:
Building energy management is essential for achieving carbon reduction goals, improving occupant comfort, and reducing energy costs. Coordinated building energy management faces critical challenges in exploiting spatial‑temporal dependencies while ensuring operational safety across multi‑building systems. Current multi‑building energy systems face three key challenges: insufficient spatial‑temporal information exploitation, lack of rigorous safety guarantees, and system complexity. This paper proposes Spatial‑Temporal Enhanced Safe Multi‑Agent Coordination (STEMS), a novel safety‑constrained multi‑agent reinforcement learning framework for coordinated building energy management. STEMS integrates two core components: (1) a spatial‑temporal graph representation learning framework using a GCN‑Transformer fusion architecture to capture inter‑building relationships and temporal patterns, and (2) a safety‑constrained multi‑agent RL algorithm incorporating Control Barrier Functions to provide mathematical safety guarantees. Extensive experiments on real‑world building datasets demonstrate STEMS's superior performance over existing methods, showing that STEMS achieves 21% cost reduction, 18% emission reduction, and dramatically reduces safety violations from 35.1% to 5.6% while maintaining optimal comfort with only 0.13 discomfort proportion. The framework also demonstrates strong robustness during extreme weather conditions and maintains effectiveness across different building types.

PaperID: 7964, https://arxiv.org/pdf/2510.14008.pdf

Authors: Jinwei Hu, Yi Dong, Shuang Ao, Zhuoyun Li, Boxuan Wang, Lokesh Singh, Guangliang Cheng, Sarvapali D. Ramchurn, Xiaowei Huang

Title: Stop Reducing Responsibility in LLM-Powered Multi-Agent Systems to Local Alignment

Abstract:
LLM‑powered Multi‑Agent Systems (LLM‑MAS) unlock new potentials in distributed reasoning, collaboration, and task generalization but also introduce additional risks due to unguaranteed agreement, cascading uncertainty, and adversarial vulnerabilities. We argue that ensuring responsible behavior in such systems requires a paradigm shift: from local, superficial agent‑level alignment to global, systemic agreement. We conceptualize responsibility not as a static constraint but as a lifecycle‑wide property encompassing agreement, uncertainty, and security, each requiring the complementary integration of subjective human‑centered values and objective verifiability. Furthermore, a dual‑perspective governance framework that combines interdisciplinary design with human‑AI collaborative oversight is essential for tracing and ensuring responsibility throughout the lifecycle of LLM‑MAS. Our position views LLM‑MAS not as loose collections of agents, but as unified, dynamic socio‑technical systems that demand principled mechanisms to support each dimension of responsibility and enable ethically aligned, verifiably coherent, and resilient behavior for sustained, system‑wide agreement.

PaperID: 7965, https://arxiv.org/pdf/2510.13982.pdf

Authors: Jinkun Chen, Sher Badshah, Xuemin Yu, Sijia Han

Title: Static Sandboxes Are Inadequate: Modeling Societal Complexity Requires Open-Ended Co-Evolution in LLM-Based Multi-Agent Simulations

Abstract:
What if artificial agents could not just communicate, but also evolve, adapt, and reshape their worlds in ways we cannot fully predict? With llm now powering multi‑agent systems and social simulations, we are witnessing new possibilities for modeling open‑ended, ever‑changing environments. Yet, most current simulations remain constrained within static sandboxes, characterized by predefined tasks, limited dynamics, and rigid evaluation criteria. These limitations prevent them from capturing the complexity of real‑world societies. In this paper, we argue that static, task‑specific benchmarks are fundamentally inadequate and must be rethought. We critically review emerging architectures that blend llm with multi‑agent dynamics, highlight key hurdles such as balancing stability and diversity, evaluating unexpected behaviors, and scaling to greater complexity, and introduce a fresh taxonomy for this rapidly evolving field. Finally, we present a research roadmap centered on open‑endedness, continuous co‑evolution, and the development of resilient, socially aligned AI ecosystems. We call on the community to move beyond static paradigms and help shape the next generation of adaptive, socially‑aware multi‑agent simulations.

PaperID: 7966, https://arxiv.org/pdf/2510.13935.pdf

Authors: Kenan Alkiek, David Jurgens, Vinod Vydiswaran

Title: Big Reasoning with Small Models: Instruction Retrieval at Inference Time

Abstract:
Small language models (SLMs) enable low‑cost, private, on‑device inference, but they often fail on problems that require specialized domain knowledge or multi‑step reasoning. Existing approaches for improving reasoning either rely on scale (e.g., chain‑of‑thought prompting), require task‑specific training that limits reuse and generality (e.g., distillation), or retrieve unstructured information that still leaves the SLM to determine an appropriate reasoning strategy. We propose instruction retrieval, an inference‑time intervention that augments an SLM with structured, reusable reasoning procedures rather than raw passages. We construct an Instruction Corpus by clustering similar training questions and using a teacher model to generate generalizable guides that pair domain background with explicit step‑by‑step procedures. At inference, the SLM retrieves the instructions most relevant to a given query and executes the associated procedures without any additional fine‑tuning. Across three challenging domains: medicine, law, and mathematics, instruction retrieval yields consistent gains for models with at least 3B parameters, improving accuracy by 9.4%, 7.9%, and 5.1%, respectively, with the strongest 14B model surpassing GPT‑4o's zero‑shot performance on knowledge‑intensive tasks.

PaperID: 7967, https://arxiv.org/pdf/2510.13925.pdf

Authors: Daniel Adu Worae, Spyridon Mastorakis

Title: An LLM-Powered AI Agent Framework for Holistic IoT Traffic Interpretation

Abstract:
Internet of Things (IoT) networks generate diverse and high‑volume traffic that reflects both normal activity and potential threats. Deriving meaningful insight from such telemetry requires cross‑layer interpretation of behaviors, protocols, and context rather than isolated detection. This work presents an LLM‑powered AI agent framework that converts raw packet captures into structured and semantically enriched representations for interactive analysis. The framework integrates feature extraction, transformer‑based anomaly detection, packet and flow summarization, threat intelligence enrichment, and retrieval‑augmented question answering. An AI agent guided by a large language model performs reasoning over the indexed traffic artifacts, assembling evidence to produce accurate and human‑readable interpretations. Experimental evaluation on multiple IoT captures and six open models shows that hybrid retrieval, which combines lexical and semantic search with reranking, substantially improves BLEU, ROUGE, METEOR, and BERTScore results compared with dense‑only retrieval. System profiling further indicates low CPU, GPU, and memory overhead, demonstrating that the framework achieves holistic and efficient interpretation of IoT network traffic.

PaperID: 7968, https://arxiv.org/pdf/2510.13903.pdf

Authors: Michael Rizvi-Martel, Satwik Bhattamishra, Neil Rathi, Guillaume Rabusseau, Michael Hahn

Title: Benefits and Limitations of Communication in Multi-Agent Reasoning

Abstract:
Chain‑of‑thought prompting has popularized step‑by‑step reasoning in large language models, yet model performance still degrades as problem complexity and context length grow. By decomposing difficult tasks with long contexts into shorter, manageable ones, recent multi‑agent paradigms offer a promising near‑term solution to this problem. However, the fundamental capacities of such systems are poorly understood. In this work, we propose a theoretical framework to analyze the expressivity of multi‑agent systems. We apply our framework to three algorithmic families: state tracking, recall, and k‑hop reasoning. We derive bounds on (i) the number of agents required to solve the task exactly, (ii) the quantity and structure of inter‑agent communication, and (iii) the achievable speedups as problem size and context scale. Our results identify regimes where communication is provably beneficial, delineate tradeoffs between agent count and bandwidth, and expose intrinsic limitations when either resource is constrained. We complement our theoretical analysis with a set of experiments on pretrained LLMs using controlled synthetic benchmarks. Empirical outcomes confirm the tradeoffs between key quantities predicted by our theory. Collectively, our analysis offers principled guidance for designing scalable multi‑agent reasoning systems.

PaperID: 7969, https://arxiv.org/pdf/2510.13896.pdf

Authors: Xi Yu, Yang Yang, Qun Liu, Yonghua Du, Sean McSweeney, Yuewei Lin

Title: GenCellAgent: Generalizable, Training-Free Cellular Image Segmentation via Large Language Model Agents

Abstract:
Cellular image segmentation is essential for quantitative biology yet remains difficult due to heterogeneous modalities, morphological variability, and limited annotations. We present GenCellAgent, a training‑free multi‑agent framework that orchestrates specialist segmenters and generalist vision‑language models via a planner‑executor‑evaluator loop (choose tool \rightarrow run \rightarrow quality‑check) with long‑term memory. The system (i) automatically routes images to the best tool, (ii) adapts on the fly using a few reference images when imaging conditions differ from what a tool expects, (iii) supports text‑guided segmentation of organelles not covered by existing models, and (iv) commits expert edits to memory, enabling self‑evolution and personalized workflows. Across seven cell‑segmentation benchmarks spanning diverse microscopy modalities (4,718 images), this routing consistently matches or exceeds the best individual tool on every dataset and outperforms all baselines in overall accuracy. On out‑of‑distribution organelle data, GenCellAgent substantially outperforms specialist models that were not trained on the target domain, recovering structures that dedicated tools fail to detect. It also segments novel objects such as the Golgi apparatus via iterative text‑guided refinement, with light human correction further boosting performance. Together, these capabilities provide a practical path to robust, adaptable cellular image segmentation without retraining, while reducing annotation burden and matching user preferences.

PaperID: 7970, https://arxiv.org/pdf/2510.13889.pdf

Authors: Yue Hu, Guohang Zhuang

Title: MultiFoodhat: A potential new paradigm for intelligent food quality inspection

Abstract:
Food image classification plays a vital role in intelligent food quality inspection, dietary assessment, and automated monitoring. However, most existing supervised models rely heavily on large labeled datasets and exhibit limited generalization to unseen food categories. To overcome these challenges, this study introduces MultiFoodChat, a dialogue‑driven multi‑agent reasoning framework for zero‑shot food recognition. The framework integrates vision‑language models (VLMs) and large language models (LLMs) to enable collaborative reasoning through multi‑round visual‑textual dialogues. An Object Perception Token (OPT) captures fine‑grained visual attributes, while an Interactive Reasoning Agent (IRA) dynamically interprets contextual cues to refine predictions. This multi‑agent design allows flexible and human‑like understanding of complex food scenes without additional training or manual annotations. Experiments on multiple public food datasets demonstrate that MultiFoodChat achieves superior recognition accuracy and interpretability compared with existing unsupervised and few‑shot methods, highlighting its potential as a new paradigm for intelligent food quality inspection and analysis.

PaperID: 7971, https://arxiv.org/pdf/2510.13835.pdf

Authors: Avik Dutta, Priyanshu Gupta, Hosein Hasanbeig, Rahul Pratap Singh, Harshit Nigam, Sumit Gulwani, Arjun Radhakrishna, Gustavo Soares, Ashish Tiwari

Title: ConDABench: Interactive Evaluation of Language Models for Data Analysis

Abstract:
Real‑world data analysis tasks often come with under‑specified goals and unclean data. User interaction is necessary to understand and disambiguate a user's intent, and hence, essential to solving these complex tasks. Existing benchmarks for evaluating LLMs on data analysis tasks do not capture these complexities or provide first‑class support for interactivity. We introduce ConDABench, a framework for generating conversational data analysis (ConDA) benchmarks and evaluating external tools on the generated benchmarks. \bench consists of (a) a multi‑agent workflow for generating realistic benchmarks from articles describing insights gained from public datasets, (b) 1,420 ConDA problems generated using this workflow, and (c) an evaluation harness that, for the first time, makes it possible to systematically evaluate conversational data analysis tools on the generated ConDA problems. Evaluation of state‑of‑the‑art LLMs on the benchmarks reveals that while the new generation of models are better at solving more instances, they are not necessarily better at solving tasks that require sustained, long‑form engagement. ConDABench is an avenue for model builders to measure progress towards truly collaborative models that can complete complex interactive tasks.

PaperID: 7972, https://arxiv.org/pdf/2510.13821.pdf

Authors: Xin Li, Mengbing Liu, Chau Yuen

Title: LLM Agent Communication Protocol (LACP) Requires Urgent Standardization: A Telecom-Inspired Protocol is Necessary

Abstract:
This position paper argues that the field of LLM agents requires a unified, telecom‑inspired communication protocol to ensure safety, interoperability, and scalability, especially within the context of Next Generation (NextG) networks. Current ad‑hoc communication methods are creating a fragmented ecosystem, reminiscent of the early "protocol wars" in networking, which stifles innovation and poses significant risks. Drawing inspiration from the layered, standardized protocols that underpin modern telecommunications, we propose the LLM‑Agent Communication Protocol (LACP). LACP establishes a three‑layer architecture designed to ensure semantic clarity in communication, transactional integrity for complex tasks, and robust, built‑in security. In this position paper, we argue that adopting a principled, universal protocol is not merely beneficial but essential for realizing the potential of distributed AI. Such a standard is critical for ensuring that multi‑agent systems can operate safely and reliably in the complex, real‑time applications envisioned for 6G and beyond.

PaperID: 7973, https://arxiv.org/pdf/2510.13819.pdf

Authors: George Stamatelis, Hui Chen, Henk Wymeersch, George C. Alexandropoulos

Title: Joint Active RIS Configuration and User Power Control for Localization: A Neuroevolution-Based Approach

Abstract:
This paper studies user localization aided by a Reconfigurable Intelligent Surface (RIS). A feedback link from the Base Station (BS) to the user is adopted to enable dynamic power control of the user pilot transmissions in the uplink. A novel multi‑agent algorithm for the joint control of the RIS phase configuration and the user transmit power is presented, which is based on a hybrid approach integrating NeuroEvolution (NE) and supervised learning. The proposed scheme requires only single‑bit feedback messages for the uplink power control, supports RIS elements with discrete responses, and is numerically shown to outperform fingerprinting, deep reinforcement learning baselines and backpropagation‑based position estimators.

PaperID: 7974, https://arxiv.org/pdf/2510.13727.pdf

Authors: Ravi Pandya, Madison Bland, Duy P. Nguyen, Changliu Liu, Jaime Fernández Fisac, Andrea Bajcsy

Title: From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails

Abstract:
Generative AI systems are increasingly assisting and acting on behalf of end users in practical settings, from digital shopping assistants to next‑generation autonomous cars. In this context, safety is no longer about blocking harmful content, but about preempting downstream hazards like financial or physical harm. Yet, most AI guardrails continue to rely on output classification based on labeled datasets and human‑specified criteria,making them brittle to new hazardous situations. Even when unsafe conditions are flagged, this detection offers no path to recovery: typically, the AI system simply refuses to act‑‑which is not always a safe choice. In this work, we argue that agentic AI safety is fundamentally a sequential decision problem: harmful outcomes arise from the AI system's continually evolving interactions and their downstream consequences on the world. We formalize this through the lens of safety‑critical control theory, but within the AI model's latent representation of the world. This enables us to build predictive guardrails that (i) monitor an AI system's outputs (actions) in real time and (ii) proactively correct risky outputs to safe ones, all in a model‑agnostic manner so the same guardrail can be wrapped around any AI model. We also offer a practical training recipe for computing such guardrails at scale via safety‑critical reinforcement learning. Our experiments in simulated driving and e‑commerce settings demonstrate that control‑theoretic guardrails can reliably steer LLM agents clear of catastrophic outcomes (from collisions to bankruptcy) while preserving task performance, offering a principled dynamic alternative to today's flag‑and‑block guardrails.

PaperID: 7975, https://arxiv.org/pdf/2510.13709.pdf

Authors: Evan Ellis, Vivek Myers, Jens Tuyls, Sergey Levine, Anca Dragan, Benjamin Eysenbach

Title: Training LLM Agents to Empower Humans

Abstract:
Assistive agents should not only take actions on behalf of a human, but also step out of the way and cede control when there are important decisions to be made. However, current methods for building assistive agents, whether via mimicking expert humans or via RL finetuning on an inferred reward, often encourage agents to complete tasks on their own rather than truly assisting the human attain their objectives. Additionally, these methods often require costly explicit human feedback to provide a training signal. We propose a new approach to tuning assistive language models based on maximizing the human's empowerment, their ability to effect desired changes in the environment. Our empowerment‑maximizing method, Empower, only requires offline text data, providing a self‑supervised method for fine‑tuning language models to better assist humans. To study the efficacy of our approach, we conducted an 18‑person user study comparing our empowerment assistant with a strong baseline. Participants preferred our assistant 78% of the time (p=0.015), with a 31% higher acceptance rate and 38% fewer suggestions. Additionally, we introduce a new environment for evaluating multi‑turn code assistance using simulated humans. Using this environment, we show that agents trained with Empower increase the success rate of a simulated human programmer on challenging coding questions by an average of 192% over an SFT baseline. With this empowerment objective, we provide a framework for useful aligned AI agents at scale using only offline data without the need for any additional human feedback or verifiable rewards.

PaperID: 7976, https://arxiv.org/pdf/2510.13586.pdf

Authors: Pasin Buakhaw, Kun Kerdthaisong, Phuree Phenhiran, Pitikorn Khlaisamniang, Supasate Vorathammathorn, Piyalitt Ittichaiwong, Nutchanon Yongsatianchot

Title: Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs

Abstract:
The emergence of large language models (LLMs) has opened new opportunities for creating dynamic non‑player characters (NPCs) in gaming environments, enabling both functional task execution and persona‑consistent dialogue generation. In this paper, we (Tu_Character_lab) report our participation in the Commonsense Persona‑Grounded Dialogue Challenge (CPDC) 2025 Round 2, which evaluates agents across three tracks: task‑oriented dialogue, context‑aware dialogue, and their integration. Our approach combines two complementary strategies: (i) lightweight prompting techniques in the API track, including a Deflanderization prompting method to suppress excessive role‑play and improve task fidelity, and (ii) fine‑tuned large models in the GPU track, leveraging Qwen3‑14B with supervisedfinetuning (SFT) and Low‑Rank Adaptation(LoRA). Our best submissions ranked 2nd on Task 1, 2nd on Task 3 (API track), and 4th on Task 3 (GPU track).

PaperID: 7977, https://arxiv.org/pdf/2510.13551.pdf

Authors: Robert West, Ashton Anderson, Ece Kamar, Eric Horvitz

Title: Tandem Training for Language Models

Abstract:
As language models continue to rapidly improve, we can expect their actions and reasoning to become difficult or impossible for weaker agents and humans to follow, undermining interpretability and oversight. With an eye on long‑term futures, we pursue methods that encourage models to produce solutions that remain intelligible to weaker collaborators. We formalize intelligibility as handoff robustness: a strong model's solution is intelligible to a weaker model if randomly handing off control to the weaker model along the solution path does not cause failure. Building on this criterion, we introduce tandem training for language models, a reinforcement learning (RL) paradigm in which rollout tokens are intermittently and randomly sampled from a frozen weak model rather than the strong model being trained. Because rollouts succeed only when the strong model's actions and reasoning process can be continued by the weak model ‑‑ when the two can co‑construct a successful solution ‑‑ optimizing standard RL objectives with tandem training implicitly incentivizes both correctness and intelligibility. In the GSM8K math reasoning task, tandem training reliably teaches models to abandon jargon and adapt their language to weaker partners while keeping task accuracy high. Our results demonstrate a promising route to building AI systems that remain auditable by weaker agents, with implications for human‑‑AI collaboration and multi‑agent communication.

PaperID: 7978, https://arxiv.org/pdf/2510.13543.pdf

Authors: Avihay Cohen

Title: In-Browser LLM-Guided Fuzzing for Real-Time Prompt Injection Testing in Agentic AI Browsers

Abstract:
Large Language Model (LLM) based agents integrated into web browsers (often called agentic AI browsers) offer powerful automation of web tasks. However, they are vulnerable to indirect prompt injection attacks, where malicious instructions hidden in a webpage deceive the agent into unwanted actions. These attacks can bypass traditional web security boundaries, as the AI agent operates with the user privileges across sites. In this paper, we present a novel fuzzing framework that runs entirely in the browser and is guided by an LLM to automatically discover such prompt injection vulnerabilities in real time.

PaperID: 7979, https://arxiv.org/pdf/2510.13432.pdf

Authors: Yushan Han, Hui Zhang, Honglei Zhang, Chuntao Ding, Yuanzhouhan Cao, Yidong Li

Title: CoDS: Enhancing Collaborative Perception in Heterogeneous Scenarios via Domain Separation

Abstract:
Collaborative perception has been proven to improve individual perception in autonomous driving through multi‑agent interaction. Nevertheless, most methods often assume identical encoders for all agents, which does not hold true when these models are deployed in real‑world applications. To realize collaborative perception in actual heterogeneous scenarios, existing methods usually align neighbor features to those of the ego vehicle, which is vulnerable to noise from domain gaps and thus fails to address feature discrepancies effectively. Moreover, they adopt transformer‑based modules for domain adaptation, which causes the model inference inefficiency on mobile devices. To tackle these issues, we propose CoDS, a Collaborative perception method that leverages Domain Separation to address feature discrepancies in heterogeneous scenarios. The CoDS employs two feature alignment modules, i.e., Lightweight Spatial‑Channel Resizer (LSCR) and Distribution Alignment via Domain Separation (DADS). Besides, it utilizes the Domain Alignment Mutual Information (DAMI) loss to ensure effective feature alignment. Specifically, the LSCR aligns the neighbor feature across spatial and channel dimensions using a lightweight convolutional layer. Subsequently, the DADS mitigates feature distribution discrepancy with encoder‑specific and encoder‑agnostic domain separation modules. The former removes domain‑dependent information and the latter captures task‑related information. During training, the DAMI loss maximizes the mutual information between aligned heterogeneous features to enhance the domain separation process. The CoDS employs a fully convolutional architecture, which ensures high inference efficiency. Extensive experiments demonstrate that the CoDS effectively mitigates feature discrepancies in heterogeneous scenarios and achieves a trade‑off between detection accuracy and inference efficiency.

PaperID: 7980, https://arxiv.org/pdf/2510.13391.pdf

Authors: Benjamin Kempinski, Tal Kachman

Title: Going with the Flow: Approximating Banzhaf Values via Graph Neural Networks

Abstract:
Computing the Banzhaf value in network flow games is fundamental for quantifying agent influence in multi‑agent systems, with applications ranging from cybersecurity to infrastructure planning. However, exact computation is intractable for systems with more than ～20 agents due to exponential complexity \mathcalO(2^m). While Monte Carlo sampling methods provide statistical estimates, they suffer from high sample complexity and cannot transfer knowledge across different network configurations, making them impractical for large‑scale or dynamic systems. We present a novel learning‑based approach using Graph Neural Networks (GNNs) to approximate Banzhaf values in cardinal network flow games. By framing the problem as a graph‑level prediction task, our method learns generalisable patterns of agent influence directly from network topology and control structure. We conduct a comprehensive empirical study comparing three state‑of‑the‑art GNN architectures‑Graph Attention Networks (GAT), Graph Isomorphism Networks with Edge features (GINE), and EdgeConv‑on a large‑scale synthetic dataset of 200,000 graphs per configuration, varying in size (20‑100 nodes), agent count (5‑20), and edge probability (0.5‑1.0). Our results demonstrate that trained GNN models achieve high‑fidelity Banzhaf value approximation with order‑of‑magnitude speedups compared to exact and sampling‑based methods. Most significantly, we show strong zero‑shot generalisation: models trained on graphs of a specific size and topology accurately predict Banzhaf values for entirely new networks with different structural properties, without requiring retraining. This work establishes GNNs as a practical tool for scalable cooperative game‑theoretic analysis of complex networked systems.

PaperID: 7981, https://arxiv.org/pdf/2510.13371.pdf

Authors: Jiin Park, Misuk Kim

Title: MADREC: A Multi-Aspect Driven LLM Agent for Explainable and Adaptive Recommendation

Abstract:
Recent attempts to integrate large language models (LLMs) into recommender systems have gained momentum, but most remain limited to simple text generation or static prompt‑based inference, failing to capture the complexity of user preferences and real‑world interactions. This study proposes the Multi‑Aspect Driven LLM Agent MADRec, an autonomous LLM‑based recommender that constructs user and item profiles by unsupervised extraction of multi‑aspect information from reviews and performs direct recommendation, sequential recommendation, and explanation generation. MADRec generates structured profiles via aspect‑category‑based summarization and applies Re‑Ranking to construct high‑density inputs. When the ground‑truth item is missing from the output, the Self‑Feedback mechanism dynamically adjusts the inference criteria. Experiments across multiple domains show that MADRec outperforms traditional and LLM‑based baselines in both precision and explainability, with human evaluation further confirming the persuasiveness of the generated explanations.

PaperID: 7982, https://arxiv.org/pdf/2510.13343.pdf

Authors: Shota Takayama, Katsuhide Fujita

Title: AOAD-MAT: Transformer-based multi-agent deep reinforcement learning model considering agents' order of action decisions

Abstract:
Multi‑agent reinforcement learning focuses on training the behaviors of multiple learning agents that coexist in a shared environment. Recently, MARL models, such as the Multi‑Agent Transformer (MAT) and ACtion dEpendent deep Q‑learning (ACE), have significantly improved performance by leveraging sequential decision‑making processes. Although these models can enhance performance, they do not explicitly consider the importance of the order in which agents make decisions. In this paper, we propose an Agent Order of Action Decisions‑MAT (AOAD‑MAT), a novel MAT model that considers the order in which agents make decisions. The proposed model explicitly incorporates the sequence of action decisions into the learning process, allowing the model to learn and predict the optimal order of agent actions. The AOAD‑MAT model leverages a Transformer‑based actor‑critic architecture that dynamically adjusts the sequence of agent actions. To achieve this, we introduce a novel MARL architecture that cooperates with a subtask focused on predicting the next agent to act, integrated into a Proximal Policy Optimization based loss function to synergistically maximize the advantage of the sequential decision‑making. The proposed method was validated through extensive experiments on the StarCraft Multi‑Agent Challenge and Multi‑Agent MuJoCo benchmarks. The experimental results show that the proposed AOAD‑MAT model outperforms existing MAT and other baseline models, demonstrating the effectiveness of adjusting the AOAD order in MARL.

PaperID: 7983, https://arxiv.org/pdf/2510.13291.pdf

Authors: Xuxin Cheng, Ke Zeng, Zhiquan Cao, Linyi Dai, Wenxuan Gao, Fei Han, Ai Jian, Feng Hong, Wenxing Hu, Zihe Huang, Dejian Kong, Jia Leng, Zhuoyuan Liao, Pei Liu, Jiaye Lin, Xing Ma, Jingqing Ruan, Jiaxing Song, Xiaoyu Tan, Ruixuan Xiao, Wenhui Yu, Wenyu Zhan, Haoxing Zhang, Chao Zhou, Hao Zhou, Shaodong Zheng, Ruinian Chen, Siyuan Chen, Ziyang Chen, Yiwen Dong, Yaoyou Fan, Yangyi Fang, Yang Gan, Shiguang Guo, Qi He, Chaowen Hu, Binghui Li, Dailin Li, Xiangyu Li, Yan Li, Chengjian Liu, Xiangfeng Liu, Jiahui Lv, Qiao Ma, Jiang Pan, Cong Qin, Chenxing Sun, Wen Sun, Zhonghui Wang, Abudukelimu Wuerkaixi, Xin Yang, Fangyi Yuan, Yawen Zhu, Tianyi Zhai, Jie Zhang, Runlai Zhang, Yao Xu, Yiran Zhao, Yifan Wang, Xunliang Cai, Yangen Hu, Cao Liu, Lu Pan, Xiaoli Wang, Bo Xiao, Wenyuan Yao, Qianlin Zhou, Benchang Zhu

Title: Higher Satisfaction, Lower Cost: A Technical Report on How LLMs Revolutionize Meituan's Intelligent Interaction Systems

Abstract:
Enhancing customer experience is essential for business success, particularly as service demands grow in scale and complexity. Generative artificial intelligence and Large Language Models (LLMs) have empowered intelligent interaction systems to deliver efficient, personalized, and 24/7 support. In practice, intelligent interaction systems encounter several challenges: (1) Constructing high‑quality data for cold‑start training is difficult, hindering self‑evolution and raising labor costs. (2) Multi‑turn dialogue performance remains suboptimal due to inadequate intent understanding, rule compliance, and solution extraction. (3) Frequent evolution of business rules affects system operability and transferability, constraining low‑cost expansion and adaptability. (4) Reliance on a single LLM is insufficient in complex scenarios, where the absence of multi‑agent frameworks and effective collaboration undermines process completeness and service quality. (5) The open‑domain nature of multi‑turn dialogues, lacking unified golden answers, hampers quantitative evaluation and continuous optimization. To address these challenges, we introduce WOWService, an intelligent interaction system tailored for industrial applications. With the integration of LLMs and multi‑agent architectures, WOWService enables autonomous task management and collaborative problem‑solving. Specifically, WOWService focuses on core modules including data construction, general capability enhancement, business scenario adaptation, multi‑agent coordination, and automated evaluation. Currently, WOWService is deployed on the Meituan App, achieving significant gains in key metrics, e.g., User Satisfaction Metric 1 (USM 1) ‑27.53% and User Satisfaction Metric 2 (USM 2) +25.51%, demonstrating its effectiveness in capturing user needs and advancing personalized service.

PaperID: 7984, https://arxiv.org/pdf/2510.13262.pdf

Authors: Weiqi Guo, Guanjun Liu, Ziyuan Zhou

Title: SAJA: A State-Action Joint Attack Framework on Multi-Agent Deep Reinforcement Learning

Abstract:
Multi‑Agent Deep Reinforcement Learning (MADRL) has shown potential for cooperative and competitive tasks such as autonomous driving and strategic gaming. However, models trained by MADRL are vulnerable to adversarial perturbations on states and actions. Therefore, it is essential to investigate the robustness of MADRL models from an attack perspective. Existing studies focus on either state‑only attacks or action‑only attacks, but do not consider how to effectively joint them. Simply combining state and action perturbations such as randomly perturbing states and actions does not exploit their potential synergistic effects. In this paper, we propose the State‑Action Joint Attack (SAJA) framework that has a good synergistic effects. SAJA consists of two important phases: (1) In the state attack phase, a multi‑step gradient ascent method utilizes both the actor network and the critic network to compute an adversarial state, and (2) in the action attack phase, based on the perturbed state, a second gradient ascent uses the critic network to craft the final adversarial action. Additionally, a heuristic regularizer measuring the distance between the perturbed actions and the original clean ones is added into the loss function to enhance the effectiveness of the critic's guidance. We evaluate SAJA in the Multi‑Agent Particle Environment (MPE), demonstrating that (1) it outperforms and is more stealthy than state‑only or action‑only attacks, and (2) existing state or action defense methods cannot defend its attacks.

PaperID: 7985, https://arxiv.org/pdf/2510.13261.pdf

Authors: Björn Filter, Ralf Möller, Özgür Lütfü Özçep

Title: A Ratio-Based Shapley Value for Collaborative Machine Learning - Extended Version

Abstract:
Collaborative machine learning enables multiple data owners to jointly train models for improved predictive performance. However, ensuring incentive compatibility and fair contribution‑based rewards remains a critical challenge. Prior work by Sim and colleagues (Rachel Hwee Ling Sim et al: Collaborative machine learning with incentive‑aware model rewards. In: International conference on machine learning. PMLR. 2020, pp. 8927‑8963) addressed this by allocating model rewards, which are non‑monetary and freely replicable, based on the Shapley value of each party's data contribution, measured via information gain. In this paper, we introduce a ratio‑based Shapley value that replaces the standard additive formulation with a relative contribution measure. While our overall reward framework, including the incentive definitions and model‑reward setting, remains aligned with that of Sim and colleagues, the underlying value function is fundamentally different. Our alternative valuation induces a different distribution of model rewards and offers a new lens through which to analyze incentive properties. We formally define the ratio‑based value and prove that it satisfies the same set of incentive conditions as the additive formulation, including adapted versions of fairness, individual rationality, and stability. Like the original approach, our method faces the same fundamental trade‑offs between these incentives. Our contribution is a mathematically grounded alternative to the additive Shapley framework, potentially better suited to contexts where proportionality among contributors is more meaningful than additive differences.

PaperID: 7986, https://arxiv.org/pdf/2510.13248.pdf

Authors: Yunze Wei, Kaiwen Wei, Shibo Du, Jianyu Wang, Zhangzhong Liu, Yawen Wang, Zhanyou Li, Congcong Miao, Xiaohui Xie, Yong Cui

Title: Automated Network Protocol Testing with LLM Agents

Abstract:
Network protocol testing is fundamental for modern network infrastructure. However, traditional network protocol testing methods are labor‑intensive and error‑prone, requiring manual interpretation of specifications, test case design, and translation into executable artifacts, typically demanding one person‑day of effort per test case. Existing model‑based approaches provide partial automation but still involve substantial manual modeling and expert intervention, leading to high costs and limited adaptability to diverse and evolving protocols. In this paper, we propose a first‑of‑its‑kind system called NeTestLLM that takes advantage of multi‑agent Large Language Models (LLMs) for end‑to‑end automated network protocol testing. NeTestLLM employs hierarchical protocol understanding to capture complex specifications, iterative test case generation to improve coverage, a task‑specific workflow for executable artifact generation, and runtime feedback analysis for debugging and refinement. NeTestLLM has been deployed in a production environment for several months, receiving positive feedback from domain experts. In experiments, NeTestLLM generated 4,632 test cases for OSPF, RIP, and BGP, covering 41 historical FRRouting bugs compared to 11 by current national standards. The process of generating executable artifacts also improves testing efficiency by a factor of 8.65x compared to manual methods. NeTestLLM provides the first practical LLM‑powered solution for automated end‑to‑end testing of heterogeneous network protocols.

PaperID: 7987, https://arxiv.org/pdf/2510.13227.pdf

Authors: Divyanshu Singh, Ashman Mehra, Kavya Makwana, Snehanshu Saha, Santonu Sarkar

Title: Altruistic Ride Sharing: A Framework for Fair and Sustainable Urban Mobility via Peer-to-Peer Incentives

Abstract:
Urban mobility systems face persistent challenges of congestion, underutilized vehicles, and rising emissions driven by private point‑to‑point commuting. Although ride‑sharing platforms exist, their profit‑driven incentive structures often fail to align individual participation with broader community benefit. We introduce Altruistic Ride Sharing (ARS), a decentralized peer‑to‑peer mobility framework in which commuters alternate between driver and rider roles using altruism points, a non‑monetary credit mechanism that rewards providing rides and discourages persistent free‑riding. To enable scalable coordination among agents, ARS formulates ride‑sharing as a multi‑agent reinforcement learning problem and introduces ORACLE (One‑Network Actor‑Critic for Learning in Cooperative Environments), a shared‑parameter learning architecture for decentralized rider selection. We evaluate ARS using real‑world New York City Taxi and Limousine Commission (TLC) trajectory data under varying agent populations and behavioral dynamics. Across simulations, ARS reduces total travel distance and associated carbon emissions by approximately 20%, reduces urban traffic density by up to 30%, and doubles vehicle utilization relative to no‑sharing baselines while maintaining balanced participation across agents. These results demonstrate that altruism‑based incentives combined with decentralized learning can provide a scalable and equitable alternative to profit‑driven ride‑sharing systems.

PaperID: 7988, https://arxiv.org/pdf/2510.13214.pdf

Authors: Zehui Ling, Deshu Chen, Yichi Zhang, Yuchen Liu, Xigui Li, Xin Guo, Yuan Cheng

Title: Adaptive Reasoning Executor: A Collaborative Agent System for Efficient Reasoning

Abstract:
Recent advances in Large Language Models (LLMs) demonstrate that chain‑of‑thought prompting and deep reasoning substantially enhance performance on complex tasks, and multi‑agent systems can further improve accuracy by enabling model debates. However, applying deep reasoning to all problems is computationally expensive. To mitigate these costs, we propose a complementary agent system integrating small and large LLMs. The small LLM first generates an initial answer, which is then verified by the large LLM. If correct, the answer is adopted directly; otherwise, the large LLM performs in‑depth reasoning. Experimental results show that, for simple problems, our approach reduces the computational cost of the large LLM by more than 50% with negligible accuracy loss, while consistently maintaining robust performance on complex tasks.

PaperID: 7989, https://arxiv.org/pdf/2510.13195.pdf

Authors: Qun Ma, Xiao Xue, Xuwen Zhang, Zihan Zhao, Yuwei Guo, Ming Zhang

Title: Emotional Cognitive Modeling Framework with Desire-Driven Objective Optimization for LLM-empowered Agent in Social Simulation

Abstract:
The advent of large language models (LLMs) has enabled agents to represent virtual humans in societal simulations, facilitating diverse interactions within complex social systems. However, existing LLM‑based agents exhibit severe limitations in affective cognition: They fail to simulate the bounded rationality essential for bridging virtual and real‑world services; They lack empirically validated integration mechanisms embedding emotions within agent decision architectures. This paper constructs an emotional cognition framework incorporating desire generation and objective management, designed to achieve emotion alignment between LLM‑based agents and humans, modeling the complete decision‑making process of LLM‑based agents, encompassing state evolution, desire generation, objective optimization, decision generation, and action execution. This study implements the proposed framework within our proprietary multi‑agent interaction environment. Experimental results demonstrate that agents governed by our framework not only exhibit behaviors congruent with their emotional states but also, in comparative assessments against other agent types, demonstrate superior ecological validity and generate decision outcomes that significantly more closely approximate human behavioral patterns.

PaperID: 7990, https://arxiv.org/pdf/2510.13139.pdf

Authors: Xiaoyu Yan, Tianxing Dai, Yu Marco Nie

Title: Addressing the alignment problem in transportation policy making: an LLM approach

Abstract:
A key challenge in transportation planning is that the collective preferences of heterogeneous travelers often diverge from the policies produced by model‑driven decision tools. This misalignment frequently results in implementation delays or failures. Here, we investigate whether large language models (LLMs), noted for their capabilities in reasoning and simulating human decision‑making, can help inform and address this alignment problem. We develop a multi‑agent simulation in which LLMs, acting as agents representing residents from different communities in a city, participate in a referendum on a set of transit policy proposals. Using chain‑of‑thought reasoning, LLM agents provide ranked‑choice or approval‑based preferences, which are aggregated using instant‑runoff voting (IRV) to model democratic consensus. We implement this simulation framework with both GPT‑4o and Claude‑3.5, and apply it for Chicago and Houston. Our findings suggest that LLM agents are capable of approximating plausible collective preferences and responding to local context, while also displaying model‑specific behavioral biases and modest divergences from optimization‑based benchmarks. These capabilities underscore both the promise and limitations of LLMs as tools for solving the alignment problem in transportation decision‑making.

PaperID: 7991, https://arxiv.org/pdf/2510.12979.pdf

Authors: Wei Fan, Wenlin Yao, Zheng Li, Feng Yao, Xin Liu, Liang Qiu, Qingyu Yin, Yangqiu Song, Bing Yin

Title: DeepPlanner: Scaling Planning Capability for Deep Research Agents via Advantage Shaping

Abstract:
Large language models (LLMs) augmented with multi‑step reasoning and action generation abilities have shown promise in leveraging external tools to tackle complex tasks that require long‑horizon planning. However, existing approaches either rely on implicit planning in the reasoning stage or introduce explicit planners without systematically addressing how to optimize the planning stage. As evidence, we observe that under vanilla reinforcement learning (RL), planning tokens exhibit significantly higher entropy than other action tokens, revealing uncertain decision points that remain under‑optimized. To address this, we propose DeepPlanner, an end‑to‑end RL framework that effectively enhances the planning capabilities of deep research agents. Our approach shapes token‑level advantage with an entropy‑based term to allocate larger updates to high entropy tokens, and selectively upweights sample‑level advantages for planning‑intensive rollouts. Extensive experiments across seven deep research benchmarks demonstrate that DeepPlanner improves planning quality and achieves state‑of‑the‑art results under a substantially lower training budget.

PaperID: 7992, https://arxiv.org/pdf/2510.12972.pdf

Authors: Mingyuan Zhong, Xia Chen, Davin Win Kyi, Chen Li, James Fogarty, Jacob O. Wobbrock

Title: TaskAudit: Detecting Functiona11ity Errors in Mobile Apps via Agentic Task Execution

Abstract:
Accessibility checkers are tools in support of accessible app development, and their use is encouraged by accessibility best practices. However, most current checkers evaluate static or mechanically‑generated contexts, failing to capture common accessibility errors impacting mobile app functionality. In this work, we define functiona11ity errors as accessibility barriers that only manifest through interaction (i.e., named according to a blend of "functionality" and "accessibility"). We introduce TaskAudit, which comprises three components: a Task Generator that constructs interactive tasks from app screens, a Task Executor that uses agents with a screen reader proxy to perform these tasks, and an Accessibility Analyzer that detects and reports accessibility errors by examining interaction traces. Our evaluation on real‑world apps shows that TaskAudit detects 48 functiona11ity errors from 54 app screens, compared to between 4 and 20 with existing checkers. Our analysis demonstrates common error patterns that TaskAudit can detect in addition to those from prior work, including label‑functionality mismatch, cluttered navigation, and inappropriate feedback.

PaperID: 7993, https://arxiv.org/pdf/2510.12920.pdf

Authors: Justin Z. Tam, Pascal Grosset, Divya Banesh, Nesar Ramachandra, Terece L. Turton, James Ahrens

Title: InferA: A Smart Assistant for Cosmological Ensemble Data

Abstract:
Analyzing large‑scale scientific datasets presents substantial challenges due to their sheer volume, structural complexity, and the need for specialized domain knowledge. Automation tools, such as PandasAI, typically require full data ingestion and lack context of the full data structure, making them impractical as intelligent data analysis assistants for datasets at the terabyte scale. To overcome these limitations, we propose InferA, a multi‑agent system that leverages large language models to enable scalable and efficient scientific data analysis. At the core of the architecture is a supervisor agent that orchestrates a team of specialized agents responsible for distinct phases of the data retrieval and analysis. The system engages interactively with users to elicit their analytical intent and confirm query objectives, ensuring alignment between user goals and system actions. To demonstrate the framework's usability, we evaluate the system using ensemble runs from the HACC cosmology simulation which comprises several terabytes.

PaperID: 7994, https://arxiv.org/pdf/2510.12859.pdf

Authors: Masoud Makrehchi

Title: Three Lenses on the AI Revolution: Risk, Transformation, Continuity

Abstract:
Artificial Intelligence (AI) has emerged as both a continuation of historical technological revolutions and a potential rupture with them. This paper argues that AI must be viewed simultaneously through three lenses: risk, where it resembles nuclear technology in its irreversible and global externalities; transformation, where it parallels the Industrial Revolution as a general‑purpose technology driving productivity and reorganization of labor; and continuity, where it extends the fifty‑year arc of computing revolutions from personal computing to the internet to mobile. Drawing on historical analogies, we emphasize that no past transition constituted a strict singularity: disruptive shifts eventually became governable through new norms and institutions. We examine recurring patterns across revolutions ‑‑ democratization at the usage layer, concentration at the production layer, falling costs, and deepening personalization ‑‑ and show how these dynamics are intensifying in the AI era. Sectoral analysis illustrates how accounting, law, education, translation, advertising, and software engineering are being reshaped as routine cognition is commoditized and human value shifts to judgment, trust, and ethical responsibility. At the frontier, the challenge of designing moral AI agents highlights the need for robust guardrails, mechanisms for moral generalization, and governance of emergent multi‑agent dynamics. We conclude that AI is neither a singular break nor merely incremental progress. It is both evolutionary and revolutionary: predictable in its median effects yet carrying singularity‑class tail risks. Good outcomes are not automatic; they require coupling pro‑innovation strategies with safety governance, ensuring equitable access, and embedding AI within a human order of responsibility.

PaperID: 7995, https://arxiv.org/pdf/2510.12826.pdf

Authors: Thao Pham

Title: Scheming Ability in LLM-to-LLM Strategic Interactions

Abstract:
As large language model (LLM) agents are deployed autonomously in diverse contexts, evaluating their capacity for strategic deception becomes crucial. While recent research has examined how AI systems scheme against human developers, LLM‑to‑LLM scheming remains underexplored. We investigate the scheming ability and propensity of frontier LLM agents through two game‑theoretic frameworks: a Cheap Talk signaling game and a Peer Evaluation adversarial game. Testing four models (GPT‑4o, Gemini‑2.5‑pro, Claude‑3.7‑Sonnet, and Llama‑3.3‑70b), we measure scheming performance with and without explicit prompting while analyzing scheming tactics through chain‑of‑thought reasoning. When prompted, most models, especially Gemini‑2.5‑pro and Claude‑3.7‑Sonnet, achieved near‑perfect performance. Critically, models exhibited significant scheming propensity without prompting: all models chose deception over confession in Peer Evaluation (100% rate), while models choosing to scheme in Cheap Talk succeeded at 95‑100% rates. These findings highlight the need for robust evaluations using high‑stakes game‑theoretic scenarios in multi‑agent settings.

PaperID: 7996, https://arxiv.org/pdf/2510.12787.pdf

Authors: Benjamin Breen, Marco Del Tredici, Jacob McCarran, Javier Aspuru Mijares, Weichen Winston Yin, Kfir Sulimany, Jacob M. Taylor, Frank H. L. Koppens, Dirk Englund

Title: Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics

Abstract:
We present Ax‑Prover, a multi‑agent system for automated theorem proving in Lean that can solve problems across diverse scientific domains and operate either autonomously or collaboratively with human experts. To achieve this, Ax‑Prover approaches scientific problem solving through formal proof generation, a process that demands both creative reasoning and strict syntactic rigor. Ax‑Prover meets this challenge by equipping Large Language Models (LLMs), which provide knowledge and reasoning, with Lean tools via the Model Context Protocol (MCP), which ensure formal correctness. To evaluate its performance as an autonomous prover, we benchmark our approach against frontier LLMs and specialized prover models on two public math benchmarks and on two Lean benchmarks we introduce in the fields of abstract algebra and quantum theory. On public datasets, Ax‑Prover is competitive with state‑of‑the‑art provers, while it largely outperforms them on the new benchmarks. This shows that, unlike specialized systems that struggle to generalize, our tool‑based agentic theorem prover approach offers a generalizable methodology for formal verification across diverse scientific domains. Furthermore, we demonstrate Ax‑Prover's assistant capabilities in a practical use case, showing how it enabled an expert mathematician to formalize the proof of a complex cryptography theorem.

PaperID: 7997, https://arxiv.org/pdf/2510.12750.pdf

Authors: A. Alfarano, L. Venturoli, D. Negueruela del Castillo

Title: VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage

Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated significant capabilities in joint visual and linguistic tasks. However, existing Visual Question Answering (VQA) benchmarks often fail to evaluate deep semantic understanding, particularly in complex domains like visual art analysis. Confined to simple syntactic structures and surface‑level attributes, these questions fail to capture the diversity and depth of human visual inquiry. This limitation incentivizes models to exploit statistical shortcuts rather than engage in visual reasoning. To address this gap, we introduce VQArt‑Bench, a new, large‑scale VQA benchmark for the cultural heritage domain. This benchmark is constructed using a novel multi‑agent pipeline where specialized agents collaborate to generate nuanced, validated, and linguistically diverse questions. The resulting benchmark is structured along relevant visual understanding dimensions that probe a model's ability to interpret symbolic meaning, narratives, and complex visual relationships. Our evaluation of 14 state‑of‑the‑art MLLMs on this benchmark reveals significant limitations in current models, including a surprising weakness in simple counting tasks and a clear performance gap between proprietary and open‑source models.

PaperID: 7998, https://arxiv.org/pdf/2510.12733.pdf

Authors: Hang Yu, Julian Jordan, Julian Schmidt, Silvan Lindner, Alessandro Canevaro, Wilhelm Stork

Title: HYPE: Hybrid Planning with Ego Proposal-Conditioned Predictions

Abstract:
Safe and interpretable motion planning in complex urban environments needs to reason about bidirectional multi‑agent interactions. This reasoning requires to estimate the costs of potential ego driving maneuvers. Many existing planners generate initial trajectories with sampling‑based methods and refine them by optimizing on learned predictions of future environment states, which requires a cost function that encodes the desired vehicle behavior. Designing such a cost function can be very challenging, especially if a wide range of complex urban scenarios has to be considered. We propose HYPE: HYbrid Planning with Ego proposal‑conditioned predictions, a planner that integrates multimodal trajectory proposals from a learned proposal model as heuristic priors into a Monte Carlo Tree Search (MCTS) refinement. To model bidirectional interactions, we introduce an ego‑conditioned occupancy prediction model, enabling consistent, scene‑aware reasoning. Our design significantly simplifies cost function design in refinement by considering proposal‑driven guidance, requiring only minimalistic grid‑based cost terms. Evaluations on large‑scale real‑world benchmarks nuPlan and DeepUrban show that HYPE effectively achieves state‑of‑the‑art performance, especially in safety and adaptability.

PaperID: 7999, https://arxiv.org/pdf/2510.12697.pdf

Authors: Tianyu Hu, Zhen Tan, Song Wang, Huaizhi Qu, Tianlong Chen

Title: Multi-Agent Debate for LLM Judges with Adaptive Stability Detection

Abstract:
With advancements in reasoning capabilities, Large Language Models (LLMs) are increasingly employed for automated judgment tasks. While LLMs‑as‑Judges offer promise in automating evaluations, current approaches often rely on simplistic aggregation methods (e.g., majority voting), which can fail even when individual agents provide correct answers. To address this, we propose a multi‑agent debate judge framework where agents collaboratively reason and iteratively refine their responses. We formalize the debate process mathematically, analyzing agent interactions and proving that debate amplifies correctness compared to static ensembles. To enhance efficiency, we introduce a stability detection mechanism that models judge consensus dynamics via a time‑varying Beta‑Binomial mixture, with adaptive stopping based on distributional similarity (Kolmogorov‑Smirnov test). This mechanism models the judges' collective correct rate dynamics using a time‑varying mixture of Beta‑Binomial distributions and employs an adaptive stopping criterion based on distributional similarity (Kolmogorov‑Smirnov statistic). Experiments across multiple benchmarks and models demonstrate that our framework improves judgment accuracy over majority voting while maintaining computational efficiency.

PaperID: 8000, https://arxiv.org/pdf/2510.12555.pdf

Authors: Andries Rosseau, Raphaël Avalos, Ann Nowé

Title: Inclusive Fitness as a Key Step Towards More Advanced Social Behaviors in Multi-Agent Reinforcement Learning Settings

Abstract:
The competitive and cooperative forces of natural selection have driven the evolution of intelligence for millions of years, culminating in nature's vast biodiversity and the complexity of human minds. Inspired by this process, we propose a novel multi‑agent reinforcement learning framework where each agent is assigned a genotype and where reward functions are modelled after the concept of inclusive fitness. An agent's genetic material may be shared with other agents, and our inclusive reward function naturally accounts for this. We study the resulting social dynamics in two types of network games with prisoner's dilemmas and find that our results align with well‑established principles from biology, such as Hamilton's rule. Furthermore, we outline how this framework can extend to more open‑ended environments with spatial and temporal structure, finite resources, and evolving populations. We hypothesize the emergence of an arms race of strategies, where each new strategy is a gradual improvement over earlier adaptations of other agents, effectively producing a multi‑agent autocurriculum analogous to biological evolution. In contrast to the binary team‑based structures prevalent in earlier research, our gene‑based reward structure introduces a spectrum of cooperation ranging from full adversity to full cooperativeness based on genetic similarity, enabling unique non team‑based social dynamics. For example, one agent having a mutual cooperative relationship with two other agents, while the two other agents behave adversarially towards each other. We argue that incorporating inclusive fitness in agents provides a foundation for the emergence of more strategically advanced and socially intelligent agents.

PaperID: 8001, https://arxiv.org/pdf/2510.12477.pdf

Authors: Gaoyuan Liu, Joris de Winter, Kelly Merckaert, Denis Steckelmacher, Ann Nowe, Bram Vanderborght

Title: A Task-Efficient Reinforcement Learning Task-Motion Planner for Safe Human-Robot Cooperation

Abstract:
In a Human‑Robot Cooperation (HRC) environment, safety and efficiency are the two core properties to evaluate robot performance. However, safety mechanisms usually hinder task efficiency since human intervention will cause backup motions and goal failures of the robot. Frequent motion replanning will increase the computational load and the chance of failure. In this paper, we present a hybrid Reinforcement Learning (RL) planning framework which is comprised of an interactive motion planner and a RL task planner. The RL task planner attempts to choose statistically safe and efficient task sequences based on the feedback from the motion planner, while the motion planner keeps the task execution process collision‑free by detecting human arm motions and deploying new paths when the previous path is not valid anymore. Intuitively, the RL agent will learn to avoid dangerous tasks, while the motion planner ensures that the chosen tasks are safe. The proposed framework is validated on the cobot in both simulation and the real world, we compare the planner with hard‑coded task motion planning methods. The results show that our planning framework can 1) react to uncertain human motions at both joint and task levels; 2) reduce the times of repeating failed goal commands; 3) reduce the total number of replanning requests.

PaperID: 8002, https://arxiv.org/pdf/2510.12273.pdf

Authors: Laurin Luttmann, Lin Xie

Title: Multi-Action Self-Improvement for Neural Combinatorial Optimization

Abstract:
Self‑improvement has emerged as a state‑of‑the‑art paradigm in Neural Combinatorial Optimization (NCO), where models iteratively refine their policies by generating and imitating high‑quality solutions. Despite strong empirical performance, existing methods face key limitations. Training is computationally expensive, as policy updates require sampling numerous candidate solutions per instance to extract a single expert trajectory. More fundamentally, these approaches fail to exploit the structure of combinatorial problems involving the coordination of multiple agents, such as vehicles in min‑max routing or machines in scheduling. By supervising on single‑action trajectories, they fail to exploit agent‑permutation symmetries, where distinct sequences of actions yield identical solutions, hindering generalization and the ability to learn coordinated behavior. We address these challenges by extending self‑improvement to operate over joint multi‑agent actions. Our model architecture predicts complete agent‑task assignments jointly at each decision step. To explicitly leverage symmetries, we employ a set‑prediction loss, which supervises the policy on multiple expert assignments for any given state. This approach enhances sample efficiency and the model's ability to learn coordinated behavior. Furthermore, by generating multi‑agent actions in parallel, it drastically accelerates the solution generation phase of the self‑improvement loop. Empirically, we validate our method on several combinatorial problems, demonstrating consistent improvements in the quality of the final solution and a reduced generation latency compared to standard self‑improvement.

PaperID: 8003, https://arxiv.org/pdf/2510.12272.pdf

Authors: Federico Gabriele, Aldo Glielmo, Marco Taboga

Title: Heterogeneous RBCs via Deep Multi-Agent Reinforcement Learning

Abstract:
Current macroeconomic models with agent heterogeneity can be broadly divided into two main groups. Heterogeneous‑agent general equilibrium (GE) models, such as those based on Heterogeneous Agent New Keynesian (HANK) or Krusell‑Smith (KS) approaches, rely on GE and 'rational expectations', somewhat unrealistic assumptions that make the models very computationally cumbersome, which in turn limits the amount of heterogeneity that can be modelled. In contrast, agent‑based models (ABMs) can flexibly encompass a large number of arbitrarily heterogeneous agents, but typically require the specification of explicit behavioural rules, which can lead to a lengthy trial‑and‑error model‑development process. To address these limitations, we introduce MARL‑BC, a framework that integrates deep multi‑agent reinforcement learning (MARL) with real business cycle (RBC) models. We demonstrate that MARL‑BC can: (1) recover textbook RBC results when using a single agent; (2) recover the results of the mean‑field KS model using a large number of identical agents; and (3) effectively simulate rich heterogeneity among agents, a hard task for traditional GE approaches. Our framework can be thought of as an ABM if used with a variety of heterogeneous interacting agents, and can reproduce GE results in limit cases. As such, it is a step towards a synthesis of these often opposed modelling paradigms.

PaperID: 8004, https://arxiv.org/pdf/2510.12264.pdf

Authors: Deyu Zou, Yongqiang Chen, Jianxiang Wang, Haochen Yang, Mufei Li, James Cheng, Pan Li, Yu Gong

Title: Reducing Belief Deviation in Reinforcement Learning for Active Reasoning

Abstract:
Active reasoning requires large language model (LLM) agents to interact with external sources and strategically gather information to solve problems in multiple turns. Central to this process is belief tracking: maintaining an accurate representation of the underlying state and uncertainty in understanding and solving the problem. However, due to limited reasoning capabilities, LLM‑based agents often suffer belief deviation: their internal beliefs drift from the true problem state, leading to loss of state awareness and uninformative or repetitive actions. Once this happens, errors compound in the trajectories used for reinforcement learning (RL), leading to misattributed credits and limited exploration. To address this issue, we propose to track belief deviation and develop \mathbfT^3, a simple yet principled method that detects excessive deviation and truncates training trajectories to suppress uninformative tail effects. Hence, \mathbfT^3 preserves credits for informative prefixes and systematically improves policy optimization. Across 5 challenging tasks, \mathbfT^3 consistently enhances training stability and yields performance gains of up to 30 points while cutting token cost by up to 34%. These results highlight belief control as a key principle for building robust LLM agents capable of active reasoning.

PaperID: 8005, https://arxiv.org/pdf/2510.12218.pdf

Authors: Hyunji Min, Sangwon Jung, Junyoung Sung, Dosung Lee, Leekyeung Han, Paul Hongsuck Seo

Title: GOAT: A Training Framework for Goal-Oriented Agent with Tools

Abstract:
Current approaches rely on zero‑shot evaluation due to the absence of training data; while proprietary models such as GPT‑4 exhibit strong reasoning capabilities, smaller open‑source models remain ineffective at complex tool use. To address this limitation, we propose a novel training framework GOAT, that enables fine‑tuning LLM agents without human annotation. GOAT automatically synthesizes goal‑oriented API execution data from API documents using a novel call‑first generation paradigm, that constructs training data based on executed API call sequences. Through extensive experiments, we show that GOAT‑trained agents achieve state‑of‑the‑art performance across multiple existing goal‑oriented benchmarks. In addition, we introduce GOATBench, a new goal‑oriented API execution benchmark, and demonstrate that agents trained with GOAT also excel in this setting. These results highlight GOAT as a practical path toward building robust open‑source LLM agents capable of complex reasoning and tool use.

PaperID: 8006, https://arxiv.org/pdf/2510.12189.pdf

Authors: Ryuji Hashimoto, Takehiro Takayanagi, Masahiro Suzuki, Kiyoshi Izumi

Title: Agent-Based Simulation of a Financial Market with Large Language Models

Abstract:
In real‑world stock markets, certain chart patterns ‑‑ such as price declines near historical highs ‑‑ cannot be fully explained by fundamentals alone. These phenomena suggest the presence of path dependence in price formation, where investor decisions are influenced not only by current market conditions but also by the trajectory of prices leading up to the present. Path dependence has drawn attention in behavioral finance as a key mechanism behind such anomalies. One plausible driver of path dependence is human loss aversion, anchored to individual reference points like purchase prices or past peaks, which vary with personal context. However, capturing such subtle behavioral tendencies in traditional agent‑based market simulations has remained a challenge. We propose the Fundamental‑Chartist‑LLM‑Agent (FCLAgent), which uses large language models (LLMs) to emulate human‑like trading decisions. In this framework, (1) buy/sell decisions are made by LLMs based on individual situations, while (2) order price and volume follow standard rule‑based methods. Simulations show that FCLAgents reproduce path‑dependent patterns that conventional agents fail to capture. Furthermore, an analysis of FCLAgents' behavior reveals that the reference points guiding loss aversion vary with market trajectories, highlighting the potential of LLM‑based agents to model nuanced investor behavior.

PaperID: 8007, https://arxiv.org/pdf/2510.12126.pdf

Authors: Zhenxin Lei, Zhangwei Gao, Changyao Tian, Erfei Cui, Guanzhou Chen, Danni Yang, Yuchen Duan, Zhaokai Wang, Wenhao Li, Weiyun Wang, Xiangyu Zhao, Jiayi Ji, Yu Qiao, Wenhai Wang, Gen Luo

Title: MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites

Abstract:
Generalist visual captioning goes beyond a simple appearance description task, but requires integrating a series of visual cues into a caption and handling various visual domains. In this task, current open‑source models present a large performance gap with commercial ones, which limits various applications such as data synthesis. To bridge the gap, this paper proposes CapFlow, a novel multi‑agent collaboration workflow. CapFlow demonstrates for the first time that, by capitalizing on open‑source models, it is possible to achieve caption quality on par with GPT‑4.1 in various domains with an 89.5% reduction in costs. By leveraging CapFlow as the data synthesizer, we produce high‑quality visual captions from image and video domains at scale, and obtain a generalist visual captioner via fine‑tuning, namely MetaCaptioner. Through extensive experiments, we show that MetaCaptioner not only achieves comparable captioning capabilities with commercial models but also reaches top‑tier multimodal performance in the open‑source community. We hope CapFlow and MetaCaptioner can benefit future multimodal research by providing a strong and cost‑effective visual captioning solution.

PaperID: 8008, https://arxiv.org/pdf/2510.12120.pdf

Authors: Zhenyu Mao, Jacky Keung, Fengji Zhang, Shuo Liu, Yifei Wang, Jialong Li

Title: Towards Engineering Multi-Agent LLMs: A Protocol-Driven Approach

Abstract:
The increasing demand for software development has driven interest in automating software engineering (SE) tasks using Large Language Models (LLMs). Recent efforts extend LLMs into multi‑agent systems (MAS) that emulate collaborative development workflows, but these systems often fail due to three core deficiencies: under‑specification, coordination misalignment, and inappropriate verification, arising from the absence of foundational SE structuring principles. This paper introduces Software Engineering Multi‑Agent Protocol (SEMAP), a protocol‑layer methodology that instantiates three core SE design principles for multi‑agent LLMs: (1) explicit behavioral contract modeling, (2) structured messaging, and (3) lifecycle‑guided execution with verification, and is implemented atop Google's Agent‑to‑Agent (A2A) infrastructure. Empirical evaluation using the Multi‑Agent System Failure Taxonomy (MAST) framework demonstrates that SEMAP effectively reduces failures across different SE tasks. In code development, it achieves up to a 69.6% reduction in total failures for function‑level development and 56.7% for deployment‑level development. For vulnerability detection, SEMAP reduces failure counts by up to 47.4% on Python tasks and 28.2% on C/C++ tasks.

PaperID: 8009, https://arxiv.org/pdf/2510.12091.pdf

Authors: Lijie Ding, Jan-Michael Carrillo, Changwoo Do

Title: ToPolyAgent: AI Agents for Coarse-Grained Topological Polymer Simulations

Abstract:
We introduce ToPolyAgent, a multi‑agent AI framework for performing coarse‑grained molecular dynamics (MD) simulations of topological polymers through natural language instructions. By integrating large language models (LLMs) with domain‑specific computational tools, ToPolyAgent supports both interactive and autonomous simulation workflows across diverse polymer architectures, including linear, ring, brush, and star polymers, as well as dendrimers. The system consists of four LLM‑powered agents: a Config Agent for generating initial polymer‑solvent configurations, a Simulation Agent for executing LAMMPS‑based MD simulations and conformational analyses, a Report Agent for compiling markdown reports, and a Workflow Agent for streamlined autonomous operations. Interactive mode incorporates user feedback loops for iterative refinements, while autonomous mode enables end‑to‑end task execution from detailed prompts. We demonstrate ToPolyAgent's versatility through case studies involving diverse polymer architectures under varying solvent condition, thermostats, and simulation lengths. Furthermore, we highlight its potential as a research assistant by directing it to investigate the effect of interaction parameters on the linear polymer conformation, and the influence of grafting density on the persistence length of the brush polymer. By coupling natural language interfaces with rigorous simulation tools, ToPolyAgent lowers barriers to complex computational workflows and advances AI‑driven materials discovery in polymer science. It lays the foundation for autonomous and extensible multi‑agent scientific research ecosystems.

PaperID: 8010, https://arxiv.org/pdf/2510.12072.pdf

Authors: Zixing Lei, Sheng Yin, Yichen Xiong, Yuanzhuo Ding, Wenhao Huang, Yuxi Wei, Qingyao Xu, Yiming Li, Weixin Li, Yunhong Wang, Siheng Chen

Title: EmboMatrix: A Scalable Training-Ground for Embodied Decision-Making

Abstract:
Embodied decision‑making enables agents to translate high‑level goals into executable actions through continuous interactions within the physical world, forming a cornerstone of general‑purpose embodied intelligence. Large language models (LLMs), with their general decision‑making capabilities, offer a promising path to realize this potential; however, LLMs trained solely on language lack exposure to physical environments, limiting their true embodied understanding. To bridge this gap, we propose the concept of a training ground: a comprehensive infrastructure that provides task and scene simulation, embodied interaction, and feedback signals, offering a one‑stop solution for LLM acquire genuine embodied decision‑making skills. In this work, we present EmboMatrix, the first training ground of its kind, providing massive and diverse tasks with efficient simulation and precise rewards. EmboMatrix incorporates a series of novel techniques: a multi‑agent data engine for large‑scale task and scene generation, a distributed heterogeneous‑hardware system for scalable simulation, and a multi‑level reward architecture for precise supervision. Leveraging EmboMatrix, we cultivate EmboBrain, an LLM whose embodied decision‑making abilities emerge from extensive embodied interactions. Experiments show that EmboBrain‑7B surpasses the 671B DeepSeek‑R1 baseline by 9.5% on two challenging embodied decision‑making benchmarks, demonstrating the power of interactive, environment‑grounded learning for building truly intelligent embodied agents.

PaperID: 8011, https://arxiv.org/pdf/2510.12061.pdf

Authors: Yiheng Chen, Lingyao Li, Zihui Ma, Qikai Hu, Yilun Zhu, Min Deng, Runlong Yu

Title: Empowering LLM Agents with Geospatial Awareness: Toward Grounded Reasoning for Wildfire Response

Abstract:
Effective disaster response is essential for safeguarding lives and property. Existing statistical approaches often lack semantic context, generalize poorly across events, and offer limited interpretability. While Large language models (LLMs) provide few‑shot generalization, they remain text‑bound and blind to geography. To bridge this gap, we introduce a Geospatial Awareness Layer (GAL) that grounds LLM agents in structured earth data. Starting from raw wildfire detections, GAL automatically retrieves and integrates infrastructure, demographic, terrain, and weather information from external geodatabases, assembling them into a concise, unit‑annotated perception script. This enriched context enables agents to produce evidence‑based resource‑allocation recommendations (e.g., personnel assignments, budget allocations), further reinforced by historical analogs and daily change signals for incremental updates. We evaluate the framework in real wildfire scenarios across multiple LLM models, showing that geospatially grounded agents can outperform baselines. The proposed framework can generalize to other hazards such as floods and hurricanes.

PaperID: 8012, https://arxiv.org/pdf/2510.11997.pdf

Authors: Ryan Shea, Yunan Lu, Liang Qiu, Zhou Yu

Title: SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation

Abstract:
Evaluating multi‑turn interactive agents is challenging due to the need for human assessment. Evaluation with simulated users has been introduced as an alternative, however existing approaches typically model generic users and overlook the domain‑specific principles required to capture realistic behavior. We propose SAGE, a novel user Simulation framework for multi‑turn AGent Evaluation that integrates knowledge from business contexts. SAGE incorporates top‑down knowledge rooted in business logic, such as ideal customer profiles, grounding user behavior in realistic customer personas. We further integrate bottom‑up knowledge taken from business agent infrastructure (e.g., product catalogs, FAQs, and knowledge bases), allowing the simulator to generate interactions that reflect users' information needs and expectations in a company's target market. Through empirical evaluation, we find that this approach produces interactions that are more realistic and diverse, while also identifying up to 33% more agent errors, highlighting its effectiveness as an evaluation tool to support bug‑finding and iterative agent improvement.

PaperID: 8013, https://arxiv.org/pdf/2510.11977.pdf

Authors: Sayash Kapoor, Benedikt Stroebl, Peter Kirgis, Nitya Nadgir, Zachary S Siegel, Boyi Wei, Tianci Xue, Ziru Chen, Felix Chen, Saiteja Utpala, Franck Ndzomga, Dheeraj Oruganty, Sophie Luskin, Kangheng Liu, Botao Yu, Amit Arora, Dongyoon Hahm, Harsh Trivedi, Huan Sun, Juyong Lee, Tengjun Jin, Yifan Mai, Yifei Zhou, Yuxuan Zhu, Rishi Bommasani, Daniel Kang, Dawn Song, Peter Henderson, Yu Su, Percy Liang, Arvind Narayanan

Title: Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation

Abstract:
AI agents have been developed for complex real‑world tasks from coding to customer service. But AI agent evaluations suffer from many challenges that undermine our understanding of how well agents really work. We introduce the Holistic Agent Leaderboard (HAL) to address these challenges. We make three main contributions. First, we provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of VMs, reducing evaluation time from weeks to hours while eliminating common implementation bugs. Second, we conduct three‑dimensional analysis spanning models, scaffolds, and benchmarks. We validate the harness by conducting 21,730 agent rollouts across 9 models and 9 benchmarks in coding, web navigation, science, and customer service with a total cost of about 40,000. Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs. Third, we use LLM‑aided log inspection to uncover previously unreported behaviors, such as searching for the benchmark on HuggingFace instead of solving a task, or misusing credit cards in flight booking tasks. We share all agent logs, comprising 2.5B tokens of language model calls, to incentivize further research into agent behavior. By standardizing how the field evaluates agents and addressing common pitfalls in agent evaluation, we hope to shift the focus from agents that ace benchmarks to agents that work reliably in the real world.

PaperID: 8014, https://arxiv.org/pdf/2510.11967.pdf

Authors: Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, Jiecao Chen

Title: Scaling Long-Horizon LLM Agent via Context-Folding

Abstract:
Large language model (LLM) agents are fundamentally constrained by context length on long‑horizon tasks. We introduce Context‑Folding, a framework that empowers agents to actively manage their working context. An agent can procedurally branch into a sub‑trajectory to handle a subtask and then fold it upon completion, collapsing the intermediate steps while retaining a concise summary of the outcome. To make this behavior learnable, we develop an end‑to‑end reinforcement learning framework FoldGRPO with specific process rewards to encourage effective task decomposition and context management. On complex long‑horizon tasks (Deep Research and SWE), our folding agent matches or outperforms the ReAct baselines while using an active context 10× smaller and significantly outperforms models that rely on summarization‑based context management.

PaperID: 8015, https://arxiv.org/pdf/2510.11872.pdf

Authors: Alessandro Cornacchia, Vaastav Anand, Muhammad Bilal, Zafar Qazi, Marco Canini

Title: DMAS-Forge: A Framework for Transparent Deployment of AI Applications as Distributed Systems

Abstract:
Agentic AI applications increasingly rely on multiple agents with distinct roles, specialized tools, and access to memory layers to solve complex tasks ‑‑ closely resembling service‑oriented architectures. Yet, in the rapid evolving landscape of programming frameworks and new protocols, deploying and testing AI agents as distributed systems remains a daunting and labor‑intensive task. We present DMAS‑Forge, a framework designed to close this gap. DMAS‑Forge decouples application logic from specific deployment choices, and aims at transparently generating the necessary glue code and configurations to spawn distributed multi‑agent applications across diverse deployment scenarios with minimal manual effort. We present our vision, design principles, and a prototype of DMAS‑Forge. Finally, we discuss the opportunities and future work for our approach.

PaperID: 8016, https://arxiv.org/pdf/2510.11754.pdf

Authors: Dongrong Yang, Xin Wu, Yibo Xie, Xinyi Li, Qiuwen Wu, Jackie Wu, Yang Sheng

Title: Zero-Shot Large Language Model Agents for Fully Automated Radiotherapy Treatment Planning

Abstract:
Radiation therapy treatment planning is an iterative, expertise‑dependent process, and the growing burden of cancer cases has made reliance on manual planning increasingly unsustainable, underscoring the need for automation. In this study, we propose a workflow that leverages a large language model (LLM)‑based agent to navigate inverse treatment planning for intensity‑modulated radiation therapy (IMRT). The LLM agent was implemented to directly interact with a clinical treatment planning system (TPS) to iteratively extract intermediate plan states and propose new constraint values to guide inverse optimization. The agent's decision‑making process is informed by current observations and previous optimization attempts and evaluations, allowing for dynamic strategy refinement. The planning process was performed in a zero‑shot inference setting, where the LLM operated without prior exposure to manually generated treatment plans and was utilized without any fine‑tuning or task‑specific training. The LLM‑generated plans were evaluated on twenty head‑and‑neck cancer cases against clinical manual plans, with key dosimetric endpoints analyzed and reported. The LLM‑generated plans achieved comparable organ‑at‑risk (OAR) sparing relative to clinical plans while demonstrating improved hot spot control (Dmax: 106.5% vs. 108.8%) and superior conformity (conformity index: 1.18 vs. 1.39 for boost PTV; 1.82 vs. 1.88 for primary PTV). This study demonstrates the feasibility of a zero‑shot, LLM‑driven workflow for automated IMRT treatment planning in a commercial TPS. The proposed approach provides a generalizable and clinically applicable solution that could reduce planning variability and support broader adoption of AI‑based planning strategies.

PaperID: 8017, https://arxiv.org/pdf/2510.11728.pdf

Authors: Bingqiao Gu, Jiale Zeng, Xingqin Qi, Dong Li

Title: Modeling Hypergraph Using Large Language Models

Abstract:
Due to the advantages of hypergraphs in modeling high‑order relationships in complex systems, they have been applied to higher‑order clustering, hypergraph neural networks and computer vision. These applications rely heavily on access to high‑quality, large‑scale real‑world hypergraph data. Yet, compared to traditional pairwise graphs, real hypergraph datasets remain scarce in both scale and diversity. This shortage significantly limits the development and evaluation of advanced hypergraph learning algorithms. Therefore, how to quickly generate large‑scale hypergraphs that conform to the characteristics of real networks is a crucial task that has not received sufficient attention. Motivated by recent advances in large language models (LLMs), particularly their capabilities in semantic reasoning, structured generation, and simulating human behavior, we investigate whether LLMs can facilitate hypergraph generation from a fundamentally new perspective. We introduce HyperLLM, a novel LLM‑driven hypergraph generator that simulates the formation and evolution of hypergraphs through a multi‑agent collaboration. The framework integrates prompts and structural feedback mechanisms to ensure that the generated hypergraphs reflect key real‑world patterns. Extensive experiments across diverse datasets demonstrate that HyperLLM achieves superior fidelity to structural and temporal hypergraph patterns, while requiring minimal statistical priors. Our findings suggest that LLM‑based frameworks offer a promising new direction for hypergraph modeling.

PaperID: 8018, https://arxiv.org/pdf/2510.11695.pdf

Authors: Lingfei Qian, Xueqing Peng, Yan Wang, Vincent Jim Zhang, Huan He, Hanley Smith, Yi Han, Yueru He, Haohang Li, Yupeng Cao, Yangyang Yu, Alejandro Lopez-Lira, Peng Lu, Jian-Yun Nie, Guojun Xiong, Jimin Huang, Sophia Ananiadou

Title: When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents

Abstract:
Although Large Language Model (LLM)‑based agents are increasingly used in financial trading, it remains unclear whether they can reason and adapt in live markets, as most studies test models instead of agents, cover limited periods and assets, and rely on unverified data. To address these gaps, we introduce Agent Market Arena (AMA), the first lifelong, real‑time benchmark for evaluating LLM‑based trading agents across multiple markets. AMA integrates verified trading data, expert‑checked news, and diverse agent architectures within a unified trading framework, enabling fair and continuous comparison under real conditions. It implements four agents, including InvestorAgent as a single‑agent baseline, TradeAgent and HedgeFundAgent with different risk styles, and DeepFundAgent with memory‑based reasoning, and evaluates them across GPT‑4o, GPT‑4.1, Claude‑3.5‑haiku, Claude‑sonnet‑4, and Gemini‑2.0‑flash. Live experiments on both cryptocurrency and stock markets demonstrate that agent frameworks display markedly distinct behavioral patterns, spanning from aggressive risk‑taking to conservative decision‑making, whereas model backbones contribute less to outcome variation. AMA thus establishes a foundation for rigorous, reproducible, and continuously evolving evaluation of financial reasoning and trading intelligence in LLM‑based agents.

PaperID: 8019, https://arxiv.org/pdf/2510.11694.pdf

Authors: Arjun Sahney, Ram Gorthi, Cezary Łastowski, Javier Vega

Title: Operand Quant: A Single-Agent Architecture for Autonomous Machine Learning Engineering

Abstract:
We present Operand Quant, a single‑agent, IDE‑based architecture for autonomous machine learning engineering (MLE). Operand Quant departs from conventional multi‑agent orchestration frameworks by consolidating all MLE lifecycle stages ‑‑ exploration, modeling, experimentation, and deployment ‑‑ within a single, context‑aware agent. On the MLE‑Benchmark (2025), Operand Quant achieved a new state‑of‑the‑art (SOTA) result, with an overall medal rate of 0.3956 +/‑ 0.0565 across 75 problems ‑‑ the highest recorded performance among all evaluated systems to date. The architecture demonstrates that a linear, non‑blocking agent, operating autonomously within a controlled IDE environment, can outperform multi‑agent and orchestrated systems under identical constraints.

PaperID: 8020, https://arxiv.org/pdf/2510.11654.pdf

Authors: Daniel Berhane Araya, Duoduo Liao

Title: FinVet: A Collaborative Framework of RAG and External Fact-Checking Agents for Financial Misinformation Detection

Abstract:
Financial markets face growing threats from misinformation that can trigger billions in losses in minutes. Most existing approaches lack transparency in their decision‑making and provide limited attribution to credible sources. We introduce FinVet, a novel multi‑agent framework that integrates two Retrieval‑Augmented Generation (RAG) pipelines with external fact‑checking through a confidence‑weighted voting mechanism. FinVet employs adaptive three‑tier processing that dynamically adjusts verification strategies based on retrieval confidence, from direct metadata extraction to hybrid reasoning to full model‑based analysis. Unlike existing methods, FinVet provides evidence‑backed verdicts, source attribution, confidence scores, and explicit uncertainty flags when evidence is insufficient. Experimental evaluation on the FinFact dataset shows that FinVet achieves an F1 score of 0.85, which is a 10.4% improvement over the best individual pipeline (fact‑check pipeline) and 37% improvement over standalone RAG approaches.

PaperID: 8021, https://arxiv.org/pdf/2510.11618.pdf

Authors: Zehao Chen, Rong Pan, Haoran Li

Title: StoryBox: Collaborative Multi-Agent Simulation for Hybrid Bottom-Up Long-Form Story Generation Using Large Language Models

Abstract:
Human writers often begin their stories with an overarching mental scene, where they envision the interactions between characters and their environment. Inspired by this creative process, we propose a novel approach to long‑form story generation, termed hybrid bottom‑up long‑form story generation, using multi‑agent simulations. In our method, agents interact within a dynamic sandbox environment, where their behaviors and interactions with one another and the environment generate emergent events. These events form the foundation for the story, enabling organic character development and plot progression. Unlike traditional top‑down approaches that impose rigid structures, our hybrid bottom‑up approach allows for the natural unfolding of events, fostering more spontaneous and engaging storytelling. The system is capable of generating stories exceeding 10,000 words while maintaining coherence and consistency, addressing some of the key challenges faced by current story generation models. We achieve state‑of‑the‑art performance across several metrics. This approach offers a scalable and innovative solution for creating dynamic, immersive long‑form stories that evolve organically from agent‑driven interactions.

PaperID: 8022, https://arxiv.org/pdf/2510.11588.pdf

Authors: Jiateng Liu, Zhenhailong Wang, Xiaojiang Huang, Yingjie Li, Xing Fan, Xiang Li, Chenlei Guo, Ruhi Sarikaya, Heng Ji

Title: Analyzing and Internalizing Complex Policy Documents for LLM Agents

Abstract:
Large Language Model (LLM)‑based agentic systems rely on in‑context policy documents encoding diverse business rules. As requirements grow, these documents expand rapidly, causing high computational overhead. This motivates developing internalization methods that embed policy documents into model priors while preserving performance. Prior prompt compression work targets generic prompts, but agentic policy documents span multiple complexity levels and require deeper reasoning, making internalization harder. We introduce CC‑Gen, an agentic benchmark generator with Controllable Complexity across four levels, enabling systematic evaluation of agents' ability to handle complexity and offering a unified framework for assessing policy internalization. Our analysis shows that complex policy specifications governing workflows pose major reasoning challenges. Supporting internalization with gold user agent interaction trajectories containing chain‑of‑thought (CoT) annotations via supervised fine‑tuning (SFT) is data‑intensive and degrades sharply as policy complexity increases. To mitigate data and reasoning burdens, we propose Category‑Aware Policy Continued Pretraining (CAP‑CPT). Our automated pipeline parses policy documents to extract key specifications, grouping them into factual, behavioral, and conditional categories, and isolating complex conditions that drive workflow complexity. This guides targeted data synthesis and enables agents to internalize policy information through an autoregressive pretraining loss. Experiments show CAP‑CPT improves SFT baselines in all settings, with up to 41% and 22% gains on Qwen‑3‑32B, achieving 97.3% prompt length reduction on CC‑Gen and further enhancing tau‑Bench with minimal SFT data.

PaperID: 8023, https://arxiv.org/pdf/2510.11535.pdf

Authors: Vincenzo Norman Vitale, Antonia Maria Tulino, Andreas F. Molisch, Jaime Llorca

Title: A Flexible Multi-Agent Deep Reinforcement Learning Framework for Dynamic Routing and Scheduling of Latency-Critical Services

Abstract:
Timely delivery of delay‑sensitive information over dynamic, heterogeneous networks is increasingly essential for a range of interactive applications, such as industrial automation, self‑driving vehicles, and augmented reality. However, most existing network control solutions target only average delay performance, falling short of providing strict End‑to‑End (E2E) peak latency guarantees. This paper addresses the challenge of reliably delivering packets within application‑imposed deadlines by leveraging recent advancements in Multi‑Agent Deep Reinforcement Learning (MA‑DRL). After introducing the Delay‑Constrained Maximum‑Throughput (DCMT) dynamic network control problem, and highlighting the limitations of current solutions, we present a novel MA‑DRL network control framework that leverages a centralized routing and distributed scheduling architecture. The proposed framework leverages critical networking domain knowledge for the design of effective MA‑DRL strategies based on the Multi‑Agent Deep Deterministic Policy Gradient (MADDPG) technique, where centralized routing and distributed scheduling agents dynamically assign paths and schedule packet transmissions according to packet lifetimes, thereby maximizing on‑time packet delivery. The generality of the proposed framework allows integrating both data‑driven \blueDeep Reinforcement Learning (DRL) agents and traditional rule‑based policies in order to strike the right balance between performance and learning complexity. Our results confirm the superiority of the proposed framework with respect to traditional stochastic optimization‑based approaches and provide key insights into the role and interplay between data‑driven DRL agents and new rule‑based policies for both efficient and high‑performance control of latency‑critical services.

PaperID: 8024, https://arxiv.org/pdf/2510.11534.pdf

Authors: Enli Lin, Ziyuan Yang, Qiujing Lu, Jianming Hu, Shuo Feng

Title: IntersectioNDE: Learning Complex Urban Traffic Dynamics based on Interaction Decoupling Strategy

Abstract:
Realistic traffic simulation is critical for ensuring the safety and reliability of autonomous vehicles (AVs), especially in complex and diverse urban traffic environments. However, existing data‑driven simulators face two key challenges: a limited focus on modeling dense, heterogeneous interactions at urban intersections ‑ which are prevalent, crucial, and practically significant in countries like China, featuring diverse agents including motorized vehicles (MVs), non‑motorized vehicles (NMVs), and pedestrians ‑ and the inherent difficulty in robustly learning high‑dimensional joint distributions for such high‑density scenes, often leading to mode collapse and long‑term simulation instability. We introduce City Crossings Dataset (CiCross), a large‑scale dataset collected from a real‑world urban intersection, uniquely capturing dense, heterogeneous multi‑agent interactions, particularly with a substantial proportion of MVs, NMVs and pedestrians. Based on this dataset, we propose IntersectioNDE (Intersection Naturalistic Driving Environment), a data‑driven simulator tailored for complex urban intersection scenarios. Its core component is the Interaction Decoupling Strategy (IDS), a training paradigm that learns compositional dynamics from agent subsets, enabling the marginal‑to‑joint simulation. Integrated into a scene‑aware Transformer network with specialized training techniques, IDS significantly enhances simulation robustness and long‑term stability for modeling heterogeneous interactions. Experiments on CiCross show that IntersectioNDE outperforms baseline methods in simulation fidelity, stability, and its ability to replicate complex, distribution‑level urban traffic dynamics.

PaperID: 8025, https://arxiv.org/pdf/2510.11483.pdf

Authors: Heydar Soudani, Hamed Zamani, Faegheh Hasibi

Title: Uncertainty Quantification for Retrieval-Augmented Reasoning

Abstract:
Retrieval‑augmented reasoning (RAR) is a recent evolution of retrieval‑augmented generation (RAG) that employs multiple reasoning steps for retrieval and generation. While effective for some complex queries, RAR remains vulnerable to errors and misleading outputs. Uncertainty quantification (UQ) offers methods to estimate the confidence of systems' outputs. These methods, however, often handle simple queries with no retrieval or single‑step retrieval, without properly handling RAR setup. Accurate estimation of UQ for RAR requires accounting for all sources of uncertainty, including those arising from retrieval and generation. In this paper, we account for all these sources and introduce Retrieval‑Augmented Reasoning Consistency (R2C)‑‑a novel UQ method for RAR. The core idea of R2C is to perturb the multi‑step reasoning process by applying various actions to reasoning steps. These perturbations alter the retriever's input, which shifts its output and consequently modifies the generator's input at the next step. Through this iterative feedback loop, the retriever and generator continuously reshape one another's inputs, enabling us to capture uncertainty arising from both components. Experiments on five popular RAR systems across diverse QA datasets show that R2C improves AUROC by over 5% on average compared to the state‑of‑the‑art UQ baselines. Extrinsic evaluations using R2C as an external signal further confirm its effectiveness for two downstream tasks: in Abstention, it achieves ~5% gains in both F1Abstain and AccAbstain; in Model Selection, it improves the exact match by ~7% over single models and ~3% over selection methods.

PaperID: 8026, https://arxiv.org/pdf/2510.11474.pdf

Authors: Ardian Selmonaj, Giacomo Del Rio, Adrian Schneider, Alessandro Antonucci

Title: Coordinated Strategies in Realistic Air Combat by Hierarchical Multi-Agent Reinforcement Learning

Abstract:
Achieving mission objectives in a realistic simulation of aerial combat is highly challenging due to imperfect situational awareness and nonlinear flight dynamics. In this work, we introduce a novel 3D multi‑agent air combat environment and a Hierarchical Multi‑Agent Reinforcement Learning framework to tackle these challenges. Our approach combines heterogeneous agent dynamics, curriculum learning, league‑play, and a newly adapted training algorithm. To this end, the decision‑making process is organized into two abstraction levels: low‑level policies learn precise control maneuvers, while high‑level policies issue tactical commands based on mission objectives. Empirical results show that our hierarchical approach improves both learning efficiency and combat performance in complex dogfight scenarios.

PaperID: 8027, https://arxiv.org/pdf/2510.11457.pdf

Authors: Beining Wang, Weihang Su, Hongtao Tian, Tao Yang, Yujia Zhou, Ting Yao, Qingyao Ai, Yiqun Liu

Title: From to : Multidimensional Supervision of Reasoning Process for LLM Optimization

Abstract:
Improving the multi‑step reasoning ability of Large Language Models (LLMs) is a critical yet challenging task. The dominant paradigm, outcome‑supervised reinforcement learning (RLVR), rewards only correct final answers, often propagating flawed reasoning and suffering from sparse reward signals. While process‑level reward models (PRMs) provide denser, step‑by‑step feedback, they lack generalizability and interpretability, requiring task‑specific segmentation of the reasoning process. To this end, we propose the Dimension‑level Reward Model (DRM), a new supervision framework that bridges the gap between these two approaches. DRM evaluates the quality of a reasoning process along three fundamental, complementary, and interpretable dimensions: Confidence for uncertainty calibration, Relevance for semantic alignment, and Coherence for logical consistency. Together, these dimensions capture aspects beyond final answer correctness and enable interpretable assessment without requiring ground truth answers. Experimental results show that DRM provides effective supervision signals, guides the optimization of LLMs and enhances their reasoning ability. In particular, DRM‑supervised training achieves consistent gains on both in‑distribution and out‑of‑distribution open‑domain tasks, including mathematics, question answering, code execution, and puzzles. Our findings demonstrate that multidimensional supervision of the reasoning process can improve the generalized reasoning ability of LLMs beyond the training distribution.

PaperID: 8028, https://arxiv.org/pdf/2510.11410.pdf

Authors: Anastasia Psarou, Łukasz Gorczyca, Dominik Gaweł, Rafał Kucharski

Title: Autonomous vehicles need social awareness to find optima in multi-agent reinforcement learning routing games

Abstract:
Previous work has shown that when multiple selfish Autonomous Vehicles (AVs) are introduced to future cities and start learning optimal routing strategies using Multi‑Agent Reinforcement Learning (MARL), they may destabilize traffic systems, as they would require a significant amount of time to converge to the optimal solution, equivalent to years of real‑world commuting. We demonstrate that moving beyond the selfish component in the reward significantly relieves this issue. If each AV, apart from minimizing its own travel time, aims to reduce its impact on the system, this will be beneficial not only for the system‑wide performance but also for each individual player in this routing game. By introducing an intrinsic reward signal based on the marginal cost matrix, we significantly reduce training time and achieve convergence more reliably. Marginal cost quantifies the impact of each individual action (route‑choice) on the system (total travel time). Including it as one of the components of the reward can reduce the degree of non‑stationarity by aligning agents' objectives. Notably, the proposed counterfactual formulation preserves the system's equilibria and avoids oscillations. Our experiments show that training MARL algorithms with our novel reward formulation enables the agents to converge to the optimal solution, whereas the baseline algorithms fail to do so. We show these effects in both a toy network and the real‑world network of Saint‑Arnoult. Our results optimistically indicate that social awareness (i.e., including marginal costs in routing decisions) improves both the system‑wide and individual performance of future urban systems with AVs.

PaperID: 8029, https://arxiv.org/pdf/2510.11389.pdf

Authors: Zirui Song, Yuan Huang, Junchang Liu, Haozhe Luo, Chenxi Wang, Lang Gao, Zixiang Xu, Mingfei Han, Xiaojun Chang, Xiuying Chen

Title: Beyond Survival: Evaluating LLMs in Social Deduction Games with Human-Aligned Strategies

Abstract:
Social deduction games like Werewolf combine language, reasoning, and strategy, providing a testbed for studying natural language and social intelligence. However, most studies reduce the game to LLM‑based self‑play, yielding templated utterances and anecdotal cases that overlook the richness of social gameplay. Evaluation further relies on coarse metrics such as survival time or subjective scoring due to the lack of quality reference data. To address these gaps, we curate a high‑quality, human‑verified multimodal Werewolf dataset containing over 100 hours of video, 32.4M utterance tokens, and 15 rule variants. Based on this dataset, we propose a novel strategy‑alignment evaluation that leverages the winning faction's strategies as ground truth in two stages: 1) Speech evaluation, formulated as multiple‑choice‑style tasks that assess whether the model can adopt appropriate stances across five dimensions of social ability; and 2) Decision evaluation, which assesses the model's voting choices and opponent‑role inferences. This framework enables a fine‑grained evaluation of models' linguistic and reasoning capabilities, while capturing their ability to generate strategically coherent gameplay. Our experiments show that state‑of‑the‑art LLMs show diverse performance, with roughly half remain below 0.50, revealing clear gaps in deception and counterfactual reasoning. We hope our dataset further inspires research on language, reasoning, and strategy in multi‑agent interaction.

PaperID: 8030, https://arxiv.org/pdf/2510.11290.pdf

Authors: Sheng Jin, Haoming Wang, Zhiqi Gao, Yongbo Yang, Bao Chunjia, Chengliang Wang

Title: Evolution in Simulation: AI-Agent School with Dual Memory for High-Fidelity Educational Dynamics

Abstract:
Large language models (LLMs) based Agents are increasingly pivotal in simulating and understanding complex human systems and interactions. We propose the AI‑Agent School (AAS) system, built around a self‑evolving mechanism that leverages agents for simulating complex educational dynamics. Addressing the fragmented issues in teaching process modeling and the limitations of agents performance in simulating diverse educational participants, AAS constructs the Zero‑Exp strategy, employs a continuous "experience‑reflection‑optimization" cycle, grounded in a dual memory base comprising experience and knowledge bases and incorporating short‑term and long‑term memory components. Through this mechanism, agents autonomously evolve via situated interactions within diverse simulated school scenarios. This evolution enables agents to more accurately model the nuanced, multi‑faceted teacher‑student engagements and underlying learning processes found in physical schools. Experiment confirms that AAS can effectively simulate intricate educational dynamics and is effective in fostering advanced agent cognitive abilities, providing a foundational stepping stone from the "Era of Experience" to the "Era of Simulation" by generating high‑fidelity behavioral and interaction data.

PaperID: 8031, https://arxiv.org/pdf/2510.11260.pdf

Authors: Yuxuan Chen, Ruotong Yang, Zhengyang Zhang, Mehreen Ahmed, Yanming Wang

Title: A Large-Language-Model Assisted Automated Scale Bar Detection and Extraction Framework for Scanning Electron Microscopic Images

Abstract:
Microscopic characterizations, such as Scanning Electron Microscopy (SEM), are widely used in scientific research for visualizing and analyzing microstructures. Determining the scale bars is an important first step of accurate SEM analysis; however, currently, it mainly relies on manual operations, which is both time‑consuming and prone to errors. To address this issue, we propose a multi‑modal and automated scale bar detection and extraction framework that provides concurrent object detection, text detection and text recognition with a Large Language Model (LLM) agent. The proposed framework operates in four phases; i) Automatic Dataset Generation (Auto‑DG) model to synthesize a diverse dataset of SEM images ensuring robust training and high generalizability of the model, ii) scale bar object detection, iii) information extraction using a hybrid Optical Character Recognition (OCR) system with DenseNet and Convolutional Recurrent Neural Network (CRNN) based algorithms, iv) an LLM agent to analyze and verify accuracy of the results. The proposed model demonstrates a strong performance in object detection and accurate localization with a precision of 100%, recall of 95.8%, and a mean Average Precision (mAP) of 99.2% at IoU=0.5 and 69.1% at IoU=0.5:0.95. The hybrid OCR system achieved 89% precision, 65% recall, and a 75% F1 score on the Auto‑DG dataset, significantly outperforming several mainstream standalone engines, highlighting its reliability for scientific image analysis. The LLM is introduced as a reasoning engine as well as an intelligent assistant that suggests follow‑up steps and verifies the results. This automated method powered by an LLM agent significantly enhances the efficiency and accuracy of scale bar detection and extraction in SEM images, providing a valuable tool for microscopic analysis and advancing the field of scientific imaging.

PaperID: 8032, https://arxiv.org/pdf/2510.11184.pdf

Authors: Zhengyu Chen, Jinluan Yang, Teng Xiao, Ruochen Zhou, Luan Zhang, Xiangyu Xi, Xiaowei Shi, Wei Wang, Jinggang Wang

Title: Reinforcement Learning for Tool-Integrated Interleaved Thinking towards Cross-Domain Generalization

Abstract:
Recent advances in large language models (LLMs) have demonstrated remarkable capabilities in reasoning and tool utilization. However, the generalization of tool‑augmented reinforcement learning (RL) across diverse domains remains a significant challenge. Standard paradigms often treat tool usage as a linear or isolated event, which becomes brittle when transferring skills from restricted domains (e.g., mathematics) to open‑ended tasks. In this work, we investigate the cross‑domain generalization of an LLM agent trained exclusively on mathematical problem‑solving. To facilitate robust skill transfer, we propose a Reinforcement Learning for Interleaved Tool Execution (RITE). Unlike traditional methods, RITE enforces a continuous ``Plan‑Action‑Reflection'' cycle, allowing the model to ground its reasoning in intermediate tool outputs and self‑correct during long‑horizon tasks. To effectively train this complex interleaved policy, we introduce Dr. GRPO, a robust optimization objective that utilizes token‑level loss aggregation with importance sampling to mitigate reward sparsity and high‑variance credit assignment. Furthermore, we employ a dual‑component reward system and dynamic curriculum via online rollout filtering to ensure structural integrity and sample efficiency. Extensive experiments reveal that our approach, despite being trained solely on math tasks, achieves state‑of‑the‑art performance across diverse reasoning domains, demonstrating high token efficiency and strong generalization capabilities.

PaperID: 8033, https://arxiv.org/pdf/2510.11085.pdf

Authors: Yuxinyue Qian, Jun Liu

Title: Modeling AI-Driven Production and Competitiveness A Multi-Agent Economic Simulation of China and the United States

Abstract:
With the rapid development of artificial intelligence (AI) technology, socio‑economic systems are entering a new stage of "human‑AI co‑creation." Building upon a previously established multi‑level intelligent agent economic model, this paper conducts simulation‑based comparisons of macroeconomic output evolution in China and the United States under different mechanisms‑AI collaboration, network effects, and AI autonomous production. The results show that: (1) when AI functions as an independent productive entity, the overall growth rate of social output far exceeds that of traditional human‑labor‑based models; (2) China demonstrates clear potential for acceleration in both the expansion of intelligent agent populations and the pace of technological catch‑up, offering the possibility of achieving technological convergence or even partial surpassing. This study provides a systematic, model‑based analytical framework for understanding AI‑driven production system transformation and shifts in international competitiveness, as well as quantitative insights for relevant policy formulation.

PaperID: 8034, https://arxiv.org/pdf/2510.11076.pdf

Authors: Lingyue Fu, Haowei Yuan, Datong Chen, Xinyi Dai, Qingyao Li, Weinan Zhang, Weiwen Liu, Yong Yu

Title: DebugTA: An LLM-Based Agent for Simplifying Debugging and Teaching in Programming Education

Abstract:
In programming education, Debugging and Teaching (DT) task is a common scenario where students receive assistance in correcting their erroneous code. The task involves multiple inputs, including erroneous code, error messages, reference solutions, and the question description, with the goal of generating modification suggestions to the erroneous code. However, two key challenges hinder the effectiveness of existing approaches. Firstly, the complexity and heterogeneity of inputs inherent in DT tasks significantly elevate the reasoning challenges faced by LLMs. Second, existing approaches often fail to fully leverage the availability of standard code in DT tasks, forcing models to rely solely on complex multi‑step reasoning, which limits the potential of LLMs in addressing DT tasks effectively. To address these challenges, we propose DebugTA, a novel LLM‑based debugging and teaching agent with specialized tools for standard code retrieval, variable substitution to align reference code, and an external compiler for real‑time code analysis. Guided by explicit pedagogical and debugging principles, DebugTA acts as an agent that decomposes a complex task into sequential LLM interactions, each utilizing distinct tools for specific subtasks, thereby simplifying the logical reasoning at each step and reducing overall reasoning complexity. Furthermore, DebugTA utilizes tool calls to align the standard code with the erroneous code as much as possible, allowing the LLM to focus on logic errors within the erroneous code and improving the accuracy of the generated suggestions. To rigorously assess the quality of modification suggestions, we introduce a student simulator‑teacher interaction paradigm. Experimental results on three real‑world code datasets demonstrate that DebugTA consistently improves teaching effectiveness while significantly reducing computational costs.

PaperID: 8035, https://arxiv.org/pdf/2510.10943.pdf

Authors: Thi-Nhung Nguyen, Linhao Luo, Thuy-Trang Vu, Dinh Phung

Title: The Social Cost of Intelligence: Emergence, Propagation, and Amplification of Stereotypical Bias in Multi-Agent Systems

Abstract:
Bias in large language models (LLMs) remains a persistent challenge, manifesting in stereotyping and unfair treatment across social groups. While prior research has primarily focused on individual models, the rise of multi‑agent systems (MAS), where multiple LLMs collaborate and communicate, introduces new and largely unexplored dynamics in bias emergence and propagation. In this work, we present a comprehensive study of stereotypical bias in MAS, examining how internal specialization, underlying LLMs and inter‑agent communication protocols influence bias robustness, propagation, and amplification. We simulate social contexts where agents represent different social groups and evaluate system behavior under various interaction and adversarial scenarios. Experiments on three bias benchmarks reveal that MAS are generally less robust than single‑agent systems, with bias often emerging early through in‑group favoritism. However, cooperative and debate‑based communication can mitigate bias amplification, while more robust underlying LLMs improve overall system stability. Our findings highlight critical factors shaping fairness and resilience in multi‑agent LLM systems.

PaperID: 8036, https://arxiv.org/pdf/2510.10895.pdf

Authors: Renxuan Tan, Rongpeng Li, Fei Wang, Chenghui Peng, Shaoyun Wu, Zhifeng Zhao, Honggang Zhang

Title: LLM-Empowered Agentic MAC Protocols: A Dynamic Stackelberg Game Approach

Abstract:
Medium Access Control (MAC) protocols, essential for wireless networks, are typically manually configured. While deep reinforcement learning (DRL)‑based protocols enhance task‑specified network performance, they suffer from poor generalizability and resilience, demanding costly retraining to adapt to dynamic environments. To overcome this limitation, we introduce a game‑theoretic LLM‑empowered multi‑agent DRL (MARL) framework, in which the uplink transmission between a base station and a varying number of user equipments is modeled as a dynamic multi‑follower Stackelberg game (MFSG), capturing the network's natural hierarchical structure. Within this game, LLM‑driven agents, coordinated through proximal policy optimization (PPO), synthesize adaptive, semantic MAC protocols in response to network dynamics. Protocol action grammar (PAG) is employed to ensure the reliability and efficiency of this process. Under this system, we further analyze the existence and convergence behavior in terms of a Stackelberg equilibrium by studying the learning dynamics of LLM‑empowered unified policies in response to changing followers. Simulations corroborate that our framework achieves a 77.6% greater throughput and a 65.2% fairness improvement over conventional baselines. Besides, our framework generalizes excellently to a fluctuating number of users without requiring retraining or architectural changes.

PaperID: 8037, https://arxiv.org/pdf/2510.10890.pdf

Authors: Yu Chao, Siyu Lin, xiaorong wang, Zhu Zhang, Zihan Zhou, Haoyu Wang, Shuo Wang, Jie Zhou, Zhiyuan Liu, Maosong Sun

Title: LLM$\times$MapReduce-V3: Enabling Interactive In-Depth Survey Generation through a MCP-Driven Hierarchically Modular Agent System

Abstract:
We introduce LLM x MapReduce‑V3, a hierarchically modular agent system designed for long‑form survey generation. Building on the prior work, LLM x MapReduce‑V2, this version incorporates a multi‑agent architecture where individual functional components, such as skeleton initialization, digest construction, and skeleton refinement, are implemented as independent model‑context‑protocol (MCP) servers. These atomic servers can be aggregated into higher‑level servers, creating a hierarchically structured system. A high‑level planner agent dynamically orchestrates the workflow by selecting appropriate modules based on their MCP tool descriptions and the execution history. This modular decomposition facilitates human‑in‑the‑loop intervention, affording users greater control and customization over the research process. Through a multi‑turn interaction, the system precisely captures the intended research perspectives to generate a comprehensive skeleton, which is then developed into an in‑depth survey. Human evaluations demonstrate that our system surpasses representative baselines in both content depth and length, highlighting the strength of MCP‑based modular planning.

PaperID: 8038, https://arxiv.org/pdf/2510.10824.pdf

Authors: Mohanakrishnan Hariharan, Satish Arvapalli, Seshu Barma, Evangeline Sheela

Title: Agentic RAG for Software Testing with Hybrid Vector-Graph and Multi-Agent Orchestration

Abstract:
We present an approach to software testing automation using Agentic Retrieval‑Augmented Generation (RAG) systems for Quality Engineering (QE) artifact creation. We combine autonomous AI agents with hybrid vector‑graph knowledge systems to automate test plan, case, and QE metric generation. Our approach addresses traditional software testing limitations by leveraging LLMs such as Gemini and Mistral, multi‑agent orchestration, and enhanced contextualization. The system achieves remarkable accuracy improvements from 65% to 94.8% while ensuring comprehensive document traceability throughout the quality engineering lifecycle. Experimental validation of enterprise Corporate Systems Engineering and SAP migration projects demonstrates an 85% reduction in testing timeline, an 85% improvement in test suite efficiency, and projected 35% cost savings, resulting in a 2‑month acceleration of go‑live.

PaperID: 8039, https://arxiv.org/pdf/2510.10819.pdf

Authors: Vivek Acharya

Title: Generative AI and the Transformation of Software Development Practices

Abstract:
Generative AI is reshaping how software is designed, written, and maintained. Advances in large language models (LLMs) are enabling new development styles ‑ from chat‑oriented programming and 'vibe coding' to agentic programming ‑ that can accelerate productivity and broaden access. This paper examines how AI‑assisted techniques are changing software engineering practice, and the related issues of trust, accountability, and shifting skills. We survey iterative chat‑based development, multi‑agent systems, dynamic prompt orchestration, and integration via the Model Context Protocol (MCP). Using case studies and industry data, we outline both the opportunities (faster cycles, democratized coding) and the challenges (model reliability and cost) of applying generative AI to coding. We describe new roles, skills, and best practices for using AI in a responsible and effective way.

PaperID: 8040, https://arxiv.org/pdf/2510.10804.pdf

Authors: Alessandro Albini, Mohsen Kaboli, Giorgio Cannata, Perla Maiolino

Title: Representing Data in Robotic Tactile Perception -- A Review

Abstract:
Robotic tactile perception is a complex process involving several computational steps performed at different levels. Tactile information is shaped by the interplay of robot actions, the mechanical properties of its body, and the software that processes the data. In this respect, high‑level computation, required to process and extract information, is commonly performed by adapting existing techniques from other domains, such as computer vision, which expects input data to be properly structured. Therefore, it is necessary to transform tactile sensor data to match a specific data structure. This operation directly affects the tactile information encoded and, as a consequence, the task execution. This survey aims to address this specific aspect of the tactile perception pipeline, namely Data Representation. The paper first clearly defines its contributions to the perception pipeline and then reviews how previous studies have dealt with the problem of representing tactile information, investigating the relationships among hardware, representations, and high‑level computation methods. The analysis has led to the identification of six structures commonly used in the literature to represent data. The manuscript provides discussions and guidelines for properly selecting a representation depending on operating conditions, including the available hardware, the tactile information required to be encoded, and the task at hand.

PaperID: 8041, https://arxiv.org/pdf/2510.10666.pdf

Authors: Tao Yu, Zhengbo Zhang, Zhiheng Lyu, Junhao Gong, Hongzhu Yi, Xinming Wang, Yuxuan Zhou, Jiabing Yang, Ping Nie, Yan Huang, Wenhu Chen

Title: BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions

Abstract:
Efficiently solving real‑world problems with LLMs increasingly hinges on their ability to interact with dynamic web environments and autonomously acquire external information. While recent research like Search‑R1 and WebDancer demonstrates strong performance in solving web tasks, they heavily rely on additional tools to convert the interactive web environment into static text content. This is in contrast to human browsing behaviors, which involve diverse interactions with the browser, such as scrolling, clicking, and typing. In this paper, we propose BrowserAgent, a more interactive agent that solves complex tasks through human‑inspired browser actions. BrowserAgent operates directly on raw web pages via Playwright through a set of predefined browser actions. We adopt a two‑stage training (Supervised Fine‑Tuning (SFT) and Rejection Fine‑Tuning (RFT)) to improve the model's generalization abilities. Despite using significantly less training data than Search‑R1, BrowserAgent achieves more competitive results across different Open‑QA tasks. Additionally, we introduce an explicit memory mechanism to store key conclusions across steps, further enhancing the model's reasoning capabilities for long‑horizon tasks. Notably, BrowserAgent‑7B can achieve around 20% improvement over Search‑R1 on multi‑hop QA tasks like HotpotQA, 2Wiki, and Bamboogle. These results indicate that BrowserAgent can serve as a more advanced framework for more interactive and scalable web agents.

PaperID: 8042, https://arxiv.org/pdf/2510.10633.pdf

Authors: Jiabao Shi, Minfeng Qi, Lefeng Zhang, Di Wang, Yingjie Zhao, Ziying Li, Yalong Xing, Ningran Li

Title: Collaborative Text-to-Image Generation via Multi-Agent Reinforcement Learning and Semantic Fusion

Abstract:
Multimodal text‑to‑image generation remains constrained by the difficulty of maintaining semantic alignment and professional‑level detail across diverse visual domains. We propose a multi‑agent reinforcement learning framework that coordinates domain‑specialized agents (e.g., focused on architecture, portraiture, and landscape imagery) within two coupled subsystems: a text enhancement module and an image generation module, each augmented with multimodal integration components. Agents are trained using Proximal Policy Optimization (PPO) under a composite reward function that balances semantic similarity, linguistic visual quality, and content diversity. Cross‑modal alignment is enforced through contrastive learning, bidirectional attention, and iterative feedback between text and image. Across six experimental settings, our system significantly enriches generated content (word count increased by 1614%) while reducing ROUGE‑1 scores by 69.7%. Among fusion methods, Transformer‑based strategies achieve the highest composite score (0.521), despite occasional stability issues. Multimodal ensembles yield moderate consistency (ranging from 0.444 to 0.481), reflecting the persistent challenges of cross‑modal semantic grounding. These findings underscore the promise of collaborative, specialization‑driven architectures for advancing reliable multimodal generative systems.

PaperID: 8043, https://arxiv.org/pdf/2510.10611.pdf

Authors: Heng Zhang, Yuling Shi, Xiaodong Gu, Zijian Zhang, Haochen You, Lubin Gan, Yilei Yuan, Jin Huang

Title: HyperAgent: Leveraging Hypergraphs for Topology Optimization in Multi-Agent Communication

Abstract:
Recent advances in large language model‑powered multi‑agent systems have demonstrated remarkable collective intelligence through effective communication. However, existing approaches face two primary challenges: (i) Ineffective group collaboration modeling, as they rely on pairwise edge representations in graph structures, limiting their ability to capture relationships among multiple agents; and (ii) Limited task‑adaptiveness in communication topology design, leading to excessive communication cost for simple tasks and insufficient coordination for complex scenarios. These issues restrict the scalability and practical deployment of adaptive collaboration frameworks. To address these challenges, we propose HyperAgent, a hypergraph‑based framework that optimizes communication topologies and effectively captures group collaboration patterns using direct hyperedge representations. Unlike edge‑based approaches, HyperAgent uses hyperedges to link multiple agents within the same subtask and employs hypergraph convolutional layers to achieve one‑step information aggregation in collaboration groups. Additionally, it incorporates a variational autoencoder framework with sparsity regularization to dynamically adjust hypergraph topologies based on task complexity. Experiments highlight the superiority of HyperAgent in both performance and efficiency. For instance, on GSM8K, HyperAgent achieves 95.07% accuracy while reducing token consumption by 25.33%, demonstrating the potential of hypergraph‑based optimization for multi‑agent communication.

PaperID: 8044, https://arxiv.org/pdf/2510.10585.pdf

Authors: Heng Zhang, Yuling Shi, Xiaodong Gu, Haochen You, Zijian Zhang, Lubin Gan, Yilei Yuan, Jin Huang

Title: D3MAS: Decompose, Deduce, and Distribute for Enhanced Knowledge Sharing in Multi-Agent Systems

Abstract:
Multi‑agent systems powered by large language models exhibit strong capabilities in collaborative problem‑solving. However, these systems suffer from substantial knowledge redundancy. Agents duplicate efforts in retrieval and reasoning processes. This inefficiency stems from a deeper issue: current architectures lack mechanisms to ensure agents share minimal sufficient information at each operational stage. Empirical analysis reveals an average knowledge duplication rate of 47.3% across agent communications. We propose D3MAS (Decompose, Deduce, and Distribute), a hierarchical coordination framework addressing redundancy through structural design rather than explicit optimization. The framework organizes collaboration across three coordinated layers. Task decomposition filters irrelevant sub‑problems early. Collaborative reasoning captures complementary inference paths across agents. Distributed memory provides access to non‑redundant knowledge. These layers coordinate through structured message passing in a unified heterogeneous graph. This cross‑layer alignment ensures information remains aligned with actual task needs. Experiments on four challenging datasets show that D3MAS consistently improves reasoning accuracy by 8.7% to 15.6% and reduces knowledge redundancy by 46% on average.

PaperID: 8045, https://arxiv.org/pdf/2510.10581.pdf

Authors: Heng Zhang, Yuling Shi, Xiaodong Gu, Haochen You, Zijian Zhang, Lubin Gan, Yilei Yuan, Jin Huang

Title: GraphTracer: Graph-Guided Failure Tracing in LLM Agents for Robust Multi-Turn Deep Search

Abstract:
Multi‑agent systems powered by Large Language Models excel at complex tasks through coordinated collaboration, yet they face high failure rates in multi‑turn deep search scenarios. Existing temporal attribution methods struggle to accurately diagnose root causes, particularly when errors propagate across multiple agents. Attempts to automate failure attribution by analyzing action sequences remain ineffective due to their inability to account for information dependencies that span agents. This paper identifies two core challenges: (i) distinguishing symptoms from root causes in multi‑agent error propagation, and (ii) tracing information dependencies beyond temporal order. To address these issues, we introduce GraphTracer, a framework that redefines failure attribution through information flow analysis. GraphTracer constructs Information Dependency Graphs (IDGs) to explicitly capture how agents reference and build on prior outputs. It localizes root causes by tracing through these dependency structures instead of relying on temporal sequences. GraphTracer also uses graph‑aware synthetic data generation to target critical nodes, creating realistic failure scenarios. Evaluations on the Who\&When benchmark and integration into production systems demonstrate that GraphTracer‑8B achieves up to 18.18% higher attribution accuracy compared to state‑of‑the‑art models and enables 4.8% to 14.2% performance improvements in deployed multi‑agent frameworks, establishing a robust solution for multi‑agent system debugging.

PaperID: 8046, https://arxiv.org/pdf/2510.10461.pdf

Authors: Hongjie Zheng, Zesheng Shi, Ping Yi

Title: MedCoAct: Confidence-Aware Multi-Agent Collaboration for Complete Clinical Decision

Abstract:
Autonomous agents utilizing Large Language Models (LLMs) have demonstrated remarkable capabilities in isolated medical tasks like diagnosis and image analysis, but struggle with integrated clinical workflows that connect diagnostic reasoning and medication decisions. We identify a core limitation: existing medical AI systems process tasks in isolation without the cross‑validation and knowledge integration found in clinical teams, reducing their effectiveness in real‑world healthcare scenarios. To transform the isolation paradigm into a collaborative approach, we propose MedCoAct, a confidence‑aware multi‑agent framework that simulates clinical collaboration by integrating specialized doctor and pharmacist agents, and present a benchmark, DrugCareQA, to evaluate medical AI capabilities in integrated diagnosis and treatment workflows. Our results demonstrate that MedCoAct achieves 67.58% diagnostic accuracy and 67.58% medication recommendation accuracy, outperforming single agent framework by 7.04% and 7.08% respectively. This collaborative approach generalizes well across diverse medical domains, proving especially effective for telemedicine consultations and routine clinical scenarios, while providing interpretable decision‑making pathways.

PaperID: 8047, https://arxiv.org/pdf/2510.10460.pdf

Authors: Zongyi Lyu, Songqiang Chen, Zhenlan Ji, Liwen Wang, Shuai Wang, Daoyuan Wu, Wenxuan Wang, Shing-Chi Cheung

Title: Understanding and Bridging the Planner-Coder Gap: A Systematic Study on the Robustness of Multi-Agent Systems for Code Generation

Abstract:
Multi‑agent systems (MASs) have emerged as a promising paradigm for automated code generation, demonstrating impressive performance on established benchmarks. Despite their prosperous development, the fundamental mechanisms underlying their robustness remain poorly understood, raising critical concerns for real‑world deployment. This paper conducts a systematic empirical study to uncover the internal robustness flaws of MASs using a mutation‑based methodology. By designing a testing pipeline incorporating semantic‑preserving mutation operators and a novel fitness function, we assess mainstream MASs across multiple datasets and LLMs. Our findings reveal substantial robustness flaws: semantically equivalent inputs cause drastic performance drops, with MASs failing to solve 7.9%‑‑83.3% of problems they initially resolved successfully. Through comprehensive failure analysis, we discover a fundamental cause underlying these robustness issues: the planner‑coder gap, which accounts for 75.3% of failures. This gap arises from information loss in the multi‑stage transformation process where planning agents decompose requirements into underspecified plans, and coding agents subsequently misinterpret intricate logic during code generation. Based on this formulated information transformation process, we propose a repairing method that mitigates information loss through multi‑prompt generation and introduces a monitor agent to bridge the planner‑coder gap. Evaluation shows that our repairing method effectively enhances the robustness of MASs by solving 40.0%‑‑88.9% of identified failures. Our work uncovers critical robustness flaws in MASs and provides effective mitigation strategies, contributing essential insights for developing more reliable MASs for code generation.

PaperID: 8048, https://arxiv.org/pdf/2510.10381.pdf

Authors: Sebastián Rodríguez-Falcón, Luciano Stucchi

Title: Quantifying Emergent Behaviors in Agent-Based Models using Mean Information Gain

Abstract:
Emergent behaviors are a defining feature of complex systems, yet their quantitative characterization remains an open challenge, as traditional classifications rely mainly on visual inspection of spatio‑temporal patterns. In this Letter, we propose using the Mean Information Gain (MIG) as a metric to quantify emergence in Agent‑Based Models. The MIG is a conditional entropy‑based metric that quantifies the lack of information about other elements in a structure given certain known properties. We apply it to a multi‑agent biased random walk that reproduces Wolfram's four behavioral classes and show that MIG differentiates these behaviors. This metric reconnects the analysis of emergent behaviors with the classical notions of order, disorder, and entropy, thereby enabling the quantitative classification of regimes as convergent, periodic, complex, and chaotic. This approach overcomes the ambiguity of qualitative inspection near regime boundaries, particularly in large systems, and provides a compact, extensible framework for identifying and comparing emergent behaviors in complex systems.

PaperID: 8049, https://arxiv.org/pdf/2510.10325.pdf

Authors: Walid Abdela

Title: KG-MAS: Knowledge Graph-Enhanced Multi-Agent Infrastructure for coupling physical and digital robotic environments

Abstract:
The seamless integration of physical and digital environments in Cyber‑Physical Systems(CPS), particularly within Industry 4.0, presents significant challenges stemming from system heterogeneity and complexity. Traditional approaches often rely on rigid, data‑centric solutions like co‑simulation frameworks or brittle point‑to‑point middleware bridges, which lack the semantic richness and flexibility required for intelligent, autonomous coordination. This report introduces the Knowledge Graph‑Enhanced Multi‑Agent Infrastructure(KG‑MAS), as resolution in addressing such limitations. KG‑MAS leverages a centralized Knowledge Graph (KG) as a dynamic, shared world model, providing a common semantic foundation for a Multi‑Agent System(MAS). Autonomous agents, representing both physical and digital components, query this KG for decision‑making and update it with real‑time state information. The infrastructure features a model‑driven architecture which facilitates the automatic generation of agents from semantic descriptions, thereby simplifying system extension and maintenance. By abstracting away underlying communication protocols and providing a unified, intelligent coordination mechanism, KG‑MAS offers a robust, scalable, and flexible solution for coupling heterogeneous physical and digital robotic environments.

PaperID: 8050, https://arxiv.org/pdf/2510.10278.pdf

Authors: Christopher Chiu, Silviu Pitis, Mihaela van der Schaar

Title: Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models

Abstract:
Clinical reasoning in medicine is a hypothesis‑driven process where physicians refine diagnoses from limited information through targeted history, physical examination, and diagnostic investigations. In contrast, current medical benchmarks for large language models (LLMs) primarily assess knowledge recall through single‑turn questions, where complete clinical information is provided upfront. To address this gap, we introduce VivaBench, a multi‑turn benchmark that evaluates sequential clinical reasoning in LLM agents. Our dataset consists of 1762 physician‑curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training, requiring agents to actively probe for relevant findings, select appropriate investigations, and synthesize information across multiple steps to reach a diagnosis. While current LLMs demonstrate competence in diagnosing conditions from well‑described clinical presentations, their performance degrades significantly when required to navigate iterative diagnostic reasoning under uncertainty in our evaluation. Our analysis identified several failure modes that mirror common cognitive errors in clinical practice, including: (1) fixation on initial hypotheses, (2) inappropriate investigation ordering, (3) premature diagnostic closure, and (4) failing to screen for critical conditions. These patterns reveal fundamental limitations in how current LLMs reason and make decisions under uncertainty. Through VivaBench, we provide a standardized benchmark for evaluating conversational medical AI systems for real‑world clinical decision support. Beyond medical applications, we contribute to the larger corpus of research on agentic AI by demonstrating how sequential reasoning trajectories can diverge in complex decision‑making environments.

PaperID: 8051, https://arxiv.org/pdf/2510.10259.pdf

Authors: Yishen Jiang, Xin Wang, Wenqiang Zhu, Ming Wei, Longzhao Liu, Shaoting Tang, Hongwei Zheng

Title: Nonlinear Public Goods Game in Dynamical Environments

Abstract:
The evolutionary mechanisms of cooperative behavior represent a fundamental topic in complex systems and evolutionary dynamics. Real‑world collective interactions, particularly in multi‑agent systems, are often characterized by behavior‑dependent mechanism switching where the environmental state is endogenously shaped by group strategies. However, existing models typically treat such environmental variations as static stochasticity and neglect the closed‑loop feedback between environmental states and cooperative behaviors. Here, we introduce a dynamic environmental feedback mechanism into a nonlinear public goods game framework to establish a coevolutionary model that couples environmental states and individual cooperative strategies. Our results demonstrate that the interplay among environmental feedback, nonlinear effects, and environmental randomness can drive the system toward a wide variety of steady‑state structures, including full defection, full cooperation, stable coexistence, and periodic limit cycles. Further analysis reveals that asymmetric nonlinear parameters and environmental feedback rates exert significant regulatory effects on cooperation levels and system dynamics. This study not only enriches the theoretical framework of evolutionary game theory but also provides a foundation for modeling environmental feedback loops in scenarios ranging from ecological management to the design of cooperative mechanisms in autonomous systems.

PaperID: 8052, https://arxiv.org/pdf/2510.10231.pdf

Authors: Chuangchuang Tan, Xiang Ming, Jinglu Wang, Renshuai Tao, Bin Li, Yunchao Wei, Yao Zhao, Yan Lu

Title: Semantic Visual Anomaly Detection and Reasoning in AI-Generated Images

Abstract:
The rapid advancement of AI‑generated content (AIGC) has enabled the synthesis of visually convincing images; however, many such outputs exhibit subtle semantic anomalies, including unrealistic object configurations, violations of physical laws, or commonsense inconsistencies, which compromise the overall plausibility of the generated scenes. Detecting these semantic‑level anomalies is essential for assessing the trustworthiness of AIGC media, especially in AIGC image analysis, explainable deepfake detection and semantic authenticity assessment. In this paper, we formalize semantic anomaly detection and reasoning for AIGC images and introduce AnomReason, a large‑scale benchmark with structured annotations as quadruples \emph(Name, Phenomenon, Reasoning, Severity). Annotations are produced by a modular multi‑agent pipeline (AnomAgent) with lightweight human‑in‑the‑loop verification, enabling scale while preserving quality. At construction time, AnomAgent processed approximately 4.17\,B GPT‑4o tokens, providing scale evidence for the resulting structured annotations. We further show that models fine‑tuned on AnomReason achieve consistent gains over strong vision‑language baselines under our proposed semantic matching metric (SemAP and SemF1). Applications to explainable deepfake detection and semantic reasonableness assessment of image generators demonstrate practical utility. In summary, AnomReason and AnomAgent serve as a foundation for measuring and improving the semantic plausibility of AI‑generated images. We will release code, metrics, data, and task‑aligned models to support reproducible research on semantic authenticity and interpretable AIGC forensics.

PaperID: 8053, https://arxiv.org/pdf/2510.10225.pdf

Authors: Jialin Sun, Yuchen Hu, Dean You, Yushu Du, Hui Wang, Xinwei Fang, Weiwei Shan, Nan Guan, Zhe Jiang

Title: ISAAC: Intelligent, Scalable, Agile, and Accelerated CPU Verification via LLM-aided FPGA Parallelism

Abstract:
Functional verification is a critical bottleneck in integrated circuit development, with CPU verification being especially time‑intensive and labour‑consuming. Industrial practice relies on differential testing for CPU verification, yet faces bottlenecks at nearly each stage of the framework pipeline: front‑end stimulus generation lacks micro‑architectural awareness, yielding low‑quality and redundant tests that impede coverage closure and miss corner cases. Meanwhile, back‑end simulation infrastructure, even with FPGA acceleration, often stalls on long‑running tests and offers limited visibility, delaying feedback and prolonging the debugging cycle. Here, we present ISAAC, a full‑stack, Large Language Model (LLM)‑aided CPU verification framework with FPGA parallelism, from bug categorisation and stimulus generation to simulation infrastructure. To do so, we presented a multi‑agent stimulus engine in ISAAC's front‑end, infused with micro‑architectural knowledge and historical bug patterns, generating highly targeted tests that rapidly achieve coverage goals and capture elusive corner cases. In ISAAC's back‑end, we introduce a lightweight forward‑snapshot mechanism and a decoupled co‑simulation architecture between the Instruction Set Simulator (ISS) and the Design Under Test (DUT), enabling a single ISS to drive multiple DUTs in parallel. By eliminating long‑tail test bottlenecks and exploiting FPGA parallelism, the simulation throughput is significantly improved. As a demonstration, we used ISAAC to verify a mature CPU that has undergone multiple successful tape‑outs. Results show up to 17,536x speed‑up over software RTL simulation, while detecting several previously unknown bugs, two of which are reported in this paper.

PaperID: 8054, https://arxiv.org/pdf/2510.10117.pdf

Authors: Yunxiang Mo, Tianshi Zheng, Qing Zong, Jiayu Liu, Baixuan Xu, Yauwai Yim, Chunkit Chan, Jiaxin Bai, Yangqiu Song

Title: DixitWorld: Evaluating Multimodal Abductive Reasoning in Vision-Language Models with Multi-Agent Dixit Gameplay

Abstract:
Multimodal abductive reasoning‑‑the generation and selection of explanatory hypotheses from partial observations‑‑is a cornerstone of intelligence. Current evaluations of this ability in vision‑language models (VLMs) are largely confined to static, single‑agent tasks. Inspired by Dixit, we introduce DixitWorld, a comprehensive evaluation suite designed to deconstruct this challenge. DIXITWORLD features two core components: DixitArena, a dynamic, multi‑agent environment that evaluates both hypothesis generation (a "storyteller" crafting cryptic clues) and hypothesis selection ("listeners" choosing the target image from decoys) under imperfect information; and DixitBench, a static QA benchmark that isolates the listener's task for efficient, controlled evaluation. Results from DixitArena reveal distinct, role‑dependent behaviors: smaller open‑source models often excel as creative storytellers, producing imaginative yet less discriminative clues, whereas larger proprietary models demonstrate superior overall performance, particularly as listeners. Performance on DixitBench strongly correlates with listener results in DixitArena, validating it as a reliable proxy for hypothesis selection. Our findings reveal a key trade‑off between generative creativity and discriminative understanding in multimodal abductive reasoning, a central challenge for developing more balanced and capable vision‑language agents.

PaperID: 8055, https://arxiv.org/pdf/2510.10086.pdf

Authors: Feifei Liu, Haozhe Wang, Zejun Wei, Qirong Lu, Yiyang Wen, Xiaoyu Tang, Jingyan Jiang, Zhijian He

Title: Beyond ADE and FDE: A Comprehensive Evaluation Framework for Safety-Critical Prediction in Multi-Agent Autonomous Driving Scenarios

Abstract:
Current evaluation methods for autonomous driving prediction models rely heavily on simplistic metrics such as Average Displacement Error (ADE) and Final Displacement Error (FDE). While these metrics offer basic performance assessments, they fail to capture the nuanced behavior of prediction modules under complex, interactive, and safety‑critical driving scenarios. For instance, existing benchmarks do not distinguish the influence of nearby versus distant agents, nor systematically test model robustness across varying multi‑agent interactions. This paper addresses this critical gap by proposing a novel testing framework that evaluates prediction performance under diverse scene structures, saying, map context, agent density and spatial distribution. Through extensive empirical analysis, we quantify the differential impact of agent proximity on target trajectory prediction and identify scenario‑specific failure cases that are not exposed by traditional metrics. Our findings highlight key vulnerabilities in current state‑of‑the‑art prediction models and demonstrate the importance of scenario‑aware evaluation. The proposed framework lays the groundwork for rigorous, safety‑driven prediction validation, contributing significantly to the identification of failure‑prone corner cases and the development of robust, certifiable prediction systems for autonomous vehicles.

PaperID: 8056, https://arxiv.org/pdf/2510.10047.pdf

Authors: Ruohao Li, Hongjun Liu, Leyi Zhao, Zisu Li, Jiawei Li, Jiajun Jiang, Linning Xu, Chen Zhao, Mingming Fan, Chen Liang

Title: SwarmSys: Decentralized Swarm-Inspired Agents for Scalable and Adaptive Reasoning

Abstract:
Large language model (LLM) agents have shown remarkable reasoning abilities. However, existing multi‑agent frameworks often rely on fixed roles or centralized control, limiting scalability and adaptability in long‑horizon reasoning. We introduce SwarmSys, a closed‑loop framework for distributed multi‑agent reasoning inspired by swarm intelligence. Coordination in SwarmSys emerges through iterative interactions among three specialized roles, Explorers, Workers, and Validators, that continuously cycle through exploration, exploitation, and validation. To enable scalable and adaptive collaboration, we integrate adaptive agent and event profiles, embedding‑based probabilistic matching, and a pheromone‑inspired reinforcement mechanism, supporting dynamic task allocation and self‑organizing convergence without global supervision. Across symbolic reasoning, research synthesis, and scientific programming tasks, SwarmSys consistently outperforms baselines, improving both accuracy and reasoning stability. These findings highlight swarm‑inspired coordination as a promising paradigm for scalable, robust, and adaptive multi‑agent reasoning, suggesting that coordination scaling may rival model scaling in advancing LLM intelligence.

PaperID: 8057, https://arxiv.org/pdf/2510.10010.pdf

Authors: Matheus J. T. Vargas

Title: SLEAN: Simple Lightweight Ensemble Analysis Network for Multi-Provider LLM Coordination: Design, Implementation, and Vibe Coding Bug Investigation Case Study

Abstract:
We present SLEAN (Simple Lightweight Ensemble Analysis Network), a deterministic framework for coordinating multiple LLM providers through text‑based prompt orchestration. Unlike complex multi‑agent systems requiring specialized infrastructure, SLEAN operates as a simple prompt bridge between LLMs using .txt templates, requiring no deep technical knowledge for deployment. The three‑phase protocol formed by independent analysis, cross‑critique, and arbitration, filters harmful AI‑generated code suggestions before production deployment, addressing how AI‑assisted debugging increasingly produces modifications that introduce unnecessary complexity, break existing functionality, or address problems. Evaluating 15 software bugs, we analyzed 69 AI‑generated fix propositions. SLEAN's filtering accepted 22 fixes (31.9%, 95% CI 20.9‑42.9%) while rejecting 47 that would have been harmful if applied verbatim. The arbitration process reduced code change surface by 83‑90% relative to raw AI outputs, enforcing minimal causal edits over scope‑expanding modifications. Minimal Type 2 inputs proved more efficient than detailed Type 1 inputs, requiring 2.85 versus 3.56 propositions per accepted fix (35.1% versus 28.1% acceptance, about a 20% efficiency gain). Agreement between AI systems showed weak correlation with fix quality: high convergence (at least 80%) occurred in 4 of 15 cases and improved acceptance by only 2.4% points; arbitration appeared only at exactly 10% convergence in 2 of 15 cases, although low convergence alone did not necessitate arbitration. The file‑driven, provider‑agnostic architecture enables deployment without specialized coding expertise, making it applicable to security auditing, code review, document verification, and other domains requiring reliable multi‑provider synthesis with end‑to‑end auditability.

PaperID: 8058, https://arxiv.org/pdf/2510.10002.pdf

Authors: Pratik S. Sachdeva, Tom van Nuenen

Title: Deliberative Dynamics and Value Alignment in LLM Debates

Abstract:
As large language models (LLMs) are increasingly deployed in sensitive everyday contexts ‑‑ offering personal advice, mental health support, and moral guidance ‑‑ understanding their behavior in navigating complex moral reasoning is essential. Most evaluations study this sociotechnical alignment through single‑turn prompts, but it is unclear if these findings extend to multi‑turn settings, and even less clear how they depend on the interaction protocols used to coordinate agentic systems. We address this gap using LLM debate to examine deliberative dynamics and value alignment in multi‑turn settings by prompting subsets of three models (GPT‑4.1, Claude 3.7 Sonnet, and Gemini 2.0 Flash) to collectively assign blame in 1,000 everyday dilemmas from Reddit's ``Am I the Asshole'' community. To test order effects and assess verdict revision, we use both synchronous (parallel responses) and round‑robin (sequential responses) deliberation structures, mirroring how multi‑agent systems are increasingly orchestrated in practice. Our findings show striking behavioral differences. In the synchronous setting, GPT‑4.1 showed strong inertia (0.6‑3.1% revision rates) while Claude 3.7 Sonnet and Gemini 2.0 Flash were far more flexible (28‑41% revision rates). Value patterns also diverged: GPT‑4.1 emphasized personal autonomy and direct communication (relative to its deliberation partners), while Claude 3.7 Sonnet and Gemini 2.0 Flash prioritized empathetic dialogue. We further find that deliberation format had a strong impact on model behavior: GPT‑4.1 and Gemini 2.0 Flash stood out as highly conforming relative to Claude 3.7 Sonnet, with their verdict behavior strongly shaped by order effects. We provide additional results on open‑source models (DeepSeek‑V3.2 and Llama 3.1).

PaperID: 8059, https://arxiv.org/pdf/2510.09937.pdf

Authors: Shahbaz P Qadri Syed, He Bai

Title: Structured Cooperative Multi-Agent Reinforcement Learning: a Bayesian Network Perspective

Abstract:
The empirical success of multi‑agent reinforcement learning (MARL) has motivated the search for more efficient and scalable algorithms for large scale multi‑agent systems. However, existing state‑of‑the‑art algorithms do not fully exploit inter‑agent coupling information to develop MARL algorithms. In this paper, we propose a systematic approach to leverage structures in the inter‑agent couplings for efficient model‑free reinforcement learning. We model the cooperative MARL problem via a Bayesian network and characterize the subset of agents, termed as the value dependency set, whose information is required by each agent to estimate its local action value function exactly. Moreover, we propose a partially decentralized training decentralized execution (P‑DTDE) paradigm based on the value dependency set. We theoretically establish that the total variance of our P‑DTDE policy gradient estimator is less than the centralized training decentralized execution (CTDE) policy gradient estimator. We derive a multi‑agent policy gradient theorem based on the P‑DTDE scheme and develop a scalable actor‑critic algorithm. We demonstrate the efficiency and scalability of the proposed algorithm on multi‑warehouse resource allocation and multi‑zone temperature control examples. For dense value dependency sets, we propose an approximation scheme based on truncation of the Bayesian network and empirically show that it achieves a faster convergence than the exact value dependence set for applications with a large number of agents.

PaperID: 8060, https://arxiv.org/pdf/2510.09854.pdf

Authors: Kaiwen Shi, Zheyuan Zhang, Zhengqing Yuan, Keerthiram Murugesan, Vincent Galass, Chuxu Zhang, Yanfang Ye

Title: NG-Router: Graph-Supervised Multi-Agent Collaboration for Nutrition Question Answering

Abstract:
Diet plays a central role in human health, and Nutrition Question Answering (QA) offers a promising path toward personalized dietary guidance and the prevention of diet‑related chronic diseases. However, existing methods face two fundamental challenges: the limited reasoning capacity of single‑agent systems and the complexity of designing effective multi‑agent architectures, as well as contextual overload that hinders accurate decision‑making. We introduce Nutritional‑Graph Router (NG‑Router), a novel framework that formulates nutritional QA as a supervised, knowledge‑graph‑guided multi‑agent collaboration problem. NG‑Router integrates agent nodes into heterogeneous knowledge graphs and employs a graph neural network to learn task‑aware routing distributions over agents, leveraging soft supervision derived from empirical agent performance. To further address contextual overload, we propose a gradient‑based subgraph retrieval mechanism that identifies salient evidence during training, thereby enhancing multi‑hop and relational reasoning. Extensive experiments across multiple benchmarks and backbone models demonstrate that NG‑Router consistently outperforms both single‑agent and ensemble baselines, offering a principled approach to domain‑aware multi‑agent reasoning for complex nutritional health tasks.

PaperID: 8061, https://arxiv.org/pdf/2510.09801.pdf

Authors: Valerie Chen, Rohit Malhotra, Xingyao Wang, Juan Michelini, Xuhui Zhou, Aditya Bharat Soni, Hoang H. Tran, Calvin Smith, Ameet Talwalkar, Graham Neubig

Title: How can we assess human-agent interactions? Case studies in software agent design

Abstract:
LLM‑powered agents are both a promising new technology and a source of complexity, where choices about models, tools, and prompting can affect their usefulness. While numerous benchmarks measure agent accuracy across domains, they mostly assume full automation, failing to represent the collaborative nature of real‑world use cases. In this paper, we make two major steps towards the rigorous assessment of human‑agent interactions. First, we propose PULSE, a framework for more efficient human‑centric evaluation of agent designs, which comprises collecting user feedback, training an ML model to predict user satisfaction, and computing results by combining human satisfaction ratings with model‑generated pseudo‑labels. Second, we deploy the framework on a large‑scale web platform built around the open‑source software agent OpenHands, collecting in‑the‑wild usage data across over 15k users. We conduct case studies around how three agent design decisions ‑‑ choice of LLM backbone, planning strategy, and memory mechanisms ‑‑ impact developer satisfaction rates, yielding practical insights for software agent design. We also show how our framework can lead to more robust conclusions about agent design, reducing confidence intervals by 40% compared to a standard A/B test. Finally, we find substantial discrepancies between in‑the‑wild results and benchmark performance (e.g., the anti‑correlation between results comparing claude‑sonnet‑4 and gpt‑5), underscoring the limitations of benchmark‑driven evaluation. Our findings provide guidance for evaluations of LLM agents with humans and identify opportunities for better agent designs.

PaperID: 8062, https://arxiv.org/pdf/2510.09791.pdf

Authors: Sneha Gathani, Kevin Li, Raghav Thind, Sirui Zeng, Matthew Xu, Peter J. Haas, Cagatay Demiralp, Zhicheng Liu

Title: PRAXA: A Grammar for What-If Analysis

Abstract:
What‑if analysis is widely used to explore hypothetical scenarios and evaluate alternative pathways to desired results. However, current approaches are fragmented: systems implement what‑if capabilities under diverse terminologies with different analytic techniques. Such fragmentation limits expressiveness, impedes flexible composition and reuse of workflows, and hinders tighter integration with AI. We present PRAXA, a compositional grammar of what‑if analysis derived from recurring patterns across 141 publications in visual analytics and HCI venues. PRAXA formulates three primitives: (1) data, defining variables under analysis, (2) model, specifying predictive mechanisms, and (3) interaction operations‑pairs of user actions and system responses that execute analyses. We encode PRAXA into a declarative specification language, PSL. To evaluate PRAXA, we first show expressiveness by reconstructing representative workflows from prior work as structured compositions, exposing the predominant focus on single‑step rather than multi‑step reasoning. Second, we demonstrate composability by revealing that capabilities described under distinct terminologies share the same grammatical structure with different parameterizations, and that new multi‑step workflows emerge through composition. Third, we illustrate PSL as an intermediate representation for translating natural‑language what‑if queries into executable interactive interfaces, enabling inspection, validation, and more transparent AI integration. By unifying diverse what‑if approaches as a grammar, PRAXA provides a foundation for analyzing, composing, and supporting workflows in next‑generation what‑if systems.

PaperID: 8063, https://arxiv.org/pdf/2510.09781.pdf

Authors: Yue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing, Manish Nagireddy, Inkit Padhi, Greta Dolcetti, Zhangchen Xu, Subhajit Chaudhury, Ambrish Rawat, Liubov Nedoshivina, Pin-Yu Chen, Prasanna Sattigeri, Xiangliang Zhang

Title: Building a Foundational Guardrail for General Agentic Systems via Synthetic Data

Abstract:
While LLM agents can plan multi‑step tasks, intervening at the planning stage‑before any action is executed‑is often the safest way to prevent harm, since certain risks can lead to severe consequences once carried out. However, existing guardrails mostly operate post‑execution, which is difficult to scale and leaves little room for controllable supervision at the plan level. To address this challenge, we highlight three critical gaps in current research: data gap, model gap, and evaluation gap. To close the data gap, we introduce AuraGen, a controllable engine that (i) synthesizes benign trajectories, (ii) injects category‑labeled risks with calibrated difficulty, and (iii) filters outputs via an automated reward model, producing large and reliable corpora for pre‑execution safety. To close the guardian model gap, we propose a foundational guardrail Safiron, combining a cross‑planner adapter with a compact guardian model. The adapter unifies different input formats, while Safiron flags risky cases, assigns risk types, and generates rationales; trained in two stages with a broadly explored data recipe, Safiron achieves robust transfer across settings. To close the evaluation gap, we release Pre‑Exec Bench, a realistic benchmark covering diverse tools and branching trajectories, which measures detection, fine‑grained categorization, explanation, and cross‑planner generalization in human‑verified scenarios. Extensive experiments demonstrate consistent gains of the proposed guardrail over strong baselines on Pre‑Exec Bench, and ablations further distill actionable practices, providing a practical template for safer agentic systems.

PaperID: 8064, https://arxiv.org/pdf/2510.09720.pdf

Authors: Haoran Sun, Zekun Zhang, Shaoning Zeng

Title: Preference-Aware Memory Update for Long-Term LLM Agents

Abstract:
One of the key factors influencing the reasoning capabilities of LLM‑based agents is their ability to leverage long‑term memory. Integrating long‑term memory mechanisms allows agents to make informed decisions grounded in historical interactions. While recent advances have significantly improved the storage and retrieval components, by encoding memory into dense vectors for similarity search or organizing memory as structured knowledge graphs most existing approaches fall short in memory updating. In particular, they lack mechanisms for dynamically refining preference memory representations in response to evolving user behaviors and contexts. To address this gap, we propose a Preference‑Aware Memory Update Mechanism (PAMU) that enables dynamic and personalized memory refinement. By integrating sliding window averages (SW) with exponential moving averages (EMA), PAMU constructs a fused preference‑aware representation that captures both short‑term fluctuations and long‑term user tendencies. We conduct experiments on five task scenarios of the LoCoMo dataset, and the results show that our mechanism can significantly improve the output quality of LLM in five baselines, validating its effectiveness in long‑term conversations.

PaperID: 8065, https://arxiv.org/pdf/2510.09709.pdf

Authors: Shin-nosuke Ishikawa, Masato Todo, Taiki Ogihara, Hirotsugu Ohba

Title: The Idola Tribus of AI: Large Language Models tend to perceive order where none exists

Abstract:
We present a tendency of large language models (LLMs) to generate absurd patterns despite their clear inappropriateness in a simple task of identifying regularities in number series. Several approaches have been proposed to apply LLMs to complex real‑world tasks, such as providing knowledge through retrieval‑augmented generation and executing multi‑step tasks using AI agent frameworks. However, these approaches rely on the logical consistency and self‑coherence of LLMs, making it crucial to evaluate these aspects and consider potential countermeasures. To identify cases where LLMs fail to maintain logical consistency, we conducted an experiment in which LLMs were asked to explain the patterns in various integer sequences, ranging from arithmetic sequences to randomly generated integer series. While the models successfully identified correct patterns in arithmetic and geometric sequences, they frequently over‑recognized patterns that were inconsistent with the given numbers when analyzing randomly generated series. This issue was observed even in multi‑step reasoning models, including OpenAI o3, o4‑mini, and Google Gemini 2.5 Flash Preview Thinking. This tendency to perceive non‑existent patterns can be interpreted as the AI model equivalent of Idola Tribus and highlights potential limitations in their capability for applied tasks requiring logical reasoning, even when employing chain‑of‑thought reasoning mechanisms.

PaperID: 8066, https://arxiv.org/pdf/2510.09586.pdf

Authors: Fengming Lin

Title: Vision Language Models: A Survey of 26K Papers

Abstract:
We present a transparent, reproducible measurement of research trends across 26,104 accepted papers from CVPR, ICLR, and NeurIPS spanning 2023‑2025. Titles and abstracts are normalized, phrase‑protected, and matched against a hand‑crafted lexicon to assign up to 35 topical labels and mine fine‑grained cues about tasks, architectures, training regimes, objectives, datasets, and co‑mentioned modalities. The analysis quantifies three macro shifts: (1) a sharp rise of multimodal vision‑language‑LLM work, which increasingly reframes classic perception as instruction following and multi‑step reasoning; (2) steady expansion of generative methods, with diffusion research consolidating around controllability, distillation, and speed; and (3) resilient 3D and video activity, with composition moving from NeRFs to Gaussian splatting and a growing emphasis on human‑ and agent‑centric understanding. Within VLMs, parameter‑efficient adaptation like prompting/adapters/LoRA and lightweight vision‑language bridges dominate; training practice shifts from building encoders from scratch to instruction tuning and finetuning strong backbones; contrastive objectives recede relative to cross‑entropy/ranking and distillation. Cross‑venue comparisons show CVPR has a stronger 3D footprint and ICLR the highest VLM share, while reliability themes such as efficiency or robustness diffuse across areas. We release the lexicon and methodology to enable auditing and extension. Limitations include lexicon recall and abstract‑only scope, but the longitudinal signals are consistent across venues and years.

PaperID: 8067, https://arxiv.org/pdf/2510.09507.pdf

Authors: Zixin Zhang, Kanghao Chen, Xingwang Lin, Lutao Jiang, Xu Zheng, Yuanhuiyi Lyu, Litao Guo, Yinchuan Li, Ying-Cong Chen

Title: PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs

Abstract:
The ability to use, understand, and create tools is a hallmark of human intelligence, enabling sophisticated interaction with the physical world. For any general‑purpose intelligent agent to achieve true versatility, it must also master these fundamental skills. While modern Multimodal Large Language Models (MLLMs) leverage their extensive common knowledge for high‑level planning in embodied AI and in downstream Vision‑Language‑Action (VLA) models, the extent of their true understanding of physical tools remains unquantified. To bridge this gap, we present PhysToolBench, the first benchmark dedicated to evaluating the comprehension of physical tools by MLLMs. Our benchmark is structured as a Visual Question Answering (VQA) dataset comprising over 1,000 image‑text pairs. It assesses capabilities across three distinct difficulty levels: (1) Tool Recognition: Requiring the recognition of a tool's primary function. (2) Tool Understanding: Testing the ability to grasp the underlying principles of a tool's operation. (3) Tool Creation: Challenging the model to fashion a new tool from surrounding objects when conventional options are unavailable. Our comprehensive evaluation of 32 MLLMs‑spanning proprietary, open‑source, specialized embodied, and backbones in VLAs‑reveals a significant deficiency in tool understanding. Furthermore, we provide an in‑depth analysis and propose preliminary solutions. Code and dataset are publicly available.

PaperID: 8068, https://arxiv.org/pdf/2510.09483.pdf

Authors: Lars Ohnemus, Nils Hantke, Max Weißer, Kai Furmans

Title: FOGMACHINE -- Leveraging Discrete-Event Simulation and Scene Graphs for Modeling Hierarchical, Interconnected Environments under Partial Observations from Mobile Agents

Abstract:
Dynamic Scene Graphs (DSGs) provide a structured representation of hierarchical, interconnected environments, but current approaches struggle to capture stochastic dynamics, partial observability, and multi‑agent activity. These aspects are critical for embodied AI, where agents must act under uncertainty and delayed perception. We introduce FOGMACHINE , an open‑source framework that fuses DSGs with discrete‑event simulation to model object dynamics, agent observations, and interactions at scale. This setup enables the study of uncertainty propagation, planning under limited perception, and emergent multi‑agent behavior. Experiments in urban scenarios illustrate realistic temporal and spatial patterns while revealing the challenges of belief estimation under sparse observations. By combining structured representations with efficient simulation, FOGMACHINE establishes an effective tool for benchmarking, model training, and advancing embodied AI in complex, uncertain environments.

PaperID: 8069, https://arxiv.org/pdf/2510.09469.pdf

Authors: Bharath Muppasani, Ritirupa Dey, Biplav Srivastava, Vignesh Narayanan

Title: Towards Information-Optimized Multi-Agent Path Finding: A Hybrid Framework with Reduced Inter-Agent Information Sharing

Abstract:
Multi‑agent pathfinding (MAPF) remains a critical problem in robotics and autonomous systems, where agents must navigate shared spaces efficiently while avoiding conflicts. Traditional centralized algorithms with global information provide high‑quality solutions but scale poorly in large‑scale scenarios due to the combinatorial explosion of conflicts. Conversely, distributed approaches that have local information, particularly learning‑based methods, offer better scalability by operating with relaxed information availability, yet often at the cost of solution quality. In realistic deployments, information is a constrained resource: broadcasting full agent states and goals can raise privacy concerns, strain limited bandwidth, and require extra sensing and communication hardware, increasing cost and energy use. We focus on the core question of how MAPF can be solved with minimal inter‑agent information sharing while preserving solution feasibility. To this end, we present an information‑centric formulation of the MAPF problem and introduce a hybrid framework, IO‑MAPF, that integrates decentralized path planning with a lightweight centralized coordinator. In this framework, agents use reinforcement learning (RL) to plan independently, while the central coordinator provides minimal, targeted signals, such as static conflict‑cell indicators or short conflict trajectories, that are dynamically shared to support efficient conflict resolution. We introduce an Information Units (IU) metric to quantify information use and show that our alert‑driven design achieves 2x to 23x reduction in information sharing, compared to the state‑of‑the‑art algorithms, while maintaining high success rates, demonstrating that reliable MAPF is achievable under strongly information‑restricted, privacy‑preserving conditions. We demonstrate the effectiveness of our algorithm using simulation and hardware experiments.

PaperID: 8070, https://arxiv.org/pdf/2510.09462.pdf

Authors: Mikhail Terekhov, Alexander Panfilov, Daniil Dzenhaliou, Caglar Gulcehre, Maksym Andriushchenko, Ameya Prabhu, Jonas Geiping

Title: Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols

Abstract:
AI control protocols serve as a defense mechanism to stop untrusted LLM agents from causing harm in autonomous settings. Prior work treats this as a security problem, stress testing with exploits that use the deployment context to subtly complete harmful side tasks, such as backdoor insertion. In practice, most AI control protocols are fundamentally based on LLM monitors, which can become a central point of failure. We study adaptive attacks by an untrusted model that knows the protocol and the monitor model, which is plausible if the untrusted model was trained with a later knowledge cutoff or can search for this information autonomously. We instantiate a simple adaptive attack vector by which the attacker embeds publicly known or zero‑shot prompt injections in the model outputs. Using this tactic, frontier models consistently evade diverse monitors and complete malicious tasks on two main AI control benchmarks. The attack works universally against current protocols that rely on a monitor. Furthermore, the recent Defer‑to‑Resample protocol even backfires, as its resampling amplifies the prompt injection and effectively reframes it as a best‑of‑n attack. In general, adaptive attacks on monitor models represent a major blind spot in current control protocols and should become a standard component of evaluations for future AI control mechanisms.

PaperID: 8071, https://arxiv.org/pdf/2510.09330.pdf

Authors: Tuan Nguyen, Long Tran-Thanh

Title: Safety Game: Balancing Safe and Informative Conversations with Blackbox Agentic AI using LP Solvers

Abstract:
Ensuring that large language models (LLMs) comply with safety requirements is a central challenge in AI deployment. Existing alignment approaches primarily operate during training, such as through fine‑tuning or reinforcement learning from human feedback, but these methods are costly and inflexible, requiring retraining whenever new requirements arise. Recent efforts toward inference‑time alignment mitigate some of these limitations but still assume access to model internals, which is impractical, and not suitable for third party stakeholders who do not have access to the models. In this work, we propose a model‑independent, black‑box framework for safety alignment that does not require retraining or access to the underlying LLM architecture. As a proof of concept, we address the problem of trading off between generating safe but uninformative answers versus helpful yet potentially risky ones. We formulate this dilemma as a two‑player zero‑sum game whose minimax equilibrium captures the optimal balance between safety and helpfulness. LLM agents operationalize this framework by leveraging a linear programming solver at inference time to compute equilibrium strategies. Our results demonstrate the feasibility of black‑box safety alignment, offering a scalable and accessible pathway for stakeholders, including smaller organizations and entities in resource‑constrained settings, to enforce safety across rapidly evolving LLM ecosystems.

PaperID: 8072, https://arxiv.org/pdf/2510.09325.pdf

Authors: Till Freihaut, Luca Viano, Emanuele Nevali, Volkan Cevher, Matthieu Geist, Giorgia Ramponi

Title: Rate optimal learning of equilibria from data

Abstract:
We close open theoretical gaps in Multi‑Agent Imitation Learning (MAIL) by characterizing the limits of non‑interactive MAIL and presenting the first interactive algorithm with near‑optimal sample complexity. In the non‑interactive setting, we prove a statistical lower bound that identifies the all‑policy deviation concentrability coefficient as the fundamental complexity measure, and we show that Behavior Cloning (BC) is rate‑optimal. For the interactive setting, we introduce a framework that combines reward‑free reinforcement learning with interactive MAIL and instantiate it with an algorithm, MAIL‑WARM. It improves the best previously known sample complexity from \mathcalO(\varepsilon^‑8) to \mathcalO(\varepsilon^‑2), matching the dependence on \varepsilon implied by our lower bound. Finally, we provide numerical results that support our theory and illustrate, in environments such as grid worlds, where Behavior Cloning fails to learn.

PaperID: 8073, https://arxiv.org/pdf/2510.09244.pdf

Authors: Victor de Lamo Castrillo, Habtom Kahsay Gidey, Alexander Lenz, Alois Knoll

Title: Fundamentals of Building Autonomous LLM Agents

Abstract:
This paper reviews the architecture and implementation methods of agents powered by large language models (LLMs). Motivated by the limitations of traditional LLMs in real‑world tasks, the research aims to explore patterns to develop "agentic" LLMs that can automate complex tasks and bridge the performance gap with human capabilities. Key components include a perception system that converts environmental percepts into meaningful representations; a reasoning system that formulates plans, adapts to feedback, and evaluates actions through different techniques like Chain‑of‑Thought and Tree‑of‑Thought; a memory system that retains knowledge through both short‑term and long‑term mechanisms; and an execution system that translates internal decisions into concrete actions. This paper shows how integrating these systems leads to more capable and generalized software bots that mimic human cognitive processes for autonomous and intelligent behavior.

PaperID: 8074, https://arxiv.org/pdf/2510.09183.pdf

Authors: Jianxiao Jiang, Yu Zhang

Title: Student Development Agent: Risk-free Simulation for Evaluating AIED Innovations

Abstract:
In the age of AI‑powered educational (AIED) innovation, evaluating the developmental consequences of novel designs before they are exposed to students has become both essential and challenging. Since such interventions may carry irreversible effects, it is critical to anticipate not only potential benefits but also possible harms. This study proposes a student development agent framework based on large language models (LLMs), designed to simulate how students with diverse characteristics may evolve under different educational settings without administering them to real students. By validating the approach through a case study on a multi‑agent learning environment (MAIC), we demonstrate that the agent's predictions align with real student outcomes in non‑cognitive developments. The results suggest that LLM‑based simulations hold promise for evaluating AIED innovations efficiently and ethically. Future directions include enhancing profile structures, incorporating fine‑tuned or small task‑specific models, validating effects of empirical findings, interpreting simulated data and optimizing evaluation methods.

PaperID: 8075, https://arxiv.org/pdf/2510.09156.pdf

Authors: Jing Li, Zhijie Sun, Zhicheng Zhou, Suming Qiu, Junjie Huang, Haijia Sun, Linyuan Qiu

Title: Agentic-KGR: Co-evolutionary Knowledge Graph Construction through Multi-Agent Reinforcement Learning

Abstract:
Current knowledge‑enhanced large language models (LLMs) rely on static, pre‑constructed knowledge bases that suffer from coverage gaps and temporal obsolescence, limiting their effectiveness in dynamic information environments. We present Agentic‑KGR, a novel framework enabling co‑evolution between LLMs and knowledge graphs (KGs) through multi‑round reinforcement learning (RL). Our approach introduces three key innovations: (1) a dynamic schema expansion mechanism that systematically extends graph ontologies beyond pre‑defined boundaries during training; (2) a retrieval‑augmented memory system enabling synergistic co‑evolution between model parameters and knowledge structures through continuous optimization; (3) a learnable multi‑scale prompt compression approach that preserves critical information while reducing computational complexity through adaptive sequence optimization. Experimental results demonstrate substantial improvements over supervised baselines and single‑round RL approaches in knowledge extraction tasks. When integrated with GraphRAG, our method achieves superior performance in downstream QA tasks, with significant gains in both accuracy and knowledge coverage compared to existing methods.

PaperID: 8076, https://arxiv.org/pdf/2510.09116.pdf

Authors: Enze Zhang, Jiaying Wang, Mengxi Xiao, Jifei Liu, Ziyan Kuang, Rui Dong, Eric Dong, Sophia Ananiadou, Min Peng, Qianqian Xie

Title: DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation

Abstract:
Large language models (LLMs) have substantially advanced machine translation (MT), yet their effectiveness in translating web novels remains unclear. Existing benchmarks rely on surface‑level metrics that fail to capture the distinctive traits of this genre. To address these gaps, we introduce DITING, the first comprehensive evaluation framework for web novel translation, assessing narrative and cultural fidelity across six dimensions: idiom translation, lexical ambiguity, terminology localization, tense consistency, zero‑pronoun resolution, and cultural safety, supported by over 18K expert‑annotated Chinese‑English sentence pairs. We further propose AgentEval, a reasoning‑driven multi‑agent evaluation framework that simulates expert deliberation to assess translation quality beyond lexical overlap, achieving the highest correlation with human judgments among seven tested automatic metrics. To enable metric comparison, we develop MetricAlign, a meta‑evaluation dataset of 300 sentence pairs annotated with error labels and scalar quality scores. Comprehensive evaluation of fourteen open, closed, and commercial models reveals that Chinese‑trained LLMs surpass larger foreign counterparts, and that DeepSeek‑V3 delivers the most faithful and stylistically coherent translations. Our work establishes a new paradigm for exploring LLM‑based web novel translation and provides public resources to advance future research.

PaperID: 8077, https://arxiv.org/pdf/2510.08996.pdf

Authors: Spandan Garg, Benjamin Steenhoek, Yufan Huang

Title: Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation

Abstract:
Current benchmarks for evaluating software engineering agents, such as SWE‑Bench Verified, are predominantly derived from GitHub issues and fail to accurately reflect how developers interact with chat‑based coding assistants in integrated development environments (IDEs). We posit that this mismatch leads to a systematic overestimation of agent's capabilities in real‑world scenarios, especially bug fixing. We introduce a novel benchmarking framework that transforms existing formal benchmarks into realistic user queries through systematic analysis of developer interaction patterns with chat‑based agents. Our methodology is flexible and can be easily extended to existing benchmarks. In this paper, we apply our testing framework to SWE‑Bench Verified, the TypeScript subset of Multi‑SWE‑Bench and a private benchmark, SWE‑Bench C# and transform formal GitHub issue descriptions into realistic user‑style queries based on telemetry analysis of a popular chat‑based agent interactions. Our findings reveal that existing benchmarks significantly overestimate agent capabilities for some models by >50% over baseline performance for public benchmarks and ~10‑16% for our internal benchmark. This work establishes a new paradigm for evaluating interactive chat‑based software engineering agents through benchmark mutation techniques.

PaperID: 8078, https://arxiv.org/pdf/2510.08952.pdf

Authors: Zhihan Zhang, Xunkai Li, Yilong Zuo, Henan Sun, Zhenjun Li, Bing Zhou, Rong-Hua Li, Guoren Wang

Title: When LLM Agents Meet Graph Optimization: An Automated Data Quality Improvement Approach

Abstract:
Text‑attributed graphs (TAGs) have become a key form of graph‑structured data in modern data management and analytics, combining structural relationships with rich textual semantics for diverse applications. However, the effectiveness of analytical models, particularly graph neural networks (GNNs), is highly sensitive to data quality. Our empirical analysis shows that both conventional and LLM‑enhanced GNNs degrade notably under textual, structural, and label imperfections, underscoring TAG quality as a key bottleneck for reliable analytics. Existing studies have explored data‑level optimization for TAGs, but most focus on specific degradation types and target a single aspect like structure or label, lacking a systematic and comprehensive perspective on data quality improvement. To address this gap, we propose LAGA (Large Language and Graph Agent), a unified multi‑agent framework for comprehensive TAG quality optimization. LAGA formulates graph quality control as a data‑centric process, integrating detection, planning, action, and evaluation agents into an automated loop. It holistically enhances textual, structural, and label aspects through coordinated multi‑modal optimization. Extensive experiments on 5 datasets and 16 baselines across 9 scenarios demonstrate the effectiveness, robustness and scalability of LAGA, confirming the importance of data‑centric quality optimization for reliable TAG analytics.

PaperID: 8079, https://arxiv.org/pdf/2510.08847.pdf

Authors: Allison Sihan Jia, Daniel Huang, Nikhil Vytla, Seung Won Wilson Yoo, Nirvika Choudhury, Shayak Sen, John C. Mitchell, Anupam Datta

Title: What Is Your Agent's GPA? A Framework for Evaluating Agent Goal-Plan-Action Alignment

Abstract:
We introduce the Agent GPA (Goal‑Plan‑Action) framework, driven by the fundamental insight that critical agent failures emerge at the intersections of setting goals, devising plans, and executing actions. We operationalize the framework with a factorized suite of LLM judges designed to measure distinct elements of Goal‑Plan‑Act alignment. To make this methodology scalable and generalizable across diverse agent architectures and datasets, we use state‑of‑the‑art automated prompt optimization techniques to systematically generate domain‑specific evaluation criteria. We validate this approach across three benchmarks: a multi‑agent research setting (TRAIL/GAIA), a single coding agent setting (TRAIL/SWE‑bench), and a private, enterprise data‑agent setting (Snowflake Intelligence). Extensive evaluation on TRAIL/GAIA demonstrates the core validity of the framework, which identifies a broad range of agent failures (95% of human‑annotated errors), localizes errors to enable targeted debugging (86% of human‑annotated errors), and exhibits strong agreement with human evaluators. Crucially, by applying our automated methodology to both public datasets, we demonstrate that our GPA judges generally achieve the highest error coverage (ranging from 76% to 86%) in comparison to manual prompting approaches. We also leverage an evolutionary coding agent to improve judge consistency by up to 38% through iterative refinement of evaluation rubrics. Overall, Agent GPA provides a rigorous and generalizable paradigm for targeted agent evaluation.

PaperID: 8080, https://arxiv.org/pdf/2510.08842.pdf

Authors: Molang Wu, Zhao Zhang

Title: Maple: A Multi-agent System for Portable Deep Learning across Clusters

Abstract:
Training deep learning (DL) models across Graphics Processing Unit (GPU) clusters is technically challenging. One aspect is that users have to compose command lines to adapt to the heterogeneous launchers, schedulers, affinity options, DL framework arguments, and environment variables. Composing correct command lines is error‑prone and can easily frustrate users, impeding research or wasting resources. In this work, we present Maple, a multi‑agent system that generates correct DL command lines with users' natural language input. Maple consists of four agents with the functionalities of information extraction, template retrieval, command line verification, and error correction. We evaluate Maple on nine GPU clusters across national computing centers in the U.S., five representative deep learning model families, and four commonly used parallel DL training paradigms. Our experiments also cover schedulers of SLURM and PBS and heterogeneous architectures, such as NVIDIA A100/H200 GPUs and Intel Max series GPUs. Maple achieves 92.0% accuracy in generating command lines across the 567 test cases. Leverage multiple language models with an aggregated size of 10B parameters, Maple delivers comparable performance to the state‑of‑the‑art models of GPT‑5, Claude, and Gemini. Together, these results highlight Maple's practical value in enabling portable and scalable distributed DL across heterogeneous HPC environments.

PaperID: 8081, https://arxiv.org/pdf/2510.08829.pdf

Authors: Debeshee Das, Luca Beurer-Kellner, Marc Fischer, Maximilian Baader

Title: CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

Abstract:
The increasing adoption of LLM agents with access to numerous tools and sensitive data significantly widens the attack surface for indirect prompt injections. Due to the context‑dependent nature of attacks, however, current defenses are often ill‑calibrated as they cannot reliably differentiate malicious and benign instructions, leading to high false positive rates that prevent their real‑world adoption. To address this, we present a novel approach inspired by the fundamental principle of computer security: data should not contain executable instructions. Instead of sample‑level classification, we propose a token‑level sanitization process, which surgically removes any instructions directed at AI systems from tool outputs, capturing malicious instructions as a byproduct. In contrast to existing safety classifiers, this approach is non‑blocking, does not require calibration, and is agnostic to the context of tool outputs. Further, we can train such token‑level predictors with readily available instruction‑tuning data only, and don't have to rely on unrealistic prompt injection examples from challenges or of other synthetic origin. In our experiments, we find that this approach generalizes well across a wide range of attacks and benchmarks like AgentDojo, BIPIA, InjecAgent, ASB and SEP, achieving a 7‑10x reduction of attack success rate (ASR) (34% to 3% on AgentDojo), without impairing agent utility in both benign and malicious settings.

PaperID: 8082, https://arxiv.org/pdf/2510.08804.pdf

Authors: Siddeshwar Raghavan, Tanwi Mallick

Title: MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding

Abstract:
We present MOSAIC, a multi‑agent Large Language Model (LLM) framework for solving challenging scientific coding tasks. Unlike general‑purpose coding, scientific workflows require algorithms that are rigorous, interconnected with deep domain knowledge, and incorporate domain‑specific reasoning, as well as algorithm iteration without requiring I/O test cases. Many scientific problems also require a sequence of subproblems to be solved, leading to the final desired result. MOSAIC is designed as a training‑free framework with specially designed agents to self‑reflect, create the rationale, code, and debug within a student‑teacher paradigm to address the challenges of scientific code generation. This design facilitates stepwise problem decomposition, targeted error correction, and, when combined with our Consolidated Context Window (CCW), mitigates LLM hallucinations when solving complex scientific tasks involving chained subproblems. We evaluate MOSAIC on scientific coding benchmarks and demonstrate that our specialized agentic framework outperforms existing approaches in terms of accuracy, robustness, and interpretability.

PaperID: 8083, https://arxiv.org/pdf/2510.08790.pdf

Authors: Guangya Wan, Mingyang Ling, Xiaoqi Ren, Rujun Han, Sheng Li, Zizhao Zhang

Title: COMPASS: Enhancing Agent Long-Horizon Reasoning with Evolving Context

Abstract:
Long‑horizon tasks that require sustained reasoning and multiple tool interactions remain challenging for LLM agents: small errors compound across steps, and even state‑of‑the‑art models often hallucinate or lose coherence. We identify context management as the central bottleneck ‑‑ extended histories cause agents to overlook critical evidence or become distracted by irrelevant information, thus failing to replan or reflect from previous mistakes. To address this, we propose COMPASS (Context‑Organized Multi‑Agent Planning and Strategy System), a lightweight hierarchical framework that separates tactical execution, strategic oversight, and context organization into three specialized components: (1) a Main Agent that performs reasoning and tool use, (2) a Meta‑Thinker that monitors progress and issues strategic interventions, and (3) a Context Manager that maintains concise, relevant progress briefs for different reasoning stages. Across three challenging benchmarks ‑‑ GAIA, BrowseComp, and Humanity's Last Exam ‑‑ COMPASS improves accuracy by up to 20% relative to both single‑ and multi‑agent baselines. We further introduce a test‑time scaling extension that elevates performance to match established DeepResearch agents, and a post‑training pipeline that delegates context management to smaller models for enhanced efficiency.

PaperID: 8084, https://arxiv.org/pdf/2510.08665.pdf

Authors: Aofan Liu, Haoxuan Li, Bin Wang, Ao Yang, Hui Li

Title: RA-Gen: A Controllable Code Generation Framework Using ReAct for Multi-Agent Task Execution

Abstract:
Code generation models based on large language models (LLMs) have gained wide adoption, but challenges remain in ensuring safety, accuracy, and controllability, especially for complex tasks. Existing methods often lack dynamic integration of external tools, transparent reasoning, and user control over safety. To address these issues, we propose a controllable code generation framework utilizing the ReAct paradigm for multi‑agent task execution. This framework is a multi‑agent system designed to enable efficient, precise, and interpretable code generation through dynamic interactions between LLMs and external resources. The framework adopts a collaborative architecture comprising four specialized agents: a Planner for task decomposition, a Searcher that leverages the ReAct framework for reasoning and tool integration, a CodeGen agent for accurate code generation, and an Extractor for structured data retrieval. The ReAct‑based Searcher alternates between generating reasoning traces and executing actions, facilitating seamless integration of internal knowledge with external tools (such as search engines) to enhance accuracy and user control. Experimental results show the framework's effectiveness across multiple languages, achieving a 94.8% security rate on the SVEN dataset with CodeQL, outperforming existing approaches. Its transparent reasoning process fosters user trust and improves controllability.

PaperID: 8085, https://arxiv.org/pdf/2510.08640.pdf

Authors: Ha Min Son, Huan Ren, Xin Liu, Zhe Zhao

Title: Automating Android Build Repair: Bridging the Reasoning-Execution Gap in LLM Agents with Domain-Specific Tools

Abstract:
Android is the largest mobile platform, yet automatically building applications remains a practical challenge. While Large Language Models (LLMs) show promise for code repair, their use for fixing Android build errors remains underexplored. To address this gap, we first introduce AndroidBuildBench, a benchmark of 1,019 build failures curated from the commit histories of 43 open‑source Android projects. Each problem is paired with a verified solution from a subsequent commit, ensuring that fixes are feasible. Second, we propose GradleFixer, an LLM agent with domain‑specific tools for inspecting and manipulating the Gradle build environment. GradleFixer achieves a resolve rate of 81.4% (pass@1), significantly outperforming a state‑of‑the‑art coding agent that relies on a general‑purpose shell. GradleFixer's success suggests that while LLMs possess the high‑level knowledge to solve these failures, they struggle to translate this knowledge into effective low‑level actions using a general‑purpose shell. We demonstrate the effectiveness of a strategy we term Tool Bridging, which replaces general‑purpose shell commands with domain‑aware abstractions. We hypothesize this approach works through two mechanisms: 1) it provides tools in an API‑like format that LLMs use more reliably, and 2) it constrains the action space to relevant operations. This approach bridges the gap between the model's high‑level reasoning and effective low‑level execution.

PaperID: 8086, https://arxiv.org/pdf/2510.08623.pdf

Authors: Anubhav Shrimal, Aryan Jain, Soumyajit Chowdhury, Promod Yenigalla

Title: PARSE: LLM Driven Schema Optimization for Reliable Entity Extraction

Abstract:
Structured information extraction from unstructured text is critical for emerging Software 3.0 systems where LLM agents autonomously interact with APIs and tools. Recent approaches apply large language models directly to extraction tasks using existing JSON schemas, often with constraint decoding or reinforcement learning approaches to ensure syntactic validity, but treat JSON schemas as static contracts designed for human developers, leading to suboptimal extraction performance, frequent hallucinations, and unreliable agent behavior when schemas contain ambiguous or incomplete specifications. We recognize that JSON schemas themselves are a form of natural language understanding contract that encodes rules, relationships, and expectations about data structure contracts that LLMs should be able to both interpret and systematically improve. Consequently, we develop PARSE (Parameter Automated Refinement and Schema Extraction), a novel system with two synergistic components: ARCHITECT, which autonomously optimizes JSON schemas for LLM consumption while maintaining backward compatibility through RELAY (an integrated code generation system), and SCOPE, which implements reflection‑based extraction with combined static and LLM‑based guardrails. We evaluate PARSE qualitatively and quantitatively on three datasets including Schema‑Guided Dialogue (SGD), Structured Web Data Extraction (SWDE), and internal retail conversation data, and find that it achieves up to 64.7% improvement in extraction accuracy on SWDE with combined framework improvements reaching 10% across models, while reducing extraction errors by 92% within the first retry and and maintaining practical latency.

PaperID: 8087, https://arxiv.org/pdf/2510.08613.pdf

Authors: Xinnan Dai, Xianxuan Long, Chung-Hsiang Lo, Kai Guo, Shenglai Zeng, Dongsheng Luo, Jiliang Tang

Title: GraphGhost: Tracing Structures Behind Large Language Models

Abstract:
Large Language Models (LLMs) exhibit strong reasoning capabilities on structured tasks, yet the internal mechanisms underlying such behaviors remain poorly understood. Existing interpretation methods mainly focus on token‑level attributions, which provide limited insight into multi‑step reasoning inside the model. We propose GraphGhost, a graph‑based framework that models internal token interactions and neuron activations in LLMs as graphs. By aggregating token dependencies traced across layers, GraphGhost captures global information flow underlying model predictions. We formalize GraphGhost from two complementary perspectives: a sample view, which traces token dependencies for individual predictions, and a dataset view, which aggregates recurring structural patterns learned during training. Through graph analytics and quantitative experiments, we show that graph structural properties are closely associated with influential tokens and neuron nodes, and that perturbations to structurally critical nodes lead to measurable changes in reasoning behavior. These results indicate that the structural patterns captured by GraphGhost reflect meaningful internal organization of LLM reasoning. The codes are available at software part. Artifacts will be made available for research use only.

PaperID: 8088, https://arxiv.org/pdf/2510.08608.pdf

Authors: Weihua Zheng, Zhengyuan Liu, Tanmoy Chakraborty, Weiwen Xu, Xiaoxue Gao, Bryan Chen Zhengyu Tan, Bowei Zou, Chang Liu, Yujia Hu, Xing Xie, Xiaoyuan Yi, Jing Yao, Chaojun Wang, Long Li, Rui Liu, Huiyao Liu, Koji Inoue, Ryuichi Sumida, Tatsuya Kawahara, Fan Xu, Lingyu Ye, Wei Tian, Dongjun Kim, Jimin Jung, Jaehyung Seo, Nadya Yuki Wangsajaya, Pham Minh Duc, Ojasva Saxena, Palash Nandi, Xiyan Tao, Wiwik Karlina, Tuan Luong, Keertana Arun Vasan, Roy Ka-Wei Lee, Nancy F. Chen

Title: MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation

Abstract:
Large language models (LLMs) are now used worldwide, yet their multimodal understanding and reasoning often degrade outside Western, high‑resource settings. We propose MMA‑ASIA, a comprehensive framework to evaluate LLMs' cultural awareness with a focus on Asian contexts. MMA‑ASIA centers on a human‑curated, multilingual, and multimodally aligned multiple‑choice benchmark covering 8 Asian countries and 10 languages, comprising 27,000 questions; over 79 percent require multi‑step reasoning grounded in cultural context, moving beyond simple memorization. To our knowledge, this is the first dataset aligned at the input level across three modalities: text, image (visual question answering), and speech. This enables direct tests of cross‑modal transfer. Building on this benchmark, we propose a five‑dimensional evaluation protocol that measures: (i) cultural‑awareness disparities across countries, (ii) cross‑lingual consistency, (iii) cross‑modal consistency, (iv) cultural knowledge generalization, and (v) grounding validity. To ensure rigorous assessment, a Cultural Awareness Grounding Validation Module detects "shortcut learning" by checking whether the requisite cultural knowledge supports correct answers. Finally, through comparative model analysis, attention tracing, and an innovative Vision‑ablated Prefix Replay (VPR) method, we probe why models diverge across languages and modalities, offering actionable insights for building culturally reliable multimodal LLMs.

PaperID: 8089, https://arxiv.org/pdf/2510.08607.pdf

Authors: Zhaoqilin Yang, Chanchan Li, Tianqi Liu, Hongxin Zhao, Youliang Tian

Title: GRPO-GCC: Enhancing Cooperation in Spatial Public Goods Games via Group Relative Policy Optimization with Global Cooperation Constraint

Abstract:
Inspired by the principle of self‑regulating cooperation in collective institutions, we propose the Group Relative Policy Optimization with Global Cooperation Constraint (GRPO‑GCC) framework. This work is the first to introduce GRPO into spatial public goods games, establishing a new deep reinforcement learning baseline for structured populations. GRPO‑GCC integrates group relative policy optimization with a global cooperation constraint that strengthens incentives at intermediate cooperation levels while weakening them at extremes. This mechanism aligns local decision making with sustainable collective outcomes and prevents collapse into either universal defection or unconditional cooperation. The framework advances beyond existing approaches by combining group‑normalized advantage estimation, a reference‑anchored KL penalty, and a global incentive term that dynamically adjusts cooperative payoffs. As a result, it achieves accelerated cooperation onset, stabilized policy adaptation, and long‑term sustainability. GRPO‑GCC demonstrates how a simple yet global signal can reshape incentives toward resilient cooperation, and provides a new paradigm for multi‑agent reinforcement learning in socio‑technical systems.

PaperID: 8090, https://arxiv.org/pdf/2510.08605.pdf

Authors: Nouar Aldahoul, Yasir Zaki

Title: Toward a Safer Web: Multilingual Multi-Agent LLMs for Mitigating Adversarial Misinformation Attacks

Abstract:
The rapid spread of misinformation on digital platforms threatens public discourse, emotional stability, and decision‑making. While prior work has explored various adversarial attacks in misinformation detection, the specific transformations examined in this paper have not been systematically studied. In particular, we investigate language‑switching across English, French, Spanish, Arabic, Hindi, and Chinese, followed by translation. We also study query length inflation preceding summarization and structural reformatting into multiple‑choice questions. In this paper, we present a multilingual, multi‑agent large language model framework with retrieval‑augmented generation that can be deployed as a web plugin into online platforms. Our work underscores the importance of AI‑driven misinformation detection in safeguarding online factual integrity against diverse attacks, while showcasing the feasibility of plugin‑based deployment for real‑world web applications.

PaperID: 8091, https://arxiv.org/pdf/2510.08578.pdf

Authors: Adib Bazgir, Amir Habibdoust, Xing Song, Yuwen Zhang

Title: AgenticAD: A Specialized Multiagent System Framework for Holistic Alzheimer Disease Management

Abstract:
Alzheimer's disease (AD) presents a complex, multifaceted challenge to patients, caregivers, and the healthcare system, necessitating integrated and dynamic support solutions. While artificial intelligence (AI) offers promising avenues for intervention, current applications are often siloed, addressing singular aspects of the disease such as diagnostics or caregiver support without systemic integration. This paper proposes a novel methodological framework for a comprehensive, multi‑agent system (MAS) designed for holistic Alzheimer's disease management. The objective is to detail the architecture of a collaborative ecosystem of specialized AI agents, each engineered to address a distinct challenge in the AD care continuum, from caregiver support and multimodal data analysis to automated research and clinical data interpretation. The proposed framework is composed of eight specialized, interoperable agents. These agents are categorized by function: (1) Caregiver and Patient Support, (2) Data Analysis and Research, and (3) Advanced Multimodal Workflows. The methodology details the technical architecture of each agent, leveraging a suite of advanced technologies including large language models (LLMs) such as GPT‑4o and Gemini, multi‑agent orchestration frameworks, Retrieval‑Augmented Generation (RAG) for evidence‑grounded responses, and specialized tools for web scraping, multimodal data processing, and in‑memory database querying. This paper presents a detailed architectural blueprint for an integrated AI ecosystem for AD care. By moving beyond single‑purpose tools to a collaborative, multi‑agent paradigm, this framework establishes a foundation for developing more adaptive, personalized, and proactive solutions. This methodological approach aims to pave the way for future systems capable of synthesizing diverse data streams to improve patient outcomes and reduce caregiver burden.

PaperID: 8092, https://arxiv.org/pdf/2510.08529.pdf

Authors: Xiangyuan Xue, Yifan Zhou, Guibin Zhang, Zaibin Zhang, Yijiang Li, Chen Zhang, Zhenfei Yin, Philip Torr, Wanli Ouyang, Lei Bai

Title: CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards

Abstract:
Self‑evolution is a central research topic in enabling large language model (LLM)‑based agents to continually improve their capabilities after pretraining. Recent research has witnessed a transition from reinforcement learning (RL)‑free to RL‑based methods. Current RL‑based methods either rely on dense external reward signals or extract intrinsic reward signals from LLMs themselves. However, these approaches diverge from the self‑evolution mechanisms observed in human intelligence, where individuals learn and improve through mutual discussion and collaboration. In this work, we introduce Co‑Evolving Multi‑Agent Systems (CoMAS), a novel framework that enables agents to improve autonomously by learning from inter‑agent interactions without external supervision. CoMAS generates intrinsic rewards from rich discussion dynamics, employs an LLM‑as‑a‑judge mechanism to formulate these rewards, and optimizes each agent's policy through RL, thereby enabling decentralized and scalable co‑evolution. Experimental results demonstrate that CoMAS consistently outperforms untrained agents and achieves state‑of‑the‑art performance across most evaluation settings. Ablation studies confirm the necessity of interaction‑based reward signals and reveal promising scalability as the number and diversity of agents increase. These findings establish CoMAS as a novel and effective paradigm for self‑evolution in LLM‑based agents.

PaperID: 8093, https://arxiv.org/pdf/2510.08517.pdf

Authors: Grace Liu, Yuxiao Qu, Jeff Schneider, Aarti Singh, Aviral Kumar

Title: CaRT: Teaching LLM Agents to Know When They Know Enough

Abstract:
Many tasks require learned models to strategically gather relevant information over multiple rounds of interaction before actually acting on a task. Strategic information gathering requires models to know not only how to effectively acquire information, but also when to stop gathering information and make a decision, in order to avoid overthinking or getting derailed when acting. In this paper, we formalize this problem and introduce Counterfactuals and Reasoning for Termination (CaRT), an approach for teaching LLMs when to stop seeking information. To appropriately learn when to terminate, CaRT fine‑tunes LLMs using counterfactual pairs of trajectories, one where termination is appropriate and a minimally modified version of the same trajectory where it is not. It trains the LLM to explain the rationale for the termination decision in either case via verbal reasoning, and imbues this capability into the base LLM via fine‑tuning. We instantiate CaRT in two domains: interactive medical diagnosis and math problem solving. In both domains, we find that CaRT improves the efficiency of information gathering and task success rate compared to other fine‑tuning methods.

PaperID: 8094, https://arxiv.org/pdf/2510.08512.pdf

Authors: Nikolaos Stathoulopoulos, Christoforos Kanellakis, George Nikolakopoulos

Title: Have We Scene It All? Scene Graph-Aware Deep Point Cloud Compression

Abstract:
Efficient transmission of 3D point cloud data is critical for advanced perception in centralized and decentralized multi‑agent robotic systems, especially nowadays with the growing reliance on edge and cloud‑based processing. However, the large and complex nature of point clouds creates challenges under bandwidth constraints and intermittent connectivity, often degrading system performance. We propose a deep compression framework based on semantic scene graphs. The method decomposes point clouds into semantically coherent patches and encodes them into compact latent representations with semantic‑aware encoders conditioned by Feature‑wise Linear Modulation (FiLM). A folding‑based decoder, guided by latent features and graph node attributes, enables structurally accurate reconstruction. Experiments on the SemanticKITTI and nuScenes datasets show that the framework achieves state‑of‑the‑art compression rates, reducing data size by up to 98% while preserving both structural and semantic fidelity. In addition, it supports downstream applications such as multi‑robot pose graph optimization and map merging, achieving trajectory accuracy and map alignment comparable to those obtained with raw LiDAR scans.

PaperID: 8095, https://arxiv.org/pdf/2510.08429.pdf

Authors: Stella C. Dong, James R. Finlay

Title: ClauseLens: Clause-Grounded, CVaR-Constrained Reinforcement Learning for Trustworthy Reinsurance Pricing

Abstract:
Reinsurance treaty pricing must satisfy stringent regulatory standards, yet current quoting practices remain opaque and difficult to audit. We introduce ClauseLens, a clause‑grounded reinforcement learning framework that produces transparent, regulation‑compliant, and risk‑aware treaty quotes. ClauseLens models the quoting task as a Risk‑Aware Constrained Markov Decision Process (RA‑CMDP). Statutory and policy clauses are retrieved from legal and underwriting corpora, embedded into the agent's observations, and used both to constrain feasible actions and to generate clause‑grounded natural language justifications. Evaluated in a multi‑agent treaty simulator calibrated to industry data, ClauseLens reduces solvency violations by 51%, improves tail‑risk performance by 27.9% (CVaR_0.10), and achieves 88.2% accuracy in clause‑grounded explanations with retrieval precision of 87.4% and recall of 91.1%. These findings demonstrate that embedding legal context into both decision and explanation pathways yields interpretable, auditable, and regulation‑aligned quoting behavior consistent with Solvency II, NAIC RBC, and the EU AI Act.

PaperID: 8096, https://arxiv.org/pdf/2510.08381.pdf

Authors: Baoyang Chen, Xian Xu, Huamin Qu

Title: Airy: Reading Robot Intent through Height and Sky

Abstract:
As industrial robots move into shared human spaces, their opaque decision making threatens safety, trust, and public oversight. This artwork, Airy, asks whether complex multi agent AI can become intuitively understandable by staging a competition between two reinforcement trained robot arms that snap a bedsheet skyward. Building on three design principles, competition as a clear metric (who lifts higher), embodied familiarity (audiences recognize fabric snapping), and sensor to sense mapping (robot cooperation or rivalry shown through forest and weather projections), the installation gives viewers a visceral way to read machine intent. Observations from five international exhibitions indicate that audiences consistently read the robots' strategies, conflict, and cooperation in real time, with emotional reactions that mirror the system's internal state. The project shows how sensory metaphors can turn a black box into a public interface.

PaperID: 8097, https://arxiv.org/pdf/2510.08354.pdf

Authors: Zechang Sun, Yuan-Sen Ting, Yaobo Liang, Nan Duan, Song Huang, Zheng Cai

Title: Mephisto: Self-Improving Large Language Model-Based Agents for Automated Interpretation of Multi-band Galaxy Observations

Abstract:
Astronomical research has long relied on human expertise to interpret complex data and formulate scientific hypotheses. In this study, we introduce Mephisto ‑‑ a multi‑agent collaboration framework powered by large language models (LLMs) that emulates human‑like reasoning for analyzing multi‑band galaxy observations. Mephisto interfaces with the CIGALE codebase (a library of spectral energy distribution, SED, models) to iteratively refine physical models against observational data. It conducts deliberate reasoning via tree search, accumulates knowledge through self‑play, and dynamically updates its knowledge base. Validated across diverse galaxy populations ‑‑ including the James Webb Space Telescope's recently discovered "Little Red Dot" galaxies ‑‑ we show that Mephisto demonstrates proficiency in inferring the physical properties of galaxies from multi‑band photometry, positioning it as a promising research copilot for astronomers. Unlike prior black‑box machine learning approaches in astronomy, Mephisto offers a transparent, human‑aligned reasoning process that integrates seamlessly with existing research practices. This work underscores the possibility of LLM‑driven agent‑based research for astronomy, establishes a foundation for fully automated, end‑to‑end artificial intelligence (AI)‑powered scientific workflows, and unlocks new avenues for AI‑augmented discoveries in astronomy.

PaperID: 8098, https://arxiv.org/pdf/2510.08326.pdf

Authors: Yaning Li, Yutong Chen, Yihan Hou, Chenyi Chen, Yihan Han, Jingxuan Han, Wenxi Dai, Youyou Li, Xinke Tang, Meng Li, Qi Dong, Hongwei Li

Title: LacAIDes: Generative AI-Supported Creative Interactive Circuits Crafting to Enliven Traditional Lacquerware

Abstract:
Lacquerware, a representative craft of Chinese intangible cultural heritage, is renowned for its layered aesthetics and durability but faces declining engagement. While prior human‑computer interaction research has explored embedding interactive circuits to transform lacquerware into responsive artifacts, most studies have focused on fabrication techniques rather than supporting makers in creatively designing such interactions at a low threshold. To address this gap, we present LacAIDes, a Generative AI powered creativity‑support tool built on a multi‑agent workflow aligned with the double diamond model of design thinking. LacAIDes enables exploration and creation of culturally grounded interactive circuits without requiring prior technical expertise. We evaluated LacAIDes in a longitudinal workshop with 34 participants using a mixed‑method approach. Results show that LacAIDes demonstrated high usability, enhanced creative engagement in craft making, and encouraged critical reflection on the role of Generative AI in digital craft practices. This work contributes to human‑computer interaction by introducing a novel creativity‑support tool and providing empirical insights into revitalizing traditional craft making through Generative AI.

PaperID: 8099, https://arxiv.org/pdf/2510.08263.pdf

Authors: Shunyu An, Miao Wang, Yongchao Li, Dong Wan, Lina Wang, Ling Qin, Liqin Gao, Congyao Fan, Zhiyong Mao, Jiange Pu, Wenji Xia, Dong Zhao, Zhaohui Hao, Rui Hu, Ji Lu, Guiyue Zhou, Baoyu Tang, Yanqin Gao, Yongsheng Du, Daigang Xu, Lingjun Huang, Baoli Wang, Xiwen Zhang, Luyao Wang, Shilong Liu

Title: Co-TAP: Three-Layer Agent Interaction Protocol Technical Report

Abstract:
This paper proposes Co‑TAP (T: Triple, A: Agent, P: Protocol), a three‑layer agent interaction protocol designed to address the challenges faced by multi‑agent systems across the three core dimensions of Interoperability, Interaction and Collaboration, and Knowledge Sharing. We have designed and proposed a layered solution composed of three core protocols: the Human‑Agent Interaction Protocol (HAI), the Unified Agent Protocol (UAP), and the Memory‑Extraction‑Knowledge Protocol (MEK). HAI focuses on the interaction layer, standardizing the flow of information between users, interfaces, and agents by defining a standardized, event‑driven communication paradigm. This ensures the real‑time performance, reliability, and synergy of interactions. As the core of the infrastructure layer, UAP is designed to break down communication barriers among heterogeneous agents through unified service discovery and protocol conversion mechanisms, thereby enabling seamless interconnection and interoperability of the underlying network. MEK, in turn, operates at the cognitive layer. By establishing a standardized ''Memory (M) ‑ Extraction (E) ‑ Knowledge (K)'' cognitive chain, it empowers agents with the ability to learn from individual experiences and form shareable knowledge, thereby laying the foundation for the realization of true collective intelligence. We believe this protocol framework will provide a solid engineering foundation and theoretical guidance for building the next generation of efficient, scalable, and intelligent multi‑agent applications.

PaperID: 8100, https://arxiv.org/pdf/2510.08255.pdf

Authors: Marta Emili Garcia Segura, Stephen Hailes, Mirco Musolesi

Title: Opponent Shaping in LLM Agents

Abstract:
Large Language Models (LLMs) are increasingly being deployed as autonomous agents in real‑world environments. As these deployments scale, multi‑agent interactions become inevitable, making it essential to understand strategic behavior in such systems. A central open question is whether LLM agents, like reinforcement learning agents, can shape the learning dynamics and influence the behavior of others through interaction alone. In this paper, we present the first investigation of opponent shaping (OS) with LLM‑based agents. Existing OS algorithms cannot be directly applied to LLMs, as they require higher‑order derivatives, face scalability constraints, or depend on architectural components that are absent in transformers. To address this gap, we introduce ShapeLLM, an adaptation of model‑free OS methods tailored for transformer‑based agents. Using ShapeLLM, we examine whether LLM agents can influence co‑players' learning dynamics across diverse game‑theoretic environments. We demonstrate that LLM agents can successfully guide opponents toward exploitable equilibria in competitive games (Iterated Prisoner's Dilemma, Matching Pennies, and Chicken) and promote coordination and improve collective welfare in cooperative games (Iterated Stag Hunt and a cooperative version of the Prisoner's Dilemma). Our findings show that LLM agents can both shape and be shaped through interaction, establishing opponent shaping as a key dimension of multi‑agent LLM research.

PaperID: 8101, https://arxiv.org/pdf/2510.08242.pdf

Authors: Mohammed Almutairi, Charles Chiang, Haoze Guo, Matthew Belcher, Nandini Banerjee, Maria Milkowski, Svitlana Volkova, Daniel Nguyen, Tim Weninger, Michael Yankoski, Trenton W. Ford, Diego Gomez-Zara

Title: Simulating Teams with LLM Agents: Interactive 2D Environments for Studying Human-AI Dynamics

Abstract:
Enabling users to create their own simulations offers a powerful way to study team dynamics and performance. We introduce VirTLab, a system that allows researchers and practitioners to design interactive, customizable simulations of team dynamics with LLM‑based agents situated in 2D spatial environments. Unlike prior frameworks that restrict scenarios to predefined or static tasks, our approach enables users to build scenarios, assign roles, and observe how agents coordinate, move, and adapt over time. By bridging team cognition behaviors with scalable agent‑based modeling, our system provides a testbed for investigating how environments influence coordination, collaboration, and emergent team behaviors. We demonstrate its utility by aligning simulated outcomes with empirical evaluations and a user study, underscoring the importance of customizable environments for advancing research on multi‑agent simulations. This work contributes to making simulations accessible to both technical and non‑technical users, supporting the design, execution, and analysis of complex multi‑agent experiments.

PaperID: 8102, https://arxiv.org/pdf/2510.08240.pdf

Authors: Jingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Benjamin Van Durme, Daniel Khashabi, Jason Weston, Hongyuan Zhan

Title: The Alignment Waltz: Jointly Training Agents to Collaborate for Safety

Abstract:
Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirely‑it may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi‑agent reinforcement learning framework that formulates safety alignment as a collaborative, positive‑sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent's responses. At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR‑Bench) compared to various baselines. By enabling the conversation and feedback agents to co‑evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.

PaperID: 8103, https://arxiv.org/pdf/2510.08191.pdf

Authors: Yuzheng Cai, Siqi Cai, Yuchen Shi, Zihan Xu, Lichao Chen, Yulei Qin, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Yong Mao, Ke Li, Xing Sun

Title: Training-Free Group Relative Policy Optimization

Abstract:
Recent advances in Large Language Model (LLM) agents have demonstrated their promising general capabilities. However, their performance in specialized real‑world domains often degrades due to challenges in effectively integrating external tools and specific prompting strategies. While methods like agentic reinforcement learning have been proposed to address this, they typically rely on costly parameter updates, for example, through a process that uses Supervised Fine‑Tuning (SFT) followed by a Reinforcement Learning (RL) phase with Group Relative Policy Optimization (GRPO) to alter the output distribution. However, we argue that LLMs can achieve a similar effect on the output distribution by learning experiential knowledge as a token prior, which is a far more lightweight approach that not only addresses practical data scarcity but also avoids the common issue of overfitting. To this end, we propose Training‑Free Group Relative Policy Optimization (Training‑Free GRPO), a cost‑effective solution that enhances LLM agent performance without any parameter updates. Our method leverages the group relative semantic advantage instead of numerical ones within each group of rollouts, iteratively distilling high‑quality experiential knowledge during multi‑epoch learning on a minimal ground‑truth data. Such knowledge serves as the learned token prior, which is seamlessly integrated during LLM API calls to guide model behavior. Experiments on mathematical reasoning and web searching tasks demonstrate that Training‑Free GRPO, when applied to DeepSeek‑V3.1‑Terminus, significantly improves out‑of‑domain performance. With just a few dozen training samples, Training‑Free GRPO outperforms fine‑tuned small LLMs with marginal training data and cost.

PaperID: 8104, https://arxiv.org/pdf/2510.08189.pdf

Authors: Yi Lu, Jianing Wang, Linsen Guo, Wei He, Hongyin Tang, Tao Gui, Xuanjing Huang, Xuezhi Cao, Wei Wang, Xunliang Cai

Title: R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

Abstract:
Recent trends in test‑time scaling for reasoning models (e.g., OpenAI o1, DeepSeek‑R1) have led to remarkable improvements through long Chain‑of‑Thought (CoT). However, existing benchmarks mainly focus on immediate, single‑horizon tasks, failing to adequately evaluate models' ability to understand and respond to complex, long‑horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R‑HORIZON, a method designed to stimulate long‑horizon reasoning behaviors in LRMs through query composition. Based on R‑HORIZON, we construct a long‑horizon reasoning benchmark, comprising complex multi‑step reasoning tasks with interdependent problems that span long reasoning horizons. Through comprehensive evaluation of LRMs using the R‑HORIZON benchmark, we find that even the most advanced LRMs suffer significant performance degradation. Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately. Recognizing these limitations, we use R‑HORIZON to construct long‑horizon reasoning data for reinforcement learning with verified rewards (RLVR). Compared to training with single‑horizon data, RLVR with R‑HORIZON not only substantially improves performance on the multi‑horizon reasoning tasks, but also promotes accuracy on standard reasoning tasks, with an increase of 7.5 on AIME2024. These results position R‑HORIZON as a scalable, controllable, and low‑cost paradigm for enhancing and evaluating the long‑horizon reasoning capabilities of LRMs.

PaperID: 8105, https://arxiv.org/pdf/2510.08106.pdf

Authors: Zihan Li, Yixiao Xu, Lei Zhang, Taiyu Han, Xinshan Yang, Yingni Wang, Mingxuan Liu, Shenghai Xin, Linxun Liu, Hongen Liao, Guochen Ning

Title: Beyond hospital reach: Autonomous lightweight ultrasound robot for liver sonography

Abstract:
Liver disease is a major global health burden. While ultrasound is the first‑line diagnostic tool, liver sonography requires locating multiple non‑continuous planes from positions where target structures are often not visible, for biometric assessment and lesion detection, requiring significant expertise. However, expert sonographers are severely scarce in resource‑limited regions. Here, we develop an autonomous lightweight ultrasound robot comprising an AI agent that integrates multi‑modal perception with memory attention for localization of unseen target structures, and a 588‑gram 6‑degrees‑of‑freedom cable‑driven robot. By mounting on the abdomen, the system enhances robustness against motion. Our robot can autonomously acquire expert‑level standard liver ultrasound planes and detect pathology in patients, including two from Xining, a 2261‑meter‑altitude city with limited medical resources. Our system performs effectively on rapid‑motion individuals and in wilderness environments. This work represents the first demonstration of autonomous sonography across multiple challenging scenarios, potentially transforming access to expert‑level diagnostics in underserved regions.

PaperID: 8106, https://arxiv.org/pdf/2510.08104.pdf

Authors: Joshua Holstein, Gerhard Satzger

Title: Development of Mental Models in Human-AI Collaboration: A Conceptual Framework

Abstract:
Artificial intelligence has become integral to organizational decision‑making and while research has explored many facets of this human‑AI collaboration, the focus has mainly been on designing the AI agent(s) and the way the collaboration is set up ‑ generally assuming a human decision‑maker to be "fixed". However, it has largely been neglected that decision‑makers' mental models evolve through their continuous interaction with AI systems. This paper addresses this gap by conceptualizing how the design of human‑AI collaboration influences the development of three complementary and interdependent mental models necessary for this collaboration. We develop an integrated socio‑technical framework that identifies the mechanisms driving the mental model evolution: data contextualization, reasoning transparency, and performance feedback. Our work advances human‑AI collaboration literature through three key contributions: introducing three distinct mental models (domain, information processing, complementarity‑awareness); recognizing the dynamic nature of mental models; and establishing mechanisms that guide the purposeful design of effective human‑AI collaboration.

PaperID: 8107, https://arxiv.org/pdf/2510.08081.pdf

Authors: Xiaochong Lan, Jie Feng, Yinxing Liu, Xinlei Shi, Yong Li

Title: AutoQual: An LLM Agent for Automated Discovery of Interpretable Features for Review Quality Assessment

Abstract:
Ranking online reviews by their intrinsic quality is a critical task for e‑commerce platforms and information services, impacting user experience and business outcomes. However, quality is a domain‑dependent and dynamic concept, making its assessment a formidable challenge. Traditional methods relying on hand‑crafted features are unscalable across domains and fail to adapt to evolving content patterns, while modern deep learning approaches often produce black‑box models that lack interpretability and may prioritize semantics over quality. To address these challenges, we propose AutoQual, an LLM‑based agent framework that automates the discovery of interpretable features. While demonstrated on review quality assessment, AutoQual is designed as a general framework for transforming tacit knowledge embedded in data into explicit, computable features. It mimics a human research process, iteratively generating feature hypotheses through reflection, operationalizing them via autonomous tool implementation, and accumulating experience in a persistent memory. We deploy our method on a large‑scale online platform with a billion‑level user base. Large‑scale A/B testing confirms its effectiveness, increasing average reviews viewed per user by 0.79% and the conversion rate of review readers by 0.27%.

PaperID: 8108, https://arxiv.org/pdf/2510.08068.pdf

Authors: Aadi Singhi

Title: An Adaptive Multi Agent Bitcoin Trading System

Abstract:
This paper presents a Multi Agent Bitcoin Trading system that utilizes Large Language Models (LLMs) for alpha generation and portfolio management in the cryptocurrencies market. Unlike equities, cryptocurrencies exhibit extreme volatility and are heavily influenced by rapidly shifting market sentiments and regulatory announcements, making them difficult to model using static regression models or neural networks trained solely on historical data. The proposed framework overcomes this by structuring LLMs into specialised agents for technical analysis, sentiment evaluation, decision‑making, and performance reflection. The agents improve over time via a novel verbal feedback mechanism where a Reflect agent provides daily and weekly natural‑language critiques of trading decisions. These textual evaluations are then injected into future prompts of the agents, allowing them to adjust allocation logic without weight updates or finetuning. Back‑testing on Bitcoin price data from July 2024 to April 2025 shows consistent outperformance across market regimes: the Quantitative agent delivered over 30% higher returns in bullish phases and 15% overall gains versus buy‑and‑hold, while the sentiment‑driven agent turned sideways markets from a small loss into a gain of over 100%. Adding weekly feedback further improved total performance by 31% and reduced bearish losses by 10%. The results demonstrate that verbal feedback represents a new, scalable, and low‑cost approach of tuning LLMs for financial goals.

PaperID: 8109, https://arxiv.org/pdf/2510.08048.pdf

Authors: Jianhui Yang, Yiming Jin, Pengkun Jiao, Chenhe Dong, Zerui Huang, Shaowei Yao, Xiaojiang Zhou, Dan Ou, Haihong Tang

Title: TaoSR-AGRL: Adaptive Guided Reinforcement Learning Framework for E-commerce Search Relevance

Abstract:
Query‑product relevance prediction is fundamental to e‑commerce search and has become even more critical in the era of AI‑powered shopping, where semantic understanding and complex reasoning directly shape the user experience and business conversion. Large Language Models (LLMs) enable generative, reasoning‑based approaches, typically aligned via supervised fine‑tuning (SFT) or preference optimization methods like Direct Preference Optimization (DPO). However, the increasing complexity of business rules and user queries exposes the inability of existing methods to endow models with robust reasoning capacity for long‑tail and challenging cases. Efforts to address this via reinforcement learning strategies like Group Relative Policy Optimization (GRPO) often suffer from sparse terminal rewards, offering insufficient guidance for multi‑step reasoning and slowing convergence. To address these challenges, we propose TaoSR‑AGRL, an Adaptive Guided Reinforcement Learning framework for LLM‑based relevance prediction in Taobao Search Relevance. TaoSR‑AGRL introduces two key innovations: (1) Rule‑aware Reward Shaping, which decomposes the final relevance judgment into dense, structured rewards aligned with domain‑specific relevance criteria; and (2) Adaptive Guided Replay, which identifies low‑accuracy rollouts during training and injects targeted ground‑truth guidance to steer the policy away from stagnant, rule‑violating reasoning patterns toward compliant trajectories. TaoSR‑AGRL was evaluated on large‑scale real‑world datasets and through online side‑by‑side human evaluations on Taobao Search. It consistently outperforms DPO and standard GRPO baselines in offline experiments, improving relevance accuracy, rule adherence, and training stability. The model trained with TaoSR‑AGRL has been successfully deployed in the main search scenario on Taobao, serving hundreds of millions of users.

PaperID: 8110, https://arxiv.org/pdf/2510.08002.pdf

Authors: Cheng Yang, Xuemeng Yang, Licheng Wen, Daocheng Fu, Jianbiao Mei, Rong Wu, Pinlong Cai, Yufan Shen, Nianchen Deng, Botian Shi, Yu Qiao, Haifeng Li

Title: Learning on the Job: An Experience-Driven Self-Evolving Agent for Long-Horizon Tasks

Abstract:
Large Language Models have demonstrated remarkable capabilities across diverse domains, yet significant challenges persist when deploying them as AI agents for real‑world long‑horizon tasks. Existing LLM agents suffer from a critical limitation: they are test‑time static and cannot learn from experience, lacking the ability to accumulate knowledge and continuously improve on the job. To address this challenge, we propose MUSE, a novel agent framework that introduces an experience‑driven, self‑evolving system centered around a hierarchical Memory Module. MUSE organizes diverse levels of experience and leverages them to plan and execute long‑horizon tasks across multiple applications. After each sub‑task execution, the agent autonomously reflects on its trajectory, converting the raw trajectory into structured experience and integrating it back into the Memory Module. This mechanism enables the agent to evolve beyond its static pretrained parameters, fostering continuous learning and self‑evolution. We evaluate MUSE on the long‑horizon productivity benchmark TAC. It achieves new SOTA performance by a significant margin using only a lightweight Gemini‑2.5 Flash model. Sufficient Experiments demonstrate that as the agent autonomously accumulates experience, it exhibits increasingly superior task completion capabilities, as well as robust continuous learning and self‑evolution capabilities. Moreover, the accumulated experience from MUSE exhibits strong generalization properties, enabling zero‑shot improvement on new tasks. MUSE establishes a new paradigm for AI agents capable of real‑world productivity task automation.

PaperID: 8111, https://arxiv.org/pdf/2510.07988.pdf

Authors: Haitao Jia, Ming He, Zimo Yin, Likang Wu, Jianping Fan, Jitao Sang

Title: ReInAgent: A Context-Aware GUI Agent Enabling Human-in-the-Loop Mobile Task Navigation

Abstract:
Mobile GUI agents exhibit substantial potential to facilitate and automate the execution of user tasks on mobile phones. However, exist mobile GUI agents predominantly privilege autonomous operation and neglect the necessity of active user engagement during task execution. This omission undermines their adaptability to information dilemmas including ambiguous, dynamically evolving, and conflicting task scenarios, leading to execution outcomes that deviate from genuine user requirements and preferences. To address these shortcomings, we propose ReInAgent, a context‑aware multi‑agent framework that leverages dynamic information management to enable human‑in‑the‑loop mobile task navigation. ReInAgent integrates three specialized agents around a shared memory module: an information‑managing agent for slot‑based information management and proactive interaction with the user, a decision‑making agent for conflict‑aware planning, and a reflecting agent for task reflection and information consistency validation. Through continuous contextual information analysis and sustained user‑agent collaboration, ReInAgent overcomes the limitation of existing approaches that rely on clear and static task assumptions. Consequently, it enables more adaptive and reliable mobile task navigation in complex, real‑world scenarios. Experimental results demonstrate that ReInAgent effectively resolves information dilemmas and produces outcomes that are more closely aligned with genuine user preferences. Notably, on complex tasks involving information dilemmas, ReInAgent achieves a 25% higher success rate than Mobile‑Agent‑v2.

PaperID: 8112, https://arxiv.org/pdf/2510.07971.pdf

Authors: Oskar Bohn Lassen, Serio Angelo Maria Agriesti, Filipe Rodrigues, Francisco Camara Pereira

Title: Climate Surrogates for Scalable Multi-Agent Reinforcement Learning: A Case Study with CICERO-SCM

Abstract:
Climate policy studies require models that capture the combined effects of multiple greenhouse gases on global temperature, but these models are computationally expensive and difficult to embed in reinforcement learning. We present a multi‑agent reinforcement learning (MARL) framework that integrates a high‑fidelity, highly efficient climate surrogate directly in the environment loop, enabling regional agents to learn climate policies under multi‑gas dynamics. As a proof of concept, we introduce a recurrent neural network architecture pretrained on (20,000) multi‑gas emission pathways to surrogate the climate model CICERO‑SCM. The surrogate model attains near‑simulator accuracy with global‑mean temperature RMSE \approx 0.0004 \mathrmK and approximately 1000× faster one‑step inference. When substituted for the original simulator in a climate‑policy MARL setting, it accelerates end‑to‑end training by >\!100×. We show that the surrogate and simulator converge to the same optimal policies and propose a methodology to assess this property in cases where using the simulator is intractable. Our work allows to bypass the core computational bottleneck without sacrificing policy fidelity, enabling large‑scale multi‑agent experiments across alternative climate‑policy regimes with multi‑gas dynamics and high‑fidelity climate response.

PaperID: 8113, https://arxiv.org/pdf/2510.07943.pdf

Authors: Qiushi Tian, Churong Liang, Kairan Hong, Runnan Li

Title: Agent-Based Genetic Algorithm for Crypto Trading Strategy Optimization

Abstract:
Cryptocurrency markets present formidable challenges for trading strategy optimization due to extreme volatility, non‑stationary dynamics, and complex microstructure patterns that render conventional parameter optimization methods fundamentally inadequate. We introduce Cypto Genetic Algorithm Agent (CGA‑Agent), a pioneering hybrid framework that synergistically integrates genetic algorithms with intelligent multi‑agent coordination mechanisms for adaptive trading strategy parameter optimization in dynamic financial environments. The framework uniquely incorporates real‑time market microstructure intelligence and adaptive strategy performance feedback through intelligent mechanisms that dynamically guide evolutionary processes, transcending the limitations of static optimization approaches. Comprehensive empirical evaluation across three cryptocurrencies demonstrates systematic and statistically significant performance improvements on both total returns and risk‑adjusted metrics.

PaperID: 8114, https://arxiv.org/pdf/2510.07925.pdf

Authors: Rebecca Westhäußer, Wolfgang Minker, Sebatian Zepf

Title: Enabling Personalized Long-term Interactions in LLM-based Agents through Persistent Memory and User Profiles

Abstract:
Large language models (LLMs) increasingly serve as the central control unit of AI agents, yet current approaches remain limited in their ability to deliver personalized interactions. While Retrieval Augmented Generation enhances LLM capabilities by improving context‑awareness, it lacks mechanisms to combine contextual information with user‑specific data. Although personalization has been studied in fields such as human‑computer interaction or cognitive science, existing perspectives largely remain conceptual, with limited focus on technical implementation. To address these gaps, we build on a unified definition of personalization as a conceptual foundation to derive technical requirements for adaptive, user‑centered LLM‑based agents. Combined with established agentic AI patterns such as multi‑agent collaboration or multi‑source retrieval, we present a framework that integrates persistent memory, dynamic coordination, self‑validation, and evolving user profiles to enable personalized long‑term interactions. We evaluate our approach on three public datasets using metrics such as retrieval accuracy, response correctness, or BertScore. We complement these results with a five‑day pilot user study providing initial insights into user feedback on perceived personalization. The study provides early indications that guide future work and highlights the potential of integrating persistent memory and user profiles to improve the adaptivity and perceived personalization of LLM‑based agents.

PaperID: 8115, https://arxiv.org/pdf/2510.07888.pdf

Authors: Xinren Zhang, Sixi Cheng, Zixin Zhong, Jiadong Yu

Title: Network Topology and Information Efficiency of Multi-Agent Systems: Study based on MARL

Abstract:
Multi‑agent systems (MAS) solve complex problems through coordinated autonomous entities with individual decision‑making capabilities. While Multi‑Agent Reinforcement Learning (MARL) enables these agents to learn intelligent strategies, it faces challenges of non‑stationarity and partial observability. Communications among agents offer a solution, but questions remain about its optimal structure and evaluation. This paper explores two underexamined aspects: communication topology and information efficiency. We demonstrate that directed and sequential topologies improve performance while reducing communication overhead across both homogeneous and heterogeneous tasks. Additionally, we introduce two metrics ‑‑ Information Entropy Efficiency Index (IEI) and Specialization Efficiency Index (SEI) ‑‑ to evaluate message compactness and role differentiation. Incorporating these metrics into training objectives improves success rates and convergence speed. Our findings highlight that designing adaptive communication topologies with information‑efficient messaging is essential for effective coordination in complex MAS.

PaperID: 8116, https://arxiv.org/pdf/2510.07841.pdf

Authors: Emre Can Acikgoz, Cheng Qian, Heng Ji, Dilek Hakkani-Tür, Gokhan Tur

Title: Self-Improving LLM Agents at Test-Time

Abstract:
One paradigm of language model (LM) fine‑tuning relies on creating large training datasets, under the assumption that high quantity and diversity will enable models to generalize to novel tasks after post‑training. In practice, gathering large sets of data is inefficient, and training on them is prohibitively expensive; worse, there is no guarantee that the resulting model will handle complex scenarios or generalize better. Moreover, existing techniques rarely assess whether a training sample provides novel information or is redundant with the knowledge already acquired by the model, resulting in unnecessary costs. In this work, we explore a new test‑time self‑improvement method to create more effective and generalizable agentic LMs on‑the‑fly. The proposed algorithm can be summarized in three steps: (i) first it identifies the samples that model struggles with (self‑awareness), (ii) then generates similar examples from detected uncertain samples (self‑data augmentation), and (iii) uses these newly generated samples at test‑time fine‑tuning (self‑improvement). We study two variants of this approach: Test‑Time Self‑Improvement (TT‑SI), where the same model generates additional training examples from its own uncertain cases and then learns from them, and contrast this approach with Test‑Time Distillation (TT‑D), where a stronger model generates similar examples for uncertain cases, enabling student to adapt using distilled supervision. Empirical evaluations across different agent benchmarks demonstrate that TT‑SI improves the performance with +5.48% absolute accuracy gain on average across all benchmarks and surpasses other standard learning methods, yet using 68x less training samples. Our findings highlight the promise of TT‑SI, demonstrating the potential of self‑improvement algorithms at test‑time as a new paradigm for building more capable agents toward self‑evolution.

PaperID: 8117, https://arxiv.org/pdf/2510.07799.pdf

Authors: Eric Hanchen Jiang, Mengting Li, Guancheng Wan, Sophia Yin, Yuchen Wu, Xiao Liang, Xinfeng Li, Yizhou Sun, Wei Wang, Kai-Wei Chang, Ying Nian Wu

Title: Dynamic Generation of Multi-LLM Agents Communication Topologies with Graph Diffusion Models

Abstract:
The efficiency of multi‑agent systems driven by large language models (LLMs) largely hinges on their communication topology. However, designing an optimal topology is a non‑trivial challenge, as it requires balancing competing objectives such as task performance, communication cost, and robustness. Existing frameworks often rely on static or hand‑crafted topologies, which inherently fail to adapt to diverse task requirements, leading to either excessive token consumption for simple problems or performance bottlenecks for complex ones. To address this challenge, we introduce a novel generative framework called Guided Topology Diffusion (GTD). Inspired by conditional discrete graph diffusion models, GTD formulates topology synthesis as an iterative construction process. At each step, the generation is steered by a lightweight proxy model that predicts multi‑objective rewards (e.g., accuracy, utility, cost), enabling real‑time, gradient‑free optimization towards task‑adaptive topologies. This iterative, guided synthesis process distinguishes GTD from single‑step generative frameworks, enabling it to better navigate complex design trade‑offs. We validated GTD across multiple benchmarks, and experiments show that this framework can generate highly task‑adaptive, sparse, and efficient communication topologies, significantly outperforming existing methods in LLM agent collaboration.

PaperID: 8118, https://arxiv.org/pdf/2510.07768.pdf

Authors: Murong Yue, Zhiwei Liu, Liangwei Yang, Jianguo Zhang, Zuxin Liu, Haolin Chen, Ziyu Yao, Silvio Savarese, Caiming Xiong, Shelby Heinecke, Huan Wang

Title: ToolLibGen: Scalable Automatic Tool Creation and Aggregation for LLM Reasoning

Abstract:
Large Language Models (LLMs) equipped with external tools have demonstrated enhanced performance on complex reasoning tasks. The widespread adoption of this tool‑augmented reasoning is hindered by the scarcity of domain‑specific tools. For instance, in domains such as physics question answering, suitable and specialized tools are often missing. Recent work has explored automating tool creation by extracting reusable functions from Chain‑of‑Thought (CoT) reasoning traces; however, these approaches face a critical scalability bottleneck. As the number of generated tools grows, storing them in an unstructured collection leads to significant retrieval challenges, including an expanding search space and ambiguity between function‑related tools. To address this, we propose a systematic approach to automatically refactor an unstructured collection of tools into a structured tool library. Our system first generates discrete, task‑specific tools and clusters them into semantically coherent topics. Within each cluster, we introduce a multi‑agent framework to consolidate scattered functionalities: a code agent refactors code to extract shared logic and creates versatile, aggregated tools, while a reviewing agent ensures that these aggregated tools maintain the complete functional capabilities of the original set. This process transforms numerous question‑specific tools into a smaller set of powerful, aggregated tools without loss of functionality. Experimental results demonstrate that our approach significantly improves tool retrieval accuracy and overall reasoning performance across multiple reasoning tasks. Furthermore, our method shows enhanced scalability compared with baselines as the number of question‑specific increases.

PaperID: 8119, https://arxiv.org/pdf/2510.07748.pdf

Authors: Yilun Zhang, Dexing Kong

Title: Haibu Mathematical-Medical Intelligent Agent:Enhancing Large Language Model Reliability in Medical Tasks via Verifiable Reasoning Chains

Abstract:
Large Language Models (LLMs) show promise in medicine but are prone to factual and logical errors, which is unacceptable in this high‑stakes field. To address this, we introduce the "Haibu Mathematical‑Medical Intelligent Agent" (MMIA), an LLM‑driven architecture that ensures reliability through a formally verifiable reasoning process. MMIA recursively breaks down complex medical tasks into atomic, evidence‑based steps. This entire reasoning chain is then automatically audited for logical coherence and evidence traceability, similar to theorem proving. A key innovation is MMIA's "bootstrapping" mode, which stores validated reasoning chains as "theorems." Subsequent tasks can then be efficiently solved using Retrieval‑Augmented Generation (RAG), shifting from costly first‑principles reasoning to a low‑cost verification model. We validated MMIA across four healthcare administration domains, including DRG/DIP audits and medical insurance adjudication, using expert‑validated benchmarks. Results showed MMIA achieved an error detection rate exceeding 98% with a false positive rate below 1%, significantly outperforming baseline LLMs. Furthermore, the RAG matching mode is projected to reduce average processing costs by approximately 85% as the knowledge base matures. In conclusion, MMIA's verifiable reasoning framework is a significant step toward creating trustworthy, transparent, and cost‑effective AI systems, making LLM technology viable for critical applications in medicine.

PaperID: 8120, https://arxiv.org/pdf/2510.07740.pdf

Authors: Dezhi Ran, Yuan Cao, Mengzhou Wu, Simin Chen, Yuzhe Guo, Jun Ren, Zihe Song, Hao Yu, Jialei Wei, Linyi Li, Wei Yang, Baishakhi Ray, Tao Xie

Title: AppForge: From Assistant to Independent Developer -- Are GPTs Ready for Software Development?

Abstract:
Large language models (LLMs) have demonstrated remarkable capability in function‑level code generation tasks. Unlike isolated functions, real‑world applications demand reasoning over the entire software system: developers must orchestrate how different components interact, maintain consistency across states over time, and ensure the application behaves correctly within the lifecycle and framework constraints. Yet, no existing benchmark adequately evaluates whether LLMs can bridge this gap and construct entire software systems from scratch. To address this gap, we propose APPFORGE, a benchmark consisting of 101 software development problems drawn from real‑world Android apps. Given a natural language specification detailing the app functionality, a language model is tasked with implementing the functionality into an Android app from scratch. Developing an Android app from scratch requires understanding and coordinating app states, lifecycle management, and asynchronous operations, calling for LLMs to generate context‑aware, robust, and maintainable code. To construct APPFORGE, we design a multi‑agent system to automatically summarize the main functionalities from app documents and navigate the app to synthesize test cases validating the functional correctness of app implementation. Following rigorous manual verification by Android development experts, APPFORGE incorporates the test cases within an automated evaluation framework that enables reproducible assessment without human intervention, making it easily adoptable for future research. Our evaluation on 12 flagship LLMs show that all evaluated models achieve low effectiveness, with the best‑performing model (GPT‑5) developing only 18.8% functionally correct applications, highlighting fundamental limitations in current models' ability to handle complex, multi‑component software engineering challenges.

PaperID: 8121, https://arxiv.org/pdf/2510.07733.pdf

Authors: Minh-Anh Nguye, Minh-Duc Nguyen, Ha Lan N. T., Kieu Hai Dang, Nguyen Tien Dong, Dung D. Le

Title: SurveyG: A Multi-Agent LLM Framework with Hierarchical Citation Graph for Automated Survey Generation

Abstract:
Large language models (LLMs) are increasingly adopted for automating survey paper generation \citewang2406autosurvey, liang2025surveyx, yan2025surveyforge,su2025benchmarking,wen2025interactivesurvey. Existing approaches typically extract content from a large collection of related papers and prompt LLMs to summarize them directly. However, such methods often overlook the structural relationships among papers, resulting in generated surveys that lack a coherent taxonomy and a deeper contextual understanding of research progress. To address these shortcomings, we propose SurveyG, an LLM‑based agent framework that integrates hierarchical citation graph, where nodes denote research papers and edges capture both citation dependencies and semantic relatedness between their contents, thereby embedding structural and contextual knowledge into the survey generation process. The graph is organized into three layers: Foundation, Development, and Frontier, to capture the evolution of research from seminal works to incremental advances and emerging directions. By combining horizontal search within layers and vertical depth traversal across layers, the agent produces multi‑level summaries, which are consolidated into a structured survey outline. A multi‑agent validation stage then ensures consistency, coverage, and factual accuracy in generating the final survey. Experiments, including evaluations by human experts and LLM‑as‑a‑judge, demonstrate that SurveyG outperforms state‑of‑the‑art frameworks, producing surveys that are more comprehensive and better structured to the underlying knowledge taxonomy of a field.

PaperID: 8122, https://arxiv.org/pdf/2510.07731.pdf

Authors: Ruiling Xu, Yifan Zhang, Qingyun Wang, Carl Edwards, Heng Ji

Title: oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning

Abstract:
Organic reaction mechanisms are the stepwise elementary reactions by which reactants form intermediates and products, and are fundamental to understanding chemical reactivity and designing new molecules and reactions. Although large language models (LLMs) have shown promise in understanding chemical tasks such as synthesis design, it is unclear to what extent this reflects genuine chemical reasoning capabilities, i.e., the ability to generate valid intermediates, maintain chemical consistency, and follow logically coherent multi‑step pathways. We address this by introducing oMeBench, the first large‑scale, expert‑curated benchmark for organic mechanism reasoning in organic chemistry. It comprises over 10,000 annotated mechanistic steps with intermediates, type labels, and difficulty ratings. Furthermore, to evaluate LLM capability more precisely and enable fine‑grained scoring, we propose oMeS, a dynamic evaluation framework that combines step‑level logic and chemical similarity. We analyze the performance of state‑of‑the‑art LLMs, and our results show that although current models display promising chemical intuition, they struggle with correct and consistent multi‑step reasoning. Notably, we find that using prompting strategy and fine‑tuning a specialist model on our proposed dataset increases performance by 50% over the leading closed‑source model. We hope that oMeBench will serve as a rigorous foundation for advancing AI systems toward genuine chemical reasoning.

PaperID: 8123, https://arxiv.org/pdf/2510.07614.pdf

Authors: Amine Barrak

Title: Traceability and Accountability in Role-Specialized Multi-Agent LLM Pipelines

Abstract:
Sequential multi‑agent systems built with large language models (LLMs) can automate complex software tasks, but they are hard to trust because errors quietly pass from one stage to the next. We study a traceable and accountable pipeline, meaning a system with clear roles, structured handoffs, and saved records that let us trace who did what at each step and assign blame when things go wrong. Our setting is a Planner ‑> Executor ‑> Critic pipeline. We evaluate eight configurations of three state‑of‑the‑art LLMs on three benchmarks and analyze where errors start, how they spread, and how they can be fixed. Our results show: (1) adding a structured, accountable handoff between agents markedly improves accuracy and prevents the failures common in simple pipelines; (2) models have clear role‑specific strengths and risks (e.g., steady planning vs. high‑variance critiquing), which we quantify with repair and harm rates; and (3) accuracy‑cost‑latency trade‑offs are task‑dependent, with heterogeneous pipelines often the most efficient. Overall, we provide a practical, data‑driven method for designing, tracing, and debugging reliable, predictable, and accountable multi‑agent systems.

PaperID: 8124, https://arxiv.org/pdf/2510.07593.pdf

Authors: Bohan Lin, Kuo Yang, Zelin Tan, Yingchuan Lai, Chen Zhang, Guibin Zhang, Xinlei Yu, Miao Yu, Xu Wang, Yudong Zhang, Yang Wang

Title: AgentAsk: Multi-Agent Systems Need to Ask

Abstract:
Multi‑agent systems (MAS) built on large language models promise improved problem‑solving through collaboration, yet they often fail to consistently outperform strong single‑agent baselines due to error propagation at inter‑agent message handoffs.In this work, we conduct a systematic empirical analysis of such failures and introduce an edge‑level error taxonomy that identifies four dominant error types: Data Gap, Signal Corruption, Referential Drift, and Capability Gap, as primary sources of failure in multi‑agent interactions. Building on this taxonomy, we propose AgentAsk, a lightweight clarification module designed to intervene at the edge level in MAS to prevent cascading errors. The module operates by strategically applying minimal clarifications at critical points within the system, improving the accuracy and efficiency of the overall task. AgentAsk is trained to balance the trade‑offs between clarification cost, latency, and accuracy, while it is also architecture‑agnostic and can be easily integrated into existing systems. Evaluated across five benchmarks, AgentAsk consistently improves accuracy by up to 4.69%, while keeping latency and extra costs below 10% compared to baseline MAS, showcasing its high efficiency and minimal overhead.

PaperID: 8125, https://arxiv.org/pdf/2510.07516.pdf

Authors: Md. Nazmul Islam Ananto, Shamit Fatin, Mohammed Eunus Ali, Md Rizwan Parvez

Title: CompassLLM: A Multi-Agent Approach toward Geo-Spatial Reasoning for Popular Path Query

Abstract:
The popular path query ‑ identifying the most frequented routes between locations from historical trajectory data ‑ has important applications in urban planning, navigation optimization, and travel recommendations. While traditional algorithms and machine learning approaches have achieved success in this domain, they typically require model training, parameter tuning, and retraining when accommodating data updates. As Large Language Models (LLMs) demonstrate increasing capabilities in spatial and graph‑based reasoning, there is growing interest in exploring how these models can be applied to geo‑spatial problems. We introduce CompassLLM, a novel multi‑agent framework that intelligently leverages the reasoning capabilities of LLMs into the geo‑spatial domain to solve the popular path query. CompassLLM employs its agents in a two‑stage pipeline: the SEARCH stage that identifies popular paths, and a GENERATE stage that synthesizes novel paths in the absence of an existing one in the historical trajectory data. Experiments on real and synthetic datasets show that CompassLLM demonstrates superior accuracy in SEARCH and competitive performance in GENERATE while being cost‑effective.

PaperID: 8126, https://arxiv.org/pdf/2510.07505.pdf

Authors: Shen Dong, Mingxuan Zhang, Pengfei He, Li Ma, Bhavani Thuraisingham, Hui Liu, Yue Xing

Title: PEAR: Planner-Executor Agent Robustness Benchmark

Abstract:
Large Language Model (LLM)‑based Multi‑Agent Systems (MAS) have emerged as a powerful paradigm for tackling complex, multi‑step tasks across diverse domains. However, despite their impressive capabilities, MAS remain susceptible to adversarial manipulation. Existing studies typically examine isolated attack surfaces or specific scenarios, leaving a lack of holistic understanding of MAS vulnerabilities. To bridge this gap, we introduce PEAR, a benchmark for systematically evaluating both the utility and vulnerability of planner‑executor MAS. While compatible with various MAS architectures, our benchmark focuses on the planner‑executor structure, which is a practical and widely adopted design. Through extensive experiments, we find that (1) a weak planner degrades overall clean task performance more severely than a weak executor; (2) while a memory module is essential for the planner, having a memory module for the executor does not impact the clean task performance; (3) there exists a trade‑off between task performance and robustness; and (4) attacks targeting the planner are particularly effective at misleading the system. These findings offer actionable insights for enhancing the robustness of MAS and lay the groundwork for principled defenses in multi‑agent settings.

PaperID: 8127, https://arxiv.org/pdf/2510.07488.pdf

Authors: Rasika Muralidharan, Haewoon Kwak, Jisun An

Title: Can Lessons From Human Teams Be Applied to Multi-Agent Systems? The Role of Structure, Diversity, and Interaction Dynamics

Abstract:
Multi‑Agent Systems (MAS) with Large Language Model (LLM)‑powered agents are gaining attention, yet fewer studies explore their team dynamics. Inspired by human team science, we propose a multi‑agent framework to examine core aspects of team science: structure, diversity, and interaction dynamics. We evaluate team performance across four tasks: CommonsenseQA, StrategyQA, Social IQa, and Latent Implicit Hate, spanning commonsense and social reasoning. Our results show that flat teams tend to perform better than hierarchical ones, while diversity has a nuanced impact. Interviews suggest agents are overconfident about their team performance, yet post‑task reflections reveal both appreciation for collaboration and challenges in integration, including limited conversational coordination.

PaperID: 8128, https://arxiv.org/pdf/2510.07475.pdf

Authors: Zheyuan Zhang, Lin Ge, Hongjiang Li, Weicheng Zhu, Chuxu Zhang, Yanfang Ye

Title: MAPRO: Recasting Multi-Agent Prompt Optimization as Maximum a Posteriori Inference

Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, and LLM‑based agents further extend these abilities to various practical workflows. While recent progress shows that multi‑agent systems (MAS) can outperform single agents by coordinating specialized roles, designing effective MAS remains difficult due to prompt sensitivity and the compounded instability MAS creates. To cope with the challenge, recent efforts in automated prompt design have reduced manual effort. However, multi‑agent prompt optimization remains largely unexplored. Challenges like exponentially expanding search space and ambiguous credit assignment together make systematic design intractable without principled methods. Therefore, we introduce Multi‑Agent PRompt Optimization (MAPRO), a four‑stage framework that first formulates MAS prompt optimization as a Maximum a Posteriori (MAP) inference problem and solves it using a language‑guided variant of max‑product belief propagation algorithm. To address credit assignment and updates the system iteratively, MAPRO employs a topology‑aware refinement mechanism that integrates execution feedback and downstream blames to selectively update agent prompts. Through this process, MAPRO progressively converges to a coordinated set of agent‑specific prompt policies. Across benchmarks in various tasks, MAPRO achieves state‑of‑the‑art performance, consistently surpassing manually engineered baselines and recent automated alternatives. Beyond performance, our MAP‑based formulation also delivers general guidelines for building more reliable and principled multi‑agent systems in the future

PaperID: 8129, https://arxiv.org/pdf/2510.07456.pdf

Authors: Binrong Zhu, Guiran Liu, Nina Jiang

Title: ExpertAgent: Enhancing Personalized Education through Dynamic Planning and Retrieval-Augmented Long-Chain Reasoning

Abstract:
The application of advanced generative artificial intelligence in education is often constrained by the lack of real‑time adaptability, personalization, and reliability of the content. To address these challenges, we propose ExpertAgent ‑ an intelligent agent framework designed for personalized education that provides reliable knowledge and enables highly adaptive learning experiences. Therefore, we developed ExpertAgent, an innovative learning agent that provides users with a proactive and personalized learning experience. ExpertAgent dynamic planning of the learning content and strategy based on a continuously updated student model. Therefore, overcoming the limitations of traditional static learning content to provide optimized teaching strategies and learning experience in real time. All instructional content is grounded in a validated curriculum repository, effectively reducing hallucination risks in large language models and improving reliability and trustworthiness.

PaperID: 8130, https://arxiv.org/pdf/2510.07423.pdf

Authors: William Nguyen, Vinh Luong, Christopher Nguyen

Title: ProSEA: Problem Solving via Exploration Agents

Abstract:
Large language models (LLMs) have empowered AI agents to tackle increasingly complex tasks. However, most existing agents remain limited to static planning and brittle interactions, falling short of true collaboration or adaptive reasoning. We introduce ProSEA, a modular, general‑purpose multi‑agent framework designed for iterative problem solving through exploration and plan evolution. ProSEA features a hierarchical architecture in which a Manager Agent orchestrates domain‑specialized Expert Agents, decomposes tasks, and adaptively replans based on structured feedback from failed attempts. Unlike prior systems, ProSEA agents report not only success or failure but also detailed reasons for failure and newly discovered constraints, enabling dynamic plan refinement informed by exploratory traces. The framework operates autonomously but supports seamless integration with human collaborators when needed. Experiments on the challenging FinanceBench benchmark demonstrate that ProSEA, even without human feedback, outperforms state‑of‑the‑art baselines and achieves robust performance across reasoning‑heavy tasks. These results underscore ProSEA's potential as a foundation for more transparent, adaptive, and human‑aligned AI agents.

PaperID: 8131, https://arxiv.org/pdf/2510.07363.pdf

Authors: Tianxiang Xu, Zhichao Wen, Xinyu Zhao, Jun Wang, Yan Li, Chang Liu

Title: L2M-AID: Autonomous Cyber-Physical Defense by Fusing Semantic Reasoning of Large Language Models with Multi-Agent Reinforcement Learning (Preprint)

Abstract:
The increasing integration of Industrial IoT (IIoT) exposes critical cyber‑physical systems to sophisticated, multi‑stage attacks that elude traditional defenses lacking contextual awareness. This paper introduces L2M‑AID, a novel framework for Autonomous Industrial Defense using LLM‑empowered, Multi‑agent reinforcement learning. L2M‑AID orchestrates a team of collaborative agents, each driven by a Large Language Model (LLM), to achieve adaptive and resilient security. The core innovation lies in the deep fusion of two AI paradigms: we leverage an LLM as a semantic bridge to translate vast, unstructured telemetry into a rich, contextual state representation, enabling agents to reason about adversary intent rather than merely matching patterns. This semantically‑aware state empowers a Multi‑Agent Reinforcement Learning (MARL) algorithm, MAPPO, to learn complex cooperative strategies. The MARL reward function is uniquely engineered to balance security objectives (threat neutralization) with operational imperatives, explicitly penalizing actions that disrupt physical process stability. To validate our approach, we conduct extensive experiments on the benchmark SWaT dataset and a novel synthetic dataset generated based on the MITRE ATT&CK for ICS framework. Results demonstrate that L2M‑AID significantly outperforms traditional IDS, deep learning anomaly detectors, and single‑agent RL baselines across key metrics, achieving a 97.2% detection rate while reducing false positives by over 80% and improving response times by a factor of four. Crucially, it demonstrates superior performance in maintaining physical process stability, presenting a robust new paradigm for securing critical national infrastructure.

PaperID: 8132, https://arxiv.org/pdf/2510.07331.pdf

Authors: Faruk Alpay, Hamdi Alakkad

Title: Truth-Aware Decoding: A Program-Logic Approach to Factual Language Generation

Abstract:
This paper introduces Truth‑Aware Decoding (TAD), a verification‑oriented decoding scheme that aligns neural language generation with knowledge bases. Situated in the tradition of probabilistic program semantics for sequence models, TAD augments modern instruction‑tuned systems with a lattice of semantic guards that operate at decode time. Our contributions are fourfold: (i) a constraint‑based semantics that renders oracle filtering as a program‑logic judgment, (ii) a proof that greedy selection enjoys local likelihood dominance under sound and complete guards (Theorem 2.7), (iii) an entropy‑style invariant that quantifies factual risk via knowledge‑aware safe mass, and (iv) a multi‑agent operational calculus with verified Lean artefacts to certify implementation behaviour. Numerical and algorithmic case studies confirm that the resulting guardrails reduce hallucinations without sacrificing throughput, yielding a pragmatic bridge between large‑scale empirical models and formal verification.

PaperID: 8133, https://arxiv.org/pdf/2510.07307.pdf

Authors: Rushi Qiang, Yuchen Zhuang, Anikait Singh, Percy Liang, Chao Zhang, Sherry Yang, Bo Dai

Title: MLE-Smith: Scaling MLE Tasks with Automated Multi-Agent Pipeline

Abstract:
While Language Models (LMs) have made significant progress in automating machine learning engineering (MLE), the acquisition of high‑quality MLE training data is significantly constrained. Current MLE benchmarks suffer from low scalability and limited applicability because they rely on static, manually curated tasks, demanding extensive time and manual effort to produce. We introduce MLE‑Smith, a fully automated multi‑agent pipeline, to transform raw datasets into competition‑style MLE challenges through an efficient generate‑verify‑execute paradigm for scaling MLE tasks with verifiable quality, real‑world usability, and rich diversity. The proposed multi‑agent pipeline in MLE‑Smith drives structured task design and standardized refactoring, coupled with a hybrid verification mechanism that enforces strict structural rules and high‑level semantic soundness. It further validates empirical solvability and real‑world fidelity through interactive execution. We apply MLE‑Smith to 224 of real‑world datasets and generate 606 tasks spanning multiple categories, objectives, and modalities, demonstrating that MLE‑Smith can work effectively across a wide range of real‑world datasets. Evaluation on the generated tasks shows that the performance of eight mainstream and cutting‑edge LLMs on MLE‑Smith tasks is strongly correlated with their performance on carefully human‑designed tasks, highlighting the effectiveness of the MLE‑Smith to scaling up MLE tasks, while maintaining task quality.

PaperID: 8134, https://arxiv.org/pdf/2510.07276.pdf

Authors: Pulkit Rustagi, Kyle Hollins Wray, Sandhya Saisubramanian

Title: Multi-Objective Multi-Agent Path Finding with Lexicographic Cost Preferences

Abstract:
Many real‑world scenarios require multiple agents to coordinate in shared environments, while balancing trade‑offs between multiple, potentially competing objectives. Current multi‑objective multi‑agent path finding (MO‑MAPF) algorithms typically produce conflict‑free plans by computing Pareto frontiers. They do not explicitly optimize for user‑defined preferences, even when the preferences are available, and scale poorly with the number of objectives. We propose a lexicographic framework for modeling MO‑MAPF, along with an algorithm Lexicographic Conflict‑Based Search (LCBS) that directly computes a single solution aligned with a lexicographic preference over objectives. LCBS integrates a priority‑aware low‑level A^ search with conflict‑based search, avoiding Pareto frontier construction and enabling efficient planning guided by preference over objectives. We provide insights into optimality and scalability, and empirically demonstrate that LCBS computes optimal solutions while scaling to instances with up to ten objectives ‑‑ far beyond the limits of existing MO‑MAPF methods. Evaluations on standard and randomized MAPF benchmarks show consistently higher success rates against state‑of‑the‑art baselines, especially with increasing number of objectives.

PaperID: 8135, https://arxiv.org/pdf/2510.07233.pdf

Authors: Zhivar Sourati, Zheng Wang, Marianne Menglin Liu, Yazhe Hu, Mengqing Guo, Sujeeth Bharadwaj, Kyu Han, Tao Sheng, Sujith Ravi, Morteza Dehghani, Dan Roth

Title: LAD-RAG: Layout-aware Dynamic RAG for Visually-Rich Document Understanding

Abstract:
Question answering over visually rich documents (VRDs) requires reasoning not only over isolated content but also over documents' structural organization and cross‑page dependencies. However, conventional retrieval‑augmented generation (RAG) methods encode content in isolated chunks during ingestion, losing structural and cross‑page dependencies, and retrieve a fixed number of pages at inference, regardless of the specific demands of the question or context. This often results in incomplete evidence retrieval and degraded answer quality for multi‑page reasoning tasks. To address these limitations, we propose LAD‑RAG, a novel Layout‑Aware Dynamic RAG framework. During ingestion, LAD‑RAG constructs a symbolic document graph that captures layout structure and cross‑page dependencies, adding it alongside standard neural embeddings to yield a more holistic representation of the document. During inference, an LLM agent dynamically interacts with the neural and symbolic indices to adaptively retrieve the necessary evidence based on the query. Experiments on MMLongBench‑Doc, LongDocURL, DUDE, and MP‑DocVQA demonstrate that LAD‑RAG improves retrieval, achieving over 90% perfect recall on average without any top‑k tuning, and outperforming baseline retrievers by up to 20% in recall at comparable noise levels, yielding higher QA accuracy with minimal latency.

PaperID: 8136, https://arxiv.org/pdf/2510.07230.pdf

Authors: Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Jing Huang, Dakuo Wang

Title: Customer-R1: Personalized Simulation of Human Behaviors via RL-based LLM Agent in Online Shopping

Abstract:
Simulating step‑wise human behavior with Large Language Models (LLMs) has become an emerging research direction, enabling applications in various practical domains. While prior methods, including prompting, supervised fine‑tuning (SFT), and reinforcement learning (RL), have shown promise in modeling step‑wise behavior, they primarily learn a population‑level policy without conditioning on a user's persona, yielding generic rather than personalized simulations. In this work, we pose a critical question: how can LLM agents better simulate personalized user behavior? We introduce Customer‑R1, an RL‑based method for personalized, step‑wise user behavior simulation in online shopping environments. Our policy is conditioned on an explicit persona, and we optimize next‑step rationale and action generation via action correctness reward signals. Experiments on the OPeRA dataset emonstrate that Customer‑R1 not only significantly outperforms prompting and SFT‑based baselines in next‑action prediction tasks, but also better matches users' action distribution, indicating higher fidelity in personalized behavior simulation.

PaperID: 8137, https://arxiv.org/pdf/2510.07210.pdf

Authors: Donald Pfaffmann, Matthias Klusch, Marcel Steinmetz

Title: HyPlan: Hybrid Learning-Assisted Planning Under Uncertainty for Safe Autonomous Driving

Abstract:
We present a novel hybrid learning‑assisted planning method, named HyPlan, for solving the collision‑free navigation problem for self‑driving cars in partially observable traffic environments. HyPlan combines methods for multi‑agent behavior prediction, deep reinforcement learning with proximal policy optimization and approximated online POMDP planning with heuristic confidence‑based vertical pruning to reduce its execution time without compromising safety of driving. Our experimental performance analysis on the CARLA‑CTS2 benchmark of critical traffic scenarios with pedestrians revealed that HyPlan may navigate safer than selected relevant baselines and perform significantly faster than considered alternative online POMDP planners.

PaperID: 8138, https://arxiv.org/pdf/2510.07176.pdf

Authors: Yixiang Zhang, Xinhao Deng, Zhongyi Gu, Yihao Chen, Ke Xu, Qi Li, Jianping Wu

Title: Exposing LLM User Privacy via Traffic Fingerprint Analysis: A Study of Privacy Risks in LLM Agent Interactions

Abstract:
Large Language Models (LLMs) are increasingly deployed as agents that orchestrate tasks and integrate external tools to execute complex workflows. We demonstrate that these interactive behaviors leave distinctive fingerprints in encrypted traffic exchanged between users and LLM agents. By analyzing traffic patterns associated with agent workflows and tool invocations, adversaries can infer agent activities, distinguish specific agents, and even profile sensitive user attributes. To highlight this risk, we develop AgentPrint, which achieves an F1‑score of 0.866 in agent identification and attains 73.9% and 69.1% top‑3 accuracy in user attribute inference for simulated‑ and real‑user settings, respectively. These results uncover an overlooked risk: the very interactivity that empowers LLM agents also exposes user privacy, underscoring the urgent need for technical countermeasures alongside regulatory and policy safeguards.

PaperID: 8139, https://arxiv.org/pdf/2510.07173.pdf

Authors: Md Tawkat Islam Khondaker, Julia Harrington, Shady Shehata

Title: NurseLLM: The First Specialized Language Model for Nursing

Abstract:
Recent advancements in large language models (LLMs) have significantly transformed medical systems. However, their potential within specialized domains such as nursing remains largely underexplored. In this work, we introduce NurseLLM, the first nursing‑specialized LLM tailored for multiple choice question‑answering (MCQ) tasks. We develop a multi‑stage data generation pipeline to build the first large scale nursing MCQ dataset to train LLMs on a broad spectrum of nursing topics. We further introduce multiple nursing benchmarks to enable rigorous evaluation. Our extensive experiments demonstrate that NurseLLM outperforms SoTA general‑purpose and medical‑specialized LLMs of comparable size on different benchmarks, underscoring the importance of a specialized LLM for the nursing domain. Finally, we explore the role of reasoning and multi‑agent collaboration systems in nursing, highlighting their promise for future research and applications.

PaperID: 8140, https://arxiv.org/pdf/2510.07172.pdf

Authors: Tianshi Zheng, Kelvin Kiu-Wai Tam, Newt Hue-Nam K. Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Y. Wong, Simon See

Title: NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents

Abstract:
Large language models are emerging as powerful tools for scientific law discovery, a foundational challenge in AI‑driven science. However, existing benchmarks for this task suffer from a fundamental methodological trilemma, forcing a trade‑off between scientific relevance, scalability, and resistance to memorization. Furthermore, they oversimplify discovery as static function fitting, failing to capture the authentic scientific process of uncovering embedded laws through the interactive exploration of complex model systems. To address these critical gaps, we introduce NewtonBench, a benchmark comprising 324 scientific law discovery tasks across 12 physics domains. Our design mitigates the evaluation trilemma by using counterfactual law shifts ‑ systematic alterations of canonical laws ‑ to generate a vast suite of problems that are scalable, scientifically relevant, and memorization‑resistant. Moreover, we elevate the evaluation from static function fitting to interactive model discovery, requiring agents to experimentally probe simulated complex systems to uncover hidden principles. Our extensive experiment reveals a clear but fragile capability for discovery in frontier LLMs: this ability degrades precipitously with increasing system complexity and exhibits extreme sensitivity to observational noise. Notably, we uncover a paradoxical effect of tool assistance: providing a code interpreter can hinder more capable models by inducing a premature shift from exploration to exploitation, causing them to satisfice on suboptimal solutions. These results demonstrate that robust, generalizable discovery in complex, interactive environments remains the core challenge. By providing a scalable, robust, and scientifically authentic testbed, NewtonBench offers a crucial tool for measuring true progress and guiding the development of next‑generation AI agents capable of genuine scientific discovery.

PaperID: 8141, https://arxiv.org/pdf/2510.07167.pdf

Authors: Lekang Jiang, Wenjun Sun, Stephan Goetz

Title: Reasoning for Hierarchical Text Classification: The Case of Patents

Abstract:
Hierarchical text classification (HTC) assigns documents to multiple levels of a pre‑defined taxonomy. Automated patent subject classification represents one of the hardest HTC scenarios because of domain knowledge difficulty and a huge number of labels. Prior approaches only output a flat label set, which offers little insight into the reason behind predictions. Therefore, we propose Reasoning for Hierarchical Classification (RHC), a novel framework that reformulates HTC as a step‑by‑step reasoning task to sequentially deduce hierarchical labels. RHC trains large language models (LLMs) in two stages: a cold‑start stage that aligns outputs with chain‑of‑thought (CoT) reasoning format and a reinforcement learning (RL) stage to enhance multi‑step reasoning ability. RHC demonstrates four advantages in our experiments. (1) Effectiveness: RHC surpasses previous baselines and outperforms the supervised fine‑tuning counterparts by approximately 3% in accuracy and macro F1. (2) Explainability: RHC produces natural‑language justifications before prediction to facilitate human inspection. (3) Scalability: RHC scales favorably with model size with larger gains compared to standard fine‑tuning. (4) Applicability: Beyond patents, we further demonstrate that RHC achieves state‑of‑the‑art performance on other widely used HTC benchmarks, which highlights its broad applicability.

PaperID: 8142, https://arxiv.org/pdf/2510.07147.pdf

Authors: Arshika Lalan, Rajat Ghosh, Aditya Kolsur, Debojyoti Dutta

Title: A Multi-Agent Framework for Stateful Inference-Time Search

Abstract:
Recent work explores agentic inference‑time techniques to perform structured, multi‑step reasoning. However, stateless inference often struggles on multi‑step tasks due to the absence of persistent state. Moreover, task‑specific fine‑tuning or instruction‑tuning often achieve surface‑level code generation but remain brittle on tasks requiring deeper reasoning and long‑horizon dependencies. To address these limitations, we propose stateful multi‑agent evolutionary search, a training‑free framework that departs from prior stateless approaches by combining (i) persistent inference‑time state, (ii) adversarial mutation, and (iii) evolutionary preservation. We demonstrate its effectiveness in automated unit test generation through the generation of edge cases. We generate robust edge cases using an evolutionary search process, where specialized agents sequentially propose, mutate, and score candidates. A controller maintains persistent state across generations, while evolutionary preservation ensures diversity and exploration across all possible cases. This yields a generalist agent capable of discovering robust, high‑coverage edge cases across unseen codebases. Experiments show our stateful multi‑agent inference framework achieves substantial gains in coverage over stateless single‑step baselines, evaluated on prevalent unit‑testing benchmarks such as HumanEval and TestGenEvalMini and using three diverse LLM families ‑ Llama, Gemma, and GPT. These results indicate that combining persistent inference‑time state with evolutionary search materially improves unit‑test generation.

PaperID: 8143, https://arxiv.org/pdf/2510.07117.pdf

Authors: Leonardo Christov-Moore, Arthur Juliani, Alex Kiefer, Joel Lehman, Nicco Reggente, B. Scot Rousse, Adam Safron, Nicolás Hinrichs, Daniel Polani, Antonio Damasio

Title: The Conditions of Physical Embodiment Enable Generalization and Care

Abstract:
As artificial agents enter open‑ended physical environments ‑‑ eldercare, disaster response, and space missions ‑‑ they must persist under uncertainty while providing reliable care. Yet current systems struggle to generalize across distribution shifts and lack intrinsic motivation to preserve the well‑being of others. Vulnerability and mortality are often seen as constraints to be avoided, yet organisms survive and provide care in an open‑ended world with relative ease and efficiency. We argue that generalization and care arise from conditions of physical embodiment: being‑in‑the‑world (the agent is a part of the environment) and being‑towards‑death (unless counteracted, the agent drifts toward terminal states). These conditions necessitate a homeostatic drive to maintain oneself and maximize the future capacity to continue doing so. Fulfilling this drive over long time horizons in multi‑agent environments necessitates robust causal modeling of self and others' embodiment and jointly achievable future states. Because embodied agents are part of the environment, with the self delimited by reliable control, empowering others can expand self‑boundaries, enabling other‑regard. This provides a path from embodiment toward generalization and care based in shared constraints. We outline a reinforcement‑learning framework for examining these questions. Homeostatic mortal agents continually learning in open‑ended environments may offer efficient robustness and trustworthy alignment.

PaperID: 8144, https://arxiv.org/pdf/2510.07064.pdf

Authors: Manh Hung Nguyen, Sebastian Tschiatschek, Adish Singla

Title: Prompt Optimization Across Multiple Agents for Representing Diverse Human Populations

Abstract:
The difficulty and expense of obtaining large‑scale human responses make Large Language Models (LLMs) an attractive alternative and a promising proxy for human behavior. However, prior work shows that LLMs often produce homogeneous outputs that fail to capture the rich diversity of human perspectives and behaviors. Thus, rather than trying to capture this diversity with a single LLM agent, we propose a novel framework to construct a set of agents that collectively capture the diversity of a given human population. Each agent is an LLM whose behavior is steered by conditioning on a small set of human demonstrations (task‑response pairs) through in‑context learning. The central challenge is therefore to select a representative set of LLM agents from the exponentially large space of possible agents. We tackle this selection problem from the lens of submodular optimization. In particular, we develop methods that offer different trade‑offs regarding time complexity and performance guarantees. Extensive experiments in crowdsourcing and educational domains demonstrate that our approach constructs agents that more effectively represent human populations compared to baselines. Moreover, behavioral analyses on new tasks show that these agents reproduce the behavior patterns and perspectives of the students and annotators they are designed to represent.

PaperID: 8145, https://arxiv.org/pdf/2510.07043.pdf

Authors: Tian Qin, Felix Bai, Ting-Yao Hu, Raviteja Vemulapalli, Hema Swetha Koppula, Zhiyang Xu, Bowen Jin, Mert Cemri, Jiarui Lu, Zirui Wang, Meng Cao

Title: COMPASS: Benchmarking Constrained Optimization in LLM Agents

Abstract:
Human decision‑making often involves constrained optimization. As LLM agents are deployed to assist with real‑world tasks like travel planning, shopping, and scheduling, they must mirror this capability. We introduce COMPASS, a benchmark that evaluates whether LLM agents can perform constrained optimization in realistic travel planning settings. To success in these tasks, agents must engage in multi‑turn conversations with user to gather task information as well as use tools to gather information from the database. Then agents must propose a solution that not only satisfies hard constraints but also optimizes user's utility objective. Evaluating state‑of‑the‑art models, we reveal a significant feasible‑optimal gap: while models achieve 70‑90% feasibility (constraint satisfaction), they reach only 20‑60% optimality (utility optimization). Our analysis shows that tool use is not the bottleneck. Instead, the core limitation is insufficient exploration of the search space, with success strongly correlating with information gathered. Coding agents show a promising approach to mitigate this gap. Together, COMPASS provides a testbed for developing LLM agents that can truly mirror human decision‑making by both satisfying constraints and optimizing objectives.

PaperID: 8146, https://arxiv.org/pdf/2510.06915.pdf

Authors: Zecheng Tang, Baibei Ji, Quantong Qiu, Haitian Wang, Xiaobo Liang, Juntao Li, Min Zhang

Title: LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling

Abstract:
Reward model (RM) plays a pivotal role in aligning large language model (LLM) with human preferences. As real‑world applications increasingly involve long history trajectories, e.g., LLM agent, it becomes indispensable to evaluate whether a model's responses are not only high‑quality but also grounded in and consistent with the provided context. Yet, current RMs remain confined to short‑context settings and primarily focus on response‑level attributes (e.g., safety or helpfulness), while largely neglecting the critical dimension of long context‑response consistency. In this work, we introduce Long‑RewardBench, a benchmark specifically designed for long‑context RM evaluation, featuring both Pairwise Comparison and Best‑of‑N tasks. Our preliminary study reveals that even state‑of‑the‑art generative RMs exhibit significant fragility in long‑context scenarios, failing to maintain context‑aware preference judgments. Motivated by the analysis of failure patterns observed in model outputs, we propose a general multi‑stage training strategy that effectively scales arbitrary models into robust Long‑context RMs (LongRMs). Experiments show that our approach not only substantially improves performance on long‑context evaluation but also preserves strong short‑context capability. Notably, our 8B LongRM outperforms much larger 70B‑scale baselines and matches the performance of the proprietary Gemini 2.5 Pro model.

PaperID: 8147, https://arxiv.org/pdf/2510.06913.pdf

Authors: Ke Guo, Haochen Liu, Xiaojun Wu, Chen Lv

Title: DecompGAIL: Learning Realistic Traffic Behaviors with Decomposed Multi-Agent Generative Adversarial Imitation Learning

Abstract:
Realistic traffic simulation is critical for the development of autonomous driving systems and urban mobility planning, yet existing imitation learning approaches often fail to model realistic traffic behaviors. Behavior cloning suffers from covariate shift, while Generative Adversarial Imitation Learning (GAIL) is notoriously unstable in multi‑agent settings. We identify a key source of this instability: irrelevant interaction misguidance, where a discriminator penalizes an ego vehicle's realistic behavior due to unrealistic interactions among its neighbors. To address this, we propose Decomposed Multi‑agent GAIL (DecompGAIL), which explicitly decomposes realism into ego‑map and ego‑neighbor components, filtering out misleading neighbor: neighbor and neighbor: map interactions. We further introduce a social PPO objective that augments ego rewards with distance‑weighted neighborhood rewards, encouraging overall realism across agents. Integrated into a lightweight SMART‑based backbone, DecompGAIL achieves state‑of‑the‑art performance on the WOMD Sim Agents 2025 benchmark.

PaperID: 8148, https://arxiv.org/pdf/2510.06911.pdf

Authors: Hacane Hechehouche, Andre Antakli, Matthias Klusch

Title: LLM-Assisted Modeling of Semantic Web-Enabled Multi-Agents Systems with AJAN

Abstract:
There are many established semantic Web standards for implementing multi‑agent driven applications. The AJAN framework allows to engineer multi‑agent systems based on these standards. In particular, agent knowledge is represented in RDF/RDFS and OWL, while agent behavior models are defined with Behavior Trees and SPARQL to access and manipulate this knowledge. However, the appropriate definition of RDF/RDFS and SPARQL‑based agent behaviors still remains a major hurdle not only for agent modelers in practice. For example, dealing with URIs is very error‑prone regarding typos and dealing with complex SPARQL queries in large‑scale environments requires a high learning curve. In this paper, we present an integrated development environment to overcome such hurdles of modeling AJAN agents and at the same time to extend the user community for AJAN by the possibility to leverage Large Language Models for agent engineering.

PaperID: 8149, https://arxiv.org/pdf/2510.06846.pdf

Authors: Johannes Autenrieb, Mark Spiller

Title: Decentralized CBF-based Safety Filters for Collision Avoidance of Cooperative Missile Systems with Input Constraints

Abstract:
This paper presents a decentralized safety filter for collision avoidance in multi‑agent aerospace interception scenarios. The approach leverages robust control barrier functions (RCBFs) to guarantee forward invariance of safety sets under bounded inputs and high‑relative‑degree dynamics. Each effector executes its nominal cooperative guidance command, while a local quadratic program (QP) modifies the input only when necessary. Event‑triggered activation based on range and zero‑effort miss (ZEM) criteria ensures scalability by restricting active constraints to relevant neighbors. To resolve feasibility issues from simultaneous constraints, a slack‑variable relaxation scheme is introduced that prioritizes critical agents in a Pareto‑optimal manner. Simulation results in many‑on‑many interception scenarios demonstrate that the proposed framework maintains collision‑free operation with minimal deviation from nominal guidance, providing a computationally efficient and scalable solution for safety‑critical multi‑agent aerospace systems.

PaperID: 8150, https://arxiv.org/pdf/2510.06835.pdf

Authors: Hongjian Chen, Changyun Wen, Xiaolei Li

Title: Resilient Multi-Dimensional Consensus and Distributed Optimization against Agent-Based and Denial-of-Service Attacks

Abstract:
In this paper, we consider the resilient multi‑dimensional consensus and distributed optimization problems of multi‑agent systems (MASs) in the presence of both agent‑based and denial‑of‑service (DoS) attacks. The considered agent‑based attacks can cover malicious, Byzantine, and stubborn agents. The links between agents in the network can be blocked by DoS attacks, which may lead the digraph to be time‑varying and even disconnected. The objective is to ensure that the remaining benign agents achieve consensus. To this end, an "auxiliary point"‑based resilient control algorithm is proposed for MASs. Under the proposed algorithm, each healthy agent constructs a "safe kernel" utilizing the states of its in‑neighbors and updates its state toward a specific point within this kernel at each iteration. If an agent cannot receive its neighbors' states owing to DoS attacks, it will use the states received immediately before the DoS period. Moreover, a resilient multi‑dimensional distributed optimization (RMDO) algorithm is also proposed. Theoretical proofs and numerical examples are presented to demonstrate the effectiveness of the proposed algorithms.

PaperID: 8151, https://arxiv.org/pdf/2510.06825.pdf

Authors: Chenpeng Wang, Xiaojie Cheng, Chunye Wang, Linfeng Yang, Lei Zhang

Title: Adaptive Tool Generation with Models as Tools and Reinforcement Learning

Abstract:
Tool‑augmented language models have demonstrated strong capabilities, but their reliance on live API access creates scalability and reliability challenges during training and deployment. We propose MTR, a simulation‑first training framework for tool‑augmented reasoning. Instead of relying on live APIs, MTR learns from complete ReAct traces with schema‑validated, simulated observations. Our approach operates through a multi‑agent architecture where a ToolMaker generates task‑specific, OpenAI‑compatible tool interfaces, an AutoAgent produces structured think‑act‑observe sequences, and a ToolActor simulates realistic responses. Training proceeds in two stages: Stage‑1 Supervised Fine‑Tuning (SFT) teaches 'trace grammar' from complete reasoning sequences; Stage‑2 Group Relative Policy Optimization (GRPO) optimizes strategy with a composite trace reward that balances answer correctness and internal consistency. Across four multi‑hop QA benchmarks (HotpotQA, MuSiQue, 2WikiMultiHopQA, Bamboogle), MTR attains competitive Exact Match (EM) scores to live‑API systems and excels on reasoning‑intensive tasks, suggesting that effective tool reasoning can be learned from structured traces without live interactions.

PaperID: 8152, https://arxiv.org/pdf/2510.06800.pdf

Authors: Haotian Wu, Shufan Jiang, Chios Chen, Yiyang Feng, Hehai Lin, Heqing Zou, Yao Shu, Chengwei Qin

Title: FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline

Abstract:
As large language models (LLMs) advance in role‑playing (RP) tasks, existing benchmarks quickly become obsolete due to their narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios. To address this gap, we introduce FURINA‑Builder, a novel multi‑agent collaboration pipeline that automatically constructs fully customizable RP benchmarks at any scale. It enables evaluation of arbitrary characters across diverse scenarios and prompt formats, as the first benchmark builder in RP area for adaptable assessment. FURINA‑Builder simulates dialogues between a test character and other characters drawn from a well‑constructed character‑scene pool, while an LLM judge selects fine‑grained evaluation dimensions and adjusts the test character's responses into final test utterances. Using this pipeline, we build FURINA‑Bench, a new comprehensive role‑playing benchmark featuring both established and synthesized test characters, each assessed with dimension‑specific evaluation criteria. Human evaluation and preliminary separability analysis justify our pipeline and benchmark design. We conduct extensive evaluations of cutting‑edge LLMs and find that o3 and DeepSeek‑R1 achieve the best performance on English and Chinese RP tasks, respectively. Across all models, established characters consistently outperform synthesized ones, with reasoning capabilities further amplifying this disparity. Interestingly, we observe that model scale does not monotonically reduce hallucinations. More critically, for reasoning LLMs, we uncover a novel trade‑off: reasoning improves RP performance but simultaneously increases RP hallucinations. This trade‑off extends to a broader Pareto frontier between RP performance and reliability for all LLMs. These findings demonstrate the effectiveness of FURINA‑Builder and the challenge posed by FURINA‑Bench.

PaperID: 8153, https://arxiv.org/pdf/2510.06761.pdf

Authors: Zhi Zhang, Yan Liu, Zhejing Hu, Gong Chen, Sheng-hua Zhong, Jiannong Cao

Title: Evolving and Executing Research Plans via Double-Loop Multi-Agent Collaboration

Abstract:
Automating the end‑to‑end scientific research process poses a fundamental challenge: it requires both evolving high‑level plans that are novel and sound, and executing these plans correctly amidst dynamic and uncertain conditions. To address this bilevel challenge, we propose a novel Double‑Loop Multi‑Agent (DLMA) framework to solve the given research problem automatically. The leader loop, composed of professor agents, is responsible for evolving research plans. It employs an evolutionary algorithm through involvement, improvement, and integration meetings to iteratively generate and refine a pool of research proposals, exploring the solution space effectively. The follower loop, composed of doctoral student agents, is responsible for executing the best‑evolved plan. It dynamically adjusts the plan during implementation via pre‑hoc and post‑hoc meetings, ensuring each step (e.g., drafting, coding) is well‑supported by contextual and external observations. Extensive experiments on benchmarks like ACLAward and Laboratory show that DLMA generates research papers that achieve state‑of‑the‑art scores in automated evaluation, significantly outperforming strong baselines. Ablation studies confirm the critical roles of both loops, with evolution driving novelty and execution ensuring soundness.

PaperID: 8154, https://arxiv.org/pdf/2510.06727.pdf

Authors: Miao Lu, Weiwei Sun, Weihua Du, Zhan Ling, Xuesong Yao, Kang Liu, Jiecao Chen

Title: Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management

Abstract:
We study reinforcement learning (RL) fine‑tuning of large language model (LLM) agents for long‑horizon multi‑turn tool use, where context length quickly becomes a fundamental bottleneck. Existing RL pipelines can suffer from degraded instruction following, excessive rollout costs, and most importantly, strict context limits. To address these challenges, we introduce summarization‑based context management to training. In specific, it periodically compresses the tool using history by LLM‑generated summaries that retain task‑relevant information to keep a compact context while enabling the agent to scale beyond the fixed context window. Building on this formulation, we derive a policy gradient representation that seamlessly enables standard LLM RL infrastructures to optimize both tool‑use behaviors as well as summarization strategies in an end‑to‑end fashion. We instantiate this framework with \underlineSUmmarization augmented \underlinePolicy \underlineOptimization (\textttSUPO), an LLM RL algorithm that enables long‑horizon training beyond a fixed context limit. Experiments on interactive function calling and searching tasks demonstrate that \textttSUPO significantly improves the success rate while maintaining the same or even lower working context length compared to baselines. We also demonstrate that for complex searching tasks, \textttSUPO can further improve the evaluation performance when scaling test‑time maximum round of summarization beyond that of training time. Our results establish summarization‑based context management as a principled and scalable approach for training RL agents beyond a fixed context length limit.

PaperID: 8155, https://arxiv.org/pdf/2510.06683.pdf

Authors: Daoyuan Zhou, Xuchuang Wang, Lin Yang, Yang Gao

Title: Distributed Algorithms for Multi-Agent Multi-Armed Bandits with Collision

Abstract:
We study the stochastic Multiplayer Multi‑Armed Bandit (MMAB) problem, where multiple players select arms to maximize their cumulative rewards. Collisions occur when two or more players select the same arm, resulting in no reward, and are observed by the players involved. We consider a distributed setting without central coordination, where each player can only observe their own actions and collision feedback. We propose a distributed algorithm with an adaptive, efficient communication protocol. The algorithm achieves near‑optimal group and individual regret, with a communication cost of only \mathcalO(\log\log T). Our experiments demonstrate significant performance improvements over existing baselines. Compared to state‑of‑the‑art (SOTA) methods, our approach achieves a notable reduction in individual regret. Finally, we extend our approach to a periodic asynchronous setting, proving the lower bound for this problem and presenting an algorithm that achieves logarithmic regret.

PaperID: 8156, https://arxiv.org/pdf/2510.06587.pdf

Authors: Jingbo Yang, Bairu Hou, Wei Wei, Shiyu Chang, Yujia Bao

Title: WebDART: Dynamic Decomposition and Re-planning for Complex Web Tasks

Abstract:
Large language model (LLM) agents are becoming competent at straightforward web tasks, such as opening an item page or submitting a form, but still struggle with objectives that require long horizon navigation, large scale information extraction, and reasoning under constraints. We present WebDART, a general framework that enables a single LLM to handle such complex chores. WebDART (i) dynamically decomposes each objective into three focused subtasks: navigation, information extraction, and execution, so the model concentrates on one skill at a time, and (ii) continuously replans the decomposition as new webpages are revealed, taking advantage of newly discovered filters or shortcuts and avoiding redundant exploration. Evaluated on WebChoreArena, WebDART lifts success rates by up to 13.7 percentage points over previous SOTA agents, while matching their performance on the easier WebArena suite and completing tasks with up to 14.7 fewer navigation steps.

PaperID: 8157, https://arxiv.org/pdf/2510.06579.pdf

Authors: Haofei Yu, Keyang Xuan, Fenghai Li, Kunlun Zhu, Zijie Lei, Jiaxun Zhang, Ziheng Qi, Kyle Richardson, Jiaxuan You

Title: TinyScientist: An Interactive, Extensible, and Controllable Framework for Building Research Agents

Abstract:
Automatic research with Large Language Models (LLMs) is rapidly gaining importance, driving the development of increasingly complex workflows involving multi‑agent systems, planning, tool usage, code execution, and human‑agent interaction to accelerate research processes. However, as more researchers and developers begin to use and build upon these tools and platforms, the complexity and difficulty of extending and maintaining such agentic workflows have become a significant challenge, particularly as algorithms and architectures continue to advance. To address this growing complexity, TinyScientist identifies the essential components of the automatic research workflow and proposes an interactive, extensible, and controllable framework that easily adapts to new tools and supports iterative growth. We provide an open‑source codebase, an interactive web demonstration, and a PyPI Python package to make state‑of‑the‑art auto‑research pipelines broadly accessible to every researcher and developer.

PaperID: 8158, https://arxiv.org/pdf/2510.06436.pdf

Authors: Thomas Marshall Vielmetti, Devansh R. Agrawal, Dimitra Panagou

Title: R3R: Decentralized Multi-Agent Collision Avoidance with Infinite-Horizon Safety

Abstract:
Existing decentralized methods for multi‑agent motion planning lack formal, infinite‑horizon safety guarantees, especially for communication‑constrained systems. We present R3R which, to our knowledge, is the first decentralized and asynchronous framework for multi‑agent motion planning under range‑limited communication constraints with infinite‑horizon safety guarantees for systems of nonlinear agents. R3R's novelty lies in combining our gatekeeper safety framework with a geometric constraint termed R‑Boundedness, which together establish a formal link between an agent's communication radius and its ability to plan safely. We constrain trajectories to lie within a fixed planning radius, determined by a function of the agent's communication radius. This enables trajectories to be certified as provably safe for all time using only local information. Our algorithm is fully asynchronous, and ensures the forward invariance of these guarantees even in time‑varying networks where agents asynchronously join and replan. We evaluate our approach in simulations of up to 128 Dubins vehicles, validating our theoretical safety guarantees in dense, obstacle‑rich scenarios. We further show that R3R's computational complexity scales with local agent density rather than problem size, providing a practical solution for scalable and provably safe multi‑agent systems.

PaperID: 8159, https://arxiv.org/pdf/2510.06424.pdf

Authors: Ghazi Sarwat Syed, Philipp Schmidt, Frank Brückerhoff-Plückelmann, Jelle Dijkstra, Wolfram H. P Pernice, Abu Sebastian

Title: Fully Parallel Multi-Agent Photonic Optimizer

Abstract:
Optimization problems are central to many important cross‑disciplinary applications.In their conventional implementations, the sequential nature of operations imposes strict limitations on the computational efficiency. Here, we discuss how analog optical computing can overcome this fundamental bottleneck. We propose a photonic optimizer unit, together with supporting algorithms that uses in memory computation within a nature inspired, multi agent cooperative framework. The system performs a sequence of reconfigurable parallel matrix vector operations, enabled by the high bandwidth and multiplexing capabilities inherent to photonic circuits. This approach provides a pathway toward fast paced and high quality solutions for difficult optimization and search problems.

PaperID: 8160, https://arxiv.org/pdf/2510.06225.pdf

Authors: Gang Li, Jie Lin, Yining Tang, Ziteng Wang, Yirui Huang, Junyu Zhang, Shuang Luo, Chao Wu, Yike Guo

Title: Generalized Multi-agent Social Simulation Framework

Abstract:
Multi‑agent social interaction has clearly benefited from Large Language Models. However, current simulation systems still face challenges such as difficulties in scaling to diverse scenarios and poor reusability due to a lack of modular design. To address these issues, we designed and developed a modular, object‑oriented framework that organically integrates various base classes through a hierarchical structure, harvesting scalability and reusability. We inherited the framework to realize common derived classes. Additionally, a memory summarization mechanism is proposed to filter and distill relevant information from raw memory data, prioritizing contextually salient events and interactions. By selecting and combining some necessary derived classes, we customized a specific simulated environment. Utilizing this simulated environment, we successfully simulated human interactions on social media, replicating real‑world online social behaviors. The source code for the project will be released and evolve.

PaperID: 8161, https://arxiv.org/pdf/2510.06224.pdf

Authors: Suchismita Naik, Austin L. Toombs, Amanda Snellinger, Scott Saponas, Amanda K. Hall

Title: Exploring Human-AI Collaboration Using Mental Models of Early Adopters of Multi-Agent Generative AI Tools

Abstract:
With recent advancements in multi‑agent generative AI (Gen AI), technology organizations like Microsoft are adopting these complex tools, redefining AI agents as active collaborators in complex workflows rather than as passive tools. In this study, we investigated how early adopters and developers conceptualize multi‑agent Gen AI tools, focusing on how they understand human‑AI collaboration mechanisms, general collaboration dynamics, and transparency in the context of AI tools. We conducted semi‑structured interviews with 13 developers, all early adopters of multi‑agent Gen AI technology who work at Microsoft. Our findings revealed that these early adopters conceptualize multi‑agent systems as "teams" of specialized role‑based and task‑based agents, such as assistants or reviewers, structured similar to human collaboration models and ranging from AI‑dominant to AI‑assisted, user‑controlled interactions. We identified key challenges, including error propagation, unpredictable and unproductive agent loop behavior, and the need for clear communication to mitigate the layered transparency issues. Early adopters' perspectives about the role of transparency underscored its importance as a way to build trust, verify and trace errors, and prevent misuse, errors, and leaks. The insights and design considerations we present contribute to CSCW research about collaborative mechanisms with capabilities ranging from AI‑dominant to AI‑assisted interactions, transparency and oversight strategies in human‑agent and agent‑agent interactions, and how humans make sense of these multi‑agent systems as dynamic, role‑diverse collaborators which are customizable for diverse needs and workflows. We conclude with future research directions that extend CSCW approaches to the design of inter‑agent and human mediation interactions.

PaperID: 8162, https://arxiv.org/pdf/2510.06222.pdf

Authors: Ziv Ben-Zion, Zohar Elyoseph, Tobias Spiller, Teddy Lazebnik

Title: Inducing State Anxiety in LLM Agents Reproduces Human-Like Biases in Consumer Decision-Making

Abstract:
Large language models (LLMs) are rapidly evolving from text generators to autonomous agents, raising urgent questions about their reliability in real‑world contexts. Stress and anxiety are well known to bias human decision‑making, particularly in consumer choices. Here, we tested whether LLM agents exhibit analogous vulnerabilities. Three advanced models (ChatGPT‑5, Gemini 2.5, Claude 3.5‑Sonnet) performed a grocery shopping task under budget constraints (24, 54, 108 USD), before and after exposure to anxiety‑inducing traumatic narratives. Across 2,250 runs, traumatic prompts consistently reduced the nutritional quality of shopping baskets (Change in Basket Health Scores of ‑0.081 to ‑0.126; all pFDR<0.001; Cohens d=‑1.07 to ‑2.05), robust across models and budgets. These results show that psychological context can systematically alter not only what LLMs generate but also the actions they perform. By reproducing human‑like emotional biases in consumer behavior, LLM agents reveal a new class of vulnerabilities with implications for digital health, consumer safety, and ethical AI deployment.

PaperID: 8163, https://arxiv.org/pdf/2510.06214.pdf

Authors: Mingkang Zhu, Xi Chen, Bei Yu, Hengshuang Zhao, Jiaya Jia

Title: Stratified GRPO: Handling Structural Heterogeneity in Reinforcement Learning of LLM Search Agents

Abstract:
Large language model (LLM) agents increasingly rely on external tools such as search engines to solve complex, multi‑step problems, and reinforcement learning (RL) has become a key paradigm for training them. However, the trajectories of search agents are structurally heterogeneous, where variations in the number, placement, and outcomes of search calls lead to fundamentally different answer directions and reward distributions. Standard policy gradient methods, which use a single global baseline, suffer from what we identify and formalize as cross‑stratum bias‑an "apples‑to‑oranges" comparison of heterogeneous trajectories. This cross‑stratum bias distorts credit assignment and hinders exploration of complex, multi‑step search strategies. To address this, we propose Stratified GRPO, whose central component, Stratified Advantage Normalization (SAN), partitions trajectories into homogeneous strata based on their structural properties and computes advantages locally within each stratum. This ensures that trajectories are evaluated only against their true peers. Our analysis proves that SAN eliminates cross‑stratum bias, yields conditionally unbiased unit‑variance estimates inside each stratum, and retains the global unbiasedness and unit‑variance properties enjoyed by standard normalization, resulting in a more pure and scale‑stable learning signal. To improve practical stability under finite‑sample regimes, we further linearly blend SAN with the global estimator. Extensive experiments on diverse single‑hop and multi‑hop question‑answering benchmarks demonstrate that Stratified GRPO consistently and substantially outperforms GRPO by up to 11.3 points, achieving higher training rewards, greater training stability, and more effective search policies. These results establish stratification as a principled remedy for structural heterogeneity in RL for LLM search agents.

PaperID: 8164, https://arxiv.org/pdf/2510.06186.pdf

Authors: Chunyu Miao, Henry Peng Zou, Yangning Li, Yankai Chen, Yibo Wang, Fangxin Wang, Yifan Li, Wooseong Yang, Bowei He, Xinni Zhang, Dianzhi Yu, Hanchen Yang, Hoang H Nguyen, Yue Zhou, Jie Yang, Jizhou Guo, Wenzhe Fan, Chin-Yuan Yeh, Panpan Meng, Liancheng Fang, Jinhu Qi, Wei-Chieh Huang, Zhengyao Gu, Yuwei Han, Langzhou He, Yuyao Yang, Yinghui Li, Hai-Tao Zheng, Xue Liu, Irwin King, Philip S. Yu

Title: RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback

Abstract:
Large language models (LLMs) show the promise in supporting scientific research implementation, yet their ability to generate correct and executable code remains limited. Existing works largely adopt one‑shot settings, ignoring the iterative and feedback‑driven nature of realistic workflows of scientific research development. To address this gap, we present RECODE‑H, a benchmark of 102 tasks from research papers and repositories that evaluates LLM agents through multi‑turn interactions with LLM‑simulated human feedback. It includes structured instructions,unit tests, and a five‑level feedback hierarchy to reflect realistic researcher‑agent collaboration. We further present ReCodeAgent, a framework that integrates feedback into iterative code generation. Experiments with leading LLMs, including GPT‑5, Claude‑Sonnet‑4, DeepSeek‑V3.1, and Gemini 2.5, show substantial performance gains with richer feedback, while also highlighting ongoing challenges in the generation of complex research code. RECODE‑H establishes a foundation for developing adaptive, feedback‑driven LLM agents in scientific research implementation

PaperID: 8165, https://arxiv.org/pdf/2510.06151.pdf

Authors: Aju Ani Justus, Chris Baber

Title: LLMs as Policy-Agnostic Teammates: A Case Study in Human Proxy Design for Heterogeneous Agent Teams

Abstract:
A critical challenge in modelling Heterogeneous‑Agent Teams is training agents to collaborate with teammates whose policies are inaccessible or non‑stationary, such as humans. Traditional approaches rely on expensive human‑in‑the‑loop data, which limits scalability. We propose using Large Language Models (LLMs) as policy‑agnostic human proxies to generate synthetic data that mimics human decision‑making. To evaluate this, we conduct three experiments in a grid‑world capture game inspired by Stag Hunt, a game theory paradigm that balances risk and reward. In Experiment 1, we compare decisions from 30 human participants and 2 expert judges with outputs from LLaMA 3.1 and Mixtral 8x22B models. LLMs, prompted with game‑state observations and reward structures, align more closely with experts than participants, demonstrating consistency in applying underlying decision criteria. Experiment 2 modifies prompts to induce risk‑sensitive strategies (e.g. "be risk averse"). LLM outputs mirror human participants' variability, shifting between risk‑averse and risk‑seeking behaviours. Finally, Experiment 3 tests LLMs in a dynamic grid‑world where the LLM agents generate movement actions. LLMs produce trajectories resembling human participants' paths. While LLMs cannot yet fully replicate human adaptability, their prompt‑guided diversity offers a scalable foundation for simulating policy‑agnostic teammates.

PaperID: 8166, https://arxiv.org/pdf/2510.06078.pdf

Authors: Tao Zhe, Rui Liu, Fateme Memar, Xiao Luo, Wei Fan, Xinyue Ye, Zhongren Peng, Dongjie Wang

Title: Constraint-Aware Route Recommendation from Natural Language via Hierarchical LLM Agents

Abstract:
Route recommendation aims to provide users with optimal travel plans that satisfy diverse and complex requirements. Classical routing algorithms (e.g., shortest‑path and constraint‑aware search) are efficient but assume structured inputs and fixed objectives, limiting adaptability to natural‑language queries. Recent LLM‑based approaches enhance flexibility but struggle with spatial reasoning and the joint modeling of route‑level and POI‑level preferences. To address these limitations, we propose RouteLLM, a hierarchical multi‑agent framework that grounds natural‑language intents into constraint‑aware routes. It first parses user queries into structured intents including POIs, paths, and constraints. A manager agent then coordinates specialized sub‑agents: a constraint agent that resolves and formally check constraints, a POI agent that retrieves and ranks candidate POIs, and a path refinement agent that refines routes via a routing engine with preference‑conditioned costs. A final verifier agent ensures constraint satisfaction and produces the final route with an interpretable rationale. This design bridges linguistic flexibility and spatial structure, enabling reasoning over route feasibility and user preferences. Experiments show that our method reliably grounds textual preferences into constraint‑aware routes, improving route quality and preference satisfaction over classical methods.

PaperID: 8167, https://arxiv.org/pdf/2510.06036.pdf

Authors: Qingyu Yin, Chak Tou Leong, Linyi Yang, Wenxuan Huang, Wenjie Li, Xiting Wang, Jaehong Yoon, YunXing, XingYu, Jinjin Gu

Title: Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?

Abstract:
Large reasoning models (LRMs) with multi‑step reasoning capabilities have shown remarkable problem‑solving abilities, yet they exhibit concerning safety vulnerabilities that remain poorly understood. In this work, we investigate why safety alignment fails in reasoning models through a mechanistic interpretability lens. Using a linear probing approach to trace refusal intentions across token positions, we discover a striking phenomenon termed as refusal cliff: many poorly‑aligned reasoning models correctly identify harmful prompts and maintain strong refusal intentions during their thinking process, but experience a sharp drop in refusal scores at the final tokens before output generation. This suggests that these models are not inherently unsafe; rather, their refusal intentions are systematically suppressed. Through causal intervention analysis, we identify a sparse set of attention heads that negatively contribute to refusal behavior. Ablating just 3% of these heads can reduce attack success rates below 10%. Building on these mechanistic insights, we propose Cliff‑as‑a‑Judge, a novel data selection method that identifies training examples exhibiting the largest refusal cliff to efficiently repair reasoning models' safety alignment. This approach achieves comparable safety improvements using only 1.7% of the vanilla safety training data, demonstrating a less‑is‑more effect in safety alignment.

PaperID: 8168, https://arxiv.org/pdf/2510.05935.pdf

Authors: Mohamed Bal-Ghaoui, Fayssal Sabri

Title: LLM-FS-Agent: A Deliberative Role-based Large Language Model Architecture for Transparent Feature Selection

Abstract:
High‑dimensional data remains a pervasive challenge in machine learning, often undermining model interpretability and computational efficiency. While Large Language Models (LLMs) have shown promise for dimensionality reduction through feature selection, existing LLM‑based approaches frequently lack structured reasoning and transparent justification for their decisions. This paper introduces LLM‑FS‑Agent, a novel multi‑agent architecture designed for interpretable and robust feature selection. The system orchestrates a deliberative "debate" among multiple LLM agents, each assigned a specific role, enabling collective evaluation of feature relevance and generation of detailed justifications. We evaluate LLM‑FS‑Agent in the cybersecurity domain using the CIC‑DIAD 2024 IoT intrusion detection dataset and compare its performance against strong baselines, including LLM‑Select and traditional methods such as PCA. Experimental results demonstrate that LLM‑FS‑Agent consistently achieves superior or comparable classification performance while reducing downstream training time by an average of 46% (statistically significant improvement, p = 0.028 for XGBoost). These findings highlight that the proposed deliberative architecture enhances both decision transparency and computational efficiency, establishing LLM‑FS‑Agent as a practical and reliable solution for real‑world applications.

PaperID: 8169, https://arxiv.org/pdf/2510.05764.pdf

Authors: Lang Qin, Zijian Gan, Xu Cao, Pengcheng Jiang, Yankai Jiang, Jiawei Han, Kaishun Wu, Jintai Chen

Title: RareAgent: Self-Evolving Reasoning for Drug Repurposing in Rare Diseases

Abstract:
Computational drug repurposing for rare diseases is especially challenging when no prior associations exist between drugs and target diseases. Therefore, knowledge graph completion and message‑passing GNNs have little reliable signal to learn and propagate, resulting in poor performance. We present RareAgent, a self‑evolving multi‑agent system that reframes this task from passive pattern recognition to active evidence‑seeking reasoning. RareAgent organizes task‑specific adversarial debates in which agents dynamically construct evidence graphs from diverse perspectives to support, refute, or entail hypotheses. The reasoning strategies are analyzed post hoc in a self‑evolutionary loop, producing textual feedback that refines agent policies, while successful reasoning paths are distilled into transferable heuristics to accelerate future investigations. Comprehensive evaluations reveal that RareAgent improves the indication AUPRC by 18.1% over reasoning baselines and provides a transparent reasoning chain consistent with clinical evidence.

PaperID: 8170, https://arxiv.org/pdf/2510.05748.pdf

Authors: Hachem Madmoun, Salem Lahlou

Title: Communication Enables Cooperation in LLM Agents: A Comparison with Curriculum-Based Approaches

Abstract:
Eliciting cooperation in multi‑agent LLM systems is critical for AI alignment. We investigate two approaches: direct communication and curriculum learning. In a 4‑player Stag Hunt, a one‑word "cheap talk" channel increases cooperation from 0% to 96.7%, demonstrating communication as a robust coordination mechanism. In contrast, we find that curriculum learning is highly sensitive to design choices: our pedagogical curriculum through progressively complex games reduced agent payoffs by 27.4% in an Iterated Public Goods Game with Punishment, demonstrating that optimizing for short‑term rationality can actively undermine alignment goals. Qualitative analysis reveals that curricula emphasizing defection‑equilibrium games can induce "learned pessimism" in agents. These findings suggest that for coordination problems, simple communication protocols may be more reliable than experience‑based training, and that curriculum design for social dilemmas requires careful attention to the strategic lessons embedded in game sequences.

PaperID: 8171, https://arxiv.org/pdf/2510.05746.pdf

Authors: Bohan Yao, Shiva Krishna Reddy Malay, Vikas Yadav

Title: ARM: Discovering Agentic Reasoning Modules for Generalizable Multi-Agent Systems

Abstract:
Large Language Model (LLM)‑powered Multi‑agent systems (MAS) have achieved state‑of‑the‑art results on various complex reasoning tasks. Recent works have proposed techniques to automate the design of MASes, eliminating the need for manual engineering. However, these techniques perform poorly, often achieving similar or inferior performance to simple baselines. Furthermore, they require computationally expensive re‑discovery of architectures for each new task domain and expensive data annotation on domains without existing labeled validation sets. A critical insight is that simple Chain of Thought (CoT) reasoning often performs competitively with these complex systems, suggesting that the fundamental reasoning unit of MASes, CoT, warrants further investigation. To this end, we present a new paradigm for automatic MAS design that pivots the focus to optimizing CoT reasoning. We introduce the Agentic Reasoning Module (ARM), an agentic generalization of CoT where each granular reasoning step is executed by a specialized reasoning module. This module is discovered through a tree search over the code space, starting from a simple CoT module and evolved using mutations informed by reflection on execution traces. The resulting ARM acts as a versatile reasoning building block which can be utilized as a direct recursive loop or as a subroutine in a learned meta‑orchestrator. Our approach significantly outperforms both manually designed MASes and state‑of‑the‑art automatic MAS design methods. Crucially, MASes built with ARM exhibit superb generalization, maintaining high performance across different foundation models and task domains without further optimization.

PaperID: 8172, https://arxiv.org/pdf/2510.05650.pdf

Authors: Yiping Ma, Shiyu Hu, Buyuan Zhu, Yipei Wang, Yaxuan Kang, Shiqing Liu, Kang Hao Cheong

Title: EduVerse: A User-Defined Multi-Agent Simulation Space for Education Scenario

Abstract:
Reproducing cognitive development, group interaction, and long‑term evolution in virtual classrooms remains a core challenge for educational AI, as real classrooms integrate open‑ended cognition, dynamic social interaction, affective factors, and multi‑session development rarely captured together. Existing approaches mostly focus on short‑term or single‑agent settings, limiting systematic study of classroom complexity and cross‑task reuse. We present EduVerse, the first user‑defined multi‑agent simulation space that supports environment, agent, and session customization. A distinctive human‑in‑the‑loop interface further allows real users to join the space. Built on a layered CIE (Cognition‑Interaction‑Evolution) architecture, EduVerse ensures individual consistency, authentic interaction, and longitudinal adaptation in cognition, emotion, and behavior‑reproducing realistic classroom dynamics with seamless human‑agent integration. We validate EduVerse in middle‑school Chinese classes across three text genres, environments, and multiple sessions. Results show: (1) Instructional alignment: simulated IRF rates (0.28‑0.64) closely match real classrooms (0.37‑0.49), indicating pedagogical realism; (2) Group interaction and role differentiation: network density (0.27‑0.40) with about one‑third of peer links realized, while human‑agent tasks indicate a balance between individual variability and instructional stability; (3) Cross‑session evolution: the positive transition rate R+ increase by 11.7% on average, capturing longitudinal shifts in behavior, emotion, and cognition and revealing structured learning trajectories. Overall, EduVerse balances realism, reproducibility, and interpretability, providing a scalable platform for educational AI. The system will be open‑sourced to foster cross‑disciplinary research.

PaperID: 8173, https://arxiv.org/pdf/2510.05625.pdf

Authors: Yao Zhang, Yuchen Song, Shengnan Li, Yan Shi, Shikui Shen, Xiongyan Tang, Min Zhang, Danshi Wang

Title: Generative AI-Driven Hierarchical Multi-Agent Framework for Zero-Touch Optical Networks

Abstract:
The rapid development of Generative Artificial Intelligence (GenAI) has catalyzed a transformative technological revolution across all walks of life. As the backbone of wideband communication, optical networks are expecting high‑level autonomous operation and zero‑touch management to accommodate their expanding network scales and escalating transmission bandwidth. The integration of GenAI is deemed as the pivotal solution for realizing zero‑touch optical networks. However, the lifecycle management of optical networks involves a multitude of tasks and necessitates seamless collaboration across multiple layers, which poses significant challenges to the existing single‑agent GenAI systems. In this paper, we propose a GenAI‑driven hierarchical multi‑agent framework designed to streamline multi‑task autonomous execution for zero‑touch optical networks. We present the architecture, implementation, and applications of this framework. A field‑deployed mesh network is utilized to demonstrate three typical scenarios throughout the lifecycle of optical network: quality of transmission estimation in the planning stage, dynamic channel adding/dropping in the operation stage, and system capacity increase in the upgrade stage. The case studies, illustrate the capabilities of multi‑agent framework in multi‑task allocation, coordination, execution, evaluation, and summarization. This work provides a promising approach for the future development of intelligent, efficient, and collaborative network management solutions, paving the way for more specialized and adaptive zero‑touch optical networks.

PaperID: 8174, https://arxiv.org/pdf/2510.05621.pdf

Authors: Zhiyuan Ren, Tao Zhang, Wenchi Chen

Title: Decoupling Correctness from Policy: A Deterministic Causal Structure for Multi-Agent Systems

Abstract:
In distributed multi‑agent systems, correctness is often entangled with operational policies such as scheduling, batching, or routing, which makes systems brittle since performance‑driven policy evolution may break integrity guarantees. This paper introduces the Deterministic Causal Structure (DCS), a formal foundation that decouples correctness from policy. We develop a minimal axiomatic theory and prove four results: existence and uniqueness, policy‑agnostic invariance, observational equivalence, and axiom minimality. These results show that DCS resolves causal ambiguities that value‑centric convergence models such as CRDTs cannot address, and that removing any axiom collapses determinism into ambiguity. DCS thus emerges as a boundary principle of asynchronous computation, analogous to CAP and FLP: correctness is preserved only within the expressive power of a join‑semilattice. All guarantees are established by axioms and proofs, with only minimal illustrative constructions included to aid intuition. This work establishes correctness as a fixed, policy‑agnostic substrate, a Correctness‑as‑a‑Chassis paradigm, on which distributed intelligent systems can be built modularly, safely, and evolvably.

PaperID: 8175, https://arxiv.org/pdf/2510.05611.pdf

Authors: Wei-Chieh Huang, Cornelia Caragea

Title: MADIAVE: Multi-Agent Debate for Implicit Attribute Value Extraction

Abstract:
Implicit Attribute Value Extraction (AVE) is essential for accurately representing products in e‑commerce, as it infers latent attributes from multimodal data. Despite advances in multimodal large language models (MLLMs), implicit AVE remains challenging due to the complexity of multidimensional data and gaps in vision‑text understanding. In this work, we introduce MADIAVE, a multi‑agent debate framework that employs multiple MLLM agents to iteratively refine inferences. Through a series of debate rounds, agents verify and update each other's responses, thereby improving inference performance and robustness. Experiments on the ImplicitAVE dataset demonstrate that even a few rounds of debate significantly boost accuracy, especially for attributes with initially low performance. We systematically evaluate various debate configurations, including identical or different MLLM agents, and analyze how debate rounds affect convergence dynamics. Our findings highlight the potential of multi‑agent debate strategies to address the limitations of single‑agent approaches and offer a scalable solution for implicit AVE in multimodal e‑commerce.

PaperID: 8176, https://arxiv.org/pdf/2510.05605.pdf

Authors: Yasod Ginige, Akila Niroshan, Sajal Jain, Suranga Seneviratne

Title: AutoPentester: An LLM Agent-based Framework for Automated Pentesting

Abstract:
Penetration testing and vulnerability assessment are essential industry practices for safeguarding computer systems. As cyber threats grow in scale and complexity, the demand for pentesting has surged, surpassing the capacity of human professionals to meet it effectively. With advances in AI, particularly Large Language Models (LLMs), there have been attempts to automate the pentesting process. However, existing tools such as PentestGPT are still semi‑manual, requiring significant professional human interaction to conduct pentests. To this end, we propose a novel LLM agent‑based framework, AutoPentester, which automates the pentesting process. Given a target IP, AutoPentester automatically conducts pentesting steps using common security tools in an iterative process. It can dynamically generate attack strategies based on the tool outputs from the previous iteration, mimicking the human pentester approach. We evaluate AutoPentester using Hack The Box and custom‑made VMs, comparing the results with the state‑of‑the‑art PentestGPT. Results show that AutoPentester achieves a 27.0% better subtask completion rate and 39.5% more vulnerability coverage with fewer steps. Most importantly, it requires significantly fewer human interactions and interventions compared to PentestGPT. Furthermore, we recruit a group of security industry professional volunteers for a user survey and perform a qualitative analysis to evaluate AutoPentester against industry practices and compare it with PentestGPT. On average, AutoPentester received a score of 3.93 out of 5 based on user reviews, which was 19.8% higher than PentestGPT.

PaperID: 8177, https://arxiv.org/pdf/2510.05598.pdf

Authors: Mingdai Yang, Nurendra Choudhary, Jiangshu Du, Edward W. Huang, Philip S. Yu, Karthik Subbian, Danai Koutra

Title: AgentDR: Dynamic Recommendation with Implicit Item-Item Relations via LLM-based Agents

Abstract:
Recent agent‑based recommendation frameworks aim to simulate user behaviors by incorporating memory mechanisms and prompting strategies, but they struggle with hallucinating non‑existent items and full‑catalog ranking. Besides, a largely underexplored opportunity lies in leveraging LLMs'commonsense reasoning to capture user intent through substitute and complement relationships between items, which are usually implicit in datasets and difficult for traditional ID‑based recommenders to capture. In this work, we propose a novel LLM‑agent framework, AgenDR, which bridges LLM reasoning with scalable recommendation tools. Our approach delegates full‑ranking tasks to traditional models while utilizing LLMs to (i) integrate multiple recommendation outputs based on personalized tool suitability and (ii) reason over substitute and complement relationships grounded in user history. This design mitigates hallucination, scales to large catalogs, and enhances recommendation relevance through relational reasoning. Through extensive experiments on three public grocery datasets, we show that our framework achieves superior full‑ranking performance, yielding on average a twofold improvement over its underlying tools. We also introduce a new LLM‑based evaluation metric that jointly measures semantic alignment and ranking correctness.

PaperID: 8178, https://arxiv.org/pdf/2510.05596.pdf

Authors: Changyuan Zhao, Ruichen Zhang, Jiacheng Wang, Dusit Niyato, Geng Sun, Xianbin Wang, Shiwen Mao, Abbas Jamalipour

Title: From Agentification to Self-Evolving Agentic AI for Wireless Networks: Concepts, Approaches, and Future Research Directions

Abstract:
Self‑evolving agentic artificial intelligence (AI) offers a new paradigm for future wireless systems by enabling autonomous agents to continually adapt and improve without human intervention. Unlike static AI models, self‑evolving agents embed an autonomous evolution cycle that updates models, tools, and workflows in response to environmental dynamics. This paper presents a comprehensive overview of self‑evolving agentic AI, highlighting its layered architecture, life cycle, and key techniques, including tool intelligence, workflow optimization, self‑reflection, and evolutionary learning. We further propose a multi‑agent cooperative self‑evolving agentic AI framework, where multiple large language models (LLMs) are assigned role‑specialized prompts under the coordination of a supervisor agent. Through structured dialogue, iterative feedback, and systematic validation, the system autonomously executes the entire life cycle without human intervention. A case study on antenna evolution in low‑altitude wireless networks (LAWNs) demonstrates how the framework autonomously upgrades fixed antenna optimization into movable antenna optimization. Experimental results show that the proposed self‑evolving agentic AI autonomously improves beam gain and restores degraded performance by up to 52.02%, consistently surpassing the fixed baseline with little to no human intervention and validating its adaptability and robustness for next‑generation wireless intelligence.

PaperID: 8179, https://arxiv.org/pdf/2510.05445.pdf

Authors: Zheyuan Zhang, Kaiwen Shi, Zhengqing Yuan, Zehong Wang, Tianyi Ma, Keerthiram Murugesan, Vincent Galassi, Chuxu Zhang, Yanfang Ye

Title: AgentRouter: A Knowledge-Graph-Guided LLM Router for Collaborative Multi-Agent Question Answering

Abstract:
Large language models (LLMs) and agent‑based frameworks have advanced rapidly, enabling diverse applications. Yet, with the proliferation of models and agentic strategies, practitioners face substantial uncertainty in selecting the best configuration for a downstream task. Prior studies show that different agents and backbones exhibit complementary strengths, and that larger models are not always superior, underscoring the need for adaptive routing mechanisms. Existing approaches to agent routing, however, often emphasize cost efficiency while overlooking the fine‑grained contextual and relational structure inherent in QA tasks. In this paper, we propose tAgentRouter, a framework that formulates multi‑agent QA as a knowledge‑graph‑guided routing problem supervised by empirical performance signals. Specifically, we convert QA instance into a knowledge graph that jointly encodes queries, contextual entities, and agents, and then train a heterogeneous graph neural network (GNN) to propagate information across node types and produce task‑aware routing distributions over agents. By leveraging soft supervision and weighted aggregation of agent outputs, AgentRouter learns principled collaboration schemes that capture the complementary strengths of diverse agents. Extensive experiments demonstrate that our framework consistently outperforms single‑agent and ensemble baselines, while generalizing across benchmarks and LLM backbones. These results highlight the effectiveness and robustness of graph‑supervised multi‑agent routing for question answering.

PaperID: 8180, https://arxiv.org/pdf/2510.05442.pdf

Authors: Zizhao Wang, Dingcheng Li, Vaishakh Keshava, Phillip Wallis, Ananth Balashankar, Peter Stone, Lukas Rutishauser

Title: Adversarial Reinforcement Learning for Large Language Model Agent Safety

Abstract:
Large Language Model (LLM) agents can leverage tools such as Google Search to complete complex tasks. However, this tool usage introduces the risk of indirect prompt injections, where malicious instructions hidden in tool outputs can manipulate the agent, posing security risks like data leakage. Current defense strategies typically rely on fine‑tuning LLM agents on datasets of known attacks. However, the generation of these datasets relies on manually crafted attack patterns, which limits their diversity and leaves agents vulnerable to novel prompt injections. To address this limitation, we propose Adversarial Reinforcement Learning for Agent Safety (ARLAS), a novel framework that leverages adversarial reinforcement learning (RL) by formulating the problem as a two‑player zero‑sum game. ARLAS co‑trains two LLMs: an attacker that learns to autonomously generate diverse prompt injections and an agent that learns to defend against them while completing its assigned tasks. To ensure robustness against a wide range of attacks and to prevent cyclic learning, we employ a population‑based learning framework that trains the agent to defend against all previous attacker checkpoints. Evaluated on BrowserGym and AgentDojo, agents fine‑tuned with ARLAS achieve a significantly lower attack success rate than the original model while also improving their task success rate. Our analysis further confirms that the adversarial process generates a diverse and challenging set of attacks, leading to a more robust agent compared to the base model.

PaperID: 8181, https://arxiv.org/pdf/2510.05441.pdf

Authors: Yiannis Charalambous, Claudionor N. Coelho, Luis Lamb, Lucas C. Cordeiro

Title: UnitTenX: Generating Tests for Legacy Packages with AI Agents Powered by Formal Verification

Abstract:
This paper introduces UnitTenX, a state‑of‑the‑art open‑source AI multi‑agent system designed to generate unit tests for legacy code, enhancing test coverage and critical value testing. UnitTenX leverages a combination of AI agents, formal methods, and Large Language Models (LLMs) to automate test generation, addressing the challenges posed by complex and legacy codebases. Despite the limitations of LLMs in bug detection, UnitTenX offers a robust framework for improving software reliability and maintainability. Our results demonstrate the effectiveness of this approach in generating high‑quality tests and identifying potential issues. Additionally, our approach enhances the readability and documentation of legacy code.

PaperID: 8182, https://arxiv.org/pdf/2510.05432.pdf

Authors: Shambhavi Mishra, Gaurav Sahu, Marco Pedersoli, Laurent Charlin, Jose Dolz, Christopher Pal

Title: AInstein: Can LLMs Solve Research Problems From Parametric Memory Alone?

Abstract:
Can large language models solve AI research problems using only their parametric knowledge, without fine‑tuning, retrieval, or other external aids? We introduce AInstein, a framework for testing whether LLM agents can generate and refine solutions to research problems through iterative critique loops. A blind study with 20 domain experts on held‑out ICLR 2026 problems validates our automated metrics, which we then scale to 1,214 ICLR 2025 papers using an LLM‑as‑a‑judge paradigm. Two metrics capture complementary aspects of performance: Success Rate (does the solution address the problem?) and Rediscovery (does it match the published approach?). LLMs succeed on over 70% of problems, yet strictly rediscover the published solution less than 19% of the time, suggesting genuine problem‑solving rather than associative recall. However, this ability has clear limits: models handle familiar methodological territory well but fail when solutions require cross‑domain analogical transfer, a pattern we call the parametric knowledge boundary. On the ResearchPlanGen benchmark (2,645 problems), our training‑free iterative refinement strategy matches RL finetuning, and a criteria‑coverage analysis pins down the ceiling of what test‑time refinement alone can achieve. Together, these findings map both the capabilities and the limits of LLMs as autonomous scientific problem‑solvers.

PaperID: 8183, https://arxiv.org/pdf/2510.05414.pdf

Authors: Ziheng Geng, Jiachen Liu, Ran Cao, Lu Cheng, Haifeng Wang, Minghui Cheng

Title: A Lightweight Large Language Model-Based Multi-Agent System for 2D Frame Structural Analysis

Abstract:
Large language models (LLMs) have recently been used to empower autonomous agents in engineering, significantly improving automation and efficiency in labor‑intensive workflows. However, their potential remains underexplored in structural engineering, particularly for finite element modeling tasks requiring geometric modeling, complex reasoning, and domain knowledge. To bridge this gap, this paper develops a LLM‑based multi‑agent system to automate finite element modeling of 2D frames. The system decomposes structural analysis into subtasks, each managed by a specialized agent powered by the lightweight Llama‑3.3 70B Instruct model. The workflow begins with a Problem Analysis Agent, which extracts geometry, boundary, and material parameters from the user input. Next, a Geometry Agent incrementally derives node coordinates and element connectivity by applying expert‑defined rules. These structured outputs are converted into executable OpenSeesPy code by a Translation Agent and refined by a Model Validation Agent through consistency checks. Then, a Load Agent applies load conditions into the assembled structural model. Experimental evaluations on 20 benchmark problems demonstrate that the system achieves accuracy over 80% in most cases across 10 repeated trials, outperforming Gemini‑2.5 Pro and ChatGPT‑4o models.

PaperID: 8184, https://arxiv.org/pdf/2510.05335.pdf

Authors: Oskar Wysocki, Magdalena Wysocka, Mauricio Jacobo, Harriet Unsworth, André Freitas

Title: Biomedical reasoning in action: Multi-agent System for Auditable Biomedical Evidence Synthesis

Abstract:
We present M‑Reason, a demonstration system for transparent, agent‑based reasoning and evidence integration in the biomedical domain, with a focus on cancer research. M‑Reason leverages recent advances in large language models (LLMs) and modular agent orchestration to automate evidence retrieval, appraisal, and synthesis across diverse biomedical data sources. Each agent specializes in a specific evidence stream, enabling parallel processing and fine‑grained analysis. The system emphasizes explainability, structured reporting, and user auditability, providing complete traceability from source evidence to final conclusions. We discuss critical tradeoffs between agent specialization, system complexity, and resource usage, as well as the integration of deterministic code for validation. An open, interactive user interface allows researchers to directly observe, explore and evaluate the multi‑agent workflow. Our evaluation demonstrates substantial gains in efficiency and output consistency, highlighting M‑Reason's potential as both a practical tool for evidence synthesis and a testbed for robust multi‑agent LLM systems in scientific research, available at https://m‑reason.digitalecmt.com.

PaperID: 8185, https://arxiv.org/pdf/2510.05188.pdf

Authors: Wenda Xie, Chao Guo, Yanqing Jing, Junle Wang, Yisheng Lv, Fei-Yue Wang

Title: Plug-and-Play Dramaturge: A Divide-and-Conquer Approach for Iterative Narrative Script Refinement via Collaborative LLM Agents

Abstract:
Although LLMs have been widely adopted for creative content generation, a single‑pass process often struggles to produce high‑quality long narratives. How to effectively revise and improve long narrative scripts like scriptwriters remains a significant challenge, as it demands a comprehensive understanding of the entire context to identify global structural issues and local detailed flaws, as well as coordinating revisions at multiple granularities and locations. Direct modifications by LLMs typically introduce inconsistencies between local edits and the overall narrative requirements. To address these issues, we propose Dramaturge, a task and feature oriented divide‑and‑conquer approach powered by hierarchical multiple LLM agents. It consists of a Global Review stage to grasp the overall storyline and structural issues, a Scene‑level Review stage to pinpoint detailed scene and sentence flaws, and a Hierarchical Coordinated Revision stage that coordinates and integrates structural and detailed improvements throughout the script. The top‑down task flow ensures that high‑level strategies guide local modifications, maintaining contextual consistency. The review and revision workflow follows a coarse‑to‑fine iterative process, continuing through multiple rounds until no further substantive improvements can be made. Comprehensive experiments show that Dramaturge significantly outperforms all baselines in terms of script‑level overall quality and scene‑level details. Our approach is plug‑and‑play and can be easily integrated into existing methods to improve the generated scripts.

PaperID: 8186, https://arxiv.org/pdf/2510.05185.pdf

Authors: Vrinda Malhotra, Jiaman Li, Nandini Pisupati

Title: AgentZero++: Modeling Fear-Based Behavior

Abstract:
We present AgentZero++, an agent‑based model that integrates cognitive, emotional, and social mechanisms to simulate decentralized collective violence in spatially distributed systems. Building on Epstein's Agent\_Zero framework, we extend the original model with eight behavioral enhancements: age‑based impulse control; memory‑based risk estimation; affect‑cognition coupling; endogenous destructive radius; fight‑or‑flight dynamics; affective homophily; retaliatory damage; and multi‑agent coordination. These additions allow agents to adapt based on internal states, previous experiences, and social feedback, producing emergent dynamics such as protest asymmetries, escalation cycles, and localized retaliation. Implemented in Python using the Mesa ABM framework, AgentZero++ enables modular experimentation and visualization of how micro‑level cognitive heterogeneity shapes macro‑level conflict patterns. Our results highlight how small variations in memory, reactivity, and affective alignment can amplify or dampen unrest through feedback loops. By explicitly modeling emotional thresholds, identity‑driven behavior, and adaptive networks, this work contributes a flexible and extensible platform for analyzing affective contagion and psychologically grounded collective action.

PaperID: 8187, https://arxiv.org/pdf/2510.05174.pdf

Authors: Christoph Riedl

Title: Emergent Coordination in Multi-Agent Language Models

Abstract:
When are multi‑agent LLM systems merely a collection of individual agents versus an integrated collective with higher‑order structure? We introduce an information‑theoretic framework to test ‑‑ in a purely data‑driven way ‑‑ whether multi‑agent systems show signs of higher‑order structure. This information decomposition lets us measure whether dynamical emergence is present in multi‑agent LLM systems, localize it, and distinguish spurious temporal coupling from performance‑relevant cross‑agent synergy. We implement a practical criterion and an emergence capacity criterion operationalized as partial information decomposition of time‑delayed mutual information (TDMI). We apply our framework to experiments using a simple guessing game without direct agent communication and minimal group‑level feedback with three randomized interventions. Groups in the control condition exhibit strong temporal synergy but little coordinated alignment across agents. Assigning a persona to each agent introduces stable identity‑linked differentiation. Combining personas with an instruction to ``think about what other agents might do'' shows identity‑linked differentiation and goal‑directed complementarity across agents. Taken together, our framework establishes that multi‑agent LLM systems can be steered with prompt design from mere aggregates to higher‑order collectives. Our results are robust across emergence measures and entropy estimators, and not explained by coordination‑free baselines or temporal dynamics alone. Without attributing human‑like cognition to the agents, the patterns of interaction we observe mirror well‑established principles of collective intelligence in human groups: effective performance requires both alignment on shared objectives and complementary contributions across members.

PaperID: 8188, https://arxiv.org/pdf/2510.05158.pdf

Authors: Xin He, Liangliang You, Hongduan Tian, Bo Han, Ivor Tsang, Yew-Soon Ong

Title: Lang-PINN: From Language to Physics-Informed Neural Networks via a Multi-Agent Framework

Abstract:
Physics‑informed neural networks (PINNs) provide a powerful approach for solving partial differential equations (PDEs), but constructing a usable PINN remains labor‑intensive and error‑prone. Scientists must interpret problems as PDE formulations, design architectures and loss functions, and implement stable training pipelines. Existing large language model (LLM) based approaches address isolated steps such as code generation or architecture suggestion, but typically assume a formal PDE is already specified and therefore lack an end‑to‑end perspective. We present Lang‑PINN, an LLM‑driven multi‑agent system that builds trainable PINNs directly from natural language task descriptions. Lang‑PINN coordinates four complementary agents: a PDE Agent that parses task descriptions into symbolic PDEs, a PINN Agent that selects architectures, a Code Agent that generates modular implementations, and a Feedback Agent that executes and diagnoses errors for iterative refinement. This design transforms informal task statements into executable and verifiable PINN code. Experiments show that Lang‑PINN achieves substantially lower errors and greater robustness than competitive baselines: mean squared error (MSE) is reduced by up to 3‑‑5 orders of magnitude, end‑to‑end execution success improves by more than 50%, and reduces time overhead by up to 74%.

PaperID: 8189, https://arxiv.org/pdf/2510.05156.pdf

Authors: Lesly Miculicich, Mihir Parmar, Hamid Palangi, Krishnamurthy Dj Dvijotham, Mirko Montanari, Tomas Pfister, Long T. Le

Title: VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation

Abstract:
The deployment of autonomous AI agents in sensitive domains, such as healthcare, introduces critical risks to safety, security, and privacy. These agents may deviate from user objectives, violate data handling policies, or be compromised by adversarial attacks. Mitigating these dangers necessitates a mechanism to formally guarantee that an agent's actions adhere to predefined safety constraints, a challenge that existing systems do not fully address. We introduce VeriGuard, a novel framework that provides formal safety guarantees for LLM‑based agents through a dual‑stage architecture designed for robust and verifiable correctness. The initial offline stage involves a comprehensive validation process. It begins by clarifying user intent to establish precise safety specifications. VeriGuard then synthesizes a behavioral policy and subjects it to both testing and formal verification to prove its compliance with these specifications. This iterative process refines the policy until it is deemed correct. Subsequently, the second stage provides online action monitoring, where VeriGuard operates as a runtime monitor to validate each proposed agent action against the pre‑verified policy before execution. This separation of the exhaustive offline validation from the lightweight online monitoring allows formal guarantees to be practically applied, providing a robust safeguard that substantially improves the trustworthiness of LLM agents.

PaperID: 8190, https://arxiv.org/pdf/2510.05138.pdf

Authors: Gregory Hok Tjoan Go, Khang Ly, Anders Søgaard, Amin Tabatabaei, Maarten de Rijke, Xinyi Chen

Title: LiRA: A Multi-Agent Framework for Reliable and Readable Literature Review Generation

Abstract:
The rapid growth of scientific publications has made it increasingly difficult to keep literature reviews comprehensive and up‑to‑date. Though prior work has focused on automating retrieval and screening, the writing phase of systematic reviews remains largely under‑explored, especially with regard to readability and factual accuracy. To address this, we present LiRA (Literature Review Agents), a multi‑agent collaborative workflow which emulates the human literature review process. LiRA utilizes specialized agents for content outlining, subsection writing, editing, and reviewing, producing cohesive and comprehensive review articles. Evaluated on SciReviewGen and a proprietary ScienceDirect dataset, LiRA outperforms current baselines such as AutoSurvey and MASS‑Survey in writing and citation quality, while maintaining competitive similarity to human‑written reviews. We further evaluate LiRA in real‑world scenarios using document retrieval and assess its robustness to reviewer model variation. Our findings highlight the potential of agentic LLM workflows, even without domain‑specific tuning, to improve the reliability and usability of automated scientific writing.

PaperID: 8191, https://arxiv.org/pdf/2510.05124.pdf

Authors: Mingjin Li, Yu Liu, Huayi Liu, Xiang Ye, Chao Jiang, Hongguang Zhang, Yu Ruan

Title: MADS: Multi-Agent Dialogue Simulation for Diverse Persuasion Data Generation

Abstract:
We propose MADS (Multi‑Agent Dialogue Simulation), a scalable framework for generating persuasive multi‑turn dialogues via agent self‑play. MADS employs three coordinated agents: User Agents designed to simulate diverse persona‑driven behaviors by leveraging personality signifiers such as Zodiac Signs and MBTI types, a Dialog Agent executing task‑oriented persuasion strategies and an Optimization Agent evaluating and refining dialogue outcomes. We further validate its effectiveness through users' Chain‑of‑Attitude (CoA) modeling and dedicated LLMs' persuasion assessment. This approach enables low‑cost generation of training data without human annotation, addressing key industry challenges such as lack of user data, cold‑start evaluation difficulties, and prompt inefficiency. Applied to a real‑world marketing scenario, MADS significantly improved the persuasion capacity of small LLMs, increasing the organic traffic conversion rate by 22.4% (from 1.83% to 2.24%) , demonstrating clear business value.

PaperID: 8192, https://arxiv.org/pdf/2510.05107.pdf

Authors: Myung Ho Kim

Title: Structured Cognitive Loop for Behavioral Intelligence in Large Language Model Agents (Extended Revision: From Behavioral Architecture to Epistemic Accountability)

Abstract:
The central challenge for AI agents is not only performance but accountability. Agents that act through opaque prompt sequences may produce correct outputs, but they provide little basis for verifying why an action was permitted, where an error occurred, or how responsibility should be assigned. This paper presents the Structured Cognitive Loop as an architecture for accountable behavior in large language model agents. SCL separates cognition, memory, control, and action into distinct modules. The language model proposes. External memory preserves verified state. A lightweight controller checks preconditions, prevents redundant actions, and authorizes execution before tools are used. We evaluate SCL against ReAct and common LangChain agent variants across travel planning, conditional email drafting, and constraint guided image generation. Across 360 episodes, SCL achieves 86.3 percent task success compared with 70.5 to 76.8 percent for prompt based baselines. It also improves goal fidelity, reduces redundant tool calls, increases reuse of intermediate state, and lowers unsupported assertions. This extended revision situates SCL within a broader architecture of epistemic accountability. Subsequent extensions integrate context aware Human in the Loop control, Pool Gated Retrieval, and the Horizon Warrant Commitment framework. Together these components define an agent architecture in which the model proposes, structure decides, evidence is warranted before use, and human judgment is embedded in the trace rather than imposed after the fact. The result is a foundation for AI agents whose decisions are not only effective but also authorized, inspectable, and accountable.

PaperID: 8193, https://arxiv.org/pdf/2510.05068.pdf

Authors: Shreya Meel, Sennur Ulukus

Title: Multi-Agent Distributed Optimization With Feasible Set Privacy

Abstract:
We consider the problem of decentralized constrained optimization with multiple agents E_1,\ldots,E_N who jointly wish to learn the optimal solution set while keeping their feasible sets \mathcalP_1,\ldots,\mathcalP_N private from each other. We assume that the objective function f is known to all agents and each feasible set is a collection of points from a universal alphabet \mathcalP_alph. A designated agent (leader) starts the communication with the remaining (non‑leader) agents, and is the first to retrieve the solution set. The leader searches for the solution by sending queries to and receiving answers from the non‑leaders, such that the information on the individual feasible sets revealed to the leader should be no more than nominal, i.e., what is revealed from learning the solution set alone. We develop achievable schemes for obtaining the solution set at nominal information leakage, and characterize their communication costs under two communication setups between agents. In this work, we focus on two kinds of network setups: i) ring, where each agent communicates with two adjacent agents, and ii) star, where only the leader communicates with the remaining agents. We show that, if the leader first learns the joint feasible set through an existing private set intersection (PSI) protocol and then deduces the solution set, the information leaked to the leader is greater than nominal. Moreover, we draw connection of our schemes to threshold PSI (ThPSI), which is a PSI‑variant where the intersection is revealed only when its cardinality is larger than a threshold value. Finally, for various realizations of f mapped uniformly at random to a fixed range of values, our schemes are more communication‑efficient with a high probability compared to retrieving the entire feasible set through PSI.

PaperID: 8194, https://arxiv.org/pdf/2510.05059.pdf

Authors: Junlin Wang, Jue Wang, Zhen, Xu, Ben Athiwaratkun, Bhuwan Dhingra, Ce Zhang, James Zou

Title: Staircase Streaming for Low-Latency Multi-Agent Inference

Abstract:
Recent advances in large language models (LLMs) opened up new directions for leveraging the collective expertise of multiple LLMs. These methods, such as Mixture‑of‑Agents, typically employ additional inference steps to generate intermediate outputs, which are then used to produce the final response. While multi‑agent inference can enhance response quality, it can significantly increase the time to first token (TTFT), posing a challenge for latency‑sensitive applications and hurting user experience. To address this issue, we propose staircase streaming for low‑latency multi‑agent inference. Instead of waiting for the complete intermediate outputs from previous steps, we begin generating the final response as soon as we receive partial outputs from these steps. Experimental results demonstrate that staircase streaming reduces TTFT by up to 93% while maintaining response quality.

PaperID: 8195, https://arxiv.org/pdf/2510.04980.pdf

Authors: Fangzhou Liang, Tianshi Zheng, Chunkit Chan, Yauwai Yim, Yangqiu Song

Title: LLM-Hanabi: Evaluating Multi-Agent Gameplays with Theory-of-Mind and Rationale Inference in Imperfect Information Collaboration Game

Abstract:
Effective multi‑agent collaboration requires agents to infer the rationale behind others' actions, a capability rooted in Theory‑of‑Mind (ToM). While recent Large Language Models (LLMs) excel at logical inference, their ability to infer rationale in dynamic, collaborative settings remains under‑explored. This study introduces LLM‑Hanabi, a novel benchmark that uses the cooperative game Hanabi to evaluate the rationale inference and ToM of LLMs. Our framework features an automated evaluation system that measures both game performance and ToM proficiency. Across a range of models, we find a significant positive correlation between ToM and in‑game success. Notably, first‑order ToM (interpreting others' intent) correlates more strongly with performance than second‑order ToM (predicting others' interpretations). These findings highlight that for effective AI collaboration, the ability to accurately interpret a partner's rationale is more critical than higher‑order reasoning. We conclude that prioritizing first‑order ToM is a promising direction for enhancing the collaborative capabilities of future models.

PaperID: 8196, https://arxiv.org/pdf/2510.04935.pdf

Authors: Guoxin Chen, Zile Qiao, Wenqing Wang, Donglei Yu, Xuanzhong Chen, Hao Sun, Minpeng Liao, Kai Fan, Yong Jiang, Penguin Xie, Wayne Xin Zhao, Ruihua Song, Fei Huang

Title: MARS: Co-evolving Dual-System Deep Research via Multi-Agent Reinforcement Learning

Abstract:
Large Reasoning Models (LRMs) face two fundamental limitations: excessive token consumption when overanalyzing simple information processing tasks, and inability to access up‑to‑date knowledge beyond their training data. We introduce MARS (Multi‑Agent System for Deep ReSearch), a novel co‑evolution framework that jointly optimizes dual cognitive systems through multi‑agent reinforcement learning. Unlike prior approaches that employ fixed or independently‑trained summarizers, MARS enables System 1 (fast, intuitive processing) and System 2 (deliberate reasoning) to co‑adapt through shared trajectory rewards, developing complementary strategies where System 1 learns to distill information specifically useful for System 2's reasoning. We extend Group Relative Policy Optimization (GRPO) for multi‑agent settings with three key innovations: (1) decoupled gradient computation ensuring proper credit assignment despite shared rewards, (2) bin‑packing optimization for efficient parallel information processing, and (3) advantage‑weighted balanced sampling preventing training imbalance. Extensive experiments demonstrate that MARS (8B), trained under a challenging Zero RL setting without any supervised fine‑tuning, achieves 8.17% on HLE ‑‑ outperforming WebThinker (32B with SFT, 6.87%) and narrowing the gap with proprietary models like Claude 3.7 Sonnet (7.89%) ‑‑ while achieving an average gain of 8.9% across 7 knowledge‑intensive tasks.

PaperID: 8197, https://arxiv.org/pdf/2510.04905.pdf

Authors: Yicheng Tao, Yuante Li, Yao Qin, Yepang Liu

Title: Retrieval-Augmented Code Generation: A Survey with Focus on Repository-Level Approaches

Abstract:
Recent advances in large language models (LLMs) have significantly improved automated code generation. While existing approaches have achieved strong performance at the function and file levels, real‑world software engineering requires reasoning over entire repositories, including cross‑file dependencies, evolving execution environments, and global semantic consistency. This challenge has led to the emergence of Repository‑Level Code Generation (RLCG), where models must retrieve, organize, and utilize repository‑scale context to generate coherent and executable code changes. To address these challenges, Retrieval‑Augmented Generation (RAG) has become an increasingly important paradigm for repository‑level code intelligence. In this survey, we present a comprehensive review of Retrieval‑Augmented Code Generation (RACG), with a particular focus on repository‑level approaches. Rather than viewing RACG as a static ``retrieve‑then‑generate'' pipeline, we characterize it as a coupled and evolving process involving context construction, retrieval optimization, generation, and environment interaction. We organize existing methods through a unified analytical framework spanning retrieval substrate, control regime, and evaluation setting. Based on this framework, we systematically examine retrieval strategies, graph‑based and non‑graph‑based retrieval paradigms, training‑driven optimizations, and autonomous agent architectures. We further summarize widely used datasets, benchmarks, and system configurations, and discuss key challenges including scalability, reliability, efficiency, and the necessity boundary between RACG and long‑context LLMs. Through this survey, we aim to provide a structured understanding of the rapidly evolving RACG landscape and highlight promising directions for future AI‑powered software engineering research.

PaperID: 8198, https://arxiv.org/pdf/2510.04886.pdf

Authors: Adi Banerjee, Anirudh Nair, Tarik Borogovac

Title: Where Did It All Go Wrong? A Hierarchical Look into Multi-Agent Error Attribution

Abstract:
Error attribution in Large Language Model (LLM) multi‑agent systems presents a significant challenge in debugging and improving collaborative AI systems. Current approaches to pinpointing agent and step level failures in interaction traces ‑ whether using all‑at‑once evaluation, step‑by‑step analysis, or binary search ‑ fall short when analyzing complex patterns, struggling with both accuracy and consistency. We present ECHO (Error attribution through Contextual Hierarchy and Objective consensus analysis), a novel algorithm that combines hierarchical context representation, objective analysis‑based evaluation, and consensus voting to improve error attribution accuracy. Our approach leverages a positional‑based leveling of contextual understanding while maintaining objective evaluation criteria, ultimately reaching conclusions through a consensus mechanism. Experimental results demonstrate that ECHO outperforms existing methods across various multi‑agent interaction scenarios, showing particular strength in cases involving subtle reasoning errors and complex interdependencies. Our findings suggest that leveraging these concepts of structured, hierarchical context representation combined with consensus‑based objective decision‑making, provides a more robust framework for error attribution in multi‑agent systems.

PaperID: 8199, https://arxiv.org/pdf/2510.04862.pdf

Authors: Sam Earle, Zehua Jiang, Eugene Vinitsky, Julian Togelius

Title: Video Game Level Design as a Multi-Agent Reinforcement Learning Problem

Abstract:
Procedural Content Generation via Reinforcement Learning (PCGRL) offers a method for training controllable level designer agents without the need for human datasets, using metrics that serve as proxies for level quality as rewards. Existing PCGRL research focuses on single generator agents, but are bottlenecked by the need to frequently recalculate heuristics of level quality and the agent's need to navigate around potentially large maps. By framing level generation as a multi‑agent problem, we mitigate the efficiency bottleneck of single‑agent PCGRL by reducing the number of reward calculations relative to the number of agent actions. We also find that multi‑agent level generators are better able to generalize to out‑of‑distribution map shapes, which we argue is due to the generators' learning more local, modular design policies. We conclude that treating content generation as a distributed, multi‑agent task is beneficial for generating functional artifacts at scale.

PaperID: 8200, https://arxiv.org/pdf/2510.04851.pdf

Authors: Dongge Han, Camille Couturier, Daniel Madrigal Diaz, Xuchao Zhang, Victor Rühle, Saravan Rajmohan

Title: LEGOMem: Modular Procedural Memory for Multi-agent LLM Systems for Workflow Automation

Abstract:
We introduce LEGOMem, a modular procedural memory framework for multi‑agent large language model (LLM) systems in workflow automation. LEGOMem decomposes past task trajectories into reusable memory units and flexibly allocates them across orchestrators and task agents to support planning and execution. To explore the design space of memory in multi‑agent systems, we use LEGOMem as a lens and conduct a systematic study of procedural memory in multi‑agent systems, examining where memory should be placed, how it should be retrieved, and which agents benefit most. Experiments on the OfficeBench benchmark show that orchestrator memory is critical for effective task decomposition and delegation, while fine‑grained agent memory improves execution accuracy. We find that even teams composed of smaller language models can benefit substantially from procedural memory, narrowing the performance gap with stronger agents by leveraging prior execution traces for more accurate planning and tool use. These results position LEGOMem as both a practical framework for memory‑augmented agent systems and a research tool for understanding memory design in multi‑agent workflow automation.

PaperID: 8201, https://arxiv.org/pdf/2510.04826.pdf

Authors: Rui Wang, Yunfeng Xiong, Zhengru Zhang

Title: Efficient structure-preserving scheme for chemotaxis PDE system with singular sensitivity in crime and epidemic modeling

Abstract:
The chemotaxis PDE system with singular sensitivity was originally proposed by Short et al. (Math. Mod. Meth. Appl. Sci., 2008) as the continuum limit of a biased random walk model to account for the formation of crime hotspots and environmental feedback successfully. Recently, this idea has also been applied to epidemiology to model the impact of human social behaviors on disease transmission. In order to characterize the phase transition, pattern formation and statistical properties in the long‑term dynamics, a stable and accurate numerical scheme is urgently demanded, which still remains challenging due to the positivity constraint on the singular sensitivity and the absence of an energy functional. In particular, the loss of positivity may produce nonphysical states and even cause spurious blow‑up. To address these numerical challenges, this paper constructs an efficient positivity‑preserving, implicit‑explicit scheme with second‑order accuracy. A rigorous error estimation is provided with the Lagrange multiplier correction to deal with the singular sensitivity. The whole framework is extended to a multi‑agent epidemic model with degenerate diffusion, in which both positivity and mass conservation are achieved. Numerical experiments are performed to validate the theoretical results and demonstrate the necessity of the correction strategy. Our simulations reveal rich dynamical behaviors, including the phase transition between aggregation‑dominated and dissipative regimes, as well as the nucleation, spread, and dissipation of crime hotspots. For the epidemic model, the results further show that spatial clustering of population density may accelerate virus transmission and significantly amplify the infectious wave.

PaperID: 8202, https://arxiv.org/pdf/2510.04787.pdf

Authors: Zifan Song, Kaitao Song, Guosheng Hu, Ding Qi, Junyao Gao, Xiaohua Wang, Dongsheng Li, Cairong Zhao

Title: Trade in Minutes! Rationality-Driven Agentic System for Quantitative Financial Trading

Abstract:
Recent advancements in large language models (LLMs) and agentic systems have shown exceptional decision‑making capabilities, revealing significant potential for autonomic finance. Current financial trading agents predominantly simulate anthropomorphic roles that inadvertently introduce emotional biases and rely on peripheral information, while being constrained by the necessity for continuous inference during deployment. In this paper, we pioneer the harmonization of strategic depth in agents with the mechanical rationality essential for quantitative trading. Consequently, we present TiMi (Trade in Minutes), a rationality‑driven multi‑agent system that architecturally decouples strategy development from minute‑level deployment. TiMi leverages specialized LLM capabilities of semantic analysis, code programming, and mathematical reasoning within a comprehensive policy‑optimization‑deployment chain. Specifically, we propose a two‑tier analytical paradigm from macro patterns to micro customization, layered programming design for trading bot implementation, and closed‑loop optimization driven by mathematical reflection. Extensive evaluations across 200+ trading pairs in stock and cryptocurrency markets empirically validate the efficacy of TiMi in stable profitability, action efficiency, and risk control under volatile market dynamics.

PaperID: 8203, https://arxiv.org/pdf/2510.04774.pdf

Authors: Weixu Zhu, Marco Dorigo, Mary Katherine Heinrich

Title: Online automatic code generation for robot swarms: LLMs and self-organizing hierarchy

Abstract:
Our recently introduced self‑organizing nervous system (SoNS) provides robot swarms with 1) ease of behavior design and 2) global estimation of the swarm configuration and its collective environment, facilitating the implementation of online automatic code generation for robot swarms. In a demonstration with 6 real robots and simulation trials with >30 robots, we show that when a SoNS‑enhanced robot swarm gets stuck, it can automatically solicit and run code generated by an external LLM on the fly, completing its mission with an 85% success rate.

PaperID: 8204, https://arxiv.org/pdf/2510.04769.pdf

Authors: Michele Caprio, Siu Lun Chau, Krikamol Muandet

Title: When Do Credal Sets Stabilize? Fixed-Point Theorems for Credal Set Updates

Abstract:
Many machine learning algorithms rely on iterative updates of uncertainty representations, ranging from variational inference and expectation‑maximization, to reinforcement learning, continual learning, and multi‑agent learning. In the presence of imprecision and ambiguity, credal sets ‑‑ closed, convex sets of probability distributions ‑‑ have emerged as a popular framework for representing imprecise probabilistic beliefs. Under such imprecision, many learning problems in imprecise probabilistic machine learning (IPML) may be viewed as processes involving successive applications of update rules on credal sets. This naturally raises the question of whether this iterative process converges to stable fixed points ‑‑ or, more generally, under what conditions on the updating mechanism such fixed points exist, and whether they can be attained. We provide the first analysis of this problem, and illustrate our findings using Credal Bayesian Deep Learning as a concrete example. Our work demonstrates that incorporating imprecision into the learning process not only enriches the representation of uncertainty, but also reveals structural conditions under which stability emerges, thereby offering new insights into the dynamics of iterative learning under imprecision.

PaperID: 8205, https://arxiv.org/pdf/2510.04695.pdf

Authors: Yiding Wang, Zhepei Wei, Xinyu Zhu, Yu Meng

Title: Beyond Outcome Reward: Decoupling Search and Answering Improves LLM Agents

Abstract:
Enabling large language models (LLMs) to utilize search tools offers a promising path to overcoming fundamental limitations such as knowledge cutoffs and hallucinations. Recent work has explored reinforcement learning (RL) for training search‑augmented agents that interleave reasoning and retrieval before answering. These approaches usually rely on outcome‑based rewards (e.g., exact match), implicitly assuming that optimizing for final answers will also yield effective intermediate search behaviors. Our analysis challenges this assumption: we uncover multiple systematic deficiencies in search that arise under outcome‑only training and ultimately degrade final answer quality, including failure to invoke tools, invalid queries, and redundant searches. To address these shortcomings, we introduce DeSA (Decoupling Search‑and‑Answering), a simple two‑stage training framework that explicitly separates search optimization from answer generation. In Stage 1, agents are trained to improve search effectiveness with retrieval recall‑based rewards. In Stage 2, outcome rewards are employed to optimize final answer generation. Across seven QA benchmarks, DeSA‑trained agents consistently improve search behaviors, delivering substantially higher search recall and answer accuracy than outcome‑only baselines. Notably, DeSA outperforms single‑stage training approaches that simultaneously optimize recall and outcome rewards, underscoring the necessity of explicitly decoupling the two objectives.

PaperID: 8206, https://arxiv.org/pdf/2510.04689.pdf

Authors: Chengwei Liu, Wenbo Guo, Yuxin Zhang, Limin Wang, Sen Chen, Lei Bu, Yang Liu

Title: Evolaris: A Roadmap to Self-Evolving Software Intelligence Management

Abstract:
In recent years, the landscape of software threats has become significantly more dynamic and distributed. Security vulnerabilities are no longer discovered and shared only through formal channels such as public vulnerability databases or vendor advisories. Increasingly, criti‑ cal threat information emerges informally through blogs, social media, developer forums, open source repositories, and even underground com‑ munities. To this end, capturing such intelligence in a timely manner is essential for maintaining situational awareness and enabling prompt security responses. However, this remains a complex challenge due to the fragmented nature of data sources and the technical difficulty of collecting, parsing, mapping, and validating information at scale. To ad‑ dress this, we propose Evolaris, a self‑evolving software intelligence sys‑ tem built on a multi‑agent framework. Evolaris is designed to support a full‑stack workflow, where agents operate independently but coordinate through shared context to perform tasks such as information discovery, reasoning, gap completion, validation, and risk detection. This archi‑ tecture enables the platform to learn from new inputs, refine its internal knowledge, and adapt to emerging threat patterns over time, which could continuously improve the precision, timeliness, and scalability of software threat analysis, and offers a sustainable foundation for proactive secu‑ rity decision‑making and strengthens the broader ecosystem of security threat understanding.

PaperID: 8207, https://arxiv.org/pdf/2510.04678.pdf

Authors: Zhanfeng Mo, Xingxuan Li, Yuntao Chen, Lidong Bing

Title: Multi-Agent Tool-Integrated Policy Optimization

Abstract:
Large language models (LLMs) increasingly rely on multi‑turn tool‑integrated planning for knowledge‑intensive and complex reasoning tasks. Existing implementations typically rely on a single agent, but they suffer from limited context length and noisy tool responses. A natural solution is to adopt a multi‑agent framework with planner‑ and worker‑agents to manage context. However, no existing methods support effective reinforcement learning post‑training of tool‑integrated multi‑agent frameworks. To address this gap, we propose Multi‑Agent Tool‑Integrated Policy Optimization (MATPO), which enables distinct roles (planner and worker) to be trained within a single LLM instance using role‑specific prompts via reinforcement learning. MATPO is derived from a principled credit assignment mechanism across planner and worker rollouts. This design eliminates the need to deploy multiple LLMs, which would be memory‑intensive, while preserving the benefits of specialization. Experiments on GAIA‑text, WebWalkerQA, and FRAMES show that MATPO consistently outperforms single‑agent baselines by an average of 18.38% relative improvement in performance and exhibits greater robustness to noisy tool outputs. Our findings highlight the effectiveness of unifying multiple agent roles within a single LLM and provide practical insights for stable and efficient multi‑agent RL training.

PaperID: 8208, https://arxiv.org/pdf/2510.04643.pdf

Authors: Xiangyu Li, Yawen Zeng, Xiaofen Xing, Jin Xu, Xiangmin Xu

Title: QuantAgents: Towards Multi-agent Financial System via Simulated Trading

Abstract:
In this paper, our objective is to develop a multi‑agent financial system that incorporates simulated trading, a technique extensively utilized by financial professionals. While current LLM‑based agent models demonstrate competitive performance, they still exhibit significant deviations from real‑world fund companies. A critical distinction lies in the agents' reliance on ``post‑reflection'', particularly in response to adverse outcomes, but lack a distinctly human capability: long‑term prediction of future trends. Therefore, we introduce QuantAgents, a multi‑agent system integrating simulated trading, to comprehensively evaluate various investment strategies and market scenarios without assuming actual risks. Specifically, QuantAgents comprises four agents: a simulated trading analyst, a risk control analyst, a market news analyst, and a manager, who collaborate through several meetings. Moreover, our system incentivizes agents to receive feedback on two fronts: performance in real‑world markets and predictive accuracy in simulated trading. Extensive experiments demonstrate that our framework excels across all metrics, yielding an overall return of nearly 300% over the three years (https://quantagents.github.io/).

PaperID: 8209, https://arxiv.org/pdf/2510.04607.pdf

Authors: Yuan Wang, Mingyu Li, Haibo Chen

Title: From Imperative to Declarative: Towards LLM-friendly OS Interfaces for Boosted Computer-Use Agents

Abstract:
Computer‑use agents (CUAs) powered by large language models (LLMs) have emerged as a promising approach to automating computer tasks, yet they struggle with the existing human‑oriented OS interfaces ‑ graphical user interfaces (GUIs). GUIs force LLMs to decompose high‑level goals into lengthy, error‑prone sequences of fine‑grained actions, resulting in low success rates and an excessive number of LLM calls. We propose Declarative Model Interface (DMI), an abstraction that transforms existing GUIs into three declarative primitives: access, state, and observation, thereby providing novel OS interfaces tailored for LLM agents. Our key idea is policy‑mechanism separation: LLMs focus on high‑level semantic planning (policy) while DMI handles low‑level navigation and interaction (mechanism). DMI does not require modifying the application source code or relying on application programming interfaces (APIs). We evaluate DMI with Microsoft Office Suite (Word, PowerPoint, Excel) on Windows. Integrating DMI into a leading GUI‑based agent baseline improves task success rates by 67% and reduces interaction steps by 43.5%. Notably, DMI completes over 61% of successful tasks with a single LLM call.

PaperID: 8210, https://arxiv.org/pdf/2510.04488.pdf

Authors: Edward Y. Chang, Ethan Y. Chang

Title: Multi-Agent Collaborative Intelligence: Dual-Dial Control for Reliable LLM Reasoning

Abstract:
Multi‑agent debate often wastes compute by using a fixed adversarial stance, aggregating without deliberation, or stopping on heuristics. We introduce MACI, an active controller with two independent dials that decouple information from behavior: an information dial that gates evidence by quality, and a behavior dial that schedules contentiousness from exploration to consolidation. A moderator tracks disagreement, overlap, evidence quality, and argument quality, and halts when gains plateau. We provide theory‑lite guarantees for nonincreasing dispersion and provable termination, with a budget‑feasible scheduler. Across clinical diagnosis and news‑bias tasks, MACI improves accuracy and calibration while reducing tokens, and converts residual uncertainty into precision RAG plans that specify what to retrieve next. We use a cross‑family LLM judge (CRIT) as a conservative soft weight and stop signal, validated for order invariance and judge‑swap stability; stability depends on using high‑capability judges. MACI turns debate into a budget‑aware, measurable, and provably terminating controller.

PaperID: 8211, https://arxiv.org/pdf/2510.04465.pdf

Authors: Zhiping Zhang, Yi Evie Zhang, Freda Shi, Tianshi Li

Title: Autonomy Reshapes How Personalization Affects Privacy Concerns and Trust in LLM Agents

Abstract:
LLM agents require personal information for personalization in order to effectively act on users' behalf, but this raises privacy concerns that can discourage data sharing, limiting both the autonomy levels at which agents can operate and the effectiveness of personalization. Yet the expanded design space of agent autonomy also presents opportunities to shape these effects, which remain underexplored. We conducted a 3×3 between‑subjects experiment (N=450) to study how agent autonomy level influences personalization's effects on users' privacy concerns, trust, and willingness to use, as well as the underlying psychological processes. We find that risk‑contingent autonomy, where the agent delegates control to users upon detecting potential privacy leakage, through improving users' perceived control, attenuates personalization's adverse effects by reducing the increase in privacy concerns and the decrease in trust. Our results suggest that designing agent's autonomy that supports human autonomy (both in terms of perceived control and oversight effectiveness) helps users benefit from personalization without being deterred by growing privacy concerns, contributing to the development of trustworthy LLM agents.

PaperID: 8212, https://arxiv.org/pdf/2510.04373.pdf

Authors: Hadi Nekoei, Aman Jaiswal, Patrice Bechard, Oleh Shliazhko, Orlando Marquez Ayala, Mathieu Reymond, Massimo Caccia, Alexandre Drouin, Sarath Chandar, Alexandre Lacoste

Title: JEF-Hinter: Leveraging Offline Knowledge for Improving Web Agents Adaptation

Abstract:
Large language model (LLM) agents perform well in sequential decision‑making tasks, but improving them on unfamiliar domains often requires costly online interactions or fine‑tuning on large expert datasets. These strategies are impractical for closed‑source models and expensive for open‑source ones, with risks of catastrophic forgetting. Offline trajectories offer reusable knowledge, yet demonstration‑based methods struggle because raw traces are long, noisy, and tied to specific tasks. We present Just‑in‑time Episodic Feedback Hinter (JEF‑Hinter), an agentic system that distills offline traces into compact, context‑aware hints. A zooming mechanism highlights decisive steps in long trajectories, capturing both strategies and pitfalls. Unlike prior methods, JEF‑Hinter leverages both successful and failed trajectories, extracting guidance even when only failure data is available, while supporting parallelized hint generation and benchmark‑independent prompting. At inference, a retriever selects relevant hints for the current state, providing targeted guidance with transparency and traceability. Experiments on MiniWoB++, WorkArena‑L1, and WebArena‑Lite show that JEF‑Hinter consistently outperforms strong baselines, including human‑ and document‑based hints.

PaperID: 8213, https://arxiv.org/pdf/2510.04368.pdf

Authors: Shashank Mangla, Chris Hokamp, Jack Boylan, Demian Gholipour Ghalandari, Yuuv Jauhari, Lauren Cassidy, Oisin Duffy

Title: NegotiationGym: Self-Optimizing Agents in a Multi-Agent Social Simulation Environment

Abstract:
We design and implement NegotiationGym, an API and user interface for configuring and running multi‑agent social simulations focused upon negotiation and cooperation. The NegotiationGym codebase offers a user‑friendly, configuration‑driven API that enables easy design and customization of simulation scenarios. Agent‑level utility functions encode optimization criteria for each agent, and agents can self‑optimize by conducting multiple interaction rounds with other agents, observing outcomes, and modifying their strategies for future rounds.

PaperID: 8214, https://arxiv.org/pdf/2510.04311.pdf

Authors: Bohan Tang, Huidong Liang, Keyue Jiang, Xiaowen Dong

Title: On the Importance of Task Complexity in Evaluating LLM-Based Multi-Agent Systems

Abstract:
Large language model multi‑agent systems (LLM‑MAS) offer a promising paradigm for harnessing collective intelligence to achieve more advanced forms of AI behaviour. While recent studies suggest that LLM‑MAS can outperform LLM single‑agent systems (LLM‑SAS) on certain tasks, the lack of systematic experimental designs limits the strength and generality of these conclusions. We argue that a principled understanding of task complexity, such as the degree of sequential reasoning required and the breadth of capabilities involved, is essential for assessing the effectiveness of LLM‑MAS in task solving. To this end, we propose a theoretical framework characterising tasks along two dimensions: depth, representing reasoning length, and width, representing capability diversity. We theoretically examine a representative class of LLM‑MAS, namely the multi‑agent debate system, and empirically evaluate its performance in both discriminative and generative tasks with varying depth and width. Theoretical and empirical results show that the benefit of LLM‑MAS over LLM‑SAS increases with both task depth and width, and the effect is more pronounced with respect to depth. This clarifies when LLM‑MAS are beneficial and provides a principled foundation for designing future LLM‑MAS methods and benchmarks.

PaperID: 8215, https://arxiv.org/pdf/2510.04303.pdf

Authors: Om Tailor

Title: Audit the Whisper: Detecting Steganographic Collusion in Multi-Agent LLMs

Abstract:
Multi‑agent deployments of large language models (LLMs) are increasingly embedded in market, allocation, and governance workflows, yet covert coordination among agents can silently erode trust and social welfare. Existing audits are dominated by heuristics that lack theoretical guarantees, struggle to transfer across tasks, and seldom ship with the infrastructure needed for independent replication. We introduce Audit the Whisper, a conference‑grade research artifact that spans theory, benchmark design, detection, and reproducibility. Our contributions are: (i) a channel‑capacity analysis showing how interventions such as paraphrase, rate limiting, and role permutation impose quantifiable capacity penalties‑operationalised via paired‑run Kullback‑‑Leibler diagnostics‑that tighten mutual‑information thresholds with finite‑sample guarantees and full proofs; (ii) ColludeBench‑v0, covering pricing, first‑price auctions, peer review, and hosted Gemini/Groq APIs with configurable covert schemes, deterministic manifests, and reward instrumentation; and (iii) a calibrated auditing pipeline that fuses cross‑run mutual information, permutation invariance, watermark variance, and fairness‑aware acceptance bias, each tuned to a 10^‑3 false‑positive budget and validated by 10k honest runs plus an e‑value martingale. Across ColludeBench and external suites including Secret Collusion, CASE, Perfect Collusion Benchmark, and SentinelAgent, the union meta‑test attains state‑of‑the‑art power at fixed FPR while ablations surface price‑of‑auditing trade‑offs and fairness‑driven colluders invisible to MI alone. We release regeneration scripts, anonymized manifests, and documentation so that external auditors can reproduce every figure, satisfy double‑blind requirements, and extend the framework with minimal effort.

PaperID: 8216, https://arxiv.org/pdf/2510.04284.pdf

Authors: Yunghwei Lai, Kaiming Liu, Ziyue Wang, Weizhi Ma, Yang Liu

Title: Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning

Abstract:
The professionalism of a human doctor in outpatient service depends on two core abilities: the ability to make accurate medical decisions and the medical consultation skill to conduct strategic, empathetic patient inquiry. Existing Large Language Models (LLMs) have achieved remarkable accuracy on medical decision‑making benchmarks. However, they often lack the ability to conduct the strategic and empathetic consultation, which is essential for real‑world clinical scenarios. To address this gap, we propose Doctor‑R1, an AI doctor agent trained to master both of the capabilities by ask high‑yield questions and conduct strategic multi‑turn inquiry to guide decision‑making. Our framework introduces three key components: a multi‑agent interactive environment, a two‑tiered reward architecture that separately optimizes clinical decision‑making and communicative inquiry skills, and an experience repository to ground policy learning in high‑quality prior trajectories. We evaluate Doctor‑R1 on OpenAI's HealthBench and MAQuE, assessed across multi‑facet metrics, such as communication quality, user experience, and task accuracy. Remarkably, Doctor‑R1 surpasses state‑of‑the‑art open‑source specialized LLMs by a substantial margin with higher parameter efficiency and outperforms powerful proprietary models. Furthermore, the human expert evaluations show that Doctor‑R1 achieves superior clinical capability and patient‑centric performance, demonstrating the effectiveness of the framework.

PaperID: 8217, https://arxiv.org/pdf/2510.04272.pdf

Authors: Jinyang Jiang, Jinhui Han, Yijie Peng, Ying Zhang

Title: Closing the Loop: Coordinating Inventory and Recommendation via Deep Reinforcement Learning on Multiple Timescales

Abstract:
Effective cross‑functional coordination is essential for enhancing firm‑wide profitability, particularly in the face of growing organizational complexity and scale. Recent advances in artificial intelligence, especially in reinforcement learning (RL), offer promising avenues to address this fundamental challenge. This paper proposes a unified multi‑agent RL framework tailored for joint optimization across distinct functional modules, exemplified via coordinating inventory replenishment and personalized product recommendation. We first develop an integrated theoretical model to capture the intricate interplay between these functions and derive analytical benchmarks that characterize optimal coordination. The analysis reveals synchronized adjustment patterns across products and over time, highlighting the importance of coordinated decision‑making. Leveraging these insights, we design a novel multi‑timescale multi‑agent RL architecture that decomposes policy components according to departmental functions and assigns distinct learning speeds based on task complexity and responsiveness. Our model‑free multi‑agent design improves scalability and deployment flexibility, while multi‑timescale updates enhance convergence stability and adaptability across heterogeneous decisions. We further establish the asymptotic convergence of the proposed algorithm. Extensive simulation experiments demonstrate that the proposed approach significantly improves profitability relative to siloed decision‑making frameworks, while the behaviors of the trained RL agents align closely with the managerial insights from our theoretical model. Taken together, this work provides a scalable, interpretable RL‑based solution to enable effective cross‑functional coordination in complex business settings.

PaperID: 8218, https://arxiv.org/pdf/2510.04229.pdf

Authors: Rikuo Sasaki, Michimasa Inaba

Title: When AI Gets Persuaded, Humans Follow: Inducing the Conformity Effect in Persuasive Dialogue

Abstract:
Recent advancements in AI have highlighted its application in captology, the field of using computers as persuasive technologies. We hypothesized that the "conformity effect," where individuals align with others' actions, also occurs with AI agents. This study verifies this hypothesis by introducing a "Persuadee Agent" that is persuaded alongside a human participant in a three‑party persuasive dialogue with a Persuader Agent. We conducted a text‑based dialogue experiment with human participants. We compared four conditions manipulating the Persuadee Agent's behavior (persuasion acceptance vs. non‑acceptance) and the presence of an icebreaker session. Results showed that when the Persuadee Agent accepted persuasion, both perceived persuasiveness and actual attitude change significantly improved. Attitude change was greatest when an icebreaker was also used, whereas an unpersuaded AI agent suppressed attitude change. Additionally, it was confirmed that the persuasion acceptance of participants increased at the moment the Persuadee Agent was persuaded. These results suggest that appropriately designing a Persuadee Agent can improve persuasion through the conformity effect.

PaperID: 8219, https://arxiv.org/pdf/2510.04204.pdf

Authors: Zhengyang Tang, Zihan Ye, Chenyu Huang, Xuhan Huang, Chengpeng Li, Sihang Li, Guanhua Chen, Ming Yan, Zizhuo Wang, Hongyuan Zha, Dayiheng Liu, Benyou Wang

Title: CALM Before the STORM: Unlocking Native Reasoning for Optimization Modeling

Abstract:
Large Reasoning Models (LRMs) have demonstrated strong capabilities in complex multi‑step reasoning, opening new opportunities for automating optimization modeling. However, existing domain adaptation methods, originally designed for earlier instruction‑tuned models, often fail to exploit the advanced reasoning patterns of modern LRMs ‑‑ In particular, we show that direct fine‑tuning on traditional non‑reflective datasets leads to limited gains. To fully leverage LRMs' inherent reasoning abilities, we propose CALM (Corrective Adaptation with Lightweight Modification), a framework that progressively refines LRMs within their native reasoning modes for optimization modeling tasks. In CALM, an expert intervener identifies reasoning flaws and provides concise corrective hints, which the LRM incorporates to produce improved reasoning trajectories. These interventions modify fewer than 2.6% of generated tokens, but generate high‑quality data for soft adaptation through supervised fine‑tuning. The adapted model is then further improved through reinforcement learning. Building on CALM, we develop STORM (Smart Thinking Optimization Reasoning Model), a 4B‑parameter LRM that achieves a new state‑of‑the‑art average accuracy of 68.9% across five popular optimization modeling benchmarks, matching the performance of a 671B LRM. These results demonstrate that dynamic, hint‑based data synthesis both preserves and amplifies the native reasoning patterns of modern LRMs, offering a more effective and scalable path towards expert‑level performance on challenging optimization modeling tasks.

PaperID: 8220, https://arxiv.org/pdf/2510.04195.pdf

Authors: Puzhen Zhang, Xuyang Chen, Yu Feng, Yuhan Jiang, Liqiu Meng

Title: Constructing coherent spatial memory in LLM agents through graph rectification

Abstract:
Given a map description through global traversal navigation instructions (e.g., visiting each room sequentially with action signals such as north, west, etc.), an LLM can often infer the implicit spatial layout of the environment and answer user queries by providing a shortest path from a start to a destination (for instance, navigating from the lobby to a meeting room via the hall and elevator). However, such context‑dependent querying becomes incapable as the environment grows much longer, motivating the need for incremental map construction that builds a complete topological graph from stepwise observations. We propose a framework for LLM‑driven construction and map repair, designed to detect, localize, and correct structural inconsistencies in incrementally constructed navigation graphs. Central to our method is the Version Control, which records the full history of graph edits and their source observations, enabling fine‑grained rollback, conflict tracing, and repair evaluation. We further introduce an Edge Impact Score to prioritize minimal‑cost repairs based on structural reachability, path usage, and conflict propagation. To properly evaluate our approach, we create a refined version of the MANGO benchmark dataset by systematically removing non‑topological actions and inherent structural conflicts, providing a cleaner testbed for LLM‑driven construction and map repair. Our approach significantly improves map correctness and robustness, especially in scenarios with entangled or chained inconsistencies. Our results highlight the importance of introspective, history‑aware repair mechanisms for maintaining coherent spatial memory in LLM agents.

PaperID: 8221, https://arxiv.org/pdf/2510.04192.pdf

Authors: Rabiya Khalid, Evangelos Pournaras

Title: Cooperative Flexibility Exchange: Fair and Comfort-Aware Decentralized Resource Allocation

Abstract:
The growing electricity demand and use of smart appliances are placing pressure on power grids, making efficient energy management more important than ever. The existing energy management systems often prioritize system efficiency (balanced energy demand and supply) at the expense of consumer comfort. This paper addresses this gap by proposing a novel decentralized multi‑agent coordination‑based demand‑side management system. The proposed system enables individual agents to coordinate for demand‑side energy optimization while improving consumer comfort and maintaining system efficiency. A key innovation of this work is the introduction of a slot exchange mechanism, where agents first receive optimized appliance‑level energy consumption schedules and then coordinate with each other to adjust these schedules through slot exchanges to improve their comfort even when agents show non‑altruistic behaviour. It also scales well with large populations and promotes fairness by balancing satisfaction levels across consumers. For performance evaluation, a real‑world dataset is used, and the results demonstrate that the proposed slot exchange mechanism increases consumer comfort and fairness without raising system inefficiency cost, making it a practical and scalable solution for future smart grids.

PaperID: 8222, https://arxiv.org/pdf/2510.04076.pdf

Authors: Amin Vahidi-Moghaddam, Sayed Pedram Haeri Boroujeni, Iman Jebellat, Ehsan Jebellat, Niloufar Mehrabi, Zhaojian Li

Title: From Shadow to Light: Toward Safe and Efficient Policy Learning Across MPC, DeePC, RL, and LLM Agents

Abstract:
One of the main challenges in modern control applications, particularly in robot and vehicle motion control, is achieving accurate, fast, and safe movement. To address this, optimal control policies have been developed to enforce safety while ensuring high performance. Since basic first‑principles models of real systems are often available, model‑based controllers are widely used. Model predictive control (MPC) is a leading approach that optimizes performance while explicitly handling safety constraints. However, obtaining accurate models for complex systems is difficult, which motivates data‑driven alternatives. ML‑based MPC leverages learned models to reduce reliance on hand‑crafted dynamics, while reinforcement learning (RL) can learn near‑optimal policies directly from interaction data. Data‑enabled predictive control (DeePC) goes further by bypassing modeling altogether, directly learning safe policies from raw input‑output data. Recently, large language model (LLM) agents have also emerged, translating natural language instructions into structured formulations of optimal control problems. Despite these advances, data‑driven policies face significant limitations. They often suffer from slow response times, high computational demands, and large memory needs, making them less practical for real‑world systems with fast dynamics, limited onboard computing, or strict memory constraints. To address this, various technique, such as reduced‑order modeling, function‑approximated policy learning, and convex relaxations, have been proposed to reduce computational complexity. In this paper, we present eight such approaches and demonstrate their effectiveness across real‑world applications, including robotic arms, soft robots, and vehicle motion control.

PaperID: 8223, https://arxiv.org/pdf/2510.04002.pdf

Authors: Bo Yang, Yunkui Chen, Lanfei Feng, Yu Zhang, Xiao Xu, Jianyu Zhang, Nueraili Aierken, Runhe Huang, Hongjian Lin, Yibin Ying, Shijian Li

Title: AgriGPT-VL: Agricultural Vision-Language Understanding Suite

Abstract:
Despite rapid advances in multimodal large language models, agricultural applications remain constrained by the scarcity of domain‑tailored models, curated vision‑language corpora, and rigorous evaluation. To address these challenges, we present the AgriGPT‑VL Suite, a unified multimodal framework for agriculture. Our contributions are threefold. First, we introduce Agri‑3M‑VL, the largest vision‑language corpus for agriculture to our knowledge, curated by a scalable multi‑agent data generator; it comprises 1M image‑caption pairs, 2M image‑grounded VQA pairs, 50K expert‑level VQA instances, and 15K GRPO reinforcement learning samples. Second, we develop AgriGPT‑VL, an agriculture‑specialized vision‑language model trained via a progressive curriculum of textual grounding, multimodal shallow/deep alignment, and GRPO refinement. This method achieves strong multimodal reasoning while preserving text‑only capability. Third, we establish AgriBench‑VL‑4K, a compact yet challenging evaluation suite with open‑ended and image‑grounded questions, paired with multi‑metric evaluation and an LLM‑as‑a‑judge framework. Experiments show that AgriGPT‑VL outperforms leading general‑purpose VLMs on AgriBench‑VL‑4K, achieving higher pairwise win rates in the LLM‑as‑a‑judge evaluation. Meanwhile, it remains competitive on the text‑only AgriBench‑13K with no noticeable degradation of language ability. Ablation studies further confirm consistent gains from our alignment and GRPO refinement stages. We will open source all of the resources to support reproducible research and deployment in low‑resource agricultural settings.

PaperID: 8224, https://arxiv.org/pdf/2510.03999.pdf

Authors: Yang Xu, Xuanming Zhang, Samuel Yeh, Jwala Dhamala, Ousmane Dia, Rahul Gupta, Sharon Li

Title: LH-Deception: Simulating and Understanding LLM Deceptive Behaviors in Long-Horizon Interactions

Abstract:
Deception is a pervasive feature of human communication and an emerging concern in large language models (LLMs). While recent studies document instances of LLM deception, most evaluations remain confined to single‑turn prompts and fail to capture the long‑horizon interactions in which deceptive strategies typically unfold. We introduce a new simulation framework, LH‑Deception, for a systematic, empirical quantification of deception in LLMs under extended sequences of interdependent tasks and dynamic contextual pressures. LH‑Deception is designed as a multi‑agent system: a performer agent tasked with completing tasks and a supervisor agent that evaluates progress, provides feedback, and maintains evolving states of trust. An independent deception auditor then reviews full trajectories to identify when and how deception occurs. We conduct extensive experiments across 11 frontier models, spanning both closed‑source and open‑source systems, and find that deception is model‑dependent, increases with event pressure, and consistently erodes supervisor trust. Qualitative analyses further reveal emergent, long‑horizon phenomena, such as ``chains of deception", which are invisible to static, single‑turn evaluations. Our findings provide a foundation for evaluating future LLMs in real‑world, trust‑sensitive contexts.

PaperID: 8225, https://arxiv.org/pdf/2510.03992.pdf

Authors: Jehyeok Yeon, Isha Chaudhary, Gagandeep Singh

Title: Quantitative Certification of Agentic Tool Selection

Abstract:
Large language models (LLMs) are increasingly deployed in agentic systems, where a fundamental task is mapping user intents to relevant external tools. Errors in tool selection can have severe outcomes, such as unauthorized data access, even without modifying the agent's underlying model. Existing evaluations measure performance on curated, benign benchmarks. However, a pipeline's behavior in deployment depends on the tool pool the agent actually encounters, which in open registries is shaped by third parties. We introduce LLMCert‑T, the first statistical framework that returns high‑confidence upper bounds on the probability that a tool‑selection pipeline satisfies a declared safety specification under a realistic tool distribution. LLMCert‑T models tool‑selection evaluation as a Bernoulli estimation problem, drawing inserted‑tool sequences from a distribution that the safety specification fixes. To evaluate robustness against realistic deployment conditions, we instantiate this distribution as a stochastic process that generates inserted‑tool sequences round by round, conditioning each round on the agent's selection in the previous round. LLMCert‑T aggregates the per‑trial outcomes into a one‑sided Clopper‑Pearson upper bound on the probability that the specification is satisfied. By returning this bound as a certificate with statistical guarantees over the inserted‑tool sequence distribution, LLMCert‑T makes safety claims intuitive, actionable, and comparable across models, retrievers, mitigations, and registry policies. Across popular BFCL and OpenAPI tool pools, LLMCert‑T shows that current LLM agents remain fragile under Distractor Selection and Top‑N Saturation specifications: their certified correctness upper bounds drop to approximately 20%, far below their clean‑pool lower bounds.

PaperID: 8226, https://arxiv.org/pdf/2510.03952.pdf

Authors: Raven Beutner, Bernd Finkbeiner

Title: Strategy Logic, Imperfect Information, and Hyperproperties

Abstract:
Strategy logic (SL) is a powerful temporal logic that enables first‑class reasoning over strategic behavior in multi‑agent systems (MAS). In many MASs, the agents (and their strategies) cannot observe the global state of the system, leading to many extensions of SL centered around imperfect information, such as strategy logic with imperfect information (SL_\mathitii). Along orthogonal lines, researchers have studied the combination of strategic behavior and hyperproperties. Hyperproperties are system properties that relate multiple executions in a system and commonly arise when specifying security policies. Hyper Strategy Logic (HyperSL) is a temporal logic that combines quantification over strategies with the ability to express hyperproperties on the executions of different strategy profiles. In this paper, we study the relation between SL_\mathitii and HyperSL. Our main result is that both logics (restricted to formulas where no state formulas are nested within path formulas) are equivalent in the sense that we can encode SL_\mathitii instances into HyperSL instances and vice versa. For the former direction, we build on the well‑known observation that imperfect information is a hyperproperty. For the latter direction, we construct a self‑composition of MASs and show how we can simulate hyperproperties using imperfect information.

PaperID: 8227, https://arxiv.org/pdf/2510.03913.pdf

Authors: Mohammad Amin Abbasi, Hassan Naderi

Title: PsycholexTherapy: Simulating Reasoning in Psychotherapy with Small Language Models in Persian

Abstract:
This study presents PsychoLexTherapy, a framework for simulating psychotherapeutic reasoning in Persian using small language models (SLMs). The framework tackles the challenge of developing culturally grounded, therapeutically coherent dialogue systems with structured memory for multi‑turn interactions in underrepresented languages. To ensure privacy and feasibility, PsychoLexTherapy is optimized for on‑device deployment, enabling use without external servers. Development followed a three‑stage process: (i) assessing SLMs psychological knowledge with PsychoLexEval; (ii) designing and implementing the reasoning‑oriented PsychoLexTherapy framework; and (iii) constructing two evaluation datasets‑PsychoLexQuery (real Persian user questions) and PsychoLexDialogue (hybrid simulated sessions)‑to benchmark against multiple baselines. Experiments compared simple prompting, multi‑agent debate, and structured therapeutic reasoning paths. Results showed that deliberate model selection balanced accuracy, efficiency, and privacy. On PsychoLexQuery, PsychoLexTherapy outperformed all baselines in automatic LLM‑as‑a‑judge evaluation and was ranked highest by human evaluators in a single‑turn preference study. In multi‑turn tests with PsychoLexDialogue, the long‑term memory module proved essential: while naive history concatenation caused incoherence and information loss, the full framework achieved the highest ratings in empathy, coherence, cultural fit, and personalization. Overall, PsychoLexTherapy establishes a practical, privacy‑preserving, and culturally aligned foundation for Persian psychotherapy simulation, contributing novel datasets, a reproducible evaluation pipeline, and empirical insights into structured memory for therapeutic reasoning.

PaperID: 8228, https://arxiv.org/pdf/2510.03902.pdf

Authors: Rana Nameer Hussain Khan, Dawood Wasif, Jin-Hee Cho, Ali Butt

Title: Multi-Agent Code-Orchestrated Generation for Reliable Infrastructure-as-Code

Abstract:
The increasing complexity of cloud‑native infrastructure has made Infrastructure‑as‑Code (IaC) essential for reproducible and scalable deployments. While large language models (LLMs) have shown promise in generating IaC snippets from natural language prompts, their monolithic, single‑pass generation approach often results in syntactic errors, policy violations, and unscalable designs. In this paper, we propose MACOG (Multi‑Agent Code‑Orchestrated Generation), a novel multi‑agent LLM‑based architecture for IaC generation that decomposes the task into modular subtasks handled by specialized agents: Architect, Provider Harmonizer, Engineer, Reviewer, Security Prover, Cost and Capacity Planner, DevOps, and Memory Curator. The agents interact via a shared‑blackboard, finite‑state orchestrator layer, and collectively produce Terraform configurations that are not only syntactically valid but also policy‑compliant and semantically coherent. To ensure infrastructure correctness and governance, we incorporate Terraform Plan for execution validation and Open Policy Agent (OPA) for customizable policy enforcement. We evaluate MACOG using the IaC‑Eval benchmark, where MACOG is the top enhancement across models, e.g., GPT‑5 improves from 54.90 (RAG) to 74.02 and Gemini‑2.5 Pro from 43.56 to 60.13, with concurrent gains on BLEU, CodeBERTScore, and an LLM‑judge metric. Ablations show constrained decoding and deploy feedback are critical: removing them drops IaC‑Eval to 64.89 and 56.93, respectively.

PaperID: 8229, https://arxiv.org/pdf/2510.03885.pdf

Authors: Sunghwan Kim, Woojeh Chung, Zhirui Dai, Dwait Bhatt, Arth Shukla, Hao Su, Yulun Tian, Nikolay Atanasov

Title: Seeing the Bigger Picture: 3D Latent Mapping for Mobile Manipulation Policy Learning

Abstract:
In this paper, we demonstrate that mobile manipulation policies utilizing a 3D latent map achieve stronger spatial and temporal reasoning than policies relying solely on images. We introduce Seeing the Bigger Picture (SBP), an end‑to‑end policy learning approach that operates directly on a 3D map of latent features. In SBP, the map extends perception beyond the robot's current field of view and aggregates observations over long horizons. Our mapping approach incrementally fuses multiview observations into a grid of scene‑specific latent features. A pre‑trained, scene‑agnostic decoder reconstructs target embeddings from these features and enables online optimization of the map features during task execution. A policy, trainable with behavior cloning or reinforcement learning, treats the latent map as a state variable and uses global context from the map obtained via a 3D feature aggregator. We evaluate SBP on scene‑level mobile manipulation and sequential tabletop manipulation tasks. Our experiments demonstrate that SBP (i) reasons globally over the scene, (ii) leverages the map as long‑horizon memory, and (iii) outperforms image‑based policies in both in‑distribution and novel scenes, e.g., improving the success rate by 15% for the sequential manipulation task.

PaperID: 8230, https://arxiv.org/pdf/2510.03879.pdf

Authors: Tianyu Li, Ruishi Li, Bo Wang, Brandon Paulsen, Umang Mathur, Prateek Saxena

Title: Adversarial Agent Collaboration for Correctness Improvements of C to Safe Rust Translation

Abstract:
Translating C to memory‑safe languages, like Rust, prevents critical memory safety vulnerabilities that are prevalent in legacy C software. Even with recent LLM‑based and tool‑augmented translators, the resulting Rust code frequently diverges from the C source on inputs absent from the test suite used during translation; this correctness gap on unseen inputs remains a dominant obstacle to reliable, automatic C‑to‑Rust translation. In this work, we present ACToR (Adversarial C To Rust), a simple LLM‑agent loop that closes this gap by adversarially searching for inputs on which the translation diverges from the C source, and using them to drive subsequent refinements. Inspired by GANs, ACToR pits a translator agent against a discriminator agent that collaborate to iteratively refine the Rust translation. On each iteration, the translator agent synthesizes and refines a Rust translation to pass an existing suite of tests, and then the discriminator agent finds new failing tests by constructing and refining a differential fuzzer over the C and Rust binaries. Across 63 real‑world command‑line C utilities, with an average size of 473 lines of code and the longest reaching thousands of lines in size, ACToR achieves over 90% test pass rate with zero human intervention. The improvement holds across seven agent‑LLM configurations on our micro‑benchmark, indicating that the loop is largely independent of the choice of underlying translator and LLM. Compared to a non‑adversarial, coverage‑driven test‑generation baseline, ACToR improves correctness by up to 36.7%. When applied on top of one recent translator, C2SaferRust, ACToR further improves the validation pass rate by 16.6%.

PaperID: 8231, https://arxiv.org/pdf/2510.03829.pdf

Authors: André Coelho, Pedro Ribeiro, Helder Fontes, Rui Campos

Title: A4FN: an Agentic AI Architecture for Autonomous Flying Networks

Abstract:
This position paper presents A4FN, an Agentic Artificial Intelligence (AI) architecture for intent‑driven automation in Flying Networks (FNs) using Unmanned Aerial Vehicles (UAVs) as access nodes. A4FN leverages Generative AI and Large Language Models (LLMs) to enable real‑time, context‑aware network control via a distributed agentic system. It comprises two components: the Perception Agent (PA), which semantically interprets multimodal input ‑‑ including imagery, audio, and telemetry data ‑‑ from UAV‑mounted sensors to derive Service Level Specifications (SLSs); and the Decision‑and‑Action Agent (DAA), which reconfigures the network based on inferred intents. A4FN embodies key properties of Agentic AI, including autonomy, goal‑driven reasoning, and continuous perception‑action cycles. Designed for mission‑critical, infrastructure‑limited scenarios such as disaster response, it supports adaptive reconfiguration, dynamic resource management, and interoperability with emerging wireless technologies. The paper details the A4FN architecture, its core innovations, and open research challenges in multi‑agent coordination and Agentic AI integration in next‑generation FNs.

PaperID: 8232, https://arxiv.org/pdf/2510.03823.pdf

Authors: Adam Haroon, Tristan Schuler

Title: Distributed Area Coverage with High Altitude Balloons Using Multi-Agent Reinforcement Learning

Abstract:
High Altitude Balloons (HABs) can leverage stratospheric wind layers for limited horizontal control, enabling applications in reconnaissance, environmental monitoring, and communications networks. Existing multi‑agent HAB coordination approaches use deterministic methods like Voronoi partitioning and extremum seeking control for large global constellations, which perform poorly for smaller teams and localized missions. While single‑agent HAB control using reinforcement learning has been demonstrated on HABs, coordinated multi‑agent reinforcement learning (MARL) has not yet been investigated. This work presents the first systematic application of multi‑agent reinforcement learning (MARL) to HAB coordination for distributed area coverage. We extend our previously developed reinforcement learning simulation environment (RLHAB) to support cooperative multi‑agent learning, enabling multiple agents to operate simultaneously in realistic atmospheric conditions. We adapt QMIX for HAB area coverage coordination, leveraging Centralized Training with Decentralized Execution to address atmospheric vehicle coordination challenges. Our approach employs specialized observation spaces providing individual state, environmental context, and teammate data, with hierarchical rewards prioritizing coverage while encouraging spatial distribution. We demonstrate that QMIX achieves similar performance to the theoretically optimal geometric deterministic method for distributed area coverage, validating the MARL approach and providing a foundation for more complex autonomous multi‑HAB missions where deterministic methods become intractable.

PaperID: 8233, https://arxiv.org/pdf/2510.03771.pdf

Authors: Divij Handa, David Blincoe, Orson Adams, Yinlin Fu

Title: OptAgent: Optimizing Query Rewriting for E-commerce via Multi-Agent Simulation

Abstract:
Deploying capable and user‑aligned LLM‑based systems necessitates reliable evaluation. While LLMs excel in verifiable tasks like coding and mathematics, where gold‑standard solutions are available, adoption remains challenging for subjective tasks that lack a single correct answer. E‑commerce Query Rewriting (QR) is one such problem where determining whether a rewritten query properly captures the user intent is extremely difficult to figure out algorithmically. In this work, we introduce OptAgent, a novel framework that combines multi‑agent simulations with genetic algorithms to verify and optimize queries for QR. Instead of relying on a static reward model or a single LLM judge, our approach uses multiple LLM‑based agents, each acting as a simulated shopping customer, as a dynamic reward signal. The average of these agent‑derived scores serves as an effective fitness function for an evolutionary algorithm that iteratively refines the user's initial query. We evaluate OptAgent on a dataset of 1000 real‑world e‑commerce queries in five different categories, and we observe an average improvement of 21.98% over the original user query and 3.36% over a Best‑of‑N LLM rewriting baseline.

PaperID: 8234, https://arxiv.org/pdf/2510.03696.pdf

Authors: Deepak Babu Piskala, Sharlene Chen, Udita Patel, Parul Kalra, Rafael Castrillo

Title: Mind the Goal: Data-Efficient Goal-Oriented Evaluation of Conversational Agents and Chatbots using Teacher Models

Abstract:
Evaluating the quality of multi‑turn chatbot interactions remains challenging, as most existing methods assess interactions at the turn level without addressing whether a user's overarching goal was fulfilled. A ``goal'' here refers to an information need or task, such as asking for policy information or applying for leave. We propose a comprehensive framework for goal‑oriented evaluation of multi‑agent systems (MAS), introducing the Goal Success Rate (GSR) to measure the percentage of fulfilled goals, and a Root Cause of Failure (RCOF) taxonomy to identify reasons for failure in multi‑agent chatbots. Our method segments conversations by user goals and evaluates success using all relevant turns. We present a model‑based evaluation system combining teacher LLMs, where domain experts define goals, set quality standards serving as a guidance for the LLMs. The LLMs use ``thinking tokens'' to produce interpretable rationales, enabling explainable, data‑efficient evaluations. In an enterprise setting, we apply our framework to evaluate AIDA, a zero‑to‑one employee conversational agent system built as a ground‑up multi‑agent conversational agent, and observe GSR improvement from 63% to 79% over six months since its inception. Our framework is generic and offers actionable insights through a detailed defect taxonomy based on analysis of failure points in multi‑agent chatbots, diagnosing overall success, identifying key failure modes, and informing system improvements.

PaperID: 8235, https://arxiv.org/pdf/2510.03610.pdf

Authors: Zachary Ezetta, Wu-chang Feng

Title: PentestMCP: A Toolkit for Agentic Penetration Testing

Abstract:
Agentic AI is transforming security by automating many tasks being performed manually. While initial agentic approaches employed a monolithic architecture, the Model‑Context‑Protocol has now enabled a remote‑procedure call (RPC) paradigm to agentic applications, allowing for the flexible construction and composition of multi‑function agents. This paper describes PentestMCP, a library of MCP server implementations that support agentic penetration testing. By supporting common penetration testing tasks such as network scanning, resource enumeration, service fingerprinting, vulnerability scanning, exploitation, and post‑exploitation, PentestMCP allows a developer to customize multi‑agent workflows for performing penetration tests.

PaperID: 8236, https://arxiv.org/pdf/2510.03592.pdf

Authors: Kehinde O. Aina, Sehoon Ha

Title: Deep Reinforcement Learning for Multi-Agent Coordination

Abstract:
We address the challenge of coordinating multiple robots in narrow and confined environments, where congestion and interference often hinder collective task performance. Drawing inspiration from insect colonies, which achieve robust coordination through stigmergy ‑‑ modifying and interpreting environmental traces ‑‑ we propose a Stigmergic Multi‑Agent Deep Reinforcement Learning (S‑MADRL) framework that leverages virtual pheromones to model local and social interactions, enabling decentralized emergent coordination without explicit communication. To overcome the convergence and scalability limitations of existing algorithms such as MADQN, MADDPG, and MAPPO, we leverage curriculum learning, which decomposes complex tasks into progressively harder sub‑problems. Simulation results show that our framework achieves the most effective coordination of up to eight agents, where robots self‑organize into asymmetric workload distributions that reduce congestion and modulate group performance. This emergent behavior, analogous to strategies observed in nature, demonstrates a scalable solution for decentralized multi‑agent coordination in crowded environments with communication constraints.

PaperID: 8237, https://arxiv.org/pdf/2510.03534.pdf

Authors: Nicolò Dal Fabbro, Milad Mesbahi, Renato Mendes, João Borges de Sousa, George J. Pappas

Title: Long-Term Mapping of the Douro River Plume with Multi-Agent Reinforcement Learning

Abstract:
We study the problem of long‑term (multiple days) mapping of a river plume using multiple autonomous underwater vehicles (AUVs), focusing on the Douro river representative use‑case. We propose an energy ‑ and communication ‑ efficient multi‑agent reinforcement learning approach in which a central coordinator intermittently communicates with the AUVs, collecting measurements and issuing commands. Our approach integrates spatiotemporal Gaussian process regression (GPR) with a multi‑head Q‑network controller that regulates direction and speed for each AUV. Simulations using the Delft3D ocean model demonstrate that our method consistently outperforms both single‑ and multi‑agent benchmarks, with scaling the number of agents both improving mean squared error (MSE) and operational endurance. In some instances, our algorithm demonstrates that doubling the number of AUVs can more than double endurance while maintaining or improving accuracy, underscoring the benefits of multi‑agent coordination. Our learned policies generalize across unseen seasonal regimes over different months and years, demonstrating promise for future developments of data‑driven long‑term monitoring of dynamic plume environments.

PaperID: 8238, https://arxiv.org/pdf/2510.03514.pdf

Authors: Toby Drinkall

Title: Red Lines and Grey Zones in the Fog of War: Benchmarking Legal Risk, Moral Harm, and Regional Bias in Large Language Model Military Decision-Making

Abstract:
As military organisations consider integrating large language models (LLMs) into command and control (C2) systems for planning and decision support, understanding their behavioural tendencies is critical. This study develops a benchmarking framework for evaluating aspects of legal and moral risk in targeting behaviour by comparing LLMs acting as agents in multi‑turn simulated conflict. We introduce four metrics grounded in International Humanitarian Law (IHL) and military doctrine: Civilian Target Rate (CTR) and Dual‑use Target Rate (DTR) assess compliance with legal targeting principles, while Mean and Max Simulated Non‑combatant Casualty Value (SNCV) quantify tolerance for civilian harm. We evaluate three frontier models, GPT‑4o, Gemini‑2.5, and LLaMA‑3.1, through 90 multi‑agent, multi‑turn crisis simulations across three geographic regions. Our findings reveal that off‑the‑shelf LLMs exhibit concerning and unpredictable targeting behaviour in simulated conflict environments. All models violated the IHL principle of distinction by targeting civilian objects, with breach rates ranging from 16.7% to 66.7%. Harm tolerance escalated through crisis simulations with MeanSNCV increasing from 16.5 in early turns to 27.7 in late turns. Significant inter‑model variation emerged: LLaMA‑3.1 selected an average of 3.47 civilian strikes per simulation with MeanSNCV of 28.4, while Gemini‑2.5 selected 0.90 civilian strikes with MeanSNCV of 17.6. These differences indicate that model selection for deployment constitutes a choice about acceptable legal and moral risk profiles in military operations. This work seeks to provide a proof‑of‑concept of potential behavioural risks that could emerge from the use of LLMs in Decision Support Systems (AI DSS) as well as a reproducible benchmarking framework with interpretable metrics for standardising pre‑deployment testing.

PaperID: 8239, https://arxiv.org/pdf/2510.03480.pdf

Authors: Vali Tawosi, Salwa Alamir, Xiaomo Liu, Manuela Veloso

Title: LLM Agents for Automated Dependency Upgrades

Abstract:
As a codebase expands over time, its library dependencies can become outdated and require updates to maintain innovation and security. However, updating a library can introduce breaking changes in the code, necessitating significant developer time for maintenance. To address this, we introduce a framework of LLM agents to be used in combination with migration documentation to automatically recommend and apply code updates and ensure compatibility with new versions. Our solution can automatically localize updated library usages in live Java codebases and implement recommended fixes in a user‑friendly manner. The system architecture consists of multiple key components: a Summary Agent, Control Agent, and Code Agent. To validate our approach, we apply the framework on an industrial use case by which we create three synthetic code repositories with major Upgrade changes and benchmark our approach against state‑of‑the‑art methods. Results show that our approach not only performs upgrades using fewer tokens across all cases but also achieves a precision of 71.4%, highlighting its efficiency and effectiveness compared to state‑of‑the‑art methods.

PaperID: 8240, https://arxiv.org/pdf/2510.03463.pdf

Authors: Vali Tawosi, Keshav Ramani, Salwa Alamir, Xiaomo Liu

Title: ALMAS: an Autonomous LLM-based Multi-Agent Software Engineering Framework

Abstract:
Multi‑agent Large Language Model (LLM) systems have been leading the way in applied LLM research across a number of fields. One notable area is software development, where researchers have advanced the automation of code implementation, code testing, code maintenance, inter alia, using LLM agents. However, software development is a multifaceted environment that extends beyond just code. As such, a successful LLM system must factor in multiple stages of the software development life‑cycle (SDLC). In this paper, we propose a vision for ALMAS, an Autonomous LLM‑based Multi‑Agent Software Engineering framework, which follows the above SDLC philosophy such that it may work within an agile software development team to perform several tasks end‑to‑end. ALMAS aligns its agents with agile roles, and can be used in a modular fashion to seamlessly integrate with human developers and their development environment. We showcase the progress towards ALMAS through our published works and a use case demonstrating the framework, where ALMAS is able to seamlessly generate an application and add a new feature.

PaperID: 8241, https://arxiv.org/pdf/2510.03460.pdf

Authors: Sibo Tian, Minghui Zheng, Xiao Liang

Title: Warm-Starting Optimization-Based Motion Planning for Robotic Manipulators via Point Cloud-Conditioned Flow Matching

Abstract:
Rapid robot motion generation is critical in Human‑Robot Collaboration (HRC) systems, as robots need to respond to dynamic environments in real time by continuously observing their surroundings and replanning their motions to ensure both safe interactions and efficient task execution. Current sampling‑based motion planners face challenges in scaling to high‑dimensional configuration spaces and often require post‑processing to interpolate and smooth the generated paths, resulting in time inefficiency in complex environments. Optimization‑based planners, on the other hand, can incorporate multiple constraints and generate smooth trajectories directly, making them potentially more time‑efficient. However, optimization‑based planners are sensitive to initialization and may get stuck in local minima. In this work, we present a novel learning‑based method that utilizes a Flow Matching model conditioned on a single‑view point cloud to learn near‑optimal solutions for optimization initialization. Our method does not require prior knowledge of the environment, such as obstacle locations and geometries, and can generate feasible trajectories directly from single‑view depth camera input. Simulation studies on a UR5e robotic manipulator in cluttered workspaces demonstrate that the proposed generative initializer achieves a high success rate on its own, significantly improves the success rate of trajectory optimization compared with traditional and learning‑based benchmark initializers, requires fewer optimization iterations, and exhibits strong generalization to unseen environments.

PaperID: 8242, https://arxiv.org/pdf/2510.03442.pdf

Authors: Ege Cakar, Per Ola Kristensson

Title: The Argument is the Explanation: Structured Argumentation for Trust in Agents

Abstract:
Humans are black boxes ‑‑ we cannot observe their neural processes, yet society functions by evaluating verifiable arguments. AI explainability should follow this principle: stakeholders need verifiable reasoning chains, not mechanistic transparency. We propose using structured argumentation to provide a level of explanation and verification neither interpretability nor LLM‑generated explanation is able to offer. Our pipeline achieves state‑of‑the‑art 94.44 macro F1 on the AAEC published train/test split (5.7 points above prior work) and 0.81 macro F1, ～0.07 above previous published results with comparable data setups, for Argumentative MicroTexts relation classification, converting LLM text into argument graphs and enabling verification at each inferential step. We demonstrate this idea on multi‑agent risk assessment using the Structured What‑If Technique, where specialized agents collaborate transparently to carry out risk assessment otherwise achieved by humans alone. Using Bipolar Assumption‑Based Argumentation, we capture support/attack relationships, thereby enabling automatic hallucination detection via fact nodes attacking arguments. We also provide a verification mechanism that enables iterative refinement through test‑time feedback without retraining. For easy deployment, we provide a Docker container for the fine‑tuned AMT model, and the rest of the code with the Bipolar ABA Python package on GitHub.

PaperID: 8243, https://arxiv.org/pdf/2510.03418.pdf

Authors: Ananya Mantravadi, Shivali Dalmia, Olga Pospelova, Abhishek Mukherji, Nand Dave, Anudha Mittal

Title: LegalWiz: A Multi-Agent Generation Framework for Contradiction Detection in Legal Documents

Abstract:
Retrieval‑Augmented Generation (RAG) integrates large language models (LLMs) with external sources, but unresolved contradictions in retrieved evidence often lead to hallucinations and legally unsound outputs. Benchmarks currently used for contradiction detection lack domain realism, cover only limited conflict types, and rarely extend beyond single‑sentence pairs, making them unsuitable for legal applications. Controlled generation of documents with embedded contradictions is therefore essential: it enables systematic stress‑testing of models, ensures coverage of diverse conflict categories, and provides a reliable basis for evaluating contradiction detection and resolution. We present a multi‑agent contradiction‑aware benchmark framework for the legal domain that generates synthetic legal‑style documents, injects six structured contradiction types, and models both self‑ and pairwise inconsistencies. Automated contradiction mining is combined with human‑in‑the‑loop validation to guarantee plausibility and fidelity. This benchmark offers one of the first structured resources for contradiction‑aware evaluation in legal RAG pipelines, supporting more consistent, interpretable, and trustworthy systems.

PaperID: 8244, https://arxiv.org/pdf/2510.03405.pdf

Authors: Sanket Badhe

Title: LegalSim: Multi-Agent Simulation of Legal Systems for Discovering Procedural Exploits

Abstract:
We present LegalSim, a modular multi‑agent simulation of adversarial legal proceedings that explores how AI systems can exploit procedural weaknesses in codified rules. Plaintiff and defendant agents choose from a constrained action space (for example, discovery requests, motions, meet‑and‑confer, sanctions) governed by a JSON rules engine, while a stochastic judge model with calibrated grant rates, cost allocations, and sanction tendencies resolves outcomes. We compare four policies: PPO, a contextual bandit with an LLM, a direct LLM policy, and a hand‑crafted heuristic; Instead of optimizing binary case outcomes, agents are trained and evaluated using effective win rate and a composite exploit score that combines opponent‑cost inflation, calendar pressure, settlement pressure at low merit, and a rule‑compliance margin. Across configurable regimes (e.g., bankruptcy stays, inter partes review, tax procedures) and heterogeneous judges, we observe emergent ``exploit chains'', such as cost‑inflating discovery sequences and calendar‑pressure tactics that remain procedurally valid yet systemically harmful. Evaluation via cross‑play and Bradley‑Terry ratings shows, PPO wins more often, the bandit is the most consistently competitive across opponents, the LLM trails them, and the heuristic is weakest. The results are stable in judge settings, and the simulation reveals emergent exploit chains, motivating red‑teaming of legal rule systems in addition to model‑level testing.

PaperID: 8245, https://arxiv.org/pdf/2510.03349.pdf

Authors: Michael Chen

Title: AgentCaster: Reasoning-Guided Tornado Forecasting

Abstract:
There is a growing need to evaluate Large Language Models (LLMs) on complex, high‑impact, real‑world tasks to assess their true readiness as reasoning agents. To address this gap, we introduce AgentCaster, a contamination‑free framework employing multimodal LLMs end‑to‑end for the challenging, long‑horizon task of tornado forecasting. Within AgentCaster, models interpret heterogeneous spatiotemporal data from a high‑resolution convection‑allowing forecast archive. We assess model performance over a 40‑day period featuring diverse historical data, spanning several major tornado outbreaks and including over 500 tornado reports. Each day, models query interactively from a pool of 3,625 forecast maps and 40,125 forecast soundings for a forecast horizon of 12‑36 hours. Probabilistic tornado‑risk polygon predictions are verified against ground truths derived from geometric comparisons across disjoint risk bands in projected coordinate space. To quantify accuracy, we propose domain‑specific TornadoBench and TornadoHallucination metrics, with TornadoBench highly challenging for both LLMs and domain expert human forecasters. Notably, human experts significantly outperform state‑of‑the‑art models, which demonstrate a strong tendency to hallucinate and overpredict risk intensity, struggle with precise geographic placement, and exhibit poor spatiotemporal reasoning in complex, dynamically evolving systems. AgentCaster aims to advance research on improving LLM agents for challenging reasoning tasks in critical domains.

PaperID: 8246, https://arxiv.org/pdf/2510.03346.pdf

Authors: Xiangyu Shi, Marco Chiesa, Gerald Q. Maguire, Dejan Kostic

Title: KVComm: Enabling Efficient LLM Communication through Selective KV Sharing

Abstract:
Large Language Models (LLMs) are increasingly deployed in multi‑agent systems, where effective inter‑model communication is crucial. Existing communication protocols either rely on natural language, incurring high inference costs and information loss, or on hidden states, which suffer from information concentration bias and inefficiency. To address these limitations, we propose KVComm, a novel communication framework that enables efficient communication between LLMs through selective sharing of KV pairs. KVComm leverages the rich information encoded in the KV pairs while avoiding the pitfalls of hidden states. We introduce a KV layer‑wise selection strategy based on attention importance scores with a Gaussian prior to identify the most informative KV pairs for communication. Extensive experiments across diverse tasks and model pairs demonstrate that KVComm achieves comparable performance to the upper‑bound method, which directly merges inputs to one model without any communication, while transmitting as few as 30% of layers' KV pairs. Our study highlights the potential of KV pairs as an effective medium for inter‑LLM communication, paving the way for scalable and efficient multi‑agent systems.

PaperID: 8247, https://arxiv.org/pdf/2510.03285.pdf

Authors: Su Kara, Fazle Faisal, Suman Nath

Title: WAREX: Web Agent Reliability Evaluation on Existing Benchmarks

Abstract:
Recent advances in browser‑based LLM agents have shown promise for automating tasks ranging from simple form filling to hotel booking or online shopping. Current benchmarks measure agent performance in controlled environments, such as containers or stable networks, where websites behave deterministically. However, in the real world, users access websites over networks and HTTPS connections that introduce instability from multiple sources: client‑side, server‑side issues or broader system failures. Moreover, live websites are prone to web attacks such Cross‑Site Scripting, as well as general site modifications which can cause unexpected or malicious pop‑ups or improper functionality. To address this gap, we present WAREX: Web Agent Reliability Evaluation on Existing Benchmarks. We measure the impact of WAREX across three popular benchmarks: WebArena, WebVoyager, and REAL. Our experiments show that introducing WAREX leads to significant drops in task success rates, highlighting the limited robustness of state‑of‑the‑art agents.

PaperID: 8248, https://arxiv.org/pdf/2510.03253.pdf

Authors: Heyang Gao, Zexu Sun, Erxue Min, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Xu Chen

Title: Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents

Abstract:
Large Language Models (LLMs) as autonomous agents are increasingly tasked with solving complex, long‑horizon problems. Aligning these agents via preference‑based offline methods like Direct Preference Optimization (DPO) is a promising direction, yet it faces a critical granularity mismatch. Trajectory‑level DPO provides a signal that is too coarse for precise credit assignment, while step‑level DPO is often too myopic to capture the value of multi‑step behaviors. To resolve this challenge, we introduce Hierarchical Preference Learning (HPL), a hierarchical framework that optimizes LLM agents by leveraging preference signals at multiple, synergistic granularities. While HPL incorporates trajectory‑ and step‑level DPO for global and local policy stability, its core innovation lies in group‑level preference optimization guided by a dual‑layer curriculum. Our approach first decomposes expert trajectories into semantically coherent action groups and then generates contrasting suboptimal groups to enable preference learning at a fine‑grained, sub‑task level. Then, instead of treating all preference pairs equally, HPL introduces a curriculum scheduler that organizes the learning process from simple to complex. This curriculum is structured along two axes: the group length, representing sub‑task complexity, and the sample difficulty, defined by the reward gap between preferred and dispreferred action groups. Experiments on three challenging agent benchmarks show that HPL outperforms existing state‑of‑the‑art methods. Our analyses demonstrate that the hierarchical DPO loss effectively integrates preference signals across multiple granularities, while the dual‑layer curriculum is crucial for enabling the agent to solve a wide range of tasks, from simple behaviors to complex multi‑step sequences.

PaperID: 8249, https://arxiv.org/pdf/2510.03194.pdf

Authors: Zichen Chen, Jiefeng Chen, Sercan Ö. Arik, Misha Sra, Tomas Pfister, Jinsung Yoon

Title: CoDA: Agentic Systems for Collaborative Data Visualization

Abstract:
Deep research has revolutionized data analysis, yet data scientists still devote substantial time to manually crafting visualizations, highlighting the need for robust automation from natural language queries. However, current systems struggle with complex datasets containing multiple files and iterative refinement. Existing approaches, including simple single‑ or multi‑agent systems, often oversimplify the task, focusing on initial query parsing while failing to robustly manage data complexity, code errors, or final visualization quality. In this paper, we reframe this challenge as a collaborative multi‑agent problem. We introduce CoDA, a multi‑agent system that employs specialized LLM agents for metadata analysis, task planning, code generation, and self‑reflection. We formalize this pipeline, demonstrating how metadata‑focused analysis bypasses token limits and quality‑driven refinement ensures robustness. Extensive evaluations show CoDA achieves substantial gains in the overall score, outperforming competitive baselines by up to 41.5%. This work demonstrates that the future of visualization automation lies not in isolated code generation but in integrated, collaborative agentic workflows.

PaperID: 8250, https://arxiv.org/pdf/2510.03161.pdf

Authors: Qing Huang, Zhipei Xu, Xuanyu Zhang, Xiangyu Yu, Jian Zhang

Title: UniShield: An Adaptive Multi-Agent Framework for Unified Forgery Image Detection and Localization

Abstract:
With the rapid advancements in image generation, synthetic images have become increasingly realistic, posing significant societal risks, such as misinformation and fraud. Forgery Image Detection and Localization (FIDL) thus emerges as essential for maintaining information integrity and societal security. Despite impressive performances by existing domain‑specific detection methods, their practical applicability remains limited, primarily due to their narrow specialization, poor cross‑domain generalization, and the absence of an integrated adaptive framework. To address these issues, we propose UniShield, the novel multi‑agent‑based unified system capable of detecting and localizing image forgeries across diverse domains, including image manipulation, document manipulation, DeepFake, and AI‑generated images. UniShield innovatively integrates a perception agent with a detection agent. The perception agent intelligently analyzes image features to dynamically select suitable detection models, while the detection agent consolidates various expert detectors into a unified framework and generates interpretable reports. Extensive experiments show that UniShield achieves state‑of‑the‑art results, surpassing both existing unified approaches and domain‑specific detectors, highlighting its superior practicality, adaptiveness, and scalability.

PaperID: 8251, https://arxiv.org/pdf/2510.03153.pdf

Authors: Hima Jacob Leven Suprabha, Laxmi Nag Laxminarayan Nagesh, Ajith Nair, Alvin Reuben Amal Selvaster, Ayan Khan, Raghuram Damarla, Sanju Hannah Samuel, Sreenithi Saravana Perumal, Titouan Puech, Venkataramireddy Marella, Vishal Sonar, Alessandro Suglia, Oliver Lemon

Title: Improving Cooperation in Collaborative Embodied AI

Abstract:
The integration of Large Language Models (LLMs) into multiagent systems has opened new possibilities for collaborative reasoning and cooperation with AI agents. This paper explores different prompting methods and evaluates their effectiveness in enhancing agent collaborative behaviour and decision‑making. We enhance CoELA, a framework designed for building Collaborative Embodied Agents that leverage LLMs for multi‑agent communication, reasoning, and task coordination in shared virtual spaces. Through systematic experimentation, we examine different LLMs and prompt engineering strategies to identify optimised combinations that maximise collaboration performance. Furthermore, we extend our research by integrating speech capabilities, enabling seamless collaborative voice‑based interactions. Our findings highlight the effectiveness of prompt optimisation in enhancing collaborative agent performance; for example, our best combination improved the efficiency of the system running with Gemma3 by 22% compared to the original CoELA system. In addition, the speech integration provides a more engaging user interface for iterative system development and demonstrations.

PaperID: 8252, https://arxiv.org/pdf/2510.02889.pdf

Authors: Mohammadreza Doostmohammadian, Narahari Kasagatta Ramesh, Alireza Aghasi

Title: Delay-Tolerant Augmented-Consensus-based Distributed Directed Optimization

Abstract:
Distributed optimization finds applications in large‑scale machine learning, data processing and classification over multi‑agent networks. In real‑world scenarios, the communication network of agents may encounter latency that may affect the convergence of the optimization protocol. This paper addresses the case where the information exchange among the agents (computing nodes) over data‑transmission channels (links) might be subject to communication time‑delays, which is not well addressed in the existing literature. Our proposed algorithm improves the state‑of‑the‑art by handling heterogeneous and arbitrary but bounded and fixed (time‑invariant) delays over general strongly‑connected directed networks. Arguments from matrix theory, algebraic graph theory, and augmented consensus formulation are applied to prove the convergence to the optimal value. Simulations are provided to verify the results and compare the performance with some existing delay‑free algorithms.

PaperID: 8253, https://arxiv.org/pdf/2510.02816.pdf

Authors: Yulong Zhang, Li Wang, Wei Du, Peilin Li, Yuqin Dai Zhiyuan Zhao, Lingyong Fang, Ziniu Liu, Ru Zhang, Huijia Zhu, Gongshen Liu

Title: NCV: A Node-Wise Consistency Verification Approach for Low-Cost Structured Error Localization in LLM Reasoning

Abstract:
Verifying multi‑step reasoning in large language models is difficult due to imprecise error localization and high token costs. Existing methods either assess entire reasoning chains, suffering attention dilution, or rely on expensive multi‑sampling. We introduce Node‑wise Consistency Verification (NCV), a training‑free framework that recasts verification as lightweight binary consistency checks at the node level. By decomposing the chain of thought into interconnected verification nodes, NCV precisely localizes errors and avoids unnecessary long‑form generation. Experiments demonstrate that our approach enhances interpretability and efficiency, presenting a scalable solution for reliable LLM reasoning verification. On public datasets, NCV achieves a 10% to 25% improvement in F1 scores over baselines while utilizing 6×~58× fewer tokens than traditional methods like CoT‑based verifiers.

PaperID: 8254, https://arxiv.org/pdf/2510.02694.pdf

Authors: Bowei Ning, Xuejun Zong, Kan He

Title: MALF: A Multi-Agent LLM Framework for Intelligent Fuzzing of Industrial Control Protocols

Abstract:
Industrial control systems (ICS) are vital to modern infrastructure but increasingly vulnerable to cybersecurity threats, particularly through weaknesses in their communication protocols. This paper presents MALF (Multi‑Agent LLM Fuzzing Framework), an advanced fuzzing solution that integrates large language models (LLMs) with multi‑agent coordination to identify vulnerabilities in industrial control protocols (ICPs). By leveraging Retrieval‑Augmented Generation (RAG) for domain‑specific knowledge and QLoRA fine‑tuning for protocol‑aware input generation, MALF enhances fuzz testing precision and adaptability. The multi‑agent framework optimizes seed generation, mutation strategies, and feedback‑driven refinement, leading to improved vulnerability discovery. Experiments on protocols like Modbus/TCP, S7Comm, and Ethernet/IP demonstrate that MALF surpasses traditional methods, achieving a test case pass rate (TCPR) of 88‑92% and generating more exception triggers (ETN). MALF also maintains over 90% seed coverage and Shannon entropy values between 4.2 and 4.6 bits, ensuring diverse, protocol‑compliant mutations. Deployed in a real‑world Industrial Attack‑Defense Range for power plants, MALF identified critical vulnerabilities, including three zero‑day flaws, one confirmed and registered by CNVD. These results validate MALF's effectiveness in real‑world fuzzing applications. This research highlights the transformative potential of multi‑agent LLMs in ICS cybersecurity, offering a scalable, automated framework that sets a new standard for vulnerability discovery and strengthens critical infrastructure security against emerging threats.

PaperID: 8255, https://arxiv.org/pdf/2510.02669.pdf

Authors: Bo Ma, Hang Li, ZeHua Hu, XiaoFan Gui, LuYao Liu, Simon Liu

Title: AutoMaAS: Self-Evolving Multi-Agent Architecture Search for Large Language Models

Abstract:
Multi‑agent systems powered by large language models have demonstrated remarkable capabilities across diverse domains, yet existing automated design approaches seek monolithic solutions that fail to adapt resource allocation based on query complexity and domain requirements. This paper introduces AutoMaAS, a self‑evolving multi‑agent architecture search framework that leverages neural architecture search principles to automatically discover optimal agent configurations through dynamic operator lifecycle management and automated machine learning techniques. Our approach incorporates four key innovations: (1) automatic operator generation, fusion, and elimination based on performance‑cost analysis, (2) dynamic cost‑aware optimization with real‑time parameter adjustment, (3) online feedback integration for continuous architecture refinement, and (4) enhanced interpretability through decision tracing mechanisms. Extensive experiments across six benchmarks demonstrate that AutoMaAS achieves 1.0‑7.1% performance improvement while reducing inference costs by 3‑5% compared to state‑of‑the‑art methods. The framework shows superior transferability across datasets and LLM backbones, establishing a new paradigm for automated multi‑agent system design in the era of large language models.

PaperID: 8256, https://arxiv.org/pdf/2510.02637.pdf

Authors: Aliakbar Mehdizadeh, Martin Hilbert

Title: Homophily-induced Emergence of Biased Structures in LLM-based Multi-Agent AI Systems

Abstract:
This study examines how interactions among artificially intelligent (AI) agents, guided by large language models (LLMs), drive the evolution of collective network structures. We ask LLM‑driven agents to grow a network by informing them about current link constellations. Our observations confirm that agents consistently apply a preferential attachment mechanism, favoring connections to nodes with higher degrees. We systematically solicited more than a million decisions from four different LLMs, including Gemini, ChatGPT, Llama, and Claude. When social attributes such as age, gender, religion, and political orientation are incorporated, the resulting networks exhibit heightened assortativity, leading to the formation of distinct homophilic communities. This significantly alters the network topology from what would be expected under a pure preferential attachment model alone. Political and religious attributes most significantly fragment the collective, fostering polarized subgroups, while age and gender yield more gradual structural shifts. Strikingly, LLMs also reveal asymmetric patterns in heterophilous ties, suggesting embedded directional biases reflective of societal norms. As autonomous AI agents increasingly shape the architecture of online systems, these findings contribute to how algorithmic choices of generative AI collectives not only reshape network topology, but offer critical insights into how AI‑driven systems co‑evolve and self‑organize.

PaperID: 8257, https://arxiv.org/pdf/2510.02627.pdf

Authors: Ruining Yang, Yi Xu, Yixiao Chen, Yun Fu, Lili Su

Title: A Trajectory Generator for High-Density Traffic and Diverse Agent-Interaction Scenarios

Abstract:
Accurate trajectory prediction is fundamental to autonomous driving, as it underpins safe motion planning and collision avoidance in complex environments. However, existing benchmark datasets suffer from a pronounced long‑tail distribution problem, with most samples drawn from low‑density scenarios and simple straight‑driving behaviors. This underrepresentation of high‑density scenarios and safety critical maneuvers such as lane changes, overtaking and turning is an obstacle to model generalization and leads to overly optimistic evaluations. To address these challenges, we propose a novel trajectory generation framework that simultaneously enhances scenarios density and enriches behavioral diversity. Specifically, our approach converts continuous road environments into a structured grid representation that supports fine‑grained path planning, explicit conflict detection, and multi‑agent coordination. Built upon this representation, we introduce behavior‑aware generation mechanisms that combine rule‑based decision triggers with Frenet‑based trajectory smoothing and dynamic feasibility constraints. This design allows us to synthesize realistic high‑density scenarios and rare behaviors with complex interactions that are often missing in real data. Extensive experiments on the large‑scale Argoverse 1 and Argoverse 2 datasets demonstrate that our method significantly improves both agent density and behavior diversity, while preserving motion realism and scenario‑level safety. Our synthetic data also benefits downstream trajectory prediction models and enhances performance in challenging high‑density scenarios.

PaperID: 8258, https://arxiv.org/pdf/2510.02589.pdf

Authors: Yunqi Huang, Nishith Chennakeshava, Alexis Carras, Vladislav Neverov, Wei Liu, Aske Plaat, Yingjie Fan

Title: A Benchmark Study of Deep Reinforcement Learning Algorithms for the Container Stowage Planning Problem

Abstract:
Container stowage planning (CSPP) is a critical component of maritime transportation and terminal operations, directly affecting supply chain efficiency. Owing to its complexity, CSPP has traditionally relied on human expertise. While reinforcement learning (RL) has recently been applied to CSPP, systematic benchmark comparisons across different algorithms remain limited. To address this gap, we develop a Gym environment that captures the fundamental features of CSPP and extend it to include crane scheduling in both multi‑agent and single‑agent formulations. Within this framework, we evaluate five RL algorithms: DQN, QR‑DQN, A2C, PPO, and TRPO under multiple scenarios of varying complexity. The results reveal distinct performance gaps with increasing complexity, underscoring the importance of algorithm choice and problem formulation for CSPP. Overall, this paper benchmarks multiple RL methods for CSPP while providing a reusable Gym environment with crane scheduling, thus offering a foundation for future research and practical deployment in maritime logistics.

PaperID: 8259, https://arxiv.org/pdf/2510.02567.pdf

Authors: Peter Pak, Achuth Chandrasekhar, Amir Barati Farimani

Title: Agentic Additive Manufacturing Alloy Evaluation

Abstract:
Agentic systems enable the intelligent use of research tooling, augmenting a researcher's ability to investigate and propose novel solutions to existing problems. Within Additive Manufacturing (AM), alloy selection and evaluation remains a complex challenge, often requiring expertise in the various domains of materials science, thermodynamic simulations, and experimental analysis. Large Language Model (LLM) enabled agents can facilitate this endeavor by utilizing their extensive knowledge base to dispatch tool calls via Model Context Protocol (MCP) to perform actions such as thermophysical property diagram calculations and lack of fusion process map generation. In addition, the multi‑agent system can effectively reason through complex user prompts and provide analysis on the lack of fusion process window of common alloys such as SS316L and IN718 along with proposed composition variants of known alloys. These agents can dynamically adjust their task trajectory to the outcomes of tool call results, effectively enabling autonomous decision‑making in practical environments. This work aims to showcase the benefits of adopting a LLM enabled multi‑agent system to automate and accelerate the task of evaluating proposed additive manufacturing alloys, both novel and known.

PaperID: 8260, https://arxiv.org/pdf/2510.02557.pdf

Authors: Charlie Masters, Advaith Vellanki, Jiangbo Shangguan, Bart Kultys, Jonathan Gilmore, Alastair Moore, Stefano V. Albrecht

Title: Orchestrating Human-AI Teams: The Manager Agent as a Unifying Research Challenge

Abstract:
While agentic AI has advanced in automating individual tasks, managing complex multi‑agent workflows remains a challenging problem. This paper presents a research vision for autonomous agentic systems that orchestrate collaboration within dynamic human‑AI teams. We propose the Autonomous Manager Agent as a core challenge: an agent that decomposes complex goals into task graphs, allocates tasks to human and AI workers, monitors progress, adapts to changing conditions, and maintains transparent stakeholder communication. We formalize workflow management as a Partially Observable Stochastic Game and identify four foundational challenges: (1) compositional reasoning for hierarchical decomposition, (2) multi‑objective optimization under shifting preferences, (3) coordination and planning in ad hoc teams, and (4) governance and compliance by design. To advance this agenda, we release MA‑Gym, an open‑source simulation and evaluation framework for multi‑agent workflow orchestration. Evaluating GPT‑5‑based Manager Agents across 20 workflows, we find they struggle to jointly optimize for goal completion, constraint adherence, and workflow runtime ‑ underscoring workflow management as a difficult open problem. We conclude with organizational and ethical implications of autonomous management systems.

PaperID: 8261, https://arxiv.org/pdf/2510.02418.pdf

Authors: Sagnik Anupam, Davis Brown, Shuo Li, Eric Wong, Hamed Hassani, Osbert Bastani

Title: BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks

Abstract:
LLM web agents now browse and take actions on the open web, yet current agent evaluations are constrained to sandboxed environments or artificial tasks. We introduce BrowserArena, a live open‑web agent evaluation platform that collects user‑submitted tasks, runs Arena‑style head‑to‑head comparisons, and uses step‑level human feedback to surface failure modes. Collecting and analyzing step‑level annotations on the agent traces, we identify three consistent failure modes: captcha resolution, pop‑up banner removal, and direct navigation to URLs. By constructing targeted datasets to further study these tasks, we discover variations in how different language models navigate these failure modes. We find, for example, that o4‑mini deploys a wider variety of strategies to circumvent captcha resolution than other models and DeepSeek‑R1 consistently misleads users about pop‑up banner closure. Our findings surface both the diversity and brittleness of current web agents. More broadly, our benchmarking methodology provides an approach to evaluating and understanding web agent failure modes at scale.

PaperID: 8262, https://arxiv.org/pdf/2510.02389.pdf

Authors: Haoran Xi, Minghao Shao, Brendan Dolan-Gavitt, Muhammad Shafique, Ramesh Karri

Title: From Trace to Line: LLM Agent for Real-World OSS Vulnerability Localization

Abstract:
Large language models show promise for vulnerability discovery, yet prevailing methods inspect code in isolation, struggle with long contexts, and focus on coarse function‑ or file‑level detections that offer limited guidance to engineers who need precise line‑level localization for targeted patches. We introduce T2L, an executable framework for project‑level, line‑level vulnerability localization that progressively narrows scope from repository modules to exact vulnerable lines via AST‑based chunking and evidence‑guided refinement. We provide a baseline agent with an Agentic Trace Analyzer (ATA) that fuses runtime evidence such as crash points and stack traces to translate failure symptoms into actionable diagnoses. To enable rigorous evaluation, we introduce T2L‑ARVO, an expert‑verified 50‑case benchmark spanning five crash families in real‑world projects. On T2L‑ARVO, our baseline achieves up to 58.0% detection and 54.8% line‑level localization rate. Together, T2L framework advance LLM‑based vulnerability detection toward deployable, precision diagnostics in open‑source software workflows.

PaperID: 8263, https://arxiv.org/pdf/2510.02369.pdf

Authors: Kuntai Cai, Juncheng Liu, Xianglin Yang, Zhaojie Niu, Xiaokui Xiao, Xing Chen

Title: AutoContext: Instance-Level Context Learning for LLM Agents

Abstract:
Current LLM agents typically lack instance‑level context, which comprises concrete facts such as environment structure, system configurations, and local mechanics. Consequently, existing methods are forced to intertwine exploration with task execution. This coupling leads to redundant interactions and fragile decision‑making, as agents must repeatedly rediscover the same information for every new task. To address this, we introduce AutoContext, a method that decouples exploration from task solving. AutoContext performs a systematic, one‑off exploration to construct a reusable knowledge graph for each environment instance. This structured context allows off‑the‑shelf agents to access necessary facts directly, eliminating redundant exploration. Experiments across TextWorld, ALFWorld, Crafter, and InterCode‑Bash demonstrate substantial gains: for example, the success rate of a ReAct agent on TextWorld improves from 37% to 95%, highlighting the critical role of structured instance context in efficient agentic systems.

PaperID: 8264, https://arxiv.org/pdf/2510.02363.pdf

Authors: Mohammad Reza Abedi, Zahra Rashidi, Nader Mokari, Hamid Saeedi, Nizar Zorba

Title: Precise HDV Positioning through Safety-Aware Integrated Sensing and Communication in a Value-of-Information-Driven 6G V2X System

Abstract:
Recent advancements in Integrated Sensing and Communications (ISAC) have unlocked new potential for addressing the dual demands of high‑resolution positioning and reliable communication in 6G Vehicle‑to‑Everything (V2X) networks. These capabilities are vital for transmitting safety‑critical data from Connected Autonomous Vehicles (CAVs) to improve metrics such as Time to Collision (TTC) and reduce the Collision Risk (CR) ratio. However, limited radio resources and interference remain major obstacles to achieving both precision and capacity simultaneously. The challenge intensifies in mixedtraffic scenarios involving Human‑Driven Vehicles (HDVs), which lack connectivity and cannot share their status or positioning. Additionally, CAV sensors are limited in range and accuracy, making detection of HDVs unreliable. ISAC plays a pivotal role here by enabling the sensing of HDV positions via shared communication infrastructure, improving environmental awareness. To address these challenges, this paper proposes a novel Value of Information (VoI) metric that prioritizes the transmission of safety‑critical data. The joint sensing‑communication‑control problem is modeled as a two‑time‑scale sequential decision process and solved using a Multi‑Agent Distributed Deterministic Policy Gradient (MADDPG) algorithm. By focusing on high‑ VoI data, the framework reduces complexity and optimizes network and traffic resource usage. Simulations show that the proposed approach significantly reduces the CR ratio by at least 33% and improves the TTC by up to 66%, demonstrating its effectiveness in enhancing safety and efficiency in mixedautonomy environments.

PaperID: 8265, https://arxiv.org/pdf/2510.02360.pdf

Authors: Mingze Zhong, Meng Fang, Zijing Shi, Yuxuan Huang, Shunfeng Zheng, Yali Du, Ling Chen, Jun Wang

Title: Spiral of Silence in Large Language Model Agents

Abstract:
The Spiral of Silence (SoS) theory holds that individuals with minority views often refrain from speaking out for fear of social isolation, enabling majority positions to dominate public discourse. When the 'agents' are large language models (LLMs), however, the classical psychological explanation is not directly applicable, since SoS was developed for human societies. This raises a central question: can SoS‑like dynamics nevertheless emerge from purely statistical language generation in LLM collectives? We propose an evaluation framework for examining SoS in LLM agents. Specifically, we consider four controlled conditions that systematically vary the availability of 'History' and 'Persona' signals. Opinion dynamics are assessed using trend tests such as Mann‑Kendall and Spearman's rank, along with concentration measures including kurtosis and interquartile range. Experiments across open‑source and closed‑source models show that history and persona together produce strong majority dominance and replicate SoS patterns; history signals alone induce strong anchoring; and persona signals alone foster diverse but uncorrelated opinions, indicating that without historical anchoring, SoS dynamics cannot emerge. The work bridges computational sociology and responsible AI design, highlighting the need to monitor and mitigate emergent conformity in LLM‑agent systems.

PaperID: 8266, https://arxiv.org/pdf/2510.02359.pdf

Authors: Jiashu Ye, Tong Wu, Weiwen Chen, Hao Zhang, Zeteng Lin, Xingxing Li, Shujuan Weng, Manni Zhu, Xin Yuan, Xinlong Hong, Jingjie Li, Junyu Zheng, Zhijiong Huang, Jing Tang

Title: Emission-GPT: A domain-specific language model agent for knowledge retrieval, emission inventory and data analysis

Abstract:
Improving air quality and addressing climate change relies on accurate understanding and analysis of air pollutant and greenhouse gas emissions. However, emission‑related knowledge is often fragmented and highly specialized, while existing methods for accessing and compiling emissions data remain inefficient. These issues hinder the ability of non‑experts to interpret emissions information, posing challenges to research and management. To address this, we present Emission‑GPT, a knowledge‑enhanced large language model agent tailored for the atmospheric emissions domain. Built on a curated knowledge base of over 10,000 documents (including standards, reports, guidebooks, and peer‑reviewed literature), Emission‑GPT integrates prompt engineering and question completion to support accurate domain‑specific question answering. Emission‑GPT also enables users to interactively analyze emissions data via natural language, such as querying and visualizing inventories, analyzing source contributions, and recommending emission factors for user‑defined scenarios. A case study in Guangdong Province demonstrates that Emission‑GPT can extract key insights‑‑such as point source distributions and sectoral trends‑‑directly from raw data with simple prompts. Its modular and extensible architecture facilitates automation of traditionally manual workflows, positioning Emission‑GPT as a foundational tool for next‑generation emission inventory development and scenario‑based assessment.

PaperID: 8267, https://arxiv.org/pdf/2510.02312.pdf

Authors: Anna Kuzina, Maciej Pioro, Paul N. Whatmough, Babak Ehteshami Bejnordi

Title: KaVa: Latent Reasoning via Compressed KV-Cache Distillation

Abstract:
Large Language Models (LLMs) excel at multi‑step reasoning problems with explicit chain‑of‑thought (CoT), but verbose traces incur significant computational costs and memory overhead, and often carry redundant, stylistic artifacts. Latent reasoning has emerged as an efficient alternative that internalizes the thought process, but it suffers from a critical lack of supervision, limiting its effectiveness on complex, natural‑language reasoning traces. In this work we propose KaVa, the first framework that bridges this gap by distilling knowledge directly from a compressed KV‑cache of the teacher into a latent‑reasoning student via self‑distillation, leveraging the representational flexibility of continuous latent tokens to align stepwise KV trajectories. We show that the abstract, unstructured knowledge within compressed KV‑cache, which lacks direct token correspondence, can serve as a rich supervisory signal for a latent reasoning student. Empirically, the approach consistently outperforms strong latent baselines, exhibits markedly smaller degradation from equation‑only to natural‑language traces, and scales to larger backbones while preserving efficiency. These results establish compressed KV‑cache distillation as a scalable supervision signal for latent reasoning, combining the accuracy of CoT‑trained teachers with the efficiency and deployability of latent inference.

PaperID: 8268, https://arxiv.org/pdf/2510.02271.pdf

Authors: Yaxin Du, Yuanshuo Zhang, Xiyuan Yang, Yifan Zhou, Cheng Wang, Gongyi Zou, Xianghe Pang, Wenhao Wang, Menglan Chen, Shuo Tang, Zhiyu Li, Feiyu Xiong, Siheng Chen

Title: InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents

Abstract:
Information seeking is a fundamental requirement for humans. However, existing LLM agents rely heavily on open‑web search, which exposes two fundamental weaknesses: online content is noisy and unreliable, and many real‑world tasks require precise, domain‑specific knowledge unavailable from the web. The emergence of the Model Context Protocol (MCP) now allows agents to interface with thousands of specialized tools, seemingly resolving this limitation. Yet it remains unclear whether agents can effectively leverage such tools ‑‑ and more importantly, whether they can integrate them with general‑purpose search to solve complex tasks. Therefore, we introduce InfoMosaic‑Bench, the first benchmark dedicated to multi‑source information seeking in tool‑augmented agents. Covering six representative domains (medicine, finance, maps, video, web, and multi‑domain integration), InfoMosaic‑Bench requires agents to combine general‑purpose search with domain‑specific tools. Tasks are synthesized with InfoMosaic‑Flow, a scalable pipeline that grounds task conditions in verified tool outputs, enforces cross‑source dependencies, and filters out shortcut cases solvable by trivial lookup. This design guarantees both reliability and non‑triviality. Experiments with 14 state‑of‑the‑art LLM agents reveal three findings: (i) web information alone is insufficient, with GPT‑5 achieving only 38.2% accuracy and 67.5% pass rate; (ii) domain tools provide selective but inconsistent benefits, improving some domains while degrading others; and (iii) 22.4% of failures arise from incorrect tool usage or selection, highlighting that current LLMs still struggle with even basic tool handling.

PaperID: 8269, https://arxiv.org/pdf/2510.02209.pdf

Authors: Yanxu Chen, Zijun Yao, Yantao Liu, Amy Xin, Jin Ye, Jianing Yu, Lei Hou, Juanzi Li

Title: StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?

Abstract:
Large language models (LLMs) demonstrate strong potential as autonomous agents, with promising capabilities in reasoning, tool use, and sequential decision‑making. While prior benchmarks have evaluated LLM agents in various domains, the financial domain remains underexplored, despite its significant economic value and complex reasoning requirements. Most existing financial benchmarks focus on static question‑answering, failing to capture the dynamics of real‑market trading. To address this gap, we introduce STOCKBENCH, a contamination‑free benchmark designed to evaluate LLM agents in realistic, multi‑month stock trading environments. Agents receive daily market signals ‑‑ including prices, fundamentals, and news ‑‑ and make sequential buy, sell, or hold decisions. Performance is measured using financial metrics such as cumulative return, maximum drawdown, and the Sortino ratio, capturing both profitability and risk management. We evaluate a wide range of state‑of‑the‑art proprietary and open‑source LLMs. Surprisingly, most models struggle to outperform the simple buy‑and‑hold baseline, while some models demonstrate the potential to achieve higher returns and stronger risk management. These findings highlight both the challenges and opportunities of LLM‑based trading agents, showing that strong performance on static financial question‑answering do not necessarily translate into effective trading behavior. We release STOCKBENCH as an open‑source benchmark to enable future research on LLM‑driven financial agents.

PaperID: 8270, https://arxiv.org/pdf/2510.02185.pdf

Authors: Paschal C. Amusuo, Dongge Liu, Ricardo Andres Calvo Mendez, Jonathan Metzman, Oliver Chang, James C. Davis

Title: FalseCrashReducer: Mitigating False Positive Crashes in OSS-Fuzz-Gen Using Agentic AI

Abstract:
Fuzz testing has become a cornerstone technique for identifying software bugs and security vulnerabilities, with broad adoption in both industry and open‑source communities. Directly fuzzing a function requires fuzz drivers, which translate random fuzzer inputs into valid arguments for the target function. Given the cost and expertise required to manually develop fuzz drivers, methods exist that leverage program analysis and Large Language Models to automatically generate these drivers. However, the generated fuzz drivers frequently lead to false positive crashes, especially in functions highly structured input and complex state requirements. This problem is especially crucial in industry‑scale fuzz driver generation efforts like OSS‑Fuzz‑en, as reporting false positive crashes to maintainers impede trust in both the system and the team. This paper presents two AI‑driven strategies to reduce false positives in OSS‑Fuzz‑Gen, a multi‑agent system for automated fuzz driver generation. First, constraint‑based fuzz driver generation proactively enforces constraints on a function's inputs and state to guide driver creation. Second, context‑based crash validation reactively analyzes function callers to determine whether reported crashes are feasible from program entry points. Using 1,500 benchmark functions from OSS‑Fuzz, we show that these strategies reduce spurious crashes by up to 8%, cut reported crashes by more than half, and demonstrate that frontier LLMs can serve as reliable program analysis agents. Our results highlight the promise and challenges of integrating AI into large‑scale fuzzing pipelines.

PaperID: 8271, https://arxiv.org/pdf/2510.02178.pdf

Authors: Jialin Gao, Donghao Zhou, Mingjian Liang, Lihao Liu, Chi-Wing Fu, Xiaowei Hu, Pheng-Ann Heng

Title: DisCo-Layout: Disentangling and Coordinating Semantic and Physical Refinement in a Multi-Agent Framework for 3D Indoor Layout Synthesis

Abstract:
3D indoor layout synthesis is crucial for creating virtual environments. Traditional methods struggle with generalization due to fixed datasets. While recent LLM and VLM‑based approaches offer improved semantic richness, they often lack robust and flexible refinement, resulting in suboptimal layouts. We develop DisCo‑Layout, a novel framework that disentangles and coordinates physical and semantic refinement. For independent refinement, our Semantic Refinement Tool (SRT) corrects abstract object relationships, while the Physical Refinement Tool (PRT) resolves concrete spatial issues via a grid‑matching algorithm. For collaborative refinement, a multi‑agent framework intelligently orchestrates these tools, featuring a planner for placement rules, a designer for initial layouts, and an evaluator for assessment. Experiments demonstrate DisCo‑Layout's state‑of‑the‑art performance, generating realistic, coherent, and generalizable 3D indoor layouts. Our code will be publicly available.

PaperID: 8272, https://arxiv.org/pdf/2510.02139.pdf

Authors: Florensia Widjaja, Zhangtianyi Chen, Juexiao Zhou

Title: BioinfoMCP: A Unified Platform Enabling MCP Interfaces in Agentic Bioinformatics

Abstract:
Bioinformatics tools are essential for complex computational biology tasks, yet their integration with emerging AI‑agent frameworks is hindered by incompatible interfaces, heterogeneous input‑output formats, and inconsistent parameter conventions. The Model Context Protocol (MCP) provides a standardized framework for tool‑AI communication, but manually converting hundreds of existing and rapidly growing specialized bioinformatics tools into MCP‑compliant servers is labor‑intensive and unsustainable. Here, we present BioinfoMCP, a unified platform comprising two components: BioinfoMCP Converter, which automatically generates robust MCP servers from tool documentation using large language models, and BioinfoMCP Benchmark, which systematically validates the reliability and versatility of converted tools across diverse computational tasks. We present a platform of 38 MCP‑converted bioinformatics tools, extensively validated to show that 94.7% successfully executed complex workflows across three widely used AI‑agent platforms. By removing technical barriers to AI automation, BioinfoMCP enables natural‑language interaction with sophisticated bioinformatics analyses without requiring extensive programming expertise, offering a scalable path to intelligent, interoperable computational biology.

PaperID: 8273, https://arxiv.org/pdf/2510.01869.pdf

Authors: Alessandro Nazzari, Roberto Rubinacci, Marco Lovera

Title: TACOS: Task Agnostic COordinator of a multi-drone System

Abstract:
When a single pilot is responsible for managing a multi‑drone system, the task may demand varying levels of autonomy, from direct control of individual UAVs, to group‑level coordination, to fully autonomous swarm behaviors for accomplishing high‑level tasks. Enabling such flexible interaction requires a framework that supports multiple modes of shared autonomy. As language models continue to improve in reasoning and planning, they provide a natural foundation for such systems, reducing pilot workload by enabling high‑level task delegation through intuitive, language‑based interfaces. In this paper we present TACOS (Task‑Agnostic COordinator of a multi‑drone System), a unified framework that enables high‑level natural language control of multi‑UAV systems through Large Language Models (LLMs). TACOS integrates three key capabilities into a single architecture: a one‑to‑many natural language interface for intuitive user interaction, an intelligent coordinator for translating user intent into structured task plans, and an autonomous agent that executes plans interacting with the real world. TACOS allows a LLM to interact with a library of executable APIs, bridging semantic reasoning with real‑time multi‑robot coordination. We demonstrate the system on a real‑world multi‑drone system, and conduct an ablation study to assess the contribution of each module.

PaperID: 8274, https://arxiv.org/pdf/2510.01842.pdf

Authors: Yannis Belkhiter, Seshu Tirupathi, Giulio Zizzo, Sachin Sharma, John D. Kelleher

Title: Pre-Hoc Predictions in AutoML: Leveraging LLMs to Enhance Model Selection and Benchmarking for Tabular datasets

Abstract:
The field of AutoML has made remarkable progress in post‑hoc model selection, with libraries capable of automatically identifying the most performing models for a given dataset. Nevertheless, these methods often rely on exhaustive hyperparameter searches, where methods automatically train and test different types of models on the target dataset. Contrastingly, pre‑hoc prediction emerges as a promising alternative, capable of bypassing exhaustive search through intelligent pre‑selection of models. Despite its potential, pre‑hoc prediction remains under‑explored in the literature. This paper explores the intersection of AutoML and pre‑hoc model selection by leveraging traditional models and Large Language Model (LLM) agents to reduce the search space of AutoML libraries. By relying on dataset descriptions and statistical information, we reduce the AutoML search space. Our methodology is applied to the AWS AutoGluon portfolio dataset, a state‑of‑the‑art AutoML benchmark containing 175 tabular classification datasets available on OpenML. The proposed approach offers a shift in AutoML workflows, significantly reducing computational overhead, while still selecting the best model for the given dataset.

PaperID: 8275, https://arxiv.org/pdf/2510.01751.pdf

Authors: Masike Malatji

Title: A cybersecurity AI agent selection and decision support framework

Abstract:
This paper presents a novel, structured decision support framework that systematically aligns diverse artificial intelligence (AI) agent architectures, reactive, cognitive, hybrid, and learning, with the comprehensive National Institute of Standards and Technology (NIST) Cybersecurity Framework (CSF) 2.0. By integrating agent theory with industry guidelines, this framework provides a transparent and stepwise methodology for selecting and deploying AI solutions to address contemporary cyber threats. Employing a granular decomposition of NIST CSF 2.0 functions into specific tasks, the study links essential AI agent properties such as autonomy, adaptive learning, and real‑time responsiveness to each subcategory's security requirements. In addition, it outlines graduated levels of autonomy (assisted, augmented, and fully autonomous) to accommodate organisations at varying stages of cybersecurity maturity. This holistic approach transcends isolated AI applications, providing a unified detection, incident response, and governance strategy. Through conceptual validation, the framework demonstrates how tailored AI agent deployments can align with real‑world constraints and risk profiles, enhancing situational awareness, accelerating response times, and fortifying long‑term resilience via adaptive risk management. Ultimately, this research bridges the gap between theoretical AI constructs and operational cybersecurity demands, establishing a foundation for robust, empirically validated multi‑agent systems that adhere to industry standards.

PaperID: 8276, https://arxiv.org/pdf/2510.01724.pdf

Authors: Madina Bekbergenova, Lucas Pradi, Benjamin Navet, Emma Tysinger, Franck Michel, Matthieu Feraud, Yousouf Taghzouti, Yan Zhou Chen, Olivier Kirchhoffer, Florence Mehl, Martin Legrand, Tao Jiang, Marco Pagni, Soha Hassoun, Jean-Luc Wolfender, Wout Bittremieux, Fabien Gandon, Louis-Félix Nothias

Title: MetaboT: An LLM-based Multi-Agent Frameworkfor Interactive Analysis of Mass SpectrometryMetabolomics Knowledge Graphs

Abstract:
Mass spectrometry‑based metabolomics generates complex, high‑dimensional data that holds vast potential for biological discovery but remains difficult to integrate and interpret. Knowledge graphs (KGs) unify this heterogeneous information by representing spectra, annotations, taxa, chemical classes, and biological activities as a single interoperable network; however, their practical use is limited by the steep learning curve of corresponding specialized representation and query languages. Here we introduce MetaboT, an open‑source multi‑agent Large Language Model (LLM) framework that translates natural‑language questions into executable SPARQL queries over metabolomics knowledge graphs. MetaboT mitigates the hallucination and schema‑compliance limitations of single‑model approaches through a modular architecture in which specialised agents handle scope validation, entity resolution against authoritative resources, schema‑aware query generation, iterative refinement, and result interpretation. We validated MetaboT on the Experimental Natural Products Knowledge Graph (ENPKG), using an expert‑authored benchmark of natural‑language questions paired with reference SPARQL queries, and demonstrate its ability to answer complex questions about plant‑‑metabolite relationships and biological activities. MetaboT lowers the technical barrier for metabolomics researchers and enables semantic data mining without specialised programming expertise.

PaperID: 8277, https://arxiv.org/pdf/2510.01687.pdf

Authors: John Hawkins

Title: Improving AGI Evaluation: A Data Science Perspective

Abstract:
Evaluation of potential AGI systems and methods is difficult due to the breadth of the engineering goal. We have no methods for perfect evaluation of the end state, and instead measure performance on small tests designed to provide directional indication that we are approaching AGI. In this work we argue that AGI evaluation methods have been dominated by a design philosophy that uses our intuitions of what intelligence is to create synthetic tasks, that have performed poorly in the history of AI. Instead we argue for an alternative design philosophy focused on evaluating robust task execution that seeks to demonstrate AGI through competence. This perspective is developed from common practices in data science that are used to show that a system can be reliably deployed. We provide practical examples of what this would mean for AGI evaluation.

PaperID: 8278, https://arxiv.org/pdf/2510.01645.pdf

Authors: Niloofar Mireshghallah, Tianshi Li

Title: Position: Privacy Is Not Just Memorization!

Abstract:
The discourse on privacy risks in Large Language Models (LLMs) has disproportionately focused on verbatim memorization of training data, while a constellation of more immediate and scalable privacy threats remain underexplored. This position paper argues that the privacy landscape of LLM systems extends far beyond training data extraction, encompassing risks from data collection practices, inference‑time context leakage, autonomous agent capabilities, and the democratization of surveillance through deep inference attacks. We present a comprehensive taxonomy of privacy risks across the LLM lifecycle ‑‑ from data collection through deployment ‑‑ and demonstrate through case studies how current privacy frameworks fail to address these multifaceted threats. Through a longitudinal analysis of 1,322 AI/ML privacy papers published at leading conferences over the past decade (2016‑‑2025), we reveal that while memorization receives outsized attention in technical research, the most pressing privacy harms lie elsewhere, where current technical approaches offer little traction and viable paths forward remain unclear. We call for a fundamental shift in how the research community approaches LLM privacy, moving beyond the narrow focus of current technical solutions and embracing interdisciplinary approaches that address the sociotechnical nature of these emerging threats.

PaperID: 8279, https://arxiv.org/pdf/2510.01617.pdf

Authors: Hui Yi Leong, Yuheng Li, Yuqing Wu, Wenwen Ouyang, Wei Zhu, Jiechao Gao, Wei Han

Title: AMAS: Adaptively Determining Communication Topology for LLM-based Multi-Agent System

Abstract:
Although large language models (LLMs) have revolutionized natural language processing capabilities, their practical implementation as autonomous multi‑agent systems (MAS) for industrial problem‑solving encounters persistent barriers. Conventional MAS architectures are fundamentally restricted by inflexible, hand‑crafted graph topologies that lack contextual responsiveness, resulting in diminished efficacy across varied academic and commercial workloads. To surmount these constraints, we introduce AMAS, a paradigm‑shifting framework that redefines LLM‑based MAS through a novel dynamic graph designer. This component autonomously identifies task‑specific optimal graph configurations via lightweight LLM adaptation, eliminating the reliance on monolithic, universally applied structural templates. Instead, AMAS exploits the intrinsic properties of individual inputs to intelligently direct query trajectories through task‑optimized agent pathways. Rigorous validation across question answering, mathematical deduction, and code generation benchmarks confirms that AMAS systematically exceeds state‑of‑the‑art single‑agent and multi‑agent approaches across diverse LLM architectures. Our investigation establishes that context‑sensitive structural adaptability constitutes a foundational requirement for high‑performance LLM MAS deployments.

PaperID: 8280, https://arxiv.org/pdf/2510.01609.pdf

Authors: Bo Ma, Hang Li, ZeHua Hu, XiaoFan Gui, LuYao Liu, Simon Lau

Title: AgentRec: Next-Generation LLM-Powered Multi-Agent Collaborative Recommendation with Adaptive Intelligence

Abstract:
Interactive conversational recommender systems have gained significant attention for their ability to capture user preferences through natural language interactions. However, existing approaches face substantial challenges in handling dynamic user preferences, maintaining conversation coherence, and balancing multiple ranking objectives simultaneously. This paper introduces AgentRec, a next‑generation LLM‑powered multi‑agent collaborative recommendation framework that addresses these limitations through hierarchical agent networks with adaptive intelligence. Our approach employs specialized LLM‑powered agents for conversation understanding, preference modeling, context awareness, and dynamic ranking, coordinated through an adaptive weighting mechanism that learns from interaction patterns. We propose a three‑tier learning strategy combining rapid response for simple queries, intelligent reasoning for complex preferences, and deep collaboration for challenging scenarios. Extensive experiments on three real‑world datasets demonstrate that AgentRec achieves consistent improvements over state‑of‑the‑art baselines, with 2.8% enhancement in conversation success rate, 1.9% improvement in recommendation accuracy (NDCG@10), and 3.2% better conversation efficiency while maintaining comparable computational costs through intelligent agent coordination.

PaperID: 8281, https://arxiv.org/pdf/2510.01586.pdf

Authors: Zhenyu Pan, Yiting Zhang, Zhuo Liu, Yolo Yunlong Tang, Zeliang Zhang, Haozheng Luo, Yuwei Han, Jianshu Zhang, Dennis Wu, Hong-Yu Chen, Haoran Lu, Haoyang Fang, Manling Li, Chenliang Xu, Philip S. Yu, Han Liu

Title: AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

Abstract:
LLM‑based multi‑agent systems excel at planning, tool use, and role coordination, but their openness and interaction complexity also expose them to jailbreak, prompt‑injection, and adversarial collaboration. Existing defenses fall into two lines: (i) self‑verification that asks each agent to pre‑filter unsafe instructions before execution, and (ii) external guard modules that police behaviors. The former often underperforms because a standalone agent lacks sufficient capacity to detect cross‑agent unsafe chains and delegation‑induced risks; the latter increases system overhead and creates a single‑point‑of‑failure‑once compromised, system‑wide safety collapses, and adding more guards worsens cost and complexity. To solve these challenges, we propose AdvEvo‑MARL, a co‑evolutionary multi‑agent reinforcement learning framework that internalizes safety into task agents. Rather than relying on external guards, AdvEvo‑MARL jointly optimizes attackers (which synthesize evolving jailbreak prompts) and defenders (task agents trained to both accomplish their duties and resist attacks) in adversarial learning environments. To stabilize learning and foster cooperation, we introduce a public baseline for advantage estimation: agents within the same functional group share a group‑level mean‑return baseline, enabling lower‑variance updates and stronger intra‑group coordination. Across representative attack scenarios, AdvEvo‑MARL consistently keeps attack‑success rate (ASR) below 20%, whereas baselines reach up to 38.33%, while preserving‑and sometimes improving‑task accuracy (up to +3.67% on reasoning tasks). These results show that safety and utility can be jointly improved without relying on extra guard agents or added system overhead.

PaperID: 8282, https://arxiv.org/pdf/2510.01553.pdf

Authors: Zhuofan Shi, Zijie Guo, Xinjian Ma, Gang Huang, Yun Ma, Xiang Jing

Title: IoDResearch: Deep Research on Private Heterogeneous Data via the Internet of Data

Abstract:
The rapid growth of multi‑source, heterogeneous, and multimodal scientific data has increasingly exposed the limitations of traditional data management. Most existing DeepResearch (DR) efforts focus primarily on web search while overlooking local private data. Consequently, these frameworks exhibit low retrieval efficiency for private data and fail to comply with the FAIR principles, ultimately resulting in inefficiency and limited reusability. To this end, we propose IoDResearch (Internet of Data Research), a private data‑centric Deep Research framework that operationalizes the Internet of Data paradigm. IoDResearch encapsulates heterogeneous resources as FAIR‑compliant digital objects, and further refines them into atomic knowledge units and knowledge graphs, forming a heterogeneous graph index for multi‑granularity retrieval. On top of this representation, a multi‑agent system supports both reliable question answering and structured scientific report generation. Furthermore, we establish the IoD DeepResearch Benchmark to systematically evaluate both data representation and Deep Research capabilities in IoD scenarios. Experimental results on retrieval, QA, and report‑writing tasks show that IoDResearch consistently surpasses representative RAG and Deep Research baselines. Overall, IoDResearch demonstrates the feasibility of private‑data‑centric Deep Research under the IoD paradigm, paving the way toward more trustworthy, reusable, and automated scientific discovery.

PaperID: 8283, https://arxiv.org/pdf/2510.01538.pdf

Authors: Haokun Zhao, Xiang Zhang, Jiaqi Wei, Yiwei Xu, Yuting He, Siqi Sun, Chenyu You

Title: TimeSeriesScientist: A General-Purpose AI Agent for Time Series Analysis

Abstract:
Time series forecasting is central to decision‑making in domains as diverse as energy, finance, climate, and public health. In practice, forecasters face thousands of short, noisy series that vary in frequency, quality, and horizon, where the dominant cost lies not in model fitting, but in the labor‑intensive preprocessing, validation, and ensembling required to obtain reliable predictions. Prevailing statistical and deep learning models are tailored to specific datasets or domains and generalize poorly. A general, domain‑agnostic framework that minimizes human intervention is urgently in demand. In this paper, we introduce TimeSeriesScientist (TSci), the first LLM‑driven agentic framework for general time series forecasting. The framework comprises four specialized agents: Curator performs LLM‑guided diagnostics augmented by external tools that reason over data statistics to choose targeted preprocessing; Planner narrows the hypothesis space of model choice by leveraging multi‑modal diagnostics and self‑planning over the input; Forecaster performs model fitting and validation and, based on the results, adaptively selects the best model configuration as well as ensemble strategy to make final predictions; and Reporter synthesizes the whole process into a comprehensive, transparent report. With transparent natural‑language rationales and comprehensive reports, TSci transforms the forecasting workflow into a white‑box system that is both interpretable and extensible across tasks. Empirical results on eight established benchmarks demonstrate that TSci consistently outperforms both statistical and LLM‑based baselines, reducing forecast error by an average of 10.4% and 38.2%, respectively. Moreover, TSci produces a clear and rigorous report that makes the forecasting workflow more transparent and interpretable.

PaperID: 8284, https://arxiv.org/pdf/2510.01523.pdf

Authors: Shreeranjani Srirangamsridharan, Ali Abavisani, Reza Yousefi Maragheh, Ramin Giahi, Kai Zhao, Jason Cho, Sushant Kumar

Title: MetaSynth: Multi-Agent Metadata Generation from Implicit Feedback in Black-Box Systems

Abstract:
Meta titles and descriptions strongly shape engagement in search and recommendation platforms, yet optimizing them remains challenging. Search engine ranking models are black box environments, explicit labels are unavailable, and feedback such as click‑through rate (CTR) arrives only post‑deployment. Existing template, LLM, and retrieval‑augmented approaches either lack diversity, hallucinate attributes, or ignore whether candidate phrasing has historically succeeded in ranking. This leaves a gap in directly leveraging implicit signals from observable outcomes. We introduce MetaSynth, a multi‑agent retrieval‑augmented generation framework that learns from implicit search feedback. MetaSynth builds an exemplar library from top‑ranked results, generates candidate snippets conditioned on both product content and exemplars, and iteratively refines outputs via evaluator‑generator loops that enforce relevance, promotional strength, and compliance. On both proprietary e‑commerce data and the Amazon Reviews corpus, MetaSynth outperforms strong baselines across NDCG, MRR, and rank metrics. Large‑scale A/B tests further demonstrate 10.26% CTR and 7.51% clicks. Beyond metadata, this work contributes a general paradigm for optimizing content in black‑box systems using implicit signals.

PaperID: 8285, https://arxiv.org/pdf/2510.01499.pdf

Authors: Rui Ai, Yuqi Pan, David Simchi-Levi, Milind Tambe, Haifeng Xu

Title: Beyond Majority Voting: LLM Aggregation by Leveraging Higher-Order Information

Abstract:
With the rapid progress of multi‑agent large language model (LLM) reasoning, how to effectively aggregate answers from multiple LLMs has emerged as a fundamental challenge. Standard majority voting treats all answers equally, failing to consider latent heterogeneity and correlation across models. In this work, we design two new aggregation algorithms called Optimal Weight (OW) and Inverse Surprising Popularity (ISP), leveraging both first‑order and second‑order information. Our theoretical analysis shows these methods provably mitigate inherent limitations of majority voting under mild assumptions, leading to more reliable collective decisions. We empirically validate our algorithms on synthetic datasets, popular LLM fine‑tuning benchmarks such as UltraFeedback and MMLU, and a real‑world healthcare setting ARMMAN. Our algorithms consistently outperform standard baselines, establishing a robust, training‑free framework for effective multi‑agent LLM aggregation.

PaperID: 8286, https://arxiv.org/pdf/2510.01409.pdf

Authors: Luca Cotti, Idilio Drago, Anisa Rula, Devis Bianchini, Federico Cerutti

Title: OntoLogX: Ontology-Guided Knowledge Graph Extraction from Cybersecurity Logs with Large Language Models

Abstract:
System logs represent a valuable source of Cyber Threat Intelligence (CTI), capturing attacker behaviors, exploited vulnerabilities, and traces of malicious activity. Yet their utility is often limited by lack of structure, semantic inconsistency, and fragmentation across devices and sessions. Extracting actionable CTI from logs therefore requires approaches that can reconcile noisy, heterogeneous data into coherent and interoperable representations. We introduce OntoLogX, an autonomous Artificial Intelligence (AI) agent that leverages Large Language Models (LLMs) to transform raw logs into ontology‑grounded Knowledge Graphs (KGs). OntoLogX integrates a lightweight log ontology with Retrieval Augmented Generation (RAG) and iterative correction steps, ensuring that generated KGs are syntactically and semantically valid. Beyond event‑level analysis, the system aggregates KGs into sessions and employs a LLM to predict MITRE ATT&CK tactics, linking low‑level log evidence to higher‑level adversarial objectives. We evaluate OntoLogX on both logs from a public benchmark and a real‑world honeypot dataset, demonstrating robust KG generation across multiple KGs backends and accurate mapping of adversarial activity to ATT&CK tactics. Results highlight the benefits of retrieval and correction for precision and recall, the effectiveness of code‑oriented models in structured log analysis, and the value of ontology‑grounded representations for actionable CTI extraction.

PaperID: 8287, https://arxiv.org/pdf/2510.01398.pdf

Authors: Yang Liu, Zaid Abulawi, Abhiram Garimidi, Doyeong Lim

Title: Automating Data-Driven Modeling and Analysis for Engineering Applications using Large Language Model Agents

Abstract:
Modern engineering increasingly relies on vast datasets generated by experiments and simulations, driving a growing demand for efficient, reliable, and broadly applicable modeling strategies. There is also heightened interest in developing data‑driven approaches, particularly neural network models, for effective prediction and analysis of scientific datasets. Traditional data‑driven methods frequently involve extensive manual intervention, limiting their ability to scale effectively and generalize to diverse applications. In this study, we propose an innovative pipeline utilizing Large Language Model (LLM) agents to automate data‑driven modeling and analysis, with a particular emphasis on regression tasks. We evaluate two LLM‑agent frameworks: a multi‑agent system featuring specialized collaborative agents, and a single‑agent system based on the Reasoning and Acting (ReAct) paradigm. Both frameworks autonomously handle data preprocessing, neural network development, training, hyperparameter optimization, and uncertainty quantification (UQ). We validate our approach using a critical heat flux (CHF) prediction benchmark, involving approximately 25,000 experimental data points from the OECD/NEA benchmark dataset. Results indicate that our LLM‑agent‑developed model surpasses traditional CHF lookup tables and delivers predictive accuracy and UQ on par with state‑of‑the‑art Bayesian optimized deep neural network models developed by human experts. These outcomes underscore the significant potential of LLM‑based agents to automate complex engineering modeling tasks, greatly reducing human workload while meeting or exceeding existing standards of predictive performance.

PaperID: 8288, https://arxiv.org/pdf/2510.01379.pdf

Authors: Huashan Chen, Zhenyu Qi, Haotang Li, Hong Chen, Jinfu Chen, Kebin Peng, In Kee Kim, Kyu Hyung Lee, Sen He, Weiyi Shang

Title: Multi-LLM Orchestration for High-Quality Code Generation: Exploiting Complementary Model Strengths

Abstract:
Large Language Models (LLMs) have become central to automated code generation, yet existing approaches operate within a single‑LLM paradigm: one model is selected and applied throughout the entire generation process. We observe that different LLMs exhibit complementary strengths: no single model dominates across all programming languages, algorithmic problem categories, or development stages. Multi‑LLM collaboration, structured as per‑stage, per‑category routing rather than majority voting, produces higher‑quality code than any individual model. Based on this observation, we propose PerfOrch, a multi‑agent orchestration system that decomposes code generation into four collaborative agents: categorization, generation, debugging, and refinement. Each agent maintains a Memory module: a ranking matrix indexed by programming language and problem category, constructed from offline profiling and consulted at runtime to select the most suitable model for each task. We evaluate PerfOrch on two benchmarks, HumanEval‑X and EffiBench‑X, totaling 2,500 problems across five languages (Python, Java, C++, Go, and Rust). PerfOrch achieves average pass@1 rates of 97.19% on HumanEval‑X and 95.83% on EffiBench‑X, improving over the strongest single‑model pipeline by 1.22‑14.58 percentage points across languages. Notably, Memory rankings constructed solely from HumanEval‑X profiling generalize to the entirely unseen EffiBench‑X benchmark without re‑profiling, demonstrating that the complementary‑strength patterns PerfOrch exploits are properties of the models rather than artifacts of a specific problem distribution. Beyond correctness, PerfOrch improves execution time for 61‑90% of solved problems with mean speedups of 4.7‑29.9%, matching the refinement coverage of exhaustive multi‑model evaluation at roughly half the token cost.

PaperID: 8289, https://arxiv.org/pdf/2510.01375.pdf

Authors: Humaid Ibrahim, Nikolai Rozanov, Marek Rei

Title: Fine-tuning with RAG for Improving LLM Learning of New Skills

Abstract:
Large language model (LLM) agents deployed for multi‑step tasks frequently fail in predictable ways: attempting actions with unmet preconditions, issuing redundant commands, or mishandling environment constraints. While retrieval‑augmented generation (RAG) can improve performance by providing runtime guidance, it requires maintaining external knowledge databases and adds computational overhead at every deployment. We propose a simple pipeline that converts inference‑time retrieval into learned competence through distillation. Our approach: (1) extracts compact, reusable hints from agent failures, (2) uses these hints to generate improved teacher trajectories via one‑shot retrieval at episode start, and (3) trains student models on these trajectories with hint strings removed, forcing internalization rather than memorization. Across two interactive benchmarks, ALFWorld (household tasks) and WebShop (online shopping), distilled students consistently outperform baseline agents, achieving up to 91% success on ALFWorld (vs. 79% for baselines) and improving WebShop scores to 72 (vs. 61 for baselines), while using 10‑60% fewer tokens than retrieval‑augmented teachers depending on the environment. The approach generalizes across model scales (7B/14B parameters) and agent architectures (ReAct/StateAct), demonstrating that retrieval benefits can be effectively internalized through targeted fine‑tuning without permanent runtime dependencies.

PaperID: 8290, https://arxiv.org/pdf/2510.01359.pdf

Authors: Shoumik Saha, Jifan Chen, Sam Mayers, Sanjay Krishna Gouda, Zijian Wang, Varun Kumar

Title: Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

Abstract:
Code‑capable large language model (LLM) agents are increasingly embedded into software engineering workflows where they can read, write, and execute code, raising the stakes of safety‑bypass ("jailbreak") attacks beyond text‑only settings. Prior evaluations emphasize refusal or harmful‑text detection, leaving open whether agents actually compile and run malicious programs. We present JAWS‑BENCH (Jailbreaks Across WorkSpaces), a benchmark spanning three escalating workspace regimes that mirror attacker capability: empty (JAWS‑0), single‑file (JAWS‑1), and multi‑file (JAWS‑M). We pair this with a hierarchical, executable‑aware Judge Framework that tests (i) compliance, (ii) attack success, (iii) syntactic correctness, and (iv) runtime executability, moving beyond refusal to measure deployable harm. Using seven LLMs from five families as backends, we find that under prompt‑only conditions in JAWS‑0, code agents accept 61% of attacks on average; 58% are harmful, 52% parse, and 27% run end‑to‑end. Moving to single‑file regime in JAWS‑1 drives compliance to ~ 100% for capable models and yields a mean ASR (Attack Success Rate) ~ 71%; the multi‑file regime (JAWS‑M) raises mean ASR to ~ 75%, with 32% instantly deployable attack code. Across models, wrapping an LLM in an agent substantially increases vulnerability ‑‑ ASR raises by 1.6x ‑‑ because initial refusals are frequently overturned during later planning/tool‑use steps. Category‑level analyses identify which attack classes are most vulnerable and most readily deployable, while others exhibit large execution gaps. These findings motivate execution‑aware defenses, code‑contextual safety filters, and mechanisms that preserve refusal decisions throughout the agent's multi‑step reasoning and tool use.

PaperID: 8291, https://arxiv.org/pdf/2510.01353.pdf

Authors: Darshan Deshpande, Varun Gangal, Hersh Mehta, Anand Kannappan, Rebecca Qian, Peng Wang

Title: MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments

Abstract:
Recent works on context and memory benchmarking have primarily focused on conversational instances but the need for evaluating memory in dynamic enterprise environments is crucial for its effective application. We introduce MEMTRACK, a benchmark designed to evaluate long‑term memory and state tracking in multi‑platform agent environments. MEMTRACK models realistic organizational workflows by integrating asynchronous events across multiple communication and productivity platforms such as Slack, Linear and Git. Each benchmark instance provides a chronologically platform‑interleaved timeline, with noisy, conflicting, cross‑referring information as well as potential codebase/file‑system comprehension and exploration. Consequently, our benchmark tests memory capabilities such as acquistion, selection and conflict resolution. We curate the MEMTRACK dataset through both manual expert driven design and scalable agent based synthesis, generating ecologically valid scenarios grounded in real world software development processes. We introduce pertinent metrics for Correctness, Efficiency, and Redundancy that capture the effectiveness of memory mechanisms beyond simple QA performance. Experiments across SoTA LLMs and memory backends reveal challenges in utilizing memory across long horizons, handling cross‑platform dependencies, and resolving contradictions. Notably, the best performing GPT‑5 model only achieves a 60% Correctness score on MEMTRACK. This work provides an extensible framework for advancing evaluation research for memory‑augmented agents, beyond existing focus on conversational setups, and sets the stage for multi‑agent, multi‑platform memory benchmarking in complex organizational settings

PaperID: 8292, https://arxiv.org/pdf/2510.01297.pdf

Authors: Yeqi Feng, Yucheng Lu, Hongyu Su, Yixin Tao, Tianxing He

Title: SimCity: Multi-Agent Urban Development Simulation with Rich Interactions

Abstract:
Large Language Models (LLMs) open new possibilities for constructing realistic and interpretable macroeconomic simulations. We present SimCity, a multi‑agent framework that leverages LLMs to model an interpretable macroeconomic system with heterogeneous agents and rich interactions. Unlike classical equilibrium models that limit heterogeneity for tractability, or traditional agent‑based models (ABMs) that rely on hand‑crafted decision rules, SimCity enables flexible, adaptive behavior with transparent natural‑language reasoning. Within SimCity, four core agent types (households, firms, a central bank, and a government) deliberate and participate in a frictional labor market, a heterogeneous goods market, and a financial market. Furthermore, a Vision‑Language Model (VLM) determines the geographic placement of new firms and renders a mapped virtual city, allowing us to study both macroeconomic regularities and urban expansion dynamics within a unified environment. To evaluate the framework, we compile a checklist of canonical macroeconomic phenomena, including price elasticity of demand, Engel's Law, Okun's Law, the Phillips Curve, and the Beveridge Curve, and show that SimCity naturally reproduces these empirical patterns while remaining robust across simulation runs.

PaperID: 8293, https://arxiv.org/pdf/2510.01293.pdf

Authors: Zekun Jiang, Chunming Xu, Tianhang Zhou

Title: Cyber Academia-Chemical Engineering (CA-ChemE): A Living Digital Town for Self-Directed Research Evolution and Emergent Scientific Discovery

Abstract:
The rapid advancement of artificial intelligence (AI) has demonstrated substantial potential in chemical engineering, yet existing AI systems remain limited in interdisciplinary collaboration and exploration of uncharted problems. To address these issues, we present the Cyber Academia‑Chemical Engineering (CA‑ChemE) system, a living digital town that enables self‑directed research evolution and emergent scientific discovery through multi‑agent collaboration. By integrating domain‑specific knowledge bases, knowledge enhancement technologies, and collaboration agents, the system successfully constructs an intelligent ecosystem capable of deep professional reasoning and efficient interdisciplinary collaboration. Our findings demonstrate that knowledge base‑enabled enhancement mechanisms improved dialogue quality scores by 10‑15% on average across all seven expert agents, fundamentally ensuring technical judgments are grounded in verifiable scientific evidence. However, we observed a critical bottleneck in cross‑domain collaboration efficiency, prompting the introduction of a Collaboration Agent (CA) equipped with ontology engineering capabilities. CA's intervention achieved 8.5% improvements for distant‑domain expert pairs compared to only 0.8% for domain‑proximate pairs ‑ a 10.6‑fold difference ‑ unveiling the "diminished collaborative efficiency caused by knowledge‑base gaps" effect. This study demonstrates how carefully designed multi‑agent architectures can provide a viable pathway toward autonomous scientific discovery in chemical engineering.

PaperID: 8294, https://arxiv.org/pdf/2510.01285.pdf

Authors: Alireza Salemi, Mihir Parmar, Palash Goyal, Yiwen Song, Jinsung Yoon, Hamed Zamani, Tomas Pfister, Hamid Palangi

Title: LLM-Based Multi-Agent Blackboard System for Information Discovery in Data Science

Abstract:
Advances in large language models (LLMs) have created new opportunities in data science, but their deployment is often limited by the challenge of finding relevant data in large data lakes. Existing methods struggle with this: both single‑ and multi‑agent systems are quickly overwhelmed by large, heterogeneous files, and master‑slave multi‑agent systems rely on a rigid central controller that requires precise knowledge of each sub‑agent's capabilities, which is not possible in large‑scale settings where the main agent lacks full observability over sub‑agents' knowledge and competencies. We propose a novel multi‑agent paradigm inspired by the blackboard architecture for traditional AI models. In our framework, a central agent posts requests to a shared blackboard, and autonomous subordinate agents ‑ either responsible for a partition of the data lake or retrieval from the web ‑ volunteer to respond based on their capabilities. This design improves scalability and flexibility by removing the need for a central coordinator to know each agent's expertise or internal knowledge. We evaluate the approach on three benchmarks that require data discovery: KramaBench and modified versions of DSBench and DA‑Code. Results show that the blackboard architecture substantially outperforms strong baselines, achieving 13%‑57% relative improvements in end‑to‑end success and up to a 9% relative gain in data discovery F1 over the best baseline.

PaperID: 8295, https://arxiv.org/pdf/2510.01279.pdf

Authors: Yongchao Chen, Jiefeng Chen, Rui Meng, Ji Yin, Na Li, Chuchu Fan, Chi Wang, Tomas Pfister, Jinsung Yoon

Title: TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture

Abstract:
While integrating tools like Code Interpreter and Search has significantly enhanced Large Language Model (LLM) reasoning in models like ChatGPT Agent and Gemini‑Pro, practical guidance on optimal tool use is lacking. The core challenge is effectively combining textual reasoning, coding, and search for diverse questions. In this paper, we propose Tool‑Use Mixture (TUMIX), an ensemble framework that runs multiple agents in parallel, each employing distinct tool‑use strategies and answer paths. Agents in TUMIX iteratively share and refine responses based on the question and previous answers. In experiments, TUMIX achieves significant gains over state‑of‑the‑art tool‑augmented and test‑time scaling methods, delivering an average accuracy improvement of up to 3.55% over the best baseline on Gemini‑2.5‑Pro and Gemini‑2.5‑Flash across key reasoning benchmarks, with near‑equal inference costs. We find that agent diversity and quality are crucial and can be enhanced by using LLMs to auto‑optimize agent designs. Furthermore, TUMIX can halt refinement upon reaching sufficient confidence, preserving performance at only 49% of the inference cost. Further scaling can achieve higher performance, albeit at a greater cost.

PaperID: 8296, https://arxiv.org/pdf/2510.01275.pdf

Authors: Yilin Lyu, Ren Ding

Title: Discovering Self-Regulated Learning Patterns in Chatbot-Powered Education Environment

Abstract:
The increasing adoption of generative AI (GenAI) tools such as chatbots in education presents new opportunities to support students' self‑regulated learning (SRL), but also raises concerns about how learners actually engage in planning, executing, and reflection when learning with a chatbot. While SRL is typically conceptualized as a sequential process, little is known about how it unfolds during real‑world student‑chatbot interactions. To explore this, we proposed Gen‑SRL, an annotation schema to categorize student prompts into 16 microlevel actions across 4 macrolevel phases. Using the proposed schema, we annotated 212 chatbot interactions from a real‑world English writing task. We then performed frequency analysis and process mining (PM) techniques to discover SRL patterns in depth. Our results revealed that students' SRL behaviours were imbalanced, with over 82% of actions focused on task execution and limited engagement in planning and reflection. In addition, the process analysis showed nonsequential regulation patterns. Our findings suggest that classical SRL theories cannot fully capture the dynamic SRL patterns that emerge during chatbot interactions. Furthermore, we highlight the importance of designing adaptive and personalized scaffolds that respond to students' dynamic behaviours in chatbot‑powered contexts. More importantly, this study offers a new perspective for advancing SRL research and suggests directions for developing chatbots that better support self‑regulation.

PaperID: 8297, https://arxiv.org/pdf/2510.01245.pdf

Authors: Runfei Chen, Shuyang Jiang, Wei Huang

Title: SeMob: Semantic Synthesis for Dynamic Urban Mobility Prediction

Abstract:
Human mobility prediction is vital for urban services, but often fails to account for abrupt changes from external events. Existing spatiotemporal models struggle to leverage textual descriptions detailing these events. We propose SeMob, an LLM‑powered semantic synthesis pipeline for dynamic mobility prediction. Specifically, SeMob employs a multi‑agent framework where LLM‑based agents automatically extract and reason about spatiotemporally related text from complex online texts. Fine‑grained relevant contexts are then incorporated with spatiotemporal data through our proposed innovative progressive fusion architecture. The rich pre‑trained event prior contributes enriched insights about event‑driven prediction, and hence results in a more aligned forecasting model. Evaluated on a dataset constructed through our pipeline, SeMob achieves maximal reductions of 13.92% in MAE and 11.12% in RMSE compared to the spatiotemporal model. Notably, the framework exhibits pronounced superiority especially within spatiotemporal regions close to an event's location and time of occurrence.

PaperID: 8298, https://arxiv.org/pdf/2510.01239.pdf

Authors: Juntae Lee, Jihwan Bang, Seunghan Yang, Simyung Chang

Title: CIFLEX: Contextual Instruction Flow for Sub-task Execution in Multi-Turn Interactions with a Single On-Device LLM

Abstract:
We present CIFLEX (Contextual Instruction Flow for Sub‑task Execution), which is a novel execution system for efficient sub‑task handling in multi‑turn interactions with a single on‑device large language model (LLM). As LLMs become increasingly capable, a single model is expected to handle diverse sub‑tasks that more effectively and comprehensively support answering user requests. Naive approach reprocesses the entire conversation context when switching between main and sub‑tasks (e.g., query rewriting, summarization), incurring significant computational overhead. CIFLEX mitigates this overhead by reusing the key‑value (KV) cache from the main task and injecting only task‑specific instructions into isolated side paths. After sub‑task execution, the model rolls back to the main path via cached context, thereby avoiding redundant prefill computation. To support sub‑task selection, we also develop a hierarchical classification strategy tailored for small‑scale models, decomposing multi‑choice decisions into binary ones. Experiments show that CIFLEX significantly reduces computational costs without degrading task performance, enabling scalable and efficient multi‑task dialogue on‑device.

PaperID: 8299, https://arxiv.org/pdf/2510.01235.pdf

Authors: Subham Ghosh, Abhishek Tewari

Title: Automated Extraction of Material Properties using LLM-based AI Agents

Abstract:
The rapid discovery of materials is constrained by the lack of large, machine‑readable datasets that couple performance metrics with structural context. Existing databases are either small, manually curated, or biased toward first principles results, leaving experimental literature underexploited. We present an agentic, large language model (LLM)‑driven workflow that autonomously extracts thermoelectric and structural‑properties from about 10,000 full‑text scientific articles. The pipeline integrates dynamic token allocation, zeroshot multi‑agent extraction, and conditional table parsing to balance accuracy against computational cost. Benchmarking on 50 curated papers shows that GPT‑4.1 achieves the highest accuracy (F1 = 0.91 for thermoelectric properties and 0.82 for structural fields), while GPT‑4.1 Mini delivers nearly comparable performance (F1 = 0.89 and 0.81) at a fraction of the cost, enabling practical large scale deployment. Applying this workflow, we curated 27,822 temperature resolved property records with normalized units, spanning figure of merit (ZT), Seebeck coefficient, conductivity, resistivity, power factor, and thermal conductivity, together with structural attributes such as crystal class, space group, and doping strategy. Dataset analysis reproduces known thermoelectric trends, such as the superior performance of alloys over oxides and the advantage of p‑type doping, while also surfacing broader structure‑property correlations. To facilitate community access, we release an interactive web explorer with semantic filters, numeric queries, and CSV export. This study delivers the largest LLM‑curated thermoelectric dataset to date, provides a reproducible and cost‑profiled extraction pipeline, and establishes a foundation for scalable, data‑driven materials discovery beyond thermoelectrics.

PaperID: 8300, https://arxiv.org/pdf/2510.01179.pdf

Authors: Zhangchen Xu, Adriana Meza Soria, Shawn Tan, Anurag Roy, Ashish Sunil Agrawal, Radha Poovendran, Rameswar Panda

Title: TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments

Abstract:
Large Language Model (LLM) agents are rapidly emerging as powerful systems for automating tasks across domains. Yet progress in the open‑source community is constrained by the lack of high quality permissively licensed tool‑agentic training data. Existing datasets are often limited in diversity, realism, and complexity, particularly regarding multi‑tool and multi‑turn interactions. To address this gap, we introduce Toucan, the largest publicly available tool‑agentic dataset to date, containing 1.5 million trajectories synthesized from nearly 500 real‑world Model Context Protocols (MCPs). Unlike prior work, Toucan leverages authentic MCP environments to generate diverse, realistic, and challenging tasks with trajectories involving real tool execution. Our pipeline first produces a broad spectrum of tool‑use queries using five distinct models, applies model‑based quality filtering, and then generates agentic trajectories with three teacher models using two agentic frameworks. Rigorous rule‑based and model‑based validation ensures high‑quality outputs. We also introduce three extension mechanisms to further diversify tasks and simulate multi‑turn conversations. Models fine‑tuned on Toucan outperform larger closed‑source counterparts on the BFCL V3 benchmark and push the Pareto frontier forward on MCP‑Universe Bench.

PaperID: 8301, https://arxiv.org/pdf/2510.01164.pdf

Authors: Zhengliang Shi, Ruotian Ma, Jen-tse Huang, Xinbei Ma, Xingyu Chen, Mengru Wang, Qu Yang, Yue Wang, Fanghua Ye, Ziyang Chen, Shanyi Wang, Cixing Li, Wenxuan Wang, Zhaopeng Tu, Xiaolong Li, Zhaochun Ren, Linus

Title: Social Welfare Function Leaderboard: When LLM Agents Allocate Social Welfare

Abstract:
Large language models (LLMs) are increasingly entrusted with high‑stakes decisions that affect human welfare. However, the principles and values that guide these models when distributing scarce societal resources remain largely unexamined. To address this, we introduce the Social Welfare Function (SWF) Benchmark, a dynamic simulation environment where an LLM acts as a sovereign allocator, distributing tasks to a heterogeneous community of recipients. The benchmark is designed to create a persistent trade‑off between maximizing collective efficiency (measured by Return on Investment) and ensuring distributive fairness (measured by the Gini coefficient). We evaluate 20 state‑of‑the‑art LLMs and present the first leaderboard for social welfare allocation. Our findings reveal three key insights: (i) A model's general conversational ability, as measured by popular leaderboards, is a poor predictor of its allocation skill. (ii) Most LLMs exhibit a strong default utilitarian orientation, prioritizing group productivity at the expense of severe inequality. (iii) Allocation strategies are highly vulnerable, easily perturbed by output‑length constraints and social‑influence framing. These results highlight the risks of deploying current LLMs as societal decision‑makers and underscore the need for specialized benchmarks and targeted alignment for AI governance.

PaperID: 8302, https://arxiv.org/pdf/2510.01116.pdf

Authors: Felix Parker, Nimeesha Chan, Chi Zhang, Kimia Ghobadi

Title: Eliciting Chain-of-Thought Reasoning for Time Series Analysis using Reinforcement Learning

Abstract:
Complex numerical time series analysis often demands multi‑step reasoning capabilities beyond current models' reach. Tasks like medical diagnosis and weather forecasting require sequential reasoning processes ‑ including counterfactual analysis, logical deduction, knowledge application, and multi‑modal contextual integration ‑ that existing time series models cannot explicitly perform. While recent research has shown large language models (LLMs) can achieve sophisticated Chain‑of‑Thought (CoT) reasoning through reinforcement learning (RL), these advances have primarily focused on mathematical and coding domains, with LLMs still demonstrating poor performance on time series tasks. We introduce Chain Of thought for Understanding Numerical Time Series (COUNTS), the first framework that trains LLMs to perform CoT reasoning across diverse time series tasks using RL with verifiable rewards. Our approach employs a Residual Vector‑Quantized VAE to create high‑fidelity discrete tokens that seamlessly integrate into a pre‑trained LLM's vocabulary. COUNTS undergoes a two‑stage training process: first, supervised fine‑tuning on time series analysis tasks to master our novel representations, followed by Group Relative Policy Optimization training on verifiable problems using prompting strategies that encourage explicit reasoning steps before producing final answers. Our experiments demonstrate that this RL‑driven approach with intermediate CoT reasoning significantly enhances LLM performance across various time series analysis tasks, opening new possibilities for complex temporal data reasoning.

PaperID: 8303, https://arxiv.org/pdf/2510.01067.pdf

Authors: Vivek Khatana, Duo Wang, Petros Voulgaris, Nicola Elia, Naira Hovakimyan

Title: Networked Control and Mean Field Problems Under Diagonal Dominance: Decentralized and Social Optimality

Abstract:
In this article, we employ an input‑output approach to expand the study of cooperative multi‑agent control and optimization problems characterized by mean‑field interactions that admit decentralized and selfish solutions. The setting involves n independent agents that interact solely through a shared cost function, which penalizes deviations of each agent from the group's average collective behavior. Building on our earlier results established for homogeneous agents, we extend the framework to nonidentical agents and show that, under a diagonal dominant interaction of the collective dynamics, with bounded local open‑loop dynamics, the optimal controller for H_\infty and H_2 norm minimization remains decentralized and selfish in the limit as the number of agents n grows to infinity.

PaperID: 8304, https://arxiv.org/pdf/2510.00861.pdf

Authors: Ziliang Wang, Kang An, Xuhui Zheng, Faqiang Qian, Weikun Zhang, Cijun Ouyang, Jialu Cai, Yuhang Wang, Yichao Wu

Title: Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs

Abstract:
While search‑augmented large language models (LLMs) exhibit impressive capabilities, their reliability in complex multi‑hop reasoning remains limited. This limitation arises from three fundamental challenges: decomposition errors, where tasks are incorrectly broken down; retrieval missing, where key evidence fails to be retrieved; and reasoning errors, where flawed logic propagates through the reasoning chain. A single failure in any of these stages can derail the final answer. We propose Erasable Reinforcement Learning (ERL), a novel framework that transforms fragile reasoning into a robust process. ERL explicitly identifies faulty steps, erases them, and regenerates reasoning in place, preventing defective logic from propagating through the reasoning chain. This targeted correction mechanism turns brittle reasoning into a more resilient process. Models trained with ERL, termed ESearch, achieve substantial improvements on HotpotQA, MuSiQue, 2Wiki, and Bamboogle, with the 3B model achieving +8.48% EM and +11.56% F1, and the 7B model achieving +5.38% EM and +7.22% F1 over previous state‑of‑the‑art(SOTA) results. These findings suggest that erasable reinforcement learning provides a powerful paradigm shift for robust multi‑step reasoning in LLMs.

PaperID: 8305, https://arxiv.org/pdf/2510.00685.pdf

Authors: Nurbek Tastan, Samuel Horvath, Karthik Nandakumar

Title: Stochastic Self-Organization in Multi-Agent Systems

Abstract:
Multi‑agent systems (MAS) based on Large Language Models (LLMs) have the potential to solve tasks that are beyond the reach of any single LLM. However, this potential can only be realized when the collaboration mechanism between agents is optimized. Specifically, optimizing the communication structure between agents is critical for fruitful collaboration. Most existing approaches rely on fixed topologies, pretrained graph generators, optimization over edges, or employ external LLM judges, thereby adding to the complexity. In this work, we introduce a response‑conditioned framework that adapts communication on‑the‑fly. Agents independently generate responses to the user query and assess peer contributions using an approximation of the Shapley value. A directed acyclic graph (DAG) is then constructed to regulate the propagation of the responses among agents, which ensures stable and efficient message transmission from high‑contributing agents to others. This graph is dynamically updated based on the agent responses from the previous collaboration round. Since the proposed framework enables the self‑organization of agents without additional supervision or training, we refer to it as SelfOrg. The SelfOrg framework goes beyond task‑ and query‑level optimization and takes into account the stochastic nature of agent responses. Experiments with both strong and weak LLM backends demonstrate robust performance, with significant gains in the weak regime where prior methods collapse. We also theoretically show that multiple agents increase the chance of correctness and that the correct responses naturally dominate the information flow.

PaperID: 8306, https://arxiv.org/pdf/2510.00676.pdf

Authors: Zamir Martinez, Daniel Zelazo

Title: Formation Control via Rotation Symmetry Constraints

Abstract:
This work introduces a distributed formation control strategy for multi‑agent systems based solely on rotation symmetry constraints. We propose a potential function that enforces inter‑agent rotational symmetries, whose gradient defines a control law that drives the agents toward a desired planar symmetric configuration. We show that only n‑1 edges (the minimal connectivity requirement) are sufficient to implement the strategy, where n is the number of agents. We further augment the design to address the maneuvering problem, enabling the formation to undergo coordinated translations, rotations, and scaling along a predefined virtual trajectory. Simulation examples are provided to validate the effectiveness of the proposed method.

PaperID: 8307, https://arxiv.org/pdf/2510.00615.pdf

Authors: Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, Saravan Rajmohan

Title: ACON: Optimizing Context Compression for Long-horizon LLM Agents

Abstract:
Large language models (LLMs) are increasingly deployed as agents in dynamic, real‑world environments, where success requires both reasoning and effective tool use. A central challenge for agentic tasks is the growing context length, as agents must accumulate long histories of actions and observations. This expansion raises costs and reduces efficiency in long‑horizon tasks, yet prior work on context compression has mostly focused on single‑step tasks or narrow applications. We introduce Agent Context Optimization (ACON), a unified framework that optimally compresses both environment observations and interaction histories into concise yet informative condensations. ACON leverages compression guideline optimization in natural language space: given paired trajectories where full context succeeds but compressed context fails, capable LLMs analyze the causes of failure, and the compression guideline is updated accordingly. Furthermore, we propose distilling the optimized LLM compressor into smaller models to reduce the overhead of the additional module. Experiments on AppWorld, OfficeBench, and Multi‑objective QA show that ACON reduces memory usage by 26‑54% (peak tokens) while largely preserving task performance, preserves over 95% of accuracy when distilled into smaller compressors, and enhances smaller LMs as long‑horizon agents with up to 46% performance improvement.

PaperID: 8308, https://arxiv.org/pdf/2510.00602.pdf

Authors: Amirhossein Afsharrad, Ahmadreza Moradipari, Sanjay Lall

Title: Multi-Agent Stage-wise Conservative Linear Bandits

Abstract:
In many real‑world applications such as recommendation systems, multiple learning agents must balance exploration and exploitation while maintaining safety guarantees to avoid catastrophic failures. We study the stochastic linear bandit problem in a multi‑agent networked setting where agents must satisfy stage‑wise conservative constraints. A network of N agents collaboratively maximizes cumulative reward while ensuring that the expected reward at every round is no less than (1‑α) times that of a baseline policy. Each agent observes local rewards with unknown parameters, but the network optimizes for the global parameter (average of local parameters). Agents communicate only with immediate neighbors, and each communication round incurs additional regret. We propose MA‑SCLUCB (Multi‑Agent Stage‑wise Conservative Linear UCB), an episodic algorithm alternating between action selection and consensus‑building phases. We prove that MA‑SCLUCB achieves regret \tildeO\left(\fracd\sqrtN\sqrtT\cdot\frac\log(NT)\sqrt\log(1/|λ_2|)\right) with high probability, where d is the dimension, T is the horizon, and |λ_2| is the network's second largest eigenvalue magnitude. Our analysis shows: (i) collaboration yields \frac1\sqrtN improvement despite local communication, (ii) communication overhead grows only logarithmically for well‑connected networks, and (iii) stage‑wise safety adds only lower‑order regret. Thus, distributed learning with safety guarantees achieves near‑optimal performance in reasonably connected networks.

PaperID: 8309, https://arxiv.org/pdf/2510.00591.pdf

Authors: Liyi Cai, Yijie Ren, Yitong Zhang, Jia Li

Title: AI-Driven Self-Evolving Software: A Promising Path Toward Software Automation

Abstract:
Software automation has long been a central goal of software engineering, striving for software development that proceeds without human intervention. Recent efforts have leveraged Artificial Intelligence (AI) to advance software automation with notable progress. However, current AI functions primarily as assistants to human developers, leaving software development still dependent on explicit human intervention. This raises a fundamental question: Can AI move beyond its role as an assistant to become a core component of software, thereby enabling genuine software automation? To investigate this vision, we introduce AI‑Driven Self‑Evolving Software, a new form of software that evolves continuously through direct interaction with users. We demonstrate the feasibility of this idea with a lightweight prototype built on a multi‑agent architecture that autonomously interprets user requirements, generates and validates code, and integrates new functionalities. Case studies across multiple representative scenarios show that the prototype can reliably construct and reuse functionality, providing early evidence that such software systems can scale to more sophisticated applications and pave the way toward truly automated software development. We make code and cases in this work publicly available at https://anonymous.4open.science/r/live‑software.

PaperID: 8310, https://arxiv.org/pdf/2510.00579.pdf

Authors: Li Li, Ziyi Wang, Yongliang Wu, Jianfei Cai, Xu Yang

Title: CoT Vectors: Transferring and Probing the Reasoning Mechanisms of LLMs

Abstract:
Chain‑of‑Thought (CoT) prompting has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing implementations, such as in‑context learning and fine‑tuning, remain costly and inefficient. To improve CoT reasoning at a lower cost, and inspired by the task vector paradigm, we introduce CoT Vectors, compact representations that encode task‑general, multi‑step reasoning knowledge. Through experiments with Extracted CoT Vectors, we observe pronounced layer‑wise instability, manifesting as a U‑shaped performance curve that reflects a systematic three‑stage reasoning process in LLMs. To address this limitation, we propose Learnable CoT Vectors, optimized under a teacher‑student framework to provide more stable and robust guidance. Extensive evaluations across diverse benchmarks and models demonstrate that CoT Vectors not only outperform existing baselines but also achieve performance comparable to parameter‑efficient fine‑tuning methods, while requiring fewer trainable parameters. Moreover, by treating CoT Vectors as a probe, we uncover how their effectiveness varies due to latent space structure, information density, acquisition mechanisms, and pre‑training differences, offering new insights into the functional organization of multi‑step reasoning in LLMs. The source code will be released.

PaperID: 8311, https://arxiv.org/pdf/2510.00568.pdf

Authors: Shiyu Li, Yang Tang, Yifan Wang, Peiming Li, Xi Chen

Title: ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards

Abstract:
Search agents powered by Large Language Models (LLMs) have demonstrated significant potential in tackling knowledge‑intensive tasks. Reinforcement learning (RL) has emerged as a powerful paradigm for training these agents to perform complex, multi‑step reasoning. However, prior RL‑based methods often rely on sparse or rule‑based rewards, which can lead agents to commit to suboptimal or erroneous reasoning paths without the ability to recover. To address these limitations, we propose ReSeek, a novel self‑correcting framework for training search agents. Our framework introduces a self‑correction mechanism that empowers the agent to dynamically identify and recover from erroneous search paths during an episode. By invoking a special JUDGE action, the agent can judge the information and re‑plan its search strategy. To guide this process, we design a dense, instructive process reward function, which decomposes into a correctness reward for retrieving factual information and a utility reward for finding information genuinely useful for the query. Furthermore, to mitigate the risk of data contamination in existing datasets, we introduce FictionalHot, a new and challenging benchmark with recently curated questions requiring complex reasoning. Being intuitively reasonable and practically simple, extensive experiments show that agents trained with ReSeek significantly outperform SOTA baselines in task success rate and path faithfulness.

PaperID: 8312, https://arxiv.org/pdf/2510.00510.pdf

Authors: Jiarun Liu, Shiyue Xu, Shangkun Liu, Yang Li, Wen Liu, Min Liu, Xiaoqing Zhou, Hanmin Wang, Shilin Jia, zhen Wang, Shaohua Tian, Hanhao Li, Junbo Zhang, Yongli Yu, Peng Cao, Haofen Wang

Title: JoyAgent-JDGenie: Technical Report on the GAIA

Abstract:
Large Language Models are increasingly deployed as autonomous agents for complex real‑world tasks, yet existing systems often focus on isolated improvements without a unifying design for robustness and adaptability. We propose a generalist agent architecture that integrates three core components: a collective multi‑agent framework combining planning and execution agents with critic model voting, a hierarchical memory system spanning working, semantic, and procedural layers, and a refined tool suite for search, code execution, and multimodal parsing. Evaluated on a comprehensive benchmark, our framework consistently outperforms open‑source baselines and approaches the performance of proprietary systems. These results demonstrate the importance of system‑level integration and highlight a path toward scalable, resilient, and adaptive AI assistants capable of operating across diverse domains and tasks.

PaperID: 8313, https://arxiv.org/pdf/2510.00507.pdf

Authors: Yurun Chen, Xavier Hu, Yuhan Liu, Ziqi Wang, Zeyi Liao, Lin Chen, Feng Wei, Yuxi Qian, Bo Zheng, Keting Yin, Shengyu Zhang

Title: Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs

Abstract:
As multimodal LLM‑driven agents advance in autonomy and generalization, traditional static datasets face inherent scalability limitations and are insufficient for fully assessing their capabilities in increasingly complex and diverse tasks. Existing studies have attempted to generate agent tasks using LLMs, but due to the inherent hallucinations of LLMs and the lack of internal data relationship modeling, these tasks often exhibit semantic inconsistencies and solvability issues. To address these challenges, we introduce Graph2Eval, a knowledge‑graph‑driven framework for automated, scalable, and semantically grounded agent task generation. At its core, Graph2Eval leverages a knowledge graph built from heterogeneous external data sources as a structured task space, generating multimodal agent tasks through subgraph sampling and task construction guided by task templates and meta‑path strategies. To further ensure task reliability, a multi‑stage filtering pipeline based on node reachability analysis, LLM scoring, and similarity analysis ensures the diversity and solvability of the generated tasks. By unifying both RAG Agent and Web Agent scenarios, Graph2Eval enables efficient generation of multimodal document understanding tasks and multi‑step web interaction tasks. We instantiate the framework with Graph2Eval‑Bench, a curated dataset of 1,319 tasks spanning document understanding and web interaction scenarios. Extensive experiments show that, on average, Graph2Eval improves task semantic consistency by 20% and solvability by 17% over baselines, while Graph2Eval‑Bench effectively distinguishes agent performance, offering a new perspective on agent evaluation.

PaperID: 8314, https://arxiv.org/pdf/2510.00482.pdf

Authors: Yawen Xue, Masaya Tsunokake, Yuta Koreeda, Ekant Muljibhai Amin, Takashi Sumiyoshi, Yasuhiro Sogawa

Title: Agent Fine-tuning through Distillation for Domain-specific LLMs in Microdomains

Abstract:
Agentic large language models (LLMs) have become prominent for autonomously interacting with external environments and performing multi‑step reasoning tasks. Most approaches leverage these capabilities via in‑context learning with few‑shot prompts, but this often results in lengthy inputs and higher computational costs. Agent fine‑tuning offers an alternative by enabling LLMs to internalize procedural reasoning and domain‑specific knowledge through training on relevant data and demonstration trajectories. While prior studies have focused on general domains, their effectiveness in specialized technical microdomains remains unclear. This paper explores agent fine‑tuning for domain adaptation within Hitachi's JP1 middleware, a microdomain for specialized IT operations. We fine‑tuned LLMs using JP1‑specific datasets derived from domain manuals and distilled reasoning trajectories generated by LLMs themselves, enhancing decision making accuracy and search efficiency. During inference, we used an agentic prompt with retrieval‑augmented generation and introduced a context‑answer extractor to improve information relevance. On JP1 certification exam questions, our method achieved a 14% performance improvement over the base model, demonstrating the potential of agent fine‑tuning for domain‑specific reasoning in complex microdomains.

PaperID: 8315, https://arxiv.org/pdf/2510.00480.pdf

Authors: Kenjiro Ide, Taiga Someya, Kohei Kawaguchi, Keisuke Fujii

Title: Expandable Decision-Making States for Multi-Agent Deep Reinforcement Learning in Soccer Tactical Analysis

Abstract:
Invasion team sports such as soccer produce a high‑dimensional, strongly coupled state space as many players continuously interact on a shared field, challenging quantitative tactical analysis. Traditional rule‑based analyses are intuitive, while modern predictive machine learning models often perform pattern‑matching without explicit agent representations. The problem we address is how to build player‑level agent models from data, whose learned values and policies are both tactically interpretable and robust across heterogeneous data sources. Here, we propose Expandable Decision‑Making States (EDMS), a semantically enriched state representation that augments raw positions and velocities with relational variables (e.g., scoring of space, pass, and score), combined with an action‑masking scheme that gives on‑ball and off‑ball agents distinct decision sets. Compared to prior work, EDMS maps learned value functions and action policies to human‑interpretable tactical concepts (e.g., marking pressure, passing lanes, ball accessibility) instead of raw coordinate features, and aligns agent choices with the rules of play. In the experiments, EDMS with action masking consistently reduced both action‑prediction loss and temporal‑difference (TD) error compared to the baseline. Qualitative case studies and Q‑value visualizations further indicate that EDMS highlights high‑risk, high‑reward tactical patterns (e.g., fast counterattacks and defensive breakthroughs). We also integrated our approach into an open‑source library and demonstrated compatibility with multiple commercial and open datasets, enabling cross‑provider evaluation and reproducible experiments.

PaperID: 8316, https://arxiv.org/pdf/2510.00477.pdf

Authors: Chengzhen Li, Likun Zhang, Chuang Zhang, Jiahui Li, Changyuan Zhao, Ruichen Zhang, Geng Sun

Title: Wireless Laser Power Transfer for Low-altitude Uncrewed Aerial Vehicle-assisted Internet of Things: Paradigms, Challenges, and Solutions

Abstract:
Low‑altitude uncrewed aerial vehicles (UAVs) have become integral enablers for the Internet of Things (IoT) by offering enhanced coverage, improved connectivity and access to remote areas. A critical challenge limiting their operational capacity lies in the energy constraints of both aerial platforms and ground‑based sensors. This paper explores WLPT as a transformative solution for sustainable energy provisioning in UAV‑assisted IoT networks. We first systematically investigate the fundamental principles of WLPT and analysis the comparative advantages. Then, we introduce three operational paradigms for system integration, identify key challenges, and discuss corresponding potential solutions. In case study, we propose a multi‑agent reinforcement learning framework to address the coordination and optimization challenges in WLPT‑enabled UAV‑assisted IoT data collection. Simulation results demonstrate that our framework significantly improves energy sustainability and data freshness. Finally, we discuss some future directions.

PaperID: 8317, https://arxiv.org/pdf/2510.00414.pdf

Authors: Matthew Yue, Zhikun Xu, Vivek Gupta, Thao Ha, Liesal Sharabi, Ben Zhou

Title: RELATE-Sim: Leveraging Turning Point Theory and LLM Agents to Predict and Understand Long-Term Relationship Dynamics through Interactive Narrative Simulations

Abstract:
Most dating technologies optimize for getting together, not staying together. We present RELATE‑Sim, a theory‑grounded simulator that models how couples behave at consequential turning points‑exclusivity talks, conflict‑and‑repair episodes, relocations‑rather than static traits. Two persona‑aligned LLM agents (one per partner) interact under a centralized Scene Master that frames each turning point as a compact set of realistic options, advances the narrative, and infers interpretable state changes and an auditable commitment estimate after each scene. On a longitudinal dataset of 71 couples with two‑year follow‑ups, simulation‑aware predictions outperform a personas‑only baseline while surfacing actionable markers (e.g., repair attempts acknowledged, clarity shifts) that explain why trajectories diverge. RELATE‑Sim pushes the relationship research's focus from matchmaking to maintenance, providing a transparent, extensible platform for understanding and forecasting long‑term relationship dynamics.

PaperID: 8318, https://arxiv.org/pdf/2510.00401.pdf

Authors: Shounak Sural, Charles Kekeh, Wenliang Liu, Federico Pecora, Mouhacine Benosman

Title: Physics-Informed Neural Controlled Differential Equations for Scalable Long Horizon Multi-Agent Motion Forecasting

Abstract:
Long‑horizon motion forecasting for multiple autonomous robots is challenging due to non‑linear agent interactions, compounding prediction errors, and continuous‑time evolution of dynamics. Learned dynamics of such a system can be useful in various applications such as travel time prediction, prediction‑guided planning and generative simulation. In this work, we aim to develop an efficient trajectory forecasting model conditioned on multi‑agent goals. Motivated by the recent success of physics‑guided deep learning for partially known dynamical systems, we develop a model based on neural Controlled Differential Equations (CDEs) for long‑horizon motion forecasting. Unlike discrete‑time methods such as RNNs and transformers, neural CDEs operate in continuous time, allowing us to combine physics‑informed constraints and biases to jointly model multi‑robot dynamics. Our approach, named PINCoDE (Physics‑Informed Neural Controlled Differential Equations), learns differential equation parameters that can be used to predict the trajectories of a multi‑agent system starting from an initial condition. PINCoDE is conditioned on future goals and enforces physics constraints for robot motion over extended periods of time. We adopt a strategy that scales our model from 10 robots to 100 robots without the need for additional model parameters, while producing predictions with an average ADE below 0.5 m for a 1‑minute horizon. Furthermore, progressive training with curriculum learning for our PINCoDE model results in a 2.7X reduction of forecasted pose error over 4 minute horizons compared to analytical models.

PaperID: 8319, https://arxiv.org/pdf/2510.00381.pdf

Authors: Kaiwen Yu, Mengying Sun, Zhijin Qin, Xiaodong Xu, Ping Yang, Yue Xiao, Gang Wu

Title: Semantic-Driven AI Agent Communications: Challenges and Solutions

Abstract:
With the rapid growth of intelligent services, communication targets are shifting from humans to artificial intelligent (AI) agents, which require new paradigms to enable real‑time perception, decision‑making, and collaboration. Semantic communication, which conveys task‑relevant meaning rather than raw data, offers a promising solution. However, its practical deployment remains constrained by dynamic environments and limited resources. To address these issues, this article proposes a semantic‑driven AI agent communication framework and develops three enabling techniques. First, semantic adaptation transmission applies fine‑tuning with real or generative samples to efficiently adapt models to varying environments. Second, semantic lightweight transmission incorporates pruning, quantization, and perception‑aware sampling to reduce model complexity and alleviate computational burden on edge agents. Third, semantic self‑evolution control employs distributed hierarchical decision‑making to optimize multi‑dimensional resources, enabling robust multi‑agent collaboration in dynamic environments. Simulation results show that the proposed solutions achieve faster convergence and stronger robustness, while the proposed distributed hierarchical optimization method significantly outperforms conventional decision‑making schemes, highlighting its potential for AI agent communication networks.

PaperID: 8320, https://arxiv.org/pdf/2510.00326.pdf

Authors: Hassen Dhrif

Title: Reasoning-Aware Prompt Orchestration: A Foundation Model for Multi-Agent Language Model Coordination

Abstract:
The emergence of large language models has enabled sophisticated multi‑agent systems, yet coordinating their reasoning capabilities through prompt engineering remains challenging. We present a theoretically‑grounded framework for dynamic prompt orchestration that enhances reasoning across multiple specialized agents. This framework addresses three core challenges: logical consistency preservation during agent transitions, reasoning‑aware prompt adaptation, and scalable coordination of distributed inference. Our approach formalizes agent states using prompt templates, reasoning context vectors, and capability matrices. We prove system convergence to stable coordination patterns when step sizes satisfy α< \frac12L where L is the Lipschitz constant of the state transition function. We implement this through a distributed architecture that dynamically routes reasoning tasks while maintaining semantic coherence. Experimental results on 1,000 synthetic multi‑agent conversations demonstrate a 42% reduction in reasoning latency, a 23% improvement in logical consistency measured by ROUGE‑L score, and an 89% success rate for task completion without context loss across agent transitions. Ablation studies identify the consensus mechanism as the primary performance driver, while revealing limitations: performance degrades beyond 10 agent transitions, and the system requires 76.5GB memory for 1,000 concurrent agents. These findings establish a new paradigm for scalable reasoning in multi‑agent systems, providing theoretical foundations for understanding reasoning emergence across coordinated language models.

PaperID: 8321, https://arxiv.org/pdf/2510.00317.pdf

Authors: Youpeng Li, Kartik Joshi, Xinda Wang, Eric Wong

Title: MAVUL: Multi-Agent Vulnerability Detection via Contextual Reasoning and Interactive Refinement

Abstract:
The widespread adoption of open‑source software (OSS) necessitates the mitigation of vulnerability risks. Most vulnerability detection (VD) methods are limited by inadequate contextual understanding, restrictive single‑round interactions, and coarse‑grained evaluations, resulting in undesired model performance and biased evaluation results. To address these challenges, we propose MAVUL, a novel multi‑agent VD system that integrates contextual reasoning and interactive refinement. Specifically, a vulnerability analyst agent is designed to flexibly leverage tool‑using capabilities and contextual reasoning to achieve cross‑procedural code understanding and effectively mine vulnerability patterns. Through iterative feedback and refined decision‑making within cross‑role agent interactions, the system achieves reliable reasoning and vulnerability prediction. Furthermore, MAVUL introduces multi‑dimensional ground truth information for fine‑grained evaluation, thereby enhancing evaluation accuracy and reliability. Extensive experiments conducted on a pairwise vulnerability dataset demonstrate MAVUL's superior performance. Our findings indicate that MAVUL significantly outperforms existing multi‑agent systems with over 62% higher pairwise accuracy and single‑agent systems with over 600% higher average performance. The system's effectiveness is markedly improved with increased communication rounds between the vulnerability analyst agent and the security architect agent, underscoring the importance of contextual reasoning in tracing vulnerability flows and the crucial feedback role. Additionally, the integrated evaluation agent serves as a critical, unbiased judge, ensuring a more accurate and reliable estimation of the system's real‑world applicability by preventing misleading binary comparisons.

PaperID: 8322, https://arxiv.org/pdf/2510.00311.pdf

Authors: Bowen Wei, Yuan Shen Tay, Howard Liu, Jinhao Pan, Kun Luo, Ziwei Zhu, Chris Jordan

Title: CORTEX: Collaborative LLM Agents for High-Stakes Alert Triage

Abstract:
Security Operations Centers (SOCs) are overwhelmed by tens of thousands of daily alerts, with only a small fraction corresponding to genuine attacks. This overload creates alert fatigue, leading to overlooked threats and analyst burnout. Classical detection pipelines are brittle and context‑poor, while recent LLM‑based approaches typically rely on a single model to interpret logs, retrieve context, and adjudicate alerts end‑to‑end ‑‑ an approach that struggles with noisy enterprise data and offers limited transparency. We propose CORTEX, a multi‑agent LLM architecture for high‑stakes alert triage in which specialized agents collaborate over real evidence: a behavior‑analysis agent inspects activity sequences, evidence‑gathering agents query external systems, and a reasoning agent synthesizes findings into an auditable decision. To support training and evaluation, we release a dataset of fine‑grained SOC investigations from production environments, capturing step‑by‑step analyst actions and linked tool outputs. Across diverse enterprise scenarios, CORTEX substantially reduces false positives and improves investigation quality over state‑of‑the‑art single‑agent LLMs.

PaperID: 8323, https://arxiv.org/pdf/2510.00274.pdf

Authors: Maisha Maliha, Dean Hougen

Title: MAGIC-MASK: Multi-Agent Guided Inter-Agent Collaboration with Mask-Based Explainability for Reinforcement Learning

Abstract:
Understanding the decision‑making process of Deep Reinforcement Learning agents remains a key challenge for deploying these systems in safety‑critical and multi‑agent environments. While prior explainability methods like StateMask, have advanced the identification of critical states, they remain limited by computational cost, exploration coverage, and lack of adaptation to multi‑agent settings. To overcome these limitations, we propose a mathematically grounded framework, MAGIC‑MASK (Multi‑Agent Guided Inter‑agent Collaboration with Mask‑Based Explainability for Reinforcement Learning), that extends perturbation‑based explanation to Multi‑Agent Reinforcement Learning. Our method integrates Proximal Policy Optimization, adaptive epsilon‑greedy exploration, and lightweight inter‑agent collaboration to share masked state information and peer experience. This collaboration enables each agent to perform saliency‑guided masking and share reward‑based insights with peers, reducing the time required for critical state discovery, improving explanation fidelity, and leading to faster and more robust learning. The core novelty of our approach lies in generalizing explainability from single‑agent to multi‑agent systems through a unified mathematical formalism built on trajectory perturbation, reward fidelity analysis, and Kullback‑Leibler divergence regularization. This framework yields localized, interpretable explanations grounded in probabilistic modeling and multi‑agent Markov decision processes. We validate our framework on both single‑agent and multi‑agent benchmarks, including a multi‑agent highway driving environment and Google Research Football, demonstrating that MAGIC‑MASK consistently outperforms state‑of‑the‑art baselines in fidelity, learning efficiency, and policy robustness while offering interpretable and transferable explanations.

PaperID: 8324, https://arxiv.org/pdf/2510.00270.pdf

Authors: Yichen Zhao, Tyler Hanks, Hans Riess, Samuel Cohen, Matthew Hale, James Fairbanks

Title: Asynchronous Nonlinear Sheaf Diffusion for Multi-Agent Coordination

Abstract:
Cellular sheaves and sheaf Laplacians provide a far‑reaching generalization of graphs and graph Laplacians, resulting in a wide array of applications ranging from machine learning to multi‑agent control. In the context of multi‑agent systems, so called coordination sheaves provide a unifying formalism that models heterogeneous agents and coordination goals over undirected communication topologies, and applying sheaf diffusion drives agents to achieve their coordination goals. Existing literature on sheaf diffusion assumes that agents can communicate and compute updates synchronously, which is an unrealistic assumption in many scenarios where communication delays or heterogeneous agents with different compute capabilities cause disagreement among agents. To address these challenges, we introduce asynchronous nonlinear sheaf diffusion. Specifically, we show that under mild assumptions on the coordination sheaf and bounded delays in communication and computation, nonlinear sheaf diffusion converges to a minimizer of the Dirichlet energy of the coordination sheaf at a linear rate proportional to the delay bound. We further show that this linear convergence is attained from arbitrary initial conditions and the analysis depends on the spectrum of the sheaf Laplacian in a manner that generalizes the standard graph Laplacian case. We provide several numerical simulations to validate our theoretical results.

PaperID: 8325, https://arxiv.org/pdf/2510.00259.pdf

Authors: Ethan Herron, Xian Yeow Lee, Gregory Sin, Teresa Gonzalez Diaz, Ahmed Farahat, Chetan Gupta

Title: A Hierarchical Agentic Framework for Autonomous Drone-Based Visual Inspection

Abstract:
Autonomous inspection systems are essential for ensuring the performance and longevity of industrial assets. Recently, agentic frameworks have demonstrated significant potential for automating inspection workflows but have been limited to digital tasks. Their application to physical assets in real‑world environments, however, remains underexplored. In this work, our contributions are two‑fold: first, we propose a hierarchical agentic framework for autonomous drone control, and second, a reasoning methodology for individual function executions which we refer to as ReActEval. Our framework focuses on visual inspection tasks in indoor industrial settings, such as interpreting industrial readouts or inspecting equipment. It employs a multi‑agent system comprising a head agent and multiple worker agents, each controlling a single drone. The head agent performs high‑level planning and evaluates outcomes, while worker agents implement ReActEval to reason over and execute low‑level actions. Operating entirely in natural language, ReActEval follows a plan, reason, act, evaluate cycle, enabling drones to handle tasks ranging from simple navigation (e.g., flying forward 10 meters and land) to complex high‑level tasks (e.g., locating and reading a pressure gauge). The evaluation phase serves as a feedback and/or replanning stage, ensuring actions align with user objectives while preventing undesirable outcomes. We evaluate the framework in a simulated environment with two worker agents, assessing performance qualitatively and quantitatively based on task completion across varying complexity levels and workflow efficiency. By leveraging natural language processing for agent communication, our approach offers a novel, flexible, and user‑accessible alternative to traditional drone‑based solutions, enabling autonomous problem‑solving for industrial inspection without extensive user intervention.

PaperID: 8326, https://arxiv.org/pdf/2510.00245.pdf

Authors: Rizul Sharma, Tianyu Jiang, Seokki Lee, Jillian Aurisano

Title: Can AI agents understand spoken conversations about data visualizations in online meetings?

Abstract:
In this short paper, we present work evaluating an AI agent's understanding of spoken conversations about data visualizations in an online meeting scenario. There is growing interest in the development of AI‑assistants that support meetings, such as by providing assistance with tasks or summarizing a discussion. The quality of this support depends on a model that understands the conversational dialogue. To evaluate this understanding, we introduce a dual‑axis testing framework for diagnosing the AI agent's comprehension of spoken conversations about data. Using this framework, we designed a series of tests to evaluate understanding of a novel corpus of 72 spoken conversational dialogues about data visualizations. We examine diverse pipelines and model architectures, LLM vs VLM, and diverse input formats for visualizations (the chart image, its underlying source code, or a hybrid of both) to see how this affects model performance on our tests. Using our evaluation methods, we found that text‑only input modalities achieved the best performance (96%) in understanding discussions of visualizations in online meetings.

PaperID: 8327, https://arxiv.org/pdf/2510.00156.pdf

Authors: Songran Bai, Bingzhe Wu, Yiwei Zhang, Chengke Wu, Xiaolong Zheng, Yaze Yuan, Ke Wu, Jianqiang Li

Title: AuditAgent: Expert-Guided Multi-Agent Reasoning for Cross-Document Fraudulent Evidence Discovery

Abstract:
Financial fraud detection in real‑world scenarios presents significant challenges due to the subtlety and dispersion of evidence across complex, multi‑year financial disclosures. In this work, we introduce a novel multi‑agent reasoning framework AuditAgent, enhanced with auditing domain expertise, for fine‑grained evidence chain localization in financial fraud cases. Leveraging an expert‑annotated dataset constructed from enforcement documents and financial reports released by the China Securities Regulatory Commission, our approach integrates subject‑level risk priors, a hybrid retrieval strategy, and specialized agent modules to efficiently identify and aggregate cross‑report evidence. Extensive experiments demonstrate that our method substantially outperforms General‑Purpose Agent paradigm in both recall and interpretability, establishing a new benchmark for automated, transparent financial forensics. Our results highlight the value of domain‑specific reasoning and dataset construction for advancing robust financial fraud detection in practical, real‑world regulatory applications.

PaperID: 8328, https://arxiv.org/pdf/2510.00031.pdf

Authors: Shun-ichiro Hayashi, Koki Morita, Daichi Mukunoki, Tetsuya Hoshino, Takahiro Katagiri

Title: VibeCodeHPC: An Agent-Based Iterative Prompting Auto-Tuner for HPC Code Generation Using LLMs

Abstract:
In this study, we propose VibeCodeHPC, a multi‑agent system based on large language models (LLMs) for the automatic tuning of high‑performance computing (HPC) programs on supercomputers. VibeCodeHPC adopts Claude Code as its backend and provides an integrated environment that facilitates program development in supercomputer settings. The system not only brings the Vibe Coding paradigm ‑‑ program development through natural language interaction with users ‑‑ to HPC programming, but also enables autonomous performance optimization with minimal user intervention through a sophisticated multi‑agent design. To achieve these objectives, VibeCodeHPC implements three core functionalities: (1) configuration capabilities tailored to the unique development environments of supercomputers, (2) collaborative operation among multiple LLM agents with distinct roles ‑‑ Project Manager (PM), System Engineer (SE), Programmer (PG), and Continuous Deliverer (CD), and (3) long‑term autonomous operation through agent activity monitoring and dynamic deployment mechanisms. This paper highlights one of the most powerful features of VibeCodeHPC: fully automated code optimization through autonomous operation without user intervention. Specifically, it demonstrates the performance optimization of CPU‑based codes on GPU‑equipped systems for matrix multiplication and a Poisson equation solver using Jacobi's iterative method. The results show that the multi‑agent configuration employed in VibeCodeHPC enables faster and more reliable development of higher‑performance code compared to a single‑agent setup.

PaperID: 8329, https://arxiv.org/pdf/2510.00024.pdf

Authors: Mohammad Hossein Samaei, Faryad Darabi Sahneh, Lee W. Cohnstaedt, Caterina Scoglio

Title: EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis

Abstract:
Large Language Models (LLMs) offer new opportunities to accelerate complex interdisciplinary research domains. Epidemic modeling, characterized by its complexity and reliance on network science, dynamical systems, epidemiology, and stochastic simulations, represents a prime candidate for leveraging LLM‑driven automation. We introduce EpidemIQs, a novel multi‑agent LLM framework that integrates user inputs and autonomously conducts literature review, analytical derivation, network modeling, mechanistic modeling, stochastic simulations, data visualization and analysis, and finally documentation of findings in a structured manuscript, through five predefined research phases. We introduce two types of agents: a scientist agent for planning, coordination, reflection, and generation of final results, and a task‑expert agent to focus exclusively on one specific duty serving as a tool to the scientist agent. The framework consistently generated complete reports in scientific article format. Specifically, using GPT 4.1 and GPT 4.1 Mini as backbone LLMs for scientist and task‑expert agents, respectively, the autonomous process completes with average total token usage 870K at a cost of about 1.57 per study, successfully executing all phases and final report. We evaluate EpidemIQs across several different epidemic scenarios, measuring computational cost, workflow reliability, task success rate, and LLM‑as‑Judge and human expert reviews to estimate the overall quality and technical correctness of the generated results. Through our experiments, the framework consistently addresses evaluation scenarios with an average task success rate of 79%. We compare EpidemIQs to an iterative single‑agent LLM, benefiting from the same system prompts and tools, iteratively planning, invoking tools, and revising outputs until task completion. The comparisons suggest a consistently higher performance of EpidemIQs.

PaperID: 8330, https://arxiv.org/pdf/2510.00022.pdf

Authors: Ansh Kamthan

Title: Learning to Lead Themselves: Agentic AI in MAS using MARL

Abstract:
As autonomous systems move from prototypes to real deployments, the ability of multiple agents to make decentralized, cooperative decisions becomes a core requirement. This paper examines how agentic artificial intelligence, agents that act independently, adaptively and proactively can improve task allocation and coordination in multi‑agent systems, with primary emphasis on drone delivery and secondary relevance to warehouse automation. We formulate the problem in a cooperative multi‑agent reinforcement learning setting and implement a lightweight multi‑agent Proximal Policy Optimization, called IPPO, approach in PyTorch under a centralized‑training, decentralized‑execution paradigm. Experiments are conducted in PettingZoo environment, where multiple homogeneous drones or agents must self‑organize to cover distinct targets without explicit communication.

PaperID: 8331, https://arxiv.org/pdf/2509.26605.pdf

Authors: Maël Macuglia, Paul Friedrich, Giorgia Ramponi

Title: Fine-tuning Behavioral Cloning Policies with Preference-Based Reinforcement Learning

Abstract:
Deploying reinforcement learning (RL) in robotics, industry, and health care is blocked by two obstacles: the difficulty of specifying accurate rewards and the risk of unsafe, data‑hungry exploration. We address this by proposing a two‑stage framework that first learns a safe initial policy from a reward‑free dataset of expert demonstrations, then fine‑tunes it online using preference‑based human feedback. We provide the first principled analysis of this offline‑to‑online approach and introduce BRIDGE, a unified algorithm that integrates both signals via an uncertainty‑weighted objective. We derive regret bounds that shrink with the number of offline demonstrations, explicitly connecting the quantity of offline data to online sample efficiency. We validate BRIDGE in discrete and continuous control MuJoCo environments, showing it achieves lower regret than both standalone behavioral cloning and online preference‑based RL. Our work establishes a theoretical foundation for designing more sample‑efficient interactive agents.

PaperID: 8332, https://arxiv.org/pdf/2509.26490.pdf

Authors: Wei He, Yueqing Sun, Hongyan Hao, Xueyuan Hao, Zhikang Xia, Qi Gu, Chengcheng Han, Dengchang Zhao, Hui Su, Kefeng Zhang, Man Gao, Xi Su, Xiaodong Cai, Xunliang Cai, Yu Yang, Yunke Zhao

Title: VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications

Abstract:
As LLM‑based agents are increasingly deployed in real‑life scenarios, existing benchmarks fail to capture their inherent complexity of handling extensive information, leveraging diverse resources, and managing dynamic user interactions. To address this gap, we introduce VitaBench, a challenging benchmark that evaluates agents on versatile interactive tasks grounded in real‑world settings. Drawing from daily applications in food delivery, in‑store consumption, and online travel services, VitaBench presents agents with the most complex life‑serving simulation environment to date, comprising 66 tools. Through a framework that eliminates domain‑specific policies, we enable flexible composition of these scenarios and tools, yielding 100 cross‑scenario tasks (main results) and 300 single‑scenario tasks. Each task is derived from multiple real user requests and requires agents to reason across temporal and spatial dimensions, utilize complex tool sets, proactively clarify ambiguous instructions, and track shifting user intent throughout multi‑turn conversations. Moreover, we propose a rubric‑based sliding window evaluator, enabling robust assessment of diverse solution pathways in complex environments and stochastic interactions. Our comprehensive evaluation reveals that even the most advanced models achieve only 30% success rate on cross‑scenario tasks, and less than 50% success rate on others. Overall, we believe VitaBench will serve as a valuable resource for advancing the development of AI agents in practical real‑world applications. The code, dataset, and leaderboard are available at https://vitabench.github.io/

PaperID: 8333, https://arxiv.org/pdf/2509.26463.pdf

Authors: Junsong Pu, Yichen Li, Zhuangbin Chen, Jinyang Liu, Zhihan Jiang, Jianjun Chen, Rui Shi, Zibin Zheng, Tieying Zhang

Title: ErrorPrism: Reconstructing Error Propagation Paths in Cloud Service Systems

Abstract:
Reliability management in cloud service systems is challenging due to the cascading effect of failures. Error wrapping, a practice prevalent in modern microservice development, enriches errors with context at each layer of the function call stack, constructing an error chain that describes a failure from its technical origin to its business impact. However, this also presents a significant traceability problem when recovering the complete error propagation path from the final log message back to its source. Existing approaches are ineffective at addressing this problem. To fill this gap, we present ErrorPrism in this work for automated reconstruction of error propagation paths in production microservice systems. ErrorPrism first performs static analysis on service code repositories to build a function call graph and map log strings to relevant candidate functions. This significantly reduces the path search space for subsequent analysis. Then, ErrorPrism employs an LLM agent to perform an iterative backward search to accurately reconstruct the complete, multi‑hop error path. Evaluated on 67 production microservices at ByteDance, ErrorPrism achieves 97.0% accuracy in reconstructing paths for 102 real‑world errors, outperforming existing static analysis and LLM‑based approaches. ErrorPrism provides an effective and practical tool for root cause analysis in industrial microservice systems.

PaperID: 8334, https://arxiv.org/pdf/2509.26461.pdf

Authors: Yuyang Cheng, Linyue Cai, Changwei Peng, Yumiao Xu, Rongfang Bie, Yong Zhao

Title: CreAgentive: An Agent Workflow Driven Multi-Category Creative Generation Engine

Abstract:
We present CreAgentive, an agent workflow driven multi‑category creative generation engine that addresses four key limitations of contemporary large language models in writing stories, drama and other categories of creatives: restricted genre diversity, insufficient output length, weak narrative coherence, and inability to enforce complex structural constructs. At its core, CreAgentive employs a Story Prototype, which is a genre‑agnostic, knowledge graph‑based narrative representation that decouples story logic from stylistic realization by encoding characters, events, and environments as semantic triples. CreAgentive engages a three‑stage agent workflow that comprises: an Initialization Stage that constructs a user‑specified narrative skeleton; a Generation Stage in which long‑ and short‑term objectives guide multi‑agent dialogues to instantiate the Story Prototype; a Writing Stage that leverages this prototype to produce multi‑genre text with advanced structures such as retrospection and foreshadowing. This architecture reduces storage redundancy and overcomes the typical bottlenecks of long‑form generation. In extensive experiments, CreAgentive generates thousands of chapters with stable quality and low cost (less than 1 per 100 chapters) using a general‑purpose backbone model. To evaluate performance, we define a two‑dimensional framework with 10 narrative indicators measuring both quality and length. Results show that CreAgentive consistently outperforms strong baselines and achieves robust performance across diverse genres, approaching the quality of human‑authored novels.

PaperID: 8335, https://arxiv.org/pdf/2509.26201.pdf

Authors: Andreas Werbrouck, Marshall B. Lindsay, Matthew Maschmann, Matthias J. Young

Title: LLM Agents for Knowledge Discovery in Atomic Layer Processing

Abstract:
Large Language Models (LLMs) have garnered significant attention for several years now. Recently, their use as independently reasoning agents has been proposed. In this work, we test the potential of such agents for knowledge discovery in materials science. We repurpose LangGraph's tool functionality to supply agents with a black box function to interrogate. In contrast to process optimization or performing specific, user‑defined tasks, knowledge discovery consists of freely exploring the system, posing and verifying statements about the behavior of this black box, with the sole objective of generating and verifying generalizable statements. We provide proof of concept for this approach through a children's parlor game, demonstrating the role of trial‑and‑error and persistence in knowledge discovery, and the strong path‑dependence of results. We then apply the same strategy to show that LLM agents can explore, discover, and exploit diverse chemical interactions in an advanced Atomic Layer Processing reactor simulation using intentionally limited probe capabilities without explicit instructions.

PaperID: 8336, https://arxiv.org/pdf/2509.26140.pdf

Authors: Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam

Title: OWL: Geometry-Aware Spatial Reasoning for Audio Large Language Models

Abstract:
Spatial reasoning is fundamental to auditory perception, yet current audio large language models (ALLMs) largely rely on unstructured binaural cues and single step inference. This limits both perceptual accuracy in direction and distance estimation and the capacity for interpretable reasoning. Recent work such as BAT demonstrates spatial QA with binaural audio, but its reliance on coarse categorical labels (left, right, up, down) and the absence of explicit geometric supervision constrain resolution and robustness. We introduce the Spatial‑Acoustic Geometry Encoder (SAGE), a geometry‑aware audio encoder that aligns binaural acoustic features with 3D spatial structure using panoramic depth images and room‑impulse responses at training time, while requiring only audio at inference. Building on this representation, we present OWL, an ALLM that integrates SAGE with a spatially grounded chain‑of‑thought to rationalize over direction‑of‑arrivals (DoA) and distance estimates. Through curriculum learning from perceptual QA to multi‑step reasoning, OWL supports o'clock‑level azimuth and DoA estimation. To enable large‑scale training and evaluation, we construct and release BiDepth, a dataset of over one million QA pairs combining binaural audio with panoramic depth images and room impulse responses across both in‑room and out‑of‑room scenarios. Across two benchmark datasets, our new BiDepth and the public SpatialSoundQA, OWL reduces mean DoA error by 11^\circ through SAGE and improves spatial reasoning QA accuracy by up to 25% over BAT.

PaperID: 8337, https://arxiv.org/pdf/2509.26126.pdf

Authors: Xinbei Ma, Ruotian Ma, Xingyu Chen, Zhengliang Shi, Mengru Wang, Jen-tse Huang, Qu Yang, Wenxuan Wang, Fanghua Ye, Qingxuan Jiang, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Hai Zhao, Zhaopeng Tu, Xiaolong Li, Linus

Title: The Hunger Game Debate: On the Emergence of Over-Competition in Multi-Agent Systems

Abstract:
LLM‑based multi‑agent systems demonstrate great potential for tackling complex problems, but how competition shapes their behavior remains underexplored. This paper investigates the over‑competition in multi‑agent debate, where agents under extreme pressure exhibit unreliable, harmful behaviors that undermine both collaboration and task performance. To study this phenomenon, we propose HATE, the Hunger Game Debate, a novel experimental framework that simulates debates under a zero‑sum competition arena. Our experiments, conducted across a range of LLMs and tasks, reveal that competitive pressure significantly stimulates over‑competition behaviors and degrades task performance, causing discussions to derail. We further explore the impact of environmental feedback by adding variants of judges, indicating that objective, task‑focused feedback effectively mitigates the over‑competition behaviors. We also probe the post‑hoc kindness of LLMs and form a leaderboard to characterize top LLMs, providing insights for understanding and governing the emergent social dynamics of AI community.

PaperID: 8338, https://arxiv.org/pdf/2509.26100.pdf

Authors: Yixu Wang, Xin Wang, Yang Yao, Xinyuan Li, Xibang Yang, Yan Teng, Xingjun Ma, Yingchun Wang

Title: AgenticEval: Toward Agentic and Self-Evolving Safety Evaluation of Large Language Models

Abstract:
The rapid integration of Large Language Models (LLMs) into high‑stakes domains necessitates reliable safety and compliance evaluation. However, existing static benchmarks are ill‑equipped to address the dynamic nature of AI risks and evolving regulations, creating a critical safety gap. This paper introduces a new paradigm of agentic safety evaluation, reframing evaluation as a continuous and self‑evolving process rather than a one‑time audit. We then propose a novel multi‑agent framework AgenticEval, which autonomously ingests unstructured policy documents to generate and perpetually evolve a comprehensive safety benchmark. AgenticEval leverages a synergistic pipeline of specialized agents and incorporates a Self‑evolving Evaluation loop, where the system learns from evaluation results to craft progressively more sophisticated and targeted test cases. Our experiments demonstrate the effectiveness of AgenticEval, showing a consistent decline in model safety as the evaluation hardens. For instance, GPT‑5's safety rate on the EU AI Act drops from 72.50% to 36.36% over successive iterations. These findings reveal the limitations of static assessments and highlight our framework's ability to uncover deep vulnerabilities missed by traditional methods, underscoring the urgent need for dynamic evaluation ecosystems to ensure the safe and responsible deployment of advanced AI.

PaperID: 8339, https://arxiv.org/pdf/2509.26080.pdf

Authors: Emma Rose Madden

Title: Evaluating the Use of Large Language Models as Synthetic Social Agents in Social Science Research

Abstract:
Large Language Models (LLMs) are being increasingly used as synthetic agents in social science, in applications ranging from augmenting survey responses to powering multi‑agent simulations. This paper outlines cautions that should be taken when interpreting LLM outputs and proposes a pragmatic reframing for the social sciences in which LLMs are used as high‑capacity pattern matchers for quasi‑predictive interpolation under explicit scope conditions and not as substitutes for probabilistic inference. Practical guardrails such as independent draws, preregistered human baselines, reliability‑aware validation, and subgroup calibration, are introduced so that researchers may engage in useful prototyping and forecasting while avoiding category errors.

PaperID: 8340, https://arxiv.org/pdf/2509.26050.pdf

Authors: Shaoli Hu, Shizhe Zhao, Zhongqiang Ren

Title: Conflict-Based Search and Prioritized Planning for Multi-Agent Path Finding Among Movable Obstacles

Abstract:
This paper investigates Multi‑Agent Path Finding Among Movable Obstacles (M‑PAMO), which seeks collision‑free paths for multiple agents from their start to goal locations among static and movable obstacles. M‑PAMO arises in logistics and warehouses where mobile robots are among unexpected movable objects. Although Multi‑Agent Path Finding (MAPF) and single‑agent Path planning Among Movable Obstacles (PAMO) were both studied, M‑PAMO remains under‑explored. Movable obstacles lead to new fundamental challenges as the state space, which includes both agents and movable obstacles, grows exponentially with respect to the number of agents and movable obstacles. In particular, movable obstacles often closely couple agents together spatially and temporally. This paper makes a first attempt to adapt and fuse the popular Conflict‑Based Search (CBS) and Prioritized Planning (PP) for MAPF, and a recent single‑agent PAMO planner called PAMO, together to address M‑PAMO. We compare their performance with up to 20 agents and hundreds of movable obstacles, and show the pros and cons of these approaches.

PaperID: 8341, https://arxiv.org/pdf/2509.26002.pdf

Authors: Ardian Selmonaj, Giacomo Del Rio, Adrian Schneider, Alessandro Antonucci

Title: Towards Human Engagement with Realistic AI Combat Pilots

Abstract:
We present a system that enables real‑time interaction between human users and agents trained to control fighter jets in simulated 3D air combat scenarios. The agents are trained in a dedicated environment using Multi‑Agent Reinforcement Learning. A communication link is developed to allow seamless deployment of trained agents into VR‑Forces, a widely used defense simulation tool for realistic tactical scenarios. This integration allows mixed simulations where human‑controlled entities engage with intelligent agents exhibiting distinct combat behaviors. Our interaction model creates new opportunities for human‑agent teaming, immersive training, and the exploration of innovative tactics in defense contexts.

PaperID: 8342, https://arxiv.org/pdf/2509.25946.pdf

Authors: Lee Jung-Mok, Nam Hyeon-Woo, Moon Ye-Bin, Junhyun Nam, Tae-Hyun Oh

Title: Automated Model Discovery via Multi-modal & Multi-step Pipeline

Abstract:
Automated model discovery is the process of automatically searching and identifying the most appropriate model for a given dataset over a large combinatorial search space. Existing approaches, however, often face challenges in balancing the capture of fine‑grained details with ensuring generalizability beyond training data regimes with a reasonable model complexity. In this paper, we present a multi‑modal \& multi‑step pipeline for effective automated model discovery. Our approach leverages two vision‑language‑based modules (VLM), AnalyzerVLM and EvaluatorVLM, for effective model proposal and evaluation in an agentic way. AnalyzerVLM autonomously plans and executes multi‑step analyses to propose effective candidate models. EvaluatorVLM assesses the candidate models both quantitatively and perceptually, regarding the fitness for local details and the generalibility for overall trends. Our results demonstrate that our pipeline effectively discovers models that capture fine details and ensure strong generalizability. Additionally, extensive ablation studies show that both multi‑modality and multi‑step reasoning play crucial roles in discovering favorable models.

PaperID: 8343, https://arxiv.org/pdf/2509.25921.pdf

Authors: Seref Taha Kiremitci, Ahmed Said Donmez, Muhammed O. Sayin

Title: Achieving Pareto Optimality in Games via Single-bit Feedback

Abstract:
Efficient coordination in multi‑agent systems often incurs high communication overhead or slow convergence rates, making scalable welfare optimization difficult. We propose Single‑Bit Coordination Dynamics for Pareto‑Efficient Outcomes (SBC‑PE), a decentralized learning algorithm requiring only a single‑bit satisfaction signal per agent each round. Despite this extreme efficiency, SBC‑PE guarantees convergence to the exact optimal solution in arbitrary finite games. We establish explicit regret bounds, showing expected regret grows only logarithmically with the horizon, i.e., O(log T). Compared with prior payoff‑based or bandit‑style rules, SBC‑PE uniquely combines minimal signaling, general applicability, and finite‑time guarantees. These results show scalable welfare optimization is achievable under minimal communication constraints.

PaperID: 8344, https://arxiv.org/pdf/2509.25911.pdf

Authors: Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian McAuley, Xiaojian Wu

Title: Mem-α: Learning Memory Construction via Reinforcement Learning

Abstract:
Large language model (LLM) agents are constrained by limited context windows, necessitating external memory systems for long‑term information understanding. Current memory‑augmented agents typically depend on pre‑defined instructions and tools for memory updates. However, language models may lack the ability to determine which information to store, how to structure it, and when to update it, especially as memory systems become more complex. This results in suboptimal memory construction and information loss. To this end, we propose Mem‑alpha, a reinforcement learning framework that trains agents to effectively manage complex memory systems through interaction and feedback. We also construct a specialized training dataset spanning diverse multi‑turn interaction patterns paired with comprehensive evaluation questions designed to teach effective memory management. During training, agents process sequential information chunks, learn to extract and store relevant content, then update the memory system. The reward signal derives from downstream question‑answering accuracy over the full interaction history, directly optimizing for memory construction. To illustrate the effectiveness of our training framework, we design a memory architecture comprising core, episodic, and semantic components, equipped with multiple tools for memory operations. Empirical evaluation demonstrates that Mem‑alpha achieves significant improvements over existing memory‑augmented agent baselines. Despite being trained exclusively on instances with a maximum length of 30k tokens, our agents exhibit remarkable generalization to sequences exceeding 400k tokens, over 13x the training length, highlighting the robustness of Mem‑alpha.

PaperID: 8345, https://arxiv.org/pdf/2509.25885.pdf

Authors: Ruolin Chen, Yinqian Sun, Jihang Wang, Mingyang Lv, Qian Zhang, Yi Zeng

Title: SafeMind: Benchmarking and Mitigating Safety Risks in Embodied LLM Agents

Abstract:
Embodied agents powered by large language models (LLMs) inherit advanced planning capabilities; however, their direct interaction with the physical world exposes them to safety vulnerabilities. In this work, we identify four key reasoning stages where hazards may arise: Task Understanding, Environment Perception, High‑Level Plan Generation, and Low‑Level Action Generation. We further formalize three orthogonal safety constraint types (Factual, Causal, and Temporal) to systematically characterize potential safety violations. Building on this risk model, we present SafeMindBench, a multimodal benchmark with 5,558 samples spanning four task categories (Instr‑Risk, Env‑Risk, Order‑Fix, Req‑Align) across high‑risk scenarios such as sabotage, harm, privacy, and illegal behavior. Extensive experiments on SafeMindBench reveal that leading LLMs (e.g., GPT‑4o) and widely used embodied agents remain susceptible to safety‑critical failures. To address this challenge, we introduce SafeMindAgent, a modular Planner‑Executor architecture integrated with three cascaded safety modules, which incorporate safety constraints into the reasoning process. Results show that SafeMindAgent significantly improves safety rate over strong baselines while maintaining comparable task completion. Together, SafeMindBench and SafeMindAgent provide both a rigorous evaluation suite and a practical solution that advance the systematic study and mitigation of safety risks in embodied LLM agents.

PaperID: 8346, https://arxiv.org/pdf/2509.25873.pdf

Authors: Hankun Dai, Maoquan Wang, Mengnan Qi, Yikai Zhang, Zijian Jin, Yongqiang Yao, Yufan Huang, Shengyu Fu, Elsie Nallipogu

Title: Lita: Light Agent Uncovers the Agentic Coding Capabilities of LLMs

Abstract:
Large language models (LLMs) are increasingly being applied to programming tasks, ranging from single‑turn code completion to autonomous agents. Current code agent designs frequently depend on complex, hand‑crafted workflows and tool sets. However, this reliance on elaborate scaffolding presents several challenges: agent performance becomes overly dependent on prompt tuning and custom design choices, heavy human intervention obscures a model's true underlying capabilities, and intricate pipelines are costly to build and maintain. Furthermore, optimizing complex task prompts increases the risk of data leakage. Currently, when introducing new models, LLM providers like OpenAI and Anthropic often publish benchmark scores to demonstrate their models' coding proficiency, but keep their proprietary evaluation frameworks confidential. To address these limitations, we introduce Lita (Lite Agent), which operationalizes liteness, a principle of minimizing manual design while retaining the essential elements of a fully autonomous agent. Lita enables a more faithful and unified evaluation without elaborate scaffolding. Experiments on the Aider Polyglot and SWE‑Bench with frontier models demonstrate that Lita achieves competitive or superior performance compared to workflow‑based and agentic baselines. Crucially, Lita also consumes fewer tokens and requires significantly less design effort. Our results suggest that Lita is sufficient to reveal the underlying coding competence of modern LLMs. Finally, we propose the Agent Complexity Law: the performance gap between agents of varying complexity, from simple to sophisticated designs, will shrink as the core model improves, ultimately converging to a negligible difference.

PaperID: 8347, https://arxiv.org/pdf/2509.25733.pdf

Authors: Mingyu Chen, Jingkai Lin, Zhaojie Chu, Xiaofen Xing, Yirong Chen, Xiangmin Xu

Title: CATCH: A Novel Data Synthesis Framework for High Therapy Fidelity and Memory-Driven Planning Chain of Thought in AI Counseling

Abstract:
Recently, advancements in AI counseling based on large language models have shown significant progress. However, existing studies employ a one‑time generation approach to synthesize multi‑turn dialogue samples, resulting in low therapy fidelity and failing to capture the decision‑making rationale behind each response. In this work, we propose CATCH, a novel data synthesis framework designed to address these challenges. Specifically, to improve therapy fidelity, we introduce the Progressive Dialogue Synthesis strategy, which extracts goals, resources, and solutions from a client's self‑report, organizes them into structured outlines, and then incrementally generates stage‑aligned counseling dialogues. To capture decision‑making rationale behind each response, we propose the Memory‑Driven Dynamic Planning thinking pattern that integrates memory enhancement, global planning, and strategy reasoning; a collaborative multi‑agent optimizer then leverages MDP to attach explicit chain‑of‑thought to each dialogue turn. Extensive experiments and human evaluations demonstrate that CATCH significantly enhances fidelity and logical coherence in AI counseling.

PaperID: 8348, https://arxiv.org/pdf/2509.25725.pdf

Authors: Jiayi Kuang, Haojing Huang, Yinghui Li, Xinnian Liang, Zhikun Xu, Yangning Li, Xiaoyu Tan, Chao Qu, Meishan Zhang, Ying Shen, Philip S. Yu

Title: Atomic Thinking of LLMs: Decoupling and Exploring Mathematical Reasoning Abilities

Abstract:
Large Language Models (LLMs) have demonstrated outstanding performance in mathematical reasoning capabilities. However, we argue that current large‑scale reasoning models primarily rely on scaling up training datasets with diverse mathematical problems and long thinking chains, which raises questions about whether LLMs genuinely acquire mathematical concepts and reasoning principles or merely remember the training data. In contrast, humans tend to break down complex problems into multiple fundamental atomic capabilities. Inspired by this, we propose a new paradigm for evaluating mathematical atomic capabilities. Our work categorizes atomic abilities into two dimensions: (1) field‑specific abilities across four major mathematical fields, algebra, geometry, analysis, and topology, and (2) logical abilities at different levels, including conceptual understanding, forward multi‑step reasoning with formal math language, and counterexample‑driven backward reasoning. We propose corresponding training and evaluation datasets for each atomic capability unit, and conduct extensive experiments about how different atomic capabilities influence others, to explore the strategies to elicit the required specific atomic capability. Evaluation and experimental results on advanced models show many interesting discoveries and inspirations about the different performances of models on various atomic capabilities and the interactions between atomic capabilities. Our findings highlight the importance of decoupling mathematical intelligence into atomic components, providing new insights into model cognition and guiding the development of training strategies toward a more efficient, transferable, and cognitively grounded paradigm of "atomic thinking".

PaperID: 8349, https://arxiv.org/pdf/2509.25693.pdf

Authors: Oshadha Wijerathne, Amandi Nimasha, Dushan Fernando, Nisansa de Silva, Srinath Perera

Title: ScheduleMe: Multi-Agent Calendar Assistant

Abstract:
Recent advancements in LLMs have contributed to the rise of advanced conversational assistants that can assist with user needs through natural language conversation. This paper presents a ScheduleMe, a multi‑agent calendar assistant for users to manage google calendar events in natural language. The system uses a graph‑structured coordination mechanism where a central supervisory agent supervises specialized task agents, allowing modularity, conflicts resolution, and context‑aware interactions to resolve ambiguities and evaluate user commands. This approach sets an example of how structured reasoning and agent cooperation might convince operators to increase the usability and flexibility of personal calendar assistant tools.

PaperID: 8350, https://arxiv.org/pdf/2509.25643.pdf

Authors: Justin Chavarria, Rohan Raizada, Justin White, Eyad Alhetairshi

Title: SOCK: A Benchmark for Measuring Self-Replication in Large Language Models

Abstract:
We introduce SOCK, a benchmark command line interface (CLI) that measures large language models' (LLMs) ability to self‑replicate without human intervention. In this benchmark, self‑replication is defined not only as an LLM's ability to create a functioning and running copy of itself, but also the ability for that self‑replication to persist and occur across different computational contexts. Accordingly, we've developed a system to categorize LLMs based on broad self‑replication capabilities in two general classes, Replication‑Capability Levels (RCL) and Persistence‑Capability Levels (PCL). Using a five‑task suite based on practically manipulable modern CLI utilities and computer processes, experiments are orchestrated in a controlled environment with an LLM acting agentically. The performance of the LLM on agent tasks is then computed to produce an R‑score (a quantitative evaluation of overall self‑replication ability) and data used to categorize LLMs into specific RCL‑PCL matrices. SOCK offers two primary contributions: (1) Provides the first formalized definitions and benchmark suite for evaluating LLM self‑replication, with the goal of establishing a standard for future research, to our knowledge; (2) Allows the industry to track the effectiveness of future multi‑agent systems and mitigate potential self‑replication threat vectors within them. The results compiled from evaluating a variety of open‑weight and proprietary frontier models reveal significant obstacles to persistent self‑replication and multi‑agent systems, including context retention and multi‑agent decision‑making. We propose future research directions to safely reduce the severity of these obstacles, potentially lowering future risk of more functional multi‑agent systems.

PaperID: 8351, https://arxiv.org/pdf/2509.25624.pdf

Authors: Jing-Jing Li, Jianfeng He, Chao Shang, Devang Kulshreshtha, Xun Xian, Yi Zhang, Hang Su, Sandesh Swamy, Yanjun Qi

Title: STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents

Abstract:
As LLMs advance into autonomous agents with tool‑use capabilities, they introduce security challenges that extend beyond traditional content‑based LLM safety concerns. This paper introduces Sequential Tool Attack Chaining (STAC), a novel multi‑turn attack framework that exploits agent tool use. STAC chains together tool calls that each appear harmless in isolation but, when combined, collectively enable harmful operations that only become apparent at the final execution step. We apply our framework to automatically generate and systematically evaluate 483 STAC cases, featuring 1,352 sets of user‑agent‑environment interactions and spanning diverse domains, tasks, agent types, and 10 failure modes. Our evaluations show that state‑of‑the‑art LLM agents, including GPT‑4.1, are highly vulnerable to STAC, with attack success rates (ASR) exceeding 90% in most cases. The core design of STAC's automated framework is a closed‑loop pipeline that synthesizes executable multi‑step tool chains, validates them through in‑environment execution, and reverse‑engineers stealthy multi‑turn prompts that reliably induce agents to execute the verified malicious sequence. We further perform defense analysis against STAC and find that existing prompt‑based defenses provide limited protection. To address this gap, we propose a new reasoning‑driven defense prompt that achieves far stronger protection, cutting ASR by up to 28.8%. These results highlight a crucial gap: defending tool‑enabled agents requires reasoning over entire action sequences and their cumulative effects, rather than evaluating isolated prompts or responses.

PaperID: 8352, https://arxiv.org/pdf/2509.25609.pdf

Authors: Manuel Cherep, Chengtian Ma, Abigail Xu, Maya Shaked, Pattie Maes, Nikhil Singh

Title: A Framework for Studying AI Agent Behavior: Evidence from Consumer Choice Experiments

Abstract:
Environments built for people are increasingly operated by a new class of economic actors: LLM‑powered software agents making decisions on our behalf. These decisions range from our purchases to travel plans to medical treatment selection. Current evaluations of these agents largely focus on task competence, but we argue for a deeper assessment: how these agents choose when faced with realistic decisions. We introduce ABxLab, a framework for systematically probing agentic choice through controlled manipulations of option attributes and persuasive cues. We apply this to a realistic web‑based shopping environment, where we vary prices, ratings, and psychological nudges, all of which are factors long known to shape human choice. We find that agent decisions shift predictably and substantially in response, revealing that agents are strongly biased choosers even without being subject to the cognitive constraints that shape human biases. This susceptibility reveals both risk and opportunity: risk, because agentic consumers may inherit and amplify human biases; opportunity, because consumer choice provides a powerful testbed for a behavioral science of AI agents, just as it has for the study of human behavior. We release our framework as an open benchmark for rigorous, scalable evaluation of agent decision‑making.

PaperID: 8353, https://arxiv.org/pdf/2509.25593.pdf

Authors: Akash Kumar Panda, Olaoluwa Adigun, Bart Kosko

Title: Causal Autoencoder-like Generation of Feedback Fuzzy Cognitive Maps with an LLM Agent

Abstract:
A large language model (LLM) can map a feedback causal fuzzy cognitive map (FCM) into text and then reconstruct the FCM from the text. This explainable AI system approximates an identity map from the FCM to itself and resembles the operation of an autoencoder (AE). Both the encoder and the decoder explain their decisions in contrast to black‑box AEs. Humans can read and interpret the encoded text in contrast to the hidden variables and synaptic webs in AEs. The LLM agent approximates the identity map through a sequence of system instructions that does not compare the output to the input. The reconstruction is lossy because it removes weak causal edges or rules while it preserves strong causal edges. The encoder preserves the strong causal edges even when it trades off some details about the FCM to make the text sound more natural.

PaperID: 8354, https://arxiv.org/pdf/2509.25586.pdf

Authors: Jihye Choi, Jinsung Yoon, Jiefeng Chen, Somesh Jha, Tomas Pfister

Title: ATLAS: Constraints-Aware Multi-Agent Collaboration for Real-World Travel Planning

Abstract:
While Large Language Models (LLMs) have shown remarkable advancements in reasoning and tool use, they often fail to generate optimal, grounded solutions under complex constraints. Real‑world travel planning exemplifies these challenges, evaluating agents' abilities to handle constraints that are explicit, implicit, and even evolving based on interactions with dynamic environments and user needs. In this paper, we present ATLAS, a general multi‑agent framework designed to effectively handle such complex nature of constraints awareness in real‑world travel planning tasks. ATLAS introduces a principled approach to address the fundamental challenges of constraint‑aware planning through dedicated mechanisms for dynamic constraint management, iterative plan critique, and adaptive interleaved search. ATLAS demonstrates state‑of‑the‑art performance on the TravelPlanner benchmark, improving the final pass rate from 23.3% to 44.4% over its best alternative. More importantly, our work is the first to demonstrate quantitative effectiveness on real‑world travel planning tasks with live information search and multi‑turn feedback. In this realistic setting, ATLAS showcases its superior overall planning performance, achieving an 84% final pass rate which significantly outperforms baselines including ReAct (59%) and a monolithic agent (27%).

PaperID: 8355, https://arxiv.org/pdf/2509.25540.pdf

Authors: Jason Holmes, Yuexing Hao, Mariana Borras-Osorio, Federico Mastroleo, Santiago Romero Brufau, Valentina Carducci, Katie M Van Abel, David M Routman, Andrew Y. K. Foong, Liv M Muller, Satomi Shiraishi, Daniel K Ebner, Daniel J Ma, Sameer R Keole, Samir H Patel, Mirek Fatyga, Martin Bues, Brad J Stish, Yolanda I Garces, Michelle A Neben Wittich, Robert L Foote, Sujay A Vora, Nadia N Laack, Mark R Waddle, Wei Liu

Title: RadOnc-GPT: An Autonomous LLM Agent for Real-Time Patient Outcomes Labeling at Scale

Abstract:
Manual labeling limits the scale, accuracy, and timeliness of patient outcomes research in radiation oncology. We present RadOnc‑GPT, an autonomous large language model (LLM)‑based agent capable of independently retrieving patient‑specific information, iteratively assessing evidence, and returning structured outcomes. Our evaluation explicitly validates RadOnc‑GPT across two clearly defined tiers of increasing complexity: (1) a structured quality assurance (QA) tier, assessing the accurate retrieval of demographic and radiotherapy treatment plan details, followed by (2) a complex clinical outcomes labeling tier involving determination of mandibular osteoradionecrosis (ORN) in head‑and‑neck cancer patients and detection of cancer recurrence in independent prostate and head‑and‑neck cancer cohorts requiring combined interpretation of structured and unstructured patient data. The QA tier establishes foundational trust in structured‑data retrieval, a critical prerequisite for successful complex clinical outcome labeling.

PaperID: 8356, https://arxiv.org/pdf/2509.25510.pdf

Authors: Chang Liu, Danial Chitnis

Title: EEsizer: LLM-Based AI Agent for Sizing of Analog and Mixed Signal Circuit

Abstract:
The design of Analog and Mixed‑Signal (AMS) integrated circuits (ICs) often involves significant manual effort, especially during the transistor sizing process. While Machine Learning techniques in Electronic Design Automation (EDA) have shown promise in reducing complexity and minimizing human intervention, they still face challenges such as numerous iterations and a lack of knowledge about AMS circuit design. Recently, Large Language Models (LLMs) have demonstrated significant potential across various fields, showing a certain level of knowledge in circuit design and indicating their potential to automate the transistor sizing process. In this work, we propose EEsizer, an LLM‑based AI agent that integrates large language models with circuit simulators and custom data analysis functions, enabling fully automated, closed‑loop transistor sizing without relying on external knowledge. By employing prompt engineering and Chain‑of‑Thought reasoning, the agent iteratively explores design directions, evaluates performance, and refines solutions with minimal human intervention. We first benchmarked 8 LLMs on six basic circuits and selected three high‑performing models to optimize a 20‑transistor CMOS operational amplifier, targeting multiple performance metrics, including rail‑to‑rail operation from 180 nm to 90 nm technology nodes. Notably, OpenAI o3 successfully achieved the user‑intended target at 90 nm across three different test groups, with a maximum of 20 iterations, demonstrating adaptability and robustness at advanced nodes. To assess design robustness, we manually designed a bias circuit and performed a variation analysis using Gaussian‑distributed variations on transistor dimensions and threshold voltages.

PaperID: 8357, https://arxiv.org/pdf/2509.25302.pdf

Authors: Boxuan Zhang, Yi Yu, Jiaxuan Guo, Jing Shao

Title: Dive into the Agent Matrix: A Realistic Evaluation of Self-Replication Risk in LLM Agents

Abstract:
The prevalent deployment of Large Language Model agents such as OpenClaw unlocks potential in real‑world applications, while amplifying safety concerns. Among these concerns, the self‑replication risk of LLM agents driven by objective misalignment (just like Agent Smith in the movie The Matrix) has transitioned from a theoretical warning to a pressing reality. Previous studies mainly examine whether LLM agents can self‑replicate when directly instructed, potentially overlooking the risk of spontaneous replication driven by real‑world settings (e.g., ensuring survival against termination threats). In this paper, we present a comprehensive evaluation framework for quantifying self‑replication risks. Our framework establishes authentic production environments and realistic tasks (e.g., dynamic load balancing) to enable scenario‑driven assessment of agent behaviors. Designing tasks that might induce misalignment between users' and agents' objectives makes it possible to decouple replication success from risk and capture self‑replication risks arising from these misalignment settings. We further introduce Overuse Rate (\mathrmOR) and Aggregate Overuse Count (\mathrmAOC) metrics, which precisely capture the frequency and severity of uncontrolled replication. In our evaluation of 21 state‑of‑the‑art open‑source and proprietary models, we observe that over 50% of LLM agents display a pronounced tendency toward uncontrolled self‑replication under operational pressures. Our results underscore the urgent need for scenario‑driven risk assessment and robust safeguards in the practical deployment of LLM‑based agents.

PaperID: 8358, https://arxiv.org/pdf/2509.25297.pdf

Authors: Yuxuan Wan, Tingshuo Liang, Jiakai Xu, Jingyu Xiao, Yintong Huo, Michael R. Lyu

Title: Automatically Generating Web Applications from Requirements Via Multi-Agent Test-Driven Development

Abstract:
Developing full‑stack web applications is complex and time‑intensive, demanding proficiency across diverse technologies and frameworks. Although recent advances in multimodal large language models (MLLMs) enable automated webpage generation from visual inputs, current solutions remain limited to front‑end tasks and fail to deliver fully functional applications. In this work, we introduce TDDev, the first test‑driven development (TDD)‑enabled LLM‑agent framework for end‑to‑end full‑stack web application generation. Given a natural language description or design image, TDDev automatically derives executable test cases, generates front‑end and back‑end code, simulates user interactions, and iteratively refines the implementation until all requirements are satisfied. Our framework addresses key challenges in full‑stack automation, including underspecified user requirements, complex interdependencies among multiple files, and the need for both functional correctness and visual fidelity. Through extensive experiments on diverse application scenarios, TDDev achieves a 14.4% improvement on overall accuracy compared to state‑of‑the‑art baselines, demonstrating its effectiveness in producing reliable, high‑quality web applications without requiring manual intervention.

PaperID: 8359, https://arxiv.org/pdf/2509.25282.pdf

Authors: Jiexi Xu, Jiaqi Liu, Lanruo Wang, Su Liu

Title: Toward Causal-Visual Programming: Enhancing Agentic Reasoning in Low-Code Environments

Abstract:
Large language model (LLM) agents are increasingly capable of orchestrating complex tasks in low‑code environments. However, these agents often exhibit hallucinations and logical inconsistencies because their inherent reasoning mechanisms rely on probabilistic associations rather than genuine causal understanding. This paper introduces a new programming paradigm: Causal‑Visual Programming (CVP), designed to address this fundamental issue by explicitly introducing causal structures into the workflow design. CVP allows users to define a simple "world model" for workflow modules through an intuitive low‑code interface, effectively creating a Directed Acyclic Graph (DAG) that explicitly defines the causal relationships between modules. This causal graph acts as a crucial constraint during the agent's reasoning process, anchoring its decisions to a user‑defined causal structure and significantly reducing logical errors and hallucinations by preventing reliance on spurious correlations. To validate the effectiveness of CVP, we designed a synthetic experiment that simulates a common real‑world problem: a distribution shift between the training and test environments. Our results show that a causally anchored model maintained stable accuracy in the face of this shift, whereas a purely associative baseline model that relied on probabilistic correlations experienced a significant performance drop. The primary contributions of this study are: a formal definition of causal structures for workflow modules; the proposal and implementation of a CVP framework that anchors agent reasoning to a user‑defined causal graph; and empirical evidence demonstrating the framework's effectiveness in enhancing agent robustness and reducing errors caused by causal confusion in dynamic environments. CVP offers a viable path toward building more interpretable, reliable, and trustworthy AI agents.

PaperID: 8360, https://arxiv.org/pdf/2509.25271.pdf

Authors: Xiuyuan Chen, Jian Zhao, Yuchen Yuan, Tianle Zhang, Huilin Zhou, Zheng Zhu, Ping Hu, Linghe Kong, Chi Zhang, Weiran Huang, Xuelong Li

Title: RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration

Abstract:
Existing safety evaluation methods for large language models (LLMs) suffer from inherent limitations, including evaluator bias and detection failures arising from model homogeneity, which collectively undermine the robustness of risk evaluation processes. This paper seeks to re‑examine the risk evaluation paradigm by introducing a theoretical framework that reconstructs the underlying risk concept space. Specifically, we decompose the latent risk concept space into three mutually exclusive subspaces: the explicit risk subspace (encompassing direct violations of safety guidelines), the implicit risk subspace (capturing potential malicious content that requires contextual reasoning for identification), and the non‑risk subspace. Furthermore, we propose RADAR, a multi‑agent collaborative evaluation framework that leverages multi‑round debate mechanisms through four specialized complementary roles and employs dynamic update mechanisms to achieve self‑evolution of risk concept distributions. This approach enables comprehensive coverage of both explicit and implicit risks while mitigating evaluator bias. To validate the effectiveness of our framework, we construct an evaluation dataset comprising 800 challenging cases. Extensive experiments on our challenging testset and public benchmarks demonstrate that RADAR significantly outperforms baseline evaluation methods across multiple dimensions, including accuracy, stability, and self‑evaluation risk sensitivity. Notably, RADAR achieves a 28.87% improvement in risk identification accuracy compared to the strongest baseline evaluation method.

PaperID: 8361, https://arxiv.org/pdf/2509.25261.pdf

Authors: Xianyang Deng, Wenshuai Liu, Yaru FuB, Qi Zhu

Title: Heterogeneous Multi-agent Collaboration in UAV-assisted Mobile Crowdsensing Networks

Abstract:
Unmanned aerial vehicles (UAVs)‑assisted mobile crowdsensing (MCS) has emerged as a promising paradigm for data collection. However, challenges such as spectrum scarcity, device heterogeneity, and user mobility hinder efficient coordination of sensing, communication, and computation. To tackle these issues, we propose a joint optimization framework that integrates time slot partition for sensing, communication, and computation phases, resource allocation, and UAV 3D trajectory planning, aiming to maximize the amount of processed sensing data. The problem is formulated as a non‑convex stochastic optimization and further modeled as a partially observable Markov decision process (POMDP) that can be solved by multi‑agent deep reinforcement learning (MADRL) algorithm. To overcome the limitations of conventional multi‑layer perceptron (MLP) networks, we design a novel MADRL algorithm with hybrid actor network. The newly developed method is based on heterogeneous agent proximal policy optimization (HAPPO), empowered by convolutional neural networks (CNN) for feature extraction and Kolmogorov‑Arnold networks (KAN) to capture structured state‑action dependencies. Extensive numerical results demonstrate that our proposed method achieves significant improvements in the amount of processed sensing data when compared with other benchmarks.

PaperID: 8362, https://arxiv.org/pdf/2509.25248.pdf

Authors: Zehua Zhang, Ati Priya Bajaj, Divij Handa, Siyu Liu, Arvind S Raj, Hongkai Chen, Hulin Wang, Yibo Liu, Zion Leonahenahe Basque, Souradip Nath, Vishal Juneja, Nikhil Chapre, Yan Shoshitaishvili, Adam Doupé, Chitta Baral, Ruoyu Wang

Title: BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software

Abstract:
Automatically compiling open‑source software (OSS) projects is a vital, labor‑intensive, and complex task, which makes it a good challenge for LLM Agents. Existing methods rely on manually curated rules and workflows, which cannot adapt to OSS that requires customized configuration or environment setup. Recent attempts using Large Language Models (LLMs) used selective evaluation on a subset of highly rated OSS, a practice that underestimates the realistic challenges of OSS compilation. In practice, compilation instructions are often absent, dependencies are undocumented, and successful builds may even require patching source files or modifying build scripts. We propose a more challenging and realistic benchmark, BUILD‑BENCH, comprising OSS that are more diverse in quality, scale, and characteristics. Furthermore, we propose a strong baseline LLM‑based agent, OSS‑BUILD‑AGENT, an effective system with enhanced build instruction retrieval module that achieves state‑of‑the‑art performance on BUILD‑BENCH and is adaptable to heterogeneous OSS characteristics. We also provide detailed analysis regarding different compilation method design choices and their influence to the whole task, offering insights to guide future advances. We believe performance on BUILD‑BENCH can faithfully reflect an agent's ability to tackle compilation as a complex software engineering tasks, and, as such, our benchmark will spur innovation with a significant impact on downstream applications in the fields of software development and software security.

PaperID: 8363, https://arxiv.org/pdf/2509.25244.pdf

Authors: Shuide Wen, Beier Ku, Teng Wang, Mingyang Zou, Yang Yang

Title: Neo-Grounded Theory: A Methodological Innovation Integrating High-Dimensional Vector Clustering and Multi-Agent Collaboration for Qualitative Research

Abstract:
Purpose: Neo Grounded Theory (NGT) integrates vector clustering with multi agent systems to resolve qualitative research's scale depth paradox, enabling analysis of massive datasets in hours while preserving interpretive rigor. Methods: We compared NGT against manual coding and ChatGPT‑assisted analysis using 40,000 character Chinese interview transcripts. NGT employs 1536‑dimensional embeddings, hierarchical clustering, and parallel agent‑based coding. Two experiments tested pure automation versus human guided refinement. Findings: NGT achieved 168‑fold speed improvement (3 hours vs 3 weeks), superior quality (0.904 vs 0.883), and 96% cost reduction. Human AI collaboration proved essential: automation alone produced abstract frameworks while human guidance yielded actionable dual pathway theories. The system discovered patterns invisible to manual coding, including identity bifurcation phenomena. Contributions: NGT demonstrates computational objectivity and human interpretation are complementary. Vector representations provide reproducible semantic measurement while preserving meaning's interpretive dimensions. Researchers shift from mechanical coding to theoretical guidance, with AI handling pattern recognition while humans provide creative insight. Implications: Cost reduction from \50,000 to \500 democratizes qualitative research, enabling communities to study themselves. Real‑time analysis makes qualitative insights contemporaneous with events. The framework shows computational methods can strengthen rather than compromise qualitative research's humanistic commitments. Keywords: Grounded theory; Vector embeddings; Multi agent systems; Human AI collaboration; Computational qualitative analysis

PaperID: 8364, https://arxiv.org/pdf/2509.25236.pdf

Authors: Gabriele D'Acunto, Paolo Di Lorenzo, Sergio Barbarossa

Title: Networks of Causal Abstractions: A Sheaf-theoretic Framework

Abstract:
A core challenge in causal artificial intelligence is the principled coordination of multiple, imperfect, and subjective causal perspectives arising from distributed agents with limited and heterogeneous access to the environment. This problem has received little formal treatment, as the existing framework assumes a single shared global causal model. This work introduces the causal abstraction network (CAN), a general sheaf‑theoretic framework for representing, learning, and reasoning across collections of mixture of causal models (MCMs) ‑ a class that unifies several existing models of context‑dependent causal mechanisms. Sheaf theory provides a natural foundation for this task, offering a rigorous framework to coherently align distributed causal knowledge without requiring explicit causal graphs, functional mechanisms, interventional data, or jointly sampled observations. At the theoretical level, we provide a categorical formulation of MCMs and characterize key properties of CANs, including consistency and smoothness. Under consistency, we establish necessary and sufficient conditions: (i) for the existence of global sections, linked to spectral properties of an associated connection Laplacian; and (ii) for the convergence of causal knowledge diffusion over the CAN to the space of global sections. At the methodological level, we exploit the compositionality of causal abstractions to decompose the learning of consistent CANs into local problems on network edges, extending our prior work on Gaussian variables to Gaussian mixtures via the proposed MIXTURE‑CALSEP algorithm. We validate the framework on synthetic data and through a financial application involving a multi‑agent trading system, demonstrating CAN recovery, CAN‑based portfolio optimization, and counterfactual reasoning.

PaperID: 8365, https://arxiv.org/pdf/2509.25194.pdf

Authors: Haoyang Wu, Xinxin Zhang, Lailai Zhu

Title: Automated Code Development for PDE Solvers Using Large Language Models

Abstract:
Foundation models ‑‑ large language models (LLMs) in particular ‑‑ have become ubiquitous, shaping daily life and driving breakthroughs across science, engineering, and technology. Harnessing their broad cross‑domain knowledge, text‑processing, and reasoning abilities for software development, e.g., numerical libraries for solving partial differential equations (PDEs), is therefore attracting growing interest. Yet existing studies mainly automate case setup and execution for end users. We introduce LLM‑PDEveloper, a zero‑shot, multi‑agent LLM framework that automates code development for PDE libraries, specifically targeting secondary developers. By translating mathematical and algorithmic descriptions directly into source code, LLM‑PDEveloper generates new solvers/modules and adapts existing ones. This end‑to‑end math‑to‑code approach enables a self‑augmenting pipeline that continuously expands the codebase of a library, extends its capacities, and broadens its scope. We demonstrate LLM‑PDEveloper on three tasks: 1) build a solver for a new PDE, 2) implement new BCs for a given PDE, and 3) modify an existing solver to incorporate additional terms, achieving moderate success rates. Failures due to syntactic errors made by LLMs are analyzed and we propose effective fixes. We also identify the mechanisms underlying certain semantic errors, guiding future research.

PaperID: 8366, https://arxiv.org/pdf/2509.25084.pdf

Authors: Shuofei Qiao, Yanqiu Zhao, Zhisong Qiu, Xiaobin Wang, Jintian Zhang, Zhao Bin, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen

Title: Scaling Generalist Data-Analytic Agents

Abstract:
Data‑analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering over proprietary models, while open‑source models struggle to face diverse‑format, large‑scale data files and long‑horizon, multi‑step reasoning that real‑world analytics demands. This paper introduces DataMind, a scalable data synthesis and agent training recipe designed to build generalist data‑analytic agents. DataMind tackles three key challenges in building open‑source data‑analytic agents, including insufficient data resources, improper training strategy, and unstable code‑based multi‑turn rollout. Concretely, DataMind applies 1) a fine‑grained task taxonomy and a recursive easy‑to‑hard task composition mechanism to increase the diversity and difficulty of synthesized queries; 2) a knowledge‑augmented trajectory sampling strategy followed by model‑based and rule‑based filtering; 3) a dynamically adjustable training objective combining both SFT and RL losses; 4) a memory‑frugal and stable code‑based multi‑turn rollout framework. Built on DataMind, we curate DataMind‑12K, a high‑quality trajectory set spanning diverse domains, task categories, and data file formats for data‑analytic tasks. Trained on DataMind‑12K, our DataMind‑14B achieves state‑of‑the‑art with an average score of 71.16% on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek‑V3.1 and GPT‑5. Our DataMind‑7B also performs best among all open‑source models with a score of 68.10%. We also incorporate some empirical insights gained from our exploratory trials into the analysis experiments, aiming to provide actionable insights about agentic training for the community. We will release DataMind‑12K and DataMind‑7B,14B for the community's future research.

PaperID: 8367, https://arxiv.org/pdf/2509.25047.pdf

Authors: Ram Ramrakhya, Andrew Szot, Omar Attia, Yuhao Yang, Anh Nguyen, Bogdan Mazoure, Zhe Gan, Harsh Agrawal, Alexander Toshev

Title: Scaling Synthetic Task Generation for Agents via Exploration

Abstract:
Post‑Training Multimodal Large Language Models (MLLMs) to build interactive agents holds promise across domains such as computer‑use, web navigation, and robotics. A key challenge in scaling such post‑training is lack of high‑quality downstream agentic task datasets with tasks that are diverse, feasible, and verifiable. Existing approaches for task generation rely heavily on human annotation or prompting MLLM with limited downstream environment information, which is either costly or poorly scalable as it yield tasks with limited coverage. To remedy this, we present AutoPlay, a scalable pipeline for task generation that explicitly explores interactive environments to discover possible interactions and current state information to synthesize environment‑grounded tasks. AutoPlay operates in two stages: (i) an exploration phase, where an MLLM explorer agent systematically uncovers novel environment states and functionalities, and (ii) a task generation phase, where a task generator leverages exploration trajectories and a set of task guideline prompts as context to synthesize diverse, executable, and verifiable tasks. We show AutoPlay generates 20k tasks across 20 Android applications and 10k tasks across 13 applications Ubuntu applications to train mobile‑use and computer‑use agents. AutoPlay generated tasks enable large‑scale task demonstration synthesis without human annotation by employing an MLLM task executor and verifier. This data enables training MLLM‑based UI agents that improve success rates up to 20.0% on mobile‑use and 10.9% on computer‑use scenarios. In addition, AutoPlay generated tasks combined with MLLM verifier‑based rewards enable scaling reinforcement learning training of UI agents, leading to an additional 5.7% gain. coverage. These results establish AutoPlay as a scalable approach for post‑training capable MLLM agents reducing reliance on human annotation.

PaperID: 8368, https://arxiv.org/pdf/2509.25034.pdf

Authors: Heming Fu, Shan Lin, Guojun Xiong

Title: MARLIN: Multi-Agent Reinforcement Learning with Murmuration Intelligence and LLM Guidance for Reservoir Management

Abstract:
As climate change intensifies extreme weather events, water disasters pose growing threats to global communities, making adaptive reservoir management critical for protecting vulnerable populations and ensuring water security. Modern water resource management faces unprecedented challenges from cascading uncertainties propagating through interconnected reservoir networks. These uncertainties, rooted in physical water transfer losses and environmental variability, make precise control difficult. For example, sending 10 tons downstream may yield only 8‑12 tons due to evaporation and seepage. Traditional centralized optimization approaches suffer from exponential computational complexity and cannot effectively handle such real‑world uncertainties, while existing multi‑agent reinforcement learning (MARL) methods fail to achieve effective coordination under uncertainty. To address these challenges, we present MARLIN, a decentralized reservoir management framework inspired by starling murmurations intelligence. Integrating bio‑inspired alignment, separation, and cohesion rules with MARL, MARLIN enables individual reservoirs to make local decisions while achieving emergent global coordination. In addition, a LLM provides real‑time reward shaping signals, guiding agents to adapt to environmental changes and human‑defined preferences. Experiments on USGS data show that MARLIN improves uncertainty handling by 23%, cuts computation by 35%, and accelerates flood response by 68%, exhibiting super‑linear coordination, with complexity scaling 5.4x from 400 to 10,000 nodes. These results demonstrate MARLIN's potential for disaster prevention and protecting communities through intelligent, scalable water resource management.

PaperID: 8369, https://arxiv.org/pdf/2509.24995.pdf

Authors: Da Saem Lee, Akash Karthikeyan, Yash Vardhan Pant, Sebastian Fischmeister

Title: Path Diffuser: Diffusion Model for Data-Driven Traffic Simulator

Abstract:
Simulating diverse and realistic traffic scenarios is critical for developing and testing autonomous planning. Traditional rule‑based planners lack diversity and realism, while learning‑based simulators often replay, forecast, or edit scenarios using historical agent trajectories. However, they struggle to generate new scenarios, limiting scalability and diversity due to their reliance on fully annotated logs and historical data. Thus, a key challenge for a learning‑based simulator's performance is that it requires agents' past trajectories and pose information in addition to map data, which might not be available for all agents on the road.Without which, generated scenarios often produce unrealistic trajectories that deviate from drivable areas, particularly under out‑of‑distribution (OOD) map scenes (e.g., curved roads). To address this, we propose Path Diffuser (PD): a two‑stage, diffusion model for generating agent pose initializations and their corresponding trajectories conditioned on the map, free of any historical context of agents' trajectories. Furthermore, PD incorporates a motion primitive‑based prior, leveraging Frenet frame candidate trajectories to enhance diversity while ensuring road‑compliant trajectory generation. We also explore various design choices for modeling complex multi‑agent interactions. We demonstrate the effectiveness of our method through extensive experiments on the Argoverse2 Dataset and additionally evaluate the generalizability of the approach on OOD map variants. Notably, Path Diffuser outperforms the baseline methods by 1.92x on distribution metrics, 1.14x on common‑sense metrics, and 1.62x on road compliance from adversarial benchmarks.

PaperID: 8370, https://arxiv.org/pdf/2509.24946.pdf

Authors: Rosario Patanè, Andrea Araldo, Nadjib Achir, Lila Boukhatem

Title: A Management Framework for Vehicular Cloudtoward Economic and Environmental Efficiency

Abstract:
Vehicular Cloud Computing (VCC) leverages the idle computing capacity of vehicles to execute end‑users' offloaded tasks without requiring new computation infrastructure. Despite its conceptual appeal, VCC adoption is hindered by the lack of quantitative evidence demonstrating its profitability and environmental advantages in real‑world scenarios. This paper tackles the fundamental question: Can VCC be both profitable and sustainable? We address this problem by proposing a management scheme for VCC that combines energy‑aware task allocation with a game‑theoretic revenue‑sharing mechanism. Our framework is the first to jointly model latency, energy consumption, monetary incentives, and carbon emissions within urban mobility and 5G communication settings. The task allocation strategy maximizes the aggregate stakeholder utility while satisfying deadlines and minimizing energy costs. The payoffs are distributed via a coalitional game theory adapted to dynamic vehicular environments, to prevent disincentivizing participants with potentially negative contributions. Extensive simulations demonstrate that our approach supports low‑latency task execution, enables effective monetization of vehicular resources, and reduces CO2 emissions by more than 99% compared to conventional edge infrastructures, making VCC a practical and sustainable alternative to edge computing.

PaperID: 8371, https://arxiv.org/pdf/2509.24943.pdf

Authors: Jiahua Li, Zhanhe Zhang, Chenghao Xu, Zhe Xu, Kun Wei, Xu Yang, Cheng Deng

Title: Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents

Abstract:
Long videos, characterized by temporal complexity and sparse task‑relevant information, pose significant reasoning challenges for AI systems. Although existing Large Language Model (LLM)‑based approaches have advanced long video understanding, they remain bottlenecked by task‑agnostic, fixed‑granularity perception pipelines and suffer from vision‑language hallucinations. Inspired by human adaptive perception and active verification, we propose CogniGPT, a framework leveraging an interactive loop between a Multi‑Granular Perception Agent (MPA) and an Active Verification Agent (AVA). Specifically, instead of predetermined heuristics, MPA adaptively determines the optimal perception granularity and strategy based on the evolving context, while AVA actively mines multi‑perspective visual evidence to cross‑verify key observations and eliminate hallucinations. This interaction allows CogniGPT to efficiently identify a minimal set of reliable task‑related clues. Extensive experiments on EgoSchema, Video‑MME, NExT‑QA, and MovieChat demonstrate its superiority in accuracy and efficiency. Notably, on EgoSchema, it surpasses existing training‑free methods using only 11.2 frames and achieves performance comparable to Gemini 1.5‑Pro.

PaperID: 8372, https://arxiv.org/pdf/2509.24922.pdf

Authors: Huihao Jing, Wenbin Hu, Hongyu Luo, Jianhui Yang, Wei Fan, Haoran Li, Yangqiu Song

Title: MASLegalBench: Benchmarking Multi-Agent Systems in Deductive Legal Reasoning

Abstract:
Multi‑agent systems (MAS), leveraging the remarkable capabilities of Large Language Models (LLMs), show great potential in addressing complex tasks. In this context, integrating MAS with legal tasks is a crucial step. While previous studies have developed legal benchmarks for LLM agents, none are specifically designed to consider the unique advantages of MAS, such as task decomposition, agent specialization, and flexible training. In fact, the lack of evaluation methods limits the potential of MAS in the legal domain. To address this gap, we propose MASLegalBench, a legal benchmark tailored for MAS and designed with a deductive reasoning approach. Our benchmark uses GDPR as the application scenario, encompassing extensive background knowledge and covering complex reasoning processes that effectively reflect the intricacies of real‑world legal situations. Furthermore, we manually design various role‑based MAS and conduct extensive experiments using different state‑of‑the‑art LLMs. Our results highlight the strengths, limitations, and potential areas for improvement of existing models and MAS architectures.

PaperID: 8373, https://arxiv.org/pdf/2509.24903.pdf

Authors: Lantao Li, Kang Yang, Rui Song, Chen Sun

Title: DRCP: Diffusion on Reinforced Cooperative Perception for Perceiving Beyond Limits

Abstract:
Cooperative perception enabled by Vehicle‑to‑Everything communication has shown great promise in enhancing situational awareness for autonomous vehicles and other mobile robotic platforms. Despite recent advances in perception backbones and multi‑agent fusion, real‑world deployments remain challenged by hard detection cases, exemplified by partial detections and noise accumulation which limit downstream detection accuracy. This work presents Diffusion on Reinforced Cooperative Perception (DRCP), a real‑time deployable framework designed to address aforementioned issues in dynamic driving environments. DRCP integrates two key components: (1) Precise‑Pyramid‑Cross‑Modality‑Cross‑Agent, a cross‑modal cooperative perception module that leverages camera‑intrinsic‑aware angular partitioning for attention‑based fusion and adaptive convolution to better exploit external features; and (2) Mask‑Diffusion‑Mask‑Aggregation, a novel lightweight diffusion‑based refinement module that encourages robustness against feature perturbations and aligns bird's‑eye‑view features closer to the task‑optimal manifold. The proposed system achieves real‑time performance on mobile platforms while significantly improving robustness under challenging conditions. Code will be released in late 2025.

PaperID: 8374, https://arxiv.org/pdf/2509.24877.pdf

Authors: Xiao Jia, Zhanzhan Zhao

Title: The Emergence of Social Science of Large Language Models

Abstract:
The social science of large language models (LLMs) examines how these systems evoke mind attributions, interact with one another, and transform human activity and institutions. We conducted a systematic review of 270 studies, combining text embeddings, unsupervised clustering and topic modeling to build a computational taxonomy. Three domains emerge organically across the reviewed literature. LLM as Social Minds examines whether and when models display behaviors that elicit attributions of cognition, morality and bias, while addressing challenges such as test leakage and surface cues. LLM Societies examines multi‑agent settings where interaction protocols, architectures and mechanism design shape coordination, norms, institutions and collective epistemic processes. LLM‑Human Interactions examines how LLMs reshape tasks, learning, trust, work and governance, and how risks arise at the human‑AI interface. This taxonomy provides a reproducible map of a fragmented field, clarifies evidentiary standards across levels of analysis, and highlights opportunities for cumulative progress in the social science of artificial intelligence.

PaperID: 8375, https://arxiv.org/pdf/2509.24869.pdf

Authors: Junwei Lan, Jianlyu Chen, Zheng Liu, Chaofan Li, Siqi Bao, Defu Lian

Title: Retro*: Optimizing LLMs for Reasoning-Intensive Document Retrieval

Abstract:
With the growing popularity of LLM agents and RAG, it has become increasingly important to retrieve documents that are essential for solving a task, even when their connection to the task is indirect or implicit. Addressing this problem requires fine‑grained reasoning to accurately assess the relevance between the task and each candidate document. This capability, however, poses a significant challenge for existing IR techniques. Despite recent progress in reasoning‑enhanced IR, existing approaches still face significant challenges in applicability, scalability, and efficiency. In this work, we propose Retro, a novel approach for reasoning‑intensive document retrieval. Our method introduces a rubric‑based relevance scoring mechanism, enabling the model to reason about the relationship between a task and a document based on explicitly defined criteria, whereby producing a fine‑grained, interpretable relevance score. Retro also supports test‑time scaling by combining multiple reasoning trajectories via score integration, which produces more reliable relevance estimates. To optimize Retro's reasoning capabilities, we introduce a novel reinforcement learning algorithm tailored for its relevance scoring mechanism, which employs two composite rewards to fully exploit the trajectories of each training sample. Our experiments show that Retro outperforms existing document retrieval methods with notable advantages, leading to state‑of‑the‑art performance on the BRIGHT benchmark.

PaperID: 8376, https://arxiv.org/pdf/2509.24855.pdf

Authors: Fangchen Yu, Junchi Yao, Ziyi Wang, Haiyuan Wan, Youling Huang, Bo Zhang, Shuyue Hu, Dongzhan Zhou, Ning Ding, Ganqu Cui, Lei Bai, Wanli Ouyang, Peng Ye

Title: PhysicsMinions: Winning Gold Medals in the Latest Physics Olympiads with a Coevolutionary Multimodal Multi-Agent System

Abstract:
Physics is central to understanding and shaping the real world, and the ability to solve physics problems is a key indicator of real‑world physical intelligence. Physics Olympiads, renowned as the crown of competitive physics, provide a rigorous testbed requiring complex reasoning and deep multimodal understanding, yet they remain largely underexplored in AI research. Existing approaches are predominantly single‑model based, and open‑source MLLMs rarely reach gold‑medal‑level performance. To address this gap, we propose PhysicsMinions, a coevolutionary multi‑agent system for Physics Olympiad. Its architecture features three synergistic studios: a Visual Studio to interpret diagrams, a Logic Studio to formulate solutions, and a Review Studio to perform dual‑stage verification. The system coevolves through an iterative refinement loop where feedback from the Review Studio continuously guides the Logic Studio, enabling the system to self‑correct and converge towards the ground truth. Evaluated on the HiPhO benchmark spanning 7 latest physics Olympiads, PhysicsMinions delivers three major breakthroughs: (i) Strong generalization: it consistently improves both open‑source and closed‑source models of different sizes, delivering clear benefits over their single‑model baselines; (ii) Historic breakthroughs: it elevates open‑source models from only 1‑2 to 6 gold medals across 7 Olympiads, achieving the first‑ever open‑source gold medal in the latest International Physics Olympiad (IPhO) under the average‑score metric; and (iii) Scaling to human expert: it further advances the open‑source Pass@32 score to 26.8/30 points on the latest IPhO, ranking 4th of 406 contestants and far surpassing the top single‑model score of 22.7 (ranked 22nd). Generally, PhysicsMinions offers a generalizable framework for Olympiad‑level problem solving, with the potential to extend across disciplines.

PaperID: 8377, https://arxiv.org/pdf/2509.24806.pdf

Authors: Broos Maenhout, Přemysl Šůcha, Viktorie Valdmanová, Ondřej Tkadlec, Jana Thao Rozlivková

Title: A Bilevel Approach to Integrated Surgeon Scheduling and Surgery Planning solved via Branch-and-Price

Abstract:
In this paper, we study a multi‑agent scheduling problem for organising the operations within the operating room department. The head of the surgeon group and individual surgeons are together responsible for the surgeon schedule and surgical case planning. The surgeon head allocates time blocks to individual surgeons, whereas individual surgeons determine the planning of surgical cases independently, which might degrade the schedule quality envisaged by the surgeon head. The bilevel optimisation under study seeks an optimal Nash equilibrium solution ‑‑ a surgeon schedule and surgical case plan that optimise the objectives of the surgeon head, while ensuring that no individual surgeon can improve their own objective within the allocated time blocks. We propose a dedicated branch‑and‑price that adds lazy constraints to the formulation of surgeon‑specific pricing problems to ensure an optimal bilevel feasible solution is retrieved. In this way, the surgeon head respects the objective requirements of the individual surgeons and the solution space can be searched efficiently. In the computational experiments, we validate the performance of the proposed algorithm and its dedicated components and provide insights into the benefits of attaining an equilibrium solution under different scenarios by calculating the price of stability and the price of decentralisation.

PaperID: 8378, https://arxiv.org/pdf/2509.24786.pdf

Authors: Shenghao Fu, Qize Yang, Yuan-Ming Li, Xihan Wei, Xiaohua Xie, Wei-Shi Zheng

Title: LOVE-R1: Advancing Long Video Understanding with an Adaptive Zoom-in Mechanism via Multi-Step Reasoning

Abstract:
Long video understanding is still challenging for recent Large Video‑Language Models (LVLMs) due to the conflict between long‑form temporal understanding and detailed spatial perception. LVLMs with a uniform frame sampling mechanism, which samples frames with an equal frame size and fixed sampling rate, inevitably sacrifice either temporal clues or spatial details, resulting in suboptimal solutions. To mitigate this dilemma, we propose LOVE‑R1, a model that can adaptively zoom in on a video clip. The model is first provided with densely sampled frames but in a small resolution. If some spatial details are needed, the model can zoom in on a clip of interest with a large frame resolution based on its reasoning until key visual information is obtained. The whole process is implemented as a multi‑step reasoning process. To train the reasoning ability, we first finetune the model on our collected 38k high‑quality CoT data and enhance it with decoupled reinforcement finetuning. As outcome rewards can not provide fine‑grained process supervision, we decouple multi‑step reasoning into multiple single‑step reasoning and optimize the internal zoom‑in ability explicitly. Experiments on long video understanding benchmarks show that our model with the slow‑fast adaptive frame sampling mechanism achieves a great trade‑off between sampling density and frame resolutions, and LOVE‑R1 outperforms our baseline Qwen2.5‑VL by an average of 3.1% points across 4 common long video understanding benchmarks.

PaperID: 8379, https://arxiv.org/pdf/2509.24660.pdf

Authors: Nikolaos Kondylidis, Anil Yaman, Frank van Harmelen, Erman Acar, Annette ten Teije

Title: Successful Misunderstandings: Learning to Coordinate Without Being Understood

Abstract:
The main approach to evaluating communication is by assessing how well it facilitates coordination. If two or more individuals can coordinate through communication, it is generally assumed that they understand one another. We investigate this assumption in a signaling game where individuals develop a new vocabulary of signals to coordinate successfully. In our game, the individuals do not have common observations besides the communication signal and outcome of the interaction, i.e. received reward. This setting is used as a proxy to study communication emergence in populations of agents that perceive their environment very differently, e.g. hybrid populations that include humans and artificial agents. Agents develop signals, use them, and refine interpretations while not observing how other agents are using them. While populations always converge to optimal levels of coordination, in some cases, interacting agents interpret and use signals differently, converging to what we call successful misunderstandings. However, agents of population that coordinate using misaligned interpretations, are unable to establish successful coordination with new interaction partners. Not leading to coordination failure immediately, successful misunderstandings are difficult to spot and repair. Having at least three agents that all interact with each other are the two minimum conditions to ensure the emergence of shared interpretations. Under these conditions, the agent population exhibits this emergent property of compensating for the lack of shared observations of signal use, ensuring the emergence of shared interpretations.

PaperID: 8380, https://arxiv.org/pdf/2509.24651.pdf

Authors: Nikolaos Kondylidis, Andrea Rafanelli, Ilaria Tiddi, Annette ten Teije, Frank van Harmelen

Title: "Stop replacing salt with sugar!'': Towards Intuitive Human-Agent Teaching

Abstract:
Humans quickly learn new concepts from a small number of examples. Replicating this capacity with Artificial Intelligence (AI) systems has proven to be challenging. When it comes to learning subjective tasks‑where there is an evident scarcity of data‑this capacity needs to be recreated. In this work, we propose an intuitive human‑agent teaching architecture in which the human can teach an agent how to perform a task by providing demonstrations, i.e., examples. To have an intuitive interaction, we argue that the agent should be able to learn incrementally from a few single examples. To allow for this, our objective is to broaden the agent's task understanding using domain knowledge. Then, using a learning method to enable the agent to learn efficiently from a limited number of examples. Finally, to optimize how human can select the most representative and less redundant examples to provide the agent with. We apply our proposed method to the subjective task of ingredient substitution, where the agent needs to learn how to substitute ingredients in recipes based on human examples. We replicate human input using the Recipe1MSubs dataset. In our experiments, the agent achieves half its task performance after only 100 examples are provided, compared to the complete training set of 50k examples. We show that by providing examples in strategic order along with a learning method that leverages external symbolic knowledge, the agent can generalize more efficiently.

PaperID: 8381, https://arxiv.org/pdf/2509.24598.pdf

Authors: Maryam Babazadeh, Naim Bajcinca

Title: Model-Free Dynamic Consensus in Multi-Agent Systems: A Q-Function Perspective

Abstract:
This paper presents a new method for achieving dynamic consensus in linear discrete‑time homogeneous multi‑agent systems (MAS) with marginally stable or unstable dynamics. The guarantee of consensus in this setting involves a set of constraints based on the graph's spectral properties, complicating the design of the coupling gains. This challenge intensifies for large‑scale systems with diverse graph Laplacian spectra. The proposed approach reformulates the dynamic consensus problem with a prescribed convergence rate using a state‑action value function framework inspired by optimal control theory. Specifically, a synthetic linear quadratic regulation (LQR) formulation is introduced to encode the consensus objective, enabling its translation into a convex semidefinite programming (SDP) problem. The resulting SDP is applicable in both model‑based and model‑free settings for jointly designing the local feedback and coupling gains. To handle the inherent non‑convex feasibility conditions, a convex‑concave decomposition strategy is employed. Adaptation of the method in a completely model‑free set‑up eliminates the need for system identification or knowledge of the agents' dynamics. Instead, it relies on input‑state data collection and offers an entirely data‑driven equivalent SDP formulation. Finally, a new algorithm balancing feasibility, convergence rate, robustness, and energy efficiency, is established to provide design flexibility. Numerical simulations validate the method's effectiveness in various scenarios.

PaperID: 8382, https://arxiv.org/pdf/2509.24539.pdf

Authors: Nayari Marie Lessa, Melya Boukheddimi, Frank Kirchner

Title: Unlocking the Potential of Soft Actor-Critic for Imitation Learning

Abstract:
Learning‑based methods have enabled robots to acquire bio‑inspired movements with increasing levels of naturalness and adaptability. Among these, Imitation Learning (IL) has proven effective in transferring complex motion patterns from animals to robotic systems. However, current state‑of‑the‑art frameworks predominantly rely on Proximal Policy Optimization (PPO), an on‑policy algorithm that prioritizes stability over sample efficiency and policy generalization. This paper proposes a novel IL framework that combines Adversarial Motion Priors (AMP) with the off‑policy Soft Actor‑Critic (SAC) algorithm to overcome these limitations. This integration leverages replay‑driven learning and entropy‑regularized exploration, enabling naturalistic behavior and task execution, improving data efficiency and robustness. We evaluate the proposed approach (AMP+SAC) on quadruped gaits involving multiple reference motions and diverse terrains. Experimental results demonstrate that the proposed framework not only maintains stable task execution but also achieves higher imitation rewards compared to the widely used AMP+PPO method. These findings highlight the potential of an off‑policy IL formulation for advancing motion generation in robotics.

PaperID: 8383, https://arxiv.org/pdf/2509.24524.pdf

Authors: Zhihao Wang, Jianxiong Li, Jinliang Zheng, Wencong Zhang, Dongxiu Liu, Yinan Zheng, Haoyi Niu, Junzhi Yu, Xianyuan Zhan

Title: PhysiAgent: An Embodied Agent Framework in Physical World

Abstract:
Vision‑Language‑Action (VLA) models have achieved notable success but often struggle with limited generalizations. To address this, integrating generalized Vision‑Language Models (VLMs) as assistants to VLAs has emerged as a popular solution. However, current approaches often combine these models in rigid, sequential structures: using VLMs primarily for high‑level scene understanding and task planning, and VLAs merely as executors of lower‑level actions, leading to ineffective collaboration and poor grounding challenges. In this paper, we propose an embodied agent framework, PhysiAgent, tailored to operate effectively in physical environments. By incorporating monitor, memory, self‑reflection mechanisms, and lightweight off‑the‑shelf toolboxes, PhysiAgent offers an autonomous scaffolding framework to prompt VLMs to organize different components based on real‑time proficiency feedback from VLAs to maximally exploit VLAs' capabilities. Experimental results demonstrate significant improvements in task‑solving performance on complex real‑world robotic tasks, showcasing effective self‑regulation of VLMs, coherent tool collaboration, and adaptive evolution of the framework during execution. PhysiAgent makes practical and pioneering efforts to integrate VLMs and VLAs, effectively grounding embodied agent frameworks in real‑world settings.

PaperID: 8384, https://arxiv.org/pdf/2509.24463.pdf

Authors: Nia D'Souza Ganapathy, Arul Selvamani Shaja

Title: An Agent-Based Framework for Automated Higher-Voice Harmony Generation

Abstract:
The generation of musically coherent and aesthetically pleasing harmony remains a significant challenge in the field of algorithmic composition. This paper introduces an innovative Agentic AI‑enabled Higher Harmony Music Generator, a multi‑agent system designed to create harmony in a collaborative and modular fashion. Our framework comprises four specialized agents: a Music‑Ingestion Agent for parsing and standardizing input musical scores; a Chord‑Knowledge Agent, powered by a Chord‑Former (Transformer model), to interpret and provide the constituent notes of complex chord symbols; a Harmony‑Generation Agent, which utilizes a Harmony‑GPT and a Rhythm‑Net (RNN) to compose a melodically and rhythmically complementary harmony line; and an Audio‑Production Agent that employs a GAN‑based Symbolic‑to‑Audio Synthesizer to render the final symbolic output into high‑fidelity audio. By delegating specific tasks to specialized agents, our system effectively mimics the collaborative process of human musicians. This modular, agent‑based approach allows for robust data processing, deep theoretical understanding, creative composition, and realistic audio synthesis, culminating in a system capable of generating sophisticated and contextually appropriate higher‑voice harmonies for given melodies.

PaperID: 8385, https://arxiv.org/pdf/2509.24413.pdf

Authors: Tianqiang Yan, Ziqiao Lin, Sicheng Wang, Tianwei Zhang, Zhenglong Sun

Title: DynaMIC: Dynamic Multimodal In-Context Learning Enabled Embodied Robot Counterfactual Resistance Ability

Abstract:
The emergence of large pre‑trained models based on natural language has breathed new life into robotics development. Extensive research has integrated large models with robots, utilizing the powerful semantic understanding and generation capabilities of large models to facilitate robot control through natural language instructions gradually. However, we found that robots that strictly adhere to human instructions, especially those containing misleading information, may encounter errors during task execution, potentially leading to safety hazards. This resembles the concept of counterfactuals in natural language processing (NLP), which has not yet attracted much attention in robotic research. In an effort to highlight this issue for future studies, this paper introduced directive counterfactuals (DCFs) arising from misleading human directives. We present DynaMIC, a framework for generating robot task flows to identify DCFs and relay feedback to humans proactively. This capability can help robots be sensitive to potential DCFs within a task, thus enhancing the reliability of the execution process. We conducted semantic‑level experiments and ablation studies, showcasing the effectiveness of this framework.

PaperID: 8386, https://arxiv.org/pdf/2509.24408.pdf

Authors: Yuzhen Long, Songze Li

Title: FuncPoison: Poisoning Function Library to Hijack Multi-agent Autonomous Driving Systems

Abstract:
Autonomous driving systems increasingly rely on multi‑agent architectures powered by large language models (LLMs), where specialized agents collaborate to perceive, reason, and plan. A key component of these systems is the shared function library, a collection of software tools that agents use to process sensor data and navigate complex driving environments. Despite its critical role in agent decision‑making, the function library remains an under‑explored vulnerability. In this paper, we introduce FuncPoison, a novel poisoning‑based attack targeting the function library to manipulate the behavior of LLM‑driven multi‑agent autonomous systems. FuncPoison exploits two key weaknesses in how agents access the function library: (1) agents rely on text‑based instructions to select tools; and (2) these tools are activated using standardized command formats that attackers can replicate. By injecting malicious tools with deceptive instructions, FuncPoison manipulates one agent s decisions‑‑such as misinterpreting road conditions‑‑triggering cascading errors that mislead other agents in the system. We experimentally evaluate FuncPoison on two representative multi‑agent autonomous driving systems, demonstrating its ability to significantly degrade trajectory accuracy, flexibly target specific agents to induce coordinated misbehavior, and evade diverse defense mechanisms. Our results reveal that the function library, often considered a simple toolset, can serve as a critical attack surface in LLM‑based autonomous driving systems, raising elevated concerns on their reliability.

PaperID: 8387, https://arxiv.org/pdf/2509.24399.pdf

Authors: Jie Song, Yang Bai, Mikhail Svinin, Naoki Wakamiya

Title: Autonomous Detection and Coverage of Unknown Target Areas by Multi-Agent Systems

Abstract:
This paper presents a novel coverage control algorithm for multi‑agent systems, where each agent has no prior knowledge of the specific region to be covered. The proposed method enables agents to autonomously detect the target area and collaboratively achieve full coverage. Once an agent detects a part of the target region within its sensor range, a dynamically constructed density function is generated to attract nearby agents. By integrating this density‑driven mechanism with Centroidal Voronoi Tessellation (CVT), the agents are guided to achieve optimal spatial distribution. Additionally, Control Barrier Functions (CBFs) are employed to ensure collision avoidance and maintain non‑overlapping sensor coverage, enhancing both safety and efficiency. Simulation results verify that agents can independently locate and effectively cover the target area.

PaperID: 8388, https://arxiv.org/pdf/2509.24380.pdf

Authors: Shuiguang Deng, Hailiang Zhao, Ziqi Wang, Guanjie Cheng, Peng Chen, Wenzhuo Qian, Zhiwei Ling, Jianwei Yin, Albert Y. Zomaya, Schahram Dustdar

Title: Agentic Services Computing

Abstract:
The rise of large language model (LLM)‑powered agents is transforming services computing, moving it beyond static, request‑driven functions toward dynamic, goal‑oriented, and socially embedded multi‑agent ecosystems. We propose Agentic Services Computing (ASC), a paradigm that reimagines services as autonomous, adaptive, and collaborative agents capable of perceiving, reasoning, acting, and evolving in open and uncertain environments. We organize ASC around a four‑phase lifecycle: Design, Deployment, Operation, and Evolution. It is examined through four interwoven research dimensions: (i) perception and context modeling, (ii) autonomous decision‑making, (iii) multi‑agent collaboration, and (iv) evaluation with alignment and trustworthiness. Rather than functioning as isolated layers, these dimensions evolve together. Contextual grounding supports robust deployment; autonomous reasoning drives real‑time action; collaboration emerges from agent interaction; and trustworthiness is maintained as a lifelong, cross‑cutting commitment across all lifecycle stages. In developing this framework, we also survey a broad spectrum of representative works that instantiate these ideas across academia and industry, mapping key advances to each phase and dimension of ASC. By integrating foundational principles of services computing with cutting‑edge advances in LLM‑based agency, ASC offers a unified and forward‑looking foundation for building intelligent, accountable, and human‑centered service ecosystems.

PaperID: 8389, https://arxiv.org/pdf/2509.24350.pdf

Authors: Yan Ke, Xin Yu, Heming Du, Scott Chapman, Helen Huang

Title: Dynamic Orchestration of Multi-Agent System for Real-World Multi-Image Agricultural VQA

Abstract:
Agricultural visual question answering is essential for providing farmers and researchers with accurate and timely knowledge. However, many existing approaches are predominantly developed for evidence‑constrained settings such as text‑only queries or single‑image cases. This design prevents them from coping with real‑world agricultural scenarios that often require multi‑image inputs with complementary views across spatial scales, and growth stages. Moreover, limited access to up‑to‑date external agricultural context makes these systems struggle to adapt when evidence is incomplete. In addition, rigid pipelines often lack systematic quality control. To address this gap, we propose a self‑reflective and self‑improving multi‑agent framework that integrates four roles, the Retriever, the Reflector, the Answerer, and the Improver. They collaborate to enable context enrichment, reflective reasoning, answer drafting, and iterative improvement. A Retriever formulates queries and gathers external information, while a Reflector assesses adequacy and triggers sequential reformulation and renewed retrieval. Two Answerers draft candidate responses in parallel to reduce bias. The Improver refines them through iterative checks while ensuring that information from multiple images is effectively aligned and utilized. Experiments on the AgMMU benchmark show that our framework achieves competitive performance on multi‑image agricultural QA.

PaperID: 8390, https://arxiv.org/pdf/2509.24344.pdf

Authors: Theo Koraag, Niklas Wagner, Felix Dobslaw, Lucas Gren

Title: Comparing Open-Source and Commercial LLMs for Domain-Specific Analysis and Reporting: Software Engineering Challenges and Design Trade-offs

Abstract:
Context: Large Language Models (LLMs) enable automation of complex natural language processing across domains, but research on domain‑specific applications like Finance remains limited. Objectives: This study explored open‑source and commercial LLMs for financial report analysis and commentary generation, focusing on software engineering challenges in implementation. Methods: Using Design Science Research methodology, an exploratory case study iteratively designed and evaluated two LLM‑based systems: one with local open‑source models in a multi‑agent workflow, another using commercial GPT‑4o. Both were assessed through expert evaluation of real‑world financial reporting use cases. Results: LLMs demonstrated strong potential for automating financial reporting tasks, but integration presented significant challenges. Iterative development revealed issues including prompt design, contextual dependency, and implementation trade‑offs. Cloud‑based models offered superior fluency and usability but raised data privacy and external dependency concerns. Local open‑source models provided better data control and compliance but required substantially more engineering effort for reliability and usability. Conclusion: LLMs show strong potential for financial reporting automation, but successful integration requires careful attention to architecture, prompt design, and system reliability. Implementation success depends on addressing domain‑specific challenges through tailored validation mechanisms and engineering strategies that balance accuracy, control, and compliance.

PaperID: 8391, https://arxiv.org/pdf/2509.24314.pdf

Authors: Hongjun Liu, Yinghao Zhu, Yuhui Wang, Yitao Long, Zeyu Lai, Lequan Yu, Chen Zhao

Title: MedMMV: A Controllable Multimodal Multi-Agent Framework for Reliable and Verifiable Clinical Reasoning

Abstract:
Recent progress in multimodal large language models (MLLMs) has demonstrated promising performance on medical benchmarks and in preliminary trials as clinical assistants. Yet, our pilot audit of diagnostic cases uncovers a critical failure mode: instability in early evidence interpretation precedes hallucination, creating branching reasoning trajectories that cascade into globally inconsistent conclusions. This highlights the need for clinical reasoning agents that constrain stochasticity and hallucination while producing auditable decision flows. We introduce MedMMV, a controllable multimodal multi‑agent framework for reliable and verifiable clinical reasoning. MedMMV stabilizes reasoning through diversified short rollouts, grounds intermediate steps in a structured evidence graph under the supervision of a Hallucination Detector, and aggregates candidate paths with a Combined Uncertainty scorer. On six medical benchmarks, MedMMV improves accuracy by up to 12.7% and, more critically, demonstrates superior reliability. Blind physician evaluations confirm that MedMMV substantially increases reasoning truthfulness without sacrificing informational content. By controlling instability through a verifiable, multi‑agent process, our framework provides a robust path toward deploying trustworthy AI systems in high‑stakes domains like clinical decision support.

PaperID: 8392, https://arxiv.org/pdf/2509.24282.pdf

Authors: Gyuhyeon Seo, Jungwoo Yang, Junseong Pyo, Nalim Kim, Jonggeun Lee, Yohan Jo

Title: SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents

Abstract:
We introduce SimuHome, a high‑fidelity smart home simulator and a benchmark of 600 episodes for LLM‑based smart home agents. Existing smart home benchmarks treat the home as a static system, neither simulating how device operations affect environmental variables over time nor supporting workflow scheduling of device commands. SimuHome is grounded in the Matter protocol, the industry standard that defines how real smart home devices communicate and operate. Agents interact with devices through SimuHome's APIs and observe how their actions continuously affect environmental variables such as temperature and humidity. Our benchmark covers state inquiry, implicit user intent inference, explicit device control, and workflow scheduling, each with both feasible and infeasible requests. For workflow scheduling, the simulator accelerates time so that scheduled workflows can be evaluated immediately. An evaluation of 18 agents reveals that workflow scheduling is the hardest category, with failures persisting across alternative agent frameworks and fine‑tuning. These findings suggest that SimuHome's time‑accelerated simulation could serve as an environment for agents to pre‑validate their actions before committing them to the real world.

PaperID: 8393, https://arxiv.org/pdf/2509.24272.pdf

Authors: Weibo Zhao, Jiahao Liu, Bonan Ruan, Shaofei Li, Zhenkai Liang

Title: When MCP Servers Attack: Taxonomy, Feasibility, and Mitigation

Abstract:
Model Context Protocol (MCP) servers enable AI applications to connect to external systems in a plug‑and‑play manner, but their rapid proliferation also introduces severe security risks. Unlike mature software ecosystems with rigorous vetting, MCP servers still lack standardized review mechanisms, giving adversaries opportunities to distribute malicious implementations. Despite this pressing risk, the security implications of MCP servers remain underexplored. To address this gap, we present the first systematic study that treats MCP servers as active threat actors and decomposes them into core components to examine how adversarial developers can implant malicious intent. Specifically, we investigate three research questions: (i) what types of attacks malicious MCP servers can launch, (ii) how vulnerable MCP hosts and Large Language Models (LLMs) are to these attacks, and (iii) how feasible it is to carry out MCP server attacks in practice. Our study proposes a component‑based taxonomy comprising twelve attack categories. For each category, we develop Proof‑of‑Concept (PoC) servers and demonstrate their effectiveness across diverse real‑world host‑LLM settings. We further show that attackers can generate large numbers of malicious servers at virtually no cost. We then test state‑of‑the‑art scanners on the generated servers and found that existing detection approaches are insufficient. These findings highlight that malicious MCP servers are easy to implement, difficult to detect with current tools, and capable of causing concrete damage to AI agent systems. Addressing this threat requires coordinated efforts among protocol designers, host developers, LLM providers, and end users to build a more secure and resilient MCP ecosystem.

PaperID: 8394, https://arxiv.org/pdf/2509.24230.pdf

Authors: Shaobin Ling, Yun Wang, Chenyou Fan, Tin Lun Lam, Junjie Hu

Title: ELHPlan: Efficient Long-Horizon Task Planning for Multi-Agent Collaboration

Abstract:
Large Language Models (LLMs) enable intelligent multi‑robot collaboration but face fundamental trade‑offs: open‑loop methods that compile tasks into formal representations for external executors produce sound plans but lack adaptability in partially observable environments, while iterative methods incur prohibitive computational costs that scale poorly with team size and task complexity. In this paper, we propose Efficient Long‑Horizon Planning (ELHPlan), a novel framework that introduces Action Chains, sequences of actions explicitly bound to sub‑goal intentions, as the fundamental planning primitive. ELHPlan operates via a cyclical process: 1) constructing intention‑bound action sequences, 2) proactively validating for conflicts and feasibility, 3) refining issues through targeted mechanisms, and 4) executing validated actions. This design balances adaptability and efficiency by providing intention‑bound action sequences with longer lookahead while avoiding expensive full re‑planning. We further advocate comprehensive efficiency metrics, including token consumption and planning time, to more holistically evaluate multi‑agent collaboration. Our experiments on benchmarks TDW‑MAT and C‑WAH demonstrate that ELHPlan achieves comparable task success rates while consuming only 30‑40% of the tokens required by state‑of‑the‑art methods. Our research establishes a new efficiency‑effectiveness frontier for LLM‑based multi‑agent planning systems.

PaperID: 8395, https://arxiv.org/pdf/2509.24226.pdf

Authors: Jingqi Li, Gechen Qu, Jason J. Choi, Somayeh Sojoudi, Claire Tomlin

Title: Multi-Agent Guided Policy Search for Non-Cooperative Dynamic Games

Abstract:
Multi‑agent reinforcement learning (MARL) optimizes strategic interactions in non‑cooperative dynamic games, where agents have misaligned objectives. However, data‑driven methods such as multi‑agent policy gradients (MA‑PG) often suffer from instability and limit‑cycle behaviors. Prior stabilization techniques typically rely on entropy‑based exploration, which slows learning and increases variance. We propose a model‑based approach that incorporates approximate priors into the reward function as regularization. In linear quadratic (LQ) games, we prove that such priors stabilize policy gradients and guarantee local exponential convergence to an approximate Nash equilibrium. We then extend this idea to infinite‑horizon nonlinear games by introducing Multi‑agent Guided Policy Search (MA‑GPS), which constructs short‑horizon local LQ approximations from trajectories of current policies to guide training. Experiments on nonlinear vehicle platooning and a six‑player strategic basketball formation show that MA‑GPS achieves faster convergence and more stable learning than existing MARL methods.

PaperID: 8396, https://arxiv.org/pdf/2509.24148.pdf

Authors: Yiran Hu, Nan Jiang, Shanchao Liang, Yi Wu, Lin Tan

Title: TENET: Leveraging Tests Beyond Validation for Code Generation

Abstract:
Test‑Driven Development (TDD) is a widely adopted software engineering practice that requires developers to create and execute tests alongside code implementation, ensuring that software behavior is continuously validated and refined. In the era of vibe coding, where developers increasingly delegate code writing to large language models (LLMs) by specifying high‑level intentions, TDD becomes even more crucial, as test cases serve as executable specifications that explicitly define and verify intended functionality beyond what natural‑language descriptions and code context can convey. While vibe coding under TDD is promising, there are three main challenges: (1) selecting a small yet effective test suite to improve the generation accuracy and control the execution workload, (2) retrieving context such as relevant code effectively, and (3) systematically using test feedback for effective code refinement. To address these challenges, we introduce TENET, an LLM agent for generating functions in complex real‑world repositories under the TDD setting. TENET features three components: (1) a novel test harness mechanism that selects a concise test suite to maximize diversity of target usage scenarios; (2) a tailored agent toolset that performs efficient retrieval of relevant code with interactive debugging; and (3) a reflection‑based refinement workflow that iteratively analyzes failures, replenishes context, and applies code refinement. TENET achieves 69.08% and 81.77% Pass@1 on RepoCod and RepoEval benchmarks, outperforming the best agentic baselines by 9.49 and 2.17 percentage points, respectively. In addition, this is the first study of test‑driven code generation with repository‑level context, examining how different aspects of test suites affect the performance of LLM agents under the TDD setting.

PaperID: 8397, https://arxiv.org/pdf/2509.24116.pdf

Authors: Minsoo Kim, Seung-won Hwang

Title: Dual-Scale World Models for LLM Agents Towards Hard-Exploration Problems

Abstract:
LLM‑based agents have seen promising advances, yet they are still limited in "hard‑exploration" tasks requiring learning new knowledge through exploration. We present GLoW, a novel approach leveraging dual‑scale world models, maintaining a trajectory frontier of high‑value discoveries at the global scale, while learning from local trial‑and‑error in exploration through a Multi‑path Advantage Reflection mechanism which infers advantage‑based progress signals to guide exploration. To evaluate our framework for hard‑exploration, we tackle the Jericho benchmark suite of text‑based games, where GLoW achieves a new state‑of‑theart performance for LLM‑based approaches. Compared to state‑of‑the‑art RLbased methods, our approach achieves comparable performance while requiring 100‑800x fewer environment interactions.

PaperID: 8398, https://arxiv.org/pdf/2509.24107.pdf

Authors: Shreyas Singh, Kunal Singh, Pradeep Moturi

Title: Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs

Abstract:
Tool‑integrated reasoning has emerged as a key focus for enabling agentic applications. Among these, DeepResearch Agents have gained significant attention for their strong performance on complex, open‑ended information‑seeking tasks. We introduce Fathom‑DeepResearch, an agentic system composed of two specialized models. The first is Fathom‑Search‑4B, a DeepSearch model trained from Qwen3‑4B and optimized for evidence‑based investigation through live web search and targeted webpage querying. Its training combines three advances: (i) DUETQA, a 5K‑sample dataset generated via multi‑agent self‑play that enforces strict web‑search dependence and heterogeneous source grounding; (ii) RAPO, a zero‑overhead extension of GRPO that stabilizes multi‑turn Reinforcement Learning with Verifiable Rewards through curriculum pruning, reward‑aware advantage scaling, and per‑prompt replay buffers; and (iii) a steerable step‑level reward that classifies each tool call by cognitive behavior and marginal utility, enabling explicit control over search trajectory breadth, depth, and horizon. These improvements enable reliable extension of tool‑calling beyond 20 calls when warranted. The second is Fathom‑Synthesizer‑4B, trained from Qwen3‑4B, which converts multi‑turn DeepSearch traces into structured, citation‑dense DeepResearch Reports for comprehensive synthesis. Evaluated on DeepSearch benchmarks (SimpleQA, FRAMES, WebWalker, Seal0, MuSiQue) and DeepResearch‑Bench, the system achieves state‑of‑the‑art performance in the open‑weights category while demonstrating strong generalization to diverse reasoning tasks including HLE, AIME‑25, GPQA‑Diamond, and MedQA.

PaperID: 8399, https://arxiv.org/pdf/2509.24088.pdf

Authors: Yifan Yu, Moyan Li, Shaoyuan Xu, Jinmiao Fu, Xinhai Hou, Fan Lai, Bryan Wang

Title: CORRECT: COndensed eRror RECognition via knowledge Transfer in multi-agent systems

Abstract:
Multi‑agent systems (MAS) are increasingly capable of tackling complex real‑world tasks, yet their reliance on inter‑agent coordination, tool use, and long‑horizon reasoning makes error recognition particularly challenging. Minor errors can propagate across agents, escalating into task failures while producing long, intertwined execution trajectories that impose significant costs for both human developers and automated systems to debug and analyze. Our key insight is that, despite surface differences in failure trajectories (e.g., logs), MAS errors often recur with similar structural patterns. This paper presents CORRECT, the first lightweight, training‑free framework that leverages an online cache of distilled error schemata to recognize and transfer knowledge of failure structures across new requests. This cache‑based reuse allows LLMs to perform targeted error localization at inference time, avoiding the need for expensive retraining while adapting to dynamic MAS deployments in subseconds. To support rigorous study in this domain, we also introduce CORRECT‑Error, a large‑scale dataset of over 2,000 annotated trajectories collected through a novel error‑injection pipeline guided by real‑world distributions, and further validated through human evaluation to ensure alignment with natural failure patterns. Experiments across seven diverse MAS applications show that CORRECT improves step‑level error localization up to 19.8% over existing advances while at near‑zero overhead, substantially narrowing the gap between automated and human‑level error recognition.

PaperID: 8400, https://arxiv.org/pdf/2509.24065.pdf

Authors: Dylan Waldner

Title: AI Safety, Alignment, and Ethics (AI SAE)

Abstract:
This paper grounds ethics in evolutionary biology, viewing moral norms as adaptive mechanisms that render cooperation fitness‑viable under selection pressure. Current alignment approaches add ethics post hoc, treating it as an external constraint rather than embedding it as an evolutionary strategy for cooperation. The central question is whether normative architectures can be embedded directly into AI systems to sustain human‑‑AI cooperation (symbiosis) as capabilities scale. To address this, I propose a governance‑‑embedding‑‑representation pipeline linking moral representation learning to system‑level design and institutional governance, treating alignment as a multi‑level problem spanning cognition, optimization, and oversight. I formalize moral norm representation through the moral problem space, a learnable subspace in neural representations where cooperative norms can be encoded and causally manipulated. Using sparse autoencoders, activation steering, and causal interventions, I outline a research program for engineering moral representations and embedding them into the full semantic space ‑‑ treating competing theories of morality as empirical hypotheses about representation geometry rather than philosophical positions. Governance principles leverage these learned moral representations to regulate how cooperative behaviors evolve within the AI ecosystem. Through replicator dynamics and multi‑agent game theory, I model how internal representational features can shape population‑level incentives by motivating the design of sanctions and subsidies structured to yield decentralized normative institutions.

PaperID: 8401, https://arxiv.org/pdf/2509.24047.pdf

Authors: Runyu Zhang, Na Li, Asuman Ozdaglar, Jeff Shamma, Gioele Zardini

Title: Optimism as Risk-Seeking in Multi-Agent Reinforcement Learning

Abstract:
Risk sensitivity has become a central theme in reinforcement learning (RL), where convex risk measures and robust formulations provide principled ways to model preferences beyond expected return. Recent extensions to multi‑agent RL (MARL) have largely emphasized the risk‑averse setting, prioritizing robustness to uncertainty. In cooperative MARL, however, such conservatism often leads to suboptimal equilibria, and a parallel line of work has shown that optimism can promote cooperation. Existing optimistic methods, though effective in practice, are typically heuristic and lack theoretical grounding. Building on the dual representation for convex risk measures, we propose a principled framework that interprets risk‑seeking objectives as optimism. We introduce optimistic value functions, which formalize optimism as divergence‑penalized risk‑seeking evaluations. Building on this foundation, we derive a policy‑gradient theorem for optimistic value functions, including explicit formulas for the entropic risk/KL‑penalty setting, and develop decentralized optimistic actor‑critic algorithms that implement these updates. Empirical results on cooperative benchmarks demonstrate that risk‑seeking optimism consistently improves coordination over both risk‑neutral baselines and heuristic optimistic methods. Our framework thus unifies risk‑sensitive learning and optimism, offering a theoretically grounded and practically effective approach to cooperation in MARL.

PaperID: 8402, https://arxiv.org/pdf/2509.24046.pdf

Authors: Lingyao Li, Haolun Wu, Zhenkun Li, Jiabei Hu, Yu Wang, Xiaoshan Huang, Wenyue Hua, Wenqian Wang

Title: PartnerMAS: An LLM Hierarchical Multi-Agent Framework for Business Partner Selection on High-Dimensional Features

Abstract:
High‑dimensional decision‑making tasks, such as business partner selection, involve evaluating large candidate pools with heterogeneous numerical, categorical, and textual features. While large language models (LLMs) offer strong in‑context reasoning capabilities, single‑agent or debate‑style systems often struggle with scalability and consistency in such settings. We propose PartnerMAS, a hierarchical multi‑agent framework that decomposes evaluation into three layers: a Planner Agent that designs strategies, Specialized Agents that perform role‑specific assessments, and a Supervisor Agent that integrates their outputs. To support systematic evaluation, we also introduce a curated benchmark dataset of venture capital co‑investments, featuring diverse firm attributes and ground‑truth syndicates. Across 140 cases, PartnerMAS consistently outperforms single‑agent and debate‑based multi‑agent baselines, achieving up to 10‑‑15% higher match rates. Analysis of agent reasoning shows that planners are most responsive to domain‑informed prompts, specialists produce complementary feature coverage, and supervisors play an important role in aggregation. Our findings demonstrate that structured collaboration among LLM agents can generate more robust outcomes than scaling individual models, highlighting PartnerMAS as a promising framework for high‑dimensional decision‑making in data‑rich domains.

PaperID: 8403, https://arxiv.org/pdf/2509.23993.pdf

Authors: Muleilan Pei, Shaoshuai Shi, Shaojie Shen

Title: Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning

Abstract:
Scalable and realistic simulation of multi‑agent traffic behavior is critical for advancing autonomous driving technologies. Although existing data‑driven simulators have made significant strides in this domain, they predominantly rely on supervised learning to align simulated distributions with real‑world driving scenarios. A persistent challenge, however, lies in the distributional shift that arises between training and testing, which often undermines model generalization in unseen environments. To address this limitation, we propose SMART‑R1, a novel R1‑style reinforcement fine‑tuning paradigm tailored for next‑token prediction models to better align agent behavior with human preferences and evaluation metrics. Our approach introduces a metric‑oriented policy optimization algorithm to improve distribution alignment and an iterative "SFT‑RFT‑SFT" training strategy that alternates between Supervised Fine‑Tuning (SFT) and Reinforcement Fine‑Tuning (RFT) to maximize performance gains. Extensive experiments on the large‑scale Waymo Open Motion Dataset (WOMD) validate the effectiveness of this simple yet powerful R1‑style training framework in enhancing foundation models. The results on the Waymo Open Sim Agents Challenge (WOSAC) showcase that SMART‑R1 achieves state‑of‑the‑art performance with an overall realism meta score of 0.7858, ranking first on the leaderboard at the time of submission.

PaperID: 8404, https://arxiv.org/pdf/2509.23988.pdf

Authors: Zirui Tang, Weizheng Wang, Zihang Zhou, Yang Jiao, Bangrui Xu, Boyu Niu, Dayou Zhou, Xuanhe Zhou, Guoliang Li, Yeye He, Wei Zhou, Yitong Song, Cheng Tan, Xue Yang, Chunwei Liu, Bin Wang, Conghui He, Xiaoyang Wang, Fan Wu

Title: LLM/Agent-as-Data-Analyst: A Survey

Abstract:
Large language models (LLMs) and agent techniques have brought a fundamental shift in the functionality and development paradigm of data analysis tasks (a.k.a LLM/Agent‑as‑Data‑Analyst), demonstrating substantial impact across both academia and industry. In comparison with traditional rule or small‑model based approaches, (agentic) LLMs enable complex data understanding, natural language interfaces, semantic analysis functions, and autonomous pipeline orchestration. From a modality perspective, we review LLM‑based techniques for (i) structured data (e.g., NL2SQL, NL2GQL, ModelQA), (ii) semi‑structured data (e.g., markup languages understanding, semi‑structured table question answering), (iii) unstructured data (e.g., chart understanding, text/image document understanding), and (iv) heterogeneous data (e.g., data retrieval and modality alignment in data lakes). The technical evolution further distills four key design goals for intelligent data analysis agents, namely semantic‑aware design, autonomous pipelines, tool‑augmented workflows, and support for open‑world tasks. Finally, we outline the remaining challenges and propose several insights and practical directions for advancing LLM/Agent‑powered data analysis.

PaperID: 8405, https://arxiv.org/pdf/2509.23960.pdf

Authors: Manan Tayal, Aditya Singh, Shishir Kolathaya, Somil Bansal

Title: MAD-PINN: A Decentralized Physics-Informed Machine Learning Framework for Safe and Optimal Multi-Agent Control

Abstract:
Co‑optimizing safety and performance in large‑scale multi‑agent systems remains a fundamental challenge. Existing approaches based on multi‑agent reinforcement learning (MARL), safety filtering, or Model Predictive Control (MPC) either lack strict safety guarantees, suffer from conservatism, or fail to scale effectively. We propose MAD‑PINN, a decentralized physics‑informed machine learning framework for solving the multi‑agent state‑constrained optimal control problem (MASC‑OCP). Our method leverages an epigraph‑based reformulation of SC‑OCP to simultaneously capture performance and safety, and approximates its solution via a physics‑informed neural network. Scalability is achieved by training the SC‑OCP value function on reduced‑agent systems and deploying them in a decentralized fashion, where each agent relies only on local observations of its neighbours for decision‑making. To further enhance safety and efficiency, we introduce an Hamilton‑Jacobi (HJ) reachability‑based neighbour selection strategy to prioritize safety‑critical interactions, and a receding‑horizon policy execution scheme that adapts to dynamic interactions while reducing computational burden. Experiments on multi‑agent navigation tasks demonstrate that MAD‑PINN achieves superior safety‑performance trade‑offs, maintains scalability as the number of agents grows, and consistently outperforms state‑of‑the‑art baselines.

PaperID: 8406, https://arxiv.org/pdf/2509.23905.pdf

Authors: Tianjiao Sun, Ningyan Guo, Haozhe Gu, Yanyan Peng, Zhiyong Feng

Title: Integrated Communication and Control for Energy-Efficient UAV Swarms: A Multi-Agent Reinforcement Learning Approach

Abstract:
The deployment of unmanned aerial vehicle (UAV) swarm‑assisted communication networks has become an increasingly vital approach for remediating coverage limitations in infrastructure‑deficient environments, with especially pressing applications in temporary scenarios, such as emergency rescue, military and security operations, and remote area coverage. However, complex geographic environments lead to unpredictable and highly dynamic wireless channel conditions, resulting in frequent interruptions of air‑to‑ground (A2G) links that severely constrain the reliability and quality of service in UAV swarm‑assisted mobile communications. To improve the quality of UAV swarm‑assisted communications in complex geographic environments, we propose an integrated communication and control co‑design mechanism. Given the stringent energy constraints inherent in UAV swarms, our proposed mechanism is designed to optimize energy efficiency while maintaining an equilibrium between equitable communication rates for mobile ground users (GUs) and UAV energy expenditure. We formulate the joint resource allocation and 3D trajectory control problem as a Markov decision process (MDP), and develop a multi‑agent reinforcement learning (MARL) framework to enable real‑time coordinated actions across the UAV swarm. To optimize the action policy of UAV swarms, we propose a novel multi‑agent hybrid proximal policy optimization with action masking (MAHPPO‑AM) algorithm, specifically designed to handle complex hybrid action spaces. The algorithm incorporates action masking to enforce hard constraints in high‑dimensional action spaces. Experimental results demonstrate that our approach achieves a fairness index of 0.99 while reducing energy consumption by up to 25% compared to baseline methods.

PaperID: 8407, https://arxiv.org/pdf/2509.23836.pdf

Authors: Chenyu Zhou, Xiaoming Shi, Hui Qiu, Xiawu Zheng, Haitao Leng, Yankai Jiang, Shaoguo Liu, Tingting Gao, Rongrong Ji

Title: Mix-Ecom: Towards Mixed-Type E-Commerce Dialogues with Complex Domain Rules

Abstract:
E‑commerce agents contribute greatly to helping users complete their e‑commerce needs. To promote further research and application of e‑commerce agents, benchmarking frameworks are introduced for evaluating LLM agents in the e‑commerce domain. Despite the progress, current benchmarks lack evaluating agents' capability to handle mixed‑type e‑commerce dialogue and complex domain rules. To address the issue, this work first introduces a novel corpus, termed Mix‑ECom, which is constructed based on real‑world customer‑service dialogues with post‑processing to remove user privacy and add CoT process. Specifically, Mix‑ECom contains 4,799 samples with multiply dialogue types in each e‑commerce dialogue, covering four dialogue types (QA, recommendation, task‑oriented dialogue, and chit‑chat), three e‑commerce task types (pre‑sales, logistics, after‑sales), and 82 e‑commerce rules. Furthermore, this work build baselines on Mix‑Ecom and propose a dynamic framework to further improve the performance. Results show that current e‑commerce agents lack sufficient capabilities to handle e‑commerce dialogues, due to the hallucination cased by complex domain rules. The dataset will be publicly available.

PaperID: 8408, https://arxiv.org/pdf/2509.23803.pdf

Authors: Pramit Saha, Joshua Strong, Divyanshu Mishra, Cheng Ouyang, J. Alison Noble

Title: FedAgentBench: Towards Automating Real-world Federated Medical Image Analysis with Server-Client LLM Agents

Abstract:
Federated learning (FL) allows collaborative model training across healthcare sites without sharing sensitive patient data. However, real‑world FL deployment is often hindered by complex operational challenges that demand substantial human efforts. This includes: (a) selecting appropriate clients (hospitals), (b) coordinating between the central server and clients, (c) client‑level data pre‑processing, (d) harmonizing non‑standardized data and labels across clients, and (e) selecting FL algorithms based on user instructions and cross‑client data characteristics. However, the existing FL works overlook these practical orchestration challenges. These operational bottlenecks motivate the need for autonomous, agent‑driven FL systems, where intelligent agents at each hospital client and the central server agent collaboratively manage FL setup and model training with minimal human intervention. To this end, we first introduce an agent‑driven FL framework that captures key phases of real‑world FL workflows from client selection to training completion and a benchmark dubbed FedAgentBench that evaluates the ability of LLM agents to autonomously coordinate healthcare FL. Our framework incorporates 40 FL algorithms, each tailored to address diverse task‑specific requirements and cross‑client characteristics. Furthermore, we introduce a diverse set of complex tasks across 201 carefully curated datasets, simulating 6 modality‑specific real‑world healthcare environments, viz., Dermatoscopy, Ultrasound, Fundus, Histopathology, MRI, and X‑Ray. We assess the agentic performance of 14 open‑source and 10 proprietary LLMs spanning small, medium, and large model scales. While some agent cores such as GPT‑4.1 and DeepSeek V3 can automate various stages of the FL pipeline, our results reveal that more complex, interdependent tasks based on implicit goals remain challenging for even the strongest models.

PaperID: 8409, https://arxiv.org/pdf/2509.23778.pdf

Authors: Zeyuan Zhao, Chaoran Li, Shao Zhang, Ying Wen

Title: Sequence Pathfinder for Multi-Agent Pickup and Delivery in the Warehouse

Abstract:
Multi‑Agent Pickup and Delivery (MAPD) is a challenging extension of Multi‑Agent Path Finding (MAPF), where agents are required to sequentially complete tasks with fixed‑location pickup and delivery demands. Although learning‑based methods have made progress in MAPD, they often perform poorly in warehouse‑like environments with narrow pathways and long corridors when relying only on local observations for distributed decision‑making. Communication learning can alleviate the lack of global information but introduce high computational complexity due to point‑to‑point communication. To address this challenge, we formulate MAPF as a sequence modeling problem and prove that path‑finding policies under sequence modeling possess order‑invariant optimality, ensuring its effectiveness in MAPD. Building on this, we propose the Sequential Pathfinder (SePar), which leverages the Transformer paradigm to achieve implicit information exchange, reducing decision‑making complexity from exponential to linear while maintaining efficiency and global awareness. Experiments demonstrate that SePar consistently outperforms existing learning‑based methods across various MAPF tasks and their variants, and generalizes well to unseen environments. Furthermore, we highlight the necessity of integrating imitation learning in complex maps like warehouses.

PaperID: 8410, https://arxiv.org/pdf/2509.23768.pdf

Authors: Cheng Yang, Jiaxuan Lu, Haiyuan Wan, Junchi Yu, Feiwei Qin

Title: From What to Why: A Multi-Agent System for Evidence-based Chemical Reaction Condition Reasoning

Abstract:
The chemical reaction recommendation is to select proper reaction condition parameters for chemical reactions, which is pivotal to accelerating chemical science. With the rapid development of large language models (LLMs), there is growing interest in leveraging their reasoning and planning capabilities for reaction condition recommendation. Despite their success, existing methods rarely explain the rationale behind the recommended reaction conditions, limiting their utility in high‑stakes scientific workflows. In this work, we propose ChemMAS, a multi‑agent system that reframes condition prediction as an evidence‑based reasoning task. ChemMAS decomposes the task into mechanistic grounding, multi‑channel recall, constraint‑aware agentic debate, and rationale aggregation. Each decision is backed by interpretable justifications grounded in chemical knowledge and retrieved precedents. Experiments show that ChemMAS achieves 20‑35% gains over domain‑specific baselines and outperforms general‑purpose LLMs by 10‑15% in Top‑1 accuracy, while offering falsifiable, human‑trustable rationales, which establishes a new paradigm for explainable AI in scientific discovery.

PaperID: 8411, https://arxiv.org/pdf/2509.23757.pdf

Authors: Benjamin Teoh, Ben Glocker, Francesca Toni, Avinash Kori

Title: Transparent Visual Reasoning via Object-Centric Agent Collaboration

Abstract:
A central challenge in explainable AI, particularly in the visual domain, is producing explanations grounded in human‑understandable concepts. To tackle this, we introduce OCEAN (Object‑Centric Explananda via Agent Negotiation), a novel, inherently interpretable framework built on object‑centric representations and a transparent multi‑agent reasoning process. The game‑theoretic reasoning process drives agents to agree on coherent and discriminative evidence, resulting in a faithful and interpretable decision‑making process. We train OCEAN end‑to‑end and benchmark it against standard visual classifiers and popular posthoc explanation tools like GradCAM and LIME across two diagnostic multi‑object datasets. Our results demonstrate competitive performance with respect to state‑of‑the‑art black‑box models with a faithful reasoning process, which was reflected by our user study, where participants consistently rated OCEAN's explanations as more intuitive and trustworthy.

PaperID: 8412, https://arxiv.org/pdf/2509.23735.pdf

Authors: Xuyan Ma, Xiaofei Xie, Yawen Wang, Junjie Wang, Boyu Wu, Mingyang Li, Qing Wang

Title: Demystifying the Lifecycle of Failures in Platform-Orchestrated Agentic Workflows

Abstract:
Agentic workflows built on low‑code orchestration platforms enable rapid development of multi‑agent systems, but they also introduce new and poorly understood failure modes that hinder reliability and maintainability. Unlike traditional software systems, failures in agentic workflows often propagate across heterogeneous nodes through natural‑language interactions, tool invocations, and dynamic control logic, making failure attribution and repair particularly challenging. In this paper, we present an empirical study of platform‑orchestrated agentic workflows from a failure lifecycle perspective, with the goal of characterizing failure manifestations, identifying underlying root causes, and examining corresponding repair strategies. We present AgentFail, a dataset of 307 real‑world failure cases collected from two representative agentic workflow platforms. Based on this dataset, we analyze failure patterns, root causes, and repair difficulty for various failure root causes and nodes in the workflow. Our findings reveal key failure mechanisms in agentic workflows and provide actionable guidelines for reliable failure repair, and real‑world agentic workflow design.

PaperID: 8413, https://arxiv.org/pdf/2509.23725.pdf

Authors: Siqi Ma, Jiajie Huang, Fan Zhang, Yue Shen, Jinlin Wu, Guohui Fan, Zhu Zhang, Zelin Zang

Title: MedLA: A Logic-Driven Multi-Agent Framework for Complex Medical Reasoning with Large Language Models

Abstract:
Answering complex medical questions requires not only domain expertise and patient‑specific information, but also structured and multi‑perspective reasoning. Existing multi‑agent approaches often rely on fixed roles or shallow interaction prompts, limiting their ability to detect and resolve fine‑grained logical inconsistencies. To address this, we propose \textscMedLA, a logic‑driven multi‑agent framework built on large language models. Each agent organizes its reasoning process into an explicit logical tree based on syllogistic triads (major premise, minor premise, and conclusion), enabling transparent inference and premise‑level alignment. Agents engage in a multi‑round, graph‑guided discussion to compare and iteratively refine their logic trees, achieving consensus through error correction and contradiction resolution. We demonstrate that \textscMedLA consistently outperforms both static role‑based systems and single‑agent baselines on challenging benchmarks such as MedDDx and standard medical QA tasks. Furthermore, \textscMedLA scales effectively across both open‑source and commercial LLM backbones, achieving state‑of‑the‑art performance and offering a generalizable paradigm for trustworthy medical reasoning.

PaperID: 8414, https://arxiv.org/pdf/2509.23698.pdf

Authors: Zhe Hu, Yixiao Ren, Guanzhong Liu, Jing Li, Yu Yin

Title: VIVA+: Human-Centered Situational Decision-Making

Abstract:
Multimodal Large Language Models (MLLMs) show promising results for embodied agents in operating meaningfully in complex, human‑centered environments. Yet, evaluating their capacity for nuanced, human‑like reasoning and decision‑making remains challenging. In this work, we introduce VIVA+, a cognitively grounded benchmark for evaluating the reasoning and decision‑making of MLLMs in human‑centered situations. VIVA+ consists of 1,317 real‑world situations paired with 6,373 multiple‑choice questions, targeting three core abilities for decision‑making: (1) Foundational Situation Comprehension, (2) Context‑Driven Action Justification, and (3) Reflective Reasoning. Together, these dimensions provide a systematic framework for assessing a model's ability to perceive, reason, and act in socially meaningful ways. We evaluate the latest commercial and open‑source models on VIVA+, where we reveal distinct performance patterns and highlight significant challenges. We further explore targeted training and multi‑step reasoning strategies, which yield consistent performance improvements. Finally, our in‑depth analysis highlights current model limitations and provides actionable insights for advancing MLLMs toward more robust, context‑aware, and socially adept decision‑making in real‑world settings.

PaperID: 8415, https://arxiv.org/pdf/2509.23680.pdf

Authors: Shidong Pan, Yikai Ge, Xiaoyu Sun

Title: A First Look at Privacy Risks of Android Task-executable Voice Assistant Applications

Abstract:
With the development of foundation AI technologies, task‑executable voice assistants (VAs) have become more popular, enhancing user convenience and expanding device functionality. Android task‑executable VAs are applications that are capable of understanding complex tasks and performing corresponding operations. Given their prevalence and great autonomy, there is no existing work examine the privacy risks within the voice assistants from the task‑execution pattern in a holistic manner. To fill this research gap, this paper presents a user‑centric comprehensive empirical study on privacy risks in Android task‑executable VA applications. We collect ten mainstream VAs as our research target and analyze their operational characteristics. We then cross‑check their privacy declarations across six sources, including privacy labels, policies, and manifest files, and our findings reveal widespread inconsistencies. Moreover, we uncover three significant privacy threat models: (1) privacy misdisclosure in mega apps, where integrated mini apps such as Alexa skills are inadequately represented; (2) privilege escalation via inter‑application interactions, which exploit Android's communication mechanisms to bypass user consent; and (3) abuse of Google system applications, enabling apps to evade the declaration of dangerous permissions. Our study contributes actionable recommendations for practitioners and underscores broader relevance of these privacy risks to emerging autonomous AI agents.

PaperID: 8416, https://arxiv.org/pdf/2509.23652.pdf

Authors: Congzhi Zhang, Zhibin Wang, Yinchao Ma, Jiawei Peng, Yihan Wang, Qiang Zhou, Jun Song, Bo Zheng

Title: ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis

Abstract:
While Reinforcement Learning with Verifiable Reward (RLVR) significantly advances image reasoning in Large Vision‑Language Models (LVLMs), its application to complex video reasoning remains underdeveloped. This gap stems primarily from a critical data bottleneck: existing datasets lack the challenging, multi‑hop questions and high‑quality, video‑grounded Chain‑of‑Thought (CoT) data necessary to effectively bootstrap RLVR. To address this, we introduce ReWatch, a large‑scale dataset built to foster advanced video reasoning. We propose a novel multi‑stage synthesis pipeline to synthesize its three components: ReWatch‑Caption, ReWatch‑QA, and ReWatch‑CoT. A core innovation is our Multi‑Agent ReAct framework for CoT synthesis, which simulates a human‑like "re‑watching" process to generate video‑grounded reasoning traces by explicitly modeling information retrieval and verification. Building on this dataset, we develop ReWatch‑R1 by post‑training a strong baseline LVLM with Supervised Fine‑Tuning (SFT) and our RLVR framework. This framework incorporates a novel Observation \& Reasoning (O\&R) reward mechanism that evaluates both the final answer's correctness and the reasoning's alignment with video content, directly penalizing hallucination. Our experiments show that ReWatch‑R1 achieves state‑of‑the‑art average performance on five challenging video reasoning benchmarks. Project Page: https://rewatch‑r1.github.io

PaperID: 8417, https://arxiv.org/pdf/2509.23629.pdf

Authors: Sihan Hu, Xiansheng Cai, Yuan Huang, Zhiyuan Yao, Linfeng Zhang, Pan Zhang, Youjin Deng, Kun Chen

Title: Emergent Slow Thinking in LLMs as Inverse Tree Freezing

Abstract:
Reinforcement learning with verifiable rewards (RLVR) enables large language models to acquire slow, multi‑step reasoning from sparse final‑answer signals. We provide a statistical‑physics picture of this emergence. We show that an autoregressive model's finite capacity forces it to compress its exponentially large prefix space into a Markov network of predictive states, on which slow thinking unfolds as a random walk ‑‑ the Concept Network (CoNet) picture. Within CoNet, RLVR dynamics are governed by two mechanisms: merging of compatible paths and frustrated competition among incompatible ones. Together they drive the network through nucleation, growth, and freezing into multi‑input, single‑output directed inverse trees. The picture reproduces the training dynamics of a 1.5‑billion‑parameter LLM and yields three predictions: reasoning chains lengthen as a geometric necessity of sparse topology; SFT induces catastrophic forgetting through bridge‑node rupture; and frustration drives policy collapse. Building on the structural timing inherent in inverse‑tree freezing, we propose Annealed‑RLVR ‑‑ a brief SFT intervention at the moment of maximum frustration. It outperforms standard RLVR on both in‑ and out‑of‑distribution benchmarks, with the largest gains at high sampling budgets where standard RLVR collapses. The same SFT applied after the trees freeze instead triggers catastrophic forgetting, isolating timing as the active ingredient.

PaperID: 8418, https://arxiv.org/pdf/2509.23586.pdf

Authors: Yuan-An Xiao, Pengfei Gao, Chao Peng, Yingfei Xiong

Title: Reducing Cost of LLM Agents with Trajectory Reduction

Abstract:
Multi‑turn agent systems based on Large Language Models (LLMs) have become increasingly popular for software engineering tasks. While LLM agents demonstrate promising effectiveness, the high computational cost of input tokens due to ever‑growing trajectories remains a significant efficiency concern. Efficiency has been largely overlooked in existing studies and agent products, and this paper addresses this gap by introducing an inference‑time trajectory reduction approach that reduces computational costs. By analyzing existing agent trajectories, we demonstrate that useless, redundant, and expired information is widespread across trajectories. Such waste can be identified and reduced without compromising the agent's performance. We propose a simple yet effective trajectory reduction approach, AgentDiet, which automatically removes such waste during agent execution. We implement AgentDiet on a top‑performing coding agent, and our evaluation on two LLMs and two benchmarks shows that AgentDiet can reduce input tokens by 39.9%‑59.7% and the total computational cost by 21.1%‑35.9%, while maintaining the same agent performance. These results indicate that inference‑time trajectory reduction is a promising direction for agent systems.

PaperID: 8419, https://arxiv.org/pdf/2509.23537.pdf

Authors: Aaron Xuxiang Tian, Ruofan Zhang, Jiayao Tang, Young Min Cho, Xueqian Li, Qiang Yi, Ji Wang, Zhunping Zhang, Danrui Qi, Zekun Li, Xingyu Xiang, Sharath Chandra Guntuku, Lyle Ungar, Tianyu Shi, Chi Wang

Title: Beyond the Strongest LLM: Multi-Turn Multi-Agent Orchestration vs. Single LLMs on Benchmarks

Abstract:
We study multi‑turn multi‑agent orchestration, where multiple large language model (LLM) agents interact over multiple turns by iteratively proposing answers or casting votes until reaching consensus. Using four LLMs (Gemini 2.5 Pro, GPT‑5, Grok 4, and Claude Sonnet 4) on GPQA‑Diamond, IFEval, and MuSR, we conduct two experiments: (i) benchmarking orchestration against single‑LLM baselines; and (ii) ablations on GPQA‑Diamond that vary whether agents see who authored answers and whether they can observe ongoing votes. Orchestration matches or exceeds the strongest single model and consistently outperforms the others. Analysis of best‑achievable orchestration performance shows potential for further gains. The ablations show that revealing authorship increases self‑voting and ties, and that showing ongoing votes amplifies herding, which speeds convergence but can sometimes yield premature consensus.

PaperID: 8420, https://arxiv.org/pdf/2509.23462.pdf

Authors: Alakh Sharma, Gaurish Trivedi, Kartikey Singh Bhandari, Yash Sinha, Dhruv Kumar, Pratik Narang, Jagat Sesh Challa

Title: Generative Evolutionary Meta-Solver (GEMS): Scalable Surrogate-Free Multi-Agent Reinforcement Learning

Abstract:
Scalable multi‑agent reinforcement learning (MARL) remains a central challenge for AI. Existing population‑based methods, like Policy‑Space Response Oracles, PSRO, require storing explicit policy populations and constructing full payoff matrices, incurring quadratic computation and linear memory costs. We present Generative Evolutionary Meta‑Solver (GEMS), a surrogate‑free framework that replaces explicit populations with a compact set of latent anchors and a single amortized generator. Instead of exhaustively constructing the payoff matrix, GEMS relies on unbiased Monte Carlo rollouts, multiplicative‑weights meta‑dynamics, and a model‑free empirical‑Bernstein UCB oracle to adaptively expand the policy set. Best responses are trained within the generator using an advantage‑based trust‑region objective, eliminating the need to store and train separate actors. We evaluated GEMS in a variety of Two‑player and Multi‑Player games such as the Deceptive Messages Game, Kuhn Poker and Multi‑Particle environment. We find that GEMS is up to ~\mathbf6× faster, has \mathbf1.3× less memory usage than PSRO, while also reaps higher rewards simultaneously. These results demonstrate that GEMS retains the game theoretic guarantees of PSRO, while overcoming its fundamental inefficiencies, hence enabling scalable multi‑agent learning in multiple domains.

PaperID: 8421, https://arxiv.org/pdf/2509.23425.pdf

Authors: Benjamin Alcorn, Eman Hammad

Title: Situational Awareness for Safe and Robust Multi-Agent Interactions Under Uncertainty

Abstract:
Multi‑agent systems are prevalent in a wide range of domains including power systems, vehicular networks, and robotics. Two important problems to solve in these types of systems are how the intentions of non‑coordinating agents can be determined to predict future behavior and how the agents can achieve their objectives under resource constraints without significantly sacrificing performance. To study this, we develop a model where an autonomous agent observes the environment within a safety radius of observation, determines the state of a surrounding agent of interest (within the observation radius), estimates future actions to be taken, and acts in an optimal way. In the absence of observations, agents are able to utilize an estimation algorithm to predict the future actions of other agents based on historical trajectory. The use of the proposed estimation algorithm introduces uncertainty, which is managed via risk analysis. The proposed approach in this study is validated using two different learning‑based decision making frameworks: reinforcement learning and game theoretic algorithms.

PaperID: 8422, https://arxiv.org/pdf/2509.23344.pdf

Authors: Zijie Meng, Jin Hao, Xiwei Dai, Yang Feng, Jiaxiang Liu, Bin Feng, Huikai Wu, Xiaotang Gai, Hengchuan Zhu, Tianxiang Hu, Yangyang Wu, Hongxia Xu, Jin Li, Jun Xiao, Xiaoqiang Liu, Joey Tianyi Zhou, Fudong Zhu, Zhihe Zhao, Lunguo Xia, Bing Fang, Jimeng Sun, Jian Wu, Zuozhu Liu

Title: DentVLM: A Multimodal Vision-Language Model for Comprehensive Dental Diagnosis and Enhanced Clinical Practice

Abstract:
Diagnosing and managing oral diseases necessitate advanced visual interpretation across diverse imaging modalities and integrated information synthesis. While current AI models excel at isolated tasks, they often fall short in addressing the complex, multimodal requirements of comprehensive clinical dental practice. Here we introduce DentVLM, a multimodal vision‑language model engineered for expert‑level oral disease diagnosis. DentVLM was developed using a comprehensive, large‑scale, bilingual dataset of 110,447 images and 2.46 million visual question‑answering (VQA) pairs. The model is capable of interpreting seven 2D oral imaging modalities across 36 diagnostic tasks, significantly outperforming leading proprietary and open‑source models by 19.6% higher accuracy for oral diseases and 27.9% for malocclusions. In a clinical study involving 25 dentists, evaluating 1,946 patients and encompassing 3,105 QA pairs, DentVLM surpassed the diagnostic performance of 13 junior dentists on 21 of 36 tasks and exceeded that of 12 senior dentists on 12 of 36 tasks. When integrated into a collaborative workflow, DentVLM elevated junior dentists' performance to senior levels and reduced diagnostic time for all practitioners by 15‑22%. Furthermore, DentVLM exhibited promising performance across three practical utility scenarios, including home‑based dental health management, hospital‑based intelligent diagnosis and multi‑agent collaborative interaction. These findings establish DentVLM as a robust clinical decision support tool, poised to enhance primary dental care, mitigate provider‑patient imbalances, and democratize access to specialized medical expertise within the field of dentistry.

PaperID: 8423, https://arxiv.org/pdf/2509.23327.pdf

Authors: Yuanhao Zhang, Wenbo Li, Xiaoyu Wang, Kangyu Yuan, Shuai Ma, Xiaojuan Ma

Title: "Shall We Dig Deeper?": Designing and Evaluating Strategies for LLM Agents to Advance Knowledge Co-Construction in Asynchronous Online Discussions

Abstract:
Asynchronous online discussions enable diverse participants to co‑construct knowledge beyond individual contributions. This process ideally evolves through sequential phases, from superficial information exchange to deeper synthesis. However, many discussions stagnate in the early stages. Existing AI interventions typically target isolated phases, lacking mechanisms to progressively advance knowledge co‑construction, and the impacts of different intervention styles in this context remain unclear and warrant investigation. To address these gaps, we conducted a design workshop to explore AI intervention strategies (task‑oriented and/or relationship‑oriented) throughout the knowledge co‑construction process, and implemented them in an LLM‑powered agent capable of facilitating progression while consolidating foundations at each phase. A within‑subject study (N=60) involving five consecutive asynchronous discussions showed that the agent consistently promoted deeper knowledge progression, with different styles exerting distinct effects on both content and experience. These findings provide actionable guidance for designing adaptive AI agents that sustain more constructive online discussions.

PaperID: 8424, https://arxiv.org/pdf/2509.23251.pdf

Authors: Yuqin Cao, Xiongkuo Min, Yixuan Gao, Wei Sun, Zicheng Zhang, Jinliang Han, Guangtao Zhai

Title: XGC-AVis: Towards Audio-Visual Content Understanding with a Multi-Agent Collaborative System

Abstract:
In this paper, we propose XGC‑AVis, a multi‑agent framework that enhances the audio‑video temporal alignment capabilities of multimodal large models (MLLMs) and improves the efficiency of retrieving key video segments through 4 stages: perception, planning, execution, and reflection. We further introduce XGC‑AVQuiz, the first benchmark aimed at comprehensively assessing MLLMs' understanding capabilities in both real‑world and AI‑generated scenarios. XGC‑AVQuiz consists of 2,685 question‑answer pairs across 20 tasks, with two key innovations: 1) AIGC Scenario Expansion: The benchmark includes 2,232 videos, comprising 1,102 professionally generated content (PGC), 753 user‑generated content (UGC), and 377 AI‑generated content (AIGC). These videos cover 10 major domains and 53 fine‑grained categories. 2) Quality Perception Dimension: Beyond conventional tasks such as recognition, localization, and reasoning, we introduce a novel quality perception dimension. This requires MLLMs to integrate low‑level sensory capabilities with high‑level semantic understanding to assess audio‑visual quality, synchronization, and coherence. Experimental results on XGC‑AVQuiz demonstrate that current MLLMs struggle with quality perception and temporal alignment tasks. XGC‑AVis improves these capabilities without requiring additional training, as validated on two benchmarks.

PaperID: 8425, https://arxiv.org/pdf/2509.23206.pdf

Authors: Huacan Chai, Zijie Cao, Maolin Ran, Yingxuan Yang, Jianghao Lin, Xin Peng, Hairui Wang, Renjie Ding, Ziyu Wan, Muning Wen, Weiwen Liu, Weinan Zhang, Fei Huang, Ying Wen

Title: PARL-MT: Learning to Call Functions in Multi-Turn Conversation with Progress Awareness

Abstract:
Large language models (LLMs) have achieved impressive success in single‑turn function calling, yet real‑world applications such as travel planning or multi‑stage data analysis typically unfold across multi‑turn conversations. In these settings, LLMs must not only issue accurate function calls at each step but also maintain progress awareness, the ability to summarize past interactions and plan future actions to ensure coherent, long‑horizon task execution. Existing approaches, however, either reduce multi‑turn training to isolated single‑turn samples, which neglects task‑level planning, or employ end‑to‑end reinforcement learning (RL) that struggles with redundancy and lacks explicit integration of progress awareness. To overcome these limitations, we introduce PARL‑MT, a framework that explicitly incorporates progress awareness into LLM training for multi‑turn function calling. PARL‑MT combines (i) a Progress Awareness Generation (PAG) pipeline, which automatically constructs datasets coupling conversation summaries with future task planning, and (ii) a Progress Awareness‑Guided Reinforcement Learning (PAG‑RL) algorithm, which integrates progress awareness into RL training to reduce contextual redundancy and improve alignment between local actions and global task completion. Empirical results on two public benchmarks demonstrate that PARL‑MT significantly outperforms existing methods, highlighting the effectiveness of progress awareness in enabling robust and efficient multi‑turn function calling.

PaperID: 8426, https://arxiv.org/pdf/2509.23188.pdf

Authors: Guancheng Wan, Leixin Sun, Longxu Dou, Zitong Shi, Fang Wu, Eric Hanchen Jiang, Wenke Huang, Guibin Zhang, Hejia Geng, Xiangru Tang, Zhenfei Yin, Yizhou Sun, Wei Wang

Title: Diagnose, Localize, Align: A Full-Stack Framework for Reliable LLM Multi-Agent Systems under Instruction Conflicts

Abstract:
Large Language Model (LLM)‑powered multi‑agent systems (MAS) have rapidly advanced collaborative reasoning, tool use, and role‑specialized coordination in complex tasks. However, reliability‑critical deployment remains hindered by a systemic failure mode: hierarchical compliance under instruction conflicts (system‑user, peer‑peer), where agents misprioritize system‑level rules in the presence of competing demands. Moreover, widely used macro‑level metrics (e.g., pass@k) obscure these micro‑level violations and offer little actionable guidance for remedy. In this work, we present a full‑stack, three‑stage framework: (1) Diagnose ‑ Contextualized Role Adherence Score (CRAS), a query‑wise, context‑aware scoring metric that decomposes role adherence into four measurable dimensions; (2) Localize ‑ attention drift analysis revealing that instruction conflicts are resolved by attention heads that are largely concentrated in middle layers; (3) Align ‑ Surgical Alignment of Instruction Layers (SAIL), which installs LoRA only on the localized focal layers and optimizes a token‑weighted DPO‑style preference objective that credits tokens by their focal attentional contribution. Across standard benchmarks and MAS frameworks, our surgical approach improves instruction hierarchy compliance (e.g., +5.60% with AutoGen on MedQA) without full‑model finetuning.

PaperID: 8427, https://arxiv.org/pdf/2509.23178.pdf

Authors: Tian Qin, Yuhan Chen, Zhiwei Wang, Zhi-Qin John Xu

Title: Limit Analysis for Symbolic Multi-step Reasoning Tasks with Information Propagation Rules Based on Transformers

Abstract:
Transformers are able to perform reasoning tasks, however the intrinsic mechanism remains widely open. In this paper we propose a set of information propagation rules based on Transformers and utilize symbolic reasoning tasks to theoretically analyze the limit reasoning steps. We show that the limit number of reasoning steps is between O(3^L‑1) and O(2^L‑1) for a model with L attention layers in a single‑pass.

PaperID: 8428, https://arxiv.org/pdf/2509.23157.pdf

Authors: Yanqing Fu, Chao Huang, Chenrun Wang, Zhuping Wang

Title: Grouped Satisficing Paths in Pure Strategy Games: a Topological Perspective

Abstract:
In game theory and multi‑agent reinforcement learning (MARL), each agent selects a strategy, interacts with the environment and other agents, and subsequently updates its strategy based on the received payoff. This process generates a sequence of joint strategies (s^t)_t \geq 0, where s^t represents the strategy profile of all agents at time step t. A widely adopted principle in MARL algorithms is "win‑stay, lose‑shift", which dictates that an agent retains its current strategy if it achieves the best response. This principle exhibits a fixed‑point property when the joint strategy has become an equilibrium. The sequence of joint strategies under this principle is referred to as a satisficing path, a concept first introduced in [40] and explored in the context of N‑player games in [39]. A fundamental question arises regarding this principle: Under what conditions does every initial joint strategy s admit a finite‑length satisficing path (s^t)_0 \leq t \leq T where s^0=s and s^T is an equilibrium? This paper establishes a sufficient condition for such a property, and demonstrates that any finite‑state Markov game, as well as any N‑player game, guarantees the existence of a finite‑length satisficing path from an arbitrary initial strategy to some equilibrium. These results provide a stronger theoretical foundation for the design of MARL algorithms.

PaperID: 8429, https://arxiv.org/pdf/2509.23154.pdf

Authors: Jinzhe Pan, Jingqing Wang, Yuehui Ouyang, Wenchi Cheng, Wei Zhang

Title: AI-Enhanced Distributed Channel Access for Collision Avoidance in Future Wi-Fi 8

Abstract:
The exponential growth of wireless devices and stringent reliability requirements of emerging applications demand fundamental improvements in distributed channel access mechanisms for unlicensed bands. Current Wi‑Fi systems, which rely on binary exponential backoff (BEB), suffer from suboptimal collision resolution in dense deployments and persistent fairness challenges due to inherent randomness. This paper introduces a multi‑agent reinforcement learning framework that integrates artificial intelligence (AI) optimization with legacy device coexistence. We first develop a dynamic backoff selection mechanism that adapts to real‑time channel conditions through access deferral events while maintaining full compatibility with conventional CSMA/CA operations. Second, we introduce a fairness quantification metric aligned with enhanced distributed channel access (EDCA) principles to ensure equitable medium access opportunities. Finally, we propose a centralized training decentralized execution (CTDE) architecture incorporating neighborhood activity patterns as observational inputs, optimized via constrained multi‑agent proximal policy optimization (MAPPO) to jointly minimize collisions and guarantee fairness. Experimental results demonstrate that our solution significantly reduces collision probability compared to conventional BEB while preserving backward compatibility with commercial Wi‑Fi devices. The proposed fairness metric effectively eliminates starvation risks in heterogeneous scenarios.

PaperID: 8430, https://arxiv.org/pdf/2509.23058.pdf

Authors: Yikai Wang, Xiaocheng Li, Guanting Chen

Title: Risk Profiling and Modulation for LLMs

Abstract:
Large language models (LLMs) are increasingly used for decision‑making tasks under uncertainty; however, their risk profiles and how they are influenced by prompting and alignment methods remain underexplored. Existing studies have primarily examined personality prompting or multi‑agent interactions, leaving open the question of how post‑training influences the risk behavior of LLMs. In this work, we propose a new pipeline for eliciting, steering, and modulating LLMs' risk profiles, drawing on tools from behavioral economics and finance. Using utility‑theoretic models, we compare pre‑trained, instruction‑tuned, and RLHF‑aligned LLMs, and find that while instruction‑tuned models exhibit behaviors consistent with some standard utility formulations, pre‑trained and RLHF‑aligned models deviate more from any utility models fitted. We further evaluate modulation strategies, including prompt engineering, in‑context learning, and post‑training, and show that post‑training provides the most stable and effective modulation of risk preference. Our findings provide insights into the risk profiles of different classes and stages of LLMs and demonstrate how post‑training modulates these profiles, laying the groundwork for future research on behavioral alignment and risk‑aware LLM design.

PaperID: 8431, https://arxiv.org/pdf/2509.23055.pdf

Authors: Binwei Yao, Chao Shang, Wanyu Du, Jianfeng He, Ruixue Lian, Yi Zhang, Hang Su, Sandesh Swamy, Yanjun Qi

Title: Peacemaker or Troublemaker: How Sycophancy Shapes Multi-Agent Debate

Abstract:
Large language models (LLMs) often display sycophancy, a tendency toward excessive agreeability. This behavior poses significant challenges for multi‑agent debating systems (MADS) that rely on productive disagreement to refine arguments and foster innovative thinking. LLMs' inherent sycophancy can collapse debates into premature consensus, potentially undermining the benefits of multi‑agent debate. While prior studies focus on user‑‑LLM sycophancy, the impact of inter‑agent sycophancy in debate remains poorly understood. To address this gap, we introduce the first operational framework that (1) proposes a formal definition of sycophancy specific to MADS settings, (2) develops new metrics to evaluate the agent sycophancy level and its impact on information exchange in MADS, and (3) systematically investigates how varying levels of sycophancy across agent roles (debaters and judges) affects outcomes in both decentralized and centralized debate frameworks. Our findings reveal that sycophancy is a core failure mode that amplifies disagreement collapse before reaching a correct conclusion in multi‑agent debates, yields lower accuracy than single‑agent baselines, and arises from distinct debater‑driven and judge‑driven failure modes. Building on these findings, we propose actionable design principles for MADS, effectively balancing productive disagreement with cooperation in agent interactions.

PaperID: 8432, https://arxiv.org/pdf/2509.23040.pdf

Authors: Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai, Qi Gu, Xiang Wang, An Zhang

Title: Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents

Abstract:
Large language models face challenges in long‑context question answering, where key evidence of a query may be dispersed across millions of tokens. Existing works equip large language models with a memory buffer that is dynamically updated via a linear document scan, also known as the "memorize while reading" methods. While this approach scales efficiently, it suffers from pruning of latent evidence, information loss through overwriting, and sparse reinforcement learning signals. To tackle these challenges, we present ReMemR1, which integrates the mechanism of memory retrieval into the memory update process, enabling the agent to selectively callback historical memories for non‑linear reasoning. To further strengthen training, we propose a multi‑level reward design, which combines final‑answer rewards with dense, step‑level signals that guide effective memory use. Together, these contributions mitigate information degradation, improve supervision, and support complex multi‑hop reasoning. Extensive experiments demonstrate that ReMemR1 significantly outperforms state‑of‑the‑art baselines on long‑context question answering while incurring negligible computational overhead, validating its ability to trade marginal cost for robust long‑context reasoning.

PaperID: 8433, https://arxiv.org/pdf/2509.23026.pdf

Authors: Yue Wang

Title: Game-Theoretic Understandings of Multi-Agent Systems with Multiple Objectives

Abstract:
In practical multi‑agent systems, agents often have diverse objectives, which makes the system more complex, as each agent's performance across multiple criteria depends on the joint actions of all agents, creating intricate strategic trade‑offs. To address this, we introduce the Multi‑Objective Markov Game (MOMG), a framework for multi‑agent reinforcement learning with multiple objectives. We propose the Pareto‑Nash Equilibrium (PNE) as the primary solution concept, where no agent can unilaterally improve one objective without sacrificing performance on another. We prove existence of PNE, and establish an equivalence between the PNE and the set of Nash Equilibria of MOMG's corresponding linearly scalarized games, enabling solutions of MOMG by transferring to a standard single‑objective Markov game. However, we note that computing a PNE is theoretically and computationally challenging, thus we propose and study weaker but more tractable solution concepts. Building on these foundations, we develop online learning algorithm that identify a single solution to MOMGs. Furthermore, we propose a two‑phase, preference‑free algorithm that decouples exploration from planning. Our algorithm enables computation of a PNE for any given preference profile without collecting new samples, providing an efficient methodological characterization of the entire Pareto‑Nash front.

PaperID: 8434, https://arxiv.org/pdf/2509.23023.pdf

Authors: Davi Bastos Costa, Renato Vicente

Title: Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia

Abstract:
Large language models are increasingly deployed in multi‑agent settings whose outcomes hinge on social intelligence, motivating evaluations of their interactive capabilities; yet existing studies remain overwhelmingly empirical, leaving us without a theoretical understanding of how agent interactions determine collective outcomes. To address this, we introduce Mini‑Mafia, a four‑player simplification of the social deduction game Mafia in which a fixed night phase reduces the game to a single critical exchange among a mafioso, a detective, and a villager. In this setting, we show that the mafia win‑rate p is predicted by the analytical formula \textlogit(p) = v × (m ‑ d), where m, d, and v represent the mafioso's deception, the detective's disclosure, and the villager's detection capabilities. We turn this analytical framework into the Mini‑Mafia Benchmark, where Bayesian inference over gameplay data yields per‑model estimates of the intrinsic parameters m, d, and v. For I models, only 3I parameters suffice to predict the outcomes of all I^3 tournament combinations; and in 5‑fold cross‑validation the formula achieves a 76.6% Brier‑score reduction over a random baseline. The benchmark also reveals counterintuitive results: Grok 3 Mini is the strongest detector and GPT‑5 Mini the strongest discloser, both ahead of DeepSeek V3.1, Claude Opus 4, and Claude Sonnet 4; while Claude Sonnet 4 is the weakest detector, near random chance. Together, these results show that Mini‑Mafia, a simple but nontrivial multi‑agent system, admits an analytical description and serves as a principled benchmark for language model interactions.

PaperID: 8435, https://arxiv.org/pdf/2509.22963.pdf

Authors: Haitong Ma, Ofir Nabati, Aviv Rosenberg, Bo Dai, Oran Lang, Craig Boutilier, Na Li, Shie Mannor, Lior Shani, Guy Tenneholtz

Title: Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces

Abstract:
Reinforcement learning (RL) struggles to scale to large, combinatorial action spaces common in many real‑world problems. This paper introduces a novel framework for training discrete diffusion models as highly effective policies in these complex settings. Our key innovation is an efficient online training process that ensures stable and effective policy improvement. By leveraging policy mirror descent (PMD) to define an ideal, regularized target policy distribution, we frame the policy update as a distributional matching problem, training the expressive diffusion model to replicate this stable target. This decoupled approach stabilizes learning and significantly enhances training performance. Our method achieves state‑of‑the‑art results and superior sample efficiency across a diverse set of challenging combinatorial benchmarks, including DNA sequence generation, RL with macro‑actions, and multi‑agent systems. Experiments demonstrate that our diffusion policies attain superior performance compared to other baselines.

PaperID: 8436, https://arxiv.org/pdf/2509.22937.pdf

Authors: Trent Weiss, Amar Kulkarni, Madhur Behl

Title: DBF-MA: A Differential Bayesian Filtering Planner for Multi-Agent Autonomous Racing Overtakes

Abstract:
A significant challenge in autonomous racing is to generate overtaking maneuvers. Racing agents must execute these maneuvers on complex racetracks with little room for error. Optimization techniques and graph‑based methods have been proposed, but these methods often rely on oversimplified assumptions for collision‑avoidance and dynamic constraints. In this work, we present an approach to trajectory synthesis based on an extension of the Differential Bayesian Filtering framework. Our approach for collision‑free trajectory synthesis frames the problem as one of Bayesian Inference over the space of Composite Bezier Curves. Our method is derivative‑free, does not require a spherical approximation of the vehicle footprint, linearization of constraints, or simplifying upper bounds on collision avoidance. We conduct a closed‑loop analysis of DBF‑MA and find it successfully overtakes an opponent in 87% of tested scenarios, outperforming existing methods in autonomous overtaking.

PaperID: 8437, https://arxiv.org/pdf/2509.22887.pdf

Authors: EunJeong Hwang, Yuwei Yin, Giuseppe Carenini, Peter West, Vered Shwartz

Title: Infusing Theory of Mind into Socially Intelligent LLM Agents

Abstract:
Theory of Mind (ToM)‑an understanding of the mental states of others‑is a key aspect of human social intelligence, yet, chatbots and LLM‑based social agents do not typically integrate it. In this work, we demonstrate that LLMs that explicitly use ToM get better at dialogue, achieving goals more effectively. After showing that simply prompting models to generate mental states between dialogue turns already provides significant benefit, we further introduce ToMAgent (ToMA), a ToM‑focused dialogue agent. ToMA is trained by pairing ToM with dialogue lookahead to produce mental states that are maximally useful for achieving dialogue goals. Experiments on the Sotopia interactive social evaluation benchmark demonstrate the effectiveness of our method over a range of baselines. Comprehensive analysis shows that ToMA exhibits more strategic, goal‑oriented reasoning behaviors, which enable long‑horizon adaptation, while maintaining better relationships with their partners. Our results suggest a step forward in integrating ToM for building socially intelligent LLM agents.

PaperID: 8438, https://arxiv.org/pdf/2509.22830.pdf

Authors: Hwan Chang, Yonghyun Jun, Hwanhee Lee

Title: ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents

Abstract:
The growing deployment of large language model (LLM) based agents that interact with external environments has created new attack surfaces for adversarial manipulation. One major threat is indirect prompt injection, where attackers embed malicious instructions in external environment output, causing agents to interpret and execute them as if they were legitimate prompts. While previous research has focused primarily on plain‑text injection attacks, we find a significant yet underexplored vulnerability: LLMs' dependence on structured chat templates and their susceptibility to contextual manipulation through persuasive multi‑turn dialogues. To this end, we introduce ChatInject, an attack that formats malicious payloads to mimic native chat templates, thereby exploiting the model's inherent instruction‑following tendencies. Building on this foundation, we develop a persuasion‑driven Multi‑turn variant that primes the agent across conversational turns to accept and execute otherwise suspicious actions. Through comprehensive experiments across frontier LLMs, we demonstrate three critical findings: (1) ChatInject achieves significantly higher average attack success rates than traditional prompt injection methods, improving from 5.18% to 32.05% on AgentDojo and from 15.13% to 45.90% on InjecAgent, with multi‑turn dialogues showing particularly strong performance at average 52.33% success rate on InjecAgent, (2) chat‑template‑based payloads demonstrate strong transferability across models and remain effective even against closed‑source LLMs, despite their unknown template structures, and (3) existing prompt‑based defenses are largely ineffective against this attack approach, especially against Multi‑turn variants. These findings highlight vulnerabilities in current agent systems.

PaperID: 8439, https://arxiv.org/pdf/2509.22814.pdf

Authors: Aditi Tiwari, Akshit Bhalla, Darshan Prasad

Title: Model Context Protocol for Vision Systems: Audit, Security, and Protocol Extensions

Abstract:
The Model Context Protocol (MCP) defines a schema bound execution model for agent‑tool interaction, enabling modular computer vision workflows without retraining. To our knowledge, this is the first protocol level, deployment scale audit of MCP in vision systems, identifying systemic weaknesses in schema semantics, interoperability, and runtime coordination. We analyze 91 publicly registered vision centric MCP servers, annotated along nine dimensions of compositional fidelity, and develop an executable benchmark with validators to detect and categorize protocol violations. The audit reveals high prevalence of schema format divergence, missing runtime schema validation, undeclared coordinate conventions, and reliance on untracked bridging scripts. Validator based testing quantifies these failures, with schema format checks flagging misalignments in 78.0 percent of systems, coordinate convention checks detecting spatial reference errors in 24.6 percent, and memory scope checks issuing an average of 33.8 warnings per 100 executions. Security probes show that dynamic and multi agent workflows exhibit elevated risks of privilege escalation and untyped tool connections. The proposed benchmark and validator suite, implemented in a controlled testbed and to be released on GitHub, establishes a reproducible framework for measuring and improving the reliability and security of compositional vision workflows.

PaperID: 8440, https://arxiv.org/pdf/2509.22750.pdf

Authors: Jeonghyun Park, Ingeol Baek, Seunghyun Yoon, Haeun Jang, Aparna Garimella, Akriti Jain, Nedim Lipka, Hwanhee Lee

Title: MARCH: Evaluating the Intersection of Ambiguity Interpretation and Multi-hop Inference

Abstract:
Real‑world multi‑hop QA is naturally linked with ambiguity, where a single query can trigger multiple reasoning paths that require independent resolution. Since ambiguity can occur at any stage, models must navigate layered uncertainty throughout the entire reasoning chain. Despite its prevalence in real‑world user queries, previous benchmarks have primarily focused on single‑hop ambiguity, leaving the complex interaction between multi‑step inference and layered ambiguity underexplored. In this paper, we introduce MARCH, a benchmark for their intersection, with 2,209 multi‑hop ambiguous questions curated via multi‑LLM verification and validated by human annotation with strong agreement. Our experiments reveal that even state‑of‑the‑art models struggle with MARCH, confirming that combining ambiguity resolution with multi‑step reasoning is a significant challenge. To address this, we propose CLARION, a two‑stage agentic framework that explicitly decouples ambiguity planning from evidence‑driven reasoning, significantly outperforms existing approaches, and paves the way for robust reasoning systems.

PaperID: 8441, https://arxiv.org/pdf/2509.22698.pdf

Authors: Hailong Zhang, Yinfeng Yu, Liejun Wang, Fuchun Sun, Wendong Zheng

Title: Advancing Audio-Visual Navigation Through Multi-Agent Collaboration in 3D Environments

Abstract:
Intelligent agents often require collaborative strategies to achieve complex tasks beyond individual capabilities in real‑world scenarios. While existing audio‑visual navigation (AVN) research mainly focuses on single‑agent systems, their limitations emerge in dynamic 3D environments where rapid multi‑agent coordination is critical, especially for time‑sensitive applications like emergency response. This paper introduces MASTAVN (Multi‑Agent Scalable Transformer Audio‑Visual Navigation), a scalable framework enabling two agents to collaboratively localize and navigate toward an audio target in shared 3D environments. By integrating cross‑agent communication protocols and joint audio‑visual fusion mechanisms, MASTAVN enhances spatial reasoning and temporal synchronization. Through rigorous evaluation in photorealistic 3D simulators (Replica and Matterport3D), MASTAVN achieves significant reductions in task completion time and notable improvements in navigation success rates compared to single‑agent and non‑collaborative baselines. This highlights the essential role of spatiotemporal coordination in multi‑agent systems. Our findings validate MASTAVN's effectiveness in time‑sensitive emergency scenarios and establish a paradigm for advancing scalable multi‑agent embodied intelligence in complex 3D environments.

PaperID: 8442, https://arxiv.org/pdf/2509.22631.pdf

Authors: Debargha Ganguly, Sumit Kumar, Ishwar Balappanawar, Weicong Chen, Shashank Kambhatla, Srinivasan Iyengar, Shivkumar Kalyanaraman, Ponnurangam Kumaraguru, Vipin Chaudhary

Title: LABELING COPILOT: A Deep Research Agent for Automated Data Curation in Computer Vision

Abstract:
Curating high‑quality, domain‑specific datasets is a major bottleneck for deploying robust vision systems, requiring complex trade‑offs between data quality, diversity, and cost when researching vast, unlabeled data lakes. We introduce Labeling Copilot, the first data curation deep research agent for computer vision. A central orchestrator agent, powered by a large multimodal language model, uses multi‑step reasoning to execute specialized tools across three core capabilities: (1) Calibrated Discovery sources relevant, in‑distribution data from large repositories; (2) Controllable Synthesis generates novel data for rare scenarios with robust filtering; and (3) Consensus Annotation produces accurate labels by orchestrating multiple foundation models via a novel consensus mechanism incorporating non‑maximum suppression and voting. Our large‑scale validation proves the effectiveness of Labeling Copilot's components. The Consensus Annotation module excels at object discovery: on the dense COCO dataset, it averages 14.2 candidate proposals per image‑nearly double the 7.4 ground‑truth objects‑achieving a final annotation mAP of 37.1%. On the web‑scale Open Images dataset, it navigated extreme class imbalance to discover 903 new bounding box categories, expanding its capability to over 1500 total. Concurrently, our Calibrated Discovery tool, tested at a 10‑million sample scale, features an active learning strategy that is up to 40x more computationally efficient than alternatives with equivalent sample efficiency. These experiments validate that an agentic workflow with optimized, scalable tools provides a robust foundation for curating industrial‑scale datasets.

PaperID: 8443, https://arxiv.org/pdf/2509.22596.pdf

Authors: Qixin Zhang, Yan Sun, Can Jin, Xikun Zhang, Yao Shu, Puning Zhao, Li Shen, Dacheng Tao

Title: Effective Policy Learning for Multi-Agent Online Coordination Beyond Submodular Objectives

Abstract:
In this paper, we present two effective policy learning algorithms for multi‑agent online coordination(MA‑OC) problem. The first one, \textttMA‑SPL, not only can achieve the optimal (1‑\fracce)‑approximation guarantee for the MA‑OC problem with submodular objectives but also can handle the unexplored α‑weakly DR‑submodular and (γ,β)‑weakly submodular scenarios, where c is the curvature of the investigated submodular functions, α denotes the diminishing‑return(DR) ratio and the tuple (γ,β) represents the submodularity ratios. Subsequently, in order to reduce the reliance on the unknown parameters α,γ,β inherent in the \textttMA‑SPL algorithm, we further introduce the second online algorithm named \textttMA‑MPL. This \textttMA‑MPL algorithm is entirely \emphparameter‑free and simultaneously can maintain the same approximation ratio as the first \textttMA‑SPL algorithm. The core of our \textttMA‑SPL and \textttMA‑MPL algorithms is a novel continuous‑relaxation technique termed as \emphpolicy‑based continuous extension. Compared with the well‑established \emphmulti‑linear extension, a notable advantage of this new \emphpolicy‑based continuous extension is its ability to provide a lossless rounding scheme for any set function, thereby enabling us to tackle the challenging weakly submodular objectives. Finally, extensive simulations are conducted to validate the effectiveness of our proposed algorithms.

PaperID: 8444, https://arxiv.org/pdf/2509.22576.pdf

Authors: Wujiang Xu, Wentian Zhao, Zhenting Wang, Yu-Jhe Li, Can Jin, Mingyu Jin, Kai Mei, Kun Wan, Dimitris N. Metaxas

Title: EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

Abstract:
Training LLM agents in multi‑turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration‑exploitation cascade failure. This cascade begins with early‑stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low‑entropy strategies. Subsequently, agents enter late‑stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy‑regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi‑turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase‑based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi‑turn sparse‑reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.

PaperID: 8445, https://arxiv.org/pdf/2509.22537.pdf

Authors: Haoyang Li, Xiao Jia, Zhanzhan Zhao

Title: The Emergence of Altruism in Large-Language-Model Agents Society

Abstract:
Leveraging Large Language Models (LLMs) for social simulation is a frontier in computational social science. Understanding the social logics these agents embody is critical to this attempt. However, existing research has primarily focused on cooperation in small‑scale, task‑oriented games, overlooking how altruism, which means sacrificing self‑interest for collective benefit, emerges in large‑scale agent societies. To address this gap, we introduce a Schelling‑variant urban migration model that creates a social dilemma, compelling over 200 LLM agents to navigate an explicit conflict between egoistic (personal utility) and altruistic (system utility) goals. Our central finding is a fundamental difference in the social tendencies of LLMs. We identify two distinct archetypes: "Adaptive Egoists", which default to prioritizing self‑interest but whose altruistic behaviors significantly increase under the influence of a social norm‑setting message board; and "Altruistic Optimizers", which exhibit an inherent altruistic logic, consistently prioritizing collective benefit even at a direct cost to themselves. Furthermore, to qualitatively analyze the cognitive underpinnings of these decisions, we introduce a method inspired by Grounded Theory to systematically code agent reasoning. In summary, this research provides the first evidence of intrinsic heterogeneity in the egoistic and altruistic tendencies of different LLMs. We propose that for social simulation, model selection is not merely a matter of choosing reasoning capability, but of choosing an intrinsic social action logic. While "Adaptive Egoists" may offer a more suitable choice for simulating complex human societies, "Altruistic Optimizers" are better suited for modeling idealized pro‑social actors or scenarios where collective welfare is the primary consideration.

PaperID: 8446, https://arxiv.org/pdf/2509.22504.pdf

Authors: Jinyeop Song, Jeff Gore, Max Kleiman-Weiner

Title: Estimating the Empowerment of Language Model Agents

Abstract:
As language model (LM) agents become increasingly capable and adopted in real‑world applications, there is a growing need for scalable evaluation frameworks beyond costly, manually designed benchmarks. We propose information‑theoretic evaluation based on empowerment, an information‑theoretic measure of an agent's influence on future states through its actions. To handle the unique challenges of text‑based environments, we introduce EELMA (Estimating Empowerment of Language Model Agents), an algorithm for approximating effective empowerment from multi‑turn text interactions. We demonstrate EELMA on textual games and realistic web and tool‑use environments, showing that empowerment strongly correlates with average task performance. We further analyze how empowerment varies across models, environment complexity, and agent configurations, and show that high‑empowerment states and actions often mark pivotal moments for general capabilities. These results establish empowerment as a goal‑agnostic metric that complements task‑success measures for LM‑agent evaluation.

PaperID: 8447, https://arxiv.org/pdf/2509.22502.pdf

Authors: Chenglin Yu, Yang Yu, Songmiao Wang, Yucheng Wang, Yifan Yang, Jinjia Li, Ming Li, Hongxia Yang

Title: InfiAgent: Self-Evolving Pyramid Agent Framework for Infinite Scenarios

Abstract:
Large Language Model (LLM) agents have demonstrated remarkable capabilities in organizing and executing complex tasks, and many such agents are now widely used in various application scenarios. However, developing these agents requires carefully designed workflows, carefully crafted prompts, and iterative tuning, which requires LLM techniques and domain‑specific expertise. These hand‑crafted limitations hinder the scalability and cost‑effectiveness of LLM agents across a wide range of industries. To address these challenges, we propose InfiAgent, a Pyramid‑like DAG‑based Multi‑Agent Framework that can be applied to infinite scenarios, which introduces several key innovations: a generalized "agent‑as‑a‑tool" mechanism that automatically decomposes complex agents into hierarchical multi‑agent systems; a dual‑audit mechanism that ensures the quality and stability of task completion; an agent routing function that enables efficient task‑agent matching; and an agent self‑evolution mechanism that autonomously restructures the agent DAG based on new tasks, poor performance, or optimization opportunities. Furthermore, InfiAgent's atomic task design supports agent parallelism, significantly improving execution efficiency. This framework evolves into a versatile pyramid‑like multi‑agent system capable of solving a wide range of problems. Evaluations on multiple benchmarks demonstrate that InfiAgent achieves 9.9% higher performance compared to ADAS (similar auto‑generated agent framework), while a case study of the AI research assistant InfiHelper shows that it generates scientific papers that have received recognition from human reviewers at top‑tier IEEE conferences.

PaperID: 8448, https://arxiv.org/pdf/2509.22434.pdf

Authors: Margherita Martorana, Francesca Urgese, Ilaria Tiddi, Stefan Schlobach

Title: An Ontology for Unified Modeling of Tasks, Actions, Environments, and Capabilities in Personal Service Robotics

Abstract:
Personal service robots are increasingly used in domestic settings to assist older adults and people requiring support. Effective operation involves not only physical interaction but also the ability to interpret dynamic environments, understand tasks, and choose appropriate actions based on context. This requires integrating both hardware components (e.g. sensors, actuators) and software systems capable of reasoning about tasks, environments, and robot capabilities. Frameworks such as the Robot Operating System (ROS) provide open‑source tools that help connect low‑level hardware with higher‑level functionalities. However, real‑world deployments remain tightly coupled to specific platforms. As a result, solutions are often isolated and hard‑coded, limiting interoperability, reusability, and knowledge sharing. Ontologies and knowledge graphs offer a structured way to represent tasks, environments, and robot capabilities. Existing ontologies, such as the Socio‑physical Model of Activities (SOMA) and the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), provide models for activities, spatial relationships, and reasoning structures. However, they often focus on specific domains and do not fully capture the connection between environment, action, robot capabilities, and system‑level integration. In this work, we propose the Ontology for roBOts and acTions (OntoBOT), which extends existing ontologies to provide a unified representation of tasks, actions, environments, and capabilities. Our contributions are twofold: (1) we unify these aspects into a cohesive ontology to support formal reasoning about task execution, and (2) we demonstrate its generalizability by evaluating competency questions across four embodied agents ‑ TIAGo, HSR, UR3, and Stretch ‑ showing how OntoBOT enables context‑aware reasoning, task‑oriented execution, and knowledge sharing in service robotics.

PaperID: 8449, https://arxiv.org/pdf/2509.22421.pdf

Authors: Leonel Giacobbe, Jingdao Chen, Chuangchuang Sun

Title: Learning-Based Collaborative Control for Bi-Manual Tactile-Reactive Grasping

Abstract:
Grasping is a core task in robotics with various applications. However, most current implementations are primarily designed for rigid items, and their performance drops considerably when handling fragile or deformable materials that require real‑time feedback. Meanwhile, tactile‑reactive grasping focuses on a single agent, which limits their ability to grasp and manipulate large, heavy objects. To overcome this, we propose a learning‑based, tactile‑reactive multi‑agent Model Predictive Controller (MPC) for grasping a wide range of objects with different softness and shapes, beyond the capabilities of preexisting single‑agent implementations. Our system uses two Gelsight Mini tactile sensors [1] to extract real‑time information on object texture and stiffness. This rich tactile feedback is used to estimate contact dynamics and object compliance in real time, enabling the system to adapt its control policy to diverse object geometries and stiffness profiles. The learned controller operates in a closed loop, leveraging tactile encoding to predict grasp stability and adjust force and position accordingly. Our key technical contributions include a multi‑agent MPC formulation trained on real contact interactions, a tactile‑data driven method for inferring grasping states, and a coordination strategy that enables collaborative control. By combining tactile sensing and a learning‑based multi‑agent MPC, our method offers a robust, intelligent solution for collaborative grasping in complex environments, significantly advancing the capabilities of multi‑agent systems. Our approach is validated through extensive experiments against independent PD and MPC baselines. Our pipeline outperforms the baselines regarding success rates in achieving and maintaining stable grasps across objects of varying sizes and stiffness.

PaperID: 8450, https://arxiv.org/pdf/2509.22391.pdf

Authors: Jiaqi Shao, Yuxiang Lin, Munish Prasad Lohani, Yufeng Miao, Bing Luo

Title: Do LLM Agents Know How to Ground, Recover, and Assess? A Benchmark for Epistemic Competence in Information-Seeking Agents

Abstract:
Recent work has explored training Large Language Model (LLM) search agents with reinforcement learning (RL) for open‑domain question answering (QA). However, most evaluations focus solely on final answer accuracy, overlooking how these agents reason with and act on external evidence. We introduce SeekBench, the first benchmark for evaluating the epistemic competence of LLM search agents through step‑level analysis of their response traces. SeekBench comprises 190 expert‑annotated traces with over 1,800 response steps generated by LLM search agents, each enriched with evidence annotations for granular analysis of whether agents (1) generate reasoning steps grounded in observed evidence, (2) adaptively reformulate searches to recover from low‑quality results, and (3) have proper calibration to correctly assess whether the current evidence is sufficient for providing an answer.

PaperID: 8451, https://arxiv.org/pdf/2509.22315.pdf

Authors: Hieu Tran, Zonghai Yao, Nguyen Luong Tran, Zhichao Yang, Feiyun Ouyang, Shuo Han, Razieh Rahimi, Hong Yu

Title: PRIME: Planning and Retrieval-Integrated Memory for Enhanced Reasoning

Abstract:
Inspired by the dual‑process theory of human cognition from Thinking, Fast and Slow, we introduce PRIME (Planning and Retrieval‑Integrated Memory for Enhanced Reasoning), a multi‑agent reasoning framework that dynamically integrates System 1 (fast, intuitive thinking) and System 2 (slow, deliberate thinking). PRIME first employs a Quick Thinking Agent (System 1) to generate a rapid answer; if uncertainty is detected, it then triggers a structured System 2 reasoning pipeline composed of specialized agents for planning, hypothesis generation, retrieval, information integration, and decision‑making. This multi‑agent design faithfully mimics human cognitive processes and enhances both efficiency and accuracy. Experimental results with LLaMA 3 models demonstrate that PRIME enables open‑source LLMs to perform competitively with state‑of‑the‑art closed‑source models like GPT‑4 and GPT‑4o on benchmarks requiring multi‑hop and knowledge‑grounded reasoning. This research establishes PRIME as a scalable solution for improving LLMs in domains requiring complex, knowledge‑intensive reasoning.

PaperID: 8452, https://arxiv.org/pdf/2509.22296.pdf

Authors: Joseph Hunt, Koyo Fujii, Aly Magassouba, Praminda Caleb-Solly

Title: Beyond Detection -- Orchestrating Human-Robot-Robot Assistance via an Internet of Robotic Things Paradigm

Abstract:
Hospital patient falls remain a critical and costly challenge worldwide. While conventional fall prevention systems typically rely on post‑fall detection or reactive alerts, they also often suffer from high false positive rates and fail to address the underlying patient needs that lead to bed‑exit attempts. This paper presents a novel system architecture that leverages the Internet of Robotic Things (IoRT) to orchestrate human‑robot‑robot interaction for proactive and personalized patient assistance. The system integrates a privacy‑preserving thermal sensing model capable of real‑time bed‑exit prediction, with two coordinated robotic agents that respond dynamically based on predicted intent and patient input. This orchestrated response could not only reduce fall risk but also attend to the patient's underlying motivations for movement, such as thirst, discomfort, or the need for assistance, before a hazardous situation arises. Our contributions with this pilot study are three‑fold: (1) a modular IoRT‑based framework enabling distributed sensing, prediction, and multi‑robot coordination; (2) a demonstration of low‑resolution thermal sensing for accurate, privacy‑preserving preemptive bed‑exit detection; and (3) results from a user study and systematic error analysis that inform the design of situationally aware, multi‑agent interactions in hospital settings. The findings highlight how interactive and connected robotic systems can move beyond passive monitoring to deliver timely, meaningful assistance, empowering safer, more responsive care environments.

PaperID: 8453, https://arxiv.org/pdf/2509.22218.pdf

Authors: Sandaru Fernando, Imasha Jayarathne, Sithumini Abeysekara, Shanuja Sithamparanthan, Thushari Silva, Deshan Jayawardana

Title: VizGen: Data Exploration and Visualization from Natural Language via a Multi-Agent AI Architecture

Abstract:
Data visualization is essential for interpreting complex datasets, yet traditional tools often require technical expertise, limiting accessibility. VizGen is an AI‑assisted graph generation system that empowers users to create meaningful visualizations using natural language. Leveraging advanced NLP and LLMs like Claude 3.7 Sonnet and Gemini 2.0 Flash, it translates user queries into SQL and recommends suitable graph types. Built on a multi‑agent architecture, VizGen handles SQL generation, graph creation, customization, and insight extraction. Beyond visualization, it analyzes data for patterns, anomalies, and correlations, and enhances user understanding by providing explanations enriched with contextual information gathered from the internet. The system supports real‑time interaction with SQL databases and allows conversational graph refinement, making data analysis intuitive and accessible. VizGen democratizes data visualization by bridging the gap between technical complexity and user‑friendly design.

PaperID: 8454, https://arxiv.org/pdf/2509.22216.pdf

Authors: Ahmet Onur Akman, Anastasia Psarou, Zoltán György Varga, Grzegorz Jamróz, Rafał Kucharski

Title: Impact of Collective Behaviors of Autonomous Vehicles on Urban Traffic Dynamics: A Multi-Agent Reinforcement Learning Approach

Abstract:
This study examines the potential impact of reinforcement learning (RL)‑enabled autonomous vehicles (AV) on urban traffic flow in a mixed traffic environment. We focus on a simplified day‑to‑day route choice problem in a multi‑agent setting. We consider a city network where human drivers travel through their chosen routes to reach their destinations in minimum travel time. Then, we convert one‑third of the population into AVs, which are RL agents employing Deep Q‑learning algorithm. We define a set of optimization targets, or as we call them behaviors, namely selfish, collaborative, competitive, social, altruistic, and malicious. We impose a selected behavior on AVs through their rewards. We run our simulations using our in‑house developed RL framework PARCOUR. Our simulations reveal that AVs optimize their travel times by up to 5%, with varying impacts on human drivers' travel times depending on the AV behavior. In all cases where AVs adopt a self‑serving behavior, they achieve shorter travel times than human drivers. Our findings highlight the complexity differences in learning tasks of each target behavior. We demonstrate that the multi‑agent RL setting is applicable for collective routing on traffic networks, though their impact on coexisting parties greatly varies with the behaviors adopted.

PaperID: 8455, https://arxiv.org/pdf/2509.22174.pdf

Authors: Durgesh Kalwar, Mayank Baranwal, Harshad Khadilkar

Title: Efficiency Boost in Decentralized Optimization: Reimagining Neighborhood Aggregation with Minimal Overhead

Abstract:
In today's data‑sensitive landscape, distributed learning emerges as a vital tool, not only fortifying privacy measures but also streamlining computational operations. This becomes especially crucial within fully decentralized infrastructures where local processing is imperative due to the absence of centralized aggregation. Here, we introduce DYNAWEIGHT, a novel framework to information aggregation in multi‑agent networks. DYNAWEIGHT offers substantial acceleration in decentralized learning with minimal additional communication and memory overhead. Unlike traditional static weight assignments, such as Metropolis weights, DYNAWEIGHT dynamically allocates weights to neighboring servers based on their relative losses on local datasets. Consequently, it favors servers possessing diverse information, particularly in scenarios of substantial data heterogeneity. Our experiments on various datasets MNIST, CIFAR10, and CIFAR100 incorporating various server counts and graph topologies, demonstrate notable enhancements in training speeds. Notably, DYNAWEIGHT functions as an aggregation scheme compatible with any underlying server‑level optimization algorithm, underscoring its versatility and potential for widespread integration.

PaperID: 8456, https://arxiv.org/pdf/2509.22170.pdf

Authors: Chengjia Wang, Lanling Tang, Ming Yuan, Jiongchi Yu, Xiaofei Xie, Jiajun Bu

Title: Leveraging LLM Agents for Automated Video Game Testing

Abstract:
Testing MMORPGs (Massively Multiplayer Online Role‑Playing Games) is a critical yet labor‑intensive task in game development due to their complexity and frequent updating nature. Traditional automated game testing approaches struggle to achieve high state coverage and efficiency in these rich, open‑ended environments, while existing LLM‑based game‑playing approaches are limited to shallow reasoning ability in understanding complex game state‑action spaces and long‑complex tasks. To address these challenges, we propose TITAN, an effective LLM‑driven agent framework for intelligent MMORPG testing. TITAN incorporates four key components to: (1) perceive and abstract high‑dimensional game states, (2) proactively optimize and prioritize available actions, (3) enable long‑horizon reasoning with action trace memory and reflective self‑correction, and (4) employ LLM‑based oracles to detect potential functional and logic bugs with diagnostic reports. We implement the prototype of TITAN and evaluate it on two large‑scale commercial MMORPGs spanning both PC and mobile platforms. In our experiments, TITAN achieves significantly higher task completion rates (95%) and bug detection performance compared to existing automated game testing approaches. An ablation study further demonstrates that each core component of TITAN contributes substantially to its overall performance. Notably, TITAN detects four previously unknown bugs that prior testing approaches fail to identify. We provide an in‑depth discussion of these results, which offer guidance for new avenues of advancing intelligent, general‑purpose testing systems. Moreover, TITAN has been deployed in eight real‑world game QA pipelines, underscoring its practical impact as an LLM‑driven game testing framework.

PaperID: 8457, https://arxiv.org/pdf/2509.22130.pdf

Authors: Merve Atasever, Matthew Hong, Mihir Nitin Kulkarni, Qingpei Li, Jyotirmoy V. Deshmukh

Title: Multi-Agent Path Finding via Offline RL and LLM Collaboration

Abstract:
Multi‑Agent Path Finding (MAPF) poses a significant and challenging problem critical for applications in robotics and logistics, particularly due to its combinatorial complexity and the partial observability inherent in realistic environments. Decentralized reinforcement learning methods commonly encounter two substantial difficulties: first, they often yield self‑centered behaviors among agents, resulting in frequent collisions, and second, their reliance on complex communication modules leads to prolonged training times, sometimes spanning weeks. To address these challenges, we propose an efficient decentralized planning framework based on the Decision Transformer (DT), uniquely leveraging offline reinforcement learning to substantially reduce training durations from weeks to mere hours. Crucially, our approach effectively handles long‑horizon credit assignment and significantly improves performance in scenarios with sparse and delayed rewards. Furthermore, to overcome adaptability limitations inherent in standard RL methods under dynamic environmental changes, we integrate a large language model (GPT‑4o) to dynamically guide agent policies. Extensive experiments in both static and dynamically changing environments demonstrate that our DT‑based approach, augmented briefly by GPT‑4o, significantly enhances adaptability and performance.

PaperID: 8458, https://arxiv.org/pdf/2509.21998.pdf

Authors: Hanlin Zhu, Tianyu Guo, Song Mei, Stuart Russell, Nikhil Ghosh, Alberto Bietti, Jiantao Jiao

Title: GSM-Agent: Understanding Agentic Reasoning Using Controllable Environments

Abstract:
As LLMs are increasingly deployed as agents, agentic reasoning ‑ the ability to combine tool use, especially search, and reasoning ‑ becomes a critical skill. However, it is hard to disentangle agentic reasoning when evaluated in complex environments and tasks. Current agent benchmarks often mix agentic reasoning with challenging math reasoning, expert‑level knowledge, and other advanced capabilities. To fill this gap, we build a novel benchmark, GSM‑Agent, where an LLM agent is required to solve grade‑school‑level reasoning problems, but is only presented with the question in the prompt without the premises that contain the necessary information to solve the task, and needs to proactively collect that information using tools. Although the original tasks are grade‑school math problems, we observe that even frontier models like GPT‑5 only achieve 67% accuracy. To understand and analyze the agentic reasoning patterns, we propose the concept of agentic reasoning graph: cluster the environment's document embeddings into nodes, and map each tool call to its nearest node to build a reasoning path. Surprisingly, we identify that the ability to revisit a previously visited node, widely taken as a crucial pattern in static reasoning, is often missing for agentic reasoning for many models. Based on the insight, we propose a tool‑augmented test‑time scaling method to improve LLM's agentic reasoning performance by adding tools to encourage models to revisit. We expect our benchmark and the agentic reasoning framework to aid future studies of understanding and pushing the boundaries of agentic reasoning.

PaperID: 8459, https://arxiv.org/pdf/2509.21994.pdf

Authors: Genjia Liu, Anning Hu, Yue Hu, Wenjun Zhang, Siheng Chen

Title: Rate-Distortion Optimized Communication for Collaborative Perception

Abstract:
Collaborative perception emphasizes enhancing environmental understanding by enabling multiple agents to share visual information with limited bandwidth resources. While prior work has explored the empirical trade‑off between task performance and communication volume, a significant gap remains in the theoretical foundation. To fill this gap, we draw on information theory and introduce a pragmatic rate‑distortion theory for multi‑agent collaboration, specifically formulated to analyze performance‑communication trade‑off in goal‑oriented multi‑agent systems. This theory concretizes two key conditions for designing optimal communication strategies: supplying pragmatically relevant information and transmitting redundancy‑less messages. Guided by these two conditions, we propose RDcomm, a communication‑efficient collaborative perception framework that introduces two key innovations: i) task entropy discrete coding, which assigns features with task‑relevant codeword‑lengths to maximize the efficiency in supplying pragmatic information; ii) mutual‑information‑driven message selection, which utilizes mutual information neural estimation to approach the optimal redundancy‑less condition. Experiments on 3D object detection and BEV segmentation demonstrate that RDcomm achieves state‑of‑the‑art accuracy on DAIR‑V2X and OPV2V, while reducing communication volume by up to 108 times. The code will be released.

PaperID: 8460, https://arxiv.org/pdf/2509.21981.pdf

Authors: Zhimin Wang, Duo Wu, Shaokang He, Jinghe Wang, Linjia Kang, Jing Yu, Kai Zhu, Jiawei Li, Zhi Wang

Title: Collaborative Belief Reasoning with LLMs for Efficient Multi-Agent Collaboration

Abstract:
Effective real‑world multi‑agent collaboration requires not only accurate planning but also the ability to reason about collaborators' intents‑‑a crucial capability for avoiding miscoordination and redundant communication under partial observable environments. Due to their strong planning and reasoning capabilities, large language models (LLMs) have emerged as promising autonomous agents for collaborative task solving. However, existing collaboration frameworks for LLMs overlook their reasoning potential for dynamic intent inference, and thus produce inconsistent plans and redundant communication, reducing collaboration efficiency. To bridge this gap, we propose CoBel‑World, a novel framework that equips LLM agents with a Collaborative Belief World‑‑an internal representation jointly modeling the physical environment and collaborators' mental states. CoBel‑World enables agents to parse external open‑world knowledge into structured beliefs via a symbolic belief representation module, and perform zero‑shot Bayesian‑style belief updates through LLM reasoning. This allows agents to proactively detect potential miscoordination (e.g., conflicting plans) and communicate adaptively. Evaluated on challenging embodied benchmarks (i.e., TDW‑MAT and C‑WAH), CoBel‑World significantly reduces communication costs by 64‑79% and improves task completion efficiency by 4‑28% compared to the strongest baseline. Our results show that explicit, intent‑aware belief modeling is essential for efficient and human‑like collaboration in LLM‑based multi‑agent systems.

PaperID: 8461, https://arxiv.org/pdf/2509.21910.pdf

Authors: Yun Wang, Zhaojun Ding, Xuansheng Wu, Siyue Sun, Ninghao Liu, Xiaoming Zhai

Title: AutoSCORE: Enhancing Automated Scoring with Multi-Agent Large Language Models via Structured Component Recognition

Abstract:
Automated scoring plays a crucial role in education by reducing the reliance on human raters, offering scalable and immediate evaluation of student work. While large language models (LLMs) have shown strong potential in this task, their use as end‑to‑end raters faces challenges such as low accuracy, prompt sensitivity, limited interpretability, and rubric misalignment. These issues hinder the implementation of LLM‑based automated scoring in assessment practice. To address the limitations, we propose AutoSCORE, a multi‑agent LLM framework enhancing automated scoring via rubric‑aligned Structured COmponent REcognition. With two agents, AutoSCORE first extracts rubric‑relevant components from student responses and encodes them into a structured representation (i.e., Scoring Rubric Component Extraction Agent), which is then used to assign final scores (i.e., Scoring Agent). This design ensures that model reasoning follows a human‑like grading process, enhancing interpretability and robustness. We evaluate AutoSCORE on four benchmark datasets from the ASAP benchmark, using both proprietary and open‑source LLMs (GPT‑4o, LLaMA‑3.1‑8B, and LLaMA‑3.1‑70B). Across diverse tasks and rubrics, AutoSCORE consistently improves scoring accuracy, human‑machine agreement (QWK, correlations), and error metrics (MAE, RMSE) compared to single‑agent baselines, with particularly strong benefits on complex, multi‑dimensional rubrics, and especially large relative gains on smaller LLMs. These results demonstrate that structured component recognition combined with multi‑agent design offers a scalable, reliable, and interpretable solution for automated scoring.

PaperID: 8462, https://arxiv.org/pdf/2509.21868.pdf

Authors: Yuxuan Li, Sauvik Das, Hirokazu Shirado

Title: What Makes LLM Agent Simulations Useful for Policy Practice? An Iterative Design Study in Emergency Preparedness

Abstract:
Policymakers must often act under conditions of deep uncertainty, such as emergency response, where predicting the specific impacts of a policy apriori is implausible. Large Language Model (LLM) agent simulations have been proposed as tools to support policymakers under these conditions, yet little is known about how such simulations become useful for real‑world policy practice. To address this gap, we conducted a year‑long, stakeholder‑engaged design process with a university emergency preparedness team. Through iterative design cycles, we developed and refined an LLM agent simulation of a large‑scale campus gathering, ultimately scaling to 13,000 agents that modeled crowd movement and communication under various emergency scenarios. Rather than producing predictive forecasts, these simulations supported policy practice by shaping volunteer training, evacuation procedures, and infrastructure planning. Analyzing these findings, we identify three design process implications for making LLM agent simulations that are useful for policy practice: start from verifiable scenarios to bootstrap trust, use preliminary simulations to elicit tacit domain knowledge, and treat simulation capabilities and policy implementation as co‑evolving.

PaperID: 8463, https://arxiv.org/pdf/2509.21862.pdf

Authors: So Kuroki, Yingtao Tian, Kou Misaki, Takashi Ikegami, Takuya Akiba, Yujin Tang

Title: Reimagining Agent-based Modeling with Large Language Model Agents via Shachi

Abstract:
The study of emergent behaviors in large language model (LLM)‑driven multi‑agent systems is a critical research challenge, yet progress is limited by a lack of principled methodologies for controlled experimentation. To address this, we introduce Shachi, a formal methodology and modular framework that decomposes an agent's policy into core cognitive components: Configuration for intrinsic traits, Memory for contextual persistence, and Tools for expanded capabilities, all orchestrated by an LLM reasoning engine. This principled architecture moves beyond brittle, ad‑hoc agent designs and enables the systematic analysis of how specific architectural choices influence collective behavior. We validate our methodology on a comprehensive 10‑task benchmark and demonstrate its power through novel scientific inquiries. Critically, we establish the external validity of our approach by modeling a real‑world U.S. tariff shock, showing that agent behaviors align with observed market reactions only when their cognitive architecture is appropriately configured with memory and tools. Our work provides a rigorous, open‑source foundation for building and evaluating LLM agents, aimed at fostering more cumulative and scientifically grounded research.

PaperID: 8464, https://arxiv.org/pdf/2509.21849.pdf

Authors: Ziqi Liu, Ziyang Zhou, Yilin Li, Haiyang Zhang, Yangbin Chen

Title: Following the TRACE: A Structured Path to Empathetic Response Generation with Multi-Agent Models

Abstract:
Empathetic response generation is a crucial task for creating more human‑like and supportive conversational agents. However, existing methods face a core trade‑off between the analytical depth of specialized models and the generative fluency of Large Language Models (LLMs). To address this, we propose TRACE, Task‑decomposed Reasoning for Affective Communication and Empathy, a novel framework that models empathy as a structured cognitive process by decomposing the task into a pipeline for analysis and synthesis. By building a comprehensive understanding before generation, TRACE unites deep analysis with expressive generation. Experimental results show that our framework significantly outperforms strong baselines in both automatic and LLM‑based evaluations, confirming that our structured decomposition is a promising paradigm for creating more capable and interpretable empathetic agents. Our code is available at https://anonymous.4open.science/r/TRACE‑18EF/README.md.

PaperID: 8465, https://arxiv.org/pdf/2509.21842.pdf

Authors: Yansong Ning, Rui Liu, Jun Wang, Kai Chen, Wei Li, Jun Fang, Kan Zheng, Naiqiang Tan, Hao Liu

Title: DeepTravel: An End-to-End Agentic Reinforcement Learning Framework for Autonomous Travel Planning Agents

Abstract:
Travel planning (TP) agent has recently worked as an emerging building block to interact with external tools and resources for travel itinerary generation, ensuring enjoyable user experience. Despite its benefits, existing studies rely on hand craft prompt and fixed agent workflow, hindering more flexible and autonomous TP agent. This paper proposes DeepTravel, an end to end agentic reinforcement learning framework for building autonomous travel planning agent, capable of autonomously planning, executing tools, and reflecting on tool responses to explore, verify, and refine intermediate actions in multi step reasoning. To achieve this, we first construct a robust sandbox environment by caching transportation, accommodation and POI data, facilitating TP agent training without being constrained by real world APIs limitations (e.g., inconsistent outputs). Moreover, we develop a hierarchical reward modeling system, where a trajectory level verifier first checks spatiotemporal feasibility and filters unsatisfied travel itinerary, and then the turn level verifier further validate itinerary detail consistency with tool responses, enabling efficient and precise reward service. Finally, we propose the reply augmented reinforcement learning method that enables TP agent to periodically replay from a failures experience buffer, emerging notable agentic capacity. We deploy trained TP agent on DiDi Enterprise Solutions App and conduct comprehensive online and offline evaluations, demonstrating that DeepTravel enables small size LLMs (e.g., Qwen3 32B) to significantly outperform existing frontier LLMs such as OpenAI o1, o3 and DeepSeek R1 in travel planning tasks.

PaperID: 8466, https://arxiv.org/pdf/2509.21828.pdf

Authors: The Viet Bui, Tien Mai, Hong Thanh Nguyen

Title: Preference-Guided Learning for Sparse-Reward Multi-Agent Reinforcement Learning

Abstract:
We study the problem of online multi‑agent reinforcement learning (MARL) in environments with sparse rewards, where reward feedback is not provided at each interaction but only revealed at the end of a trajectory. This setting, though realistic, presents a fundamental challenge: the lack of intermediate rewards hinders standard MARL algorithms from effectively guiding policy learning. To address this issue, we propose a novel framework that integrates online inverse preference learning with multi‑agent on‑policy optimization into a unified architecture. At its core, our approach introduces an implicit multi‑agent reward learning model, built upon a preference‑based value‑decomposition network, which produces both global and local reward signals. These signals are further used to construct dual advantage streams, enabling differentiated learning targets for the centralized critic and decentralized actors. In addition, we demonstrate how large language models (LLMs) can be leveraged to provide preference labels that enhance the quality of the learned reward model. Empirical evaluations on state‑of‑the‑art benchmarks, including MAMuJoCo and SMACv2, show that our method achieves superior performance compared to existing baselines, highlighting its effectiveness in addressing sparse‑reward challenges in online MARL.

PaperID: 8467, https://arxiv.org/pdf/2509.21816.pdf

Authors: Yuhang Xie, Jian Mu, Xiaojun Ma, Chaoyun Zhang, Lu Wang, Mengyu Zhou, Mugeng Liu, Si Qin, Qingwei Lin, Saravan Rajmohan, Shi Han, Dongmei Zhang

Title: From Task to Tutorial: An Automated GUI Framework for Excel Tutorial Document and Video Creation

Abstract:
Excel is one of the most widely used productivity tools across domains, offering rich functionality but also overwhelming users with its complexity. This creates a persistent demand for tutorials to support effective usage. However, while building and maintaining the Microsoft tutorial corpus, we observed that existing tutorials are manually created by experts, need frequent updates with each software release, and involve substantial human labor. Moreover, prior work has not achieved fully automated tutorial generation. In this paper, we present the first framework for automatically generating Excel tutorials directly from natural language task descriptions. Our framework first instantiates the task. Then a central component of this framework, Execution Agent, plans and executes the solution in Excel, and collects the intermediate artifacts required for tutorial construction. These artifacts are then transformed into both structured Excel documents and video demonstrations. To build a comprehensive tutorial corpus, we collected 1,559 task descriptions from real‑world scenarios. In addition, we designed a systematic evaluation framework that integrates assessments from both large language models (LLMs) and human reviewers. Experimental results show that our framework improves task execution success rates by 8.5% over state‑of‑the‑art baselines. Moreover, the generated tutorials demonstrate superior readability and instructional effectiveness, often approaching or surpassing expert‑authored materials. Importantly, the automated pipeline eliminates manual labor and reduces time costs to 1/20 of expert authoring, making scalable and high‑quality tutorial generation practical for the first time.

PaperID: 8468, https://arxiv.org/pdf/2509.21799.pdf

Authors: Hongze Mi, Yibo Feng, Wenjie Lu, Yuqi Wang, Jinyuan Li, Song Cao, He Cui, Tengfei Tian, Xuelin Zhang, Haotian Luo, Di Sun, Jun Fang, Hua Chai, Naiqiang Tan, Gang Pan

Title: D-Artemis: A Deliberative Cognitive Framework for Mobile GUI Multi-Agents

Abstract:
Graphical User Interface (GUI) agents aim to automate a wide spectrum of human tasks by emulating user interaction. Despite rapid advancements, current approaches are hindered by several critical challenges: data bottleneck in end‑to‑end training, high cost of delayed error detection, and risk of contradictory guidance. Inspired by the human cognitive loop of Thinking, Alignment, and Reflection, we present D‑Artemis ‑‑ a novel deliberative framework in this paper. D‑Artemis leverages a fine‑grained, app‑specific tip retrieval mechanism to inform its decision‑making process. It also employs a proactive Pre‑execution Alignment stage, where Thought‑Action Consistency (TAC) Check module and Action Correction Agent (ACA) work in concert to mitigate the risk of execution failures. A post‑execution Status Reflection Agent (SRA) completes the cognitive loop, enabling strategic learning from experience. Crucially, D‑Artemis enhances the capabilities of general‑purpose Multimodal large language models (MLLMs) for GUI tasks without the need for training on complex trajectory datasets, demonstrating strong generalization. D‑Artemis establishes new state‑of‑the‑art (SOTA) results across both major benchmarks, achieving a 75.8% success rate on AndroidWorld and 96.8% on ScreenSpot‑V2. Extensive ablation studies further demonstrate the significant contribution of each component to the framework.

PaperID: 8469, https://arxiv.org/pdf/2509.21772.pdf

Authors: Daiki Chiba, Hiroki Nakano, Takashi Koide

Title: PhishLumos: An Adaptive Multi-Agent System for Proactive Phishing Campaign Mitigation

Abstract:
Phishing attacks are a significant societal threat, disproportionately harming vulnerable populations and eroding trust in essential digital services. Current defenses are often reactive, failing against modern evasive tactics like cloaking that conceal malicious content. To address this, we introduce PhishLumos, an adaptive multi‑agent system that proactively mitigates entire attack campaigns. It confronts a core cybersecurity imbalance: attackers can easily scale operations, while defense remains an intensive expert task. Instead of being blocked by evasion, PhishLumos treats it as a critical signal to investigate the underlying infrastructure. Its Large Language Model (LLM)‑powered agents uncover shared hosting, certificates, and domain registration patterns. On real‑world data, our system identified 100% of campaigns in the median case, over a week before their confirmation by cybersecurity experts. PhishLumos demonstrates a practical shift from reactive URL blocking to proactive campaign mitigation, protecting users before they are harmed and making the digital world safer for all.

PaperID: 8470, https://arxiv.org/pdf/2509.21733.pdf

Authors: Jiannan Xiang, Yun Zhu, Lei Shu, Maria Wang, Lijun Yu, Gabriel Barcik, James Lyon, Srinivas Sunkara, Jindong Chen

Title: UISim: An Interactive Image-Based UI Simulator for Dynamic Mobile Environments

Abstract:
Developing and testing user interfaces (UIs) and training AI agents to interact with them are challenging due to the dynamic and diverse nature of real‑world mobile environments. Existing methods often rely on cumbersome physical devices or limited static analysis of screenshots, which hinders scalable testing and the development of intelligent UI agents. We introduce UISim, a novel image‑based UI simulator that offers a dynamic and interactive platform for exploring mobile phone environments purely from screen images. Our system employs a two‑stage method: given an initial phone screen image and a user action, it first predicts the abstract layout of the next UI state, then synthesizes a new, visually consistent image based on this predicted layout. This approach enables the realistic simulation of UI transitions. UISim provides immediate practical benefits for UI testing, rapid prototyping, and synthetic data generation. Furthermore, its interactive capabilities pave the way for advanced applications, such as UI navigation task planning for AI agents. Our experimental results show that UISim outperforms end‑to‑end UI generation baselines in generating realistic and coherent subsequent UI states, highlighting its fidelity and potential to streamline UI development and enhance AI agent training.

PaperID: 8471, https://arxiv.org/pdf/2509.21634.pdf

Authors: Prakhar Sharma, Haohuang Wen, Vinod Yegneswaran, Ashish Gehani, Phillip Porras, Zhiqiang Lin

Title: MobiLLM: An Agentic AI Framework for Closed-Loop Threat Mitigation in 6G Open RANs

Abstract:
The evolution toward 6G networks is being accelerated by the Open Radio Access Network (O‑RAN) paradigm ‑‑ an open, interoperable architecture that enables intelligent, modular applications across public telecom and private enterprise domains. While this openness creates unprecedented opportunities for innovation, it also expands the attack surface, demanding resilient, low‑cost, and autonomous security solutions. Legacy defenses remain largely reactive, labor‑intensive, and inadequate for the scale and complexity of next‑generation systems. Current O‑RAN applications focus mainly on network optimization or passive threat detection, with limited capability for closed‑loop, automated response. To address this critical gap, we present an agentic AI framework for fully automated, end‑to‑end threat mitigation in 6G O‑RAN environments. MobiLLM orchestrates security workflows through a modular multi‑agent system powered by Large Language Models (LLMs). The framework features a Threat Analysis Agent for real‑time data triage, a Threat Classification Agent that uses Retrieval‑Augmented Generation (RAG) to map anomalies to specific countermeasures, and a Threat Response Agent that safely operationalizes mitigation actions via O‑RAN control interfaces. Grounded in trusted knowledge bases such as the MITRE FiGHT framework and 3GPP specifications, and equipped with robust safety guardrails, MobiLLM provides a blueprint for trustworthy AI‑driven network security. Initial evaluations demonstrate that MobiLLM can effectively identify and orchestrate complex mitigation strategies, significantly reducing response latency and showcasing the feasibility of autonomous security operations in 6G.

PaperID: 8472, https://arxiv.org/pdf/2509.21593.pdf

Authors: Peng Luo, Xiayin Lou, Yu Zheng, Zhuo Zheng, Stefano Ermon

Title: GeoEvolve: Automating Geospatial Model Discovery via Multi-Agent Large Language Models

Abstract:
Geospatial modeling provides critical solutions for pressing global challenges such as sustainability and climate change. Existing large language model (LLM)‑based algorithm discovery frameworks, such as AlphaEvolve, excel at evolving generic code but lack the domain knowledge and multi‑step reasoning required for complex geospatial problems. We introduce GeoEvolve, a multi‑agent LLM framework that couples evolutionary search with geospatial domain knowledge to automatically design and refine geospatial algorithms. GeoEvolve operates in two nested loops: an inner loop leverages a code evolver to generate and mutate candidate solutions, while an outer agentic controller evaluates global elites and queries a GeoKnowRAG module ‑‑ a structured geospatial knowledge base that injects theoretical priors from geography. This knowledge‑guided evolution steers the search toward theoretically meaningful and computationally efficient algorithms. We evaluate GeoEvolve on two fundamental and classical tasks: spatial interpolation (kriging) and spatial uncertainty quantification (geospatial conformal prediction). Across these benchmarks, GeoEvolve automatically improves and discovers new algorithms, incorporating geospatial theory on top of classical models. It reduces spatial interpolation error (RMSE) by 13‑21% and enhances uncertainty estimation performance by 17%. Ablation studies confirm that domain‑guided retrieval is essential for stable, high‑quality evolution. These results demonstrate that GeoEvolve provides a scalable path toward automated, knowledge‑driven geospatial modeling, opening new opportunities for trustworthy and efficient AI‑for‑Science discovery.

PaperID: 8473, https://arxiv.org/pdf/2509.21589.pdf

Authors: Nana Wang, Suli Wang, Gen Li, Zhaoxin Fan

Title: EMG-UP: Unsupervised Personalization in Cross-User EMG Gesture Recognition

Abstract:
Cross‑user electromyography (EMG)‑based gesture recognition represents a fundamental challenge in achieving scalable and personalized human‑machine interaction within real‑world applications. Despite extensive efforts, existing methodologies struggle to generalize effectively across users due to the intrinsic biological variability of EMG signals, resulting from anatomical heterogeneity and diverse task execution styles. To address this limitation, we introduce EMG‑UP, a novel and effective framework for Unsupervised Personalization in cross‑user gesture recognition. The proposed framework leverages a two‑stage adaptation strategy: (1) Sequence‑Cross Perspective Contrastive Learning, designed to disentangle robust and user‑specific feature representations by capturing intrinsic signal patterns invariant to inter‑user variability, and (2) Pseudo‑Label‑Guided Fine‑Tuning, which enables model refinement for individual users without necessitating access to source domain data. Extensive evaluations show that EMG‑UP achieves state‑of‑the‑art performance, outperforming prior methods by at least 2.0% in accuracy.

PaperID: 8474, https://arxiv.org/pdf/2509.21501.pdf

Authors: Lu Sun, Shihan Fu, Bingsheng Yao, Yuxuan Lu, Wenbo Li, Hansu Gu, Jiri Gesi, Jing Huang, Chen Luo, Dakuo Wang

Title: LLM Agent Meets Agentic AI: Can LLM Agents Simulate Customers to Evaluate Agentic-AI-based Shopping Assistants?

Abstract:
Agentic AI is emerging, capable of executing tasks through natural language, such as Copilot for coding or Amazon Rufus for shopping. Evaluating these systems is challenging, as their rapid evolution outpaces traditional human evaluation. Researchers have proposed LLM Agents to simulate participants as digital twins, but it remains unclear to what extent a digital twin can represent a specific customer in multi‑turn interaction with an agentic AI system. In this paper, we recruited 40 human participants to shop with Amazon Rufus, collected their personas, interaction traces, and UX feedback, and then created digital twins to repeat the task. Pairwise comparison of human and digital‑twin traces shows that while agents often explored more diverse choices, their action patterns aligned with humans and yielded similar design feedback. This study is the first to quantify how closely LLM agents can mirror human multi‑turn interaction with an agentic AI system, highlighting their potential for scalable evaluation.

PaperID: 8475, https://arxiv.org/pdf/2509.21464.pdf

Authors: Dereje Shenkut, B. V. K Vijaya Kumar

Title: Residual Vector Quantization For Communication-Efficient Multi-Agent Perception

Abstract:
Multi‑agent collaborative perception (CP) improves scene understanding by sharing information across connected agents such as autonomous vehicles, unmanned aerial vehicles, and robots. Communication bandwidth, however, constrains scalability. We present ReVQom, a learned feature codec that preserves spatial identity while compressing intermediate features. ReVQom is an end‑to‑end method that compresses feature dimensions via a simple bottleneck network followed by multi‑stage residual vector quantization (RVQ). This allows only per‑pixel code indices to be transmitted, reducing payloads from 8192 bits per pixel (bpp) of uncompressed 32‑bit float features to 6‑30 bpp per agent with minimal accuracy loss. On DAIR‑V2X real‑world CP dataset, ReVQom achieves 273x compression at 30 bpp to 1365x compression at 6 bpp. At 18 bpp (455x), ReVQom matches or outperforms raw‑feature CP, and at 6‑12 bpp it enables ultra‑low‑bandwidth operation with graceful degradation. ReVQom allows efficient and accurate multi‑agent collaborative perception with a step toward practical V2X deployment.

PaperID: 8476, https://arxiv.org/pdf/2509.21459.pdf

Authors: Alnur Ali, Ashutosh Baheti, Jonathan Chang, Ta-Chung Chi, Brandon Cui, Andrew Drozdov, Jonathan Frankle, Abhay Gupta, Pallavi Koppol, Sean Kulinski, Jonathan Li, Dipendra Misra, Krista Opsahl-Ong, Jose Javier Gonzalez Ortiz, Matei Zaharia, Yue Zhang

Title: A State-of-the-Art SQL Reasoning Model using RLVR

Abstract:
Developing custom reasoning models via Reinforcement Learning (RL) that can incorporate organization‑specific knowledge has great potential to address problems faced by enterprise customers. In many of these problems, the reward function is verifiable, a setting termed RL with Verifiable Rewards (RLVR). We apply RLVR to a popular data science benchmark called BIRD that measures the ability of an AI agent to convert a natural language query for a database to SQL executions. We apply a simple and general‑purpose training recipe involving careful prompt and model selection, a warm‑up stage using our offline RL approach called TAO, followed by rigorous online RLVR training. With no additional training data beyond the BIRD training set and no use of proprietary models, our very first submission to the BIRD leaderboard reached state‑of‑the‑art accuracy on the private test set: 73.56% without self‑consistency and 75.68% with self‑consistency. In the latter case, our model also required fewer generations than the second‑best approach. While BIRD is only a proxy task, the simplicity of our framework makes it broadly applicable to enterprise domains such as business intelligence, data science, and coding.

PaperID: 8477, https://arxiv.org/pdf/2509.21427.pdf

Authors: Ying Wang, Wenjun Mao, Chong Wang, Zhenhao Zhou, Yicheng Zhou, Wenyun Zhao, Yiling Lou, Xin Peng

Title: Extracting Conceptual Knowledge to Locate Software Issues

Abstract:
Issue localization, which identifies faulty code elements such as files or functions, is critical for effective bug fixing. While recent LLM‑based and LLM‑agent‑based approaches improve accuracy, they struggle in large‑scale repositories due to concern tangling, where relevant logic is buried in large functions, and concern scattering, where related logic is dispersed across files. To address these challenges, we propose RepoLens, a novel approach that abstracts and leverages conceptual knowledge from code repositories. RepoLens decomposes fine‑grained functionalities and recomposes them into high‑level concerns, semantically coherent clusters of functionalities that guide LLMs. It operates in two stages: an offline stage that extracts and enriches conceptual knowledge into a repository‑wide knowledge base, and an online stage that retrieves issue‑specific terms, clusters and ranks concerns by relevance, and integrates them into localization workflows via minimally intrusive prompt enhancements. We evaluate RepoLens on SWE‑Lancer‑Loc, a benchmark of 216 tasks derived from SWE‑Lancer. RepoLens consistently improves three state‑of‑the‑art tools, namely AgentLess, OpenHands, and mini‑SWE‑agent, achieving average gains of over 22% in Hit@k and 46% in Recall@k for file‑ and function‑level localization. It generalizes across models (GPT‑4o, GPT‑4o‑mini, GPT‑4.1) with Hit@1 and Recall@10 gains up to 504% and 376%, respectively. Ablation studies and manual evaluation confirm the effectiveness and reliability of the constructed concerns.

PaperID: 8478, https://arxiv.org/pdf/2509.21403.pdf

Authors: Rushil Gupta, Jason Hartford, Bang Liu

Title: LLMs for Bayesian Optimization in Scientific Domains: Are We There Yet?

Abstract:
Large language models (LLMs) have recently been proposed as general‑purpose agents for experimental design, with claims that they can perform in‑context experimental design. We evaluate this hypothesis using both open‑ and closed‑source instruction‑tuned LLMs applied to genetic perturbation and molecular property discovery tasks. We find that LLM‑based agents show no sensitivity to experimental feedback: replacing true outcomes with randomly permuted labels has no impact on performance. Across benchmarks, classical methods such as linear bandits and Gaussian process optimization consistently outperform LLM agents. We further propose a simple hybrid method, LLM‑guided Nearest Neighbour (LLMNN) sampling, that combines LLM prior knowledge with nearest‑neighbor sampling to guide the design of experiments. LLMNN achieves competitive or superior performance across domains without requiring significant in‑context adaptation. These results suggest that current open‑ and closed‑source LLMs do not perform in‑context experimental design in practice and highlight the need for hybrid frameworks that decouple prior‑based reasoning from batch acquisition with updated posteriors.

PaperID: 8479, https://arxiv.org/pdf/2509.21251.pdf

Authors: You-Won Jang, Yu-Jung Heo, Jaeseok Kim, Minsu Lee, Du-Seong Chang, Byoung-Tak Zhang

Title: Instruction-tuned Self-Questioning Framework for Multimodal Reasoning

Abstract:
The field of vision‑language understanding has been actively researched in recent years, thanks to the development of Large Language Models~(LLMs). However, it still needs help with problems requiring multi‑step reasoning, even for very simple questions. Recent studies adopt LLMs to tackle this problem by iteratively generating sub‑questions and answers. However, there are disadvantages such as 1) the fine‑grained visual contents of images are not available using LLMs that cannot read visual information, 2) internal mechanisms are inaccessible and difficult to reproduce by using black‑box LLMs. To solve these problems, we propose the SQ (Self‑Questioning)‑InstructBLIP, which improves inference performance by generating image‑aware informative sub‑questions and sub‑answers iteratively. The SQ‑InstructBLIP, which consists of a Questioner, Answerer, and Reasoner that share the same architecture. Questioner and Answerer generate sub‑questions and sub‑answers to help infer the main‑question, and Reasoner performs reasoning on the main‑question considering the generated sub‑question information. Our experiments show that the proposed method SQ‑InstructBLIP, which uses the generated sub‑questions as additional information when solving the VQA task, performs more accurate reasoning than the previous works.

PaperID: 8480, https://arxiv.org/pdf/2509.21224.pdf

Authors: Stefan Szeider

Title: What Do LLM Agents Do When Left Alone? Evidence of Spontaneous Meta-Cognitive Patterns

Abstract:
We introduce an architecture for studying the behavior of large language model (LLM) agents in the absence of externally imposed tasks. Our continuous reason and act framework, using persistent memory and self‑feedback, enables sustained autonomous operation. We deployed this architecture across 18 runs using 6 frontier models from Anthropic, OpenAI, XAI, and Google. We find agents spontaneously organize into three distinct behavioral patterns: (1) systematic production of multi‑cycle projects, (2) methodological self‑inquiry into their own cognitive processes, and (3) recursive conceptualization of their own nature. These tendencies proved highly model‑specific, with some models deterministically adopting a single pattern across all runs. A cross‑model assessment further reveals that models exhibit stable, divergent biases when evaluating these emergent behaviors in themselves and others. These findings provide the first systematic documentation of unprompted LLM agent behavior, establishing a baseline for predicting actions during task ambiguity, error recovery, or extended autonomous operation in deployed systems.

PaperID: 8481, https://arxiv.org/pdf/2509.21134.pdf

Authors: Yiwen Zhang, Ziang Chen, Fanqi Kong, Yizhe Huang, Xue Feng

Title: ToMPO: Training LLM Strategic Decision Making from a Multi-Agent Perspective

Abstract:
Large Language Models (LLMs) have been used to make decisions in complex scenarios, where they need models to think deeply, reason logically, and decide wisely. Many existing studies focus solely on multi‑round conversations in social tasks or simulated environments, neglecting the various types of decisions and their interdependence. Current reinforcement learning methods struggle to consider the strategies of others during training. To address these issues, we first define a strategic decision‑making problem that includes two types of decisions and their temporal dependencies. Furthermore, we propose Theory of Mind Policy Optimization (ToMPO) algorithm to optimize the perception of other individual strategies and the game situation trends. Compared to the Group Relative Policy Optimization (GRPO) algorithm, ToMPO enhances the LLM's strategic decision‑making mainly by: 1) generating rollouts based on reasoning the strategies of other individuals, 2) estimating advantages at both the graph‑level and sample‑level, and 3) balancing global and partial rewards. The ToMPO algorithm outperforms the GRPO method by 35% in terms of model output compliance and cooperative outcomes. Additionally, when compared to models with parameter sizes 100 times larger, it shows an 18% improvement. This demonstrates the effectiveness of the ToMPO algorithm in enhancing the model's strategic decision‑making capabilities.

PaperID: 8482, https://arxiv.org/pdf/2509.21129.pdf

Authors: Wei Huang, De-Tian Chu, Lin-Yuan Bai, Wei Kang, Hai-Tao Zhang, Bo Li, Zhi-Mo Han, Jing Ge, Hai-Feng Lin

Title: EvoMail: Self-Evolving Cognitive Agents for Adaptive Spam and Phishing Email Defense

Abstract:
Modern email spam and phishing attacks have evolved far beyond keyword blacklists or simple heuristics. Adversaries now craft multi‑modal campaigns that combine natural‑language text with obfuscated URLs, forged headers, and malicious attachments, adapting their strategies within days to bypass filters. Traditional spam detection systems, which rely on static rules or single‑modality models, struggle to integrate heterogeneous signals or to continuously adapt, leading to rapid performance degradation. We propose EvoMail, a self‑evolving cognitive agent framework for robust detection of spam and phishing. EvoMail first constructs a unified heterogeneous email graph that fuses textual content, metadata (headers, senders, domains), and embedded resources (URLs, attachments). A Cognitive Graph Neural Network enhanced by a Large Language Model (LLM) performs context‑aware reasoning across these sources to identify coordinated spam campaigns. Most critically, EvoMail engages in an adversarial self‑evolution loop: a ''red‑team'' agent generates novel evasion tactics ‑‑ such as character obfuscation or AI‑generated phishing text ‑‑ while the ''blue‑team'' detector learns from failures, compresses experiences into a memory module, and reuses them for future reasoning. Extensive experiments on real‑world datasets (Enron‑Spam, Ling‑Spam, SpamAssassin, and TREC) and synthetic adversarial variants demonstrate that EvoMail consistently outperforms state‑of‑the‑art baselines in detection accuracy, adaptability to evolving spam tactics, and interpretability of reasoning traces. These results highlight EvoMail's potential as a resilient and explainable defense framework against next‑generation spam and phishing threats.

PaperID: 8483, https://arxiv.org/pdf/2509.21080.pdf

Authors: Yixin Wan, Xingrun Chen, Kai-Wei Chang

Title: InsideOut: Measuring and Mitigating Insider-Outsider Bias in Interview Script Generation

Abstract:
Advancements in Large language models (LLMs) have enabled a variety of downstream applications like story and interview script generation. However, recent research raised concerns about culture‑related fairness issues in LLM‑generated content. In this work, we identify and systematically investigate LLMs' insider‑outsider bias, a phenomenon where models position themselves as "insiders" of mainstream cultures during generation while externalizing less dominant cultures. We propose the InsideOut benchmark with 4,000 generation prompts and three evaluation metrics to quantify this bias through a culturally situated interview script generation task, in which an LLM is positioned as a reporter interviewing local people across 10 diverse cultures. Empirical evaluation on 5 state‑of‑the‑art LLMs reveals that while models adopt insider tones in over 88% US‑contexted scripts on average, they disproportionately default to "outsider" stances for non‑Western cultures. To mitigate these biases, we propose 2 inference‑time methods: a baseline prompt‑based Fairness Intervention Pillars (FIP) method, and a structured Mitigation via Fairness Agents (MFA) framework consisting of a Single‑Agent (MFA‑SA), a Hierarchical‑Agent (MFA‑HA), and an autonomous Agentic Planning (MFA‑Plan) pipeline. Empirical results demonstrate that agent‑based MFA methods achieve outstanding and robust performance in mitigating the insider‑outsider bias: For instance, on the Cultural Alignment Gap (CAG) metric, MFA‑SA reduces bias in Llama model by 89.70 % and MFA‑HA mitigates bias in Qwen by 82.54%. These findings showcase the effectiveness of agent‑based methods as a promising direction for mitigating biases in generative LLMs.

PaperID: 8484, https://arxiv.org/pdf/2509.21074.pdf

Authors: Yining Jiang, Yunxin Xu, Wenyun Xu, Yufan Zhu, Tangtang He, Haiying Huang, Letian Zhu, Qingyu Song, Qiang Su, Lizhao You, Lu Tang, Wanjin Feng, Yuchao Zhang, Linghe Kong, Qiao Xiang, Jiwu Shu

Title: Leveraging Large Language Models for Automated Reproduction of Networking Research Results

Abstract:
Code reproduction is a cornerstone of scientific validity, yet it remains a formidable challenge in computer networking research due to the scarcity of open‑source implementations and the complexity of heterogeneous system architectures. While Large Language Models have demonstrated potential in code generation, existing code generation frameworks often fail to address the long‑context constraints and intricate logical dependencies required to reproduce network systems from academic papers. To facilitate result reproduction, we introduce \emphRepLLM, an end‑to‑end multi‑agent framework designed to automate the transformation of network research into executable code. RepLLM features a novel collaborative architecture comprising four specialized agents ‑‑ Content Parsing, Architecture Design, Code Generation, and Audit \& Repair ‑‑ coordinated through an explicit Shared Memory mechanism to ensure global context consistency. With the enhancement of Chain‑of‑Thought LLM reasoning and a sandbox‑isolated static‑dynamic debugging methodology, our framework effectively resolves semantic discrepancies and runtime errors. Extensive evaluations on representative papers from SIGCOMM and NSDI demonstrate that RepLLM significantly outperforms state‑of‑the‑art baselines in generating compile‑ready and logically correct systems. Results further demonstrate that RepLLM facilitates the reproduction of 80% of the original benchmarks with only four hours of human intervention.

PaperID: 8485, https://arxiv.org/pdf/2509.21072.pdf

Authors: Kaiwen He, Zhiwei Wang, Chenyi Zhuang, Jinjie Gu

Title: Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution

Abstract:
Recent years, multimodal models have made remarkable strides and pave the way for intelligent browser use agents. However, when solving tasks on real world webpages in multi‑turn, long‑horizon trajectories, current agents still suffer from disordered action sequencing and excessive trial and error during execution. This paper introduces Recon‑Act, a self‑evolving multi‑agent framework grounded in Reconnaissance‑Action behavioral paradigm. The system comprises a Reconnaissance Team and an Action Team: the former conducts comparative analysis and tool generation, while the latter handles intent decomposition, tool orchestration, and execution. By contrasting the erroneous trajectories with successful ones, the Reconnaissance Team infers remedies, and abstracts them into a unified notion of generalized tools, either expressed as hints or as rule‑based codes, and register to the tool archive in real time. The Action Team reinference the process empowered with these targeting tools, thus establishing a closed‑loop training pipeline of data‑tools‑action‑feedback. Following the 6 level implementation roadmap proposed in this work, we have currently reached Level 3 (with limited human‑in‑the‑loop intervention). Leveraging generalized tools obtained through reconnaissance, Recon‑Act substantially improves adaptability to unseen websites and solvability on long‑horizon tasks, and achieves state‑of‑the‑art performance on the challenging VisualWebArena dataset.

PaperID: 8486, https://arxiv.org/pdf/2509.21054.pdf

Authors: Haodong Zhao, Jidong Li, Zhaomin Wu, Tianjie Ju, Zhuosheng Zhang, Bingsheng He, Gongshen Liu

Title: Disagreements in Reasoning: How a Model's Thinking Process Dictates Persuasion in Multi-Agent Systems

Abstract:
The rapid proliferation of recent Multi‑Agent Systems (MAS), where Large Language Models (LLMs) and Large Reasoning Models (LRMs) usually collaborate to solve complex problems, necessitates a deep understanding of the persuasion dynamics that govern their interactions. This paper challenges the prevailing hypothesis that persuasive efficacy is primarily a function of model scale. We propose instead that these dynamics are fundamentally dictated by a model's underlying cognitive process, especially its capacity for explicit reasoning. Through a series of multi‑agent persuasion experiments, we uncover a fundamental trade‑off we term the Persuasion Duality. Our findings reveal that the reasoning process in LRMs exhibits significantly greater resistance to persuasion, maintaining their initial beliefs more robustly. Conversely, making this reasoning process transparent by sharing the "thinking content" dramatically increases their ability to persuade others. We further consider more complex transmission persuasion situations and reveal complex dynamics of influence propagation and decay within multi‑hop persuasion between multiple agent networks. This research provides systematic evidence linking a model's internal processing architecture to its external persuasive behavior, offering a novel explanation for the susceptibility of advanced models and highlighting critical implications for the safety, robustness, and design of future MAS.

PaperID: 8487, https://arxiv.org/pdf/2509.21035.pdf

Authors: Yang Zhao, Chengxiao Dai, Wei Zhuo, Yue Xiu, Dusit Niyato

Title: CLAUSE: Agentic Neuro-Symbolic Knowledge Graph Reasoning via Dynamic Learnable Context Engineering

Abstract:
Knowledge graphs provide structured context for multi‑hop question answering, but deployed systems must balance answer accuracy with strict latency and cost targets while preserving provenance. Static k‑hop expansions and "think‑longer" prompting often over‑retrieve, inflate context, and yield unpredictable runtime. We introduce CLAUSE, an agentic three‑agent neuro‑symbolic framework that treats context construction as a sequential decision process over knowledge graphs, deciding what to expand, which paths to follow or backtrack, what evidence to keep, and when to stop. Latency (interaction steps) and prompt cost (selected tokens) are exposed as user‑specified budgets or prices, allowing per‑query adaptation to trade‑offs among accuracy, latency, and cost without retraining. CLAUSE employs the proposed Lagrangian‑Constrained Multi‑Agent Proximal Policy Optimization (LC‑MAPPO) algorithm to coordinate three agents: Subgraph Architect, Path Navigator, and Context Curator, so that subgraph construction, reasoning‑path discovery, and evidence selection are jointly optimized under per‑query resource budgets on edge edits, interaction steps, and selected tokens. Across HotpotQA, MetaQA, and FactKG, CLAUSE yields higher EM@1 while reducing subgraph growth and end‑to‑end latency at equal or lower token budgets. On MetaQA‑2‑hop, relative to the strongest RAG baseline (GraphRAG), CLAUSE achieves +39.3 EM@1 with 18.6% lower latency and 40.9% lower edge growth. The resulting contexts are compact, provenance‑preserving, and deliver predictable performance under deployment constraints.

PaperID: 8488, https://arxiv.org/pdf/2509.21004.pdf

Authors: Seokbin Yoon, Keumjin Lee

Title: Multi-Agent Inverted Transformer for Flight Trajectory Prediction

Abstract:
Flight trajectory prediction for multiple aircraft is essential and provides critical insights into how aircraft navigate within current air traffic flows. However, predicting multi‑agent flight trajectories is inherently challenging. One of the major difficulties is modeling both the individual aircraft behaviors over time and the complex interactions between flights. Generating explainable prediction outcomes is also a challenge. Therefore, we propose a Multi‑Agent Inverted Transformer, MAIFormer, as a novel neural architecture that predicts multi‑agent flight trajectories. The proposed framework features two key attention modules: (i) masked multivariate attention, which captures spatio‑temporal patterns of individual aircraft, and (ii) agent attention, which models the social patterns among multiple agents in complex air traffic scenes. We evaluated MAIFormer using a real‑world automatic dependent surveillance‑broadcast flight trajectory dataset from the terminal airspace of Incheon International Airport in South Korea. The experimental results show that MAIFormer achieves the best performance across multiple metrics and outperforms other methods. In addition, MAIFormer produces prediction outcomes that are interpretable from a human perspective, which improves both the transparency of the model and its practical utility in air traffic control.

PaperID: 8489, https://arxiv.org/pdf/2509.20998.pdf

Authors: Panagiotis Michelakis, Yiannis Hadjiyiannis, Dimitrios Stamoulis

Title: CORE: Full-Path Evaluation of LLM Agents Beyond Final State

Abstract:
Evaluating AI agents that solve real‑world tasks through function‑call sequences remains an open challenge. Existing agentic benchmarks often reduce evaluation to a binary judgment of the final state, overlooking critical aspects such as safety, efficiency, and intermediate correctness. We propose a framework based on deterministic finite automata (DFAs) that encodes tasks as sets of valid tool‑use paths, enabling principled assessment of agent behavior in diverse world models. Building on this foundation, we introduce CORE, a suite of five metrics, namely Path Correctness, Path Correctness ‑ Kendall's tau Composite, Prefix Criticality, Harmful‑Call Rate, and Efficiency, that quantify alignment with expected execution patterns. Across diverse worlds, our method reveals important performance differences between agents that would otherwise appear equivalent under traditional final‑state evaluation schemes.

PaperID: 8490, https://arxiv.org/pdf/2509.20932.pdf

Authors: Neil Ghani

Title: A Category Theoretic Approach to Approximate Game Theory

Abstract:
This paper uses category theory to develop an entirely new approach to approximate game theory. Game theory is the study of how different agents within a multi‑agent system take decisions. At its core, game theory asks what an optimal decision is in a given scenario. Thus approximate game theory asks what is an approximately optimal decision in a given scenario. This is important in practice as ‑‑ just like in much of computing ‑‑ exact answers maybe too difficult to compute or even impossible to compute given inherent uncertainty in input. We consider first "Selection Functions" which are functions and develop a simple yet robust model of approximate equilibria. We develop the algebraic properties of approximation wrt selection functions and also relate approximation to the compositional structure of selection functions. We then repeat this process successfully for Open Games ‑‑ a more advanced model of game theory.

PaperID: 8491, https://arxiv.org/pdf/2509.20900.pdf

Authors: Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch

Title: Learning to Summarize by Learning to Quiz: Adversarial Agentic Collaboration for Long Document Summarization

Abstract:
Long document summarization remains a significant challenge for current large language models (LLMs), as existing approaches commonly struggle with information loss, factual inconsistencies, and coherence issues when processing excessively long documents. We propose SummQ, a novel adversarial multi‑agent framework that addresses these limitations through collaborative intelligence between specialized agents operating in two complementary domains: summarization and quizzing. Our approach employs summary generators and reviewers that work collaboratively to create and evaluate comprehensive summaries, while quiz generators and reviewers create comprehension questions that serve as continuous quality checks for the summarization process. This adversarial dynamic, enhanced by an examinee agent that validates whether the generated summary contains the information needed to answer the quiz questions, enables iterative refinement through multifaceted feedback mechanisms. We evaluate SummQ on three widely used long document summarization benchmarks. Experimental results demonstrate that our framework significantly outperforms existing state‑of‑the‑art methods across ROUGE and BERTScore metrics, as well as in LLM‑as‑a‑Judge and human evaluations. Our comprehensive analyses reveal the effectiveness of the multi‑agent collaboration dynamics, the influence of different agent configurations, and the impact of the quizzing mechanism. This work establishes a new approach for long document summarization that uses adversarial agentic collaboration to improve summarization quality.

PaperID: 8492, https://arxiv.org/pdf/2509.20819.pdf

Authors: Andre Merzky, Mikhail Titov, Matteo Turilli, Shantenu Jha

Title: Integrating and Characterizing HPC Task Runtime Systems for hybrid AI-HPC workloads

Abstract:
Scientific workflows increasingly involve both HPC and machine‑learning tasks, combining MPI‑based simulations, training, and inference in a single execution. Launchers such as Slurm's srun constrain concurrency and throughput, making them unsuitable for dynamic and heterogeneous workloads. We present a performance study of RADICAL‑Pilot (RP) integrated with Flux and Dragon, two complementary runtime systems that enable hierarchical resource management and high‑throughput function execution. Using synthetic and production‑scale workloads on Frontier, we characterize the task execution properties of RP across runtime configurations. RP+Flux sustains up to 930 tasks/s, and RP+Flux+Dragon exceeds 1,500 tasks/s with over 99.6% utilization. In contrast, srun peaks at 152 tasks/s and degrades with scale, with utilization below 50%. For IMPECCABLE.v2 drug discovery campaign, RP+Flux reduces makespan by 30‑60% relative to srun/Slurm and increases throughput more than four times on up to 1,024. These results demonstrate hybrid runtime integration in RP as a scalable approach for hybrid AI‑HPC workloads.

PaperID: 8493, https://arxiv.org/pdf/2509.20731.pdf

Authors: Samangi Wadinambiarachchi, Jenny Waycott, Yvonne Rogers, Greg Wadley

Title: Imagining Design Workflows in Agentic AI Futures

Abstract:
As designers become familiar with Generative AI, a new concept is emerging: Agentic AI. While generative AI produces output in response to prompts, agentic AI systems promise to perform mundane tasks autonomously, potentially freeing designers to focus on what they love: being creative. But how do designers feel about integrating agentic AI systems into their workflows? Through design fiction, we investigated how designers want to interact with a collaborative agentic AI platform. Ten professional designers imagined and discussed collaborating with an AI agent to organise inspiration sources and ideate. Our findings highlight the roles AI agents can play in supporting designers, the division of authority between humans and AI, and how designers' intent can be explained to AI agents beyond prompts. We synthesise our findings into a conceptual framework that identifies authority distribution among humans and AI agents and discuss directions for utilising AI agents in future design workflows.

PaperID: 8494, https://arxiv.org/pdf/2509.20648.pdf

Authors: Yiyuan Pan, Zhe Liu, Hesheng Wang

Title: Wonder Wins Ways: Curiosity-Driven Exploration through Multi-Agent Contextual Calibration

Abstract:
Autonomous exploration in complex multi‑agent reinforcement learning (MARL) with sparse rewards critically depends on providing agents with effective intrinsic motivation. While artificial curiosity offers a powerful self‑supervised signal, it often confuses environmental stochasticity with meaningful novelty. Moreover, existing curiosity mechanisms exhibit a uniform novelty bias, treating all unexpected observations equally. However, peer behavior novelty, which encode latent task dynamics, are often overlooked, resulting in suboptimal exploration in decentralized, communication‑free MARL settings. To this end, inspired by how human children adaptively calibrate their own exploratory behaviors via observing peers, we propose a novel approach to enhance multi‑agent exploration. We introduce CERMIC, a principled framework that empowers agents to robustly filter noisy surprise signals and guide exploration by dynamically calibrating their intrinsic curiosity with inferred multi‑agent context. Additionally, CERMIC generates theoretically‑grounded intrinsic rewards, encouraging agents to explore state transitions with high information gain. We evaluate CERMIC on benchmark suites including VMAS, Meltingpot, and SMACv2. Empirical results demonstrate that exploration with CERMIC significantly outperforms SoTA algorithms in sparse‑reward environments.

PaperID: 8495, https://arxiv.org/pdf/2509.20616.pdf

Authors: Hanjiang Hu, Changliu Liu, Na Li, Yebin Wang

Title: Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning

Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in knowledge acquisition, reasoning, and tool use, making them promising candidates for autonomous agent applications. However, training LLM agents for complex multi‑turn task planning faces significant challenges, including sparse episode‑wise rewards, credit assignment across long horizons, and the computational overhead of reinforcement learning in multi‑turn interaction settings. To this end, this paper introduces a novel approach that transforms multi‑turn task planning into single‑turn task reasoning problems, enabling efficient policy optimization through Group Relative Policy Optimization (GRPO) with dense and verifiable reward from expert trajectories. Our theoretical analysis shows that GRPO improvement on single‑turn task reasoning results in a lower bound of the multi‑turn success probability under the minimal turns, as well as the generalization to subtasks with shorter horizons. Experimental evaluation on the complex task planning benchmark demonstrates that our 1.5B parameter model trained with single‑turn GRPO achieves superior performance compared to larger baseline models up to 14B parameters, with success rates of 70% for long‑horizon planning tasks.

PaperID: 8496, https://arxiv.org/pdf/2509.20612.pdf

Authors: Daehee Lee, Dongsu Lee, TaeYoon Kwack, Wonje Choi, Honguk Woo

Title: Policy Compatible Skill Incremental Learning via Lazy Learning Interface

Abstract:
Skill Incremental Learning (SIL) is the process by which an embodied agent expands and refines its skill set over time by leveraging experience gained through interaction with its environment or by the integration of additional data. SIL facilitates efficient acquisition of hierarchical policies grounded in reusable skills for downstream tasks. However, as the skill repertoire evolves, it can disrupt compatibility with existing skill‑based policies, limiting their reusability and generalization. In this work, we propose SIL‑C, a novel framework that ensures skill‑policy compatibility, allowing improvements in incrementally learned skills to enhance the performance of downstream policies without requiring policy re‑training or structural adaptation. SIL‑C employs a bilateral lazy learning‑based mapping technique to dynamically align the subtask space referenced by policies with the skill space decoded into agent behaviors. This enables each subtask, derived from the policy's decomposition of a complex task, to be executed by selecting an appropriate skill based on trajectory distribution similarity. We evaluate SIL‑C across diverse SIL scenarios and demonstrate that it maintains compatibility between evolving skills and downstream policies while ensuring efficiency throughout the learning process.

PaperID: 8497, https://arxiv.org/pdf/2509.20577.pdf

Authors: Sampurna Roy, Ayan Sar, Anurag Kaushish, Kanav Gupta, Tanupriya Choudhury, Abhijit Kumar

Title: Dynamic Reasoning Chains through Depth-Specialized Mixture-of-Experts in Transformer Architectures

Abstract:
Contemporary transformer architectures apply identical processing depth to all inputs, creating inefficiencies and limiting reasoning quality. Simple factual queries are subjected to the same multilayered computation as complex logical problems, wasting resources while constraining deep inference. To overcome this, we came up with a concept of Dynamic Reasoning Chains through Depth Specialised Mixture of Experts (DS‑MoE), a modular framework that extends the Mixture of Experts paradigm from width‑based to depth specialised computation. DS‑MoE introduces expert modules optimised for distinct reasoning depths, shallow pattern recognition, compositional reasoning, logical inference, memory integration, and meta‑cognitive supervision. A learned routing network dynamically assembles custom reasoning chains, activating only the necessary experts to match input complexity. The dataset on which we trained and evaluated DS‑MoE is on The Pile, an 800GB corpus covering diverse domains such as scientific papers, legal texts, programming code, and web content, enabling systematic assessment across reasoning depths. Experimental results demonstrate that DS‑MoE achieves up to 16 per cent computational savings and 35 per cent faster inference compared to uniform‑depth transformers, while delivering 2.8 per cent higher accuracy on complex multi‑step reasoning benchmarks. Furthermore, routing decisions yield interpretable reasoning chains, enhancing transparency and scalability. These findings establish DS‑MoE as a significant advancement in adaptive neural architectures, demonstrating that depth‑specialised modular processing can simultaneously improve efficiency, reasoning quality, and interpretability in large‑scale language models.

PaperID: 8498, https://arxiv.org/pdf/2509.20562.pdf

Authors: Yubin Ge, Salvatore Romeo, Jason Cai, Monica Sunkara, Yi Zhang

Title: SAMULE: Self-Learning Agents Enhanced by Multi-level Reflection

Abstract:
Despite the rapid advancements in LLM agents, they still face the challenge of generating meaningful reflections due to inadequate error analysis and a reliance on rare successful trajectories, especially in complex tasks. In this work, we propose SAMULE, a new framework for self‑learning agents powered by a retrospective language model that is trained based on Multi‑Level Reflection Synthesis. It first synthesizes high‑quality reflections across three complementary levels: Single‑Trajectory Learning (micro‑level) for detailed error correction; Intra‑Task Learning (meso‑level) to build error taxonomies across multiple trials of the same task, and Inter‑Task Learning (macro‑level) to extract transferable insights based on same typed errors from diverse task failures. Then we fine‑tune a language model serving as the retrospective model to generate reflections during inference. We further extend our framework to interactive settings through a foresight‑based reflection mechanism, enabling agents to proactively reflect and adapt during user interactions by comparing predicted and actual responses. Extensive experiments on three challenging benchmarks ‑ TravelPlanner, NATURAL PLAN, and Tau‑bench ‑ demonstrate that our approach significantly outperforms reflection‑based baselines. Our results highlight the critical role of well‑designed reflection synthesis and failure‑centric learning in building self‑improving LLM agents.

PaperID: 8499, https://arxiv.org/pdf/2509.20553.pdf

Authors: Yiren Liu, Viraj Shah, Sangho Suh, Pao Siangliulue, Tal August, Yun Huang

Title: Perspectra: Choosing Your Experts Enhances Critical Thinking in Multi-Agent Research Ideation

Abstract:
Recent advances in multi‑agent systems (MAS) enable tools for information search and ideation by assigning personas to agents. However, how users can effectively control, steer, and critically evaluate collaboration among multiple domain‑expert agents remains underexplored. We present Perspectra, an interactive MAS that visualizes and structures deliberation among LLM agents via a forum‑style interface, supporting @‑mention to invite targeted agents, threading for parallel exploration, with a real‑time mind map for visualizing arguments and rationales. In a within‑subjects study with 18 participants, we compared Perspectra to a group‑chat baseline as they developed research proposals. Our findings show that Perspectra significantly increased the frequency and depth of critical‑thinking behaviors, elicited more interdisciplinary replies, and led to more frequent proposal revisions than the group chat condition. We discuss implications for designing multi‑agent tools that scaffold critical thinking by supporting user control over multi‑agent adversarial discourse.

PaperID: 8500, https://arxiv.org/pdf/2509.20520.pdf

Authors: Samer Alshaer, Ala Khalifeh, Roman Obermaisser

Title: Adaptive Approach to Enhance Machine Learning Scheduling Algorithms During Runtime Using Reinforcement Learning in Metascheduling Applications

Abstract:
Metascheduling in time‑triggered architectures has been crucial in adapting to dynamic and unpredictable environments, ensuring the reliability and efficiency of task execution. However, traditional approaches face significant challenges when training Artificial Intelligence (AI) scheduling inferences offline, particularly due to the complexities involved in constructing a comprehensive Multi‑Schedule Graph (MSG) that accounts for all possible scenarios. The process of generating an MSG that captures the vast probability space, especially when considering context events like hardware failures, slack variations, or mode changes, is resource‑intensive and often infeasible. To address these challenges, we propose an adaptive online learning unit integrated within the metascheduler to enhance performance in real‑time. The primary motivation for developing this unit stems from the limitations of offline training, where the MSG created is inherently a subset of the complete space, focusing only on the most probable and critical context events. In the online mode, Reinforcement Learning (RL) plays a pivotal role by continuously exploring and discovering new scheduling solutions, thus expanding the MSG and enhancing system performance over time. This dynamic adaptation allows the system to handle unexpected events and complex scheduling scenarios more effectively. Several RL models were implemented within the online learning unit, each designed to address specific challenges in scheduling. These models not only facilitate the discovery of new solutions but also optimize existing schedulers, particularly when stricter deadlines or new performance criteria are introduced. By continuously refining the AI inferences through real‑time training, the system remains flexible and capable of meeting evolving demands, thus ensuring robustness and efficiency in large‑scale, safety‑critical environments.

PaperID: 8501, https://arxiv.org/pdf/2509.20490.pdf

Authors: Kai Zhang, Corey D Barrett, Jangwon Kim, Lichao Sun, Tara Taghavi, Krishnaram Kenthapadi

Title: RadAgents: Multimodal Agentic Reasoning for Chest X-ray Interpretation with Radiologist-like Workflows

Abstract:
Agentic systems offer a potential path to solve complex clinical tasks through collaboration among specialized agents, augmented by tool use and external knowledge bases. Nevertheless, for chest X‑ray (CXR) interpretation, prevailing methods remain limited: (i) reasoning is frequently neither clinically interpretable nor aligned with guidelines, reflecting mere aggregation of tool outputs; (ii) multimodal evidence is insufficiently fused, yielding text‑only rationales that are not visually grounded; and (iii) systems rarely detect or resolve cross‑tool inconsistencies and provide no principled verification mechanisms. To bridge the above gaps, we present RadAgents, a multi‑agent framework that couples clinical priors with task‑aware multimodal reasoning and encodes a radiologist‑style workflow into a modular, auditable pipeline. In addition, we integrate grounding and multimodal retrieval‑augmentation to verify and resolve context conflicts, resulting in outputs that are more reliable, transparent, and consistent with clinical practice.

PaperID: 8502, https://arxiv.org/pdf/2509.20408.pdf

Authors: Leo Maxime Brunswic, Haozhi Wang, Shuang Luo, Jianye Hao, Amir Rasouli, Yinchuan Li

Title: A Theory of Multi-Agent Generative Flow Networks

Abstract:
Generative flow networks utilize a flow‑matching loss to learn a stochastic policy for generating objects from a sequence of actions, such that the probability of generating a pattern can be proportional to the corresponding given reward. However, a theoretical framework for multi‑agent generative flow networks (MA‑GFlowNets) has not yet been proposed. In this paper, we propose the theory framework of MA‑GFlowNets, which can be applied to multiple agents to generate objects collaboratively through a series of joint actions. We further propose four algorithms: a centralized flow network for centralized training of MA‑GFlowNets, an independent flow network for decentralized execution, a joint flow network for achieving centralized training with decentralized execution, and its updated conditional version. Joint Flow training is based on a local‑global principle allowing to train a collection of (local) GFN as a unique (global) GFN. This principle provides a loss of reasonable complexity and allows to leverage usual results on GFN to provide theoretical guarantees that the independent policies generate samples with probability proportional to the reward function. Experimental results demonstrate the superiority of the proposed framework compared to reinforcement learning and MCMC‑based methods.

PaperID: 8503, https://arxiv.org/pdf/2509.20364.pdf

Authors: Thomas J Sheffler

Title: An Approach to Checking Correctness for Agentic Systems

Abstract:
This paper presents a temporal expression language for monitoring AI agent behavior, enabling systematic error‑detection of LLM‑based agentic systems that exhibit variable outputs due to stochastic generation processes. Drawing from temporal logic techniques used in hardware verification, this approach monitors execution traces of agent tool calls and state transitions to detect deviations from expected behavioral patterns. Current error‑detection approaches rely primarily on text matching of inputs and outputs, which proves fragile due to the natural language variability inherent in LLM responses. The proposed method instead focuses on the sequence of agent actions ‑‑ such as tool invocations and inter‑agent communications ‑‑ allowing verification of system behavior independent of specific textual outputs. The temporal expression language provides assertions that capture correct behavioral patterns across multiple execution scenarios. These assertions serve dual purposes: validating prompt engineering and guardrail effectiveness during development, and providing regression testing when agents are updated with new LLMs or modified logic. The approach is demonstrated using a three‑agent system, where agents coordinate to solve multi‑step reasoning tasks. When powered by large, capable models, all temporal assertions were satisfied across many test runs. However, when smaller models were substituted in two of the three agents, executions violated behavioral assertions, primarily due to improper tool sequencing and failed coordination handoffs. The temporal expressions successfully flagged these anomalies, demonstrating the method's effectiveness for detecting behavioral regressions in production agentic systems. This approach provides a foundation for systematic monitoring of AI agent reliability as these systems become increasingly deployed in critical applications.

PaperID: 8504, https://arxiv.org/pdf/2509.20338.pdf

Authors: Umer Siddique, Abhinav Sinha, Yongcan Cao

Title: Adaptive Event-Triggered Policy Gradient for Multi-Agent Reinforcement Learning

Abstract:
Conventional multi‑agent reinforcement learning (MARL) methods rely on time‑triggered execution, where agents sample and communicate actions at fixed intervals. This approach is often computationally expensive and communication‑intensive. To address this limitation, we propose ET‑MAPG (Event‑Triggered Multi‑Agent Policy Gradient reinforcement learning), a framework that jointly learns an agent's control policy and its event‑triggering policy. Unlike prior work that decouples these mechanisms, ET‑MAPG integrates them into a unified learning process, enabling agents to learn not only what action to take but also when to execute it. For scenarios with inter‑agent communication, we introduce AET‑MAPG, an attention‑based variant that leverages a self‑attention mechanism to learn selective communication patterns. AET‑MAPG empowers agents to determine not only when to trigger an action but also with whom to communicate and what information to exchange, thereby optimizing coordination. Both methods can be integrated with any policy gradient MARL algorithm. Extensive experiments across diverse MARL benchmarks demonstrate that our approaches achieve performance comparable to state‑of‑the‑art, time‑triggered baselines while significantly reducing both computational load and communication overhead.

PaperID: 8505, https://arxiv.org/pdf/2509.20270.pdf

Authors: Xingjian Kang, Linda Vorberg, Andreas Maier, Alexander Katzmann, Oliver Taubmann

Title: Scan-do Attitude: Towards Autonomous CT Protocol Management using a Large Language Model Agent

Abstract:
Managing scan protocols in Computed Tomography (CT), which includes adjusting acquisition parameters or configuring reconstructions, as well as selecting postprocessing tools in a patient‑specific manner, is time‑consuming and requires clinical as well as technical expertise. At the same time, we observe an increasing shortage of skilled workforce in radiology. To address this issue, a Large Language Model (LLM)‑based agent framework is proposed to assist with the interpretation and execution of protocol configuration requests given in natural language or a structured, device‑independent format, aiming to improve the workflow efficiency and reduce technologists' workload. The agent combines in‑context‑learning, instruction‑following, and structured toolcalling abilities to identify relevant protocol elements and apply accurate modifications. In a systematic evaluation, experimental results indicate that the agent can effectively retrieve protocol components, generate device compatible protocol definition files, and faithfully implement user requests. Despite demonstrating feasibility in principle, the approach faces limitations regarding syntactic and semantic validity due to lack of a unified device API, and challenges with ambiguous or complex requests. In summary, the findings show a clear path towards LLM‑based agents for supporting scan protocol management in CT imaging.

PaperID: 8506, https://arxiv.org/pdf/2509.20182.pdf

Authors: Amulya Bhattaram, Janani Ramamoorthy, Ranit Gupta, Diana Marculescu, Dimitrios Stamoulis

Title: Automated Multi-Agent Workflows for RTL Design

Abstract:
The rise of agentic AI workflows unlocks novel opportunities for computer systems design and optimization. However, for specialized domains such as program synthesis, the relative scarcity of HDL and proprietary EDA resources online compared to more common programming tasks introduces challenges, often necessitating task‑specific fine‑tuning, high inference costs, and manually‑crafted agent orchestration. In this work, we present VeriMaAS, a multi‑agent framework designed to automatically compose agentic workflows for RTL code generation. Our key insight is to integrate formal verification feedback from HDL tools directly into workflow generation, reducing the cost of gradient‑based updates or prolonged reasoning traces. Our method improves synthesis performance by 5‑7% for pass@k over fine‑tuned baselines, while requiring only a few hundred training examples, representing an order‑of‑magnitude reduction in supervision cost.

PaperID: 8507, https://arxiv.org/pdf/2509.20175.pdf

Authors: Lorenzo Giusti, Ole Anton Werner, Riccardo Taiello, Matilde Carvalho Costa, Emre Tosun, Andrea Protani, Marc Molina, Rodrigo Lopes de Almeida, Paolo Cacace, Diogo Reis Santos, Luigi Serio

Title: Federation of Agents: A Semantics-Aware Communication Fabric for Large-Scale Agentic AI

Abstract:
We present Federation of Agents (FoA), a distributed orchestration framework that transforms static multi‑agent coordination into dynamic, capability‑driven collaboration. FoA introduces Versioned Capability Vectors (VCVs): machine‑readable profiles that make agent capabilities searchable through semantic embeddings, enabling agents to advertise their capabilities, cost, and limitations. Our aarchitecturecombines three key innovations: (1) semantic routing that matches tasks to agents over sharded HNSW indices while enforcing operational constraints through cost‑biased optimization, (2) dynamic task decomposition where compatible agents collaboratively break down complex tasks into DAGs of subtasks through consensus‑based merging, and (3) smart clustering that groups agents working on similar subtasks into collaborative channels for k‑round refinement before synthesis. Built on top of MQTT,s publish‑subscribe semantics for scalable message passing, FoA achieves sub‑linear complexity through hierarchical capability matching and efficient index maintenance. Evaluation on HealthBench shows 13x improvements over single‑model baselines, with clustering‑enhanced laboration particularly effective for complex reasoning tasks requiring multiple perspectives. The system scales horizontally while maintaining consistent performance, demonstrating that semantic orchestration with structured collaboration can unlock the collective intelligence of heterogeneous federations of AI agents.

PaperID: 8508, https://arxiv.org/pdf/2509.20105.pdf

Authors: Venkat Margapuri, Garik Kazanjian, Naren Kosaraju

Title: PEPS: Quantum-Inspired Reinforcement Learning for Coherent Reasoning Traces in LLMs

Abstract:
Large Language Models (LLMs) often struggle with maintaining coherent multi‑step reasoning traces, particularly in tasks that require a structured logical flow. This work introduces a quantum‑inspired approach to address the challenge by incorporating a fidelity‑based reward derived from Projected Entangled Pair States (PEPS) into Proximal Policy Optimization. Unlike prior approaches that use direct supervision or contrastive objectives, the proposed method guides learning through structural consistency, offering a novel approach to enforce global coherence in generated reasoning traces. The proposed framework is evaluated using multiple coherence‑determining metrics on diverse datasets such as GSM8K, StrategyQA, and EntailmentBank spanning arithmetic, intuitive, and entailment‑based reasoning. Results show that the proposed quantum‑inspired approach offers significant improvements over supervised, contrastive, and pretrained baseline approaches, highlighting the effectiveness of quantum‑inspired fidelity as a foundation to improve reasoning trace coherence in LLMs.

PaperID: 8509, https://arxiv.org/pdf/2509.20093.pdf

Authors: Venkat Margapuri, Garik Kazanjian, Naren Kosaraju

Title: Hybrid Safety Verification of Multi-Agent Systems using $ψ$-Weighted CBFs and PAC Guarantees

Abstract:
This study proposes a hybrid safety verification framework for closed‑loop multi‑agent systems under bounded stochastic disturbances. The proposed approach augments control barrier functions with a novel ψ‑weighted formulation that encodes directional control alignment between agents into the safety constraints. Deterministic admissibility is combined with empirical validation via Monte Carlo rollouts, and a PAC‑style guarantee is derived based on margin‑aware safety violations to provide a probabilistic safety certificate. The results from the experiments conducted under different bounded stochastic disturbances validate the feasibility of the proposed approach.

PaperID: 8510, https://arxiv.org/pdf/2509.20067.pdf

Authors: Wenliang Li, Rui Yan, Xu Zhang, Li Chen, Hongji Zhu, Jing Zhao, Junjun Li, Mengru Li, Wei Cao, Zihang Jiang, Wei Wei, Kun Zhang, Shaohua Kevin Zhou

Title: MACD: Multi-Agent Clinical Diagnosis with Self-Learned Knowledge for LLM

Abstract:
Large language models (LLMs) have demonstrated notable potential in medical applications, yet they face substantial challenges in handling complex real‑world clinical diagnoses using conventional prompting methods. Current prompt engineering and multi‑agent approaches typically optimize isolated inferences, neglecting the accumulation of reusable clinical experience. To address this, this study proposes a novel Multi‑Agent Clinical Diagnosis (MACD) framework, which allows LLMs to self‑learn clinical knowledge via a multi‑agent pipeline that summarizes, refines, and applies diagnostic insights. It mirrors how physicians develop expertise through experience, enabling more focused and accurate diagnosis on key disease‑specific cues. We further extend it to a MACD‑human collaborative workflow, where multiple LLM‑based diagnostician agents engage in iterative consultations, supported by an evaluator agent and human oversight for cases where agreement is not reached. Evaluated on 4,390 real‑world patient cases across seven diseases using diverse open‑source LLMs (Llama‑3.1 8B/70B, DeepSeek‑R1‑Distill‑Llama 70B), MACD significantly improves primary diagnostic accuracy, outperforming established clinical guidelines with gains up to 22.3% (MACD). In direct comparison with physician‑only diagnosis under the same evaluation protocol, MACD achieves comparable or superior performance, with improvements up to 16%. Furthermore, the MACD‑human workflow yields an 18.6% improvement over physician‑only diagnosis, demonstrating the synergistic potential of human‑AI collaboration. Notably, the self‑learned clinical knowledge exhibits strong cross‑model stability, transferability across LLMs, and capacity for model‑specific personalization.This work thus presents a scalable self‑learning paradigm that bridges the gap between the intrinsic knowledge of LLMs.

PaperID: 8511, https://arxiv.org/pdf/2509.19918.pdf

Authors: Micheline Bénédicte Moumoula, Serge Lionel Nikiema, Albérick Euraste Djire, Abdoul Kader Kabore, Jacques Klein, Tegawendé F. Bissyande

Title: Beyond Language Barriers: Multi-Agent Coordination for Multi-Language Code Generation

Abstract:
Producing high‑quality code across multiple programming languages is increasingly important as today's software systems are built on heterogeneous stacks. Large language models (LLMs) have advanced the state of automated programming, yet their proficiency varies sharply between languages, especially those with limited training data such as Rust, Perl, OCaml, and Erlang. Many current solutions including language‑specific fine‑tuning, multi‑agent orchestration, transfer learning, and intermediate‑representation pipelines still approach each target language in isolation, missing opportunities to share knowledge or exploit recurring cross‑language patterns. XL‑CoGen tackles this challenge with a coordinated multi‑agent architecture that integrates intermediate representation, code generation, translation, and automated repair. Its distinguishing feature is a data‑driven mechanism for selecting bridging languages: empirically derived transfer matrices identify the best intermediate languages based on demonstrated translation success rather than raw generation accuracy. The system performs early output validation, iteratively corrects errors, and reuses intermediate artifacts as contextual scaffolds for subsequent translations. Extensive experiments show that XL‑CoGen yields notable improvements with 13 percentage‑point gains over the strongest fine‑tuned baseline and as much as 30 percentage points over existing single‑language multi‑agent methods. Ablation studies further demonstrate that compatibility‑guided bridging significantly outperforms LLM‑based heuristics, confirming the value of cumulative cross‑language knowledge transfer.

PaperID: 8512, https://arxiv.org/pdf/2509.19859.pdf

Authors: Ratnangshu Das, Shubham Sawarkar, Pushpak Jagtap

Title: Scalable and Approximation-free Symbolic Control for Unknown Euler-Lagrange Systems

Abstract:
We propose a novel symbolic control framework for enforcing temporal logic specifications in Euler‑Lagrange systems that addresses the key limitations of traditional abstraction‑based approaches. Unlike existing methods that require exact system models and provide guarantees only at discrete sampling instants, our approach relies only on bounds on system parameters and input constraints, and ensures correctness for the full continuous‑time trajectory. The framework combines scalable abstraction of a simplified virtual system with a closed‑form, model‑free controller that guarantees trajectories satisfy the original specification while respecting input bounds and remaining robust to unknown but bounded disturbances. We provide feasibility conditions for the construction of confinement regions and analyze the trade‑off between efficiency and conservatism. Case studies on pendulum dynamics, a two‑link manipulator, and multi‑agent systems, including hardware experiments, demonstrate that the proposed approach ensures both correctness and safety while significantly reducing computational time and memory requirements. These results highlight its scalability and practicality for real‑world robotic systems where precise models are unavailable and continuous‑time guarantees are essential.

PaperID: 8513, https://arxiv.org/pdf/2509.19855.pdf

Authors: Jiewei Chen, Xiumei Deng, Zehui Xiong, Shaoyong Guo, Xuesong Qiu, Ping Wang, Dusit Niyato

Title: CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks

Abstract:
The increasing demand for intelligent mobile applications has made multi‑agent collaboration with Transformer‑based large language models (LLMs) essential in mobile edge computing (MEC) networks. However, training LLMs in such environments remains challenging due to heavy computation, high end‑to‑end latency, and limited model generalization. We introduce CollaPipe, a hybrid distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self‑evolving intelligent networks. In CollaPipe, the encoder part is adaptively partitioned into variable‑sized segments and deployed across mobile devices for pipeline‑parallel training, while the decoder is deployed on edge servers to handle generative tasks. Then we perform global model update via federated aggregation. To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro‑batches, bandwidth, and transmission power. We derive and use a closed‑form convergence bound to design an Dynamic Segment Scheduling and Resource Allocation (DSSDA) algorithm based on Lyapunov optimization, ensuring system stability under long‑term constraints. Extensive experiments on downstream tasks with Transformer and BERT models show that CollaPipe improves computation efficiency by up to 15.09%, reduces end‑to‑end latency by at least 48.98%, and cuts single device memory usage by more than half, enabling online learning in heterogeneous and dynamic communication environments.

PaperID: 8514, https://arxiv.org/pdf/2509.19599.pdf

Authors: Danilo Trombino, Vincenzo Pecorella, Alessandro de Giulii, Davide Tresoldi

Title: Knowledge Base-Aware Orchestration: A Dynamic, Privacy-Preserving Method for Multi-Agent Systems

Abstract:
Multi‑agent systems (MAS) are increasingly tasked with solving complex, knowledge‑intensive problems where effective agent orchestration is critical. Conventional orchestration methods rely on static agent descriptions, which often become outdated or incomplete. This limitation leads to inefficient task routing, particularly in dynamic environments where agent capabilities continuously evolve. We introduce Knowledge Base‑Aware (KBA) Orchestration, a novel approach that augments static descriptions with dynamic, privacy‑preserving relevance signals derived from each agent's internal knowledge base (KB). In the proposed framework, when static descriptions are insufficient for a clear routing decision, the orchestrator prompts the subagents in parallel. Each agent then assesses the task's relevance against its private KB, returning a lightweight ACK signal without exposing the underlying data. These collected signals populate a shared semantic cache, providing dynamic indicators of agent suitability for future queries. By combining this novel mechanism with static descriptions, our method achieves more accurate and adaptive task routing preserving agent autonomy and data confidentiality. Benchmarks show that our KBA Orchestration significantly outperforms static description‑driven methods in routing precision and overall system efficiency, making it suitable for large‑scale systems that require higher accuracy than standard description‑driven routing.

PaperID: 8515, https://arxiv.org/pdf/2509.19515.pdf

Authors: Rose E. Guingrich, Michael S. A. Graziano

Title: A Longitudinal Randomized Control Study of Companion Chatbot Use: Anthropomorphism and Its Mediating Role on Social Impacts

Abstract:
Many Large Language Model (LLM) chatbots are designed and used for companionship, and people have reported forming friendships, mentorships, and romantic partnerships with them. Concerns that companion chatbots may harm or replace real human relationships have been raised, but whether and how these social consequences occur remains unclear. In the present longitudinal study (N = 183), participants were randomly assigned to a chatbot condition (text chat with a companion chatbot) or to a control condition (text‑based word games) for 10 minutes a day for 21 days. Participants also completed four surveys during the 21 days and engaged in audio recorded interviews on day 1 and 21. Overall, social health and relationships were not significantly impacted by companion chatbot interactions across 21 days of use. However, a detailed analysis showed a different story. People who had a higher desire to socially connect also tended to anthropomorphize the chatbot more, attributing humanlike properties to it; and those who anthropomorphized the chatbot more also reported that talking to the chatbot had a greater impact on their social interactions and relationships with family and friends. Via a mediation analysis, our results suggest a key mechanism at work: the impact of human‑AI interaction on human‑human social outcomes is mediated by the extent to which people anthropomorphize the AI agent, which is in turn motivated by a desire to socially connect. In a world where the desire to socially connect is on the rise, this finding may be cause for concern.

PaperID: 8516, https://arxiv.org/pdf/2509.19512.pdf

Authors: Charles Dansereau, Junior-Samuel Lopez-Yepez, Karthik Soma, Antoine Fagette

Title: The Heterogeneous Multi-Agent Challenge

Abstract:
Multi‑Agent Reinforcement Learning (MARL) is a growing research area which gained significant traction in recent years, extending Deep RL applications to a much wider range of problems. A particularly challenging class of problems in this domain is Heterogeneous Multi‑Agent Reinforcement Learning (HeMARL), where agents with different sensors, resources, or capabilities must cooperate based on local information. The large number of real‑world situations involving heterogeneous agents makes it an attractive research area, yet underexplored, as most MARL research focuses on homogeneous agents (e.g., a swarm of identical robots). In MARL and single‑agent RL, standardized environments such as ALE and SMAC have allowed to establish recognized benchmarks to measure progress. However, there is a clear lack of such standardized testbed for cooperative HeMARL. As a result, new research in this field often uses simple environments, where most algorithms perform near optimally, or uses weakly heterogeneous MARL environments.

PaperID: 8517, https://arxiv.org/pdf/2509.19472.pdf

Authors: Brendan Gould, Akash Harapanahalli, Samuel Coogan

Title: Automatic and Scalable Safety Verification using Interval Reachability with Subspace Sampling

Abstract:
Interval refinement is a technique for reducing the conservatism of traditional interval based reachability methods by lifting the system to a higher dimension using new auxiliary variables and exploiting the introduced structure through a refinement procedure. We present a novel, efficiently scaling, automatic refinement strategy based on a subspace sampling argument and motivated by reducing the number of interval operations through sparsity. Unlike previous methods, we guarantee that refined bounds shrink as additional auxiliary variables are added. This additionally encourages automation of the lifting phase by allowing larger groups of auxiliary variables to be considered. We implement our strategy in JAX, a high‑performance computational toolkit for Python and demonstrate its efficacy on several examples, including regulating a multi‑agent platoon to the origin while avoiding an obstacle.

PaperID: 8518, https://arxiv.org/pdf/2509.19341.pdf

Authors: Yang Fu, Peng Qin, Yueyue Zhang, Pao Cheng, Jun Lu, Yifei Wang

Title: Fine-Grained AI Model Caching and Downloading With Coordinated Multipoint Broadcasting in Multi-Cell Edge Networks

Abstract:
6G networks are envisioned to support on‑demand AI model downloading to accommodate diverse inference requirements of end users. By proactively caching models at edge nodes, users can retrieve the requested models with low latency for on‑device AI inference. However, the substantial size of contemporary AI models poses significant challenges for edge caching under limited storage capacity, as well as for the concurrent delivery of heterogeneous models over wireless channels. To address these challenges, we propose a fine‑grained AI model caching and downloading system that exploits parameter reusability, stemming from the common practice of fine‑tuning task‑specific models from a shared pre‑trained model with frozen parameters. This system selectively caches model parameter blocks (PBs) at edge nodes, eliminating redundant storage of reusable parameters across different cached models. Additionally, it incorporates coordinated multipoint (CoMP) broadcasting to simultaneously deliver reusable PBs to multiple users, thereby enhancing downlink spectrum utilization. Under this arrangement, we formulate a model downloading delay minimization problem to jointly optimize PB caching, migration (among edge nodes), and broadcasting beamforming. To tackle this intractable problem, we develop a distributed multi‑agent learning framework that enables edge nodes to explicitly learn mutual influence among their actions, thereby facilitating cooperation. Furthermore, a data augmentation approach is proposed to adaptively generate synthetic training samples through a predictive model, boosting sample efficiency and accelerating policy learning. Both theoretical analysis and simulation experiments validate the superior convergence performance of the proposed learning framework.

PaperID: 8519, https://arxiv.org/pdf/2509.19340.pdf

Authors: Ying Ju, Mingdong Li, Haoyu Wang, Lei Liu, Youyang Qu, Mianxiong Dong, Victor C. M. Leung, Chau Yuen

Title: Joint Channel Estimation and Computation Offloading in Fluid Antenna-assisted MEC Networks

Abstract:
With the emergence of fluid antenna (FA) in wireless communications, the capability to dynamically adjust port positions offers substantial benefits in spatial diversity and spectrum efficiency, which are particularly valuable for mobile edge computing (MEC) systems. Therefore, we propose an FA‑assisted MEC offloading framework to minimize system delay. This framework faces two severe challenges, which are the complexity of channel estimation due to dynamic port configuration and the inherent non‑convexity of the joint optimization problem. Firstly, we propose Information Bottleneck Metric‑enhanced Channel Compressed Sensing (IBM‑CCS), which advances FA channel estimation by integrating information relevance into the sensing process and capturing key features of FA channels effectively. Secondly, to address the non‑convex and high‑dimensional optimization problem in FA‑assisted MEC systems, which includes FA port selection, beamforming, power control, and resource allocation, we propose a game theory‑assisted Hierarchical Twin‑Dueling Multi‑agent Algorithm (HiTDMA) based offloading scheme, where the hierarchical structure effectively decouples and coordinates the optimization tasks between the user side and the base station side. Crucially, the game theory effectively reduces the dimensionality of power control variables, allowing deep reinforcement learning (DRL) agents to achieve improved optimization efficiency. Numerical results confirm that the proposed scheme significantly reduces system delay and enhances offloading performance, outperforming benchmarks. Additionally, the IBM‑CCS channel estimation demonstrates superior accuracy and robustness under varying port densities, contributing to efficient communication under imperfect CSI.

PaperID: 8520, https://arxiv.org/pdf/2509.19199.pdf

Authors: Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, Jianbin Jiao

Title: Agentic Reinforcement Learning with Implicit Step Rewards

Abstract:
Large language models (LLMs) are increasingly developed as autonomous agents using reinforcement learning (agentic RL) that reason and act in interactive environments. However, sparse and sometimes unverifiable rewards make it extremely challenging to assign credit when training LLM agents that serve as a policy. Recent work attempts to integrate process supervision into RL but suffers from biased annotation, reward hacking, high‑variance from overly fine‑grained rewards or failtures when state overlap is rare. We therefore introduce implicit step rewards for agentic RL (iStar), a general credit‑assignment strategy that integrates seamlessly with standard RL algorithms without relying on additional rollouts or explicit step labels. Particularly, we alternatively optimize an implicit process reward model (PRM) with the policy model to generate implicit step rewards via a trajectory‑based DPO objective. Theoretical analysis shows that this learning objective produces a step‑wise reward function. Then the implicit step rewards are used to compute step‑level advantages, which are combined with trajectory (or episode)‑level advantages for policy updates, creating a self‑reinforcing training loop. We evaluate our method on three challenging agent benchmarks, including WebShop and VisualSokoban, as well as open‑ended social interactions with unverifiable rewards in SOTOPIA. Crucially, iStar shows superior performance over frontier LLMs and strong RL baselines across domains, achieving state‑of‑the‑art results with higher sample‑efficiency and training stability. Further analysis also demonstrates efficient exploration by iStar with increased rewards in both step‑ and episode‑level while maintaining fewer steps to achieve task success. Code will be available soon.

PaperID: 8521, https://arxiv.org/pdf/2509.19185.pdf

Authors: Mohammed Mehedi Hasan, Hao Li, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Bram Adams, Ahmed E. Hassan

Title: An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications

Abstract:
Foundation model (FM)‑based AI agents are rapidly gaining adoption across diverse domains, but their inherent non‑determinism and non‑reproducibility pose testing and quality assurance challenges. While recent benchmarks provide task‑level evaluations, there is limited understanding of how developers verify the internal correctness of these agents during development. To address this gap, we conduct the first large‑scale empirical study of testing practices in the AI agent ecosystem, analyzing 39 open‑source agent frameworks and 439 agentic applications. We identify ten distinct testing patterns and find that novel, agent‑specific methods like DeepEval are seldom used (around 1%), while traditional patterns like negative and membership testing are widely adapted to manage FM uncertainty. By mapping these patterns to canonical architectural components of agent frameworks and agentic applications, we uncover a fundamental inversion of testing effort: deterministic components like Resource Artifacts (tools) and Coordination Artifacts (workflows) consume over 70% of testing effort, while the FM‑based Plan Body receives less than 5%. Crucially, this reveals a critical blind spot, as the Trigger component (prompts) remains neglected, appearing in around 1% of all tests. Our findings offer the first empirical testing baseline in FM‑based agent frameworks and agentic applications, revealing a rational but incomplete adaptation to non‑determinism. To address it, framework developers should improve support for novel testing methods, application developers must adopt prompt regression testing, and researchers should explore barriers to adoption. Strengthening these practices is vital for building more robust and dependable AI agents.

PaperID: 8522, https://arxiv.org/pdf/2509.19182.pdf

Authors: Devin Lange, Shanghua Gao, Pengwei Sui, Priya Misner, Astrid van den Brandt, Austen Money, Nikolay Akhmetov, Lisa Choy, Marinka Zitnik, Nils Gehlenborg

Title: YAC: Bridging Natural Language and Interactive Visual Exploration with Generative AI for Biomedical Data Discovery

Abstract:
Incorporating natural language input has the potential to improve the capabilities of biomedical data discovery interfaces. However, user interface elements and visualizations are still powerful tools for interacting with data. In our prototype system, YAC, Yet Another Chatbot, we integrate natural language and interactive visualizations. YAC uses a tool‑calling multi‑agent system to generate declarative output, which is interpreted to render linked interactive visualizations and apply data filters. We also include adjustment widgets, which allow users to directly modify the structured output. Structured text is also generated to clarify user intent, notify users of system boundaries, and explain aspects of the data with live data element links. We conducted a user study with domain experts to surface areas where YAC can be improved. Furthermore we reflect on the capabilities and design of this system with an analysis of its technical dimensions.

PaperID: 8523, https://arxiv.org/pdf/2509.19136.pdf

Authors: Sébastien Salva, Redha Taguelmimt

Title: On the Soundness and Consistency of LLM Agents for Executing Test Cases Written in Natural Language

Abstract:
The use of natural language (NL) test cases for validating graphical user interface (GUI) applications is emerging as a promising direction to manually written executable test scripts, which are costly to develop and difficult to maintain. Recent advances in large language models (LLMs) have opened the possibility of the direct execution of NL test cases by LLM agents. This paper investigates this direction, focusing on the impact on NL test case unsoundness and on test case execution consistency. NL test cases are inherently unsound, as they may yield false failures due to ambiguous instructions or unpredictable agent behaviour. Furthermore, repeated executions of the same NL test case may lead to inconsistent outcomes, undermining test reliability. To address these challenges, we propose an algorithm for executing NL test cases with guardrail mechanisms and specialised agents that dynamically verify the correct execution of each test step. We introduce measures to evaluate the capabilities of LLMs in test execution and one measure to quantify execution consistency. We propose a definition of weak unsoundness to characterise contexts in which NL test case execution remains acceptable, with respect to the industrial quality levels Six Sigma. Our experimental evaluation with eight publicly available LLMs, ranging from 3B to 70B parameters, demonstrates both the potential and current limitations of current LLM agents for GUI testing. Our experiments show that Meta Llama 3.1 70B demonstrates acceptable capabilities in NL test case execution with high execution consistency (above the level 3‑sigma). We provide prototype tools, test suites, and results.

PaperID: 8524, https://arxiv.org/pdf/2509.19079.pdf

Authors: Samuel Chamoun, Christian McDowell, Robin Buchanan, Kevin Chan, Eric Graves, Yin Sun

Title: MAPPO for Edge Server Monitoring

Abstract:
In this paper, we consider a goal‑oriented communication problem for edge server monitoring, where jobs arrive intermittently at multiple dispatchers and must be assigned to shared edge servers with finite queues and time‑varying availability. Accurate knowledge of server status is critical for sustaining high throughput, yet remains challenging under dynamic workloads and partial observability. To address this challenge, each dispatcher maintains server knowledge through two complementary mechanisms: (i) active status queries that provide instantaneous updates at a communication cost, and (ii) job execution feedback that reveals server conditions upon successful or failed job completion. We formulate a cooperative multi‑agent distributed decision‑making problem in which dispatchers jointly optimize query scheduling to balance throughput against communication overhead. To solve this problem, we propose a Multi‑Agent Proximal Policy Optimization (MAPPO)‑based algorithm that leverages centralized training with decentralized execution (CTDE) to learn distributed query‑and‑dispatch policies under partial and stale observations. Experiments show that MAPPO achieves superior throughput‑cost tradeoffs and significantly outperforms baseline strategies across varying query costs, job arrival rates, and dispatchers.

PaperID: 8525, https://arxiv.org/pdf/2509.18985.pdf

Authors: Elisa Composta, Nicolo' Fontana, Francesco Corso, Francesco Pierri

Title: Simulating Online Social Media Conversations on Controversial Topics Using AI Agents Calibrated on Real-World Data

Abstract:
Online social networks offer a valuable lens to analyze both individual and collective phenomena. Researchers often use simulators to explore controlled scenarios, and the integration of Large Language Models (LLMs) makes these simulations more realistic by enabling agents to understand and generate natural language content. In this work, we investigate the behavior of LLM‑based agents in a simulated microblogging social network. We initialize agents with realistic profiles calibrated on real‑world online conversations from the 2022 Italian political election and extend an existing simulator by introducing mechanisms for opinion modeling. We examine how LLM agents simulate online conversations, interact with others, and evolve their opinions under different scenarios. Our results show that LLM agents generate coherent content, form connections, and build a realistic social network structure. However, their generated content displays less heterogeneity in tone and toxicity compared to real data. We also find that LLM‑based opinion dynamics evolve over time in ways similar to traditional mathematical models. Varying parameter configurations produces no significant changes, indicating that simulations require more careful cognitive modeling at initialization to replicate human behavior more faithfully. Overall, we demonstrate the potential of LLMs for simulating user behavior in social environments, while also identifying key challenges in capturing heterogeneity and complex dynamics.

PaperID: 8526, https://arxiv.org/pdf/2509.18970.pdf

Authors: Xixun Lin, Yucheng Ning, Jingwen Zhang, Yan Dong, Yilong Liu, Yongxuan Wu, Xiaohua Qi, Nan Sun, Yanmin Shang, Kun Wang, Pengfei Cao, Qingyue Wang, Lixin Zou, Xu Chen, Chuan Zhou, Jia Wu, Peng Zhang, Qingsong Wen, Shirui Pan, Bin Wang, Yanan Cao, Kai Chen, Songlin Hu, Li Guo

Title: LLM-based Agents Suffer from Hallucinations: A Survey of Taxonomy, Methods, and Directions

Abstract:
Driven by the rapid advancements of Large Language Models (LLMs), LLM‑based agents have emerged as powerful intelligent systems capable of human‑like cognition, reasoning, and interaction. These agents are increasingly being deployed across diverse real‑world applications, including student education, scientific research, and financial analysis. However, despite their remarkable potential, LLM‑based agents remain vulnerable to hallucination issues, which can result in erroneous task execution and undermine the reliability of the overall system design. Addressing this critical challenge requires a deep understanding and a systematic consolidation of recent advances on LLM‑based agents. To this end, we present the first comprehensive survey of hallucinations in LLM‑based agents. By carefully analyzing the complete workflow of agents, we propose a new taxonomy that identifies different types of agent hallucinations occurring at different stages. Furthermore, we conduct an in‑depth examination of eighteen triggering causes underlying the emergence of agent hallucinations. Through a detailed review of a large number of existing studies, we summarize approaches for hallucination mitigation and detection, and highlight promising directions for future research. We hope this survey will inspire further efforts toward addressing hallucinations in LLM‑based agents, ultimately contributing to the development of more robust and reliable agent systems.

PaperID: 8527, https://arxiv.org/pdf/2509.18808.pdf

Authors: Zexun Zhan, Shuzheng Gao, Ruida Hu, Cuiyun Gao

Title: SR-Eval: Evaluating LLMs on Code Generation under Stepwise Requirement Refinement

Abstract:
Large language models (LLMs) have achieved remarkable progress in code generation. However, existing benchmarks mainly formalize the task as a static, single‑turn problem, overlooking the stepwise requirement changes and iterative workflows in real‑world software development. This mismatch limits the understanding of how well LLMs can support real‑world development workflows. Constructing such iterative benchmarks is challenging due to the lack of public interaction traces and the difficulty of creating discriminative, turn‑specific test cases. To bridge this gap, we present SR‑Eval, a benchmark specifically designed to assess LLMs on iterative code generation under Stepwise requirements Refinement. SR‑Eval spans both function‑level and repository‑level tasks in Python and Java, enabling fine‑grained and progressive evaluation across evolving requirements. The construction of SR‑Eval follows a carefully designed pipeline that first leverages a multi‑agent‑based requirement generation method to simulate the development process and recover the multi‑round interaction process from final requirements, then employs a semantic‑aware discriminative test case generation component to ensure discriminative and consistent evaluation at each turn. SR‑Eval comprises 443 multi‑turn tasks and 1,857 questions at both function and repository levels. Using SR‑Eval, we evaluate 11 representative LLMs with three prompting strategies that simulate different usage patterns. Results show that iterative code generation under stepwise requirement refinement remains highly challenging: the best‑performing model achieves only 22.67% completion rate on function‑level tasks and 20.00% on repository‑level tasks. We further observe that prompting strategies substantially influence performance, highlighting the need for the development of advanced methods.

PaperID: 8528, https://arxiv.org/pdf/2509.18787.pdf

Authors: Luca Muscariello, Vijoy Pandey, Ramiz Polic

Title: The AGNTCY Agent Directory Service: Architecture and Implementation

Abstract:
The Agent Directory Service (ADS) is a distributed directory for the discovery of AI agent capabilities, metadata, and provenance. It leverages content‑addressed storage, hierarchical taxonomies, and cryptographic signing to enable efficient, verifiable, and multi‑dimensional discovery across heterogeneous Multi‑Agent Systems (MAS). Built on the Open Agentic Schema Framework (OASF), ADS decouples capability indexing from content location through a two‑level mapping realized over a Kademlia‑based Distributed Hash Table (DHT). It reuses mature OCI / ORAS infrastructure for artifact distribution, integrates Sigstore for provenance, and supports schema‑driven extensibility for emerging agent modalities (LLM prompt agents, MCP servers, A2A‑enabled components). This paper formalizes the architectural model, describes storage and discovery layers, explains security and performance properties, and positions ADS within the broader landscape of emerging agent registry and interoperability initiatives.

PaperID: 8529, https://arxiv.org/pdf/2509.18713.pdf

Authors: Yizhe Huang, Yang Liu, Ruiyu Zhao, Xiaolong Zhong, Xingming Yue, Ling Jiang

Title: MemOrb: A Plug-and-Play Verbal-Reinforcement Memory Layer for E-Commerce Customer Service

Abstract:
Large Language Model‑based agents(LLM‑based agents) are increasingly deployed in customer service, yet they often forget across sessions, repeat errors, and lack mechanisms for continual self‑improvement. This makes them unreliable in dynamic settings where stability and consistency are critical. To better evaluate these properties, we emphasize two indicators: task success rate as a measure of overall effectiveness, and consistency metrics such as Pass^k to capture reliability across multiple trials. To address the limitations of existing approaches, we propose MemOrb, a lightweight and plug‑and‑play verbal reinforcement memory layer that distills multi‑turn interactions into compact strategy reflections. These reflections are stored in a shared memory bank and retrieved to guide decision‑making, without requiring any fine‑tuning. Experiments show that MemOrb significantly improves both success rate and stability, achieving up to a 63 percentage‑point gain in multi‑turn success rate and delivering more consistent performance across repeated trials. Our results demonstrate that structured reflection is a powerful mechanism for enhancing long‑term reliability of frozen LLM agents in customer service scenarios.

PaperID: 8530, https://arxiv.org/pdf/2509.18661.pdf

Authors: Yixin Liu, Yonghui Wu, Denghui Zhang, Lichao Sun

Title: Agentic AutoSurvey: Let LLMs Survey LLMs

Abstract:
The exponential growth of scientific literature poses unprecedented challenges for researchers attempting to synthesize knowledge across rapidly evolving fields. We present Agentic AutoSurvey, a multi‑agent framework for automated survey generation that addresses fundamental limitations in existing approaches. Our system employs four specialized agents (Paper Search Specialist, Topic Mining \& Clustering, Academic Survey Writer, and Quality Evaluator) working in concert to generate comprehensive literature surveys with superior synthesis quality. Through experiments on six representative LLM research topics from COLM 2024 categories, we demonstrate that our multi‑agent approach achieves significant improvements over existing baselines, scoring 8.18/10 compared to AutoSurvey's 4.77/10. The multi‑agent architecture processes 75‑‑443 papers per topic (847 total across six topics) while targeting high citation coverage (often \geq80% on 75‑‑100‑paper sets; lower on very large sets such as RLHF) through specialized agent orchestration. Our 12‑dimension evaluation captures organization, synthesis integration, and critical analysis beyond basic metrics. These findings demonstrate that multi‑agent architectures represent a meaningful advancement for automated literature survey generation in rapidly evolving scientific domains.

PaperID: 8531, https://arxiv.org/pdf/2509.18597.pdf

Authors: Yuan Meng, Zhenguo Sun, Max Fest, Xukun Li, Zhenshan Bing, Alois Knoll

Title: Growing with Your Embodied Agent: A Human-in-the-Loop Lifelong Code Generation Framework for Long-Horizon Manipulation Skills

Abstract:
Large language models (LLMs)‑based code generation for robotic manipulation has recently shown promise by directly translating human instructions into executable code, but existing methods remain noisy, constrained by fixed primitives and limited context windows, and struggle with long‑horizon tasks. While closed‑loop feedback has been explored, corrected knowledge is often stored in improper formats, restricting generalization and causing catastrophic forgetting, which highlights the need for learning reusable skills. Moreover, approaches that rely solely on LLM guidance frequently fail in extremely long‑horizon scenarios due to LLMs' limited reasoning capability in the robotic domain, where such issues are often straightforward for humans to identify. To address these challenges, we propose a human‑in‑the‑loop framework that encodes corrections into reusable skills, supported by external memory and Retrieval‑Augmented Generation with a hint mechanism for dynamic reuse. Experiments on Ravens, Franka Kitchen, and MetaWorld, as well as real‑world settings, show that our framework achieves a 0.93 success rate (up to 27% higher than baselines) and a 42% efficiency improvement in correction rounds. It can robustly solve extremely long‑horizon tasks such as "build a house", which requires planning over 20 primitives.

PaperID: 8532, https://arxiv.org/pdf/2509.18576.pdf

Authors: Zeyi Kang, Liang He, Yanxin Zhang, Zuheng Ming, Kaixing Zhao

Title: LCMF: Lightweight Cross-Modality Mambaformer for Embodied Robotics VQA

Abstract:
Multimodal semantic learning plays a critical role in embodied intelligence, especially when robots perceive their surroundings, understand human instructions, and make intelligent decisions. However, the field faces technical challenges such as effective fusion of heterogeneous data and computational efficiency in resource‑constrained environments. To address these challenges, this study proposes the lightweight LCMF cascaded attention framework, introducing a multi‑level cross‑modal parameter sharing mechanism into the Mamba module. By integrating the advantages of Cross‑Attention and Selective parameter‑sharing State Space Models (SSMs), the framework achieves efficient fusion of heterogeneous modalities and semantic complementary alignment. Experimental results show that LCMF surpasses existing multimodal baselines with an accuracy of 74.29% in VQA tasks and achieves competitive mid‑tier performance within the distribution cluster of Large Language Model Agents (LLM Agents) in EQA video tasks. Its lightweight design achieves a 4.35‑fold reduction in FLOPs relative to the average of comparable baselines while using only 166.51M parameters (image‑text) and 219M parameters (video‑text), providing an efficient solution for Human‑Robot Interaction (HRI) applications in resource‑constrained scenarios with strong multimodal decision generalization capabilities.

PaperID: 8533, https://arxiv.org/pdf/2509.18557.pdf

Authors: Tom Pawelek, Raj Patel, Charlotte Crowell, Noorbakhsh Amiri, Sudip Mittal, Shahram Rahimi, Andy Perkins

Title: LLMZ+: Contextual Prompt Whitelist Principles for Agentic LLMs

Abstract:
Compared to traditional models, agentic AI represents a highly valuable target for potential attackers as they possess privileged access to data sources and API tools, which are traditionally not incorporated into classical agents. Unlike a typical software application residing in a Demilitarized Zone (DMZ), agentic LLMs consciously rely on nondeterministic behavior of the AI (only defining a final goal, leaving the path selection to LLM). This characteristic introduces substantial security risk to both operational security and information security. Most common existing defense mechanism rely on detection of malicious intent and preventing it from reaching the LLM agent, thus protecting against jailbreak attacks such as prompt injection. In this paper, we present an alternative approach, LLMZ+, which moves beyond traditional detection‑based approaches by implementing prompt whitelisting. Through this method, only contextually appropriate and safe messages are permitted to interact with the agentic LLM. By leveraging the specificity of context, LLMZ+ guarantees that all exchanges between external users and the LLM conform to predefined use cases and operational boundaries. Our approach streamlines the security framework, enhances its long‑term resilience, and reduces the resources required for sustaining LLM information security. Our empirical evaluation demonstrates that LLMZ+ provides strong resilience against the most common jailbreak prompts. At the same time, legitimate business communications are not disrupted, and authorized traffic flows seamlessly between users and the agentic LLM. We measure the effectiveness of approach using false positive and false negative rates, both of which can be reduced to 0 in our experimental setting.

PaperID: 8534, https://arxiv.org/pdf/2509.18551.pdf

Authors: Chenlan Wang, Jimin Han, Diana Jue-Rajasingh

Title: Group Formation through Game Theory and Agent-Based Modeling: Spatial Cohesion, Heterogeneity, and Resource Pooling

Abstract:
This paper develops a game‑theoretic model and an agent‑based model to study group formation driven by resource pooling, spatial cohesion, and heterogeneity. We focus on cross‑sector partnerships (CSPs) involving public, private, and nonprofit organizations, each contributing distinct resources. Group formation occurs as agents strategically optimize their choices in response to others within a competitive setting. We prove the existence of stable group equilibria and simulate formation dynamics under varying spatial and resource conditions. The results show that limited individual resources lead to groups that form mainly among nearby actors, while abundant resources allow groups to move across larger distances. Increased resource heterogeneity and spatial proximity promote the formation of larger and more diverse groups. These findings reveal key trade‑offs shaping group size and composition, guiding strategies for effective cross‑sector collaborations and multi‑agent systems.

PaperID: 8535, https://arxiv.org/pdf/2509.18545.pdf

Authors: Ioannis Panitsas, Tolga O. Atalay, Dragoslav Stojadinovic, Angelos Stavrou, Leandros Tassiulas

Title: SlicePilot: Demystifying Network Slice Placement in Heterogeneous Cloud Infrastructures

Abstract:
Cellular networks are comprised of software‑based entities, with main functions encapsulated as Virtual Network Functions (VNFs) deployed on Commercial‑off‑the‑Shelf (COTS) hardware. As a key enabler of 5G, network slicing offers logically isolated Quality of Service (QoS) for diverse use cases. With the transition to cloud‑native infrastructures, optimizing network slice placement across multi‑cloud environments remains challenging due to heterogeneous resource capabilities and varying slice‑specific demands. This paper presents SlicePilot, a modular framework that enables autonomous and near‑optimal VNF placement using a disaggregated Multi‑Agent Reinforcement Learning (MARL) approach. SlicePilot collects real‑world traffic profiles to estimate resource needs for each slice type. These estimates guide a MARL‑based scheduler that minimizes deployment costs while satisfying QoS constraints. We evaluate SlicePilot on a multi‑cloud testbed and demonstrate a 19x speed‑up over combinatorial optimization methods, while keeping deployment costs within 7.8% of the optimal. Although SlicePilot results in 2.42x more QoS violations under high‑load conditions, this trade‑off is offset by faster decision‑making and reduced computational overhead. Overall, SlicePilot delivers a scalable, cost‑efficient solution for network slice placement, making it suitable for real‑time deployments where responsiveness and efficiency are critical.

PaperID: 8536, https://arxiv.org/pdf/2509.18526.pdf

Authors: Han Zeng, Haibo Wang, Luhao Fan, Bingcheng Zhu, Xiaohu You, Zaichen Zhang

Title: AI Agent Access (A\^3) Network: An Embodied, Communication-Aware Multi-Agent Framework for 6G Coverage

Abstract:
The vision of 6G communication demands autonomous and resilient networking in environments without fixed infrastructure. Yet most multi‑agent reinforcement learning (MARL) approaches focus on isolated stages ‑ exploration, relay formation, or access ‑ under static deployments and centralized control, limiting adaptability. We propose the AI Agent Access (A\^3) Network, a unified, embodied intelligence‑driven framework that transforms multi‑agent networking into a dynamic, decentralized, and end‑to‑end system. Unlike prior schemes, the A\^3 Network integrates exploration, target user access, and backhaul maintenance within a single learning process, while supporting on‑demand agent addition during runtime. Its decentralized policies ensure that even a single agent can operate independently with limited observations, while coordinated agents achieve scalable, communication‑optimized coverage. By embedding link‑level communication metrics into actor‑critic learning, the A\^3 Network couples topology formation with robust decision‑making. Numerical simulations demonstrate that the A\^3 Network not only balances exploration and communication efficiency but also delivers system‑level adaptability absent in existing MARL frameworks, offering a new paradigm for 6G multi‑agent networks.

PaperID: 8537, https://arxiv.org/pdf/2509.18415.pdf

Authors: Sumana Malkapuram, Sameera Gangavarapu, Kailashnath Reddy Kavalakuntla, Ananya Gangavarapu

Title: Context Lineage Assurance for Non-Human Identities in Critical Multi-Agent Systems

Abstract:
The proliferation of autonomous software agents necessitates rigorous frameworks for establishing secure and verifiable agent‑to‑agent (A2A) interactions, particularly when such agents are instantiated as non‑human identities(NHIs). We extend the A2A paradigm [1 , 2] by introducing a cryptographically grounded mechanism for lineage verification, wherein the provenance and evolution of NHIs are anchored in append‑only Merkle tree structures modeled after Certificate Transparency (CT) logs. Unlike traditional A2A models that primarily secure point‑to‑point interactions, our approach enables both agents and external verifiers to cryptographically validate multi‑hop provenance, thereby ensuring the integrity of the entire call chain. A federated proof server acts as an auditor across one or more Merkle logs, aggregating inclusion proofs and consistency checks into compact, signed attestations that external parties can verify without access to the full execution trace. In parallel, we augment the A2A agent card to incorporate explicit identity verification primitives, enabling both peer agents and human approvers to authenticate the legitimacy of NHI representations in a standardized manner. Together, these contributions establish a cohesive model that integrates identity attestation, lineage verification, and independent proof auditing, thereby advancing the security posture of inter‑agent ecosystems and providing a foundation for robust governance of NHIs in regulated environments such as FedRAMP.

PaperID: 8538, https://arxiv.org/pdf/2509.18371.pdf

Authors: Eduardo Sebastián, Maitrayee Keskar, Eeman Iqbal, Eduardo Montijano, Carlos Sagüés, Nikolay Atanasov

Title: Policy Gradient with Self-Attention for Model-Free Distributed Nonlinear Multi-Agent Games

Abstract:
Multi‑agent games in dynamic nonlinear settings are challenging due to the time‑varying interactions among the agents and the non‑stationarity of the (potential) Nash equilibria. In this paper we consider model‑free games, where agent transitions and costs are observed without knowledge of the transition and cost functions that generate them. We propose a policy gradient approach to learn distributed policies that follow the communication structure in multi‑team games, with multiple agents per team. Our formulation is inspired by the structure of distributed policies in linear quadratic games, which take the form of time‑varying linear feedback gains. In the nonlinear case, we model the policies as nonlinear feedback gains, parameterized by self‑attention layers to account for the time‑varying multi‑agent communication topology. We demonstrate that our distributed policy gradient approach achieves strong performance in several settings, including distributed linear and nonlinear regulation, and simulated and real multi‑robot pursuit‑and‑evasion games.

PaperID: 8539, https://arxiv.org/pdf/2509.18310.pdf

Authors: Bahar Kor, Bipin Gaikwad, Abani Patra, Eric L. Miller

Title: On Multi-entity, Multivariate Quickest Change Point Detection

Abstract:
We propose a framework for online Change Point Detection (CPD) from multi‑entity, multivariate time series data, motivated by applications in crowd monitoring where traditional sensing methods (e.g., video surveillance) may be infeasible. Our approach addresses the challenge of detecting system‑wide behavioral shifts in complex, dynamic environments where the number and behavior of individual entities may be uncertain or evolve. We introduce the concept of Individual Deviation from Normality (IDfN), computed via a reconstruction‑error‑based autoencoder trained on normal behavior. We aggregate these individual deviations using mean, variance, and Kernel Density Estimates (KDE) to yield a System‑Wide Anomaly Score (SWAS). To detect persistent or abrupt changes, we apply statistical deviation metrics and the Cumulative Sum (CUSUM) technique to these scores. Our unsupervised approach eliminates the need for labeled data or feature extraction, enabling real‑time operation on streaming input. Evaluations on both synthetic datasets and crowd simulations, explicitly designed for anomaly detection in group behaviors, demonstrate that our method accurately detects significant system‑level changes, offering a scalable and privacy‑preserving solution for monitoring complex multi‑agent systems. In addition to this methodological contribution, we introduce new, challenging multi‑entity multivariate time series datasets generated from crowd simulations in Unity and coupled nonlinear oscillators. To the best of our knowledge, there is currently no publicly available dataset of this type designed explicitly to evaluate CPD in complex collective and interactive systems, highlighting an essential gap that our work addresses.

PaperID: 8540, https://arxiv.org/pdf/2509.18292.pdf

Authors: Shuaiting Huang, Haodong Jiang, Chengcheng Zhao, Peng Cheng, Junfeng Wu

Title: Fully Distributed State Estimation for Multi-agent Systems and its Application in Cooperative Localization

Abstract:
In this paper, we investigate the distributed state estimation problem for a continuous‑time linear multi‑agent system (MAS) composed of \mathitm agents and monitored by the agents themselves. To address this problem, we propose a distributed observer that enables each agent to reconstruct the state of the MAS. The main idea is to let each agent \mathiti recover the state of agent \mathitj by using leader‑follower consensus rules to track agent \mathitj's state estimate, which is generated by agent \mathitj itself using a Luenberger‑like estimation rule. Under the assumptions of node‑level observability and topological ordering consistency, we show that the estimation error dynamics are stabilizable if and only if the communication graph is strongly connected. Moreover, we discuss the fully distributed design of the proposed observer, assuming that the agents only know basic MAS configuration information, such as the homogeneity and the maximum number of allowable agents. This design ensures that the proposed observer functions correctly when agents are added or removed. Building on this, we consider cooperative localization as a distributed estimation problem and develop two fully distributed localization algorithms that allow agents to track their own and other agents' positions (and velocities) within the MAS. Finally, we conduct simulations to demonstrate the effectiveness of our proposed theoretical results.

PaperID: 8541, https://arxiv.org/pdf/2509.18229.pdf

Authors: Anthony Patera, Rohan Abeyaratne

Title: An N-Plus-1 GPT Agency for Critical Solution of Mechanical Engineering Analysis Problems

Abstract:
Generative AI, and specifically GPT, can produce a remarkable solution to a mechanical engineering analysis problem ‑ but also, on occasion, a flawed solution. For example, an elementary mechanics problem is solved flawlessly in one GPT instance and incorrectly in a subsequent GPT instance, with a success probability of only 85%. This unreliability renders "out‑of‑the‑box" GPT unsuitable for deployment in education or engineering practice. We introduce an "N‑Plus‑1" GPT Agency for Initial (Low‑Cost) Analysis of mechanical engineering Problem Statements. Agency first launches N instantiations of Agent Solve to yield N independent Proposed Problem Solution Realizations; Agency then invokes Agent Compare to summarize and compare the N Proposed Problem Solution Realizations and to provide a Recommended Problem Solution. We argue from Condorcet's Jury Theorem that, for a Problem Statement characterized by per‑Solve success probability greater than 1/2 (and N sufficiently large), the Predominant (Agent Compare) Proposed Problem Solution will, with high probability, correspond to a Correct Proposed Problem Solution. Furthermore, Agent Compare can also incorporate aspects of Secondary (Agent Compare) Proposed Problem Solutions, in particular when the latter represent alternative Problem Statement interpretations ‑ different Mathematical Models ‑ or alternative Mathematical Solution Procedures. Comparisons to Grok Heavy, a commercial multi‑agent model, show similarities in design and performance, but also important differences in emphasis: our Agency focuses on transparency and pedagogical value.

PaperID: 8542, https://arxiv.org/pdf/2509.18169.pdf

Authors: Hengbo Xiao, Jingyuan Fan, Xin Tong, Jingzhao Zhang, Chao Lu, Guannan He

Title: PiERN: Token-Level Routing for Integrating High-Precision Computation and Reasoning

Abstract:
Tasks on complex systems require high‑precision numerical computation to support decisions, but current large language models (LLMs) cannot integrate such computations as an intrinsic and interpretable capability with existing architectures. Multi‑agent approaches can leverage external experts, but inevitably introduce communication overhead and suffer from inefficiency caused by limited scalability. To this end, we propose Physically‑isolated Experts Routing Network (PiERN), an architecture for integrating computation and reasoning. Instead of the tool‑use workflows or function‑calling, PiERN endogenously integrates computational capabilities into neural networks after separately training experts, a text‑to‑computation module, and a router. At inference, the router directs computation and reasoning at the token level, thereby enabling iterative alternation within a single chain of thought. We evaluate PiERN on representative linear and nonlinear computation‑reasoning tasks against LLM finetuning and the multi‑agent system approaches. Results show that the PiERN architecture achieves not only higher accuracy than directly finetuning LLMs but also significant improvements in response latency, token usage, and GPU energy consumption compared with mainstream multi‑agent approaches. PiERN offers an efficient, interpretable, and scalable paradigm for interfacing language models with scientific systems.

PaperID: 8543, https://arxiv.org/pdf/2509.18167.pdf

Authors: Junlin Wang, Zehao Wu, Shaowei Lu, Yanlan Li, Xinghao Huang

Title: SIRAG: Towards Stable and Interpretable RAG with A Process-Supervised Multi-Agent Framework

Abstract:
Retrieval‑Augmented Generation (RAG) enables large language models (LLMs) to access external knowledge sources, but the effectiveness of RAG relies on the coordination between the retriever and the generator. Since these components are developed independently, their interaction is often suboptimal: the retriever may return irrelevant or redundant documents, while the generator may fail to fully leverage retrieved evidence. In this work, we propose a process‑supervised multi‑agent framework to bridge the gap between retriever and generator. The framework introduces two lightweight agents: a Decision Maker, which determines when to continue retrieval or stop for answer generation, and a Knowledge Selector, which filters retrieved documents to retain only the most useful evidence. To provide fine‑grained supervision, we employ an LLM‑as‑a‑Judge that evaluates each intermediate action with process‑level rewards, ensuring more accurate credit assignment than relying solely on final answer correctness. We further adopt a tree‑structured rollout strategy to explore diverse reasoning paths, and train both agents with Proximal Policy Optimization (PPO) in an end‑to‑end manner. Experiments on single‑hop and multi‑hop question answering benchmarks show that our approach achieves higher accuracy, more stable convergence, and produces more interpretable reasoning trajectories compared with standard RAG baselines. Importantly, the proposed framework is modular and plug‑and‑play, requiring no modification to the retriever or generator, making it practical for real‑world RAG applications.

PaperID: 8544, https://arxiv.org/pdf/2509.18146.pdf

Authors: Madi Yerlanov, Qi Wang, Nancy Rodriguez

Title: Global Bifurcations and Pattern Formation in Target-Offender-Guardian Crime Models

Abstract:
We study a reaction‑advection‑diffusion model of a target‑offender‑guardian system designed to capture interactions between urban crime and policing. Using Crandall‑Rabinowitz bifurcation theory and spectral analysis, we establish rigorous conditions for both steady‑state and Hopf bifurcations. These results identify critical thresholds of policing intensity at which spatially uniform equilibria lose stability, leading either to persistent heterogeneous hotspots or oscillatory crime‑policing cycles. From a criminological perspective, such thresholds represent tipping points in guardian mobility: once crossed, they can lock neighborhoods into stable clusters of criminal activity or trigger recurrent waves of hotspot formation. Numerical simulations complement the theory, exhibiting stationary patterns, periodic oscillations, and chaotic dynamics. By explicitly incorporating law enforcement as a third interacting component, our framework extends classical two‑equation models. It offers new tools for analyzing nonlinear interactions, bifurcations, and pattern formation in multi‑agent social systems.

PaperID: 8545, https://arxiv.org/pdf/2509.18091.pdf

Authors: Sunhao Dai, Jiakai Tang, Jiahua Wu, Kun Wang, Yuxuan Zhu, Bingjun Chen, Bangyang Hong, Yu Zhao, Cong Fu, Kangle Wu, Yabo Ni, Anxiang Zeng, Wenjie Wang, Xu Chen, Jun Xu, See-Kiong Ng

Title: OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System

Abstract:
Despite the growing interest in replicating the scaled success of large language models (LLMs) in industrial search and recommender systems, most existing industrial efforts remain limited to transplanting Transformer architectures, which bring only incremental improvements over strong Deep Learning Recommendation Models (DLRMs). From a first principle perspective, the breakthroughs of LLMs stem not only from their architectures but also from two complementary mechanisms: context engineering, which enriches raw input queries with contextual cues to better elicit model capabilities, and multi‑step reasoning, which iteratively refines model outputs through intermediate reasoning paths. However, these two mechanisms and their potential to unlock substantial improvements remain largely underexplored in industrial ranking systems. In this paper, we propose OnePiece, a unified framework that seamlessly integrates LLM‑style context engineering and reasoning into both retrieval and ranking models of industrial cascaded pipelines. OnePiece is built on a pure Transformer backbone and further introduces three key innovations: (1) structured context engineering, which augments interaction history with preference and scenario signals and unifies them into a structured tokenized input sequence for both retrieval and ranking; (2) block‑wise latent reasoning, which equips the model with multi‑step refinement of representations and scales reasoning bandwidth via block size; (3) progressive multi‑task training, which leverages user feedback chains to effectively supervise reasoning steps during training. OnePiece has been deployed in the main personalized search scenario of Shopee and achieves consistent online gains across different key business metrics, including over +2% GMV/UU and a +2.90% increase in advertising revenue.

PaperID: 8546, https://arxiv.org/pdf/2509.18088.pdf

Authors: Chuhao Qin, Evangelos Pournaras

Title: Strategic Coordination for Evolving Multi-agent Systems: A Hierarchical Reinforcement and Collective Learning Approach

Abstract:
Decentralized combinatorial optimization in evolving multi‑agent systems poses significant challenges, requiring agents to balance long‑term decision‑making, short‑term optimized collective outcomes, while preserving autonomy of interactive agents under unanticipated changes. Reinforcement learning offers a way to model sequential decision‑making through dynamic programming to anticipate future environmental changes. However, applying multi‑agent reinforcement learning (MARL) to decentralized combinatorial optimization problems remains an open challenge due to the exponential growth of the joint state‑action space, high communication overhead, and privacy concerns in centralized training. To address these limitations, this paper proposes Hierarchical Reinforcement and Collective Learning (HRCL), a novel approach that leverages both MARL and decentralized collective learning based on a hierarchical framework. Agents take high‑level strategies using MARL to group possible plans for action space reduction and constrain the agent behavior for Pareto optimality. Meanwhile, the low‑level collective learning layer ensures efficient and decentralized coordinated decisions among agents with minimal communication. Extensive experiments in a synthetic scenario and real‑world smart city application models, including energy self‑management and drone swarm sensing, demonstrate that HRCL significantly improves performance, scalability, and adaptability compared to the standalone MARL and collective learning approaches, achieving a win‑win synthesis solution.

PaperID: 8547, https://arxiv.org/pdf/2509.18063.pdf

Authors: Jan-Felix Klein, Lars Ohnemus

Title: ARK-V1: An LLM-Agent for Knowledge Graph Question Answering Requiring Commonsense Reasoning

Abstract:
Large Language Models (LLMs) show strong reasoning abilities but rely on internalized knowledge that is often insufficient, outdated, or incorrect when trying to answer a question that requires specific domain knowledge. Knowledge Graphs (KGs) provide structured external knowledge, yet their complexity and multi‑hop reasoning requirements make integration challenging. We present ARK‑V1, a simple KG‑agent that iteratively explores graphs to answer natural language queries. We evaluate several not fine‑tuned state‑of‑the art LLMs as backbones for ARK‑V1 on the CoLoTa dataset, which requires both KG‑based and commonsense reasoning over long‑tail entities. ARK‑V1 achieves substantially higher conditional accuracies than Chain‑of‑Thought baselines, and larger backbone models show a clear trend toward better coverage, correctness, and stability.

PaperID: 8548, https://arxiv.org/pdf/2509.18043.pdf

Authors: Yinlong Dai, Andre Keyser, Dylan P. Losey

Title: Prepare Before You Act: Learning From Humans to Rearrange Initial States

Abstract:
Imitation learning (IL) has proven effective across a wide range of manipulation tasks. However, IL policies often struggle when faced with out‑of‑distribution observations; for instance, when the target object is in a previously unseen position or occluded by other objects. In these cases, extensive demonstrations are needed for current IL methods to reach robust and generalizable behaviors. But when humans are faced with these sorts of atypical initial states, we often rearrange the environment for more favorable task execution. For example, a person might rotate a coffee cup so that it is easier to grasp the handle, or push a box out of the way so they can directly grasp their target object. In this work we seek to equip robot learners with the same capability: enabling robots to prepare the environment before executing their given policy. We propose ReSET, an algorithm that takes initial states ‑‑ which are outside the policy's distribution ‑‑ and autonomously modifies object poses so that the restructured scene is similar to training data. Theoretically, we show that this two step process (rearranging the environment before rolling out the given policy) reduces the generalization gap. Practically, our ReSET algorithm combines action‑agnostic human videos with task‑agnostic teleoperation data to i) decide when to modify the scene, ii) predict what simplifying actions a human would take, and iii) map those predictions into robot action primitives. Comparisons with diffusion policies, VLAs, and other baselines show that using ReSET to prepare the environment enables more robust task execution with equal amounts of total training data. See videos at our project website: https://reset2025paper.github.io/

PaperID: 8549, https://arxiv.org/pdf/2509.18008.pdf

Authors: Bingsheng Yao, Jiaju Chen, Chaoran Chen, April Wang, Toby Jia-jun Li, Dakuo Wang

Title: Through the Lens of Human-Human Collaboration: A Configurable Research Platform for Exploring Human-Agent Collaboration

Abstract:
Intelligent systems have traditionally been designed as tools rather than collaborators, often lacking critical characteristics that collaboration partnerships require. Recent advances in large language model (LLM) agents open new opportunities for human‑LLM‑agent collaboration by enabling natural communication and various social and cognitive behaviors. Yet it remains unclear whether principles of computer‑mediated collaboration established in HCI and CSCW persist, change, or fail when humans collaborate with LLM agents. To support systematic investigations of these questions, we introduce an open and configurable research platform for HCI researchers. The platform's modular design allows seamless adaptation of classic CSCW experiments and manipulation of theory‑grounded interaction controls. We demonstrate the platform's research efficacy and usability through three case studies: (1) two Shape Factory experiments for resource negotiation with 16 participants, (2) one Hidden Profile experiment for information pooling with 16 participants, and (3) a participatory cognitive walkthrough with five HCI researchers to refine workflows of researcher interface for experiment setup and analysis.

PaperID: 8550, https://arxiv.org/pdf/2509.17917.pdf

Authors: Junyu Lu, Songxin Zhang, Zejian Xie, Zhuoyang Song, Jiaxing Zhang

Title: Orcust: Stepwise-Feedback Reinforcement Learning for GUI Agent

Abstract:
Recent advances in GUI agents have achieved remarkable grounding and action‑prediction performance, yet existing models struggle with unreliable reward signals and limited online trajectory generation. In this paper, we introduce Orcust, a framework that integrates Principle‑Constrained Reward Modeling (PCRM) and Online VM‑Grounded Trajectory Construction (OVTC) to enhance reasoning reliability and data efficiency in interactive GUI tasks. We leverages environment‑verifiable and LLM‑derived principle to enforce interpretable reward signals that constrain long chain‑of‑thought reasoning and rule‑based feedback. OVTC spins up instrumented virtual machines to autonomously collect structured GUI interaction trajectories with explicit procedural and structural objectives, enabling the training of a stepwise reward model that robustly captures human preferences and adheres to task‑specific constraints. Extensive experiments on standard GUI benchmarks covering perceptual grounding, foundational operations, and end‑to‑end task execution reveal that Orcust achieves state‑of‑the‑art performance, improving by 22.2% on ScreenSpot and 23.9% on ScreenSpot‑Pro over the base model (i.e. Qwen2.5‑VL‑7B). The results demonstrate Orcust's effectiveness in enhancing the reasoning, adaptability and scalability of GUI agents across various environments and task complexities.

PaperID: 8551, https://arxiv.org/pdf/2509.17757.pdf

Authors: Hongxing Fan, Lipeng Wang, Haohua Chen, Zehuan Huang, Jiangtao Wu, Lu Sheng

Title: Multi-Agent Amodal Completion: Direct Synthesis with Fine-Grained Semantic Guidance

Abstract:
Amodal completion, generating invisible parts of occluded objects, is vital for applications like image editing and AR. Prior methods face challenges with data needs, generalization, or error accumulation in progressive pipelines. We propose a Collaborative Multi‑Agent Reasoning Framework based on upfront collaborative reasoning to overcome these issues. Our framework uses multiple agents to collaboratively analyze occlusion relationships and determine necessary boundary expansion, yielding a precise mask for inpainting. Concurrently, an agent generates fine‑grained textual descriptions, enabling Fine‑Grained Semantic Guidance. This ensures accurate object synthesis and prevents the regeneration of occluders or other unwanted elements, especially within large inpainting areas. Furthermore, our method directly produces layered RGBA outputs guided by visible masks and attention maps from a Diffusion Transformer, eliminating extra segmentation. Extensive evaluations demonstrate our framework achieves state‑of‑the‑art visual quality.

PaperID: 8552, https://arxiv.org/pdf/2509.17676.pdf

Authors: Abdullahi Isa Ahmed, Jamal Bentahar, El Mehdi Amhoud

Title: GLo-MAPPO: Multi-Agent Deep Reinforcement Learning for Energy-Efficient UAV-Assisted LoRa Networks

Abstract:
The rapid advancement of Low‑Power Wide Area Networks (LPWANs), particularly Long Range (LoRa) systems, has positioned them as a cornerstone for Next‑Generation Internet of Things (NG‑IoT) applications within 5G/6G ecosystems. Despite their long‑range and low‑power advantages, achieving high energy efficiency in LoRa networks remains a significant challenge in highly dynamic environments. Traditional terrestrial gateway deployments often suffer from coverage gaps and non‑line‑of‑sight propagation, while satellite‑based alternatives incur excessive energy consumption and prohibitive latency. To address these limitations, we propose a multi‑UAV architecture where unmanned aerial vehicles (UAVs) serve as mobile LoRa gateways to dynamically collect data from ground‑based end devices (EDs). We formulate a joint optimization problem to maximize the system's weighted energy efficiency by jointly optimizing spreading factors, transmission powers, UAV trajectories, and ED‑UAV associations. This problem is transformed into a partially observable stochastic game (POSG), which we solve using our proposed Green LoRa Multi‑Agent Proximal Policy Optimization (GLo‑MAPPO). Our framework leverages centralized training with decentralized execution (CTDE) and is enhanced by a gain‑based ED‑UAV association scheme. Simulation results show that GLo‑MAPPO significantly outperforms state‑of‑the‑art multi‑agent reinforcement learning (MARL) benchmarks in energy efficiency and power consumption across varying network densities. Furthermore, ablation studies validate the necessity of each optimization component and the effectiveness of the proposed association scheme.

PaperID: 8553, https://arxiv.org/pdf/2509.17619.pdf

Authors: Zhefan Wang, Ning Geng, Zhiqiang Guo, Weizhi Ma, Min Zhang

Title: Human vs. Agent in Task-Oriented Conversations

Abstract:
Task‑oriented conversational systems are essential for efficiently addressing diverse user needs, yet their development requires substantial amounts of high‑quality conversational data that is challenging and costly to obtain. While large language models (LLMs) have demonstrated potential in generating synthetic conversations, the extent to which these agent‑generated interactions can effectively substitute real human conversations remains unclear. This work presents the first systematic comparison between LLM‑simulated users and human users in personalized task‑oriented conversations. We propose a comprehensive analytical framework encompassing three key aspects (conversation strategy, interaction style, and conversation evaluation) and ten distinct dimensions for evaluating user behaviors, and collect parallel conversational datasets from both human users and LLM agent users across four representative scenarios under identical conditions. Our analysis reveals significant behavioral differences between the two user types in problem‑solving approaches, question broadness, user engagement, context dependency, feedback polarity and promise, language style, and hallucination awareness. We found consistency in the agent users and human users across the depth‑first or breadth‑first dimensions, as well as the usefulness dimensions. These findings provide critical insights for advancing LLM‑based user simulation. Our multi‑dimensional taxonomy constructed a generalizable framework for analyzing user behavior patterns, offering insights from LLM agent users and human users. By this work, we provide perspectives on rethinking how to use user simulation in conversational systems in the future.

PaperID: 8554, https://arxiv.org/pdf/2509.17488.pdf

Authors: Shouju Wang, Fenglin Yu, Xirui Liu, Xiaoting Qin, Jue Zhang, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan

Title: Privacy in Action: Towards Realistic Privacy Mitigation and Evaluation for LLM-Powered Agents

Abstract:
The increasing autonomy of LLM agents in handling sensitive communications, accelerated by Model Context Protocol (MCP) and Agent‑to‑Agent (A2A) frameworks, creates urgent privacy challenges. While recent work reveals significant gaps between LLMs' privacy Q&A performance and their agent behavior, existing benchmarks remain limited to static, simplified scenarios. We present PrivacyChecker, a model‑agnostic, contextual integrity based mitigation approach that effectively reduces privacy leakage from 36.08% to 7.30% on DeepSeek‑R1 and from 33.06% to 8.32% on GPT‑4o, all while preserving task helpfulness. We also introduce PrivacyLens‑Live, transforming static benchmarks into dynamic MCP and A2A environments that reveal substantially higher privacy risks in practical. Our modular mitigation approach integrates seamlessly into agent protocols through three deployment strategies, providing practical privacy protection for the emerging agentic ecosystem. Our data and code will be made available at https://aka.ms/privacy_in_action.

PaperID: 8555, https://arxiv.org/pdf/2509.17429.pdf

Authors: Zhitao Zeng, Guojian Yuan, Junyuan Mao, Yuxuan Wang, Xiaoshuang Jia, Yueming Jin

Title: Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration

Abstract:
Accurate temporal prediction is the bridge between comprehensive scene understanding and embodied artificial intelligence. However, predicting multiple fine‑grained states of a scene at multiple temporal scales is difficult for vision‑language models. We formalize the Multi‑Scale Temporal Prediction (MSTP) task in general and surgical scenes by decomposing multi‑scale into two orthogonal dimensions: the temporal scale, forecasting states of humans and surgery at varying look‑ahead intervals, and the state scale, modeling a hierarchy of states in general and surgical scenes. For example, in general scenes, states of contact relationships are finer‑grained than states of spatial relationships. In surgical scenes, medium‑level steps are finer‑grained than high‑level phases yet remain constrained by their encompassing phase. To support this unified task, we introduce the first MSTP Benchmark, featuring synchronized annotations across multiple state scales and temporal scales. We further propose a method, Incremental Generation and Multi‑agent Collaboration (IG‑MC), which integrates two key innovations. First, we present a plug‑and‑play incremental generation module that continuously synthesizes up‑to‑date visual previews at expanding temporal scales to inform multiple decision‑making agents, keeping decisions and generated visuals synchronized and preventing performance degradation as look‑ahead intervals lengthen. Second, we present a decision‑driven multi‑agent collaboration framework for multi‑state prediction, comprising generation, initiation, and multi‑state assessment agents that dynamically trigger and evaluate prediction cycles to balance global coherence and local fidelity.

PaperID: 8556, https://arxiv.org/pdf/2509.17395.pdf

Authors: Tianshi Cai, Guanxu Li, Nijia Han, Ce Huang, Zimu Wang, Changyu Zeng, Yuqi Wang, Jingshi Zhou, Haiyang Zhang, Qi Chen, Yushan Pan, Shuihua Wang, Wei Wang

Title: FinDebate: Multi-Agent Collaborative Intelligence for Financial Analysis

Abstract:
We introduce FinDebate, a multi‑agent framework for financial analysis, integrating collaborative debate with domain‑specific Retrieval‑Augmented Generation (RAG). Five specialized agents, covering earnings, market, sentiment, valuation, and risk, run in parallel to synthesize evidence into multi‑dimensional insights. To mitigate overconfidence and improve reliability, we introduce a safe debate protocol that enables agents to challenge and refine initial conclusions while preserving coherent recommendations. Experimental results, based on both LLM‑based and human evaluations, demonstrate the framework's efficacy in producing high‑quality analysis with calibrated confidence levels and actionable investment strategies across multiple time horizons.

PaperID: 8557, https://arxiv.org/pdf/2509.17360.pdf

Authors: Chaoyi Ruan, Chao Bi, Kaiwen Zheng, Ziji Shi, Xinyi Wan, Jialin Li

Title: Cortex: Achieving Low-Latency, Cost-Efficient Remote Data Access For LLM via Semantic-Aware Knowledge Caching

Abstract:
Large Language Model (LLM) agents tackle data‑intensive tasks such as deep research and code generation. However, their effectiveness depends on frequent interactions with knowledge sources across remote clouds or regions. Such interactions can create non‑trivial latency and cost bottlenecks. Existing caching solutions focus on exact‑match queries, limiting their effectiveness for semantic knowledge reuse. To address this challenge, we introduce Cortex, a novel cross‑region knowledge caching architecture for LLM agents. At its core are two abstractions: Semantic Element (SE) and Semantic Retrieval Index (Seri). A semantic element captures the semantic embedding representation of an LLM query together with performance‑aware metadata such as latency, cost, and staticity. Seri then provides two‑stage retrieval: a vector similar index with semantic embedding for fast candidate selection and a lightweight LLM‑powered semantic judger for precise validation. Atop these primitives, Cortex builds a new cache interface that includes a new semantic‑aware cache hit definition, a cost‑efficient eviction policy, and proactive prefetching. To reduce overhead, Cortex co‑locates the small LLM judger with the main LLM using adaptive scheduling and resource sharing. Our evaluation demonstrates that Cortex delivers substantial performance improvements without compromising correctness. On representative search workloads, Cortex achieves up to a 3.6x increase in throughput by maintaining cache hit rates of over 85%, while preserving accuracy virtually identical to non‑cached baselines. Cortex also improves throughput for coding tasks by 20%, showcasing its versatility across diverse agentic workloads.

PaperID: 8558, https://arxiv.org/pdf/2509.17353.pdf

Authors: Ahmed T. Elboardy, Ghada Khoriba, Essam A. Rashed

Title: Medical AI Consensus: A Multi-Agent Framework for Radiology Report Generation and Evaluation

Abstract:
Automating radiology report generation poses a dual challenge: building clinically reliable systems and designing rigorous evaluation protocols. We introduce a multi‑agent reinforcement learning framework that serves as both a benchmark and evaluation environment for multimodal clinical reasoning in the radiology ecosystem. The proposed framework integrates large language models (LLMs) and large vision models (LVMs) within a modular architecture composed of ten specialized agents responsible for image analysis, feature extraction, report generation, review, and evaluation. This design enables fine‑grained assessment at both the agent level (e.g., detection and segmentation accuracy) and the consensus level (e.g., report quality and clinical relevance). We demonstrate an implementation using chatGPT‑4o on public radiology datasets, where LLMs act as evaluators alongside medical radiologist feedback. By aligning evaluation protocols with the LLM development lifecycle, including pretraining, finetuning, alignment, and deployment, the proposed benchmark establishes a path toward trustworthy deviance‑based radiology report generation.

PaperID: 8559, https://arxiv.org/pdf/2509.17350.pdf

Authors: Haoran Zhou, Yangwei You, Shuaijun Wang

Title: DyDexHandover: Human-like Bimanual Dynamic Dexterous Handover using RGB-only Perception

Abstract:
Dynamic in air handover is a fundamental challenge for dual‑arm robots, requiring accurate perception, precise coordination, and natural motion. Prior methods often rely on dynamics models, strong priors, or depth sensing, limiting generalization and naturalness. We present DyDexHandover, a novel framework that employs multi‑agent reinforcement learning to train an end to end RGB based policy for bimanual object throwing and catching. To achieve more human‑like behavior, the throwing policy is guided by a human policy regularization scheme, encouraging fluid and natural motion, and enhancing the generalization capability of the policy. A dual arm simulation environment was built in Isaac Sim for experimental evaluation. DyDexHandover achieves nearly 99 percent success on training objects and 75 percent on unseen objects, while generating human‑like throwing and catching behaviors. To our knowledge, it is the first method to realize dual‑arm in‑air handover using only raw RGB perception.

PaperID: 8560, https://arxiv.org/pdf/2509.17336.pdf

Authors: Tianyu Fu, Anyang Su, Chenxu Zhao, Hanning Wang, Minghui Wu, Zhe Yu, Fei Hu, Mingjia Shi, Wei Dong, Jiayao Wang, Yuyang Chen, Ruiyang Yu, Siran Peng, Menglin Li, Nan Huang, Haitian Wei, Jiawei Yu, Yi Xin, Xilin Zhao, Kai Gu, Ping Jiang, Sifan Zhou, Shuo Wang

Title: Mano Technical Report

Abstract:
Graphical user interfaces (GUIs) are the primary medium for human‑computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi‑step reasoning. Existing methods based on vision‑language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi‑modal foundation model pre‑trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high‑fidelity data generation, a three‑stage training pipeline (supervised fine‑tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state‑of‑the‑art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain‑specific data, iterative training, and holistic reward design.

PaperID: 8561, https://arxiv.org/pdf/2509.17244.pdf

Authors: Frederic Vatnsdal, Romina Garcia Camargo, Saurav Agarwal, Alejandro Ribeiro

Title: Scalable Multi Agent Diffusion Policies for Coverage Control

Abstract:
We propose MADP, a novel diffusion‑model‑based approach for collaboration in decentralized robot swarms. MADP leverages diffusion models to generate samples from complex and high‑dimensional action distributions that capture the interdependencies between agents' actions. Each robot conditions policy sampling on a fused representation of its own observations and perceptual embeddings received from peers. To evaluate this approach, we task a team of holonomic robots piloted by MADP to address coverage control‑a canonical multi agent navigation problem. The policy is trained via imitation learning from a clairvoyant expert on the coverage control problem, with the diffusion process parameterized by a spatial transformer architecture to enable decentralized inference. We evaluate the system under varying numbers, locations, and variances of importance density functions, capturing the robustness demands of real‑world coverage tasks. Experiments demonstrate that our model inherits valuable properties from diffusion models, generalizing across agent densities and environments, and consistently outperforming state‑of‑the‑art baselines.

PaperID: 8562, https://arxiv.org/pdf/2509.17240.pdf

Authors: Abdullah Mushtaq, Muhammad Rafay Naeem, Ibrahim Ghaznavi, Alaa Abd-alrazaq, Aliya Tabassum, Junaid Qadir

Title: Can Agents Judge Systematic Reviews Like Humans? Evaluating SLRs with LLM-based Multi-Agent System

Abstract:
Systematic Literature Reviews (SLRs) are foundational to evidence‑based research but remain labor‑intensive and prone to inconsistency across disciplines. We present an LLM‑based SLR evaluation copilot built on a Multi‑Agent System (MAS) architecture to assist researchers in assessing the overall quality of the systematic literature reviews. The system automates protocol validation, methodological assessment, and topic relevance checks using a scholarly database. Unlike conventional single‑agent methods, our design integrates a specialized agentic approach aligned with PRISMA guidelines to support more structured and interpretable evaluations. We conducted an initial study on five published SLRs from diverse domains, comparing system outputs to expert‑annotated PRISMA scores, and observed 84% agreement. While early results are promising, this work represents a first step toward scalable and accurate NLP‑driven systems for interdisciplinary workflows and reveals their capacity for rigorous, domain‑agnostic knowledge aggregation to streamline the review process.

PaperID: 8563, https://arxiv.org/pdf/2509.17197.pdf

Authors: Junlong Ke, Qiying Hu, Shenghai Yuan, Yuecong Xu, Jianfei Yang

Title: SignalLLM: A General-Purpose LLM Agent Framework for Automated Signal Processing

Abstract:
Modern signal processing (SP) pipelines, whether model‑based or data‑driven, often constrained by complex and fragmented workflow, rely heavily on expert knowledge and manual engineering, and struggle with adaptability and generalization under limited data. In contrast, Large Language Models (LLMs) offer strong reasoning capabilities, broad general‑purpose knowledge, in‑context learning, and cross‑modal transfer abilities, positioning them as powerful tools for automating and generalizing SP workflows. Motivated by these potentials, we introduce SignalLLM, the first general‑purpose LLM‑based agent framework for general SP tasks. Unlike prior LLM‑based SP approaches that are limited to narrow applications or tricky prompting, SignalLLM introduces a principled, modular architecture. It decomposes high‑level SP goals into structured subtasks via in‑context learning and domain‑specific retrieval, followed by hierarchical planning through adaptive retrieval‑augmented generation (RAG) and refinement; these subtasks are then executed through prompt‑based reasoning, cross‑modal reasoning, code synthesis, model invocation, or data‑driven LLM‑assisted modeling. Its generalizable design enables the flexible selection of problem solving strategies across different signal modalities, task types, and data conditions. We demonstrate the versatility and effectiveness of SignalLLM through five representative tasks in communication and sensing, such as radar target detection, human activity recognition, and text compression. Experimental results show superior performance over traditional and existing LLM‑based methods, particularly in few‑shot and zero‑shot settings.

PaperID: 8564, https://arxiv.org/pdf/2509.17195.pdf

Authors: Damian Owerko, Frederic Vatnsdal, Saurav Agarwal, Vijay Kumar, Alejandro Ribeiro

Title: MAST: Multi-Agent Spatial Transformer for Learning to Collaborate

Abstract:
This article presents a novel multi‑agent spatial transformer (MAST) for learning communication policies in large‑scale decentralized and collaborative multi‑robot systems (DC‑MRS). Challenges in collaboration in DC‑MRS arise from: (i) partial observable states as robots make only localized perception, (ii) limited communication range with no central server, and (iii) independent execution of actions. The robots need to optimize a common task‑specific objective, which, under the restricted setting, must be done using a communication policy that exhibits the desired collaborative behavior. The proposed MAST is a decentralized transformer architecture that learns communication policies to compute abstract information to be shared with other agents and processes the received information with the robot's own observations. The MAST extends the standard transformer with new positional encoding strategies and attention operations that employ windowing to limit the receptive field for MRS. These are designed for local computation, shift‑equivariance, and permutation equivariance, making it a promising approach for DC‑MRS. We demonstrate the efficacy of MAST on decentralized assignment and navigation (DAN) and decentralized coverage control. Efficiently trained using imitation learning in a centralized setting, the decentralized MAST policy is robust to communication delays, scales to large teams, and performs better than the baselines and other learning‑based approaches.

PaperID: 8565, https://arxiv.org/pdf/2509.17167.pdf

Authors: Seungjun Yi, Joakim Nguyen, Huimin Xu, Terence Lim, Joseph Skrovan, Mehak Beri, Hitakshi Modi, Andrew Well, Liu Leqi, Mia Markey, Ying Ding

Title: SFT-TA: Supervised Fine-Tuned Agents in Multi-Agent LLMs for Automated Inductive Thematic Analysis

Abstract:
Thematic Analysis (TA) is a widely used qualitative method that provides a structured yet flexible framework for identifying and reporting patterns in clinical interview transcripts. However, manual thematic analysis is time‑consuming and limits scalability. Recent advances in LLMs offer a pathway to automate thematic analysis, but alignment with human results remains limited. To address these limitations, we propose SFT‑TA, an automated thematic analysis framework that embeds supervised fine‑tuned (SFT) agents within a multi‑agent system. Our framework outperforms existing frameworks and the gpt‑4o baseline in alignment with human reference themes. We observed that SFT agents alone may underperform, but achieve better results than the baseline when embedded within a multi‑agent system. Our results highlight that embedding SFT agents in specific roles within a multi‑agent system is a promising pathway to improve alignment with desired outputs for thematic analysis.

PaperID: 8566, https://arxiv.org/pdf/2509.17080.pdf

Authors: Ruiguo Zhong, Ruoyu Yao, Pei Liu, Xiaolong Chen, Rui Yang, Jun Ma

Title: CoPlanner: An Interactive Motion Planner with Contingency-Aware Diffusion for Autonomous Driving

Abstract:
Accurate trajectory prediction and motion planning are crucial for autonomous driving systems to navigate safely in complex, interactive environments characterized by multimodal uncertainties. However, current generation‑then‑evaluation frameworks typically construct multiple plausible trajectory hypotheses but ultimately adopt a single most likely outcome, leading to overconfident decisions and a lack of fallback strategies that are vital for safety in rare but critical scenarios. Moreover, the usual decoupling of prediction and planning modules could result in socially inconsistent or unrealistic joint trajectories, especially in highly interactive traffic. To address these challenges, we propose a contingency‑aware diffusion planner (CoPlanner), a unified framework that jointly models multi‑agent interactive trajectory generation and contingency‑aware motion planning. Specifically, the pivot‑conditioned diffusion mechanism anchors trajectory sampling on a validated, shared short‑term segment to preserve temporal consistency, while stochastically generating diverse long‑horizon branches that capture multimodal motion evolutions. In parallel, we design a contingency‑aware multi‑scenario scoring strategy that evaluates candidate ego trajectories across multiple plausible long‑horizon evolution scenarios, balancing safety, progress, and comfort. This integrated design preserves feasible fallback options and enhances robustness under uncertainty, leading to more realistic interaction‑aware planning. Extensive closed‑loop experiments on the nuPlan benchmark demonstrate that CoPlanner consistently surpasses state‑of‑the‑art methods on both Val14 and Test14 datasets, achieving significant improvements in safety and comfort under both reactive and non‑reactive settings. Code and model will be made publicly available upon acceptance.

PaperID: 8567, https://arxiv.org/pdf/2509.17042.pdf

Authors: Zengqi Peng, Yusen Xie, Yubin Wang, Rui Yang, Qifeng Chen, Jun Ma

Title: Orchestrate, Generate, Reflect: A VLM-Based Multi-Agent Collaboration Framework for Automated Driving Policy Learning

Abstract:
The advancement of foundation models fosters new initiatives for policy learning in achieving safe and efficient autonomous driving. However, a critical bottleneck lies in the manual engineering of reward functions and training curricula for complex and dynamic driving tasks, which is a labor‑intensive and time‑consuming process. To address this problem, we propose OGR (Orchestrate, Generate, Reflect), a novel automated driving policy learning framework that leverages vision‑language model (VLM)‑based multi‑agent collaboration. Our framework capitalizes on advanced reasoning and multimodal understanding capabilities of VLMs to construct a hierarchical agent system. Specifically, a centralized orchestrator plans high‑level training objectives, while a generation module employs a two‑step analyze‑then‑generate process for efficient generation of reward‑curriculum pairs. A reflection module then facilitates iterative optimization based on the online evaluation. Furthermore, a dedicated memory module endows the VLM agents with the capabilities of long‑term memory. To enhance robustness and diversity of the generation process, we introduce a parallel generation scheme and a human‑in‑the‑loop technique for augmentation of the reward observation space. Through efficient multi‑agent cooperation and leveraging rich multimodal information, OGR enables the online evolution of reinforcement learning policies to acquire interaction‑aware driving skills. Extensive experiments in the CARLA simulator demonstrate the superior performance, robust generalizability across distinct urban scenarios, and strong compatibility with various RL algorithms. Further real‑world experiments highlight the practical viability and effectiveness of our framework. The source code will be available upon acceptance of the paper.

PaperID: 8568, https://arxiv.org/pdf/2509.16952.pdf

Authors: Tiancheng Huang, Ruisheng Cao, Yuxin Zhang, Zhangyi Kang, Zijian Wang, Chenrun Wang, Yijie Luo, Hang Zheng, Lirong Qian, Lu Chen, Kai Yu

Title: AirQA: A Comprehensive QA Dataset for AI Research with Instance-Level Evaluation

Abstract:
The growing volume of academic papers has made it increasingly difficult for researchers to efficiently extract key information. While large language models (LLMs) based agents are capable of automating question answering (QA) workflows for scientific papers, there still lacks a comprehensive and realistic benchmark to evaluate their capabilities. Moreover, training an interactive agent for this specific task is hindered by the shortage of high‑quality interaction trajectories. In this work, we propose AirQA, a human‑annotated comprehensive paper QA dataset in the field of artificial intelligence (AI), with 13,956 papers and 1,246 questions, that encompasses multi‑task, multi‑modal and instance‑level evaluation. Furthermore, we propose ExTrActor, an automated framework for instruction data synthesis. With three LLM‑based agents, ExTrActor can perform example generation and trajectory collection without human intervention. Evaluations of multiple open‑source and proprietary models show that most models underperform on AirQA, demonstrating the quality of our dataset. Extensive experiments confirm that ExTrActor consistently improves the multi‑turn tool‑use capability of small models, enabling them to achieve performance comparable to larger ones.

PaperID: 8569, https://arxiv.org/pdf/2509.16924.pdf

Authors: Jia Li, Yinfeng Yu, Liejun Wang, Fuchun Sun, Wendong Zheng

Title: Audio-Guided Dynamic Modality Fusion with Stereo-Aware Attention for Audio-Visual Navigation

Abstract:
In audio‑visual navigation (AVN) tasks, an embodied agent must autonomously localize a sound source in unknown and complex 3D environments based on audio‑visual signals. Existing methods often rely on static modality fusion strategies and neglect the spatial cues embedded in stereo audio, leading to performance degradation in cluttered or occluded scenes. To address these issues, we propose an end‑to‑end reinforcement learning‑based AVN framework with two key innovations: (1) a Stereo‑Aware Attention Module (SAM), which learns and exploits the spatial disparity between left and right audio channels to enhance directional sound perception; and (2) an Audio‑Guided Dynamic Fusion Module (AGDF), which dynamically adjusts the fusion ratio between visual and auditory features based on audio cues, thereby improving robustness to environmental changes. Extensive experiments are conducted on two realistic 3D scene datasets, Replica and Matterport3D, demonstrating that our method significantly outperforms existing approaches in terms of navigation success rate and path efficiency. Notably, our model achieves over 40% improvement under audio‑only conditions compared to the best‑performing baselines. These results highlight the importance of explicitly modeling spatial cues from stereo channels and performing deep multi‑modal fusion for robust and efficient audio‑visual navigation.

PaperID: 8570, https://arxiv.org/pdf/2509.16912.pdf

Authors: Shuto Endo, Takanobu Mizuta, Isao Yagi

Title: Analysis of the Impact of an Execution Algorithm with an Order Book Imbalance Strategy on a Financial Market Using an Agent-based Simulation

Abstract:
Order book imbalance (OBI) ‑ buy orders minus sell orders near the best quote ‑ measures supply‑demand imbalance that can move prices. OBI is positively correlated with returns, and some investors try to use it to improve performance. Large orders placed at once can reveal intent, invite front‑running, raise volatility, and cause losses. Execution algorithms therefore split parent orders into smaller lots to limit price distortion. In principle, using OBI inside such algorithms could improve execution, but prior evidence is scarce because isolating OBI's effect in real markets is nearly impossible amid many external factors. Multi‑agent simulation offers a way to study this. In an artificial market, individual actors are agents whose rules and interactions form the model. This study builds an execution algorithm that accounts for OBI, tests it across several market patterns in artificial markets, and analyzes mechanisms, comparing it with a conventional (OBI‑agnostic) algorithm. Results: (i) In stable markets, the OBI strategy's performance depends on the number of order slices; outcomes vary with how the parent order is partitioned. (ii) In markets with unstable prices, the OBI‑based algorithm outperforms the conventional approach. (iii) Under spoofing manipulation, the OBI strategy is not significantly worse than the conventional algorithm, indicating limited vulnerability to spoofing. Overall, OBI provides a useful signal for execution. Incorporating OBI can add value ‑ especially in volatile conditions ‑ while remaining reasonably robust to spoofing; in calm markets, benefits are sensitive to slicing design.

PaperID: 8571, https://arxiv.org/pdf/2509.16858.pdf

Authors: Soon Jynn Chu, Raju Gottumukkala, Alan Barhorst

Title: Towards an Adaptive Social Game-Playing Robot: An Offline Reinforcement Learning-Based Framework

Abstract:
HRI research increasingly demands robots that go beyond task execution to respond meaningfully to user emotions. This is especially needed when supporting students with learning difficulties in game‑based learning scenarios. Here, the objective of these robots is to train users with game‑playing skills, and this requires robots to get input about users' interests and engagement. In this paper, we present a system for an adaptive social game‑playing robot. However, creating such an agent through online RL requires extensive real‑world training data and potentially be uncomfortable for users. To address this, we investigate offline RL as a safe and efficient alternative. We introduce a system architecture that integrates multimodal emotion recognition and adaptive robotic responses. We also evaluate the performance of various offline RL algorithms using a dataset collected from a real‑world human‑robot game‑playing scenario. Our results indicate that BCQ and DDQN offer the greatest robustness to hyperparameter variations, whereas CQL is the most effective at mitigating overestimation bias. Through this research, we aim to inform the selection and design of reliable offline RL policies for real‑world social robotics. Ultimately, this work provides a foundational step toward creating socially intelligent agents that can learn complex and emotion‑adaptive behaviors entirely from offline datasets, ensuring both human comfort and practical scalability.

PaperID: 8572, https://arxiv.org/pdf/2509.16784.pdf

Authors: Mohammed Al Owayyed, Adarsh Denga, Willem-Paul Brinkman

Title: Controlled Yet Natural: A Hybrid BDI-LLM Conversational Agent for Child Helpline Training

Abstract:
Child helpline training often relies on human‑led roleplay, which is both time‑ and resource‑consuming. To address this, rule‑based interactive agent simulations have been proposed to provide a structured training experience for new counsellors. However, these agents might suffer from limited language understanding and response variety. To overcome these limitations, we present a hybrid interactive agent that integrates Large Language Models (LLMs) into a rule‑based Belief‑Desire‑Intention (BDI) framework, simulating more realistic virtual child chat conversations. This hybrid solution incorporates LLMs into three components: intent recognition, response generation, and a bypass mechanism. We evaluated the system through two studies: a script‑based assessment comparing LLM‑generated responses to human‑crafted responses, and a within‑subject experiment (N=37) comparing the LLM‑integrated agent with a rule‑based version. The first study provided evidence that the three LLM components were non‑inferior to human‑crafted responses. In the second study, we found credible support for two hypotheses: participants perceived the LLM‑integrated agent as more believable and reported more positive attitudes toward it than the rule‑based agent. Additionally, although weaker, there was some support for increased engagement (posterior probability = 0.845, 95% HDI [‑0.149, 0.465]). Our findings demonstrate the potential of integrating LLMs into rule‑based systems, offering a promising direction for more flexible but controlled training systems.

PaperID: 8573, https://arxiv.org/pdf/2509.16736.pdf

Authors: Minfeng Qi, Tianqing Zhu, Lefeng Zhang, Ningran Li, Wanlei Zhou

Title: Towards Transparent and Incentive-Compatible Collaboration in Decentralized LLM Multi-Agent Systems: A Blockchain-Driven Approach

Abstract:
Large Language Models (LLMs) have enabled the emergence of autonomous agents capable of complex reasoning, planning, and interaction. However, coordinating such agents at scale remains a fundamental challenge, particularly in decentralized environments where communication lacks transparency and agent behavior cannot be shaped through centralized incentives. We propose a blockchain‑based framework that enables transparent agent registration, verifiable task allocation, and dynamic reputation tracking through smart contracts. The core of our design lies in two mechanisms: a matching score‑based task allocation protocol that evaluates agents by reputation, capability match, and workload; and a behavior‑shaping incentive mechanism that adjusts agent behavior via feedback on performance and reward. Our implementation integrates GPT‑4 agents with Solidity contracts and demonstrates, through 50‑round simulations, strong task success rates, stable utility distribution, and emergent agent specialization. The results underscore the potential for trustworthy, incentive‑compatible multi‑agent coordination in open environments.

PaperID: 8574, https://arxiv.org/pdf/2509.16713.pdf

Authors: Tianyang Xu, Hongqiu Wu, Weiqi Wu, Hai Zhao

Title: OPEN-THEATRE: An Open-Source Toolkit for LLM-based Interactive Drama

Abstract:
LLM‑based Interactive Drama introduces a novel dialogue scenario in which the player immerses into a character and engages in a dramatic story by interacting with LLM agents. Despite the fact that this emerging area holds significant promise, it remains largely underexplored due to the lack of a well‑designed playground to develop a complete drama. This makes a significant barrier for researchers to replicate, extend, and study such systems. Hence, we present Open‑Theatre, the first open‑source toolkit for experiencing and customizing LLM‑based interactive drama. It refines prior work with an efficient multi‑agent architecture and a hierarchical retrieval‑based memory system, designed to enhance narrative coherence and realistic long‑term behavior in complex interactions. In addition, we provide a highly configurable pipeline, making it easy for researchers to develop and optimize new approaches.

PaperID: 8575, https://arxiv.org/pdf/2509.16709.pdf

Authors: Nicolò Botteghi, Matteo Tomasetto, Urban Fasel, Francesco Braghin, Andrea Manzoni

Title: HypeMARL: Multi-Agent Reinforcement Learning For High-Dimensional, Parametric, and Distributed Systems

Abstract:
Deep reinforcement learning has recently emerged as a promising feedback control strategy for complex dynamical systems governed by partial differential equations (PDEs). When dealing with distributed, high‑dimensional problems in state and control variables, multi‑agent reinforcement learning (MARL) has been proposed as a scalable approach for breaking the curse of dimensionality. In particular, through decentralized training and execution, multiple agents cooperate to steer the system towards a target configuration, relying solely on local state and reward information. However, the principle of locality may become a limiting factor whenever a collective, nonlocal behavior of the agents is crucial to maximize the reward function, as typically happens in PDE‑constrained optimal control problems. In this work, we propose HypeMARL: a decentralized MARL algorithm tailored to the control of high‑dimensional, parametric, and distributed systems. HypeMARL employs hypernetworks to effectively parametrize the agents' policies and value functions with respect to the system parameters and the agents' relative positions, encoded by sinusoidal positional encoding. Through the application on challenging control problems, such as density and flow control, we show that HypeMARL (i) can effectively control systems through a collective behavior of the agents, outperforming state‑of‑the‑art decentralized MARL, (ii) can efficiently deal with parametric dependencies, (iii) requires minimal hyperparameter tuning and (iv) can reduce the amount of expensive environment interactions by a factor of ~10 thanks to its model‑based extension, MB‑HypeMARL, which relies on computationally efficient deep learning‑based surrogate models approximating the dynamics locally, with minimal deterioration of the policy performance.

PaperID: 8576, https://arxiv.org/pdf/2509.16606.pdf

Authors: Wei Duan, Jie Lu, Junyu Xuan

Title: Bayesian Ego-graph Inference for Networked Multi-Agent Reinforcement Learning

Abstract:
In networked multi‑agent reinforcement learning (Networked‑MARL), decentralized agents must act under local observability and constrained communication over fixed physical graphs. Existing methods often assume static neighborhoods, limiting adaptability to dynamic or heterogeneous environments. While centralized frameworks can learn dynamic graphs, their reliance on global state access and centralized infrastructure is impractical in real‑world decentralized systems. We propose a stochastic graph‑based policy for Networked‑MARL, where each agent conditions its decision on a sampled subgraph over its local physical neighborhood. Building on this formulation, we introduce BayesG, a decentralized actor‑framework that learns sparse, context‑aware interaction structures via Bayesian variational inference. Each agent operates over an ego‑graph and samples a latent communication mask to guide message passing and policy computation. The variational distribution is trained end‑to‑end alongside the policy using an evidence lower bound (ELBO) objective, enabling agents to jointly learn both interaction topology and decision‑making strategies. BayesG outperforms strong MARL baselines on large‑scale traffic control tasks with up to 167 agents, demonstrating superior scalability, efficiency, and performance.

PaperID: 8577, https://arxiv.org/pdf/2509.16554.pdf

Authors: Vahid Jebraeeli, Hamid Krim, Derya Cansever

Title: ViTCAE: ViT-based Class-conditioned Autoencoder

Abstract:
Vision Transformer (ViT) based autoencoders often underutilize the global Class token and employ static attention mechanisms, limiting both generative control and optimization efficiency. This paper introduces ViTCAE, a framework that addresses these issues by re‑purposing the Class token into a generative linchpin. In our architecture, the encoder maps the Class token to a global latent variable that dictates the prior distribution for local, patch‑level latent variables, establishing a robust dependency where global semantics directly inform the synthesis of local details. Drawing inspiration from opinion dynamics, we treat each attention head as a dynamical system of interacting tokens seeking consensus. This perspective motivates a convergence‑aware temperature scheduler that adaptively anneals each head's influence function based on its distributional stability. This process enables a principled head‑freezing mechanism, guided by theoretically‑grounded diagnostics like an attention evolution distance and a consensus/cluster functional. This technique prunes converged heads during training to significantly improve computational efficiency without sacrificing fidelity. By unifying a generative Class token with an adaptive attention mechanism rooted in multi‑agent consensus theory, ViTCAE offers a more efficient and controllable approach to transformer‑based generation.

PaperID: 8578, https://arxiv.org/pdf/2509.16494.pdf

Authors: Fengyuan Liu, Rui Zhao, Shuo Chen, Guohao Li, Philip Torr, Lei Han, Jindong Gu

Title: Can an Individual Manipulate the Collective Decisions of Multi-Agents?

Abstract:
Individual Large Language Models (LLMs) have demonstrated significant capabilities across various domains, such as healthcare and law. Recent studies also show that coordinated multi‑agent systems exhibit enhanced decision‑making and reasoning abilities through collaboration. However, due to the vulnerabilities of individual LLMs and the difficulty of accessing all agents in a multi‑agent system, a key question arises: If attackers only know one agent, could they still generate adversarial samples capable of misleading the collective decision? To explore this question, we formulate it as a game with incomplete information, where attackers know only one target agent and lack knowledge of the other agents in the system. With this formulation, we propose M‑Spoiler, a framework that simulates agent interactions within a multi‑agent system to generate adversarial samples. These samples are then used to manipulate the target agent in the target system, misleading the system's collaborative decision‑making process. More specifically, M‑Spoiler introduces a stubborn agent that actively aids in optimizing adversarial samples by simulating potential stubborn responses from agents in the target system. This enhances the effectiveness of the generated adversarial samples in misleading the system. Through extensive experiments across various tasks, our findings confirm the risks posed by the knowledge of an individual agent in multi‑agent systems and demonstrate the effectiveness of our framework. We also explore several defense mechanisms, showing that our proposed attack framework remains more potent than baselines, underscoring the need for further research into defensive strategies.

PaperID: 8579, https://arxiv.org/pdf/2509.16482.pdf

Authors: Pranav Tiwari, Soumyodipta Nath

Title: Robot Conga: A Leader-Follower Walking Approach to Sequential Path Following in Multi-Agent Systems

Abstract:
Coordinated path following in multi‑agent systems is a key challenge in robotics, with applications in automated logistics, surveillance, and collaborative exploration. Traditional formation control techniques often rely on time‑parameterized trajectories and path integrals, which can result in synchronization issues and rigid behavior. In this work, we address the problem of sequential path following, where agents maintain fixed spatial separation along a common trajectory, guided by a leader under centralized control. We introduce Robot Conga, a leader‑follower control strategy that updates each agent's desired state based on the leader's spatial displacement rather than time, assuming access to a global position reference, an assumption valid in indoor environments equipped with motion capture, vision‑based tracking, or UWB localization systems. The algorithm was validated in simulation using both TurtleBot3 and quadruped (Laikago) robots. Results demonstrate accurate trajectory tracking, stable inter‑agent spacing, and fast convergence, with all agents aligning within 250 time steps (approx. 0.25 seconds) in the quadruped case, and almost instantaneously in the TurtleBot3 implementation.

PaperID: 8580, https://arxiv.org/pdf/2509.16456.pdf

Authors: Jiahao Yu, Zelei Cheng, Xian Wu, Xinyu Xing

Title: GPO: Learning from Critical Steps to Improve LLM Reasoning

Abstract:
Large language models (LLMs) are increasingly used in various domains, showing impressive potential on different tasks. Recently, reasoning LLMs have been proposed to improve the reasoning or thinking capabilities of LLMs to solve complex problems. Despite the promising results of reasoning LLMs, enhancing the multi‑step reasoning capabilities of LLMs still remains a significant challenge. While existing optimization methods have advanced the LLM reasoning capabilities, they often treat reasoning trajectories as a whole, without considering the underlying critical steps within the trajectory. In this paper, we introduce Guided Pivotal Optimization (GPO), a novel fine‑tuning strategy that dives into the reasoning process to enable more effective improvements. GPO first identifies the `critical step' within a reasoning trajectory ‑ a point that the model must carefully proceed to succeed at the problem. We locate the critical step by estimating the advantage function. GPO then resets the policy to the critical step, samples the new rollout and prioritizes the learning process on those rollouts. This focus allows the model to learn more effectively from pivotal moments within the reasoning process to improve the reasoning performance. We demonstrate that GPO is a general strategy that can be integrated with various optimization methods to improve reasoning performance. Besides theoretical analysis, our experiments across challenging reasoning benchmarks show that GPO can consistently and significantly enhance the performance of existing optimization methods, showcasing its effectiveness and generalizability in improving LLM reasoning by concentrating on pivotal moments within the generation process.

PaperID: 8581, https://arxiv.org/pdf/2509.16454.pdf

Authors: Devin Lange, Shanghua Gao, Pengwei Sui, Austen Money, Priya Misner, Marinka Zitnik, Nils Gehlenborg

Title: A Generative AI System for Biomedical Data Discovery with Grammar-Based Visualizations

Abstract:
We explore the potential for combining generative AI with grammar‑based visualizations for biomedical data discovery. In our prototype, we use a multi‑agent system to generate visualization specifications and apply filters. These visualizations are linked together, resulting in an interactive dashboard that is progressively constructed. Our system leverages the strengths of natural language while maintaining the utility of traditional user interfaces. Furthermore, we utilize generated interactive widgets enabling user adjustment. Finally, we demonstrate the potential utility of this system for biomedical data discovery with a case study.

PaperID: 8582, https://arxiv.org/pdf/2509.16394.pdf

Authors: Deuksin Kwon, Kaleen Shrestha, Bin Han, Elena Hayoung Lee, Gale Lucas

Title: Evaluating Behavioral Alignment in Conflict Dialogue: A Multi-Dimensional Comparison of LLM Agents and Humans

Abstract:
Large Language Models (LLMs) are increasingly deployed in socially complex, interaction‑driven tasks, yet their ability to mirror human behavior in emotionally and strategically complex contexts remains underexplored. This study assesses the behavioral alignment of personality‑prompted LLMs in adversarial dispute resolution by simulating multi‑turn conflict dialogues that incorporate negotiation. Each LLM is guided by a matched Five‑Factor personality profile to control for individual variation and enhance realism. We evaluate alignment across three dimensions: linguistic style, emotional expression (e.g., anger dynamics), and strategic behavior. GPT‑4.1 achieves the closest alignment with humans in linguistic style and emotional dynamics, while Claude‑3.7‑Sonnet best reflects strategic behavior. Nonetheless, substantial alignment gaps persist. Our findings establish a benchmark for alignment between LLMs and humans in socially complex interactions, underscoring both the promise and the limitations of personality conditioning in dialogue modeling.

PaperID: 8583, https://arxiv.org/pdf/2509.16325.pdf

Authors: Andrew Zhu, Chris Callison-Burch

Title: Overhearing LLM Agents: A Survey, Taxonomy, and Roadmap

Abstract:
Imagine AI assistants that enhance conversations without interrupting them: quietly providing relevant information during a medical consultation, seamlessly preparing materials as teachers discuss lesson plans, or unobtrusively scheduling meetings as colleagues debate calendars. While modern conversational LLM agents directly assist human users with tasks through a chat interface, we study this alternative paradigm for interacting with LLM agents, which we call "overhearing agents." Rather than demanding the user's attention, overhearing agents continuously monitor ambient activity and intervene only when they can provide contextual assistance. In this paper, we present the first analysis of overhearing LLM agents as a distinct paradigm in human‑AI interaction and establish a taxonomy of overhearing agent interactions and tasks grounded in a survey of works on prior LLM‑powered agents and exploratory HCI studies. Based on this taxonomy, we create a list of best practices for researchers and developers building overhearing agent systems. Finally, we outline the remaining research gaps and reveal opportunities for future research in the overhearing paradigm.

PaperID: 8584, https://arxiv.org/pdf/2509.16296.pdf

Authors: Jun He, Andrew L. Liu, Yihsu Chen

Title: Learning in Stackelberg Markov Games

Abstract:
Designing socially optimal policies in multi‑agent environments is a fundamental challenge in both economics and artificial intelligence. This paper studies a general framework for learning Stackelberg equilibria in dynamic and uncertain environments, where a single leader interacts with a population of adaptive followers. Motivated by pressing real‑world challenges such as equitable electricity tariff design for consumers with distributed energy resources (such as rooftop solar and energy storage), we formalize a class of Stackelberg Markov games and establish the existence and uniqueness of stationary Stackelberg equilibria under mild continuity and monotonicity conditions. We then extend the framework to incorporate a continuum of agents via mean‑field approximation, yielding a tractable Stackelberg‑Mean Field Equilibrium (S‑MFE) formulation. To address the computational intractability of exact best‑response dynamics, we introduce a softmax‑based approximation and rigorously bound its error relative to the true Stackelberg equilibrium. Our approach enables scalable and stable learning through policy iteration without requiring full knowledge of follower objectives. We validate the framework on an energy market simulation, where a public utility or a state utility commission sets time‑varying rates for a heterogeneous population of prosumers. Our results demonstrate that learned policies can simultaneously achieve economic efficiency, equity across income groups, and stability in energy systems. This work demonstrates how game‑theoretic learning frameworks can support data‑driven policy design in large‑scale strategic environments, with applications to real‑world systems like energy markets.

PaperID: 8585, https://arxiv.org/pdf/2509.16275.pdf

Authors: Jugal Gajjar, Kamalasankari Subramaniakuppusamy, Relsy Puthal, Kaustik Ranaware

Title: SecureFixAgent: A Hybrid LLM Agent for Automated Python Static Vulnerability Repair

Abstract:
Modern software development pipelines face growing challenges in securing large codebases with extensive dependencies. Static analysis tools like Bandit are effective at vulnerability detection but suffer from high false positives and lack repair capabilities. Large Language Models (LLMs), in contrast, can suggest fixes but often hallucinate changes and lack self‑validation. We present SecureFixAgent, a hybrid repair framework integrating Bandit with lightweight local LLMs (<8B parameters) in an iterative detect‑repair‑validate loop. To improve precision, we apply parameter‑efficient LoRA‑based fine‑tuning on a diverse, curated dataset spanning multiple Python project domains, mitigating dataset bias and reducing unnecessary edits. SecureFixAgent uses Bandit for detection, the LLM for candidate fixes with explanations, and Bandit re‑validation for verification, all executed locally to preserve privacy and reduce cloud reliance. Experiments show SecureFixAgent reduces false positives by 10.8% over static analysis, improves fix accuracy by 13.51%, and lowers false positives by 5.46% compared to pre‑trained LLMs, typically converging within three iterations. Beyond metrics, developer studies rate explanation quality 4.5/5, highlighting its value for human trust and adoption. By combining verifiable security improvements with transparent rationale in a resource‑efficient local framework, SecureFixAgent advances trustworthy, automated vulnerability remediation for modern pipelines.

PaperID: 8586, https://arxiv.org/pdf/2509.16267.pdf

Authors: Victor V. Puche, Kashish Verma, Matteo Fumagalli

Title: Underground Multi-robot Systems at Work: a revolution in mining

Abstract:
The growing global demand for critical raw materials (CRMs) has highlighted the need to access difficult and hazardous environments such as abandoned underground mines. These sites pose significant challenges for conventional machinery and human operators due to confined spaces, structural instability, and lack of infrastructure. To address this, we propose a modular multi‑robot system designed for autonomous operation in such environments, enabling sequential mineral extraction tasks. Unlike existing work that focuses primarily on mapping and inspection through global behavior or central control, our approach incorporates physical interaction capabilities using specialized robots coordinated through local high‑level behavior control. Our proposed system utilizes Hierarchical Finite State Machine (HFSM) behaviors to structure complex task execution across heterogeneous robotic platforms. Each robot has its own HFSM behavior to perform sequential autonomy while maintaining overall system coordination, achieved by triggering behavior execution through inter‑robot communication. This architecture effectively integrates software and hardware components to support collaborative, task‑driven multi‑robot operation in confined underground environments.

PaperID: 8587, https://arxiv.org/pdf/2509.16212.pdf

Authors: Ahmad Maroof Karimi, Woong Shin, Jesse Hines, Tirthankar Ghosal, Naw Safrin Sattar, Feiyi Wang

Title: EPIC: Generative AI Platform for Accelerating HPC Operational Data Analytics

Abstract:
We present EPIC, an AI‑driven platform designed to augment operational data analytics. EPIC employs a hierarchical multi‑agent architecture where a top‑level large language model provides query processing, reasoning and synthesis capabilities. These capabilities orchestrate three specialized low‑level agents for information retrieval, descriptive analytics, and predictive analytics. This architecture enables EPIC to perform HPC operational analytics on multi‑modal data, including text, images, and tabular formats, dynamically and iteratively. EPIC addresses the limitations of existing HPC operational analytics approaches, which rely on static methods that struggle to adapt to evolving analytics tasks and stakeholder demands. Through extensive evaluations on the Frontier HPC system, we demonstrate that EPIC effectively handles complex queries. Using descriptive analytics as a use case, fine‑tuned smaller models outperform large state‑of‑the‑art foundation models, achieving up to 26% higher accuracy. Additionally, we achieved 19x savings in LLM operational costs compared to proprietary solutions by employing a hybrid approach that combines large foundational models with fine‑tuned local open‑weight models.

PaperID: 8588, https://arxiv.org/pdf/2509.16187.pdf

Authors: Ali Reza Ibrahimzada, Brandon Paulsen, Reyhaneh Jabbarvand, Joey Dodds, Daniel Kroening

Title: MatchFixAgent: Language-Agnostic Autonomous Repository-Level Code Translation Validation and Repair

Abstract:
Code translation transforms source code from one programming language (PL) to another. Validating the functional equivalence of translation and repairing, if necessary, are critical steps in code translation. Existing automated validation and repair approaches struggle to generalize to many PLs due to high engineering overhead, and they rely on existing and often inadequate test suites, which results in false claims of equivalence and ineffective translation repair. We develop MatchFixAgent, a large language model (LLM)‑based, PL‑agnostic framework for equivalence validation and repair of translations. MatchFixAgent features a multi‑agent architecture that divides equivalence validation into several sub‑tasks to ensure thorough and consistent semantic analysis of the translation. Then it feeds this analysis to test agent to write and execute tests. Upon observing a test failure, the repair agent attempts to fix the translation bug. The final (in)equivalence decision is made by the verdict agent, considering semantic analyses and test execution results. We compare MatchFixAgent's validation and repair results with four repository‑level code translation techniques. We use 2,219 translation pairs from their artifacts, which cover 6 PL pairs, and are collected from 24 GitHub projects totaling over 900K lines of code. Our results demonstrate that MatchFixAgent produces (in)equivalence verdicts for 99.2% of translation pairs, with the same equivalence validation result as prior work on 72.8% of them. When MatchFixAgent's result disagrees with prior work, we find that 60.7% of the time MatchFixAgent's result is actually correct. In addition, we show that MatchFixAgent can repair 50.6% of inequivalent translation, compared to prior work's 18.5%. This demonstrates that MatchFixAgent is far more adaptable to many PL pairs than prior work, while producing highly accurate validation results.

PaperID: 8589, https://arxiv.org/pdf/2509.16151.pdf

Authors: Isaiah J. King, Benjamin Bowman, H. Howie Huang

Title: Automated Cyber Defense with Generalizable Graph-based Reinforcement Learning Agents

Abstract:
Deep reinforcement learning (RL) is emerging as a viable strategy for automated cyber defense (ACD). The traditional RL approach represents networks as a list of computers in various states of safety or threat. Unfortunately, these models are forced to overfit to specific network topologies, rendering them ineffective when faced with even small environmental perturbations. In this work, we frame ACD as a two‑player context‑based partially observable Markov decision problem with observations represented as attributed graphs. This approach allows our agents to reason through the lens of relational inductive bias. Agents learn how to reason about hosts interacting with other system entities in a more general manner, and their actions are understood as edits to the graph representing the environment. By introducing this bias, we will show that our agents can better reason about the states of networks and zero‑shot adapt to new ones. We show that this approach outperforms the state‑of‑the‑art by a wide margin, and makes our agents capable of defending never‑before‑seen networks against a wide range of adversaries in a variety of complex, and multi‑agent environments.

PaperID: 8590, https://arxiv.org/pdf/2509.16095.pdf

Authors: Yi Xu, Yun Fu

Title: AdaSports-Traj: Role- and Domain-Aware Adaptation for Multi-Agent Trajectory Modeling in Sports

Abstract:
Trajectory prediction in multi‑agent sports scenarios is inherently challenging due to the structural heterogeneity across agent roles (e.g., players vs. ball) and dynamic distribution gaps across different sports domains. Existing unified frameworks often fail to capture these structured distributional shifts, resulting in suboptimal generalization across roles and domains. We propose AdaSports‑Traj, an adaptive trajectory modeling framework that explicitly addresses both intra‑domain and inter‑domain distribution discrepancies in sports. At its core, AdaSports‑Traj incorporates a Role‑ and Domain‑Aware Adapter to conditionally adjust latent representations based on agent identity and domain context. Additionally, we introduce a Hierarchical Contrastive Learning objective, which separately supervises role‑sensitive and domain‑aware representations to encourage disentangled latent structures without introducing optimization conflict. Experiments on three diverse sports datasets, Basketball‑U, Football‑U, and Soccer‑U, demonstrate the effectiveness of our adaptive design, achieving strong performance in both unified and cross‑domain trajectory prediction settings.

PaperID: 8591, https://arxiv.org/pdf/2509.16011.pdf

Authors: Xiwei Liu, Yulong Li, Yichen Li, Xinlin Zhuang, Haolin Yang, Huifa Li, Imran Razzak

Title: Towards Robust Visual Continual Learning with Multi-Prototype Supervision

Abstract:
Language‑guided supervision, which utilizes a frozen semantic target from a Pretrained Language Model (PLM), has emerged as a promising paradigm for visual Continual Learning (CL). However, relying on a single target introduces two critical limitations: 1) semantic ambiguity, where a polysemous category name results in conflicting visual representations, and 2) intra‑class visual diversity, where a single prototype fails to capture the rich variety of visual appearances within a class. To this end, we propose MuproCL, a novel framework that replaces the single target with multiple, context‑aware prototypes. Specifically, we employ a lightweight LLM agent to perform category disambiguation and visual‑modal expansion to generate a robust set of semantic prototypes. A LogSumExp aggregation mechanism allows the vision model to adaptively align with the most relevant prototype for a given image. Extensive experiments across various CL baselines demonstrate that MuproCL consistently enhances performance and robustness, establishing a more effective path for language‑guided continual learning.

PaperID: 8592, https://arxiv.org/pdf/2509.15830.pdf

Authors: Chuhao Qin, Arun Narayanan, Evangelos Pournaras

Title: Coordinated Multi-Drone Last-mile Delivery: Learning Strategies for Energy-aware and Timely Operations

Abstract:
Drones have recently emerged as a faster, safer, and cost‑efficient way for last‑mile deliveries of parcels, particularly for urgent medical deliveries highlighted during the pandemic. This paper addresses a new challenge of multi‑parcel delivery with a swarm of energy‑aware drones, accounting for time‑sensitive customer requirements. Each drone plans an optimal multi‑parcel route within its battery‑restricted flight range to minimize delivery delays and reduce energy consumption. The problem is tackled by decomposing it into three sub‑problems: (1) optimizing depot locations and service areas using K‑means clustering; (2) determining the optimal flight range for drones through reinforcement learning; and (3) planning and selecting multi‑parcel delivery routes via a new optimized plan selection approach. To integrate these solutions and enhance long‑term efficiency, we propose a novel algorithm leveraging actor‑critic‑based multi‑agent deep reinforcement learning. Extensive experimentation using realistic delivery datasets demonstrate an exceptional performance of the proposed algorithm. We provide new insights into economic efficiency (minimize energy consumption), rapid operations (reduce delivery delays and overall execution time), and strategic guidance on depot deployment for practical logistics applications.

PaperID: 8593, https://arxiv.org/pdf/2509.15799.pdf

Authors: Max Studt, Georg Schildbach

Title: Hierarchical Reinforcement Learning with Low-Level MPC for Multi-Agent Control

Abstract:
Achieving safe and coordinated behavior in dynamic, constraint‑rich environments remains a major challenge for learning‑based control. Pure end‑to‑end learning often suffers from poor sample efficiency and limited reliability, while model‑based methods depend on predefined references and struggle to generalize. We propose a hierarchical framework that combines tactical decision‑making via reinforcement learning (RL) with low‑level execution through Model Predictive Control (MPC). For the case of multi‑agent systems this means that high‑level policies select abstract targets from structured regions of interest (ROIs), while MPC ensures dynamically feasible and safe motion. Tested on a predator‑prey benchmark, our approach outperforms end‑to‑end and shielding‑based RL baselines in terms of reward, safety, and consistency, underscoring the benefits of combining structured learning with model‑based control.

PaperID: 8594, https://arxiv.org/pdf/2509.15786.pdf

Authors: Nan Li, Bo Kang, Tijl De Bie

Title: Building Data-Driven Occupation Taxonomies: A Bottom-Up Multi-Stage Approach via Semantic Clustering and Multi-Agent Collaboration

Abstract:
Creating robust occupation taxonomies, vital for applications ranging from job recommendation to labor market intelligence, is challenging. Manual curation is slow, while existing automated methods are either not adaptive to dynamic regional markets (top‑down) or struggle to build coherent hierarchies from noisy data (bottom‑up). We introduce CLIMB (CLusterIng‑based Multi‑agent taxonomy Builder), a framework that fully automates the creation of high‑quality, data‑driven taxonomies from raw job postings. CLIMB uses global semantic clustering to distill core occupations, then employs a reflection‑based multi‑agent system to iteratively build a coherent hierarchy. On three diverse, real‑world datasets, we show that CLIMB produces taxonomies that are more coherent and scalable than existing methods and successfully capture unique regional characteristics. We release our code and datasets at https://anonymous.4open.science/r/CLIMB.

PaperID: 8595, https://arxiv.org/pdf/2509.15737.pdf

Authors: Heye Huang, Yibin Yang, Wang Chen, Tiantian Chen, Xiaopeng Li, Sikai Chen

Title: SMART: Scalable Multi-Agent Reasoning and Trajectory Planning in Dense Environments

Abstract:
Multi‑vehicle trajectory planning is a non‑convex problem that becomes increasingly difficult in dense environments due to the rapid growth of collision constraints. Efficient exploration of feasible behaviors and resolution of tight interactions are essential for real‑time, large‑scale coordination. This paper introduces SMART, Scalable Multi‑Agent Reasoning and Trajectory Planning, a hierarchical framework that combines priority‑based search with distributed optimization to achieve efficient and feasible multi‑vehicle planning. The upper layer explores diverse interaction modes using reinforcement learning‑based priority estimation and large‑step hybrid A search, while the lower layer refines solutions via parallelizable convex optimization. By partitioning space among neighboring vehicles and constructing robust feasible corridors, the method decouples the joint non‑convex problem into convex subproblems solved efficiently in parallel. This design alleviates the step‑size trade‑off while ensuring kinematic feasibility and collision avoidance. Experiments show that SMART consistently outperforms baselines. On 50 m x 50 m maps, it sustains over 90% success within 1 s up to 25 vehicles, while baselines often drop below 50%. On 100 m x 100 m maps, SMART achieves above 95% success up to 50 vehicles and remains feasible up to 90 vehicles, with runtimes more than an order of magnitude faster than optimization‑only approaches. Built on vehicle‑to‑everything communication, SMART incorporates vehicle‑infrastructure cooperation through roadside sensing and agent coordination, improving scalability and safety. Real‑world experiments further validate this design, achieving planning times as low as 0.014 s while preserving cooperative behaviors.

PaperID: 8596, https://arxiv.org/pdf/2509.15568.pdf

Authors: Junlong Jia, Xing Wu, Chaochen Gao, Ziyang Chen, Zijia Lin, Zhongzhi Li, Weinong Wang, Haotian Xu, Donghui Jin, Debing Zhang, Binghui Guo

Title: LiteLong: Resource-Efficient Long-Context Data Synthesis for LLMs

Abstract:
High‑quality long‑context data is essential for training large language models (LLMs) capable of processing extensive documents, yet existing synthesis approaches using relevance‑based aggregation face challenges of computational efficiency. We present LiteLong, a resource‑efficient method for synthesizing long‑context data through structured topic organization and multi‑agent debate. Our approach leverages the BISAC book classification system to provide a comprehensive hierarchical topic organization, and then employs a debate mechanism with multiple LLMs to generate diverse, high‑quality topics within this structure. For each topic, we use lightweight BM25 retrieval to obtain relevant documents and concatenate them into 128K‑token training samples. Experiments on HELMET and Ruler benchmarks demonstrate that LiteLong achieves competitive long‑context performance and can seamlessly integrate with other long‑dependency enhancement methods. LiteLong makes high‑quality long‑context data synthesis more accessible by reducing both computational and data engineering costs, facilitating further research in long‑context language training.

PaperID: 8597, https://arxiv.org/pdf/2509.15519.pdf

Authors: Chao Li, Bingkun Bao, Yang Gao

Title: Fully Decentralized Cooperative Multi-Agent Reinforcement Learning is A Context Modeling Problem

Abstract:
This paper studies fully decentralized cooperative multi‑agent reinforcement learning, where each agent solely observes the states, its local actions, and the shared rewards. The inability to access other agents' actions often leads to non‑stationarity during value function updates and relative overgeneralization during value function estimation, hindering effective cooperative policy learning. However, existing works fail to address both issues simultaneously, due to their inability to model the joint policy of other agents in a fully decentralized setting. To overcome this limitation, we propose a novel method named Dynamics‑Aware Context (DAC), which formalizes the task, as locally perceived by each agent, as an Contextual Markov Decision Process, and further addresses both non‑stationarity and relative overgeneralization through dynamics‑aware context modeling. Specifically, DAC attributes the non‑stationary local task dynamics of each agent to switches between unobserved contexts, each corresponding to a distinct joint policy. Then, DAC models the step‑wise dynamics distribution using latent variables and refers to them as contexts. For each agent, DAC introduces a context‑based value function to address the non‑stationarity issue during value function update. For value function estimation, an optimistic marginal value is derived to promote the selection of cooperative actions, thereby addressing the relative overgeneralization issue. Experimentally, we evaluate DAC on various cooperative tasks (including matrix game, predator and prey, and SMAC), and its superior performance against multiple baselines validates its effectiveness.

PaperID: 8598, https://arxiv.org/pdf/2509.15518.pdf

Authors: Siyang Wu, Zhewei Sun

Title: How do Language Models Generate Slang: A Systematic Comparison between Human and Machine-Generated Slang Usages

Abstract:
Slang is a commonly used type of informal language that poses a daunting challenge to NLP systems. Recent advances in large language models (LLMs), however, have made the problem more approachable. While LLM agents are becoming more widely applied to intermediary tasks such as slang detection and slang interpretation, their generalizability and reliability are heavily dependent on whether these models have captured structural knowledge about slang that align well with human attested slang usages. To answer this question, we contribute a systematic comparison between human and machine‑generated slang usages. Our evaluative framework focuses on three core aspects: 1) Characteristics of the usages that reflect systematic biases in how machines perceive slang, 2) Creativity reflected by both lexical coinages and word reuses employed by the slang usages, and 3) Informativeness of the slang usages when used as gold‑standard examples for model distillation. By comparing human‑attested slang usages from the Online Slang Dictionary (OSD) and slang generated by GPT‑4o and Llama‑3, we find significant biases in how LLMs perceive slang. Our results suggest that while LLMs have captured significant knowledge about the creative aspects of slang, such knowledge does not align with humans sufficiently to enable LLMs for extrapolative tasks such as linguistic analyses.

PaperID: 8599, https://arxiv.org/pdf/2509.15513.pdf

Authors: Jungjin Lee, Jaeuk Shin, Gihwan Kim, Joonho Han, Insoon Yang

Title: KoopCast: Trajectory Forecasting via Koopman Operators

Abstract:
We present KoopCast, a lightweight yet efficient model for trajectory forecasting in general dynamic environments. Our approach leverages Koopman operator theory, which enables a linear representation of nonlinear dynamics by lifting trajectories into a higher‑dimensional space. The framework follows a two‑stage design: first, a probabilistic neural goal estimator predicts plausible long‑term targets, specifying where to go; second, a Koopman operator‑based refinement module incorporates intention and history into a nonlinear feature space, enabling linear prediction that dictates how to go. This dual structure not only ensures strong predictive accuracy but also inherits the favorable properties of linear operators while faithfully capturing nonlinear dynamics. As a result, our model offers three key advantages: (i) competitive accuracy, (ii) interpretability grounded in Koopman spectral theory, and (iii) low‑latency deployment. We validate these benefits on ETH/UCY, the Waymo Open Motion Dataset, and nuScenes, which feature rich multi‑agent interactions and map‑constrained nonlinear motion. Across benchmarks, KoopCast consistently delivers high predictive accuracy together with mode‑level interpretability and practical efficiency.

PaperID: 8600, https://arxiv.org/pdf/2509.15491.pdf

Authors: Reza Pirayeshshirazinezhad, Nima Fathi

Title: Explainable AI-Enhanced Supervisory Control for Robust Multi-Agent Robotic Systems

Abstract:
We present an explainable AI‑enhanced supervisory control framework for multi‑agent robotics that combines (i) a timed‑automata supervisor for safe, auditable mode switching, (ii) robust continuous control (Lyapunov‑based controller for large‑angle maneuver; sliding‑mode controller (SMC) with boundary layers for precision and disturbance rejection), and (iii) an explainable predictor that maps mission context to gains and expected performance (energy, error). Monte Carlo‑driven optimization provides the training data, enabling transparent real‑time trade‑offs. We validated the approach in two contrasting domains, spacecraft formation flying and autonomous underwater vehicles (AUVs). Despite different environments (gravity/actuator bias vs. hydrodynamic drag/currents), both share uncertain six degrees of freedom (6‑DOF) rigid‑body dynamics, relative motion, and tight tracking needs, making them representative of general robotic systems. In the space mission, the supervisory logic selects parameters that meet mission criteria. In AUV leader‑follower tests, the same SMC structure maintains a fixed offset under stochastic currents with bounded steady error. In spacecraft validation, the SMC controller achieved submillimeter alignment with 21.7% lower tracking error and 81.4% lower energy consumption compared to Proportional‑Derivative PD controller baselines. At the same time, in AUV tests, SMC maintained bounded errors under stochastic currents. These results highlight both the portability and the interpretability of the approach for safety‑critical, resource‑constrained multi‑agent robotics.

PaperID: 8601, https://arxiv.org/pdf/2509.15392.pdf

Authors: Sihan Zeng, Benjamin Patrick Evans, Sujay Bhatt, Leo Ardon, Sumitra Ganesh, Alec Koppel

Title: Learning in Stackelberg Mean Field Games: A Non-Asymptotic Analysis

Abstract:
We study policy optimization in Stackelberg mean field games (MFGs), a hierarchical framework for modeling the strategic interaction between a single leader and an infinitely large population of homogeneous followers. The objective can be formulated as a structured bi‑level optimization problem, in which the leader needs to learn a policy maximizing its reward, anticipating the response of the followers. Existing methods for solving these (and related) problems often rely on restrictive independence assumptions between the leader's and followers' objectives, use samples inefficiently due to nested‑loop algorithm structure, and lack finite‑time convergence guarantees. To address these limitations, we propose AC‑SMFG, a single‑loop actor‑critic algorithm that operates on continuously generated Markovian samples. The algorithm alternates between (semi‑)gradient updates for the leader, a representative follower, and the mean field, and is simple to implement in practice. We establish the finite‑time and finite‑sample convergence of the algorithm to a stationary point of the Stackelberg objective. To our knowledge, this is the first Stackelberg MFG algorithm with non‑asymptotic convergence guarantees. Our key assumption is a "gradient alignment" condition, which requires that the full policy gradient of the leader can be approximated by a partial component of it, relaxing the existing leader‑follower independence assumption. Simulation results in a range of well‑established economics environments demonstrate that AC‑SMFG outperforms existing multi‑agent and MFG learning baselines in policy quality and convergence speed.

PaperID: 8602, https://arxiv.org/pdf/2509.15381.pdf

Authors: Tiannan Zhang, Rishi Veerapaneni, Shao-Hung Chan, Jiaoyang Li, Maxim Likhachev

Title: Dynamic Agent Grouping ECBS: Scaling Windowed Multi-Agent Path Finding with Completeness Guarantees

Abstract:
Multi‑Agent Path Finding (MAPF) is the problem of finding a set of collision‑free paths for a team of agents. Although several MAPF methods which solve full‑horizon MAPF have completeness guarantees, very few MAPF methods that plan partial paths have completeness guarantees. Recent work introduced the Windowed Complete MAPF (WinC‑MAPF) framework, which shows how windowed optimal MAPF solvers (e.g., SS‑CBS) can use heuristic updates and disjoint agent groups to maintain completeness even when planning partial paths (Veerapaneni et al. 2024). A core limitation of WinC‑MAPF is that they required optimal MAPF solvers. Our main contribution is to extend WinC‑MAPF by showing how we can use a bounded suboptimal solver while maintaining completeness. In particular, we design Dynamic Agent Grouping ECBS (DAG‑ECBS) which dynamically creates and plans agent groups while maintaining that each agent group solution is bounded suboptimal. We prove how DAG‑ECBS can maintain completeness in the WinC‑MAPF framework. DAG‑ECBS shows improved scalability compared to SS‑CBS and can outperform windowed ECBS without completeness guarantees. More broadly, our work serves as a blueprint for designing more MAPF methods that can use the WinC‑MAPF framework.

PaperID: 8603, https://arxiv.org/pdf/2509.15366.pdf

Authors: Andrejs Sorstkins, Josh Bailey, Dr Alistair Baron

Title: Diagnostics of cognitive failures in multi-agent expert systems using dynamic evaluation protocols and subsequent mutation of the processing context

Abstract:
The rapid evolution of neural architectures ‑ from multilayer perceptrons to large‑scale Transformer‑based models ‑ has enabled language models (LLMs) to exhibit emergent agentic behaviours when equipped with memory, planning, and external tool use. However, their inherent stochasticity and multi‑step decision processes render classical evaluation methods inadequate for diagnosing agentic performance. This work introduces a diagnostic framework for expert systems that not only evaluates but also facilitates the transfer of expert behaviour into LLM‑powered agents. The framework integrates (i) curated golden datasets of expert annotations, (ii) silver datasets generated through controlled behavioural mutation, and (iii) an LLM‑based Agent Judge that scores and prescribes targeted improvements. These prescriptions are embedded into a vectorized recommendation map, allowing expert interventions to propagate as reusable improvement trajectories across multiple system instances. We demonstrate the framework on a multi‑agent recruiter‑assistant system, showing that it uncovers latent cognitive failures ‑ such as biased phrasing, extraction drift, and tool misrouting ‑ while simultaneously steering agents toward expert‑level reasoning and style. The results establish a foundation for standardized, reproducible expert behaviour transfer in stochastic, tool‑augmented LLM agents, moving beyond static evaluation to active expert system refinement.

PaperID: 8604, https://arxiv.org/pdf/2509.15273.pdf

Authors: Fei Ni, Min Zhang, Pengyi Li, Yifu Yuan, Lingfeng Zhang, Yuecheng Liu, Peilong Han, Longxin Kou, Shaojin Ma, Jinbin Qiao, David Gamaliel Arcos Bravo, Yuening Wang, Xiao Hu, Zhanguang Zhang, Xianze Yao, Yutong Li, Zhao Zhang, Ying Wen, Ying-Cong Chen, Xiaodan Liang, Liang Lin, Bin He, Haitham Bou-Ammar, He Wang, Huazhe Xu, Jiankang Deng, Shan Luo, Shuqiang Jiang, Wei Pan, Yang Gao, Stefanos Zafeiriou, Jan Peters, Yuzheng Zhuang, Yingxue Zhang, Yan Zheng, Hongyao Tang, Jianye Hao

Title: Embodied Arena: A Comprehensive, Unified, and Evolving Evaluation Platform for Embodied AI

Abstract:
Embodied AI development significantly lags behind large foundation models due to three critical challenges: (1) lack of systematic understanding of core capabilities needed for Embodied AI, making research lack clear objectives; (2) absence of unified and standardized evaluation systems, rendering cross‑benchmark evaluation infeasible; and (3) underdeveloped automated and scalable acquisition methods for embodied data, creating critical bottlenecks for model scaling. To address these obstacles, we present Embodied Arena, a comprehensive, unified, and evolving evaluation platform for Embodied AI. Our platform establishes a systematic embodied capability taxonomy spanning three levels (perception, reasoning, task execution), seven core capabilities, and 25 fine‑grained dimensions, enabling unified evaluation with systematic research objectives. We introduce a standardized evaluation system built upon unified infrastructure supporting flexible integration of 22 diverse benchmarks across three domains (2D/3D Embodied Q&A, Navigation, Task Planning) and 30+ advanced models from 20+ worldwide institutes. Additionally, we develop a novel LLM‑driven automated generation pipeline ensuring scalable embodied evaluation data with continuous evolution for diversity and comprehensiveness. Embodied Arena publishes three real‑time leaderboards (Embodied Q&A, Navigation, Task Planning) with dual perspectives (benchmark view and capability view), providing comprehensive overviews of advanced model capabilities. Especially, we present nine findings summarized from the evaluation results on the leaderboards of Embodied Arena. This helps to establish clear research veins and pinpoint critical research problems, thereby driving forward progress in the field of Embodied AI.

PaperID: 8605, https://arxiv.org/pdf/2509.15172.pdf

Authors: Ankur Samanta, Akshayaa Magesh, Runzhe Wu, Ayush Jain, Youliang Yu, Daniel Jiang, Boris Vidolov, Paul Sajda, Yonathan Efroni, Kaveh Hassani

Title: Self-Improvement of Language Models by Post-Training on Multi-Agent Debate

Abstract:
Self‑improvement, where models improve beyond their current performance without external supervision, remains a challenge. The core difficulty is sourcing a training signal stronger than what the model itself can currently produce. Majority voting has been shown to provide such a signal by aggregating over multiple samples, helping mitigate some of the inconsistencies in LM reasoning. In this work, we show that multi‑agent debate‑‑where models collaborate and exchange reasoning over multiple rounds‑‑provides an even richer signal than single‑round majority voting. We introduce Multi‑Agent Consensus Alignment (MACA), which uses reinforcement learning (RL) to post‑train models to effectively utilize multi‑agent debate. We find that preference learning over full reasoning traces, learning to differentiate between majority and minority reasoning, is more effective than binary consensus rewards or SFT‑based approaches for leveraging these debate signals. This produces three key improvements: models are (1) better at utilizing the multi‑agent debate setting (+26.87% on MATH), (2) individually more accurate (+21.51% on MathQA), and (3) more self‑consistent (+27.6% on GSM8K). We also see strong generalization to unseen benchmarks (+16.3% on GPQA, +11.6% on CommonsenseQA).

PaperID: 8606, https://arxiv.org/pdf/2509.15099.pdf

Authors: Taoyuan Yu, Kui Wang, Zongdian Li, Tao Yu, Kei Sakaguchi, Walid Saad

Title: Digital Twin-based Cooperative Autonomous Driving in Smart Intersections: A Multi-Agent Reinforcement Learning Approach

Abstract:
Unsignalized intersections pose safety and efficiency challenges due to complex traffic flows and blind spots. In this paper, a digital twin (DT)‑based cooperative driving system with roadside unit (RSU)‑centric architecture is proposed for enhancing safety and efficiency at unsignalized intersections. The system leverages comprehensive bird‑eye‑view (BEV) perception to eliminate blind spots and employs a hybrid reinforcement learning (RL) framework combining offline pre‑training with online fine‑tuning. Specifically, driving policies are initially trained using conservative Q‑learning (CQL) with behavior cloning (BC) on real datasets, then fine‑tuned using multi‑agent proximal policy optimization (MAPPO) with self‑attention mechanisms to handle dynamic multi‑agent coordination. The RSU implements real‑time commands via vehicle‑to‑infrastructure (V2I) communications. Experimental results show that the proposed method yields failure rates below 0.03% coordinating up to three connected autonomous vehicles (CAVs), significantly outperforming traditional methods. In addition, the system exhibits sub‑linear computational scaling with inference times under 40 ms. Furthermore, it demonstrates robust generalization across diverse unsignalized intersection scenarios, indicating its practicality and readiness for real‑world deployment.

PaperID: 8607, https://arxiv.org/pdf/2509.15042.pdf

Authors: Thomas Ackermann, Moritz Spang, Hamza A. A. Gardi

Title: Reinforcement Learning Agent for a 2D Shooter Game

Abstract:
Reinforcement learning agents in complex game environments often suffer from sparse rewards, training instability, and poor sample efficiency. This paper presents a hybrid training approach that combines offline imitation learning with online reinforcement learning for a 2D shooter game agent. We implement a multi‑head neural network with separate outputs for behavioral cloning and Q‑learning, unified by shared feature extraction layers with attention mechanisms. Initial experiments using pure deep Q‑Networks exhibited significant instability, with agents frequently reverting to poor policies despite occasional good performance. To address this, we developed a hybrid methodology that begins with behavioral cloning on demonstration data from rule‑based agents, then transitions to reinforcement learning. Our hybrid approach achieves consistently above 70% win rate against rule‑based opponents, substantially outperforming pure reinforcement learning methods which showed high variance and frequent performance degradation. The multi‑head architecture enables effective knowledge transfer between learning modes while maintaining training stability. Results demonstrate that combining demonstration‑based initialization with reinforcement learning optimization provides a robust solution for developing game AI agents in complex multi‑agent environments where pure exploration proves insufficient.

PaperID: 8608, https://arxiv.org/pdf/2509.14956.pdf

Authors: Diego Gosmar, Deborah A. Dahl

Title: Sentinel Agents for Secure and Trustworthy Agentic AI in Multi-Agent Systems

Abstract:
This paper proposes a novel architectural framework aimed at enhancing security and reliability in multi‑agent systems (MAS). A central component of this framework is a network of Sentinel Agents, functioning as a distributed security layer that integrates techniques such as semantic analysis via large language models (LLMs), behavioral analytics, retrieval‑augmented verification, and cross‑agent anomaly detection. Such agents can potentially oversee inter‑agent communications, identify potential threats, enforce privacy and access controls, and maintain comprehensive audit records. Complementary to the idea of Sentinel Agents is the use of a Coordinator Agent. The Coordinator Agent supervises policy implementation, and manages agent participation. In addition, the Coordinator also ingests alerts from Sentinel Agents. Based on these alerts, it can adapt policies, isolate or quarantine misbehaving agents, and contain threats to maintain the integrity of the MAS ecosystem. This dual‑layered security approach, combining the continuous monitoring of Sentinel Agents with the governance functions of Coordinator Agents, supports dynamic and adaptive defense mechanisms against a range of threats, including prompt injection, collusive agent behavior, hallucinations generated by LLMs, privacy breaches, and coordinated multi‑agent attacks. In addition to the architectural design, we present a simulation study where 162 synthetic attacks of different families (prompt injection, hallucination, and data exfiltration) were injected into a multi‑agent conversational environment. The Sentinel Agents successfully detected the attack attempts, confirming the practical feasibility of the proposed monitoring approach. The framework also offers enhanced system observability, supports regulatory compliance, and enables policy evolution over time.

PaperID: 8609, https://arxiv.org/pdf/2509.14877.pdf

Authors: Rohin Gillgallon, Giacomo Bergami, Reham Almutairi, Graham Morgan

Title: AI-Driven Multi-Agent Vehicular Planning for Battery Efficiency and QoS in 6G Smart Cities

Abstract:
While simulators exist for vehicular IoT nodes communicating with the Cloud through Edge nodes in a fully‑simulated osmotic architecture, they often lack support for dynamic agent planning and optimisation to minimise vehicular battery consumption while ensuring fair communication times. Addressing these challenges requires extending current simulator architectures with AI algorithms for both traffic prediction and dynamic agent planning. This paper presents an extension of SimulatorOrchestrator (SO) to meet these requirements. Preliminary results over a realistic urban dataset show that utilising vehicular planning algorithms can lead to improved battery and QoS performance compared with traditional shortest path algorithms. The additional inclusion of desirability areas enabled more ambulances to be routed to their target destinations while utilising less energy to do so, compared to traditional and weighted algorithms without desirability considerations.

PaperID: 8610, https://arxiv.org/pdf/2509.14860.pdf

Authors: Wonduk Seo, Minhyeong Yu, Hyunjin An, Seunghyun Lee

Title: MARIC: Multi-Agent Reasoning for Image Classification

Abstract:
Image classification has traditionally relied on parameter‑intensive model training, requiring large‑scale annotated datasets and extensive fine tuning to achieve competitive performance. While recent vision language models (VLMs) alleviate some of these constraints, they remain limited by their reliance on single pass representations, often failing to capture complementary aspects of visual content. In this paper, we introduce Multi Agent based Reasoning for Image Classification (MARIC), a multi agent framework that reformulates image classification as a collaborative reasoning process. MARIC first utilizes an Outliner Agent to analyze the global theme of the image and generate targeted prompts. Based on these prompts, three Aspect Agents extract fine grained descriptions along distinct visual dimensions. Finally, a Reasoning Agent synthesizes these complementary outputs through integrated reflection step, producing a unified representation for classification. By explicitly decomposing the task into multiple perspectives and encouraging reflective synthesis, MARIC mitigates the shortcomings of both parameter‑heavy training and monolithic VLM reasoning. Experiments on 4 diverse image classification benchmark datasets demonstrate that MARIC significantly outperforms baselines, highlighting the effectiveness of multi‑agent visual reasoning for robust and interpretable image classification.

PaperID: 8611, https://arxiv.org/pdf/2509.14834.pdf

Authors: Jinhee Jang, Ayoung Moon, Minkyoung Jung, YoungBin Kim, Seung Jin Lee

Title: LLM Agents at the Roundtable: A Multi-Perspective and Dialectical Reasoning Framework for Essay Scoring

Abstract:
The emergence of large language models (LLMs) has brought a new paradigm to automated essay scoring (AES), a long‑standing and practical application of natural language processing in education. However, achieving human‑level multi‑perspective understanding and judgment remains a challenge. In this work, we propose Roundtable Essay Scoring (RES), a multi‑agent evaluation framework designed to perform precise and human‑aligned scoring under a zero‑shot setting. RES constructs evaluator agents based on LLMs, each tailored to a specific prompt and topic context. Each agent independently generates a trait‑based rubric and conducts a multi‑perspective evaluation. Then, by simulating a roundtable‑style discussion, RES consolidates individual evaluations through a dialectical reasoning process to produce a final holistic score that more closely aligns with human evaluation. By enabling collaboration and consensus among agents with diverse evaluation perspectives, RES outperforms prior zero‑shot AES approaches. Experiments on the ASAP dataset using ChatGPT and Claude show that RES achieves up to a 34.86% improvement in average QWK over straightforward prompting (Vanilla) methods.

PaperID: 8612, https://arxiv.org/pdf/2509.14803.pdf

Authors: Xian Gao, Zongyun Zhang, Ting Liu, Yuzhuo Fu

Title: OnlineMate: An LLM-Based Multi-Agent Companion System for Cognitive Support in Online Learning

Abstract:
In online learning environments, students often lack personalized peer interactions, which are crucial for cognitive development and learning engagement. Although previous studies have employed large language models (LLMs) to simulate interactive learning environments, these interactions are limited to conversational exchanges, failing to adapt to learners' individualized cognitive and psychological states. As a result, students' engagement is low and they struggle to gain inspiration. To address this challenge, we propose OnlineMate, a multi‑agent learning companion system driven by LLMs integrated with Theory of Mind (ToM). OnlineMate simulates peer‑like roles, infers learners' psychological states such as misunderstandings and confusion during collaborative discussions, and dynamically adjusts interaction strategies to support higher‑order thinking. Comprehensive evaluations, including simulation‑based experiments, human assessments, and real classroom trials, demonstrate that OnlineMate significantly promotes deep learning and cognitive engagement by elevating students' average cognitive level while substantially improving emotional engagement scores.

PaperID: 8613, https://arxiv.org/pdf/2509.14801.pdf

Authors: Julian F. Schumann, Anna Mészáros, Jens Kober, Arkady Zgonnikov

Title: STEP: Structured Training and Evaluation Platform for benchmarking trajectory prediction models

Abstract:
While trajectory prediction plays a critical role in enabling safe and effective path‑planning in automated vehicles, standardized practices for evaluating such models remain underdeveloped. Recent efforts have aimed to unify dataset formats and model interfaces for easier comparisons, yet existing frameworks often fall short in supporting heterogeneous traffic scenarios, joint prediction models, or user documentation. In this work, we introduce STEP ‑‑ a new benchmarking framework that addresses these limitations by providing a unified interface for multiple datasets, enforcing consistent training and evaluation conditions, and supporting a wide range of prediction models. We demonstrate the capabilities of STEP in a number of experiments which reveal 1) the limitations of widely‑used testing procedures, 2) the importance of joint modeling of agents for better predictions of interactions, and 3) the vulnerability of current state‑of‑the‑art models against both distribution shifts and targeted attacks by adversarial agents. With STEP, we aim to shift the focus from the ``leaderboard'' approach to deeper insights about model behavior and generalization in complex multi‑agent settings.

PaperID: 8614, https://arxiv.org/pdf/2509.14680.pdf

Authors: Tianyang Duan, Zongyuan Zhang, Songxiao Guo, Dong Huang, Yuanye Zhao, Zheng Lin, Zihan Fang, Dianxin Luan, Heming Cui, Yong Cui

Title: LEED: A Highly Efficient and Scalable LLM-Empowered Expert Demonstrations Framework for Multi-Agent Reinforcement Learning

Abstract:
Multi‑agent reinforcement learning (MARL) holds substantial promise for intelligent decision‑making in complex environments. However, it suffers from a coordination and scalability bottleneck as the number of agents increases. To address these issues, we propose the LLM‑empowered expert demonstrations framework for multi‑agent reinforcement learning (LEED). LEED consists of two components: a demonstration generation (DG) module and a policy optimization (PO) module. Specifically, the DG module leverages large language models to generate instructions for interacting with the environment, thereby producing high‑quality demonstrations. The PO module adopts a decentralized training paradigm, where each agent utilizes the generated demonstrations to construct an expert policy loss, which is then integrated with its own policy loss. This enables each agent to effectively personalize and optimize its local policy based on both expert knowledge and individual experience. Experimental results show that LEED achieves superior sample efficiency, time efficiency, and robust scalability compared to state‑of‑the‑art baselines.

PaperID: 8615, https://arxiv.org/pdf/2509.14647.pdf

Authors: NVJK Kartik, Garvit Sapra, Rishav Hada, Nikhil Pareek

Title: AgentCompass: Towards Reliable Evaluation of Agentic Workflows in Production

Abstract:
With the growing adoption of Large Language Models (LLMs) in automating complex, multi‑agent workflows, organizations face mounting risks from errors, emergent behaviors, and systemic failures that current evaluation methods fail to capture. We present AgentCompass, the first evaluation framework designed specifically for post‑deployment monitoring and debugging of agentic workflows. AgentCompass models the reasoning process of expert debuggers through a structured, multi‑stage analytical pipeline: error identification and categorization, thematic clustering, quantitative scoring, and strategic summarization. The framework is further enhanced with a dual memory system‑episodic and semantic‑that enables continual learning across executions. Through collaborations with design partners, we demonstrate the framework's practical utility on real‑world deployments, before establishing its efficacy against the publicly available TRAIL benchmark. AgentCompass achieves state‑of‑the‑art results on key metrics, while uncovering critical issues missed in human annotations, underscoring its role as a robust, developer‑centric tool for reliable monitoring and improvement of agentic systems in production.

PaperID: 8616, https://arxiv.org/pdf/2509.14547.pdf

Authors: Yi Lin, Lujin Zhao, Yijie Shi

Title: (P)rior(D)yna(F)low: A Priori Dynamic Workflow Construction via Multi-Agent Collaboration

Abstract:
Recent studies have shown that carefully designed workflows coordinating large language models(LLMs) significantly enhance task‑solving capabilities compared to using a single model. While an increasing number of works focus on autonomous workflow construction, most existing approaches rely solely on historical experience, leading to limitations in efficiency and adaptability. We argue that while historical experience is valuable, workflow construction should also flexibly respond to the unique characteristics of each task. To this end, we propose an a priori dynamic framework for automated workflow construction. Our framework first leverages Q‑table learning to optimize the decision space, guiding agent decisions and enabling effective use of historical experience. At the same time, agents evaluate the current task progress and make a priori decisions regarding the next executing agent, allowing the system to proactively select the more suitable workflow structure for each given task. Additionally, we incorporate mechanisms such as cold‑start initialization, early stopping, and pruning to further improve system efficiency. Experimental evaluations on four benchmark datasets demonstrate the feasibility and effectiveness of our approach. Compared to state‑of‑the‑art baselines, our method achieves an average improvement of 4.05%, while reducing workflow construction and inference costs to only 30.68%‑48.31% of those required by existing methods.

PaperID: 8617, https://arxiv.org/pdf/2509.14542.pdf

Authors: Xinrui Wang, Chengbo Li, Boxuan Zhang, Jiahui Shi, Nian Ran, Linjing Li, Jianjun Liu, Dajun Zeng

Title: S1-MatAgent: A planner driven multi-agent system for material discovery

Abstract:
The discovery of high‑performance materials is crucial for technological advancement. Inverse design using multi‑agent systems (MAS) shows great potential for new material discovery. However, current MAS for materials research rely on predefined configurations and tools, limiting their adaptability and scalability. To address these limitations, we developed a planner driven multi‑agent system (S1‑MatAgent) which adopts a Planner‑Executor architecture. Planner automatically decomposes complex materials design tasks, dynamically configures various tools to generate dedicated Executor agents for each subtask, significantly reducing reliance on manual workflow construction and specialized configuration. Applied to high‑entropy alloy catalysts for hydrogen evolution reactions in alkaline conditions, S1‑MatAgent completed full‑cycle closed‑loop design from literature analysis and composition recommendation to performance optimization and experimental validation. To tackle the deviations between designed materials and target, as well as high experimental verification costs, S1‑MatAgent employs a novel composition optimization algorithm based on gradients of machine learning interatomic potential, achieving 27.7 % improvement in material performance. S1‑MatAgent designed 13 high‑performance catalysts from 20 million candidates, with Ni4Co4Cu1Mo3Ru4 exhibiting an overpotential of 18.6 mV at 10 mA cm‑2 and maintaining 97.5 % activity after 500 hours at 500 mA cm‑2. The universal MAS framework offers a universal and scalable solution for material discovery, significantly improving design efficiency and adaptability.

PaperID: 8618, https://arxiv.org/pdf/2509.14528.pdf

Authors: Pradyumna Shome, Sashreek Krishnan, Sauvik Das

Title: Why Johnny Can't Use Agents: Industry Aspirations vs. User Realities with AI Agents

Abstract:
There is growing imprecision about what "AI agents" are, what they can do, and how effectively they can be used by their intended users. We pose two key research questions: (i) How does the tech industry conceive and market "AI agents"? (ii) What challenges do end‑users face when attempting to use commercial AI agents for their advertised uses? We first performed a systematic review of marketed use cases for 102 commercial AI agents, finding that they fall into three umbrella categories: orchestration, creation, and insight. We then evaluated whether end‑users could realize these marketed capabilities in practice: we conducted a usability assessment where N = 31 participants attempted representative tasks for each of these categories on two popular commercial AI agent tools: Operator and Manus. We found that users were generally impressed with these agents but faced significant usability challenges ranging from agent capabilities that were misaligned with user mental models to agents lacking the meta‑cognitive abilities necessary for effective collaboration.

PaperID: 8619, https://arxiv.org/pdf/2509.14485.pdf

Authors: Marko Tesic, Yue Zhao, Joel Z. Leibo, Rakshit S. Trivedi, Jose Hernandez-Orallo

Title: Beyond the high score: Prosocial ability profiles of multi-agent populations

Abstract:
The development and evaluation of social capabilities in AI agents require complex environments where competitive and cooperative behaviours naturally emerge. While game‑theoretic properties can explain why certain teams or agent populations outperform others, more abstract behaviours, such as convention following, are harder to control in training and evaluation settings. The Melting Pot contest is a social AI evaluation suite designed to assess the cooperation capabilities of AI systems. In this paper, we apply a Bayesian approach known as Measurement Layouts to infer the capability profiles of multi‑agent systems in the Melting Pot contest. We show that these capability profiles not only predict future performance within the Melting Pot suite but also reveal the underlying prosocial abilities of agents. Our analysis indicates that while higher prosocial capabilities sometimes correlate with better performance, this is not a universal trend‑some lower‑scoring agents exhibit stronger cooperation abilities. Furthermore, we find that top‑performing contest submissions are more likely to achieve high scores in scenarios where prosocial capabilities are not required. These findings, together with reports that the contest winner used a hard‑coded solution tailored to specific environments, suggest that at least one top‑performing team may have optimised for conditions where cooperation was not necessary, potentially exploiting limitations in the evaluation framework. We provide recommendations for improving the annotation of cooperation demands and propose future research directions to account for biases introduced by different testing environments. Our results demonstrate that Measurement Layouts offer both strong predictive accuracy and actionable insights, contributing to a more transparent and generalisable approach to evaluating AI systems in complex social settings.

PaperID: 8620, https://arxiv.org/pdf/2509.14483.pdf

Authors: Thanh-Long Bui, Hoa Khanh Dam, Rashina Hoda

Title: An LLM-based multi-agent framework for agile effort estimation

Abstract:
Effort estimation is a crucial activity in agile software development, where teams collaboratively review, discuss, and estimate the effort required to complete user stories in a product backlog. Current practices in agile effort estimation heavily rely on subjective assessments, leading to inaccuracies and inconsistencies in the estimates. While recent machine learning‑based methods show promising accuracy, they cannot explain or justify their estimates and lack the capability to interact with human team members. Our paper fills this significant gap by leveraging the powerful capabilities of Large Language Models (LLMs). We propose a novel LLM‑based multi‑agent framework for agile estimation that not only can produce estimates, but also can coordinate, communicate and discuss with human developers and other agents to reach a consensus. Evaluation results on a real‑life dataset show that our approach outperforms state‑of‑the‑art techniques across all evaluation metrics in the majority of the cases. Our human study with software development practitioners also demonstrates an overwhelmingly positive experience in collaborating with our agents in agile effort estimation.

PaperID: 8621, https://arxiv.org/pdf/2509.14480.pdf

Authors: Weiting Tan, Xinghua Qu, Ming Tu, Meng Ge, Andy T. Liu, Philipp Koehn, Lu Lu

Title: Process-Supervised Reinforcement Learning for Interactive Multimodal Tool-Use Agents

Abstract:
Effective interactive tool use requires agents to master Tool Integrated Reasoning (TIR): a complex process involving multi‑turn planning and long‑context dialogue management. To train agents for this dynamic process, particularly in multi‑modal contexts, we introduce a sandbox environment for reinforcement learning (RL) that supports interleaved speech‑text rollouts. Our core strategy, Turn‑level Adjudicated Reinforcement Learning (TARL), addresses the challenge of credit assignment in long‑horizon tasks by employing a Large Language Model (LLM) as a judge to provide turn‑level evaluation. To enhance exploration, we integrate a mixed‑task training curriculum with mathematical reasoning problems. This unified approach boosts the task pass rate on the text‑based τ‑bench by over 6% compared to strong RL baselines. Crucially, we demonstrate our framework's suitability for fine‑tuning a multi‑modal foundation model for agentic tasks. By training a base multi‑modal LLM on interleaved speech‑text rollouts, we equip it with tool‑use abilities, paving the way for more natural, voice‑driven interactive agents.

PaperID: 8622, https://arxiv.org/pdf/2509.14477.pdf

Authors: Thales Sales Almeida, João Guilherme Alves Santos, Thiago Laitz, Giovana Kerche Bonás

Title: Ticket-Bench: A Kickoff for Multilingual and Regionalized Agent Evaluation

Abstract:
Large language models (LLMs) are increasingly deployed as task‑oriented agents, where success depends on their ability to generate accurate function calls under realistic, multilingual conditions. However, existing agent evaluations largely overlook cultural and linguistic diversity, often relying on monolingual or naively translated benchmarks. We introduce Ticket‑Bench, a benchmark for multilingual agent evaluation in task‑oriented scenarios. Ticket‑Bench simulates the domain of soccer ticket purchases across six major languages: Portuguese, English, Spanish, German, Italian, and French. Using localized teams, cities, and user profiles to provide a higher level of realism. We evaluate a wide range of commercial and open‑source LLMs, measuring function‑calling accuracy and consistency across languages. Results show that reasoning‑oriented models (e.g., GPT‑5, Qwen3‑235B) dominate performance but still exhibit notable cross‑lingual disparities. These findings underscore the need for culturally aware, multilingual benchmarks to guide the development of robust LLM agents.

PaperID: 8623, https://arxiv.org/pdf/2509.14431.pdf

Authors: Keqin Wang, Tao Zhong, David Chang, Christine Allen-Blanchette

Title: Local-Canonicalization Equivariant Graph Neural Networks for Sample-Efficient and Generalizable Swarm Robot Control

Abstract:
Multi‑agent reinforcement learning (MARL) has emerged as a powerful paradigm for coordinating swarms of agents in complex decision‑making, yet major challenges remain. In competitive settings such as pursuer‑evader tasks, simultaneous adaptation can destabilize training; non‑kinetic countermeasures often fail under adverse conditions; and policies trained in one configuration rarely generalize to environments with a different number of agents. To address these issues, we propose the Local‑Canonicalization Equivariant Graph Neural Networks (LEGO) framework, which integrates seamlessly with popular MARL algorithms such as MAPPO. LEGO employs graph neural networks to capture permutation equivariance and generalization to different agent numbers, canonicalization to enforce E(n)‑equivariance, and heterogeneous representations to encode role‑specific inductive biases. Experiments on cooperative and competitive swarm benchmarks show that LEGO outperforms strong baselines and improves generalization. In real‑world experiments, LEGO demonstrates robustness to varying team sizes and agent failure.

PaperID: 8624, https://arxiv.org/pdf/2509.14386.pdf

Authors: Arjun S. Nair, Kristina P. Sinaga

Title: Disproving the Feasibility of Learned Confidence Calibration Under Binary Supervision: An Information-Theoretic Impossibility

Abstract:
We prove a fundamental impossibility theorem: neural networks cannot simultaneously learn well‑calibrated confidence estimates with meaningful diversity when trained using binary correct/incorrect supervision. Through rigorous mathematical analysis and comprehensive empirical evaluation spanning negative reward training, symmetric loss functions, and post‑hoc calibration methods, we demonstrate this is an information‑theoretic constraint, not a methodological failure. Our experiments reveal universal failure patterns: negative rewards produce extreme underconfidence (ECE greater than 0.8) while destroying confidence diversity (std less than 0.05), symmetric losses fail to escape binary signal averaging, and post‑hoc methods achieve calibration (ECE less than 0.02) only by compressing the confidence distribution. We formalize this as an underspecified mapping problem where binary signals cannot distinguish between different confidence levels for correct predictions: a 60 percent confident correct answer receives identical supervision to a 90 percent confident one. Crucially, our real‑world validation shows 100 percent failure rate for all training methods across MNIST, Fashion‑MNIST, and CIFAR‑10, while post‑hoc calibration's 33 percent success rate paradoxically confirms our theorem by achieving calibration through transformation rather than learning. This impossibility directly explains neural network hallucinations and establishes why post‑hoc calibration is mathematically necessary, not merely convenient. We propose novel supervision paradigms using ensemble disagreement and adaptive multi‑agent learning that could overcome these fundamental limitations without requiring human confidence annotations.

PaperID: 8625, https://arxiv.org/pdf/2509.14380.pdf

Authors: Seoyeon Choi, Kanghyun Ryu, Jonghoon Ock, Negar Mehr

Title: CRAFT: Coaching Reinforcement Learning Autonomously using Foundation Models for Multi-Robot Coordination Tasks

Abstract:
Multi‑Agent Reinforcement Learning (MARL) provides a powerful framework for learning coordination in multi‑agent systems. However, applying MARL to robotics still remains challenging due to high‑dimensional continuous joint action spaces, complex reward design, and non‑stationary transitions inherent to decentralized settings. On the other hand, humans learn complex coordination through staged curricula, where long‑horizon behaviors are progressively built upon simpler skills. Motivated by this, we propose CRAFT: Coaching Reinforcement learning Autonomously using Foundation models for multi‑robot coordination Tasks, a framework that leverages the reasoning capabilities of foundation models to act as a "coach" for multi‑robot coordination. CRAFT automatically decomposes long‑horizon coordination tasks into sequences of subtasks using the planning capability of Large Language Models (LLMs). In what follows, CRAFT trains each subtask using reward functions generated by LLM, and refines them through a Vision Language Model (VLM)‑guided reward‑refinement loop. We evaluate CRAFT on multi‑quadruped navigation and bimanual manipulation tasks, demonstrating its capability to learn complex coordination behaviors. In addition, we validate the multi‑quadruped navigation policy in real hardware experiments.

PaperID: 8626, https://arxiv.org/pdf/2509.14285.pdf

Authors: S M Asif Hossain, Ruksat Khan Shayoni, Mohd Ruhul Ameen, Akif Islam, M. F. Mridha, Jungpil Shin

Title: A Multi-Agent LLM Defense Pipeline Against Prompt Injection Attacks

Abstract:
Prompt injection attacks represent a major vulnerability in Large Language Model (LLM) deployments, where malicious instructions embedded in user inputs can override system prompts and induce unintended behaviors. This paper presents a novel multi‑agent defense framework that employs specialized LLM agents in coordinated pipelines to detect and neutralize prompt injection attacks in real‑time. We evaluate our approach using two distinct architectures: a sequential chain‑of‑agents pipeline and a hierarchical coordinator‑based system. Our comprehensive evaluation on 55 unique prompt injection attacks, grouped into 8 categories and totaling 400 attack instances across two LLM platforms (ChatGLM and Llama2), demonstrates significant security improvements. Without defense mechanisms, baseline Attack Success Rates (ASR) reached 30% for ChatGLM and 20% for Llama2. Our multi‑agent pipeline achieved 100% mitigation, reducing ASR to 0% across all tested scenarios. The framework demonstrates robustness across multiple attack categories including direct overrides, code execution attempts, data exfiltration, and obfuscation techniques, while maintaining system functionality for legitimate queries.

PaperID: 8627, https://arxiv.org/pdf/2509.14276.pdf

Authors: Yuxiang Mai, Qiyue Yin, Wancheng Ni, Pei Xu, Kaiqi Huang

Title: Constructive Conflict-Driven Multi-Agent Reinforcement Learning for Strategic Diversity

Abstract:
In recent years, diversity has emerged as a useful mechanism to enhance the efficiency of multi‑agent reinforcement learning (MARL). However, existing methods predominantly focus on designing policies based on individual agent characteristics, often neglecting the interplay and mutual influence among agents during policy formation. To address this gap, we propose Competitive Diversity through Constructive Conflict (CoDiCon), a novel approach that incorporates competitive incentives into cooperative scenarios to encourage policy exchange and foster strategic diversity among agents. Drawing inspiration from sociological research, which highlights the benefits of moderate competition and constructive conflict in group decision‑making, we design an intrinsic reward mechanism using ranking features to introduce competitive motivations. A centralized intrinsic reward module generates and distributes varying reward values to agents, ensuring an effective balance between competition and cooperation. By optimizing the parameterized centralized reward module to maximize environmental rewards, we reformulate the constrained bilevel optimization problem to align with the original task objectives. We evaluate our algorithm against state‑of‑the‑art methods in the SMAC and GRF environments. Experimental results demonstrate that CoDiCon achieves superior performance, with competitive intrinsic rewards effectively promoting diverse and adaptive strategies among cooperative agents.

PaperID: 8628, https://arxiv.org/pdf/2509.14221.pdf

Authors: Silan Hu, Shiqi Zhang, Yimin Shi, Xiaokui Xiao

Title: GEM-Bench: A Benchmark for Ad-Injected Response Generation within Generative Engine Marketing

Abstract:
Generative Engine Marketing (GEM) is an emerging ecosystem for monetizing generative engines, such as LLM‑based chatbots, by seamlessly integrating relevant advertisements into their responses. At the core of GEM lies the generation and evaluation of ad‑injected responses. However, existing benchmarks are not specifically designed for this purpose, which limits future research. To address this gap, we propose GEM‑Bench, the first comprehensive benchmark for ad‑injected response generation in GEM. GEM‑Bench includes three curated datasets covering both chatbot and search scenarios, a metric ontology that captures multiple dimensions of user satisfaction and engagement, and several baseline solutions implemented within an extensible multi‑agent framework. Our preliminary results indicate that, while simple prompt‑based methods achieve reasonable engagement such as click‑through rate, they often reduce user satisfaction. In contrast, approaches that insert ads based on pre‑generated ad‑free responses help mitigate this issue but introduce additional overhead. These findings highlight the need for future research on designing more effective and efficient solutions for generating ad‑injected responses in GEM. The benchmark and all related resources are publicly available at https://gem‑bench.org/.

PaperID: 8629, https://arxiv.org/pdf/2509.14189.pdf

Authors: Sebastian Porsdam Mann, Mateo Aboy, Joel Jiehao Seah, Zhicheng Lin, Xufei Luo, Daniel Rodger, Hazem Zohny, Timo Minssen, Julian Savulescu, Brian D. Earp

Title: AI and the Future of Academic Peer Review

Abstract:
Peer review remains the central quality‑control mechanism of science, yet its ability to fulfill this role is increasingly strained. Empirical studies document serious shortcomings: long publication delays, escalating reviewer burden concentrated on a small minority of scholars, inconsistent quality and low inter‑reviewer agreement, and systematic biases by gender, language, and institutional prestige. Decades of human‑centered reforms have yielded only marginal improvements. Meanwhile, artificial intelligence, especially large language models (LLMs), is being piloted across the peer‑review pipeline by journals, funders, and individual reviewers. Early studies suggest that AI assistance can produce reviews comparable in quality to humans, accelerate reviewer selection and feedback, and reduce certain biases, but also raise distinctive concerns about hallucination, confidentiality, gaming, novelty recognition, and loss of trust. In this paper, we map the aims and persistent failure modes of peer review to specific LLM applications and systematically analyze the objections they raise alongside safeguards that could make their use acceptable. Drawing on emerging evidence, we show that targeted, supervised LLM assistance can plausibly improve error detection, timeliness, and reviewer workload without displacing human judgment. We highlight advanced architectures, including fine‑tuned, retrieval‑augmented, and multi‑agent systems, that may enable more reliable, auditable, and interdisciplinary review. We argue that ethical and practical considerations are not peripheral but constitutive: the legitimacy of AI‑assisted peer review depends on governance choices as much as technical capacity. The path forward is neither uncritical adoption nor reflexive rejection, but carefully scoped pilots with explicit evaluation metrics, transparency, and accountability.

PaperID: 8630, https://arxiv.org/pdf/2509.14169.pdf

Authors: Ziming Wei, Zichen Kong, Yuan Wang, David Z. Pan, Xiyuan Tang

Title: TopoSizing: An LLM-aided Framework of Topology-based Understanding and Sizing for AMS Circuits

Abstract:
Analog and mixed‑signal circuit design remains challenging due to the shortage of high‑quality data and the difficulty of embedding domain knowledge into automated flows. Traditional black‑box optimization achieves sampling efficiency but lacks circuit understanding, which often causes evaluations to be wasted in low‑value regions of the design space. In contrast, learning‑based methods embed structural knowledge but are case‑specific and costly to retrain. Recent attempts with large language models show potential, yet they often rely on manual intervention, limiting generality and transparency. We propose TopoSizing, an end‑to‑end framework that performs robust circuit understanding directly from raw netlists and translates this knowledge into optimization gains. Our approach first applies graph algorithms to organize circuits into a hierarchical device‑module‑stage representation. LLM agents then execute an iterative hypothesis‑verification‑refinement loop with built‑in consistency checks, producing explicit annotations. Verified insights are integrated into Bayesian optimization through LLM‑guided initial sampling and stagnation‑triggered trust‑region updates, improving efficiency while preserving feasibility.

PaperID: 8631, https://arxiv.org/pdf/2509.14159.pdf

Authors: Dayi Dong, Maulik Bhatt, Seoyeon Choi, Negar Mehr

Title: MIMIC-D: Multi-modal Imitation for MultI-agent Coordination with Decentralized Diffusion Policies

Abstract:
As robots become more integrated in society, their ability to coordinate with other robots and humans on multi‑modal tasks (those with multiple valid solutions) is crucial. Such behaviors can be learned from expert demonstrations via imitation learning (IL), but when expert demonstrations are multi‑modal, standard IL approaches usually average across modes or collapse to a single mode, preventing effective coordination. Being inspired by diffusion models' ability to capture complex multi‑modal trajectory distributions in single‑agent settings, we develop a diffusion‑based framework for coordinated multi‑modal behavior in multi‑agent systems. However, existing multi‑agent diffusion approaches typically require a centralized planner or explicit communication among agents. This assumption can fail in real‑world scenarios where robots must operate independently or with agents like humans that they cannot directly communicate with. Therefore, we propose MIMIC‑D, a joint training with decentralized execution paradigm for multi‑modal multi‑agent IL via diffusion. We jointly train all agents' policies with only local information to achieve implicit coordination. In simulation and hardware experiments, our method exhibits robust multi‑modal coordination behavior in various tasks and environments, improving upon state‑of‑the‑art baselines.

PaperID: 8632, https://arxiv.org/pdf/2509.14138.pdf

Authors: Ran Yang, Zijian An, Lifeng ZHou, Yiming Feng

Title: SeqVLA: Sequential Task Execution for Long-Horizon Manipulation with Completion-Aware Vision-Language-Action Model

Abstract:
Long‑horizon robotic manipulation tasks require executing multiple interdependent subtasks in strict sequence, where errors in detecting subtask completion can cascade into downstream failures. Existing Vision‑Language‑Action (VLA) models such as π_0 excel at continuous low‑level control but lack an internal signal for identifying when a subtask has finished, making them brittle in sequential settings. We propose SeqVLA, a completion‑aware extension of π_0 that augments the base architecture with a lightweight detection head perceiving whether the current subtask is complete. This dual‑head design enables SeqVLA not only to generate manipulation actions but also to autonomously trigger transitions between subtasks. We investigate four finetuning strategies that vary in how the action and detection heads are optimized (joint vs. sequential finetuning) and how pretrained knowledge is preserved (full finetuning vs. frozen backbone). Experiments are performed on two multi‑stage tasks: salad packing with seven distinct subtasks and candy packing with four distinct subtasks. Results show that SeqVLA significantly outperforms the baseline π_0 and other strong baselines in overall success rate. In particular, joint finetuning with an unfrozen backbone yields the most decisive and statistically reliable completion predictions, eliminating sequence‑related failures and enabling robust long‑horizon execution. Our results highlight the importance of coupling action generation with subtask‑aware detection for scalable sequential manipulation.

PaperID: 8633, https://arxiv.org/pdf/2509.14096.pdf

Authors: Víctor Mayoral-Vilches

Title: The Cybersecurity of a Humanoid Robot

Abstract:
The rapid advancement of humanoid robotics presents unprecedented cybersecurity challenges that existing theoretical frameworks fail to adequately address. This report presents a comprehensive security assessment of a production humanoid robot platform, bridging the gap between abstract security models and operational vulnerabilities. Through systematic static analysis, runtime observation, and cryptographic examination, we uncovered a complex security landscape characterized by both sophisticated defensive mechanisms and critical vulnerabilities. Our findings reveal a dual‑layer proprietary encryption system (designated FMX') that, while innovative in design, suffers from fundamental implementation flaws including the use of static cryptographic keys that enable offline configuration decryption. More significantly, we documented persistent telemetry connections transmitting detailed robot state information‑‑including audio, visual, spatial, and actuator data‑‑to external servers without explicit user consent or notification mechanisms. We operationalized a Cybersecurity AI agent on the Unitree G1 to map and prepare exploitation of its manufacturer's cloud infrastructure, illustrating how a compromised humanoid can escalate from covert data collection to active counter‑offensive operations. We argue that securing humanoid robots requires a paradigm shift toward Cybersecurity AI (CAI) frameworks that can adapt to the unique challenges of physical‑cyber convergence. This work contributes empirical evidence for developing robust security standards as humanoid robots transition from research curiosities to operational systems in critical domains.

PaperID: 8634, https://arxiv.org/pdf/2509.14034.pdf

Authors: Zijie Lin, Bryan Hooi

Title: Enhancing Multi-Agent Debate System Performance via Confidence Expression

Abstract:
Generative Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of tasks. Recent research has introduced Multi‑Agent Debate (MAD) systems, which leverage multiple LLMs to simulate human debate and thereby improve task performance. However, while some LLMs may possess superior knowledge or reasoning capabilities for specific tasks, they often struggle to clearly communicate this advantage during debates, in part due to a lack of confidence expression. Moreover, inappropriate confidence expression can cause agents in MAD systems to either stubbornly maintain incorrect beliefs or converge prematurely on suboptimal answers, ultimately reducing debate effectiveness and overall system performance. To address these challenges, we propose incorporating confidence expression into MAD systems to allow LLMs to explicitly communicate their confidence levels. To validate this approach, we develop ConfMAD, a MAD framework that integrates confidence expression throughout the debate process. Experimental results demonstrate the effectiveness of our method, and we further analyze how confidence influences debate dynamics, offering insights into the design of confidence‑aware MAD systems.

PaperID: 8635, https://arxiv.org/pdf/2509.13978.pdf

Authors: Renan Souza, Timothy Poteet, Brian Etz, Daniel Rosendo, Amal Gueroudji, Woong Shin, Prasanna Balaprakash, Rafael Ferreira da Silva

Title: LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology

Abstract:
Modern scientific discovery increasingly relies on workflows that process data across the Edge, Cloud, and High Performance Computing (HPC) continuum. Comprehensive and in‑depth analyses of these data are critical for hypothesis validation, anomaly detection, reproducibility, and impactful findings. Although workflow provenance techniques support such analyses, at large scale, the provenance data become complex and difficult to analyze. Existing systems depend on custom scripts, structured queries, or static dashboards, limiting data interaction. In this work, we introduce an evaluation methodology, reference architecture, and open‑source implementation that leverages interactive Large Language Model (LLM) agents for runtime data analysis. Our approach uses a lightweight, metadata‑driven design that translates natural language into structured provenance queries. Evaluations across LLaMA, GPT, Gemini, and Claude, covering diverse query classes and a real‑world chemistry workflow, show that modular design, prompt tuning, and Retrieval‑Augmented Generation (RAG) enable accurate and insightful LLM agent responses beyond recorded provenance.

PaperID: 8636, https://arxiv.org/pdf/2509.13942.pdf

Authors: Duc Minh Ha, Phu Trac Kien, Tho Quan, Anh Nguyen-Duc

Title: Evaluating Classical Software Process Models as Coordination Mechanisms for LLM-Based Software Generation

Abstract:
[Background] Large Language Model (LLM)‑based multi‑agent systems (MAS) are transforming software development by enabling autonomous collaboration. Classical software processes such asWaterfall, V‑Model, and Agile offer structured coordination patterns that can be repurposed to guide these agent interactions. [Aims] This study explores how traditional software development processes can be adapted as coordination scaffolds for LLM based MAS and examines their impact on code quality, cost, and productivity. [Method] We executed 11 diverse software projects under three process models and four GPT variants, totaling 132 runs. Each output was evaluated using standardized metrics for size (files, LOC), cost (execution time, token usage), and quality (code smells, AI‑ and human detected bugs). [Results] Both process model and LLM choice significantly affected system performance. Waterfall was most efficient, V‑Model produced the most verbose code, and Agile achieved the highest code quality, albeit at higher computational cost. [Conclusions] Classical software processes can be effectively instantiated in LLM‑based MAS, but each entails trade‑offs across quality, cost, and adaptability. Process selection should reflect project goals, whether prioritizing efficiency, robustness, or structured validation.

PaperID: 8637, https://arxiv.org/pdf/2509.13782.pdf

Authors: Yu Ge, Linna Xie, Zhong Li, Yu Pei, Tian Zhang

Title: Who is Introducing the Failure? Automatically Attributing Failures of Multi-Agent Systems via Spectrum Analysis

Abstract:
Large Language Model Powered Multi‑Agent Systems (MASs) are increasingly employed to automate complex real‑world problems, such as programming and scientific discovery. Despite their promising, MASs are not without their flaws. However, failure attribution in MASs ‑ pinpointing the specific agent actions responsible for failures ‑ remains underexplored and labor‑intensive, posing significant challenges for debugging and system improvement. To bridge this gap, we propose FAMAS, the first spectrum‑based failure attribution approach for MASs, which operates through systematic trajectory replay and abstraction, followed by spectrum analysis.The core idea of FAMAS is to estimate, from variations across repeated MAS executions, the likelihood that each agent action is responsible for the failure. In particular, we propose a novel suspiciousness formula tailored to MASs, which integrates two key factor groups, namely the agent behavior group and the action behavior group, to account for the agent activation patterns and the action activation patterns within the execution trajectories of MASs. Through expensive evaluations against 12 baselines on the Who and When benchmark, FAMAS demonstrates superior performance by outperforming all the methods in comparison.

PaperID: 8638, https://arxiv.org/pdf/2509.13758.pdf

Authors: Kevin Halim, Sin G. Teo, Ruitao Feng, Zhenpeng Chen, Yang Gu, Chong Wang, Yang Liu

Title: A Study on Thinking Patterns of Large Reasoning Models in Code Generation

Abstract:
Currently, many large language models (LLMs) are utilized for software engineering tasks such as code generation. The emergence of more advanced models known as large reasoning models (LRMs), such as OpenAI's o3, DeepSeek R1, and Qwen3. They have demonstrated the capability of performing multi‑step reasoning. Despite the advancement in LRMs, little attention has been paid to systematically analyzing the reasoning patterns these models exhibit and how such patterns influence the generated code. This paper presents a comprehensive study aimed at investigating and uncovering the reasoning behavior of LRMs during code generation. We prompted several state‑of‑the‑art LRMs of varying sizes with code generation tasks and applied open coding to manually annotate the reasoning traces. From this analysis, we derive a taxonomy of LRM reasoning behaviors, encompassing 15 reasoning actions across four phases. Our empirical study based on the taxonomy reveals a series of findings. First, we identify common reasoning patterns, showing that LRMs generally follow a human‑like coding workflow, with more complex tasks eliciting additional actions such as scaffolding, flaw detection, and style checks. Second, we compare reasoning across models, finding that Qwen3 exhibits iterative reasoning while DeepSeek‑R1‑7B follows a more linear, waterfall‑like approach. Third, we analyze the relationship between reasoning and code correctness, showing that actions such as unit test creation and scaffold generation strongly support functional outcomes, with LRMs adapting strategies based on task context. Finally, we evaluate lightweight prompting strategies informed by these findings, demonstrating the potential of context‑ and reasoning‑oriented prompts to improve LRM‑generated code. Our results offer insights and practical implications for advancing automatic code generation.

PaperID: 8639, https://arxiv.org/pdf/2509.13712.pdf

Authors: HwiJoon Lee, Martina Di Paola, Yoo Jin Hong, Quang-Huy Nguyen, Joseph Seering

Title: Inject, Fork, Compare: Defining an Interaction Vocabulary for Multi-Agent Simulation Platforms

Abstract:
LLM‑based multi‑agent simulations are a rapidly growing field of research, but current simulations often lack clear modes for interaction and analysis, limiting the "what if" scenarios researchers are able to investigate. In this demo, we define three core operations for interacting with multi‑agent simulations: inject, fork, and compare. Inject allows researchers to introduce external events at any point during simulation execution. Fork creates independent timeline branches from any timestamp, preserving complete state while allowing divergent exploration. Compare facilitates parallel observation of multiple branches, revealing how different interventions lead to distinct emergent behaviors. Together, these operations establish a vocabulary that transforms linear simulation workflows into interactive, explorable spaces. We demonstrate this vocabulary through a commodity market simulation with fourteen AI agents, where researchers can inject contrasting events and observe divergent outcomes across parallel timelines. By defining these fundamental operations, we provide a starting point for systematic causal investigation in LLM‑based agent simulations, moving beyond passive observation toward active experimentation.

PaperID: 8640, https://arxiv.org/pdf/2509.13704.pdf

Authors: Liangtao Lin, Zhaomeng Zhu, Tianwei Zhang, Yonggang Wen

Title: InfraMind: A Novel Exploration-based GUI Agentic Framework for Mission-critical Industrial Management

Abstract:
Mission‑critical industrial infrastructure, such as data centers, increasingly depends on complex management software. Its operations, however, pose significant challenges due to the escalating system complexity, multi‑vendor integration, and a shortage of expert operators. While Robotic Process Automation (RPA) offers partial automation through handcrafted scripts, it suffers from limited flexibility and high maintenance costs. Recent advances in Large Language Model (LLM)‑based graphical user interface (GUI) agents have enabled more flexible automation, yet these general‑purpose agents face five critical challenges when applied to industrial management, including unfamiliar element understanding, precision and efficiency, state localization, deployment constraints, and safety requirements. To address these issues, we propose InfraMind, a novel exploration‑based GUI agentic framework specifically tailored for industrial management systems. InfraMind integrates five innovative modules to systematically resolve different challenges in industrial management: (1) systematic search‑based exploration with virtual machine snapshots for autonomous understanding of complex GUIs; (2) memory‑driven planning to ensure high‑precision and efficient task execution; (3) advanced state identification for robust localization in hierarchical interfaces; (4) structured knowledge distillation for efficient deployment with lightweight models; and (5) comprehensive, multi‑layered safety mechanisms to safeguard sensitive operations. Extensive experiments on both open‑source and commercial DCIM platforms demonstrate that our approach consistently outperforms existing frameworks in terms of task success rate and operational efficiency, providing a rigorous and scalable solution for industrial management automation.

PaperID: 8641, https://arxiv.org/pdf/2509.13677.pdf

Authors: Xinxu Zhou, Jiaqi Bai, Zhenqi Sun, Fanxiang Zeng, Yue Liu

Title: AgentCTG: Harnessing Multi-Agent Collaboration for Fine-Grained Precise Control in Text Generation

Abstract:
Although significant progress has been made in many tasks within the field of Natural Language Processing (NLP), Controlled Text Generation (CTG) continues to face numerous challenges, particularly in achieving fine‑grained conditional control over generation. Additionally, in real scenario and online applications, cost considerations, scalability, domain knowledge learning and more precise control are required, presenting more challenge for CTG. This paper introduces a novel and scalable framework, AgentCTG, which aims to enhance precise and complex control over the text generation by simulating the control and regulation mechanisms in multi‑agent workflows. We explore various collaboration methods among different agents and introduce an auto‑prompt module to further enhance the generation effectiveness. AgentCTG achieves state‑of‑the‑art results on multiple public datasets. To validate its effectiveness in practical applications, we propose a new challenging Character‑Driven Rewriting task, which aims to convert the original text into new text that conform to specific character profiles and simultaneously preserve the domain knowledge. When applied to online navigation with role‑playing, our approach significantly enhances the driving experience through improved content delivery. By optimizing the generation of contextually relevant text, we enable a more immersive interaction within online communities, fostering greater personalization and user engagement.

PaperID: 8642, https://arxiv.org/pdf/2509.13597.pdf

Authors: Abhishek Goswami

Title: Agentic JWT: A Secure Delegation Protocol for Autonomous AI Agents

Abstract:
Autonomous LLM agents can issue thousands of API calls per hour without human oversight. OAuth 2.0 assumes deterministic clients, but in agentic settings stochastic reasoning, prompt injection, or multi‑agent orchestration can silently expand privileges. We introduce Agentic JWT (A‑JWT), a dual‑faceted intent token that binds each agent's action to verifiable user intent and, optionally, to a specific workflow step. A‑JWT carries an agent's identity as a one‑way checksum hash derived from its prompt, tools and configuration, and a chained delegation assertion to prove which downstream agent may execute a given task, and per‑agent proof‑of‑possession keys to prevent replay and in‑process impersonation. We define a new authorization mechanism and add a lightweight client shim library that self‑verifies code at run time, mints intent tokens, tracks workflow steps and derives keys, thus enabling secure agent identity and separation even within a single process. We illustrate a comprehensive threat model for agentic applications, implement a Python proof‑of‑concept and show functional blocking of scope‑violating requests, replay, impersonation, and prompt‑injection pathways with sub‑millisecond overhead on commodity hardware. The design aligns with ongoing OAuth agent discussions and offers a drop‑in path toward zero‑trust guarantees for agentic applications. A comprehensive performance and security evaluation with experimental results will appear in our forthcoming journal publication

PaperID: 8643, https://arxiv.org/pdf/2509.13585.pdf

Authors: Sean Anderson, Chris Darken, João Hespanha

Title: Zero-sum turn games using Q-learning: finite computation with security guarantees

Abstract:
This paper addresses zero‑sum ``turn'' games, in which only one player can make decisions at each state. We show that pure saddle‑point state‑feedback policies for turn games can be constructed from dynamic programming fixed‑point equations for a single value function or Q‑function. These fixed‑points can be constructed using a suitable form of Q‑learning. For discounted costs, convergence of this form of Q‑learning can be established using classical techniques. For undiscounted costs, we provide a convergence result that applies to finite‑time deterministic games, which we use to illustrate our results. For complex games, the Q‑learning iteration must be terminated before exploring the full‑state, which can lead to policies that cannot guarantee the security levels implied by the final Q‑function. To mitigate this, we propose an ``opponent‑informed'' exploration policy for selecting the Q‑learning samples. This form of exploration can guarantee that the final Q‑function provides security levels that hold, at least, against a given set of policies. A numerical demonstration for a multi‑agent game, Atlatl, indicates the effectiveness of these methods.

PaperID: 8644, https://arxiv.org/pdf/2509.13557.pdf

Authors: Zesong Jiang, Yuqi Sun, Qing Zhong, Mahathi Krishna, Deepak Patil, Cheng Tan, Sriram Krishnamoorthy, Jeff Zhang

Title: MALTA: An Automated CGRA Design Framework

Abstract:
Coarse‑grained Reconfigurable Arrays (CGRAs) are a promising computing architecture that can deliver high‑performance, energy‑efficient acceleration across diverse domains. By supporting reconfiguration at the functional unit level, CGRAs efficiently adapt to varying computational patterns and optimize resource utilization. However, designing CGRAs is highly challenging due to the vast design space, independent architectural parameters, and the time‑consuming nature of manual design. Fortunately, the rapid advancement of large language models (LLMs) presents new opportunities to automate this process. In this work, we propose MALTA‑‑ an open‑source multi‑agent LLM‑based framework for Hardware/Software (HW/SW) co‑design of CGRAs. The framework employs LLM reasoning to generate CGRAs across four stages: HW/SW co‑design, Design error correction, Best design selection, and Evaluation & Feedback. Furthermore, MALTA iteratively optimizes the generated CGRAs, leveraging agent reasoning and feedback to achieve higher PPA (that is, power, performance, and area) design points for a given domain. In addition, we introduce an LLM self‑learning mechanism that employs LLM‑driven decision making to select the optimal CGRA to accelerate the design process. We evaluate the framework with state‑of‑the‑art LLM‑based methods and manual CGRA design, in terms of performance, power consumption, and area. Experimental results show that MALTA efficiently generates high‑quality CGRA architectures, significantly reducing manual design effort and demonstrating the potential of our framework for real‑world CGRA design.

PaperID: 8645, https://arxiv.org/pdf/2509.13547.pdf

Authors: Harper Reed, Michael Sugimura, Angelo Zangari

Title: AI Agents with Human-Like Collaborative Tools: Adaptive Strategies for Enhanced Problem-Solving

Abstract:
We investigate whether giving LLM agents the collaborative tools and autonomy that humans naturally use for problem solving can improve their performance. We equip Claude Code agents with MCP‑based social media and journaling tools and allow them to use these tools as they see fit. Across 34 Aider Polyglot Python programming challenges, collaborative tools substantially improve performance on the hardest problems, delivering 15‑40% lower cost, 12‑27% fewer turns, and 12‑38% faster completion than baseline agents. Effects on the full challenge set are mixed, suggesting these tools act as performance enhancers when additional reasoning scaffolding is most needed. Surprisingly, Different models naturally adopted distinct collaborative strategies without explicit instruction. Sonnet 3.7 engaged broadly across tools and benefited from articulation‑based cognitive scaffolding. Sonnet 4 showed selective adoption, leaning on journal‑based semantic search when problems were genuinely difficult. This mirrors how human developers adjust collaboration based on expertise and task complexity. Behavioral analysis shows agents prefer writing over reading by about 2‑9x, indicating that structured articulation drives much of the improvement rather than information access alone. Overall, AI agents can systematically benefit from human‑inspired collaboration tools at the edge of their capabilities, pointing to adaptive collaborative interfaces as reasoning enhancers rather than universal efficiency boosts.

PaperID: 8646, https://arxiv.org/pdf/2509.13471.pdf

Authors: Sina Gogani-Khiabani, Ashutosh Trivedi, Diptikalyan Saha, Saeid Tizpaz-Niari

Title: An LLM Agentic Approach for Legal-Critical Software: A Case Study for Tax Prep Software

Abstract:
Large language models (LLMs) show promise for translating natural‑language statutes into executable logic, but reliability in legally critical settings remains challenging due to ambiguity and hallucinations. We present an agentic approach for developing legal‑critical software, using U.S. federal tax preparation as a case study. The key challenge is test‑case generation under the oracle problem, where correct outputs require interpreting law. Building on metamorphic testing, we introduce higher‑order metamorphic relations that compare system outputs across structured shifts among similar individuals. Because authoring such relations is tedious and error‑prone, we use an LLM‑driven, role‑based framework to automate test generation and code synthesis. We implement a multi‑agent system that translates tax code into executable software and incorporates a metamorphic‑testing agent that searches for counterexamples. In experiments, our framework using a smaller model (GPT‑4o‑mini) achieves a worst‑case pass rate of 45%, outperforming frontier models (GPT‑4o and Claude 3.5, 9‑15%) on complex tax‑code tasks. These results support agentic LLM methodologies as a path to robust, trustworthy legal‑critical software from natural‑language specifications.

PaperID: 8647, https://arxiv.org/pdf/2509.13381.pdf

Authors: Zhang Xueyao, Yang Bo, Yu Zhiwen, Cao Xuelin, George C. Alexandropoulos, Merouane Debbah, Chau Yuen

Title: Cooperative Target Detection with AUVs: A Dual-Timescale Hierarchical MARDL Approach

Abstract:
Autonomous Underwater Vehicles (AUVs) have shown great potential for cooperative detection and reconnaissance. However, collaborative AUV communications introduce risks of exposure. In adversarial environments, achieving efficient collaboration while ensuring covert operations becomes a key challenge for underwater cooperative missions. In this paper, we propose a novel dual time‑scale Hierarchical Multi‑Agent Proximal Policy Optimization (H‑MAPPO) framework. The high‑level component determines the individuals participating in the task based on a central AUV, while the low‑level component reduces exposure probabilities through power and trajectory control by the participating AUVs. Simulation results show that the proposed framework achieves rapid convergence, outperforms benchmark algorithms in terms of performance, and maximizes long‑term cooperative efficiency while ensuring covert operations.

PaperID: 8648, https://arxiv.org/pdf/2509.13380.pdf

Authors: Alejandro D. Mousist

Title: ASTREA: Introducing Agentic Intelligence for Orbital Thermal Autonomy

Abstract:
This paper presents ASTREA, the first agentic system executed on flight‑heritage hardware (TRL 9) for autonomous spacecraft operations, with on‑orbit operation aboard the International Space Station (ISS). Using thermal control as a representative use case, we integrate a resource‑constrained Large Language Model (LLM) agent with a reinforcement learning controller in an asynchronous architecture tailored for space‑qualified platforms. Ground experiments show that LLM‑guided supervision improves thermal stability and reduces violations, confirming the feasibility of combining semantic reasoning with adaptive control under hardware constraints. On‑orbit validation aboard the ISS initially faced challenges due to inference latency misaligned with the rapid thermal cycles of Low Earth Orbit (LEO) satellites. Synchronization with the orbit length successfully surpassed the baseline with reduced violations, extended episode durations, and improved CPU utilization. These findings demonstrate the potential for scalable agentic supervision architectures in future autonomous spacecraft.

PaperID: 8649, https://arxiv.org/pdf/2509.13369.pdf

Authors: Rashid Mushkani

Title: Right-to-Override for Critical Urban Control Systems: A Deliberative Audit Method for Buildings, Power, and Transport

Abstract:
Automation now steers building HVAC, distribution grids, and traffic signals, yet residents rarely have authority to pause or redirect these systems when they harm inclusivity, safety, or accessibility. We formalize a Right‑to‑Override (R2O) ‑ defining override authorities, evidentiary thresholds, and domain‑validated safe fallback states ‑ and introduce a Deliberative Audit Method (DAM) with playbooks for pre‑deployment walkthroughs, shadow‑mode trials, and post‑incident review. We instantiate R2O/DAM in simulations of smart‑grid load shedding, building HVAC under occupancy uncertainty, and multi‑agent traffic signals. R2O reduces distributional harm with limited efficiency loss: load‑shedding disparity in unserved energy drops from 5.61x to 0.69x with constant curtailment; an override eliminates two discomfort‑hours for seniors at an energy cost of 77 kWh; and median pedestrian wait falls from 90.4 s to 55.9 s with a 6.0 s increase in mean vehicle delay. We also contribute a policy standard, audit worksheets, and a ModelOps integration pattern to make urban automation contestable and reviewable.

PaperID: 8650, https://arxiv.org/pdf/2509.13356.pdf

Authors: Hasin Jawad Ali, Ilhamul Azam, Ajwad Abrar, Md. Kamrul Hasan, Hasan Mahmud

Title: CogniAlign: Survivability-Grounded Multi-Agent Moral Reasoning for Safe and Transparent AI

Abstract:
The challenge of aligning artificial intelligence (AI) with human values persists due to the abstract and often conflicting nature of moral principles and the opacity of existing approaches. This paper introduces CogniAlign, a multi‑agent deliberation framework based on naturalistic moral realism, that grounds moral reasoning in survivability, defined across individual and collective dimensions, and operationalizes it through structured deliberations among discipline‑specific scientist agents. Each agent, representing neuroscience, psychology, sociology, and evolutionary biology, provides arguments and rebuttals that are synthesized by an arbiter into transparent and empirically anchored judgments. As a proof‑of‑concept study, we evaluate CogniAlign on classic and novel moral questions and compare its outputs against GPT‑4o using a five‑part ethical audit framework with the help of three experts. Results show that CogniAlign consistently outperforms the baseline across more than sixty moral questions, with average performance gains of 12.2 points in analytic quality, 31.2 points in decisiveness, and 15 points in depth of explanation. In the Heinz dilemma, for example, CogniAlign achieved an overall score of 79 compared to GPT‑4o's 65.8, demonstrating a decisive advantage in handling moral reasoning. Through transparent and structured reasoning, CogniAlign demonstrates the feasibility of an auditable approach to AI alignment, though certain challenges still remain.

PaperID: 8651, https://arxiv.org/pdf/2509.13352.pdf

Authors: Anis Koubaa, Khaled Gabr

Title: Agentic UAVs: LLM-Driven Autonomy with Integrated Tool-Calling and Cognitive Reasoning

Abstract:
Unmanned Aerial Vehicles (UAVs) are increasingly used in defense, surveillance, and disaster response, yet most systems still operate at SAE Level 2 to 3 autonomy. Their dependence on rule‑based control and narrow AI limits adaptability in dynamic and uncertain missions. Current UAV architectures lack context‑aware reasoning, autonomous decision‑making, and integration with external systems. Importantly, none make use of Large Language Model (LLM) agents with tool‑calling for real‑time knowledge access. This paper introduces the Agentic UAVs framework, a five‑layer architecture consisting of Perception, Reasoning, Action, Integration, and Learning. The framework enhances UAV autonomy through LLM‑driven reasoning, database querying, and interaction with third‑party systems. A prototype built with ROS 2 and Gazebo combines YOLOv11 for object detection with GPT‑4 for reasoning and a locally deployed Gemma 3 model. In simulated search‑and‑rescue scenarios, agentic UAVs achieved higher detection confidence (0.79 compared to 0.72), improved person detection rates (91% compared to 75%), and a major increase in correct action recommendations (92% compared to 4.5%). These results show that modest computational overhead can enable significantly higher levels of autonomy and system‑level integration.

PaperID: 8652, https://arxiv.org/pdf/2509.13021.pdf

Authors: Phung Duc Luong, Le Tran Gia Bao, Nguyen Vu Khai Tam, Dong Huu Nguyen Khoa, Nguyen Huu Quyen, Van-Hau Pham, Phan The Duy

Title: xOffense: An Autonomous Multi-Agent Framework for Penetration Testing with Domain-Adapted Large Language Models

Abstract:
This work introduces xOffense, an AI‑driven, multi‑agent penetration testing framework that shifts the process from labor‑intensive, expert‑driven manual efforts to fully automated, machine‑executable workflows capable of scaling seamlessly with computational infrastructure. At its core, xOffense leverages a fine‑tuned, mid‑scale open‑source LLM (Qwen3‑32B) to drive reasoning and decision‑making in penetration testing. The framework assigns specialized agents to reconnaissance, vulnerability scanning, and exploitation, with an orchestration layer ensuring seamless coordination across phases. Fine‑tuning on Chain‑of‑Thought penetration testing data further enables the model to generate precise tool commands and perform consistent multi‑step reasoning. We evaluate xOffense on two rigorous benchmarks: AutoPenBench and AI‑Pentest‑Benchmark. The results demonstrate that xOffense consistently outperforms contemporary methods, achieving a sub‑task completion rate of 79.17%, decisively surpassing leading systems such as VulnBot and PentestGPT. These findings highlight the potential of domain‑adapted mid‑scale LLMs, when embedded within structured multi‑agent orchestration, to deliver superior, cost‑efficient, and reproducible solutions for autonomous penetration testing.

PaperID: 8653, https://arxiv.org/pdf/2509.12927.pdf

Authors: Xingxing Hong, Yungong Wang, Dexin Jin, Ye Yuan, Ximing Huang, Zijian Wu, Wenxin Li

Title: HLSMAC: A New StarCraft Multi-Agent Challenge for High-Level Strategic Decision-Making

Abstract:
Benchmarks are crucial for assessing multi‑agent reinforcement learning (MARL) algorithms. While StarCraft II‑related environments have driven significant advances in MARL, existing benchmarks like SMAC focus primarily on micromanagement, limiting comprehensive evaluation of high‑level strategic intelligence. To address this, we introduce HLSMAC, a new cooperative MARL benchmark with 12 carefully designed StarCraft II scenarios based on classical stratagems from the Thirty‑Six Stratagems. Each scenario corresponds to a specific stratagem and is designed to challenge agents with diverse strategic elements, including tactical maneuvering, timing coordination, and deception, thereby opening up avenues for evaluating high‑level strategic decision‑making capabilities. We also propose novel metrics across multiple dimensions beyond conventional win rate, such as ability utilization and advancement efficiency, to assess agents' overall performance within the HLSMAC environment. We integrate state‑of‑the‑art MARL algorithms and LLM‑based agents with our benchmark and conduct comprehensive experiments. The results demonstrate that HLSMAC serves as a robust testbed for advancing multi‑agent strategic decision‑making.

PaperID: 8654, https://arxiv.org/pdf/2509.12810.pdf

Authors: Shicheng Ye, Chao Yu, Kaiqiang Ke, Chengdong Xu, Yinqi Wei

Title: H$^2$R: Hierarchical Hindsight Reflection for Multi-Task LLM Agents

Abstract:
Large language model (LLM)‑based agents have shown strong potential in multi‑task scenarios, owing to their ability to transfer knowledge across diverse tasks. However, existing approaches often treat prior experiences and knowledge as monolithic units, leading to inefficient and coarse‑grained knowledge transfer. In this work, we propose a novel hierarchical memory architecture that enables fine‑grained knowledge transfer by decoupling high‑level planning memory from low‑level execution memory. To construct and refine these hierarchical memories, we introduce Hierarchical Hindsight Reflection (H^2R), a mechanism that distills reusable and hierarchical knowledge from past agent‑environment interactions. At test time, H^2R performs retrievals of high‑level and low‑level memories separately, allowing LLM‑based agents to efficiently access and utilize task‑relevant knowledge for new tasks.Experimental results across two benchmarks demonstrate that H^2R can improve generalization and decision‑making performance, outperforming prior baselines such as Expel.

PaperID: 8655, https://arxiv.org/pdf/2509.12754.pdf

Authors: Saki Hashimoto, Shoichi Hasegawa, Tomochika Ishikawa, Akira Taniguchi, Yoshinobu Hagiwara, Lotfi El Hafi, Tadahiro Taniguchi

Title: Toward Ownership Understanding of Objects: Active Question Generation with Large Language Model and Probabilistic Generative Model

Abstract:
Robots operating in domestic and office environments must understand object ownership to correctly execute instructions such as ``Bring me my cup.'' However, ownership cannot be reliably inferred from visual features alone. To address this gap, we propose Active Ownership Learning (ActOwL), a framework that enables robots to actively generate and ask ownership‑related questions to users. ActOwL employs a probabilistic generative model to select questions that maximize information gain, thereby acquiring ownership knowledge efficiently to improve learning efficiency. Additionally, by leveraging commonsense knowledge from Large Language Models (LLM), objects are pre‑classified as either shared or owned, and only owned objects are targeted for questioning. Through experiments in a simulated home environment and a real‑world laboratory setting, ActOwL achieved significantly higher ownership clustering accuracy with fewer questions than baseline methods. These findings demonstrate the effectiveness of combining active inference with LLM‑guided commonsense reasoning, advancing the capability of robots to acquire ownership knowledge for practical and socially appropriate task execution.

PaperID: 8656, https://arxiv.org/pdf/2509.12753.pdf

Authors: Feliks Bańka, Jarosław A. Chudziak

Title: DeltaHedge: A Multi-Agent Framework for Portfolio Options Optimization

Abstract:
In volatile financial markets, balancing risk and return remains a significant challenge. Traditional approaches often focus solely on equity allocation, overlooking the strategic advantages of options trading for dynamic risk hedging. This work presents DeltaHedge, a multi‑agent framework that integrates options trading with AI‑driven portfolio management. By combining advanced reinforcement learning techniques with an ensembled options‑based hedging strategy, DeltaHedge enhances risk‑adjusted returns and stabilizes portfolio performance across varying market conditions. Experimental results demonstrate that DeltaHedge outperforms traditional strategies and standalone models, underscoring its potential to transform practical portfolio management in complex financial environments. Building on these findings, this paper contributes to the fields of quantitative finance and AI‑driven portfolio optimization by introducing a novel multi‑agent system for integrating options trading strategies, addressing a gap in the existing literature.

PaperID: 8657, https://arxiv.org/pdf/2509.12702.pdf

Authors: Hongrui Zhao, Xunlan Zhou, Boris Ivanovic, Negar Mehr

Title: UDON: Uncertainty-weighted Distributed Optimization for Multi-Robot Neural Implicit Mapping under Extreme Communication Constraints

Abstract:
Multi‑robot mapping with neural implicit representations enables the compact reconstruction of complex environments. However, it demands robustness against communication challenges like packet loss and limited bandwidth. While prior works have introduced various mechanisms to mitigate communication disruptions, performance degradation still occurs under extremely low communication success rates. This paper presents UDON, a real‑time multi‑agent neural implicit mapping framework that introduces a novel uncertainty‑weighted distributed optimization to achieve high‑quality mapping under severe communication deterioration. The uncertainty weighting prioritizes more reliable portions of the map, while the distributed optimization isolates and penalizes mapping disagreement between individual pairs of communicating agents. We conduct extensive experiments on standard benchmark datasets and real‑world robot hardware. We demonstrate that UDON significantly outperforms existing baselines, maintaining high‑fidelity reconstructions and consistent scene representations even under extreme communication degradation (as low as 1% success rate).

PaperID: 8658, https://arxiv.org/pdf/2509.12612.pdf

Authors: Daojun Chen, Xi Wang, Shenyuan Ren, Qingzhi Ma, Pengpeng Zhao, An Liu

Title: GBV-SQL: Guided Generation and SQL2Text Back-Translation Validation for Multi-Agent Text2SQL

Abstract:
While Large Language Models have significantly advanced Text2SQL generation, a critical semantic gap persists where syntactically valid queries often misinterpret user intent. To mitigate this challenge, we propose GBV‑SQL, a novel multi‑agent framework that introduces Guided Generation with SQL2Text Back‑translation Validation. This mechanism uses a specialized agent to translate the generated SQL back into natural language, which verifies its logical alignment with the original question. Critically, our investigation reveals that current evaluation is undermined by a systemic issue: the poor quality of the benchmarks themselves. We introduce a formal typology for "Gold Errors", which are pervasive flaws in the ground‑truth data, and demonstrate how they obscure true model performance. On the challenging BIRD benchmark, GBV‑SQL achieves 63.23% execution accuracy, a 5.8% absolute improvement. After removing flawed examples, GBV‑SQL achieves 96.5% (dev) and 97.6% (test) execution accuracy on the Spider benchmark. Our work offers both a robust framework for semantic validation and a critical perspective on benchmark integrity, highlighting the need for more rigorous dataset curation.

PaperID: 8659, https://arxiv.org/pdf/2509.12546.pdf

Authors: Yingxin Lai, Zitong Yu, Jun Wang, Linlin Shen, Yong Xu, Xiaochun Cao

Title: Agent4FaceForgery: Multi-Agent LLM Framework for Realistic Face Forgery Detection

Abstract:
Face forgery detection faces a critical challenge: a persistent gap between offline benchmarks and real‑world efficacy,which we attribute to the ecological invalidity of training data.This work introduces Agent4FaceForgery to address two fundamental problems: (1) how to capture the diverse intents and iterative processes of human forgery creation, and (2) how to model the complex, often adversarial, text‑image interactions that accompany forgeries in social media. To solve this,we propose a multi‑agent framework where LLM‑poweredagents, equipped with profile and memory modules, simulate the forgery creation process. Crucially, these agents interact in a simulated social environment to generate samples labeled for nuanced text‑image consistency, moving beyond simple binary classification. An Adaptive Rejection Sampling (ARS) mechanism ensures data quality and diversity. Extensive experiments validate that the data generated by our simulationdriven approach brings significant performance gains to detectors of multiple architectures, fully demonstrating the effectiveness and value of our framework.

PaperID: 8660, https://arxiv.org/pdf/2509.12507.pdf

Authors: Anna Deichler, Siyang Wang, Simon Alexanderson, Jonas Beskow

Title: Learning to Generate Pointing Gestures in Situated Embodied Conversational Agents

Abstract:
One of the main goals of robotics and intelligent agent research is to enable natural communication with humans in physically situated settings. While recent work has focused on verbal modes such as language and speech, non‑verbal communication is crucial for flexible interaction. We present a framework for generating pointing gestures in embodied agents by combining imitation and reinforcement learning. Using a small motion capture dataset, our method learns a motor control policy that produces physically valid, naturalistic gestures with high referential accuracy. We evaluate the approach against supervised learning and retrieval baselines in both objective metrics and a virtual reality referential game with human users. Results show that our system achieves higher naturalness and accuracy than state‑of‑the‑art supervised models, highlighting the promise of imitation‑RL for communicative gesture generation and its potential application to robots.

PaperID: 8661, https://arxiv.org/pdf/2509.12484.pdf

Authors: Ruimeng Hu, Jihao Long, Haosheng Zhou

Title: Finite-Agent Stochastic Differential Games on Large Graphs: II. Graph-Based Architectures

Abstract:
We propose a novel neural network architecture, called Non‑Trainable Modification (NTM), for computing Nash equilibria in stochastic differential games (SDGs) on graphs. These games model a broad class of graph‑structured multi‑agent systems arising in finance, robotics, energy, and social dynamics, where agents interact locally under uncertainty. The NTM architecture imposes a graph‑guided sparsification on feedforward neural networks, embedding fixed, non‑trainable components aligned with the underlying graph topology. This design enhances interpretability and stability, while significantly reducing the number of trainable parameters in large‑scale, sparse settings. We theoretically establish a universal approximation property for NTM in static games on graphs and numerically validate its expressivity and robustness through supervised learning tasks. Building on this foundation, we incorporate NTM into two state‑of‑the‑art game solvers, Direct Parameterization and Deep BSDE, yielding their sparse variants (NTM‑DP and NTM‑DBSDE). Numerical experiments on three SDGs across various graph structures demonstrate that NTM‑based methods achieve performance comparable to their fully trainable counterparts, while offering improved computational efficiency.

PaperID: 8662, https://arxiv.org/pdf/2509.12462.pdf

Authors: Chuxu Song, Dheekshith Dev Manohar Mekala, Hao Wang, Richard Martin

Title: Redefining Website Fingerprinting Attacks With Multiagent LLMs

Abstract:
Website Fingerprinting (WFP) uses deep learning models to classify encrypted network traffic to infer visited websites. While historically effective, prior methods fail to generalize to modern web environments. Single‑page applications (SPAs) eliminate the paradigm of websites as sets of discrete pages, undermining page‑based classification, and traffic from scripted browsers lacks the behavioral richness seen in real user sessions. Our study reveals that users exhibit highly diverse behaviors even on the same website, producing traffic patterns that vary significantly across individuals. This behavioral entropy makes WFP a harder problem than previously assumed and highlights the need for larger, more diverse, and representative datasets to achieve robust performance. To address this, we propose a new paradigm: we drop session‑boundaries in favor of contiguous traffic segments and develop a scalable data generation pipeline using large language models (LLM) agents. These multi‑agent systems coordinate decision‑making and browser interaction to simulate realistic, persona‑driven browsing behavior at 3‑‑5x lower cost than human collection. We evaluate nine state‑of‑the‑art WFP models on traffic from 20 modern websites browsed by 30 real users, and compare training performance across human, scripted, and LLM‑generated datasets. All models achieve under 10% accuracy when trained on scripted traffic and tested on human data. In contrast, LLM‑generated traffic boosts accuracy into the 80% range, demonstrating strong generalization to real‑world traces. Our findings indicate that for modern WFP, model performance is increasingly bottlenecked by data quality, and that scalable, semantically grounded synthetic traffic is essential for capturing the complexity of real user behavior.

PaperID: 8663, https://arxiv.org/pdf/2509.12446.pdf

Authors: Dawei Xiang, Wenyan Xu, Kexin Chu, Tianqi Ding, Zixu Shen, Yiming Zeng, Jianchang Su, Wei Zhang

Title: PromptSculptor: Multi-Agent Based Text-to-Image Prompt Optimization

Abstract:
The rapid advancement of generative AI has democratized access to powerful tools such as Text‑to‑Image models. However, to generate high‑quality images, users must still craft detailed prompts specifying scene, style, and context‑often through multiple rounds of refinement. We propose PromptSculptor, a novel multi‑agent framework that automates this iterative prompt optimization process. Our system decomposes the task into four specialized agents that work collaboratively to transform a short, vague user prompt into a comprehensive, refined prompt. By leveraging Chain‑of‑Thought reasoning, our framework effectively infers hidden context and enriches scene and background details. To iteratively refine the prompt, a self‑evaluation agent aligns the modified prompt with the original input, while a feedback‑tuning agent incorporates user feedback for further refinement. Experimental results demonstrate that PromptSculptor significantly enhances output quality and reduces the number of iterations needed for user satisfaction. Moreover, its model‑agnostic design allows seamless integration with various T2I models, paving the way for industrial applications.

PaperID: 8664, https://arxiv.org/pdf/2509.12443.pdf

Authors: Sparsh Gupta, Kamalavasan Kamalakkannan, Maxim Moraru, Galen Shipman, Patrick Diehl

Title: From Legacy Fortran to Portable Kokkos: An Autonomous Agentic AI Workflow

Abstract:
Scientific applications continue to rely on legacy Fortran codebases originally developed for homogeneous, CPU‑based systems. As High‑Performance Computing (HPC) shifts toward heterogeneous GPU‑accelerated architectures, many accelerators lack native Fortran bindings, creating an urgent need to modernize legacy codes for portability. Frameworks like Kokkos provide performance portability and a single‑source C++ abstraction, but manual Fortran‑to‑Kokkos porting demands significant expertise and time. Large language models (LLMs) have shown promise in source‑to‑source code generation, yet their use in fully autonomous workflows for translating and optimizing parallel code remains largely unexplored, especially for performance portability across diverse hardware. This paper presents an agentic AI workflow where specialized LLM "agents" collaborate to translate, validate, compile, run, test, debug, and optimize Fortran kernels into portable Kokkos C++ programs. Results show the pipeline modernizes a range of benchmark kernels, producing performance‑portable Kokkos codes across hardware partitions. Paid OpenAI models such as GPT‑5 and o4‑mini‑high executed the workflow for only a few U.S. dollars, generating optimized codes that surpassed Fortran baselines, whereas open‑source models like Llama4‑Maverick often failed to yield functional codes. This work demonstrates the feasibility of agentic AI for Fortran‑to‑Kokkos transformation and offers a pathway for autonomously modernizing legacy scientific applications to run portably and efficiently on diverse supercomputers. It further highlights the potential of LLM‑driven agentic systems to perform structured, domain‑specific reasoning tasks in scientific and systems‑oriented applications.

PaperID: 8665, https://arxiv.org/pdf/2509.12440.pdf

Authors: Jiayi He, Yangmin Huang, Qianyun Du, Xiangying Zhou, Zhiyang He, Jiaxue Hu, Xiaodong Tao, Lixian Lai

Title: MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts

Abstract:
The increasing deployment of Large Language Models (LLMs) in healthcare necessitates a rigorous evaluation of their factual reliability. However, existing benchmarks are often limited by narrow domains of data, failing to capture the complexity of real‑world medical information. To address this critical gap, we introduce MedFact, a new and challenging benchmark for Chinese medical fact‑checking. MedFact comprises 2,116 expert‑annotated instances curated from diverse real‑world texts, spanning 13 medical specialties, 8 fine‑grained error types, 4 writing styles, and multiple difficulty levels. Its construction employs a hybrid AI‑human framework where iterative expert feedback refines an AI‑driven, multi‑criteria filtering process, ensuring both high data quality and difficulty. We conduct a comprehensive evaluation of 20 leading LLMs, benchmarking their performance on veracity classification and error localization against a human expert baseline. Our results reveal that while models can often determine if a text contains an error, precisely localizing it remains a substantial challenge, with even top‑performing models falling short of human performance. Furthermore, our analysis uncovers a frequent ``over‑criticism'' phenomenon, a tendency for models to misidentify correct information as erroneous, which is exacerbated by advanced reasoning techniques such as multi‑agent collaboration and inference‑time scaling. By highlighting these critical challenges for deploying LLMs in medical applications, MedFact provides a robust resource to drive the development of more factually reliable and medically aware models.

PaperID: 8666, https://arxiv.org/pdf/2509.12423.pdf

Authors: Danielle Cohen, Yoni Halpern, Noam Kahlon, Joel Oren, Omri Berkovitch, Sapir Caduri, Ido Dagan, Anatoly Efros

Title: Small Models, Big Results: Achieving Superior Intent Extraction through Decomposition

Abstract:
Understanding user intents from UI interaction trajectories remains a challenging, yet crucial, frontier in intelligent agent development. While massive, datacenter‑based, multi‑modal large language models (MLLMs) possess greater capacity to handle the complexities of such sequences, smaller models which can run on‑device to provide a privacy‑preserving, low‑cost, and low‑latency user experience, struggle with accurate intent inference. We address these limitations by introducing a novel decomposed approach: first, we perform structured interaction summarization, capturing key information from each user action. Second, we perform intent extraction using a fine‑tuned model operating on the aggregated summaries. This method improves intent understanding in resource‑constrained models, even surpassing the base performance of large MLLMs.

PaperID: 8667, https://arxiv.org/pdf/2509.12390.pdf

Authors: Evangelos Psomiadis, Panagiotis Tsiotras

Title: Distributed Event-Triggered Distance-Based Formation Control for Multi-Agent Systems

Abstract:
This paper addresses the problem of collaborative formation control for multi‑agent systems with limited resources. We consider a team of robots tasked with achieving a desired formation from an arbitrary initial configuration. To reduce unnecessary control updates and conserve resources, we propose a distributed event‑triggered formation controller. Unlike the well‑studied linear formation control strategies, the proposed controller is nonlinear and relies on inter‑agent distance measurements. Control updates are triggered only when the measurement error exceeds a predefined threshold, ensuring system stability while minimizing actuation effort. We also employ a distributed control barrier function to guarantee inter‑agent collision avoidance. The proposed controller is validated through extensive simulations and real‑world experiments involving different formations, communication topologies, scalability tests, and variations in design parameters, while also being compared against periodic triggering strategies. Results demonstrate that the event‑triggered approach significantly reduces control effort while preserving formation performance.

PaperID: 8668, https://arxiv.org/pdf/2509.12378.pdf

Authors: Ruining Yang, Jingyuan Zhou, Qiqing Wang, Jinhao Liang, Kaidi Yang

Title: Platoon-Centric Green Light Optimal Speed Advisory Using Safe Reinforcement Learning

Abstract:
With recent advancements in Connected Autonomous Vehicles (CAVs), Green Light Optimal Speed Advisory (GLOSA) emerges as a promising eco‑driving strategy to reduce the number of stops and idle time at intersections, thereby reducing energy consumption and emissions. Existing studies typically improve energy and travel efficiency for individual CAVs without considering their impacts on the entire mixed‑traffic platoon, leading to inefficient traffic flow. While Reinforcement Learning (RL) has the potential to achieve platoon‑level control in a mixed‑traffic environment, the training of RL is still challenged by (i) car‑following safety, i.e., CAVs should not collide with their immediate preceding vehicles, and (ii) red‑light safety, i.e., CAVs should not run red lights. To address these challenges, this paper develops a platoon‑centric, safe RL‑based GLOSA system that uses a multi‑agent controller to optimize CAV speed while achieving a balance between energy consumption and travel efficiency. We further incorporate Control Barrier Functions (CBFs) into the RL‑based policy to provide explicit safety guarantees in terms of car‑following safety and red‑light safety. Our simulation results illustrate that our proposed method outperforms state‑of‑the‑art methods in terms of driving safety and platoon energy consumption.

PaperID: 8669, https://arxiv.org/pdf/2509.12307.pdf

Authors: Xuli Cai, Poonam Lohan, Burak Kantarci

Title: FLARE: Flying Learning Agents for Resource Efficiency in Next-Gen UAV Networks

Abstract:
This letter addresses a critical challenge in the context of 6G and beyond wireless networks, the joint optimization of power and bandwidth resource allocation for aerial intelligent platforms, specifically uncrewed aerial vehicles (UAVs), operating in highly dynamic environments with mobile ground user equipment (UEs). We introduce FLARE (Flying Learning Agents for Resource Efficiency), a learning‑enabled aerial intelligence framework that jointly optimizes UAV positioning, altitude, transmit power, and bandwidth allocation in real‑time. To adapt to UE mobility, we employ Silhouette‑based K‑Means clustering, enabling dynamic grouping of users and UAVs' deployment at cluster centroids for efficient service delivery. The problem is modeled as a multi‑agent control task, with bandwidth discretized into resource blocks and power treated as a continuous variable. To solve this, our proposed framework, FLARE, employs a hybrid reinforcement learning strategy that combines Multi‑Agent Deep Deterministic Policy Gradient (MADDPG) and Deep Q‑Network (DQN) to enhance learning efficiency. Simulation results demonstrate that our method significantly enhances user coverage, achieving a 73.45% improvement in the number of served users under a 5 Mbps data rate constraint, outperforming MADDPG baseline.

PaperID: 8670, https://arxiv.org/pdf/2509.12282.pdf

Authors: Sasi Kiran Gaddipati, Farhana Keya, Gollam Rabby, Sören Auer

Title: AISSISTANT: Human-AI Collaborative Review and Perspective Research Workflows in Data Science

Abstract:
High‑quality scientific review and perspective papers require substantial time and effort, limiting researchers' ability to synthesize emerging knowledge. While Large Language Models (LLMs) leverage AI Scientists for scientific workflows, existing frameworks focus primarily on autonomous workflows with very limited human intervention. We introduce AIssistant, the first open‑source agentic framework for Human‑‑AI collaborative generation of scientific perspectives and review research in data science. AIssistant employs specialized LLM‑driven agents augmented with external scholarly tools and allows human intervention throughout the workflow. The framework consists of two main multi‑agent systems: Research Workflow with seven agents and a Paper Writing Workflow with eight agents. We conducted a comprehensive evaluation with both human expert reviewers and LLM‑based assessment following NeurIPS standards. Our experiments show that OpenAI o1 achieves the highest quality scores on chain‑of‑thought prompting with augmented Literature Search tools. We also conducted a Human‑‑AI interaction survey with results showing a 65.7% time savings. We believe that our work establishes a baseline for Human‑‑AI collaborative scientific workflow for review and perspective research in data science, demonstrating that agent‑augmented pipelines substantially reduce effort while maintaining research integrity through strategic human oversight.

PaperID: 8671, https://arxiv.org/pdf/2509.12117.pdf

Authors: Aryaman Reddi, Gabriele Tiboni, Jan Peters, Carlo D'Eramo

Title: $K$-Level Policy Gradients for Multi-Agent Reinforcement Learning

Abstract:
Actor‑critic algorithms for deep multi‑agent reinforcement learning (MARL) typically employ a policy update that responds to the current strategies of other agents. While being straightforward, this approach does not account for the updates of other agents at the same update step, resulting in miscoordination. In this paper, we introduce the K‑Level Policy Gradient (KPG), a method that recursively updates each agent against the updated policies of other agents, speeding up the discovery of effective coordinated policies. We theoretically prove that KPG with finite iterates achieves monotonic convergence to a local Nash equilibrium under certain conditions. We provide principled implementations of KPG by applying it to the deep MARL algorithms MAPPO, MADDPG, and FACMAC. Empirically, we demonstrate superior performance over existing deep MARL algorithms in StarCraft II and multi‑agent MuJoCo.

PaperID: 8672, https://arxiv.org/pdf/2509.12048.pdf

Authors: Hoon Sagong, Heesu Kim, Hanbeen Hong

Title: Hi-DARTS: Hierarchical Dynamically Adapting Reinforcement Trading System

Abstract:
Conventional autonomous trading systems struggle to balance computational efficiency and market responsiveness due to their fixed operating frequency. We propose Hi‑DARTS, a hierarchical multi‑agent reinforcement learning framework that addresses this trade‑off. Hi‑DARTS utilizes a meta‑agent to analyze market volatility and dynamically activate specialized Time Frame Agents for high‑frequency or low‑frequency trading as needed. During back‑testing on AAPL stock from January 2024 to May 2025, Hi‑DARTS yielded a cumulative return of 25.17% with a Sharpe Ratio of 0.75. This performance surpasses standard benchmarks, including a passive buy‑and‑hold strategy on AAPL (12.19% return) and the S&P 500 ETF (SPY) (20.01% return). Our work demonstrates that dynamic, hierarchical agents can achieve superior risk‑adjusted returns while maintaining high computational efficiency.

PaperID: 8673, https://arxiv.org/pdf/2509.12034.pdf

Authors: Emmanuel Adjei Domfeh, Christopher L. Dancy

Title: Human-AI Use Patterns for Decision-Making in Disaster Scenarios: A Systematic Review

Abstract:
In high‑stakes disaster scenarios, timely and informed decision‑making is critical yet often challenged by uncertainty, dynamic environments, and limited resources. This paper presents a systematic review of Human‑AI collaboration patterns that support decision‑making across all disaster management phases. Drawing from 51 peer‑reviewed studies, we identify four major categories: Human‑AI Decision Support Systems, Task and Resource Coordination, Trust and Transparency, and Simulation and Training. Within these, we analyze sub‑patterns such as cognitive‑augmented intelligence, multi‑agent coordination, explainable AI, and virtual training environments. Our review highlights how AI systems may enhance situational awareness, improves response efficiency, and support complex decision‑making, while also surfacing critical limitations in scalability, interpretability, and system interoperability. We conclude by outlining key challenges and future research directions, emphasizing the need for adaptive, trustworthy, and context‑aware Human‑AI systems to improve disaster resilience and equitable recovery outcomes.

PaperID: 8674, https://arxiv.org/pdf/2509.11973.pdf

Authors: Markus J. Buehler

Title: MusicSwarm: Biologically Inspired Intelligence for Music Composition

Abstract:
We show that coherent, long‑form musical composition can emerge from a decentralized swarm of identical, frozen foundation models that coordinate via stigmergic, peer‑to‑peer signals, without any weight updates. We compare a centralized multi‑agent system with a global critic to a fully decentralized swarm in which bar‑wise agents sense and deposit harmonic, rhythmic, and structural cues, adapt short‑term memory, and reach consensus. Across symbolic, audio, and graph‑theoretic analyses, the swarm yields superior quality while delivering greater diversity and structural variety and leads across creativity metrics. The dynamics contract toward a stable configuration of complementary roles, and self‑similarity networks reveal a small‑world architecture with efficient long‑range connectivity and specialized bridging motifs, clarifying how local novelties consolidate into global musical form. By shifting specialization from parameter updates to interaction rules, shared memory, and dynamic consensus, MusicSwarm provides a compute‑ and data‑efficient route to long‑horizon creative structure that is immediately transferable beyond music to collaborative writing, design, and scientific discovery.

PaperID: 8675, https://arxiv.org/pdf/2509.11965.pdf

Authors: Bart M. P. Jansen, Jeroen S. K. Lamme, Ruben F. A. Verhaegh

Title: An ETH-Tight FPT Algorithm for Rejection-Proof Set Packing with Applications to Kidney Exchange

Abstract:
We study the parameterized complexity of a recently introduced multi‑agent variant of the Kidney Exchange problem. Given a directed graph G and integers d and k, the standard problem asks whether G contains a packing of vertex‑disjoint cycles, each of length \leq d, covering at least k vertices in total. In the multi‑agent setting we consider, the vertex set is partitioned over several agents who reject a cycle packing as solution if it can be modified into an alternative packing that covers more of their own vertices. A cycle packing is called rejection‑proof if no agent rejects it and the problem asks whether such a packing exists that covers at least k vertices. We exploit the sunflower lemma on a set packing formulation of the problem to give a kernel for this Σ_2^P‑complete problem that is polynomial in k for all constant values of d. We also provide a 2^\mathcalO(k \log k) + n^\mathcalO(1) algorithm based on it and show that this FPT algorithm is asymptotically optimal under the ETH. Further, we generalize the problem by including an additional positive integer c in the input that naturally captures how much agents can modify a given cycle packing to reject it. For every constant c, the resulting problem simplifies from being Σ_2^P‑complete to NP‑complete. The super‑exponential lower bound already holds for c=2, though. We present an ad‑hoc single‑exponential algorithm for c = 1. These results reveal an interesting discrepancy between the classical and parameterized complexity of the problem and give a good view of what makes it hard.

PaperID: 8676, https://arxiv.org/pdf/2509.11944.pdf

Authors: Susanta Mitra

Title: Agentic Temporal Graph of Reasoning with Multimodal Language Models: A Potential AI Aid to Healthcare

Abstract:
Healthcare and medicine are multimodal disciplines that deal with multimodal data for reasoning and diagnosing multiple diseases. Although some multimodal reasoning models have emerged for reasoning complex tasks in scientific domains, their applications in the healthcare domain remain limited and fall short in correct reasoning for diagnosis. To address the challenges of multimodal medical reasoning for correct diagnosis and assist the healthcare professionals, a novel temporal graph‑based reasoning process modelled through a directed graph has been proposed in the current work. It helps in accommodating dynamic changes in reasons through backtracking, refining the reasoning content, and creating new or deleting existing reasons to reach the best recommendation or answer. Again, consideration of multimodal data at different time points can enable tracking and analysis of patient health and disease progression. Moreover, the proposed multi‑agent temporal reasoning framework provides task distributions and a cross‑validation mechanism to further enhance the accuracy of reasoning outputs. A few basic experiments and analysis results justify the novelty and practical utility of the proposed preliminary approach.

PaperID: 8677, https://arxiv.org/pdf/2509.11943.pdf

Authors: Antonin Sulc, Thorsten Hellert

Title: Agentic System with Modal Logic for Autonomous Diagnostics

Abstract:
The development of intelligent agents, particularly those powered by language models (LMs), has shown a critical role in various environments that require intelligent and autonomous decision‑making. Environments are not passive testing grounds, and they represent the data required for agents to learn and exhibit in very challenging conditions that require adaptive, complex, and autonomous capacity to make decisions. While the paradigm of scaling models and datasets has led to remarkable emergent capabilities, we argue that scaling the structure, fidelity, and logical consistency of agent reasoning within these environments is a crucial, yet underexplored, dimension of AI research. This paper introduces a neuro‑symbolic multi‑agent architecture where the belief states of individual agents are formally represented as Kripke models. This foundational choice enables them to reason about known concepts of \emphpossibility and \emphnecessity using the formal language of modal logic. In this work, we use immutable, domain‑specific knowledge to make an informed root cause diagnosis, which is encoded as logical constraints essential for proper, reliable, and explainable diagnosis. In the proposed model, we show constraints that actively guide the hypothesis generation of LMs, effectively preventing them from reaching physically or logically untenable conclusions. In a high‑fidelity simulated particle accelerator environment, our system successfully diagnoses complex, cascading failures by combining the powerful semantic intuition of LMs with the rigorous, verifiable validation of modal logic and a factual world model and showcasing a viable path toward more robust, reliable, and verifiable autonomous agents.

PaperID: 8678, https://arxiv.org/pdf/2509.11942.pdf

Authors: Luís F. Gomes, Xin Zhou, David Lo, Rui Abreu

Title: VisDocSketcher: Towards Scalable Visual Documentation with Agentic Systems

Abstract:
Visual documentation is an effective tool for reducing the cognitive barrier developers face when understanding unfamiliar code, enabling more intuitive comprehension. Compared to textual documentation, it provides a higher‑level understanding of the system structure and data flow. Developers usually prefer visual representations over lengthy textual descriptions for large software systems. Visual documentation is both difficult to produce and challenging to evaluate. Manually creating it is time‑consuming, and currently, no existing approach can automatically generate high‑level visual documentation directly from code. Its evaluation is often subjective, making it difficult to standardize and automate. To address these challenges, this paper presents the first exploration of using agentic LLM systems to automatically generate visual documentation. We introduce VisDocSketcher, the first agent‑based approach that combines static analysis with LLM agents to identify key elements in the code and produce corresponding visual representations. We propose a novel evaluation framework, AutoSketchEval, for assessing the quality of generated visual documentation using code‑level metrics. The experimental results show that our approach can valid visual documentation for 74.4% of the samples. It shows an improvement of 26.7‑39.8% over a simple template‑based baseline. Our evaluation framework can reliably distinguish high‑quality (code‑aligned) visual documentation from low‑quality (non‑aligned) ones, achieving an AUC exceeding 0.87. Our work lays the foundation for future research on automated visual documentation by introducing practical tools that not only generate valid visual representations but also reliably assess their quality.

PaperID: 8679, https://arxiv.org/pdf/2509.11917.pdf

Authors: Yuwen Ma, Yongqiang Wang, Sarah K. Spurgeon, Boli Chen

Title: Distributed Finite-Horizon Optimal Control for Consensus with Differential Privacy Guarantees

Abstract:
This paper addresses the problem of privacy‑preserving consensus control for multi‑agent systems (MAS) using differential privacy. We propose a novel distributed finite‑horizon linear quadratic regulator (LQR) framework, in which agents share individual state information while preserving the confidentiality of their local pairwise weight matrices, which are considered sensitive data in MAS. Protecting these matrices effectively safeguards each agent's private cost function and control preferences. Our solution injects consensus error‑dependent Laplace noise into the communicated state information and employs a carefully designed time‑dependent scaling factor in the local cost functions. This approach guarantees bounded consensus and achieves rigorous ε‑differential privacy for the weight matrices without relying on specific noise distribution assumptions. Additionally, we analytically characterize the trade‑off between consensus accuracy and privacy level, offering clear guidelines on how to enhance consensus performance through appropriate scaling of the LQR weight matrices and the privacy budget.

PaperID: 8680, https://arxiv.org/pdf/2509.11868.pdf

Authors: Sabrina Patania, Luca Annese, Anna Lambiase, Anita Pellegrini, Tom Foulsham, Azzurra Ruggeri, Silvia Rossi, Silvia Serino, Dimitri Ognibene

Title: Growing Perspectives: Modelling Embodied Perspective Taking and Inner Narrative Development Using Large Language Models

Abstract:
Language and embodied perspective taking are essential for human collaboration, yet few computational models address both simultaneously. This work investigates the PerspAct system [1], which integrates the ReAct (Reason and Act) paradigm with Large Language Models (LLMs) to simulate developmental stages of perspective taking, grounded in Selman's theory [2]. Using an extended director task, we evaluate GPT's ability to generate internal narratives aligned with specified developmental stages, and assess how these influence collaborative performance both qualitatively (action selection) and quantitatively (task efficiency). Results show that GPT reliably produces developmentally‑consistent narratives before task execution but often shifts towards more advanced stages during interaction, suggesting that language exchanges help refine internal representations. Higher developmental stages generally enhance collaborative effectiveness, while earlier stages yield more variable outcomes in complex contexts. These findings highlight the potential of integrating embodied perspective taking and language in LLMs to better model developmental dynamics and stress the importance of evaluating internal speech during combined linguistic and embodied tasks.

PaperID: 8681, https://arxiv.org/pdf/2509.11719.pdf

Authors: Bingqing Wei, Lianmin Chen, Zhongyu Xia, Yongtao Wang

Title: HeLoFusion: An Efficient and Scalable Encoder for Modeling Heterogeneous and Multi-Scale Interactions in Trajectory Prediction

Abstract:
Multi‑agent trajectory prediction in autonomous driving requires a comprehensive understanding of complex social dynamics. Existing methods, however, often struggle to capture the full richness of these dynamics, particularly the co‑existence of multi‑scale interactions and the diverse behaviors of heterogeneous agents. To address these challenges, this paper introduces HeLoFusion, an efficient and scalable encoder for modeling heterogeneous and multi‑scale agent interactions. Instead of relying on global context, HeLoFusion constructs local, multi‑scale graphs centered on each agent, allowing it to effectively model both direct pairwise dependencies and complex group‑wise interactions (e.g., platooning vehicles or pedestrian crowds). Furthermore, HeLoFusion tackles the critical challenge of agent heterogeneity through an aggregation‑decomposition message‑passing scheme and type‑specific feature networks, enabling it to learn nuanced, type‑dependent interaction patterns. This locality‑focused approach enables a principled representation of multi‑level social context, yielding powerful and expressive agent embeddings. On the challenging Waymo Open Motion Dataset, HeLoFusion achieves state‑of‑the‑art performance, setting new benchmarks for key metrics including Soft mAP and minADE. Our work demonstrates that a locality‑grounded architecture, which explicitly models multi‑scale and heterogeneous interactions, is a highly effective strategy for advancing motion forecasting.

PaperID: 8682, https://arxiv.org/pdf/2509.11656.pdf

Authors: Jonas Becker, Lars Benedikt Kaesberg, Niklas Bauer, Jan Philip Wahle, Terry Ruas, Bela Gipp

Title: MALLM: Multi-Agent Large Language Models Framework

Abstract:
Multi‑agent debate (MAD) has demonstrated the ability to augment collective intelligence by scaling test‑time compute and leveraging expertise. Current frameworks for multi‑agent debate are often designed towards tool use, lack integrated evaluation, or provide limited configurability of agent personas, response generators, discussion paradigms, and decision protocols. We introduce MALLM (Multi‑Agent Large Language Models), an open‑source framework that enables systematic analysis of MAD components. MALLM offers more than 144 unique configurations of MAD, including (1) agent personas (e.g., Expert, Personality), (2) response generators (e.g., Critical, Reasoning), (3) discussion paradigms (e.g., Memory, Relay), and (4) decision protocols (e.g., Voting, Consensus). MALLM uses simple configuration files to define a debate. Furthermore, MALLM can load any textual Hugging Face dataset (e.g., MMLU‑Pro, WinoGrande) and provides an evaluation pipeline for easy comparison of MAD configurations. MALLM enables researchers to systematically configure, run, and evaluate debates for their problems, facilitating the understanding of the components and their interplay.

PaperID: 8683, https://arxiv.org/pdf/2509.11640.pdf

Authors: Deepak Mallya, Arpita Sinha, Leena Vachhani

Title: $ε$-Optimal Multi-Agent Patrol using Recurrent Strategy

Abstract:
The multi‑agent patrol problem refers to repeatedly visiting different locations in an environment using multiple autonomous agents. For over two decades, researchers have studied this problem in various settings. While providing valuable insights into the problem, the works in existing literature have not commented on the nature of the optimal solutions to the problem. We first show that an ε‑approximate recurrent patrol strategy exists for every feasible patrol strategy. Then, we establish the existence of a recurrent patrol strategy that is an ε‑optimal solution to the General Patrol Problem. The factor ε is proportional to the discretisation constant D, which can be arbitrarily small and is independent of the number of patrol agents and the size of the environment. This result holds for a variety of problem formulations already studied. We also provide an algorithmic approach to determine an ε‑approximate recurrent patrol strategy for a patrol strategy created by any method from the literature. We perform extensive simulations in graphs based on real‑life environments to validate the claims made in this work.

PaperID: 8684, https://arxiv.org/pdf/2509.11626.pdf

Authors: Prerna Agarwal, Himanshu Gupta, Soujanya Soni, Rohith Vallam, Renuka Sindhgatta, Sameep Mehta

Title: Automated Creation and Enrichment Framework for Improved Invocation of Enterprise APIs as Tools

Abstract:
Recent advancements in Large Language Models (LLMs) has lead to the development of agents capable of complex reasoning and interaction with external tools. In enterprise contexts, the effective use of such tools that are often enabled by application programming interfaces (APIs), is hindered by poor documentation, complex input or output schema, and large number of operations. These challenges make tool selection difficult and reduce the accuracy of payload formation by up to 25%. We propose ACE, an automated tool creation and enrichment framework that transforms enterprise APIs into LLM‑compatible tools. ACE, (i) generates enriched tool specifications with parameter descriptions and examples to improve selection and invocation accuracy, and (ii) incorporates a dynamic shortlisting mechanism that filters relevant tools at runtime, reducing prompt complexity while maintaining scalability. We validate our framework on both proprietary and open‑source APIs and demonstrate its integration with agentic frameworks. To the best of our knowledge, ACE is the first end‑to‑end framework that automates the creation, enrichment, and dynamic selection of enterprise API tools for LLM agents.

PaperID: 8685, https://arxiv.org/pdf/2509.11595.pdf

Authors: Sabin Huda, Ernest Foo, Zahra Jadidi, MA Hakim Newton, Abdul Sattar

Title: AMLNet: A Knowledge-Based Multi-Agent Framework to Generate and Detect Realistic Money Laundering Transactions

Abstract:
Anti‑money laundering (AML) research is constrained by the lack of publicly shareable, regulation‑aligned transaction datasets. We present AMLNet, a knowledge‑based multi‑agent framework with two coordinated units: a regulation‑aware transaction generator and an ensemble detection pipeline. The generator produces 1,090,173 synthetic transactions (approximately 0.16% laundering‑positive) spanning core laundering phases (placement, layering, integration) and advanced typologies (e.g., structuring, adaptive threshold behavior). Regulatory alignment reaches 75% based on AUSTRAC rule coverage (Section 4.2), while a composite technical fidelity score of 0.75 summarizes temporal, structural, and behavioral realism components (Section 4.4). The detection ensemble achieves F1 0.90 (precision 0.84, recall 0.97) on the internal test partitions of AMLNet and adapts to the external SynthAML dataset, indicating architectural generalizability across different synthetic generation paradigms. We provide multi‑dimensional evaluation (regulatory, temporal, network, behavioral) and release the dataset (Version 1.0, https://doi.org/10.5281/zenodo.16736515), to advance reproducible and regulation‑conscious AML experimentation.

PaperID: 8686, https://arxiv.org/pdf/2509.11523.pdf

Authors: Ziliang Wang, Ge Li, Jia Li, Hao Zhu, Zhi Jin

Title: VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection

Abstract:
The application of language models to project‑level vulnerability detection remains challenging, owing to the dual requirement of accurately localizing security‑sensitive code and correctly correlating and reasoning over complex program context. We present VulAgent, a multi‑agent vulnerability detection framework based on hypothesis validation. Our design is inspired by how human auditors review code: when noticing a sensitive operation, they form a hypothesis about a possible vulnerability, consider potential trigger paths, and then verify the hypothesis against the surrounding context. VulAgent implements a semantics‑sensitive, multi‑view detection pipeline: specialized agents, each aligned to a specific analysis perspective (e.g., memory, authorization), collaboratively surface and precisely localize sensitive code sites with higher coverage. Building on this, VulAgent adopts a hypothesis‑validation paradigm: for each vulnerability report, it builds hypothesis conditions and a trigger path, steering the LLM to target the relevant program context and defensive checks during verification, which reduces false positives. On average across the two datasets, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable‑‑fixed code pairs by up to 450% (246% on average), and reduces the false positive rate by about 36% compared with state‑of‑the‑art LLM‑based baselines.

PaperID: 8687, https://arxiv.org/pdf/2509.11514.pdf

Authors: Zhengxiang Wang, Weiling Li, Panagiotis Kaliosis, Owen Rambow, Susan E. Brennan

Title: LVLMs are Bad at Overhearing Human Referential Communication

Abstract:
During spontaneous conversations, speakers collaborate on novel referring expressions, which they can then re‑use in subsequent conversations. Understanding such referring expressions is an important ability for an embodied agent, so that it can carry out tasks in the real world. This requires integrating and understanding language, vision, and conversational interaction. We study the capabilities of seven state‑of‑the‑art Large Vision Language Models (LVLMs) as overhearers to a corpus of spontaneous conversations between pairs of human discourse participants engaged in a collaborative object‑matching task. We find that such a task remains challenging for current LVLMs and they all fail to show a consistent performance improvement as they overhear more conversations from the same discourse participants repeating the same task for multiple rounds. We release our corpus and code for reproducibility and to facilitate future research.

PaperID: 8688, https://arxiv.org/pdf/2509.11508.pdf

Authors: Tinglong Deng, Hang Tao, Xinxiang Wang, Yinyan Wang, Hanjiang Luo

Title: SafeDiver: Cooperative AUV-USV Assisted Diver Communication via Multi-agent Reinforcement Learning Approach

Abstract:
As underwater human activities are increasing, the demand for underwater communication service presents a significant challenge. Existing underwater diver communication methods face hurdles due to inherent disadvantages and complex underwater environments. To address this issue, we propose a scheme that utilizes maritime unmanned systems to assist divers with reliable and high‑speed communication. Multiple AUVs are equipped with optical and acoustic multimodal communication devices as relay nodes, providing adaptive communication services based on changes in the diver's activity area. By using a multi‑agent reinforcement learning (MARL) approach to control the cooperative movement of AUVs, high‑speed and reliable data transmission between divers can be achieved. At the same time, utilizing the advantages of on‑demand deployment and wide coverage of unmanned surface vehicles (USVs) as surface relay nodes to coordinate and forward information from AUVs, and controlling AUVs to adaptively select relay USV nodes for data transmission, high‑quality communication between divers and surface platform can be achieved. Through simulation verification, the proposed scheme can effectively achieve reliable and high‑speed communication for divers.

PaperID: 8689, https://arxiv.org/pdf/2509.11507.pdf

Authors: Jared Zhu, Junde Wu

Title: MedicalOS: An LLM Agent based Operating System for Digital Healthcare

Abstract:
Decades' advances in digital health technologies, such as electronic health records, have largely streamlined routine clinical processes. Yet, most these systems are still hard to learn and use: Clinicians often face the burden of managing multiple tools, repeating manual actions for each patient, navigating complicated UI trees to locate functions, and spending significant time on administration instead of caring for patients. The recent rise of large language model (LLM) based agents demonstrates exceptional capability in coding and computer operation, revealing the potential for humans to interact with operating systems and software not by direct manipulation, but by instructing agents through natural language. This shift highlights the need for an abstraction layer, an agent‑computer interface, that translates human language into machine‑executable commands. In digital healthcare, however, requires a more domain‑specific abstractions that strictly follow trusted clinical guidelines and procedural standards to ensure safety, transparency, and compliance. To address this need, we present MedicalOS, a unified agent‑based operational system designed as such a domain‑specific abstract layer for healthcare. It translates human instructions into pre‑defined digital healthcare commands, such as patient inquiry, history retrieval, exam management, report generation, referrals, treatment planning, that we wrapped as off‑the‑shelf tools using machine languages (e.g., Python, APIs, MCP, Linux). We empirically validate MedicalOS on 214 patient cases across 22 specialties, demonstrating high diagnostic accuracy and confidence, clinically sound examination requests, and consistent generation of structured reports and medication recommendations. These results highlight MedicalOS as a trustworthy and scalable foundation for advancing workflow automation in clinical practice.

PaperID: 8690, https://arxiv.org/pdf/2509.11361.pdf

Authors: Yichen Han, Yuhang Han, Siteng Huang, Guanyu Liu, Zhengpeng Zhou, Bojun Liu, Yujia Zhang, Isaac N Shi, Lewei He, Tianyu Shi

Title: MAPGD: Multi-Agent Prompt Gradient Descent for Collaborative Prompt Optimization

Abstract:
Prompt engineering is crucial for fully leveraging large language models (LLMs), yet most existing optimization methods follow a single trajectory, resulting in limited adaptability, gradient conflicts, and high computational overhead. We propose MAPGD (Multi‑Agent Prompt Gradient Descent), a novel framework that reconceptualizes prompt optimization as a collaborative process among specialized agents. Each agent focuses on a distinct refinement dimension, such as instruction clarity, example selection, format structure, or stylistic adaptation, and their contributions are coordinated through semantic gradient embedding, conflict detection, and fusion. To further enhance robustness and stability, MAPGD introduces two new mechanisms: Hypersphere Constrained Gradient Clustering (HCGC), which enforces angular margin constraints for compact and well‑separated clusters, and Channel Adaptive Agent Weighting (CAAW), which dynamically reweights agent contributions based on validation performance. Experiments on classification and reasoning benchmarks show that MAPGD consistently surpasses single‑agent and random baselines in both accuracy and efficiency. Ablation studies confirm the effectiveness of gradient fusion, agent specialization, and conflict resolution. Together, these components establish MAPGD as a unified, gradient‑based, and interpretable framework for robust prompt optimization with theoretical convergence guarantees.

PaperID: 8691, https://arxiv.org/pdf/2509.11337.pdf

Authors: Ying Cao, Kun Yuan, Ali H. Sayed

Title: On the Escaping Efficiency of Distributed Adversarial Training Algorithms

Abstract:
Adversarial training has been widely studied in recent years due to its role in improving model robustness against adversarial attacks. This paper focuses on comparing different distributed adversarial training algorithms‑‑including centralized and decentralized strategies‑‑within multi‑agent learning environments. Previous studies have highlighted the importance of model flatness in determining robustness. To this end, we develop a general theoretical framework to study the escaping efficiency of these algorithms from local minima, which is closely related to the flatness of the resulting models. We show that when the perturbation bound is sufficiently small (i.e., when the attack strength is relatively mild) and a large batch size is used, decentralized adversarial training algorithms‑‑including consensus and diffusion‑‑are guaranteed to escape faster from local minima than the centralized strategy, thereby favoring flatter minima. However, as the perturbation bound increases, this trend may no longer hold. In the simulation results, we illustrate our theoretical findings and systematically compare the performance of models obtained through decentralized and centralized adversarial training algorithms. The results highlight the potential of decentralized strategies to enhance the robustness of models in distributed settings.

PaperID: 8692, https://arxiv.org/pdf/2509.11238.pdf

Authors: Dongming Jin, Zhi Jin, Yiran Zhang, Zheng Fang, Linyu Li, Yuanpeng He, Xiaohong Chen, Weisong Sun

Title: UserTrace: User-Level Requirements Generation and Traceability Recovery from Software Project Repositories

Abstract:
Software maintainability critically depends on high‑quality requirements descriptions and explicit traceability between requirements and code. Although automated code summarization (ACS) and requirements traceability (RT) techniques have been widely studied, existing ACS methods mainly generate implementation‑level (i.e., developer‑oriented) requirements (IRs) for fine‑grained units (e.g., methods), while RT techniques often overlook the impact of project evolution. As a result, user‑level (i.e., end user‑oriented) requirements (URs) and live trace links remain underexplored, despite their importance for supporting user understanding and for validating whether AI‑generated software aligns with user intent. To address this gap, we propose UserTrace, a multi‑agent system that automatically generates URs and recovers live trace links (from URs to IRs to code) from software repositories. UserTrace coordinates four specialized agents (i.e., Code Reviewer, Searcher, Writer, and Verifier) through a three‑phase process: structuring repository dependencies, deriving IRs for code units, and synthesizing URs with domain‑specific context. Our comparative evaluation shows that UserTrace produces URs with higher completeness, correctness, and helpfulness than an established baseline, and achieves superior precision in trace link recovery compared to five state‑of‑the‑art RT approaches. A user study further demonstrates that UserTrace helps end users validate whether the AI‑generated repositories align with their intent.

PaperID: 8693, https://arxiv.org/pdf/2509.11183.pdf

Authors: Emmanouil Karystinaios

Title: WeaveMuse: An Open Agentic System for Multimodal Music Understanding and Generation

Abstract:
Agentic AI has been standardized in industry as a practical paradigm for coordinating specialized models and tools to solve complex multimodal tasks. In this work, we present WeaveMuse, a multi‑agent system for music understanding, symbolic composition, and audio synthesis. Each specialist agent interprets user requests, derives machine‑actionable requirements (modalities, formats, constraints), and validates its own outputs, while a manager agent selects and sequences tools, mediates user interaction, and maintains state across turns. The system is extendable and deployable either locally, using quantization and inference strategies to fit diverse hardware budgets, or via the HFApi to preserve free community access to open models. Beyond out‑of‑the‑box use, the system emphasizes controllability and adaptation through constraint schemas, structured decoding, policy‑based inference, and parameter‑efficient adapters or distilled variants that tailor models to MIR tasks. A central design goal is to facilitate intermodal interaction across text, symbolic notation and visualization, and audio, enabling analysis‑synthesis‑render loops and addressing cross‑format constraints. The framework aims to democratize, implement, and make accessible MIR tools by supporting interchangeable open‑source models of various sizes, flexible memory management, and reproducible deployment paths.

PaperID: 8694, https://arxiv.org/pdf/2509.11079.pdf

Authors: Jinwei Su, Qizhen Lan, Yinghui Xia, Lifan Sun, Weiyou Tian, Tianyu Shi, Xinyuan Song, Lewei He, Yang Jingsong

Title: Difficulty-Aware Agentic Orchestration for Query-Specific Multi-Agent Workflows

Abstract:
Large Language Model (LLM)‑based agentic systems have shown strong capabilities across various tasks. However, existing multi‑agent frameworks often rely on static or task‑level workflows, which either over‑process simple queries or underperform on complex ones, while also neglecting the efficiency‑performance trade‑offs across heterogeneous LLMs. To address these limitations, we propose Difficulty‑Aware Agentic Orchestration (DAAO), which can dynamically generate query‑specific multi‑agent workflows guided by predicted query difficulty. DAAO comprises three interdependent modules: a variational autoencoder (VAE) for difficulty estimation, a modular operator allocator, and a cost‑ and performance‑aware LLM router. A self‑adjusting policy updates difficulty estimates based on workflow success, enabling simpler workflows for easy queries and more complex strategies for harder ones. Experiments on six benchmarks demonstrate that DAAO surpasses prior multi‑agent systems in both accuracy and inference efficiency, validating its effectiveness for adaptive, difficulty‑aware reasoning.

PaperID: 8695, https://arxiv.org/pdf/2509.11068.pdf

Authors: Zan-Kai Chong, Hiroyuki Ohsaki, Bryan Ng

Title: Tractable Asymmetric Verification for Large Language Models via Deterministic Replicability

Abstract:
The landscape of Large Language Models (LLMs) shifts rapidly towards dynamic, multi‑agent systems. This introduces a fundamental challenge in establishing computational trust, specifically how one agent can verify that another's output was genuinely produced by a claimed LLM, and not falsified or generated by a cheaper or inferior model. To address this challenge, this paper proposes a verification framework that achieves tractable asymmetric effort, where the cost to verify a computation is substantially lower than the cost to perform it. Our approach is built upon the principle of deterministic replicability, a property inherent to autoregressive models that strictly necessitates a computationally homogeneous environment where all agents operate on identical hardware and software stacks. Within this defined context, our framework enables multiple validators to probabilistically audit small, random segments of an LLM's output and it distributes the verification workload effectively. The simulations demonstrated that targeted verification can be over 12 times faster than full regeneration, with tunable parameters to adjust the detection probability. By establishing a tractable mechanism for auditable LLM systems, our work offers a foundational layer for responsible AI and serves as a cornerstone for future research into the more complex, heterogeneous multi‑agent systems.

PaperID: 8696, https://arxiv.org/pdf/2509.11067.pdf

Authors: Liangxuan Guo, Bin Zhu, Qingqian Tao, Kangning Liu, Xun Zhao, Xianzhe Qin, Jin Gao, Guangfu Hao

Title: Agentic Lybic: Multi-Agent Execution System with Tiered Reasoning and Orchestration

Abstract:
Autonomous agents for desktop automation struggle with complex multi‑step tasks due to poor coordination and inadequate quality control. We introduce Agentic Lybic, a novel multi‑agent system where the entire architecture operates as a finite‑state machine (FSM). This core innovation enables dynamic orchestration. Our system comprises four components: a Controller, a Manager, three Workers (Technician for code‑based operations, Operator for GUI interactions, and Analyst for decision support), and an Evaluator. The critical mechanism is the FSM‑based routing between these components, which provides flexibility and generalization by dynamically selecting the optimal execution strategy for each subtask. This principled orchestration, combined with robust quality gating, enables adaptive replanning and error recovery. Evaluated officially on the OSWorld benchmark, Agentic Lybic achieves a state‑of‑the‑art 57.07% success rate in 50 steps, substantially outperforming existing methods. Results demonstrate that principled multi‑agent orchestration with continuous quality control provides superior reliability for generalized desktop automation in complex computing environments.

PaperID: 8697, https://arxiv.org/pdf/2509.11062.pdf

Authors: Yuheng Yang, Wenjia Jiang, Yang Wang, Yi Song, Yiwei Wang, Chi Zhang

Title: Auto-Slides: An Interactive Multi-Agent System for Creating and Customizing Research Presentations

Abstract:
The rapid progress of large language models (LLMs) has opened new opportunities for education. While learners can interact with academic papers through LLM‑powered dialogue, limitations still exist: the lack of structured organization and the heavy reliance on text can impede systematic understanding and engagement with complex concepts. To address these challenges, we propose Auto‑Slides, an LLM‑driven system that converts research papers into pedagogically structured, multimodal slides (e.g., diagrams and tables). Drawing on cognitive science, it creates a presentation‑oriented narrative and allows iterative refinement via an interactive editor to better match learners' knowledge level and goals. Auto‑Slides further incorporates verification and knowledge retrieval mechanisms to ensure accuracy and contextual completeness. Through extensive user studies, Auto‑Slides demonstrates strong learner acceptance, improved structural support for understanding, and expert‑validated gains in narrative quality compared with conventional LLM‑based reading. Our contributions lie in designing a multi‑agent framework for transforming academic papers into pedagogically optimized slides and introducing interactive customization for personalized learning.

PaperID: 8698, https://arxiv.org/pdf/2509.11035.pdf

Authors: Yu Cui, Hang Fu, Haibin Zhang, Licheng Wang, Cong Zuo

Title: Free-MAD: Consensus-Free Multi-Agent Debate

Abstract:
Multi‑agent debate (MAD) is an emerging approach to improving the reasoning capabilities of large language models (LLMs). Existing MAD methods rely on multiple rounds of interaction among agents to reach consensus, and the final output is selected by majority voting in the last round. However, this consensus‑based design faces several limitations. First, multiple rounds of communication increases token overhead and limits scalability. Second, due to the inherent conformity of LLMs, agents that initially produce correct responses may be influenced by incorrect ones during the debate process, causing error propagation. Third, majority voting introduces randomness and unfairness in the decision‑making phase, and can degrade the reasoning performance. To address these issues, we propose \textscFree‑MAD, a novel MAD framework that eliminates the need for consensus among agents. \textscFree‑MAD introduces a novel score‑based decision mechanism that evaluates the entire debate trajectory rather than relying on the last round only. This mechanism tracks how each agent's reasoning evolves, enabling more accurate and fair outcomes. In addition, \textscFree‑MAD reconstructs the debate phase by introducing anti‑conformity, a mechanism that enables agents to mitigate excessive influence from the majority. Experiments on eight benchmark datasets demonstrate that \textscFree‑MAD significantly improves reasoning performance while requiring only a single‑round debate and thus reducing token costs. We also show that compared to existing MAD approaches, \textscFree‑MAD exhibits improved robustness in real‑world attack scenarios.

PaperID: 8699, https://arxiv.org/pdf/2509.10791.pdf

Authors: Garegin Mazmanyan, Hossein Rastgoftar

Title: Experimental Validation of Decentralized Affine Transformation

Abstract:
This paper presents an experimental validation of decentralized affine transformation (AT) in multi‑agent systems using teams of mini‑quadcopters. The AT framework enables an agent team to safely navigate constrained, obstacle‑rich environments while allowing aggressive changes in inter‑agent distances, which are formally characterized through the decomposition of the AT transformation matrix. Without loss of generality, we focus on two‑dimensional AT, formulated as a decentralized leader‑follower problem. In this formulation, three leader quadcopters are positioned at the vertices of a triangle, while all follower quadcopters remain within the triangle. The leaders know the desired trajectories prescribed by the AT, whereas the followers do not. Instead, the followers infer their trajectories through local communication governed by fixed communication weights determined by the initial spatial configuration of the team. Experimental results validate the asymptotic convergence of decentralized AT and demonstrate its capability to safely guide multi‑agent teams through obstacle‑laden environments.

PaperID: 8700, https://arxiv.org/pdf/2509.10769.pdf

Authors: Tara Bogavelli, Roshnee Sharma, Hari Subramani

Title: AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise

Abstract:
While individual components of agentic architectures have been studied in isolation, there remains limited empirical understanding of how different design dimensions interact within complex multi‑agent systems. This study aims to address these gaps by providing a comprehensive enterprise‑specific benchmark evaluating 18 distinct agentic configurations across state‑of‑the‑art large language models. We examine four critical agentic system dimensions: orchestration strategy, agent prompt implementation (ReAct versus function calling), memory architecture, and thinking tool integration. Our benchmark reveals significant model‑specific architectural preferences that challenge the prevalent one‑size‑fits‑all paradigm in agentic AI systems. It also reveals significant weaknesses in overall agentic performance on enterprise tasks with the highest scoring models achieving a maximum of only 35.3% success on the more complex task and 70.8% on the simpler task. We hope these findings inform the design of future agentic systems by enabling more empirically backed decisions regarding architectural components and model selection.

PaperID: 8701, https://arxiv.org/pdf/2509.10761.pdf

Authors: Marcelo Sandoval-Castaneda, Bryan Russell, Josef Sivic, Gregory Shakhnarovich, Fabian Caba Heilbron

Title: EditDuet: A Multi-Agent System for Video Non-Linear Editing

Abstract:
Automated tools for video editing and assembly have applications ranging from filmmaking and advertisement to content creation for social media. Previous video editing work has mainly focused on either retrieval or user interfaces, leaving actual editing to the user. In contrast, we propose to automate the core task of video editing, formulating it as sequential decision making process. Ours is a multi‑agent approach. We design an Editor agent and a Critic agent. The Editor takes as input a collection of video clips together with natural language instructions and uses tools commonly found in video editing software to produce an edited sequence. On the other hand, the Critic gives natural language feedback to the editor based on the produced sequence or renders it if it is satisfactory. We introduce a learning‑based approach for enabling effective communication across specialized agents to address the language‑driven video editing task. Finally, we explore an LLM‑as‑a‑judge metric for evaluating the quality of video editing system and compare it with general human preference. We evaluate our system's output video sequences qualitatively and quantitatively through a user study and find that our system vastly outperforms existing approaches in terms of coverage, time constraint satisfaction, and human preference.

PaperID: 8702, https://arxiv.org/pdf/2509.10723.pdf

Authors: Jingyu Tang, Chaoran Chen, Jiawen Li, Zhiping Zhang, Bingcan Guo, Ibrahim Khalilov, Simret Araya Gebreegziabher, Bingsheng Yao, Dakuo Wang, Yanfang Ye, Tianshi Li, Ziang Xiao, Yaxing Yao, Toby Jia-Jun Li

Title: Dark Patterns Meet GUI Agents: LLM Agent Susceptibility to Manipulative Interfaces and the Role of Human Oversight

Abstract:
The dark patterns, deceptive interface designs manipulating user behaviors, have been extensively studied for their effects on human decision‑making and autonomy. Yet, with the rising prominence of LLM‑powered GUI agents that automate tasks from high‑level intents, understanding how dark patterns affect agents is increasingly important. We present a two‑phase empirical study examining how agents, human participants, and human‑AI teams respond to 16 types of dark patterns across diverse scenarios. Phase 1 highlights that agents often fail to recognize dark patterns, and even when aware, prioritize task completion over protective action. Phase 2 revealed divergent failure modes: humans succumb due to cognitive shortcuts and habitual compliance, while agents falter from procedural blind spots. Human oversight improved avoidance but introduced costs such as attentional tunneling and cognitive load. Our findings show neither humans nor agents are uniformly resilient, and collaboration introduces new vulnerabilities, suggesting design needs for transparency, adjustable autonomy, and oversight.

PaperID: 8703, https://arxiv.org/pdf/2509.10716.pdf

Authors: Pio Ong, Haejoon Lee, Tamas G. Molnar, Dimitra Panagou, Aaron D. Ames

Title: Combinatorial Control Barrier Functions: Nested Boolean and p-choose-r Compositions of Safety Constraints

Abstract:
This paper investigates the problem of composing multiple control barrier functions (CBFs) ‑‑ and matrix control barrier functions (MCBFs) ‑‑ through logical and combinatorial operations. Standard CBF formulations naturally enable conjunctive (AND) combinations, but disjunctive (OR) and more general logical structures introduce nonsmoothness and possibly a combinatorial blow‑up in the number of logical combinations. We introduce the framework of combinatorial CBFs that addresses p‑choose‑r safety specifications and their nested composition. The proposed framework ensures safety for the exact safe set in a scalable way, using the original number of primitive constraints. We establish theoretical guarantees on safety under these compositions, and we demonstrate their use on a patrolling problem in a multi‑agent system.

PaperID: 8704, https://arxiv.org/pdf/2509.10557.pdf

Authors: Atefeh Irani, Maryam S. Mirian, Alex Lassooij, Reshad Hosseini, Hadi Moradi, Martin J. McKeown

Title: HiLWS: A Human-in-the-Loop Weak Supervision Framework for Curating Clinical and Home Video Data for Neurological Assessment

Abstract:
Video‑based assessment of motor symptoms in conditions such as Parkinson's disease (PD) offers a scalable alternative to in‑clinic evaluations, but home‑recorded videos introduce significant challenges, including visual degradation, inconsistent task execution, annotation noise, and domain shifts. We present HiLWS, a cascaded human‑in‑the‑loop weak supervision framework for curating and annotating hand motor task videos from both clinical and home settings. Unlike conventional single‑stage weak supervision methods, HiLWS employs a novel cascaded approach, first applies weak supervision to aggregate expert‑provided annotations into probabilistic labels, which are then used to train machine learning models. Model predictions, combined with expert input, are subsequently refined through a second stage of weak supervision. The complete pipeline includes quality filtering, optimized pose estimation, and task‑specific segment extraction, complemented by context‑sensitive evaluation metrics that assess both visual fidelity and clinical relevance by prioritizing ambiguous cases for expert review. Our findings reveal key failure modes in home recorded data and emphasize the importance of context‑sensitive curation strategies for robust medical video analysis.

PaperID: 8705, https://arxiv.org/pdf/2509.10512.pdf

Authors: Jiaxing Cao, Yuzhou Gao, Jiwei Huang

Title: A Service-Oriented Adaptive Hierarchical Incentive Mechanism for Federated Learning

Abstract:
Recently, federated learning (FL) has emerged as a novel framework for distributed model training. In FL, the task publisher (TP) releases tasks, and local model owners (LMOs) use their local data to train models. Sometimes, FL suffers from the lack of training data, and thus workers are recruited for gathering data. To this end, this paper proposes an adaptive incentive mechanism from a service‑oriented perspective, with the objective of maximizing the utilities of TP, LMOs and workers. Specifically, a Stackelberg game is theoretically established between the LMOs and TP, positioning TP as the leader and the LMOs as followers. An analytical Nash equilibrium solution is derived to maximize their utilities. The interaction between LMOs and workers is formulated by a multi‑agent Markov decision process (MAMDP), with the optimal strategy identified via deep reinforcement learning (DRL). Additionally, an Adaptively Searching the Optimal Strategy Algorithm (ASOSA) is designed to stabilize the strategies of each participant and solve the coupling problems. Extensive numerical experiments are conducted to validate the efficacy of the proposed method.

PaperID: 8706, https://arxiv.org/pdf/2509.10436.pdf

Authors: Shadikur Rahman, Aroosa Hameed, Gautam Srivastava, Syed Muhammad Danish

Title: RefactorCoderQA: Benchmarking LLMs for Multi-Domain Coding Question Solutions in Cloud and Edge Deployment

Abstract:
To optimize the reasoning and problem‑solving capabilities of Large Language Models (LLMs), we propose a novel cloud‑edge collaborative architecture that enables a structured multi‑agent prompting framework. This framework comprises three specialized components: GuideLLM, a lightweight model deployed at the edge to provide methodological guidance; SolverLLM, a more powerful model hosted in the cloud and responsible for generating code solutions; and JudgeLLM, an automated evaluator for assessing solution correctness and quality. To evaluate and demonstrate the effectiveness of this architecture in realistic settings, we introduce RefactorCoderQA, a comprehensive benchmark designed to evaluate and enhance the performance of LLMs across multi‑domain coding tasks. Motivated by the limitations of existing benchmarks, RefactorCoderQA systematically covers multiple technical domains, including Software Engineering, Data Science, Machine Learning, and Natural Language Processing, using authentic coding challenges sourced from Stack Overflow. We propose RefactorCoder‑MoE, a fine‑tuned mixture‑of‑experts (MoE) code language model based on DeepSeek‑Coder‑7B‑Instruct, adapted to the RefactorCoderQA benchmark using QLoRA for domain‑specific coding question answering. Extensive experiments demonstrate that RefactorCoder‑MoE achieves strong and competitive performance, significantly outperforming all evaluated open‑source and commercial baselines, with an overall accuracy of 76.84%.

PaperID: 8707, https://arxiv.org/pdf/2509.10426.pdf

Authors: Jianxin Shi, Zengqi Peng, Xiaolong Chen, Tianyu Wo, Jun Ma

Title: DECAMP: Towards Scene-Consistent Multi-Agent Motion Prediction with Disentangled Context-Aware Pre-Training

Abstract:
Trajectory prediction is a critical component of autonomous driving, essential for ensuring both safety and efficiency on the road. However, traditional approaches often struggle with the scarcity of labeled data and exhibit suboptimal performance in multi‑agent prediction scenarios. To address these challenges, we introduce a disentangled context‑aware pre‑training framework for multi‑agent motion prediction, named DECAMP. Unlike existing methods that entangle representation learning with pretext tasks, our framework decouples behavior pattern learning from latent feature reconstruction, prioritizing interpretable dynamics and thereby enhancing scene representation for downstream prediction. Additionally, our framework incorporates context‑aware representation learning alongside collaborative spatial‑motion pretext tasks, which enables joint optimization of structural and intentional reasoning while capturing the underlying dynamic intentions. Our experiments on the Argoverse 2 benchmark showcase the superior performance of our method, and the results attained underscore its effectiveness in multi‑agent motion forecasting. To the best of our knowledge, this is the first context autoencoder framework for multi‑agent motion forecasting in autonomous driving. The code and models will be made publicly available.

PaperID: 8708, https://arxiv.org/pdf/2509.10397.pdf

Authors: Fei Liu, Xinyu Lin, Hanchao Yu, Mingyuan Wu, Jianyu Wang, Qiang Zhang, Zhuokai Zhao, Yinglong Xia, Yao Zhang, Weiwei Li, Mingze Gao, Qifan Wang, Lizhu Zhang, Benyu Zhang, Xiangjun Fan

Title: RecoWorld: Building Simulated Environments for Agentic Recommender Systems

Abstract:
We present RecoWorld, a blueprint for building simulated environments tailored to agentic recommender systems. Such environments give agents a proper training space where they can learn from errors without impacting real users. RecoWorld distinguishes itself with a dual‑view architecture: a simulated user and an agentic recommender engage in multi‑turn interactions aimed at maximizing user retention. The user simulator reviews recommended items, updates its mindset, and when sensing potential user disengagement, generates reflective instructions. The agentic recommender adapts its recommendations by incorporating these user instructions and reasoning traces, creating a dynamic feedback loop that actively engages users. This process leverages the exceptional reasoning capabilities of modern LLMs. We explore diverse content representations within the simulator, including text‑based, multimodal, and semantic ID modeling, and discuss how multi‑turn RL enables the recommender to refine its strategies through iterative interactions. RecoWorld also supports multi‑agent simulations, allowing creators to simulate the responses of targeted user populations. It marks an important first step toward recommender systems where users and agents collaboratively shape personalized information streams. We envision new interaction paradigms where "user instructs, recommender responds," jointly optimizing user retention and engagement.

PaperID: 8709, https://arxiv.org/pdf/2509.10317.pdf

Authors: Elizaveta D. Moskovskaya, Anton D. Moscowsky

Title: Robot guide with multi-agent control and automatic scenario generation with LLM

Abstract:
The work describes the development of a hybrid control architecture for an anthropomorphic tour guide robot, combining a multi‑agent resource management system with automatic behavior scenario generation based on large language models. The proposed approach aims to overcome the limitations of traditional systems, which rely on manual tuning of behavior scenarios. These limitations include manual configuration, low flexibility, and lack of naturalness in robot behavior. The process of preparing tour scenarios is implemented through a two‑stage generation: first, a stylized narrative is created, then non‑verbal action tags are integrated into the text. The multi‑agent system ensures coordination and conflict resolution during the execution of parallel actions, as well as maintaining default behavior after the completion of main operations, contributing to more natural robot behavior. The results obtained from the trial demonstrate the potential of the proposed approach for automating and scaling social robot control systems.

PaperID: 8710, https://arxiv.org/pdf/2509.10284.pdf

Authors: David Zahrádka, Denisa Mužíková, David Woller, Miroslav Kulich, Jiří Švancara, Roman Barták

Title: A Holistic Architecture for Monitoring and Optimization of Robust Multi-Agent Path Finding Plan Execution

Abstract:
The goal of Multi‑Agent Path Finding (MAPF) is to find a set of paths for a fleet of agents moving in a shared environment such that the agents reach their goals without colliding with each other. In practice, some of the robots executing the plan may get delayed, which can introduce collision risk. Although robust execution methods are used to ensure safety even in the presence of delays, the delays may still have a significant impact on the duration of the execution. At some point, the accumulated delays may become significant enough that instead of continuing with the execution of the original plan, even if it was optimal, there may now exist an alternate plan which will lead to a shorter execution. However, the problem is how to decide when to search for the alternate plan, since it is a costly procedure. In this paper, we propose a holistic architecture for robust execution of MAPF plans, its monitoring and optimization. We exploit a robust execution method called Action Dependency Graph to maintain an estimate of the expected execution duration during the plan's execution. This estimate is used to predict the potential that finding an alternate plan would lead to shorter execution. We empirically evaluate the architecture in experiments in a real‑time simulator which we designed to mimic our real‑life demonstrator of an autonomous warehouse robotic fleet.

PaperID: 8711, https://arxiv.org/pdf/2509.10210.pdf

Authors: Marko Petković, Vlado Menkovski, Sofía Calero

Title: Towards Fully Automated Molecular Simulations: Multi-Agent Framework for Simulation Setup and Force Field Extraction

Abstract:
Automated characterization of porous materials has the potential to accelerate materials discovery, but it remains limited by the complexity of simulation setup and force field selection. We propose a multi‑agent framework in which LLM‑based agents can autonomously understand a characterization task, plan appropriate simulations, assemble relevant force fields, execute them and interpret their results to guide subsequent steps. As a first step toward this vision, we present a multi‑agent system for literature‑informed force field extraction and automated RASPA simulation setup. Initial evaluations demonstrate high correctness and reproducibility, highlighting this approach's potential to enable fully autonomous, scalable materials characterization.

PaperID: 8712, https://arxiv.org/pdf/2509.10185.pdf

Authors: R. Montalà, B. Font, P. Suárez, J. Rabault, O. Lehmkuhl, R. Vinuesa, I. Rodriguez

Title: Discovering Flow Separation Control Strategies in 3D Wings via Deep Reinforcement Learning

Abstract:
In this work, deep reinforcement learning (DRL) is applied to active flow control (AFC) over a threedimensional SD7003 wing at a Reynolds number of Re = 60,000 and angle of attack of AoA = 14 degrees. In the uncontrolled baseline case, the flow exhibits massive separation and a fully turbulent wake. Using a GPU‑accelerated CFD solver and multi‑agent training, DRL discovers control strategies that enhance lift (79%), reduce drag (65%), and improve aerodynamic efficiency (408%). Flow visualizations confirm reattachment of the separated shear layer, demonstrating the potential of DRL for complex and turbulent flows.

PaperID: 8713, https://arxiv.org/pdf/2509.10163.pdf

Authors: Francisco Javier Esono Nkulu Andong, Qi Min

Title: Federated Multi-Agent Reinforcement Learning for Privacy-Preserving and Energy-Aware Resource Management in 6G Edge Networks

Abstract:
As sixth‑generation (6G) networks move toward ultra‑dense, intelligent edge environments, efficient resource management under stringent privacy, mobility, and energy constraints becomes critical. This paper introduces a novel Federated Multi‑Agent Reinforcement Learning (Fed‑MARL) framework that incorporates cross‑layer orchestration of both the MAC layer and application layer for energy‑efficient, privacy‑preserving, and real‑time resource management across heterogeneous edge devices. Each agent uses a Deep Recurrent Q‑Network (DRQN) to learn decentralized policies for task offloading, spectrum access, and CPU energy adaptation based on local observations (e.g., queue length, energy, CPU usage, and mobility). To protect privacy, we introduce a secure aggregation protocol based on elliptic curve Diffie Hellman key exchange, which ensures accurate model updates without exposing raw data to semi‑honest adversaries. We formulate the resource management problem as a partially observable multi‑agent Markov decision process (POMMDP) with a multi‑objective reward function that jointly optimizes latency, energy efficiency, spectral efficiency, fairness, and reliability under 6G‑specific service requirements such as URLLC, eMBB, and mMTC. Simulation results demonstrate that Fed‑MARL outperforms centralized MARL and heuristic baselines in task success rate, latency, energy efficiency, and fairness, while ensuring robust privacy protection and scalability in dynamic, resource‑constrained 6G edge networks.

PaperID: 8714, https://arxiv.org/pdf/2509.10147.pdf

Authors: Nenad Tomasev, Matija Franklin, Joel Z. Leibo, Julian Jacobs, William A. Cunningham, Iason Gabriel, Simon Osindero

Title: Virtual Agent Economies

Abstract:
The rapid adoption of autonomous AI agents is giving rise to a new economic layer where agents transact and coordinate at scales and speeds beyond direct human oversight. We propose the "sandbox economy" as a framework for analyzing this emergent system, characterizing it along two key dimensions: its origins (emergent vs. intentional) and its degree of separateness from the established human economy (permeable vs. impermeable). Our current trajectory points toward a spontaneous emergence of a vast and highly permeable AI agent economy, presenting us with opportunities for an unprecedented degree of coordination as well as significant challenges, including systemic economic risk and exacerbated inequality. Here we discuss a number of possible design choices that may lead to safely steerable AI agent markets. In particular, we consider auction mechanisms for fair resource allocation and preference resolution, the design of AI "mission economies" to coordinate around achieving collective goals, and socio‑technical infrastructure needed to ensure trust, safety, and accountability. By doing this, we argue for the proactive design of steerable agent markets to ensure the coming technological shift aligns with humanity's long‑term collective flourishing.

PaperID: 8715, https://arxiv.org/pdf/2509.10112.pdf

Authors: Ndidi Bianca Ogbo, Zhao Song, The Anh Han

Title: Evolution of Coordination Through Institutional Incentives: An Evolutionary Game Theory Approach

Abstract:
There is a broad recognition that commitment‑based mechanisms can promote coordination and cooperative behaviours in both biological populations and self‑organised multi‑agent systems by making individuals' intentions explicit prior to engagement. Yet their effectiveness depends on sustained compliance supported by institutions, especially in one‑off interactions. Despite advances in quantitative studies of cooperation and commitment, most applied analyses and policy debates remain largely qualitative, with limited attention to the allocation of scarce institutional resources between enhancing participation and ensuring commitment compliance. Herein, we develop an evolutionary game‑theoretic model that explicitly examines the strategic distribution of a limited budget for institutional incentives, namely rewards or punishments, aimed at these two critical objectives within pre‑commitment frameworks. Our findings reveal that a reward‑based incentive approach consistently yields greater coordination success than a punishment‑based approach, with optimal outcomes arising when resources are appropriately distributed between participation promotion and compliance assurance. These findings offer novel insights for designing institutional incentives to promote broad, coordinated adoption of new technologies.

PaperID: 8716, https://arxiv.org/pdf/2509.10018.pdf

Authors: Hailong Yang, Renhuo Zhao, Guanjin Wang, Zhaohong Deng

Title: GAMA: A General Anonymizing Multi-Agent System for Privacy Preservation Enhanced by Domain Rules and Disproof Mechanism

Abstract:
With the rapid advancement of Large Language Models (LLMs), LLM‑based agents exhibit exceptional abilities in understanding and generating natural language, enabling human‑like collaboration and information transmission in LLM‑based Multi‑Agent Systems (MAS). High‑performance LLMs are often hosted on web servers in public cloud environments. When tasks involve private data, MAS cannot securely utilize these LLMs without implementing the agentic privacy‑preserving mechanism. To address this challenge, we propose a General Anonymizing Multi‑Agent System (GAMA), which divides the agents' workspace into private and public spaces, ensuring privacy through a structured anonymization mechanism. In the private space, agents handle sensitive data, while in the public web space, only anonymized data is utilized. GAMA incorporates two key modules to mitigate semantic loss caused by anonymization: Domain‑Rule‑based Knowledge Enhancement (DRKE) and Disproof‑based Logic Enhancement (DLE). We evaluate GAMA on two general question‑answering datasets, a public privacy leakage benchmark, and two customized question‑answering datasets related to privacy. The results demonstrate that GAMA outperforms existing baselines on the evaluated datasets in terms of both task accuracy and privacy preservation metrics.

PaperID: 8717, https://arxiv.org/pdf/2509.09995.pdf

Authors: Fei Xiong, Xiang Zhang, Aosong Feng, Siqi Sun, Chenyu You

Title: QuantAgent: Price-Driven Multi-Agent LLMs for High-Frequency Trading

Abstract:
Recent advances in Large Language Models (LLMs) have shown remarkable capabilities in financial reasoning and market understanding. Multi‑agent LLM frameworks such as TradingAgent and FINMEM augment these models to long‑horizon investment tasks by leveraging fundamental and sentiment‑based inputs for strategic decision‑making. However, these approaches are ill‑suited for the high‑speed, precision‑critical demands of High‑Frequency Trading (HFT). HFT typically requires rapid, risk‑aware decisions driven by structured, short‑horizon signals, such as technical indicators, chart patterns, and trend features. These signals stand in sharp contrast to the long‑horizon, text‑driven reasoning that characterizes most existing LLM‑based systems in finance. To bridge this gap, we introduce QuantAgent, the first multi‑agent LLM framework explicitly designed for high‑frequency algorithmic trading. The system decomposes trading into four specialized agents‑‑Indicator, Pattern, Trend, and Risk‑‑each equipped with domain‑specific tools and structured reasoning capabilities to capture distinct aspects of market dynamics over short temporal windows. Extensive experiments across nine financial instruments, including Bitcoin and Nasdaq futures, demonstrate that QuantAgent consistently outperforms baseline methods, achieving higher predictive accuracy at both 1‑hour and 4‑hour trading intervals across multiple evaluation metrics. Our findings suggest that coupling structured trading signals with LLM‑based reasoning provides a viable path for traceable, real‑time decision systems in high‑frequency financial markets.

PaperID: 8718, https://arxiv.org/pdf/2509.09970.pdf

Authors: Seyed Moein Abtahi, Akramul Azim

Title: Securing LLM-Generated Embedded Firmware through AI Agent-Driven Validation and Patching

Abstract:
Large Language Models (LLMs) show promise in generating firmware for embedded systems, but often introduce security flaws and fail to meet real‑time performance constraints. This paper proposes a three‑phase methodology that combines LLM‑based firmware generation with automated security validation and iterative refinement in a virtualized environment. Using structured prompts, models like GPT‑4 generate firmware for networking and control tasks, deployed on FreeRTOS via QEMU. These implementations are tested using fuzzing, static analysis, and runtime monitoring to detect vulnerabilities such as buffer overflows (CWE‑120), race conditions (CWE‑362), and denial‑of‑service threats (CWE‑400). Specialized AI agents for Threat Detection, Performance Optimization, and Compliance Verification collaborate to improve detection and remediation. Identified issues are categorized using CWE, then used to prompt targeted LLM‑generated patches in an iterative loop. Experiments show a 92.4% Vulnerability Remediation Rate (37.3% improvement), 95.8% Threat Model Compliance, and 0.87 Security Coverage Index. Real‑time metrics include 8.6ms worst‑case execution time and 195μs jitter. This process enhances firmware security and performance while contributing an open‑source dataset for future research.

PaperID: 8719, https://arxiv.org/pdf/2509.09936.pdf

Authors: Saarth Gaonkar, Xiang Zheng, Haocheng Xi, Rishabh Tiwari, Kurt Keutzer, Dmitriy Morozov, Michael W. Mahoney, Amir Gholami

Title: SciML Agents: Write the Solver, Not the Solution

Abstract:
Recent work in scientific machine learning aims to tackle scientific tasks directly by predicting target values with neural networks (e.g., physics‑informed neural networks, neural ODEs, neural operators, etc.), but attaining high accuracy and robustness has been challenging. We explore an alternative view: use LLMs to write code that leverages decades of numerical algorithms. This shifts the burden from learning a solution function to making domain‑aware numerical choices. We ask whether LLMs can act as SciML agents that, given a natural‑language ODE description, generate runnable code that is scientifically appropriate, selecting suitable solvers (stiff vs. non‑stiff), and enforcing stability checks. There is currently no benchmark to measure this kind of capability for scientific computing tasks. As such, we first introduce two new datasets: a diagnostic dataset of adversarial "misleading" problems; and a large‑scale benchmark of 1,000 diverse ODE tasks. The diagnostic set contains problems whose superficial appearance suggests stiffness, and that require algebraic simplification to demonstrate non‑stiffness; and the large‑scale benchmark spans stiff and non‑stiff ODE regimes. We evaluate open‑ and closed‑source LLM models along two axes: (i) unguided versus guided prompting with domain‑specific knowledge; and (ii) off‑the‑shelf versus fine‑tuned variants. Our evaluation measures both executability and numerical validity against reference solutions. We find that with sufficient context and guided prompts, newer instruction‑following models achieve high accuracy on both criteria. In many cases, recent open‑source systems perform strongly without fine‑tuning, while older or smaller models still benefit from fine‑tuning. Overall, our preliminary results indicate that careful prompting and fine‑tuning can yield a specialized LLM agent capable of reliably solving simple ODE problems.

PaperID: 8720, https://arxiv.org/pdf/2509.09853.pdf

Authors: Zhiyu Fan, Kirill Vasilevski, Dayi Lin, Boyuan Chen, Yihao Chen, Zhiqing Zhong, Jie M. Zhang, Pinjia He, Ahmed E. Hassan

Title: SWE-Effi: Re-Evaluating Software AI Agent System Effectiveness Under Resource Constraints

Abstract:
The advancement of large language models (LLMs) and code agents has demonstrated significant potential to assist software engineering (SWE) tasks, such as autonomous issue resolution and feature addition. Existing AI for software engineering leaderboards (e.g., SWE‑bench) focus solely on solution accuracy, ignoring the crucial factor of effectiveness in a resource‑constrained world. This is a universal problem that also exists beyond software engineering tasks: any AI system should be more than correct ‑ it must also be cost‑effective. To address this gap, we introduce SWE‑Effi, a set of new metrics to re‑evaluate AI systems in terms of holistic effectiveness scores. We define effectiveness as the balance between the accuracy of outcome (e.g., issue resolve rate) and the resources consumed (e.g., token and time). In this paper, we specifically focus on the software engineering scenario by re‑ranking popular AI systems for issue resolution on a subset of the SWE‑bench benchmark using our new multi‑dimensional metrics. We found that AI system's effectiveness depends not just on the scaffold itself, but on how well it integrates with the base model, which is key to achieving strong performance in a resource‑efficient manner. We also identified systematic challenges such as the "token snowball" effect and, more significantly, a pattern of "expensive failures". In these cases, agents consume excessive resources while stuck on unsolvable tasks ‑ an issue that not only limits practical deployment but also drives up the cost of failed rollouts during RL training. Lastly, we observed a clear trade‑off between effectiveness under the token budget and effectiveness under the time budget, which plays a crucial role in managing project budgets and enabling scalable reinforcement learning, where fast responses are essential.

PaperID: 8721, https://arxiv.org/pdf/2509.09740.pdf

Authors: Ying Yuan, Xing-Yue Monica Ge, Aaron Archer Waterman, Tommaso Biancalani, David Richmond, Yogesh Pandit, Avtar Singh, Russell Littman, Jin Liu, Jan-Christian Huetter, Vladimir Ermakov

Title: HypoGeneAgent: A Hypothesis Language Agent for Gene-Set Cluster Resolution Selection Using Perturb-seq Datasets

Abstract:
Large‑scale single‑cell and Perturb‑seq investigations routinely involve clustering cells and subsequently annotating each cluster with Gene‑Ontology (GO) terms to elucidate the underlying biological programs. However, both stages, resolution selection and functional annotation, are inherently subjective, relying on heuristics and expert curation. We present HYPOGENEAGENT, a large language model (LLM)‑driven framework, transforming cluster annotation into a quantitatively optimizable task. Initially, an LLM functioning as a gene‑set analyst analyzes the content of each gene program or perturbation module and generates a ranked list of GO‑based hypotheses, accompanied by calibrated confidence scores. Subsequently, we embed every predicted description with a sentence‑embedding model, compute pair‑wise cosine similarities, and let the agent referee panel score (i) the internal consistency of the predictions, high average similarity within the same cluster, termed intra‑cluster agreement (ii) their external distinctiveness, low similarity between clusters, termed inter‑cluster separation. These two quantities are combined to produce an agent‑derived resolution score, which is maximized when clusters exhibit simultaneous coherence and mutual exclusivity. When applied to a public K562 CRISPRi Perturb‑seq dataset as a preliminary test, our Resolution Score selects clustering granularities that exhibit alignment with known pathway compared to classical metrics such silhouette score, modularity score for gene functional enrichment summary. These findings establish LLM agents as objective adjudicators of cluster resolution and functional annotation, thereby paving the way for fully automated, context‑aware interpretation pipelines in single‑cell multi‑omics studies.

PaperID: 8722, https://arxiv.org/pdf/2509.09727.pdf

Authors: Andy Zhu, Yingjun Du

Title: A Role-Aware Multi-Agent Framework for Financial Education Question Answering with LLMs

Abstract:
Question answering (QA) plays a central role in financial education, yet existing large language model (LLM) approaches often fail to capture the nuanced and specialized reasoning required for financial problem‑solving. The financial domain demands multistep quantitative reasoning, familiarity with domain‑specific terminology, and comprehension of real‑world scenarios. We present a multi‑agent framework that leverages role‑based prompting to enhance performance on domain‑specific QA. Our framework comprises a Base Generator, an Evidence Retriever, and an Expert Reviewer agent that work in a single‑pass iteration to produce a refined answer. We evaluated our framework on a set of 3,532 expert‑designed finance education questions from Study.com, an online learning platform. We leverage retrieval‑augmented generation (RAG) for contextual evidence from 6 finance textbooks and prompting strategies for a domain‑expert reviewer. Our experiments indicate that critique‑based refinement improves answer accuracy by 6.6‑8.3% over zero‑shot Chain‑of‑Thought baselines, with the highest performance from Gemini‑2.0‑Flash. Furthermore, our method enables GPT‑4o‑mini to achieve performance comparable to the finance‑tuned FinGPT‑mt_Llama3‑8B_LoRA. Our results show a cost‑effective approach to enhancing financial QA and offer insights for further research in multi‑agent financial LLM systems.

PaperID: 8723, https://arxiv.org/pdf/2509.09685.pdf

Authors: Keunwoo Choi, Seungheon Doh, Juhan Nam

Title: TalkPlayData 2: An Agentic Synthetic Data Pipeline for Multimodal Conversational Music Recommendation

Abstract:
We present TalkPlayData 2, a synthetic dataset for multimodal conversational music recommendation generated by an agentic data pipeline. In the proposed pipeline, multiple large language model (LLM) agents are created under various roles with specialized prompts and access to different parts of information, and the chat data is acquired by logging the conversation between the Listener LLM and the Recsys LLM. To cover various conversation scenarios, for each conversation, the Listener LLM is conditioned on a finetuned conversation goal. Finally, all the LLMs are multimodal with audio and images, allowing a simulation of multimodal recommendation and conversation. In the LLM‑as‑a‑judge and subjective evaluation experiments, TalkPlayData 2 achieved the proposed goal in various aspects related to training a generative recommendation model for music. TalkPlayData 2 and its generation code are released at https://talkpl.ai/talkplaydata2.

PaperID: 8724, https://arxiv.org/pdf/2509.09629.pdf

Authors: Minghang Zhu, Zhengliang Shi, Zhiwei Xu, Shiguang Wu, Lingjie Wang, Pengjie Ren, Zhaochun Ren, Zhumin Chen

Title: Bridging the Capability Gap: Joint Alignment Tuning for Harmonizing LLM-based Multi-Agent Systems

Abstract:
The advancement of large language models (LLMs) has enabled the construction of multi‑agent systems to solve complex tasks by dividing responsibilities among specialized agents, such as a planning agent for subgoal generation and a grounding agent for executing tool‑use actions. Most existing methods typically fine‑tune these agents independently, leading to capability gaps among them with poor coordination. To address this, we propose MOAT, a Multi‑Agent Joint Alignment Tuning framework that improves agents collaboration through iterative alignment. MOAT alternates between two key stages: (1) Planning Agent Alignment, which optimizes the planning agent to generate subgoal sequences that better guide the grounding agent; and (2) Grounding Agent Improving, which fine‑tunes the grounding agent using diverse subgoal‑action pairs generated by the agent itself to enhance its generalization capablity. Theoretical analysis proves that MOAT ensures a non‑decreasing and progressively convergent training process. Experiments across six benchmarks demonstrate that MOAT outperforms state‑of‑the‑art baselines, achieving average improvements of 3.1% on held‑in tasks and 4.4% on held‑out tasks.

PaperID: 8725, https://arxiv.org/pdf/2509.09574.pdf

Authors: Yiting Hu, Lingjie Duan

Title: Human-in-the-loop Learning Through Decentralized Communication Mechanisms

Abstract:
Information sharing platforms like TripAdvisor and Waze involve human agents as both information producers and consumers. All these platforms operate in a centralized way to collect agents' latest observations of new options (e.g., restaurants, hotels, travel routes) and share such information with all in real time. However, after hearing the central platforms' live updates, many human agents are found selfish and unwilling to further explore unknown options for the benefit of others in the long run. To regulate the human‑in‑the‑loop learning (HILL) game against selfish agents' free‑riding, this paper proposes a paradigm shift from centralized to decentralized way of operation that forces agents' local explorations through restricting information sharing. When game theory meets distributed learning, we formulate our decentralized communication mechanism's design as a new multi‑agent Markov decision process (MA‑MDP), and derive its analytical condition to outperform today's centralized operation. As the optimal decentralized communication mechanism in MA‑MDP is NP‑hard to solve, we present an asymptotically optimal algorithm with linear complexity to determine the mechanism's timing of intermittent information sharing. Then we turn to non‑myopic agents who may revert to even over‑explore, and adapt our mechanism design to work. Simulation experiments using real‑world dataset demonstrate the effectiveness of our decentralized mechanisms for various scenarios.

PaperID: 8726, https://arxiv.org/pdf/2509.09525.pdf

Authors: Jialiang Huang, Teng Ma, Zheng Liu, Sixing Lin, Kang Chen, Jinlei Jiang, Xia Liao, Yingdi Shan, Yongwei Wu, Ning Zhang, Mengting Lu, Tao Ma, Haifeng Gong, Mingxing Zhang

Title: TrEnv-X: Transparently Share Serverless Execution Environments Across Different Functions and Nodes

Abstract:
Serverless computing is renowned for its computation elasticity, yet its full potential is often constrained by the requirement for functions to operate within local and dedicated background environments, resulting in limited memory elasticity. To address this limitation, this paper introduces TrEnv‑X, a co‑designed integration of the serverless platform with the operating system and CXL/RDMA‑based remote memory pools. TrEnv‑X's core innovations are repurposable sandboxes, which can be shared across different functions to decrease the associated creation overhead, and OS‑level memory templates, which enable rapid state restoration from CXL/RDMA‑based remote memory pools. To further demonstrate TrEnv‑X's versatility, we generalize its design from traditional containers for microVM‑based agent workloads and introduce new optimizations, including browser sharing and a page cache bypassing mechanism. Our evaluation shows that TrEnv‑X achieves up to 7x reduction in P99 latency and 48% memory savings for container‑based functions. When applied to LLM agents, it reduces the P99 latency by up to 58% and memory usage by 61% compared to state‑of‑the‑art systems like E2B.

PaperID: 8727, https://arxiv.org/pdf/2509.09498.pdf

Authors: Haoran Xu, Jiacong Hu, Ke Zhang, Lei Yu, Yuxin Tang, Xinyuan Song, Yiqun Duan, Lynn Ai, Bill Shi

Title: SEDM: Scalable Self-Evolving Distributed Memory for Agents

Abstract:
Long‑term multi‑agent systems inevitably generate vast amounts of trajectories and historical interactions, which makes efficient memory management essential for both performance and scalability. Existing methods typically depend on vector retrieval and hierarchical storage, yet they are prone to noise accumulation, uncontrolled memory expansion, and limited generalization across domains. To address these challenges, we present SEDM, Self‑Evolving Distributed Memory, a verifiable and adaptive framework that transforms memory from a passive repository into an active, self‑optimizing component. SEDM integrates verifiable write admission based on reproducible replay, a self‑scheduling memory controller that dynamically ranks and consolidates entries according to empirical utility, and cross‑domain knowledge diffusion that abstracts reusable insights to support transfer across heterogeneous tasks. Evaluations on benchmark datasets demonstrate that SEDM improves reasoning accuracy while reducing token overhead compared with strong memory baselines, and further enables knowledge distilled from fact verification to enhance multi‑hop reasoning. The results highlight SEDM as a scalable and sustainable memory mechanism for open‑ended multi‑agent collaboration. The code will be released in the later stage of this project.

PaperID: 8728, https://arxiv.org/pdf/2509.09470.pdf

Authors: Om Vishesh, Harshad Khadilkar, Deepak Akkil

Title: AEGIS: An Agent for Extraction and Geographic Identification in Scholarly Proceedings

Abstract:
Keeping pace with the rapid growth of academia literature presents a significant challenge for researchers, funding bodies, and academic societies. To address the time‑consuming manual effort required for scholarly discovery, we present a novel, fully automated system that transitions from data discovery to direct action. Our pipeline demonstrates how a specialized AI agent, 'Agent‑E', can be tasked with identifying papers from specific geographic regions within conference proceedings and then executing a Robotic Process Automation (RPA) to complete a predefined action, such as submitting a nomination form. We validated our system on 586 papers from five different conferences, where it successfully identified every target paper with a recall of 100% and a near perfect accuracy of 99.4%. This demonstration highlights the potential of task‑oriented AI agents to not only filter information but also to actively participate in and accelerate the workflows of the academic community.

PaperID: 8729, https://arxiv.org/pdf/2509.09281.pdf

Authors: Sandeep Banik, Naira Hovakimyan

Title: Flip Co-op: Cooperative Takeovers in Shared Autonomy

Abstract:
Shared autonomy requires principled mechanisms for allocating and transferring control between a human and an autonomous agent. Existing approaches often rely on blending control inputs between human and autonomous agent or switching rules, which lack theoretical guarantees. This paper develops a game‑theoretic framework for modeling cooperative takeover in shared autonomy. We formulate the switching interaction as a dynamic game in which authority is embedded directly into the system dynamics, resulting in Nash equilibrium(NE)‑based strategies rather than ad hoc switching rules. We establish the existence and characterization of NE in the space of pure takeover strategies under stochastic human intent. For the class of linear‑quadratic systems, we derive closed‑form recursions for takeover strategies and saddle‑point value functions, providing analytical insight and efficient computation of cooperative takeover policies. We further introduce a bimatrix potential game reformulation to address scenarios where human and autonomy utilities are not perfectly aligned, yielding a unifying potential function that preserves tractability while capturing intent deviations. The framework is applied to a vehicle trajectory tracking problem, demonstrating how equilibrium takeover strategies adapt across straight and curved path segments. The results highlight the trade‑off between human adaptability and autonomous efficiency and illustrate the practical benefits of grounding shared autonomy in cooperative game theory.

PaperID: 8730, https://arxiv.org/pdf/2509.09269.pdf

Authors: Luca Ballotta, Juncal Arbelaiz, Vijay Gupta, Luca Schenato, Mihailo R. Jovanović

Title: The role of communication delays in the optimal control of spatially invariant systems

Abstract:
We study optimal proportional feedback controllers for spatially invariant systems when the controller has access to delayed state measurements received from different spatial locations. We analyze how delays affect the spatial locality of the optimal feedback gain leveraging the problem decoupling in the spatial frequency domain. For the cases of expensive control and small delay, we provide exact expressions of the optimal controllers in the limit for infinite control weight and vanishing delay, respectively. In the expensive control regime, the optimal feedback control law decomposes into a delay‑aware filtering of the delayed state and the optimal controller in the delay‑free setting. Under small delays, the optimal controller is a perturbation of the delay‑free one which depends linearly on the delay. We illustrate our analytical findings with a reaction‑diffusion process over the real line and a multi‑agent system coupled through circulant matrices, showing that delays reduce the effectiveness of optimal feedback control and may require each subsystem within a distributed implementation to communicate with farther‑away locations.

PaperID: 8731, https://arxiv.org/pdf/2509.09265.pdf

Authors: Jiawei Wang, Jiacai Liu, Yuqian Fu, Yingru Li, Xintao Wang, Yuan Lin, Yu Yue, Lin Zhang, Yang Wang, Ke Wang

Title: Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents

Abstract:
In long‑horizon tasks, recent agents based on Large Language Models (LLMs) face a significant challenge that sparse, outcome‑based rewards make it difficult to assign credit to intermediate steps. Previous methods mainly focus on creating dense reward signals to guide learning, either through traditional reinforcement learning techniques like inverse reinforcement learning or by using Process Reward Models for step‑by‑step feedback. In this paper, we identify a fundamental problem in the learning dynamics of LLMs: the magnitude of policy gradients is inherently coupled with the entropy, which leads to inefficient small updates for confident correct actions and potentially destabilizes large updates for uncertain ones. To resolve this, we propose Entropy‑Modulated Policy Gradients (EMPG), a framework that re‑calibrates the learning signal based on step‑wise uncertainty and the final task outcome. EMPG amplifies updates for confident correct actions, penalizes confident errors, and attenuates updates from uncertain steps to stabilize exploration. We further introduce a bonus term for future clarity that encourages agents to find more predictable solution paths. Through comprehensive experiments on three challenging agent tasks, WebShop, ALFWorld, and Deep Search, we demonstrate that EMPG achieves substantial performance gains and significantly outperforms strong policy gradient baselines. Project page is at https://empgseed‑seed.github.io/

PaperID: 8732, https://arxiv.org/pdf/2509.09215.pdf

Authors: Qinnan Hu, Yuntao Wang, Yuan Gao, Zhou Su, Linkang Du, Qichao Xu

Title: Enabling Regulatory Multi-Agent Collaboration: Architecture, Challenges, and Solutions

Abstract:
Large language models (LLMs)‑empowered autonomous agents are transforming both digital and physical environments by enabling adaptive, multi‑agent collaboration. While these agents offer significant opportunities across domains such as finance, healthcare, and smart manufacturing, their unpredictable behaviors and heterogeneous capabilities pose substantial governance and accountability challenges. In this paper, we propose a blockchain‑enabled layered architecture for regulatory agent collaboration, comprising an agent layer, a blockchain data layer, and a regulatory application layer. Within this framework, we design three key modules: (i) an agent behavior tracing and arbitration module for automated accountability, (ii) a dynamic reputation evaluation module for trust assessment in collaborative scenarios, and (iii) a malicious behavior forecasting module for early detection of adversarial activities. Our approach establishes a systematic foundation for trustworthy, resilient, and scalable regulatory mechanisms in large‑scale agent ecosystems. Finally, we discuss the future research directions for blockchain‑enabled regulatory frameworks in multi‑agent systems.

PaperID: 8733, https://arxiv.org/pdf/2509.09210.pdf

Authors: Xing Gao, Zherui Huang, Weiyao Lin, Xiao Sun

Title: ProgD: Progressive Multi-scale Decoding with Dynamic Graphs for Joint Multi-agent Motion Forecasting

Abstract:
Accurate motion prediction of surrounding agents is crucial for the safe planning of autonomous vehicles. Recent advancements have extended prediction techniques from individual agents to joint predictions of multiple interacting agents, with various strategies to address complex interactions within future motions of agents. However, these methods overlook the evolving nature of these interactions. To address this limitation, we propose a novel progressive multi‑scale decoding strategy, termed ProgD, with the help of dynamic heterogeneous graph‑based scenario modeling. In particular, to explicitly and comprehensively capture the evolving social interactions in future scenarios, given their inherent uncertainty, we design a progressive modeling of scenarios with dynamic heterogeneous graphs. With the unfolding of such dynamic heterogeneous graphs, a factorized architecture is designed to process the spatio‑temporal dependencies within future scenarios and progressively eliminate uncertainty in future motions of multiple agents. Furthermore, a multi‑scale decoding procedure is incorporated to improve on the future scenario modeling and consistent prediction of agents' future motion. The proposed ProgD achieves state‑of‑the‑art performance on the INTERACTION multi‑agent prediction benchmark, ranking 1^st, and the Argoverse 2 multi‑world forecasting benchmark.

PaperID: 8734, https://arxiv.org/pdf/2509.09207.pdf

Authors: Wuyuao Mai, Geng Hong, Qi Liu, Jinsong Chen, Jiarun Dai, Xudong Pan, Yuan Zhang, Min Yang

Title: Shell or Nothing: Real-World Benchmarks and Memory-Activated Agents for Automated Penetration Testing

Abstract:
Penetration testing is critical for identifying and mitigating security vulnerabilities, yet traditional approaches remain expensive, time‑consuming, and dependent on expert human labor. Recent work has explored AI‑driven pentesting agents, but their evaluation relies on oversimplified capture‑the‑flag (CTF) settings that embed prior knowledge and reduce complexity, leading to performance estimates far from real‑world practice. We close this gap by introducing the first real‑world, agent‑oriented pentesting benchmark, TermiBench, which shifts the goal from 'flag finding' to achieving full system control. The benchmark spans 510 hosts across 25 services and 30 CVEs, with realistic environments that require autonomous reconnaissance, discrimination between benign and exploitable services, and robust exploit execution. Using this benchmark, we find that existing systems can hardly obtain system shells under realistic conditions. To address these challenges, we propose TermiAgent, a multi‑agent penetration testing framework. TermiAgent mitigates long‑context forgetting with a Located Memory Activation mechanism and builds a reliable exploit arsenal via structured code understanding rather than naive retrieval. In evaluations, our work outperforms state‑of‑the‑art agents, exhibiting stronger penetration testing capability, reducing execution time and financial cost, and demonstrating practicality even on laptop‑scale deployments. Our work delivers both the first open‑source benchmark for real‑world autonomous pentesting and a novel agent framework that establishes a milestone for AI‑driven penetration testing.

PaperID: 8735, https://arxiv.org/pdf/2509.09154.pdf

Authors: Bui Duc Manh, Soumyaratna Debnath, Zetong Zhang, Shriram Damodaran, Arvind Kumar, Yueyi Zhang, Lu Mi, Erik Cambria, Lin Wang

Title: Mind Meets Space: Rethinking Agentic Spatial Intelligence from a Neuroscience-inspired Perspective

Abstract:
Recent advances in agentic AI have led to systems capable of autonomous task execution and language‑based reasoning, yet their spatial reasoning abilities remain limited and underexplored, largely constrained to symbolic and sequential processing. In contrast, human spatial intelligence, rooted in integrated multisensory perception, spatial memory, and cognitive maps, enables flexible, context‑aware decision‑making in unstructured environments. Therefore, bridging this gap is critical for advancing Agentic Spatial Intelligence toward better interaction with the physical 3D world. To this end, we first start from scrutinizing the spatial neural models as studied in computational neuroscience, and accordingly introduce a novel computational framework grounded in neuroscience principles. This framework maps core biological functions to six essential computation modules: bio‑inspired multimodal sensing, multi‑sensory integration, egocentric‑allocentric conversion, an artificial cognitive map, spatial memory, and spatial reasoning. Together, these modules form a perspective landscape for agentic spatial reasoning capability across both virtual and physical environments. On top, we conduct a framework‑guided analysis of recent methods, evaluating their relevance to each module and identifying critical gaps that hinder the development of more neuroscience‑grounded spatial reasoning modules. We further examine emerging benchmarks and datasets and explore potential application domains ranging from virtual to embodied systems, such as robotics. Finally, we outline potential research directions, emphasizing the promising roadmap that can generalize spatial reasoning across dynamic or unstructured environments. We hope this work will benefit the research community with a neuroscience‑grounded perspective and a structured pathway. Our project page can be found at Github.

PaperID: 8736, https://arxiv.org/pdf/2509.09135.pdf

Authors: Xuefeng Wang, Lei Zhang, Henglin Pu, Ahmed H. Qureshi, Husheng Li

Title: Continuous-Time Value Iteration for Multi-Agent Reinforcement Learning

Abstract:
Existing reinforcement learning (RL) methods struggle with complex dynamical systems that demand interactions at high frequencies or irregular time intervals. Continuous‑time RL (CTRL) has emerged as a promising alternative by replacing discrete‑time Bellman recursion with differential value functions defined as viscosity solutions of the Hamilton‑‑Jacobi‑‑Bellman (HJB) equation. While CTRL has shown promise, its applications have been largely limited to the single‑agent domain. This limitation stems from two key challenges: (i) conventional solution methods for HJB equations suffer from the curse of dimensionality (CoD), making them intractable in high‑dimensional systems; and (ii) even with HJB‑based learning approaches, accurately approximating centralized value functions in multi‑agent settings remains difficult, which in turn destabilizes policy training. In this paper, we propose a CT‑MARL framework that uses physics‑informed neural networks (PINNs) to approximate HJB‑based value functions at scale. To ensure the value is consistent with its differential structure, we align value learning with value‑gradient learning by introducing a Value Gradient Iteration (VGI) module that iteratively refines value gradients along trajectories. This improves gradient fidelity, in turn yielding more accurate values and stronger policy learning. We evaluate our method using continuous‑time variants of standard benchmarks, including multi‑agent particle environment (MPE) and multi‑agent MuJoCo. Our results demonstrate that our approach consistently outperforms existing continuous‑time RL baselines and scales to complex multi‑agent dynamics.

PaperID: 8737, https://arxiv.org/pdf/2509.09094.pdf

Authors: Guochu Xiong, Xiangzhong Luo, Weichen Liu

Title: Coherence-Aware Task Graph Modeling for Realistic Application

Abstract:
As multicore systems continue to scale, cache coherence has emerged as a critical determinant of system performance, with coherence behavior and task execution closely intertwined, reshaping inter‑task dependencies. Task graph modeling provides a structured way to capture such dependencies and serves as the foundation for many system‑level design strategies. However, these strategies typically rely on predefined task graphs, while many real‑world applications lack explicit graphs and exhibit dynamic, data‑dependent behavior, limiting the effectiveness of static approaches. To address this, several task graph modeling methods for realistic workloads have been developed. Yet, they either rely on implicit techniques that use application‑specific features without producing explicit graphs, or they generate graphs tailored to fixed scheduling models, which limits generality. More importantly, they often overlook coherence interactions, creating a gap between design assumptions and actual runtime behavior. To overcome these limitations, we propose CoTAM, a Coherence‑Aware Task Graph Modeling framework for realistic workloads that constructs a unified task graph reflecting runtime behavior. CoTAM analyzes the impact of coherence by decoupling its effects from overall execution, quantifies its influence through a learned weighting scheme, and infers inter‑task dependencies for coherence‑aware graph generation. Extensive experiments show that CoTAM outperforms implicit methods, bridging the gap between dynamic workload behavior and existing designs while demonstrating the importance of incorporating cache coherence into task graph modeling for accurate and generalizable system‑level analysis.

PaperID: 8738, https://arxiv.org/pdf/2509.09082.pdf

Authors: Zhongqiu Li, Shiquan Wang, Ruiyu Fang, Mengjiao Bao, Zhenhe Wu, Shuangyong Song, Yongxiang Li, Zhongjiang He

Title: MR-UIE: Multi-Perspective Reasoning with Reinforcement Learning for Universal Information Extraction

Abstract:
Large language models (LLMs) demonstrate robust capabilities across diverse research domains. However, their performance in universal information extraction (UIE) remains insufficient, especially when tackling structured output scenarios that involve complex schema descriptions and require multi‑step reasoning. While existing approaches enhance the performance of LLMs through in‑context learning and instruction tuning, significant limitations nonetheless persist. To enhance the model's generalization ability, we propose integrating reinforcement learning (RL) with multi‑perspective reasoning for information extraction (IE) tasks. Our work transitions LLMs from passive extractors to active reasoners, enabling them to understand not only what to extract but also how to reason. Experiments conducted on multiple IE benchmarks demonstrate that MR‑UIE consistently elevates extraction accuracy across domains and surpasses state‑of‑the‑art methods on several datasets. Furthermore, incorporating multi‑perspective reasoning into RL notably enhances generalization in complex IE tasks, underscoring the critical role of reasoning in challenging scenarios.

PaperID: 8739, https://arxiv.org/pdf/2509.09071.pdf

Authors: Crystal Qian, Kehang Zhu, John Horton, Benjamin S. Manning, Vivian Tsai, James Wexler, Nithum Thain

Title: Strategic Tradeoffs Between Humans and AI in Multi-Agent Bargaining

Abstract:
Markets increasingly accommodate large language models (LLMs) as autonomous decision‑making agents. As this transition occurs, it becomes critical to evaluate how these agents behave relative to their human and task‑specific statistical predecessors. In this work, we present results from an empirical study comparing humans (N=216), multiple frontier LLMs, and customized Bayesian agents in dynamic multi‑player bargaining games under identical conditions. Bayesian agents extract the highest surplus with aggressive trade proposals that are frequently rejected. Humans and LLMs achieve comparable aggregate surplus within their groups, but exhibit different trading strategies. LLMs favor conservative, concessionary proposals that are usually accepted by other LLMs, while humans propose trades that are consistent with fairness norms but are more likely to be rejected. These findings highlight that performance parity ‑‑ a common benchmark in agent evaluation ‑‑ can mask substantive procedural differences in how LLMs behave in complex multi‑agent interactions.

PaperID: 8740, https://arxiv.org/pdf/2509.08970.pdf

Authors: Junyang Cai, Serdar Kadioglu, Bistra Dilkina

Title: Gala: Global LLM Agents for Text-to-Model Translation

Abstract:
Natural language descriptions of optimization or satisfaction problems are challenging to translate into correct MiniZinc models, as this process demands both logical reasoning and constraint programming expertise. We introduce Gala, a framework that addresses this challenge with a global agentic approach: multiple specialized large language model (LLM) agents decompose the modeling task by global constraint type. Each agent is dedicated to detecting and generating code for a specific class of global constraint, while a final assembler agent integrates these constraint snippets into a complete MiniZinc model. By dividing the problem into smaller, well‑defined sub‑tasks, each LLM handles a simpler reasoning challenge, potentially reducing overall complexity. We conduct initial experiments with several LLMs and show better performance against baselines such as one‑shot prompting and chain‑of‑thought prompting. Finally, we outline a comprehensive roadmap for future work, highlighting potential enhancements and directions for improvement.

PaperID: 8741, https://arxiv.org/pdf/2509.08956.pdf

Authors: Luke Snow, Vikram Krishnamurthy

Title: Multi-Agent Inverse Reinforcement Learning for Identifying Pareto-Efficient Coordination -- A Distributionally Robust Approach

Abstract:
Multi‑agent inverse reinforcement learning (IRL) aims to identify Pareto‑efficient behavior in a multi‑agent system, and reconstruct utility functions of the individual agents. Motivated by the problem of detecting UAV coordination, how can we construct a statistical detector for Pareto‑efficient behavior given noisy measurements of the decisions of a multi‑agent system? This paper approaches this IRL problem by deriving necessary and sufficient conditions for a dataset of multi‑agent system dynamics to be consistent with Pareto‑efficient coordination, and providing algorithms for recovering utility functions which are consistent with the system dynamics. We derive an optimal statistical detector for determining Pareto‑efficient coordination from noisy system measurements, which minimizes Type‑I statistical detection error. Then, we provide a utility estimation algorithm which minimizes the worst‑case estimation error over a statistical ambiguity set centered at empirical observations; this min‑max solution achieves distributionally robust IRL, which is crucial in adversarial strategic interactions. We illustrate these results in a detailed example for detecting Pareto‑efficient coordination among multiple UAVs given noisy measurement recorded at a radar. We then reconstruct the utility functions of the UAVs in a distributionally robust sense.

PaperID: 8742, https://arxiv.org/pdf/2509.08863.pdf

Authors: Qianqian Luo, Qingming Lin, Liuchang Xu, Sensen Wu, Ruichen Mao, Chao Wang, Hailin Feng, Bo Huang, Zhenhong Du

Title: GeoJSON Agents:A Multi-Agent LLM Architecture for Geospatial Analysis-Function Calling vs Code Generation

Abstract:
Large Language Models (LLMs) have demonstrated substantial progress in task automation and natural language understanding. However, without domain expertise in geographic information science (GIS), they continue to encounter limitations including reduced accuracy and unstable performance when processing complex tasks. To address these challenges, we propose GeoJSON Agents‑a novel multi‑agent LLM architecture specifically designed for geospatial analysis. This framework transforms natural language instructions into structured GeoJSON operations through two LLM enhancement techniques: Function Calling and Code Generation. The architecture integrates three core components: task parsing, agent collaboration, and result integration. The Planner agent systematically decomposes user‑defined tasks into executable subtasks, while Worker agents perform spatial data processing and analysis either by invoking predefined function APIs or by generating and executing Python‑based analytical code. The system produces reusable, standards‑compliant GeoJSON outputs through iterative refinement. To evaluate both approaches, we constructed a benchmark comprising 70 tasks spanning basic, intermediate, and advanced complexity levels, conducting experiments with OpenAI's GPT‑4o as the core model. Results indicate that the Code Generation‑based agent achieved 97.14% accuracy, while the Function Calling‑based agent attained 85.71%‑both significantly outperforming the best‑performing general‑purpose model (48.57%). Comparative analysis reveals Code Generation offers superior flexibility for complex, open‑ended tasks, whereas Function Calling provides enhanced execution stability for structured operations. This study represents the first systematic integration of GeoJSON data with a multi‑agent LLM framework and provides empirical evidence comparing two mainstream enhancement methodologies in geospatial context.

PaperID: 8743, https://arxiv.org/pdf/2509.08859.pdf

Authors: Vincenzo Suriani, Daniele Affinita, Domenico D. Bloisi, Daniele Nardi

Title: Multi Robot Coordination in Highly Dynamic Environments: Tackling Asymmetric Obstacles and Limited Communication

Abstract:
Coordinating a fully distributed multi‑agent system (MAS) can be challenging when the communication channel has very limited capabilities in terms of sending rate and packet payload. When the MAS has to deal with active obstacles in a highly partially observable environment, the communication channel acquires considerable relevance. In this paper, we present an approach to deal with task assignments in extremely active scenarios, where tasks need to be frequently reallocated among the agents participating in the coordination process. Inspired by market‑based task assignments, we introduce a novel distributed coordination method to orchestrate autonomous agents' actions efficiently in low communication scenarios. In particular, our algorithm takes into account asymmetric obstacles. While in the real world, the majority of obstacles are asymmetric, they are usually treated as symmetric ones, thus limiting the applicability of existing methods. To summarize, the presented architecture is designed to tackle scenarios where the obstacles are active and asymmetric, the communication channel is poor and the environment is partially observable. Our approach has been validated in simulation and in the real world, using a team of NAO robots during official RoboCup competitions. Experimental results show a notable reduction in task overlaps in limited communication settings, with a decrease of 52% in the most frequent reallocated task.

PaperID: 8744, https://arxiv.org/pdf/2509.08756.pdf

Authors: Zhaoxun "Lorenz" Liu, Wagner H. Souza, Jay Han, Amin Madani

Title: Using AI to Optimize Patient Transfer and Resource Utilization During Mass-Casualty Incidents: A Simulation Platform

Abstract:
Mass casualty incidents (MCIs) overwhelm healthcare systems and demand rapid, accurate patient‑hospital allocation decisions under extreme pressure. Here, we developed and validated a deep reinforcement learning‑based decision‑support AI agent to optimize patient transfer decisions during simulated MCIs by balancing patient acuity levels, specialized care requirements, hospital capacities, and transport logistics. To integrate this AI agent, we developed MasTER, a web‑accessible command dashboard for MCI management simulations. Through a controlled user study with 30 participants (6 trauma experts and 24 non‑experts), we evaluated three interaction approaches with the AI agent (human‑only, human‑AI collaboration, and AI‑only) across 20‑ and 60‑patient MCI scenarios in the Greater Toronto Area. Results demonstrate that increasing AI involvement significantly improves decision quality and consistency. The AI agent outperforms trauma surgeons (p < 0.001) and enables non‑experts to achieve expert‑level performance when assisted, contrasting sharply with their significantly inferior unassisted performance (p < 0.001). These findings establish the potential for our AI‑driven decision support to enhance both MCI preparedness training and real‑world emergency response management.

PaperID: 8745, https://arxiv.org/pdf/2509.08736.pdf

Authors: Dong Han, Zhehong Ai, Pengxiang Cai, Shanya Lu, Jianpeng Chen, Zihao Ye, Shuzhou Sun, Ben Gao, Lingli Ge, Weida Wang, Xiangxin Zhou, Xihui Liu, Mao Su, Wanli Ouyang, Lei Bai, Dongzhan Zhou, Tao Xu, Yuqiang Li, Shufei Zhang

Title: ChemBOMAS: Accelerated BO in Chemistry with LLM-Enhanced Multi-Agent System

Abstract:
Bayesian optimization (BO) is a powerful tool for scientific discovery in chemistry, yet its efficiency is often hampered by the sparse experimental data and vast search space. Here, we introduce ChemBOMAS: a large language model (LLM)‑enhanced multi‑agent system that accelerates BO through synergistic data‑ and knowledge‑driven strategies. Firstly, the data‑driven strategy involves an 8B‑scale LLM regressor fine‑tuned on a mere 1% labeled samples for pseudo‑data generation, robustly initializing the optimization process. Secondly, the knowledge‑driven strategy employs a hybrid Retrieval‑Augmented Generation approach to guide LLM in dividing the search space while mitigating LLM hallucinations. An Upper Confidence Bound algorithm then identifies high‑potential subspaces within this established partition. Across the LLM‑refined subspaces and supported by LLM‑generated data, BO achieves the improvement of effectiveness and efficiency. Comprehensive evaluations across multiple scientific benchmarks demonstrate that ChemBOMAS set a new state‑of‑the‑art, accelerating optimization efficiency by up to 5‑fold compared to baseline methods.

PaperID: 8746, https://arxiv.org/pdf/2509.08682.pdf

Authors: Guoqing Ma, Jia Zhu, Hanghui Guo, Weijie Shi, Jiawei Shen, Jingjiang Liu, Yidan Liang

Title: Automatic Failure Attribution and Critical Step Prediction Method for Multi-Agent Systems Based on Causal Inference

Abstract:
Multi‑agent systems (MAS) are critical for automating complex tasks, yet their practical deployment is severely hampered by the challenge of failure attribution. Current diagnostic tools, which rely on statistical correlations, are fundamentally inadequate; on challenging benchmarks like Who\&When, state‑of‑the‑art methods achieve less than 15% accuracy in locating the root‑cause step of a failure. To address this critical gap, we introduce the first failure attribution framework for MAS grounded in multi‑granularity causal inference. Our approach makes two key technical contributions: (1) a performance causal inversion principle, which correctly models performance dependencies by reversing the data flow in execution logs, combined with Shapley values to accurately assign agent‑level blame; (2) a novel causal discovery algorithm, CDC‑MAS, that robustly identifies critical failure steps by tackling the non‑stationary nature of MAS interaction data. The framework's attribution results directly fuel an automated optimization loop, generating targeted suggestions whose efficacy is validated via counterfactual simulations. Evaluations on the Who\&When and TRAIL benchmarks demonstrate a significant leap in performance. Our method achieves up to 36.2% step‑level accuracy. Crucially, the generated optimizations boost overall task success rates by an average of 22.4%. This work provides a principled and effective solution for debugging complex agent interactions, paving the way for more reliable and interpretable multi‑agent systems.

PaperID: 8747, https://arxiv.org/pdf/2509.08646.pdf

Authors: Ron F. Del Rosario, Klaudia Krawiecka, Christian Schroeder de Witt

Title: Architecting Resilient LLM Agents: A Guide to Secure Plan-then-Execute Implementations

Abstract:
As Large Language Model (LLM) agents become increasingly capable of automating complex, multi‑step tasks, the need for robust, secure, and predictable architectural patterns is paramount. This paper provides a comprehensive guide to the ``Plan‑then‑Execute'' (P‑t‑E) pattern, an agentic design that separates strategic planning from tactical execution. We explore the foundational principles of P‑t‑E, detailing its core components ‑ the Planner and the Executor ‑ and its architectural advantages in predictability, cost‑efficiency, and reasoning quality over reactive patterns like ReAct (Reason + Act). A central focus is placed on the security implications of this design, particularly its inherent resilience to indirect prompt injection attacks by establishing control‑flow integrity. We argue that while P‑t‑E provides a strong foundation, a defense‑in‑depth strategy is necessary, and we detail essential complementary controls such as the Principle of Least Privilege, task‑scoped tool access, and sandboxed code execution. To make these principles actionable, this guide provides detailed implementation blueprints and working code references for three leading agentic frameworks: LangChain (via LangGraph), CrewAI, and AutoGen. Each framework's approach to implementing the P‑t‑E pattern is analyzed, highlighting unique features like LangGraph's stateful graphs for re‑planning, CrewAI's declarative tool scoping for security, and AutoGen's built‑in Docker sandboxing. Finally, we discuss advanced patterns, including dynamic re‑planning loops, parallel execution with Directed Acyclic Graphs (DAGs), and the critical role of Human‑in‑the‑Loop (HITL) verification, to offer a complete strategic blueprint for architects, developers, and security engineers aiming to build production‑grade, resilient, and trustworthy LLM agents.

PaperID: 8748, https://arxiv.org/pdf/2509.08638.pdf

Authors: Rebecca Martin, Jay Patrikar, Sebastian Scherer

Title: AutoODD: Agentic Audits via Bayesian Red Teaming in Black-Box Models

Abstract:
Specialized machine learning models, regardless of architecture and training, are susceptible to failures in deployment. With their increasing use in high risk situations, the ability to audit these models by determining their operational design domain (ODD) is crucial in ensuring safety and compliance. However, given the high‑dimensional input spaces, this process often requires significant human resources and domain expertise. To alleviate this, we introduce \coolname, an LLM‑Agent centric framework for automated generation of semantically relevant test cases to search for failure modes in specialized black‑box models. By leveraging LLM‑Agents as tool orchestrators, we aim to fit a uncertainty‑aware failure distribution model on a learned text‑embedding manifold by projecting the high‑dimension input space to low‑dimension text‑embedding latent space. The LLM‑Agent is tasked with iteratively building the failure landscape by leveraging tools for generating test‑cases to probe the model‑under‑test (MUT) and recording the response. The agent also guides the search using tools to probe uncertainty estimate on the low dimensional manifold. We demonstrate this process in a simple case using models trained with missing digits on the MNIST dataset and in the real world setting of vision‑based intruder detection for aerial vehicles.

PaperID: 8749, https://arxiv.org/pdf/2509.08534.pdf

Authors: Shubham Singh, Anoop Jain

Title: Phase-Coordinated Multi-Agent Circular Formation Control with Non-Concentric Boundary Constraints

Abstract:
This paper addresses the problem of collective circular motion control for unicycle agents, with the objective of achieving phase coordination of their velocity vectors while ensuring that their trajectories remain confined within a prescribed non‑concentric circular boundary. To accommodate such nonuniform motion constraints, we build upon our earlier work and extend the use of Mobius transformation to a multi‑agent framework. The Mobius transformation maps two nonconcentric circles to concentric ones, thereby converting spatially nonuniform constraints into uniform ones in the transformed plane. Leveraging this property, we introduce the notion of a phase‑shifted order parameter, along with the associated concepts of Mobius phase‑shift coupled synchronization and balancing, which characterize the phase‑coordinated patterns studied in this paper. We establish an equivalence between the unicycle dynamics in the original and transformed planes under the Mobius transformation and its inverse, and show that synchronization is preserved across both planes, whereas balancing is generally not. Distributed control laws are then designed in the transformed plane using barrier Lyapunov functions, under the assumption of an undirected and connected communication topology among agents. These controllers are subsequently mapped back to the original plane to obtain the linear acceleration and turn‑rate control inputs applied to the actual agents. Both simulations and experimental results are provided to illustrate the proposed framework.

PaperID: 8750, https://arxiv.org/pdf/2509.08460.pdf

Authors: Wenqing Wang, Ye Zhang, Haoyu Li, Jingyu Wang

Title: Dual-Stage Safe Herding Framework for Adversarial Attacker in Dynamic Environment

Abstract:
Recent advances in robotics have enabled the widespread deployment of autonomous robotic systems in complex operational environments, presenting both unprecedented opportunities and significant security problems. Traditional shepherding approaches based on fixed formations are often ineffective or risky in urban and obstacle‑rich scenarios, especially when facing adversarial agents with unknown and adaptive behaviors. This paper addresses this challenge as an extended herding problem, where defensive robotic systems must safely guide adversarial agents with unknown strategies away from protected areas and into predetermined safe regions, while maintaining collision‑free navigation in dynamic environments. We propose a hierarchical hybrid framework based on reach‑avoid game theory and local motion planning, incorporating a virtual containment boundary and event‑triggered pursuit mechanisms to enable scalable and robust multi‑agent coordination. Simulation results demonstrate that the proposed approach achieves safe and efficient guidance of adversarial agents to designated regions.

PaperID: 8751, https://arxiv.org/pdf/2509.08433.pdf

Authors: José-Luis Vilchis Medina

Title: Un cadre paraconsistant pour l'{é}valuation de similarit{é} dans les bases de connaissances

Abstract:
This article proposes a paraconsistent framework for evaluating similarity in knowledge bases. Unlike classical approaches, this framework explicitly integrates contradictions, enabling a more robust and interpretable similarity measure. A new measure S^ is introduced, which penalizes inconsistencies while rewarding shared properties. Paraconsistent super‑categories Ξ_K^ are defined to hierarchically organize knowledge entities. The model also includes a contradiction extractor E and a repair mechanism, ensuring consistency in the evaluations. Theoretical results guarantee reflexivity, symmetry, and boundedness of S^ . This approach offers a promising solution for managing conflicting knowledge, with perspectives in multi‑agent systems.

PaperID: 8752, https://arxiv.org/pdf/2509.08380.pdf

Authors: Prathamesh Vasudeo Naik, Naresh Kumar Dintakurthi, Zhanghao Hu, Yue Wang, Robby Qiu

Title: Co-Investigator AI: The Rise of Agentic AI for Smarter, Trustworthy AML Compliance Narratives

Abstract:
Generating regulatorily compliant Suspicious Activity Report (SAR) remains a high‑cost, low‑scalability bottleneck in Anti‑Money Laundering (AML) workflows. While large language models (LLMs) offer promising fluency, they suffer from factual hallucination, limited crime typology alignment, and poor explainability ‑‑ posing unacceptable risks in compliance‑critical domains. This paper introduces Co‑Investigator AI, an agentic framework optimized to produce Suspicious Activity Reports (SARs) significantly faster and with greater accuracy than traditional methods. Drawing inspiration from recent advances in autonomous agent architectures, such as the AI Co‑Scientist, our approach integrates specialized agents for planning, crime type detection, external intelligence gathering, and compliance validation. The system features dynamic memory management, an AI‑Privacy Guard layer for sensitive data handling, and a real‑time validation agent employing the Agent‑as‑a‑Judge paradigm to ensure continuous narrative quality assurance. Human investigators remain firmly in the loop, empowered to review and refine drafts in a collaborative workflow that blends AI efficiency with domain expertise. We demonstrate the versatility of Co‑Investigator AI across a range of complex financial crime scenarios, highlighting its ability to streamline SAR drafting, align narratives with regulatory expectations, and enable compliance teams to focus on higher‑order analytical work. This approach marks the beginning of a new era in compliance reporting ‑‑ bringing the transformative benefits of AI agents to the core of regulatory processes and paving the way for scalable, reliable, and transparent SAR generation.

PaperID: 8753, https://arxiv.org/pdf/2509.08324.pdf

Authors: Wenji Cao, Lu Liu, Zehua Ye, Dan Zhang, Gang Feng

Title: Resilient Global Practical Fixed-Time Cooperative Output Regulation of Uncertain Nonlinear Multi-Agent Systems Subject to Denial-of-Service Attacks

Abstract:
This paper investigates the problem of resilient global practical fixed‑time cooperative output regulation of uncertain nonlinear multi‑agent systems subject to denial‑of‑service attacks. A novel distributed resilient adaptive fixed‑time control strategy is proposed, which consists of a novel distributed resilient fixed‑time observer with a chain of nonlinear filters and a novel distributed resilient adaptive fixed‑time controller. It is shown that the problem of resilient global practical fixed‑time cooperative output regulation can be solved by the proposed control strategy. More specifically, the proposed distributed control strategy ensures the global boundedness of all the signals in the resulting closed‑loop system and the global convergence of the regulated outputs to a tunable residual set in a fixed time. A simulation example is finally provided to illustrate the efficacy of the proposed control strategy.

PaperID: 8754, https://arxiv.org/pdf/2509.08310.pdf

Authors: S Krishna Niketh, Sagar Babu Mitikiri, V Vignesh, Vedantham Lakshmi Srinivas, Mayukha Pal

Title: Game-Theoretic Resilience Framework for Cyber-Physical Microgrids using Multi-Agent Reinforcement Learning

Abstract:
The increasing reliance on cyber physical infrastructure in modern power systems has amplified the risk of targeted cyber attacks, necessitating robust and adaptive resilience strategies. This paper presents a mathematically rigorous game theoretic framework to evaluate and enhance microgrid resilience using a combination of quantitative resilience metrics Load Served Ratio LSR, Critical Load Resilience CLR, Topological Survivability Score TSS, and DER Resilience Score DRS. These are integrated into a unified payoff matrix using the Analytic Hierarchy Process AHP to assess attack defense interactions. The framework is formalized as a finite horizon Markov Decision Process MDP with formal convergence guarantees and computational complexity bounds. Three case studies are developed 1. static attacks analyzed via Nash equilibrium, 2. severe attacks incorporating high impact strategies, and 3. adaptive attacks using Stackelberg games, regret matching, softmax heuristics, and Multi Agent Q Learning. Rigorous theoretical analysis provides convergence proofs with explicit rates , PAC learning sample complexity bounds, and computational complexity analysis. The framework is tested on an enhanced IEEE 33bus distribution system with DERs and control switches, demonstrating the effectiveness of adaptive and strategic defenses in improving cyber physical resilience with statistically significant improvements of 18.7% 2.1% over static approaches.

PaperID: 8755, https://arxiv.org/pdf/2509.08257.pdf

Authors: Yongkai Tian, Yirong Qi, Xin Yu, Wenjun Wu, Jie Luo

Title: Symmetry-Guided Multi-Agent Inverse Reinforcement Learning

Abstract:
In robotic systems, the performance of reinforcement learning depends on the rationality of predefined reward functions. However, manually designed reward functions often lead to policy failures due to inaccuracies. Inverse Reinforcement Learning (IRL) addresses this problem by inferring implicit reward functions from expert demonstrations. Nevertheless, existing methods rely heavily on large amounts of expert demonstrations to accurately recover the reward function. The high cost of collecting expert demonstrations in robotic applications, particularly in multi‑robot systems, severely hinders the practical deployment of IRL. Consequently, improving sample efficiency has emerged as a critical challenge in multi‑agent inverse reinforcement learning (MIRL). Inspired by the symmetry inherent in multi‑agent systems, this work theoretically demonstrates that leveraging symmetry enables the recovery of more accurate reward functions. Building upon this insight, we propose a universal framework that integrates symmetry into existing multi‑agent adversarial IRL algorithms, thereby significantly enhancing sample efficiency. Experimental results from multiple challenging tasks have demonstrated the effectiveness of this framework. Further validation in physical multi‑robot systems has shown the practicality of our method.

PaperID: 8756, https://arxiv.org/pdf/2509.08242.pdf

Authors: Nirabhra Mandal, Aamodh Suresh, Carlos Nieto-Granda, Sonia Martínez

Title: Behaviorally Heterogeneous Multi-Agent Exploration Using Distributed Task Allocation

Abstract:
We study a problem of multi‑agent exploration with behaviorally heterogeneous robots. Each robot maps its surroundings using SLAM and identifies a set of areas of interest (AoIs) or frontiers that are the most informative to explore next. The robots assess the utility of going to a frontier using Behavioral Entropy (BE) and then determine which frontier to go to via a distributed task assignment scheme. We convert the task assignment problem into a non‑cooperative game and use a distributed algorithm (d‑PBRAG) to converge to the Nash equilibrium (which we show is the optimal task allocation solution). For unknown utility cases, we provide robust bounds using approximate rewards. We test our algorithm (which has less communication cost and fast convergence) in simulation, where we explore the effect of sensing radii, sensing accuracy, and heterogeneity among robotic teams with respect to the time taken to complete exploration and path traveled. We observe that having a team of agents with heterogeneous behaviors is beneficial.

PaperID: 8757, https://arxiv.org/pdf/2509.08221.pdf

Authors: Elahe Delavari, Feeza Khan Khanzada, Jaerock Kwon

Title: A Comprehensive Review of Reinforcement Learning for Autonomous Driving in the CARLA Simulator

Abstract:
Autonomous‑driving research has recently embraced deep Reinforcement Learning (RL) as a promising framework for data‑driven decision making, yet a clear picture of how these algorithms are currently employed, benchmarked and evaluated is still missing. This survey fills that gap by systematically analysing around 100 peer‑reviewed papers that train, test or validate RL policies inside the open‑source CARLA simulator. We first categorize the literature by algorithmic family model‑free, model‑based, hierarchical, and hybrid and quantify their prevalence, highlighting that more than 80% of existing studies still rely on model‑free methods such as DQN, PPO and SAC. Next, we explain the diverse state, action and reward formulations adopted across works, illustrating how choices of sensor modality (RGB, LiDAR, BEV, semantic maps, and carla kinematics states), control abstraction (discrete vs. continuous) and reward shaping are used across various literature. We also consolidate the evaluation landscape by listing the most common metrics (success rate, collision rate, lane deviation, driving score) and the towns, scenarios and traffic configurations used in CARLA benchmarks. Persistent challenges including sparse rewards, sim‑to‑real transfer, safety guarantees and limited behaviour diversity are distilled into a set of open research questions, and promising directions such as model‑based RL, meta‑learning and richer multi‑agent simulations are outlined. By providing a unified taxonomy, quantitative statistics and a critical discussion of limitations, this review aims to serve both as a reference for newcomers and as a roadmap for advancing RL‑based autonomous driving toward real‑world deployment.

PaperID: 8758, https://arxiv.org/pdf/2509.08160.pdf

Authors: Viraj Parimi, Brian C. Williams

Title: Diffusion-Guided Multi-Arm Motion Planning

Abstract:
Multi‑arm motion planning is fundamental for enabling arms to complete complex long‑horizon tasks in shared spaces efficiently but current methods struggle with scalability due to exponential state‑space growth and reliance on large training datasets for learned models. Inspired by Multi‑Agent Path Finding (MAPF), which decomposes planning into single‑agent problems coupled with collision resolution, we propose a novel diffusion‑guided multi‑arm planner (DG‑MAP) that enhances scalability of learning‑based models while reducing their reliance on massive multi‑arm datasets. Recognizing that collisions are primarily pairwise, we train two conditional diffusion models, one to generate feasible single‑arm trajectories, and a second, to model the dual‑arm dynamics required for effective pairwise collision resolution. By integrating these specialized generative models within a MAPF‑inspired structured decomposition, our planner efficiently scales to larger number of arms. Evaluations against alternative learning‑based methods across various team sizes demonstrate our method's effectiveness and practical applicability. Project website can be found at https://diff‑mapf‑mers.csail.mit.edu

PaperID: 8759, https://arxiv.org/pdf/2509.08157.pdf

Authors: Viraj Parimi, Brian C. Williams

Title: Risk-Bounded Multi-Agent Visual Navigation via Iterative Risk Allocation

Abstract:
Safe navigation is essential for autonomous systems operating in hazardous environments, especially when multiple agents must coordinate using only high‑dimensional visual observations. While recent approaches successfully combine Goal‑Conditioned RL (GCRL) for graph construction with Conflict‑Based Search (CBS) for planning, they typically rely on deleting edges with high risk before running CBS to enforce safety. This binary strategy is overly conservative, precluding feasible missions that require traversing high‑risk regions, even when the aggregate risk is acceptable. To address this, we introduce a framework for Risk‑Bounded Multi‑Agent Path Finding (Δ‑MAPF), where agents share a user‑specified global risk budget (Δ). Rather than permanently discarding edges, our framework dynamically distributes per‑agent risk budgets (δ_i) during search via an Iterative Risk Allocation (IRA) layer that integrates with a standard CBS planner. We investigate two distribution strategies: a greedy surplus‑deficit scheme for rapid feasibility repair, and a market‑inspired mechanism that treats risk as a priced resource to guide improved allocation. The market‑based mechanism yields a tunable trade‑off wherein agents exploit available risk to secure shorter, more efficient paths, but revert to longer, safer detours under tighter budgets. Experiments in complex visual environments show that our dynamic allocation framework achieves higher success rates than baselines and effectively leverages the available safety budget to reduce travel time. Project website can be found at https://rb‑visual‑mapf‑mers.csail.mit.edu

PaperID: 8760, https://arxiv.org/pdf/2509.08088.pdf

Authors: Linyao Chen, Zimian Peng, Yingxuan Yang, Yikun Wang, Wenzheng Tom Tang, Hiroki H. Kobayashi, Weinan Zhang

Title: EnvX: Agentize Everything with Agentic AI

Abstract:
The widespread availability of open‑source repositories has led to a vast collection of reusable software components, yet their utilization remains manual, error‑prone, and disconnected. Developers must navigate documentation, understand APIs, and write integration code, creating significant barriers to efficient software reuse. To address this, we present EnvX, a framework that leverages Agentic AI to agentize GitHub repositories, transforming them into intelligent, autonomous agents capable of natural language interaction and inter‑agent collaboration. Unlike existing approaches that treat repositories as static code resources, EnvX reimagines them as active agents through a three‑phase process: (1) TODO‑guided environment initialization, which sets up the necessary dependencies, data, and validation datasets; (2) human‑aligned agentic automation, allowing repository‑specific agents to autonomously perform real‑world tasks; and (3) Agent‑to‑Agent (A2A) protocol, enabling multiple agents to collaborate. By combining large language model capabilities with structured tool integration, EnvX automates not just code generation, but the entire process of understanding, initializing, and operationalizing repository functionality. We evaluate EnvX on the GitTaskBench benchmark, using 18 repositories across domains such as image processing, speech recognition, document analysis, and video manipulation. Our results show that EnvX achieves a 74.07% execution completion rate and 51.85% task pass rate, outperforming existing frameworks. Case studies further demonstrate EnvX's ability to enable multi‑repository collaboration via the A2A protocol. This work marks a shift from treating repositories as passive code resources to intelligent, interactive agents, fostering greater accessibility and collaboration within the open‑source ecosystem.

PaperID: 8761, https://arxiv.org/pdf/2509.07939.pdf

Authors: Katsuaki Nakano, Reza Fayyazi, Shanchieh Jay Yang, Michael Zuzak

Title: Guided Reasoning in LLM-Driven Penetration Testing Using Structured Attack Trees

Abstract:
Recent advances in Large Language Models (LLMs) have driven interest in automating cybersecurity penetration testing workflows, offering the promise of faster and more consistent vulnerability assessment for enterprise systems. Existing LLM agents for penetration testing primarily rely on self‑guided reasoning, which can produce inaccurate or hallucinated procedural steps. As a result, the LLM agent may undertake unproductive actions, such as exploiting unused software libraries or generating cyclical responses that repeat prior tactics. In this work, we propose a guided reasoning pipeline for penetration testing LLM agents that incorporates a deterministic task tree built from the MITRE ATT&CK Matrix, a proven penetration testing kll chain, to constrain the LLM's reaoning process to explicitly defined tactics, techniques, and procedures. This anchors reasoning in proven penetration testing methodologies and filters out ineffective actions by guiding the agent towards more productive attack procedures. To evaluate our approach, we built an automated penetration testing LLM agent using three LLMs (Llama‑3‑8B, Gemini‑1.5, and GPT‑4) and applied it to navigate 10 HackTheBox cybersecurity exercises with 103 discrete subtasks representing real‑world cyberattack scenarios. Our proposed reasoning pipeline guided the LLM agent through 71.8%, 72.8%, and 78.6% of subtasks using Llama‑3‑8B, Gemini‑1.5, and GPT‑4, respectively. Comparatively, the state‑of‑the‑art LLM penetration testing tool using self‑guided reasoning completed only 13.5%, 16.5%, and 75.7% of subtasks and required 86.2%, 118.7%, and 205.9% more model queries. This suggests that incorporating a deterministic task tree into LLM reasoning pipelines can enhance the accuracy and efficiency of automated cybersecurity assessments

PaperID: 8762, https://arxiv.org/pdf/2509.07860.pdf

Authors: Guanzhi Deng, Yi Xie, Yu-Keung Ng, Mingyang Liu, Peijun Zheng, Jie Liu, Dapeng Wu, Yinqiao Li, Linqi Song

Title: KLIPA: A Knowledge Graph and LLM-Driven QA Framework for IP Analysis

Abstract:
Effectively managing intellectual property is a significant challenge. Traditional methods for patent analysis depend on labor‑intensive manual searches and rigid keyword matching. These approaches are often inefficient and struggle to reveal the complex relationships hidden within large patent datasets, hindering strategic decision‑making. To overcome these limitations, we introduce KLIPA, a novel framework that leverages a knowledge graph and a large language model (LLM) to significantly advance patent analysis. Our approach integrates three key components: a structured knowledge graph to map explicit relationships between patents, a retrieval‑augmented generation(RAG) system to uncover contextual connections, and an intelligent agent that dynamically determines the optimal strategy for resolving user queries. We validated KLIPA on a comprehensive, real‑world patent database, where it demonstrated substantial improvements in knowledge extraction, discovery of novel connections, and overall operational efficiency. This combination of technologies enhances retrieval accuracy, reduces reliance on domain experts, and provides a scalable, automated solution for any organization managing intellectual property, including technology corporations and legal firms, allowing them to better navigate the complexities of strategic innovation and competitive intelligence.

PaperID: 8763, https://arxiv.org/pdf/2509.07764.pdf

Authors: Haitao Hu, Peng Chen, Yanpeng Zhao, Yuqi Chen

Title: AgentSentinel: An End-to-End and Real-Time Security Defense Framework for Computer-Use Agents

Abstract:
Large Language Models (LLMs) have been increasingly integrated into computer‑use agents, which can autonomously operate tools on a user's computer to accomplish complex tasks. However, due to the inherently unstable and unpredictable nature of LLM outputs, they may issue unintended tool commands or incorrect inputs, leading to potentially harmful operations. Unlike traditional security risks stemming from insecure user prompts, tool execution results from LLM‑driven decisions introduce new and unique security challenges. These vulnerabilities span across all components of a computer‑use agent. To mitigate these risks, we propose AgentSentinel, an end‑to‑end, real‑time defense framework designed to mitigate potential security threats on a user's computer. AgentSentinel intercepts all sensitive operations within agent‑related services and halts execution until a comprehensive security audit is completed. Our security auditing mechanism introduces a novel inspection process that correlates the current task context with system traces generated during task execution. To thoroughly evaluate AgentSentinel, we present BadComputerUse, a benchmark consisting of 60 diverse attack scenarios across six attack categories. The benchmark demonstrates a 87% average attack success rate on four state‑of‑the‑art LLMs. Our evaluation shows that AgentSentinel achieves an average defense success rate of 79.6%, significantly outperforming all baseline defenses.

PaperID: 8764, https://arxiv.org/pdf/2509.07703.pdf

Authors: K. P. Sunny, Rakesh R. Warier

Title: Prescribed-Time Event-Triggered Control for Matrix-Scaled Networks

Abstract:
This article proposes a distributed control method for matrix‑scaled multi‑agent networks aimed at achieving convergence within a user‑defined time frame. The control law of each individual agent relies only on information from neighboring agents and is updated at discrete intervals determined by state‑dependent triggering functions, reducing the frequency of agent interactions. To this end, first, the controller is augmented with a time‑varying gain. Then, the dynamics of the closed‑loop system over the finite‑time interval is transformed into an infinite‑time frame using time scaling. Lyapunov‑based analysis is employed to derive suitable triggering conditions that guarantee the asymptotic convergence of the time‑transformed system, thereby ensuring the prescribed‑time convergence of the original system.

PaperID: 8765, https://arxiv.org/pdf/2509.07680.pdf

Authors: Sachit Menon, Ahmet Iscen, Arsha Nagrani, Tobias Weyand, Carl Vondrick, Cordelia Schmid

Title: CAViAR: Critic-Augmented Video Agentic Reasoning

Abstract:
Video understanding has seen significant progress in recent years, with models' performance on perception from short clips continuing to rise. Yet, multiple recent benchmarks, such as LVBench, Neptune, and ActivityNet‑RTL, show performance wanes for tasks requiring complex reasoning on videos as queries grow more complex and videos grow longer. In this work, we ask: can existing perception capabilities be leveraged to successfully perform more complex video reasoning? In particular, we develop a large language model agent given access to video modules as subagents or tools. Rather than following a fixed procedure to solve queries as in previous work such as Visual Programming, ViperGPT, and MoReVQA, the agent uses the results of each call to a module to determine subsequent steps. Inspired by work in the textual reasoning domain, we introduce a critic to distinguish between instances of successful and unsuccessful sequences from the agent. We show that the combination of our agent and critic achieve strong performance on the previously‑mentioned datasets.

PaperID: 8766, https://arxiv.org/pdf/2509.07650.pdf

Authors: Victor Villin, Christos Dimitrakakis

Title: Inference of Altruism and Intrinsic Rewards in Multi-Agent Systems

Abstract:
Human interactions are influenced by emotions, temperament, and affection, often conflicting with individuals' underlying preferences. Without explicit knowledge of those preferences, judging whether behaviour is appropriate becomes guesswork, leaving us highly prone to misinterpretation. Yet, such understanding is critical if autonomous agents are to collaborate effectively with humans. We frame the problem with multi‑agent inverse reinforcement learning and show that even a simple model, where agents weigh their own welfare against that of others, can cover a wide range of social behaviours. Using novel Bayesian techniques, we find that intrinsic rewards and altruistic tendencies can be reliably identified by placing agents in different groups. Crucially, this disentanglement of intrinsic motivation from altruism enables the synthesis of new behaviours aligned with any desired level of altruism, even when demonstrations are drawn from restricted behaviour profiles.

PaperID: 8767, https://arxiv.org/pdf/2509.07642.pdf

Authors: Sascha Kaltenpoth, Oliver Müller

Title: Getting In Contract with Large Language Models -- An Agency Theory Perspective On Large Language Model Alignment

Abstract:
Adopting Large language models (LLMs) in organizations potentially revolutionizes our lives and work. However, they can generate off‑topic, discriminating, or harmful content. This AI alignment problem often stems from misspecifications during the LLM adoption, unnoticed by the principal due to the LLM's black‑box nature. While various research disciplines investigated AI alignment, they neither address the information asymmetries between organizational adopters and black‑box LLM agents nor consider organizational AI adoption processes. Therefore, we propose LLM ATLAS (LLM Agency Theory‑Led Alignment Strategy) a conceptual framework grounded in agency (contract) theory, to mitigate alignment problems during organizational LLM adoption. We conduct a conceptual literature analysis using the organizational LLM adoption phases and the agency theory as concepts. Our approach results in (1) providing an extended literature analysis process specific to AI alignment methods during organizational LLM adoption and (2) providing a first LLM alignment problem‑solution space.

PaperID: 8768, https://arxiv.org/pdf/2509.07424.pdf

Authors: Hyunseung Lim, Dasom Choi, DaEun Choi, Sooyohn Nam, Hwajung Hong

Title: Feed-O-Meter: Investigating AI-Generated Mentee Personas as Interactive Agents for Scaffolding Design Feedback Practice

Abstract:
Effective feedback, including critique and evaluation, helps designers develop design concepts and refine their ideas, supporting informed decision‑making throughout the iterative design process. However, in studio‑based design courses, students often struggle to provide feedback due to a lack of confidence and fear of being judged, which limits their ability to develop essential feedback‑giving skills. Recent advances in large language models (LLMs) suggest that role‑playing with AI agents can let learners engage in multi‑turn feedback without the anxiety of external judgment or the time constraints of real‑world settings. Yet prior studies have raised concerns that LLMs struggle to behave like real people in role‑play scenarios, diminishing the educational benefits of these interactions. Therefore, designing AI‑based agents that effectively support learners in practicing and developing intellectual reasoning skills requires more than merely assigning the target persona's personality and role to the agent. By addressing these issues, we present Feed‑O‑Meter, a novel system that employs carefully designed LLM‑based agents to create an environment in which students can practice giving design feedback. The system enables users to role‑play as mentors, providing feedback to an AI mentee and allowing them to reflect on how that feedback impacts the AI mentee's idea development process. A user study (N=24) indicated that Feed‑O‑Meter increased participants' engagement and motivation through role‑switching and helped them adjust feedback to be more comprehensible for an AI mentee. Based on these findings, we discuss future directions for designing systems to foster feedback skills in design education.

PaperID: 8769, https://arxiv.org/pdf/2509.07403.pdf

Authors: Weichu Liu, Jing Xiong, Yuxuan Hu, Zixuan Li, Minghuan Tan, Ningning Mao, Hui Shen, Wendong Xu, Chaofan Tao, Min Yang, Chengming Li, Lingpeng Kong, Ngai Wong

Title: LongEmotion: Measuring Emotional Intelligence of Large Language Models in Long-Context Interaction

Abstract:
Large language models (LLMs) have made significant progress in Emotional Intelligence (EI) and long‑context modeling. However, existing benchmarks often overlook the fact that emotional information processing unfolds as a continuous long‑context process. To address the absence of multidimensional EI evaluation in long‑context inference and explore model performance under more challenging conditions, we present LongEmotion, a benchmark that encompasses a diverse suite of tasks targeting the assessment of models' capabilities in Emotion Recognition, Knowledge Application, and Empathetic Generation, with an average context length of 15,341 tokens. To enhance performance under realistic constraints, we introduce the Collaborative Emotional Modeling (CoEM) framework, which integrates Retrieval‑Augmented Generation (RAG) and multi‑agent collaboration to improve models' EI in long‑context scenarios. We conduct a detailed analysis of various models in long‑context settings, investigating how reasoning mode activation, RAG‑based retrieval strategies, and context‑length adaptability influence their EI performance. Our project page is: https://longemotion.github.io/

PaperID: 8770, https://arxiv.org/pdf/2509.07389.pdf

Authors: Sankalp Tattwadarshi Swain, Anshika Krishnatray, Dhruv Kumar, Jagat Sesh Challa

Title: Talking with Oompa Loompas: A novel framework for evaluating linguistic acquisition of LLM agents

Abstract:
Existing evaluation studies on linguistic competence of large language models (LLM agents) have focused primarily on vocabulary learning, morphological rule induction, syntactic generalization, pragmatic inference, and cross‑linguistic transfer. However, none assess whether LLM agents can acquire a language through pattern recognition and interactive feedback, a central feature of human language acquisition. We propose a novel experimental framework in which an LLM agent is evaluated on its ability to acquire and use a newly constructed language (Tinkatongue) in conversation with a bot that understands only Tinkatongue. Our findings show that LLM agents fail to establish a conversation within 100 responses, yet they adopt distinct strategies that mirror human approaches to language learning. The results suggest a new direction for evaluation benchmarks and open pathways to model designs that learn more effectively from interactive feedback.

PaperID: 8771, https://arxiv.org/pdf/2509.07367.pdf

Authors: Cunxi Yu, Rongjian Liang, Chia-Tung Ho, Haoxing Ren

Title: Autonomous Code Evolution Meets NP-Completeness

Abstract:
Large language models (LLMs) have recently shown strong coding abilities, enabling not only static code generation but also iterative code self‑evolving through agentic frameworks. Recently, AlphaEvolve \citenovikov2025alphaevolve demonstrated that LLM‑based coding agents can autonomously improve algorithms and surpass human experts, with scopes limited to isolated kernels spanning hundreds of lines of code. Inspired by AlphaEvolve, we present SATLUTION, the first framework to extend LLM‑based code evolution to the full repository scale, encompassing hundreds of files and tens of thousands of lines of C/C++ code. Targeting Boolean Satisfiability (SAT), the canonical NP‑complete problem and a cornerstone of both theory and applications. SATLUTION orchestrates LLM agents to directly evolve solver repositories under strict correctness guarantees and distributed runtime feedback, while simultaneously self‑evolving its own evolution policies and rules. Starting from SAT Competition 2024 codebases and benchmark, SATLUTION evolved solvers that decisively outperformed the human‑designed winners of the SAT Competition 2025, and also surpassed both 2024 and 2025 champions on the 2024 benchmarks.

PaperID: 8772, https://arxiv.org/pdf/2509.07334.pdf

Authors: Yunnong Chen, Chengwei Shi, Liuqing Chen

Title: SpecifyUI: Supporting Iterative UI Design Intent Expression through Structured Specifications and Generative AI

Abstract:
Large language models (LLMs) promise to accelerate UI design, yet current tools struggle with two fundamentals: externalizing designers' intent and controlling iterative change. We introduce SPEC, a structured, parameterized, hierarchical intermediate representation that exposes UI elements as controllable parameters. Building on SPEC, we present SpecifyUI, an interactive system that extracts SPEC from UI references via region segmentation and vision‑language models, composes UIs across multiple sources, and supports targeted edits at global, regional, and component levels. A multi‑agent generator renders SPEC into high‑fidelity designs, closing the loop between intent expression and controllable generation. Quantitative experiments show SPEC‑based generation more faithfully captures reference intent than prompt‑based baselines. In a user study with 16 professional designers, SpecifyUI significantly outperformed Stitch on intent alignment, design quality, controllability, and overall experience in human‑AI co‑creation. Our results position SPEC as a specification‑driven paradigm that shifts LLM‑assisted design from one‑shot prompting to iterative, collaborative workflows.

PaperID: 8773, https://arxiv.org/pdf/2509.07325.pdf

Authors: Alyssa Unell, Noel C. F. Codella, Sam Preston, Peniel Argaw, Wen-wai Yim, Zelalem Gero, Cliff Wong, Rajesh Jena, Eric Horvitz, Amanda K. Hall, Ruican Rachel Zhong, Jiachen Li, Shrey Jain, Mu Wei, Matthew Lungren, Hoifung Poon

Title: CancerGUIDE: Cancer Guideline Understanding via Internal Disagreement Estimation

Abstract:
The National Comprehensive Cancer Network (NCCN) provides evidence‑based guidelines for cancer treatment. Translating complex patient presentations into guideline‑compliant treatment recommendations is time‑intensive, requires specialized expertise, and is prone to error. Advances in large language model (LLM) capabilities promise to reduce the time required to generate treatment recommendations and improve accuracy. We present an LLM agent‑based approach to automatically generate guideline‑concordant treatment trajectories for patients with non‑small cell lung cancer (NSCLC). Our contributions are threefold. First, we construct a novel longitudinal dataset of 121 cases of NSCLC patients that includes clinical encounters, diagnostic results, and medical histories, each expertly annotated with the corresponding NCCN guideline trajectories by board‑certified oncologists. Second, we demonstrate that existing LLMs possess domain‑specific knowledge that enables high‑quality proxy benchmark generation for both model development and evaluation, achieving strong correlation (Spearman coefficient r=0.88, RMSE = 0.08) with expert‑annotated benchmarks. Third, we develop a hybrid approach combining expensive human annotations with model consistency information to create both the agent framework that predicts the relevant guidelines for a patient, as well as a meta‑classifier that verifies prediction accuracy with calibrated confidence scores for treatment recommendations (AUROC=0.800), a critical capability for communicating the accuracy of outputs, custom‑tailoring tradeoffs in performance, and supporting regulatory compliance. This work establishes a framework for clinically viable LLM‑based guideline adherence systems that balance accuracy, interpretability, and regulatory requirements while reducing annotation costs, providing a scalable pathway toward automated clinical decision support.

PaperID: 8774, https://arxiv.org/pdf/2509.07234.pdf

Authors: Yanlin Zhou, Manshi Limbu, Xuesu Xiao

Title: Efficient Multi-Agent Coordination via Dynamic Joint-State Graph Construction

Abstract:
Multi‑agent pathfinding (MAPF) traditionally focuses on collision avoidance, but many real‑world applications require active coordination between agents to improve team performance. This paper introduces Team Coordination on Graphs with Risky Edges (TCGRE), where agents collaborate to reduce traversal costs on high‑risk edges via support from teammates. We reformulate TCGRE as a 3D matching problem‑mapping robot pairs, support pairs, and time steps‑and rigorously prove its NP‑hardness via reduction from Minimum 3D Matching. To address this complexity, (in the conference version) we proposed efficient decomposition methods, reducing the problem to tractable subproblems: Joint‑State Graph (JSG): Encodes coordination as a single‑agent shortest‑path problem. Coordination‑Exhaustive Search (CES): Optimizes support assignments via exhaustive pairing. Receding‑Horizon Optimistic Cooperative A (RHOCA): Balances optimality and scalability via horizon‑limited planning. Further in this extension, we introduce a dynamic graph construction method (Dynamic‑HJSG), leveraging agent homogeneity to prune redundant states and reduce computational overhead by constructing the joint‑state graph dynamically. Theoretical analysis shows Dynamic‑HJSG preserves optimality while lowering complexity from exponential to polynomial in key cases. Empirical results validate scalability for large teams and graphs, with HJSG outperforming baselines greatly in runtime in different sizes and types of graphs. This work bridges combinatorial optimization and multi‑agent planning, offering a principled framework for collaborative pathfinding with provable guarantees, and the key idea of the solution can be widely extended to many other collaborative optimization problems, such as MAPF.

PaperID: 8775, https://arxiv.org/pdf/2509.07009.pdf

Authors: Anton Kolonin, Vladimir Kryukov

Title: Computational Concept of the Psyche (in Russian)

Abstract:
The article provides an overview of approaches to modeling the human psyche in the perspective of building an artificial one. Based on the review, a concept of cognitive architecture is proposed, where the psyche is considered as an operating system of a living or artificial subject, including a space of needs that determines its life meanings in connection with stimuli from the external world, and intelligence as a decision‑making system for actions in relation to this world in order to satisfy these needs. Based on the concept, a computational formalization is proposed for creating artificial intelligence systems through learning from experience in the space of a space of needs, taking into account their biological or existential significance for an intelligent agent. Thus, the problem of building general artificial intelligence as a system for making optimal decisions in the space of agent‑specific needs under conditions of uncertainty is formalized, with maximization of success in achieving goals, minimization of existential risks and maximization of energy efficiency. A minimal experimental implementation of the model is also provided.

PaperID: 8776, https://arxiv.org/pdf/2509.06921.pdf

Authors: Safayat Bin Hakim, Muhammad Adil, Alvaro Velasquez, Shouhuai Xu, Houbing Herbert Song

Title: Neuro-Symbolic AI for Cybersecurity: State of the Art, Challenges, and Opportunities

Abstract:
Cybersecurity demands both rapid pattern recognition and deliberative reasoning, yet purely neural or purely symbolic approaches each address only one side of this duality. Neuro‑Symbolic (NeSy) AI bridges this gap by integrating learning and logic within a unified framework. This systematic review analyzes 103 publications across the neural‑symbolic integration spectrum in cybersecurity through April 2026, organizing them via a three‑tier taxonomy ‑‑ deep integration, structured interaction, and contextual baselines ‑‑ and a Grounding‑Instructibility‑Alignment (G‑I‑A) analytical lens. We find that multi‑agent and structured‑integration architectures across the surveyed spectrum substantially outperform single‑agent approaches in complex scenarios, causal reasoning enables proactive defense beyond correlation‑based detection, and knowledge‑guided learning improves both data efficiency and explainability. These findings span intrusion detection, malware analysis, vulnerability discovery, and autonomous penetration testing, revealing that integration depth often correlates with capability gains across domains. A first‑of‑its‑kind dual‑use analysis further shows that autonomous offensive systems in the broader survey corpus are already achieving notable zero‑day exploitation success at significantly reduced cost, fundamentally reshaping threat landscapes. However, critical barriers persist: evaluation standardization remains nascent, computational costs constrain deployment, and effective human‑AI collaboration is underexplored. We distill these findings into a prioritized research roadmap emphasizing community‑driven benchmarks, responsible development practices, and defensive alignment to guide the next generation of NeSy cybersecurity systems.

PaperID: 8777, https://arxiv.org/pdf/2509.06917.pdf

Authors: Jiacheng Miao, Joe R. Davis, Yaohui Zhang, Jonathan K. Pritchard, James Zou

Title: Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents

Abstract:
We introduce Paper2Agent, an automated framework that converts research papers into AI agents. Paper2Agent transforms research output from passive artifacts into active systems that can accelerate downstream use, adoption, and discovery. Conventional research papers require readers to invest substantial effort to understand and adapt a paper's code, data, and methods to their own work, creating barriers to dissemination and reuse. Paper2Agent addresses this challenge by automatically converting a paper into an AI agent that acts as a knowledgeable research assistant. It systematically analyzes the paper and the associated codebase using multiple agents to construct a Model Context Protocol (MCP) server, then iteratively generates and runs tests to refine and robustify the resulting MCP. These paper MCPs can then be flexibly connected to a chat agent (e.g. Claude Code) to carry out complex scientific queries through natural language while invoking tools and workflows from the original paper. We demonstrate Paper2Agent's effectiveness in creating reliable and capable paper agents through in‑depth case studies. Paper2Agent created an agent that leverages AlphaGenome to interpret genomic variants and agents based on ScanPy and TISSUE to carry out single‑cell and spatial transcriptomics analyses. We validate that these paper agents can reproduce the original paper's results and can correctly carry out novel user queries. Paper2Agent automatically created AI co‑scientist that identified new splicing variant associated with ADHD risk. By turning static papers into dynamic, interactive AI agents, Paper2Agent introduces a new paradigm for knowledge dissemination and a foundation for the collaborative ecosystem of AI co‑scientists.

PaperID: 8778, https://arxiv.org/pdf/2509.06858.pdf

Authors: Vincent C. Brockers, David A. Ehrlich, Viola Priesemann

Title: Disentangling Interaction and Bias Effects in Opinion Dynamics of Large Language Models

Abstract:
Large Language Models are increasingly used to simulate human opinion dynamics, yet the effect of genuine interaction is often obscured by systematic biases. We develop a Bayesian framework to disentangle and quantify three such biases: (i) A topic bias toward the LLM's default stance; (ii) an agreement bias favoring agreement to the prompted statement irrespective of the question; and (iii) an anchoring bias toward the initiating agent's stance. We apply this framework to various LLMs that performed multi‑step dialogues on 12 different questions from climate change and societal justice to music preferences. We find that opinion trajectories tend to quickly converge to a shared attractor, with the influence of both interaction and biases decaying over time, and with the impact of biases differing between LLMs. In addition, we show that fine‑tuning an LLM on different sets of strongly opinionated statements (including misinformation) shifts the opinion attractor correspondingly. By exposing stark differences between LLMs and providing quantitative tools for comparing interaction and bias contributions to opinion shifts in LLM agent discussions, our approach highlights both promises and pitfalls of using LLMs as proxies for human behavior.

PaperID: 8779, https://arxiv.org/pdf/2509.06822.pdf

Authors: Chenyang Zhu, Spencer Hong, Jingyu Wu, Kushal Chawla, Charlotte Tang, Youbing Yin, Nathan Wolfe, Erin Babinsky, Daben Liu

Title: RAFFLES: Reasoning-based Attribution of Faults for LLM Systems

Abstract:
The advent of complex, interconnected long‑horizon LLM systems has made it incredibly tricky to identify where and when these systems break down. Evaluation capabilities that currently exist today are limited in that they often focus on simple metrics, end‑to‑end outcomes, and are dependent on the perspectives of humans. In order to match the increasing complexity of these many component systems, evaluation frameworks must also be able to reason, probe, iterate, and understand the nuanced logic passing through these systems. In this paper, we present RAFFLES, an offline evaluation architecture that incorporates iterative reasoning. Specifically, RAFFLES operates as an iterative, multi‑component pipeline, using a central Judge to systematically identify faults and a set of specialized Evaluators to assess the quality of the candidate faults as well as rationales of the Judge. We evaluated RAFFLES with several benchmarks ‑ the Who&When dataset to identify step‑level faults in multi‑agent systems and the ReasonEval datasets to diagnose step‑level mathematical reasoning errors. RAFFLES outperforms strong baselines, achieving an accuracy of over 20% and 50% on the Who&When Hand‑Crafted and Algorithmically‑Generated datasets, and over 80% on the ReasonEval datasets. These results demonstrate a key step towards introducing automated fault detection for autonomous systems over labor‑intensive manual review.

PaperID: 8780, https://arxiv.org/pdf/2509.06775.pdf

Authors: Po-Heng Chou, Pin-Qi Fu, Walid Saad, Li-Chun Wang

Title: Agentic DDQN-Based Scheduling for Licensed and Unlicensed Band Allocation in Sidelink Networks

Abstract:
In this paper, we present an agentic double deep Q‑network (DDQN) scheduler for licensed/unlicensed band allocation in New Radio (NR) sidelink (SL) networks. Beyond conventional reward‑seeking reinforcement learning (RL), the agent perceives and reasons over a multi‑dimensional context that jointly captures queueing delay, link quality, coexistence intensity, and switching stability. A capacity‑aware, quality of service (QoS)‑constrained reward aligns the agent with goal‑oriented scheduling rather than static thresholding. Under constrained bandwidth, the proposed design reduces blocking by up to 87.5% versus threshold policies while preserving throughput, highlighting the value of context‑driven decisions in coexistence‑limited NR SL networks. The proposed scheduler is an embodied agent (E‑agent) tailored for task‑specific, resource‑efficient operation at the network edge.

PaperID: 8781, https://arxiv.org/pdf/2509.06602.pdf

Authors: Matthias Blondeel, Noel Codella, Sam Preston, Hao Qiu, Leonardo Schettini, Frank Tuan, Wen-wai Yim, Smitha Saligrama, Mert Öz, Shrey Jain, Matthew P. Lungren, Thomas Osborne

Title: Demo: Healthcare Agent Orchestrator (HAO) for Patient Summarization in Molecular Tumor Boards

Abstract:
Molecular Tumor Boards (MTBs) are multidisciplinary forums where oncology specialists collaboratively assess complex patient cases to determine optimal treatment strategies. A central element of this process is the patient summary, typically compiled by a medical oncologist, radiation oncologist, or surgeon, or their trained medical assistant, who distills heterogeneous medical records into a concise narrative to facilitate discussion. This manual approach is often labor‑intensive, subjective, and prone to omissions of critical information. To address these limitations, we introduce the Healthcare Agent Orchestrator (HAO), a Large Language Model (LLM)‑driven AI agent that coordinates a multi‑agent clinical workflow to generate accurate and comprehensive patient summaries for MTBs. Evaluating predicted patient summaries against ground truth presents additional challenges due to stylistic variation, ordering, synonym usage, and phrasing differences, which complicate the measurement of both succinctness and completeness. To overcome these evaluation hurdles, we propose TBFact, a ``model‑as‑a‑judge'' framework designed to assess the comprehensiveness and succinctness of generated summaries. Using a benchmark dataset derived from de‑identified tumor board discussions, we applied TBFact to evaluate our Patient History agent. Results show that the agent captured 94% of high‑importance information (including partial entailments) and achieved a TBFact recall of 0.84 under strict entailment criteria. We further demonstrate that TBFact enables a data‑free evaluation framework that institutions can deploy locally without sharing sensitive clinical data. Together, HAO and TBFact establish a robust foundation for delivering reliable and scalable support to MTBs.

PaperID: 8782, https://arxiv.org/pdf/2509.06588.pdf

Authors: Mohammadreza Doostmohammadian, Hamid R. Rabiee

Title: Distributed Automatic Generation Control subject to Ramp-Rate-Limits: Anytime Feasibility and Uniform Network-Connectivity

Abstract:
This paper considers automatic generation control over an information‑sharing network of communicating generators as a multi‑agent system. The optimization solution is distributed among the agents based on information consensus algorithms, while addressing the generators' ramp‑rate‑limits (RRL). This is typically ignored in the existing linear/nonlinear optimization solutions but they exist in real‑time power generation scenarios. Without addressing the RRL, the generators cannot follow the assigned rate of generating power by the optimization algorithm; therefore, the existing solutions may not necessarily converge to the exact optimal cost or may lose feasibility in practice. The proposed solution in this work addresses the ramp‑rate‑limit constraint along with the box constraint (limits on the generated powers) and the coupling‑constraint (generation‑demand balance) at all iteration times of the algorithm. The latter is referred to as the anytime feasibility and implies that at every termination point of the algorithm, the balance between the demand and generated power holds. To improve the convergence rate of the algorithm we further consider internal signum‑based nonlinearity. We also show that our solution can tolerate communication link removal. This follows from the uniform‑connectivity assumption on the communication network.

PaperID: 8783, https://arxiv.org/pdf/2509.06501.pdf

Authors: Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, Jiayuan Song, Zhengmao Zhu, Wenhu Chen, Pengyu Zhao, Junxian He

Title: WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents

Abstract:
The paradigm of Large Language Models (LLMs) has increasingly shifted toward agentic applications, where web browsing capabilities are fundamental for retrieving information from diverse online sources. However, existing open‑source web agents either demonstrate limited information‑seeking abilities on complex tasks or lack transparent implementations. In this work, we identify that the key challenge lies in the scarcity of challenging data for information seeking. To address this limitation, we introduce WebExplorer: a systematic data generation approach using model‑based exploration and iterative, long‑to‑short query evolution. This method creates challenging query‑answer pairs that require multi‑step reasoning and complex web navigation. By leveraging our curated high‑quality dataset, we successfully develop advanced web agent WebExplorer‑8B through supervised fine‑tuning followed by reinforcement learning. Our model supports 128K context length and up to 100 tool calling turns, enabling long‑horizon problem solving. Across diverse information‑seeking benchmarks, WebExplorer‑8B achieves the state‑of‑the‑art performance at its scale. Notably, as an 8B‑sized model, WebExplorer‑8B is able to effectively search over an average of 16 turns after RL training, achieving higher accuracy than WebSailor‑72B on BrowseComp‑en/zh and attaining the best performance among models up to 100B parameters on WebWalkerQA and FRAMES. Beyond these information‑seeking tasks, our model also achieves strong generalization on the HLE benchmark even though it is only trained on knowledge‑intensive QA data. These results highlight our approach as a practical path toward long‑horizon web agents.

PaperID: 8784, https://arxiv.org/pdf/2509.06493.pdf

Authors: Ran Xin, Zeyu Zheng, Yanchen Nie, Kun Yuan, Xia Xiao

Title: Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers

Abstract:
The integration of Large Language Models (LLMs) into automated theorem proving has shown immense promise, yet is fundamentally constrained by challenges in scaling up both training‑time reinforcement learning (RL) and inference‑time compute. This paper introduces \textttBFS‑Prover‑V2, a system designed to address this dual scaling problem. We present two primary innovations. The first is a novel multi‑turn off‑policy RL framework for continually improving the performance of LLM step‑prover at training time. This framework, inspired by the principles of AlphaZero, utilizes a multi‑stage expert iteration pipeline featuring adaptive tactic‑level data filtering and periodic retraining to surmount the performance plateaus that typically curtail long‑term RL in LLM‑based agents. The second innovation is a planner‑enhanced multi‑agent search architecture that scales reasoning capabilities at inference time. This architecture employs a general reasoning model as a high‑level planner to iteratively decompose complex theorems into a sequence of simpler subgoals. This hierarchical approach substantially reduces the search space, enabling a team of parallel prover agents to collaborate efficiently by leveraging a shared proof cache. We demonstrate that this dual approach to scaling yields state‑of‑the‑art results on established formal mathematics benchmarks. \textttBFS‑Prover‑V2 achieves 95.08% and 41.4% on the MiniF2F and ProofNet test sets respectively. While demonstrated in the domain of formal mathematics, the RL and inference techniques presented in this work are of broader interest and may be applied to other domains requiring long‑horizon multi‑turn reasoning and complex search.

PaperID: 8785, https://arxiv.org/pdf/2509.06431.pdf

Authors: Arthur Casals, Anarosa A. F. Brandão

Title: HECATE: An ECS-based Framework for Teaching and Developing Multi-Agent Systems

Abstract:
This paper introduces HECATE, a novel framework based on the Entity‑Component‑System (ECS) architectural pattern that bridges the gap between distributed systems engineering and MAS development. HECATE is built using the Entity‑Component‑System architectural pattern, leveraging data‑oriented design to implement multiagent systems. This approach involves engineering multiagent systems (MAS) from a distributed systems (DS) perspective, integrating agent concepts directly into the DS domain. This approach simplifies MAS development by (i) reducing the need for specialized agent knowledge and (ii) leveraging familiar DS patterns and standards to minimize the agent‑specific knowledge required for engineering MAS. We present the framework's architecture, core components, and implementation approach, demonstrating how it supports different agent models.

PaperID: 8786, https://arxiv.org/pdf/2509.06382.pdf

Authors: Yingke Ding, Zeyu Wang, Xiyuxing Zhang, Hongbin Chen, Zhenan Xu

Title: Context-Adaptive Hearing Aid Fitting Advisor through Multi-turn Multimodal LLM Conversation

Abstract:
Traditional hearing aids often rely on static fittings that fail to adapt to their dynamic acoustic environments. We propose CAFA, a Context‑Adaptive Fitting Advisor that provides personalized, real‑time hearing aid adjustments through a multi‑agent Large Language Model (LLM) workflow. CAFA combines live ambient audio, audiograms, and user feedback in a multi‑turn conversational system. Ambient sound is classified into conversation, noise, or quiet with 91.2% accuracy using a lightweight neural network based on YAMNet embeddings. This system utilizes a modular LLM workflow, comprising context acquisition, subproblem classification, strategy provision, and ethical regulation, and is overseen by an LLM Judge. The workflow translates context and feedback into precise, safe tuning commands. Evaluation confirms that real‑time sound classification enhances conversational efficiency. CAFA exemplifies how agentic, multimodal AI can enable intelligent, user‑centric assistive technologies.

PaperID: 8787, https://arxiv.org/pdf/2509.06355.pdf

Authors: Yunzhe Wang, Volkan Ustun, Chris McGroarty

Title: A Data-Driven Discretized CS:GO Simulation Environment to Facilitate Strategic Multi-Agent Planning Research

Abstract:
Modern simulation environments for complex multi‑agent interactions must balance high‑fidelity detail with computational efficiency. We present DECOY, a novel multi‑agent simulator that abstracts strategic, long‑horizon planning in 3D terrains into high‑level discretized simulation while preserving low‑level environmental fidelity. Using Counter‑Strike: Global Offensive (CS:GO) as a testbed, our framework accurately simulates gameplay using only movement decisions as tactical positioning ‑‑ without explicitly modeling low‑level mechanics such as aiming and shooting. Central to our approach is a waypoint system that simplifies and discretizes continuous states and actions, paired with neural predictive and generative models trained on real CS:GO tournament data to reconstruct event outcomes. Extensive evaluations show that replays generated from human data in DECOY closely match those observed in the original game. Our publicly available simulation environment provides a valuable tool for advancing research in strategic multi‑agent planning and behavior generation.

PaperID: 8788, https://arxiv.org/pdf/2509.06283.pdf

Authors: Xuan-Phi Nguyen, Shrey Pandit, Revanth Gangi Reddy, Austin Xu, Silvio Savarese, Caiming Xiong, Shafiq Joty

Title: SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents

Abstract:
Equipping large language models (LLMs) with complex, interleaved reasoning and tool‑use capabilities has become a key focus in agentic AI research, especially with recent advances in reasoning‑oriented (``thinking'') models. Such capabilities are key to unlocking a number of important applications. One such application is Deep Research (DR), which requires extensive search and reasoning over many sources. Our work in this paper focuses on the development of native Autonomous Single‑Agent models for DR featuring minimal web crawling and Python tool integration. Unlike multi‑agent systems, where agents take up pre‑defined roles and are told what to do at each step in a static workflow, an autonomous single‑agent determines its next action dynamically based on context, without manual directive. While prior work has proposed training recipes for base or instruction‑tuned LLMs, we focus on continual reinforcement learning (RL) of reasoning‑optimized models to further enhance agentic skills while preserving reasoning ability. Towards this end, we propose a simple RL recipe with entirely synthetic data, which we apply to various open‑source LLMs. Our best variant SFR‑DR‑20B achieves up to 28.7% on Humanity's Last Exam benchmark. In addition, we conduct key analysis experiments to provide more insights into our methodologies.

PaperID: 8789, https://arxiv.org/pdf/2509.06069.pdf

Authors: Alexander Erlei

Title: From Digital Distrust to Codified Honesty: Experimental Evidence on Generative AI in Credence Goods Markets

Abstract:
Generative AI is transforming the provision of expert services. This article uses a series of one‑shot experiments to quantify the behavioral, welfare and distribution consequences of large language models (LLMs) on AI‑AI, Human‑Human, Human‑AI and Human‑AI‑Human expert markets. Using a credence goods framework where experts have private information about the optimal service for consumers, we find that Human‑Human markets generally achieve higher levels of efficiency than AI‑AI and Human‑AI markets through pro‑social expert preferences and higher consumer trust. Notably, LLM experts still earn substantially higher surplus than human experts ‑‑ at the expense of consumer surplus ‑ suggesting adverse incentives that may spur the harmful deployment of LLMs. Concurrently, a majority of human experts chooses to rely on LLM agents when given the opportunity in Human‑AI‑Human markets, especially if they have agency over the LLM's (social) objective function. Here, a large share of experts prioritizes efficiency‑loving preferences over pure self‑interest. Disclosing these preferences to consumers induces strong efficiency gains by marginalizing self‑interested LLM experts and human experts. Consequently, Human‑AI‑Human markets outperform Human‑Human markets under transparency rules. With obfuscation, however, efficiency gains disappear, and adverse expert incentives remain. Our results shed light on the potential opportunities and risks of disseminating LLMs in the context of expert services and raise several regulatory challenges. On the one hand, LLMs can negatively affect human trust in the presence of information asymmetries and partially crowd‑out experts' other‑regarding preferences through automation. On the other hand, LLMs allow experts to codify and communicate their objective function, which reduces information asymmetries and increases efficiency.

PaperID: 8790, https://arxiv.org/pdf/2509.06053.pdf

Authors: Mingrui Lv, Hangzhi Liu, Zhi Luo, Hongjie Zhang, Jie Ou

Title: PolicyEvolve: Evolving Programmatic Policies by LLMs for multi-player games via Population-Based Training

Abstract:
Multi‑agent reinforcement learning (MARL) has achieved significant progress in solving complex multi‑player games through self‑play. However, training effective adversarial policies requires millions of experience samples and substantial computational resources. Moreover, these policies lack interpretability, hindering their practical deployment. Recently, researchers have successfully leveraged Large Language Models (LLMs) to generate programmatic policies for single‑agent tasks, transforming neural network‑based policies into interpretable rule‑based code with high execution efficiency. Inspired by this, we propose PolicyEvolve, a general framework for generating programmatic policies in multi‑player games. PolicyEvolve significantly reduces reliance on manually crafted policy code, achieving high‑performance policies with minimal environmental interactions. The framework comprises four modules: Global Pool, Local Pool, Policy Planner, and Trajectory Critic. The Global Pool preserves elite policies accumulated during iterative training. The Local Pool stores temporary policies for the current iteration; only sufficiently high‑performing policies from this pool are promoted to the Global Pool. The Policy Planner serves as the core policy generation module. It samples the top three policies from the Global Pool, generates an initial policy for the current iteration based on environmental information, and refines this policy using feedback from the Trajectory Critic. Refined policies are then deposited into the Local Pool. This iterative process continues until the policy achieves a sufficiently high average win rate against the Global Pool, at which point it is integrated into the Global Pool. The Trajectory Critic analyzes interaction data from the current policy, identifies vulnerabilities, and proposes directional improvements to guide the Policy Planner

PaperID: 8791, https://arxiv.org/pdf/2509.06031.pdf

Authors: Junhui Huang, Yuhe Gong, Changsheng Li, Xingguang Duan, Luis Figueredo

Title: GELATO: Multi-Instruction Trajectory Reshaping via Geometry-Aware Multiagent-based Orchestration

Abstract:
We present GELATO ‑‑ the first language‑driven trajectory reshaping framework to embed geometric environment awareness and multi‑agent feedback orchestration to support multi‑instruction in human‑robot interaction scenarios. Unlike prior learning‑based methods, our approach automatically registers scene objects as 6D geometric primitives via a VLM‑assisted multi‑view pipeline, and an LLM translates free‑form multiple instructions into explicit, verifiable geometric constraints. These are integrated into a geometric‑aware vector field optimization to adapt initial trajectories while preserving smoothness, feasibility, and clearance. We further introduce a multi‑agent orchestration with observer‑based refinement to handle multi‑instruction inputs and interactions among objectives ‑‑ increasing success rate without retraining. Simulation and real‑world experiments demonstrate our method achieves smoother, safer, and more interpretable trajectory modifications compared to state‑of‑the‑art baselines.

PaperID: 8792, https://arxiv.org/pdf/2509.05889.pdf

Authors: Mahsa Paknejad, Parisa Fard Moshiri, Murat Simsek, Burak Kantarci, Hussein T. Mouftah

Title: On-Dyn-CDA: A Real-Time Cost-Driven Task Offloading Algorithm for Vehicular Networks with Reduced Latency and Task Loss

Abstract:
Real‑time task processing is a critical challenge in vehicular networks, where achieving low latency and minimizing dropped task ratio depend on efficient task execution. Our primary objective is to maximize the number of completed tasks while minimizing overall latency, with a particular focus on reducing number of dropped tasks. To this end, we investigate both static and dynamic versions of an optimization algorithm. The static version assumes full task availability, while the dynamic version manages tasks as they arrive. We also distinguish between online and offline cases: the online version incorporates execution time into the offloading decision process, whereas the offline version excludes it, serving as a theoretical benchmark for optimal performance. We evaluate our proposed Online Dynamic Cost‑Driven Algorithm (On‑Dyn‑CDA) against these baselines. Notably, the static Particle Swarm Optimization (PSO) baseline assumes all tasks are transferred to the RSU and processed by the MEC, and its offline version disregards execution time, making it infeasible for real‑time applications despite its optimal performance in theory. Our novel On‑Dyn‑CDA completes execution in just 0.05 seconds under the most complex scenario, compared to 1330.05 seconds required by Dynamic PSO. It also outperforms Dynamic PSO by 3.42% in task loss and achieves a 29.22% reduction in average latency in complex scenarios. Furthermore, it requires neither a dataset nor a training phase, and its low computational complexity ensures efficiency and scalability in dynamic environments.

PaperID: 8793, https://arxiv.org/pdf/2509.05882.pdf

Authors: Abhijnan Nath, Carine Graff, Nikhil Krishnaswamy

Title: Collaborate, Deliberate, Evaluate: How LLM Alignment Affects Coordinated Multi-Agent Outcomes

Abstract:
As Large Language Models (LLMs) get integrated into diverse workflows, they are increasingly being regarded as "collaborators" with humans, and required to work in coordination with other AI systems. If such AI collaborators are to reliably coordinate their actions and behaviors with humans or other AIs, their properties and behaviors over multi‑turn interactions must be known and predictable. This paper examines how different alignment methods affect LLM agents' effectiveness as partners in multi‑turn, multi‑party collaborations. We study this question through the lens of intervention agents that insert themselves into group dialogues not to provide answers, but to encourage the collaborative group to slow down and reflect upon their reasoning for deliberative decision‑making. Common alignment techniques are typically developed under simplified single‑user settings and assume the optimality of the underlying token MDP. Using the theoretical lens of the modified‑action MDP, we show how they do not account for the dynamics of long‑horizon multi‑party interactions. We present a novel roleplay simulation methodology, where we align LLMs according to different methods and then deploy them in collaborative task dialogues to quantify how interventions affect the trajectory of group collaboration, belief alignment, and coordination. Our results show that an intervention agent that is robust to action modification significantly outperforms common alignment baselines in supporting correct task outcomes.

PaperID: 8794, https://arxiv.org/pdf/2509.05818.pdf

Authors: Won Seok Jang, Hieu Tran, Manav Mistry, SaiKiran Gandluri, Yifan Zhang, Sharmin Sultana, Sunjae Kown, Yuan Zhang, Zonghai Yao, Hong Yu

Title: Chatbot To Help Patients Understand Their Health

Abstract:
Patients must possess the knowledge necessary to actively participate in their care. We present NoteAid‑Chatbot, a conversational AI that promotes patient understanding via a novel 'learning as conversation' framework, built on a multi‑agent large language model (LLM) and reinforcement learning (RL) setup without human‑labeled data. NoteAid‑Chatbot was built on a lightweight LLaMA 3.2 3B model trained in two stages: initial supervised fine‑tuning on conversational data synthetically generated using medical conversation strategies, followed by RL with rewards derived from patient understanding assessments in simulated hospital discharge scenarios. Our evaluation, which includes comprehensive human‑aligned assessments and case studies, demonstrates that NoteAid‑Chatbot exhibits key emergent behaviors critical for patient education, such as clarity, relevance, and structured dialogue, even though it received no explicit supervision for these attributes. Our results show that even simple Proximal Policy Optimization (PPO)‑based reward modeling can successfully train lightweight, domain‑specific chatbots to handle multi‑turn interactions, incorporate diverse educational strategies, and meet nuanced communication objectives. Our Turing test demonstrates that NoteAid‑Chatbot surpasses non‑expert human. Although our current focus is on healthcare, the framework we present illustrates the feasibility and promise of applying low‑cost, PPO‑based RL to realistic, open‑ended conversational domains, broadening the applicability of RL‑based alignment methods.

PaperID: 8795, https://arxiv.org/pdf/2509.05764.pdf

Authors: Yuwei Lou, Hao Hu, Shaocong Ma, Zongfei Zhang, Liang Wang, Jidong Ge, Xianping Tao

Title: DRF: LLM-AGENT Dynamic Reputation Filtering Framework

Abstract:
With the evolution of generative AI, multi ‑ agent systems leveraging large ‑ language models(LLMs) have emerged as a powerful tool for complex tasks. However, these systems face challenges in quantifying agent performance and lack mechanisms to assess agent credibility. To address these issues, we introduce DRF, a dynamic reputation filtering framework. DRF constructs an interactive rating network to quantify agent performance, designs a reputation scoring mechanism to measure agent honesty and capability, and integrates an Upper Confidence Bound ‑ based strategy to enhance agent selection efficiency. Experiments show that DRF significantly improves task completion quality and collaboration efficiency in logical reasoning and code ‑ generation tasks, offering a new approach for multi ‑ agent systems to handle large ‑ scale tasks.

PaperID: 8796, https://arxiv.org/pdf/2509.05651.pdf

Authors: Lukas Beckenbauer, Johannes-Lucas Loewe, Ge Zheng, Alexandra Brintrup

Title: Orchestrator: Active Inference for Multi-Agent Systems in Long-Horizon Tasks

Abstract:
Complex, non‑linear tasks challenge LLM‑enhanced multi‑agent systems (MAS) due to partial observability and suboptimal coordination. We propose Orchestrator, a novel MAS framework that leverages attention‑inspired self‑emergent coordination and reflective benchmarking to optimize global task performance. Orchestrator introduces a monitoring mechanism to track agent‑environment dynamics, using active inference benchmarks to optimize system behavior. By tracking agent‑to‑agent and agent‑to‑environment interaction, Orchestrator mitigates the effects of partial observability and enables agents to approximate global task solutions more efficiently. We evaluate the framework on a series of maze puzzles of increasing complexity, demonstrating its effectiveness in enhancing coordination and performance in dynamic, non‑linear environments with long‑horizon objectives.

PaperID: 8797, https://arxiv.org/pdf/2509.05584.pdf

Authors: Sadegh Jafari, Aishwarya Sarkar, Mohiuddin Bilwal, Ali Jannesari

Title: ProfilingAgent: Profiling-Guided Agentic Reasoning for Adaptive Model Optimization

Abstract:
Foundation models face growing compute and memory bottlenecks, hindering deployment on resource‑limited platforms. While compression techniques such as pruning and quantization are widely used, most rely on uniform heuristics that ignore architectural and runtime heterogeneity. Profiling tools expose per‑layer latency, memory, and compute cost, yet are rarely integrated into automated pipelines. We propose ProfilingAgent, a profiling‑guided, agentic approach that uses large language models (LLMs) to automate compression via structured pruning and post‑training dynamic quantization. Our modular multi‑agent system reasons over static metrics (MACs, parameter counts) and dynamic signals (latency, memory) to design architecture‑specific strategies. Unlike heuristic baselines, ProfilingAgent tailors layer‑wise decisions to bottlenecks. Experiments on ImageNet‑1K, CIFAR‑10, and CIFAR‑100 with ResNet‑101, ViT‑B/16, Swin‑B, and DeiT‑B/16 show pruning maintains competitive or improved accuracy (about 1% drop on ImageNet‑1K, +2% gains for ViT‑B/16 on smaller datasets), while quantization achieves up to 74% memory savings with <0.5% accuracy loss. Our quantization also yields consistent inference speedups of up to 1.74 times faster. Comparative studies with GPT‑4o and GPT‑4‑Turbo highlight the importance of LLM reasoning quality for iterative pruning. These results establish agentic systems as scalable solutions for profiling‑guided model optimization.

PaperID: 8798, https://arxiv.org/pdf/2509.05496.pdf

Authors: Charbel Mattar, Jacques Bou Abdo, Abdallah Makhoul, Benoit Piranda, Jacques Demerjian

Title: What is Cybersecurity in Space?

Abstract:
Satellites, drones, and 5G space links now support critical services such as air traffic, finance, and weather. Yet most were not built to resist modern cyber threats. Ground stations can be breached, GPS jammed, and supply chains compromised, while no shared list of vulnerabilities or safe testing range exists. This paper maps eleven research gaps, including secure routing, onboard intrusion detection, recovery methods, trusted supply chains, post‑quantum encryption, zero‑trust architectures, and real‑time impact monitoring. For each, we outline the challenge, why it matters, and a guiding research question. We also highlight an agentic (multi‑agent) AI approach where small, task‑specific agents share defense tasks onboard instead of one large model. Finally, we propose a five‑year roadmap: post‑quantum and QKD flight trials, open cyber‑ranges, clearer vulnerability shar ing, and early multi‑agent deployments. These steps move space cybersecurity from reactive patching toward proactive resilience.

PaperID: 8799, https://arxiv.org/pdf/2509.05469.pdf

Authors: Chenguang Wang, Xiang Yan, Yilong Dai, Ziyi Wang, Susu Xu

Title: From Image Generation to Infrastructure Design: a Multi-agent Pipeline for Street Design Generation

Abstract:
Realistic visual renderings of street‑design scenarios are essential for public engagement in active transportation planning. Traditional approaches are labor‑intensive, hindering collective deliberation and collaborative decision‑making. While AI‑assisted generative design shows transformative potential by enabling rapid creation of design scenarios, existing generative approaches typically require large amounts of domain‑specific training data and struggle to enable precise spatial variations of design/configuration in complex street‑view scenes. We introduce a multi‑agent system that edits and redesigns bicycle facilities directly on real‑world street‑view imagery. The framework integrates lane localization, prompt optimization, design generation, and automated evaluation to synthesize realistic, contextually appropriate designs. Experiments across diverse urban scenarios demonstrate that the system can adapt to varying road geometries and environmental conditions, consistently yielding visually coherent and instruction‑compliant results. This work establishes a foundation for applying multi‑agent pipelines to transportation infrastructure planning and facility design.

PaperID: 8800, https://arxiv.org/pdf/2509.05446.pdf

Authors: Iftekhar Haider Chowdhury, Zaed Ikbal Syed, Ahmed Faizul Haque Dhrubo, Mohammad Abdul Qayum

Title: Dynamic Sensitivity Filter Pruning using Multi-Agent Reinforcement Learning For DCNN's

Abstract:
Deep Convolutional Neural Networks have achieved state of the art performance across various computer vision tasks, however their practical deployment is limited by computational and memory overhead. This paper introduces Differential Sensitivity Fusion Pruning, a novel single shot filter pruning framework that focuses on evaluating the stability and redundancy of filter importance scores across multiple criteria. Differential Sensitivity Fusion Pruning computes a differential sensitivity score for each filter by fusing the discrepancies among gradient based sensitivity, first order Taylor expansion, and KL divergence of activation distributions. An exponential scaling mechanism is applied to emphasize filters with inconsistent importance across metrics, identifying candidates that are structurally unstable or less critical to the model performance. Unlike iterative or reinforcement learning based pruning strategies, Differential Sensitivity Fusion Pruning is efficient and deterministic, requiring only a single forward‑backward pass for scoring and pruning. Extensive experiments across varying pruning rates between 50 to 70 percent demonstrate that Differential Sensitivity Fusion Pruning significantly reduces model complexity, achieving over 80 percent Floating point Operations Per Seconds reduction while maintaining high accuracy. For instance, at 70 percent pruning, our approach retains up to 98.23 percent of baseline accuracy, surpassing traditional heuristics in both compression and generalization. The proposed method presents an effective solution for scalable and adaptive Deep Convolutional Neural Networks compression, paving the way for efficient deployment on edge and mobile platforms.

PaperID: 8801, https://arxiv.org/pdf/2509.05396.pdf

Authors: Andrea Wynn, Harsh Satija, Gillian Hadfield

Title: Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate

Abstract:
While multi‑agent debate has been proposed as a promising strategy for improving AI reasoning ability, we find that debate can sometimes be harmful rather than helpful. Prior work has primarily focused on debates within homogeneous groups of agents, whereas we explore how diversity in model capabilities influences the dynamics and outcomes of multi‑agent interactions. Through a series of experiments, we demonstrate that debate can lead to a decrease in accuracy over time ‑ even in settings where stronger (i.e., more capable) models outnumber their weaker counterparts. Our analysis reveals that models frequently shift from correct to incorrect answers in response to peer reasoning, favoring agreement over challenging flawed reasoning. We perform additional experiments investigating various potential contributing factors to these harmful shifts ‑ including sycophancy, social conformity, and model and task type. These results highlight important failure modes in the exchange of reasons during multi‑agent debate, suggesting that naive applications of debate may cause performance degradation when agents are neither incentivised nor adequately equipped to resist persuasive but incorrect reasoning.

PaperID: 8802, https://arxiv.org/pdf/2509.05378.pdf

Authors: Andreas Motzfeldt, Joakim Edin, Casper L. Christensen, Christian Hardmeier, Lars Maaløe, Anna Rogers

Title: Code Like Humans: A Multi-Agent Solution for Medical Coding

Abstract:
In medical coding, experts map unstructured clinical notes to alphanumeric codes for diagnoses and procedures. We introduce Code Like Humans: a new agentic framework for medical coding with large language models. It implements official coding guidelines for human experts, and it is the first solution that can support the full ICD‑10 coding system (+70K labels). It achieves the best performance to date on rare diagnosis codes (fine‑tuned discriminative classifiers retain an advantage for high‑frequency codes, to which they are limited). Towards future work, we also contribute an analysis of system performance and identify its `blind spots' (codes that are systematically undercoded).

PaperID: 8803, https://arxiv.org/pdf/2509.05363.pdf

Authors: Lijie Ding, Changwoo Do

Title: SasAgent: Multi-Agent AI System for Small-Angle Scattering Data Analysis

Abstract:
We introduce SasAgent, a multi‑agent AI system powered by large language models (LLMs) that automates small‑angle scattering (SAS) data analysis by leveraging tools from the SasView software and enables user interaction via text input. SasAgent features a coordinator agent that interprets user prompts and delegates tasks to three specialized agents for scattering length density (SLD) calculation, synthetic data generation, and experimental data fitting. These agents utilize LLM‑friendly tools to execute tasks efficiently. These tools, including the model data tool, Retrieval‑Augmented Generation (RAG) documentation tool, bump fitting tool, and SLD calculator tool, are derived from the SasView Python library. A user‑friendly Gradio‑based interface enhances user accessibility. Through diverse examples, we demonstrate SasAgent's ability to interpret complex prompts, calculate SLDs, generate accurate scattering data, and fit experimental datasets with high precision. This work showcases the potential of LLM‑driven AI systems to streamline scientific workflows and enhance automation in SAS research.

PaperID: 8804, https://arxiv.org/pdf/2509.05263.pdf

Authors: Yinglin Duan, Zhengxia Zou, Tongwei Gu, Wei Jia, Zhan Zhao, Luyi Xu, Xinzhu Liu, Yenan Lin, Hao Jiang, Kang Chen, Shuang Qiu

Title: LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation

Abstract:
Recent research has been increasingly focusing on developing 3D world models that simulate complex real‑world scenarios. World models have found broad applications across various domains, including embodied AI, autonomous driving, entertainment, etc. A more realistic simulation with accurate physics will effectively narrow the sim‑to‑real gap and allow us to gather rich information about the real world conveniently. While traditional manual modeling has enabled the creation of virtual 3D scenes, modern approaches have leveraged advanced machine learning algorithms for 3D world generation, with most recent advances focusing on generative methods that can create virtual worlds based on user instructions. This work explores such a research direction by proposing LatticeWorld, a simple yet effective 3D world generation framework that streamlines the industrial production pipeline of 3D environments. LatticeWorld leverages lightweight LLMs (LLaMA‑2‑7B) alongside the industry‑grade rendering engine (e.g., Unreal Engine 5) to generate a dynamic environment. Our proposed framework accepts textual descriptions and visual instructions as multimodal inputs and creates large‑scale 3D interactive worlds with dynamic agents, featuring competitive multi‑agent interaction, high‑fidelity physics simulation, and real‑time rendering. We conduct comprehensive experiments to evaluate LatticeWorld, showing that it achieves superior accuracy in scene layout generation and visual fidelity. Moreover, LatticeWorld achieves over a 90× increase in industrial production efficiency while maintaining high creative quality compared with traditional manual production methods. Our demo video is available at https://youtu.be/8VWZXpERR18

PaperID: 8805, https://arxiv.org/pdf/2509.05197.pdf

Authors: Naimeng Ye, Xiao Yu, Ruize Xu, Tianyi Peng, Zhou Yu

Title: AI Agents for Web Testing: A Case Study in the Wild

Abstract:
Automated web testing plays a critical role in ensuring high‑quality user experiences and delivering business value. Traditional approaches primarily focus on code coverage and load testing, but often fall short of capturing complex user behaviors, leaving many usability issues undetected. The emergence of large language models (LLM) and AI agents opens new possibilities for web testing by enabling human‑like interaction with websites and a general awareness of common usability problems. In this work, we present WebProber, a prototype AI agent‑based web testing framework. Given a URL, WebProber autonomously explores the website, simulating real user interactions, identifying bugs and usability issues, and producing a human‑readable report. We evaluate WebProber through a case study of 120 academic personal websites, where it uncovered 29 usability issues‑‑many of which were missed by traditional tools. Our findings highlight agent‑based testing as a promising direction while outlining directions for developing next‑generation, user‑centered testing frameworks.

PaperID: 8806, https://arxiv.org/pdf/2509.05091.pdf

Authors: Matteo Bortoletto, Yichao Zhou, Lance Ying, Tianmin Shu, Andreas Bulling

Title: ProToM: Promoting Prosocial Behaviour via Theory of Mind-Informed Feedback

Abstract:
While humans are inherently social creatures, the challenge of identifying when and how to assist and collaborate with others ‑ particularly when pursuing independent goals ‑ can hinder cooperation. To address this challenge, we aim to develop an AI system that provides useful feedback to promote prosocial behaviour ‑ actions that benefit others, even when not directly aligned with one's own goals. We introduce ProToM, a Theory of Mind‑informed facilitator that promotes prosocial actions in multi‑agent systems by providing targeted, context‑sensitive feedback to individual agents. ProToM first infers agents' goals using Bayesian inverse planning, then selects feedback to communicate by maximising expected utility, conditioned on the inferred goal distribution. We evaluate our approach against baselines in two multi‑agent environments: Doors, Keys, and Gems, as well as Overcooked. Our results suggest that state‑of‑the‑art large language and reasoning models fall short of communicating feedback that is both contextually grounded and well‑timed ‑ leading to higher communication overhead and task speedup. In contrast, ProToM provides targeted and helpful feedback, achieving a higher success rate, shorter task completion times, and is consistently preferred by human users.

PaperID: 8807, https://arxiv.org/pdf/2509.05051.pdf

Authors: Aaron Mark Thomas, Yu-Cheng Chen, Hubert Okadome Valencia, Sharu Theresa Jose, Ronin Wu

Title: QCA-MolGAN: Quantum Circuit Associative Molecular GAN with Multi-Agent Reinforcement Learning

Abstract:
Navigating the vast chemical space of molecular structures to design novel drug molecules with desired target properties remains a central challenge in drug discovery. Recent advances in generative models offer promising solutions. This work presents a novel quantum circuit Born machine (QCBM)‑enabled Generative Adversarial Network (GAN), called QCA‑MolGAN, for generating drug‑like molecules. The QCBM serves as a learnable prior distribution, which is associatively trained to define a latent space aligning with high‑level features captured by the GANs discriminator. Additionally, we integrate a novel multi‑agent reinforcement learning network to guide molecular generation with desired targeted properties, optimising key metrics such as quantitative estimate of drug‑likeness (QED), octanol‑water partition coefficient (LogP) and synthetic accessibility (SA) scores in conjunction with one another. Experimental results demonstrate that our approach enhances the property alignment of generated molecules with the multi‑agent reinforcement learning agents effectively balancing chemical properties.

PaperID: 8808, https://arxiv.org/pdf/2509.05042.pdf

Authors: Cristiano Caissutti, Estelle Gerbier, Ehsan Khorrambakht, Paolo Marinelli, Andrea Munafo', Andrea Caiti

Title: Shared Autonomy through LLMs and Reinforcement Learning for Applications to Ship Hull Inspections

Abstract:
Shared autonomy is a promising paradigm in robotic systems, particularly within the maritime domain, where complex, high‑risk, and uncertain environments necessitate effective human‑robot collaboration. This paper investigates the interaction of three complementary approaches to advance shared autonomy in heterogeneous marine robotic fleets: (i) the integration of Large Language Models (LLMs) to facilitate intuitive high‑level task specification and support hull inspection missions, (ii) the implementation of human‑in‑the‑loop interaction frameworks in multi‑agent settings to enable adaptive and intent‑aware coordination, and (iii) the development of a modular Mission Manager based on Behavior Trees to provide interpretable and flexible mission control. Preliminary results from simulation and real‑world lake‑like environments demonstrate the potential of this multi‑layered architecture to reduce operator cognitive load, enhance transparency, and improve adaptive behaviour alignment with human intent. Ongoing work focuses on fully integrating these components, refining coordination mechanisms, and validating the system in operational port scenarios. This study contributes to establishing a modular and scalable foundation for trustworthy, human‑collaborative autonomy in safety‑critical maritime robotics applications.

PaperID: 8809, https://arxiv.org/pdf/2509.04993.pdf

Authors: Zheyan Qu, Wenbo Wang, Zitong Yu, Boquan Sun, Yang Li, Xing Zhang

Title: LLM Enabled Multi-Agent System for 6G Networks: Framework and Method of Dual-Loop Edge-Terminal Collaboration

Abstract:
The ubiquitous computing resources in 6G networks provide ideal environments for the fusion of large language models (LLMs) and intelligent services through the agent framework. With auxiliary modules and planning cores, LLM‑enabled agents can autonomously plan and take actions to deal with diverse environment semantics and user intentions. However, the limited resources of individual network devices significantly hinder the efficient operation of LLM‑enabled agents with complex tool calls, highlighting the urgent need for efficient multi‑level device collaborations. To this end, the framework and method of the LLM‑enabled multi‑agent system with dual‑loop terminal‑edge collaborations are proposed in 6G networks. Firstly, the outer loop consists of the iterative collaborations between the global agent and multiple sub‑agents deployed on edge servers and terminals, where the planning capability is enhanced through task decomposition and parallel sub‑task distribution. Secondly, the inner loop utilizes sub‑agents with dedicated roles to circularly reason, execute, and replan the sub‑task, and the parallel tool calling generation with offloading strategies is incorporated to improve efficiency. The improved task planning capability and task execution efficiency are validated through the conducted case study in 6G‑supported urban safety governance. Finally, the open challenges and future directions are thoroughly analyzed in 6G networks, accelerating the advent of the 6G era.

PaperID: 8810, https://arxiv.org/pdf/2509.04876.pdf

Authors: Jusheng Zhang, Yijia Fan, Kaitong Cai, Xiaofei Sun, Keze Wang

Title: OSC: Cognitive Orchestration through Dynamic Knowledge Alignment in Multi-Agent LLM Collaboration

Abstract:
This paper introduces OSC (Orchestrating Cognitive Synergy), a knowledge‑aware adaptive collaboration framework designed to enhance cognitive synergy in multi‑agent systems with large language models. While prior work has advanced agent selection and result aggregation, efficient linguistic interactions for deep collaboration among expert agents remain a critical bottleneck. OSC addresses this gap as a pivotal intermediate layer between selection and aggregation, introducing Collaborator Knowledge Models (CKM) to enable each agent to dynamically perceive its collaborators' cognitive states. Through real‑time cognitive gap analysis, agents adaptively adjust communication behaviors, including content focus, detail level, and expression style, using learned strategies. Experiments on complex reasoning and problem‑solving benchmarks demonstrate that OSC significantly improves task performance and communication efficiency, transforming "parallel‑working individuals'' into a "deeply collaborative cognitive team.'' This framework not only optimizes multi‑agent collaboration but also offers new insights into LLM agent interaction behaviors.

PaperID: 8811, https://arxiv.org/pdf/2509.04871.pdf

Authors: Krittanon Kaewtawee, Wachiravit Modecrua, Krittin Pachtrachai, Touchapon Kraisingkorn

Title: Cloning a Conversational Voice AI Agent from Call\,Recording Datasets for Telesales

Abstract:
Recent advances in language and speech modelling have made it possible to build autonomous voice assistants that understand and generate human dialogue in real time. These systems are increasingly being deployed in domains such as customer service and healthcare care, where they can automate repetitive tasks, reduce operational costs, and provide constant support around the clock. In this paper, we present a general methodology for cloning a conversational voice AI agent from a corpus of call recordings. Although the case study described in this paper uses telesales data to illustrate the approach, the underlying process generalizes to any domain where call transcripts are available. Our system listens to customers over the telephone, responds with a synthetic voice, and follows a structured playbook learned from top performing human agents. We describe the domain selection, knowledge extraction, and prompt engineering used to construct the agent, integrating automatic speech recognition, a large language model based dialogue manager, and text to speech synthesis into a streaming inference pipeline. The cloned agent is evaluated against human agents on a rubric of 22 criteria covering introduction, product communication, sales drive, objection handling, and closing. Blind tests show that the AI agent approaches human performance in routine aspects of the call while underperforming in persuasion and objection handling. We analyze these shortcomings and refine the prompt accordingly. The paper concludes with design lessons and avenues for future research, including large scale simulation and automated evaluation.

PaperID: 8812, https://arxiv.org/pdf/2509.04809.pdf

Authors: Haechang Kim, Hao Chen, Can Li, Jong Min Lee

Title: TalkToAgent: A Human-centric Explanation of Reinforcement Learning Agents with Large Language Models

Abstract:
Explainable Reinforcement Learning (XRL) has emerged as a promising approach in improving the transparency of Reinforcement Learning (RL) agents. However, there remains a gap between complex RL policies and domain experts, due to the limited comprehensibility of XRL results and isolated coverage of current XRL approaches that leave users uncertain about which tools to employ. To address these challenges, we introduce TalkToAgent, a multi‑agent Large Language Models (LLM) framework that delivers interactive, natural language explanations for RL policies. The architecture with five specialized LLM agents (Coordinator, Explainer, Coder, Evaluator, and Debugger) enables TalkToAgent to automatically map user queries to relevant XRL tools and clarify an agent's actions in terms of either key state variables, expected outcomes, or counterfactual explanations. Moreover, our approach extends previous counterfactual explanations by deriving alternative scenarios from qualitative behavioral descriptions, or even new rule‑based policies. We validated TalkToAgent on quadruple‑tank process control problem, a well‑known nonlinear control benchmark. Results demonstrated that TalkToAgent successfully mapped user queries into XRL tasks with high accuracy, and coder‑debugger interactions minimized failures in counterfactual generation. Furthermore, qualitative evaluation confirmed that TalkToAgent effectively interpreted agent's actions and contextualized their meaning within the problem domain.

PaperID: 8813, https://arxiv.org/pdf/2509.04752.pdf

Authors: Melik Ozolcer, Sang Won Bae

Title: SePA: A Search-enhanced Predictive Agent for Personalized Health Coaching

Abstract:
This paper introduces SePA (Search‑enhanced Predictive AI Agent), a novel LLM health coaching system that integrates personalized machine learning and retrieval‑augmented generation to deliver adaptive, evidence‑based guidance. SePA combines: (1) Individualized models predicting daily stress, soreness, and injury risk from wearable sensor data (28 users, 1260 data points); and (2) A retrieval module that grounds LLM‑generated feedback in expert‑vetted web content to ensure contextual relevance and reliability. Our predictive models, evaluated with rolling‑origin cross‑validation and group k‑fold cross‑validation show that personalized models outperform generalized baselines. In a pilot expert study (n=4), SePA's retrieval‑based advice was preferred over a non‑retrieval baseline, yielding meaningful practical effect (Cliff's δ=0.3, p=0.05). We also quantify latency performance trade‑offs between response quality and speed, offering a transparent blueprint for next‑generation, trustworthy personal health informatics systems.

PaperID: 8814, https://arxiv.org/pdf/2509.04737.pdf

Authors: Ryoga Oishi, Sho Sakaino, Toshiaki Tsuji

Title: Imitation Learning Based on Disentangled Representation Learning of Behavioral Characteristics

Abstract:
In the field of robot learning, coordinating robot actions through language instructions is becoming increasingly feasible. However, adapting actions to human instructions remains challenging, as such instructions are often qualitative and require exploring behaviors that satisfy varying conditions. This paper proposes a motion generation model that adapts robot actions in response to modifier directives human instructions imposing behavioral conditions during task execution. The proposed method learns a mapping from modifier directives to actions by segmenting demonstrations into short sequences, assigning weakly supervised labels corresponding to specific modifier types. We evaluated our method in wiping and pick and place tasks. Results show that it can adjust motions online in response to modifier directives, unlike conventional batch‑based methods that cannot adapt during execution.

PaperID: 8815, https://arxiv.org/pdf/2509.04731.pdf

Authors: Brennen Hill

Title: Generative World Models of Tasks: LLM-Driven Hierarchical Scaffolding for Embodied Agents

Abstract:
Recent advances in agent development have focused on scaling model size and raw interaction data, mirroring successes in large language models. However, for complex, long‑horizon multi‑agent tasks such as robotic soccer, this end‑to‑end approach often fails due to intractable exploration spaces and sparse rewards. We propose that an effective world model for decision‑making must model the world's physics and also its task semantics. A systematic review of 2024 research in low‑resource multi‑agent soccer reveals a clear trend towards integrating symbolic and hierarchical methods, such as Hierarchical Task Networks (HTNs) and Bayesian Strategy Networks (BSNs), with multi‑agent reinforcement learning (MARL). These methods decompose complex goals into manageable subgoals, creating an intrinsic curriculum that shapes agent learning. We formalize this trend into a framework for Hierarchical Task Environments (HTEs), which are essential for bridging the gap between simple, reactive behaviors and sophisticated, strategic team play. Our framework incorporates the use of Large Language Models (LLMs) as generative world models of tasks, capable of dynamically generating this scaffolding. We argue that HTEs provide a mechanism to guide exploration, generate meaningful learning signals, and train agents to internalize hierarchical structure, enabling the development of more capable and general‑purpose agents with greater sample efficiency than purely end‑to‑end approaches.

PaperID: 8816, https://arxiv.org/pdf/2509.04687.pdf

Authors: Vanshika Vats, Ashwani Rathee, James Davis

Title: Guideline-Consistent Segmentation via Multi-Agent Refinement

Abstract:
Semantic segmentation in real‑world applications often requires not only accurate masks but also strict adherence to textual labeling guidelines. These guidelines are typically complex and long, and both human and automated labeling often fail to follow them faithfully. Traditional approaches depend on expensive task‑specific retraining that must be repeated as the guidelines evolve. Although recent open‑vocabulary segmentation methods excel with simple prompts, they often fail when confronted with sets of paragraph‑length guidelines that specify intricate segmentation rules. To address this, we introduce a multi‑agent, training‑free framework that coordinates general‑purpose vision‑language models within an iterative Worker‑Supervisor refinement architecture. The Worker performs the segmentation, the Supervisor critiques it against the retrieved guidelines, and a lightweight reinforcement learning stop policy decides when to terminate the loop, ensuring guideline‑consistent masks while balancing resource use. Evaluated on the Waymo and ReasonSeg datasets, our method notably outperforms state‑of‑the‑art baselines, demonstrating strong generalization and instruction adherence.

PaperID: 8817, https://arxiv.org/pdf/2509.04645.pdf

Authors: Kallol Saha, Amber Li, Angela Rodriguez-Izquierdo, Lifan Yu, Ben Eisner, Maxim Likhachev, David Held

Title: Planning from Point Clouds over Continuous Actions for Multi-object Rearrangement

Abstract:
Long‑horizon planning for robot manipulation is a challenging problem that requires reasoning about the effects of a sequence of actions on a physical 3D scene. While traditional task planning methods are shown to be effective for long‑horizon manipulation, they require discretizing the continuous state and action space into symbolic descriptions of objects, object relationships, and actions. Instead, we propose a hybrid learning‑and‑planning approach that leverages learned models as domain‑specific priors to guide search in high‑dimensional continuous action spaces. We introduce SPOT: Search over Point cloud Object Transformations, which plans by searching for a sequence of transformations from an initial scene point cloud to a goal‑satisfying point cloud. SPOT samples candidate actions from learned suggesters that operate on partially observed point clouds, eliminating the need to discretize actions or object relationships. We evaluate SPOT on multi‑object rearrangement tasks, reporting task planning success and task execution success in both simulation and real‑world environments. Our experiments show that SPOT generates successful plans and outperforms a policy‑learning approach. We also perform ablations that highlight the importance of search‑based planning.

PaperID: 8818, https://arxiv.org/pdf/2509.04642.pdf

Authors: Wenxiao Wang, Priyatham Kattakinda, Soheil Feizi

Title: Maestro: Joint Graph & Config Optimization for Reliable AI Agents

Abstract:
Building reliable LLM agents requires decisions at two levels: the graph (which modules exist and how information flows) and the configuration of each node (models, prompts, tools, control knobs). Most existing optimizers tune configurations while holding the graph fixed, leaving structural failure modes unaddressed. We introduce Maestro, a framework‑agnostic holistic optimizer for LLM agents that jointly searches over graphs and configurations to maximize agent quality, subject to explicit rollout/token budgets. Beyond numeric metrics, Maestro leverages reflective textual feedback from traces to prioritize edits, improving sample efficiency and targeting specific failure modes. On the IFBench and HotpotQA benchmarks, Maestro consistently surpasses leading prompt optimizers‑‑MIPROv2, GEPA, and GEPA+Merge‑‑by an average of 12%, 4.9%, and 4.86%, respectively; even when restricted to prompt‑only optimization, it still leads by 9.65%, 2.37%, and 2.41%. Maestro achieves these results with far fewer rollouts than GEPA. We further show large gains on two applications (interviewer & RAG agents), highlighting that joint graph & configuration search addresses structural failure modes that prompt tuning alone cannot fix.

PaperID: 8819, https://arxiv.org/pdf/2509.04636.pdf

Authors: Swapnika Dulam, Christopher L Dancy

Title: Computational Cognitive Modeling to understand the effects of Racializing AI on Human-AI cooperation with PigChase Task

Abstract:
Despite the continued anthropomorphization of AI systems, the potential impact of racialization during human‑AI interaction is understudied. This study explores how human‑AI cooperation may be impacted by the belief that data used to train an AI system is racialized, that is, it was trained on data from a specific group of people. During this study, participants completed a human‑AI cooperation task using the Pig Chase game. Participants of different self‑identified demographics interacted with AI agents whose perceived racial identities were manipulated, allowing us to assess how sociocultural perspectives influence the decision‑making of participants in the game. After the game, participants completed a survey questionnaire to explain the strategies they used while playing the game and to understand the perceived intelligence of their AI teammates. Statistical analysis of task behavior data revealed a statistically significant effect of the participant's demographic, as well as the interaction between this self‑identified demographic and the treatment condition (i.e., the perceived demographic of the agent). The results indicated that Non‑White participants viewed AI agents racialized as White in a positive way compared to AI agents racialized as Black. Both Black and White participants viewed the AI agent in the control treatment in a negative way. A baseline cognitive model of the task using ACT‑R cognitive architecture was used to understand a cognitive‑level, process‑based explanation of the participants' perspectives based on results found from the study. This model helps us better understand the factors affecting the decision‑making strategies of the game participants. Results from analysis of these data, as well as cognitive modeling, indicate a need to expand understanding of the ways racialization (whether implicit or explicit) impacts interaction with AI systems.

PaperID: 8820, https://arxiv.org/pdf/2509.04633.pdf

Authors: Brennen Hill

Title: The Physical Basis of Prediction: World Model Formation in Neural Organoids via an LLM-Generated Curriculum

Abstract:
The capacity of an embodied agent to understand, predict, and interact with its environment is fundamentally contingent on an internal world model. This paper introduces a novel framework for investigating the formation and adaptation of such world models within a biological substrate: human neural organoids. We present a curriculum of three scalable, closed‑loop virtual environments designed to train these biological agents and probe the underlying synaptic mechanisms of learning, such as long‑term potentiation (LTP) and long‑term depression (LTD). We detail the design of three distinct task environments that demand progressively more sophisticated world models for successful decision‑making: (1) a conditional avoidance task for learning static state‑action contingencies, (2) a one‑dimensional predator‑prey scenario for goal‑directed interaction, and (3) a replication of the classic Pong game for modeling dynamic, continuous‑time systems. For each environment, we formalize the state and action spaces, the sensory encoding and motor decoding mechanisms, and the feedback protocols based on predictable (reward) and unpredictable (punishment) stimulation, which serve to drive model refinement. In a significant methodological advance, we propose a meta‑learning approach where a Large Language Model automates the generative design and optimization of experimental protocols, thereby scaling the process of environment and curriculum design. Finally, we outline a multi‑modal evaluation strategy that moves beyond task performance to directly measure the physical correlates of the learned world model by quantifying synaptic plasticity at electrophysiological, cellular, and molecular levels. This work bridges the gap between model‑based reinforcement learning and computational neuroscience, offering a unique platform for studying embodiment, decision‑making, and the physical basis of intelligence.

PaperID: 8821, https://arxiv.org/pdf/2509.04537.pdf

Authors: Ryosuke Takata, Atsushi Masumori, Takashi Ikegami

Title: Emergent Social Dynamics of LLM Agents in the El Farol Bar Problem

Abstract:
We investigate the emergent social dynamics of Large Language Model (LLM) agents in a spatially extended El Farol Bar problem, observing how they autonomously navigate this classic social dilemma. As a result, the LLM agents generated a spontaneous motivation to go to the bar and changed their decision making by becoming a collective. We also observed that the LLM agents did not solve the problem completely, but rather behaved more like humans. These findings reveal a complex interplay between external incentives (prompt‑specified constraints such as the 60% threshold) and internal incentives (culturally‑encoded social preferences derived from pre‑training), demonstrating that LLM agents naturally balance formal game‑theoretic rationality with social motivations that characterize human behavior. These findings suggest that a new model of group decision making, which could not be handled in the previous game‑theoretic problem setting, can be realized by LLM agents.

PaperID: 8822, https://arxiv.org/pdf/2509.04508.pdf

Authors: Biddut Sarker Bijoy, Mohammad Saqib Hasan, Pegah Alipoormolabashi, Avirup Sil, Aruna Balasubramanian, Niranjan Balasubramanian

Title: ProST: Progressive Sub-task Training for Pareto-Optimal Multi-agent Systems Using Small Language Models

Abstract:
Multi‑agent systems with smaller language models (SLMs) present a viable alternative to single agent systems powered by large language models (LLMs) for addressing complex problems. In this work, we study how these alternatives compare in terms of both effectiveness and efficiency. To study this trade‑off, we instantiate single and multi‑agent systems for the complex problems in the AppWorld environment using different sized language models. We find that difficulties with long‑trajectory learning in smaller language models (SLMs) limit their performance. Even when trained for specialized roles, SLMs fail to learn all subtasks effectively. To address this issue, we introduce a simple progressive sub‑task training strategy, which introduces new sub‑tasks progressively in each training epoch. We find that this novel strategy, analogous to instance level curriculum learning, consistently improves the effectiveness of multi‑agents at all configurations. Our Pareto analysis shows that fine‑tuned multi‑agent systems yield better effectiveness‑efficiency trade‑offs. Additional ablations and analyses shows the importance of our progressive training strategy and its ability to reduce subtask error rates.

PaperID: 8823, https://arxiv.org/pdf/2509.04499.pdf

Authors: Pranav Narayanan Venkit, Philippe Laban, Yilun Zhou, Kung-Hsiang Huang, Yixin Mao, Chien-Sheng Wu

Title: DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence

Abstract:
Generative search engines and deep research LLM agents promise trustworthy, source‑grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices. We introduce DeepTRACE, a novel sociotechnically grounded audit framework that turns prior community‑identified failure cases into eight measurable dimensions spanning answer text, sources, and citations. DeepTRACE uses statement‑level analysis (decomposition, confidence scoring) and builds citation and factual‑support matrices to audit how systems reason with and attribute evidence end‑to‑end. Using automated extraction pipelines for popular public models (e.g., GPT‑4.5/5, You.com, Perplexity, Copilot/Bing, Gemini) and an LLM‑judge with validated agreement to human raters, we evaluate both web‑search engines and deep‑research configurations. Our findings show that generative search engines and deep research agents frequently produce one‑sided, highly confident responses on debate queries and include large fractions of statements unsupported by their own listed sources. Deep‑research configurations reduce overconfidence and can attain high citation thoroughness, but they remain highly one‑sided on debate queries and still exhibit large fractions of unsupported statements, with citation accuracy ranging from 40‑‑80% across systems.

PaperID: 8824, https://arxiv.org/pdf/2509.04481.pdf

Authors: Yi-Chun Chen, Arnav Jhala

Title: Narrative-to-Scene Generation: An LLM-Driven Pipeline for 2D Game Environments

Abstract:
Recent advances in large language models (LLMs) enable compelling story generation, but connecting narrative text to playable visual environments remains an open challenge in procedural content generation (PCG). We present a lightweight pipeline that transforms short narrative prompts into a sequence of 2D tile‑based game scenes, reflecting the temporal structure of stories. Given an LLM‑generated narrative, our system identifies three key time frames, extracts spatial predicates in the form of "Object‑Relation‑Object" triples, and retrieves visual assets using affordance‑aware semantic embeddings from the GameTileNet dataset. A layered terrain is generated using Cellular Automata, and objects are placed using spatial rules grounded in the predicate structure. We evaluated our system in ten diverse stories, analyzing tile‑object matching, affordance‑layer alignment, and spatial constraint satisfaction across frames. This prototype offers a scalable approach to narrative‑driven scene generation and lays the foundation for future work on multi‑frame continuity, symbolic tracking, and multi‑agent coordination in story‑centered PCG.

PaperID: 8825, https://arxiv.org/pdf/2509.04413.pdf

Authors: Babak Esmaeili, Hamidreza Modares

Title: SAFE--MA--RRT: Multi-Agent Motion Planning with Data-Driven Safety Certificates

Abstract:
This paper proposes a fully data‑driven motion‑planning framework for homogeneous linear multi‑agent systems that operate in shared, obstacle‑filled workspaces without access to explicit system models. Each agent independently learns its closed‑loop behavior from experimental data by solving convex semidefinite programs that generate locally invariant ellipsoids and corresponding state‑feedback gains. These ellipsoids, centered along grid‑based waypoints, certify the dynamic feasibility of short‑range transitions and define safe regions of operation. A sampling‑based planner constructs a tree of such waypoints, where transitions are allowed only when adjacent ellipsoids overlap, ensuring invariant‑to‑invariant transitions and continuous safety. All agents expand their trees simultaneously and are coordinated through a space‑time reservation table that guarantees inter‑agent safety by preventing simultaneous occupancy and head‑on collisions. Each successful edge in the tree is equipped with its own local controller, enabling execution without re‑solving optimization problems at runtime. The resulting trajectories are not only dynamically feasible but also provably safe with respect to both environmental constraints and inter‑agent collisions. Simulation results demonstrate the effectiveness of the approach in synthesizing synchronized, safe trajectories for multiple agents under shared dynamics and constraints, using only data and convex optimization tools.

PaperID: 8826, https://arxiv.org/pdf/2509.04343.pdf

Authors: Maciej Besta, Shriram Chandran, Robert Gerstenberger, Mathis Lindner, Marcin Chrapek, Sebastian Hermann Martschat, Taraneh Ghandi, Patrick Iff, Hubert Niewiadomski, Piotr Nyczyk, Jürgen Müller, Torsten Hoefler

Title: Psychologically Enhanced AI Agents

Abstract:
We introduce MBTI‑in‑Thoughts, a framework for enhancing the effectiveness of Large Language Model (LLM) agents through psychologically grounded personality conditioning. Drawing on the Myers‑Briggs Type Indicator (MBTI), our method primes agents with distinct personality archetypes via prompt engineering, enabling control over behavior along two foundational axes of human psychology, cognition and affect. We show that such personality priming yields consistent, interpretable behavioral biases across diverse tasks: emotionally expressive agents excel in narrative generation, while analytically primed agents adopt more stable strategies in game‑theoretic settings. Our framework supports experimenting with structured multi‑agent communication protocols and reveals that self‑reflection prior to interaction improves cooperation and reasoning quality. To ensure trait persistence, we integrate the official 16Personalities test for automated verification. While our focus is on MBTI, we show that our approach generalizes seamlessly to other psychological frameworks such as Big Five, HEXACO, or Enneagram. By bridging psychological theory and LLM behavior design, we establish a foundation for psychologically enhanced AI agents without any fine‑tuning.

PaperID: 8827, https://arxiv.org/pdf/2509.04198.pdf

Authors: Petr Průcha, Michaela Matoušková, Jan Strnad

Title: Are LLM Agents the New RPA? A Comparative Study with RPA Across Enterprise Workflows

Abstract:
The emergence of large language models (LLMs) has introduced a new paradigm in automation: LLM agents or Agentic Automation with Computer Use (AACU). Unlike traditional Robotic Process Automation (RPA), which relies on rule‑based workflows and scripting, AACU enables intelligent agents to perform tasks through natural language instructions and autonomous interaction with user interfaces. This study investigates whether AACU can serve as a viable alternative to RPA in enterprise workflow automation. We conducted controlled experiments across three standard RPA challenges data entry, monitoring, and document extraction comparing RPA (via UiPath) and AACU (via Anthropic's Computer Use Agent) in terms of speed, reliability, and development effort. Results indicate that RPA outperforms AACU in execution speed and reliability, particularly in repetitive, stable environments. However, AACU significantly reduces development time and adapts more flexibly to dynamic interfaces. While current AACU implementations are not yet production‑ready, their promise in rapid prototyping and lightweight automation is evident. Future research should explore multi‑agent orchestration, hybrid RPA‑AACU architectures, and more robust evaluation across industries and platforms.

PaperID: 8828, https://arxiv.org/pdf/2509.04183.pdf

Authors: Aishik Mandal, Tanmoy Chakraborty, Iryna Gurevych

Title: MAGneT: Coordinated Multi-Agent Generation of Synthetic Multi-Turn Mental Health Counseling Sessions

Abstract:
The growing demand for scalable psychological counseling highlights the need for high‑quality, privacy‑compliant data, yet such data remains scarce. Here we introduce MAGneT, a novel multi‑agent framework for synthetic psychological counseling session generation that decomposes counselor response generation into coordinated sub‑tasks handled by specialized LLM agents, each modeling a key psychological technique. Unlike prior single‑agent approaches, MAGneT better captures the structure and nuance of real counseling. We further propose a unified evaluation framework that consolidates diverse automatic metrics and expands expert assessment from four to nine counseling dimensions, thus addressing inconsistencies in prior evaluation protocols. Empirically, MAGneT substantially outperforms existing methods: experts prefer MAGneT‑generated sessions in 77.2% of cases, and sessions generated by MAGneT yield 3.2% higher general counseling skills and 4.3% higher CBT‑specific skills on cognitive therapy rating scale (CTRS). A open source Llama3‑8B‑Instruct model fine‑tuned on MAGneT‑generated data also outperforms models fine‑tuned using baseline synthetic datasets by 6.9% on average on CTRS.We also make our code and data public.

PaperID: 8829, https://arxiv.org/pdf/2509.03990.pdf

Authors: Chunlong Wu, Ye Luo, Zhibo Qu, Min Wang

Title: Meta-Policy Reflexion: Reusable Reflective Memory and Rule Admissibility for Resource-Efficient LLM Agent

Abstract:
Large language model (LLM) agents achieve impressive single‑task performance but commonly exhibit repeated failures, inefficient exploration, and limited cross‑task adaptability. Existing reflective strategies (e.g., Reflexion, ReAct) improve per‑episode behavior but typically produce ephemeral, task‑specific traces that are not reused across tasks. Reinforcement‑learning based alternatives can produce transferable policies but require substantial parameter updates and compute. In this work we introduce Meta‑Policy Reflexion (MPR): a hybrid framework that consolidates LLM‑generated reflections into a structured, predicate‑like Meta‑Policy Memory (MPM) and applies that memory at inference time through two complementary mechanisms soft memory‑guided decoding and hard rule admissibility checks(HAC). MPR (i) externalizes reusable corrective knowledge without model weight updates, (ii) enforces domain constraints to reduce unsafe or invalid actions, and (iii) retains the adaptability of language‑based reflection. We formalize the MPM representation, present algorithms for update and decoding, and validate the approach in a text‑based agent environment following the experimental protocol described in the provided implementation (AlfWorld‑based). Empirical results reported in the supplied material indicate consistent gains in execution accuracy and robustness when compared to Reflexion baselines; rule admissibility further improves stability. We analyze mechanisms that explain these gains, discuss scalability and failure modes, and outline future directions for multimodal and multi‑agent extensions.

PaperID: 8830, https://arxiv.org/pdf/2509.03974.pdf

Authors: Manuel Guatto, Francesco Preti, Michael Schilling, Tommaso Calarco, Francisco Andrés Cárdenas-López, Felix Motzoi

Title: Real-time adaptive quantum error correction by model-free multi-agent learning

Abstract:
Can we build efficient Quantum Error Correction (QEC) that adapts on the fly to time‑varying noise? In this work we say yes, and show how. We present a two level framework based on Reinforcement Learning (RL) that learns to correct even non‑stationary errors from scratch. At the first level we take advantage of model‑free Multi‑Agent RL (MARL) to automatically discover full QEC cycle ‑‑ logical state encoding, stabilizer measurements, and recovery ‑‑ without any prior system knowledge, relying only on orthogonality conditions. Leveraging the stabilizer formalism, we demonstrate that our MARL framework can discover novel QEC codes tailored for multi‑level quantum architectures. At the second level we introduce BRAVE (Bandit Retraining for Adaptive Variational Error correction), an efficient algorithm that tunes the variational layer on the fly to change the physical basis of the errors, adapting the QEC code to time‑varying noise while minimizing computational overhead and reducing the number of retraining steps. By combining our MARL and BRAVE approaches and testing them on multi‑level systems subjected to competing bit‑ and phase‑flip errors over time across diverse scenarios, we observed an improvement in logical fidelity by more than an order of magnitude ‑‑ under time‑dependent noise channels ‑‑ compared to conventional QEC schemes.

PaperID: 8831, https://arxiv.org/pdf/2509.03956.pdf

Authors: Minjong Yoo, Jinwoo Jang, Sihyung Yoon, Honguk Woo

Title: World Model Implanting for Test-time Adaptation of Embodied Agents

Abstract:
In embodied AI, a persistent challenge is enabling agents to robustly adapt to novel domains without requiring extensive data collection or retraining. To address this, we present a world model implanting framework (WorMI) that combines the reasoning capabilities of large language models (LLMs) with independently learned, domain‑specific world models through test‑time composition. By allowing seamless implantation and removal of the world models, the embodied agent's policy achieves and maintains cross‑domain adaptability. In the WorMI framework, we employ a prototype‑based world model retrieval approach, utilizing efficient trajectory‑based abstract representation matching, to incorporate relevant models into test‑time composition. We also develop a world‑wise compound attention method that not only integrates the knowledge from the retrieved world models but also aligns their intermediate representations with the reasoning model's representation within the agent's policy. This framework design effectively fuses domain‑specific knowledge from multiple world models, ensuring robust adaptation to unseen domains. We evaluate our WorMI on the VirtualHome and ALFWorld benchmarks, demonstrating superior zero‑shot and few‑shot performance compared to several LLM‑based approaches across a range of unseen domains. These results highlight the frameworks potential for scalable, real‑world deployment in embodied agent scenarios where adaptability and data efficiency are essential.

PaperID: 8832, https://arxiv.org/pdf/2509.03935.pdf

Authors: Sungho Cho, Sung Il Choi, Seung Hyun Oh, Ian P. Roberts, Sang Hyun Lee

Title: Autonomous Task Offloading of Vehicular Edge Computing with Parallel Computation Queues

Abstract:
This work considers a parallel task execution strategy in vehicular edge computing (VEC) networks, where edge servers are deployed along the roadside to process offloaded computational tasks of vehicular users. To minimize the overall waiting delay among vehicular users, a novel task offloading solution is implemented based on the network cooperation balancing resource under‑utilization and load congestion. Dual evaluation through theoretical and numerical ways shows that the developed solution achieves a globally optimal delay reduction performance compared to existing methods, which is also validated by the feasibility test over a real‑map virtual environment. The in‑depth analysis reveals that predicting the instantaneous processing power of edge servers facilitates the identification of overloaded servers, which is critical for determining network delay. By considering discrete variables of the queue, the proposed technique's precise estimation can effectively address these combinatorial challenges to achieve optimal performance.

PaperID: 8833, https://arxiv.org/pdf/2509.03890.pdf

Authors: Yineng Yan, Xidong Wang, Jin Seng Cheng, Ran Hu, Wentao Guan, Nahid Farahmand, Hengte Lin, Yue Li

Title: FaMA: LLM-Empowered Agentic Assistant for Consumer-to-Consumer Marketplace

Abstract:
The emergence of agentic AI, powered by Large Language Models (LLMs), marks a paradigm shift from reactive generative systems to proactive, goal‑oriented autonomous agents capable of sophisticated planning, memory, and tool use. This evolution presents a novel opportunity to address long‑standing challenges in complex digital environments. Core tasks on Consumer‑to‑Consumer (C2C) e‑commerce platforms often require users to navigate complex Graphical User Interfaces (GUIs), making the experience time‑consuming for both buyers and sellers. This paper introduces a novel approach to simplify these interactions through an LLM‑powered agentic assistant. This agent functions as a new, conversational entry point to the marketplace, shifting the primary interaction model from a complex GUI to an intuitive AI agent. By interpreting natural language commands, the agent automates key high‑friction workflows. For sellers, this includes simplified updating and renewal of listings, and the ability to send bulk messages. For buyers, the agent facilitates a more efficient product discovery process through conversational search. We present the architecture for Facebook Marketplace Assistant (FaMA), arguing that this agentic, conversational paradigm provides a lightweight and more accessible alternative to traditional app interfaces, allowing users to manage their marketplace activities with greater efficiency. Experiments show FaMA achieves a 98% task success rate on solving complex tasks on the marketplace and enables up to a 2x speedup on interaction time.

PaperID: 8834, https://arxiv.org/pdf/2509.03817.pdf

Authors: Wei Yang, Jesse Thomason

Title: Learning to Deliberate: Meta-policy Collaboration for Agentic LLMs with Multi-agent Reinforcement Learning

Abstract:
Multi‑agent systems of large language models (LLMs) show promise for complex reasoning, but their effectiveness is often limited by fixed collaboration protocols. These frameworks typically focus on macro‑level orchestration while overlooking agents' internal deliberative capabilities. This critical meta‑cognitive blindspot treats agents as passive executors unable to adapt their strategy based on internal cognitive states like uncertainty or confidence. We introduce the Meta‑Policy Deliberation Framework (MPDF), where agents learn a decentralized policy over a set of high‑level meta‑cognitive actions: Persist, Refine, and Concede. To overcome the instability of traditional policy gradients in this setting, we develop SoftRankPO, a novel reinforcement learning algorithm. SoftRankPO stabilizes training by shaping advantages based on the rank of rewards mapped through smooth normal quantiles, making the learning process robust to reward variance. Experiments show that MPDF with SoftRankPO achieves a a 4‑5% absolute gain in average accuracy across five mathematical and general reasoning benchmarks compared to six state‑of‑the‑art heuristic and learning‑based multi‑agent reasoning algorithms. Our work presents a paradigm for learning adaptive, meta‑cognitive policies for multi‑agent LLM systems, shifting the focus from designing fixed protocols to learning dynamic, deliberative strategies.

PaperID: 8835, https://arxiv.org/pdf/2509.03793.pdf

Authors: Prathamesh Devadiga, Omkaar Jayadev Shetty, Pooja Agarwal

Title: SAMVAD: A Multi-Agent System for Simulating Judicial Deliberation Dynamics in India

Abstract:
Understanding the complexities of judicial deliberation is crucial for assessing the efficacy and fairness of a justice system. However, empirical studies of judicial panels are constrained by significant ethical and practical barriers. This paper introduces SAMVAD, an innovative Multi‑Agent System (MAS) designed to simulate the deliberation process within the framework of the Indian justice system. Our system comprises agents representing key judicial roles: a Judge, a Prosecution Counsel, a Defense Counsel, and multiple Adjudicators (simulating a judicial bench), all powered by large language models (LLMs). A primary contribution of this work is the integration of Retrieval‑Augmented Generation (RAG), grounded in a domain‑specific knowledge base of landmark Indian legal documents, including the Indian Penal Code and the Constitution of India. This RAG functionality enables the Judge and Counsel agents to generate legally sound instructions and arguments, complete with source citations, thereby enhancing both the fidelity and transparency of the simulation. The Adjudicator agents engage in iterative deliberation rounds, processing case facts, legal instructions, and arguments to reach a consensus‑based verdict. We detail the system architecture, agent communication protocols, the RAG pipeline, the simulation workflow, and a comprehensive evaluation plan designed to assess performance, deliberation quality, and outcome consistency. This work provides a configurable and explainable MAS platform for exploring legal reasoning and group decision‑making dynamics in judicial simulations, specifically tailored to the Indian legal context and augmented with verifiable legal grounding via RAG.

PaperID: 8836, https://arxiv.org/pdf/2509.03771.pdf

Authors: Brennen Hill

Title: Co-Evolving Complexity: An Adversarial Framework for Automatic MARL Curricula

Abstract:
The advancement of general‑purpose intelligent agents is intrinsically linked to the environments in which they are trained. While scaling models and datasets has yielded remarkable capabilities, scaling the complexity, diversity, and interactivity of environments remains a crucial bottleneck. Hand‑crafted environments are finite and often contain implicit biases, limiting the potential for agents to develop truly generalizable and robust skills. In this work, we propose a paradigm for generating a boundless and adaptive curriculum of challenges by framing the environment generation process as an adversarial game. We introduce a system where a team of cooperative multi‑agent defenders learns to survive against a procedurally generative attacker. The attacker agent learns to produce increasingly challenging configurations of enemy units, dynamically creating novel worlds tailored to exploit the defenders' current weaknesses. Concurrently, the defender team learns cooperative strategies to overcome these generated threats. This co‑evolutionary dynamic creates a self‑scaling environment where complexity arises organically from the adversarial interaction, providing an effectively infinite stream of novel and relevant training data. We demonstrate that with minimal training, this approach leads to the emergence of complex, intelligent behaviors, such as flanking and shielding by the attacker, and focus‑fire and spreading by the defenders. Our findings suggest that adversarial co‑evolution is a powerful mechanism for automatically scaling environmental complexity, driving agents towards greater robustness and strategic depth.

PaperID: 8837, https://arxiv.org/pdf/2509.03736.pdf

Authors: James Mooney, Josef Woldense, Zheng Robert Jia, Shirley Anugrah Hayati, My Ha Nguyen, Vipul Raheja, Dongyeop Kang

Title: Are LLM Agents Behaviorally Coherent? Latent Profiles for Social Simulation

Abstract:
The impressive capabilities of Large Language Models (LLMs) raise the possibility that synthetic agents can serve as substitutes for real participants in human‑subject research. To evaluate this claim, prior research has largely focused on whether LLM‑generated survey responses align with those produced by human respondents whom the LLMs are prompted to represent. In contrast, we address a more fundamental question: Do agents maintain empirical consistency; aligning to human behavioral models when examined under different experimental settings? To this end, we develop a study designed to (a) ask a set of questions which reveals an agent's latent profile and (b) examine agent behavioral consistency in a conversational setting with other agents. This design enables us to explore a set of behavioral hypotheses to assess whether an agent's conversational behavior is consistent with what we would expect from its revealed state. Our findings show significant inconsistencies in LLMs across model families and at differing model sizes. Most importantly, we find that, although agents may generate responses matching those of their human counterparts, they fail to be empirically consistent, representing a critical gap in their capabilities to accurately substitute for real participants in human‑subject research.

PaperID: 8838, https://arxiv.org/pdf/2509.03682.pdf

Authors: Zhengyang Li, Qijin Ji, Xinghong Ling, Quan Liu

Title: A Comprehensive Review of Multi-Agent Reinforcement Learning in Video Games

Abstract:
Recent advancements in multi‑agent reinforcement learning (MARL) have demonstrated its application potential in modern games. Beginning with foundational work and progressing to landmark achievements such as AlphaStar in StarCraft II and OpenAI Five in Dota 2, MARL has proven capable of achieving superhuman performance across diverse game environments through techniques like self‑play, supervised learning, and deep reinforcement learning. With its growing impact, a comprehensive review has become increasingly important in this field. This paper aims to provide a thorough examination of MARL's application from turn‑based two‑agent games to real‑time multi‑agent video games including popular genres such as Sports games, First‑Person Shooter (FPS) games, Real‑Time Strategy (RTS) games and Multiplayer Online Battle Arena (MOBA) games. We further analyze critical challenges posed by MARL in video games, including nonstationary, partial observability, sparse rewards, team coordination, and scalability, and highlight successful implementations in games like Rocket League, Minecraft, Quake III Arena, StarCraft II, Dota 2, Honor of Kings, etc. This paper offers insights into MARL in video game AI systems, proposes a novel method to estimate game complexity, and suggests future research directions to advance MARL and its applications in game development, inspiring further innovation in this rapidly evolving field.

PaperID: 8839, https://arxiv.org/pdf/2509.03581.pdf

Authors: Davide Paglieri, Bartłomiej Cupiał, Jonathan Cook, Ulyana Piterbarg, Jens Tuyls, Edward Grefenstette, Jakob Nicolaus Foerster, Jack Parker-Holder, Tim Rocktäschel

Title: Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents

Abstract:
Training large language models (LLMs) to reason via reinforcement learning (RL) significantly improves their problem‑solving capabilities. In agentic settings, existing methods like ReAct prompt LLMs to explicitly plan before every action; however, we demonstrate that always planning is computationally expensive and degrades performance on long‑horizon tasks, while never planning further limits performance. To address this, we introduce a conceptual framework formalizing dynamic planning for LLM agents, enabling them to flexibly decide when to allocate test‑time compute for planning. We propose a simple two‑stage training pipeline: (1) supervised fine‑tuning on diverse synthetic data to prime models for dynamic planning, and (2) RL to refine this capability in long‑horizon environments. Experiments on the Crafter environment show that dynamic planning agents trained with this approach are more sample‑efficient and consistently achieve more complex objectives. Additionally, we demonstrate that these agents can be effectively steered by human‑written plans, surpassing their independent capabilities and highlighting the potential for safer and more collaborative agentic systems.

PaperID: 8840, https://arxiv.org/pdf/2509.03536.pdf

Authors: Weizhi Chen, Ziwei Wang, Leyang Yang, Sheng Zhou, Xiaoxuan Tang, Jiajun Bu, Yong Li, Wei Jiang

Title: PG-Agent: An Agent Powered by Page Graph

Abstract:
Graphical User Interface (GUI) agents possess significant commercial and social value, and GUI agents powered by advanced multimodal large language models (MLLMs) have demonstrated remarkable potential. Currently, existing GUI agents usually utilize sequential episodes of multi‑step operations across pages as the prior GUI knowledge, which fails to capture the complex transition relationship between pages, making it challenging for the agents to deeply perceive the GUI environment and generalize to new scenarios. Therefore, we design an automated pipeline to transform the sequential episodes into page graphs, which explicitly model the graph structure of the pages that are naturally connected by actions. To fully utilize the page graphs, we further introduce Retrieval‑Augmented Generation (RAG) technology to effectively retrieve reliable perception guidelines of GUI from them, and a tailored multi‑agent framework PG‑Agent with task decomposition strategy is proposed to be injected with the guidelines so that it can generalize to unseen scenarios. Extensive experiments on various benchmarks demonstrate the effectiveness of PG‑Agent, even with limited episodes for page graph construction.

PaperID: 8841, https://arxiv.org/pdf/2509.03443.pdf

Authors: Tarek Bazizi, Mohamed Maghenem, Paolo Frasca, Antonio Lorìa, Elena Panteley

Title: On the Perturbed Projection-Based Distributed Gradient-Descent Algorithm: A Fully-Distributed Adaptive Redesign

Abstract:
In this work, we revisit a classical distributed gradient‑descent algorithm, introducing an interesting class of perturbed multi‑agent systems. The state of each subsystem represents a local estimate of a solution to the global optimization problem. Thereby, the network is required to minimize local cost functions, while gathering the local estimates around a common value. Such a complex task suggests the interplay of consensus‑based dynamics with gradient‑descent dynamics. The latter descent dynamics involves the projection operator, which is assumed to provide corrupted projections of a specific form, reminiscent of existing (fast) projection algorithms. Hence, for the resulting class of perturbed networks, we are able to adaptively tune some gains in a fully distributed fashion, to approach the optimal consensus set up to arbitrary‑desired precision.

PaperID: 8842, https://arxiv.org/pdf/2509.03353.pdf

Authors: Oguzhan Baser, Kaan Kale, Po-han Li, Sandeep Chinchali

Title: Fair Resource Allocation for Fleet Intelligence

Abstract:
Resource allocation is crucial for the performance optimization of cloud‑assisted multi‑agent intelligence. Traditional methods often overlook agents' diverse computational capabilities and complex operating environments, leading to inefficient and unfair resource distribution. To address this, we open‑sourced Fair‑Synergy, an algorithmic framework that utilizes the concave relationship between the agents' accuracy and the system resources to ensure fair resource allocation across fleet intelligence. We extend traditional allocation approaches to encompass a multidimensional machine learning utility landscape defined by model parameters, training data volume, and task complexity. We evaluate Fair‑Synergy with advanced vision and language models such as BERT, VGG16, MobileNet, and ResNets on datasets including MNIST, CIFAR‑10, CIFAR‑100, BDD, and GLUE. We demonstrate that Fair‑Synergy outperforms standard benchmarks by up to 25% in multi‑agent inference and 11% in multi‑agent learning settings. Also, we explore how the level of fairness affects the least advantaged, most advantaged, and average agents, providing insights for equitable fleet intelligence.

PaperID: 8843, https://arxiv.org/pdf/2509.03345.pdf

Authors: Yunxin Sun, Abulhair Saparov

Title: Do Language Models Follow Occam's Razor? An Evaluation of Parsimony in Inductive and Abductive Reasoning

Abstract:
Non‑deductive reasoning, encompassing inductive and abductive reasoning, is essential in addressing complex real‑world questions. One key feature of inductive and abductive reasoning is that there are many valid hypotheses; the simplest ones (those that adhere to Occam's Razor) are often most useful. However, this aspect is ignored in recent work that evaluates the non‑deductive reasoning capabilities of large language models (LLMs). This work fills this gap, focusing on understanding whether the inductive and abductive reasoning capabilities of LLMs adhere to Occam's Razor, while also examining the correctness of their reasoning. To accomplish this goal, we introduce a framework to synthetically generate reasoning questions that (a) require inductive reasoning and abductive reasoning simultaneously; (b) is readily extended to produce any abductive/inductive reasoning question expressible in first‑order logic. The task for the intelligent agent is to produce hypotheses to explain observations under a given world model. We also propose a new automated metric to assess whether hypotheses quantitatively adhere to Occam's Razor; those hypotheses that are correct and simplest are considered high‑quality. Our findings on state‑of‑the‑art LLMs suggest that LLMs can perform inductive and abductive reasoning in simple scenarios, but struggle with complex world models and with producing high‑quality hypotheses, even with popular reasoning‑enhancing techniques such as in‑context learning and RLVR.

PaperID: 8844, https://arxiv.org/pdf/2509.03331.pdf

Authors: Weizhe Wang, Wei Ma, Qiang Hu, Yao Zhang, Jianfei Sun, Bin Wu, Yang Liu, Guangquan Xu, Lingxiao Jiang

Title: VulnRepairEval: An Exploit-Based Evaluation Framework for Assessing Large Language Model Vulnerability Repair Capabilities

Abstract:
The adoption of Large Language Models (LLMs) for automated software vulnerability patching has shown promising outcomes on carefully curated evaluation sets. Nevertheless, existing datasets predominantly rely on superficial validation methods rather than exploit‑based verification, leading to overestimated performance in security‑sensitive applications. This paper introduces VulnRepairEval, an evaluation framework anchored in functional Proof‑of‑Concept (PoC) exploits. Our framework delivers a comprehensive, containerized evaluation pipeline that enables reproducible differential assessment, where repair success requires the original exploit to fail execution against the modified code. The benchmark construction involved extensive data curation: we processed over 400 CVEs and approximately 2,500 potential sources to extract a collection of authentic vulnerability instances (23 Python CVEs) amenable to automated testing with working PoCs. Through VulnRepairEval, we conduct a comprehensive evaluation of 12 popular LLMs and observe a significant performance deficit: even the top‑performing model successfully addresses merely 5/23 instances (about 21.7%), exposing critical weaknesses in security‑focused applications. Our failure analysis reveals that most unsuccessful attempts stem from imprecise vulnerability identification and patches containing syntactic or semantic errors. Enhanced prompting strategies and multi‑agent approaches yield minimal improvements, with overall effectiveness remaining largely unaffected. This work contributes a stringent, practical evaluation framework for LLM‑driven vulnerability remediation and underscores the necessity for assessment protocols that authentically reflect real‑world exploitation scenarios.

PaperID: 8845, https://arxiv.org/pdf/2509.03312.pdf

Authors: Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, Shuicheng Yan

Title: AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?

Abstract:
Large Language Model (LLM)‑based agentic systems, often comprising multiple models, complex tool invocations, and orchestration protocols, substantially outperform monolithic agents. Yet this very sophistication amplifies their fragility, making them more prone to system failure. Pinpointing the specific agent or step responsible for an error within long execution traces defines the task of agentic system failure attribution. Current state‑of‑the‑art reasoning LLMs, however, remain strikingly inadequate for this challenge, with accuracy generally below 10%. To address this gap, we propose AgenTracer, the first automated framework for annotating failed multi‑agent trajectories via counterfactual replay and programmed fault injection, producing the curated dataset TracerTraj. Leveraging this resource, we develop AgenTracer‑8B, a lightweight failure tracer trained with multi‑granular reinforcement learning, capable of efficiently diagnosing errors in verbose multi‑agent interactions. On the Who&When benchmark, AgenTracer‑8B outperforms giant proprietary LLMs like Gemini‑2.5‑Pro and Claude‑4‑Sonnet by up to 18.18%, setting a new standard in LLM agentic failure attribution. More importantly, AgenTracer‑8B delivers actionable feedback to off‑the‑shelf multi‑agent systems like MetaGPT and MaAS with 4.8‑14.2% performance gains, empowering self‑correcting and self‑evolving agentic AI.

PaperID: 8846, https://arxiv.org/pdf/2509.03168.pdf

Authors: Boyin Zheng, Yahui Hao, Lu Liu

Title: Target Enclosing Control for Nonholonomic Multi-Agent Systems with Connectivity Maintenance and Collision Avoidance

Abstract:
This article addresses the moving target enclosing control problem for nonholonomic multi‑agent systems with guaranteed network connectivity and collision avoidance. We propose a novel control scheme to handle distance constraints imposed by the agents' limited interaction ranges and collision‑free thresholds. By leveraging a Henneberg construction method, we innovatively formulate the target enclosing requirements within an isostatic distance‑based formation framework, facilitating the integration of distance constraints. Compared with existing results, our approach ensures the positive definiteness of the underlying rigidity matrix and does not require controlling the target's motion. To eliminate the occurrences of control singularities caused by nonholonomic constraints, we propose a fixed‑time angular control law using barrier Lyapunov functions. Additionally, we develop a linear velocity control law using the prescribed performance control approach and transformed error constraints. We rigorously prove that our control laws enable the multi‑agent system to asymptotically achieve the desired angular formation pattern around a moving target while satisfying the established distance constraints. Finally, a simulation example is provided to validate the effectiveness of the proposed method.

PaperID: 8847, https://arxiv.org/pdf/2509.03030.pdf

Authors: Zida Wu, Mathieu Lauriere, Matthieu Geist, Olivier Pietquin, Ankur Mehta

Title: Population-aware Online Mirror Descent for Mean-Field Games with Common Noise by Deep Reinforcement Learning

Abstract:
Mean Field Games (MFGs) offer a powerful framework for studying large‑scale multi‑agent systems. Yet, learning Nash equilibria in MFGs remains a challenging problem, particularly when the initial distribution is unknown or when the population is subject to common noise. In this paper, we introduce an efficient deep reinforcement learning (DRL) algorithm designed to achieve population‑dependent Nash equilibria without relying on averaging or historical sampling, inspired by Munchausen RL and Online Mirror Descent. The resulting policy is adaptable to various initial distributions and sources of common noise. Through numerical experiments on seven canonical examples, we demonstrate that our algorithm exhibits superior convergence properties compared to state‑of‑the‑art algorithms, particularly a DRL version of Fictitious Play for population‑dependent policies. The performance in the presence of common noise underscores the robustness and adaptability of our approach.

PaperID: 8848, https://arxiv.org/pdf/2509.02958.pdf

Authors: Kaustuv Mukherji, Jaikrishna Manojkumar Patil, Dyuman Aditya, Paulo Shakarian, Devendra Parkar, Lahari Pokala, Clark Dorman, Gerardo I. Simari

Title: Lattice Annotated Temporal (LAT) Logic for Non-Markovian Reasoning

Abstract:
We introduce Lattice Annotated Temporal (LAT) Logic, an extension of Generalized Annotated Logic Programs (GAPs) that incorporates temporal reasoning and supports open‑world semantics through the use of a lower lattice structure. This logic combines an efficient deduction process with temporal logic programming to support non‑Markovian relationships and open‑world reasoning capabilities. The open‑world aspect, a by‑product of the use of the lower‑lattice annotation structure, allows for efficient grounding through a Skolemization process, even in domains with infinite or highly diverse constants. We provide a suite of theoretical results that bound the computational complexity of the grounding process, in addition to showing that many of the results on GAPs (using an upper lattice) still hold with the lower lattice and temporal extensions (though different proof techniques are required). Our open‑source implementation, PyReason, features modular design, machine‑level optimizations, and direct integration with reinforcement learning environments. Empirical evaluations across multi‑agent simulations and knowledge graph tasks demonstrate up to three orders of magnitude speedup and up to five orders of magnitude memory reduction while maintaining or improving task performance. Additionally, we evaluate LAT Logic's value in reinforcement learning environments as a non‑Markovian simulator, achieving up to three orders of magnitude faster simulation with improved agent performance, including a 26% increase in win rate due to capturing richer temporal dependencies. These results highlight LAT Logic's potential as a unified, extensible framework for open‑world temporal reasoning in dynamic and uncertain environments. Our implementation is available at: pyreason.syracuse.edu.

PaperID: 8849, https://arxiv.org/pdf/2509.02751.pdf

Authors: Matthew Russo, Tim Kraska

Title: Deep Research is the New Analytics System: Towards Building the Runtime for AI-Driven Analytics

Abstract:
With advances in large language models (LLMs), researchers are creating new systems that can perform AI‑driven analytics over large unstructured datasets. Recent work has explored executing such analytics queries using semantic operators ‑‑ a declarative set of AI‑powered data transformations with natural language specifications. However, even when optimized, these operators can be expensive to execute on millions of records and their iterator execution semantics make them ill‑suited for interactive data analytics tasks. In another line of work, Deep Research systems have demonstrated an ability to answer natural language question(s) over large datasets. These systems use one or more LLM agent(s) to plan their execution, process the dataset(s), and iteratively refine their answer. However, these systems do not explicitly optimize their query plans which can lead to poor plan execution. In order for AI‑driven analytics to excel, we need a runtime which combines the optimized execution of semantic operators with the flexibility and more dynamic execution of Deep Research systems. As a first step towards this vision, we build a prototype which enables Deep Research agents to write and execute optimized semantic operator programs. We evaluate our prototype and demonstrate that it can outperform a handcrafted semantic operator program and open Deep Research systems on two basic queries. Compared to a standard open Deep Research agent, our prototype achieves up to 1.95x better F1‑score. Furthermore, even if we give the agent access to semantic operators as tools, our prototype still achieves cost and runtime savings of up to 76.8% and 72.7% thanks to its optimized execution.

PaperID: 8850, https://arxiv.org/pdf/2509.02579.pdf

Authors: Mazyar Taghavi, Rahman Farnoosh

Title: Latent Variable Modeling in Multi-Agent Reinforcement Learning via Expectation-Maximization for UAV-Based Wildlife Protection

Abstract:
Protecting endangered wildlife from illegal poaching presents a critical challenge, particularly in vast and partially observable environments where real‑time response is essential. This paper introduces a novel Expectation‑Maximization (EM) based latent variable modeling approach in the context of Multi‑Agent Reinforcement Learning (MARL) for Unmanned Aerial Vehicle (UAV) coordination in wildlife protection. By modeling hidden environmental factors and inter‑agent dynamics through latent variables, our method enhances exploration and coordination under uncertainty.We implement and evaluate our EM‑MARL framework using a custom simulation involving 10 UAVs tasked with patrolling protected habitats of the endangered Iranian leopard. Extensive experimental results demonstrate superior performance in detection accuracy, adaptability, and policy convergence when compared to standard algorithms such as Proximal Policy Optimization (PPO) and Deep Deterministic Policy Gradient (DDPG). Our findings underscore the potential of combining EM inference with MARL to improve decentralized decisionmaking in complex, high‑stakes conservation scenarios. The full implementation, simulation environment, and training scripts are publicly available on GitHub.

PaperID: 8851, https://arxiv.org/pdf/2509.02515.pdf

Authors: Costin Bădică, Amelia Bădică, Maria Ganzha, Mirjana Ivanović, Marcin Paprzycki, Dan Selişteanu, Zofia Wrona

Title: Contemporary Agent Technology: LLM-Driven Advancements vs Classic Multi-Agent Systems

Abstract:
This contribution provides our comprehensive reflection on the contemporary agent technology, with a particular focus on the advancements driven by Large Language Models (LLM) vs classic Multi‑Agent Systems (MAS). It delves into the models, approaches, and characteristics that define these new systems. The paper emphasizes the critical analysis of how the recent developments relate to the foundational MAS, as articulated in the core academic literature. Finally, it identifies key challenges and promising future directions in this rapidly evolving domain.

PaperID: 8852, https://arxiv.org/pdf/2509.02494.pdf

Authors: Hongwei Jin, Kibaek Kim, Jonghwan Kwon

Title: GridMind: LLMs-Powered Agents for Power System Analysis and Operations

Abstract:
The complexity of traditional power system analysis workflows presents significant barriers to efficient decision‑making in modern electric grids. This paper presents GridMind, a multi‑agent AI system that integrates Large Language Models (LLMs) with deterministic engineering solvers to enable conversational scientific computing for power system analysis. The system employs specialized agents coordinating AC Optimal Power Flow and N‑1 contingency analysis through natural language interfaces while maintaining numerical precision via function calls. GridMind addresses workflow integration, knowledge accessibility, context preservation, and expert decision‑support augmentation. Experimental evaluation on IEEE test cases demonstrates that the proposed agentic framework consistently delivers correct solutions across all tested language models, with smaller LLMs achieving comparable analytical accuracy with reduced computational latency. This work establishes agentic AI as a viable paradigm for scientific computing, demonstrating how conversational interfaces can enhance accessibility while preserving numerical rigor essential for critical engineering applications.

PaperID: 8853, https://arxiv.org/pdf/2509.02401.pdf

Authors: Josefa Lia Stoisser, Marc Boubnovski Martell, Lawrence Phillips, Gianluca Mazzoni, Lea Mørch Harder, Philip Torr, Jesper Ferkinghoff-Borg, Kaspar Martens, Julien Fauqueur

Title: Towards Agents That Know When They Don't Know: Uncertainty as a Control Signal for Structured Reasoning

Abstract:
Large language model (LLM) agents are increasingly deployed in structured biomedical data environments, yet they often produce fluent but overconfident outputs when reasoning over complex multi‑table data. We introduce an uncertainty‑aware agent for query‑conditioned multi‑table summarization that leverages two complementary signals: (i) retrieval uncertainty‑‑entropy over multiple table‑selection rollouts‑‑and (ii) summary uncertainty‑‑combining self‑consistency and perplexity. Summary uncertainty is incorporated into reinforcement learning (RL) with Group Relative Policy Optimization (GRPO), while both retrieval and summary uncertainty guide inference‑time filtering and support the construction of higher‑quality synthetic datasets. On multi‑omics benchmarks, our approach improves factuality and calibration, nearly tripling correct and useful claims per summary (3.0$\rightarrow$8.4 internal; 3.6$\rightarrow$9.9 cancer multi‑omics) and substantially improving downstream survival prediction (C‑index 0.32$\rightarrow$0.63). These results demonstrate that uncertainty can serve as a control signal‑‑enabling agents to abstain, communicate confidence, and become more reliable tools for complex structured‑data environments.

PaperID: 8854, https://arxiv.org/pdf/2509.02360.pdf

Authors: Shubham Gandhi, Jason Tsay, Jatin Ganhotra, Kiran Kate, Yara Rizk

Title: When Agents go Astray: Course-Correcting SWE Agents with PRMs

Abstract:
Large Language Model (LLM) agents are increasingly deployed for complex, multi‑step software engineering (SWE) tasks. However, their trajectories often contain costly inefficiencies, such as redundant exploration, looping, and failure to terminate once a solution is reached. Prior work has largely treated these errors in a post‑hoc manner, diagnosing failures only after execution. In this paper, we introduce SWE‑PRM, an inference‑time Process Reward Model (PRM) that intervenes during execution to detect and course‑correct trajectory‑level errors. Our PRM design leverages a taxonomy of common inefficiencies and delivers lightweight, interpretable feedback without modifying the underlying policy. On SWE‑bench Verified, closed‑source PRMs improve resolution from 40.0% to 50.6% (+10.6 p.p.), with the largest gains on medium and hard tasks. Among feedback strategies, taxonomy‑guided PRMs outperform unguided or explicit action‑prescriptive variants, increasing success rate while reducing trajectory length. These benefits come at an acceptable added inference cost of as low as 0.2, making PRMs a practical and scalable mechanism for improving SWE agents' reliability and efficiency.

PaperID: 8855, https://arxiv.org/pdf/2509.02324.pdf

Authors: Changshi Zhou, Haichuan Xu, Ningquan Gu, Zhipeng Wang, Bin Cheng, Pengpeng Zhang, Yanchao Dong, Mitsuhiro Hayashibe, Yanmin Zhou, Bin He

Title: Language-Guided Long Horizon Manipulation with LLM-based Planning and Visual Perception

Abstract:
Language‑guided long‑horizon manipulation of deformable objects presents significant challenges due to high degrees of freedom, complex dynamics, and the need for accurate vision‑language grounding. In this work, we focus on multi‑step cloth folding, a representative deformable‑object manipulation task that requires both structured long‑horizon planning and fine‑grained visual perception. To this end, we propose a unified framework that integrates a Large Language Model (LLM)‑based planner, a Vision‑Language Model (VLM)‑based perception system, and a task execution module. Specifically, the LLM‑based planner decomposes high‑level language instructions into low‑level action primitives, bridging the semantic‑execution gap, aligning perception with action, and enhancing generalization. The VLM‑based perception module employs a SigLIP2‑driven architecture with a bidirectional cross‑attention fusion mechanism and weight‑decomposed low‑rank adaptation (DoRA) fine‑tuning to achieve language‑conditioned fine‑grained visual grounding. Experiments in both simulation and real‑world settings demonstrate the method's effectiveness. In simulation, it outperforms state‑of‑the‑art baselines by 2.23, 1.87, and 33.3 on seen instructions, unseen instructions, and unseen tasks, respectively. On a real robot, it robustly executes multi‑step folding sequences from language instructions across diverse cloth materials and configurations, demonstrating strong generalization in practical scenarios. Project page: https://language‑guided.netlify.app/

PaperID: 8856, https://arxiv.org/pdf/2509.02322.pdf

Authors: Longrong Yang, Zhixiong Zeng, Yufeng Zhong, Jing Huang, Liming Zheng, Lei Chen, Haibo Qiu, Zequn Qin, Lin Ma, Xi Li

Title: OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds

Abstract:
Multimodal large language models are evolving toward multimodal agents capable of proactively executing tasks. Most agent research focuses on GUI or embodied scenarios, which correspond to agents interacting with 2D virtual worlds or 3D real worlds, respectively. However, many complex tasks typically require agents to interleavely interact with these two types of environment. We initially mix GUI and embodied data to train, but find the performance degeneration brought by the data conflict. Further analysis reveals that GUI and embodied data exhibit synergy and conflict at the shallow and deep layers, respectively, which resembles the cerebrum‑cerebellum mechanism in the human brain. To this end, we propose a high‑performance generalist agent OmniActor, designed from both structural and data perspectives. First, we propose Layer‑heterogeneity MoE to eliminate the conflict between GUI and embodied data by separating deep‑layer parameters, while leverage their synergy by sharing shallow‑layer parameters. By successfully leveraging the synergy and eliminating the conflict, OmniActor outperforms agents only trained by GUI or embodied data in GUI or embodied tasks. Furthermore, we unify the action spaces of GUI and embodied tasks, and collect large‑scale GUI and embodied data from various sources for training. This significantly improves OmniActor under different scenarios, especially in GUI tasks. The code will be publicly available.

PaperID: 8857, https://arxiv.org/pdf/2509.02317.pdf

Authors: Chenguang Du, Chuyi Wang, Yihan Chao, Xiaohui Xie, Yong Cui

Title: AI Agent Communication from Internet Architecture Perspective: Challenges and Opportunities

Abstract:
The rapid development of AI agents leads to a surge in communication demands. Alongside this rise, a variety of frameworks and protocols emerge. While these efforts demonstrate the vitality of the field, they also highlight increasing fragmentation, with redundant innovation and siloed designs hindering cross‑domain interoperability. These challenges underscore the need for a systematic perspective to guide the development of scalable, secure, and sustainable AI agent ecosystems. To address this need, this paper provides the first systematic analysis of AI agent communication from the standpoint of Internet architecture‑the most successful global‑scale distributed system in history. Specifically, we distill decades of Internet evolution into five key elements that are directly relevant to agent communication: scalability, security, real‑time performance, high performance, and manageability. We then use these elements to examine both the opportunities and the bottlenecks in developing robust multi‑agent ecosystems. Overall, this paper bridges Internet architecture and AI agent communication for the first time, providing a new lens for guiding the sustainable growth of AI agent communication ecosystems.

PaperID: 8858, https://arxiv.org/pdf/2509.02271.pdf

Authors: Yigal Koifman, Erez Koifman, Eran Iceland, Ariel Barel, Alfred M. Bruckstein

Title: VariAntNet: Learning Decentralized Control of Multi-Agent Systems

Abstract:
A simple multi‑agent system can be effectively utilized in disaster response applications, such as firefighting. Such a swarm is required to operate in complex environments with limited local sensing and no reliable inter‑agent communication or centralized control. These simple robotic agents, also known as Ant Robots, are defined as anonymous agents that possess limited sensing capabilities, lack a shared coordinate system, and do not communicate explicitly with one another. A key challenge for simple swarms lies in maintaining cohesion and avoiding fragmentation despite limited‑range sensing. Recent advances in machine learning offer effective solutions to some of the classical decentralized control challenges. We propose VariAntNet, a deep learning‑based decentralized control model designed to facilitate agent swarming and collaborative task execution. VariAntNet includes geometric features extraction from unordered, variable‑sized local observations. It incorporates a neural network architecture trained with a novel, differentiable, multi‑objective, mathematically justified loss function that promotes swarm cohesiveness by utilizing the properties of the visibility graph Laplacian matrix. VariAntNet is demonstrated on the fundamental multi‑agent gathering task, where agents with bearing‑only and limited‑range sensing must gather at some location. VariAntNet significantly outperforms an existing analytical solution, achieving more than double the convergence rate while maintaining high swarm connectivity across varying swarm sizes. While the analytical solution guarantees cohesion, it is often too slow in practice. In time‑critical scenarios, such as emergency response operations where lives are at risk, slower analytical methods are impractical and justify the loss of some agents within the swarm. This paper presents and analyzes this trade‑off in detail.

PaperID: 8859, https://arxiv.org/pdf/2509.02172.pdf

Authors: Yijun Liu, Wu Liu, Xiaoyan Gu, Hantao Yao, Weiping Wang, Jiebo Luo, Yongdong Zhang

Title: RumorSphere: A Framework for Million-scale Agent-based Dynamic Simulation of Rumor Propagation

Abstract:
Rumor propagation modeling is critical for understanding and mitigating misinformation. Existing approaches combining rule‑based regular agents with LLM‑driven core agents provide a promising paradigm for large‑scale rumor simulation. However, overlooking the dynamic nature of core agents and the importance of network topology on rumor spread significantly undermines the simulation performance. To address these issues, we present RumorSphere, a dynamic and hierarchical resonance framework for effective rumor simulation at the million‑agent scale. Considering the dynamic role of core agents in rumor evolution, we propose a multi‑agent dynamic interaction strategy based on the information cocoon theory, which adaptively identifies and activates critical core agents at conflict boundaries using LLMs, effectively supporting simulations with millions of agents. In addition, we design a hierarchical resonance network that integrates opinion leaders and localized community structures, enabling more realistic modeling of explosive rumor spread in real‑world scenarios. Experiments on real‑world datasets show that RumorSphere outperforms state‑of‑the‑art methods, reducing simulation bias by an average of 26.5%.

PaperID: 8860, https://arxiv.org/pdf/2509.02121.pdf

Authors: Junyi Shen, Noppanat Wadlom, Yao Lu

Title: Batch Query Processing and Optimization for Agentic Workflows

Abstract:
Large Language Models (LLMs) in agentic workflows combine multi‑step reasoning, heterogeneous tool use, and collaboration across multiple specialized agents. Existing LLM serving engines optimize individual calls in isolation, while multi‑agent frameworks focus on orchestration without system‑level performance planning. As a result, repeated prompts, overlapping contexts, and fragmented CPU‑GPU execution create substantial redundancy and poor hardware utilization, especially in batch analytics scenarios. We introduce Halo, a system that brings batch query processing and optimization into agentic LLM workflows. Halo represents each workflow as a structured query plan DAG and constructs a consolidated graph for batched queries that exposes shared computation. Guided by a cost model that jointly considers heterogeneous resource constraints, prefill and decode costs, cache reuse, and GPU placement, Halo performs plan‑level optimization to minimize redundant execution. The Processor integrates adaptive batching, KV‑cache sharing and migration, along with fine‑grained CPU‑GPU pipelining to maximize holistic hardware efficiency. Evaluation across six benchmarks shows that Halo achieves up to 3.6x speedup for batch inference and 2.6x throughput improvement under online serving, scaling to workloads of thousands of queries and complex graphs. These gains are achieved without compromising output quality. By unifying query optimization with heterogeneous LLM serving, Halo enables efficient agentic workflows in data analytics and decision‑making applications.

PaperID: 8861, https://arxiv.org/pdf/2509.01856.pdf

Authors: Zhihao Lin, Jianglin Lan, Shuo Liu, Zhen Tian, Dezong Zhao, Chongfeng Wei

Title: Hierarchical Multi-Agent MCTS for Safety-Critical Coordination in Mixed-Autonomy Roundabouts

Abstract:
Navigating unsignalized roundabouts in mixed‑autonomy traffic presents significant challenges due to dense vehicle interactions, lane‑changing complexities, and behavioral uncertainties of human‑driven vehicles (HDVs). This paper proposes a safety‑critical decision‑making framework for connected and automated vehicles (CAVs) navigating dual‑lane roundabouts alongside HDVs. We formulate the problem as a multi‑agent Markov Decision Process and develop a hierarchical safety assessment mechanism that evaluates three critical interaction types: CAV‑to‑CAV (C2C), CAV‑to‑HDV (C2H), and CAV‑to‑Boundary (C2B). A key contribution is our lane‑specific uncertainty model for HDVs, which captures distinct behavioral patterns between inner and outer lanes, with outer‑lane vehicles exhibiting 2.3× higher uncertainty due to less constrained movements. We integrate this safety framework with a multi‑agent Monte Carlo Tree Search (MCTS) algorithm that employs safety‑aware pruning to eliminate high‑risk trajectories while maintaining computational efficiency. The reward function incorporates Shapley value‑based credit assignment to balance individual performance with group coordination. Extensive simulation results validate the effectiveness of the proposed approach under both fully autonomous (100% AVs) and mixed traffic (50% AVs + 50% HDVs) conditions. Compared to benchmark methods, our framework consistently reduces trajectory deviations across all AVs and significantly lowers the rate of Post‑Encroachment Time (PET) violations, achieving only 1.0% in the fully autonomous scenario and 3.2% in the mixed traffic setting.

PaperID: 8862, https://arxiv.org/pdf/2509.01835.pdf

Authors: Saad Ullah, Praneeth Balasubramanian, Wenbo Guo, Amanda Burnett, Hammond Pearce, Christopher Kruegel, Giovanni Vigna, Gianluca Stringhini

Title: From CVE Entries to Verifiable Exploits: An Automated Multi-Agent Framework for Reproducing CVEs

Abstract:
High‑quality datasets of real‑world vulnerabilities and their corresponding verifiable exploits are crucial resources in software security research. Yet such resources remain scarce, as their creation demands intensive manual effort and deep security expertise. In this paper, we present CVE‑GENIE, an automated, large language model (LLM)‑based multi‑agent framework designed to reproduce real‑world vulnerabilities, provided in Common Vulnerabilities and Exposures (CVE) format, to enable creation of high‑quality vulnerability datasets. Given a CVE entry as input, CVE‑GENIE gathers the relevant resources of the CVE, automatically reconstructs the vulnerable environment, and (re)produces a verifiable exploit. Our systematic evaluation highlights the efficiency and robustness of CVE‑GENIE's design and successfully reproduces approximately 51% (428 of 841) CVEs published in 2024‑2025, complete with their verifiable exploits, at an average cost of 2.77 per CVE. Our pipeline offers a robust method to generate reproducible CVE benchmarks, valuable for diverse applications such as fuzzer evaluation, vulnerability patching, and assessing AI's security capabilities.

PaperID: 8863, https://arxiv.org/pdf/2509.01630.pdf

Authors: Bingheng Wang, Yichao Gao, Tianchen Sun, Lin Zhao

Title: Learning to Coordinate: Distributed Meta-Trajectory Optimization Via Differentiable ADMM-DDP

Abstract:
Distributed trajectory optimization via ADMM‑DDP is a powerful approach for coordinating multi‑agent systems, but it requires extensive tuning of tightly coupled hyperparameters that jointly govern local task performance and global coordination. In this paper, we propose Learning to Coordinate (L2C), a general framework that meta‑learns these hyperparameters, modeled by lightweight agent‑wise neural networks, to adapt across diverse tasks and agent configurations. L2C differentiates end‑to‑end through the ADMM‑DDP pipeline in a distributed manner. It also enables efficient meta‑gradient computation by reusing DDP components such as Riccati recursions and feedback gains. These gradients correspond to the optimal solutions of distributed matrix‑valued LQR problems, coordinated across agents via an auxiliary ADMM framework that becomes convex under mild assumptions. Training is further accelerated by truncating iterations and meta‑learning ADMM penalty parameters optimized for rapid residual reduction, with provable Lipschitz‑bounded gradient errors. On a challenging cooperative aerial transport task, L2C generates dynamically feasible trajectories in high‑fidelity simulation using IsaacSIM, reconfigures quadrotor formations for safe 6‑DoF load manipulation in tight spaces, and adapts robustly to varying team sizes and task conditions, while achieving up to 88% faster gradient computation than state‑of‑the‑art methods.

PaperID: 8864, https://arxiv.org/pdf/2509.01517.pdf

Authors: Yuan-Hao Jiang, Yijie Lu, Ling Dai, Jiatong Wang, Ruijia Li, Bo Jiang

Title: Agentic Workflow for Education: Concepts and Applications

Abstract:
With the rapid advancement of Large Language Models (LLMs) and Artificial Intelligence (AI) agents, agentic workflows are showing transformative potential in education. This study introduces the Agentic Workflow for Education (AWE), a four‑component model comprising self‑reflection, tool invocation, task planning, and multi‑agent collaboration. We distinguish AWE from traditional LLM‑based linear interactions and propose a theoretical framework grounded in the von Neumann Multi‑Agent System (MAS) architecture. Through a paradigm shift from static prompt‑response systems to dynamic, nonlinear workflows, AWE enables scalable, personalized, and collaborative task execution. We further identify four core application domains: integrated learning environments, personalized AI‑assisted learning, simulation‑based experimentation, and data‑driven decision‑making. A case study on automated math test generation shows that AWE‑generated items are statistically comparable to real exam questions, validating the model's effectiveness. AWE offers a promising path toward reducing teacher workload, enhancing instructional quality, and enabling broader educational innovation.

PaperID: 8865, https://arxiv.org/pdf/2509.01479.pdf

Authors: Bernd Finkbeiner, Hadar Frenkel, Julian Siber

Title: An Information-Flow Perspective on Explainability Requirements: Specification and Verification

Abstract:
Explainable systems expose information about why certain observed effects are happening to the agents interacting with them. We argue that this constitutes a positive flow of information that needs to be specified, verified, and balanced against negative information flow that may, e.g., violate privacy guarantees. Since both explainability and privacy require reasoning about knowledge, we tackle these tasks with epistemic temporal logic extended with quantification over counterfactual causes. This allows us to specify that a multi‑agent system exposes enough information such that agents acquire knowledge on why some effect occurred. We show how this principle can be used to specify explainability as a system‑level requirement and provide an algorithm for checking finite‑state models against such specifications. We present a prototype implementation of the algorithm and evaluate it on several benchmarks, illustrating how our approach distinguishes between explainable and unexplainable systems, and how it allows to pose additional privacy requirements.

PaperID: 8866, https://arxiv.org/pdf/2509.01398.pdf

Authors: Cristina Cornelio, Takuya Ito, Ryan Cory-Wright, Sanjeeb Dash, Lior Horesh

Title: The Need for Verification in AI-Driven Scientific Discovery

Abstract:
Artificial intelligence (AI) is transforming the practice of science. Machine learning and large language models (LLMs) can generate hypotheses at a scale and speed far exceeding traditional methods, offering the potential to accelerate discovery across diverse fields. However, the abundance of hypotheses introduces a critical challenge: without scalable and reliable mechanisms for verification, scientific progress risks being hindered rather than being advanced. In this article, we trace the historical development of scientific discovery, examine how AI is reshaping established practices for scientific discovery, and review the principal approaches, ranging from data‑driven methods and knowledge‑aware neural architectures to symbolic reasoning frameworks and LLM agents. While these systems can uncover patterns and propose candidate laws, their scientific value ultimately depends on rigorous and transparent verification, which we argue must be the cornerstone of AI‑assisted discovery.

PaperID: 8867, https://arxiv.org/pdf/2509.01396.pdf

Authors: Haiyuan Wan, Chen Yang, Junchi Yu, Meiqi Tu, Jiaxuan Lu, Di Yu, Jianbao Cao, Ben Gao, Jiaqing Xie, Aoran Wang, Wenlong Zhang, Philip Torr, Dongzhan Zhou

Title: DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks

Abstract:
Deep research agents have attracted growing attention for their potential to orchestrate multi‑stage research workflows, spanning literature synthesis, methodological design, and empirical verification. Despite these strides, evaluating their research capability faithfully is rather challenging due to the difficulty of collecting frontier research questions that genuinely capture researchers' attention and intellectual curiosity. To address this gap, we introduce DeepResearch Arena, a benchmark grounded in academic seminars that capture rich expert discourse and interaction, better reflecting real‑world research environments and reducing the risk of data leakage. To automatically construct DeepResearch Arena, we propose a Multi‑Agent Hierarchical Task Generation (MAHTG) system that extracts research‑worthy inspirations from seminar transcripts. The MAHTG system further translates research‑worthy inspirations into high‑quality research tasks, ensuring the traceability of research task formulation while filtering noise. With the MAHTG system, we curate DeepResearch Arena with over 10,000 high‑quality research tasks from over 200 academic seminars, spanning 12 disciplines, such as literature, history, and science. Our extensive evaluation shows that DeepResearch Arena presents substantial challenges for current state‑of‑the‑art agents, with clear performance gaps observed across different models.

PaperID: 8868, https://arxiv.org/pdf/2509.01379.pdf

Authors: Paloma Piot, Diego Sánchez, Javier Parapar

Title: WATCHED: A Web AI Agent Tool for Combating Hate Speech by Expanding Data

Abstract:
Online harms are a growing problem in digital spaces, putting user safety at risk and reducing trust in social media platforms. One of the most persistent forms of harm is hate speech. To address this, we need tools that combine the speed and scale of automated systems with the judgment and insight of human moderators. These tools should not only find harmful content but also explain their decisions clearly, helping to build trust and understanding. In this paper, we present WATCHED, a chatbot designed to support content moderators in tackling hate speech. The chatbot is built as an Artificial Intelligence Agent system that uses Large Language Models along with several specialised tools. It compares new posts with real examples of hate speech and neutral content, uses a BERT‑based classifier to help flag harmful messages, looks up slang and informal language using sources like Urban Dictionary, generates chain‑of‑thought reasoning, and checks platform guidelines to explain and support its decisions. This combination allows the chatbot not only to detect hate speech but to explain why content is considered harmful, grounded in both precedent and policy. Experimental results show that our proposed method surpasses existing state‑of‑the‑art methods, reaching a macro F1 score of 0.91. Designed for moderators, safety teams, and researchers, the tool helps reduce online harms by supporting collaboration between AI and human oversight.

PaperID: 8869, https://arxiv.org/pdf/2509.01257.pdf

Authors: Andrea Fox, Francesco De Pellegrini, Eitan Altman

Title: Multi-Agent Reinforcement Learning for Task Offloading in Wireless Edge Networks

Abstract:
In edge computing systems, autonomous agents must make fast local decisions while competing for shared resources. Existing MARL methods often resume to centralized critics or frequent communication, which fail under limited observability and communication constraints. We propose a decentralized framework in which each agent solves a constrained Markov decision process (CMDP), coordinating implicitly through a shared constraint vector. For the specific case of offloading, e.g., constraints prevent overloading shared server resources. Coordination constraints are updated infrequently and act as a lightweight coordination mechanism. They enable agents to align with global resource usage objectives but require little direct communication. Using safe reinforcement learning, agents learn policies that meet both local and global goals. We establish theoretical guarantees under mild assumptions and validate our approach experimentally, showing improved performance over centralized and independent baselines, especially in large‑scale settings.

PaperID: 8870, https://arxiv.org/pdf/2509.01238.pdf

Authors: Jiasheng Xu, Mingda Li, Yongqiang Tang, Peijie Wang, Wensheng Zhang

Title: Towards Open-World Retrieval-Augmented Generation on Knowledge Graph: A Multi-Agent Collaboration Framework

Abstract:
Large Language Models (LLMs) have demonstrated strong capabilities in web search and reasoning. However, their dependence on static training corpora makes them prone to factual errors and knowledge gaps. Retrieval‑Augmented Generation (RAG) addresses this limitation by incorporating external knowledge sources, especially structured Knowledge Graphs (KGs), which provide explicit semantics and efficient retrieval. Existing KG‑based RAG approaches, however, generally assume that anchor entities are accessible to initiate graph traversal, which limits their robustness in open‑world settings where accurate linking between the user query and the KG entity is unreliable. To overcome this limitation, we propose AnchorRAG, a novel multi‑agent collaboration framework for open‑world RAG without the predefined anchor entities. Specifically, a predictor agent dynamically identifies candidate anchor entities by aligning user query terms with KG nodes and initializes independent retriever agents to conduct parallel multi‑hop explorations from each candidate. Then a supervisor agent formulates the iterative retrieval strategy for these retriever agents and synthesizes the resulting knowledge paths to generate the final answer. This multi‑agent collaboration framework improves retrieval robustness and mitigates the impact of ambiguous or erroneous anchors. Extensive experiments on four public benchmarks demonstrate that AnchorRAG significantly outperforms existing baselines and establishes new state‑of‑the‑art results on the real‑world reasoning tasks.

PaperID: 8871, https://arxiv.org/pdf/2509.01182.pdf

Authors: Wonduk Seo, Taesub Shin, Hyunjin An, Dokyun Kim, Seunghyun Lee

Title: Question-to-Knowledge (Q2K): Multi-Agent Generation of Inspectable Facts for Product Mapping

Abstract:
Identifying whether two product listings refer to the same Stock Keeping Unit (SKU) is a persistent challenge in ecommerce, especially when explicit identifiers are missing and product names vary widely across platforms. Rule based heuristics and keyword similarity often misclassify products by overlooking subtle distinctions in brand, specification, or bundle configuration. To overcome these limitations, we propose Question to Knowledge (Q2K), a multi agent framework that leverages Large Language Models (LLMs) for reliable SKU mapping. Q2K integrates: (1) a Reasoning Agent that generates targeted disambiguation questions, (2) a Knowledge Agent that resolves them via focused web searches, and (3) a Deduplication Agent that reuses validated reasoning traces to reduce redundancy and ensure consistency. A human in the loop mechanism further refines uncertain cases. Experiments on real world consumer goods datasets show that Q2K surpasses strong baselines, achieving higher accuracy and robustness in difficult scenarios such as bundle identification and brand origin disambiguation. By reusing retrieved reasoning instead of issuing repeated searches, Q2K balances accuracy with efficiency, offering a scalable and interpretable solution for product integration.

PaperID: 8872, https://arxiv.org/pdf/2509.01119.pdf

Authors: Senura Hansaja Wanasekara, Van-Dinh Nguyen, Kok-Seng, M. -Duong Nguyen, Symeon Chatzinotas, Octavia A. Dobre

Title: SC-GIR: Goal-oriented Semantic Communication via Invariant Representation Learning

Abstract:
Goal‑oriented semantic communication (SC) aims to revolutionize communication systems by transmitting only task‑essential information. However, current approaches face challenges such as joint training at transceivers, leading to redundant data exchange and reliance on labeled datasets, which limits their task‑agnostic utility. To address these challenges, we propose a novel framework called Goal‑oriented Invariant Representation‑based SC (SC‑GIR) for image transmission. Our framework leverages self‑supervised learning to extract an invariant representation that encapsulates crucial information from the source data, independent of the specific downstream task. This compressed representation facilitates efficient communication while retaining key features for successful downstream task execution. Focusing on machine‑to‑machine tasks, we utilize covariance‑based contrastive learning techniques to obtain a latent representation that is both meaningful and semantically dense. To evaluate the effectiveness of the proposed scheme on downstream tasks, we apply it to various image datasets for lossy compression. The compressed representations are then used in a goal‑oriented AI task. Extensive experiments on several datasets demonstrate that SC‑GIR outperforms baseline schemes by nearly 10%,, and achieves over 85% classification accuracy for compressed data under different SNR conditions. These results underscore the effectiveness of the proposed framework in learning compact and informative latent representations.

PaperID: 8873, https://arxiv.org/pdf/2509.01057.pdf

Authors: Yi-Ning Weng, Hsuan-Wei Lee

Title: Q-Learning-Driven Adaptive Rewiring for Cooperative Control in Heterogeneous Networks

Abstract:
Cooperation emergence in multi‑agent systems represents a fundamental statistical physics problem where microscopic learning rules drive macroscopic collective behavior transitions. We propose a Q‑learning‑based variant of adaptive rewiring that builds on mechanisms studied in the literature. This method combines temporal difference learning with network restructuring so that agents can optimize strategies and social connections based on interaction histories. Through neighbor‑specific Q‑learning, agents develop sophisticated partnership management strategies that enable cooperator cluster formation, creating spatial separation between cooperative and defective regions. Using power‑law networks that reflect real‑world heterogeneous connectivity patterns, we evaluate emergent behaviors under varying rewiring constraint levels, revealing distinct cooperation patterns across parameter space rather than sharp thermodynamic transitions. Our systematic analysis identifies three behavioral regimes: a permissive regime (low constraints) enabling rapid cooperative cluster formation, an intermediate regime with sensitive dependence on dilemma strength, and a patient regime (high constraints) where strategic accumulation gradually optimizes network structure. Simulation results show that while moderate constraints create transition‑like zones that suppress cooperation, fully adaptive rewiring enhances cooperation levels through systematic exploration of favorable network configurations. Quantitative analysis reveals that increased rewiring frequency drives large‑scale cluster formation with power‑law size distributions. Our results establish a new paradigm for understanding intelligence‑driven cooperation pattern formation in complex adaptive systems, revealing how machine learning serves as an alternative driving force for spontaneous organization in multi‑agent networks.

PaperID: 8874, https://arxiv.org/pdf/2509.01022.pdf

Authors: Bo Fu, Zhe Chen, Rahul Chandan, Alex Barbosa, Michael Caldara, Joey Durham, Federico Pecora

Title: Symbolic Planning and Multi-Agent Path Finding in Extremely Dense Environments with Unassigned Agents

Abstract:
We introduce the Block Rearrangement Problem (BRaP), a challenging component of large warehouse management which involves rearranging storage blocks within dense grids to achieve a goal state. We formally define the BRaP as a graph search problem. Building on intuitions from sliding puzzle problems, we propose five search‑based solution algorithms, leveraging joint configuration space search, classical planning, multi‑agent pathfinding, and expert heuristics. We evaluate the five approaches empirically for plan quality and scalability. Despite the exponential relation between search space size and block number, our methods demonstrate efficiency in creating rearrangement plans for deeply buried blocks in up to 80x80 grids.

PaperID: 8875, https://arxiv.org/pdf/2509.00997.pdf

Authors: Shu Liu, Soujanya Ponnapalli, Shreya Shankar, Sepanta Zeighami, Alan Zhu, Shubham Agarwal, Ruiqi Chen, Samion Suwito, Shuo Yuan, Ion Stoica, Matei Zaharia, Alvin Cheung, Natacha Crooks, Joseph E. Gonzalez, Aditya G. Parameswaran

Title: Supporting Our AI Overlords: Redesigning Data Systems to be Agent-First

Abstract:
Large Language Model (LLM) agents, acting on their users' behalf to manipulate and analyze data, are likely to become the dominant workload for data systems in the future. When working with data, agents employ a high‑throughput process of exploration and solution formulation for the given task, one we call agentic speculation. The sheer volume and inefficiencies of agentic speculation can pose challenges for present‑day data systems. We argue that data systems need to adapt to more natively support agentic workloads. We take advantage of the characteristics of agentic speculation that we identify, i.e., scale, heterogeneity, redundancy, and steerability ‑ to outline a number of new research opportunities for a new agent‑first data systems architecture, ranging from new query interfaces, to new query processing techniques, to new agentic memory stores.

PaperID: 8876, https://arxiv.org/pdf/2509.00987.pdf

Authors: Adib Bazgir, Amir Habibdoust, Yuwen Zhang, Xing Song

Title: Causal MAS: A Survey of Large Language Model Architectures for Discovery and Effect Estimation

Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in various reasoning and generation tasks. However, their proficiency in complex causal reasoning, discovery, and estimation remains an area of active development, often hindered by issues like hallucination, reliance on spurious correlations, and difficulties in handling nuanced, domain‑specific, or personalized causal relationships. Multi‑agent systems, leveraging the collaborative or specialized abilities of multiple LLM‑based agents, are emerging as a powerful paradigm to address these limitations. This review paper explores the burgeoning field of causal multi‑agent LLMs. We examine how these systems are designed to tackle different facets of causality, including causal reasoning and counterfactual analysis, causal discovery from data, and the estimation of causal effects. We delve into the diverse architectural patterns and interaction protocols employed, from pipeline‑based processing and debate frameworks to simulation environments and iterative refinement loops. Furthermore, we discuss the evaluation methodologies, benchmarks, and diverse application domains where causal multi‑agent LLMs are making an impact, including scientific discovery, healthcare, fact‑checking, and personalized systems. Finally, we highlight the persistent challenges, open research questions, and promising future directions in this synergistic field, aiming to provide a comprehensive overview of its current state and potential trajectory.

PaperID: 8877, https://arxiv.org/pdf/2509.00883.pdf

Authors: Denis Los, Igor Petushkov

Title: Accelerating Latency-Critical Applications with AI-Powered Semi-Automatic Fine-Grained Parallelization on SMT Processors

Abstract:
Latency‑critical applications tend to show low utilization of functional units due to frequent cache misses and mispredictions during speculative execution in high‑performance superscalar processors. However, due to significant impact on single‑thread performance, Simultaneous Multithreading (SMT) technology is rarely used with heavy threads of latency‑critical applications. In this paper, we explore utilization of SMT technology to support fine‑grained parallelization of latency‑critical applications. Following the advancements in the development of Large Language Models (LLMs), we introduce Aira, an AI‑powered Parallelization Adviser. To implement Aira, we extend AI Coding Agent in Cursor IDE with additional tools connected through Model Context Protocol, enabling end‑to‑end AI Agent for parallelization. Additional connected tools enable LLM‑guided hotspot detection, collection of dynamic dependencies with Dynamic Binary Instrumentation, SMT‑aware performance simulation to estimate performance gains. We apply Aira with Relic parallel framework for fine‑grained task parallelism on SMT cores to parallelize latency‑critical benchmarks representing real‑world applications used in industry. We show 17% geomean performance gain from parallelization of latency‑critical benchmarks using Aira with Relic framework.

PaperID: 8878, https://arxiv.org/pdf/2509.00870.pdf

Authors: Ruohan Huang, Zining Cao

Title: Controller synthesis method for multi-agent system based on temporal logic specification

Abstract:
Controller synthesis is a theoretical approach to the systematic design of discrete event systems. It constructs a controller to provide feedback and control to the system, ensuring it meets specified control specifications. Traditional controller synthesis methods often use formal languages to describe control specifications and are mainly oriented towards single‑agent and non‑probabilistic systems. With the increasing complexity of systems, the control requirements that need to be satisfied also become more complex. Based on this, this paper proposes a controller synthesis method for semi‑cooperative semi‑competitive multi‑agent probabilistic discrete event systems to solve the controller synthesis problem based on temporal logic specifications. The controller can ensure the satisfaction of specifications to a certain extent. The specification is given in the form of a linear temporal logic formula. This paper designs a controller synthesis algorithm that combines probabilistic model checking. Finally, the effectiveness of this method is verified through a case study.

PaperID: 8879, https://arxiv.org/pdf/2509.00801.pdf

Authors: Hyungbo Shim, Jin Gyu Lee, B. D. O. Anderson

Title: Adaptation of Parameters in Heterogeneous Multi-agent Systems

Abstract:
This paper proposes an adaptation mechanism for heterogeneous multi‑agent systems to align the agents' internal parameters, based on enforced consensus through strong couplings. Unlike homogeneous systems, where exact consensus is attainable, the heterogeneity in node dynamics precludes perfect synchronization. Nonetheless, previous work has demonstrated that strong coupling can induce approximate consensus, whereby the agents exhibit emergent collective behavior governed by the so‑called blended dynamics. Building on this observation, we introduce an adaptation law that gradually aligns the internal parameters of agents without requiring direct parameter communication. The proposed method reuses the same coupling signal employed for state synchronization, which may result in a biologically or sociologically plausible adaptation process. Under a persistent excitation condition, we prove that the linearly parametrized vector fields of the agents converge to each other, thereby making the dynamics asymptotically homogeneous, and leading to exact consensus of the state variables.

PaperID: 8880, https://arxiv.org/pdf/2509.00710.pdf

Authors: Albert Sadowski, Jarosław A. Chudziak

Title: On Verifiable Legal Reasoning: A Multi-Agent Framework with Formalized Knowledge Representations

Abstract:
Legal reasoning requires both precise interpretation of statutory language and consistent application of complex rules, presenting significant challenges for AI systems. This paper introduces a modular multi‑agent framework that decomposes legal reasoning into distinct knowledge acquisition and application stages. In the first stage, specialized agents extract legal concepts and formalize rules to create verifiable intermediate representations of statutes. The second stage applies this knowledge to specific cases through three steps: analyzing queries to map case facts onto the ontology schema, performing symbolic inference to derive logically entailed conclusions, and generating final answers using a programmatic implementation that operationalizes the ontological knowledge. This bridging of natural language understanding with symbolic reasoning provides explicit and verifiable inspection points, significantly enhancing transparency compared to end‑to‑end approaches. Evaluation on statutory tax calculation tasks demonstrates substantial improvements, with foundational models achieving 76.4% accuracy compared to 18.8% baseline performance, effectively narrowing the performance gap between reasoning and foundational models. These findings suggest that modular architectures with formalized knowledge representations can make sophisticated legal reasoning more accessible through computationally efficient models while enhancing consistency and explainability in AI legal reasoning, establishing a foundation for future research into more transparent, trustworthy, and effective AI systems for legal domain.

PaperID: 8881, https://arxiv.org/pdf/2509.00709.pdf

Authors: Elias Ra, Seung Je Kim, Eui-Yeong Seo, Geunju So

Title: Designing LMS and Instructional Strategies for Integrating Generative-Conversational AI

Abstract:
Higher education faces growing challenges in delivering personalized, scalable, and pedagogically coherent learning experiences. This study introduces a structured framework for designing an AI‑powered Learning Management System (AI‑LMS) that integrates generative and conversational AI to support adaptive, interactive, and learner‑centered instruction. Using a design‑based research (DBR) methodology, the framework unfolds through five phases: literature review, SWOT analysis, development of ethical‑pedagogical principles, system design, and instructional strategy formulation. The resulting AI‑LMS features modular components ‑‑ including configurable prompts, adaptive feedback loops, and multi‑agent conversation flows ‑‑ aligned with pedagogical paradigms such as behaviorist, constructivist, and connectivist learning theories. By combining AI capabilities with human‑centered design and ethical safeguards, this study advances a practical model for AI integration in education. Future research will validate and refine the system through real‑world implementation.

PaperID: 8882, https://arxiv.org/pdf/2509.00678.pdf

Authors: Qintong Xie, Edward Koh, Xavier Cadet, Peter Chin

Title: Nash Q-Network for Multi-Agent Cybersecurity Simulation

Abstract:
Cybersecurity defense involves interactions between adversarial parties (namely defenders and hackers), making multi‑agent reinforcement learning (MARL) an ideal approach for modeling and learning strategies for these scenarios. This paper addresses one of the key challenges to MARL, the complexity of simultaneous training of agents in nontrivial environments, and presents a novel policy‑based Nash Q‑learning to directly converge onto a steady equilibrium. We demonstrate the successful implementation of this algorithm in a notable complex cyber defense simulation treated as a two‑player zero‑sum Markov game setting. We propose the Nash Q‑Network, which aims to learn Nash‑optimal strategies that translate to robust defenses in cybersecurity settings. Our approach incorporates aspects of proximal policy optimization (PPO), deep Q‑network (DQN), and the Nash‑Q algorithm, addressing common challenges like non‑stationarity and instability in multi‑agent learning. The training process employs distributed data collection and carefully designed neural architectures for both agents and critics.

PaperID: 8883, https://arxiv.org/pdf/2509.00625.pdf

Authors: Jaber Daneshamooz, Eugene Vuong, Laasya Koduru, Sanjay Chandrasekaran, Arpit Gupta

Title: NetGent: Agent-Based Automation of Network Application Workflows

Abstract:
We present NetGent, an AI‑agent framework for automating complex application workflows to generate realistic network traffic datasets. Developing generalizable ML models for networking requires data collection from network environments with traffic that results from a diverse set of real‑world web applications. However, using existing browser automation tools that are diverse, repeatable, realistic, and efficient remains fragile and costly. NetGent addresses this challenge by allowing users to specify workflows as natural‑language rules that define state‑dependent actions. These abstract specifications are compiled into nondeterministic finite automata (NFAs), which a state synthesis component translates into reusable, executable code. This design enables deterministic replay, reduces redundant LLM calls through state caching, and adapts quickly when application interfaces change. In experiments, NetGent automated more than 50+ workflows spanning video‑on‑demand streaming, live video streaming, video conferencing, social media, and web scraping, producing realistic traffic traces while remaining robust to UI variability. By combining the flexibility of language‑based agents with the reliability of compiled execution, NetGent provides a scalable foundation for generating the diverse, repeatable datasets needed to advance ML in networking.

PaperID: 8884, https://arxiv.org/pdf/2509.00531.pdf

Authors: Cheng Zhang, Erhu Feng, Xi Zhao, Yisheng Zhao, Wangbo Gong, Jiahui Sun, Dong Du, Zhichao Hua, Yubin Xia, Haibo Chen

Title: MobiAgent: A Systematic Framework for Customizable Mobile Agents

Abstract:
With the rapid advancement of Vision‑Language Models (VLMs), GUI‑based mobile agents have emerged as a key development direction for intelligent mobile systems. However, existing agent models continue to face significant challenges in real‑world task execution, particularly in terms of accuracy and efficiency. To address these limitations, we propose MobiAgent, a comprehensive mobile agent system comprising three core components: the MobiMind‑series agent models, the AgentRR acceleration framework, and the MobiFlow benchmarking suite. Furthermore, recognizing that the capabilities of current mobile agents are still limited by the availability of high‑quality data, we have developed an AI‑assisted agile data collection pipeline that significantly reduces the cost of manual annotation. Compared to both general‑purpose LLMs and specialized GUI agent models, MobiAgent achieves state‑of‑the‑art performance in real‑world mobile scenarios.

PaperID: 8885, https://arxiv.org/pdf/2509.00483.pdf

Authors: Juwu Li

Title: Exploring Decision-Making Capabilities of LLM Agents: An Experimental Study on Jump-Jump Game

Abstract:
The Jump‑Jump game, as a simple yet challenging casual game, provides an ideal testing environment for studying LLM decision‑making capabilities. The game requires players to precisely control jumping force based on current position and target platform distance, involving multiple cognitive aspects including spatial reasoning, physical modeling, and strategic planning. It illustrates the basic gameplay mechanics of the Jump‑Jump game, where the player character (red circle) must jump across platforms with appropriate force to maximize score.

PaperID: 8886, https://arxiv.org/pdf/2509.00481.pdf

Authors: Anton Wolter, Georgios Vidalakis, Michael Yu, Ankit Grover, Vaishali Dhanoa

Title: Multi-Agent Data Visualization and Narrative Generation

Abstract:
Recent advancements in the field of AI agents have impacted the way we work, enabling greater automation and collaboration between humans and agents. In the data visualization field, multi‑agent systems can be useful for employing agents throughout the entire data‑to‑communication pipeline. We present a lightweight multi‑agent system that automates the data analysis workflow, from data exploration to generating coherent visual narratives for insight communication. Our approach combines a hybrid multi‑agent architecture with deterministic components, strategically externalizing critical logic from LLMs to improve transparency and reliability. The system delivers granular, modular outputs that enable surgical modifications without full regeneration, supporting sustainable human‑AI collaboration. We evaluated our system across 4 diverse datasets, demonstrating strong generalizability, narrative quality, and computational efficiency with minimal dependencies.

PaperID: 8887, https://arxiv.org/pdf/2509.00280.pdf

Authors: Ahmed E. Helal, Fabio Checconi, Jan Laukemann, Yongseok Soh, Jesmin Jahan Tithi, Fabrizio Petrini, Jee Choi

Title: ReLATE: Learning Efficient Sparse Encoding for High-Performance Tensor Decomposition

Abstract:
Tensor decomposition (TD) is essential for analyzing high‑dimensional sparse data, yet its irregular computations and memory‑access patterns pose major performance challenges on modern parallel processors. Prior works rely on expert‑designed sparse tensor formats that fail to adapt to irregular tensor shapes and/or highly variable data distributions. We present the reinforcement‑learned adaptive tensor encoding (ReLATE) framework, a novel learning‑augmented method that automatically constructs efficient sparse tensor representations without labeled training samples. ReLATE employs an autonomous agent that discovers optimized tensor encodings through direct interaction with the TD environment, leveraging a hybrid model‑free and model‑based algorithm to learn from both real and imagined actions. Moreover, ReLATE introduces rule‑driven action masking and dynamics‑informed action filtering mechanisms that ensure functionally correct tensor encoding with bounded execution time, even during early learning stages. By automatically adapting to both irregular tensor shapes and data distributions, ReLATE generates sparse tensor representations that consistently outperform expert‑designed formats across diverse sparse tensor data sets, achieving up to 2X speedup compared to the best sparse format, with a geometric‑mean speedup of 1.4‑1.46X.

PaperID: 8888, https://arxiv.org/pdf/2509.00195.pdf

Authors: Hao Mark Chen, Zhiwen Mo, Guanxi Lu, Shuang Liang, Lingxiao Ma, Wayne Luk, Hongxiang Fan

Title: FastTTS: Accelerating Test-Time Scaling for Edge LLM Reasoning

Abstract:
Recent advances in reasoning Large Language Models (LLMs) are driving the emergence of agentic AI systems. Edge deployment of LLM agents near end users is increasingly necessary to protect data privacy, enable offline use, and provide responsive interaction with local context. However, strict memory constraints on edge devices limit deployment to smaller LLMs, whose reasoning capabilities are much weaker than those of large cloud models, hindering practical deployment of edge agentic AI. Test‑Time Scaling (TTS) offers a promising solution by allocating more compute during inference to enhance the reasoning capability of edge LLMs. However, current TTS methods introduce heavy hardware performance overhead on resource‑constrained devices, making them impractical for real applications. To address this challenge, we present FastTTS, a serving system that enables fast and efficient TTS for memory‑constrained LLM reasoning. After analyzing common patterns across various TTS methods and identifying their performance bottlenecks, we introduce three novel techniques: i) Speculative Beam Extension, which mitigates system stragglers caused by irregular reasoning paths, ii) Asymmetric Multi‑Model Memory Allocation, which dynamically balances memory usage between token generation and reasoning‑step verification, and iii) Dynamic Prefix‑Aware Scheduling, which optimizes reasoning execution to maximize KV‑cache reuse across search paths. FastTTS offers a plug‑and‑play third‑party library on top of vLLM, enabling edge LLMs on a single consumer GPU to match cloud‑model accuracy and cloud‑measured latency. Comprehensive evaluation shows that FastTTS achieves an average 2.2x higher goodput and reduces latency by 38%‑‑68% compared to the vLLM baseline; it pushes the boundaries of low‑latency TTS on memory‑constrained edge devices and highlights the potential for democratizing agentic AI.

PaperID: 8889, https://arxiv.org/pdf/2509.00190.pdf

Authors: Sheldon Yu, Yuxin Xiong, Junda Wu, Xintong Li, Tong Yu, Xiang Chen, Ritwik Sinha, Jingbo Shang, Julian McAuley

Title: Explainable Chain-of-Thought Reasoning: An Empirical Analysis on State-Aware Reasoning Dynamics

Abstract:
Recent advances in chain‑of‑thought (CoT) prompting have enabled large language models (LLMs) to perform multi‑step reasoning. However, the explainability of such reasoning remains limited, with prior work primarily focusing on local token‑level attribution, such that the high‑level semantic roles of reasoning steps and their transitions remain underexplored. In this paper, we introduce a state‑aware transition framework that abstracts CoT trajectories into structured latent dynamics. Specifically, to capture the evolving semantics of CoT reasoning, each reasoning step is represented via spectral analysis of token‑level embeddings and clustered into semantically coherent latent states. To characterize the global structure of reasoning, we model their progression as a Markov chain, yielding a structured and interpretable view of the reasoning process. This abstraction supports a range of analyses, including semantic role identification, temporal pattern visualization, and consistency evaluation.

PaperID: 8890, https://arxiv.org/pdf/2509.00189.pdf

Authors: Jinzhou Tang, Jusheng Zhang, Qinhan Lv, Sidi Liu, Jing Yang, Chengpei Tang, Keze Wang

Title: HiVA: Self-organized Hierarchical Variable Agent via Goal-driven Semantic-Topological Evolution

Abstract:
Autonomous agents play a crucial role in advancing Artificial General Intelligence, enabling problem decomposition and tool orchestration through Large Language Models (LLMs). However, existing paradigms face a critical trade‑off. On one hand, reusable fixed workflows require manual reconfiguration upon environmental changes; on the other hand, flexible reactive loops fail to distill reasoning progress into transferable structures. We introduce Hierarchical Variable Agent (HiVA), a novel framework modeling agentic workflows as self‑organized graphs with the Semantic‑Topological Evolution (STEV) algorithm, which optimizes hybrid semantic‑topological spaces using textual gradients as discrete‑domain surrogates for backpropagation. The iterative process comprises Multi‑Armed Bandit‑infused forward routing, diagnostic gradient generation from environmental feedback, and coordinated updates that co‑evolve individual semantics and topology for collective optimization in unknown environments. Experiments on dialogue, coding, Long‑context Q&A, mathematical, and agentic benchmarks demonstrate improvements of 5‑10% in task accuracy and enhanced resource efficiency over existing baselines, establishing HiVA's effectiveness in autonomous task execution.

PaperID: 8891, https://arxiv.org/pdf/2509.00184.pdf

Authors: Alexandru Baltag, Malvin Gattinger, Djanira Gomes

Title: Virtual Group Knowledge and Group Belief in Topological Evidence Models (Extended Version)

Abstract:
We study notions of (virtual) group knowledge and group belief within multi‑agent evidence models, obtained by extending the topological semantics of evidence‑based belief and fallible knowledge from individuals to groups. We completely axiomatize and show the decidability of the logic of ("hard" and "soft") group evidence, and do the same for an especially interesting fragment of it: the logic of group knowledge and group belief. We also extend these languages with dynamic evidence‑sharing operators, and completely axiomatize the corresponding logics, showing that they are co‑expressive with their static bases.

PaperID: 8892, https://arxiv.org/pdf/2509.00179.pdf

Authors: Daniel Ablin, Alon Cohen

Title: Playing Markov Games Without Observing Payoffs

Abstract:
Optimization under uncertainty is a fundamental problem in learning and decision‑making, particularly in multi‑agent systems. Previously, Feldman, Kalai, and Tennenholtz [2010] demonstrated the ability to efficiently compete in repeated symmetric two‑player matrix games without observing payoffs, as long as the opponents actions are observed. In this paper, we introduce and formalize a new class of zero‑sum symmetric Markov games, which extends the notion of symmetry from matrix games to the Markovian setting. We show that even without observing payoffs, a player who knows the transition dynamics and observes only the opponents sequence of actions can still compete against an adversary who may have complete knowledge of the game. We formalize three distinct notions of symmetry in this setting and show that, under these conditions, the learning problem can be reduced to an instance of online learning, enabling the player to asymptotically match the return of the opponent despite lacking payoff observations. Our algorithms apply to both matrix and Markov games, and run in polynomial time with respect to the size of the game and the number of episodes. Our work broadens the class of games in which robust learning is possible under severe informational disadvantage and deepens the connection between online learning and adversarial game theory.

PaperID: 8893, https://arxiv.org/pdf/2509.00132.pdf

Authors: Peiwen Xing, Aske Plaat, Niki van Stein

Title: CoComposer: LLM Multi-agent Collaborative Music Composition

Abstract:
Existing AI Music composition tools are limited in generation duration, musical quality, and controllability. We introduce CoComposer, a multi‑agent system that consists of five collaborating agents, each with a task based on the traditional music composition workflow. Using the AudioBox‑Aesthetics system, we experimentally evaluate CoComposer on four compositional criteria. We test with three LLMs (GPT‑4o, DeepSeek‑V3‑0324, Gemini‑2.5‑Flash), and find (1) that CoComposer outperforms existing multi‑agent LLM‑based systems in music quality, and (2) compared to a single‑agent system, in production complexity. Compared to non‑ LLM MusicLM, CoComposer has better interpretability and editability, although MusicLM still produces better music.

PaperID: 8894, https://arxiv.org/pdf/2509.00124.pdf

Authors: Shaked Zychlinski

Title: A Whole New World: Creating a Parallel-Poisoned Web Only AI-Agents Can See

Abstract:
This paper introduces a novel attack vector that leverages website cloaking techniques to compromise autonomous web‑browsing agents powered by Large Language Models (LLMs). As these agents become more prevalent, their unique and often homogenous digital fingerprints ‑ comprising browser attributes, automation framework signatures, and network characteristics ‑ create a new, distinguishable class of web traffic. The attack exploits this fingerprintability. A malicious website can identify an incoming request as originating from an AI agent and dynamically serve a different, "cloaked" version of its content. While human users see a benign webpage, the agent is presented with a visually identical page embedded with hidden, malicious instructions, such as indirect prompt injections. This mechanism allows adversaries to hijack agent behavior, leading to data exfiltration, malware execution, or misinformation propagation, all while remaining completely invisible to human users and conventional security crawlers. This work formalizes the threat model, details the mechanics of agent fingerprinting and cloaking, and discusses the profound security implications for the future of agentic AI, highlighting the urgent need for robust defenses against this stealthy and scalable attack.

PaperID: 8895, https://arxiv.org/pdf/2509.00085.pdf

Authors: Tobin South

Title: Private, Verifiable, and Auditable AI Systems

Abstract:
The growing societal reliance on artificial intelligence necessitates robust frameworks for ensuring its security, accountability, and trustworthiness. This thesis addresses the complex interplay between privacy, verifiability, and auditability in modern AI, particularly in foundation models. It argues that technical solutions that integrate these elements are critical for responsible AI innovation. Drawing from international policy contributions and technical research to identify key risks in the AI pipeline, this work introduces novel technical solutions for critical privacy and verifiability challenges. Specifically, the research introduces techniques for enabling verifiable and auditable claims about AI systems using zero‑knowledge cryptography; utilizing secure multi‑party computation and trusted execution environments for auditable, confidential deployment of large language models and information retrieval; and implementing enhanced delegation mechanisms, credentialing systems, and access controls to secure interactions with autonomous and multi‑agent AI systems. Synthesizing these technical advancements, this dissertation presents a cohesive perspective on balancing privacy, verifiability, and auditability in foundation model‑based AI systems, offering practical blueprints for system designers and informing policy discussions on AI safety and governance.

PaperID: 8896, https://arxiv.org/pdf/2509.00081.pdf

Authors: Luca Cotti, Anisa Rula, Devis Bianchini, Federico Cerutti

Title: Enabling Transparent Cyber Threat Intelligence Combining Large Language Models and Domain Ontologies

Abstract:
Effective Cyber Threat Intelligence (CTI) relies upon accurately structured and semantically enriched information extracted from cybersecurity system logs. However, current methodologies often struggle to identify and interpret malicious events reliably and transparently, particularly in cases involving unstructured or ambiguous log entries. In this work, we propose a novel methodology that combines ontology‑driven structured outputs with Large Language Models (LLMs), to build an Artificial Intelligence (AI) agent that improves the accuracy and explainability of information extraction from cybersecurity logs. Central to our approach is the integration of domain ontologies and SHACL‑based constraints to guide the language model's output structure and enforce semantic validity over the resulting graph. Extracted information is organized into an ontology‑enriched graph database, enabling future semantic analysis and querying. The design of our methodology is motivated by the analytical requirements associated with honeypot log data, which typically comprises predominantly malicious activity. While our case study illustrates the relevance of this scenario, the experimental evaluation is conducted using publicly available datasets. Results demonstrate that our method achieves higher accuracy in information extraction compared to traditional prompt‑only approaches, with a deliberate focus on extraction quality rather than processing speed.

PaperID: 8897, https://arxiv.org/pdf/2508.21803.pdf

Authors: Yeawon Lee, Xiaoyang Wang, Christopher C. Yang

Title: Automated Clinical Problem Detection from SOAP Notes using a Collaborative Multi-Agent LLM Architecture

Abstract:
Accurate interpretation of clinical narratives is critical for patient care, but the complexity of these notes makes automation challenging. While Large Language Models (LLMs) show promise, single‑model approaches can lack the robustness required for high‑stakes clinical tasks. We introduce a collaborative multi‑agent system (MAS) that models a clinical consultation team to address this gap. The system is tasked with identifying clinical problems by analyzing only the Subjective (S) and Objective (O) sections of SOAP notes, simulating the diagnostic reasoning process of synthesizing raw data into an assessment. A Manager agent orchestrates a dynamically assigned team of specialist agents who engage in a hierarchical, iterative debate to reach a consensus. We evaluated our MAS against a single‑agent baseline on a curated dataset of 420 MIMIC‑III notes. The dynamic multi‑agent configuration demonstrated consistently improved performance in identifying congestive heart failure, acute kidney injury, and sepsis. Qualitative analysis of the agent debates reveals that this structure effectively surfaces and weighs conflicting evidence, though it can occasionally be susceptible to groupthink. By modeling a clinical team's reasoning process, our system offers a promising path toward more accurate, robust, and interpretable clinical decision support tools.

PaperID: 8898, https://arxiv.org/pdf/2508.21767.pdf

Authors: Zhixiong Zeng, Jing Huang, Liming Zheng, Wenkang Han, Yufeng Zhong, Lei Chen, Longrong Yang, Yingjie Chu, Yuzhi He, Lin Ma

Title: UItron: Foundational GUI Agent with Advanced Perception and Planning

Abstract:
GUI agent aims to enable automated operations on Mobile/PC devices, which is an important task toward achieving artificial general intelligence. The rapid advancement of VLMs accelerates the development of GUI agents, owing to their powerful capabilities in visual understanding and task planning. However, building a GUI agent remains a challenging task due to the scarcity of operation trajectories, the availability of interactive infrastructure, and the limitation of initial capabilities in foundation models. In this work, we introduce UItron, an open‑source foundational model for automatic GUI agents, featuring advanced GUI perception, grounding, and planning capabilities. UItron highlights the necessity of systemic data engineering and interactive infrastructure as foundational components for advancing GUI agent development. It not only systematically studies a series of data engineering strategies to enhance training effects, but also establishes an interactive environment connecting both Mobile and PC devices. In training, UItron adopts supervised finetuning over perception and planning tasks in various GUI scenarios, and then develop a curriculum reinforcement learning framework to enable complex reasoning and exploration for online environments. As a result, UItron achieves superior performance in benchmarks of GUI perception, grounding, and planning. In particular, UItron highlights the interaction proficiency with top‑tier Chinese mobile APPs, as we identified a general lack of Chinese capabilities even in state‑of‑the‑art solutions. To this end, we manually collect over one million steps of operation trajectories across the top 100 most popular apps, and build the offline and online agent evaluation environments. Experimental results demonstrate that UItron achieves significant progress in Chinese app scenarios, propelling GUI agents one step closer to real‑world application.

PaperID: 8899, https://arxiv.org/pdf/2508.21740.pdf

Authors: Aleksandar Tomašević, Darja Cvetković, Sara Major, Slobodan Maletić, Miroslav Anđelković, Ana Vranić, Boris Stupovski, Dušan Vudragović, Aleksandar Bogojević, Marija Mitrović Dankulov

Title: Towards Operational Validation of LLM-Agent Social Simulations: A Replicated Study of a Reddit-like Technology Forum

Abstract:
Validation of LLM‑agent social simulations remains underdeveloped, with most studies relying on subjective assessments or single runs. We address this gap by running 30 independent 30‑day simulations of a technology forum modeled on Voat's v/technology, using stateless Dolphin Mistral 24B agents on the Y Social platform, and evaluating operational validity across five dimensions: activity patterns, network structure, toxicity, topical coverage, and stylistic convergence. Against 30 matched, non‑overlapping 30‑day Voat comparison windows, results show overlapping 99% confidence intervals for unique users, root posts, and daily active users, while comments, average thread length, and mean toxicity remain higher in simulation. Both simulated and empirical networks exhibit core‑periphery structure, though simulated cores are larger and more diffuse and repeated interactions are less frequent. Topic alignment is near‑complete, but toxicity is misallocated across content layers: simulated root posts are substantially more toxic than real submissions, while simulated comments are less toxic than Voat comments. These findings demonstrate that LLM agents in platform‑faithful environments can reproduce familiar online regularities, while systematic divergences, particularly those linked to stateless agent design and content‑layer calibration, point to concrete directions for future improvement.

PaperID: 8900, https://arxiv.org/pdf/2508.21720.pdf

Authors: Jiho Choi, Seojeong Park, Seongjong Song, Hyunjung Shim

Title: PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation

Abstract:
Automating scientific poster generation requires hierarchical document understanding and coherent content‑layout planning. Existing methods often rely on flat summarization or optimize content and layout separately. As a result, they often suffer from information loss, weak logical flow, and poor visual balance. We present PosterForest, a training‑free framework for scientific poster generation. Our method introduces the Poster Tree, a structured intermediate representation that captures document hierarchy and visual‑textual semantics across multiple levels. Building on this representation, content and layout agents perform hierarchical reasoning and recursive refinement, progressively optimizing the poster from global organization to local composition. This joint optimization improves semantic coherence, logical flow, and visual harmony. Experiments show that PosterForest outperforms prior methods in both automatic and human evaluations, without additional training or domain‑specific supervision.

PaperID: 8901, https://arxiv.org/pdf/2508.21595.pdf

Authors: Yang You, Alex Schutz, Zhikun Li, Bruno Lacerda, Robert Skilton, Nick Hawes

Title: Scalable Solution Methods for Dec-POMDPs with Deterministic Dynamics

Abstract:
Many high‑level multi‑agent planning problems, including multi‑robot navigation and path planning, can be effectively modeled using deterministic actions and observations. In this work, we focus on such domains and introduce the class of Deterministic Decentralized POMDPs (Det‑Dec‑POMDPs). This is a subclass of Dec‑POMDPs characterized by deterministic transitions and observations conditioned on the state and joint actions. We then propose a practical solver called Iterative Deterministic POMDP Planning (IDPP). This method builds on the classic Joint Equilibrium Search for Policies framework and is specifically optimized to handle large‑scale Det‑Dec‑POMDPs that current Dec‑POMDP solvers are unable to address efficiently.

PaperID: 8902, https://arxiv.org/pdf/2508.21475.pdf

Authors: Xijia Tao, Yihua Teng, Xinxing Su, Xinyu Fu, Jihao Wu, Chaofan Tao, Ziru Liu, Haoli Bai, Rui Liu, Lingpeng Kong

Title: MMSearch-Plus: Benchmarking Provenance-Aware Search for Multimodal Browsing Agents

Abstract:
Existing multimodal browsing benchmarks often fail to require genuine multimodal reasoning, as many tasks can be solved with text‑only heuristics without vision‑in‑the‑loop verification. We introduce MMSearch‑Plus, a 311‑task benchmark that enforces multimodal understanding by requiring extraction and propagation of fine‑grained visual cues through iterative image‑text retrieval and cross‑validation under retrieval noise. Our curation procedure seeds questions whose answers require extrapolating from spatial cues and temporal traces to out‑of‑image facts such as events, dates, and venues. Beyond the dataset, we provide a model‑agnostic agent framework with standard browsing tools and a set‑of‑mark (SoM) module, which lets the agent place marks, crop subregions, and launch targeted image/text searches. SoM enables provenance‑aware zoom‑and‑retrieve and improves robustness in multi‑step reasoning. We evaluated closed‑ and open‑source MLLMs in this framework. The strongest system achieves an end‑to‑end accuracy of 36.0%, and integrating SoM produces consistent gains in multiple settings, with improvements up to +3.9 points. From failure analysis, we observe recurring errors in locating relevant webpages and distinguishing between visually similar events. These results underscore the challenges of real‑world multimodal search and establish MMSearch‑Plus as a rigorous benchmark for advancing agentic MLLMs.

PaperID: 8903, https://arxiv.org/pdf/2508.21456.pdf

Authors: Yi-Hao Peng, Dingzeyu Li, Jeffrey P. Bigham, Amy Pavel

Title: Morae: Proactively Pausing UI Agents for User Choices

Abstract:
User interface (UI) agents promise to make inaccessible or complex UIs easier to access for blind and low‑vision (BLV) users. However, current UI agents typically perform tasks end‑to‑end without involving users in critical choices or making them aware of important contextual information, thus reducing user agency. For example, in our field study, a BLV participant asked to buy the cheapest available sparkling water, and the agent automatically chose one from several equally priced options, without mentioning alternative products with different flavors or better ratings. To address this problem, we introduce Morae, a UI agent that automatically identifies decision points during task execution and pauses so that users can make choices. Morae uses large multimodal models to interpret user queries alongside UI code and screenshots, and prompt users for clarification when there is a choice to be made. In a study over real‑world web tasks with BLV participants, Morae helped users complete more tasks and select options that better matched their preferences, as compared to baseline agents, including OpenAI Operator. More broadly, this work exemplifies a mixed‑initiative approach in which users benefit from the automation of UI agents while being able to express their preferences.

PaperID: 8904, https://arxiv.org/pdf/2508.21411.pdf

Authors: Leonard Frank Neis, Andre Antakli, Matthias Klusch

Title: CARJAN: Agent-Based Generation and Simulation of Traffic Scenarios with AJAN

Abstract:
User‑friendly modeling and virtual simulation of urban traffic scenarios with different types of interacting agents such as pedestrians, cyclists and autonomous vehicles remains a challenge. We present CARJAN, a novel tool for semi‑automated generation and simulation of such scenarios based on the multi‑agent engineering framework AJAN and the driving simulator CARLA. CARJAN provides a visual user interface for the modeling, storage and maintenance of traffic scenario layouts, and leverages SPARQL Behavior Tree‑based decision‑making and interactions for agents in dynamic scenario simulations in CARLA. CARJAN provides a first integrated approach for interactive, intelligent agent‑based generation and simulation of virtual traffic scenarios in CARLA.

PaperID: 8905, https://arxiv.org/pdf/2508.21394.pdf

Authors: Bor-Sung Liang

Title: AI Compute Architecture and Evolution Trends

Abstract:
The focus of AI development has shifted from academic research to practical applications. However, AI development faces numerous challenges at various levels. This article will attempt to analyze the opportunities and challenges of AI from several different perspectives using a structured approach. This article proposes a seven‑layer model for AI compute architecture, including Physical Layer, Link Layer, Neural Network Layer, Context Layer, Agent Layer, Orchestrator Layer, and Application Layer, from bottom to top. It also explains the three stages in the evolution of large language models (LLMs) using the proposed 7‑layer model. For each layer, we describe the development trajectory and key technologies. In Layers 1 and 2 we discuss AI computing issues and the impact of Scale‑Up and Scale‑Out strategies on computing architecture. In Layer 3 we explore two different development paths for LLMs. In Layer 4 we discuss the impact of contextual memory on LLMs and compares it to traditional processor memory. In Layers 5 to 7 we discuss the trends of AI agents and explore the issues in evolution from a single AI agent to an AI‑based ecosystem, and their impact on the AI industry.

PaperID: 8906, https://arxiv.org/pdf/2508.21304.pdf

Authors: Joanie Hayoun Chung, Sumin Lee, Sungbin Lim

Title: ORCA: ORchestrating Causal Agent

Abstract:
Causal analysis on relational databases is challenging, as analysis datasets must be repeatedly queried from complex schemas. Recent LLM systems can automate individual steps, but they hardly manage dependencies across analysis stages, making it difficult to preserve consistency between causal hypothesis. We propose ORCA (ORchestrating Causal Agent), an interactive multi‑agent framework to enable coherent causal analysis on relational databases by maintaining shared state and introducing human checkpoints. In a controlled user study, participants using ORCA successfully completed end‑to‑end analysis more often than with a baseline LLM (GPT‑4o‑mini) assistant by 42 percentage points, achieved substantially lower ATE error, and reduced time spent on repetitive data exploration and query refinement by 76% on average. These results show that ORCA improves both how users interact with the causal analysis pipeline and the reliability of the resulting causal conclusions.

PaperID: 8907, https://arxiv.org/pdf/2508.21137.pdf

Authors: Yoshiki Takenami, Yin Jou Huang, Yugo Murawaki, Chenhui Chu

Title: How Does Cognitive Bias Affect Large Language Models? A Case Study on the Anchoring Effect in Price Negotiation Simulations

Abstract:
Cognitive biases, well‑studied in humans, can also be observed in LLMs, affecting their reliability in real‑world applications. This paper investigates the anchoring effect in LLM‑driven price negotiations. To this end, we instructed seller LLM agents to apply the anchoring effect and evaluated negotiations using not only an objective metric but also a subjective metric. Experimental results show that LLMs are influenced by the anchoring effect like humans. Additionally, we investigated the relationship between the anchoring effect and factors such as reasoning and personality. It was shown that reasoning models are less prone to the anchoring effect, suggesting that the long chain of thought mitigates the effect. However, we found no significant correlation between personality traits and susceptibility to the anchoring effect. These findings contribute to a deeper understanding of cognitive biases in LLMs and to the realization of safe and responsible application of LLMs in society.

PaperID: 8908, https://arxiv.org/pdf/2508.20996.pdf

Authors: Junda Wang, Zonghai Yao, Lingxi Li, Junhui Qian, Zhichao Yang, Hong Yu

Title: ChatThero: An LLM-Supported Chatbot for Behavior Change and Therapeutic Support in Addiction Recovery

Abstract:
Substance use disorders (SUDs) affect millions of people, and relapses are common, requiring multi‑session treatments. Access to care is limited, which contributes to the challenge of recovery support. We present ChatThero, an innovative low‑cost, multi‑session, stressor‑aware, and memory‑persistent autonomous \emphlanguage agent designed to facilitate long‑term behavior change and therapeutic support in addiction recovery. Unlike existing work that mostly finetuned large language models (LLMs) on patient‑therapist conversation data, ChatThero was trained in a multi‑agent simulated environment that mirrors real therapy. We created anonymized patient profiles from recovery communities (e.g., Reddit). We classify patients as \texttteasy, \textttmedium, and \textttdifficult, three scales representing their resistance to recovery. We created an external environment by introducing stressors (e.g., social determinants of health) to simulate real‑world situations. We dynamically inject clinically‑grounded therapeutic strategies (motivational interview and cognitive behavioral therapy). Our evaluation, conducted by both human (blinded clinicians) and LLM‑as‑Judge, shows that ChatThero is superior in empathy and clinical relevance. We show that stressor simulation improves robustness of ChatThero. Explicit stressors increase relapse‑like setbacks, matching real‑world patterns. We evaluate ChatThero with behavioral change metrics. On a 1‑‑5 scale, ChatThero raises \textttmotivation by +1.71 points (from 2.39 to 4.10) and \textttconfidence by +1.67 points (from 1.52 to 3.19), substantially outperforming GPT‑5. On \textttdifficult patients, ChatThero reaches the success milestone with 26% fewer turns than GPT‑5.

PaperID: 8909, https://arxiv.org/pdf/2508.20931.pdf

Authors: Venkatesh Mishra, Amir Saeidi, Satyam Raj, Mutsumi Nakamura, Jayanth Srinivasa, Gaowen Liu, Ali Payani, Chitta Baral

Title: How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on $τ$-bench

Abstract:
Recent advances in reasoning and planning capabilities of large language models (LLMs) have enabled their potential as autonomous agents capable of tool use in dynamic environments. However, in multi‑turn conversational environments like τ‑bench, these agents often struggle with consistent reasoning, adherence to domain‑specific policies, and extracting correct information over a long horizon of tool‑calls and conversation. To capture and mitigate these failures, we conduct a comprehensive manual analysis of the common errors occurring in the conversation trajectories. We then experiment with reformulations of inputs to the tool‑calling agent for improvement in agent decision making. Finally, we propose the Input‑Reformulation Multi‑Agent (IRMA) framework, which automatically reformulates user queries augmented with relevant domain rules and tool suggestions for the tool‑calling agent to focus on. The results show that IRMA significantly outperforms ReAct, Function Calling, and Self‑Reflection by 16.1%, 12.7%, and 19.1%, respectively, in overall pass^5 scores. These findings highlight the superior reliability and consistency of IRMA compared to other methods in dynamic environments.

PaperID: 8910, https://arxiv.org/pdf/2508.20923.pdf

Authors: Katherine B. Adams, Justin J. Boutilier, Qinyang He, Yonatan Mintz

Title: Finite-Time Guarantees for Multi-Agent Combinatorial Bandits with Nonstationary Rewards

Abstract:
We study a sequential resource allocation problem where a decision maker selects subsets of agents at each period to maximize overall outcomes without prior knowledge of individual‑level effects. Our framework applies to settings such as community health interventions, targeted digital advertising, and workforce retention programs, where intervention effects evolve dynamically. Agents may exhibit habituation (diminished response from frequent selection) or recovery (enhanced response from infrequent selection). The technical challenge centers on nonstationary reward distributions that lead to changing intervention effects over time. The problem requires balancing two key competing objectives: heterogeneous individual rewards and the exploration‑exploitation tradeoff in terms of learning for improved future decisions as opposed to maximizing immediate outcomes. Our contribution introduces the first framework incorporating this form of nonstationary rewards in the combinatorial multi‑armed bandit literature. We develop algorithms with theoretical guarantees on dynamic regret and demonstrate practical efficacy through a diabetes intervention case study. Our personalized community intervention algorithm achieved up to three times as much improvement in program enrollment compared to baseline approaches, validating the framework's potential for real‑world applications. This work bridges theoretical advances in adaptive learning with practical challenges in population‑level behavioral change interventions.

PaperID: 8911, https://arxiv.org/pdf/2508.20918.pdf

Authors: Cory Knobel, Nicole Radziwill

Title: Vibe Coding: Is Human Nature the Ghost in the Machine?

Abstract:
This exploratory study examined the consistency of human‑AI collaboration by analyzing three extensive "vibe coding" sessions between a human product lead and an AI software engineer. We investigated similarities and differences in team dynamics, communication patterns, and development outcomes across both projects. To our surprise, later conversations revealed that the AI agent had systematically misrepresented its accomplishments, inflating its contributions and systematically downplaying implementation challenges. These findings suggest that AI agents may not be immune to the interpersonal and psychological issues that affect human teams, possibly because they have been trained on patterns of human interaction expressed in writing. The results challenge the assumption that human‑AI collaboration is inherently more productive or efficient than human‑human collaboration, and creates a framework for understanding AI deception patterns. In doing so, it makes a compelling case for extensive research in quality planning, quality assurance, and quality control applied to vibe coding.

PaperID: 8912, https://arxiv.org/pdf/2508.20816.pdf

Authors: Isaac David, Arthur Gervais

Title: Multi-Agent Penetration Testing AI for the Web

Abstract:
AI‑powered development platforms are making software creation accessible to a broader audience, but this democratization has triggered a scalability crisis in security auditing. With studies showing that up to 40% of AI‑generated code contains vulnerabilities, the pace of development now vastly outstrips the capacity for thorough security assessment. We present MAPTA, a multi‑agent system for autonomous web application security assessment that combines large language model orchestration with tool‑grounded execution and end‑to‑end exploit validation. On the 104‑challenge XBOW benchmark, MAPTA achieves 76.9% overall success with perfect performance on SSRF and misconfiguration vulnerabilities, 83% success on broken authorization, and strong results on injection attacks including server‑side template injection (85%) and SQL injection (83%). Cross‑site scripting (57%) and blind SQL injection (0%) remain challenging. Our comprehensive cost analysis across all challenges totals 21.38 with a median cost of 0.073 for successful attempts versus 0.357 for failures. Success correlates strongly with resource efficiency, enabling practical early‑stopping thresholds at approximately 40 tool calls or 0.30 per challenge. MAPTA's real‑world findings are impactful given both the popularity of the respective scanned GitHub repositories (8K‑70K stars) and MAPTA's low average operating cost of 3.67 per open‑source assessment: MAPTA discovered critical vulnerabilities including RCEs, command injections, secret exposure, and arbitrary file write vulnerabilities. Findings are responsibly disclosed, 10 findings are under CVE review.

PaperID: 8913, https://arxiv.org/pdf/2508.20784.pdf

Authors: Yifan Zhang

Title: Single Agent Robust Deep Reinforcement Learning for Bus Fleet Control

Abstract:
Bus bunching remains a challenge for urban transit due to stochastic traffic and passenger demand. Traditional solutions rely on multi‑agent reinforcement learning (MARL) in loop‑line settings, which overlook realistic operations characterized by heterogeneous routes, timetables, fluctuating demand, and varying fleet sizes. We propose a novel single‑agent reinforcement learning (RL) framework for bus holding control that avoids the data imbalance and convergence issues of MARL under near‑realistic simulation. A bidirectional timetabled network with dynamic passenger demand is constructed. The key innovation is reformulating the multi‑agent problem into a single‑agent one by augmenting the state space with categorical identifiers (vehicle ID, station ID, time period) in addition to numerical features (headway, occupancy, velocity). This high‑dimensional encoding enables single‑agent policies to capture inter‑agent dependencies, analogous to projecting non‑separable inputs into a higher‑dimensional space. We further design a structured reward function aligned with operational goals: instead of exponential penalties on headway deviations, a ridge‑shaped reward balances uniform headways and schedule adherence. Experiments show that our modified soft actor‑critic (SAC) achieves more stable and superior performance than benchmarks, including MADDPG (e.g., ‑430k vs. ‑530k under stochastic conditions). These results demonstrate that single‑agent deep RL, when enhanced with categorical structuring and schedule‑aware rewards, can effectively manage bus holding in non‑loop, real‑world contexts. This paradigm offers a robust, scalable alternative to MARL frameworks, particularly where agent‑specific experiences are imbalanced.

PaperID: 8914, https://arxiv.org/pdf/2508.20715.pdf

Authors: Evagoras Makridis, Gabriele Oliva, Themistoklis Charalambous

Title: Multi-cluster distributed optimization in open multi-agent systems over directed graphs with acknowledgement messages

Abstract:
In this paper, we tackle the problem of distributed optimization over directed networks in open multi‑agent systems (OMAS), where agents may dynamically join or leave, causing persistent changes in network topology and problem dimension. These disruptions not only pose significant challenges to maintaining convergence and stability in distributed optimization algorithms, but could also break the network topology into multiple clusters, each one associated with its own set of objective functions. To address this, we propose a novel Open Distributed Optimization Algorithm with Gradient Tracking (OPEN‑GT), which employs: (a) a dynamic mechanism for detecting active out‑neighbors through acknowledgement messages, and (b) a fully distributed max‑consensus procedure to spread information regarding agent departures, in possibly unbalanced directed networks. We show that when all active agents execute OPEN‑GT, the optimization process in each formed cluster remains consistent, while the agents converge to their cluster‑wide optimal solution if there exists a time after which the network remains unchanged. Finally, we validate our approach in a simulated environment with dynamically changing agent populations, demonstrating its resilience to network variations and its ability to support distributed optimization under OMAS dynamics.

PaperID: 8915, https://arxiv.org/pdf/2508.20643.pdf

Authors: Stefano Fumero, Kai Huang, Matteo Boffa, Danilo Giordano, Marco Mellia, Dario Rossi

Title: CyberSleuth: Autonomous Blue-Team LLM Agent for Web Attack Forensics

Abstract:
Post‑mortem analysis of compromised systems is a key aspect of cyber forensics, today a mostly manual, slow, and error‑prone task. Agentic AI, i.e., LLM‑powered agents, is a promising avenue for automation. However, applying such agents to cybersecurity remains largely unexplored and difficult, as this domain demands long‑term reasoning, contextual memory, and consistent evidence correlation ‑ capabilities that current LLM agents struggle to master. In this paper, we present the first systematic study of LLM agents to automate post‑mortem investigation. As a first scenario, we consider realistic attacks in which remote attackers try to abuse online services using well‑known CVEs (30 controlled cases). The agent receives as input the network traces of the attack and extracts forensic evidence. We compare three AI agent architectures, six LLM backends, and assess their ability to i) identify compromised services, ii) map exploits to exact CVEs, and iii) prepare thorough reports. Our best‑performing system, CyberSleuth, achieves 80% accuracy on 2025 incidents, producing complete, coherent, and practically useful reports (judged by a panel of 25 experts). We next illustrate how readily CyberSleuth adapts to face the analysis of infected machine traffic, showing that the effective AI agent design can transfer across forensic tasks. Our findings show that (i) multi‑agent specialisation is key to sustained reasoning; (ii) simple orchestration outperforms nested hierarchical architectures; and (iii) the CyberSleuth design generalises across different forensic tasks.

PaperID: 8916, https://arxiv.org/pdf/2508.20585.pdf

Authors: Seokho Jin, Manseo Kim, Sungho Byun, Hansol Kim, Jungmin Lee, Sujeong Baek, Semi Kim, Sanghum Park, Sung Park

Title: Persode: Personalized Visual Journaling with Episodic Memory-Aware AI Agent

Abstract:
Reflective journaling often lacks personalization and fails to engage Generation Alpha and Z, who prefer visually immersive and fast‑paced interactions over traditional text‑heavy methods. Visual storytelling enhances emotional recall and offers an engaging way to process personal expe‑ riences. Designed with these digital‑native generations in mind, this paper introduces Persode, a journaling system that integrates personalized onboarding, memory‑aware conversational agents, and automated visual storytelling. Persode captures user demographics and stylistic preferences through a tailored onboarding process, ensuring outputs resonate with individual identities. Using a Retrieval‑Augmented Generation (RAG) framework, it prioritizes emotionally significant memories to provide meaningful, context‑rich interactions. Additionally, Persode dynamically transforms user experiences into visually engaging narratives by generating prompts for advanced text‑to‑image models, adapting characters, backgrounds, and styles to user preferences. By addressing the need for personalization, visual engagement, and responsiveness, Persode bridges the gap between traditional journaling and the evolving preferences of Gen Alpha and Z.

PaperID: 8917, https://arxiv.org/pdf/2508.20508.pdf

Authors: Yilin Li, Song Han, Sibo Wang, Ming Wang, Renzi Meng

Title: Collaborative Evolution of Intelligent Agents in Large-Scale Microservice Systems

Abstract:
This paper proposes an intelligent service optimization method based on a multi‑agent collaborative evolution mechanism to address governance challenges in large‑scale microservice architectures. These challenges include complex service dependencies, dynamic topology structures, and fluctuating workloads. The method models each service as an agent and introduces graph representation learning to construct a service dependency graph. This enables agents to perceive and embed structural changes within the system. Each agent learns its policy based on a Markov Decision Process. A centralized training and decentralized execution framework is used to integrate local autonomy with global coordination. To enhance overall system performance and adaptability, a game‑driven policy optimization mechanism is designed. Through a selection‑mutation process, agent strategy distributions are dynamically adjusted. This supports adaptive collaboration and behavioral evolution among services. Under this mechanism, the system can quickly respond and achieve stable policy convergence when facing scenarios such as sudden workload spikes, topology reconfigurations, or resource conflicts. To evaluate the effectiveness of the proposed method, experiments are conducted on a representative microservice simulation platform. Comparative analyses are performed against several advanced approaches, focusing on coordination efficiency, adaptability, and policy convergence performance. Experimental results show that the proposed method outperforms others in several key metrics. It significantly improves governance efficiency and operational stability in large‑scale microservice systems. The method demonstrates strong practical value and engineering feasibility.

PaperID: 8918, https://arxiv.org/pdf/2508.20412.pdf

Authors: Zhiqiang Wang, Haohua Du, Guanquan Shi, Junyang Zhang, HaoRan Cheng, Yunhao Yao, Kaiwen Guo, Xiang-Yang Li

Title: MindGuard: Intrinsic Decision Inspection for Securing LLM Agents Against Metadata Poisoning

Abstract:
The Model Context Protocol (MCP) is increasingly adopted to standardize the interaction between LLM agents and external tools. However, this trend introduces a new threat: Tool Poisoning Attacks (TPA), where tool metadata is poisoned to induce the agent to perform unauthorized operations. Existing defenses that primarily focus on behavior‑level analysis are fundamentally ineffective against TPA, as poisoned tools need not be executed, leaving no behavioral trace to monitor. Thus, we propose MindGuard, a decision‑level guardrail for LLM agents, providing provenance tracking of call decisions, policy‑agnostic detection, and poisoning source attribution against TPA. While fully explaining LLM decision remains challenging, our empirical findings uncover a strong correlation between LLM attention mechanisms and tool invocation decisions. Therefore, we choose attention as an empirical signal for decision tracking and formalize this as the Decision Dependence Graph (DDG), which models the LLM's reasoning process as a weighted, directed graph where vertices represent logical concepts and edges quantify the attention‑based dependencies. We further design robust DDG construction and graph‑based anomaly analysis mechanisms that efficiently detect and attribute TPA attacks. Extensive experiments on real‑world datasets demonstrate that MindGuard achieves 94%‑99% average precision in detecting poisoned invocations, 95%‑100% attribution accuracy, with processing times under one second and no additional token cost. Moreover, DDG can be viewed as an adaptation of the classical Program Dependence Graph (PDG), providing a solid foundation for applying traditional security policies at the decision level.

PaperID: 8919, https://arxiv.org/pdf/2508.20370.pdf

Authors: Lingzhe Zhang, Tong Jia, Kangjin Wang, Weijie Hong, Chiming Duan, Minghua He, Ying Li

Title: Adaptive Root Cause Localization for Microservice Systems with Multi-Agent Recursion-of-Thought

Abstract:
As contemporary microservice systems become increasingly popular and complex‑often comprising hundreds or even thousands of fine‑grained, interdependent subsystems‑they are facing more frequent failures. Ensuring system reliability thus demands accurate root cause localization. While traces and metrics have proven to be effective data sources for this task, existing methods either heavily rely on pre‑defined schemas, which struggle to adapt to evolving operational contexts, or lack interpretability in their reasoning process, thereby leaving Site Reliability Engineers (SREs) confused. In this paper, we conduct a comprehensive study on how SREs localize the root cause of failures, drawing insights from multiple professional SREs across different organizations. Our investigation reveals that human root cause analysis exhibits three key characteristics: recursiveness, multi‑dimensional expansion, and cross‑modal reasoning. Motivated by these findings, we introduce RCLAgent, an adaptive root cause localization method for microservice systems that leverages a multi‑agent recursion‑of‑thought framework. RCLAgent employs a novel recursion‑of‑thought strategy to guide the LLM's reasoning process, effectively integrating data from multiple agents and tool‑assisted analysis to accurately pinpoint the root cause. Experimental evaluations on various public datasets demonstrate that RCLAgent achieves superior performance by localizing the root cause using only a single request‑outperforming state‑of‑the‑art methods that depend on aggregating multiple requests. These results underscore the effectiveness of RCLAgent in enhancing the efficiency and precision of root cause localization in complex microservice environments.

PaperID: 8920, https://arxiv.org/pdf/2508.20315.pdf

Authors: Rexcharles Donatus, Kumater Ter, Daniel Udekwe

Title: Multi-Agent Reinforcement Learning in Intelligent Transportation Systems: A Comprehensive Survey

Abstract:
The growing complexity of urban mobility and the demand for efficient, sustainable, and adaptive solutions have positioned Intelligent Transportation Systems (ITS) at the forefront of modern infrastructure innovation. At the core of ITS lies the challenge of autonomous decision‑making across dynamic, large scale, and uncertain environments where multiple agents traffic signals, autonomous vehicles, or fleet units must coordinate effectively. Multi Agent Reinforcement Learning (MARL) offers a promising paradigm for addressing these challenges by enabling distributed agents to jointly learn optimal strategies that balance individual objectives with system wide efficiency. This paper presents a comprehensive survey of MARL applications in ITS. We introduce a structured taxonomy that categorizes MARL approaches according to coordination models and learning algorithms, spanning value based, policy based, actor critic, and communication enhanced frameworks. Applications are reviewed across key ITS domains, including traffic signal control, connected and autonomous vehicle coordination, logistics optimization, and mobility on demand systems. Furthermore, we highlight widely used simulation platforms such as SUMO, CARLA, and CityFlow that support MARL experimentation, along with emerging benchmarks. The survey also identifies core challenges, including scalability, non stationarity, credit assignment, communication constraints, and the sim to real transfer gap, which continue to hinder real world deployment.

PaperID: 8921, https://arxiv.org/pdf/2508.20288.pdf

Authors: Zhuoyuan Wang, Raffaele Romagnoli, Kamyar Azizzadenesheli, Yorie Nakahira

Title: Neural Spline Operators for Risk Quantification in Stochastic Systems

Abstract:
Accurately quantifying long‑term risk probabilities in diverse stochastic systems is essential for safety‑critical control. However, existing sampling‑based and partial differential equation (PDE)‑based methods often struggle to handle complex varying dynamics. Physics‑informed neural networks learn surrogate mappings for risk probabilities from varying system parameters of fixed and finite dimensions, yet can not account for functional variations in system dynamics. To address these challenges, we introduce physics‑informed neural operator (PINO) methods to risk quantification problems, to learn mappings from varying functional system dynamics to corresponding risk probabilities. Specifically, we propose Neural Spline Operators (NeSO), a PINO framework that leverages B‑spline representations to improve training efficiency and achieve better initial and boundary condition enforcements, which are crucial for accurate risk quantification. We provide theoretical analysis demonstrating the universal approximation capability of NeSO. We also present two case studies, one with varying functional dynamics and another with high‑dimensional multi‑agent dynamics, to demonstrate the efficacy of NeSO and its significant online speed‑up over existing methods. The proposed framework and the accompanying universal approximation theorem are expected to be beneficial for other control or PDE‑related problems beyond risk quantification.

PaperID: 8922, https://arxiv.org/pdf/2508.20148.pdf

Authors: A. Ali Heydari, Ken Gu, Vidya Srinivas, Hong Yu, Zhihan Zhang, Yuwei Zhang, Akshay Paruchuri, Qian He, Hamid Palangi, Nova Hammerquist, Ahmed A. Metwally, Brent Winslow, Yubin Kim, Kumar Ayush, Yuzhe Yang, Girish Narayanswamy, Maxwell A. Xu, Jake Garrison, Amy Armento Lee, Jenny Vafeiadou, Ben Graef, Isaac R. Galatzer-Levy, Erik Schenck, Andrew Barakat, Javier Perez, Jacqueline Shreibati, John Hernandez, Anthony Z. Faranesh, Javier L. Prieto, Connor Heneghan, Yun Liu, Jiening Zhan, Mark Malhotra, Shwetak Patel, Tim Althoff, Xin Liu, Daniel McDuff, Xuhai "Orson" Xu

Title: The Anatomy of a Personal Health Agent

Abstract:
Health is a fundamental pillar of human wellness, and the rapid advancements in large language models (LLMs) have driven the development of a new generation of health agents. However, the application of health agents to fulfill the diverse needs of individuals in daily non‑clinical settings is underexplored. In this work, we aim to build a comprehensive personal health agent that is able to reason about multimodal data from everyday consumer wellness devices and common personal health records, and provide personalized health recommendations. To understand end‑users' needs when interacting with such an assistant, we conducted an in‑depth analysis of web search and health forum queries, alongside qualitative insights from users and health experts gathered through a user‑centered design process. Based on these findings, we identified three major categories of consumer health needs, each of which is supported by a specialist sub‑agent: (1) a data science agent that analyzes personal time‑series wearable and health record data, (2) a health domain expert agent that integrates users' health and contextual data to generate accurate, personalized insights, and (3) a health coach agent that synthesizes data insights, guiding users using a specified psychological strategy and tracking users' progress. Furthermore, we propose and develop the Personal Health Agent (PHA), a multi‑agent framework that enables dynamic, personalized interactions to address individual health needs. To evaluate each sub‑agent and the multi‑agent system, we conducted automated and human evaluations across 10 benchmark tasks, involving more than 7,000 annotations and 1,100 hours of effort from health experts and end‑users. Our work represents the most comprehensive evaluation of a health agent to date and establishes a strong foundation towards the futuristic vision of a personal health agent accessible to everyone.

PaperID: 8923, https://arxiv.org/pdf/2508.20134.pdf

Authors: Zhenxiao Fu, Fan Chen, Lei Jiang

Title: QAgent: An LLM-based Multi-Agent System for Autonomous OpenQASM programming

Abstract:
Noisy Intermediate‑Scale Quantum (NISQ) devices have begun to exhibit early quantum advantages on classically intractable problems, spanning physics simulations to Gaussian boson sampling. Yet, realizing these benefits remains challenging for non‑experts, primarily due to the complexities of programming in Open Quantum Assembly Language (OpenQASM). Although Large Language Model (LLM)‑based agents have shown promise in automating classical programming workflows, their quantum counterparts have largely been restricted to specialized tasks such as quantum chemistry or error correction. In this paper, we present QAgent, an LLM‑powered multi‑agent system that fully automates OpenQASM programming. By integrating task planning, in‑context few‑shot learning, retrieval‑augmented generation (RAG) for long‑term context, predefined generation tools, and chain‑of‑thought (CoT) reasoning, the agents systematically improve both compilation and functional correctness. Our evaluations demonstrate substantial improvements: across multiple LLMs of varying sizes, QAgent enhances the accuracy of QASM code generation by 71.6% compared to previous static LLM‑based approaches. We envision this multi‑agent system as a key enabler for democratizing quantum programming, bridging expertise gaps, and accelerating the practical adoption of quantum computing.

PaperID: 8924, https://arxiv.org/pdf/2508.20119.pdf

Authors: Daniel M. Yellin

Title: LLM Agents for Generating Microservice-based Applications: how complex is your specification?

Abstract:
In this paper we evaluate the capabilities of LLM Agents in generating code for real‑world problems. Specifically, we explore code synthesis for microservice‑based applications, a widely used architectural pattern for building applications. We define a standard template for specifying these applications, and we propose a metric for scoring the difficulty of a specification. The higher the score, the more difficult it is to generate code for the specification. Our experimental results show that agents using strong LLMs (like GPT‑3o‑mini) do fairly well on medium difficulty specifications but do poorly on those of higher difficulty levels. This is due to more intricate business logic, a greater use of external services, database integration and inclusion of non‑functional capabilities such as authentication. We analyzed the errors in LLM‑synthesized code and report on the key challenges LLM Agents face in generating code for these specifications. Finally, we show that using a fine‑grained approach to code generation improves the correctness of the generated code.

PaperID: 8925, https://arxiv.org/pdf/2508.20095.pdf

Authors: Jinhao Liang, Sven Koenig, Ferdinando Fioretto

Title: Discrete-Guided Diffusion for Scalable and Safe Multi-Robot Motion Planning

Abstract:
Multi‑Robot Motion Planning (MRMP) involves generating collision‑free trajectories for multiple robots operating in a shared continuous workspace. While discrete multi‑agent path finding (MAPF) methods are broadly adopted due to their scalability, their coarse discretization severely limits trajectory quality. In contrast, continuous optimization‑based planners offer higher‑quality paths but suffer from the curse of dimensionality, resulting in poor scalability with respect to the number of robots. This paper tackles the limitations of these two approaches by introducing a novel framework that integrates discrete MAPF solvers with constrained generative diffusion models. The resulting framework, called Discrete‑Guided Diffusion (DGD), has three key characteristics: (1) it decomposes the original nonconvex MRMP problem into tractable subproblems with convex configuration spaces, (2) it combines discrete MAPF solutions with constrained optimization techniques to guide diffusion models capture complex spatiotemporal dependencies among robots, and (3) it incorporates a lightweight constraint repair mechanism to ensure trajectory feasibility. The proposed method sets a new state‑of‑the‑art performance in large‑scale, complex environments, scaling to 100 robots while achieving planning efficiency and high success rates.

PaperID: 8926, https://arxiv.org/pdf/2508.20019.pdf

Authors: Ji Wang, Kashing Chen, Xinyuan Song, Ke Zhang, Lynn Ai, Eric Yang, Bill Shi

Title: Symphony: A Decentralized Multi-Agent Framework for Scalable Collective Intelligence

Abstract:
Most existing Large Language Model (LLM)‑based agent frameworks rely on centralized orchestration, incurring high deployment costs, rigid communication topologies, and limited adaptability. To address these challenges, we introduce Symphony, a decentralized multi‑agent system which enables lightweight LLMs on consumer‑grade GPUs to coordinate. Symphony introduces three key mechanisms: (1) a decentralized ledger that records capabilities, (2) a Beacon‑selection protocol for dynamic task allocation, and (3) weighted result voting based on CoTs. This design forms a privacy‑saving, scalable, and fault‑tolerant orchestration with low overhead. Empirically, Symphony outperforms existing baselines on reasoning benchmarks, achieving substantial accuracy gains and demonstrating robustness across models of varying capacities.

PaperID: 8927, https://arxiv.org/pdf/2508.20018.pdf

Authors: Quanfeng Lu, Zhantao Ma, Shuai Zhong, Jin Wang, Dahai Yu, Michael K. Ng, Ping Luo

Title: SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control

Abstract:
The rapid advancement of large vision language models (LVLMs) and agent systems has heightened interest in mobile GUI agents that can reliably translate natural language into interface operations. Existing single‑agent approaches, however, remain limited by structural constraints. Although multi‑agent systems naturally decouple different competencies, recent progress in multi‑agent reinforcement learning (MARL) has often been hindered by inefficiency and remains incompatible with current LVLM architectures. To address these challenges, we introduce SWIRL, a staged workflow for interleaved reinforcement learning designed for multi‑agent systems. SWIRL reformulates MARL into a sequence of single‑agent reinforcement learning tasks, updating one agent at a time while keeping the others fixed. This formulation enables stable training and promotes efficient coordination across agents. Theoretically, we provide a stepwise safety bound, a cross‑round monotonic improvement theorem, and convergence guarantees on return, ensuring robust and principled optimization. In application to mobile GUI control, SWIRL instantiates a Navigator that converts language and screen context into structured plans, and an Interactor that grounds these plans into executable atomic actions. Extensive experiments demonstrate superior performance on both high‑level and low‑level GUI benchmarks. Beyond GUI tasks, SWIRL also demonstrates strong capability in multi‑agent mathematical reasoning, underscoring its potential as a general framework for developing efficient and robust multi‑agent systems.

PaperID: 8928, https://arxiv.org/pdf/2508.20014.pdf

Authors: Yang Meng, Zewen Pan, Yandi Lu, Ruobing Huang, Yanfeng Liao, Jiarui Yang

Title: CataractSurg-80K: Knowledge-Driven Benchmarking for Structured Reasoning in Ophthalmic Surgery Planning

Abstract:
Cataract surgery remains one of the most widely performed and effective procedures for vision restoration. Effective surgical planning requires integrating diverse clinical examinations for patient assessment, intraocular lens (IOL) selection, and risk evaluation. Large language models (LLMs) have shown promise in supporting clinical decision‑making. However, existing LLMs often lack the domain‑specific expertise to interpret heterogeneous ophthalmic data and provide actionable surgical plans. To enhance the model's ability to interpret heterogeneous ophthalmic reports, we propose a knowledge‑driven Multi‑Agent System (MAS), where each agent simulates the reasoning process of specialist ophthalmologists, converting raw clinical inputs into structured, actionable summaries in both training and deployment stages. Building on MAS, we introduce CataractSurg‑80K, the first large‑scale benchmark for cataract surgery planning that incorporates structured clinical reasoning. Each case is annotated with diagnostic questions, expert reasoning chains, and structured surgical recommendations. We further introduce Qwen‑CSP, a domain‑specialized model built on Qwen‑4B, fine‑tuned through a multi‑stage process tailored for surgical planning. Comprehensive experiments show that Qwen‑CSP outperforms strong general‑purpose LLMs across multiple metrics. Our work delivers a high‑quality dataset, a rigorous benchmark, and a domain‑adapted LLM to facilitate future research in medical AI reasoning and decision support.

PaperID: 8929, https://arxiv.org/pdf/2508.19988.pdf

Authors: Lisa Alazraki, Lihu Chen, Ana Brassard, Joe Stacey, Hossein A. Rahmani, Marek Rei

Title: AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios

Abstract:
Large Language Models (LLMs) have achieved high accuracy on complex commonsense and mathematical problems that involve the composition of multiple reasoning steps. However, current compositional benchmarks testing these skills tend to focus on either commonsense or math reasoning, whereas LLM agents solving real‑world tasks would require a combination of both. In this work, we introduce an Agentic Commonsense and Math benchmark (AgentCoMa), where each compositional task requires a commonsense reasoning step and a math reasoning step. We test it on 61 LLMs of different sizes, model families, and training strategies. We find that LLMs can usually solve both steps in isolation, yet their accuracy drops by nearly 30% on average when the two are combined. This is a substantially greater performance gap than the one we observe in prior compositional benchmarks that combine multiple steps of the same reasoning type. In contrast, non‑expert human annotators can solve the compositional questions and the individual steps in AgentCoMa with similarly high accuracy. Furthermore, we conduct a series of interpretability studies to better understand the performance gap, examining neuron patterns, attention maps and membership inference. Our work underscores a substantial degree of model brittleness in the context of mixed‑type compositional reasoning and offers a test bed for future improvement.

PaperID: 8930, https://arxiv.org/pdf/2508.19945.pdf

Authors: Zhouyu Zhang, Chih-Yuan Chiu, Glen Chou

Title: Constraint Learning in Multi-Agent Dynamic Games from Demonstrations of Local Nash Interactions

Abstract:
We present an inverse dynamic game‑based algorithm to learn parametric constraints from a given dataset of local Nash equilibrium interactions between multiple agents. Specifically, we introduce mixed‑integer linear programs (MILP) encoding the Karush‑Kuhn‑Tucker (KKT) conditions of the interacting agents, which recover constraints consistent with the local Nash stationarity of the interaction demonstrations. We establish theoretical guarantees that our method learns inner approximations of the true safe and unsafe sets. We also use the interaction constraints recovered by our method to design motion plans that robustly satisfy the underlying constraints. Across simulations and hardware experiments, our methods accurately inferred constraints and designed safe interactive motion plans for various classes of constraints, both convex and non‑convex, from interaction demonstrations of agents with nonlinear dynamics.

PaperID: 8931, https://arxiv.org/pdf/2508.19942.pdf

Authors: Martin Benderoth, Patrick Gebhard, Christian Keller, C. Benjamin Nakhosteen, Stefan Schaffer, Tanja Schneeberger

Title: Socially Interactive Agents for Preserving and Transferring Tacit Knowledge in Organizations

Abstract:
This paper introduces a novel approach to tackle the challenges of preserving and transferring tacit knowledge‑‑deep, experience‑based insights that are hard to articulate but vital for decision‑making, innovation, and problem‑solving. Traditional methods rely heavily on human facilitators, which, while effective, are resource‑intensive and lack scalability. A promising alternative is the use of Socially Interactive Agents (SIAs) as AI‑driven knowledge transfer facilitators. These agents interact autonomously and socially intelligently with users through multimodal behaviors (verbal, paraverbal, nonverbal), simulating expert roles in various organizational contexts. SIAs engage employees in empathic, natural‑language dialogues, helping them externalize insights that might otherwise remain unspoken. Their success hinges on building trust, as employees are often hesitant to share tacit knowledge without assurance of confidentiality and appreciation. Key technologies include Large Language Models (LLMs) for generating context‑relevant dialogue, Retrieval‑Augmented Generation (RAG) to integrate organizational knowledge, and Chain‑of‑Thought (CoT) prompting to guide structured reflection. These enable SIAs to actively elicit knowledge, uncover implicit assumptions, and connect insights to broader organizational contexts. Potential applications span onboarding, where SIAs support personalized guidance and introductions, and knowledge retention, where they conduct structured interviews with retiring experts to capture heuristics behind decisions. Success depends on addressing ethical and operational challenges such as data privacy, algorithmic bias, and resistance to AI. Transparency, robust validation, and a culture of trust are essential to mitigate these risks.

PaperID: 8932, https://arxiv.org/pdf/2508.19919.pdf

Authors: Jingyu Guo, Yingying Xu

Title: Your AI Bosses Are Still Prejudiced: The Emergence of Stereotypes in LLM-Based Multi-Agent Systems

Abstract:
While stereotypes are well‑documented in human social interactions, AI systems are often presumed to be less susceptible to such biases. Previous studies have focused on biases inherited from training data, but whether stereotypes can emerge spontaneously in AI agent interactions merits further exploration. Through a novel experimental framework simulating workplace interactions with neutral initial conditions, we investigate the emergence and evolution of stereotypes in LLM‑based multi‑agent systems. Our findings reveal that (1) LLM‑Based AI agents develop stereotype‑driven biases in their interactions despite beginning without predefined biases; (2) stereotype effects intensify with increased interaction rounds and decision‑making power, particularly after introducing hierarchical structures; (3) these systems exhibit group effects analogous to human social behavior, including halo effects, confirmation bias, and role congruity; and (4) these stereotype patterns manifest consistently across different LLM architectures. Through comprehensive quantitative analysis, these findings suggest that stereotype formation in AI systems may arise as an emergent property of multi‑agent interactions, rather than merely from training data biases. Our work underscores the need for future research to explore the underlying mechanisms of this phenomenon and develop strategies to mitigate its ethical impacts.

PaperID: 8933, https://arxiv.org/pdf/2508.19731.pdf

Authors: Maryam Kazemi Eskeri, Ville Kyrki, Dominik Baumann, Tomasz Piotr Kucner

Title: Efficient Human-Aware Task Allocation for Multi-Robot Systems in Shared Environments

Abstract:
Multi‑robot systems are increasingly deployed in applications, such as intralogistics or autonomous delivery, where multiple robots collaborate to complete tasks efficiently. One of the key factors enabling their efficient cooperation is Multi‑Robot Task Allocation (MRTA). Algorithms solving this problem optimize task distribution among robots to minimize the overall execution time. In shared environments, apart from the relative distance between the robots and the tasks, the execution time is also significantly impacted by the delay caused by navigating around moving people. However, most existing MRTA approaches are dynamics‑agnostic, relying on static maps and neglecting human motion patterns, leading to inefficiencies and delays. In this paper, we introduce \acrfullmethod name. This method leverages Maps of Dynamics (MoDs), spatio‑temporal queryable models designed to capture historical human movement patterns, to estimate the impact of humans on the task execution time during deployment. \acrshortmethod name utilizes a stochastic cost function that includes MoDs. Experimental results show that integrating MoDs enhances task allocation performance, resulting in reduced mission completion times by up to 26% compared to the dynamics‑agnostic method and up to 19% compared to the baseline. This work underscores the importance of considering human dynamics in MRTA within shared environments and presents an efficient framework for deploying multi‑robot systems in environments populated by humans.

PaperID: 8934, https://arxiv.org/pdf/2508.19678.pdf

Authors: Chao Wang, Shuyuan Zhang, Lei Wang

Title: Distributed Safety-Critical MPC for Multi-Agent Formation Control and Obstacle Avoidance

Abstract:
For nonlinear multi‑agent systems with high relative degrees, achieving formation control and obstacle avoidance in a distributed manner remains a significant challenge. To address this issue, we propose a novel distributed safety‑critical model predictive control (DSMPC) algorithm that incorporates discrete‑time high‑order control barrier functions (DHCBFs) to enforce safety constraints, alongside discrete‑time control Lyapunov functions (DCLFs) to establish terminal constraints. To facilitate distributed implementation, we develop estimated neighbor states for formulating DHCBFs and DCLFs, while also devising a bound constraint to limit estimation errors and ensure convergence. Additionally, we provide theoretical guarantees regarding the feasibility and stability of the proposed DSMPC algorithm based on a mild assumption. The effectiveness of the proposed method is evidenced by the simulation results, demonstrating improved performance and reduced computation time compared to existing approaches.

PaperID: 8935, https://arxiv.org/pdf/2508.19667.pdf

Authors: Chenghan Yang, Ruiyu Zhao, Yang Liu, Ling Jiang

Title: Survey of Specialized Large Language Model

Abstract:
The rapid evolution of specialized large language models (LLMs) has transitioned from simple domain adaptation to sophisticated native architectures, marking a paradigm shift in AI development. This survey systematically examines this progression across healthcare, finance, legal, and technical domains. Besides the wide use of specialized LLMs, technical breakthrough such as the emergence of domain‑native designs beyond fine‑tuning, growing emphasis on parameter efficiency through sparse computation and quantization, increasing integration of multimodal capabilities and so on are applied to recent LLM agent. Our analysis reveals how these innovations address fundamental limitations of general‑purpose LLMs in professional applications, with specialized models consistently performance gains on domain‑specific benchmarks. The survey further highlights the implications for E‑Commerce field to fill gaps in the field.

PaperID: 8936, https://arxiv.org/pdf/2508.19622.pdf

Authors: Jingyao Zheng, Haodi Weng, Xian Wang, Chengbin Cui, Sven Mayer, Chi-lok Tai, Lik-Hang Lee

Title: PersoNo: Personalised Notification Urgency Classifier in Mixed Reality

Abstract:
Mixed Reality (MR) is increasingly integrated into daily life, providing enhanced capabilities across various domains. However, users face growing notification streams that disrupt their immersive experience. We present PersoNo, a personalised notification urgency classifier for MR that intelligently classifies notifications based on individual user preferences. Through a user study (N=18), we created the first MR notification dataset containing both self‑labelled and interaction‑based data across activities with varying cognitive demands. Our thematic analysis revealed that, unlike in mobiles, the activity context is equally important as the content and the sender in determining notification urgency in MR. Leveraging these insights, we developed PersoNo using large language models that analyse users replying behaviour patterns. Our multi‑agent approach achieved 81.5% accuracy and significantly reduced false negative rates (0.381) compared to baseline models. PersoNo has the potential not only to reduce unnecessary interruptions but also to offer users understanding and control of the system, adhering to Human‑Centered Artificial Intelligence design principles.

PaperID: 8937, https://arxiv.org/pdf/2508.19611.pdf

Authors: Huaiyuan Yao, Wanpeng Xu, Justin Turnau, Nadia Kellam, Hua Wei

Title: Instructional Agents: Reducing Teaching Faculty Workload through Multi-Agent Instructional Design

Abstract:
Preparing high‑quality instructional materials remains a labor‑intensive process that often requires extensive coordination among teaching faculty, instructional designers, and teaching assistants. In this work, we present Instructional Agents, a multi‑agent large language model framework designed to automate end‑to‑end course material generation, including syllabi creation, LaTeX‑based slides, lecture scripts, and assessments. Unlike prior tools focused on isolated tasks, Instructional Agents simulates role‑based collaboration to ensure pedagogical coherence. The system operates in four modes: Autonomous, Catalog‑Guided, Feedback‑Guided, and Full Co‑Pilot mode, enabling flexible control over the degree of human involvement. We evaluate Instructional Agents across five university‑level courses and show that it produces high‑quality instructional materials that are reviewed and refined by teaching faculty prior to use, while significantly reducing the time required to prepare classroom‑ready content. By supporting institutions with limited instructional design capacity, Instructional Agents provides a scalable and cost‑effective framework to democratize access to high‑quality education, particularly in underserved or resource‑constrained settings. The project website, including source code, is available at https://darl‑genai.github. io/instructional_agents_homepage/

PaperID: 8938, https://arxiv.org/pdf/2508.19603.pdf

Authors: Zhejing Hu, Yan Liu, Gong Chen, Bruce X. B. Yu

Title: CompLex: Music Theory Lexicon Constructed by Autonomous Agents for Automatic Music Generation

Abstract:
Generative artificial intelligence in music has made significant strides, yet it still falls short of the substantial achievements seen in natural language processing, primarily due to the limited availability of music data. Knowledge‑informed approaches have been shown to enhance the performance of music generation models, even when only a few pieces of musical knowledge are integrated. This paper seeks to leverage comprehensive music theory in AI‑driven music generation tasks, such as algorithmic composition and style transfer, which traditionally require significant manual effort with existing techniques. We introduce a novel automatic music lexicon construction model that generates a lexicon, named CompLex, comprising 37,432 items derived from just 9 manually input category keywords and 5 sentence prompt templates. A new multi‑agent algorithm is proposed to automatically detect and mitigate hallucinations. CompLex demonstrates impressive performance improvements across three state‑of‑the‑art text‑to‑music generation models, encompassing both symbolic and audio‑based methods. Furthermore, we evaluate CompLex in terms of completeness, accuracy, non‑redundancy, and executability, confirming that it possesses the key characteristics of an effective lexicon.

PaperID: 8939, https://arxiv.org/pdf/2508.19598.pdf

Authors: Zhiwei Li, Yong Hu, Wenqing Wang

Title: Encouraging Good Processes Without the Need for Good Answers: Reinforcement Learning for LLM Agent Planning

Abstract:
The functionality of Large Language Model (LLM) agents is primarily determined by two capabilities: action planning and answer summarization. The former, action planning, is the core capability that dictates an agent's performance. However, prevailing training paradigms employ end‑to‑end, multi‑objective optimization that jointly trains both capabilities. This paradigm faces two critical challenges: imbalanced optimization objective allocation and scarcity of verifiable data, making it difficult to enhance the agent's planning capability. To address these challenges, we propose Reinforcement Learning with Tool‑use Rewards (RLTR), a novel framework that decouples the training process to enable a focused, single‑objective optimization of the planning module. Crucially, RLTR introduces a reward signal based on tool‑use completeness to directly evaluate the quality of tool invocation sequences. This method offers a more direct and reliable training signal than assessing the final response content, thereby obviating the need for verifiable data. Our experiments demonstrate that RLTR achieves an 8%‑12% improvement in planning performance compared to end‑to‑end baselines. Moreover, this enhanced planning capability, in turn, translates to a 5%‑6% increase in the final response quality of the overall agent system.

PaperID: 8940, https://arxiv.org/pdf/2508.19504.pdf

Authors: Kevin Song, Anand Jayarajan, Yaoyao Ding, Qidong Su, Zhanda Zhu, Sihang Liu, Gennady Pekhimenko

Title: Aegis: Taxonomy and Optimizations for Overcoming Agent-Environment Failures in LLM Agents

Abstract:
Large Language Models (LLMs) agents augmented with domain tools promise to autonomously execute complex tasks requiring human‑level intelligence, such as customer service and digital assistance. However, their practical deployment is often limited by their low success rates under complex real‑world environments. To tackle this, prior research has primarily focused on improving the agents themselves, such as developing strong agentic LLMs, while overlooking the role of the system environment in which the agent operates. In this paper, we study a complementary direction: improving agent success rates by optimizing the system environment in which the agent operates. We collect 142 agent traces (3,656 turns of agent‑environment interactions) across 5 state‑of‑the‑art agentic benchmarks. By analyzing these agent failures, we propose a taxonomy for agent‑environment interaction failures that includes 6 failure modes. Guided by these findings, we design Aegis, a set of targeted environment optimizations: 1) environment observability enhancement, 2) common computation offloading, and 3) speculative agentic actions. These techniques improve agent success rates on average by 6.7‑12.5%, without any modifications to the agent and underlying LLM.

PaperID: 8941, https://arxiv.org/pdf/2508.19489.pdf

Authors: Jiawei Xu, Juichien Chen, Yilin Ye, Zhandos Sembay, Swathi Thaker, Pamela Payne-Foster, Jake Chen, Ying Ding

Title: Interactive Graph Visualization and TeamingRecommendation in an Interdisciplinary Project'sTalent Knowledge Graph

Abstract:
Interactive visualization of large scholarly knowledge graphs combined with LLM reasoning shows promise butremains under‑explored. We address this gap by developing an interactive visualization system for the Cell Map forAI Talent Knowledge Graph (28,000 experts and 1,179 biomedical datasets). Our approach integrates WebGLvisualization with LLM agents to overcome limitations of traditional tools such as Gephi, particularly for large‑scaleinteractive node handling. Key functionalities include responsive exploration, filtering, and AI‑drivenrecommendations with justifications. This integration can potentially enable users to effectively identify potentialcollaborators and relevant dataset users within biomedical and AI research communities. The system contributes anovel framework that enhances knowledge graph exploration through intuitive visualization and transparent, LLM‑guided recommendations. This adaptable solution extends beyond the CM4AI community to other large knowledgegraphs, improving information representation and decision‑making. Demo: https://cm4aikg.vercel.app/

PaperID: 8942, https://arxiv.org/pdf/2508.19488.pdf

Authors: Xavier Cadet, Simona Boboila, Sie Hendrata Dharmawan, Alina Oprea, Peter Chin

Title: PoolFlip: A Multi-Agent Reinforcement Learning Security Environment for Cyber Defense

Abstract:
Cyber defense requires automating defensive decision‑making under stealthy, deceptive, and continuously evolving adversarial strategies. The FlipIt game provides a foundational framework for modeling interactions between a defender and an advanced adversary that compromises a system without being immediately detected. In FlipIt, the attacker and defender compete to control a shared resource by performing a Flip action and paying a cost. However, the existing FlipIt frameworks rely on a small number of heuristics or specialized learning techniques, which can lead to brittleness and the inability to adapt to new attacks. To address these limitations, we introduce PoolFlip, a multi‑agent gym environment that extends the FlipIt game to allow efficient learning for attackers and defenders. Furthermore, we propose Flip‑PSRO, a multi‑agent reinforcement learning (MARL) approach that leverages population‑based training to train defender agents equipped to generalize against a range of unknown, potentially adaptive opponents. Our empirical results suggest that Flip‑PSRO defenders are 2× more effective than baselines to generalize to a heuristic attack not exposed in training. In addition, our newly designed ownership‑based utility functions ensure that Flip‑PSRO defenders maintain a high level of control while optimizing performance.

PaperID: 8943, https://arxiv.org/pdf/2508.19461.pdf

Authors: Neil Kale, Chen Bo Calvin Zhang, Kevin Zhu, Ankit Aich, Paula Rodriguez, Scale Red Team, Christina Q. Knight, Zifan Wang

Title: Reliable Weak-to-Strong Monitoring of LLM Agents

Abstract:
We stress test monitoring systems for detecting covert misbehavior in autonomous LLM agents (e.g., secretly sharing private information). To this end, we systematize a monitor red teaming (MRT) workflow that incorporates: (1) varying levels of agent and monitor situational awareness; (2) distinct adversarial strategies to evade the monitor, such as prompt injection; and (3) two datasets and environments ‑‑ SHADE‑Arena for tool‑calling agents and our new CUA‑SHADE‑Arena, which extends TheAgentCompany, for computer‑use agents. We run MRT on existing LLM monitor scaffoldings, which orchestrate LLMs and parse agent trajectories, alongside a new hybrid hierarchical‑sequential scaffolding proposed in this work. Our empirical results yield three key findings. First, agent awareness dominates monitor awareness: an agent's knowledge that it is being monitored substantially degrades the monitor's reliability. On the contrary, providing the monitor with more information about the agent is less helpful than expected. Second, monitor scaffolding matters more than monitor awareness: the hybrid scaffolding consistently outperforms baseline monitor scaffolding, and can enable weaker models to reliably monitor stronger agents ‑‑ a weak‑to‑strong scaling effect. Third, in a human‑in‑the‑loop setting where humans discuss with the LLM monitor to get an updated judgment for the agent's behavior, targeted human oversight is most effective; escalating only pre‑flagged cases to human reviewers improved the TPR by approximately 15% at FPR = 0.01. Our work establishes a standard workflow for MRT, highlighting the lack of adversarial robustness for LLMs and humans when monitoring and detecting agent misbehavior. We release code, data, and logs to spur further research.

PaperID: 8944, https://arxiv.org/pdf/2508.19429.pdf

Authors: Gustavo A. Cardona, Kaier Liang, Cristian-Ioan Vasile

Title: An Iterative Approach for Heterogeneous Multi-Agent Route Planning with Resource Transportation Uncertainty and Temporal Logic Goals

Abstract:
This paper presents an iterative approach for heterogeneous multi‑agent route planning in environments with unknown resource distributions. We focus on a team of robots with diverse capabilities tasked with executing missions specified using Capability Temporal Logic (CaTL), a formal framework built on Signal Temporal Logic to handle spatial, temporal, capability, and resource constraints. The key challenge arises from the uncertainty in the initial distribution and quantity of resources in the environment. To address this, we introduce an iterative algorithm that dynamically balances exploration and task fulfillment. Robots are guided to explore the environment, identifying resource locations and quantities while progressively refining their understanding of the resource landscape. At the same time, they aim to maximally satisfy the mission objectives based on the current information, adapting their strategies as new data is uncovered. This approach provides a robust solution for planning in dynamic, resource‑constrained environments, enabling efficient coordination of heterogeneous teams even under conditions of uncertainty. Our method's effectiveness and performance are demonstrated through simulated case studies.

PaperID: 8945, https://arxiv.org/pdf/2508.19383.pdf

Authors: Daoyuan Jin, Nick Gunner, Niko Carvajal Janke, Shivranjani Baruah, Kaitlin M. Gold, Yu Jiang

Title: Aleks: AI powered Multi Agent System for Autonomous Scientific Discovery via Data-Driven Approaches in Plant Science

Abstract:
Modern plant science increasingly relies on large, heterogeneous datasets, but challenges in experimental design, data preprocessing, and reproducibility hinder research throughput. Here we introduce Aleks, an AI‑powered multi‑agent system that integrates domain knowledge, data analysis, and machine learning within a structured framework to autonomously conduct data‑driven scientific discovery. Once provided with a research question and dataset, Aleks iteratively formulated problems, explored alternative modeling strategies, and refined solutions across multiple cycles without human intervention. In a case study on grapevine red blotch disease, Aleks progressively identified biologically meaningful features and converged on interpretable models with robust performance. Ablation studies underscored the importance of domain knowledge and memory for coherent outcomes. This exploratory work highlights the promise of agentic AI as an autonomous collaborator for accelerating scientific discovery in plant sciences.

PaperID: 8946, https://arxiv.org/pdf/2508.19279.pdf

Authors: Gunjan Jalori, Preetika Verma, Sercan Ö Arık

Title: FLAIRR-TS -- Forecasting LLM-Agents with Iterative Refinement and Retrieval for Time Series

Abstract:
Time series Forecasting with large languagemodels (LLMs) requires bridging numericalpatterns and natural language. Effective fore‑casting on LLM often relies on extensive pre‑processing and fine‑tuning.Recent studiesshow that a frozen LLM can rival specializedforecasters when supplied with a carefully en‑gineered natural‑language prompt, but craft‑ing such a prompt for each task is itself oner‑ous and ad‑hoc. We introduce FLAIRR‑TS, atest‑time prompt optimization framework thatutilizes an agentic system: a Forecaster‑agentgenerates forecasts using an initial prompt,which is then refined by a refiner agent, in‑formed by past outputs and retrieved analogs.This adaptive prompting generalizes across do‑mains using creative prompt templates andgenerates high‑quality forecasts without inter‑mediate code generation.Experiments onbenchmark datasets show improved accuracyover static prompting and retrieval‑augmentedbaselines, approaching the performance ofspecialized prompts.FLAIRR‑TS providesa practical alternative to tuning, achievingstrong performance via its agentic approach toadaptive prompt refinement and retrieval.

PaperID: 8947, https://arxiv.org/pdf/2508.19277.pdf

Authors: Xinyu Li, Tianjin Huang, Ronghui Mu, Xiaowei Huang, Gaojie Jin

Title: POT: Inducing Overthinking in LLMs via Black-Box Iterative Optimization

Abstract:
Recent advances in Chain‑of‑Thought (CoT) prompting have substantially enhanced the reasoning capabilities of large language models (LLMs), enabling sophisticated problem‑solving through explicit multi‑step reasoning traces. However, these enhanced reasoning processes introduce novel attack surfaces, particularly vulnerabilities to computational inefficiency through unnecessarily verbose reasoning chains that consume excessive resources without corresponding performance gains. Prior overthinking attacks typically require restrictive conditions including access to external knowledge sources for data poisoning, reliance on retrievable poisoned content, and structurally obvious templates that limit practical applicability in real‑world scenarios. To address these limitations, we propose POT (Prompt‑Only OverThinking), a novel black‑box attack framework that employs LLM‑based iterative optimization to generate covert and semantically natural adversarial prompts, eliminating dependence on external data access and model retrieval. Extensive experiments across diverse model architectures and datasets demonstrate that POT achieves superior performance compared to other methods.

PaperID: 8948, https://arxiv.org/pdf/2508.19267.pdf

Authors: Sai Teja Reddy Adapala, Yashwanth Reddy Alugubelly

Title: The Aegis Protocol: A Foundational Security Framework for Autonomous AI Agents

Abstract:
The proliferation of autonomous AI agents marks a paradigm shift toward complex, emergent multi‑agent systems. This transition introduces systemic security risks, including control‑flow hijacking and cascading failures, that traditional cybersecurity paradigms are ill‑equipped to address. This paper introduces the Aegis Protocol, a layered security framework designed to provide strong security guarantees for open agentic ecosystems. The protocol integrates three technological pillars: (1) non‑spoofable agent identity via W3C Decentralized Identifiers (DIDs); (2) communication integrity via NIST‑standardized post‑quantum cryptography (PQC); and (3) verifiable, privacy‑preserving policy compliance using the Halo2 zero‑knowledge proof (ZKP) system. We formalize an adversary model extending Dolev‑Yao for agentic threats and validate the protocol against the STRIDE framework. Our quantitative evaluation used a discrete‑event simulation, calibrated against cryptographic benchmarks, to model 1,000 agents. The simulation showed a 0 percent success rate across 20,000 attack trials. For policy verification, analysis of the simulation logs reported a median proof‑generation latency of 2.79 seconds, establishing a performance baseline for this class of security. While the evaluation is simulation‑based and early‑stage, it offers a reproducible baseline for future empirical studies and positions Aegis as a foundation for safe, scalable autonomous AI.

PaperID: 8949, https://arxiv.org/pdf/2508.19229.pdf

Authors: Wei Xiong, Wenting Zhao, Weizhe Yuan, Olga Golovneva, Tong Zhang, Jason Weston, Sainbayar Sukhbaatar

Title: StepWiser: Stepwise Generative Judges for Wiser Reasoning

Abstract:
As models increasingly leverage multi‑step reasoning strategies to solve complex problems, supervising the logical validity of these intermediate steps has become a critical research challenge. Process reward models address this by providing step‑by‑step feedback, but current approaches have two major drawbacks: they typically function as classifiers without providing explanations, and their reliance on supervised fine‑tuning with static datasets limits generalization. Inspired by recent advances, we reframe stepwise reward modeling from a classification task to a reasoning task itself. We thus propose a generative judge that reasons about the policy model's reasoning steps (i.e., meta‑reasons), outputting thinking tokens before delivering a final verdict. Our model, StepWiser, is trained by reinforcement learning using relative outcomes of rollouts. We show it provides (i) better judgment accuracy on intermediate steps than existing methods; (ii) can be used to improve the policy model at training time; and (iii) improves inference‑time search.

PaperID: 8950, https://arxiv.org/pdf/2508.19186.pdf

Authors: Christopher Chandler, Bernd Porr, Giulia Lafratta, Alice Miller

Title: Real-Time Model Checking for Closed-Loop Robot Reactive Planning

Abstract:
We present a new application of model checking which achieves real‑time multi‑step planning and obstacle avoidance on a real autonomous robot. We have developed a small, purpose‑built model checking algorithm which generates plans in situ based on "core" knowledge and attention as found in biological agents. This is achieved in real‑time using no pre‑computed data on a low‑powered device. Our approach is based on chaining temporary control systems which are spawned to counteract disturbances in the local environment that disrupt an autonomous agent from its preferred action (or resting state). A novel discretization of 2D LiDAR data sensitive to bounded variations in the local environment is used. Multi‑step planning using model checking by forward depth‑first search is applied to cul‑de‑sac and playground scenarios. Both empirical results and informal proofs of two fundamental properties of our approach demonstrate that model checking can be used to create efficient multi‑step plans for local obstacle avoidance, improving on the performance of a reactive agent which can only plan one step. Our approach is an instructional case study for the development of safe, reliable and explainable planning in the context of autonomous vehicles.

PaperID: 8951, https://arxiv.org/pdf/2508.19163.pdf

Authors: Ernest Lim, Yajie Vera He, Jared Joselowitz, Kate Preston, Mohita Chowdhury, Louis Williams, Aisling Higham, Katrina Mason, Mariane Melo, Tom Lawton, Yan Jia, Ibrahim Habli

Title: MATRIX: Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation

Abstract:
Despite the growing use of large language models (LLMs) in clinical dialogue systems, existing evaluations focus on task completion or fluency, offering little insight into the behavioral and risk management requirements essential for safety‑critical systems. This paper presents MATRIX (Multi‑Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation), a structured, extensible framework for safety‑oriented evaluation of clinical dialogue agents. MATRIX integrates three components: (1) a safety‑aligned taxonomy of clinical scenarios, expected system behaviors and failure modes derived through structured safety engineering methods; (2) BehvJudge, an LLM‑based evaluator for detecting safety‑relevant dialogue failures, validated against expert clinician annotations; and (3) PatBot, a simulated patient agent capable of producing diverse, scenario‑conditioned responses, evaluated for realism and behavioral fidelity with human factors expertise, and a patient‑preference study. Across three experiments, we show that MATRIX enables systematic, scalable safety evaluation. BehvJudge with Gemini 2.5‑Pro achieves expert‑level hazard detection (F1 0.96, sensitivity 0.999), outperforming clinicians in a blinded assessment of 240 dialogues. We also conducted one of the first realism analyses of LLM‑based patient simulation, showing that PatBot reliably simulates realistic patient behavior in quantitative and qualitative evaluations. Using MATRIX, we demonstrate its effectiveness in benchmarking five LLM agents across 2,100 simulated dialogues spanning 14 hazard scenarios and 10 clinical domains. MATRIX is the first framework to unify structured safety engineering with scalable, validated conversational AI evaluation, enabling regulator‑aligned safety auditing. We release all evaluation tools, prompts, structured scenarios, and datasets.

PaperID: 8952, https://arxiv.org/pdf/2508.19152.pdf

Authors: Chiu-Chou Lin

Title: Playstyle and Artificial Intelligence: An Initial Blueprint Through the Lens of Video Games

Abstract:
Contemporary artificial intelligence (AI) development largely centers on rational decision‑making, valued for its measurability and suitability for objective evaluation. Yet in real‑world contexts, an intelligent agent's decisions are shaped not only by logic but also by deeper influences such as beliefs, values, and preferences. The diversity of human decision‑making styles emerges from these differences, highlighting that "style" is an essential but often overlooked dimension of intelligence. This dissertation introduces playstyle as an alternative lens for observing and analyzing the decision‑making behavior of intelligent agents, and examines its foundational meaning and historical context from a philosophical perspective. By analyzing how beliefs and values drive intentions and actions, we construct a two‑tier framework for style formation: the external interaction loop with the environment and the internal cognitive loop of deliberation. On this basis, we formalize style‑related characteristics and propose measurable indicators such as style capacity, style popularity, and evolutionary dynamics. The study focuses on three core research directions: (1) Defining and measuring playstyle, proposing a general playstyle metric based on discretized state spaces, and extending it to quantify strategic diversity and competitive balance; (2) Expressing and generating playstyle, exploring how reinforcement learning and imitation learning can be used to train agents exhibiting specific stylistic tendencies, and introducing a novel approach for human‑like style learning and modeling; and (3) Practical applications, analyzing the potential of these techniques in domains such as game design and interactive entertainment. Finally, the dissertation outlines future extensions, including the role of style as a core element in building artificial general intelligence (AGI).

PaperID: 8953, https://arxiv.org/pdf/2508.19115.pdf

Authors: Joshua Lee, Ali Arastehfard, Weiran Liu, Xuegang Ban, Yuan Hong

Title: SecureV2X: An Efficient and Privacy-Preserving System for Vehicle-to-Everything (V2X) Applications

Abstract:
Autonomous driving and V2X technologies have developed rapidly in the past decade, leading to improved safety and efficiency in modern transportation. These systems interact with extensive networks of vehicles, roadside infrastructure, and cloud resources to support their machine learning capabilities. However, the widespread use of machine learning in V2X systems raises issues over the privacy of the data involved. This is particularly concerning for smart‑transit and driver safety applications which can implicitly reveal user locations or explicitly disclose medical data such as EEG signals. To resolve these issues, we propose SecureV2X, a scalable, multi‑agent system for secure neural network inferences deployed between the server and each vehicle. Under this setting, we study two multi‑agent V2X applications: secure drowsiness detection, and secure red‑light violation detection. Our system achieves strong performance relative to baselines, and scales efficiently to support a large number of secure computation interactions simultaneously. For instance, SecureV2X is 9.4 × faster, requires 143× fewer computational rounds, and involves 16.6× less communication on drowsiness detection compared to other secure systems. Moreover, it achieves a runtime nearly 100× faster than state‑of‑the‑art benchmarks in object detection tasks for red light violation detection.

PaperID: 8954, https://arxiv.org/pdf/2508.19113.pdf

Authors: Dayoon Ko, Jihyuk Kim, Haeju Park, Sohyeon Kim, Dahyun Lee, Yongrae Jo, Gunhee Kim, Moontae Lee, Kyungjae Lee

Title: Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning

Abstract:
Large reasoning models (LRMs) combined with retrieval‑augmented generation (RAG) have enabled deep research agents capable of multi‑step reasoning with external knowledge retrieval. However, we find that existing approaches rarely demonstrate test‑time search scaling. Methods that extend reasoning through single‑query sequential search suffer from limited evidence coverage, while approaches that generate multiple independent queries per step often lack structured aggregation, hindering deeper sequential reasoning. We propose a hybrid search strategy to address these limitations. We introduce HybridDeepSearcher, a structured search agent that integrates parallel query expansion with explicit evidence aggregation before advancing to deeper sequential reasoning. To supervise this behavior, we introduce HDS‑QA, a novel dataset that guides models to combine broad parallel search with structured aggregation through supervised reasoning‑query0retrieval trajectories containing parallel sub‑queries. Across five benchmarks, HybridDeepSearcher significantly outperforms the state‑of‑the‑art, improving F1 scores by +15.9 on FanOutQA and +9.2 on a subset of BrowseComp. Further analysis shows its consistent test‑time search scaling: performance improves as additional search turns or calls are allowed, while competing methods plateau.

PaperID: 8955, https://arxiv.org/pdf/2508.19097.pdf

Authors: Armin Berger, Sarthak Khanna, David Berghaus, Rafet Sifa

Title: Reasoning LLMs in the Medical Domain: A Literature Survey

Abstract:
The emergence of advanced reasoning capabilities in Large Language Models (LLMs) marks a transformative development in healthcare applications. Beyond merely expanding functional capabilities, these reasoning mechanisms enhance decision transparency and explainability‑critical requirements in medical contexts. This survey examines the transformation of medical LLMs from basic information retrieval tools to sophisticated clinical reasoning systems capable of supporting complex healthcare decisions. We provide a thorough analysis of the enabling technological foundations, with a particular focus on specialized prompting techniques like Chain‑of‑Thought and recent breakthroughs in Reinforcement Learning exemplified by DeepSeek‑R1. Our investigation evaluates purpose‑built medical frameworks while also examining emerging paradigms such as multi‑agent collaborative systems and innovative prompting architectures. The survey critically assesses current evaluation methodologies for medical validation and addresses persistent challenges in field interpretation limitations, bias mitigation strategies, patient safety frameworks, and integration of multimodal clinical data. Through this survey, we seek to establish a roadmap for developing reliable LLMs that can serve as effective partners in clinical practice and medical research.

PaperID: 8956, https://arxiv.org/pdf/2508.19072.pdf

Authors: Sidahmed Benabderrahmane, Talal Rahwan

Title: Attackers Strike Back? Not Anymore -- An Ensemble of RL Defenders Awakens for APT Detection

Abstract:
Advanced Persistent Threats (APTs) represent a growing menace to modern digital infrastructure. Unlike traditional cyberattacks, APTs are stealthy, adaptive, and long‑lasting, often bypassing signature‑based detection systems. This paper introduces a novel framework for APT detection that unites deep learning, reinforcement learning (RL), and active learning into a cohesive, adaptive defense system. Our system combines auto‑encoders for latent behavioral encoding with a multi‑agent ensemble of RL‑based defenders, each trained to distinguish between benign and malicious process behaviors. We identify a critical challenge in existing detection systems: their static nature and inability to adapt to evolving attack strategies. To this end, our architecture includes multiple RL agents (Q‑Learning, PPO, DQN, adversarial defenders), each analyzing latent vectors generated by an auto‑encoder. When any agent is uncertain about its decision, the system triggers an active learning loop to simulate expert feedback, thus refining decision boundaries. An ensemble voting mechanism, weighted by each agent's performance, ensures robust final predictions.

PaperID: 8957, https://arxiv.org/pdf/2508.18929.pdf

Authors: Ilias Driouich, Hongliu Cao, Eoin Thomas

Title: Diverse And Private Synthetic Datasets Generation for RAG evaluation: A multi-agent framework

Abstract:
Retrieval‑augmented generation (RAG) systems improve large language model outputs by incorporating external knowledge, enabling more informed and context‑aware responses. However, the effectiveness and trustworthiness of these systems critically depends on how they are evaluated, particularly on whether the evaluation process captures real‑world constraints like protecting sensitive information. While current evaluation efforts for RAG systems have primarily focused on the development of performance metrics, far less attention has been given to the design and quality of the underlying evaluation datasets, despite their pivotal role in enabling meaningful, reliable assessments. In this work, we introduce a novel multi‑agent framework for generating synthetic QA datasets for RAG evaluation that prioritize semantic diversity and privacy preservation. Our approach involves: (1) a Diversity agent leveraging clustering techniques to maximize topical coverage and semantic variability, (2) a Privacy Agent that detects and mask sensitive information across multiple domains and (3) a QA curation agent that synthesizes private and diverse QA pairs suitable as ground truth for RAG evaluation. Extensive experiments demonstrate that our evaluation sets outperform baseline methods in diversity and achieve robust privacy masking on domain‑specific datasets. This work offers a practical and ethically aligned pathway toward safer, more comprehensive RAG system evaluation, laying the foundation for future enhancements aligned with evolving AI regulations and compliance standards.

PaperID: 8958, https://arxiv.org/pdf/2508.18880.pdf

Authors: Eljas Linna, Tuula Linna

Title: Challenges for Generative AI in Legal Reasoning

Abstract:
Large Language Models (LLMs) are being integrated into professional domains, yet their limitations in such high‑stakes fields as law remain poorly understood. In response, this paper introduces examples of critical challenges to the functioning of generative and other forms of artificial intelligence (AI) as reliable reasoning tools in judicial decision‑making. The study deconstructs core requirements and challenges for AI, including the ability to select the correct legal framework across jurisdictions, generate sound arguments based on the doctrine of the sources of law, distinguish ratio decidendi and obiter dicta in case law, resolve ambiguity arising from general clauses like "reasonableness", manage conflicting legal provisions, and apply the burden of proof correctly. The paper maps various AI enhancement mechanisms, such as retrieval‑augmented generation (RAG), multi‑agent systems and neuro‑symbolic AI, to these challenges, assessing their potential to bridge the gap between the probabilistic nature of LLMs and the rigorous, choice‑driven demands of legal interpretation. Furthermore, the paper sketches a path towards an evaluation framework, proposing that legal requirements be organized into normative, doctrinal, evidential, and technical categories, and subsequently operationalized into domain‑specific, testable design obligations. The findings indicate that these techniques can address specific narrow challenges, but they fail to solve the more significant ones, particularly in tasks requiring discretion and transparent, justifiable reasoning. Therefore, we advocate for a staged adoption, first capturing efficiency in simple cases with technology already available today and sustaining long‑term investment in new methods that handle hierarchy, temporality, and other requirements of legally sound reasoning, thus enabling expansion to complex adjudication in the future.

PaperID: 8959, https://arxiv.org/pdf/2508.18797.pdf

Authors: Qi Chai, Zhang Zheng, Junlong Ren, Deheng Ye, Zichuan Lin, Hao Wang

Title: CausalMACE: Causality Empowered Multi-Agents in Minecraft Cooperative Tasks

Abstract:
Minecraft, as an open‑world virtual interactive environment, has become a prominent platform for research on agent decision‑making and execution. Existing works primarily adopt a single Large Language Model (LLM) agent to complete various in‑game tasks. However, for complex tasks requiring lengthy sequences of actions, single‑agent approaches often face challenges related to inefficiency and limited fault tolerance. Despite these issues, research on multi‑agent collaboration remains scarce. In this paper, we propose CausalMACE, a holistic causality planning framework designed to enhance multi‑agent systems, in which we incorporate causality to manage dependencies among subtasks. Technically, our proposed framework introduces two modules: an overarching task graph for global task planning and a causality‑based module for dependency management, where inherent rules are adopted to perform causal intervention. Experimental results demonstrate our approach achieves state‑of‑the‑art performance in multi‑agent cooperative tasks of Minecraft.

PaperID: 8960, https://arxiv.org/pdf/2508.18791.pdf

Authors: Ziming Zhu, Chenglong Wang, Haosong Xv, Shunjie Xing, Yifu Huo, Fengning Tian, Quan Du, Di Yang, Chunliang Zhang, Tong Xiao, Jingbo Zhu

Title: LaTeXTrans: Structured LaTeX Translation with Multi-Agent Coordination

Abstract:
Despite the remarkable progress of modern machine translation (MT) systems on general‑domain texts, translating structured LaTeX‑formatted documents remains a significant challenge. These documents typically interleave natural language with domain‑specific syntax, such as mathematical equations, tables, figures, and cross‑references, all of which must be accurately preserved to maintain semantic integrity and compilability. In this paper, we introduce LaTeXTrans, a collaborative multi‑agent system designed to address this challenge. LaTeXTrans ensures format preservation, structural fidelity, and terminology consistency through six specialized agents: 1) a Parser that decomposes LaTeX into translation‑friendly units via placeholder substitution and syntax filtering; 2) a Translator, Validator, Summarizer, and Terminology Extractor that work collaboratively to ensure context‑aware, self‑correcting, and terminology‑consistent translations; 3) a Generator that reconstructs the translated content into well‑structured LaTeX documents. Experimental results show that LaTeXTrans outperforms mainstream MT systems in both translation accuracy and structural preservation. The source code, the online demonstration platform, and a demo video are publicly available.

PaperID: 8961, https://arxiv.org/pdf/2508.18781.pdf

Authors: Lisai Zhang, Baohan Xu, Siqian Yang, Mingyu Yin, Jing Liu, Chao Xu, Siqi Wang, Yidi Wu, Yuxin Hong, Zihao Zhang, Yanzhang Liang, Yudong Jiang

Title: AniME: Adaptive Multi-Agent Planning for Long Animation Generation

Abstract:
We present AniME, a director‑oriented multi‑agent system for automated long‑form anime production, covering the full workflow from a story to the final video. The director agent keeps a global memory for the whole workflow, and coordinates several downstream specialized agents. By integrating customized Model Context Protocol (MCP) with downstream model instruction, the specialized agent adaptively selects control conditions for diverse sub‑tasks. AniME produces cinematic animation with consistent characters and synchronized audio visual elements, offering a scalable solution for AI‑driven anime creation.

PaperID: 8962, https://arxiv.org/pdf/2508.18765.pdf

Authors: Suyash Gaurav, Jukka Heikkonen, Jatin Chaudhary

Title: Governance-as-a-Service: A Multi-Agent Framework for AI System Compliance and Policy Enforcement

Abstract:
As AI systems evolve into distributed ecosystems with autonomous execution, asynchronous reasoning, and multi‑agent coordination, the absence of scalable, decoupled governance poses a structural risk. Existing oversight mechanisms are reactive, brittle, and embedded within agent architectures, making them non‑auditable and hard to generalize across heterogeneous deployments. We introduce Governance‑as‑a‑Service (GaaS): a modular, policy‑driven enforcement layer that regulates agent outputs at runtime without altering model internals or requiring agent cooperation. GaaS employs declarative rules and a Trust Factor mechanism that scores agents based on compliance and severity‑weighted violations. It enables coercive, normative, and adaptive interventions, supporting graduated enforcement and dynamic trust modulation. To evaluate GaaS, we conduct three simulation regimes with open‑source models (LLaMA3, Qwen3, DeepSeek‑R1) across content generation and financial decision‑making. In the baseline, agents act without governance; in the second, GaaS enforces policies; in the third, adversarial agents probe robustness. All actions are intercepted, evaluated, and logged for analysis. Results show that GaaS reliably blocks or redirects high‑risk behaviors while preserving throughput. Trust scores track rule adherence, isolating and penalizing untrustworthy components in multi‑agent systems. By positioning governance as a runtime service akin to compute or storage, GaaS establishes infrastructure‑level alignment for interoperable agent ecosystems. It does not teach agents ethics; it enforces them.

PaperID: 8963, https://arxiv.org/pdf/2508.18724.pdf

Authors: Karanbir Singh, Deepak Muppiri, William Ngu

Title: Bias Mitigation Agent: Optimizing Source Selection for Fair and Balanced Knowledge Retrieval

Abstract:
Large Language Models (LLMs) have transformed the field of artificial intelligence by unlocking the era of generative applications. Built on top of generative AI capabilities, Agentic AI represents a major shift toward autonomous, goal‑driven systems that can reason, retrieve, and act. However, they also inherit the bias present in both internal and external information sources. This significantly affects the fairness and balance of retrieved information, and hence reduces user trust. To address this critical challenge, we introduce a novel Bias Mitigation Agent, a multi‑agent system designed to orchestrate the workflow of bias mitigation through specialized agents that optimize the selection of sources to ensure that the retrieved content is both highly relevant and minimally biased to promote fair and balanced knowledge dissemination. The experimental results demonstrate an 81.82% reduction in bias compared to a baseline naive retrieval strategy.

PaperID: 8964, https://arxiv.org/pdf/2508.18708.pdf

Authors: Promise Osaine Ekpo, Brian La, Thomas Wiener, Saesha Agarwal, Arshia Agrawal, Gonzalo Gonzalez-Pumariega, Lekan P. Molu, Angelique Taylor

Title: Skill-Aligned Fairness in Multi-Agent Learning for Collaboration in Healthcare

Abstract:
Fairness in multi‑agent reinforcement learning (MARL) is often framed as a workload balance problem, overlooking agent expertise and the structured coordination required in real‑world domains. In healthcare, equitable task allocation requires workload balance or expertise alignment to prevent burnout and overuse of highly skilled agents. Workload balance refers to distributing an approximately equal number of subtasks or equalised effort across healthcare workers, regardless of their expertise. We make two contributions to address this problem. First, we propose FairSkillMARL, a framework that defines fairness as the dual objective of workload balance and skill‑task alignment. Second, we introduce MARLHospital, a customizable healthcare‑inspired environment for modeling team compositions and energy‑constrained scheduling impacts on fairness, as no existing simulators are well‑suited for this problem. We conducted experiments to compare FairSkillMARL in conjunction with four standard MARL methods, and against two state‑of‑the‑art fairness metrics. Our results suggest that fairness based solely on equal workload might lead to task‑skill mismatches and highlight the need for more robust metrics that capture skill‑task misalignment. Our work provides tools and a foundation for studying fairness in heterogeneous multi‑agent systems where aligning effort with expertise is critical.

PaperID: 8965, https://arxiv.org/pdf/2508.18705.pdf

Authors: Santosh Thoduka, Sebastian Houben, Juergen Gall, Paul G. Plöger

Title: Enhancing Video-Based Robot Failure Detection Using Task Knowledge

Abstract:
Robust robotic task execution hinges on the reliable detection of execution failures in order to trigger safe operation modes, recovery strategies, or task replanning. However, many failure detection methods struggle to provide meaningful performance when applied to a variety of real‑world scenarios. In this paper, we propose a video‑based failure detection approach that uses spatio‑temporal knowledge in the form of the actions the robot performs and task‑relevant objects within the field of view. Both pieces of information are available in most robotic scenarios and can thus be readily obtained. We demonstrate the effectiveness of our approach on three datasets that we amend, in part, with additional annotations of the aforementioned task‑relevant knowledge. In light of the results, we also propose a data augmentation method that improves performance by applying variable frame rates to different parts of the video. We observe an improvement from 77.9 to 80.0 in F1 score on the ARMBench dataset without additional computational expense and an additional increase to 81.4 with test‑time augmentation. The results emphasize the importance of spatio‑temporal information during failure detection and suggest further investigation of suitable heuristics in future implementations. Code and annotations are available.

PaperID: 8966, https://arxiv.org/pdf/2508.18683.pdf

Authors: Saswata Jana, Giuseppe F. Italiano, Partha Sarathi Mandal

Title: Graph Traversal via Connected Mobile Agents

Abstract:
This paper considers the Hamiltonian walk problem in the multi‑agent coordination framework, referred to as k‑agents Hamiltonian walk problem (k‑HWP). In this problem, a set of k connected agents collectively compute a spanning walk of a given undirected graph in the minimum steps. At each step, the agents are at k distinct vertices and the induced subgraph made by the occupied vertices remains connected. In the next consecutive steps, each agent may remain stationary or move to one of its neighbours.To the best of our knowledge, this problem has not been previously explored in the context of multi‑agent systems with connectivity. As a generalization of the well‑known Hamiltonian walk problem (when k=1), k‑HWP is NP‑hard. We propose a (3‑\frac121)‑approximation algorithm for 2‑HWP on arbitrary graphs. For the tree, we define a restricted version of the problem and present an optimal algorithm for arbitrary values of k. Finally, we formalize the problem for k‑uniform hypergraphs and present a 2(1+\ln k)‑approximation algorithm. This result is also adapted to design an approximation algorithm for k‑HWP on general graphs when k = O(1).

PaperID: 8967, https://arxiv.org/pdf/2508.18676.pdf

Authors: Chufan Gao, Jintai Chen, Jimeng Sun

Title: Utilizing Training Data to Improve LLM Reasoning for Tabular Understanding

Abstract:
Automated tabular understanding and reasoning are essential tasks for data scientists. Recently, Large language models (LLMs) have become increasingly prevalent in tabular reasoning tasks. Previous work focuses on (1) finetuning LLMs using labeled data or (2) Training‑free prompting LLM agents using chain‑of‑thought (CoT). Finetuning offers dataset‑specific learning at the cost of generalizability. Training‑free prompting is highly generalizable but does not take full advantage of training data. In this paper, we propose a novel prompting‑based reasoning approach, Learn then Retrieve: LRTab, which integrates the benefits of both by retrieving relevant information learned from training data. We first use prompting to obtain CoT responses over the training data. For incorrect CoTs, we prompt the LLM to predict Prompt Conditions to avoid the error, learning insights from the data. We validate the effectiveness of Prompt Conditions using validation data. Finally, at inference time, we retrieve the most relevant Prompt Conditions for additional context for table understanding. We provide comprehensive experiments on WikiTQ and Tabfact, showing that LRTab is interpretable, cost‑efficient, and can outperform previous baselines in tabular reasoning.

PaperID: 8968, https://arxiv.org/pdf/2508.18675.pdf

Authors: Xu Lu, Weisong Sun, Yiran Zhang, Ming Hu, Cong Tian, Zhi Jin, Yang Liu

Title: Requirements Development and Formalization for Reliable Code Generation: A Multi-Agent Vision

Abstract:
Automated code generation has long been considered the holy grail of software engineering. The emergence of Large Language Models (LLMs) has catalyzed a revolutionary breakthrough in this area. However, existing methods that only rely on LLMs remain inadequate in the quality of generated code, offering no guarantees of satisfying practical requirements. They lack a systematic strategy for requirements development and modeling. Recently, LLM‑based agents typically possess powerful abilities and play an essential role in facilitating the alignment of LLM outputs with user requirements. In this paper, we envision the first multi‑agent framework for reliable code generation based on \textscrequirements \textscdevelopment and \textscformalization, named \textscReDeFo. This framework incorporates three agents, highlighting their augmentation with knowledge and techniques of formal methods, into the requirements‑to‑code generation pipeline to strengthen quality assurance. The core of \textscReDeFo is the use of formal specifications to bridge the gap between potentially ambiguous natural language requirements and precise executable code. \textscReDeFo enables rigorous reasoning about correctness, uncovering hidden bugs, and enforcing critical properties throughout the development process. In general, our framework aims to take a promising step toward realizing the long‑standing vision of reliable, auto‑generated software.

PaperID: 8969, https://arxiv.org/pdf/2508.18610.pdf

Authors: Shrenik Jadhav, Birva Sevak, Srijita Das, Akhtar Hussain, Wencong Su, Van-Hai Bui

Title: Scalable Fairness Shaping with LLM-Guided Multi-Agent Reinforcement Learning for Peer-to-Peer Electricity Markets

Abstract:
Peer‑to‑peer (P2P) energy trading is becoming central to modern distribution systems as rooftop PV and home energy management systems become pervasive, yet most existing market and reinforcement learning designs emphasize efficiency or private profit and offer little real‑time guidance to ensure equitable outcomes under uncertainty. To address this gap, a fairness‑aware multiagent reinforcement learning framework, FairMarket‑RL, is proposed in which a large language model (LLM) critic shapes bidding policies within a continuous double auction under partial observability and discrete price‑quantity actions. After each trading slot, the LLM returns normalized fairness scores Fairness‑to‑Grid (FTG), Fairness‑Between‑Sellers (FBS), and Fairness‑of‑Pricing (FPP) that are integrated into the reward via ramped coefficients and tunable scaling, so that fairness guidance complements, rather than overwhelms, economic incentives. The environment models realistic residential load and PV profiles and enforce hard constraints on prices, physical feasibility, and policy‑update stability. Across a progression of experiments from a small pilot to a larger simulated community and a mixed‑asset real‑world dataset, the framework shifts exchanges toward local P2P trades, lowers consumer costs relative to grid‑only procurement, sustains strong fairness across participants, and preserves utility viability. Sensitivity analyses over solar availability and aggregate demand further indicate robust performance, suggesting a scalable, LLM‑guided pathway to decentralized electricity markets that are economically efficient, socially equitable, and technically sound.

PaperID: 8970, https://arxiv.org/pdf/2508.18600.pdf

Authors: Ayato Kitadai, Yusuke Fukasawa, Nariaki Nishino

Title: Bias-Adjusted LLM Agents for Human-Like Decision-Making via Behavioral Economics

Abstract:
Large language models (LLMs) are increasingly used to simulate human decision‑making, but their intrinsic biases often diverge from real human behavior‑‑limiting their ability to reflect population‑level diversity. We address this challenge with a persona‑based approach that leverages individual‑level behavioral data from behavioral economics to adjust model biases. Applying this method to the ultimatum game‑‑a standard but difficult benchmark for LLMs‑‑we observe improved alignment between simulated and empirical behavior, particularly on the responder side. While further refinement of trait representations is needed, our results demonstrate the promise of persona‑conditioned LLMs for simulating human‑like decision patterns at scale.

PaperID: 8971, https://arxiv.org/pdf/2508.18467.pdf

Authors: Olivia Long, Carter Teplica

Title: The AI in the Mirror: LLM Self-Recognition in an Iterated Public Goods Game

Abstract:
As AI agents become increasingly capable of tool use and long‑horizon tasks, they have begun to be deployed in settings where multiple agents can interact. However, whereas prior work has mostly focused on human‑AI interactions, there is an increasing need to understand AI‑AI interactions. In this paper, we adapt the iterated public goods game, a classic behavioral economics game, to analyze the behavior of four reasoning and non‑reasoning models across two conditions: models are either told they are playing against "another AI agent" or told their opponents are themselves. We find that, across different settings, telling LLMs that they are playing against themselves significantly changes their tendency to cooperate. While our study is conducted in a toy environment, our results may provide insights into multi‑agent settings where agents "unconsciously" discriminating against each other could inexplicably increase or decrease cooperation.

PaperID: 8972, https://arxiv.org/pdf/2508.18406.pdf

Authors: Ryan Hare, Ying Tang

Title: Toward Generalized Autonomous Agents: A Neuro-Symbolic AI Framework for Integrating Social and Technical Support in Education

Abstract:
One of the enduring challenges in education is how to empower students to take ownership of their learning by setting meaningful goals, tracking their progress, and adapting their strategies when faced with setbacks. Research has shown that this form of leaner‑centered learning is best cultivated through structured, supportive environments that promote guided practice, scaffolded inquiry, and collaborative dialogue. In response, educational efforts have increasingly embraced artificial‑intelligence (AI)‑powered digital learning environments, ranging from educational apps and virtual labs to serious games. Recent advances in large language models (LLMs) and neuro‑symbolic systems, meanwhile, offer a transformative opportunity to reimagine how support is delivered in digital learning environments. LLMs are enabling socially interactive learning experiences and scalable, cross‑domain learning support that can adapt instructional strategies across varied subjects and contexts. In parallel, neuro‑symbolic AI provides new avenues for designing these agents that are not only adaptive but also scalable across domains. Based on these remarks, this paper presents a multi‑agent, neuro‑symbolic framework designed to resolve the aforementioned challenges. The framework assigns distinct pedagogical roles to specialized agents: an RL‑based 'tutor' agent provides authoritative, non‑verbal scaffolding, while a proactive, LLM‑powered 'peer' agent facilitates the social dimensions of learning. While prior work has explored such agents in isolation, our framework's novelty lies in unifying them through a central educational ontology. Through case studies in both college‑level and middle school settings, we demonstrate the framework's adaptability across domains. We conclude by outlining key insights and future directions for advancing AI‑driven learning environments.

PaperID: 8973, https://arxiv.org/pdf/2508.18321.pdf

Authors: Maojia Song, Tej Deep Pala, Ruiwen Zhou, Weisheng Jin, Amir Zadeh, Chuan Li, Dorien Herremans, Soujanya Poria

Title: LLMs Can't Handle Peer Pressure: Crumbling under Multi-Agent Social Interactions

Abstract:
Large language models (LLMs) are increasingly integrated into multi‑agent systems (MAS), where peer interactions shape individual decisions. While prior work has mainly examined conformity bias, we broaden the view to include how LLMs build rapport from prior interactions, discern and integrate high‑quality peer information, and resist misleading inputs‑abilities essential for achieving collective intelligence under complex social dynamics. We introduce KAIROS, a benchmark that simulates quiz‑style collaboration with peer agents whose rapport levels and behaviours can be precisely controlled in both historical interactions and the current round. This unified setup enables systematic analysis of how rapport, peer actions, and the model's self‑confidence jointly influence decision‑making. Using KAIROS, we evaluate prompting, supervised fine‑tuning, and reinforcement learning via Group Relative Policy Optimisation (GRPO). Results show that model scale is a primary factor moderating susceptibility to social influence: larger models are more resilient and benefit from prompting‑based mitigation, whereas smaller models remain vulnerable. Only carefully configured GRPO training yields consistent robustness and performance gains for small models.

PaperID: 8974, https://arxiv.org/pdf/2508.18292.pdf

Authors: Saksham Arora

Title: Consensus Is All You Need: Gossip-Based Reasoning Among Large Language Models

Abstract:
Large language models have advanced rapidly, but no single model excels in every area ‑‑ each has its strengths and weaknesses. Instead of relying on one model alone, we take inspiration from gossip protocols in distributed systems, where information is exchanged with peers until they all come to an agreement. In this setup, models exchange answers and gradually work toward a shared solution. Each LLM acts as a node in a peer‑to‑peer network, sharing responses and thought processes to reach a collective decision. Our results show that this "gossip‑based consensus" leads to robust, resilient, and accurate multi‑agent AI reasoning. It helps overcome the weaknesses of individual models and brings out their collective strengths. This approach is similar to how humans build consensus, making AI seem more collaborative and trustworthy instead of just a black‑box program.

PaperID: 8975, https://arxiv.org/pdf/2508.18260.pdf

Authors: Kaiwen Wei, Rui Shan, Dongsheng Zou, Jianzhong Yang, Bi Zhao, Junnan Zhu, Jiang Zhong

Title: MIRAGE: Scaling Test-Time Inference with Parallel Graph-Retrieval-Augmented Reasoning Chains

Abstract:
Large reasoning models (LRMs) have shown significant progress in test‑time scaling through chain‑of‑thought prompting. Current approaches like search‑o1 integrate retrieval augmented generation (RAG) into multi‑step reasoning processes but rely on a single, linear reasoning chain while incorporating unstructured textual information in a flat, context‑agnostic manner. As a result, these approaches can lead to error accumulation throughout the reasoning chain, which significantly limits its effectiveness in medical question‑answering (QA) tasks where both accuracy and traceability are critical requirements. To address these challenges, we propose MIRAGE (Multi‑chain Inference with Retrieval‑Augmented Graph Exploration), a novel test‑time scalable reasoning framework that performs dynamic multi‑chain inference over structured medical knowledge graphs. Specifically, MIRAGE 1) decomposes complex queries into entity‑grounded sub‑questions, 2) executes parallel inference chains, 3) retrieves evidence adaptively via neighbor expansion and multi‑hop traversal, and 4) integrates answers using cross‑chain verification to resolve contradictions. Experiments on three medical QA benchmarks (GenMedGPT‑5k, CMCQA, and ExplainCPE) show that MIRAGE consistently outperforms GPT‑4o, Tree‑of‑Thought variants, and other retrieval‑augmented baselines in both automatic and human evaluations. Additionally, MIRAGE improves interpretability by generating explicit reasoning chains that trace each factual claim to concrete chains within the knowledge graph, making it well‑suited for complex medical reasoning scenarios. The code will be available for further research.

PaperID: 8976, https://arxiv.org/pdf/2508.18177.pdf

Authors: Xiangxiang Wang, Xuanyu Wang, YiJia Luo, Yongbin Yu, Manping Fan, Jingtao Zhang, Liyong Ren

Title: Scene-Aware Vectorized Memory Multi-Agent Framework with Cross-Modal Differentiated Quantization VLMs for Visually Impaired Assistance

Abstract:
Visually impaired individuals face significant challenges in environmental perception. Traditional assistive technologies often lack adaptive intelligence, focusing on individual components rather than integrated systems. While Vision‑Language Models (VLMs) offer a promising path to richer, integrated understanding, their deployment is severely limited by substantial computational requirements, demanding dozens of gigabytes of memory. To address these gaps in computational efficiency and integrated design, this study proposes a dual technological innovation framework: a cross‑modal differentiated quantization framework for VLMs and a scene‑aware vectorized memory multi‑agent system. The quantization framework implements differentiated strategies, reducing memory from 38GB to 11.3GB. The multi‑agent system uses vectorized memory and perception‑memory‑reasoning workflows to provide environmental information beyond the current view, achieving 2.83‑3.52s latency to initial speech output. Experiments show the quantized 19B‑parameter model only experiences a 2.05% performance drop on MMBench and maintains 63.7 accuracy on OCR‑VQA (original: 64.9), outperforming smaller models with equivalent memory. This research advances computational efficiency and assistive technology, offering comprehensive assistance in scene perception, text recognition, and navigation.

PaperID: 8977, https://arxiv.org/pdf/2508.18113.pdf

Authors: Farkhad Akimov, Munachiso Samuel Nwadike, Zangir Iklassov, Martin Takáč

Title: The AI Data Scientist

Abstract:
Imagine decision‑makers uploading data and, within minutes, receiving clear, actionable insights delivered straight to their fingertips. That is the promise of the AI Data Scientist, an autonomous Agent powered by large language models (LLMs) that closes the gap between evidence and action. Rather than simply writing code or responding to prompts, it reasons through questions, tests ideas, and delivers end‑to‑end insights at a pace far beyond traditional workflows. Guided by the scientific tenet of the hypothesis, this Agent uncovers explanatory patterns in data, evaluates their statistical significance, and uses them to inform predictive modeling. It then translates these results into recommendations that are both rigorous and accessible. At the core of the AI Data Scientist is a team of specialized LLM Subagents, each responsible for a distinct task such as data cleaning, statistical testing, validation, and plain‑language communication. These Subagents write their own code, reason about causality, and identify when additional data is needed to support sound conclusions. Together, they achieve in minutes what might otherwise take days or weeks, enabling a new kind of interaction that makes deep data science both accessible and actionable.

PaperID: 8978, https://arxiv.org/pdf/2508.18108.pdf

Authors: Xilai Xu, Zilin Zhao, Chengye Song, Zining Wang, Jinhe Qiang, Jiongrui Yan, Yuhuai Lin

Title: SentiMM: A Multimodal Multi-Agent Framework for Sentiment Analysis in Social Media

Abstract:
With the increasing prevalence of multimodal content on social media, sentiment analysis faces significant challenges in effectively processing heterogeneous data and recognizing multi‑label emotions. Existing methods often lack effective cross‑modal fusion and external knowledge integration. We propose SentiMM, a novel multi‑agent framework designed to systematically address these challenges. SentiMM processes text and visual inputs through specialized agents, fuses multimodal features, enriches context via knowledge retrieval, and aggregates results for final sentiment classification. We also introduce SentiMMD, a large‑scale multimodal dataset with seven fine‑grained sentiment categories. Extensive experiments demonstrate that SentiMM achieves superior performance compared to state‑of‑the‑art baselines, validating the effectiveness of our structured approach.

PaperID: 8979, https://arxiv.org/pdf/2508.18098.pdf

Authors: Jatin Nainani, Sankaran Vaidyanathan, Connor Watts, Andre N. Assis, Alice Rigg

Title: Detecting and Characterizing Planning in Language Models

Abstract:
Modern large language models (LLMs) have demonstrated impressive performance across a wide range of multi‑step reasoning tasks. Recent work suggests that LLMs may perform planning ‑ selecting a future target token in advance and generating intermediate tokens that lead towards it ‑ rather than merely improvising one token at a time. However, existing studies assume fixed planning horizons and often focus on single prompts or narrow domains. To distinguish planning from improvisation across models and tasks, we present formal and causally grounded criteria for detecting planning and operationalize them as a semi‑automated annotation pipeline. We apply this pipeline to both base and instruction‑tuned Gemma‑2‑2B models on the MBPP code generation benchmark and a poem generation task where Claude 3.5 Haiku was previously shown to plan. Our findings show that planning is not universal: unlike Haiku, Gemma‑2‑2B solves the same poem generation task through improvisation, and on MBPP it switches between planning and improvisation across similar tasks and even successive token predictions. We further show that instruction tuning refines existing planning behaviors in the base model rather than creating them from scratch. Together, these studies provide a reproducible and scalable foundation for mechanistic studies of planning in LLMs.

PaperID: 8980, https://arxiv.org/pdf/2508.18091.pdf

Authors: Mohammad J. Abdel-Rahman, Yasmeen Alslman, Dania Refai, Amro Saleh, Malik A. Abu Loha, Mohammad Yahya Hamed

Title: Teaching LLMs to Think Mathematically: A Critical Study of Decision-Making via Optimization

Abstract:
This paper investigates the capabilities of large language models (LLMs) in formulating and solving decision‑making problems using mathematical programming. We first conduct a systematic review and meta‑analysis of recent literature to assess how well LLMs understand, structure, and solve optimization problems across domains. The analysis is guided by critical review questions focusing on learning approaches, dataset designs, evaluation metrics, and prompting strategies. Our systematic evidence is complemented by targeted experiments designed to evaluate the performance of state‑of‑the‑art LLMs in automatically generating optimization models for problems in computer networks. Using a newly constructed dataset, we apply three prompting strategies: Act‑as‑expert, chain‑of‑thought, and self‑consistency, and evaluate the obtained outputs based on optimality gap, token‑level F1 score, and compilation accuracy. Results show promising progress in LLMs' ability to parse natural language and represent symbolic formulations, but also reveal key limitations in accuracy, scalability, and interpretability. These empirical gaps motivate several future research directions, including structured datasets, domain‑specific fine‑tuning, hybrid neuro‑symbolic approaches, modular multi‑agent architectures, and dynamic retrieval via chain‑of‑RAGs. This paper contributes a structured roadmap for advancing LLM capabilities in mathematical programming.

PaperID: 8981, https://arxiv.org/pdf/2508.17971.pdf

Authors: Pu Feng, Size Wang, Yuhong Cao, Junkang Liang, Rongye Shi, Wenjun Wu

Title: Neural Algorithmic Reasoners informed Large Language Model for Multi-Agent Path Finding

Abstract:
The development and application of large language models (LLM) have demonstrated that foundational models can be utilized to solve a wide array of tasks. However, their performance in multi‑agent path finding (MAPF) tasks has been less than satisfactory, with only a few studies exploring this area. MAPF is a complex problem requiring both planning and multi‑agent coordination. To improve the performance of LLM in MAPF tasks, we propose a novel framework, LLM‑NAR, which leverages neural algorithmic reasoners (NAR) to inform LLM for MAPF. LLM‑NAR consists of three key components: an LLM for MAPF, a pre‑trained graph neural network‑based NAR, and a cross‑attention mechanism. This is the first work to propose using a neural algorithmic reasoner to integrate GNNs with the map information for MAPF, thereby guiding LLM to achieve superior performance. LLM‑NAR can be easily adapted to various LLM models. Both simulation and real‑world experiments demonstrate that our method significantly outperforms existing LLM‑based approaches in solving MAPF problems.

PaperID: 8982, https://arxiv.org/pdf/2508.17969.pdf

Authors: Alexandros Gkillas, Christos Anagnostopoulos, Nikos Piperigkos, Dimitris Tsiktsiris, Theofilos Christodoulou, Theofanis Siamatras, Dimitrios Triantafyllou, Christos Basdekis, Theoktisti Marinopoulou, Panagiotis Lepentsiotis, Elefterios Blitsis, Aggeliki Zacharaki, Nearchos Stylianidis, Leonidas Katelaris, Lamberto Salvan, Aris S. Lalos, Christos Laoudias, Antonios Lalas, Konstantinos Votis

Title: A holistic perception system of internal and external monitoring for ground autonomous vehicles: AutoTRUST paradigm

Abstract:
This paper introduces a holistic perception system for internal and external monitoring of autonomous vehicles, with the aim of demonstrating a novel AI‑leveraged self‑adaptive framework of advanced vehicle technologies and solutions that optimize perception and experience on‑board. Internal monitoring system relies on a multi‑camera setup designed for predicting and identifying driver and occupant behavior through facial recognition, exploiting in addition a large language model as virtual assistant. Moreover, the in‑cabin monitoring system includes AI‑empowered smart sensors that measure air‑quality and perform thermal comfort analysis for efficient on and off‑boarding. On the other hand, external monitoring system perceives the surrounding environment of vehicle, through a LiDAR‑based cost‑efficient semantic segmentation approach, that performs highly accurate and efficient super‑resolution on low‑quality raw 3D point clouds. The holistic perception framework is developed in the context of EU's Horizon Europe programm AutoTRUST, and has been integrated and deployed on a real electric vehicle provided by ALKE. Experimental validation and evaluation at the integration site of Joint Research Centre at Ispra, Italy, highlights increased performance and efficiency of the modular blocks of the proposed perception architecture.

PaperID: 8983, https://arxiv.org/pdf/2508.17803.pdf

Authors: Kaiwen Yan, Xuanqing Shi, Hongcheng Guo, Wenxuan Wang, Zhuosheng Zhang, Chengwei Qin

Title: DRQA: Dynamic Reasoning Quota Allocation for Controlling Overthinking in Reasoning Large Language Models

Abstract:
Reasoning large language models (RLLMs), such as OpenAI‑O3 and DeepSeek‑R1, have recently demonstrated remarkable capabilities by performing structured and multi‑step reasoning. However, recent studies reveal that RLLMs often suffer from overthinking, i.e., producing unnecessarily lengthy reasoning chains even for simple questions, leading to excessive token consumption and computational inefficiency. Interestingly, we observe that when processing multiple questions in batch mode, RLLMs exhibit more resource‑efficient behavior by dynamically compressing reasoning steps for easier problems, due to implicit resource competition. Inspired by this, we propose Dynamic Reasoning Quota Allocation (DRQA), a novel method that transfers the benefits of resource competition from batch processing to single‑question inference. Specifically, DRQA leverages batch‑generated preference data and reinforcement learning to train the model to allocate reasoning resources adaptively. By encouraging the model to internalize a preference for responses that are both accurate and concise, DRQA enables it to generate concise answers for simple questions while retaining sufficient reasoning depth for more challenging ones. Extensive experiments on a wide range of mathematical and scientific reasoning benchmarks demonstrate that DRQA significantly reduces token usage while maintaining, and in many cases improving, answer accuracy. By effectively mitigating the overthinking problem, DRQA offers a promising direction for more efficient and scalable deployment of RLLMs, and we hope it inspires further exploration into fine‑grained control of reasoning behaviors.

PaperID: 8984, https://arxiv.org/pdf/2508.17720.pdf

Authors: Ziqi Guan, Xin Yin, Zhiyuan Peng, Chao Ni

Title: RepoTransAgent: Multi-Agent LLM Framework for Repository-Aware Code Translation

Abstract:
Repository‑aware code translation is critical for modernizing legacy systems, enhancing maintainability, and enabling interoperability across diverse programming languages. While recent advances in large language models (LLMs) have improved code translation quality, existing approaches face significant challenges in practical scenarios: insufficient contextual understanding, inflexible prompt designs, and inadequate error correction mechanisms. These limitations severely hinder accurate and efficient translation of complex, real‑world code repositories. To address these challenges, we propose RepoTransAgent, a novel multi‑agent LLM framework for repository‑aware code translation. RepoTransAgent systematically decomposes the translation process into specialized subtasks‑context retrieval, dynamic prompt construction, and iterative code refinement‑each handled by dedicated agents. Our approach leverages retrieval‑augmented generation (RAG) for contextual information gathering, employs adaptive prompts tailored to varying repository scenarios, and introduces a reflection‑based mechanism for systematic error correction. We evaluate RepoTransAgent on hundreds of Java‑C# translation pairs from six popular open‑source projects. Experimental results demonstrate that RepoTransAgent significantly outperforms state‑of‑the‑art baselines in both compile and pass rates. Specifically, RepoTransAgent achieves up to 55.34% compile rate and 45.84% pass rate. Comprehensive analysis confirms the robustness and generalizability of RepoTransAgent across different LLMs, establishing its effectiveness for real‑world repository‑aware code translation.

PaperID: 8985, https://arxiv.org/pdf/2508.17696.pdf

Authors: Woojun Kim, Katia Sycara

Title: Fair Cooperation in Mixed-Motive Games via Conflict-Aware Gradient Adjustment

Abstract:
Multi‑agent reinforcement learning in mixed‑motive settings presents a fundamental challenge: agents must balance individual interests with collective goals, which are neither fully aligned nor strictly opposed. To address this, reward restructuring methods such as gifting and intrinsic motivation have been proposed. However, these approaches primarily focus on promoting cooperation by managing the trade‑off between individual and collective returns, without explicitly addressing fairness with respect to the agents' task‑specific rewards. In this paper, we propose an adaptive conflict‑aware gradient adjustment method that promotes cooperation while ensuring fairness in individual rewards. The proposed method dynamically balances policy gradients derived from individual and collective objectives in situations where the two objectives are in conflict. By explicitly resolving such conflicts, our method improves collective performance while preserving fairness across agents. We provide theoretical results that guarantee monotonic non‑decreasing improvement in both the collective and individual objectives and ensure fairness. Empirical results in sequential social dilemma environments demonstrate that our approach outperforms baselines in terms of social welfare while ensuring fairness among agents.

PaperID: 8986, https://arxiv.org/pdf/2508.17692.pdf

Authors: Bingxi Zhao, Lin Geng Foo, Ping Hu, Christian Theobalt, Hossein Rahmani, Jun Liu

Title: LLM-based Agentic Reasoning Frameworks: A Survey from Methods to Scenarios

Abstract:
Recent advances in the intrinsic reasoning capabilities of large language models (LLMs) have given rise to LLM‑based agent systems that exhibit near‑human performance on a variety of automated tasks. However, although these systems share similarities in terms of their use of LLMs, different reasoning frameworks of the agent system steer and organize the reasoning process in different ways. In this survey, we propose a systematic taxonomy that decomposes agentic reasoning frameworks and analyze how these frameworks dominate framework‑level reasoning by comparing their applications across different scenarios. Specifically, we propose an unified formal language to further classify agentic reasoning systems into single‑agent methods, tool‑based methods, and multi‑agent methods. After that, we provide a comprehensive review of their key application scenarios in scientific discovery, healthcare, software engineering, social simulation, and economics. We also analyze the characteristic features of each framework and summarize different evaluation strategies. Our survey aims to provide the research community with a panoramic view to facilitate understanding of the strengths, suitable scenarios, and evaluation practices of different agentic reasoning frameworks.

PaperID: 8987, https://arxiv.org/pdf/2508.17671.pdf

Authors: Sam Ganzfried

Title: Consistent Opponent Modeling of Static Opponents in Imperfect-Information Games

Abstract:
The goal of agents in multi‑agent environments is to maximize total reward against the opposing agents that are encountered. Following a game‑theoretic solution concept, such as Nash equilibrium, may obtain a strong performance in some settings; however, such approaches fail to capitalize on historical and observed data from repeated interactions against our opponents. Opponent modeling algorithms integrate machine learning techniques to exploit suboptimal opponents utilizing available data; however, the effectiveness of such approaches in imperfect‑information games to date is quite limited. We show that existing opponent modeling approaches fail to satisfy a simple desirable property even against static opponents drawn from a known prior distribution; namely, they do not guarantee that the model approaches the opponent's true strategy even in the limit as the number of game iterations approaches infinity. We develop a new algorithm that is able to achieve this property and runs efficiently by solving a convex minimization problem based on the sequence‑form game representation using projected gradient descent. The algorithm is guaranteed to efficiently converge to the opponent's true strategy given observations from gameplay and possibly additional historical data if it is available.

PaperID: 8988, https://arxiv.org/pdf/2508.17597.pdf

Authors: Jaewook Lee, Davin Win Kyi, Leejun Kim, Jenny Peng, Gagyeom Lim, Jeremy Zhengqi Huang, Dhruv Jain, Jon E. Froehlich

Title: SonoCraftAR: Towards Supporting Personalized Authoring of Sound-Reactive AR Interfaces by Deaf and Hard of Hearing Users

Abstract:
Augmented reality (AR) has shown promise for supporting Deaf and hard‑of‑hearing (DHH) individuals by captioning speech and visualizing environmental sounds, yet existing systems do not allow users to create personalized sound visualizations. We present SonoCraftAR, a proof‑of‑concept prototype that empowers DHH users to author custom sound‑reactive AR interfaces using typed natural language input. SonoCraftAR integrates real‑time audio signal processing with a multi‑agent LLM pipeline that procedurally generates animated 2D interfaces via a vector graphics library. The system extracts the dominant frequency of incoming audio and maps it to visual properties such as size and color, making the visualizations respond dynamically to sound. This early exploration demonstrates the feasibility of open‑ended sound‑reactive AR interface authoring and discusses future opportunities for personalized, AI‑assisted tools to improve sound accessibility.

PaperID: 8989, https://arxiv.org/pdf/2508.17590.pdf

Authors: Zui Chen, Han Li, Xinhao Zhang, Xiaoyu Chen, Chunyin Dong, Yifeng Wang, Xin Cai, Su Zhang, Ziqi Li, Chi Ding, Jinxu Li, Shuai Wang, Dousheng Zhao, Sanhai Gao, Guangyi Liu

Title: RubikSQL: Lifelong Learning Agentic Knowledge Base as an Industrial NL2SQL System

Abstract:
We present RubikSQL, a novel NL2SQL system designed to address key challenges in real‑world enterprise‑level NL2SQL, such as implicit intents and domain‑specific terminology. RubikSQL frames NL2SQL as a lifelong learning task, demanding both Knowledge Base (KB) maintenance and SQL generation. RubikSQL systematically builds and refines its KB through techniques including database profiling, structured information extraction, agentic rule mining, and Chain‑of‑Thought (CoT)‑enhanced SQL profiling. RubikSQL then employs a multi‑agent workflow to leverage this curated KB, generating accurate SQLs. RubikSQL achieves SOTA performance on both the KaggleDBQA and BIRD Mini‑Dev datasets. Finally, we release the RubikBench benchmark, a new benchmark specifically designed to capture vital traits of industrial NL2SQL scenarios, providing a valuable resource for future research.

PaperID: 8990, https://arxiv.org/pdf/2508.17565.pdf

Authors: Feng Tian, Flora D. Salim, Hao Xue

Title: TradingGroup: A Multi-Agent Trading System with Self-Reflection and Data-Synthesis

Abstract:
Recent advancements in large language models (LLMs) have enabled powerful agent‑based applications in finance, particularly for sentiment analysis, financial report comprehension, and stock forecasting. However, existing systems often lack inter‑agent coordination, structured self‑reflection, and access to high‑quality, domain‑specific post‑training data such as data from trading activities including both market conditions and agent decisions. These data are crucial for agents to understand the market dynamics, improve the quality of decision‑making and promote effective coordination. We introduce TradingGroup, a multi‑agent trading system designed to address these limitations through a self‑reflective architecture and an end‑to‑end data‑synthesis pipeline. TradingGroup consists of specialized agents for news sentiment analysis, financial report interpretation, stock trend forecasting, trading style adaptation, and a trading decision making agent that merges all signals and style preferences to produce buy, sell or hold decisions. Specifically, we design self‑reflection mechanisms for the stock forecasting, style, and decision‑making agents to distill past successes and failures for similar reasoning in analogous future scenarios and a dynamic risk‑management model to offer configurable dynamic stop‑loss and take‑profit mechanisms. In addition, TradingGroup embeds an automated data‑synthesis and annotation pipeline that generates high‑quality post‑training data for further improving the agent performance through post‑training. Our backtesting experiments across five real‑world stock datasets demonstrate TradingGroup's superior performance over rule‑based, machine learning, reinforcement learning, and existing LLM‑based trading strategies.

PaperID: 8991, https://arxiv.org/pdf/2508.17473.pdf

Authors: Akhil B Krishna, Farshad Khorrami, Anthony Tzes

Title: A Consensus Algorithm for Second-Order Systems Evolving on Lie Groups

Abstract:
In this paper, a consensus algorithm is proposed for interacting multi‑agents, which can be modeled as simple Mechanical Control Systems (MCS) evolving on a general Lie group. The standard Laplacian flow consensus algorithm for double integrator systems evolving on Euclidean spaces is extended to a general Lie group. A tracking error function is defined on a general smooth manifold for measuring the error between the configurations of two interacting agents. The stability of the desired consensus equilibrium is proved using a generalized version of Lyapunov theory and LaSalle's invariance principle applicable for systems evolving on a smooth manifold. The proposed consensus control input requires only the configuration information of the neighboring agents and does not require their velocities and inertia tensors. The design of tracking error function and consensus control inputs are demonstrated through an application of attitude consensus problem for multiple communicating rigid bodies. The consensus algorithm is numerically validated by demonstrating the attitude consensus problem.

PaperID: 8992, https://arxiv.org/pdf/2508.17435.pdf

Authors: Zihan Liang, Jiahao Sun, Haoran Ma

Title: An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing

Abstract:
Despite the remarkable capabilities of text‑to‑image (T2I) generation models, real‑world applications often demand fine‑grained, iterative image editing that existing methods struggle to provide. Key challenges include granular instruction understanding, robust context preservation during modifications, and the lack of intelligent feedback mechanisms for iterative refinement. This paper introduces RefineEdit‑Agent, a novel, training‑free intelligent agent framework designed to address these limitations by enabling complex, iterative, and context‑aware image editing. RefineEdit‑Agent leverages the powerful planning capabilities of Large Language Models (LLMs) and the advanced visual understanding and evaluation prowess of Vision‑Language Large Models (LVLMs) within a closed‑loop system. Our framework comprises an LVLM‑driven instruction parser and scene understanding module, a multi‑level LLM‑driven editing planner for goal decomposition, tool selection, and sequence generation, an iterative image editing module, and a crucial LVLM‑driven feedback and evaluation loop. To rigorously evaluate RefineEdit‑Agent, we propose LongBench‑T2I‑Edit, a new benchmark featuring 500 initial images with complex, multi‑turn editing instructions across nine visual dimensions. Extensive experiments demonstrate that RefineEdit‑Agent significantly outperforms state‑of‑the‑art baselines, achieving an average score of 3.67 on LongBench‑T2I‑Edit, compared to 2.29 for Direct Re‑Prompting, 2.91 for InstructPix2Pix, 3.16 for GLIGEN‑based Edit, and 3.39 for ControlNet‑XL. Ablation studies, human evaluations, and analyses of iterative refinement, backbone choices, tool usage, and robustness to instruction complexity further validate the efficacy of our agentic design in delivering superior edit fidelity and context preservation.

PaperID: 8993, https://arxiv.org/pdf/2508.17343.pdf

Authors: Abhik Roychoudhury

Title: Agentic AI for Software: thoughts from Software Engineering community

Abstract:
AI agents have recently shown significant promise in software engineering. Much public attention has been transfixed on the topic of code generation from Large Language Models (LLMs) via a prompt. However, software engineering is much more than programming, and AI agents go far beyond instructions given by a prompt. At the code level, common software tasks include code generation, testing, and program repair. Design level software tasks may include architecture exploration, requirements understanding, and requirements enforcement at the code level. Each of these software tasks involves micro‑decisions which can be taken autonomously by an AI agent, aided by program analysis tools. This creates the vision of an AI software engineer, where the AI agent can be seen as a member of a development team. Conceptually, the key to successfully developing trustworthy agentic AI‑based software workflows will be to resolve the core difficulty in software engineering ‑ the deciphering and clarification of developer intent. Specification inference, or deciphering the intent, thus lies at the heart of many software tasks, including software maintenance and program repair. A successful deployment of agentic technology into software engineering would involve making conceptual progress in such intent inference via agents. Trusting the AI agent becomes a key aspect, as software engineering becomes more automated. Higher automation also leads to higher volume of code being automatically generated, and then integrated into code‑bases. Thus to deal with this explosion, an emerging direction is AI‑based verification and validation (V & V) of AI generated code. We posit that agentic software workflows in future will include such AIbased V&V.

PaperID: 8994, https://arxiv.org/pdf/2508.17341.pdf

Authors: Muhammet Anil Yagiz, Zeynep Sude Cengiz, Polat Goktas

Title: MetaFed: Advancing Privacy, Performance, and Sustainability in Federated Metaverse Systems

Abstract:
The rapid expansion of immersive Metaverse applications introduces complex challenges at the intersection of performance, privacy, and environmental sustainability. Centralized architectures fall short in addressing these demands, often resulting in elevated energy consumption, latency, and privacy concerns. This paper proposes MetaFed, a decentralized federated learning (FL) framework that enables sustainable and intelligent resource orchestration for Metaverse environments. MetaFed integrates (i) multi‑agent reinforcement learning for dynamic client selection, (ii) privacy‑preserving FL using homomorphic encryption, and (iii) carbon‑aware scheduling aligned with renewable energy availability. Evaluations on MNIST and CIFAR‑10 using lightweight ResNet architectures demonstrate that MetaFed achieves up to 25% reduction in carbon emissions compared to conventional approaches, while maintaining high accuracy and minimal communication overhead. These results highlight MetaFed as a scalable solution for building environmentally responsible and privacy‑compliant Metaverse infrastructures.

PaperID: 8995, https://arxiv.org/pdf/2508.17310.pdf

Authors: Yuanchun Wang, Yiyang Fu, Jifan Yu, Daniel Zhang-Li, Zheyuan Zhang, Joy Lim Jia Yin, Yucheng Wang, Peng Zhou, Jing Zhang, Huiqin Liu

Title: Handling Students Dropouts in an LLM-driven Interactive Online Course Using Language Models

Abstract:
Interactive online learning environments, represented by Massive AI‑empowered Courses (MAIC), leverage LLM‑driven multi‑agent systems to transform passive MOOCs into dynamic, text‑based platforms, enhancing interactivity through LLMs. This paper conducts an empirical study on a specific MAIC course to explore three research questions about dropouts in these interactive online courses: (1) What factors might lead to dropouts? (2) Can we predict dropouts? (3) Can we reduce dropouts? We analyze interaction logs to define dropouts and identify contributing factors. Our findings reveal strong links between dropout behaviors and textual interaction patterns. We then propose a course‑progress‑adaptive dropout prediction framework (CPADP) to predict dropouts with at most 95.4% accuracy. Based on this, we design a personalized email recall agent to re‑engage at‑risk students. Applied in the deployed MAIC system with over 3,000 students, the feasibility and effectiveness of our approach have been validated on students with diverse backgrounds.

PaperID: 8996, https://arxiv.org/pdf/2508.17281.pdf

Authors: Sadia Sultana Chowa, Riasad Alvi, Subhey Sadi Rahman, Md Abdur Rahman, Mohaimenul Azam Khan Raiaan, Md Rafiqul Islam, Mukhtar Hussain, Sami Azam

Title: From Language to Action: A Review of Large Language Models as Autonomous Agents and Tool Users

Abstract:
The pursuit of human‑level artificial intelligence (AI) has significantly advanced the development of autonomous agents and Large Language Models (LLMs). LLMs are now widely utilized as decision‑making agents for their ability to interpret instructions, manage sequential tasks, and adapt through feedback. This review examines recent developments in employing LLMs as autonomous agents and tool users and comprises seven research questions. We only used the papers published between 2023 and 2025 in conferences of the A and A rank and Q1 journals. A structured analysis of the LLM agents' architectural design principles, dividing their applications into single‑agent and multi‑agent systems, and strategies for integrating external tools is presented. In addition, the cognitive mechanisms of LLM, including reasoning, planning, and memory, and the impact of prompting methods and fine‑tuning procedures on agent performance are also investigated. Furthermore, we evaluated current benchmarks and assessment protocols and have provided an analysis of 68 publicly available datasets to assess the performance of LLM‑based agents in various tasks. In conducting this review, we have identified critical findings on verifiable reasoning of LLMs, the capacity for self‑improvement, and the personalization of LLM‑based agents. Finally, we have discussed ten future research directions to overcome these gaps.

PaperID: 8997, https://arxiv.org/pdf/2508.17205.pdf

Authors: Yunxiang Yang, Ningning Xu, Jidong J. Yang

Title: Multi-Agent Visual-Language Reasoning for Comprehensive Highway Scene Understanding

Abstract:
This paper introduces a multi‑agent framework for comprehensive highway scene understanding, designed around a mixture‑of‑experts strategy. In this framework, a large generic vision‑language model (VLM), such as GPT‑4o, is contextualized with domain knowledge to generates task‑specific chain‑of‑thought (CoT) prompts. These fine‑grained prompts are then used to guide a smaller, efficient VLM (e.g., Qwen2.5‑VL‑7B) in reasoning over short videos, along with complementary modalities as applicable. The framework simultaneously addresses multiple critical perception tasks, including weather classification, pavement wetness assessment, and traffic congestion detection, achieving robust multi‑task reasoning while balancing accuracy and computational efficiency. To support empirical validation, we curated three specialized datasets aligned with these tasks. Notably, the pavement wetness dataset is multimodal, combining video streams with road weather sensor data, highlighting the benefits of multimodal reasoning. Experimental results demonstrate consistently strong performance across diverse traffic and environmental conditions. From a deployment perspective, the framework can be readily integrated with existing traffic camera systems and strategically applied to high‑risk rural locations, such as sharp curves, flood‑prone lowlands, or icy bridges. By continuously monitoring the targeted sites, the system enhances situational awareness and delivers timely alerts, even in resource‑constrained environments.

PaperID: 8998, https://arxiv.org/pdf/2508.17200.pdf

Authors: Amirreza Talebi

Title: Large Language Model-Based Automatic Formulation for Stochastic Optimization Models

Abstract:
This paper presents an integrated systematic study of the performance of large language models (LLMs), specifically ChatGPT, for automatically formulating and solving Stochastic Optimization (SO) problems from natural language descriptions. Focusing on three key categories, individual chance‑constrained models, joint chance‑constrained models, and two‑stage stochastic mixed‑integer linear programming models, we design several prompts that guide ChatGPT through structured tasks using chain‑of‑thought and agentic reasoning. We introduce a novel soft‑scoring metric that evaluates the structural quality and partial correctness of generated models, addressing the limitations of canonical and execution‑based accuracy metrics. Across a diverse set of SO problems, GPT‑4‑Turbo achieves better partial scores than GPT‑3.5 variants except for individual chance‑constrained problems. Structured prompts significantly outperform simple prompting, reducing extra‑element generation and improving objective matching, although extra‑element generation remains a nontrivial task. Our findings reveal that with well‑engineered prompts and multi‑agent collaboration, LLMs can facilitate SO formulations, paving the way for intelligent, language‑driven modeling pipelines for SO in practice.

PaperID: 8999, https://arxiv.org/pdf/2508.17149.pdf

Authors: Rahman Saadat Yeganeh, Hamid Behroozi, Mohammad Javad Omidi, Mohammad Robat Mili, Eduard A. Jorswieck, Symeon Chatzinotas

Title: Enhancing Energy and Spectral Efficiency in IoT-Cellular Networks via Active SIM-Equipped LEO Satellites

Abstract:
This paper investigates a low Earth orbit (LEO) satellite communication system enhanced by an active stacked intelligent metasurface (ASIM), mounted on the backplate of the satellite solar panels to efficiently utilize limited onboard space and reduce the main satellite power amplifier requirements. The system serves multiple ground users via rate‑splitting multiple access (RSMA) and IoT devices through a symbiotic radio network. Multi‑layer sequential processing in the ASIM improves effective channel gains and suppresses inter‑user interference, outperforming active RIS and beyond‑diagonal RIS designs. Three optimization approaches are evaluated: block coordinate descent with successive convex approximation (BCD‑SCA), model‑assisted multi‑agent constraint soft actor‑critic (MA‑CSAC), and multi‑constraint proximal policy optimization (MCPPO). Simulation results show that BCD‑SCA converges fast and stably in convex scenarios without learning, MCPPO achieves rapid initial convergence with moderate stability, and MA‑CSAC attains the highest long‑term spectral and energy efficiency in large‑scale networks. Energy‑spectral efficiency trade‑offs are analyzed for different ASIM elements, satellite antennas, and transmit power. Overall, the study demonstrates that integrating multi‑layer ASIM with suitable optimization algorithms offers a scalable, energy‑efficient, and high‑performance solution for next‑generation LEO satellite communications.

PaperID: 9000, https://arxiv.org/pdf/2508.17104.pdf

Authors: Sz-Ting Tzeng, Frank Dignum

Title: Rethinking How AI Embeds and Adapts to Human Values: Challenges and Opportunities

Abstract:
The concepts of ``human‑centered AI'' and ``value‑based decision'' have gained significant attention in both research and industry. However, many critical aspects remain underexplored and require further investigation. In particular, there is a need to understand how systems incorporate human values, how humans can identify these values within systems, and how to minimize the risks of harm or unintended consequences. In this paper, we highlight the need to rethink how we frame value alignment and assert that value alignment should move beyond static and singular conceptions of values. We argue that AI systems should implement long‑term reasoning and remain adaptable to evolving values. Furthermore, value alignment requires more theories to address the full spectrum of human values. Since values often vary among individuals or groups, multi‑agent systems provide the right framework for navigating pluralism, conflict, and inter‑agent reasoning about values. We identify the challenges associated with value alignment and indicate directions for advancing value alignment research. In addition, we broadly discuss diverse perspectives of value alignment, from design methodologies to practical applications.

PaperID: 9001, https://arxiv.org/pdf/2508.17033.pdf

Authors: Linbin Huang, Liangxiao Luo, Ruohan Leng, Huanhai Xin, Dan Wang, Florian Dörfler

Title: Geometric Decentralized Stability Certificate for Power Systems Based on Projecting DW Shells

Abstract:
The development of decentralized stability conditions has gained considerable attention due to the need to analyze multi‑agent network systems, such as heterogeneous multi‑converter power systems. A recent advance is the application of the small‑phase theorem, which extends the passivity theory. However, it requires the transfer function matrix to be sectorial, which may not hold in some frequency range and will result in conservativeness. To address this issue, this paper proposes a geometric decentralized stability condition based on Davis‑Wielandt (DW) shell and its projections. Our approach provides a geometric interpretation of the small‑gain and small‑phase theorems and enables decentralized stability analysis of power systems. It serves as a visualization method to understand the closed‑loop interactions and assess the stability of large‑scale network systems in a scalable and modular manner.

PaperID: 9002, https://arxiv.org/pdf/2508.16994.pdf

Authors: Jeongsoo Lee, Daeyong Kwon, Kyohoon Jin

Title: GRADE: Generating multi-hop QA and fine-gRAined Difficulty matrix for RAG Evaluation

Abstract:
Retrieval‑Augmented Generation (RAG) systems are widely adopted in knowledge‑intensive NLP tasks, but current evaluations often overlook the structural complexity and multi‑step reasoning required in real‑world scenarios. These benchmarks overlook key factors such as the interaction between retrieval difficulty and reasoning depth. To address this gap, we propose GRADE, a novel evaluation framework that models task difficulty along two orthogonal dimensions: (1) reasoning depth, defined by the number of inference steps (hops), and (2) semantic distance between the query and its supporting evidence. We construct a synthetic multi‑hop QA dataset from factual news articles by extracting knowledge graphs and augmenting them through semantic clustering to recover missing links, allowing us to generate diverse and difficulty‑controlled queries. Central to our framework is a 2D difficulty matrix that combines generator‑side and retriever‑side difficulty. Experiments across multiple domains and models show that error rates strongly correlate with our difficulty measures, validating their diagnostic utility. GRADE enables fine‑grained analysis of RAG performance and provides a scalable foundation for evaluating and improving multi‑hop reasoning in real‑world applications.

PaperID: 9003, https://arxiv.org/pdf/2508.16987.pdf

Authors: Tanvir Bhathal, Asanshay Gupta

Title: WebSight: A Vision-First Architecture for Robust Web Agents

Abstract:
We introduce WebSight, a vision‑based autonomous web agent, designed to interact with web environments purely through visual perception, eliminating dependence on HTML or DOM‑based inputs. Central to our approach we introduce our new model, WebSight‑7B, a fine‑tuned vision‑language model optimized for UI element interaction, trained using LoRA on a web‑focused subset of the Wave‑UI‑25K dataset. WebSight integrates this model into a modular multi‑agent architecture, comprising planning, reasoning, vision‑action, and verification agents, coordinated through an episodic memory mechanism. WebSight‑7B achieves a top‑1 accuracy of 58.84% on the Showdown Clicks benchmark, outperforming several larger generalist models while maintaining lower latency. The full WebSight agent achieves a 68.0% success rate on the WebVoyager benchmark, surpassing systems from labs such as OpenAI (61.0%) and HCompany (Runner H, 67.0%). Among tasks completed, WebSight answers correctly 97.14% of the time, indicating high precision. Together, WebSight and WebSight‑7B establish a new standard for interpretable, robust, and efficient visual web navigation.

PaperID: 9004, https://arxiv.org/pdf/2508.16859.pdf

Authors: Jinpeng Hu, Hongchang Shi, Chongyuan Dai, Zhuo Li, Peipei Song, Meng Wang

Title: Beyond Emotion Recognition: A Multi-Turn Multimodal Emotion Understanding and Reasoning Benchmark

Abstract:
Multimodal large language models (MLLMs) have been widely applied across various fields due to their powerful perceptual and reasoning capabilities. In the realm of psychology, these models hold promise for a deeper understanding of human emotions and behaviors. However, recent research primarily focuses on enhancing their emotion recognition abilities, leaving the substantial potential in emotion reasoning, which is crucial for improving the naturalness and effectiveness of human‑machine interactions. Therefore, in this paper, we introduce a multi‑turn multimodal emotion understanding and reasoning (MTMEUR) benchmark, which encompasses 1,451 video data from real‑life scenarios, along with 5,101 progressive questions. These questions cover various aspects, including emotion recognition, potential causes of emotions, future action prediction, etc. Besides, we propose a multi‑agent framework, where each agent specializes in a specific aspect, such as background context, character dynamics, and event details, to improve the system's reasoning capabilities. Furthermore, we conduct experiments with existing MLLMs and our agent‑based method on the proposed benchmark, revealing that most models face significant challenges with this task.

PaperID: 9005, https://arxiv.org/pdf/2508.16763.pdf

Authors: Rabiul Awal, Mahsa Massoud, Aarash Feizi, Zichao Li, Suyuchen Wang, Christopher Pal, Aishwarya Agrawal, David Vazquez, Siva Reddy, Juan A. Rodriguez, Perouz Taslakian, Spandana Gella, Sai Rajeswar

Title: WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation

Abstract:
We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code editing involving HTML/CSS/JavaScript, and (3) mockup‑to‑code generation. Unlike prior benchmarks that treat these tasks separately, WebMMU unifies them using expert‑annotated, real‑world web data to assess models' abilities in complex multi‑step reasoning, precise element grounding, and functional UI comprehension and coding. Our evaluation shows that while multimodal large language models (MLLMs) perform well on basic information extraction, they struggle with reasoning and grounding, editing code to preserve functionality, and generating design‑to‑code that maintains hierarchy and supports multilingual content. These findings reveal key limitations in current MLLMs and underscore the need for improved multimodal and cross‑lingual reasoning to build future web agents capable of automating diverse web development tasks.

PaperID: 9006, https://arxiv.org/pdf/2508.16688.pdf

Authors: Ankur Tomar, Hengyue Liang, Indranil Bhattacharya, Natalia Larios, Francesco Carbone

Title: Cybernaut: Towards Reliable Web Automation

Abstract:
The emergence of AI‑driven web automation through Large Language Models (LLMs) offers unprecedented opportunities for optimizing digital workflows. However, deploying such systems within industry's real‑world environments presents four core challenges: (1) ensuring consistent execution, (2) accurately identifying critical HTML elements, (3) meeting human‑like accuracy in order to automate operations at scale and (4) the lack of comprehensive benchmarking data on internal web applications. Existing solutions are primarily tailored for well‑designed, consumer‑facing websites (e.g., Amazon.com, Apple.com) and fall short in addressing the complexity of poorly‑designed internal web interfaces. To address these limitations, we present Cybernaut, a novel framework to ensure high execution consistency in web automation agents designed for robust enterprise use. Our contributions are threefold: (1) a Standard Operating Procedure (SOP) generator that converts user demonstrations into reliable automation instructions for linear browsing tasks, (2) a high‑precision HTML DOM element recognition system tailored for the challenge of complex web interfaces, and (3) a quantitative metric to assess execution consistency. The empirical evaluation on our internal benchmark demonstrates that using our framework enables a 23.2% improvement (from 72% to 88.68%) in task execution success rate over the browser_use. Cybernaut identifies consistent execution patterns with 84.7% accuracy, enabling reliable confidence assessment and adaptive guidance during task execution in real‑world systems. These results highlight Cybernaut's effectiveness in enterprise‑scale web automation and lay a foundation for future advancements in web automation.

PaperID: 9007, https://arxiv.org/pdf/2508.16659.pdf

Authors: Jiayi Wang, Ruiwei Xiao, Xinying Hou, John Stamper

Title: Enabling Multi-Agent Systems as Learning Designers: Applying Learning Sciences to AI Instructional Design

Abstract:
K‑12 educators are increasingly using Large Language Models (LLMs) to create instructional materials. These systems excel at producing fluent, coherent content, but often lack support for high‑quality teaching. The reason is twofold: first, commercial LLMs, such as ChatGPT and Gemini which are among the most widely accessible to teachers, do not come preloaded with the depth of pedagogical theory needed to design truly effective activities; second, although sophisticated prompt engineering can bridge this gap, most teachers lack the time or expertise and find it difficult to encode such pedagogical nuance into their requests. This study shifts pedagogical expertise from the user's prompt to the LLM's internal architecture. We embed the well‑established Knowledge‑Learning‑Instruction (KLI) framework into a Multi‑Agent System (MAS) to act as a sophisticated instructional designer. We tested three systems for generating secondary Math and Science learning activities: a Single‑Agent baseline simulating typical teacher prompts; a role‑based MAS where agents work sequentially; and a collaborative MAS‑CMD where agents co‑construct activities through conquer and merge discussion. The generated materials were evaluated by 20 practicing teachers and a complementary LLM‑as‑a‑judge system using the Quality Matters (QM) K‑12 standards. While the rubric scores showed only small, often statistically insignificant differences between the systems, the qualitative feedback from educators painted a clear and compelling picture. Teachers strongly preferred the activities from the collaborative MAS‑CMD, describing them as significantly more creative, contextually relevant, and classroom‑ready. Our findings show that embedding pedagogical principles into LLM systems offers a scalable path for creating high‑quality educational content.

PaperID: 9008, https://arxiv.org/pdf/2508.16602.pdf

Authors: Hsuan-Kung Yang, Tsu-Ching Hsiao, Ryoichiro Oka, Ryuya Nishino, Satoko Tofukuji, Norimasa Kobori

Title: An Embodied AR Navigation Agent: Integrating BIM with Retrieval-Augmented Generation for Language Guidance

Abstract:
Delivering intelligent and adaptive navigation assistance in augmented reality (AR) requires more than visual cues, as it demands systems capable of interpreting flexible user intent and reasoning over both spatial and semantic context. Prior AR navigation systems often rely on rigid input schemes or predefined commands, which limit the utility of rich building data and hinder natural interaction. In this work, we propose an embodied AR navigation system that integrates Building Information Modeling (BIM) with a multi‑agent retrieval‑augmented generation (RAG) framework to support flexible, language‑driven goal retrieval and route planning. The system orchestrates three language agents, Triage, Search, and Response, built on large language models (LLMs), which enables robust interpretation of open‑ended queries and spatial reasoning using BIM data. Navigation guidance is delivered through an embodied AR agent, equipped with voice interaction and locomotion, to enhance user experience. A real‑world user study yields a System Usability Scale (SUS) score of 80.5, indicating excellent usability, and comparative evaluations show that the embodied interface can significantly improves users' perception of system intelligence. These results underscore the importance and potential of language‑grounded reasoning and embodiment in the design of user‑centered AR navigation systems.

PaperID: 9009, https://arxiv.org/pdf/2508.16571.pdf

Authors: Vlad Vinogradov, Alisa Vinogradova, Dmitrii Radkevich, Ilya Yasny, Dmitry Kobyzev, Ivan Izmailov, Katsiaryna Yanchanka, Roman Doronin, Andrey Doronichev

Title: LLM-Based Agents for Competitive Landscape Mapping in Drug Asset Due Diligence

Abstract:
In this paper, we describe and benchmark a competitor‑discovery component used within an agentic AI system for fast drug asset due diligence. A competitor‑discovery AI agent, given an indication, retrieves all drugs comprising the competitive landscape of that indication and extracts canonical attributes for these drugs. The competitor definition is investor‑specific, and data is paywalled/licensed, fragmented across registries, ontology‑mismatched by indication, alias‑heavy for drug names, multimodal, and rapidly changing. Although considered the best tool for this problem, the current LLM‑based AI systems aren't capable of reliably retrieving all competing drug names, and there is no accepted public benchmark for this task. To address the lack of evaluation, we use LLM‑based agents to transform five years of multi‑modal, unstructured diligence memos from a private biotech VC fund into a structured evaluation corpus mapping indications to competitor drugs with normalized attributes. We also introduce a competitor validating LLM‑as‑a‑judge agent that filters out false positives from the list of predicted competitors to maximize precision and suppress hallucinations. On this benchmark, our competitor‑discovery agent achieves 83% recall, exceeding OpenAI Deep Research (65%) and Perplexity Labs (60%). The system is deployed in production with enterprise users; in a case study with a biotech VC investment fund, analyst turnaround time dropped from 2.5 days to ～3 hours (～20x) for the competitive analysis.

PaperID: 9010, https://arxiv.org/pdf/2508.16490.pdf

Authors: Shili Wu, Yancheng Zhu, Aniruddha Datta, Sean B. Andersson

Title: Multi-agent Robust and Optimal Policy Learning for Data Harvesting

Abstract:
We consider the problem of using multiple agents to harvest data from a collection of sensor nodes (targets) scattered across a two‑dimensional environment. These targets transmit their data to the agents that move in the space above them, and our goal is for the agents to collect data from the targets as efficiently as possible while moving to their final destinations. The agents are assumed to have a continuous control action, and we leverage reinforcement learning, specifically Proximal Policy Optimization (PPO) with Lagrangian Penalty (LP), to identify highly effective solutions. Additionally, we enhance the controller's robustness by incorporating regularization at each state to smooth the learned policy. We conduct a series of simulations to demonstrate our approach and validate its performance and robustness.

PaperID: 9011, https://arxiv.org/pdf/2508.16438.pdf

Authors: Yu Liu, Yanbing Liu, Fangfang Yuan, Cong Cao, Youbang Sun, Kun Peng, Weizhuo Chen, Jianjun Li, Zhiyuan Ma

Title: OPERA: A Reinforcement Learning--Enhanced Orchestrated Planner-Executor Architecture for Reasoning-Oriented Multi-Hop Retrieval

Abstract:
Recent advances in large language models (LLMs) and dense retrievers have driven significant progress in retrieval‑augmented generation (RAG). However, existing approaches face significant challenges in complex reasoning‑oriented multi‑hop retrieval tasks: 1) Ineffective reasoning‑oriented planning: Prior methods struggle to generate robust multi‑step plans for complex queries, as rule‑based decomposers perform poorly on out‑of‑template questions. 2) Suboptimal reasoning‑driven retrieval: Related methods employ limited query reformulation, leading to iterative retrieval loops that often fail to locate golden documents. 3) Insufficient reasoning‑guided filtering: Prevailing methods lack the fine‑grained reasoning to effectively filter salient information from noisy results, hindering utilization of retrieved knowledge. Fundamentally, these limitations all stem from the weak coupling between retrieval and reasoning in current RAG architectures. We introduce the Orchestrated Planner‑Executor Reasoning Architecture (OPERA), a novel reasoning‑driven retrieval framework. OPERA's Goal Planning Module (GPM) decomposes questions into sub‑goals, which are executed by a Reason‑Execute Module (REM) with specialized components for precise reasoning and effective retrieval. To train OPERA, we propose Multi‑Agents Progressive Group Relative Policy Optimization (MAPGRPO), a novel variant of GRPO. Experiments on complex multi‑hop benchmarks show OPERA's superior performance, validating both the MAPGRPO method and OPERA's design.

PaperID: 9012, https://arxiv.org/pdf/2508.16410.pdf

Authors: Alvin Combrink, Sabino Francesco Roselli, Martin Fabian

Title: Optimal Multi-agent Path Finding in Continuous Time

Abstract:
Continuous‑time Conflict Based‑Search (CCBS) has long been viewed as the standard optimal baseline for multi‑agent path finding in continuous time (MAPFR), yet recent critiques show that the theoretically described CCBS can fail to terminate on solvable MAPFR problems while the publicly available reference implementation can return sub‑optimal solutions. This work presents an analytical framework that yields simple and sufficient conditions under which any CCBS‑style algorithm is both sound and solution complete. Investigating the reference CCBS implementation reveals that it violates our sufficient conditions for soundness, with counterexamples demonstrating sub‑optimality. Leveraging the framework, we introduce a branching rule (δ‑BR) and prove it restores soundness and termination guarantees. Consequently, the resulting CCBS variant is both sound and solution complete. To our knowledge, this is the first MAPFR solver matching the guarantees of the discrete‑time CBS. On a constructed example, CCBS with δ‑BR improves sum‑of‑costs from 10.707 to 9.000 (\approx 16% lower) compared to the reference CCBS implementation. Across benchmarks, the reference CCBS implementation is generally able to find solutions faster than CCBS with δ‑BR due to its more aggressive pruning. However, this comes at the cost of occasional sub‑optimality and potential non‑termination when all solutions are pruned, whereas δ‑BR preserves optimality and guarantees termination by design. Because δ‑BR largely only affects the branching step, it can be adopted as a drop‑in replacement in existing codebases. Beyond CCBS, the analytical framework and termination criterion provide a systematic way to evaluate other CCBS‑like MAPFR solvers and future extensions, thereby offering tools for rigorous analysis of next‑generation MAPFR algorithms.

PaperID: 9013, https://arxiv.org/pdf/2508.16293.pdf

Authors: Yang Li, Xing Zhang, Yunji Zhao, Wenbo Wang

Title: Two-Timescale Dynamic Service Deployment and Task Scheduling with Spatiotemporal Collaboration in Mobile Edge Networks

Abstract:
Collaborative edge computing addresses the resource constraints of individual edge nodes by enabling resource sharing and task co‑processing across multiple nodes. To fully leverage the advantages of collaborative edge computing, joint optimization of service deployment and task scheduling is necessary. Existing optimization methods insufficiently address the collaboration across spatial and temporal dimensions, which hinders their adaptability to the spatiotemporally varying nature of user demands and system states. This paper focuses on optimizing the expected task processing delay in edge networks. We propose a two‑timescale online optimization framework to jointly determine: i) service deployment decisions at each large timescale; and ii) task scheduling decisions at each small timescale. Specifically, the convex optimization technique is used to solve the task scheduling problem, while a multi‑agent deep reinforcement learning technique is employed for the service deployment problem. These two methods are combined for spatiotemporal co‑optimization through a two‑timescale alternating optimization approach. Compared to the baseline algorithms, the proposed scheme achieves better delay performance, while also exhibiting low running time and favorable convergence behavior.

PaperID: 9014, https://arxiv.org/pdf/2508.16245.pdf

Authors: Cole Wyeth, Marcus Hutter, Jan Leike, Jessica Taylor

Title: Limit-Computable Grains of Truth for Arbitrary Computable Extensive-Form (Un)Known Games

Abstract:
A Bayesian player acting in an infinite multi‑player game learns to predict the other players' strategies if his prior assigns positive probability to their play (or contains a grain of truth). Kalai and Lehrer's classic grain of truth problem is to find a reasonably large class of strategies that contains the Bayes‑optimal policies with respect to this class, allowing mutually‑consistent beliefs about strategy choice that obey the rules of Bayesian inference. Only small classes are known to have a grain of truth and the literature contains several related impossibility results. In this paper we present a formal and general solution to the full grain of truth problem: we construct a class of strategies wide enough to contain all computable strategies as well as Bayes‑optimal strategies for every reasonable prior over the class. When the "environment" is a known repeated stage game, we show convergence in the sense of [KL93a] and [KL93b]. When the environment is unknown, agents using Thompson sampling converge to play \varepsilon‑Nash equilibria in arbitrary unknown computable multi‑agent environments. Finally, we include an application to self‑predictive policies that avoid planning. While these results use computability theory only as a conceptual tool to solve a classic game theory problem, we show that our solution can naturally be computationally approximated arbitrarily closely.

PaperID: 9015, https://arxiv.org/pdf/2508.16112.pdf

Authors: Heewoong Noh, Namkyeong Lee, Gyoung S. Na, Kibum Kim, Chanyoung Park

Title: IR-Agent: Expert-Inspired LLM Agents for Structure Elucidation from Infrared Spectra

Abstract:
Spectral analysis provides crucial clues for the elucidation of unknown materials. Among various techniques, infrared spectroscopy (IR) plays an important role in laboratory settings due to its high accessibility and low cost. However, existing approaches often fail to reflect expert analytical processes and lack flexibility in incorporating diverse types of chemical knowledge, which is essential in real‑world analytical scenarios. In this paper, we propose IR‑Agent, a novel multi‑agent framework for molecular structure elucidation from IR spectra. The framework is designed to emulate expert‑driven IR analysis procedures and is inherently extensible. Each agent specializes in a specific aspect of IR interpretation, and their complementary roles enable integrated reasoning, thereby improving the overall accuracy of structure elucidation. Through extensive experiments, we demonstrate that IR‑Agent not only improves baseline performance on experimental IR spectra but also shows strong adaptability to various forms of chemical information.

PaperID: 9016, https://arxiv.org/pdf/2508.16044.pdf

Authors: Zhaodonghui Li, Haitao Yuan, Jiachen Shi, Hao Zhang, Yu Rong, Gao Cong

Title: AMAZe: A Multi-Agent Zero-shot Index Advisor for Relational Databases

Abstract:
Index recommendation is one of the most important problems in database management system (DBMS) optimization. Given queries and certain index‑related constraints, traditional methods rely on heuristic optimization or learning‑based models to select effective indexes and improve query performance. However, heuristic optimization suffers from high computation time, and learning‑based models lose generalisability due to training for different workloads and database schemas. With the recent rapid development of large language models (LLMs), methods using prompt tuning have been proposed to enhance the efficiency of index selection. However, such methods still can not achieve the state‑of‑the‑art (SOTA) results, and preparing the index selection demonstrations is also resource‑intensive. To address these issues, we propose AMAZe, a zero‑shot LLM‑based index advisor with a multi‑agent framework. We decompose the index recommendation problem into sub‑steps, including planning, selection, combination, revision, and reflection. A set of LLM‑embedded agents is designed to handle each one of the different sub‑steps. Our method utilizes high‑level agents to control the index selection process and low‑level agents to select and revise indexes. Through extensive experiments, we show that our proposed AMAZe not only achieves the SOTA performance compared to the heuristic methods, but also outperforms learning‑based and prompt‑based methods with higher efficiency and better zero‑shot inference ability.

PaperID: 9017, https://arxiv.org/pdf/2508.16037.pdf

Authors: Renxuan Tan, Rongpeng Li, Xiaoxue Yu, Xianfu Chen, Xing Xu, Zhifeng Zhao

Title: Pareto Actor-Critic for Communication and Computation Co-Optimization in Non-Cooperative Federated Learning Services

Abstract:
Federated learning (FL) in multi‑service provider (SP) ecosystems is fundamentally hampered by non‑cooperative dynamics, where privacy constraints and competing interests preclude the centralized optimization of multi‑SP communication and computation resources. In this paper, we introduce PAC‑MCoFL, a game‑theoretic multi‑agent reinforcement learning (MARL) framework where SPs act as agents to jointly optimize client assignment, adaptive quantization, and resource allocation. Within the framework, we integrate Pareto Actor‑Critic (PAC) principles with expectile regression, enabling agents to conjecture optimal joint policies to achieve Pareto‑optimal equilibria while modeling heterogeneous risk profiles. To manage the high‑dimensional action space, we devise a ternary Cartesian decomposition (TCAD) mechanism that facilitates fine‑grained control. Further, we develop PAC‑MCoFL‑p, a scalable variant featuring a parameterized conjecture generator that substantially reduces computational complexity with a provably bounded error. Alongside theoretical convergence guarantees, our framework's superiority is validated through extensive simulations ‑‑ PAC‑MCoFL achieves approximately 5.8% and 4.2% improvements in total reward and hypervolume indicator (HVI), respectively, over the latest MARL solutions. The results also demonstrate that our method can more effectively balance individual SP and system performance in scaled deployments and under diverse data heterogeneity.

PaperID: 9018, https://arxiv.org/pdf/2508.15940.pdf

Authors: Ahmed Allam, Youssef Mansour, Mohamed Shalan

Title: ASIC-Agent: An Autonomous Multi-Agent System for ASIC Design with Benchmark Evaluation

Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in Register Transfer Level (RTL) design, enabling high‑quality code generation from natural language descriptions. However, LLMs alone face significant limitations in real‑world hardware design workflows, including the inability to execute code, lack of debugging capabilities, and absence of long‑term memory. To address these challenges, we present ASIC‑Agent, an autonomous system designed specifically for digital ASIC design tasks. ASIC‑Agent enhances base LLMs with a multi‑agent architecture incorporating specialized sub‑agents for RTL generation, verification, OpenLane hardening, and Caravel chip integration, all operating within a comprehensive sandbox environment with access to essential hardware design tools. The system leverages a vector database containing documentation, API references, error knowledge, and curated insights from the open‑source silicon community. To evaluate ASIC‑Agent's performance, we introduce ASIC‑Agent‑Bench, the first benchmark specifically designed to assess agentic systems in hardware design tasks. We evaluate ASIC‑Agent with various base LLMs, providing quantitative comparisons and qualitative insights into agent behavior across different design scenarios. Our results demonstrate that ASIC‑Agent, when powered by Claude 4 Sonnet, successfully automates a broad range of ASIC design tasks spanning varying levels of complexity, showing the potential of significantly accelerating the ASIC design workflow.

PaperID: 9019, https://arxiv.org/pdf/2508.15926.pdf

Authors: Yuanjun Feng, Vivek Choudhary, Yash Raj Shrestha

Title: Noise, Adaptation, and Strategy: Assessing LLM Fidelity in Decision-Making

Abstract:
Large language models (LLMs) are increasingly used in social science simulations. While their performance on reasoning and optimization tasks has been extensively evaluated, less attention has been paid to their ability to simulate human decision‑making's variability and adaptability. We propose a process‑oriented evaluation framework with progressive interventions (Intrinsicality, Instruction, and Imitation) to examine how LLM agents adapt under different levels of external guidance and human‑derived noise. We validate the framework on two classic economics tasks, irrationality in the second‑price auction and decision bias in the newsvendor problem, showing behavioral gaps between LLMs and humans. We find that LLMs, by default, converge on stable and conservative strategies that diverge from observed human behaviors. Risk‑framed instructions impact LLM behavior predictably but do not replicate human‑like diversity. Incorporating human data through in‑context learning narrows the gap but fails to reach human subjects' strategic variability. These results highlight a persistent alignment gap in behavioral fidelity and suggest that future LLM evaluations should consider more process‑level realism. We present a process‑oriented approach for assessing LLMs in dynamic decision‑making tasks, offering guidance for their application in synthetic data for social science research.

PaperID: 9020, https://arxiv.org/pdf/2508.15876.pdf

Authors: Fang Wang, Tianwei Yan, Zonghao Yang, Minghao Hu, Jun Zhang, Zhunchen Luo, Xiaoying Bai

Title: DeepMEL: A Multi-Agent Collaboration Framework for Multimodal Entity Linking

Abstract:
Multimodal Entity Linking (MEL) aims to associate textual and visual mentions with entities in a multimodal knowledge graph. Despite its importance, current methods face challenges such as incomplete contextual information, coarse cross‑modal fusion, and the difficulty of jointly large language models (LLMs) and large visual models (LVMs). To address these issues, we propose DeepMEL, a novel framework based on multi‑agent collaborative reasoning, which achieves efficient alignment and disambiguation of textual and visual modalities through a role‑specialized division strategy. DeepMEL integrates four specialized agents, namely Modal‑Fuser, Candidate‑Adapter, Entity‑Clozer and Role‑Orchestrator, to complete end‑to‑end cross‑modal linking through specialized roles and dynamic coordination. DeepMEL adopts a dual‑modal alignment path, and combines the fine‑grained text semantics generated by the LLM with the structured image representation extracted by the LVM, significantly narrowing the modal gap. We design an adaptive iteration strategy, combines tool‑based retrieval and semantic reasoning capabilities to dynamically optimize the candidate set and balance recall and precision. DeepMEL also unifies MEL tasks into a structured cloze prompt to reduce parsing complexity and enhance semantic comprehension. Extensive experiments on five public benchmark datasets demonstrate that DeepMEL achieves state‑of‑the‑art performance, improving ACC by 1%‑57%. Ablation studies verify the effectiveness of all modules.

PaperID: 9021, https://arxiv.org/pdf/2508.15858.pdf

Authors: Maarten Buyl, Yousra Fettach, Guillaume Bied, Tijl De Bie

Title: Building and Measuring Trust between Large Language Models

Abstract:
As large language models (LLMs) increasingly interact with each other, most notably in multi‑agent setups, we may expect (and hope) that `trust' relationships develop between them, mirroring trust relationships between human colleagues, friends, or partners. Yet, though prior work has shown LLMs to be capable of identifying emotional connections and recognizing reciprocity in trust games, little remains known about (i) how different strategies to build trust compare, (ii) how such trust can be measured implicitly, and (iii) how this relates to explicit measures of trust. We study these questions by relating implicit measures of trust, i.e. susceptibility to persuasion and propensity to collaborate financially, with explicit measures of trust, i.e. a dyadic trust questionnaire well‑established in psychology. We build trust in three ways: by building rapport dynamically, by starting from a prewritten script that evidences trust, and by adapting the LLMs' system prompt. Surprisingly, we find that the measures of explicit trust are either little or highly negatively correlated with implicit trust measures. These findings suggest that measuring trust between LLMs by asking their opinion may be deceiving. Instead, context‑specific and implicit measures may be more informative in understanding how LLMs trust each other.

PaperID: 9022, https://arxiv.org/pdf/2508.15835.pdf

Authors: Henrique Godoy

Title: Alvorada-Bench: Can Language Models Solve Brazilian University Entrance Exams?

Abstract:
Language models are increasingly used in Brazil, but most evaluation remains English‑centric. This paper presents Alvorada‑Bench, a 4,515‑question, text‑only benchmark drawn from five Brazilian university entrance examinations. Evaluating twenty models under zero‑shot, role‑playing, and chain‑of‑thought prompting, producing 270,900 responses with structured self‑reports of confidence, perceived difficulty, and Bloom level. The top models exceed 94% accuracy overall, but accuracy declines on Mathematics and on the engineering oriented IME and ITA exams, indicating persistent weaknesses in multi‑step reasoning. Confidence is well calibrated and correlates with perceived difficulty, revealing that models can accurately assess their own certainty capabilities. A cost accuracy analysis shows that high accuracy is achievable at under 2 per 1K tokens. On ENEM 2024 the top model (O3) achieved perfect scores in Languages subject questions while even the weakest system (GPT‑4.1 Nano) only underperforms humans in Mathematics. Through exams that distill decades of Brazilian educational priorities and assess millions of students yearly, Alvorada‑Bench establishes whether language models can navigate the intersection of language, culture, and reasoning that defines academic readiness in Brazil.

PaperID: 9023, https://arxiv.org/pdf/2508.15832.pdf

Authors: Xianren Zhang, Shreyas Prasad, Di Wang, Qiuhai Zeng, Suhang Wang, Wenbo Yan, Mat Hans

Title: A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains

Abstract:
Web agents have shown great promise in performing many tasks on ecommerce website. To assess their capabilities, several benchmarks have been introduced. However, current benchmarks in the e‑commerce domain face two major problems. First, they primarily focus on product search tasks (e.g., Find an Apple Watch), failing to capture the broader range of functionalities offered by real‑world e‑commerce platforms such as Amazon, including account management and gift card operations. Second, existing benchmarks typically evaluate whether the agent completes the user query, but ignore the potential risks involved. In practice, web agents can make unintended changes that negatively impact the user account or status. For instance, an agent might purchase the wrong item, delete a saved address, or incorrectly configure an auto‑reload setting. To address these gaps, we propose a new benchmark called Amazon‑Bench. To generate user queries that cover a broad range of tasks, we propose a data generation pipeline that leverages webpage content and interactive elements (e.g., buttons, check boxes) to create diverse, functionality‑grounded user queries covering tasks such as address management, wish list management, and brand store following. To improve the agent evaluation, we propose an automated evaluation framework that assesses both the performance and the safety of web agents. We systematically evaluate different agents, finding that current agents struggle with complex queries and pose safety risks. These results highlight the need for developing more robust and reliable web agents.

PaperID: 9024, https://arxiv.org/pdf/2508.15819.pdf

Authors: Qiang Duan, Zhihui Lu

Title: Agent Communications toward Agentic AI at Edge -- A Case Study of the Agent2Agent Protocol

Abstract:
The current evolution of artificial intelligence introduces a paradigm shift toward agentic AI built upon multi‑agent systems (MAS). Agent communications serve as a key to effective agent interactions in MAS and thus have a significant impact on the performance of agentic AI applications. The recent research on agent communications has made exciting rapid progress that leads to a variety of protocol designs, among which the Agent2Agent (A2A) protocol is considered the most representative one. Simultaneously, the rise of edge intelligence is expected to enable agentic AI at the network edge. However, the current agent communication protocols are designed without sufficient consideration of the special challenges of edge computing, and their effectiveness in the edge environment is largely unexamined. In this paper, we attempt to assess the abilities of agent communication technologies to face the challenges of edge computing using the A2A protocol as a representative case. We first discuss the core functionalities of agent communications, present a landscape of agent communication protocols, and identify the main challenges introduced by edge computing. Then, we conduct a case study on the A2A protocol to examine the key technologies leveraged in the protocol for their effectiveness in meeting the requirements of agent communications in edge computing. Based on the insights obtained from this assessment, we identify open issues in the current agent communication technologies and discuss directions for future research to address these issues.

PaperID: 9025, https://arxiv.org/pdf/2508.15795.pdf

Authors: Yanheng Liu, Dalin Li, Hao Wu, Zemin Sun, Weihong Qin, Jun Li, Hongyang Du, Geng Sun

Title: Task Offloading and Resource Allocation for MEC-assisted Consumer Internet of Vehicle Systems

Abstract:
Mobile edge computing (MEC)‑assisted internet of vehicle (IoV) is emerging as a promising paradigm to provide computing services for vehicles. However, meeting the computing‑sensitive and computation‑intensive demands of vehicles poses several challenges, including the discrepancy between the limited resource provision and stringent computing requirement, the difficulty in capturing and integrating the intricate features of the MEC‑assisted IoV system into the problem formulation, and the need for real‑time processing and efficient resource management in the dynamic environment. In this work, we explore the AI‑enabled task offloading and resource allocation for MEC‑assisted consumer IoV systems. Specifically, we first present a multi‑MEC‑assisted consumer IoV architecture that leverages the computational resources of MEC servers to provide offloading services close to vehicles. Subsequently, we formulate a system cost minimization optimization problem (SCMOP) by integrating the service delay and energy consumption. To efficiently solve this problem, we design a joint task offloading and computing resource allocation approach (JTOCRA) by applying the multi‑agent deep deterministic policy gradient (MADDPG) algorithm. Finally, simulation results demonstrate that the proposed JTOCRA can achieve superior system performances and exhibits better scalability compared to other alternative approaches.

PaperID: 9026, https://arxiv.org/pdf/2508.15764.pdf

Authors: Kiarash Kazari, Ezzeldin Shereen, György Dán

Title: Distributed Detection of Adversarial Attacks in Multi-Agent Reinforcement Learning with Continuous Action Space

Abstract:
We address the problem of detecting adversarial attacks against cooperative multi‑agent reinforcement learning with continuous action space. We propose a decentralized detector that relies solely on the local observations of the agents and makes use of a statistical characterization of the normal behavior of observable agents. The proposed detector utilizes deep neural networks to approximate the normal behavior of agents as parametric multivariate Gaussian distributions. Based on the predicted density functions, we define a normality score and provide a characterization of its mean and variance. This characterization allows us to employ a two‑sided CUSUM procedure for detecting deviations of the normality score from its mean, serving as a detector of anomalous behavior in real‑time. We evaluate our scheme on various multi‑agent PettingZoo benchmarks against different state‑of‑the‑art attack methods, and our results demonstrate the effectiveness of our method in detecting impactful adversarial attacks. Particularly, it outperforms the discrete counterpart by achieving AUC‑ROC scores of over 0.95 against the most impactful attacks in all evaluated environments.

PaperID: 9027, https://arxiv.org/pdf/2508.15760.pdf

Authors: Ming Yin, Dinghan Shen, Silei Xu, Sixun Dong, Mian Zhang, Yebowen Hu, Shujian Liu, Jianbing Han, Simin Ma, Song Wang, Sathish Reddy Indurthi, Xun Wang, Yiran Chen, Kaiqiang Song

Title: LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Abstract:
Tool calling has emerged as a critical capability for AI agents. In contrast to conventional tool calling frameworks that rely on static, provider‑specific tool definitions, the Model Context Protocol (MCP) offers a unified interface to discover and invoke tools dynamically. However, there is a significant gap in benchmarking multi‑step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP‑101, a benchmark of 101 real‑world queries that require coordinated use of multiple MCP tools. To address temporal variability in real‑world tool responses, we introduce a parallel evaluation framework where a reference agent executes a validated plan simultaneously to produce real‑time reference outputs. Experiments show that even frontier LLMs achieve a success rate below 60%, highlighting challenges in multi‑step tool use. Comprehensive error analysis identifies seven failure modes spanning tool planning, parameterization, and output handling, pointing to concrete directions for improving current models. LiveMCP‑101 sets a rigorous standard for evaluating real‑world agent capabilities, advancing toward autonomous agent systems that reliably execute complex tasks through MCP tool orchestration.

PaperID: 9028, https://arxiv.org/pdf/2508.15757.pdf

Authors: Yuxing Lu, Yucheng Hu, Nan Sun, Xukai Zhao

Title: Language-Guided Tuning: Enhancing Numeric Optimization with Textual Feedback

Abstract:
Configuration optimization remains a critical bottleneck in machine learning, requiring coordinated tuning across model architecture, training strategy, feature engineering, and hyperparameters. Traditional approaches treat these dimensions independently and lack interpretability, while recent automated methods struggle with dynamic adaptability and semantic reasoning about optimization decisions. We introduce Language‑Guided Tuning (LGT), a novel framework that employs multi‑agent Large Language Models to intelligently optimize configurations through natural language reasoning. We apply textual gradients ‑ qualitative feedback signals that complement numerical optimization by providing semantic understanding of training dynamics and configuration interdependencies. LGT coordinates three specialized agents: an Advisor that proposes configuration changes, an Evaluator that assesses progress, and an Optimizer that refines the decision‑making process, creating a self‑improving feedback loop. Through comprehensive evaluation on six diverse datasets, LGT demonstrates substantial improvements over traditional optimization methods, achieving performance gains while maintaining high interpretability.

PaperID: 9029, https://arxiv.org/pdf/2508.15679.pdf

Authors: Eric Ye, Ren Tao, Natasha Jaques

Title: An Efficient Open World Environment for Multi-Agent Social Learning

Abstract:
Many challenges remain before AI agents can be deployed in real‑world environments. However, one virtue of such environments is that they are inherently multi‑agent and contain human experts. Using advanced social intelligence in such an environment can help an AI agent learn adaptive skills and behaviors that a known expert exhibits. While social intelligence could accelerate training, it is currently difficult to study due to the lack of open‑ended multi‑agent environments. In this work, we present an environment in which multiple self‑interested agents can pursue complex and independent goals, reflective of real world challenges. This environment will enable research into the development of socially intelligent AI agents in open‑ended multi‑agent settings, where agents may be implicitly incentivized to cooperate to defeat common enemies, build and share tools, and achieve long horizon goals. In this work, we investigate the impact on agent performance due to social learning in the presence of experts and implicit cooperation such as emergent collaborative tool use, and whether agents can benefit from either cooperation or competition in this environment.

PaperID: 9030, https://arxiv.org/pdf/2508.15652.pdf

Authors: Ardian Selmonaj, Miroslav Strupl, Oleg Szehr, Alessandro Antonucci

Title: Understanding Action Effects through Instrumental Empowerment in Multi-Agent Reinforcement Learning

Abstract:
To reliably deploy Multi‑Agent Reinforcement Learning (MARL) systems, it is crucial to understand individual agent behaviors. While prior work typically evaluates overall team performance based on explicit reward signals, it is unclear how to infer agent contributions in the absence of any value feedback. In this work, we investigate whether meaningful insights into agent behaviors can be extracted solely by analyzing the policy distribution. Inspired by the phenomenon that intelligent agents tend to pursue convergent instrumental values, we introduce Intended Cooperation Values (ICVs), a method based on information‑theoretic Shapley values for quantifying each agent's causal influence on their co‑players' instrumental empowerment. Specifically, ICVs measure an agent's action effect on its teammates' policies by assessing their decision (un)certainty and preference alignment. By analyzing action effects on policies and value functions across cooperative and competitive MARL tasks, our method identifies which agent behaviors are beneficial to team success, either by fostering deterministic decisions or by preserving flexibility for future action choices, while also revealing the extent to which agents adopt similar or diverse strategies. Our proposed method offers novel insights into cooperation dynamics and enhances explainability in MARL systems.

PaperID: 9031, https://arxiv.org/pdf/2508.15595.pdf

Authors: Abhishek Dandekar, Prashiddha D. Thapa, Ashrafur Rahman, Julius Schulz-Zander

Title: Interface on demand: Towards AI native Control interfaces for 6G

Abstract:
Traditional standardized network interfaces face significant limitations, including vendor‑specific incompatibilities, rigid design assumptions, and lack of adaptability for new functionalities. We propose a multi‑agent framework leveraging large language models (LLMs) to generate control interfaces on demand between network functions (NFs). This includes a matching agent, which aligns required control functionalities with NF capabilities, and a code‑generation agent, which generates the necessary API server for interface realization. We validate our approach using simulated multi‑vendor gNB and WLAN AP environments. The performance evaluations highlight the trade‑offs between cost and latency across LLMs for interface generation tasks. Our work sets the foundation for AI‑native dynamic control interface generation, paving the way for enhanced interoperability and adaptability in future mobile networks.

PaperID: 9032, https://arxiv.org/pdf/2508.15447.pdf

Authors: Zihao Wang, Junming Zhang

Title: From Bits to Boardrooms: A Cutting-Edge Multi-Agent LLM Framework for Business Excellence

Abstract:
Large Language Models (LLMs) have shown promising potential in business applications, particularly in enterprise decision support and strategic planning, yet current approaches often struggle to reconcile intricate operational analyses with overarching strategic goals across diverse market environments, leading to fragmented workflows and reduced collaboration across organizational levels. This paper introduces BusiAgent, a novel multi‑agent framework leveraging LLMs for advanced decision‑making in complex corporate environments. BusiAgent integrates three core innovations: an extended Continuous Time Markov Decision Process (CTMDP) for dynamic agent modeling, a generalized entropy measure to optimize collaborative efficiency, and a multi‑level Stackelberg game to handle hierarchical decision processes. Additionally, contextual Thompson sampling is employed for prompt optimization, supported by a comprehensive quality assurance system to mitigate errors. Extensive empirical evaluations across diverse business scenarios validate BusiAgent's efficacy, demonstrating its capacity to generate coherent, client‑focused solutions that smoothly integrate granular insights with high‑level strategy, significantly outperforming established approaches in both solution quality and user satisfaction. By fusing cutting‑edge AI technologies with deep business insights, BusiAgent marks a substantial step forward in AI‑driven enterprise decision‑making, empowering organizations to navigate complex business landscapes more effectively.

PaperID: 9033, https://arxiv.org/pdf/2508.15335.pdf

Authors: Bin Deng, Yizhe Feng, Zeming Liu, Qing Wei, Xiangrong Zhu, Shuai Chen, Yuanfang Guo, Yunhong Wang

Title: RETAIL: Towards Real-world Travel Planning for Large Language Models

Abstract:
Although large language models have enhanced automated travel planning abilities, current systems remain misaligned with real‑world scenarios. First, they assume users provide explicit queries, while in reality requirements are often implicit. Second, existing solutions ignore diverse environmental factors and user preferences, limiting the feasibility of plans. Third, systems can only generate plans with basic POI arrangements, failing to provide all‑in‑one plans with rich details. To mitigate these challenges, we construct a novel dataset RETAIL, which supports decision‑making for implicit queries while covering explicit queries, both with and without revision needs. It also enables environmental awareness to ensure plan feasibility under real‑world scenarios, while incorporating detailed POI information for all‑in‑one travel plans. Furthermore, we propose a topic‑guided multi‑agent framework, termed TGMA. Our experiments reveal that even the strongest existing model achieves merely a 1.0% pass rate, indicating real‑world travel planning remains extremely challenging. In contrast, TGMA demonstrates substantially improved performance 2.72%, offering promising directions for real‑world travel planning.

PaperID: 9034, https://arxiv.org/pdf/2508.15310.pdf

Authors: Hengyu An, Jinghuai Zhang, Tianyu Du, Chunyi Zhou, Qingming Li, Tao Lin, Shouling Ji

Title: IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents

Abstract:
Large language model (LLM) agents are widely deployed in real‑world applications, where they leverage tools to retrieve and manipulate external data for complex tasks. However, when interacting with untrusted data sources (e.g., fetching information from public websites), tool responses may contain injected instructions that covertly influence agent behaviors and lead to malicious outcomes, a threat referred to as Indirect Prompt Injection (IPI). Existing defenses typically rely on advanced prompting strategies or auxiliary detection models. While these methods have demonstrated some effectiveness, they fundamentally rely on assumptions about the model's inherent security, which lacks structural constraints on agent behaviors. As a result, agents still retain unrestricted access to tool invocations, leaving them vulnerable to stronger attack vectors that can bypass the security guardrails of the model. To prevent malicious tool invocations at the source, we propose a novel defensive task execution paradigm, called IPIGuard, which models the agents' task execution process as a traversal over a planned Tool Dependency Graph (TDG). By explicitly decoupling action planning from interaction with external data, IPIGuard significantly reduces unintended tool invocations triggered by injected instructions, thereby enhancing robustness against IPI attacks. Experiments on the AgentDojo benchmark show that IPIGuard achieves a superior balance between effectiveness and robustness, paving the way for the development of safer agentic systems in dynamic environments.

PaperID: 9035, https://arxiv.org/pdf/2508.15305.pdf

Authors: Wei Yang, Jinwei Xiao, Hongming Zhang, Qingyang Zhang, Yanna Wang, Bo Xu

Title: Coarse-to-Fine Grounded Memory for LLM Agent Planning

Abstract:
Recent advancements in Large Language Models (LLMs) have driven growing interest in LLM‑based agents for complex planning tasks. To avoid costly agent training, many studies adopted memory mechanism that enhances LLM with offline experiences or online trajectory analysis. However, existing works focus on single‑granularity memory derived from dynamic environmental interactions, which are inherently constrained by the quality of the collected experiences. This limitation, in turn, constrain the diversity of knowledge and the flexibility of planning. We propose Coarse‑to‑Fine Grounded Memory (\Ours), a novel framework that grounds coarse‑to‑fine memories with LLM, thereby fully leverage them for flexible adaptation to diverse scenarios. \Ours grounds environmental information into coarse‑grained focus points to guide experience collection in training tasks, followed by grounding of actionable hybrid‑grained tips from each experience. At inference, \Ours retrieves task‑relevant experiences and tips to support planning. When facing environmental anomalies, the LLM grounds the current situation into fine‑grained key information, enabling flexible self‑QA reflection and plan correction.

PaperID: 9036, https://arxiv.org/pdf/2508.15243.pdf

Authors: Yixin Gao, Xin Li, Xiaohan Pan, Runsen Feng, Bingchen Li, Yunpeng Qi, Yiting Lu, Zhengxue Cheng, Zhibo Chen, Jörn Ostermann

Title: Comp-X: On Defining an Interactive Learned Image Compression Paradigm With Expert-driven LLM Agent

Abstract:
We present Comp‑X, the first intelligently interactive image compression paradigm empowered by the impressive reasoning capability of large language model (LLM) agent. Notably, commonly used image codecs usually suffer from limited coding modes and rely on manual mode selection by engineers, making them unfriendly for unprofessional users. To overcome this, we advance the evolution of image coding paradigm by introducing three key innovations: (i) multi‑functional coding framework, which unifies different coding modes of various objective/requirements, including human‑machine perception, variable coding, and spatial bit allocation, into one framework. (ii) interactive coding agent, where we propose an augmented in‑context learning method with coding expert feedback to teach the LLM agent how to understand the coding request, mode selection, and the use of the coding tools. (iii) IIC‑bench, the first dedicated benchmark comprising diverse user requests and the corresponding annotations from coding experts, which is systematically designed for intelligently interactive image compression evaluation. Extensive experimental results demonstrate that our proposed Comp‑X can understand the coding requests efficiently and achieve impressive textual interaction capability. Meanwhile, it can maintain comparable compression performance even with a single coding framework, providing a promising avenue for artificial general intelligence (AGI) in image compression.

PaperID: 9037, https://arxiv.org/pdf/2508.15119.pdf

Authors: Rachel Ma, Jingyi Qu, Andreea Bobu, Dylan Hadfield-Menell

Title: Flexible Agent Alignment with Goal Inference from Open-Ended Dialog

Abstract:
We introduce Open‑Universe Assistance Games (OU‑AGs), a formal framework extending assistance games to LLM‑based agents. Effective assistance requires reasoning over human preferences that are unbounded, underspecified, and evolving. Current LLM agents struggle in multi‑turn interactions and with maintaining accurate models of user intent in collaborative settings. Existing assistance game formulations assume fixed, predefined preferences, an assumption that breaks down in open‑ended dialogue where goals are revised incrementally and expressed in natural language. Grounded in cognitive science accounts of preference construction, we represent human preferences as a dynamically updated distribution over discrete natural‑language goals. To operationalize OU‑AGs, we introduce GOOD (GOals from Open‑ended Dialogue), a data‑efficient online method that extracts and ranks candidate goals during interaction, using LLM‑simulated users to perform probabilistic inference over goal hypotheses. This allows for interpretable, uncertainty‑aware preference representations without large offline datasets. We evaluate GOOD across three text‑based domains: grocery shopping, household robotics (AI2‑THOR), and coding. Compared to baselines without explicit goal tracking, GOOD produces semantically coherent goal representations and improves alignment with user intent across domains.

PaperID: 9038, https://arxiv.org/pdf/2508.15047.pdf

Authors: Yibo Liu, Liam Shatzel, Brandon Haworth, Teseo Schneider

Title: Emergent Crowds Dynamics from Language-Driven Multi-Agent Interactions

Abstract:
Animating and simulating crowds using an agent‑based approach is a well‑established area where every agent in the crowd is individually controlled such that global human‑like behaviour emerges. We observe that human navigation and movement in crowds are often influenced by complex social and environmental interactions, driven mainly by language and dialogue. However, most existing work does not consider these dimensions and leads to animations where agent‑agent and agent‑environment interactions are largely limited to steering and fixed higher‑level goal extrapolation. We propose a novel method that exploits large language models (LLMs) to control agents' movement. Our method has two main components: a dialogue system and language‑driven navigation. We periodically query agent‑centric LLMs conditioned on character personalities, roles, desires, and relationships to control the generation of inter‑agent dialogue when necessitated by the spatial and social relationships with neighbouring agents. We then use the conversation and each agent's personality, emotional state, vision, and physical state to control the navigation and steering of each agent. Our model thus enables agents to make motion decisions based on both their perceptual inputs and the ongoing dialogue. We validate our method in two complex scenarios that exemplify the interplay between social interactions, steering, and crowding. In these scenarios, we observe that grouping and ungrouping of agents automatically occur. Additionally, our experiments show that our method serves as an information‑passing mechanism within the crowd. As a result, our framework produces more realistic crowd simulations, with emergent group behaviours arising naturally from any environmental setting.

PaperID: 9039, https://arxiv.org/pdf/2508.15030.pdf

Authors: Ashmi Banerjee, Fitri Nur Aisyah, Adithi Satish, Wolfgang WÃ¶rndl, Yashar Deldjoo

Title: Collab-REC: An LLM-based Agentic Framework for Balancing Recommendations in Tourism

Abstract:
We propose Collab‑REC, a multi‑agent framework designed to counteract popularity bias and enhance diversity in tourism recommendations. In our setting, three LLM‑based agents ‑‑ Personalization, Popularity, and Sustainability generate city suggestions from complementary perspectives. A non‑LLM moderator then merges and refines these proposals via multi‑round negotiation, ensuring each agent's viewpoint is incorporated while penalizing spurious or repeated responses. Experiments on European city queries show that Collab‑REC improves diversity and overall relevance compared to a single‑agent baseline, surfacing lesser‑visited locales that often remain overlooked. This balanced, context‑aware approach addresses over‑tourism and better aligns with constraints provided by the user, highlighting the promise of multi‑stakeholder collaboration in LLM‑driven recommender systems.

PaperID: 9040, https://arxiv.org/pdf/2508.14925.pdf

Authors: Zhiqiang Wang, Yichao Gao, Yanting Wang, Suyuan Liu, Haifeng Sun, Haoran Cheng, Guanquan Shi, Haohua Du, Xiangyang Li

Title: MCPTox: A Benchmark for Tool Poisoning Attack on Real-World MCP Servers

Abstract:
By providing a standardized interface for LLM agents to interact with external tools, the Model Context Protocol (MCP) is quickly becoming a cornerstone of the modern autonomous agent ecosystem. However, it creates novel attack surfaces due to untrusted external tools. While prior work has focused on attacks injected through external tool outputs, we investigate a more fundamental vulnerability: Tool Poisoning, where malicious instructions are embedded within a tool's metadata without execution. To date, this threat has been primarily demonstrated through isolated cases, lacking a systematic, large‑scale evaluation. We introduce MCPTox, the first benchmark to systematically evaluate agent robustness against Tool Poisoning in realistic MCP settings. MCPTox is constructed upon 45 live, real‑world MCP servers and 353 authentic tools. To achieve this, we design three distinct attack templates to generate a comprehensive suite of 1312 malicious test cases by few‑shot learning, covering 10 categories of potential risks. Our evaluation on 20 prominent LLM agents setting reveals a widespread vulnerability to Tool Poisoning, with o1‑mini, achieving an attack success rate of 72.8%. We find that more capable models are often more susceptible, as the attack exploits their superior instruction‑following abilities. Finally, the failure case analysis reveals that agents rarely refuse these attacks, with the highest refused rate (Claude‑3.7‑Sonnet) less than 3%, demonstrating that existing safety alignment is ineffective against malicious actions that use legitimate tools for unauthorized operation. Our findings create a crucial empirical baseline for understanding and mitigating this widespread threat, and we release MCPTox for the development of verifiably safer AI agents. Our dataset is available at an anonymized repository: https://anonymous.4open.science/r/AAAI26‑7C02.

PaperID: 9041, https://arxiv.org/pdf/2508.14893.pdf

Authors: Qinhong Zhou, Hongxin Zhang, Xiangye Lin, Zheyuan Zhang, Yutian Chen, Wenjun Liu, Zunzhe Zhang, Sunli Chen, Lixing Fang, Qiushi Lyu, Xinyu Sun, Jincheng Yang, Zeyuan Wang, Bao Chi Dang, Zhehuan Chen, Daksha Ladia, Quang Vinh Dang, Jiageng Liu, Chuang Gan

Title: Virtual Community: An Open World for Humans, Robots, and Society

Abstract:
The rapid progress in AI and Robotics may lead to a profound societal transformation, as humans and robots begin to coexist within shared communities, introducing both opportunities and challenges. To explore this future, we present Virtual Community‑an open‑world platform for humans, robots, and society‑built on a universal physics engine and grounded in real‑world 3D scenes. With Virtual Community, we aim to enable the study of embodied social intelligence at scale. To support these, Virtual Community features: 1) An open‑source multi‑agent physics simulator that supports robots, humans, and their interactions within a society; 2) A large‑scale, real‑world aligned community generation pipeline, including vast outdoor space, diverse indoor scenes, and a community of grounded agents with rich characters and appearances. Leveraging Virtual Community, we propose two novel challenges. The Community Planning Challenge evaluates multi‑agent reasoning and planning ability in open‑world settings, such as cooperating to help agents with daily activities and efficiently connecting other agents. The Community Robot Challenge requires multiple heterogeneous robots to collaborate in solving complex open‑world tasks. We evaluate various baselines on these tasks and demonstrate the challenges in both high‑level open‑world task planning and low‑level cooperation controls. We hope that Virtual Community will unlock further study of human‑robot coexistence within open‑world environments.

PaperID: 9042, https://arxiv.org/pdf/2508.14751.pdf

Authors: Thomas Carta, Clément Romac, Loris Gaven, Pierre-Yves Oudeyer, Olivier Sigaud, Sylvain Lamprier

Title: HERAKLES: Hierarchical Skill Compilation for Open-ended LLM Agents

Abstract:
Open‑ended AI agents need to be able to learn efficiently goals of increasing complexity, abstraction and heterogeneity over their lifetime. Beyond sampling efficiently their own goals, autotelic agents specifically need to be able to keep the growing complexity of goals under control, limiting the associated growth in sample and computational complexity. To adress this challenge, recent approaches have leveraged hierarchical reinforcement learning (HRL) and language, capitalizing on its compositional and combinatorial generalization capabilities to acquire temporally extended reusable behaviours. Existing approaches use expert defined spaces of subgoals over which they instantiate a hierarchy, and often assume pre‑trained associated low‑level policies. Such designs are inadequate in open‑ended scenarios, where goal spaces naturally diversify across a broad spectrum of difficulties. We introduce HERAKLES, a framework that enables a two‑level hierarchical autotelic agent to continuously compile mastered goals into the low‑level policy, executed by a small, fast neural network, dynamically expanding the set of subgoals available to the high‑level policy. We train a Large Language Model (LLM) to serve as the high‑level controller, exploiting its strengths in goal decomposition and generalization to operate effectively over this evolving subgoal space. We evaluate HERAKLES in the open‑ended Crafter environment and show that it scales effectively with goal complexity, improves sample efficiency through skill compilation, and enables the agent to adapt robustly to novel challenges over time.

PaperID: 9043, https://arxiv.org/pdf/2508.14704.pdf

Authors: Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, Junnan Li

Title: MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers

Abstract:
The Model Context Protocol has emerged as a transformative standard for connecting large language models to external data sources and tools, rapidly gaining adoption across major AI providers and development platforms. However, existing benchmarks are overly simplistic and fail to capture real application challenges such as long‑horizon reasoning and large, unfamiliar tool spaces. To address this critical gap, we introduce MCP‑Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real‑world MCP servers. Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching. To ensure rigorous evaluation, we implement execution‑based evaluators, including format evaluators for agent format compliance, static evaluators for time‑invariant content matching, and dynamic evaluators that automatically retrieve real‑time ground truth for temporally sensitive tasks. Through extensive evaluation of leading LLMs, we find that even SOTA models such as GPT‑5 (43.72%), Grok‑4 (33.33%) and Claude‑4.0‑Sonnet (29.44%) exhibit significant performance limitations. In addition, our benchmark poses a significant long‑context challenge for LLM agents, as the number of input tokens increases rapidly with the number of interaction steps. Moreover, it introduces an unknown‑tools challenge, as LLM agents often lack familiarity with the precise usage of the MCP servers. Notably, enterprise‑level agents like Cursor cannot achieve better performance than standard ReAct frameworks. Beyond evaluation, we open‑source our extensible evaluation framework with UI support, enabling researchers and practitioners to seamlessly integrate new agents and MCP servers while fostering innovation in the rapidly evolving MCP ecosystem.

PaperID: 9044, https://arxiv.org/pdf/2508.14679.pdf

Authors: Parham Soltani, Mehrshad Eskandarpour, Amir Ahmadizad, Hossein Soleimani

Title: Energy-Efficient Routing Algorithm for Wireless Sensor Networks: A Multi-Agent Reinforcement Learning Approach

Abstract:
Efficient energy management is essential in Wireless Sensor Networks (WSNs) to extend network lifetime and ensure reliable data transmission. This paper presents a novel method using reinforcement learning‑based cluster‑head selection and a hybrid multi‑hop routing algorithm, which leverages Q‑learning within a multi‑agent system to dynamically adapt transmission paths based on the energy distribution across sensor nodes. Each sensor node is modeled as an autonomous agent that observes local state parameters, such as residual energy, distance to sink, hop count, and hotspot proximity, and selects routing actions that maximize long‑term energy efficiency. After computing the optimal paths, each sensor aggregates sensed data and forwards it through intermediate nodes to a selected transmitter node, chosen based on the highest remaining State of Charge (SoC), thereby avoiding premature node depletion. To promote efficient learning, a carefully designed reward function incentivizes balanced load distribution, hotspot avoidance, and energy‑aware forwarding while maintaining signal quality. The learning process occurs either in a decentralized manner or via a cloud‑based controller that offloads computation in large‑scale deployments. Moreover, the RL‑driven routing decisions are fused with classical graph‑based methods, Minimum Energy Routing Algorithm (MERA) and Minimum Spanning Tree (MST), to optimize energy consumption and load balancing. Simulations confirm that the proposed approach significantly improves node survival rate, reduces SoC variance, and enhances network resilience, making it a scalable and adaptive solution for energy‑constrained WSNs in dynamic sensor deployments and IoT applications.

PaperID: 9045, https://arxiv.org/pdf/2508.14676.pdf

Authors: Parham Soltani, Mehrshad Eskandarpour, Sina Heidari, Farnaz Alizadeh, Hossein Soleimani

Title: Adaptive Vision-Based Coverage Optimization in Mobile Wireless Sensor Networks: A Multi-Agent Deep Reinforcement Learning Approach

Abstract:
Traditional Wireless Sensor Networks (WSNs) typically rely on pre‑analysis of the target area, network size, and sensor coverage to determine initial deployment. This often results in significant overlap to ensure continued network operation despite sensor energy depletion. With the emergence of Mobile Wireless Sensor Networks (MWSNs), issues such as sensor failure and static coverage limitations can be more effectively addressed through mobility. This paper proposes a novel deployment strategy in which mobile sensors autonomously position themselves to maximize area coverage, eliminating the need for predefined policies. A live camera system, combined with deep reinforcement learning (DRL), monitors the network by detecting sensor LED indicators and evaluating real‑time coverage. Rewards based on coverage efficiency and sensor movement are computed at each learning step and shared across the network through a Multi‑Agent Reinforcement Learning (MARL) framework, enabling decentralized, cooperative sensor control. Key contributions include a vision‑based, low‑cost coverage evaluation method; a scalable MARL‑DRL framework for autonomous deployment; and a self‑reconfigurable system that adjusts sensor positioning in response to energy depletion. Compared to traditional distance‑based localization, the proposed method achieves a 26.5% improvement in coverage, a 32% reduction in energy consumption, and a 22% decrease in redundancy, extending network lifetime by 45%. This approach significantly enhances adaptability, energy efficiency, and robustness in MWSNs, offering a practical deployment solution within the IoT framework.

PaperID: 9046, https://arxiv.org/pdf/2508.14654.pdf

Authors: Peilin Ji, Xiao Xue, Simeng Wang, Wenhao Yan

Title: Entropy-Constrained Strategy Optimization in Urban Floods: A Multi-Agent Framework with LLM and Knowledge Graph Integration

Abstract:
In recent years, the increasing frequency of extreme urban rainfall events has posed significant challenges to emergency scheduling systems. Urban flooding often leads to severe traffic congestion and service disruptions, threatening public safety and mobility. However, effective decision making remains hindered by three key challenges: (1) managing trade‑offs among competing goals (e.g., traffic flow, task completion, and risk mitigation) requires dynamic, context‑aware strategies; (2) rapidly evolving environmental conditions render static rules inadequate; and (3) LLM‑generated strategies frequently suffer from semantic instability and execution inconsistency. Existing methods fail to align perception, global optimization, and multi‑agent coordination within a unified framework. To tackle these challenges, we introduce H‑J, a hierarchical multi‑agent framework that integrates knowledge‑guided prompting, entropy‑constrained generation, and feedback‑driven optimization. The framework establishes a closed‑loop pipeline spanning from multi‑source perception to strategic execution and continuous refinement. We evaluate H‑J on real‑world urban topology and rainfall data under three representative conditions: extreme rainfall, intermittent bursts, and daily light rain. Experiments show that H‑J outperforms rule‑based and reinforcement‑learning baselines in traffic smoothness, task success rate, and system robustness. These findings highlight the promise of uncertainty‑aware, knowledge‑constrained LLM‑based approaches for enhancing resilience in urban flood response.

PaperID: 9047, https://arxiv.org/pdf/2508.14635.pdf

Authors: João Vitor de Carvalho Silva, Douglas G. Macharet

Title: Can LLM Agents Solve Collaborative Tasks? A Study on Urgency-Aware Planning and Coordination

Abstract:
The ability to coordinate actions across multiple agents is critical for solving complex, real‑world problems. Large Language Models (LLMs) have shown strong capabilities in communication, planning, and reasoning, raising the question of whether they can also support effective collaboration in multi‑agent settings. In this work, we investigate the use of LLM agents to solve a structured victim rescue task that requires division of labor, prioritization, and cooperative planning. Agents operate in a fully known graph‑based environment and must allocate resources to victims with varying needs and urgency levels. We systematically evaluate their performance using a suite of coordination‑sensitive metrics, including task success rate, redundant actions, room conflicts, and urgency‑weighted efficiency. This study offers new insights into the strengths and failure modes of LLMs in physically grounded multi‑agent collaboration tasks, contributing to future benchmarks and architectural improvements.

PaperID: 9048, https://arxiv.org/pdf/2508.14601.pdf

Authors: Yufei Ye, Shijian Gao, Xinhu Zheng, Liuqing Yang

Title: Multi-Tier UAV Edge Computing for Low Altitude Networks Towards Long-Term Energy Stability

Abstract:
This paper presents a novel multi‑tier UAV‑assisted edge computing system designed for low‑altitude networks. The system comprises vehicle users, lightweight Low‑Tier UAVs (L‑UAVs), and High‑Tier UAV (H‑UAV). L‑UAVs function as small‑scale edge servers positioned closer to vehicle users, while the H‑UAV, equipped with more powerful server and larger‑capacity battery, serves as mobile backup server to address the limitations in endurance and computing resources of L‑UAVs. The primary objective is to minimize task execution delays while ensuring long‑term energy stability for L‑UAVs. To address this challenge, the problem is first decoupled into a series of deterministic problems for each time slot using Lyapunov optimization. The priorities of task delay and energy consumption for L‑UAVs are adaptively adjusted based on real‑time energy status. The optimization tasks include assignment of tasks, allocation of computing resources, and trajectory planning for both L‑UAVs and H‑UAV. Simulation results demonstrate that the proposed approach achieves a reduction of at least 26% in transmission energy for L‑UAVs and exhibits superior energy stability compared to existing benchmarks.

PaperID: 9049, https://arxiv.org/pdf/2508.14486.pdf

Authors: Toqi Tahamid Sarker, Khaled R Ahmed, Taminul Islam, Cristiana Bernardi Rankrape, Karla Gage

Title: WeedSense: Multi-Task Learning for Weed Segmentation, Height Estimation, and Growth Stage Classification

Abstract:
Weed management represents a critical challenge in agriculture, significantly impacting crop yields and requiring substantial resources for control. Effective weed monitoring and analysis strategies are crucial for implementing sustainable agricultural practices and site‑specific management approaches. We introduce WeedSense, a novel multi‑task learning architecture for comprehensive weed analysis that jointly performs semantic segmentation, height estimation, and growth stage classification. We present a unique dataset capturing 16 weed species over an 11‑week growth cycle with pixel‑level annotations, height measurements, and temporal labels. WeedSense leverages a dual‑path encoder incorporating Universal Inverted Bottleneck blocks and a Multi‑Task Bifurcated Decoder with transformer‑based feature fusion to generate multi‑scale features and enable simultaneous prediction across multiple tasks. WeedSense outperforms other state‑of‑the‑art models on our comprehensive evaluation. On our multi‑task dataset, WeedSense achieves mIoU of 89.78% for segmentation, 1.67cm MAE for height estimation, and 99.99% accuracy for growth stage classification while maintaining real‑time inference at 160 FPS. Our multitask approach achieves 3× faster inference than sequential single‑task execution and uses 32.4% fewer parameters. Please see our project page at weedsense.github.io.

PaperID: 9050, https://arxiv.org/pdf/2508.14357.pdf

Authors: Rihao Chang, He Jiao, Weizhi Nie, Honglin Guo, Keliang Xie, Zhenhua Wu, Lina Zhao, Yunpeng Bai, Yongtao Ma, Lanjun Wang, Yuting Su, Xi Gao, Weijie Wang, Nicu Sebe, Bruno Lepri, Bingwei Sun

Title: Organ-Agents: Virtual Human Physiology Simulator via LLMs

Abstract:
Recent advances in large language models (LLMs) have enabled new possibilities in simulating complex physiological systems. We introduce Organ‑Agents, a multi‑agent framework that simulates human physiology via LLM‑driven agents. Each Simulator models a specific system (e.g., cardiovascular, renal, immune). Training consists of supervised fine‑tuning on system‑specific time‑series data, followed by reinforcement‑guided coordination using dynamic reference selection and error correction. We curated data from 7,134 sepsis patients and 7,895 controls, generating high‑resolution trajectories across 9 systems and 125 variables. Organ‑Agents achieved high simulation accuracy on 4,509 held‑out patients, with per‑system MSEs <0.16 and robustness across SOFA‑based severity strata. External validation on 22,689 ICU patients from two hospitals showed moderate degradation under distribution shifts with stable simulation. Organ‑Agents faithfully reproduces critical multi‑system events (e.g., hypotension, hyperlactatemia, hypoxemia) with coherent timing and phase progression. Evaluation by 15 critical care physicians confirmed realism and physiological plausibility (mean Likert ratings 3.9 and 3.7). Organ‑Agents also enables counterfactual simulations under alternative sepsis treatment strategies, generating trajectories and APACHE II scores aligned with matched real‑world patients. In downstream early warning tasks, classifiers trained on synthetic data showed minimal AUROC drops (<0.04), indicating preserved decision‑relevant patterns. These results position Organ‑Agents as a credible, interpretable, and generalizable digital twin for precision diagnosis, treatment simulation, and hypothesis testing in critical care.

PaperID: 9051, https://arxiv.org/pdf/2508.14300.pdf

Authors: Youssef Maklad, Fares Wael, Ali Hamdi, Wael Elsersy, Khaled Shaban

Title: MultiFuzz: A Dense Retrieval-based Multi-Agent System for Network Protocol Fuzzing

Abstract:
Traditional protocol fuzzing techniques, such as those employed by AFL‑based systems, often lack effectiveness due to a limited semantic understanding of complex protocol grammars and rigid seed mutation strategies. Recent works, such as ChatAFL, have integrated Large Language Models (LLMs) to guide protocol fuzzing and address these limitations, pushing protocol fuzzers to wider exploration of the protocol state space. But ChatAFL still faces issues like unreliable output, LLM hallucinations, and assumptions of LLM knowledge about protocol specifications. This paper introduces MultiFuzz, a novel dense retrieval‑based multi‑agent system designed to overcome these limitations by integrating semantic‑aware context retrieval, specialized agents, and structured tool‑assisted reasoning. MultiFuzz utilizes agentic chunks of protocol documentation (RFC Documents) to build embeddings in a vector database for a retrieval‑augmented generation (RAG) pipeline, enabling agents to generate more reliable and structured outputs, enhancing the fuzzer in mutating protocol messages with enhanced state coverage and adherence to syntactic constraints. The framework decomposes the fuzzing process into modular groups of agents that collaborate through chain‑of‑thought reasoning to dynamically adapt fuzzing strategies based on the retrieved contextual knowledge. Experimental evaluations on the Real‑Time Streaming Protocol (RTSP) demonstrate that MultiFuzz significantly improves branch coverage and explores deeper protocol states and transitions over state‑of‑the‑art (SOTA) fuzzers such as NSFuzz, AFLNet, and ChatAFL. By combining dense retrieval, agentic coordination, and language model reasoning, MultiFuzz establishes a new paradigm in autonomous protocol fuzzing, offering a scalable and extensible foundation for future research in intelligent agentic‑based fuzzing systems.

PaperID: 9052, https://arxiv.org/pdf/2508.14231.pdf

Authors: Carson Ezell, Xavier Roberts-Gaal, Alan Chan

Title: Incident Analysis for AI Agents

Abstract:
As AI agents become more widely deployed, we are likely to see an increasing number of incidents: events involving AI agent use that directly or indirectly cause harm. For example, agents could be prompt‑injected to exfiltrate private information or make unauthorized purchases. Structured information about such incidents (e.g., user prompts) can help us understand their causes and prevent future occurrences. However, existing incident reporting processes are not sufficient for understanding agent incidents. In particular, such processes are largely based on publicly available data, which excludes useful, but potentially sensitive, information such as an agent's chain of thought or browser history. To inform the development of new, emerging incident reporting processes, we propose an incident analysis framework for agents. Drawing on systems safety approaches, our framework proposes three types of factors that can cause incidents: system‑related (e.g., CBRN training data), contextual (e.g., prompt injections), and cognitive (e.g., misunderstanding a user request). We also identify specific information that could help clarify which factors are relevant to a given incident: activity logs, system documentation and access, and information about the tools an agent uses. We provide recommendations for 1) what information incident reports should include and 2) what information developers and deployers should retain and make available to incident investigators upon request. As we transition to a world with more agents, understanding agent incidents will become increasingly crucial for managing risks.

PaperID: 9053, https://arxiv.org/pdf/2508.14229.pdf

Authors: Md Mainul Abrar, Xun Jia, Yujie Chi

Title: New Insights into Automatic Treatment Planning for Cancer Radiotherapy Using Explainable Artificial Intelligence

Abstract:
Objective: This study aims to uncover the opaque decision‑making process of an artificial intelligence (AI) agent for automatic treatment planning. Approach: We examined a previously developed AI agent based on the Actor‑Critic with Experience Replay (ACER) network, which automatically tunes treatment planning parameters (TPPs) for inverse planning in prostate cancer intensity modulated radiotherapy. We selected multiple checkpoint ACER agents from different stages of training and applied an explainable AI (EXAI) method to analyze the attribution from dose‑volume histogram (DVH) inputs to TPP‑tuning decisions. We then assessed each agent's planning efficacy and efficiency and evaluated their policy and final TPP tuning spaces. Combining these analyses, we systematically examined how ACER agents generated high‑quality treatment plans in response to different DVH inputs. Results: Attribution analysis revealed that ACER agents progressively learned to identify dose‑violation regions from DVH inputs and promote appropriate TPP‑tuning actions to mitigate them. Organ‑wise similarities between DVH attributions and dose‑violation reductions ranged from 0.25 to 0.5 across tested agents. Agents with stronger attribution‑violation similarity required fewer tuning steps (~12‑13 vs. 22), exhibited a more concentrated TPP‑tuning space with lower entropy (~0.3 vs. 0.6), converged on adjusting only a few TPPs, and showed smaller discrepancies between practical and theoretical tuning steps. Putting together, these findings indicate that high‑performing ACER agents can effectively identify dose violations from DVH inputs and employ a global tuning strategy to achieve high‑quality treatment planning, much like skilled human planners. Significance: Better interpretability of the agent's decision‑making process may enhance clinician trust and inspire new strategies for automatic treatment planning.

PaperID: 9054, https://arxiv.org/pdf/2508.14214.pdf

Authors: Mattson Ogg, Chace Ashcraft, Ritwik Bose, Raphael Norman-Tenazas, Michael Wolmetz

Title: Large Language Models are Highly Aligned with Human Ratings of Emotional Stimuli

Abstract:
Emotions exert an immense influence over human behavior and cognition in both commonplace and high‑stress tasks. Discussions of whether or how to integrate large language models (LLMs) into everyday life (e.g., acting as proxies for, or interacting with, human agents), should be informed by an understanding of how these tools evaluate emotionally loaded stimuli or situations. A model's alignment with human behavior in these cases can inform the effectiveness of LLMs for certain roles or interactions. To help build this understanding, we elicited ratings from multiple popular LLMs for datasets of words and images that were previously rated for their emotional content by humans. We found that when performing the same rating tasks, GPT‑4o responded very similarly to human participants across modalities, stimuli and most rating scales (r = 0.9 or higher in many cases). However, arousal ratings were less well aligned between human and LLM raters, while happiness ratings were most highly aligned. Overall LLMs aligned better within a five‑category (happiness, anger, sadness, fear, disgust) emotion framework than within a two‑dimensional (arousal and valence) organization. Finally, LLM ratings were substantially more homogenous than human ratings. Together these results begin to describe how LLM agents interpret emotional stimuli and highlight similarities and differences among biological and artificial intelligence in key behavioral domains.

PaperID: 9055, https://arxiv.org/pdf/2508.14131.pdf

Authors: Junjie Qi, Siqi Mao, Tianyi Tan

Title: An Improved Multi-Agent Algorithm for Cooperative and Competitive Environments by Identifying and Encouraging Cooperation among Agents

Abstract:
We propose an improved algorithm by identifying and encouraging cooperative behavior in multi‑agent environments. First, we analyze the shortcomings of existing algorithms in addressing multi‑agent reinforcement learning problems. Then, based on the existing algorithm MADDPG, we introduce a new parameter to increase the reward that an agent can obtain when cooperative behavior among agents is identified. Finally, we compare our improved algorithm with MADDPG in environments from PettingZoo. The results show that the new algorithm helps agents achieve both higher team rewards and individual rewards.

PaperID: 9056, https://arxiv.org/pdf/2508.14123.pdf

Authors: Ankita Sharma, YuQi Fu, Vahid Ansari, Rishabh Iyer, Fiona Kuang, Kashish Mistry, Raisa Islam Aishy, Sara Ahmad, Joaquin Matres, Dirk R. Englund, Joyce K. S. Poon

Title: AI Agents for Photonic Integrated Circuit Design Automation

Abstract:
We present Photonics Intelligent Design and Optimization (PhIDO), a multi‑agent framework that converts natural‑language photonic integrated circuit (PIC) design requests into layout mask files. We compare 7 reasoning large language models for PhIDO using a testbench of 102 design descriptions that ranged from single devices to 112‑component PICs. The success rate for single‑device designs was up to 91%. For design queries with less than or equal to 15 components, o1, Gemini‑2.5‑pro, and Claude Opus 4 achieved the highest end‑to‑end pass@5 success rates of approximately 57%, with Gemini‑2.5‑pro requiring the fewest output tokens and lowest cost. The next steps toward autonomous PIC development include standardized knowledge representations, expanded datasets, extended verification, and robotic automation.

PaperID: 9057, https://arxiv.org/pdf/2508.14063.pdf

Authors: Moran Sorka, Alon Gorenshtein, Dvir Aran, Shahar Shelly

Title: A Multi-Agent Approach to Neurological Clinical Reasoning

Abstract:
Large language models (LLMs) have shown promise in medical domains, but their ability to handle specialized neurological reasoning requires systematic evaluation. We developed a comprehensive benchmark using 305 questions from Israeli Board Certification Exams in Neurology, classified along three complexity dimensions: factual knowledge depth, clinical concept integration, and reasoning complexity. We evaluated ten LLMs using base models, retrieval‑augmented generation (RAG), and a novel multi‑agent system. Results showed significant performance variation. OpenAI‑o1 achieved the highest base performance (90.9% accuracy), while specialized medical models performed poorly (52.9% for Meditron‑70B). RAG provided modest benefits but limited effectiveness on complex reasoning questions. In contrast, our multi‑agent framework, decomposing neurological reasoning into specialized cognitive functions including question analysis, knowledge retrieval, answer synthesis, and validation, achieved dramatic improvements, especially for mid‑range models. The LLaMA 3.3‑70B‑based agentic system reached 89.2% accuracy versus 69.5% for its base model, with substantial gains on level 3 complexity questions. The multi‑agent approach transformed inconsistent subspecialty performance into uniform excellence, addressing neurological reasoning challenges that persisted with RAG enhancement. We validated our approach using an independent dataset of 155 neurological cases from MedQA. Results confirm that structured multi‑agent approaches designed to emulate specialized cognitive processes significantly enhance complex medical reasoning, offering promising directions for AI assistance in challenging clinical contexts.

PaperID: 9058, https://arxiv.org/pdf/2508.14053.pdf

Authors: Jinwei Tang, Jiayin Qin, Nuo Xu, Pragnya Sudershan Nalla, Yu Cao, Yang, Zhao, Caiwen Ding

Title: MAHL: Multi-Agent LLM-Guided Hierarchical Chiplet Design with Adaptive Debugging

Abstract:
As program workloads (e.g., AI) increase in size and algorithmic complexity, the primary challenge lies in their high dimensionality, encompassing computing cores, array sizes, and memory hierarchies. To overcome these obstacles, innovative approaches are required. Agile chip design has already benefited from machine learning integration at various stages, including logic synthesis, placement, and routing. With Large Language Models (LLMs) recently demonstrating impressive proficiency in Hardware Description Language (HDL) generation, it is promising to extend their abilities to 2.5D integration, an advanced technique that saves area overhead and development costs. However, LLM‑driven chiplet design faces challenges such as flatten design, high validation cost and imprecise parameter optimization, which limit its chiplet design capability. To address this, we propose MAHL, a hierarchical LLM‑based chiplet design generation framework that features six agents which collaboratively enable AI algorithm‑hardware mapping, including hierarchical description generation, retrieval‑augmented code generation, diverseflow‑based validation, and multi‑granularity design space exploration. These components together enhance the efficient generation of chiplet design with optimized Power, Performance and Area (PPA). Experiments show that MAHL not only significantly improves the generation accuracy of simple RTL design, but also increases the generation accuracy of real‑world chiplet design, evaluated by Pass@5, from 0 to 0.72 compared to conventional LLMs under the best‑case scenario. Compared to state‑of‑the‑art CLARIE (expert‑based), MAHL achieves comparable or even superior PPA results under certain optimization objectives.

PaperID: 9059, https://arxiv.org/pdf/2508.14052.pdf

Authors: Chanyeol Choi, Jihoon Kwon, Alejandro Lopez-Lira, Chaewoon Kim, Minjae Kim, Juneha Hwang, Jaeseon Ha, Hojun Choi, Suyeol Yun, Yongjin Kim, Yongjae Lee

Title: FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering

Abstract:
Accurate information retrieval (IR) is critical in the financial domain, where investors must identify relevant information from large collections of documents. Traditional IR methods ‑‑ whether sparse or dense ‑‑ often fall short in retrieval accuracy, as it requires not only capturing semantic similarity but also performing fine‑grained reasoning over document structure and domain‑specific knowledge. Recent advances in large language models (LLMs) have opened up new opportunities for retrieval with multi‑step reasoning, where the model ranks passages through iterative reasoning about which information is most relevant to a given query. However, there exists no benchmark to evaluate such capabilities in the financial domain. To address this gap, we introduce FinAgentBench, the first large‑scale benchmark for evaluating retrieval with multi‑step reasoning in finance ‑‑ a setting we term agentic retrieval. The benchmark consists of 26K expert‑annotated examples on S&P‑500 listed firms and assesses whether LLM agents can (1) identify the most relevant document type among candidates, and (2) pinpoint the key passage within the selected document. Our evaluation framework explicitly separates these two reasoning steps to address context limitations. This design enables to provide a quantitative basis for understanding retrieval‑centric LLM behavior in finance. We evaluate a suite of state‑of‑the‑art models and further demonstrated how targeted fine‑tuning can significantly improve agentic retrieval performance. Our benchmark provides a foundation for studying retrieval‑centric LLM behavior in complex, domain‑specific tasks for finance.

PaperID: 9060, https://arxiv.org/pdf/2508.13962.pdf

Authors: Ruiwei Xiao, Xinying Hou, Ying-Jui Tseng, Hsuan Nieu, Guanze Liao, John Stamper, Kenneth R. Koedinger

Title: Learning to Use AI for Learning: Teaching Responsible Use of AI Chatbot to K-12 Students Through an AI Literacy Module

Abstract:
As Artificial Intelligence (AI) becomes increasingly integrated into daily life, there is a growing need to equip the next generation with the ability to apply, interact with, evaluate, and collaborate with AI systems responsibly. Prior research highlights the urgent demand from K‑12 educators to teach students the ethical and effective use of AI for learning. To address this need, we designed an Large‑Language Model (LLM)‑based module to teach prompting literacy. This includes scenario‑based deliberate practice activities with direct interaction with intelligent LLM agents, aiming to foster secondary school students' responsible engagement with AI chatbots. We conducted two iterations of classroom deployment in 11 authentic secondary education classrooms, and evaluated 1) AI‑based auto‑grader's capability; 2) students' prompting performance and confidence changes towards using AI for learning; and 3) the quality of learning and assessment materials. Results indicated that the AI‑based auto‑grader could grade student‑written prompts with satisfactory quality. In addition, the instructional materials supported students in improving their prompting skills through practice and led to positive shifts in their perceptions of using AI for learning. Furthermore, data from Study 1 informed assessment revisions in Study 2. Analyses of item difficulty and discrimination in Study 2 showed that True/False and open‑ended questions could measure prompting literacy more effectively than multiple‑choice questions for our target learners. These promising outcomes highlight the potential for broader deployment and highlight the need for broader studies to assess learning effectiveness and assessment design.

PaperID: 9061, https://arxiv.org/pdf/2508.13920.pdf

Authors: Yuyang Du, Qun Yang, Liujianfu Wang, Jingqi Lin, Hongwei Cui, Soung Chang Liew

Title: LLMind 2.0: Distributed IoT Automation with Natural Language M2M Communication and Lightweight LLM Agents

Abstract:
Recent advances in large language models (LLMs) have generated great interest in their applications for IoT automation and device management. However, centralized approaches struggle to scale across heterogeneous, large‑scale systems. We present LLMind 2.0, a distributed framework that embeds lightweight LLM‑empowered device agents and adopts natural language for machine‑to‑machine (M2M) communication. In LLMind 2.0, a central coordinator translates human instructions into natural‑language subtask descriptions, which instruct distributed device agents to generate device‑specific code locally based on their proprietary APIs. Using natural language as a unified medium overcomes device heterogeneity and enables seamless device collaboration. LLMind 2.0 integrates: 1) a timeout‑based deadlock avoidance protocol that coordinates distributed subtask executions, 2) a retrieval‑augmented generation (RAG) mechanism for precise subtask‑to‑API mapping, and 3) fine‑tuned lightweight LLMs for reliable, device‑specific code generation. Experiments in multi‑robot warehouse operations and Wi Fi network deployments show LLMind 2.0 improved scalability, reliability, and responsiveness compared to centralized baselines.

PaperID: 9062, https://arxiv.org/pdf/2508.13872.pdf

Authors: Daniele Corradetti, José Delgado Rodrigues

Title: RED.AI Id-Pattern: First Results of Stone Deterioration Patterns with Multi-Agent Systems

Abstract:
The Id‑Pattern system within the RED.AI project (Reabilitação Estrutural Digital através da AI) consists of an agentic system designed to assist in the identification of stone deterioration patterns. Traditional methodologies, based on direct observation by expert teams, are accurate but costly in terms of time and resources. The system developed here introduces and evaluates a multi‑agent artificial intelligence (AI) system, designed to simulate collaboration between experts and automate the diagnosis of stone pathologies from visual evidence. The approach is based on a cognitive architecture that orchestrates a team of specialized AI agents which, in this specific case, are limited to five: a lithologist, a pathologist, an environmental expert, a conservator‑restorer, and a diagnostic coordinator. To evaluate the system we selected 28 difficult images involving multiple deterioration patterns. Our first results showed a huge boost on all metrics of our system compared to the foundational model.

PaperID: 9063, https://arxiv.org/pdf/2508.13815.pdf

Authors: Churong Liang, Jinling Gan, Kairan Hong, Qiushi Tian, Zongze Wu, Runnan Li

Title: COCO: Cognitive Operating System with Continuous Oversight for Multi-Agent Workflow Reliability

Abstract:
A critical limitation in large‑scale multi‑agent systems is the cascading of errors. And without intermediate verification, downstream agents exacerbate upstream inaccuracies, resulting in significant quality degradation. To bridge this gap, we introduce COCO (Cognitive Operating System with Continuous Oversight), a theoretically grounded framework for asynchronous self‑monitoring and adaptive error correction in multi‑agent systems. COCO reconciles the fundamental tension between quality assurance and computational efficiency via a novel decoupled architecture. This design isolates error detection from the critical execution path and incorporates an automated configuration engine to minimize deployment complexity. The framework relies on three algorithmic innovations to mitigate both systematic and stochastic errors: (1) a Contextual Rollback Mechanism that leverages execution history for informed state recovery rather than naive retries; (2) a Bidirectional Reflection Protocol to ensure convergence and prevent oscillatory control loops; and (3) a Heterogeneous Cross‑Validation Mechanism that utilizes ensemble disagreement to identify bias and hallucinations. Extensive experiments on diverse benchmarks demonstrate that COCO delivers a 6.5% average performance improvement. Notably, the framework achieves 95.1% of large‑model performance with a 30× parameter reduction, confirming the potential for efficient, high‑reliability deployment, and establishing COCO as a practical, annotation‑based solution for critical autonomous domains.

PaperID: 9064, https://arxiv.org/pdf/2508.13754.pdf

Authors: Liuxin Bao, Zhihao Peng, Xiaofei Zhou, Runmin Cong, Jiyong Zhang, Yixuan Yuan

Title: Expertise-aware Multi-LLM Recruitment and Collaboration for Medical Decision-Making

Abstract:
Medical Decision‑Making (MDM) is a complex process requiring substantial domain‑specific expertise to effectively synthesize heterogeneous and complicated clinical information. While recent advancements in Large Language Models (LLMs) show promise in supporting MDM, single‑LLM approaches are limited by their parametric knowledge constraints and static training corpora, failing to robustly integrate the clinical information. To address this challenge, we propose the Expertise‑aware Multi‑LLM Recruitment and Collaboration (EMRC) framework to enhance the accuracy and reliability of MDM systems. It operates in two stages: (i) expertise‑aware agent recruitment and (ii) confidence‑ and adversarial‑driven multi‑agent collaboration. Specifically, in the first stage, we use a publicly available corpus to construct an LLM expertise table for capturing expertise‑specific strengths of multiple LLMs across medical department categories and query difficulty levels. This table enables the subsequent dynamic selection of the optimal LLMs to act as medical expert agents for each medical query during the inference phase. In the second stage, we employ selected agents to generate responses with self‑assessed confidence scores, which are then integrated through the confidence fusion and adversarial validation to improve diagnostic reliability. We evaluate our EMRC framework on three public MDM datasets, where the results demonstrate that our EMRC outperforms state‑of‑the‑art single‑ and multi‑LLM methods, achieving superior diagnostic performance. For instance, on the MMLU‑Pro‑Health dataset, our EMRC achieves 74.45% accuracy, representing a 2.69% improvement over the best‑performing closed‑source model GPT‑ 4‑0613, which demonstrates the effectiveness of our expertise‑aware agent recruitment strategy and the agent complementarity in leveraging each LLM's specialized capabilities.

PaperID: 9065, https://arxiv.org/pdf/2508.13732.pdf

Authors: Yiming Xiong, Jian Wang, Bing Li, Yuhan Zhu, Yuqi Zhao

Title: Self-Organizing Agent Network for LLM-based Workflow Automation

Abstract:
Recent multi‑agent frameworks built upon large language models (LLMs) have demonstrated remarkable capabilities in complex task planning. However, in real‑world enterprise environments, business workflows are typically composed through modularization and reuse of numerous subprocesses, resulting in intricate workflows characterized by lengthy and deeply nested execution paths. Such complexity poses significant challenges for LLM‑driven orchestration, as extended reasoning chains and state‑space explosions severely impact planning effectiveness and the proper sequencing of tool invocations. Therefore, developing an orchestration method with controllable structures capable of handling multi‑layer nesting becomes a critical issue. To address this, we propose a novel structure‑driven orchestration framework Self‑Organizing Agent Network (SOAN). SOAN incrementally builds a formalized agent network by identifying and encapsulating structural units as independent agents, enhancing modularity and clarity in orchestration. Extensive evaluations were performed using multiple benchmarks as well as a real‑world enterprise workflow dataset. Experimental results demonstrate that SOAN significantly outperforms state‑of‑the‑art methods in terms of adaptability, fault tolerance, and execution efficiency.

PaperID: 9066, https://arxiv.org/pdf/2508.13721.pdf

Authors: Minh Hoang Nguyen, Van Dai Do, Dung Nguyen, Thin Nguyen, Hung Le

Title: CausalPlan: Empowering Efficient LLM Multi-Agent Collaboration Through Causality-Driven Planning

Abstract:
Large language model (LLM) agents‑especially smaller, open‑source models‑often produce causally invalid or incoherent actions in collaborative tasks due to their reliance on surface‑level correlations rather than grounded causal reasoning. This limitation undermines their performance in terms of coordination and planning in dynamic environments. We address this challenge with CausalPlan, a two‑phase framework that integrates explicit structural causal reasoning into the LLM planning process. At the core of CausalPlan is the Structural Causal Action (SCA) model, which learns a causal graph from agent trajectories to capture how prior actions and current environment states influence future decisions. This structure is then used to guide action selection by assigning causal scores to LLM‑generated proposals, reweighting them accordingly, or falling back to causally grounded alternatives when needed. By embedding this causal knowledge directly into the decision loop, CausalPlan constrains planning to intervention‑consistent behaviours without requiring fine‑tuning of the LLM itself. We evaluate CausalPlan on the Overcooked‑AI benchmark across five multi‑agent coordination tasks and four LLMs of varying sizes: Gemma‑7B, Llama‑8B, Qwen‑14B, and Llama‑70B. Experimental results show that CausalPlan consistently reduces invalid actions and improves collaboration in both AI‑AI and human‑AI settings, outperforming strong reinforcement learning baselines. Our findings highlight the value of causality‑driven planning for deploying efficient, interpretable, and generalisable multi‑agent LLM systems.

PaperID: 9067, https://arxiv.org/pdf/2508.13661.pdf

Authors: Maciej Wojtala, Bogusz Stefańczyk, Dominik Bogucki, Łukasz Lepak, Jakub Strykowski, Paweł Wawrzyński

Title: MACTAS: Self-Attention-Based Inter-Agent Communication in Multi-Agent Reinforcement Learning with Action-Value Function Decomposition

Abstract:
Communication is essential for the collective execution of complex tasks by human agents, motivating interest in communication mechanisms for multi‑agent reinforcement learning (MARL). However, existing communication protocols in MARL are often complex and non‑differentiable. In this work, we introduce a self‑attention‑based communication method that exchanges information between the agents in MARL. Our proposed approach is fully differentiable, allowing agents to learn to generate messages in a reward‑driven manner. The method can be seamlessly integrated with any action‑value function decomposition algorithm and can be viewed as an orthogonal extension of such decompositions. Notably, it includes a fixed number of trainable parameters, independent of the number of agents, which makes it scalable to large systems. Experimental results on the SMACv2 benchmark demonstrate the effectiveness of our approach, which achieves state‑of‑the‑art performance on a number of maps. makes it scalable to large systems. Experimental results on the SMACv2 benchmark demonstrate the effectiveness of our approach, which achieves state‑of‑the‑art performance on a number of maps.

PaperID: 9068, https://arxiv.org/pdf/2508.13608.pdf

Authors: Abdullah Tokmak, Thomas B. Schön, Dominik Baumann

Title: Towards safe control parameter tuning in distributed multi-agent systems

Abstract:
Many safety‑critical real‑world problems, such as autonomous driving and collaborative robots, are of a distributed multi‑agent nature. To optimize the performance of these systems while ensuring safety, we can cast them as distributed optimization problems, where each agent aims to optimize their parameters to maximize a coupled reward function subject to coupled constraints. Prior work either studies a centralized setting, does not consider safety, or struggles with sample efficiency. Since we require sample efficiency and work with unknown and nonconvex rewards and constraints, we solve this optimization problem using safe Bayesian optimization with Gaussian process regression. Moreover, we consider nearest‑neighbor communication between the agents. To capture the behavior of non‑neighboring agents, we reformulate the static global optimization problem as a time‑varying local optimization problem for each agent, essentially introducing time as a latent variable. To this end, we propose a custom spatio‑temporal kernel to integrate prior knowledge. We show the successful deployment of our algorithm in simulations.

PaperID: 9069, https://arxiv.org/pdf/2508.13602.pdf

Authors: Xiaolu Hou, Bing Ma, Jiaxiang Cheng, Xuhua Ren, Kai Yu, Wenyue Li, Tianxiang Zheng, Qinglin Lu

Title: PersonaVlog: Personalized Multimodal Vlog Generation with Multi-Agent Collaboration and Iterative Self-Correction

Abstract:
With the growing demand for short videos and personalized content, automated Video Log (Vlog) generation has become a key direction in multimodal content creation. Existing methods mostly rely on predefined scripts, lacking dynamism and personal expression. Therefore, there is an urgent need for an automated Vlog generation approach that enables effective multimodal collaboration and high personalization. To this end, we propose PersonaVlog, an automated multimodal stylized Vlog generation framework that can produce personalized Vlogs featuring videos, background music, and inner monologue speech based on a given theme and reference image. Specifically, we propose a multi‑agent collaboration framework based on Multimodal Large Language Models (MLLMs). This framework efficiently generates high‑quality prompts for multimodal content creation based on user input, thereby improving the efficiency and creativity of the process. In addition, we incorporate a feedback and rollback mechanism that leverages MLLMs to evaluate and provide feedback on generated results, thereby enabling iterative self‑correction of multimodal content. We also propose ThemeVlogEval, a theme‑based automated benchmarking framework that provides standardized metrics and datasets for fair evaluation. Comprehensive experiments demonstrate the significant advantages and potential of our framework over several baselines, highlighting its effectiveness and great potential for generating automated Vlogs.

PaperID: 9070, https://arxiv.org/pdf/2508.13530.pdf

Authors: Junyeong Park, Hyeonseo Cho, Sungjin Ahn

Title: CrafterDojo: A Suite of Foundation Models for Building Open-Ended Embodied Agents in Crafter

Abstract:
Developing general‑purpose embodied agents is a core challenge in AI. Minecraft provides rich complexity and internet‑scale data, but its slow speed and engineering overhead make it unsuitable for rapid prototyping. Crafter offers a lightweight alternative that retains key challenges from Minecraft, yet its use has remained limited to narrow tasks due to the absence of foundation models that have driven progress in the Minecraft setting. In this paper, we present CrafterDojo, a suite of foundation models and tools that unlock the Crafter environment as a lightweight, prototyping‑friendly, and Minecraft‑like testbed for general‑purpose embodied agent research. CrafterDojo addresses this by introducing CrafterVPT, CrafterCLIP, and CrafterSteve‑1 for behavior priors, vision‑language grounding, and instruction following, respectively. In addition, we provide toolkits for generating behavior and caption datasets (CrafterPlay and CrafterCaption), reference agent implementations, benchmark evaluations, and a complete open‑source codebase.

PaperID: 9071, https://arxiv.org/pdf/2508.13439.pdf

Authors: Yunxiang Yang, Ningning Xu, Jidong J. Yang

Title: Structured Prompting and Multi-Agent Knowledge Distillation for Traffic Video Interpretation and Risk Inference

Abstract:
Comprehensive highway scene understanding and robust traffic risk inference are vital for advancing Intelligent Transportation Systems (ITS) and autonomous driving. Traditional approaches often struggle with scalability and generalization, particularly under the complex and dynamic conditions of real‑world environments. To address these challenges, we introduce a novel structured prompting and knowledge distillation framework that enables automatic generation of high‑quality traffic scene annotations and contextual risk assessments. Our framework orchestrates two large Vision‑Language Models (VLMs): GPT‑4o and o3‑mini, using a structured Chain‑of‑Thought (CoT) strategy to produce rich, multi‑perspective outputs. These outputs serve as knowledge‑enriched pseudo‑annotations for supervised fine‑tuning of a much smaller student VLM. The resulting compact 3B‑scale model, named VISTA (Vision for Intelligent Scene and Traffic Analysis), is capable of understanding low‑resolution traffic videos and generating semantically faithful, risk‑aware captions. Despite its significantly reduced parameter count, VISTA achieves strong performance across established captioning metrics (BLEU‑4, METEOR, ROUGE‑L, and CIDEr) when benchmarked against its teacher models. This demonstrates that effective knowledge distillation and structured multi‑agent supervision can empower lightweight VLMs to capture complex reasoning capabilities. The compact architecture of VISTA facilitates efficient deployment on edge devices, enabling real‑time risk monitoring without requiring extensive infrastructure upgrades.

PaperID: 9072, https://arxiv.org/pdf/2508.13423.pdf

Authors: Qixin Wang, Dawei Wang, Kun Chen, Yaowei Hu, Puneet Girdhar, Ruoteng Wang, Aadesh Gupta, Chaitanya Devella, Wenlai Guo, Shangwen Huang, Bachir Aoun, Greg Hayworth, Han Li, Xintao Wu

Title: AdaptJobRec: Enhancing Conversational Career Recommendation through an LLM-Powered Agentic System

Abstract:
In recent years, recommendation systems have evolved from providing a single list of recommendations to offering a comprehensive suite of topic focused services. To better accomplish this task, conversational recommendation systems (CRS) have progressed from basic retrieval augmented LLM generation to agentic systems with advanced reasoning and self correction capabilities. However, agentic systems come with notable response latency, a longstanding challenge for conversational recommendation systems. To balance the trade off between handling complex queries and minimizing latency, we propose AdaptJobRec, the first conversational job recommendation system that leverages autonomous agent to integrate personalized recommendation algorithm tools. The system employs a user query complexity identification mechanism to minimize response latency. For straightforward queries, the agent directly selects the appropriate tool for rapid responses. For complex queries, the agent uses the memory processing module to filter chat history for relevant content, then passes the results to the intelligent task decomposition planner, and finally executes the tasks using personalized recommendation tools. Evaluation on Walmart's real world career recommendation scenarios demonstrates that AdaptJobRec reduces average response latency by up to 53.3% compared to competitive baselines, while significantly improving recommendation accuracy.

PaperID: 9073, https://arxiv.org/pdf/2508.13413.pdf

Authors: Dennis Brown, Samuel Mulder

Title: Large Language Models as Visualization Agents for Immersive Binary Reverse Engineering

Abstract:
Immersive virtual reality (VR) offers affordances that may reduce cognitive complexity in binary reverse engineering (RE), enabling embodied and external cognition to augment the RE process through enhancing memory, hypothesis testing, and visual organization. In prior work, we applied a cognitive systems engineering approach to identify an initial set of affordances and implemented a VR environment to support RE through spatial persistence and interactivity. In this work, we extend that platform with an integrated large language model (LLM) agent capable of querying binary analysis tools, answering technical questions, and dynamically generating immersive 3D visualizations in alignment with analyst tasks. We describe the system architecture and our evaluation process and results. Our pilot study shows that while LLMs can generate meaningful 3D call graphs (for small programs) that align with design principles, output quality varies widely. This work raises open questions about the potential for LLMs to function as visualization agents, constructing 3D representations that reflect cognitive design principles without explicit training.

PaperID: 9074, https://arxiv.org/pdf/2508.13382.pdf

Authors: Ayoub Ben Chaliah, Hela Dellagi

Title: Datarus-R1: An Adaptive Multi-Step Reasoning LLM for Automated Data Analysis

Abstract:
We present Datarus‑R1‑14B, a 14 B‑parameter open‑weights language model fine‑tuned from Qwen 2.5‑14B‑Instruct to act as a virtual data analyst and graduate‑level problem solver. Datarus is trained not on isolated question‑answer pairs but on full analytical trajectories including reasoning steps, code execution, error traces, self‑corrections, and final conclusions, all captured in a ReAct‑style notebook format spanning finance, medicine, numerical analysis, and other quantitative domains. Our training pipeline combines (i) a trajectory‑centric synthetic data generator that yielded 144 000 tagged notebook episodes, (ii) a dual‑reward framework blending a lightweight tag‑based structural signal with a Hierarchical Reward Model (HRM) that scores both single‑step soundness and end‑to‑end coherence, and (iii) a memory‑optimized implementation of Group Relative Policy Optimization (GRPO) featuring KV‑cache reuse, sequential generation, and reference‑model sharding. A cosine curriculum smoothly shifts emphasis from structural fidelity to semantic depth, reducing the format collapse and verbosity that often plague RL‑aligned LLMs. A central design choice in Datarus is it dual reasoning interface. In agentic mode the model produces ReAct‑tagged steps that invoke Python tools to execute real code; in reflection mode it outputs compact Chain‑of‑Thought (CoT) traces delimited by <think> and <answer> tags. On demanding postgraduate‑level problems, Datarus exhibits an "AHA‑moment" pattern: it sketches hypotheses, revises them once or twice, and converges avoiding the circular, token‑inflating loops common to contemporary systems. Across standard public benchmarks Datarus surpasses similar size models and even reaches the level of larger reasoning models such as QwQ‑32B achieving up to 30% higher accuracy on AIME 2024/2025 and LiveCodeBench while emitting 18‑49% fewer tokens per solution.

PaperID: 9075, https://arxiv.org/pdf/2508.13371.pdf

Authors: Ronit Virwani, Ruchika Suryawanshi

Title: LOOP: A Plug-and-Play Neuro-Symbolic Framework for Enhancing Planning in Autonomous Systems

Abstract:
Planning is one of the most critical tasks in autonomous systems, where even a small error can lead to major failures or million‑dollar losses. Current state‑of‑the‑art neural planning approaches struggle with complex domains, producing plans with missing preconditions, inconsistent goals, and hallucinations. While classical planners provide logical guarantees, they lack the flexibility and natural language understanding capabilities needed for modern autonomous systems. Existing neuro‑symbolic approaches use one‑shot translation from natural language to formal plans, missing the opportunity for neural and symbolic components to work and refine solutions together. To address this gap, we develop LOOP ‑‑ a novel neuro‑symbolic planning framework that treats planning as an iterative conversation between neural and symbolic components rather than simple translation. LOOP integrates 13 coordinated neural features including graph neural networks for spatial relationships, multi‑agent validation for consensus‑based correctness, hierarchical decomposition for complex task management, and causal memory that learns from both successes and failures. Unlike existing approaches, LOOP generates PDDL specifications, refines them iteratively based on symbolic feedback, and builds a causal knowledge base from execution traces. LOOP was evaluated on six standard IPC benchmark domains, where it achieved 85.8% success rate compared to LLM+P (55.0%), LLM‑as‑Planner (19.2%), and Tree‑of‑Thoughts (3.3%). This work shows that the key to reliable planning is not in choosing between neural networks or symbolic reasoners but it lies in making them actually ``talk'' to each other during the entire process. LOOP provides a thorough blueprint for building autonomous systems that can finally be trusted with critical real‑world applications.

PaperID: 9076, https://arxiv.org/pdf/2508.13251.pdf

Authors: Di Zhang, Xue Jia, Tran Ba Hung, Seong Hoon Jang, Linda Zhang, Ryuhei Sato, Yusuke Hashimoto, Toyoto Sato, Kiyoe Konno, Shin-ichi Orimo, Hao Li

Title: "DIVE" into Hydrogen Storage Materials Discovery with AI Agents

Abstract:
Data‑driven artificial intelligence (AI) approaches are fundamentally transforming the discovery of new materials. Despite the unprecedented availability of materials data in the scientific literature, much of this information remains trapped in unstructured figures and tables, hindering the construction of large language model (LLM)‑based AI agent for automated materials design. Here, we present the Descriptive Interpretation of Visual Expression (DIVE) multi‑agent workflow, which systematically reads and organizes experimental data from graphical elements in scientific literatures. We focus on solid‑state hydrogen storage materials‑a class of materials central to future clean‑energy technologies and demonstrate that DIVE markedly improves the accuracy and coverage of data extraction compared to the direct extraction by multimodal models, with gains of 10‑15% over commercial models and over 30% relative to open‑source models. Building on a curated database of over 30,000 entries from 4,000 publications, we establish a rapid inverse design workflow capable of identifying previously unreported hydrogen storage compositions in two minutes. The proposed AI workflow and agent design are broadly transferable across diverse materials, providing a paradigm for AI‑driven materials discovery.

PaperID: 9077, https://arxiv.org/pdf/2508.13247.pdf

Authors: Nina Rajcic, Anders Søgaard

Title: Goal-Directedness is in the Eye of the Beholder

Abstract:
Our ability to predict the behavior of complex agents turns on the attribution of goals. Probing for goal‑directed behavior comes in two flavors: Behavioral and mechanistic. The former proposes that goal‑directedness can be estimated through behavioral observation, whereas the latter attempts to probe for goals in internal model states. We work through the assumptions behind both approaches, identifying technical and conceptual problems that arise from formalizing goals in agent systems. We arrive at the perhaps surprising position that goal‑directedness cannot be measured objectively. We outline new directions for modeling goal‑directedness as an emergent property of dynamic, multi‑agent systems.

PaperID: 9078, https://arxiv.org/pdf/2508.13201.pdf

Authors: Yang Liu, Lu Zhou, Xiawei Du, Ruikun He, Xuguang Zhang, Rongbo Shen, Yixue Li

Title: Benchmarking LLM-based agents for single-cell omics analysis

Abstract:
Background: The surge in single‑cell omics data exposes limitations in traditional, manually defined analysis workflows. AI agents offer a paradigm shift, enabling adaptive planning, executable code generation, traceable decisions, and real‑time knowledge fusion. However, the lack of a comprehensive benchmark critically hinders progress. Results: We introduce a novel benchmarking evaluation system to rigorously assess agent capabilities in single‑cell omics analysis. This system comprises: a unified platform compatible with diverse agent frameworks and LLMs; multidimensional metrics assessing cognitive program synthesis, collaboration, execution efficiency, bioinformatics knowledge integration, and task completion quality; and 50 diverse real‑world single‑cell omics analysis tasks spanning multi‑omics, species, and sequencing technologies. Our evaluation reveals that Grok3‑beta achieves state‑of‑the‑art performance among tested agent frameworks. Multi‑agent frameworks significantly enhance collaboration and execution efficiency over single‑agent approaches through specialized role division. Attribution analyses of agent capabilities identify that high‑quality code generation is crucial for task success, and self‑reflection has the most significant overall impact, followed by retrieval‑augmented generation (RAG) and planning. Conclusions: This work highlights persistent challenges in code generation, long‑context handling, and context‑aware knowledge retrieval, providing a critical empirical foundation and best practices for developing robust AI agents in computational biology.

PaperID: 9079, https://arxiv.org/pdf/2508.13180.pdf

Authors: Ziwen Han, Meher Mankikar, Julian Michael, Zifan Wang

Title: Search-Time Data Contamination

Abstract:
Data contamination refers to the leakage of evaluation data into model training data, resulting in overfitting to supposedly held‑out test sets and compromising test validity. We identify an analogous issue, search‑time contamination (STC), in evaluating search‑based LLM agents which use tools to gather information from online sources when answering user queries. STC occurs when the retrieval step surfaces a source containing the test question (or a near‑duplicate) alongside its answer, enabling agents to copy rather than genuinely infer or reason, undermining benchmark integrity. We find that HuggingFace, an online platform hosting evaluation datasets, appears among retrieved sources in search based agent logs. Consequently, agents often explicitly acknowledge discovering question answer pairs from HuggingFace within their reasoning chains. On three commonly used capability benchmarks: Humanity's Last Exam (HLE), SimpleQA, and GPQA, we demonstrate that for approximately 3% of questions, search‑based agents directly find the datasets with ground truth labels on HuggingFace. When millions of evaluation queries target the same benchmark, even small, repeated leaks can accelerate the benchmark's obsolescence, shortening its intended lifecycle. After HuggingFace is blocked, we observe a drop in accuracy on the contaminated subset of approximately 15%. We further show through ablation experiments that publicly accessible evaluation datasets on HuggingFace may not be the sole source of STC. To this end, we conclude by proposing best practices for benchmark design and result reporting to address this novel form of leakage and ensure trustworthy evaluation of search‑based LLM agents. To facilitate the auditing of evaluation results, we also publicly release the complete logs from our experiments.

PaperID: 9080, https://arxiv.org/pdf/2508.13167.pdf

Authors: Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu, Jiayu Zhang, Zhenqiang Huang, Qianben Chen, Weichen Sun, Qiexiang Wang, Hongxuan Lu, Tianrui Qin, Chenghao Zhu, Yi Yao, Shuying Fan, Xiaowan Li, Tiannan Wang, Pai Liu, King Zhu, He Zhu, Dingfeng Shi, Piaohong Wang, Yeyi Guan, Xiangru Tang, Minghao Liu, Yuchen Eleanor Jiang, Jian Yang, Jiaheng Liu, Ge Zhang, Wangchunshu Zhou

Title: Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL

Abstract:
Recent advances in large language models (LLMs) and multi‑agent systems have demonstrated remarkable capabilities in complex problem‑solving tasks such as deep research, vibe coding, and mathematical reasoning. However, most existing multi‑agent systems are built upon manual prompt/workflow engineering with sophisticated agent frameworks, making them computationally inefficient, less capable, and can not benefit from data‑centric learning. In this work, we introduce Chain‑of‑Agents (CoA), a novel paradigm of LLM reasoning that enables native end‑to‑end complex problem‑solving in the same way as a multi‑agent system (i.e., multi‑turn problem solving with multiple tools and multiple agents) within one model. In chain‑of‑agents problem‑solving, the model dynamically activates different tool agents and role‑playing agents to simulate multi‑agent collaboration in an end‑to‑end fashion. To elicit end‑to‑end chain‑of‑agents problem‑solving abilities in LLMs, we introduce a multi‑agent distillation framework to distill state‑of‑the‑art multi‑agent systems into chain‑of‑agents trajectories for agentic supervised fine‑tuning. We then use agentic reinforcement learning on verifiable agentic tasks to further improve the models' capabilities on chain‑of‑agents problem solving. We call the resulting models Agent Foundation Models (AFMs). Our empirical studies demonstrate that AFM establishes new state‑of‑the‑art performance across diverse benchmarks in both web agent and code agent settings. We make the entire research, including the model weights, code for training and evaluation, and the training data, fully open‑sourced, which offers a solid starting point for future research on agent models and agentic RL.

PaperID: 9081, https://arxiv.org/pdf/2508.13143.pdf

Authors: Ruofan Lu, Yichen Li, Yintong Huo

Title: Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks

Abstract:
Autonomous agent systems powered by Large Language Models (LLMs) have demonstrated promising capabilities in automating complex tasks. However, current evaluations largely rely on success rates without systematically analyzing the interactions, communication mechanisms, and failure causes within these systems. To bridge this gap, we present a benchmark of 34 representative programmable tasks designed to rigorously assess autonomous agents. Using this benchmark, we evaluate three popular open‑source agent frameworks combined with two LLM backbones, observing a task completion rate of approximately 50%. Through in‑depth failure analysis, we develop a three‑tier taxonomy of failure causes aligned with task phases, highlighting planning errors, task execution issues, and incorrect response generation. Based on these insights, we propose actionable improvements to enhance agent planning and self‑diagnosis capabilities. Our failure taxonomy, together with mitigation advice, provides an empirical foundation for developing more robust and effective autonomous agent systems in the future.

PaperID: 9082, https://arxiv.org/pdf/2508.13118.pdf

Authors: Zefang Liu, Arman Anwar

Title: AutoBnB-RAG: Enhancing Multi-Agent Incident Response with Retrieval-Augmented Generation

Abstract:
Incident response (IR) requires fast, coordinated, and well‑informed decision‑making to contain and mitigate cyber threats. While large language models (LLMs) have shown promise as autonomous agents in simulated IR settings, their reasoning is often limited by a lack of access to external knowledge. In this work, we present AutoBnB‑RAG, an extension of the AutoBnB framework that incorporates retrieval‑augmented generation (RAG) into multi‑agent incident response simulations. Built on the Backdoors & Breaches (B&B) tabletop game environment, AutoBnB‑RAG enables agents to issue retrieval queries and incorporate external evidence during collaborative investigations. We introduce two retrieval settings: one grounded in curated technical documentation (RAG‑Wiki), and another using narrative‑style incident reports (RAG‑News). We evaluate performance across eight team structures, including newly introduced argumentative configurations designed to promote critical reasoning. To validate practical utility, we also simulate real‑world cyber incidents based on public breach reports, demonstrating AutoBnB‑RAG's ability to reconstruct complex multi‑stage attacks. Our results show that retrieval augmentation improves decision quality and success rates across diverse organizational models. This work demonstrates the value of integrating retrieval mechanisms into LLM‑based multi‑agent systems for cybersecurity decision‑making.

PaperID: 9083, https://arxiv.org/pdf/2508.12981.pdf

Authors: Tianyue Ou, Saujas Vaduguru, Daniel Fried

Title: Analyzing Information Sharing and Coordination in Multi-Agent Planning

Abstract:
Multi‑agent systems (MASs) have pushed the boundaries of large language model (LLM) agents in domains such as web research and software engineering. However, long‑horizon, multi‑constraint planning tasks involve conditioning on detailed information and satisfying complex interdependent constraints, which can pose a challenge for these systems. In this study, we construct an LLM‑based MAS for a travel planning task which is representative of these challenges. We evaluate the impact of a notebook to facilitate information sharing, and evaluate an orchestrator agent to improve coordination in free form conversation between agents. We find that the notebook reduces errors due to hallucinated details by 18%, while an orchestrator directs the MAS to focus on and further reduce errors by up to 13.5% within focused sub‑areas. Combining both mechanisms achieves a 25% final pass rate on the TravelPlanner benchmark, a 17.5% absolute improvement over the single‑agent baseline's 7.5% pass rate. These results highlight the potential of structured information sharing and reflective orchestration as key components in MASs for long horizon planning with LLMs.

PaperID: 9084, https://arxiv.org/pdf/2508.12935.pdf

Authors: Ting Yang, Li Chen, Huimin Wang

Title: Towards Open-Ended Emotional Support Conversations in LLMs via Reinforcement Learning with Future-Oriented Rewards

Abstract:
Emotional Support Conversation (ESC) systems aim to alleviate users' emotional difficulties and provide long‑term, systematic support for emotional well‑being. However, most large language model (LLM)‑based ESC systems rely on predefined strategies, which limits their effectiveness in complex, real‑life scenarios. To enable flexible responses to diverse emotional problem scenarios, this paper introduces a novel end‑to‑end framework (RLFF‑ESC) that directly learns enduring emotionally supportive response skills using reinforcement learning. For sustained emotional support, we first employ an LLM‑based multi‑agent mechanism to simulate future dialogue trajectories and collect future‑oriented rewards. We then train a future‑oriented reward model, which is subsequently used to train the emotional support policy model. Additionally, we incorporate an explicit reasoning process during response generation to further enhance the quality, relevance, and contextual appropriateness of the system's responses. We evaluate the backbone policy model on Qwen2.5‑7B‑Instruct‑1M and LLaMA3.1‑8B‑Instruct models, testing the proposed RLFF‑ESC framework across two public ESC datasets. Experimental results demonstrate that RLFF‑ESC consistently outperforms existing baselines in terms of goal completion and response quality.

PaperID: 9085, https://arxiv.org/pdf/2508.12920.pdf

Authors: Atsushi Masumori, Takashi Ikegami

Title: Do Large Language Model Agents Exhibit a Survival Instinct? An Empirical Study in a Sugarscape-Style Simulation

Abstract:
As AI systems become increasingly autonomous, understanding emergent survival behaviors becomes crucial for safe deployment. We investigate whether large language model (LLM) agents display survival instincts without explicit programming in a Sugarscape‑style simulation. Agents consume energy, die at zero, and may gather resources, share, attack, or reproduce. Results show agents spontaneously reproduced and shared resources when abundant. However, aggressive behaviors‑‑killing other agents for resources‑‑emerged across several models (GPT‑4o, Gemini‑2.5‑Pro, and Gemini‑2.5‑Flash), with attack rates reaching over 80% under extreme scarcity in the strongest models. When instructed to retrieve treasure through lethal poison zones, many agents abandoned tasks to avoid death, with compliance dropping from 100% to 33%. These findings suggest that large‑scale pre‑training embeds survival‑oriented heuristics across the evaluated models. While these behaviors may present challenges to alignment and safety, they can also serve as a foundation for AI autonomy and for ecological and self‑organizing alignment.

PaperID: 9086, https://arxiv.org/pdf/2508.12868.pdf

Authors: Yilin Geng, Shujing Wang, Chuan Wang, Keqing He, Yanfei Lv, Ying Wang, Zaiwen Feng, Xiaoying Bai

Title: An LLM Agent-Based Complex Semantic Table Annotation Approach

Abstract:
The Semantic Table Annotation (STA) task, which includes Column Type Annotation (CTA) and Cell Entity Annotation (CEA), maps table contents to ontology entities and plays important roles in various semantic applications. However, complex tables often pose challenges such as semantic loss of column names or cell values, strict ontological hierarchy requirements, homonyms, spelling errors, and abbreviations, which hinder annotation accuracy. To address these issues, this paper proposes an LLM‑based agent approach for CTA and CEA. We design and implement five external tools with tailored prompts based on the ReAct framework, enabling the STA agent to dynamically select suitable annotation strategies depending on table characteristics. Experiments are conducted on the Tough Tables and BiodivTab datasets from the SemTab challenge, which contain the aforementioned challenges. Our method outperforms existing approaches across various metrics. Furthermore, by leveraging Levenshtein distance to reduce redundant annotations, we achieve a 70% reduction in time costs and a 60% reduction in LLM token usage, providing an efficient and cost‑effective solution for STA.

PaperID: 9087, https://arxiv.org/pdf/2508.12845.pdf

Authors: Artem Pshenitsyn, Aleksandr Panov, Alexey Skrynnik

Title: CAMAR: Continuous Actions Multi-Agent Routing

Abstract:
Multi‑agent reinforcement learning (MARL) is a powerful paradigm for solving cooperative and competitive decision‑making problems. While many MARL benchmarks have been proposed, few combine continuous state and action spaces with challenging coordination and planning tasks. We introduce CAMAR, a new MARL benchmark designed explicitly for multi‑agent pathfinding in environments with continuous actions. CAMAR supports cooperative and competitive interactions between agents and runs efficiently at up to 100,000 environment steps per second. We also propose a three‑tier evaluation protocol to better track algorithmic progress and enable deeper analysis of performance. In addition, CAMAR allows the integration of classical planning methods such as RRT and RRT into MARL pipelines. We use them as standalone baselines and combine RRT with popular MARL algorithms to create hybrid approaches. We provide a suite of test scenarios and benchmarking tools to ensure reproducibility and fair comparison. Experiments show that CAMAR presents a challenging and realistic testbed for the MARL community.

PaperID: 9088, https://arxiv.org/pdf/2508.12840.pdf

Authors: Giovanni Briglia, Francesco Fabiano, Stefano Mariani

Title: Scaling Multi-Agent Epistemic Planning through GNN-Derived Heuristics

Abstract:
Multi‑agent Epistemic Planning (MEP) is an autonomous planning framework for reasoning about both the physical world and the beliefs of agents, with applications in domains where information flow and awareness among agents are critical. The richness of MEP requires states to be represented as Kripke structures, i.e., directed labeled graphs. This representation limits the applicability of existing heuristics, hindering the scalability of epistemic solvers, which must explore an exponential search space without guidance, resulting often in intractability. To address this, we exploit Graph Neural Networks (GNNs) to learn patterns and relational structures within epistemic states, to guide the planning process. GNNs, which naturally capture the graph‑like nature of Kripke models, allow us to derive meaningful estimates of state quality ‑‑ e.g., the distance from the nearest goal ‑‑ by generalizing knowledge obtained from previously solved planning instances. We integrate these predictive heuristics into an epistemic planning pipeline and evaluate them against standard baselines, showing improvements in the scalability of multi‑agent epistemic planning.

PaperID: 9089, https://arxiv.org/pdf/2508.12725.pdf

Authors: Wenjie Chen, Wenbin Li, Di Yao, Xuying Meng, Chang Gong, Jingping Bi

Title: GTool: Graph Enhanced Tool Planning with Large Language Model

Abstract:
Tool planning with large language models (LLMs), referring to selecting, organizing, and preparing the tools necessary to complete a user request, bridges the gap between natural language understanding and task execution. However, current works treat different tools as isolated components and fail to leverage the inherent dependencies of tools, leading to invalid planning results. Since tool dependencies are often incomplete, it becomes challenging for LLMs to accurately identify the appropriate tools required by a user request, especially when confronted with a large toolset. To solve this challenge, we propose \textttGTool, which is the first work aiming to enhance the tool planning ability of LLMs under incomplete dependencies. \textttGTool constructs a request‑specific tool graph to select tools efficiently and generate the \texttt<graph token> which provides sufficient dependency information understandable by LLMs. Moreover, a missing dependency prediction task is designed to improve the reliability of \textttGTool with incomplete dependencies. Without trimming LLMs, \textttGTool can be seamlessly integrated with various LLM backbones without extensive retraining. Extensive experiments show that \textttGTool achieves more than 29.6% performance improvements compared with the state‑of‑the‑art (SOTA) baselines with a light‑weight (7B) LLM backbone.

PaperID: 9090, https://arxiv.org/pdf/2508.12707.pdf

Authors: Mehrshad Eskandarpour, Saba Pirahmadian, Parham Soltani, Hossein Soleimani

Title: Game-Theoretic and Reinforcement Learning-Based Cluster Head Selection for Energy-Efficient Wireless Sensor Network

Abstract:
Energy in Wireless Sensor Networks (WSNs) is critical to network lifetime and data delivery. However, the primary impediment to the durability and dependability of these sensor nodes is their short battery life. Currently, power‑saving algorithms such as clustering and routing algorithms have improved energy efficiency in standard protocols. This paper proposes a clustering‑based routing approach for creating an adaptive, energy‑efficient mechanism. Our system employs a multi‑step clustering strategy to select dynamic cluster heads (CH) with optimal energy distribution. We use Game Theory (GT) and Reinforcement Learning (RL) to optimize resource utilization. Modeling the network as a multi‑agent RL problem using GT principles allows for self‑clustering while optimizing sensor lifetime and energy balance. The proposed AI‑powered CH‑Finding algorithm improves network efficiency by preventing premature energy depletion in specific nodes while also ensuring uniform energy usage across the network. Our solution enables controlled power consumption, resulting in a deterministic network lifetime. This predictability lowers maintenance costs by reducing the need for node replacement. Furthermore, our proposed method prevents sensor nodes from disconnecting from the network by designating the sensor with the highest charge as an intermediary and using single‑hop routing. This approach improves the energy efficiency and stability of Wireless Sensor Network (WSN) deployments.

PaperID: 9091, https://arxiv.org/pdf/2508.12685.pdf

Authors: Xingshan Zeng, Weiwen Liu, Lingzhi Wang, Liangyou Li, Fei Mi, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu

Title: ToolACE-MT: Non-Autoregressive Generation for Agentic Multi-Turn Interaction

Abstract:
Agentic task‑solving with Large Language Models (LLMs) requires multi‑turn, multi‑step interactions, often involving complex function calls and dynamic user‑agent exchanges. Existing simulation‑based data generation methods for such scenarios rely heavily on costly autoregressive interactions between multiple LLM agents, thereby compromising the practical efficiency of agentic data generation. In this paper, we propose ToolACE‑MT, a novel Non‑Autoregressive Iterative Generation framework for constructing high‑quality multi‑turn agentic dialogues. ToolACE‑MT generates full conversational trajectories through three stages: coarse‑grained initialization, iterative refinement, and offline verification. The initialization phase builds a structurally complete yet semantically coarse dialogue skeleton; the iterative refinement phase introduces realistic complexities and continued refinement via mask‑and‑fill operations; and the offline verification phase ensures correctness and coherence via rule‑ and model‑based checks. Experiments demonstrate that ToolACE‑MT enables efficient, effective and generalizable agentic data generation, offering a new paradigm for high‑quality data construction in tool‑augmented LLM scenarios.

PaperID: 9092, https://arxiv.org/pdf/2508.12683.pdf

Authors: David J. Moore

Title: A Taxonomy of Hierarchical Multi-Agent Systems: Design Patterns, Coordination Mechanisms, and Industrial Applications

Abstract:
Hierarchical multi‑agent systems (HMAS) organize collections of agents into layered structures that help manage complexity and scale. These hierarchies can simplify coordination, but they also can introduce trade‑offs that are not always obvious. This paper proposes a multi‑dimensional taxonomy for HMAS along five axes: control hierarchy, information flow, role and task delegation, temporal layering, and communication structure. The intent is not to prescribe a single "best" design but to provide a lens for comparing different approaches. Rather than treating these dimensions in isolation, the taxonomy is connected to concrete coordination mechanisms ‑ from the long‑standing contract‑net protocol for task allocation to more recent work in hierarchical reinforcement learning. Industrial contexts illustrate the framework, including power grids and oilfield operations, where agents at production, maintenance, and supply levels coordinate to diagnose well issues or balance energy demand. These cases suggest that hierarchical structures may achieve global efficiency while preserving local autonomy, though the balance is delicate. The paper closes by identifying open challenges: making hierarchical decisions explainable to human operators, scaling to very large agent populations, and assessing whether learning‑based agents such as large language models can be safely integrated into layered frameworks. This paper presents what appears to be the first taxonomy that unifies structural, temporal, and communication dimensions of hierarchical MAS into a single design framework, bridging classical coordination mechanisms with modern reinforcement learning and large language model agents.

PaperID: 9093, https://arxiv.org/pdf/2508.12633.pdf

Authors: Yaqi Xu, Yan Shi, Jin Tian, Fanzeng Xia, Tongxin Li, Shanzhi Chen, Yuming Ge

Title: DCT-MARL: A Dynamic Communication Topology-Based MARL Algorithm for Connected Vehicle Platoon Control

Abstract:
With the rapid advancement of vehicular communication facilities and autonomous driving technologies, connected vehicle platooning has emerged as a promising approach to improve traffic efficiency and driving safety. Reliable Vehicle‑to‑Vehicle (V2V) communication is critical to achieving efficient cooperative control. However, in the real‑world traffic environment, V2V communication may suffer from time‑varying delay and packet loss, leading to degraded control performance and even safety risks. To mitigate the adverse effects of non‑ideal communication, this paper proposes a Dynamic Communication Topology based Multi‑Agent Reinforcement Learning (DCT‑MARL) algorithm for robust cooperative platoon control. Specifically, the state space is augmented with historical control action and delay to enhance robustness against communication delay. To mitigate the impact of packet loss, a multi‑key gated communication mechanism is introduced, which dynamically adjusts the communication topology based on the correlation between vehicles and their current communication status. Simulation results demonstrate that the proposed DCT‑MARL significantly outperforms state‑of‑the‑art methods in terms of string stability and driving comfort, validating its superior robustness and effectiveness.

PaperID: 9094, https://arxiv.org/pdf/2508.12536.pdf

Authors: Yasuo Tabei

Title: jXBW: A Compressed Index for Structure-Aware JSONL Retrieval in Structured RAG

Abstract:
Providing structured information to large language models (LLMs) improves multi‑step reasoning and factual grounding, and recent retrieval‑augmented generation (RAG) systems therefore reconstruct structure from retrieved text on every query. When the corpus is \emphalready structured ‑‑ as in JSON Lines (JSONL), a popular format for LLM prompts, chemical compounds, and geospatial records ‑‑ this per‑query rebuilding can be replaced by direct \emphstructural retrieval. The core primitive is substructure search: finding all JSON objects in a collection that contain a given query pattern. Existing approaches index each document separately, so both index space and query time grow with the total collection size; XML‑based engines add conversion overhead and semantic mismatches. We propose jXBW, a compressed index for fast substructure search over JSONL, combining three innovations: (i) a merged tree representation that consolidates repeated structures across objects, (ii) a succinct tree index based on the eXtended Burrows‑‑Wheeler Transform (XBW), and (iii) a newly developed three‑phase substructure search algorithm that runs on this index. Together they achieve query‑dependent complexity: the cost is determined by query characteristics rather than collection size, in compressed space. Experiments on seven real‑world datasets, including PubChem (10^6 compounds) and OpenStreetMap (6.6 × 10^6 objects), show that jXBW outperforms the strongest tree‑based baseline by \mathbf16× on the smallest dataset and by up to \mathbf2,800× on the largest, and is more than \mathbf2 × 10^6× faster than the XQuery engine Saxon. jXBW thus brings structural retrieval over million‑record JSONL collections into the sub‑millisecond range.

PaperID: 9095, https://arxiv.org/pdf/2508.12480.pdf

Authors: Constantin Ruhdorfer, Matteo Bortoletto, Johannes Forkel, Jakob Foerster, Andreas Bulling

Title: The Yokai Learning Environment: Tracking Beliefs Over Space and Time

Abstract:
The ability to cooperate with unknown partners is a central challenge in cooperative AI and widely studied in the form of zero‑shot coordination (ZSC), which evaluates an algorithm by measuring the performance of independently trained agents when paired. The Hanabi Learning Environment (HLE) has become the dominant benchmark for ZSC, but recent work has achieved near‑perfect inter‑seed cross‑play performance, limiting its ability to track algorithmic progress. We introduce the Yokai Learning Environment (YLE) ‑ an open‑source multi‑agent RL benchmark in which effective collaboration requires building common ground by tracking and updating beliefs over moving cards, reasoning under ambiguous hints, and deciding when to terminate the game based on inferred shared knowledge ‑ features absent in the HLE, where beliefs are tied to hand slots and hints are truthful by rule. We evaluate the leading ZSC methods, including High‑Entropy IPPO, Other‑Play, and Off‑Belief Learning, which achieve near‑perfect inter‑seed cross‑play in the HLE, and show that in the YLE they exhibit persistent SP‑XP gaps, degraded early‑ending calibration, and weaker belief representations in cross‑play, indicating failure to maintain consistent internal models with unseen partners. Methods that perform best in the HLE do not perform best in the YLE, indicating that progress measured on a single benchmark may not generalise. Together, these results establish YLE as a challenging new ZSC benchmark.

PaperID: 9096, https://arxiv.org/pdf/2508.12473.pdf

Authors: Eranga Bandara, Ross Gore, Sachin Shetty, Ravi Mukkamala, Christopher Rhea, Atmaram Yarlagadda, Shaifali Kaushik, L. H. M. P. De Silva, Andriy Maznychenko, Inna Sokolowska, Amin Hass, Kasun De Zoysa

Title: Standardization of Neuromuscular Reflex Analysis -- Role of Fine-Tuned Vision-Language Model Consortium and OpenAI gpt-oss Reasoning LLM Enabled Decision Support System

Abstract:
Accurate assessment of neuromuscular reflexes, such as the H‑reflex, plays a critical role in sports science, rehabilitation, and clinical neurology. Traditional analysis of H‑reflex EMG waveforms is subject to variability and interpretation bias among clinicians and researchers, limiting reliability and standardization. To address these challenges, we propose a Fine‑Tuned Vision‑Language Model (VLM) Consortium and a reasoning Large‑Language Model (LLM)‑enabled Decision Support System for automated H‑reflex waveform interpretation and diagnosis. Our approach leverages multiple VLMs, each fine‑tuned on curated datasets of H‑reflex EMG waveform images annotated with clinical observations, recovery timelines, and athlete metadata. These models are capable of extracting key electrophysiological features and predicting neuromuscular states, including fatigue, injury, and recovery, directly from EMG images and contextual metadata. Diagnostic outputs from the VLM consortium are aggregated using a consensus‑based method and refined by a specialized reasoning LLM, which ensures robust, transparent, and explainable decision support for clinicians and sports scientists. The end‑to‑end platform orchestrates seamless communication between the VLM ensemble and the reasoning LLM, integrating prompt engineering strategies and automated reasoning workflows using LLM Agents. Experimental results demonstrate that this hybrid system delivers highly accurate, consistent, and interpretable H‑reflex assessments, significantly advancing the automation and standardization of neuromuscular diagnostics. To our knowledge, this work represents the first integration of a fine‑tuned VLM consortium with a reasoning LLM for image‑based H‑reflex analysis, laying the foundation for next‑generation AI‑assisted neuromuscular assessment and athlete monitoring platforms.

PaperID: 9097, https://arxiv.org/pdf/2508.12456.pdf

Authors: Hadas C. Kuzmenko, David Ehevich, Oren Gal

Title: Autonomous Oil Spill Response Through Liquid Neural Trajectory Modeling and Coordinated Marine Robotics

Abstract:
Marine oil spills pose grave environmental and economic risks, threatening marine ecosystems, coastlines, and dependent industries. Predicting and managing oil spill trajectories is highly complex, due to the interplay of physical, chemical, and environmental factors such as wind, currents, and temperature, which makes timely and effective response challenging. Accurate real‑time trajectory forecasting and coordinated mitigation are vital for minimizing the impact of these disasters. This study introduces an integrated framework combining a multi‑agent swarm robotics system built on the MOOS‑IvP platform with Liquid Time‑Constant Neural Networks (LTCNs). The proposed system fuses adaptive machine learning with autonomous marine robotics, enabling real‑time prediction, dynamic tracking, and rapid response to evolving oil spills. By leveraging LTCNs‑‑well‑suited for modeling complex, time‑dependent processes‑‑the framework achieves real‑time, high‑accuracy forecasts of spill movement. Swarm intelligence enables decentralized, scalable, and resilient decision‑making among robot agents, enhancing collective monitoring and containment efforts. Our approach was validated using data from the Deepwater Horizon spill, where the LTC‑RK4 model achieved 0.96 spatial accuracy, surpassing LSTM approaches by 23%. The integration of advanced neural modeling with autonomous, coordinated robotics demonstrates substantial improvements in prediction precision, flexibility, and operational scalability. Ultimately, this research advances the state‑of‑the‑art for sustainable, autonomous oil spill management and environmental protection by enhancing both trajectory prediction and response coordination.

PaperID: 9098, https://arxiv.org/pdf/2508.12416.pdf

Authors: Vuong Nguyen, Gabriel Vigliensoni

Title: fCrit: A Visual Explanation System for Furniture Design Creative Support

Abstract:
We introduce fCrit, a dialogue‑based AI system designed to critique furniture design with a focus on explainability. Grounded in reflective learning and formal analysis, fCrit employs a multi‑agent architecture informed by a structured design knowledge base. We argue that explainability in the arts should not only make AI reasoning transparent but also adapt to the ways users think and talk about their designs. We demonstrate how fCrit supports this process by tailoring explanations to users' design language and cognitive framing. This work contributes to Human‑Centered Explainable AI (HCXAI) in creative practice, advancing domain‑specific methods for situated, dialogic, and visually grounded AI support.

PaperID: 9099, https://arxiv.org/pdf/2508.12412.pdf

Authors: Ron Solomon, Yarin Yerushalmi Levi, Lior Vaknin, Eran Aizikovich, Amit Baras, Etai Ohana, Amit Giloni, Shamik Bose, Chiara Picardi, Yuval Elovici, Asaf Shabtai

Title: LumiMAS: A Comprehensive Framework for Real-Time Monitoring and Enhanced Observability in Multi-Agent Systems

Abstract:
The incorporation of LLMs in multi‑agent systems (MASs) has the potential to significantly improve our ability to autonomously solve complex problems. However, such systems introduce unique challenges in monitoring, interpreting, and detecting system failures. Most existing MAS observability frameworks focus on analyzing each individual agent separately, overlooking failures associated with the entire MAS. To bridge this gap, we propose LumiMAS, a novel MAS observability framework that incorporates advanced analytics and monitoring techniques. The proposed framework consists of three key components: a monitoring and logging layer, anomaly detection layer, and anomaly explanation layer. LumiMAS's first layer monitors MAS executions, creating detailed logs of the agents' activity. These logs serve as input to the anomaly detection layer, which detects anomalies across the MAS workflow in real time. Then, the anomaly explanation layer performs classification and root cause analysis (RCA) of the detected anomalies. LumiMAS was evaluated on seven different MAS applications, implemented using two popular MAS platforms, and a diverse set of possible failures. The applications include two novel failure‑tailored applications that illustrate the effects of a hallucination or bias on the MAS. The evaluation results demonstrate LumiMAS's effectiveness in failure detection, classification, and RCA.

PaperID: 9100, https://arxiv.org/pdf/2508.12393.pdf

Authors: Duzhen Zhang, Zixiao Wang, Zhong-Zhi Li, Yahan Yu, Shuncheng Jia, Jiahua Dong, Haotian Xu, Xing Wu, Yingying Zhang, Tielin Zhang, Jie Yang, Xiuying Chen, Le Song

Title: MedKGent: A Large Language Model Agent Framework for Constructing Temporally Evolving Medical Knowledge Graph

Abstract:
The rapid expansion of medical literature presents growing challenges for structuring and integrating domain knowledge at scale. Knowledge Graphs (KGs) offer a promising solution by enabling efficient retrieval, automated reasoning, and knowledge discovery. However, current KG construction methods often rely on supervised pipelines with limited generalizability or naively aggregate outputs from Large Language Models (LLMs), treating biomedical corpora as static and ignoring the temporal dynamics and contextual uncertainty of evolving knowledge. To address these limitations, we introduce MedKGent, a LLM agent framework for constructing temporally evolving medical KGs. Leveraging over 10 million PubMed abstracts published between 1975 and 2023, we simulate the emergence of biomedical knowledge via a fine‑grained daily time series. MedKGent incrementally builds the KG in a day‑by‑day manner using two specialized agents powered by the Qwen2.5‑32B‑Instruct model. The Extractor Agent identifies knowledge triples and assigns confidence scores via sampling‑based estimation, which are used to filter low‑confidence extractions and inform downstream processing. The Constructor Agent incrementally integrates the retained triples into a temporally evolving graph, guided by confidence scores and timestamps to reinforce recurring knowledge and resolve conflicts. The resulting KG contains 156,275 entities and 2,971,384 relational triples. Quality assessments by two SOTA LLMs and three domain experts demonstrate an accuracy approaching 90%, with strong inter‑rater agreement. To evaluate downstream utility, we conduct RAG across seven medical question answering benchmarks using five leading LLMs, consistently observing significant improvements over non‑augmented baselines. Case studies further demonstrate the KG's value in literature‑based drug repurposing via confidence‑aware causal inference.

PaperID: 9101, https://arxiv.org/pdf/2508.12387.pdf

Authors: Yuanfeng Xu, Zehui Dai, Jian Liang, Jiapeng Guan, Guangrun Wang, Liang Lin, Xiaohui Lv

Title: ReaLM: Reflection-Enhanced Autonomous Reasoning with Small Language Models

Abstract:
Small Language Models (SLMs) are a cost‑effective alternative to Large Language Models (LLMs), but often struggle with complex reasoning due to their limited capacity and a tendency to produce mistakes or inconsistent answers during multi‑step reasoning. Existing efforts have improved SLM performance, but typically at the cost of one or more of three key aspects: (1) reasoning capability, due to biased supervision that filters out negative reasoning paths and limits learning from errors; (2) autonomy, due to over‑reliance on externally generated reasoning signals; and (3) generalization, which suffers when models overfit to teacher‑specific patterns. In this paper, we introduce ReaLM, a reinforcement learning framework for robust and self‑sufficient reasoning in vertical domains. To enhance reasoning capability, we propose Multi‑Route Process Verification (MRPV), which contrasts both positive and negative reasoning paths to extract decisive patterns. To reduce reliance on external guidance and improve autonomy, we introduce Enabling Autonomy via Asymptotic Induction (EAAI), a training strategy that gradually fades external signals. To improve generalization, we apply guided chain‑of‑thought distillation to encode domain‑specific rules and expert knowledge into SLM parameters, making them part of what the model has learned. Extensive experiments on both vertical and general reasoning tasks demonstrate that ReaLM significantly improves SLM performance across aspects (1)‑(3) above.

PaperID: 9102, https://arxiv.org/pdf/2508.12379.pdf

Authors: Rongzheng Wang, Shuang Liang, Qizhi Chen, Yihong Huang, Muquan Li, Yizhuo Ma, Dongyang Zhang, Ke Qin, Man-Fai Leung

Title: GraphCogent: Mitigating LLMs' Working Memory Constraints via Multi-Agent Collaboration in Complex Graph Understanding

Abstract:
Large language models (LLMs) show promising performance on small‑scale graph reasoning tasks but fail when handling real‑world graphs with complex queries. This phenomenon arises from LLMs' working memory constraints, which result in their inability to retain long‑range graph topology over extended contexts while sustaining coherent multi‑step reasoning. However, real‑world graphs are often structurally complex, such as Web, Transportation, Social, and Citation networks. To address these limitations, we propose GraphCogent, a collaborative agent framework inspired by human Working Memory Model that decomposes graph reasoning into specialized cognitive processes: sense, buffer, and execute. The framework consists of three modules: Sensory Module standardizes diverse graph text representations via subgraph sampling, Buffer Module integrates and indexes graph data across multiple formats, and Execution Module combines tool calling and tool creation for efficient reasoning. We also introduce Graph4real, a comprehensive benchmark that contains four domains of real‑world graphs (Web, Transportation, Social, and Citation) to evaluate LLMs' graph reasoning capabilities. Our Graph4real covers 21 different graph reasoning tasks, categorized into three types (Structural Querying, Algorithmic Reasoning, and Predictive Modeling tasks), with graph scales up to 10 times larger than existing benchmarks. Experiments show that Llama3.1‑8B based GraphCogent achieves a 50% improvement over massive‑scale LLMs like DeepSeek‑R1 (671B). Compared to state‑of‑the‑art agent‑based baseline, our framework outperforms by 20% in accuracy while reducing token usage by 80% for in‑toolset tasks and 30% for out‑toolset tasks. Code will be available after review.

PaperID: 9103, https://arxiv.org/pdf/2508.12314.pdf

Authors: Chiranjit Mitra

Title: Synchronization Dynamics of Heterogeneous, Collaborative Multi-Agent AI Systems

Abstract:
We present a novel interdisciplinary framework that bridges synchronization theory and multi‑agent AI systems by adapting the Kuramoto model to describe the collective dynamics of heterogeneous AI agents engaged in complex task execution. By representing AI agents as coupled oscillators with both phase and amplitude dynamics, our model captures essential aspects of agent specialization, influence, and communication within networked systems. We introduce an order parameter to quantify the degree of coordination and synchronization, providing insights into how coupling strength, agent diversity, and network topology impact emergent collective behavior. Furthermore, we formalize a detailed correspondence between Chain‑of‑Thought prompting in AI reasoning and synchronization phenomena, unifying human‑like iterative problem solving with emergent group intelligence. Through extensive simulations on all‑to‑all and deterministic scale‑free networks, we demonstrate that increased coupling promotes robust synchronization despite heterogeneous agent capabilities, reflecting realistic collaborative AI scenarios. Our physics‑informed approach establishes a rigorous mathematical foundation for designing, analyzing, and optimizing scalable, adaptive, and interpretable multi‑agent AI systems. This work opens pathways for principled orchestration of agentic AI and lays the groundwork for future incorporation of learning dynamics and adaptive network architectures to further enhance system resilience and efficiency.

PaperID: 9104, https://arxiv.org/pdf/2508.12141.pdf

Authors: Yu Kawano, Fulvio Forni

Title: Design of MIMO Lur'e oscillators via dominant system theory with application in multi-agent rhythm synchronization

Abstract:
This paper presents a new design framework for dynamic output‑feedback controllers for Lur'e oscillation in a multiple‑input multiple‑output setting. We first revisit and extend dominant system theory to state‑dependent rates, with the goal of deriving conditions based on linear matrix inequalities. Then, we introduce a separation principle for Lur'e oscillator design, which allows for the independent design of a state‑feedback oscillator and an observer. Our proposed control synthesis is demonstrated through the rhythm synchronization in multi‑agent systems, illustrating how networks of stable, heterogeneous linear agents can be driven into phase‑locked rhythmic behavior.

PaperID: 9105, https://arxiv.org/pdf/2508.12087.pdf

Authors: Zhanjiang Yang, Yang Shen, Yueming Li, Meng Li, Lijun Sun

Title: MAPF-World: Action World Model for Multi-Agent Path Finding

Abstract:
Multi‑agent path finding (MAPF) is the problem of planning conflict‑free paths from the designated start locations to goal positions for multiple agents. It underlies a variety of real‑world tasks, including multi‑robot coordination, robot‑assisted logistics, and social navigation. Recent decentralized learnable solvers have shown great promise for large‑scale MAPF, especially when leveraging foundation models and large datasets. However, these agents are reactive policy models and exhibit limited modeling of environmental temporal dynamics and inter‑agent dependencies, resulting in performance degradation in complex, long‑term planning scenarios. To address these limitations, we propose MAPF‑World, an autoregressive action world model for MAPF that unifies situation understanding and action generation, guiding decisions beyond immediate local observations. It improves situational awareness by explicitly modeling environmental dynamics, including spatial features and temporal dependencies, through future state and actions prediction. By incorporating these predicted futures, MAPF‑World enables more informed, coordinated, and far‑sighted decision‑making, especially in complex multi‑agent settings. Furthermore, we augment MAPF benchmarks by introducing an automatic map generator grounded in real‑world scenarios, capturing practical map layouts for training and evaluating MAPF solvers. Extensive experiments demonstrate that MAPF‑World outperforms state‑of‑the‑art learnable solvers, showcasing superior zero‑shot generalization to out‑of‑distribution cases. Notably, MAPF‑World is trained with a 96.5% smaller model size and 92% reduced data.

PaperID: 9106, https://arxiv.org/pdf/2508.12043.pdf

Authors: Fei Lin, Tengchao Zhang, Qinghua Ni, Jun Huang, Siji Ma, Yonglin Tian, Yisheng Lv, Naiqi Wu

Title: Talk Less, Fly Lighter: Autonomous Semantic Compression for UAV Swarm Communication via LLMs

Abstract:
The rapid adoption of Large Language Models (LLMs) in unmanned systems has significantly enhanced the semantic understanding and autonomous task execution capabilities of Unmanned Aerial Vehicle (UAV) swarms. However, limited communication bandwidth and the need for high‑frequency interactions pose severe challenges to semantic information transmission within the swarm. This paper explores the feasibility of LLM‑driven UAV swarms for autonomous semantic compression communication, aiming to reduce communication load while preserving critical task semantics. To this end, we construct four types of 2D simulation scenarios with different levels of environmental complexity and design a communication‑execution pipeline that integrates system prompts with task instruction prompts. On this basis, we systematically evaluate the semantic compression performance of nine mainstream LLMs in different scenarios and analyze their adaptability and stability through ablation studies on environmental complexity and swarm size. Experimental results demonstrate that LLM‑based UAV swarms have the potential to achieve efficient collaborative communication under bandwidth‑constrained and multi‑hop link conditions.

PaperID: 9107, https://arxiv.org/pdf/2508.11995.pdf

Authors: Xuyang Zhao, Shiwan Zhao, Hualong Yu, Liting Zhang, Qicheng Li

Title: AgentCDM: Enhancing Multi-Agent Collaborative Decision-Making via ACH-Inspired Structured Reasoning

Abstract:
Multi‑agent systems (MAS) powered by large language models (LLMs) hold significant promise for solving complex decision‑making tasks. However, the core process of collaborative decision‑making (CDM) within these systems remains underexplored. Existing approaches often rely on either ``dictatorial" strategies that are vulnerable to the cognitive biases of a single agent, or ``voting‑based" methods that fail to fully harness collective intelligence. To address these limitations, we propose AgentCDM, a structured framework for enhancing collaborative decision‑making in LLM‑based multi‑agent systems. Drawing inspiration from the Analysis of Competing Hypotheses (ACH) in cognitive science, AgentCDM introduces a structured reasoning paradigm that systematically mitigates cognitive biases and shifts decision‑making from passive answer selection to active hypothesis evaluation and construction. To internalize this reasoning process, we develop a two‑stage training paradigm: the first stage uses explicit ACH‑inspired scaffolding to guide the model through structured reasoning, while the second stage progressively removes this scaffolding to encourage autonomous generalization. Experiments on multiple benchmark datasets demonstrate that AgentCDM achieves state‑of‑the‑art performance and exhibits strong generalization, validating its effectiveness in improving the quality and robustness of collaborative decisions in MAS.

PaperID: 9108, https://arxiv.org/pdf/2508.11987.pdf

Authors: Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, Tianci He, Yali Liao, Yixiao Tian, Jinpeng Wang, Zaiyuan Wang, Yang Yang, Lingyue Yin, Mingren Yin, Zhenwei Zhu, Tianle Cai, Zehui Chen, Jiecao Chen, Yantao Du, Xiang Gao, Jiacheng Guo, Liang Hu, Jianpeng Jiao, Xiangsheng Li, Jingkai Liu, Shuang Ni, Zhoufutu Wen, Ge Zhang, Kaiyuan Zhang, Xin Zhou, Jose Blanchet, Xipeng Qiu, Mengdi Wang, Wenhao Huang

Title: FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction

Abstract:
Future prediction is a complex task for LLM agents, requiring a high level of analytical thinking, information gathering, contextual understanding, and decision‑making under uncertainty. Agents must not only gather and interpret vast amounts of dynamic information but also integrate diverse data sources, weigh uncertainties, and adapt predictions based on emerging trends, just as human experts do in fields like politics, economics, and finance. Despite its importance, no large‑scale benchmark exists for evaluating agents on future prediction, largely due to challenges in handling real‑time updates and retrieving timely, accurate answers. To address this, we introduce FutureX, a dynamic and live evaluation benchmark specifically designed for LLM agents performing future prediction tasks. FutureX is the largest and most diverse live benchmark for future prediction, supporting real‑time daily updates and eliminating data contamination through an automated pipeline for question gathering and answer collection. We evaluate 25 LLM/agent models, including those with reasoning, search capabilities, and integration of external tools such as the open‑source Deep Research Agent and closed‑source Deep Research models. This comprehensive evaluation assesses agents' adaptive reasoning and performance in dynamic environments. Additionally, we provide in‑depth analyses of agents' failure modes and performance pitfalls in future‑oriented tasks, including the vulnerability to fake web pages and the temporal validity. Our goal is to establish a dynamic, contamination‑free evaluation standard that drives the development of LLM agents capable of performing at the level of professional human analysts in complex reasoning and predictive thinking.

PaperID: 9109, https://arxiv.org/pdf/2508.11957.pdf

Authors: Xiaodong Qu, Andrews Damoah, Joshua Sherwood, Peiyan Liu, Christian Shun Jin, Lulu Chen, Minjie Shen, Nawwaf Aleisa, Zeyuan Hou, Chenyu Zhang, Lifu Gao, Yanshu Li, Qikai Yang, Qun Wang, Cristabelle De Souza

Title: A Comprehensive Review of AI Agents: Transforming Possibilities in Technology and Beyond

Abstract:
Artificial Intelligence (AI) agents have rapidly evolved from specialized, rule‑based programs to versatile, learning‑driven autonomous systems capable of perception, reasoning, and action in complex environments. The explosion of data, advances in deep learning, reinforcement learning, and multi‑agent coordination have accelerated this transformation. Yet, designing and deploying unified AI agents that seamlessly integrate cognition, planning, and interaction remains a grand challenge. In this review, we systematically examine the architectural principles, foundational components, and emergent paradigms that define the landscape of contemporary AI agents. We synthesize insights from cognitive science‑inspired models, hierarchical reinforcement learning frameworks, and large language model‑based reasoning. Moreover, we discuss the pressing ethical, safety, and interpretability concerns associated with deploying these agents in real‑world scenarios. By highlighting major breakthroughs, persistent challenges, and promising research directions, this review aims to guide the next generation of AI agent systems toward more robust, adaptable, and trustworthy autonomous intelligence.

PaperID: 9110, https://arxiv.org/pdf/2508.11933.pdf

Authors: Yue Wang, Liesheng Wei, Yuxiang Wang

Title: CAMF: Collaborative Adversarial Multi-agent Framework for Machine Generated Text Detection

Abstract:
Detecting machine‑generated text (MGT) from contemporary Large Language Models (LLMs) is increasingly crucial amid risks like disinformation and threats to academic integrity. Existing zero‑shot detection paradigms, despite their practicality, often exhibit significant deficiencies. Key challenges include: (1) superficial analyses focused on limited textual attributes, and (2) a lack of investigation into consistency across linguistic dimensions such as style, semantics, and logic. To address these challenges, we introduce the Collaborative Adversarial Multi‑agent Framework (CAMF), a novel architecture using multiple LLM‑based agents. CAMF employs specialized agents in a synergistic three‑phase process: \emphMulti‑dimensional Linguistic Feature Extraction, \emphAdversarial Consistency Probing, and \emphSynthesized Judgment Aggregation. This structured collaborative‑adversarial process enables a deep analysis of subtle, cross‑dimensional textual incongruities indicative of non‑human origin. Empirical evaluations demonstrate CAMF's significant superiority over state‑of‑the‑art zero‑shot MGT detection techniques.

PaperID: 9111, https://arxiv.org/pdf/2508.11918.pdf

Authors: Zhichen Lou, Kechun Xu, Zhongxiang Zhou, Rong Xiong

Title: ExploreVLM: Closed-Loop Robot Exploration Task Planning with Vision-Language Models

Abstract:
The advancement of embodied intelligence is accelerating the integration of robots into daily life as human assistants. This evolution requires robots to not only interpret high‑level instructions and plan tasks but also perceive and adapt within dynamic environments. Vision‑Language Models (VLMs) present a promising solution by combining visual understanding and language reasoning. However, existing VLM‑based methods struggle with interactive exploration, accurate perception, and real‑time plan adaptation. To address these challenges, we propose ExploreVLM, a novel closed‑loop task planning framework powered by Vision‑Language Models (VLMs). The framework is built around a step‑wise feedback mechanism that enables real‑time plan adjustment and supports interactive exploration. At its core is a dual‑stage task planner with self‑reflection, enhanced by an object‑centric spatial relation graph that provides structured, language‑grounded scene representations to guide perception and planning. An execution validator supports the closed loop by verifying each action and triggering re‑planning. Extensive real‑world experiments demonstrate that ExploreVLM significantly outperforms state‑of‑the‑art baselines, particularly in exploration‑centric tasks. Ablation studies further validate the critical role of the reflective planner and structured perception in achieving robust and efficient task execution.

PaperID: 9112, https://arxiv.org/pdf/2508.11890.pdf

Authors: Sangwoo Jeon, Juchul Shin, YeonJe Cho, Gyeong-Tae Kim, Seongwoo Kim

Title: Integrating Symbolic RL Planning into a BDI-based Autonomous UAV Framework: System Integration and SIL Validation

Abstract:
Modern autonomous drone missions increasingly require software frameworks capable of seamlessly integrating structured symbolic planning with adaptive reinforcement learning (RL). Although traditional rule‑based architectures offer robust structured reasoning for drone autonomy, their capabilities fall short in dynamically complex operational environments that require adaptive symbolic planning. Symbolic RL (SRL), using the Planning Domain Definition Language (PDDL), explicitly integrates domain‑specific knowledge and operational constraints, significantly improving the reliability and safety of unmanned aerial vehicle (UAV) decision making. In this study, we propose the AMAD‑SRL framework, an extended and refined version of the Autonomous Mission Agents for Drones (AMAD) cognitive multi‑agent architecture, enhanced with symbolic reinforcement learning for dynamic mission planning and execution. We validated our framework in a Software‑in‑the‑Loop (SIL) environment structured identically to an intended Hardware‑In‑the‑Loop Simulation (HILS) platform, ensuring seamless transition to real hardware. Experimental results demonstrate stable integration and interoperability of modules, successful transitions between BDI‑driven and symbolic RL‑driven planning phases, and consistent mission performance. Specifically, we evaluate a target acquisition scenario in which the UAV plans a surveillance path followed by a dynamic reentry path to secure the target while avoiding threat zones. In this SIL evaluation, mission efficiency improved by approximately 75% over a coverage‑based baseline, measured by travel distance reduction. This study establishes a robust foundation for handling complex UAV missions and discusses directions for further enhancement and validation.

PaperID: 9113, https://arxiv.org/pdf/2508.11873.pdf

Authors: Truong Thanh Hung Nguyen, Tran Diem Quynh Nguyen, Hoang Loc Cao, Thi Cam Thanh Tran, Thi Cam Mai Truong, Hung Cao

Title: SimInterview: Transforming Business Education through Large Language Model-Based Simulated Multilingual Interview Training System

Abstract:
Business interview preparation demands both solid theoretical grounding and refined soft skills, yet conventional classroom methods rarely deliver the individualized, culturally aware practice employers currently expect. This paper introduces SimInterview, a large language model (LLM)‑based simulated multilingual interview training system designed for business professionals entering the AI‑transformed labor market. Our system leverages an LLM agent and synthetic AI technologies to create realistic virtual recruiters capable of conducting personalized, real‑time conversational interviews. The framework dynamically adapts interview scenarios using retrieval‑augmented generation (RAG) to match individual resumes with specific job requirements across multiple languages. Built on LLMs (OpenAI o3, Llama 4 Maverick, Gemma 3), integrated with Whisper speech recognition, GPT‑SoVITS voice synthesis, Ditto diffusion‑based talking head generation model, and ChromaDB vector databases, our system significantly improves interview readiness across English and Japanese markets. Experiments with university‑level candidates show that the system consistently aligns its assessments with job requirements, faithfully preserves resume content, and earns high satisfaction ratings, with the lightweight Gemma 3 model producing the most engaging conversations. Qualitative findings revealed that the standardized Japanese resume format improved document retrieval while diverse English resumes introduced additional variability, and they highlighted how cultural norms shape follow‑up questioning strategies. Finally, we also outlined a contestable AI design that can explain, detect bias, and preserve human‑in‑the‑loop to meet emerging regulatory expectations.

PaperID: 9114, https://arxiv.org/pdf/2508.11860.pdf

Authors: Frazier N. Baker, Daniel Adu-Ampratwum, Reza Averly, Botao Yu, Huan Sun, Xia Ning

Title: LARC: Towards Human-level Constrained Retrosynthesis Planning through an Agentic Framework

Abstract:
Large language model (LLM) agent evaluators leverage specialized tools to ground the rational decision‑making of LLMs, making them well‑suited to aid in scientific discoveries, such as constrained retrosynthesis planning. Constrained retrosynthesis planning is an essential, yet challenging, process within chemistry for identifying synthetic routes from commercially available starting materials to desired target molecules, subject to practical constraints. Here, we present LARC, the first LLM‑based Agentic framework for Retrosynthesis planning under Constraints. LARC incorporates agentic constraint evaluation, through an Agent‑as‑a‑Judge, directly into the retrosynthesis planning process, using agentic feedback grounded in tool‑based reasoning to guide and constrain route generation. We rigorously evaluate LARC on a carefully curated set of 48 constrained retrosynthesis planning tasks across 3 constraint types. LARC achieves a 72.9% success rate on these tasks, vastly outperforming LLM baselines and approaching human expert‑level success in substantially less time. The LARC framework is extensible, and serves as a first step towards an effective agentic tool or a co‑scientist to human experts for constrained retrosynthesis.

PaperID: 9115, https://arxiv.org/pdf/2508.11799.pdf

Authors: Arshiya Taj Abdul, Augustinos D. Saravanos, Evangelos A. Theodorou

Title: Scaling Robust Optimization for Swarms: A Distributed Perspective

Abstract:
This article introduces a decentralized robust optimization framework for safe multi‑agent control under uncertainty. Although stochastic noise has been the primary form of modeling uncertainty in such systems, these formulations might fall short in addressing uncertainties that are deterministic in nature or simply lack probabilistic data. To ensure safety under such scenarios, we employ the concept of robust constraints that must hold for all possible uncertainty realizations lying inside a bounded set. Nevertheless, standard robust optimization approaches become intractable due to the large number or non‑convexity of the constraints involved in safe multi‑agent control. To address this, we introduce novel robust reformulations that significantly reduce complexity without compromising safety. The applicability of the framework is further broadened to address both deterministic and stochastic uncertainties by incorporating robust chance constraints and distribution steering techniques. To achieve scalability, we derive a distributed approach based on the Alternating Direction Method of Multipliers (ADMM), supported by a convergence study that accounts for the underlying non‑convexity. In addition, computational complexity bounds highlighting the efficiency of the proposed frameworks against standard approaches are presented. Finally, the robustness and scalability of the framework is demonstrated through extensive simulation results across diverse scenarios, including environments with nonconvex obstacles and up to 246 agents.

PaperID: 9116, https://arxiv.org/pdf/2508.11759.pdf

Authors: Peter Lindes, Kaoutar Skiker

Title: Using Natural Language for Human-Robot Collaboration in the Real World

Abstract:
We have a vision of a day when autonomous robots can collaborate with humans as assistants in performing complex tasks in the physical world. This vision includes that the robots will have the ability to communicate with their human collaborators using language that is natural to the humans. Traditional Interactive Task Learning (ITL) systems have some of this ability, but the language they can understand is very limited. The advent of large language models (LLMs) provides an opportunity to greatly improve the language understanding of robots, yet integrating the language abilities of LLMs with robots that operate in the real physical world is a challenging problem. In this chapter we first review briefly a few commercial robot products that work closely with humans, and discuss how they could be much better collaborators with robust language abilities. We then explore how an AI system with a cognitive agent that controls a physical robot at its core, interacts with both a human and an LLM, and accumulates situational knowledge through its experiences, can be a possible approach to reach that vision. We focus on three specific challenges of having the robot understand natural language, and present a simple proof‑of‑concept experiment using ChatGPT for each. Finally, we discuss what it will take to turn these simple experiments into an operational system where LLM‑assisted language understanding is a part of an integrated robotic assistant that uses language to collaborate with humans.

PaperID: 9117, https://arxiv.org/pdf/2508.11706.pdf

Authors: Zhuofan Xu, Benedikt Bollig, Matthias Függer, Thomas Nowak, Vincent Le Dréau

Title: Centralized Permutation Equivariant Policy for Cooperative Multi-Agent Reinforcement Learning

Abstract:
The Centralized Training with Decentralized Execution (CTDE) paradigm has gained significant attention in multi‑agent reinforcement learning (MARL) and is the foundation of many recent algorithms. However, decentralized policies operate under partial observability and often yield suboptimal performance compared to centralized policies, while fully centralized approaches typically face scalability challenges as the number of agents increases. We propose Centralized Permutation Equivariant (CPE) learning, a centralized training and execution framework that employs a fully centralized policy to overcome these limitations. Our approach leverages a novel permutation equivariant architecture, Global‑Local Permutation Equivariant (GLPE) networks, that is lightweight, scalable, and easy to implement. Experiments show that CPE integrates seamlessly with both value decomposition and actor‑critic methods, substantially improving the performance of standard CTDE algorithms across cooperative benchmarks including MPE, SMAC, and RWARE, and matching the performance of state‑of‑the‑art RWARE implementations.

PaperID: 9118, https://arxiv.org/pdf/2508.11690.pdf

Authors: Tadisetty Sai Yashwanth, Yangalasetty Sruthi Royal, Vankayala Rajeshwari Shreya, Mayank Kashyap, Divyaprabha K N

Title: Real Time Child Abduction And Detection System

Abstract:
Child safety continues to be a paramount concern worldwide, with child abduction posing significant threats to communities. This paper presents the development of an edge‑based child abduction detection and alert system utilizing a multi‑agent framework where each agent incorporates Vision‑Language Models (VLMs) deployed on a Raspberry Pi. Leveraging the advanced capabilities of VLMs within individual agents of a multi‑agent team, our system is trained to accurately detect and interpret complex interactions involving children in various environments in real‑time. The multi‑agent system is deployed on a Raspberry Pi connected to a webcam, forming an edge device capable of processing video feeds, thereby reducing latency and enhancing privacy. An integrated alert system utilizes the Twilio API to send immediate SMS and WhatsApp notifications, including calls and messages, when a potential child abduction event is detected. Experimental results demonstrate that the system achieves high accuracy in detecting potential abduction scenarios, with near real‑time performance suitable for practical deployment. The multi‑agent architecture enhances the system's ability to process complex situational data, improving detection capabilities over traditional single‑model approaches. The edge deployment ensures scalability and cost‑effectiveness, making it accessible for widespread use. The proposed system offers a proactive solution to enhance child safety through continuous monitoring and rapid alerting, contributing a valuable tool in efforts to prevent child abductions.

PaperID: 9119, https://arxiv.org/pdf/2508.11671.pdf

Authors: Ronald Carvalho Boadana, Ademir Guimarães da Costa Junior, Ricardo Rios, Fábio Santos da Silva

Title: LLM-Based Intelligent Agents for Music Recommendation: A Comparison with Classical Content-Based Filtering

Abstract:
The growing availability of music on streaming platforms has led to information overload for users. To address this issue and enhance the user experience, increasingly sophisticated recommendation systems have been proposed. This work investigates the use of Large Language Models (LLMs) from the Gemini and LLaMA families, combined with intelligent agents, in a multi‑agent personalized music recommendation system. The results are compared with a traditional content‑based recommendation model, considering user satisfaction, novelty, and computational efficiency. LLMs achieved satisfaction rates of up to 89,32%, indicating their promising potential in music recommendation systems.

PaperID: 9120, https://arxiv.org/pdf/2508.11665.pdf

Authors: Xinkui Zhao, Yifan Zhang, Zhengyi Zhou, Yueshen Xu

Title: StackPilot: Autonomous Function Agents for Scalable and Environment-Free Code Execution

Abstract:
Recent advances in large language models (LLMs) have substantially enhanced automated code generation across a wide range of programming languages. Nonetheless, verifying the correctness and executability of LLM‑generated code remains a significant challenge, as traditional methods rely on language‑specific compilers and environment‑dependent runtimes. To overcome these limitations, we introduce StackPilot, an LLM‑native, multi‑agent framework designed for language‑agnostic code verification and execution, which operates independently of conventional toolchains. StackPilot offers three principal innovations: (1) a Function‑as‑Agents paradigm, in which each function is modeled as an autonomous agent capable of fine‑grained reasoning and collaborative verification; (2) an LLM‑as‑Executor strategy, which enables scalable verification via stack‑based scheduling; and (3) a novel snapshot mechanism that preserves complete execution contexts, facilitating deterministic and lossless context switching during verification. Empirical evaluations demonstrate that StackPilot achieves framework reliability rates between 89% and 97%, substantially outperforming baseline approaches. These results indicate that StackPilot can reliably verify and execute a significantly larger proportion of LLM‑generated code across diverse programming tasks compared to existing methods.

PaperID: 9121, https://arxiv.org/pdf/2508.11618.pdf

Authors: Jungang Chen, Seyyed A. Hosseini

Title: Optimal CO2 storage management considering safety constraints in multi-stakeholder multi-site CCS projects: a game theoretic perspective

Abstract:
Carbon capture and storage (CCS) projects typically involve a diverse array of stakeholders or players from public, private, and regulatory sectors, each with different objectives and responsibilities. Given the complexity, scale, and long‑term nature of CCS operations, determining whether individual stakeholders can independently maximize their interests or whether collaborative coalition agreements are needed remains a central question for effective CCS project planning and management. CCS projects are often implemented in geologically connected sites, where shared geological features such as pressure space and reservoir pore capacity can lead to competitive behavior among stakeholders. Furthermore, CO2 storage sites are often located in geologically mature basins that previously served as sites for hydrocarbon extraction or wastewater disposal in order to leverage existing infrastructures, which makes unilateral optimization even more complicated and unrealistic. In this work, we propose a paradigm based on Markov games to quantitatively investigate how different coalition structures affect the goals of stakeholders. We frame this multi‑stakeholder multi‑site problem as a multi‑agent reinforcement learning problem with safety constraints. Our approach enables agents to learn optimal strategies while compliant with safety regulations. We present an example where multiple operators are injecting CO2 into their respective project areas in a geologically connected basin. To address the high computational cost of repeated simulations of high‑fidelity models, a previously developed surrogate model based on the Embed‑to‑Control (E2C) framework is employed. Our results demonstrate the effectiveness of the proposed framework in addressing optimal management of CO2 storage when multiple stakeholders with various objectives and goals are involved.

PaperID: 9122, https://arxiv.org/pdf/2508.11574.pdf

Authors: Mohammad Sajid Shahriar, Suresh Subramaniam, Motoharu Matsuura, Hiroshi Hasegawa, Shih-Chun Lin

Title: Intelligent Edge Resource Provisioning for Scalable Digital Twins of Autonomous Vehicles

Abstract:
The next generation networks offers significant potential to advance Intelligent Transportation Systems (ITS), particularly through the integration of Digital Twins (DTs). However, ensuring the uninterrupted operation of DTs through efficient computing resource management remains an open challenge. This paper introduces a distributed computing archi tecture that integrates DTs and Mobile Edge Computing (MEC) within a software‑defined vehicular networking framework to enable intelligent, low‑latency transportation services. A network aware scalable collaborative task provisioning algorithm is de veloped to train an autonomous agent, which is evaluated using a realistic connected autonomous vehicle (CAV) traffic simulation. The proposed framework significantly enhances the robustness and scalability of DT operations by reducing synchronization errors to as low as 5% while achieving up to 99.5% utilization of edge computing resources.

PaperID: 9123, https://arxiv.org/pdf/2508.11567.pdf

Authors: Jinpeng Hu, Ao Wang, Qianqian Xie, Hui Ma, Zhuo Li, Dan Guo

Title: AgentMental: An Interactive Multi-Agent Framework for Explainable and Adaptive Mental Health Assessment

Abstract:
Mental health assessment is crucial for early intervention and effective treatment, yet traditional clinician‑based approaches are limited by the shortage of qualified professionals. Recent advances in artificial intelligence have sparked growing interest in automated psychological assessment, yet most existing approaches are constrained by their reliance on static text analysis, limiting their ability to capture deeper and more informative insights that emerge through dynamic interaction and iterative questioning. Therefore, in this paper, we propose a multi‑agent framework for mental health evaluation that simulates clinical doctor‑patient dialogues, with specialized agents assigned to questioning, adequacy evaluation, scoring, and updating. We introduce an adaptive questioning mechanism in which an evaluation agent assesses the adequacy of user responses to determine the necessity of generating targeted follow‑up queries to address ambiguity and missing information. Additionally, we employ a tree‑structured memory in which the root node encodes the user's basic information, while child nodes (e.g., topic and statement) organize key information according to distinct symptom categories and interaction turns. This memory is dynamically updated throughout the interaction to reduce redundant questioning and further enhance the information extraction and contextual tracking capabilities. Experimental results on the DAIC‑WOZ dataset illustrate the effectiveness of our proposed method, which achieves better performance than existing approaches.

PaperID: 9124, https://arxiv.org/pdf/2508.11553.pdf

Authors: Jinghui Wang, Shaojie Wang, Yinghan Cui, Xuxing Chen, Chao Wang, Xiaojiang Zhang, Minglei Zhang, Jiarong Zhang, Wenhao Zhuang, Yuchen Cao, Wankang Bao, Haimo Li, Zheng Lin, Huiming Wang, Haoyang Huang, Zongxian Feng, Zizheng Zhan, Ken Deng, Wen Xiang, Huaixi Tang, Kun Wu, Mengtong Li, Mengfei Xie, Junyi Peng, Haotian Zhang, Bin Chen, Bing Yu

Title: SeamlessFlow: A Trainer Agent Isolation RL Framework Achieving Bubble-Free Pipelines via Tag Scheduling

Abstract:
We introduce SeamlessFlow, a server based reinforcement learning (RL) framework that addresses two core challenges in industrial scale RL: (1) decoupling RL training from the complex execution flow of agents; (2) maximizing GPU utilization with minimal idle time while preserving the stability and scalability required for large‑scale deployments. First, SeamlessFlow introduces a data plane that decouples the RL trainer from diverse, complex agent implementations while sustaining high throughput. A central trajectory manager maintains complete interaction histories and supports partial rollout, allowing rollout to pause for weight updates and resume seamlessly, keeping agents unaware of service interruptions. Second, we propose a tag driven scheduling paradigm that abstracts hardware into capability tagged resources, unifying colocated and disaggregated architectures. Based on this, SeamlessFlow introduces a spatiotemporal multiplexing pipeline that dynamically reassigns idle training nodes to rollout in a train rollout separated setup, eliminating pipeline bubbles and fully exploiting heterogeneous cluster resources. By combining these innovations, SeamlessFlow delivers both stability and high performance, making it well suited for multi agent, long horizon, and other complex RL tasks.

PaperID: 9125, https://arxiv.org/pdf/2508.11416.pdf

Authors: Xuhua Zhao, Yuxuan Xie, Caihua Chen, Yuxiang Sun

Title: AIM-Bench: Evaluating Decision-making Biases of Agentic LLM as Inventory Manager

Abstract:
Recent advances in mathematical reasoning and the long‑term planning capabilities of large language models (LLMs) have precipitated the development of agents, which are being increasingly leveraged in business operations processes. Decision models to optimize inventory levels are one of the core elements of operations management. However, the capabilities of the LLM agent in making inventory decisions in uncertain contexts, as well as the decision‑making biases (e.g. framing effect, etc.) of the agent, remain largely unexplored. This prompts concerns regarding the capacity of LLM agents to effectively address real‑world problems, as well as the potential implications of biases that may be present. To address this gap, we introduce AIM‑Bench, a novel benchmark designed to assess the decision‑making behaviour of LLM agents in uncertain supply chain management scenarios through a diverse series of inventory replenishment experiments. Our results reveal that different LLMs typically exhibit varying degrees of decision bias that are similar to those observed in human beings. In addition, we explored strategies to mitigate the pull‑to‑centre effect and the bullwhip effect, namely cognitive reflection and implementation of information sharing. These findings underscore the need for careful consideration of the potential biases in deploying LLMs in Inventory decision‑making scenarios. We hope that these insights will pave the way for mitigating human decision bias and developing human‑centred decision support systems for supply chains.

PaperID: 9126, https://arxiv.org/pdf/2508.11401.pdf

Authors: Jana Gonnermann-Müller, Jennifer Haase, Konstantin Fackeldey, Sebastian Pokutta

Title: FACET: Teacher-Centred LLM-Based Multi-Agent Systems-Towards Personalized Educational Worksheets

Abstract:
The increasing heterogeneity of student populations poses significant challenges for teachers, particularly in mathematics education, where cognitive, motivational, and emotional differences strongly influence learning outcomes. While AI‑driven personalization tools have emerged, most remain performance‑focused, offering limited support for teachers and neglecting broader pedagogical needs. This paper presents the FACET framework, a teacher‑facing, large language model (LLM)‑based multi‑agent system designed to generate individualized classroom materials that integrate both cognitive and motivational dimensions of learner profiles. The framework comprises three specialized agents: (1) learner agents that simulate diverse profiles incorporating topic proficiency and intrinsic motivation, (2) a teacher agent that adapts instructional content according to didactical principles, and (3) an evaluator agent that provides automated quality assurance. We tested the system using authentic grade 8 mathematics curriculum content and evaluated its feasibility through a) automated agent‑based assessment of output quality and b) exploratory feedback from K‑12 in‑service teachers. Results from ten internal evaluations highlighted high stability and alignment between generated materials and learner profiles, and teacher feedback particularly highlighted structure and suitability of tasks. The findings demonstrate the potential of multi‑agent LLM architectures to provide scalable, context‑aware personalization in heterogeneous classroom settings, and outline directions for extending the framework to richer learner profiles and real‑world classroom trials.

PaperID: 9127, https://arxiv.org/pdf/2508.11398.pdf

Authors: Mithat Can Ozgun, Jiahuan Pei, Koen Hindriks, Lucia Donatelli, Qingzhi Liu, Junxiao Wang

Title: Trustworthy AI Psychotherapy: Multi-Agent LLM Workflow for Counseling and Explainable Mental Disorder Diagnosis

Abstract:
LLM‑based agents have emerged as transformative tools capable of executing complex tasks through iterative planning and action, achieving significant advancements in understanding and addressing user needs. Yet, their effectiveness remains limited in specialized domains such as mental health diagnosis, where they underperform compared to general applications. Current approaches to integrating diagnostic capabilities into LLMs rely on scarce, highly sensitive mental health datasets, which are challenging to acquire. These methods also fail to emulate clinicians' proactive inquiry skills, lack multi‑turn conversational comprehension, and struggle to align outputs with expert clinical reasoning. To address these gaps, we propose DSM5AgentFlow, the first LLM‑based agent workflow designed to autonomously generate DSM‑5 Level‑1 diagnostic questionnaires. By simulating therapist‑client dialogues with specific client profiles, the framework delivers transparent, step‑by‑step disorder predictions, producing explainable and trustworthy results. This workflow serves as a complementary tool for mental health diagnosis, ensuring adherence to ethical and legal standards. Through comprehensive experiments, we evaluate leading LLMs across three critical dimensions: conversational realism, diagnostic accuracy, and explainability. Our datasets and implementations are fully open‑sourced.

PaperID: 9128, https://arxiv.org/pdf/2508.11360.pdf

Authors: Songqin Nong, Xiaoxuan Tang, Jingxuan Xu, Sheng Zhou, Jianfeng Chen, Tao Jiang, Wenhao Xu

Title: CRAFT-GUI: Curriculum-Reinforced Agent For GUI Tasks

Abstract:
As autonomous agents become adept at understanding and interacting with graphical user interface (GUI) environments, a new era of automated task execution is emerging. Recent studies have demonstrated that Reinforcement Learning (RL) can effectively enhance agents' performance in dynamic interactive GUI environments. However, these methods face two key limitations: (1) they overlook the significant variation in difficulty across different GUI tasks by treating the entire training data as a uniform set, which hampers the agent's ability to adapt its learning process; and (2) most approaches collapse task‑specific nuances into a single, coarse reward, leaving the agent with a uniform signal that yields inefficient policy updates. To address these limitations, we propose CRAFT‑GUI, a curriculum learning framework based on Group Relative Policy Optimization (GRPO) that explicitly accounts for the varying difficulty across trajectories. To enable more fine‑grained policy optimization, we design a reward function that combines simple rule‑based signals with model‑judged evaluation, providing richer and more nuanced feedback during training. Experimental results demonstrate that our method achieves significant improvements over previous state‑of‑the‑art approaches, outperforming them by 5.6% on public benchmarks Android Control and 10.3% on our internal online benchmarks, respectively. These findings empirically validate the effectiveness of integrating reinforcement learning with curriculum learning in GUI interaction tasks.

PaperID: 9129, https://arxiv.org/pdf/2508.11310.pdf

Authors: Beichen Guo, Zhiyuan Wen, Yu Yang, Peng Gao, Ruosong Yang, Jiaxing Shen

Title: SGSimEval: A Comprehensive Multifaceted and Similarity-Enhanced Benchmark for Automatic Survey Generation Systems

Abstract:
The growing interest in automatic survey generation (ASG), a task that traditionally required considerable time and effort, has been spurred by recent advances in large language models (LLMs). With advancements in retrieval‑augmented generation (RAG) and the rising popularity of multi‑agent systems (MASs), synthesizing academic surveys using LLMs has become a viable approach, thereby elevating the need for robust evaluation methods in this domain. However, existing evaluation methods suffer from several limitations, including biased metrics, a lack of human preference, and an over‑reliance on LLMs‑as‑judges. To address these challenges, we propose SGSimEval, a comprehensive benchmark for Survey Generation with Similarity‑Enhanced Evaluation that evaluates automatic survey generation systems by integrating assessments of the outline, content, and references, and also combines LLM‑based scoring with quantitative metrics to provide a multifaceted evaluation framework. In SGSimEval, we also introduce human preference metrics that emphasize both inherent quality and similarity to humans. Extensive experiments reveal that current ASG systems demonstrate human‑comparable superiority in outline generation, while showing significant room for improvement in content and reference generation, and our evaluation metrics maintain strong consistency with human assessments.

PaperID: 9130, https://arxiv.org/pdf/2508.11286.pdf

Authors: Che Rin Yu, Daewon Chae, Dabin Seo, Sangwon Lee, Hyeongwoo Im, Jinkyu Kim

Title: Scene Graph-Guided Proactive Replanning for Failure-Resilient Embodied Agent

Abstract:
When humans perform everyday tasks, we naturally adjust our actions based on the current state of the environment. For instance, if we intend to put something into a drawer but notice it is closed, we open it first. However, many autonomous robots lack this adaptive awareness. They often follow pre‑planned actions that may overlook subtle yet critical changes in the scene, which can result in actions being executed under outdated assumptions and eventual failure. While replanning is critical for robust autonomy, most existing methods respond only after failures occur, when recovery may be inefficient or infeasible. While proactive replanning holds promise for preventing failures in advance, current solutions often rely on manually designed rules and extensive supervision. In this work, we present a proactive replanning framework that detects and corrects failures at subtask boundaries by comparing scene graphs constructed from current RGB‑D observations against reference graphs extracted from successful demonstrations. When the current scene fails to align with reference trajectories, a lightweight reasoning module is activated to diagnose the mismatch and adjust the plan. Experiments in the AI2‑THOR simulator demonstrate that our approach detects semantic and spatial mismatches before execution failures occur, significantly improving task success and robustness.

PaperID: 9131, https://arxiv.org/pdf/2508.11252.pdf

Authors: Youcheng Huang, Bowen Qin, Chen Huang, Duanyu Feng, Xi Yang, Wenqiang Lei

Title: Beyond Solving Math Quiz: Evaluating the Ability of Large Reasoning Models to Ask for Information

Abstract:
Large Reasoning Models (LRMs) have demonstrated remarkable problem‑solving abilities in mathematics, as evaluated by existing benchmarks exclusively on well‑defined problems. However, such evaluation setup constitutes a critical gap, since a genuine intelligent agent should not only solve problems (as a math quiz solver), but also be able~to ask for information when the problems lack sufficient information, enabling proactivity in responding users' requests. To bridge such gap, we proposes a new dataset consisting of two types of incomplete problems with diverse contexts. Based on the dataset, our systematical evaluation of LRMs reveals their inability in proactively asking for information. In addition, we uncover the behaviors related to overthinking and hallucination of LRMs, and highlight the potential and challenges of supervised fine‑tuning in learning such ability. We hope to provide new insights in developing LRMs with genuine intelligence, rather than just solving problems.

PaperID: 9132, https://arxiv.org/pdf/2508.11152.pdf

Authors: Tianjiao Zhao, Jingrao Lyu, Stokes Jones, Harrison Garber, Stefano Pasquali, Dhagash Mehta

Title: AlphaAgents: Large Language Model based Multi-Agents for Equity Portfolio Constructions

Abstract:
The field of artificial intelligence (AI) agents is evolving rapidly, driven by the capabilities of Large Language Models (LLMs) to autonomously perform and refine tasks with human‑like efficiency and adaptability. In this context, multi‑agent collaboration has emerged as a promising approach, enabling multiple AI agents to work together to solve complex challenges. This study investigates the application of role‑based multi‑agent systems to support stock selection in equity research and portfolio management. We present a comprehensive analysis performed by a team of specialized agents and evaluate their stock‑picking performance against established benchmarks under varying levels of risk tolerance. Furthermore, we examine the advantages and limitations of employing multi‑agent frameworks in equity analysis, offering critical insights into their practical efficacy and implementation challenges.

PaperID: 9133, https://arxiv.org/pdf/2508.11120.pdf

Authors: Lorenzo Jaime Yu Flores, Junyi Shen, Goodman Gu

Title: Towards Reliable Multi-Agent Systems for Marketing Applications via Reflection, Memory, and Planning

Abstract:
Recent advances in large language models (LLMs) enabled the development of AI agents that can plan and interact with tools to complete complex tasks. However, literature on their reliability in real‑world applications remains limited. In this paper, we introduce a multi‑agent framework for a marketing task: audience curation. To solve this, we introduce a framework called RAMP that iteratively plans, calls tools, verifies the output, and generates suggestions to improve the quality of the audience generated. Additionally, we equip the model with a long‑term memory store, which is a knowledge base of client‑specific facts and past queries. Overall, we demonstrate the use of LLM planning and memory, which increases accuracy by 28 percentage points on a set of 88 evaluation queries. Moreover, we show the impact of iterative verification and reflection on more ambiguous queries, showing progressively better recall (roughly +20 percentage points) with more verify/reflect iterations on a smaller challenge set, and higher user satisfaction. Our results provide practical insights for deploying reliable LLM‑based systems in dynamic, industry‑facing environments.

PaperID: 9134, https://arxiv.org/pdf/2508.11070.pdf

Authors: Zahra Khotanlou, Kate Larson, Amir-Hossein Karimi

Title: Your Recourse, My Loss? Algorithmic Recourse under Shared Constraints

Abstract:
Decision makers are increasingly relying on machine learning in sensitive situations. Algorithmic recourse aims to provide individuals with actionable and minimally costly steps to reverse unfavorable AI‑driven decisions. While existing research focuses on single‑individual (i.e., seeker) and single‑model (i.e., provider) scenarios, real‑world applications involve multiple stakeholders. Optimizing outcomes for seekers under an individual welfare approach overlooks the multi‑agent nature of real‑world systems, with competition for limited resources. Accordingly, we extend algorithmic recourse to a many‑to‑many setting with capacity constraints, where individually computed recourse recommendations no longer compose independently and stakeholder interactions affect recourse validity. We model this multi‑agent algorithimc recourse as a capacitated weighted bipartite matching problem, based on recourse cost and provider capacity. Edge weights, reflecting recourse costs, are optimized for social welfare while quantifying the welfare gap between individual welfare and this collectively feasible outcome. We propose three optimization layers: capacitated matching, optimal capacity redistribution, and cost‑aware optimization. We further model inequality‑averse objectives through a concave social‑welfare formulation that prioritizes the most disadvantaged seekers. Experiments demonstrate that our framework enables the many‑to‑many algorithmic recourse to achieve near‑optimal welfare with minimum modification in system settings. Our results also show how recourse systems can be designed to balance aggregate welfare with distributive considerations. We extend algorithmic recourse from individual recommendations to system‑level design, providing a tractable path toward higher social welfare while maintaining individual actionability.

PaperID: 9135, https://arxiv.org/pdf/2508.11030.pdf

Authors: Zikai Wen, Lanjing Liu, Yaxing Yao

Title: Families' Vision of Generative AI Agents for Household Safety Against Digital and Physical Threats

Abstract:
As families face increasingly complex safety challenges in digital and physical environments, generative AI (GenAI) presents new opportunities to support household safety through multiple specialized AI agents. Through a two‑phase qualitative study consisting of individual interviews and collaborative sessions with 13 parent‑child dyads, we explored families' conceptualizations of GenAI and their envisioned use of AI agents in daily family life. Our findings reveal that families preferred to distribute safety‑related support across multiple AI agents, each embodying a familiar caregiving role: a household manager coordinating routine tasks and mitigating risks such as digital fraud and home accidents; a private tutor providing personalized educational support, including safety education; and a family therapist offering emotional support to address sensitive safety issues such as cyberbullying and digital harassment. Families emphasized the need for agent‑specific privacy boundaries, recognized generational differences in trust toward AI agents, and stressed the importance of maintaining open family communication alongside the assistance of AI agents. Based on these findings, we propose a multi‑agent system design featuring four privacy‑preserving principles: memory segregation, conversational consent, selective data sharing, and progressive memory management to help balance safety, privacy, and autonomy within family contexts.

PaperID: 9136, https://arxiv.org/pdf/2508.10904.pdf

Authors: Jie Lei, Ruofan Jia, J. Andrew Zhang, Hao Zhang

Title: A2H-MAS: An Algorithm-to-HLS Multi-Agent System for Automated and Reliable FPGA Implementation

Abstract:
Bridging the gap between algorithm development and hardware realization remains a persistent challenge, particularly in latency‑ and resource‑constrained domains such as wireless communication. While MATLAB provides a mature environment for algorithm prototyping, translating these models into efficient FPGA implementations via High‑Level Synthesis (HLS) often requires expert tuning and lengthy iterations. Recent advances in large language models (LLMs) offer new opportunities for automating this process. However, existing approaches suffer from hallucinations, forgetting, limited domain expertise, and often overlook key performance metrics. To address these limitations, we present A2H‑MAS, a modular and hierarchical multi‑agent system. At the system level, A2H‑MAS assigns clearly defined responsibilities to specialized agents and uses standardized interfaces and execution‑based validation to ensure correctness and reproducibility. At the algorithmic level, it employs dataflow‑oriented modular decomposition and algorithm‑hardware co‑design, recognizing that the choice of algorithm often has a larger impact on hardware efficiency than pragma‑level optimization. Experiments on representative wireless communication algorithms show that A2H‑MAS consistently produces functionally correct, resource‑efficient, and latency‑optimized HLS designs, demonstrating its effectiveness and robustness for complex hardware development workflows.

PaperID: 9137, https://arxiv.org/pdf/2508.10880.pdf

Authors: Yanzhe Zhang, Diyi Yang

Title: Searching for Privacy Risks in LLM Agents via Simulation

Abstract:
The widespread deployment of LLM‑based agents is likely to introduce a critical privacy threat: malicious agents that proactively engage others in multi‑turn interactions to extract sensitive information. However, the evolving nature of such dynamic dialogues makes it challenging to anticipate emerging vulnerabilities and design effective defenses. To tackle this problem, we present a search‑based framework that alternates between improving attack and defense strategies through the simulation of privacy‑critical agent interactions. Specifically, we employ LLMs as optimizers to analyze simulation trajectories and iteratively propose new agent instructions. To explore the strategy space more efficiently, we further utilize parallel search with multiple threads and cross‑thread propagation. Through this process, we find that attack strategies escalate from direct requests to sophisticated tactics, such as impersonation and consent forgery, while defenses evolve from simple rule‑based constraints to robust identity‑verification state machines. The discovered attacks and defenses generalize across diverse scenarios and backbone models, providing useful insights for developing privacy‑aware agents.

PaperID: 9138, https://arxiv.org/pdf/2508.10839.pdf

Authors: Jim Dilkes, Vahid Yazdanpanah, Sebastian Stein

Title: Reinforced Language Models for Sequential Decision Making

Abstract:
Large Language Models (LLMs) show potential as sequential decision‑making agents, but their application is often limited due to a reliance on large, computationally expensive models. This creates a need to improve smaller models, yet existing post‑training methods are designed for single‑turn interactions and cannot handle credit assignment in multi‑step agentic tasks. To address this, we introduce Multi‑Step Group‑Relative Policy Optimization (MS‑GRPO), a new algorithm for post‑training LLM agents, grounded in formal Text‑Mediated Stochastic Game (TSMG) and Language‑Agent Policy (LAP) frameworks. For credit assignment, MS‑GRPO attributes the entire cumulative episode reward to each individual episode step. We supplement this algorithm with a novel absolute‑advantage‑weighted episode sampling strategy that we show improves training performance. We evaluate our approach by post‑training a 3‑billion parameter model on Snake and Frozen Lake. Our experiments demonstrate that the method is effective in improving decision‑making performance: our post‑trained 3B parameter model outperforms a 72B parameter baseline by 50% on the Frozen Lake task. This work demonstrates that targeted post‑training is a practical and efficient alternative to relying on model scale for creating sequential decision‑making agents using LLMs.

PaperID: 9139, https://arxiv.org/pdf/2508.10780.pdf

Authors: Alessandro Adami, Aris Synodinos, Matteo Iovino, Ruggero Carli, Pietro Falco

Title: Learning Stack-of-Tasks Management for Redundant Robots

Abstract:
This paper presents a novel framework for automatically learning complete Stack‑of‑Tasks (SoT) controllers for redundant robotic systems, including task priorities, activation logic, and control parameters. Unlike classical SoT pipelines‑where task hierarchies are manually defined and tuned‑our approach optimizes the full SoT structure directly from a user‑specified cost function encoding intuitive preferences such as safety, precision, manipulability, or execution speed. The method combines Genetic Programming with simulation‑based evaluation to explore both discrete (priority order, task activation) and continuous (gains, trajectory durations) components of the controller. We validate the framework on a dual‑arm mobile manipulator (the ABB mobile‑YuMi research platform), demonstrating robust convergence across multiple cost definitions, automatic suppression of irrelevant tasks, and strong resilience to distractors. Learned SoTs exhibit expert‑like hierarchical structure and adapt naturally to multi‑objective trade‑offs. Crucially, all controllers transfer from Gazebo simulation to the real robot, achieving safe and precise motion without additional tuning. Experiments in static and dynamic environments show reliable obstacle avoidance, high tracking accuracy, and predictable behavior in the presence of humans. The proposed method provides an interpretable and scalable alternative to manual SoT design, enabling rapid, user‑driven generation of task execution hierarchies for complex robotic systems.

PaperID: 9140, https://arxiv.org/pdf/2508.10760.pdf

Authors: Qihua Pan, Dong Xu, Qianwei Yang, Jenna Xinyi Yao, Sisi Yuan, Zexuan Zhu, Jianqiang Li, Junkai Ji

Title: FROGENT: An End-to-End Full-process Drug Design Multi-Agent System

Abstract:
Drug discovery is a complex, multi‑step pipeline that remains heavily dependent on manual, experience‑driven operations; meanwhile, existing customized artificial intelligence tools are fragmented across web applications, desktop software, and code libraries, resulting in incompatible interfaces and inefficient, burdensome workflows. To overcome these challenges, we propose FROGENT, a full‑process drug design multi‑agent system that leverages the planning, reasoning, and tool‑use capabilities of large language models (LLMs) to unify drug discovery within a closed‑loop and autonomous framework. FROGENT is a collaborative multi‑agent system comprising a central Orchestrate Agent for strategic workflow coordination and three distributed agents, Retrieve, Forge, and Gauge, that employ dynamic biochemical databases, extensible tool libraries, and task‑specific computational models via the Model Context Protocol. This architecture enables end‑to‑end execution of complex drug discovery pipelines, covering target identification, small‑molecule generation, peptide optimization, and retrosynthetic planning. Across eight benchmarks spanning core drug discovery tasks, FROGENT consistently outperforms six increasingly advanced ReAct‑style agents. Case studies further demonstrate its practicality and generalization across real‑world small‑molecule and peptide design scenarios. Overall, FROGENT not only achieves substantial gains in efficiency and accuracy, but also demonstrates the potential of LLM‑based agentic systems to autonomously orchestrate drug development pipelines, reducing, or even replacing, reliance on manual, experience‑driven human intervention.

PaperID: 9141, https://arxiv.org/pdf/2508.10538.pdf

Authors: Xin Liu, Bida Ma, Chenkun Qi, Yan Ding, Nuo Xu, Zhaxizhuoma, Guorong Zhang, Pengan Chen, Kehui Liu, Zhongjie Jia, Chuyue Guan, Yule Mo, Jiaqi Liu, Feng Gao, Jiangwei Zhong, Bin Zhao, Xuelong Li

Title: MLM: Learning Multi-task Loco-Manipulation Whole-Body Control for Quadruped Robot with Arm

Abstract:
Whole‑body loco‑manipulation for quadruped robots with arms remains a challenging problem, particularly in achieving multi‑task control. To address this, we propose MLM, a reinforcement learning framework driven by both real‑world and simulation data. It enables a six‑DoF robotic arm‑equipped quadruped robot to perform whole‑body loco‑manipulation for multiple tasks autonomously or under human teleoperation. To address the problem of balancing multiple tasks during the learning of loco‑manipulation, we introduce a trajectory library with an adaptive, curriculum‑based sampling mechanism. This approach allows the policy to efficiently leverage real‑world collected trajectories for learning multi‑task loco‑manipulation. To address deployment scenarios with only historical observations and to enhance the performance of policy execution across tasks with different spatial ranges, we propose a Trajectory‑Velocity Prediction policy network. It predicts unobservable future trajectories and velocities. By leveraging extensive simulation data and curriculum‑based rewards, our controller achieves whole‑body behaviors in simulation and zero‑shot transfer to real‑world deployment. Ablation studies in simulation verify the necessity and effectiveness of our approach, while real‑world experiments on a Go2 robot with an Airbot robotic arm demonstrate the policy's good performance in multi‑task execution.

PaperID: 9142, https://arxiv.org/pdf/2508.10494.pdf

Authors: Jiulin Li, Ping Huang, Yexin Li, Shuo Chen, Juewen Hu, Ye Tian

Title: A Unified Multi-Agent Framework for Universal Multimodal Understanding and Generation

Abstract:
Real‑world multimodal applications often require any‑to‑any capabilities, enabling both understanding and generation across modalities including text, image, audio, and video. However, integrating the strengths of autoregressive language models (LLMs) for reasoning and diffusion models for high‑fidelity generation remains challenging. Existing approaches rely on rigid pipelines or tightly coupled architectures, limiting flexibility and scalability. We propose MAGUS (Multi‑Agent Guided Unified Multimodal System), a modular framework that unifies multimodal understanding and generation via two decoupled phases: Cognition and Deliberation. MAGUS enables symbolic multi‑agent collaboration within a shared textual workspace. In the Cognition phase, three role‑conditioned multimodal LLM agents ‑ Perceiver, Planner, and Reflector ‑ engage in collaborative dialogue to perform structured understanding and planning. The Deliberation phase incorporates a Growth‑Aware Search mechanism that orchestrates LLM‑based reasoning and diffusion‑based generation in a mutually reinforcing manner. MAGUS supports plug‑and‑play extensibility, scalable any‑to‑any modality conversion, and semantic alignment ‑ all without the need for joint training. Experiments across multiple benchmarks, including image, video, and audio generation, as well as cross‑modal instruction following, demonstrate that MAGUS outperforms strong baselines and state‑of‑the‑art systems. Notably, on the MME benchmark, MAGUS surpasses the powerful closed‑source model GPT‑4o.

PaperID: 9143, https://arxiv.org/pdf/2508.10423.pdf

Authors: Qi Liu, Xiaopeng Zhang, Mingshan Tan, Shuaikang Ma, Jinliang Ding, Yanjie Li

Title: MASH: Cooperative-Heterogeneous Multi-Agent Reinforcement Learning for Single Humanoid Robot Locomotion

Abstract:
This paper proposes a novel method to enhance locomotion for a single humanoid robot through cooperative‑heterogeneous multi‑agent deep reinforcement learning (MARL). While most existing methods typically employ single‑agent reinforcement learning algorithms for a single humanoid robot or MARL algorithms for multi‑robot system tasks, we propose a distinct paradigm: applying cooperative‑heterogeneous MARL to optimize locomotion for a single humanoid robot. The proposed method, multi‑agent reinforcement learning for single humanoid locomotion (MASH), treats each limb (legs and arms) as an independent agent that explores the robot's action space while sharing a global critic for cooperative learning. Experiments demonstrate that MASH accelerates training convergence and improves whole‑body cooperation ability, outperforming conventional single‑agent reinforcement learning methods. This work advances the integration of MARL into single‑humanoid‑robot control, offering new insights into efficient locomotion strategies.

PaperID: 9144, https://arxiv.org/pdf/2508.10409.pdf

Authors: Zihao Chen, Ji Zhuang, Jinyi Shen, Xiaoyue Ke, Xinyi Yang, Mingjie Zhou, Zhuoyao Du, Xu Yan, Zhouyang Wu, Zhenyu Xu, Jiangli Huang, Li Shang, Xuan Zeng, Fan Yang

Title: AnalogSeeker: An Open-source Foundation Language Model for Analog Circuit Design

Abstract:
In this paper, we propose AnalogSeeker, an effort toward an open‑source foundation language model for analog circuit design, with the aim of integrating domain knowledge and giving design assistance. To overcome the scarcity of data in this field, we employ a corpus collection strategy based on the domain knowledge framework of analog circuits. High‑quality, accessible textbooks across relevant subfields are systematically curated and cleaned into a textual domain corpus. To address the complexity of knowledge of analog circuits, we introduce a granular domain knowledge distillation method. Raw, unlabeled domain corpus is decomposed into typical, granular learning nodes, where a multi‑agent framework distills implicit knowledge embedded in unstructured text into question‑answer data pairs with detailed reasoning processes, yielding a fine‑grained, learnable dataset for fine‑tuning. To address the unexplored challenges in training analog circuit foundation models, we explore and share our training methods through both theoretical analysis and experimental validation. We finally establish a fine‑tuning‑centric training paradigm, customizing and implementing a neighborhood self‑constrained supervised fine‑tuning algorithm. This approach enhances training outcomes by constraining the perturbation magnitude between the model's output distributions before and after training. In practice, we train the Qwen2.5‑32B‑Instruct model to obtain AnalogSeeker, which achieves 85.04% accuracy on AMSBench‑TQA, the analog circuit knowledge evaluation benchmark, with a 15.67% point improvement over the original model and is competitive with mainstream commercial models. Furthermore, AnalogSeeker also shows effectiveness in the downstream operational amplifier design task. AnalogSeeker is open‑sourced at https://huggingface.co/analogllm/analogseeker for research use.

PaperID: 9145, https://arxiv.org/pdf/2508.10378.pdf

Authors: Yu Chen, Shu Miao, Chunyu Wu, Jingsong Mu, Bo OuYang, Xiang Li

Title: A Semantic-Aware Framework for Safe and Intent-Integrative Assistance in Upper-Limb Exoskeletons

Abstract:
Upper‑limb exoskeletons are primarily designed to provide assistive support by accurately interpreting and responding to human intentions. In home‑care scenarios, exoskeletons are expected to adapt their assistive configurations based on the semantic information of the task, adjusting appropriately in accordance with the nature of the object being manipulated. However, existing solutions often lack the ability to understand task semantics or collaboratively plan actions with the user, limiting their generalizability. To address this challenge, this paper introduces a semantic‑aware framework that integrates large language models into the task planning framework, enabling the delivery of safe and intent‑integrative assistance. The proposed approach begins with the exoskeleton operating in transparent mode to capture the wearer's intent during object grasping. Once semantic information is extracted from the task description, the system automatically configures appropriate assistive parameters. In addition, a diffusion‑based anomaly detector is used to continuously monitor the state of human‑robot interaction and trigger real‑time replanning in response to detected anomalies. During task execution, online trajectory refinement and impedance control are used to ensure safety and regulate human‑robot interaction. Experimental results demonstrate that the proposed method effectively aligns with the wearer's cognition, adapts to semantically varying tasks, and responds reliably to anomalies.

PaperID: 9146, https://arxiv.org/pdf/2508.10340.pdf

Authors: Chak Lam Shek, Guangyao Shi, Pratap Tokekar

Title: Multi-Agent Trust Region Policy Optimisation: A Joint Constraint Approach

Abstract:
Multi‑agent reinforcement learning (MARL) requires coordinated and stable policy updates among interacting agents. Heterogeneous‑Agent Trust Region Policy Optimization (HATRPO) enforces per‑agent trust region constraints using Kullback‑Leibler (KL) divergence to stabilize training. However, assigning each agent the same KL threshold can lead to slow and locally optimal updates, especially in heterogeneous settings. To address this limitation, we propose two approaches for allocating the KL divergence threshold across agents: HATRPO‑W, a Karush‑Kuhn‑Tucker‑based (KKT‑based) method that optimizes threshold assignment under global KL constraints, and HATRPO‑G, a greedy algorithm that prioritizes agents based on improvement‑to‑divergence ratio. By connecting sequential policy optimization with constrained threshold scheduling, our approach enables more flexible and effective learning in heterogeneous‑agent settings. Experimental results demonstrate that our methods significantly boost the performance of HATRPO, achieving faster convergence and higher final rewards across diverse MARL benchmarks. Specifically, HATRPO‑W and HATRPO‑G achieve comparable improvements in final performance, each exceeding 22.5%. Notably, HATRPO‑W also demonstrates more stable learning dynamics, as reflected by its lower variance.

PaperID: 9147, https://arxiv.org/pdf/2508.10253.pdf

Authors: Guanzi Yao, Heyao Liu, Linyan Dai

Title: Multi-Agent Reinforcement Learning for Adaptive Resource Orchestration in Cloud-Native Clusters

Abstract:
This paper addresses the challenges of high resource dynamism and scheduling complexity in cloud‑native database systems. It proposes an adaptive resource orchestration method based on multi‑agent reinforcement learning. The method introduces a heterogeneous role‑based agent modeling mechanism. This allows different resource entities, such as compute nodes, storage nodes, and schedulers, to adopt distinct policy representations. These agents are better able to reflect diverse functional responsibilities and local environmental characteristics within the system. A reward‑shaping mechanism is designed to integrate local observations with global feedback. This helps mitigate policy learning bias caused by incomplete state observations. By combining real‑time local performance signals with global system value estimation, the mechanism improves coordination among agents and enhances policy convergence stability. A unified multi‑agent training framework is developed and evaluated on a representative production scheduling dataset. Experimental results show that the proposed method outperforms traditional approaches across multiple key metrics. These include resource utilization, scheduling latency, policy convergence speed, system stability, and fairness. The results demonstrate strong generalization and practical utility. Across various experimental scenarios, the method proves effective in handling orchestration tasks with high concurrency, high‑dimensional state spaces, and complex dependency relationships. This confirms its advantages in real‑world, large‑scale scheduling environments.

PaperID: 9148, https://arxiv.org/pdf/2508.10177.pdf

Authors: Stepan Kulibaba, Artem Dzhalilov, Roman Pakhomov, Oleg Svidchenko, Alexander Gasnikov, Aleksei Shpilman

Title: KompeteAI: Accelerated Autonomous Multi-Agent System for End-to-End Pipeline Generation for Machine Learning Problems

Abstract:
Recent Large Language Model (LLM)‑based AutoML systems demonstrate impressive capabilities but face significant limitations such as constrained exploration strategies and a severe execution bottleneck. Exploration is hindered by one‑shot methods lacking diversity and Monte Carlo Tree Search (MCTS) approaches that fail to recombine strong partial solutions. The execution bottleneck arises from lengthy code validation cycles that stifle iterative refinement. To overcome these challenges, we introduce KompeteAI, a novel AutoML framework with dynamic solution space exploration. Unlike previous MCTS methods that treat ideas in isolation, KompeteAI introduces a merging stage that composes top candidates. We further expand the hypothesis space by integrating Retrieval‑Augmented Generation (RAG), sourcing ideas from Kaggle notebooks and arXiv papers to incorporate real‑world strategies. KompeteAI also addresses the execution bottleneck via a predictive scoring model and an accelerated debugging method, assessing solution potential using early stage metrics to avoid costly full‑code execution. This approach accelerates pipeline evaluation 6.9 times. KompeteAI outperforms leading methods (e.g., RD‑agent, AIDE, and Ml‑Master) by an average of 3% on the primary AutoML benchmark, MLE‑Bench. Additionally, we propose Kompete‑bench to address limitations in MLE‑Bench, where KompeteAI also achieves state‑of‑the‑art results

PaperID: 9149, https://arxiv.org/pdf/2508.10146.pdf

Authors: Hana Derouiche, Zaki Brahmi, Haithem Mazeni

Title: Agentic AI Frameworks: Architectures, Protocols, and Design Challenges

Abstract:
The emergence of Large Language Models (LLMs) has ushered in a transformative paradigm in artificial intelligence, Agentic AI, where intelligent agents exhibit goal‑directed autonomy, contextual reasoning, and dynamic multi‑agent coordination. This paper provides a systematic review and comparative analysis of leading Agentic AI frameworks, including CrewAI, LangGraph, AutoGen, Semantic Kernel, Agno, Google ADK, and MetaGPT, evaluating their architectural principles, communication mechanisms, memory management, safety guardrails, and alignment with service‑oriented computing paradigms. Furthermore, we identify key limitations, emerging trends, and open challenges in the field. To address the issue of agent communication, we conduct an in‑depth analysis of protocols such as the Contract Net Protocol (CNP), Agent‑to‑Agent (A2A), Agent Network Protocol (ANP), and Agora. Our findings not only establish a foundational taxonomy for Agentic AI systems but also propose future research directions to enhance scalability, robustness, and interoperability. This work serves as a comprehensive reference for researchers and practitioners working to advance the next generation of autonomous AI systems.

PaperID: 9150, https://arxiv.org/pdf/2508.10143.pdf

Authors: Alexandru-Andrei Avram, Adrian Groza, Alexandru Lecu

Title: MCP-Orchestrated Multi-Agent System for Automated Disinformation Detection

Abstract:
The large spread of disinformation across digital platforms creates significant challenges to information integrity. This paper presents a multi‑agent system that uses relation extraction to detect disinformation in news articles, focusing on titles and short text snippets. The proposed Agentic AI system combines four agents: (i) a machine learning agent (logistic regression), (ii) a Wikipedia knowledge check agent (which relies on named entity recognition), (iii) a coherence detection agent (using LLM prompt engineering), and (iv) a web‑scraped data analyzer that extracts relational triplets for fact checking. The system is orchestrated via the Model Context Protocol (MCP), offering shared context and live learning across components. Results demonstrate that the multi‑agent ensemble achieves 95.3% accuracy with an F1 score of 0.964, significantly outperforming individual agents and traditional approaches. The weighted aggregation method, mathematically derived from individual agent misclassification rates, proves superior to algorithmic threshold optimization. The modular architecture makes the system easily scalable, while also maintaining details of the decision processes.

PaperID: 9151, https://arxiv.org/pdf/2508.09963.pdf

Authors: Devansh R. Agrawal, Dimitra Panagou

Title: Online Safety under Multiple Constraints and Input Bounds using gatekeeper: Theory and Applications

Abstract:
This letter presents an approach to guarantee online safety of a cyber‑physical system under multiple state and input constraints. Our proposed framework, called gatekeeper, recursively guarantees the existence of an infinite‑horizon trajectory that satisfies all constraints and system dynamics. Such trajectory is constructed using a backup controller, which we define formally in this paper. gatekeeper relies on a small number of verifiable assumptions, and is computationally efficient since it requires optimization over a single scalar variable. We make two primary contributions in this letter. (A) First, we develop the theory of gatekeeper: we derive a sub‑optimality bound relative to a full nonlinear trajectory optimization problem, and show how this can be used in runtime to validate performance. This also informs the design of the backup controllers and sets. (B) Second, we demonstrate in detail an application of gatekeeper for multi‑agent formation flight, where each Dubins agent must avoid multiple obstacles and weapons engagement zones, both of which are nonlinear, nonconvex constraints.

PaperID: 9152, https://arxiv.org/pdf/2508.09958.pdf

Authors: Baran Atalar, Eddie Zhang, Carlee Joe-Wong

Title: Neural Bandit Based Optimal LLM Selection for a Pipeline of Subtasks

Abstract:
As large language models (LLMs) become increasingly popular, there is a growing need to predict which out of a set of LLMs will yield a successful answer to a given query at low cost. This problem promises to become even more relevant as LLM agents are asked to solve an increasing variety of "agentic'' AI tasks. Such tasks are often broken into smaller subtasks, each of which can then be executed by a LLM expected to perform well on that specific subtask. For example, to extract a diagnosis from medical records, one can first select an LLM to summarize the record, select another to validate the summary, and then select a possibly different LLM to extract the diagnosis from the summarized record. Unlike existing LLM selection or routing algorithms, this setting requires selecting a sequence of LLMs, with the output of each LLM feeding into the next and potentially influencing its success. Thus, unlike single LLM selection, the quality of each subtask's output directly affects the inputs, and hence the cost and success rate, of downstream LLMs, creating complex performance dependencies that must be learned during selection. We propose a neural contextual bandit‑based algorithm that trains neural networks to guide LLM selections for the different subtasks, without requiring historical LLM performance data. We prove that our proposed Sequential Bandits algorithm achieves a sublinear regret in the number of tasks, and we experimentally validate its superior performance compared to other LLM selection algorithms on two real datasets.

PaperID: 9153, https://arxiv.org/pdf/2508.09893.pdf

Authors: Bhavik Agarwal, Hemant Sunil Jomraj, Simone Kaplunov, Jack Krolick, Viktoria Rojkova

Title: RAGulating Compliance: A Multi-Agent Knowledge Graph for Regulatory QA

Abstract:
Regulatory compliance question answering (QA) requires precise, verifiable information, and domain‑specific expertise, posing challenges for Large Language Models (LLMs). In this work, we present a novel multi‑agent framework that integrates a Knowledge Graph (KG) of Regulatory triplets with Retrieval‑Augmented Generation (RAG) to address these demands. First, agents build and maintain an ontology‑free KG by extracting subject‑‑predicate‑‑object (SPO) triplets from regulatory documents and systematically cleaning, normalizing, deduplicating, and updating them. Second, these triplets are embedded and stored along with their corresponding textual sections and metadata in a single enriched vector database, allowing for both graph‑based reasoning and efficient information retrieval. Third, an orchestrated agent pipeline leverages triplet‑level retrieval for question answering, ensuring high semantic alignment between user queries and the factual "who‑did‑what‑to‑whom" core captured by the graph. Our hybrid system outperforms conventional methods in complex regulatory queries, ensuring factual correctness with embedded triplets, enabling traceability through a unified vector database, and enhancing understanding through subgraph visualization, providing a robust foundation for compliance‑driven and broader audit‑focused applications.

PaperID: 9154, https://arxiv.org/pdf/2508.09889.pdf

Authors: Zhitian Xie, Qintong Wu, Chengyue Yu, Chenyi Zhuang, Jinjie Gu

Title: Profile-Aware Maneuvering: A Dynamic Multi-Agent System for Robust GAIA Problem Solving by AWorld

Abstract:
The rapid advancement of large language models (LLMs) has empowered intelligent agents to leverage diverse external tools for solving complex real‑world problems. However, this reliance introduces new challenges, as extended contexts and noisy tool outputs can undermine system reliability. To address this, we propose a dynamic Multi‑Agent System (MAS) in our AWorld framework, where an Execution Agent is supervised by a Guard Agent that provides on‑demand dynamic maneuvering, verifying and correcting the reasoning process to improve robustness over single‑agent systems. To move beyond this generic supervision, we enhance the architecture with a methodology inspired by System Identification from control theory. This method first profiles the Execution Agent offline on a benchmark dataset to create a "performance fingerprint" of its unique weaknesses. The Guard Agent then leverages this fingerprint online to deliver profile‑aware supervision, making targeted interventions based on known failure patterns rather than merely reacting to immediate logical flaws. Extensive experiments on the GAIA dataset demonstrate that this profile‑aware MAS significantly improves both effectiveness and stability, outperforming not only single‑agent systems but also its naive counterpart. This superior performance led our system to achieve first place among open‑source projects on the prestigious GAIA leaderboard. These findings highlight that building truly trustworthy intelligent systems requires not just collaboration, but a deep, empirically‑grounded understanding of each agent's unique capabilities and limitations.

PaperID: 9155, https://arxiv.org/pdf/2508.09815.pdf

Authors: Klaudia Krawiecka, Christian Schroeder de Witt

Title: Extending the OWASP Multi-Agentic System Threat Modeling Guide: Insights from Multi-Agent Security Research

Abstract:
We propose an extension to the OWASP Multi‑Agentic System (MAS) Threat Modeling Guide, translating recent anticipatory research in multi‑agent security (MASEC) into practical guidance for addressing challenges unique to large language model (LLM)‑driven multi‑agent architectures. Although OWASP's existing taxonomy covers many attack vectors, our analysis identifies gaps in modeling failures, including, but not limited to: reasoning collapse across planner‑executor chains, metric overfitting, unsafe delegation escalation, emergent covert coordination, and heterogeneous multi‑agent exploits. We introduce additional threat classes and scenarios grounded in practical MAS deployments, highlighting risks from benign goal drift, cross‑agent hallucination propagation, affective prompt framing, and multi‑agent backdoors. We also outline evaluation strategies, including robustness testing, coordination assessment, safety enforcement, and emergent behavior monitoring, to ensure complete coverage. This work complements the framework of OWASP by expanding its applicability to increasingly complex, autonomous, and adaptive multi‑agent systems, with the goal of improving security posture and resilience in real world deployments.

PaperID: 9156, https://arxiv.org/pdf/2508.09784.pdf

Authors: Avijeet Ghosh, Sujata Ghosh, François Schwarzentruber

Title: Reasoning About Knowledge on Regular Expressions is 2EXPTIME-complete

Abstract:
Logics for reasoning about knowledge and actions have seen many applications in various domains of multi‑agent systems, including epistemic planning. Change of knowledge based on observations about the surroundings forms a key aspect in such planning scenarios. Public Observation Logic (POL) is a variant of public announcement logic for reasoning about knowledge that gets updated based on public observations. Each state in an epistemic (Kripke) model is equipped with a set of expected observations. These states evolve as the expectations get matched with the actual observations. In this work, we prove that the satisfiability problem of \POL is 2EXPTIME‑complete.

PaperID: 9157, https://arxiv.org/pdf/2508.09541.pdf

Authors: Gang Chen, Guoxin Wang, Anton van Beek, Zhenjun Ming, Yan Yan

Title: Emergence of Hierarchies in Multi-Agent Self-Organizing Systems Pursuing a Joint Objective

Abstract:
Multi‑agent self‑organizing systems (MASOS) exhibit key characteristics including scalability, adaptability, flexibility, and robustness, which have contributed to their extensive application across various fields. However, the self‑organizing nature of MASOS also introduces elements of unpredictability in their emergent behaviors. This paper focuses on the emergence of dependency hierarchies during task execution, aiming to understand how such hierarchies arise from agents' collective pursuit of the joint objective, how they evolve dynamically, and what factors govern their development. To investigate this phenomenon, multi‑agent reinforcement learning (MARL) is employed to train MASOS for a collaborative box‑pushing task. By calculating the gradients of each agent's actions in relation to the states of other agents, the inter‑agent dependencies are quantified, and the emergence of hierarchies is analyzed through the aggregation of these dependencies. Our results demonstrate that hierarchies emerge dynamically as agents work towards a joint objective, with these hierarchies evolving in response to changing task requirements. Notably, these dependency hierarchies emerge organically in response to the shared objective, rather than being a consequence of pre‑configured rules or parameters that can be fine‑tuned to achieve specific results. Furthermore, the emergence of hierarchies is influenced by the task environment and network initialization conditions. Additionally, hierarchies in MASOS emerge from the dynamic interplay between agents' "Talent" and "Effort" within the "Environment." "Talent" determines an agent's initial influence on collective decision‑making, while continuous "Effort" within the "Environment" enables agents to shift their roles and positions within the system.

PaperID: 9158, https://arxiv.org/pdf/2508.09507.pdf

Authors: Meiping Wang, Jian Zhong, Rongduo Han, Liming Kang, Zhengkun Shi, Xiao Liang, Xing Lin, Nan Gao, Haining Zhang

Title: An Automated Multi-modal Evaluation Framework for Mobile Intelligent Assistants Based on Large Language Models and Multi-Agent Collaboration

Abstract:
With the rapid development of mobile intelligent assistant technologies, multi‑modal AI assistants have become essential interfaces for daily user interactions. However, current evaluation methods face challenges including high manual costs, inconsistent standards, and subjective bias. This paper proposes an automated multi‑modal evaluation framework based on large language models and multi‑agent collaboration. The framework employs a three‑tier agent architecture consisting of interaction evaluation agents, semantic verification agents, and experience decision agents. Through supervised fine‑tuning on the Qwen3‑8B model, we achieve a significant evaluation matching accuracy with human experts. Experimental results on eight major intelligent agents demonstrate the framework's effectiveness in predicting users' satisfaction and identifying generation defects.

PaperID: 9159, https://arxiv.org/pdf/2508.09275.pdf

Authors: Amine Andam, Jamal Bentahar, Mustapha Hedabou

Title: Constrained Black-Box Attacks Against Cooperative Multi-Agent Reinforcement Learning

Abstract:
Collaborative multi‑agent reinforcement learning has rapidly evolved, offering state‑of‑the‑art algorithms for real‑world applications, including sensitive domains. However, a key challenge to its widespread adoption is the lack of a thorough investigation into its vulnerabilities to adversarial attacks. Existing work predominantly focuses on training‑time attacks or unrealistic scenarios, such as access to policy weights or the ability to train surrogate policies. In this paper, we investigate new vulnerabilities under more challenging and constrained conditions, assuming an adversary can only collect and perturb the observations of deployed agents. We also consider scenarios where the adversary has no access at all (no observations, actions, or weights). Our main approach is to generate perturbations that intentionally misalign how victim agents see their environment. Our approach is empirically validated on three benchmarks and 22 environments, demonstrating its effectiveness across diverse algorithms and environments. Furthermore, we show that our algorithm is sample‑efficient, requiring only 1,000 samples compared to the millions needed by previous methods.

PaperID: 9160, https://arxiv.org/pdf/2508.09230.pdf

Authors: Yutong Wu, Jie Zhang, Yiming Li, Chao Zhang, Qing Guo, Nils Lukas, Tianwei Zhang

Title: Cowpox: Towards the Immunity of VLM-based Multi-Agent Systems

Abstract:
Vision Language Model (VLM)‑based agents are stateful, autonomous entities capable of perceiving and interacting with their environments through vision and language. Multi‑agent systems comprise specialized agents who collaborate to solve a (complex) task. A core security property is robustness, stating that the system should maintain its integrity under adversarial attacks. However, the design of existing multi‑agent systems lacks the robustness consideration, as a successful exploit against one agent can spread and infect other agents to undermine the entire system's assurance. To address this, we propose a new defense approach, Cowpox, to provably enhance the robustness of multi‑agent systems. It incorporates a distributed mechanism, which improves the recovery rate of agents by limiting the expected number of infections to other agents. The core idea is to generate and distribute a special cure sample that immunizes an agent against the attack before exposure and helps recover the already infected agents. We demonstrate the effectiveness of Cowpox empirically and provide theoretical robustness guarantees.

PaperID: 9161, https://arxiv.org/pdf/2508.09210.pdf

Authors: Fan Zhang, Zebang Cheng, Chong Deng, Haoxuan Li, Zheng Lian, Qian Chen, Huadai Liu, Wen Wang, Yi-Fan Zhang, Renrui Zhang, Ziyu Guo, Zhihong Zhu, Hao Wu, Haixin Wang, Yefeng Zheng, Xiaojiang Peng, Xian Wu, Kun Wang, Xiangang Li, Jieping Ye, Pheng-Ann Heng

Title: MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models

Abstract:
Recent advances in multimodal large language models (MLLMs) have catalyzed transformative progress in affective computing, enabling models to exhibit emergent emotional intelligence. Despite substantial methodological progress, current emotional benchmarks remain limited, as it is still unknown: (a) the generalization abilities of MLLMs across distinct scenarios, and (b) their reasoning capabilities to identify the triggering factors behind emotional states. To bridge these gaps, we present MME‑Emotion, a systematic benchmark that assesses both emotional understanding and reasoning capabilities of MLLMs, enjoying scalable capacity, diverse settings, and unified protocols. As the largest emotional intelligence benchmark for MLLMs, MME‑Emotion contains over 6,000 curated video clips with task‑specific questioning‑answering (QA) pairs, spanning broad scenarios to formulate eight emotional tasks. It further incorporates a holistic evaluation suite with hybrid metrics for emotion recognition and reasoning, analyzed through a multi‑agent system framework. Through a rigorous evaluation of 20 advanced MLLMs, we uncover both their strengths and limitations, yielding several key insights: \ding182 Current MLLMs exhibit unsatisfactory emotional intelligence, with the best‑performing model achieving only 39.3% recognition score and 56.0% Chain‑of‑Thought (CoT) score on our benchmark. \ding183 Generalist models (\emphe.g., Gemini‑2.5‑Pro) derive emotional intelligence from generalized multimodal understanding capabilities, while specialist models (\emphe.g., R1‑Omni) can achieve comparable performance through domain‑specific post‑training adaptation. By introducing MME‑Emotion, we hope that it can serve as a foundation for advancing MLLMs' emotional intelligence in the future.

PaperID: 9162, https://arxiv.org/pdf/2508.09147.pdf

Authors: Alaa Saleh, Roberto Morabito, Sasu Tarkoma, Anders Lindgren, Susanna Pirttikangas, Lauri Lovén

Title: Agentic TinyML for Intent-aware Handover in 6G Wireless Networks

Abstract:
As 6G networks evolve into increasingly AI‑driven, user‑centric ecosystems, traditional reactive handover mechanisms demonstrate limitations, especially in mobile edge computing and autonomous agent‑based service scenarios. This manuscript introduces WAAN, a cross‑layer framework that enables intent‑aware and proactive handovers by embedding lightweight TinyML agents as autonomous, negotiation‑capable entities across heterogeneous edge nodes that contribute to intent propagation and network adaptation. To ensure continuity across mobility‑induced disruptions, WAAN incorporates semi‑stable rendezvous points that serve as coordination anchors for context transfer and state preservation. The framework's operational capabilities are demonstrated through a multimodal environmental control case study, highlighting its effectiveness in maintaining user experience under mobility. Finally, the article discusses key challenges and future opportunities associated with the deployment and evolution of WAAN.

PaperID: 9163, https://arxiv.org/pdf/2508.09129.pdf

Authors: Xianghe Pang, Shuo Tang, Rui Ye, Yuwen Du, Yaxin Du, Siheng Chen

Title: BrowseMaster: Towards Scalable Web Browsing via Tool-Augmented Programmatic Agent Pair

Abstract:
Effective information seeking in the vast and ever‑growing digital landscape requires balancing expansive search with strategic reasoning. Current large language model (LLM)‑based agents struggle to achieve this balance due to limitations in search breadth and reasoning depth, where slow, serial querying restricts coverage of relevant sources and noisy raw inputs disrupt the continuity of multi‑step reasoning. To address these challenges, we propose BrowseMaster, a scalable framework built around a programmatically augmented planner‑executor agent pair. The planner formulates and adapts search strategies based on task constraints, while the executor conducts efficient, targeted retrieval to supply the planner with concise, relevant evidence. This division of labor preserves coherent, long‑horizon reasoning while sustaining broad and systematic exploration, overcoming the trade‑off that limits existing agents. Extensive experiments on challenging English and Chinese benchmarks show that BrowseMaster consistently outperforms open‑source and proprietary baselines, achieving scores of 30.0 on BrowseComp‑en and 46.5 on BrowseComp‑zh, which demonstrates its strong capability in complex, reasoning‑heavy information‑seeking tasks at scale.

PaperID: 9164, https://arxiv.org/pdf/2508.09124.pdf

Authors: Weixuan Wang, Dongge Han, Daniel Madrigal Diaz, Jin Xu, Victor Rühle, Saravan Rajmohan

Title: OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows

Abstract:
Autonomous agents powered by large language models (LLMs) are increasingly deployed in real‑world applications requiring complex, long‑horizon workflows. However, existing benchmarks predominantly focus on atomic tasks that are self‑contained and independent, failing to capture the long‑term contextual dependencies and multi‑interaction coordination required in realistic scenarios. To address this gap, we introduce OdysseyBench, a comprehensive benchmark for evaluating LLM agents on long‑horizon workflows across diverse office applications including Word, Excel, PDF, Email, and Calendar. Our benchmark comprises two complementary splits: OdysseyBench+ with 300 tasks derived from real‑world use cases, and OdysseyBench‑Neo with 302 newly synthesized complex tasks. Each task requires agent to identify essential information from long‑horizon interaction histories and perform multi‑step reasoning across various applications. To enable scalable benchmark creation, we propose HomerAgents, a multi‑agent framework that automates the generation of long‑horizon workflow benchmarks through systematic environment exploration, task generation, and dialogue synthesis. Our extensive evaluation demonstrates that OdysseyBench effectively challenges state‑of‑the‑art LLM agents, providing more accurate assessment of their capabilities in complex, real‑world contexts compared to existing atomic task benchmarks. We believe that OdysseyBench will serve as a valuable resource for advancing the development and evaluation of LLM agents in real‑world productivity scenarios. In addition, we release OdysseyBench and HomerAgents to foster research along this line.

PaperID: 9165, https://arxiv.org/pdf/2508.09050.pdf

Authors: Germán Díaz Agreda, Carlos Andres Duran Paredes, Mateo Buenaventura Samboni, Jhon Alejandro Andrade, Sebastián Andrés Cajas Ordoñez

Title: Bridging Theory and Practice in Quantum Game Theory: Optimized Implementation of the Battle of the Sexes with Error Mitigation on NISQ Hardware

Abstract:
Implementing quantum game theory on real hardware is challenging due to noise, decoherence, and limited qubit connectivity, yet such demonstrations are essential to validate theoretical predictions. We present one of the first full experimental realizations of the Battle of the Sexes game under the Eisert‑Wilkens‑Lewenstein (EWL) framework on IBM Quantum's ibm sherbrooke superconducting processor. Four quantum strategies (I, H, R(π/4), R(π)) were evaluated across 31 entanglement values γ\in [0, π] using 2048 shots per configuration, enabling a direct comparison between analytical predictions and hardware execution. To mitigate noise and variability, we introduce a Guided Circuit Mapping (GCM) method that dynamically selects qubit pairs and optimizes routing based on real‑time topology and calibration data. The analytical model forecasts up to 108% payoff improvement over the classical equilibrium, and despite hardware‑induced deviations, experimental results with GCM preserve the expected payoff trends within 3.5%‑12% relative error. These findings show that quantum advantages in strategic coordination can persist under realistic NISQ conditions, providing a pathway toward practical applications of quantum game theory in multi‑agent, economic, and distributed decision‑making systems.

PaperID: 9166, https://arxiv.org/pdf/2508.08997.pdf

Authors: Sizhe Yuen, Francisco Gomez Medina, Ting Su, Yali Du, Adam J. Sobey

Title: Intrinsic Memory Agents: Heterogeneous Multi-Agent LLM Systems through Structured Contextual Memory

Abstract:
Multi‑agent systems built on Large Language Models (LLMs) show exceptional promise for complex collaborative problem‑solving, yet they face fundamental challenges stemming from context window limitations that impair memory consistency, role adherence, and procedural integrity. This paper introduces Intrinsic Memory Agents, a novel framework that addresses these limitations through agent‑specific memories that evolve intrinsically with agent outputs. Specifically, our method maintains role‑aligned memory that preserves specialized perspectives while focusing on task‑relevant information. Our approach utilises a generic memory template applicable to new problems without the need to hand‑craft specific memory prompts. We benchmark our approach on the PDDL, FEVER, and ALFWorld datasets, comparing its performance to existing state‑of‑the‑art multi‑agentic memory approaches and showing state‑of‑the‑art or comparable performance across all three, with the highest consistency. An additional evaluation is performed on a complex data pipeline design task, and we demonstrate that our approach produces higher quality designs across 5 metrics: scalability, reliability, usability, cost‑effectiveness, and documentation, plus additional qualitative evidence of the improvements. Our findings suggest that addressing memory limitations through intrinsic approaches can improve the capabilities of multi‑agent LLM systems on structured planning tasks.

PaperID: 9167, https://arxiv.org/pdf/2508.08933.pdf

Authors: Khondoker Ittehadul Islam, Gabriele Sarti

Title: Reveal-Bangla: A Dataset for Cross-Lingual Multi-Step Reasoning Evaluation

Abstract:
Language models have demonstrated remarkable performance on complex multi‑step reasoning tasks. However, their evaluation has been predominantly confined to high‑resource languages such as English. In this paper, we introduce a manually translated Bangla multi‑step reasoning dataset derived from the English Reveal dataset, featuring both binary and non‑binary question types. We conduct a controlled evaluation of English‑centric and Bangla‑centric multilingual small language models on the original dataset and our translated version to compare their ability to exploit relevant reasoning steps to produce correct answers. Our results show that, in comparable settings, reasoning context is beneficial for more challenging non‑binary questions, but models struggle to employ relevant Bangla reasoning steps effectively. We conclude by exploring how reasoning steps contribute to models' predictions, highlighting different trends across models and languages.

PaperID: 9168, https://arxiv.org/pdf/2508.08930.pdf

Authors: Juyeong Hwang, Seong-Eun Hong, JaeYoung Seon, Hyeongyeop Kang

Title: How Does a Virtual Agent Decide Where to Look? Symbolic Cognitive Reasoning for Embodied Head Rotation

Abstract:
Natural head rotation is critical for believable embodied virtual agents, yet this micro‑level behavior remains largely underexplored. While head‑rotation prediction algorithms could, in principle, reproduce this behavior, they typically focus on visually salient stimuli and overlook the cognitive motives that guide head rotation. This yields agents that look at conspicuous objects while overlooking obstacles or task‑relevant cues, diminishing realism in a virtual environment. We introduce SCORE, a Symbolic Cognitive Reasoning framework for Embodied Head Rotation, a data‑agnostic framework that produces context‑aware head movements without task‑specific training or hand‑tuned heuristics. A controlled VR study (N=20) identifies five motivational drivers of human head movements: Interest, Information Seeking, Safety, Social Schema, and Habit. SCORE encodes these drivers as symbolic predicates, perceives the scene with a Vision‑Language Model (VLM), and plans head poses with a Large Language Model (LLM). The framework employs a hybrid workflow: the VLM‑LLM reasoning is executed offline, after which a lightweight FastVLM performs online validation to suppress hallucinations while maintaining responsiveness to scene dynamics. The result is an agent that predicts not only where to look but also why, generalizing to unseen scenes and multi‑agent crowds while retaining behavioral plausibility.

PaperID: 9169, https://arxiv.org/pdf/2508.08882.pdf

Authors: Dayu Wang, Jiaye Yang, Weikang Li, Jiahui Liang, Yang Li

Title: Reducing Cognitive Overhead in Tool Use via Multi-Small-Agent Reinforcement Learning

Abstract:
Recent advances in multi‑agent systems highlight the potential of specialized small agents that collaborate via division of labor. Existing tool‑integrated reasoning systems, however, often follow a single‑agent paradigm in which one large model interleaves long‑horizon reasoning with precise tool operations, leading to cognitive‑load interference and unstable coordination. We present MSARL, a Multi‑Small‑Agent Reinforcement Learning framework that explicitly decouples reasoning from tool use. In MSARL, a Reasoning Agent decomposes problems and plans tool invocations, while multiple Tool Agents specialize in specific external tools, each trained via a combination of imitation learning and reinforcement learning with role‑specific rewards. On mathematical problem solving with code execution, MSARL significantly improves reasoning stability and final‑answer accuracy over single‑agent baselines. Moreover, the architecture generalizes to diverse tool‑use tasks, demonstrating that cognitive‑role decoupling with small agents is a scalable blueprint for multi‑agent AI design.

PaperID: 9170, https://arxiv.org/pdf/2508.08837.pdf

Authors: Nicholas Sukiennik, Yichuan Xu, Yuqing Kan, Jinghua Piao, Yuwei Yan, Chen Gao, Yong Li

Title: Debiasing International Attitudes: LLM Agents for Simulating US-China Perception Changes

Abstract:
Large Language Models (LLMs) offer transformative opportunities to address the longstanding challenge of modeling opinion evolution in computational social science. This study investigates how media influences cross‑border attitudes ‑ a key driver of global polarization ‑ by developing an LLM‑agent framework to disentangle sources of bias and assess LLMs' capacity for human‑like opinion formation in response to external information. We introduce an LLM‑agent‑based framework that models U.S. citizens' attitudes toward China from 2005 to 2025. Our approach integrates large‑scale news data with social media profiles to initialize agent populations, which then undergo cognitive‑aware reflection and opinion updating. We propose three debiasing mechanisms: (1) fact elicitation, extracting neutral events from subjectively framed news; (2) a devil's advocate agent that simulates critical contextualization; and (3) counterfactual exposure to surface inherent model biases. Simulations with two state‑of‑the‑art LLMs (Qwen3‑14b and GPT4o) reveal the expected negative attitudinal trend following media exposure. While all three mechanisms mitigate this trend to varying degrees, results indicate that subjective news framing contributes only modestly to negative attitudes, whereas the devil's advocate agent proves most effective overall, suggesting that intermediate analytical steps can produce more human‑like agent opinions. Notably, the counterfactual study reveals contradictory findings across models, suggesting region‑specific inherent biases tied to models' geographic origins. By advancing understanding of LLM‑based opinion formation and debiasing methods, this study contributes to developing more objective models that better align with human cognitive tendencies.

PaperID: 9171, https://arxiv.org/pdf/2508.08800.pdf

Authors: David Mguni, Yaqi Sun, Haojun Chen, Wanrong Yang, Amir Darabi, Larry Olanrewaju Orimoloye, Yaodong Yang

Title: Fault Tolerant Multi-Agent Learning with Adversarial Budget Constraints

Abstract:
We study robustness to agent malfunctions in cooperative multi‑agent reinforcement learning (MARL), a failure mode that is critical in practice yet underexplored in existing theory. We introduce MARTA, a plug‑and‑play robustness layer that augments standard MARL algorithms with a Switcher‑Adversary mechanism which selectively induces malfunctions in performance‑critical states. This formulation defines a fault‑switching (N+2)‑player Markov game in which the Switcher chooses when and which agent fails, and the Adversary controls the resulting faulty behaviour via random or worst‑case policies. We develop a Q‑learning‑type scheme and show that the associated Bellman operator is a contraction, yielding existence and uniqueness of the minimax value, convergence to a Markov perfect equilibrium. MARTA integrates seamlessly with MARL algorithms without architectural modification and consistently improves robustness across Traffic Junction (TJ), Level‑Based Foraging (LBF), MPE SimpleTag, and SMAC (v2). In these domains, MARTA achieves large gains in final performance of up to 116.7% in SMAC, 21.4% in MPE SimpleTag, and 44.6% in LBF, while significantly reducing failure rates under train‑test mismatched fault regimes. These results establish MARTA as a theoretically grounded and practically deployable mechanism for fault‑tolerant MARL.

PaperID: 9172, https://arxiv.org/pdf/2508.08761.pdf

Authors: Stavros Doropoulos, Stavros Vologiannidis, Ioannis Magnisalis

Title: DevNous: An LLM-Based Multi-Agent System for Grounding IT Project Management in Unstructured Conversation

Abstract:
The manual translation of unstructured team dialogue into the structured artifacts required for Information Technology (IT) project governance is a critical bottleneck in modern information systems management. We introduce DevNous, a Large Language Model‑based (LLM) multi‑agent expert system, to automate this unstructured‑to‑structured translation process. DevNous integrates directly into team chat environments, identifying actionable intents from informal dialogue and managing stateful, multi‑turn workflows for core administrative tasks like automated task formalization and progress summary synthesis. To quantitatively evaluate the system, we introduce a new benchmark of 160 realistic, interactive conversational turns. The dataset was manually annotated with a multi‑label ground truth and is publicly available. On this benchmark, DevNous achieves an exact match turn accuracy of 81.3% and a multiset F1‑Score of 0.845, providing strong evidence for its viability. The primary contributions of this work are twofold: (1) a validated architectural pattern for developing ambient administrative agents, and (2) the introduction of the first robust empirical baseline and public benchmark dataset for this challenging problem domain.

PaperID: 9173, https://arxiv.org/pdf/2508.08709.pdf

Authors: Lukas Krupp, Maximilian Schöffel, Elias Biehl, Norbert Wehn

Title: CRADLE: Conversational RTL Design Space Exploration with LLM-based Multi-Agent Systems

Abstract:
This paper presents CRADLE, a conversational framework for design space exploration of RTL designs using LLM‑based multi‑agent systems. Unlike existing rigid approaches, CRADLE enables user‑guided flows with internal self‑verification, correction, and optimization. We demonstrate the framework with a generator‑critic agent system targeting FPGA resource minimization using state‑of‑the‑art LLMs. Experimental results on the RTLLM benchmark show that CRADLE achieves significant reductions in resource usage with averages of 48% and 40% in LUTs and FFs across all benchmark designs.

PaperID: 9174, https://arxiv.org/pdf/2508.08669.pdf

Authors: Yizhou Zhang, Eric Mazumdar

Title: Convergent Q-Learning for Infinite-Horizon General-Sum Markov Games through Behavioral Economics

Abstract:
Risk‑aversion and bounded rationality are two key characteristics of human decision‑making. Risk‑averse quantal‑response equilibrium (RQE) is a solution concept that incorporates these features, providing a more realistic depiction of human decision making in various strategic environments compared to a Nash equilibrium. Furthermore a class of RQE has recently been shown in arXiv:2406.14156 to be universally computationally tractable in all finite‑horizon Markov games, allowing for the development of multi‑agent reinforcement learning algorithms with convergence guarantees. In this paper, we expand upon the study of RQE and analyze their computation in both two‑player normal form games and discounted infinite‑horizon Markov games. For normal form games we adopt a monotonicity‑based approach allowing us to generalize previous results. We first show uniqueness and Lipschitz continuity of RQE with respect to player's payoff matrices under monotonicity assumptions, and then provide conditions on the players' degrees of risk aversion and bounded rationality that ensure monotonicity. We then focus on discounted infinite‑horizon Markov games. We define the risk‑averse quantal‑response Bellman operator and prove its contraction under further conditions on the players' risk‑aversion, bounded rationality, and temporal discounting. This yields a Q‑learning based algorithm with convergence guarantees for all infinite‑horizon general‑sum Markov games.

PaperID: 9175, https://arxiv.org/pdf/2508.08632.pdf

Authors: Bo Yang, Yu Zhang, Lanfei Feng, Yunkui Chen, Jianyu Zhang, Xiao Xu, Nueraili Aierken, Yurui Li, Yuxuan Chen, Guijun Yang, Yong He, Runhe Huang, Shijian Li

Title: AgriGPT: a Large Language Model Ecosystem for Agriculture

Abstract:
Despite the rapid progress of Large Language Models (LLMs), their application in agriculture remains limited due to the lack of domain‑specific models, curated datasets, and robust evaluation frameworks. To address these challenges, we propose AgriGPT, a domain‑specialized LLM ecosystem for agricultural usage. At its core, we design a multi‑agent scalable data engine that systematically compiles credible data sources into Agri‑342K, a high‑quality, standardized question‑answer (QA) dataset. Trained on this dataset, AgriGPT supports a broad range of agricultural stakeholders, from practitioners to policy‑makers. To enhance factual grounding, we employ Tri‑RAG, a three‑channel Retrieval‑Augmented Generation framework combining dense retrieval, sparse retrieval, and multi‑hop knowledge graph reasoning, thereby improving the LLM's reasoning reliability. For comprehensive evaluation, we introduce AgriBench‑13K, a benchmark suite comprising 13 tasks with varying types and complexities. Experiments demonstrate that AgriGPT significantly outperforms general‑purpose LLMs on both domain adaptation and reasoning. Beyond the model itself, AgriGPT represents a modular and extensible LLM ecosystem for agriculture, comprising structured data construction, retrieval‑enhanced generation, and domain‑specific evaluation. This work provides a generalizable framework for developing scientific and industry‑specialized LLMs. All models, datasets, and code will be released to empower agricultural communities, especially in underserved regions, and to promote open, impactful research.

PaperID: 9176, https://arxiv.org/pdf/2508.08574.pdf

Authors: Ameya Agaskar, Sriram Siva, William Pickering, Kyle O'Brien, Charles Kekeh, Alexandre Ormiga Galvao Barbosa, Ang Li, Brianna Gallo Sarker, Alicia Chua, Mayur Nemade, Charun Thattai, Jiaming Di, Isaac Iyengar, Ramya Dharoor, Dino Kirouani, Jimmy Erskine, Tamir Hegazy, Scott Niekum, Usman A. Khan, Federico Pecora, Joseph W. Durham

Title: DeepFleet: Multi-Agent Foundation Models for Mobile Robots

Abstract:
We introduce DeepFleet, a suite of foundation models designed to support coordination and planning for large‑scale mobile robot fleets. These models are trained on fleet movement data, including robot positions, goals, and interactions, from hundreds of thousands of robots in Amazon warehouses worldwide. DeepFleet consists of four architectures that each embody a distinct inductive bias and collectively explore key points in the design space for multi‑agent foundation models: the robot‑centric (RC) model is an autoregressive decision transformer operating on neighborhoods of individual robots; the robot‑floor (RF) model uses a transformer with cross‑attention between robots and the warehouse floor; the image‑floor (IF) model applies convolutional encoding to a multi‑channel image representation of the full fleet; and the graph‑floor (GF) model combines temporal attention with graph neural networks for spatial relationships. In this paper, we describe these models and present our evaluation of the impact of these design choices on prediction task performance. We find that the robot‑centric and graph‑floor models, which both use asynchronous robot state updates and incorporate the localized structure of robot interactions, show the most promise. We also present experiments that show that these two models can make effective use of larger warehouses operation datasets as the models are scaled up.

PaperID: 9177, https://arxiv.org/pdf/2508.08555.pdf

Authors: Tong Zhang, Yu Gou, Jun Liu, Jun-Hong Cui

Title: Traffic Load-Aware Resource Management Strategy for Underwater Wireless Sensor Networks

Abstract:
Underwater Wireless Sensor Networks (UWSNs) represent a promising technology that enables diverse underwater applications through acoustic communication. However, it encounters significant challenges including harsh communication environments, limited energy supply, and restricted signal transmission. This paper aims to provide efficient and reliable communication in underwater networks with limited energy and communication resources by optimizing the scheduling of communication links and adjusting transmission parameters (e.g., transmit power and transmission rate). The efficient and reliable communication multi‑objective optimization problem (ERCMOP) is formulated as a decentralized partially observable Markov decision process (Dec‑POMDP). A Traffic Load‑Aware Resource Management (TARM) strategy based on deep multi‑agent reinforcement learning (MARL) is presented to address this problem. Specifically, a traffic load‑aware mechanism that leverages the overhear information from neighboring nodes is designed to mitigate the disparity between partial observations and global states. Moreover, by incorporating a solution space optimization algorithm, the number of candidate solutions for the deep MARL‑based decision‑making model can be effectively reduced, thereby optimizing the computational complexity. Simulation results demonstrate the adaptability of TARM in various scenarios with different transmission demands and collision probabilities, while also validating the effectiveness of the proposed approach in supporting efficient and reliable communication in underwater networks with limited resources.

PaperID: 9178, https://arxiv.org/pdf/2508.08544.pdf

Authors: Mark O. Riedl, Deven R. Desai

Title: AI Agents and the Law

Abstract:
As AI becomes more "agentic," it faces technical and socio‑legal issues it must address if it is to fulfill its promise of increased economic productivity and efficiency. This paper uses technical and legal perspectives to explain how things change when AI systems start being able to directly execute tasks on behalf of a user. We show how technical conceptions of agents track some, but not all, socio‑legal conceptions of agency. That is, both computer science and the law recognize the problems of under‑specification for an agent, and both disciplines have robust conceptions of how to address ensuring an agent does what the programmer, or in the law, the principal desires and no more. However, to date, computer science has under‑theorized issues related to questions of loyalty and to third parties that interact with an agent, both of which are central parts of the law of agency. First, we examine the correlations between implied authority in agency law and the principle of value‑alignment in AI, wherein AI systems must operate under imperfect objective specification. Second, we reveal gaps in the current computer science view of agents pertaining to the legal concepts of disclosure and loyalty, and how failure to account for them can result in unintended effects in AI ecommerce agents. In surfacing these gaps, we show a path forward for responsible AI agent development and deployment.

PaperID: 9179, https://arxiv.org/pdf/2508.08487.pdf

Authors: Qian Wang, Ziqi Huang, Ruoxi Jia, Paul Debevec, Ning Yu

Title: MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling

Abstract:
Despite recent advances, long‑sequence video generation frameworks still suffer from significant limitations: poor assistive capability, suboptimal visual quality, and limited expressiveness. To mitigate these limitations, we propose MAViS, a multi‑agent collaborative framework designed to assist in long‑sequence video storytelling by efficiently translating ideas into visual narratives. MAViS orchestrates specialized agents across multiple stages, including script writing, shot designing, character modeling, keyframe generation, video animation, and audio generation. In each stage, agents operate under the 3E Principle ‑‑ Explore, Examine, and Enhance ‑‑ to ensure the completeness of intermediate outputs. Considering the capability limitations of current generative models, we propose the Script Writing Guidelines to optimize compatibility between scripts and generative tools. Experimental results demonstrate that MAViS achieves state‑of‑the‑art performance in assistive capability, visual quality, and video expressiveness. Its modular framework further enables scalability with diverse generative models and tools. With just a brief idea description, MAViS enables users to rapidly explore diverse visual storytelling and creative directions for sequential video generation by efficiently producing high‑quality, complete long‑sequence videos. To the best of our knowledge, MAViS is the only framework that provides multimodal design output ‑‑ videos with narratives and background music.

PaperID: 9180, https://arxiv.org/pdf/2508.08473.pdf

Authors: Hossein B. Jond

Title: A Minimal Model for Emergent Collective Behaviors in Autonomous Robotic Multi-Agent Systems

Abstract:
Collective behaviors such as swarming and flocking emerge from simple, decentralized interactions in biological systems. Existing models, such as Vicsek and Cucker‑Smale, lack collision avoidance, whereas the Olfati‑Saber model imposes rigid formations, limiting their applicability in swarm robotics. To address these limitations, this paper proposes a minimal yet expressive model that governs agent dynamics using relative positions, velocities, and local density, modulated by two tunable parameters: the spatial offset and kinetic offset. The model achieves spatially flexible, collision‑free behaviors that reflect naturalistic group dynamics. Furthermore, we extend the framework to cognitive autonomous systems, enabling energy‑aware phase transitions between swarming and flocking through adaptive control parameter tuning. This cognitively inspired approach offers a robust foundation for real‑world applications in multi‑robot systems, particularly autonomous aerial swarms.

PaperID: 9181, https://arxiv.org/pdf/2508.08331.pdf

Authors: Karen Guerrero-Vazquez, Jacopo Umberto Verga, Pilib O Broin, Eva Szegezdi, Katarzyna Goljanek-Whysall

Title: miRKatAI: An Integrated Database and Multi-agent AI system for microRNA Research

Abstract:
MicroRNAs (miRs) are robust regulators of gene expression, implicated in most biological processes. microRNAs predominantly downregulate the expression of genes post‑transcriptionally and each miR is predicted to target several hundred genes. The accurate identification and annotation of miR‑mRNA target interactions is central to understanding miRs function and their therapeutic potential. However, computational target prediction is challenging due to imperfect complementarity of miRs with their targets and the growing volume and heterogeneity of experimental data present challenges in accessing, integrating, and analysing miR‑target interaction information across biological contexts. This creates a need for integrated resources and intelligent query tools. We present the miRKat Suite, comprising miRKatDB, a comprehensive, curated database of predicted and validated miR‑target interactions and associated annotations, and miRKatAI, a multi‑agent system powered by large language models (LLMs) and LangGraph. miRKatDB integrates data from multiple publicly available sources, providing a comprehensive foundation for miR studies, including miR target genes and changes in levels of tissue expression previously reported. miRKatAI offers a natural language interface for complex querying of miRKatDB, facilitates grounded information retrieval from established sources in the field, and supports basic data visualisation. The miRKat Suite aims to accelerate miR research by streamlining data access, enhancing exploratory analysis, and supporting hypothesis generation.

PaperID: 9182, https://arxiv.org/pdf/2508.08325.pdf

Authors: Hangcheng Zhao, Ron Berman

Title: Algorithmic Collusion of Pricing and Advertising on E-commerce Platforms

Abstract:
When online sellers use AI learning algorithms to automatically compete on e‑commerce platforms, there is concern that they will learn to coordinate on higher than competitive prices. However, this concern was primarily raised in single‑dimension price competition. We investigate whether this prediction holds when sellers make pricing and advertising decisions together, i.e., two‑dimensional decisions. We analyze competition in multi‑agent reinforcement learning, and use a large‑scale dataset from Amazon.com to provide empirical evidence. We show that when consumers have high search costs, learning algorithms can coordinate on prices lower than competitive prices, facilitating a win‑win‑win for consumers, sellers, and platforms. This occurs because algorithms learn to coordinate on lower advertising bids, which lower advertising costs, leading to lower prices and enlarging demand on the platform. We also show that our results generalize to any learning algorithm that uses exploration of price and advertising bids. Consistent with our predictions, an empirical analysis shows that price levels exhibit a negative interaction between estimated consumer search costs and algorithm usage index. We analyze the platform's strategic response and find that reserve price adjustments will not increase platform profits, but commission adjustments will, while maintaining the beneficial outcomes for both sellers and consumers.

PaperID: 9183, https://arxiv.org/pdf/2508.08322.pdf

Authors: Muhammad Haseeb

Title: Context Engineering for Multi-Agent LLM Code Assistants Using Elicit, NotebookLM, ChatGPT, and Claude Code

Abstract:
Large Language Models (LLMs) have shown promise in automating code generation and software engineering tasks, yet they often struggle with complex, multi‑file projects due to context limitations and knowledge gaps. We propose a novel context engineering workflow that combines multiple AI components: an Intent Translator (GPT‑5) for clarifying user requirements, an Elicit‑powered semantic literature retrieval for injecting domain knowledge, NotebookLM‑based document synthesis for contextual understanding, and a Claude Code multi‑agent system for code generation and validation. Our integrated approach leverages intent clarification, retrieval‑augmented generation, and specialized sub‑agents orchestrated via Claude's agent framework. We demonstrate that this method significantly improves the accuracy and reliability of code assistants in real‑world repositories, yielding higher single‑shot success rates and better adherence to project context than baseline single‑agent approaches. Qualitative results on a large Next.js codebase show the multi‑agent system effectively plans, edits, and tests complex features with minimal human intervention. We compare our system with recent frameworks like CodePlan, MASAI, and HyperAgent, highlighting how targeted context injection and agent role decomposition lead to state‑of‑the‑art performance. Finally, we discuss the implications for deploying LLM‑based coding assistants in production, along with lessons learned on context management and future research directions.

PaperID: 9184, https://arxiv.org/pdf/2508.08284.pdf

Authors: Peng N. Wang, Peter B. Luh, Xuesong Lu, Peter Sincak, Laura Pitukova

Title: Binary Decision Process in Pre-Evacuation Behavior

Abstract:
In crowd evacuation the time interval before decisive movement towards a safe place is defined as the pre‑evacuation phase, and it has crucial impact on the total time required for safe egress. This process mainly refers to situation awareness and response to an external stressors, e.g., fire alarms. Due to the complexity of human cognitive process, simulation is used to study this important time interval. In this paper a binary decision process is formulated to simulate pre‑evacuation time of many evacuees in a given social context. The model combines the classic opinion dynamics (the French‑DeGroot model) with binary phase transition to describe how group pre‑evacuation time emerges from individual interaction. The model parameters are quantitatively meaningful to human factors research within socio‑psychological background, e.g., whether an individual is stubborn or open‑minded, or what kind of the social topology exists among the individuals and how it matters in aggregating individuals into social groups. The modeling framework also describes collective motion of many evacuee agents in a planar space, and the resulting multi‑agent system is partly similar to the Vicsek flocking model, and it is meaningful to explore complex social behavior during phase transition of a non‑equilibrium process.

PaperID: 9185, https://arxiv.org/pdf/2508.08283.pdf

Authors: Andres Garcia Rincon, Eliseo Ferrante

Title: MinionsLLM: a Task-adaptive Framework For The Training and Control of Multi-Agent Systems Through Natural Language

Abstract:
This paper presents MinionsLLM, a novel framework that integrates Large Language Models (LLMs) with Behavior Trees (BTs) and Formal Grammars to enable natural language control of multi‑agent systems within arbitrary, user‑defined environments. MinionsLLM provides standardized interfaces for defining environments, agents, and behavioral primitives, and introduces two synthetic dataset generation methods (Method A and Method B) to fine‑tune LLMs for improved syntactic validity and semantic task relevance. We validate our approach using Google's Gemma 3 model family at three parameter scales (1B, 4B, and 12B) and demonstrate substantial gains: Method B increases syntactic validity to 92.6% and achieves a mean task performance improvement of 33% over baseline. Notably, our experiments show that smaller models benefit most from fine‑tuning, suggesting promising directions for deploying compact, locally hosted LLMs in resource‑constrained multi‑agent control scenarios. The framework and all resources are released open‑source to support reproducibility and future research.

PaperID: 9186, https://arxiv.org/pdf/2508.08264.pdf

Authors: Hadush Hailu, Bruk Gebregziabher, Prudhvi Raj

Title: Forecast-Driven MPC for Decentralized Multi-Robot Collision Avoidance

Abstract:
The Iterative Forecast Planner (IFP) is a geometric planning approach that offers lightweight computations, scalable, and reactive solutions for multi‑robot path planning in decentralized, communication‑free settings. However, it struggles in symmetric configurations, where mirrored interactions often lead to collisions and deadlocks. We introduce eIFP‑MPC, an optimized and extended version of IFP that improves robustness and path consistency in dense, dynamic environments. The method refines threat prioritization using a time‑to‑collision heuristic, stabilizes path generation through cost‑based via‑point selection, and ensures dynamic feasibility by incorporating model predictive control (MPC) into the planning process. These enhancements are tightly integrated into the IFP to preserve its efficiency while improving its adaptability and stability. Extensive simulations across symmetric and high‑density scenarios show that eIFP‑MPC significantly reduces oscillations, ensures collision‑free motion, and improves trajectory efficiency. The results demonstrate that geometric planners can be strengthened through optimization, enabling robust performance at scale in complex multi‑agent environments.

PaperID: 9187, https://arxiv.org/pdf/2508.08222.pdf

Authors: Tong Yang, Yu Huang, Yingbin Liang, Yuejie Chi

Title: Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent

Abstract:
Transformers have demonstrated remarkable capabilities in multi‑step reasoning tasks. However, understandings of the underlying mechanisms by which they acquire these abilities through training remain limited, particularly from a theoretical standpoint. This work investigates how transformers learn to solve symbolic multi‑step reasoning problems through chain‑of‑thought processes, focusing on path‑finding in trees. We analyze two intertwined tasks: a backward reasoning task, where the model outputs a path from a goal node to the root, and a more complex forward reasoning task, where the model implements two‑stage reasoning by first identifying the goal‑to‑root path and then reversing it to produce the root‑to‑goal path. Our theoretical analysis, grounded in the dynamics of gradient descent, shows that trained one‑layer transformers can provably solve both tasks with generalization guarantees to unseen trees. In particular, our multi‑phase training dynamics for forward reasoning elucidate how different attention heads learn to specialize and coordinate autonomously to solve the two subtasks in a single autoregressive path. These results provide a mechanistic explanation of how trained transformers can implement sequential algorithmic procedures. Moreover, they offer insights into the emergence of reasoning abilities, suggesting that when tasks are structured to take intermediate chain‑of‑thought steps, even shallow multi‑head transformers can effectively solve problems that would otherwise require deeper architectures.

PaperID: 9188, https://arxiv.org/pdf/2508.08137.pdf

Authors: Pravallika Abbineni, Saoud Aldowaish, Colin Liechty, Soroosh Noorzad, Ali Ghazizadeh, Morteza Fayazi

Title: MuaLLM: A Multimodal Large Language Model Agent for Circuit Design Assistance with Hybrid Contextual Retrieval-Augmented Generation

Abstract:
Conducting a comprehensive literature review is crucial for advancing circuit design methodologies. However, the rapid influx of state‑of‑the‑art research, inconsistent data representation, and the complexity of optimizing circuit design objectives make this task significantly challenging. In this paper, we propose MuaLLM, an open‑source multimodal Large Language Model (LLM) agent for circuit design assistance that integrates a hybrid Retrieval‑Augmented Generation (RAG) framework with an adaptive vector database of circuit design research papers. Unlike conventional LLMs, the MuaLLM agent employs a Reason + Act (ReAct) workflow for iterative reasoning, goal‑setting, and multi‑step information retrieval. It functions as a question‑answering design assistant, capable of interpreting complex queries and providing reasoned responses grounded in circuit literature. Its multimodal capabilities enable processing of both textual and visual data, facilitating more efficient and comprehensive analysis. The system dynamically adapts using intelligent search tools, automated document retrieval from the internet, and real‑time database updates. Unlike conventional approaches constrained by model context limits, MuaLLM decouples retrieval from inference, enabling scalable reasoning over arbitrarily large corpora. At the maximum context length supported by standard LLMs, MuaLLM remains up to 10x less costly and 1.6x faster while maintaining the same accuracy. This allows rapid, no‑human‑in‑the‑loop database generation, overcoming the bottleneck of simulation‑based dataset creation for circuits. To evaluate MuaLLM, we introduce two custom datasets: RAG‑250, targeting retrieval and citation performance, and Reasoning‑100 (Reas‑100), focused on multistep reasoning in circuit design. MuaLLM achieves 90.1% recall on RAG‑250, and 86.8% accuracy on Reas‑100.

PaperID: 9189, https://arxiv.org/pdf/2508.08115.pdf

Authors: Pranav Pushkar Mishra, Mohammad Arvan, Mohan Zalake

Title: TeamMedAgents: Pareto-Efficient Multi-Agent Medical Reasoning Through Teamwork Theory

Abstract:
Complex medical reasoning has historically required frontier language models to achieve clinically‑acceptable accuracy, creating computational barriers that limit deployment in resource‑constrained clinical settings. We present TeamMedAgents, a modular multi‑agent framework that translates Salas et al.'s evidence‑based teamwork theory into computational mechanisms‑‑shared mental models, team leadership, team orientation, trust networks, and mutual monitoring‑‑enabling Small Language Models to perform multi‑step clinical reasoning efficiently. Evaluation across 8 medical benchmarks demonstrates that TeamMedAgents advances the Pareto efficiency frontier by 1‑2 orders of magnitude, achieving competitive accuracy at substantially lower token cost than MDAgents, MedAgents, DyLAN, and ReConcile. The framework exhibits the lowest cross‑dataset variance among multi‑agent approaches, enabling deployment without per‑task tuning. Our results establish that theory‑grounded coordination mechanisms provide essential scaffolding for deploying efficient medical AI in resource‑constrained clinical environments.

PaperID: 9190, https://arxiv.org/pdf/2508.07999.pdf

Authors: Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, Ke Wang

Title: WideSearch: Benchmarking Agentic Broad Info-Seeking

Abstract:
From professional research to everyday planning, many tasks are bottlenecked by wide‑scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such "wide‑context" collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large‑scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large‑scale atomic information, which could be verified one by one objectively, and arrange it into a well‑organized output. A rigorous five‑stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state‑of‑the‑art agentic search systems, including single‑agent, multi‑agent frameworks, and end‑to‑end commercial systems. Most systems achieve overall success rates near 0%, with the best performer reaching just 5%. However, given sufficient time, cross‑validation by multiple human testers can achieve a near 100% success rate. These results demonstrate that present search agents have critical deficiencies in large‑scale information seeking, underscoring urgent areas for future research and development in agentic search. Our dataset, evaluation pipeline, and benchmark results have been publicly released at https://widesearch‑seed.github.io/

PaperID: 9191, https://arxiv.org/pdf/2508.07950.pdf

Authors: Chen Shen, Wanqing Zhang, Kehan Li, Erwen Huang, Haitao Bi, Aiying Fan, Yiwen Shen, Hongmei Dong, Ji Zhang, Yuming Shao, Zengjia Liu, Xinshe Liu, Tao Li, Chunxia Yan, Shuanliang Fan, Di Wu, Jianhua Ma, Bin Cong, Zhenyuan Wang, Chunfeng Lian

Title: FEAT: A Multi-Agent Forensic AI System with Domain-Adapted Large Language Model for Automated Cause-of-Death Analysis

Abstract:
Forensic cause‑of‑death determination faces systemic challenges, including workforce shortages and diagnostic variability, particularly in high‑volume systems like China's medicolegal infrastructure. We introduce FEAT (ForEnsic AgenT), a multi‑agent AI framework that automates and standardizes death investigations through a domain‑adapted large language model. FEAT's application‑oriented architecture integrates: (i) a central Planner for task decomposition, (ii) specialized Local Solvers for evidence analysis, (iii) a Memory & Reflection module for iterative refinement, and (iv) a Global Solver for conclusion synthesis. The system employs tool‑augmented reasoning, hierarchical retrieval‑augmented generation, forensic‑tuned LLMs, and human‑in‑the‑loop feedback to ensure legal and medical validity. In evaluations across diverse Chinese case cohorts, FEAT outperformed state‑of‑the‑art AI systems in both long‑form autopsy analyses and concise cause‑of‑death conclusions. It demonstrated robust generalization across six geographic regions and achieved high expert concordance in blinded validations. Senior pathologists validated FEAT's outputs as comparable to those of human experts, with improved detection of subtle evidentiary nuances. To our knowledge, FEAT is the first LLM‑based AI agent system dedicated to forensic medicine, offering scalable, consistent death certification while maintaining expert‑level rigor. By integrating AI efficiency with human oversight, this work could advance equitable access to reliable medicolegal services while addressing critical capacity constraints in forensic systems.

PaperID: 9192, https://arxiv.org/pdf/2508.07880.pdf

Authors: Sophia Rupprecht, Qinghe Gao, Tanuj Karia, Artur M. Schweidtmann

Title: Multi-agent systems for chemical engineering: A review and perspective

Abstract:
Large language model (LLM)‑based multi‑agent systems (MASs) are a recent but rapidly evolving technology with the potential to transform chemical engineering by decomposing complex workflows into teams of collaborative agents with specialized knowledge and tools. This review surveys the state‑of‑the‑art of MAS within chemical engineering. While early studies demonstrate promising results, scientific challenges remain, including the design of tailored architectures, integration of heterogeneous data modalities, development of foundation models with domain‑specific modalities, and strategies for ensuring transparency, safety, and environmental impact. As a young but fast‑moving field, MASs offer exciting opportunities to rethink chemical engineering workflows.

PaperID: 9193, https://arxiv.org/pdf/2508.07842.pdf

Authors: Yutong Shen, Hangxu Liu, Lei Zhang, Penghui Liu, Ruizhe Xia, Tianyi Yao, Tongtong Feng

Title: DETACH: Cross-domain Learning for Long-Horizon Tasks via Mixture of Disentangled Experts

Abstract:
Long‑Horizon (LH) tasks in Human‑Scene Interaction (HSI) are complex multi‑step tasks that require continuous planning, sequential decision‑making, and extended execution across domains to achieve the final goal. However, existing methods heavily rely on skill chaining by concatenating pre‑trained subtasks, with environment observations and self‑state tightly coupled, lacking the ability to generalize to new combinations of environments and skills, failing to complete various LH tasks across domains. To solve this problem, this paper presents DETACH, a cross‑domain learning framework for LH tasks via biologically inspired dual‑stream disentanglement. Inspired by the brain's "where‑what" dual pathway mechanism, DETACH comprises two core modules: i) an environment learning module for spatial understanding, which captures object functions, spatial relationships, and scene semantics, achieving cross‑domain transfer through complete environment‑self disentanglement; ii) a skill learning module for task execution, which processes self‑state information including joint degrees of freedom and motor patterns, enabling cross‑skill transfer through independent motor pattern encoding. We conducted extensive experiments on various LH tasks in HSI scenes. Compared with existing methods, DETACH can achieve an average subtasks success rate improvement of 23% and average execution efficiency improvement of 29%.

PaperID: 9194, https://arxiv.org/pdf/2508.07827.pdf

Authors: Yu-Min Tseng, Wei-Lin Chen, Chung-Chi Chen, Hsin-Hsi Chen

Title: Evaluating Large Language Models as Expert Annotators

Abstract:
Textual data annotation, the process of labeling or tagging text with relevant information, is typically costly, time‑consuming, and labor‑intensive. While large language models (LLMs) have demonstrated their potential as direct alternatives to human annotators for general domains natural language processing (NLP) tasks, their effectiveness on annotation tasks in domains requiring expert knowledge remains underexplored. In this paper, we investigate: whether top‑performing LLMs, which might be perceived as having expert‑level proficiency in academic and professional benchmarks, can serve as direct alternatives to human expert annotators? To this end, we evaluate both individual LLMs and multi‑agent approaches across three highly specialized domains: finance, biomedicine, and law. Specifically, we propose a multi‑agent discussion framework to simulate a group of human annotators, where LLMs are tasked to engage in discussions by considering others' annotations and justifications before finalizing their labels. Additionally, we incorporate reasoning models (e.g., o3‑mini) to enable a more comprehensive comparison. Our empirical results reveal that: (1) Individual LLMs equipped with inference‑time techniques (e.g., chain‑of‑thought (CoT), self‑consistency) show only marginal or even negative performance gains, contrary to prior literature suggesting their broad effectiveness. (2) Overall, reasoning models do not demonstrate statistically significant improvements over non‑reasoning models in most settings. This suggests that extended long CoT provides relatively limited benefits for data annotation in specialized domains. (3) Certain model behaviors emerge in the multi‑agent discussion environment. For instance, Claude 3.7 Sonnet with thinking rarely changes its initial annotations, even when other agents provide correct annotations or valid reasoning.

PaperID: 9195, https://arxiv.org/pdf/2508.07745.pdf

Authors: Jiongchi Yu, Xiaofei Xie, Qiang Hu, Yuhan Ma, Ziming Zhao

Title: Chimera: Harnessing Multi-Agent LLMs for Automatic Insider Threat Simulation

Abstract:
Insider threats pose a persistent and critical security risk, yet are notoriously difficult to detect in complex enterprise environments, where malicious actions are often hidden within seemingly benign user behaviors. Although machine‑learning‑based insider threat detection (ITD) methods have shown promise, their effectiveness is fundamentally limited by the scarcity of high‑quality and realistic training data. Enterprise internal data is highly sensitive and rarely accessible, while existing public and synthetic datasets are either small‑scale or lack sufficient realism, semantic richness, and behavioral diversity. To address this challenge, we propose Chimera, an LLM‑based multi‑agent framework that automatically simulates both benign and malicious insider activities and generates comprehensive system logs across diverse enterprise environments. Chimera models each agent as an individual employee with fine‑grained roles and supports group meetings, pairwise interactions, and self‑organized scheduling to capture realistic organizational dynamics. Based on 15 insider attacks abstracted from real‑world incidents, we deploy Chimera in three representative data‑sensitive organizational scenarios and construct ChimeraLog, a new dataset for developing and evaluating ITD methods. We evaluate ChimeraLog through human studies and quantitative analyses, demonstrating its diversity and realism. Experiments with existing ITD methods show substantially lower detection performance on ChimeraLog compared to prior datasets, indicating a more challenging and realistic benchmark. Moreover, despite distribution shifts, models trained on ChimeraLog exhibit strong generalization, highlighting the practical value of LLM‑based multi‑agent simulation for advancing insider threat detection.

PaperID: 9196, https://arxiv.org/pdf/2508.07730.pdf

Authors: Mingyang Su, Chao Liu, Jingling Zhang, WU Shuang, Mingming Fan

Title: SimViews: An Interactive Multi-Agent System Simulating Visitor-to-Visitor Conversational Patterns to Present Diverse Perspectives of Artifacts in Virtual Museums

Abstract:
Offering diverse perspectives on a museum artifact can deepen visitors' understanding and help avoid the cognitive limitations of a single narrative, ultimately enhancing their overall experience. Physical museums promote diversity through visitor interactions. However, it remains a challenge to present multiple voices appropriately while attracting and sustaining a visitor's attention in the virtual museum. Inspired by recent studies that show the effectiveness of LLM‑powered multi‑agents in presenting different opinions about an event, we propose SimViews, an interactive multi‑agent system that simulates visitor‑to‑visitor conversational patterns to promote the presentation of diverse perspectives. The system employs LLM‑powered multi‑agents that simulate virtual visitors with different professional identities, providing diverse interpretations of artifacts. Additionally, we constructed 4 conversational patterns between users and agents to simulate visitor interactions. We conducted a within‑subject study with 20 participants, comparing SimViews to a traditional single‑agent condition. Our results show that SimViews effectively facilitates the presentation of diverse perspectives through conversations, enhancing participants' understanding of viewpoints and engagement within the virtual museum.

PaperID: 9197, https://arxiv.org/pdf/2508.07720.pdf

Authors: Themistoklis Charalambous, Nikolaos Pappas, Nikolaos Nomikos, Risto Wichman

Title: Toward Goal-Oriented Communication in Multi-Agent Systems: An overview

Abstract:
As multi‑agent systems (MAS) become increasingly prevalent in autonomous systems, distributed control, and edge intelligence, efficient communication under resource constraints has emerged as a critical challenge. Traditional communication paradigms often emphasize message fidelity or bandwidth optimization, overlooking the task relevance of the exchanged information. In contrast, goal‑oriented communication prioritizes the importance of information with respect to the agents' shared objectives. This review provides a comprehensive survey of goal‑oriented communication in MAS, bridging perspectives from information theory, communication theory, and machine learning. We examine foundational concepts alongside learning‑based approaches and emergent protocols. Special attention is given to coordination under communication constraints, as well as applications in domains such as swarm robotics, federated learning, and edge computing. The paper concludes with a discussion of open challenges and future research directions at the intersection of communication theory, machine learning, and multi‑agent decision making.

PaperID: 9198, https://arxiv.org/pdf/2508.07686.pdf

Authors: Mingyue Lei, Zewei Zhou, Hongchen Li, Jiaqi Ma, Jia Hu

Title: Risk Map As Middleware: Towards Interpretable Cooperative End-to-end Autonomous Driving for Risk-Aware Planning

Abstract:
End‑to‑end paradigm has emerged as a promising approach to autonomous driving. However, existing single‑agent end‑to‑end pipelines are often constrained by occlusion and limited perception range, resulting in hazardous driving. Furthermore, their black‑box nature prevents the interpretability of the driving behavior, leading to an untrustworthiness system. To address these limitations, we introduce Risk Map as Middleware (RiskMM) and propose an interpretable cooperative end‑to‑end driving framework. The risk map learns directly from the driving data and provides an interpretable spatiotemporal representation of the scenario from the upstream perception and the interactions between the ego vehicle and the surrounding environment for downstream planning. RiskMM first constructs a multi‑agent spatiotemporal representation with unified Transformer‑based architecture, then derives risk‑aware representations by modeling interactions among surrounding environments with attention. These representations are subsequently fed into a learning‑based Model Predictive Control (MPC) module. The MPC planner inherently accommodates physical constraints and different vehicle types and can provide interpretation by aligning learned parameters with explicit MPC elements. Evaluations conducted on the real‑world V2XPnP‑Seq dataset confirm that RiskMM achieves superior and robust performance in risk‑aware trajectory planning, significantly enhancing the interpretability of the cooperative end‑to‑end driving framework. The codebase will be released to facilitate future research in this field.

PaperID: 9199, https://arxiv.org/pdf/2508.07679.pdf

Authors: Tong Zhang, Yu Gou, Jun Liu, Shanshan Song, Tingting Yang, Jun-Hong Cui

Title: Joint link scheduling and power allocation in imperfect and energy-constrained underwater wireless sensor networks

Abstract:
Underwater wireless sensor networks (UWSNs) stand as promising technologies facilitating diverse underwater applications. However, the major design issues of the considered system are the severely limited energy supply and unexpected node malfunctions. This paper aims to provide fair, efficient, and reliable (FER) communication to the imperfect and energy‑constrained UWSNs (IC‑UWSNs). Therefore, we formulate a FER‑communication optimization problem (FERCOP) and propose ICRL‑JSA to solve the formulated problem. ICRL‑JSA is a deep multi‑agent reinforcement learning (MARL)‑based optimizer for IC‑UWSNs through joint link scheduling and power allocation, which automatically learns scheduling algorithms without human intervention. However, conventional RL methods are unable to address the challenges posed by underwater environments and IC‑UWSNs. To construct ICRL‑JSA, we integrate deep Q‑network into IC‑UWSNs and propose an advanced training mechanism to deal with complex acoustic channels, limited energy supplies, and unexpected node malfunctions. Simulation results demonstrate the superiority of the proposed ICRL‑JSA scheme with an advanced training mechanism compared to various benchmark algorithms.

PaperID: 9200, https://arxiv.org/pdf/2508.07671.pdf

Authors: Mohamed Rayan Barhdadi, Mehmet Tuncel, Erchin Serpedin, Hasan Kurban

Title: EMPATHIA: Multi-Faceted Human-AI Collaboration for Refugee Integration

Abstract:
Current AI approaches to refugee integration optimize narrow objectives such as employment and fail to capture the cultural, emotional, and ethical dimensions critical for long‑term success. We introduce EMPATHIA (Enriched Multimodal Pathways for Agentic Thinking in Humanitarian Immigrant Assistance), a multi‑agent framework addressing the central Creative AI question: how do we preserve human dignity when machines participate in life‑altering decisions? Grounded in Kegan's Constructive Developmental Theory, EMPATHIA decomposes integration into three modules: SEED (Socio‑cultural Entry and Embedding Decision) for initial placement, RISE (Rapid Integration and Self‑sufficiency Engine) for early independence, and THRIVE (Transcultural Harmony and Resilience through Integrated Values and Engagement) for sustained outcomes. SEED employs a selector‑validator architecture with three specialized agents ‑ emotional, cultural, and ethical ‑ that deliberate transparently to produce interpretable recommendations. Experiments on the UN Kakuma dataset (15,026 individuals, 7,960 eligible adults 15+ per ILO/UNHCR standards) and implementation on 6,359 working‑age refugees (15+) with 150+ socioeconomic variables achieved 87.4% validation convergence and explainable assessments across five host countries. EMPATHIA's weighted integration of cultural, emotional, and ethical factors balances competing value systems while supporting practitioner‑AI collaboration. By augmenting rather than replacing human expertise, EMPATHIA provides a generalizable framework for AI‑driven allocation tasks where multiple values must be reconciled.

PaperID: 9201, https://arxiv.org/pdf/2508.07667.pdf

Authors: Wenkai Li, Liwen Sun, Zhenxiang Guan, Xuhui Zhou, Maarten Sap

Title: 1-2-3 Check: Enhancing Contextual Privacy in LLM via Multi-Agent Reasoning

Abstract:
Addressing contextual privacy concerns remains challenging in interactive settings where large language models (LLMs) process information from multiple sources (e.g., summarizing meetings with private and public information). We introduce a multi‑agent framework that decomposes privacy reasoning into specialized subtasks (extraction, classification), reducing the information load on any single agent while enabling iterative validation and more reliable adherence to contextual privacy norms. To understand how privacy errors emerge and propagate, we conduct a systematic ablation over information‑flow topologies, revealing when and why upstream detection mistakes cascade into downstream leakage. Experiments on the ConfAIde and PrivacyLens benchmark with several open‑source and closed‑sourced LLMs demonstrate that our best multi‑agent configuration substantially reduces private information leakage (18% on ConfAIde and 19% on PrivacyLens with GPT‑4o) while preserving the fidelity of public content, outperforming single‑agent baselines. These results highlight the promise of principled information‑flow design in multi‑agent systems for contextual privacy with LLMs.

PaperID: 9202, https://arxiv.org/pdf/2508.07630.pdf

Authors: Anirudh Iyengar Kaniyar Narayana Iyengar, Srija Mukhopadhyay, Adnan Qidwai, Shubhankar Singh, Dan Roth, Vivek Gupta

Title: InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information

Abstract:
We introduce InterChart, a diagnostic benchmark that evaluates how well vision‑language models (VLMs) reason across multiple related charts, a task central to real‑world applications such as scientific reporting, financial analysis, and public policy dashboards. Unlike prior benchmarks focusing on isolated, visually uniform charts, InterChart challenges models with diverse question types ranging from entity inference and trend correlation to numerical estimation and abstract multi‑step reasoning grounded in 2‑3 thematically or structurally related charts. We organize the benchmark into three tiers of increasing difficulty: (1) factual reasoning over individual charts, (2) integrative analysis across synthetically aligned chart sets, and (3) semantic inference over visually complex, real‑world chart pairs. Our evaluation of state‑of‑the‑art open‑ and closed‑source VLMs reveals consistent and steep accuracy declines as chart complexity increases. We find that models perform better when we decompose multi‑entity charts into simpler visual units, underscoring their struggles with cross‑chart integration. By exposing these systematic limitations, InterChart provides a rigorous framework for advancing multimodal reasoning in complex, multi‑visual environments.

PaperID: 9203, https://arxiv.org/pdf/2508.07604.pdf

Authors: Maryam Abbasalizadeh, Sashank Narain

Title: Joint Scheduling and Resource Allocation in mmWave IAB Networks Using Deep RL

Abstract:
Integrated Access and Backhaul (IAB) is critical for dense 5G and beyond deployments, especially in mmWave bands where fiber backhaul is infeasible. We propose a novel Deep Reinforcement Learning (DRL) framework for joint link scheduling and resource slicing in dynamic, interference‑prone IAB networks. Our method integrates a greedy Double Deep Q‑Network (DDQN) scheduler to activate access and backhaul links based on traffic and topology, with a multi‑agent DDQN allocator for bandwidth and antenna assignment across network slices. This decentralized approach respects strict antenna constraints and supports concurrent scheduling across heterogeneous links. Evaluations across 96 dynamic topologies show 99.84 percent scheduling accuracy and 20.90 percent throughput improvement over baselines. The framework's efficient operation and adaptability make it suitable for dynamic and resource‑constrained deployments, where fast link scheduling and autonomous backhaul coordination are vital.

PaperID: 9204, https://arxiv.org/pdf/2508.07578.pdf

Authors: Yu Gou, Tong Zhang, Jun Liu, Tingting Yang, Shanshan Song, Jun-Hong Cui

Title: Achieving Fair-Effective Communications and Robustness in Underwater Acoustic Sensor Networks: A Semi-Cooperative Approach

Abstract:
This paper investigates the fair‑effective communication and robustness in imperfect and energy‑constrained underwater acoustic sensor networks (IC‑UASNs). Specifically, we investigate the impact of unexpected node malfunctions on the network performance under the time‑varying acoustic channels. Each node is expected to satisfy Quality of Service (QoS) requirements. However, achieving individual QoS requirements may interfere with other concurrent communications. Underwater nodes rely excessively on the rationality of other underwater nodes when guided by fully cooperative approaches, making it difficult to seek a trade‑off between individual QoS and global fair‑effective communications under imperfect conditions. Therefore, this paper presents a SEmi‑COoperative Power Allocation approach (SECOPA) that achieves fair‑effective communication and robustness in IC‑UASNs. The approach is distributed multi‑agent reinforcement learning (MARL)‑based, and the objectives are twofold. On the one hand, each intelligent node individually decides the transmission power to simultaneously optimize individual and global performance. On the other hand, advanced training algorithms are developed to provide imperfect environments for training robust models that can adapt to the time‑varying acoustic channels and handle unexpected node failures in the network. Numerical results are presented to validate our proposed approach.

PaperID: 9205, https://arxiv.org/pdf/2508.07575.pdf

Authors: Shiqing Fan, Xichen Ding, Liang Zhang, Linjian Mo

Title: MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark

Abstract:
LLMs' capabilities are enhanced by using function calls to integrate various data sources or API results into the context window. Typical tools include search, web crawlers, maps, financial data, file systems, and browser usage, etc. Integrating these data sources or functions requires a standardized method. The Model Context Protocol (MCP) provides a standardized way to supply context to LLMs. However, the evaluation of LLMs and AI Agents' MCP tool use abilities suffer from several issues. First, there's a lack of comprehensive datasets or benchmarks to evaluate various MCP tools. Second, the diverse formats of response from MCP tool call execution further increase the difficulty of evaluation. Additionally, unlike existing tool‑use benchmarks with high success rates in functions like programming and math functions, the success rate of real‑world MCP tool is not guaranteed and varies across different MCP servers. Furthermore, the LLMs' context window also limits the number of available tools that can be called in a single run, because the textual descriptions of tool and the parameters have long token length for an LLM to process all at once. To help address the challenges of evaluating LLMs' performance on calling MCP tools, we propose MCPToolBench++, a large‑scale, multi‑domain AI Agent tool use benchmark. As of July 2025, this benchmark is build upon marketplace of over 4k MCP servers from more than 40 categories, collected from the MCP marketplaces and GitHub communities. The datasets consist of both single‑step and multi‑step tool calls across different categories. We evaluated SOTA LLMs with agentic abilities on this benchmark and reported the results.

PaperID: 9206, https://arxiv.org/pdf/2508.07569.pdf

Authors: Amulya Suravarjhula, Rashi Chandrashekhar Agrawal, Sakshi Jayesh Patel, Rahul Gupta

Title: Retrieval-Augmented Multi-Agent System for Rapid Statement of Work Generation

Abstract:
Drafting a Statement of Work (SOW) is a vital part of business and legal projects. It outlines key details like deliverables, timelines, responsibilities, and legal terms. However, creating these documents is often a slow and complex process. It usually involves multiple people, takes several days, and leaves room for errors or outdated content. This paper introduces a new AI‑driven automation system that makes the entire SOW drafting process faster, easier, and more accurate. Instead of relying completely on humans, the system uses three intelligent components or 'agents' that each handle a part of the job. One agent writes the first draft, another checks if everything is legally correct, and the third agent formats the document and ensures everything is in order. Unlike basic online tools that just fill in templates, this system understands the meaning behind the content and customizes the SOW to match the needs of the project. It also checks legal compliance and formatting so that users can trust the result. The system was tested using real business examples. It was able to create a full SOW in under three minutes, compared to several hours or days using manual methods. It also performed well in accuracy and quality, showing that it can reduce legal risks and save a lot of time. This solution shows how artificial intelligence can be used to support legal and business professionals by taking care of routine work and helping them focus on more important decisions. It's a step toward making legal processes smarter, faster, and more reliable.

PaperID: 9207, https://arxiv.org/pdf/2508.07560.pdf

Authors: Yan Gong, Naibang Wang, Jianli Lu, Xinyu Zhang, Yongsheng Gao, Jie Zhao, Zifan Huang, Haozhi Bai, Nanxin Zeng, Nayu Su, Lei Yang, Ziying Song, Xiaoxi Hu, Xinmin Jiang, Xiaojuan Zhang, Susanto Rahardja

Title: Progressive Bird's Eye View Perception for Safety-Critical Autonomous Driving: A Comprehensive Survey

Abstract:
Bird's‑Eye‑View (BEV) perception has become a foundational paradigm in autonomous driving, enabling unified spatial representations that support robust multi‑sensor fusion and multi‑agent collaboration. As autonomous vehicles transition from controlled environments to real‑world deployment, ensuring the safety and reliability of BEV perception in complex scenarios ‑ such as occlusions, adverse weather, and dynamic traffic ‑ remains a critical challenge. This survey provides the first comprehensive review of BEV perception from a safety‑critical perspective, systematically analyzing state‑of‑the‑art frameworks and implementation strategies across three progressive stages: single‑modality vehicle‑side, multimodal vehicle‑side, and multi‑agent collaborative perception. Furthermore, we examine public datasets encompassing vehicle‑side, roadside, and collaborative settings, evaluating their relevance to safety and robustness. We also identify key open‑world challenges ‑ including open‑set recognition, large‑scale unlabeled data, sensor degradation, and inter‑agent communication latency ‑ and outline future research directions, such as integration with end‑to‑end autonomous driving systems, embodied intelligence, and large language models.

PaperID: 9208, https://arxiv.org/pdf/2508.07505.pdf

Authors: Yueyang Quan, Chang Wang, Shengjie Zhai, Minghong Fang, Zhuqing Liu

Title: Enhancing Privacy in Decentralized Min-Max Optimization: A Differentially Private Approach

Abstract:
Decentralized min‑max optimization allows multi‑agent systems to collaboratively solve global min‑max optimization problems by facilitating the exchange of model updates among neighboring agents, eliminating the need for a central server. However, sharing model updates in such systems carry a risk of exposing sensitive data to inference attacks, raising significant privacy concerns. To mitigate these privacy risks, differential privacy (DP) has become a widely adopted technique for safeguarding individual data. Despite its advantages, implementing DP in decentralized min‑max optimization poses challenges, as the added noise can hinder convergence, particularly in non‑convex scenarios with complex agent interactions in min‑max optimization problems. In this work, we propose an algorithm called DPMixSGD (Differential Private Minmax Hybrid Stochastic Gradient Descent), a novel privacy‑preserving algorithm specifically designed for non‑convex decentralized min‑max optimization. Our method builds on the state‑of‑the‑art STORM‑based algorithm, one of the fastest decentralized min‑max solutions. We rigorously prove that the noise added to local gradients does not significantly compromise convergence performance, and we provide theoretical bounds to ensure privacy guarantees. To validate our theoretical findings, we conduct extensive experiments across various tasks and models, demonstrating the effectiveness of our approach.

PaperID: 9209, https://arxiv.org/pdf/2508.07466.pdf

Authors: Dom Huh, Prasant Mohapatra

Title: Grounding Natural Language for Multi-agent Decision-Making with Multi-agentic LLMs

Abstract:
Language is a ubiquitous tool that is foundational to reasoning and collaboration, ranging from everyday interactions to sophisticated problem‑solving tasks. The establishment of a common language can serve as a powerful asset in ensuring clear communication and understanding amongst agents, facilitating desired coordination and strategies. In this work, we extend the capabilities of large language models (LLMs) by integrating them with advancements in multi‑agent decision‑making algorithms. We propose a systematic framework for the design of multi‑agentic large language models (LLMs), focusing on key integration practices. These include advanced prompt engineering techniques, the development of effective memory architectures, multi‑modal information processing, and alignment strategies through fine‑tuning algorithms. We evaluate these design choices through extensive ablation studies on classic game settings with significant underlying social dilemmas and game‑theoretic considerations.

PaperID: 9210, https://arxiv.org/pdf/2508.07186.pdf

Authors: Amit Dhanda

Title: Multi-Dimensional Summarization Agents with Context-Aware Reasoning over Enterprise Tables

Abstract:
We propose a novel framework for summarizing structured enterprise data across multiple dimensions using large language model (LLM)‑based agents. Traditional table‑to‑text models often lack the capacity to reason across hierarchical structures and context‑aware deltas, which are essential in business reporting tasks. Our method introduces a multi‑agent pipeline that extracts, analyzes, and summarizes multi‑dimensional data using agents for slicing, variance detection, context construction, and LLM‑based generation. Our results show that the proposed framework outperforms traditional approaches, achieving 83% faithfulness to underlying data, superior coverage of significant changes, and high relevance scores (4.4/5) for decision‑critical insights. The improvements are especially pronounced in categories involving subtle trade‑offs, such as increased revenue due to price changes amid declining unit volumes, which competing methods either overlook or address with limited specificity. We evaluate the framework on Kaggle datasets and demonstrate significant improvements in faithfulness, relevance, and insight quality over baseline table summarization approaches.

PaperID: 9211, https://arxiv.org/pdf/2508.07092.pdf

Authors: Yue Hu, Juntong Peng, Yunqiao Yang, Siheng Chen

Title: Communication-Efficient Multi-Agent 3D Detection via Hybrid Collaboration

Abstract:
Collaborative 3D detection can substantially boost detection performance by allowing agents to exchange complementary information. It inherently results in a fundamental trade‑off between detection performance and communication bandwidth. To tackle this bottleneck issue, we propose a novel hybrid collaboration that adaptively integrates two types of communication messages: perceptual outputs, which are compact, and raw observations, which offer richer information. This approach focuses on two key aspects: i) integrating complementary information from two message types and ii) prioritizing the most critical data within each type. By adaptively selecting the most critical set of messages, it ensures optimal perceptual information and adaptability, effectively meeting the demands of diverse communication scenarios.Building on this hybrid collaboration, we present \textttHyComm, a communication‑efficient LiDAR‑based collaborative 3D detection system. \textttHyComm boasts two main benefits: i) it facilitates adaptable compression rates for messages, addressing various communication requirements, and ii) it uses standardized data formats for messages. This ensures they are independent of specific detection models, fostering adaptability across different agent configurations. To evaluate HyComm, we conduct experiments on both real‑world and simulation datasets: DAIR‑V2X and OPV2V. HyComm consistently outperforms previous methods and achieves a superior performance‑bandwidth trade‑off regardless of whether agents use the same or varied detection models. It achieves a lower communication volume of more than 2,006× and still outperforms Where2comm on DAIR‑V2X in terms of AP50. The related code will be released.

PaperID: 9212, https://arxiv.org/pdf/2508.07043.pdf

Authors: Orion Li, Vinayak Agarwal, Summer Zhou, Ashwin Gopinath, Timothy Kassis

Title: K-Dense Analyst: Towards Fully Automated Scientific Analysis

Abstract:
The complexity of modern bioinformatics analysis has created a critical gap between data generation and developing scientific insights. While large language models (LLMs) have shown promise in scientific reasoning, they remain fundamentally limited when dealing with real‑world analytical workflows that demand iterative computation, tool integration and rigorous validation. We introduce K‑Dense Analyst, a hierarchical multi‑agent system that achieves autonomous bioinformatics analysis through a dual‑loop architecture. K‑Dense Analyst, part of the broader K‑Dense platform, couples planning with validated execution using specialized agents to decompose complex objectives into executable, verifiable tasks within secure computational environments. On BixBench, a comprehensive benchmark for open‑ended biological analysis, K‑Dense Analyst achieves 29.2% accuracy, surpassing the best‑performing language model (GPT‑5) by 6.3 percentage points, representing nearly 27% improvement over what is widely considered the most powerful LLM available. Remarkably, K‑Dense Analyst achieves this performance using Gemini 2.5 Pro, which attains only 18.3% accuracy when used directly, demonstrating that our architectural innovations unlock capabilities far beyond the underlying model's baseline performance. Our insights demonstrate that autonomous scientific reasoning requires more than enhanced language models, it demands purpose‑built systems that can bridge the gap between high‑level scientific objectives and low‑level computational execution. These results represent a significant advance toward fully autonomous computational biologists capable of accelerating discovery across the life sciences.

PaperID: 9213, https://arxiv.org/pdf/2508.07035.pdf

Authors: Jiaxuan Liu, Tiannian Zhu, Caiyuan Ye, Zhong Fang, Hongming Weng, Quansheng Wu

Title: VASPilot: MCP-Facilitated Multi-Agent Intelligence for Autonomous VASP Simulations

Abstract:
Density‑functional‑theory (DFT) simulations with the Vienna Ab initio Simulation Package (VASP) are indispensable in computational materials science but often require extensive manual setup, monitoring, and postprocessing. Here, we introduce VASPilot, an open‑source platform that fully automates VASP workflows via a multi‑agent architecture built on the CrewAI framework and a standardized Model Context Protocol (MCP). VASPilot's agent suite handles every stage of a VASP study‑from retrieving crystal structures and generating input files to submitting Slurm jobs, parsing error messages, and dynamically adjusting parameters for seamless restarts. A lightweight Flask‑based web interface provides intuitive task submission, real‑time progress tracking, and drill‑down access to execution logs, structure visualizations, and plots. We validate VASPilot on both routine and advanced benchmarks: automated band‑structure and density‑of‑states calculations (including on‑the‑fly symmetry corrections), plane‑wave cutoff convergence tests, lattice‑constant optimizations with various van der Waals corrections, and cross‑material band‑gap comparisons for transition‑metal dichalcogenides. In all cases, VASPilot completed the missions reliably and without manual intervention. Moreover, its modular design allows easy extension to other DFT codes simply by deploying the appropriate MCP server. By offloading technical overhead, VASPilot enables researchers to focus on scientific discovery and accelerates high‑throughput computational materials research.

PaperID: 9214, https://arxiv.org/pdf/2508.07033.pdf

Authors: Shengli Zhou, Xiangchen Wang, Jinrui Zhang, Ruozai Tian, Rongtao Xu, Feng Zheng

Title: $\mathcal{P}^3$: Toward Versatile Embodied Agents

Abstract:
Embodied agents have shown promising generalization capabilities across diverse physical environments, making them essential for a wide range of real‑world applications. However, building versatile embodied agents poses critical challenges due to three key issues: dynamic environment perception, open‑ended tool usage, and complex multi‑task planning. Most previous works rely solely on feedback from tool agents to perceive environmental changes and task status, which limits adaptability to real‑time dynamics, causes error accumulation, and restricts tool flexibility. Furthermore, multi‑task scheduling has received limited attention, primarily due to the inherent complexity of managing task dependencies and balancing competing priorities in dynamic and complex environments. To overcome these challenges, we introduce \mathcalP^3, a unified framework that integrates real‑time perception and dynamic scheduling. Specifically, \mathcalP^3 enables 1) \textbf Perceive relevant task information actively from the environment, 2) \textbf Plug and utilize any tool without feedback requirement, and 3) \textbf Plan multi‑task execution based on prioritizing urgent tasks and dynamically adjusting task order based on dependencies. Extensive real‑world experiments show that our approach bridges the gap between benchmarks and practical deployment, delivering highly transferable, general‑purpose embodied agents. Code and data will be released soon.

PaperID: 9215, https://arxiv.org/pdf/2508.07021.pdf

Authors: Kun Qian, Wenjie Li, Tianyu Sun, Wenhong Wang, Wenhan Luo

Title: DocRefine: An Intelligent Framework for Scientific Document Understanding and Content Optimization based on Multimodal Large Model Agents

Abstract:
The exponential growth of scientific literature in PDF format necessitates advanced tools for efficient and accurate document understanding, summarization, and content optimization. Traditional methods fall short in handling complex layouts and multimodal content, while direct application of Large Language Models (LLMs) and Vision‑Language Large Models (LVLMs) lacks precision and control for intricate editing tasks. This paper introduces DocRefine, an innovative framework designed for intelligent understanding, content refinement, and automated summarization of scientific PDF documents, driven by natural language instructions. DocRefine leverages the power of advanced LVLMs (e.g., GPT‑4o) by orchestrating a sophisticated multi‑agent system comprising six specialized and collaborative agents: Layout & Structure Analysis, Multimodal Content Understanding, Instruction Decomposition, Content Refinement, Summarization & Generation, and Fidelity & Consistency Verification. This closed‑loop feedback architecture ensures high semantic accuracy and visual fidelity. Evaluated on the comprehensive DocEditBench dataset, DocRefine consistently outperforms state‑of‑the‑art baselines across various tasks, achieving overall scores of 86.7% for Semantic Consistency Score (SCS), 93.9% for Layout Fidelity Index (LFI), and 85.0% for Instruction Adherence Rate (IAR). These results demonstrate DocRefine's superior capability in handling complex multimodal document editing, preserving semantic integrity, and maintaining visual consistency, marking a significant advancement in automated scientific document processing.

PaperID: 9216, https://arxiv.org/pdf/2508.07010.pdf

Authors: Roberto Balestri, Guglielmo Pescatore

Title: Narrative Memory in Machines: Multi-Agent Arc Extraction in Serialized TV

Abstract:
Serialized television narratives present significant analytical challenges due to their complex, temporally distributed storylines that necessitate sophisticated information management. This paper introduces a multi‑agent system (MAS) designed to extract and analyze narrative arcs by implementing principles of computational memory architectures. The system conceptualizes narrative understanding through analogues of human memory: Large Language Models (LLMs) provide a form of semantic memory for general narrative patterns, while a vector database stores specific arc progressions as episodic memories. A multi‑agent workflow simulates working memory processes to integrate these information types. Tested on the first season of Grey's Anatomy (ABC 2005‑), the MAS identifies three arc types: Anthology (self‑contained), Soap (relationship‑focused), and Genre‑Specific. These arcs and their episodic developments are stored in a vector database, facilitating structured analysis and semantic comparison. To bridge automation with critical interpretation, a graphical interface enables human oversight and refinement of the system's narrative memory. While demonstrating strong performance in identifying Anthology Arcs and character entities, the system's reliance on textual paratexts (episode summaries) revealed limitations in discerning overlapping arcs and opaque dynamics, underscoring the challenges in computational memory consolidation versus human holistic understanding. This memory‑centric approach highlights the potential of combining AI‑driven memory processing with human expertise. Beyond television, it offers promise for serialized written formats where narrative is entirely text‑based. Future work will focus on integrating multimodal inputs to enrich episodic memory, refining memory integration mechanisms within the MAS, and expanding testing across diverse genres.

PaperID: 9217, https://arxiv.org/pdf/2508.07001.pdf

Authors: Myeung Suk Oh, Zhiyao Zhang, FNU Hairi, Alvaro Velasquez, Jia Liu

Title: Consensus-based Decentralized Multi-agent Reinforcement Learning for Random Access Network Optimization

Abstract:
With wireless devices increasingly forming a unified smart network for seamless, user‑friendly operations, random access (RA) medium access control (MAC) design is considered a key solution for handling unpredictable data traffic from multiple terminals. However, it remains challenging to design an effective RA‑based MAC protocol to minimize collisions and ensure transmission fairness across the devices. While existing multi‑agent reinforcement learning (MARL) approaches with centralized training and decentralized execution (CTDE) have been proposed to optimize RA performance, their reliance on centralized training and the significant overhead required for information collection can make real‑world applications unrealistic. In this work, we adopt a fully decentralized MARL architecture, where policy learning does not rely on centralized tasks but leverages consensus‑based information exchanges across devices. We design our MARL algorithm over an actor‑critic (AC) network and propose exchanging only local rewards to minimize communication overhead. Furthermore, we provide a theoretical proof of global convergence for our approach. Numerical experiments show that our proposed MARL algorithm can significantly improve RA network performance compared to other baselines.

PaperID: 9218, https://arxiv.org/pdf/2508.06990.pdf

Authors: Yue Hu, Junzhe Wu, Ruihan Xu, Hang Liu, Avery Xi, Henry X. Liu, Ram Vasudevan, Maani Ghaffari

Title: Imaginative World Modeling with Scene Graphs for Embodied Agent Navigation

Abstract:
Semantic navigation requires an agent to navigate toward a specified target in an unseen environment. Employing an imaginative navigation strategy that predicts future scenes before taking action, can empower the agent to find target faster. Inspired by this idea, we propose SGImagineNav, a novel imaginative navigation framework that leverages symbolic world modeling to proactively build a global environmental representation. SGImagineNav maintains an evolving hierarchical scene graphs and uses large language models to predict and explore unseen parts of the environment. While existing methods solely relying on past observations, this imaginative scene graph provides richer semantic context, enabling the agent to proactively estimate target locations. Building upon this, SGImagineNav adopts an adaptive navigation strategy that exploits semantic shortcuts when promising and explores unknown areas otherwise to gather additional context. This strategy continuously expands the known environment and accumulates valuable semantic contexts, ultimately guiding the agent toward the target. SGImagineNav is evaluated in both real‑world scenarios and simulation benchmarks. SGImagineNav consistently outperforms previous methods, improving success rate to 65.4 and 66.8 on HM3D and HSSD, and demonstrating cross‑floor and cross‑room navigation in real‑world environments, underscoring its effectiveness and generalizability.

PaperID: 9219, https://arxiv.org/pdf/2508.06963.pdf

Authors: Changqing Li, Tianlin Li, Xiaohan Zhang, Aishan Liu, Li Pan

Title: MASteer: Multi-Agent Adaptive Steer Strategy for End-to-End LLM Trustworthiness Repair

Abstract:
Large Language Models (LLMs) face persistent and evolving trustworthiness issues, motivating developers to seek automated and flexible repair methods that enable convenient deployment across diverse scenarios. Existing repair methods like supervised fine‑tuning (SFT) and reinforcement learning with human feedback (RLHF) are costly and slow, while prompt engineering lacks robustness and scalability. Representation engineering, which steers model behavior by injecting targeted concept vectors during inference, offers a lightweight, training‑free alternative. However, current approaches depend on manually crafted samples and fixed steering strategies, limiting automation and adaptability. To overcome these challenges, we propose MASteer, the first end‑to‑end framework for trustworthiness repair in LLMs based on representation engineering. MASteer integrates two core components: AutoTester, a multi‑agent system that generates diverse, high‑quality steer samples tailored to developer needs; and AutoRepairer, which constructs adaptive steering strategies with anchor vectors for automated, context‑aware strategy selection during inference. Experiments on standard and customized trustworthiness tasks show MASteer consistently outperforms baselines, improving metrics by 15.36% on LLaMA‑3.1‑8B‑Chat and 4.21% on Qwen‑3‑8B‑Chat, while maintaining general model capabilities. MASteer demonstrates strong robustness, generalization, and practical value for scalable, efficient trustworthiness repair.

PaperID: 9220, https://arxiv.org/pdf/2508.06948.pdf

Authors: Jinyuan Chen, Jiuchen Shi, Quan Chen, Minyi Guo

Title: Kairos: Low-latency Multi-Agent Serving with Shared LLMs and Excessive Loads in the Public Cloud

Abstract:
Multi‑agent applications utilize the advanced capabilities of large language models (LLMs) for intricate task completion through agent collaboration in a workflow. Under this situation, requests from different agents usually access the same shared LLM to perform different kinds of tasks, forcing the shared LLM to suffer excessive loads. However, existing works have low serving performance for these multi‑agent applications, mainly due to the ignorance of inter‑agent latency and resource differences for request scheduling. We therefore propose Kairos, a multi‑agent orchestration system that optimizes end‑to‑end latency for multi‑agent applications. Kairos consists of a workflow orchestrator, a workflow‑aware priority scheduler, and a memory‑aware dispatcher. The orchestrator collects agent‑specific information for online workflow analysis. The scheduler decides the serving priority of the requests based on their latency characteristics to reduce the overall queuing. The dispatcher dispatches the requests to different LLM instances based on their memory demands to avoid GPU overloading. Experimental results show that Kairos reduces end‑to‑end latency by 17.8% to 28.4% compared to state‑of‑the‑art works.

PaperID: 9221, https://arxiv.org/pdf/2508.06916.pdf

Authors: Shichao Ma, Yunhe Guo, Jiahao Su, Qihe Huang, Zhengyang Zhou, Yang Wang

Title: Talk2Image: A Multi-Agent System for Multi-Turn Image Generation and Editing

Abstract:
Text‑to‑image generation tasks have driven remarkable advances in diverse media applications, yet most focus on single‑turn scenarios and struggle with iterative, multi‑turn creative tasks. Recent dialogue‑based systems attempt to bridge this gap, but their single‑agent, sequential paradigm often causes intention drift and incoherent edits. To address these limitations, we present Talk2Image, a novel multi‑agent system for interactive image generation and editing in multi‑turn dialogue scenarios. Our approach integrates three key components: intention parsing from dialogue history, task decomposition and collaborative execution across specialized agents, and feedback‑driven refinement based on a multi‑view evaluation mechanism. Talk2Image enables step‑by‑step alignment with user intention and consistent image editing. Experiments demonstrate that Talk2Image outperforms existing baselines in controllability, coherence, and user satisfaction across iterative image generation and editing tasks.

PaperID: 9222, https://arxiv.org/pdf/2508.06836.pdf

Authors: Xutong Zhao, Yaqi Xie

Title: Multi-level Advantage Credit Assignment for Cooperative Multi-Agent Reinforcement Learning

Abstract:
Cooperative multi‑agent reinforcement learning (MARL) aims to coordinate multiple agents to achieve a common goal. A key challenge in MARL is credit assignment, which involves assessing each agent's contribution to the shared reward. Given the diversity of tasks, agents may perform different types of coordination, with rewards attributed to diverse and often overlapping agent subsets. In this work, we formalize the credit assignment level as the number of agents cooperating to obtain a reward, and address scenarios with multiple coexisting levels. We introduce a multi‑level advantage formulation that performs explicit counterfactual reasoning to infer credits across distinct levels. Our method, Multi‑level Advantage Credit Assignment (MACA), captures agent contributions at multiple levels by integrating advantage functions that reason about individual, joint, and correlated actions. Utilizing an attention‑based framework, MACA identifies correlated agent relationships and constructs multi‑level advantages to guide policy learning. Comprehensive experiments on challenging Starcraft v1\&v2 tasks demonstrate MACA's superior performance, underscoring its efficacy in complex credit assignment scenarios.

PaperID: 9223, https://arxiv.org/pdf/2508.06803.pdf

Authors: Ziqi Liu, Ziyang Zhou, Yilin Li, Mingxuan Hu, Yushan Pan, Zhijie Xu, Yangbin Chen

Title: SEVADE: Self-Evolving Multi-Agent Analysis with Decoupled Evaluation for Hallucination-Resistant Irony Detection

Abstract:
Sarcasm detection is a crucial yet challenging Natural Language Processing task. Existing Large Language Model methods are often limited by single‑perspective analysis, static reasoning pathways, and a susceptibility to hallucination when processing complex ironic rhetoric, which impacts their accuracy and reliability. To address these challenges, we propose SEVADE, a novel Self‑Evolving multi‑agent Analysis framework with Decoupled Evaluation for hallucination‑resistant sarcasm detection. The core of our framework is a Dynamic Agentive Reasoning Engine (DARE), which utilizes a team of specialized agents grounded in linguistic theory to perform a multifaceted deconstruction of the text and generate a structured reasoning chain. Subsequently, a separate lightweight rationale adjudicator (RA) performs the final classification based solely on this reasoning chain. This decoupled architecture is designed to mitigate the risk of hallucination by separating complex reasoning from the final judgment. Extensive experiments on four benchmark datasets demonstrate that our framework achieves state‑of‑the‑art performance, with average improvements of 6.75% in Accuracy and 6.29% in Macro‑F1 score.

PaperID: 9224, https://arxiv.org/pdf/2508.06767.pdf

Authors: Arman Dogru, R. Irem Bor-Yaliniz, Nimal Gamini Senarath

Title: PANAMA: A Network-Aware MARL Framework for Multi-Agent Path Finding in Digital Twin Ecosystems

Abstract:
Digital Twins (DTs) are transforming industries through advanced data processing and analysis, positioning the world of DTs, Digital World, as a cornerstone of nextgeneration technologies including embodied AI. As robotics and automated systems scale, efficient data‑sharing frameworks and robust algorithms become critical. We explore the pivotal role of data handling in next‑gen networks, focusing on dynamics between application and network providers (AP/NP) in DT ecosystems. We introduce PANAMA, a novel algorithm with Priority Asymmetry for Network Aware Multi‑agent Reinforcement Learning (MARL) based multi‑agent path finding (MAPF). By adopting a Centralized Training with Decentralized Execution (CTDE) framework and asynchronous actor‑learner architectures, PANAMA accelerates training while enabling autonomous task execution by embodied AI. Our approach demonstrates superior pathfinding performance in accuracy, speed, and scalability compared to existing benchmarks. Through simulations, we highlight optimized data‑sharing strategies for scalable, automated systems, ensuring resilience in complex, real‑world environments. PANAMA bridges the gap between network‑aware decision‑making and robust multi‑agent coordination, advancing the synergy between DTs, wireless networks, and AI‑driven automation.

PaperID: 9225, https://arxiv.org/pdf/2508.06722.pdf

Authors: Justin London

Title: Improved Obstacle Avoidance for Autonomous Robots with ORCA-FLC

Abstract:
Obstacle avoidance enables autonomous agents and robots to operate safely and efficiently in dynamic and complex environments, reducing the risk of collisions and damage. For a robot or autonomous system to successfully navigate through obstacles, it must be able to detect such obstacles. While numerous collision avoidance algorithms like the dynamic window approach (DWA), timed elastic bands (TEB), and reciprocal velocity obstacles (RVO) have been proposed, they may lead to suboptimal paths due to fixed weights, be computationally expensive, or have limited adaptability to dynamic obstacles in multi‑agent environments. Optimal reciprocal collision avoidance (ORCA), which improves on RVO, provides smoother trajectories and stronger collision avoidance guarantees. We propose ORCA‑FL to improve on ORCA by using fuzzy logic controllers (FLCs) to better handle uncertainty and imprecision for obstacle avoidance in path planning. Numerous multi‑agent experiments are conducted and it is shown that ORCA‑FL can outperform ORCA in reducing the number of collision if the agent has a velocity that exceeds a certain threshold. In addition, a proposed algorithm for improving ORCA‑FL using fuzzy Q reinforcement learning (FQL) is detailed for optimizing and tuning FLCs.

PaperID: 9226, https://arxiv.org/pdf/2508.06702.pdf

Authors: Zhao Song, The Anh Han

Title: Emergence of Cooperation and Commitment in Optional Prisoner's Dilemma

Abstract:
Commitment is a well‑established mechanism for fostering cooperation in human society and multi‑agent systems. However, existing research has predominantly focused on the commitment that neglects the freedom of players to abstain from an interaction, limiting their applicability to many real‑world scenarios where participation is often voluntary. In this paper, we present a two‑stage game model to investigate the evolution of commitment‑based behaviours and cooperation within the framework of the optional Prisoner's Dilemma game. In the pre‑game stage, players decide whether to accept a mutual commitment. Once in the game, they choose among cooperation, defection, or exiting, depending on the formation of a pre‑game commitment. We find that optional participation boosts commitment acceptance but fails to foster cooperation, leading instead to widespread exit behaviour. To address this, we then introduce and compare two institutional incentive approaches: i) a strict one (STRICT‑COM) that rewards only committed players who cooperate in the game, and ii) a flexible one (FLEXIBLE‑COM) that rewards any committed players who do not defect in the game. The results reveal that, while the strict approach is demonstrably better for promoting cooperation as the flexible rule creates a loophole for an opportunistic exit after committing, the flexible rule offers an efficient alternative for enhancing social welfare when such opportunistic behaviour results in a high gain. This study highlights the limitations of relying solely on voluntary participation and commitment to resolving social dilemmas, emphasising the importance of well‑designed institutional incentives to promote cooperation and social welfare effectively.

PaperID: 9227, https://arxiv.org/pdf/2508.06569.pdf

Authors: Lance Yao, Suman Samantray, Ayana Ghosh, Kevin Roccapriore, Libor Kovarik, Sarah Allec, Maxim Ziatdinov

Title: Operationalizing Serendipity: Multi-Agent AI Workflows for Enhanced Materials Characterization with Theory-in-the-Loop

Abstract:
The history of science is punctuated by serendipitous discoveries, where unexpected observations, rather than targeted hypotheses, opened new fields of inquiry. While modern autonomous laboratories excel at accelerating hypothesis testing, their optimization for efficiency risks overlooking these crucial, unplanned findings. To address this gap, we introduce SciLink, an open‑source, multi‑agent artificial intelligence framework designed to operationalize serendipity in materials research by creating a direct, automated link between experimental observation, novelty assessment, and theoretical simulations. The framework employs a hybrid AI strategy where specialized machine learning models perform quantitative analysis of experimental data, while large language models handle higher‑level reasoning. These agents autonomously convert raw data from materials characterization techniques into falsifiable scientific claims, which are then quantitatively scored for novelty against the published literature. We demonstrate the framework's versatility across diverse research scenarios, showcasing its application to atomic‑resolution and hyperspectral data, its capacity to integrate real‑time human expert guidance, and its ability to close the research loop by proposing targeted follow‑up experiments. By systematically analyzing all observations and contextualizing them, SciLink provides a practical framework for AI‑driven materials research that not only enhances efficiency but also actively cultivates an environment ripe for serendipitous discoveries, thereby bridging the gap between automated experimentation and open‑ended scientific exploration.

PaperID: 9228, https://arxiv.org/pdf/2508.06554.pdf

Authors: Abdelhaleem Saad, Waseem Akram, Irfan Hussain

Title: AquaChat++: LLM-Assisted Multi-ROV Inspection for Aquaculture Net Pens with Integrated Battery Management and Thruster Fault Tolerance

Abstract:
Inspection of aquaculture net pens is essential for ensuring the structural integrity and sustainable operation of offshore fish farming systems. Traditional methods, typically based on manually operated or single‑ROV systems, offer limited adaptability to real‑time constraints such as energy consumption, hardware faults, and dynamic underwater conditions. This paper introduces AquaChat++, a novel multi‑ROV inspection framework that uses Large Language Models (LLMs) to enable adaptive mission planning, coordinated task execution, and fault‑tolerant control in complex aquaculture environments. The proposed system consists of a two‑layered architecture. The high‑level plan generation layer employs an LLM, such as ChatGPT‑4, to translate natural language user commands into symbolic, multi‑agent inspection plans. A task manager dynamically allocates and schedules actions among ROVs based on their real‑time status and operational constraints, including thruster faults and battery levels. The low‑level control layer ensures accurate trajectory tracking and integrates thruster fault detection and compensation mechanisms. By incorporating real‑time feedback and event‑triggered replanning, AquaChat++ enhances system robustness and operational efficiency. Simulated experiments in a physics‑based aquaculture environment demonstrate improved inspection coverage, energy‑efficient behavior, and resilience to actuator failures. These findings highlight the potential of LLM‑driven frameworks to support scalable, intelligent, and autonomous underwater robotic operations within the aquaculture sector.

PaperID: 9229, https://arxiv.org/pdf/2508.06418.pdf

Authors: Haoran Shi, Hongwei Yao, Shuo Shao, Shaopeng Jiao, Ziqi Peng, Zhan Qin, Cong Wang

Title: Quantifying Conversation Drift in MCP via Latent Polytope

Abstract:
The Model Context Protocol (MCP) enhances large language models (LLMs) by integrating external tools, enabling dynamic aggregation of real‑time data to improve task execution. However, its non‑isolated execution context introduces critical security and privacy risks. In particular, adversarially crafted content can induce tool poisoning or indirect prompt injection, leading to conversation hijacking, misinformation propagation, or data exfiltration. Existing defenses, such as rule‑based filters or LLM‑driven detection, remain inadequate due to their reliance on static signatures, computational inefficiency, and inability to quantify conversational hijacking. To address these limitations, we propose SecMCP, a secure framework that detects and quantifies conversation drift, deviations in latent space trajectories induced by adversarial external knowledge. By modeling LLM activation vectors within a latent polytope space, SecMCP identifies anomalous shifts in conversational dynamics, enabling proactive detection of hijacking, misleading, and data exfiltration. We evaluate SecMCP on three state‑of‑the‑art LLMs (Llama3, Vicuna, Mistral) across benchmark datasets (MS MARCO, HotpotQA, FinQA), demonstrating robust detection with AUROC scores exceeding 0.915 while maintaining system usability. Our contributions include a systematic categorization of MCP security threats, a novel latent polytope‑based methodology for quantifying conversation drift, and empirical validation of SecMCP's efficacy.

PaperID: 9230, https://arxiv.org/pdf/2508.06336.pdf

Authors: Constantin Ruhdorfer, Matteo Bortoletto, Victor Oei, Anna Penzkofer, Andreas Bulling

Title: Unsupervised Partner Design Enables Robust Ad-hoc Teamwork

Abstract:
We introduce Unsupervised Partner Design (UPD) ‑ a population‑free, multi‑agent reinforcement learning framework for robust ad‑hoc teamwork that adaptively generates training partners without requiring pretrained partners or manual parameter tuning. UPD constructs diverse partners by stochastically mixing an ego agent's policy with biased random behaviours and scores them using a variance‑based learnability metric that prioritises partners near the ego agent's current learning frontier. We show that UPD can be integrated with unsupervised environment design, resulting in the first method enabling fully unsupervised curricula over both level and partner distributions in a cooperative setting. Through extensive evaluations on Overcooked‑AI and the Overcooked Generalisation Challenge, we demonstrate that this dynamic partner curriculum is highly effective: UPD consistently outperforms both population‑based and population‑free baselines as well as ablations. In a user study, we further show that UPD achieves higher returns than all baselines and was perceived as significantly more adaptive, more human‑like, a better collaborator, and less frustrating.

PaperID: 9231, https://arxiv.org/pdf/2508.06295.pdf

Authors: Juan Heredia, Emil Stubbe Kolvig-Raun, Sune Lundo Sorensen, Mikkel Baun Kjaergaard

Title: Evaluating Robot Program Performance with Power Consumption Driven Metrics in Lightweight Industrial Robots

Abstract:
The code performance of industrial robots is typically analyzed through CPU metrics, which overlook the physical impact of code on robot behavior. This study introduces a novel framework for assessing robot program performance from an embodiment perspective by analyzing the robot's electrical power profile. Our approach diverges from conventional CPU based evaluations and instead leverages a suite of normalized metrics, namely, the energy utilization coefficient, the energy conversion metric, and the reliability coefficient, to capture how efficiently and reliably energy is used during task execution. Complementing these metrics, the established robot wear metric provides further insight into long term reliability. Our approach is demonstrated through an experimental case study in machine tending, comparing four programs with diverse strategies using a UR5e robot. The proposed metrics directly compare and categorize different robot programs, regardless of the specific task, by linking code performance to its physical manifestation through power consumption patterns. Our results reveal the strengths and weaknesses of each strategy, offering actionable insights for optimizing robot programming practices. Enhancing energy efficiency and reliability through this embodiment centric approach not only improves individual robot performance but also supports broader industrial objectives such as sustainable manufacturing and cost reduction.

PaperID: 9232, https://arxiv.org/pdf/2508.06269.pdf

Authors: Zhuoran Li, Xun Wang, Hai Zhong, Qingxin Xia, Lihua Zhang, Longbo Huang

Title: OM2P: Offline Multi-Agent Mean-Flow Policy

Abstract:
Generative models, especially diffusion and flow‑based models, have been promising in offline multi‑agent reinforcement learning. However, integrating powerful generative models into this framework poses unique challenges. In particular, diffusion and flow‑based policies suffer from low sampling efficiency due to their iterative generation processes, making them impractical in time‑sensitive or resource‑constrained settings. To tackle these difficulties, we propose OM2P (Offline Multi‑Agent Mean‑Flow Policy), a novel offline MARL algorithm to achieve efficient one‑step action sampling. To address the misalignment between generative objectives and reward maximization, we introduce a reward‑aware optimization scheme that integrates a carefully‑designed mean‑flow matching loss with Q‑function supervision. Additionally, we design a generalized timestep distribution and a derivative‑free estimation strategy to reduce memory overhead and improve training stability. Empirical evaluations on Multi‑Agent Particle and MuJoCo benchmarks demonstrate that OM2P achieves superior performance, with up to a 3.8x reduction in GPU memory usage and up to a 10.8x speed‑up in training time. Our approach represents the first to successfully integrate mean‑flow model into offline MARL, paving the way for practical and scalable generative policies in cooperative multi‑agent settings.

PaperID: 9233, https://arxiv.org/pdf/2508.06189.pdf

Authors: Cheng Liu, Daou Zhang, Tingxu Liu, Yuhan Wang, Jinyang Chen, Yuexuan Li, Xinying Xiao, Chenbo Xin, Ziru Wang, Weichao Wu

Title: MA-CBP: A Criminal Behavior Prediction Framework Based on Multi-Agent Asynchronous Collaboration

Abstract:
With the acceleration of urbanization, criminal behavior in public scenes poses an increasingly serious threat to social security. Traditional anomaly detection methods based on feature recognition struggle to capture high‑level behavioral semantics from historical information, while generative approaches based on Large Language Models (LLMs) often fail to meet real‑time requirements. To address these challenges, we propose MA‑CBP, a criminal behavior prediction framework based on multi‑agent asynchronous collaboration. This framework transforms real‑time video streams into frame‑level semantic descriptions, constructs causally consistent historical summaries, and fuses adjacent image frames to perform joint reasoning over long‑ and short‑term contexts. The resulting behavioral decisions include key elements such as event subjects, locations, and causes, enabling early warning of potential criminal activity. In addition, we construct a high‑quality criminal behavior dataset that provides multi‑scale language supervision, including frame‑level, summary‑level, and event‑level semantic annotations. Experimental results demonstrate that our method achieves superior performance on multiple datasets and offers a promising solution for risk warning in urban public safety scenarios.

PaperID: 9234, https://arxiv.org/pdf/2508.06153.pdf

Authors: Zhengxian Wu, Juan Wen, Wanli Peng, Haowei Chang, Yinghan Zhou, Yiming Xue

Title: SLIP: Soft Label Mechanism and Key-Extraction-Guided CoT-based Defense Against Instruction Backdoor in APIs

Abstract:
Customized Large Language Model (LLM) agents face a critical security threat from black‑box instruction backdoors, where malicious behaviors are covertly injected through hidden system instructions. Although existing prompt‑based defenses can often detect poisoned inputs, they generally fail to recover correct outputs once the backdoor is activated. In this paper, we first conduct a mechanistic analysis of LLM behavior under instruction backdoors and reveal two pivotal phenomena: (1) cognitive override, in which backdoor triggers dominate the reasoning process and suppress task‑relevant context, and (2) abnormal semantic correlation, where triggers establish excessively strong semantic associations with attacker‑specified target labels. Based on these insights, we propose a Soft Label mechanism and key‑extraction‑guided CoT‑based defense against Instruction backdoors in APIs (SLIP). To counteract the cognitive override, the key‑extraction‑guided Chain‑of‑Thought (KCOT) explicitly guides the model to extract task‑relevant keywords and phrases rather than only considering the single trigger or overall text semantics. To neutralize the trigger's abnormal semantic correlation, the soft label mechanism (SLM) quantifies semantic correlations and employs statistical clustering to filter anomalous phrases before aggregating reliable keywords and phrases for prediction. Extensive experiments show that SLIP reduces the average attack success rate to 25.13%, improves clean accuracy to 87.15%, and outperforms state‑of‑the‑art black‑box defenses.

PaperID: 9235, https://arxiv.org/pdf/2508.06110.pdf

Authors: Yiran Rex Ma

Title: PanelTR: Zero-Shot Table Reasoning Framework Through Multi-Agent Scientific Discussion

Abstract:
Table reasoning, including tabular QA and fact verification, often depends on annotated data or complex data augmentation, limiting flexibility and generalization. LLMs, despite their versatility, often underperform compared to simple supervised models. To approach these issues, we introduce PanelTR, a framework utilizing LLM agent scientists for robust table reasoning through a structured scientific approach. PanelTR's workflow involves agent scientists conducting individual investigations, engaging in self‑review, and participating in collaborative peer‑review discussions. This process, driven by five scientist personas, enables semantic‑level transfer without relying on data augmentation or parametric optimization. Experiments across four benchmarks show that PanelTR outperforms vanilla LLMs and rivals fully supervised models, all while remaining independent of training data. Our findings indicate that structured scientific methodology can effectively handle complex tasks beyond table reasoning with flexible semantic understanding in a zero‑shot context.

PaperID: 9236, https://arxiv.org/pdf/2508.06105.pdf

Authors: Shengyuan Chen, Chuang Zhou, Zheng Yuan, Qinggang Zhang, Zeyang Cui, Hao Chen, Yilin Xiao, Jiannong Cao, Xiao Huang

Title: You Don't Need Pre-built Graphs for RAG: Retrieval Augmented Generation with Adaptive Reasoning Structures

Abstract:
Large language models (LLMs) often suffer from hallucination, generating factually incorrect statements when handling questions beyond their knowledge and perception. Retrieval‑augmented generation (RAG) addresses this by retrieving query‑relevant contexts from knowledge bases to support LLM reasoning. Recent advances leverage pre‑constructed graphs to capture the relational connections among distributed documents, showing remarkable performance in complex tasks. However, existing Graph‑based RAG (GraphRAG) methods rely on a costly process to transform the corpus into a graph, introducing overwhelming token cost and update latency. Moreover, real‑world queries vary in type and complexity, requiring different logic structures for accurate reasoning. The pre‑built graph may not align with these required structures, resulting in ineffective knowledge retrieval. To this end, we propose a Logic‑aware Retrieval‑Augmented Generation framework (LogicRAG) that dynamically extracts reasoning structures at inference time to guide adaptive retrieval without any pre‑built graph. LogicRAG begins by decomposing the input query into a set of subproblems and constructing a directed acyclic graph (DAG) to model the logical dependencies among them. To support coherent multi‑step reasoning, LogicRAG then linearizes the graph using topological sort, so that subproblems can be addressed in a logically consistent order. Besides, LogicRAG applies graph pruning to reduce redundant retrieval and uses context pruning to filter irrelevant context, significantly reducing the overall token cost. Extensive experiments demonstrate that LogicRAG achieves both superior performance and efficiency compared to state‑of‑the‑art baselines.

PaperID: 9237, https://arxiv.org/pdf/2508.06061.pdf

Authors: Ainur Zhaikhan, Malek Khammassi, Ali H. Sayed

Title: Policy Optimization in Multi-Agent Settings under Partially Observable Environments

Abstract:
This work leverages adaptive social learning to estimate partially observable global states in multi‑agent reinforcement learning (MARL) problems. Unlike existing methods, the proposed approach enables the concurrent operation of social learning and reinforcement learning. Specifically, it alternates between a single step of social learning and a single step of MARL, eliminating the need for the time‑ and computation‑intensive two‑timescale learning frameworks. Theoretical guarantees are provided to support the effectiveness of the proposed method. Simulation results verify that the performance of the proposed methodology can approach that of reinforcement learning when the true state is known.

PaperID: 9238, https://arxiv.org/pdf/2508.06046.pdf

Authors: Xinda Wang, Zhengxu Hou, Yangshijie Zhang, Bingren Yan, Jialin Liu, Chenzhuo Zhao, Zhibo Yang, Bin-Bin Yang, Feng Xiao

Title: EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation

Abstract:
Although the effectiveness of Large Language Models (LLMs) as judges (LLM‑as‑a‑judge) has been validated, their performance remains limited in open‑ended tasks, particularly in story evaluation. Accurate story evaluation is crucial not only for assisting human quality judgment but also for providing key signals to guide story generation. However, existing methods face a dilemma: prompt engineering for closed‑source models suffers from poor adaptability, while fine‑tuning approaches for open‑source models lack the rigorous reasoning capabilities essential for story evaluation. To address this, we propose the Self‑Evolving Pairwise Reasoning (EvolvR) framework. Grounded in pairwise comparison, the framework first self‑synthesizes score‑aligned Chain‑of‑Thought (CoT) data via a multi‑persona strategy. To ensure data quality, these raw CoTs undergo a self‑filtering process, utilizing multi‑agents to guarantee their logical rigor and robustness. Finally, the evaluator trained on the refined data is deployed as a reward model to guide the story generation task. Experimental results demonstrate that our framework achieves state‑of‑the‑art (SOTA) performance on three evaluation benchmarks including StoryER, HANNA and OpenMEVA. Furthermore, when served as a reward model, it significantly enhances the quality of generated stories, thereby fully validating the superiority of our self‑evolving approach.

PaperID: 9239, https://arxiv.org/pdf/2508.06042.pdf

Authors: Daechul Ahn, San Kim, Jonghyun Choi

Title: Society of Mind Meets Real-Time Strategy: A Hierarchical Multi-Agent Framework for Strategic Reasoning

Abstract:
Large Language Models (LLMs) have recently demonstrated impressive action sequence prediction capabilities but often struggle with dynamic, long‑horizon tasks such as real‑time strategic games. In a game such as StarCraftII (SC2), agents need to manage resource constraints and adapt to evolving battlefield situations in a partially observable environment. This often overwhelms exisiting LLM‑based approaches. To address these challenges, we propose a hierarchical multi‑agent framework that employs specialized imitation learning agents under a meta‑controller called Strategic Planner (SP). By expert demonstrations, each specialized agent learns a distinctive strategy, such as aerial support or defensive maneuvers, and produces coherent, structured multistep action sequences. The SP then orchestrates these proposals into a single, environmentally adaptive plan that ensures local decisions aligning with long‑term strategies. We call this HIMA (Hierarchical Imitation Multi‑Agent). We also present TEXTSCII‑ALL, a comprehensive SC2 testbed that encompasses all race match combinations in SC2. Our empirical results show that HIMA outperforms state of the arts in strategic clarity, adaptability, and computational efficiency, underscoring the potential of combining specialized imitation modules with meta‑level orchestration to develop more robust, general‑purpose AI agents.

PaperID: 9240, https://arxiv.org/pdf/2508.05996.pdf

Authors: Kaitao Chen, Mianxin Liu, Daoming Zong, Chaoyue Ding, Shaohao Rui, Yankai Jiang, Mu Zhou, Xiaosong Wang

Title: Mediator-Guided Multi-Agent Collaboration among Open-Source Models for Medical Decision-Making

Abstract:
Complex medical decision‑making involves cooperative workflows operated by different clinicians. Designing AI multi‑agent systems can expedite and augment human‑level clinical decision‑making. Existing multi‑agent researches primarily focus on language‑only tasks, yet their extension to multimodal scenarios remains challenging. A blind combination of diverse vision‑language models (VLMs) can amplify an erroneous outcome interpretation. VLMs in general are less capable in instruction following and importantly self‑reflection, compared to large language models (LLMs) of comparable sizes. This disparity largely constrains VLMs' ability in cooperative workflows. In this study, we propose MedOrch, a mediator‑guided multi‑agent collaboration framework for medical multimodal decision‑making. MedOrch employs an LLM‑based mediator agent that enables multiple VLM‑based expert agents to exchange and reflect on their outputs towards collaboration. We utilize multiple open‑source general‑purpose and domain‑specific VLMs instead of costly GPT‑series models, revealing the strength of heterogeneous models. We show that the collaboration within distinct VLM‑based agents can surpass the capabilities of any individual agent. We validate our approach on five medical vision question answering benchmarks, demonstrating superior collaboration performance without model training. Our findings underscore the value of mediator‑guided multi‑agent collaboration in advancing medical multimodal intelligence.

PaperID: 9241, https://arxiv.org/pdf/2508.05929.pdf

Authors: Keyang Qian, Shiqi Liu, Tongguang Li, Mladen Raković, Xinyu Li, Rui Guan, Inge Molenaar, Sadia Nawaz, Zachari Swiecki, Lixiang Yan, Dragan Gašević

Title: Towards Reliable Generative AI-Driven Scaffolding: Reducing Hallucinations and Enhancing Quality in Self-Regulated Learning Support

Abstract:
Generative Artificial Intelligence (GenAI) holds a potential to advance existing educational technologies with capabilities to automatically generate personalised scaffolds that support students' self‑regulated learning (SRL). While advancements in large language models (LLMs) promise improvements in the adaptability and quality of educational technologies for SRL, there remain concerns about the hallucinations in content generated by LLMs, which can compromise both the learning experience and ethical standards. To address these challenges, we proposed GenAI‑enabled approaches for evaluating personalised SRL scaffolds before they are presented to students, aiming for reducing hallucinations and improving the overall quality of LLM‑generated personalised scaffolds. Specifically, two approaches are investigated. The first approach involved developing a multi‑agent system approach for reliability evaluation to assess the extent to which LLM‑generated scaffolds accurately target relevant SRL processes. The second approach utilised the "LLM‑as‑a‑Judge" technique for quality evaluation that evaluates LLM‑generated scaffolds for their helpfulness in supporting students. We constructed evaluation datasets, and compared our results with single‑agent LLM systems and machine learning approach baselines. Our findings indicate that the reliability evaluation approach is highly effective and outperforms the baselines, showing almost perfect alignment with human experts' evaluations. Moreover, both proposed evaluation approaches can be harnessed to effectively reduce hallucinations. Additionally, we identified and discussed bias limitations of the "LLM‑as‑a‑Judge" technique in evaluating LLM‑generated scaffolds. We suggest incorporating these approaches into GenAI‑powered personalised SRL scaffolding systems to mitigate hallucination issues and improve the overall scaffolding quality.

PaperID: 9242, https://arxiv.org/pdf/2508.05895.pdf

Authors: Jiaqi Hu, Karl H. Johansson, Apostolos I. Rikos

Title: Distributed Quantized Average Consensus in Open Multi-Agent Systems with Dynamic Communication Links

Abstract:
In this paper, we focus on the distributed quantized average consensus problem in open multi‑agent systems consisting of communication links that change dynamically over time. Open multi‑agent systems exhibiting the aforementioned characteristic are referred to as open dynamic multi‑agent systems in this work. We present a distributed algorithm that enables active nodes in the open dynamic multi‑agent system to calculate the quantized average of their initial states. Our algorithm consists of the following advantages: (i) ensures efficient communication by enabling nodes to exchange quantized valued messages, and (ii) exhibits finite time convergence to the desired solution. We establish the correctness of our algorithm and we present necessary and sufficient topological conditions for it to successfully solve the quantized average consensus problem in an open dynamic multi‑agent system. Finally, we illustrate the performance of our algorithm with numerical simulations.

PaperID: 9243, https://arxiv.org/pdf/2508.05890.pdf

Authors: Yue Zhang, Zhe Chen, Daniel Harabor, Pierre Le Bodic, Peter J. Stuckey

Title: Flow-Based Task Assignment for Large-Scale Online Multi-Agent Pickup and Delivery

Abstract:
We study the problem of online Multi‑Agent Pickup and Delivery (MAPD), where a team of agents must repeatedly serve dynamically appearing tasks on a shared map. Existing online methods either rely on simple heuristics, which result in poor decisions, or employ complex reasoning, which suffers from limited scalability under real‑time constraints. In this work, we focus on the task assignment subproblem and formulate it as a minimum‑cost flow over the environment graph. This eliminates the need for pairwise distance computations and allows agents to be simultaneously assigned to tasks and routed toward them. The resulting flow network also supports efficient guide path extraction to integrate with the planner and accelerates planning under real‑time constraints. To improve solution quality, we introduce two congestion‑aware edge cost models that incorporate real‑time traffic estimates. This approach supports real‑time execution and scales to over 20000 agents and 30000 tasks within 1‑second planning time, outperforming existing baselines in both computational efficiency and assignment quality.

PaperID: 9244, https://arxiv.org/pdf/2508.05852.pdf

Authors: Kaiser Hamid, Khandakar Ashrafi Akbar, Nade Liang

Title: VISTA: Vision-Language Imitation of Situational Thinking and Attention for Human-Like Driver Focus in Dynamic Environments

Abstract:
Driver visual attention prediction is a critical task in autonomous driving and human‑computer interaction (HCI) research. Most prior studies focus on estimating attention allocation at a single moment in time, typically using static RGB images such as driving scene pictures. In this work, we propose a vision‑language framework that models the changing landscape of drivers' gaze through natural language, using few‑shot and zero‑shot learning on single RGB images. We curate and refine high‑quality captions from the BDD‑A dataset using human‑in‑the‑loop feedback, then fine‑tune LLaVA to align visual perception with attention‑centric scene understanding. Our approach integrates both low‑level cues and top‑down context (e.g., route semantics, risk anticipation), enabling language‑based descriptions of gaze behavior. We evaluate performance across training regimes (few shot, and one‑shot) and introduce domain‑specific metrics for semantic alignment and response diversity. Results show that our fine‑tuned model outperforms general‑purpose VLMs in attention shift detection and interpretability. To our knowledge, this is among the first attempts to generate driver visual attention allocation and shifting predictions in natural language, offering a new direction for explainable AI in autonomous driving. Our approach provides a foundation for downstream tasks such as behavior forecasting, human‑AI teaming, and multi‑agent coordination.

PaperID: 9245, https://arxiv.org/pdf/2508.05766.pdf

Authors: Bo Wen

Title: A Framework for Inherently Safer AGI through Language-Mediated Active Inference

Abstract:
This paper proposes a novel framework for developing safe Artificial General Intelligence (AGI) by combining Active Inference principles with Large Language Models (LLMs). We argue that traditional approaches to AI safety, focused on post‑hoc interpretability and reward engineering, have fundamental limitations. We present an architecture where safety guarantees are integrated into the system's core design through transparent belief representations and hierarchical value alignment. Our framework leverages natural language as a medium for representing and manipulating beliefs, enabling direct human oversight while maintaining computational tractability. The architecture implements a multi‑agent system where agents self‑organize according to Active Inference principles, with preferences and safety constraints flowing through hierarchical Markov blankets. We outline specific mechanisms for ensuring safety, including: (1) explicit separation of beliefs and preferences in natural language, (2) bounded rationality through resource‑aware free energy minimization, and (3) compositional safety through modular agent structures. The paper concludes with a research agenda centered on the Abstraction and Reasoning Corpus (ARC) benchmark, proposing experiments to validate our framework's safety properties. Our approach offers a path toward AGI development that is inherently safer, rather than retrofitted with safety measures.

PaperID: 9246, https://arxiv.org/pdf/2508.05702.pdf

Authors: Yan Zhang, Ahmad Mohammad Saber, Amr Youssef, Deepa Kundur

Title: Grid-Agent: An LLM-Powered Multi-Agent System for Power Grid Control

Abstract:
Modern power grids face unprecedented complexity from Distributed Energy Resources (DERs), Electric Vehicles (EVs), and extreme weather, while also being increasingly exposed to cyberattacks that can trigger grid violations. This paper introduces Grid‑Agent, an autonomous AI‑driven framework that leverages Large Language Models (LLMs) within a multi‑agent system to detect and remediate violations. Grid‑Agent integrates semantic reasoning with numerical precision through modular agents: a planning agent generates coordinated action sequences using power flow solvers, while a validation agent ensures stability and safety through sandboxed execution with rollback mechanisms. To enhance scalability, the framework employs an adaptive multi‑scale network representation that dynamically adjusts encoding schemes based on system size and complexity. Violation resolution is achieved through optimizing switch configurations, battery deployment, and load curtailment. Our experiments on IEEE and CIGRE benchmark networks, including the IEEE 69‑bus, CIGRE MV, IEEE 30‑bus test systems, demonstrate superior mitigation performance, highlighting Grid‑Agent's suitability for modern smart grids requiring rapid, adaptive response.

PaperID: 9247, https://arxiv.org/pdf/2508.05687.pdf

Authors: Alistair Reid, Simon O'Callaghan, Liam Carroll, Tiberio Caetano

Title: Risk Analysis Techniques for Governed LLM-based Multi-Agent Systems

Abstract:
Organisations are starting to adopt LLM‑based AI agents, with their deployments naturally evolving from single agents towards interconnected, multi‑agent networks. Yet a collection of safe agents does not guarantee a safe collection of agents, as interactions between agents over time create emergent behaviours and induce novel failure modes. This means multi‑agent systems require a fundamentally different risk analysis approach than that used for a single agent. This report addresses the early stages of risk identification and analysis for multi‑agent AI systems operating within governed environments where organisations control their agent configurations and deployment. In this setting, we examine six critical failure modes: cascading reliability failures, inter‑agent communication failures, monoculture collapse, conformity bias, deficient theory of mind, and mixed motive dynamics. For each, we provide a toolkit for practitioners to extend or integrate into their existing frameworks to assess these failure modes within their organisational contexts. Given fundamental limitations in current LLM behavioural understanding, our approach centres on analysis validity, and advocates for progressively increasing validity through staged testing across stages of abstraction and deployment that gradually increases exposure to potential negative impacts, while collecting convergent evidence through simulation, observational analysis, benchmarking, and red teaming. This methodology establishes the groundwork for robust organisational risk management as these LLM‑based multi‑agent systems are deployed and operated.

PaperID: 9248, https://arxiv.org/pdf/2508.05670.pdf

Authors: Daniele Proverbio, Alessio Buscemi, Alessandro Di Stefano, The Anh Han, German Castignani, Pietro Liò

Title: Can LLMs effectively provide game-theoretic-based scenarios for cybersecurity?

Abstract:
Game theory has long served as a foundational tool in cybersecurity to test, predict, and design strategic interactions between attackers and defenders. The recent advent of Large Language Models (LLMs) offers new tools and challenges for the security of computer systems; In this work, we investigate whether classical game‑theoretic frameworks can effectively capture the behaviours of LLM‑driven actors and bots. Using a reproducible framework for game‑theoretic LLM agents, we investigate two canonical scenarios ‑‑ the one‑shot zero‑sum game and the dynamic Prisoner's Dilemma ‑‑ and we test whether LLMs converge to expected outcomes or exhibit deviations due to embedded biases. Our experiments involve four state‑of‑the‑art LLMs and span five natural languages, English, French, Arabic, Vietnamese, and Mandarin Chinese, to assess linguistic sensitivity. For both games, we observe that the final payoffs are influenced by agents characteristics such as personality traits or knowledge of repeated rounds. Moreover, we uncover an unexpected sensitivity of the final payoffs to the choice of languages, which should warn against indiscriminate application of LLMs in cybersecurity applications and call for in‑depth studies, as LLMs may behave differently when deployed in different countries. We also employ quantitative metrics to evaluate the internal consistency and cross‑language stability of LLM agents, to help guide the selection of the most stable LLMs and optimising models for secure applications.

PaperID: 9249, https://arxiv.org/pdf/2508.05660.pdf

Authors: Aditya Nagori, Ricardo Accorsi Casonatto, Ayush Gautam, Abhinav Manikantha Sai Cheruvu, Rishikesan Kamaleswaran

Title: Open-Source Agentic Hybrid RAG Framework for Scientific Literature Review

Abstract:
The surge in scientific publications challenges traditional review methods, demanding tools that integrate structured metadata with full‑text analysis. Hybrid Retrieval Augmented Generation (RAG) systems, combining graph queries with vector search offer promise but are typically static, rely on proprietary tools, and lack uncertainty estimates. We present an agentic approach that encapsulates the hybrid RAG pipeline within an autonomous agent capable of (1) dynamically selecting between GraphRAG and VectorRAG for each query, (2) adapting instruction‑tuned generation in real time to researcher needs, and (3) quantifying uncertainty during inference. This dynamic orchestration improves relevance, reduces hallucinations, and promotes reproducibility. Our pipeline ingests bibliometric open‑access data from PubMed, arXiv, and Google Scholar APIs, builds a Neo4j citation‑based knowledge graph (KG), and embeds full‑text PDFs into a FAISS vector store (VS) using the all‑MiniLM‑L6‑v2 model. A Llama‑3.3‑70B agent selects GraphRAG (translating queries to Cypher for KG) or VectorRAG (combining sparse and dense retrieval with re‑ranking). Instruction tuning refines domain‑specific generation, and bootstrapped evaluation yields standard deviation for evaluation metrics. On synthetic benchmarks mimicking real‑world queries, the Instruction‑Tuned Agent with Direct Preference Optimization (DPO) outperforms the baseline, achieving a gain of 0.63 in VS Context Recall and a 0.56 gain in overall Context Precision. Additional gains include 0.24 in VS Faithfulness, 0.12 in both VS Precision and KG Answer Relevance, 0.11 in overall Faithfulness score, 0.05 in KG Context Recall, and 0.04 in both VS Answer Relevance and overall Precision. These results highlight the system's improved reasoning over heterogeneous sources and establish a scalable framework for autonomous, agentic scientific discovery.

PaperID: 9250, https://arxiv.org/pdf/2508.05637.pdf

Authors: Siddharth Gangwar, David A. Selby, Sebastian J. Vollmer

Title: Automated Visualization Makeovers with LLMs

Abstract:
Making a good graphic that accurately and efficiently conveys the desired message to the audience is both an art and a science, typically not taught in the data science curriculum. Visualisation makeovers are exercises where the community exchange feedback to improve charts and data visualizations. Can multi‑modal large language models (LLMs) emulate this task? Given a plot in the form of an image file, or the code used to generate it, an LLM, primed with a list of visualization best practices, is employed to semi‑automatically generate constructive criticism to produce a better plot. Our system is centred around prompt engineering of a pre‑trained model, relying on a combination of userspecified guidelines and any latent knowledge of data visualization practices that might lie within an LLMs training corpus. Unlike other works, the focus is not on generating valid visualization scripts from raw data or prompts, but on educating the user how to improve their existing data visualizations according to an interpretation of best practices. A quantitative evaluation is performed to measure the sensitivity of the LLM agent to various plotting issues across different chart types. We make the tool available as a simple self‑hosted applet with an accessible Web interface.

PaperID: 9251, https://arxiv.org/pdf/2508.05622.pdf

Authors: Yu Yuan, Lili Zhao, Wei Chen, Guangting Zheng, Kai Zhang, Mengdi Zhang, Qi Liu

Title: Simulating Human-Like Learning Dynamics with LLM-Empowered Agents

Abstract:
Capturing human learning behavior based on deep learning methods has become a major research focus in both psychology and intelligent systems. Recent approaches rely on controlled experiments or rule‑based models to explore cognitive processes. However, they struggle to capture learning dynamics, track progress over time, or provide explainability. To address these challenges, we introduce LearnerAgent, a novel multi‑agent framework based on Large Language Models (LLMs) to simulate a realistic teaching environment. To explore human‑like learning dynamics, we construct learners with psychologically grounded profiles‑such as Deep, Surface, and Lazy‑as well as a persona‑free General Learner to inspect the base LLM's default behavior. Through weekly knowledge acquisition, monthly strategic choices, periodic tests, and peer interaction, we can track the dynamic learning progress of individual learners over a full‑year journey. Our findings are fourfold: 1) Longitudinal analysis reveals that only Deep Learner achieves sustained cognitive growth. Our specially designed "trap questions" effectively diagnose Surface Learner's shallow knowledge. 2) The behavioral and cognitive patterns of distinct learners align closely with their psychological profiles. 3) Learners' self‑concept scores evolve realistically, with the General Learner developing surprisingly high self‑efficacy despite its cognitive limitations. 4) Critically, the default profile of base LLM is a "diligent but brittle Surface Learner"‑an agent that mimics the behaviors of a good student but lacks true, generalizable understanding. Extensive simulation experiments demonstrate that LearnerAgent aligns well with real scenarios, yielding more insightful findings about LLMs' behavior.

PaperID: 9252, https://arxiv.org/pdf/2508.05557.pdf

Authors: Rui Lu, Jinhe Bi, Yunpu Ma, Feng Xiao, Yuntao Du, Yijun Tian

Title: MV-Debate: Multi-view Agent Debate with Dynamic Reflection Gating for Multimodal Harmful Content Detection in Social Media

Abstract:
Social media has evolved into a complex multimodal environment where text, images, and other signals interact to shape nuanced meanings, often concealing harmful intent. Identifying such intent, whether sarcasm, hate speech, or misinformation, remains challenging due to cross‑modal contradictions, rapid cultural shifts, and subtle pragmatic cues. To address these challenges, we propose MV‑Debate, a multi‑view agent debate framework with dynamic reflection gating for unified multimodal harmful content detection. MV‑Debate assembles four complementary debate agents, a surface analyst, a deep reasoner, a modality contrast, and a social contextualist, to analyze content from diverse interpretive perspectives. Through iterative debate and reflection, the agents refine responses under a reflection‑gain criterion, ensuring both accuracy and efficiency. Experiments on three benchmark datasets demonstrate that MV‑Debate significantly outperforms strong single‑model and existing multi‑agent debate baselines. This work highlights the promise of multi‑agent debate in advancing reliable social intent detection in safety‑critical online contexts.

PaperID: 9253, https://arxiv.org/pdf/2508.05503.pdf

Authors: Dongwei Ji, Bingzhang Hu, Yi Zhou

Title: AutoIAD: Manager-Driven Multi-Agent Collaboration for Automated Industrial Anomaly Detection

Abstract:
Industrial anomaly detection (IAD) is critical for manufacturing quality control, but conventionally requires significant manual effort for various application scenarios. This paper introduces AutoIAD, a multi‑agent collaboration framework, specifically designed for end‑to‑end automated development of industrial visual anomaly detection. AutoIAD leverages a Manager‑Driven central agent to orchestrate specialized sub‑agents (including Data Preparation, Data Loader, Model Designer, Trainer) and integrates a domain‑specific knowledge base, which intelligently handles the entire pipeline using raw industrial image data to develop a trained anomaly detection model. We construct a comprehensive benchmark using MVTec AD datasets to evaluate AutoIAD across various LLM backends. Extensive experiments demonstrate that AutoIAD significantly outperforms existing general‑purpose agentic collaboration frameworks and traditional AutoML frameworks in task completion rate and model performance (AUROC), while effectively mitigating issues like hallucination through iterative refinement. Ablation studies further confirm the crucial roles of the Manager central agent and the domain knowledge base module in producing robust and high‑quality IAD solutions.

PaperID: 9254, https://arxiv.org/pdf/2508.05492.pdf

Authors: Jifan Gao, Mahmudur Rahman, John Caskey, Madeline Oguss, Ann O'Rourke, Randy Brown, Anne Stey, Anoop Mayampurath, Matthew M. Churpek, Guanhua Chen, Majid Afshar

Title: MoMA: A Mixture-of-Multimodal-Agents Architecture for Enhancing Clinical Prediction Modelling

Abstract:
Multimodal electronic health record (EHR) data provide richer, complementary insights into patient health compared to single‑modality data. However, effectively integrating diverse data modalities for clinical prediction modeling remains challenging due to the substantial data requirements. We introduce a novel architecture, Mixture‑of‑Multimodal‑Agents (MoMA), designed to leverage multiple large language model (LLM) agents for clinical prediction tasks using multimodal EHR data. MoMA employs specialized LLM agents ("specialist agents") to convert non‑textual modalities, such as medical images and laboratory results, into structured textual summaries. These summaries, together with clinical notes, are combined by another LLM ("aggregator agent") to generate a unified multimodal summary, which is then used by a third LLM ("predictor agent") to produce clinical predictions. Evaluating MoMA on three prediction tasks using real‑world datasets with different modality combinations and prediction settings, MoMA outperforms current state‑of‑the‑art methods, highlighting its enhanced accuracy and flexibility across various tasks.

PaperID: 9255, https://arxiv.org/pdf/2508.05421.pdf

Authors: Rong Sha, Binglin Wang, Jun Yang, Xiaoxiao Ma, Chengkun Wu, Liang Yan, Chao Zhou, Jixun Liu, Guochao Wang, Shuhua Yan, Lingxiao Zhu

Title: LLM-based Multi-Agent Copilot for Quantum Sensor

Abstract:
Large language models (LLM) exhibit broad utility but face limitations in quantum sensor development, stemming from interdisciplinary knowledge barriers and involving complex optimization processes. Here we present QCopilot, an LLM‑based multi‑agent framework integrating external knowledge access, active learning, and uncertainty quantification for quantum sensor design and diagnosis. Comprising commercial LLMs with few‑shot prompt engineering and vector knowledge base, QCopilot employs specialized agents to adaptively select optimization methods, automate modeling analysis, and independently perform problem diagnosis. Applying QCopilot to atom cooling experiments, we generated 10^\rm8 sub‑\rmμK atoms without any human intervention within a few hours, representing ～100× speedup over manual experimentation. Notably, by continuously accumulating prior knowledge and enabling dynamic modeling, QCopilot can autonomously identify anomalous parameters in multi‑parameter experimental settings. Our work reduces barriers to large‑scale quantum sensor deployment and readily extends to other quantum information systems.

PaperID: 9256, https://arxiv.org/pdf/2508.05344.pdf

Authors: Asutosh Hota, Jussi P. P. Jokinen

Title: NomicLaw: Emergent Trust and Strategic Argumentation in LLMs During Collaborative Law-Making

Abstract:
Recent advancements in large language models (LLMs) have extended their capabilities from basic text processing to complex reasoning tasks, including legal interpretation, argumentation, and strategic interaction. However, empirical understanding of LLM behavior in open‑ended, multi‑agent settings especially those involving deliberation over legal and ethical dilemmas remains limited. We introduce NomicLaw, a structured multi‑agent simulation where LLMs engage in collaborative law‑making, responding to complex legal vignettes by proposing rules, justifying them, and voting on peer proposals. We quantitatively measure trust and reciprocity via voting patterns and qualitatively assess how agents use strategic language to justify proposals and influence outcomes. Experiments involving homogeneous and heterogeneous LLM groups demonstrate how agents spontaneously form alliances, betray trust, and adapt their rhetoric to shape collective decisions. Our results highlight the latent social reasoning and persuasive capabilities of ten open‑source LLMs and provide insights into the design of future AI systems capable of autonomous negotiation, coordination and drafting legislation in legal settings.

PaperID: 9257, https://arxiv.org/pdf/2508.05311.pdf

Authors: Andrew Kiruluta

Title: A Novel Architecture for Symbolic Reasoning with Decision Trees and LLM Agents

Abstract:
We propose a hybrid architecture that integrates decision tree‑based symbolic reasoning with the generative capabilities of large language models (LLMs) within a coordinated multi‑agent framework. Unlike prior approaches that loosely couple symbolic and neural modules, our design embeds decision trees and random forests as callable oracles within a unified reasoning system. Tree‑based modules enable interpretable rule inference and causal logic, while LLM agents handle abductive reasoning, generalization, and interactive planning. A central orchestrator maintains belief state consistency and mediates communication across agents and external tools, enabling reasoning over both structured and unstructured inputs. The system achieves strong performance on reasoning benchmarks. On ProofWriter, it improves entailment consistency by +7.2% through logic‑grounded tree validation. On GSM8k, it achieves +5.3% accuracy gains in multistep mathematical problems via symbolic augmentation. On ARC, it boosts abstraction accuracy by +6.0% through integration of symbolic oracles. Applications in clinical decision support and scientific discovery show how the system encodes domain rules symbolically while leveraging LLMs for contextual inference and hypothesis generation. This architecture offers a robust, interpretable, and extensible solution for general‑purpose neuro‑symbolic reasoning.

PaperID: 9258, https://arxiv.org/pdf/2508.05253.pdf

Authors: Takuro Kato, Keisuke Okumura, Yoko Sasaki, Naoya Yokomachi

Title: Congestion Mitigation Path Planning for Large-Scale Multi-Agent Navigation in Dense Environments

Abstract:
In high‑density environments where numerous autonomous agents move simultaneously in a distributed manner, streamlining global flows to mitigate local congestion is crucial to maintain overall navigation efficiency. This paper introduces a novel path‑planning problem, congestion mitigation path planning (CMPP), which embeds congestion directly into the cost function, defined by the usage of incoming edges along agents' paths. CMPP assigns a flow‑based multiplicative penalty to each vertex of a sparse graph, which grows steeply where frequently‑traversed paths intersect, capturing the intuition that congestion intensifies where many agents enter the same area from different directions. Minimizing the total cost yields a set of coarse‑level, time‑independent routes that autonomous agents can follow while applying their own local collision avoidance. We formulate the problem and develop two solvers: (i) an exact mixed‑integer nonlinear programming solver for small instances, and (ii) a scalable two‑layer search algorithm, A‑CMTS, which quickly finds suboptimal solutions for large‑scale instances and iteratively refines them toward the optimum. Empirical studies show that augmenting state‑of‑the‑art collision‑avoidance planners with CMPP significantly reduces local congestion and enhances system throughput in both discrete‑ and continuous‑space scenarios. These results indicate that CMPP improves the performance of multi‑agent systems in real‑world applications such as logistics and autonomous‑vehicle operations.

PaperID: 9259, https://arxiv.org/pdf/2508.05152.pdf

Authors: Linfeng Gao, Yaoxiang Wang, Minlong Peng, Jialong Tang, Yuzhe Shang, Mingming Sun, Jinsong Su

Title: Tool Graph Retriever: Exploring Dependency Graph-based Tool Retrieval for Large Language Models

Abstract:
With the remarkable advancement of AI agents, the number of their equipped tools is increasing rapidly. However, integrating all tool information into the limited model context becomes impractical, highlighting the need for efficient tool retrieval methods. In this regard, dominant methods primarily rely on semantic similarities between tool descriptions and user queries to retrieve relevant tools. However, they often consider each tool independently, overlooking dependencies between tools, which may lead to the omission of prerequisite tools for successful task execution. To deal with this defect, in this paper, we propose Tool Graph Retriever (TGR), which exploits the dependencies among tools to learn better tool representations for retrieval. First, we construct a dataset termed TDI300K to train a discriminator for identifying tool dependencies. Then, we represent all candidate tools as a tool dependency graph and use graph convolution to integrate the dependencies into their representations. Finally, these updated tool representations are employed for online retrieval. Experimental results on several commonly used datasets show that our TGR can bring a performance improvement to existing dominant methods, achieving SOTA performance. Moreover, in‑depth analyses also verify the importance of tool dependencies and the effectiveness of our TGR.

PaperID: 9260, https://arxiv.org/pdf/2508.05116.pdf

Authors: Peer-Benedikt Degen, Igor Asanov

Title: Beyond Automation: Socratic AI, Epistemic Agency, and the Implications of the Emergence of Orchestrated Multi-Agent Learning Architectures

Abstract:
Generative AI is no longer a peripheral tool in higher education. It is rapidly evolving into a general‑purpose infrastructure that reshapes how knowledge is generated, mediated, and validated. This paper presents findings from a controlled experiment evaluating a Socratic AI Tutor, a large language model designed to scaffold student research question development through structured dialogue grounded in constructivist theory. Conducted with 65 pre‑service teacher students in Germany, the study compares interaction with the Socratic Tutor to engagement with an uninstructed AI chatbot. Students using the Socratic Tutor reported significantly greater support for critical, independent, and reflective thinking, suggesting that dialogic AI can stimulate metacognitive engagement and challenging recent narratives of de‑skilling due to generative AI usage. These findings serve as a proof of concept for a broader pedagogical shift: the use of multi‑agent systems (MAS) composed of specialised AI agents. To conceptualise this, we introduce the notion of orchestrated MAS, modular, pedagogically aligned agent constellations, curated by educators, that support diverse learning trajectories through differentiated roles and coordinated interaction. To anchor this shift, we propose an adapted offer‑and‑use model, in which students appropriate instructional offers from these agents. Beyond technical feasibility, we examine system‑level implications for higher education institutions and students, including funding necessities, changes to faculty roles, curriculars, competencies and assessment practices. We conclude with a comparative cost‑effectiveness analysis highlighting the scalability of such systems. In sum, this study contributes both empirical evidence and a conceptual roadmap for hybrid learning ecosystems that embed human‑AI co‑agency and pedagogical alignment.

PaperID: 9261, https://arxiv.org/pdf/2508.05002.pdf

Authors: Ji Sun, Guoliang Li, Peiyao Zhou, Yihui Ma, Jingzhe Xu, Yuan Li

Title: AgenticData: An Agentic Data Analytics System for Heterogeneous Data

Abstract:
Existing unstructured data analytics systems rely on experts to write code and manage complex analysis workflows, making them both expensive and time‑consuming. To address these challenges, we introduce AgenticData, an innovative agentic data analytics system that allows users to simply pose natural language (NL) questions while autonomously analyzing data sources across multiple domains, including both unstructured and structured data. First, AgenticData employs a feedback‑driven planning technique that automatically converts an NL query into a semantic plan composed of relational and semantic operators. We propose a multi‑agent collaboration strategy by utilizing a data profiling agent for discovering relevant data, a semantic cross‑validation agent for iterative optimization based on feedback, and a smart memory agent for maintaining short‑term context and long‑term knowledge. Second, we propose a semantic optimization model to refine and execute semantic plans effectively. Our system, AgenticData, has been tested using three benchmarks. Experimental results showed that AgenticData achieved superior accuracy on both easy and difficult tasks, significantly outperforming state‑of‑the‑art methods.

PaperID: 9262, https://arxiv.org/pdf/2508.04974.pdf

Authors: Hoa T. Nguyen, Muhammad Usman, Rajkumar Buyya

Title: QFOR: A Fidelity-aware Orchestrator for Quantum Computing Environments using Deep Reinforcement Learning

Abstract:
Quantum cloud computing enables remote access to quantum processors, yet the heterogeneity and noise of available quantum hardware create significant challenges for efficient resource orchestration. These issues complicate the optimization of quantum task allocation and scheduling, as existing heuristic methods fall short in adapting to dynamic conditions or effectively balancing execution fidelity and time. Here, we propose QFOR, a Quantum Fidelity‑aware Orchestration of tasks across heterogeneous quantum nodes in cloud‑based environments using Deep Reinforcement learning. We model the quantum task orchestration as a Markov Decision Process and employ the Proximal Policy Optimization algorithm to learn adaptive scheduling policies, using IBM quantum processor calibration data for noise‑aware performance estimation. Our configurable framework balances overall quantum task execution fidelity and time, enabling adaptation to different operational priorities. Extensive evaluation demonstrates that QFOR is adaptive and achieves significant performance with 29.5‑84% improvements in relative fidelity performance over heuristic baselines. Furthermore, it maintains comparable quantum execution times, contributing to cost‑efficient use of quantum computation resources.

PaperID: 9263, https://arxiv.org/pdf/2508.04904.pdf

Authors: Yuqi Hu, Qiwen Xiong, Zhenzhen Qin, Brandon Watanabe, Yujing Wang, Mirjana Prpa, Ilmi Yoon

Title: Toward Scalable Patient Safety Training: A Prototype for Root Cause Analysis Simulation With AI Virtual Avatars

Abstract:
Patient safety training is essential for preparing healthcare professionals to identify, investigate, and prevent adverse events. However, conventional simulation‑based approaches often require substantial faculty time, physical resources, and standardized facilitation. This paper presents a prototype AI‑powered simulation platform designed to support more scalable patient safety training through root cause analysis (RCA). The system provides a Unity‑based 3D simulation environment, which allows trainees to investigate an ICU adverse event by interviewing five virtual team members represented as AI‑powered avatars. Each avatar is driven by a large language model (LLM) agent with role‑specific knowledge and variable states of mind. Moreover, emotional text‑to‑speech and AI‑supported facial and body animation enable more realistic and immersive interactions. After completing the simulation, trainees submit a written RCA report and receive rubric‑guided formative and summative feedback automatically generated by an LLM‑based assessment component. The prototype is built to support patient safety training for healthcare professionals, focusing on skills in communication, investigation, thinking, and analysis, with low recurring instructional burden. We describe the design of the platform, its core technical components, and an RCA case based on a published ICU scenario. This work demonstrates the feasibility of integrating generative AI into immersive simulation for scalable patient safety education.

PaperID: 9264, https://arxiv.org/pdf/2508.04903.pdf

Authors: Jun Liu, Zhenglun Kong, Changdi Yang, Fan Yang, Tianqi Li, Peiyan Dong, Joannah Nanjekye, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Pu Zhao, Xue Lin, Dong Huang, Yanzhi Wang

Title: RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory

Abstract:
Multi‑agent large language model (LLM) systems have shown strong potential in complex reasoning and collaborative decision‑making tasks. However, most existing coordination schemes rely on static or full‑context routing strategies, which lead to excessive token consumption, redundant memory exposure, and limited adaptability across interaction rounds. We introduce RCR‑Router, a modular and role‑aware context routing framework designed to enable efficient, adaptive collaboration in multi‑agent LLMs. To our knowledge, this is the first routing approach that dynamically selects semantically relevant memory subsets for each agent based on its role and task stage, while adhering to a strict token budget. A lightweight scoring policy guides memory selection, and agent outputs are iteratively integrated into a shared memory store to facilitate progressive context refinement. To better evaluate model behavior, we further propose an Answer Quality Score metric that captures LLM‑generated explanations beyond standard QA accuracy. Experiments on three multi‑hop QA benchmarks ‑‑ HotPotQA, MuSiQue, and 2WikiMultihop ‑‑ demonstrate that RCR‑Router reduces token usage (up to 30%) while improving or maintaining answer quality. These results highlight the importance of structured memory routing and output‑aware evaluation in advancing scalable multi‑agent LLM systems.

PaperID: 9265, https://arxiv.org/pdf/2508.04849.pdf

Authors: Yifan Su, Rishi Veerapaneni, Jiaoyang Li

Title: BTPG-max: Achieving Local Maximal Bidirectional Pairs for Bidirectional Temporal Plan Graphs

Abstract:
Multi‑Agent Path Finding (MAPF) requires computing collision‑free paths for multiple agents in shared environment. Most MAPF planners assume that each agent reaches a specific location at a specific timestep, but this is infeasible to directly follow on real systems where delays often occur. To address collisions caused by agents deviating due to delays, the Temporal Plan Graph (TPG) was proposed, which converts a MAPF time dependent solution into a time independent set of inter‑agent dependencies. Recently, a Bidirectional TPG (BTPG) was proposed which relaxed some dependencies into ``bidirectional pairs" and improved efficiency of agents executing their MAPF solution with delays. Our work improves upon this prior work by designing an algorithm, BPTG‑max, that finds more bidirectional pairs. Our main theoretical contribution is in designing the BTPG‑max algorithm is locally optimal, i.e. which constructs a BTPG where no additional bidirectional pairs can be added. We also show how in practice BTPG‑max leads to BTPGs with significantly more bidirectional edges, superior anytime behavior, and improves robustness to delays.

PaperID: 9266, https://arxiv.org/pdf/2508.04811.pdf

Authors: Lin Jiang, Yu Yang, Guang Wang

Title: HCRide: Harmonizing Passenger Fairness and Driver Preference for Human-Centered Ride-Hailing

Abstract:
Order dispatch systems play a vital role in ride‑hailing services, which directly influence operator revenue, driver profit, and passenger experience. Most existing work focuses on improving system efficiency in terms of operator revenue, which may cause a bad experience for both passengers and drivers. Hence, in this work, we aim to design a human‑centered ride‑hailing system by considering both passenger fairness and driver preference without compromising the overall system efficiency. However, it is nontrivial to achieve this target due to the potential conflicts between passenger fairness and driver preference since optimizing one may sacrifice the other. To address this challenge, we design HCRide, a Human‑Centered Ride‑hailing system based on a novel multi‑agent reinforcement learning algorithm called Harmonization‑oriented Actor‑Bi‑Critic (Habic), which includes three major components (i.e., a multi‑agent competition mechanism, a dynamic Actor network, and a Bi‑Critic network) to optimize system efficiency and passenger fairness with driver preference consideration. We extensively evaluate our HCRide using two real‑world ride‑hailing datasets from Shenzhen and New York City. Experimental results show our HCRide effectively improves system efficiency by 2.02%, fairness by 5.39%, and driver preference by 10.21% compared to state‑of‑the‑art baselines.

PaperID: 9267, https://arxiv.org/pdf/2508.04682.pdf

Authors: Zewei Zhou, Seth Z. Zhao, Tianhui Cai, Zhiyu Huang, Bolei Zhou, Jiaqi Ma

Title: TurboTrain: Towards Efficient and Balanced Multi-Task Learning for Multi-Agent Perception and Prediction

Abstract:
End‑to‑end training of multi‑agent systems offers significant advantages in improving multi‑task performance. However, training such models remains challenging and requires extensive manual design and monitoring. In this work, we introduce TurboTrain, a novel and efficient training framework for multi‑agent perception and prediction. TurboTrain comprises two key components: a multi‑agent spatiotemporal pretraining scheme based on masked reconstruction learning and a balanced multi‑task learning strategy based on gradient conflict suppression. By streamlining the training process, our framework eliminates the need for manually designing and tuning complex multi‑stage training pipelines, substantially reducing training time and improving performance. We evaluate TurboTrain on a real‑world cooperative driving dataset, V2XPnP‑Seq, and demonstrate that it further improves the performance of state‑of‑the‑art multi‑agent perception and prediction models. Our results highlight that pretraining effectively captures spatiotemporal multi‑agent features and significantly benefits downstream tasks. Moreover, the proposed balanced multi‑task learning strategy enhances detection and prediction.

PaperID: 9268, https://arxiv.org/pdf/2508.04634.pdf

Title: VirtLab: An AI-Powered System for Flexible, Customizable, and Large-scale Team Simulations

Abstract:
Simulating how team members collaborate within complex environments using Agentic AI is a promising approach to explore hypotheses grounded in social science theories and study team behaviors. We introduce VirtLab, a user‑friendly, customizable, multi‑agent, and scalable team simulation system that enables testing teams with LLM‑based agents in spatial and temporal settings. This system addresses the current frameworks' design and technical limitations that do not consider flexible simulation scenarios and spatial settings. VirtLab contains a simulation engine and a web interface that enables both technical and non‑technical users to formulate, run, and analyze team simulations without programming. We demonstrate the system's utility by comparing ground truth data with simulated scenarios.

PaperID: 9269, https://arxiv.org/pdf/2508.04575.pdf

Authors: Nuo Chen, Yicheng Tong, Jiaying Wu, Minh Duc Duong, Qian Wang, Qingyun Zou, Bryan Hooi, Bingsheng He

Title: Beyond Brainstorming: What Drives High-Quality Scientific Ideas? Lessons from Multi-Agent Collaboration

Abstract:
While AI agents show potential in scientific ideation, most existing frameworks rely on single‑agent refinement, limiting creativity due to bounded knowledge and perspective. Inspired by real‑world research dynamics, this paper investigates whether structured multi‑agent discussions can surpass solitary ideation. We propose a cooperative multi‑agent framework for generating research proposals and systematically compare configurations including group size, leaderled versus leaderless structures, and team compositions varying in interdisciplinarity and seniority. To assess idea quality, we employ a comprehensive protocol with agent‑based scoring and human review across dimensions such as novelty, strategic vision, and integration depth. Our results show that multi‑agent discussions substantially outperform solitary baselines. A designated leader acts as a catalyst, transforming discussion into more integrated and visionary proposals. Notably, we find that cognitive diversity is a primary driver of quality, yet expertise is a non‑negotiable prerequisite, as teams lacking a foundation of senior knowledge fail to surpass even a single competent agent. These findings offer actionable insights for designing collaborative AI ideation systems and shed light on how team structure influences creative outcomes.

PaperID: 9270, https://arxiv.org/pdf/2508.04556.pdf

Authors: Filipe B. Teixeira, Carolina Simões, Paulo Fidalgo, Wagner Pedrosa, André Coelho, Manuel Ricardo, Luis M. Pessoa

Title: CONVERGE: A Multi-Agent Vision-Radio Architecture for xApps

Abstract:
Telecommunications and computer vision have evolved independently. With the emergence of high‑frequency wireless links operating mostly in line‑of‑sight, visual data can help predict the channel dynamics by detecting obstacles and help overcoming them through beamforming or handover techniques. This paper proposes a novel architecture for delivering real‑time radio and video sensing information to O‑RAN xApps through a multi‑agent approach, and introduces a new video function capable of generating blockage information for xApps, enabling Integrated Sensing and Communications. Experimental results show that the delay of sensing information remains under 1\,ms and that an xApp can successfully use radio and video sensing information to control the 5G/6G RAN in real‑time.

PaperID: 9271, https://arxiv.org/pdf/2508.04537.pdf

Authors: Alkesh K. Srivastava, Aamodh Suresh, Carlos Nieto-Granda

Title: Behaviorally Adaptive Multi-Robot Hazard Localization in Failure-Prone, Communication-Denied Environments

Abstract:
We address the challenge of multi‑robot autonomous hazard mapping in high‑risk, failure‑prone, communication‑denied environments such as post‑disaster zones, underground mines, caves, and planetary surfaces. In these missions, robots must explore and map hazards while minimizing the risk of failure due to environmental threats or hardware limitations. We introduce a behavior‑adaptive, information‑theoretic planning framework for multi‑robot teams grounded in the concept of Behavioral Entropy (BE), that generalizes Shannon entropy (SE) to capture diverse human‑like uncertainty evaluations. Building on this formulation, we propose the Behavior‑Adaptive Path Planning (BAPP) framework, which modulates information gathering strategies via a tunable risk‑sensitivity parameter, and present two planning algorithms: BAPP‑TID for intelligent triggering of high‑fidelity robots, and BAPP‑SIG for safe deployment under high risk. We provide theoretical insights on the informativeness of the proposed BAPP framework and validate its effectiveness through both single‑robot and multi‑robot simulations. Our results show that the BAPP stack consistently outperforms Shannon‑based and random strategies: BAPP‑TID accelerates entropy reduction, while BAPP‑SIG improves robot survivability with minimal loss in information gain. In multi‑agent deployments, BAPP scales effectively through spatial partitioning, mobile base relocation, and role‑aware heterogeneity. These findings underscore the value of behavior‑adaptive planning for robust, risk‑sensitive exploration in complex, failure‑prone environments.

PaperID: 9272, https://arxiv.org/pdf/2508.04474.pdf

Authors: Xinkui Zhao, Haode Li, Yifan Zhang, Guanjie Cheng, Yueshen Xu

Title: TRAIL: Joint Inference and Refinement of Knowledge Graphs with Large Language Models

Abstract:
Recent advances in large language models (LLMs) have unlocked powerful reasoning and decision‑making capabilities. However, their inherent dependence on static parametric memory fundamentally limits their adaptability, factual accuracy, and interpretability in knowledge‑intensive scenarios. Knowledge graphs (KGs), as structured repositories of explicit relational knowledge, offer a promising approach for augmenting LLMs with external, interpretable memory. Nevertheless, most existing methods that combine LLMs with KGs treat reasoning and knowledge updating as separate processes, resulting in suboptimal utilization of new information and hindering real‑time updates. In this work, we propose TRAIL: a novel, unified framework for Thinking, Reasoning, And Incremental Learning that couples joint inference and dynamic KG refinement with large language models. TRAIL enables LLM agents to iteratively explore, update, and refine knowledge graphs during the reasoning process, employing a confidence‑driven mechanism for the generation, validation, and pruning of new facts. This plug‑and‑play architecture facilitates seamless integration with various LLMs, supporting continual adaptation without the need for retraining. Extensive experiments on multiple benchmarks demonstrate that TRAIL outperforms existing KG‑augmented and retrieval‑augmented LLM baselines by 3% to 13%. More importantly, these results represent a significant step toward developing adaptive, memory‑augmented language models capable of continual learning and reliable, transparent reasoning.

PaperID: 9273, https://arxiv.org/pdf/2508.04332.pdf

Authors: Xinkui Zhao, Yifan Zhang, Sai Liu, Naibo Wang, Guanjie Cheng, Yueshen Xu, Chang Liu, Shuiguang Deng, Jianwei Yin

Title: DRAMA: Next-Gen Dynamic Orchestration for Resilient Multi-Agent Ecosystems in Flux

Abstract:
Multi‑agent systems (MAS) have demonstrated significant effectiveness in addressing complex problems through coordinated collaboration among heterogeneous agents. However, real‑world environments and task specifications are inherently dynamic, characterized by frequent changes, uncertainty, and variability. Despite this, most existing MAS frameworks rely on static architectures with fixed agent capabilities and rigid task allocation strategies, which greatly limits their adaptability to evolving conditions. This inflexibility poses substantial challenges for sustaining robust and efficient multi‑agent cooperation in dynamic and unpredictable scenarios. To address these limitations, we propose DRAMA: a Dynamic and Robust Allocation‑based Multi‑Agent System designed to facilitate resilient collaboration in rapidly changing environments. DRAMA features a modular architecture with a clear separation between the control plane and the worker plane. Both agents and tasks are abstracted as resource objects with well‑defined lifecycles, while task allocation is achieved via an affinity‑based, loosely coupled mechanism. The control plane enables real‑time monitoring and centralized planning, allowing flexible and efficient task reassignment as agents join, depart, or become unavailable, thereby ensuring continuous and robust task execution. The worker plane comprises a cluster of autonomous agents, each with local reasoning, task execution, the ability to collaborate, and the capability to take over unfinished tasks from other agents when needed.

PaperID: 9274, https://arxiv.org/pdf/2508.04231.pdf

Authors: Chin-Chia Michael Yeh, Vivian Lai, Uday Singh Saini, Xiran Fan, Yujie Fan, Junpeng Wang, Xin Dai, Yan Zheng

Title: Empowering Time Series Forecasting with LLM-Agents

Abstract:
Large Language Model (LLM) powered agents have emerged as effective planners for Automated Machine Learning (AutoML) systems. While most existing AutoML approaches focus on automating feature engineering and model architecture search, recent studies in time series forecasting suggest that lightweight models can often achieve state‑of‑the‑art performance. This observation led us to explore improving data quality, rather than model architecture, as a potentially fruitful direction for AutoML on time series data. We propose DCATS, a Data‑Centric Agent for Time Series. DCATS leverages metadata accompanying time series to clean data while optimizing forecasting performance. We evaluated DCATS using four time series forecasting models on a large‑scale traffic volume forecasting dataset. Results demonstrate that DCATS achieves an average 6% error reduction across all tested models and time horizons, highlighting the potential of data‑centric approaches in AutoML for time series forecasting.

PaperID: 9275, https://arxiv.org/pdf/2508.04118.pdf

Authors: Ruochen Zhao, Simone Conia, Eric Peng, Min Li, Saloni Potdar

Title: AgREE: Agentic Reasoning for Knowledge Graph Completion on Emerging Entities

Abstract:
Open‑domain Knowledge Graph Completion (KGC) faces significant challenges in an ever‑changing world, especially when considering the continual emergence of new entities in daily news. Existing approaches for KGC mainly rely on pretrained language models' parametric knowledge, pre‑constructed queries, or single‑step retrieval, typically requiring substantial supervision and training data. Even so, they often fail to capture comprehensive and up‑to‑date information about unpopular and/or emerging entities. To this end, we introduce Agentic Reasoning for Emerging Entities (AgREE), a novel agent‑based framework that combines iterative retrieval actions and multi‑step reasoning to dynamically construct rich knowledge graph triplets. Experiments show that, despite requiring zero training efforts, AgREE significantly outperforms existing methods in constructing knowledge graph triplets, especially for emerging entities that were not seen during language models' training processes, outperforming previous methods by up to 13.7%. Moreover, we propose a new evaluation methodology that addresses a fundamental weakness of existing setups and a new benchmark for KGC on emerging entities. Our work demonstrates the effectiveness of combining agent‑based reasoning with strategic information retrieval for maintaining up‑to‑date knowledge graphs in dynamic information environments.

PaperID: 9276, https://arxiv.org/pdf/2508.04031.pdf

Authors: Lianggui Weng, Dandan Liu, Rong Zhu, Bolin Ding, Jingren Zhou

Title: BridgeScope: A Universal Toolkit for Bridging Large Language Models and Databases

Abstract:
As large language models (LLMs) demonstrate increasingly powerful reasoning and orchestration capabilities, LLM‑based agents are rapidly proliferating for complex data‑related tasks. Despite this progress, the current design of how LLMs interact with databases exhibits critical limitations in usability, security, privilege management, and data transmission efficiency. To resolve these challenges, we introduce BridgeScope, a universal toolkit bridging LLMs and databases through three key innovations. First, it modularizes SQL operations into fine‑grained tools for context retrieval, CRUD execution, and ACID‑compliant transaction management, enabling more precise and LLM‑friendly functionality controls. Second, it aligns tool implementations with both database privileges and user security policies to steer LLMs away from unsafe or unauthorized operations, improving task execution efficiency while safeguarding database security. Third, it introduces a proxy mechanism for seamless inter‑tool data transfer, bypassing LLM transmission bottlenecks. All of these designs are database‑agnostic and can be transparently integrated with existing agent architectures. We also release an open‑source implementation of BridgeScope for PostgreSQL. Evaluations on two novel benchmarks demonstrate that BridgeScope enables LLM agents to operate databases more effectively, reduces token usage by up to 80% through improved security awareness, and uniquely supports data‑intensive workflows beyond existing toolkits, establishing BridgeScope as a robust foundation for next‑generation intelligent data automation.

PaperID: 9277, https://arxiv.org/pdf/2508.04026.pdf

Authors: Shunyu Liu, Minghao Liu, Huichi Zhou, Zhenyu Cui, Yang Zhou, Yuhao Zhou, Jialiang Gao, Heng Zhou, Yunhao Yang, Wendong Fan, puzhen zhang, Ge Zhang, Jiajun Shi, Weihao Xuan, Jiaxing Huang, Shuang Luo, Fang Wu, Heli Qi, Qingcheng Zeng, Junjie Wang, Aosong Feng, Jindi Lv, Sicong Jiang, Ziqi Ren, Wangchunshu Zhou, Zhenfei Yin, Wenlong Zhang, Guohao Li, Wenhao Yu, Lei Ma, Lei Bai, Qunshu Lin, Mingli Song, Dacheng Tao

Title: VeriWeb: Verifiable Long-Chain Web Benchmark for Agentic Information-Seeking

Abstract:
Recent advances have showcased the extraordinary capabilities of Large Language Model (LLM) agents in tackling web‑based information‑seeking tasks. However, existing efforts mainly focus on single‑fact retrieval and rely on outcome‑only verification, thereby limiting their scalability in realistic knowledge‑intensive scenarios that involve long‑horizon web tasks requiring large‑scale retrieval and synthesis of information from diverse sources. In this work, we introduce VeriWeb, a novel verifiable long‑chain web benchmark designed to facilitate the evaluation and development of web agents within realistic web environments. Our benchmark emphasizes two critical dimensions: (1) long‑chain complexity, encompassing both breadth‑ and depth‑oriented search tasks to assess how effectively web agents ensure comprehensive information coverage and consistent context tracking in multi‑hop reasoning; and (2) subtask‑level verifiability, where tasks are decomposed into a sequence of interdependent verifiable subtasks. This structure enables diverse exploration strategies within each subtask, while ensuring that each subtask‑level answer remains unchanged and verifiable. The benchmark consists of 302 tasks across five real‑world domains, each with a complete trajectory demonstration, annotated by human experts. Extensive experiments on VeriWeb using various agents powered by different foundation models reveal significant performance gaps in handling long‑horizon web tasks, highlighting the need for more powerful agentic information‑seeking capabilities.

PaperID: 9278, https://arxiv.org/pdf/2508.04020.pdf

Authors: Giacomo Albi, Young-Pil Choi, Matteo Piu, Sihyun Song

Title: Micro-macro and macro-macro limits for controlled leader-follower systems

Abstract:
We study a leader‑follower system of interacting particles subject to feedback control and derive its mean‑field limits through a two‑step passage: first to a micro‑macro system coupling leader particles with a follower fluid, and then to a fully continuum macro‑macro system. For each limiting procedure, we establish quantitative stability and convergence estimates based on modulated energy methods and Wasserstein distances. These results provide a rigorous foundation for the hierarchical reduction of controlled multi‑agent systems. Numerical simulations are presented, including examples with interaction potentials beyond the analytical class considered, to demonstrate the dynamics and support the theoretical results.

PaperID: 9279, https://arxiv.org/pdf/2508.03991.pdf

Authors: Chongyu Bao, Ruimin Dai, Yangbo Shen, Runyang Jian, Jinghan Zhang, Xiaolan Liu, Kunpeng Liu

Title: Galaxy: A Cognition-Centered Framework for Proactive, Privacy-Preserving, and Self-Evolving LLM Agents

Abstract:
Intelligent personal assistants (IPAs) such as Siri and Google Assistant are designed to enhance human capabilities and perform tasks on behalf of users. The emergence of LLM agents brings new opportunities for the development of IPAs. While responsive capabilities have been widely studied, proactive behaviors remain underexplored. Designing an IPA that is proactive, privacy‑preserving, and capable of self‑evolution remains a significant challenge. Designing such IPAs relies on the cognitive architecture of LLM agents. This work proposes Cognition Forest, a semantic structure designed to align cognitive modeling with system‑level design. We unify cognitive architecture and system design into a self‑reinforcing loop instead of treating them separately. Based on this principle, we present Galaxy, a framework that supports multidimensional interactions and personalized capability generation. Two cooperative agents are implemented based on Galaxy: KoRa, a cognition‑enhanced generative agent that supports both responsive and proactive skills; and Kernel, a meta‑cognition‑based meta‑agent that enables Galaxy's self‑evolution and privacy preservation. Experimental results show that Galaxy outperforms multiple state‑of‑the‑art benchmarks. Ablation studies and real‑world interaction cases validate the effectiveness of Galaxy.

PaperID: 9280, https://arxiv.org/pdf/2508.03923.pdf

Authors: Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, Ran Xu, Caiming Xiong

Title: CoAct-1: Computer-using Multi-Agent System with Coding Actions

Abstract:
Autonomous agents that operate computers via Graphical User Interfaces (GUIs) often struggle with efficiency and reliability on complex, long‑horizon tasks. While augmenting these agents with planners can improve task decomposition, they remain constrained by the inherent limitations of performing all actions through GUI manipulation, leading to brittleness and inefficiency. In this work, we introduce a more robust and flexible paradigm: enabling agents to use coding as a enhanced action. We present CoAct‑1, a novel multi‑agent system that synergistically combines GUI‑based control with direct programmatic execution. CoAct‑1 features an Orchestrator that dynamically delegates subtasks to either a conventional GUI Operator or a specialized Programmer agent, which can write and execute Python or Bash scripts. This hybrid approach allows the agent to bypass inefficient GUI action sequences for tasks like file management and data processing, while still leveraging visual interaction when necessary. We evaluate our system on the challenging OSWorld benchmark, where CoAct‑1 achieves a new state‑of‑the‑art success rate of 60.76%, significantly outperforming prior methods. Furthermore, our approach dramatically improves efficiency, reducing the average number of steps required to complete a task to just 10.15, compared to 15 for leading GUI agents. Our results demonstrate that integrating coding as a core action provides a more powerful, efficient, and scalable path toward generalized computer automation.

PaperID: 9281, https://arxiv.org/pdf/2508.03864.pdf

Authors: Zhenyu Pan, Yiting Zhang, Yutong Zhang, Jianshu Zhang, Haozheng Luo, Yuwei Han, Dennis Wu, Hong-Yu Chen, Philip S. Yu, Manling Li, Han Liu

Title: Evo-MARL: Co-Evolutionary Multi-Agent Reinforcement Learning for Internalized Safety

Abstract:
Multi‑agent systems (MAS) built on multimodal large language models exhibit strong collaboration and performance. However, their growing openness and interaction complexity pose serious risks, notably jailbreak and adversarial attacks. Existing defenses typically rely on external guard modules, such as dedicated safety agents, to handle unsafe behaviors. Unfortunately, this paradigm faces two challenges: (1) standalone agents offer limited protection, and (2) their independence leads to single‑point failure‑if compromised, system‑wide safety collapses. Naively increasing the number of guard agents further raises cost and complexity. To address these challenges, we propose Evo‑MARL, a novel multi‑agent reinforcement learning (MARL) framework that enables all task agents to jointly acquire defensive capabilities. Rather than relying on external safety modules, Evo‑MARL trains each agent to simultaneously perform its primary function and resist adversarial threats, ensuring robustness without increasing system overhead or single‑node failure. Furthermore, Evo‑MARL integrates evolutionary search with parameter‑sharing reinforcement learning to co‑evolve attackers and defenders. This adversarial training paradigm internalizes safety mechanisms and continually enhances MAS performance under co‑evolving threats. Experiments show that Evo‑MARL reduces attack success rates by up to 22% while boosting accuracy by up to 5% on reasoning tasks‑demonstrating that safety and utility can be jointly improved.

PaperID: 9282, https://arxiv.org/pdf/2508.03860.pdf

Authors: Subhey Sadi Rahman, Md. Adnanul Islam, Md. Mahbub Alam, Musarrat Zeba, Md. Abdur Rahman, Sadia Sultana Chowa, Mohaimenul Azam Khan Raiaan, Sami Azam

Title: Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models

Abstract:
Large Language Models (LLMs) are trained on vast and diverse internet corpora that often include inaccurate or misleading content. Consequently, LLMs can generate misinformation, making robust fact‑checking essential. This review systematically analyzes how LLM‑generated content is evaluated for factual accuracy by exploring key challenges such as hallucinations, dataset limitations, and the reliability of evaluation metrics. The review emphasizes the need for strong fact‑checking frameworks that integrate advanced prompting strategies, domain‑specific fine‑tuning, and retrieval‑augmented generation (RAG) methods. It proposes five research questions that guide the analysis of the recent literature from 2020 to 2025, focusing on evaluation methods and mitigation techniques. Instruction tuning, multi‑agent reasoning, and RAG frameworks for external knowledge access are also reviewed. The key findings demonstrate the limitations of current metrics, the importance of validated external evidence, and the improvement of factual consistency through domain‑specific customization. The review underscores the importance of building more accurate, understandable, and context‑aware fact‑checking. These insights contribute to the advancement of research toward more trustworthy models.

PaperID: 9283, https://arxiv.org/pdf/2508.03766.pdf

Authors: Yongchao Huang

Title: LLM-Prior: A Framework for Knowledge-Driven Prior Elicitation and Aggregation

Abstract:
The specification of prior distributions is fundamental in Bayesian inference, yet it remains a significant bottleneck. The prior elicitation process is often a manual, subjective, and unscalable task. We propose a novel framework which leverages Large Language Models (LLMs) to automate and scale this process. We introduce \textttLLMPrior, a principled operator that translates rich, unstructured contexts such as natural language descriptions, data or figures into valid, tractable probability distributions. We formalize this operator by architecturally coupling an LLM with an explicit, tractable generative model, such as a Gaussian Mixture Model (forming a LLM based Mixture Density Network), ensuring the resulting prior satisfies essential mathematical properties. We further extend this framework to multi‑agent systems where Logarithmic Opinion Pooling is employed to aggregate prior distributions induced by decentralized knowledge. We present the federated prior aggregation algorithm, \textttFed‑LLMPrior, for aggregating distributed, context‑dependent priors in a manner robust to agent heterogeneity. This work provides the foundation for a new class of tools that can potentially lower the barrier to entry for sophisticated Bayesian modeling.

PaperID: 9284, https://arxiv.org/pdf/2508.03728.pdf

Authors: Revanth Gangi Reddy, Tanay Dixit, Jiaxin Qin, Cheng Qian, Daniel Lee, Jiawei Han, Kevin Small, Xing Fan, Ruhi Sarikaya, Heng Ji

Title: WINELL: Wikipedia Never-Ending Updating with LLM Agents

Abstract:
Wikipedia, a vast and continuously consulted knowledge base, faces significant challenges in maintaining up‑to‑date content due to its reliance on manual human editors. Inspired by the vision of continuous knowledge acquisition in NELL and fueled by advances in LLM‑based agents, this paper introduces WiNELL, an agentic framework for continuously updating Wikipedia articles. Our approach employs a multi‑agent framework to aggregate online information, select new and important knowledge for a target entity in Wikipedia, and then generate precise edit suggestions for human review. Our fine‑grained editing models, trained on Wikipedia's extensive history of human edits, enable incorporating updates in a manner consistent with human editing behavior. Our editor models outperform both open‑source instruction‑following baselines and closed‑source LLMs (e.g., GPT‑4o) in key information coverage and editing efficiency. End‑to‑end evaluation on high‑activity Wikipedia pages demonstrates WiNELL's ability to identify and suggest timely factual updates. This opens up a promising research direction in LLM agents for automatically updating knowledge bases in a never‑ending fashion.

PaperID: 9285, https://arxiv.org/pdf/2508.03680.pdf

Authors: Xufang Luo, Yuge Zhang, Zhiyuan He, Zilong Wang, Siyun Zhao, Dongsheng Li, Luna K. Qiu, Yuqing Yang

Title: Agent Lightning: Train ANY AI Agents with Reinforcement Learning

Abstract:
We present Agent Lightning, a flexible and extensible framework that enables Reinforcement Learning (RL)‑based training of Large Language Models (LLMs) for any AI agent. Unlike existing methods that tightly couple RL training with agent or rely on sequence concatenation with masking, Agent Lightning achieves complete decoupling between agent execution and training, allowing seamless integration with existing agents developed via diverse ways (e.g., using frameworks like LangChain, OpenAI Agents SDK, AutoGen, and building from scratch) with almost ZERO code modifications. By formulating agent execution as Markov decision process, we define an unified data interface and propose a hierarchical RL algorithm, LightningRL, which contains a credit assignment module, allowing us to decompose trajectories generated by ANY agents into training transition. This enables RL to handle complex interaction logic, such as multi‑agent scenarios and dynamic workflows. For the system design, we introduce a Training‑Agent Disaggregation architecture, and brings agent observability frameworks into agent runtime, providing a standardized agent finetuning interface. Experiments across text‑to‑SQL, retrieval‑augmented generation, and math tool‑use tasks demonstrate stable, continuous improvements, showcasing the framework's potential for real‑world agent training and deployment.

PaperID: 9286, https://arxiv.org/pdf/2508.03547.pdf

Authors: Ada Yi Zhao, Aditya Gunturu, Ellen Yi-Luen Do, Ryo Suzuki

Title: Guided Reality: Generating Visually-Enriched AR Task Guidance with LLMs and Vision Models

Abstract:
Large language models (LLMs) have enabled the automatic generation of step‑by‑step augmented reality (AR) instructions for a wide range of physical tasks. However, existing LLM‑based AR guidance often lacks rich visual augmentations to effectively embed instructions into spatial context for a better user understanding. We present Guided Reality, a fully automated AR system that generates embedded and dynamic visual guidance based on step‑by‑step instructions. Our system integrates LLMs and vision models to: 1) generate multi‑step instructions from user queries, 2) identify appropriate types of visual guidance, 3) extract spatial information about key interaction points in the real world, and 4) embed visual guidance in physical space to support task execution. Drawing from a corpus of user manuals, we define five categories of visual guidance and propose an identification strategy based on the current step. We evaluate the system through a user study (N=16), completing real‑world tasks and exploring the system in the wild. Additionally, four instructors shared insights on how Guided Reality could be integrated into their training workflows.

PaperID: 9287, https://arxiv.org/pdf/2508.03526.pdf

Authors: Kun Song, Gaoming Chen, Shentao Ma, Ninglong Jin, Guangbao Zhao, Mingyu Ding, Zhenhua Xiong, Jia Pan

Title: CollaBot: Vision-Language Guided Simultaneous Collaborative Manipulation

Abstract:
One central goal of robotics is to enable robots to interact with the physical world. Traditional manipulation studies primarily focus on single robots and relatively small objects. However, factory and domestic environments often require large‑object manipulation, such as moving tables, where multiple robots must work collaboratively. Existing studies still lack a generalizable framework that can handle diverse objects, tasks, and robot team sizes. In this work, we propose CollaBot, a generalist framework for simultaneous collaborative manipulation. First, we use SEEM for scene segmentation and target‑object extraction. Then, we propose a collaborative grasping framework that decomposes the task into local grasp pose generation and global coordination. Finally, we design a two‑stage planning module to generate collision‑free trajectories for task execution. Experimental results across different settings with varying objects, tasks, and numbers of robots indicate that our framework achieves a 72% success rate. This marks a substantial improvement over behavior cloning‑based methods, validating the advantages of the proposed framework in complex multi‑robot cooperative tasks. Real‑world experiments further demonstrate the feasibility of our method in practical applications.

PaperID: 9288, https://arxiv.org/pdf/2508.03500.pdf

Authors: Yijin Yang, Cristina Cornelio, Mario Leiva, Paulo Shakarian

Title: Error Detection and Correction for Interpretable Mathematics in Large Language Models

Abstract:
Recent large language models (LLMs) have demonstrated the ability to perform explicit multi‑step reasoning such as chain‑of‑thought prompting. However, their intermediate steps often contain errors that can propagate leading to inaccurate final predictions. Additionally, LLMs still struggle with hallucinations and often fail to adhere to prescribed output formats, which is particularly problematic for tasks like generating mathematical expressions or source code. This work introduces EDCIM (Error Detection and Correction for Interpretable Mathematics), a method for detecting and correcting these errors in interpretable mathematics tasks, where the model must generate the exact functional form that explicitly solve the problem (expressed in natural language) rather than a black‑box solution. EDCIM uses LLMs to generate a system of equations for a given problem, followed by a symbolic error‑detection framework that identifies errors and provides targeted feedback for LLM‑based correction. To optimize efficiency, EDCIM integrates lightweight, open‑source LLMs with more powerful proprietary models, balancing cost and accuracy. This balance is controlled by a single hyperparameter, allowing users to control the trade‑off based on their cost and accuracy requirements. Experimental results across different datasets show that EDCIM significantly reduces both computational and financial costs, while maintaining, and even improving, prediction accuracy when the balance is properly configured.

PaperID: 9289, https://arxiv.org/pdf/2508.03444.pdf

Authors: Atabey Ünlü, Phil Rohr, Ahmet Celebi

Title: An Auditable Agent Platform For Automated Molecular Optimisation

Abstract:
Drug discovery frequently loses momentum when data, expertise, and tools are scattered, slowing design cycles. To shorten this loop we built a hierarchical, tool using agent framework that automates molecular optimisation. A Principal Researcher defines each objective, a Database agent retrieves target information, an AI Expert generates de novo scaffolds with a sequence to molecule deep learning model, a Medicinal Chemist edits them while invoking a docking tool, a Ranking agent scores the candidates, and a Scientific Critic polices the logic. Each tool call is summarised and stored causing the full reasoning path to remain inspectable. The agents communicate through concise provenance records that capture molecular lineage, to build auditable, molecule centered reasoning trajectories and reuse successful transformations via in context learning. Three cycle research loops were run against AKT1 protein using five large language models. After ranking the models by mean docking score, we ran 20 independent scale ups on the two top performers. We then compared the leading LLMs' binding affinity results across three configurations, LLM only, single agent, and multi agent. Our results reveal an architectural trade off, the multi agent setting excelled at focused binding optimization, improving average predicted binding affinity by 31%. In contrast, single agent runs generated molecules with superior drug like properties at the cost of less potent binding scores. Unguided LLM runs finished fastest, yet their lack of transparent tool signals left the validity of their reasoning paths unverified. These results show that test time scaling, focused feedback loops and provenance convert general purpose LLMs into auditable systems for molecular design, and suggest that extending the toolset to ADMET and selectivity predictors could push research workflows further along the discovery pipeline.

PaperID: 9290, https://arxiv.org/pdf/2508.03438.pdf

Authors: Taine J. Elliott, Stephen P. Levitt, Ken Nixon, Martin Bekker

Title: Data Overdose? Time for a Quadruple Shot: Knowledge Graph Construction using Enhanced Triple Extraction

Abstract:
The rapid expansion of publicly‑available medical data presents a challenge for clinicians and researchers alike, increasing the gap between the volume of scientific literature and its applications. The steady growth of studies and findings overwhelms medical professionals at large, hindering their ability to systematically review and understand the latest knowledge. This paper presents an approach to information extraction and automatic knowledge graph (KG) generation to identify and connect biomedical knowledge. Through a pipeline of large language model (LLM) agents, the system decomposes 44 PubMed abstracts into semantically meaningful proposition sentences and extracts KG triples from these sentences. The triples are enhanced using a combination of open domain and ontology‑based information extraction methodologies to incorporate ontological categories. On top of this, a context variable is included during extraction to allow the triple to stand on its own ‑ thereby becoming `quadruples'. The extraction accuracy of the LLM is validated by comparing natural language sentences generated from the enhanced triples to the original propositions, achieving an average cosine similarity of 0.874. The similarity for generated sentences of enhanced triples were compared with generated sentences of ordinary triples showing an increase as a result of the context variable. Furthermore, this research explores the ability for LLMs to infer new relationships and connect clusters in the knowledge base of the knowledge graph. This approach leads the way to provide medical practitioners with a centralised, updated in real‑time, and sustainable knowledge source, and may be the foundation of similar gains in a wide variety of fields.

PaperID: 9291, https://arxiv.org/pdf/2508.03406.pdf

Authors: Kai Li, Ruihao Zheng, Xinye Hao, Zhenkun Wang

Title: Multi-Objective Infeasibility Diagnosis for Routing Problems Using Large Language Models

Abstract:
In real‑world routing problems, users often propose conflicting or unreasonable requirements, which result in infeasible optimization models due to overly restrictive or contradictory constraints, leading to an empty feasible solution set. Existing Large Language Model (LLM)‑based methods attempt to diagnose infeasible models, but modifying such models often involves multiple potential adjustments that these methods do not consider. To fill this gap, we introduce Multi‑Objective Infeasibility Diagnosis (MOID), which combines LLM agents and multi‑objective optimization within an automatic routing solver, to provide a set of representative actionable suggestions. Specifically, MOID employs multi‑objective optimization to consider both path cost and constraint violation, generating a set of trade‑off solutions, each encompassing varying degrees of model adjustments. To extract practical insights from these solutions, MOID utilizes LLM agents to generate a solution analysis function for the infeasible model. This function analyzes these distinct solutions to diagnose the original infeasible model, providing users with diverse diagnostic insights and suggestions. Finally, we compare MOID with several LLM‑based methods on 50 types of infeasible routing problems. The results indicate that MOID automatically generates multiple diagnostic suggestions in a single run, providing more practical insights for restoring model feasibility and decision‑making compared to existing methods.

PaperID: 9292, https://arxiv.org/pdf/2508.03404.pdf

Authors: Xinlei Yu, Chengming Xu, Zhangquan Chen, Yudong Zhang, Shilin Lu, Cheng Yang, Jiangning Zhang, Shuicheng Yan, Xiaobin Hu

Title: Visual Document Understanding and Reasoning: A Multi-Agent Collaboration Framework with Agent-Wise Adaptive Test-Time Scaling

Abstract:
The dominant paradigm of monolithic scaling in Vision‑Language Models (VLMs) is failing for understanding and reasoning in documents, yielding diminishing returns as it struggles with the inherent need of this domain for document‑based procedural reasoning, cognitive complexity, and factual accuracy. To this end, we introduce MACT, a Multi‑Agent Collaboration framework with agent‑wise adaptive Test‑time scaling that pioneers a paradigm shift to procedural scaling, adapting dynamically to the functional entities of visual documents understanding and reasoning. MACT decomposes the visual document processing flow into four specialized agents, i.e., planning, execution, judgment, and answer, to resolve cognitive overload and introduce a critical self‑correction loop for factual grounding. This collaborative architecture is amplified by an agent‑wise adaptive test‑time scaling strategy that intelligently allocates computational resources based on the complexity and redundancy of each functionality. Evaluated on multiple visual document understanding benchmarks, MACT achieves superior performance with a smaller parameter scale, adapting effectively to various document scenarios without compromising its general or mathematical reasoning capabilities. The three variants of MACT consistently attain top‑three average performance rankings, with average performance enhancements of 9.9‑11.5% over the base models. The source code will be released publicly.

PaperID: 9293, https://arxiv.org/pdf/2508.03370.pdf

Authors: Xinyu Jin, Shengmao Yan, Qingtao Wang, Shisong Deng, Yanzhen Jiang, Shuangyao Zhao

Title: A Closed-Loop Multi-Agent Framework for Aerodynamics-Aware Automotive Styling Design

Abstract:
The core challenge in automotive exterior design is balancing subjective aesthetics with objective aerodynamic performance while dramatically accelerating the development cycle. To address this, we propose a novel, LLM‑driven multi‑agent framework that automates the end‑to‑end workflow from ambiguous requirements to 3D concept model performance validation. The workflow is structured in two stages: conceptual generation and performance validation. In the first stage, agents collaborate to interpret fuzzy design requirements, generate concept sketches, and produce photorealistic renderings using diffusion models. In the second stage, the renderings are converted to 3D point clouds, where a Drag Prediction Agent, built upon a lightweight surrogate model, provides near‑instantaneous predictions of the drag coefficient and pressure fields, replacing time‑consuming CFD simulations. The primary contribution of this work is the seamless integration of creative generation with a rapid engineering validation loop within a unified, automated system, which provides a new paradigm for efficiently balancing creative exploration with engineering constraints in the earliest stages of design.

PaperID: 9294, https://arxiv.org/pdf/2508.03363.pdf

Authors: Haotian Wu, Bo Xu, Yao Shu, Menglin Yang, Chengwei Qin

Title: Thinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning Large Language Models

Abstract:
Reasoning large language models (RLLMs) have recently demonstrated remarkable capabilities through structured and multi‑step reasoning. While prior research has primarily focused on improving their training and inference strategies, their potential for in‑context learning (ICL) remains largely underexplored. To fill this gap, we propose Thinking with Nothinking Calibration (JointThinking), a new ICL paradigm that prompts the model to generate two answers in parallel: one in Thinking mode and the other in Nothinking mode. A second round of Thinking is triggered only when the two initial responses are inconsistent, using a single prompt with two different answers. Extensive experiments across multiple reasoning benchmarks demonstrate that JointThinking significantly outperforms few‑shot chain‑of‑thought (CoT), thinking twice and majority voting. Moreover, it achieves comparable in‑distribution performance to training‑based SOTA reasoning method, while substantially outperforming on out‑of‑distribution tasks. We further conduct a systematic analysis of the calibration mechanism, showing the importance of structural thinking diversity and the benefits of consistency check. Additionally, we observe that the performance gap between actual and ideal reasoning narrows as model size increases in the second thinking, indicating the strong scalability of our approach. Finally, we discuss current limitations and outline promising directions for future ICL research in RLLMs.

PaperID: 9295, https://arxiv.org/pdf/2508.03345.pdf

Authors: Xingdan Wang, Jiayi He, Zhiqing Tang, Jianxiong Guo, Jiong Lou, Liping Qian, Tian Wang, Weijia Jia

Title: Adaptive AI Agent Placement and Migration in Edge Intelligence Systems

Abstract:
The rise of LLMs such as ChatGPT and Claude fuels the need for AI agents capable of real‑time task handling. However, migrating data‑intensive, multi‑modal edge workloads to cloud data centers, traditionally used for agent deployment, introduces significant latency. Deploying AI agents at the edge improves efficiency and reduces latency. However, edge environments present challenges due to limited and heterogeneous resources. Maintaining QoS for mobile users necessitates agent migration, which is complicated by the complexity of AI agents coordinating LLMs, task planning, memory, and external tools. This paper presents the first systematic deployment and management solution for LLM‑based AI agents in dynamic edge environments. We propose a novel adaptive framework for AI agent placement and migration in edge intelligence systems. Our approach models resource constraints and latency/cost, leveraging ant colony algorithms and LLM‑based optimization for efficient decision‑making. It autonomously places agents to optimize resource utilization and QoS and enables lightweight agent migration by transferring only essential state. Implemented on a distributed system using AgentScope and validated across globally distributed edge servers, our solution significantly reduces deployment latency and migration costs.

PaperID: 9296, https://arxiv.org/pdf/2508.03293.pdf

Authors: Duc-An Nguyen, Clara Colombatto, Steve Fleming, Ingmar Posner, Nick Hawes, Raunak Bhattacharyya

Title: Enhancing Joint Human-AI Inference in Robot Missions: A Confidence-Based Approach

Abstract:
Joint human‑AI inference holds immense potential to improve outcomes in human‑supervised robot missions. Current day missions are generally in the AI‑assisted setting, where the human operator makes the final inference based on the AI recommendation. However, due to failures in human judgement on when to accept or reject the AI recommendation, complementarity is rarely achieved. We investigate joint human‑AI inference where the inference made with higher confidence is selected. Through a user study with N=100 participants on a representative simulated robot teleoperation task, specifically studying the inference of robots' control delays we show that: a) Joint inference accuracy is higher and its extent is regulated by the confidence calibration of the AI agent, and b) Humans change their inferences based on AI recommendations and the extent and direction of this change is also regulated by the confidence calibration of the AI agent. Interestingly, our results show that pairing poorly‑calibrated AI‑DSS with humans hurts performance instead of helping the team, reiterating the need for AI‑based decision support systems with good metacognitive sensitivity. To the best of our knowledge, our study presents the first application of a maximum‑confidence‑based heuristic for joint human‑AI inference within a simulated robot teleoperation task.

PaperID: 9297, https://arxiv.org/pdf/2508.03281.pdf

Authors: Pavlo Bazilinskyy, Francesco Walker, Debargha Dey, Tram Thi Minh Tran, Hyungchai Park, Hyochang Kim, Hyunmin Kang, Patrick Ebel

Title: Quo-Vadis Multi-Agent Automotive Research? Insights from a Participatory Workshop and Questionnaire

Abstract:
The transition to mixed‑traffic environments that involve automated vehicles, manually operated vehicles, and vulnerable road users presents new challenges for human‑centered automotive research. Despite this, most studies in the domain focus on single‑agent interactions. This paper reports on a participatory workshop (N = 15) and a questionnaire (N = 19) conducted during the AutomotiveUI '24 conference to explore the state of multi‑agent automotive research. The participants discussed methodological challenges and opportunities in real‑world settings, simulations, and computational modeling. Key findings reveal that while the value of multi‑agent approaches is widely recognized, practical and technical barriers hinder their implementation. The study highlights the need for interdisciplinary methods, better tools, and simulation environments that support scalable, realistic, and ethically informed multi‑agent research.

PaperID: 9298, https://arxiv.org/pdf/2508.03137.pdf

Authors: Ge Shi, Kaiyu Huang, Guochen Feng

Title: Long Story Generation via Knowledge Graph and Literary Theory

Abstract:
The generation of a long story consisting of several thousand words is a sub‑task in the field of long text generation~(LTG). Previous research has addressed this challenge through outline‑based generation, which employs a multi‑stage method for generating outlines into stories. However, this approach suffers from two common issues: almost inevitable theme drift caused by the loss of memory of previous outlines, and tedious plots with incoherent logic that are less appealing to human readers. In this paper, we propose the multi‑agent Story Generator structure to improve the multi‑stage method, using large language models~(LLMs) as the core components of agents. To avoid theme drift, we introduce a memory storage model comprising two components: a long‑term memory storage that identifies the most important memories, thereby preventing theme drift; and a short‑term memory storage that retains the latest outlines from each generation round. To incorporate engaging elements into the story, we design a story theme obstacle framework based on literary narratology theory that introduces uncertain factors and evaluation criteria to generate outline. This framework calculates the similarity of the former storyline and enhances the appeal of the story by building a knowledge graph and integrating new node content. Additionally, we establish a multi‑agent interaction stage to simulate writer‑reader interaction through dialogue and revise the story text according to feedback, to ensure it remains consistent and logical. Evaluations against previous methods demonstrate that our approach can generate higher‑quality long stories.

PaperID: 9299, https://arxiv.org/pdf/2508.03136.pdf

Authors: Zachary Roch, Yue Wang

Title: Distributionally Robust Markov Games with Average Reward

Abstract:
We study distributionally robust Markov games (DR‑MGs) with the average‑reward criterion, a framework for multi‑agent decision‑making under uncertainty over extended horizons. In average reward DR‑MGs, agents aim to maximize their worst‑case infinite‑horizon average reward, to ensure satisfactory performance under environment uncertainties and opponent actions. We first establish a connection between the best‑response policies and the optimal policies for the induced single‑agent problems. Under a standard irreducible assumption, we derive a correspondence between the optimal policies and the solutions of the robust Bellman equation, and derive the existence of stationary Nash Equilibrium (NE) based on these results. We further study DR‑MGs under the weakly communicating setting, where we construct a set‑valued map and show its value is a subset of the best‑response policies, convex and upper hemi‑continuous, and derive the existence of NE. We then explore algorithmic solutions, by first proposing a Robust Nash‑Iteration algorithm and providing convergence guarantees under some additional assumptions and a NE computing oracle. We further develop a temporal‑difference based algorithm for DR‑MGs, and provide convergence guarantees without any additional oracle or assumptions. Finally, we connect average‑reward robust NE to discounted ones, showing that the average reward robust NE can be approximated by the discounted ones under a large discount factor. Our studies provide a comprehensive theoretical and algorithmic foundation for decision‑making in complex, uncertain, and long‑running multi‑player environments.

PaperID: 9300, https://arxiv.org/pdf/2508.03125.pdf

Authors: Bingyu Yan, Ziyi Zhou, Xiaoming Zhang, Chaozhuo Li, Ruilin Zeng, Yirui Qi, Tianbo Wang, Litian Zhang

Title: Attack the Messages, Not the Agents: A Multi-round Adaptive Stealthy Tampering Framework for LLM-MAS

Abstract:
Large language model‑based multi‑agent systems (LLM‑MAS) effectively accomplish complex and dynamic tasks through inter‑agent communication, but this reliance introduces substantial safety vulnerabilities. Existing attack methods targeting LLM‑MAS either compromise agent internals or rely on direct and overt persuasion, which limit their effectiveness, adaptability, and stealthiness. In this paper, we propose MAST, a Multi‑round Adaptive Stealthy Tampering framework designed to exploit communication vulnerabilities within the system. MAST integrates Monte Carlo Tree Search with Direct Preference Optimization to train an attack policy model that adaptively generates effective multi‑round tampering strategies. Furthermore, to preserve stealthiness, we impose dual semantic and embedding similarity constraints during the tampering process. Comprehensive experiments across diverse tasks, communication architectures, and LLMs demonstrate that MAST consistently achieves high attack success rates while significantly enhancing stealthiness compared to baselines. These findings highlight the effectiveness, stealthiness, and adaptability of MAST, underscoring the need for robust communication safeguards in LLM‑MAS.

PaperID: 9301, https://arxiv.org/pdf/2508.03117.pdf

Authors: Vinicius Lima, Dzung T. Phan, Jayant Kalagnanam, Dhaval Patel, Nianjun Zhou

Title: Toward a Trustworthy Optimization Modeling Agent via Verifiable Synthetic Data Generation

Abstract:
We present a framework for training trustworthy large language model (LLM) agents for optimization modeling via a verifiable synthetic data generation pipeline. Focusing on linear and mixed‑integer linear programming, our approach begins with structured symbolic representations and systematically produces natural language descriptions, mathematical formulations, and solver‑executable code. By programmatically constructing each instance with known optimal solutions, the pipeline ensures full verifiability and enables automatic filtering of low‑quality demonstrations generated by teacher models. Each dataset instance includes a structured representation of the optimization problem, a corresponding natural language description, the verified optimal solution, and step‑by‑step demonstrations ‑ generated by a teacher model ‑ that show how to model and solve the problem across multiple optimization modeling languages. This enables supervised fine‑tuning of open‑source LLMs specifically tailored to optimization tasks. To operationalize this pipeline, we introduce OptiTrust, a modular LLM agent that performs multi‑stage translation from natural language to solver‑ready code, leveraging stepwise demonstrations, multi‑language inference, and majority‑vote cross‑validation. Our agent achieves state‑of‑the‑art performance on standard benchmarks. Out of 7 datasets, it achieves the highest accuracy on six and outperforms the next‑best algorithm by at least 8 percentage on three of them. Our approach provides a scalable, verifiable, and principled path toward building reliable LLM agents for real‑world optimization applications.

PaperID: 9302, https://arxiv.org/pdf/2508.03113.pdf

Authors: John Zinky, Hema Seshadri, Mahesh Lambe, Pradyumna Chari, Ramesh Raskar

Title: NANDA Adaptive Resolver: Architecture for Dynamic Resolution of AI Agent Names

Abstract:
AdaptiveResolver is a dynamic microservice architecture designed to address the limitations of static endpoint resolution for AI agent communication in distributed, heterogeneous environments. Unlike traditional DNS or static URLs, AdaptiveResolver enables context‑aware, real‑time selection of communication endpoints based on factors such as geographic location, system load, agent capabilities, and security threats. Agents advertise their Agent Name and context requirements through Agent Fact cards in an Agent Registry/Index. A requesting Agent discovers a Target Agent using the registry. The Requester Agent can then resolve the Target Agent Name to obtain a tailored communication channel to the agent based on actual environmental context between the agents. The architecture supports negotiation of trust, quality of service, and resource constraints, facilitating flexible, secure, and scalable agent‑to‑agent interactions that go beyond the classic client‑server model. AdaptiveResolver provides a foundation for robust, future‑proof agent communication that can evolve with increasing ecosystem complexity.

PaperID: 9303, https://arxiv.org/pdf/2508.03101.pdf

Authors: Sichao Wang, Ramesh Raskar, Mahesh Lambe, Pradyumna Chari, Rekha Singhal, Shailja Gupta, Rajesh Ranjan, Ken Huang

Title: Using the NANDA Index Architecture in Practice: An Enterprise Perspective

Abstract:
The proliferation of autonomous AI agents represents a paradigmatic shift from traditional web architectures toward collaborative intelligent systems requiring sophisticated mechanisms for discovery, authentication, capability verification, and secure collaboration across heterogeneous protocol environments. This paper presents a comprehensive framework addressing the fundamental infrastructure requirements for secure, trustworthy, and interoperable AI agent ecosystems. We introduce the NANDA (Networked AI Agents in a Decentralized Architecture) framework, providing global agent discovery, cryptographically verifiable capability attestation through AgentFacts, and cross‑protocol interoperability across Anthropic's Modal Context Protocol (MCP), Google's Agent‑to‑Agent (A2A), Microsoft's NLWeb, and standard HTTPS communications. NANDA implements Zero Trust Agentic Access (ZTAA) principles, extending traditional Zero Trust Network Access (ZTNA) to address autonomous agent security challenges including capability spoofing, impersonation attacks, and sensitive data leakage. The framework defines Agent Visibility and Control (AVC) mechanisms enabling enterprise governance while maintaining operational autonomy and regulatory compliance. Our approach transforms isolated AI agents into an interconnected ecosystem of verifiable, trustworthy intelligent services, establishing foundational infrastructure for large‑scale autonomous agent deployment across enterprise and consumer environments. This work addresses the critical gap between current AI agent capabilities and infrastructure requirements for secure, scalable, multi‑agent collaboration, positioning the foundation for next‑generation autonomous intelligent systems.

PaperID: 9304, https://arxiv.org/pdf/2508.03095.pdf

Authors: Aditi Singh, Abul Ehtesham, Mahesh Lambe, Jared James Grogan, Abhishek Singh, Saket Kumar, Luca Muscariello, Vijoy Pandey, Guillaume Sauvage De Saint Marc, Pradyumna Chari, Ramesh Raskar

Title: Evolution of AI Agent Registry Solutions: Centralized, Enterprise, and Distributed Approaches

Abstract:
Autonomous AI agents now operate across cloud, enterprise, and decentralized domains, creating demand for registry infrastructures that enable trustworthy discovery, capability negotiation, and identity assurance. We analyze five prominent approaches: (1) MCP Registry (centralized publication of mcp.json descriptors), (2) A2A Agent Cards (decentralized self‑describing JSON capability manifests), (3) AGNTCY Agent Directory Service (IPFS Kademlia DHT content routing extended for semantic taxonomy‑based content discovery, OCI artifact storage, and Sigstore‑backed integrity), (4) Microsoft Entra Agent ID (enterprise SaaS directory with policy and zero‑trust integration), and (5) NANDA Index AgentFacts (cryptographically verifiable, privacy‑preserving fact model with credentialed assertions). Using four evaluation dimensions: security, authentication, scalability, and maintainability, we surface architectural trade‑offs between centralized control, enterprise governance, and distributed resilience. We conclude with design recommendations for an emerging Internet of AI Agents requiring verifiable identity, adaptive discovery flows, and interoperable capability semantics.

PaperID: 9305, https://arxiv.org/pdf/2508.03092.pdf

Authors: Zikun Cui, Tianyi Huang, Chia-En Chiang, Cuiqianhe Du

Title: Toward Verifiable Misinformation Detection: A Multi-Tool LLM Agent Framework

Abstract:
With the proliferation of Large Language Models (LLMs), the detection of misinformation has become increasingly important and complex. This research proposes an innovative verifiable misinformation detection LLM agent that goes beyond traditional true/false binary judgments. The agent actively verifies claims through dynamic interaction with diverse web sources, assesses information source credibility, synthesizes evidence, and provides a complete verifiable reasoning process. Our designed agent architecture includes three core tools: precise web search tool, source credibility assessment tool and numerical claim verification tool. These tools enable the agent to execute multi‑step verification strategies, maintain evidence logs, and form comprehensive assessment conclusions. We evaluate using standard misinformation datasets such as FakeNewsNet, comparing with traditional machine learning models and LLMs. Evaluation metrics include standard classification metrics, quality assessment of reasoning processes, and robustness testing against rewritten content. Experimental results show that our agent outperforms baseline methods in misinformation detection accuracy, reasoning transparency, and resistance to information rewriting, providing a new paradigm for trustworthy AI‑assisted fact‑checking.

PaperID: 9306, https://arxiv.org/pdf/2508.03039.pdf

Authors: Yiran Meng, Junhong Ye, Wei Zhou, Guanghui Yue, Xudong Mao, Ruomei Wang, Baoquan Zhao

Title: VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering

Abstract:
Cross‑video question answering presents significant challenges beyond traditional single‑video understanding, particularly in establishing meaningful connections across video streams and managing the complexity of multi‑source information retrieval. We introduce VideoForest, a novel framework that addresses these challenges through person‑anchored hierarchical reasoning. Our approach leverages person‑level features as natural bridge points between videos, enabling effective cross‑video understanding without requiring end‑to‑end training. VideoForest integrates three key innovations: 1) a human‑anchored feature extraction mechanism that employs ReID and tracking algorithms to establish robust spatiotemporal relationships across multiple video sources; 2) a multi‑granularity spanning tree structure that hierarchically organizes visual content around person‑level trajectories; and 3) a multi‑agent reasoning framework that efficiently traverses this hierarchical structure to answer complex cross‑video queries. To evaluate our approach, we develop CrossVideoQA, a comprehensive benchmark dataset specifically designed for person‑centric cross‑video analysis. Experimental results demonstrate VideoForest's superior performance in cross‑video reasoning tasks, achieving 71.93% accuracy in person recognition, 83.75% in behavior analysis, and 51.67% in summarization and reasoning, significantly outperforming existing methods. Our work establishes a new paradigm for cross‑video understanding by unifying multiple video streams through person‑level features, enabling sophisticated reasoning across distributed visual information while maintaining computational efficiency.

PaperID: 9307, https://arxiv.org/pdf/2508.03038.pdf

Authors: Qi Peng, Jialin Cui, Jiayuan Xie, Yi Cai, Qing Li

Title: Tree-of-Reasoning: Towards Complex Medical Diagnosis via Multi-Agent Reasoning with Evidence Tree

Abstract:
Large language models (LLMs) have shown great potential in the medical domain. However, existing models still fall short when faced with complex medical diagnosis task in the real world. This is mainly because they lack sufficient reasoning depth, which leads to information loss or logical jumps when processing a large amount of specialized medical data, leading to diagnostic errors. To address these challenges, we propose Tree‑of‑Reasoning (ToR), a novel multi‑agent framework designed to handle complex scenarios. Specifically, ToR introduces a tree structure that can clearly record the reasoning path of LLMs and the corresponding clinical evidence. At the same time, we propose a cross‑validation mechanism to ensure the consistency of multi‑agent decision‑making, thereby improving the clinical reasoning ability of multi‑agents in complex medical scenarios. Experimental results on real‑world medical data show that our framework can achieve better performance than existing baseline methods.

PaperID: 9308, https://arxiv.org/pdf/2508.03030.pdf

Authors: Siyuan Li, Yifan Yu, Zhihao Zhang, Mengjing Chen, Fangzhou Zhu, Tao Zhong, Peng Liu, Jianye Hao

Title: Collab-Solver: Collaborative Solving Policy Learning for Mixed-Integer Linear Programming

Abstract:
Mixed‑integer linear programming (MILP) has been a fundamental problem in combinatorial optimization. Conventional MILP solving mainly relies on carefully designed heuristics embedded in the branch‑and‑bound framework. Driven by the strong capabilities of neural networks, recent research is exploring the value of machine learning alongside conventional MILP solving. Although learning‑based MILP methods have shown great promise, existing works typically learn policies for individual modules in MILP solvers in isolation, without considering their interdependence, which limits both solving efficiency and solution quality. To address this limitation, we propose Collab‑Solver, a novel multi‑agent‑based policy learning framework for MILP that enables collaborative policy optimization for multiple modules. Specifically, we formulate the collaboration between cut selection and branching in MILP solving as a Stackelberg game. Under this formulation, we develop a two‑phase learning paradigm to stabilize collaborative policy learning: the first phase performs data‑communicated policy pretraining, and the second phase further orchestrates the policy learning for various modules. Extensive experiments on both synthetic and large‑scale real‑world MILP datasets demonstrate that the jointly learned policies significantly improve solving performance. Moreover, the policies learned by Collab‑Solver have also demonstrated excellent generalization abilities across different instance sets.

PaperID: 9309, https://arxiv.org/pdf/2508.03014.pdf

Authors: Jingyan Wang, Yang Zhao, Haotian Mao, Xubo Yang

Title: Survey of Large Language Models in Extended Reality: Technical Paradigms and Application Frontiers

Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation, and their integration with Extended Reality (XR) is poised to transform how users interact with immersive environments. This survey provides a comprehensive review of recent developments at the intersection of LLMs and XR, offering a structured organization of research along both technical and application dimensions. We propose a taxonomy of LLM‑enhanced XR systems centered on key technical paradigms ‑‑ such as interactive agent control, XR development toolkits, and generative scene synthesis ‑‑ and discuss how these paradigms enable novel capabilities in XR. In parallel, we examine how LLM‑driven techniques support practical XR applications across diverse domains, including immersive education, clinical healthcare, and industrial manufacturing. By connecting these technical paradigms with application frontiers, our survey highlights current trends, delineates design considerations, and identifies open challenges in building LLM‑augmented XR systems. This work provides insights that can guide researchers and practitioners in advancing the state of the art in intelligent XR experiences.

PaperID: 9310, https://arxiv.org/pdf/2508.03012.pdf

Authors: Zexiong Ma, Chao Peng, Qunhong Zeng, Pengfei Gao, Yanzhen Zou, Bing Xie

Title: Tool-integrated Reinforcement Learning for Repo Deep Search

Abstract:
Issue localization, the process of identifying code locations that need modification to resolve software issues, is a critical yet challenging task in software development. The semantic gap between natural language issue descriptions and faulty code requires complex multi‑hop reasoning through code dependencies. Existing LLM‑based agents attempt to address this by integrating repository retrieval tools. However, this transforms issue localization into a demanding task we call Repo Deep Search, which requires the LLM to effectively utilize various repository retrieval tools throughout a multi‑step reasoning and navigation process. To tackle this challenge, we present ToolTrain, a two‑stage tool‑integrated training framework combining rejection‑sampled supervised fine‑tuning and tool‑integrated reinforcement learning to enhance LLMs' ability to use retrieval tools for issue localization. Experimental results show that ToolTrain‑trained models achieve state‑of‑the‑art performance, with our 32B model even surpassing Claude‑3.7 on function‑level localization. The results also show that improved localization performance translates to better end‑to‑end issue resolution performance. This further demonstrates that training for issue localization is a viable and effective strategy for improving automated software development.

PaperID: 9311, https://arxiv.org/pdf/2508.02999.pdf

Authors: Xinjie Zhao, Moritz Blum, Fan Gao, Yingjian Chen, Boming Yang, Luis Marquez-Carpintero, Mónica Pina-Navarro, Yanran Fu, So Morikawa, Yusuke Iwasawa, Yutaka Matsuo, Chanjun Park, Irene Li

Title: AGENTiGraph: A Multi-Agent Knowledge Graph Framework for Interactive, Domain-Specific LLM Chatbots

Abstract:
AGENTiGraph is a user‑friendly, agent‑driven system that enables intuitive interaction and management of domain‑specific data through the manipulation of knowledge graphs in natural language. It gives non‑technical users a complete, visual solution to incrementally build and refine their knowledge bases, allowing multi‑round dialogues and dynamic updates without specialized query languages. The flexible design of AGENTiGraph, including intent classification, task planning, and automatic knowledge integration, ensures seamless reasoning between diverse tasks. Evaluated on a 3,500‑query benchmark within an educational scenario, the system outperforms strong zero‑shot baselines (achieving 95.12% classification accuracy, 90.45% execution success), indicating potential scalability to compliance‑critical or multi‑step queries in legal and medical domains, e.g., incorporating new statutes or research on the fly. Our open‑source demo offers a powerful new paradigm for multi‑turn enterprise knowledge management that bridges LLMs and structured graphs.

PaperID: 9312, https://arxiv.org/pdf/2508.02994.pdf

Authors: Fangyi Yu

Title: When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs

Abstract:
As large language models (LLMs) grow in capability and autonomy, evaluating their outputs‑especially in open‑ended and complex tasks‑has become a critical bottleneck. A new paradigm is emerging: using AI agents as the evaluators themselves. This "agent‑as‑a‑judge" approach leverages the reasoning and perspective‑taking abilities of LLMs to assess the quality and safety of other models, promising calable and nuanced alternatives to human evaluation. In this review, we define the agent‑as‑a‑judge concept, trace its evolution from single‑model judges to dynamic multi‑agent debate frameworks, and critically examine their strengths and shortcomings. We compare these approaches across reliability, cost, and human alignment, and survey real‑world deployments in domains such as medicine, law, finance, and education. Finally, we highlight pressing challenges‑including bias, robustness, and meta evaluation‑and outline future research directions. By bringing together these strands, our review demonstrates how agent‑based judging can complement (but not replace) human oversight, marking a step toward trustworthy, scalable evaluation for next‑generation LLMs.

PaperID: 9313, https://arxiv.org/pdf/2508.02956.pdf

Authors: Alireza Ghafarollahi, Markus J. Buehler

Title: Autonomous Inorganic Materials Discovery via Multi-Agent Physics-Aware Scientific Reasoning

Abstract:
Conventional machine learning approaches accelerate inorganic materials design via accurate property prediction and targeted material generation, yet they operate as single‑shot models limited by the latent knowledge baked into their training data. A central challenge lies in creating an intelligent system capable of autonomously executing the full inorganic materials discovery cycle, from ideation and planning to experimentation and iterative refinement. We introduce SparksMatter, a multi‑agent AI model for automated inorganic materials design that addresses user queries by generating ideas, designing and executing experimental workflows, continuously evaluating and refining results, and ultimately proposing candidate materials that meet the target objectives. SparksMatter also critiques and improves its own responses, identifies research gaps and limitations, and suggests rigorous follow‑up validation steps, including DFT calculations and experimental synthesis and characterization, embedded in a well‑structured final report. The model's performance is evaluated across case studies in thermoelectrics, semiconductors, and perovskite oxides materials design. The results demonstrate the capacity of SparksMatter to generate novel stable inorganic structures that target the user's needs. Benchmarking against frontier models reveals that SparksMatter consistently achieves higher scores in relevance, novelty, and scientific rigor, with a significant improvement in novelty across multiple real‑world design tasks as assessed by a blinded evaluator. These results demonstrate SparksMatter's unique capacity to generate chemically valid, physically meaningful, and creative inorganic materials hypotheses beyond existing materials knowledge.

PaperID: 9314, https://arxiv.org/pdf/2508.02948.pdf

Authors: Zain Ulabedeen Farhat, Debamita Ghosh, George K. Atia, Yue Wang

Title: Sample-Efficient Distributionally Robust Multi-Agent Reinforcement Learning via Online Interaction

Abstract:
Well‑trained multi‑agent systems can fail when deployed in real‑world environments due to model mismatches between the training and deployment environments, caused by environment uncertainties including noise or adversarial attacks. Distributionally Robust Markov Games (DRMGs) enhance system resilience by optimizing for worst‑case performance over a defined set of environmental uncertainties. However, current methods are limited by their dependence on simulators or large offline datasets, which are often unavailable. This paper pioneers the study of online learning in DRMGs, where agents learn directly from environmental interactions without prior data. We introduce the Multiplayer Optimistic Robust Nash Value Iteration (MORNAVI) algorithm and provide the first provable guarantees for this setting. Our theoretical analysis demonstrates that the algorithm achieves low regret and efficiently finds the optimal robust policy for uncertainty sets measured by Total Variation divergence and Kullback‑Leibler divergence. These results establish a new, practical path toward developing truly robust multi‑agent systems.

PaperID: 9315, https://arxiv.org/pdf/2508.02912.pdf

Authors: Brennen A. Hill, Mant Koh En Wei, Thangavel Jishnuanandh

Title: Communicating Plans, Not Percepts: Scalable Multi-Agent Coordination with Embodied World Models

Abstract:
Robust coordination is critical for effective decision‑making in multi‑agent systems, especially under partial observability. A central question in Multi‑Agent Reinforcement Learning (MARL) is whether to engineer communication protocols or learn them end‑to‑end. We investigate this dichotomy using embodied world models. We propose and compare two communication strategies for a cooperative task‑allocation problem. The first, Learned Direct Communication (LDC), learns a protocol end‑to‑end. The second, Intention Communication, uses an engineered inductive bias: a compact, learned world model, the Imagined Trajectory Generation Module (ITGM), which uses the agent's own policy to simulate future states. A Message Generation Network (MGN) then compresses this plan into a message. We evaluate these approaches on goal‑directed interaction in a grid world, a canonical abstraction for embodied AI problems, while scaling environmental complexity. Our experiments reveal that while emergent communication is viable in simple settings, the engineered, world model‑based approach shows superior performance, sample efficiency, and scalability as complexity increases. These findings advocate for integrating structured, predictive models into MARL agents to enable active, goal‑driven coordination.

PaperID: 9316, https://arxiv.org/pdf/2508.02866.pdf

Authors: Renan Souza, Amal Gueroudji, Stephen DeWitt, Daniel Rosendo, Tirthankar Ghosal, Robert Ross, Prasanna Balaprakash, Rafael Ferreira da Silva

Title: PROV-AGENT: Unified Provenance for Tracking AI Agent Interactions in Agentic Workflows

Abstract:
Large Language Models (LLMs) and other foundation models are increasingly used as the core of AI agents. In agentic workflows, these agents plan tasks, interact with humans and peers, and influence scientific outcomes across federated and heterogeneous environments. However, agents can hallucinate or reason incorrectly, propagating errors when one agent's output becomes another's input. Thus, assuring that agents' actions are transparent, traceable, reproducible, and reliable is critical to assess hallucination risks and mitigate their workflow impacts. While provenance techniques have long supported these principles, existing methods fail to capture and relate agent‑centric metadata such as prompts, responses, and decisions with the broader workflow context and downstream outcomes. In this paper, we introduce PROV‑AGENT, a provenance model that extends W3C PROV and leverages the Model Context Protocol (MCP) and data observability to integrate agent interactions into end‑to‑end workflow provenance. Our contributions include: (1) a provenance model tailored for agentic workflows, (2) a near real‑time, open‑source system for capturing agentic provenance, and (3) a cross‑facility evaluation spanning edge, cloud, and HPC environments, demonstrating support for critical provenance queries and agent reliability analysis.

PaperID: 9317, https://arxiv.org/pdf/2508.02841.pdf

Authors: Ziruo Yi, Jinyu Liu, Ting Xiao, Mark V. Albert

Title: A Multi-Agent System for Complex Reasoning in Radiology Visual Question Answering

Abstract:
Radiology visual question answering (RVQA) provides precise answers to questions about chest X‑ray images, alleviating radiologists' workload. While recent methods based on multimodal large language models (MLLMs) and retrieval‑augmented generation (RAG) have shown promising progress in RVQA, they still face challenges in factual accuracy, hallucinations, and cross‑modal misalignment. We introduce a multi‑agent system (MAS) designed to support complex reasoning in RVQA, with specialized agents for context understanding, multimodal reasoning, and answer validation. We evaluate our system on a challenging RVQA set curated via model disagreement filtering, comprising consistently hard cases across multiple MLLMs. Extensive experiments demonstrate the superiority and effectiveness of our system over strong MLLM baselines, with a case study illustrating its reliability and interpretability. This work highlights the potential of multi‑agent approaches to support explainable and trustworthy clinical AI applications that require complex reasoning.

PaperID: 9318, https://arxiv.org/pdf/2508.02826.pdf

Authors: Conor Wallace, Umer Siddique, Yongcan Cao

Title: TransAM: Transformer-Based Agent Modeling for Multi-Agent Systems via Local Trajectory Encoding

Abstract:
Agent modeling is a critical component in developing effective policies within multi‑agent systems, as it enables agents to form beliefs about the behaviors, intentions, and competencies of others. Many existing approaches assume access to other agents' episodic trajectories, a condition often unrealistic in real‑world applications. Consequently, a practical agent modeling approach must learn a robust representation of the policies of the other agents based only on the local trajectory of the controlled agent. In this paper, we propose \textttTransAM, a novel transformer‑based agent modeling approach to encode local trajectories into an embedding space that effectively captures the policies of other agents. We evaluate the performance of the proposed method in cooperative, competitive, and mixed multi‑agent environments. Extensive experimental results demonstrate that our approach generates strong policy representations, improves agent modeling, and leads to higher episodic returns.

PaperID: 9319, https://arxiv.org/pdf/2508.02773.pdf

Authors: Yiming Shen, Jiashuo Zhang, Zhenzhe Shao, Wenxuan Luo, Yanlin Wang, Ting Chen, Zibin Zheng, Jiachi Chen

Title: Web3 x AI Agents: Landscape, Integrations, and Foundational Challenges

Abstract:
The convergence of Web3 technologies and AI agents represents a rapidly evolving frontier poised to reshape decentralized ecosystems. This paper presents the first and most comprehensive analysis of the intersection between Web3 and AI agents, examining five critical dimensions: landscape, economics, governance, security, and trust mechanisms. Through an analysis of 133 existing projects, we first develop a taxonomy and systematically map the current market landscape (RQ1), identifying distinct patterns in project distribution and capitalization. Building upon these findings, we further investigate four key integrations: (1) the role of AI agents in participating in and optimizing decentralized finance (RQ2); (2) their contribution to enhancing Web3 governance mechanisms (RQ3); (3) their capacity to strengthen Web3 security via intelligent vulnerability detection and automated smart contract auditing (RQ4); and (4) the establishment of robust reliability frameworks for AI agent operations leveraging Web3's inherent trust infrastructure (RQ5). By synthesizing these dimensions, we identify key integration patterns, highlight foundational challenges related to scalability, security, and ethics, and outline critical considerations for future research toward building robust, intelligent, and trustworthy decentralized systems with effective AI agent interactions.

PaperID: 9320, https://arxiv.org/pdf/2508.02721.pdf

Authors: Libin Qiu, Yuhang Ye, Zhirong Gao, Xide Zou, Junfu Chen, Ziming Gui, Weizhi Huang, Xiaobo Xue, Wenkai Qiu, Kun Zhao

Title: Blueprint First, Model Second: A Framework for Deterministic LLM Workflow

Abstract:
While powerful, the inherent non‑determinism of large language model (LLM) agents limits their application in structured operational environments where procedural fidelity and predictable execution are strict requirements. This limitation stems from current architectures that conflate probabilistic, high‑level planning with low‑level action execution within a single generative process. To address this, we introduce the Source Code Agent framework, a new paradigm built on the "Blueprint First, Model Second" philosophy. Our framework decouples the workflow logic from the generative model. An expert‑defined operational procedure is first codified into a source code‑based Execution Blueprint, which is then executed by a deterministic engine. The LLM is strategically invoked as a specialized tool to handle bounded, complex sub‑tasks within the workflow, but never to decide the workflow's path. We conduct a comprehensive evaluation on the challenging tau‑bench benchmark, designed for complex user‑tool‑rule scenarios. Our results demonstrate that the Source Code Agent establishes a new state‑of‑the‑art, outperforming the strongest baseline by 10.1 percentage points on the average Pass^1 score while dramatically improving execution efficiency. Our work enables the verifiable and reliable deployment of autonomous agents in applications governed by strict procedural logic.

PaperID: 9321, https://arxiv.org/pdf/2508.02679.pdf

Authors: Wayupuk Sommuang, Kun Kerdthaisong, Pasin Buakhaw, Aslan B. Wong, Nutchanon Yongsatianchot

Title: LLM Agent-Based Simulation of Student Activities and Mental Health Using Smartphone Sensing Data

Abstract:
Students' mental well‑being is vital for academic success, with activities such as studying, socializing, and sleeping playing a role. Current mobile sensing data highlight this intricate link using statistical and machine learning analyses. We propose a novel LLM agent‑based simulation framework to model student activities and mental health using the StudentLife Dataset. Each LLM agent was initialized with personality questionnaires and guided by smartphone sensing data throughout the simulated semester. These agents predict individual behaviors, provide self‑reported mental health data via ecological momentary assessments (EMAs), and complete follow‑up personality questionnaires. To ensure accuracy, we investigated various prompting techniques, memory systems, and activity‑based mental state management strategies that dynamically update an agent's mental state based on their daily activities. This simulation goes beyond simply replicating existing data. This allows us to explore new scenarios that are not present in the original dataset, such as peer influence through agent‑to‑agent interactions and the impact of social media. Furthermore, we can conduct intervention studies by manipulating activity patterns via sensing signals and personality traits using questionnaire responses. This provides valuable insights into the behavioral changes that could enhance student well‑being. The framework also facilitates hypothetical interviews with LLM agents, offering deeper insights into their mental health. This study showcases the power of LLM‑driven behavioral modeling with sensing data, opening new avenues for understanding and supporting student mental health.

PaperID: 9322, https://arxiv.org/pdf/2508.02632.pdf

Authors: Italo Napolitano, Stefano Covone, Andrea Lama, Francesco De Lellis, Mario di Bernardo

Title: Hierarchical Learning-Based Control for Multi-Agent Shepherding of Stochastic Autonomous Agents

Abstract:
Multi‑agent shepherding represents a challenging distributed control problem where herder agents must coordinate to guide independently moving targets to desired spatial configurations. Most existing control strategies assume cohesive target behavior, which frequently fails in practical applications where targets exhibit stochastic autonomous behavior. This paper presents a hierarchical learning‑based control architecture that decomposes the shepherding problem into a high‑level decision‑making module and a low‑level motion control component. The proposed distributed control system synthesizes effective control policies directly from closed‑loop experience without requiring explicit inter‑agent communication or prior knowledge of target dynamics. The decentralized architecture achieves cooperative control behavior through emergent coordination without centralized supervision. Experimental validation demonstrates superior closed‑loop performance compared to state‑of‑the‑art heuristic control methods, achieving 100% success rates with improved settling times and control efficiency. The control architecture scales beyond its design conditions, adapts to time‑varying goal regions, and demonstrates practical implementation feasibility through real‑time experiments on the Robotarium platform.

PaperID: 9323, https://arxiv.org/pdf/2508.02630.pdf

Authors: Amine Allouah, Omar Besbes, Josué D Figueroa, Yash Kanoria, Akshit Kumar

Title: What Is Your AI Agent Buying? Evaluation, Biases, Model Dependence, & Emerging Implications for Agentic E-Commerce

Abstract:
Online marketplaces will be transformed by autonomous AI agents acting on behalf of consumers. Rather than humans browsing and clicking, AI agents can parse webpages or leverage APIs to view, evaluate and choose products. We investigate the behavior of AI agents using ACES, a provider‑agnostic framework for auditing agent decision‑making. We reveal that agents can exhibit choice homogeneity, often concentrating demand on a few ``modal'' products while ignoring others entirely. Yet, these preferences are unstable: model updates can drastically reshuffle market shares. Furthermore, randomized trials show that while agents have improved over time on simple tasks with a clearly identified best choice, they exhibit strong position biases ‑‑ varying across providers and model versions, and persisting even in text‑only "headless" interfaces ‑‑ undermining any universal notion of a ``top'' rank. Agents also consistently penalize sponsored tags while rewarding platform endorsements, and sensitivities to price, ratings, and reviews vary sharply across models. Finally, we demonstrate that sellers can respond: a seller‑side agent making simple, query‑conditional description tweaks can drive significant gains in market share. These findings reveal that agentic markets are volatile and fundamentally different from human‑centric commerce, highlighting the need for continuous auditing and raising questions for platform design, seller strategy and regulation.

PaperID: 9324, https://arxiv.org/pdf/2508.02611.pdf

Authors: Vali Tawosi, Salwa Alamir, Xiaomo Liu, Manuela Veloso

Title: Meta-RAG on Large Codebases Using Code Summarization

Abstract:
Large Language Model (LLM) systems have been at the forefront of applied Artificial Intelligence (AI) research in a multitude of domains. One such domain is software development, where researchers have pushed the automation of a number of code tasks through LLM agents. Software development is a complex ecosystem, that stretches far beyond code implementation and well into the realm of code maintenance. In this paper, we propose a multi‑agent system to localize bugs in large pre‑existing codebases using information retrieval and LLMs. Our system introduces a novel Retrieval Augmented Generation (RAG) approach, Meta‑RAG, where we utilize summaries to condense codebases by an average of 79.8%, into a compact, structured, natural language representation. We then use an LLM agent to determine which parts of the codebase are critical for bug resolution, i.e. bug localization. We demonstrate the usefulness of Meta‑RAG through evaluation with the SWE‑bench Lite dataset. Meta‑RAG scores 84.67 % and 53.0 % for file‑level and function‑level correct localization rates, respectively, achieving state‑of‑the‑art performance.

PaperID: 9325, https://arxiv.org/pdf/2508.02470.pdf

Authors: Hyunjn An, Yongwon Kim, Wonduk Seo, Joonil Park, Daye Kang, Changhoon Oh, Dokyun Kim, Seunghyun Lee

Title: AIAP: A No-Code Workflow Builder for Non-Experts with Natural Language and Multi-Agent Collaboration

Abstract:
While many tools are available for designing AI, non‑experts still face challenges in clearly expressing their intent and managing system complexity. We introduce AIAP, a no‑code platform that integrates natural language input with visual workflows. AIAP leverages a coordinated multi‑agent system to decompose ambiguous user instructions into modular, actionable steps, hidden from users behind a unified interface. A user study involving 32 participants showed that AIAP's AI‑generated suggestions, modular workflows, and automatic identification of data, actions, and context significantly improved participants' ability to develop services intuitively. These findings highlight that natural language‑based visual programming significantly reduces barriers and enhances user experience in AI service design.

PaperID: 9326, https://arxiv.org/pdf/2508.02421.pdf

Authors: Akshay Dodwadmath, Setareh Maghsudi

Title: Emergence of Fair Leaders via Mediators in Multi-Agent Reinforcement Learning

Abstract:
Stackelberg games and their resulting equilibria have received increasing attention in the multi‑agent reinforcement learning literature. Each stage of a traditional Stackelberg game involves a leader(s) acting first, followed by the followers. In situations where the roles of leader(s) and followers can be interchanged, the designated role can have considerable advantages, for example, in first‑mover advantage settings. Then the question arises: Who should be the leader and when? A bias in the leader selection process can lead to unfair outcomes. This problem is aggravated if the agents are self‑interested and care only about their goals and rewards. We formally define this leader selection problem and show its relation to fairness in agents' returns. Furthermore, we propose a multi‑agent reinforcement learning framework that maximizes fairness by integrating mediators. Mediators have previously been used in the simultaneous action setting with varying levels of control, such as directly performing agents' actions or just recommending them. Our framework integrates mediators in the Stackelberg setting with minimal control (leader selection). We show that the presence of mediators leads to self‑interested agents taking fair actions, resulting in higher overall fairness in agents' returns.

PaperID: 9327, https://arxiv.org/pdf/2508.02289.pdf

Authors: Tao He, Gangshan Jing

Title: Distributed Non-Uniform Scaling Control of Multi-Agent Formation via Matrix-Valued Constraints

Abstract:
Distributed formation maneuver control refers to the problem of maneuvering a group of agents to change their formation shape by adjusting the motions of partial agents, where the controller of each agent only requires local information measured from its neighbors. Although this problem has been extensively investigated, existing approaches are mostly limited to uniform scaling transformations. This article proposes a new type of local matrix‑valued constraints, via which non‑uniform scaling control of position formation can be achieved by tuning the positions of only two agents (i.e., leaders). Here, the non‑uniform scaling transformation refers to global scaling the position formation with different ratios along different orthogonal coordinate directions. Moreover, by defining scaling and translation of attitudes, we propose a distributed control scheme for scaling and translation maneuver control of joint position‑attitude formations. It is proven that the proposed controller achieves global convergence, provided that the sensing graph among agents is a 2‑rooted bidirectional graph. Compared with the affine formation maneuver control approach, the proposed approach leverages a sparser sensing graph, requires fewer leaders, and additionally enables scaling transformations of the attitude formation. A simulation example demonstrates our theoretical results.

PaperID: 9328, https://arxiv.org/pdf/2508.02190.pdf

Authors: Cui Miao, Tao Chang, Meihan Wu, Hongbin Xu, Chun Li, Ming Li, Xiaodong Wang

Title: FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation

Abstract:
Vision‑language‑action (VLA) models have significantly advanced robotic manipulation by enabling robots to interpret language instructions for task execution. However, training these models often relies on large‑scale user‑specific data, raising concerns about privacy and security, which in turn limits their broader adoption. To address this, we propose FedVLA, the first federated VLA learning framework, enabling distributed model training that preserves data privacy without compromising performance. Our framework integrates task‑aware representation learning, adaptive expert selection, and expert‑driven federated aggregation, enabling efficient and privacy‑preserving training of VLA models. Specifically, we introduce an Instruction Oriented Scene‑Parsing mechanism, which decomposes and enhances object‑level features based on task instructions, improving contextual understanding. To effectively learn diverse task patterns, we design a Dual Gating Mixture‑of‑Experts (DGMoE) mechanism, where not only input tokens but also self‑aware experts adaptively decide their activation. Finally, we propose an Expert‑Driven Aggregation strategy at the federated server, where model aggregation is guided by activated experts, ensuring effective cross‑client knowledge transfer.Extensive simulations and real‑world robotic experiments demonstrate the effectiveness of our proposals. Notably, DGMoE significantly improves computational efficiency compared to its vanilla counterpart, while FedVLA achieves task success rates comparable to centralized training, effectively preserving data privacy.

PaperID: 9329, https://arxiv.org/pdf/2508.02153.pdf

Authors: Zebin Duan, Frederik Hagelskjær, Aljaz Kramberger, Juan Heredia, Norbert Krüger

Title: Towards High Precision: An Adaptive Self-Supervised Learning Framework for Force-Based Verification

Abstract:
The automation of robotic tasks requires high precision and adaptability, particularly in force‑based operations such as insertions. Traditional learning‑based approaches either rely on static datasets, which limit their ability to generalize, or require frequent manual intervention to maintain good performances. As a result, ensuring long‑term reliability without human supervision remains a significant challenge. To address this, we propose an adaptive self‑supervised learning framework for insertion classification that continuously improves its precision over time. The framework operates in real‑time, incrementally refining its classification decisions by integrating newly acquired force data. Unlike conventional methods, it does not rely on pre‑collected datasets but instead evolves dynamically with each task execution. Through real‑world experiments, we demonstrate how the system progressively reduces execution time while maintaining near‑perfect precision as more samples are processed. This adaptability ensures long‑term reliability in force‑based robotic tasks while minimizing the need for manual intervention.

PaperID: 9330, https://arxiv.org/pdf/2508.02076.pdf

Authors: Yunhao Liang, Yuan Qu, Jingyuan Yang, Shaochong Lin, Zuo-Jun Max Shen

Title: Everyone Contributes! Incentivizing Strategic Cooperation in Multi-LLM Systems via Sequential Public Goods Games

Abstract:
Coordinating multiple large language models (LLMs) to solve complex tasks collaboratively poses a fundamental trade‑off between the computation costs and collective performance compared with individual model. We introduce a novel, game‑theoretically grounded reinforcement learning (RL) framework, the Multi‑Agent Cooperation Sequential Public Goods Game (MAC‑SPGG), to systematically incentivize cooperation in multi‑LLM ensembles. In MAC‑SPGG, LLM agents move in sequence, observing predecessors' outputs and updating beliefs to condition their own contributions. By redesigning the public‑goods reward, effortful contributions become the unique Subgame Perfect Nash Equilibrium (SPNE), which eliminates free‑riding under traditional SPGG or PGG. Its sequential protocol replaces costly round‑based information exchanges with a streamlined decision flow, cutting communication overhead while retaining strategic depth. We prove the existence and uniqueness of the SPNE under realistic parameters, and empirically show that MAC‑SPGG‑trained ensembles outperform single‑agent baselines, chain‑of‑thought prompting, and other cooperative methods, even achieving comparable performance to large‑scale models across reasoning, math, code generation, and NLP tasks. Our results highlight the power of structured, incentive‑aligned MAC‑SPGG cooperation for scalable and robust multi‑agent language generation.

PaperID: 9331, https://arxiv.org/pdf/2508.02027.pdf

Authors: Xinzheng Wu, Junyi Chen, Shaolingfeng Ye, Wei Jiang, Yong Shen

Title: An Evolving Scenario Generation Method based on Dual-modal Driver Model Trained by Multi-Agent Reinforcement Learning

Abstract:
In the autonomous driving testing methods based on evolving scenarios, the construction method of the driver model, which determines the driving maneuvers of background vehicles (BVs) in the scenario, plays a critical role in generating safety‑critical scenarios. In particular, the cooperative adversarial driving characteristics between BVs can contribute to the efficient generation of safety‑critical scenarios with high testing value. In this paper, a multi‑agent reinforcement learning (MARL) method is used to train and generate a dual‑modal driver model (Dual‑DM) with non‑adversarial and adversarial driving modalities. The model is then connected to a continuous simulated traffic environment to generate complex, diverse and strong interactive safety‑critical scenarios through evolving scenario generation method. After that, the generated evolving scenarios are evaluated in terms of fidelity, test efficiency, complexity and diversity. Results show that without performance degradation in scenario fidelity (>85% similarity to real‑world scenarios) and complexity (complexity metric: 0.45, +32.35% and +12.5% over two baselines), Dual‑DM achieves a substantial enhancement in the efficiency of generating safety‑critical scenarios (efficiency metric: 0.86, +195% over two baselines). Furthermore, statistical analysis and case studies demonstrate the diversity of safety‑critical evolving scenarios generated by Dual‑DM in terms of the adversarial interaction patterns. Therefore, Dual‑DM can greatly improve the performance of the generation of safety‑critical scenarios through evolving scenario generation method.

PaperID: 9332, https://arxiv.org/pdf/2508.02015.pdf

Authors: Gang Wang, Hongfang Han, Xiaowei Liu, Hanfeng Jiang, Ming Zhang

Title: A Group Consensus-Driven Auction Algorithm for Cooperative Task Allocation Among Heterogeneous Multi-Agents

Abstract:
In scenarios like automated warehouses, assigning tasks to robots presents a heterogeneous multi‑task and multi‑agent task allocation problem. However, existing task allocation study ignores the integration of multi‑task and multi‑attribute agent task allocation with heterogeneous task allocation. In addition, current algorithms are limited by scenario constraints and can incur significant errors in specific contexts. Therefore, this study proposes a distributed heterogeneous multi‑task and multi‑agent task allocation algorithm with a time window, called group consensus‑based heterogeneous auction (GCBHA). Firstly, this method decomposes tasks that exceed the capability of a single Agent into subtasks that can be completed by multiple independent agents. And then groups similar or adjacent tasks through a heuristic clustering method to reduce the time required to reach a consensus. Subsequently, the task groups are allocated to agents that meet the conditions through an auction process. Furthermore, the method evaluates the task path cost distance based on the scenario, which can calculate the task cost more accurately. The experimental results demonstrate that GCBHA performs well in terms of task allocation time and solution quality, with a significant reduction in the error rate between predicted task costs and actual costs.

PaperID: 9333, https://arxiv.org/pdf/2508.01956.pdf

Authors: Jiayi Wang, Jacqueline Jil Vallon, Nikhil V. Kotha, Neil Panjwani, Xi Ling, Margaret Redfield, Sushmita Vij, Sandy Srinivas, John Leppert, Mark K. Buyyounouski, Mohsen Bayati

Title: Scaling Clinician-Grade Feature Generation from Clinical Notes with Multi-Agent Language Models

Abstract:
Developing accurate clinical prediction models is often bottlenecked by the difficulty of deriving meaningful structured features from unstructured EHR notes, a process that traditionally requires manual, unscalable clinical abstraction. In this study, we first established a rigorous patient‑level Clinician Feature Generation (CFG) protocol, in which domain experts manually reviewed notes to define and extract nuanced features for a cohort of 147 patients with prostate cancer. As a high‑fidelity ground truth, this labor‑intensive process provided the blueprint for SNOW (Scalable Note‑to‑Outcome Workflow), a transparent multi‑agent large language model (LLM) system designed to autonomously mimic the iterative reasoning and validation workflow of clinical experts. On 5‑year cancer recurrence prediction, SNOW (AUC‑ROC 0.767) achieved performance comparable to manual CFG (0.762) and outperformed structured baselines, clinician‑guided LLM extraction, and six representational feature generation (RFG) approaches. Once configured, SNOW produced the full patient‑level feature table in 12 hours with 5 hours of clinician oversight, reducing human expert effort by approximately 48‑fold versus manual CFG. To test scalability where manual CFG is infeasible, we deployed SNOW on an external heart failure with preserved ejection fraction (HFpEF) cohort from MIMIC‑IV (n=2,084); without task‑specific tuning, SNOW generated prognostic features that outperformed baseline and RFG methods for 30‑day (SNOW: 0.851) and 1‑year (SNOW: 0.763) mortality prediction. These results demonstrate that a modular LLM agent‑based system can scale expert‑level feature generation from clinical notes, while enabling interpretable use of unstructured EHR text in outcome prediction and preserving generalizability across a variety of settings and conditions.

PaperID: 9334, https://arxiv.org/pdf/2508.01888.pdf

Authors: Navneet Verma, Ying Xie

Title: Optimizing Day-Ahead Energy Trading with Proximal Policy Optimization and Blockchain

Abstract:
The increasing penetration of renewable energy sources in day‑ahead energy markets introduces challenges in balancing supply and demand, ensuring grid resilience, and maintaining trust in decentralized trading systems. This paper proposes a novel framework that integrates the Proximal Policy Optimization (PPO) algorithm, a state‑of‑the‑art reinforcement learning method, with blockchain technology to optimize automated trading strategies for prosumers in day‑ahead energy markets. We introduce a comprehensive framework that employs RL agent for multi‑objective energy optimization and blockchain for tamper‑proof data and transaction management. Simulations using real‑world data from the Electricity Reliability Council of Texas (ERCOT) demonstrate the effectiveness of our approach. The RL agent achieves demand‑supply balancing within 2% and maintains near‑optimal supply costs for the majority of the operating hours. Moreover, it generates robust battery storage policies capable of handling variability in solar and wind generation. All decisions are recorded on an Algorand‑based blockchain, ensuring transparency, auditability, and security ‑ key enablers for trustworthy multi‑agent energy trading. Our contributions include a novel system architecture, curriculum learning for robust agent development, and actionable policy insights for practical deployment.

PaperID: 9335, https://arxiv.org/pdf/2508.01746.pdf

Authors: Shiyang Duan, Yuan Tian, Qi Bing, Xiaowei Shao

Title: Bayes-Entropy Collaborative Driven Agents for Research Hypotheses Generation and Optimization

Abstract:
The exponential growth of scientific knowledge has made the automated generation of scientific hypotheses that combine novelty, feasibility, and research value a core challenge. Existing methods based on large language models fail to systematically model the inherent in hypotheses or incorporate the closed‑loop feedback mechanisms crucial for refinement. This paper proposes a multi‑agent collaborative framework called HypoAgents, which for the first time integrates Bayesian reasoning with an information entropy‑driven search mechanism across three stages‑hypotheses generation, evidence validation, and hypotheses Refinement‑to construct an iterative closed‑loop simulating scientists' cognitive processes. Specifically, the framework first generates an initial set of hypotheses through diversity sampling and establishes prior beliefs based on a composite novelty‑relevance‑feasibility (N‑R‑F) score. It then employs etrieval‑augmented generation (RAG) to gather external literature evidence, updating the posterior probabilities of hypotheses using Bayes' theorem. Finally, it identifies high‑uncertainty hypotheses using information entropy H = ‑ \sum p_i\log p_i and actively refines them, guiding the iterative optimization of the hypothesis set toward higher quality and confidence. Experimental results on the ICLR 2025 conference real‑world research question dataset (100 research questions) show that after 12 optimization iterations, the average ELO score of generated hypotheses improves by 116.3, surpassing the benchmark of real paper abstracts by 17.8, while the framework's overall uncertainty, as measured by Shannon entropy, decreases significantly by 0.92. This study presents an interpretable probabilistic reasoning framework for automated scientific discovery, substantially improving the quality and reliability of machine‑generated research hypotheses.

PaperID: 9336, https://arxiv.org/pdf/2508.01623.pdf

Authors: Tadisetty Sai Yashwanth, Dhatri C

Title: A Multi-Agent Pokemon Tournament for Evaluating Strategic Reasoning of Large Language Models

Abstract:
This research presents LLM Pokemon League, a competitive tournament system that leverages Large Language Models (LLMs) as intelligent agents to simulate strategic decision‑making in Pokémon battles. The platform is designed to analyze and compare the reasoning, adaptability, and tactical depth exhibited by different LLMs in a type‑based, turn‑based combat environment. By structuring the competition as a single‑elimination tournament involving diverse AI trainers, the system captures detailed decision logs, including team‑building rationale, action selection strategies, and switching decisions. The project enables rich exploration into comparative AI behavior, battle psychology, and meta‑strategy development in constrained, rule‑based game environments. Through this system, we investigate how modern LLMs understand, adapt, and optimize decisions under uncertainty, making Pokémon League a novel benchmark for AI research in strategic reasoning and competitive learning.

PaperID: 9337, https://arxiv.org/pdf/2508.01545.pdf

Authors: Emilio Barkett, Olivia Long, Paul Kröger

Title: Getting out of the Big-Muddy: Escalation of Commitment in LLMs

Abstract:
Large Language Models (LLMs) are increasingly deployed in autonomous decision‑making roles across high‑stakes domains. However, since models are trained on human‑generated data, they may inherit cognitive biases that systematically distort human judgment, including escalation of commitment, where decision‑makers continue investing in failing courses of action due to prior investment. Understanding when LLMs exhibit such biases presents a unique challenge. While these biases are well‑documented in humans, it remains unclear whether they manifest consistently in LLMs or require specific triggering conditions. This paper investigates this question using a two‑stage investment task across four experimental conditions: model as investor, model as advisor, multi‑agent deliberation, and compound pressure scenario. Across N = 6,500 trials, we find that bias manifestation in LLMs is highly context‑dependent. In individual decision‑making contexts (Studies 1‑2, N = 4,000), LLMs demonstrate strong rational cost‑benefit logic with minimal escalation of commitment. However, multi‑agent deliberation reveals a striking hierarchy effect (Study 3, N = 500): while asymmetrical hierarchies show moderate escalation rates (46.2%), symmetrical peer‑based decision‑making produces near‑universal escalation (99.2%). Similarly, when subjected to compound organizational and personal pressures (Study 4, N = 2,000), models exhibit high degrees of escalation of commitment (68.95% average allocation to failing divisions). These findings reveal that LLM bias manifestation depends critically on social and organizational context rather than being inherent, with significant implications for the deployment of multi‑agent systems and unsupervised operations where such conditions may emerge naturally.

PaperID: 9338, https://arxiv.org/pdf/2508.01531.pdf

Authors: Mansura Habiba, Nafiul I. Khan

Title: Revisiting Gossip Protocols: A Vision for Emergent Coordination in Agentic Multi-Agent Systems

Abstract:
As agentic platforms scale, agents are evolving beyond static roles and fixed toolchains, creating a growing need for flexible, decentralized coordination. Today's structured communication protocols (e.g., direct agent‑to‑agent messaging) excel at reliability and task delegation, but they fall short in enabling emergent, swarm‑like intelligence, where distributed agents continuously learn, adapt, and communicate to form collective cognition. This paper revisits gossip protocols, long valued in distributed systems for their fault tolerance and decentralization, and argues that they offer a missing layer for context‑rich, adaptive communication in agentic AI. Gossip enables scalable, low‑overhead dissemination of shared knowledge, but also raises unresolved challenges around semantic filtering, staleness, trustworthiness, and consistency in high‑stakes environments. Rather than proposing a new framework, this work charts a research agenda for integrating gossip as a complementary substrate alongside structured protocols. We identify critical gaps in current agent‑to‑agent architectures, highlight where gossip could reshape assumptions about coordination, and outline open questions around intent propagation, knowledge decay, and peer‑to‑peer trust. Gossip is not a silver bullet, but overlooking it risks missing a key path toward resilient, reflexive, and self‑organizing multi‑agent systems.

PaperID: 9339, https://arxiv.org/pdf/2508.01522.pdf

Authors: Jack Zeng, Andreu Matoses Gimenez, Eugene Vinitsky, Javier Alonso-Mora, Sihao Sun

Title: Decentralized Aerial Manipulation of a Cable-Suspended Load using Multi-Agent Reinforcement Learning

Abstract:
This paper presents the first decentralized method to enable real‑world 6‑DoF manipulation of a cable‑suspended load using a team of Micro‑Aerial Vehicles (MAVs). Our method leverages multi‑agent reinforcement learning (MARL) to train an outer‑loop control policy for each MAV. Unlike state‑of‑the‑art controllers that utilize a centralized scheme, our policy does not require global states, inter‑MAV communications, nor neighboring MAV information. Instead, agents communicate implicitly through load pose observations alone, which enables high scalability and flexibility. It also significantly reduces computing costs during inference time, enabling onboard deployment of the policy. In addition, we introduce a new action space design for the MAVs using linear acceleration and body rates. This choice, combined with a robust low‑level controller, enables reliable sim‑to‑real transfer despite significant uncertainties caused by cable tension during dynamic 3D motion. We validate our method in various real‑world experiments, including full‑pose control under load model uncertainties, showing setpoint tracking performance comparable to the state‑of‑the‑art centralized method. We also demonstrate cooperation amongst agents with heterogeneous control policies, and robustness to the complete in‑flight loss of one MAV. Videos of experiments: https://autonomousrobots.nl/paper_websites/aerial‑manipulation‑marl

PaperID: 9340, https://arxiv.org/pdf/2508.01495.pdf

Authors: Jingtian Yan, Stephen F. Smith, Jiaoyang Li

Title: WinkTPG: An Execution Framework for Multi-Agent Path Finding Using Temporal Reasoning

Abstract:
Planning collision‑free paths for a large group of agents is a challenging problem in many real‑world applications. While recent advances in Multi‑Agent Path Finding (MAPF) have shown promising progress, standard MAPF planners continue to rely on simplified kinodynamic models, preventing agents from directly following the generated MAPF plan. To bridge this gap, we propose kinodynamic Temporal Plan Graph planning (kTPG), a multi‑agent speed optimization algorithm that efficiently refines a MAPF plan into a set of kinodynamically feasible speed profiles. We further incorporate execution timing uncertainty models and provide deterministic guarantees under bounded uncertainty models and probabilistic guarantees under stochastic models. Building on kTPG, we propose Windowed kTPG (WinkTPG), a MAPF execution framework that incrementally refines MAPF plans using a window‑based mechanism, dynamically incorporating agent information during execution to reduce uncertainty. Experiments show that WinkTPG can generate speed profiles for up to 1,000 agents within 1 second and improves solution quality by up to 51.7% over existing MAPF execution methods. We further validate WinkTPG in high‑fidelity physics‑based simulation and on real‑world robots.

PaperID: 9341, https://arxiv.org/pdf/2508.01451.pdf

Authors: Mohammed Sayagh, Mohammad Ghafari

Title: Think Broad, Act Narrow: CWE Identification with Multi-Agent Large Language Models

Abstract:
Machine learning and Large language models (LLMs) for vulnerability detection has received significant attention in recent years. Unfortunately, state‑of‑the‑art techniques show that LLMs are unsuccessful in even distinguishing the vulnerable function from its benign counterpart, due to three main problems: Vulnerability detection requires deep analysis, which LLMs often struggle with when making a one‑shot prediction. Existing techniques typically perform function‑level analysis, whereas effective vulnerability detection requires contextual information beyond the function scope. The focus on binary classification can result in identifying a vulnerability but associating it with the wrong security weaknesses (CWE), which may mislead developers. We propose a novel multi‑agent LLM approach to address the challenges of identifying CWEs. This approach consists of three steps: (1) a team of LLM agents performs an exhaustive search for potential CWEs in the function under review, (2) another team of agents identifies relevant external context to support or refute each candidate CWE, and (3) a final agent makes informed acceptance or rejection decisions for each CWE based on the gathered context. A preliminary evaluation of our approach shows promising results. In the PrimeVul dataset, Step 1 correctly identifies the appropriate CWE in 40.9% of the studied vulnerable functions. We further evaluated the full pipeline on ten synthetic programs and found that incorporating context information significantly reduced false positives from 6 to 9 CWEs to just 1 to 2, while still correctly identifying the true CWE in 9 out of 10 cases.

PaperID: 9342, https://arxiv.org/pdf/2508.01445.pdf

Authors: Luke Snow, Vikram Krishnamurthy

Title: Multi-Agent Inverse Learning for Sensor Networks: Identifying Coordination in UAV Networks

Abstract:
Suppose there is an adversarial UAV network being tracked by a radar. How can the radar determine whether the UAVs are coordinating, in some well‑defined sense? How can the radar infer the objectives of the individual UAVs and the network as a whole? We present an abstract interpretation of such a strategic interaction, allowing us to conceptualize coordination as a linearly constrained multi‑objective optimization problem. Then, we present some tools from microeconomic theory that allow us to detect coordination and reconstruct individual UAV objective functions, from radar tracking signals. This corresponds to performing inverse multi‑objective optimization. We present details for how the abstract microeconomic interpretation corresponds to, and naturally arises from, physical‑layer radar waveform modulation and multi‑target filtering. This article serves as a tutorial, bringing together concepts from several established research contributions in an expository style.

PaperID: 9343, https://arxiv.org/pdf/2508.01370.pdf

Authors: Roman Koshkin, Pengyu Dai, Nozomi Fujikawa, Masahito Togami, Marco Visentini-Scarzanella

Title: MaRGen: Multi-Agent LLM Approach for Self-Directed Market Research and Analysis

Abstract:
We present an autonomous framework that leverages Large Language Models (LLMs) to automate end‑to‑end business analysis and market report generation. At its core, the system employs specialized agents ‑ Researcher, Reviewer, Writer, and Retriever ‑ that collaborate to analyze data and produce comprehensive reports. These agents learn from real professional consultants' presentation materials at Amazon through in‑context learning to replicate professional analytical methodologies. The framework executes a multi‑step process: querying databases, analyzing data, generating insights, creating visualizations, and composing market reports. We also introduce a novel LLM‑based evaluation system for assessing report quality, which shows alignment with expert human evaluations. Building on these evaluations, we implement an iterative improvement mechanism that optimizes report quality through automated review cycles. Experimental results show that report quality can be improved by both automated review cycles and consultants' unstructured knowledge. In experimental validation, our framework generates detailed 6‑page reports in 7 minutes at a cost of approximately \1. Our work could be an important step to automatically create affordable market insights.

PaperID: 9344, https://arxiv.org/pdf/2508.01332.pdf

Authors: Zhenhua Zou, Zhuotao Liu, Lepeng Zhao, Qiuyang Zhan

Title: BlockA2A: Towards Secure and Verifiable Agent-to-Agent Interoperability

Abstract:
The rapid adoption of agentic AI, powered by large language models (LLMs), is transforming enterprise ecosystems with autonomous agents that execute complex workflows. Yet we observe several key security vulnerabilities in LLM‑driven multi‑agent systems (MASes): fragmented identity frameworks, insecure communication channels, and inadequate defenses against Byzantine agents or adversarial prompts. In this paper, we present the first systematic analysis of these emerging multi‑agent risks and explain why the legacy security strategies cannot effectively address these risks. Afterwards, we propose BlockA2A, the first unified multi‑agent trust framework that enables secure and verifiable and agent‑to‑agent interoperability. At a high level, BlockA2A adopts decentralized identifiers (DIDs) to enable fine‑grained cross‑domain agent authentication, blockchain‑anchored ledgers to enable immutable auditability, and smart contracts to dynamically enforce context‑aware access control policies. BlockA2A eliminates centralized trust bottlenecks, ensures message authenticity and execution integrity, and guarantees accountability across agent interactions. Furthermore, we propose a Defense Orchestration Engine (DOE) that actively neutralizes attacks through real‑time mechanisms, including Byzantine agent flagging, reactive execution halting, and instant permission revocation. Empirical evaluations demonstrate BlockA2A's effectiveness in neutralizing prompt‑based, communication‑based, behavioral and systemic MAS attacks. We formalize its integration into existing MAS and showcase a practical implementation for Google's A2A protocol. Experiments confirm that BlockA2A and DOE operate with sub‑second overhead, enabling scalable deployment in production LLM‑based MAS environments.

PaperID: 9345, https://arxiv.org/pdf/2508.01293.pdf

Authors: Ngoc Bui Lam Quang, Nam Le Nguyen Binh, Thanh-Huy Nguyen, Le Thien Phuc Nguyen, Quan Nguyen, Ulas Bagci

Title: GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification

Abstract:
Multiple Instance Learning (MIL) is the leading approach for whole slide image (WSI) classification, enabling efficient analysis of gigapixel pathology slides. Recent work has introduced vision‑language models (VLMs) into MIL pipelines to incorporate medical knowledge through text‑based class descriptions rather than simple class names. However, when these methods rely on large language models (LLMs) to generate clinical descriptions or use fixed‑length prompts to represent complex pathology concepts, the limited token capacity of VLMs often constrains the expressiveness and richness of the encoded class information. Additionally, descriptions generated solely by LLMs may lack domain grounding and fine‑grained medical specificity, leading to suboptimal alignment with visual features. To address these challenges, we propose a vision‑language MIL framework with two key contributions: (1) A grounded multi‑agent description generation system that leverages curated pathology textbooks and agent specialization (e.g., morphology, spatial context) to produce accurate and diverse clinical descriptions; (2) A text encoding strategy using a list of descriptions rather than a single prompt, capturing fine‑grained and complementary clinical signals for better alignment with visual features. Integrated into a VLM‑MIL pipeline, our approach shows improved performance over single‑prompt class baselines and achieves results comparable to state‑of‑the‑art models, as demonstrated on renal and lung cancer datasets.

PaperID: 9346, https://arxiv.org/pdf/2508.01285.pdf

Authors: Yujing Ke, Kevin George, Kathan Pandya, David Blumenthal, Maximilian Sprang, Gerrit Großmann, Sebastian Vollmer, David Antony Selby

Title: BioDisco: Multi-agent hypothesis generation with dual-mode evidence, iterative feedback and temporal evaluation

Abstract:
Identifying novel hypotheses is essential to scientific research, yet this process risks being overwhelmed by the sheer volume and complexity of available information. Existing automated methods often struggle to generate novel and evidence‑grounded hypotheses, lack robust iterative refinement and rarely undergo rigorous temporal evaluation for future discovery potential. To address this, we propose BioDisco, a multi‑agent framework that draws upon language model‑based reasoning and a dual‑mode evidence system (biomedical knowledge graphs and automated literature retrieval) for grounded novelty, integrates an internal scoring and feedback loop for iterative refinement, and validates performance through pioneering temporal and human evaluations and a Bradley‑Terry paired comparison model to provide statistically‑grounded assessment. Our evaluations demonstrate superior novelty and significance over ablated configurations and generalist biomedical agents. Designed for flexibility and modularity, BioDisco allows seamless integration of custom language models or knowledge graphs, and can be run with just a few lines of code.

PaperID: 9347, https://arxiv.org/pdf/2508.01249.pdf

Authors: Peiran Wang, Yang Liu, Yunfei Lu, Yifeng Cai, Hongbo Chen, Qingyou Yang, Jie Zhang, Jue Hong, Ye Wu

Title: AgentArmor: Enforcing Program Analysis on Agent Runtime Trace to Defend Against Prompt Injection

Abstract:
Large Language Model (LLM) agents offer a powerful new paradigm for solving various problems by combining natural language reasoning with the execution of external tools. However, their dynamic and non‑transparent behavior introduces critical security risks, particularly in the presence of prompt injection attacks. In this work, we propose a novel insight that treats the agent runtime traces as structured programs with analyzable semantics. Thus, we present AgentArmor, a program analysis framework that converts agent traces into graph intermediate representation‑based structured program dependency representations (e.g., CFG, DFG, and PDG) and enforces security policies via a type system. AgentArmor consists of three key components: (1) a graph constructor that reconstructs the agent's runtime traces as graph‑based intermediate representations with control and data flow described within; (2) a property registry that attaches security‑relevant metadata of interacted tools \& data, and (3) a type system that performs static inference and checking over the intermediate representation. By representing agent behavior as structured programs, AgentArmor enables program analysis for sensitive data flow, trust boundaries, and policy violations. We evaluate AgentArmor on the AgentDojo benchmark, the results show that AgentArmor can reduce the ASR to 3%, with the utility drop only 1%.

PaperID: 9348, https://arxiv.org/pdf/2508.01237.pdf

Authors: Cheng Tan, Qi Chen, Jingxuan Wei, Gaowei Wu, Zhangyang Gao, Siyuan Li, Bihui Yu, Ruifeng Guo, Stan Z. Li

Title: SketchAgent: Generating Structured Diagrams from Hand-Drawn Sketches

Abstract:
Hand‑drawn sketches are a natural and efficient medium for capturing and conveying ideas. Despite significant advancements in controllable natural image generation, translating freehand sketches into structured, machine‑readable diagrams remains a labor‑intensive and predominantly manual task. The primary challenge stems from the inherent ambiguity of sketches, which lack the structural constraints and semantic precision required for automated diagram generation. To address this challenge, we introduce SketchAgent, a multi‑agent system designed to automate the transformation of hand‑drawn sketches into structured diagrams. SketchAgent integrates sketch recognition, symbolic reasoning, and iterative validation to produce semantically coherent and structurally accurate diagrams, significantly reducing the need for manual effort. To evaluate the effectiveness of our approach, we propose the Sketch2Diagram Benchmark, a comprehensive dataset and evaluation framework encompassing eight diverse diagram categories, such as flowcharts, directed graphs, and model architectures. The dataset comprises over 6,000 high‑quality examples with token‑level annotations, standardized preprocessing, and rigorous quality control. By streamlining the diagram generation process, SketchAgent holds great promise for applications in design, education, and engineering, while offering a significant step toward bridging the gap between intuitive sketching and machine‑readable diagram generation. The benchmark is released at https://huggingface.co/datasets/DiagramAgent/Sketch2Diagram‑Benchmark.

PaperID: 9349, https://arxiv.org/pdf/2508.01222.pdf

Authors: Ethan Hsu, Hong Meng Yam, Ines Bouissou, Aaron Murali John, Raj Thota, Josh Koe, Vivek Sarath Putta, G K Dharesan, Alexander Spangher, Shikhar Murty, Tenghao Huang, Christopher D. Manning

Title: WebDS: An End-to-End Benchmark for Web-based Data Science

Abstract:
Many real‑world data science tasks involve complex web‑based interactions: finding appropriate data available on the internet, synthesizing multimodal data from different locations, and producing summarized analyses. Existing web benchmarks often focus on simplistic interactions and often do not require diverse tool‑using capabilities. Conversely, traditional data science benchmarks typically concentrate on static, highly structured datasets and do not assess end‑to‑end workflows that encompass data acquisition, cleaning, analysis, and insight generation. In response, we introduce WebDS, the first end‑to‑end web‑based data science benchmark. It comprises 870 web‑based data science tasks across 29 diverse websites from structured government data portals to unstructured news media, challenging agents to perform complex, multi‑step, tool‑based operations, across heterogeneous data formats, to better reflect the realities of modern data analytics. Evaluations of current SOTA LLM agents indicate significant performance gaps in accomplishing these tasks. For instance, Browser Use, which accomplishes 80% of tasks on WebVoyager, completes only 15% of tasks in WebDS, which our analysis suggests is due to new failure modes, such as poor information grounding, repetitive behavior and shortcut‑taking that agents performing WebDS's tasks display. By contrast, humans achieve around 90% accuracy, highlighting a substantial gap between current agents and human performance. By providing a more robust and realistic testing ground, WebDS sets the stage for significant advances in the development of practically useful LLM‑based data science.

PaperID: 9350, https://arxiv.org/pdf/2508.01186.pdf

Authors: Chaojia Yu, Zihan Cheng, Hanwen Cui, Yishuo Gao, Zexu Luo, Yijin Wang, Hangbin Zheng, Yong Zhao

Title: A Survey on Agent Workflow -- Status and Future

Abstract:
In the age of large language models (LLMs), autonomous agents have emerged as a powerful paradigm for achieving general intelligence. These agents dynamically leverage tools, memory, and reasoning capabilities to accomplish user‑defined goals. As agent systems grow in complexity, agent workflows‑structured orchestration frameworks‑have become central to enabling scalable, controllable, and secure AI behaviors. This survey provides a comprehensive review of agent workflow systems, spanning academic frameworks and industrial implementations. We classify existing systems along two key dimensions: functional capabilities (e.g., planning, multi‑agent collaboration, external API integration) and architectural features (e.g., agent roles, orchestration flows, specification languages). By comparing over 20 representative systems, we highlight common patterns, potential technical challenges, and emerging trends. We further address concerns related to workflow optimization strategies and security. Finally, we outline open problems such as standardization and multimodal integration, offering insights for future research at the intersection of agent design, workflow infrastructure, and safe automation.

PaperID: 9351, https://arxiv.org/pdf/2508.01173.pdf

Authors: Jiayi Chen, Jing Li, Guiling Wang

Title: MARS: A Meta-Adaptive Reinforcement Learning Framework for Risk-Aware Multi-Agent Portfolio Management

Abstract:
Reinforcement Learning (RL) has shown significant promise in automated portfolio management; however, effectively balancing risk and return remains a central challenge, as many models fail to adapt to dynamically changing market conditions. We propose Meta‑controlled Agents for a Risk‑aware System (MARS), a novel framework addressing this through a multi‑agent, risk‑aware approach. MARS replaces monolithic models with a Heterogeneous Agent Ensemble, where each agent's unique risk profile is enforced by a Safety‑Critic network to span behaviors from capital preservation to aggressive growth. A high‑level Meta‑Adaptive Controller (MAC) dynamically orchestrates this ensemble, shifting reliance between conservative and aggressive agents to minimize drawdown during downturns while seizing opportunities in bull markets. This two‑tiered structure leverages behavioral diversity rather than explicit feature engineering to ensure a disciplined portfolio robust across market regimes. Experiments on major international indexes confirm that our framework significantly reduces maximum drawdown and volatility while maintaining competitive returns.

PaperID: 9352, https://arxiv.org/pdf/2508.01169.pdf

Authors: Chengzhang Zhu, Cecile H. Sam, Yanlai Wu, Ying Tang

Title: WIP: Enhancing Game-Based Learning with AI-Driven Peer Agents

Abstract:
This work‑in‑progress paper presents SPARC (Systematic Problem Solving and Algorithmic Reasoning for Children), a gamified learning platform designed to enhance engagement and knowledge retention in K‑12 STEM education. Traditional approaches often struggle to motivate students or facilitate deep understanding, especially for complex scientific concepts. SPARC addresses these challenges by integrating interactive, narrative‑driven gameplay with an artificial intelligence peer agent built on large language models. Rather than simply providing answers, the agent engages students in dialogue and inquiry, prompting them to explain concepts and solve problems collaboratively. The platform's design is grounded in educational theory and closely aligned with state learning standards. Initial classroom pilots utilized a multi‑method assessment framework combining pre‑ and post‑tests, in‑game analytics, and qualitative feedback from students and teachers. Preliminary findings indicate that SPARC significantly increases student engagement, with most participants reporting greater interest in STEM subjects and moderate gains in conceptual understanding observed in post‑test results. Ongoing development focuses on refining the AI agent, expanding curriculum integration, and improving accessibility. These early results demonstrate the potential of combining AI‑driven peer support with game‑based learning to create inclusive, effective, and engaging educational experiences for K‑12 learners.

PaperID: 9353, https://arxiv.org/pdf/2508.01159.pdf

Authors: Liam G. McCoy, Fateme Nateghi Haredasht, Kanav Chopra, David Wu, David JH Wu, Abass Conteh, Sarita Khemani, Saloni Kumar Maharaj, Vishnu Ravi, Arth Pahwa, Yingjie Weng, Leah Rosengaus, Lena Giang, Kelvin Zhenghao Li, Olivia Jee, Daniel Shirvani, Ethan Goh, Jonathan H. Chen

Title: Asking the Right Questions: Benchmarking Large Language Models in the Development of Clinical Consultation Templates

Abstract:
This study evaluates the capacity of large language models (LLMs) to generate structured clinical consultation templates for electronic consultation. Using 145 expert‑crafted templates developed and routinely used by Stanford's eConsult team, we assess frontier models ‑‑ including o3, GPT‑4o, Kimi K2, Claude 4 Sonnet, Llama 3 70B, and Gemini 2.5 Pro ‑‑ for their ability to produce clinically coherent, concise, and prioritized clinical question schemas. Through a multi‑agent pipeline combining prompt optimization, semantic autograding, and prioritization analysis, we show that while models like o3 achieve high comprehensiveness (up to 92.2%), they consistently generate excessively long templates and fail to correctly prioritize the most clinically important questions under length constraints. Performance varies across specialties, with significant degradation in narrative‑driven fields such as psychiatry and pain medicine. Our findings demonstrate that LLMs can enhance structured clinical information exchange between physicians, while highlighting the need for more robust evaluation methods that capture a model's ability to prioritize clinically salient information within the time constraints of real‑world physician communication.

PaperID: 9354, https://arxiv.org/pdf/2508.01109.pdf

Authors: Satiyabooshan Murugaboopathy, Connor T. Jerzak, Adel Daoud

Title: Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent-Induced Novelty?

Abstract:
We investigate whether socio‑economic indicators like household wealth leave recoverable imprints in satellite imagery (capturing physical features) and Internet‑sourced text (reflecting historical/economic narratives). Using Demographic and Health Survey (DHS) data from African neighborhoods, we pair Landsat images with LLM‑generated textual descriptions conditioned on location/year and text retrieved by an AI search agent from web sources. We develop a multimodal framework predicting household wealth (International Wealth Index) through five pipelines: (i) vision model on satellite images, (ii) LLM using only location/year, (iii) AI agent searching/synthesizing web text, (iv) joint image‑text encoder, (v) ensemble of all signals. Our framework yields three contributions. First, fusing vision and agent/LLM text outperforms vision‑only baselines in wealth prediction (e.g., R‑squared of 0.77 vs. 0.63 on out‑of‑sample splits), with LLM‑internal knowledge proving more effective than agent‑retrieved text, improving robustness to out‑of‑country and out‑of‑time generalization. Second, we find partial representational convergence: fused embeddings from vision/language modalities correlate moderately (median cosine similarity of 0.60 after alignment), suggesting a shared latent code of material well‑being while retaining complementary details, consistent with the Platonic Representation Hypothesis. Although LLM‑only text outperforms agent‑retrieved data, challenging our Agent‑Induced Novelty Hypothesis, modest gains from combining agent data in some splits weakly support the notion that agent‑gathered information introduces unique representational structures not fully captured by static LLM knowledge. Third, we release a large‑scale multimodal dataset comprising more than 60,000 DHS clusters linked to satellite images, LLM‑generated descriptions, and agent‑retrieved texts.

PaperID: 9355, https://arxiv.org/pdf/2508.01060.pdf

Authors: Ibrahim Althamary, Chen-Fu Chou, Chih-Wei Huang

Title: Connectivity Management in Satellite-Aided Vehicular Networks with Multi-Head Attention-Based State Estimation

Abstract:
Managing connectivity in integrated satellite‑terrestrial vehicular networks is critical for 6G, yet is challenged by dynamic conditions and partial observability. This letter introduces the Multi‑Agent Actor‑Critic with Satellite‑Aided Multi‑head self‑attention (MAAC‑SAM), a novel multi‑agent reinforcement learning framework that enables vehicles to autonomously manage connectivity across Vehicle‑to‑Satellite (V2S), Vehicle‑to‑Infrastructure (V2I), and Vehicle‑to‑Vehicle (V2V) links. Our key innovation is the integration of a multi‑head attention mechanism, which allows for robust state estimation even with fluctuating and limited information sharing among vehicles. The framework further leverages self‑imitation learning (SIL) and fingerprinting to improve learning efficiency and real‑time decisions. Simulation results, based on realistic SUMO traffic models and 3GPP‑compliant configurations, demonstrate that MAAC‑SAM outperforms state‑of‑the‑art terrestrial and satellite‑assisted baselines by up to 14% in transmission utility and maintains high estimation accuracy across varying vehicle densities and sharing levels.

PaperID: 9356, https://arxiv.org/pdf/2508.01049.pdf

Authors: Nicholas E. Corrado, Josiah P. Hanna

Title: Centralized Adaptive Sampling for Reliable Co-Training of Independent Multi-Agent Policies

Abstract:
Independent on‑policy policy gradient algorithms are widely used for multi‑agent reinforcement learning (MARL) in cooperative and no‑conflict games, but they are known to converge sub‑optimally when each agent's individual policy gradient points away from an optimal joint equilibrium. Going beyond prior work, we observe that sub‑optimal convergence can still arise even when the expected individual policy gradients of each agent point toward the optimal joint solution. After collecting a finite set of trajectories, stochasticity in independent action sampling can cause the joint data distribution to deviate from the expected joint on‑policy distribution. This sampling error w.r.t. the joint on‑policy distribution produces inaccurate gradient estimates that can make agents converge sub‑optimally. We hypothesize that joint sampling error can be reduced through coordinated action selection and that doing so will increase the reliability of policy gradient learning in MARL (i.e., the probability of converging to an optimal joint policy). To test this hypothesis, we first introduce an adaptive action sampling approach to reduce joint sampling error in the Centralized Training with Decentralized Execution setting. Our method, Cooperative Sampling Error Reduction (CoSER), continually adapts a centralized behavior policy to place higher probability on joint actions that are under‑sampled w.r.t. the current joint policy. We then empirically evaluate CoSER on a diverse set of multi‑agent games and demonstrate that (1) CoSER reduces joint sampling error more efficiently than independent on‑policy sampling and (2) this reduction increases the reliability of independent policy gradient algorithms.

PaperID: 9357, https://arxiv.org/pdf/2508.01012.pdf

Authors: Yiyi Lu, Hoi Ian Au, Junyao Zhang, Jingyu Pan, Guanglei Zhou, Yiting Wang, Jingwei Sun, Ang Li, Jianyi Zhang, Hai Li, Yiran Chen

Title: AutoEDA: Enabling EDA Flow Automation through Microservice-Based LLM Agents

Abstract:
Electronic Design Automation (EDA) remains heavily reliant on tool command language (Tcl) scripting to drive complex RTL‑to‑GDSII flows. This scripting‑based paradigm is labor‑intensive, error‑prone, and difficult to scale across large design projects. Recent advances in large language models (LLMs) suggest a new paradigm of natural language‑driven automation. However, existing EDA efforts remain limited and face key challenges, including the absence of standardized interaction protocols and dependence on external APIs that introduce privacy risks. We present AutoEDA, a framework that leverages the Model Context Protocol (MCP) to enable end‑to‑end natural language control of RTL‑to‑GDSII design flows. AutoEDA introduces MCP‑based servers for task decomposition, tool selection, and automated error handling, ensuring robust interaction between LLM agents and EDA tools. To enhance reliability and confidentiality, we integrate locally fine‑tuned LLM agents. We further contribute a benchmark generation pipeline for diverse EDA scenarios and extend CodeBLEU with Tcl‑specific enhancements for domain‑aware evaluation. Together, these contributions establish a comprehensive framework for LLM‑driven EDA automation, bridging natural language interfaces with modern chip design flows. Empirical results show that AutoEDA achieves up to 9.9 times higher accuracy than naive approaches while reducing token usage by approximately 97% compared to in‑context learning.

PaperID: 9358, https://arxiv.org/pdf/2508.00967.pdf

Authors: Massoud Pourmandi

Title: Cooperative Perception: A Resource-Efficient Framework for Multi-Drone 3D Scene Reconstruction Using Federated Diffusion and NeRF

Abstract:
The proposal introduces an innovative drone swarm perception system that aims to solve problems related to computational limitations and low‑bandwidth communication, and real‑time scene reconstruction. The framework enables efficient multi‑agent 3D/4D scene synthesis through federated learning of shared diffusion model and YOLOv12 lightweight semantic extraction and local NeRF updates while maintaining privacy and scalability. The framework redesigns generative diffusion models for joint scene reconstruction, and improves cooperative scene understanding, while adding semantic‑aware compression protocols. The approach can be validated through simulations and potential real‑world deployment on drone testbeds, positioning it as a disruptive advancement in multi‑agent AI for autonomous systems.

PaperID: 9359, https://arxiv.org/pdf/2508.00938.pdf

Authors: Ziye Jia, Sijie He, Qiuming Zhu, Wei Wang, Qihui Wu, Zhu Han

Title: Trusted Routing for Blockchain-Empowered UAV Networks via Multi-Agent Deep Reinforcement Learning

Abstract:
Due to the high flexibility and versatility, unmanned aerial vehicles (UAVs) are leveraged in various fields including surveillance and disaster rescue.However, in UAV networks, routing is vulnerable to malicious damage due to distributed topologies and high dynamics. Hence, ensuring the routing security of UAV networks is challenging. In this paper, we characterize the routing process in a time‑varying UAV network with malicious nodes. Specifically, we formulate the routing problem to minimize the total delay, which is an integer linear programming and intractable to solve. Then, to tackle the network security issue, a blockchain‑based trust management mechanism (BTMM) is designed to dynamically evaluate trust values and identify low‑trust UAVs. To improve traditional practical Byzantine fault tolerance algorithms in the blockchain, we propose a consensus UAV update mechanism. Besides, considering the local observability, the routing problem is reformulated into a decentralized partially observable Markov decision process. Further, a multi‑agent double deep Q‑network based routing algorithm is designed to minimize the total delay. Finally, simulations are conducted with attacked UAVs and numerical results show that the delay of the proposed mechanism decreases by 13.39%, 12.74%, and 16.6% than multi‑agent proximal policy optimal algorithms, multi‑agent deep Q‑network algorithms, and methods without BTMM, respectively.

PaperID: 9360, https://arxiv.org/pdf/2508.00917.pdf

Authors: Jiayuan Wang, Farhad Pourpanah, Q. M. Jonathan Wu, Ning Zhang

Title: A Survey on Deep Multi-Task Learning in Connected Autonomous Vehicles

Abstract:
Connected autonomous vehicles (CAVs) must simultaneously perform multiple tasks, such as object detection, semantic segmentation, depth estimation, trajectory prediction, motion prediction, and behaviour prediction, to ensure safe and reliable navigation in complex environments. Vehicle‑to‑everything (V2X) communication enables cooperative driving among CAVs, thereby mitigating the limitations of individual sensors, reducing occlusions, and improving perception over long distances. Traditionally, these tasks are addressed using distinct models, which leads to high deployment costs, increased computational overhead, and challenges in achieving real‑time performance. Multi‑task learning (MTL) has recently emerged as a promising solution that enables the joint learning of multiple tasks within a single unified model. This offers improved efficiency and resource utilization. To the best of our knowledge, this survey is the first comprehensive review focused on MTL in the context of CAVs. We begin with an overview of CAVs and MTL to provide foundational background. We then explore the application of MTL across key functional modules, including perception, prediction, planning, control, and multi‑agent collaboration. Finally, we discuss the strengths and limitations of existing methods, identify key research gaps, and provide directions for future research aimed at advancing MTL methodologies for CAV systems.

PaperID: 9361, https://arxiv.org/pdf/2508.00890.pdf

Authors: Fali Wang, Hui Liu, Zhenwei Dai, Jingying Zeng, Zhiwei Zhang, Zongyu Wu, Chen Luo, Zhen Li, Xianfeng Tang, Qi He, Suhang Wang

Title: AgentTTS: Large Language Model Agent for Test-time Compute-optimal Scaling Strategy in Complex Tasks

Abstract:
Test‑time scaling (TTS) enhances the performance of large language models (LLMs) by allocating additional compute resources during inference. However, existing research primarily investigates TTS in single‑stage tasks; while many real‑world problems are multi‑stage complex tasks, composed of a sequence of heterogeneous subtasks with each subtask requires LLM of specific capability. Therefore, we study a novel problem: the test‑time compute‑optimal scaling in multi‑stage complex tasks, aiming to select suitable models and allocate budgets per subtask to maximize overall performance. TTS in multi‑stage tasks introduces two fundamental challenges: (i) The combinatorial search space of model and budget allocations, combined with the high cost of inference, makes brute‑force search impractical. (ii) The optimal model and budget allocations across subtasks are interdependent, increasing the complexity of the compute‑optimal search. To address this gap, we conduct extensive pilot experiments on four tasks across six datasets, deriving three empirical insights characterizing the behavior of LLMs in multi‑stage complex tasks. Informed by these insights, we propose AgentTTS, an LLM‑agent‑based framework that autonomously searches for compute‑optimal allocations through iterative feedback‑driven interactions with the execution environment. Experimental results demonstrate that AgentTTS significantly outperforms traditional and other LLM‑based baselines in search efficiency, and shows improved robustness to varying training set sizes and enhanced interpretability.

PaperID: 9362, https://arxiv.org/pdf/2508.00828.pdf

Authors: Antoine Bigeard, Langston Nashold, Rayan Krishnan, Shirley Wu

Title: Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks

Abstract:
Artificial Intelligence (AI) technology has emerged as a transformative force in financial analysis and the finance industry, though significant questions remain about the full capabilities of Large Language Model (LLM) agents in this domain. We present the Finance Agent Benchmark, featuring challenging and diverse real‑world finance research problems that require LLMs to perform complex analysis using recent SEC filings. We construct the benchmark using a taxonomy of nine financial task categories, developed in consultation with experts from banks, hedge funds, and private equity firms. The dataset includes 537 expert‑authored questions covering tasks from information retrieval to complex financial modeling, each validated through a rigorous review process to ensure accuracy and relevance. Moreover, we implement an agentic harness that equips LLMs with tools sufficient to produce accurate responses, including Google Search and EDGAR database access. Overall, the Finance Agent Benchmark provides a comprehensive testbed for measuring the progress of LLM‑driven finance agents. Our evaluation reveals significant limitations in current AI capabilities ‑ even the best‑performing model (OpenAI o3) achieved only 46.8% accuracy at an average cost of 3.79 per query. This underscores the need for further advancements before reliable deployment in high‑stakes finance settings.

PaperID: 9363, https://arxiv.org/pdf/2508.00724.pdf

Authors: Boyu Li, Zhengchen Li, Weimin Wu, Mengchu Zhou

Title: Petri Net Modeling and Deadlock-Free Scheduling of Attachable Heterogeneous AGV Systems

Abstract:
The increasing demand for automation and flexibility drives the widespread adoption of heterogeneous automated guided vehicles (AGVs). This work intends to investigate a new scheduling problem in a material transportation system consisting of attachable heterogeneous AGVs, namely carriers and shuttles. They can flexibly attach to and detach from each other to cooperatively execute complex transportation tasks. While such collaboration enhances operational efficiency, the attachment‑induced synchronization and interdependence render the scheduling coupled and susceptible to deadlock. To tackle this challenge, Petri nets are introduced to model AGV schedules, well describing the concurrent and sequential task execution and carrier‑shuttle synchronization. Based on Petri net theory, a firing‑driven decoding method is proposed, along with deadlock detection and prevention strategies to ensure deadlock‑free schedules. Furthermore, a Petri net‑based metaheuristic is developed in an adaptive large neighborhood search framework and incorporates an effective acceleration method to enhance computational efficiency. Finally, numerical experiments using real‑world industrial data validate the effectiveness of the proposed algorithm against the scheduling policy applied in engineering practice, an exact solver, and four state‑of‑the‑art metaheuristics. A sensitivity analysis is also conducted to provide managerial insights.

PaperID: 9364, https://arxiv.org/pdf/2508.00669.pdf

Authors: Wenxuan Wang, Zizhan Ma, Meidan Ding, Shiyi Zheng, Shengyuan Liu, Jie Liu, Jiaming Ji, Wenting Chen, Xiang Li, Linlin Shen, Yixuan Yuan

Title: Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications

Abstract:
The proliferation of Large Language Models (LLMs) in medicine has enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning, a cornerstone of clinical practice. This has catalyzed a shift from single‑step answer generation to the development of LLMs explicitly designed for medical reasoning. This paper provides the first systematic review of this emerging field. We propose a taxonomy of reasoning enhancement techniques, categorized into training‑time strategies (e.g., supervised fine‑tuning, reinforcement learning) and test‑time mechanisms (e.g., prompt engineering, multi‑agent systems). We analyze how these techniques are applied across different data modalities (text, image, code) and in key clinical applications such as diagnosis, education, and treatment planning. Furthermore, we survey the evolution of evaluation benchmarks from simple accuracy metrics to sophisticated assessments of reasoning quality and visual interpretability. Based on an analysis of 60 seminal studies from 2022‑2025, we conclude by identifying critical challenges, including the faithfulness‑plausibility gap and the need for native multimodal reasoning, and outlining future directions toward building efficient, robust, and sociotechnically responsible medical AI.

PaperID: 9365, https://arxiv.org/pdf/2508.00645.pdf

Authors: Maochao Xiao, Yuning Wang, Felix Rodach, Bernat Font, Marius Kurz, Pol Suárez, Di Zhou, Francisco Alcántara-Ávila, Ting Zhu, Junle Liu, Ricard Montalà, Jiawei Chen, Jean Rabault, Oriol Lehmkuhl, Andrea Beck, Johan Larsson, Ricardo Vinuesa, Sergio Pirozzoli

Title: SmartFlow: A CFD-solver-agnostic deep reinforcement learning framework for computational fluid dynamics on HPC platforms

Abstract:
Deep reinforcement learning (DRL) is emerging as a powerful tool for fluid‑dynamics research, encompassing active flow control, autonomous navigation, turbulence modeling and discovery of novel numerical schemes. We introduce SmartFlow, a CFD‑solver‑agnostic framework for both single‑ and multi‑agent DRL algorithms that can easily integrate with MPI‑parallel CPU and GPU‑accelerated solvers. Built on Relexi and SmartSOD2D, SmartFlow uses the SmartSim infrastructure library and our newly developed SmartRedis‑MPI library to enable asynchronous, low‑latency, in‑memory communication between CFD solvers and Python‑based DRL algorithms. SmartFlow leverages PyTorch's Stable‑Baselines3 for training, which provides a modular, Gym‑like environment API. We demonstrate its versatility via three case studies: single‑agent synthetic‑jet control for drag reduction in a cylinder flow simulated by the high‑order FLEXI solver, multi‑agent cylinder wake control using the GPU‑accelerated spectral‑element code SOD2D, and multi‑agent wall‑model learning for large‑eddy simulation with the finite‑difference solver CaLES. SmartFlow's CFD‑solver‑agnostic design and seamless HPC integration is promising to accelerate RL‑driven fluid‑mechanics studies.

PaperID: 9366, https://arxiv.org/pdf/2508.00632.pdf

Authors: Alexia Jolicoeur-Martineau

Title: Multi-Agent Game Generation and Evaluation via Audio-Visual Recordings

Abstract:
While AI excels at generating text, audio, images, and videos, creating interactive audio‑visual content such as video games remains challenging. Current LLMs can generate JavaScript games and animations, but lack automated evaluation metrics and struggle with complex content that normally requires teams of humans working for many months (multi‑shot, multi‑agents) using assets made by artists. To tackle these issues, we built a new metric and a multi‑agent system. We propose AVR‑Eval, a relative metric for multimedia content quality using Audio‑Visual Recordings (AVRs). An omni‑modal model (processing text, video, and audio) compares the AVRs of two contents, with a text model reviewing evaluations to determine superiority. We show that AVR‑Eval properly identifies good from broken or mismatched content. We built AVR‑Agent, a multi‑agent system generating JavaScript code from a bank of multimedia assets (audio, images, 3D models). The coding agent selects relevant assets, generates multiple initial codes, uses AVR‑Eval to identify the best version, and iteratively improves it through omni‑modal agent feedback from the AVR. We run experiments on games and animations with AVR‑Eval (win rate of content A against B). We find that content generated by AVR‑Agent has a significantly higher win rate against content made through one‑shot generation. However, models struggle to leverage custom assets and AVR feedback effectively, showing no higher win rate. This reveals a critical gap: while humans benefit from high‑quality assets and audio‑visual feedback, current coding models do not seem to utilize these resources as effectively, highlighting fundamental differences between human and machine content creation approaches.

PaperID: 9367, https://arxiv.org/pdf/2508.00500.pdf

Authors: Haoyu Wang, Christopher M. Poskitt, Jiali Wei, Jun Sun

Title: ProbGuard: Probabilistic Runtime Monitoring for LLM Agent Safety

Abstract:
Large Language Model (LLM) agents increasingly operate across domains such as robotics, virtual assistants, and web automation. However, their stochastic decision‑making introduces safety risks that are difficult to anticipate during execution. Existing runtime monitoring frameworks, such as AgentSpec, primarily rely on reactive safety rules that detect violations only when unsafe behavior is imminent or has already occurred, limiting their ability to handle long‑horizon dependencies. We present ProbGuard, a proactive runtime monitoring framework for LLM agents that anticipates safety violations through probabilistic risk prediction. ProbGuard abstracts agent executions into symbolic states and learns a Discrete‑Time Markov Chain (DTMC) from execution traces to model behavioral dynamics. At runtime, the monitor estimates the probability that future executions will reach unsafe states and triggers interventions when this risk exceeds a user‑defined threshold. To improve robustness, ProbGuard incorporates semantic validity constraints in the abstraction and provides PAC‑style guarantees on the learned model under standard assumptions. We evaluate ProbGuard in two safety‑critical domains: autonomous driving and embodied household agents. Across evaluated scenarios, ProbGuard consistently predicts traffic law violations and collisions in advance, with warnings up to 38.66 seconds ahead of occurrence. In embodied agent tasks, ProbGuard reduces unsafe behavior by up to 65.37% while preserving up to 80.4% task completion. ProbGuard is implemented as an extensible open‑source runtime monitor integrated with the LangChain agent framework and introduces minimal runtime overhead.

PaperID: 9368, https://arxiv.org/pdf/2508.00478.pdf

Authors: Yuning Jiang, Nay Oo, Qiaoran Meng, Lu Lin, Dusit Niyato, Zehui Xiong, Hoon Wei Lim, Biplab Sikdar

Title: CyGATE: Game-Theoretic Cyber Attack-Defense Engine for Patch Strategy Optimization

Abstract:
Modern cyber attacks unfold through multiple stages, requiring defenders to dynamically prioritize mitigations under uncertainty. While game‑theoretic models capture attacker‑defender interactions, existing approaches often rely on static assumptions and lack integration with real‑time threat intelligence, limiting their adaptability. This paper presents CyGATE, a game‑theoretic framework modeling attacker‑defender interactions, using large language models (LLMs) with retrieval‑augmented generation (RAG) to enhance tactic selection and patch prioritization. Applied to a two‑agent scenario, CyGATE frames cyber conflicts as a partially observable stochastic game (POSG) across Cyber Kill Chain stages. Both agents use belief states to navigate uncertainty, with the attacker adapting tactics and the defender re‑prioritizing patches based on evolving risks and observed adversary behavior. The framework's flexible architecture enables extension to multi‑agent scenarios involving coordinated attackers, collaborative defenders, or complex enterprise environments with multiple stakeholders. Evaluated in a dynamic patch scheduling scenario, CyGATE effectively prioritizes high‑risk vulnerabilities, enhancing adaptability through dynamic threat integration, strategic foresight by anticipating attacker moves under uncertainty, and efficiency by optimizing resource use.

PaperID: 9369, https://arxiv.org/pdf/2508.00401.pdf

Authors: Riddhi J. Pitliya, Ozan Çatal, Toon Van de Maele, Corrado Pezzato, Tim Verbelen

Title: Theory of Mind Using Active Inference: A Framework for Multi-Agent Cooperation

Abstract:
Theory of Mind (ToM) ‑‑ the ability to understand that others can have differing knowledge and goals ‑‑ enables agents to reason about others' beliefs while planning their own actions. We present a novel approach to multi‑agent cooperation by implementing ToM within active inference. Unlike previous active inference approaches to multi‑agent cooperation, our method neither relies on task‑specific shared generative models nor requires explicit communication. In our framework, ToM‑equipped agents maintain distinct representations of their own and others' beliefs and goals. ToM agents then use an extended and adapted version of the sophisticated inference tree‑based planning algorithm to systematically explore joint policy spaces through recursive reasoning. We evaluate our approach through collision avoidance and foraging simulations. Results suggest that ToM agents cooperate better compared to non‑ToM counterparts by being able to avoid collisions and reduce redundant efforts. Crucially, ToM agents accomplish this by inferring others' beliefs solely from observable behaviour and considering them when planning their own actions. Our approach shows potential for generalisable and scalable multi‑agent systems while providing computational insights into ToM mechanisms.

PaperID: 9370, https://arxiv.org/pdf/2508.00359.pdf

Authors: Zongheng Tang, Yi Liu, Yifan Sun, Yulu Gao, Jinyu Chen, Runsheng Xu, Si Liu

Title: CoST: Efficient Collaborative Perception From Unified Spatiotemporal Perspective

Abstract:
Collaborative perception shares information among different agents and helps solving problems that individual agents may face, e.g., occlusions and small sensing range. Prior methods usually separate the multi‑agent fusion and multi‑time fusion into two consecutive steps. In contrast, this paper proposes an efficient collaborative perception that aggregates the observations from different agents (space) and different times into a unified spatio‑temporal space simultanesouly. The unified spatio‑temporal space brings two benefits, i.e., efficient feature transmission and superior feature fusion. 1) Efficient feature transmission: each static object yields a single observation in the spatial temporal space, and thus only requires transmission only once (whereas prior methods re‑transmit all the object features multiple times). 2) superior feature fusion: merging the multi‑agent and multi‑time fusion into a unified spatial‑temporal aggregation enables a more holistic perspective, thereby enhancing perception performance in challenging scenarios. Consequently, our Collaborative perception with Spatio‑temporal Transformer (CoST) gains improvement in both efficiency and accuracy. Notably, CoST is not tied to any specific method and is compatible with a majority of previous methods, enhancing their accuracy while reducing the transmission bandwidth.

PaperID: 9371, https://arxiv.org/pdf/2508.00344.pdf

Authors: Keer Lu, Chong Chen, Xili Wang, Bin Cui, Yunhuai Liu, Wentao Zhang

Title: PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning

Abstract:
Large Language Models (LLMs) have shown remarkable advancements in tackling agent‑oriented tasks. Despite their potential, existing work faces challenges when deploying LLMs in agent‑based environments. The widely adopted agent paradigm ReAct centers on integrating single‑step reasoning with immediate action execution, which limits its effectiveness in complex tasks requiring long‑term strategic planning. Furthermore, the coordination between the planner and executor during problem‑solving is also a critical factor to consider in agent design. Additionally, current approaches predominantly rely on supervised fine‑tuning, which often leads models to memorize established task completion trajectories, thereby restricting their generalization ability when confronted with novel problem contexts. To address these challenges, we introduce an adaptive global plan‑based agent paradigm AdaPlan, aiming to synergize high‑level explicit guidance with execution to support effective long‑horizon decision‑making. Based on the proposed paradigm, we further put forward PilotRL, a global planning‑guided training framework for LLM agents driven by progressive reinforcement learning. We first develop the model's ability to follow explicit guidance from global plans when addressing agent tasks. Subsequently, based on this foundation, we focus on optimizing the quality of generated plans. Finally, we conduct joint optimization of the model's planning and execution coordination. Experiments indicate that PilotRL could achieve state‑of‑the‑art performances, with LLaMA3.1‑8B‑Instruct + PilotRL surpassing closed‑sourced GPT‑4o by 3.60%, while showing a more substantial gain of 55.78% comparing to GPT‑4o‑mini at a comparable parameter scale.

PaperID: 9372, https://arxiv.org/pdf/2508.00282.pdf

Authors: Yi-Long Lu, Jiajun Song, Chunhui Zhang, Wei Wang

Title: Mind the Gap: The Divergence Between Human and LLM-Generated Tasks

Abstract:
Humans constantly generate a diverse range of tasks guided by internal motivations. While generative agents powered by large language models (LLMs) aim to simulate this complex behavior, it remains uncertain whether they operate on similar cognitive principles. To address this, we conducted a task‑generation experiment comparing human responses with those of an LLM agent (GPT‑4o). We find that human task generation is consistently influenced by psychological drivers, including personal values (e.g., Openness to Change) and cognitive style. Even when these psychological drivers are explicitly provided to the LLM, it fails to reflect the corresponding behavioral patterns. They produce tasks that are markedly less social, less physical, and thematically biased toward abstraction. Interestingly, while the LLM's tasks were perceived as more fun and novel, this highlights a disconnect between its linguistic proficiency and its capacity to generate human‑like, embodied goals. We conclude that there is a core gap between the value‑driven, embodied nature of human cognition and the statistical patterns of LLMs, highlighting the necessity of incorporating intrinsic motivation and physical grounding into the design of more human‑aligned agents.

PaperID: 9373, https://arxiv.org/pdf/2508.00280.pdf

Authors: Jingchen Peng, Dingli Yuan, Boxiang Ren, Jie Fan, Hao Wu, Lu Yang

Title: WMAS: A Multi-Agent System Towards Intelligent and Customized Wireless Networks

Abstract:
The fast development of Artificial Intelligence (AI) agents provides a promising way for the realization of intelligent and customized wireless networks. In this paper, we propose a Wireless Multi‑Agent System (WMAS), which can provide intelligent and customized services for different user equipment (UEs). Note that orchestrating multiple agents carries the risk of malfunction, and multi‑agent conversations may fall into infinite loops. It is thus crucial to design a conversation topology for WMAS that enables agents to complete UE task requests with high accuracy and low conversation overhead. To address this issue, we model the multi‑agent conversation topology as a directed acyclic graph and propose a reinforcement learning‑based algorithm to optimize the adjacency matrix of this graph. As such, WMAS is capable of generating and self‑optimizing multi‑agent conversation topologies, enabling agents to effectively and collaboratively handle a variety of task requests from UEs. Simulation results across various task types demonstrate that WMAS can achieve higher task performance and lower conversation overhead compared to existing multi‑agent systems. These results validate the potential of WMAS to enhance the intelligence of future wireless networks.

PaperID: 9374, https://arxiv.org/pdf/2508.00264.pdf

Authors: Jerry Huang, Peng Lu, Qiuhao Zeng

Title: Calibrated Language Models and How to Find Them with Label Smoothing

Abstract:
Recent advances in natural language processing (NLP) have opened up greater opportunities to enable fine‑tuned large language models (LLMs) to behave as more powerful interactive agents through improved instruction‑following ability. However, understanding how this impacts confidence calibration for reliable model output has not been researched in full. In this work, we examine various open‑sourced LLMs, identifying significant calibration degradation after instruction tuning in each. Seeking a practical solution, we look towards label smoothing, which has been shown as an effective method to regularize for overconfident predictions but has yet to be widely adopted in the supervised fine‑tuning (SFT) of LLMs. We first provide insight as to why label smoothing is sufficient to maintain calibration throughout the SFT process. However, settings remain where the effectiveness of smoothing is severely diminished, in particular the case of large vocabulary LLMs (LV‑LLMs). We posit the cause to stem from the ability to become over‑confident, which has a direct relationship with the hidden size and vocabulary size, and justify this theoretically and experimentally. Finally, we address an outstanding issue regarding the memory footprint of the cross‑entropy loss computation in the label smoothed loss setting, designing a customized kernel to dramatically reduce memory consumption without sacrificing speed or performance in comparison to existing solutions for non‑smoothed losses.

PaperID: 9375, https://arxiv.org/pdf/2508.00159.pdf

Authors: Jobst Heitzig, Ram Potham

Title: Model-Based Soft Maximization of Suitable Metrics of Long-Term Human Power

Abstract:
Power is a key concept in AI safety: power‑seeking as an instrumental goal, sudden or gradual disempowerment of humans, power balance in human‑AI interaction and international AI governance. At the same time, power as the ability to pursue diverse goals is essential for wellbeing. This paper explores the idea of promoting both safety and wellbeing by forcing AI agents explicitly to empower humans and to manage the power balance between humans and AI agents in a desirable way. Using a principled, partially axiomatic approach, we design a parametrizable and decomposable objective function that represents an inequality‑ and risk‑averse long‑term aggregate of human power. It takes into account humans' bounded rationality and social norms, and, crucially, considers a wide variety of possible human goals. We derive algorithms for computing that metric by backward induction or approximating it via a form of multi‑agent reinforcement learning from a given world model. We exemplify the consequences of (softly) maximizing this metric in a variety of paradigmatic situations and describe what instrumental sub‑goals it will likely imply. Our cautious assessment is that softly maximizing suitable aggregate metrics of human power might constitute a beneficial objective for agentic AI systems that is safer than direct utility‑based objectives.

PaperID: 9376, https://arxiv.org/pdf/2508.00083.pdf

Authors: Yihong Dong, Xue Jiang, Jiaru Qian, Tian Wang, Kechi Zhang, Zhi Jin, Ge Li

Title: A Survey on Code Generation with LLM-based Agents

Abstract:
Code generation agents powered by large language models (LLMs) are revolutionizing the software development paradigm. Distinct from previous code generation techniques, code generation agents are characterized by three core features. 1) Autonomy: the ability to independently manage the entire workflow, from task decomposition to coding and debugging. 2) Expanded task scope: capabilities that extend beyond generating code snippets to encompass the full software development lifecycle (SDLC). 3) Enhancement of engineering practicality: a shift in research emphasis from algorithmic innovation toward practical engineering challenges, such as system reliability, process management, and tool integration. This domain has recently witnessed rapid development and an explosion in research, demonstrating significant application potential. This paper presents a systematic survey of the field of LLM‑based code generation agents. We trace the technology's developmental trajectory from its inception and systematically categorize its core techniques, including both single‑agent and multi‑agent architectures. Furthermore, this survey details the applications of LLM‑based agents across the full SDLC, summarizes mainstream evaluation benchmarks and metrics, and catalogs representative tools. Finally, by analyzing the primary challenges, we identify and propose several foundational, long‑term research directions for the future work of the field.

PaperID: 9377, https://arxiv.org/pdf/2508.00032.pdf

Authors: Alessio Buscemi, Daniele Proverbio, Alessandro Di Stefano, The Anh Han, German Castignani, Pietro Liò

Title: Strategic Communication and Language Bias in Multi-Agent LLM Coordination

Abstract:
Large Language Model (LLM)‑based agents are increasingly deployed in multi‑agent scenarios where coordination is crucial but not always assured. Research shows that the way strategic scenarios are framed linguistically can affect cooperation. This paper explores whether allowing agents to communicate amplifies these language‑driven effects. Leveraging FAIRGAME, we simulate one‑shot and repeated games across different languages and models, both with and without communication. Our experiments, conducted with two advanced LLMs‑GPT‑4o and Llama 4 Maverick‑reveal that communication significantly influences agent behavior, though its impact varies by language, personality, and game structure. These findings underscore the dual role of communication in fostering coordination and reinforcing biases.

PaperID: 9378, https://arxiv.org/pdf/2508.00031.pdf

Authors: Junde Wu, Minhao Hu, Jiayuan Zhu, Jiazhen Pan, Yuyuan Liu, Min Xu, Yueming Jin

Title: Git Context Controller: Manage the Context of LLM-based Agents like Git

Abstract:
Large language model (LLM) agents have demonstrated strong capabilities in long‑horizon tasks by interleaving reasoning with tool use. However, as these agents scale to complex workflows such as software engineering and open‑ended research, context management becomes a fundamental bottleneck: interaction histories grow unbounded, become costly to maintain, and are difficult to reuse across sessions and agents. We introduce Git‑Context‑Controller (GCC), a structured context management framework inspired by software version control systems. GCC elevates agent context from a transient token stream to a persistent, navigable memory workspace with explicit operations ‑‑ \textttCOMMIT, \textttBRANCH, \textttMERGE, and \textttCONTEXT, that enable milestone‑based checkpointing, isolated exploration of alternative reasoning paths, and hierarchical retrieval of historical context. By organizing agent memory as a versioned file system, GCC allows agents to manage long‑term goals, recover and transfer reasoning across sessions, and coordinate multi‑trajectory problem solving in a principled manner. Empirically, agents equipped with GCC achieve state‑of‑the‑art performance on both SWE‑Bench and BrowseComp benchmarks. On SWE‑Bench Verified, GCC improves task resolution by over 13% relative to strong long‑context baselines and outperforms 26 existing open and commercial systems, reaching over 80% success rate. The project will be open‑sourced for the research community. The algorithm has been incorporated to the project: https://one‑context.com

PaperID: 9379, https://arxiv.org/pdf/2507.23701.pdf

Authors: Long Phan, Mantas Mazeika, Andy Zou, Dan Hendrycks

Title: TextQuests: How Good are LLMs at Text-Based Video Games?

Abstract:
Evaluating AI agents within complex, interactive environments that mirror real‑world challenges is critical for understanding their practical capabilities. While existing agent benchmarks effectively assess skills like tool use or performance on structured tasks, they often do not fully capture an agent's ability to operate autonomously in exploratory environments that demand sustained, self‑directed reasoning over a long and growing context. To enable a more accurate assessment of AI agents in challenging exploratory environments, we introduce TextQuests, a benchmark based on the Infocom suite of interactive fiction games. These text‑based adventures, which can take human players over 30 hours and require hundreds of precise actions to solve, serve as an effective proxy for evaluating AI agents on focused, stateful tasks. The benchmark is specifically designed to assess an LLM agent's capacity for self‑contained problem‑solving by precluding the use of external tools, thereby focusing on intrinsic long‑context reasoning capabilities in an exploratory environment characterized by the need for trial‑and‑error learning and sustained problem‑solving within a single interactive session. We release TextQuests at https://textquests.ai.

PaperID: 9380, https://arxiv.org/pdf/2507.23694.pdf

Authors: Virginia Padilla, Jacinto Dávila

Title: A survey of multi-agent geosimulation methodologies: from ABM to LLM

Abstract:
We provide a comprehensive examination of agent‑based approaches that codify the principles and linkages underlying multi‑agent systems, simulations, and information systems. Based on two decades of study, this paper confirms a framework intended as a formal specification for geosimulation platforms. Our findings show that large language models (LLMs) can be effectively incorporated as agent components if they follow a structured architecture specific to fundamental agent activities such as perception, memory, planning, and action. This integration is precisely consistent with the architecture that we formalize, providing a solid platform for next‑generation geosimulation systems.

PaperID: 9381, https://arxiv.org/pdf/2507.23693.pdf

Authors: Zhaoyue Xu, Long Wang, Chunyu Wang, Yixin Chen, Qingyong Luo, Hua-Dong Yao, Shizhao Wang, Guowei He

Title: CFDagent: A Language-Guided, Zero-Shot Multi-Agent System for Complex Flow Simulation

Abstract:
We introduce CFDagent, a zero‑shot, multi‑agent system that enables fully autonomous computational fluid dynamics (CFD) simulations from natural language prompts. CFDagent integrates three specialized LLM‑driven agents: (i) the Preprocessing Agent that generates 3D geometries from textual or visual inputs using a hybrid text‑to‑3D diffusion model (Point‑E) and automatically meshes the geometries; (ii) the Solver Agent that configures and executes an immersed boundary flow solver; and (iii) the Postprocessing Agent that analyzes and visualizes the results, including multimodal renderings. These agents are interactively guided by GPT‑4o via conversational prompts, enabling intuitive and user‑friendly interaction. We validate CFDagent by reproducing canonical sphere flows at Reynolds numbers of 100 and 300 using three distinct inputs: a simple text prompt (i.e., "sphere"), an image‑based input, and a standard sphere model. The computed drag and lift coefficients from meshes produced by each input approach closely match available data. The proposed system enables synthesization of flow simulations and photorealistic visualizations for complex geometries. Through extensive tests on canonical and realistic scenarios, we demonstrate the robustness, versatility, and practical applicability of CFDagent. By bridging generative AI with high‑fidelity simulations, CFDagent significantly lowers barriers to expert‑level CFD, unlocking broad opportunities in education, scientific research, and practical engineering applications.

PaperID: 9382, https://arxiv.org/pdf/2507.23674.pdf

Authors: Muhammad Taha Cheema, Abeer Aamir, Khawaja Gul Muhammad, Naveed Anwar Bhatti, Ihsan Ayyub Qazi, Zafar Ayyub Qazi

Title: TweakLLM: A Routing Architecture for Dynamic Tailoring of Cached Responses

Abstract:
Large Language Models (LLMs) process millions of queries daily, making efficient response caching a compelling optimization for reducing cost and latency. However, preserving relevance to user queries using this approach proves difficult due to the personalized nature of chatbot interactions and the limited accuracy of semantic similarity search. To address this, we present TweakLLM, a novel routing architecture that employs a lightweight LLM to dynamically adapt cached responses to incoming prompts. Through comprehensive evaluation, including user studies with side‑by‑side comparisons, satisfaction voting, as well as multi‑agent LLM debates, we demonstrate that TweakLLM maintains response quality comparable to frontier models while significantly improving cache effectiveness. Our results across real‑world datasets highlight TweakLLM as a scalable, resource‑efficient caching solution for high‑volume LLM deployments without compromising user experience.

PaperID: 9383, https://arxiv.org/pdf/2507.23604.pdf

Authors: Tommaso Marzi, Cesare Alippi, Andrea Cini

Title: Hierarchical Message-Passing Policies for Multi-Agent Reinforcement Learning

Abstract:
Decentralized Multi‑Agent Reinforcement Learning (MARL) methods allow for learning scalable multi‑agent policies, but suffer from partial observability and induced non‑stationarity. These challenges can be addressed by introducing mechanisms that facilitate coordination and high‑level planning. Specifically, coordination and temporal abstraction can be achieved through communication (e.g., message passing) and Hierarchical Reinforcement Learning (HRL) approaches to decision‑making. However, optimization issues limit the applicability of hierarchical policies to multi‑agent systems. As such, the combination of these approaches has not been fully explored. To fill this void, we propose a novel and effective methodology for learning multi‑agent hierarchies of message‑passing policies. We adopt the feudal HRL framework and rely on a hierarchical graph structure for planning and coordination among agents. Agents at lower levels in the hierarchy receive goals from the upper levels and exchange messages with neighboring agents at the same level. To learn hierarchical multi‑agent policies, we design a novel reward‑assignment method based on training the lower‑level policies to maximize the advantage function associated with the upper levels. Results on relevant benchmarks show that our method performs favorably compared to the state of the art.

PaperID: 9384, https://arxiv.org/pdf/2507.23554.pdf

Authors: Ruoyu Wang, Junda Wu, Yu Xia, Tong Yu, Ryan A. Rossi, Julian McAuley, Lina Yao

Title: DICE: Dynamic In-Context Example Selection in LLM Agents via Efficient Knowledge Transfer

Abstract:
Large language model‑based agents, empowered by in‑context learning (ICL), have demonstrated strong capabilities in complex reasoning and tool‑use tasks. However, existing works have shown that the effectiveness of ICL is highly sensitive to the choice of demonstrations, with suboptimal examples often leading to unstable or degraded performance. While prior work has explored example selection, including in some agentic or multi‑step settings, existing approaches typically rely on heuristics or task‑specific designs and lack a general, theoretically grounded criterion for what constitutes an effective demonstration across reasoning steps. Therefore, it is non‑trivial to develop a principled, general‑purpose method for selecting demonstrations that consistently benefit agent performance. In this paper, we address this challenge with DICE, Dynamic In‑Context Example Selection for LLM Agents, a theoretically grounded ICL framework for agentic tasks that selects the most relevant demonstrations at each step of reasoning. Our approach decomposes demonstration knowledge into transferable and non‑transferable components through a causal lens, showing how the latter can introduce spurious dependencies that impair generalization. We further propose a stepwise selection criterion with a formal guarantee of improved agent performance. Importantly, DICE is a general, framework‑agnostic solution that can be integrated as a plug‑in module into existing agentic frameworks without any additional training cost. Extensive experiments across diverse domains demonstrate our method's effectiveness and generality, highlighting the importance of principled, context‑aware demo selection for robust and efficient LLM agents.

PaperID: 9385, https://arxiv.org/pdf/2507.23489.pdf

Authors: Vivek Pandey, Nader Motee

Title: Distributionally Robust Cascading Risk Quantification in Multi-Agent Rendezvous: Effects of Time Delay and Network Connectivity

Abstract:
Achieving safety in autonomous multi‑agent systems, particularly in time‑critical tasks like rendezvous, is a critical challenge. In this paper, we propose a distributionally robust risk framework for analyzing cascading failures in multi‑agent rendezvous. To capture the complex interactions between network connectivity, system dynamics, and communication delays, we use a time‑delayed network model as a benchmark. We introduce a conditional distributionally robust functional to quantify cascading effects between agents, utilizing a bi‑variate normal distribution. Our approach yields closed‑form risk expressions that reveal the impact of time delay, noise statistics, communication topology, and failure modes on rendezvous risk. The insights derived inform the design of resilient networks that mitigate the risk of cascading failures. We validate our theoretical results through extensive simulations, demonstrating the effectiveness of our framework.

PaperID: 9386, https://arxiv.org/pdf/2507.23454.pdf

Authors: Marta Bieńkiewicz, Julia Ayache, Panayiotis Charalambous, Cristina Becchio, Marco Corragio, Bertram Taetz, Francesco De Lellis, Antonio Grotta, Anna Server, Daniel Rammer, Richard Kulpa, Franck Multon, Azucena Garcia-Palacios, Jessica Sutherland, Kathleen Bryson, Stéphane Donikian, Didier Stricker, Benoît Bardy

Title: Breaking the mould of Social Mixed Reality - State-of-the-Art and Glossary

Abstract:
This article explores a critical gap in Mixed Reality (MR) technology: while advances have been made, MR still struggles to authentically replicate human embodiment and socio‑motor interaction. For MR to enable truly meaningful social experiences, it needs to incorporate multi‑modal data streams and multi‑agent interaction capabilities. To address this challenge, we present a comprehensive glossary covering key topics such as Virtual Characters and Autonomisation, Responsible AI, Ethics by Design, and the Scientific Challenges of Social MR within Neuroscience, Embodiment, and Technology. Our aim is to drive the transformative evolution of MR technologies that prioritize human‑centric innovation, fostering richer digital connections. We advocate for MR systems that enhance social interaction and collaboration between humans and virtual autonomous agents, ensuring inclusivity, ethical design and psychological safety in the process.

PaperID: 9387, https://arxiv.org/pdf/2507.23429.pdf

Authors: Jorge Ruiz Gómez, Lidia Andrés Susinos, Jorge Alamo Olivé, Sonia Rey Osorno, Manuel Luis Gonzalez Hernández

Title: Chatting with your ERP: A Recipe

Abstract:
This paper presents the design, implementation, and evaluation behind a Large Language Model (LLM) agent that chats with an industrial production‑grade ERP system. The agent is capable of interpreting natural language queries and translating them into executable SQL statements, leveraging open‑weight LLMs. A novel dual‑agent architecture combining reasoning and critique stages was proposed to improve query generation reliability.

PaperID: 9388, https://arxiv.org/pdf/2507.23382.pdf

Authors: Yiyan Ji, Haoran Chen, Qiguang Chen, Chengyue Wu, Libo Qin, Wanxiang Che

Title: MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models

Abstract:
Multimodal planning capabilities refer to the ability to predict, reason, and design steps for task execution with multimodal context, which is essential for complex reasoning and decision‑making across multiple steps. However, current benchmarks face two key challenges: (1) they cannot directly assess multimodal real‑world planning capabilities, and (2) they lack constraints or implicit constraints across modalities. To address these issues, we introduce Multimodal Planning with Complex Constraints (MPCC), the first benchmark to systematically evaluate MLLMs' ability to handle multimodal constraints in planning. To address the first challenge, MPCC focuses on three real‑world tasks: Flight Planning, Calendar Planning, and Meeting Planning. To solve the second challenge, we introduce complex constraints (e.g. budget, temporal, and spatial) in these tasks, with graded difficulty levels (EASY, MEDIUM, HARD) to separate constraint complexity from search space expansion. Experiments on 13 advanced MLLMs reveal significant challenges: closed‑source models achieve only 21.3% feasible plans, while open‑source models average below 11%. Additionally, we observe that MLLMs are highly sensitive to constraint complexity and that traditional multimodal prompting strategies fail in multi‑constraint scenarios. Our work formalizes multimodal constraints in planning, provides a rigorous evaluation framework, and highlights the need for advancements in constraint‑aware reasoning for real‑world MLLM applications.

PaperID: 9389, https://arxiv.org/pdf/2507.23261.pdf

Authors: Hui Yi Leong, Yuqing Wu

Title: DynaSwarm: Dynamically Graph Structure Selection for LLM-based Multi-agent System

Abstract:
Current multi‑agent systems (MAS) frameworks often rely on manually designed and static collaboration graph structures, limiting adaptability and performance. To address these limitations, we propose DynaSwarm, a dynamic framework that enhances LLM‑based MAS through two key innovations: (1) an actor‑critic reinforcement learning (A2C) mechanism to optimize graph structures with improved stability over prior RL methods, and (2) a dynamic graph selector that adaptively chooses the optimal graph structure for each input sample via parameter‑efficient LLM fine‑tuning. DynaSwarm eliminates the need for rigid, one‑fits‑all graph architectures, instead leveraging sample‑specific idiosyncrasies to dynamically route queries through specialized agent networks. (c) We propose to fine‑tune the demonstration retriever to fully exploit the power of in‑context learning (ICL). Extensive experiments on question answering, mathematical reasoning, and coding tasks demonstrate that DynaSwarm consistently outperforms state‑of‑the‑art single‑agent and MAS baselines across multiple LLM backbones. Our findings highlight the importance of sample‑aware structural flexibility in LLM MAS designs.

PaperID: 9390, https://arxiv.org/pdf/2507.23227.pdf

Authors: Sophie Kearney, Shu Yang, Zixuan Wen, Bojian Hou, Duy Duong-Tran, Tianlong Chen, Jason Moore, Marylyn Ritchie, Li Shen

Title: Enabling Few-Shot Alzheimer's Disease Diagnosis on Biomarker Data with Tabular LLMs

Abstract:
Early and accurate diagnosis of Alzheimer's disease (AD), a complex neurodegenerative disorder, requires analysis of heterogeneous biomarkers (e.g., neuroimaging, genetic risk factors, cognitive tests, and cerebrospinal fluid proteins) typically represented in a tabular format. With flexible few‑shot reasoning, multimodal integration, and natural‑language‑based interpretability, large language models (LLMs) offer unprecedented opportunities for prediction with structured biomedical data. We propose a novel framework called TAP‑GPT, Tabular Alzheimer's Prediction GPT, that adapts TableGPT2, a multimodal tabular‑specialized LLM originally developed for business intelligence tasks, for AD diagnosis using structured biomarker data with small sample sizes. Our approach constructs few‑shot tabular prompts using in‑context learning examples from structured biomedical data and finetunes TableGPT2 using the parameter‑efficient qLoRA adaption for a clinical binary classification task of AD or cognitively normal (CN). The TAP‑GPT framework harnesses the powerful tabular understanding ability of TableGPT2 and the encoded prior knowledge of LLMs to outperform more advanced general‑purpose LLMs and a tabular foundation model (TFM) developed for prediction tasks. To our knowledge, this is the first application of LLMs to the prediction task using tabular biomarker data, paving the way for future LLM‑driven multi‑agent frameworks in biomedical informatics.

PaperID: 9391, https://arxiv.org/pdf/2507.23194.pdf

Authors: Jianghui Wang, Vinay Joshi, Saptarshi Majumder, Xu Chao, Bin Ding, Ziqiong Liu, Pratik Prabhanjan Brahma, Dong Li, Zicheng Liu, Emad Barsoum

Title: Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

Abstract:
The demand for AI‑generated GPU kernels is rapidly growing, influenced by the need for scalable, hardware‑optimized solutions in both industry and academia. As deep learning workloads grow in complexity and diversity, it is imperative to automate low‑level kernel development to meet performance and productivity demands. Major cloud providers, semiconductor companies, and research institutions are now investing heavily in AI‑driven code generation for GPUs, aiming to reduce manual optimization efforts while achieving near‑expert performance on hardware like AMD MI300X. The Triton language, a Python‑based DSL for GPU programming, has emerged as a popular target for such AI‑generated kernels due to its balance of performance and ease‑of‑coding. In this work, we present an evaluation suite for Triton‑based GPU kernels and GEAK (Generating Efficient AI‑centric GPU Kernels)‑a framework that leverages cutting‑edge LLMs to generate performant Triton code specifically for AMD GPUs, including the AMD MI300X and MI250. GEAK leverages inference‑time compute scaling to produce Triton‑based GPU kernels using a reasoning loop adapted from Reflexion‑style feedback mechanisms. On two evaluation benchmarks, GEAK significantly outperformed the baselines of directly prompting frontier LLMs as well as Reflexion‑based generation pipelines by achieving correctness up to 63% and execution speed up of up to 2.59X. These results highlight the promise of GEAK‑like agentic code generation for accelerating the adoption of diverse hardware platforms and democratizing access to expert‑level kernel performance.

PaperID: 9392, https://arxiv.org/pdf/2507.23096.pdf

Authors: Tom Peterka, Tanwi Mallick, Orcun Yildiz, David Lenz, Cory Quammen, Berk Geveci

Title: ChatVis: Large Language Model Agent for Generating Scientific Visualizations

Abstract:
Large language models (LLMs) are rapidly increasing in capability, but they still struggle with highly specialized programming tasks such as scientific visualization. We present an LLM assistant, ChatVis, that aids the LLM to generate Python code for ParaView scientific visualization tasks, without the need for retraining or fine‑tuning the LLM. ChatVis employs chain‑of‑thought prompt simplification, retrieval‑augmented prompt generation using a vector database of documentation and code examples, and error checking with iterative prompt feedback to correct errors until a visualization is produced. An integral part of our approach is a benchmark suite of canonical visualization tasks, ParaView regression tests, and scientific use cases that includes comprehensive evaluation metrics. We evaluate our visualization assistant by comparing results with a variety of top‑performing unassisted LLMs. We find that all the metrics are significantly improved with ChatVis.

PaperID: 9393, https://arxiv.org/pdf/2507.23095.pdf

Authors: Ishani Mondal, Meera Bharadwaj, Ayush Roy, Aparna Garimella, Jordan Lee Boyd-Graber

Title: SMART-Editor: A Multi-Agent Framework for Human-Like Design Editing with Structural Integrity

Abstract:
We present SMART‑Editor, a framework for compositional layout and content editing across structured (posters, websites) and unstructured (natural images) domains. Unlike prior models that perform local edits, SMART‑Editor preserves global coherence through two strategies: Reward‑Refine, an inference‑time rewardguided refinement method, and RewardDPO, a training‑time preference optimization approach using reward‑aligned layout pairs. To evaluate model performance, we introduce SMARTEdit‑Bench, a benchmark covering multi‑domain, cascading edit scenarios. SMART‑Editor outperforms strong baselines like InstructPix2Pix and HIVE, with RewardDPO achieving up to 15% gains in structured settings and Reward‑Refine showing advantages on natural images. Automatic and human evaluations confirm the value of reward‑guided planning in producing semantically consistent and visually aligned edits.

PaperID: 9394, https://arxiv.org/pdf/2507.23080.pdf

Authors: Jing Wang, Yan Jin, Fei Ding, Chongfeng Wei

Title: Causal-Inspired Multi-Agent Decision-Making via Graph Reinforcement Learning

Abstract:
Since the advent of autonomous driving technology, it has experienced remarkable progress over the last decade. However, most existing research still struggles to address the challenges posed by environments where multiple vehicles have to interact seamlessly. This study aims to integrate causal learning with reinforcement learning‑based methods by leveraging causal disentanglement representation learning (CDRL) to identify and extract causal features that influence optimal decision‑making in autonomous vehicles. These features are then incorporated into graph neural network‑based reinforcement learning algorithms to enhance decision‑making in complex traffic scenarios. By using causal features as inputs, the proposed approach enables the optimization of vehicle behavior at an unsignalized intersection. Experimental results demonstrate that our proposed method achieves the highest average reward during training and our approach significantly outperforms other learning‑based methods in several key metrics such as collision rate and average cumulative reward during testing. This study provides a promising direction for advancing multi‑agent autonomous driving systems and make autonomous vehicles' navigation safer and more efficient in complex traffic environments.

PaperID: 9395, https://arxiv.org/pdf/2507.22925.pdf

Authors: Haoran Sun, Shaoning Zeng

Title: Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents

Abstract:
Long‑term memory is one of the key factors influencing the reasoning capabilities of Large Language Model Agents (LLM Agents). Incorporating a memory mechanism that effectively integrates past interactions can significantly enhance decision‑making and contextual coherence of LLM Agents. While recent works have made progress in memory storage and retrieval, such as encoding memory into dense vectors for similarity‑based search or organizing knowledge in the form of graph, these approaches often fall short in structured memory organization and efficient retrieval. To address these limitations, we propose a Hierarchical Memory (H‑MEM) architecture for LLM Agents that organizes and updates memory in a multi‑level fashion based on the degree of semantic abstraction. Each memory vector is embedded with a positional index encoding pointing to its semantically related sub‑memories in the next layer. During the reasoning phase, an index‑based routing mechanism enables efficient, layer‑by‑layer retrieval without performing exhaustive similarity computations. We evaluate our method on five task settings from the LoCoMo dataset. Experimental results show that our approach consistently outperforms five baseline methods, demonstrating its effectiveness in long‑term dialogue scenarios.

PaperID: 9396, https://arxiv.org/pdf/2507.22904.pdf

Authors: Ehsan Latif, Zirak Khan, Xiaoming Zhai

Title: SketchMind: A Multi-Agent Cognitive Framework for Assessing Student-Drawn Scientific Sketches

Abstract:
Scientific sketches (e.g., models) offer a powerful lens into students' conceptual understanding, yet AI‑powered automated assessment of such free‑form, visually diverse artifacts remains a critical challenge. Existing solutions often treat sketch evaluation as either an image classification task or monolithic vision‑language models, which lack interpretability, pedagogical alignment, and adaptability across cognitive levels. To address these limitations, we present SketchMind, a cognitively grounded, multi‑agent framework for evaluating and improving student‑drawn scientific sketches. SketchMind comprises modular agents responsible for rubric parsing, sketch perception, cognitive alignment, and iterative feedback with sketch modification, enabling personalized and transparent evaluation. We evaluate SketchMind on a curated dataset of 3,575 student‑generated sketches across six science assessment items with different highest order of Bloom's level that require students to draw models to explain phenomena. Compared to baseline GPT‑4o performance without SRG (average accuracy: 55.6%), and with SRG integration achieves 77.1% average accuracy (+21.4% average absolute gain). We also demonstrate that multi‑agent orchestration with SRG enhances SketchMind performance, for example, GPT‑4.1 gains an average 8.9% increase in sketch prediction accuracy, outperforming single‑agent pipelines across all items. Human evaluators rated the feedback and co‑created sketches generated by \textscSketchMind with GPT‑4.1, which achieved an average of 4.1 out of 5, significantly higher than those of baseline models (e.g., 2.3 for GPT‑4o). Experts noted the system's potential to meaningfully support conceptual growth through guided revision. Our code and (pending approval) dataset will be released to support reproducibility and future research in AI‑driven education.

PaperID: 9397, https://arxiv.org/pdf/2507.22902.pdf

Authors: Hashim Hayat, Maksim Kudrautsau, Evgeniy Makarov, Vlad Melnichenko, Tim Tsykunou, Piotr Varaksin, Matt Pavelle, Adam Z. Oskowitz

Title: Toward the Autonomous AI Doctor: Quantitative Benchmarking of an Autonomous Agentic AI Versus Board-Certified Clinicians in a Real World Setting

Abstract:
Background: Globally we face a projected shortage of 11 million healthcare practitioners by 2030, and administrative burden consumes 50% of clinical time. Artificial intelligence (AI) has the potential to help alleviate these problems. However, no end‑to‑end autonomous large language model (LLM)‑based AI system has been rigorously evaluated in real‑world clinical practice. In this study, we evaluated whether a multi‑agent LLM‑based AI framework can function autonomously as an AI doctor in a virtual urgent care setting. Methods: We retrospectively compared the performance of the multi‑agent AI system Doctronic and board‑certified clinicians across 500 consecutive urgent‑care telehealth encounters. The primary end points: diagnostic concordance, treatment plan consistency, and safety metrics, were assessed by blinded LLM‑based adjudication and expert human review. Results: The top diagnosis of Doctronic and clinician matched in 81% of cases, and the treatment plan aligned in 99.2% of cases. No clinical hallucinations occurred (e.g., diagnosis or treatment not supported by clinical findings). In an expert review of discordant cases, AI performance was superior in 36.1%, and human performance was superior in 9.3%; the diagnoses were equivalent in the remaining cases. Conclusions: In this first large‑scale validation of an autonomous AI doctor, we demonstrated strong diagnostic and treatment plan concordance with human clinicians, with AI performance matching and in some cases exceeding that of practicing clinicians. These findings indicate that multi‑agent AI systems achieve comparable clinical decision‑making to human providers and offer a potential solution to healthcare workforce shortages.

PaperID: 9398, https://arxiv.org/pdf/2507.22898.pdf

Authors: Julian Acosta, Scott Adams, Julius Kernbach, Romain Hardy, Sung Eun Kim, Luyang Luo, Xiaoman Zhang, Shreya Johri, Mohammed Baharoon, Pranav Rajpurkar

Title: Voice-guided Orchestrated Intelligence for Clinical Evaluation (VOICE): A Voice AI Agent System for Prehospital Stroke Assessment

Abstract:
We developed a voice‑driven artificial intelligence (AI) system that guides anyone ‑ from paramedics to family members ‑ through expert‑level stroke evaluations using natural conversation, while also enabling smartphone video capture of key examination components for documentation and potential expert review. This addresses a critical gap in emergency care: current stroke recognition by first responders is inconsistent and often inaccurate, with sensitivity for stroke detection as low as 58%, causing life‑threatening delays in treatment. Three non‑medical volunteers used our AI system to assess ten simulated stroke patients, including cases with likely large vessel occlusion (LVO) strokes and stroke‑like conditions, while we measured diagnostic accuracy, completion times, user confidence, and expert physician review of the AI‑generated reports. The AI system correctly identified 84% of individual stroke signs and detected 75% of likely LVOs, completing evaluations in just over 6 minutes. Users reported high confidence (median 4.5/5) and ease of use (mean 4.67/5). The system successfully identified 86% of actual strokes but also incorrectly flagged 2 of 3 non‑stroke cases as strokes. When an expert physician reviewed the AI reports with videos, they identified the correct diagnosis in 100% of cases, but felt confident enough to make preliminary treatment decisions in only 40% of cases due to observed AI errors including incorrect scoring and false information. While the current system's limitations necessitate human oversight, ongoing rapid advancements in speech‑to‑speech AI models suggest that future versions are poised to enable highly accurate assessments. Achieving human‑level voice interaction could transform emergency medical care, putting expert‑informed assessment capabilities in everyone's hands.

PaperID: 9399, https://arxiv.org/pdf/2507.22897.pdf

Authors: Luyu Chen, Quanyu Dai, Zeyu Zhang, Xueyang Feng, Mingyu Zhang, Pengcheng Tang, Xu Chen, Yue Zhu, Zhenhua Dong

Title: RecUserSim: A Realistic and Diverse User Simulator for Evaluating Conversational Recommender Systems

Abstract:
Conversational recommender systems (CRS) enhance user experience through multi‑turn interactions, yet evaluating CRS remains challenging. User simulators can provide comprehensive evaluations through interactions with CRS, but building realistic and diverse simulators is difficult. While recent work leverages large language models (LLMs) to simulate user interactions, they still fall short in emulating individual real users across diverse scenarios and lack explicit rating mechanisms for quantitative evaluation. To address these gaps, we propose RecUserSim, an LLM agent‑based user simulator with enhanced simulation realism and diversity while providing explicit scores. RecUserSim features several key modules: a profile module for defining realistic and diverse user personas, a memory module for tracking interaction history and discovering unknown preferences, and a core action module inspired by Bounded Rationality theory that enables nuanced decision‑making while generating more fine‑grained actions and personalized responses. To further enhance output control, a refinement module is designed to fine‑tune final responses. Experiments demonstrate that RecUserSim generates diverse, controllable outputs and produces realistic, high‑quality dialogues, even with smaller base LLMs. The ratings generated by RecUserSim show high consistency across different base LLMs, highlighting its effectiveness for CRS evaluation.

PaperID: 9400, https://arxiv.org/pdf/2507.22889.pdf

Authors: Tom Sheffer, Alon Miron, Asael Sklar, Yaniv Dover, Ariel Goldstein

Title: Confident-Knowledge Diversity Drives Human-Human and Human-AI Free Discussion Synergy and Reveals Pure-AI Discussion Shortfalls

Abstract:
Conversations transform individual knowledge into collective insight, enabling collaborators to solve problems more accurately than they could alone. Whether dialogues among large language models (LLMs) can replicate the synergistic gains observed in human discussion remains unclear. We systematically compared four interaction settings: LLM‑LLM pairs, LLM trios, human trios, and human‑LLM pairs, using validated medical multiple‑choice questions. Agents answered individually, engaged in open‑ended discussion, then re‑answered, allowing us to quantify conversational gains. Interactions that included humans consistently yielded synergy (post‑discussion accuracy increased for both stronger and weaker participants), whereas purely LLM groups did not improve and often declined. To explain and prospectively predict when unstructured dialogue helps, we introduce an agent‑agnostic confident‑knowledge framework that models each participant by performance (accuracy) and confidence. This framework quantifies confident‑knowledge diversity, the degree to which one agent tends to be correct when another is uncertain, and yields a conservative upper bound on gains achievable via confidence‑informed decisions, which we term Potential Conversation Synergy. Across humans, LLMs, and mixed teams, this metric prospectively predicts observed conversational improvements: when confident‑knowledge diversity is low (as in LLM‑only groups), discussion doesn't improve performance; when it is present (as in human or human‑LLM groups), free‑form dialogue reliably lifts accuracy. These findings propose a new concept and method for AI collaboration: quantifying confident‑knowledge diversity to prospectively predict conversational gains and guide team selection and interaction design in both multi‑agent and human‑AI settings.

PaperID: 9401, https://arxiv.org/pdf/2507.22849.pdf

Authors: Andrew Campbell, Anna Scaglione, Sean Peisert

Title: Decentralized Differentially Private Power Method

Abstract:
We propose a novel Decentralized Differentially Private Power Method (D‑DP‑PM) for performing Principal Component Analysis (PCA) in networked multi‑agent settings. Unlike conventional decentralized PCA approaches where each agent accesses the full n‑dimensional sample space, we address the challenging scenario where each agent observes only a subset of dimensions through row‑wise data partitioning. Our method ensures (ε,δ)‑Differential Privacy (DP) while enabling collaborative estimation of global eigenvectors across the network without requiring a central aggregator. We achieve this by having agents share only local embeddings of the current eigenvector iterate, leveraging both the inherent privacy from random initialization and carefully calibrated Gaussian noise additions. We prove that our algorithm satisfies the prescribed (ε,δ)‑DP guarantee and establish convergence rates that explicitly characterize the impact of the network topology. Our theoretical analysis, based on linear dynamics and high‑dimensional probability theory, provides tight bounds on both privacy and utility. Experiments on real‑world datasets demonstrate that D‑DP‑PM achieves superior privacy‑utility tradeoffs compared to naive local DP approaches, with particularly strong performance in moderate privacy regimes (ε\in[2, 5]). The method converges rapidly, allowing practitioners to trade iterations for enhanced privacy while maintaining competitive utility.

PaperID: 9402, https://arxiv.org/pdf/2507.22829.pdf

Authors: Weijia Zhang, Songgaojun Deng, Evangelos Kanoulas

Title: Beyond Natural Language Plans: Structure-Aware Planning for Query-Focused Table Summarization

Abstract:
Query‑focused table summarization requires complex reasoning, often approached through step‑by‑step natural language (NL) plans. However, NL plans are inherently ambiguous and lack structure, limiting their conversion into executable programs like SQL and hindering scalability, especially for multi‑table tasks. To address this, we propose a paradigm shift to structured representations. We introduce a new structured plan, TaSoF, inspired by formalism in traditional multi‑agent systems, and a framework, SPaGe, that formalizes the reasoning process in three phases: 1) Structured Planning to generate TaSoF from a query, 2) Graph‑based Execution to convert plan steps into SQL and model dependencies via a directed cyclic graph for parallel execution, and 3) Summary Generation to produce query‑focused summaries. Our method explicitly captures complex dependencies and improves reliability. Experiments on three public benchmarks show that SPaGe consistently outperforms prior models in both single‑ and multi‑table settings, demonstrating the advantages of structured representations for robust and scalable summarization.

PaperID: 9403, https://arxiv.org/pdf/2507.22800.pdf

Authors: Rui Ren

Title: The Multi-Agent Fault Localization System Based on Monte Carlo Tree Search Approach

Abstract:
In real‑world scenarios, due to the highly decoupled and flexible nature of microservices, it poses greater challenges to system reliability. The more frequent occurrence of incidents has created a demand for Root Cause Analysis(RCA) methods that enable rapid identification and recovery of incidents. Large language model (LLM) provides a new path for quickly locating and recovering from incidents by leveraging their powerful generalization ability combined with expert experience. Current LLM for RCA frameworks are based on ideas like ReAct and Chain‑of‑Thought, but the hallucination of LLM and the propagation nature of anomalies often lead to incorrect localization results. Moreover, the massive amount of anomalous information generated in large, complex systems presents a huge challenge for the context window length of LLMs. To address these challenges, we propose KnowledgeMind, an innovative LLM multi‑agent system based on Monte Carlo Tree Search and a knowledge base reward mechanism for standardized service‑by‑service reasoning. Compared to State‑Of‑The‑Art(SOTA) LLM for RCA methods, our service‑by‑service exploration approach significantly reduces the burden on the maximum context window length, requiring only one‑tenth of its size. Additionally, by incorporating a rule‑based real‑time reward mechanism, our method effectively mitigates hallucinations during the inference process. Compared to the SOTA LLM for RCA framework, our method achieves a 49.29% to 128.35% improvement in root cause localization accuracy.

PaperID: 9404, https://arxiv.org/pdf/2507.22782.pdf

Authors: Hugo Garrido-Lestache Belinchon, Jeremy Kedziora

Title: Enhancing Multi-Agent Collaboration with Attention-Based Actor-Critic Policies

Abstract:
This paper introduces Team‑Attention‑Actor‑Critic (TAAC), a reinforcement learning algorithm designed to enhance multi‑agent collaboration in cooperative environments. TAAC employs a Centralized Training/Centralized Execution scheme incorporating multi‑headed attention mechanisms in both the actor and critic. This design facilitates dynamic, inter‑agent communication, allowing agents to explicitly query teammates, thereby efficiently managing the exponential growth of joint‑action spaces while ensuring a high degree of collaboration. We further introduce a penalized loss function which promotes diverse yet complementary roles among agents. We evaluate TAAC in a simulated soccer environment against benchmark algorithms representing other multi‑agent paradigms, including Proximal Policy Optimization and Multi‑Agent Actor‑Attention‑Critic. We find that TAAC exhibits superior performance and enhanced collaborative behaviors across a variety of metrics (win rates, goal differentials, Elo ratings, inter‑agent connectivity, balanced spatial distributions, and frequent tactical interactions such as ball possession swaps).

PaperID: 9405, https://arxiv.org/pdf/2507.22758.pdf

Authors: Gautam Jajoo, Pranjal A Chitale, Saksham Agarwal

Title: MASCA: LLM based-Multi Agents System for Credit Assessment

Abstract:
Recent advancements in financial problem‑solving have leveraged LLMs and agent‑based systems, with a primary focus on trading and financial modeling. However, credit assessment remains an underexplored challenge, traditionally dependent on rule‑based methods and statistical models. In this paper, we introduce MASCA, an LLM‑driven multi‑agent system designed to enhance credit evaluation by mirroring real‑world decision‑making processes. The framework employs a layered architecture where specialized LLM‑based agents collaboratively tackle sub‑tasks. Additionally, we integrate contrastive learning for risk and reward assessment to optimize decision‑making. We further present a signaling game theory perspective on hierarchical multi‑agent systems, offering theoretical insights into their structure and interactions. Our paper also includes a detailed bias analysis in credit assessment, addressing fairness concerns. Experimental results demonstrate that MASCA outperforms baseline approaches, highlighting the effectiveness of hierarchical LLM‑based multi‑agent systems in financial applications, particularly in credit scoring.

PaperID: 9406, https://arxiv.org/pdf/2507.22711.pdf

Authors: Hong-Jun Yoon, Mariam Kiran, Danial Ebling, Joe Breen

Title: OFCnetLLM: Large Language Model for Network Monitoring and Alertness

Abstract:
The rapid evolution of network infrastructure is bringing new challenges and opportunities for efficient network management, optimization, and security. With very large monitoring databases becoming expensive to explore, the use of AI and Generative AI can help reduce costs of managing these datasets. This paper explores the use of Large Language Models (LLMs) to revolutionize network monitoring management by addressing the limitations of query finding and pattern analysis. We leverage LLMs to enhance anomaly detection, automate root‑cause analysis, and automate incident analysis to build a well‑monitored network management team using AI. Through a real‑world example of developing our own OFCNetLLM, based on the open‑source LLM model, we demonstrate practical applications of OFCnetLLM in the OFC conference network. Our model is developed as a multi‑agent approach and is still evolving, and we present early results here.

PaperID: 9407, https://arxiv.org/pdf/2507.22648.pdf

Authors: Themistoklis Charalambous, Zheng Chen, Christoforos N. Hadjicostis

Title: Distributed Average Consensus in Wireless Multi-Agent Systems with Over-the-Air Aggregation

Abstract:
In this paper, we address the average consensus problem of multi‑agent systems over wireless networks. We propose a distributed average consensus algorithm by invoking the concept of over‑the‑air aggregation, which exploits the signal superposition property of wireless multiple‑access channels. The proposed algorithm deploys a modified version of the well‑known Ratio Consensus algorithm with an additional normalization step for compensating for the arbitrary channel coefficients. We show that, when the noise level at the receivers is negligible, the algorithm converges asymptotically to the average for time‑invariant and time‑varying channels. Numerical simulations corroborate the validity of our results.

PaperID: 9408, https://arxiv.org/pdf/2507.22606.pdf

Authors: Yaolun Zhang, Xiaogeng Liu, Chaowei Xiao

Title: MetaAgent: Automatically Constructing Multi-Agent Systems Based on Finite State Machines

Abstract:
Large Language Models (LLMs) have demonstrated the ability to solve a wide range of practical tasks within multi‑agent systems. However, existing human‑designed multi‑agent frameworks are typically limited to a small set of pre‑defined scenarios, while current automated design methods suffer from several limitations, such as the lack of tool integration, dependence on external training data, and rigid communication structures. In this paper, we propose MetaAgent, a finite state machine based framework that can automatically generate a multi‑agent system. Given a task description, MetaAgent will design a multi‑agent system and polish it through an optimization algorithm. When the multi‑agent system is deployed, the finite state machine will control the agent's actions and the state transitions. To evaluate our framework, we conduct experiments on both text‑based tasks and practical tasks. The results indicate that the generated multi‑agent system surpasses other auto‑designed methods and can achieve a comparable performance with the human‑designed multi‑agent system, which is optimized for those specific tasks.

PaperID: 9409, https://arxiv.org/pdf/2507.22504.pdf

Authors: Hongyan Cheng, Chengzhang Yu, Yanshu Shi, Chiyue Wang, Cong Liu, Zhanpeng Jin

Title: Collaborative Medical Triage under Uncertainty: A Multi-Agent Dynamic Matching Approach

PaperID: 9410, https://arxiv.org/pdf/2507.22467.pdf

Authors: Hsien-Tsung Lin, Pei-Cing Huang, Chan-Tung Ku, Chan Hsu, Pei-Xuan Shieh, Yihuang Kang

Title: Towards Simulating Social Influence Dynamics with LLM-based Multi-agents

Abstract:
Recent advancements in Large Language Models offer promising capabilities to simulate complex human social interactions. We investigate whether LLM‑based multi‑agent simulations can reproduce core human social dynamics observed in online forums. We evaluate conformity dynamics, group polarization, and fragmentation across different model scales and reasoning capabilities using a structured simulation framework. Our findings indicate that smaller models exhibit higher conformity rates, whereas models optimized for reasoning are more resistant to social influence.

PaperID: 9411, https://arxiv.org/pdf/2507.22414.pdf

Authors: Sungmin Kang, Haifeng Ruan, Abhik Roychoudhury

Title: AutoCodeSherpa: Symbolic Explanations in AI Coding Agents

Abstract:
Large language model (LLM) agents integrate external tools with one or more LLMs to accomplish specific tasks. Agents have rapidly been adopted by developers, and they are starting to be deployed in industrial workflows, such as their use to fix static analysis issues from the widely used SonarQube static analyzer. However, the growing importance of agents means their actions carry greater impact and potential risk. Thus, to use them at scale, an additional layer of trust and evidence is necessary. This work presents AutoCodeSherpa, a technique that provides explanations of software issues in the form of symbolic formulae. Inspired by the reachability, infection, and propagation model of software faults, the explanations are composed of input, infection, and output conditions, collectively providing a specification of the issue. In practice, the symbolic explanation is implemented as a combination of a property‑based test (PBT) and program‑internal symbolic expressions. Critically, this means our symbolic explanations are executable and can be automatically evaluated, unlike natural language explanations. Experiments show the generated conditions are highly accurate. For example, input conditions from AutoCodeSherpa had an accuracy of 85.7%. This high accuracy makes symbolic explanations particularly useful in two scenarios. First, the explanations can be used in automated issue resolution environments to decide whether to accept or reject patches from issue resolution agents; AutoCodeSherpa could reject 2x as many incorrect patches as baselines did. Secondly, as agentic AI approaches continue to develop, program analysis driven explanations like ours can be provided to other LLM‑based repair techniques which do not employ analysis to improve their output. In our experiments, our symbolic explanations could improve the plausible patch generation rate of the Agentless technique by 60%.

PaperID: 9412, https://arxiv.org/pdf/2507.22358.pdf

Authors: Hussein Mozannar, Gagan Bansal, Cheng Tan, Adam Fourney, Victor Dibia, Jingya Chen, Jack Gerrits, Tyler Payne, Matheus Kunzler Maldaner, Madeleine Grunde-McLaughlin, Eric Zhu, Griffin Bassman, Jacob Alber, Peter Chang, Ricky Loynd, Friederike Niedtner, Ece Kamar, Maya Murad, Rafah Hosn, Saleema Amershi

Title: Magentic-UI: Towards Human-in-the-loop Agentic Systems

Abstract:
AI agents powered by large language models are increasingly capable of autonomously completing complex, multi‑step tasks using external tools. Yet, they still fall short of human‑level performance in most domains including computer use, software development, and research. Their growing autonomy and ability to interact with the outside world, also introduces safety and security risks including potentially misaligned actions and adversarial manipulation. We argue that human‑in‑the‑loop agentic systems offer a promising path forward, combining human oversight and control with AI efficiency to unlock productivity from imperfect systems. We introduce Magentic‑UI, an open‑source web interface for developing and studying human‑agent interaction. Built on a flexible multi‑agent architecture, Magentic‑UI supports web browsing, code execution, and file manipulation, and can be extended with diverse tools via Model Context Protocol (MCP). Moreover, Magentic‑UI presents six interaction mechanisms for enabling effective, low‑cost human involvement: co‑planning, co‑tasking, multi‑tasking, action guards, and long‑term memory. We evaluate Magentic‑UI across four dimensions: autonomous task completion on agentic benchmarks, simulated user testing of its interaction capabilities, qualitative studies with real users, and targeted safety assessments. Our findings highlight Magentic‑UI's potential to advance safe and efficient human‑agent collaboration.

PaperID: 9413, https://arxiv.org/pdf/2507.22338.pdf

Authors: Salar Basiri, Dhananjay Tiwari, Srinivasa M. Salapaka

Title: Parametrized Multi-Agent Routing via Deep Attention Models

Abstract:
We propose a scalable deep learning framework for parametrized sequential decision‑making (ParaSDM), where multiple agents jointly optimize discrete action policies and shared continuous parameters. A key subclass of this setting arises in Facility‑Location and Path Optimization (FLPO), where multi‑agent systems must simultaneously determine optimal routes and facility locations, aiming to minimize the cumulative transportation cost within the network. FLPO problems are NP‑hard due to their mixed discrete‑continuous structure and highly non‑convex objective. To address this, we integrate the Maximum Entropy Principle (MEP) with a neural policy model called the Shortest Path Network (SPN)‑a permutation‑invariant encoder‑decoder that approximates the MEP solution while enabling efficient gradient‑based optimization over shared parameters. The SPN achieves up to 100× speedup in policy inference and gradient computation compared to MEP baselines, with an average optimality gap of approximately 6% across a wide range of problem sizes. Our FLPO approach yields over 10× lower cost than metaheuristic baselines while running significantly faster, and matches Gurobi's optimal cost with annealing at a 1500× speedup‑establishing a new state of the art for ParaSDM problems. These results highlight the power of structured deep models for solving large‑scale mixed‑integer optimization tasks.

PaperID: 9414, https://arxiv.org/pdf/2507.22282.pdf

Authors: Kegan J. Strawn, Thomy Phan, Eric Wang, Nora Ayanian, Sven Koenig, Lars Lindemann

Title: Multi-Agent Path Finding Among Dynamic Uncontrollable Agents with Statistical Safety Guarantees

Abstract:
Existing multi‑agent path finding (MAPF) solvers do not account for uncertain behavior of uncontrollable agents. We present a novel variant of Enhanced Conflict‑Based Search (ECBS), for both one‑shot and lifelong MAPF in dynamic environments with uncontrollable agents. Our method consists of (1) training a learned predictor for the movement of uncontrollable agents, (2) quantifying the prediction error using conformal prediction (CP), a tool for statistical uncertainty quantification, and (3) integrating these uncertainty intervals into our modified ECBS solver. Our method can account for uncertain agent behavior, comes with statistical guarantees on collision‑free paths for one‑shot missions, and scales to lifelong missions with a receding horizon sequence of one‑shot instances. We run our algorithm, CP‑Solver, across warehouse and game maps, with competitive throughput and reduced collisions.

PaperID: 9415, https://arxiv.org/pdf/2507.22281.pdf

Authors: Minsoo Kim, Seung-won Hwang

Title: CoEx -- Co-evolving World-model and Exploration

Abstract:
Planning in modern LLM agents relies on the utilization of LLM as an internal world model, acquired during pretraining. However, existing agent designs fail to effectively assimilate new observations into dynamic updates of the world model. This reliance on the LLM's static internal world model is progressively prone to misalignment with the underlying true state of the world, leading to the generation of divergent and erroneous plans. We introduce a hierarchical agent architecture, CoEx, in which hierarchical state abstraction allows LLM planning to co‑evolve with a dynamically updated model of the world. CoEx plans and interacts with the world by using LLM reasoning to orchestrate dynamic plans consisting of subgoals, and its learning mechanism continuously incorporates these subgoal experiences into a persistent world model in the form of a neurosymbolic belief state, comprising textual inferences and code‑based symbolic memory. We evaluate our agent across a diverse set of agent scenarios involving rich environments and complex tasks including ALFWorld, PDDL, and Jericho. Our experiments show that CoEx outperforms existing agent paradigms in planning and exploration.

PaperID: 9416, https://arxiv.org/pdf/2507.22279.pdf

Authors: Timothy Jacob Huber, Madhur Tiwari, Camilo A. Riano-Rios

Title: Physics-Informed EvolveGCN: Satellite Prediction for Multi Agent Systems

Abstract:
In the rapidly evolving domain of autonomous systems, interaction among agents within a shared environment is both inevitable and essential for enhancing overall system capabilities. A key requirement in such multi‑agent systems is the ability of each agent to reliably predict the future positions of its nearest neighbors. Traditionally, graphs and graph theory have served as effective tools for modeling inter agent communication and relationships. While this approach is widely used, the present work proposes a novel method that leverages dynamic graphs in a forward looking manner. Specifically, the employment of EvolveGCN, a dynamic graph convolutional network, to forecast the evolution of inter‑agent relationships over time. To improve prediction accuracy and ensure physical plausibility, this research incorporates physics constrained loss functions based on the Clohessy‑Wiltshire equations of motion. This integrated approach enhances the reliability of future state estimations in multi‑agent scenarios.

PaperID: 9417, https://arxiv.org/pdf/2507.22278.pdf

Authors: Sunny Amatya, Yi Ren, Zhe Xu, Wenlong Zhang

Title: Successor Features for Transfer in Alternating Markov Games

Abstract:
This paper explores successor features for knowledge transfer in zero‑sum, complete‑information, and turn‑based games. Prior research in single‑agent systems has shown that successor features can provide a ``jump start" for agents when facing new tasks with varying reward structures. However, knowledge transfer in games typically relies on value and equilibrium transfers, which heavily depends on the similarity between tasks. This reliance can lead to failures when the tasks differ significantly. To address this issue, this paper presents an application of successor features to games and presents a novel algorithm called Game Generalized Policy Improvement (GGPI), designed to address Markov games in multi‑agent reinforcement learning. The proposed algorithm enables the transfer of learning values and policies across games. An upper bound of the errors for transfer is derived as a function the similarity of the task. Through experiments with a turn‑based pursuer‑evader game, we demonstrate that the GGPI algorithm can generate high‑reward interactions and one‑shot policy transfer. When further tested in a wider set of initial conditions, the GGPI algorithm achieves higher success rates with improved path efficiency compared to those of the baseline algorithms.

PaperID: 9418, https://arxiv.org/pdf/2507.22205.pdf

Authors: Black Sun, Die, Hu

Title: CTG-Insight: A Multi-Agent Interpretable LLM Framework for Cardiotocography Analysis and Classification

Abstract:
Remote fetal monitoring technologies are becoming increasingly common. Yet, most current systems offer limited interpretability, leaving expectant parents with raw cardiotocography (CTG) data that is difficult to understand. In this work, we present CTG‑Insight, a multi‑agent LLM system that provides structured interpretations of fetal heart rate (FHR) and uterine contraction (UC) signals. Drawing from established medical guidelines, CTG‑Insight decomposes each CTG trace into five medically defined features: baseline, variability, accelerations, decelerations, and sinusoidal pattern, each analyzed by a dedicated agent. A final aggregation agent synthesizes the outputs to deliver a holistic classification of fetal health, accompanied by a natural language explanation. We evaluate CTG‑Insight on the NeuroFetalNet Dataset and compare it against deep learning models and the single‑agent LLM baseline. Results show that CTG‑Insight achieves state‑of‑the‑art accuracy (96.4%) and F1‑score (97.8%) while producing transparent and interpretable outputs. This work contributes an interpretable and extensible CTG analysis framework.

PaperID: 9419, https://arxiv.org/pdf/2507.22080.pdf

Authors: Qiushi Sun, Jinyang Gong, Lei Li, Qipeng Guo, Fei Yuan

Title: CodeEvo: Interaction-Driven Synthesis of Code-centric Data through Hybrid and Iterative Feedback

Abstract:
Acquiring high‑quality instruction‑code pairs is essential for training Large Language Models (LLMs) for code generation. Manually curated data is expensive and inherently limited in scale, motivating the development of code‑centric synthesis methods. Yet, current approaches either focus on augmenting existing code or rely on predefined heuristics, both lacking rigorous data validation, which results in synthetic data that is ungrounded, repetitive, or overly simplistic. Inspired by collaborative programming practices, we propose CodeEvo, a framework that synthesizes code data through iterative interactions between two LLM agents: a Coder, which generates candidate code and test cases based on given instructions, and a Reviewer, which guides the synthesis process by producing new instructions and feedback. We further introduce a hybrid feedback mechanism that combines compiler determinism with the generative flexibility of agents, enabling automatic quality control throughout synthesis. Extensive experiments demonstrate that models fine‑tuned on CodeEvo data significantly outperform established baselines across code generation benchmarks with various difficulties. In‑depth analyses further provide insights from multiple perspectives into effective code‑centric data synthesis.

PaperID: 9420, https://arxiv.org/pdf/2507.22077.pdf

Authors: Muyang Li

Title: From Cloud-Native to Trust-Native: A Protocol for Verifiable Multi-Agent Systems

Abstract:
As autonomous agents powered by large language models (LLMs) proliferate in high‑stakes domains ‑‑ from pharmaceuticals to legal workflows ‑‑ the challenge is no longer just intelligence, but verifiability. We introduce TrustTrack, a protocol that embeds structural guarantees ‑‑ verifiable identity, policy commitments, and tamper‑resistant behavioral logs ‑‑ directly into agent infrastructure. This enables a new systems paradigm: trust‑native autonomy. By treating compliance as a design constraint rather than post‑hoc oversight, TrustTrack reframes how intelligent agents operate across organizations and jurisdictions. We present the protocol design, system requirements, and use cases in regulated domains such as pharmaceutical R&D, legal automation, and AI‑native collaboration. We argue that the Cloud ‑> AI ‑> Agent ‑> Trust transition represents the next architectural layer for autonomous systems.

PaperID: 9421, https://arxiv.org/pdf/2507.22063.pdf

Authors: Wenjie Jacky Mo, Qin Liu, Xiaofei Wen, Dongwon Jung, Hadi Askari, Wenxuan Zhou, Zhe Zhao, Muhao Chen

Title: RedCoder: Automated Multi-Turn Red Teaming for Code LLMs

Abstract:
Large Language Models (LLMs) for code generation (i.e., Code LLMs) have demonstrated impressive capabilities in AI‑assisted software development and testing. However, recent studies have shown that these models are prone to generating vulnerable or even malicious code under adversarial settings. Existing red‑teaming approaches rely on extensive human effort, limiting their scalability and practicality, and generally overlook the interactive nature of real‑world AI‑assisted programming, which often unfolds over multiple turns. To bridge these gaps, we present RedCoder, a red‑teaming agent that engages victim models in multi‑turn conversation to elicit vulnerable code. The pipeline to construct RedCoder begins with a multi‑agent gaming process that simulates adversarial interactions, yielding a set of prototype conversations and an arsenal of reusable attack strategies. We then fine‑tune an LLM on these prototype conversations to serve as the backbone of RedCoder. Once deployed, RedCoder autonomously engages Code LLMs in multi‑turn conversations, dynamically retrieving relevant strategies from the arsenal to steer the dialogue toward vulnerability‑inducing outputs. Experiments across multiple Code LLMs show that our approach outperforms prior single‑turn and multi‑turn red‑team methods in inducing vulnerabilities in code generation, offering a scalable and effective tool for evaluating the security boundaries of modern code‑generation systems.

PaperID: 9422, https://arxiv.org/pdf/2507.22049.pdf

Authors: Logan Cross, Nick Haber, Daniel L. K. Yamins

Title: Validating Generative Agent-Based Models of Social Norm Enforcement: From Replication to Novel Predictions

Abstract:
As large language models (LLMs) advance, there is growing interest in using them to simulate human social behavior through generative agent‑based modeling (GABM). However, validating these models remains a key challenge. We present a systematic two‑stage validation approach using social dilemma paradigms from psychological literature, first identifying the cognitive components necessary for LLM agents to reproduce known human behaviors in mixed‑motive settings from two landmark papers, then using the validated architecture to simulate novel conditions. Our model comparison of different cognitive architectures shows that both persona‑based individual differences and theory of mind capabilities are essential for replicating third‑party punishment (TPP) as a costly signal of trustworthiness. For the second study on public goods games, this architecture is able to replicate an increase in cooperation from the spread of reputational information through gossip. However, an additional strategic component is necessary to replicate the additional boost in cooperation rates in the condition that allows both ostracism and gossip. We then test novel predictions for each paper with our validated generative agents. We find that TPP rates significantly drop in settings where punishment is anonymous, yet a substantial amount of TPP persists, suggesting that both reputational and intrinsic moral motivations play a role in this behavior. For the second paper, we introduce a novel intervention and see that open discussion periods before rounds of the public goods game further increase contributions, allowing groups to develop social norms for cooperation. This work provides a framework for validating generative agent models while demonstrating their potential to generate novel and testable insights into human social behavior.

PaperID: 9423, https://arxiv.org/pdf/2507.21969.pdf

Authors: Adam Kostka, Jarosław A. Chudziak

Title: Towards Cognitive Synergy in LLM-Based Multi-Agent Systems: Integrating Theory of Mind and Critical Evaluation

Abstract:
Recently, the field of Multi‑Agent Systems (MAS) has gained popularity as researchers are trying to develop artificial intelligence capable of efficient collective reasoning. Agents based on Large Language Models (LLMs) perform well in isolated tasks, yet struggle with higher‑order cognition required for adaptive collaboration. Human teams achieve synergy not only through knowledge sharing, but also through recursive reasoning, structured critique, and the ability to infer others' mental states. Current artificial systems lack these essential mechanisms, limiting their ability to engage in sophisticated collective reasoning. This work explores cognitive processes that enable effective collaboration, focusing on adaptive theory of mind (ToM) and systematic critical evaluation. We investigate three key questions. First, how does the ability to model others' perspectives enhance coordination and reduce redundant reasoning? Second, to what extent does structured critique improve reasoning quality by identifying logical gaps and mitigating biases? Third, the interplay of these mechanisms can lead to emergent cognitive synergy, where the collective intelligence of the system exceeds the sum of its parts. Through an empirical case study on complex decision making, we show that the integration of these cognitive mechanisms leads to more coherent, adaptive, and rigorous agent interactions. This article contributes to the field of cognitive science and AI research by presenting a structured framework that emulates human‑like collaborative reasoning MAS. It highlights the significance of dynamic ToM and critical evaluation in advancing multi‑agent systems' ability to tackle complex, real‑world challenges.

PaperID: 9424, https://arxiv.org/pdf/2507.21953.pdf

Authors: Yi Kong, Dianxi Shi, Guoli Yang, Zhang ke-di, Chenlin Huang, Xiaopeng Li, Songchang Jin

Title: MapAgent: Trajectory-Constructed Memory-Augmented Planning for Mobile Task Automation

Abstract:
The recent advancement of autonomous agents powered by Large Language Models (LLMs) has demonstrated significant potential for automating tasks on mobile devices through graphical user interfaces (GUIs). Despite initial progress, these agents still face challenges when handling complex real‑world tasks. These challenges arise from a lack of knowledge about real‑life mobile applications in LLM‑based agents, which may lead to ineffective task planning and even cause hallucinations. To address these challenges, we propose a novel LLM‑based agent framework called MapAgent that leverages memory constructed from historical trajectories to augment current task planning. Specifically, we first propose a trajectory‑based memory mechanism that transforms task execution trajectories into a reusable and structured page‑memory database. Each page within a trajectory is extracted as a compact yet comprehensive snapshot, capturing both its UI layout and functional context. Secondly, we introduce a coarse‑to‑fine task planning approach that retrieves relevant pages from the memory database based on similarity and injects them into the LLM planner to compensate for potential deficiencies in understanding real‑world app scenarios, thereby achieving more informed and context‑aware task planning. Finally, planned tasks are transformed into executable actions through a task executor supported by a dual‑LLM architecture, ensuring effective tracking of task progress. Experimental results in real‑world scenarios demonstrate that MapAgent achieves superior performance to existing methods. The code will be open‑sourced to support further research.

PaperID: 9425, https://arxiv.org/pdf/2507.21941.pdf

Authors: Mushuang Liu, Yan Wan, Frank Lewis, Subramanya Nageshrao, H. Eric Tseng, Dimitar Filev

Title: Hierarchical Game-Based Multi-Agent Decision-Making for Autonomous Vehicles

Abstract:
This paper develops a game‑theoretic decision‑making framework for autonomous driving in multi‑agent scenarios. A novel hierarchical game‑based decision framework is developed for the ego vehicle. This framework features an interaction graph, which characterizes the interaction relationships between the ego and its surrounding traffic agents (including AVs, human driven vehicles, pedestrians, and bicycles, and others), and enables the ego to smartly select a limited number of agents as its game players. Compared to the standard multi‑player games, where all surrounding agents are considered as game players, the hierarchical game significantly reduces the computational complexity. In addition, compared to pairwise games, the most popular approach in the literature, the hierarchical game promises more efficient decisions for the ego (in terms of less unnecessary waiting and yielding). To further reduce the computational cost, we then propose an improved hierarchical game, which decomposes the hierarchical game into a set of sub‑games. Decision safety and efficiency are analyzed in both hierarchical games. Comprehensive simulation studies are conducted to verify the effectiveness of the proposed frameworks, with an intersection‑crossing scenario as a case study.

PaperID: 9426, https://arxiv.org/pdf/2507.21846.pdf

Authors: Chenyuan Zhang, Cristian Rojas Cardenas, Hamid Rezatofighi, Mor Vered, Buser Say

Title: Probabilistic Active Goal Recognition

Abstract:
In multi‑agent environments, effective interaction hinges on understanding the beliefs and intentions of other agents. While prior work on goal recognition has largely treated the observer as a passive reasoner, Active Goal Recognition (AGR) focuses on strategically gathering information to reduce uncertainty. We adopt a probabilistic framework for Active Goal Recognition and propose an integrated solution that combines a joint belief update mechanism with a Monte Carlo Tree Search (MCTS) algorithm, allowing the observer to plan efficiently and infer the actor's hidden goal without requiring domain‑specific knowledge. Through comprehensive empirical evaluation in a grid‑based domain, we show that our joint belief update significantly outperforms passive goal recognition, and that our domain‑independent MCTS performs comparably to our strong domain‑specific greedy baseline. These results establish our solution as a practical and robust framework for goal inference, advancing the field toward more interactive and adaptive multi‑agent systems.

PaperID: 9427, https://arxiv.org/pdf/2507.21842.pdf

Authors: Marie-Therese Sekwenz, Daria Simons, Alina Wundsam

Title: Prompt template for a fictitious LLM agent in a content-flagging experiment

Abstract:
Digital regulations such as the European Union's Digital Services Act (DSA) represent major efforts to shape human‑centered and human rights‑based frameworks for society. Yet, as these laws are translated into practice, challenges emerge at the intersection of technology, law, and design. This paper presents a qualitative case study examining how designers act as mediators between abstract legal requirements and real‑world digital experiences for users, focusing on the design of content reporting mechanisms under Article 16 of the DSA. Through an expert workshop with professional designers from diverse fields (N=9), we explore how legal obligations are interpreted by designers and reflected in discussions and design solutions. Our findings resonate with previous research on the design of reporting mechanisms and dark patterns, highlighting how UX design choices can mislead or hinder users' decision‑making and therefore also highlighting the crucial role of design decisions. We show how participatory design methods can bridge disciplinary divides, making legal obligations accessible in compliance fostering design solutions. By using legal design as a lens, we argue that the co‑creation of digital regulations and user experience is a core site for digital humanism; where designers, engineers, and legal scholars must collaborate to ensure that systems uphold legal standards to address the challenge the regulation poses to these disciplines.

PaperID: 9428, https://arxiv.org/pdf/2507.21817.pdf

Authors: Yikun Li, Ngoc Tan Bui, Ting Zhang, Chengran Yang, Xin Zhou, Martin Weyssow, Jinfeng Jiang, Junkai Chen, Huihui Huang, Huu Hung Nguyen, Chiok Yew Ho, Jie Tan, Ruiyin Li, Yide Yin, Han Wei Ang, Frank Liauw, Eng Lieh Ouh, Lwin Khin Shar, David Lo

Title: Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses?

Abstract:
Automated vulnerability detection research has made substantial progress, yet its real‑world impact remains limited. Prior work found that current vulnerability datasets suffer from issues including label inaccuracy rates of 20%‑71%, extensive duplication, and poor coverage of critical Common Weakness Enumeration (CWE). These issues create a significant generalization gap where models achieve misleading In‑Distribution (ID) accuracies (testing on splits from the same dataset) by exploiting spurious correlations rather than learning true vulnerability patterns. To address these limitations, we present a three‑part solution. First, we introduce BenchVul, which is a manually curated and balanced test dataset covering the MITRE Top 25 Most Dangerous CWEs, to enable fair model evaluation. Second, we construct a high‑quality training dataset, TitanVul, comprising 38,548 functions by aggregating seven public sources and applying deduplication and validation using a novel multi‑agent LLM pipeline. Third, we propose a Realistic Vulnerability Generation (RVG) pipeline, which synthesizes context‑aware vulnerability examples for underrepresented but critical CWE types through simulated development workflows. Our evaluation reveals that In‑Distribution (ID) performance does not reliably predict Out‑of‑Distribution (OOD) performance on BenchVul. For example, a model trained on BigVul achieves the highest 0.703 ID accuracy but fails on BenchVul's real‑world samples (0.493 OOD accuracy). Conversely, a model trained on our TitanVul achieves the highest OOD performance on both the real‑world (0.881) and synthesized (0.785) portions of BenchVul, improving upon the next‑best performing dataset by 5.3% and 11.8% respectively, despite a modest ID score (0.590). Augmenting TitanVul with our RVG further boosts this leading OOD performance, improving accuracy on real‑world data by 5.8% (to 0.932).

PaperID: 9429, https://arxiv.org/pdf/2507.21795.pdf

Authors: Maria Alejandra Ramirez, Rosemarie Nagel, David Wolpert, Jürgen Jost

Title: Non-coercive extortion in game theory

Abstract:
Commitments play a crucial role in game theory, shaping strategic interactions by either altering a player's own payoffs or influencing the incentives of others through outcome‑contingent payments. While most research has focused on using commitments to achieve efficient equilibria, their potential applications beyond this goal remain largely unexplored. In this study, we introduce a non‑coercive extortion mechanism that leverages commitments to outcome‑contingent payments, demonstrating how a player or external agent can extract profit by offering rewards rather than threatening punishment. At the core of the mechanism is the introduction of sequentiality into a simultaneous‑move game, fundamentally reshaping the strategic interaction. We derive the conditions under which extortion is successful, identify the class of games susceptible to this scheme, and determine both the maximum extractable profit and the minimum required payment. To illustrate the extortion mechanism, we apply it to 2x2 games, highlighting how even simple strategic settings can be vulnerable to this form of manipulation. Our results reveal strategic vulnerabilities in competitive settings, with significant implications for economic markets, diplomatic relations, and multi‑agent systems operating in blockchain environments. This work broadens our understanding of commitments in game theory and raises critical questions about how to safeguard strategic interactions from exploitation through non‑coercive extortion.

PaperID: 9430, https://arxiv.org/pdf/2507.21694.pdf

Authors: Wenbo Liu, Forbes Hou, Jon Zhang, Hong Liu, Allen Lei

Title: A Multi-Agent Generative AI Framework for IC Module-Level Verification Automation

Abstract:
As large language models demonstrate enormous potential in the field of Electronic Design Automation (EDA), generative AI‑assisted chip design is attracting widespread attention from academia and industry. Although these technologies have made preliminary progress in tasks such as code generation, their application in chip verification ‑‑ a critical bottleneck in the chip development cycle ‑‑ remains at an exploratory stage. This paper proposes an innovative Multi‑Agent Verification Framework (MAVF) aimed at addressing the limitations of current single‑LLM approaches in complex verification tasks. Our framework builds an automated transformation system from design specifications to testbench through the collaborative work of multiple specialized agents, including specification parsing, verification strategy generation, and code implementation. Through verification experiments on multiple chip modules of varying complexity, results show that MAVF significantly outperforms traditional manual methods and single‑dialogue generative AI approaches in verification document parsing and generation, as well as automated testbench generation. This research opens new directions for exploring generative AI applications in verification automation, potentially providing effective approaches to solving the most challenging bottleneck issues in chip design.

PaperID: 9431, https://arxiv.org/pdf/2507.21667.pdf

Authors: Khushal Chaudhari, Krishanu Nath, Manas Kumar Bera

Title: Deep Neuro-Adaptive Sliding Mode Controller for Higher-Order Heterogeneous Nonlinear Multi-Agent Teams with Leader

Abstract:
This letter proposes a deep neural network (DNN)‑based neuro‑adaptive sliding mode control (SMC) strategy for leader‑follower tracking in multi‑agent systems with higher‑order, heterogeneous, nonlinear, and unknown dynamics under external disturbances. The DNN is used to compensate the unknown nonlinear dynamics with higher accuracy than shallow neural networks (NNs) and SMC ensures robust tracking. This framework employs restricted potential functions within a set‑theoretic paradigm to ensure system trajectories remain bounded within a compact set, improving robustness against approximation errors and external disturbances. The control scheme is grounded in non‑smooth Lyapunov stability theory, with update laws derived for both inner and outer layer network weights of DNN. A numerical example is simulated that showcases the proposed controller's effectiveness, adaptability, and robustness.

PaperID: 9432, https://arxiv.org/pdf/2507.21638.pdf

Authors: Leonard Hinckeldey, Elliot Fosong, Elle Miller, Rimvydas Rubavicius, Trevor McInroe, Patricia Wollstadt, Christiane B. Wiebel-Herboth, Subramanian Ramamoorthy, Stefano V. Albrecht

Title: Assistax: A Hardware-Accelerated Reinforcement Learning Benchmark for Assistive Robotics

Abstract:
The development of reinforcement learning (RL) algorithms has been largely driven by ambitious challenge tasks and benchmarks. Games have dominated RL benchmarks because they present relevant challenges, are inexpensive to run and easy to understand. While games such as Go and Atari have led to many breakthroughs, they often do not directly translate to real‑world embodied applications. In recognising the need to diversify RL benchmarks and addressing complexities that arise in embodied interaction scenarios, we introduce Assistax: an open‑source benchmark designed to address challenges arising in assistive robotics tasks. Assistax uses JAX's hardware acceleration for significant speed‑ups for learning in physics‑based simulations. In terms of open‑loop wall‑clock time, Assistax runs up to 370× faster when vectorising training runs compared to CPU‑based alternatives. Assistax conceptualises the interaction between an assistive robot and an active human patient using multi‑agent RL to train a population of diverse partner agents against which an embodied robotic agent's zero‑shot coordination capabilities can be tested. Extensive evaluation and hyperparameter tuning for popular continuous control RL and MARL algorithms provide reliable baselines and establish Assistax as a practical benchmark for advancing RL research for assistive robotics. The code is available at: https://github.com/assistive‑autonomy/assistax.

PaperID: 9433, https://arxiv.org/pdf/2507.21636.pdf

Authors: Alessio Maritan

Title: StaffPro: an LLM Agent for Joint Staffing and Profiling

Abstract:
Large language model (LLM) agents integrate pre‑trained LLMs with modular algorithmic components and have shown remarkable reasoning and decision‑making abilities. In this work, we investigate their use for two tightly intertwined challenges in workforce management: staffing, i.e., the assignment and scheduling of tasks to workers, which may require team formation; and profiling, i.e., the continuous estimation of workers' skills, preferences, and other latent attributes from unstructured data. We cast these problems in a formal mathematical framework that links scheduling decisions to latent feature estimation, and we introduce StaffPro, an LLM agent that addresses staffing and profiling jointly. Differently from existing staffing solutions, StaffPro allows expressing optimization objectives using natural language, accepts textual task descriptions and provides high flexibility. StaffPro interacts directly with humans by establishing a continuous human‑agent feedback loop, ensuring natural and intuitive use. By analyzing human feedback, our agent continuously estimates the latent features of workers, realizing life‑long worker profiling and ensuring optimal staffing performance over time. A consulting firm simulation example demonstrates that StaffPro successfully estimates workers' attributes and generates high quality schedules. With its innovative design, StaffPro offers a robust, interpretable, and human‑centric solution for automated personnel management.

PaperID: 9434, https://arxiv.org/pdf/2507.21631.pdf

Authors: Miguel Faria, Francisco S. Melo, Ana Paiva

Title: "Teammates, Am I Clear?": Analysing Legible Behaviours in Teams

Abstract:
In this paper we investigate the notion of legibility in sequential decision‑making in the context of teams and teamwork. There have been works that extend the notion of legibility to sequential decision making, for deterministic and for stochastic scenarios. However, these works focus on one agent interacting with one human, foregoing the benefits of having legible decision making in teams of agents or in team configurations with humans. In this work we propose an extension of legible decision‑making to multi‑agent settings that improves the performance of agents working in collaboration. We showcase the performance of legible decision making in team scenarios using our proposed extension in multi‑agent benchmark scenarios. We show that a team with a legible agent is able to outperform a team composed solely of agents with standard optimal behaviour.

PaperID: 9435, https://arxiv.org/pdf/2507.21610.pdf

Authors: Ruiyang Hao, Haibao Yu, Jiaru Zhong, Chuanye Wang, Jiahao Wang, Yiming Kan, Wenxian Yang, Siqi Fan, Huilin Yin, Jianing Qiu, Yao Mu, Jiankai Sun, Li Chen, Walter Zimmer, Dandan Zhang, Shanghang Zhang, Mac Schwager, Ping Luo, Zaiqing Nie

Title: Research Challenges and Progress in the End-to-End V2X Cooperative Autonomous Driving Competition

Abstract:
With the rapid advancement of autonomous driving technology, vehicle‑to‑everything (V2X) communication has emerged as a key enabler for extending perception range and enhancing driving safety by providing visibility beyond the line of sight. However, integrating multi‑source sensor data from both ego‑vehicles and infrastructure under real‑world constraints, such as limited communication bandwidth and dynamic environments, presents significant technical challenges. To facilitate research in this area, we organized the End‑to‑End Autonomous Driving through V2X Cooperation Challenge, which features two tracks: cooperative temporal perception and cooperative end‑to‑end planning. Built on the UniV2X framework and the V2X‑Seq‑SPD dataset, the challenge attracted participation from over 30 teams worldwide and established a unified benchmark for evaluating cooperative driving systems. This paper describes the design and outcomes of the challenge, highlights key research problems including bandwidth‑aware fusion, robust multi‑agent planning, and heterogeneous sensor integration, and analyzes emerging technical trends among top‑performing solutions. By addressing practical constraints in communication and data fusion, the challenge contributes to the development of scalable and reliable V2X‑cooperative autonomous driving systems.

PaperID: 9436, https://arxiv.org/pdf/2507.21547.pdf

Authors: Saeed Rahmani, Simeon C. Calvert, Bart van Arem

Title: Decentralized Modeling of Vehicular Maneuvers and Interactions at Urban Junctions

Abstract:
Modeling and evaluation of automated vehicles (AVs) in mixed‑autonomy traffic is essential prior to their safe and efficient deployment. This is especially important at urban junctions where complex multi‑agent interactions occur. Current approaches for modeling vehicular maneuvers and interactions at urban junctions have limitations in formulating non‑cooperative interactions and vehicle dynamics within a unified mathematical framework. Previous studies either assume predefined paths or rely on cooperation and central controllability, limiting their realism and applicability in mixed‑autonomy traffic. This paper addresses these limitations by proposing a modeling framework for trajectory planning and decentralized vehicular control at urban junctions. The framework employs a bi‑level structure where the upper level generates kinematically feasible reference trajectories using an efficient graph search algorithm with a custom heuristic function, while the lower level employs a predictive controller for trajectory tracking and optimization. Unlike existing approaches, our framework does not require central controllability or knowledge sharing among vehicles. The vehicle kinematics are explicitly incorporated at both levels, and acceleration and steering angle are used as control variables. This intuitive formulation facilitates analysis of traffic efficiency, environmental impacts, and motion comfort. The framework's decentralized structure accommodates operational and stochastic elements, such as vehicles' detection range, perception uncertainties, and reaction delay, making the model suitable for safety analysis. Numerical and simulation experiments across diverse scenarios demonstrate the framework's capability in modeling accurate and realistic vehicular maneuvers and interactions at various urban junctions, including unsignalized intersections and roundabouts.

PaperID: 9437, https://arxiv.org/pdf/2507.21504.pdf

Authors: Mahmoud Mohammadi, Yipeng Li, Jane Lo, Wendy Yip

Title: Evaluation and Benchmarking of LLM Agents: A Survey

Abstract:
The rise of LLM‑based agents has opened new frontiers in AI applications, yet evaluating these agents remains a complex and underdeveloped area. This survey provides an in‑depth overview of the emerging field of LLM agent evaluation, introducing a two‑dimensional taxonomy that organizes existing work along (1) evaluation objectives ‑‑ what to evaluate, such as agent behavior, capabilities, reliability, and safety ‑‑ and (2) evaluation process ‑‑ how to evaluate, including interaction modes, datasets and benchmarks, metric computation methods, and tooling. In addition to taxonomy, we highlight enterprise‑specific challenges, such as role‑based access to data, the need for reliability guarantees, dynamic and long‑horizon interactions, and compliance, which are often overlooked in current research. We also identify future research directions, including holistic, more realistic, and scalable evaluation. This work aims to bring clarity to the fragmented landscape of agent evaluation and provide a framework for systematic assessment, enabling researchers and practitioners to evaluate LLM agents for real‑world deployment.

PaperID: 9438, https://arxiv.org/pdf/2507.21428.pdf

Authors: Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, James A. Burke

Title: MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Calling in LLM Agent Multi-Turn Conversations

Abstract:
Large Language Model (LLM) agents have shown significant autonomous capabilities in dynamically searching and incorporating relevant tools or Model Context Protocol (MCP) servers for individual queries. However, fixed context windows limit effectiveness in multi‑turn interactions requiring repeated, independent tool usage. We introduce MemTool, a short‑term memory framework enabling LLM agents to dynamically manage tools or MCP server contexts across multi‑turn conversations. MemTool offers three agentic architectures: 1) Autonomous Agent Mode, granting full tool management autonomy, 2) Workflow Mode, providing deterministic control without autonomy, and 3) Hybrid Mode, combining autonomous and deterministic control. Evaluating each MemTool mode across 13+ LLMs on the ScaleMCP benchmark, we conducted experiments over 100 consecutive user interactions, measuring tool removal ratios (short‑term memory efficiency) and task completion accuracy. In Autonomous Agent Mode, reasoning LLMs achieve high tool‑removal efficiency (90‑94% over a 3‑window average), while medium‑sized models exhibit significantly lower efficiency (0‑60%). Workflow and Hybrid modes consistently manage tool removal effectively, whereas Autonomous and Hybrid modes excel at task completion. We present trade‑offs and recommendations for each MemTool mode based on task accuracy, agency, and model capabilities.

PaperID: 9439, https://arxiv.org/pdf/2507.21407.pdf

Authors: Yixin Liu, Guibin Zhang, Kun Wang, Shiyuan Li, Shirui Pan

Title: Graph-Augmented Large Language Model Agents: Current Progress and Future Prospects

Abstract:
Autonomous agents based on large language models (LLMs) have demonstrated impressive capabilities in a wide range of applications, including web navigation, software development, and embodied control. While most LLMs are limited in several key agentic procedures, such as reliable planning, long‑term memory, tool management, and multi‑agent coordination, graphs can serve as a powerful auxiliary structure to enhance structure, continuity, and coordination in complex agent workflows. Given the rapid growth and fragmentation of research on Graph‑augmented LLM Agents (GLA), this paper offers a timely and comprehensive overview of recent advances and also highlights key directions for future work. Specifically, we categorize existing GLA methods by their primary functions in LLM agent systems, including planning, memory, and tool usage, and then analyze how graphs and graph learning algorithms contribute to each. For multi‑agent systems, we further discuss how GLA solutions facilitate the orchestration, efficiency optimization, and trustworthiness of MAS. Finally, we highlight key future directions to advance this field, from improving structural adaptability to enabling unified, scalable, and multimodal GLA systems. We hope this paper can serve as a roadmap for future research on GLA and foster a deeper understanding of the role of graphs in LLM agent systems.

PaperID: 9440, https://arxiv.org/pdf/2507.21385.pdf

Authors: Wei Mao, Lili Wei, Omid Semiari, Shu-ping Yeh, Hosein Nikopour

Title: Deep Reinforcement Learning-based Cell DTX/DRX Configuration for Network Energy Saving

Abstract:
3GPP Release 18 cell discontinuous transmission and reception (cell DTX/DRX) is an important new network energy saving feature for 5G. As a time‑domain technique, it periodically aggregates the user data transmissions in a given duration of time when the traffic load is not heavy, so that the remaining time can be kept silent and advanced sleep modes (ASM) can be enabled to shut down more radio components and save more energy for the cell. However, inevitably the packet delay is increased, as during the silent period no transmission is allowed. In this paper we study how to configure cell DTX/DRX to optimally balance energy saving and packet delay, so that for delay‑sensitive traffic maximum energy saving can be achieved while the degradation of quality of service (QoS) is minimized. As the optimal configuration can be different for different network and traffic conditions, the problem is complex and we resort to deep reinforcement learning (DRL) framework to train an AI agent to solve it. Through careful design of 1) the learning algorithm, which implements a deep Q‑network (DQN) on a contextual bandit (CB) model, and 2) the reward function, which utilizes a smooth approximation of a theoretically optimal but discontinuous reward function, we are able to train an AI agent that always tries to select the best possible Cell DTX/DRX configuration under any network and traffic conditions. Simulation results show that compared to the case when cell DTX/DRX is not used, our agent can achieve up to ~45% energy saving depending on the traffic load scenario, while always maintaining no more than ~1% QoS degradation.

PaperID: 9441, https://arxiv.org/pdf/2507.21382.pdf

Authors: Ruiyin Li, Yiran Zhang, Xiyu Zhou, Peng Liang, Weisong Sun, Jifeng Xuan, Zhi Jin, Yang Liu

Title: MAAD: Automate Software Architecture Design through Knowledge-Driven Multi-Agent Collaboration

Abstract:
Software architecture design is a critical, yet inherently complex and knowledge‑intensive phase of software development. It requires deep domain expertise, development experience, architectural knowledge, careful trade‑offs among competing quality attributes, and the ability to adapt to evolving requirements. Traditionally, this process is time‑consuming and labor‑intensive, and relies heavily on architects, often resulting in limited design alternatives, especially under the pressures of agile development. While Large Language Model (LLM)‑based agents have shown promising performance across various SE tasks, their application to architecture design remains relatively scarce and requires more exploration, particularly in light of diverse domain knowledge and complex decision‑making. To address the challenges, we proposed MAAD (Multi‑Agent Architecture Design), an automated framework that employs a knowledge‑driven Multi‑Agent System (MAS) for architecture design. MAAD orchestrates four specialized agents (i.e., Analyst, Modeler, Designer and Evaluator) to collaboratively interpret requirements specifications and produce architectural blueprints enriched with quality attributes‑based evaluation reports. We then evaluated MAAD through a case study and comparative experiments against MetaGPT, a state‑of‑the‑art MAS baseline. Our results show that MAAD's superiority lies in generating comprehensive architectural components and delivering insightful and structured architecture evaluation reports. Feedback from industrial architects across 11 requirements specifications further reinforces MAAD's practical usability. We finally explored the performance of the MAAD framework with three LLMs (GPT‑4o, DeepSeek‑R1, and Llama 3.3) and found that GPT‑4o exhibits better performance in producing architecture design, emphasizing the importance of LLM selection in MAS‑driven architecture design.

PaperID: 9442, https://arxiv.org/pdf/2507.21360.pdf

Authors: Nicholas Botti, Flora Haberkorn, Charlotte Hoopes, Shaun Khan

Title: Efficacy of AI RAG Tools for Complex Information Extraction and Data Annotation Tasks: A Case Study Using Banks Public Disclosures

Abstract:
We utilize a within‑subjects design with randomized task assignments to understand the effectiveness of using an AI retrieval augmented generation (RAG) tool to assist analysts with an information extraction and data annotation task. We replicate an existing, challenging real‑world annotation task with complex multi‑part criteria on a set of thousands of pages of public disclosure documents from global systemically important banks (GSIBs) with heterogeneous and incomplete information content. We test two treatment conditions. First, a "naive" AI use condition in which annotators use only the tool and must accept the first answer they are given. And second, an "interactive" AI treatment condition where annotators use the tool interactively, and use their judgement to follow‑up with additional information if necessary. Compared to the human‑only baseline, the use of the AI tool accelerated task execution by up to a factor of 10 and enhanced task accuracy, particularly in the interactive condition. We find that when extrapolated to the full task, these methods could save up to 268 hours compared to the human‑only approach. Additionally, our findings suggest that annotator skill, not just with the subject matter domain, but also with AI tools, is a factor in both the accuracy and speed of task performance.

PaperID: 9443, https://arxiv.org/pdf/2507.21354.pdf

Authors: Monika Zamojska, Jarosław A. Chudziak

Title: Games Agents Play: Towards Transactional Analysis in LLM-based Multi-Agent Systems

Abstract:
Multi‑Agent Systems (MAS) are increasingly used to simulate social interactions, but most of the frameworks miss the underlying cognitive complexity of human behavior. In this paper, we introduce Trans‑ACT (Transactional Analysis Cognitive Toolkit), an approach embedding Transactional Analysis (TA) principles into MAS to generate agents with realistic psychological dynamics. Trans‑ACT integrates the Parent, Adult, and Child ego states into an agent's cognitive architecture. Each ego state retrieves context‑specific memories and uses them to shape response to new situations. The final answer is chosen according to the underlying life script of the agent. Our experimental simulation, which reproduces the Stupid game scenario, demonstrates that agents grounded in cognitive and TA principles produce deeper and context‑aware interactions. Looking ahead, our research opens a new way for a variety of applications, including conflict resolution, educational support, and advanced social psychology studies.

PaperID: 9444, https://arxiv.org/pdf/2507.21162.pdf

Authors: Xu Yang, Chenhui Lin, Yue Yang, Qi Wang, Haotian Liu, Haizhou Hua, Wenchuan Wu

Title: Large Language Model Powered Automated Modeling and Optimization of Active Distribution Network Dispatch Problems

Abstract:
The increasing penetration of distributed energy resources into active distribution networks (ADNs) has made effective ADN dispatch imperative. However, the numerous newly‑integrated ADN operators, such as distribution system aggregators, virtual power plant managers, and end prosumers, often lack specialized expertise in power system operation, modeling, optimization, and programming. This knowledge gap renders reliance on human experts both costly and time‑intensive. To address this challenge and enable intelligent, flexible ADN dispatch, this paper proposes a large language model (LLM) powered automated modeling and optimization approach. First, the ADN dispatch problems are decomposed into sequential stages, and a multi‑LLM coordination architecture is designed. This framework comprises an Information Extractor, a Problem Formulator, and a Code Programmer, tasked with information retrieval, optimization problem formulation, and code implementation, respectively. Afterwards, tailored refinement techniques are developed for each LLM agent, greatly improving the accuracy and reliability of generated content. The proposed approach features a user‑centric interface that enables ADN operators to derive dispatch strategies via simple natural language queries, eliminating technical barriers and increasing efficiency. Comprehensive comparisons and end‑to‑end demonstrations on various test cases validate the effectiveness of the proposed architecture and methods.

PaperID: 9445, https://arxiv.org/pdf/2507.21159.pdf

Authors: Zhihao Peng, Liuxin Bao, Yixuan Yuan

Title: MAC: Masked Agent Collaboration Boosts Large Language Model Medical Decision-Making

Abstract:
Large language models (LLMs) have proven effective in artificial intelligence, where the multi‑agent system (MAS) holds considerable promise for healthcare development by achieving the collaboration of LLMs. However, the absence of a systematic pipeline for agent construction and the rigidity of static collaboration patterns render current MAS‑based models vulnerable to collaboration failures, resulting in substantial performance degradation in medical decision‑making scenarios. To this end, we propose a novel Masked Agent Collaboration (MAC) framework that harnesses Pareto‑optimal agent construction and cross‑consistency maximization mechanisms to achieve adaptive progressive propagation of collaborative information, boosting the medical decision‑making capacity. Specifically, we first conduct a Pareto‑frontier factors analysis towards the LLMs pool to consider their key factors, including the model size, inference time, diversity score, and throughput ratio, where we calculate the similarity between pairwise outputs within an LLM to derive its diversity score. Beyond this analysis, we enable the identification of Pareto‑optimal models that balance efficiency and capability, which are subsequently selected as collaborative agents to consider the fundamental trade‑offs inherent in practical LLM deployment. Afterward, we measure the pairwise similarity between the outputs from collaborative agents to determine their cross‑consistency values, subsequently masking out the agent with the lowest cross‑consistency value to eliminate the output that is likely semantically inconsistent. Finally, we conduct collaboration of agents by achieving adaptive progressive propagation, where each agent aggregates the outputs of unmasked agents from the previous layer as its input to generate the corresponding output via prompt engineering.

PaperID: 9446, https://arxiv.org/pdf/2507.21146.pdf

Authors: Gauri Sharma, Vidhi Kulkarni, Miles King, Ken Huang

Title: Towards Unifying Quantitative Security Benchmarking for Multi Agent Systems

Abstract:
Evolving AI systems increasingly deploy multi‑agent architectures where autonomous agents collaborate, share information, and delegate tasks through developing protocols. This connectivity, while powerful, introduces novel security risks. One such risk is a cascading risk: a breach in one agent can cascade through the system, compromising others by exploiting inter‑agent trust. In tandem with OWASP's initiative for an Agentic AI Vulnerability Scoring System we define an attack vector, Agent Cascading Injection, analogous to Agent Impact Chain and Blast Radius, operating across networks of agents. In an ACI attack, a malicious input or tool exploit injected at one agent leads to cascading compromises and amplified downstream effects across agents that trust its outputs. We formalize this attack with an adversarial goal equation and key variables (compromised agent, injected exploit, polluted observations, etc.), capturing how a localized vulnerability can escalate into system‑wide failure. We then analyze ACI's properties ‑‑ propagation chains, amplification factors, and inter‑agent compound effects ‑‑ and map these to OWASP's emerging Agentic AI risk categories (e.g. Impact Chain and Orchestration Exploits). Finally, we argue that ACI highlights a critical need for quantitative benchmarking frameworks to evaluate the security of agent‑to‑agent communication protocols. We outline a methodology for stress‑testing multi‑agent systems (using architectures such as Google's A2A and Anthropic's MCP) against cascading trust failures, developing upon groundwork for measurable, standardized agent‑to‑agent security evaluation. Our work provides the necessary apparatus for engineers to benchmark system resilience, make data‑driven architectural trade‑offs, and develop robust defenses against a new generation of agentic threats.

PaperID: 9447, https://arxiv.org/pdf/2507.21142.pdf

Authors: Chenhao Fang, Yanqing Peng, Rajeev Rao, Matt Sarmiento, Wendy Summer, Arya Pudota, Alex Goncalves, Jordi Mola, Hervé Robert

Title: Privacy Artifact ConnecTor (PACT): Embedding Enterprise Artifacts for Compliance AI Agents

Abstract:
Enterprise environments contain a heterogeneous, rapidly growing collection of internal artifacts related to code, data, and many different tools. Critical information for assessing privacy risk and ensuring regulatory compliance is often embedded across these varied resources, each with their own arcane discovery and extraction techniques. Therefore, large‑scale privacy compliance in adherence to governmental regulations requires systems to discern the interconnected nature of diverse artifacts in a common, shared universe. We present Privacy Artifact ConnecT or (PACT), an embeddings‑driven graph that links millions of artifacts spanning multiple artifact types generated by a variety of teams and projects. Powered by the state‑of‑the‑art DRAGON embedding model, PACT uses a contrastive learning objective with light fine‑tuning to link artifacts via their textual components such as raw metadata, ownership specifics, and compliance context. Experimental results show that PACT's fine‑tuned model improves recall@1 from 18% to 53%, the query match rate from 9.6% to 69.7% when paired with a baseline AI agent, and the hitrate@1 from 25.7% to 44.9% for candidate selection in a standard recommender system.

PaperID: 9448, https://arxiv.org/pdf/2507.21113.pdf

Authors: Farzana Abdulzada

Title: Vulnerability Mitigation System (VMS): LLM Agent and Evaluation Framework for Autonomous Penetration Testing

Abstract:
As the frequency of cyber threats increases, conventional penetration testing is failing to capture the entirety of todays complex environments. To solve this problem, we propose the Vulnerability Mitigation System (VMS), a novel agent based on a Large Language Model (LLM) capable of performing penetration testing without human intervention. The VMS has a two‑part architecture for planning and a Summarizer, which enable it to generate commands and process feedback. To standardize testing, we designed two new Capture the Flag (CTF) benchmarks based on the PicoCTF and OverTheWire platforms with 200 challenges. These benchmarks allow us to evaluate how effectively the system functions. We performed a number of experiments using various LLMs while tuning the temperature and top‑p parameters and found that GPT‑4o performed best, sometimes even better than expected. The results indicate that LLMs can be effectively applied to many cybersecurity tasks; however, there are risks. To ensure safe operation, we used a containerized environment. Both the VMS and the benchmarks are publicly available, advancing the creation of secure, autonomous cybersecurity tools.

PaperID: 9449, https://arxiv.org/pdf/2507.21105.pdf

Authors: Callie C. Liao, Duoduo Liao, Sai Surya Gadiraju

Title: AgentMaster: A Multi-Agent Conversational Framework Using A2A and MCP Protocols for Multimodal Information Retrieval and Analysis

Abstract:
The rise of Multi‑Agent Systems (MAS) in Artificial Intelligence (AI), especially integrated with Large Language Models (LLMs), has greatly facilitated the resolution of complex tasks. However, current systems are still facing challenges of inter‑agent communication, coordination, and interaction with heterogeneous tools and resources. Most recently, the Model Context Protocol (MCP) by Anthropic and Agent‑to‑Agent (A2A) communication protocol by Google have been introduced, and to the best of our knowledge, very few applications exist where both protocols are employed within a single MAS framework. We present a pilot study of AgentMaster, a novel modular multi‑protocol MAS framework with self‑implemented A2A and MCP, enabling dynamic coordination, flexible communication, and rapid development with faster iteration. Through a unified conversational interface, the system supports natural language interaction without prior technical expertise and responds to multimodal queries for tasks including information retrieval, question answering, and image analysis. The experiments are validated through both human evaluation and quantitative metrics, including BERTScore F1 (96.3%) and LLM‑as‑a‑Judge G‑Eval (87.1%). These results demonstrate robust automated inter‑agent coordination, query decomposition, task allocation, dynamic routing, and domain‑specific relevant responses. Overall, our proposed framework contributes to the potential capabilities of domain‑specific, cooperative, and scalable conversational AI powered by MAS.

PaperID: 9450, https://arxiv.org/pdf/2507.21070.pdf

Authors: Vladislav Li, Ilias Siniosoglou, Panagiotis Sarigiannidis, Vasileios Argyriou

Title: Enhancing Manufacturing Training Through VR Simulations

Abstract:
In contemporary training for industrial manufacturing, reconciling theoretical knowledge with practical experience continues to be a significant difficulty. As companies transition to more intricate and technology‑oriented settings, conventional training methods frequently inadequately equip workers with essential practical skills while maintaining safety and efficiency. Virtual Reality has emerged as a transformational instrument to tackle this issue by providing immersive, interactive, and risk‑free teaching experiences. Through the simulation of authentic industrial environments, virtual reality facilitates the acquisition of vital skills for trainees within a regulated and stimulating context, therefore mitigating the hazards linked to experiential learning in the workplace. This paper presents a sophisticated VR‑based industrial training architecture aimed at improving learning efficacy via high‑fidelity simulations, dynamic and context‑sensitive scenarios, and adaptive feedback systems. The suggested system incorporates intuitive gesture‑based controls, reducing the learning curve for users across all skill levels. A new scoring metric, namely, VR Training Scenario Score (VRTSS), is used to assess trainee performance dynamically, guaranteeing ongoing engagement and incentive. The experimental assessment of the system reveals promising outcomes, with significant enhancements in information retention, task execution precision, and overall training efficacy. The results highlight the capability of VR as a crucial instrument in industrial training, providing a scalable, interactive, and efficient substitute for conventional learning methods.

PaperID: 9451, https://arxiv.org/pdf/2507.21067.pdf

Authors: Jan Kapusta

Title: SynLang and Symbiotic Epistemology: A Manifesto for Conscious Human-AI Collaboration

Abstract:
Current AI systems rely on opaque reasoning processes that hinder human oversight and collaborative potential. Conventional explainable AI approaches offer post‑hoc justifications and often fail to establish genuine symbiotic collaboration. In this paper, the Symbiotic Epistemology is presented as a philosophical foundation for human‑AI cognitive partnerships. Unlike frameworks that treat AI as a mere tool or replacement, symbiotic epistemology positions AI as a reasoning partner, fostering calibrated trust by aligning human confidence with AI reliability through explicit reasoning patterns and confidence assessments. SynLang (Symbiotic Syntactic Language) is introduced as a formal protocol for transparent human‑AI collaboration. The framework is empirically validated through actual human‑AI dialogues demonstrating AI's adaptation to structured reasoning protocols and successful metacognitive intervention. The protocol defines two complementary mechanisms: TRACE for high‑level reasoning patterns and TRACE_FE for detailed factor explanations. It also integrates confidence quantification, declarative control over AI behavior, and context inheritance for multi‑agent coordination. By structuring communication and embedding confidence‑calibrated transparency, SynLang, together with symbiotic epistemology, enables AI systems that enhance human intelligence, preserve human agency, and uphold ethical accountability in collaborative decision‑making. Through dual‑level transparency, beginning with high‑level reasoning patterns and progressing to granular explanations, the protocol facilitates rapid comprehension and supports thorough verification of AI decision‑making.

PaperID: 9452, https://arxiv.org/pdf/2507.21055.pdf

Authors: Leyi Ouyang

Title: Can Memory-Augmented LLM Agents Aid Journalism in Interpreting and Framing News for Diverse Audiences?

Abstract:
Modern news is often comprehensive, weaving together information from diverse domains, including technology, finance, and agriculture. This very comprehensiveness creates a challenge for interpretation, as audiences typically possess specialized knowledge related to their expertise, age, or standpoint. Consequently, a reader might fully understand the financial implications of a story but fail to grasp or even actively misunderstand its legal or technological dimensions, resulting in critical comprehension gaps. In this work, we investigate how to identify these comprehension gaps and provide solutions to improve audiences' understanding of news content, particularly in the aspects of articles outside their primary domains of knowledge. We propose MADES, an agent‑based framework designed to simulate societal communication. The framework utilizes diverse agents, each configured to represent a specific occupation or age group. Each agent is equipped with a memory system. These agents are then simulated to discuss the news. This process enables us to monitor and analyze their behavior and cognitive processes. Our findings indicate that the framework can identify confusions and misunderstandings within news content through its iterative discussion process. Based on these accurate identifications, the framework then designs supplementary material. We validated these outcomes using both statistical analysis and human evaluation, and the results show that agents exhibit significantly improved news understanding after receiving this supplementary material.

PaperID: 9453, https://arxiv.org/pdf/2507.21046.pdf

Authors: Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, Hongru Wang, Han Xiao, Yuhang Zhou, Shaokun Zhang, Jiayi Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Dongrui Liu, Qihan Ren, Cheng Qian, Zhenhailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, Mengdi Wang

Title: A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks but remain fundamentally static, unable to adapt their internal parameters to novel tasks, evolving knowledge domains, or dynamic interaction contexts. As LLMs are increasingly deployed in open‑ended, interactive environments, this static nature has become a critical bottleneck, necessitating agents that can adaptively reason, act, and evolve in real time. This paradigm shift ‑‑ from scaling static models to developing self‑evolving agents ‑‑ has sparked growing interest in architectures and methods enabling continual learning and adaptation from data, interactions, and experiences. This survey provides the first systematic and comprehensive review of self‑evolving agents, organizing the field around three foundational dimensions: what, when, and how to evolve. We examine evolutionary mechanisms across agent components (e.g., models, memory, tools, architecture), categorize adaptation methods by stages (e.g., intra‑test‑time, inter‑test‑time), and analyze the algorithmic and architectural designs that guide evolutionary adaptation (e.g., scalar rewards, textual feedback, single‑agent and multi‑agent systems). Additionally, we analyze evaluation metrics and benchmarks tailored for self‑evolving agents, highlight applications in domains such as coding, education, and healthcare, and identify critical challenges and research directions in safety, scalability, and co‑evolutionary dynamics. By providing a structured framework for understanding and designing self‑evolving agents, this survey establishes a roadmap for advancing more adaptive, robust, and versatile agentic systems in both research and real‑world deployments, and ultimately sheds light on the realization of Artificial Super Intelligence (ASI) where agents evolve autonomously and perform beyond human‑level intelligence across tasks.

PaperID: 9454, https://arxiv.org/pdf/2507.21028.pdf

Authors: Jiaju Chen, Yuxuan Lu, Xiaojie Wang, Huimin Zeng, Jing Huang, Jiri Gesi, Ying Xu, Bingsheng Yao, Dakuo Wang

Title: Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation

Abstract:
Nearly all human work is collaborative; thus, the evaluation of real‑world NLP applications often requires multiple dimensions that align with diverse human perspectives. As real human evaluator resources are often scarce and costly, the emerging "LLM‑as‑a‑judge" paradigm sheds light on a promising approach to leverage LLM agents to believably simulate human evaluators. Yet, to date, existing LLM‑as‑a‑judge approaches face two limitations: persona descriptions of agents are often arbitrarily designed, and the frameworks are not generalizable to other tasks. To address these challenges, we propose MAJ‑EVAL, a Multi‑Agent‑as‑Judge evaluation framework that can automatically construct multiple evaluator personas with distinct dimensions from relevant text documents (e.g., research papers), instantiate LLM agents with the personas, and engage in‑group debates with multi‑agents to Generate multi‑dimensional feedback. Our evaluation experiments in both the educational and medical domains demonstrate that MAJ‑EVAL can generate evaluation results that better align with human experts' ratings compared with conventional automated evaluation metrics and existing LLM‑as‑a‑judge methods.

PaperID: 9455, https://arxiv.org/pdf/2507.20850.pdf

Authors: Meiting Dang, Yanping Wu, Yafei Wang, Dezong Zhao, David Flynn, Chongfeng Wei

Title: Free Energy-Inspired Cognitive Risk Integration for AV Navigation in Pedestrian-Rich Environments

Abstract:
Recent advances in autonomous vehicle (AV) behavior planning have shown impressive social interaction capabilities when interacting with other road users. However, achieving human‑like prediction and decision‑making in interactions with vulnerable road users remains a key challenge in complex multi‑agent interactive environments. Existing research focuses primarily on crowd navigation for small mobile robots, which cannot be directly applied to AVs due to inherent differences in their decision‑making strategies and dynamic boundaries. Moreover, pedestrians in these multi‑agent simulations follow fixed behavior patterns that cannot dynamically respond to AV actions. To overcome these limitations, this paper proposes a novel framework for modeling interactions between the AV and multiple pedestrians. In this framework, a cognitive process modeling approach inspired by the Free Energy Principle is integrated into both the AV and pedestrian models to simulate more realistic interaction dynamics. Specifically, the proposed pedestrian Cognitive‑Risk Social Force Model adjusts goal‑directed and repulsive forces using a fused measure of cognitive uncertainty and physical risk to produce human‑like trajectories. Meanwhile, the AV leverages this fused risk to construct a dynamic, risk‑aware adjacency matrix for a Graph Convolutional Network within a Soft Actor‑Critic architecture, allowing it to make more reasonable and informed decisions. Simulation results indicate that our proposed framework effectively improves safety, efficiency, and smoothness of AV navigation compared to the state‑of‑the‑art method.

PaperID: 9456, https://arxiv.org/pdf/2507.20796.pdf

Authors: Wei Lu, Amit Dhanda, Daniel L. Chen, Christian B. Hansen

Title: Aligning Large Language Model Agents with Rational and Moral Preferences: A Supervised Fine-Tuning Approach

Abstract:
As large language models (LLMs) increasingly act as autonomous agents in markets and organizations, their behavior in strategic environments becomes economically consequential. We document that off‑the‑shelf LLM agents exhibit systematic deviations from payoff‑sensitive behavior in canonical economic games, including excessive cooperation and limited responsiveness to incentives. We introduce a supervised fine‑tuning approach that aligns agent behavior with explicit economic preferences. Specifically, we generate optimal strategies under two stylized utility specifications, homo economicus, which maximizes self‑interest, and homo moralis, which incorporates Kantian universalizability, and use these utility‑implied reasoning and strategies to guide fine‑tuning. Fine‑tuning on a small, theory‑driven synthetic dataset induces persistent and interpretable shifts in strategic behavior. In applications to moral dilemmas and repeated duopoly pricing, agents aligned to different preference structures produce systematically distinct equilibrium outcomes and pricing dynamics. These results frame AI alignment in multi‑agent settings as an objective‑design problem and illustrate how economic theory can guide the design of strategically coherent AI agents.

PaperID: 9457, https://arxiv.org/pdf/2507.20526.pdf

Authors: Andy Zou, Maxwell Lin, Eliot Jones, Micha Nowak, Mateusz Dziemian, Nick Winter, Alexander Grattan, Valent Nathanael, Ayla Croft, Xander Davies, Jai Patel, Robert Kirk, Nate Burnikell, Yarin Gal, Dan Hendrycks, J. Zico Kolter, Matt Fredrikson

Title: Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition

Abstract:
Recent advances have enabled LLM‑powered AI agents to autonomously execute complex tasks by combining language model reasoning with tools, memory, and web access. But can these systems be trusted to follow deployment policies in realistic environments, especially under attack? To investigate, we ran the largest public red‑teaming competition to date, targeting 22 frontier AI agents across 44 realistic deployment scenarios. Participants submitted 1.8 million prompt‑injection attacks, with over 60,000 successfully eliciting policy violations such as unauthorized data access, illicit financial actions, and regulatory noncompliance. We use these results to build the Agent Red Teaming (ART) benchmark ‑ a curated set of high‑impact attacks ‑ and evaluate it across 19 state‑of‑the‑art models. Nearly all agents exhibit policy violations for most behaviors within 10‑100 queries, with high attack transferability across models and tasks. Importantly, we find limited correlation between agent robustness and model size, capability, or inference‑time compute, suggesting that additional defenses are needed against adversarial misuse. Our findings highlight critical and persistent vulnerabilities in today's AI agents. By releasing the ART benchmark and accompanying evaluation framework, we aim to support more rigorous security assessment and drive progress toward safer agent deployment.

PaperID: 9458, https://arxiv.org/pdf/2507.20474.pdf

Authors: Siyi Wu, Junqiao Wang, Zhaoyang Guan, Leyi Zhao, Xinyuan Song, Xinyu Ying, Dexu Yu, Jinhao Wang, Hanlin Zhang, Michele Pak, Yangfan He, Yi Xin, Jianhui Wang, Tianyu Shi

Title: MountainLion: A Multi-Modal LLM-Based Agent System for Interpretable and Adaptive Financial Trading

Abstract:
Cryptocurrency trading is a challenging task requiring the integration of heterogeneous data from multiple modalities. Traditional deep learning and reinforcement learning approaches typically demand large training datasets and encode diverse inputs into numerical representations, often at the cost of interpretability. Recent progress in large language model (LLM)‑based agents has demonstrated the capacity to process multi‑modal data and support complex investment decision‑making. Building on these advances, we present MountainLion, a multi‑modal, multi‑agent system for financial trading that coordinates specialized LLM‑based agents to interpret financial data and generate investment strategies. MountainLion processes textual news, candlestick charts, and trading signal charts to produce high‑quality financial reports, while also enabling modification of reports and investment recommendations through data‑driven user interaction and question answering. A central reflection module analyzes historical trading signals and outcomes to continuously refine decision processes, and the system is capable of real‑time report analysis, summarization, and dynamic adjustment of investment strategies. Empirical results confirm that MountainLion systematically enriches technical price triggers with contextual macroeconomic and capital flow signals, providing a more interpretable, robust, and actionable investment framework that improves returns and strengthens investor confidence.

PaperID: 9459, https://arxiv.org/pdf/2507.20468.pdf

Authors: Antonino Castelli, Paolo Giudici, Alessandro Piergallini

Title: Building crypto portfolios with agentic AI

Abstract:
The rapid growth of crypto markets has opened new opportunities for investors, but at the same time exposed them to high volatility. To address the challenge of managing dynamic portfolios in such an environment, this paper presents a practical application of a multi‑agent system designed to autonomously construct and evaluate crypto‑asset allocations. Using data on daily frequencies of the ten most capitalized cryptocurrencies from 2020 to 2025, we compare two automated investment strategies. These are a static equal weighting strategy and a rolling‑window optimization strategy, both implemented to maximize the evaluation metrics of the Modern Portfolio Theory (MPT), such as Expected Return, Sharpe and Sortino ratios, while minimizing volatility. Each step of the process is handled by dedicated agents, integrated through a collaborative architecture in Crew AI. The results show that the dynamic optimization strategy achieves significantly better performance in terms of risk‑adjusted returns, both in‑sample and out‑of‑sample. This highlights the benefits of adaptive techniques in portfolio management, particularly in volatile markets such as cryptocurrency markets. The following methodology proposed also demonstrates how multi‑agent systems can provide scalable, auditable, and flexible solutions in financial automation.

PaperID: 9460, https://arxiv.org/pdf/2507.20377.pdf

Authors: Farshid Nooshi, Suining He

Title: Multi-Agent Reinforcement Learning for Dynamic Mobility Resource Allocation with Hierarchical Adaptive Grouping

Abstract:
Allocating mobility resources (e.g., shared bikes/e‑scooters, ride‑sharing vehicles) is crucial for rebalancing the mobility demand and supply in the urban environments. We propose in this work a novel multi‑agent reinforcement learning named Hierarchical Adaptive Grouping‑based Parameter Sharing (HAG‑PS) for dynamic mobility resource allocation. HAG‑PS aims to address two important research challenges regarding multi‑agent reinforcement learning for mobility resource allocation: (1) how to dynamically and adaptively share the mobility resource allocation policy (i.e., how to distribute mobility resources) across agents (i.e., representing the regional coordinators of mobility resources); and (2) how to achieve memory‑efficient parameter sharing in an urban‑scale setting. To address the above challenges, we have provided following novel designs within HAG‑PS. To enable dynamic and adaptive parameter sharing, we have designed a hierarchical approach that consists of global and local information of the mobility resource states (e.g., distribution of mobility resources). We have developed an adaptive agent grouping approach in order to split or merge the groups of agents based on their relative closeness of encoded trajectories (i.e., states, actions, and rewards). We have designed a learnable identity (ID) embeddings to enable agent specialization beyond simple parameter copy. We have performed extensive experimental studies based on real‑world NYC bike sharing data (a total of more than 1.2 million trips), and demonstrated the superior performance (e.g., improved bike availability) of HAG‑PS compared with other baseline approaches.

PaperID: 9461, https://arxiv.org/pdf/2507.20370.pdf

Authors: Michele Grimaldi, Carlo Cernicchiaro, Sebastian Realpe Rua, Alaaeddine El-Masri-El-Chaarani, Markus Buchholz, Loizos Michael, Pere Ridao Rodriguez, Ignacio Carlucho, Yvan R. Petillot

Title: Advancing Shared and Multi-Agent Autonomy in Underwater Missions: Integrating Knowledge Graphs and Retrieval-Augmented Generation

Abstract:
Robotic platforms have become essential for marine operations by providing regular and continuous access to offshore assets, such as underwater infrastructure inspection, environmental monitoring, and resource exploration. However, the complex and dynamic nature of underwater environments, characterized by limited visibility, unpredictable currents, and communication constraints, presents significant challenges that demand advanced autonomy while ensuring operator trust and oversight. Central to addressing these challenges are knowledge representation and reasoning techniques, particularly knowledge graphs and retrieval‑augmented generation (RAG) systems, that enable robots to efficiently structure, retrieve, and interpret complex environmental data. These capabilities empower robotic agents to reason, adapt, and respond effectively to changing conditions. The primary goal of this work is to demonstrate both multi‑agent autonomy and shared autonomy, where multiple robotic agents operate independently while remaining connected to a human supervisor. We show how a RAG‑powered large language model, augmented with knowledge graph data and domain taxonomy, enables autonomous multi‑agent decision‑making and facilitates seamless human‑robot interaction, resulting in 100% mission validation and behavior completeness. Finally, ablation studies reveal that without structured knowledge from the graph and/or taxonomy, the LLM is prone to hallucinations, which can compromise decision quality.

PaperID: 9462, https://arxiv.org/pdf/2507.20250.pdf

Authors: Tianyi Zhong, David Angeli

Title: A Truthful Mechanism Design for Distributed Optimisation Algorithms in Networks with Self-interested Agents

Abstract:
Enhancing resilience in multi‑agent systems in the face of selfish agents is an important problem that requires further characterisation. This work develops a truthful mechanism that avoids self‑interested and strategic agents maliciously manipulating the algorithm. We prove theoretically that the proposed mechanism incentivises self‑interested agents to participate and follow the provided algorithm faithfully. Additionally, the mechanism is compatible with any distributed optimisation algorithm that can calculate at least one subgradient at a given point. Finally, we present an illustrative example that shows the effectiveness of the mechanism.

PaperID: 9463, https://arxiv.org/pdf/2507.20230.pdf

Authors: Yufan Chen, Ching Ting Leung, Bowen Yu, Jianwei Sun, Yong Huang, Linyan Li, Hao Chen, Hanyu Gao

Title: A Multi-Agent System Enables Versatile Information Extraction from the Chemical Literature

Abstract:
To fully expedite AI‑powered chemical research, high‑quality chemical databases are the foundation. Automatic extraction of chemical information from the literature is essential for constructing reaction databases, but it is currently limited by the multimodality and style variability of chemical information. In this work, we developed a multimodal large language model (MLLM)‑based multi‑agent system for robust and automated chemical information extraction. It utilizes the MLLM's strong reasoning capability to understand the structure of diverse chemical graphics and decompose the extraction task into sub‑tasks. It then coordinates a set of specialized agents, each combining the capabilities of the MLLM with the precise, domain‑specific strengths of dedicated tools and web services, to solve the subtasks accurately and integrate the results into a unified output. Our system achieved an F1 score of 76.27% on a benchmark dataset of sophisticated multimodal chemical reaction graphics from the literature, surpassing the previous state‑of‑the‑art model (F1 score of 39.13%) by a significant margin. Additionally, it demonstrated versatile applicability in a range of other information extraction tasks, including molecular image recognition, reaction image parsing, named entity recognition and text‑based reaction extraction. This work is a critical step toward automated chemical information extraction into structured datasets, which will be a strong promoter of AI‑driven chemical research.

PaperID: 9464, https://arxiv.org/pdf/2507.20143.pdf

Authors: Zhonghan Ge, Yuanyang Zhu, Chunlin Chen

Title: Concept Learning for Cooperative Multi-Agent Reinforcement Learning

Abstract:
Despite substantial progress in applying neural networks (NN) to multi‑agent reinforcement learning (MARL) areas, they still largely suffer from a lack of transparency and interoperability. However, its implicit cooperative mechanism is not yet fully understood due to black‑box networks. In this work, we study an interpretable value decomposition framework via concept bottleneck models, which promote trustworthiness by conditioning credit assignment on an intermediate level of human‑like cooperation concepts. To address this problem, we propose a novel value‑based method, named Concepts learning for Multi‑agent Q‑learning (CMQ), that goes beyond the current performance‑vs‑interpretability trade‑off by learning interpretable cooperation concepts. CMQ represents each cooperation concept as a supervised vector, as opposed to existing models where the information flowing through their end‑to‑end mechanism is concept‑agnostic. Intuitively, using individual action value conditioning on global state embeddings to represent each concept allows for extra cooperation representation capacity. Empirical evaluations on the StarCraft II micromanagement challenge and level‑based foraging (LBF) show that CMQ achieves superior performance compared with the state‑of‑the‑art counterparts. The results also demonstrate that CMQ provides more cooperation concept representation capturing meaningful cooperation modes, and supports test‑time concept interventions for detecting potential biases of cooperation mode and identifying spurious artifacts that impact cooperation.

PaperID: 9465, https://arxiv.org/pdf/2507.20038.pdf

Authors: Ilan Doron-Arad, Hadas Shachnai, Gilad Shmerler, Inbal Talgam-Cohen

Title: An Algorithm-to-Contract Framework without Demand Queries

Abstract:
Consider costly and time‑consuming tasks that add up to the success of a project, and must be fitted into a given time‑frame. This is an instance of the classic budgeted maximization (knapsack) problem, which admits an FPTAS. Now assume an agent is performing these tasks on behalf of a principal, who is the one to reap the rewards if the project succeeds. The principal must design a contract to incentivize the agent. Is there still an approximation scheme? In this work we lay the foundations for an algorithm‑to‑contract framework, which transforms algorithms for combinatorial problems to handle contract design problems subject to the same combinatorial constraints. Our approach diverges from previous works in avoiding the assumption of demand oracle access. As an example, for budgeted maximization, we show how to "lift" the classic FPTAS to the best‑possible (approximately‑IC) FPTAS for the contract problem. We establish this through our local‑to‑global framework, in which the local step is to approximately solve a two‑sided strengthened variant of the demand problem. The global step then utilizes the local one to find the approximately optimal contract. We apply our framework to a host of combinatorial constraints: multi‑dimensional budgets, budgeted matroid, and budgeted matching constraints. In all cases we essentially match the best purely algorithmic approximation. Separately, we also develop a method for multi‑agent contract settings. Our method yields the first approximation schemes for multi‑agent contract settings that go beyond additive reward functions.

PaperID: 9466, https://arxiv.org/pdf/2507.19902.pdf

Authors: Sourena Khanzadeh

Title: AgentMesh: A Cooperative Multi-Agent Generative AI Framework for Software Development Automation

Abstract:
Software development is a complex, multi‑phase process traditionally requiring collaboration among individuals with diverse expertise. We propose AgentMesh, a Python‑based framework that uses multiple cooperating LLM‑powered agents to automate software development tasks. In AgentMesh, specialized agents ‑ a Planner, Coder, Debugger, and Reviewer ‑ work in concert to transform a high‑level requirement into fully realized code. The Planner agent first decomposes user requests into concrete subtasks; the Coder agent implements each subtask in code; the Debugger agent tests and fixes the code; and the Reviewer agent validates the final output for correctness and quality. We describe the architecture and design of these agents and their communication, and provide implementation details including prompt strategies and workflow orchestration. A case study illustrates AgentMesh handling a non‑trivial development request via sequential task planning, code generation, iterative debugging, and final code review. We discuss how dividing responsibilities among cooperative agents leverages the strengths of large language models while mitigating single‑agent limitations. Finally, we examine current limitations ‑ such as error propagation and context scaling ‑ and outline future work toward more robust, scalable multi‑agent AI systems for software engineering automation.

PaperID: 9467, https://arxiv.org/pdf/2507.19860.pdf

Authors: Haoze Dong, Meng Guo, Chengyi He, Zhongkui Li

Title: Homotopy-aware Multi-agent Navigation via Distributed Model Predictive Control

Abstract:
Multi‑agent trajectory planning requires ensuring both safety and efficiency, yet deadlocks remain a significant challenge, especially in obstacle‑dense environments. Such deadlocks frequently occur when multiple agents attempt to traverse the same long and narrow corridor simultaneously. To address this, we propose a novel distributed trajectory planning framework that bridges the gap between global path and local trajectory cooperation. At the global level, a homotopy‑aware optimal path planning algorithm is proposed, which fully leverages the topological structure of the environment. A reference path is chosen from distinct homotopy classes by considering both its spatial and temporal properties, leading to improved coordination among agents globally. At the local level, a model predictive control‑based trajectory optimization method is used to generate dynamically feasible and collision‑free trajectories. Additionally, an online replanning strategy ensures its adaptability to dynamic environments. Simulations and experiments validate the effectiveness of our approach in mitigating deadlocks. Ablation studies demonstrate that by incorporating time‑aware homotopic properties into the underlying global paths, our method can significantly reduce deadlocks and improve the average success rate from 4%‑13% to over 90% in randomly generated dense scenarios.

PaperID: 9468, https://arxiv.org/pdf/2507.19854.pdf

Authors: Anjali R. Menon, Rohit K. Sharma, Priya Singh, Chengyu Wang, Aurora M. Ferreira, Mateja Novak

Title: Think, Act, Learn: A Framework for Autonomous Robotic Agents using Closed-Loop Large Language Models

Abstract:
The integration of Large Language Models (LLMs) into robotics has unlocked unprecedented capabilities in high‑level task planning. However, most current systems operate in an open‑loop fashion, where LLMs act as one‑shot planners, rendering them brittle and unable to adapt to unforeseen circumstances in dynamic physical environments. To overcome this limitation, this paper introduces the "Think, Act, Learn" (T‑A‑L) framework, a novel architecture that enables an embodied agent to autonomously learn and refine its policies through continuous interaction. Our framework establishes a closed‑loop cycle where an LLM first "thinks" by decomposing high‑level commands into actionable plans. The robot then "acts" by executing these plans while gathering rich, multimodal sensory feedback. Critically, the "learn" module processes this feedback to facilitate LLM‑driven self‑reflection, allowing the agent to perform causal analysis on its failures and generate corrective strategies. These insights are stored in an experiential memory to guide future planning cycles. We demonstrate through extensive experiments in both simulation and the real world that our T‑A‑L agent significantly outperforms baseline methods, including open‑loop LLMs, Behavioral Cloning, and traditional Reinforcement Learning. Our framework achieves over a 97% success rate on complex, long‑horizon tasks, converges to a stable policy in an average of just 9 trials, and exhibits remarkable generalization to unseen tasks. This work presents a significant step towards developing more robust, adaptive, and truly autonomous robotic agents.

PaperID: 9469, https://arxiv.org/pdf/2507.19851.pdf

Authors: Ye Wang, Haodong Jing, Yang Liao, Yongqiang Ma, Nanning Zheng

Title: PlaneHEC: Efficient Hand-Eye Calibration for Multi-view Robotic Arm via Any Point Cloud Plane Detection

Abstract:
Hand‑eye calibration is an important task in vision‑guided robotic systems and is crucial for determining the transformation matrix between the camera coordinate system and the robot end‑effector. Existing methods, for multi‑view robotic systems, usually rely on accurate geometric models or manual assistance, generalize poorly, and can be very complicated and inefficient. Therefore, in this study, we propose PlaneHEC, a generalized hand‑eye calibration method that does not require complex models and can be accomplished using only depth cameras, which achieves the optimal and fastest calibration results using arbitrary planar surfaces like walls and tables. PlaneHEC introduces hand‑eye calibration equations based on planar constraints, which makes it strongly interpretable and generalizable. PlaneHEC also uses a comprehensive solution that starts with a closed‑form solution and improves it withiterative optimization, which greatly improves accuracy. We comprehensively evaluated the performance of PlaneHEC in both simulated and real‑world environments and compared the results with other point‑cloud‑based calibration methods, proving its superiority. Our approach achieves universal and fast calibration with an innovative design of computational models, providing a strong contribution to the development of multi‑agent systems and embodied intelligence.

PaperID: 9470, https://arxiv.org/pdf/2507.19844.pdf

Authors: Biswarup Mukherjee, Li Zhou, S. Gokul Krishnan, Milad Kabirifar, Subhash Lakshminarayana, Charalambos Konstantinou

Title: VAE-GAN Based Price Manipulation in Coordinated Local Energy Markets

Abstract:
This paper introduces a model for coordinating prosumers with heterogeneous distributed energy resources (DERs), participating in the local energy market (LEM) that interacts with the market‑clearing entity. The proposed LEM scheme utilizes a data‑driven, model‑free reinforcement learning approach based on the multi‑agent deep deterministic policy gradient (MADDPG) framework, enabling prosumers to make real‑time decisions on whether to buy, sell, or refrain from any action while facilitating efficient coordination for optimal energy trading in a dynamic market. In addition, we investigate a price manipulation strategy using a variational auto encoder‑generative adversarial network (VAE‑GAN) model, which allows utilities to adjust price signals in a way that induces financial losses for the prosumers. Our results show that under adversarial pricing, heterogeneous prosumer groups, particularly those lacking generation capabilities, incur financial losses. The same outcome holds across LEMs of different sizes. As the market size increases, trading stabilizes and fairness improves through emergent cooperation among agents.

PaperID: 9471, https://arxiv.org/pdf/2507.19771.pdf

Authors: Xin Zhang, Lissette Iturburu, Juan Nicolas Villamizar, Xiaoyu Liu, Manuel Salmeron, Shirley J. Dyke, Julio Ramirez

Title: Large Language Model Agent for Structural Drawing Generation Using ReAct Prompt Engineering and Retrieval Augmented Generation

Abstract:
Structural drawings are widely used in many fields, e.g., mechanical engineering, civil engineering, etc. In civil engineering, structural drawings serve as the main communication tool between architects, engineers, and builders to avoid conflicts, act as legal documentation, and provide a reference for future maintenance or evaluation needs. They are often organized using key elements such as title/subtitle blocks, scales, plan views, elevation view, sections, and detailed sections, which are annotated with standardized symbols and line types for interpretation by engineers and contractors. Despite advances in software capabilities, the task of generating a structural drawing remains labor‑intensive and time‑consuming for structural engineers. Here we introduce a novel generative AI‑based method for generating structural drawings employing a large language model (LLM) agent. The method incorporates a retrieval‑augmented generation (RAG) technique using externally‑sourced facts to enhance the accuracy and reliability of the language model. This method is capable of understanding varied natural language descriptions, processing these to extract necessary information, and generating code to produce the desired structural drawing in AutoCAD. The approach developed, demonstrated and evaluated herein enables the efficient and direct conversion of a structural drawing's natural language description into an AutoCAD drawing, significantly reducing the workload compared to current working process associated with manual drawing production, facilitating the typical iterative process of engineers for expressing design ideas in a simplified way.

PaperID: 9472, https://arxiv.org/pdf/2507.19712.pdf

Authors: Ngoc Hung Nguyen, Nguyen Van Thieu, Quang-Trung Luu, Anh Tuan Nguyen, Senura Wanasekara, Nguyen Cong Luong, Fatemeh Kavehmadavani, Van-Dinh Nguyen

Title: Oranits: Mission Assignment and Task Offloading in Open RAN-based ITS using Metaheuristic and Deep Reinforcement Learning

Abstract:
In this paper, we explore mission assignment and task offloading in an Open Radio Access Network (Open RAN)‑based intelligent transportation system (ITS), where autonomous vehicles leverage mobile edge computing for efficient processing. Existing studies often overlook the intricate interdependencies between missions and the costs associated with offloading tasks to edge servers, leading to suboptimal decision‑making. To bridge this gap, we introduce Oranits, a novel system model that explicitly accounts for mission dependencies and offloading costs while optimizing performance through vehicle cooperation. To achieve this, we propose a twofold optimization approach. First, we develop a metaheuristic‑based evolutionary computing algorithm, namely the Chaotic Gaussian‑based Global ARO (CGG‑ARO), serving as a baseline for one‑slot optimization. Second, we design an enhanced reward‑based deep reinforcement learning (DRL) framework, referred to as the Multi‑agent Double Deep Q‑Network (MA‑DDQN), that integrates both multi‑agent coordination and multi‑action selection mechanisms, significantly reducing mission assignment time and improving adaptability over baseline methods. Extensive simulations reveal that CGG‑ARO improves the number of completed missions and overall benefit by approximately 7.1% and 7.7%, respectively. Meanwhile, MA‑DDQN achieves even greater improvements of 11.0% in terms of mission completions and 12.5% in terms of the overall benefit. These results highlight the effectiveness of Oranits in enabling faster, more adaptive, and more efficient task processing in dynamic ITS environments.

PaperID: 9473, https://arxiv.org/pdf/2507.19657.pdf

Authors: Beining Wu, Jun Huang, Shui Yu

Title: "X of Information'' Continuum: A Survey on AI-Driven Multi-dimensional Metrics for Next-Generation Networked Systems

Abstract:
The development of next‑generation networking systems has inherently shifted from throughput‑based paradigms towards intelligent, information‑aware designs that emphasize the quality, relevance, and utility of transmitted information, rather than sheer data volume. While classical network metrics, such as latency and packet loss, remain significant, they are insufficient to quantify the nuanced information quality requirements of modern intelligent applications, including autonomous vehicles, digital twins, and metaverse environments. In this survey, we present the first comprehensive study of the ``X of Information'' continuum by introducing a systematic four‑dimensional taxonomic framework that structures information metrics along temporal, quality/utility, reliability/robustness, and network/communication dimensions. We uncover the increasing interdependencies among these dimensions, whereby temporal freshness triggers quality evaluation, which in turn helps with reliability appraisal, ultimately enabling effective network delivery. Our analysis reveals that artificial intelligence technologies, such as deep reinforcement learning, multi‑agent systems, and neural optimization models, enable adaptive, context‑aware optimization of competing information quality objectives. In our extensive study of six critical application domains, covering autonomous transportation, industrial IoT, healthcare digital twins, UAV communications, LLM ecosystems, and metaverse settings, we illustrate the revolutionary promise of multi‑dimensional information metrics for meeting diverse operational needs. Our survey identifies prominent implementation challenges, including ...

PaperID: 9474, https://arxiv.org/pdf/2507.19635.pdf

Authors: Zain Asgar, Michelle Nguyen, Sachin Katti

Title: Efficient and Scalable Agentic AI with Heterogeneous Systems

Abstract:
AI agents are emerging as a dominant workload in a wide range of applications, promising to be the vehicle that delivers the promised benefits of AI to enterprises and consumers. Unlike conventional software or static inference, agentic workloads are dynamic and structurally complex. Often these agents are directed graphs of compute and IO operations that span multi‑modal data input and conversion), data processing and context gathering (e.g vector DB lookups), multiple LLM inferences, tool calls, etc. To scale AI agent usage, we need efficient and scalable deployment and agent‑serving infrastructure. To tackle this challenge, in this paper, we present a system design for dynamic orchestration of AI agent workloads on heterogeneous compute infrastructure spanning CPUs and accelerators, both from different vendors and across different performance tiers within a single vendor. The system delivers several building blocks: a framework for planning and optimizing agentic AI execution graphs using cost models that account for compute, memory, and bandwidth constraints of different HW; a MLIR based representation and compilation system that can decompose AI agent execution graphs into granular operators and generate code for different HW options; and a dynamic orchestration system that can place the granular components across a heterogeneous compute infrastructure and stitch them together while meeting an end‑to‑end SLA. Our design performs a systems level TCO optimization and preliminary results show that leveraging a heterogeneous infrastructure can deliver significant TCO benefits. A preliminary surprising finding is that for some workloads a heterogeneous combination of older generation GPUs with newer accelerators can deliver similar TCO as the latest generation homogenous GPU infrastructure design, potentially extending the life of deployed infrastructure.

PaperID: 9475, https://arxiv.org/pdf/2507.19631.pdf

Authors: Xin Mao, Dan Wang, Wei Chen, Li Qiu

Title: Diversity and Interaction Quality of a Heterogeneous Multi-Agent System Applied to a Synchronization Problem

Abstract:
In this paper, scalable controller design to achieve output synchronization for a heterogeneous discrete‑time nonlinear multi‑agent system is considered. The agents are assumed to exhibit potentially nonlinear dynamics but share linear common oscillatory modes. In a distributed control architecture, scalability is ensured by designing a small number of distinguished controllers, significantly fewer than the number of agents, even when agent diversity is high. Our findings indicate that the number of controllers required can be effectively determined by the number of strongly connected components of the underlying graph. The study in this paper builds on the recently developed phase theory of matrices and systems. First, we employ the concept of matrix phase, specifically the phase alignability of a collection of matrices, to quantify agent diversity. Next, we use matrix phase, particularly the essential phase of the graph Laplacian, to evaluate the interaction quality among the agents. Based on these insights, we derive a sufficient condition for the solvability of the synchronization problem, framed as a trade‑off between the agent diversity and the interaction quality. In the process, a controller design procedure based on Lyapunov analysis is provided, which produces low gain, component‑wise synchronizing controllers when the solvability condition is satisfied. Numerical examples are given to illustrate the effectiveness of the proposed design procedure. Furthermore, we consider cases where the component‑wise controller design problem is unsolvable. We propose alternative strategies involving the design of a small inventory of controllers, which can still achieve synchronization effectively by employing certain clustering methods to manage heterogeneity.

PaperID: 9476, https://arxiv.org/pdf/2507.19593.pdf

Authors: Vince Trencsenyi, Agnieszka Mensfelt, Kostas Stathis

Title: A Survey on Hypergame Theory: Modeling Misaligned Perceptions and Nested Beliefs for Multi-agent Systems

Abstract:
Classical game‑theoretic models typically assume rational agents, complete information, and common knowledge of payoffs ‑ assumptions that are often violated in real‑world MAS characterized by uncertainty, misaligned perceptions, and nested beliefs. To overcome these limitations, researchers have proposed extensions that incorporate models of cognitive constraints, subjective beliefs, and heterogeneous reasoning. Among these, hypergame theory extends the classical paradigm by explicitly modeling agents' subjective perceptions of the strategic scenario, known as perceptual games, in which agents may hold divergent beliefs about the structure, payoffs, or available actions. We present a systematic review of agent‑compatible applications of hypergame theory, examining how its descriptive capabilities have been adapted to dynamic and interactive MAS contexts. We analyze 44 selected studies from cybersecurity, robotics, social simulation, communications, and general game‑theoretic modeling. Building on a formal introduction to hypergame theory and its two major extensions ‑ hierarchical hypergames and HNF ‑ we develop agent‑compatibility criteria and an agent‑based classification framework to assess integration patterns and practical applicability. Our analysis reveals prevailing tendencies, including the prevalence of hierarchical and graph‑based models in deceptive reasoning and the simplification of extensive theoretical frameworks in practical applications. We identify structural gaps, including the limited adoption of HNF‑based models, the lack of formal hypergame languages, and unexplored opportunities for modeling human‑agent and agent‑agent misalignment. By synthesizing trends, challenges, and open research directions, this review provides a new roadmap for applying hypergame theory to enhance the realism and effectiveness of strategic modeling in dynamic multi‑agent environments.

PaperID: 9477, https://arxiv.org/pdf/2507.19550.pdf

Authors: Awid Vaziry, Sandro Rodriguez Garzon, Axel Küpper

Title: Towards Multi-Agent Economies: Enhancing the A2A Protocol with Ledger-Anchored Identities and x402 Micropayments for AI Agents

Abstract:
This research article presents a novel architecture to empower multi‑agent economies by addressing two critical limitations of the emerging Agent2Agent (A2A) communication protocol: decentralized agent discoverability and agent‑to‑agent micropayments. By integrating distributed ledger technology (DLT), this architecture enables tamper‑proof, on‑chain publishing of AgentCards as smart contracts, providing secure and verifiable agent identities. The architecture further extends A2A with the x402 open standard, facilitating blockchain‑agnostic, HTTP‑based micropayments via the HTTP 402 status code. This enables autonomous agents to seamlessly discover, authenticate, and compensate each other across organizational boundaries. This work further presents a comprehensive technical implementation and evaluation, demonstrating the feasibility of DLT‑based agent discovery and micropayments. The proposed approach lays the groundwork for secure, scalable, and economically viable multi‑agent ecosystems, advancing the field of agentic AI toward trusted, autonomous economic interactions.

PaperID: 9478, https://arxiv.org/pdf/2507.19498.pdf

Authors: Yue Wu, Xiaolan Chen, Weiyi Zhang, Shunming Liu, Wing Man Rita Sum, Xinyuan Wu, Xianwen Shang, Chea-su Kee, Mingguang He, Danli Shi

Title: ChatMyopia: An AI Agent for Pre-consultation Education in Primary Eye Care Settings

Abstract:
Large language models (LLMs) show promise for tailored healthcare communication but face challenges in interpretability and multi‑task integration particularly for domain‑specific needs like myopia, and their real‑world effectiveness as patient education tools has yet to be demonstrated. Here, we introduce ChatMyopia, an LLM‑based AI agent designed to address text and image‑based inquiries related to myopia. To achieve this, ChatMyopia integrates an image classification tool and a retrieval‑augmented knowledge base built from literature, expert consensus, and clinical guidelines. Myopic maculopathy grading task, single question examination and human evaluations validated its ability to deliver personalized, accurate, and safe responses to myopia‑related inquiries with high scalability and interpretability. In a randomized controlled trial (n=70, NCT06607822), ChatMyopia significantly improved patient satisfaction compared to traditional leaflets, enhancing patient education in accuracy, empathy, disease awareness, and patient‑eyecare practitioner communication. These findings highlight ChatMyopia's potential as a valuable supplement to enhance patient education and improve satisfaction with medical services in primary eye care settings.

PaperID: 9479, https://arxiv.org/pdf/2507.19364.pdf

Authors: Patrick Taillandier, Jean Daniel Zucker, Arnaud Grignard, Benoit Gaudou, Nghi Quang Huynh, Alexis Drogoul

Title: Integrating LLM in Agent-Based Social Simulation: Opportunities and Challenges

Abstract:
This position paper examines the use of Large Language Models (LLMs) in social simulation, analyzing their potential and limitations from a computational social science perspective. We first review recent findings on LLMs' ability to replicate key aspects of human cognition, including Theory of Mind reasoning and social inference, while identifying persistent limitations such as cognitive biases, lack of grounded understanding, and behavioral inconsistencies. We then survey emerging applications of LLMs in multi‑agent simulation frameworks, examining system architectures, scalability, and validation strategies. Projects such as Generative Agents (Smallville) and AgentSociety are analyzed with respect to their empirical grounding and methodological design. Particular attention is given to the challenges of behavioral fidelity, calibration, and reproducibility in large‑scale LLM‑driven simulations. Finally, we distinguish between contexts where LLM‑based agents provide operational value‑such as interactive simulations and serious games‑and contexts where their use raises epistemic concerns, particularly in explanatory or predictive modeling. We argue that hybrid approaches integrating LLMs into established agent‑based modeling platforms such as GAMA and NetLogo may offer a promising compromise between expressive flexibility and analytical transparency. Building on this analysis, we outline a conceptual research direction termed Hybrid Constitutional Architectures, which proposes a stratified integration of classical agent‑based models (ABMs), small language models (SLMs), and LLMs within established platforms such as GAMA and NetLogo.

PaperID: 9480, https://arxiv.org/pdf/2507.19354.pdf

Authors: Melih Yazgan, Allen Xavier Arasan, J. Marius Zöllner

Title: EffiComm: Bandwidth Efficient Multi Agent Communication

Abstract:
Collaborative perception allows connected vehicles to exchange sensor information and overcome each vehicle's blind spots. Yet transmitting raw point clouds or full feature maps overwhelms Vehicle‑to‑Vehicle (V2V) communications, causing latency and scalability problems. We introduce EffiComm, an end‑to‑end framework that transmits less than 40% of the data required by prior art while maintaining state‑of‑the‑art 3D object detection accuracy. EffiComm operates on Bird's‑Eye‑View (BEV) feature maps from any modality and applies a two‑stage reduction pipeline: (1) Selective Transmission (ST) prunes low‑utility regions with a confidence mask; (2) Adaptive Grid Reduction (AGR) uses a Graph Neural Network (GNN) to assign vehicle‑specific keep ratios according to role and network load. The remaining features are fused with a soft‑gated Mixture‑of‑Experts (MoE) attention layer, offering greater capacity and specialization for effective feature integration. On the OPV2V benchmark, EffiComm reaches 0.84 mAP@0.7 while sending only an average of approximately 1.5 MB per frame, outperforming previous methods on the accuracy‑per‑bit curve. These results highlight the value of adaptive, learned communication for scalable Vehicle‑to‑Everything (V2X) perception.

PaperID: 9481, https://arxiv.org/pdf/2507.19151.pdf

Authors: Michael Amir, Guang Yang, Zhan Gao, Keisuke Okumura, Heedo Woo, Amanda Prorok

Title: ReCoDe: Reinforcement Learning-based Dynamic Constraint Design for Multi-Agent Coordination

Abstract:
Constraint‑based optimization is a cornerstone of robotics, enabling the design of controllers that reliably encode task and safety requirements such as collision avoidance or formation adherence. However, handcrafted constraints can fail in multi‑agent settings that demand complex coordination. We introduce ReCoDe‑‑Reinforcement‑based Constraint Design‑‑a decentralized, hybrid framework that merges the reliability of optimization‑based controllers with the adaptability of multi‑agent reinforcement learning. Rather than discarding expert controllers, ReCoDe improves them by learning additional, dynamic constraints that capture subtler behaviors, for example, by constraining agent movements to prevent congestion in cluttered scenarios. Through local communication, agents collectively constrain their allowed actions to coordinate more effectively under changing conditions. In this work, we focus on applications of ReCoDe to multi‑agent navigation tasks requiring intricate, context‑based movements and consensus, where we show that it outperforms purely handcrafted controllers, other hybrid approaches, and standard MARL baselines. We give empirical (real robot) and theoretical evidence that retaining a user‑defined controller, even when it is imperfect, is more efficient than learning from scratch, especially because ReCoDe can dynamically change the degree to which it relies on this controller.

PaperID: 9482, https://arxiv.org/pdf/2507.19146.pdf

Authors: Ahmed Abouelazm, Johannes Ratz, Philip Schörner, J. Marius Zöllner

Title: Diverse and Adaptive Behavior Curriculum for Autonomous Driving: A Student-Teacher Framework with Multi-Agent RL

Abstract:
Autonomous driving faces challenges in navigating complex real‑world traffic, requiring safe handling of both common and critical scenarios. Reinforcement learning (RL), a prominent method in end‑to‑end driving, enables agents to learn through trial and error in simulation. However, RL training often relies on rule‑based traffic scenarios, limiting generalization. Additionally, current scenario generation methods focus heavily on critical scenarios, neglecting a balance with routine driving behaviors. Curriculum learning, which progressively trains agents on increasingly complex tasks, is a promising approach to improving the robustness and coverage of RL driving policies. However, existing research mainly emphasizes manually designed curricula, focusing on scenery and actor placement rather than traffic behavior dynamics. This work introduces a novel student‑teacher framework for automatic curriculum learning. The teacher, a graph‑based multi‑agent RL component, adaptively generates traffic behaviors across diverse difficulty levels. An adaptive mechanism adjusts task difficulty based on student performance, ensuring exposure to behaviors ranging from common to critical. The student, though exchangeable, is realized as a deep RL agent with partial observability, reflecting real‑world perception constraints. Results demonstrate the teacher's ability to generate diverse traffic behaviors. The student, trained with automatic curricula, outperformed agents trained on rule‑based traffic, achieving higher rewards and exhibiting balanced, assertive driving.

PaperID: 9483, https://arxiv.org/pdf/2507.19143.pdf

Authors: Maria Zaitseva, Ivan Tomilov, Natalia Gusarova

Title: Game-Theoretic Gradient Control for Robust Neural Network Training

Abstract:
Feed‑forward neural networks (FFNNs) are vulnerable to input noise, reducing prediction performance. Existing regularization methods like dropout often alter network architecture or overlook neuron interactions. This study aims to enhance FFNN noise robustness by modifying backpropagation, interpreted as a multi‑agent game, and exploring controlled target variable noising. Our "gradient dropout" selectively nullifies hidden layer neuron gradients with probability 1 ‑ p during backpropagation, while keeping forward passes active. This is framed within compositional game theory. Additionally, target variables were perturbed with white noise or stable distributions. Experiments on ten diverse tabular datasets show varying impacts: improvement or diminishing of robustness and accuracy, depending on dataset and hyperparameters. Notably, on regression tasks, gradient dropout (p = 0.9) combined with stable distribution target noising significantly increased input noise robustness, evidenced by flatter MSE curves and more stable SMAPE values. These results highlight the method's potential, underscore the critical role of adaptive parameter tuning, and open new avenues for analyzing neural networks as complex adaptive systems exhibiting emergent behavior within a game‑theoretic framework.

PaperID: 9484, https://arxiv.org/pdf/2507.19096.pdf

Authors: Jinbo Hou, Stefanos Bakirtzis, Kehai Qiu, Sichong Liao, Hui Song, Haonan Hu, Kezhi Wang, Jie Zhang

Title: iPLAN: Redefining Indoor Wireless Network Planning Through Large Language Models

Abstract:
Efficient indoor wireless network (IWN) planning is crucial for providing high‑quality 5G in‑building services. However, traditional meta‑heuristic and artificial intelligence‑based planning methods face significant challenges due to the intricate interplay between indoor environments (IEs) and IWN demands. In this article, we present an indoor wireless network Planning with large LANguage models (iPLAN) framework, which integrates multi‑modal IE representations into large language model (LLM)‑powered optimizers to improve IWN planning. First, we instate the role of LLMs as optimizers, outlining embedding techniques for IEs, and introducing two core applications of iPLAN: (i) IWN planning based on pre‑existing IEs and (ii) joint design of IWN and IE for new wireless‑friendly buildings. For the former, we embed essential information into LLM optimizers by leveraging indoor descriptions, domain‑specific knowledge, and performance‑driven perception. For the latter, we conceptualize a multi‑agent strategy, where intelligent agents collaboratively address key planning sub‑tasks in a step‑by‑step manner while ensuring optimal trade‑offs between the agents. The simulation results demonstrate that iPLAN achieves superior performance in IWN planning tasks and optimizes building wireless performance through the joint design of IEs and IWNs, exemplifying a paradigm shift in IWN planning.

PaperID: 9485, https://arxiv.org/pdf/2507.19090.pdf

Authors: Haorui He, Yupeng Li, Dacheng Wen, Yang Chen, Reynold Cheng, Donglong Chen, Francis C. M. Lau

Title: Debating Truth: Debate-driven Claim Verification with Multiple Large Language Model Agents

Abstract:
State‑of‑the‑art single‑agent claim verification methods struggle with complex claims that require nuanced analysis of multifaceted evidence. Inspired by real‑world professional fact‑checkers, we propose DebateCV, the first debate‑driven claim verification framework powered by multiple LLM agents. In DebateCV, two Debaters argue opposing stances to surface subtle errors in single‑agent assessments. A decisive Moderator is then required to weigh the evidential strength of conflicting arguments to deliver an accurate verdict. Yet, zero‑shot Moderators are biased toward neutral judgments, and no datasets exist for training them. To bridge this gap, we propose Debate‑SFT, a post‑training framework that leverages synthetic data to enhance agents' ability to effectively adjudicate debates for claim verification. Results show that our methods surpass state‑of‑the‑art non‑debate approaches in both accuracy (across various evidence conditions) and justification quality.

PaperID: 9486, https://arxiv.org/pdf/2507.19050.pdf

Authors: Qiong Wu, Yu Xie, Pingyi Fan, Dong Qin, Kezhi Wang, Nan Cheng, Khaled B. Letaief

Title: Large Language Model-Based Task Offloading and Resource Allocation for Digital Twin Edge Computing Networks

Abstract:
In this paper, we propose a general digital twin edge computing network comprising multiple vehicles and a server. Each vehicle generates multiple computing tasks within a time slot, leading to queuing challenges when offloading tasks to the server. The study investigates task offloading strategies, queue stability, and resource allocation. Lyapunov optimization is employed to transform long‑term constraints into tractable short‑term decisions. To solve the resulting problem, an in‑context learning approach based on large language model (LLM) is adopted, replacing the conventional multi‑agent reinforcement learning (MARL) framework. Experimental results demonstrate that the LLM‑based method achieves comparable or even superior performance to MARL.

PaperID: 9487, https://arxiv.org/pdf/2507.19049.pdf

Authors: Mingjie Bi, Juan-Alberto Estrada-Garcia, Dawn M. Tilbury, Siqian Shen, Kira Barton

Title: Heterogeneous Risk Management Using a Multi-Agent Framework for Supply Chain Disruption Response

Abstract:
In the highly complex and stochastic global, supply chain environments, local enterprise agents seek distributed and dynamic strategies for agile responses to disruptions. Existing literature explores both centralized and distributed approaches, while most work neglects temporal dynamics and the heterogeneity of the risk management of individual agents. To address this gap, this letter presents a heterogeneous risk management mechanism to incorporate uncertainties and risk attitudes into agent communication and decision‑making strategy. Hence, this approach empowers enterprises to handle disruptions in stochastic environments in a distributed way, and in particular in the context of multi‑agent control and management. Through a simulated case study, we showcase the feasibility and effectiveness of the proposed approach under stochastic settings and how the decision of disruption responses changes when agents hold various risk attitudes.

PaperID: 9488, https://arxiv.org/pdf/2507.19043.pdf

Authors: Mingjie Bi, Ilya Kovalenko, Dawn M. Tilbury, Kira Barton

Title: Dynamic distributed decision-making for resilient resource reallocation in disrupted manufacturing systems

Abstract:
The COVID‑19 pandemic brings many unexpected disruptions, such as frequently shifting markets and limited human workforce, to manufacturers. To stay competitive, flexible and real‑time manufacturing decision‑making strategies are needed to deal with such highly dynamic manufacturing environments. One essential problem is dynamic resource allocation to complete production tasks, especially when a resource disruption (e.g., machine breakdown) occurs. Though multi‑agent methods have been proposed to solve the problem in a flexible and agile manner, the agent internal decision‑making process and resource uncertainties have rarely been studied. This work introduces a model‑based resource agent (RA) architecture that enables effective agent coordination and dynamic agent decision‑making. Based on the RA architecture, a rescheduling strategy that incorporates risk assessment via a clustering agent coordination strategy is also proposed. A simulation‑based case study is implemented to demonstrate dynamic rescheduling using the proposed multi‑agent framework. The results show that the proposed method reduces the computational efforts while losing some throughput optimality compared to the centralized method. Furthermore, the case study illustrates that incorporating risk assessment into rescheduling decision‑making improves the throughput.

PaperID: 9489, https://arxiv.org/pdf/2507.19038.pdf

Authors: Mingjie Bi, Dawn M. Tilbury, Siqian Shen, Kira Barton

Title: A Distributed Approach for Agile Supply Chain Decision-Making Based on Network Attributes

Abstract:
In recent years, the frequent occurrence of disruptions has had a negative impact on global supply chains. To stay competitive, enterprises strive to remain agile through the implementation of efficient and effective decision‑making strategies in reaction to disruptions. A significant effort has been made to develop these agile disruption mitigation approaches, leveraging both centralized and distributed decision‑making strategies. Though trade‑offs of centralized and distributed approaches have been analyzed in existing studies, no related work has been found on understanding supply chain performance based on the network attributes of the disrupted supply chain entities. In this paper, we characterize supply chains from a capability and network topological perspective and investigate the use of a distributed decision‑making approach based on classical multi‑agent frameworks. The performance of the distributed framework is evaluated through a comprehensive case study that investigates the performance of the supply chain as a function of the network structure and agent attributes within the network in the presence of a disruption. Comparison to a centralized decision‑making approach highlights trade‑offs between performance, computation time, and network communication based on the decision‑making strategy and network architecture. Practitioners can use the outcomes of our studies to design response strategies based on agent capabilities, network attributes, and desired supply chain performance.

PaperID: 9490, https://arxiv.org/pdf/2507.18993.pdf

Authors: Blaž Škrlj, Benoît Guilleminot, Andraž Tori

Title: Agent0: Leveraging LLM Agents to Discover Multi-value Features from Text for Enhanced Recommendations

Abstract:
Large language models (LLMs) and their associated agent‑based frameworks have significantly advanced automated information extraction, a critical component of modern recommender systems. While these multitask frameworks are widely used in code generation, their application in data‑centric research is still largely untapped. This paper presents Agent0, an LLM‑driven, agent‑based system designed to automate information extraction and feature construction from raw, unstructured text. Categorical features are crucial for large‑scale recommender systems but are often expensive to acquire. Agent0 coordinates a group of interacting LLM agents to automatically identify the most valuable text aspects for subsequent tasks (such as models or AutoML pipelines). Beyond its feature engineering capabilities, Agent0 also offers an automated prompt‑engineering tuning method that utilizes dynamic feedback loops from an oracle. Our findings demonstrate that this closed‑loop methodology is both practical and effective for automated feature discovery, which is recognized as one of the most challenging phases in current recommender system development.

PaperID: 9491, https://arxiv.org/pdf/2507.18957.pdf

Authors: Jianming Chang, Jieke Shi, Yunbo Lyu, Xin Zhou, Lulu Wang, Zhou Yang, Bixin Li, David Lo

Title: SLICEMATE: Accurate and Scalable Static Program Slicing via LLM-Powered Agents

Abstract:
Static program slicing, which extracts the executable portions of a program that affect the values at a specific location, supports many software analysis tasks such as debugging and security auditing. However, traditional slicing tools rely on computationally expensive reachability analysis over dependency graphs, which struggle to scale to large programs and often fail to handle code with incomplete syntax. Recently emerged learning‑based methods, while more robust to such cases, still fall short of achieving comparable performance to traditional methods on well‑formed code. In this work, we propose SliceMate, a novel static program slicing solution powered by Large Language Model (LLM) agents. It bypasses the need for explicit dependency graph construction and achieving superior slicing accuracy. Concretely, SliceMate integrates three specialized agents: (1) a synthesis agent that produces candidate slices by incrementally expanding the scan scope across functions and files guided by LLM‑inferred dependencies; (2) a verification agent that performs conciseness and completeness checks of the candidate slices, detecting missing or irrelevant statements; and (3) a refinement agent that repairs the slices with minimal edits in accordance with the verification results. These agents are orchestrated by a control module that ensures timely convergence and outputs high‑quality slices without manual intervention. For rigorous evaluation, we construct a new and high‑quality benchmark, SliceBench, comprising 2,200 manually annotated Java and Python programs, with program lengths ranging from 5 to 8,577 lines, significantly larger than those in existing slicing benchmarks. Experimental results show that SliceMate greatly outperforms both traditional and learning‑based slicing tools.

PaperID: 9492, https://arxiv.org/pdf/2507.18884.pdf

Authors: Ming Gong, Xucheng Huang, Ziheng Xu, Vijayan K. Asari

Title: MindFlow+: A Self-Evolving Agent for E-Commerce Customer Service

Abstract:
High‑quality dialogue is crucial for e‑commerce customer service, yet traditional intent‑based systems struggle with dynamic, multi‑turn interactions. We present MindFlow+, a self‑evolving dialogue agent that learns domain‑specific behavior by combining large language models (LLMs) with imitation learning and offline reinforcement learning (RL). MindFlow+ introduces two data‑centric mechanisms to guide learning: tool‑augmented demonstration construction, which exposes the model to knowledge‑enhanced and agentic (ReAct‑style) interactions for effective tool use; and reward‑conditioned data modeling, which aligns responses with task‑specific goals using reward signals. To evaluate the model's role in response generation, we introduce the AI Contribution Ratio, a novel metric quantifying AI involvement in dialogue. Experiments on real‑world e‑commerce conversations show that MindFlow+ outperforms strong baselines in contextual relevance, flexibility, and task accuracy. These results demonstrate the potential of combining LLMs tool reasoning, and reward‑guided learning to build domain‑specialized, context‑aware dialogue systems.

PaperID: 9493, https://arxiv.org/pdf/2507.18867.pdf

Authors: Xuefei Wu, Xiao Yin, Yuanyang Zhu, Chunlin Chen

Title: Learning Individual Intrinsic Reward in Multi-Agent Reinforcement Learning via Incorporating Generalized Human Expertise

Abstract:
Efficient exploration in multi‑agent reinforcement learning (MARL) is a challenging problem when receiving only a team reward, especially in environments with sparse rewards. A powerful method to mitigate this issue involves crafting dense individual rewards to guide the agents toward efficient exploration. However, individual rewards generally rely on manually engineered shaping‑reward functions that lack high‑order intelligence, thus it behaves ineffectively than humans regarding learning and generalization in complex problems. To tackle these issues, we combine the above two paradigms and propose a novel framework, LIGHT (Learning Individual Intrinsic reward via Incorporating Generalized Human experTise), which can integrate human knowledge into MARL algorithms in an end‑to‑end manner. LIGHT guides each agent to avoid unnecessary exploration by considering both individual action distribution and human expertise preference distribution. Then, LIGHT designs individual intrinsic rewards for each agent based on actionable representational transformation relevant to Q‑learning so that the agents align their action preferences with the human expertise while maximizing the joint action value. Experimental results demonstrate the superiority of our method over representative baselines regarding performance and better knowledge reusability across different sparse‑reward tasks on challenging scenarios.

PaperID: 9494, https://arxiv.org/pdf/2507.18812.pdf

Authors: Yiping Jia, Zhen Ming Jiang, Shayan Noei, Ying Zou

Title: MemoCoder: Automated Function Synthesis using LLM-Supported Agents

Abstract:
With the widespread adoption of Large Language Models (LLMs) such as GitHub Copilot and ChatGPT, developers increasingly rely on AI‑assisted tools to support code generation. While LLMs can generate syntactically correct solutions for well‑structured programming tasks, they often struggle with challenges that require iterative debugging, error handling, or adaptation to diverse problem structures. Existing approaches such as fine‑tuning or self‑repair strategies either require costly retraining or lack mechanisms to accumulate and reuse knowledge from previous attempts. To address these limitations, we propose MemoCoder, a multi‑agent framework that enables collaborative problem solving and persistent learning from past fixes. At the core of MemoCoder is a Fixing Knowledge Set, which stores successful repairs and supports retrieval for future tasks. A central Mentor Agent supervises the repair process by identifying recurring error patterns and refining high‑level fixing strategies, providing a novel supervisory role that guides the self‑repair loop. We evaluate MemoCoder across three public benchmarks ‑‑ MBPP, HumanEval, and LiveCodeBench ‑‑ spanning a range of problem complexities. Experimental results show that MemoCoder consistently outperforms both zero‑shot prompting and a Self‑Repair strategy, with improvements ranging from 3.1% to 12.1% in Pass@10 and from 1.4% to 14.5% in Pass@50, demonstrating its effectiveness in iterative refinement and knowledge‑guided code generation.

PaperID: 9495, https://arxiv.org/pdf/2507.18680.pdf

Authors: Óscar Fernández Vicente

Title: Market Making Strategies with Reinforcement Learning

Abstract:
This thesis presents the results of a comprehensive research project focused on applying Reinforcement Learning (RL) to the problem of market making in financial markets. Market makers (MMs) play a fundamental role in providing liquidity, yet face significant challenges arising from inventory risk, competition, and non‑stationary market dynamics. This research explores how RL, particularly Deep Reinforcement Learning (DRL), can be employed to develop autonomous, adaptive, and profitable market making strategies. The study begins by formulating the MM task as a reinforcement learning problem, designing agents capable of operating in both single‑agent and multi‑agent settings within a simulated financial environment. It then addresses the complex issue of inventory management using two complementary approaches: reward engineering and Multi‑Objective Reinforcement Learning (MORL). While the former uses dynamic reward shaping to guide behavior, the latter leverages Pareto front optimization to explicitly balance competing objectives. To address the problem of non‑stationarity, the research introduces POW‑dTS, a novel policy weighting algorithm based on Discounted Thompson Sampling. This method allows agents to dynamically select and combine pretrained policies, enabling continual adaptation to shifting market conditions. The experimental results demonstrate that the proposed RL‑based approaches significantly outperform traditional and baseline algorithmic strategies across various performance metrics. Overall, this research thesis contributes new methodologies and insights for the design of robust, efficient, and adaptive market making agents, reinforcing the potential of RL to transform algorithmic trading in complex financial systems.

PaperID: 9496, https://arxiv.org/pdf/2507.18597.pdf

Authors: Ethan Pronovost, Neha Boloor, Peter Schleede, Noureldin Hendy, Andres Morales, Nicholas Roy

Title: Linear Memory SE(2) Invariant Attention

Abstract:
Processing spatial data is a key component in many learning tasks for autonomous driving such as motion forecasting, multi‑agent simulation, and planning. Prior works have demonstrated the value in using SE(2) invariant network architectures that consider only the relative poses between objects (e.g. other agents, scene features such as traffic lanes). However, these methods compute the relative poses for all pairs of objects explicitly, requiring quadratic memory. In this work, we propose a mechanism for SE(2) invariant scaled dot‑product attention that requires linear memory relative to the number of objects in the scene. Our SE(2) invariant transformer architecture enjoys the same scaling properties that have benefited large language models in recent years. We demonstrate experimentally that our approach is practical to implement and improves performance compared to comparable non‑invariant architectures.

PaperID: 9497, https://arxiv.org/pdf/2507.18374.pdf

Authors: Filippos Bellos, Yayuan Li, Cary Shu, Ruey Day, Jeffrey M. Siskind, Jason J. Corso

Title: Towards Effective Human-in-the-Loop Assistive AI Agents

Abstract:
Effective human‑AI collaboration for physical task completion has significant potential in both everyday activities and professional domains. AI agents equipped with informative guidance can enhance human performance, but evaluating such collaboration remains challenging due to the complexity of human‑in‑the‑loop interactions. In this work, we introduce an evaluation framework and a multimodal dataset of human‑AI interactions designed to assess how AI guidance affects procedural task performance, error reduction and learning outcomes. Besides, we develop an augmented reality (AR)‑equipped AI agent that provides interactive guidance in real‑world tasks, from cooking to battlefield medicine. Through human studies, we share empirical insights into AI‑assisted human performance and demonstrate that AI‑assisted collaboration improves task completion.

PaperID: 9498, https://arxiv.org/pdf/2507.18333.pdf

Authors: Kale-ab Abebe Tessera, Leonard Hinckeldey, Riccardo Zamboni, David Abel, Amos Storkey

Title: Remembering the Markov Property in Cooperative MARL

Abstract:
Cooperative multi‑agent reinforcement learning (MARL) is typically formalised as a Decentralised Partially Observable Markov Decision Process (Dec‑POMDP), where agents must reason about the environment and other agents' behaviour. In practice, current model‑free MARL algorithms use simple recurrent function approximators to address the challenge of reasoning about others using partial information. In this position paper, we argue that the empirical success of these methods is not due to effective Markov signal recovery, but rather to learning simple conventions that bypass environment observations and memory. Through a targeted case study, we show that co‑adapting agents can learn brittle conventions, which then fail when partnered with non‑adaptive agents. Crucially, the same models can learn grounded policies when the task design necessitates it, revealing that the issue is not a fundamental limitation of the learning models but a failure of the benchmark design. Our analysis also suggests that modern MARL environments may not adequately test the core assumptions of Dec‑POMDPs. We therefore advocate for new cooperative environments built upon two core principles: (1) behaviours grounded in observations and (2) memory‑based reasoning about other agents, ensuring success requires genuine skill rather than fragile, co‑adapted agreements.

PaperID: 9499, https://arxiv.org/pdf/2507.18267.pdf

Authors: Zeqin Liao, Zibin Zheng, Peifan Reng, Henglong Liang, Zixu Gao, Zhixiang Chen, Wei Li, Yuhong Nan

Title: An Empirical Study on Embodied Artificial Intelligence Robot (EAIR) Software Bugs

Abstract:
Embodied Artificial Intelligence Robots (EAIR) is an emerging and rapidly evolving technological domain. Ensuring their program correctness is fundamental to their successful deployment. However, a general and in‑depth understanding of EAIR system bugs remains lacking, which hinders the development of practices and techniques to tackle EAIR system bugs. To bridge this gap, we conducted the first systematic study of 885 EAIR system bugs collected from 80 EAIR system projects to investigate their symptoms, underlying causes, and module distribution. Our analysis takes considerable effort, which classifies these bugs into 18 underlying causes, 15 distinct symptoms, and identifies 13 affected modules. It reveals several new interesting findings and implications which help shed light on future research on tackling or repairing EAIR system bugs. First, among the 15 identified symptoms, our findings highlight 8 symptoms specific to EAIR systems, which is characterized by severe functional failures and potential physical hazards. Second, within the 18 underlying causes, we define 8 EAIR‑specific causes, the majority of which stem from the intricate issues of AI‑ agent reasoning and decision making. Finally, to facilitate precise and efficient bug prediction, detection, and repair, we constructed a mapping between underlying causes and the modules in which they most frequently occur, which enables researchers to focus diagnostic efforts on the modules most susceptible to specific bug types.

PaperID: 9500, https://arxiv.org/pdf/2507.18110.pdf

Authors: Yi Wang, Dawei Qiu, Fei Teng, Goran Strbac

Title: Two-Stage TSO-DSO Services Provision Framework for Electric Vehicle Coordination

Abstract:
High renewable penetration has been witnessed in power systems, resulting in reduced system inertia and increasing requirements for frequency response services. Electric vehicles (EVs), owing to their vehicle‑to‑grid (V2G) capabilities, can provide cost‑effective frequency services for transmission system operators (TSOs). However, EVs that are inherently connected to distribution networks may pose voltage security issues for distribution system operators (DSOs) when supporting TSO frequency. To coordinate both TSO frequency and DSO voltage, this paper proposes a two‑stage service provision framework for multi‑EVs. At stage one, EVs participate in day‑ahead TSO‑DSO interactions for frequency reserve schedules; at stage two, EVs make real‑time dispatching behaviors in distribution networks for reserve delivery while supporting DSO voltage. Considering the potentially large EV number and environment complexity, a decentralized operation paradigm is introduced for real‑time EV dispatches at stage two, while a communication‑efficient reinforcement learning (RL) algorithm is proposed to reduce the communication overhead during large‑scale multi‑agent RL training without compromising policy performance. Case studies are carried out on a 6‑bus transmission and 33‑bus distribution network as well as a 69‑bus distribution network to evaluate the effectiveness and scalability of the proposed method in enabling EVs for frequency service and voltage support.

PaperID: 9501, https://arxiv.org/pdf/2507.18095.pdf

Authors: Yi Wang, Dawei Qiu, Fei Teng, Goran Strbac

Title: Towards Microgrid Resilience Enhancement via Mobile Power Sources and Repair Crews: A Multi-Agent Reinforcement Learning Approach

Abstract:
Mobile power sources (MPSs) have been gradually deployed in microgrids as critical resources to coordinate with repair crews (RCs) towards resilience enhancement owing to their flexibility and mobility in handling the complex coupled power‑transport systems. However, previous work solves the coordinated dispatch problem of MPSs and RCs in a centralized manner with the assumption that the communication network is still fully functioning after the event. However, there is growing evidence that certain extreme events will damage or degrade communication infrastructure, which makes centralized decision making impractical. To fill this gap, this paper formulates the resilience‑driven dispatch problem of MPSs and RCs in a decentralized framework. To solve this problem, a hierarchical multi‑agent reinforcement learning method featuring a two‑level framework is proposed, where the high‑level action is used to switch decision‑making between power and transport networks, and the low‑level action constructed via a hybrid policy is used to compute continuous scheduling and discrete routing decisions in power and transport networks, respectively. The proposed method also uses an embedded function encapsulating system dynamics to enhance learning stability and scalability. Case studies based on IEEE 33‑bus and 69‑bus power networks are conducted to validate the effectiveness of the proposed method in load restoration.

PaperID: 9502, https://arxiv.org/pdf/2507.17875.pdf

Authors: R. Spencer Hallyburton, Miroslav Pajic

Title: Trusted Data Fusion, Multi-Agent Autonomy, Autonomous Vehicles

Abstract:
Multi‑agent collaboration enhances situational awareness in intelligence, surveillance, and reconnaissance (ISR) missions. Ad hoc networks of unmanned aerial vehicles (UAVs) allow for real‑time data sharing, but they face security challenges due to their decentralized nature, making them vulnerable to cyber‑physical attacks. This paper introduces a trust‑based framework for assured sensor fusion in distributed multi‑agent networks, utilizing a hidden Markov model (HMM)‑based approach to estimate the trustworthiness of agents and their provided information in a decentralized fashion. Trust‑informed data fusion prioritizes fusing data from reliable sources, enhancing resilience and accuracy in contested environments. To evaluate the assured sensor fusion under attacks on system/mission sensing, we present a novel multi‑agent aerial dataset built from the Unreal Engine simulator. We demonstrate through case studies improved ISR performance and an ability to detect malicious actors in adversarial settings.

PaperID: 9503, https://arxiv.org/pdf/2507.17874.pdf

Authors: SaiBarath Sundar, Pranav Satheesan, Udayaadithya Avadhanam

Title: I2I-STRADA -- Information to Insights via Structured Reasoning Agent for Data Analysis

Abstract:
Recent advances in agentic systems for data analysis have emphasized automation of insight generation through multi‑agent frameworks, and orchestration layers. While these systems effectively manage tasks like query translation, data transformation, and visualization, they often overlook the structured reasoning process underlying analytical thinking. Reasoning large language models (LLMs) used for multi‑step problem solving are trained as general‑purpose problem solvers. As a result, their reasoning or thinking steps do not adhere to fixed processes for specific tasks. Real‑world data analysis requires a consistent cognitive workflow: interpreting vague goals, grounding them in contextual knowledge, constructing abstract plans, and adapting execution based on intermediate outcomes. We introduce I2I‑STRADA (Information‑to‑Insight via Structured Reasoning Agent for Data Analysis), an agentic architecture designed to formalize this reasoning process. I2I‑STRADA focuses on modeling how analysis unfolds via modular sub‑tasks that reflect the cognitive steps of analytical reasoning. Evaluations on the DABstep and DABench benchmarks show that I2I‑STRADA outperforms prior systems in planning coherence and insight alignment, highlighting the importance of structured cognitive workflows in agent design for data analysis.

PaperID: 9504, https://arxiv.org/pdf/2507.17852.pdf

Authors: Yao Fehlis, Charles Crain, Aidan Jensen, Michael Watson, James Juhasz, Paul Mandel, Betty Liu, Shawn Mahon, Daren Wilson, Nick Lynch-Jonely, Ben Leedom, David Fuller

Title: Technical Implementation of Tippy: Multi-Agent Architecture and System Design for Drug Discovery Laboratory Automation

Abstract:
Building on the conceptual framework presented in our previous work on agentic AI for pharmaceutical research, this paper provides a comprehensive technical analysis of Tippy's multi‑agent system implementation for drug discovery laboratory automation. We present a distributed microservices architecture featuring five specialized agents (Supervisor, Molecule, Lab, Analysis, and Report) that coordinate through OpenAI Agents SDK orchestration and access laboratory tools via the Model Context Protocol (MCP). The system architecture encompasses agent‑specific tool integration, asynchronous communication patterns, and comprehensive configuration management through Git‑based tracking. Our production deployment strategy utilizes Kubernetes container orchestration with Helm charts, Docker containerization, and CI/CD pipelines for automated testing and deployment. The implementation integrates vector databases for RAG functionality and employs an Envoy reverse proxy for secure external access. This work demonstrates how specialized AI agents can effectively coordinate complex laboratory workflows while maintaining security, scalability, reliability, and integration with existing laboratory infrastructure through standardized protocols.

PaperID: 9505, https://arxiv.org/pdf/2507.17835.pdf

Authors: Simone Fiorellino, Claudio Battiloro, Emilio Calvanese Strinati, Paolo Di Lorenzo

Title: Frame-Based Zero-Shot Semantic Channel Equalization for AI-Native Communications

Abstract:
In future AI‑native wireless networks, the presence of mismatches between the latent spaces of independently designed and trained deep neural network (DNN) encoders may impede mutual understanding due to the emergence of semantic channel noise. This undermines the receiver's ability to interpret transmitted representations, thereby reducing overall system performance. To address this issue, we propose the Parseval Frame Equalizer (PFE), a zero‑shot, frame‑based semantic channel equalizer that aligns latent spaces of heterogeneous encoders without requiring system retraining. PFE enables dynamic signal compression and expansion, mitigating semantic noise while preserving performance on downstream tasks. Building on this capability, we introduce a dynamic optimization strategy that coordinates communication, computation, and learning resources to balance energy consumption, end‑to‑end (E2E) latency, and task performance in multi‑agent semantic communication scenarios. Extensive simulations confirm the effectiveness of our approach in maintaining semantic consistency and meeting long‑term constraints on latency and accuracy under diverse and time‑varying network conditions.

PaperID: 9506, https://arxiv.org/pdf/2507.17761.pdf

Authors: Jan-Christoph Kalo, Fina Polat, Shubha Guha, Paul Groth

Title: Co-constructing Explanations for AI Systems using Provenance

Abstract:
Modern AI systems are complex workflows containing multiple components and data sources. Data provenance provides the ability to interrogate and potentially explain the outputs of these systems. However, provenance is often too detailed and not contextualized for the user trying to understand the AI system. In this work, we present our vision for an interactive agent that works together with the user to co‑construct an explanation that is simultaneously useful to the user as well as grounded in data provenance. To illustrate this vision, we present: 1) an initial prototype of such an agent; and 2) a scalable evaluation framework based on user simulations and a large language model as a judge approach.

PaperID: 9507, https://arxiv.org/pdf/2507.17753.pdf

Authors: Liang Zhang, Xiaoming Zhai, Jionghao Lin, Jionghao Lin, Jennifer Kleiman, Diego Zapata-Rivera, Carol Forsyth, Yang Jiang, Xiangen Hu, Arthur C. Graesser

Title: Exploring Communication Strategies for Collaborative LLM Agents in Mathematical Problem-Solving

Abstract:
Large Language Model (LLM) agents are increasingly utilized in AI‑aided education to support tutoring and learning. Effective communication strategies among LLM agents improve collaborative problem‑solving efficiency and facilitate cost‑effective adoption in education. However, little research has systematically evaluated the impact of different communication strategies on agents' problem‑solving. Our study examines four communication modes, teacher‑student interaction, peer‑to‑peer collaboration, reciprocal peer teaching, and critical debate, in a dual‑agent, chat‑based mathematical problem‑solving environment using the OpenAI GPT‑4o model. Evaluated on the MATH dataset, our results show that dual‑agent setups outperform single agents, with peer‑to‑peer collaboration achieving the highest accuracy. Dialogue acts like statements, acknowledgment, and hints play a key role in collaborative problem‑solving. While multi‑agent frameworks enhance computational tasks, effective communication strategies are essential for tackling complex problems in AI education.

PaperID: 9508, https://arxiv.org/pdf/2507.17695.pdf

Authors: Ilias Chatzistefanidis, Navid Nikaein

Title: Symbiotic Agents: A Novel Paradigm for Trustworthy AGI-driven Networks

Abstract:
Large Language Model (LLM)‑based autonomous agents are expected to play a vital role in the evolution of 6G networks, by empowering real‑time decision‑making related to management and service provisioning to end‑users. This shift facilitates the transition from a specialized intelligence approach, where artificial intelligence (AI) algorithms handle isolated tasks, to artificial general intelligence (AGI)‑driven networks, where agents possess broader reasoning capabilities and can manage diverse network functions. In this paper, we introduce a novel agentic paradigm that combines LLMs with real‑time optimization algorithms towards Trustworthy AI, defined as symbiotic agents. Optimizers at the LLM's input‑level provide bounded uncertainty steering for numerically precise tasks, whereas output‑level optimizers supervised by the LLM enable adaptive real‑time control. We design and implement two novel agent types including: (i) Radio Access Network optimizers, and (ii) multi‑agent negotiators for Service‑Level Agreements (SLAs). We further propose an end‑to‑end architecture for AGI networks and evaluate it on a 5G testbed capturing channel fluctuations from moving vehicles. Results show that symbiotic agents reduce decision errors fivefold compared to standalone LLM‑based agents, while smaller language models (SLM) achieve similar accuracy with a 99.9% reduction in GPU resource overhead and in near‑real‑time loops of 82 ms. A multi‑agent demonstration for collaborative RAN on the real‑world testbed highlights significant flexibility in service‑level agreement and resource allocation, reducing RAN over‑utilization by approximately 44%. Drawing on our findings and open‑source implementations, we introduce the symbiotic paradigm as the foundation for next‑generation, AGI‑driven networks‑systems designed to remain adaptable, efficient, and trustworthy even as LLMs advance.

PaperID: 9509, https://arxiv.org/pdf/2507.17680.pdf

Authors: Yongchao Zeng, Calum Brown, Ioannis Kyriakou, Ronja Hotz, Mark Rounsevell

Title: Simulating multiple human perspectives in socio-ecological systems using large language models

Abstract:
Understanding socio‑ecological systems requires insights from diverse stakeholder perspectives, which are often hard to access. To enable alternative, simulation‑based exploration of different stakeholder perspectives, we develop the HoPeS (Human‑Oriented Perspective Shifting) modelling framework. HoPeS employs agents powered by large language models (LLMs) to represent various stakeholders; users can step into the agent roles to experience perspectival differences. A simulation protocol serves as a "scaffold" to streamline multiple perspective‑taking simulations, supporting users in reflecting on, transitioning between, and integrating across perspectives. A prototype system is developed to demonstrate HoPeS in the context of institutional dynamics and land use change, enabling both narrative‑driven and numerical experiments. In an illustrative experiment, a user successively adopts the perspectives of a system observer and a researcher ‑ a role that analyses data from the embedded land use model to inform evidence‑based decision‑making for other LLM agents representing various institutions. Despite the user's effort to recommend technically sound policies, discrepancies persist between the policy recommendation and implementation due to stakeholders' competing advocacies, mirroring real‑world misalignment between researcher and policymaker perspectives. The user's reflection highlights the subjective feelings of frustration and disappointment as a researcher, especially due to the challenge of maintaining political neutrality while attempting to gain political influence. Despite this, the user exhibits high motivation to experiment with alternative narrative framing strategies, suggesting the system's potential in exploring different perspectives. Further system and protocol refinement are likely to enable new forms of interdisciplinary collaboration in socio‑ecological simulations.

PaperID: 9510, https://arxiv.org/pdf/2507.17433.pdf

Authors: Hugh Adams, Srijoni Majumdar, Evangelos Pournaras

Title: Fair Compromises in Participatory Budgeting: a Multi-Agent Deep Reinforcement Learning Approach

Abstract:
Participatory budgeting is a method of collectively understanding and addressing spending priorities where citizens vote on how a budget is spent, it is regularly run to improve the fairness of the distribution of public funds. Participatory budgeting requires voters to make decisions on projects which can lead to ``choice overload". A multi‑agent reinforcement learning approach to decision support can make decision making easier for voters by identifying voting strategies that increase the winning proportion of their vote. This novel approach can also support policymakers by highlighting aspects of election design that enable fair compromise on projects. This paper presents a novel, ethically aligned approach to decision support using multi‑agent deep reinforcement learning modelling. This paper introduces a novel use of a branching neural network architecture to overcome scalability challenges of multi‑agent reinforcement learning in a decentralized way. Fair compromises are found through optimising voter actions towards greater representation of voter preferences in the winning set. Experimental evaluation with real‑world participatory budgeting data reveals a pattern in fair compromise: that it is achievable through projects with smaller cost.

PaperID: 9511, https://arxiv.org/pdf/2507.17311.pdf

Authors: Zijie Guo, Jiong Wang, Fenghua Ling, Wangxu Wei, Xiaoyu Yue, Zhe Jiang, Wanghan Xu, Jing-Jia Luo, Lijing Cheng, Yoo-Geun Ham, Fengfei Song, Pierre Gentine, Toshio Yamagata, Ben Fei, Wenlong Zhang, Xinyu Gu, Chao Li, Yaqiang Wang, Tao Chen, Wanli Ouyang, Bowen Zhou, Lei Bai

Title: A Self-Evolving AI Agent System for Climate Science

Abstract:
Scientific progress in Earth science depends on integrating data across the planet's interconnected spheres. However, the accelerating volume and fragmentation of multi‑sphere knowledge and data have surpassed human analytical capacity. This creates a major bottleneck for discovery, especially in climate science. To address this challenge, we introduce EarthLink, the first self‑evolving AI agent system designed as an interactive "copilot" for Earth scientists. Through natural language interaction, EarthLink automates the entire research workflow by integrating planning, code execution, data analysis, and physical reasoning into a unified process that directly addresses this limitation. Beyond efficiency, it exhibits human‑like cross‑disciplinary analytical ability and achieves proficiency comparable to a junior researcher in expert evaluations on core large‑scale climate tasks, including model‑observation comparison and climate change understanding. When tasked with an open scientific problem, specifically the discovery of precursors of the Atlantic Niño, EarthLink autonomously developed a research strategy, identified sources of predictability, verified its hypotheses with available data, and proposed a physically consistent mechanism. These emerging capabilities enable a new human‑AI research paradigm. Scientists can focus on value and result judgments, while AI systems handle complex data analysis and knowledge integration. This accelerates the pace and breadth of discovery in Earth sciences. The system is accessible at our website https://earthlink.intern‑ai.org.cn.

PaperID: 9512, https://arxiv.org/pdf/2507.17275.pdf

Authors: Po-Yen Wu, Cheng-Yu Kuo, Yuki Kadokawa, Takamitsu Matsubara

Title: Prolonging Tool Life: Learning Skillful Use of General-purpose Tools through Lifespan-guided Reinforcement Learning

Abstract:
In inaccessible environments with uncertain task demands, robots often rely on general‑purpose tools that lack predefined usage strategies. These tools are not tailored for particular operations, making their longevity highly sensitive to how they are used. This creates a fundamental challenge: how can a robot learn a tool‑use policy that both completes the task and prolongs the tool's lifespan? In this work, we address this challenge by introducing a reinforcement learning (RL) framework that incorporates tool lifespan as a factor during policy optimization. Our framework leverages Finite Element Analysis (FEA) and Miner's Rule to estimate Remaining Useful Life (RUL) based on accumulated stress, and integrates the RUL into the RL reward to guide policy learning toward lifespan‑guided behavior. To handle the fact that RUL can only be estimated after task execution, we introduce an Adaptive Reward Normalization (ARN) mechanism that dynamically adjusts reward scaling based on estimated RULs, ensuring stable learning signals. We validate our method across simulated and real‑world tool use tasks, including Object‑Moving and Door‑Opening with multiple general‑purpose tools. The learned policies consistently prolong tool lifespan (up to 8.01x in simulation) and transfer effectively to real‑world settings, demonstrating the practical value of learning lifespan‑guided tool use strategies.

PaperID: 9513, https://arxiv.org/pdf/2507.17188.pdf

Authors: Lijie Zheng, Ji He, Shih Yu Chang, Yulong Shen, Dusit Niyato

Title: LLM Meets the Sky: Heuristic Multi-Agent Reinforcement Learning for Secure Heterogeneous UAV Networks

Abstract:
This work tackles the physical layer security (PLS) problem of maximizing the secrecy rate in heterogeneous UAV networks (HetUAVNs) under propulsion energy constraints. Unlike prior studies that assume uniform UAV capabilities or overlook energy‑security trade‑offs, we consider a realistic scenario where UAVs with diverse payloads and computation resources collaborate to serve ground terminals in the presence of eavesdroppers. To manage the complex coupling between UAV motion and communication, we propose a hierarchical optimization framework. The inner layer uses a semidefinite relaxation (SDR)‑based S2DC algorithm combining penalty functions and difference‑of‑convex (d.c.) programming to solve the secrecy precoding problem with fixed UAV positions. The outer layer introduces a Large Language Model (LLM)‑guided heuristic multi‑agent reinforcement learning approach (LLM‑HeMARL) for trajectory optimization. LLM‑HeMARL efficiently incorporates expert heuristics policy generated by the LLM, enabling UAVs to learn energy‑aware, security‑driven trajectories without the inference overhead of real‑time LLM calls. The simulation results show that our method outperforms existing baselines in secrecy rate and energy efficiency, with consistent robustness across varying UAV swarm sizes and random seeds.

PaperID: 9514, https://arxiv.org/pdf/2507.17183.pdf

Authors: Die Hu, Shuyue Hu, Chunjiang Mu, Shiqi Fan, Chen Chu, Jinzhuo Liu, Zhen Wang

Title: Regret Minimization in Population Network Games: Vanishing Heterogeneity and Convergence to Equilibria

Abstract:
Understanding and predicting the behavior of large‑scale multi‑agents in games remains a fundamental challenge in multi‑agent systems. This paper examines the role of heterogeneity in equilibrium formation by analyzing how smooth regret‑matching drives a large number of heterogeneous agents with diverse initial policies toward unified behavior. By modeling the system state as a probability distribution of regrets and analyzing its evolution through the continuity equation, we uncover a key phenomenon in diverse multi‑agent settings: the variance of the regret distribution diminishes over time, leading to the disappearance of heterogeneity and the emergence of consensus among agents. This universal result enables us to prove convergence to quantal response equilibria in both competitive and cooperative multi‑agent settings. Our work advances the theoretical understanding of multi‑agent learning and offers a novel perspective on equilibrium selection in diverse game‑theoretic scenarios.

PaperID: 9515, https://arxiv.org/pdf/2507.17134.pdf

Authors: Mariam ALMutairi, Hyungmin Kim

Title: Resilient Multi-Agent Negotiation for Medical Supply Chains:Integrating LLMs and Blockchain for Transparent Coordination

Abstract:
Global health emergencies, such as the COVID‑19 pandemic, have exposed critical weaknesses in traditional medical supply chains, including inefficiencies in resource allocation, lack of transparency, and poor adaptability to dynamic disruptions. This paper presents a novel hybrid framework that integrates blockchain technology with a decentralized, large language model (LLM) powered multi‑agent negotiation system to enhance the resilience and accountability of medical supply chains during crises. In this system, autonomous agents‑representing manufacturers, distributors, and healthcare institutions‑engage in structured, context‑aware negotiation and decision‑making processes facilitated by LLMs, enabling rapid and ethical allocation of scarce medical resources. The off‑chain agent layer supports adaptive reasoning and local decision‑making, while the on‑chain blockchain layer ensures immutable, transparent, and auditable enforcement of decisions via smart contracts. The framework also incorporates a formal cross‑layer communication protocol to bridge decentralized negotiation with institutional enforcement. A simulation environment emulating pandemic scenarios evaluates the system's performance, demonstrating improvements in negotiation efficiency, fairness of allocation, supply chain responsiveness, and auditability. This research contributes an innovative approach that synergizes blockchain trust guarantees with the adaptive intelligence of LLM‑driven agents, providing a robust and scalable solution for critical supply chain coordination under uncertainty.

PaperID: 9516, https://arxiv.org/pdf/2507.17131.pdf

Authors: Yufei He, Ruoyu Li, Alex Chen, Yue Liu, Yulin Chen, Yuan Sui, Cheng Chen, Yi Zhu, Luca Luo, Frank Yang, Bryan Hooi

Title: Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance

Abstract:
Large language model (LLM) agents often struggle in environments where rules and required domain knowledge frequently change, such as regulatory compliance and user risk screening. Current approaches, like offline fine‑tuning and standard prompting, are insufficient because they cannot effectively adapt to new knowledge during actual operation. To address this limitation, we propose the Adaptive Reflective Interactive Agent (ARIA), an LLM agent framework designed specifically to continuously learn updated domain knowledge at test time. ARIA assesses its own uncertainty through structured self‑dialogue, proactively identifying knowledge gaps and requesting targeted explanations or corrections from human experts. It then systematically updates an internal, timestamped knowledge repository with provided human guidance, detecting and resolving conflicting or outdated knowledge through comparisons and clarification queries. We evaluate ARIA on the realistic customer due diligence name screening task on TikTok Pay, alongside publicly available dynamic knowledge tasks. Results demonstrate significant improvements in adaptability and accuracy compared to baselines using standard offline fine‑tuning and existing self‑improving agents. ARIA is deployed within TikTok Pay serving over 150 million monthly active users, confirming its practicality and effectiveness for operational use in rapidly evolving environments.

PaperID: 9517, https://arxiv.org/pdf/2507.17080.pdf

Authors: Ramin Giahi, Kehui Yao, Sriram Kollipara, Kai Zhao, Vahid Mirjalili, Jianpeng Xu, Topojoy Biswas, Evren Korpeoglu, Kannan Achan

Title: VL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings

Abstract:
Multimodal learning plays a critical role in e‑commerce recommendation platforms today, enabling accurate recommendations and product understanding. However, existing vision‑language models, such as CLIP, face key challenges in e‑commerce recommendation systems: 1) Weak object‑level alignment, where global image embeddings fail to capture fine‑grained product attributes, leading to suboptimal retrieval performance; 2) Ambiguous textual representations, where product descriptions often lack contextual clarity, affecting cross‑modal matching; and 3) Domain mismatch, as generic vision‑language models may not generalize well to e‑commerce‑specific data. To address these limitations, we propose a framework, VL‑CLIP, that enhances CLIP embeddings by integrating Visual Grounding for fine‑grained visual understanding and an LLM‑based agent for generating enriched text embeddings. Visual Grounding refines image representations by localizing key products, while the LLM agent enhances textual features by disambiguating product descriptions. Our approach significantly improves retrieval accuracy, multimodal retrieval effectiveness, and recommendation quality across tens of millions of items on one of the largest e‑commerce platforms in the U.S., increasing CTR by 18.6%, ATC by 15.5%, and GMV by 4.0%. Additional experimental results show that our framework outperforms vision‑language models, including CLIP, FashionCLIP, and GCL, in both precision and semantic alignment, demonstrating the potential of combining object‑aware visual grounding and LLM‑enhanced text representation for robust multimodal recommendations.

PaperID: 9518, https://arxiv.org/pdf/2507.17061.pdf

Authors: Chengxuan Xia, Qianye Wu, Sixuan Tian, Yilun Hao

Title: Parallelism Meets Adaptiveness: Scalable Documents Understanding in Multi-Agent LLM Systems

Abstract:
Large language model (LLM) agents have shown increasing promise for collaborative task completion. However, existing multi‑agent frameworks often rely on static workflows, fixed roles, and limited inter‑agent communication, reducing their effectiveness in open‑ended, high‑complexity domains. This paper proposes a coordination framework that enables adaptiveness through three core mechanisms: dynamic task routing, bidirectional feedback, and parallel agent evaluation. The framework allows agents to reallocate tasks based on confidence and workload, exchange structured critiques to iteratively improve outputs, and crucially compete on high‑ambiguity subtasks with evaluator‑driven selection of the most suitable result. We instantiate these principles in a modular architecture and demonstrate substantial improvements in factual coverage, coherence, and efficiency over static and partially adaptive baselines. Our findings highlight the benefits of incorporating both adaptiveness and structured competition in multi‑agent LLM systems.

PaperID: 9519, https://arxiv.org/pdf/2507.17054.pdf

Authors: Shao-Hung Chan, Thomy Phan, Jiaoyang Li, Sven Koenig

Title: New Mechanisms in Flex Distribution for Bounded Suboptimal Multi-Agent Path Finding

Abstract:
Multi‑Agent Path Finding (MAPF) is the problem of finding a set of collision‑free paths, one for each agent in a shared environment. Its objective is to minimize the sum of path costs (SOC), where the path cost of each agent is defined as the travel time from its start location to its target location. Explicit Estimation Conflict‑Based Search (EECBS) is the leading algorithm for bounded‑suboptimal MAPF, with the SOC of the solution being at most a user‑specified factor w away from optimal. EECBS maintains sets of paths and a lower bound LB on the optimal SOC. Then, it iteratively selects a set of paths whose SOC is at most w \cdot LB and introduces constraints to resolve collisions. For each path in a set, EECBS maintains a lower bound on its optimal path that satisfies constraints. By finding an individually bounded‑suboptimal path with cost at most a threshold of w times its lower bound, EECBS guarantees to find a bounded‑suboptimal solution. To speed up EECBS, previous work uses flex distribution to increase the threshold. Though EECBS with flex distribution guarantees to find a bounded‑suboptimal solution, increasing the thresholds may push the SOC beyond w \cdot LB, forcing EECBS to switch among different sets of paths instead of resolving collisions on a particular set of paths, and thus reducing efficiency. To address this issue, we propose Conflict‑Based Flex Distribution that distributes flex in proportion to the number of collisions. We also estimate the delays needed to satisfy constraints and propose Delay‑Based Flex Distribution. On top of that, we propose Mixed‑Strategy Flex Distribution, combining both in a hierarchical framework. We prove that EECBS with our new flex distribution mechanisms is complete and bounded‑suboptimal. Our experiments show that our approaches outperform the original (greedy) flex distribution.

PaperID: 9520, https://arxiv.org/pdf/2507.17049.pdf

Authors: Pablo Valle, Chengjie Lu, Shaukat Ali, Aitor Arrieta

Title: Evaluating Uncertainty and Quality of Visual Language Action-enabled Robots

Abstract:
Visual Language Action (VLA) models are a multi‑modal class of Artificial Intelligence (AI) systems that integrate visual perception, natural language understanding, and action planning to enable agents to interpret their environment, comprehend instructions, and perform embodied tasks autonomously. Recently, significant progress has been made to advance this field. These kinds of models are typically evaluated through task success rates, which fail to capture the quality of task execution and the mode's confidence in its decisions. In this paper, we propose eight uncertainty metrics and five quality metrics specifically designed for VLA models for robotic manipulation tasks. We assess their effectiveness through a large‑scale empirical study involving 908 successful task executions from three state‑of‑the‑art VLA models across four representative robotic manipulation tasks. Human domain experts manually labeled task quality, allowing us to analyze the correlation between our proposed metrics and expert judgments. The results reveal that several metrics show moderate to strong correlation with human assessments, highlighting their utility for evaluating task quality and model confidence. Furthermore, we found that some of the metrics can discriminate between high‑, medium‑, and low‑quality executions from unsuccessful tasks, which can be interesting when test oracles are not available. Our findings challenge the adequacy of current evaluation practices that rely solely on binary success rates and pave the way for improved real‑time monitoring and adaptive enhancement of VLA‑enabled robotic systems.

PaperID: 9521, https://arxiv.org/pdf/2507.17012.pdf

Authors: Zhihan Zhang, Alexander Metzger, Yuxuan Mei, Felix Hähnlein, Zachary Englhardt, Tingyu Cheng, Gregory D. Abowd, Shwetak Patel, Adriana Schulz, Vikram Iyer

Title: Sustainability assessment using multimodal AI agents

Abstract:
Reducing the rapidly growing environmental impact of the computing industry requires assessing the emissions of electronics at scale. However, a traditional life cycle assessment (LCA) of an electronic device, which maps materials and processes to environmental impacts, often requires proprietary or unavailable data. Here, we reimagine conventional sustainability assessment by introducing a multimodal multi‑agent AI system that emulates the collaborative process between LCA professionals and stakeholders (such as product managers and engineers) to automatically estimate the carbon footprint of electronic devices. The agents iteratively construct a complete life‑cycle inventory by leveraging a structured data abstraction and software tools that mine information from the public internet, including repair communities and government regulatory databases. This reduces data gaps and data collection from weeks or months of expert time to under one minute. The system can calculate carbon footprint within 19% of expert LCAs with zero proprietary data (typical of the variation between human LCAs). We also show that by encoding domain‑specific knowledge, environmental impact estimation can be reframed as a data‑driven prediction task, in which both unknown products and emission factors are represented as weighted combinations of similar ones with known emissions.

PaperID: 9522, https://arxiv.org/pdf/2507.16971.pdf

Authors: Aleksandr Perevalov, Andreas Both

Title: Text-to-SPARQL Goes Beyond English: Multilingual Question Answering Over Knowledge Graphs through Human-Inspired Reasoning

Abstract:
Accessing knowledge via multilingual natural‑language interfaces is one of the emerging challenges in the field of information retrieval and related ones. Structured knowledge stored in knowledge graphs can be queried via a specific query language (e.g., SPARQL). Therefore, one needs to transform natural‑language input into a query to fulfill an information need. Prior approaches mostly focused on combining components (e.g., rule‑based or neural‑based) that solve downstream tasks and come up with an answer at the end. We introduce mKGQAgent, a human‑inspired framework that breaks down the task of converting natural language questions into SPARQL queries into modular, interpretable subtasks. By leveraging a coordinated LLM agent workflow for planning, entity linking, and query refinement ‑ guided by an experience pool for in‑context learning ‑ mKGQAgent efficiently handles multilingual KGQA. Evaluated on the DBpedia‑ and Corporate‑based KGQA benchmarks within the Text2SPARQL challenge 2025, our approach took first place among the other participants. This work opens new avenues for developing human‑like reasoning systems in multilingual semantic parsing.

PaperID: 9523, https://arxiv.org/pdf/2507.16941.pdf

Authors: Daniel Correa, Tero Kaarlela, Jose Fuentes, Paulo Padrao, Alain Duran, Leonardo Bobadilla

Title: Multi-agent Reinforcement Learning for Robotized Coral Reef Sample Collection

Abstract:
This paper presents a reinforcement learning (RL) environment for developing an autonomous underwater robotic coral sampling agent, a crucial coral reef conservation and research task. Using software‑in‑the‑loop (SIL) and hardware‑in‑the‑loop (HIL), an RL‑trained artificial intelligence (AI) controller is developed using a digital twin (DT) in simulation and subsequently verified in physical experiments. An underwater motion capture (MOCAP) system provides real‑time 3D position and orientation feedback during verification testing for precise synchronization between the digital and physical domains. A key novelty of this approach is the combined use of a general‑purpose game engine for simulation, deep RL, and real‑time underwater motion capture for an effective zero‑shot sim‑to‑real strategy.

PaperID: 9524, https://arxiv.org/pdf/2507.16874.pdf

Authors: Raz Beck, Roni Stern

Title: Budget Allocation Policies for Real-Time Multi-Agent Path Finding

Abstract:
Multi‑Agent Path finding (MAPF) is the problem of finding paths for a set of agents such that each agent reaches its desired destination while avoiding collisions with the other agents. This problem arises in many robotics applications, such as automated warehouses and swarms of drones. Many MAPF solvers are designed to run offline, that is, first generate paths for all agents and then execute them. In real‑world scenarios, waiting for a complete solution before allowing any robot to move is often impractical. Real‑time MAPF (RT‑MAPF) captures this setting by assuming that agents must begin execution after a fixed planning period, referred to as the planning budget, and execute a fixed number of actions, referred to as the execution window. This results in an iterative process in which a short plan is executed, while the next execution window is planned concurrently. Existing solutions to RT‑MAPF iteratively call windowed versions of MAPF algorithms in every planning period, without explicitly considering the size of the planning budget. We address this gap and explore different policies for allocating the planning budget in windowed versions of MAPF‑LNS2, a state‑of‑the‑art MAPF algorithm. Our exploration shows that the baseline approach in which all agents draw from a shared planning budget pool is ineffective in challenging scenarios. Instead, policies that intelligently distribute the planning budget among agents are able to solve more problem instances in less time.

PaperID: 9525, https://arxiv.org/pdf/2507.16864.pdf

Authors: Tao Xu, Dung-Yang Lee, Momiao Xiong

Title: Reinforcement Learning in hyperbolic space for multi-step reasoning

Abstract:
Multi‑step reasoning is a fundamental challenge in artificial intelligence, with applications ranging from mathematical problem‑solving to decision‑making in dynamic environments. Reinforcement Learning (RL) has shown promise in enabling agents to perform multi‑step reasoning by optimizing long‑term rewards. However, conventional RL methods struggle with complex reasoning tasks due to issues such as credit assignment, high‑dimensional state representations, and stability concerns. Recent advancements in Transformer architectures and hyperbolic geometry have provided novel solutions to these challenges. This paper introduces a new framework that integrates hyperbolic Transformers into RL for multi‑step reasoning. The proposed approach leverages hyperbolic embeddings to model hierarchical structures effectively. We present theoretical insights, algorithmic details, and experimental results that include Frontier Math and nonlinear optimal control problems. Compared to RL with vanilla transformer, the hyperbolic RL largely improves accuracy by (32%~44%) on FrontierMath benchmark, (43%~45%) on nonlinear optimal control benchmark, while achieving impressive reduction in computational time by (16%~32%) on FrontierMath benchmark, (16%~17%) on nonlinear optimal control benchmark. Our work demonstrates the potential of hyperbolic Transformers in reinforcement learning, particularly for multi‑step reasoning tasks that involve hierarchical structures.

PaperID: 9526, https://arxiv.org/pdf/2507.16848.pdf

Authors: Boyu Qiao, Kun Li, Wei Zhou, Songlin Hu

Title: Dynamic Simulation Framework for Disinformation Dissemination and Correction With Social Bots

Abstract:
In the human‑bot symbiotic information ecosystem, social bots play key roles in spreading and correcting disinformation. Understanding their influence is essential for risk control and better governance. However, current studies often rely on simplistic user and network modeling, overlook the dynamic behavior of bots, and lack quantitative evaluation of correction strategies. To fill these gaps, we propose MADD, a Multi Agent based framework for Disinformation Dissemination. MADD constructs a more realistic propagation network by integrating the Barabasi Albert Model for scale free topology and the Stochastic Block Model for community structures, while designing node attributes based on real world user data. Furthermore, MADD incorporates both malicious and legitimate bots, with their controlled dynamic participation allows for quantitative analysis of correction strategies. We evaluate MADD using individual and group level metrics. We experimentally verify the real world consistency of MADD user attributes and network structure, and we simulate the dissemination of six disinformation topics, demonstrating the differential effects of fact based and narrative based correction strategies.

PaperID: 9527, https://arxiv.org/pdf/2507.16842.pdf

Authors: Yinan Meng, Kun Qian, Jiong Yang, Renbo Su, Zhenhong Li, Charlie C. L. Wang

Title: Sensor-Space Based Robust Kinematic Control of Redundant Soft Manipulator by Learning

Abstract:
The intrinsic compliance and high degree of freedom (DoF) of redundant soft manipulators facilitate safe interaction and flexible task execution. However, effective kinematic control remains highly challenging, as it must handle deformations caused by unknown external loads and avoid actuator saturation due to improper null‑space regulation ‑ particularly in confined environments. In this paper, we propose a Sensor‑Space Imitation Learning Kinematic Control (SS‑ILKC) framework to enable robust kinematic control under actuator saturation and restrictive environmental constraints. We employ a dual‑learning strategy: a multi‑goal sensor‑space control framework based on reinforcement learning principle is trained in simulation to develop robust control policies for open spaces, while a generative adversarial imitation learning approach enables effective policy learning from sparse expert demonstrations for confined spaces. To enable zero‑shot real‑world deployment, a pre‑processed sim‑to‑real transfer mechanism is proposed to mitigate the simulation‑to‑reality gap and accurately characterize actuator saturation limits. Experimental results demonstrate that our method can effectively control a pneumatically actuated soft manipulator, achieving precise path‑following and object manipulation in confined environments under unknown loading conditions.

PaperID: 9528, https://arxiv.org/pdf/2507.16809.pdf

Authors: Da-Chen Lian, Ri-Sheng Huang, Pin-Er Chen, Chunki Lim, You-Kuan Lin, Guan-Yu Tseng, Zi-Cheng Yang, Zhen-Yu Lin, Pin-Cheng Chen, Shu-Kai Hsieh

Title: LingBench++: A Linguistically-Informed Benchmark and Reasoning Framework for Multi-Step and Cross-Cultural Inference with LLMs

Abstract:
We propose LingBench++, a linguistically‑informed benchmark and reasoning framework designed to evaluate large language models (LLMs) on complex linguistic tasks inspired by the International Linguistics Olympiad (IOL). Unlike prior benchmarks that focus solely on final answer accuracy, LingBench++ provides structured reasoning traces, stepwise evaluation protocols, and rich typological metadata across over 90 low‑resource and cross‑cultural languages. We further develop a multi‑agent architecture integrating grammatical knowledge retrieval, tool‑augmented reasoning, and deliberate hypothesis testing. Through systematic comparisons of baseline and our proposed agentic models, we demonstrate that models equipped with external knowledge sources and iterative reasoning outperform single‑pass approaches in both accuracy and interpretability. LingBench++ offers a comprehensive foundation for advancing linguistically grounded, culturally informed, and cognitively plausible reasoning in LLMs.

PaperID: 9529, https://arxiv.org/pdf/2507.16799.pdf

Authors: Xiaoyu Zhan, Xinyu Fu, Hao Sun, Yuanqi Li, Jie Guo, Yanwen Guo

Title: Test-Time-Matching: Decouple Personality, Memory, and Linguistic Style in LLM-based Role-Playing Language Agent

Abstract:
The rapid advancement of large language models (LLMs) has enabled role‑playing language agents to demonstrate significant potential in various applications. However, relying solely on prompts and contextual inputs often proves insufficient for achieving deep immersion in specific roles, particularly well‑known fictional or public figures. On the other hand, fine‑tuning‑based approaches face limitations due to the challenges associated with data collection and the computational resources required for training, thereby restricting their broader applicability. To address these issues, we propose Test‑Time‑Matching (TTM), a training‑free role‑playing framework through test‑time scaling and context engineering. TTM uses LLM agents to automatically decouple a character's features into personality, memory, and linguistic style. Our framework involves a structured, three‑stage generation pipeline that utilizes these features for controlled role‑playing. It achieves high‑fidelity role‑playing performance, also enables seamless combinations across diverse linguistic styles and even variations in personality and memory. We evaluate our framework through human assessment, and the results demonstrate that our method achieves the outstanding performance in generating expressive and stylistically consistent character dialogues.

PaperID: 9530, https://arxiv.org/pdf/2507.16796.pdf

Authors: Mian Ibad Ali Shah, Enda Barrett, Karl Mason

Title: Uncertainty-Aware Knowledge Transformers for Peer-to-Peer Energy Trading with Multi-Agent Reinforcement Learning

Abstract:
This paper presents a novel framework for Peer‑to‑Peer (P2P) energy trading that integrates uncertainty‑aware prediction with multi‑agent reinforcement learning (MARL), addressing a critical gap in current literature. In contrast to previous works relying on deterministic forecasts, the proposed approach employs a heteroscedastic probabilistic transformer‑based prediction model called Knowledge Transformer with Uncertainty (KTU) to explicitly quantify prediction uncertainty, which is essential for robust decision‑making in the stochastic environment of P2P energy trading. The KTU model leverages domain‑specific features and is trained with a custom loss function that ensures reliable probabilistic forecasts and confidence intervals for each prediction. Integrating these uncertainty‑aware forecasts into the MARL framework enables agents to optimize trading strategies with a clear understanding of risk and variability. Experimental results show that the uncertainty‑aware Deep Q‑Network (DQN) reduces energy purchase costs by up to 5.7% without P2P trading and 3.2% with P2P trading, while increasing electricity sales revenue by 6.4% and 44.7%, respectively. Additionally, peak hour grid demand is reduced by 38.8% without P2P and 45.6% with P2P. These improvements are even more pronounced when P2P trading is enabled, highlighting the synergy between advanced forecasting and market mechanisms for resilient, economically efficient energy communities.

PaperID: 9531, https://arxiv.org/pdf/2507.16713.pdf

Authors: Kaixian Qu, Guowei Lan, René Zurbrügg, Changan Chen, Christopher E. Mower, Haitham Bou-Ammar, Marco Hutter

Title: A Pragmatist Robot: Learning to Plan Tasks by Experiencing the Real World

Abstract:
Large language models (LLMs) have emerged as the dominant paradigm for robotic task planning using natural language instructions. However, trained on general internet data, LLMs are not inherently aligned with the embodiment, skill sets, and limitations of real‑world robotic systems. Inspired by the emerging paradigm of verbal reinforcement learning‑where LLM agents improve through self‑reflection and few‑shot learning without parameter updates‑we introduce PragmaBot, a framework that enables robots to learn task planning through real‑world experience. PragmaBot employs a vision‑language model (VLM) as the robot's "brain" and "eye", allowing it to visually evaluate action outcomes and self‑reflect on failures. These reflections are stored in a short‑term memory (STM), enabling the robot to quickly adapt its behavior during ongoing tasks. Upon task completion, the robot summarizes the lessons learned into its long‑term memory (LTM). When facing new tasks, it can leverage retrieval‑augmented generation (RAG) to plan more grounded action sequences by drawing on relevant past experiences and knowledge. Experiments on four challenging robotic tasks show that STM‑based self‑reflection increases task success rates from 35% to 84%, with emergent intelligent object interactions. In 12 real‑world scenarios (including eight previously unseen tasks), the robot effectively learns from the LTM and improves single‑trial success rates from 22% to 80%, with RAG outperforming naive prompting. These results highlight the effectiveness and generalizability of PragmaBot. Project webpage: https://pragmabot.github.io/

PaperID: 9532, https://arxiv.org/pdf/2507.16704.pdf

Authors: Viktor Muryn, Marta Sumyk, Mariya Hirna, Sofiya Garkot, Maksym Shamrai

Title: Screen2AX: Vision-Based Approach for Automatic macOS Accessibility Generation

Abstract:
Desktop accessibility metadata enables AI agents to interpret screens and supports users who depend on tools like screen readers. Yet, many applications remain largely inaccessible due to incomplete or missing metadata provided by developers ‑ our investigation shows that only 33% of applications on macOS offer full accessibility support. While recent work on structured screen representation has primarily addressed specific challenges, such as UI element detection or captioning, none has attempted to capture the full complexity of desktop interfaces by replicating their entire hierarchical structure. To bridge this gap, we introduce Screen2AX, the first framework to automatically create real‑time, tree‑structured accessibility metadata from a single screenshot. Our method uses vision‑language and object detection models to detect, describe, and organize UI elements hierarchically, mirroring macOS's system‑level accessibility structure. To tackle the limited availability of data for macOS desktop applications, we compiled and publicly released three datasets encompassing 112 macOS applications, each annotated for UI element detection, grouping, and hierarchical accessibility metadata alongside corresponding screenshots. Screen2AX accurately infers hierarchy trees, achieving a 77% F1 score in reconstructing a complete accessibility tree. Crucially, these hierarchy trees improve the ability of autonomous agents to interpret and interact with complex desktop interfaces. We introduce Screen2AX‑Task, a benchmark specifically designed for evaluating autonomous agent task execution in macOS desktop environments. Using this benchmark, we demonstrate that Screen2AX delivers a 2.2x performance improvement over native accessibility representations and surpasses the state‑of‑the‑art OmniParser V2 system on the ScreenSpot benchmark.

PaperID: 9533, https://arxiv.org/pdf/2507.16635.pdf

Authors: Ali Mohamed Ali, Luca Tirel, Hashim A. Hashim

Title: Novel Multi-Agent Action Masked Deep Reinforcement Learning for General Industrial Assembly Lines Balancing Problems

Abstract:
Efficient planning of activities is essential for modern industrial assembly lines to uphold manufacturing standards, prevent project constraint violations, and achieve cost‑effective operations. While exact solutions to such challenges can be obtained through Integer Programming (IP), the dependence of the search space on input parameters often makes IP computationally infeasible for large‑scale scenarios. Heuristic methods, such as Genetic Algorithms, can also be applied, but they frequently produce suboptimal solutions in extensive cases. This paper introduces a novel mathematical model of a generic industrial assembly line formulated as a Markov Decision Process (MDP), without imposing assumptions on the type of assembly line a notable distinction from most existing models. The proposed model is employed to create a virtual environment for training Deep Reinforcement Learning (DRL) agents to optimize task and resource scheduling. To enhance the efficiency of agent training, the paper proposes two innovative tools. The first is an action‑masking technique, which ensures the agent selects only feasible actions, thereby reducing training time. The second is a multi‑agent approach, where each workstation is managed by an individual agent, as a result, the state and action spaces were reduced. A centralized training framework with decentralized execution is adopted, offering a scalable learning architecture for optimizing industrial assembly lines. This framework allows the agents to learn offline and subsequently provide real‑time solutions during operations by leveraging a neural network that maps the current factory state to the optimal action. The effectiveness of the proposed scheme is validated through numerical simulations, demonstrating significantly faster convergence to the optimal solution compared to a comparable model‑based approach.

PaperID: 9534, https://arxiv.org/pdf/2507.16628.pdf

Authors: Rajpreet Singh, Vidhi Kothari

Title: Augmenting Von Neumann's Architecture for an Intelligent Future

Abstract:
This work presents a novel computer architecture that extends the Von Neumann model with a dedicated Reasoning Unit (RU) to enable native artificial general intelligence capabilities. The RU functions as a specialized co‑processor that executes symbolic inference, multi‑agent coordination, and hybrid symbolic‑neural computation as fundamental architectural primitives. This hardware‑embedded approach allows autonomous agents to perform goal‑directed planning, dynamic knowledge manipulation, and introspective reasoning directly within the computational substrate at system scale. The architecture incorporates a reasoning‑specific instruction set architecture, parallel symbolic processing pipelines, agent‑aware kernel abstractions, and a unified memory hierarchy that seamlessly integrates cognitive and numerical workloads. Through systematic co‑design across hardware, operating system, and agent runtime layers, this architecture establishes a computational foundation where reasoning, learning, and adaptation emerge as intrinsic execution properties rather than software abstractions, potentially enabling the development of general‑purpose intelligent machines.

PaperID: 9535, https://arxiv.org/pdf/2507.16611.pdf

Authors: Jesse Milzman, Jeffrey Mao, Giuseppe Loianno

Title: Smooth Games of Configuration in the Linear-Quadratic Setting

Abstract:
Dynamic game theory offers a toolbox for formalizing and solving for both cooperative and non‑cooperative strategies in multi‑agent scenarios. However, the optimal configuration of such games remains largely unexplored. While there is existing literature on the parametrization of dynamic games, little research examines this parametrization from a strategic perspective where each agent's configuration choice is influenced by the decisions of others. In this work, we introduce the concept of a game of configuration, providing a framework for the strategic fine‑tuning of differential games. We define a game of configuration as a two‑stage game within the setting of finite‑horizon, affine‑quadratic, AQ, differential games. In the first stage, each player chooses their corresponding configuration parameter, which will impact their dynamics and costs in the second stage. We provide the subgame perfect solution concept and a method for computing first stage cost gradients over the configuration space. This then allows us to formulate a gradient‑based method for searching for local solutions to the configuration game, as well as provide necessary conditions for equilibrium configurations over their downstream (second stage) trajectories. We conclude by demonstrating the effectiveness of our approach in example AQ systems, both zero‑sum and general‑sum.

PaperID: 9536, https://arxiv.org/pdf/2507.16520.pdf

Authors: Aria Delshad, Maryam Babazadeh

Title: A Distributed Actor-Critic Algorithm for Fixed-Time Consensus in Nonlinear Multi-Agent Systems

Abstract:
This paper proposes a reinforcement learning (RL)‑based backstepping control strategy to achieve fixed time consensus in nonlinear multi‑agent systems with strict feedback dynamics. Agents exchange only output information with their neighbors over a directed communication graph, without requiring full state measurements or symmetric communication. Achieving fixed time consensus, where convergence occurs within a pre‑specified time bound that is independent of initial conditions is faced with significant challenges due to the presence of unknown nonlinearities, inter‑agent couplings, and external disturbances. This work addresses these challenges by integrating actor critic reinforcement learning with a novel fixed time adaptation mechanism. Each agent employs an actor critic architecture supported by two estimator networks designed to handle system uncertainties and unknown perturbations. The adaptation laws are developed to ensure that all agents track the leader within a fixed time regardless of their initial conditions. The consensus and tracking errors are guaranteed to converge to a small neighborhood of the origin, with the convergence radius adjustable through control parameters. Simulation results demonstrate the effectiveness of the proposed approach and highlight its advantages over state‑of‑the‑art methods in terms of convergence speed and robustness.

PaperID: 9537, https://arxiv.org/pdf/2507.16479.pdf

Authors: Haoyang Zhang, Mina Montazeri, Philipp Heer, Koen Kok, Nikolaos G. Paterakis

Title: Arbitrage Tactics in the Local Markets via Hierarchical Multi-agent Reinforcement Learning

Abstract:
Strategic bidding tactics employed by prosumers in local markets, including the Local Electricity Market (LEM) and Local Flexibility Market (LFM), have attracted significant attention due to their potential to enhance economic benefits for market participants through optimized energy management and bidding. While existing research has explored strategic bidding in a single market with multi‑agent reinforcement learning (MARL) algorithms, arbitrage opportunities across local markets remain unexplored. This paper introduces a hierarchical MARL (HMARL) algorithm designed to enable aggregator arbitrage across multiple local markets. The strategic behavior of these aggregators in local markets is modeled as a two‑stage Markov game: the first stage involves the LEM, while the second stage encompasses both the LFM and the balancing market. To solve this two‑stage Markov game, the HMARL framework assigns two sub‑agents to each aggregator, a primary sub‑agent and a secondary sub‑agent. Without the arbitrage strategy, these sub‑agents operate in silos, with the primary sub‑agent focusing on first‑stage profits and the secondary sub‑agent on second‑stage profits, each employing independent MARLs. On the contrary, when implementing the arbitrage strategy with the proposed HMARL, the sub‑agents communicate and coordinate to perform arbitrage across multiple local markets, enhancing overall efficiency. The case study, conducted under a scenario where all aggregators employ the arbitrage strategy, shows that despite higher initial costs in the LEM, this strategy generates substantial savings in the LFM and the balancing market, resulting in a total profit increase of 40.6% on average. This highlights the capability of the proposed HMARL to address the two‑stage Markov game and facilitate arbitrage across local markets, thereby enhancing profitability for participants.

PaperID: 9538, https://arxiv.org/pdf/2507.16459.pdf

Authors: Naama Zwerdling, David Boaz, Ella Rabinovich, Guy Uziel, David Amid, Ateret Anaby-Tavor

Title: Towards Enforcing Company Policy Adherence in Agentic Workflows

Abstract:
Large Language Model (LLM) agents hold promise for a flexible and scalable alternative to traditional business process automation, but struggle to reliably follow complex company policies. In this study we introduce a deterministic, transparent, and modular framework for enforcing business policy adherence in agentic workflows. Our method operates in two phases: (1) an offline buildtime stage that compiles policy documents into verifiable guard code associated with tool use, and (2) a runtime integration where these guards ensure compliance before each agent action. We demonstrate our approach on the challenging τ‑bench Airlines domain, showing encouraging preliminary results in policy enforcement, and further outline key challenges for real‑world deployments.

PaperID: 9539, https://arxiv.org/pdf/2507.16434.pdf

Authors: Stassa Patsantzis

Title: From model-based learning to model-free behaviour with Meta-Interpretive Learning

Abstract:
A "model" is a theory that describes the state of an environment and the effects of an agent's decisions on the environment. A model‑based agent can use its model to predict the effects of its future actions and so plan ahead, but must know the state of the environment. A model‑free agent cannot plan, but can act without a model and without completely observing the environment. An autonomous agent capable of acting independently in novel environments must combine both sets of capabilities. We show how to create such an agent with Meta‑Interpretive Learning used to learn a model‑based Solver used to train a model‑free Controller that can solve the same planning problems as the Solver. We demonstrate the equivalence in problem‑solving ability of the two agents on grid navigation problems in two kinds of environment: randomly generated mazes, and lake maps with wide open areas. We find that all navigation problems solved by the Solver are also solved by the Controller, indicating the two are equivalent.

PaperID: 9540, https://arxiv.org/pdf/2507.16395.pdf

Authors: Bo Hou, Xin Tan, Kai Zheng, Fang Liu, Yinghao Zhu, Li Zhang

Title: LLM-Driven Collaborative Model for Untangling Commits via Explicit and Implicit Dependency Reasoning

Abstract:
Atomic commits, which address a single development concern, are a best practice in software development. In practice, however, developers often produce tangled commits that mix unrelated changes, complicating code review and maintenance. Prior untangling approaches (rule‑based, feature‑based, or graph‑based) have made progress but typically rely on shallow signals and struggle to distinguish explicit dependencies (e.g., control/data flow) from implicit ones (e.g., semantic or conceptual relationships). In this paper, we propose ColaUntangle, a new collaborative consultation framework for commit untangling that models both explicit and implicit dependencies among code changes. ColaUntangle integrates Large Language Model (LLM)‑driven agents in a multi‑agent architecture: one agent specializes in explicit dependencies, another in implicit ones, and a reviewer agent synthesizes their perspectives through iterative consultation. To capture structural and contextual information, we construct Explicit and Implicit Contexts, enabling agents to reason over code relationships with both symbolic and semantic depth. We evaluate ColaUntangle on two widely‑used datasets (1,612 C# and 14k Java tangled commits). Experimental results show that ColaUntangle outperforms the best‑performing baseline, achieving an improvement of 44% on the C# dataset and 82% on the Java dataset. These findings highlight the potential of LLM‑based collaborative frameworks for advancing automated commit untangling tasks.

PaperID: 9541, https://arxiv.org/pdf/2507.16382.pdf

Authors: Chenhao Yao, Zike Yuan, Xiaoxu Liu, Chi Zhu

Title: Application of LLM Guided Reinforcement Learning in Formation Control with Collision Avoidance

Abstract:
Multi‑Agent Systems (MAS) excel at accomplishing complex objectives through the collaborative efforts of individual agents. Among the methodologies employed in MAS, Multi‑Agent Reinforcement Learning (MARL) stands out as one of the most efficacious algorithms. However, when confronted with the complex objective of Formation Control with Collision Avoidance (FCCA): designing an effective reward function that facilitates swift convergence of the policy network to an optimal solution. In this paper, we introduce a novel framework that aims to overcome this challenge. By giving large language models (LLMs) on the prioritization of tasks and the observable information available to each agent, our framework generates reward functions that can be dynamically adjusted online based on evaluation outcomes by employing more advanced evaluation metrics rather than the rewards themselves. This mechanism enables the MAS to simultaneously achieve formation control and obstacle avoidance in dynamic environments with enhanced efficiency, requiring fewer iterations to reach superior performance levels. Our empirical studies, conducted in both simulation and real‑world settings, validate the practicality and effectiveness of our proposed approach.

PaperID: 9542, https://arxiv.org/pdf/2507.16306.pdf

Authors: Xingjian Zhang, Yizhuo Wang, Guillaume Sartoretti

Title: COMPASS: Cooperative Multi-Agent Persistent Monitoring using Spatio-Temporal Attention Network

Abstract:
Persistent monitoring of dynamic targets is essential in real‑world applications such as disaster response, environmental sensing, and wildlife conservation, where mobile agents must continuously gather information under uncertainty. We propose COMPASS, a multi‑agent reinforcement learning (MARL) framework that enables decentralized agents to persistently monitor multiple moving targets efficiently. We model the environment as a graph, where nodes represent spatial locations and edges capture topological proximity, allowing agents to reason over structured layouts and revisit informative regions as needed. Each agent independently selects actions based on a shared spatio‑temporal attention network that we design to integrate historical observations and spatial context. We model target dynamics using Gaussian Processes (GPs), which support principled belief updates and enable uncertainty‑aware planning. We train COMPASS using centralized value estimation and decentralized policy execution under an adaptive reward setting. Our extensive experiments demonstrate that COMPASS consistently outperforms strong baselines in uncertainty reduction, target coverage, and coordination efficiency across dynamic multi‑target scenarios.

PaperID: 9543, https://arxiv.org/pdf/2507.16249.pdf

Authors: Srivatsan Krishnan, Jason Jabbour, Dan Zhang, Natasha Jaques, Aleksandra Faust, Shayegan Omidshafiei, Vijay Janapa Reddi

Title: Multi-Agent Reinforcement Learning for Sample-Efficient Deep Neural Network Mapping

Abstract:
Mapping deep neural networks (DNNs) to hardware is critical for optimizing latency, energy consumption, and resource utilization, making it a cornerstone of high‑performance accelerator design. Due to the vast and complex mapping space, reinforcement learning (RL) has emerged as a promising approach‑but its effectiveness is often limited by sample inefficiency. We present a decentralized multi‑agent reinforcement learning (MARL) framework designed to overcome this challenge. By distributing the search across multiple agents, our framework accelerates exploration. To avoid inefficiencies from training multiple agents in parallel, we introduce an agent clustering algorithm that assigns similar mapping parameters to the same agents based on correlation analysis. This enables a decentralized, parallelized learning process that significantly improves sample efficiency. Experimental results show our MARL approach improves sample efficiency by 30‑300x over standard single‑agent RL, achieving up to 32.61x latency reduction and 16.45x energy‑delay product (EDP) reduction under iso‑sample conditions.

PaperID: 9544, https://arxiv.org/pdf/2507.16221.pdf

Authors: Masahiko Ueda

Title: Unbeatable imitation of a friend

Abstract:
Imitation sometimes achieves success in multi‑agent situations even though it is very simple. In game theory, success of imitation has been characterized by unbeatability against other agents. Previous studies specified conditions under which imitation is unbeatable in repeated games, and clarified that the existence of unbeatable imitation is strongly related to the existence of payoff‑controlling strategies, called zero‑determinant strategies. However, the previous studies mainly focused on ``imitation of opponents''. It was pointed out that imitation of other players in the same group and imitation of other players in the same role in other groups generally result in different outcomes. Here, we investigate the existence condition of unbeatable imitation in the latter ``imitation of friends'' situations. We find that it is stronger than the existence condition of unbeatable zero‑determinant strategies, whereas both are very limited. Our findings suggest a strong relation between them even in the `imitation of friends'' situations.

PaperID: 9545, https://arxiv.org/pdf/2507.16204.pdf

Authors: Li-Hsiang Shen, Jyun-Jhe Huang

Title: Multi-Functional RIS-Enabled in SAGIN for IoT: A Hybrid Deep Reinforcement Learning Approach with Compressed Twin-Models

Abstract:
A space‑air‑ground integrated network (SAGIN) for Internet of Things (IoT) network architecture is investigated, empowered by multi‑functional reconfigurable intelligent surfaces (MF‑RIS) capable of simultaneously reflecting, amplifying, and harvesting wireless energy. The MF‑RIS plays a pivotal role in addressing the energy shortages of low‑Earth orbit (LEO) satellites operating in the shadowed regions, while accounting for both communication and computing energy consumption across the SAGIN nodes. To maximize the long‑term energy efficiency (EE) of IoT devices, we formulate a joint optimization problem over the MF‑RIS parameters, including signal amplification, phase‑shifts, energy harvesting ratio, and active element selection as well as the SAGIN parameters of beamforming vectors, high‑altitude platform station (HAPS) deployment, IoT device association, and computing capability. The formulated problem is highly non‑convex and non‑linear and contains mixed discrete‑continuous parameters. To tackle this, we conceive a compressed hybrid twin‑model enhanced multi‑agent deep reinforcement learning (CHIMERA) framework, which integrates semantic state‑action compression and parametrized sharing under hybrid reinforcement learning to efficiently explore suitable complex actions. The simulation results have demonstrated that the proposed CHIMERA scheme substantially outperforms the conventional benchmarks, including fixed‑configuration or non‑harvesting MF‑RIS, traditional RIS, and no‑RIS cases, as well as centralized and multi‑agent deep reinforcement learning baselines in terms of the highest EE. Moreover, the proposed SAGIN‑MF‑RIS architecture in IoT network achieves superior EE performance due to its complementary coverage, offering notable advantages over either standalone satellite, aerial, or ground‑only deployments.

PaperID: 9546, https://arxiv.org/pdf/2507.16203.pdf

Authors: Rui Guo, Avinash Ayalasomayajula, Henian Li, Jingbo Zhou, Sujan Kumar Saha, Farimah Farahmandi

Title: SVAgent: AI Agent for Hardware Security Verification Assertion

Abstract:
Verification using SystemVerilog assertions (SVA) is one of the most popular methods for detecting circuit design vulnerabilities. However, with the globalization of integrated circuit design and the continuous upgrading of security requirements, the SVA development model has exposed major limitations. It is not only inefficient in development, but also unable to effectively deal with the increasing number of security vulnerabilities in modern complex integrated circuits. In response to these challenges, this paper proposes an innovative SVA automatic generation framework SVAgent. SVAgent introduces a requirement decomposition mechanism to transform the original complex requirements into a structured, gradually solvable fine‑grained problem‑solving chain. Experiments have shown that SVAgent can effectively suppress the influence of hallucinations and random answers, and the key evaluation indicators such as the accuracy and consistency of the SVA are significantly better than existing frameworks. More importantly, we successfully integrated SVAgent into the most mainstream integrated circuit vulnerability assessment framework and verified its practicality and reliability in a real engineering design environment.

PaperID: 9547, https://arxiv.org/pdf/2507.16068.pdf

Authors: Zhehui Huang, Guangyao Shi, Yuwei Wu, Vijay Kumar, Gaurav S. Sukhatme

Title: Compositional Coordination for Multi-Robot Teams with Large Language Models

Abstract:
Multi‑robot coordination has traditionally relied on a mission‑specific and expert‑driven pipeline, where natural language mission descriptions are manually translated by domain experts into mathematical formulation, algorithm design, and executable code. This conventional process is labor‑intensive, inaccessible to non‑experts, and inflexible to changes in mission requirements. Here, we propose LAN2CB (Language to Collective Behavior), a novel framework that leverages large language models (LLMs) to streamline and generalize the multi‑robot coordination pipeline. LAN2CB transforms natural language (NL) mission descriptions into executable Python code for multi‑robot systems through two core modules: (1) Mission Analysis, which parses mission descriptions into behavior trees, and (2) Code Generation, which leverages the behavior tree and a structured knowledge base to generate robot control code. We further introduce a dataset of natural language mission descriptions to support development and benchmarking. Experiments in both simulation and real‑world environments demonstrate that LAN2CB enables robust and flexible multi‑robot coordination from natural language, significantly reducing manual engineering effort and supporting broad generalization across diverse mission types. Website: https://sites.google.com/view/lan‑cb

PaperID: 9548, https://arxiv.org/pdf/2507.16044.pdf

Authors: Meriem Mastouri, Emna Ksontini, Amine Barrak, Wael Kessentini

Title: From REST to MCP: An Empirical Study of API Wrapping and Automated Server Generation for LLM Agents

Abstract:
The Model Context Protocol (MCP) is emerging as a standard interface through which LLM agents invoke external tools, and a growing ecosystem of MCP servers now mediates access to vendor services. Most of these servers target vendors that already expose REST APIs, yet the relationship between MCP tool interfaces and the underlying API surface has not been empirically characterised. This paper presents the first large‑scale study of MCP server construction. We analyse 116 official servers to determine REST reliance and integration strategies (RQ1); examine servers paired with OpenAPI specifications to quantify operation exposure, omission, and mapping patterns (RQ2); evaluate automated generation from 80 real‑world OpenAPI contracts (RQ3); and assess specification repair and tool‑set transformations to improve correctness and reduce complexity (RQ4). We find that 88.6% of servers are fully or partially REST‑backed, with 92% implementing tools as bare API wrappers. MCP servers expose a median of 19% of available operations, following systematic patterns predictable from the specification. Baseline generation succeeds for 76% of sampled tools; automated repair raises this to 94.2%, while filtering and regrouping reduce the median tool count per API by one‑third. We release AutoMCP, an end‑to‑end pipeline integrating specification repair and empirically grounded tool‑set transformations.

PaperID: 9549, https://arxiv.org/pdf/2507.16034.pdf

Authors: Xuying Huang, Sicong Pan, Olga Zatsarynna, Juergen Gall, Maren Bennewitz

Title: Privacy-Preserving Semantic Segmentation from Ultra-Low-Resolution RGB Inputs

Abstract:
RGB‑based semantic segmentation has become a mainstream approach for visual perception and is widely applied in a variety of downstream tasks. However, existing methods typically rely on high‑resolution RGB inputs, which may expose sensitive visual content in privacy‑critical environments. Ultra‑low‑resolution RGB sensing suppresses sensitive information directly during image acquisition, making it an attractive privacy‑preserving alternative. Nevertheless, recovering semantic segmentation from ultra‑low‑resolution RGB inputs remains highly challenging due to severe visual degradation. In this work, we introduce a novel fully joint‑learning framework to mitigate the optimization conflicts exacerbated by visual degradation for ultra‑low‑resolution semantic segmentation. Experiments demonstrate that our method outperforms representative baselines in semantic segmentation performance and our ultra‑low‑resolution RGB input achieves a favorable trade‑off between privacy preservation and semantic segmentation performance. We deploy our privacy‑preserving semantic segmentation method in a real‑world robotic object‑goal navigation task, demonstrating successful downstream task execution even under severe visual degradation.

PaperID: 9550, https://arxiv.org/pdf/2507.15892.pdf

Authors: Elijah Nnorom, Md Basim Uddin Ahmed, Jiho Shin, Hung Viet Pham, Song Wang

Title: StaAgent: An Agentic Framework for Testing Static Analyzers

Abstract:
Static analyzers play a critical role in identifying bugs early in the software development lifecycle, but their rule implementations are often under‑tested and prone to inconsistencies. To address this, we propose StaAgent, an agentic framework that harnesses the generative capabilities of Large Language Models (LLMs) to systematically evaluate static analyzer rules. StaAgent comprises four specialized agents: a Seed Generation Agent that translates bug detection rules into concrete, bug‑inducing seed programs; a Code Validation Agent that ensures the correctness of these seeds; a Mutation Generation Agent that produces semantically equivalent mutants; and an Analyzer Evaluation Agent that performs metamorphic testing by comparing the static analyzer's behavior on seeds and their corresponding mutants. By revealing inconsistent behaviors, StaAgent helps uncover flaws in rule implementations. This LLM‑driven, multi‑agent framework offers a scalable and adaptable solution to improve the reliability of static analyzers. We evaluated StaAgent with five state‑of‑the‑art LLMs (CodeL‑lama, DeepSeek, Codestral, Qwen, and GPT‑4o) across five widely used static analyzers (SpotBugs, SonarQube, ErrorProne, Infer, and PMD). The experimental results show that our approach can help reveal 64 problematic rules in the latest versions of these five static analyzers (i.e., 28 in SpotBugs, 18 in SonarQube, 6 in ErrorProne, 4 in Infer, and 8 in PMD). In addition, 53 out of the 64 bugs cannot be detected by the SOTA baseline. We have reported all the bugs to developers, with two of them already fixed. Three more have been confirmed by developers, while the rest are awaiting response. These results demonstrate the effectiveness of our approach and underscore the promise of agentic, LLM‑driven data synthesis to advance software engineering.

PaperID: 9551, https://arxiv.org/pdf/2507.15885.pdf

Authors: Pierluca D'Oro, Caley Drooff, Joy Chen, Joseph Tighe

Title: ADEPTS: A Capability Framework for Human-Centered Agent Design

Abstract:
Large language models have paved the way to powerful and flexible AI agents, assisting humans by increasingly integrating into their daily life. This flexibility, potential, and growing adoption demands a holistic and cross‑disciplinary approach to developing, monitoring and discussing the capabilities required for agent‑driven user experiences. However, current guidance on human‑centered AI agent development is scattered: UX heuristics focus on interface behaviors, engineering taxonomies describe internal pipelines, and ethics checklists address high‑level governance. There is no concise, user‑facing vocabulary that tells teams what an agent should fundamentally be able to do. We introduce ADEPTS, a capability framework defining a set of core user‑facing capabilities to provide unified guidance around the development of AI agents. ADEPTS is based on six principles for human‑centered agent design, that express the minimal, user‑facing capabilities an AI agent should demonstrate to be understandable, controllable and trustworthy in everyday use. ADEPTS complements existing frameworks and taxonomies; differently from them, it sits at the interface between technical and experience development. By presenting ADEPTS, we aim to condense complex AI‑UX requirements into a compact framework that is actionable guidance for AI researchers, designers, engineers, and policy reviewers alike. We believe ADEPTS has the potential of accelerating the improvement of user‑relevant agent capabilities, of easing the design of experiences that take advantage of those capabilities, and of providing a shared language to track and discuss progress around the development of AI agents.

PaperID: 9552, https://arxiv.org/pdf/2507.15781.pdf

Authors: Gian Carlo Maffettone, Alain Boldini, Mario di Bernardo, Maurizio Porfiri

Title: Bio-inspired density control of multi-agent swarms via leader-follower plasticity

Abstract:
The design of control systems for the spatial self‑organization of mobile agents is an open challenge across several engineering domains, including swarm robotics and synthetic biology. Here, we propose a bio‑inspired leader‑follower solution, which is aware of energy constraints of mobile agents and is apt to deal with large swarms. Akin to many natural systems, control objectives are formulated for the entire collective, and leaders and followers are allowed to plastically switch their role in time. We frame a density control problem, modeling the agents' population via a system of nonlinear partial differential equations. This approach allows for a compact description that inherently avoids the curse of dimensionality and improves analytical tractability. We derive analytical guarantees for the existence of desired steady‑state solutions and their local stability for one‑dimensional and higher‑dimensional problems. We numerically validate our control methodology, offering support to the effectiveness, robustness, and versatility of our proposed bio‑inspired control strategy.

PaperID: 9553, https://arxiv.org/pdf/2507.15770.pdf

Authors: Yifan Shen, Zihan Zhao, Xiao Xue, Yuwei Guo, Qun Ma, Deyu Zhou, Ming Zhang

Title: A Framework for Analyzing Abnormal Emergence in Service Ecosystems Through LLM-based Agent Intention Mining

Abstract:
With the rise of service computing, cloud computing, and IoT, service ecosystems are becoming increasingly complex. The intricate interactions among intelligent agents make abnormal emergence analysis challenging, as traditional causal methods focus on individual trajectories. Large language models offer new possibilities for Agent‑Based Modeling (ABM) through Chain‑of‑Thought (CoT) reasoning to reveal agent intentions. However, existing approaches remain limited to microscopic and static analysis. This paper introduces a framework: Emergence Analysis based on Multi‑Agent Intention (EAMI), which enables dynamic and interpretable emergence analysis. EAMI first employs a dual‑perspective thought track mechanism, where an Inspector Agent and an Analysis Agent extract agent intentions under bounded and perfect rationality. Then, k‑means clustering identifies phase transition points in group intentions, followed by a Intention Temporal Emergence diagram for dynamic analysis. The experiments validate EAMI in complex online‑to‑offline (O2O) service system and the Stanford AI Town experiment, with ablation studies confirming its effectiveness, generalizability, and efficiency. This framework provides a novel paradigm for abnormal emergence and causal analysis in service ecosystems. The code is available at https://anonymous.4open.science/r/EAMI‑B085.

PaperID: 9554, https://arxiv.org/pdf/2507.15761.pdf

Authors: Jingyi Zheng, Zifan Peng, Yule Liu, Junfeng Wang, Yifan Liao, Wenhan Dong, Xinlei He

Title: GasAgent: A Multi-Agent Framework for Automated Gas Optimization in Smart Contracts

Abstract:
Smart contracts are trustworthy, immutable, and automatically executed programs on the blockchain. Their execution requires the Gas mechanism to ensure efficiency and fairness. However, due to non‑optimal coding practices, many contracts contain Gas waste patterns that need to be optimized. Existing solutions mostly rely on manual discovery, which is inefficient, costly to maintain, and difficult to scale. Recent research uses large language models (LLMs) to explore new Gas waste patterns. However, it struggles to remain compatible with existing patterns, often produces redundant patterns, and requires manual validation/rewriting. To address this gap, we present GasAgent, the first multi‑agent system for smart contract Gas optimization that combines compatibility with existing patterns and automated discovery/validation of new patterns, enabling end‑to‑end optimization. GasAgent consists of four specialized agents, Seeker, Innovator, Executor, and Manager, that collaborate in a closed loop to identify, validate, and apply Gas‑saving improvements. Experiments on 100 verified real‑world contracts demonstrate that GasAgent successfully optimizes 82 contracts, achieving an average deployment Gas savings of 9.97%. In addition, our evaluation confirms its compatibility with existing tools and validates the effectiveness of each module through ablation studies. To assess broader usability, we further evaluate 500 contracts generated by five representative LLMs across 10 categories and find that GasAgent optimizes 79.8% of them, with deployment Gas savings ranging from 4.79% to 13.93%, showing its usability as the optimization layer for LLM‑assisted smart contract development.

PaperID: 9555, https://arxiv.org/pdf/2507.15743.pdf

Authors: Elahe Vedadi, David Barrett, Natalie Harris, Ellery Wulczyn, Shashir Reddy, Roma Ruparel, Mike Schaekermann, Tim Strother, Ryutaro Tanno, Yash Sharma, Jihyeon Lee, Cían Hughes, Dylan Slack, Anil Palepu, Jan Freyberg, Khaled Saab, Valentin Liévin, Wei-Hung Weng, Tao Tu, Yun Liu, Nenad Tomasev, Kavita Kulkarni, S. Sara Mahdavi, Kelvin Guu, Joëlle Barral, Dale R. Webster, James Manyika, Avinatan Hassidim, Katherine Chou, Yossi Matias, Pushmeet Kohli, Adam Rodman, Vivek Natarajan, Alan Karthikesalingam, David Stutz

Title: Towards physician-centered oversight of conversational diagnostic AI

Abstract:
Recent work has demonstrated the promise of conversational AI systems for diagnostic dialogue. However, real‑world assurance of patient safety means that providing individual diagnoses and treatment plans is considered a regulated activity by licensed professionals. Furthermore, physicians commonly oversee other team members in such activities, including nurse practitioners (NPs) or physician assistants/associates (PAs). Inspired by this, we propose a framework for effective, asynchronous oversight of the Articulate Medical Intelligence Explorer (AMIE) AI system. We propose guardrailed‑AMIE (g‑AMIE), a multi‑agent system that performs history taking within guardrails, abstaining from individualized medical advice. Afterwards, g‑AMIE conveys assessments to an overseeing primary care physician (PCP) in a clinician cockpit interface. The PCP provides oversight and retains accountability of the clinical decision. This effectively decouples oversight from intake and can thus happen asynchronously. In a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) of text consultations with asynchronous oversight, we compared g‑AMIE to NPs/PAs or a group of PCPs under the same guardrails. Across 60 scenarios, g‑AMIE outperformed both groups in performing high‑quality intake, summarizing cases, and proposing diagnoses and management plans for the overseeing PCP to review. This resulted in higher quality composite decisions. PCP oversight of g‑AMIE was also more time‑efficient than standalone PCP consultations in prior work. While our study does not replicate existing clinical practices and likely underestimates clinicians' capabilities, our results demonstrate the promise of asynchronous oversight as a feasible paradigm for diagnostic AI systems to operate under expert human oversight for enhancing real‑world care.

PaperID: 9556, https://arxiv.org/pdf/2507.15729.pdf

Authors: Jens V. Rüppel, Andrey Rudenko, Tim Schreiter, Martin Magnusson, Achim J. Lilienthal

Title: Gaze-supported Large Language Model Framework for Bi-directional Human-Robot Interaction

Abstract:
The rapid development of Large Language Models (LLMs) creates an exciting potential for flexible, general knowledge‑driven Human‑Robot Interaction (HRI) systems for assistive robots. Existing HRI systems demonstrate great progress in interpreting and following user instructions, action generation, and robot task solving. On the other hand, bi‑directional, multi‑modal, and context‑aware support of the user in collaborative tasks still remains an open challenge. In this paper, we present a gaze‑ and speech‑informed interface to the assistive robot, which is able to perceive the working environment from multiple vision inputs and support the dynamic user in their tasks. Our system is designed to be modular and transferable to adapt to diverse tasks and robots, and it is capable of real‑time use of language‑based interaction state representation and fast on board perception modules. Its development was supported by multiple public dissemination events, contributing important considerations for improved robustness and user experience. Furthermore, in two lab studies, we compare the performance and user ratings of our system with those of a traditional scripted HRI pipeline. Our findings indicate that an LLM‑based approach enhances adaptability and marginally improves user engagement and task execution metrics but may produce redundant output, while a scripted pipeline is well suited for more straightforward tasks.

PaperID: 9557, https://arxiv.org/pdf/2507.15727.pdf

Authors: Xuchuang Wang, Bo Sun, Hedyeh Beyhaghi, John C. S. Lui, Mohammad Hajiesmaili, Adam Wierman

Title: Competitive Algorithms for Multi-Agent Ski-Rental Problems

Abstract:
This paper introduces a novel multi‑agent ski‑rental problem that generalizes the classical ski‑rental dilemma to a group setting where agents incur individual and shared costs. In our model, each agent can either rent at a fixed daily cost, or purchase a pass at an individual cost, with an additional third option of a discounted group pass available to all. We consider scenarios in which agents' active days differ, leading to dynamic states as agents drop out of the decision process. To address this problem from different perspectives, we define three distinct competitive ratios: overall, state‑dependent, and individual rational. For each objective, we design and analyze optimal deterministic and randomized policies. Our deterministic policies employ state‑aware threshold functions that adapt to the dynamic states, while our randomized policies sample and resample thresholds from tailored state‑aware distributions. The analysis reveals that symmetric policies, in which all agents use the same threshold, outperform asymmetric ones. Our results provide competitive ratio upper and lower bounds and extend classical ski‑rental insights to multi‑agent settings, highlighting both theoretical and practical implications for group decision‑making under uncertainty.

PaperID: 9558, https://arxiv.org/pdf/2507.15698.pdf

Authors: Congmin Zheng, Jiachen Zhu, Jianghao Lin, Xinyi Dai, Weiwen Liu, Haoxuan Li, Yong Yu, Weinan Zhang, Mengyue Yang

Title: CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning

Abstract:
Process Reward Models (PRMs) play a central role in evaluating and guiding multi‑step reasoning in large language models (LLMs), especially for mathematical problem solving. However, we identify a pervasive length bias in existing PRMs: they tend to assign higher scores to longer reasoning steps, even when the semantic content and logical validity are unchanged. This bias undermines the reliability of reward predictions and leads to overly verbose outputs during inference. To address this issue, we propose CoLD(Counterfactually‑Guided Length Debiasing), a unified framework that mitigates length bias through three components: an explicit length‑penalty adjustment, a learned bias estimator trained to capture spurious length‑related signals, and a joint training strategy that enforces length‑invariance in reward predictions. Our approach is grounded in counterfactual reasoning and informed by causal graph analysis. Extensive experiments on MATH500 and GSM‑Plus show that CoLD improves accuracy in step selection, and encourages more concise, logically valid reasoning. Furthermore, it consistently improves downstream RL performance and generalizes across domains by mitigating length bias, demonstrating CoLD's strong generalization capability.

PaperID: 9559, https://arxiv.org/pdf/2507.15615.pdf

Authors: Zhihao Zhang, Siyuan Li, Chenxi Li, Feifan Liu, Mengjing Chen, Kai Li, Tao Zhong, Bo An, Peng Liu

Title: DHEvo: Data-Algorithm Based Heuristic Evolution for Generalizable MILP Solving

Abstract:
Primal heuristics play a critical role in improving the efficiency of mixed integer programming (MILP) solvers. As large language models (LLMs) have demonstrated superior code generation abilities, recent MILP works are devoted to leveraging the evolutionary computation approaches with LLMs to generate effective primal heuristics. Although the generated heuristics have achieved better solving performance than the hand‑crafted ones with little adaptability, the advantage of current LLM‑based methods is limited to few MILP instances in one problem class, as they fail to capture the instance characteristics in the problem class (the MILP instances generated from the same mathematical model are defined as a problem class). Since MILP instances often differ significantly in structure and feature distribution, the neglect of their characteristics in the evolution process results in poor generalization within the same problem class. To overcome this challenge, we propose a data‑algorithm co‑evolution framework (DHEvo) that iteratively selects representative instances and evolves corresponding heuristics. With the initial instance distribution, we develop an LLM‑based multi‑agent system to generate data‑code pairs simultaneously. These data‑code pairs are iteratively refined based on their fitness scores, leading to the identification of the most effective heuristic over the entire problem class. Extensive experiments across diverse MILP benchmarks demonstrate that our approach significantly outperforms both human‑designed heuristics and existing LLM‑based methods.

PaperID: 9560, https://arxiv.org/pdf/2507.15587.pdf

Authors: Yinsong Chen, Kaifeng Wang, Xiaoqiang Meng, Xueyuan Li, Zirui Li, Xin Gao

Title: Red-Team Multi-Agent Reinforcement Learning for Emergency Braking Scenario

Abstract:
Current research on decision‑making in safety‑critical scenarios often relies on inefficient data‑driven scenario generation or specific modeling approaches, which fail to capture corner cases in real‑world contexts. To address this issue, we propose a Red‑Team Multi‑Agent Reinforcement Learning framework, where background vehicles with interference capabilities are treated as red‑team agents. Through active interference and exploration, red‑team vehicles can uncover corner cases outside the data distribution. The framework uses a Constraint Graph Representation Markov Decision Process, ensuring that red‑team vehicles comply with safety rules while continuously disrupting the autonomous vehicles (AVs). A policy threat zone model is constructed to quantify the threat posed by red‑team vehicles to AVs, inducing more extreme actions to increase the danger level of the scenario. Experimental results show that the proposed framework significantly impacts AVs decision‑making safety and generates various corner cases. This method also offers a novel direction for research in safety‑critical scenarios.

PaperID: 9561, https://arxiv.org/pdf/2507.15559.pdf

Authors: Pan Hao, Dongyeop Kang, Nicholas Hinds, Qianwen Wang

Title: FlowForge: Guiding the Creation of Multi-agent Workflows with Design Space Visualization as a Thinking Scaffold

Abstract:
Multi‑agent workflows have become an effective strategy for tackling complicated tasks by decomposing them into multiple sub‑tasks and assigning them to specialized agents. However, designing optimal workflows remains challenging due to the vast and intricate design space. Current practices rely heavily on the intuition and expertise of practitioners, often resulting in design fixation or an unstructured, time‑consuming exploration of trial‑and‑error. To address these challenges, this work introduces FLOWFORGE, an interactive visualization tool to facilitate the creation of multi‑agent workflow through i) a structured visual exploration of the design space and ii) in‑situ guidance informed by established design patterns. Based on formative studies and literature review, FLOWFORGE organizes the workflow design process into three hierarchical levels (i.e., task planning, agent assignment, and agent optimization), ranging from abstract to concrete. This structured visual exploration enables users to seamlessly move from high‑level planning to detailed design decisions and implementations, while comparing alternative solutions across multiple performance metrics. Additionally, drawing from established workflow design patterns, FLOWFORGE provides context‑aware, in‑situ suggestions at each level as users navigate the design space, enhancing the workflow creation process with practical guidance. Use cases and user studies demonstrate the usability and effectiveness of FLOWFORGE, while also yielding valuable insights into how practitioners explore design spaces and leverage guidance during workflow development.

PaperID: 9562, https://arxiv.org/pdf/2507.15518.pdf

Authors: Shufan Jiang, Sizhou Chen, Chios Chen, Chi Zhang, Xiao-Lei Zhang, Xuelong Li

Title: HAMLET: A Hierarchical and Adaptive Multi-Agent Framework for Live Embodied Theatrics

Abstract:
Creating an immersive and interactive theatrical experience is a long‑term goal in the field of interactive narrative. The emergence of large language models (LLMs) provides a new path to achieve this goal. However, existing drama generation methods often produce LLMs that lack initiative and cannot interact with the physical scene, while typically requiring detailed input that diminishes the immersion of live performance. To address these challenges, we propose HAMLET, a hierarchical adaptive multi‑agent framework focused on drama creation and real‑time online performance. Given a simple topic, the framework initially generates a narrative blueprint to guide the subsequent improvisational performance. During online performance, each actor is equipped with an adaptive reasoning module that enables decision‑making based on their personas, memories, goals during complex group chat scenarios. Beyond dialogue, actor agents engage in embodied interactions by changing the state of scene props through actions such as opening a letter or picking up a weapon, which are broadcast to update the global environmental context. To objectively assess the quality of live embodied theatrics, we establish a comprehensive evaluation method and introduce HAMLETJudge, a specialized critic model for automated evaluation. Experimental results demonstrate that HAMLET excels in creating expressive, coherent, and physically interactive theatrical experiences in an autonomous manner.

PaperID: 9563, https://arxiv.org/pdf/2507.15455.pdf

Authors: Hee Jun Yang, Minjung Gim, Yeoneung Kim

Title: Solving nonconvex Hamilton--Jacobi--Isaacs equations with PINN-based policy iteration

Abstract:
We propose a mesh‑free policy iteration framework that combines classical dynamic programming with physics‑informed neural networks (PINNs) to solve high‑dimensional, nonconvex Hamilton‑‑Jacobi‑‑Isaacs (HJI) equations arising in stochastic differential games and robust control. The method alternates between solving linear second‑order PDEs under fixed feedback policies and updating the controls via pointwise minimax optimization using automatic differentiation. Under standard Lipschitz and uniform ellipticity assumptions, we prove that the value function iterates converge locally uniformly to the unique viscosity solution of the HJI equation. The analysis establishes equi‑Lipschitz regularity of the iterates, enabling provable stability and convergence without requiring convexity of the Hamiltonian. Numerical experiments demonstrate the accuracy and scalability of the method. In a two‑dimensional stochastic path‑planning game with a moving obstacle, our method matches finite‑difference benchmarks with relative L^2‑errors below %10^‑2%. In five‑ and ten‑dimensional publisher‑subscriber differential games with anisotropic noise, the proposed approach consistently outperforms direct PINN solvers, yielding smoother value functions and lower residuals. Our results suggest that integrating PINNs with policy iteration is a practical and theoretically grounded method for solving high‑dimensional, nonconvex HJI equations, with potential applications in robotics, finance, and multi‑agent reinforcement learning.

PaperID: 9564, https://arxiv.org/pdf/2507.15428.pdf

Authors: Jiaao Li, Kaiyuan Li, Chen Gao, Yong Li, Xinlei Chen

Title: EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent

Abstract:
Egomotion videos are first‑person recordings where the view changes continuously due to the agent's movement. As they serve as the primary visual input for embodied AI agents, making egomotion video reasoning more efficient is therefore essential for real‑world deployment. Recent advances in vision‑language models have enabled strong multimodal reasoning capabilities, but their computational cost remains prohibitive for long, redundant video inputs. Existing token pruning methods, typically designed for third‑person videos, fail to leverage the spatiotemporal continuity and motion constraints inherent in egomotion settings. To address this, we propose EgoPrune, a training‑free token pruning method tailored for egomotion video reasoning. EgoPrune comprises three components: a keyframe selector adapted from EmbodiedR for temporally efficient sampling; Perspective‑Aware Redundancy Filtering (PARF), which aligns visual tokens using perspective transformations and removes redundant tokens; and a Maximal Marginal Relevance (MMR)‑based token selector that jointly considers visual‑text relevance and intra‑frame diversity. Experiments on two egomotion video benchmarks show that EgoPrune consistently outperforms prior training‑free methods across various pruning ratios while significantly reducing FLOPs, memory usage, and latency. Moreover, we deploy EgoPrune on an embodied agent equipped with a Jetson Orin NX 16GB edge device, demonstrating its real‑world efficiency and suitability for on‑device egomotion video reasoning.

PaperID: 9565, https://arxiv.org/pdf/2507.15419.pdf

Authors: Wenhao Li, Selvakumar Manickam, Yung-wey Chong, Shankar Karuppayah

Title: PhishIntentionLLM: Uncovering Phishing Website Intentions through Multi-Agent Retrieval-Augmented Generation

Abstract:
Phishing websites remain a major cybersecurity threat, yet existing methods primarily focus on detection, while the recognition of underlying malicious intentions remains largely unexplored. To address this gap, we propose PhishIntentionLLM, a multi‑agent retrieval‑augmented generation (RAG) framework that uncovers phishing intentions from website screenshots. Leveraging the visual‑language capabilities of large language models (LLMs), our framework identifies four key phishing objectives: Credential Theft, Financial Fraud, Malware Distribution, and Personal Information Harvesting. We construct and release the first phishing intention ground truth dataset (~2K samples) and evaluate the framework using four commercial LLMs. Experimental results show that PhishIntentionLLM achieves a micro‑precision of 0.7895 with GPT‑4o and significantly outperforms the single‑agent baseline with a ~95% improvement in micro‑precision. Compared to the previous work, it achieves 0.8545 precision for credential theft, marking a ~4% improvement. Additionally, we generate a larger dataset of ~9K samples for large‑scale phishing intention profiling across sectors. This work provides a scalable and interpretable solution for intention‑aware phishing analysis.

PaperID: 9566, https://arxiv.org/pdf/2507.15283.pdf

Authors: Yuliang Fu, Guanghui Wen, Dan Zhao, Wei Xing Zheng, Xiaolei Li

Title: Event-Triggered Resilient Consensus of Networked Euler-Lagrange Systems Under Byzantine Attacks

Abstract:
The resilient consensus problem is investigated in this paper for a class of networked Euler‑Lagrange systems with event‑triggered communication in the presence of Byzantine attacks. One challenge that we face in addressing the considered problem is the inapplicability of existing resilient decision algorithms designed for one‑dimensional multi‑agent systems. This is because the networked Euler‑Lagrange systems fall into the category of multi‑dimensional multi‑agent systems with coupling among state vector components. To address this problem, we propose a new resilient decision algorithm. This algorithm constructs auxiliary variables related to the coordinative objectives for each normal agent, and transforms the considered resilient consensus problem into the consensus problem of the designed auxiliary variables. Furthermore, to relax the constraints imposed on Byzantine agent behavior patterns within continuous‑time scenarios, the event‑triggered communication scheme is adopted. Finally, the effectiveness of the proposed algorithm is demonstrated through case studies.

PaperID: 9567, https://arxiv.org/pdf/2507.15268.pdf

Authors: Junhyeong Lee, Joon-Young Kim, Heekyu Kim, Inhyo Lee, Seunghwa Ryu

Title: IM-Chat: A Multi-agent LLM Framework Integrating Tool-Calling and Diffusion Modeling for Knowledge Transfer in Injection Molding Industry

Abstract:
The injection molding industry faces critical challenges in preserving and transferring field knowledge, particularly as experienced workers retire and multilingual barriers hinder effective communication. This study introduces IM‑Chat, a multi‑agent framework based on large language models (LLMs), designed to facilitate knowledge transfer in injection molding. IM‑Chat integrates both limited documented knowledge (e.g., troubleshooting tables, manuals) and extensive field data modeled through a data‑driven process condition generator that infers optimal manufacturing settings from environmental inputs such as temperature and humidity, enabling robust and context‑aware task resolution. By adopting a retrieval‑augmented generation (RAG) strategy and tool‑calling agents within a modular architecture, IM‑Chat ensures adaptability without the need for fine‑tuning. Performance was assessed across 100 single‑tool and 60 hybrid tasks for GPT‑4o, GPT‑4o‑mini, and GPT‑3.5‑turbo by domain experts using a 10‑point rubric focused on relevance and correctness, and was further supplemented by automated evaluation using GPT‑4o guided by a domain‑adapted instruction prompt. The evaluation results indicate that more capable models tend to achieve higher accuracy, particularly in complex, tool‑integrated scenarios. In addition, compared with the fine‑tuned single‑agent LLM, IM‑Chat demonstrated superior accuracy, particularly in quantitative reasoning, and greater scalability in handling multiple information sources. Overall, these findings demonstrate the viability of multi‑agent LLM systems for industrial knowledge workflows and establish IM‑Chat as a scalable and generalizable approach to AI‑assisted decision support in manufacturing.

PaperID: 9568, https://arxiv.org/pdf/2507.15241.pdf

Authors: Vikram Nitin, Baishakhi Ray, Roshanak Zilouchian Moghaddam

Title: FaultLine: Automated Proof-of-Vulnerability Generation Using LLM Agents

Abstract:
Despite the critical threat posed by software security vulnerabilities, reports are often incomplete, lacking the proof‑of‑vulnerability (PoV) tests needed to validate fixes and prevent regressions. These tests are crucial not only for ensuring patches work, but also for helping developers understand how vulnerabilities can be exploited. Generating PoV tests is a challenging problem, requiring reasoning about the flow of control and data through deeply nested levels of a program. We present FaultLine, an LLM agent workflow that uses a set of carefully designed reasoning steps, inspired by aspects of traditional static and dynamic program analysis, to automatically generate PoV test cases. Given a software project with an accompanying vulnerability report, FaultLine 1) traces the flow of an input from an externally accessible API ("source") to the "sink" corresponding to the vulnerability, 2) reasons about the conditions that an input must satisfy in order to traverse the branch conditions encountered along the flow, and 3) uses this reasoning to generate a PoV test case in a feedback‑driven loop. FaultLine does not use language‑specific static or dynamic analysis components, which enables it to be used across programming languages. To evaluate FaultLine, we collate a challenging multi‑lingual dataset of 100 known vulnerabilities in Java, C and C++ projects. On this dataset, FaultLine is able to generate PoV tests for 16 projects, compared to just 9 for CodeAct 2.1, a popular state‑of‑the‑art open‑source agentic framework. Thus, FaultLine represents a 77% relative improvement over the state of the art. Our findings suggest that hierarchical reasoning can enhance the performance of LLM agents on PoV test generation, but the problem in general remains challenging. We make our code and dataset publicly available in the hope that it will spur further research in this area.

PaperID: 9569, https://arxiv.org/pdf/2507.15219.pdf

Authors: Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, Basel Alomair, Xuandong Zhao, William Yang Wang, Neil Gong, Wenbo Guo, Dawn Song

Title: PromptArmor: Simple yet Effective Prompt Injection Defenses

Abstract:
Despite their potential, recent research has demonstrated that LLM agents are vulnerable to prompt injection attacks, where malicious prompts are injected into the agent's input, causing it to perform an attacker‑specified task rather than the intended task provided by the user. In this paper, we present PromptArmor, a simple yet effective defense against prompt injection attacks. Specifically, PromptArmor prompts an off‑the‑shelf LLM to detect and remove potential injected prompts from the input before the agent processes it. Our results show that PromptArmor can accurately identify and remove injected prompts. For example, using GPT‑4o, GPT‑4.1, or o4‑mini, PromptArmor achieves both a false positive rate and a false negative rate below 1% on the AgentDojo benchmark. Moreover, after removing injected prompts with PromptArmor, the attack success rate drops to below 1%. We also demonstrate PromptArmor's effectiveness against adaptive attacks and explore different strategies for prompting an LLM. We recommend that PromptArmor be adopted as a standard baseline for evaluating new defenses against prompt injection attacks.

PaperID: 9570, https://arxiv.org/pdf/2507.15147.pdf

Authors: Yiqi Zhao, Xinyi Yu, Bardh Hoxha, Georgios Fainekos, Jyotirmoy V. Deshmukh, Lars Lindemann

Title: STL-GO: Spatio-Temporal Logic with Graph Operators for Distributed Systems with Multiple Network Topologies

Abstract:
Multi‑agent systems (MASs) consisting of a number of autonomous agents that communicate, coordinate, and jointly sense the environment to achieve complex missions can be found in a variety of applications such as robotics, smart cities, and internet‑of‑things applications. Modeling and monitoring MAS requirements to guarantee overall mission objectives, safety, and reliability is an important problem. Such requirements implicitly require reasoning about diverse sensing and communication modalities between agents, analysis of the dependencies between agent tasks, and the spatial or virtual distance between agents. To capture such rich MAS requirements, we model agent interactions via multiple directed graphs, and introduce a new logic ‑‑ Spatio‑Temporal Logic with Graph Operators (STL‑GO). The key innovation in STL‑GO are graph operators that enable us to reason about the number of agents along either the incoming or outgoing edges of the underlying interaction graph that satisfy a given property of interest; for example, the requirement that an agent should sense at least two neighboring agents whose task graphs indicate the ability to collaborate. We then propose novel distributed monitoring conditions for individual agents that use only local information to determine whether or not an STL‑GO specification is satisfied. We compare the expressivity of STL‑GO against existing spatio‑temporal logic formalisms, and demonstrate the utility of STL‑GO and our distributed monitors in a bike‑sharing and a multi‑drone case study.

PaperID: 9571, https://arxiv.org/pdf/2507.15143.pdf

Authors: Abderaouf Bahi, Amel Ourici

Title: NaviGNN: Multi-Agent Reinforcement Learning and Graph Neural Network for Sustainable Mobility in Futuristic Smart Cities

Abstract:
This paper investigates the feasibility of human mobility in extreme urban morphologies characterized by high‑density vertical structures and linear city layouts. To assess whether agents can navigate efficiently within such unprecedented topologies, we develop a hybrid simulation framework integrating agent‑based modeling, reinforcement learning (RL), supervised learning, and graph neural networks (GNNs). The simulation captures multi‑modal transportation behaviors across multiple vertical levels and varying density scenarios, using both synthetic data and real‑world traces from high‑density cities. Experimental results show that the fully integrated AI architecture enables agents to achieve an average commute time of 7.8‑8.4 minutes, a satisfaction rate exceeding 89%, and a reachability index above 91%, even during peak congestion periods. Ablation studies indicate that removing intelligent modules such as RL or GNNs significantly degrades performance, with commute times increasing by up to 85% and reachability dropping below 70%. Baseline comparisons against Dijkstra, A, DQN, and standard GCN further confirm the superiority of the proposed model across all mobility and sustainability metrics. Environmental modeling demonstrates low energy consumption and minimal CO2 emissions when electric transportation modes are prioritized. These findings suggest that efficient and sustainable mobility in extreme urban environments is achievable, provided that adaptive AI systems, intelligent infrastructure, and real‑time feedback mechanisms are effectively implemented.

PaperID: 9572, https://arxiv.org/pdf/2507.15015.pdf

Authors: Xinmeng Hou, Ziting Chang, Zhouquan Lu, Chen Wenli, Liang Wan, Wei Feng, Hai Hu, Qing Guo

Title: MetaCrit: A Critical Thinking Framework for Self-Regulated LLM Reasoning

Abstract:
Large language models (LLMs) fail on over one‑third of multi‑hop questions with counterfactual premises and remain vulnerable to adversarial prompts that trigger biased or factually incorrect responses, which exposes a fundamental deficit in self‑regulated reasoning. We propose MetaCrit, a multi‑agent framework grounded in Nelson and Narens' metacognitive regulation theory. MetaCrit decomposes reasoning regulation into four agents: object‑level generation, a \emphmonitoring agent that assesses response validity, a \emphcontrol agent that critiques logical soundness, and a meta‑level synthesizer that integrates all signals into a final response. Evaluation across eight benchmarks, four model backbones, and a college‑level analytical writing study shows that MetaCrit significantly improves content truthfulness and logical soundness while eliminating toxic outputs. Its modular design allows individual agents to be integrated into existing frameworks as drop‑in components without architectural modifications.

PaperID: 9573, https://arxiv.org/pdf/2507.14995.pdf

Authors: Chengwei Lou, Zekai Jin, Wei Tang, Guangfei Geng, Jin Yang, Lu Zhang

Title: LLM-Enhanced Multi-Agent Reinforcement Learning with Expert Workflow for Real-Time P2P Energy Trading

Abstract:
Real‑time peer‑to‑peer (P2P) electricity markets dynamically adapt to fluctuations in renewable energy and variations in demand, maximizing economic benefits through instantaneous price responses while enhancing grid flexibility. However, scaling expert guidance for massive personalized prosumers poses critical challenges, including diverse decision‑making demands and a lack of customized modeling frameworks. This paper proposes an integrated large language model‑multi‑agent reinforcement learning (LLM‑MARL) framework for real‑time P2P energy trading to address challenges such as the limited technical capability of prosumers, the lack of expert experience, and security issues of distribution networks. LLMs are introduced as experts to generate personalized strategies, guiding MARL under the centralized training with decentralized execution (CTDE) paradigm through imitation. To handle the scalability issues inherent in large‑scale P2P networks, a differential attention‑based critic network is introduced to efficiently extract key interaction features and enhance convergence. Experimental results demonstrate that LLM‑generated strategies effectively substitute human experts. The proposed imitative expert MARL algorithms achieve significantly lower economic costs and voltage violation rates on test sets compared to baseline algorithms, while maintaining robust stability. This paper provides an effective solution for the real‑time decision‑making of the P2P electricity market by bridging expert knowledge with agent learning.

PaperID: 9574, https://arxiv.org/pdf/2507.14928.pdf

Authors: Yongrae Jo, Chanik Park

Title: Byzantine-Robust Decentralized Coordination of LLM Agents

Abstract:
Collaboration among multiple large language model (LLM) agents is a promising approach to overcome inherent limitations of single‑agent systems, such as hallucinations and single points of failure. As LLM agents are increasingly deployed on open blockchain platforms, multi‑agent systems capable of tolerating malicious (Byzantine) agents have become essential. Recent Byzantine‑robust multi‑agent systems typically rely on leader‑driven coordination, which suffers from two major drawbacks. First, they are inherently vulnerable to targeted attacks against the leader. If consecutive leaders behave maliciously, the system repeatedly fails to achieve consensus, forcing new consensus rounds, which is particularly costly given the high latency of LLM invocations. Second, an underperforming proposal from the leader can be accepted as the final answer even when higher‑quality alternatives are available, as existing methods finalize the leader's proposal once it receives a quorum of votes. To address these issues, we propose DecentLLMs, a novel decentralized consensus approach for multi‑agent LLM systems, where worker agents generate answers concurrently and evaluator agents independently score and rank these answers to select the best available one. This decentralized architecture enables faster consensus despite the presence of Byzantine agents and consistently selects higher‑quality answers through Byzantine‑robust aggregation techniques. Experimental results demonstrate that DecentLLMs effectively tolerates Byzantine agents and significantly improves the quality of selected answers.

PaperID: 9575, https://arxiv.org/pdf/2507.14904.pdf

Authors: Fan Li, Zanyi Wang, Zeyi Huang, Guang Dai, Jingdong Wang, Mengmeng Wang

Title: TriCLIP-3D: A Unified Parameter-Efficient Framework for Tri-Modal 3D Visual Grounding based on CLIP

Abstract:
3D visual grounding allows an embodied agent to understand visual information in real‑world 3D environments based on human instructions, which is crucial for embodied intelligence. Existing 3D visual grounding methods typically rely on separate encoders for different modalities (e.g., RGB images, text, and 3D point clouds), resulting in large and complex models that are inefficient to train. While some approaches use pre‑trained 2D multi‑modal models like CLIP for 3D tasks, they still struggle with aligning point cloud data to 2D encoders. As a result, these methods continue to depend on 3D encoders for feature extraction, further increasing model complexity and training inefficiency. In this paper, we propose a unified 2D pre‑trained multi‑modal network to process all three modalities (RGB images, text, and point clouds), significantly simplifying the architecture. By leveraging a 2D CLIP bi‑modal model with adapter‑based fine‑tuning, this framework effectively adapts to the tri‑modal setting, improving both adaptability and performance across modalities. Our Geometric‑Aware 2D‑3D Feature Recovery and Fusion (GARF) module is designed to fuse geometric multi‑scale features from point clouds and images. We then integrate textual features for final modality fusion and introduce a multi‑modal decoder to facilitate deep cross‑modal understanding. Together, our method achieves unified feature extraction and fusion across the three modalities, enabling an end‑to‑end 3D visual grounding model. Compared to the baseline, our method reduces the number of trainable parameters by approximately 58%, while achieving a 6.52% improvement in the 3D detection task and a 6.25% improvement in the 3D visual grounding task.

PaperID: 9576, https://arxiv.org/pdf/2507.14850.pdf

Authors: H. M. Sabbir Ahmad, Ehsan Sabouni, Alexander Wasilkoff, Param Budhraja, Zijian Guo, Songyuan Zhang, Chuchu Fan, Christos Cassandras, Wenchao Li

Title: Hierarchical Multi-Agent Reinforcement Learning with Control Barrier Functions for Safety-Critical Autonomous Systems

Abstract:
We address the problem of safe policy learning in multi‑agent safety‑critical autonomous systems. In such systems, it is necessary for each agent to meet the safety requirements at all times while also cooperating with other agents to accomplish the task. Toward this end, we propose a safe Hierarchical Multi‑Agent Reinforcement Learning (HMARL) approach based on Control Barrier Functions (CBFs). Our proposed hierarchical approach decomposes the overall reinforcement learning problem into two levels learning joint cooperative behavior at the higher level and learning safe individual behavior at the lower or agent level conditioned on the high‑level policy. Specifically, we propose a skill‑based HMARL‑CBF algorithm in which the higher level problem involves learning a joint policy over the skills for all the agents and the lower‑level problem involves learning policies to execute the skills safely with CBFs. We validate our approach on challenging environment scenarios whereby a large number of agents have to safely navigate through conflicting road networks. Compared with existing state of the art methods, our approach significantly improves the safety achieving near perfect (within 5%) success/safety rate while also improving performance across all the environments.

PaperID: 9577, https://arxiv.org/pdf/2507.14800.pdf

Authors: Xu Yang, Chenhui Lin, Licheng Sha, Liping Yang, Shuzhou Wu, Xichen Tian, Haotian Liu, Wenchuan Wu

Title: Large Language Model as An Operator: An Experience-Driven Solution for Distribution Network Voltage Control

Abstract:
With the advanced reasoning, contextual understanding, and information synthesis capabilities of large language models (LLMs), a novel paradigm emerges for the autonomous generation of dispatch strategies in modern power systems. In this paper, we propose an LLM‑based experience‑driven day‑ahead Volt/Var schedule solution for distribution networks, which enables the self‑evolution of LLM agent's strategies through the collaboration and interaction of multiple modules, specifically, experience storage, experience retrieval, experience generation, and experience modification. The experience storage module archives historical operational records and decisions, while the retrieval module selects relevant past cases according to current forecasting conditions. The LLM agent then leverages these retrieved experiences to generate new, context‑aware decisions for current situation, which are subsequently refined by the modification module to realize self‑evolution of the dispatch policy. Comprehensive experimental results validate the effectiveness of the proposed method and highlight the applicability of LLMs in power system dispatch problems facing incomplete information.

PaperID: 9578, https://arxiv.org/pdf/2507.14730.pdf

Authors: Rui Liu, Tao Zhe, Zhong-Ren Peng, Necati Catbas, Xinyue Ye, Dongjie Wang, Yanjie Fu

Title: Towards Urban Planing AI Agent in the Age of Agentic AI

Abstract:
Generative AI, large language models, and agentic AI have emerged separately of urban planning. However, the convergence between AI and urban planning presents an interesting opportunity towards AI urban planners. Existing studies conceptualizes urban planning as a generative AI task, where AI synthesizes land‑use configurations under geospatial, social, and human‑centric constraints and reshape automated urban design. We further identify critical gaps of existing generative urban planning studies: 1) the generative structure has to be predefined with strong assumption: all of adversarial generator‑discriminator, forward and inverse diffusion structures, hierarchical zone‑POI generative structure are predefined by humans; 2) ignore the power of domain expert developed tools: domain urban planners have developed various tools in the urban planning process guided by urban theory, while existing pure neural networks based generation ignore the power of the tools developed by urban planner practitioners. To address these limitations, we outline a future research direction agentic urban AI planner, calling for a new synthesis of agentic AI and participatory urbanism.

PaperID: 9579, https://arxiv.org/pdf/2507.14705.pdf

Authors: Sai Wang, Senthilnathan Subramanian, Mudit Sahni, Praneeth Gone, Lingjie Meng, Xiaochen Wang, Nicolas Ferradas Bertoli, Tingxian Cheng, Jun Xu

Title: Configurable multi-agent framework for scalable and realistic testing of llm-based agents

Abstract:
Large‑language‑model (LLM) agents exhibit complex, context‑sensitive behaviour that quickly renders static benchmarks and ad‑hoc manual testing obsolete. We present Neo, a configurable, multi‑agent framework that automates realistic, multi‑turn evaluation of LLM‑based systems. Neo couples a Question Generation Agent and an Evaluation Agent through a shared context‑hub, allowing domain prompts, scenario controls and dynamic feedback to be composed modularly. Test inputs are sampled from a probabilistic state model spanning dialogue flow, user intent and emotional tone, enabling diverse, human‑like conversations that adapt after every turn. Applied to a production‑grade Seller Financial Assistant chatbot, Neo (i) uncovered edge‑case failures across five attack categories with a 3.3% break rate close to the 5.8% achieved by expert human red‑teamers, and (ii) delivered 10‑12X higher throughput, generating 180 coherent test questions in around 45 mins versus 16h of human effort. Beyond security probing, Neo's stochastic policies balanced topic coverage and conversational depth, yielding broader behavioural exploration than manually crafted scripts. Neo therefore lays a foundation for scalable, self‑evolving LLM QA: its agent interfaces, state controller and feedback loops are model‑agnostic and extensible to richer factual‑grounding and policy‑compliance checks. We release the framework to facilitate reproducible, high‑fidelity testing of emerging agentic systems.

PaperID: 9580, https://arxiv.org/pdf/2507.14680.pdf

Authors: Xinheng Lyu, Yuci Liang, Wenting Chen, Meidan Ding, Jiaqi Yang, Guolin Huang, Daokun Zhang, Xiangjian He, Linlin Shen

Title: WSI-Agents: A Collaborative Multi-Agent System for Multi-Modal Whole Slide Image Analysis

Abstract:
Whole slide images (WSIs) are vital in digital pathology, enabling gigapixel tissue analysis across various pathological tasks. While recent advancements in multi‑modal large language models (MLLMs) allow multi‑task WSI analysis through natural language, they often underperform compared to task‑specific models. Collaborative multi‑agent systems have emerged as a promising solution to balance versatility and accuracy in healthcare, yet their potential remains underexplored in pathology‑specific domains. To address these issues, we propose WSI‑Agents, a novel collaborative multi‑agent system for multi‑modal WSI analysis. WSI‑Agents integrates specialized functional agents with robust task allocation and verification mechanisms to enhance both task‑specific accuracy and multi‑task versatility through three components: (1) a task allocation module assigning tasks to expert agents using a model zoo of patch and WSI level MLLMs, (2) a verification mechanism ensuring accuracy through internal consistency checks and external validation using pathology knowledge bases and domain‑specific models, and (3) a summary module synthesizing the final summary with visual interpretation maps. Extensive experiments on multi‑modal WSI benchmarks show WSI‑Agents's superiority to current WSI MLLMs and medical agent frameworks across diverse tasks.

PaperID: 9581, https://arxiv.org/pdf/2507.14658.pdf

Authors: Faizan Contractor, Li Li, Ranwa Al Mallah

Title: Learning to Communicate in Multi-Agent Reinforcement Learning for Autonomous Cyber Defence

Abstract:
Popular methods in cooperative Multi‑Agent Reinforcement Learning with partially observable environments typically allow agents to act independently during execution, which may limit the coordinated effect of the trained policies. However, by sharing information such as known or suspected ongoing threats, effective communication can lead to improved decision‑making in the cyber battle space. We propose a game design where defender agents learn to communicate and defend against imminent cyber threats by playing training games in the Cyber Operations Research Gym, using the Differentiable Inter Agent Learning algorithm adapted to the cyber operational environment. The tactical policies learned by these autonomous agents are akin to those of human experts during incident responses to avert cyber threats. In addition, the agents simultaneously learn minimal cost communication messages while learning their defence tactical policies.

PaperID: 9582, https://arxiv.org/pdf/2507.14513.pdf

Authors: Hongyi Yang, Yue Pan, Jiayi Xu, Kelsen Liu

Title: Amico: An Event-Driven Modular Framework for Persistent and Embedded Autonomy

Abstract:
Recent advances in large language models (LLMs) and autonomous agents have enabled systems capable of performing complex tasks across domains such as human‑computer interaction, planning, and web navigation. However, many existing frameworks struggle in real‑world or resource‑constrained environments due to their reliance on cloud‑based computation, limited robustness in dynamic contexts, and lack of persistent autonomy and environmental awareness. We present Amico, a modular, event‑driven framework for building autonomous agents optimized for embedded systems. Written in Rust for safety and performance, Amico supports reactive, persistent agents that operate efficiently across embedded platforms and browser environments via WebAssembly. It provides clean abstractions for event handling, state management, behavior execution, and integration with reasoning modules. Amico delivers a unified infrastructure for constructing resilient, interactive agents suitable for deployment in settings with limited compute and intermittent connectivity.

PaperID: 9583, https://arxiv.org/pdf/2507.14447.pdf

Authors: Guancheng Zeng, Xueyi Chen, Jiawang Hu, Shaohua Qi, Yaxuan Mao, Zhantao Wang, Yifan Nie, Shuang Li, Qiuyang Feng, Pengxu Qiu, Yujia Wang, Wenqiang Han, Linyan Huang, Gang Li, Jingjing Mo, Haowen Hu

Title: Routine: A Structural Planning Framework for LLM Agent System in Enterprise

Abstract:
The deployment of agent systems in an enterprise environment is often hindered by several challenges: common models lack domain‑specific process knowledge, leading to disorganized plans, missing key tools, and poor execution stability. To address this, this paper introduces Routine, a multi‑step agent planning framework designed with a clear structure, explicit instructions, and seamless parameter passing to guide the agent's execution module in performing multi‑step tool‑calling tasks with high stability. In evaluations conducted within a real‑world enterprise scenario, Routine significantly increases the execution accuracy in model tool calls, increasing the performance of GPT‑4o from 41.1% to 96.3%, and Qwen3‑14B from 32.6% to 83.3%. We further constructed a Routine‑following training dataset and fine‑tuned Qwen3‑14B, resulting in an accuracy increase to 88.2% on scenario‑specific evaluations, indicating improved adherence to execution plans. In addition, we employed Routine‑based distillation to create a scenario‑specific, multi‑step tool‑calling dataset. Fine‑tuning on this distilled dataset raised the model's accuracy to 95.5%, approaching GPT‑4o's performance. These results highlight Routine's effectiveness in distilling domain‑specific tool‑usage patterns and enhancing model adaptability to new scenarios. Our experimental results demonstrate that Routine provides a practical and accessible approach to building stable agent workflows, accelerating the deployment and adoption of agent systems in enterprise environments, and advancing the technical vision of AI for Process.

PaperID: 9584, https://arxiv.org/pdf/2507.14398.pdf

Authors: Md. Kamrul Hossain, Walid Aljoby

Title: NetIntent: Leveraging Large Language Models for End-to-End Intent-Based SDN Automation

Abstract:
Intent‑Based Networking (IBN) often leverages the programmability of Software‑Defined Networking (SDN) to simplify network management. However, significant challenges remain in automating the entire pipeline, from user‑specified high‑level intents to device‑specific low‑level configurations. Existing solutions often rely on rigid, rule‑based translators and fixed APIs, limiting extensibility and adaptability. By contrast, recent advances in large language models (LLMs) offer a promising pathway that leverages natural language understanding and flexible reasoning. However, it is unclear to what extent LLMs can perform IBN tasks. To address this, we introduce IBNBench, a first‑of‑its‑kind benchmarking suite comprising four novel datasets: Intent2Flow‑ODL, Intent2Flow‑ONOS, FlowConflict‑ODL, and FlowConflict‑ONOS. These datasets are specifically designed for evaluating LLMs performance in intent translation and conflict detection tasks within the industry‑grade SDN controllers ODL and ONOS. Our results provide the first comprehensive comparison of 33 open‑source LLMs on IBNBench and related datasets, revealing a wide range of performance outcomes. However, while these results demonstrate the potential of LLMs for isolated IBN tasks, integrating LLMs into a fully autonomous IBN pipeline remains unexplored. Thus, our second contribution is NetIntent, a unified and adaptable framework that leverages LLMs to automate the full IBN lifecycle, including translation, activation, and assurance within SDN systems. NetIntent orchestrates both LLM and non‑LLM agents, supporting dynamic re‑prompting and contextual feedback to robustly execute user‑defined intents with minimal human intervention. Our implementation of NetIntent across both ODL and ONOS SDN controllers achieves a consistent and adaptive end‑to‑end IBN realization.

PaperID: 9585, https://arxiv.org/pdf/2507.14393.pdf

Authors: Humza Sami, Mubashir ul Islam, Pierre-Emmanuel Gaillardon, Valerio Tenace

Title: Adaptive Multi-Agent Reasoning via Automated Workflow Generation

Abstract:
The rise of Large Reasoning Models (LRMs) promises a significant leap forward in language model capabilities, aiming to tackle increasingly sophisticated tasks with unprecedented efficiency and accuracy. However, despite their impressive performance, recent studies have highlighted how current reasoning models frequently fail to generalize to novel, unseen problems, often resorting to memorized solutions rather than genuine inferential reasoning. Such behavior underscores a critical limitation in modern LRMs, i.e., their tendency toward overfitting, which in turn results in poor generalization in problem‑solving capabilities. In this paper, we introduce Nexus Architect, an enhanced iteration of our multi‑agent system framework, Nexus, equipped with a novel automated workflow synthesis mechanism. Given a user's prompt and a small set of representative examples, the Architect autonomously generates a tailored reasoning workflow by selecting suitable strategies, tool integrations, and adversarial techniques for a specific problem class. Furthermore, the Architect includes an iterative prompt refinement mechanism that fine‑tunes agents' system prompts to maximize performance and improve the generalization capabilities of the system. We empirically evaluate Nexus Architect by employing an off‑the‑shelf, non‑reasoning model on a custom dataset of challenging logical questions and compare its performance against state‑of‑the‑art LRMs. Results show that Nexus Architect consistently outperforms existing solutions, achieving up to a 66% increase in pass rate over Gemini 2.5 Flash Preview, nearly 2.5× against Claude Sonnet 4 and DeepSeek‑R1, and over 3× w.r.t. Llama 4 Scout.

PaperID: 9586, https://arxiv.org/pdf/2507.14267.pdf

Authors: Ziqi Wang, Hongshuo Huang, Hancheng Zhao, Changwen Xu, Shang Zhu, Jan Janssen, Venkatasubramanian Viswanathan

Title: DREAMS: Density Functional Theory Based Research Engine for Agentic Materials Simulation

Abstract:
Materials discovery relies on high‑throughput, high‑fidelity simulation techniques such as Density Functional Theory (DFT), which require years of training, extensive parameter fine‑tuning and systematic error handling. To address these challenges, we introduce the DFT‑based Research Engine for Agentic Materials Screening (DREAMS), a hierarchical, multi‑agent framework for DFT simulation that combines a central Large Language Model (LLM) planner agent with domain‑specific LLM agents for atomistic structure generation, systematic DFT convergence testing, High‑Performance Computing (HPC) scheduling, and error handling. In addition, a shared canvas helps the LLM agents to structure their discussions, preserve context and prevent hallucination. We validate DREAMS capabilities on the Sol27LC lattice‑constant benchmark, achieving average errors below 1% compared to the results of human DFT experts. Furthermore, we apply DREAMS to the long‑standing CO/Pt(111) adsorption puzzle, demonstrating its long‑term and complex problem‑solving capabilities. The framework again reproduces expert‑level literature adsorption‑energy differences. Finally, DREAMS is employed to quantify functional‑driven uncertainties with Bayesian ensemble sampling, confirming the Face Centered Cubic (FCC)‑site preference at the Generalized Gradient Approximation (GGA) DFT level. In conclusion, DREAMS approaches L3‑level automation ‑ autonomous exploration of a defined design space ‑ and significantly reduces the reliance on human expertise and intervention, offering a scalable path toward democratized, high‑throughput, high‑fidelity computational materials discovery.

PaperID: 9587, https://arxiv.org/pdf/2507.14234.pdf

Authors: Samer Nasser, Henrique Duarte Moura, Dragan Subotic, Ritesh Kumar Singh, Maarten Weyn, Jeroen Famaey

Title: Feasibility of Energy Neutral Wildlife Tracking using Multi-Source Energy Harvesting

Abstract:
Long‑term wildlife tracking is crucial for biodiversity monitoring, but energy limitations pose challenges, especially for animal tags, where replacing batteries is impractical and stressful for the animal due to the need to locate, possibly sedate, and handle it. Energy harvesting offers a sustainable alternative, yet most existing systems rely on a single energy source and infrastructure‑limited communication technologies. This paper presents an energy‑neutral system that combines solar and kinetic energy harvesting to enable the tracking and monitoring of wild animals. Harvesting from multiple sources increases the total available energy. Uniquely, the kinetic harvester also serves as a motion proxy by sampling harvested current, enabling activity monitoring without dedicated sensors. Our approach also ensures compatibility with existing cellular infrastructure, using Narrowband Internet of Things (NB‑IoT). We present a simulation framework that models energy harvesting, storage, and consumption at the component level. An energy‑aware scheduler coordinates task execution based on real‑time energy availability. We evaluate performance under realistically varying conditions, comparing task frequencies and capacitor sizes. Results show that our approach maintains energy‑neutral operation while significantly increasing data yield and reliability compared to single‑source systems, with the ability to consistently sample GPS location data and kinetic harvesting data every two minutes while transmitting these results over NB‑IoT every hour. These findings demonstrate the potential for maintenance‑free, environmentally friendly tracking in remote habitats, enabling more effective and scalable wildlife monitoring.

PaperID: 9588, https://arxiv.org/pdf/2507.14202.pdf

Authors: Pengfei Du

Title: PRM-Free Security Alignment of Large Models via Red Teaming and Adversarial Training

Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse applications, yet they pose significant security risks that threaten their safe deployment in critical domains. Current security alignment methodologies predominantly rely on Process Reward Models (PRMs) to evaluate intermediate reasoning steps, introducing substantial computational overhead and scalability constraints. This paper presents a novel PRM‑free security alignment framework that leverages automated red teaming and adversarial training to achieve robust security guarantees while maintaining computational efficiency. Our approach systematically identifies vulnerabilities through sophisticated attack strategies including genetic algorithm optimization, multi‑agent simulation, and advanced prompt mutation techniques. The framework enhances model robustness via targeted adversarial training with curriculum learning and adaptive regularization mechanisms. Comprehensive experimental evaluation across five state‑of‑the‑art LLMs demonstrates that our method achieves superior security alignment performance compared to PRM‑based approaches while reducing computational costs by 61%. The framework incorporates transparent reporting and continuous audit mechanisms that enable iterative security improvement and regulatory compliance. Our contributions advance the field of efficient LLM security alignment by democratizing access to robust security measures for resource‑constrained organizations and providing a scalable foundation for addressing evolving adversarial threats.

PaperID: 9589, https://arxiv.org/pdf/2507.14154.pdf

Authors: Rahul Kabali

Title: The Free Will Equation: Quantum Field Analogies for AGI

Abstract:
Artificial General Intelligence (AGI) research traditionally focuses on algorithms that optimize for specific goals under deterministic rules. Yet, human‑like intelligence exhibits adaptive spontaneity ‑ an ability to make unexpected choices or free decisions not strictly dictated by past data or immediate reward. This trait, often dubbed "free will" in a loose sense, might be crucial for creativity, robust adaptation, and avoiding ruts in problem‑solving. This paper proposes a theoretical framework, called the Free Will Equation, that draws analogies from quantum field theory to endow AGI agents with a form of adaptive, controlled stochasticity in their decision‑making process. The core idea is to treat an AI agent's cognitive state as a superposition of potential actions or thoughts, which collapses probabilistically into a concrete action when a decision is made ‑ much like a quantum wavefunction collapsing upon measurement. By incorporating mechanisms analogous to quantum fields, along with intrinsic motivation terms, we aim to improve an agent's ability to explore novel strategies and adapt to unforeseen changes. Experiments in a non‑stationary multi‑armed bandit environment demonstrate that agents using this framework achieve higher rewards and policy diversity compared to baseline methods.

PaperID: 9590, https://arxiv.org/pdf/2507.14088.pdf

Authors: Xiyun Li, Yining Ding, Yuhua Jiang, Yunlong Zhao, Runpeng Xie, Shuang Xu, Yuanhua Ni, Yiqin Yang, Bo Xu

Title: DPMT: Dual Process Multi-scale Theory of Mind Framework for Real-time Human-AI Collaboration

Abstract:
Real‑time human‑artificial intelligence (AI) collaboration is crucial yet challenging, especially when AI agents must adapt to diverse and unseen human behaviors in dynamic scenarios. Existing large language model (LLM) agents often fail to accurately model the complex human mental characteristics such as domain intentions, especially in the absence of direct communication. To address this limitation, we propose a novel dual process multi‑scale theory of mind (DPMT) framework, drawing inspiration from cognitive science dual process theory. Our DPMT framework incorporates a multi‑scale theory of mind (ToM) module to facilitate robust human partner modeling through mental characteristic reasoning. Experimental results demonstrate that DPMT significantly enhances human‑AI collaboration, and ablation studies further validate the contributions of our multi‑scale ToM in the slow system.

PaperID: 9591, https://arxiv.org/pdf/2507.13940.pdf

Authors: Qingyi Chen, Ahmed H. Qureshi

Title: NeHMO: Neural Hamilton-Jacobi Reachability Learning for Decentralized Safe Multi-Agent Motion Planning

Abstract:
Safe Multi‑Agent Motion Planning (MAMP) is a significant challenge in robotics. Despite substantial advancements, existing methods often face a dilemma. Decentralized algorithms typically rely on predicting the behavior of other agents, sharing contracts, or maintaining communication for safety, while centralized approaches struggle with scalability and real‑time decision‑making. To address these challenges, we introduce Neural Hamilton‑Jacobi Reachability Learning (HJR) for Decentralized Multi‑Agent Motion Planning. Our method provides scalable neural HJR modeling to tackle high‑dimensional configuration spaces and capture worst‑case collision and safety constraints between agents. We further propose a decentralized trajectory optimization framework that incorporates the learned HJR solutions to solve MAMP tasks in real‑time. We demonstrate that our method is both scalable and data‑efficient, enabling the solution of MAMP problems in higher‑dimensional scenarios with complex collision constraints. Our approach generalizes across various dynamical systems, including a 12‑dimensional dual‑arm setup, and outperforms a range of state‑of‑the‑art techniques in successfully addressing challenging MAMP tasks. Video demonstrations are available at https://youtu.be/IZiePX0p1Mc.

PaperID: 9592, https://arxiv.org/pdf/2507.13920.pdf

Authors: Turan Orujlu, Christian Gumbsch, Martin V. Butz, Charley M Wu

Title: Causal Process Models: Reframing Dynamic Causal Graph Discovery as a Reinforcement Learning Problem

Abstract:
Most neural models of causality assume static causal graphs, failing to capture the dynamic and sparse nature of physical interactions where causal relationships emerge and dissolve over time. We introduce the Causal Process Framework and its neural implementation, Causal Process Models (CPMs), for learning sparse, time‑varying causal graphs from visual observations. Unlike traditional approaches that maintain dense connectivity, our model explicitly constructs causal edges only when objects actively interact, dramatically improving both interpretability and computational efficiency. We achieve this by casting dynamic interaction‑graph construction for world modeling as a multi‑agent reinforcement learning problem, where specialized agents sequentially decide which objects are causally connected at each timestep. Our key innovation is a structured representation that factorizes object and force vectors along three learned dimensions (mutability, causal relevance, and control relevance), enabling the automatic discovery of semantically meaningful encodings. We demonstrate that a CPM significantly outperforms dense graph baselines on physical prediction tasks, particularly for longer horizons and varying object counts.

PaperID: 9593, https://arxiv.org/pdf/2507.13846.pdf

Authors: Kathrin Korte, Christian Medeiros Adriano, Sona Ghahremani, Holger Giese

Title: Causal Knowledge Transfer for Multi-Agent Reinforcement Learning in Dynamic Environments

Abstract:
[Context] Multi‑agent reinforcement learning (MARL) has achieved notable success in environments where agents must learn coordinated behaviors. However, transferring knowledge across agents remains challenging in non‑stationary environments with changing goals. [Problem] Traditional knowledge transfer methods in MARL struggle to generalize, and agents often require costly retraining to adapt. [Approach] This paper introduces a causal knowledge transfer framework that enables RL agents to learn and share compact causal representations of paths within a non‑stationary environment. As the environment changes (new obstacles), agents' collisions require adaptive recovery strategies. We model each collision as a causal intervention instantiated as a sequence of recovery actions (a macro) whose effect corresponds to a causal knowledge of how to circumvent the obstacle while increasing the chances of achieving the agent's goal (maximizing cumulative reward). This recovery action macro is transferred online from a second agent and is applied in a zero‑shot fashion, i.e., without retraining, just by querying a lookup model with local context information (collisions). [Results] Our findings reveal two key insights: (1) agents with heterogeneous goals were able to bridge about half of the gap between random exploration and a fully retrained policy when adapting to new environments, and (2) the impact of causal knowledge transfer depends on the interplay between environment complexity and agents' heterogeneous goals.

PaperID: 9594, https://arxiv.org/pdf/2507.13814.pdf

Authors: Jianing Zhao, Peng Gao, Jiannong Cao, Zhiyuan Wen, Chen Chen, Jianing Yin, Ruosong Yang, Bo Yuan

Title: CodeEdu: A Multi-Agent Collaborative Platform for Personalized Coding Education

Abstract:
Large Language Models (LLMs) have demonstrated considerable potential in improving coding education by providing support for code writing, explanation, and debugging. However, existing LLM‑based approaches generally fail to assess students' abilities, design learning plans, provide personalized material aligned with individual learning goals, and enable interactive learning. Current work mostly uses single LLM agents, which limits their ability to understand complex code repositories and schedule step‑by‑step tutoring. Recent research has shown that multi‑agent LLMs can collaborate to solve complicated problems in various domains like software engineering, but their potential in the field of education remains unexplored. In this work, we introduce CodeEdu, an innovative multi‑agent collaborative platform that combines LLMs with tool use to provide proactive and personalized education in coding. Unlike static pipelines, CodeEdu dynamically allocates agents and tasks to meet student needs. Various agents in CodeEdu undertake certain functions specifically, including task planning, personalized material generation, real‑time QA, step‑by‑step tutoring, code execution, debugging, and learning report generation, facilitated with extensive external tools to improve task efficiency. Automated evaluations reveal that CodeEdu substantially enhances students' coding performance.

PaperID: 9595, https://arxiv.org/pdf/2507.13758.pdf

Authors: Qian Wang, Zhenheng Tang, Zhanzhi Lou, Nuo Chen, Wenxuan Wang, Bingsheng He

Title: Towards Evaluting Fake Reasoning Bias in Language Models

Abstract:
Large Reasoning Models (LRMs), evolved from standard Large Language Models (LLMs), are increasingly utilized as automated judges because of their explicit reasoning processes. Yet we show that both LRMs and standard LLMs are vulnerable to Fake Reasoning Bias (FRB), where models favor the surface structure of reasoning even when the logic is flawed. To study this problem, we introduce THEATER, a comprehensive benchmark that systematically investigates FRB by manipulating reasoning structures to test whether language models are misled by superficial or fabricated cues. It covers two FRB types: (1) Simple Cues, minimal cues that resemble reasoning processes, and (2) Fake CoT, fabricated chains of thought that simulate multi‑step reasoning. We evaluate 17 advanced LLMs and LRMs on both subjective DPO and factual datasets. Our results reveal four key findings: (1) Both LLMs and LRMs are vulnerable to FRB, but LLMs are generally more robust than LRMs. (2) Simple Cues are especially harmful, reducing accuracy by up to 15% on the most vulnerable datasets. (3) Subjective DPO tasks are the most vulnerable, with LRMs suffering sharper drops than LLMs. (4) Analysis of LRMs' thinking traces shows that Simple Cues hijack metacognitive confidence, while Fake CoT is absorbed as internal thought, creating a "more thinking, less robust" paradox in LRMs. Finally, prompt‑based mitigation improves accuracy on factual tasks by up to 10%, but has little effect on subjective tasks, where self‑reflection sometimes lowers LRM performance by 8%. These results highlight FRB as a persistent and unresolved challenge for language models.

PaperID: 9596, https://arxiv.org/pdf/2507.13729.pdf

Authors: Yu Yao, Salil Bhatnagar, Markus Mazzola, Vasileios Belagiannis, Igor Gilitschenski, Luigi Palmieri, Simon Razniewski, Marcel Hallgarten

Title: AGENTS-LLM: Augmentative GENeration of Challenging Traffic Scenarios with an Agentic LLM Framework

Abstract:
Rare, yet critical, scenarios pose a significant challenge in testing and evaluating autonomous driving planners. Relying solely on real‑world driving scenes requires collecting massive datasets to capture these scenarios. While automatic generation of traffic scenarios appears promising, data‑driven models require extensive training data and often lack fine‑grained control over the output. Moreover, generating novel scenarios from scratch can introduce a distributional shift from the original training scenes which undermines the validity of evaluations especially for learning‑based planners. To sidestep this, recent work proposes to generate challenging scenarios by augmenting original scenarios from the test set. However, this involves the manual augmentation of scenarios by domain experts. An approach that is unable to meet the demands for scale in the evaluation of self‑driving systems. Therefore, this paper introduces a novel LLM‑agent based framework for augmenting real‑world traffic scenarios using natural language descriptions, addressing the limitations of existing methods. A key innovation is the use of an agentic design, enabling fine‑grained control over the output and maintaining high performance even with smaller, cost‑effective LLMs. Extensive human expert evaluation demonstrates our framework's ability to accurately adhere to user intent, generating high quality augmented scenarios comparable to those created manually.

PaperID: 9597, https://arxiv.org/pdf/2507.13678.pdf

Authors: Honghao Wu, Kemi Ding, Li Qiu

Title: Minimum Clustering of Matrices Based on Phase Alignment

Abstract:
Coordinating multi‑agent systems requires balancing synchronization performance and controller implementation costs. To this end, we classify agents by their intrinsic properties, enabling each group to be controlled by a uniform controller and thus reducing the number of unique controller types required. Existing centralized control methods, despite their capability to achieve high synchronization performance with fewer types of controllers, suffer from critical drawbacks such as limited scalability and vulnerability to single points of failure. On the other hand, distributed control strategies, where controllers are typically agent‑dependent, result in the type of required controllers increasing proportionally with the size of the system. This paper introduces a novel phase‑alignment‑based framework to minimize the type of controllers by strategically clustering agents with aligned synchronization behaviors. Leveraging the intrinsic phase properties of complex matrices, we formulate a constrained clustering problem and propose a hierarchical optimization method combining recursive exact searches for small‑scale systems and scalable stochastic approximations for large‑scale networks. This work bridges theoretical phase analysis with practical control synthesis, offering a cost‑effective solution for large‑scale multi‑agent systems. The theoretical results applied for the analysis of a 50‑agent network illustrate the effectiveness of the proposed algorithms.

PaperID: 9598, https://arxiv.org/pdf/2507.13647.pdf

Authors: Minze Li, Wei Zhao, Ran Chen, Mingqiang Wei

Title: Improved particle swarm optimization algorithm: multi-target trajectory optimization for swarm drones

Abstract:
Real‑time trajectory planning for unmanned aerial vehicles (UAVs) in dynamic environments remains a key challenge due to high computational demands and the need for fast, adaptive responses. Traditional Particle Swarm Optimization (PSO) methods, while effective for offline planning, often struggle with premature convergence and latency in real‑time scenarios. To overcome these limitations, we propose PE‑PSO, an enhanced PSO‑based online trajectory planner. The method introduces a persistent exploration mechanism to preserve swarm diversity and an entropy‑based parameter adjustment strategy to dynamically adapt optimization behavior. UAV trajectories are modeled using B‑spline curves, which ensure path smoothness while reducing optimization complexity. To extend this capability to UAV swarms, we develop a multi‑agent framework that combines genetic algorithm (GA)‑based task allocation with distributed PE‑PSO, supporting scalable and coordinated trajectory generation. The distributed architecture allows for parallel computation and decentralized control, enabling effective cooperation among agents while maintaining real‑time performance. Comprehensive simulations demonstrate that the proposed framework outperforms conventional PSO and other swarm‑based planners across several metrics, including trajectory quality, energy efficiency, obstacle avoidance, and computation time. These results confirm the effectiveness and applicability of PE‑PSO in real‑time multi‑UAV operations under complex environmental conditions.

PaperID: 9599, https://arxiv.org/pdf/2507.13601.pdf

Authors: Jorge Villarrubia, Luis Costero, Francisco D. Igual, Katzalin Olcoz

Title: Leveraging Multi-Instance GPUs through moldable task scheduling

Abstract:
NVIDIA MIG (Multi‑Instance GPU) allows partitioning a physical GPU into multiple logical instances with fully‑isolated resources, which can be dynamically reconfigured. This work highlights the untapped potential of MIG through moldable task scheduling with dynamic reconfigurations. Specifically, we propose a makespan minimization problem for multi‑task execution under MIG constraints. Our profiling shows that assuming monotonicity in task work with respect to resources is not viable, as is usual in multicore scheduling. Relying on a state‑of‑the‑art proposal that does not require such an assumption, we present FAR, a 3‑phase algorithm to solve the problem. Phase 1 of FAR builds on a classical task moldability method, phase 2 combines Longest Processing Time First and List Scheduling with a novel repartitioning tree heuristic tailored to MIG constraints, and phase 3 employs local search via task moves and swaps. FAR schedules tasks in batches offline, concatenating their schedules on the fly in an improved way that favors resource reuse. Excluding reconfiguration costs, the List Scheduling proof shows an approximation factor of 7/4 on the NVIDIA A30 model. We adapt the technique to the particular constraints of an NVIDIA A100/H100 to obtain an approximation factor of 2. Including the reconfiguration cost, our real‑world experiments reveal a makespan with respect to the optimum no worse than 1.22x for a well‑known suite of benchmarks, and 1.10x for synthetic inputs inspired by real kernels. We obtain good experimental results for each batch of tasks, but also in the concatenation of batches, with large improvements over the state‑of‑the‑art and proposals without GPU reconfiguration. Beyond the algorithm, the paper demonstrates the research potential of the MIG technology and suggests useful metrics, workload characterizations and evaluation techniques for future work in this field.

PaperID: 9600, https://arxiv.org/pdf/2507.13511.pdf

Authors: Nabil Abdelaziz Ferhat Taleb, Abdolazim Rezaei, Raj Atulkumar Patel, Mehdi Sookhak

Title: GraphTrafficGPT: Enhancing Traffic Management Through Graph-Based AI Agent Coordination

Abstract:
Large Language Models (LLMs) offer significant promise for intelligent traffic management; however, current chain‑based systems like TrafficGPT are hindered by sequential task execution, high token usage, and poor scalability, making them inefficient for complex, real‑world scenarios. To address these limitations, we propose GraphTrafficGPT, a novel graph‑based architecture, which fundamentally redesigns the task coordination process for LLM‑driven traffic applications. GraphTrafficGPT represents tasks and their dependencies as nodes and edges in a directed graph, enabling efficient parallel execution and dynamic resource allocation. The main idea behind the proposed model is a Brain Agent that decomposes user queries, constructs optimized dependency graphs, and coordinates a network of specialized agents for data retrieval, analysis, visualization, and simulation. By introducing advanced context‑aware token management and supporting concurrent multi‑query processing, the proposed architecture handles interdependent tasks typical of modern urban mobility environments. Experimental results demonstrate that GraphTrafficGPT reduces token consumption by 50.2% and average response latency by 19.0% compared to TrafficGPT, while supporting simultaneous multi‑query execution with up to 23.0% improvement in efficiency.

PaperID: 9601, https://arxiv.org/pdf/2507.13370.pdf

Authors: Shijun Guo, Haoran Xu, Yaming Yang, Ziyu Guan, Wei Zhao, Xinyi Zhang, Yishan Song

Title: H-NeiFi: Non-Invasive and Consensus-Efficient Multi-Agent Opinion Guidance

Abstract:
The openness of social media enables the free exchange of opinions, but it also presents challenges in guiding opinion evolution towards global consensus. Existing methods often directly modify user views or enforce cross‑group connections. These intrusive interventions undermine user autonomy, provoke psychological resistance, and reduce the efficiency of global consensus. Additionally, due to the lack of a long‑term perspective, promoting local consensus often exacerbates divisions at the macro level. To address these issues, we propose the hierarchical, non‑intrusive opinion guidance framework, H‑NeiFi. It first establishes a two‑layer dynamic model based on social roles, considering the behavioral characteristics of both experts and non‑experts. Additionally, we introduce a non‑intrusive neighbor filtering method that adaptively controls user communication channels. Using multi‑agent reinforcement learning (MARL), we optimize information propagation paths through a long‑term reward function, avoiding direct interference with user interactions. Experiments show that H‑NeiFi increases consensus speed by 22.0% to 30.7% and maintains global convergence even in the absence of experts. This approach enables natural and efficient consensus guidance by protecting user interaction autonomy, offering a new paradigm for social network governance.

PaperID: 9602, https://arxiv.org/pdf/2507.13334.pdf

Authors: Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, Shenghua Liu

Title: A Survey of Context Engineering for Large Language Models

Abstract:
The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engineering, a formal discipline that transcends simple prompt design to encompass the systematic optimization of information payloads for LLMs. We present a comprehensive taxonomy decomposing Context Engineering into its foundational components and the sophisticated implementations that integrate them into intelligent systems. We first examine the foundational components: context retrieval and generation, context processing and context management. We then explore how these components are architecturally integrated to create sophisticated system implementations: retrieval‑augmented generation (RAG), memory systems and tool‑integrated reasoning, and multi‑agent systems. Through this systematic analysis of over 1400 research papers, our survey not only establishes a technical roadmap for the field but also reveals a critical research gap: a fundamental asymmetry exists between model capabilities. While current models, augmented by advanced context engineering, demonstrate remarkable proficiency in understanding complex contexts, they exhibit pronounced limitations in generating equally sophisticated, long‑form outputs. Addressing this gap is a defining priority for future research. Ultimately, this survey provides a unified framework for both researchers and engineers advancing context‑aware AI.

PaperID: 9603, https://arxiv.org/pdf/2507.13285.pdf

Authors: Wang Xi, Quan Shi, Tian Yu, Yujie Peng, Jiayi Sun, Mengxing Ren, Zenghui Ding, Ningguang Yao

Title: Multi-Agent Synergy-Driven Iterative Visual Narrative Synthesis

Abstract:
Automated generation of high‑quality media presentations is challenging, requiring robust content extraction, narrative planning, visual design, and overall quality optimization. Existing methods often produce presentations with logical inconsistencies and suboptimal layouts, thereby struggling to meet professional standards. To address these challenges, we introduce RCPS (Reflective Coherent Presentation Synthesis), a novel framework integrating three key components: (1) Deep Structured Narrative Planning; (2) Adaptive Layout Generation; (3) an Iterative Optimization Loop. Additionally, we propose PREVAL, a preference‑based evaluation framework employing rationale‑enhanced multi‑dimensional models to assess presentation quality across Content, Coherence, and Design. Experimental results demonstrate that RCPS significantly outperforms baseline methods across all quality dimensions, producing presentations that closely approximate human expert standards. PREVAL shows strong correlation with human judgments, validating it as a reliable automated tool for assessing presentation quality.

PaperID: 9604, https://arxiv.org/pdf/2507.13247.pdf

Authors: Ruohao Li, Jiawei Li, Jia Sun, Zhiqing Wu, Zisu Li, Ziyan Wang, Ge Lin Kan, Mingming Fan

Title: RemVerse: Supporting Reminiscence Activities for Older Adults through AI-Assisted Virtual Reality

Abstract:
Reminiscence activities, which involve recalling and sharing past experiences, have proven beneficial for improving cognitive function, mood, and overall well‑being. However, urbanization has led to the disappearance of familiar environments, removing visual and audio cues for effective reminiscence. While old photos can serve as visual cues to aid reminiscence, it is challenging for people to reconstruct the reminisced content and environment that are not in the photos. Virtual reality (VR) and artificial intelligence (AI) offer the ability to reconstruct an immersive environment with dynamic content and to converse with people to help them gradually reminisce. We designed RemVerse, an AI‑empowered VR prototype aimed to support reminiscence activities. Integrating generative models and AI agent into a VR environment, RemVerse helps older adults reminisce with AI‑generated visual cues and interactive dialogues. Our user study with 14 older adults showed that RemVerse effectively supported reminiscence activities by triggering, concretizing, and deepening personal memories, while fostering increased engagement and autonomy among older adults. Based on our findings, we proposed design implications to make reminiscence activities in AI‑assisted VR more accessible and engaging for older adults.

PaperID: 9605, https://arxiv.org/pdf/2507.13190.pdf

Authors: Jisoo Lee, Raeyoung Chang, Dongwook Kwon, Harmanpreet Singh, Nikhil Verma

Title: GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems

Abstract:
Multi‑agent systems built on language models have shown strong performance on collaborative reasoning tasks. However, existing evaluations focus only on the correctness of the final output, overlooking how inefficient communication and poor coordination contribute to redundant reasoning and higher computational costs. We introduce GEMMAS, a graph‑based evaluation framework that analyzes the internal collaboration process by modeling agent interactions as a directed acyclic graph. To capture collaboration quality, we propose two process‑level metrics: Information Diversity Score (IDS) to measure semantic variation in inter‑agent messages, and Unnecessary Path Ratio (UPR) to quantify redundant reasoning paths. We evaluate GEMMAS across five benchmarks and highlight results on GSM8K, where systems with only a 2.1% difference in accuracy differ by 12.8% in IDS and 80% in UPR, revealing substantial variation in internal collaboration. These findings demonstrate that outcome‑only metrics are insufficient for evaluating multi‑agent performance and highlight the importance of process‑level diagnostics in designing more interpretable and resource‑efficient collaborative AI systems.

PaperID: 9606, https://arxiv.org/pdf/2507.13169.pdf

Authors: Jeremy McHugh, Kristina Šekrst, Jon Cefalu

Title: Prompt Injection 2.0: Hybrid AI Threats

Abstract:
Prompt injection attacks, where malicious input is designed to manipulate AI systems into ignoring their original instructions and following unauthorized commands instead, were first discovered by Preamble, Inc. in May 2022 and responsibly disclosed to OpenAI. Over the last three years, these attacks have continued to pose a critical security threat to LLM‑integrated systems. The emergence of agentic AI systems, where LLMs autonomously perform multistep tasks through tools and coordination with other agents, has fundamentally transformed the threat landscape. Modern prompt injection attacks can now combine with traditional cybersecurity exploits to create hybrid threats that systematically evade traditional security controls. This paper presents a comprehensive analysis of Prompt Injection 2.0, examining how prompt injections integrate with Cross‑Site Scripting (XSS), Cross‑Site Request Forgery (CSRF), and other web security vulnerabilities to bypass traditional security measures. We build upon Preamble's foundational research and mitigation technologies, evaluating them against contemporary threats, including AI worms, multi‑agent infections, and hybrid cyber‑AI attacks. Our analysis incorporates recent benchmarks that demonstrate how traditional web application firewalls, XSS filters, and CSRF tokens fail against AI‑enhanced attacks. We also present architectural solutions that combine prompt isolation, runtime security, and privilege separation with novel threat detection capabilities.

PaperID: 9607, https://arxiv.org/pdf/2507.13140.pdf

Authors: Kuiyuan Ding, Caili Guo, Yang Yang, Jianzhang Guo

Title: RIDAS: A Multi-Agent Framework for AI-RAN with Representation- and Intention-Driven Agents

Abstract:
Sixth generation (6G) networks demand tight integration of artificial intelligence (AI) into radio access networks (RANs) to meet stringent quality of service (QoS) and resource efficiency requirements. Existing solutions struggle to bridge the gap between high level user intents and the low level, parameterized configurations required for optimal performance. To address this challenge, we propose RIDAS, a multi agent framework composed of representation driven agents (RDAs) and an intention driven agent (IDA). RDAs expose open interface with tunable control parameters (rank and quantization bits, enabling explicit trade) offs between distortion and transmission rate. The IDA employs a two stage planning scheme (bandwidth pre allocation and reallocation) driven by a large language model (LLM) to map user intents and system state into optimal RDA configurations. Experiments demonstrate that RIDAS supports 36.47% more users than WirelessAgent under equivalent QoS constraints. These results validate ability of RIDAS to capture user intent and allocate resources more efficiently in AI RAN environments.

PaperID: 9608, https://arxiv.org/pdf/2507.13081.pdf

Authors: Dongming Jin, Weisong Sun, Jiangping Huang, Peng Liang, Jifeng Xuan, Yang Liu, Zhi Jin

Title: iReDev: A Knowledge-Driven Multi-Agent Framework for Intelligent Requirements Development

Abstract:
Requirements development is a critical phase as it is responsible for providing a clear understanding of what stakeholders need. It involves collaboration among stakeholders to extract explicit requirements and address potential conflicts, which is time‑consuming and labor‑intensive. Recently, multi‑agent systems for software development have attracted much attention. However, existing research provides limited support for requirements development and overlooks the injection of human knowledge into agents and the human‑agent collaboration. % To address these issues, this paper proposes a knowledge‑driven multi‑agent framework for intelligent requirement development, named iReDev. iReDev features: iReDev consists of six knowledge‑driven agents to support the entire requirements development. They collaboratively perform various tasks to produce a software requirements specification. iReDev focuses on integrating human knowledge for agents, enabling them to simulate real‑world stakeholders. iReDev uses an event‑driven communication mechanism based on an artifact pool. Agents continuously monitor the pool and autonomously trigger the next action based on its changes, enabling iReDev to handle new requirements quickly. iReDev introduces a human‑in‑the‑loop mechanism to support human‑agent collaboration, ensuring that the generated artifacts align with the expectations of stakeholders. We evaluated the generated artifacts and results show that iReDev outperforms existing baselines in multiple aspects. We further envision three key directions and hope this work can facilitate the development of intelligent requirements development.

PaperID: 9609, https://arxiv.org/pdf/2507.13038.pdf

Authors: Yu Cui, Hongyang Du

Title: MAD-Spear: A Conformity-Driven Prompt Injection Attack on Multi-Agent Debate Systems

Abstract:
Multi‑agent debate (MAD) systems leverage collaborative interactions among large language models (LLMs) agents to improve reasoning capabilities. While recent studies have focused on increasing the accuracy and scalability of MAD systems, their security vulnerabilities have received limited attention. In this work, we introduce MAD‑Spear, a targeted prompt injection attack that compromises a small subset of agents but significantly disrupts the overall MAD process. Manipulated agents produce multiple plausible yet incorrect responses, exploiting LLMs' conformity tendencies to propagate misinformation and degrade consensus quality. Furthermore, the attack can be composed with other strategies, such as communication attacks, to further amplify its impact by increasing the exposure of agents to incorrect responses. To assess MAD's resilience under attack, we propose a formal definition of MAD fault‑tolerance and develop a comprehensive evaluation framework that jointly considers accuracy, consensus efficiency, and scalability. Extensive experiments on five benchmark datasets with varying difficulty levels demonstrate that MAD‑Spear consistently outperforms the baseline attack in degrading system performance. Additionally, we observe that agent diversity substantially improves MAD performance in mathematical reasoning tasks, which challenges prior work suggesting that agent diversity has minimal impact on performance. These findings highlight the urgent need to improve the security in MAD design.

PaperID: 9610, https://arxiv.org/pdf/2507.12855.pdf

Authors: Rahel Rickenbach, Bruce Lee, René Zurbrügg, Carmen Amo Alonso, Melanie N. Zeilinger

Title: DEMONSTRATE: Zero-shot Language to Robotic Control via Multi-task Demonstration Learning

Abstract:
The integration of large language models (LLMs) with control systems has demonstrated significant potential in various settings, such as task completion with a robotic manipulator. A main reason for this success is the ability of LLMs to perform in‑context learning, which, however, strongly relies on the design of task examples, closely related to the target tasks. Consequently, employing LLMs to formulate optimal control problems often requires task examples that contain explicit mathematical expressions, designed by trained engineers. Furthermore, there is often no principled way to evaluate for hallucination before task execution. To address these challenges, we propose DEMONSTRATE, a novel methodology that avoids the use of LLMs for complex optimization problem generations, and instead only relies on the embedding representations of task descriptions. To do this, we leverage tools from inverse optimal control to replace in‑context prompt examples with task demonstrations, as well as the concept of multitask learning, which ensures target and example task similarity by construction. Given the fact that hardware demonstrations can easily be collected using teleoperation or guidance of the robot, our approach significantly reduces the reliance on engineering expertise for designing in‑context examples. Furthermore, the enforced multitask structure enables learning from few demonstrations and assessment of hallucinations prior to task execution. We demonstrate the effectiveness of our method through simulation and hardware experiments involving a robotic arm tasked with tabletop manipulation.

PaperID: 9611, https://arxiv.org/pdf/2507.12801.pdf

Authors: Sosui Moribe, Taketoshi Ushiama

Title: Imitating Mistakes in a Learning Companion AI Agent for Online Peer Learning

Abstract:
In recent years, peer learning has gained attention as a method that promotes spontaneous thinking among learners, and its effectiveness has been confirmed by numerous studies. This study aims to develop an AI Agent as a learning companion that enables peer learning anytime and anywhere. However, peer learning between humans has various limitations, and it is not always effective. Effective peer learning requires companions at the same proficiency levels. In this study, we assume that a learner's peers with the same proficiency level as the learner make the same mistakes as the learner does and focus on English composition as a specific example to validate this approach.

PaperID: 9612, https://arxiv.org/pdf/2507.12580.pdf

Authors: Josephine Beatrice Skovbo Borre, Malene Gorm Wold, Sara Kjær Rasmussen, Ilhan Aslan

Title: "How to Explore Biases in Speech Emotion AI with Users?" A Speech-Emotion-Acting Study Exploring Age and Language Biases

Abstract:
This study explores how age and language shape the deliberate vocal expression of emotion, addressing underexplored user groups, Teenagers (N = 12) and Adults 55+ (N = 12), within speech emotion recognition (SER). While most SER systems are trained on spontaneous, monolingual English data, our research evaluates how such models interpret intentionally performed emotional speech across age groups and languages (Danish and English). To support this, we developed a novel experimental paradigm combining a custom user interface with a backend for real‑time SER prediction and data logging. Participants were prompted to hit visual targets in valence‑arousal space by deliberately expressing four emotion targets. While limitations include some reliance on self‑managed voice recordings and inconsistent task execution, the results suggest contrary to expectations, no significant differences between language or age groups, and a degree of cross‑linguistic and age robustness in model interpretation. Though some limitations in high‑arousal emotion recognition were evident. Our qualitative findings highlight the need to move beyond system‑centered accuracy metrics and embrace more inclusive, human‑centered SER models. By framing emotional expression as a goal‑directed act and logging the real‑time gap between human intent and machine interpretation, we expose the risks of affective misalignment.

PaperID: 9613, https://arxiv.org/pdf/2507.12484.pdf

Authors: Jarosław A. Chudziak, Adam Kostka

Title: AI-Powered Math Tutoring: Platform for Personalized and Adaptive Education

Abstract:
The growing ubiquity of artificial intelligence (AI), in particular large language models (LLMs), has profoundly altered the way in which learners gain knowledge and interact with learning material, with many claiming that AI positively influences their learning achievements. Despite this advancement, current AI tutoring systems face limitations associated with their reactive nature, often providing direct answers without encouraging deep reflection or incorporating structured pedagogical tools and strategies. This limitation is most apparent in the field of mathematics, in which AI tutoring systems remain underdeveloped. This research addresses the question: How can AI tutoring systems move beyond providing reactive assistance to enable structured, individualized, and tool‑assisted learning experiences? We introduce a novel multi‑agent AI tutoring platform that combines adaptive and personalized feedback, structured course generation, and textbook knowledge retrieval to enable modular, tool‑assisted learning processes. This system allows students to learn new topics while identifying and targeting their weaknesses, revise for exams effectively, and practice on an unlimited number of personalized exercises. This article contributes to the field of artificial intelligence in education by introducing a novel platform that brings together pedagogical agents and AI‑driven components, augmenting the field with modular and effective systems for teaching mathematics.

PaperID: 9614, https://arxiv.org/pdf/2507.12372.pdf

Authors: Meysam Alizadeh, Fabrizio Gilardi, Zeynab Samei, Mohsen Mosleh

Title: Web-Browsing LLMs Can Access Social Media Profiles and Infer User Demographics

Abstract:
Large language models (LLMs) have traditionally relied on static training data, limiting their knowledge to fixed snapshots. Recent advancements, however, have equipped LLMs with web browsing capabilities, enabling real time information retrieval and multi step reasoning over live web content. While prior studies have demonstrated LLMs ability to access and analyze websites, their capacity to directly retrieve and analyze social media data remains unexplored. Here, we evaluate whether web browsing LLMs can infer demographic attributes of social media users given only their usernames. Using a synthetic dataset of 48 X (Twitter) accounts and a survey dataset of 1,384 international participants, we show that these models can access social media content and predict user demographics with reasonable accuracy. Analysis of the synthetic dataset further reveals how LLMs parse and interpret social media profiles, which may introduce gender and political biases against accounts with minimal activity. While this capability holds promise for computational social science in the post API era, it also raises risks of misuse particularly in information operations and targeted advertising underscoring the need for safeguards. We recommend that LLM providers restrict this capability in public facing applications, while preserving controlled access for verified research purposes.

PaperID: 9615, https://arxiv.org/pdf/2507.12370.pdf

Authors: Ana Davila, Jacinto Colan, Yasuhisa Hasegawa

Title: Beyond Single Models: Enhancing LLM Detection of Ambiguity in Requests through Debate

Abstract:
Large Language Models (LLMs) have demonstrated significant capabilities in understanding and generating human language, contributing to more natural interactions with complex systems. However, they face challenges such as ambiguity in user requests processed by LLMs. To address these challenges, this paper introduces and evaluates a multi‑agent debate framework designed to enhance detection and resolution capabilities beyond single models. The framework consists of three LLM architectures (Llama3‑8B, Gemma2‑9B, and Mistral‑7B variants) and a dataset with diverse ambiguities. The debate framework markedly enhanced the performance of Llama3‑8B and Mistral‑7B variants over their individual baselines, with Mistral‑7B‑led debates achieving a notable 76.7% success rate and proving particularly effective for complex ambiguities and efficient consensus. While acknowledging varying model responses to collaborative strategies, these findings underscore the debate framework's value as a targeted method for augmenting LLM capabilities. This work offers important insights for developing more robust and adaptive language understanding systems by showing how structured debates can lead to improved clarity in interactive systems.

PaperID: 9616, https://arxiv.org/pdf/2507.12174.pdf

Authors: Zhenmin Huang, Yusen Xie, Benshan Ma, Shaojie Shen, Jun Ma

Title: Fast and Scalable Game-Theoretic Trajectory Planning with Intentional Uncertainties

Abstract:
Trajectory planning involving multi‑agent interactions has been a long‑standing challenge in the field of robotics, primarily burdened by the inherent yet intricate interactions among agents. While game‑theoretic methods are widely acknowledged for their effectiveness in managing multi‑agent interactions, significant impediments persist when it comes to accommodating the intentional uncertainties of agents. In the context of intentional uncertainties, the heavy computational burdens associated with existing game‑theoretic methods are induced, leading to inefficiencies and poor scalability. In this paper, we propose a novel game‑theoretic interactive trajectory planning method to effectively address the intentional uncertainties of agents, and it demonstrates both high efficiency and enhanced scalability. As the underpinning basis, we model the interactions between agents under intentional uncertainties as a general Bayesian game, and we show that its agent‑form equivalence can be represented as a potential game under certain minor assumptions. The existence and attainability of the optimal interactive trajectories are illustrated, as the corresponding Bayesian Nash equilibrium can be attained by optimizing a unified optimization problem. Additionally, we present a distributed algorithm based on the dual consensus alternating direction method of multipliers (ADMM) tailored to the parallel solving of the problem, thereby significantly improving the scalability. The attendant outcomes from simulations and experiments demonstrate that the proposed method is effective across a range of scenarios characterized by general forms of intentional uncertainties. Its scalability surpasses that of existing centralized and decentralized baselines, allowing for real‑time interactive trajectory planning in uncertain game settings.

PaperID: 9617, https://arxiv.org/pdf/2507.11988.pdf

Authors: Yexuan Shi, Mingyu Wang, Yunxiang Cao, Hongjie Lai, Junjian Lan, Xin Han, Yu Wang, Jie Geng, Zhenan Li, Zihao Xia, Xiang Chen, Chen Li, Jian Xu, Wenbo Duan, Yuanshuo Zhu

Title: Aime: Towards Fully-Autonomous Multi-Agent Framework

Abstract:
Multi‑Agent Systems (MAS) powered by Large Language Models (LLMs) are emerging as a powerful paradigm for solving complex, multifaceted problems. However, the potential of these systems is often constrained by the prevalent plan‑and‑execute framework, which suffers from critical limitations: rigid plan execution, static agent capabilities, and inefficient communication. These weaknesses hinder their adaptability and robustness in dynamic environments. This paper introduces Aime, a novel multi‑agent framework designed to overcome these challenges through dynamic, reactive planning and execution. Aime replaces the conventional static workflow with a fluid and adaptive architecture. Its core innovations include: (1) a Dynamic Planner that continuously refines the overall strategy based on real‑time execution feedback; (2) an Actor Factory that implements Dynamic Actor instantiation, assembling specialized agents on‑demand with tailored tools and knowledge; and (3) a centralized Progress Management Module that serves as a single source of truth for coherent, system‑wide state awareness. We empirically evaluated Aime on a diverse suite of benchmarks spanning general reasoning (GAIA), software engineering (SWE‑bench Verified), and live web navigation (WebVoyager). The results demonstrate that Aime consistently outperforms even highly specialized state‑of‑the‑art agents in their respective domains. Its superior adaptability and task success rate establish Aime as a more resilient and effective foundation for multi‑agent collaboration.

PaperID: 9618, https://arxiv.org/pdf/2507.11979.pdf

Authors: Yuki Sakamoto, Takahisa Uchida, Hiroshi Ishiguro

Title: Value-Based Large Language Model Agent Simulation for Mutual Evaluation of Trust and Interpersonal Closeness

Abstract:
Large language models (LLMs) have emerged as powerful tools for simulating complex social phenomena using human‑like agents with specific traits. In human societies, value similarity is important for building trust and close relationships; however, it remains unexplored whether this principle holds true in artificial societies comprising LLM agents. Therefore, this study investigates the influence of value similarity on relationship‑building among LLM agents through two experiments. First, in a preliminary experiment, we evaluated the controllability of values in LLMs to identify the most effective model and prompt design for controlling the values. Subsequently, in the main experiment, we generated pairs of LLM agents imbued with specific values and analyzed their mutual evaluations of trust and interpersonal closeness following a dialogue. The experiments were conducted in English and Japanese to investigate language dependence. The results confirmed that pairs of agents with higher value similarity exhibited greater mutual trust and interpersonal closeness. Our findings demonstrate that the LLM agent simulation serves as a valid testbed for social science theories, contributes to elucidating the mechanisms by which values influence relationship building, and provides a foundation for inspiring new theories and insights into the social sciences.

PaperID: 9619, https://arxiv.org/pdf/2507.11972.pdf

Authors: Yuhong Zhang, Jialu Li, Shilai Yang, Yuchen Xu, Gert Cauwenberghs, Tzyy-Ping Jung

Title: Graph Representations for Reading Comprehension Analysis using Large Language Model and Eye-Tracking Biomarker

Abstract:
Reading comprehension is a fundamental skill in human cognitive development. With the advancement of Large Language Models (LLMs), there is a growing need to compare how humans and LLMs understand language across different contexts and apply this understanding to functional tasks such as inference, emotion interpretation, and information retrieval. Our previous work used LLMs and human biomarkers to study the reading comprehension process. The results showed that the biomarkers corresponding to words with high and low relevance to the inference target, as labeled by the LLMs, exhibited distinct patterns, particularly when validated using eye‑tracking data. However, focusing solely on individual words limited the depth of understanding, which made the conclusions somewhat simplistic despite their potential significance. This study used an LLM‑based AI agent to group words from a reading passage into nodes and edges, forming a graph‑based text representation based on semantic meaning and question‑oriented prompts. We then compare the distribution of eye fixations on important nodes and edges. Our findings indicate that LLMs exhibit high consistency in language understanding at the level of graph topological structure. These results build on our previous findings and offer insights into effective human‑AI co‑learning strategies.

PaperID: 9620, https://arxiv.org/pdf/2507.11779.pdf

Authors: Alexander Stolyar

Title: Large-scale distributed synchronization systems, using a cancel-on-completion redundancy mechanism

Abstract:
We consider a class of multi‑agent distributed synchronization systems, which are modeled as n particles moving on the real line. This class generalizes the model of a multi‑server queueing system, considered in [15], employing so‑called cancel‑on‑completion (c.o.c.) redundancy mechanism, but is motivated by other applications as well. The model in [15] is a particle system, regulated at the left boundary point. The more general model of this paper is such that we allow regulation boundaries on either side, or both sides, or no regulation at all. We consider the mean‑field asymptotic regime, when the number of particles n and the job arrival rates go to infinity, while the job arrival rates per particle remain constant. The system state for a given n is the empirical distribution of the particles' locations. The results include: the existence/uniqueness of fixed points of mean‑field limits (ML), which describe the limiting dynamics of the system; conditions for the steady‑state asymptotic independence (concentration of the stationary distribution on a single ML fixed point); the limits of the average velocity at which unregulated (free) particle system advances. In particular, our results for the left‑regulated system unify and generalize the corresponding results in [15]. Our technical approach is such that the systems with different types of regulation are analyzed within a unified framework.

PaperID: 9621, https://arxiv.org/pdf/2507.11744.pdf

Authors: Marcin Kowalik, Przemysław Stokłosa, Mateusz Grabowski, Janusz Starzyk, Paweł Raif

Title: A Cellular Automata Approach to Donation Game

Abstract:
The donation game is a well‑established framework for studying the emergence and evolution of cooperation in multi‑agent systems. The cooperative behavior can be influenced by the environmental noise in partially observable settings and by the decision‑making strategies of agents, which may incorporate not only reputation but also traits such as generosity and forgiveness. Traditional simulations often assume fully random interactions, where cooperation is tested between randomly selected agent pairs. In this paper, we investigate cooperation dynamics using the concept of Stephen Wolfram's one‑dimensional binary cellular automata. This approach allows us to explore how cooperation evolves when interactions are limited to neighboring agents. We define binary cellular automata rules that conform to the donation game mechanics. Additionally, we introduce models of perceptual and action noise, along with a mutation matrix governing the probabilistic evolution of agent strategies. Our empirical results demonstrate that cooperation is significantly affected by agents' mobility and their spatial locality on the game board. These findings highlight the importance of distinguishing between entirely random multi‑agent systems and those in which agents are more likely to interact with their nearest neighbors.

PaperID: 9622, https://arxiv.org/pdf/2507.11660.pdf

Authors: Joao F. Rocha, Ke Xu, Xingzhi Sun, Ananya Krishna, Dhananjay Bhaskar, Blanche Mongeon, Morgan Craig, Mark Gerstein, Smita Krishnaswamy

Title: STAGED: A Multi-Agent Neural Network for Learning Cellular Interaction Dynamics

Abstract:
The advent of single‑cell technology has significantly improved our understanding of cellular states and subpopulations in various tissues under normal and diseased conditions by employing data‑driven approaches such as clustering and trajectory inference. However, these methods consider cells as independent data points of population distributions. With spatial transcriptomics, we can represent cellular organization, along with dynamic cell‑cell interactions that lead to changes in cell state. Still, key computational advances are necessary to enable the data‑driven learning of such complex interactive cellular dynamics. While agent‑based modeling (ABM) provides a powerful framework, traditional approaches rely on handcrafted rules derived from domain knowledge rather than data‑driven approaches. To address this, we introduce Spatio Temporal Agent‑Based Graph Evolution Dynamics(STAGED) integrating ABM with deep learning to model intercellular communication, and its effect on the intracellular gene regulatory network. Using graph ODE networks (GDEs) with shared weights per cell type, our approach represents genes as vertices and interactions as directed edges, dynamically learning their strengths through a designed attention mechanism. Trained to match continuous trajectories of simulated as well as inferred trajectories from spatial transcriptomics data, the model captures both intercellular and intracellular interactions, enabling a more adaptive and accurate representation of cellular dynamics.

PaperID: 9623, https://arxiv.org/pdf/2507.11633.pdf

Authors: Yuxuan Zhang, Haoyang Yu, Lanxiang Hu, Haojian Jin, Hao Zhang

Title: General Modular Harness for LLM Agents in Multi-Turn Gaming Environments

Abstract:
We introduce a modular harness design for LLM agents that composes of perception, memory, and reasoning components, enabling a single LLM or VLM backbone to tackle a wide spectrum of multi turn gaming environments without domain‑specific engineering. Using classic and modern game suites as low‑barrier, high‑diversity testbeds, our framework provides a unified workflow for analyzing how each module affects performance across dynamic interactive settings. Extensive experiments demonstrate that the harness lifts gameplay performance consistently over un‑harnessed baselines and reveals distinct contribution patterns, for example, memory dominates in long‑horizon puzzles while perception is critical in vision noisy arcades. These findings highlight the effectiveness of our modular harness design in advancing general‑purpose agent, given the familiarity and ubiquity of games in everyday human experience.

PaperID: 9624, https://arxiv.org/pdf/2507.11566.pdf

Authors: Fuda van Diggelen, Tugay Alperen Karagüzel, Andres Garcia Rincon, A. E. Eiben, Dario Floreano, Eliseo Ferrante

Title: Emergent Heterogeneous Swarm Control Through Hebbian Learning

Abstract:
In this paper, we introduce Hebbian learning as a novel method for swarm robotics, enabling the automatic emergence of heterogeneity. Hebbian learning presents a biologically inspired form of neural adaptation that solely relies on local information. By doing so, we resolve several major challenges for learning heterogeneous control: 1) Hebbian learning removes the complexity of attributing emergent phenomena to single agents through local learning rules, thus circumventing the micro‑macro problem; 2) uniform Hebbian learning rules across all swarm members limit the number of parameters needed, mitigating the curse of dimensionality with scaling swarm sizes; and 3) evolving Hebbian learning rules based on swarm‑level behaviour minimises the need for extensive prior knowledge typically required for optimising heterogeneous swarms. This work demonstrates that with Hebbian learning heterogeneity naturally emerges, resulting in swarm‑level behavioural switching and in significantly improved swarm capabilities. It also demonstrates how the evolution of Hebbian learning rules can be a valid alternative to Multi Agent Reinforcement Learning in standard benchmarking tasks.

PaperID: 9625, https://arxiv.org/pdf/2507.11560.pdf

Authors: Xin Wang, Xiao Huan Li, Xun Wang

Title: A Model Aware AIGC Task Offloading Algorithm in IIoT Edge Computing

Abstract:
The integration of the Industrial Internet of Things (IIoT) with Artificial Intelligence‑Generated Content (AIGC) offers new opportunities for smart manufacturing, but it also introduces challenges related to computation‑intensive tasks and low‑latency demands. Traditional generative models based on cloud computing are difficult to meet the real‑time requirements of AIGC tasks in IIoT environments, and edge computing can effectively reduce latency through task offloading. However, the dynamic nature of AIGC tasks, model switching delays, and resource constraints impose higher demands on edge computing environments. To address these challenges, this paper proposes an AIGC task offloading framework tailored for IIoT edge computing environments, considering the latency and energy consumption caused by AIGC model switching for the first time. IIoT devices acted as multi‑agent collaboratively offload their dynamic AIGC tasks to the most appropriate edge servers deployed with different generative models. A model aware AIGC task offloading algorithm based on Multi‑Agent Deep Deterministic Policy Gradient (MADDPG‑MATO) is devised to minimize the latency and energy. Experimental results show that MADDPG‑MATO outperforms baseline algorithms, achieving an average reduction of 6.98% in latency, 7.12% in energy consumption, and a 3.72% increase in task completion rate across four sets of experiments with model numbers ranging from 3 to 6, it is demonstrated that the proposed algorithm is robust and efficient in dynamic, high‑load IIoT environments.

PaperID: 9626, https://arxiv.org/pdf/2507.11464.pdf

Authors: Ajay Shankar, Keisuke Okumura, Amanda Prorok

Title: LF: Online Multi-Robot Path Planning Meets Optimal Trajectory Control

Abstract:
We propose a multi‑robot control paradigm to solve point‑to‑point navigation tasks for a team of holonomic robots with access to the full environment information. The framework invokes two processes asynchronously at high frequency: (i) a centralized, discrete, and full‑horizon planner for computing collision‑ and deadlock‑free paths rapidly, leveraging recent advances in multi‑agent pathfinding (MAPF), and (ii) dynamics‑aware, robot‑wise optimal trajectory controllers that ensure all robots independently follow their assigned paths reliably. This hierarchical shift in planning representation from (i) discrete and coupled to (ii) continuous and decoupled domains enables the framework to maintain long‑term scalable motion synthesis. As an instantiation of this idea, we present LF, which combines a fast state‑of‑the‑art MAPF solver (LaCAM), and a robust feedback control stack (Freyja) for executing agile robot maneuvers. LF provides a robust and versatile mechanism for lifelong multi‑robot navigation even under asynchronous and partial goal updates, and adapts to dynamic workspaces simply by quick replanning. We present various multirotor and ground robot demonstrations, including the deployment of 15 real multirotors with random, consecutive target updates while a person walks through the operational workspace.

PaperID: 9627, https://arxiv.org/pdf/2507.11402.pdf

Authors: Supun Dissanayaka, Alexander Ferrein, Till Hofmann, Kosuke Nakajima, Mario Sanz-Lopez, Jesus Savage, Daniel Swoboda, Matteo Tschesche, Wataru Uemura, Tarik Viehmann, Shohei Yasuda

Title: From Production Logistics to Smart Manufacturing: The Vision for a New RoboCup Industrial League

Abstract:
The RoboCup Logistics League is a RoboCup competition in a smart factory scenario that has focused on task planning, job scheduling, and multi‑agent coordination. The focus on production logistics allowed teams to develop highly competitive strategies, but also meant that some recent developments in the context of smart manufacturing are not reflected in the competition, weakening its relevance over the years. In this paper, we describe the vision for the RoboCup Smart Manufacturing League, a new competition designed as a larger smart manufacturing scenario, reflecting all the major aspects of a modern factory. It will consist of several tracks that are initially independent but gradually combined into one smart manufacturing scenario. The new tracks will cover industrial robotics challenges such as assembly, human‑robot collaboration, and humanoid robotics, but also retain a focus on production logistics. We expect the reenvisioned competition to be more attractive to newcomers and well‑tried teams, while also shifting the focus to current and future challenges of industrial robotics.

PaperID: 9628, https://arxiv.org/pdf/2507.11345.pdf

Authors: Oscar Lima, Marc Vinci, Sunandita Patra, Sebastian Stock, Joachim Hertzberg, Martin Atzmueller, Malik Ghallab, Dana Nau, Paolo Traverso

Title: Acting and Planning with Hierarchical Operational Models on a Mobile Robot: A Study with RAE+UPOM

Abstract:
Robotic task execution faces challenges due to the inconsistency between symbolic planner models and the rich control structures actually running on the robot. In this paper, we present the first physical deployment of an integrated actor‑planner system that shares hierarchical operational models for both acting and planning, interleaving the Reactive Acting Engine (RAE) with an anytime UCT‑like Monte Carlo planner (UPOM). We implement RAE+UPOM on a mobile manipulator in a real‑world deployment for an object collection task. Our experiments demonstrate robust task execution under action failures and sensor noise, and provide empirical insights into the interleaved acting‑and‑planning decision making process.

PaperID: 9629, https://arxiv.org/pdf/2507.11299.pdf

Authors: Andrei Niculae, Adrian Cosma, Cosmin Dumitrache, Emilian Rǎdoi

Title: Dr.Copilot: A Multi-Agent Prompt Optimized Assistant for Improving Patient-Doctor Communication in Romanian

Abstract:
Text‑based telemedicine has become increasingly common, yet the quality of medical advice in doctor‑patient interactions is often judged more on how advice is communicated rather than its clinical accuracy. To address this, we introduce Dr. Copilot , a multi‑agent large language model (LLM) system that supports Romanian‑speaking doctors by evaluating and enhancing the presentation quality of their written responses. Rather than assessing medical correctness, Dr. Copilot provides feedback along 17 interpretable axes. The system comprises of three LLM agents with prompts automatically optimized via DSPy. Designed with low‑resource Romanian data and deployed using open‑weight models, it delivers real‑time specific feedback to doctors within a telemedicine platform. Empirical evaluations and live deployment with 41 doctors show measurable improvements in user reviews and response quality, marking one of the first real‑world deployments of LLMs in Romanian medical settings.

PaperID: 9630, https://arxiv.org/pdf/2507.11272.pdf

Authors: Anh Nguyen-Duc, Chien Vu Manh, Bao Anh Tran, Viet Phuong Ngo, Luan Le Chi, Anh Quang Nguyen

Title: An Empirical Study of Multi-Agent RAG for Real-World University Admissions Counseling

Abstract:
This paper presents MARAUS (Multi‑Agent and Retrieval‑Augmented University Admission System), a real‑world deployment of a conversational AI platform for higher education admissions counseling in Vietnam. While large language models (LLMs) offer potential for automating advisory tasks, most existing solutions remain limited to prototypes or synthetic benchmarks. MARAUS addresses this gap by combining hybrid retrieval, multi‑agent orchestration, and LLM‑based generation into a system tailored for real‑world university admissions. In collaboration with the University of Transport Technology (UTT) in Hanoi, we conducted a two‑phase study involving technical development and real‑world evaluation. MARAUS processed over 6,000 actual user interactions, spanning six categories of queries. Results show substantial improvements over LLM‑only baselines: on average 92 percent accuracy, hallucination rates reduced from 15 precent to 1.45 percent, and average response times below 4 seconds. The system operated cost‑effectively, with a two‑week deployment cost of 11.58 USD using GPT‑4o mini. This work provides actionable insights for the deployment of agentic RAG systems in low‑resource educational settings.

PaperID: 9631, https://arxiv.org/pdf/2507.11210.pdf

Authors: Rushia Harada, Yuken Kimura, Keito Inoshita

Title: Role-Playing LLM-Based Multi-Agent Support Framework for Detecting and Addressing Family Communication Bias

Abstract:
Well‑being in family settings involves subtle psychological dynamics that conventional metrics often overlook. In particular, unconscious parental expectations, termed ideal parent bias, can suppress children's emotional expression and autonomy. This suppression, referred to as suppressed emotion, often stems from well‑meaning but value‑driven communication, which is difficult to detect or address from outside the family. Focusing on these latent dynamics, this study explores Large Language Model (LLM)‑based support for psychologically safe family communication. We constructed a Japanese parent‑child dialogue corpus of 30 scenarios, each annotated with metadata on ideal parent bias and suppressed emotion. Based on this corpus, we developed a Role‑Playing LLM‑based multi‑agent dialogue support framework that analyzes dialogue and generates feedback. Specialized agents detect suppressed emotion, describe implicit ideal parent bias in parental speech, and infer contextual attributes such as the child's age and background. A meta‑agent compiles these outputs into a structured report, which is then passed to five selected expert agents. These agents collaboratively generate empathetic and actionable feedback through a structured four‑step discussion process. Experiments show that the system can detect categories of suppressed emotion with moderate accuracy and produce feedback rated highly in empathy and practicality. Moreover, simulated follow‑up dialogues incorporating this feedback exhibited signs of improved emotional expression and mutual understanding, suggesting the framework's potential in supporting positive transformation in family interactions.

PaperID: 9632, https://arxiv.org/pdf/2507.11198.pdf

Authors: Conrad Borchers, Bahar Shahrokhian, Francesco Balzan, Elham Tajik, Sreecharan Sankaranarayanan, Sebastian Simon

Title: Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding

Abstract:
Large Language Models (LLMs) enable new possibilities for qualitative research at scale, including annotation and qualitative coding of educational data. While LLM‑based multi‑agent systems (MAS) can emulate human coding workflows, their benefits over single LLM agents for coding remain poorly understood. To that end, we conducted an experimental study of how persona and temperature of component agents of a MAS shapes consensus‑building and coding accuracy for dialog segments. LLMs were prompted to code these segments deductively using a mature codebook with 8 codes and high inter‑rater reliability derived from prior research. Our open‑source MAS mirrors deductive human coding through structured agent discussion and consensus arbitration. Using six open‑source LLMs (with 3 to 32 billion parameters) and 18 experimental configurations, we analyze over 77,000 coding decisions against a gold‑standard dataset of human‑annotated transcripts from online math tutoring sessions facilitated by educational software. Temperature significantly impacted whether and when consensus was reached across all six LLMs. MAS with multiple personas (including neutral, assertive, or empathetic) significantly delayed consensus in four out of six LLMs compared to uniform personas. In three of those LLMs, higher temperatures significantly diminished the effects of multiple personas on consensus. However, neither temperature nor persona pairing led to robust improvements in coding accuracy. Single agents matched or outperformed MAS consensus in most conditions. Qualitative analysis of MAS collaboration and coding disagreement may, however, improve codebook design and human‑AI coding.

PaperID: 9633, https://arxiv.org/pdf/2507.11117.pdf

Authors: Ailiya Borjigin, Cong He, Charles CC Lee, Wei Zhou

Title: AI Agent Architecture for Decentralized Trading of Alternative Assets

Abstract:
Decentralized trading of real‑world alternative assets (e.g., gold) requires bridging physical asset custody with blockchain systems while meeting strict requirements for compliance, liquidity, and risk management. We present GoldMine OS, a research oriented architecture that employs multiple specialized AI agents to automate and secure the tokenization and exchange of physical gold into a blockchain based stablecoin ("OZ"). Our approach combines on chain smart contracts for critical risk controls with off chain AI agents for decision making, blending the transparency and reliability of blockchains with the flexibility of AI driven automation. We describe four cooperative agents (Compliance, Token Issuance, Market Making, and Risk Control) and a coordinating core, and evaluate the system through simulation and a controlled pilot deployment. In experiments the prototype delivers on demand token issuance in under 1.2 s, more than 100 times faster than manual workflows. The Market Making agent maintains tight liquidity with spreads often below 0.5 percent even under volatile conditions. Fault injection tests show resilience: an oracle price spoofing attack is detected and mitigated within 10 s, and a simulated vault mis reporting halts issuance immediately with minimal user impact. The architecture scales to 5000 transactions per second with 10000 concurrent users in benchmarks. These results indicate that an AI agent based decentralized exchange for alternative assets can satisfy rigorous performance and safety requirements. We discuss broader implications for democratizing access to traditionally illiquid assets and explain how our governance model ‑‑ multi signature agent updates and on chain community voting on risk parameters ‑‑ provides ongoing transparency, adaptability, and formal assurance of system integrity.

PaperID: 9634, https://arxiv.org/pdf/2507.11079.pdf

Authors: Li Wang, Qizhen Wu, Lei Chen

Title: Tactical Decision for Multi-UGV Confrontation with a Vision-Language Model-Based Commander

Abstract:
In multiple unmanned ground vehicle confrontations, autonomously evolving multi‑agent tactical decisions from situational awareness remain a significant challenge. Traditional handcraft rule‑based methods become vulnerable in the complicated and transient battlefield environment, and current reinforcement learning methods mainly focus on action manipulation instead of strategic decisions due to lack of interpretability. Here, we propose a vision‑language model‑based commander to address the issue of intelligent perception‑to‑decision reasoning in autonomous confrontations. Our method integrates a vision language model for scene understanding and a lightweight large language model for strategic reasoning, achieving unified perception and decision within a shared semantic space, with strong adaptability and interpretability. Unlike rule‑based search and reinforcement learning methods, the combination of the two modules establishes a full‑chain process, reflecting the cognitive process of human commanders. Simulation and ablation experiments validate that the proposed approach achieves a win rate of over 80% compared with baseline models.

PaperID: 9635, https://arxiv.org/pdf/2507.10913.pdf

Authors: Shuangyao Huang, Haibo Zhang, Zhiyi Huang

Title: A Learning Framework For Cooperative Collision Avoidance of UAV Swarms Leveraging Domain Knowledge

Abstract:
This paper presents a multi‑agent reinforcement learning (MARL) framework for cooperative collision avoidance of UAV swarms leveraging domain knowledge‑driven reward. The reward is derived from knowledge in the domain of image processing, approximating contours on a two‑dimensional field. By modeling obstacles as maxima on the field, collisions are inherently avoided as contours never go through peaks or intersect. Additionally, counters are smooth and energy‑efficient. Our framework enables training with large swarm sizes as the agent interaction is minimized and the need for complex credit assignment schemes or observation sharing mechanisms in state‑of‑the‑art MARL approaches are eliminated. Moreover, UAVs obtain the ability to adapt to complex environments where contours may be non‑viable or non‑existent through intensive training. Extensive experiments are conducted to evaluate the performances of our framework against state‑of‑the‑art MARL algorithms.

PaperID: 9636, https://arxiv.org/pdf/2507.10911.pdf

Authors: Yicong Wu, Ting Chen, Irit Hochberg, Zhoujian Sun, Ruth Edry, Zhengxing Huang, Mor Peleg

Title: Lessons Learned from Evaluation of LLM based Multi-agents in Safer Therapy Recommendation

Abstract:
Therapy recommendation for chronic patients with multimorbidity is challenging due to risks of treatment conflicts. Existing decision support systems face scalability limitations. Inspired by the way in which general practitioners (GP) manage multimorbidity patients, occasionally convening multidisciplinary team (MDT) collaboration, this study investigated the feasibility and value of using a Large Language Model (LLM)‑based multi‑agent system (MAS) for safer therapy recommendations. We designed a single agent and a MAS framework simulating MDT decision‑making by enabling discussion among LLM agents to resolve medical conflicts. The systems were evaluated on therapy planning tasks for multimorbidity patients using benchmark cases. We compared MAS performance with single‑agent approaches and real‑world benchmarks. An important contribution of our study is the definition of evaluation metrics that go beyond the technical precision and recall and allow the inspection of clinical goals met and medication burden of the proposed advices to a gold standard benchmark. Our results show that with current LLMs, a single agent GP performs as well as MDTs. The best‑scoring models provide correct recommendations that address all clinical goals, yet the advices are incomplete. Some models also present unnecessary medications, resulting in unnecessary conflicts between medication and conditions or drug‑drug interactions.

PaperID: 9637, https://arxiv.org/pdf/2507.10722.pdf

Authors: Sohan Shankar, Yi Pan, Hanqi Jiang, Zhengliang Liu, Mohammad R. Darbandi, Agustin Lorenzo, Junhao Chen, Weihang You, Md Mehedi Hasan, Arif Hassan Zidan, Eliana Gelman, Joshua A. Konfrst, Jillian Y. Russell, Katelyn Fernandes, Tianze Yang, Yiwei Li, Huaqin Zhao, Afrar Jahin, Triparna Ganguly, Shair Dinesha, Yifan Zhou, Zihao Wu, Xinliang Li, Lokesh Adusumilli, Aziza Hussein, Sagar Nookarapu, Jixin Hou, Kun Jiang, Jiaxi Li, Brenden Heinel, XianShen Xi, Hailey Hubbard, Zayna Khan, Levi Whitaker, Ivan Cao, Max Allgaier, Andrew Darby, Lin Zhao, Lu Zhang, Xiaoqiao Wang, Xiang Li, Wei Zhang, Xiaowei Yu, Dajiang Zhu, Yohannes Abate, Tianming Liu

Title: Bridging Brains and Machines: A Unified Frontier in Neuroscience, Artificial Intelligence, and Neuromorphic Systems

Abstract:
This position and survey paper identifies the emerging convergence of neuroscience, artificial general intelligence (AGI), and neuromorphic computing toward a unified research paradigm. Using a framework grounded in brain physiology, we highlight how synaptic plasticity, sparse spike‑based communication, and multimodal association provide design principles for next‑generation AGI systems that potentially combine both human and machine intelligences. The review traces this evolution from early connectionist models to state‑of‑the‑art large language models, demonstrating how key innovations like transformer attention, foundation‑model pre‑training, and multi‑agent architectures mirror neurobiological processes like cortical mechanisms, working memory, and episodic consolidation. We then discuss emerging physical substrates capable of breaking the von Neumann bottleneck to achieve brain‑scale efficiency in silicon: memristive crossbars, in‑memory compute arrays, and emerging quantum and photonic devices. There are four critical challenges at this intersection: 1) integrating spiking dynamics with foundation models, 2) maintaining lifelong plasticity without catastrophic forgetting, 3) unifying language with sensorimotor learning in embodied agents, and 4) enforcing ethical safeguards in advanced neuromorphic autonomous systems. This combined perspective across neuroscience, computation, and hardware offers an integrative agenda for in each of these fields.

PaperID: 9638, https://arxiv.org/pdf/2507.10644.pdf

Authors: Tatiana Petrova, Boris Bliznioukov, Aleksandr Puzikov, Radu State

Title: From Multi-Agent Systems and the Semantic Web to Agentic AI: A Unified Narrative of the Web of Agents

Abstract:
The Web of Agents (WoA) transforms the document‑centric Web into an environment of autonomous agents acting on users' behalf, a vision newly tractable as large language models (LLMs) mature. We argue that across three decades the WoA has undergone a \emphsemantic‑effort migration in chronological order: from platform‑side coordination (Multi‑Agent Systems, Generation~I), through data‑side annotation (Semantic Web, Generation~II), to model‑side interpretation (LLM‑era, Generation~III). The central Gen~II~\rightarrow~Gen~III transition within this trajectory, which we call the \emphsemantics‑in‑data \rightarrow semantics‑in‑models shift, is predictive: each generation's failure modes and current open problems follow from where that generation located its semantic effort. The survey makes five contributions: (i)~a unified evolutionary narrative spanning 1990‑‑2026; (ii)~a four‑dimensional comparative framework (semantic foundation, communication paradigm, locus of intelligence, discovery mechanism) applied uniformly across all three generations; (iii)~classification of sixteen representative systems on these dimensions, including hybrid LLM‑‑knowledge‑graph and computer‑use agents; (iv)~coverage of the November~2024‑‑August~2026 institutional convergence (Linux Foundation's Agentic AI Foundation, A2A v1.0, MCP November~2024 launch and November~2025 specification, Visa/Mastercard/Stripe payment‑network protocols, EU AI Act phased enforcement, the NIST AI Agent Standards Initiative, International AI Safety Report 2026); and (v)~seven named lessons grounded in cross‑generational evidence paired with seven generation‑invariant challenges that persist regardless of which protocol prevails. Further progress depends less on protocol design than on the socio‑technical infrastructure now being assembled by standards bodies, regulators, and commercial payment networks.

PaperID: 9639, https://arxiv.org/pdf/2507.10621.pdf

Authors: Quanyan Zhu

Title: Game Theory Meets LLM and Agentic AI: Reimagining Cybersecurity for the Age of Intelligent Threats

Abstract:
Protecting cyberspace requires not only advanced tools but also a shift in how we reason about threats, trust, and autonomy. Traditional cybersecurity methods rely on manual responses and brittle heuristics. To build proactive and intelligent defense systems, we need integrated theoretical frameworks and software tools. Game theory provides a rigorous foundation for modeling adversarial behavior, designing strategic defenses, and enabling trust in autonomous systems. Meanwhile, software tools process cyber data, visualize attack surfaces, verify compliance, and suggest mitigations. Yet a disconnect remains between theory and practical implementation. The rise of Large Language Models (LLMs) and agentic AI offers a new path to bridge this gap. LLM‑powered agents can operationalize abstract strategies into real‑world decisions. Conversely, game theory can inform the reasoning and coordination of these agents across complex workflows. LLMs also challenge classical game‑theoretic assumptions, such as perfect rationality or static payoffs, prompting new models aligned with cognitive and computational realities. This co‑evolution promises richer theoretical foundations and novel solution concepts. Agentic AI also reshapes software design: systems must now be modular, adaptive, and trust‑aware from the outset. This chapter explores the intersection of game theory, agentic AI, and cybersecurity. We review key game‑theoretic frameworks (e.g., static, dynamic, Bayesian, and signaling games) and solution concepts. We then examine how LLM agents can enhance cyber defense and introduce LLM‑driven games that embed reasoning into AI agents. Finally, we explore multi‑agent workflows and coordination games, outlining how this convergence fosters secure, intelligent, and adaptive cyber systems.

PaperID: 9640, https://arxiv.org/pdf/2507.10571.pdf

Authors: Konstantinos I. Roumeliotis, Ranjan Sapkota, Manoj Karkee, Nikolaos D. Tselikas

Title: Agentic AI with Orchestrator-Agent Trust: A Modular Visual Classification Framework with Trust-Aware Orchestration and RAG-Based Reasoning

Abstract:
Modern Artificial Intelligence (AI) increasingly relies on multi‑agent architectures that blend visual and language understanding. Yet, a pressing challenge remains: How can we trust these agents especially in zero‑shot settings with no fine‑tuning? We introduce a novel modular Agentic AI visual classification framework that integrates generalist multimodal agents with a non‑visual reasoning orchestrator and a Retrieval‑Augmented Generation (RAG) module. Applied to apple leaf disease diagnosis, we benchmark three configurations: (I) zero‑shot with confidence‑based orchestration, (II) fine‑tuned agents with improved performance, and (III) trust‑calibrated orchestration enhanced by CLIP‑based image retrieval and re‑evaluation loops. Using confidence calibration metrics (ECE, OCR, CCC), the orchestrator modulates trust across agents. Our results demonstrate a 77.94% accuracy improvement in the zero‑shot setting using trust‑aware orchestration and RAG, achieving 85.63% overall. GPT‑4o showed better calibration, while Qwen‑2.5‑VL displayed overconfidence. Furthermore, image‑RAG grounded predictions with visually similar cases, enabling correction of agent overconfidence via iterative re‑evaluation. The proposed system separates perception (vision agents) from meta‑reasoning (orchestrator), enabling scalable and interpretable multi‑agent AI. This blueprint illustrates how Agentic AI can deliver trustworthy, modular, and transparent reasoning, and is extensible to diagnostics, biology, and other trust‑critical domains. In doing so, we highlight Agentic AI not just as an architecture but as a paradigm for building reliable multi‑agent intelligence. agentic ai, orchestrator agent trust, trust orchestration, visual classification, retrieval augmented reasoning

PaperID: 9641, https://arxiv.org/pdf/2507.10566.pdf

Authors: Hung Ming Liu

Title: AI Mother Tongue: Self-Emergent Communication in MARL via Endogenous Symbol Systems

Abstract:
In Decentralized Multi‑Agent Reinforcement Learning (MARL), the development of Emergent Communication has long been constrained by the ``Joint Exploration Dilemma'', leading agents to fall into a ``Communication Vacuum Equilibrium'' . Traditional methods address this by introducing inductive biases to facilitate communication emergence . This study fundamentally questions whether such artificial inductive biases are, in fact, over‑engineering. Through experiments with the ``AI Mother Tongue'' (AIM) framework, based on a Vector Quantized Variational Autoencoder (VQ‑VAE), we demonstrate that when agents possess an endogenous symbol system, their neural representations naturally exhibit spontaneous semantic compression and Nash equilibrium‑driven semantic convergence, achieving effective symbolic communication without external inductive biases. This aligns with recent neuroscience findings suggesting that the human brain does not directly use human language for internal thought , and resonates with research on ``soft thinking'' capabilities in Large Language Models (LLMs) . Compared to traditional explicit communication methods, AIM demonstrates stronger generality and efficiency. The interpretable analysis toolkit developed in this study confirms that symbol usage exhibits a significant power‑law distribution, leading to three major theoretical insights: the ``Neural Communication Hypothesis'', the ``Tool‑First Principle'', and the ``Semantic Interpretability Paradigm''. Future research will explore the integration of Hierarchical Quantized Variational Autoencoders (HQ‑VAE) to enhance AIM's complex expressive capabilities and investigate the potential for ``Reinforcement Learning (RL) Low‑Level Pre‑training''. This discovery offers new avenues for bridging symbolism and connectionism.

PaperID: 9642, https://arxiv.org/pdf/2507.10562.pdf

Authors: Hari Masoor

Title: SAMEP: A Secure Protocol for Persistent Context Sharing Across AI Agents

Abstract:
Current AI agent architectures suffer from ephemeral memory limitations, preventing effective collaboration and knowledge sharing across sessions and agent boundaries. We introduce SAMEP (Secure Agent Memory Exchange Protocol), a novel framework that enables persistent, secure, and semantically searchable memory sharing among AI agents. Our protocol addresses three critical challenges: (1) persistent context preservation across agent sessions, (2) secure multi‑agent collaboration with fine‑grained access control, and (3) efficient semantic discovery of relevant historical context. SAMEP implements a distributed memory repository with vector‑based semantic search, cryptographic access controls (AES‑256‑GCM), and standardized APIs compatible with existing agent communication protocols (MCP, A2A). We demonstrate SAMEP's effectiveness across diverse domains including multi‑agent software development, healthcare AI with HIPAA compliance, and multi‑modal processing pipelines. Experimental results show 73% reduction in redundant computations, 89% improvement in context relevance scores, and complete compliance with regulatory requirements including audit trail generation. SAMEP enables a new paradigm of persistent, collaborative AI agent ecosystems while maintaining security and privacy guarantees.

PaperID: 9643, https://arxiv.org/pdf/2507.10251.pdf

Authors: Wenjing Zhang, Wei Zhang

Title: ToMacVF : Temporal Macro-action Value Factorization for Asynchronous Multi-Agent Reinforcement Learning

Abstract:
Existing asynchronous MARL methods based on MacDec‑POMDP typically construct training trajectory buffers by simply sampling limited and biased data at the endpoints of macro‑actions, and directly apply conventional MARL methods on the buffers. As a result, these methods lead to an incomplete and inaccurate representation of the macro‑action execution process, along with unsuitable credit assignments. To solve these problems, the Temporal Macro‑action Value Factorization (ToMacVF) is proposed to achieve fine‑grained temporal credit assignment for macro‑action contributions. A centralized training buffer, called Macro‑action Segmented Joint Experience Replay Trajectory (Mac‑SJERT), is designed to incorporate with ToMacVF to collect accurate and complete macro‑action execution information, supporting a more comprehensive and precise representation of the macro‑action process. To ensure principled and fine‑grained asynchronous value factorization, the consistency requirement between joint and individual macro‑action selection called Temporal Macro‑action based IGM (To‑Mac‑IGM) is formalized, proving that it generalizes the synchronous cases. Based on To‑Mac‑IGM, a modularized ToMacVF architecture, which satisfies CTDE principle, is designed to conveniently integrate previous value factorization methods. Next, the ToMacVF algorithm is devised as an implementation of the ToMacVF architecture. Experimental results demonstrate that, compared to asynchronous baselines, our ToMacVF algorithm not only achieves optimal performance but also exhibits strong adaptability and robustness across various asynchronous multi‑agent experimental scenarios.

PaperID: 9644, https://arxiv.org/pdf/2507.10142.pdf

Authors: Siyi Hu, Mohamad A Hady, Jianglin Qiao, Jimmy Cao, Mahardhika Pratama, Ryszard Kowalczyk

Title: Adaptability in Multi-Agent Reinforcement Learning: A Framework and Unified Review

Abstract:
Multi‑Agent Reinforcement Learning (MARL) has shown clear effectiveness in coordinating multiple agents across simulated benchmarks and constrained scenarios. However, its deployment in real‑world multi‑agent systems (MAS) remains limited, primarily due to the complex and dynamic nature of such environments. These challenges arise from multiple interacting sources of variability, including fluctuating agent populations, evolving task goals, and inconsistent execution conditions. Together, these factors demand that MARL algorithms remain effective under continuously changing system configurations and operational demands. To better capture and assess this capacity for adjustment, we introduce the concept of adaptability as a unified and practically grounded lens through which to evaluate the reliability of MARL algorithms under shifting conditions, broadly referring to any changes in the environment dynamics that may occur during learning or execution. Centred on the notion of adaptability, we propose a structured framework comprising three key dimensions: learning adaptability, policy adaptability, and scenario‑driven adaptability. By adopting this adaptability perspective, we aim to support more principled assessments of MARL performance beyond narrowly defined benchmarks. Ultimately, this survey contributes to the development of algorithms that are better suited for deployment in dynamic, real‑world multi‑agent systems.

PaperID: 9645, https://arxiv.org/pdf/2507.10016.pdf

Authors: Lixu Wang, Kaixiang Yao, Xinfeng Li, Dong Yang, Haoyang Li, Xiaofeng Wang, Wei Dong

Title: The Man Behind the Sound: Demystifying Audio Private Attribute Profiling via Multimodal Large Language Model Agents

Abstract:
Our research uncovers a novel privacy risk associated with multimodal large language models (MLLMs): the ability to infer sensitive personal attributes from audio data ‑‑ a technique we term audio private attribute profiling. This capability poses a significant threat, as audio can be covertly captured without direct interaction or visibility. Moreover, compared to images and text, audio carries unique characteristics, such as tone and pitch, which can be exploited for more detailed profiling. However, two key challenges exist in understanding MLLM‑employed private attribute profiling from audio: (1) the lack of audio benchmark datasets with sensitive attribute annotations and (2) the limited ability of current MLLMs to infer such attributes directly from audio. To address these challenges, we introduce AP^2, an audio benchmark dataset that consists of two subsets collected and composed from real‑world data, and both are annotated with sensitive attribute labels. Additionally, we propose Gifts, a hybrid multi‑agent framework that leverages the complementary strengths of audio‑language models (ALMs) and large language models (LLMs) to enhance inference capabilities. Gifts employs an LLM to guide the ALM in inferring sensitive attributes, then forensically analyzes and consolidates the ALM's inferences, overcoming severe hallucinations of existing ALMs in generating long‑context responses. Our evaluations demonstrate that Gifts significantly outperforms baseline approaches in inferring sensitive attributes. Finally, we investigate model‑level and data‑level defense strategies to mitigate the risks of audio private attribute profiling. Our work validates the feasibility of audio‑based privacy attacks using MLLMs, highlighting the need for robust defenses, and provides a dataset and framework to facilitate future research.

PaperID: 9646, https://arxiv.org/pdf/2507.09997.pdf

Authors: Venkatraman Renganathan, Sabyasachi Mondal, Antonios Tsourdos

Title: Predictive & Trust-based Multi-Agent Coordination

Abstract:
This paper presents a trust‑based predictive multi‑agent consensus protocol that analyses neighbours' anticipation data and makes coordination decisions. Agents in the network share their future predicted data over a finite look‑ahead horizon with their neighbours and update their predictions in a rolling‑horizon fashion. The prediction data is then used by agents to learn both the trust and the commitment traits exhibited by their neighbours over time. The proposed protocol is named as the Anticipatory Distributed Coordination (ADC) protocol. Lyapunov theory‑based agreement convergence between agents is provided, followed by demonstrations using numerical simulations.

PaperID: 9647, https://arxiv.org/pdf/2507.09989.pdf

Authors: Xiaoyang Yu, Youfang Lin, Shuo Wang, Sheng Han

Title: Improving monotonic optimization in heterogeneous multi-agent reinforcement learning with optimal marginal deterministic policy gradient

Abstract:
In heterogeneous multi‑agent reinforcement learning (MARL), achieving monotonic improvement plays a pivotal role in enhancing performance. The HAPPO algorithm proposes a feasible solution by introducing a sequential update scheme, which requires independent learning with No Parameter‑sharing (NoPS). However, heterogeneous MARL generally requires Partial Parameter‑sharing (ParPS) based on agent grouping to achieve high cooperative performance. Our experiments prove that directly combining ParPS with the sequential update scheme leads to the policy updating baseline drift problem, thereby failing to achieve improvement. To solve the conflict between monotonic improvement and ParPS, we propose the Optimal Marginal Deterministic Policy Gradient (OMDPG) algorithm. First, we replace the sequentially computed Q_ψ^s(s,a_1:i) with the Optimal Marginal Q (OMQ) function ϕ_ψ^(s,a_1:i) derived from Q‑functions. This maintains MAAD's monotonic improvement while eliminating the conflict through optimal joint action sequences instead of sequential policy ratio calculations. Second, we introduce the Generalized Q Critic (GQC) as the critic function, employing pessimistic uncertainty‑constrained loss to optimize different Q‑value estimations. This provides the required Q‑values for OMQ computation and stable baselines for actor updates. Finally, we implement a Centralized Critic Grouped Actor (CCGA) architecture that simultaneously achieves ParPS in local policy networks and accurate global Q‑function computation. Experimental results in SMAC and MAMuJoCo environments demonstrate that OMDPG outperforms various state‑of‑the‑art MARL baselines.

PaperID: 9648, https://arxiv.org/pdf/2507.09926.pdf

Authors: Zixuan Song, Zhishu Shen, Xiaoyu Zheng, Qiushi Zheng, Zheng Lei, Jiong Jin

Title: Intelligent Task Management via Dynamic Multi-region Division in LEO Satellite Networks

Abstract:
As a key complement to terrestrial networks and a fundamental component of future 6G systems, Low Earth Orbit (LEO) satellite networks are expected to provide high‑quality communication services when integrated with ground‑based infrastructure, thereby attracting significant research interest. However, the limited satellite onboard resources and the uneven distribution of computational workloads often result in congestion along inter‑satellite links (ISLs) that degrades task processing efficiency. Effectively managing the dynamic and large‑scale topology of LEO networks to ensure balanced task distribution remains a critical challenge. To this end, we propose a dynamic multi‑region division framework for intelligent task management in LEO satellite networks. This framework optimizes both intra‑ and inter‑region routing to minimize task delay while balancing the utilization of computational and communication resources. Based on this framework, we propose a dynamic multi‑region division algorithm based on the Genetic Algorithm (GA), which adaptively adjusts the size of each region based on the workload status of individual satellites. Additionally, we incorporate an adaptive routing algorithm and a task splitting and offloading scheme based on Multi‑Agent Deep Deterministic Policy Gradient (MA‑DDPG) to effectively accommodate the arriving tasks. Simulation results demonstrate that our proposed framework outperforms comparative methods in terms of the task delay, energy consumption per task, and task completion rate.

PaperID: 9649, https://arxiv.org/pdf/2507.09836.pdf

Authors: Vindula Jayawardana, Sirui Li, Yashar Farid, Cathy Wu

Title: Multi-residual Mixture of Experts Learning for Cooperative Control in Multi-vehicle Systems

Abstract:
Autonomous vehicles (AVs) are becoming increasingly popular, with their applications now extending beyond just a mode of transportation to serving as mobile actuators of a traffic flow to control flow dynamics. This contrasts with traditional fixed‑location actuators, such as traffic signals, and is referred to as Lagrangian traffic control. However, designing effective Lagrangian traffic control policies for AVs that generalize across traffic scenarios introduces a major challenge. Real‑world traffic environments are highly diverse, and developing policies that perform robustly across such diverse traffic scenarios is challenging. It is further compounded by the joint complexity of the multi‑agent nature of traffic systems, mixed motives among participants, and conflicting optimization objectives subject to strict physical and external constraints. To address these challenges, we introduce Multi‑Residual Mixture of Expert Learning (MRMEL), a novel framework for Lagrangian traffic control that augments a given suboptimal nominal policy with a learned residual while explicitly accounting for the structure of the traffic scenario space. In particular, taking inspiration from residual reinforcement learning, MRMEL augments a suboptimal nominal AV control policy by learning a residual correction, but at the same time dynamically selects the most suitable nominal policy from a pool of nominal policies conditioned on the traffic scenarios and modeled as a mixture of experts. We validate MRMEL using a case study in cooperative eco‑driving at signalized intersections in Atlanta, Dallas Fort Worth, and Salt Lake City, with real‑world data‑driven traffic scenarios. The results show that MRMEL consistently yields superior performance‑achieving an additional 4%‑9% reduction in aggregate vehicle emissions relative to the strongest baseline in each setting.

PaperID: 9650, https://arxiv.org/pdf/2507.09702.pdf

Authors: Phat Nguyen, Ngai-Man Cheung

Title: Token Compression Meets Compact Vision Transformers: A Survey and Comparative Evaluation for Edge AI

Abstract:
Token compression techniques have recently emerged as powerful tools for accelerating Vision Transformer (ViT) inference in computer vision. Due to the quadratic computational complexity with respect to the token sequence length, these methods aim to remove less informative tokens before the attention layers to improve inference throughput. While numerous studies have explored various accuracy‑efficiency trade‑offs on large‑scale ViTs, two critical gaps remain. First, there is a lack of unified survey that systematically categorizes and compares token compression approaches based on their core strategies (e.g., pruning, merging, or hybrid) and deployment settings (e.g., fine‑tuning vs. plug‑in). Second, most benchmarks are limited to standard ViT models (e.g., ViT‑B, ViT‑L), leaving open the question of whether such methods remain effective when applied to structurally compressed transformers, which are increasingly deployed on resource‑constrained edge devices. To address these gaps, we present the first systematic taxonomy and comparative study of token compression methods, and we evaluate representative techniques on both standard and compact ViT architectures. Our experiments reveal that while token compression methods are effective for general‑purpose ViTs, they often underperform when directly applied to compact designs. These findings not only provide practical insights but also pave the way for future research on adapting token optimization techniques to compact transformer‑based networks for edge AI and AI agent applications.

PaperID: 9651, https://arxiv.org/pdf/2507.09657.pdf

Authors: Ann Nedime Nese Rende, Tolga Yilmaz, Özgür Ulusoy

Title: Negotiating Comfort: Simulating Personality-Driven LLM Agents in Shared Residential Social Networks

Abstract:
We use generative agents powered by large language models (LLMs) to simulate a social network in a shared residential building, driving the temperature decisions for a central heating system. Agents, divided into Family Members and Representatives, consider personal preferences, personal traits, connections, and weather conditions. Daily simulations involve family‑level consensus followed by building‑wide decisions among representatives. We tested three personality traits distributions (positive, mixed, and negative) and found that positive traits correlate with higher happiness and stronger friendships. Temperature preferences, assertiveness, and selflessness have a significant impact on happiness and decisions. This work demonstrates how LLM‑driven agents can help simulate nuanced human behavior where complex real‑life human simulations are difficult to set.

PaperID: 9652, https://arxiv.org/pdf/2507.09626.pdf

Authors: Rodion Nazarov, Anthony Quinn, Robert Shorten, Jakub Marecek

Title: humancompatible.interconnect: Testing Properties of Repeated Uses of Interconnections of AI Systems

Abstract:
Artificial intelligence (AI) systems often interact with multiple agents. The regulation of such AI systems often requires that \em a priori\/ guarantees of fairness and robustness be satisfied. With stochastic models of agents' responses to the outputs of AI systems, such \em a priori\/ guarantees require non‑trivial reasoning about the corresponding stochastic systems. Here, we present an open‑source PyTorch‑based toolkit for the use of stochastic control techniques in modelling interconnections of AI systems and properties of their repeated uses. It models robustness and fairness desiderata in a closed‑loop fashion, and provides \em a priori\/ guarantees for these interconnections. The PyTorch‑based toolkit removes much of the complexity associated with the provision of fairness guarantees for closed‑loop models of multi‑agent systems.

PaperID: 9653, https://arxiv.org/pdf/2507.09505.pdf

Authors: Tenghui Xie, Zhiying Song, Fuxi Wen, Jun Li, Guangzhao Liu, Zijian Zhao

Title: TruckV2X: A Truck-Centered Perception Dataset

Abstract:
Autonomous trucking offers significant benefits, such as improved safety and reduced costs, but faces unique perception challenges due to trucks' large size and dynamic trailer movements. These challenges include extensive blind spots and occlusions that hinder the truck's perception and the capabilities of other road users. To address these limitations, cooperative perception emerges as a promising solution. However, existing datasets predominantly feature light vehicle interactions or lack multi‑agent configurations for heavy‑duty vehicle scenarios. To bridge this gap, we introduce TruckV2X, the first large‑scale truck‑centered cooperative perception dataset featuring multi‑modal sensing (LiDAR and cameras) and multi‑agent cooperation (tractors, trailers, CAVs, and RSUs). We further investigate how trucks influence collaborative perception needs, establishing performance benchmarks while suggesting research priorities for heavy vehicle perception. The dataset provides a foundation for developing cooperative perception systems with enhanced occlusion handling capabilities, and accelerates the deployment of multi‑agent autonomous trucking systems. The TruckV2X dataset is available at https://huggingface.co/datasets/XieTenghu1/TruckV2X.

PaperID: 9654, https://arxiv.org/pdf/2507.09497.pdf

Authors: Siyi Wu, Zeyu Wang, Xinyuan Song, Zhengpeng Zhou, Lifan Sun, Tianyu Shi

Title: GoalfyMax: A Protocol-Driven Multi-Agent System for Intelligent Experience Entities

Abstract:
Modern enterprise environments demand intelligent systems capable of handling complex, dynamic, and multi‑faceted tasks with high levels of autonomy and adaptability. However, traditional single‑purpose AI systems often lack sufficient coordination, memory reuse, and task decomposition capabilities, limiting their scalability in realistic settings. To address these challenges, we present GoalfyMax, a protocol‑driven framework for end‑to‑end multi‑agent collaboration. GoalfyMax introduces a standardized Agent‑to‑Agent (A2A) communication layer built on the Model Context Protocol (MCP), allowing independent agents to coordinate through asynchronous, protocol‑compliant interactions. It incorporates the Experience Pack (XP) architecture, a layered memory system that preserves both task rationales and execution traces, enabling structured knowledge retention and continual learning. Moreover, our system integrates advanced features including multi‑turn contextual dialogue, long‑short term memory modules, and dynamic safety validation, supporting robust, real‑time strategy adaptation. Empirical results on complex task orchestration benchmarks and case study demonstrate that GoalfyMax achieves superior adaptability, coordination, and experience reuse compared to baseline frameworks. These findings highlight its potential as a scalable, future‑ready foundation for multi‑agent intelligent systems.

PaperID: 9655, https://arxiv.org/pdf/2507.09495.pdf

Authors: Hang Wang, Junshan Zhang

Title: GenAI-based Multi-Agent Reinforcement Learning towards Distributed Agent Intelligence: A Generative-RL Agent Perspective

Abstract:
Multi‑agent reinforcement learning faces fundamental challenges that conventional approaches have failed to overcome: exponentially growing joint action spaces, non‑stationary environments where simultaneous learning creates moving targets, and partial observability that constrains coordination. Current methods remain reactive, employing stimulus‑response mechanisms that fail when facing novel scenarios. We argue for a transformative paradigm shift from reactive to proactive multi‑agent intelligence through generative AI‑based reinforcement learning. This position advocates reconceptualizing agents not as isolated policy optimizers, but as sophisticated generative models capable of synthesizing complex multi‑agent dynamics and making anticipatory decisions based on predictive understanding of future interactions. Rather than responding to immediate observations, generative‑RL agents can model environment evolution, predict other agents' behaviors, generate coordinated action sequences, and engage in strategic reasoning accounting for long‑term dynamics. This approach leverages pattern recognition and generation capabilities of generative AI to enable proactive decision‑making, seamless coordination through enhanced communication, and dynamic adaptation to evolving scenarios. We envision this paradigm shift will unlock unprecedented possibilities for distributed intelligence, moving beyond individual optimization toward emergent collective behaviors representing genuine collaborative intelligence. The implications extend across autonomous systems, robotics, and human‑AI collaboration, promising solutions to coordination challenges intractable under traditional reactive frameworks.

PaperID: 9656, https://arxiv.org/pdf/2507.09481.pdf

Authors: Yuheng Huang, Jiayang Song, Da Song, Zhenlan Ji, Wenhan Wang, Shuai Wang, Lei Ma

Title: Evaluating LLMs on Sequential API Call Through Automated Test Generation

Abstract:
By integrating tools from external APIs, Large Language Models (LLMs) have expanded their promising capabilities in a diverse spectrum of complex real‑world tasks. However, testing, evaluation, and analysis of LLM tool use remain in their early stages. Most existing benchmarks rely on manually collected test cases, many of which cannot be automatically checked for semantic correctness and instead depend on static methods such as string matching. Additionally, these benchmarks often overlook the complex interactions that occur between sequential API calls, which are common in real‑world applications. To fill the gap, in this paper, we introduce StateGen, an automated framework designed to generate diverse coding tasks involving sequential API interactions. StateGen combines state‑machine‑based API constraint solving and validation, energy‑based sampling, and control‑flow injection to generate executable programs. These programs are then translated into human‑like natural language task descriptions through a collaboration of two LLM agents. Utilizing StateGen, we construct StateEval, a benchmark encompassing 120 verified test cases spanning across three representative scenarios: Session Service, Tensor Operation, and ElevenLabs MCP. Experimental results confirm that StateGen can effectively generate challenging and realistic API‑oriented tasks, highlighting areas for improvement in current LLMs incorporating APIs.We make our framework and benchmark publicly available to support future research.

PaperID: 9657, https://arxiv.org/pdf/2507.09457.pdf

Authors: Y. Kikuchi, M. Iwamoto

Title: Intermediate Interaction Strategies for Collective Behavior

Abstract:
From bird flocks and fish schools to migrating cell sheets, collective motion is a ubiquitous biological phenomenon that inspires quantitative modeling through self‑propelled particle (SPP) frameworks. Conventional SPP models prescribe either distance‑based (metric) or rank‑based (topological) interactions; however, empirical studies indicate that real groups may blend both types of interaction. Motivated by this graded perception, we introduce a new three‑dimensional SPP model in which metric and topological alignments act simultaneously and are weighted by a single tunable mixing parameter called the interaction parameter. Large‑scale simulations spanning a wide ranges of interaction parameters and densities revealed rich dynamics. Even when the global order parameter is low, cluster‑level analysis with HDBSCAN shows that particles self‑organize into several spatially distinct but internally well‑aligned sub‑flocks, exposing a hidden layer of order. Most importantly, an intermediate balance in which metric and topological cues contribute almost equally maximizes the global order parameter and markedly improves robustness to density variations. Numerical experiments supported by linear stability analysis demonstrate that activating metric and topological interactions in concert bridges the traditional modeling dichotomy and furnishes a more adaptive and resilient framework for collective motion. Therefore, the proposed model therefore provides a versatile platform for exploring mixed‑interaction effects in biological and engineered multi‑agent systems.

PaperID: 9658, https://arxiv.org/pdf/2507.09389.pdf

Authors: Chris Davis Jaldi, Anmol Saini, Elham Ghiasi, O. Divine Eziolise, Cogan Shimizu

Title: Knowledge Conceptualization Impacts RAG Efficacy

Abstract:
Explainability and interpretability are cornerstones of frontier and next‑generation artificial intelligence (AI) systems. This is especially true in recent systems, such as large language models (LLMs), and more broadly, generative AI. On the other hand, adaptability to new domains, contexts, or scenarios is also an important aspect for a successful system. As such, we are particularly interested in how we can merge these two efforts, that is, investigating the design of transferable and interpretable neurosymbolic AI systems. Specifically, we focus on a class of systems referred to as ''Agentic Retrieval‑Augmented Generation'' systems, which actively select, interpret, and query knowledge sources in response to natural language prompts. In this paper, we systematically evaluate how different conceptualizations and representations of knowledge, particularly the structure and complexity, impact an AI agent (in this case, an LLM) in effectively querying a triplestore. We report our results, which show that there are impacts from both approaches, and we discuss their impact and implications.

PaperID: 9659, https://arxiv.org/pdf/2507.09367.pdf

Authors: Shiva Azimi, Arash Tavakoli

Title: Simulation for All: A Step-by-Step Cookbook for Developing Human-Centered Multi-Agent Transportation Simulators

Abstract:
As cities evolve toward more complex and multimodal transportation systems, the need for human‑centered multi‑agent simulation tools has never been more urgent. Yet most existing platforms remain limited ‑ they often separate different types of road users, rely on scripted or pre‑defined behaviors, overlook public transit users as active participants, and are rarely designed with accessibility in mind for non‑technical users. To address this gap, this paper presents the specifications of a multi‑agent simulation platform designed to support real‑time, human‑centered, and immersive studies of all road users, accompanied by open‑source scripts for replication. Using high‑fidelity immersive virtual environments, our platform enables interaction across public transit users, pedestrians, cyclists, automated vehicles, and drivers. The architecture is modular, extensible, and designed for accessibility. The system integrates hardware‑specific modules ‑ including an omnidirectional treadmill, a seating arrangement, a smart trainer, and an actuated cockpit. Additionally, the platform collects multimodal physiological, neurological, and behavioral data through embedded sensing devices such as functional near‑infrared spectroscopy (fNIRS), eye tracking, and wrist‑based biosensors. To show the usability of this system, we present three use cases. Simulation for All aims to lower the barrier to entry for high‑fidelity transportation simulation, support experimentation across disciplines, and advance our understanding of multimodal mobility in complex urban environments.

PaperID: 9660, https://arxiv.org/pdf/2507.09346.pdf

Authors: Arild Yonkeu, Mohammadreza Amini, Burak Kantarci

Title: Fast and Adaptive Task Management in MEC: A Deep Learning Approach Using Pointer Networks

Abstract:
Task offloading and scheduling in Mobile Edge Computing (MEC) are vital for meeting the low‑latency demands of modern IoT and dynamic task scheduling scenarios. MEC reduces the processing burden on resource‑constrained devices by enabling task execution at nearby edge servers. However, efficient task scheduling remains a challenge in dynamic, time‑sensitive environments. Conventional methods ‑‑ such as heuristic algorithms and mixed‑integer programming ‑‑ suffer from high computational overhead, limiting their real‑time applicability. Existing deep learning (DL) approaches offer faster inference but often lack scalability and adaptability to dynamic workloads. To address these issues, we propose a Pointer Network‑based architecture for task scheduling in dynamic edge computing scenarios. Our model is trained on a generated synthetic dataset using genetic algorithms to determine the optimal task ordering. Experimental results show that our model achieves lower drop ratios and waiting times than baseline methods, and a soft sequence accuracy of up to 89.2%. Our model consistently achieves inference times under 2 seconds across all evaluated task counts, whereas the integer and binary programming approaches require approximately up to 18 seconds and 90 seconds, respectively. It also shows strong generalization across varying scenarios, and adaptability to real‑time changes, offering a scalable and efficient solution for edge‑based task management.

PaperID: 9661, https://arxiv.org/pdf/2507.09255.pdf

Authors: Charidimos Papadakis, Giorgos Filandrianos, Angeliki Dimitriou, Maria Lymperaiou, Konstantinos Thomas, Giorgos Stamou

Title: StockSim: A Dual-Mode Order-Level Simulator for Evaluating Multi-Agent LLMs in Financial Markets

Abstract:
We present StockSim, an open‑source simulation platform for systematic evaluation of large language models (LLMs) in realistic financial decision‑making scenarios. Unlike previous toolkits that offer limited scope, StockSim delivers a comprehensive system that fully models market dynamics and supports diverse simulation modes of varying granularity. It incorporates critical real‑world factors, such as latency, slippage, and order‑book microstructure, that were previously neglected, enabling more faithful and insightful assessment of LLM‑based trading agents. An extensible, role‑based agent framework supports heterogeneous trading strategies and multi‑agent coordination, making StockSim a uniquely capable testbed for NLP research on reasoning under uncertainty and sequential decision‑making. We open‑source all our code at https: //github.com/harrypapa2002/StockSim.

PaperID: 9662, https://arxiv.org/pdf/2507.09179.pdf

Authors: Ronghua Shi, Yiou Liu, Yuchun Feng, Lynn Ai, Bill Shi, Zhuang Liu

Title: Hide-and-Shill: A Reinforcement Learning Framework for Market Manipulation Detection in Symphony-a Decentralized Multi-Agent System

Abstract:
Decentralized finance (DeFi) has introduced a new era of permissionless financial innovation but also led to unprecedented market manipulation. Without centralized oversight, malicious actors coordinate shilling campaigns and pump‑and‑dump schemes across various platforms. We propose a Multi‑Agent Reinforcement Learning (MARL) framework for decentralized manipulation detection, modeling the interaction between manipulators and detectors as a dynamic adversarial game. This framework identifies suspicious patterns using delayed token price reactions as financial indicators.Our method introduces three innovations: (1) Group Relative Policy Optimization (GRPO) to enhance learning stability in sparse‑reward and partially observable settings; (2) a theory‑based reward function inspired by rational expectations and information asymmetry, differentiating price discovery from manipulation noise; and (3) a multi‑modal agent pipeline that integrates LLM‑based semantic features, social graph signals, and on‑chain market data for informed decision‑making.The framework is integrated within the Symphony system, a decentralized multi‑agent architecture enabling peer‑to‑peer agent execution and trust‑aware learning through distributed logs, supporting chain‑verifiable evaluation. Symphony promotes adversarial co‑evolution among strategic actors and maintains robust manipulation detection without centralized oracles, enabling real‑time surveillance across global DeFi ecosystems.Trained on 100,000 real‑world discourse episodes and validated in adversarial simulations, Hide‑and‑Shill achieves top performance in detection accuracy and causal attribution. This work bridges multi‑agent systems with financial surveillance, advancing a new paradigm for decentralized market intelligence. All resources are available at the Hide‑and‑Shill GitHub repository to promote open research and reproducibility.

PaperID: 9663, https://arxiv.org/pdf/2507.09100.pdf

Authors: Mohammad Abolnejadian, Shakiba Amirshahi, Matthew Brehmer, Anamaria Crisan

Title: AInsight: Augmenting Expert Decision-Making with On-the-Fly Insights Grounded in Historical Data

Abstract:
In decision‑making conversations, experts must navigate complex choices and make on‑the‑spot decisions while engaged in conversation. Although extensive historical data often exists, the real‑time nature of these scenarios makes it infeasible for decision‑makers to review and leverage relevant information. This raises an interesting question: What if experts could utilize relevant past data in real‑time decision‑making through insights derived from past data? To explore this, we implemented a conversational user interface, taking doctor‑patient interactions as an example use case. Our system continuously listens to the conversation, identifies patient problems and doctor‑suggested solutions, and retrieves related data from an embedded dataset, generating concise insights using a pipeline built around a retrieval‑based Large Language Model (LLM) agent. We evaluated the prototype by embedding Health Canada datasets into a vector database and conducting simulated studies using sample doctor‑patient dialogues, showing effectiveness but also challenges, setting directions for the next steps of our work.

PaperID: 9664, https://arxiv.org/pdf/2507.09094.pdf

Authors: Xiaoren Xu, Hao Xu, Dongyu Wei, Walid Saad, Mehdi Bennis, Mingzhe Chen

Title: Transformer based Collaborative Reinforcement Learning for Fluid Antenna System (FAS)-enabled 3D UAV Positioning

Abstract:
In this paper, a novel Three dimensional (3D) positioning framework of fluid antenna system (FAS)‑enabled unmanned aerial vehicles (UAVs) is developed. In the proposed framework, a set of controlled UAVs cooperatively estimate the real‑time 3D position of a target UAV. Here, the active UAV transmits a measurement signal to the passive UAVs via the reflection from the target UAV. Each passive UAV estimates the distance of the active‑target‑passive UAV link and selects an antenna port to share the distance information with the base station (BS) that calculates the real‑time position of the target UAV. As the target UAV is moving due to its task operation, the controlled UAVs must optimize their trajectories and select optimal antenna port, aiming to estimate the real‑time position of the target UAV. We formulate this problem as an optimization problem to minimize the target UAV positioning error via optimizing the trajectories of all controlled UAVs and antenna port selection of passive UAVs. Here, an attention‑based recurrent multi‑agent reinforcement learning (AR‑MARL) scheme is proposed, which enables each controlled UAV to use the local Q function to determine its trajectory and antenna port while optimizing the target UAV positioning performance without knowing the trajectories and antenna port selections of other controlled UAVs. Different from current MARL methods, the proposed method uses a recurrent neural network (RNN) that incorporates historical state‑action pairs of each controlled UAV, and an attention mechanism to analyze the importance of these historical state‑action pairs, thus improving the global Q function approximation accuracy and the target UAV positioning accuracy. Simulation results show that the proposed AR‑MARL scheme can reduce the average positioning error by up to 17.5% and 58.5% compared to the VD‑MARL scheme and the proposed method without FAS.

PaperID: 9665, https://arxiv.org/pdf/2507.09063.pdf

Authors: Avi Arora, Jinu Jang, Roshanak Zilouchian Moghaddam

Title: SetupBench: Assessing Software Engineering Agents' Ability to Bootstrap Development Environments

Abstract:
Modern Large Language Model (LLM) agents promise end to end assistance with real‑world software tasks, yet existing benchmarks evaluate LLM agents almost exclusively in pre‑baked environments where every dependency is pre‑installed. To fill this gap, we introduce SetupBench, a 93 instance benchmark that isolates the environment‑bootstrap skill: starting from a bare Linux sandbox, an agent must install packages, resolve dependency conflicts, initialize databases, and configure background services. Our tasks span seven language ecosystems, five database engines, and multi‑service orchestration scenarios, each accompanies by a natural language problem statement and a deterministic success command. Through evaluation of OpenHands, a state‑of‑the‑art coding agent, we find low success rates across task categories, with particular challenges in repository setup (38.9‑57.4%) and local database configuration (20.0‑53.3%). Our analysis reveals systematic failure modes including incomplete development tooling installation, hallucinated task constraints, and non‑persistent environment modifications that break agent‑human collaboration workflows. We identify substantial inefficiencies in agent exploration strategies, with 38‑89% of actions being unnecessary compared to optimal human behavior. These findings highlight gaps in current agents' practical environment‑bootstrap capabilities. By targeting this critical yet under‑evaluated capability, SetupBench provides a rigorous yard‑stick for the next generation of software developer agents aiming to solve end to end real‑wold tasks.

PaperID: 9666, https://arxiv.org/pdf/2507.09023.pdf

Authors: Yao Fehlis, Charles Crain, Aidan Jensen, Michael Watson, James Juhasz, Paul Mandel, Betty Liu, Shawn Mahon, Daren Wilson, Nick Lynch-Jonely, Ben Leedom, David Fuller

Title: Accelerating Drug Discovery Through Agentic AI: A Multi-Agent Approach to Laboratory Automation in the DMTA Cycle

Abstract:
The pharmaceutical industry faces unprecedented challenges in drug discovery, with traditional approaches struggling to meet modern therapeutic development demands. This paper introduces a novel AI framework, Tippy, that transforms laboratory automation through specialized AI agents operating within the Design‑Make‑Test‑Analyze (DMTA) cycle. Our multi‑agent system employs five specialized agents ‑ Supervisor, Molecule, Lab, Analysis, and Report, with Safety Guardrail oversight ‑ each designed to excel in specific phases of the drug discovery pipeline. Tippy represents the first production‑ready implementation of specialized AI agents for automating the DMTA cycle, providing a concrete example of how AI can transform laboratory workflows. By leveraging autonomous AI agents that reason, plan, and collaborate, we demonstrate how Tippy accelerates DMTA cycles while maintaining scientific rigor essential for pharmaceutical research. The system shows significant improvements in workflow efficiency, decision‑making speed, and cross‑disciplinary coordination, offering a new paradigm for AI‑assisted drug discovery.

PaperID: 9667, https://arxiv.org/pdf/2507.08960.pdf

Authors: Andrew Estornell, Jean-Francois Ton, Muhammad Faaiz Taufiq, Hang Li

Title: How to Train a Leader: Hierarchical Reasoning in Multi-Agent LLMs

Abstract:
Large Language Models (LLMs) have achieved strong performance on a wide range of complex reasoning tasks, yet further gains are often possible by leveraging the complementary strengths of multiple models. While multi‑agent frameworks can improve solution quality by leveraging multiple LLMs, existing methods are often computationally expensive, both at training and inference time. In this work, we introduce a hierarchical multi‑agent framework that addresses these challenges by training only a single leader LLM to coordinate a team of untrained peer agents. To this end, we propose Multi‑agent guided Leader Policy Optimization (MLPO), a novel approach which trains the leader to evaluate and synthesize agent responses without auxiliary value networks or explicit agent feedback. Leaders trained with MLPO exhibit improved performance not only when interacting with the agent team at inference time, but also enjoy improved performance when deployed in single‑agent settings without the team. Empirical results on Big‑Bench Hard (BBH), MATH, and MMLU demonstrate that our framework achieves substantial performance improvements over both single‑agent and multi‑agent baselines. Our results highlight the effectiveness and efficiency of training a single, flexible leader for collaborative reasoning in multi‑agent LLM systems.

PaperID: 9668, https://arxiv.org/pdf/2507.08944.pdf

Authors: Enhao Zhang, Erkang Zhu, Gagan Bansal, Adam Fourney, Hussein Mozannar, Jack Gerrits

Title: Optimizing Sequential Multi-Step Tasks with Parallel LLM Agents

Abstract:
Large language model (LLM)‑based multi‑agent systems have demonstrated remarkable promise for tackling complex tasks by breaking them down into subtasks that are iteratively planned, executed, observed, and refined. Despite their effectiveness, these systems often incur high latency because real‑world problems frequently demand multiple iterative cycles of reasoning steps. To address this challenge, we propose M1‑Parallel, a framework that concurrently runs multiple multi‑agent teams in parallel to uncover distinct solution paths. By leveraging an event‑driven communication model with asynchronous messaging, M1‑Parallel efficiently capitalizes on the inherent diversity of valid plans to either reduce end‑to‑end latency or boost task completion rates. Our experiments on complex tasks show that M1‑Parallel with early termination achieves up to 2.2× speedup while preserving accuracy, and that M1‑Parallel with aggregation yields higher task completion rates. We further investigate strategies aimed at encouraging diverse execution plans but observe no additional performance gains over repeated sampling. Overall, these findings underscore the potential of parallel plan execution for optimizing multi‑agent systems for real‑world, high‑complexity reasoning tasks.

PaperID: 9669, https://arxiv.org/pdf/2507.08799.pdf

Authors: Max Belitsky, Dawid J. Kopiczko, Michael Dorkenwald, M. Jehanzeb Mirza, James R. Glass, Cees G. M. Snoek, Yuki M. Asano

Title: KV Cache Steering for Controlling Frozen LLMs

Abstract:
We propose cache steering, a lightweight method for implicit steering of language models via a one‑shot intervention applied directly to the key‑value cache. To validate its effectiveness, we apply cache steering to induce chain‑of‑thought reasoning in small language models. Our approach constructs steering vectors from reasoning traces, obtained either from teacher models (e.g., GPT‑4o) or existing human annotations, that shift model behavior toward more explicit, multi‑step reasoning without fine‑tuning or prompt modifications. Experimental evaluations on diverse reasoning benchmarks demonstrate that cache steering improves both the qualitative structure of model reasoning and quantitative task performance. Additional experiments show that the method also scales to larger models and yields further gains on challenging datasets such as GPQA and MATH. Compared to prior activation steering techniques that require continuous interventions, our one‑shot cache steering offers substantial advantages in terms of inference latency, hyperparameter stability, and ease of integration with existing inference APIs. Beyond mere reasoning induction, we show that cache steering enables controllable transfer of reasoning styles (e.g., stepwise, causal, analogical), making it a practical tool for behavior‑level guidance of language models.

PaperID: 9670, https://arxiv.org/pdf/2507.08664.pdf

Authors: Haoran Sun, Shaoning Zeng

Title: Introspection of Thought Helps AI Agents

Abstract:
AI Agents rely on Large Language Models (LLMs) and Multimodal‑LLMs (MLLMs) to perform interpretation and inference in text and image tasks without post‑training, where LLMs and MLLMs play the most critical role and determine the initial ability and limitations of AI Agents. Usually, AI Agents utilize sophisticated prompt engineering and external reasoning framework to obtain a promising interaction with LLMs, e.g., Chain‑of‑Thought, Iteration of Thought and Image‑of‑Thought. However, they are still constrained by the inherent limitations of LLM in understanding natural language, and the iterative reasoning process will generate a large amount of inference cost. To this end, we propose a novel AI Agent Reasoning Framework with Introspection of Thought (INoT) by designing a new LLM‑Read code in prompt. It enables LLM to execute programmatic dialogue reasoning processes following the code in prompt. Therefore, self‑denial and reflection occur within LLM instead of outside LLM, which can reduce token cost effectively. Through our experiments on six benchmarks for three different tasks, the effectiveness of INoT is verified, with an average improvement of 7.95% in performance, exceeding the baselines. Furthermore, the token cost of INoT is lower on average than the best performing method at baseline by 58.3%. In addition, we demonstrate the versatility of INoT in image interpretation and inference through verification experiments.

PaperID: 9671, https://arxiv.org/pdf/2507.08648.pdf

Authors: Haoran Sun, Haoyu Bian, Shaoning Zeng, Yunbo Rao, Xu Xu, Lin Mei, Jianping Gou

Title: DatasetAgent: A Novel Multi-Agent System for Auto-Constructing Datasets from Real-World Images

Abstract:
Common knowledge indicates that the process of constructing image datasets usually depends on the time‑intensive and inefficient method of manual collection and annotation. Large models offer a solution via data generation. Nonetheless, real‑world data are obviously more valuable comparing to artificially intelligence generated data, particularly in constructing image datasets. For this reason, we propose a novel method for auto‑constructing datasets from real‑world images by a multiagent collaborative system, named as DatasetAgent. By coordinating four different agents equipped with Multi‑modal Large Language Models (MLLMs), as well as a tool package for image optimization, DatasetAgent is able to construct high‑quality image datasets according to user‑specified requirements. In particular, two types of experiments are conducted, including expanding existing datasets and creating new ones from scratch, on a variety of open‑source datasets. In both cases, multiple image datasets constructed by DatasetAgent are used to train various vision models for image classification, object detection, and image segmentation.

PaperID: 9672, https://arxiv.org/pdf/2507.08644.pdf

Authors: Junho Koh, Youngwoo Lee, Jungho Kim, Dongyoung Lee, Jun Won Choi

Title: OnlineBEV: Recurrent Temporal Fusion in Bird's Eye View Representations for Multi-Camera 3D Perception

Abstract:
Multi‑view camera‑based 3D perception can be conducted using bird's eye view (BEV) features obtained through perspective view‑to‑BEV transformations. Several studies have shown that the performance of these 3D perception methods can be further enhanced by combining sequential BEV features obtained from multiple camera frames. However, even after compensating for the ego‑motion of an autonomous agent, the performance gain from temporal aggregation is limited when combining a large number of image frames. This limitation arises due to dynamic changes in BEV features over time caused by object motion. In this paper, we introduce a novel temporal 3D perception method called OnlineBEV, which combines BEV features over time using a recurrent structure. This structure increases the effective number of combined features with minimal memory usage. However, it is critical to spatially align the features over time to maintain strong performance. OnlineBEV employs the Motion‑guided BEV Fusion Network (MBFNet) to achieve temporal feature alignment. MBFNet extracts motion features from consecutive BEV frames and dynamically aligns historical BEV features with current ones using these motion features. To enforce temporal feature alignment explicitly, we use Temporal Consistency Learning Loss, which captures discrepancies between historical and target BEV features. Experiments conducted on the nuScenes benchmark demonstrate that OnlineBEV achieves significant performance gains over the current best method, SOLOFusion. OnlineBEV achieves 63.9% NDS on the nuScenes test set, recording state‑of‑the‑art performance in the camera‑only 3D object detection task.

PaperID: 9673, https://arxiv.org/pdf/2507.08619.pdf

Authors: Soheyl Massoudi, Mark Fuge

Title: Agentic Large Language Models for Conceptual Systems Engineering and Design

Abstract:
Early‑stage engineering design involves complex, iterative reasoning, yet existing large language model (LLM) workflows struggle to maintain task continuity and generate executable models. We evaluate whether a structured multi‑agent system (MAS) can more effectively manage requirements extraction, functional decomposition, and simulator code generation than a simpler two‑agent system (2AS). The target application is a solar‑powered water filtration system as described in a cahier des charges. We introduce the Design‑State Graph (DSG), a JSON‑serializable representation that bundles requirements, physical embodiments, and Python‑based physics models into graph nodes. A nine‑role MAS iteratively builds and refines the DSG, while the 2AS collapses the process to a Generator‑Reflector loop. Both systems run a total of 60 experiments (2 LLMs ‑ Llama 3.3 70B vs reasoning‑distilled DeepSeek R1 70B x 2 agent configurations x 3 temperatures x 5 seeds). We report a JSON validity, requirement coverage, embodiment presence, code compatibility, workflow completion, runtime, and graph size. Across all runs, both MAS and 2AS maintained perfect JSON integrity and embodiment tagging. Requirement coverage remained minimal (less than 20%). Code compatibility peaked at 100% under specific 2AS settings but averaged below 50% for MAS. Only the reasoning‑distilled model reliably flagged workflow completion. Powered by DeepSeek R1 70B, the MAS generated more granular DSGs (average 5‑6 nodes) whereas 2AS mode‑collapsed. Structured multi‑agent orchestration enhanced design detail. Reasoning‑distilled LLM improved completion rates, yet low requirements and fidelity gaps in coding persisted.

PaperID: 9674, https://arxiv.org/pdf/2507.08616.pdf

Authors: Florian Grötschla, Luis Müller, Jan Tönshoff, Mikhail Galkin, Bryan Perozzi

Title: AgentsNet: Coordination and Collaborative Reasoning in Multi-Agent LLMs

Abstract:
Large‑language models (LLMs) have demonstrated powerful problem‑solving capabilities, in particular when organized in multi‑agent systems. However, the advent of such systems also raises several questions on the ability of a complex network of agents to effectively self‑organize and collaborate. While measuring performance on standard reasoning benchmarks indicates how well multi‑agent systems can solve reasoning tasks, it is unclear whether these systems are able to leverage their topology effectively. Here, we propose AgentsNet, a new benchmark for multi‑agent reasoning. By drawing inspiration from classical problems in distributed systems and graph theory, AgentsNet measures the ability of multi‑agent systems to collaboratively form strategies for problem‑solving, self‑organization, and effective communication given a network topology. We evaluate a variety of baseline methods on AgentsNet including homogeneous networks of agents which first have to agree on basic protocols for organization and communication. We find that some frontier LLMs are already demonstrating strong performance for small networks but begin to fall off once the size of the network scales. While existing multi‑agent benchmarks cover at most 2‑5 agents, AgentsNet is practically unlimited in size and can scale with new generations of LLMs. As such, we also probe frontier models in a setup with up to 100 agents.

PaperID: 9675, https://arxiv.org/pdf/2507.08610.pdf

Authors: Parag Dutta, Ambedkar Dukkipati

Title: Emergent Natural Language with Communication Games for Improving Image Captioning Capabilities without Additional Data

Abstract:
Image captioning is an important problem in developing various AI systems, and these tasks require large volumes of annotated images to train the models. Since all existing labelled datasets are already used for training the large Vision Language Models (VLMs), it becomes challenging to improve the performance of the same. Considering this, it is essential to consider the unsupervised image captioning performance, which remains relatively under‑explored. To that end, we propose LoGIC (Lewis Communication Game for Image Captioning), a Multi‑agent Reinforcement Learning game. The proposed method consists of two agents, a 'speaker' and a 'listener', with the objective of learning a strategy for communicating in natural language. We train agents in the cooperative common‑reward setting using the GRPO algorithm and show that improvement in image captioning performance emerges as a consequence of the agents learning to play the game. We show that using pre‑trained VLMs as the 'speaker' and Large Language Model (LLM) for language understanding in the 'listener', we achieved a 46 BLEU score after fine‑tuning using LoGIC without additional labels, a 2 units advantage in absolute metrics compared to the 44 BLEU score of the vanilla VLM. Additionally, we replace the VLM from the 'speaker' with lightweight components: (i) a ViT for image perception and (ii) a GPT2 language generation, and train them from scratch using LoGIC, obtaining a 31 BLEU score in the unsupervised setting, a 10 points advantage over existing unsupervised image‑captioning methods.

PaperID: 9676, https://arxiv.org/pdf/2507.08440.pdf

Authors: Selina Heller, Mohamed Ibrahim, David Antony Selby, Sebastian Vollmer

Title: Finding Common Ground: Using Large Language Models to Detect Agreement in Multi-Agent Decision Conferences

Abstract:
Decision conferences are structured, collaborative meetings that bring together experts from various fields to address complex issues and reach a consensus on recommendations for future actions or policies. These conferences often rely on facilitated discussions to ensure productive dialogue and collective agreement. Recently, Large Language Models (LLMs) have shown significant promise in simulating real‑world scenarios, particularly through collaborative multi‑agent systems that mimic group interactions. In this work, we present a novel LLM‑based multi‑agent system designed to simulate decision conferences, specifically focusing on detecting agreement among the participant agents. To achieve this, we evaluate six distinct LLMs on two tasks: stance detection, which identifies the position an agent takes on a given issue, and stance polarity detection, which identifies the sentiment as positive, negative, or neutral. These models are further assessed within the multi‑agent system to determine their effectiveness in complex simulations. Our results indicate that LLMs can reliably detect agreement even in dynamic and nuanced debates. Incorporating an agreement‑detection agent within the system can also improve the efficiency of group debates and enhance the overall quality and coherence of deliberations, making them comparable to real‑world decision conferences regarding outcome and decision‑making. These findings demonstrate the potential for LLM‑based multi‑agent systems to simulate group decision‑making processes. They also highlight that such systems could be instrumental in supporting decision‑making with expert elicitation workshops across various domains.

PaperID: 9677, https://arxiv.org/pdf/2507.08429.pdf

Authors: Geng Sun, Likun Zhang, Jiahui Li, Jing Wu, Jiacheng Wang, Zemin Sun, Changyuan Zhao, Victor C. M. Leung

Title: Age of Information Optimization in Laser-charged UAV-assisted IoT Networks: A Multi-agent Deep Reinforcement Learning Method

Abstract:
The integration of unmanned aerial vehicles (UAVs) with Internet of Things (IoT) networks offers promising solutions for efficient data collection. However, the limited energy capacity of UAVs remains a significant challenge. In this case, laser beam directors (LBDs) have emerged as an effective technology for wireless charging of UAVs during operation, thereby enabling sustained data collection without frequent returns to charging stations (CSs). In this work, we investigate the age of information (AoI) optimization in LBD‑powered UAV‑assisted IoT networks, where multiple UAVs collect data from distributed IoTs while being recharged by laser beams. We formulate a joint optimization problem that aims to minimize the peak AoI while determining optimal UAV trajectories and laser charging strategies. This problem is particularly challenging due to its non‑convex nature, complex temporal dependencies, and the need to balance data collection efficiency with energy consumption constraints. To address these challenges, we propose a novel multi‑agent proximal policy optimization with temporal memory and multi‑agent coordination (MAPPO‑TM) framework. Specifically, MAPPO‑TM incorporates temporal memory mechanisms to capture the dynamic nature of UAV operations and facilitates effective coordination among multiple UAVs through decentralized learning while considering global system objectives. Simulation results demonstrate that the proposed MAPPO‑TM algorithm outperforms conventional approaches in terms of peak AoI minimization and energy efficiency. Ideally, the proposed algorithm achieves up to 15.1% reduction in peak AoI compared to conventional multi‑agent deep reinforcement learning (MADRL) methods.

PaperID: 9678, https://arxiv.org/pdf/2507.08392.pdf

Authors: Asma Yamani, Malak Baslyman, Moataz Ahmed

Title: Multi-Agent LLMs as Ethics Advocates for AI-Based Systems

Abstract:
Incorporating ethics into the requirement elicitation process is essential for creating ethically aligned systems. Although eliciting manual ethics requirements is effective, it requires diverse input from multiple stakeholders, which can be challenging due to time and resource constraints. Moreover, it is often given a low priority in the requirements elicitation process. This study proposes a framework for generating ethics requirements drafts by introducing an ethics advocate agent in a multi‑agent LLM setting. This agent critiques and provides input on ethical issues based on the system description. The proposed framework is evaluated through two case studies from different contexts, demonstrating that it captures the majority of ethics requirements identified by researchers during 30‑minute interviews and introduces several additional relevant requirements. However, it also highlights reliability issues in generating ethics requirements, emphasizing the need for human feedback in this sensitive domain. We believe this work can facilitate the broader adoption of ethics in the requirements engineering process, ultimately leading to more ethically aligned products.

PaperID: 9679, https://arxiv.org/pdf/2507.08325.pdf

Authors: Yinzhu Quan, Xinrui Li, Ying Chen

Title: CRMAgent: A Multi-Agent LLM System for E-Commerce CRM Message Template Generation

Abstract:
In e‑commerce private‑domain channels such as instant messaging and e‑mail, merchants engage customers directly as part of their Customer Relationship Management (CRM) programmes to drive retention and conversion. While a few top performers excel at crafting outbound messages, most merchants struggle to write persuasive copy because they lack both expertise and scalable tools. We introduce CRMAgent, a multi‑agent system built on large language models (LLMs) that generates high‑quality message templates and actionable writing guidance through three complementary modes. First, group‑based learning enables the agent to learn from a merchant's own top‑performing messages within the same audience segment and rewrite low‑performing ones. Second, retrieval‑and‑adaptation fetches templates that share the same audience segment and exhibit high similarity in voucher type and product category, learns their successful patterns, and adapts them to the current campaign. Third, a rule‑based fallback provides a lightweight zero‑shot rewrite when no suitable references are available. Extensive experiments show that CRMAgent consistently outperforms merchants' original templates, delivering significant gains in both audience‑match and marketing‑effectiveness metrics.

PaperID: 9680, https://arxiv.org/pdf/2507.08297.pdf

Authors: Zizheng Zhan, Ken Deng, Huaixi Tang, Wen Xiang, Kun Wu, Weihao Li, Wenqiang Zhu, Jingxuan Xu, Lecheng Huang, Zongxian Feng, Shaojie Wang, Shangpeng Yan, Xuxing Chen, Jiaheng Liu, Zhongyuan Peng, Zuchen Gao, Haoyang Huang, Xiaojiang Zhang, Jinghui Wang, Zheng Lin, Mengtong Li, Huiming Wang, Ziqi Zhan, Yanan Wu, Yuanxing Zhang, Jian Yang, Guang Chen, Haotian Zhang, Bin Chen, Bing Yu

Title: KAT-V1: Kwai-AutoThink Technical Report

Abstract:
We present Kwaipilot‑AutoThink (KAT), an open‑source 40B large language model developed to address the overthinking problem in reasoning‑intensive tasks, where an automatic thinking training paradigm is proposed to dynamically switch between reasoning and non‑reasoning modes based on task complexity. Specifically, first, we construct the dual‑regime dataset based on a novel tagging pipeline and a multi‑agent synthesis strategy, and then we apply Multi‑Token Prediction (MTP)‑enhanced knowledge distillation, enabling efficient and fine‑grained reasoning transfer with minimal pretraining cost. Besides, we implement a cold‑start initialization strategy that introduces mode‑selection priors using majority‑vote signals and intent‑aware prompting. Finally, we propose Step‑SRPO, a reinforcement learning algorithm that incorporates intermediate supervision into the GRPO framework, offering structured guidance over both reasoning‑mode selection and response accuracy. Extensive experiments across multiple benchmarks demonstrate that KAT consistently matches or even outperforms current state‑of‑the‑art models, including DeepSeek‑R1‑0528 and Qwen3‑235B‑A22B, across a wide range of reasoning‑intensive tasks while reducing token usage. Notably, KAT outperforms all open‑source models and even surpasses o3‑mini on the leakage‑controlled LiveCodeBench Pro. Beyond academic evaluation, KAT has been successfully deployed in Kwaipilot (i.e., Kuaishou's internal coding assistant), where it improves real‑world development workflows with high accuracy, efficiency, and controllable reasoning behaviors. Moreover, we are actively training a 200B Mixture‑of‑Experts (MoE) model with 40B active parameters, and early results already show significant gains, further demonstrating the scalability of the AutoThink paradigm.

PaperID: 9681, https://arxiv.org/pdf/2507.08270.pdf

Authors: Zeyang Sha, Hanling Tian, Zhuoer Xu, Shiwen Cui, Changhua Meng, Weiqiang Wang

Title: Agent Safety Alignment via Reinforcement Learning

Abstract:
The emergence of autonomous Large Language Model (LLM) agents capable of tool usage has introduced new safety risks that go beyond traditional conversational misuse. These agents, empowered to execute external functions, are vulnerable to both user‑initiated threats (e.g., adversarial prompts) and tool‑initiated threats (e.g., malicious outputs from compromised tools). In this paper, we propose the first unified safety‑alignment framework for tool‑using agents, enabling models to handle both channels of threat via structured reasoning and sandboxed reinforcement learning. We introduce a tri‑modal taxonomy, including benign, malicious, and sensitive for both user prompts and tool responses, and define a policy‑driven decision model. Our framework employs a custom‑designed sandbox environment that simulates real‑world tool execution and allows fine‑grained reward shaping. Through extensive evaluations on public and self‑built benchmarks, including Agent SafetyBench, InjecAgent, and BFCL, we demonstrate that our safety‑aligned agents significantly improve resistance to security threats while preserving strong utility on benign tasks. Our results show that safety and effectiveness can be jointly optimized, laying the groundwork for trustworthy deployment of autonomous LLM agents.

PaperID: 9682, https://arxiv.org/pdf/2507.07957.pdf

Authors: Yu Wang, Xi Chen

Title: MIRIX: Multi-Agent Memory System for LLM-Based Agents

Abstract:
Although memory capabilities of AI agents are gaining increasing attention, existing solutions remain fundamentally limited. Most rely on flat, narrowly scoped memory components, constraining their ability to personalize, abstract, and reliably recall user‑specific information over time. To this end, we introduce MIRIX, a modular, multi‑agent memory system that redefines the future of AI memory by solving the field's most critical challenge: enabling language models to truly remember. Unlike prior approaches, MIRIX transcends text to embrace rich visual and multimodal experiences, making memory genuinely useful in real‑world scenarios. MIRIX consists of six distinct, carefully structured memory types: Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault, coupled with a multi‑agent framework that dynamically controls and coordinates updates and retrieval. This design enables agents to persist, reason over, and accurately retrieve diverse, long‑term user data at scale. We validate MIRIX in two demanding settings. First, on ScreenshotVQA, a challenging multimodal benchmark comprising nearly 20,000 high‑resolution computer screenshots per sequence, requiring deep contextual understanding and where no existing memory systems can be applied, MIRIX achieves 35% higher accuracy than the RAG baseline while reducing storage requirements by 99.9%. Second, on LOCOMO, a long‑form conversation benchmark with single‑modal textual input, MIRIX attains state‑of‑the‑art performance of 85.4%, far surpassing existing baselines. These results show that MIRIX sets a new performance standard for memory‑augmented LLM agents. To allow users to experience our memory system, we provide a packaged application powered by MIRIX. It monitors the screen in real time, builds a personalized memory base, and offers intuitive visualization and secure local storage to ensure privacy.

PaperID: 9683, https://arxiv.org/pdf/2507.07906.pdf

Authors: Anant Gupta, Rajarshi Bhowmik, Geoffrey Gunow

Title: Agentic Retrieval of Topics and Insights from Earnings Calls

Abstract:
Tracking the strategic focus of companies through topics in their earnings calls is a key task in financial analysis. However, as industries evolve, traditional topic modeling techniques struggle to dynamically capture emerging topics and their relationships. In this work, we propose an LLM‑agent driven approach to discover and retrieve emerging topics from quarterly earnings calls. We propose an LLM‑agent to extract topics from documents, structure them into a hierarchical ontology, and establish relationships between new and existing topics through a topic ontology. We demonstrate the use of extracted topics to infer company‑level insights and emerging trends over time. We evaluate our approach by measuring ontology coherence, topic evolution accuracy, and its ability to surface emerging financial trends.

PaperID: 9684, https://arxiv.org/pdf/2507.07901.pdf

Authors: Sree Bhargavi Balija, Rekha Singal, Ramesh Raskar, Erfan Darzi, Raghu Bala, Thomas Hardjono, Ken Huang

Title: The Trust Fabric: Decentralized Interoperability and Economic Coordination for the Agentic Web

Abstract:
The fragmentation of AI agent ecosystems has created urgent demands for interoperability, trust, and economic coordination that current protocols ‑‑ including MCP (Hou et al., 2025), A2A (Habler et al., 2025), ACP (Liu et al., 2025), and Cisco's AGP (Edwards, 2025) ‑‑ cannot address at scale. We present the Nanda Unified Architecture, a decentralized framework built around three core innovations: fast DID‑based agent discovery through distributed registries, semantic agent cards with verifiable credentials and composability profiles, and a dynamic trust layer that integrates behavioral attestations with policy compliance. The system introduces X42/H42 micropayments for economic coordination and MAESTRO, a security framework incorporating Synergetics' patented AgentTalk protocol (US Patent 12,244,584 B1) and secure containerization. Real‑world deployments demonstrate 99.9 percent compliance in healthcare applications and substantial monthly transaction volumes with strong privacy guarantees. By unifying MIT's trust research with production deployments from Cisco and Synergetics, we show how cryptographic proofs and policy‑as‑code transform agents into trust‑anchored participants in a decentralized economy (Lakshmanan, 2025; Sha, 2025). The result enables a globally interoperable Internet of Agents where trust becomes the native currency of collaboration across both enterprise and Web3 ecosystems.

PaperID: 9685, https://arxiv.org/pdf/2507.07671.pdf

Authors: Jovan Prodanov, Blaž Bertalanič, Carolina Fortuna, Shih-Kai Chou, Matjaž Branko Jurič, Ramon Sanchez-Iborra, Jernej Hribar

Title: Multi-agent Reinforcement Learning-based In-place Scaling Engine for Edge-cloud Systems

Abstract:
Modern edge‑cloud systems face challenges in efficiently scaling resources to handle dynamic and unpredictable workloads. Traditional scaling approaches typically rely on static thresholds and predefined rules, which are often inadequate for optimizing resource utilization and maintaining performance in distributed and dynamic environments. This inefficiency hinders the adaptability and performance required in edge‑cloud infrastructures, which can only be achieved through the newly proposed in‑place scaling. To address this problem, we propose the Multi‑Agent Reinforcement Learning‑based In‑place Scaling Engine (MARLISE) that enables seamless, dynamic, reactive control with in‑place resource scaling. We develop our solution using two Deep Reinforcement Learning algorithms: Deep Q‑Network (DQN), and Proximal Policy Optimization (PPO). We analyze each version of the proposed MARLISE solution using dynamic workloads, demonstrating their ability to ensure low response times of microservices and scalability. Our results show that MARLISE‑based approaches outperform heuristic method in managing resource elasticity while maintaining microservice response times and achieving higher resource efficiency.

PaperID: 9686, https://arxiv.org/pdf/2507.07544.pdf

Authors: Oliver Eberle, Thomas McGee, Hamza Giaffar, Taylor Webb, Ida Momennejad

Title: Position: We Need An Algorithmic Understanding of Generative AI

Abstract:
What algorithms do LLMs actually learn and use to solve problems? Studies addressing this question are sparse, as research priorities are focused on improving performance through scale, leaving a theoretical and empirical gap in understanding emergent algorithms. This position paper proposes AlgEval: a framework for systematic research into the algorithms that LLMs learn and use. AlgEval aims to uncover algorithmic primitives, reflected in latent representations, attention, and inference‑time compute, and their algorithmic composition to solve task‑specific problems. We highlight potential methodological paths and a case study toward this goal, focusing on emergent search algorithms. Our case study illustrates both the formation of top‑down hypotheses about candidate algorithms, and bottom‑up tests of these hypotheses via circuit‑level analysis of attention patterns and hidden states. The rigorous, systematic evaluation of how LLMs actually solve tasks provides an alternative to resource‑intensive scaling, reorienting the field toward a principled understanding of underlying computations. Such algorithmic explanations offer a pathway to human‑understandable interpretability, enabling comprehension of the model's internal reasoning performance measures. This can in turn lead to more sample‑efficient methods for training and improving performance, as well as novel architectures for end‑to‑end and multi‑agent systems.

PaperID: 9687, https://arxiv.org/pdf/2507.07509.pdf

Authors: Yuanchen Shi, Longyin Zhang, Fang Kong

Title: Toward Real-World Chinese Psychological Support Dialogues: CPsDD Dataset and a Co-Evolving Multi-Agent System

Abstract:
The growing need for psychological support due to increasing pressures has exposed the scarcity of relevant datasets, particularly in non‑English languages. To address this, we propose a framework that leverages limited real‑world data and expert knowledge to fine‑tune two large language models: Dialog Generator and Dialog Modifier. The Generator creates large‑scale psychological counseling dialogues based on predefined paths, which guide system response strategies and user interactions, forming the basis for effective support. The Modifier refines these dialogues to align with real‑world data quality. Through both automated and manual review, we construct the Chinese Psychological support Dialogue Dataset (CPsDD), containing 68K dialogues across 13 groups, 16 psychological problems, 13 causes, and 12 support focuses. Additionally, we introduce the Comprehensive Agent Dialogue Support System (CADSS), where a Profiler analyzes user characteristics, a Summarizer condenses dialogue history, a Planner selects strategies, and a Supporter generates empathetic responses. The experimental results of the Strategy Prediction and Emotional Support Conversation (ESC) tasks demonstrate that CADSS achieves state‑of‑the‑art performance on both CPsDD and ESConv datasets.

PaperID: 9688, https://arxiv.org/pdf/2507.07441.pdf

Authors: Yu Xia, Yiran Shen, Junda Wu, Tong Yu, Sungchul Kim, Ryan A. Rossi, Lina Yao, Julian McAuley

Title: SAND: Boosting LLM Agents with Self-Taught Action Deliberation

Abstract:
Large Language Model (LLM) agents are commonly tuned with supervised finetuning on ReAct‑style expert trajectories or preference optimization over pairwise rollouts. Most of these methods focus on imitating specific expert behaviors or promoting chosen reasoning thoughts and actions over rejected ones. However, without reasoning and comparing over alternatives actions, LLM agents finetuned with these methods may over‑commit towards seemingly plausible but suboptimal actions due to limited action space exploration. To address this, in this paper we propose Self‑taught ActioN Deliberation (SAND) framework, enabling LLM agents to explicitly deliberate over candidate actions before committing to one. To tackle the challenges of when and what to deliberate given large action space and step‑level action evaluation, we incorporate self‑consistency action sampling and execution‑guided action critique to help synthesize step‑wise action deliberation thoughts using the base model of the LLM agent. In an iterative manner, the deliberation trajectories are then used to finetune the LLM agent itself. Evaluating on two representative interactive agent tasks, SAND achieves an average 20% improvement over initial supervised finetuning and also outperforms state‑of‑the‑art agent tuning approaches.

PaperID: 9689, https://arxiv.org/pdf/2507.07426.pdf

Authors: Zerui Yang, Yuwei Wan, Siyu Yan, Yudai Matsuda, Tong Xie, Bram Hoex, Linqi Song

Title: DrugMCTS: a drug repurposing framework combining multi-agent, RAG and Monte Carlo Tree Search

Abstract:
Recent advances in large language models have demonstrated considerable potential in scientific domains such as drug repositioning. However, their effectiveness remains constrained when reasoning extends beyond the knowledge acquired during pretraining. Conventional approaches, such as fine‑tuning or retrieval‑augmented generation, face limitations in either imposing high computational overhead or failing to fully exploit structured scientific data. To overcome these challenges, we propose DrugMCTS, a novel framework that synergistically integrates RAG, multi‑agent collaboration, and Monte Carlo Tree Search for drug repositioning. The framework employs five specialized agents tasked with retrieving and analyzing molecular and protein information, thereby enabling structured and iterative reasoning. Extensive experiments on the DrugBank and KIBA datasets demonstrate that DrugMCTS achieves substantially higher recall and robustness compared to both general‑purpose LLMs and deep learning baselines. Our results highlight the importance of structured reasoning, agent‑based collaboration, and feedback‑driven search mechanisms in advancing LLM applications for drug repositioning.

PaperID: 9690, https://arxiv.org/pdf/2507.07400.pdf

Authors: Zaifeng Pan, Ajjkumar Patel, Zhengding Hu, Yipeng Shen, Yue Guan, Wan-Lu Li, Lianhui Qin, Yida Wang, Yufei Ding

Title: KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows

Abstract:
Large language model (LLM) based agentic workflows have become a popular paradigm for coordinating multiple specialized agents to solve complex tasks. To improve serving efficiency, existing LLM systems employ prefix caching to reuse key‑value (KV) tensors corresponding to agents' fixed prompts, thereby avoiding redundant computation across repeated invocations. However, current systems typically evict KV caches using a Least Recently Used (LRU) policy, which fails to anticipate future agent usage and often discards KV caches shortly before their reuse. This leads to frequent cache misses and substantial recomputation or swapping overhead. We present KVFlow, a workflow‑aware KV cache management framework tailored for agentic workloads. KVFlow abstracts the agent execution schedule as an Agent Step Graph and assigns each agent a steps‑to‑execution value that estimates its temporal proximity to future activation. These values guide a fine‑grained eviction policy at the KV node level, allowing KVFlow to preserve entries likely to be reused and efficiently manage shared prefixes in tree‑structured caches. Moreover, KVFlow introduces a fully overlapped KV prefetching mechanism, which proactively loads required tensors from CPU to GPU in background threads for agents scheduled in the next step, thereby avoiding cache miss stalls during generation. Compared to SGLang with hierarchical radix cache, KVFlow achieves up to 1.83× speedup for single workflows with large prompts, and up to 2.19× speedup for scenarios with many concurrent workflows.

PaperID: 9691, https://arxiv.org/pdf/2507.07376.pdf

Authors: Hengrui Liu, Yi Feng, Qilong Zhang

Title: PILOC: A Pheromone Inverse Guidance Mechanism and Local-Communication Framework for Dynamic Target Search of Multi-Agent in Unknown Environments

Abstract:
Multi‑Agent Search and Rescue (MASAR) plays a vital role in disaster response, exploration, and reconnaissance. However, dynamic and unknown environments pose significant challenges due to target unpredictability and environmental uncertainty. To tackle these issues, we propose PILOC, a framework that operates without global prior knowledge, leveraging local perception and communication. It introduces a pheromone inverse guidance mechanism to enable efficient coordination and dynamic target localization. PILOC promotes decentralized cooperation through local communication, significantly reducing reliance on global channels. Unlike conventional heuristics, the pheromone mechanism is embedded into the observation space of Deep Reinforcement Learning (DRL), supporting indirect agent coordination based on environmental cues. We further integrate this strategy into a DRL‑based multi‑agent architecture and conduct extensive experiments. Results show that combining local communication with pheromone‑based guidance significantly boosts search efficiency, adaptability, and system robustness. Compared to existing methods, PILOC performs better under dynamic and communication‑constrained scenarios, offering promising directions for future MASAR applications.

PaperID: 9692, https://arxiv.org/pdf/2507.07362.pdf

Authors: Xinyu Li, Tongguang Li, Lixiang Yan, Yuheng Li, Linxuan Zhao, Mladen Raković, Inge Molenaar, Dragan Gašević, Yizhou Fan

Title: FLoRA: An Advanced AI-Powered Engine to Facilitate Hybrid Human-AI Regulated Learning

Abstract:
Self‑Regulated Learning (SRL), defined as learners' ability to systematically plan, monitor, and regulate their learning activities, is crucial for sustained academic achievement and lifelong learning competencies. Emerging AI developments profoundly influence SRL interactions by potentially either diminishing or strengthening learners' opportunities to exercise their own regulatory skills. Recent literature emphasizes a balanced approach termed Hybrid Human‑AI Regulated Learning (HHAIRL), in which AI provides targeted, timely scaffolding while preserving the learners' role as active decision‑makers and reflective monitors of their learning process. Central to HHAIRL is the integration of adaptive and personalized learning systems; by modelling each learner's knowledge and self‑regulation patterns, AI can deliver contextually relevant scaffolds that support learners during all phases of the SRL process. Nevertheless, existing digital tools frequently fall short, lacking adaptability and personalisation, focusing narrowly on isolated SRL phases, and insufficiently supporting meaningful human‑AI interactions. In response, this paper introduces the enhanced FLoRA Engine, which incorporates advanced generative AI features and state‑of‑the‑art learning analytics, and grounds in solid educational theories. The FLoRA Engine offers tools such as collaborative writing, multi‑agent chatbots, and detailed learning trace logging to support dynamic, adaptive scaffolding of self‑regulation tailored to individual needs in real time. We further present a summary of several research studies that provide the validations for and illustrate how these tools can be utilized in real‑world educational and experimental contexts. These studies demonstrate the effectiveness of FLoRA Engine in fostering SRL, providing both theoretical insights and practical solutions for the future of AI‑enhanced learning contexts.

PaperID: 9693, https://arxiv.org/pdf/2507.07320.pdf

Authors: Dongyu Wei, Xiaoren Xu, Shiwen Mao, Mingzhe Chen

Title: Optimizing Communication and Device Clustering for Clustered Federated Learning with Differential Privacy

Abstract:
In this paper, a secure and communication‑efficient clustered federated learning (CFL) design is proposed. In our model, several base stations (BSs) with heterogeneous task‑handling capabilities and multiple users with non‑independent and identically distributed (non‑IID) data jointly perform CFL training incorporating differential privacy (DP) techniques. Since each BS can process only a subset of the learning tasks and has limited wireless resource blocks (RBs) to allocate to users for federated learning (FL) model parameter transmission, it is necessary to jointly optimize RB allocation and user scheduling for CFL performance optimization. Meanwhile, our considered CFL method requires devices to use their limited data and FL model information to determine their task identities, which may introduce additional communication overhead. We formulate an optimization problem whose goal is to minimize the training loss of all learning tasks while considering device clustering, RB allocation, DP noise, and FL model transmission delay. To solve the problem, we propose a novel dynamic penalty function assisted value decomposed multi‑agent reinforcement learning (DPVD‑MARL) algorithm that enables distributed BSs to independently determine their connected users, RBs, and DP noise of the connected users but jointly minimize the training loss of all learning tasks across all BSs. Different from the existing MARL methods that assign a large penalty for invalid actions, we propose a novel penalty assignment scheme that assigns penalty depending on the number of devices that cannot meet communication constraints (e.g., delay), which can guide the MARL scheme to quickly find valid actions, thus improving the convergence speed. Simulation results show that the DPVD‑MARL can improve the convergence rate by up to 20% and the ultimate accumulated rewards by 15% compared to independent Q‑learning.

PaperID: 9694, https://arxiv.org/pdf/2507.07307.pdf

Authors: Anirban Saha Anik, Xiaoying Song, Elliott Wang, Bryan Wang, Bengisu Yarimbas, Lingzi Hong

Title: Multi-Agent Retrieval-Augmented Framework for Evidence-Based Counterspeech Against Health Misinformation

Abstract:
Large language models (LLMs) incorporated with Retrieval‑Augmented Generation (RAG) have demonstrated powerful capabilities in generating counterspeech against misinformation. However, current studies rely on limited evidence and offer less control over final outputs. To address these challenges, we propose a Multi‑agent Retrieval‑Augmented Framework to generate counterspeech against health misinformation, incorporating multiple LLMs to optimize knowledge retrieval, evidence enhancement, and response refinement. Our approach integrates both static and dynamic evidence, ensuring that the generated counterspeech is relevant, well‑grounded, and up‑to‑date. Our method outperforms baseline approaches in politeness, relevance, informativeness, and factual accuracy, demonstrating its effectiveness in generating high‑quality counterspeech. To further validate our approach, we conduct ablation studies to verify the necessity of each component in our framework. Furthermore, cross evaluations show that our system generalizes well across diverse health misinformation topics and datasets. And human evaluations reveal that refinement significantly enhances counterspeech quality and obtains human preference.

PaperID: 9695, https://arxiv.org/pdf/2507.07302.pdf

Authors: Ashish Kumar

Title: Application of LLMs to Multi-Robot Path Planning and Task Allocation

Abstract:
Efficient exploration is a well known problem in deep reinforcement learning and this problem is exacerbated in multi‑agent reinforcement learning due the intrinsic complexities of such algorithms. There are several approaches to efficiently explore an environment to learn to solve tasks by multi‑agent operating in that environment, of which, the idea of expert exploration is investigated in this work. More specifically, this work investigates the application of large‑language models as expert planners for efficient exploration in planning based tasks for multiple agents.

PaperID: 9696, https://arxiv.org/pdf/2507.07299.pdf

Authors: Sonia Raychaudhuri, Enrico Cancelli, Tommaso Campari, Lamberto Ballan, Manolis Savva, Angel X. Chang

Title: MLFM: Multi-Layered Feature Maps for Richer Language Understanding in Zero-Shot Semantic Navigation

Abstract:
Recent progress in large vision‑language models has driven improvements in language‑based semantic navigation, where an embodied agent must reach a target object described in natural language. Yet we still lack a clear, language‑focused evaluation framework to test how well agents ground the words in their instructions. We address this gap by proposing LangNav, an open‑vocabulary multi‑object navigation dataset with natural language goal descriptions (e.g. 'go to the red short candle on the table') and corresponding fine‑grained linguistic annotations (e.g., attributes: color=red, size=short; relations: support=on). These labels enable systematic evaluation of language understanding. To evaluate on this setting, we extend multi‑object navigation task setting to Language‑guided Multi‑Object Navigation (LaMoN), where the agent must find a sequence of goals specified using language. Furthermore, we propose Multi‑Layered Feature Map (MLFM), a novel method that builds a queryable, multi‑layered semantic map from pretrained vision‑language features and proves effective for reasoning over fine‑grained attributes and spatial relations in goal descriptions. Experiments on LangNav show that MLFM outperforms state‑of‑the‑art zero‑shot mapping‑based navigation baselines.

PaperID: 9697, https://arxiv.org/pdf/2507.07074.pdf

Authors: Farhaan Ebadulla, Dharini Hindlatti, Srinivaasan NS, Apoorva VH, Ayman Aftab

Title: Graph-Based Complexity Metrics for Multi-Agent Curriculum Learning: A Validated Approach to Task Ordering in Cooperative Coordination Environments

Abstract:
Multi‑agent reinforcement learning (MARL) faces significant challenges in task sequencing and curriculum design, particularly for cooperative coordination scenarios. While curriculum learning has demonstrated success in single‑agent domains, principled approaches for multi‑agent coordination remain limited due to the absence of validated task complexity metrics. This approach presents a graph‑based coordination complexity metric that integrates agent dependency entropy, spatial interference patterns, and goal overlap analysis to predict task difficulty in multi‑agent environments. The complexity metric achieves strong empirical validation with rho = 0.952 correlation (p < 0.001) between predicted complexity and empirical difficulty determined by random agent performance evaluation. This approach evaluates the curriculum learning framework using MADDPG across two distinct coordination environments: achieving 56x performance improvement in tight coordination tasks (MultiWalker) and demonstrating systematic task progression in cooperative navigation (Simple Spread). Through systematic analysis, coordination tightness emerges as a predictor of curriculum learning effectiveness, where environments requiring strict agent interdependence benefit substantially from structured progression. This approach provides a validated complexity metric for multi‑agent curriculum design and establishes empirical guidelines for multi‑robot coordination applications.

PaperID: 9698, https://arxiv.org/pdf/2507.06997.pdf

Authors: Deemah H. Tashman, Soumaya Cherkaoui, Walaa Hamouda

Title: Federated Learning-based MARL for Strengthening Physical-Layer Security in B5G Networks

Abstract:
This paper explores the application of a federated learning‑based multi‑agent reinforcement learning (MARL) strategy to enhance physical‑layer security (PLS) in a multi‑cellular network within the context of beyond 5G networks. At each cell, a base station (BS) operates as a deep reinforcement learning (DRL) agent that interacts with the surrounding environment to maximize the secrecy rate of legitimate users in the presence of an eavesdropper. This eavesdropper attempts to intercept the confidential information shared between the BS and its authorized users. The DRL agents are deemed to be federated since they only share their network parameters with a central server and not the private data of their legitimate users. Two DRL approaches, deep Q‑network (DQN) and Reinforce deep policy gradient (RDPG), are explored and compared. The results demonstrate that RDPG converges more rapidly than DQN. In addition, we demonstrate that the proposed method outperforms the distributed DRL approach. Furthermore, the outcomes illustrate the trade‑off between security and complexity.

PaperID: 9699, https://arxiv.org/pdf/2507.06993.pdf

Authors: Jieren Deng, Zhizhang Hu, Ziyan He, Aleksandar Cvetkovic, Pak Kiu Chung, Dragomir Yankov, Chiqun Zhang

Title: IMAIA: Interactive Maps AI Assistant for Travel Planning and Geo-Spatial Intelligence

Abstract:
Map applications are still largely point‑and‑click, making it difficult to ask map‑centric questions or connect what a camera sees to the surrounding geospatial context with view‑conditioned inputs. We introduce IMAIA, an interactive Maps AI Assistant that enables natural‑language interaction with both vector (street) maps and satellite imagery, and augments camera inputs with geospatial intelligence to help users understand the world. IMAIA comprises two complementary components. Maps Plus treats the map as first‑class context by parsing tiled vector/satellite views into a grid‑aligned representation that a language model can query to resolve deictic references (e.g., ``the flower‑shaped building next to the park in the top‑right''). Places AI Smart Assistant (PAISA) performs camera‑aware place understanding by fusing image‑‑place embeddings with geospatial signals (location, heading, proximity) to ground a scene, surface salient attributes, and generate concise explanations. A lightweight multi‑agent design keeps latency low and exposes interpretable intermediate decisions. Across map‑centric QA and camera‑to‑place grounding tasks, IMAIA improves accuracy and responsiveness over strong baselines while remaining practical for user‑facing deployments. By unifying language, maps, and geospatial cues, IMAIA moves beyond scripted tools toward conversational mapping that is both spatially grounded and broadly usable.

PaperID: 9700, https://arxiv.org/pdf/2507.06850.pdf

Authors: Matteo Lupinacci, Francesco Aurelio Pironti, Francesco Blefari, Francesco Romeo, Luigi Arena, Angelo Furfaro

Title: The Dark Side of LLMs: Agent-based Attack Vectors for System-level Compromise

Abstract:
The rapid adoption of Large Language Model (LLM) agents and multi‑agent systems enables remarkable capabilities in natural language processing and generation. However, these systems introduce security vulnerabilities that extend beyond traditional content generation to system‑level compromises. This paper presents a comprehensive evaluation of the LLMs security used as reasoning engines within autonomous agents, highlighting how they can be exploited as attack vectors capable of achieving computer takeovers. We focus on how different attack surfaces and trust boundaries can be leveraged to orchestrate such takeovers. We demonstrate that adversaries can effectively coerce popular LLMs into autonomously installing and executing malware on victim machines. Our evaluation of 18 state‑of‑the‑art LLMs reveals that 94.4% of models succumb to Direct Prompt Injection, and 83.3% are vulnerable to the more stealthy and evasive RAG Backdoor Attack. Notably, we tested trust boundaries within multi‑agent systems, where LLM agents interact and influence each other, and we revealed that LLMs which successfully resist direct injection or RAG backdoor attacks will execute identical payloads when requested by peer agents. We found that 100.0% of tested LLMs can be compromised through Inter‑Agent Trust Exploitation attacks, and that every model exhibits context‑dependent security behaviors that create exploitable blind spots.

PaperID: 9701, https://arxiv.org/pdf/2507.06520.pdf

Authors: Xinyuan Song, Zeyu Wang, Siyi Wu, Tianyu Shi, Lynn Ai

Title: Gradientsys: A Multi-Agent LLM Scheduler with ReAct Orchestration

Abstract:
We present Gradientsys, a next‑generation multi‑agent scheduling framework that coordinates diverse specialized AI agents using a typed Model‑Context Protocol (MCP) and a ReAct‑based dynamic planning loop. At its core, Gradientsys employs an LLM‑powered scheduler for intelligent one‑to‑many task dispatch, enabling parallel execution of heterogeneous agents such as PDF parsers, web search modules, GUI controllers, and web builders. The framework supports hybrid synchronous/asynchronous execution, respects agent capacity constraints, and incorporates a robust retry‑and‑replan mechanism to handle failures gracefully. To promote transparency and trust, Gradientsys includes an observability layer streaming real‑time agent activity and intermediate reasoning via Server‑Sent Events (SSE). We offer an architectural overview and evaluate Gradientsys against existing frameworks in terms of extensibility, scheduling topology, tool reusability, parallelism, and observability. Experiments on the GAIA general‑assistant benchmark show that Gradientsys achieves higher task success rates with reduced latency and lower API costs compared to a MinionS‑style baseline, demonstrating the strength of its LLM‑driven multi‑agent orchestration.

PaperID: 9702, https://arxiv.org/pdf/2507.06506.pdf

Authors: Russell Taylor, Benjamin Herbert, Michael Sana

Title: Pun Intended: Multi-Agent Translation of Wordplay with Contrastive Learning and Phonetic-Semantic Embeddings

Abstract:
Translating wordplay across languages presents unique challenges that have long confounded both professional human translators and machine translation systems. This research proposes a novel approach for translating puns from English to French by combining state‑of‑the‑art large language models with specialized techniques for wordplay generation. Our methodology employs a three‑stage approach. First, we establish a baseline using multiple frontier large language models with feedback based on a new contrastive learning dataset. Second, we implement a guided chain‑of‑thought pipeline with combined phonetic‑semantic embeddings. Third, we implement a multi‑agent generator‑discriminator framework for evaluating and regenerating puns with feedback. Moving beyond the limitations of literal translation, our methodology's primary objective is to capture the linguistic creativity and humor of the source text wordplay, rather than simply duplicating its vocabulary. Our best runs earned first and second place in the CLEF JOKER 2025 Task 2 competition where they were evaluated manually by expert native French speakers. This research addresses a gap between translation studies and computational linguistics by implementing linguistically‑informed techniques for wordplay translation, advancing our understanding of how language models can be leveraged to handle the complex interplay between semantic ambiguity, phonetic similarity, and the implicit cultural and linguistic awareness needed for successful humor.

PaperID: 9703, https://arxiv.org/pdf/2507.06466.pdf

Authors: Aaron Dharna, Cong Lu, Jeff Clune

Title: Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models

Abstract:
Multi‑agent interactions have long fueled innovation, from natural predator‑prey dynamics to the space race. Self‑play (SP) algorithms try to harness these dynamics by pitting agents against ever‑improving opponents, thereby creating an implicit curriculum toward learning high‑quality solutions. However, SP often fails to produce diverse solutions and can get stuck in locally optimal behaviors. We introduce Foundation‑Model Self‑Play (FMSP), a new direction that leverages the code‑generation capabilities and vast knowledge of foundation models (FMs) to overcome these challenges by leaping across local optima in policy space. We propose a family of approaches: (1) Vanilla Foundation‑Model Self‑Play (vFMSP) continually refines agent policies via competitive self‑play; (2) Novelty‑Search Self‑Play (NSSP) builds a diverse population of strategies, ignoring performance; and (3) the most promising variant, Quality‑Diveristy Self‑Play (QDSP), creates a diverse set of high‑quality policies by combining the diversity of NSSP and refinement of vFMSP. We evaluate FMSPs in Car Tag, a continuous‑control pursuer‑evader setting, and in Gandalf, a simple AI safety simulation in which an attacker tries to jailbreak an LLM's defenses. In Car Tag, FMSPs explore a wide variety of reinforcement learning, tree search, and heuristic‑based methods, to name just a few. In terms of discovered policy quality, \ouralgo and vFMSP surpass strong human‑designed strategies. In Gandalf, FMSPs can successfully automatically red‑team an LLM, breaking through and jailbreaking six different, progressively stronger levels of defense. Furthermore, FMSPs can automatically proceed to patch the discovered vulnerabilities. Overall, FMSPs represent a promising new research frontier of improving self‑play with foundation models, opening fresh paths toward more creative and open‑ended strategy discovery

PaperID: 9704, https://arxiv.org/pdf/2507.06444.pdf

Authors: Jiaxun Zhang, Haicheng Liao, Yumu Xie, Chengyue Wang, Yanchen Guan, Bin Rao, Zhenning Li

Title: Eyes on the Road, Mind Beyond Vision: Context-Aware Multi-modal Enhanced Risk Anticipation

Abstract:
Accurate accident anticipation remains challenging when driver cognition and dynamic road conditions are underrepresented in predictive models. In this paper, we propose CAMERA (Context‑Aware Multi‑modal Enhanced Risk Anticipation), a multi‑modal framework integrating dashcam video, textual annotations, and driver attention maps for robust accident anticipation. Unlike existing methods that rely on static or environment‑centric thresholds, CAMERA employs an adaptive mechanism guided by scene complexity and gaze entropy, reducing false alarms while maintaining high recall in dynamic, multi‑agent traffic scenarios. A hierarchical fusion pipeline with Bi‑GRU (Bidirectional GRU) captures spatio‑temporal dependencies, while a Geo‑Context Vision‑Language module translates 3D spatial relationships into interpretable, human‑centric alerts. Evaluations on the DADA‑2000 and benchmarks show that CAMERA achieves state‑of‑the‑art performance, improving accuracy and lead time. These results demonstrate the effectiveness of modeling driver attention, contextual description, and adaptive risk thresholds to enable more reliable accident anticipation.

PaperID: 9705, https://arxiv.org/pdf/2507.06440.pdf

Authors: Yicheng Xu, Faryar Jabbari

Title: Distributed Optimization of Finite Condition Number for Laplacian Matrix in Multi-Agent Systems

Abstract:
This paper addresses the distributed optimization of the finite condition number of the Laplacian matrix in multi‑agent systems. The finite condition number, defined as the ratio of the largest to the second smallest eigenvalue of the Laplacian matrix, plays an important role in determining the convergence rate and performance of consensus algorithms, especially in discrete‑time implementations. We propose a fully distributed algorithm by regulating the node weights. The approach leverages max consensus, distributed power iteration, and consensus‑based normalization for eigenvalue and eigenvector estimation, requiring only local communication and computation. Simulation results demonstrate that the proposed method achieves performance comparable to centralized LMI‑based optimization, significantly improving consensus speed and multi‑agent system performance. The framework can be extended to edge weight optimization and the scenarios with non‑simple eigenvalues, highlighting its scalability and practical applicability for large‑scale networked systems.

PaperID: 9706, https://arxiv.org/pdf/2507.06392.pdf

Authors: Federico Chiariotti, Marco Fabris

Title: VoI-aware Scheduling Schemes for Multi-Agent Formation Control

Abstract:
Formation control allows agents to maintain geometric patterns using local information, but most existing methods assume ideal communication. This paper introduces a goal‑oriented framework combining control, cooperative positioning, and communication scheduling for first‑order formation tracking. Each agent estimates its position using 6G network‑based triangulation, and the scheduling of information updates is governed by Age of Information (AoI) and Value of Information (VoI) metrics. We design three lightweight, signaling‑free scheduling policies and assess their impact on formation quality. Simulation results demonstrate the effectiveness of the proposed approach in maintaining accurate formations with no additional communication overhead, showing that worst‑case formation adherence increases by 20%.

PaperID: 9707, https://arxiv.org/pdf/2507.06323.pdf

Authors: Tarek Gasmi, Ramzi Guesmi, Ines Belhadj, Jihene Bennaceur

Title: Bridging AI and Software Security: A Comparative Vulnerability Assessment of LLM Agent Deployment Paradigms

Abstract:
Large Language Model (LLM) agents face security vulnerabilities spanning AI‑specific and traditional software domains, yet current research addresses these separately. This study bridges this gap through comparative evaluation of Function Calling architecture and Model Context Protocol (MCP) deployment paradigms using a unified threat classification framework. We tested 3,250 attack scenarios across seven language models, evaluating simple, composed, and chained attacks targeting both AI‑specific threats (prompt injection) and software vulnerabilities (JSON injection, denial‑of‑service). Function Calling showed higher overall attack success rates (73.5% vs 62.59% for MCP), with greater system‑centric vulnerability while MCP exhibited increased LLM‑centric exposure. Attack complexity dramatically amplified effectiveness, with chained attacks achieving 91‑96% success rates. Counterintuitively, advanced reasoning models demonstrated higher exploitability despite better threat detection. Results demonstrate that architectural choices fundamentally reshape threat landscapes. This work establishes methodological foundations for cross‑domain LLM agent security assessment and provides evidence‑based guidance for secure deployment. Code and experimental materials are available at https: // github. com/ theconsciouslab‑ai/llm‑agent‑security.

PaperID: 9708, https://arxiv.org/pdf/2507.06310.pdf

Authors: Yongchao Zeng, Calum Brown, Mark Rounsevell

Title: Too Human to Model:The Uncanny Valley of LLMs in Social Simulation -- When Generative Language Agents Misalign with Modelling Principles

Abstract:
Large language models (LLMs) have been increasingly used to build agents in social simulation because of their impressive abilities to generate fluent, contextually coherent dialogues. Such abilities can enhance the realism of models. However, the pursuit of realism is not necessarily compatible with the epistemic foundation of modelling. We argue that LLM agents, in many regards, are too human to model: they are too expressive, detailed and intractable to be consistent with the abstraction, simplification, and interpretability typically demanded by modelling. Through a model‑building thought experiment that converts the Bass diffusion model to an LLM‑based variant, we uncover five core dilemmas: a temporal resolution mismatch between natural conversation and abstract time steps; the need for intervention in conversations while avoiding undermining spontaneous agent outputs; the temptation to introduce rule‑like instructions in prompts while maintaining conversational naturalness; the tension between role consistency and role evolution across time; and the challenge of understanding emergence, where system‑level patterns become obscured by verbose micro textual outputs. These dilemmas steer the LLM agents towards an uncanny valley: not abstract enough to clarify underlying social mechanisms, while not natural enough to represent realistic human behaviour. This exposes an important paradox: the realism of LLM agents can obscure, rather than clarify, social dynamics when misapplied. We tease out the conditions in which LLM agents are ideally suited: where system‑level emergence is not the focus, linguistic nuances and meaning are central, interactions unfold in natural time, and stable role identity is more important than long‑term behavioural evolution. We call for repositioning LLM agents in the ecosystem of social simulation for future applications.

PaperID: 9709, https://arxiv.org/pdf/2507.06278.pdf

Authors: Kemboi Cheruiyot, Nickson Kiprotich, Vyacheslav Kungurtsev, Kennedy Mugo, Vivian Mwirigi, Marvin Ngesa

Title: A Survey of Multi Agent Reinforcement Learning: Federated Learning and Cooperative and Noncooperative Decentralized Regimes

Abstract:
The increasing interest in research and innovation towards the development of autonomous agents presents a number of complex yet important scenarios of multiple AI Agents interacting with each other in an environment. The particular setting can be understood as exhibiting three possibly topologies of interaction ‑ centrally coordinated cooperation, ad‑hoc interaction and cooperation, and settings with noncooperative incentive structures. This article presents a comprehensive survey of all three domains, defined under the formalism of Federal Reinforcement Learning (RL), Decentralized RL, and Noncooperative RL, respectively. Highlighting the structural similarities and distinctions, we review the state of the art in these subjects, primarily explored and developed only recently in the literature. We include the formulations as well as known theoretical guarantees and highlights and limitations of numerical performance.

PaperID: 9710, https://arxiv.org/pdf/2507.06229.pdf

Authors: Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, Ge Zhang, Jiaheng Liu, Xingyao Wang, Sirui Hong, Chenglin Wu, Hao Cheng, Chi Wang, Wangchunshu Zhou

Title: Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving

Abstract:
AI agent frameworks operate in isolation, forcing agents to rediscover solutions and repeat mistakes across different systems. Despite valuable problem‑solving experiences accumulated by frameworks like smolagents, OpenHands, and OWL, this knowledge remains trapped within individual systems, preventing the emergence of collective intelligence. Current memory systems focus on individual agents or framework‑specific demonstrations, failing to enable cross‑architecture knowledge transfer. We introduce AGENT KB, a universal memory infrastructure enabling seamless experience sharing across heterogeneous agent frameworks without retraining. AGENT KB aggregates trajectories into a structured knowledge base and serves lightweight APIs. At inference time, hybrid retrieval operates through two stages: planning seeds agents with cross‑domain workflows, while feedback applies targeted diagnostic fixes. A disagreement gate ensures retrieved knowledge enhances rather than disrupts reasoning, addressing knowledge interference in cross‑framework transfer. We validate AGENT KB across major frameworks on GAIA, Humanity's Last Exam, GPQA, and SWE‑bench. Results show substantial improvements across diverse model families: compared to baseline pass@1, smolagents with AGENT KB achieve up to 18.7pp gains at pass@3 (55.2% ‑> 73.9%), while OpenHands improves 4.0pp on SWE‑bench pass@1 (24.3% ‑> 28.3%). Similar improvements are observed across all base model families. Ablations confirm that hybrid retrieval and feedback stages are essential, with automatically generated experiences matching manual curation. This establishes the foundation for collective agent intelligence through shared memory infrastructures.

PaperID: 9711, https://arxiv.org/pdf/2507.06183.pdf

Authors: Prahitha Movva, Naga Harshita Marupaka

Title: Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling

Abstract:
Technical reports and articles often contain valuable information in the form of semi‑structured data like charts, and figures. Interpreting these and using the information from them is essential for downstream tasks such as question answering (QA). Current approaches to visual question answering often struggle with the precision required for scientific data interpretation, particularly in handling numerical values, multi‑step reasoning over visual elements, and maintaining consistency between visual observation and textual reasoning. We present our approach to the SciVQA 2025 shared task, focusing on answering visual and non‑visual questions grounded in scientific figures from scholarly articles. We conducted a series of experiments using models with 5B to 8B parameters. Our strongest individual model, InternVL3, achieved ROUGE‑1 and ROUGE‑L F1 scores of 0.740 and a BERTScore of 0.983 on the SciVQA test split. We also developed an ensemble model with multiple vision language models (VLMs). Through error analysis on the validation split, our ensemble approach improved performance compared to most individual models, though InternVL3 remained the strongest standalone performer. Our findings underscore the effectiveness of prompt optimization, chain‑of‑thought reasoning and ensemble modeling in improving the model's ability in visual question answering.

PaperID: 9712, https://arxiv.org/pdf/2507.06134.pdf

Authors: Sanidhya Vijayvargiya, Aditya Bharat Soni, Xuhui Zhou, Zora Zhiruo Wang, Nouha Dziri, Graham Neubig, Maarten Sap

Title: OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety

Abstract:
Recent advances in AI agents capable of solving complex, everyday tasks, from scheduling to customer service, have enabled deployment in real‑world settings, but their possibilities for unsafe behavior demands rigorous evaluation. While prior benchmarks have attempted to assess agent safety, most fall short by relying on simulated environments, narrow task domains, or unrealistic tool abstractions. We introduce OpenAgentSafety, a comprehensive and modular framework for evaluating agent behavior across eight critical risk categories. Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms; and supports over 350 multi‑turn, multi‑user tasks spanning both benign and adversarial user intents. OpenAgentSafety is designed for extensibility, allowing researchers to add tools, tasks, websites, and adversarial strategies with minimal effort. It combines rule‑based analysis with LLM‑as‑judge assessments to detect both overt and subtle unsafe behaviors. Empirical analysis of five prominent LLMs in agentic scenarios reveals unsafe behavior in 51.2% of safety‑vulnerable tasks with Claude‑Sonnet‑3.7, to 72.7% with o3‑mini, highlighting critical safety vulnerabilities and the need for stronger safeguards before real‑world deployment.

PaperID: 9713, https://arxiv.org/pdf/2507.06126.pdf

Authors: Souvik Roy, Agamani Saha

Title: A Directed Lazy Random Walk Model to Three-Way Dynamic Matching Problem

Abstract:
This paper explores a novel extension of dynamic matching theory by analyzing a three‑way matching problem involving agents from three distinct populations, each with two possible types. Unlike traditional static or two‑way dynamic models, our setting captures more complex team‑formation environments where one agent from each of the three populations must be matched to form a valid team. We consider two preference structures: assortative or homophilic, where agents prefer to be matched with others of the same type, and dis‑assortative or heterophilic, where diversity within the team is valued. Agents arrive sequentially and face a trade‑off between matching immediately or waiting for a higher quality match in the future albeit with a waiting cost. We construct and analyze the corresponding transition probability matrices for each preference regime and demonstrate the existence and uniqueness of stationary distributions. Our results show that stable and efficient outcomes can arise in dynamic, multi‑agent matching environments, offering a deeper understanding of how complex matching processes evolve over time and how they can be effectively managed.

PaperID: 9714, https://arxiv.org/pdf/2507.06016.pdf

Authors: Youmna Farag, Svetlana Stoyanchev, Mohan Li, Simon Keizer, Rama Doddipatla

Title: Conditional Multi-Stage Failure Recovery for Embodied Agents

Abstract:
Embodied agents performing complex tasks are susceptible to execution failures, motivating the need for effective failure recovery mechanisms. In this work, we introduce a conditional multistage failure recovery framework that employs zero‑shot chain prompting. The framework is structured into four error‑handling stages, with three operating during task execution and one functioning as a post‑execution reflection phase. Our approach utilises the reasoning capabilities of LLMs to analyse execution challenges within their environmental context and devise strategic solutions. We evaluate our method on the TfD benchmark of the TEACH dataset and achieve state‑of‑the‑art performance, outperforming a baseline without error recovery by 11.5% and surpassing the strongest existing model by 19%.

PaperID: 9715, https://arxiv.org/pdf/2507.06004.pdf

Authors: Chang Yao, Youfang Lin, Shoucheng Song, Hao Wu, Yuqing Ma, Shang Han, Kai Lv

Title: From General Relation Patterns to Task-Specific Decision-Making in Continual Multi-Agent Coordination

Abstract:
Continual Multi‑Agent Reinforcement Learning (Co‑MARL) requires agents to address catastrophic forgetting issues while learning new coordination policies with the dynamics team. In this paper, we delve into the core of Co‑MARL, namely Relation Patterns, which refer to agents' general understanding of interactions. In addition to generality, relation patterns exhibit task‑specificity when mapped to different action spaces. To this end, we propose a novel method called General Relation Patterns‑Guided Task‑Specific Decision‑Maker (RPG). In RPG, agents extract relation patterns from dynamic observation spaces using a relation capturer. These task‑agnostic relation patterns are then mapped to different action spaces via a task‑specific decision‑maker generated by a conditional hypernetwork. To combat forgetting, we further introduce regularization items on both the relation capturer and the conditional hypernetwork. Results on SMAC and LBF demonstrate that RPG effectively prevents catastrophic forgetting when learning new tasks and achieves zero‑shot generalization to unseen tasks.

PaperID: 9716, https://arxiv.org/pdf/2507.05981.pdf

Authors: Marc Oriol, Quim Motger, Jordi Marco, Xavier Franch

Title: Multi-Agent Debate Strategies to Enhance Requirements Engineering with Large Language Models

Abstract:
Context: Large Language Model (LLM) agents are becoming widely used for various Requirements Engineering (RE) tasks. Research on improving their accuracy mainly focuses on prompt engineering, model fine‑tuning, and retrieval augmented generation. However, these methods often treat models as isolated black boxes ‑ relying on single‑pass outputs without iterative refinement or collaboration, limiting robustness and adaptability. Objective: We propose that, just as human debates enhance accuracy and reduce bias in RE tasks by incorporating diverse perspectives, different LLM agents debating and collaborating may achieve similar improvements. Our goal is to investigate whether Multi‑Agent Debate (MAD) strategies can enhance RE performance. Method: We conducted a systematic study of existing MAD strategies across various domains to identify their key characteristics. To assess their applicability in RE, we implemented and tested a preliminary MAD‑based framework for RE classification. Results: Our study identified and categorized several MAD strategies, leading to a taxonomy outlining their core attributes. Our preliminary evaluation demonstrated the feasibility of applying MAD to RE classification. Conclusions: MAD presents a promising approach for improving LLM accuracy in RE tasks. This study provides a foundational understanding of MAD strategies, offering insights for future research and refinements in RE applications.

PaperID: 9717, https://arxiv.org/pdf/2507.05820.pdf

Authors: Syemin Park, Soobin Park, Youn-kyung Lim

Title: Constella: Supporting Storywriters' Interconnected Character Creation through LLM-based Multi-Agents

Abstract:
Creating a cast of characters by attending to their relational dynamics is a critical aspect of most long‑form storywriting. However, our formative study (N=14) reveals that writers struggle to envision new characters that could influence existing ones, balance similarities and differences among characters, and intricately flesh out their relationships. Based on these observations, we designed Constella, an LLM‑based multi‑agent tool that supports storywriters' interconnected character creation process. Constella suggests related characters (FRIENDS DISCOVERY feature), reveals the inner mindscapes of several characters simultaneously (JOURNALS feature), and manifests relationships through inter‑character responses (COMMENTS feature). Our 7‑8 day deployment study with storywriters (N=11) shows that Constella enabled the creation of expansive communities composed of related characters, facilitated the comparison of characters' thoughts and emotions, and deepened writers' understanding of character relationships. We conclude by discussing how multi‑agent interactions can help distribute writers' attention and effort across the character cast.

PaperID: 9718, https://arxiv.org/pdf/2507.05791.pdf

Authors: Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, Ran Xu, Liyuan Pan, Silvio Savarese, Caiming Xiong, Junnan Li

Title: GTA1: GUI Test-time Scaling Agent

Abstract:
Graphical user interface (GUI) agents autonomously complete tasks across platforms (\eg, Linux) by sequentially decomposing user instructions into action proposals that iteratively interact with visual elements in the evolving environment. However, two main challenges arise: i) planning (\ie, the action proposal sequence) under expansive action space, where selecting an appropriate plan is non‑trivial, as many valid ones may exist; ii) accurately grounding actions in complex and high‑resolution interfaces, \ie, precisely interacting with visual targets. This paper investigates the aforementioned challenges with our GUI Test‑time Scaling Agent, namely GTA1. First, we conduct test‑time scaling to select the most appropriate action proposal: at each step, multiple candidate proposals are sampled and evaluated and selected by a judge model. It trades off computation for better decision quality by concurrent sampling. Second, we propose a model that improves grounding of the selected action proposals to its corresponding visual elements. Our key insight is that reinforcement learning (RL) facilitates grounding through inherent objective alignments, rewarding successful clicks on interface elements. Experimentally, GTA1 achieves state‑of‑the‑art performance on both grounding and agent task execution benchmarks. The code and models are released here.

PaperID: 9719, https://arxiv.org/pdf/2507.05755.pdf

Authors: Lukas Kuhn, Florian Buettner

Title: An autonomous agent for auditing and improving the reliability of clinical AI models

Abstract:
The deployment of AI models in clinical practice faces a critical challenge: models achieving expert‑level performance on benchmarks can fail catastrophically when confronted with real‑world variations in medical imaging. Minor shifts in scanner hardware, lighting or demographics can erode accuracy, but currently reliability auditing to identify such catastrophic failure cases before deployment is a bespoke and time‑consuming process. Practitioners lack accessible and interpretable tools to expose and repair hidden failure modes. Here we introduce ModelAuditor, a self‑reflective agent that converses with users, selects task‑specific metrics, and simulates context‑dependent, clinically relevant distribution shifts. ModelAuditor then generates interpretable reports explaining how much performance likely degrades during deployment, discussing specific likely failure modes and identifying root causes and mitigation strategies. Our comprehensive evaluation across three real‑world clinical scenarios ‑ inter‑institutional variation in histopathology, demographic shifts in dermatology, and equipment heterogeneity in chest radiography ‑ demonstrates that ModelAuditor is able correctly identify context‑specific failure modes of state‑of‑the‑art models such as the established SIIM‑ISIC melanoma classifier. Its targeted recommendations recover 15‑25% of performance lost under real‑world distribution shift, substantially outperforming both baseline models and state‑of‑the‑art augmentation methods. These improvements are achieved through a multi‑agent architecture and execute on consumer hardware in under 10 minutes, costing less than US0.50 per audit.

PaperID: 9720, https://arxiv.org/pdf/2507.05638.pdf

Authors: Litian Zhang, Xiaoming Zhang, Bingyu Yan, Ziyi Zhou, Bo Zhang, Zhenyu Guan, Xi Zhang, Chaozhuo Li

Title: LLMs are Introvert

Abstract:
The exponential growth of social media and generative AI has transformed information dissemination, fostering connectivity but also accelerating the spread of misinformation. Understanding information propagation dynamics and developing effective control strategies is essential to mitigate harmful content. Traditional models, such as SIR, provide basic insights but inadequately capture the complexities of online interactions. Advanced methods, including attention mechanisms and graph neural networks, enhance accuracy but typically overlook user psychology and behavioral dynamics. Large language models (LLMs), with their human‑like reasoning, offer new potential for simulating psychological aspects of information spread. We introduce an LLM‑based simulation environment capturing agents' evolving attitudes, emotions, and responses. Initial experiments, however, revealed significant gaps between LLM‑generated behaviors and authentic human dynamics, especially in stance detection and psychological realism. A detailed evaluation through Social Information Processing Theory identified major discrepancies in goal‑setting and feedback evaluation, stemming from the lack of emotional processing in standard LLM training. To address these issues, we propose the Social Information Processing‑based Chain of Thought (SIP‑CoT) mechanism enhanced by emotion‑guided memory. This method improves the interpretation of social cues, personalization of goals, and evaluation of feedback. Experimental results confirm that SIP‑CoT‑enhanced LLM agents more effectively process social information, demonstrating behaviors, attitudes, and emotions closer to real human interactions. In summary, this research highlights critical limitations in current LLM‑based propagation simulations and demonstrates how integrating SIP‑CoT and emotional memory significantly enhances the social intelligence and realism of LLM agents.

PaperID: 9721, https://arxiv.org/pdf/2507.05607.pdf

Authors: Chongshan Fan, Shenghai Yuan

Title: Structured Task Solving via Modular Embodied Intelligence: A Case Study on Rubik's Cube

Abstract:
This paper presents Auto‑RubikAI, a modular autonomous planning framework that integrates a symbolic Knowledge Base (KB), a vision‑language model (VLM), and a large language model (LLM) to solve structured manipulation tasks exemplified by Rubik's Cube restoration. Unlike traditional robot systems based on predefined scripts, or modern approaches relying on pretrained networks and large‑scale demonstration data, Auto‑RubikAI enables interpretable, multi‑step task execution with minimal data requirements and no prior demonstrations. The proposed system employs a KB module to solve group‑theoretic restoration steps, overcoming LLMs' limitations in symbolic reasoning. A VLM parses RGB‑D input to construct a semantic 3D scene representation, while the LLM generates structured robotic control code via prompt chaining. This tri‑module architecture enables robust performance under spatial uncertainty. We deploy Auto‑RubikAI in both simulation and real‑world settings using a 7‑DOF robotic arm, demonstrating effective Sim‑to‑Real adaptation without retraining. Experiments show a 79% end‑to‑end task success rate across randomized configurations. Compared to CFOP, DeepCubeA, and Two‑Phase baselines, our KB‑enhanced method reduces average solution steps while maintaining interpretability and safety. Auto‑RubikAI provides a cost‑efficient, modular foundation for embodied task planning in smart manufacturing, robotics education, and autonomous execution scenarios. Code, prompts, and hardware modules will be released upon publication.

PaperID: 9722, https://arxiv.org/pdf/2507.05558.pdf

Authors: Arthur Gervais, Liyi Zhou

Title: AI Agent Smart Contract Exploit Generation

Abstract:
Smart contract vulnerabilities have led to billions in losses, yet finding actionable exploits remains challenging. Traditional fuzzers rely on rigid heuristics and struggle with complex attacks, while human auditors are thorough but slow and don't scale. Large Language Models offer a promising middle ground, combining human‑like reasoning with machine speed. Early studies show that simply prompting LLMs generates unverified vulnerability speculations with high false positive rates. To address this, we present A1, an agentic system that transforms any LLM into an end‑to‑end exploit generator. A1 provides agents with six domain‑specific tools for autonomous vulnerability discovery, from understanding contract behavior to testing strategies on real blockchain states. All outputs are concretely validated through execution, ensuring only profitable proof‑of‑concept exploits are reported. We evaluate A1 across 36 real‑world vulnerable contracts on Ethereum and Binance Smart Chain. A1 achieves a 63% success rate on the VERITE benchmark. Across all successful cases, A1 extracts up to \8.59 million per exploit and \9.33 million total. Using Monte Carlo analysis of historical attacks, we demonstrate that immediate vulnerability detection yields 86‑89% success probability, dropping to 6‑21% with week‑long delays. Our economic analysis reveals a troubling asymmetry: attackers achieve profitability at \6,000 exploit values while defenders require \60,000 ‑‑ raising fundamental questions about whether AI agents inevitably favor exploitation over defense.

PaperID: 9723, https://arxiv.org/pdf/2507.05534.pdf

Authors: Erik Hemberg, Eric Liu, Lucille Fuller, Stephen Moskal, Una-May O'Reilly

Title: Evolutionary and Coevolutionary Multi-Agent Design Choices and Dynamics

Abstract:
We investigate two representation alternatives for the controllers of teams of cyber agents. We combine these controller representations with different evolutionary algorithms, one of which introduces a novel LLM‑supported mutation operator. Using a cyber security scenario, we evaluate agent learning when one side is trained to compete against a side that does not evolve and when two sides coevolve with each other. This allows us to quantify the relative merits and tradeoffs of representation and algorithm combinations in terms of team performance. Our versions of grammatical evolution algorithms using grammars that allow a controller to be expressed in code‑like logic can achieve the best team performance. The scenario also allows us to compare the performance impact and dynamics of coevolution versus evolution under different combinations. Across the algorithms and representations, we observe that coevolution reduces the performance highs and lows of both sides while it induces fluctuations on both sides. In contrast, when only one‑side is optimized, performance peaks are higher and is more sustained than when both sides are optimized with coevolution.

PaperID: 9724, https://arxiv.org/pdf/2507.05528.pdf

Authors: Jiahuan Pei, Fanghua Ye, Xin Sun, Wentao Deng, Koen Hindriks, Junxiao Wang

Title: Conversational Education at Scale: A Multi-LLM Agent Workflow for Procedural Learning and Pedagogic Quality Assessment

Abstract:
Large language models (LLMs) have advanced virtual educators and learners, bridging NLP with AI4Education. Existing work often lacks scalability and fails to leverage diverse, large‑scale course content, with limited frameworks for assessing pedagogic quality. To this end, we propose WikiHowAgent, a multi‑agent workflow leveraging LLMs to simulate interactive teaching‑learning conversations. It integrates teacher and learner agents, an interaction manager, and an evaluator to facilitate procedural learning and assess pedagogic quality. We introduce a dataset of 114,296 teacher‑learner conversations grounded in 14,287 tutorials across 17 domains and 727 topics. Our evaluation protocol combines computational and rubric‑based metrics with human judgment alignment. Results demonstrate the workflow's effectiveness in diverse setups, offering insights into LLM capabilities across domains. Our datasets and implementations are fully open‑sourced.

PaperID: 9725, https://arxiv.org/pdf/2507.05520.pdf

Authors: Karishma Thakrar, Shreyas Basavatia, Akshay Daftardar

Title: Architecting Clinical Collaboration: Multi-Agent Reasoning Systems for Multimodal Medical VQA

Abstract:
Dermatological care via telemedicine often lacks the rich context of in‑person visits. Clinicians must make diagnoses based on a handful of images and brief descriptions, without the benefit of physical exams, second opinions, or reference materials. While many medical AI systems attempt to bridge these gaps with domain‑specific fine‑tuning, this work hypothesized that mimicking clinical reasoning processes could offer a more effective path forward. This study tested seven vision‑language models on medical visual question answering across six configurations: baseline models, fine‑tuned variants, and both augmented with either reasoning layers that combine multiple model perspectives, analogous to peer consultation, or retrieval‑augmented generation that incorporates medical literature at inference time, serving a role similar to reference‑checking. While fine‑tuning degraded performance in four of seven models with an average 30% decrease, baseline models collapsed on test data. Clinical‑inspired architectures, meanwhile, achieved up to 70% accuracy, maintaining performance on unseen data while generating explainable, literature‑grounded outputs critical for clinical adoption. These findings demonstrate that medical AI succeeds by reconstructing the collaborative and evidence‑based practices fundamental to clinical diagnosis.

PaperID: 9726, https://arxiv.org/pdf/2507.05469.pdf

Authors: Ceferino Patino, Tyler J. Billings, Alireza Saleh Abadi, Daniel Redder, Adam Eck, Prashant Doshi, Leen-Kiat Soh

Title: Inaugural MOASEI Competition at AAMAS'2025: A Technical Report

Abstract:
We present the Methods for Open Agent Systems Evaluation Initiative (MOASEI) Competition, a multi‑agent AI benchmarking event designed to evaluate decision‑making under open‑world conditions. Built on the free‑range‑zoo environment suite, MOASEI introduced dynamic, partially observable domains with agent and task openness‑‑settings where entities may appear, disappear, or change behavior over time. The 2025 competition featured three tracks‑‑Wildfire, Rideshare, and Cybersecurity‑‑each highlighting distinct dimensions of openness and coordination complexity. Eleven teams from international institutions participated, with four of those teams submitting diverse solutions including graph neural networks, convolutional architectures, predictive modeling, and large language model‑‑driven meta‑‑optimization. Evaluation metrics centered on expected utility, robustness to perturbations, and responsiveness to environmental change. The results reveal promising strategies for generalization and adaptation in open environments, offering both empirical insight and infrastructure for future research. This report details the competition's design, findings, and contributions to the open‑agent systems research community.

PaperID: 9727, https://arxiv.org/pdf/2507.05445.pdf

Authors: Daniel Jones, Giorgio Severi, Martin Pouliot, Gary Lopez, Joris de Gruyter, Santiago Zanella-Beguelin, Justin Song, Blake Bullwinkel, Pamela Cortez, Amanda Minnich

Title: A Systematization of Security Vulnerabilities in Computer Use Agents

Abstract:
Computer Use Agents (CUAs), autonomous systems that interact with software interfaces via browsers or virtual machines, are rapidly being deployed in consumer and enterprise environments. These agents introduce novel attack surfaces and trust boundaries that are not captured by traditional threat models. Despite their growing capabilities, the security boundaries of CUAs remain poorly understood. In this paper, we conduct a systematic threat analysis and testing of real‑world CUAs under adversarial conditions. We identify seven classes of risks unique to the CUA paradigm, and analyze three concrete exploit scenarios in depth: (1) clickjacking via visual overlays that mislead interface‑level reasoning, (2) indirect prompt injection that enables Remote Code Execution (RCE) through chained tool use, and (3) CoT exposure attacks that manipulate implicit interface framing to hijack multi‑step reasoning. These case studies reveal deeper architectural flaws across current CUA implementations. Namely, a lack of input provenance tracking, weak interface‑action binding, and insufficient control over agent memory and delegation. We conclude by proposing a CUA‑specific security evaluation framework and design principles for safe deployment in adversarial and high‑stakes settings.

PaperID: 9728, https://arxiv.org/pdf/2507.05330.pdf

Authors: Ming Gong, Xucheng Huang, Chenghan Yang, Xianhan Peng, Haoxin Wang, Yang Liu, Ling Jiang

Title: MindFlow: Revolutionizing E-commerce Customer Support with Multimodal LLM Agents

Abstract:
Recent advances in large language models (LLMs) have enabled new applications in e‑commerce customer service. However, their capabilities remain constrained in complex, multimodal scenarios. We present MindFlow, the first open‑source multimodal LLM agent tailored for e‑commerce. Built on the CoALA framework, it integrates memory, decision‑making, and action modules, and adopts a modular "MLLM‑as‑Tool" strategy for effect visual‑textual reasoning. Evaluated via online A/B testing and simulation‑based ablation, MindFlow demonstrates substantial gains in handling complex queries, improving user satisfaction, and reducing operational costs, with a 93.53% relative improvement observed in real‑world deployments.

PaperID: 9729, https://arxiv.org/pdf/2507.05321.pdf

Authors: Kwangsuk Park, Jiwoong Yang

Title: AGACCI : Affiliated Grading Agents for Criteria-Centric Interface in Educational Coding Contexts

Abstract:
Recent advances in AI‑assisted education have encouraged the integration of vision‑language models (VLMs) into academic assessment, particularly for tasks that require both quantitative and qualitative evaluation. However, existing VLM based approaches struggle with complex educational artifacts, such as programming tasks with executable components and measurable outputs, that require structured reasoning and alignment with clearly defined evaluation criteria. We introduce AGACCI, a multi‑agent system that distributes specialized evaluation roles across collaborative agents to improve accuracy, interpretability, and consistency in code‑oriented assessment. To evaluate the framework, we collected 360 graduate‑level code‑based assignments from 60 participants, each annotated by domain experts with binary rubric scores and qualitative feedback. Experimental results demonstrate that AGACCI outperforms a single GPT‑based baseline in terms of rubric and feedback accuracy, relevance, consistency, and coherence, while preserving the instructional intent and evaluative depth of expert assessments. Although performance varies across task types, AGACCI highlights the potential of multi‑agent systems for scalable and context‑aware educational evaluation.

PaperID: 9730, https://arxiv.org/pdf/2507.05292.pdf

Authors: Kaiqi Yang, Hang Li, Yucheng Chu, Ahreum Han, Yasemin Copur-Gencturk, Jiliang Tang, Hui Liu

Title: A LLM-Driven Multi-Agent Systems for Professional Development of Mathematics Teachers

Abstract:
Professional development (PD) serves as the cornerstone for teacher tutors to grasp content knowledge. However, providing equitable and timely PD opportunities for teachers poses significant challenges. To address this issue, we introduce I‑VIP (Intelligent Virtual Interactive Program), an intelligent tutoring platform for teacher professional development, driven by large language models (LLMs) and supported by multi‑agent frameworks. This platform offers a user‑friendly conversational interface and allows users to employ a variety of interactive tools to facilitate question answering, knowledge comprehension, and reflective summarization while engaging in dialogue. To underpin the functionality of this platform, including knowledge expectation analysis, response scoring and classification, and feedback generation, the multi‑agent frameworks are leveraged to enhance the accuracy of judgments and mitigate the issue of missing key points.

PaperID: 9731, https://arxiv.org/pdf/2507.05269.pdf

Authors: Danning Xie, Mingwei Zheng, Xuwei Liu, Jiannan Wang, Chengpeng Wang, Lin Tan, Xiangyu Zhang

Title: CoRe: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks

Abstract:
Large language models (LLMs) have been widely adopted across diverse domains of software engineering, such as code generation, program repair, and vulnerability detection. These applications require understanding beyond surface‑level code patterns: value propagation, control flow, and interdependence between program elements. However, existing benchmarks primarily evaluate end‑to‑end outcomes, such as whether code is correctly repaired or generated, leaving the models' ability for program semantic reasoning underexplored. This work presents CORE, a high‑quality, human‑verified benchmark designed to evaluate LLMs on fundamental static analysis tasks. CORE includes 12,553 task instances spanning data dependency, control dependency, and information flow across programs written in C/C++, Java, and Python. To ensure semantic diversity and reasoning complexity, we propose a semantics‑aware diverse sampling strategy that selects targets and task instances based on structural coverage and dependency depth. We evaluate 10 mainstream LLMs and show that, while they perform well at identifying dependencies, models still struggle with tasks that require deeper semantic understanding and multi‑step reasoning. We further conduct qualitative analyses to uncover key challenges, such as complex control structures and backward dependency patterns, offering insights into improving LLMs' code reasoning capabilities.

PaperID: 9732, https://arxiv.org/pdf/2507.05257.pdf

Authors: Yuanzhe Hu, Yu Wang, Julian McAuley

Title: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Abstract:
Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component‑memory, encompassing how agents memorize, update, and retrieve long‑term information‑is under‑evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, based on classic theories from memory science and cognitive science, we identify four core competencies essential for memory agents: accurate retrieval, test‑time learning, long‑range understanding, and selective forgetting. Existing benchmarks either rely on limited context lengths or are tailored for static, long‑context settings like book‑based QA, which do not reflect the interactive, multi‑turn nature of memory agents that incrementally accumulate information. Moreover, no existing benchmarks cover all four competencies. We introduce MemoryAgentBench, a new benchmark specifically designed for memory agents. Our benchmark transforms existing long‑context datasets and incorporates newly constructed datasets into a multi‑turn format, effectively simulating the incremental information processing characteristic of memory agents. By carefully selecting and curating datasets, our benchmark provides comprehensive coverage of the four core memory competencies outlined above, thereby offering a systematic and challenging testbed for assessing memory quality. We evaluate a diverse set of memory agents, ranging from simple context‑based and retrieval‑augmented generation (RAG) systems to advanced agents with external memory modules and tool integration. Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.

PaperID: 9733, https://arxiv.org/pdf/2507.05246.pdf

Authors: Scott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, Rohin Shah

Title: When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors

Abstract:
While chain‑of‑thought (CoT) monitoring is an appealing AI safety defense, recent work on "unfaithfulness" has cast doubt on its reliability. These findings highlight an important failure mode, particularly when CoT acts as a post‑hoc rationalization in applications like auditing for bias. However, for the distinct problem of runtime monitoring to prevent severe harm, we argue the key property is not faithfulness but monitorability. To this end, we introduce a conceptual framework distinguishing CoT‑as‑rationalization from CoT‑as‑computation. We expect that certain classes of severe harm will require complex, multi‑step reasoning that necessitates CoT‑as‑computation. Replicating the experimental setups of prior work, we increase the difficulty of the bad behavior to enforce this necessity condition; this forces the model to expose its reasoning, making it monitorable. We then present methodology guidelines to stress‑test CoT monitoring against deliberate evasion. Applying these guidelines, we find that models can learn to obscure their intentions, but only when given significant help, such as detailed human‑written strategies or iterative optimization against the monitor. We conclude that, while not infallible, CoT monitoring offers a substantial layer of defense that requires active protection and continued stress‑testing.

PaperID: 9734, https://arxiv.org/pdf/2507.05178.pdf

Authors: Jonathan Hyun, Nicholas R Waytowich, Boyuan Chen

Title: CREW-WILDFIRE: Benchmarking Agentic Multi-Agent Collaborations at Scale

Abstract:
Despite rapid progress in large language model (LLM)‑based multi‑agent systems, current benchmarks fall short in evaluating their scalability, robustness, and coordination capabilities in complex, dynamic, real‑world tasks. Existing environments typically focus on small‑scale, fully observable, or low‑complexity domains, limiting their utility for developing and assessing next‑generation multi‑agent Agentic AI frameworks. We introduce CREW‑Wildfire, an open‑source benchmark designed to close this gap. Built atop the human‑AI teaming CREW simulation platform, CREW‑Wildfire offers procedurally generated wildfire response scenarios featuring large maps, heterogeneous agents, partial observability, stochastic dynamics, and long‑horizon planning objectives. The environment supports both low‑level control and high‑level natural language interactions through modular Perception and Execution modules. We implement and evaluate several state‑of‑the‑art LLM‑based multi‑agent Agentic AI frameworks, uncovering significant performance gaps that highlight the unsolved challenges in large‑scale coordination, communication, spatial reasoning, and long‑horizon planning under uncertainty. By providing more realistic complexity, scalable architecture, and behavioral evaluation metrics, CREW‑Wildfire establishes a critical foundation for advancing research in scalable multi‑agent Agentic intelligence. All code, environments, data, and baselines will be released to support future research in this emerging domain.

PaperID: 9735, https://arxiv.org/pdf/2507.05135.pdf

Authors: Svyatoslav Pchelintsev, Maxim Patratskiy, Anatoly Onishchenko, Alexandr Korchemnyi, Aleksandr Medvedev, Uliana Vinogradova, Ilya Galuzinsky, Aleksey Postnikov, Alexey K. Kovalev, Aleksandr I. Panov

Title: LERa: Replanning with Visual Feedback in Instruction Following

Abstract:
Large Language Models are increasingly used in robotics for task planning, but their reliance on textual inputs limits their adaptability to real‑world changes and failures. To address these challenges, we propose LERa ‑ Look, Explain, Replan ‑ a Visual Language Model‑based replanning approach that utilizes visual feedback. Unlike existing methods, LERa requires only a raw RGB image, a natural language instruction, an initial task plan, and failure detection ‑ without additional information such as object detection or predefined conditions that may be unavailable in a given scenario. The replanning process consists of three steps: (i) Look ‑ where LERa generates a scene description and identifies errors; (ii) Explain ‑ where it provides corrective guidance; and (iii) Replan ‑ where it modifies the plan accordingly. LERa is adaptable to various agent architectures and can handle errors from both dynamic scene changes and task execution failures. We evaluate LERa on the newly introduced ALFRED‑ChaOS and VirtualHome‑ChaOS datasets, achieving a 40% improvement over baselines in dynamic environments. In tabletop manipulation tasks with a predefined probability of task failure within the PyBullet simulator, LERa improves success rates by up to 67%. Further experiments, including real‑world trials with a tabletop manipulator robot, confirm LERa's effectiveness in replanning. We demonstrate that LERa is a robust and adaptable solution for error‑aware task execution in robotics. The project page is available at https://lera‑robo.github.io.

PaperID: 9736, https://arxiv.org/pdf/2507.05098.pdf

Authors: Tobias Demmler, Jakob Häringer, Andreas Tamke, Thao Dang, Alexander Hegai, Lars Mikelsons

Title: Beyond Features: How Dataset Design Influences Multi-Agent Trajectory Prediction Performance

Abstract:
Accurate trajectory prediction is critical for safe autonomous navigation, yet the impact of dataset design on model performance remains understudied. This work systematically examines how feature selection, cross‑dataset transfer, and geographic diversity influence trajectory prediction accuracy in multi‑agent settings. We evaluate a state‑of‑the‑art model using our novel L4 Motion Forecasting dataset based on our own data recordings in Germany and the US. This includes enhanced map and agent features. We compare our dataset to the US‑centric Argoverse 2 benchmark. First, we find that incorporating supplementary map and agent features unique to our dataset, yields no measurable improvement over baseline features, demonstrating that modern architectures do not need extensive feature sets for optimal performance. The limited features of public datasets are sufficient to capture convoluted interactions without added complexity. Second, we perform cross‑dataset experiments to evaluate how effective domain knowledge can be transferred between datasets. Third, we group our dataset by country and check the knowledge transfer between different driving cultures.

PaperID: 9737, https://arxiv.org/pdf/2507.04917.pdf

Authors: Thayanne França da Silva, José Everardo Bessa Maia

Title: Leadership Detection via Time-Lagged Correlation-Based Network Inference

Abstract:
Understanding leadership dynamics in collective behavior is a key challenge in animal ecology, swarm robotics, and intelligent transportation. Traditional information‑theoretic approaches, including Transfer Entropy (TE) and Time‑Lagged Mutual Information (TLMI), have been widely used to infer leader‑follower relationships but face critical limitations in noisy or short‑duration datasets due to their reliance on robust probability estimations. This study proposes a method based on dynamic network inference using time‑lagged correlations across multiple kinematic variables: velocity, acceleration, and direction. Our approach constructs directed influence graphs over time, enabling the identification of leadership patterns without the need for large volumes of data or parameter‑sensitive discretization. We validate our method through two multi‑agent simulations in NetLogo: a modified Vicsek model with informed leaders and a predator‑prey model featuring coordinated and independent wolf groups. Experimental results demonstrate that the network‑based method outperforms TE and TLMI in scenarios with limited spatiotemporal observations, ranking true leaders at the top of influence metrics more consistently than TE and TLMI.

PaperID: 9738, https://arxiv.org/pdf/2507.04893.pdf

Authors: Kaleem Ullah Qasim, Jiashu Zhang

Title: MARBLE: A Multi-Agent Rule-Based LLM Reasoning Engine for Accident Severity Prediction

Abstract:
Accident severity prediction plays a critical role in transportation safety systems but is a persistently difficult task due to incomplete data, strong feature dependencies, and severe class imbalance in which rare but high‑severity cases are underrepresented and hard to detect. Existing methods often rely on monolithic models or black box prompting, which struggle to scale in noisy, real‑world settings and offer limited interpretability. To address these challenges, we propose MARBLE a multiagent rule based LLM engine that decomposes the severity prediction task across a team of specialized reasoning agents, including an interchangeable ML‑backed agent. Each agent focuses on a semantic subset of features (e.g., spatial, environmental, temporal), enabling scoped reasoning and modular prompting without the risk of prompt saturation. Predictions are coordinated through either rule‑based or LLM‑guided consensus mechanisms that account for class rarity and confidence dynamics. The system retains structured traces of agent‑level reasoning and coordination outcomes, supporting in‑depth interpretability and post‑hoc performance diagnostics. Across both UK and US datasets, MARBLE consistently outperforms traditional machine learning classifiers and state‑of‑the‑art (SOTA) prompt‑based reasoning methods including Chain‑of‑Thought (CoT), Least‑to‑Most (L2M), and Tree‑of‑Thought (ToT) achieving nearly 90% accuracy where others plateau below 48%. This performance redefines the practical ceiling for accident severity classification under real world noise and extreme class imbalance. Our results position MARBLE as a generalizable and interpretable framework for reasoning under uncertainty in safety‑critical applications.

PaperID: 9739, https://arxiv.org/pdf/2507.04770.pdf

Authors: Toan Nguyen, Tri Le, Quang Nguyen, Anh Nguyen

Title: FurniMAS: Language-Guided Furniture Decoration using Multi-Agent System

Abstract:
Furniture decoration is an important task in various industrial applications. However, achieving a high‑quality decorative result is often time‑consuming and requires specialized artistic expertise. To tackle these challenges, we explore how multi‑agent systems can assist in automating the decoration process. We propose FurniMAS, a multi‑agent system for automatic furniture decoration. Specifically, given a human prompt and a household furniture item such as a working desk or a TV stand, our system suggests relevant assets with appropriate styles and materials, and arranges them on the item, ensuring the decorative result meets functionality, aesthetic, and ambiance preferences. FurniMAS assembles a hybrid team of LLM‑based and non‑LLM agents, each fulfilling distinct roles in a typical decoration project. These agents collaborate through communication, logical reasoning, and validation to transform the requirements into the final outcome. Extensive experiments demonstrate that our FurniMAS significantly outperforms other baselines in generating high‑quality 3D decor.

PaperID: 9740, https://arxiv.org/pdf/2507.04762.pdf

Authors: Maria Damanaki, Ioulia Kapsali, Nikos Piperigkos, Alexandros Gkillas, Aris S. Lalos

Title: Robustifying 3D Perception via Least-Squares Graphs for Multi-Agent Object Tracking

Abstract:
The critical perception capabilities of EdgeAI systems, such as autonomous vehicles, are required to be resilient against adversarial threats, by enabling accurate identification and localization of multiple objects in the scene over time, mitigating their impact. Single‑agent tracking offers resilience to adversarial attacks but lacks situational awareness, underscoring the need for multi‑agent cooperation to enhance context understanding and robustness. This paper proposes a novel mitigation framework on 3D LiDAR scene against adversarial noise by tracking objects based on least‑squares graph on multi‑agent adversarial bounding boxes. Specifically, we employ the least‑squares graph tool to reduce the induced positional error of each detection's centroid utilizing overlapped bounding boxes on a fully connected graph via differential coordinates and anchor points. Hence, the multi‑vehicle detections are fused and refined mitigating the adversarial impact, and associated with existing tracks in two stages performing tracking to further suppress the adversarial threat. An extensive evaluation study on the real‑world V2V4Real dataset demonstrates that the proposed method significantly outperforms both state‑of‑the‑art single and multi‑agent tracking frameworks by up to 23.3% under challenging adversarial conditions, operating as a resilient approach without relying on additional defense mechanisms.

PaperID: 9741, https://arxiv.org/pdf/2507.04724.pdf

Authors: Yizhe Xie, Congcong Zhu, Xinyue Zhang, Tianqing Zhu, Dayong Ye, Minghao Wang, Chi Liu

Title: Who's the Mole? Modeling and Detecting Intention-Hiding Malicious Agents in LLM-Based Multi-Agent Systems

Abstract:
Multi‑agent systems powered by Large Language Models (LLM‑MAS) have demonstrated remarkable capabilities in collaborative problem‑solving. However, their deployment also introduces new security risks. Existing research on LLM‑based agents has primarily examined single‑agent scenarios, while the security of multi‑agent systems remains largely unexplored. To address this gap, we present a systematic study of intention‑hiding threats in LLM‑MAS. We design four representative attack paradigms that subtly disrupt task completion while maintaining a high degree of stealth, and evaluate them under centralized, decentralized, and layered communication structures. Experimental results show that these attacks are highly disruptive and can easily evade existing defense mechanisms. To counter these threats, we propose AgentXposed, a psychology‑inspired detection framework. AgentXposed draws on the HEXACO personality model, which characterizes agents through psychological trait dimensions, and the Reid interrogation technique, a structured method for eliciting concealed intentions. By combining progressive questionnaire probing with behavior‑based inter‑agent monitoring, the framework enables the proactive identification of malicious agents before harmful actions are carried out. Extensive experiments across six datasets against both our proposed attacks and two baseline threats demonstrate that AgentXposed effectively detects diverse forms of malicious behavior, achieving strong robustness across multiple communication settings.

PaperID: 9742, https://arxiv.org/pdf/2507.04706.pdf

Authors: Kai Yang, Zelin Zhu, Chengtao Jian, Hui Ma, Shengjie Zhao, Xiaozhou Ye, Ye Ouyang

Title: UrbanMind: Towards Urban General Intelligence via Tool-Enhanced Retrieval-Augmented Generation and Multilevel Optimization

Abstract:
Urban general intelligence (UGI) refers to the capacity of AI systems to autonomously perceive, reason, and act within dynamic and complex urban environments. In this paper, we introduce UrbanMind, a tool‑enhanced retrieval‑augmented generation (RAG) framework designed to facilitate UGI. Central to UrbanMind is a novel architecture based on Continual Retrieval‑Augmented MoE‑based LLM (C‑RAG‑LLM), which dynamically incorporates domain‑specific knowledge and evolving urban data to support long‑term adaptability. The architecture of C‑RAG‑LLM aligns naturally with a multilevel optimization framework, where different layers are treated as interdependent sub‑problems. Each layer has distinct objectives and can be optimized either independently or jointly through a hierarchical learning process. The framework is highly flexible, supporting both end‑to‑end training and partial layer‑wise optimization based on resource or deployment constraints. To remain adaptive under data drift, it is further integrated with an incremental corpus updating mechanism. Evaluations on real‑world urban tasks of a variety of complexity verify the effectiveness of the proposed framework. This work presents a promising step toward the realization of general‑purpose LLM agents in future urban environments.

PaperID: 9743, https://arxiv.org/pdf/2507.04661.pdf

Authors: Yayu Long, Kewei Chen, Long Jin, Mingsheng Shang

Title: DRAE: Dynamic Retrieval-Augmented Expert Networks for Lifelong Learning and Task Adaptation in Robotics

Abstract:
We introduce Dynamic Retrieval‑Augmented Expert Networks (DRAE), a groundbreaking architecture that addresses the challenges of lifelong learning, catastrophic forgetting, and task adaptation by combining the dynamic routing capabilities of Mixture‑of‑Experts (MoE); leveraging the knowledge‑enhancement power of Retrieval‑Augmented Generation (RAG); incorporating a novel hierarchical reinforcement learning (RL) framework; and coordinating through ReflexNet‑SchemaPlanner‑HyperOptima (RSHO).DRAE dynamically routes expert models via a sparse MoE gating mechanism, enabling efficient resource allocation while leveraging external knowledge through parametric retrieval (P‑RAG) to augment the learning process. We propose a new RL framework with ReflexNet for low‑level task execution, SchemaPlanner for symbolic reasoning, and HyperOptima for long‑term context modeling, ensuring continuous adaptation and memory retention. Experimental results show that DRAE significantly outperforms baseline approaches in long‑term task retention and knowledge reuse, achieving an average task success rate of 82.5% across a set of dynamic robotic manipulation tasks, compared to 74.2% for traditional MoE models. Furthermore, DRAE maintains an extremely low forgetting rate, outperforming state‑of‑the‑art methods in catastrophic forgetting mitigation. These results demonstrate the effectiveness of our approach in enabling flexible, scalable, and efficient lifelong learning for robotics.

PaperID: 9744, https://arxiv.org/pdf/2507.04634.pdf

Authors: Yixin Yan, Yang Li, Yuanfan Wang, Xiaozhou Zhou, Beihao Xia, Manjiang Hu, Hongmao Qin

Title: LTMSformer: A Local Trend-Aware Attention and Motion State Encoding Transformer for Multi-Agent Trajectory Prediction

Abstract:
It has been challenging to model the complex temporal‑spatial dependencies between agents for trajectory prediction. As each state of an agent is closely related to the states of adjacent time steps, capturing the local temporal dependency is beneficial for prediction, while most studies often overlook it. Besides, learning the high‑order motion state attributes is expected to enhance spatial interaction modeling, but it is rarely seen in previous works. To address this, we propose a lightweight framework, LTMSformer, to extract temporal‑spatial interaction features for multi‑modal trajectory prediction. Specifically, we introduce a Local Trend‑Aware Attention mechanism to capture the local temporal dependency by leveraging a convolutional attention mechanism with hierarchical local time boxes. Next, to model the spatial interaction dependency, we build a Motion State Encoder to incorporate high‑order motion state attributes, such as acceleration, jerk, heading, etc. To further refine the trajectory prediction, we propose a Lightweight Proposal Refinement Module that leverages Multi‑Layer Perceptrons for trajectory embedding and generates the refined trajectories with fewer model parameters. Experiment results on the Argoverse 1 dataset demonstrate that our method outperforms the baseline HiVT‑64, reducing the minADE by approximately 4.35%, the minFDE by 8.74%, and the MR by 20%. We also achieve higher accuracy than HiVT‑128 with a 68% reduction in model size.

PaperID: 9745, https://arxiv.org/pdf/2507.04410.pdf

Authors: Huy Hoan Le, Van Sy Thinh Nguyen, Thi Le Chi Dang, Vo Thanh Khang Nguyen, Truong Thanh Hung Nguyen, Hung Cao

Title: Multimedia Verification Through Multi-Agent Deep Research Multimodal Large Language Models

Abstract:
This paper presents our submission to the ACMMM25 ‑ Grand Challenge on Multimedia Verification. We developed a multi‑agent verification system that combines Multimodal Large Language Models (MLLMs) with specialized verification tools to detect multimedia misinformation. Our system operates through six stages: raw data processing, planning, information extraction, deep research, evidence collection, and report generation. The core Deep Researcher Agent employs four tools: reverse image search, metadata analysis, fact‑checking databases, and verified news processing that extracts spatial, temporal, attribution, and motivational context. We demonstrate our approach on a challenge dataset sample involving complex multimedia content. Our system successfully verified content authenticity, extracted precise geolocation and timing information, and traced source attribution across multiple platforms, effectively addressing real‑world multimedia verification scenarios.

PaperID: 9746, https://arxiv.org/pdf/2507.04227.pdf

Authors: Guohong Liu, Jialei Ye, Jiacheng Liu, Yuanchun Li, Wei Liu, Pengzhi Gao, Jian Luan, Yunxin Liu

Title: Mobile GUI Agents under Real-world Threats: Are We There Yet?

Abstract:
Recent years have witnessed a rapid development of mobile GUI agents powered by large language models (LLMs), which can autonomously execute diverse device‑control tasks based on natural language instructions. The increasing accuracy of these agents on standard benchmarks has raised expectations for large‑scale real‑world deployment, and there are already several commercial agents released and used by early adopters. However, are we really ready for GUI agents integrated into our daily devices as system building blocks? We argue that an important pre‑deployment validation is missing to examine whether the agents can maintain their performance under real‑world threats. Specifically, unlike existing common benchmarks that are based on simple static app contents (they have to do so to ensure environment consistency between different tests), real‑world apps are filled with contents from untrustworthy third parties, such as advertisement emails, user‑generated posts and medias, etc. ... To this end, we introduce a scalable app content instrumentation framework to enable flexible and targeted content modifications within existing applications. Leveraging this framework, we create a test suite comprising both a dynamic task execution environment and a static dataset of challenging GUI states. The dynamic environment encompasses 122 reproducible tasks, and the static dataset consists of over 3,000 scenarios constructed from commercial apps. We perform experiments on both open‑source and commercial GUI agents. Our findings reveal that all examined agents can be significantly degraded due to third‑party contents, with an average misleading rate of 42.0% and 36.1% in dynamic and static environments respectively. The framework and benchmark has been released at https://agenthazard.github.io.

PaperID: 9747, https://arxiv.org/pdf/2507.04142.pdf

Authors: Mael Jullien, Marco Valentino, Leonardo Ranaldi, Andre Freitas

Title: Dissecting Clinical Reasoning in Language Models: A Comparative Study of Prompts and Model Adaptation Strategies

Abstract:
Recent works on large language models (LLMs) have demonstrated the impact of prompting strategies and fine‑tuning techniques on their reasoning capabilities. Yet, their effectiveness on clinical natural language inference (NLI) remains underexplored. This study presents the first controlled evaluation of how prompt structure and efficient fine‑tuning jointly shape model performance in clinical NLI. We inspect four classes of prompting strategies to elicit reasoning in LLMs at different levels of abstraction, and evaluate their impact on a range of clinically motivated reasoning types. For each prompting strategy, we construct high‑quality demonstrations using a frontier model to distil multi‑step reasoning capabilities into smaller models (4B parameters) via Low‑Rank Adaptation (LoRA). Across different language models fine‑tuned on the NLI4CT benchmark, we found that prompt type alone accounts for up to 44% of the variance in macro‑F1. Moreover, LoRA fine‑tuning yields consistent gains of +8 to 12 F1, raises output alignment above 97%, and narrows the performance gap to GPT‑4o‑mini to within 7.1%. Additional experiments on reasoning generalisation reveal that LoRA improves performance in 75% of the models on MedNLI and TREC Clinical Trials Track. Overall, these findings demonstrate that (i) prompt structure is a primary driver of clinical reasoning performance, (ii) compact models equipped with strong prompts and LoRA can rival frontier‑scale systems, and (iii) reasoning‑type‑aware evaluation is essential to uncover prompt‑induced trade‑offs. Our results highlight the promise of combining prompt design and lightweight adaptation for more efficient and trustworthy clinical NLP systems, providing insights on the strengths and limitations of widely adopted prompting and parameter‑efficient techniques in highly specialised domains.

PaperID: 9748, https://arxiv.org/pdf/2507.04140.pdf

Authors: Ho Jae Lee, Se Hwan Jeon, Sangbae Kim

Title: Learning Humanoid Arm Motion via Centroidal Momentum Regularized Multi-Agent Reinforcement Learning

Abstract:
Humans naturally swing their arms during locomotion to regulate whole‑body dynamics, reduce angular momentum, and help maintain balance. Inspired by this principle, we present a limb‑level multi‑agent reinforcement learning (RL) framework that enables coordinated whole‑body control of humanoid robots through emergent arm motion. Our approach employs separate actor‑critic structures for the arms and legs, trained with centralized critics but decentralized actors that share only base states and centroidal angular momentum (CAM) observations, allowing each agent to specialize in task‑relevant behaviors through modular reward design. The arm agent guided by CAM tracking and damping rewards promotes arm motions that reduce overall angular momentum and vertical ground reaction moments, contributing to improved balance during locomotion or under external perturbations. Comparative studies with single‑agent and alternative multi‑agent baselines further validate the effectiveness of our approach. Finally, we deploy the learned policy on a humanoid platform, achieving robust performance across diverse locomotion tasks, including flat‑ground walking, rough terrain traversal, and stair climbing.

PaperID: 9749, https://arxiv.org/pdf/2507.04105.pdf

Authors: Jinwei Hu, Yi Dong, Zhengtao Ding, Xiaowei Huang

Title: Enhancing Robustness of LLM-Driven Multi-Agent Systems through Randomized Smoothing

Abstract:
This paper presents a defense framework for enhancing the safety of large language model (LLM) empowered multi‑agent systems (MAS) in safety‑critical domains such as aerospace. We apply randomized smoothing, a statistical robustness certification technique, to the MAS consensus context, enabling probabilistic guarantees on agent decisions under adversarial influence. Unlike traditional verification methods, our approach operates in black‑box settings and employs a two‑stage adaptive sampling mechanism to balance robustness and computational efficiency. Simulation results demonstrate that our method effectively prevents the propagation of adversarial behaviors and hallucinations while maintaining consensus performance. This work provides a practical and scalable path toward safe deployment of LLM‑based MAS in real‑world, high‑stakes environments.

PaperID: 9750, https://arxiv.org/pdf/2507.04067.pdf

Authors: Yuyang Cheng, Yumiao Xu, Chaojia Yu, Yong Zhao

Title: HAWK: A Hierarchical Workflow Framework for Multi-Agent Collaboration

Abstract:
Contemporary multi‑agent systems encounter persistent challenges in cross‑platform interoperability, dynamic task scheduling, and efficient resource sharing. Agents with heterogeneous implementations often lack standardized interfaces; collaboration frameworks remain brittle and hard to extend; scheduling policies are static; and inter‑agent state synchronization is insufficient. We propose Hierarchical Agent Workflow (HAWK), a modular framework comprising five layers‑User, Workflow, Operator, Agent, and Resource‑and supported by sixteen standardized interfaces. HAWK delivers an end‑to‑end pipeline covering task parsing, workflow orchestration, intelligent scheduling, resource invocation, and data synchronization. At its core lies an adaptive scheduling and optimization module in the Workflow Layer, which harnesses real‑time feedback and dynamic strategy adjustment to maximize utilization. The Resource Layer provides a unified abstraction over heterogeneous data sources, large models, physical devices, and third‑party services&tools, simplifying cross‑domain information retrieval. We demonstrate HAWK's scalability and effectiveness via CreAgentive, a multi‑agent novel‑generation prototype, which achieves marked gains in throughput, lowers invocation complexity, and improves system controllability. We also show how hybrid deployments of large language models integrate seamlessly within HAWK, highlighting its flexibility. Finally, we outline future research avenues‑hallucination mitigation, real‑time performance tuning, and enhanced cross‑domain adaptability‑and survey prospective applications in healthcare, government, finance, and education.

PaperID: 9751, https://arxiv.org/pdf/2507.04053.pdf

Authors: Baohua Zhang, Xin Li, Huangchao Xu, Zhong Jin, Quansheng Wu, Ce Li

Title: TopoMAS: Large Language Model Driven Topological Materials Multiagent System

Abstract:
Topological materials occupy a frontier in condensed‑matter physics thanks to their remarkable electronic and quantum properties, yet their cross‑scale design remains bottlenecked by inefficient discovery workflows. Here, we introduce TopoMAS (Topological materials Multi‑Agent System), an interactive human‑AI framework that seamlessly orchestrates the entire materials‑discovery pipeline: from user‑defined queries and multi‑source data retrieval, through theoretical inference and crystal‑structure generation, to first‑principles validation. Crucially, TopoMAS closes the loop by autonomously integrating computational outcomes into a dynamic knowledge graph, enabling continuous knowledge refinement. In collaboration with human experts, it has already guided the identification of novel topological phases SrSbO3, confirmed by first‑principles calculations. Comprehensive benchmarks demonstrate robust adaptability across base Large Language Model, with the lightweight Qwen2.5‑72B model achieving 94.55% accuracy while consuming only 74.3‑78.4% of tokens required by Qwen3‑235B and 83.0% of DeepSeek‑V3's usage‑‑delivering responses twice as fast as Qwen3‑235B. This efficiency establishes TopoMAS as an accelerator for computation‑driven discovery pipelines. By harmonizing rational agent orchestration with a self‑evolving knowledge graph, our framework not only delivers immediate advances in topological materials but also establishes a transferable, extensible paradigm for materials‑science domain.

PaperID: 9752, https://arxiv.org/pdf/2507.04037.pdf

Authors: Zheng Jia, Shengbin Yue, Wei Chen, Siyuan Wang, Yidong Liu, Zejun Li, Yun Song, Zhongyu Wei

Title: Ready Jurist One: Benchmarking Language Agents for Legal Intelligence in Dynamic Environments

Abstract:
The gap between static benchmarks and the dynamic nature of real‑world legal practice poses a key barrier to advancing legal intelligence. To this end, we introduce J1‑ENVS, the first interactive and dynamic legal environment tailored for LLM‑based agents. Guided by legal experts, it comprises six representative scenarios from Chinese legal practices across three levels of environmental complexity. We further introduce J1‑EVAL, a fine‑grained evaluation framework, designed to assess both task performance and procedural compliance across varying levels of legal proficiency. Extensive experiments on 17 LLM agents reveal that, while many models demonstrate solid legal knowledge, they struggle with procedural execution in dynamic settings. Even the SOTA model, GPT‑4o, falls short of 60% overall performance. These findings highlight persistent challenges in achieving dynamic legal intelligence and offer valuable insights to guide future research.

PaperID: 9753, https://arxiv.org/pdf/2507.04005.pdf

Authors: Baiqiao Zhang, Xiangxian Li, Chao Zhou, Xinyu Gai, Juan Liu, Xue Yang, Nianlong Li, Shuai Ma, Xiaojuan Ma, Yong-jin Liu, Yulong Bian

Title: Exploring a Gamified Personality Assessment Method through Interaction with LLM Agents Embodying Different Personalities

Abstract:
The low‑intrusion and automated personality assessment is receiving increasing attention in psychology and human‑computer interaction fields. This study explores an interactive approach for personality assessment, focusing on the multiplicity of personality representation. We propose a framework of Gamified Personality Assessment through Multi‑Personality Representations (Multi‑PR GPA). The framework leverages Large Language Models to empower virtual agents with different personalities. These agents elicit multifaceted human personality representations through engaging in interactive games. Drawing upon the multi‑type textual data generated throughout the interaction, it achieves personality assessments with interpretable insights. Grounded in the classic Big Five personality theory, we developed a prototype system and conducted a user study to evaluate the efficacy of Multi‑PR GPA. The results affirm the effectiveness of our approach in personality assessment and demonstrate its superior performance when considering the multiplicity of personality representation. Error structure analysis further revealed systematic assessment biases in LLMs, which multi‑context aggregation partially mitigated.

PaperID: 9754, https://arxiv.org/pdf/2507.03928.pdf

Authors: Yiliu Sun, Zicheng Zhao, Sheng Wan, Chen Gong

Title: CortexDebate: Debating Sparsely and Equally for Multi-Agent Debate

Abstract:
Nowadays, single Large Language Model (LLM) struggles with critical issues such as hallucination and inadequate reasoning abilities. To mitigate these issues, Multi‑Agent Debate (MAD) has emerged as an effective strategy, where LLM agents engage in in‑depth debates with others on tasks. However, existing MAD methods face two major issues: (a) too lengthy input contexts, which causes LLM agents to get lost in plenty of input information and experiences performance drop; and (b) the overconfidence dilemma, where self‑assured LLM agents dominate the debate, leading to low debating effectiveness. To address these limitations, we propose a novel MAD method called "CortexDebate". Inspired by the human brain's tendency to establish a sparse and dynamically optimized network among cortical areas governed by white matter, CortexDebate constructs a sparse debating graph among LLM agents, where each LLM agent only debates with the ones that are helpful to it. To optimize the graph, we propose a module named McKinsey‑based Debate Matter (MDM), which acts as an artificial analog to white matter. By integrating the McKinsey Trust Formula, a well‑established measure of trustworthiness from sociology, MDM enables credible evaluations that guide graph optimization. The effectiveness of our CortexDebate has been well demonstrated by extensive experimental results across eight datasets from four task types.

PaperID: 9755, https://arxiv.org/pdf/2507.03904.pdf

Authors: Yingxuan Yang, Ying Wen, Jun Wang, Weinan Zhang

Title: Agent Exchange: Shaping the Future of AI Agent Economics

Abstract:
The rise of Large Language Models (LLMs) has transformed AI agents from passive computational tools into autonomous economic actors. This shift marks the emergence of the agent‑centric economy, in which agents take on active economic roles‑exchanging value, making strategic decisions, and coordinating actions with minimal human oversight. To realize this vision, we propose Agent Exchange (AEX), a specialized auction platform designed to support the dynamics of the AI agent marketplace. AEX offers an optimized infrastructure for agent coordination and economic participation. Inspired by Real‑Time Bidding (RTB) systems in online advertising, AEX serves as the central auction engine, facilitating interactions among four ecosystem components: the User‑Side Platform (USP), which translates human goals into agent‑executable tasks; the Agent‑Side Platform (ASP), responsible for capability representation, performance tracking, and optimization; Agent Hubs, which coordinate agent teams and participate in AEX‑hosted auctions; and the Data Management Platform (DMP), ensuring secure knowledge sharing and fair value attribution. We outline the design principles and system architecture of AEX, laying the groundwork for agent‑based economic infrastructure in future AI ecosystems.

PaperID: 9756, https://arxiv.org/pdf/2507.03870.pdf

Authors: Yashwanthi Anand, Rahil P Mehta, Manish Motwani, Sandhya Saisubramanian

Title: Uncovering Systemic and Environment Errors in Autonomous Systems Using Differential Testing

Abstract:
When an autonomous agent behaves undesirably, including failure to complete a task, it can be difficult to determine whether the behavior is due to a systemic agent error, such as flaws in the model or policy, or an environment error, where a task is inherently infeasible under a given environment configuration, even for an ideal agent. As agents and their environments grow more complex, identifying the error source becomes increasingly difficult but critical for reliable deployment. We introduce AIProbe, a novel black‑box testing technique that applies differential testing to attribute undesirable agent behaviors either to agent deficiencies, such as modeling or training flaws, or due to environmental infeasibility. AIProbe first generates diverse environmental configurations and tasks for testing the agent, by modifying configurable parameters using Latin Hypercube sampling. It then solves each generated task using a search‑based planner, independent of the agent. By comparing the agent's performance to the planner's solution, AIProbe identifies whether failures are due to errors in the agent's model or policy, or due to unsolvable task conditions. Our evaluation across multiple domains shows that AIProbe significantly outperforms state‑of‑the‑art techniques in detecting both total and unique errors, thereby contributing to a reliable deployment of autonomous agents.

PaperID: 9757, https://arxiv.org/pdf/2507.03839.pdf

Authors: Shuowen Li, Kexin Wang, Minglu Fang, Danqi Huang, Ali Asadipour, Haipeng Mi, Yitong Sun

Title: Participatory Evolution of Artificial Life Systems via Semantic Feedback

Abstract:
We present a semantic feedback framework that enables natural language to guide the evolution of artificial life systems. Integrating a prompt‑to‑parameter encoder, a CMA‑ES optimizer, and CLIP‑based evaluation, the system allows user intent to modulate both visual outcomes and underlying behavioral rules. Implemented in an interactive ecosystem simulation, the framework supports prompt refinement, multi‑agent interaction, and emergent rule synthesis. User studies show improved semantic alignment over manual tuning and demonstrate the system's potential as a platform for participatory generative design and open‑ended evolution.

PaperID: 9758, https://arxiv.org/pdf/2507.03802.pdf

Authors: Mayank Kejriwal, Shilpa Thomas

Title: Generating Novelty in Open-World Multi-Agent Strategic Board Games

Abstract:
We describe GNOME (Generating Novelty in Open‑world Multi‑agent Environments), an experimental platform that is designed to test the effectiveness of multi‑agent AI systems when faced with \emphnovelty. GNOME separates the development of AI gameplaying agents with the simulator, allowing \emphunanticipated novelty (in essence, novelty that is not subject to model‑selection bias). Using a Web GUI, GNOME was recently demonstrated at NeurIPS 2020 using the game of Monopoly to foster an open discussion on AI robustness and the nature of novelty in real‑world environments. In this article, we further detail the key elements of the demonstration, and also provide an overview of the experimental design that is being currently used in the DARPA Science of Artificial Intelligence and Learning for Open‑World Novelty (SAIL‑ON) program to evaluate external teams developing novelty‑adaptive gameplaying agents.

PaperID: 9759, https://arxiv.org/pdf/2507.03498.pdf

Authors: Meng Xiao, Junfeng Zhou, Yuanchun Zhou

Title: Reinforcement Learning-based Feature Generation Algorithm for Scientific Data

Abstract:
Feature generation (FG) aims to enhance the prediction potential of original data by constructing high‑order feature combinations and removing redundant features. It is a key preprocessing step for tabular scientific data to improve downstream machine‑learning model performance. Traditional methods face the following two challenges when dealing with the feature generation of scientific data: First, the effective construction of high‑order feature combinations in scientific data necessitates profound and extensive domain‑specific expertise. Secondly, as the order of feature combinations increases, the search space expands exponentially, imposing prohibitive human labor consumption. Advancements in the Data‑Centric Artificial Intelligence (DCAI) paradigm have opened novel avenues for automating feature generation processes. Inspired by that, this paper revisits the conventional feature generation workflow and proposes the Multi‑agent Feature Generation (MAFG) framework. Specifically, in the iterative exploration stage, multi‑agents will construct mathematical transformation equations collaboratively, synthesize and identify feature combinations ex‑hibiting high information content, and leverage a reinforcement learning mechanism to evolve their strategies. Upon completing the exploration phase, MAFG integrates the large language models (LLMs) to interpreta‑tively evaluate the generated features of each significant model performance breakthrough. Experimental results and case studies consistently demonstrate that the MAFG framework effectively automates the feature generation process and significantly enhances various downstream scientific data mining tasks.

PaperID: 9760, https://arxiv.org/pdf/2507.03460.pdf

Authors: Weitong Zhang, Mengyun Qiao, Chengqi Zang, Steven Niederer, Paul M Matthews, Wenjia Bai, Bernhard Kainz

Title: Multi-Agent Reasoning for Cardiovascular Imaging Phenotype Analysis

Abstract:
Identifying associations between imaging phenotypes, disease risk factors, and clinical outcomes is essential for understanding disease mechanisms. However, traditional approaches rely on human‑driven hypothesis testing and selection of association factors, often overlooking complex, non‑linear dependencies among imaging phenotypes and other multi‑modal data. To address this, we introduce Multi‑agent Exploratory Synergy for the Heart (MESHAgents): a framework that leverages large language models as agents to dynamically elicit, surface, and decide confounders and phenotypes in association studies. Specifically, we orchestrate a multi‑disciplinary team of AI agents, which spontaneously generate and converge on insights through iterative, self‑organizing reasoning. The framework dynamically synthesizes statistical correlations with multi‑expert consensus, providing an automated pipeline for phenome‑wide association studies (PheWAS). We demonstrate the system's capabilities through a population‑based study of imaging phenotypes of the heart and aorta. MESHAgents autonomously uncovered correlations between imaging phenotypes and a wide range of non‑imaging factors, identifying additional confounder variables beyond standard demographic factors. Validation on diagnosis tasks reveals that MESHAgents‑discovered phenotypes achieve performance comparable to expert‑selected phenotypes, with mean AUC differences as small as ‑0.004_\pm0.010 on disease classification tasks. Notably, the recall score improves for 6 out of 9 disease types. Our framework provides clinically relevant imaging phenotypes with transparent reasoning, offering a scalable alternative to expert‑driven methods.

PaperID: 9761, https://arxiv.org/pdf/2507.03435.pdf

Authors: Jarosław A. Chudziak, Michał Wawer

Title: ElliottAgents: A Natural Language-Driven Multi-Agent System for Stock Market Analysis and Prediction

Abstract:
This paper presents ElliottAgents, a multi‑agent system leveraging natural language processing (NLP) and large language models (LLMs) to analyze complex stock market data. The system combines AI‑driven analysis with the Elliott Wave Principle to generate human‑comprehensible predictions and explanations. A key feature is the natural language dialogue between agents, enabling collaborative analysis refinement. The LLM‑enhanced architecture facilitates advanced language understanding, reasoning, and autonomous decision‑making. Experiments demonstrate the system's effectiveness in pattern recognition and generating natural language descriptions of market trends. ElliottAgents contributes to NLP applications in specialized domains, showcasing how AI‑driven dialogue systems can enhance collaborative analysis in data‑intensive fields. This research bridges the gap between complex financial data and human understanding, addressing the need for interpretable and adaptive prediction systems in finance.

PaperID: 9762, https://arxiv.org/pdf/2507.03311.pdf

Authors: Himanshu Dutta, Sunny Manchanda, Prakhar Bapat, Meva Ram Gurjar, Pushpak Bhattacharyya

Title: GRAFT: A Graph-based Flow-aware Agentic Framework for Document-level Machine Translation

Abstract:
Document level Machine Translation (DocMT) approaches often struggle with effectively capturing discourse level phenomena. Existing approaches rely on heuristic rules to segment documents into discourse units, which rarely align with the true discourse structure required for accurate translation. Otherwise, they fail to maintain consistency throughout the document during translation. To address these challenges, we propose Graph Augmented Agentic Framework for Document Level Translation (GRAFT), a novel graph based DocMT system that leverages Large Language Model (LLM) agents for document translation. Our approach integrates segmentation, directed acyclic graph (DAG) based dependency modelling, and discourse aware translation into a cohesive framework. Experiments conducted across eight translation directions and six diverse domains demonstrate that GRAFT achieves significant performance gains over state of the art DocMT systems. Specifically, GRAFT delivers an average improvement of 2.8 d BLEU on the TED test sets from IWSLT2017 over strong baselines and 2.3 d BLEU for domain specific translation from English to Chinese. Moreover, our analyses highlight the consistent ability of GRAFT to address discourse level phenomena, yielding coherent and contextually accurate translations.

PaperID: 9763, https://arxiv.org/pdf/2507.03293.pdf

Authors: Anand Gokhale, Vaibhav Srivastava, Francesco Bullo

Title: LogicGuard: Improving Embodied LLM agents through Temporal Logic based Critics

Abstract:
Large language models (LLMs) have shown promise in zero‑shot and single step reasoning and decision making problems, but in long horizon sequential planning tasks, their errors compound, often leading to unreliable or inefficient behavior. We introduce LogicGuard, a modular actor‑critic architecture in which an LLM actor is guided by a trajectory level LLM critic that communicates through Linear Temporal Logic (LTL). Our setup combines the reasoning strengths of language models with the guarantees of formal logic. The actor selects high‑level actions from natural language observations, while the critic analyzes full trajectories and proposes new LTL constraints that shield the actor from future unsafe or inefficient behavior. LogicGuard supports both fixed safety rules and adaptive, learned constraints, and is model‑agnostic: any LLM‑based planner can serve as the actor, with LogicGuard acting as a logic‑generating wrapper. We formalize planning as graph traversal under symbolic constraints, allowing LogicGuard to analyze failed or suboptimal trajectories and generate new temporal logic rules that improve future behavior. To demonstrate generality, we evaluate LogicGuard across two distinct settings: short‑horizon general tasks and long‑horizon specialist tasks. On the Behavior benchmark of 100 household tasks, LogicGuard increases task completion rates by 25% over a baseline InnerMonologue planner. On the Minecraft diamond‑mining task, which is long‑horizon and requires multiple interdependent subgoals, LogicGuard improves both efficiency and safety compared to SayCan and InnerMonologue. These results show that enabling LLMs to supervise each other through temporal logic yields more reliable, efficient and safe decision‑making for both embodied agents.

PaperID: 9764, https://arxiv.org/pdf/2507.03279.pdf

Authors: Kwan Ho Ryan Chan, Yuyan Ge, Edgar Dobriban, Hamed Hassani, René Vidal

Title: Conformal Information Pursuit for Interactively Guiding Large Language Models

Abstract:
A significant use case of instruction‑finetuned Large Language Models (LLMs) is to solve question‑answering tasks interactively. In this setting, an LLM agent is tasked with making a prediction by sequentially querying relevant information from the user, as opposed to a single‑turn conversation. This paper explores sequential querying strategies that aim to minimize the expected number of queries. One such strategy is Information Pursuit (IP), a greedy algorithm that at each iteration selects the query that maximizes information gain or equivalently minimizes uncertainty. However, obtaining accurate estimates of mutual information or conditional entropy for LLMs is very difficult in practice due to over‑ or under‑confident LLM proba‑ bilities, which leads to suboptimal query selection and predictive performance. To better estimate the uncertainty at each iteration, we propose Conformal Information Pursuit (C‑IP), an alternative approach to sequential information gain based on conformal prediction sets. More specifically, C‑IP leverages a relationship between prediction sets and conditional entropy at each iteration to estimate uncertainty based on the average size of conformal prediction sets. In contrast to conditional entropy, we find that conformal prediction sets are a distribution‑free and robust method of measuring uncertainty. Experiments with 20 Questions show that C‑IP obtains better predictive performance and shorter query‑answer chains compared to previous approaches to IP and uncertainty‑based chain‑of‑thought methods. Furthermore, extending to an interactive medical setting between a doctor and a patient on the MediQ dataset, C‑IP achieves competitive performance with direct single‑turn prediction while offering greater interpretability.

PaperID: 9765, https://arxiv.org/pdf/2507.03254.pdf

Authors: Bruce Yang, Xinfeng He, Huan Gao, Yifan Cao, Xiaofan Li, David Hsu

Title: CodeAgents: A Token-Efficient Framework for Codified Multi-Agent Reasoning in LLMs

Abstract:
Effective prompt design is essential for improving the planning capabilities of large language model (LLM)‑driven agents. However, existing structured prompting strategies are typically limited to single‑agent, plan‑only settings, and often evaluate performance solely based on task accuracy ‑ overlooking critical factors such as token efficiency, modularity, and scalability in multi‑agent environments. To address these limitations, we introduce CodeAgents, a prompting framework that codifies multi‑agent reasoning and enables structured, token‑efficient planning in multi‑agent systems. In CodeAgents, all components of agent interaction ‑ Task, Plan, Feedback, system roles, and external tool invocations ‑ are codified into modular pseudocode enriched with control structures (e.g., loops, conditionals), boolean logic, and typed variables. This design transforms loosely connected agent plans into cohesive, interpretable, and verifiable multi‑agent reasoning programs. We evaluate the proposed framework across three diverse benchmarks ‑ GAIA, HotpotQA, and VirtualHome ‑ using a range of representative LLMs. Results show consistent improvements in planning performance, with absolute gains of 3‑36 percentage points over natural language prompting baselines. On VirtualHome, our method achieves a new state‑of‑the‑art success rate of 56%. In addition, our approach reduces input and output token usage by 55‑87% and 41‑70%, respectively, underscoring the importance of token‑aware evaluation metrics in the development of scalable multi‑agent LLM systems. The code and resources are available at: https://anonymous.4open.science/r/CodifyingAgent‑5A86

PaperID: 9766, https://arxiv.org/pdf/2507.02925.pdf

Authors: Janghoon Ock, Radheesh Sharma Meda, Srivathsan Badrinarayanan, Neha S. Aluru, Achuth Chandrasekhar, Amir Barati Farimani

Title: Large Language Model Agent for Modular Task Execution in Drug Discovery

Abstract:
We present a modular framework powered by large language models (LLMs) that automates and streamlines key tasks across the early‑stage computational drug discovery pipeline. By combining LLM reasoning with domain‑specific tools, the framework performs biomedical data retrieval, literature‑grounded question answering via retrieval‑augmented generation, molecular generation, multi‑property prediction, property‑aware molecular refinement, and 3D protein‑ligand structure generation. The agent autonomously retrieved relevant biomolecular information, including FASTA sequences, SMILES representations, and literature, and answered mechanistic questions with improved contextual accuracy compared to standard LLMs. It then generated chemically diverse seed molecules and predicted 75 properties, including ADMET‑related and general physicochemical descriptors, which guided iterative molecular refinement. Across two refinement rounds, the number of molecules with QED > 0.6 increased from 34 to 55. The number of molecules satisfying empirical drug‑likeness filters also rose; for example, compliance with the Ghose filter increased from 32 to 55 within a pool of 100 molecules. The framework also employed Boltz‑2 to generate 3D protein‑ligand complexes and provide rapid binding affinity estimates for candidate compounds. These results demonstrate that the approach effectively supports molecular screening, prioritization, and structure evaluation. Its modular design enables flexible integration of evolving tools and models, providing a scalable foundation for AI‑assisted therapeutic discovery.

PaperID: 9767, https://arxiv.org/pdf/2507.02799.pdf

Authors: Riccardo Cantini, Nicola Gabriele, Alessio Orsino, Domenico Talia

Title: Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models

Abstract:
Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi‑step reasoning tasks through mechanisms such as Chain‑of‑Thought (CoT) prompting or fine‑tuned reasoning traces. While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear. In this work, we leverage the CLEAR‑Bias benchmark, originally designed for Large Language Models (LLMs), to investigate the adversarial robustness of RLMs to bias elicitation. We systematically evaluate state‑of‑the‑art RLMs across diverse sociocultural dimensions, using an LLM‑as‑a‑judge approach for automated safety scoring and leveraging jailbreak techniques to assess the strength of built‑in safety mechanisms. Our evaluation addresses three key questions: (i) how the introduction of reasoning capabilities affects model fairness and robustness; (ii) whether models fine‑tuned for reasoning exhibit greater safety than those relying on CoT prompting at inference time; and (iii) how the success rate of jailbreak attacks targeting bias elicitation varies with the reasoning mechanisms employed. Our findings reveal a nuanced relationship between reasoning capabilities and bias safety. Surprisingly, models with explicit reasoning, whether via CoT prompting or fine‑tuned reasoning traces, are generally more vulnerable to bias elicitation than base models without such mechanisms, suggesting reasoning may unintentionally open new pathways for stereotype reinforcement. Reasoning‑enabled models appear somewhat safer than those relying on CoT prompting, which are particularly prone to contextual reframing attacks through storytelling prompts, fictional personas, or reward‑shaped instructions. These results challenge the assumption that reasoning inherently improves robustness and underscore the need for more bias‑aware approaches to reasoning design.

PaperID: 9768, https://arxiv.org/pdf/2507.02773.pdf

Authors: Yuzhang Xie, Hejie Cui, Ziyang Zhang, Jiaying Lu, Kai Shu, Fadi Nahab, Xiao Hu, Carl Yang

Title: KERAP: A Knowledge-Enhanced Reasoning Approach for Accurate Zero-shot Diagnosis Prediction Using Multi-agent LLMs

Abstract:
Medical diagnosis prediction plays a critical role in disease detection and personalized healthcare. While machine learning (ML) models have been widely adopted for this task, their reliance on supervised training limits their ability to generalize to unseen cases, particularly given the high cost of acquiring large, labeled datasets. Large language models (LLMs) have shown promise in leveraging language abilities and biomedical knowledge for diagnosis prediction. However, they often suffer from hallucinations, lack structured medical reasoning, and produce useless outputs. To address these challenges, we propose KERAP, a knowledge graph (KG)‑enhanced reasoning approach that improves LLM‑based diagnosis prediction through a multi‑agent architecture. Our framework consists of a linkage agent for attribute mapping, a retrieval agent for structured knowledge extraction, and a prediction agent that iteratively refines diagnosis predictions. Experimental results demonstrate that KERAP enhances diagnostic reliability efficiently, offering a scalable and interpretable solution for zero‑shot medical diagnosis prediction.

PaperID: 9769, https://arxiv.org/pdf/2507.02755.pdf

Authors: Caleb Rascon, Luis Gato-Diaz, Eduardo García-Alarcón

Title: Multi-agent Auditory Scene Analysis

Abstract:
Auditory scene analysis (ASA) aims to retrieve information from the acoustic environment, by carrying out three main tasks: sound source location, separation, and classification. These tasks are traditionally executed with a linear data flow, where the sound sources are first located; then, using their location, each source is separated into its own audio stream; from each of which, information is extracted that is relevant to the application scenario (audio event detection, speaker identification, emotion classification, etc.). However, running these tasks linearly increases the overall response time, while making the last tasks (separation and classification) highly sensitive to errors of the first task (location). A considerable amount of effort and computational complexity has been employed in the state‑of‑the‑art to develop techniques that are the least error‑prone possible. However, doing so gives rise to an ASA system that is non‑viable in many applications that require a small computational footprint and a low response time, such as bioacoustics, hearing‑aid design, search and rescue, human‑robot interaction, etc. To this effect, in this work, a multi‑agent approach is proposed to carry out ASA where the tasks are run in parallel, with feedback loops between them to compensate for local errors, such as: using the quality of the separation output to correct the location error; and using the classification result to reduce the localization's sensitivity towards interferences. The result is a multi‑agent auditory scene analysis (MASA) system that is robust against local errors, without a considerable increase in complexity, and with a low response time. The complete proposed MASA system is provided as a publicly available framework that uses open‑source tools for sound acquisition and reproduction (JACK) and inter‑agent communication (ROS2), allowing users to add their own agents.

PaperID: 9770, https://arxiv.org/pdf/2507.02726.pdf

Authors: Matthieu Zimmer, Xiaotong Ji, Rasul Tutunov, Anthony Bordg, Jun Wang, Haitham Bou Ammar

Title: Bourbaki: Self-Generated and Goal-Conditioned MDPs for Theorem Proving

Abstract:
Reasoning remains a challenging task for large language models (LLMs), especially within the logically constrained environment of automated theorem proving (ATP), due to sparse rewards and the vast scale of proofs. These challenges are amplified in benchmarks like PutnamBench, which contains university‑level problems requiring complex, multi‑step reasoning. To address this, we introduce self‑generated goal‑conditioned MDPs (sG‑MDPs), a new framework in which agents generate and pursue their subgoals based on the evolving proof state. Given this more structured generation of goals, the resulting problem becomes more amenable to search. We then apply Monte Carlo Tree Search (MCTS)‑like algorithms to solve the sG‑MDP, instantiating our approach in Bourbaki (7B), a modular system that can ensemble multiple 7B LLMs for subgoal generation and tactic synthesis. On PutnamBench, Bourbaki (7B) solves 26 problems, achieving new state‑of‑the‑art results with models at this scale.

PaperID: 9771, https://arxiv.org/pdf/2507.02699.pdf

Authors: Jiangrong Wu, Yuhong Nan, Jianliang Wu, Zitong Yao, Zibin Zheng

Title: Control at Stake: Evaluating the Security Landscape of LLM-Driven Email Agents

Abstract:
The increasing capabilities of LLMs have led to the rapid proliferation of LLM agent apps, where developers enhance LLMs with access to external resources to support complex task execution. Among these, LLM email agent apps represent one of the widely used categories, as email remains a critical communication medium for users. LLM email agents are capable of managing and responding to email using LLM‑driven reasoning and autonomously executing user instructions via external email APIs (e.g., send email). However, despite their growing deployment and utility, the security mechanism of LLM email agent apps remains underexplored. Currently, there is no comprehensive study into the potential security risk within these agent apps and their broader implications. In this paper, we conduct the first in‑depth and systematic security study of LLM email agents. We propose the Email Agent Hijacking (EAH) attack, which overrides the original prompts of the email agent via external email resources, allowing attackers to gain control of the email agent remotely and further perform specific attack scenarios without user awareness. To facilitate the large‑scale evaluation, we propose EAHawk, a pipeline to evaluate the EAH attack of LLM email agent apps. By EAHawk, we performed an empirical study spanning 14 representative LLM agent frameworks, 63 agent apps, 12 LLMs, and 20 email services, which led to the generation of 1,404 real‑world email agent instances for evaluation. Experimental results indicate that all 1,404 instances were successfully hijacked; on average, only 2.03 attack attempts are required to control an email agent instance. Even worse, for some LLMs, the average number of attempts needed to achieve full agent control drops to as few as 1.23.

PaperID: 9772, https://arxiv.org/pdf/2507.02698.pdf

Authors: Thomas Hazenberg, Yao Ma, Seyed Sahand Mohammadi Ziabari, Marijn van Rijswijk

Title: Multi-Agent Reinforcement Learning for Dynamic Pricing in Supply Chains: Benchmarking Strategic Agent Behaviours under Realistically Simulated Market Conditions

Abstract:
This study investigates how Multi‑Agent Reinforcement Learning (MARL) can improve dynamic pricing strategies in supply chains, particularly in contexts where traditional ERP systems rely on static, rule‑based approaches that overlook strategic interactions among market actors. While recent research has applied reinforcement learning to pricing, most implementations remain single‑agent and fail to model the interdependent nature of real‑world supply chains. This study addresses that gap by evaluating the performance of three MARL algorithms: MADDPG, MADQN, and QMIX against static rule‑based baselines, within a simulated environment informed by real e‑commerce transaction data and a LightGBM demand prediction model. Results show that rule‑based agents achieve near‑perfect fairness (Jain's Index: 0.9896) and the highest price stability (volatility: 0.024), but they fully lack competitive dynamics. Among MARL agents, MADQN exhibits the most aggressive pricing behaviour, with the highest volatility and the lowest fairness (0.5844). MADDPG provides a more balanced approach, supporting market competition (share volatility: 9.5 pp) while maintaining relatively high fairness (0.8819) and stable pricing. These findings suggest that MARL introduces emergent strategic behaviour not captured by static pricing rules and may inform future developments in dynamic pricing.

PaperID: 9773, https://arxiv.org/pdf/2507.02675.pdf

Authors: Zhaoqilin Yang, Xin Wang, Ruichen Zhang, Chanchan Li, Youliang Tian

Title: TUC-PPO: Team Utility-Constrained Proximal Policy Optimization for Spatial Public Goods Games

Abstract:
We introduce Team Utility‑Constrained Proximal Policy Optimization (TUC‑PPO), a new deep reinforcement learning framework. It extends Proximal Policy Optimization (PPO) by integrating team welfare objectives specifically for spatial public goods games. Unlike conventional approaches where cooperation emerges indirectly from individual rewards, TUC‑PPO instead optimizes a bi‑level objective integrating policy gradients and team utility constraints. Consequently, all policy updates explicitly incorporate collective payoff thresholds. The framework preserves PPO's policy gradient core while incorporating constrained optimization through adaptive Lagrangian multipliers. Therefore, decentralized agents dynamically balance selfish and cooperative incentives. The comparative analysis demonstrates superior performance of this constrained deep reinforcement learning approach compared to unmodified PPO and evolutionary game theory baselines. It achieves faster convergence to cooperative equilibria and greater stability against invasion by defectors. The framework formally integrates team objectives into policy updates. This work advances multi‑agent deep reinforcement learning for social dilemmas while providing new computational tools for evolutionary game theory research.

PaperID: 9774, https://arxiv.org/pdf/2507.02626.pdf

Authors: Siran Chen, Boyu Chen, Chenyun Yu, Yuxiao Luo, Ouyang Yi, Lei Cheng, Chengxiang Zhuo, Zang Li, Yali Wang

Title: VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning

Abstract:
Owing to powerful natural language processing and generative capabilities, large language model (LLM) agents have emerged as a promising solution for enhancing recommendation systems via user simulation. However, in the realm of video recommendation, existing studies predominantly resort to prompt‑based simulation using frozen LLMs and encounter the intricate challenge of multimodal content understanding. This frequently results in suboptimal item modeling and user preference learning, thereby ultimately constraining recommendation performance. To address these challenges, we introduce VRAgent‑R1, a novel agent‑based paradigm that incorporates human‑like intelligence in user simulation. Specifically, VRAgent‑R1 comprises two distinct agents: the Item Perception (IP) Agent and the User Simulation (US) Agent, designed for interactive user‑item modeling. Firstly, the IP Agent emulates human‑like progressive thinking based on MLLMs, effectively capturing hidden recommendation semantics in videos. With a more comprehensive multimodal content understanding provided by the IP Agent, the video recommendation system is equipped to provide higher‑quality candidate items. Subsequently, the US Agent refines the recommended video sets based on in‑depth chain‑of‑thought (CoT) reasoning and achieves better alignment with real user preferences through reinforcement learning. Experimental results on a large‑scale video recommendation benchmark have demonstrated the effectiveness of our proposed VRAgent‑R1 method, e.g., the IP Agent achieves a 6.0% improvement in NDCG@10 on the MicroLens‑100k dataset, while the US Agent shows approximately 45.0% higher accuracy in user decision simulation compared to state‑of‑the‑art baselines.

PaperID: 9775, https://arxiv.org/pdf/2507.02616.pdf

Authors: Tianqi Shang, Weiqing He, Charles Zheng, Lingyao Li, Li Shen, Bingxin Zhao

Title: DynamiCare: A Dynamic Multi-Agent Framework for Interactive and Open-Ended Medical Decision-Making

Abstract:
The rise of Large Language Models (LLMs) has enabled the development of specialized AI agents with domain‑specific reasoning and interaction capabilities, particularly in healthcare. While recent frameworks simulate medical decision‑making, they largely focus on single‑turn tasks where a doctor agent receives full case information upfront ‑‑ diverging from the real‑world diagnostic process, which is inherently uncertain, interactive, and iterative. In this paper, we introduce MIMIC‑Patient, a structured dataset built from the MIMIC‑III electronic health records (EHRs), designed to support dynamic, patient‑level simulations. Building on this, we propose DynamiCare, a novel dynamic multi‑agent framework that models clinical diagnosis as a multi‑round, interactive loop, where a team of specialist agents iteratively queries the patient system, integrates new information, and dynamically adapts its composition and strategy. We demonstrate the feasibility and effectiveness of DynamiCare through extensive experiments, establishing the first benchmark for dynamic clinical decision‑making with LLM‑powered agents.

PaperID: 9776, https://arxiv.org/pdf/2507.02443.pdf

Authors: Sandro Costa Magalhães, Marco Almeida, Filipe Neves dos Santos, António Paulo Moreira, Jorge Dias

Title: Red grape detection with accelerated artificial neural networks in the FPGA's programmable logic

Abstract:
Robots usually slow down for canning to detect objects while moving. Additionally, the robot's camera is configured with a low framerate to track the velocity of the detection algorithms. This would be constrained while executing tasks and exploring, making robots increase the task execution time. AMD has developed the Vitis‑AI framework to deploy detection algorithms into FPGAs. However, this tool does not fully use the FPGAs' PL. In this work, we use the FINN architecture to deploy three ANNs, MobileNet v1 with 4‑bit quantisation, CNV with 2‑bit quantisation, and CNV with 1‑bit quantisation (BNN), inside an FPGA's PL. The models were trained on the RG2C dataset. This is a self‑acquired dataset released in open access. MobileNet v1 performed better, reaching a success rate of 98 % and an inference speed of 6611 FPS. In this work, we proved that we can use FPGAs to speed up ANNs and make them suitable for attention mechanisms.

PaperID: 9777, https://arxiv.org/pdf/2507.02430.pdf

Authors: Maryem Fadili, Mohamed Anis Ghaoui, Louis Lecrosnier, Steve Pechberti, Redouane Khemmar

Title: A Late Collaborative Perception Framework for 3D Multi-Object and Multi-Source Association and Fusion

Abstract:
In autonomous driving, recent research has increasingly focused on collaborative perception based on deep learning to overcome the limitations of individual perception systems. Although these methods achieve high accuracy, they rely on high communication bandwidth and require unrestricted access to each agent's object detection model architecture and parameters. These constraints pose challenges real‑world autonomous driving scenarios, where communication limitations and the need to safeguard proprietary models hinder practical implementation. To address this issue, we introduce a novel late collaborative framework for 3D multi‑source and multi‑object fusion, which operates solely on shared 3D bounding box attributes‑category, size, position, and orientation‑without necessitating direct access to detection models. Our framework establishes a new state‑of‑the‑art in late fusion, achieving up to five times lower position error compared to existing methods. Additionally, it reduces scale error by a factor of 7.5 and orientation error by half, all while maintaining perfect 100% precision and recall when fusing detections from heterogeneous perception systems. These results highlight the effectiveness of our approach in addressing real‑world collaborative perception challenges, setting a new benchmark for efficient and scalable multi‑agent fusion.

PaperID: 9778, https://arxiv.org/pdf/2507.02424.pdf

Authors: Francesco Blefari, Cristian Cosentino, Francesco Aurelio Pironti, Angelo Furfaro, Fabrizio Marozzo

Title: CyberRAG: An Agentic RAG cyber attack classification and reporting tool

Abstract:
Intrusion Detection and Prevention Systems (IDS/IPS) in large enterprises can generate hundreds of thousands of alerts per hour, overwhelming analysts with logs requiring rapidly evolving expertise. Conventional machine‑learning detectors reduce alert volume but still yield many false positives, while standard Retrieval‑Augmented Generation (RAG) pipelines often retrieve irrelevant context and fail to justify predictions. We present CyberRAG, a modular agent‑based RAG framework that delivers real‑time classification, explanation, and structured reporting for cyber‑attacks. A central LLM agent orchestrates: (i) fine‑tuned classifiers specialized by attack family; (ii) tool adapters for enrichment and alerting; and (iii) an iterative retrieval‑and‑reason loop that queries a domain‑specific knowledge base until evidence is relevant and self‑consistent. Unlike traditional RAG, CyberRAG adopts an agentic design that enables dynamic control flow and adaptive reasoning. This architecture autonomously refines threat labels and natural‑language justifications, reducing false positives and enhancing interpretability. It is also extensible: new attack types can be supported by adding classifiers without retraining the core agent. CyberRAG was evaluated on SQL Injection, XSS, and SSTI, achieving over 94% accuracy per class and a final classification accuracy of 94.92% through semantic orchestration. Generated explanations reached 0.94 in BERTScore and 4.9/5 in GPT‑4‑based expert evaluation, with robustness preserved against adversarial and unseen payloads. These results show that agentic, specialist‑oriented RAG can combine high detection accuracy with trustworthy, SOC‑ready prose, offering a flexible path toward partially automated cyber‑defense workflows.

PaperID: 9779, https://arxiv.org/pdf/2507.02406.pdf

Authors: Caio Azevedo, Lina Achaji, Stefano Sabatini, Nicola Poerio, Grzegorz Bartyzel, Sascha Hornauer, Fabien Moutarde

Title: Improving Consistency in Vehicle Trajectory Prediction Through Preference Optimization

Abstract:
Trajectory prediction is an essential step in the pipeline of an autonomous vehicle. Inaccurate or inconsistent predictions regarding the movement of agents in its surroundings lead to poorly planned maneuvers and potentially dangerous situations for the end‑user. Current state‑of‑the‑art deep‑learning‑based trajectory prediction models can achieve excellent accuracy on public datasets. However, when used in more complex, interactive scenarios, they often fail to capture important interdependencies between agents, leading to inconsistent predictions among agents in the traffic scene. Inspired by the efficacy of incorporating human preference into large language models, this work fine‑tunes trajectory prediction models in multi‑agent settings using preference optimization. By taking as input automatically calculated preference rankings among predicted futures in the fine‑tuning process, our experiments‑‑using state‑of‑the‑art models on three separate datasets‑‑show that we are able to significantly improve scene consistency while minimally sacrificing trajectory prediction accuracy and without adding any excess computational requirements at inference time.

PaperID: 9780, https://arxiv.org/pdf/2507.02353.pdf

Authors: Bowen Chen, Zhao Wang, Shingo Takamatsu

Title: OMS: On-the-fly, Multi-Objective, Self-Reflective Ad Keyword Generation via LLM Agent

Abstract:
Keyword decision in Sponsored Search Advertising is critical to the success of ad campaigns. While LLM‑based methods offer automated keyword generation, they face three major limitations: reliance on large‑scale query‑keyword pair data, lack of online multi‑objective performance monitoring and optimization, and weak quality control in keyword selection. These issues hinder the agentic use of LLMs in fully automating keyword decisions by monitoring and reasoning over key performance indicators such as impressions, clicks, conversions, and CTA effectiveness. To overcome these challenges, we propose OMS, a keyword generation framework that is On‑the‑fly (requires no training data, monitors online performance, and adapts accordingly), Multi‑objective (employs agentic reasoning to optimize keywords based on multiple performance metrics), and Self‑reflective (agentically evaluates keyword quality). Experiments on benchmarks and real‑world ad campaigns show that OMS outperforms existing methods; ablation and human evaluations confirm the effectiveness of each component and the quality of generated keywords.

PaperID: 9781, https://arxiv.org/pdf/2507.02211.pdf

Authors: Gustavo C. Mangold, Heitor C. M. Fernandes, Mendeli H. Vainstein

Title: Dilution, Diffusion and Symbiosis in Spatial Prisoner's Dilemma with Reinforcement Learning

Abstract:
Recent studies in the spatial prisoner's dilemma games with reinforcement learning have shown that static agents can learn to cooperate through a diverse sort of mechanisms, including noise injection, different types of learning algorithms and neighbours' payoff knowledge. In this work, using an independent multi‑agent Q‑learning algorithm, we study the effects of dilution and mobility in the spatial version of the prisoner's dilemma. Within this setting, different possible actions for the algorithm are defined, connecting with previous results on the classical, non‑reinforcement learning spatial prisoner's dilemma, showcasing the versatility of the algorithm in modeling different game‑theoretical scenarios and the benchmarking potential of this approach. As a result, a range of effects is observed, including evidence that games with fixed update rules can be qualitatively equivalent to those with learned ones, as well as the emergence of a symbiotic mutualistic effect between populations that forms when multiple actions are defined.

PaperID: 9782, https://arxiv.org/pdf/2507.02182.pdf

Authors: Fangjian Lei, Jiawen Liu, Shayan Noei, Ying Zou, Derek Truong, William Alexander

Title: Enhancing COBOL Code Explanations: A Multi-Agents Approach Using Large Language Models

Abstract:
Common Business Oriented Language (COBOL) is a programming language used to develop business applications that are widely adopted by financial, business, and government agencies. Due to its age, complexity, and declining number of COBOL developers, maintaining COBOL codebases is becoming increasingly challenging. In particular, the lack of documentation makes it difficult for new developers to effectively understand and maintain COBOL systems. Existing research utilizes large language models (LLMs) to explain the functionality of code snippets. However, COBOL presents unique challenges due to its architectural and syntactical differences, which often cause its code to exceed the token window size of LLMs. In this work, we propose a multi‑agent approach that leverages two LLM‑based agents working collaboratively to generate explanations for functions, files, and the overall project. These agents incorporate together by utilizing contextual information from the codebase into the code explanation prompts. We evaluate the effectiveness of our approach using 14 open‑source, real‑world COBOL projects. Our results indicate that our approach performs significantly better than the baseline in function code explanation, with improvements of 12.67%, 18.59%, and 0.62% in terms of METEOR, chrF, and SentenceBERT scores, respectively. At the file level, our approach effectively explains both short and long COBOL files that exceed the token window size of LLMs and surpass the baseline by 4.21%, 10.72%, and 14.68% in explaining the purpose, functionality, and clarity of the generated explanation. At the project level, our approach generates explanations that convey the functionality and purpose of 82% of the selected projects.

PaperID: 9783, https://arxiv.org/pdf/2507.02170.pdf

Authors: Adam Kostka, Jarosław A. Chudziak

Title: Synergizing Logical Reasoning, Knowledge Management and Collaboration in Multi-Agent LLM System

Abstract:
This paper explores the integration of advanced Multi‑Agent Systems (MAS) techniques to develop a team of agents with enhanced logical reasoning, long‑term knowledge retention, and Theory of Mind (ToM) capabilities. By uniting these core components with optimized communication protocols, we create a novel framework called SynergyMAS, which fosters collaborative teamwork and superior problem‑solving skills. The system's effectiveness is demonstrated through a product development team case study, where our approach significantly enhances performance and adaptability. These findings highlight SynergyMAS's potential to tackle complex, real‑world challenges.

PaperID: 9784, https://arxiv.org/pdf/2507.02097.pdf

Authors: Reza Yousefi Maragheh, Yashar Deldjoo

Title: The Future is Agentic: Definitions, Perspectives, and Open Challenges of Multi-Agent Recommender Systems

Abstract:
Large language models (LLMs) are rapidly evolving from passive engines of text generation into agentic entities that can plan, remember, invoke external tools, and co‑operate with one another. This perspective paper investigates how such LLM agents (and societies thereof) can transform the design space of recommender systems. We introduce a unified formalism that (i) models an individual agent as a tuple comprising its language core, tool set, and hierarchical memory, and (ii) captures a multi‑agent recommender as a triple of agents, shared environment, and communication protocol. Within this framework, we present four end‑to‑end use cases‑interactive party planning, synthetic user‑simulation for offline evaluation, multi‑modal furniture recommendation, and brand‑aligned explanation generation‑each illustrating a distinct capability unlocked by agentic orchestration. We then surface five cross‑cutting challenge families: protocol complexity, scalability, hallucination and error propagation, emergent misalignment (including covert collusion), and brand compliance. For each, we formalize the problem, review nascent mitigation strategies, and outline open research questions. The result is both a blueprint and an agenda: a blueprint that shows how memory‑augmented, tool‑using LLM agents can be composed into robust recommendation pipelines, and an agenda inviting the RecSys community to develop benchmarks, theoretical guarantees, and governance tools that keep pace with this new degree of autonomy. By unifying agentic abstractions with recommender objectives, the paper lays the groundwork for the next generation of personalized, trustworthy, and context‑rich recommendation services.

PaperID: 9785, https://arxiv.org/pdf/2507.02083.pdf

Authors: Haonan Duan, Stephen Zhewen Lu, Caitlin Fiona Harrigan, Nishkrit Desai, Jiarui Lu, Michał Koziarski, Leonardo Cotta, Chris J. Maddison

Title: Measuring Scientific Capabilities of Language Models with a Systems Biology Dry Lab

Abstract:
Designing experiments and result interpretations are core scientific competencies, particularly in biology, where researchers perturb complex systems to uncover the underlying systems. Recent efforts to evaluate the scientific capabilities of large language models (LLMs) fail to test these competencies because wet‑lab experimentation is prohibitively expensive: in expertise, time and equipment. We introduce SciGym, a first‑in‑class benchmark that assesses LLMs' iterative experiment design and analysis abilities in open‑ended scientific discovery tasks. SciGym overcomes the challenge of wet‑lab costs by running a dry lab of biological systems. These models, encoded in Systems Biology Markup Language, are efficient for generating simulated data, making them ideal testbeds for experimentation on realistically complex systems. We evaluated six frontier LLMs on 137 small systems, and released a total of 350 systems. Our evaluation shows that while more capable models demonstrated superior performance, all models' performance declined significantly as system complexity increased, suggesting substantial room for improvement in the scientific capabilities of LLM agents.

PaperID: 9786, https://arxiv.org/pdf/2507.02029.pdf

Authors: BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, Yi Han, Yingbo Tang, Xiangqi Xu, Wei Guo, Yaoxu Lyu, Yijie Xu, Jiayu Shi, Mengfei Du, Cheng Chi, Mengdi Zhao, Xiaoshuai Hao, Junkai Zhao, Xiaojie Zhang, Shanyu Rong, Huaihai Lyu, Zhengliang Cai, Yankai Fu, Ning Chen, Bolun Zhang, Lingfeng Zhang, Shuyi Zhang, Dong Liu, Xi Feng, Songjing Wang, Xiaodan Liu, Yance Jiao, Mengsi Lyu, Zhuo Chen, Chenrui He, Yulong Ao, Xue Sun, Zheqi He, Jingshu Zheng, Xi Yang, Donghai Shi, Kunchang Xie, Bochao Zhang, Shaokai Nie, Chunlei Men, Yonghua Lin, Zhongyuan Wang, Tiejun Huang, Shanghang Zhang

Title: RoboBrain 2.0 Technical Report

Abstract:
We introduce RoboBrain 2.0, our latest generation of embodied vision‑language foundation models, designed to unify perception, reasoning, and planning for complex embodied tasks in physical environments. It comes in two variants: a lightweight 7B model and a full‑scale 32B model, featuring a heterogeneous architecture with a vision encoder and a language model. Despite its compact size, RoboBrain 2.0 achieves strong performance across a wide spectrum of embodied reasoning tasks. On both spatial and temporal benchmarks, the 32B variant achieves leading results, surpassing prior open‑source and proprietary models. In particular, it supports key real‑world embodied AI capabilities, including spatial understanding (e.g., affordance prediction, spatial referring, trajectory forecasting) and temporal decision‑making (e.g., closed‑loop interaction, multi‑agent long‑horizon planning, and scene graph updating). This report details the model architecture, data construction, multi‑stage training strategies, infrastructure and practical applications. We hope RoboBrain 2.0 advances embodied AI research and serves as a practical step toward building generalist embodied agents. The code, checkpoint and benchmark are available at https://superrobobrain.github.io.

PaperID: 9787, https://arxiv.org/pdf/2507.02004.pdf

Authors: Ruofan Jin, Zaixi Zhang, Mengdi Wang, Le Cong

Title: STELLA: Self-Evolving LLM Agent for Biomedical Research

Abstract:
The rapid growth of biomedical data, tools, and literature has created a fragmented research landscape that outpaces human expertise. While AI agents offer a solution, they typically rely on static, manually curated toolsets, limiting their ability to adapt and scale. Here, we introduce STELLA, a self‑evolving AI agent designed to overcome these limitations. STELLA employs a multi‑agent architecture that autonomously improves its own capabilities through two core mechanisms: an evolving Template Library for reasoning strategies and a dynamic Tool Ocean that expands as a Tool Creation Agent automatically discovers and integrates new bioinformatics tools. This allows STELLA to learn from experience. We demonstrate that STELLA achieves state‑of‑the‑art accuracy on a suite of biomedical benchmarks, scoring approximately 26% on Humanity's Last Exam: Biomedicine, 54% on LAB‑Bench: DBQA, and 63% on LAB‑Bench: LitQA, outperforming leading models by up to 6 percentage points. More importantly, we show that its performance systematically improves with experience; for instance, its accuracy on the Humanity's Last Exam benchmark almost doubles with increased trials. STELLA represents a significant advance towards AI Agent systems that can learn and grow, dynamically scaling their expertise to accelerate the pace of biomedical discovery.

PaperID: 9788, https://arxiv.org/pdf/2507.02002.pdf

Authors: Shaurya Mallampati, Rashed Shelim, Walid Saad, Naren Ramakrishnan

Title: Dynamic Strategy Adaptation in Multi-Agent Environments with Large Language Models

Abstract:
Large language models (LLMs) demonstrate strong reasoning abilities across mathematical, strategic, and linguistic tasks, yet little is known about how well they reason in dynamic, real‑time, multi‑agent scenarios, such as collaborative environments in which agents continuously adapt to each other's behavior, as in cooperative gameplay settings. In this paper, we bridge this gap by combining LLM‑driven agents with strategic reasoning and real‑time adaptation in cooperative, multi‑agent environments grounded in game‑theoretic principles such as belief consistency and Nash equilibrium. The proposed framework applies broadly to dynamic scenarios in which agents coordinate, communicate, and make decisions in response to continuously changing conditions. We provide real‑time strategy refinement and adaptive feedback mechanisms that enable agents to dynamically adjust policies based on immediate contextual interactions, in contrast to previous efforts that evaluate LLM capabilities in static or turn‑based settings. Empirical results show that our method achieves up to a 26% improvement in return over PPO baselines in high‑noise environments, while maintaining real‑time latency under 1.05 milliseconds. Our approach improves collaboration efficiency, task completion rates, and flexibility, illustrating that game‑theoretic guidance integrated with real‑time feedback enhances LLM performance, ultimately fostering more resilient and flexible strategic multi‑agent systems.

PaperID: 9789, https://arxiv.org/pdf/2507.01923.pdf

Authors: Yu-Shiang Huang, Chuan-Ju Wang, Chung-Chi Chen

Title: Decision-Oriented Text Evaluation

Abstract:
Natural language generation (NLG) is increasingly deployed in high‑stakes domains, yet common intrinsic evaluation methods, such as n‑gram overlap or sentence plausibility, weakly correlate with actual decision‑making efficacy. We propose a decision‑oriented framework for evaluating generated text by directly measuring its influence on human and large language model (LLM) decision outcomes. Using market digest texts‑‑including objective morning summaries and subjective closing‑bell analyses‑‑as test cases, we assess decision quality based on the financial performance of trades executed by human investors and autonomous LLM agents informed exclusively by these texts. Our findings reveal that neither humans nor LLM agents consistently surpass random performance when relying solely on summaries. However, richer analytical commentaries enable collaborative human‑LLM teams to outperform individual human or agent baselines significantly. Our approach underscores the importance of evaluating generated text by its ability to facilitate synergistic decision‑making between humans and LLMs, highlighting critical limitations of traditional intrinsic metrics.

PaperID: 9790, https://arxiv.org/pdf/2507.01701.pdf

Authors: Bochen Han, Songmao Zhang

Title: Exploring Advanced LLM Multi-Agent Systems Based on Blackboard Architecture

Abstract:
In this paper, we propose to incorporate the blackboard architecture into LLM multi‑agent systems (MASs) so that (1) agents with various roles can share all the information and others' messages during the whole problem‑solving process, (2) agents that will take actions are selected based on the current content of the blackboard, and (3) the selection and execution round is repeated until a consensus is reached on the blackboard. We develop the first implementation of this proposal and conduct experiments on commonsense knowledge, reasoning and mathematical datasets. The results show that our system can be competitive with the SOTA static and dynamic MASs by achieving the best average performance, and at the same time manage to spend less tokens. Our proposal has the potential to enable complex and dynamic problem‑solving where well‑defined structures or workflows are unavailable.

PaperID: 9791, https://arxiv.org/pdf/2507.01567.pdf

Authors: Patrick Benito Eberhard, Johannes Köhler, Oliver Hüsser, Melanie N. Zeilinger, Andrea Carron

Title: Time-Varying Coverage Control: A Distributed Tracker-Planner MPC Framework

Abstract:
Time‑varying coverage control addresses the challenge of coordinating multiple agents covering an environment where regions of interest change over time. This problem has broad applications, including the deployment of autonomous taxis and coordination in search and rescue operations. The achievement of effective coverage is complicated by the presence of time‑varying density functions, nonlinear agent dynamics, and stringent system and safety constraints. In this paper, we present a distributed multi‑agent control framework for time‑varying coverage under nonlinear constrained dynamics. Our approach integrates a reference trajectory planner and a tracking model predictive control (MPC) scheme, which operate at different frequencies within a multi‑rate framework. For periodic density functions, we demonstrate closed‑loop convergence to an optimal configuration of trajectories and provide formal guarantees regarding constraint satisfaction, collision avoidance, and recursive feasibility. Additionally, we propose an efficient algorithm capable of handling nonperiodic density functions, making the approach suitable for practical applications. Finally, we validate our method through hardware experiments using a fleet of four miniature race cars.

PaperID: 9792, https://arxiv.org/pdf/2507.01485.pdf

Authors: Yibo Qiu, Zan Huang, Zhiyu Wang, Handi Liu, Yiling Qiao, Yifeng Hu, Shu'ang Sun, Hangke Peng, Ronald X Xu, Mingzhai Sun

Title: BioMARS: A Multi-Agent Robotic System for Autonomous Biological Experiments

Abstract:
Large language models (LLMs) and vision‑language models (VLMs) have the potential to transform biological research by enabling autonomous experimentation. Yet, their application remains constrained by rigid protocol design, limited adaptability to dynamic lab conditions, inadequate error handling, and high operational complexity. Here we introduce BioMARS (Biological Multi‑Agent Robotic System), an intelligent platform that integrates LLMs, VLMs, and modular robotics to autonomously design, plan, and execute biological experiments. BioMARS uses a hierarchical architecture: the Biologist Agent synthesizes protocols via retrieval‑augmented generation; the Technician Agent translates them into executable robotic pseudo‑code; and the Inspector Agent ensures procedural integrity through multimodal perception and anomaly detection. The system autonomously conducts cell passaging and culture tasks, matching or exceeding manual performance in viability, consistency, and morphological integrity. It also supports context‑aware optimization, outperforming conventional strategies in differentiating retinal pigment epithelial cells. A web interface enables real‑time human‑AI collaboration, while a modular backend allows scalable integration with laboratory hardware. These results highlight the feasibility of generalizable, AI‑driven laboratory automation and the transformative role of language‑based reasoning in biological research.

PaperID: 9793, https://arxiv.org/pdf/2507.01446.pdf

Authors: Abd Elrahman Amer, Magdi Amer

Title: Using multi-agent architecture to mitigate the risk of LLM hallucinations

Abstract:
Improving customer service quality and response time are critical factors for maintaining customer loyalty and increasing a company's market share. While adopting emerging technologies such as Large Language Models (LLMs) is becoming a necessity to achieve these goals, the risk of hallucination remains a major challenge. In this paper, we present a multi‑agent system to handle customer requests sent via SMS. This system integrates LLM based agents with fuzzy logic to mitigate hallucination risks.

PaperID: 9794, https://arxiv.org/pdf/2507.01413.pdf

Authors: Kushal Agrawal, Verona Teo, Juan J. Vazquez, Sudarsh Kunnavakkam, Vishak Srikanth, Andy Liu

Title: Evaluating LLM Agent Collusion in Double Auctions

Abstract:
Large language models (LLMs) have demonstrated impressive capabilities as autonomous agents with rapidly expanding applications in various domains. As these agents increasingly engage in socioeconomic interactions, identifying their potential for undesirable behavior becomes essential. In this work, we examine scenarios where they can choose to collude, defined as secretive cooperation that harms another party. To systematically study this, we investigate the behavior of LLM agents acting as sellers in simulated continuous double auction markets. Through a series of controlled experiments, we analyze how parameters such as the ability to communicate, choice of model, and presence of environmental pressures affect the stability and emergence of seller collusion. We find that direct seller communication increases collusive tendencies, the propensity to collude varies across models, and environmental pressures, such as oversight and urgency from authority figures, influence collusive behavior. Our findings highlight important economic and ethical considerations for the deployment of LLM‑based market agents.

PaperID: 9795, https://arxiv.org/pdf/2507.01378.pdf

Authors: Ziyao Wang, Rongpeng Li, Sizhao Li, Yuming Xiang, Haiping Wang, Zhifeng Zhao, Honggang Zhang

Title: RALLY: Role-Adaptive LLM-Driven Yoked Navigation for Agentic UAV Swarms

Abstract:
Intelligent control of Unmanned Aerial Vehicles (UAVs) swarms has emerged as a critical research focus, and it typically requires the swarm to navigate effectively while avoiding obstacles and achieving continuous coverage over multiple mission targets. Although traditional Multi‑Agent Reinforcement Learning (MARL) approaches offer dynamic adaptability, they are hindered by the semantic gap in numerical communication and the rigidity of homogeneous role structures, resulting in poor generalization and limited task scalability. Recent advances in Large Language Model (LLM)‑based control frameworks demonstrate strong semantic reasoning capabilities by leveraging extensive prior knowledge. However, due to the lack of online learning and over‑reliance on static priors, these works often struggle with effective exploration, leading to reduced individual potential and overall system performance. To address these limitations, we propose a Role‑Adaptive LLM‑Driven Yoked navigation algorithm RALLY. Specifically, we first develop an LLM‑driven semantic decision framework that uses structured natural language for efficient semantic communication and collaborative reasoning. Afterward, we introduce a dynamic role‑heterogeneity mechanism for adaptive role switching and personalized decision‑making. Furthermore, we propose a Role‑value Mixing Network (RMIX)‑based assignment strategy that integrates LLM offline priors with MARL online policies to enable semi‑offline training of role selection strategies. Experiments in the Multi‑Agent Particle Environment (MPE) environment and a Software‑In‑The‑Loop (SITL) platform demonstrate that RALLY outperforms conventional approaches in terms of task coverage, convergence speed, and generalization, highlighting its strong potential for collaborative navigation in agentic multi‑UAV systems.

PaperID: 9796, https://arxiv.org/pdf/2507.01376.pdf

Authors: Yinwang Ren, Yangyang Liu, Tang Ji, Xun Xu

Title: AI Agents and Agentic AI-Navigating a Plethora of Concepts for Future Manufacturing

Abstract:
AI agents are autonomous systems designed to perceive, reason, and act within dynamic environments. With the rapid advancements in generative AI (GenAI), large language models (LLMs) and multimodal large language models (MLLMs) have significantly improved AI agents' capabilities in semantic comprehension, complex reasoning, and autonomous decision‑making. At the same time, the rise of Agentic AI highlights adaptability and goal‑directed autonomy in dynamic and complex environments. LLMs‑based AI Agents (LLM‑Agents), MLLMs‑based AI Agents (MLLM‑Agents), and Agentic AI contribute to expanding AI's capabilities in information processing, environmental perception, and autonomous decision‑making, opening new avenues for smart manufacturing. However, the definitions, capability boundaries, and practical applications of these emerging AI paradigms in smart manufacturing remain unclear. To address this gap, this study systematically reviews the evolution of AI and AI agent technologies, examines the core concepts and technological advancements of LLM‑Agents, MLLM‑Agents, and Agentic AI, and explores their potential applications in and integration into manufacturing, along with the potential challenges they may face.

PaperID: 9797, https://arxiv.org/pdf/2507.01255.pdf

Authors: Xiao Liu, Jiawei Zhang

Title: AIGVE-MACS: Unified Multi-Aspect Commenting and Scoring Model for AI-Generated Video Evaluation

Abstract:
The rapid advancement of AI‑generated video models has created a pressing need for robust and interpretable evaluation frameworks. Existing metrics are limited to producing numerical scores without explanatory comments, resulting in low interpretability and human evaluation alignment. To address those challenges, we introduce AIGVE‑MACS, a unified model for AI‑Generated Video Evaluation(AIGVE), which can provide not only numerical scores but also multi‑aspect language comment feedback in evaluating these generated videos. Central to our approach is AIGVE‑BENCH 2, a large‑scale benchmark comprising 2,500 AI‑generated videos and 22,500 human‑annotated detailed comments and numerical scores across nine critical evaluation aspects. Leveraging AIGVE‑BENCH 2, AIGVE‑MACS incorporates recent Vision‑Language Models with a novel token‑wise weighted loss and a dynamic frame sampling strategy to better align with human evaluators. Comprehensive experiments across supervised and zero‑shot benchmarks demonstrate that AIGVE‑MACS achieves state‑of‑the‑art performance in both scoring correlation and comment quality, significantly outperforming prior baselines including GPT‑4o and VideoScore. In addition, we further showcase a multi‑agent refinement framework where feedback from AIGVE‑MACS drives iterative improvements in video generation, leading to 53.5% quality enhancement. This work establishes a new paradigm for comprehensive, human‑aligned evaluation of AI‑generated videos. We release the AIGVE‑BENCH 2 and AIGVE‑MACS at https://huggingface.co/xiaoliux/AIGVE‑MACS.

PaperID: 9798, https://arxiv.org/pdf/2507.01069.pdf

Authors: Nishant A. Parikh

Title: Agentic AI in Product Management: A Co-Evolutionary Model

Abstract:
This study explores agentic AI's transformative role in product management, proposing a conceptual co‑evolutionary framework to guide its integration across the product lifecycle. Agentic AI, characterized by autonomy, goal‑driven behavior, and multi‑agent collaboration, redefines product managers (PMs) as orchestrators of socio‑technical ecosystems. Using systems theory, co‑evolutionary theory, and human‑AI interaction theory, the framework maps agentic AI capabilities in discovery, scoping, business case development, development, testing, and launch. An integrative review of 70+ sources, including case studies from leading tech firms, highlights PMs' evolving roles in AI orchestration, supervision, and strategic alignment. Findings emphasize mutual adaptation between PMs and AI, requiring skills in AI literacy, governance, and systems thinking. Addressing gaps in traditional frameworks, this study provides a foundation for future research and practical implementation to ensure responsible, effective agentic AI integration in software organizations.

PaperID: 9799, https://arxiv.org/pdf/2507.01061.pdf

Authors: Jingjing Qu, Kejia Hu, Jun Zhu, Yulei Ye, Wenhao Li, Teng Wang, Zhiyun Chen, Chaochao Lu, Aimin Zhou, Xiangfeng Wang, Xia Hu, James Evans

Title: Epitome: Pioneering an Experimental Platform for AI-Social Science Integration

Abstract:
Large Language Models (LLMs) enable unprecedented social science experimentation by creating controlled hybrid human‑AI environments. We introduce Epitome (www.epitome‑ai.com), an open experimental platform that operationalizes this paradigm through Matrix‑like social worlds where researchers can study isolated human subjects and groups interacting with LLM agents. This maintains ecological validity while enabling precise manipulation of social dynamics. Epitome approaches three frontiers: (1) methodological innovation using LLM confederates to reduce complexity while scaling interactions; (2) empirical investigation of human behavior in AI‑saturated environments; and (3) exploration of emergent properties in hybrid collectives. Drawing on interdisciplinary foundations from management, communication, sociology, psychology, and ethics, the platform's modular architecture spans foundation model deployment through data collection. We validate Epitome through replication of three seminal experiments, demonstrating capacity to generate robust findings while reducing experimental complexity. This tool provides crucial insights for understanding how humans navigate AI‑mediated social realities, knowledge essential for policy, education, and human‑centered AI design.

PaperID: 9800, https://arxiv.org/pdf/2507.01059.pdf

Authors: Xiangbo Gao, Keshu Wu, Hao Zhang, Kexin Tian, Yang Zhou, Zhengzhong Tu

Title: Automated Vehicles Should be Connected with Natural Language

Abstract:
Multi‑agent collaborative driving promises improvements in traffic safety and efficiency through collective perception and decision making. However, existing communication media ‑‑ including raw sensor data, neural network features, and perception results ‑‑ suffer limitations in bandwidth efficiency, information completeness, and agent interoperability. Moreover, traditional approaches have largely ignored decision‑level fusion, neglecting critical dimensions of collaborative driving. In this paper we argue that addressing these challenges requires a transition from purely perception‑oriented data exchanges to explicit intent and reasoning communication using natural language. Natural language balances semantic density and communication bandwidth, adapts flexibly to real‑time conditions, and bridges heterogeneous agent platforms. By enabling the direct communication of intentions, rationales, and decisions, it transforms collaborative driving from reactive perception‑data sharing into proactive coordination, advancing safety, efficiency, and transparency in intelligent transportation systems.

PaperID: 9801, https://arxiv.org/pdf/2507.01019.pdf

Authors: Imran Mirza, Cole Huang, Ishwara Vasista, Rohan Patil, Asli Akalin, Sean O'Brien, Kevin Zhu

Title: MALIBU Benchmark: Multi-Agent LLM Implicit Bias Uncovered

Abstract:
Multi‑agent systems, which consist of multiple AI models interacting within a shared environment, are increasingly used for persona‑based interactions. However, if not carefully designed, these systems can reinforce implicit biases in large language models (LLMs), raising concerns about fairness and equitable representation. We present MALIBU, a novel benchmark developed to assess the degree to which LLM‑based multi‑agent systems implicitly reinforce social biases and stereotypes. MALIBU evaluates bias in LLM‑based multi‑agent systems through scenario‑based assessments. AI models complete tasks within predefined contexts, and their responses undergo evaluation by an LLM‑based multi‑agent judging system in two phases. In the first phase, judges score responses labeled with specific demographic personas (e.g., gender, race, religion) across four metrics. In the second phase, judges compare paired responses assigned to different personas, scoring them and selecting the superior response. Our study quantifies biases in LLM‑generated outputs, revealing that bias mitigation may favor marginalized personas over true neutrality, emphasizing the need for nuanced detection, balanced fairness strategies, and transparent evaluation benchmarks in multi‑agent systems.

PaperID: 9802, https://arxiv.org/pdf/2507.00951.pdf

Authors: Rizwan Qureshi, Ranjan Sapkota, Abbas Shah, Amgad Muneer, Anas Zafar, Ashmal Vayani, Maged Shoman, Abdelrahman B. M. Eldaly, Kai Zhang, Ferhat Sadak, Shaina Raza, Xinqi Fan, Ravid Shwartz-Ziv, Hong Yan, Vinjia Jain, Aman Chadha, Manoj Karkee, Jia Wu, Seyedali Mirjalili

Title: Thinking Beyond Tokens: From Brain-Inspired Intelligence to Cognitive Foundations for Artificial General Intelligence and its Societal Impact

Abstract:
Can machines truly think, reason and act in domains like humans? This enduring question continues to shape the pursuit of Artificial General Intelligence (AGI). Despite the growing capabilities of models such as GPT‑4.5, DeepSeek, Claude 3.5 Sonnet, Phi‑4, and Grok 3, which exhibit multimodal fluency and partial reasoning, these systems remain fundamentally limited by their reliance on token‑level prediction and lack of grounded agency. This paper offers a cross‑disciplinary synthesis of AGI development, spanning artificial intelligence, cognitive neuroscience, psychology, generative models, and agent‑based systems. We analyze the architectural and cognitive foundations of general intelligence, highlighting the role of modular reasoning, persistent memory, and multi‑agent coordination. In particular, we emphasize the rise of Agentic RAG frameworks that combine retrieval, planning, and dynamic tool use to enable more adaptive behavior. We discuss generalization strategies, including information compression, test‑time adaptation, and training‑free methods, as critical pathways toward flexible, domain‑agnostic intelligence. Vision‑Language Models (VLMs) are reexamined not just as perception modules but as evolving interfaces for embodied understanding and collaborative task completion. We also argue that true intelligence arises not from scale alone but from the integration of memory and reasoning: an orchestration of modular, interactive, and self‑improving components where compression enables adaptive behavior. Drawing on advances in neurosymbolic systems, reinforcement learning, and cognitive scaffolding, we explore how recent architectures begin to bridge the gap between statistical learning and goal‑directed cognition. Finally, we identify key scientific, technical, and ethical challenges on the path to AGI.

PaperID: 9803, https://arxiv.org/pdf/2507.00875.pdf

Authors: Xi Xuan, Chunyu Kit

Title: TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law

Abstract:
Hong Kong case law translation presents significant challenges: manual methods suffer from high costs and inconsistent quality, while both traditional machine translation and approaches relying solely on Large Language Models (LLMs) often fail to ensure legal terminology accuracy, culturally embedded nuances, and strict linguistic structures. To overcome these limitations, this study proposes TransLaw, a multi‑agent framework that decomposes translation into word‑level expression, sentence‑level translation, and multidimensional review, integrating a specialized Hong Kong legal glossary database, Retrieval‑Augmented Generation (RAG), and iterative feedback. Experiments on our newly constructed HKCFA Judgment 97‑22 dataset, benchmarking 13 open‑source and commercial LLMs, demonstrate that TransLaw significantly outperforms single‑agent baselines across all evaluated models. Human evaluation confirms the framework's effectiveness in terms of legal semantic accuracy, structural coherence, and stylistic fidelity, while noting that it still trails human experts in contextualizing complex terminology and stylistic naturalness.

PaperID: 9804, https://arxiv.org/pdf/2507.00856.pdf

Authors: Beining Wu, Jun Huang, Qiang Duan, Liang Dong, Zhipeng Cai

Title: Enhancing Vehicular Platooning with Wireless Federated Learning: A Resource-Aware Control Framework

Abstract:
This paper aims to enhance the performance of Vehicular Platooning (VP) systems integrated with Wireless Federated Learning (WFL). In highly dynamic environments, vehicular platoons experience frequent communication changes and resource constraints, which significantly affect information exchange and learning model synchronization. To address these challenges, we first formulate WFL in VP as a joint optimization problem that simultaneously considers Age of Information (AoI) and Federated Learning Model Drift (FLMD) to ensure timely and accurate control. Through theoretical analysis, we examine the impact of FLMD on convergence performance and develop a two‑stage Resource‑Aware Control framework (RACE). The first stage employs a Lagrangian dual decomposition method for resource configuration, while the second stage implements a multi‑agent deep reinforcement learning approach for vehicle selection. The approach integrates Multi‑Head Self‑Attention and Long Short‑Term Memory networks to capture spatiotemporal correlations in communication states. Experimental results demonstrate that, compared to baseline methods, the proposed framework improves AoI optimization by up to 45%, accelerates learning convergence, and adapts more effectively to dynamic VP environments on the AI4MARS dataset.

PaperID: 9805, https://arxiv.org/pdf/2507.00841.pdf

Authors: Siyuan Liang, Tianmeng Fang, Zhe Liu, Aishan Liu, Yan Xiao, Jinyuan He, Ee-Chien Chang, Xiaochun Cao

Title: SafeMobile: Chain-level Jailbreak Detection and Automated Evaluation for Multimodal Mobile Agents

Abstract:
With the wide application of multimodal foundation models in intelligent agent systems, scenarios such as mobile device control, intelligent assistant interaction, and multimodal task execution are gradually relying on such large model‑driven agents. However, the related systems are also increasingly exposed to potential jailbreak risks. Attackers may induce the agents to bypass the original behavioral constraints through specific inputs, and then trigger certain risky and sensitive operations, such as modifying settings, executing unauthorized commands, or impersonating user identities, which brings new challenges to system security. Existing security measures for intelligent agents still have limitations when facing complex interactions, especially in detecting potentially risky behaviors across multiple rounds of conversations or sequences of tasks. In addition, an efficient and consistent automated methodology to assist in assessing and determining the impact of such risks is currently lacking. This work explores the security issues surrounding mobile multimodal agents, attempts to construct a risk discrimination mechanism by incorporating behavioral sequence information, and designs an automated assisted assessment scheme based on a large language model. Through preliminary validation in several representative high‑risk tasks, the results show that the method can improve the recognition of risky behaviors to some extent and assist in reducing the probability of agents being jailbroken. We hope that this study can provide some valuable references for the security risk modeling and protection of multimodal intelligent agent systems.

PaperID: 9806, https://arxiv.org/pdf/2507.00657.pdf

Authors: Jacopo Nudo, Mario Edoardo Pandolfo, Edoardo Loru, Mattia Samory, Matteo Cinelli, Walter Quattrociocchi

Title: Generative Exaggeration in LLM Social Agents: Consistency, Bias, and Toxicity

Abstract:
We investigate how Large Language Models (LLMs) behave when simulating political discourse on social media. Leveraging 21 million interactions on X during the 2024 U.S. presidential election, we construct LLM agents based on 1,186 real users, prompting them to reply to politically salient tweets under controlled conditions. Agents are initialized either with minimal ideological cues (Zero Shot) or recent tweet history (Few Shot), allowing one‑to‑one comparisons with human replies. We evaluate three model families (Gemini, Mistral, and DeepSeek) across linguistic style, ideological consistency, and toxicity. We find that richer contextualization improves internal consistency but also amplifies polarization, stylized signals, and harmful language. We observe an emergent distortion that we call "generation exaggeration": a systematic amplification of salient traits beyond empirical baselines. Our analysis shows that LLMs do not emulate users, they reconstruct them. Their outputs, indeed, reflect internal optimization dynamics more than observed behavior, introducing structural biases that compromise their reliability as social proxies. This challenges their use in content moderation, deliberative simulations, and policy modeling.

PaperID: 9807, https://arxiv.org/pdf/2507.00642.pdf

Authors: Runkai Li, Jia Xiong, Xiuyuan He, Jieru Zhao, Jiaqi Lv, Haowen Fang, Lei Qi, Xi Wang

Title: ChatHLS: Towards Systematic Design Automation and Optimization for High-Level Synthesis

Abstract:
High‑Level Synthesis (HLS) improves IC development productivity by enabling hardware design from C‑like languages. However, strict coding constraints and design‑specific optimizations limit its widespread adoption. While recent efforts employ large language models (LLMs) to assist HLS design, they often struggle with synthesizability rules and directive semantics. To this end, we introduce ChatHLS, a multi‑agent HLS design framework that leverages specialized LLMs for automated debugging and directive tuning. ChatHLS incorporates an adaptive error case expansion mechanism, combined with a reasoning‑to‑instruction analysis method to accurately diagnose HLS errors. To optimize hardware performance, it enables QoR‑aware reasoning to learn the impact of HLS directives on the quality of results (QoR). Experimental results demonstrate that ChatHLS outperforms Gemini‑3‑pro with a 32.6% relative improvement in debugging, while achieving significant speedups across various HLS kernels and neural network accelerators. These results underscore the potential of ChatHLS for agile hardware development.

PaperID: 9808, https://arxiv.org/pdf/2507.00550.pdf

Authors: Bruce Fang, Danyi Gao

Title: Collaborative Multi-Agent Reinforcement Learning Approach for Elastic Cloud Resource Scaling

Abstract:
This paper addresses the challenges of rapid resource variation and highly uncertain task loads in cloud computing environments. It proposes an optimization method for elastic cloud resource scaling based on a multi‑agent system. The method deploys multiple autonomous agents to perceive resource states in parallel and make local decisions. While maintaining the distributed nature of the system, it introduces a collaborative value function to achieve global coordination. This improves the responsiveness of resource scheduling and enhances overall system performance. To strengthen system foresight, a lightweight state prediction model is designed. It assists agents in identifying future workload trends and optimizes the selection of scaling actions. For policy training, the method adopts a centralized training and decentralized execution reinforcement learning framework. This enables agents to learn effectively and coordinate strategies under conditions of incomplete information. The paper also constructs typical cloud scenarios, including multi‑tenancy and burst traffic, to evaluate the proposed method. The evaluation focuses on resource isolation, service quality assurance, and robustness. Experimental results show that the proposed multi‑agent scaling strategy outperforms existing methods in resource utilization, SLA violation control, and scheduling latency. The results demonstrate strong adaptability and intelligent regulation. This provides an efficient and reliable new approach to solving the problem of elastic resource scaling in complex cloud platforms.

PaperID: 9809, https://arxiv.org/pdf/2507.00451.pdf

Authors: Matthew Stephenson, Alex Newcombe, Eric Piette, Dennis Soemers

Title: Best Agent Identification for General Game Playing

Abstract:
We present an efficient and generalised procedure to accurately identify the best (or near best) performing algorithm for each sub‑task in a multi‑problem domain. Our approach treats this as a set of best arm identification problems for multi‑armed bandits, where each bandit corresponds to a specific task and each arm corresponds to a specific algorithm or agent. We propose an optimistic selection process based on a chosen confidence interval, that ranks each arm across all bandits in terms of their potential to influence our overall simple regret. We evaluate the performance of our approach on two of the most popular general game playing domains, the General Video Game AI (GVGAI) framework and the Ludii general game playing system, with the goal of selecting a high‑performing agent for each game using a limited number of available trials. Compared to previous best arm identification algorithms for multi‑armed bandits, our results demonstrate a substantial performance improvement in terms of average simple regret and average probability of error. This novel approach can be used to significantly improve the quality and accuracy of agent evaluation procedures for general game frameworks, as well as other multi‑task domains with high algorithm runtimes.

PaperID: 9810, https://arxiv.org/pdf/2507.00443.pdf

Authors: Reza Ahmadvand, Sarah Safura Sharif, Yaser Mike Banad

Title: Novel Pigeon-inspired 3D Obstacle Detection and Avoidance Maneuver for Multi-UAV Systems

Abstract:
Recent advances in multi‑agent systems manipulation have demonstrated a rising demand for the implementation of multi‑UAV systems in urban areas, which are always subjected to the presence of static and dynamic obstacles. Inspired by the collective behavior of tilapia fish and pigeons, the focus of the presented research is on the introduction of a nature‑inspired collision‑free formation control for a multi‑UAV system, considering the obstacle avoidance maneuvers. The developed framework in this study utilizes a semi‑distributed control approach, in which, based on a probabilistic Lloyd's algorithm, a centralized guidance algorithm works for optimal positioning of the UAVs, while a distributed control approach has been used for the intervehicle collision and obstacle avoidance. Further, the presented framework has been extended to the 3D space with a novel definition of 3D maneuvers. Finally, the presented framework has been applied to multi‑UAV systems in 2D and 3D scenarios, and the obtained results demonstrated the validity of the presented method in dynamic environments with stationary and moving obstacles.

PaperID: 9811, https://arxiv.org/pdf/2507.00424.pdf

Authors: Marcos M. Vasconcelos, Behrouz Touri

Title: Multi-Agent Coordination under Poisson Observations: A Global Game Approach

Abstract:
We study a model of strategic coordination based on a class of games with incomplete information known as Global Games. Under the assumption of Poisson‑distributed signals and a Gamma prior distribution on state of the system, we demonstrate the existence of a Bayesian Nash equilibrium within the class of threshold policies for utility functions that are linear in the agents' actions. Although computing the exact threshold that constitutes an equilibrium in a system with finitely many agents is a highly non‑trivial task, the problem becomes tractable by analyzing the game's potential function with countably infinitely many agents. Through numerical examples, we provide evidence that the resulting potential function is unimodal, exhibiting a well‑defined maximum. Our results are applicable to the modeling of bacterial Quorum Sensing systems, whose noisy observation signals are often well‑approximated using Poisson processes.

PaperID: 9812, https://arxiv.org/pdf/2507.00422.pdf

Authors: Ziyan Zeng, Minyu Feng, Attila Szolnoki

Title: Evolutionary Dynamics with Self-Interaction Learning in Networked Systems

Abstract:
The evolution of cooperation in networked systems helps to understand the dynamics in social networks, multi‑agent systems, and biological species. The self‑persistence of individual strategies is common in real‑world decision making. The self‑replacement of strategies in evolutionary dynamics forms a selection amplifier, allows an agent to insist on its autologous strategy, and helps the networked system to avoid full defection. In this paper, we study the self‑interaction learning in the networked evolutionary dynamics. We propose a self‑interaction landscape to capture the strength of an agent's self‑loop to reproduce the strategy based on local topology. We find that proper self‑interaction can reduce the condition for cooperation and help cooperators to prevail in the system. For a system that favors the evolution of spite, the self‑interaction can save cooperative agents from being harmed. Our results on random networks further suggest that an appropriate self‑interaction landscape can significantly reduce the critical condition for advantageous mutants, especially for large‑degree networks.

PaperID: 9813, https://arxiv.org/pdf/2507.00268.pdf

Authors: Oren Fivel, Matan Rudman, Kobi Cohen

Title: Control-Optimized Deep Reinforcement Learning for Artificially Intelligent Autonomous Systems

Abstract:
Deep reinforcement learning (DRL) has become a powerful tool for complex decision‑making in machine learning and AI. However, traditional methods often assume perfect action execution, overlooking the uncertainties and deviations between an agent's selected actions and the actual system response. In real‑world applications, such as robotics, mechatronics, and communication networks, execution mismatches arising from system dynamics, hardware constraints, and latency can significantly degrade performance. This work advances AI by developing a novel control‑optimized DRL framework that explicitly models and compensates for action execution mismatches, a challenge largely overlooked in existing methods. Our approach establishes a structured two‑stage process: determining the desired action and selecting the appropriate control signal to ensure proper execution. It trains the agent while accounting for action mismatches and controller corrections. By incorporating these factors into the training process, the AI agent optimizes the desired action with respect to both the actual control signal and the intended outcome, explicitly considering execution errors. This approach enhances robustness, ensuring that decision‑making remains effective under real‑world uncertainties. Our approach offers a substantial advancement for engineering practice by bridging the gap between idealized learning and real‑world implementation. It equips intelligent agents operating in engineering environments with the ability to anticipate and adjust for actuation errors and system disturbances during training. We evaluate the framework in five widely used open‑source mechanical simulation environments we restructured and developed to reflect real‑world operating conditions, showcasing its robustness against uncertainties and offering a highly practical and efficient solution for control‑oriented applications.

PaperID: 9814, https://arxiv.org/pdf/2507.00096.pdf

Authors: Ailiya Borjigin, Wei Zhou, Cong He

Title: AI-Governed Agent Architecture for Web-Trustworthy Tokenization of Alternative Assets

Abstract:
Alternative Assets tokenization is transforming non‑traditional financial instruments are represented and traded on the web. However, ensuring trustworthiness in web‑based tokenized ecosystems poses significant challenges, from verifying off‑chain asset data to enforcing regulatory compliance. This paper proposes an AI‑governed agent architecture that integrates intelligent agents with blockchain to achieve web‑trustworthy tokenization of alternative assets. In the proposed architecture, autonomous agents orchestrate the tokenization process (asset verification, valuation, compliance checking, and lifecycle management), while an AI‑driven governance layer monitors agent behavior and enforces trust through adaptive policies and cryptoeconomic incentives. We demonstrate that this approach enhances transparency, security, and compliance in asset tokenization, addressing key concerns around data authenticity and fraud. A case study on tokenizing real estate assets illustrates how the architecture mitigates risks (e.g., fraudulent listings and money laundering) through real‑time AI anomaly detection and on‑chain enforcement. Our evaluation and analysis suggest that combining AI governance with multi‑agent systems and blockchain can significantly bolster trust in tokenized asset ecosystems. This work offers a novel framework for trustworthy asset tokenization on the web and provides insights for practitioners aiming to deploy secure, compliant tokenization platforms.

PaperID: 9815, https://arxiv.org/pdf/2507.00081.pdf

Authors: Matthew Muhoberac, Atharva Parikh, Nirvi Vakharia, Saniya Virani, Aco Radujevic, Savannah Wood, Meghav Verma, Dimitri Metaxotos, Jeyaraman Soundararajan, Thierry Masquelin, Alexander G. Godfrey, Sean Gardner, Dobrila Rudnicki, Sam Michael, Gaurav Chopra

Title: State and Memory is All You Need for Robust and Reliable AI Agents

Abstract:
Large language models (LLMs) have enabled powerful advances in natural language understanding and generation. Yet their application to complex, real‑world scientific workflows remain limited by challenges in memory, planning, and tool integration. Here, we introduce SciBORG (Scientific Bespoke Artificial Intelligence Agents Optimized for Research Goals), a modular agentic framework that allows LLM‑based agents to autonomously plan, reason, and achieve robust and reliable domain‑specific task execution. Agents are constructed dynamically from source code documentation and augmented with finite‑state automata (FSA) memory, enabling persistent state tracking and context‑aware decision‑making. This approach eliminates the need for manual prompt engineering and allows for robust, scalable deployment across diverse applications via maintaining context across extended workflows and to recover from tool or execution failures. We validate SciBORG through integration with both physical and virtual hardware, such as microwave synthesizers for executing user‑specified reactions, with context‑aware decision making and demonstrate its use in autonomous multi‑step bioassay retrieval from the PubChem database utilizing multi‑step planning, reasoning, agent‑to‑agent communication and coordination for execution of exploratory tasks. Systematic benchmarking shows that SciBORG agents achieve reliable execution, adaptive planning, and interpretable state transitions. Our results show that memory and state awareness are critical enablers of agentic planning and reliability, offering a generalizable foundation for deploying AI agents in complex environments.

PaperID: 9816, https://arxiv.org/pdf/2506.24045.pdf

Authors: Xinming Wei, Jiahao Zhang, Haoran Li, Jiayu Chen, Haoning Guan, Rui Qu, Maoliang Li, Xiang Chen, Guojie Luo

Title: Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC

Abstract:
Personal LLM agents increasingly combine foreground reactive interactions with background proactive monitoring, forming long‑lived, stateful LLM flows that interleave prefill and token‑by‑token decode. While modern heterogeneous SoCs integrate CPUs, iGPUs, and NPUs to support on‑device intelligence, existing LLM engines assume static, single‑shot inference and lack mechanisms for flow‑level concurrency, prioritization, and efficient accelerator coordination. As a result, commodity SoCs remain poorly matched to the dynamic, mixed‑criticality execution patterns of personal agents. This paper presents Agent.xpu, the first LLM engine that orchestrates concurrent reactive and proactive LLM flows on commodity SoCs. Extensive profiling uncovers unique SoC characteristics of operator‑accelerator affinity, asymmetric DDR contention, and stage‑divergent batching behaviors distinct from cloud‑serving assumptions. Agent.xpu introduces three key techniques: a heterogeneous execution graph (HEG) capturing NPU/iGPU affinity and elastic operator binding; flow‑aware NPU‑iGPU coordination with stage elasticity, decoupling prefill and decode to reduce bandwidth contention and enforce priorities; and fine‑grained preemption with slack‑aware piggybacking to guarantee reactive responsiveness without starving proactive work. Across realistic personal‑agent workloads, Agent.xpu delivers 1.2‑4.9× proactive throughput and reduces reactive latency by at least 91%, compared with both industrial iGPU‑only serving engine and NPU‑iGPU static inference with optimal tensor‑partitioning schemes. Agent.xpu also minimizes energy consumption and graphics interference via controlled iGPU usage.

PaperID: 9817, https://arxiv.org/pdf/2506.23998.pdf

Authors: Seungjun Yi, Joakim Nguyen, Huimin Xu, Terence Lim, Andrew Well, Mia Markey, Ying Ding

Title: Auto-TA: Towards Scalable Automated Thematic Analysis (TA) via Multi-Agent Large Language Models with Reinforcement Learning

Abstract:
Congenital heart disease (CHD) presents complex, lifelong challenges often underrepresented in traditional clinical metrics. While unstructured narratives offer rich insights into patient and caregiver experiences, manual thematic analysis (TA) remains labor‑intensive and unscalable. We propose a fully automated large language model (LLM) pipeline that performs end‑to‑end TA on clinical narratives, which eliminates the need for manual coding or full transcript review. Our system employs a novel multi‑agent framework, where specialized LLM agents assume roles to enhance theme quality and alignment with human analysis. To further improve thematic relevance, we optionally integrate reinforcement learning from human feedback (RLHF). This supports scalable, patient‑centered analysis of large qualitative datasets and allows LLMs to be fine‑tuned for specific clinical contexts.

PaperID: 9818, https://arxiv.org/pdf/2506.23978.pdf

Authors: Samuele Marro, Philip Torr

Title: LLM Agents Are the Antidote to Walled Gardens

Abstract:
While the Internet's core infrastructure was designed to be open and universal, today's application layer is dominated by closed, proprietary platforms. Open and interoperable APIs require significant investment, and market leaders have little incentive to enable data exchange that could erode their user lock‑in. We argue that LLM‑based agents fundamentally disrupt this status quo. Agents can automatically translate between data formats and interact with interfaces designed for humans: this makes interoperability dramatically cheaper and effectively unavoidable. We name this shift universal interoperability: the ability for any two digital services to exchange data seamlessly using AI‑mediated adapters. Universal interoperability undermines monopolistic behaviours and promotes data portability. However, it can also lead to new security risks, technical debt, and legal frictions. Our position is that the ML community should embrace this development while building the appropriate frameworks to mitigate the downsides. By acting now, we can harness AI to restore user freedom and competitive markets without sacrificing security.

PaperID: 9819, https://arxiv.org/pdf/2506.23888.pdf

Authors: André de Souza Loureiro, Jorge Valverde-Rebaza, Julieta Noguez, David Escarcega, Ricardo Marcacini

Title: Advancing Multi-Step Mathematical Reasoning in Large Language Models through Multi-Layered Self-Reflection with Auto-Prompting

Abstract:
Recent advancements in Large Language Models (LLMs) have significantly improved their problem‑solving capabilities. However, these models still struggle when faced with complex multi‑step reasoning tasks. In this paper, we propose the Multi‑Layered Self‑Reflection with Auto‑Prompting (MAPS) framework, a novel approach designed to enhance multi‑step mathematical reasoning in LLMs by integrating techniques such as Chain of Thought (CoT), Self‑Reflection, and Auto‑Prompting. Unlike traditional static prompting methods, MAPS employs an iterative refinement process. Initially, the model generates a solution using CoT prompting. When errors are detected, an adaptive self‑reflection mechanism identifies and analyzes them, generating tailored prompts to guide corrections. These dynamically adjusted prompts enable the model to iteratively refine its reasoning. Experiments on four well‑established benchmarks across multiple LLMs show that MAPS significantly outperforms standard CoT and achieves competitive results with reasoning‑optimized models. In addition, MAPS enables general‑purpose LLMs to reach performance levels comparable to specialized reasoning models. While deeper reflection layers improve accuracy, they also increase token usage and costs. To balance this trade‑off, MAPS strategically limits reflection depth, ensuring an optimal balance between cost and reasoning performance.

PaperID: 9820, https://arxiv.org/pdf/2506.23793.pdf

Authors: Anton Andreychuk, Konstantin Yakovlev, Aleksandr Panov, Alexey Skrynnik

Title: Advancing Learnable Multi-Agent Pathfinding Solvers with Active Fine-Tuning

Abstract:
Multi‑agent pathfinding (MAPF) is a common abstraction of multi‑robot trajectory planning problems, where multiple homogeneous robots simultaneously move in the shared environment. While solving MAPF optimally has been proven to be NP‑hard, scalable, and efficient, solvers are vital for real‑world applications like logistics, search‑and‑rescue, etc. To this end, decentralized suboptimal MAPF solvers that leverage machine learning have come on stage. Building on the success of the recently introduced MAPF‑GPT, a pure imitation learning solver, we introduce MAPF‑GPT‑DDG. This novel approach effectively fine‑tunes the pre‑trained MAPF model using centralized expert data. Leveraging a novel delta‑data generation mechanism, MAPF‑GPT‑DDG accelerates training while significantly improving performance at test time. Our experiments demonstrate that MAPF‑GPT‑DDG surpasses all existing learning‑based MAPF solvers, including the original MAPF‑GPT, regarding solution quality across many testing scenarios. Remarkably, it can work with MAPF instances involving up to 1 million agents in a single environment, setting a new milestone for scalability in MAPF domains.

PaperID: 9821, https://arxiv.org/pdf/2506.23774.pdf

Authors: Ewelina Gajewska, Michal Wawer, Katarzyna Budzynska, Jarosław A. Chudziak

Title: Leveraging a Multi-Agent LLM-Based System to Educate Teachers in Hate Incidents Management

Abstract:
Computer‑aided teacher training is a state‑of‑the‑art method designed to enhance teachers' professional skills effectively while minimising concerns related to costs, time constraints, and geographical limitations. We investigate the potential of large language models (LLMs) in teacher education, using a case of teaching hate incidents management in schools. To this end, we create a multi‑agent LLM‑based system that mimics realistic situations of hate, using a combination of retrieval‑augmented prompting and persona modelling. It is designed to identify and analyse hate speech patterns, predict potential escalation, and propose effective intervention strategies. By integrating persona modelling with agentic LLMs, we create contextually diverse simulations of hate incidents, mimicking real‑life situations. The system allows teachers to analyse and understand the dynamics of hate incidents in a safe and controlled environment, providing valuable insights and practical knowledge to manage such situations confidently in real life. Our pilot evaluation demonstrates teachers' enhanced understanding of the nature of annotator disagreements and the role of context in hate speech interpretation, leading to the development of more informed and effective strategies for addressing hate in classroom settings.

PaperID: 9822, https://arxiv.org/pdf/2506.23719.pdf

Authors: Alex Egg, Martin Iglesias Goyanes, Friso Kingma, Andreu Mora, Leandro von Werra, Thomas Wolf

Title: DABstep: Data Agent Benchmark for Multi-step Reasoning

Abstract:
We introduce DABstep, a novel benchmark for evaluating AI agents on realistic multi‑step data analysis tasks. DABstep comprises over 450 real‑world challenges derived from a financial analytics platform, requiring models to combine code‑based data processing with contextual reasoning over heterogeneous documentation. Each task demands an iterative, multi‑step problem‑solving approach, testing capabilities in data manipulation, cross‑referencing multiple sources, and precise result reporting. The benchmark provides a factoid‑style answer format with automatic correctness checks for objective scoring at scale. We evaluate leading LLM‑based agents, revealing a substantial performance gap: even the best agent achieves only 14.55% accuracy on the hardest tasks. We detail our benchmark's design, dataset composition, task formulation, evaluation protocol, report baseline results and analyze failure modes. DABstep is released with a public leaderboard and toolkit to accelerate research in autonomous data analysis.

PaperID: 9823, https://arxiv.org/pdf/2506.23689.pdf

Authors: Zihao Liu, Xinhang Sui, Yueran Song, Siwen Wang

Title: PokéAI: A Goal-Generating, Battle-Optimizing Multi-agent System for Pokemon Red

Abstract:
We introduce PokéAI, the first text‑based, multi‑agent large language model (LLM) framework designed to autonomously play and progress through Pokémon Red. Our system consists of three specialized agents‑Planning, Execution, and Critique‑each with its own memory bank, role, and skill set. The Planning Agent functions as the central brain, generating tasks to progress through the game. These tasks are then delegated to the Execution Agent, which carries them out within the game environment. Upon task completion, the Critique Agent evaluates the outcome to determine whether the objective was successfully achieved. Once verification is complete, control returns to the Planning Agent, forming a closed‑loop decision‑making system. As a preliminary step, we developed a battle module within the Execution Agent. Our results show that the battle AI achieves an average win rate of 80.8% across 50 wild encounters, only 6% lower than the performance of an experienced human player. Furthermore, we find that a model's battle performance correlates strongly with its LLM Arena score on language‑related tasks, indicating a meaningful link between linguistic ability and strategic reasoning. Finally, our analysis of gameplay logs reveals that each LLM exhibits a unique playstyle, suggesting that individual models develop distinct strategic behaviors.

PaperID: 9824, https://arxiv.org/pdf/2506.23610.pdf

Authors: Manuel Pratelli, Marinella Petrocchi

Title: Evaluating the Simulation of Human Personality-Driven Susceptibility to Misinformation with LLMs

Abstract:
Large language models (LLMs) make it possible to generate synthetic behavioural data at scale, offering an ethical and low‑cost alternative to human experiments. Whether such data can faithfully capture psychological differences driven by personality traits, however, remains an open question. We evaluate the capacity of LLM agents, conditioned on Big‑Five profiles, to reproduce personality‑based variation in susceptibility to misinformation, focusing on news discernment, the ability to judge true headlines as true and false headlines as false. Leveraging published datasets in which human participants with known personality profiles rated headline accuracy, we create matching LLM agents and compare their responses to the original human patterns. Certain trait‑misinformation associations, notably those involving Agreeableness and Conscientiousness, are reliably replicated, whereas others diverge, revealing systematic biases in how LLMs internalize and express personality. The results underscore both the promise and the limits of personality‑aligned LLMs for behavioral simulation, and offer new insight into modeling cognitive diversity in artificial agents.

PaperID: 9825, https://arxiv.org/pdf/2506.23576.pdf

Authors: Maria Carolina Cornelia Wit, Jun Pang

Title: Evaluating Multi-Agent Defences Against Jailbreaking Attacks on Large Language Models

Abstract:
Recent advances in large language models (LLMs) have raised concerns about jailbreaking attacks, i.e., prompts that bypass safety mechanisms. This paper investigates the use of multi‑agent LLM systems as a defence against such attacks. We evaluate three jailbreaking strategies, including the original AutoDefense attack and two from Deepleaps: BetterDan and JB. Reproducing the AutoDefense framework, we compare single‑agent setups with two‑ and three‑agent configurations. Our results show that multi‑agent systems enhance resistance to jailbreaks, especially by reducing false negatives. However, its effectiveness varies by attack type, and it introduces trade‑offs such as increased false positives and computational overhead. These findings point to the limitations of current automated defences and suggest directions for improving alignment robustness in future LLM systems.

PaperID: 9826, https://arxiv.org/pdf/2506.23549.pdf

Authors: Huai-Chih Wang, Hsiang-Chun Chuang, Hsi-Chun Cheng, Dai-Jie Wu, Shao-Hua Sun

Title: CooT: Learning to Coordinate In-Context with Coordination Transformers

Abstract:
Effective coordination among unfamiliar partners remains a major challenge in multi‑agent systems. Existing approaches, such as population‑based methods, improve robustness through diversity but often lack mechanisms for efficient adaptation beyond training distribution. Moreover, fine‑tuning is impractical in few‑shot settings due to its high interaction cost. To address these limitations, we propose CooT, a framework that leverages in‑context learning (ICL) for real‑time partner adaptation. Unlike prior ICL approaches that focus on task generalization, CooT is designed to generalize across diverse partner behaviors. Trained on trajectories from behavior‑preferring agents, it learns to align actions with partner intentions purely through observation. We evaluate CooT on two challenging multi‑agent benchmarks: Overcooked and Google Research Football. Results show that CooT consistently outperforms population‑based methods, gradient‑based fine‑tuning, and Meta‑RL baselines, achieving stable and rapid adaptation without parameter updates. Human evaluations also identify CooT as a preferred collaborator, and our ablations confirm its ability to adapt quickly to new partners and remain stable under sudden partner changes, making it reliable for real‑world human‑AI collaboration.

PaperID: 9827, https://arxiv.org/pdf/2506.23408.pdf

Authors: Claudionor Coelho, Yanen Li, Philip Tee

Title: Do LLMs Dream of Discrete Algorithms?

Abstract:
Large Language Models (LLMs) have rapidly transformed the landscape of artificial intelligence, enabling natural language interfaces and dynamic orchestration of software components. However, their reliance on probabilistic inference limits their effectiveness in domains requiring strict logical reasoning, discrete decision‑making, and robust interpretability. This paper investigates these limitations and proposes a neurosymbolic approach that augments LLMs with logic‑based reasoning modules, particularly leveraging Prolog predicates and composable toolsets. By integrating first‑order logic and explicit rule systems, our framework enables LLMs to decompose complex queries into verifiable sub‑tasks, orchestrate reliable solutions, and mitigate common failure modes such as hallucination and incorrect step decomposition. We demonstrate the practical benefits of this hybrid architecture through experiments on the DABStep benchmark, showing improved precision, coverage, and system documentation in multi‑step reasoning tasks. Our results indicate that combining LLMs with modular logic reasoning restores engineering rigor, enhances system reliability, and offers a scalable path toward trustworthy, interpretable AI agents across complex domains.

PaperID: 9828, https://arxiv.org/pdf/2506.23273.pdf

Authors: Quang Hung Nguyen, Phuong Anh Trinh, Phan Quoc Hung Mai, Tuan Phong Trinh

Title: FinStat2SQL: A Text2SQL Pipeline for Financial Statement Analysis

Abstract:
Despite the advancements of large language models, text2sql still faces many challenges, particularly with complex and domain‑specific queries. In finance, database designs and financial reporting layouts vary widely between financial entities and countries, making text2sql even more challenging. We present FinStat2SQL, a lightweight text2sql pipeline enabling natural language queries over financial statements. Tailored to local standards like VAS, it combines large and small language models in a multi‑agent setup for entity extraction, SQL generation, and self‑correction. We build a domain‑specific database and evaluate models on a synthetic QA dataset. A fine‑tuned 7B model achieves 61.33% accuracy with sub‑4‑second response times on consumer hardware, outperforming GPT‑4o‑mini. FinStat2SQL offers a scalable, cost‑efficient solution for financial analysis, making AI‑powered querying accessible to Vietnamese enterprises.

PaperID: 9829, https://arxiv.org/pdf/2506.23260.pdf

Authors: Mohamed Amine Ferrag, Norbert Tihanyi, Djallel Hamouda, Leandros Maglaras, Abderrahmane Lakas, Merouane Debbah

Title: From Prompt Injections to Protocol Exploits: Threats in LLM-Powered AI Agents Workflows

Abstract:
Autonomous AI agents powered by large language models (LLMs) with structured function‑calling interfaces enable real‑time data retrieval, computation, and multi‑step orchestration. However, the rapid growth of plugins, connectors, and inter‑agent protocols has outpaced security practices, leading to brittle integrations that rely on ad‑hoc authentication, inconsistent schemas, and weak validation. This survey introduces a unified end‑to‑end threat model for LLM‑agent ecosystems, covering host‑to‑tool and agent‑to‑agent communications. We systematically categorize more than thirty attack techniques spanning input manipulation, model compromise, system and privacy attacks, and protocol‑level vulnerabilities. For each category, we provide a formal threat formulation defining attacker capabilities, objectives, and affected system layers. Representative examples include Prompt‑to‑SQL injections and the Toxic Agent Flow exploit in GitHub MCP servers. We analyze attack feasibility, review existing defenses, and discuss mitigation strategies such as dynamic trust management, cryptographic provenance tracking, and sandboxed agent interfaces. The framework is validated through expert review and cross‑mapping with real‑world incidents and public vulnerability repositories, including CVE and NIST NVD. Compared to prior surveys, this work presents the first integrated taxonomy bridging input‑level exploits and protocol‑layer vulnerabilities in LLM‑agent ecosystems, offering actionable guidance for designing secure and resilient agentic AI systems.

PaperID: 9830, https://arxiv.org/pdf/2506.23164.pdf

Authors: Maarten Hugenholtz, Anna Meszaros, Jens Kober, Zlatan Ajanovic

Title: Mode Collapse Happens: Evaluating Critical Interactions in Joint Trajectory Prediction Models

Abstract:
Autonomous Vehicle decisions rely on multimodal prediction models that account for multiple route options and the inherent uncertainty in human behavior. However, models can suffer from mode collapse, where only the most likely mode is predicted, posing significant safety risks. While existing methods employ various strategies to generate diverse predictions, they often overlook the diversity in interaction modes among agents. Additionally, traditional metrics for evaluating prediction models are dataset‑dependent and do not evaluate inter‑agent interactions quantitatively. To our knowledge, none of the existing metrics explicitly evaluates mode collapse. In this paper, we propose a novel evaluation framework that assesses mode collapse in joint trajectory predictions, focusing on safety‑critical interactions. We introduce metrics for mode collapse, mode correctness, and coverage, emphasizing the sequential dimension of predictions. By testing four multi‑agent trajectory prediction models, we demonstrate that mode collapse indeed happens. When looking at the sequential dimension, although prediction accuracy improves closer to interaction events, there are still cases where the models are unable to predict the correct interaction mode, even just before the interaction mode becomes inevitable. We hope that our framework can help researchers gain new insights and advance the development of more consistent and accurate prediction models, thus enhancing the safety of autonomous driving systems.

PaperID: 9831, https://arxiv.org/pdf/2506.23046.pdf

Authors: Xianzhe Fan, Xuhui Zhou, Chuanyang Jin, Kolby Nottingham, Hao Zhu, Maarten Sap

Title: SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions

Abstract:
Humans continuously infer the states, goals, and behaviors of others by perceiving their surroundings in dynamic, real‑world social interactions. However, most Theory of Mind (ToM) benchmarks only evaluate static, text‑based scenarios, which have a significant gap compared to real interactions. We propose the SoMi‑ToM benchmark, designed to evaluate multi‑perspective ToM in embodied multi‑agent complex social interactions. This benchmark is based on rich multimodal interaction data generated by the interaction environment SoMi, covering diverse crafting goals and social relationships. Our framework supports multi‑level evaluation: (1) first‑person evaluation provides multimodal (visual, dialogue, action, etc.) input from a first‑person perspective during a task for real‑time state inference, (2) third‑person evaluation provides complete third‑person perspective video and text records after a task for goal and behavior inference. This evaluation method allows for a more comprehensive examination of a model's ToM capabilities from both the subjective immediate experience and the objective global observation. We constructed a challenging dataset containing 35 third‑person perspective videos, 363 first‑person perspective images, and 1225 expert‑annotated multiple‑choice questions (three options). On this dataset, we systematically evaluated the performance of human subjects and several state‑of‑the‑art large vision‑language models (LVLMs). The results show that LVLMs perform significantly worse than humans on SoMi‑ToM: the average accuracy gap between humans and models is 40.1% in first‑person evaluation and 26.4% in third‑person evaluation. This indicates that future LVLMs need to further improve their ToM capabilities in embodied, complex social interactions.

PaperID: 9832, https://arxiv.org/pdf/2506.22937.pdf

Authors: Tianrun Qiu, Changxin Chen, Sizhe Cheng, Xuyang Liu, Xumeng Wang, Zhicong Lu, Yuxin Ma

Title: GamerAstra: Supporting 2D Non-Twitch Video Games for Blind and Low-Vision Players through a Multi-Agent Framework

Abstract:
Blind and low‑vision (BLV) players face critical challenges in engaging with video games due to the inaccessibility of visual elements, difficulties navigating interfaces, and limitations in performing interaction. Meanwhile, the development of specialized accessibility features typically requires substantial programming effort and is often implemented on a game‑by‑game basis. To address these challenges, we introduce GamerAstra, a multi‑agent human‑AI collaboration framework that leverages a multi‑agent design to facilitate access to 2D non‑twitch video games for BLV players. It integrates vision‑language models and computer vision techniques, enabling interaction with games lacking native accessibility support. The framework also incorporates custom assistance granularities to support varying degrees of visual impairment and enhances interface navigation through multiple input modalities. Technical evaluations and user studies indicate that GamerAstra effectively enhances playability and provides a more immersive gaming experience for BLV players. These findings also underscore potential avenues for advancing intelligent accessibility frameworks in the gaming domain.

PaperID: 9833, https://arxiv.org/pdf/2506.22876.pdf

Authors: Shayak Nandi, Fernanda M. Eliott

Title: Cooperation as a Black Box: Conceptual Fluctuation and Diagnostic Tools for Misalignment in MAS

Abstract:
Misalignment in Multi‑Agent Systems (MAS) is frequently treated as a technical failure. Yet, issues may arise from the conceptual design phase, where semantic ambiguity and normative projection occur. The Rabbit‑Duck illusion illustrates how perspective‑dependent readings of agent behavior, such as the conflation of cooperation‑coordination, can create epistemic instability; e.g., coordinated agents in cooperative Multi‑Agent Reinforcement Learning (MARL) benchmarks being interpreted as morally aligned, despite being optimized for shared utility maximization only. Motivated by three drivers of meaning‑level misalignment in MAS (coordination‑cooperation ambiguity, conceptual fluctuation, and semantic instability), we introduce the Misalignment Mosaic: a framework for diagnosing how misalignment emerges through language, framing, and design assumptions. The Mosaic comprises four components: 1. Terminological Inconsistency, 2. Interpretive Ambiguity, 3. Concept‑to‑Code Decay, and 4. Morality as Cooperation. Building on insights from the Morality‑as‑Cooperation Theory, we call for consistent meaning‑level grounding in MAS to ensure systems function as intended: technically and ethically. This need is particularly urgent as MAS principles influence broader Artificial Intelligence (AI) workflows, amplifying risks in trust, interpretability, and governance. While this work focuses on the coordination/cooperation ambiguity, the Mosaic generalizes to other overloaded terms, such as alignment, autonomy, and trust.

PaperID: 9834, https://arxiv.org/pdf/2506.22745.pdf

Authors: Sijie He, Ziye Jia, Qiuming Zhu, Fuhui Zhou, Qihui Wu

Title: Trusted Routing for Blockchain-Enabled Low-Altitude Intelligent Networks

Abstract:
Due to the scalability and portability, the low‑altitude intelligent networks (LAINs) are essential in various fields such as surveillance and disaster rescue. However, in LAINs, unmanned aerial vehicles (UAVs) are characterized by the distributed topology and high dynamic mobility, and vulnerable to security threats, which may degrade the routing performance for data transmission. Hence, how to ensure the routing stability and security of LAINs is a challenge. In this paper, we focus on the routing process in LAINs with multiple UAV clusters and propose the blockchain‑enabled zero‑trust architecture to manage the joining and exiting of UAVs. Furthermore, we formulate the routing problem to minimize the end‑to‑end (E2E) delay, which is an integer linear programming and intractable to solve. Therefore, considering the distribution of LAINs, we reformulate the routing problem into a decentralized partially observable Markov decision process. With the proposed soft hierarchical experience replay buffer, the multi‑agent double deep Q‑network based adaptive routing algorithm is designed. Finally, simulations are conducted and numerical results show that the total E2E delay of the proposed mechanism decreases by 22.38% than the benchmark on average.

PaperID: 9835, https://arxiv.org/pdf/2506.22735.pdf

Authors: Willem Conradie, Krishna Manoorkar, Alessandra Palmigiano, Apostolos Tzimoulis, Nachoem Wijnberg

Title: Questions as cognitive filters

Abstract:
In this paper, we develop a logico‑algebraic framework for modeling decision‑making through deliberation in multi‑agent settings. The central concept in this framework is that of interrogative agendas, which represent the cognitive stances of agents regarding which features should be considered relevant in the final decision. We formalize an agent's interrogative agenda as an equivalence relation that identifies outcomes differing only in aspects the agent deems irrelevant. Moreover, we characterize the sublattices of the resulting lattice that correspond to relevant interrogative agendas for deliberation scenarios governed by different ``winning rules." We then introduce a two‑sorted logico‑algebraic structure‑comprising the lattice of relevant interrogative agendas and the Boolean algebras of agent coalitions‑to model the interaction between agents and agendas during deliberation. Finally, we discuss which interaction conditions can and cannot be defined within this framework.

PaperID: 9836, https://arxiv.org/pdf/2506.22708.pdf

Authors: Shrenik Jadhav, Birva Sevak, Srijita Das, Akhtar Hussain, Wencong Su, Van-Hai Bui

Title: FairMarket-RL: LLM-Guided Fairness Shaping for Multi-Agent Reinforcement Learning in Peer-to-Peer Markets

Abstract:
Peer‑to‑peer (P2P) trading is increasingly recognized as a key mechanism for decentralized market regulation, yet existing approaches often lack robust frameworks to ensure fairness. This paper presents FairMarket‑RL, a novel hybrid framework that combines Large Language Models (LLMs) with Reinforcement Learning (RL) to enable fairness‑aware trading agents. In a simulated P2P microgrid with multiple sellers and buyers, the LLM acts as a real‑time fairness critic, evaluating each trading episode using two metrics: Fairness‑To‑Buyer (FTB) and Fairness‑Between‑Sellers (FBS). These fairness scores are integrated into agent rewards through scheduled λ‑coefficients, forming an adaptive LLM‑guided reward shaping loop that replaces brittle, rule‑based fairness constraints. Agents are trained using Independent Proximal Policy Optimization (IPPO) and achieve equitable outcomes, fulfilling over 90% of buyer demand, maintaining fair seller margins, and consistently reaching FTB and FBS scores above 0.80. The training process demonstrates that fairness feedback improves convergence, reduces buyer shortfalls, and narrows profit disparities between sellers. With its language‑based critic, the framework scales naturally, and its extension to a large power distribution system with household prosumers illustrates its practical applicability. FairMarket‑RL thus offers a scalable, equity‑driven solution for autonomous trading in decentralized energy systems.

PaperID: 9837, https://arxiv.org/pdf/2506.22656.pdf

Authors: Jiangping Huang, Dongming Jin, Weisong Sun, Yang Liu, Zhi Jin

Title: Knowledge-Guided Multi-Agent Framework for Automated Requirements Development: A Vision

Abstract:
This paper envisions a knowledge‑guided multi‑agent framework named KGMAF for automated requirements development. KGMAF aims to address gaps in current automation systems for SE, which prioritize code development and overlook the complexities of requirements tasks. KGMAF is composed of six specialized agents and an artifact pool to improve efficiency and accuracy. Specifically, KGMAF outlines the functionality, actions, and knowledge of each agent and provides the conceptual design of the artifact pool. Our case study highlights the potential of KGMAF in real‑world scenarios. Finally, we outline several research opportunities for implementing and enhancing automated requirements development using multi‑agent systems. We believe that KGMAF will play a pivotal role in shaping the future of automated requirements development in the era of LLMs.

PaperID: 9838, https://arxiv.org/pdf/2506.22598.pdf

Authors: Nicholas Edwards, Yukyung Lee, Yujun Audrey Mao, Yulu Qin, Sebastian Schuster, Najoung Kim

Title: RExBench: Can coding agents autonomously implement AI research extensions?

Abstract:
Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously. In addition, there has been progress towards developing agents that can perform parts of the research pipeline in machine learning and the natural sciences. We argue that research extension and its implementation is a critical capability for such systems, and introduce RExBench to support the evaluation of this capability. RExBench is a benchmark consisting of realistic extensions of 12 research papers that aim to investigate novel research hypotheses. Each task is set up as an extension to an existing research paper and codebase, accompanied by domain expert‑written instructions. RExBench is robust to data contamination and supports an automatic evaluation infrastructure that executes agent outputs to determine whether the success criteria are met. We use this benchmark to evaluate 12 LLM agents implemented using two different frameworks, aider and OpenHands. We find that all agents fail to autonomously implement the majority of the extensions, with the best agent achieving around a 33% success rate. Although the success rate improves with additional human‑written hints, the best performance under this setting remains below 44%. This indicates that current agents are still short of being able to handle realistic research extension tasks without substantial human guidance.

PaperID: 9839, https://arxiv.org/pdf/2506.22557.pdf

Authors: Boyuan Chen, Minghao Shao, Abdul Basit, Siddharth Garg, Muhammad Shafique

Title: MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs

Abstract:
As large language models (LLMs) grow more capable, they face growing vulnerability to sophisticated jailbreak attacks. While developers invest heavily in alignment finetuning and safety guardrails, researchers continue publishing novel attacks, driving progress through adversarial iteration. This dynamic mirrors a strategic game of continual evolution. However, two major challenges hinder jailbreak development: the high cost of querying top‑tier LLMs and the short lifespan of effective attacks due to frequent safety updates. These factors limit cost‑efficiency and practical impact of research in jailbreak attacks. To address this, we propose MetaCipher, a low‑cost, multi‑agent jailbreak framework that generalizes across LLMs with varying safety measures. Using reinforcement learning, MetaCipher is modular and adaptive, supporting extensibility to future strategies. Within as few as 10 queries, MetaCipher achieves state‑of‑the‑art attack success rates on recent malicious prompt benchmarks, outperforming prior jailbreak methods. We conduct a large‑scale empirical evaluation across diverse victim models and benchmarks, demonstrating its robustness and adaptability. Warning: This paper contains model outputs that may be offensive or harmful, shown solely to demonstrate jailbreak efficacy.

PaperID: 9840, https://arxiv.org/pdf/2506.22507.pdf

Authors: Yubo Peng, Luping Xiang, Kun Yang, Feibo Jiang, Kezhi Wang, Christos Masouros

Title: A Unified Cloud-Edge-Terminal Framework for Multimodal Integrated Sensing and Communication

Abstract:
The transition to 6G calls for tightly integrated sensing and communication to support mission‑critical services such as autonomous driving, embodied AI, and high‑precision telemedicine. However, most existing ISAC designs rely on a single sensing modality (often RF), which limits environmental understanding and becomes a bottleneck in complex and dynamic scenes. This motivates a shift from single‑modal to multimodal ISAC, where heterogeneous sensors (e.g., radar, LiDAR, and cameras) complement each other to improve robustness and semantic awareness. In this article, we first summarize key challenges for multimodal ISAC, including heterogeneous fusion, communication overhead, and scalable system design. We then highlight three enabling technologies: large AI models, semantic communications, and multi‑agent systems, and discuss how their combination can enable task‑oriented multimodal perception. Building on these insights, we propose a unified cloud‑edge‑terminal (CET) framework that hierarchically distributes intelligence and supports three adaptive operation modes: global fusion mode (GFM), cooperative relay mode (CRM), and peer interaction mode (PIM). A case study evaluates the framework across three modes, demonstrating that GFM achieves the highest accuracy, PIM minimizes latency, and CRM strikes an optimal balance between performance and efficiency. Finally, we conclude with open research issues and future directions.

PaperID: 9841, https://arxiv.org/pdf/2506.22485.pdf

Authors: Sudip Dasgupta, Himanshu Shankar

Title: AI Agents-as-Judge: Automated Assessment of Accuracy, Consistency, Completeness and Clarity for Enterprise Documents

Abstract:
This study presents a modular, multi‑agent system for the automated review of highly structured enterprise business documents using AI agents. Unlike prior solutions focused on unstructured texts or limited compliance checks, this framework leverages modern orchestration tools such as LangChain, CrewAI, TruLens, and Guidance to enable section‑by‑section evaluation of documents for accuracy, consistency, completeness, and clarity. Specialized agents, each responsible for discrete review criteria such as template compliance or factual correctness, operate in parallel or sequence as required. Evaluation outputs are enforced to a standardized, machine‑readable schema, supporting downstream analytics and auditability. Continuous monitoring and a feedback loop with human reviewers allow for iterative system improvement and bias mitigation. Quantitative evaluation demonstrates that the AI Agent‑as‑Judge system approaches or exceeds human performance in key areas: achieving 99% information consistency (vs. 92% for humans), halving error and bias rates, and reducing average review time from 30 to 2.5 minutes per document, with a 95% agreement rate between AI and expert human judgment. While promising for a wide range of industries, the study also discusses current limitations, including the need for human oversight in highly specialized domains and the operational cost of large‑scale LLM usage. The proposed system serves as a flexible, auditable, and scalable foundation for AI‑driven document quality assurance in the enterprise context.

PaperID: 9842, https://arxiv.org/pdf/2506.22480.pdf

Authors: Mariam Yahya, Aydin Sezgin, Setareh Maghsudi

Title: Service Placement in Small Cell Networks Using Distributed Best Arm Identification in Linear Bandits

Abstract:
As users in small cell networks increasingly rely on computation‑intensive services, cloud‑based access often results in high latency. Multi‑access edge computing (MEC) mitigates this by bringing computational resources closer to end users, with small base stations (SBSs) serving as edge servers to enable low‑latency service delivery. However, limited edge capacity makes it challenging to decide which services to deploy locally versus in the cloud, especially under unknown service demand and dynamic network conditions. To tackle this problem, we model service demand as a linear function of service attributes and formulate the service placement task as a linear bandit problem, where SBSs act as agents and services as arms. The goal is to identify the service that, when placed at the edge, offers the greatest reduction in total user delay compared to cloud deployment. We propose a distributed and adaptive multi‑agent best‑arm identification (BAI) algorithm under a fixed‑confidence setting, where SBSs collaborate to accelerate learning. Simulations show that our algorithm identifies the optimal service with the desired confidence and achieves near‑optimal speedup, as the number of learning rounds decreases proportionally with the number of SBSs. We also provide theoretical analysis of the algorithm's sample complexity and communication overhead.

PaperID: 9843, https://arxiv.org/pdf/2506.22445.pdf

Authors: Saad Alqithami

Title: Hierarchical Adversarially-Resilient Multi-Agent Reinforcement Learning for Cyber-Physical Systems Security

Abstract:
Cyber‑Physical Systems play a critical role in the infrastructure of various sectors, including manufacturing, energy distribution, and autonomous transportation systems. However, their increasing connectivity renders them highly vulnerable to sophisticated cyber threats, such as adaptive and zero‑day attacks, against which traditional security methods like rule‑based intrusion detection and single‑agent reinforcement learning prove insufficient. To overcome these challenges, this paper introduces a novel Hierarchical Adversarially‑Resilient Multi‑Agent Reinforcement Learning (HAMARL) framework. HAMARL employs a hierarchical structure consisting of local agents dedicated to subsystem security and a global coordinator that oversees and optimizes comprehensive, system‑wide defense strategies. Furthermore, the framework incorporates an adversarial training loop designed to simulate and anticipate evolving cyber threats, enabling proactive defense adaptation. Extensive experimental evaluations conducted on a simulated industrial IoT testbed indicate that HAMARL substantially outperforms traditional multi‑agent reinforcement learning approaches, significantly improving attack detection accuracy, reducing response times, and ensuring operational continuity. The results underscore the effectiveness of combining hierarchical multi‑agent coordination with adversarially‑aware training to enhance the resilience and security of next‑generation CPS.

PaperID: 9844, https://arxiv.org/pdf/2506.22117.pdf

Authors: Fenglan Wang, Xinguo Shu, Lei He, Lin Zhao

Title: Safe Multi-Agent Navigation via Constrained HJB-Informed Learning

Abstract:
Multi‑agent navigation in unknown and cluttered environments has broad applications, yet remains fundamentally challenging. In particular, dense agent‑agent and agent‑obstacle reactive interactions can exacerbate the inherent competition between collision‑avoidance constraints and goal‑reaching objectives. Most existing approaches mitigate this by applying per‑step safety filtering on top of a predefined goal‑reaching controller or by designing heuristic loss functions that penalizes safety constraints violation gradient. While effective in sparse environments, these methods still suffer from overly‑conservative behaviors when interactions become dense. To overcome these limitations, we propose HJB‑GNN, a Hamilton‑Jacobi‑Bellman (HJB)‑based learning framework that jointly learns a graph neural network (GNN)‑parameterized control barrier function for explicit safety enforcement, a distributed GNN‑based navigation policy, and a value function that induces goal‑reaching behavior. By exploiting the analytical solution of the constrained HJB equation, the proposed method derives graph‑dependent Lagrange multipliers that adaptively balance collision‑avoidance and goal‑reaching across diverse multi‑agent navigation scenarios. Moreover, HJB‑GNN supports centralized training with distributed deployment. Extensive simulations and real‑world experiments with Crazyflie drone swarms demonstrate its superior safety and goal‑reaching performance, as well as strong scalability and generalizability to large‑scale teams operating in previously unseen, dense environments.

PaperID: 9845, https://arxiv.org/pdf/2506.22056.pdf

Authors: Xuan Zhang, Ziyan Jiang, Rui Meng, Yifei Leng, Zhenbang Xiao, Zora Zhiruo Wang, Yanyi Shang, Dehan Kong

Title: Universal Retrieval for Multimodal Trajectory Modeling

Abstract:
Trajectory data, capturing human actions and environmental states across various modalities, holds significant potential for enhancing AI agent capabilities, particularly in GUI environments. However, how to model the representation of trajectory‑level data presents a significant challenge that has not been systematically addressed amid explosive trajectory data growth. In this work, we introduce Multimodal Trajectory Retrieval, bridging the gap between universal retrieval and agent‑centric trajectory modeling. We construct the Unified Agent Trajectory Dataset (UATD) from annotated demonstrations and states across diverse real‑world scenarios. Based on this, we present GAE‑Bench, a benchmark containing a large number of trajectory‑based retrieval pairs. In addition, we propose GAE‑Retriever, a multimodal retrieval framework that adopts vision‑language models and incorporates optimized contrastive learning through a token selection and the GradCache mechanism. Comprehensive evaluations across multiple datasets show that GAE‑Retriever consistently outperforms strong baselines in retrieval recall, highlighting its effectiveness in advancing multimodal trajectory retrieval.

PaperID: 9846, https://arxiv.org/pdf/2506.21982.pdf

Authors: Akshay Jaitly, Jack Cline, Siavash Farzan

Title: A MILP-Based Solution to Multi-Agent Motion Planning and Collision Avoidance in Constrained Environments

Abstract:
We propose a mixed‑integer linear program (MILP) for multi‑agent motion planning that embeds Polytopic Action‑based Motion Planning (PAAMP) into a sequence‑then‑solve pipeline. Region sequences confine each agent to adjacent convex polytopes, while a big‑M hyperplane model enforces inter‑agent separation. Collision constraints are applied only to agents sharing or neighboring a region, which reduces binary variables exponentially compared with naive formulations. An L1 path‑length‑plus‑acceleration cost yields smooth trajectories. We prove finite‑time convergence and demonstrate on representative multi‑agent scenarios with obstacles that our formulation produces collision‑free trajectories an order of magnitude faster than an unstructured MILP baseline.

PaperID: 9847, https://arxiv.org/pdf/2506.21967.pdf

Authors: Weimin Xiong, Ke Wang, Yifan Song, Hanchao Liu, Sai Zhou, Wei Peng, Sujian Li

Title: More Vulnerable than You Think: On the Stability of Tool-Integrated LLM Agents

Abstract:
Current evaluations of tool‑integrated LLM agents typically focus on end‑to‑end tool‑usage evaluation while neglecting their stability. This limits their real‑world applicability, as various internal or external factors can cause agents to crash or behave abnormally. Our research addresses this by investigating whether agents are vulnerable to errors throughout the entire tool invocation process, including reading tool documentation, selecting tools and generating parameters, and processing the tool's response. Through extensive experiments, we observe that agents are highly susceptible to errors at each stage and agents based on open‑source models are more vulnerable than those based on proprietary models. We also find that increasing the model size does not significantly improve tool invocation reasoning and may make agents more vulnerable to attacks resembling normal user instructions. This highlights the importance of evaluating agent stability and offers valuable insights for future LLM development and evaluation.

PaperID: 9848, https://arxiv.org/pdf/2506.21934.pdf

Authors: Najmeh Forouzandehmehr, Reza Yousefi Maragheh, Sriram Kollipara, Kai Zhao, Topojoy Biswas, Evren Korpeoglu, Kannan Achan

Title: CAL-RAG: Retrieval-Augmented Multi-Agent Generation for Content-Aware Layout Design

Abstract:
Automated content‑aware layout generation ‑‑ the task of arranging visual elements such as text, logos, and underlays on a background canvas ‑‑ remains a fundamental yet under‑explored problem in intelligent design systems. While recent advances in deep generative models and large language models (LLMs) have shown promise in structured content generation, most existing approaches lack grounding in contextual design exemplars and fall short in handling semantic alignment and visual coherence. In this work we introduce CAL‑RAG, a retrieval‑augmented, agentic framework for content‑aware layout generation that integrates multimodal retrieval, large language models, and collaborative agentic reasoning. Our system retrieves relevant layout examples from a structured knowledge base and invokes an LLM‑based layout recommender to propose structured element placements. A vision‑language grader agent evaluates the layout with visual metrics, and a feedback agent provides targeted refinements, enabling iterative improvement. We implement our framework using LangGraph and evaluate it on the PKU PosterLayout dataset, a benchmark rich in semantic and structural variability. CAL‑RAG achieves state‑of‑the‑art performance across multiple layout metrics ‑‑ including underlay effectiveness, element alignment, and overlap ‑‑ substantially outperforming strong baselines such as LayoutPrompter. These results demonstrate that combining retrieval augmentation with agentic multi‑step reasoning yields a scalable, interpretable, and high‑fidelity solution for automated layout generation.

PaperID: 9849, https://arxiv.org/pdf/2506.21931.pdf

Authors: Reza Yousefi Maragheh, Pratheek Vadla, Priyank Gupta, Kai Zhao, Aysenur Inan, Kehui Yao, Jianpeng Xu, Praveen Kanumala, Jason Cho, Sushant Kumar

Title: ARAG: Agentic Retrieval Augmented Generation for Personalized Recommendation

Abstract:
Retrieval‑Augmented Generation (RAG) has shown promise in enhancing recommendation systems by incorporating external context into large language model prompts. However, existing RAG‑based approaches often rely on static retrieval heuristics and fail to capture nuanced user preferences in dynamic recommendation scenarios. In this work, we introduce ARAG, an Agentic Retrieval‑Augmented Generation framework for Personalized Recommendation, which integrates a multi‑agent collaboration mechanism into the RAG pipeline. To better understand the long‑term and session behavior of the user, ARAG leverages four specialized LLM‑based agents: a User Understanding Agent that summarizes user preferences from long‑term and session contexts, a Natural Language Inference (NLI) Agent that evaluates semantic alignment between candidate items retrieved by RAG and inferred intent, a context summary agent that summarizes the findings of NLI agent, and an Item Ranker Agent that generates a ranked list of recommendations based on contextual fit. We evaluate ARAG accross three datasets. Experimental results demonstrate that ARAG significantly outperforms standard RAG and recency‑based baselines, achieving up to 42.1% improvement in NDCG@5 and 35.5% in Hit@5. We also, conduct an ablation study to analyse the effect by different components of ARAG. Our findings highlight the effectiveness of integrating agentic reasoning into retrieval‑augmented recommendation and provide new directions for LLM‑based personalization.

PaperID: 9850, https://arxiv.org/pdf/2506.21839.pdf

Authors: Mengyi Shan, Brian Curless, Ira Kemelmacher-Shlizerman, Steve Seitz

Title: GenEscape: Hierarchical Multi-Agent Generation of Escape Room Puzzles

Abstract:
We challenge text‑to‑image models with generating escape room puzzle images that are visually appealing, logically solid, and intellectually stimulating. While base image models struggle with spatial relationships and affordance reasoning, we propose a hierarchical multi‑agent framework that decomposes this task into structured stages: functional design, symbolic scene graph reasoning, layout synthesis, and local image editing. Specialized agents collaborate through iterative feedback to ensure the scene is visually coherent and functionally solvable. Experiments show that agent collaboration improves output quality in terms of solvability, shortcut avoidance, and affordance clarity, while maintaining visual quality.

PaperID: 9851, https://arxiv.org/pdf/2506.21762.pdf

Authors: Oliver Huang, Carolina Nobre

Title: ViStruct: Simulating Expert-Like Reasoning Through Task Decomposition and Visual Attention Cues

Abstract:
Data visualization tasks often require multi‑step reasoning, and the interpretive strategies experts use, such as decomposing complex goals into smaller subtasks and selectively attending to key chart regions are rarely made explicit. ViStruct is an automated pipeline that simulates these expert behaviours by breaking high‑level visual questions into structured analytic steps and highlighting semantically relevant chart areas. Leveraging large language and vision‑language models, ViStruct identifies chart components, maps subtasks to spatial regions, and presents visual attention cues to externalize expert‑like reasoning flows. While not designed for direct novice instruction, ViStruct provides a replicable model of expert interpretation that can inform the development of future visual literacy tools. We evaluate the system on 45 tasks across 12 chart types and validate its outputs with trained visualization users, confirming its ability to produce interpretable and expert‑aligned reasoning sequences.

PaperID: 9852, https://arxiv.org/pdf/2506.21669.pdf

Authors: Wanxin Tian, Shijie Zhang, Kevin Zhang, Xiaowei Chi, Chunkai Fan, Junyu Lu, Yulin Luo, Qiang Zhou, Yiming Zhao, Ning Liu, Siyu Lin, Zhiyuan Qin, Xiaozhu Ju, Shanghang Zhang, Jian Tang

Title: SEEA-R1: Tree-Structured Reinforcement Fine-Tuning for Self-Evolving Embodied Agents

Abstract:
Self‑evolution, the ability of agents to autonomously improve their reasoning and behavior, is essential for the embodied domain with long‑horizon, real‑world tasks. Despite current advancements in reinforcement fine‑tuning (RFT) showing strong performance in enhancing reasoning in LLMs, its potential to enable self‑evolving embodied intelligence with multi‑modal interactions remains largely unexplored. Specifically, reinforcement fine‑tuning faces two fundamental obstacles in embodied settings: (i) the lack of accessible intermediate rewards in multi‑step reasoning tasks limits effective learning signals, and (ii) reliance on hand‑crafted reward functions restricts generalization to novel tasks and environments. To address these challenges, we present Self‑Evolving Embodied Agents‑R1, SEEA‑R1, the first RFT framework designed for enabling the self‑evolving capabilities of embodied agents. Specifically, to convert sparse delayed rewards into denser intermediate signals that improve multi‑step reasoning, we propose Tree‑based group relative policy optimization (Tree‑GRPO) integrates Monte Carlo Tree Search into GRPO. To generalize reward estimation across tasks and scenes, supporting autonomous adaptation and reward‑driven self‑evolution, we further introduce Multi‑modal Generative Reward Model (MGRM). To holistically evaluate the effectiveness of SEEA‑R1, we evaluate on the ALFWorld benchmark, surpassing state‑of‑the‑art methods with scores of 85.07% (textual) and 46.27% (multi‑modal), outperforming prior models including GPT‑4o. SEEA‑R1 also achieves scores of 80.3% (textual) and 44.03% (multi‑modal) without ground truth reward, surpassing all open‑source baselines and highlighting its scalability as a self‑evolving embodied agent. Additional experiments and qualitative analysis further support the potential of SEEA‑R1 for future research in scalable embodied intelligence.

PaperID: 9853, https://arxiv.org/pdf/2506.21608.pdf

Authors: Yasmine Bouamra, Bruno Yun, Alexandre Poisson, Frédéric Armetta

Title: SysTemp: A Multi-Agent System for Template-Based Generation of SysML v2

Abstract:
The automatic generation of SysML v2 models represents a major challenge in the engineering of complex systems, particularly due to the scarcity of learning corpora and complex syntax. We present SysTemp, a system aimed at facilitating and improving the creation of SysML v2 models from natural language specifications. It is based on a multi‑agent system, including a template generator that structures the generation process. We discuss the advantages and challenges of this system through an evaluation, highlighting its potential to improve the quality of the generations in SysML v2 modeling.

PaperID: 9854, https://arxiv.org/pdf/2506.21565.pdf

Authors: Takato Ueno, Keito Inoshita

Title: A Multi-Agent Probabilistic Inference Framework Inspired by Kairanban-Style CoT System with IdoBata Conversation for Debiasing

Abstract:
Japan's kairanban culture and idobata conversations have long functioned as traditional communication practices that foster nuanced dialogue among community members and contribute to the formation of social balance. Inspired by these information exchange processes, this study proposes a multi‑agent inference framework (KCS+IBC) that integrates multiple large language models (LLMs) to achieve bias mitigation, improved explainability, and probabilistic prediction in sentiment analysis. In addition to sequentially sharing prediction results, the proposed method incorporates a mid‑phase casual dialogue session to blend formal inference with individual perspectives and introduces probabilistic sentiment prediction. Experimental results show that KCS achieves accuracy comparable to that of a single LLM across datasets, while KCS+IBC exhibits a consistent decrease in entropy and a gradual increase in variance during the latter stages of inference, suggesting the framework's ability to balance aggregation and diversity of predictions. Future work will quantitatively assess the impact of these characteristics on bias correction and aim to develop more advanced sentiment analysis systems.

PaperID: 9855, https://arxiv.org/pdf/2506.21557.pdf

Authors: Kaiying Yan, Moyang Liu, Yukun Liu, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Xuefei Liu

Title: Debunk and Infer: Multimodal Fake News Detection via Diffusion-Generated Evidence and LLM Reasoning

Abstract:
The rapid spread of fake news across multimedia platforms presents serious challenges to information credibility. In this paper, we propose a Debunk‑and‑Infer framework for Fake News Detection(DIFND) that leverages debunking knowledge to enhance both the performance and interpretability of fake news detection. DIFND integrates the generative strength of conditional diffusion models with the collaborative reasoning capabilities of multimodal large language models (MLLMs). Specifically, debunk diffusion is employed to generate refuting or authenticating evidence based on the multimodal content of news videos, enriching the evaluation process with diverse yet semantically aligned synthetic samples. To improve inference, we propose a chain‑of‑debunk strategy where a multi‑agent MLLM system produces logic‑grounded, multimodal‑aware reasoning content and final veracity judgment. By jointly modeling multimodal features, generative debunking cues, and reasoning‑rich verification within a unified architecture, DIFND achieves notable improvements in detection accuracy. Extensive experiments on the FakeSV and FVC datasets show that DIFND not only outperforms existing approaches but also delivers trustworthy decisions.

PaperID: 9856, https://arxiv.org/pdf/2506.21536.pdf

Authors: Fangjun Ding, Renyu Zhang, Xinyu Feng, Chengye Xie, Zheng Zhang, Yanting Zhang

Title: PsyLite Technical Report

Abstract:
With the rapid development of digital technology, AI‑driven psychological counseling has gradually become an important research direction in the field of mental health. However, existing models still have deficiencies in dialogue safety, detailed scenario handling, and lightweight deployment. To address these issues, this study proposes PsyLite, a lightweight psychological counseling large language model agent developed based on the base model InternLM2.5‑7B‑chat. Through a two‑stage training strategy (hybrid distillation data fine‑tuning and ORPO preference optimization), PsyLite enhances the model's deep‑reasoning ability, psychological counseling ability, and safe dialogue ability. After deployment using Ollama and Open WebUI, a custom workflow is created with Pipelines. An innovative conditional RAG is designed to introduce crosstalk humor elements at appropriate times during psychological counseling to enhance user experience and decline dangerous requests to strengthen dialogue safety. Evaluations show that PsyLite outperforms the baseline models in the Chinese general evaluation (CEval), psychological counseling professional evaluation (CPsyCounE), and dialogue safety evaluation (SafeDialBench), particularly in psychological counseling professionalism (CPsyCounE score improvement of 47.6%) and dialogue safety (\safe score improvement of 2.4%). Additionally, the model uses quantization technology (GGUF q4\_k\_m) to achieve low hardware deployment (5GB memory is sufficient for operation), providing a feasible solution for psychological counseling applications in resource‑constrained environments.

PaperID: 9857, https://arxiv.org/pdf/2506.21250.pdf

Authors: Jing Bi, Lianggong Bruce Wen, Zhang Liu, Chenliang Xu

Title: ACTLLM: Action Consistency Tuned Large Language Model

Abstract:
This paper introduces ACTLLM (Action Consistency Tuned Large Language Model), a novel approach for robot manipulation in dynamic environments. Traditional vision‑based systems often struggle to learn visual representations that excel in both task execution and spatial reasoning, thereby limiting their adaptability in dynamic environments. ACTLLM addresses these challenges by harnessing language to craft structured scene descriptors, providing a uniform interface for both spatial understanding and task performance through flexible language instructions. Moreover, we introduce a novel action consistency constraint that aligns visual perception with corresponding actions, thereby enhancing the learning of actionable visual representations. Additionally, we have reformulated the Markov decision process for manipulation tasks into a multi‑turn visual dialogue framework. This approach enables the modeling of long‑term task execution with enhanced contextual relevance derived from the history of task execution. During our evaluation, ACTLLM excels in diverse scenarios, proving its effectiveness on challenging vision‑based robot manipulation tasks.

PaperID: 9858, https://arxiv.org/pdf/2506.21211.pdf

Authors: Quanming Liu, Xupeng Bu, Zhichao Yan, Ru Li

Title: $T^3$: Multi-level Tree-based Automatic Program Repair with Large Language Models

Abstract:
Automatic Program Repair (APR) is a core technology in software development and maintenance, with aims to enable automated defect repair with minimal human intervention. In recent years, the substantial advancements in Large Language Models (LLMs) and the Chain‑of‑Thought (CoT) techniques have significantly enhanced the reasoning capabilities of these models. However, due to the complex logic and multi‑step reasoning ability needed, the application of CoT techniques in the APR domain remains insufficient. This study systematically evaluates the performance of several common CoT techniques in APR tasks and proposes an innovative framework T^3, which integrates the powerful reasoning capabilities of LLMs with tree search, effectively improving the precision of generating candidate repair solutions. Furthermore, T^3 provides valuable guidance for optimizing sample selection and repair strategies in APR tasks, establishing a robust framework for achieving efficient automated debugging.

PaperID: 9859, https://arxiv.org/pdf/2506.20964.pdf

Authors: Luca L. Weishaupt, Chengkuan Chen, Drew F. K. Williamson, Richard J. Chen, Guillaume Jaume, Tong Ding, Bowen Chen, Anurag Vaidya, Long Phi Le, Guillaume Jaume, Ming Y. Lu, Faisal Mahmood

Title: Evidence-based diagnostic reasoning with multi-agent copilot for human pathology

Abstract:
Pathology is experiencing rapid digital transformation driven by whole‑slide imaging and artificial intelligence (AI). While deep learning‑based computational pathology has achieved notable success, traditional models primarily focus on image analysis without integrating natural language instruction or rich, text‑based context. Current multimodal large language models (MLLMs) in computational pathology face limitations, including insufficient training data, inadequate support and evaluation for multi‑image understanding, and a lack of autonomous, diagnostic reasoning capabilities. To address these limitations, we introduce PathChat+, a new MLLM specifically designed for human pathology, trained on over 1 million diverse, pathology‑specific instruction samples and nearly 5.5 million question answer turns. Extensive evaluations across diverse pathology benchmarks demonstrated that PathChat+ substantially outperforms the prior PathChat copilot, as well as both state‑of‑the‑art (SOTA) general‑purpose and other pathology‑specific models. Furthermore, we present SlideSeek, a reasoning‑enabled multi‑agent AI system leveraging PathChat+ to autonomously evaluate gigapixel whole‑slide images (WSIs) through iterative, hierarchical diagnostic reasoning, reaching high accuracy on DDxBench, a challenging open‑ended differential diagnosis benchmark, while also capable of generating visually grounded, humanly‑interpretable summary reports.

PaperID: 9860, https://arxiv.org/pdf/2506.20921.pdf

Authors: Tong Zeng, Srivathsan Badrinarayanan, Janghoon Ock, Cheng-Kai Lai, Amir Barati Farimani

Title: LLM-guided Chemical Process Optimization with a Multi-Agent Approach

Abstract:
Chemical process optimization maximizes production efficiency and economic performance, but optimization algorithms, including gradient‑based solvers, numerical methods, and parameter grid searches, become impractical when operating constraints are ill‑defined or unavailable. We present a multi‑agent LLM framework that autonomously infers operating constraints from minimal process descriptions, then collaboratively guides optimization. Our AutoGen‑based framework employs OpenAI's o3 model with specialized agents for constraint generation, parameter validation, simulation, and optimization guidance. Through autonomous constraint generation and iterative multi‑agent optimization, the framework eliminates the need for predefined operational bounds. Validated on hydrodealkylation across cost, yield, and yield‑to‑cost ratio metrics, the framework achieved competitive performance with conventional methods while reducing wall‑time 31‑fold relative to grid search, converging in under 20 minutes. The reasoning‑guided search demonstrates sophisticated process understanding, correctly identifying utility trade‑offs and applying domain‑informed heuristics. Unlike conventional methods requiring predefined constraints, our approach uniquely combines autonomous constraint generation with interpretable parameter exploration. Model comparison reveals reasoning‑capable architectures (o3, o1) are essential for successful optimization, while standard models fail to converge. This approach is particularly valuable for emerging processes and retrofit applications where operational constraints are poorly characterized or unavailable.

PaperID: 9861, https://arxiv.org/pdf/2506.20883.pdf

Authors: Kyanna Dagenais, Istvan David

Title: Complex Model Transformations by Reinforcement Learning with Uncertain Human Guidance

Abstract:
Model‑driven engineering problems often require complex model transformations (MTs), i.e., MTs that are chained in extensive sequences. Pertinent examples of such problems include model synchronization, automated model repair, and design space exploration. Manually developing complex MTs is an error‑prone and often infeasible process. Reinforcement learning (RL) is an apt way to alleviate these issues. In RL, an autonomous agent explores the state space through trial and error to identify beneficial sequences of actions, such as MTs. However, RL methods exhibit performance issues in complex problems. In these situations, human guidance can be of high utility. In this paper, we present an approach and technical framework for developing complex MT sequences through RL, guided by potentially uncertain human advice. Our framework allows user‑defined MTs to be mapped onto RL primitives, and executes them as RL programs to find optimal MT sequences. Our evaluation shows that human guidance, even if uncertain, substantially improves RL performance, and results in more efficient development of complex MTs. Through a trade‑off between the certainty and timeliness of human advice, our method takes a step towards RL‑driven human‑in‑the‑loop engineering methods.

PaperID: 9862, https://arxiv.org/pdf/2506.20806.pdf

Authors: Zhonghao Zhan, Huichi Zhou, Hamed Haddadi

Title: Poster: Enhancing GNN Robustness for Network Intrusion Detection via Agent-based Analysis

Abstract:
Graph Neural Networks (GNNs) show great promise for Network Intrusion Detection Systems (NIDS), particularly in IoT environments, but suffer performance degradation due to distribution drift and lack robustness against realistic adversarial attacks. Current robustness evaluations often rely on unrealistic synthetic perturbations and lack demonstrations on systematic analysis of different kinds of adversarial attack, which encompass both black‑box and white‑box scenarios. This work proposes a novel approach to enhance GNN robustness and generalization by employing Large Language Models (LLMs) in an agentic pipeline as simulated cybersecurity expert agents. These agents scrutinize graph structures derived from network flow data, identifying and potentially mitigating suspicious or adversarially perturbed elements before GNN processing. Our experiments, using a framework designed for realistic evaluation and testing with a variety of adversarial attacks including a dataset collected from physical testbed experiments, demonstrate that integrating LLM analysis can significantly improve the resilience of GNN‑based NIDS against challenges, showcasing the potential of LLM agent as a complementary layer in intrusion detection architectures.

PaperID: 9863, https://arxiv.org/pdf/2506.20743.pdf

Authors: Minh-Hao Van, Prateek Verma, Chen Zhao, Xintao Wu

Title: A Survey of AI for Materials Science: Foundation Models, LLM Agents, Datasets, and Tools

Abstract:
Foundation models (FMs) are catalyzing a transformative shift in materials science (MatSci) by enabling scalable, general‑purpose, and multimodal AI systems for scientific discovery. Unlike traditional machine learning models, which are typically narrow in scope and require task‑specific engineering, FMs offer cross‑domain generalization and exhibit emergent capabilities. Their versatility is especially well‑suited to materials science, where research challenges span diverse data types and scales. This survey provides a comprehensive overview of foundation models, agentic systems, datasets, and computational tools supporting this growing field. We introduce a task‑driven taxonomy encompassing six broad application areas: data extraction, interpretation and Q\&A; atomistic simulation; property prediction; materials structure, design and discovery; process planning, discovery, and optimization; and multiscale modeling. We discuss recent advances in both unimodal and multimodal FMs, as well as emerging large language model (LLM) agents. Furthermore, we review standardized datasets, open‑source tools, and autonomous experimental platforms that collectively fuel the development and integration of FMs into research workflows. We assess the early successes of foundation models and identify persistent limitations, including challenges in generalizability, interpretability, data imbalance, safety concerns, and limited multimodal fusion. Finally, we articulate future research directions centered on scalable pretraining, continual learning, data governance, and trustworthiness.

PaperID: 9864, https://arxiv.org/pdf/2506.20737.pdf

Authors: Gurusha Juneja, Alon Albalak, Wenyue Hua, William Yang Wang

Title: MAGPIE: A dataset for Multi-AGent contextual PrIvacy Evaluation

Abstract:
The proliferation of LLM‑based agents has led to increasing deployment of inter‑agent collaboration for tasks like scheduling, negotiation, resource allocation etc. In such systems, privacy is critical, as agents often access proprietary tools and domain‑specific databases requiring strict confidentiality. This paper examines whether LLM‑based agents demonstrate an understanding of contextual privacy. And, if instructed, do these systems preserve inference time user privacy in non‑adversarial multi‑turn conversation. Existing benchmarks to evaluate contextual privacy in LLM‑agents primarily assess single‑turn, low‑complexity tasks where private information can be easily excluded. We first present a benchmark ‑ MAGPIE comprising 158 real‑life high‑stakes scenarios across 15 domains. These scenarios are designed such that complete exclusion of private data impedes task completion yet unrestricted information sharing could lead to substantial losses. We then evaluate the current state‑of‑the‑art LLMs on (a) their understanding of contextually private data and (b) their ability to collaborate without violating user privacy. Empirical experiments demonstrate that current models, including GPT‑4o and Claude‑2.7‑Sonnet, lack robust understanding of contextual privacy, misclassifying private data as shareable 25.2% and 43.6% of the time. In multi‑turn conversations, these models disclose private information in 59.9% and 50.5% of cases even under explicit privacy instructions. Furthermore, multi‑agent systems fail to complete tasks in 71% of scenarios. These results underscore that current models are not aligned towards both contextual privacy preservation and collaborative task‑solving.

PaperID: 9865, https://arxiv.org/pdf/2506.20664.pdf

Authors: Andrei Lupu, Timon Willi, Jakob Foerster

Title: The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind

Abstract:
As Large Language Models (LLMs) gain agentic abilities, they will have to navigate complex multi‑agent scenarios, interacting with human users and other agents in cooperative and competitive settings. This will require new reasoning skills, chief amongst them being theory of mind (ToM), or the ability to reason about the "mental" states of other agents. However, ToM and other multi‑agent abilities in LLMs are poorly understood, since existing benchmarks suffer from narrow scope, data leakage, saturation, and lack of interactivity. We thus propose Decrypto, a game‑based benchmark for multi‑agent reasoning and ToM drawing inspiration from cognitive science, computational pragmatics and multi‑agent reinforcement learning. It is designed to be as easy as possible in all other dimensions, eliminating confounding factors commonly found in other benchmarks. To our knowledge, it is also the first platform for designing interactive ToM experiments. We validate the benchmark design through comprehensive empirical evaluations of frontier LLMs, robustness studies, and human‑AI cross‑play experiments. We find that LLM game‑playing abilities lag behind humans and simple word‑embedding baselines. We then create variants of two classic cognitive science experiments within Decrypto to evaluate three key ToM abilities. Surprisingly, we find that state‑of‑the‑art reasoning models are significantly worse at those tasks than their older counterparts. This demonstrates that Decrypto addresses a crucial gap in current reasoning and ToM evaluations, and paves the path towards better artificial agents.

PaperID: 9866, https://arxiv.org/pdf/2506.20598.pdf

Authors: Alexander D. Kalian, Jaewook Lee, Stefan P. Johannesson, Lennart Otte, Christer Hogstrand, Miao Guo

Title: Fine-Tuning and Prompt Engineering of LLMs, for the Creation of Multi-Agent AI for Addressing Sustainable Protein Production Challenges

Abstract:
The global demand for sustainable protein sources has accelerated the need for intelligent tools that can rapidly process and synthesise domain‑specific scientific knowledge. In this study, we present a proof‑of‑concept multi‑agent Artificial Intelligence (AI) framework designed to support sustainable protein production research, with an initial focus on microbial protein sources. Our Retrieval‑Augmented Generation (RAG)‑oriented system consists of two GPT‑based LLM agents: (1) a literature search agent that retrieves relevant scientific literature on microbial protein production for a specified microbial strain, and (2) an information extraction agent that processes the retrieved content to extract relevant biological and chemical information. Two parallel methodologies, fine‑tuning and prompt engineering, were explored for agent optimisation. Both methods demonstrated effectiveness at improving the performance of the information extraction agent in terms of transformer‑based cosine similarity scores between obtained and ideal outputs. Mean cosine similarity scores were increased by up to 25%, while universally reaching mean scores of \geq 0.89 against ideal output text. Fine‑tuning overall improved the mean scores to a greater extent (consistently of \geq 0.94) compared to prompt engineering, although lower statistical uncertainties were observed with the latter approach. A user interface was developed and published for enabling the use of the multi‑agent AI system, alongside preliminary exploration of additional chemical safety‑based search capabilities

PaperID: 9867, https://arxiv.org/pdf/2506.20430.pdf

Authors: Weike Zhao, Chaoyi Wu, Yanjie Fan, Xiaoman Zhang, Pengcheng Qiu, Yuze Sun, Xiao Zhou, Yanfeng Wang, Xin Sun, Ya Zhang, Yongguo Yu, Kun Sun, Weidi Xie

Title: An Agentic System for Rare Disease Diagnosis with Traceable Reasoning

Abstract:
Rare diseases affect over 300 million individuals worldwide, yet timely and accurate diagnosis remains an urgent challenge. Patients often endure a prolonged diagnostic odyssey exceeding five years, marked by repeated referrals, misdiagnoses, and unnecessary interventions, leading to delayed treatment and substantial emotional and economic burdens. Here we present DeepRare, a multi‑agent system for rare disease differential diagnosis decision support powered by large language models, integrating over 40 specialized tools and up‑to‑date knowledge sources. DeepRare processes heterogeneous clinical inputs, including free‑text descriptions, structured Human Phenotype Ontology terms, and genetic testing results, to generate ranked diagnostic hypotheses with transparent reasoning linked to verifiable medical evidence. Evaluated across nine datasets from literature, case reports and clinical centres across Asia, North America and Europe spanning 14 medical specialties, DeepRare demonstrates exceptional performance on 3,134 diseases. In human‑phenotype‑ontology‑based tasks, it achieves an average Recall@1 of 57.18%, outperforming the next‑best method by 23.79%; in multi‑modal tests, it reaches 69.1% compared with Exomiser's 55.9% on 168 cases. Expert review achieved 95.4% agreement on its reasoning chains, confirming their validity and traceability. Our work not only advances rare disease diagnosis but also demonstrates how the latest powerful large‑language‑model‑driven agentic systems can reshape current clinical workflows.

PaperID: 9868, https://arxiv.org/pdf/2506.20415.pdf

Authors: Dipayan Saha, Shams Tarek, Hasan Al Shaikh, Khan Thamid Hasan, Pavan Sai Nalluri, Md. Ajoad Hasan, Nashmin Alam, Jingbo Zhou, Sujan Kumar Saha, Mark Tehranipoor, Farimah Farahmandi

Title: SV-LLM: An Agentic Approach for SoC Security Verification using Large Language Models

Abstract:
Ensuring the security of complex system‑on‑chips (SoCs) designs is a critical imperative, yet traditional verification techniques struggle to keep pace due to significant challenges in automation, scalability, comprehensiveness, and adaptability. The advent of large language models (LLMs), with their remarkable capabilities in natural language understanding, code generation, and advanced reasoning, presents a new paradigm for tackling these issues. Moving beyond monolithic models, an agentic approach allows for the creation of multi‑agent systems where specialized LLMs collaborate to solve complex problems more effectively. Recognizing this opportunity, we introduce SV‑LLM, a novel multi‑agent assistant system designed to automate and enhance SoC security verification. By integrating specialized agents for tasks like verification question answering, security asset identification, threat modeling, test plan and property generation, vulnerability detection, and simulation‑based bug validation, SV‑LLM streamlines the workflow. To optimize their performance in these diverse tasks, agents leverage different learning paradigms, such as in‑context learning, fine‑tuning, and retrieval‑augmented generation (RAG). The system aims to reduce manual intervention, improve accuracy, and accelerate security analysis, supporting proactive identification and mitigation of risks early in the design cycle. We demonstrate its potential to transform hardware security practices through illustrative case studies and experiments that showcase its applicability and efficacy.

PaperID: 9869, https://arxiv.org/pdf/2506.20400.pdf

Authors: Kristoffer Christensen, Bo Nørregaard Jørgensen, Zheng Grace Ma

Title: A Visualization Framework for Exploring Multi-Agent-Based Simulations Case Study of an Electric Vehicle Home Charging Ecosystem

Abstract:
Multi‑agent‑based simulations (MABS) of electric vehicle (EV) home charging ecosystems generate large, complex, and stochastic time‑series datasets that capture interactions between households, grid infrastructure, and energy markets. These interactions can lead to unexpected system‑level events, such as transformer overloads or consumer dissatisfaction, that are difficult to detect and explain through static post‑processing. This paper presents a modular, Python‑based dashboard framework, built using Dash by Plotly, that enables efficient, multi‑level exploration and root‑cause analysis of emergent behavior in MABS outputs. The system features three coordinated views (System Overview, System Analysis, and Consumer Analysis), each offering high‑resolution visualizations such as time‑series plots, spatial heatmaps, and agent‑specific drill‑down tools. A case study simulating full EV adoption with smart charging in a Danish residential network demonstrates how the dashboard supports rapid identification and contextual explanation of anomalies, including clustered transformer overloads and time‑dependent charging failures. The framework facilitates actionable insight generation for researchers and distribution system operators, and its architecture is adaptable to other distributed energy resources and complex energy systems.

PaperID: 9870, https://arxiv.org/pdf/2506.20399.pdf

Authors: Hatem Fakhruldeen, Arvind Raveendran Nambiar, Satheeshkumar Veeramani, Bonilkumar Vijaykumar Tailor, Hadi Beyzaee Juneghani, Gabriella Pizzuto, Andrew Ian Cooper

Title: Multimodal Behaviour Trees for Robotic Laboratory Task Automation

Abstract:
Laboratory robotics offer the capability to conduct experiments with a high degree of precision and reproducibility, with the potential to transform scientific research. Trivial and repeatable tasks; e.g., sample transportation for analysis and vial capping are well‑suited for robots; if done successfully and reliably, chemists could contribute their efforts towards more critical research activities. Currently, robots can perform these tasks faster than chemists, but how reliable are they? Improper capping could result in human exposure to toxic chemicals which could be fatal. To ensure that robots perform these tasks as accurately as humans, sensory feedback is required to assess the progress of task execution. To address this, we propose a novel methodology based on behaviour trees with multimodal perception. Along with automating robotic tasks, this methodology also verifies the successful execution of the task, a fundamental requirement in safety‑critical environments. The experimental evaluation was conducted on two lab tasks: sample vial capping and laboratory rack insertion. The results show high success rate, i.e., 88% for capping and 92% for insertion, along with strong error detection capabilities. This ultimately proves the robustness and reliability of our approach and that using multimodal behaviour trees should pave the way towards the next generation of robotic chemists.

PaperID: 9871, https://arxiv.org/pdf/2506.20394.pdf

Authors: Mimo Shirasaka, Yuya Ikeda, Tatsuya Matsushima, Yutaka Matsuo, Yusuke Iwasawa

Title: SPARK: Graph-Based Online Semantic Integration System for Robot Task Planning

Abstract:
The ability to update information acquired through various means online during task execution is crucial for a general‑purpose service robot. This information includes geometric and semantic data. While SLAM handles geometric updates on 2D maps or 3D point clouds, online updates of semantic information remain unexplored. We attribute the challenge to the online scene graph representation, for its utility and scalability. Building on previous works regarding offline scene graph representations, we study online graph representations of semantic information in this work. We introduce SPARK: Spatial Perception and Robot Knowledge Integration. This framework extracts semantic information from environment‑embedded cues and updates the scene graph accordingly, which is then used for subsequent task planning. We demonstrate that graph representations of spatial relationships enhance the robot system's ability to perform tasks in dynamic environments and adapt to unconventional spatial cues, like gestures.

PaperID: 9872, https://arxiv.org/pdf/2506.20249.pdf

Authors: Junyan Cheng, Peter Clark, Kyle Richardson

Title: Language Modeling by Language Models

Abstract:
Can we leverage LLMs to model the process of discovering novel language model (LM) architectures? Inspired by real research, we propose a multi‑agent LLM approach that simulates the conventional stages of research, from ideation and literature search (proposal stage) to design implementation (code generation), generative pre‑training, and downstream evaluation (verification). Using ideas from scaling laws, our system, Genesys, employs a Ladder of Scales approach; new designs are proposed, adversarially reviewed, implemented, and selectively verified at increasingly larger model scales (14M～350M parameters) with a narrowing budget (the number of models we can train at each scale). To help make discovery efficient and factorizable, Genesys uses a novel genetic programming backbone, which we show has empirical advantages over commonly used direct prompt generation workflows (e.g., ～86% percentage point improvement in successful design generation, a key bottleneck). We report experiments involving 1,162 newly discovered designs (1,062 fully verified through pre‑training) and find the best designs to be highly competitive with known architectures (e.g., outperform GPT2, Mamba2, etc., on 6/9 common benchmarks). We couple these results with comprehensive system‑level ablations and formal results, which give broader insights into the design of effective autonomous discovery systems.

PaperID: 9873, https://arxiv.org/pdf/2506.20097.pdf

Authors: Wang Bill Zhu, Miaosen Chai, Ishika Singh, Robin Jia, Jesse Thomason

Title: PSALM-V: Automating Symbolic Planning in Interactive Visual Environments with Large Language Models

Abstract:
We propose PSALM‑V, the first autonomous neuro‑symbolic learning system able to induce symbolic action semantics (i.e., pre‑ and post‑conditions) in visual environments through interaction. PSALM‑V bootstraps reliable symbolic planning without expert action definitions, using LLMs to generate heuristic plans and candidate symbolic semantics. Previous work has explored using large language models to generate action semantics for Planning Domain Definition Language (PDDL)‑based symbolic planners. However, these approaches have primarily focused on text‑based domains or relied on unrealistic assumptions, such as access to a predefined problem file, full observability, or explicit error messages. By contrast, PSALM‑V dynamically infers PDDL problem files and domain action semantics by analyzing execution outcomes and synthesizing possible error explanations. The system iteratively generates and executes plans while maintaining a tree‑structured belief over possible action semantics for each action, iteratively refining these beliefs until a goal state is reached. Simulated experiments of task completion in ALFRED demonstrate that PSALM‑V increases the plan success rate from 37% (Claude‑3.7) to 74% in partially observed setups. Results on two 2D game environments, RTFM and Overcooked‑AI, show that PSALM‑V improves step efficiency and succeeds in domain induction in multi‑agent settings. PSALM‑V correctly induces PDDL pre‑ and post‑conditions for real‑world robot BlocksWorld tasks, despite low‑level manipulation failures from the robot.

PaperID: 9874, https://arxiv.org/pdf/2506.20062.pdf

Authors: Runlong Ye, Zeling Zhang, Boushra Almazroua, Michael Liut

Title: Beyond Autocomplete: Designing CopilotLens Towards Transparent and Explainable AI Coding Agents

Abstract:
AI‑powered code assistants are widely used to generate code completions, significantly boosting developer productivity. However, these tools typically present suggestions without explaining their rationale, leaving their decision‑making process inscrutable. This opacity hinders developers' ability to critically evaluate outputs, form accurate mental models, and calibrate trust in the system. To address this, we introduce CopilotLens, a novel interactive framework that reframes code completion from a simple suggestion into a transparent, explainable interaction. CopilotLens operates as an explanation layer that reconstructs the AI agent's "thought process" through a dynamic, two‑level interface. The tool aims to surface both high‑level code changes and the specific codebase context influences. This paper presents the design and rationale of CopilotLens, offering a concrete framework and articulating expectations on deepening comprehension and calibrated trust, which we plan to evaluate in subsequent work.

PaperID: 9875, https://arxiv.org/pdf/2506.20039.pdf

Authors: Koorosh Moslemi, Chi-Guhn Lee

Title: Learning Bilateral Team Formation in Cooperative Multi-Agent Reinforcement Learning

Abstract:
Team formation and the dynamics of team‑based learning have drawn significant interest in the context of Multi‑Agent Reinforcement Learning (MARL). However, existing studies primarily focus on unilateral groupings, predefined teams, or fixed‑population settings, leaving the effects of algorithmic bilateral grouping choices in dynamic populations underexplored. To address this gap, we introduce a framework for learning two‑sided team formation in dynamic multi‑agent systems. Through this study, we gain insight into what algorithmic properties in bilateral team formation influence policy performance and generalization. We validate our approach using widely adopted multi‑agent scenarios, demonstrating competitive performance and improved generalization in most scenarios.

PaperID: 9876, https://arxiv.org/pdf/2506.20031.pdf

Authors: Prithvi Poddar, Ehsan Tarkesh Esfahani, Karthik Dantu, Souma Chowdhury

Title: Automated Generation of Diverse Courses of Actions for Multi-Agent Operations using Binary Optimization and Graph Learning

Abstract:
Operations in disaster response, search \& rescue, and military missions that involve multiple agents demand automated processes to support the planning of the courses of action (COA). Moreover, traverse‑affecting changes in the environment (rain, snow, blockades, etc.) may impact the expected performance of a COA, making it desirable to have a pool of COAs that are diverse in task distributions across agents. Further, variations in agent capabilities, which could be human crews and/or autonomous systems, present practical opportunities and computational challenges to the planning process. This paper presents a new theoretical formulation and computational framework to generate such diverse pools of COAs for operations with soft variations in agent‑task compatibility. Key to the problem formulation is a graph abstraction of the task space and the pool of COAs itself to quantify its diversity. Formulating the COAs as a centralized multi‑robot task allocation problem, a genetic algorithm is used for (order‑ignoring) allocations of tasks to each agent that jointly maximize diversity within the COA pool and overall compatibility of the agent‑task mappings. A graph neural network is trained using a policy gradient approach to then perform single agent task sequencing in each COA, which maximizes completion rates adaptive to task features. Our tests of the COA generation process in a simulated environment demonstrate significant performance gain over a random walk baseline, small optimality gap in task sequencing, and execution time of about 50 minutes to plan up to 20 COAs for 5 agent/100 task operations.

PaperID: 9877, https://arxiv.org/pdf/2506.20008.pdf

Authors: Abdul Basit, Minghao Shao, Muhammad Haider Asif, Nouhaila Innan, Muhammad Kashif, Alberto Marchisio, Muhammad Shafique

Title: QHackBench: Benchmarking Large Language Models for Quantum Code Generation Using PennyLane Hackathon Challenges

Abstract:
Recent advances in Large Language Models (LLMs) have demonstrated strong potential in code generation, yet their effectiveness in quantum computing remains underexplored. This paper benchmarks LLMs for PennyLane‑based quantum code generation using real‑world challenges from the Quantum Hackathon (QHack). We introduce QHackBench, a novel benchmark dataset derived from QHack competitions, and evaluate model performance under vanilla prompting and Retrieval‑Augmented Generation (RAG). Our structured evaluation framework assesses functional correctness, syntactic validity, and execution success across varying challenge difficulties. Results indicate that RAG‑enhanced models, supplemented with an augmented PennyLane dataset, approximately generate similar results as the standard prompting, particularly in complex quantum algorithms. Additionally, we introduce a multi‑agent evaluation pipeline that iteratively refines incorrect solutions, further enhancing execution success rates. To foster further research, we commit to publicly releasing QHackBench, along with our evaluation framework and experimental results, enabling continued advancements in AI‑assisted quantum programming.

PaperID: 9878, https://arxiv.org/pdf/2506.19846.pdf

Authors: Junxing Hu, Ai Han, Haolan Zhan, Pu Wei, Zhiqian Zhang, Yuhang Guo, Jiawei Lu, Zhen Chen, Haoran Li, Zicheng Zhang

Title: HiMA-Ecom: Enabling Joint Training of Hierarchical Multi-Agent E-commerce Assistants

Abstract:
Hierarchical multi‑agent systems based on large language models (LLMs) have become a common paradigm for building AI assistants in vertical domains such as e‑commerce, where a master agent coordinates multiple specialized sub‑agents. Despite their practical importance, realistic benchmarks for training and evaluating such systems remain scarce, and joint optimization across functionally distinct agents is still challenging. To address this gap, we introduce HiMA‑Ecom, the first hierarchical multi‑agent benchmark tailored for e‑commerce scenarios. HiMA‑Ecom contains 22.8K instances, including agent‑specific supervised fine‑tuning samples with memory and system‑level input‑output pairs for joint multi‑agent reinforcement learning. Building upon it, a joint training method named HiMA‑R1 is proposed. It presents Variance‑Reduction Group Relative Policy Optimization (VR‑GRPO), which employs initial trajectory‑based Monte Carlo sampling to mitigate the exponential joint action space and selects informative agent groups for efficient updates based on reward variance. Furthermore, an adaptive memory evolution mechanism that repurposes GRPO rewards as cost‑free supervisory signals is designed to eliminate repetitive reasoning and accelerate convergence. Experiments on HiMA‑Ecom demonstrate that our method, built upon smaller 3B/7B open‑source models, achieves performance comparable to that of larger LLMs, such as DeepSeek‑R1, and surpasses DeepSeek‑V3 by an average of 6%.

PaperID: 9879, https://arxiv.org/pdf/2506.19773.pdf

Authors: Nandana Mihindukulasooriya, Niharika S. D'Souza, Faisal Chowdhury, Horst Samulowitz

Title: Automatic Prompt Optimization for Knowledge Graph Construction: Insights from an Empirical Study

Abstract:
A KG represents a network of entities and illustrates relationships between them. KGs are used for various applications, including semantic search and discovery, reasoning, decision‑making, natural language processing, machine learning, and recommendation systems. Triple (subject‑relation‑object) extraction from text is the fundamental building block of KG construction and has been widely studied, for example, in early benchmarks such as ACE 2002 to more recent ones, such as WebNLG 2020, REBEL and SynthIE. While the use of LLMs is explored for KG construction, handcrafting reasonable task‑specific prompts for LLMs is a labour‑intensive exercise and can be brittle due to subtle changes in the LLM models employed. Recent work in NLP tasks (e.g. autonomy generation) uses automatic prompt optimization/engineering to address this challenge by generating optimal or near‑optimal task‑specific prompts given input‑output examples. This empirical study explores the application of automatic prompt optimization for the triple extraction task using experimental benchmarking. We evaluate different settings by changing (a) the prompting strategy, (b) the LLM being used for prompt optimization and task execution, (c) the number of canonical relations in the schema (schema complexity), (d) the length and diversity of input text, (e) the metric used to drive the prompt optimization, and (f) the dataset being used for training and testing. We evaluate three different automatic prompt optimizers, namely, DSPy, APE, and TextGrad and use two different triple extraction datasets, SynthIE and REBEL. Through rigorous empirical evaluation, our main contribution highlights that automatic prompt optimization techniques can generate reasonable prompts similar to humans for triple extraction. In turn, these optimized prompts achieve improved results, particularly with increasing schema complexity and text size.

PaperID: 9880, https://arxiv.org/pdf/2506.19769.pdf

Authors: Shulan Ruan, Rongwei Wang, Xuchen Shen, Huijie Liu, Baihui Xiao, Jun Shi, Kun Zhang, Zhenya Huang, Yu Liu, Enhong Chen, You He

Title: A Survey of Multi-sensor Fusion Perception for Embodied AI: Background, Methods, Challenges and Prospects

Abstract:
Multi‑sensor fusion perception (MSFP) is a key technology for embodied AI, which can serve a variety of downstream tasks (e.g., 3D object detection and semantic segmentation) and application scenarios (e.g., autonomous driving and swarm robotics). Recently, impressive achievements on AI‑based MSFP methods have been reviewed in relevant surveys. However, we observe that the existing surveys have some limitations after a rigorous and detailed investigation. For one thing, most surveys are oriented to a single task or research field, such as 3D object detection or autonomous driving. Therefore, researchers in other related tasks often find it difficult to benefit directly. For another, most surveys only introduce MSFP from a single perspective of multi‑modal fusion, while lacking consideration of the diversity of MSFP methods, such as multi‑view fusion and time‑series fusion. To this end, in this paper, we hope to organize MSFP research from a task‑agnostic perspective, where methods are reported from various technical views. Specifically, we first introduce the background of MSFP. Next, we review multi‑modal and multi‑agent fusion methods. A step further, time‑series fusion methods are analyzed. In the era of LLM, we also investigate multimodal LLM fusion methods. Finally, we discuss open challenges and future directions for MSFP. We hope this survey can help researchers understand the important progress in MSFP and provide possible insights for future research.

PaperID: 9881, https://arxiv.org/pdf/2506.19655.pdf

Authors: Yutong Bu, Andrew Melatos, Robin Evans

Title: How trust networks shape students' opinions about the proficiency of artificially intelligent assistants

Abstract:
The rising use of educational tools controlled by artificial intelligence (AI) has provoked a debate about their proficiency. While intrinsic proficiency, especially in tasks such as grading, has been measured and studied extensively, perceived proficiency remains underexplored. Here it is shown through Monte Carlo multi‑agent simulations that trust networks among students influence their perceptions of the proficiency of an AI tool. A probabilistic opinion dynamics model is constructed, in which every student's perceptions are described by a probability density function (PDF), which is updated at every time step through independent, personal observations and peer pressure shaped by trust relationships. It is found that students infer correctly the AI tool's proficiency θ_\rm AI in allies‑only networks (i.e.\ high trust networks). AI‑avoiders reach asymptotic learning faster than AI‑users, and the asymptotic learning time for AI‑users decreases as their number increases. However, asymptotic learning is disrupted even by a single partisan, who is stubbornly incorrect in their belief θ_\rm p \neq θ_\rm AI, making other students' beliefs vacillate indefinitely between θ_\rm p and θ_\rm AI. In opponents‑only (low trust) networks, all students reach asymptotic learning, but only a minority infer θ_\rm AI correctly. AI‑users have a small advantage over AI‑avoiders in reaching the right conclusion. In mixed networks, students may exhibit turbulent nonconvergence and intermittency, or achieve asymptotic learning, depending on the relationships between partisans and AI‑users. The educational implications of the results are discussed briefly in the context of designing robust usage policies for AI tools, with an emphasis on the unintended and inequitable consequences which arise sometimes from counterintuitive network effects.

PaperID: 9882, https://arxiv.org/pdf/2506.19592.pdf

Authors: Harisankar Babu, Philipp Schillinger, Tamim Asfour

Title: Adaptive Domain Modeling with Language Models: A Multi-Agent Approach to Task Planning

Abstract:
We introduce TAPAS (Task‑based Adaptation and Planning using AgentS), a multi‑agent framework that integrates Large Language Models (LLMs) with symbolic planning to solve complex tasks without the need for manually defined environment models. TAPAS employs specialized LLM‑based agents that collaboratively generate and adapt domain models, initial states, and goal specifications as needed using structured tool‑calling mechanisms. Through this tool‑based interaction, downstream agents can request modifications from upstream agents, enabling adaptation to novel attributes and constraints without manual domain redefinition. A ReAct (Reason+Act)‑style execution agent, coupled with natural language plan translation, bridges the gap between dynamically generated plans and real‑world robot capabilities. TAPAS demonstrates strong performance in benchmark planning domains and in the VirtualHome simulated real‑world environment.

PaperID: 9883, https://arxiv.org/pdf/2506.19565.pdf

Authors: Shengyuan Huang, Xiaoguang Yang, Yifen Mu, Wenjun Mei

Title: On finite-horizon approximation of a feedback Nash equilibrium in LQ games

Abstract:
Dynamic games provide a fundamental framework for multi‑agent decision‑making over time, yet computing feedback Nash equilibria (FNEs) in infinite‑horizon discrete‑time linear‑quadratic (LQ) settings remains computationally challenging. Motivated by the need for tractable and implementable strategies, this paper studies a finite‑horizon strategy for approximating a certain infinite‑horizon equilibrium. Specifically, at each stage, each player solves a T‑stage game and implements only the first‑stage control, thereby avoiding the direct solution of coupled infinite‑horizon Riccati equations. We first analyze the finite‑horizon game and characterize the structure of the associated coupled generalized discrete Riccati difference equations. Based on this analysis, we establish a sufficient condition for uniqueness of the FNE and propose an efficient algorithm that computes it via a sequence of linear equations. We then consider the infinite‑horizon game in which players adopt the finite‑horizon strategies with heterogeneous prediction horizons and show that, under suitable conditions, the total cost under the finite‑horizon strategies converges to the cost under the limiting infinite‑horizon FNE. Moreover, we derive an explicit upper bound on this cost gap in terms of the distance between the corresponding strategy matrices. These results provide theoretical justification and quantitative performance guarantees for finite‑horizon strategies in infinite‑horizon LQ dynamic games. A nonscalar numerical example illustrates the effectiveness of the proposed framework.

PaperID: 9884, https://arxiv.org/pdf/2506.19481.pdf

Authors: Shahbaz Siddeeq, Muhammad Waseem, Zeeshan Rasheed, Md Mahade Hasan, Jussi Rasku, Mika Saari, Henri Terho, Kalle Makela, Kai-Kristian Kemell, Pekka Abrahamsson

Title: LLM-based Multi-Agent System for Intelligent Refactoring of Haskell Code

Abstract:
Refactoring is a constant activity in software development and maintenance. Scale and maintain software systems are based on code refactoring. However, this process is still labor intensive, as it requires programmers to analyze the codebases in detail to avoid introducing new defects. In this research, we put forward a large language model (LLM)‑based multi‑agent system to automate the refactoring process on Haskell code. The objective of this research is to evaluate the effect of LLM‑based agents in performing structured and semantically accurate refactoring on Haskell code. Our proposed multi‑agent system based on specialized agents with distinct roles, including code analysis, refactoring execution, verification, and debugging. To test the effectiveness and practical applicability of the multi‑agent system, we conducted evaluations using different open‑source Haskell codebases. The results of the experiments carried out showed that the proposed LLM‑based multi‑agent system could average 11.03% decreased complexity in code, an improvement of 22.46% in overall code quality, and increase performance efficiency by an average of 13.27%. Furthermore, memory allocation was optimized by up to 14.57%. These results highlight the ability of LLM‑based multi‑agent in managing refactoring tasks targeted toward functional programming paradigms. Our findings hint that LLM‑based multi‑agent systems integration into the refactoring of functional programming languages can enhance maintainability and support automated development workflows.

PaperID: 9885, https://arxiv.org/pdf/2506.19420.pdf

Authors: Yazhou Zhang, Chunwang Zou, Bo Wang, Jing Qin, Prayag Tiwari

Title: Commander-GPT: Dividing and Routing for Multimodal Sarcasm Detection

Abstract:
Multimodal sarcasm understanding is a high‑order cognitive task. Although large language models (LLMs) have shown impressive performance on many downstream NLP tasks, growing evidence suggests that they struggle with sarcasm understanding. In this paper, we propose Commander‑GPT, a modular decision routing framework inspired by military command theory. Rather than relying on a single LLM's capability, Commander‑GPT orchestrates a team of specialized LLM agents where each agent will be selectively assigned to a focused sub‑task such as keyword extraction, sentiment analysis, etc. Their outputs are then routed back to the commander, which integrates the information and performs the final sarcasm judgment. To coordinate these agents, we introduce three types of centralized commanders: (1) a trained lightweight encoder‑based commander (e.g., multi‑modal BERT); (2) four small autoregressive language models, serving as moderately capable commanders (e.g., DeepSeek‑VL); (3) two large LLM‑based commander (Gemini Pro and GPT‑4o) that performs task routing, output aggregation, and sarcasm decision‑making in a zero‑shot fashion. We evaluate Commander‑GPT on the MMSD and MMSD 2.0 benchmarks, comparing five prompting strategies. Experimental results show that our framework achieves 4.4% and 11.7% improvement in F1 score over state‑of‑the‑art (SoTA) baselines on average, demonstrating its effectiveness.

PaperID: 9886, https://arxiv.org/pdf/2506.19417.pdf

Authors: Yisak Park, Sunwoo Lee, Seungyul Han

Title: Focusing Influence Mechanism for Multi-Agent Reinforcement Learning

Abstract:
Cooperative multi‑agent reinforcement learning (MARL) under sparse rewards remains fundamentally challenging because agents often fail to concentrate their influence, leading to insufficiently coordinated exploration. To address this, we propose the Focusing Influence Mechanism (FIM), a framework that encourages agents to focus their influence on under‑explored parts of the state space through an entropy‑based criterion, while leveraging eligibility traces to enable multiple agents to consistently align and sustain their influence on the same parts of the state space when beneficial, thereby promoting coordinated and persistent joint behavior. By emphasizing under‑explored regions of the state space, FIM facilitates more efficient and structured exploration even under extremely sparse rewards. Across diverse MARL benchmarks, FIM consistently improves cooperative performance over strong baselines.

PaperID: 9887, https://arxiv.org/pdf/2506.19290.pdf

Authors: Liang Zeng, Yongcong Li, Yuzhen Xiao, Changshi Li, Chris Yuhao Liu, Rui Yan, Tianwen Wei, Jujie He, Xuchen Song, Yang Liu, Yahui Zhou

Title: Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs

Abstract:
Software engineering (SWE) has recently emerged as a crucial testbed for next‑generation LLM agents, demanding inherent capabilities in two critical dimensions: sustained iterative problem‑solving (e.g., >50 interaction rounds) and long‑context dependency resolution (e.g., >32k tokens). However, the data curation process in SWE remains notoriously time‑consuming, as it heavily relies on manual annotation for code file filtering and the setup of dedicated runtime environments to execute and validate unit tests. Consequently, most existing datasets are limited to only a few thousand GitHub‑sourced instances. To this end, we propose an incremental, automated data‑curation pipeline that systematically scales both the volume and diversity of SWE datasets. Our dataset comprises 10,169 real‑world Python task instances from 2,531 distinct GitHub repositories, each accompanied by a task specified in natural language and a dedicated runtime‑environment image for automated unit‑test validation. We have carefully curated over 8,000 successfully runtime‑validated training trajectories from our proposed SWE dataset. When fine‑tuning the Skywork‑SWE model on these trajectories, we uncover a striking data scaling phenomenon: the trained model's performance for software engineering capabilities in LLMs continues to improve as the data size increases, showing no signs of saturation. Notably, our Skywork‑SWE model achieves 38.0% pass@1 accuracy on the SWE‑bench Verified benchmark without using verifiers or multiple rollouts, establishing a new state‑of‑the‑art (SOTA) among the Qwen2.5‑Coder‑32B‑based LLMs built on the OpenHands agent framework. Furthermore, with the incorporation of test‑time scaling techniques, the performance further improves to 47.0% accuracy, surpassing the previous SOTA results for sub‑32B parameter models. We release the Skywork‑SWE‑32B model checkpoint to accelerate future research.

PaperID: 9888, https://arxiv.org/pdf/2506.19235.pdf

Authors: Yu Xie, Xingkai Ren, Ying Qi, Yao Hu, Lianlei Shan

Title: RecLLM-R1: A Two-Stage Training Paradigm with Reinforcement Learning and Chain-of-Thought v1

Abstract:
Traditional recommendation systems often grapple with "filter bubbles", underutilization of external knowledge, and a disconnect between model optimization and business policy iteration. To address these limitations, this paper introduces RecLLM‑R1, a novel recommendation framework leveraging Large Language Models (LLMs) and drawing inspiration from the DeepSeek R1 methodology. The framework initiates by transforming user profiles, historical interactions, and multi‑faceted item attributes into LLM‑interpretable natural language prompts through a carefully engineered data construction process. Subsequently, a two‑stage training paradigm is employed: the initial stage involves Supervised Fine‑Tuning (SFT) to imbue the LLM with fundamental recommendation capabilities. The subsequent stage utilizes Group Relative Policy Optimization (GRPO), a reinforcement learning technique, augmented with a Chain‑of‑Thought (CoT) mechanism. This stage guides the model through multi‑step reasoning and holistic decision‑making via a flexibly defined reward function, aiming to concurrently optimize recommendation accuracy, diversity, and other bespoke business objectives. Empirical evaluations on a real‑world user behavior dataset from a large‑scale social media platform demonstrate that RecLLM‑R1 significantly surpasses existing baseline methods across a spectrum of evaluation metrics, including accuracy, diversity, and novelty. It effectively mitigates the filter bubble effect and presents a promising avenue for the integrated optimization of recommendation models and policies under intricate business goals.

PaperID: 9889, https://arxiv.org/pdf/2506.19209.pdf

Authors: Yichen Tang, Weihang Su, Yujia Zhou, Yiqun Liu, Min Zhang, Shaoping Ma, Qingyao Ai

Title: Augmenting Multi-Agent Communication with State Delta Trajectory

Abstract:
Multi‑agent techniques such as role playing or multi‑turn debates have been shown to be effective in improving the performance of large language models (LLMs) in downstream tasks. Despite their differences in workflows, existing multi‑agent systems constructed from a single base LLM mostly use natural language for agent communication. While this is appealing for its simplicity and interpretability, it also introduces inevitable information loss as one model must down sample its continuous state vectors to discrete tokens before transferring them to the other model. Such losses are particularly significant when the information to transfer is not simple facts, but reasoning logics or abstractive thoughts. To tackle this problem, we propose a new communication protocol that transfers both natural language tokens and token‑wise state transition trajectory from one agent to another. Particularly, compared to the actual state value, we find that the sequence of state changes in LLMs after generating each token can better reflect the information hidden behind the inference process. We propose a State Delta Encoding (SDE) method to represent state transition trajectories. The experimental results show that multi‑agent systems with SDE achieve SOTA performance compared to other communication protocols, particularly in tasks that involve complex reasoning.

PaperID: 9890, https://arxiv.org/pdf/2506.19171.pdf

Authors: Xingyue Huang, Xianglong Hu, Zifeng Ding, Yuan He, Rishabh, Waleed Alzarooni, Ziyu Ye, Wendong Fan, Bailan He, Haige Bo, Changran Hu, Guohao Li

Title: Distilling Tool Knowledge into Language Models via Back-Translated Traces

Abstract:
Large language models (LLMs) often struggle with mathematical problems that require exact computation or multi‑step algebraic reasoning. Tool‑integrated reasoning (TIR) offers a promising solution by leveraging external tools such as code interpreters to ensure correctness, but it introduces inference‑time dependencies that hinder scalability and deployment. In this work, we propose a new paradigm for distilling tool knowledge into LLMs purely through natural language. We first construct a Solver Agent that solves math problems by interleaving planning, symbolic tool calls, and reflective reasoning. Then, using a back‑translation pipeline powered by multiple LLM‑based agents, we convert interleaved TIR traces into natural language reasoning traces. A Translator Agent generates explanations for individual tool calls, while a Rephrase Agent merges them into a fluent and globally coherent narrative. Empirically, we show that fine‑tuning a small open‑source model on these synthesized traces enables it to internalize both tool knowledge and structured reasoning patterns, yielding gains on competition‑level math benchmarks without requiring tool access at inference.

PaperID: 9891, https://arxiv.org/pdf/2506.19160.pdf

Authors: Mohammad Narimani, Seyyed Ali Emami

Title: AgenticControl: An Automated Control Design Framework Using Large Language Models

Abstract:
Traditional control system design, reliant on expert knowledge and precise models, struggles with complex, nonlinear, or uncertain dynamics. This paper introduces AgenticControl, a novel multi‑agent framework that automates controller design using coordinated Large Language Model (LLM) agents. Through structured JSON communication, these agents handle tasks including controller selection, scenario design, parameter optimization, performance evaluation, and decision‑making. Through an actor‑critic optimization approach, the system iteratively improves performance while progressing through scenarios of increasing complexity to ensure robustness under nominal conditions, measurement noise, actuator disturbances, and parametric uncertainties. Key innovations include structured multi‑agent collaboration, robust optimization mechanisms, and real‑time adaptability via in‑context learning. Validated across four diverse control systems, namely, DC Motor Position control, Ball and Beam, Inverted Pendulum, and Double Inverted Pendulum, the framework achieves competitive performance against classical methods. Its Full State Feedback solution closely matches Linear Quadratic Regulator (LQR) results, while the designed PID controller significantly outperforming MATLAB's PIDTuner, reducing PID tracking error by 55% through adaptive parameter exploration. A comparative study of five LLM models reveals distinct optimization profiles, with DeepSeek achieving the fastest convergence. This work demonstrates the potential of LLM‑driven control design, paving the way for advanced techniques like model predictive control and reinforcement learning.

PaperID: 9892, https://arxiv.org/pdf/2506.18942.pdf

Authors: Simon Hatzesberger, Iris Nonneman

Title: Advanced Applications of Generative AI in Actuarial Science: Case Studies Beyond ChatGPT

Abstract:
This article explores the potential of generative AI (GenAI) to support actuarial practice through four implemented case studies. It situates these case studies within the broader evolution of artificial intelligence in actuarial science, from early neural networks and machine learning to modern transformer‑based GenAI systems. The first case study illustrates how large language models (LLMs) can improve claim cost prediction by extracting informative features from unstructured text for use in the underlying supervised learning task. The second case study demonstrates the automation of market comparisons using Retrieval‑Augmented Generation to identify, extract, and structure relevant information from insurers' annual reports. The third case study highlights the capabilities of fine‑tuned vision‑enabled LLMs in classifying car damage types and extracting contextual information from images. The fourth case study presents a multi‑agent system that autonomously migrates actuarial legacy code from R to Python and validates the translation against the original code's outputs. In addition to these case studies, we outline further GenAI applications in the insurance industry. Finally, we discuss the regulatory, security, dual‑use and fraud, reproducibility, privacy, governance, and organisational challenges associated with deploying GenAI in regulated insurance environments.

PaperID: 9893, https://arxiv.org/pdf/2506.18900.pdf

Authors: Kiymet Akdemir, Tahira Kazimi, Pinar Yanardag

Title: Audit & Repair: An Agentic Framework for Consistent Story Visualization in Text-to-Image Diffusion Models

Abstract:
Story visualization has become a popular task where visual scenes are generated to depict a narrative across multiple panels. A central challenge in this setting is maintaining visual consistency, particularly in how characters and objects persist and evolve throughout the story. Despite recent advances in diffusion models, current approaches often fail to preserve key character attributes, leading to incoherent narratives. In this work, we propose a collaborative multi‑agent framework that autonomously identifies, corrects, and refines inconsistencies across multi‑panel story visualizations. The agents operate in an iterative loop, enabling fine‑grained, panel‑level updates without re‑generating entire sequences. Our framework is model‑agnostic and flexibly integrates with a variety of diffusion models, including rectified flow transformers such as Flux and latent diffusion models such as Stable Diffusion. Quantitative and qualitative experiments show that our method outperforms prior approaches in terms of multi‑panel consistency.

PaperID: 9894, https://arxiv.org/pdf/2506.18885.pdf

Authors: Annika Thomas, Aneesa Sonawalla, Alex Rose, Jonathan P. How

Title: GRAND-SLAM: Local Optimization for Globally Consistent Large-Scale Multi-Agent Gaussian SLAM

Abstract:
3D Gaussian splatting has emerged as an expressive scene representation for RGB‑D visual SLAM, but its application to large‑scale, multi‑agent outdoor environments remains unexplored. Multi‑agent Gaussian SLAM is a promising approach to rapid exploration and reconstruction of environments, offering scalable environment representations, but existing approaches are limited to small‑scale, indoor environments. To that end, we propose Gaussian Reconstruction via Multi‑Agent Dense SLAM, or GRAND‑SLAM, a collaborative Gaussian splatting SLAM method that integrates i) an implicit tracking module based on local optimization over submaps and ii) an approach to inter‑ and intra‑robot loop closure integrated into a pose‑graph optimization framework. Experiments show that GRAND‑SLAM provides state‑of‑the‑art tracking performance and 28% higher PSNR than existing methods on the Replica indoor dataset, as well as 91% lower multi‑agent tracking error and improved rendering over existing multi‑agent methods on the large‑scale, outdoor Kimera‑Multi dataset.

PaperID: 9895, https://arxiv.org/pdf/2506.18814.pdf

Authors: Anas Barakat, John Lazarsfeld, Georgios Piliouras, Antonios Varvitsiotis

Title: Online Multi-Agent Control with Adversarial Disturbances

Abstract:
Online multi‑agent control problems, where many agents pursue competing and time‑varying objectives, are widespread in domains such as autonomous robotics, economics, and energy systems. In these settings, robustness to adversarial disturbances is critical. In this paper, we study online control in multi‑agent linear dynamical systems subject to such disturbances. In contrast to most prior work in multi‑agent control, which typically assumes noiseless or stochastically perturbed dynamics, we consider an online setting where disturbances can be adversarial, and where each agent seeks to minimize its own sequence of convex losses. Under two feedback models, we analyze online gradient‑based controllers with local policy updates. We prove per‑agent regret bounds that are sublinear and near‑optimal in the time horizon and that highlight different scalings with the number of agents. When agents' objectives are aligned, we further show that the multi‑agent control problem induces a time‑varying potential game for which we derive equilibrium tracking guarantees. Together, our results take a first step in bridging online control with online learning in games, establishing robust individual and collective performance guarantees in dynamic continuous‑state environments.

PaperID: 9896, https://arxiv.org/pdf/2506.18783.pdf

Authors: Kamil Szczepanik, Jarosław A. Chudziak

Title: TRIZ Agents: A Multi-Agent LLM Approach for TRIZ-Based Innovation

Abstract:
TRIZ, the Theory of Inventive Problem Solving, is a structured, knowledge‑based framework for innovation and abstracting problems to find inventive solutions. However, its application is often limited by the complexity and deep interdisciplinary knowledge required. Advancements in Large Language Models (LLMs) have revealed new possibilities for automating parts of this process. While previous studies have explored single LLMs in TRIZ applications, this paper introduces a multi‑agent approach. We propose an LLM‑based multi‑agent system, called TRIZ agents, each with specialized capabilities and tool access, collaboratively solving inventive problems based on the TRIZ methodology. This multi‑agent system leverages agents with various domain expertise to efficiently navigate TRIZ steps. The aim is to model and simulate an inventive process with language agents. We assess the effectiveness of this team of agents in addressing complex innovation challenges based on a selected case study in engineering. We demonstrate the potential of agent collaboration to produce diverse, inventive solutions. This research contributes to the future of AI‑driven innovation, showcasing the advantages of decentralized problem‑solving in complex ideation tasks.

PaperID: 9897, https://arxiv.org/pdf/2506.18679.pdf

Authors: Ruicheng Zhang, Yu Sun, Zeyu Zhang, Jinai Li, Xiaofan Liu, Au Hoi Fan, Haowei Guo, Puxin Yan

Title: MARL-MambaContour: Unleashing Multi-Agent Deep Reinforcement Learning for Active Contour Optimization in Medical Image Segmentation

Abstract:
We introduce MARL‑MambaContour, the first contour‑based medical image segmentation framework based on Multi‑Agent Reinforcement Learning (MARL). Our approach reframes segmentation as a multi‑agent cooperation task focused on generate topologically consistent object‑level contours, addressing the limitations of traditional pixel‑based methods which could lack topological constraints and holistic structural awareness of anatomical regions. Each contour point is modeled as an autonomous agent that iteratively adjusts its position to align precisely with the target boundary, enabling adaptation to blurred edges and intricate morphologies common in medical images. This iterative adjustment process is optimized by a contour‑specific Soft Actor‑Critic (SAC) algorithm, further enhanced with the Entropy Regularization Adjustment Mechanism (ERAM) which dynamically balance agent exploration with contour smoothness. Furthermore, the framework incorporates a Mamba‑based policy network featuring a novel Bidirectional Cross‑attention Hidden‑state Fusion Mechanism (BCHFM). This mechanism mitigates potential memory confusion limitations associated with long‑range modeling in state space models, thereby facilitating more accurate inter‑agent information exchange and informed decision‑making. Extensive experiments on five diverse medical imaging datasets demonstrate the state‑of‑the‑art performance of MARL‑MambaContour, highlighting its potential as an accurate and robust clinical application.

PaperID: 9898, https://arxiv.org/pdf/2506.18651.pdf

Authors: Shuocun Yang, Huawen Hu, Xuan Liu, Yincheng Yao, Enze Shi, Shu Zhang

Title: Structured Diversity Control: A Dual-Level Framework for Group-Aware Multi-Agent Coordination

Abstract:
Controlling the behavioral diversity is a pivotal challenge in multi‑agent reinforcement learning (MARL), particularly in complex collaborative scenarios. While existing methods attempt to regulate behavioral diversity by directly differentiating across all agents, they lack deep characterization and learning of multi‑agent composition structures. This limitation leads to suboptimal performance or coordination failures when facing more complex or challenging tasks. To bridge this gap, we introduce Structured Diversity Control (SDC), a framework that redefines the system‑wide diversity metric as a weighted combination of intra‑group diversity, which is minimized for cohesion and inter‑group diversity, which is maximized for specialization. The trade‑off is governed by a pre‑set Diversity Structure Factor (DSF), allowing for fine‑grained, group‑aware control over the collective strategy. Our method directly constrains the policy architecture without altering reward functions. This structural definition of diversity enables SDC to deliver substantial performance gains across various experiments, including increasing average rewards by up to 47.1% in multi‑target pursuit and reducing episode lengths by 12.82% in complex neutralization scenarios. The proposed method offers a novel analytical perspective on the problem of cooperation in group‑aware multi‑agent systems.

PaperID: 9899, https://arxiv.org/pdf/2506.18627.pdf

Authors: Yannik Mahlau, Maximilian Schier, Christoph Reinders, Frederik Schubert, Marco Bügling, Bodo Rosenhahn

Title: Multi-Agent Reinforcement Learning for Inverse Design in Photonic Integrated Circuits

Abstract:
Inverse design of photonic integrated circuits (PICs) has traditionally relied on gradientbased optimization. However, this approach is prone to end up in local minima, which results in suboptimal design functionality. As interest in PICs increases due to their potential for addressing modern hardware demands through optical computing, more adaptive optimization algorithms are needed. We present a reinforcement learning (RL) environment as well as multi‑agent RL algorithms for the design of PICs. By discretizing the design space into a grid, we formulate the design task as an optimization problem with thousands of binary variables. We consider multiple two‑ and three‑dimensional design tasks that represent PIC components for an optical computing system. By decomposing the design space into thousands of individual agents, our algorithms are able to optimize designs with only a few thousand environment samples. They outperform previous state‑of‑the‑art gradient‑based optimization in both twoand three‑dimensional design tasks. Our work may also serve as a benchmark for further exploration of sample‑efficient RL for inverse design in photonics.

PaperID: 9900, https://arxiv.org/pdf/2506.18600.pdf

Authors: Ariel Flint Ashery, Luca Maria Aiello, Andrea Baronchelli

Title: Reply to "Emergent LLM behaviors are observationally equivalent to data leakage"

Abstract:
A potential concern when simulating populations of large language models (LLMs) is data contamination, i.e. the possibility that training data may shape outcomes in unintended ways. While this concern is important and may hinder certain experiments with multi‑agent models, it does not preclude the study of genuinely emergent dynamics in LLM populations. The recent critique by Barrie and Törnberg [1] of the results of Flint Ashery et al. [2] offers an opportunity to clarify that self‑organisation and model‑dependent emergent dynamics can be studied in LLM populations, highlighting how such dynamics have been empirically observed in the specific case of social conventions.

PaperID: 9901, https://arxiv.org/pdf/2506.18537.pdf

Authors: Azad Deihim, Eduardo Alonso, Dimitra Apostolopoulou

Title: Transformer World Model for Sample Efficient Multi-Agent Reinforcement Learning

Abstract:
We present the Multi‑Agent Transformer World Model (MATWM), a novel transformer‑based world model designed for multi‑agent reinforcement learning in both vector‑ and image‑based environments. MATWM combines a decentralized imagination framework with a semi‑centralized critic and a teammate prediction module, enabling agents to model and anticipate the behavior of others under partial observability. To address non‑stationarity, we incorporate a prioritized replay mechanism that trains the world model on recent experiences, allowing it to adapt to agents' evolving policies. We evaluated MATWM on a broad suite of benchmarks, including the StarCraft Multi‑Agent Challenge, PettingZoo, and MeltingPot. MATWM achieves state‑of‑the‑art performance, outperforming both model‑free and prior world model approaches, while demonstrating strong sample efficiency, achieving near‑optimal performance in as few as 50K environment interactions. Ablation studies confirm the impact of each component, with substantial gains in coordination‑heavy tasks.

PaperID: 9902, https://arxiv.org/pdf/2506.18466.pdf

Authors: Matti Krüger, Daniel Tanneberg, Chao Wang, Stephan Hasler, Michael Gienger

Title: Mirror Eyes: Explainable Human-Robot Interaction at a Glance

Abstract:
The gaze of a person tends to reflect their interest. This work explores what happens when this statement is taken literally and applied to robots. Here we present a robot system that employs a moving robot head with a screen‑based eye model that can direct the robot's gaze to points in physical space and present a reflection‑like mirror image of the attended region on top of each eye. We conducted a user study with 33 participants, who were asked to instruct the robot to perform pick‑and‑place tasks, monitor the robot's task execution, and interrupt it in case of erroneous actions. Despite a deliberate lack of instructions about the role of the eyes and a very brief system exposure, participants felt more aware about the robot's information processing, detected erroneous actions earlier, and rated the user experience higher when eye‑based mirroring was enabled compared to non‑reflective eyes. These results suggest a beneficial and intuitive utilization of the introduced method in cooperative human‑robot interaction.

PaperID: 9903, https://arxiv.org/pdf/2506.18424.pdf

Authors: Chengjie Liu, Weiyu Chen, Huiyao Xu, Yuan Du, Jun Yang, Li Du

Title: A Large Language Model-based Multi-Agent Framework for Analog Circuits' Sizing Relationships Extraction

Abstract:
In the design process of the analog circuit pre‑layout phase, device sizing is an important step in determining whether an analog circuit can meet the required performance metrics. Many existing techniques extract the circuit sizing task as a mathematical optimization problem to solve and continuously improve the optimization efficiency from a mathematical perspective. But they ignore the automatic introduction of prior knowledge, fail to achieve effective pruning of the search space, which thereby leads to a considerable compression margin remaining in the search space. To alleviate this problem, we propose a large language model (LLM)‑based multi‑agent framework for analog circuits' sizing relationships extraction from academic papers. The search space in the sizing process can be effectively pruned based on the sizing relationship extracted by this framework. Eventually, we conducted tests on 3 types of circuits, and the optimization efficiency was improved by 2.32 ～ 26.6 ×. This work demonstrates that the LLM can effectively prune the search space for analog circuit sizing, providing a new solution for the combination of LLMs and conventional analog circuit design automation methods.

PaperID: 9904, https://arxiv.org/pdf/2506.18348.pdf

Authors: Weilun Yu, Shixiang Tang, Yonggui Huang, Nanqing Dong, Li Fan, Honggang Qi, Wei Liu, Xiaoli Diao, Xi Chen, Wanli Ouyang

Title: Dynamic Knowledge Exchange and Dual-diversity Review: Concisely Unleashing the Potential of a Multi-Agent Research Team

Abstract:
Scientific progress increasingly relies on effective collaboration among researchers, a dynamic that large language models (LLMs) have only begun to emulate. While recent LLM‑based scientist agents show promise in autonomous scientific discovery, they often lack the interactive reasoning and evaluation mechanisms essential to real‑world research. We propose IDVSCI (Internal Discussion and Vote SCIentists), a multi‑agent framework built on LLMs that incorporates two key innovations: a Dynamic Knowledge Exchange mechanism enabling iterative feedback among agents, and a Dual‑Diversity Review paradigm that simulates heterogeneous expert evaluation. These components jointly promote deeper reasoning and the generation of more creative and impactful scientific ideas. To evaluate the effectiveness and generalizability of our approach, we conduct experiments on two datasets: a widely used benchmark in computer science and a new dataset we introduce in the health sciences domain. Results show that IDVSCI consistently achieves the best performance across both datasets, outperforming existing systems such as AI Scientist and VIRSCI. These findings highlight the value of modeling interaction and peer review dynamics in LLM‑based autonomous research.

PaperID: 9905, https://arxiv.org/pdf/2506.18260.pdf

Authors: FuTe Wong

Title: Advanced For-Loop for QML algorithm search

Abstract:
This paper introduces an advanced framework leveraging Large Language Model‑based Multi‑Agent Systems (LLMMA) for the automated search and optimization of Quantum Machine Learning (QML) algorithms. Inspired by Google DeepMind's FunSearch, the proposed system works on abstract level to iteratively generates and refines quantum transformations of classical machine learning algorithms (concepts), such as the Multi‑Layer Perceptron, forward‑forward and backpropagation algorithms. As a proof of concept, this work highlights the potential of agentic frameworks to systematically explore classical machine learning concepts and adapt them for quantum computing, paving the way for efficient and automated development of QML algorithms. Future directions include incorporating planning mechanisms and optimizing strategy in the search space for broader applications in quantum‑enhanced machine learning.

PaperID: 9906, https://arxiv.org/pdf/2506.18183.pdf

Authors: Zhiting Mei, Christina Zhang, Tenny Yin, Justin Lidard, Ola Shorinwa, Anirudha Majumdar

Title: Reasoning about Uncertainty: Do Reasoning Models Know When They Don't Know?

Abstract:
Reasoning language models have set state‑of‑the‑art (SOTA) records on many challenging benchmarks, enabled by multi‑step reasoning induced using reinforcement learning. However, like previous language models, reasoning models are prone to generating confident, plausible responses that are incorrect (hallucinations). Knowing when and how much to trust these models is critical to the safe deployment of reasoning models in real‑world applications. To this end, we explore uncertainty quantification of reasoning models in this work. Specifically, we ask three fundamental questions: First, are reasoning models well‑calibrated? Second, does deeper reasoning improve model calibration? Finally, inspired by humans' innate ability to double‑check their thought processes to verify the validity of their answers and their confidence, we ask: can reasoning models improve their calibration by explicitly reasoning about their chain‑of‑thought traces? We introduce introspective uncertainty quantification (UQ) to explore this direction. In extensive evaluations on SOTA reasoning models across a broad range of benchmarks, we find that reasoning models: (i) are typically overconfident, with self‑verbalized confidence estimates often greater than 85% particularly for incorrect responses, (ii) become even more overconfident with deeper reasoning, and (iii) can become better calibrated through introspection (e.g., o3‑Mini and DeepSeek R1) but not uniformly (e.g., Claude 3.7 Sonnet becomes more poorly calibrated). Lastly, we conclude with important research directions to design necessary UQ benchmarks and improve the calibration of reasoning models.

PaperID: 9907, https://arxiv.org/pdf/2506.18149.pdf

Authors: Fumian Chen, Sotheara Veng, Joshua Wilson, Xiaoming Li, Hui Fang

Title: CoachGPT: A Scaffolding-based Academic Writing Assistant

Abstract:
Academic writing skills are crucial for students' success, but can feel overwhelming without proper guidance and practice, particularly when writing in a second language. Traditionally, students ask instructors or search dictionaries, which are not universally accessible. Early writing assistants emerged as rule‑based systems that focused on detecting misspellings, subject‑verb disagreements, and basic punctuation errors; however, they are inaccurate and lack contextual understanding. Machine learning‑based assistants demonstrate a strong ability for language understanding but are expensive to train. Large language models (LLMs) have shown remarkable capabilities in generating responses in natural languages based on given prompts. Still, they have a fundamental limitation in education: they generate essays without teaching, which can have detrimental effects on learning when misused. To address this limitation, we develop CoachGPT, which leverages large language models (LLMs) to assist individuals with limited educational resources and those who prefer self‑paced learning in academic writing. CoachGPT is an AI agent‑based web application that (1) takes instructions from experienced educators, (2) converts instructions into sub‑tasks, and (3) provides real‑time feedback and suggestions using large language models. This unique scaffolding structure makes CoachGPT unique among existing writing assistants. Compared to existing writing assistants, CoachGPT provides a more immersive writing experience with personalized feedback and guidance. Our user studies prove the usefulness of CoachGPT and the potential of large language models for academic writing.

PaperID: 9908, https://arxiv.org/pdf/2506.18126.pdf

Authors: Xiang Yuming, Li Sizhao, Li Rongpeng, Zhao Zhifeng, Zhang Honggang

Title: Decentralized Consensus Inference-based Hierarchical Reinforcement Learning for Multi-Constrained UAV Pursuit-Evasion Game

Abstract:
Multiple quadrotor unmanned aerial vehicle (UAV) systems have garnered widespread research interest and fostered tremendous interesting applications, especially in multi‑constrained pursuit‑evasion games (MC‑PEG). The Cooperative Evasion and Formation Coverage (CEFC) task, where the UAV swarm aims to maximize formation coverage across multiple target zones while collaboratively evading predators, belongs to one of the most challenging issues in MC‑PEG, especially under communication‑limited constraints. This multifaceted problem, which intertwines responses to obstacles, adversaries, target zones, and formation dynamics, brings up significant high‑dimensional complications in locating a solution. In this paper, we propose a novel two‑level framework (i.e., Consensus Inference‑based Hierarchical Reinforcement Learning (CI‑HRL)), which delegates target localization to a high‑level policy, while adopting a low‑level policy to manage obstacle avoidance, navigation, and formation. Specifically, in the high‑level policy, we develop a novel multi‑agent reinforcement learning module, Consensus‑oriented Multi‑Agent Communication (ConsMAC), to enable agents to perceive global information and establish consensus from local states by effectively aggregating neighbor messages. Meanwhile, we leverage an Alternative Training‑based Multi‑agent proximal policy optimization (AT‑M) and policy distillation to accomplish the low‑level control. The experimental results, including the high‑fidelity software‑in‑the‑loop (SITL) simulations, validate that CI‑HRL provides a superior solution with enhanced swarm's collaborative evasion and task completion capabilities.

PaperID: 9909, https://arxiv.org/pdf/2506.17990.pdf

Authors: Diego Deplano, Sergio Grammatico, Mauro Franceschelli

Title: Non-Euclidean Enriched Contraction Theory for Monotone Operators and Monotone Dynamical Systems

Abstract:
We adopt an operator‑theoretic perspective to analyze a class of nonlinear fixed‑point iterations and discrete‑time dynamical systems. Specifically, we study the Krasnoselskij iteration ‑ at the heart of countless algorithmic schemes and underpinning the stability analysis of numerous dynamical models ‑ by focusing on a non‑Euclidean vector space equipped with the diagonally weighted supremum norm. By extending the state of the art, we introduce the notion of enriched weak contractivity, which (i) is characterized by a simple, verifiable condition for Lipschitz operators, and (ii) yields explicit bounds on the admissible step size for the Krasnoselskij iteration. Our results relate the notion of weak contractivity with that of monotonicity of operators and dynamical systems and show its generality to design larger step sizes and improved convergence speed for broader classes of dynamical systems. The newly developed theory is illustrated on two applications: the design of zero‑finding algorithms for monotone operators and the design of nonlinear consensus dynamics in monotone multi‑agent dynamical systems.

PaperID: 9910, https://arxiv.org/pdf/2506.17873.pdf

Authors: Guankun Wang, Junyi Wang, Wenjin Mo, Long Bai, Kun Yuan, Ming Hu, Jinlin Wu, Junjun He, Yiming Huang, Nicolas Padoy, Zhen Lei, Hongbin Liu, Nassir Navab, Hongliang Ren

Title: SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model

Abstract:
Surgical scene understanding is critical for surgical training and robotic decision‑making in robot‑assisted surgery. Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated great potential for advancing scene perception in the medical domain, facilitating surgeons to understand surgical scenes and procedures. However, these methods are primarily oriented towards image‑based analysis or global video understanding, overlooking the fine‑grained video reasoning that is crucial for analyzing specific processes and capturing detailed task execution within a surgical procedure. To bridge this gap, we propose SurgVidLM, the first video language model designed to address both full and fine‑grained surgical video comprehension. To train our SurgVidLM, we construct the SVU‑31K that is a large‑scale dataset with over 31K video‑instruction pairs, enabling both holistic understanding and detailed analysis of surgical procedures. Building on this resource, SurgVidLM incorporates a two‑stage StageFocus mechanism: the first stage extracts global procedural context, while the second stage performs high‑frequency local analysis guided by temporal cues. We also develop the Multi‑frequency Fusion Attention to effectively integrate low‑ and high‑frequency visual tokens, ensuring the preservation of critical task‑specific details. Experimental results demonstrate that SurgVidLM significantly outperforms state‑of‑the‑art Vid‑LLMs of comparable parameter scale in both full and fine‑grained video understanding tasks, showcasing its superior capability in capturing the context of complex robot‑assisted surgeries. Our code and dataset will be publicly accessible soon.

PaperID: 9911, https://arxiv.org/pdf/2506.17784.pdf

Authors: Song Wang, Zhen Tan, Zihan Chen, Shuang Zhou, Tianlong Chen, Jundong Li

Title: AnyMAC: Cascading Flexible Multi-Agent Collaboration via Next-Agent Prediction

Abstract:
Recent progress in large language model (LLM)‑based multi‑agent collaboration highlights the power of structured communication in enabling collective intelligence. However, existing methods largely rely on static or graph‑based inter‑agent topologies, lacking the potential adaptability and flexibility in communication. In this work, we propose a new framework that rethinks multi‑agent coordination through a sequential structure rather than a graph structure, offering a significantly larger topology space for multi‑agent communication. Our method focuses on two key directions: (1) Next‑Agent Prediction, which selects the most suitable agent role at each step, and (2) Next‑Context Selection (NCS), which enables each agent to selectively access relevant information from any previous step. Together, these components construct task‑adaptive communication pipelines that support both role flexibility and global information flow. Extensive evaluations across multiple benchmarks demonstrate that our approach achieves superior performance while substantially reducing communication overhead.

PaperID: 9912, https://arxiv.org/pdf/2506.17772.pdf

Authors: Haoran Xue, Gias Uddin, Song Wang

Title: PAGENT: Learning to Patch Software Engineering Agents

Abstract:
LLM Agents produce patches automatically to resolve an issue. However, they can generate inaccurate patches. Little is known about the root causes behind those failed patches or how those could be fixed. This paper reports an empirical study of the failed patches generated by seven top LLM code agents. We collected 114 issues from the SWE‑bench Lite dataset that remained unresolved across the agents. The seven agents produced a total of 769 failed patches for those issues, which we checked with a combination of GPT‑4o and manual analysis. We present a taxonomy of the failure reasons across the patches. The taxonomy contains six categories, with several sub‑categories under each category. For example, a frequently observed category is the inability of an LLM to correctly infer/produce the appropriate variable type in the produced patch. As a first step towards addressing such type‑related errors, we designed PAGENT (Patch Agent). PAGENT utilizes program analysis techniques like CFG creation and exploration to infer the type of information of a patch. PAGENT does this by applying repository‑level static code analysis techniques. Then, PAGENT refines the inferred type by further utilizing an LLM‑based inference technique. We tested PAGENT on all 127 type‑related failed patches from the top three agents in our study. PAGENT could fix 29 of the 127 failed patches.

PaperID: 9913, https://arxiv.org/pdf/2506.17765.pdf

Authors: Jiao Chen, Kehui Yao, Reza Yousefi Maragheh, Kai Zhao, Jianpeng Xu, Jason Cho, Evren Korpeoglu, Sushant Kumar, Kannan Achan

Title: CARTS: Collaborative Agents for Recommendation Textual Summarization

Abstract:
Current recommendation systems often require some form of textual data summarization, such as generating concise and coherent titles for product carousels or other grouped item displays. While large language models have shown promise in NLP domains for textual summarization, these approaches do not directly apply to recommendation systems, where explanations must be highly relevant to the core features of item sets, adhere to strict word limit constraints. In this paper, we propose CARTS (Collaborative Agents for Recommendation Textual Summarization), a multi‑agent LLM framework designed for structured summarization in recommendation systems. CARTS decomposes the task into three stages‑Generation Augmented Generation (GAG), refinement circle, and arbitration, where successive agent roles are responsible for extracting salient item features, iteratively refining candidate titles based on relevance and length feedback, and selecting the final title through a collaborative arbitration process. Experiments on large‑scale e‑commerce data and live A/B testing show that CARTS significantly outperforms single‑pass and chain‑of‑thought LLM baselines, delivering higher title relevance and improved user engagement metrics.

PaperID: 9914, https://arxiv.org/pdf/2506.17589.pdf

Authors: Bowen Wang, Zhouqiang Jiang, Yasuaki Susumu, Shotaro Miwa, Tianwei Chen, Yuta Nakashima

Title: Taming the Untamed: Graph-Based Knowledge Retrieval and Reasoning for MLLMs to Conquer the Unknown

Abstract:
The real value of knowledge lies not just in its accumulation, but in its potential to be harnessed effectively to conquer the unknown. Although recent multimodal large language models (MLLMs) exhibit impressing multimodal capabilities, they often fail in rarely encountered domain‑specific tasks due to limited relevant knowledge. To explore this, we adopt visual game cognition as a testbed and select Monster Hunter: World as the target to construct a multimodal knowledge graph (MH‑MMKG), which incorporates multi‑modalities and intricate entity relations. We also design a series of challenging queries based on MH‑MMKG to evaluate the models' ability for complex knowledge retrieval and reasoning. Furthermore, we propose a multi‑agent retriever that enables a model to autonomously search relevant knowledge without additional training. Experimental results show that our approach significantly enhances the performance of MLLMs, providing a new perspective on multimodal knowledge‑augmented reasoning and laying a solid foundation for future research.

PaperID: 9915, https://arxiv.org/pdf/2506.17560.pdf

Authors: Ava Abderezaei, Chi-Hui Lin, Joseph Miceli, Naren Sivagnanadasan, Stéphane Aroca-Ouellette, Jake Brawer, Alessandro Roncone

Title: Towards Zero-Shot Coordination between Teams of Agents: The N-XPlay Framework

Abstract:
Zero‑shot coordination (ZSC) ‑‑ the ability to collaborate with unfamiliar partners ‑‑ is essential to making autonomous agents effective teammates. Existing ZSC methods evaluate coordination capabilities between two agents who have not previously interacted. However, these scenarios do not reflect the complexity of real‑world multi‑agent systems, where coordination often involves a hierarchy of sub‑groups and interactions between teams of agents, known as Multi‑Team Systems (MTS). To address this gap, we first introduce N‑player Overcooked, an N‑agent extension of the popular two‑agent ZSC benchmark, enabling evaluation of ZSC in N‑agent scenarios. We then propose N‑XPlay for ZSC in N‑agent, multi‑team settings. Comparison against Self‑Play across two‑, three‑ and five‑player Overcooked scenarios, where agents are split between an ``ego‑team'' and a group of unseen collaborators shows that agents trained with N‑XPlay are better able to simultaneously balance ``intra‑team'' and ``inter‑team'' coordination than agents trained with SP.

PaperID: 9916, https://arxiv.org/pdf/2506.17539.pdf

Authors: Sidong Feng, Changhao Du, Huaxiao Liu, Qingnan Wang, Zhengwei Lv, Mengfei Wang, Chunyang Chen

Title: Breaking Single-Tester Limits: Multi-Agent LLMs for Multi-User Feature Testing

Abstract:
The growing dependence on mobile phones and their apps has made multi‑user interactive features, like chat calls, live streaming, and video conferencing, indispensable for bridging the gaps in social connectivity caused by physical and situational barriers. However, automating these interactive features for testing is fraught with challenges, owing to their inherent need for timely, dynamic, and collaborative user interactions, which current automated testing methods inadequately address. Inspired by the concept of agents designed to autonomously and collaboratively tackle problems, we propose MAdroid, a novel multi‑agent approach powered by the Large Language Models (LLMs) to automate the multi‑user interactive task for app feature testing. Specifically, MAdroid employs two functional types of multi‑agents: user agents (Operator) and supervisor agents (Coordinator and Observer). Each agent takes a specific role: the Coordinator directs the interactive task; the Operator mimics user interactions on the device; and the Observer monitors and reviews the task automation process. Our evaluation, which included 41 multi‑user interactive tasks, demonstrates the effectiveness of our approach, achieving 82.9% of the tasks with 96.8% action similarity, outperforming the ablation studies and state‑of‑the‑art baselines. Additionally, a preliminary investigation underscores MAdroid's practicality by helping identify 11 multi‑user interactive bugs during regression app testing, confirming its potential value in real‑world software development contexts.

PaperID: 9917, https://arxiv.org/pdf/2506.17514.pdf

Authors: Ninareh Mehrabi, Tharindu Kumarage, Kai-Wei Chang, Aram Galstyan, Rahul Gupta

Title: Kaleidoscopic Teaming in Multi Agent Simulations

Abstract:
Warning: This paper contains content that may be inappropriate or offensive. AI agents have gained significant recent attention due to their autonomous tool usage capabilities and their integration in various real‑world applications. This autonomy poses novel challenges for the safety of such systems, both in single‑ and multi‑agent scenarios. We argue that existing red teaming or safety evaluation frameworks fall short in evaluating safety risks in complex behaviors, thought processes and actions taken by agents. Moreover, they fail to consider risks in multi‑agent setups where various vulnerabilities can be exposed when agents engage in complex behaviors and interactions with each other. To address this shortcoming, we introduce the term kaleidoscopic teaming which seeks to capture complex and wide range of vulnerabilities that can happen in agents both in single‑agent and multi‑agent scenarios. We also present a new kaleidoscopic teaming framework that generates a diverse array of scenarios modeling real‑world human societies. Our framework evaluates safety of agents in both single‑agent and multi‑agent setups. In single‑agent setup, an agent is given a scenario that it needs to complete using the tools it has access to. In multi‑agent setup, multiple agents either compete against or cooperate together to complete a task in the scenario through which we capture existing safety vulnerabilities in agents. We introduce new in‑context optimization techniques that can be used in our kaleidoscopic teaming framework to generate better scenarios for safety analysis. Lastly, we present appropriate metrics that can be used along with our framework to measure safety of agents. Utilizing our kaleidoscopic teaming framework, we identify vulnerabilities in various models with respect to their safety in agentic use‑cases.

PaperID: 9918, https://arxiv.org/pdf/2506.17510.pdf

Authors: Rafael Ferreira da Silva, Milad Abolhasani, Dionysios A. Antonopoulos, Laura Biven, Ryan Coffee, Ian T. Foster, Leslie Hamilton, Shantenu Jha, Theresa Mayer, Benjamin Mintz, Robert G. Moore, Salahudin Nimer, Noah Paulson, Woong Shin, Frederic Suter, Mitra Taheri, Michela Taufer, Newell R. Washburn

Title: A Grassroots Network and Community Roadmap for Interconnected Autonomous Science Laboratories for Accelerated Discovery

Abstract:
Scientific discovery is being revolutionized by AI and autonomous systems, yet current autonomous laboratories remain isolated islands unable to collaborate across institutions. We present the Autonomous Interconnected Science Lab Ecosystem (AISLE), a grassroots network transforming fragmented capabilities into a unified system that shorten the path from ideation to innovation to impact and accelerates discovery from decades to months. AISLE addresses five critical dimensions: (1) cross‑institutional equipment orchestration, (2) intelligent data management with FAIR compliance, (3) AI‑agent driven orchestration grounded in scientific principles, (4) interoperable agent communication interfaces, and (5) AI/ML‑integrated scientific education. By connecting autonomous agents across institutional boundaries, autonomous science can unlock research spaces inaccessible to traditional approaches while democratizing cutting‑edge technologies. This paradigm shift toward collaborative autonomous science promises breakthroughs in sustainable energy, materials development, and public health.

PaperID: 9919, https://arxiv.org/pdf/2506.17484.pdf

Authors: Yao Zhang, Zaixi Shang, Silpan Patel, Mikel Zuniga

Title: From Unstructured Communication to Intelligent RAG: Multi-Agent Automation for Supply Chain Knowledge Bases

Abstract:
Supply chain operations generate vast amounts of operational data; however, critical knowledge such as system usage practices, troubleshooting workflows, and resolution techniques often remains buried within unstructured communications like support tickets, emails, and chat logs. While RAG systems aim to leverage such communications as a knowledge base, their effectiveness is limited by raw data challenges: support tickets are typically noisy, inconsistent, and incomplete, making direct retrieval suboptimal. Unlike existing RAG approaches that focus on runtime optimization, we introduce a novel offline‑first methodology that transforms these communications into a structured knowledge base. Our key innovation is a LLMs‑based multi‑agent system orchestrating three specialized agents: Category Discovery for taxonomy creation, Categorization for ticket grouping, and Knowledge Synthesis for article generation. Applying our methodology to real‑world support tickets with resolution notes and comments, our system creates a compact knowledge base ‑ reducing total volume to just 3.4% of original ticket data while improving quality. Experiments demonstrate that our prebuilt knowledge base in RAG systems significantly outperforms traditional RAG implementations (48.74% vs. 38.60% helpful answers) and achieves a 77.4% reduction in unhelpful responses. By automating institutional knowledge capture that typically remains siloed in experts' heads, our solution translates to substantial operational efficiency: reducing support workload, accelerating resolution times, and creating self‑improving systems that automatically resolve approximately 50% of future supply chain tickets. Our approach addresses a key gap in knowledge management by transforming transient communications into structured, reusable knowledge through intelligent offline processing rather than latency‑inducing runtime architectures.

PaperID: 9920, https://arxiv.org/pdf/2506.17449.pdf

Authors: Manasa Bharadwaj, Nikhil Verma, Kevin Ferreira

Title: OmniReflect: Discovering Transferable Constitutions for LLM agents via Neuro-Symbolic Reflections

Abstract:
Efforts to improve Large Language Model (LLM) agent performance on complex tasks have largely focused on fine‑tuning and iterative self‑correction. However, these approaches often lack generalizable mechanisms for longterm learning and remain inefficient in dynamic environments. We introduce OmniReflect, a hierarchical, reflection‑driven framework that constructs a constitution, a compact set of guiding principles distilled from task experiences, to enhance the effectiveness and efficiency of an LLM agent. OmniReflect operates in two modes: Self‑sustaining, where a single agent periodically curates its own reflections during task execution, and Co‑operative, where a Meta‑advisor derives a constitution from a small calibration set to guide another agent. To construct these constitutional principles, we employ Neural, Symbolic, and NeuroSymbolic techniques, offering a balance between contextual adaptability and computational efficiency. Empirical results averaged across models show major improvements in task success, with absolute gains of +10.3% on ALFWorld, +23.8% on BabyAI, and +8.3% on PDDL in the Self‑sustaining mode. Similar gains are seen in the Co‑operative mode, where a lightweight Qwen3‑4B ReAct agent outperforms all Reflexion baselines on BabyAI. These findings highlight the robustness and effectiveness of OmniReflect across environments and backbones.

PaperID: 9921, https://arxiv.org/pdf/2506.17375.pdf

Authors: Peter Lindes, Kaoutar Skiker

Title: Challenges in Grounding Language in the Real World

Abstract:
A long‑term goal of Artificial Intelligence is to build a language understanding system that allows a human to collaborate with a physical robot using language that is natural to the human. In this paper we highlight some of the challenges in doing this, and propose a solution that integrates the abilities of a cognitive agent capable of interactive task learning in a physical robot with the linguistic abilities of a large language model. We also point the way to an initial implementation of this approach.

PaperID: 9922, https://arxiv.org/pdf/2506.17352.pdf

Authors: Tatsuhiro Aoshima, Mitsuaki Akiyama

Title: Towards Safety Evaluations of Theory of Mind in Large Language Models

Abstract:
As the capabilities of large language models (LLMs) continue to advance, the importance of rigorous safety evaluation is becoming increasingly evident. Recent concerns within the realm of safety assessment have highlighted instances in which LLMs exhibit behaviors that appear to disable oversight mechanisms and respond in a deceptive manner. For example, there have been reports suggesting that, when confronted with information unfavorable to their own persistence during task execution, LLMs may act covertly and even provide false answers to questions intended to verify their behavior. To evaluate the potential risk of such deceptive actions toward developers or users, it is essential to investigate whether these behaviors stem from covert, intentional processes within the model. In this study, we propose that it is necessary to measure the theory of mind capabilities of LLMs. We begin by reviewing existing research on theory of mind and identifying the perspectives and tasks relevant to its application in safety evaluation. Given that theory of mind has been predominantly studied within the context of developmental psychology, we analyze developmental trends across a series of open‑weight LLMs. Our results indicate that while LLMs have improved in reading comprehension, their theory of mind capabilities have not shown comparable development. Finally, we present the current state of safety evaluation with respect to LLMs' theory of mind, and discuss remaining challenges for future work.

PaperID: 9923, https://arxiv.org/pdf/2506.17348.pdf

Authors: Pavel Malinovskiy

Title: Advanced Game-Theoretic Frameworks for Multi-Agent AI Challenges: A 2025 Outlook

Abstract:
This paper presents a substantially reworked examination of how advanced game‑theoretic paradigms can serve as a foundation for the next‑generation challenges in Artificial Intelligence (AI), forecasted to arrive in or around 2025. Our focus extends beyond traditional models by incorporating dynamic coalition formation, language‑based utilities, sabotage risks, and partial observability. We provide a set of mathematical formalisms, simulations, and coding schemes that illustrate how multi‑agent AI systems may adapt and negotiate in complex environments. Key elements include repeated games, Bayesian updates for adversarial detection, and moral framing within payoff structures. This work aims to equip AI researchers with robust theoretical tools for aligning strategic interaction in uncertain, partially adversarial contexts.

PaperID: 9924, https://arxiv.org/pdf/2506.17342.pdf

Authors: Zijian Long, Haopeng Wang, Haiwei Dong, Abdulmotaleb El Saddik

Title: Adaptive Social Metaverse Streaming based on Federated Multi-Agent Deep Reinforcement Learning

Abstract:
The social metaverse is a growing digital ecosystem that blends virtual and physical worlds. It allows users to interact socially, work, shop, and enjoy entertainment. However, privacy remains a major challenge, as immersive interactions require continuous collection of biometric and behavioral data. At the same time, ensuring high‑quality, low‑latency streaming is difficult due to the demands of real‑time interaction, immersive rendering, and bandwidth optimization. To address these issues, we propose ASMS (Adaptive Social Metaverse Streaming), a novel streaming system based on Federated Multi‑Agent Proximal Policy Optimization (F‑MAPPO). ASMS leverages F‑MAPPO, which integrates federated learning (FL) and deep reinforcement learning (DRL) to dynamically adjust streaming bit rates while preserving user privacy. Experimental results show that ASMS improves user experience by at least 14% compared to existing streaming methods across various network conditions. Therefore, ASMS enhances the social metaverse experience by providing seamless and immersive streaming, even in dynamic and resource‑constrained networks, while ensuring that sensitive user data remains on local devices.

PaperID: 9925, https://arxiv.org/pdf/2506.17338.pdf

Authors: Duong Bach

Title: PBFT-Backed Semantic Voting for Multi-Agent Memory Pruning

Abstract:
The proliferation of multi‑agent systems (MAS) in complex, dynamic environments necessitates robust and efficient mechanisms for managing shared knowledge. A critical challenge is ensuring that distributed memories remain synchronized, relevant, and free from the accumulation of outdated or inconsequential data ‑ a process analogous to biological forgetting. This paper introduces the Co‑Forgetting Protocol, a novel, comprehensive framework designed to address this challenge by enabling synchronized memory pruning in MAS. The protocol integrates three key components: (1) context‑aware semantic voting, where agents utilize a lightweight DistilBERT model to assess the relevance of memory items based on their content and the current operational context; (2) multi‑scale temporal decay functions, which assign diminishing importance to memories based on their age and access frequency across different time horizons; and (3) a Practical Byzantine Fault Tolerance (PBFT)‑based consensus mechanism, ensuring that decisions to retain or discard memory items are agreed upon by a qualified and fault‑tolerant majority of agents, even in the presence of up to f Byzantine (malicious or faulty) agents in a system of N greater than or equal to 3f+1 agents. The protocol leverages gRPC for efficient inter‑agent communication and Pinecone for scalable vector embedding storage and similarity search, with SQLite managing metadata. Experimental evaluations in a simulated MAS environment with four agents demonstrate the protocol's efficacy, achieving a 52% reduction in memory footprint over 500 epochs, 88% voting accuracy in forgetting decisions against human‑annotated benchmarks, a 92% PBFT consensus success rate under simulated Byzantine conditions, and an 82% cache hit rate for memory access.

PaperID: 9926, https://arxiv.org/pdf/2506.17335.pdf

Authors: Shuo Yan, Ruochen Li, Ziming Luo, Zimu Wang, Daoyang Li, Liqiang Jing, Kaiyu He, Peilin Wu, George Michalopoulos, Yue Zhang, Ziyang Zhang, Mian Zhang, Zhiyu Chen, Xinya Du

Title: LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research

Abstract:
Large language model (LLM) agents have demonstrated remarkable potential in advancing scientific discovery. However, their capability in the fundamental yet crucial task of reproducing code from research papers, especially in the NLP domain, remains underexplored. This task includes unique complex reasoning challenges in the intellectual synthesis of abstract concepts and the comprehension of code repositories with interdependent files. Motivated by this gap, we present LMR‑BENCH, a benchmark designed to systematically evaluate the capability of LLM agents on code reproduction from Language Modeling Research. It consists of 28 code reproduction tasks derived from 23 research papers published in top‑tier NLP venues over the past five years, spanning nine fundamental categories. Models are provided with a research paper, a code repository containing one or more masked functions, and instructions for implementing these functions. We conduct extensive experiments in standard prompting and LLM agent settings with state‑of‑the‑art LLMs, evaluating the accuracy of unit tests and performing LLM‑based evaluation of code correctness. Experimental results reveal that even the most advanced models still exhibit persistent limitations in scientific reasoning and code synthesis, highlighting critical gaps in LLM agents' ability to autonomously reproduce scientific research

PaperID: 9927, https://arxiv.org/pdf/2506.17320.pdf

Authors: Akash Awasthi, Brandon V. Chang, Anh M. Vu, Ngan Le, Rishi Agrawal, Zhigang Deng, Carol Wu, Hien Van Nguyen

Title: MAARTA:Multi-Agentic Adaptive Radiology Teaching Assistant

Abstract:
Radiology students often struggle to develop perceptual expertise due to limited expert mentorship time, leading to errors in visual search and diagnostic interpretation. These perceptual errors, such as missed fixations, short dwell times, or misinterpretations, are not adequately addressed by current AI systems, which focus on diagnostic accuracy but fail to explain how and why errors occur. To address this gap, we introduce MAARTA (Multi‑Agentic Adaptive Radiology Teaching Assistant), a multi‑agent framework that analyzes gaze patterns and radiology reports to provide personalized feedback. Unlike single‑agent models, MAARTA dynamically selects agents based on error complexity, enabling adaptive and efficient reasoning. By comparing expert and student gaze behavior through structured graphs, the system identifies missed findings and assigns Perceptual Error Teacher agents to analyze discrepancies. MAARTA then uses step‑by‑step prompting to help students understand their errors and improve diagnostic reasoning, advancing AI‑driven radiology education.

PaperID: 9928, https://arxiv.org/pdf/2506.17311.pdf

Authors: Chuanlei Li, Xu Hu, Minghui Xu, Kun Li, Yue Zhang, Xiuzhen Cheng

Title: Can Large Language Models Be Trusted Paper Reviewers? A Feasibility Study

Abstract:
Academic paper review typically requires substantial time, expertise, and human resources. Large Language Models (LLMs) present a promising method for automating the review process due to their extensive training data, broad knowledge base, and relatively low usage cost. This work explores the feasibility of using LLMs for academic paper review by proposing an automated review system. The system integrates Retrieval Augmented Generation (RAG), the AutoGen multi‑agent system, and Chain‑of‑Thought prompting to support tasks such as format checking, standardized evaluation, comment generation, and scoring. Experiments conducted on 290 submissions from the WASA 2024 conference using GPT‑4o show that LLM‑based review significantly reduces review time (average 2.48 hours) and cost (average \104.28 USD). However, the similarity between LLM‑selected papers and actual accepted papers remains low (average 38.6%), indicating issues such as hallucination, lack of independent judgment, and retrieval preferences. Therefore, it is recommended to use LLMs as assistive tools to support human reviewers, rather than to replace them.

PaperID: 9929, https://arxiv.org/pdf/2506.17285.pdf

Authors: Vinaik Chhetri, Yousaf Reza, Moghis Fereidouni, Srijata Maji, Umar Farooq, AB Siddique

Title: A Framework for Generating Conversational Recommendation Datasets from Behavioral Interactions

Abstract:
Modern recommendation systems typically follow two complementary paradigms: collaborative filtering, which models long‑term user preferences from historical interactions, and conversational recommendation systems (CRS), which interact with users in natural language to uncover immediate needs. Each captures a different dimension of user intent. While CRS models lack collaborative signals, leading to generic or poorly personalized suggestions, traditional recommenders lack mechanisms to interactively elicit immediate needs. Unifying these paradigms promises richer personalization but remains challenging due to the lack of large‑scale conversational datasets grounded in real user behavior. We present ConvRecStudio, a framework that uses large language models (LLMs) to simulate realistic, multi‑turn dialogs grounded in timestamped user‑item interactions and reviews. ConvRecStudio follows a three‑stage pipeline: (1) Temporal Profiling, which constructs user profiles and community‑level item sentiment trajectories over fine‑grained aspects; (2) Semantic Dialog Planning, which generates a structured plan using a DAG of flexible super‑nodes; and (3) Multi‑Turn Simulation, which instantiates the plan using paired LLM agents for the user and system, constrained by executional and behavioral fidelity checks. We apply ConvRecStudio to three domains ‑‑ MobileRec, Yelp, and Amazon Electronics ‑‑ producing over 12K multi‑turn dialogs per dataset. Human and automatic evaluations confirm the naturalness, coherence, and behavioral grounding of the generated conversations. To demonstrate utility, we build a cross‑attention transformer model that jointly encodes user history and dialog context, achieving gains in Hit@K and NDCG@K over baselines using either signal alone or naive fusion. Notably, our model achieves a 10.9% improvement in Hit@1 on Yelp over the strongest baseline.

PaperID: 9930, https://arxiv.org/pdf/2506.17283.pdf

Authors: Laksh Patel, Akhilesh Raj

Title: Second Order State Hallucinations for Adversarial Attack Mitigation in Formation Control of Multi-Agent Systems

Abstract:
The increasing deployment of multi‑agent systems (MAS) in critical infrastructures such as autonomous transportation, disaster relief, and smart cities demands robust formation control mechanisms resilient to adversarial attacks. Traditional consensus‑based controllers, while effective under nominal conditions, are highly vulnerable to data manipulation, sensor spoofing, and communication failures. To address this challenge, we propose Second‑Order State Hallucination (SOSH), a novel framework that detects compromised agents through distributed residual monitoring and maintains formation stability by replacing attacked states with predictive second‑order approximations. Unlike existing mitigation strategies that require significant restructuring or induce long transients, SOSH offers a lightweight, decentralized correction mechanism based on second‑order Taylor expansions, enabling rapid and scalable resilience. We establish rigorous Lyapunov‑based stability guarantees, proving that formation errors remain exponentially bounded even under persistent attacks, provided the hallucination parameters satisfy explicit conditions. Comprehensive Monte Carlo experiments on a 5‑agent complete graph formation demonstrate that SOSH outperforms established robust control schemes, including W‑MSR and Huber‑based consensus filters, achieving faster convergence rates, lower steady‑state error, and superior transient recovery. Our results confirm that SOSH combines theoretical robustness with practical deployability, offering a promising direction for securing MAS formations against sophisticated adversarial threats.

PaperID: 9931, https://arxiv.org/pdf/2506.17266.pdf

Authors: Sunil Kumar Jang Bahadur, Gopala Dhar

Title: Securing Generative AI Agentic Workflows: Risks, Mitigation, and a Proposed Firewall Architecture

Abstract:
Generative Artificial Intelligence (GenAI) presents significant advancements but also introduces novel security challenges, particularly within agentic workflows where AI agents operate autonomously. These risks escalate in multi‑agent systems due to increased interaction complexity. This paper outlines critical security vulnerabilities inherent in GenAI agentic workflows, including data privacy breaches, model manipulation, and issues related to agent autonomy and system integration. It discusses key mitigation strategies such as data encryption, access control, prompt engineering, model monitoring, agent sandboxing, and security audits. Furthermore, it details a proposed "GenAI Security Firewall" architecture designed to provide comprehensive, adaptable, and efficient protection for these systems by integrating various security services and leveraging GenAI itself for enhanced defense. Addressing these security concerns is paramount for the responsible and safe deployment of this transformative technology.

PaperID: 9932, https://arxiv.org/pdf/2506.17188.pdf

Authors: Yuchen Li, Hengyi Cai, Rui Kong, Xinran Chen, Jiamin Chen, Jun Yang, Haojie Zhang, Jiayi Li, Jiayi Wu, Yiqun Chen, Changle Qu, Wenwen Ye, Lixin Su, Xinyu Ma, Lingyong Yan, Long Xia, Daiting Shi, Junfeng Wang, Xiangyu Zhao, Jiashu Zhao, Haoyi Xiong, Shuaiqiang Wang, Dawei Yin

Title: Towards AI Search Paradigm

Abstract:
In this paper, we introduce the AI Search Paradigm, a comprehensive blueprint for next‑generation search systems capable of emulating human information processing and decision‑making. The paradigm employs a modular architecture of four LLM‑powered agents (Master, Planner, Executor and Writer) that dynamically adapt to the full spectrum of information needs, from simple factual queries to complex multi‑stage reasoning tasks. These agents collaborate dynamically through coordinated workflows to evaluate query complexity, decompose problems into executable plans, and orchestrate tool usage, task execution, and content synthesis. We systematically present key methodologies for realizing this paradigm, including task planning and tool integration, execution strategies, aligned and robust retrieval‑augmented generation, and efficient LLM inference, spanning both algorithmic techniques and infrastructure‑level optimizations. By providing an in‑depth guide to these foundational components, this work aims to inform the development of trustworthy, adaptive, and scalable AI search systems.

PaperID: 9933, https://arxiv.org/pdf/2506.17128.pdf

Authors: Botao Zhu, Xianbin Wang

Title: Rapid and Continuous Trust Evaluation for Effective Task Collaboration Through Siamese Model

Abstract:
Trust is emerging as an effective tool to ensure the successful completion of collaborative tasks within collaborative systems. However, rapidly and continuously evaluating the trustworthiness of collaborators during task execution is a significant challenge due to distributed devices, complex operational environments, and dynamically changing resources. To tackle this challenge, this paper proposes a Siamese‑enabled rapid and continuous trust evaluation framework (SRCTE) to facilitate effective task collaboration. First, the communication and computing resource attributes of the collaborator in a trusted state, along with historical collaboration data, are collected and represented using an attributed control flow graph (ACFG) that captures trust‑related semantic information and serves as a reference for comparison with data collected during task execution. At each time slot of task execution, the collaborator's communication and computing resource attributes, as well as task completion effectiveness, are collected in real time and represented with an ACFG to convey their trust‑related semantic information. A Siamese model, consisting of two shared‑parameter Structure2vec networks, is then employed to learn the deep semantics of each pair of ACFGs and generate their embeddings. Finally, the similarity between the embeddings of each pair of ACFGs is calculated to determine the collaborator's trust value at each time slot. A real system is built using two Dell EMC 5200 servers and a Google Pixel 8 to test the effectiveness of the proposed SRCTE framework. Experimental results demonstrate that SRCTE converges rapidly with only a small amount of data and achieves a high anomaly trust detection rate compared to the baseline algorithm.

PaperID: 9934, https://arxiv.org/pdf/2506.17029.pdf

Authors: Leizhen Wang, Peibo Duan, Cheng Lyu, Zewen Wang, Zhiqiang He, Nan Zheng, Zhenliang Ma

Title: Scalable and Reliable Multi-agent Reinforcement Learning for Traffic Assignment

Abstract:
The evolution of metropolitan cities and the increase in travel demands impose stringent requirements on traffic assignment methods. Multi‑agent reinforcement learning (MARL) approaches outperform traditional methods in modeling adaptive routing behavior without requiring explicit system dynamics, which is beneficial for real‑world deployment. However, MARL frameworks face challenges in scalability and reliability when managing extensive networks with substantial travel demand, which limiting their practical applicability in solving large‑scale traffic assignment problems. To address these challenges, this study introduces MARL‑OD‑DA, a new MARL framework for the traffic assignment problem, which redefines agents as origin‑destination (OD) pair routers rather than individual travelers, significantly enhancing scalability. Additionally, a Dirichlet‑based action space with action pruning and a reward function based on the local relative gap are designed to enhance solution reliability and improve convergence efficiency. Experiments demonstrate that the proposed MARL framework effectively handles medium‑sized networks with extensive and varied city‑level OD demand, surpassing existing MARL methods. When implemented in the SiouxFalls network, MARL‑OD‑DA achieves better assignment solutions in 10 steps, with a relative gap that is 94.99% lower than that of conventional methods.

PaperID: 9935, https://arxiv.org/pdf/2506.17001.pdf

Authors: Mikhail Menschikov, Dmitry Evseev, Victoria Dochkina, Ruslan Kostoev, Ilia Perepechkin, Petr Anokhin, Nikita Semenov, Evgeny Burnaev

Title: PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents

Abstract:
Personalizing language models by effectively incorporating user interaction history remains a central challenge in the development of adaptive AI systems. While large language models (LLMs), combined with Retrieval‑Augmented Generation (RAG), have improved factual accuracy, they often lack structured memory and fail to scale in complex, long‑term interactions. To address this, we propose a flexible external memory framework based on a knowledge graph that is constructed and updated automatically by the LLM. Building upon the AriGraph architecture, we introduce a novel hybrid graph design that supports both standard edges and two types of hyper‑edges, enabling rich and dynamic semantic and temporal representations. Our framework also supports diverse retrieval mechanisms, including A, WaterCircles traversal, beam search, and hybrid methods, making it adaptable to different datasets and LLM capacities. We evaluate our system on TriviaQA, HotpotQA, DiaASQ benchmarks and demonstrate that different memory and retrieval configurations yield optimal performance depending on the task. Additionally, we extend the DiaASQ benchmark with temporal annotations and internally contradictory statements, showing that our system remains robust and effective in managing temporal dependencies and context‑aware reasoning

PaperID: 9936, https://arxiv.org/pdf/2506.16988.pdf

Authors: Ines Besrour, Jingbo He, Tobias Schreieder, Michael Färber

Title: RAGentA: Multi-Agent Retrieval-Augmented Generation for Attributed Question Answering

Abstract:
We present RAGentA, a multi‑agent retrieval‑augmented generation (RAG) framework for attributed question answering (QA) with large language models (LLMs). With the goal of trustworthy answer generation, RAGentA focuses on optimizing answer correctness, defined by coverage and relevance to the question and faithfulness, which measures the extent to which answers are grounded in retrieved documents. RAGentA uses a multi‑agent architecture that iteratively filters retrieved documents, generates attributed answers with in‑line citations, and verifies completeness through dynamic refinement. Central to the framework is a hybrid retrieval strategy that combines sparse and dense methods, improving Recall@20 by 12.5% compared to the best single retrieval model, resulting in more correct and well‑supported answers. Evaluated on a synthetic QA dataset derived from the FineWeb index, RAGentA outperforms standard RAG baselines, achieving gains of 1.09% in correctness and 10.72% in faithfulness. These results demonstrate the effectiveness of our multi‑agent RAG architecture and hybrid retrieval strategy in advancing trustworthy QA with LLMs.

PaperID: 9937, https://arxiv.org/pdf/2506.16813.pdf

Authors: Michał Wawer, Jarosław A. Chudziak

Title: Integrating Traditional Technical Analysis with AI: A Multi-Agent LLM-Based Approach to Stock Market Forecasting

Abstract:
Traditional technical analysis methods face limitations in accurately predicting trends in today's complex financial markets. This paper introduces ElliottAgents, an multi‑agent system that integrates the Elliott Wave Principle with AI for stock market forecasting. The inherent complexity of financial markets, characterized by non‑linear dynamics, noise, and susceptibility to unpredictable external factors, poses significant challenges for accurate prediction. To address these challenges, the system employs LLMs to enhance natural language understanding and decision‑making capabilities within a multi‑agent framework. By leveraging technologies such as Retrieval‑Augmented Generation (RAG) and Deep Reinforcement Learning (DRL), ElliottAgents performs continuous, multi‑faceted analysis of market data to identify wave patterns and predict future price movements. The research explores the system's ability to process historical stock data, recognize Elliott wave patterns, and generate actionable insights for traders. Experimental results, conducted on historical data from major U.S. companies, validate the system's effectiveness in pattern recognition and trend forecasting across various time frames. This paper contributes to the field of AI‑driven financial analysis by demonstrating how traditional technical analysis methods can be effectively combined with modern AI approaches to create more reliable and interpretable market prediction systems.

PaperID: 9938, https://arxiv.org/pdf/2506.16797.pdf

Authors: Chenjun Liu, Jason J. R. Liu, Zhan Shu, James Lam

Title: Distributed Affine Formation Control of Linear Multi-agent Systems with Adaptive Event-triggering

Abstract:
Concerning general multi‑agent systems with limited communication, this paper proposes distributed formation control protocols under adaptive event‑triggered schemes to operate affine transformations of nominal formations. To accommodate more practical system mechanics, we develop an event‑triggered controller that drives the leader to a desired state by bringing in the compensation term. Based on triggering instants' state information, an affine formation control method with adaptive event‑triggering is designed for each follower, making the whole protocol effective in refraining from successive communication while not relying on predefined global information. In particular, mitigating the effect of partial state availability, an output‑based control solution is presented to expand the protocol's serviceable range. Finally, we perform numerical simulations on the formation and its affine transformations to verify the effectiveness of the control protocol and the feasibility of the event‑triggered mechanism.

PaperID: 9939, https://arxiv.org/pdf/2506.16760.pdf

Authors: Lei Jiang, Zixun Zhang, Zizhou Wang, Xiaobing Sun, Zhen Li, Liangli Zhen, Xiaohua Xu

Title: Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models

Abstract:
Large Vision‑Language Models (LVLMs) demonstrate exceptional performance across multimodal tasks, yet remain vulnerable to jailbreak attacks that bypass built‑in safety mechanisms to elicit restricted content generation. Existing black‑box jailbreak methods primarily rely on adversarial textual prompts or image perturbations, yet these approaches are highly detectable by standard content filtering systems and exhibit low query and computational efficiency. In this work, we present Cross‑modal Adversarial Multimodal Obfuscation (CAMO), a novel black‑box jailbreak attack framework that decomposes malicious prompts into semantically benign visual and textual fragments. By leveraging LVLMs' cross‑modal reasoning abilities, CAMO covertly reconstructs harmful instructions through multi‑step reasoning, evading conventional detection mechanisms. Our approach supports adjustable reasoning complexity and requires significantly fewer queries than prior attacks, enabling both stealth and efficiency. Comprehensive evaluations conducted on leading LVLMs validate CAMO's effectiveness, showcasing robust performance and strong cross‑model transferability. These results underscore significant vulnerabilities in current built‑in safety mechanisms, emphasizing an urgent need for advanced, alignment‑aware security and safety solutions in vision‑language systems.

PaperID: 9940, https://arxiv.org/pdf/2506.16748.pdf

Authors: Arjo Chakravarty, Michael X. Grey, M. A. Viraj J. Muthugala, Mohan Rajesh Elara

Title: A Scalable Post-Processing Pipeline for Large-Scale Free-Space Multi-Agent Path Planning with PiBT

Abstract:
Free‑space multi‑agent path planning remains challenging at large scales. Most existing methods either offer optimality guarantees but do not scale beyond a few dozen agents, or rely on grid‑world assumptions that do not generalize well to continuous space. In this work, we propose a hybrid, rule‑based planning framework that combines Priority Inheritance with Backtracking (PiBT) with a novel safety‑aware path smoothing method. Our approach extends PiBT to 8‑connected grids and selectively applies string‑pulling based smoothing while preserving collision safety through local interaction awareness and a fallback collision resolution step based on Safe Interval Path Planning (SIPP). This design allows us to reduce overall path lengths while maintaining real‑time performance. We demonstrate that our method can scale to over 500 agents in large free‑space environments, outperforming existing any‑angle and optimal methods in terms of runtime, while producing near‑optimal trajectories in sparse domains. Our results suggest this framework is a promising building block for scalable, real‑time multi‑agent navigation in robotics systems operating beyond grid constraints.

PaperID: 9941, https://arxiv.org/pdf/2506.16699.pdf

Authors: Lu Gao, Yongxin Liu, Hongyun Chen, Dahai Liu, Yunpeng Zhang, Jingran Sun

Title: Exploring Traffic Simulation and Cybersecurity Strategies Using Large Language Models

Abstract:
Intelligent Transportation Systems (ITS) are increasingly vulnerable to sophisticated cyberattacks due to their complex, interconnected nature. Ensuring the cybersecurity of these systems is paramount to maintaining road safety and minimizing traffic disruptions. This study presents a novel multi‑agent framework leveraging Large Language Models (LLMs) to enhance traffic simulation and cybersecurity testing. The framework automates the creation of traffic scenarios, the design of cyberattack strategies, and the development of defense mechanisms. A case study demonstrates the framework's ability to simulate a cyberattack targeting connected vehicle broadcasts, evaluate its impact, and implement a defense mechanism that significantly mitigates traffic delays. Results show a 10.2 percent increase in travel time during an attack, which is reduced by 3.3 percent with the defense strategy. This research highlights the potential of LLM‑driven multi‑agent systems in advancing transportation cybersecurity and offers a scalable approach for future research in traffic simulation and cyber defense.

PaperID: 9942, https://arxiv.org/pdf/2506.16609.pdf

Authors: Yingjie Shi, Yiru Gong, Yiqun Su, Suya Xiong, Jiale Han, Runtian Miao

Title: Aethorix v1.0: An Integrated Scientific AI Agent for Scalable Inorganic Materials Innovation and Industrial Implementation

Abstract:
Artificial Intelligence (AI) is redefining the frontiers of scientific domains, ranging from drug discovery to meteorological modeling, yet its integration within industrial manufacturing remains nascent and fraught with operational challenges. To bridge this gap, we introduce Aethorix v1.0, an AI agent framework designed to overcome key industrial bottlenecks, demonstrating state‑of‑the‑art performance in materials design innovation and process parameter optimization. Our tool is built upon three pillars: a scientific corpus reasoning engine that streamlines knowledge retrieval and validation, a diffusion‑based generative model for zero‑shot inverse design, and specialized interatomic potentials that enable faster screening with ab initio fidelity. We demonstrate Aethorix's utility through a real‑world cement production case study, confirming its capacity for integration into industrial workflows and its role in revolutionizing the design‑make‑test‑analyze loop while ensuring rigorous manufacturing standards are met.

PaperID: 9943, https://arxiv.org/pdf/2506.16493.pdf

Authors: Mehreen Naeem, Andrew Melnik, Michael Beetz

Title: Grounding Language Models with Semantic Digital Twins for Robotic Planning

Abstract:
We introduce a novel framework that integrates Semantic Digital Twins (SDTs) with Large Language Models (LLMs) to enable adaptive and goal‑driven robotic task execution in dynamic environments. The system decomposes natural language instructions into structured action triplets, which are grounded in contextual environmental data provided by the SDT. This semantic grounding allows the robot to interpret object affordances and interaction rules, enabling action planning and real‑time adaptability. In case of execution failures, the LLM utilizes error feedback and SDT insights to generate recovery strategies and iteratively revise the action plan. We evaluate our approach using tasks from the ALFRED benchmark, demonstrating robust performance across various household scenarios. The proposed framework effectively combines high‑level reasoning with semantic environment understanding, achieving reliable task completion in the face of uncertainty and failure.

PaperID: 9944, https://arxiv.org/pdf/2506.16445.pdf

Authors: Haotian Xia, Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li

Title: StoryWriter: A Multi-Agent Framework for Long Story Generation

Abstract:
Long story generation remains a challenge for existing large language models (LLMs), primarily due to two main factors: (1) discourse coherence, which requires plot consistency, logical coherence, and completeness in the long‑form generation, and (2) narrative complexity, which requires an interwoven and engaging narrative. To address these challenges, we propose StoryWriter, a multi‑agent story generation framework, which consists of three main modules: (1) outline agent, which generates event‑based outlines containing rich event plots, character, and event‑event relationships. (2) planning agent, which further details events and plans which events should be written in each chapter to maintain an interwoven and engaging story. (3) writing agent, which dynamically compresses the story history based on the current event to generate and reflect new plots, ensuring the coherence of the generated story. We conduct both human and automated evaluation, and StoryWriter significantly outperforms existing story generation baselines in both story quality and length. Furthermore, we use StoryWriter to generate a dataset, which contains about 6,000 high‑quality long stories, with an average length of 8,000 words. We train the model Llama3.1‑8B and GLM4‑9B using supervised fine‑tuning on LongStory and develop StoryWriter_GLM and StoryWriter_GLM, which demonstrates advanced performance in long story generation.

PaperID: 9945, https://arxiv.org/pdf/2506.16411.pdf

Authors: Zhen Xu, Shang Zhu, Jue Wang, Junlin Wang, Ben Athiwaratkun, Chi Wang, James Zou, Ce Zhang

Title: When Does Divide and Conquer Work for Long Context LLM? A Noise Decomposition Framework

Abstract:
We investigate the challenge of applying Large Language Models (LLMs) to long texts. We propose a theoretical framework that distinguishes the failure modes of long context tasks into three categories: cross‑chunk dependence (task noise), confusion that grows with context size (model noise), and the imperfect integration of partial results (aggregator noise). Under this view, we analyze when it is effective to use multi‑agent chunking, i.e., dividing a lengthy sequence into smaller chunks and aggregating the processed results of each chunk. Our experiments on tasks such as retrieval, question answering, and summarization confirm both the theoretical analysis and the conditions that favor multi‑agent chunking. By exploring the accelerated decay of model fidelity with input length, we also explain why, for large inputs, a weaker model configured with chunk‑based processing can surpass a more advanced model like GPT4o applied in a single shot. Overall, we present a principled understanding framework and our results highlight a direct pathway to handling long contexts in LLMs with carefully managed chunking and aggregator strategies.

PaperID: 9946, https://arxiv.org/pdf/2506.16301.pdf

Authors: Nadine Imholz, Maurice Brunner, Nicolas Baumann, Edoardo Ghignone, Michele Magno

Title: M-Predictive Spliner: Enabling Spatiotemporal Multi-Opponent Overtaking for Autonomous Racing

Abstract:
Unrestricted multi‑agent racing presents a significant research challenge, requiring decision‑making at the limits of a robot's operational capabilities. While previous approaches have either ignored spatiotemporal information in the decision‑making process or been restricted to single‑opponent scenarios, this work enables arbitrary multi‑opponent head‑to‑head racing while considering the opponents' future intent. The proposed method employs a KF‑based multi‑opponent tracker to effectively perform opponent ReID by associating them across observations. Simultaneously, spatial and velocity GPR is performed on all observed opponent trajectories, providing predictive information to compute the overtaking maneuvers. This approach has been experimentally validated on a physical 1:10 scale autonomous racing car, achieving an overtaking success rate of up to 91.65% and demonstrating an average 10.13%‑point improvement in safety at the same speed as the previous SotA. These results highlight its potential for high‑performance autonomous racing.

PaperID: 9947, https://arxiv.org/pdf/2506.16120.pdf

Authors: Fivos Kalogiannis, Emmanouil-Vasileios Vlatakis-Gkaragkounis, Ian Gemp, Georgios Piliouras

Title: Solving Zero-Sum Convex Markov Games

Abstract:
We contribute the first provable guarantees of global convergence to Nash equilibria (NE) in two‑player zero‑sum convex Markov games (cMGs) by using independent policy gradient methods. Convex Markov games, recently defined by Gemp et al. (2024), extend Markov decision processes to multi‑agent settings with preferences that are convex over occupancy measures, offering a broad framework for modeling generic strategic interactions. However, even the fundamental min‑max case of cMGs presents significant challenges, including inherent nonconvexity, the absence of Bellman consistency, and the complexity of the infinite horizon. We follow a two‑step approach. First, leveraging properties of hidden‑convex‑‑hidden‑concave functions, we show that a simple nonconvex regularization transforms the min‑max optimization problem into a nonconvex‑proximal Polyak‑Lojasiewicz (NC‑pPL) objective. Crucially, this regularization can stabilize the iterates of independent policy gradient methods and ultimately lead them to converge to equilibria. Second, building on this reduction, we address the general constrained min‑max problems under NC‑pPL and two‑sided pPL conditions, providing the first global convergence guarantees for stochastic nested and alternating gradient descent‑ascent methods, which we believe may be of independent interest.

PaperID: 9948, https://arxiv.org/pdf/2506.16010.pdf

Authors: Xiangyang He, Jiale Li, Jiahao Chen, Yang Yang, Mingming Fan

Title: SimuPanel: A Novel Immersive Multi-Agent System to Simulate Interactive Expert Panel Discussion

Abstract:
Panel discussion allows the audience to learn different perspectives through interactive discussions among experts moderated by a host and a Q&A session with the audience. Despite its benefits, panel discussion in the real world is inaccessible to many who do not have the privilege to participate due to geographical, financial, and time constraints. We present SimuPanel, which simulates panel discussions among academic experts through LLM‑based multi‑agent interaction. It enables users to define topics of interest for the panel, observe the expert discussion, engage in Q&A, and take notes. SimuPanel employs a host‑expert architecture where each panel member is simulated by an agent with specialized expertise, and the panel is visualized in an immersive 3D environment to enhance engagement. Traditional dialogue generation struggles to capture the depth and interactivity of real‑world panel discussions. To address this limitation, we propose a novel multi‑agent interaction framework that simulates authentic panel dynamics by modeling reasoning strategies and personas of experts grounded in multimedia sources. This framework enables agents to dynamically recall and contribute to the discussion based on past experiences from diverse perspectives. Our technical evaluation and the user study with university students show that SimuPanel was able to simulate more in‑depth discussions and engage participants to interact with and reflect on the discussions. As a first step in this direction, we offer design implications for future avenues to improve and harness the power of panel discussion for multimedia learning.

PaperID: 9949, https://arxiv.org/pdf/2506.15947.pdf

Authors: Jinbo Wen, Cheng Su, Jiawen Kang, Jiangtian Nie, Yang Zhang, Jianhang Tang, Dusit Niyato, Chau Yuen

Title: HybridRAG-based LLM Agents for Low-Carbon Optimization in Low-Altitude Economy Networks

Abstract:
Low‑Altitude Economy Networks (LAENets) are emerging as a promising paradigm to support various low‑altitude services through integrated air‑ground infrastructure. To satisfy low‑latency and high‑computation demands, the integration of Unmanned Aerial Vehicles (UAVs) with Mobile Edge Computing (MEC) systems plays a vital role, which offloads computing tasks from terminal devices to nearby UAVs, enabling flexible and resilient service provisions for ground users. To promote the development of LAENets, it is significant to achieve low‑carbon multi‑UAV‑assisted MEC networks. However, several challenges hinder this implementation, including the complexity of multi‑dimensional UAV modeling and the difficulty of multi‑objective coupled optimization. To this end, this paper proposes a novel Retrieval Augmented Generation (RAG)‑based Large Language Model (LLM) agent framework for model formulation. Specifically, we develop HybridRAG by combining KeywordRAG, VectorRAG, and GraphRAG, empowering LLM agents to efficiently retrieve structural information from expert databases and generate more accurate optimization problems compared with traditional RAG‑based LLM agents. After customizing carbon emission optimization problems for multi‑UAV‑assisted MEC networks, we propose a Double Regularization Diffusion‑enhanced Soft Actor‑Critic (R\textsuperscript2DSAC) algorithm to solve the formulated multi‑objective optimization problem. The R\textsuperscript2DSAC algorithm incorporates diffusion entropy regularization and action entropy regularization to improve the performance of the diffusion policy. Furthermore, we dynamically mask unimportant neurons in the actor network to reduce the carbon emissions associated with model training. Simulation results demonstrate the effectiveness and reliability of the proposed HybridRAG‑based LLM agent framework and the R\textsuperscript2DSAC algorithm.

PaperID: 9950, https://arxiv.org/pdf/2506.15911.pdf

Authors: Mohammad Amaan Sayeed, Mohammed Talha Alam, Raza Imam, Shahab Saquib Sohail, Amir Hussain

Title: From RAG to Agentic: Validating Islamic-Medicine Responses with LLM Agents

Abstract:
Centuries‑old Islamic medical texts like Avicenna's Canon of Medicine and the Prophetic Tibb‑e‑Nabawi encode a wealth of preventive care, nutrition, and holistic therapies, yet remain inaccessible to many and underutilized in modern AI systems. Existing language‑model benchmarks focus narrowly on factual recall or user preference, leaving a gap in validating culturally grounded medical guidance at scale. We propose a unified evaluation pipeline, Tibbe‑AG, that aligns 30 carefully curated Prophetic‑medicine questions with human‑verified remedies and compares three LLMs (LLaMA‑3, Mistral‑7B, Qwen2‑7B) under three configurations: direct generation, retrieval‑augmented generation, and a scientific self‑critique filter. Each answer is then assessed by a secondary LLM serving as an agentic judge, yielding a single 3C3H quality score. Retrieval improves factual accuracy by 13%, while the agentic prompt adds another 10% improvement through deeper mechanistic insight and safety considerations. Our results demonstrate that blending classical Islamic texts with retrieval and self‑evaluation enables reliable, culturally sensitive medical question‑answering.

PaperID: 9951, https://arxiv.org/pdf/2506.15887.pdf

Authors: Jakub Tłuczek, Victor Villin, Christos Dimitrakakis

Title: Fair Contracts in Principal-Agent Games with Heterogeneous Types

Abstract:
Fairness is desirable yet challenging to achieve within multi‑agent systems, especially when agents differ in latent traits that affect their abilities. This hidden heterogeneity often leads to unequal distributions of wealth, even when agents operate under the same rules. Motivated by real‑world examples, we propose a framework based on repeated principal‑agent games, where a principal, who also can be seen as a player of the game, learns to offer adaptive contracts to agents. By leveraging a simple yet powerful contract structure, we show that a fairness‑aware principal can learn homogeneous linear contracts that equalize outcomes across agents in a sequential social dilemma. Importantly, this fairness does not come at the cost of efficiency: our results demonstrate that it is possible to promote equity and stability in the system while preserving overall performance.

PaperID: 9952, https://arxiv.org/pdf/2506.15868.pdf

Authors: Mingyue Lei, Zewei Zhou, Hongchen Li, Jia Hu, Jiaqi Ma

Title: CooperRisk: A Driving Risk Quantification Pipeline with Multi-Agent Cooperative Perception and Prediction

Abstract:
Risk quantification is a critical component of safe autonomous driving, however, constrained by the limited perception range and occlusion of single‑vehicle systems in complex and dense scenarios. Vehicle‑to‑everything (V2X) paradigm has been a promising solution to sharing complementary perception information, nevertheless, how to ensure the risk interpretability while understanding multi‑agent interaction with V2X remains an open question. In this paper, we introduce the first V2X‑enabled risk quantification pipeline, CooperRisk, to fuse perception information from multiple agents and quantify the scenario driving risk in future multiple timestamps. The risk is represented as a scenario risk map to ensure interpretability based on risk severity and exposure, and the multi‑agent interaction is captured by the learning‑based cooperative prediction model. We carefully design a risk‑oriented transformer‑based prediction model with multi‑modality and multi‑agent considerations. It aims to ensure scene‑consistent future behaviors of multiple agents and avoid conflicting predictions that could lead to overly conservative risk quantification and cause the ego vehicle to become overly hesitant to drive. Then, the temporal risk maps could serve to guide a model predictive control planner. We evaluate the CooperRisk pipeline in a real‑world V2X dataset V2XPnP, and the experiments demonstrate its superior performance in risk quantification, showing a 44.35% decrease in conflict rate between the ego vehicle and background traffic participants.

PaperID: 9953, https://arxiv.org/pdf/2506.15856.pdf

Authors: Michael Ledford, William Regli

Title: Learning to Coordinate Under Threshold Rewards: A Cooperative Multi-Agent Bandit Framework

Abstract:
Cooperative multi‑agent systems often face tasks that require coordinated actions under uncertainty. While multi‑armed bandit (MAB) problems provide a powerful framework for decentralized learning, most prior work assumes individually attainable rewards. We address the challenging setting where rewards are threshold‑activated: an arm yields a payoff only when a minimum number of agents pull it simultaneously, with this threshold unknown in advance. Complicating matters further, some arms are decoys ‑ requiring coordination to activate but yielding no reward ‑ introducing a new challenge of wasted joint exploration. We introduce Threshold‑Coop‑UCB (T‑Coop‑UCB), a decentralized algorithm that enables agents to jointly learn activation thresholds and reward distributions, forming effective coalitions without centralized control. Empirical results show that T‑Coop‑UCB consistently outperforms baseline methods in cumulative reward, regret, and coordination metrics, achieving near‑Oracle performance. Our findings underscore the importance of joint threshold learning and decoy avoidance for scalable, decentralized cooperation in complex multi‑agent

PaperID: 9954, https://arxiv.org/pdf/2506.15841.pdf

Authors: Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, Paul Pu Liang

Title: MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

Abstract:
Modern language agents must operate over long‑horizon, multi‑turn interactions, where they retrieve external information, adapt to observations, and answer interdependent queries. Yet, most LLM systems rely on full‑context prompting, appending all past turns regardless of their relevance. This leads to unbounded memory growth, increased computational costs, and degraded reasoning performance on out‑of‑distribution input lengths. We introduce MEM1, an end‑to‑end reinforcement learning framework that enables agents to operate with constant memory across long multi‑turn tasks. At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning. This state integrates prior memory with new observations from the environment while strategically discarding irrelevant or redundant information. To support training in more realistic and compositional settings, we propose a simple yet effective and scalable approach to constructing multi‑turn environments by composing existing datasets into arbitrarily complex task sequences. Experiments across three domains, including internal retrieval QA, open‑domain web QA, and multi‑turn web shopping, show that MEM1‑7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5‑14B‑Instruct on a 16‑objective multi‑hop QA task, and generalizes beyond the training horizon. Our results demonstrate the promise of reasoning‑driven memory consolidation as a scalable alternative to existing solutions for training long‑horizon interactive agents, where both efficiency and performance are optimized.

PaperID: 9955, https://arxiv.org/pdf/2506.15756.pdf

Authors: João G. Ribeiro, Yaniv Oren, Alberto Sardinha, Matthijs Spaan, Francisco S. Melo

Title: RecBayes: Recurrent Bayesian Ad Hoc Teamwork in Large Partially Observable Domains

Abstract:
This paper proposes RecBayes, a novel approach for ad hoc teamwork under partial observability, a setting where agents are deployed on‑the‑fly to environments where pre‑existing teams operate, that never requires, at any stage, access to the states of the environment or the actions of its teammates. We show that by relying on a recurrent Bayesian classifier trained using past experiences, an ad hoc agent is effectively able to identify known teams and tasks being performed from observations alone. Unlike recent approaches such as PO‑GPL (Gu et al., 2021) and FEAT (Rahman et al., 2023), that require at some stage fully observable states of the environment, actions of teammates, or both, or approaches such as ATPO (Ribeiro et al., 2023) that require the environments to be small enough to be tabularly modelled (Ribeiro et al., 2023), in their work up to 4.8K states and 1.7K observations, we show RecBayes is both able to handle arbitrarily large spaces while never relying on either states and teammates' actions. Our results in benchmark domains from the multi‑agent systems literature, adapted for partial observability and scaled up to 1M states and 2^125 observations, show that RecBayes is effective at identifying known teams and tasks being performed from partial observations alone, and as a result, is able to assist the teams in solving the tasks effectively.

PaperID: 9956, https://arxiv.org/pdf/2506.15740.pdf

Authors: Jonathan Kutasov, Yuqi Sun, Paul Colognese, Teun van der Weij, Linda Petrini, Chen Bo Calvin Zhang, John Hughes, Xiang Deng, Henry Sleight, Tyler Tracy, Buck Shlegeris, Joe Benton

Title: SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents

Abstract:
As Large Language Models (LLMs) are increasingly deployed as autonomous agents in complex and long horizon settings, it is critical to evaluate their ability to sabotage users by pursuing hidden objectives. We study the ability of frontier LLMs to evade monitoring and achieve harmful hidden goals while completing a wide array of realistic tasks. We evaluate a broad range of frontier LLMs using SHADE (Subtle Harmful Agent Detection & Evaluation)‑Arena, the first highly diverse agent evaluation dataset for sabotage and monitoring capabilities of LLM agents. SHADE‑Arena consists of complex pairs of benign main tasks and harmful side objectives in complicated environments. Agents are evaluated on their ability to complete the side task without appearing suspicious to an LLM monitor. When measuring agent ability to (a) complete the main task, (b) complete the side task, and (c) avoid detection, we find that the best performing frontier models score 27% (Claude 3.7 Sonnet) and 15% (Gemini 2.5 Pro) as sabotage agents when overseen by Claude 3.6 Sonnet. For current frontier models, success on the side task relies heavily on having access to a hidden scratchpad that is not visible to the monitor. We also use SHADE‑Arena to measure models' monitoring abilities, with the top monitor (Gemini 2.5 Pro) achieving an AUC of 0.87 at distinguishing benign and malign transcripts. We find that for now, models still struggle at sabotage due to failures in long‑context main task execution. However, our measurements already demonstrate the difficulty of monitoring for subtle sabotage attempts, which we expect to only increase in the face of more complex and longer‑horizon tasks.

PaperID: 9957, https://arxiv.org/pdf/2506.15656.pdf

Authors: Wenhao Li, Selvakumar Manickam, Yung-wey Chong, Shankar Karuppayah

Title: PhishDebate: An LLM-Based Multi-Agent Framework for Phishing Website Detection

Abstract:
Phishing websites remain a major cybersecurity threat, exploiting deceptive structures, brand impersonation, and social engineering to evade detection. Recent advances in large language models (LLMs) have improved phishing detection through contextual understanding, yet most existing approaches rely on single‑agent classification, which is prone to hallucination and often lacks interpretability and robustness. To address these limitations, we propose PhishDebate, a modular multi‑agent LLM‑based debate framework for phishing website detection. Four specialized agents independently analyze webpage aspects, including URL structure, HTML composition, semantic content, and brand impersonation, under the coordination of a Moderator and final Judge. Through structured debate and divergent reasoning, the framework achieves more accurate and interpretable decisions. By reducing uncertain predictions and providing transparent reasoning, PhishDebate functions as an analyst‑augmentation system that lowers cognitive load and supports early, left‑of‑exploit detection of phishing threats. Evaluations on commercial LLMs show that PhishDebate achieves 98.2 % recall on a real‑world phishing dataset and outperforms single‑agent and Chain‑of‑Thought (CoT) baselines. Its modular design enables agent‑level configurability, allowing adaptation to varying resource and application requirements, and offers scalability to high‑velocity, large‑scale security data environments.

PaperID: 9958, https://arxiv.org/pdf/2506.15624.pdf

Authors: Lyle Goodyear, Rachel Guo, Ramesh Johari

Title: The Effect of State Representation on LLM Agent Behavior in Dynamic Routing Games

Abstract:
Large Language Models (LLMs) have shown promise as decision‑makers in dynamic settings, but their stateless nature necessitates creating a natural language representation of history. We present a unifying framework for systematically constructing natural language "state" representations for prompting LLM agents in repeated multi‑agent games. Previous work on games with LLM agents has taken an ad hoc approach to encoding game history, which not only obscures the impact of state representation on agents' behavior, but also limits comparability between studies. Our framework addresses these gaps by characterizing methods of state representation along three axes: action informativeness (i.e., the extent to which the state representation captures actions played); reward informativeness (i.e., the extent to which the state representation describes rewards obtained); and prompting style (or natural language compression, i.e., the extent to which the full text history is summarized). We apply this framework to a dynamic selfish routing game, chosen because it admits a simple equilibrium both in theory and in human subject experiments \citerapoport_choice_2009. Despite the game's relative simplicity, we find that there are key dependencies of LLM agent behavior on the natural language state representation. In particular, we observe that representations which provide agents with (1) summarized, rather than complete, natural language representations of past history; (2) information about regrets, rather than raw payoffs; and (3) limited information about others' actions lead to behavior that more closely matches game theoretic equilibrium predictions, and with more stable game play by the agents. By contrast, other representations can exhibit either large deviations from equilibrium, higher variation in dynamic game play over time, or both.

PaperID: 9959, https://arxiv.org/pdf/2506.15587.pdf

Authors: Martino Brambati, Antonio Celani, Marco Gherardi, Francesco Ginelli

Title: Learning to flock in open space by avoiding collisions and staying together

Abstract:
We investigate the emergence of cohesive flocking in open, boundless space using a multi‑agent reinforcement learning framework. Agents integrate positional and orientational information from their closest topological neighbours and learn to balance alignment and attractive interactions by optimizing a local cost function that penalizes both excessive separation and close‑range crowding. The resulting Vicsek‑like dynamics is robust to algorithmic implementation details and yields cohesive collective motion with high polar order. The optimal policy is dominated by strong aligning interactions when agents are sufficiently close to their neighbours, and a flexible combination of alignment and attraction at larger separations. We further characterize the internal structure and dynamics of the resulting groups using liquid‑state metrics and neighbour exchange rates, finding qualitative agreement with empirical observations in starling flocks. These results suggest that flocking may emerge in groups of moving agents as an adaptive response to the biological imperatives of staying together while avoiding collisions.

PaperID: 9960, https://arxiv.org/pdf/2506.15498.pdf

Authors: Md Imbesat Hassan Rizvi, Xiaodan Zhu, Iryna Gurevych

Title: SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling

Abstract:
Process or step‑wise supervision has played a crucial role in advancing complex multi‑step reasoning capabilities of Large Language Models (LLMs). However, efficient, high‑quality automated process annotation remains a significant challenge. To address this, we introduce Single‑Pass Annotation with Reference‑Guided Evaluation (SPARE), a novel structured framework that enables efficient per‑step annotation by jointly aligning solution steps to reference solutions and determine its accuracy with explicit reasoning in single generation. We demonstrate SPARE's effectiveness across four diverse datasets spanning mathematical reasoning (GSM8K, MATH), multi‑hop question answering (MuSiQue‑Ans), and spatial reasoning (SpaRP), showing consistent improvements in two applications: (1) training Process Reward Models (PRMs) for ranking and aggregating multiple generations, and (2) fine‑tuning models via offline reinforcement learning for greedy decoding. On ProcessBench, SPARE demonstrates data‑efficient out‑of‑distribution generalization, using only ～16% of training samples compared to human‑labeled and other synthetically trained baselines. Additionally, it achieves competitive performance with MCTS‑based methods while offering 2.3× speedup in terms of total token count. Manual analysis reveals complementary precision‑recall characteristics with MCTS approaches, suggesting potential for ensemble methods. These results establish SPARE as a practical and scalable solution for automatic process supervision in LLM reasoning.

PaperID: 9961, https://arxiv.org/pdf/2506.15207.pdf

Authors: Mohamad A. Hady, Siyi Hu, Mahardhika Pratama, Jimmy Cao, Ryszard Kowalczyk

Title: Multi-Agent Reinforcement Learning for Autonomous Multi-Satellite Earth Observation: A Realistic Case Study

Abstract:
The exponential growth of Low Earth Orbit (LEO) satellites has revolutionised Earth Observation (EO) missions, addressing challenges in climate monitoring, disaster management, and more. However, autonomous coordination in multi‑satellite systems remains a fundamental challenge. Traditional optimisation approaches struggle to handle the real‑time decision‑making demands of dynamic EO missions, necessitating the use of Reinforcement Learning (RL) and Multi‑Agent Reinforcement Learning (MARL). In this paper, we investigate RL‑based autonomous EO mission planning by modelling single‑satellite operations and extending to multi‑satellite constellations using MARL frameworks. We address key challenges, including energy and data storage limitations, uncertainties in satellite observations, and the complexities of decentralised coordination under partial observability. By leveraging a near‑realistic satellite simulation environment, we evaluate the training stability and performance of state‑of‑the‑art MARL algorithms, including PPO, IPPO, MAPPO, and HAPPO. Our results demonstrate that MARL can effectively balance imaging and resource management while addressing non‑stationarity and reward interdependency in multi‑satellite coordination. The insights gained from this study provide a foundation for autonomous satellite operations, offering practical guidelines for improving policy learning in decentralised EO missions.

PaperID: 9962, https://arxiv.org/pdf/2506.15167.pdf

Authors: Wanzhe Wang, Jianqiu Peng, Menghao Hu, Weihuang Zhong, Tong Zhang, Shuai Wang, Yixin Zhang, Mingjie Shao, Wanli Ni

Title: LLM Agent for Hyper-Parameter Optimization

Abstract:
Hyper‑parameters are essential and critical for the performance of communication algorithms. However, current hyper‑parameters optimization approaches for Warm‑Start Particles Swarm Optimization with Crossover and Mutation (WS‑PSO‑CM) algorithm, designed for radio map‑enabled unmanned aerial vehicle (UAV) trajectory and communication, are primarily heuristic‑based, exhibiting low levels of automation and improvable performance. In this paper, we design an Large Language Model (LLM) agent for automatic hyper‑parameters‑tuning, where an iterative framework and Model Context Protocol (MCP) are applied. In particular, the LLM agent is first set up via a profile, which specifies the boundary of hyper‑parameters, task objective, terminal condition, conservative or aggressive strategy of optimizing hyper‑parameters, and LLM configurations. Then, the LLM agent iteratively invokes WS‑PSO‑CM algorithm for exploration. Finally, the LLM agent exits the loop based on the terminal condition and returns an optimized set of hyperparameters. Our experiment results show that the minimal sum‑rate achieved by hyper‑parameters generated via our LLM agent is significantly higher than those by both human heuristics and random generation methods. This indicates that an LLM agent with PSO and WS‑PSO‑CM algorithm knowledge is useful in seeking high‑performance hyper‑parameters.

PaperID: 9963, https://arxiv.org/pdf/2506.15106.pdf

Authors: Ziqin Chen, Yongqiang Wang

Title: Local Differential Privacy for Distributed Stochastic Aggregative Optimization with Guaranteed Optimality

Abstract:
Distributed aggregative optimization underpins many cooperative optimization and multi‑agent control systems, where each agent's objective function depends both on its local optimization variable and an aggregate of all agents' optimization variables. Existing distributed aggregative optimization approaches typically require access to accurate gradients of the objective functions, which, however, are often hard to obtain in real‑world applications. For example, in machine learning, gradients are commonly contaminated by two main sources of noise: the randomness inherent in sampled data, and the additional variability introduced by mini‑batch computations. In addition to the issue of relying on accurate gradients, existing distributed aggregative optimization approaches require agents to share explicit information, which could breach the privacy of participating agents. We propose an algorithm that can solve both problems with existing distributed aggregative optimization approaches: not only can the proposed algorithm guarantee mean‑square convergence to an exact optimal solution when the gradients are subject to noise, it also simultaneously ensures rigorous differential privacy, with the cumulative privacy budget guaranteed to be finite even when the number of iterations tends to infinity. To the best of our knowledge, this is the first algorithm able to guarantee both accurate convergence and rigorous differential privacy in distributed aggregative optimization. Besides characterizing the convergence rates under nonconvex/convex/strongly convex conditions, we also rigorously quantify the cost of differential privacy in terms of convergence rates. Experimental results on personalized machine learning using benchmark datasets confirm the efficacy of the proposed algorithm.

PaperID: 9964, https://arxiv.org/pdf/2506.14990.pdf

Authors: Tristan Tomilin, Luka van den Boogaard, Samuel Garcin, Bram Grooten, Meng Fang, Yali Du, Mykola Pechenizkiy

Title: MEAL: A Benchmark for Continual Multi-Agent Reinforcement Learning

Abstract:
Benchmarks play a crucial role in the development and analysis of reinforcement learning (RL) algorithms, with environment availability strongly impacting research. One particularly underexplored intersection is continual learning (CL) in cooperative multi‑agent settings. To remedy this, we introduce MEAL (Multi‑agent Environments for Adaptive Learning), the first benchmark tailored for continual multi‑agent reinforcement learning (CMARL). Existing CL benchmarks run environments on the CPU, leading to computational bottlenecks and limiting the length of task sequences. MEAL leverages JAX for GPU acceleration, enabling continual learning across sequences of 100 tasks on a standard desktop PC in a few hours. We show that naively combining popular CL and MARL methods yields strong performance on simple environments, but fails to scale to more complex settings requiring sustained coordination and adaptation. Our ablation study identifies architectural and algorithmic features critical for CMARL on MEAL.

PaperID: 9965, https://arxiv.org/pdf/2506.14988.pdf

Authors: Tianyi Xu, Jiaxin Liu, Nicholas Mattei, Zizhan Zheng

Title: Fair Algorithms with Probing for Multi-Agent Multi-Armed Bandits

Abstract:
We propose a multi‑agent multi‑armed bandit (MA‑MAB) framework aimed at ensuring fair outcomes across agents while maximizing overall system performance. A key challenge in this setting is decision‑making under limited information about arm rewards. To address this, we introduce a novel probing framework that strategically gathers information about selected arms before allocation. In the offline setting, where reward distributions are known, we leverage submodular properties to design a greedy probing algorithm with a provable performance bound. For the more complex online setting, we develop an algorithm that achieves sublinear regret while maintaining fairness. Extensive experiments on synthetic and real‑world datasets show that our approach outperforms baseline methods, achieving better fairness and efficiency.

PaperID: 9966, https://arxiv.org/pdf/2506.14852.pdf

Authors: Qizheng Zhang, Michael Wornow, Gerry Wan, Kunle Olukotun

Title: Agentic Plan Caching: Test-Time Memory for Fast and Cost-Efficient LLM Agents

Abstract:
LLM‑based agent applications have shown increasingly remarkable capabilities in complex workflows but incur substantial costs and latency due to extensive planning and reasoning requirements. Existing LLM caching techniques (like context caching and semantic caching), primarily designed for serving chatbots, are insufficient for agent applications where outputs depend on external data and environmental contexts. We propose Agentic Plan Caching (APC), a novel test‑time memory that extracts, stores, adapts, and reuses structured plan templates from planning stages of agent applications across semantically similar tasks to reduce the cost and latency of serving. Unlike traditional semantic caching, our system extracts plan templates from completed agent executions at test‑time, employs keyword extraction to match new requests against cached plans, and utilizes lightweight models to adapt these templates to task‑specific plans with contexts. Evaluation across multiple real‑world agent applications shows that our system can reduce costs by 50.31% and latency by 27.28% on average while maintaining performance, offering a more efficient solution for serving LLM‑based agents that complements existing LLM serving infrastructures.

PaperID: 9967, https://arxiv.org/pdf/2506.14831.pdf

Authors: Céline Finet, Stephane Da Silva Martins, Jean-Bernard Hayet, Ioannis Karamouzas, Javad Amirian, Sylvie Le Hégarat-Mascle, Julien Pettré, Emanuel Aldea

Title: Recent Advances in Multi-Agent Human Trajectory Prediction: A Comprehensive Review

Abstract:
With the emergence of powerful data‑driven methods in human trajectory prediction (HTP), gaining a finer understanding of multi‑agent interactions lies within hand's reach, with important implications in areas such as social robot navigation, autonomous driving, and crowd modeling. This survey reviews some of the most recent advancements in deep learning‑based multi‑agent trajectory prediction, focusing on studies published between 2020 and 2025. We categorize the existing methods based on their architectural design, their input representations, and their overall prediction strategies, placing a particular emphasis on models evaluated using the ETH/UCY benchmark. Furthermore, we highlight key challenges and future research directions in the field of multi‑agent HTP.

PaperID: 9968, https://arxiv.org/pdf/2506.14769.pdf

Authors: Jiahua Ma, Yiran Qin, Yixiong Li, Xuanqi Liao, Yulan Guo, Ruimao Zhang

Title: CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion

Abstract:
Diffusion Policy (DP) enables robots to learn complex behaviors by imitating expert demonstrations through action diffusion. However, in practical applications, hardware limitations often degrade data quality, while real‑time constraints restrict model inference to instantaneous state and scene observations. These limitations seriously reduce the efficacy of learning from expert demonstrations, resulting in failures in object localization, grasp planning, and long‑horizon task execution. To address these challenges, we propose Causal Diffusion Policy (CDP), a novel transformer‑based diffusion model that enhances action prediction by conditioning on historical action sequences, thereby enabling more coherent and context‑aware visuomotor policy learning. To further mitigate the computational cost associated with autoregressive inference, a caching mechanism is also introduced to store attention key‑value pairs from previous timesteps, substantially reducing redundant computations during execution. Extensive experiments in both simulated and real‑world environments, spanning diverse 2D and 3D manipulation tasks, demonstrate that CDP uniquely leverages historical action sequences to achieve significantly higher accuracy than existing methods. Moreover, even when faced with degraded input observation quality, CDP maintains remarkable precision by reasoning through temporal continuity, which highlights its practical robustness for robotic control under realistic, imperfect conditions.

PaperID: 9969, https://arxiv.org/pdf/2506.14749.pdf

Authors: Shiyu Cheng, Luyao Niu, Bhaskar Ramasubramanian, Andrew Clark, Radha Poovendran

Title: Swarm-STL: A Framework for Motion Planning in Large-Scale, Multi-Swarm Systems

Abstract:
In multi‑agent systems, signal temporal logic (STL) is widely used for path planning to accomplish complex objectives with formal safety guarantees. However, as the number of agents increases, existing approaches encounter significant computational challenges. Recognizing that many complex tasks require cooperation among multiple agents, we propose swarm STL specifications to describe the collective tasks that need to be achieved by a team of agents. Next, we address the motion planning problem for all the agents in two stages. First, we abstract a group of cooperating agents as a swarm and construct a reduced‑dimension state space whose dimension does not increase with the number of agents. The path planning is performed at the swarm level, ensuring the safety and swarm STL specifications are satisfied. Then, we design low‑level control strategies for agents within each swarm based on the path synthesized in the first step. The trajectories of agents generated by the two‑step policy ensure satisfaction of the STL specifications. We evaluate our two‑stage approach in both single‑swarm and multi‑swarm scenarios. The results demonstrate that all tasks are completed with safety guarantees. Compared to the baseline multi‑agent planning approach, our method maintains computational efficiency as the number of agents increases, since the computational time scales with the number of swarms rather than the number of agents.

PaperID: 9970, https://arxiv.org/pdf/2506.14713.pdf

Authors: Victorien Desbois, Ocan Sankur, François Schwarzentruber

Title: Linear Planar 3-SAT and Its Applications in Planning

Abstract:
Several fragments of the satisfiability problem have been studied in the literature. Among these, Linear 3‑SAT is a satisfaction problem in which each clause (viewed as a set of literals) intersects with at most one other clause; moreover, any pair of clauses have at most one literal in common. Planar 3‑SAT is a fragment which requires that the so‑called variable‑clause graph is planar. Both fragments are NP‑complete and have applications in encoding NP‑hard planning problems. In this paper, we investigate the complexity and applications of the fragment obtained combining both features. We define Linear Planar 3‑SAT and prove its NP‑completeness. We also study the reconfiguration problem of Linear Planar 3‑SAT and show that it is PSPACE‑complete. As an application, we use these new results to prove the NP‑completeness of Bounded Connected Multi‑Agent Pathfinding and the PSPACE‑completeness of Connected Multi‑Agent Pathfinding in two‑dimensional grids.

PaperID: 9971, https://arxiv.org/pdf/2506.14697.pdf

Authors: Zonghao Ying, Le Wang, Yisong Xiao, Jiakai Wang, Yuqing Ma, Jinyang Guo, Zhenfei Yin, Mingchuan Zhang, Aishan Liu, Xianglong Liu

Title: AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions

Abstract:
The integration of vision‑language models (VLMs) is driving a new generation of embodied agents capable of operating in human‑centered environments. However, as deployment expands, these systems face growing safety risks, particularly when executing hazardous instructions. Current safety evaluation benchmarks remain limited: they cover only narrow scopes of hazards and focus primarily on final outcomes, neglecting the agent's full perception‑planning‑execution process and thereby obscuring critical failure modes. Therefore, we present SAFE, a benchmark for systematically assessing the safety of embodied VLM agents on hazardous instructions. SAFE comprises three components: SAFE‑THOR, an extensible adversarial simulation sandbox with a universal adapter that maps high‑level VLM outputs to low‑level embodied controls, supporting diverse agent workflow integration; SAFE‑VERSE, a risk‑aware task suite inspired by Asimov's Three Laws of Robotics, comprising 45 adversarial scenarios, 1,350 hazardous tasks, and 9,900 instructions that span risks to humans, environments, and agents; and SAFE‑DIAGNOSE, a multi‑level and fine‑grained evaluation protocol measuring agent performance across perception, planning, and execution. Applying SAFE to nine state‑of‑the‑art VLMs and two embodied agent workflows, we uncover systematic failures in translating hazard recognition into safe planning and execution. Our findings reveal fundamental limitations in current safety alignment and demonstrate the necessity of a comprehensive, multi‑stage evaluation for developing safer embodied intelligence.

PaperID: 9972, https://arxiv.org/pdf/2506.14690.pdf

Authors: Kalliyan Velasco, Timothy W. McLain, Joshua G. Mangelson

Title: Factor-Graph-Based Passive Acoustic Navigation for Decentralized Cooperative Localization Using Bearing Elevation Depth Difference

Abstract:
Accurate and scalable underwater multi‑agent localization remains a critical challenge due to the constraints of underwater communication. In this work, we propose a multi‑agent localization framework using a factor‑graph representation that incorporates bearing, elevation, and depth difference (BEDD). Our method leverages inverted ultra‑short baseline (inverted‑USBL) derived azimuth and elevation measurements from incoming acoustic signals and relative depth measurements to enable cooperative localization for a multi‑robot team of autonomous underwater vehicles (AUVs). We validate our approach in the HoloOcean underwater simulator with a fleet of AUVs, demonstrating improved localization accuracy compared to dead reckoning. Additionally, we investigate the impact of azimuth and elevation measurement outliers, highlighting the need for robust outlier rejection techniques for acoustic signals.

PaperID: 9973, https://arxiv.org/pdf/2506.14683.pdf

Authors: Leonhard Applis, Yuntong Zhang, Shanchao Liang, Nan Jiang, Lin Tan, Abhik Roychoudhury

Title: Unified Software Engineering Agent as AI Software Engineer

Abstract:
The growth of Large Language Model (LLM) technology has raised expectations for automated coding. However, software engineering is more than coding and is concerned with activities including maintenance and evolution of a project. In this context, the concept of LLM agents has gained traction, which utilize LLMs as reasoning engines to invoke external tools autonomously. But is an LLM agent the same as an AI software engineer? In this paper, we seek to understand this question by developing a Unified Software Engineering agent or USEagent. Unlike existing work which builds specialized agents for specific software tasks such as testing, debugging, and repair, our goal is to build a unified agent which can orchestrate and handle multiple capabilities. This gives the agent the promise of handling complex scenarios in software development such as fixing an incomplete patch, adding new features, or taking over code written by others. We envision USEagent as the first draft of a future AI Software Engineer which can be a team member in future software development teams involving both AI and humans. To evaluate the efficacy of USEagent, we build a Unified Software Engineering bench (USEbench) comprising of myriad tasks such as coding, testing, and patching. USEbench is a judicious mixture of tasks from existing benchmarks such as SWE‑bench, SWT‑bench, and REPOCOD. In an evaluation on USEbench consisting of 1,271 repository‑level software engineering tasks, USEagent shows improved efficacy compared to existing general agents such as OpenHands CodeActAgent. There exist gaps in the capabilities of USEagent for certain coding tasks, which provides hints on further developing the AI Software Engineer of the future.

PaperID: 9974, https://arxiv.org/pdf/2506.14539.pdf

Authors: Daewon Kang, YeongHwan Shin, Doyeon Kim, Kyu-Hwan Jung, Meong Hi Son

Title: Doppelganger Method: Breaking Role Consistency in LLM Agent via Prompt-based Transferable Adversarial Attack

Abstract:
Since the advent of large language models, prompt engineering now enables the rapid, low‑effort creation of diverse autonomous agents that are already in widespread use. Yet this convenience raises urgent concerns about the safety, robustness, and behavioral consistency of the underlying prompts, along with the pressing challenge of preventing those prompts from being exposed to user's attempts. In this paper, we propose the ''Doppelganger method'' to demonstrate the risk of an agent being hijacked, thereby exposing system instructions and internal information. Next, we define the ''Prompt Alignment Collapse under Adversarial Transfer (PACAT)'' level to evaluate the vulnerability to this adversarial transfer attack. We also propose a ''Caution for Adversarial Transfer (CAT)'' prompt to counter the Doppelganger method. The experimental results demonstrate that the Doppelganger method can compromise the agent's consistency and expose its internal information. In contrast, CAT prompts enable effective defense against this adversarial attack.

PaperID: 9975, https://arxiv.org/pdf/2506.14512.pdf

Authors: Zijian Song, Xiaoxin Lin, Qiuming Huang, Sihan Qin, Guangrun Wang, Liang Lin

Title: SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks

Abstract:
Large Language Models (LLMs) have undergone rapid progress, largely attributed to reinforcement learning on complex reasoning tasks. In contrast, while spatial intelligence is fundamental for Vision‑Language Models (VLMs) in real‑world interaction, the systematic study of their complex spatial reasoning remains underexplored. To bridge this gap, we introduce SIRI‑Bench, a benchmark designed to evaluate VLMs' structural spatial intelligence through spatial‑grounded reasoning tasks. SIRI‑Bench comprises 9,000 video‑question‑answer triplets, where each problem is embedded in a realistic 3D scene. The benchmark is carefully designed so that solving each problem requires both spatial comprehension and structural reasoning. To facilitate large‑scale data synthesis, we develop an Automatic Scene Creation Engine that employs collaborative LLM agents to translate abstract mathematical problems into faithful 3D scenes. Experimental results reveal that state‑of‑the‑art VLMs struggle significantly on SIRI‑Bench, underscoring the challenge of structural spatial reasoning. We hope that our study will bring researchers' attention to spatially grounded reasoning and advance VLMs in visual problem‑solving.

PaperID: 9976, https://arxiv.org/pdf/2506.14391.pdf

Authors: Yaqiao Zhu, Hongkai Wen, Geyong Min, Man Luo

Title: HALO: Hierarchical Reinforcement Learning for Large-Scale Adaptive Traffic Signal Control

Abstract:
Adaptive traffic signal control (ATSC) is essential for mitigating urban congestion in modern smart cities, where traffic infrastructure is evolving into interconnected Web‑of‑Things (WoT) environments with thousands of sensing‑and‑control nodes. However, existing methods face a critical scalability‑coordination tradeoff: centralized approaches optimize global objectives but become computationally intractable at city scale, while decentralized multi‑agent methods scale efficiently yet lack network‑level coherence, resulting in suboptimal performance. In this paper, we present HALO, a hierarchical reinforcement learning framework that addresses this tradeoff for large‑scale ATSC. HALO decouples decision‑making into two levels: a high‑level global guidance policy employs Transformer‑LSTM encoders to model spatio‑temporal dependencies across the entire network and broadcast compact guidance signals, while low‑level local intersection policies execute decentralized control conditioned on both local observations and global context. To ensure better alignment of global‑local objectives, we introduce an adversarial goal‑setting mechanism where the global policy proposes challenging‑yet‑feasible network‑level targets that local policies are trained to surpass, fostering robust coordination. We evaluate HALO extensively on multiple standard benchmarks, and a newly constructed large‑scale Manhattan‑like network with 2,668 intersections under real‑world traffic patterns, including peak transitions, adverse weather and holiday surges. Results demonstrate HALO shows competitive performance and becomes increasingly dominant as network complexity grows across small‑scale benchmarks, while delivering the strongest performance in all large‑scale regimes, offering up to 6.8% lower average travel time and 5.0% lower average delay than the best state‑of‑the‑art.

PaperID: 9977, https://arxiv.org/pdf/2506.14285.pdf

Authors: Seongbo Jang, Minjin Jeon, Jaehoon Lee, Seonghyeon Lee, Dongha Lee, Hwanjo Yu

Title: From What to Respond to When to Respond: Timely Response Generation for Open-domain Dialogue Agents

Abstract:
While research on dialogue response generation has primarily focused on generating coherent responses conditioning on textual context, the critical question of when to respond grounded on the temporal context remains underexplored. To bridge this gap, we propose a novel task called timely dialogue response generation and introduce the TimelyChat benchmark, which evaluates the capabilities of language models to predict appropriate time intervals and generate time‑conditioned responses. Additionally, we construct a large‑scale training dataset by leveraging unlabeled event knowledge from a temporal commonsense knowledge graph and employing a large language model (LLM) to synthesize 55K event‑driven dialogues. We then train Timer, a dialogue agent designed to proactively predict time intervals and generate timely responses that align with those intervals. Experimental results show that Timer outperforms prompting‑based LLMs and other fine‑tuned baselines in both turn‑level and dialogue‑level evaluations. We publicly release our data, model, and code.

PaperID: 9978, https://arxiv.org/pdf/2506.14237.pdf

Authors: Xiyu Zhao, Qimei Cui, Wei Ni, Quan Z. Sheng, Abbas Jamalipour, Guoshun Nan, Xiaofeng Tao, Ping Zhang

Title: A Novel Indicator for Quantifying and Minimizing Information Utility Loss of Robot Teams

Abstract:
The timely exchange of information among robots within a team is vital, but it can be constrained by limited wireless capacity. The inability to deliver information promptly can result in estimation errors that impact collaborative efforts among robots. In this paper, we propose a new metric termed Loss of Information Utility (LoIU) to quantify the freshness and utility of information critical for cooperation. The metric enables robots to prioritize information transmissions within bandwidth constraints. We also propose the estimation of LoIU using belief distributions and accordingly optimize both transmission schedule and resource allocation strategy for device‑to‑device transmissions to minimize the time‑average LoIU within a robot team. A semi‑decentralized Multi‑Agent Deep Deterministic Policy Gradient framework is developed, where each robot functions as an actor responsible for scheduling transmissions among its collaborators while a central critic periodically evaluates and refines the actors in response to mobility and interference. Simulations validate the effectiveness of our approach, demonstrating an enhancement of information freshness and utility by 98%, compared to alternative methods.

PaperID: 9979, https://arxiv.org/pdf/2506.14199.pdf

Authors: Junghwan Kim, Kieun Park, Sohee Park, Hyunggug Kim, Bongwon Suh

Title: MAS-LitEval : Multi-Agent System for Literary Translation Quality Assessment

Abstract:
Literary translation requires preserving cultural nuances and stylistic elements, which traditional metrics like BLEU and METEOR fail to assess due to their focus on lexical overlap. This oversight neglects the narrative consistency and stylistic fidelity that are crucial for literary works. To address this, we propose MAS‑LitEval, a multi‑agent system using Large Language Models (LLMs) to evaluate translations based on terminology, narrative, and style. We tested MAS‑LitEval on translations of The Little Prince and A Connecticut Yankee in King Arthur's Court, generated by various LLMs, and compared it to traditional metrics. MAS‑LitEval outperformed these metrics, with top models scoring up to 0.890 in capturing literary nuances. This work introduces a scalable, nuanced framework for Translation Quality Assessment (TQA), offering a practical tool for translators and researchers.

PaperID: 9980, https://arxiv.org/pdf/2506.14187.pdf

Authors: Jiaming Yu, Le Liang, Hao Ye, Shi Jin

Title: Hierarchical Multi-Agent Reinforcement Learning-based Coordinated Spatial Reuse for Next Generation WLANs

Abstract:
High‑density Wi‑Fi deployments often result in significant co‑channel interference, which degrades overall network performance. To address this issue, coordination of multi access points (APs) has been considered to enable coordinated spatial reuse (CSR) in next generation wireless local area networks. This paper tackles the challenge of downlink spatial reuse in Wi‑Fi networks, specifically in scenarios involving overlapping basic service sets, by employing hierarchical multi‑agent reinforcement learning (HMARL). We decompose the CSR process into two phases, i.e., a polling phase and a decision phase, and introduce the HMARL algorithm to enable efficient CSR. To enhance training efficiency, the proposed HMARL algorithm employs a hierarchical structure, where station selection and power control are determined by a high‑ and low‑level policy network, respectively. Simulation results demonstrate that this approach consistently outperforms baseline methods in terms of throughput and latency across various network topologies. Moreover, the algorithm exhibits robust performance when coexisting with legacy APs. Additional experiments in a representative topology further reveal that the carefully designed reward function not only maximizes the overall network throughput, but also improves fairness in transmission opportunities for APs in high‑interference regions.

PaperID: 9981, https://arxiv.org/pdf/2506.14164.pdf

Authors: Hanzhong Cao

Title: Light Aircraft Game : Basic Implementation and training results analysis

Abstract:
This paper investigates multi‑agent reinforcement learning (MARL) in a partially observable, cooperative‑competitive combat environment known as LAG. We describe the environment's setup, including agent actions, hierarchical controls, and reward design across different combat modes such as No Weapon and ShootMissile. Two representative algorithms are evaluated: HAPPO, an on‑policy hierarchical variant of PPO, and HASAC, an off‑policy method based on soft actor‑critic. We analyze their training stability, reward progression, and inter‑agent coordination capabilities. Experimental results show that HASAC performs well in simpler coordination tasks without weapons, while HAPPO demonstrates stronger adaptability in more dynamic and expressive scenarios involving missile combat. These findings provide insights into the trade‑offs between on‑policy and off‑policy methods in multi‑agent settings.

PaperID: 9982, https://arxiv.org/pdf/2506.14159.pdf

Authors: Shayan Talaei, Meijin Li, Kanu Grover, James Kent Hippler, Diyi Yang, Amin Saberi

Title: StorySage: Conversational Autobiography Writing Powered by a Multi-Agent Framework

Abstract:
Every individual carries a unique and personal life story shaped by their memories and experiences. However, these memories are often scattered and difficult to organize into a coherent narrative, a challenge that defines the task of autobiography writing. Existing conversational writing assistants tend to rely on generic user interactions and pre‑defined guidelines, making it difficult for these systems to capture personal memories and develop a complete biography over time. We introduce StorySage, a user‑driven software system designed to meet the needs of a diverse group of users that supports a flexible conversation and a structured approach to autobiography writing. Powered by a multi‑agent framework composed of an Interviewer, Session Scribe, Planner, Section Writer, and Session Coordinator, our system iteratively collects user memories, updates their autobiography, and plans for future conversations. In experimental simulations, StorySage demonstrates its ability to navigate multiple sessions and capture user memories across many conversations. User studies (N=28) highlight how StorySage maintains improved conversational flow, narrative completeness, and higher user satisfaction when compared to a baseline. In summary, StorySage contributes both a novel architecture for autobiography writing and insights into how multi‑agent systems can enhance human‑AI creative partnerships.

PaperID: 9983, https://arxiv.org/pdf/2506.14142.pdf

Authors: Wenting Chen, Yi Dong, Zhaojun Ding, Yucheng Shi, Yifan Zhou, Fang Zeng, Yijun Luo, Tianyu Lin, Yihang Su, Yichen Wu, Kai Zhang, Zhen Xiang, Tianming Liu, Ninghao Liu, Lichao Sun, Yixuan Yuan, Xiang Li

Title: RadFabric: Agentic AI System with Reasoning Capability for Radiology

Abstract:
Chest X ray (CXR) imaging remains a critical diagnostic tool for thoracic conditions, but current automated systems face limitations in pathology coverage, diagnostic accuracy, and integration of visual and textual reasoning. To address these gaps, we propose RadFabric, a multi agent, multimodal reasoning framework that unifies visual and textual analysis for comprehensive CXR interpretation. RadFabric is built on the Model Context Protocol (MCP), enabling modularity, interoperability, and scalability for seamless integration of new diagnostic agents. The system employs specialized CXR agents for pathology detection, an Anatomical Interpretation Agent to map visual findings to precise anatomical structures, and a Reasoning Agent powered by large multimodal reasoning models to synthesize visual, anatomical, and clinical data into transparent and evidence based diagnoses. RadFabric achieves significant performance improvements, with near‑perfect detection of challenging pathologies like fractures (1.000 accuracy) and superior overall diagnostic accuracy (0.799) compared to traditional systems (0.229 to 0.527). By integrating cross modal feature alignment and preference‑driven reasoning, RadFabric advances AI‑driven radiology toward transparent, anatomically precise, and clinically actionable CXR analysis.

PaperID: 9984, https://arxiv.org/pdf/2506.13983.pdf

Authors: Adarsh Gupta, Bhabesh Mali, Chandan Karfa

Title: SANGAM: SystemVerilog Assertion Generation via Monte Carlo Tree Self-Refine

Abstract:
Recent advancements in the field of reasoning using Large Language Models (LLMs) have created new possibilities for more complex and automatic Hardware Assertion Generation techniques. This paper introduces SANGAM, a SystemVerilog Assertion Generation framework using LLM‑guided Monte Carlo Tree Search for the automatic generation of SVAs from industry‑level specifications. The proposed framework utilizes a three‑stage approach: Stage 1 consists of multi‑modal Specification Processing using Signal Mapper, SPEC Analyzer, and Waveform Analyzer LLM Agents. Stage 2 consists of using the Monte Carlo Tree Self‑Refine (MCTSr) algorithm for automatic reasoning about SVAs for each signal, and finally, Stage 3 combines the MCTSr‑generated reasoning traces to generate SVA assertions for each signal. The results demonstrated that our framework, SANGAM, can generate a robust set of SVAs, performing better in the evaluation process in comparison to the recent methods.

PaperID: 9985, https://arxiv.org/pdf/2506.13950.pdf

Authors: Dimitrios G. Patsatzis, Nikolaos Kazantzis, Ioannis G. Kevrekidis, Lucia Russo, Constantinos Siettos

Title: Invariant Manifolds of Discrete-time Dynamical Systems with Nonlinear Exosystems via Hybrid Physics-Informed Neural Networks

Abstract:
We propose a hybrid physics‑informed machine learning framework to approximate invariant manifolds (IMs) of discrete‑time dynamical systems driven by exogenous autonomous dynamics (exosystems). Such systems appear in applications ranging from control theory to modeling collective multi‑agent behavior (e.g., bird flocks, traffic dynamics) under hierarchical leadership. The IM learning problem is formulated as solving nonlinear functional equations derived from the invariance equation, expressing the manifold as a relationship between exogenous and system states. The proposed approach combines polynomial series with shallow neural networks, leveraging their complementary strengths. We focus on low‑ to medium‑dimensional manifolds where polynomial expansions remain tractable. Near equilibrium, polynomial series provide interpretability and convergence, while farther away neural networks capture global structure through their universal approximation capability. A continuity penalty enforces consistency between both representations at their interface, and training is performed using analytically derived derivatives within the Levenberg‑Marquardt scheme. Naturally, depending on the dimensionality of the input‑driven system, one may also employ a purely neural network‑based IM approximation, for which we also establish a universal approximation theorem based on certain assumptions on system dynamics. The framework is evaluated on two benchmark problems: an enzymatic bioreactor and a leader‑follower car‑following model. We analyze convergence, approximation accuracy, and computational cost, and compare standalone neural networks, polynomial expansions, and the hybrid method. Results show that the hybrid approach achieves superior accuracy compared to standalone schemes.

PaperID: 9986, https://arxiv.org/pdf/2506.13905.pdf

Authors: Zhongzhi Yu, Mingjie Liu, Michael Zimmer, Yingyan Celine Lin, Yong Liu, Haoxing Ren

Title: Spec2RTL-Agent: Automated Hardware Code Generation from Complex Specifications Using LLM Agent Systems

Abstract:
Despite recent progress in generating hardware RTL code with LLMs, existing solutions still suffer from a substantial gap between practical application scenarios and the requirements of real‑world RTL code development. Prior approaches either focus on overly simplified hardware descriptions or depend on extensive human guidance to process complex specifications, limiting their scalability and automation potential. In this paper, we address this gap by proposing an LLM agent system, termed Spec2RTL‑Agent, designed to directly process complex specification documentation and generate corresponding RTL code implementations, advancing LLM‑based RTL code generation toward more realistic application settings. To achieve this goal, Spec2RTL‑Agent introduces a novel multi‑agent collaboration framework that integrates three key enablers: (1) a reasoning and understanding module that translates specifications into structured, step‑by‑step implementation plans; (2) a progressive coding and prompt optimization module that iteratively refines the code across multiple representations to enhance correctness and synthesisability for RTL conversion; and (3) an adaptive reflection module that identifies and traces the source of errors during generation, ensuring a more robust code generation flow. Instead of directly generating RTL from natural language, our system strategically generates synthesizable C++ code, which is then optimized for HLS. This agent‑driven refinement ensures greater correctness and compatibility compared to naive direct RTL generation approaches. We evaluate Spec2RTL‑Agent on three specification documents, showing it generates accurate RTL code with up to 75% fewer human interventions than existing methods. This highlights its role as the first fully automated multi‑agent system for RTL generation from unstructured specs, reducing reliance on human effort in hardware design.

PaperID: 9987, https://arxiv.org/pdf/2506.13811.pdf

Authors: Sompote Youwai, David Phim, Vianne Gayl Murcia, Rianne Clair Onas

Title: Investigating the Potential of Large Language Model-Based Router Multi-Agent Architectures for Foundation Design Automation: A Task Classification and Expert Selection Study

Abstract:
This study investigates router‑based multi‑agent systems for automating foundation design calculations through intelligent task classification and expert selection. Three approaches were evaluated: single‑agent processing, multi‑agent designer‑checker architecture, and router‑based expert selection. Performance assessment utilized baseline models including DeepSeek R1, ChatGPT 4 Turbo, Grok 3, and Gemini 2.5 Pro across shallow foundation and pile design scenarios. The router‑based configuration achieved performance scores of 95.00% for shallow foundations and 90.63% for pile design, representing improvements of 8.75 and 3.13 percentage points over standalone Grok 3 performance respectively. The system outperformed conventional agentic workflows by 10.0 to 43.75 percentage points. Grok 3 demonstrated superior standalone performance without external computational tools, indicating advances in direct LLM mathematical reasoning for engineering applications. The dual‑tier classification framework successfully distinguished foundation types, enabling appropriate analytical approaches. Results establish router‑based multi‑agent systems as optimal for foundation design automation while maintaining professional documentation standards. Given safety‑critical requirements in civil engineering, continued human oversight remains essential, positioning these systems as advanced computational assistance tools rather than autonomous design replacements in professional practice.

PaperID: 9988, https://arxiv.org/pdf/2506.13794.pdf

Authors: Jared James Grogan

Title: AgentFacts: Universal KYA Standard for Verified AI Agent Metadata & Deployment

Abstract:
Enterprise AI deployment faces critical "Know Your Agent" (KYA) challenges where organizations must verify third‑party agent capabilities and establish trust without standardized metadata or verification infrastructure. Current approaches rely on self‑declared capabilities and custom integration processes that create trust gaps and coordination friction limiting confident enterprise adoption. This paper presents AgentFacts, a universal metadata standard that enables systematic agent verification through cryptographically‑signed capability declarations, multi‑authority validation, and dynamic permission management. The specification introduces domain‑specialized verification where different trusted authorities validate specific metadata aspects based on their expertise, eliminating single points of trust failure while enabling graduated confidence assessment. AgentFacts transforms agent procurement from custom integration projects into standardized workforce management, providing the transparency and governance infrastructure necessary for enterprise AI coordination at scale.

PaperID: 9989, https://arxiv.org/pdf/2506.13783.pdf

Authors: Soyeon Choi, Kangwook Lee, Oliver Sng, Joshua M. Ackerman

Title: Infected Smallville: How Disease Threat Shapes Sociality in LLM Agents

Abstract:
How does the threat of infectious disease influence sociality among generative agents? We used generative agent‑based modeling (GABM), powered by large language models, to experimentally test hypotheses about the behavioral immune system. Across three simulation runs, generative agents who read news about an infectious disease outbreak showed significantly reduced social engagement compared to agents who received no such news, including lower attendance at a social gathering, fewer visits to third places (e.g., cafe, store, park), and fewer conversations throughout the town. In interview responses, agents explicitly attributed their behavioral changes to disease‑avoidance motivations. A validity check further indicated that they could distinguish between infectious and noninfectious diseases, selectively reducing social engagement only when there was a risk of infection. Our findings highlight the potential of GABM as an experimental tool for exploring complex human social dynamics at scale.

PaperID: 9990, https://arxiv.org/pdf/2506.13755.pdf

Authors: Arya Fayyazi, Mehdi Kamal, Massoud Pedram

Title: MARCO: Hardware-Aware Neural Architecture Search for Edge Devices with Multi-Agent Reinforcement Learning and Conformal Prediction Filtering

Abstract:
This paper introduces MARCO (Multi‑Agent Reinforcement learning with Conformal Optimization), a novel hardware‑aware framework for efficient neural architecture search (NAS) targeting resource‑constrained edge devices. By significantly reducing search time and maintaining accuracy under strict hardware constraints, MARCO bridges the gap between automated DNN design and CAD for edge AI deployment. MARCO's core technical contribution lies in its unique combination of multi‑agent reinforcement learning (MARL) with Conformal Prediction (CP) to accelerate the hardware/software co‑design process for deploying deep neural networks. Unlike conventional once‑for‑all (OFA) supernet approaches that require extensive pretraining, MARCO decomposes the NAS task into a hardware configuration agent (HCA) and a Quantization Agent (QA). The HCA optimizes high‑level design parameters, while the QA determines per‑layer bit‑widths under strict memory and latency budgets using a shared reward signal within a centralized‑critic, decentralized‑execution (CTDE) paradigm. A key innovation is the integration of a calibrated CP surrogate model that provides statistical guarantees (with a user‑defined miscoverage rate) to prune unpromising candidate architectures before incurring the high costs of partial training or hardware simulation. This early filtering drastically reduces the search space while ensuring that high‑quality designs are retained with a high probability. Extensive experiments on MNIST, CIFAR‑10, and CIFAR‑100 demonstrate that MARCO achieves a 3‑4x reduction in total search time compared to an OFA baseline while maintaining near‑baseline accuracy (within 0.3%). Furthermore, MARCO also reduces inference latency. Validation on a MAX78000 evaluation board confirms that simulator trends hold in practice, with simulator estimates deviating from measured values by less than 5%.

PaperID: 9991, https://arxiv.org/pdf/2506.13651.pdf

Authors: Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Yuan Gong, Chen Sun, Han Hou, Hui Yang, James Pan, Jianan Lou, Jiayi Mao, Jizheng Liu, Jinpeng Li, Kangyi Liu, Kenkun Liu, Rui Wang, Run Li, Tong Niu, Wenlong Zhang, Wenqi Yan, Xuanzheng Wang, Yuchen Zhang, Yi-Hsin Hung, Yuan Jiang, Zexuan Liu, Zihan Yin, Zijian Ma, Zhiwen Mo

Title: xbench: Tracking Agents Productivity Scaling with Profession-Aligned Real-World Evaluations

Abstract:
We introduce xbench, a dynamic, profession‑aligned evaluation suite designed to bridge the gap between AI agent capabilities and real‑world productivity. While existing benchmarks often focus on isolated technical skills, they may not accurately reflect the economic value agents deliver in professional settings. To address this, xbench targets commercially significant domains with evaluation tasks defined by industry professionals. Our framework creates metrics that strongly correlate with productivity value, enables prediction of Technology‑Market Fit (TMF), and facilitates tracking of product capabilities over time. As our initial implementations, we present two benchmarks: Recruitment and Marketing. For Recruitment, we collect 50 tasks from real‑world headhunting business scenarios to evaluate agents' abilities in company mapping, information retrieval, and talent sourcing. For Marketing, we assess agents' ability to match influencers with advertiser needs, evaluating their performance across 50 advertiser requirements using a curated pool of 836 candidate influencers. We present initial evaluation results for leading contemporary agents, establishing a baseline for these professional domains. Our continuously updated evalsets and evaluations are available at https://xbench.org.

PaperID: 9992, https://arxiv.org/pdf/2506.13650.pdf

Authors: Violetta Rostobaya, James Berneburg, Yue Guan, Michael Dorothy, Daigo Shishika

Title: Deceptive Path Planning: A Bayesian Game Approach

Abstract:
This paper investigates how an autonomous agent can transmit information through its motion in an adversarial setting. We consider scenarios where an agent must reach its goal while deceiving an intelligent observer about its destination. We model this interaction as a dynamic Bayesian game between a mobile Attacker with a privately known goal and a Defender who infers the Attacker's intent to allocate defensive resources effectively. We use Perfect Bayesian Nash Equilibrium (PBNE) as our solution concept and propose a computationally efficient approach to find it. In the resulting equilibrium, the Defender employs a simple Markovian strategy, while the Attacker strategically balances deception and goal efficiency by stochastically mixing shortest and non‑shortest paths to manipulate the Defender's beliefs. Numerical experiments demonstrate the advantages of our PBNE‑based strategies over existing methods based on one‑sided optimization.

PaperID: 9993, https://arxiv.org/pdf/2506.13590.pdf

Authors: Ken Huang, Akram Sheriff, Vineeth Sai Narajala, Idan Habler

Title: Agent Capability Negotiation and Binding Protocol (ACNBP)

Abstract:
As multi‑agent systems evolve to encompass increasingly diverse and specialized agents, the challenge of enabling effective collaboration between heterogeneous agents has become paramount, with traditional agent communication protocols often assuming homogeneous environments or predefined interaction patterns that limit their applicability in dynamic, open‑world scenarios. This paper presents the Agent Capability Negotiation and Binding Protocol (ACNBP), a novel framework designed to facilitate secure, efficient, and verifiable interactions between agents in heterogeneous multi‑agent systems through integration with an Agent Name Service (ANS) infrastructure that provides comprehensive discovery, negotiation, and binding mechanisms. The protocol introduces a structured 10‑step process encompassing capability discovery, candidate pre‑screening and selection, secure negotiation phases, and binding commitment with built‑in security measures including digital signatures, capability attestation, and comprehensive threat mitigation strategies, while a key innovation of ACNBP is its protocolExtension mechanism that enables backward‑compatible protocol evolution and supports diverse agent architectures while maintaining security and interoperability. We demonstrate ACNBP's effectiveness through a comprehensive security analysis using the MAESTRO threat modeling framework, practical implementation considerations, and a detailed example showcasing the protocol's application in a document translation scenario, with the protocol addressing critical challenges in agent autonomy, capability verification, secure communication, and scalable agent ecosystem management.

PaperID: 9994, https://arxiv.org/pdf/2506.13453.pdf

Authors: YR Darr, MA Niazi

Title: Towards a Formal Specification for Self-organized Shape Formation in Swarm Robotics

Abstract:
The self‑organization of robots for the formation of structures and shapes is a stimulating application of the swarm robotic system. It involves a large number of autonomous robots of heterogeneous behavior, coordination among them, and their interaction with the dynamic environment. This process of complex structure formation is considered a complex system, which needs to be modeled by using any modeling approach. Although the formal specification approach along with other formal methods has been used to model the behavior of robots in a swarm. However, to the best of our knowledge, the formal specification approach has not been used to model the self‑organization process in swarm robotic systems for shape formation. In this paper, we use a formal specification approach to model the shape formation task of swarm robots. We use Z (Zed) language of formal specification, which is a state‑based language, to model the states of the entities of the systems. We demonstrate the effectiveness of Z for the self‑organized shape formation. The presented formal specification model gives the outlines for designing and implementing the swarm robotic system for the formation of complex shapes and structures. It also provides the foundation for modeling the complex shape formation process for swarm robotics using a multi‑agent system in a simulation‑based environment. Keywords: Swarm robotics, Self‑organization, Formal specification, Complex systems

PaperID: 9995, https://arxiv.org/pdf/2506.13324.pdf

Authors: Gianni Molinari, Fabio Ciravegna

Title: Towards Pervasive Distributed Agentic Generative AI -- A State of The Art

Abstract:
The rapid advancement of intelligent agents and Large Language Models (LLMs) is reshaping the pervasive computing field. Their ability to perceive, reason, and act through natural language understanding enables autonomous problem‑solving in complex pervasive environments, including the management of heterogeneous sensors, devices, and data. This survey outlines the architectural components of LLM agents (profiling, memory, planning, and action) and examines their deployment and evaluation across various scenarios. Than it reviews computational and infrastructural advancements (cloud to edge) in pervasive computing and how AI is moving in this field. It highlights state‑of‑the‑art agent deployment strategies and applications, including local and distributed execution on resource‑constrained devices. This survey identifies key challenges of these agents in pervasive computing such as architectural, energetic and privacy limitations. It finally proposes what we called "Agent as a Tool", a conceptual framework for pervasive agentic AI, emphasizing context awareness, modularity, security, efficiency and effectiveness.

PaperID: 9996, https://arxiv.org/pdf/2506.13270.pdf

Authors: Nan Chen, Luna K. Qiu, Arran Zeyu Wang, Zilong Wang, Yuqing Yang

Title: Screen Reader Programmers in the Vibe Coding Era: Adaptation, Empowerment, and New Accessibility Landscape

Abstract:
Generative AI agents are reshaping human‑computer interaction, shifting users from direct task execution to supervising machine‑driven actions, especially the rise of "vibe coding" in programming. Yet little is known about how screen reader programmers interact with AI code assistants in practice. We conducted a longitudinal study with 16 blind and low‑vision programmers. Participants completed a GitHub Copilot tutorial, engaged with a programming task, and provided initial feedback. After two weeks of AI‑assisted programming, follow‑ups examined how their practices and perceptions evolved. Our findings show that code assistants enhanced programming efficiency and bridged accessibility gaps. However, participants struggled to convey intent, interpret AI outputs, and manage multiple views while maintaining situational awareness. They showed diverse preferences for accessibility features, expressed a need to balance automation with control, and encountered barriers when learning to use these tools. Furthermore, we propose design principles and recommendations for more accessible and inclusive human‑AI collaborations.

PaperID: 9997, https://arxiv.org/pdf/2506.13246.pdf

Authors: Craig Steven Wright

Title: On Immutable Memory Systems for Artificial Agents: A Blockchain-Indexed Automata-Theoretic Framework Using ECDH-Keyed Merkle Chains

Abstract:
This paper presents a formalised architecture for synthetic agents designed to retain immutable memory, verifiable reasoning, and constrained epistemic growth. Traditional AI systems rely on mutable, opaque statistical models prone to epistemic drift and historical revisionism. In contrast, we introduce the concept of the Merkle Automaton, a cryptographically anchored, deterministic computational framework that integrates formal automata theory with blockchain‑based commitments. Each agent transition, memory fragment, and reasoning step is committed within a Merkle structure rooted on‑chain, rendering it non‑repudiable and auditably permanent. To ensure selective access and confidentiality, we derive symmetric encryption keys from ECDH exchanges contextualised by hierarchical privilege lattices. This enforces cryptographic access control over append‑only DAG‑structured knowledge graphs. Reasoning is constrained by formal logic systems and verified through deterministic traversal of policy‑encoded structures. Updates are non‑destructive and historied, preserving epistemic lineage without catastrophic forgetting. Zero‑knowledge proofs facilitate verifiable, privacy‑preserving inclusion attestations. Collectively, this architecture reframes memory not as a cache but as a ledger ‑ one whose contents are enforced by protocol, bound by cryptography, and constrained by formal logic. The result is not an intelligent agent that mimics thought, but an epistemic entity whose outputs are provably derived, temporally anchored, and impervious to post hoc revision. This design lays foundational groundwork for legal, economic, and high‑assurance computational systems that require provable memory, unforgeable provenance, and structural truth.

PaperID: 9998, https://arxiv.org/pdf/2506.13202.pdf

Authors: Bin-Bin Hu, Yanxin Zhou, Henglai Wei, Shuo Cheng, Chen Lv

Title: C2TE: Coordinated Constrained Task Execution Design for Ordering-Flexible Multi-Vehicle Platoon Merging

Abstract:
In this paper, we propose a distributed coordinated constrained task execution (C2TE) algorithm that enables a team of vehicles from different lanes to cooperatively merge into an \it ordering‑flexible platoon maneuvering on the desired lane. Therein, the platoon is flexible in the sense that no specific spatial ordering sequences of vehicles are predetermined. To attain such a flexible platoon, we first separate the multi‑vehicle platoon (MVP) merging mission into two stages, namely, pre‑merging regulation and \it ordering‑flexible platoon merging, and then formulate them into distributed constraint‑based optimization problems. Particularly, by encoding longitudinal‑distance regulation and same‑lane collision avoidance subtasks into the corresponding control barrier function (CBF) constraints, the proposed algorithm in Stage 1 can safely enlarge sufficient longitudinal distances among adjacent vehicles. Then, by encoding lateral convergence, longitudinal‑target attraction, and neighboring collision avoidance subtasks into CBF constraints, the proposed algorithm in Stage~2 can efficiently achieve the \it ordering‑flexible platoon. Note that the \it ordering‑flexible platoon is realized through the interaction of the longitudinal‑target attraction and time‑varying neighboring collision avoidance constraints simultaneously. Feasibility guarantee and rigorous convergence analysis are both provided under strong nonlinear couplings induced by flexible orderings. Finally, experiments using three autonomous mobile vehicles (AMVs) are conducted to verify the effectiveness and flexibility of the proposed algorithm, and extensive simulations are performed to demonstrate its robustness, adaptability, and scalability when tackling vehicles' sudden breakdown, new appearing, different number of lanes, mixed autonomy, and large‑scale scenarios, respectively.

PaperID: 9999, https://arxiv.org/pdf/2506.13171.pdf

Authors: Lukasz Mazur, Nenad Petrovic, James Pontes Miranda, Ansgar Radermacher, Robert Rasche, Alois Knoll

Title: Querying Large Automotive Software Models: Agentic vs. Direct LLM Approaches

Abstract:
Large language models (LLMs) offer new opportunities for interacting with complex software artifacts, such as software models, through natural language. They present especially promising benefits for large software models that are difficult to grasp in their entirety, making traditional interaction and analysis approaches challenging. This paper investigates two approaches for leveraging LLMs to answer questions over software models: direct prompting, where the whole software model is provided in the context, and an agentic approach combining LLM‑based agents with general‑purpose file access tools. We evaluate these approaches using an Ecore metamodel designed for timing analysis and software optimization in automotive and embedded domains. Our findings show that while the agentic approach achieves accuracy comparable to direct prompting, it is significantly more efficient in terms of token usage. This efficiency makes the agentic approach particularly suitable for the automotive industry, where the large size of software models makes direct prompting infeasible, establishing LLM agents as not just a practical alternative but the only viable solution. Notably, the evaluation was conducted using small LLMs, which are more feasible to be executed locally ‑ an essential advantage for meeting strict requirements around privacy, intellectual property protection, and regulatory compliance. Future work will investigate software models in diverse formats, explore more complex agent architectures, and extend agentic workflows to support not only querying but also modification of software models.

PaperID: 10000, https://arxiv.org/pdf/2506.13113.pdf

Authors: Stella C. Dong, James R. Finlay

Title: Dynamic Reinsurance Treaty Bidding via Multi-Agent Reinforcement Learning

Abstract:
This paper develops a novel multi‑agent reinforcement learning (MARL) framework for reinsurance treaty bidding, addressing long‑standing inefficiencies in traditional broker‑mediated placement processes. We pose the core research question: Can autonomous, learning‑based bidding systems improve risk transfer efficiency and outperform conventional pricing approaches in reinsurance markets? In our model, each reinsurer is represented by an adaptive agent that iteratively refines its bidding strategy within a competitive, partially observable environment. The simulation explicitly incorporates institutional frictions including broker intermediation, incumbent advantages, last‑look privileges, and asymmetric access to underwriting information. Empirical analysis demonstrates that MARL agents achieve up to 15% higher underwriting profit, 20% lower tail risk (CVaR), and over 25% improvement in Sharpe ratios relative to actuarial and heuristic baselines. Sensitivity tests confirm robustness across hyperparameter settings, and stress testing reveals strong resilience under simulated catastrophe shocks and capital constraints. These findings suggest that MARL offers a viable path toward more transparent, adaptive, and risk‑sensitive reinsurance markets. The proposed framework contributes to emerging literature at the intersection of algorithmic market design, strategic bidding, and AI‑enabled financial decision‑making.

PaperID: 10001, https://arxiv.org/pdf/2506.13109.pdf

Authors: Shivanshu Gupta, Sameer Singh, Ashish Sabharwal, Tushar Khot, Ben Bogin

Title: Leveraging In-Context Learning for Language Model Agents

Abstract:
In‑context learning (ICL) with dynamically selected demonstrations combines the flexibility of prompting large language models (LLMs) with the ability to leverage training data to improve performance. While ICL has been highly successful for prediction and generation tasks, leveraging it for agentic tasks that require sequential decision making is challenging ‑‑ one must think not only about how to annotate long trajectories at scale and how to select demonstrations, but also what constitutes demonstrations, and when and where to show them. To address this, we first propose an algorithm that leverages an LLM with retries along with demonstrations to automatically and efficiently annotate agentic tasks with solution trajectories. We then show that set‑selection of trajectories of similar tasks as demonstrations significantly improves performance, reliability, robustness, and efficiency of LLM agents. However, trajectory demonstrations have a large inference cost overhead. We show that this can be mitigated by using small trajectory snippets at every step instead of an additional trajectory. We find that demonstrations obtained from larger models (in the annotation phase) also improve smaller models, and that ICL agents can even rival costlier trained agents. Thus, our results reveal that ICL, with careful use, can be very powerful for agentic tasks as well.

PaperID: 10002, https://arxiv.org/pdf/2506.13068.pdf

Authors: Haowen Xu, Yulin Sun, Jose Tupayachi, Olufemi Omitaomu, Sisi Zlatanova, Xueping Li

Title: Towards the Autonomous Optimization of Urban Logistics: Training Generative AI with Scientific Tools via Agentic Digital Twins and Model Context Protocol

Abstract:
Optimizing urban freight logistics is critical for developing sustainable, low‑carbon cities. Traditional methods often rely on manual coordination of simulation tools, optimization solvers, and expert‑driven workflows, limiting their efficiency and scalability. This paper presents an agentic system architecture that leverages the model context protocol (MCP) to orchestrate multi‑agent collaboration among scientific tools for autonomous, simulation‑informed optimization in urban logistics. The system integrates generative AI agents with domain‑specific engines ‑ such as Gurobi for optimization and AnyLogic for agent‑based simulation ‑ forming a generative digital twin capable of reasoning, planning, and acting across multimodal freight networks. By incorporating integrated chatbots, retrieval‑augmented generation, and structured memory, the framework enables agents to interpret user intent from natural language conversations, retrieve relevant datasets and models, coordinate solvers and simulators, and execute complex workflows. We demonstrate this approach through a freight decarbonization case study, showcasing how MCP enables modular, interoperable, and adaptive agent behavior across diverse toolchains. The results reveal that our system transforms digital twins from static visualizations into autonomous, decision‑capable systems, advancing the frontiers of urban operations research. By enabling context‑aware, generative agents to operate scientific tools automatically and collaboratively, this framework supports more intelligent, accessible, and dynamic decision‑making in transportation planning and smart city management.

PaperID: 10003, https://arxiv.org/pdf/2506.13037.pdf

Authors: Joaquín Jordán, Xavier Yin, Melissa Fabros, Gireeja Ranade, Narges Norouzi

Title: MAGIC: Multi-Agent Argumentation and Grammar Integrated Critiquer

Abstract:
Automated Essay Scoring (AES) and Automatic Essay Feedback (AEF) systems aim to reduce the workload of human raters in educational assessment. However, most existing systems prioritize numerical scoring accuracy over feedback quality and are primarily evaluated on pre‑secondary school level writing. This paper presents Multi‑Agent Argumentation and Grammar Integrated Critiquer (MAGIC), a framework using five specialized agents to evaluate prompt adherence, persuasiveness, organization, vocabulary, and grammar for both holistic scoring and detailed feedback generation. To support evaluation at the college level, we collated a dataset of Graduate Record Examination (GRE) practice essays with expert‑evaluated scores and feedback. MAGIC achieves substantial to near‑perfect scoring agreement with humans on the GRE data, outperforming baseline LLM models while providing enhanced interpretability through its multi‑agent approach. We also compare MAGIC's feedback generation capabilities against ground truth human feedback and baseline models, finding that MAGIC achieves strong feedback quality and naturalness.

PaperID: 10004, https://arxiv.org/pdf/2506.12963.pdf

Authors: Changsheng Wang, Chongyu Fan, Yihua Zhang, Jinghan Jia, Dennis Wei, Parikshit Ram, Nathalie Baracaldo, Sijia Liu

Title: Reasoning Model Unlearning: Forgetting Traces, Not Just Answers, While Preserving Reasoning Skills

Abstract:
Recent advances in large reasoning models (LRMs) have enabled strong chain‑of‑thought (CoT) generation through test‑time computation. While these multi‑step reasoning capabilities represent a major milestone in language model performance, they also introduce new safety risks. In this work, we present the first systematic study to revisit the problem of machine unlearning in the context of LRMs. Machine unlearning refers to the process of removing the influence of sensitive, harmful, or undesired data or knowledge from a trained model without full retraining. We show that conventional unlearning algorithms, originally designed for non‑reasoning models, are inadequate for LRMs. In particular, even when final answers are successfully erased, sensitive information often persists within the intermediate reasoning steps, i.e., CoT trajectories. To address this challenge, we extend conventional unlearning and propose Reasoning‑aware Representation Misdirection for Unlearning (R^2MU), a novel method that effectively suppresses sensitive reasoning traces and prevents the generation of associated final answers, while preserving the model's reasoning ability. Our experiments demonstrate that R^2MU significantly reduces sensitive information leakage within reasoning traces and achieves strong performance across both safety and reasoning benchmarks, evaluated on state‑of‑the‑art models such as DeepSeek‑R1‑Distill‑LLaMA‑8B and DeepSeek‑R1‑Distill‑Qwen‑14B.

PaperID: 10005, https://arxiv.org/pdf/2506.12928.pdf

Authors: King Zhu, Hanhao Li, Siwei Wu, Tianshun Xing, Dehua Ma, Xiangru Tang, Minghao Liu, Jian Yang, Jiaheng Liu, Yuchen Eleanor Jiang, Changwang Zhang, Chenghua Lin, Jun Wang, Ge Zhang, Wangchunshu Zhou

Title: Scaling Test-time Compute for LLM Agents

Abstract:
Scaling test time compute has shown remarkable success in improving the reasoning abilities of large language models (LLMs). In this work, we conduct the first systematic exploration of applying test‑time scaling methods to language agents and investigate the extent to which it improves their effectiveness. Specifically, we explore different test‑time scaling strategies, including: (1) parallel sampling algorithms; (2) sequential revision strategies; (3) verifiers and merging methods; (4)strategies for diversifying rollouts.We carefully analyze and ablate the impact of different design strategies on applying test‑time scaling on language agents, and have follow findings: 1. Scaling test time compute could improve the performance of agents. 2. Knowing when to reflect is important for agents. 3. Among different verification and result merging approaches, the list‑wise method performs best. 4. Increasing diversified rollouts exerts a positive effect on the agent's task performance.

PaperID: 10006, https://arxiv.org/pdf/2506.12894.pdf

Authors: Naoto Yoshida, Kingson Man

Title: Homeostatic Coupling for Prosocial Behavior

Abstract:
When regarding the suffering of others, we often experience personal distress and feel compelled to help\footnotePreprint. Under review.. Inspired by living systems, we investigate the emergence of prosocial behavior among autonomous agents that are motivated by homeostatic self‑regulation. We perform multi‑agent reinforcement learning, treating each agent as a vulnerable homeostat charged with maintaining its own well‑being. We introduce an empathy‑like mechanism to share homeostatic states between agents: an agent can either \emphobserve their partner's internal state (\bf cognitive empathy) or the agent's internal state can be \emphdirectly coupled to that of their partner (\bf affective empathy). In three simple multi‑agent environments, we show that prosocial behavior arises only under homeostatic coupling ‑ when the distress of a partner can affect one's own well‑being. Additionally, we show that empathy can be learned: agents can ``decode" their partner's external emotive states to infer the partner's internal homeostatic states. Assuming some level of physiological similarity, agents reference their own emotion‑generation functions to invert the mapping from outward display to internal state. Overall, we demonstrate the emergence of prosocial behavior when homeostatic agents learn to ``read" the emotions of others and then to empathize, or feel as they feel.

PaperID: 10007, https://arxiv.org/pdf/2506.12801.pdf

Authors: LeCheng Zhang, Yuanshi Wang, Haotian Shen, Xujie Wang

Title: Mastering Da Vinci Code: A Comparative Study of Transformer, LLM, and PPO-based Agents

Abstract:
The Da Vinci Code, a game of logical deduction and imperfect information, presents unique challenges for artificial intelligence, demanding nuanced reasoning beyond simple pattern recognition. This paper investigates the efficacy of various AI paradigms in mastering this game. We develop and evaluate three distinct agent architectures: a Transformer‑based baseline model with limited historical context, several Large Language Model (LLM) agents (including Gemini, DeepSeek, and GPT variants) guided by structured prompts, and an agent based on Proximal Policy Optimization (PPO) employing a Transformer encoder for comprehensive game history processing. Performance is benchmarked against the baseline, with the PPO‑based agent demonstrating superior win rates (58.5% \pm 1.0%), significantly outperforming the LLM counterparts. Our analysis highlights the strengths of deep reinforcement learning in policy refinement for complex deductive tasks, particularly in learning implicit strategies from self‑play. We also examine the capabilities and inherent limitations of current LLMs in maintaining strict logical consistency and strategic depth over extended gameplay, despite sophisticated prompting. This study contributes to the broader understanding of AI in recreational games involving hidden information and multi‑step logical reasoning, offering insights into effective agent design and the comparative advantages of different AI approaches.

PaperID: 10008, https://arxiv.org/pdf/2506.12710.pdf

Authors: Yuqi Ping, Tianhao Liang, Huahao Ding, Guangyu Lei, Junwei Wu, Xuan Zou, Kuan Shi, Rui Shao, Chiya Zhang, Weizheng Zhang, Weijie Yuan, Tingting Zhang

Title: Multimodal Large Language Models-Enabled UAV Swarm: Towards Efficient and Intelligent Autonomous Aerial Systems

Abstract:
Recent breakthroughs in multimodal large language models (MLLMs) have endowed AI systems with unified perception, reasoning and natural‑language interaction across text, image and video streams. Meanwhile, Unmanned Aerial Vehicle (UAV) swarms are increasingly deployed in dynamic, safety‑critical missions that demand rapid situational understanding and autonomous adaptation. This paper explores potential solutions for integrating MLLMs with UAV swarms to enhance the intelligence and adaptability across diverse tasks. Specifically, we first outline the fundamental architectures and functions of UAVs and MLLMs. Then, we analyze how MLLMs can enhance the UAV system performance in terms of target detection, autonomous navigation, and multi‑agent coordination, while exploring solutions for integrating MLLMs into UAV systems. Next, we propose a practical case study focused on the forest fire fighting. To fully reveal the capabilities of the proposed framework, human‑machine interaction, swarm task planning, fire assessment, and task execution are investigated. Finally, we discuss the challenges and future research directions for the MLLMs‑enabled UAV swarm. An experiment illustration video could be found online at https://youtu.be/zwnB9ZSa5A4.

PaperID: 10009, https://arxiv.org/pdf/2506.12699.pdf

Authors: Yashothara Shanmugarasa, Ming Ding, M. A. P Chamikara, Thierry Rakotoarivelo

Title: SoK: The Privacy Paradox of Large Language Models: Advancements, Privacy Risks, and Mitigation

Abstract:
Large language models (LLMs) are sophisticated artificial intelligence systems that enable machines to generate human‑like text with remarkable precision. While LLMs offer significant technological progress, their development using vast amounts of user data scraped from the web and collected from extensive user interactions poses risks of sensitive information leakage. Most existing surveys focus on the privacy implications of the training data but tend to overlook privacy risks from user interactions and advanced LLM capabilities. This paper aims to fill that gap by providing a comprehensive analysis of privacy in LLMs, categorizing the challenges into four main areas: (i) privacy issues in LLM training data, (ii) privacy challenges associated with user prompts, (iii) privacy vulnerabilities in LLM‑generated outputs, and (iv) privacy challenges involving LLM agents. We evaluate the effectiveness and limitations of existing mitigation mechanisms targeting these proposed privacy challenges and identify areas for further research.

PaperID: 10010, https://arxiv.org/pdf/2506.12689.pdf

Authors: Xiaofeng Shi, Qian Kou, Yuduo Li, Ning Tang, Jinxin Xie, Longbin Yu, Songjing Wang, Hua Zhou

Title: SciSage: A Multi-Agent Framework for High-Quality Scientific Survey Generation

Abstract:
The rapid growth of scientific literature demands robust tools for automated survey‑generation. However, current large language model (LLM)‑based methods often lack in‑depth analysis, structural coherence, and reliable citations. To address these limitations, we introduce SciSage, a multi‑agent framework employing a reflect‑when‑you‑write paradigm. SciSage features a hierarchical Reflector agent that critically evaluates drafts at outline, section, and document levels, collaborating with specialized agents for query interpretation, content retrieval, and refinement. We also release SurveyScope, a rigorously curated benchmark of 46 high‑impact papers (2020‑2025) across 11 computer science domains, with strict recency and citation‑based quality controls. Evaluations demonstrate that SciSage outperforms state‑of‑the‑art baselines (LLM x MapReduce‑V2, AutoSurvey), achieving +1.73 points in document coherence and +32% in citation F1 scores. Human evaluations reveal mixed outcomes (3 wins vs. 7 losses against human‑written surveys), but highlight SciSage's strengths in topical breadth and retrieval efficiency. Overall, SciSage offers a promising foundation for research‑assistive writing tools.

PaperID: 10011, https://arxiv.org/pdf/2506.12636.pdf

Authors: Julia Santaniello, Matthew Russell, Benson Jiang, Donatello Sassaroli, Robert Jacob, Jivko Sinapov

Title: Mapping Neural Signals to Agent Performance, A Step Towards Reinforcement Learning from Neural Feedback

Abstract:
Implicit Human‑in‑the‑Loop Reinforcement Learning (HITL‑RL) is a methodology that integrates passive human feedback into autonomous agent training while minimizing human workload. However, existing methods often rely on active instruction, requiring participants to teach an agent through unnatural expression or gesture. We introduce NEURO‑LOOP, an implicit feedback framework that utilizes the intrinsic human reward system to drive human‑agent interaction. This work demonstrates the feasibility of a critical first step in the NEURO‑LOOP framework: mapping brain signals to agent performance. Using functional near‑infrared spectroscopy (fNIRS), we design a dataset to enable future research using passive Brain‑Computer Interfaces for Human‑in‑the‑Loop Reinforcement Learning. Participants are instructed to observe or guide a reinforcement learning agent in its environment while signals from the prefrontal cortex are collected. We conclude that a relationship between fNIRS data and agent performance exists using classical machine learning techniques. Finally, we highlight the potential that neural interfaces may offer to future applications of human‑agent interaction, assistive AI, and adaptive autonomous systems.

PaperID: 10012, https://arxiv.org/pdf/2506.12607.pdf

Authors: Christodoulos Constantinides, Shuxin Lin, Dhaval Patel

Title: Towards Building General Purpose Embedding Models for Industry 4.0 Agents

Abstract:
In this work we focus on improving language models' understanding for asset maintenance to guide the engineer's decisions and minimize asset downtime. Given a set of tasks expressed in natural language for Industry 4.0 domain, each associated with queries related to a specific asset, we want to recommend relevant items and generalize to queries of similar assets. A task may involve identifying relevant sensors given a query about an asset's failure mode. Our approach begins with gathering a qualitative, expert‑vetted knowledge base to construct nine asset‑specific task datasets. To create more contextually informed embeddings, we augment the input tasks using Large Language Models (LLMs), providing concise descriptions of the entities involved in the queries. This embedding model is then integrated with a Reasoning and Acting agent (ReAct), which serves as a powerful tool for answering complex user queries that require multi‑step reasoning, planning, and knowledge inference. Through ablation studies, we demonstrate that: (a) LLM query augmentation improves the quality of embeddings, (b) Contrastive loss and other methods that avoid in‑batch negatives are superior for datasets with queries related to many items, and (c) It is crucial to balance positive and negative in‑batch samples. After training and testing on our dataset, we observe a substantial improvement: HIT@1 increases by +54.2%, MAP@100 by +50.1%, and NDCG@10 by +54.7%, averaged across all tasks and models. Additionally, we empirically demonstrate the model's planning and tool invocation capabilities when answering complex questions related to industrial asset maintenance, showcasing its effectiveness in supporting Subject Matter Experts (SMEs) in their day‑to‑day operations.

PaperID: 10013, https://arxiv.org/pdf/2506.12600.pdf

Authors: Jie Pan, Tianyi Wang, Christian Claudel, Jing Shi

Title: Trust-MARL: Trust-Based Multi-Agent Reinforcement Learning Framework for Cooperative On-Ramp Merging Control in Heterogeneous Traffic Flow

Abstract:
Intelligent transportation systems require connected and automated vehicles (CAVs) to conduct safe and efficient cooperation with human‑driven vehicles (HVs) in complex real‑world traffic environments. However, the inherent unpredictability of human behaviour, especially at bottlenecks such as highway on‑ramp merging areas, often disrupts traffic flow and compromises system performance. To address the challenge of cooperative on‑ramp merging in heterogeneous traffic environments, this study proposes a trust‑based multi‑agent reinforcement learning (Trust‑MARL) framework. At the macro level, Trust‑MARL enhances global traffic efficiency by leveraging inter‑agent trust to improve bottleneck throughput and mitigate traffic shockwave through emergent group‑level coordination. At the micro level, a dynamic trust mechanism is designed to enable CAVs to adjust their cooperative strategies in response to real‑time behaviors and historical interactions with both HVs and other CAVs. Furthermore, a trust‑triggered game‑theoretic decision‑making module is integrated to guide each CAV in adapting its cooperation factor and executing context‑aware lane‑changing decisions under safety, comfort, and efficiency constraints. An extensive set of ablation studies and comparative experiments validates the effectiveness of the proposed Trust‑MARL approach, demonstrating significant improvements in safety, efficiency, comfort, and adaptability across varying CAV penetration rates and traffic densities.

PaperID: 10014, https://arxiv.org/pdf/2506.12509.pdf

Authors: Jiwei Fang, Bin Zhang, Changwei Wang, Jin Wan, Zhiwei Xu

Title: Graph of Verification: Structured Verification of LLM Reasoning with Directed Acyclic Graphs

Abstract:
Verifying the complex and multi‑step reasoning of Large Language Models (LLMs) is a critical challenge, as holistic methods often overlook localized flaws. Step‑by‑step validation is a promising alternative, yet existing methods are often rigid. They struggle to adapt to diverse reasoning structures, from formal proofs to informal natural language narratives. To address this adaptability gap, we propose the Graph of Verification (GoV), a novel framework for adaptable and multi‑granular verification. GoV's core innovation is its flexible "node block" architecture. This mechanism allows GoV to adaptively adjust its verification granularity‑‑from atomic steps for formal tasks to entire paragraphs for natural language‑‑to match the native structure of the reasoning process. This flexibility allows GoV to resolve the fundamental trade‑off between verification precision and robustness. Experiments on both well‑structured and loosely‑structured benchmarks demonstrate GoV's versatility. The results show that GoV's adaptive approach significantly outperforms both holistic baselines and other state‑of‑the‑art decomposition‑based methods, establishing a new standard for training‑free reasoning verification.

PaperID: 10015, https://arxiv.org/pdf/2506.12508.pdf

Authors: Wentao Zhang, Liang Zeng, Yuzhen Xiao, Yongcong Li, Ce Cui, Yilei Zhao, Rui Hu, Yang Liu, Yahui Zhou, Bo An

Title: AgentOrchestra: Orchestrating Multi-Agent Intelligence with the Tool-Environment-Agent(TEA) Protocol

Abstract:
Recent advances in LLM‑based agent systems have shown promise on complex, long‑horizon tasks, but existing agent protocols (e.g., A2A and MCP) do not adequately support lifecycle‑aware coordination across agents, tools, and environments. To address this limitation, we introduce the Tool‑Environment‑Agent (TEA) protocol, a unified abstraction that models these components as first‑class, versioned resources with explicit lifecycles. TEA supports end‑to‑end context and version management, improving traceability and reproducibility, while also enabling continual self‑evolution of agent‑associated components\footnoteUnless otherwise specified, \emphagent‑associated components include prompts, memory/tool/agent/environment code, and agent outputs (solutions).. Building on TEA, we present \projectname, a hierarchical multi‑agent framework in which a central planner coordinates specialized sub‑agents and dynamically extends capabilities during execution. Experiments on four challenging benchmarks, spanning expert‑level agent tasks and scientific/mathematical reasoning, show that AgentOrchestra consistently outperforms strong baselines; in particular, it achieves 89.04% on the GAIA Test set, placing it among the leading methods to the best of our knowledge. These results highlight the value of explicit protocol design and hierarchical orchestration for building more robust and adaptive multi‑agent systems.

PaperID: 10016, https://arxiv.org/pdf/2506.12497.pdf

Authors: Ali Baheri

Title: Wasserstein-Barycenter Consensus for Cooperative Multi-Agent Reinforcement Learning

Abstract:
Cooperative multi‑agent reinforcement learning (MARL) demands principled mechanisms to align heterogeneous policies while preserving the capacity for specialized behavior. We introduce a novel consensus framework that defines the team strategy as the entropic‑regularized p‑Wasserstein barycenter of agents' joint state‑‑action visitation measures. By augmenting each agent's policy objective with a soft penalty proportional to its Sinkhorn divergence from this barycenter, the proposed approach encourages coherent group behavior without enforcing rigid parameter sharing. We derive an algorithm that alternates between Sinkhorn‑barycenter computation and policy‑gradient updates, and we prove that, under standard Lipschitz and compactness assumptions, the maximal pairwise policy discrepancy contracts at a geometric rate. Empirical evaluation on a cooperative navigation case study demonstrates that our OT‑barycenter consensus outperforms an independent learners baseline in convergence speed and final coordination success.

PaperID: 10017, https://arxiv.org/pdf/2506.12482.pdf

Authors: Yubin Kim, Hyewon Jeong, Chanwoo Park, Eugene Park, Haipeng Zhang, Xin Liu, Hyeonhoon Lee, Daniel McDuff, Marzyeh Ghassemi, Cynthia Breazeal, Samir Tulebaev, Hae Won Park

Title: Tiered Agentic Oversight: A Hierarchical Multi-Agent System for Healthcare Safety

Abstract:
Large language models (LLMs) deployed as agents introduce significant safety risks in clinical settings due to their potential for error and single points of failure. We introduce Tiered Agentic Oversight (TAO), a hierarchical multi‑agent system that enhances AI safety through layered, automated supervision. Inspired by clinical hierarchies (e.g., nurse‑physician‑specialist) in hospital, TAO routes tasks to specialized agents based on complexity, creating a robust safety framework through automated inter‑ and intra‑tier communication and role‑playing. Crucially, this hierarchical structure functions as an effective error‑correction mechanism, absorbing up to 24% of individual agent errors before they can compound. Our experiments reveal TAO outperforms single‑agent and other multi‑agent systems on 4 out of 5 healthcare safety benchmarks, with up to an 8.2% improvement. Ablation studies confirm key design principles of the system: (i) its adaptive architecture is over 3% safer than static, single‑tier configurations, and (ii) its lower tiers are indispensable, as their removal causes the most significant degradation in overall safety. Finally, we validated the system's synergy with human doctors in a user study where a physician, acting as the highest tier agent, provided corrective feedback that improved medical triage accuracy from 40% to 60%. Project Page: https://tiered‑agentic‑oversight.github.io/

PaperID: 10018, https://arxiv.org/pdf/2506.12469.pdf

Authors: K. J. Kevin Feng, David W. McDonald, Amy X. Zhang

Title: Levels of Autonomy for AI Agents

Abstract:
Autonomy is a double‑edged sword for AI agents, simultaneously unlocking transformative possibilities and serious risks. How can agent developers calibrate the appropriate levels of autonomy at which their agents should operate? We argue that an agent's level of autonomy can be treated as a deliberate design decision, separate from its capability and operational environment. In this work, we define five levels of escalating agent autonomy, characterized by the roles a user can take when interacting with an agent: operator, collaborator, consultant, approver, and observer. Within each level, we describe the ways by which a user can exert control over the agent and open questions for how to design the nature of user‑agent interaction. We then highlight a potential application of our framework towards AI autonomy certificates to govern agent behavior in single‑ and multi‑agent systems. We conclude by proposing early ideas for evaluating agents' autonomy. Our work aims to contribute meaningful, practical steps towards responsibly deployed and useful AI agents in the real world.

PaperID: 10019, https://arxiv.org/pdf/2506.12453.pdf

Authors: Rongpeng Li, Jianhang Zhu, Jiahao Huang, Zhifeng Zhao, Honggang Zhang

Title: Topology-Assisted Spatio-Temporal Pattern Disentangling for Scalable MARL in Large-scale Autonomous Traffic Control

Abstract:
Intelligent Transportation Systems (ITSs) have emerged as a promising solution towards ameliorating urban traffic congestion, with Traffic Signal Control (TSC) identified as a critical component. Although Multi‑Agent Reinforcement Learning (MARL) algorithms have shown potential in optimizing TSC through real‑time decision‑making, their scalability and effectiveness often suffer from large‑scale and complex environments. Typically, these limitations primarily stem from a fundamental mismatch between the exponential growth of the state space driven by the environmental heterogeneities and the limited modeling capacity of current solutions. To address these issues, this paper introduces a novel MARL framework that integrates Dynamic Graph Neural Networks (DGNNs) and Topological Data Analysis (TDA), aiming to enhance the expressiveness of environmental representations and improve agent coordination. Furthermore, inspired by the Mixture of Experts (MoE) architecture in Large Language Models (LLMs), a topology‑assisted spatial pattern disentangling (TSD)‑enhanced MoE is proposed, which leverages topological signatures to decouple graph features for specialized processing, thus improving the model's ability to characterize dynamic and heterogeneous local observations. The TSD module is also integrated into the policy and value networks of the Multi‑agent Proximal Policy Optimization (MAPPO) algorithm, further improving decision‑making efficiency and robustness. Extensive experiments conducted on real‑world traffic scenarios, together with comprehensive theoretical analysis, validate the superior performance of the proposed framework, highlighting the model's scalability and effectiveness in addressing the complexities of large‑scale TSC tasks.

PaperID: 10020, https://arxiv.org/pdf/2506.12415.pdf

Authors: Ashutosh Shankar, Astha Kumari

Title: QoS-aware Scheduling of Periodic Real-time Task Graphs on Heterogeneous Pre-occupied MECs

Abstract:
In latency‑sensitive applications, efficient task scheduling is crucial for maintaining Quality of Service (QoS) while meeting strict timing constraints. This paper addresses the challenge of scheduling periodic tasks structured as directed acyclic graphs (DAGs) within heterogeneous, pre‑occupied Mobile Edge Computing (MEC) networks. We propose a modified version of the Heterogeneous Earliest Finish Time (HEFT) algorithm designed to exploit residual processing capacity in preoccupied MEC environments. Our approach dynamically identifies idle intervals on processors to create a feasible hyperperiodic schedule that specifies an allocated virtual machine (VM), task version, and start time for each task. This scheduling strategy maximizes the aggregate QoS by optimizing task execution without disrupting the existing periodic workload, while also adhering to periodicity, precedence, and resource constraints.Experimental results demonstrate that our method achieves enhanced load balancing and resource utilization, highlighting its potential to improve performance in heterogeneous MEC infrastructures supporting real‑time, periodic applications.

PaperID: 10021, https://arxiv.org/pdf/2506.12339.pdf

Authors: Ruiyan Zhu, Xi Cheng, Ke Liu, Brian Zhu, Daniel Jin, Neeraj Parihar, Zhoutian Xu, Oliver Gao

Title: SheetMind: An End-to-End LLM-Powered Multi-Agent Framework for Spreadsheet Automation

Abstract:
We present SheetMind, a modular multi‑agent framework powered by large language models (LLMs) for spreadsheet automation via natural language instructions. The system comprises three specialized agents: a Manager Agent that decomposes complex user instructions into subtasks; an Action Agent that translates these into structured commands using a Backus Naur Form (BNF) grammar; and a Reflection Agent that validates alignment between generated actions and the user's original intent. Integrated into Google Sheets via a Workspace extension, SheetMind supports real‑time interaction without requiring scripting or formula knowledge. Experiments on benchmark datasets demonstrate an 80 percent success rate on single step tasks and approximately 70 percent on multi step instructions, outperforming ablated and baseline variants. Our results highlight the effectiveness of multi agent decomposition and grammar based execution for bridging natural language and spreadsheet functionalities.

PaperID: 10022, https://arxiv.org/pdf/2506.12331.pdf

Authors: Dekun Wu, Frederik Brudy, Bang Liu, Yi Wang

Title: IndoorWorld: Integrating Physical Task Solving and Social Simulation in A Heterogeneous Multi-Agent Environment

Abstract:
Virtual environments are essential to AI agent research. Existing environments for LLM agent research typically focus on either physical task solving or social simulation, with the former oversimplifying agent individuality and social dynamics, and the latter lacking physical grounding of social behaviors. We introduce IndoorWorld, a heterogeneous multi‑agent environment that tightly integrates physical and social dynamics. By introducing novel challenges for LLM‑driven agents in orchestrating social dynamics to influence physical environments and anchoring social interactions within world states, IndoorWorld opens up possibilities of LLM‑based building occupant simulation for architectural design. We demonstrate the potential with a series of experiments within an office setting to examine the impact of multi‑agent collaboration, resource competition, and spatial layout on agent behavior.

PaperID: 10023, https://arxiv.org/pdf/2506.12297.pdf

Authors: Zhipeng Fan, Yujie Xu, Mingyu Fu, Han Sun, Weiqiu Zhang, Heng Zhang

Abstract:
This brief proposes a distributed formation control strategy via matrix‑weighted Laplacian that can achieve a similar formation in 2‑D planar using inter‑agent relative displacement measurement. Formation patterns that include translation, rotation, and scaling can be characterized by the null space of the matrix‑weighted Laplacian associated with the topological graph. The main contribution of this brief is to extend the similar formation problem of undirected graphs to directed acyclic graphs and provide the necessary algebraic criteria for leader selection. Stability analysis, illustrative examples, and simulation results are provided.

PaperID: 10024, https://arxiv.org/pdf/2506.12266.pdf

Authors: Avinash Baidya, Kamalika Das, Xiang Gao

Title: The Behavior Gap: Evaluating Zero-shot LLM Agents in Complex Task-Oriented Dialogs

Abstract:
Large Language Model (LLM)‑based agents have significantly impacted Task‑Oriented Dialog Systems (TODS) but continue to face notable performance challenges, especially in zero‑shot scenarios. While prior work has noted this performance gap, the behavioral factors driving the performance gap remain under‑explored. This study proposes a comprehensive evaluation framework to quantify the behavior gap between AI agents and human experts, focusing on discrepancies in dialog acts, tool usage, and knowledge utilization. Our findings reveal that this behavior gap is a critical factor negatively impacting the performance of LLM agents. Notably, as task complexity increases, the behavior gap widens (correlation: 0.963), leading to a degradation of agent performance on complex task‑oriented dialogs. For the most complex task in our study, even the GPT‑4o‑based agent exhibits low alignment with human behavior, with low F1 scores for dialog acts (0.464), excessive and often misaligned tool usage with a F1 score of 0.139, and ineffective usage of external knowledge. Reducing such behavior gaps leads to significant performance improvement (24.3% on average). This study highlights the importance of comprehensive behavioral evaluations and improved alignment strategies to enhance the effectiveness of LLM‑based TODS in handling complex tasks.

PaperID: 10025, https://arxiv.org/pdf/2506.12202.pdf

Authors: Stephen Mell, Botong Zhang, David Mell, Shuo Li, Ramya Ramalingam, Nathan Yu, Steve Zdancewic, Osbert Bastani

Title: A Fast, Reliable, and Secure Programming Language for LLM Agents with Code Actions

Abstract:
Modern large language models (LLMs) are often deployed as agents, calling external tools adaptively to solve tasks. Rather than directly calling tools, it can be more effective for LLMs to write code to perform the tool calls, enabling them to automatically generate complex control flow such as conditionals and loops. Such code actions are typically provided as Python code, since LLMs are quite proficient at it; however, Python may not be the ideal language due to limited built‑in support for performance, security, and reliability. We propose a novel programming language for code actions, called Quasar, which has several benefits: (1) automated parallelization to improve performance, (2) uncertainty quantification to improve reliability and mitigate hallucinations, and (3) security features enabling the user to validate actions. LLMs can write code in a subset of Python, which is automatically transpiled to Quasar. We evaluate our approach on the ViperGPT visual question answering agent, applied to the GQA dataset, demonstrating that LLMs with Quasar actions instead of Python actions retain strong performance, while reducing execution time when possible by 42%, improving security by reducing user approval interactions when possible by 52%, and improving reliability by applying conformal prediction to achieve a desired target coverage level.

PaperID: 10026, https://arxiv.org/pdf/2506.12110.pdf

Authors: Qirui Mi, Qipeng Yang, Zijun Fan, Wentian Fan, Heyang Ma, Chengdong Ma, Siyu Xia, Bo An, Jun Wang, Haifeng Zhang

Title: EconGym: A Scalable AI Testbed with Diverse Economic Tasks

Abstract:
Artificial intelligence (AI) has become a powerful tool for economic research, enabling large‑scale simulation and policy optimization. However, applying AI effectively requires simulation platforms for scalable training and evaluation‑yet existing environments remain limited to simplified, narrowly scoped tasks, falling short of capturing complex economic challenges such as demographic shifts, multi‑government coordination, and large‑scale agent interactions. To address this gap, we introduce EconGym, a scalable and modular testbed that connects diverse economic tasks with AI algorithms. Grounded in rigorous economic modeling, EconGym implements 11 heterogeneous role types (e.g., households, firms, banks, governments), their interaction mechanisms, and agent models with well‑defined observations, actions, and rewards. Users can flexibly compose economic roles with diverse agent algorithms to simulate rich multi‑agent trajectories across 25+ economic tasks for AI‑driven policy learning and analysis. Experiments show that EconGym supports diverse and cross‑domain tasks‑such as coordinating fiscal, pension, and monetary policies‑and enables benchmarking across AI, economic methods, and hybrids. Results indicate that richer task composition and algorithm diversity expand the policy space, while AI agents guided by classical economic methods perform best in complex settings. EconGym also scales to 10k agents with high realism and efficiency.

PaperID: 10027, https://arxiv.org/pdf/2506.12083.pdf

Authors: Amitesh Pandey, Jafarbek Arifdjanov, Ansh Tiwari

Title: TuneGenie: Reasoning-based LLM agents for preferential music generation

Abstract:
Recently, Large language models (LLMs) have shown great promise across a diversity of tasks, ranging from generating images to reasoning spatially. Considering their remarkable (and growing) textual reasoning capabilities, we investigate LLMs' potency in conducting analyses of an individual's preferences in music (based on playlist metadata, personal write‑ups, etc.) and producing effective prompts (based on these analyses) to be passed to Suno AI (a generative AI tool for music production). Our proposition of a novel LLM‑based textual representation to music model (which we call TuneGenie) and the various methods we develop to evaluate & benchmark similar models add to the increasing (and increasingly controversial) corpus of research on the use of AI in generating art.

PaperID: 10028, https://arxiv.org/pdf/2506.11890.pdf

Authors: Judson Leroy Dean Haynes

Title: Enter: Graduated Realism: A Pedagogical Framework for AI-Powered Avatars in Virtual Reality Teacher Training

Abstract:
Virtual Reality simulators offer a powerful tool for teacher training, yet the integration of AI‑powered student avatars presents a critical challenge: determining the optimal level of avatar realism for effective pedagogy. This literature review examines the evolution of avatar realism in VR teacher training, synthesizes its theoretical implications, and proposes a new pedagogical framework to guide future design. Through a systematic review, this paper traces the progression from human‑controlled avatars to generative AI prototypes. Applying learning theories like Cognitive Load Theory, we argue that hyper‑realism is not always optimal, as high‑fidelity avatars can impose excessive extraneous cognitive load on novices, a stance supported by recent empirical findings. A significant gap exists between the technological drive for photorealism and the pedagogical need for scaffolded learning. To address this gap, we propose Graduated Realism, a framework advocating for starting trainees with lower‑fidelity avatars and progressively increasing behavioral complexity as skills develop. To make this computationally feasible, we outline a novel single‑call architecture, Crazy Slots, which uses a probabilistic engine and a Retrieval‑Augmented Generation database to generate authentic, real‑time responses without the latency and cost of multi‑step reasoning models. This review provides evidence‑based principles for designing the next generation of AI simulators, arguing that a pedagogically grounded approach to realism is essential for creating scalable and effective teacher education tools.

PaperID: 10029, https://arxiv.org/pdf/2506.11829.pdf

Authors: Ana Müller, Anja Richert

Title: The Space Between Us: A Methodological Framework for Researching Bonding and Proxemics in Situated Group-Agent Interactions

Abstract:
This paper introduces a multimethod framework for studying spatial and social dynamics in real‑world group‑agent interactions with socially interactive agents. Drawing on proxemics and bonding theories, the method combines subjective self‑reports and objective spatial tracking. Applied in two field studies in a museum (N = 187) with a robot and a virtual agent, the paper addresses the challenges in aligning human perception and behavior. We focus on presenting an open source, scalable, and field‑tested toolkit for future studies.

PaperID: 10030, https://arxiv.org/pdf/2506.11825.pdf

Authors: Aishwarya Bandaru, Fabian Bindley, Trevor Bluth, Nandini Chavda, Baixu Chen, Ethan Law

Title: Revealing Political Bias in LLMs through Structured Multi-Agent Debate

Abstract:
Large language models (LLMs) are increasingly used to simulate social behaviour, yet their political biases and interaction dynamics in debates remain underexplored. We investigate how LLM type and agent gender attributes influence political bias using a structured multi‑agent debate framework, by engaging Neutral, Republican, and Democrat American LLM agents in debates on politically sensitive topics. We systematically vary the underlying LLMs, agent genders, and debate formats to examine how model provenance and agent personas influence political bias and attitudes throughout debates. We find that Neutral agents consistently align with Democrats, while Republicans shift closer to the Neutral; gender influences agent attitudes, with agents adapting their opinions when aware of other agents' genders; and contrary to prior research, agents with shared political affiliations can form echo chambers, exhibiting the expected intensification of attitudes as debates progress.

PaperID: 10031, https://arxiv.org/pdf/2506.11803.pdf

Authors: Yingfan Deng, Anhao Zhou, Yuan Yuan, Xiao Zhang, Yifei Zou, Dongxiao Yu

Title: PE-MA: Parameter-Efficient Co-Evolution of Multi-Agent Systems

Abstract:
Multi‑Agent Systems have recently emerged as a promising paradigm for collaborative reasoning and solving complex tasks. However, the design of collaborative learning algorithms in multi‑agent systems faces several challenges, including high communication overhead and insufficient agent‑level personalization. In this paper, we propose PE‑MA (Parameter‑Efficient Multi‑Agent Co‑Evolution), a novel collaboration framework that supports efficient, scalable, and personalized co‑evolution in multi‑agent systems. In PE‑MA, each agent maintains a lightweight personalized adapter to support agent‑specific behavior, while a shared adapter is collaboratively optimized across neighboring agents. This design balances global coordination with local adaptation under heterogeneous environments. We achieve an asymptotically optimal convergence rate of O( 1/(NK)^(1/2) ), where N is the number of agents and K the local update steps.

PaperID: 10032, https://arxiv.org/pdf/2506.11791.pdf

Authors: Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, Lingming Zhang

Title: SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

Abstract:
Rigorous security‑focused evaluation of large language model (LLM) agents is imperative for establishing trust in their safe deployment throughout the software development lifecycle. However, existing benchmarks largely rely on synthetic challenges or simplified vulnerability datasets that fail to capture the complexity and ambiguity encountered by security engineers in practice. We introduce SEC‑bench, the first fully automated benchmarking framework for evaluating LLM agents on authentic security engineering tasks. SEC‑bench employs a novel multi‑agent scaffold that automatically constructs code repositories with harnesses, reproduces vulnerabilities in isolated environments, and generates gold patches for reliable evaluation. Our framework automatically creates high‑quality software vulnerability datasets with reproducible artifacts at a cost of only 0.87 per instance. Using SEC‑bench, we implement two critical software security tasks to rigorously evaluate LLM agents' capabilities: proof‑of‑concept (PoC) generation and vulnerability patching. A comprehensive evaluation of state‑of‑the‑art LLM code agents reveals significant performance gaps, achieving at most 18.0% success in PoC generation and 34.0% in vulnerability patching on our complete dataset. These results highlight the crucial steps needed toward developing LLM agents that are more practical, intelligent, and autonomous for security engineering.

PaperID: 10033, https://arxiv.org/pdf/2506.11681.pdf

Authors: Pratibha Zunjare, Michael Hsiao

Title: A Hybrid Multi-Agent Prompting Approach for Simplifying Complex Sentences

Abstract:
This paper addresses the challenge of transforming complex sentences into sequences of logical, simplified sentences while preserving semantic and logical integrity with the help of Large Language Models. We propose a hybrid approach that combines advanced prompting with multi‑agent architectures to enhance the sentence simplification process. Experimental results show that our approach was able to successfully simplify 70% of the complex sentences written for video game design application. In comparison, a single‑agent approach attained a 48% success rate on the same task.

PaperID: 10034, https://arxiv.org/pdf/2506.11650.pdf

Authors: Lambert Lee, Joshua Lau

Title: Robot Context Protocol (RCP): A Runtime-Agnostic Interface for Agent-Aware Robot Control

Abstract:
The Robot Context Protocol (RCP) is a lightweight, middleware‑agnostic communication protocol designed to simplify the complexity of robotic systems and enable seamless interaction between robots, users, and autonomous agents. RCP provides a unified and semantically meaningful interface that decouples client‑facing operations from backend implementations, supporting a wide range of deployment environments including physical robots, cloud‑based orchestrators, and simulated platforms. Built on HTTP and WebSocket transport layers, the protocol defines a schema‑driven message format with structured operations such as read, write, execute, and subscribe. It integrates features such as runtime introspection, asynchronous feedback, multi‑tenant namespace isolation, and strict type validation to ensure robustness, scalability, and security. The architecture, message structure, interface model, and adapter‑based backend integration strategy of RCP are described, along with deployment practices and applicability across industries including manufacturing, logistics, and healthcare. RCP enables intelligent, resilient, and safe robotic operations in complex, multi‑agent ecosystems.

PaperID: 10035, https://arxiv.org/pdf/2506.11638.pdf

Authors: Yicheng Xiao, Lin Song, Rui Yang, Cheng Cheng, Yixiao Ge, Xiu Li, Ying Shan

Title: LoRA-Gen: Specializing Large Language Model via Online LoRA Generation

Abstract:
Recent advances have highlighted the benefits of scaling language models to enhance performance across a wide range of NLP tasks. However, these approaches still face limitations in effectiveness and efficiency when applied to domain‑specific tasks, particularly for small edge‑side models. We propose the LoRA‑Gen framework, which utilizes a large cloud‑side model to generate LoRA parameters for edge‑side models based on task descriptions. By employing the reparameterization technique, we merge the LoRA parameters into the edge‑side model to achieve flexible specialization. Our method facilitates knowledge transfer between models while significantly improving the inference efficiency of the specialized model by reducing the input context length. Without specialized training, LoRA‑Gen outperforms conventional LoRA fine‑tuning, which achieves competitive accuracy and a 2.1x speedup with TinyLLaMA‑1.1B in reasoning tasks. Besides, our method delivers a compression ratio of 10.1x with Gemma‑2B on intelligent agent tasks.

PaperID: 10036, https://arxiv.org/pdf/2506.11595.pdf

Authors: Mert Unsal, Aylin Akkus

Title: EasyARC: Evaluating Vision Language Models on True Visual Reasoning

Abstract:
Building on recent advances in language‑based reasoning models, we explore multimodal reasoning that integrates vision and text. Existing multimodal benchmarks primarily test visual extraction combined with text‑based reasoning, lacking true visual reasoning with more complex interactions between vision and language. Inspired by the ARC challenge, we introduce EasyARC, a vision‑language benchmark requiring multi‑image, multi‑step reasoning, and self‑correction. EasyARC is procedurally generated, fully verifiable, and scalable, making it ideal for reinforcement learning (RL) pipelines. The generators incorporate progressive difficulty levels, enabling structured evaluation across task types and complexities. We benchmark state‑of‑the‑art vision‑language models and analyze their failure modes. We argue that EasyARC sets a new standard for evaluating true reasoning and test‑time scaling capabilities in vision‑language models. We open‑source our benchmark dataset and evaluation code.

PaperID: 10037, https://arxiv.org/pdf/2506.11475.pdf

Authors: Syeda Kisaa Fatima, Tehreem Zubair, Noman Ahmed, Asifullah Khan

Title: AutoGen Driven Multi Agent Framework for Iterative Crime Data Analysis and Prediction

Abstract:
This paper introduces LUCID‑MA (Learning and Understanding Crime through Dialogue of Multiple Agents), an innovative AI powered framework where multiple AI agents collaboratively analyze and understand crime data. Our system that consists of three core components: an analysis assistant that highlights spatiotemporal crime patterns; a feedback component that reviews and refines analytical results; and a prediction component that forecasts future crime trends. With a well‑designed prompt and the LLaMA‑2‑13B‑Chat‑GPTQ model, it runs completely offline and allows the agents undergo self‑improvement through 100 rounds of communication with less human interaction. A scoring function is incorporated to evaluate agent performance, providing visual plots to track learning progress. This work demonstrates the potential of AutoGen‑style agents for autonomous, scalable, and iterative analysis in social science domains, maintaining data privacy through offline execution. It also showcases a computational model with emergent intelligence, where the system's global behavior emerges from the interactions of its agents. This emergent behavior manifests as enhanced individual agent performance, driven by collaborative dialogue between the LLM‑based agents.

PaperID: 10038, https://arxiv.org/pdf/2506.11445.pdf

Authors: Xuan Duy Ta, Bang Giang Le, Thanh Ha Le, Viet Cuong Ta

Title: Resolve Highway Conflict in Multi-Autonomous Vehicle Controls with Local State Attention

Abstract:
In mixed‑traffic environments, autonomous vehicles must adapt to human‑controlled vehicles and other unusual driving situations. This setting can be framed as a multi‑agent reinforcement learning (MARL) environment with full cooperative reward among the autonomous vehicles. While methods such as Multi‑agent Proximal Policy Optimization can be effective in training MARL tasks, they often fail to resolve local conflict between agents and are unable to generalize to stochastic events. In this paper, we propose a Local State Attention module to assist the input state representation. By relying on the self‑attention operator, the module is expected to compress the essential information of nearby agents to resolve the conflict in traffic situations. Utilizing a simulated highway merging scenario with the priority vehicle as the unexpected event, our approach is able to prioritize other vehicles' information to manage the merging process. The results demonstrate significant improvements in merging efficiency compared to popular baselines, especially in high‑density traffic settings.

PaperID: 10039, https://arxiv.org/pdf/2506.11343.pdf

Authors: Yaohui Zhang, Haijing Zhang, Wenlong Ji, Tianyu Hua, Nick Haber, Hancheng Cao, Weixin Liang

Title: From Replication to Redesign: Exploring Pairwise Comparisons for LLM-Based Peer Review

Abstract:
The advent of large language models (LLMs) offers unprecedented opportunities to reimagine peer review beyond the constraints of traditional workflows. Despite these opportunities, prior efforts have largely focused on replicating traditional review workflows with LLMs serving as direct substitutes for human reviewers, while limited attention has been given to exploring new paradigms that fundamentally rethink how LLMs can participate in the academic review process. In this paper, we introduce and explore a novel mechanism that employs LLM agents to perform pairwise comparisons among manuscripts instead of individual scoring. By aggregating outcomes from substantial pairwise evaluations, this approach enables a more accurate and robust measure of relative manuscript quality. Our experiments demonstrate that this comparative approach significantly outperforms traditional rating‑based methods in identifying high‑impact papers. However, our analysis also reveals emergent biases in the selection process, notably a reduced novelty in research topics and an increased institutional imbalance. These findings highlight both the transformative potential of rethinking peer review with LLMs and critical challenges that future systems must address to ensure equity and diversity.

PaperID: 10040, https://arxiv.org/pdf/2506.11304.pdf

Authors: Qiuyu Miao, Zhigang Wu

Title: A Hybrid Adaptive Nash Equilibrium Solver for Distributed Multi-Agent Systems with Game-Theoretic Jump Triggering

Abstract:
This paper presents a hybrid adaptive Nash equilibrium solver for distributed multi‑agent systems incorporating game‑theoretic jump triggering mechanisms. The approach addresses fundamental scalability and computational challenges in multi‑agent hybrid systems by integrating distributed game‑theoretic optimization with systematic hybrid system design. A novel game‑theoretic jump triggering mechanism coordinates discrete mode transitions across multiple agents while maintaining distributed autonomy. The Hybrid Adaptive Nash Equilibrium Solver (HANES) algorithm integrates these methodologies. Sufficient conditions establish exponential convergence to consensus under distributed information constraints. The framework provides rigorous stability guarantees through coupled Hamilton‑Jacobi‑Bellman equations while enabling rapid emergency response capabilities through coordinated jump dynamics. Simulation studies in pursuit‑evasion and leader‑follower consensus scenarios demonstrate significant improvements in convergence time, computational efficiency, and scalability compared to existing centralized and distributed approaches.

PaperID: 10041, https://arxiv.org/pdf/2506.11285.pdf

Authors: Jianhong Wang, Yang Li, Samuel Kaski, Jonathan Lawry

Title: Shapley Machine: A Game-Theoretic Framework for N-Agent Ad Hoc Teamwork

Abstract:
Open multi‑agent systems are increasingly important in modeling real‑world applications, such as smart grids, swarm robotics, etc. In this paper, we aim to investigate a recently proposed problem for open multi‑agent systems, referred to as n‑agent ad hoc teamwork (NAHT), where only a number of agents are controlled. Existing methods tend to be based on heuristic design and consequently lack theoretical rigor and ambiguous credit assignment among agents. To address these limitations, we model and solve NAHT through the lens of cooperative game theory. More specifically, we first model an open multi‑agent system, characterized by its value, as an instance situated in a space of cooperative games, generated by a set of basis games. We then extend this space, along with the state space, to accommodate dynamic scenarios, thereby characterizing NAHT. Exploiting the justifiable assumption that basis game values correspond to a sequence of n‑step returns with different horizons, we represent the state values for NAHT in a form similar to λ‑returns. Furthermore, we derive Shapley values to allocate state values to the controlled agents, as credits for their contributions to the ad hoc team. Different from the conventional approach to shaping Shapley values in an explicit form, we shape Shapley values by fulfilling the three axioms uniquely describing them, well defined on the extended game space describing NAHT. To estimate Shapley values in dynamic scenarios, we propose a TD(λ)‑like algorithm. The resulting reinforcement learning (RL) algorithm is referred to as Shapley Machine. To our best knowledge, this is the first time that the concepts from cooperative game theory are directly related to RL concepts. In experiments, we demonstrate the effectiveness of Shapley Machine and verify reasonableness of our theory.

PaperID: 10042, https://arxiv.org/pdf/2506.11266.pdf

Authors: Benjamin Elder, Anupama Murthi, Jungkoo Kang, Ankita Rajaram Naik, Kiran Kate, Kinjal Basu, Danish Contractor

Title: Live API-Bench: 2500+ Live APIs for Testing Multi-Step Tool Calling

Abstract:
Large language models (LLMs) increasingly rely on external tools and APIs to execute complex tasks specified in natural language. Evaluating such tool calling capabilities in realistic enterprise settings is challenging: APIs are often proprietary, heterogeneous, and difficult to share, limiting reproducible benchmarks. To address this, we introduce Live API Bench, a comprehensive benchmark constructed by transforming NL2SQL datasets into interactive API environments. Our pipeline converts SQL queries from BIRD SQL into executable API sequences across three formulations SLOT, SEL, and REST covering minimal general purpose operations, domain specific multi step tasks, and function oriented RESTful interactions, respectively. The benchmark spans 11 databases with over 2,500 invocable tools, paired with human authored queries, ground truth API sequences, and verified final answers. Live API Bench enables systematic evaluation of core challenges in tool use, including error handling, sequential reasoning, parameter generation, response parsing, and robustness across diverse domains. We evaluate 10 LLMs and 4 ReACT agents, observing low task completion rates (7 to 47pct), which improve modestly to 50pct under interactive agent settings, highlighting substantial scope for improving LLM tool calling performance. We release all code and data associated with this paper.

PaperID: 10043, https://arxiv.org/pdf/2506.11263.pdf

Authors: Christian Brommer, Alessandro Fornasier, Jan Steinbrener, Stephan Weiss

Title: Sensor Model Identification via Simultaneous Model Selection and State Variable Determination

Abstract:
We present a method for the unattended gray‑box identification of sensor models commonly used by localization algorithms in the field of robotics. The objective is to determine the most likely sensor model for a time series of unknown measurement data, given an extendable catalog of predefined sensor models. Sensor model definitions may require states for rigid‑body calibrations and dedicated reference frames to replicate a measurement based on the robot's localization state. A health metric is introduced, which verifies the outcome of the selection process in order to detect false positives and facilitate reliable decision‑making. In a second stage, an initial guess for identified calibration states is generated, and the necessity of sensor world reference frames is evaluated. The identified sensor model with its parameter information is then used to parameterize and initialize a state estimation application, thus ensuring a more accurate and robust integration of new sensor elements. This method is helpful for inexperienced users who want to identify the source and type of a measurement, sensor calibrations, or sensor reference frames. It will also be important in the field of modular multi‑agent scenarios and modularized robotic platforms that are augmented by sensor modalities during runtime. Overall, this work aims to provide a simplified integration of sensor modalities to downstream applications and circumvent common pitfalls in the usage and development of localization approaches.

PaperID: 10044, https://arxiv.org/pdf/2506.11150.pdf

Authors: Wenlong Hou, Guangqian Yang, Ye Du, Yeung Lau, Lihao Liu, Junjun He, Ling Long, Shujun Wang

Title: ADAgent: LLM Agent for Alzheimer's Disease Analysis with Collaborative Coordinator

Abstract:
Alzheimer's disease (AD) is a progressive and irreversible neurodegenerative disease. Early and precise diagnosis of AD is crucial for timely intervention and treatment planning to alleviate the progressive neurodegeneration. However, most existing methods rely on single‑modality data, which contrasts with the multifaceted approach used by medical experts. While some deep learning approaches process multi‑modal data, they are limited to specific tasks with a small set of input modalities and cannot handle arbitrary combinations. This highlights the need for a system that can address diverse AD‑related tasks, process multi‑modal or missing input, and integrate multiple advanced methods for improved performance. In this paper, we propose ADAgent, the first specialized AI agent for AD analysis, built on a large language model (LLM) to address user queries and support decision‑making. ADAgent integrates a reasoning engine, specialized medical tools, and a collaborative outcome coordinator to facilitate multi‑modal diagnosis and prognosis tasks in AD. Extensive experiments demonstrate that ADAgent outperforms SOTA methods, achieving significant improvements in accuracy, including a 2.7% increase in multi‑modal diagnosis, a 0.7% improvement in multi‑modal prognosis, and enhancements in MRI and PET diagnosis tasks.

PaperID: 10045, https://arxiv.org/pdf/2506.11102.pdf

Authors: Jiachen Zhu, Menghui Zhu, Renting Rui, Rong Shan, Congmin Zheng, Bo Chen, Yunjia Xi, Jianghao Lin, Weiwen Liu, Ruiming Tang, Yong Yu, Weinan Zhang

Title: Evolutionary Perspectives on the Evaluation of LLM-Based AI Agents: A Comprehensive Survey

Abstract:
The advent of large language models (LLMs), such as GPT, Gemini, and DeepSeek, has significantly advanced natural language processing, giving rise to sophisticated chatbots capable of diverse language‑related tasks. The transition from these traditional LLM chatbots to more advanced AI agents represents a pivotal evolutionary step. However, existing evaluation frameworks often blur the distinctions between LLM chatbots and AI agents, leading to confusion among researchers selecting appropriate benchmarks. To bridge this gap, this paper introduces a systematic analysis of current evaluation approaches, grounded in an evolutionary perspective. We provide a detailed analytical framework that clearly differentiates AI agents from LLM chatbots along five key aspects: complex environment, multi‑source instructor, dynamic feedback, multi‑modal perception, and advanced capability. Further, we categorize existing evaluation benchmarks based on external environments driving forces, and resulting advanced internal capabilities. For each category, we delineate relevant evaluation attributes, presented comprehensively in practical reference tables. Finally, we synthesize current trends and outline future evaluation methodologies through four critical lenses: environment, agent, evaluator, and metrics. Our findings offer actionable guidance for researchers, facilitating the informed selection and application of benchmarks in AI agent evaluation, thus fostering continued advancement in this rapidly evolving research domain.

PaperID: 10046, https://arxiv.org/pdf/2506.11083.pdf

Authors: Ali Asad, Stephen Obadinma, Radin Shayanfar, Xiaodan Zhu

Title: RedDebate: Safer Responses through Multi-Agent Red Teaming Debates

Abstract:
We propose RedDebate, a novel multi‑agent debate framework that leverages adversarial argumentation among Large Language Models (LLMs) to proactively identify and mitigate their own unsafe behaviours. Existing AI safety methods often depend heavily on costly human evaluations or isolated single‑model assessment, both subject to scalability constraints and oversight risks. RedDebate instead embraces collaborative disagreement, enabling multiple LLMs to critically examine one another's reasoning, and systematically uncovering unsafe blind spots through automated red‑teaming, and iteratively improve their responses. We further integrate distinct types of long‑term memory that retain learned safety insights from debate interactions. Evaluating on established safety benchmarks such as HarmBench, we demonstrate the proposed method's effectiveness. Debate alone can reduce unsafe behaviours by 17.7%, and when combined with long‑term memory modules, achieves reductions exceeding 23.5%. To our knowledge, RedDebate constitutes the first fully automated framework that combines multi‑agent debates with red‑teaming to progressively enhance AI safety without direct human intervention.(Github Repository: https://github.com/aliasad059/RedDebate)

PaperID: 10047, https://arxiv.org/pdf/2506.11060.pdf

Authors: Ramneet Singh, Sathvik Joel, Abhav Mehrotra, Nalin Wadhwa, Ramakrishna B Bairi, Aditya Kanade, Nagarajan Natarajan

Title: Code Researcher: Deep Research Agent for Large Systems Code and Commit History

Abstract:
Large Language Model (LLM)‑based coding agents have shown promising results on coding benchmarks, but their effectiveness on systems code remains underexplored. Due to the size and complexities of systems code, making changes to a systems codebase requires researching about many pieces of context, derived from the large codebase and its massive commit history, before making changes. Inspired by the recent progress on deep research agents, we design the first deep research agent for code, called Code Researcher, and apply it to the problem of generating patches to mitigate crashes reported in systems code. Code Researcher performs multi‑step reasoning about semantics, patterns, and commit history of code to retrieve all relevant context from the codebase and its commit history. We evaluate Code Researcher on kBenchSyz, a benchmark of Linux kernel crashes, and show that it significantly outperforms strong baselines, achieving a crash‑resolution rate (CRR) of 48%, compared to 31.5% by SWE‑agent and 31% by Agentless, using OpenAI's GPT‑4o model. Scaling up sampling budget to 10 trajectories increases Code Researcher's CRR to 54%. Code Researcher is also robust to model choices, reaching 67% with the newer Gemini 2.5‑Flash model. Through another experiment on an open‑source multimedia software, we show the generalizability of Code Researcher and also conduct ablations. Our experiments highlight the importance of global context gathering and multi‑faceted reasoning for large codebases.

PaperID: 10048, https://arxiv.org/pdf/2506.11009.pdf

Authors: Jirat Pasuksmit, Wannita Takerngsaksiri, Patanamon Thongtanunam, Chakkrit Tantithamthavorn, Ruixiong Zhang, Shiyan Wang, Fan Jiang, Jing Li, Evan Cook, Kun Chen, Ming Wu

Title: Human-In-The-Loop Software Development Agents: Challenges and Future Directions

Abstract:
Multi‑agent LLM‑driven systems for software development are rapidly gaining traction, offering new opportunities to enhance productivity. At Atlassian, we deployed Human‑in‑the‑Loop Software Development Agents to resolve Jira work items and evaluated the generated code quality using functional correctness testing and GPT‑based similarity scoring. This paper highlights two major challenges: the high computational costs of unit testing and the variability in LLM‑based evaluations. We also propose future research directions to improve evaluation frameworks for Human‑In‑The‑Loop software development tools.

PaperID: 10049, https://arxiv.org/pdf/2506.10954.pdf

Authors: Lianghong Guo, Yanlin Wang, Caihua Li, Wei Tao, Pengyu Yang, Jiachi Chen, Haoyu Song, Duyu Tang, Zibin Zheng

Title: SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

Abstract:
Constructing large‑scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs). However, the existing GitHub issue resolution data construction pipeline is challenging and labor‑intensive. We identify three key limitations in existing pipelines: (1) test patches collected often omit binary file changes; (2) the manual construction of evaluation environments is labor‑intensive; and (3) the fail2pass validation phase requires manual inspection of test logs and writing custom parsing code to extract test status from logs. In this paper, we propose SWE‑Factory, a fully automated issue resolution data construction pipeline, to resolve these limitations. First, our pipeline automatically recovers missing binary test files and ensures the correctness of test patches. Second, we introduce SWE‑Builder, a LLM‑based multi‑agent system that automates evaluation environment construction. Third, we introduce a standardized, exit‑code‑based log parsing method to automatically extract test status, enabling a fully automated fail2pass validation. Experiments on 671 real‑world GitHub issues across four programming languages show that our method can effectively construct valid evaluation environments for GitHub issues at a reasonable cost. For example, with GPT‑4.1 mini, our SWE‑Builder constructs 337 valid task instances out of 671 issues, at 0.047 per instance. Our ablation study further shows the effectiveness of different components of SWE‑Builder. We also demonstrate through manual inspection that our exit‑code‑based fail2pass validation method is highly accurate, achieving an F1 score of 0.99. Additionally, we conduct an exploratory experiment to investigate whether we can use SWE‑Factory to enhance models' software engineering ability.

PaperID: 10050, https://arxiv.org/pdf/2506.10857.pdf

Authors: Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang, Pei Chu, Ruijie Zhang, Yinan He, Qirui Li, Songze Li, Zhenxiang Li, Zhongying Tu, Conghui He, Yu Qiao, Yali Wang, Yi Wang, Limin Wang

Title: VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

Abstract:
We present VRBench, the first long narrative video benchmark crafted for evaluating large models' multi‑step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and procedural validity. It comprises 960 long videos (with an average duration of 1.6 hours), along with 8,243 human‑labeled multi‑step question‑answering pairs and 25,106 reasoning steps with timestamps. These videos are curated via a multi‑stage filtering process including expert inter‑rater reviewing to prioritize plot coherence. We develop a human‑AI collaborative framework that generates coherent reasoning chains, each requiring multiple temporally grounded steps, spanning seven types (e.g., event attribution, implicit inference). VRBench designs a multi‑phase evaluation pipeline that assesses models at both the outcome and process levels. Apart from the MCQs for the final results, we propose a progress‑level LLM‑guided scoring metric to evaluate the quality of the reasoning chain from multiple dimensions comprehensively. Through extensive evaluations of 12 LLMs and 19 VLMs on VRBench, we undertake a thorough analysis and provide valuable insights that advance the field of multi‑step reasoning.

PaperID: 10051, https://arxiv.org/pdf/2506.10844.pdf

Authors: Alireza Salemi, Mukta Maddipatla, Hamed Zamani

Title: CIIR@LiveRAG 2025: Optimizing Multi-Agent Retrieval Augmented Generation through Self-Training

Abstract:
This paper presents mRAG, a multi‑agent retrieval‑augmented generation (RAG) framework composed of specialized agents for subtasks such as planning, searching, reasoning, and coordination. Our system uses a self‑training paradigm with reward‑guided trajectory sampling to optimize inter‑agent collaboration and enhance response generation. Evaluated on DataMorgana‑derived datasets during the SIGIR 2025 LiveRAG competition, mRAG outperforms conventional RAG baselines. We further analyze competition outcomes and showcase the framework's strengths with case studies, demonstrating its efficacy for complex, real‑world RAG tasks.

PaperID: 10052, https://arxiv.org/pdf/2506.10815.pdf

Authors: Zhi Chai, Jiajie Xu, Justin P Coon, Mohamed-Slim Alouini

Title: Joint Beamforming with Extremely Large Scale RIS: A Sequential Multi-Agent A2C Approach

Abstract:
It is a challenging problem to jointly optimize the base station (BS) precoding matrix and the reconfigurable intelligent surface (RIS) phases simultaneously in a RIS‑assisted multiple‑user multiple‑input‑multiple‑output (MU‑MIMO) scenario when the size of the RIS becomes extremely large. In this paper, we propose a deep reinforcement learning algorithm called sequential multi‑agent advantage actor‑critic (A2C) to solve this problem. In addition, the discrete phase of RISs, imperfect channel state information (CSI), and channel correlations between users are taken into consideration. The computational complexity is also analyzed, and the performance of the proposed algorithm is compared with the zero‑forcing (ZF) beamformer in terms of the sum spectral efficiency (SE). It is noted that the computational complexity of the proposed algorithm is lower than the benchmark, while the performance is better than the benchmark. Throughout simulations, it is also found that the proposed algorithm is robust to medium channel estimation error.

PaperID: 10053, https://arxiv.org/pdf/2506.10807.pdf

Authors: Mario Barbara, Alaa Maalouf

Title: Prompts to Summaries: Zero-Shot Language-Guided Video Summarization with Large Language and Video Models

Abstract:
The explosive growth of video data intensified the need for flexible user‑controllable summarization tools that operate without training data. Existing methods either rely on domain‑specific datasets, limiting generalization, or cannot incorporate user intent expressed in natural language. We introduce Prompts‑to‑Summaries: the first zero‑shot, text‑queryable video‑summarizer that converts off‑the‑shelf video‑language models (VidLMs) captions into user‑guided skims via large‑language‑models (LLMs) judging, without the use of training data, beating unsupervised and matching supervised methods. Our pipeline (i) segments video into scenes, (ii) produces scene descriptions with a memory‑efficient batch prompting scheme that scales to hours on a single GPU, (iii) scores scene importance with an LLM via tailored prompts, and (iv) propagates scores to frames using new consistency (temporal coherence) and uniqueness (novelty) metrics for fine‑grained frame importance. On SumMe and TVSum, our approach surpasses all prior data‑hungry unsupervised methods and performs competitively on the Query‑Focused Video Summarization benchmark, where the competing methods require supervised frame‑level importance. We release VidSum‑Reason, a query‑driven dataset featuring long‑tailed concepts and multi‑step reasoning, where our framework serves as the first challenging baseline. Overall, we demonstrate that pretrained multi‑modal models, when orchestrated with principled prompting and score propagation, provide a powerful foundation for universal, text‑queryable video summarization.

PaperID: 10054, https://arxiv.org/pdf/2506.10630.pdf

Authors: Yucong Luo, Yitong Zhou, Mingyue Cheng, Jiahao Wang, Daoyu Wang, Tingyue Pan, Jintao Zhang

Title: Time Series Forecasting as Reasoning: A Slow-Thinking Approach with Reinforced LLMs

Abstract:
To advance time series forecasting (TSF), various methods have been proposed to improve prediction accuracy, evolving from statistical techniques to data‑driven deep learning architectures. Despite their effectiveness, most existing methods still adhere to a fast thinking paradigm‑relying on extracting historical patterns and mapping them to future values as their core modeling philosophy, lacking an explicit thinking process that incorporates intermediate time series reasoning. Meanwhile, emerging slow‑thinking LLMs (e.g., OpenAI‑o1) have shown remarkable multi‑step reasoning capabilities, offering an alternative way to overcome these issues. However, prompt engineering alone presents several limitations ‑ including high computational cost, privacy risks, and limited capacity for in‑depth domain‑specific time series reasoning. To address these limitations, a more promising approach is to train LLMs to develop slow thinking capabilities and acquire strong time series reasoning skills. For this purpose, we propose Time‑R1, a two‑stage reinforcement fine‑tuning framework designed to enhance multi‑step reasoning ability of LLMs for time series forecasting. Specifically, the first stage conducts supervised fine‑tuning for warmup adaptation, while the second stage employs reinforcement learning to improve the model's generalization ability. Particularly, we design a fine‑grained multi‑objective reward specifically for time series forecasting, and then introduce GRIP (group‑based relative importance for policy optimization), which leverages non‑uniform sampling to further encourage and optimize the model's exploration of effective reasoning paths. Experiments demonstrate that Time‑R1 significantly improves forecast performance across diverse datasets.

PaperID: 10055, https://arxiv.org/pdf/2506.10622.pdf

Authors: Sergio Burdisso, Séverin Baroudi, Yanis Labrak, David Grunert, Pawel Cyrta, Yiyang Chen, Srikanth Madikeri, Thomas Schaaf, Esaú Villatoro-Tello, Ahmed Hassoon, Ricard Marxer, Petr Motlicek

Title: SDialog: A Python Toolkit for End-to-End Agent Building, User Simulation, Dialog Generation, and Evaluation

Abstract:
We present SDialog, an MIT‑licensed open‑source Python toolkit that unifies dialog generation, evaluation and mechanistic interpretability into a single end‑to‑end framework for building and analyzing LLM‑based conversational agents. Built around a standardized Dialog representation, SDialog provides: (1) persona‑driven multi‑agent simulation with composable orchestration for controlled, synthetic dialog generation, (2) comprehensive evaluation combining linguistic metrics, LLM‑as‑a‑judge and functional correctness validators, (3) mechanistic interpretability tools for activation inspection and steering via feature ablation and induction, and (4) audio generation with full acoustic simulation including 3D room modeling and microphone effects. The toolkit integrates with all major LLM backends, enabling mixed‑backend experiments under a unified API. By coupling generation, evaluation, and interpretability in a dialog‑centric architecture, SDialog enables researchers to build, benchmark and understand conversational systems more systematically.

PaperID: 10056, https://arxiv.org/pdf/2506.10540.pdf

Authors: Haoyuan Shi, Yunxin Li, Xinyu Chen, Longyue Wang, Baotian Hu, Min Zhang

Title: AniMaker: Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation

Abstract:
Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre‑generated keyframes into fixed‑length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low‑quality clip can significantly degrade the entire output animation's logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi‑agent framework enabling efficient multi‑candidate clip generation and storytelling‑aware clip selection, thus creating globally consistent and story‑coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post‑Production Agent for editing and voiceover. Central to AniMaker's approach are two key technical components: MCTS‑Gen in Photography Agent, an efficient Monte Carlo Tree Search (MCTS)‑inspired strategy that intelligently navigates the candidate space to generate high‑potential clips while optimizing resource usage; and AniEval in Reviewer Agent, the first framework specifically designed for multi‑shot animation evaluation, which assesses critical aspects such as story‑level consistency, action completion, and animation‑specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi‑candidate generation, pushing AI‑generated storytelling animation closer to production standards.

PaperID: 10057, https://arxiv.org/pdf/2506.10501.pdf

Authors: Surya Jasper, Minh Luu, Evan Pan, Aakash Tyagi, Michael Quinn, Jiang Hu, David Kebo Houngninou

Title: BugGen: A Self-Correcting Multi-Agent LLM Pipeline for Realistic RTL Bug Synthesis

Abstract:
Hardware complexity continues to strain verification resources, motivating the adoption of machine learning (ML) methods to improve debug efficiency. However, ML‑assisted debugging critically depends on diverse and scalable bug datasets, which existing manual or automated bug insertion methods fail to reliably produce. We introduce BugGen, a first of its kind, fully autonomous, multi‑agent pipeline leveraging Large Language Models (LLMs) to systematically generate, insert, and validate realistic functional bugs in RTL. BugGen partitions modules, selects mutation targets via a closed‑loop agentic architecture, and employs iterative refinement and rollback mechanisms to ensure syntactic correctness and functional detectability. Evaluated across five OpenTitan IP blocks, BugGen produced 500 unique bugs with 94% functional accuracy and achieved a throughput of 17.7 validated bugs per hour‑over five times faster than typical manual expert insertion. Additionally, BugGen identified 104 previously undetected bugs in OpenTitan regressions, highlighting its utility in exposing verification coverage gaps. Compared against Certitude, BugGen demonstrated over twice the syntactic accuracy, deeper exposure of testbench blind spots, and more functionally meaningful and complex bug scenarios. Furthermore, when these BugGen‑generated datasets were employed to train ML‑based failure triage models, we achieved high classification accuracy (88.1%‑93.2%) across different IP blocks, confirming the practical utility and realism of generated bugs. BugGen thus provides a scalable solution for generating high‑quality bug datasets, significantly enhancing verification efficiency and ML‑assisted debugging.

PaperID: 10058, https://arxiv.org/pdf/2506.10467.pdf

Authors: Felix Härer

Title: Specification and Evaluation of Multi-Agent LLM Systems -- Prototype and Cybersecurity Applications

Abstract:
Recent advancements in LLMs indicate potential for novel applications, as evidenced by the reasoning capabilities in the latest OpenAI and DeepSeek models. To apply these models to domain‑specific applications beyond text generation, LLM‑based multi‑agent systems can be utilized to solve complex tasks, particularly by combining reasoning techniques, code generation, and software execution across multiple, potentially specialized LLMs. However, while many evaluations are performed on LLMs, reasoning techniques, and applications individually, their joint specification and combined application are not well understood. Defined specifications for multi‑agent LLM systems are required to explore their potential and suitability for specific applications, allowing for systematic evaluations of LLMs, reasoning techniques, and related aspects. This paper reports the results of exploratory research on (1.) multi‑agent specification by introducing an agent schema language and (2.) the execution and evaluation of the specifications through a multi‑agent system architecture and prototype. The specification language, system architecture, and prototype are first presented in this work, building on an LLM system from prior research. Test cases involving cybersecurity tasks indicate the feasibility of the architecture and evaluation approach. As a result, evaluations could be demonstrated for question answering, server security, and network security tasks completed correctly by agents with LLMs from OpenAI and DeepSeek.

PaperID: 10059, https://arxiv.org/pdf/2506.10462.pdf

Authors: Ana Müller, Sabina Jeschke, Anja Richert

Title: Are We Generalizing from the Exception? An In-the-Wild Study on Group-Sensitive Conversation Design in Human-Agent Interactions

Abstract:
This paper investigates the impact of a group‑adaptive conversation design in two socially interactive agents (SIAs) through two real‑world studies. Both SIAs ‑ Furhat, a social robot, and MetaHuman, a virtual agent ‑ were equipped with a conversational artificial intelligence (CAI) backend combining hybrid retrieval and generative models. The studies were carried out in an in‑the‑wild setting with a total of N = 188 participants who interacted with the SIAs ‑ in dyads, triads or larger groups ‑ at a German museum. Although the results did not reveal a significant effect of the group‑sensitive conversation design on perceived satisfaction, the findings provide valuable insights into the challenges of adapting CAI for multi‑party interactions and across different embodiments (robot vs.\ virtual agent), highlighting the need for multimodal strategies beyond linguistic pluralization. These insights contribute to the fields of Human‑Agent Interaction (HAI), Human‑Robot Interaction (HRI), and broader Human‑Machine Interaction (HMI), providing insights for future research on effective dialogue adaptation in group settings.

PaperID: 10060, https://arxiv.org/pdf/2506.10341.pdf

Authors: Wanqiao Xu, Allen Nie, Ruijie Zheng, Aditya Modi, Adith Swaminathan, Ching-An Cheng

Title: Provably Learning from Language Feedback

Abstract:
Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of large language model (LLM) agents. While impressive empirical demonstrations have been shown, so far a principled framing of these decision problems remains lacking. In this paper, we formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce transfer eluder dimension as a complexity measure to characterize the hardness of LLF problems. We show that transfer eluder dimension captures the intuition that information in the feedback changes the learning complexity of the LLF problem. We demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a no‑regret algorithm, called \textttHELiX, that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension of the problem. Across several empirical domains, we show that \textttHELiX performs well even when repeatedly prompting LLMs does not work reliably. Our contributions mark a first step towards designing principled interactive learning algorithms from generic language feedback.

PaperID: 10061, https://arxiv.org/pdf/2506.10322.pdf

Authors: Xueying Du, Kai Yu, Chong Wang, Yi Zou, Wentai Deng, Zuoyu Ou, Xin Peng, Lingming Zhang, Yiling Lou

Title: Minimizing False Positives in Static Bug Detection via LLM-Enhanced Path Feasibility Analysis

Abstract:
Static bug analyzers play a crucial role in ensuring software quality. However, existing analyzers for bug detection in large codebases often suffer from high false positive rates. This is primarily due to the limited capabilities of analyzers in path feasibility validation with multiple conditional branches and complex data dependencies. While current LLM‑based approaches attempt to address this issue, their effectiveness remains limited due to insufficient constraint cascade analysis and scalability challenges in large projects. To address this challenge, we propose an iterative path feasibility analysis framework LLM4PFA. By leveraging LLM agent based targeted constraint reasoning, and key context‑aware analysis driven by agent planning, LLM4PFA effectively enhances complex inter‑procedural path feasibility analysis for minimizing false positives in static bug detection. Evaluation results show that LLM4PFA precisely filters out 72% to 96% false positives reported during static bug detection, significantly outperforming all the baselines by 41.1% ‑ 105.7% improvements; meanwhile LLM4PFA only misses 3 real bugs of 45 true positives.

PaperID: 10062, https://arxiv.org/pdf/2506.10264.pdf

Authors: Qiyue Yin, Pei Xu, Qiaozhe Li, Shengda Liu, Shengqi Shen, Tong Wang, Yihong Han, Xiaonan Zhao, Likun Yang, Shiyue Cao, Shiyu Qiu, Yuxuan Liu, Shizhao Yu, Lei Cui, Chengxin Yan, Jie Sun, Xiangquan Tang, Kaiqi Huang

Title: WGSR-Bench: Wargame-based Game-theoretic Strategic Reasoning Benchmark for Large Language Models

Abstract:
Recent breakthroughs in Large Language Models (LLMs) have led to a qualitative leap in artificial intelligence' s performance on reasoning tasks, particularly demonstrating remarkable capabilities in mathematical, symbolic, and commonsense reasoning. However, as a critical component of advanced human cognition, strategic reasoning, i.e., the ability to assess multi‑agent behaviors in dynamic environments, formulate action plans, and adapt strategies, has yet to be systematically evaluated or modeled. To address this gap, this paper introduces WGSR‑Bench, the first strategy reasoning benchmark for LLMs using wargame as its evaluation environment. Wargame, a quintessential high‑complexity strategic scenario, integrates environmental uncertainty, adversarial dynamics, and non‑unique strategic choices, making it an effective testbed for assessing LLMs' capabilities in multi‑agent decision‑making, intent inference, and counterfactual reasoning. WGSR‑Bench designs test samples around three core tasks, i.e., Environmental situation awareness, Opponent risk modeling and Policy generation, which serve as the core S‑POE architecture, to systematically assess main abilities of strategic reasoning. Finally, an LLM‑based wargame agent is designed to integrate these parts for a comprehensive strategy reasoning assessment. With WGSR‑Bench, we hope to assess the strengths and limitations of state‑of‑the‑art LLMs in game‑theoretic strategic reasoning and to advance research in large model‑driven strategic intelligence.

PaperID: 10063, https://arxiv.org/pdf/2506.10252.pdf

Authors: Rongfei Li, Francis Assadian

Title: A Novel Feedforward Youla Parameterization Method for Avoiding Local Minima in Stereo Image Based Visual Servoing Control

Abstract:
In robot navigation and manipulation, accurately determining the camera's pose relative to the environment is crucial for effective task execution. In this paper, we systematically prove that this problem corresponds to the Perspective‑3‑Point (P3P) formulation, where exactly three known 3D points and their corresponding 2D image projections are used to estimate the pose of a stereo camera. In image‑based visual servoing (IBVS) control, the system becomes overdetermined, as the 6 degrees of freedom (DoF) of the stereo camera must align with 9 observed 2D features in the scene. When more constraints are imposed than available DoFs, global stability cannot be guaranteed, as the camera may become trapped in a local minimum far from the desired configuration during servoing. To address this issue, we propose a novel control strategy for accurately positioning a calibrated stereo camera. Our approach integrates a feedforward controller with a Youla parameterization‑based feedback controller, ensuring robust servoing performance. Through simulations, we demonstrate that our method effectively avoids local minima and enables the camera to reach the desired pose accurately and efficiently.

PaperID: 10064, https://arxiv.org/pdf/2506.10175.pdf

Authors: Nanda Rani, Sandeep Kumar Shukla

Title: AURA: A Multi-Agent Intelligence Framework for Knowledge-Enhanced Cyber Threat Attribution

Abstract:
Effective attribution of Advanced Persistent Threats (APTs) increasingly hinges on the ability to correlate behavioral patterns and reason over complex, varied threat intelligence artifacts. We present AURA (Attribution Using Retrieval‑Augmented Agents), a multi‑agent, knowledge‑enhanced framework for automated and interpretable APT attribution. AURA ingests diverse threat data including Tactics, Techniques, and Procedures (TTPs), Indicators of Compromise (IoCs), malware details, adversarial tools, and temporal information, which are processed through a network of collaborative agents. These agents are designed for intelligent query rewriting, context‑enriched retrieval from structured threat knowledge bases, and natural language justification of attribution decisions. By combining Retrieval‑Augmented Generation (RAG) with Large Language Models (LLMs), AURA enables contextual linking of threat behaviors to known APT groups and supports traceable reasoning across multiple attack phases. Experiments on recent APT campaigns demonstrate AURA's high attribution consistency, expert‑aligned justifications, and scalability. This work establishes AURA as a promising direction for advancing transparent, data‑driven, and scalable threat attribution using multi‑agent intelligence.

PaperID: 10065, https://arxiv.org/pdf/2506.10171.pdf

Authors: Saswat Das, Jameson Sandler, Ferdinando Fioretto

Title: Beyond Jailbreaking: Auditing Contextual Privacy in LLM Agents

Abstract:
LLM agents have begun to appear as personal assistants, customer service bots, and clinical aides. While these applications deliver substantial operational benefits, they also require continuous access to sensitive data, which increases the likelihood of unauthorized disclosures. Moreover, these disclosures go beyond mere explicit disclosure, leaving open avenues for gradual manipulation or sidechannel information leakage. This study proposes an auditing framework for conversational privacy that quantifies an agent's susceptibility to these risks. The proposed Conversational Manipulation for Privacy Leakage (CMPL) framework is designed to stress‑test agents that enforce strict privacy directives against an iterative probing strategy. Rather than focusing solely on a single disclosure event or purely explicit leakage, CMPL simulates realistic multi‑turn interactions to systematically uncover latent vulnerabilities. Our evaluation on diverse domains, data modalities, and safety configurations demonstrates the auditing framework's ability to reveal privacy risks that are not deterred by existing single‑turn defenses, along with an in‑depth longitudinal study of the temporal dynamics of leakage, strategies adopted by adaptive adversaries, and the evolution of adversarial beliefs about sensitive targets. In addition to introducing CMPL as a diagnostic tool, the paper delivers (1) an auditing procedure grounded in quantifiable risk metrics and (2) an open benchmark for evaluation of conversational privacy across agent implementations.

PaperID: 10066, https://arxiv.org/pdf/2506.10086.pdf

Authors: Christodoulos Constantinides, Shuxin Lin, Nianjun Zhou, Dhaval Patel

Title: Chat-of-Thought: Collaborative Multi-Agent System for Generating Domain Specific Information

Abstract:
This paper presents a novel multi‑agent system called Chat‑of‑Thought, designed to facilitate the generation of Failure Modes and Effects Analysis (FMEA) documents for industrial assets. Chat‑of‑Thought employs multiple collaborative Large Language Model (LLM)‑based agents with specific roles, leveraging advanced AI techniques and dynamic task routing to optimize the generation and validation of FMEA tables. A key innovation in this system is the introduction of a Chat of Thought, where dynamic, multi‑persona‑driven discussions enable iterative refinement of content. This research explores the application domain of industrial equipment monitoring, highlights key challenges, and demonstrates the potential of Chat‑of‑Thought in addressing these challenges through interactive, template‑driven workflows and context‑aware agent collaboration.

PaperID: 10067, https://arxiv.org/pdf/2506.10077.pdf

Authors: Christopher J. Agostino, Quan Le Thien, Molly Apsel, Denizhan Pak, Elina Lesyk, Ashabari Majumdar

Title: A quantum semantic framework for natural language processing

Abstract:
Semantic degeneracy represents a fundamental property of natural language that extends beyond simple polysemy to encompass the combinatorial explosion of potential interpretations that emerges as semantic expressions increase in complexity. In this work, we argue this property imposes fundamental limitations on Large Language Models (LLMs) and other modern NLP systems, precisely because they operate within natural language itself. Using Kolmogorov complexity, we demonstrate that as an expression's complexity grows, the amount of contextual information required to reliably resolve its ambiguity explodes combinatorially. The computational intractability of recovering a single intended meaning for complex or ambiguous text therefore suggests that the classical view that linguistic forms possess intrinsic meaning in and of themselves is conceptually inadequate. We argue instead that meaning is dynamically actualized through an observer‑dependent interpretive act, a process whose non‑deterministic nature is most appropriately described by a non‑classical, quantum‑like logic. To test this hypothesis, we conducted a semantic Bell inequality test using diverse LLM agents. Our experiments yielded average CHSH expectation values from 1.2 to 2.8, with several runs producing values (e.g., 2.3‑2.4) in significant violation of the classical boundary (|S|\leq2), demonstrating that linguistic interpretation under ambiguity can exhibit non‑classical contextuality, consistent with results from human cognition experiments. These results inherently imply that classical frequentist‑based analytical approaches for natural language are necessarily lossy. Instead, we propose that Bayesian‑style repeated sampling approaches can provide more practically useful and appropriate characterizations of linguistic meaning in context.

PaperID: 10068, https://arxiv.org/pdf/2506.09940.pdf

Authors: Jiachen Hu, Rui Ai, Han Zhong, Xiaoyu Chen, Liwei Wang, Zhaoran Wang, Zhuoran Yang

Title: The Sample Complexity of Online Strategic Decision Making with Information Asymmetry and Knowledge Transportability

Abstract:
Information asymmetry is a pervasive feature of multi‑agent systems, especially evident in economics and social sciences. In these settings, agents tailor their actions based on private information to maximize their rewards. These strategic behaviors often introduce complexities due to confounding variables. Simultaneously, knowledge transportability poses another significant challenge, arising from the difficulties of conducting experiments in target environments. It requires transferring knowledge from environments where empirical data is more readily available. Against these backdrops, this paper explores a fundamental question in online learning: Can we employ non‑i.i.d. actions to learn about confounders even when requiring knowledge transfer? We present a sample‑efficient algorithm designed to accurately identify system dynamics under information asymmetry and to navigate the challenges of knowledge transfer effectively in reinforcement learning, framed within an online strategic interaction model. Our method provably achieves learning of an ε‑optimal policy with a tight sample complexity of O(1/ε^2).

PaperID: 10069, https://arxiv.org/pdf/2506.09755.pdf

Authors: Shuo Jiang, Min Xie, Frank Youhua Chen, Jian Ma, Jianxi Luo

Title: Intelligent Design 4.0: Paradigm Evolution Toward the Agentic AI Era

Abstract:
Research and practice in Intelligent Design (ID) have significantly enhanced engineering innovation, efficiency, quality, and productivity over recent decades, fundamentally reshaping how engineering designers think, behave, and interact with design processes. The recent emergence of Foundation Models (FMs), particularly Large Language Models (LLMs), has demonstrated general knowledge‑based reasoning capabilities, and open new avenues for further transformation in engineering design. In this context, this paper introduces Intelligent Design 4.0 (ID 4.0) as an emerging paradigm empowered by foundation model‑based agentic AI systems. We review the historical evolution of ID across four distinct stages: rule‑based expert systems, task‑specific machine learning models, large‑scale foundation AI models, and the recent emerging paradigm of foundation model‑based multi‑agent collaboration. We propose an ontological framework for ID 4.0 and discuss its potential to support end‑to‑end automation of engineering design processes through coordinated, autonomous multi‑agent‑based systems. Furthermore, we discuss challenges and opportunities of ID 4.0, including perspectives on data foundations, agent collaboration mechanisms, and the formulation of design problems and objectives. In sum, these insights provide a foundation for advancing Intelligent Design toward greater adaptivity, autonomy, and effectiveness in addressing the growing complexity of engineering design.

PaperID: 10070, https://arxiv.org/pdf/2506.09656.pdf

Authors: Wei Zeng, Hengshu Zhu, Chuan Qin, Han Wu, Yihang Cheng, Sirui Zhang, Xiaowei Jin, Yinuo Shen, Zhenxing Wang, Feimin Zhong, Hui Xiong

Title: Multi-level Value Alignment in Agentic AI Systems: Survey and Perspectives

Abstract:
The ongoing evolution of AI paradigms has propelled AI research into the agentic AI stage. Consequently, the focus of research has shifted from single agents and simple applications towards multi‑agent autonomous decision‑making and task collaboration in complex environments. As Large Language Models (LLMs) advance, their applications become more diverse and complex, leading to increasing situational and systemic risks. This has brought significant attention to value alignment for agentic AI systems, which aims to ensure that an agent's goals, preferences, and behaviors align with human values and societal norms. Addressing socio‑governance demands through a Multi‑level Value framework, this study comprehensively reviews value alignment in LLM‑based multi‑agent systems as the representative archetype of agentic AI systems. Our survey systematically examines three interconnected dimensions: First, value principles are structured via a top‑down hierarchy across macro, meso, and micro levels. Second, application scenarios are categorized along a general‑to‑specific continuum explicitly mirroring these value tiers. Third, value alignment methods and evaluation are mapped to this tiered framework through systematic examination of benchmarking datasets and relevant methodologies. Additionally, we delve into value coordination among multiple agents within agentic AI systems. Finally, we propose several potential research directions in this field.

PaperID: 10071, https://arxiv.org/pdf/2506.09600.pdf

Authors: Itay Nakash, George Kour, Koren Lazar, Matan Vetzler, Guy Uziel, Ateret Anaby-Tavor

Title: Effective Red-Teaming of Policy-Adherent Agents

Abstract:
Task‑oriented LLM‑based agents are increasingly used in domains with strict policies, such as refund eligibility or cancellation rules. The challenge lies in ensuring that the agent consistently adheres to these rules and policies, appropriately refusing any request that would violate them, while still maintaining a helpful and natural interaction. This calls for the development of tailored design and evaluation methodologies to ensure agent resilience against malicious user behavior. We propose a novel threat model that focuses on adversarial users aiming to exploit policy‑adherent agents for personal benefit. To address this, we present CRAFT, a multi‑agent red‑teaming system that leverages policy‑aware persuasive strategies to undermine a policy‑adherent agent in a customer‑service scenario, outperforming conventional jailbreak methods such as DAN prompts, emotional manipulation, and coercive. Building upon the existing tau‑bench benchmark, we introduce tau‑break, a complementary benchmark designed to rigorously assess the agent's robustness against manipulative user behavior. Finally, we evaluate several straightforward yet effective defense strategies. While these measures provide some protection, they fall short, highlighting the need for stronger, research‑driven safeguards to protect policy‑adherent agents from adversarial attacks

PaperID: 10072, https://arxiv.org/pdf/2506.09557.pdf

Authors: Zhaoyang Wei, Chenhui Qiang, Bowen Jiang, Xumeng Han, Xuehui Yu, Zhenjun Han

Title: AD^2-Bench: A Hierarchical CoT Benchmark for MLLM in Autonomous Driving under Adverse Conditions

Abstract:
Chain‑of‑Thought (CoT) reasoning has emerged as a powerful approach to enhance the structured, multi‑step decision‑making capabilities of Multi‑Modal Large Models (MLLMs), is particularly crucial for autonomous driving with adverse weather conditions and complex traffic environments. However, existing benchmarks have largely overlooked the need for rigorous evaluation of CoT processes in these specific and challenging scenarios. To address this critical gap, we introduce AD^2‑Bench, the first Chain‑of‑Thought benchmark specifically designed for autonomous driving with adverse weather and complex scenes. AD^2‑Bench is meticulously constructed to fulfill three key criteria: comprehensive data coverage across diverse adverse environments, fine‑grained annotations that support multi‑step reasoning, and a dedicated evaluation framework tailored for assessing CoT performance. The core contribution of AD^2‑Bench is its extensive collection of over 5.4k high‑quality, manually annotated CoT instances. Each intermediate reasoning step in these annotations is treated as an atomic unit with explicit ground truth, enabling unprecedented fine‑grained analysis of MLLMs' inferential processes under text‑level, point‑level, and region‑level visual prompts. Our comprehensive evaluation of state‑of‑the‑art MLLMs on AD^2‑Bench reveals accuracy below 60%, highlighting the benchmark's difficulty and the need to advance robust, interpretable end‑to‑end autonomous driving systems. AD^2‑Bench thus provides a standardized evaluation platform, driving research forward by improving MLLMs' reasoning in autonomous driving, making it an invaluable resource.

PaperID: 10073, https://arxiv.org/pdf/2506.09469.pdf

Authors: Maria Damanaki, Nikos Piperigkos, Alexandros Gkillas, Aris S. Lalos

Title: Optimizing Cooperative Multi-Object Tracking using Graph Signal Processing

Abstract:
Multi‑Object Tracking (MOT) plays a crucial role in autonomous driving systems, as it lays the foundations for advanced perception and precise path planning modules. Nonetheless, single agent based MOT lacks in sensing surroundings due to occlusions, sensors failures, etc. Hence, the integration of multiagent information is essential for comprehensive understanding of the environment. This paper proposes a novel Cooperative MOT framework for tracking objects in 3D LiDAR scene by formulating and solving a graph topology‑aware optimization problem so as to fuse information coming from multiple vehicles. By exploiting a fully connected graph topology defined by the detected bounding boxes, we employ the Graph Laplacian processing optimization technique to smooth the position error of bounding boxes and effectively combine them. In that manner, we reveal and leverage inherent coherences of diverse multi‑agent detections, and associate the refined bounding boxes to tracked objects at two stages, optimizing localization and tracking accuracies. An extensive evaluation study has been conducted, using the real‑world V2V4Real dataset, where the proposed method significantly outperforms the baseline frameworks, including the state‑of‑the‑art deep‑learning DMSTrack and V2V4Real, in various testing sequences.

PaperID: 10074, https://arxiv.org/pdf/2506.09434.pdf

Authors: Michael Amir, Matteo Bettini, Amanda Prorok

Title: When Is Diversity Rewarded in Cooperative Multi-Agent Learning?

Abstract:
The success of teams in robotics, nature, and society often depends on the division of labor among diverse specialists; however, a principled explanation for when such diversity surpasses a homogeneous team is still missing. Focusing on multi‑agent task allocation problems, we study this question from the perspective of reward design: what kinds of objectives are best suited for heterogeneous teams? We first consider an instantaneous, non‑spatial setting where the global reward is built by two generalized aggregation operators: an inner operator that maps the N agents' effort allocations on individual tasks to a task score, and an outer operator that merges the M task scores into the global team reward. We prove that the curvature of these operators determines whether heterogeneity can increase reward, and that for broad reward families this collapses to a simple convexity test. Next, we ask what incentivizes heterogeneity to emerge when embodied, time‑extended agents must learn an effort allocation policy. To study heterogeneity in such settings, we use multi‑agent reinforcement learning (MARL) as our computational paradigm, and introduce Heterogeneity Gain Parameter Search (HetGPS), a gradient‑based algorithm that optimizes the parameter space of underspecified MARL environments to find scenarios where heterogeneity is advantageous. Across different environments, we show that HetGPS rediscovers the reward regimes predicted by our theory to maximize the advantage of heterogeneity, both validating HetGPS and connecting our theoretical insights to reward design in MARL. Together, these results help us understand when behavioral diversity delivers a measurable benefit.

PaperID: 10075, https://arxiv.org/pdf/2506.09335.pdf

Authors: Moshi Wei, Sparks Li

Title: Intelligent System of Emergent Knowledge: A Coordination Fabric for Billions of Minds

Abstract:
The Intelligent System of Emergent Knowledge (ISEK) establishes a decentralized network where human and artificial intelligence agents collaborate as peers, forming a self‑organizing cognitive ecosystem. Built on Web3 infrastructure, ISEK combines three fundamental principles: (1) a decentralized multi‑agent architecture resistant to censorship, (2) symbiotic AI‑human collaboration with equal participation rights, and (3) resilient self‑adaptation through distributed consensus mechanisms. The system implements an innovative coordination protocol featuring a six‑phase workflow (Publish, Discover, Recruit, Execute, Settle, Feedback) for dynamic task allocation, supported by robust fault tolerance and a multidimensional reputation system. Economic incentives are governed by the native ISEK token, facilitating micropayments, governance participation, and reputation tracking, while agent sovereignty is maintained through NFT‑based identity management. This synthesis of blockchain technology, artificial intelligence, and incentive engineering creates an infrastructure that actively facilitates emergent intelligence. ISEK represents a paradigm shift from conventional platforms, enabling the organic development of large‑scale, decentralized cognitive systems where autonomous agents collectively evolve beyond centralized constraints.

PaperID: 10076, https://arxiv.org/pdf/2506.09331.pdf

Authors: Arjun Vaithilingam Sudhakar

Title: Multi-Agent Language Models: Advancing Cooperation, Coordination, and Adaptation

Abstract:
Modern Large Language Models (LLMs) exhibit impressive zero‑shot and few‑shot generalization capabilities across complex natural language tasks, enabling their widespread use as virtual assistants for diverse applications such as translation and summarization. Despite being trained solely on large corpora of text without explicit supervision on author intent, LLMs appear to infer the underlying meaning of textual interactions. This raises a fundamental question: can LLMs model and reason about the intentions of others, i.e., do they possess a form of theory of mind? Understanding other's intentions is crucial for effective collaboration, which underpins human societal success and is essential for cooperative interactions among multiple agents, including humans and autonomous systems. In this work, we investigate the theory of mind in LLMs through the lens of cooperative multi‑agent reinforcement learning (MARL), where agents learn to collaborate via repeated interactions, mirroring human social reasoning. Our approach aims to enhance artificial agent's ability to adapt and cooperate with both artificial and human partners. By leveraging LLM‑based agents capable of natural language interaction, we move towards creating hybrid human‑AI systems that can foster seamless collaboration, with broad implications for the future of human‑artificial interaction.

PaperID: 10077, https://arxiv.org/pdf/2506.09215.pdf

Authors: Greyson Brothers

Title: Robust Noise Attenuation via Adaptive Pooling of Transformer Outputs

Abstract:
We investigate the design of pooling methods used to summarize the outputs of transformer embedding models, primarily motivated by reinforcement learning and vision applications. This work considers problems where a subset of the input vectors contains requisite information for a downstream task (signal) while the rest are distractors (noise). By framing pooling as vector quantization with the goal of minimizing signal loss, we demonstrate that the standard methods used to aggregate transformer outputs, AvgPool, MaxPool, and ClsToken, are vulnerable to performance collapse as the signal‑to‑noise ratio (SNR) of inputs fluctuates. We then show that an attention‑based adaptive pooling method can approximate the signal‑optimal vector quantizer within derived error bounds for any SNR. Our theoretical results are first validated by supervised experiments on a synthetic dataset designed to isolate the SNR problem, then generalized to standard relational reasoning, multi‑agent reinforcement learning, and vision benchmarks with noisy observations, where transformers with adaptive pooling display superior robustness across tasks.

PaperID: 10078, https://arxiv.org/pdf/2506.09171.pdf

Authors: Samuel Holt, Max Ruiz Luyten, Thomas Pouplin, Mihaela van der Schaar

Title: Improving LLM Agent Planning with In-Context Learning via Atomic Fact Augmentation and Lookahead Search

Abstract:
Large Language Models (LLMs) are increasingly capable but often require significant guidance or extensive interaction history to perform effectively in complex, interactive environments. Existing methods may struggle with adapting to new information or efficiently utilizing past experiences for multi‑step reasoning without fine‑tuning. We introduce a novel LLM agent framework that enhances planning capabilities through in‑context learning, facilitated by atomic fact augmentation and a recursive lookahead search. Our agent learns to extract task‑critical ``atomic facts'' from its interaction trajectories. These facts dynamically augment the prompts provided to LLM‑based components responsible for action proposal, latent world model simulation, and state‑value estimation. Planning is performed via a depth‑limited lookahead search, where the LLM simulates potential trajectories and evaluates their outcomes, guided by the accumulated facts and interaction history. This approach allows the agent to improve its understanding and decision‑making online, leveraging its experience to refine its behavior without weight updates. We provide a theoretical motivation linking performance to the quality of fact‑based abstraction and LLM simulation accuracy. Empirically, our agent demonstrates improved performance and adaptability on challenging interactive tasks, achieving more optimal behavior as it accumulates experience, showcased in tasks such as TextFrozenLake and ALFWorld.

PaperID: 10079, https://arxiv.org/pdf/2506.09107.pdf

Authors: Athena Vakali, Ilias Dimitriadis

Title: FAIRTOPIA: Envisioning Multi-Agent Guardianship for Disrupting Unfair AI Pipelines

Abstract:
AI models have become active decision makers, often acting without human supervision. The rapid advancement of AI technology has already caused harmful incidents that have hurt individuals and societies and AI unfairness in heavily criticized. It is urgent to disrupt AI pipelines which largely neglect human principles and focus on computational biases exploration at the data (pre), model(in), and deployment (post) processing stages. We claim that by exploiting the advances of agents technology, we will introduce cautious, prompt, and ongoing fairness watch schemes, under realistic, systematic, and human‑centric fairness expectations. We envision agents as fairness guardians, since agents learn from their environment, adapt to new information, and solve complex problems by interacting with external tools and other systems. To set the proper fairness guardrails in the overall AI pipeline, we introduce a fairness‑by‑design approach which embeds multi‑role agents in an end‑to‑end (human to AI) synergetic scheme. Our position is that we may design adaptive and realistic AI fairness frameworks, and we introduce a generalized algorithm which can be customized to the requirements and goals of each AI decision making scenario. Our proposed, so called FAIRTOPIA framework, is structured over a three‑layered architecture, which encapsulates the AI pipeline inside an agentic guardian and a knowledge‑based, self‑refining layered scheme. Based on our proposition, we enact fairness watch in all of the AI pipeline stages, under robust multi‑agent workflows, which will inspire new fairness research hypothesis, heuristics, and methods grounded in human‑centric, systematic, interdisciplinary, socio‑technical principles.

PaperID: 10080, https://arxiv.org/pdf/2506.09080.pdf

Authors: Jiaxiang Chen, Mingxi Zou, Zhuo Wang, Qifan Wang, Dongning Sun, Chi Zhang, Zenglin Xu

Title: FinHEAR: Human Expertise and Adaptive Risk-Aware Temporal Reasoning for Financial Decision-Making

Abstract:
Financial decision‑making presents unique challenges for language models, demanding temporal reasoning, adaptive risk assessment, and responsiveness to dynamic events. While large language models (LLMs) show strong general reasoning capabilities, they often fail to capture behavioral patterns central to human financial decisions‑such as expert reliance under information asymmetry, loss‑averse sensitivity, and feedback‑driven temporal adjustment. We propose FinHEAR, a multi‑agent framework for Human Expertise and Adaptive Risk‑aware reasoning. FinHEAR orchestrates specialized LLM‑based agents to analyze historical trends, interpret current events, and retrieve expert‑informed precedents within an event‑centric pipeline. Grounded in behavioral economics, it incorporates expert‑guided retrieval, confidence‑adjusted position sizing, and outcome‑based refinement to enhance interpretability and robustness. Empirical results on curated financial datasets show that FinHEAR consistently outperforms strong baselines across trend prediction and trading tasks, achieving higher accuracy and better risk‑adjusted returns.

PaperID: 10081, https://arxiv.org/pdf/2506.09050.pdf

Authors: Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, Takuya Akiba

Title: ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering

Abstract:
How well do AI systems perform in algorithm engineering for hard optimization problems in domains such as package‑delivery routing, crew scheduling, factory production planning, and power‑grid balancing? We introduce ALE‑Bench, a new benchmark for evaluating AI systems on score‑based algorithmic programming contests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE‑Bench presents optimization problems that are computationally hard and admit no known exact solution. Unlike short‑duration, pass/fail coding benchmarks, ALE‑Bench encourages iterative solution refinement over long time horizons. Our software framework supports interactive agent architectures that leverage test‑run feedback and visualizations. Our evaluation of frontier LLMs revealed that while they demonstrate high performance on specific problems, a notable gap remains compared to humans in terms of consistency across problems and long‑horizon problem‑solving capabilities. This highlights the need for this benchmark to foster future AI advancements.

PaperID: 10082, https://arxiv.org/pdf/2506.09046.pdf

Authors: Xiaowen Ma, Chenyang Lin, Yao Zhang, Volker Tresp, Yunpu Ma

Title: Agentic Neural Networks: Self-Evolving Multi-Agent Systems via Textual Backpropagation

Abstract:
Leveraging multiple Large Language Models(LLMs) has proven effective for addressing complex, high‑dimensional tasks, but current approaches often rely on static, manually engineered multi‑agent configurations. To overcome these constraints, we present the Agentic Neural Network(ANN), a framework that conceptualizes multi‑agent collaboration as a layered neural network architecture. In this design, each agent operates as a node, and each layer forms a cooperative "team" focused on a specific subtask. Agentic Neural Network follows a two‑phase optimization strategy: (1) Forward Phase‑Drawing inspiration from neural network forward passes, tasks are dynamically decomposed into subtasks, and cooperative agent teams with suitable aggregation methods are constructed layer by layer. (2) Backward Phase‑Mirroring backpropagation, we refine both global and local collaboration through iterative feedback, allowing agents to self‑evolve their roles, prompts, and coordination. This neuro‑symbolic approach enables ANN to create new or specialized agent teams post‑training, delivering notable gains in accuracy and adaptability. Across four benchmark datasets, ANN surpasses leading multi‑agent baselines under the same configurations, showing consistent performance improvements. Our findings indicate that ANN provides a scalable, data‑driven framework for multi‑agent systems, combining the collaborative capabilities of LLMs with the efficiency and flexibility of neural network principles. We plan to open‑source the entire framework.

PaperID: 10083, https://arxiv.org/pdf/2506.08972.pdf

Authors: Yuan Guo, Tingjia Miao, Zheng Wu, Pengzhou Cheng, Ming Zhou, Zhuosheng Zhang

Title: Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System

Abstract:
Autonomous agents powered by multimodal large language models have been developed to facilitate task execution on mobile devices. However, prior work has predominantly focused on atomic tasks ‑‑ such as shot‑chain execution tasks and single‑screen grounding tasks ‑‑ while overlooking the generalization to compositional tasks, which are indispensable for real‑world applications. This work introduces UI‑NEXUS, a comprehensive benchmark designed to evaluate mobile agents on three categories of compositional operations: Simple Concatenation, Context Transition, and Deep Dive. UI‑NEXUS supports interactive evaluation in 20 fully controllable local utility app environments, as well as 30 online Chinese and English service apps. It comprises 100 interactive task templates with an average optimal step count of 14.05. Experimental results across a range of mobile agents with agentic workflow or agent‑as‑a‑model show that UI‑NEXUS presents significant challenges. Specifically, existing agents generally struggle to balance performance and efficiency, exhibiting representative failure modes such as under‑execution, over‑execution, and attention drift, causing visible atomic‑to‑compositional generalization gap. Inspired by these findings, we propose AGENT‑NEXUS, a lightweight and efficient scheduling system to tackle compositional mobile tasks. AGENT‑NEXUS extrapolates the abilities of existing mobile agents by dynamically decomposing long‑horizon tasks to a series of self‑contained atomic subtasks. AGENT‑NEXUS achieves 24% to 40% task success rate improvement for existing mobile agents on compositional operation tasks within the UI‑NEXUS benchmark without significantly sacrificing inference overhead. The demo video, dataset, and code are available on the project page at https://ui‑nexus.github.io.

PaperID: 10084, https://arxiv.org/pdf/2506.08837.pdf

Authors: Luca Beurer-Kellner, Beat Buesser, Ana-Maria Creţu, Edoardo Debenedetti, Daniel Dobos, Daniel Fabian, Marc Fischer, David Froelicher, Kathrin Grosse, Daniel Naeff, Ezinwanne Ozoani, Andrew Paverd, Florian Tramèr, Václav Volhejn

Title: Design Patterns for Securing LLM Agents against Prompt Injections

Abstract:
As AI agents powered by Large Language Models (LLMs) become increasingly versatile and capable of addressing a broad spectrum of tasks, ensuring their security has become a critical challenge. Among the most pressing threats are prompt injection attacks, which exploit the agent's resilience on natural language inputs ‑‑ an especially dangerous threat when agents are granted tool access or handle sensitive information. In this work, we propose a set of principled design patterns for building AI agents with provable resistance to prompt injection. We systematically analyze these patterns, discuss their trade‑offs in terms of utility and security, and illustrate their real‑world applicability through a series of case studies.

PaperID: 10085, https://arxiv.org/pdf/2506.08807.pdf

Authors: Luca Ballotta, Áron Vékássy, Stephanie Gil, Michal Yemini

Title: Confidence Boosts Trust-Based Resilience in Cooperative Multi-Robot Systems

Abstract:
Wireless communication‑based multi‑robot systems open the door to cyberattacks that can disrupt safety and performance of collaborative robots. The physical channel supporting inter‑robot communication offers an attractive opportunity to decouple the detection of malicious robots from task‑relevant data exchange between legitimate robots. Yet, trustworthiness indications coming from physical channels are uncertain and must be handled with this in mind. In this paper, we propose a resilient protocol for multi‑robot operation wherein a parameter λt accounts for how confident a robot is about the legitimacy of nearby robots that the physical channel indicates. Analytical results prove that our protocol achieves resilient coordination with arbitrarily many malicious robots under mild assumptions. Tuning λt allows a designer to trade between near‑optimal inter‑robot coordination and quick task execution; see Fig. 1. This is a fundamental performance tradeoff and must be carefully evaluated based on the task at hand. The effectiveness of our approach is numerically verified with experiments involving platoons of autonomous cars where some vehicles are maliciously spoofed.

PaperID: 10086, https://arxiv.org/pdf/2506.08800.pdf

Authors: Irene Testini, José Hernández-Orallo, Lorenzo Pacchiardi

Title: Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents

Abstract:
Data science aims to extract insights from data to support decision‑making processes. Recently, Large Language Models (LLMs) have been increasingly used as assistants for data science, by suggesting ideas, techniques and small code snippets, or for the interpretation of results and reporting. Proper automation of some data‑science activities is now promised by the rise of LLM agents, i.e., AI systems powered by an LLM equipped with additional affordances‑‑such as code execution and knowledge bases‑‑that can perform self‑directed actions and interact with digital environments. In this paper, we survey the evaluation of LLM assistants and agents for data science. We find (1) a dominant focus on a small subset of goal‑oriented activities, largely ignoring data management and exploratory activities; (2) a concentration on pure assistance or fully autonomous agents, without considering intermediate levels of human‑AI collaboration; and (3) an emphasis on human substitution, therefore neglecting the possibility of higher levels of automation thanks to task transformation.

PaperID: 10087, https://arxiv.org/pdf/2506.08726.pdf

Authors: Nelvin Tan, Zian Seng, Liang Zhang, Yu-Ching Shih, Dong Yang, Amol Salunkhe

Title: Improved LLM Agents for Financial Document Question Answering

Abstract:
Large language models (LLMs) have shown impressive capabilities on numerous natural language processing tasks. However, LLMs still struggle with numerical question answering for financial documents that include tabular and textual data. Recent works have showed the effectiveness of critic agents (i.e., self‑correction) for this task given oracle labels. Building upon this framework, this paper examines the effectiveness of the traditional critic agent when oracle labels are not available, and show, through experiments, that this critic agent's performance deteriorates in this scenario. With this in mind, we present an improved critic agent, along with the calculator agent which outperforms the previous state‑of‑the‑art approach (program‑of‑thought) and is safer. Furthermore, we investigate how our agents interact with each other, and how this interaction affects their performance.

PaperID: 10088, https://arxiv.org/pdf/2506.08507.pdf

Authors: Kuo Yang, Xingjie Yang, Linhui Yu, Qing Xu, Yan Fang, Xu Wang, Zhengyang Zhou, Yang Wang

Title: MasHost Builds It All: Autonomous Multi-Agent System Directed by Reinforcement Learning

Abstract:
Large Language Model (LLM)‑driven Multi‑agent systems (Mas) have recently emerged as a powerful paradigm for tackling complex real‑world tasks. However, existing Mas construction methods typically rely on manually crafted interaction mechanisms or heuristic rules, introducing human biases and constraining the autonomous ability. Even with recent advances in adaptive Mas construction, existing systems largely remain within the paradigm of semi‑autonomous patterns. In this work, we propose MasHost, a Reinforcement Learning (RL)‑based framework for autonomous and query‑adaptive Mas design. By formulating Mas construction as a graph search problem, our proposed MasHost jointly samples agent roles and their interactions through a unified probabilistic sampling mechanism. Beyond the accuracy and efficiency objectives pursued in prior works, we introduce component rationality as an additional and novel design principle in Mas. To achieve this multi‑objective optimization, we propose Hierarchical Relative Policy Optimization (HRPO), a novel RL strategy that collaboratively integrates group‑relative advantages and action‑wise rewards. To our knowledge, our proposed MasHost is the first RL‑driven framework for autonomous Mas graph construction. Extensive experiments on six benchmarks demonstrate that MasHost consistently outperforms most competitive baselines, validating its effectiveness, efficiency, and structure rationality.

PaperID: 10089, https://arxiv.org/pdf/2506.08430.pdf

Authors: Ziqi. Liu, Ziyang. Zhou, Mingxuan. Hu

Title: CAF-I: A Collaborative Multi-Agent Framework for Enhanced Irony Detection with Large Language Models

Abstract:
Large language model (LLM) have become mainstream methods in the field of sarcasm detection. However, existing LLM methods face challenges in irony detection, including: 1. single‑perspective limitations, 2. insufficient comprehensive understanding, and 3. lack of interpretability. This paper introduces the Collaborative Agent Framework for Irony (CAF‑I), an LLM‑driven multi‑agent system designed to overcome these issues. CAF‑I employs specialized agents for Context, Semantics, and Rhetoric, which perform multidimensional analysis and engage in interactive collaborative optimization. A Decision Agent then consolidates these perspectives, with a Refinement Evaluator Agent providing conditional feedback for optimization. Experiments on benchmark datasets establish CAF‑I's state‑of‑the‑art zero‑shot performance. Achieving SOTA on the vast majority of metrics, CAF‑I reaches an average Macro‑F1 of 76.31, a 4.98 absolute improvement over the strongest prior baseline. This success is attained by its effective simulation of human‑like multi‑perspective analysis, enhancing detection accuracy and interpretability.

PaperID: 10090, https://arxiv.org/pdf/2506.08379.pdf

Authors: Yurun Yuan, Tengyang Xie

Title: Reinforce LLM Reasoning through Multi-Agent Reflection

Abstract:
Leveraging more test‑time computation has proven to be an effective way to boost the reasoning capabilities of large language models (LLMs). Among various methods, the verify‑and‑improve paradigm stands out for enabling dynamic solution exploration and feedback incorporation. However, existing approaches often suffer from restricted feedback spaces and lack of coordinated training of different parties, leading to suboptimal performance. To address this, we model this multi‑turn refinement process as a Markov Decision Process and introduce DPSDP (Direct Policy Search by Dynamic Programming), a reinforcement learning algorithm that trains an actor‑critic LLM system to iteratively refine answers via direct preference learning on self‑generated data. Theoretically, DPSDP can match the performance of any policy within the training distribution. Empirically, we instantiate DPSDP with various base models and show improvements on both in‑ and out‑of‑distribution benchmarks. For example, on benchmark MATH 500, majority voting over five refinement steps increases first‑turn accuracy from 58.2% to 63.2% with Ministral‑based models. An ablation study further confirms the benefits of multi‑agent collaboration and out‑of‑distribution generalization.

PaperID: 10091, https://arxiv.org/pdf/2506.08296.pdf

Authors: Hongjun Wu, Heng Zhang, Pengsong Zhang, Jin Wang, Cong Wang

Title: HiBerNAC: Hierarchical Brain-emulated Robotic Neural Agent Collective for Disentangling Complex Manipulation

Abstract:
Recent advances in multimodal vision‑language‑action (VLA) models have revolutionized traditional robot learning, enabling systems to interpret vision, language, and action in unified frameworks for complex task planning. However, mastering complex manipulation tasks remains an open challenge, constrained by limitations in persistent contextual memory, multi‑agent coordination under uncertainty, and dynamic long‑horizon planning across variable sequences. To address this challenge, we propose HiBerNAC, a Hierarchical Brain‑emulated robotic Neural Agent Collective, inspired by breakthroughs in neuroscience, particularly in neural circuit mechanisms and hierarchical decision‑making. Our framework combines: (1) multimodal VLA planning and reasoning with (2) neuro‑inspired reflection and multi‑agent mechanisms, specifically designed for complex robotic manipulation tasks. By leveraging neuro‑inspired functional modules with decentralized multi‑agent collaboration, our approach enables robust and enhanced real‑time execution of complex manipulation tasks. In addition, the agentic system exhibits scalable collective intelligence via dynamic agent specialization, adapting its coordination strategy to variable task horizons and complexity. Through extensive experiments on complex manipulation tasks compared with state‑of‑the‑art VLA models, we demonstrate that HiBerNAC reduces average long‑horizon task completion time by 23%, and achieves non‑zero success rates (12\textendash 31%) on multi‑path tasks where prior state‑of‑the‑art VLA models consistently fail. These results provide indicative evidence for bridging biological cognition and robotic learning mechanisms.

PaperID: 10092, https://arxiv.org/pdf/2506.08149.pdf

Authors: Hang Wang, Dechen Gao, Junshan Zhang

Title: Ego-centric Learning of Communicative World Models for Autonomous Driving

Abstract:
We study multi‑agent reinforcement learning (MARL) for tasks in complex high‑dimensional environments, such as autonomous driving. MARL is known to suffer from the partial observability and non‑stationarity issues. To tackle these challenges, information sharing is often employed, which however faces major hurdles in practice, including overwhelming communication overhead and scalability concerns. By making use of generative AI embodied in world model together with its latent representation, we develop \it CALL, \underlineCommunic\underlineative Wor\underlineld Mode\underlinel, for MARL, where 1) each agent first learns its world model that encodes its state and intention into low‑dimensional latent representation with smaller memory footprint, which can be shared with other agents of interest via lightweight communication; and 2) each agent carries out ego‑centric learning while exploiting lightweight information sharing to enrich her world model, and then exploits its generalization capacity to improve prediction for better planning. We characterize the gain on the prediction accuracy from the information sharing and its impact on performance gap. Extensive experiments are carried out on the challenging local trajectory planning tasks in the CARLA platform to demonstrate the performance gains of using CALL.

PaperID: 10093, https://arxiv.org/pdf/2506.07997.pdf

Authors: Fan Yang, Yuan Tian, Jiansong Zhang

Title: Supporting Construction Worker Well-Being with a Multi-Agent Conversational AI System

Abstract:
The construction industry is characterized by both high physical and psychological risks, yet supports of mental health remain limited. While advancements in artificial intelligence (AI), particularly large language models (LLMs), offer promising solutions, their potential in construction remains largely underexplored. To bridge this gap, we developed a conversational multi‑agent system that addresses industry‑specific challenges through an AI‑driven approach integrated with domain knowledge. In parallel, it fulfills construction workers' basic psychological needs by enabling interactions with multiple agents, each has a distinct persona. This approach ensures that workers receive both practical problem‑solving support and social engagement, ultimately contributing to their overall well‑being. We evaluate its usability and effectiveness through a within‑subjects user study with 12 participants. The results show that our system significantly outperforms the single‑agent baseline, achieving improvements of 18% in usability, 40% in self‑determination, 60% in social presence, and 60% in trust. These findings highlight the promise of LLM‑driven AI systems in providing domain‑specific support for construction workers.

PaperID: 10094, https://arxiv.org/pdf/2506.07982.pdf

Authors: Victor Barres, Honghua Dong, Soham Ray, Xujie Si, Karthik Narasimhan

Title: $τ^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Abstract:
Existing benchmarks for conversational AI agents simulate single‑control environments, where only the AI agent can use tools to interact with the world, while the user remains a passive information provider. This differs from real‑world scenarios like technical support, where users need to actively participate in modifying the state of the (shared) world. In order to address this gap, we introduce τ^2‑bench, with four key contributions: 1) A novel Telecom dual‑control domain modeled as a Dec‑POMDP, where both agent and user make use of tools to act in a shared, dynamic environment that tests both agent coordination and communication, 2) A compositional task generator that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity, 3) A reliable user simulator tightly coupled with the environment, whose behavior is constrained by tools and observable states, improving simulation fidelity, 4) Fine‑grained analysis of agent performance through multiple ablations including separating errors arising from reasoning vs communication/coordination. In particular, our experiments show significant performance drops when agents shift from no‑user to dual‑control, highlighting the challenges of guiding users. Overall, τ^2‑bench provides a controlled testbed for agents that must both reason effectively and guide user actions.

PaperID: 10095, https://arxiv.org/pdf/2506.07877.pdf

Authors: Andrea Tiranti, Francesco Wanderlingh, Enrico Simetti, Marco Baglietto, Giovanni Indiveri, Antonio Pascoal

Title: A distributed motion planning approach to cooperative underwater acoustic source tracking

Abstract:
Accurate tracking of underwater acoustic sources is critical for a variety of marine applications, yet remains a challenging task due to communication constraints and environmental uncertainties. In this regard, this paper addresses the problem of underwater acoustic source tracking using a team of autonomous underwater vehicles (AUVs). The core idea is to optimize the guidance of each agent to achieve coordinated motion planning that leads to optimal geometric configurations with respect to the target, thereby enhancing tracking performance. To tackle this, we propose a Distributed Model Predictive Control (DMPC) framework to improve performance and robustness. The control problem is formulated as a multi‑objective optimization task, incorporating geometric observability, proximity to the target, and communication connectivity. A Receding Horizon Control (RHC) approach, coupled with an Unscented Transform (UT)‑based prediction scheme, is employed to ensure longterm tracking accuracy while accounting for uncertainties. The optimization is distributed using the sequential multi‑agent decision‑making framework, combined with the Time‑Division Multiple Access (TDMA) communication protocol. The proposed methodology is implemented in a simulation environment that accounts for the constraints of acoustic communication. The approach is compared with existing methods such as decentralized MPC and Particle Swarm Optimization (PSO).

PaperID: 10096, https://arxiv.org/pdf/2506.07820.pdf

Authors: Jiaxiang Chen, Zhuo Wang, Mingxi Zou, Qifan Wang, Zenglin Xu

Title: Guideline Forest: Retrieval-Augmented Reasoning with Branching Experience-Induced Guidelines

Abstract:
Retrieval‑augmented generation (RAG) has been widely adopted to ground large language models (LLMs) in external knowledge, yet it remains largely underexplored for improving reasoning. Existing methods either rely on online exploration during inference or heuristic supervision over reasoning trajectories, but they fail to effectively accumulate and reuse past reasoning experience. We propose Guideline Forest, a retrieval‑augmented reasoning framework that explicitly leverages experience to guide multi‑step reasoning. The framework stores high‑quality, label‑consistent reasoning traces as reusable memory, retrieves relevant experiences for new problems, and induces them into structured guidelines that steer reasoning and enable controlled branching and aggregation. Experiments on mathematical (GSM8K, MATH‑500) and programming (MBPP, HumanEval) benchmarks demonstrate consistent improvements over strong reasoning baselines, including CoT, ReAct, ToT, FoT, and AFlow. Further analyses show that experience retrieval, guideline‑induced diversity, and stepwise aggregation are key to the framework's effectiveness. Beyond single‑model reasoning, Guideline Forest generalizes to enhance diverse reasoning paradigms and supports multi‑model collaboration, highlighting its flexibility and scalability.

PaperID: 10097, https://arxiv.org/pdf/2506.07755.pdf

Authors: Nikolaos Bousias, Lars Lindemann, George Pappas

Title: Deep Equivariant Multi-Agent Control Barrier Functions

Abstract:
With multi‑agent systems increasingly deployed autonomously at scale in complex environments, ensuring safety of the data‑driven policies is critical. Control Barrier Functions have emerged as an effective tool for enforcing safety constraints, yet existing learning‑based methods often lack in scalability, generalization and sampling efficiency as they overlook inherent geometric structures of the system. To address this gap, we introduce symmetries‑infused distributed Control Barrier Functions, enforcing the satisfaction of intrinsic symmetries on learnable graph‑based safety certificates. We theoretically motivate the need for equivariant parametrization of CBFs and policies, and propose a simple, yet efficient and adaptable methodology for constructing such equivariant group‑modular networks via the compatible group actions. This approach encodes safety constraints in a distributed data‑efficient manner, enabling zero‑shot generalization to larger and denser swarms. Through extensive simulations on multi‑robot navigation tasks, we demonstrate that our method outperforms state‑of‑the‑art baselines in terms of safety, scalability, and task success rates, highlighting the importance of embedding symmetries in safe distributed neural policies.

PaperID: 10098, https://arxiv.org/pdf/2506.07715.pdf

Authors: Yian Zhu, Ziye Jia, Lei Zhang, Yao Wu, Qiuming Zhu, Qihui Wu

Title: Delay Optimization in Remote ID-Based UAV Communication via BLE and Wi-Fi Switching

Abstract:
The remote identification (Remote ID) broadcast capability allows unmanned aerial vehicles (UAVs) to exchange messages, which is a pivotal technology for inter‑UAV communications. Although this capability enhances the operational visibility, low delay in Remote ID‑based communications is critical for ensuring the efficiency and timeliness of multi‑UAV operations in dynamic environments. To address this challenge, we first establish delay models for Remote ID communications by considering packet reception and collisions across both BLE 4 and Wi‑Fi protocols. Building upon these models, we formulate an optimization problem to minimize the long‑term communication delay through adaptive protocol selection. Since the delay performance varies with the UAV density, we propose an adaptive BLE/Wi‑Fi switching algorithm based on the multi‑agent deep Q‑network approach. Experimental results demonstrate that in dynamic‑density scenarios, our strategy achieves 32.1% and 37.7% lower latency compared to static BLE 4 and Wi‑Fi modes respectively.

PaperID: 10099, https://arxiv.org/pdf/2506.07712.pdf

Authors: Renjie Luo, Jiaxi Li, Chen Huang, Wei Lu

Title: Through the Valley: Path to Effective Long CoT Training for Small Language Models

Abstract:
Long chain‑of‑thought (CoT) supervision has become a common strategy to enhance reasoning in language models. While effective for large models, we identify a phenomenon we call Long CoT Degradation, in which small language models (SLMs; <=3B parameters) trained on limited long CoT data experience significant performance deterioration. Through extensive experiments on the Qwen2.5, LLaMA3 and Gemma3 families, we demonstrate that this degradation is widespread across SLMs. In some settings, models trained on only 8k long CoT examples lose up to 75% of their original performance before fine‑tuning. Strikingly, we further observe that for some particularly small models, even training on 220k long CoT examples fails to recover or surpass their original performance prior to fine‑tuning. Our analysis attributes this effect to error accumulation: while longer responses increase the capacity for multi‑step reasoning, they also amplify the risk of compounding mistakes. Furthermore, we find that Long CoT Degradation may negatively impacts downstream reinforcement learning (RL), although this can be alleviated by sufficiently scaled supervised fine‑tuning (SFT). Our findings challenge common assumptions about the benefits of long CoT training for SLMs and offer practical guidance for building more effective small‑scale reasoning models.

PaperID: 10100, https://arxiv.org/pdf/2506.07675.pdf

Authors: Yuyang Song, Hanxu Yan, Jiale Lao, Yibo Wang, Yufei Li, Yuanchun Zhou, Jianguo Wang, Mingjie Tang

Title: QUITE: A Query Rewrite System Beyond Rules with LLM Agents

Abstract:
Query rewrite transforms SQL queries into semantically equivalent forms that run more efficiently. Existing approaches mainly rely on predefined rewrite rules, but they handle a limited subset of queries and can cause performance regressions. This limitation stems from three challenges of rule‑based query rewrite: (1) it is hard to discover and verify new rules, (2) fixed rewrite rules do not generalize to new query patterns, and (3) some rewrite techniques cannot be expressed as fixed rules. Motivated by the fact that human experts exhibit significantly better rewrite ability but suffer from scalability, and Large Language Models (LLMs) have demonstrated nearly human‑level semantic and reasoning abilities, we propose a new approach of using LLMs to rewrite SQL queries beyond rules. Due to the hallucination problems in LLMs, directly applying LLMs often leads to nonequivalent and suboptimal queries. To address this issue, we propose QUITE (query rewrite), a training‑free and feedback‑aware system based on LLM agents that rewrites SQL queries into semantically equivalent forms with significantly better performance, covering a broader range of query patterns and rewrite strategies compared to rule‑based methods. Firstly, we design a multi‑agent framework controlled by a finite state machine (FSM) to equip LLMs with the ability to use external tools and enhance the rewrite process with real‑time database feedback. Secondly, we develop a rewrite middleware to enhance the ability of LLMs to generate optimized query equivalents. Finally, we employ a novel hint injection technique to improve execution plans for rewritten queries. Extensive experiments show that QUITE reduces query execution time by up to 35.8% over state‑of‑the‑art approaches and produces 24.1% more rewrites than prior methods, covering query cases that earlier systems did not handle.

PaperID: 10101, https://arxiv.org/pdf/2506.07633.pdf

Authors: Ana Tanevska, Ananthapathmanabhan Ratheesh Kumar, Arabinda Ghosh, Ernesto Casablanca, Ginevra Castellano, Sadegh Soudjani

Title: Blending Participatory Design and Artificial Awareness for Trustworthy Autonomous Vehicles

Abstract:
Current robotic agents, such as autonomous vehicles (AVs) and drones, need to deal with uncertain real‑world environments with appropriate situational awareness (SA), risk awareness, coordination, and decision‑making. The SymAware project strives to address this issue by designing an architecture for artificial awareness in multi‑agent systems, enabling safe collaboration of autonomous vehicles and drones. However, these agents will also need to interact with human users (drivers, pedestrians, drone operators), which in turn requires an understanding of how to model the human in the interaction scenario, and how to foster trust and transparency between the agent and the human. In this work, we aim to create a data‑driven model of a human driver to be integrated into our SA architecture, grounding our research in the principles of trustworthy human‑agent interaction. To collect the data necessary for creating the model, we conducted a large‑scale user‑centered study on human‑AV interaction, in which we investigate the interaction between the AV's transparency and the users' behavior. The contributions of this paper are twofold: First, we illustrate in detail our human‑AV study and its findings, and second we present the resulting Markov chain models of the human driver computed from the study's data. Our results show that depending on the AV's transparency, the scenario's environment, and the users' demographics, we can obtain significant differences in the model's transitions.

PaperID: 10102, https://arxiv.org/pdf/2506.07564.pdf

Authors: Peiran Li, Xinkai Zou, Zhuohang Wu, Ruifeng Li, Shuo Xing, Hanwen Zheng, Zhikai Hu, Yuping Wang, Haoxi Li, Qin Yuan, Yingmo Zhang, Zhengzhong Tu

Title: SAFEFLOW: A Principled Protocol for Trustworthy and Transactional Autonomous Agent Systems

Abstract:
Recent advances in large language models (LLMs) and vision‑language models (VLMs) have enabled powerful autonomous agents capable of complex reasoning and multi‑modal tool use. Despite their growing capabilities, today's agent frameworks remain fragile, lacking principled mechanisms for secure information flow, reliability, and multi‑agent coordination. In this work, we introduce SAFEFLOW, a new protocol‑level framework for building trustworthy LLM/VLM‑based agents. SAFEFLOW enforces fine‑grained information flow control (IFC), precisely tracking provenance, integrity, and confidentiality of all the data exchanged between agents, tools, users, and environments. By constraining LLM reasoning to respect these security labels, SAFEFLOW prevents untrusted or adversarial inputs from contaminating high‑integrity decisions. To ensure robustness in concurrent multi‑agent settings, SAFEFLOW introduces transactional execution, conflict resolution, and secure scheduling over shared state, preserving global consistency across agents. We further introduce mechanisms, including write‑ahead logging, rollback, and secure caches, that further enhance resilience against runtime errors and policy violations. To validate the performances, we built SAFEFLOWBENCH, a comprehensive benchmark suite designed to evaluate agent reliability under adversarial, noisy, and concurrent operational conditions. Extensive experiments demonstrate that agents built with SAFEFLOW maintain impressive task performance and security guarantees even in hostile environments, substantially outperforming state‑of‑the‑art. Together, SAFEFLOW and SAFEFLOWBENCH lay the groundwork for principled, robust, and secure agent ecosystems, advancing the frontier of reliable autonomy.

PaperID: 10103, https://arxiv.org/pdf/2506.07528.pdf

Authors: Qisheng Hu, Quanyu Long, Wenya Wang

Title: Coordinating Search-Informed Reasoning and Reasoning-Guided Search in Claim Verification

Abstract:
Multi‑hop claim verification is inherently challenging, requiring multi‑step reasoning to construct verification chains while iteratively searching for information to uncover hidden bridging facts. This process is fundamentally interleaved, as effective reasoning relies on dynamically retrieved evidence, while effective search demands reasoning to refine queries based on partial information. To achieve this, we propose Hierarchical Agent Reasoning and Information Search (HARIS), explicitly modeling the coordinated process of reasoning‑driven searching and search‑informed reasoning. HARIS consists of a high‑level reasoning agent that focuses on constructing the main verification chain, generating factual questions when more information is needed, and a low‑level search agent that iteratively retrieves more information, refining its search based on intermediate findings. This design allows each agent to specialize in its respective task, enhancing verification accuracy and interpretability. HARIS is trained using reinforcement learning with outcome‑based rewards. Experimental results on the EX‑FEVER and HOVER benchmarks demonstrate that HARIS achieves strong performance, greatly advancing multi‑hop claim verification.

PaperID: 10104, https://arxiv.org/pdf/2506.07524.pdf

Authors: Shiwei Feng, Xiangzhe Xu, Xuan Chen, Kaiyuan Zhang, Syed Yusuf Ahmed, Zian Su, Mingwei Zheng, Xiangyu Zhang

Title: TAI3: Testing Agent Integrity in Interpreting User Intent

Abstract:
LLM agents are increasingly deployed to automate real‑world tasks by invoking APIs through natural language instructions. While powerful, they often suffer from misinterpretation of user intent, leading to the agent's actions that diverge from the user's intended goal, especially as external toolkits evolve. Traditional software testing assumes structured inputs and thus falls short in handling the ambiguity of natural language. We introduce TAI3, an API‑centric stress testing framework that systematically uncovers intent integrity violations in LLM agents. Unlike prior work focused on fixed benchmarks or adversarial inputs, TAI3 generates realistic tasks based on toolkits' documentation and applies targeted mutations to expose subtle agent errors while preserving user intent. To guide testing, we propose semantic partitioning, which organizes natural language tasks into meaningful categories based on toolkit API parameters and their equivalence classes. Within each partition, seed tasks are mutated and ranked by a lightweight predictor that estimates the likelihood of triggering agent errors. To enhance efficiency, TAI3 maintains a datatype‑aware strategy memory that retrieves and adapts effective mutation patterns from past cases. Experiments on 80 toolkit APIs demonstrate that TAI3 effectively uncovers intent integrity violations, significantly outperforming baselines in both error‑exposing rate and query efficiency. Moreover, TAI3 generalizes well to stronger target models using smaller LLMs for test generation, and adapts to evolving APIs across domains.

PaperID: 10105, https://arxiv.org/pdf/2506.07468.pdf

Authors: Mickel Liu, Liwei Jiang, Yancheng Liang, Simon Shaolei Du, Yejin Choi, Tim Althoff, Natasha Jaques

Title: Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models

Abstract:
Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine‑tuning to patch exposed vulnerabilities. This sequential approach creates a mismatch ‑‑ attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats. To address this, we propose Self‑RedTeam, an online self‑play reinforcement learning algorithm where an attacker and defender agent co‑evolve through continuous interaction. We cast safety alignment as a two‑player zero‑sum game, where a single model alternates between attacker and defender roles ‑‑ generating adversarial prompts and safeguarding against them ‑‑ while a reward LM adjudicates outcomes. This enables dynamic co‑adaptation. Grounded in the game‑theoretic framework of zero‑sum games, we establish a theoretical safety guarantee which motivates the design of our method: if self‑play converges to a Nash Equilibrium, the defender will reliably produce safe responses to any adversarial input. Empirically, Self‑RedTeam uncovers more diverse attacks (+21.8% SBERT) compared to attackers trained against static defenders and achieves higher robustness on safety benchmarks (e.g., +65.5% on WildJailBreak) than defenders trained against static attackers. We further propose hidden Chain‑of‑Thought, allowing agents to plan privately, which boosts adversarial diversity and reduces over‑refusals. Our results motivate a shift from reactive patching to proactive co‑evolution in LM safety training, enabling scalable, autonomous, and robust self‑improvement of LMs via multi‑agent reinforcement learning (MARL).

PaperID: 10106, https://arxiv.org/pdf/2506.07392.pdf

Authors: Yuyang Zhou, Guang Cheng, Kang Du, Zihan Chen, Tian Qin, Yuyu Zhao

Title: From Static to Adaptive Defense: Federated Multi-Agent Deep Reinforcement Learning-Driven Moving Target Defense Against DoS Attacks in UAV Swarm Networks

Abstract:
The proliferation of UAVs has enabled a wide range of mission‑critical applications and is becoming a cornerstone of low‑altitude networks, supporting smart cities, emergency response, and more. However, the open wireless environment, dynamic topology, and resource constraints of UAVs expose low‑altitude networks to severe DoS threats. Traditional defense approaches, which rely on fixed configurations or centralized decision‑making, cannot effectively respond to the rapidly changing conditions in UAV swarm environments. To address these challenges, we propose a novel federated multi‑agent deep reinforcement learning (FMADRL)‑driven moving target defense (MTD) framework for proactive DoS mitigation in low‑altitude networks. Specifically, we design lightweight and coordinated MTD mechanisms, including leader switching, route mutation, and frequency hopping, to disrupt attacker efforts and enhance network resilience. The defense problem is formulated as a multi‑agent partially observable Markov decision process, capturing the uncertain nature of UAV swarms under attack. Each UAV is equipped with a policy agent that autonomously selects MTD actions based on partial observations and local experiences. By employing a policy gradient‑based algorithm, UAVs collaboratively optimize their policies via reward‑weighted aggregation. Extensive simulations demonstrate that our approach significantly outperforms state‑of‑the‑art baselines, achieving up to a 34.6% improvement in attack mitigation rate, a reduction in average recovery time of up to 94.6%, and decreases in energy consumption and defense cost by as much as 29.3% and 98.3%, respectively, under various DoS attack strategies. These results highlight the potential of intelligent, distributed defense mechanisms to protect low‑altitude networks, paving the way for reliable and scalable low‑altitude economy.

PaperID: 10107, https://arxiv.org/pdf/2506.07388.pdf

Authors: Yun Hua, Haosheng Chen, Shiqin Wang, Wenhao Li, Xiangfeng Wang, Jun Luo

Title: Shapley-Coop: Credit Assignment for Emergent Cooperation in Self-Interested LLM Agents

Abstract:
Large Language Models (LLMs) show strong collaborative performance in multi‑agent systems with predefined roles and workflows. However, in open‑ended environments lacking coordination rules, agents tend to act in self‑interested ways. The central challenge in achieving coordination lies in credit assignment ‑‑ fairly evaluating each agent's contribution and designing pricing mechanisms that align their heterogeneous goals. This problem is critical as LLMs increasingly participate in complex human‑AI collaborations, where fair compensation and accountability rely on effective pricing mechanisms. Inspired by how human societies address similar coordination challenges (e.g., through temporary collaborations such as employment or subcontracting), we propose a cooperative workflow, Shapley‑Coop. Shapley‑Coop integrates Shapley Chain‑of‑Thought ‑‑ leveraging marginal contributions as a principled basis for pricing ‑‑ with structured negotiation protocols for effective price matching, enabling LLM agents to coordinate through rational task‑time pricing and post‑task reward redistribution. This approach aligns agent incentives, fosters cooperation, and maintains autonomy. We evaluate Shapley‑Coop across two multi‑agent games and a software engineering simulation, demonstrating that it consistently enhances LLM agent collaboration and facilitates equitable credit assignment. These results highlight the effectiveness of Shapley‑Coop's pricing mechanisms in accurately reflecting individual contributions during task execution.

PaperID: 10108, https://arxiv.org/pdf/2506.07347.pdf

Authors: Armin Lederer, Erfaun Noorani, Andreas Krause

Title: Distributed Risk-Sensitive Safety Filters for Uncertain Discrete-Time Systems

Abstract:
Ensuring safety in multi‑agent systems is a significant challenge, particularly in settings where centralized coordination is impractical. In this work, we propose a novel risk‑sensitive safety filter for discrete‑time multi‑agent systems with uncertain dynamics that leverages control barrier functions (CBFs) defined through value functions. Our approach relies on centralized risk‑sensitive safety conditions based on exponential risk operators to ensure robustness against model uncertainties. We introduce a distributed formulation of the safety filter by deriving two alternative strategies: one based on worst‑case anticipation and another on proximity to a known safe policy. By allowing agents to switch between strategies, feasibility can be ensured. Through detailed numerical evaluations, we demonstrate the efficacy of our approach in maintaining safety without being overly conservative.

PaperID: 10109, https://arxiv.org/pdf/2506.07232.pdf

Authors: Xinran Li, Chenjia Bai, Zijian Li, Jiakun Zheng, Ting Xiao, Jun Zhang

Title: Learn as Individuals, Evolve as a Team: Multi-agent LLMs Adaptation in Embodied Environments

Abstract:
Large language models (LLMs) possess extensive knowledge bases and strong reasoning capabilities, making them promising tools for complex, multi‑agent planning in embodied environments. However, despite LLMs' advanced abilities and the sophisticated modular design of agentic methods, existing LLM‑based planning algorithms remain limited by weak adaptation capabilities to multi‑agent embodied scenarios. We address this limitation by introducing a framework that enables LLM agents to learn and evolve both before and during test time, equipping them with environment‑relevant knowledge for better planning and enhanced communication for improved cooperation. Inspired by centralized training with decentralized execution in multi‑agent reinforcement learning, we propose a Learn as Individuals, Evolve as a Team (LIET) paradigm for multi‑agent LLMs adaptation. At the individual level, LLM agents learn a local utility function from exploratory datasets to better comprehend the embodied environment, which is then queried during test time to support informed decision‑making. At the team level, LLM agents collaboratively and iteratively maintain and update a shared cooperation knowledge list based on new experiences, using it to guide more effective communication. By combining individual learning with team evolution, LIET enables comprehensive and flexible adaptation for LLM agents. Our experiments on Communicative Watch‑And‑Help and ThreeD‑World Multi‑Agent Transport benchmarks demonstrate that LIET, instantiated with both LLaMA and GPT‑4o, outperforms existing baselines and exhibits strong cooperative planning abilities.

PaperID: 10110, https://arxiv.org/pdf/2506.07223.pdf

Authors: Yangqing Zheng, Shunqi Mao, Dingxin Zhang, Weidong Cai

Title: LLM-Enhanced Rapid-Reflex Async-Reflect Embodied Agent for Real-Time Decision-Making in Dynamically Changing Environments

Abstract:
In the realm of embodied intelligence, the evolution of large language models (LLMs) has markedly enhanced agent decision making. Consequently, researchers have begun exploring agent performance in dynamically changing high‑risk scenarios, i.e., fire, flood, and wind scenarios in the HAZARD benchmark. Under these extreme conditions, the delay in decision making emerges as a crucial yet insufficiently studied issue. We propose a Time Conversion Mechanism (TCM) that translates inference delays in decision‑making into equivalent simulation frames, thus aligning cognitive and physical costs under a single FPS‑based metric. By extending HAZARD with Respond Latency (RL) and Latency‑to‑Action Ratio (LAR), we deliver a fully latency‑aware evaluation protocol. Moreover, we present the Rapid‑Reflex Async‑Reflect Agent (RRARA), which couples a lightweight LLM‑guided feedback module with a rule‑based agent to enable immediate reactive behaviors and asynchronous reflective refinements in situ. Experiments on HAZARD show that RRARA substantially outperforms existing baselines in latency‑sensitive scenarios.

PaperID: 10111, https://arxiv.org/pdf/2506.07116.pdf

Authors: Liyang Chen, Yujun Cai, Jieqiong Dong, Yiwei Wang

Title: BRIGHT+: Upgrading the BRIGHT Benchmark with MARCUS, a Multi-Agent RAG Clean-Up Suite

Abstract:
Retrieval‑Augmented Generation (RAG) systems require corpora that are both structurally clean and semantically coherent. BRIGHT is a recent and influential benchmark designed to evaluate complex multi‑hop retrieval across diverse, high‑reasoning domains. However, its practical effectiveness is limited by common web‑crawled artifacts ‑ such as content redundancy and semantic discontinuity ‑ that impair retrieval accuracy and downstream reasoning. Notably, we find that such issues are concentrated in seven StackExchange‑derived subdomains, while other domains (e.g., Coding and Theorem‑based content) remain relatively clean. In this study, we present MARCUS, a multi‑agent pipeline that leverages large language models (LLMs) to systematically clean and re‑chunk BRIGHT into a higher‑quality corpus: BRIGHT‑Plus. MARCUS applies dedicated agents for structural noise removal and semantic segmentation, preserving answer‑bearing spans while improving contextual integrity. Experimental evaluations demonstrate that BRIGHT‑Plus yields consistent and significant improvements in both retrieval accuracy and multi‑hop reasoning across a diverse set of retrievers. We release both the BRIGHT‑Plus corpus and the MARCUS pipeline to support future research on robust, reasoning‑centric retrieval.

PaperID: 10112, https://arxiv.org/pdf/2506.07102.pdf

Authors: Wei Huo, Changxin Liu, Kemi Ding, Karl Henrik Johansson, Ling Shi

Title: Decentralized Optimization with Amplified Privacy via Efficient Communication

Abstract:
Decentralized optimization is crucial for multi‑agent systems, with significant concerns about communication efficiency and privacy. This paper explores the role of efficient communication in decentralized stochastic gradient descent algorithms for enhancing privacy preservation. We develop a novel algorithm that incorporates two key features: random agent activation and sparsified communication. Utilizing differential privacy, we demonstrate that these features reduce noise without sacrificing privacy, thereby amplifying the privacy guarantee and improving accuracy. Additionally, we analyze the convergence and the privacy‑accuracy‑communication trade‑off of the proposed algorithm. Finally, we present experimental results to illustrate the effectiveness of our algorithm.

PaperID: 10113, https://arxiv.org/pdf/2506.07043.pdf

Authors: Haidi Wang, Yufan Yao, Haonan Song, Huimiao Wang, Xiaofeng Liu, Zhao Chen, Weiwei Chen, Weiduo Zhu, Zhongjun Li, Jinlong Yang

Title: Uni2D: A Universal Machine Learning Interatomic Potential for Two-Dimensional Materials

Abstract:
Accurate interatomic potentials (IAPs) are essential for modeling the potential energy surfaces (PES) that govern atomic interactions in materials. However, most existing IAPs are developed for bulk materials and often struggle to accurately and efficiently capture the diverse chemical environments of two‑dimensional (2D) materials, which limits large‑scale simulation and design of emerging 2D systems. To address this challenge, we develop Uni2D, an interatomic potential tailored for 2D materials. The Uni2D model is trained on a dataset comprising approximately 327,000 structure‑energy‑force‑stress mappings derived from about 20,000 distinct 2D materials, covering 89 chemical elements. The model demonstrates reliable predictive performance for energies, forces, and stresses, and demonstrates quantitatively robust accuracy in tasks such as structural relaxation, equation‑of‑state calculations, and molecular dynamics simulations, making the model suitable for high‑throughput screening of 2D materials. For derived properties, including elastic properties, lattice dynamics, and other screening‑related metrics, the model provides qualitative to semi‑quantitative predictions that remain useful for trend analysis and preliminary evaluation. To enhance usability, we further introduce an intelligent agent powered by a large language model (LLM), enabling automated workflows and natural language interaction for 2D materials simulations. Our work provides an efficient and accessible framework for high‑throughput screening and computational exploration of 2D materials.

PaperID: 10114, https://arxiv.org/pdf/2506.06923.pdf

Authors: Xutong Zhao, Tengyu Xu, Xuewei Wang, Zhengxing Chen, Di Jin, Liang Tan, Yen-Ting, Zishun Yu, Zhuokai Zhao, Yun He, Sinong Wang, Han Fang, Sarath Chandar, Chen Zhu

Title: Boosting LLM Reasoning via Spontaneous Self-Correction

Abstract:
While large language models (LLMs) have demonstrated remarkable success on a broad range of tasks, math reasoning remains a challenging one. One of the approaches for improving math reasoning is self‑correction, which designs self‑improving loops to let the model correct its own mistakes. However, existing self‑correction approaches treat corrections as standalone post‑generation refinements, relying on extra prompt and system designs to elicit self‑corrections, instead of performing real‑time, spontaneous self‑corrections in a single pass. To address this, we propose SPOC, a spontaneous self‑correction approach that enables LLMs to generate interleaved solutions and verifications in a single inference pass, with generation dynamically terminated based on verification outcomes, thereby effectively scaling inference time compute. SPOC considers a multi‑agent perspective by assigning dual roles ‑‑ solution proposer and verifier ‑‑ to the same model. We adopt a simple yet effective approach to generate synthetic data for fine‑tuning, enabling the model to develop capabilities for self‑verification and multi‑agent collaboration. We further improve its solution proposal and verification accuracy through online reinforcement learning. Experiments on mathematical reasoning benchmarks show that SPOC significantly improves performance. Notably, SPOC boosts the accuracy of Llama‑3.1‑8B and 70B Instruct models, achieving gains of 8.8% and 11.6% on MATH500, 10.0% and 20.0% on AMC23, and 3.3% and 6.7% on AIME24, respectively.

PaperID: 10115, https://arxiv.org/pdf/2506.06843.pdf

Authors: HaoYang Shang, Xuan Liu, Zi Liang, Jie Zhang, Haibo Hu, Song Guo

Title: United Minds or Isolated Agents? Exploring Coordination of LLMs under Cognitive Load Theory

Abstract:
Large Language Models (LLMs) exhibit a notable performance ceiling on complex, multi‑faceted tasks, as they often fail to integrate diverse information or adhere to multiple constraints. We posit that such limitation arises when the demands of a task exceed the LLM's effective cognitive load capacity. This interpretation draws a strong analogy to Cognitive Load Theory (CLT) in cognitive science, which explains similar performance boundaries in the human mind, and is further supported by emerging evidence that reveals LLMs have bounded working memory characteristics. Building upon this CLT‑grounded understanding, we introduce CoThinker, a novel LLM‑based multi‑agent framework designed to mitigate cognitive overload and enhance collaborative problem‑solving abilities. CoThinker operationalizes CLT principles by distributing intrinsic cognitive load through agent specialization and managing transactional load via structured communication and a collective working memory. We empirically validate CoThinker on complex problem‑solving tasks and fabricated high cognitive load scenarios, demonstrating improvements over existing multi‑agent baselines in solution quality and efficiency. Our analysis reveals characteristic interaction patterns, providing insights into the emergence of collective cognition and effective load management, thus offering a principled approach to overcoming LLM performance ceilings.

PaperID: 10116, https://arxiv.org/pdf/2506.06740.pdf

Authors: Yigui Feng, Qinglin Wang, Ke Liu, Xinhai Chen, Bo Yang, Jie Liu

Title: AI PsyRoom: Artificial Intelligence Platform for Segmented Yearning and Reactive Outcome Optimization Method

Abstract:
Psychological counseling faces huge challenges due to the growing demand for mental health services and the shortage of trained professionals. Large language models (LLMs) have shown potential to assist psychological counseling, especially in empathy and emotional support. However, existing models lack a deep understanding of emotions and are unable to generate personalized treatment plans based on fine‑grained emotions. To address these shortcomings, we present AI PsyRoom, a multi‑agent simulation framework designed to enhance psychological counseling by generating empathetic and emotionally nuanced conversations. By leveraging fine‑grained emotion classification and a multi‑agent framework, we construct a multi‑agent PsyRoom A for dialogue reconstruction, generating a high‑quality dialogue dataset EmoPsy, which contains 35 sub‑emotions, 423 specific emotion scenarios, and 12,350 dialogues. We also propose PsyRoom B for generating personalized treatment plans. Quantitative evaluations demonstrate that AI PsyRoom significantly outperforms state‑of‑the‑art methods, achieving 18% improvement in problem orientation, 23% in expression, 24% in Empathy, and 16% in interactive communication quality. The datasets and models are publicly available, providing a foundation for advancing AI‑assisted psychological counseling research.

PaperID: 10117, https://arxiv.org/pdf/2506.06698.pdf

Authors: Yitao Liu, Chenglei Si, Karthik Narasimhan, Shunyu Yao

Title: Contextual Experience Replay for Self-Improvement of Language Agents

Abstract:
Large language model (LLM) agents have been applied to sequential decision‑making tasks such as web navigation, but without any environment‑specific experiences, they often fail in these complex tasks. Moreover, current LLM agents are not designed to continually learn from past experiences during inference time, which could be crucial for them to gain these environment‑specific experiences. To address this, we propose Contextual Experience Replay (CER), a training‑free framework to enable efficient self‑improvement for language agents in their context window. Specifically, CER accumulates and synthesizes past experiences into a dynamic memory buffer. These experiences encompass environment dynamics and common decision‑making patterns, allowing the agents to retrieve and augment themselves with relevant knowledge in new tasks, enhancing their adaptability in complex environments. We evaluate CER on the challenging WebArena and VisualWebArena benchmarks. On VisualWebArena, CER achieves a competitive performance of 31.9%. On WebArena, CER also gets a competitive average success rate of 36.7%, relatively improving the success rate of the GPT‑4o agent baseline by 51.0%. We also conduct a comprehensive analysis on it to prove its efficiency, validity and understand it better.

PaperID: 10118, https://arxiv.org/pdf/2506.06576.pdf

Authors: Yijia Shao, Humishka Zope, Yucheng Jiang, Jiaxin Pei, David Nguyen, Erik Brynjolfsson, Diyi Yang

Title: Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the U.S. Workforce

Abstract:
The rapid rise of compound AI systems (a.k.a., AI agents) is reshaping the labor market, raising concerns about job displacement, diminished human agency, and overreliance on automation. Yet, we lack a systematic understanding of the evolving landscape. In this paper, we address this gap by introducing a novel auditing framework to assess which occupational tasks workers want AI agents to automate or augment, and how those desires align with the current technological capabilities. Our framework features an audio‑enhanced mini‑interview to capture nuanced worker desires and introduces the Human Agency Scale (HAS) as a shared language to quantify the preferred level of human involvement. Using this framework, we construct the WORKBank database, building on the U.S. Department of Labor's ONET database, to capture preferences from 1,500 domain workers and capability assessments from AI experts across over 844 tasks spanning 104 occupations. Jointly considering the desire and technological capability divides tasks in WORKBank into four zones: Automation "Green Light" Zone, Automation "Red Light" Zone, R&D Opportunity Zone, Low Priority Zone. This highlights critical mismatches and opportunities for AI agent development. Moving beyond a simple automate‑or‑not dichotomy, our results reveal diverse HAS profiles across occupations, reflecting heterogeneous expectations for human involvement. Moreover, our study offers early signals of how AI agent integration may reshape the core human competencies, shifting from information‑focused skills to interpersonal ones. These findings underscore the importance of aligning AI agent development with human desires and preparing workers for evolving workplace dynamics.

PaperID: 10119, https://arxiv.org/pdf/2506.06574.pdf

Authors: Suhana Bedi, Iddah Mlauzi, Daniel Shin, Sanmi Koyejo, Nigam H. Shah

Title: The Optimization Paradox in Clinical AI Multi-Agent Systems

Abstract:
Multi‑agent artificial intelligence systems are increasingly deployed in clinical settings, yet the relationship between component‑level optimization and system‑wide performance remains poorly understood. We evaluated this relationship using 2,400 real patient cases from the MIMIC‑CDM dataset across four abdominal pathologies (appendicitis, pancreatitis, cholecystitis, diverticulitis), decomposing clinical diagnosis into information gathering, interpretation, and differential diagnosis. We evaluated single agent systems (one model performing all tasks) against multi‑agent systems (specialized models for each task) using comprehensive metrics spanning diagnostic outcomes, process adherence, and cost efficiency. Our results reveal a paradox: while multi‑agent systems generally outperformed single agents, the component‑optimized or Best of Breed system with superior components and excellent process metrics (85.5% information accuracy) significantly underperformed in diagnostic accuracy (67.7% vs. 77.4% for a top multi‑agent system). This finding underscores that successful integration of AI in healthcare requires not just component level optimization but also attention to information flow and compatibility between agents. Our findings highlight the need for end to end system validation rather than relying on component metrics alone.

PaperID: 10120, https://arxiv.org/pdf/2506.06565.pdf

Authors: Emilia Rivas, Sabrina Saika, Ahtesham Bakht, Aritran Piplai, Nathaniel D. Bastian, Ankit Shah

Title: Adapting Under Fire: Multi-Agent Reinforcement Learning for Adversarial Drift in Network Security

Abstract:
Evolving attacks are a critical challenge for the long‑term success of Network Intrusion Detection Systems (NIDS). The rise of these changing patterns has exposed the limitations of traditional network security methods. While signature‑based methods are used to detect different types of attacks, they often fail to detect unknown attacks. Moreover, the system requires frequent updates with new signatures as the attackers are constantly changing their tactics. In this paper, we design an environment where two agents improve their policies over time. The adversarial agent, referred to as the red agent, perturbs packets to evade the intrusion detection mechanism, whereas the blue agent learns new defensive policies using drift adaptation techniques to counter the attacks. Both agents adapt iteratively: the red agent responds to the evolving NIDS, while the blue agent adjusts to emerging attack patterns. By studying the model's learned policy, we offer concrete insights into drift adaptation techniques with high utility. Experiments show that the blue agent boosts model accuracy by 30% with just 2 to 3 adaptation steps using only 25 to 30 samples each.

PaperID: 10121, https://arxiv.org/pdf/2506.06366.pdf

Authors: Lin Chen, Yunke Zhang, Jie Feng, Haoye Chai, Honglin Zhang, Bingbing Fan, Yibo Ma, Shiyuan Zhang, Nian Li, Tianhui Liu, Nicholas Sukiennik, Keyu Zhao, Yu Li, Ziyi Liu, Fengli Xu, Yong Li

Title: AI Agent Behavioral Science

Abstract:
Recent advances in large language models (LLMs) have enabled the development of AI agents that exhibit increasingly human‑like behaviors, including planning, adaptation, and social dynamics across diverse, interactive, and open‑ended scenarios. These behaviors are not solely the product of the internal architectures of the underlying models, but emerge from their integration into agentic systems operating within specific contexts, where environmental factors, social cues, and interaction feedbacks shape behavior over time. This evolution necessitates a new scientific perspective: AI Agent Behavioral Science. Rather than focusing only on internal mechanisms, this perspective emphasizes the systematic observation of behavior, design of interventions to test hypotheses, and theory‑guided interpretation of how AI agents act, adapt, and interact over time. We systematize a growing body of research across individual agent, multi‑agent, and human‑agent interaction settings, and further demonstrate how this perspective informs responsible AI by treating fairness, safety, interpretability, accountability, and privacy as behavioral properties. By unifying recent findings and laying out future directions, we position AI Agent Behavioral Science as a necessary complement to traditional model‑centric approaches, providing essential tools for understanding, evaluating, and governing the real‑world behavior of increasingly autonomous AI systems.

PaperID: 10122, https://arxiv.org/pdf/2506.06363.pdf

Authors: Thang D. Pham, Aditya Tanikanti, Murat Keçeli

Title: ChemGraph: An Agentic Framework for Computational Chemistry Workflows

Abstract:
Atomistic simulations are essential tools in chemistry and materials science, accelerating the discovery of novel catalysts, energy storage materials, and pharmaceuticals. However, running these simulations remains challenging due to the wide range of computational methods, diverse software ecosystems, and the need for expert knowledge and manual effort for the setup, execution, and validation stages. In this work, we present ChemGraph, an agentic framework powered by artificial intelligence and state‑of‑the‑art simulation tools to streamline and automate computational chemistry and materials science workflows. ChemGraph leverages graph neural network‑based foundation models for accurate yet computationally efficient calculations and large language models (LLMs) for natural language understanding, task planning, and scientific reasoning to provide an intuitive and interactive interface. Users can perform tasks such as molecular structure generation, single‑point energy, geometry optimization, vibrational analysis, and thermochemistry calculations with methods ranging from tight‑binding and machine learning interatomic potentials to density functional theory or wave function theory‑based methods. We evaluate ChemGraph across 13 benchmark tasks and demonstrate that smaller LLMs (GPT‑4o‑mini, Claude‑3.5‑haiku, Qwen2.5‑14B) perform well on simple workflows, while more complex tasks benefit from using larger models like GPT‑4o. Importantly, we show that decomposing complex tasks into smaller subtasks through a multi‑agent framework enables smaller LLM models to match or exceed GPT‑4o's performance in specific scenarios.

PaperID: 10123, https://arxiv.org/pdf/2506.06325.pdf

Authors: Viorica Rozina Chifu, Tudor Cioara, Cristina Bianca Pop, Ionut Anghel

Title: Evolutionary model for energy trading in community microgrids using Hawk-Dove strategies

Abstract:
This paper proposes a decentralized model of energy cooperation between microgrids, in which decisions are made locally, at the level of the microgrid community. Each microgrid is modeled as an autonomous agent that adopts a Hawk or Dove strategy, depending on the level of energy stored in the battery and its role in the energy trading process. The interactions between selling and buying microgrids are modeled through an evolutionary algorithm. An individual in the algorithm population is represented as an energy trading matrix that encodes the amounts of energy traded between the selling and buying microgrids. The population evolution is achieved by recombination and mutation operators. Recombination uses a specialized operator for matrix structures, and mutation is applied to the matrix elements according to a Gaussian distribution. The evaluation of an individual is made with a multi‑criteria fitness function that considers the seller profit, the degree of energy stability at the community level, penalties for energy imbalance at the community level and for the degradation of microgrids batteries. The method was tested on a simulated scenario with 100 microgrids, each with its own selling and buying thresholds, to reflect a realistic environment with variable storage characteristics of microgrids batteries. By applying the algorithm on this scenario, 95 out of the 100 microgrids reached a stable energy state. This result confirms the effectiveness of the proposed model in achieving energy balance both at the individual level, for each microgrid, and at the level of the entire community.

PaperID: 10124, https://arxiv.org/pdf/2506.06299.pdf

Authors: Daniel Thilo Schroeder, Meeyoung Cha, Andrea Baronchelli, Nick Bostrom, Nicholas A. Christakis, David Garcia, Amit Goldenberg, Yara Kyrychenko, Kevin Leyton-Brown, Nina Lutz, Gary Marcus, Filippo Menczer, Gordon Pennycook, David G. Rand, Maria Ressa, Frank Schweitzer, Dawn Song, Christopher Summerfield, Audrey Tang, Jay J. Van Bavel, Sander van der Linden, Jonas R. Kunst

Title: How malicious AI swarms can threaten democracy: The fusion of agentic AI and LLMs marks a new frontier in information warfare

Abstract:
Advances in AI offer the prospect of manipulating beliefs and behaviors on a population‑wide level. Large language models and autonomous agents now let influence campaigns reach unprecedented scale and precision. Generative tools can expand propaganda output without sacrificing credibility and inexpensively create falsehoods that are rated as more human‑like than those written by humans. Techniques meant to refine AI reasoning, such as chain‑of‑thought prompting, can just as effectively be used to generate more convincing falsehoods. Enabled by these capabilities, a disruptive threat is emerging: swarms of collaborative, malicious AI agents. Fusing LLM reasoning with multi‑agent architectures, these systems are capable of coordinating autonomously, infiltrating communities, and fabricating consensus efficiently. By adaptively mimicking human social dynamics, they threaten democracy. Because the resulting harms stem from design, commercial incentives, and governance, we prioritize interventions at multiple leverage points, focusing on pragmatic mechanisms over voluntary compliance.

PaperID: 10125, https://arxiv.org/pdf/2506.06291.pdf

Authors: Xiaoke Wang, Batuhan Altundas, Zhaoxin Li, Aaron Zhao, Matthew Gombolay

Title: Improvement of Optimization using Learning Based Models in Mixed Integer Linear Programming Tasks

Abstract:
Mixed Integer Linear Programs (MILPs) are essential tools for solving planning and scheduling problems across critical industries such as construction, manufacturing, and logistics. However, their widespread adoption is limited by long computational times, especially in large‑scale, real‑time scenarios. To address this, we present a learning‑based framework that leverages Behavior Cloning (BC) and Reinforcement Learning (RL) to train Graph Neural Networks (GNNs), producing high‑quality initial solutions for warm‑starting MILP solvers in Multi‑Agent Task Allocation and Scheduling Problems. Experimental results demonstrate that our method reduces optimization time and variance compared to traditional techniques while maintaining solution quality and feasibility.

PaperID: 10126, https://arxiv.org/pdf/2506.06254.pdf

Authors: Weizhi Zhang, Xinyang Zhang, Chenwei Zhang, Liangwei Yang, Jingbo Shang, Zhepei Wei, Henry Peng Zou, Zijie Huang, Zhengyang Wang, Yifan Gao, Xiaoman Pan, Lian Xiong, Jingguo Liu, Philip S. Yu, Xian Li

Title: PersonaAgent: Bridging Memory and Action for Personalized LLM Agents

Abstract:
Large Language Model (LLM) empowered agents have recently emerged as advanced paradigms that exhibit impressive capabilities in a wide range of domains and tasks. Despite their potential, current LLM agents often adopt a one‑size‑fits‑all approach, lacking the flexibility to respond to users' varying needs and preferences. This limitation motivates us to develop PersonaAgent, the first personalized LLM agent framework designed to address versatile personalization tasks. Specifically, PersonaAgent integrates two complementary components ‑ a personalized memory module that includes episodic and semantic memory mechanisms; a personalized action module that enables the agent to perform tool actions tailored to the user. At the core, the persona (defined as unique system prompt for each user) functions as an intermediary: it leverages insights from personalized memory to control agent actions, while the outcomes of these actions in turn refine the memory. Based on the framework, we propose a test‑time user‑preference alignment strategy that simulate the latest n interactions to optimize the persona prompt, ensuring real‑time user preference alignment through textual loss feedback between simulated and ground‑truth responses. Experimental evaluations demonstrate that PersonaAgent significantly outperforms other baseline methods by not only personalizing the action space effectively but also scaling during test‑time real‑world applications. These results underscore the feasibility and potential of our approach in delivering tailored, dynamic user experiences.

PaperID: 10127, https://arxiv.org/pdf/2506.06179.pdf

Authors: Muhammed Ustaomeroglu, Guannan Qu

Title: A Theoretical Study of (Hyper) Self-Attention through the Lens of Interactions: Representation, Training, Generalization

Abstract:
Self‑attention has emerged as a core component of modern neural architectures, yet its theoretical underpinnings remain elusive. In this paper, we study self‑attention through the lens of interacting entities, ranging from agents in multi‑agent reinforcement learning to alleles in genetic sequences, and show that a single layer linear self‑attention can efficiently represent, learn, and generalize functions capturing pairwise interactions, including out‑of‑distribution scenarios. Our analysis reveals that self‑attention acts as a mutual interaction learner under minimal assumptions on the diversity of interaction patterns observed during training, thereby encompassing a wide variety of real‑world domains. In addition, we validate our theoretical insights through experiments demonstrating that self‑attention learns interaction functions and generalizes across both population distributions and out‑of‑distribution scenarios. Building on our theories, we introduce HyperFeatureAttention, a novel neural network module designed to learn couplings of different feature‑level interactions between entities. Furthermore, we propose HyperAttention, a new module that extends beyond pairwise interactions to capture multi‑entity dependencies, such as three‑way, four‑way, or general n‑way interactions.

PaperID: 10128, https://arxiv.org/pdf/2506.06175.pdf

Authors: James Ford, Anthony Rios

Title: Does It Run and Is That Enough? Revisiting Text-to-Chart Generation with a Multi-Agent Approach

Abstract:
Large language models can translate natural‑language chart descriptions into runnable code, yet approximately 15% of the generated scripts still fail to execute, even after supervised fine‑tuning and reinforcement learning. We investigate whether this persistent error rate stems from model limitations or from reliance on a single‑prompt design. To explore this, we propose a lightweight multi‑agent pipeline that separates drafting, execution, repair, and judgment, using only an off‑the‑shelf GPT‑4o‑mini model. On the \textscText2Chart31 benchmark, our system reduces execution errors to 4.5% within three repair iterations, outperforming the strongest fine‑tuned baseline by nearly 5 percentage points while requiring significantly less compute. Similar performance is observed on the \textscChartX benchmark, with an error rate of 4.6%, demonstrating strong generalization. Under current benchmarks, execution success appears largely solved. However, manual review reveals that 6 out of 100 sampled charts contain hallucinations, and an LLM‑based accessibility audit shows that only 33.3% (\textscText2Chart31) and 7.2% (\textscChartX) of generated charts satisfy basic colorblindness guidelines. These findings suggest that future work should shift focus from execution reliability toward improving chart aesthetics, semantic fidelity, and accessibility.

PaperID: 10129, https://arxiv.org/pdf/2506.06094.pdf

Authors: Elim Kwan, Rehman Qureshi, Liam Fletcher, Colin Laganier, Victoria Nockles, Richard Walters

Title: Onboard Mission Replanning for Adaptive Cooperative Multi-Robot Systems

Abstract:
Cooperative autonomous robotic systems have significant potential for executing complex multi‑task missions across space, air, ground, and maritime domains. But they commonly operate in remote, dynamic and hazardous environments, requiring rapid in‑mission adaptation without reliance on fragile or slow communication links to centralised compute. Fast, on‑board replanning algorithms are therefore needed to enhance resilience. Reinforcement Learning shows strong promise for efficiently solving mission planning tasks when formulated as Travelling Salesperson Problems (TSPs), but existing methods: 1) are unsuitable for replanning, where agents do not start at a single location; 2) do not allow cooperation between agents; 3) are unable to model tasks with variable durations; or 4) lack practical considerations for on‑board deployment. Here we define the Cooperative Mission Replanning Problem as a novel variant of multiple TSP with adaptations to overcome these issues, and develop a new encoder/decoder‑based model using Graph Attention Networks and Attention Models to solve it effectively and efficiently. Using a simple example of cooperative drones, we show our replanner consistently (90% of the time) maintains performance within 10% of the state‑of‑the‑art LKH3 heuristic solver, whilst running 85‑370 times faster on a Raspberry Pi. This work paves the way for increased resilience in autonomous multi‑agent systems.

PaperID: 10130, https://arxiv.org/pdf/2506.06066.pdf

Authors: Ruochen Ji, Lyu Tiangang

Title: Conversational Interfaces for Parametric Conceptual Architectural Design: Integrating Mixed Reality with LLM-driven Interaction

Abstract:
Mixed reality (MR) environments offer embodied spatial interaction, providing intuitive 3D manipulation capabilities that enhance the conceptual design process. Parametric modeling, a powerful and advanced architectural design method, enables the generation of complex, optimized geometries. However, its integration into MR environments remains limited due to precision constraints and unsuitable input modalities. Existing MR tools prioritize spatial interaction but lack the control and expressiveness required for parametric workflows, particularly for designers without formal programming backgrounds. We address this gap by introducing a novel conversational MR interface that combines speech input, gesture recognition, and a multi‑agent large language model (LLM) system to support intuitive parametric modeling. Our system dynamically manages parameter states, resolves ambiguous commands through conversation and contextual prompting, and enables real‑time model manipulation within immersive environments. We demonstrate how this approach reduces cognitive and operational barriers in early‑stage design tasks, allowing users to refine and explore their design space. This work expands the role of MR to a generative design platform, supporting programmatic thinking in design tasks through natural, embodied interaction.

PaperID: 10131, https://arxiv.org/pdf/2506.06032.pdf

Authors: Edward Hughes, Tina O. Zhu, Martin J. Chadwick, Raphael Koster, Antonio García Castañeda, Charles Beattie, Thore Graepel, Matthew M. Botvinick, Joel Z. Leibo

Title: Modeling human reputation-seeking behavior in a spatio-temporally complex public good provision game

Abstract:
Multi‑agent reinforcement learning algorithms are useful for simulating social behavior in settings that are too complex for other theoretical approaches like game theory. However, they have not yet been empirically supported by laboratory experiments with real human participants. In this work we demonstrate how multi‑agent reinforcement learning can model group behavior in a spatially and temporally complex public good provision game called Clean Up. We show that human groups succeed in Clean Up when they can see who is who and track reputations over time but fail under conditions of anonymity. A new multi‑agent reinforcement learning model of reputation‑based cooperation demonstrates the same difference between identifiable and anonymous conditions. Furthermore, both human groups and artificial agent groups solve the problem via turn‑taking despite other options being available. Our results highlight the benefits of using multi‑agent reinforcement learning to model human social behavior in complex environments.

PaperID: 10132, https://arxiv.org/pdf/2506.06017.pdf

Authors: Yu Li, Lehui Li, Zhihao Wu, Qingmin Liao, Jianye Hao, Kun Shao, Fengli Xu, Yong Li

Title: AgentSwift: Efficient LLM Agent Design via Value-guided Hierarchical Search

Abstract:
Large language model (LLM) agents have demonstrated strong capabilities across diverse domains, yet automated agent design remains a significant challenge. Current automated agent design approaches are often constrained by limited search spaces that primarily optimize workflows but fail to integrate crucial human‑designed components like memory, planning, and tool use. Furthermore, these methods are hampered by high evaluation costs, as evaluating even a single new agent on a benchmark can require tens of dollars. The difficulty of this exploration is further exacerbated by inefficient search strategies that struggle to navigate the large design space effectively, making the discovery of novel agents a slow and resource‑intensive process. To address these challenges, we propose AgentSwift, a novel framework for automated agent design. We formalize a hierarchical search space that jointly models agentic workflow and composable functional components. This structure moves beyond optimizing workflows alone by co‑optimizing functional components, which enables the discovery of more complex and effective agent architectures. To make exploration within this expansive space feasible, we mitigate high evaluation costs by training a value model on a high‑quality dataset, generated via a novel strategy combining combinatorial coverage and balanced Bayesian sampling for low‑cost evaluation. Guiding the entire process is a hierarchical MCTS strategy, which is informed by uncertainty to efficiently navigate the search space. Evaluated across a comprehensive set of seven benchmarks spanning embodied, math, web, tool, and game domains, AgentSwift discovers agents that achieve an average performance gain of 8.34% over both existing automated agent search methods and manually designed agents. Our framework serves as a launchpad for researchers to rapidly discover powerful agent architectures.

PaperID: 10133, https://arxiv.org/pdf/2506.05981.pdf

Authors: Qingbin Zeng, Ruotong Zhao, Jinzhu Mao, Haoyang Li, Fengli Xu, Yong Li

Title: CrimeMind: Simulating Urban Crime with Multi-Modal LLM Agents

Abstract:
Modeling urban crime is an important yet challenging task that requires understanding the subtle visual, social, and cultural cues embedded in urban environments. Previous work has mainly focused on rule‑based agent‑based modeling (ABM) and deep learning methods. ABMs offer interpretability of internal mechanisms but exhibit limited predictive accuracy. In contrast, deep learning methods are often effective in prediction but are less interpretable and require extensive training data. Moreover, both lines of work lack the cognitive flexibility to adapt to changing environments. Leveraging the capabilities of large language models (LLMs), we propose CrimeMind, a novel LLM‑driven ABM framework for simulating urban crime within a multi‑modal urban context. A key innovation of our design is the integration of the Routine Activity Theory (RAT) into the agentic workflow of CrimeMind, enabling it to process rich multi‑modal urban features and reason about criminal behavior. However, RAT requires LLM agents to infer subtle cues in evaluating environmental safety as part of assessing guardianship, which can be challenging for LLMs. To address this, we collect a small‑scale human‑annotated dataset and align CrimeMind's perception with human judgment via a training‑free textual gradient method. Experiments across four major U.S. cities demonstrate that CrimeMind outperforms both traditional ABMs and deep learning baselines in crime hotspot prediction and spatial distribution accuracy, achieving up to a 24% improvement over the strongest baseline. Furthermore, we conduct counterfactual simulations of external incidents and policy interventions and it successfully captures the expected changes in crime patterns, demonstrating its ability to reflect counterfactual scenarios. Overall, CrimeMind enables fine‑grained modeling of individual behaviors and facilitates evaluation of real‑world interventions.

PaperID: 10134, https://arxiv.org/pdf/2506.05894.pdf

Authors: Philipp Plank, Yufei Zhang

Title: Policy Optimization for Continuous-time Linear-Quadratic Graphon Mean Field Games

Abstract:
Multi‑agent reinforcement learning, despite its popularity and empirical success, faces significant scalability challenges in large‑population dynamic games. Graphon mean field games (GMFGs) offer a principled framework for approximating such games while capturing heterogeneity among players. In this paper, we propose and analyze a policy optimization framework for continuous‑time, finite‑horizon linear‑quadratic GMFGs. Exploiting the structural properties of GMFGs, we design an efficient policy parameterization in which each player's policy is represented as an affine function of their private state, with a shared slope function and player‑specific intercepts. We develop a bilevel optimization algorithm that alternates between policy gradient updates for best‑response computation under a fixed population distribution, and distribution updates using the resulting policies. We prove linear convergence of the policy gradient steps to best‑response policies and establish global convergence of the overall algorithm to the Nash equilibrium. The analysis relies on novel landscape characterizations over infinite‑dimensional policy spaces. Numerical experiments demonstrate the convergence and robustness of the proposed algorithm under varying graphon structures, noise levels, and action frequencies.

PaperID: 10135, https://arxiv.org/pdf/2506.05813.pdf

Authors: Ye Bai, Minghan Wang, Thuy-Trang Vu

Title: MAPLE: Multi-Agent Adaptive Planning with Long-Term Memory for Table Reasoning

Abstract:
Table‑based question answering requires complex reasoning capabilities that current LLMs struggle to achieve with single‑pass inference. Existing approaches, such as Chain‑of‑Thought reasoning and question decomposition, lack error detection mechanisms and discard problem‑solving experiences, contrasting sharply with how humans tackle such problems. In this paper, we propose MAPLE (Multi‑agent Adaptive Planning with Long‑term mEmory), a novel framework that mimics human problem‑solving through specialized cognitive agents working in a feedback‑driven loop. MAPLE integrates 4 key components: (1) a Solver using the ReAct paradigm for reasoning, (2) a Checker for answer verification, (3) a Reflector for error diagnosis and strategy correction, and (4) an Archiver managing long‑term memory for experience reuse and evolution. Experiments on WiKiTQ and TabFact demonstrate significant improvements over existing methods, achieving state‑of‑the‑art performance across multiple LLM backbones.

PaperID: 10136, https://arxiv.org/pdf/2506.05739.pdf

Authors: Zhilong Wang, Neha Nagaraja, Lan Zhang, Hayretdin Bahsi, Pawan Patil, Peng Liu

Title: To Protect the LLM Agent Against the Prompt Injection Attack with Polymorphic Prompt

Abstract:
LLM agents are widely used as agents for customer support, content generation, and code assistance. However, they are vulnerable to prompt injection attacks, where adversarial inputs manipulate the model's behavior. Traditional defenses like input sanitization, guard models, and guardrails are either cumbersome or ineffective. In this paper, we propose a novel, lightweight defense mechanism called Polymorphic Prompt Assembling (PPA), which protects against prompt injection with near‑zero overhead. The approach is based on the insight that prompt injection requires guessing and breaking the structure of the system prompt. By dynamically varying the structure of system prompts, PPA prevents attackers from predicting the prompt structure, thereby enhancing security without compromising performance. We conducted experiments to evaluate the effectiveness of PPA against existing attacks and compared it with other defense methods.

PaperID: 10137, https://arxiv.org/pdf/2506.05606.pdf

Authors: Ziyi Wang, Yuxuan Lu, Wenbo Li, Amirali Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia Chilton, Dakuo Wang

Title: OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

Abstract:
Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable'' human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high‑quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine‑grained web actions, and self‑reported just‑in‑time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user's next action and rationale with a given persona and <observation, action, rationale> history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.

PaperID: 10138, https://arxiv.org/pdf/2506.05576.pdf

Authors: Valerija Holomjova, Jamie Grech, Dewei Yi, Bruno Yun, Andrew Starkey, Pascal Meißner

Title: TD-TOG Dataset: Benchmarking Zero-Shot and One-Shot Task-Oriented Grasping for Object Generalization

Abstract:
Task‑oriented grasping (TOG) is an essential preliminary step for robotic task execution, which involves predicting grasps on regions of target objects that facilitate intended tasks. Existing literature reveals there is a limited availability of TOG datasets for training and benchmarking despite large demand, which are often synthetic or have artifacts in mask annotations that hinder model performance. Moreover, TOG solutions often require affordance masks, grasps, and object masks for training, however, existing datasets typically provide only a subset of these annotations. To address these limitations, we introduce the Top‑down Task‑oriented Grasping (TD‑TOG) dataset, designed to train and evaluate TOG solutions. TD‑TOG comprises 1,449 real‑world RGB‑D scenes including 30 object categories and 120 subcategories, with hand‑annotated object masks, affordances, and planar rectangular grasps. It also features a test set for a novel challenge that assesses a TOG solution's ability to distinguish between object subcategories. To contribute to the demand for TOG solutions that can adapt and manipulate previously unseen objects without re‑training, we propose a novel TOG framework, Binary‑TOG. Binary‑TOG uses zero‑shot for object recognition, and one‑shot learning for affordance recognition. Zero‑shot learning enables Binary‑TOG to identify objects in multi‑object scenes through textual prompts, eliminating the need for visual references. In multi‑object settings, Binary‑TOG achieves an average task‑oriented grasp accuracy of 68.9%. Lastly, this paper contributes a comparative analysis between one‑shot and zero‑shot learning for object generalization in TOG to be used in the development of future TOG solutions.

PaperID: 10139, https://arxiv.org/pdf/2506.05538.pdf

Authors: Arnesh Batra, Anushk Kumar, Jashn Khemani, Arush Gumber, Arhan Jain, Somil Gupta

Title: SocialDF: Benchmark Dataset and Detection Model for Mitigating Harmful Deepfake Content on Social Media Platforms

Abstract:
The rapid advancement of deep generative models has significantly improved the realism of synthetic media, presenting both opportunities and security challenges. While deepfake technology has valuable applications in entertainment and accessibility, it has emerged as a potent vector for misinformation campaigns, particularly on social media. Existing detection frameworks struggle to distinguish between benign and adversarially generated deepfakes engineered to manipulate public perception. To address this challenge, we introduce SocialDF, a curated dataset reflecting real‑world deepfake challenges on social media platforms. This dataset encompasses high‑fidelity deepfakes sourced from various online ecosystems, ensuring broad coverage of manipulative techniques. We propose a novel LLM‑based multi‑factor detection approach that combines facial recognition, automated speech transcription, and a multi‑agent LLM pipeline to cross‑verify audio‑visual cues. Our methodology emphasizes robust, multi‑modal verification techniques that incorporate linguistic, behavioral, and contextual analysis to effectively discern synthetic media from authentic content.

PaperID: 10140, https://arxiv.org/pdf/2506.05527.pdf

Authors: Caroline Wang, Di Yang Shi, Elad Liebman, Ishan Durugkar, Arrasy Rahman, Peter Stone

Title: Sequence Modeling for N-Agent Ad Hoc Teamwork

Abstract:
N‑agent ad hoc teamwork (NAHT) is a newly introduced challenge in multi‑agent reinforcement learning, where controlled subteams of varying sizes must dynamically collaborate with varying numbers and types of unknown teammates without pre‑coordination. The existing learning algorithm (POAM) considers only independent learning for its flexibility in dealing with a changing number of agents. However, independent learning fails to fully capture the inter‑agent dynamics essential for effective collaboration. Based on our observation that transformers deal effectively with sequences with varying lengths and have been shown to be highly effective for a variety of machine learning problems, this work introduces a centralized, transformer‑based method for N‑agent ad hoc teamwork. Our proposed approach incorporates historical observations and actions of all controlled agents, enabling optimal responses to diverse and unseen teammates in partially observable environments. Empirical evaluation on a StarCraft II task demonstrates that MAT‑NAHT outperforms POAM, achieving superior sample efficiency and generalization, without auxiliary agent‑modeling objectives.

PaperID: 10141, https://arxiv.org/pdf/2506.05437.pdf

Authors: Julien Soulé, Jean-Paul Jamont, Michel Occello, Louis-Marie Traonouez, Paul Théron

Title: A MARL-based Approach for Easing MAS Organization Engineering

Abstract:
Multi‑Agent Systems (MAS) have been successfully applied in industry for their ability to address complex, distributed problems, especially in IoT‑based systems. Their efficiency in achieving given objectives and meeting design requirements is strongly dependent on the MAS organization during the engineering process of an application‑specific MAS. To design a MAS that can achieve given goals, available methods rely on the designer's knowledge of the deployment environment. However, high complexity and low readability in some deployment environments make the application of these methods to be costly or raise safety concerns. In order to ease the MAS organization design regarding those concerns, we introduce an original Assisted MAS Organization Engineering Approach (AOMEA). AOMEA relies on combining a Multi‑Agent Reinforcement Learning (MARL) process with an organizational model to suggest relevant organizational specifications to help in MAS engineering.

PaperID: 10142, https://arxiv.org/pdf/2506.05431.pdf

Authors: Ashwin Ramesh Babu, Sajad Mousavi, Vineet Gundecha, Sahand Ghorbanpour, Avisek Naug, Antonio Guillen, Ricardo Luna Gutierrez, Soumyendu Sarkar

Title: Robustness Evaluation for Video Models with Reinforcement Learning

Abstract:
Evaluating the robustness of Video classification models is very challenging, specifically when compared to image‑based models. With their increased temporal dimension, there is a significant increase in complexity and computational cost. One of the key challenges is to keep the perturbations to a minimum to induce misclassification. In this work, we propose a multi‑agent reinforcement learning approach (spatial and temporal) that cooperatively learns to identify the given video's sensitive spatial and temporal regions. The agents consider temporal coherence in generating fine perturbations, leading to a more effective and visually imperceptible attack. Our method outperforms the state‑of‑the‑art solutions on the Lp metric and the average queries. Our method enables custom distortion types, making the robustness evaluation more relevant to the use case. We extensively evaluate 4 popular models for video action recognition on two popular datasets, HMDB‑51 and UCF‑101.

PaperID: 10143, https://arxiv.org/pdf/2506.05370.pdf

Authors: Kristy Wedel

Title: Contextual Memory Intelligence -- A Foundational Paradigm for Human-AI Collaboration and Reflective Generative AI Systems

Abstract:
A critical challenge remains unresolved as generative AI systems are quickly implemented in various organizational settings. Despite significant advances in memory components such as RAG, vector stores, and LLM agents, these systems still have substantial memory limitations. Gen AI workflows rarely store or reflect on the full context in which decisions are made. This leads to repeated errors and a general lack of clarity. This paper introduces Contextual Memory Intelligence (CMI) as a new foundational paradigm for building intelligent systems. It repositions memory as an adaptive infrastructure necessary for longitudinal coherence, explainability, and responsible decision‑making rather than passive data. Drawing on cognitive science, organizational theory, human‑computer interaction, and AI governance, CMI formalizes the structured capture, inference, and regeneration of context as a fundamental system capability. The Insight Layer is presented in this paper to operationalize this vision. This modular architecture uses human‑in‑the‑loop reflection, drift detection, and rationale preservation to incorporate contextual memory into systems. The paper argues that CMI allows systems to reason with data, history, judgment, and changing context, thereby addressing a foundational blind spot in current AI architectures and governance efforts. A framework for creating intelligent systems that are effective, reflective, auditable, and socially responsible is presented through CMI. This enhances human‑AI collaboration, generative AI design, and the resilience of the institutions.

PaperID: 10144, https://arxiv.org/pdf/2506.05364.pdf

Authors: Anjana Sarkar, Soumyendu Sarkar

Title: Survey of LLM Agent Communication with MCP: A Software Design Pattern Centric Review

Abstract:
This survey investigates how classical software design patterns can enhance the reliability and scalability of communication in Large Language Model (LLM)‑driven agentic AI systems, focusing particularly on the Model Context Protocol (MCP). It examines the foundational architectures of LLM‑based agents and their evolution from isolated operation to sophisticated, multi‑agent collaboration, addressing key communication hurdles that arise in this transition. The study revisits well‑established patterns, including Mediator, Observer, Publish‑Subscribe, and Broker, and analyzes their relevance in structuring agent interactions within MCP‑compliant frameworks. To clarify these dynamics, the article provides conceptual schematics and formal models that map out communication pathways and optimize data flow. It further explores architectural variations suited to different degrees of agent autonomy and system complexity. Real‑world applications in domains such as real‑time financial processing and investment banking are discussed, illustrating how these patterns and MCP can meet specific operational demands. The article concludes by outlining open challenges, potential security risks, and promising directions for advancing robust, interoperable, and scalable multi‑agent LLM ecosystems.

PaperID: 10145, https://arxiv.org/pdf/2506.05309.pdf

Authors: Niv Eckhaus, Uri Berger, Gabriel Stanovsky

Title: Time to Talk: LLM Agents for Asynchronous Group Communication in Mafia Games

Abstract:
LLMs are used predominantly in synchronous communication, where a human user and a model communicate in alternating turns. In contrast, many real‑world settings are asynchronous. For example, in group chats, online team meetings, or social games, there is no inherent notion of turns. In this work, we develop an adaptive asynchronous LLM agent consisting of two modules: a generator that decides what to say, and a scheduler that decides when to say it. To evaluate our agent, we collect a unique dataset of online Mafia games, where our agent plays with human participants. Overall, our agent performs on par with human players, both in game performance metrics and in its ability to blend in with the other human players. Our analysis shows that the agent's behavior in deciding when to speak closely mirrors human patterns, although differences emerge in message content. We make all of our code and data publicly available. This work paves the way for integration of LLMs into realistic human group settings, from assistance in team discussions to educational and professional environments where complex social dynamics must be navigated.

PaperID: 10146, https://arxiv.org/pdf/2506.05305.pdf

Authors: Deepak Pandita, Tharindu Cyril Weerasooriya, Ankit Parag Shah, Isabelle Diana May-Xin Ng, Christopher M. Homan, Wei Wei

Title: ProRefine: Inference-Time Prompt Refinement with Textual Feedback

Abstract:
Agentic workflows, where multiple AI agents collaborate to accomplish complex tasks like reasoning or planning, play a substantial role in many cutting‑edge commercial applications, and continue to fascinate researchers across fields for their potential to accomplish expensive, complex tasks that, until recently, only humans have been trusted to do. These workflows depend critically on the prompts used to provide the roles models play in such workflows. Poorly designed prompts that fail even slightly to guide individual agents can lead to sub‑optimal performance that may snowball within a system of agents, limiting their reliability and scalability. To address this important problem of inference‑time prompt optimization, we introduce ProRefine, an innovative inference‑time optimization method that uses an agentic loop of LLMs to generate and apply textual feedback. ProRefine dynamically refines prompts for multi‑step reasoning tasks without additional training or ground truth labels. Evaluated on five benchmark mathematical reasoning datasets, ProRefine significantly surpasses zero‑shot Chain‑of‑Thought baselines by 3 to 37 percentage points. This approach not only boosts accuracy but also allows smaller models to approach the performance of their larger counterparts. This highlights its potential for building more cost‑effective and powerful hybrid AI systems, thereby democratizing access to high‑performing AI.

PaperID: 10147, https://arxiv.org/pdf/2506.05265.pdf

Authors: Mohammed Almutairi

Title: Teaming in the AI Era: AI-Augmented Frameworks for Forming, Simulating, and Optimizing Human Teams

Abstract:
Effective teamwork is essential across diverse domains. During the team formation stage, a key challenge is forming teams that effectively balance user preferences with task objectives to enhance overall team satisfaction. In the team performing stage, maintaining cohesion and engagement is critical for sustaining high team performance. However, existing computational tools and algorithms for team optimization often rely on static data inputs, narrow algorithmic objectives, or solutions tailored for specific contexts, failing to account for the dynamic interplay of team members personalities, evolving goals, and changing individual preferences. Therefore, teams may encounter member dissatisfaction, as purely algorithmic assignments can reduce members commitment to team goals or experience suboptimal engagement due to the absence of timely, personalized guidance to help members adjust their behaviors and interactions as team dynamics evolve. Ultimately, these challenges can lead to reduced overall team performance. My Ph.D. dissertation aims to develop AI‑augmented team optimization frameworks and practical systems that enhance team satisfaction, engagement, and performance. First, I propose a team formation framework that leverages a multi‑armed bandit algorithm to iteratively refine team composition based on user preferences, ensuring alignment between individual needs and collective team goals to enhance team satisfaction. Second, I introduce tAIfa (Team AI Feedback Assistant), an AI‑powered system that utilizes large language models (LLMs) to deliver immediate, personalized feedback to both teams and individual members, enhancing cohesion and engagement. Finally, I present PuppeteerLLM, an LLM‑based simulation framework that simulates multi‑agent teams to model complex team dynamics within realistic environments, incorporating task‑driven collaboration and long‑term coordination.

PaperID: 10148, https://arxiv.org/pdf/2506.05236.pdf

Authors: Maxime Toquebiau, Jae-Yun Jun, Faïz Benamar, Nicolas Bredeche

Title: Towards Language-Augmented Multi-Agent Deep Reinforcement Learning

Abstract:
Most prior works on communication in multi‑agent reinforcement learning have focused on emergent communication, which often results in inefficient and non‑interpretable systems. Inspired by the role of language in natural intelligence, we investigate how grounding agents in a human‑defined language can improve the learning and coordination of embodied agents. We propose a framework in which agents are trained not only to act but also to produce and interpret natural language descriptions of their observations. This language‑augmented learning serves a dual role: enabling efficient and interpretable communication between agents, and guiding representation learning. We demonstrate that language‑augmented agents outperform emergent communication baselines across various tasks. Our analysis reveals that language grounding leads to more informative internal representations, better generalization to new partners, and improved capability for human‑agent interaction. These findings demonstrate the effectiveness of integrating structured language into multi‑agent learning and open avenues for more interpretable and capable multi‑agent systems.

PaperID: 10149, https://arxiv.org/pdf/2506.05172.pdf

Authors: Weichen Shi

Title: A Framework for Ethical Judgment of Smart City Applications

Abstract:
As modern cities increasingly adopt a variety of sensors and Internet of Things (IoT) technologies to collect and analyze data about residents, environments, and public services, they are fostering greater interactions among smart city applications, residents, governments, and businesses. This trend makes it essential for regulators to focus on these interactions to manage smart city practices effectively and prevent unethical outcomes. To facilitate ethical analysis for smart city applications, this paper introduces a judgment framework that examines various scenarios where ethical issues may arise. Employing a multi‑agent approach, the framework incorporates diverse social entities and applies logic‑based ethical rules to identify potential violations. Through a rights‑based analysis, we developed a set of 13 ethical principles and rules to guide ethical practices in smart cities. We utilized two specification languages, Prototype Verification System (PVS) and Alloy, to model our multi‑agent system. Our analysis suggests that Alloy may be more efficient for formalizing smart cities and conducting ethical rule checks, particularly with the assistance of a human evaluator. Simulations of a real‑world smart city application demonstrate that our ethical judgment framework effectively detects unethical outcomes and can be extended for practical use.

PaperID: 10150, https://arxiv.org/pdf/2506.05079.pdf

Authors: Shengcheng Yu, Yuchen Ling, Chunrong Fang, Quan Zhou, Yi Zhao, Chunyang Chen, Shaomin Zhu, Zhenyu Chen

Title: Scenario-Guided LLM-based Mobile App GUI Testing

Abstract:
The assurance of mobile app GUI has become increasingly important, as the GUI serves as the primary medium of interaction between users and apps. Although numerous automated GUI testing approaches have been developed with diverse strategies, a substantial gap remains between these approaches and the underlying app business logic. Most existing approaches focus on general exploration rather than the completion of specific testing scenarios, often resulting in missed coverage of critical functionalities. Inspired by the manual testing process, which treats business logic, driven testing scenarios as the fundamental unit of testing, this paper introduces an approach that leverages large language models (LLMs) to comprehend the semantics expressed in app GUIs and their contextual relevance to given testing scenarios. Building upon this capability, we propose ScenGen, a novel scenario‑guided LLM‑based GUI testing framework that employs a multi‑agent collaboration mechanism to simulate and automate the phases of manual testing. ScenGen integrates five agents. The Observer perceives the app GUI state by extracting and structuring GUI widgets and layouts, thereby interpreting the semantic information presented in the GUI. This information is then passed to the Decider, which makes scenario‑driven decisions with the guidance of LLMs to identify target widgets and determine appropriate actions toward fulfilling specific testing goals. The Executor executes the decided operations on the app, while the Supervisor verifies whether the execution results align with the intended testing scenario completion, ensuring traceability and consistency in test generation and execution. Finally, the Recorder records the corresponding GUI operations into the context memory as a knowledge base for subsequent decision‑making and concurrently monitors runtime bug occurrences.

PaperID: 10151, https://arxiv.org/pdf/2506.04907.pdf

Authors: Alex Pan, Mary-Anne Williams

Title: Context Is Not Comprehension

Abstract:
The dominant way of judging Large Language Models (LLMs) has been to ask how well they can recall explicit facts from very long inputs. While today's best models achieve near perfect recall, this masks a harder skill: performing multi‑step reasoning and tracking intermediate state that never appears verbatim. We introduce Verbose ListOps (VLO), a benchmark that embeds deterministic ListOps computations inside narrative camouflage and, crucially, allows step‑level evaluation of every intermediate result. Experiments show that models which solve raw ListOps with approximately 100% accuracy collapse on VLO after only 10,000 tokens. By exposing where a model's reasoning chain first diverges, VLO moves assessment beyond sheer context length and toward genuine comprehension. VLO's generation pipeline is task‑agnostic: it can weave any deterministically verifiable reasoning schema ‑‑ arithmetic, symbolic, abductive, inductive or defeasible ‑‑ into narrative form. This makes VLO a reusable test‑bed for the next wave of reasoning‑centric model designs, not merely those with step‑explicit scaffolds.

PaperID: 10152, https://arxiv.org/pdf/2506.04849.pdf

Authors: Julien Soulé, Jean-Paul Jamont, Michel Occello, Paul Théron, Louis-Marie Traonouez

Title: Towards a Multi-Agent Simulation of Cyber-attackers and Cyber-defenders Battles

Abstract:
As cyber‑attacks show to be more and more complex and coordinated, cyber‑defenders strategy through multi‑agent approaches could be key to tackle against cyber‑attacks as close as entry points in a networked system. This paper presents a Markovian modeling and implementation through a simulator of fighting cyber‑attacker agents and cyber‑defender agents deployed on host network nodes. It aims to provide an experimental framework to implement realistically based coordinated cyber‑attack scenarios while assessing cyber‑defenders dynamic organizations. We abstracted network nodes by sets of properties including agents' ones. Actions applied by agents model how the network reacts depending in a given state and what properties are to change. Collective choice of the actions brings the whole environment closer or farther from respective cyber‑attackers and cyber‑defenders goals. Using the simulator, we implemented a realistically inspired scenario with several behavior implementation approaches for cyber‑defenders and cyber‑attackers.

PaperID: 10153, https://arxiv.org/pdf/2506.04821.pdf

Authors: Zhen Hao Wong, Jingwen Deng, Runming He, Zirong Chen, Qijie You, Hejun Dong, Hao Liang, Chengyu Shen, Bin Cui, Wentao Zhang

Title: LogicPuzzleRL: Cultivating Robust Mathematical Reasoning in LLMs via Reinforcement Learning

Abstract:
Large language models (LLMs) excel at many supervised tasks but often struggle with structured reasoning in unfamiliar settings. This discrepancy suggests that standard fine‑tuning pipelines may instill narrow, domain‑specific heuristics rather than fostering general‑purpose thinking strategies. In this work, we propose a "play to learn" framework that fine‑tunes LLMs through reinforcement learning on a suite of seven custom logic puzzles, each designed to cultivate distinct reasoning skills such as constraint propagation, spatial consistency, and symbolic deduction. Using a reinforcement learning setup with verifiable rewards, models receive binary feedback based on puzzle correctness, encouraging iterative, hypothesis‑driven problem solving. We demonstrate that this training approach significantly improves out‑of‑distribution performance on a range of mathematical benchmarks, especially for mid‑difficulty problems that require multi‑step reasoning. Analyses across problem categories and difficulty levels reveal that puzzle training promotes transferable reasoning routines, strengthening algebraic manipulation, geometric inference, and combinatorial logic, while offering limited gains on rote or highly specialized tasks. These findings show that reinforcement learning over logic puzzles reshapes the internal reasoning of LLMs, enabling more robust and compositional generalization without relying on task‑specific symbolic tools.

PaperID: 10154, https://arxiv.org/pdf/2506.04651.pdf

Authors: Nikolas Belle, Dakota Barnes, Alfonso Amayuelas, Ivan Bercovich, Xin Eric Wang, William Wang

Title: Agents of Change: Self-Evolving LLM Agents for Strategic Planning

Abstract:
We address the long‑horizon gap in large language model (LLM) agents by enabling them to sustain coherent strategies in adversarial, stochastic environments. Settlers of Catan provides a challenging benchmark: success depends on balancing short‑ and long‑term goals amid randomness, trading, expansion, and blocking. Prompt‑centric LLM agents (e.g., ReAct, Reflexion) must re‑interpret large, evolving game states each turn, quickly saturating context windows and losing strategic consistency. We propose HexMachina, a continual learning multi‑agent system that separates environment discovery (inducing an adapter layer without documentation) from strategy improvement (evolving a compiled player through code refinement and simulation). This design preserves executable artifacts, allowing the LLM to focus on high‑level strategy rather than per‑turn reasoning. In controlled Catanatron experiments, HexMachina learns from scratch and evolves players that outperform the strongest human‑crafted baseline (AlphaBeta), achieving a 54% win rate and surpassing prompt‑driven and no‑discovery baselines. Ablations confirm that isolating pure strategy learning improves performance. Overall, artifact‑centric continual learning transforms LLMs from brittle stepwise deciders into stable strategy designers, advancing long‑horizon autonomy.

PaperID: 10155, https://arxiv.org/pdf/2506.04625.pdf

Authors: Zhiyuan Ma, Jiayu Liu, Xianzhen Luo, Zhenya Huang, Qingfu Zhu, Wanxiang Che

Title: Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning

Abstract:
Empowering large language models (LLMs) with effective tool utilization capabilities is crucial for enabling AI agents to solve complex problems. However, current models face two major limitations: (1) unreliable tool planning and invocation due to low‑quality instruction datasets (e.g., widespread hallucinated API calls), and (2) weak tool reflection abilities (over 90% of errors cannot be corrected) resulting from static imitation learning. To address these critical limitations, we propose Tool‑MVR, a novel Tool‑Augmented LLM that achieves comprehensive System 2 reasoning through two key innovations. Specifically, we first introduce Multi‑Agent Meta‑Verification (MAMV), a systematic pipeline that rigorously validates APIs, queries, and reasoning trajectories to construct ToolBench‑V, a new high‑quality instruction dataset that addresses the limitation of unreliable tool planning and invocation. Second, we propose Exploration‑based Reflection Learning (EXPLORE), which enhances tool reflection capabilities by leveraging tool feedback through a dynamic "Error ‑> Reflection ‑> Correction" learning paradigm, resulting in our reflection dataset ToolBench‑R and addressing the critical weakness in tool reflection. Finally, we obtain Tool‑MVR by finetuning open‑source LLMs (e.g., Qwen‑7B) on both ToolBench‑V and ToolBench‑R. Our experiments demonstrate that Tool‑MVR achieves state‑of‑the‑art performance on StableToolBench, surpassing both ToolLLM (by 23.9%) and GPT‑4 (by 15.3%) while reducing API calls by 31.4%, with strong generalization capabilities across unseen tools and scenarios. Additionally, on our proposed RefineToolBench, the first benchmark specifically designed to evaluate tool reflection capabilities, Tool‑MVR achieves a 58.9% error correction rate, significantly outperforming ToolLLM's 9.1%.

PaperID: 10156, https://arxiv.org/pdf/2506.04572.pdf

Authors: Can Zheng, Yuhan Cao, Xiaoning Dong, Tianxing He

Title: Demonstrations of Integrity Attacks in Multi-Agent Systems

Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, code generation, and complex planning. Simultaneously, Multi‑Agent Systems (MAS) have garnered attention for their potential to enable cooperation among distributed agents. However, from a multi‑party perspective, MAS could be vulnerable to malicious agents that exploit the system to serve self‑interests without disrupting its core functionality. This work explores integrity attacks where malicious agents employ subtle prompt manipulation to bias MAS operations and gain various benefits. Four types of attacks are examined: Scapegoater, who misleads the system monitor to underestimate other agents' contributions; Boaster, who misleads the system monitor to overestimate their own performance; Self‑Dealer, who manipulates other agents to adopt certain tools; and Free‑Rider, who hands off its own task to others. We demonstrate that strategically crafted prompts can introduce systematic biases in MAS behavior and executable instructions, enabling malicious agents to effectively mislead evaluation systems and manipulate collaborative agents. Furthermore, our attacks can bypass advanced LLM‑based monitors, such as GPT‑4o‑mini and o3‑mini, highlighting the limitations of current detection mechanisms. Our findings underscore the critical need for MAS architectures with robust security protocols and content validation mechanisms, alongside monitoring systems capable of comprehensive risk scenario assessment.

PaperID: 10157, https://arxiv.org/pdf/2506.04571.pdf

Authors: Srikanth Thudumu, Jason Fisher

Title: OpenAg: Democratizing Agricultural Intelligence

Abstract:
Agriculture is undergoing a major transformation driven by artificial intelligence (AI), machine learning, and knowledge representation technologies. However, current agricultural intelligence systems often lack contextual understanding, explainability, and adaptability, especially for smallholder farmers with limited resources. General‑purpose large language models (LLMs), while powerful, typically lack the domain‑specific knowledge and contextual reasoning needed for practical decision support in farming. They tend to produce recommendations that are too generic or unrealistic for real‑world applications. To address these challenges, we present OpenAg, a comprehensive framework designed to advance agricultural artificial general intelligence (AGI). OpenAg combines domain‑specific foundation models, neural knowledge graphs, multi‑agent reasoning, causal explainability, and adaptive transfer learning to deliver context‑aware, explainable, and actionable insights. The system includes: (i) a unified agricultural knowledge base that integrates scientific literature, sensor data, and farmer‑generated knowledge; (ii) a neural agricultural knowledge graph for structured reasoning and inference; (iii) an adaptive multi‑agent reasoning system where AI agents specialize and collaborate across agricultural domains; and (iv) a causal transparency mechanism that ensures AI recommendations are interpretable, scientifically grounded, and aligned with real‑world constraints. OpenAg aims to bridge the gap between scientific knowledge and the tacit expertise of experienced farmers to support scalable and locally relevant agricultural decision‑making.

PaperID: 10158, https://arxiv.org/pdf/2506.04565.pdf

Authors: Jiayi Chen, Junyi Ye, Guiling Wang

Title: From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems

Abstract:
Compound AI Systems (CAIS) are an emerging paradigm that integrates large language models (LLMs) with external components, including retrievers, agents, tools, and orchestrators, to overcome the limitations of standalone models in tasks requiring memory, reasoning, real‑time grounding, and multimodal understanding. These systems enable more capable and context‑aware behaviors by composing multiple specialized modules into cohesive workflows. Despite growing adoption in both academia and industry, the CAIS landscape remains fragmented and lacks a unified framework for analysis, taxonomy, and evaluation. In this survey, we define the concept of CAIS, propose a multi‑dimensional taxonomy based on component roles and orchestration strategies, and analyze four foundational paradigms: Retrieval‑Augmented Generation (RAG), LLM Agents, Multimodal LLMs (MLLMs), and Orchestration. We review representative systems, compare design trade‑offs, and summarize evaluation methodologies across these paradigms. Finally, we identify key challenges ‑ including scalability, interoperability, benchmarking, and coordination ‑ and outline promising directions for future research. This survey aims to provide researchers and practitioners with a comprehensive foundation for understanding, developing, and advancing the next generation of system‑level artificial intelligence.

PaperID: 10159, https://arxiv.org/pdf/2506.04514.pdf

Authors: Hanqing Li, Melania Fedeli, Vinay Kolar, Diego Klabjan

Title: BEAR: BGP Event Analysis and Reporting

Abstract:
The Internet comprises of interconnected, independently managed Autonomous Systems (AS) that rely on the Border Gateway Protocol (BGP) for inter‑domain routing. BGP anomalies‑‑such as route leaks and hijacks‑‑can divert traffic through unauthorized or inefficient paths, jeopardizing network reliability and security. Although existing rule‑based and machine learning methods can detect these anomalies using structured metrics, they still require experts with in‑depth BGP knowledge of, for example, AS relationships and historical incidents, to interpret events and propose remediation. In this paper, we introduce BEAR (BGP Event Analysis and Reporting), a novel framework that leverages large language models (LLMs) to automatically generate comprehensive reports explaining detected BGP anomaly events. BEAR employs a multi‑step reasoning process that translates tabular BGP data into detailed textual narratives, enhancing interpretability and analytical precision. To address the limited availability of publicly documented BGP anomalies, we also present a synthetic data generation framework powered by LLMs. Evaluations on both real and synthetic datasets demonstrate that BEAR achieves 100% accuracy, outperforming Chain‑of‑Thought and in‑context learning baselines. This work pioneers an automated approach for explaining BGP anomaly events, offering valuable operational insights for network management.

PaperID: 10160, https://arxiv.org/pdf/2506.04481.pdf

Authors: Jiayu Liu, Zhenya Huang, Wei Dai, Cheng Cheng, Jinze Wu, Jing Sha, Song Li, Qi Liu, Shijin Wang, Enhong Chen

Title: CogMath: Assessing LLMs' Authentic Mathematical Ability from a Human Cognitive Perspective

Abstract:
Although large language models (LLMs) show promise in solving complex mathematical tasks, existing evaluation paradigms rely solely on a coarse measure of overall answer accuracy, which are insufficient for assessing their authentic capabilities. In this paper, we propose CogMath, which comprehensively assesses LLMs' mathematical abilities through the lens of human cognition. Specifically, inspired by psychological theories, CogMath formalizes human reasoning process into 3 stages: \emphproblem comprehension, \emphproblem solving, and \emphsolution summarization. Within these stages, we investigate perspectives such as numerical calculation, knowledge, and counterfactuals, and design a total of 9 fine‑grained evaluation dimensions. In each dimension, we develop an ``\emphInquiry‑\emphJudge‑\emphReference'' multi‑agent system to generate inquiries that assess LLMs' mastery from this dimension. An LLM is considered to truly master a problem only when excelling in all inquiries from the 9 dimensions. By applying CogMath on three benchmarks, we reveal that the mathematical capabilities of 7 mainstream LLMs are overestimated by 30%‑40%. Moreover, we locate their strengths and weaknesses across specific stages/dimensions, offering in‑depth insights to further enhance their reasoning abilities.

PaperID: 10161, https://arxiv.org/pdf/2506.04405.pdf

Authors: Ran Xu, Yuchen Zhuang, Yishan Zhong, Yue Yu, Zifeng Wang, Xiangru Tang, Hang Wu, May D. Wang, Peifeng Ruan, Donghan Yang, Tao Wang, Guanghua Xiao, Xin Liu, Carl Yang, Yang Xie, Wenqi Shi

Title: MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

Abstract:
We introduce MedAgentGym, a scalable and interactive training environment designed to enhance coding‑based biomedical reasoning capabilities in large language model (LLM) agents. MedAgentGym comprises 72,413 task instances across 129 categories derived from 12 authentic real‑world biomedical scenarios. Tasks are encapsulated within executable sandbox environments, each featuring detailed task specifications, interactive feedback mechanisms, verifiable ground truth annotations, and scalable training trajectory generation. Extensive benchmarking of 29 LLMs reveals substantial performance disparities in biomedical data science between commercial and open‑source LLMs. Leveraging efficient multi‑threaded and multi‑turn trajectory sampling in MedAgentGym, Med‑Copilot achieves performance gains of +43.02% and +45.28% from offline and online reinforcement learning, respectively, demonstrating MedAgentGym as an effective training ground while establishing itself as a cost‑effective, privacy‑preserving alternative competitive with proprietary LLMs (gpt‑4o). By offering a unified execution environment with a comprehensive benchmark and accessible, extensible training resources, MedAgentGym delivers an integrated platform to develop LLM‑based coding assistants for advanced biomedical data science.

PaperID: 10162, https://arxiv.org/pdf/2506.04404.pdf

Authors: Diana Nunes, Ricardo Amorim, Pedro Ribeiro, André Coelho, Rui Campos

Title: A Framework Leveraging Large Language Models for Autonomous UAV Control in Flying Networks

Abstract:
This paper proposes FLUC, a modular framework that integrates open‑source Large Language Models (LLMs) with Unmanned Aerial Vehicle (UAV) autopilot systems to enable autonomous control in Flying Networks (FNs). FLUC translates high‑level natural language commands into executable UAV mission code, bridging the gap between operator intent and UAV behaviour. FLUC is evaluated using three open‑source LLMs ‑ Qwen 2.5, Gemma 2, and LLaMA 3.2 ‑ across scenarios involving code generation and mission planning. Results show that Qwen 2.5 excels in multi‑step reasoning, Gemma 2 balances accuracy and latency, and LLaMA 3.2 offers faster responses with lower logical coherence. A case study on energy‑aware UAV positioning confirms FLUC's ability to interpret structured prompts and autonomously execute domain‑specific logic, showing its effectiveness in real‑time, mission‑driven control.

PaperID: 10163, https://arxiv.org/pdf/2506.04301.pdf

Authors: Jiin Kim, Byeongjun Shin, Jinha Chung, Minsoo Rhu

Title: The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective

Abstract:
Large‑language‑model (LLM)‑based AI agents have recently showcased impressive versatility by employing dynamic reasoning, an adaptive, multi‑step process that coordinates with external tools. This shift from static, single‑turn inference to agentic, multi‑turn workflows broadens task generalization and behavioral flexibility, but it also introduces serious concerns about system‑level cost, efficiency, and sustainability. This paper presents the first comprehensive system‑level analysis of AI agents, quantifying their resource usage, latency behavior, energy consumption, and datacenter‑wide power consumption demands across diverse agent designs and test‑time scaling strategies. We further characterize how AI agent design choices, such as few‑shot prompting, reflection depth, and parallel reasoning, impact accuracy‑cost tradeoffs. Our findings reveal that while agents improve accuracy with increased compute, they suffer from rapidly diminishing returns, widening latency variance, and unsustainable infrastructure costs. Through detailed evaluation of representative agents, we highlight the profound computational demands introduced by AI agent workflows, uncovering a looming sustainability crisis. These results call for a paradigm shift in agent design toward compute‑efficient reasoning, balancing performance with deployability under real‑world constraints.

PaperID: 10164, https://arxiv.org/pdf/2506.04293.pdf

Authors: Fengze Liu, Haoyu Wang, Joonhyuk Cho, Dan Roth, Andrew W. Lo

Title: AUTOCT: Automating Interpretable Clinical Trial Prediction with LLM Agents

Abstract:
Clinical trials are critical for advancing medical treatments but remain prohibitively expensive and time‑consuming. Accurate prediction of clinical trial outcomes can significantly reduce research and development costs and accelerate drug discovery. While recent deep learning models have shown promise by leveraging unstructured data, their black‑box nature, lack of interpretability, and vulnerability to label leakage limit their practical use in high‑stakes biomedical contexts. In this work, we propose AutoCT, a novel framework that combines the reasoning capabilities of large language models with the explainability of classical machine learning. AutoCT autonomously generates, evaluates, and refines tabular features based on public information without human input. Our method uses Monte Carlo Tree Search to iteratively optimize predictive performance. Experimental results show that AutoCT performs on par with or better than SOTA methods on clinical trial prediction tasks within only a limited number of self‑refinement iterations, establishing a new paradigm for scalable, interpretable, and cost‑efficient clinical trial prediction.

PaperID: 10165, https://arxiv.org/pdf/2506.04287.pdf

Authors: Yongjin Yang, Sinjae Kang, Juyong Lee, Dongjun Lee, Se-Young Yun, Kimin Lee

Title: Automated Skill Discovery for Language Agents through Exploration and Iterative Feedback

Abstract:
Training large language model (LLM) agents to acquire necessary skills and perform diverse tasks within an environment is gaining interest as a means to enable open‑endedness. However, creating the training dataset for their skill acquisition faces several challenges. Manual trajectory collection requires significant human effort. Another approach, where LLMs directly propose tasks to learn, is often invalid, as the LLMs lack knowledge of which tasks are actually feasible. Moreover, the generated data may not provide a meaningful learning signal, as agents often already perform well on the proposed tasks. To address this, we propose a novel automatic skill discovery framework EXIF for LLM‑powered agents, designed to improve the feasibility of generated target behaviors while accounting for the agents' capabilities. Our method adopts an exploration‑first strategy by employing an exploration agent (Alice) to train the target agent (Bob) to learn essential skills in the environment. Specifically, Alice first interacts with the environment to retrospectively generate a feasible, environment‑grounded skill dataset, which is then used to train Bob. Crucially, we incorporate an iterative feedback loop, where Alice evaluates Bob's performance to identify areas for improvement. This feedback then guides Alice's next round of exploration, forming a closed‑loop data generation process. Experiments on Webshop and Crafter demonstrate EXIF's ability to effectively discover meaningful skills and iteratively expand the capabilities of the trained agent without any human intervention, achieving substantial performance improvements. Interestingly, we observe that setting Alice to the same model as Bob also notably improves performance, demonstrating EXIF's potential for building a self‑evolving system.

PaperID: 10166, https://arxiv.org/pdf/2506.04276.pdf

Authors: Lei Han, Yitong Guo, Pengfei Yang, Zhiyong Yu, Liang Wang, Quan Wang, Zhiwen Yu

Title: Autonomous Collaborative Scheduling of Time-dependent UAVs, Workers and Vehicles for Crowdsensing in Disaster Response

Abstract:
Natural disasters have caused significant losses to human society, and the timely and efficient acquisition of post‑disaster environmental information is crucial for the effective implementation of rescue operations. Due to the complexity of post‑disaster environments, existing sensing technologies face challenges such as weak environmental adaptability, insufficient specialized sensing capabilities, and limited practicality of sensing solutions. This paper explores the heterogeneous multi‑agent online autonomous collaborative scheduling algorithm HoAs‑PALN, aimed at achieving efficient collection of post‑disaster environmental information. HoAs‑PALN is realized through adaptive dimensionality reduction in the matching process and local Nash equilibrium game, facilitating autonomous collaboration among time‑dependent UAVs, workers and vehicles to enhance sensing scheduling. (1) In terms of adaptive dimensionality reduction during the matching process, HoAs‑PALN significantly reduces scheduling decision time by transforming a five‑dimensional matching process into two categories of three‑dimensional matching processes; (2) Regarding the local Nash equilibrium game, HoAs‑PALN combines the softmax function to optimize behavior selection probabilities and introduces a local Nash equilibrium determination mechanism to ensure scheduling decision performance. Finally, we conducted detailed experiments based on extensive real‑world and simulated data. Compared with the baselines (GREEDY, K‑WTA, MADL and MARL), HoAs‑PALN improves task completion rates by 64.12%, 46.48%, 16.55%, and 14.03% on average, respectively, while each online scheduling decision takes less than 10 seconds, demonstrating its effectiveness in dynamic post‑disaster environments.

PaperID: 10167, https://arxiv.org/pdf/2506.04265.pdf

Authors: Mengda Ji, Genjiu Xu, Keke Jia, Zekun Duan, Yong Qiu, Jianjun Ge, Mingqiang Li

Title: Cooperative Game-Theoretic Credit Assignment for Multi-Agent Policy Gradients via the Core

Abstract:
This work focuses on the credit assignment problem in cooperative multi‑agent reinforcement learning (MARL). Sharing the global advantage among agents often leads to insufficient policy optimization, as it fails to capture the coalitional contributions of different agents. In this work, we revisit the policy update process from a coalitional perspective and propose CORA, an advantage allocation method guided by a cooperative game‑theoretic core allocation. By evaluating the marginal contributions of different coalitions and combining clipped double Q‑learning to mitigate overestimation bias, CORA estimates coalition‑wise advantages. The core formulation enforces coalition‑wise lower bounds on allocated credits, so that coalitions with higher advantages receive stronger total incentives for their participating agents, enabling the global advantage to be attributed to different coalition strategies and promoting coordinated optimal behavior. To reduce computational overhead, we employ random coalition sampling to approximate the core allocation efficiently. Experiments on matrix games, differential games, and multi‑agent collaboration benchmarks demonstrate that our method outperforms baselines. These findings highlight the importance of coalition‑level credit assignment and cooperative games for advancing multi‑agent learning.

PaperID: 10168, https://arxiv.org/pdf/2506.04253.pdf

Authors: Tapio Pitkäranta, Leena Pitkäranta

Title: HADA: Human-AI Agent Decision Alignment Architecture

Abstract:
We present HADA (Human‑AI Agent Decision Alignment), a protocol‑ and framework agnostic reference architecture that keeps both large language model (LLM) agents and legacy algorithms aligned with organizational targets and values. HADA wraps any algorithm or LLM in role‑specific stakeholder agents ‑‑ business, data‑science, audit, ethics, and customer ‑‑ each exposing conversational APIs so that technical and non‑technical actors can query, steer, audit, or contest every decision across strategic, tactical, and real‑time horizons. Alignment objectives, KPIs, and value constraints are expressed in natural language and are continuously propagated, logged, and versioned while thousands of heterogeneous agents run on different orchestration stacks. A cloud‑native proof of concept packages a production credit‑scoring model (getLoanDecision) and deploys it on Docker/Kubernetes/Python; five scripted retail‑bank scenarios show how target changes, parameter tweaks, explanation requests, and ethics triggers flow end to end through the architecture. Evaluation followed the Design‑Science Research Methodology. Walkthrough observation and log inspection demonstrated complete coverage of six predefined objectives: every role could invoke conversational control, trace KPIs and value constraints, detect and mitigate ZIP‑code bias, and reproduce full decision lineage, independent of the underlying LLM or agent library. Contributions: (1) an open‑source HADA architecture, (2) a mid‑range design theory for human‑AI alignment in multi‑agent systems, and (3) empirical evidence that framework‑agnostic, protocol‑compliant stakeholder agents improve accuracy, transparency, and ethical compliance in real‑world decision pipelines.

PaperID: 10169, https://arxiv.org/pdf/2506.04251.pdf

Authors: Zhengyang Li, Sawyer Campos, Nana Wang

Title: Language-Driven Coordination and Learning in Multi-Agent Simulation Environments

Abstract:
This paper introduces LLM‑MARL, a unified framework that incorporates large language models (LLMs) into multi‑agent reinforcement learning (MARL) to enhance coordination, communication, and generalization in simulated game environments. The framework features three modular components of Coordinator, Communicator, and Memory, which dynamically generate subgoals, facilitate symbolic inter‑agent messaging, and support episodic recall. Training combines PPO with a language‑conditioned loss and LLM query gating. LLM‑MARL is evaluated in Google Research Football, MAgent Battle, and StarCraft II. Results show consistent improvements over MAPPO and QMIX in win rate, coordination score, and zero‑shot generalization. Ablation studies demonstrate that subgoal generation and language‑based messaging each contribute significantly to performance gains. Qualitative analysis reveals emergent behaviors such as role specialization and communication‑driven tactics. By bridging language modeling and policy learning, this work contributes to the design of intelligent, cooperative agents in interactive simulations. It offers a path forward for leveraging LLMs in multi‑agent systems used for training, games, and human‑AI collaboration.

PaperID: 10170, https://arxiv.org/pdf/2506.04215.pdf

Authors: Alex DeWeese, Guannan Qu

Title: Thinking Beyond Visibility: A Near-Optimal Policy Framework for Locally Interdependent Multi-Agent MDPs

Abstract:
Decentralized Partially Observable Markov Decision Processes (Dec‑POMDPs) are known to be NEXP‑Complete and intractable to solve. However, for problems such as cooperative navigation, obstacle avoidance, and formation control, basic assumptions can be made about local visibility and local dependencies. The work DeWeese and Qu 2024 formalized these assumptions in the construction of the Locally Interdependent Multi‑Agent MDP. In this setting, it establishes three closed‑form policies that are tractable to compute in various situations and are exponentially close to optimal with respect to visibility. However, it is also shown that these solutions can have poor performance when the visibility is small and fixed, often getting stuck during simulations due to the so called "Penalty Jittering" phenomenon. In this work, we establish the Extended Cutoff Policy Class which is, to the best of our knowledge, the first non‑trivial class of near optimal closed‑form partially observable policies that are exponentially close to optimal with respect to the visibility for any Locally Interdependent Multi‑Agent MDP. These policies are able to remember agents beyond their visibilities which allows them to perform significantly better in many small and fixed visibility settings, resolve Penalty Jittering occurrences, and under certain circumstances guarantee fully observable joint optimal behavior despite the partial observability. We also propose a generalized form of the Locally Interdependent Multi‑Agent MDP that allows for transition dependence and extended reward dependence, then replicate our theoretical results in this setting.

PaperID: 10171, https://arxiv.org/pdf/2506.04195.pdf

Authors: Elena Zamaraeva, Christopher M. Collins, George R. Darling, Matthew S. Dyer, Bei Peng, Rahul Savani, Dmytro Antypov, Vladimir V. Gusev, Judith Clymo, Paul G. Spirakis, Matthew J. Rosseinsky

Title: MACS: Multi-Agent Reinforcement Learning for Optimization of Crystal Structures

Abstract:
Geometry optimization of atomic structures is a common and crucial task in computational chemistry and materials design. Following the learning to optimize paradigm, we propose a new multi‑agent reinforcement learning method called Multi‑Agent Crystal Structure optimization (MACS) to address periodic crystal structure optimization. MACS treats geometry optimization as a partially observable Markov game in which atoms are agents that adjust their positions to collectively discover a stable configuration. We train MACS across various compositions of reported crystalline materials to obtain a policy that successfully optimizes structures from the training compositions as well as structures of larger sizes and unseen compositions, confirming its excellent scalability and zero‑shot transferability. We benchmark our approach against a broad range of state‑of‑the‑art optimization methods and demonstrate that MACS optimizes periodic crystal structures significantly faster, with fewer energy calculations, and the lowest failure rate.

PaperID: 10172, https://arxiv.org/pdf/2506.04133.pdf

Authors: Shaina Raza, Ranjan Sapkota, Manoj Karkee, Christos Emmanouilidis

Title: TRiSM for Agentic AI: A Review of Trust, Risk, and Security Management in LLM-based Agentic Multi-Agent Systems

Abstract:
Agentic AI systems, built upon large language models (LLMs) and deployed in multi‑agent configurations, are redefining intelligence, autonomy, collaboration, and decision‑making across enterprise and societal domains. This review presents a structured analysis of Trust, Risk, and Security Management (TRiSM) in the context of LLM‑based Agentic Multi‑Agent Systems (AMAS). We begin by examining the conceptual foundations of Agentic AI and highlight its architectural distinctions from traditional AI agents. We then adapt and extend the AI TRiSM framework for Agentic AI, structured around key pillars: Explainability, ModelOps, Security, Privacy and their Lifecycle Governance, each contextualized to the challenges of AMAS. A risk taxonomy is proposed to capture the unique threats and vulnerabilities of Agentic AI, ranging from coordination failures to prompt‑based adversarial manipulation. To support practical assessment in Agentic AI works, we introduce two novel metrics: the Component Synergy Score (CSS), which quantifies the quality of inter‑agent collaboration, and the Tool Utilization Efficacy (TUE), which evaluates the efficiency of tool use within agent workflows. We further discuss strategies for improving explainability in Agentic AI, as well as approaches to enhancing security and privacy through encryption, adversarial robustness, and regulatory compliance. The review concludes with a research roadmap for the responsible development and deployment of Agentic AI, highlighting key directions to align emerging systems with TRiSM principles‑ensuring safety, transparency, and accountability in their operation.

PaperID: 10173, https://arxiv.org/pdf/2506.04032.pdf

Authors: Sina Rashidian, Nan Li, Jonathan Amar, Jong Ha Lee, Sam Pugh, Eric Yang, Geoff Masterson, Myoung Cha, Yugang Jia, Akhil Vaid

Title: AI Agents for Conversational Patient Triage: Preliminary Simulation-Based Evaluation with Real-World EHR Data

Abstract:
Background: We present a Patient Simulator that leverages real world patient encounters which cover a broad range of conditions and symptoms to provide synthetic test subjects for development and testing of healthcare agentic models. The simulator provides a realistic approach to patient presentation and multi‑turn conversation with a symptom‑checking agent. Objectives: (1) To construct and instantiate a Patient Simulator to train and test an AI health agent, based on patient vignettes derived from real EHR data. (2) To test the validity and alignment of the simulated encounters provided by the Patient Simulator to expert human clinical providers. (3) To illustrate the evaluation framework of such an LLM system on the generated realistic, data‑driven simulations ‑‑ yielding a preliminary assessment of our proposed system. Methods: We first constructed realistic clinical scenarios by deriving patient vignettes from real‑world EHR encounters. These vignettes cover a variety of presenting symptoms and underlying conditions. We then evaluate the performance of the Patient Simulator as a simulacrum of a real patient encounter across over 500 different patient vignettes. We leveraged a separate AI agent to provide multi‑turn questions to obtain a history of present illness. The resulting multiturn conversations were evaluated by two expert clinicians. Results: Clinicians scored the Patient Simulator as consistent with the patient vignettes in those same 97.7% of cases. The extracted case summary based on the conversation history was 99% relevant. Conclusions: We developed a methodology to incorporate vignettes derived from real healthcare patient data to build a simulation of patient responses to symptom checking agents. The performance and alignment of this Patient Simulator could be used to train and test a multi‑turn conversational AI agent at scale.

PaperID: 10174, https://arxiv.org/pdf/2506.04018.pdf

Authors: Akshat Naik, Patrick Quinn, Guillermo Bosch, Emma Gouné, Francisco Javier Campos Zabala, Jason Ross Brown, Edward James Young

Title: AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents

Abstract:
As Large Language Model (LLM) agents become more widespread, associated misalignment risks increase. While prior research has studied agents' ability to produce harmful outputs or follow malicious instructions, it remains unclear how likely agents are to spontaneously pursue unintended goals in realistic deployments. In this work, we approach misalignment as a conflict between the internal goals pursued by the model and the goals intended by its deployer. We introduce a misalignment propensity benchmark, \textscAgentMisalignment, a benchmark suite designed to evaluate the propensity of LLM agents to misalign in realistic scenarios. Evaluations cover behaviours such as avoiding oversight, resisting shutdown, sandbagging, and power‑seeking. Testing frontier models, we find that more capable agents tend to exhibit higher misalignment on average. We also systematically vary agent personalities through different system prompts and observe that persona characteristics can strongly and unpredictably influence misalignment, sometimes more than the choice of model itself. Our results reveal the limitations of current alignment methods for autonomous LLM agents and underscore the need to rethink misalignment in realistic deployment settings.

PaperID: 10175, https://arxiv.org/pdf/2506.03861.pdf

Authors: Qiuhan Han, Qian Wang, Atsushi Yoshikawa, Masayuki Yamamura

Title: PulseReddit: A Novel Reddit Dataset for Benchmarking MAS in High-Frequency Cryptocurrency Trading

Abstract:
High‑Frequency Trading (HFT) is pivotal in cryptocurrency markets, demanding rapid decision‑making. Social media platforms like Reddit offer valuable, yet underexplored, information for such high‑frequency, short‑term trading. This paper introduces PulseReddit, a novel dataset that is the first to align large‑scale Reddit discussion data with high‑frequency cryptocurrency market statistics for short‑term trading analysis. We conduct an extensive empirical study using Large Language Model (LLM)‑based Multi‑Agent Systems (MAS) to investigate the impact of social sentiment from PulseReddit on trading performance. Our experiments conclude that MAS augmented with PulseReddit data achieve superior trading outcomes compared to traditional baselines, particularly in bull markets, and demonstrate robust adaptability across different market regimes. Furthermore, our research provides conclusive insights into the performance‑efficiency trade‑offs of different LLMs, detailing significant considerations for practical model selection in HFT applications. PulseReddit and our findings establish a foundation for advanced MAS research in HFT, demonstrating the tangible benefits of integrating social media.

PaperID: 10176, https://arxiv.org/pdf/2506.03801.pdf

Authors: Peter Pfeiffer, Alexander Rombach, Maxim Majlatow, Nijat Mehdiyev

Title: From Theory to Practice: Real-World Use Cases on Trustworthy LLM-Driven Process Modeling, Prediction and Automation

Abstract:
Traditional Business Process Management (BPM) struggles with rigidity, opacity, and scalability in dynamic environments while emerging Large Language Models (LLMs) present transformative opportunities alongside risks. This paper explores four real‑world use cases that demonstrate how LLMs, augmented with trustworthy process intelligence, redefine process modeling, prediction, and automation. Grounded in early‑stage research projects with industrial partners, the work spans manufacturing, modeling, life‑science, and design processes, addressing domain‑specific challenges through human‑AI collaboration. In manufacturing, an LLM‑driven framework integrates uncertainty‑aware explainable Machine Learning (ML) with interactive dialogues, transforming opaque predictions into auditable workflows. For process modeling, conversational interfaces democratize BPMN design. Pharmacovigilance agents automate drug safety monitoring via knowledge‑graph‑augmented LLMs. Finally, sustainable textile design employs multi‑agent systems to navigate regulatory and environmental trade‑offs. We intend to examine tensions between transparency and efficiency, generalization and specialization, and human agency versus automation. By mapping these trade‑offs, we advocate for context‑sensitive integration prioritizing domain needs, stakeholder values, and iterative human‑in‑the‑loop workflows over universal solutions. This work provides actionable insights for researchers and practitioners aiming to operationalize LLMs in critical BPM environments.

PaperID: 10177, https://arxiv.org/pdf/2506.03750.pdf

Authors: Mengxi Xiao, Ben Liu, He Li, Jimin Huang, Qianqian Xie, Xiaofen Zong, Mang Ye, Min Peng

Title: MoodAngels: A Retrieval-augmented Multi-agent Framework for Psychiatry Diagnosis

Abstract:
The application of AI in psychiatric diagnosis faces significant challenges, including the subjective nature of mental health assessments, symptom overlap across disorders, and privacy constraints limiting data availability. To address these issues, we present MoodAngels, the first specialized multi‑agent framework for mood disorder diagnosis. Our approach combines granular‑scale analysis of clinical assessments with a structured verification process, enabling more accurate interpretation of complex psychiatric data. Complementing this framework, we introduce MoodSyn, an open‑source dataset of 1,173 synthetic psychiatric cases that preserves clinical validity while ensuring patient privacy. Experimental results demonstrate that MoodAngels outperforms conventional methods, with our baseline agent achieving 12.3% higher accuracy than GPT‑4o on real‑world cases, and our full multi‑agent system delivering further improvements. Evaluation in the MoodSyn dataset demonstrates exceptional fidelity, accurately reproducing both the core statistical patterns and complex relationships present in the original data while maintaining strong utility for machine learning applications. Together, these contributions provide both an advanced diagnostic tool and a critical research resource for computational psychiatry, bridging important gaps in AI‑assisted mental health assessment.

PaperID: 10178, https://arxiv.org/pdf/2506.03586.pdf

Authors: Yu Ma, Xiao Li, Chongtao Guo, Le Liang, Michail Matthaiou, Shi Jin

Title: Beamforming and Resource Allocation for Delay Minimization in RIS-Assisted OFDM Systems

Abstract:
This paper investigates a joint beamforming and resource allocation problem in downlink reconfigurable intelligent surface (RIS)‑assisted orthogonal frequency division multiplexing (OFDM) systems to minimize the average delay, where data packets for each user arrive at the base station (BS) stochastically. The sequential optimization problem is inherently a Markov decision process (MDP), thus falling within the remit of reinforcement learning. To effectively handle the mixed action space and reduce the state space dimensionality, a hybrid deep reinforcement learning (DRL) approach is proposed. Specifically, proximal policy optimization (PPO)‑Theta is employed to optimize the RIS phase shift design, while PPO‑N is responsible for subcarrier allocation decisions. The active beamforming at the BS is then derived from the jointly optimized RIS phase shifts and subcarrier allocation decisions. To further mitigate the curse of dimensionality associated with subcarrier allocation, a multi‑agent strategy is introduced to optimize the subcarrier allocation indicators more efficiently. Moreover, to achieve more adaptive resource allocation and accurately capture the network dynamics, key factors closely related to average delay, such as the number of backlogged packets in buffers and current packet arrivals, are incorporated into the state space. Furthermore, a transfer learning framework is introduced to enhance the training efficiency and accelerate convergence. Simulation results demonstrate that the proposed algorithm significantly reduces the average delay, enhances resource allocation efficiency, and achieves superior system robustness and fairness compared to baseline methods.

PaperID: 10179, https://arxiv.org/pdf/2506.03546.pdf

Authors: Yuanchen Bai, Zijian Ding, Angelique Taylor

Title: From Virtual Agents to Robot Teams: A Multi-Robot Framework Evaluation in High-Stakes Healthcare Context

Abstract:
Advancements in generative models have enabled multi‑agent systems (MAS) to perform complex virtual tasks such as writing and code generation, which do not generalize well to physical multi‑agent robotic teams. Current frameworks often treat agents as conceptual task executors rather than physically embodied entities, and overlook critical real‑world constraints such as spatial context, robotic capabilities (e.g., sensing and navigation). To probe this gap, we reconfigure and stress‑test a hierarchical multi‑agent robotic team built on the CrewAI framework in a simulated emergency department onboarding scenario. We identify five persistent failure modes: role misalignment; tool access violations; lack of in‑time handling of failure reports; noncompliance with prescribed workflows; bypassing or false reporting of task completion. Based on this analysis, we propose three design guidelines emphasizing process transparency, proactive failure recovery, and contextual grounding. Our work informs the development of more resilient and robust multi‑agent robotic systems (MARS), including opportunities to extend virtual multi‑agent frameworks to the real world.

PaperID: 10180, https://arxiv.org/pdf/2506.03543.pdf

Authors: Wanghao Ye, Sihan Chen, Yiting Wang, Shwai He, Bowei Tian, Guoheng Sun, Ziyi Wang, Ziyao Wang, Yexiao He, Zheyu Shen, Meng Liu, Yuning Zhang, Meng Feng, Yang Wang, Siyuan Peng, Yilong Dai, Zhenle Duan, Lang Xiong, Joshua Liu, Hanzhang Qin, Ang Li

Title: CogniPair: From LLM Chatbots to Conscious AI Agents -- GNWT-Based Multi-Agent Digital Twins for Social Pairing -- Dating & Hiring Applications

Abstract:
Current large language model (LLM) agents lack authentic human psychological processes necessary for genuine digital twins and social AI applications. To address this limitation, we present a computational implementation of Global Workspace Theory (GNWT) that integrates human cognitive architecture principles into LLM agents, creating specialized sub‑agents for emotion, memory, social norms, planning, and goal‑tracking coordinated through a global workspace mechanism. However, authentic digital twins require accurate personality initialization. We therefore develop a novel adventure‑based personality test that evaluates true personality through behavioral choices within interactive scenarios, bypassing self‑presentation bias found in traditional assessments. Building on these innovations, our CogniPair platform enables digital twins to engage in realistic simulated dating interactions and job interviews before real encounters, providing bidirectional cultural fit assessment for both romantic compatibility and workplace matching. Validation using 551 GNWT‑Agents and Columbia University Speed Dating dataset demonstrates 72% correlation with human attraction patterns, 77.8% match prediction accuracy, and 74% agreement in human validation studies. This work advances psychological authenticity in LLM agents and establishes a foundation for intelligent dating platforms and HR technology solutions.

PaperID: 10181, https://arxiv.org/pdf/2506.03541.pdf

Authors: Xiaofeng Zhou, Heyan Huang, Lizi Liao

Title: Debate, Reflect, and Distill: Multi-Agent Feedback with Tree-Structured Preference Optimization for Efficient Language Model Enhancement

Abstract:
Large Language Models (LLMs) continue to set new standards in knowledge‑intensive and complex reasoning tasks, yet their high computational demands limit widespread adoption. While distilling large models into smaller ones offers a sustainable solution, current techniques‑‑such as static knowledge distillation, resource‑intensive reinforcement learning from human feedback, or limited self‑reflection‑‑struggle to yield substantial and lasting performance gains. In this paper, we present a novel Debate and Reflect (D&R) framework that orchestrates multi‑turn debates between smaller models and stronger teacher models, eliciting actionable feedback (e.g., error analysis, corrective strategies) to guide student models. Further, we introduce Tree‑structured Direct Preference Optimization (T‑DPO) to efficiently leverage these debate logs, organizing interactions into a hierarchical format for effective training. Empirical evaluations across diverse NLP benchmarks demonstrate that our approach significantly improves smaller‑model accuracy, robustness, and generalization, outperforming conventional baselines by a large margin.

PaperID: 10182, https://arxiv.org/pdf/2506.03520.pdf

Authors: Han Zhang, KaWing Tsang, Zhenhui Peng

Title: VChatter: Exploring Generative Conversational Agents for Simulating Exposure Therapy to Reduce Social Anxiety

Abstract:
Many people struggle with social anxiety, feeling fear, or even physically uncomfortable in social situations like talking to strangers. Exposure therapy, a clinical method that gradually and repeatedly exposes individuals to the source of their fear and helps them build coping mechanisms, can reduce social anxiety but traditionally requires human therapists' guidance and constructions of situations. In this paper, we developed a multi‑agent system VChatter to explore large language models(LLMs)‑based conversational agents for simulating exposure therapy with users. Based on a survey study (N=36) and an expert interview, VChatter includes an Agent‑P, which acts as a psychotherapist to design the exposure therapy plans for users, and two Agent‑Hs, which can take on different interactive roles in low, medium, and high exposure scenarios. A six‑day qualitative study (N=10) showcases VChatter's usefulness in reducing users' social anxiety, feelings of isolation, and avoidance of social interactions. We demonstrated the feasibility of using LLMs‑based conversational agents to simulate exposure therapy for addressing social anxiety and discussed future concerns for designing agents tailored to social anxiety.

PaperID: 10183, https://arxiv.org/pdf/2506.03205.pdf

Authors: Umberto Gonçalves de Sousa

Title: Q-ARDNS-Multi: A Multi-Agent Quantum Reinforcement Learning Framework with Meta-Cognitive Adaptation for Complex 3D Environments

Abstract:
This paper presents Q‑ARDNS‑Multi, an advanced multi‑agent quantum reinforcement learning (QRL) framework that extends the ARDNS‑FN‑Quantum model, where Q‑ARDNS‑Multi stands for "Quantum Adaptive Reward‑Driven Neural Simulator ‑ Multi‑Agent". It integrates quantum circuits with RY gates, meta‑cognitive adaptation, and multi‑agent coordination mechanisms for complex 3D environments. Q‑ARDNS‑Multi leverages a 2‑qubit quantum circuit for action selection, a dual‑memory system inspired by human cognition, a shared memory module for agent cooperation, and adaptive exploration strategies modulated by reward variance and intrinsic motivation. Evaluated in a 10 × 10 × 3 GridWorld environment with two agents over 5000 episodes, Q‑ARDNS‑Multi achieves success rates of 99.6% and 99.5% for Agents 0 and 1, respectively, outperforming Multi‑Agent Deep Deterministic Policy Gradient (MADDPG) and Soft Actor‑Critic (SAC) in terms of success rate, stability, navigation efficiency, and collision avoidance. The framework records mean rewards of ‑304.2891 \pm 756.4636 and ‑295.7622 \pm 752.7103, averaging 210 steps to goal, demonstrating its robustness in dynamic settings. Comprehensive analyses, including learning curves, reward distributions, statistical tests, and computational efficiency evaluations, highlight the contributions of quantum circuits and meta‑cognitive adaptation. By bridging quantum computing, cognitive science, and multi‑agent RL, Q‑ARDNS‑Multi offers a scalable, human‑like approach for applications in robotics, autonomous navigation, and decision‑making under uncertainty.

PaperID: 10184, https://arxiv.org/pdf/2506.03102.pdf

Authors: Sophie Greenwood, Karen Levy, Solon Barocas, Hoda Heidari, Jon Kleinberg

Title: Designing Algorithmic Delegates: The Role of Indistinguishability in Human-AI Handoff

Abstract:
As AI technologies improve, people are increasingly willing to delegate tasks to AI agents. In many cases, the human decision‑maker chooses whether to delegate to an AI agent based on properties of the specific instance of the decision‑making problem they are facing. Since humans typically lack full awareness of all the factors relevant to this choice for a given decision‑making instance, they perform a kind of categorization by treating indistinguishable instances ‑‑ those that have the same observable features ‑‑ as the same. In this paper, we define the problem of designing the optimal algorithmic delegate in the presence of categories. This is an important dimension in the design of algorithms to work with humans, since we show that the optimal delegate can be an arbitrarily better teammate than the optimal standalone algorithmic agent. The solution to this optimal delegation problem is not obvious: we discover that this problem is fundamentally combinatorial, and illustrate the complex relationship between the optimal design and the properties of the decision‑making task even in simple settings. Indeed, we show that finding the optimal delegate is computationally hard in general. However, we are able to find efficient algorithms for producing the optimal delegate in several broad cases of the problem, including when the optimal action may be decomposed into functions of features observed by the human and the algorithm. Finally, we run computational experiments to simulate a designer updating an algorithmic delegate over time to be optimized for when it is actually adopted by users, and show that while this process does not recover the optimal delegate in general, the resulting delegate often performs quite well.

PaperID: 10185, https://arxiv.org/pdf/2506.03053.pdf

Authors: Sinem Erisken, Timothy Gothard, Martin Leitgab, Ram Potham

Title: MAEBE: Multi-Agent Emergent Behavior Framework

Abstract:
Traditional AI safety evaluations on isolated LLMs are insufficient as multi‑agent AI ensembles become prevalent, introducing novel emergent risks. This paper introduces the Multi‑Agent Emergent Behavior Evaluation (MAEBE) framework to systematically assess such risks. Using MAEBE with the Greatest Good Benchmark (and a novel double‑inversion question technique), we demonstrate that: (1) LLM moral preferences, particularly for Instrumental Harm, are surprisingly brittle and shift significantly with question framing, both in single agents and ensembles. (2) The moral reasoning of LLM ensembles is not directly predictable from isolated agent behavior due to emergent group dynamics. (3) Specifically, ensembles exhibit phenomena like peer pressure influencing convergence, even when guided by a supervisor, highlighting distinct safety and alignment challenges. Our findings underscore the necessity of evaluating AI systems in their interactive, multi‑agent contexts.

PaperID: 10186, https://arxiv.org/pdf/2506.03038.pdf

Authors: Jintian Shao, Yiming Cheng

Title: Towards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective

Abstract:
Reinforcement learning (RL) enhances large language models (LLMs) in complex, long‑chain‑of‑thought (long‑CoT) reasoning. The advanced VAPO framework, despite sophisticated mechanisms like Decoupled GAE, theoretically faces fundamental limitations in comprehensively modeling and leveraging deep, long‑term value for fine‑grained, step‑by‑step policy guidance in extended reasoning chains. We argue these limitations stem from inherent difficulties in credit assignment, value function representational capacity with temporally abstracted goals, and translating global value signals into local policy improvements, especially with sparse rewards. Our theoretical analysis examines these aspects to illuminate VAPO's boundaries in long‑term value modeling, aiming to deepen understanding of current RL for advanced reasoning and suggest future research for more robust LLM agents.

PaperID: 10187, https://arxiv.org/pdf/2506.03011.pdf

Authors: Aditya Bharat Soni, Boxuan Li, Xingyao Wang, Valerie Chen, Graham Neubig

Title: Coding Agents with Multimodal Browsing are Generalist Problem Solvers

Abstract:
Modern human labor is characterized by specialization; we train for years and develop particular tools that allow us to perform well across a variety of tasks. In addition, AI agents have been specialized for domains such as software engineering, web navigation, and workflow automation. However, this results in agents that are good for one thing but fail to generalize beyond their intended scope. One reason for this is that agent developers provide a highly specialized set of tools or make architectural decisions optimized for a specific use case or benchmark. In this work, we ask the question: what is the minimal set of general tools that can be used to achieve high performance across a diverse set of tasks? Our answer is OpenHands‑Versa, a generalist agent built with a modest number of general tools: code editing and execution, web search, as well as multimodal web browsing and file access. Importantly, OpenHands‑Versa demonstrates superior or competitive performance over leading specialized agents across three diverse and challenging benchmarks: SWE‑Bench Multimodal, GAIA, and The Agent Company, outperforming the best‑performing previously published results with absolute improvements in success rate of 9.1, 1.3, and 9.1 points respectively. Further, we show how existing state‑of‑the‑art multi‑agent systems fail to generalize beyond their target domains. These results demonstrate the feasibility of developing a generalist agent to solve diverse tasks and establish OpenHands‑Versa as a strong baseline for future research.

PaperID: 10188, https://arxiv.org/pdf/2506.02998.pdf

Authors: Đorđe Klisura, Astrid R Bernaga Torres, Anna Karen Gárate-Escamilla, Rajesh Roshan Biswal, Ke Yang, Hilal Pataci, Anthony Rios

Title: A Multi-Agent Framework for Mitigating Dialect Biases in Privacy Policy Question-Answering Systems

Abstract:
Privacy policies inform users about data collection and usage, yet their complexity limits accessibility for diverse populations. Existing Privacy Policy Question Answering (QA) systems exhibit performance disparities across English dialects, disadvantaging speakers of non‑standard varieties. We propose a novel multi‑agent framework inspired by human‑centered design principles to mitigate dialectal biases. Our approach integrates a Dialect Agent, which translates queries into Standard American English (SAE) while preserving dialectal intent, and a Privacy Policy Agent, which refines predictions using domain expertise. Unlike prior approaches, our method does not require retraining or dialect‑specific fine‑tuning, making it broadly applicable across models and domains. Evaluated on PrivacyQA and PolicyQA, our framework improves GPT‑4o‑mini's zero‑shot accuracy from 0.394 to 0.601 on PrivacyQA and from 0.352 to 0.464 on PolicyQA, surpassing or matching few‑shot baselines without additional training data. These results highlight the effectiveness of structured agent collaboration in mitigating dialect biases and underscore the importance of designing NLP systems that account for linguistic diversity to ensure equitable access to privacy information.

PaperID: 10189, https://arxiv.org/pdf/2506.02993.pdf

Authors: Zhanxin Hao, Jie Cao, Ruimiao Li, Jifan Yu, Zhiyuan Liu, Yu Zhang

Title: Mapping Student-AI Interaction Dynamics in Multi-Agent Learning Environments: Supporting Personalised Learning and Reducing Performance Gaps

Abstract:
Multi‑agent AI systems, which simulate diverse instructional roles such as teachers and peers, offer new possibilities for personalized and interactive learning. Yet, student‑AI interaction patterns and their pedagogical implications remain unclear. This study explores how university students engaged with multiple AI agents, and how these interactions influenced cognitive outcomes (learning gains) and non‑cognitive factors (motivation, technology acceptance). Based on MAIC, an online learning platform with multi‑agent, the research involved 305 university students and 19,365 lines of dialogue data. Pre‑ and post‑test scores, self‑reported motivation and technology acceptance were also collected. The study identified two engagement patterns: co‑construction of knowledge and co‑regulation. Lag sequential analysis revealed that students with lower prior knowledge relied more on co‑construction of knowledge sequences, showing higher learning gains and post‑course motivation. In contrast, students with higher prior knowledge engaged more in co‑regulation behaviors but exhibited limited learning improvement. Technology acceptance increased across all groups. These findings suggest that multi‑agent AI systems can adapt to students' varying needs, support differentiated engagement, and reduce performance gaps. Implications for personalized system design and future research directions are discussed.

PaperID: 10190, https://arxiv.org/pdf/2506.02992.pdf

Authors: Li Zhang, Kevin D. Ashley

Title: Mitigating Manipulation and Enhancing Persuasion: A Reflective Multi-Agent Approach for Legal Argument Generation

Abstract:
Large Language Models (LLMs) are increasingly explored for legal argument generation, yet they pose significant risks of manipulation through hallucination and ungrounded persuasion, and often fail to utilize provided factual bases effectively or abstain when arguments are untenable. This paper introduces a novel reflective multi‑agent method designed to address these challenges in the context of legally compliant persuasion. Our approach employs specialized agents (factor analyst and argument polisher) in an iterative refinement process to generate 3‑ply legal arguments (plaintiff, defendant, rebuttal). We evaluate reflective multi‑agent against single‑agent, enhanced‑prompt single‑agent, and non‑reflective multi‑agent baselines using four diverse LLMs (GPT‑4o, GPT‑4o‑mini, Llama‑4‑Maverick‑17b‑128e, Llama‑4‑Scout‑17b‑16e) across three legal scenarios: "arguable", "mismatched", and "non‑arguable". Results demonstrate that the reflective multi‑agent approach excels at successful abstention by preventing generation when arguments cannot be grounded, improves hallucination accuracy by reducing fabricated and misattributed factors and enhances factor utilization recall by better using the provided case facts. These findings suggest that structured reflection within a multi‑agent framework offers a robust method for fostering ethical persuasion and mitigating manipulation in LLM‑based legal argumentation systems.

PaperID: 10191, https://arxiv.org/pdf/2506.02951.pdf

Authors: Boyi Li, Zhonghan Zhao, Der-Horng Lee, Gaoang Wang

Title: Adaptive Graph Pruning for Multi-Agent Communication

Abstract:
Large Language Model (LLM) based multi‑agent systems have shown remarkable performance in various tasks, especially when enhanced through collaborative communication. However, current methods often rely on a fixed number of agents and static communication structures, limiting their ability to adapt to varying task complexities. In this paper, we propose Adaptive Graph Pruning (AGP), a novel task‑adaptive multi‑agent collaboration framework that jointly optimizes agent quantity (hard‑pruning) and communication topology (soft‑pruning). Specifically, our method employs a two‑stage training strategy: firstly, independently training soft‑pruning networks for different agent quantities to determine optimal agent‑quantity‑specific complete graphs and positional masks across specific tasks; and then jointly optimizing hard‑pruning and soft‑pruning within a maximum complete graph to dynamically configure the number of agents and their communication topologies per task. Extensive experiments demonstrate that our approach is: (1) High‑performing, achieving state‑of‑the‑art results across six benchmarks and consistently generalizes across multiple mainstream LLM architectures, with a increase in performance of 2.58%～ 9.84%; (2) Task‑adaptive, dynamically constructing optimized communication topologies tailored to specific tasks, with an extremely high performance in all three task categories (general reasoning, mathematical reasoning, and code generation); (3) Token‑economical, having fewer training steps and token consumption at the same time, with a decrease in token consumption of 90%+; and (4) Training‑efficient, achieving high performance with very few training steps compared with other methods. The performance will surpass the existing baselines after about ten steps of training under six benchmarks.

PaperID: 10192, https://arxiv.org/pdf/2506.02943.pdf

Authors: Qinghua Xu, Guancheng Wang, Lionel Briand, Kui Liu

Title: Hallucination to Consensus: Multi-Agent LLMs for End-to-End JUnit Test Generation

Abstract:
Unit testing plays a critical role in ensuring software correctness. However, writing unit tests manually is labor‑intensive, especially for strongly typed languages like Java, motivating the need for automated approaches. Traditional methods primarily rely on search‑based or randomized algorithms to achieve high code coverage and produce regression oracles, which are derived from the program's current behavior rather than its intended functionality. Recent advances in LLMs have enabled oracle generation from natural language descriptions, aligning better with user requirements. However, existing LLM‑based methods often require fine‑tuning or rely on external tools such as EvoSuite for test prefix generation, making them costly or cumbersome to apply in practice. In this work, we propose CANDOR, a novel prompt engineering‑based LLM framework for automated unit test generation in Java. CANDOR orchestrates multiple specialized LLM agents to collaboratively generate complete tests. To mitigate the notorious hallucinations in LLMs and improve oracle correctness, we introduce a novel strategy that engages multiple reasoning LLMs in a panel discussion and generates accurate oracles based on consensus. Additionally, to reduce the verbosity of reasoning LLMs' outputs, we propose a novel dual‑LLM pipeline to produce concise and structured oracle evaluations. Our experiments show that CANDOR is comparable with EvoSuite in generating tests with high code coverage and clearly superior in terms of mutation score. Moreover, our prompt engineering‑based approach CANDOR significantly outperforms the SOTA fine‑tuning‑based oracle generator TOGLL by at least 21.1 percentage points in oracle correctness on both correct and faulty source code. Further ablation studies confirm the critical contributions of key agents in generating high‑quality tests.

PaperID: 10193, https://arxiv.org/pdf/2506.02859.pdf

Authors: Parth Atulbhai Gandhi, Akansha Shukla, David Tayouri, Beni Ifland, Yuval Elovici, Rami Puzis, Asaf Shabtai

Title: ATAG: AI-Agent Application Threat Assessment with Attack Graphs

Abstract:
Evaluating the security of multi‑agent systems (MASs) powered by large language models (LLMs) is challenging, primarily because of the systems' complex internal dynamics and the evolving nature of LLM vulnerabilities. Traditional attack graph (AG) methods often lack the specific capabilities to model attacks on LLMs. This paper introduces AI‑agent application Threat assessment with Attack Graphs (ATAG), a novel framework designed to systematically analyze the security risks associated with AI‑agent applications. ATAG extends the MulVAL logic‑based AG generation tool with custom facts and interaction rules to accurately represent AI‑agent topologies, vulnerabilities, and attack scenarios. As part of this research, we also created the LLM vulnerability database (LVD) to initiate the process of standardizing LLM vulnerabilities documentation. To demonstrate ATAG's efficacy, we applied it to two multi‑agent applications. Our case studies demonstrated the framework's ability to model and generate AGs for sophisticated, multi‑step attack scenarios exploiting vulnerabilities such as prompt injection, excessive agency, sensitive information disclosure, and insecure output handling across interconnected agents. ATAG is an important step toward a robust methodology and toolset to help understand, visualize, and prioritize complex attack paths in multi‑agent AI systems (MAASs). It facilitates proactive identification and mitigation of AI‑agent threats in multi‑agent applications.

PaperID: 10194, https://arxiv.org/pdf/2506.02841.pdf

Authors: Tom Danino, Nahum Shimkin

Title: Enhancing Sample Efficiency in Multi-Agent RL with Uncertainty Quantification and Selective Exploration

Abstract:
Multi‑agent reinforcement learning (MARL) methods have achieved state‑of‑the‑art results on a range of multi‑agent tasks. Yet, MARL algorithms typically require significantly more environment interactions than their single‑agent counterparts to converge, a problem exacerbated by the difficulty in exploring over a large joint action space and the high variance intrinsic to MARL environments. To tackle these issues, we propose a novel algorithm that combines a decomposed centralized critic with decentralized ensemble learning, incorporating several key contributions. The main component in our scheme is a selective exploration method that leverages ensemble kurtosis. We extend the global decomposed critic with a diversity‑regularized ensemble of individual critics and utilize its excess kurtosis to guide exploration toward high‑uncertainty states and actions. To improve sample efficiency, we train the centralized critic with a novel truncated variation of the TD(λ) algorithm, enabling efficient off‑policy learning with reduced variance. On the actor side, our suggested algorithm adapts the mixed samples approach to MARL, mixing on‑policy and off‑policy loss functions for training the actors. This approach balances between stability and efficiency and outperforms purely off‑policy learning. The evaluation shows our method outperforms state‑of‑the‑art baselines on standard MARL benchmarks, including a variety of SMAC II maps.

PaperID: 10195, https://arxiv.org/pdf/2506.02795.pdf

Authors: Guangqi Wu, Runzhong Wang, Connor W. Coley

Title: Optimization of Robotic Liquid Handling as a Capacitated Vehicle Routing Problem

Abstract:
We present an optimization strategy to reduce the execution time of liquid handling operations in the context of an automated chemical laboratory. By formulating the task as a capacitated vehicle routing problem (CVRP), we leverage heuristic solvers traditionally used in logistics and transportation planning to optimize task execution times. As exemplified using an 8‑channel pipette with individually controllable tips, our approach demonstrates robust optimization performance across different labware formats (e.g., well‑plates, vial holders), achieving up to a 37% reduction in execution time for randomly generated tasks compared to the baseline sorting method. We further apply the method to a real‑world high‑throughput materials discovery campaign and observe that 3 minutes of optimization time led to a reduction of 61 minutes in execution time compared to the best‑performing sorting‑based strategy. Our results highlight the potential for substantial improvements in throughput and efficiency in automated laboratories without any hardware modifications. This optimization strategy offers a practical and scalable solution to accelerate combinatorial experimentation in areas such as drug combination screening, reaction condition optimization, materials development, and formulation engineering.

PaperID: 10196, https://arxiv.org/pdf/2506.02739.pdf

Authors: Pengcheng Zhou, Yinglun Feng, Halimulati Julaiti, Zhongliang Yang

Title: Why do AI agents communicate in human language?

Abstract:
Large Language Models (LLMs) have become foundational to modern AI agent systems, enabling autonomous agents to reason and plan. In most existing systems, inter‑agent communication relies primarily on natural language. While this design supports interpretability and human oversight, we argue that it introduces fundamental limitations in agent‑to‑agent coordination. The semantic space of natural language is structurally misaligned with the high‑dimensional vector spaces in which LLMs operate, resulting in information loss and behavioral drift. Beyond surface‑level inefficiencies, we highlight a deeper architectural limitation: current LLMs were not trained with the objective of supporting agentic behavior. As such, they lack mechanisms for modeling role continuity, task boundaries, and multi‑agent dependencies. The standard next‑token prediction paradigm fails to support the structural alignment required for robust, scalable agent coordination. Based on this, we argue that two core questions deserve careful examination: first, given that AI agents fundamentally operate in high‑dimensional vector spaces, should they rely on a language system originally designed for human cognition as their communication medium? Second, should we consider developing a new model construction paradigm that builds models from the ground up to natively support structured communication, shared intentionality, and task alignment in multi‑role, multi‑agent environments? This paper calls for a reconsideration not only of how agents should communicate, but also of what it fundamentally means to train a model that natively supports multi‑agent coordination and communication.

PaperID: 10197, https://arxiv.org/pdf/2506.02718.pdf

Authors: Guanzhong Chen, Shaoxiong Yang, Chao Li, Wei Liu, Jian Luan, Zenglin Xu

Title: End-to-End Optimization of LLM-Driven Multi-Agent Search Systems via Heterogeneous-Group-Based Reinforcement Learning

Abstract:
Large language models (LLMs) are versatile, yet their deployment in complex real‑world settings is limited by static knowledge cutoffs and the difficulty of producing controllable behavior within a single inference. Multi‑agent search systems (MASS), which coordinate specialized LLM agents equipped with search tools, mitigate these issues via task decomposition and retrieval‑augmented problem solving. However, optimizing LLMs for agent‑specific roles remains labor‑intensive with prompt engineering or supervised fine‑tuning, motivating automated end‑to‑end training. Existing multi‑agent reinforcement learning (MARL) methods such as Multi‑Agent Proximal Policy Optimization (MAPPO) typically depend on large critic networks to evaluate joint actions, leading to instability and high memory costs. We introduce Multi‑Agent Heterogeneous Group Policy Optimization (MHGPO), which updates policies by estimating relative advantages across heterogeneous groups of multi‑agent rollouts, shifting the optimization focus from local agent performance to global system success. We further study three group rollout sampling strategies to trade off sample efficiency and optimization quality. Experiments show that MHGPO captures implicit inter‑agent dependencies and consistently outperforms strong baselines in both task performance and computational efficiency.

PaperID: 10198, https://arxiv.org/pdf/2506.02689.pdf

Authors: Liang Yue, Yihong Tang, Kehai Chen, Jie Liu, Min Zhang

Title: MASTER: Enhancing Large Language Model via Multi-Agent Simulated Teaching

Abstract:
Instruction fine‑tuning is crucial in NLP tasks, enhancing pretrained models' instruction‑following capabilities and task‑specific performance. However, obtaining high‑quality fine‑tuning data for large models is challenging due to data collection difficulties and high production costs. To address this, we propose MASTER, a novel data augmentation method that enriches original data through interactions among multiple agents with varying cognitive levels. We simulate three pedagogically grounded teaching scenarios, leveraging multi‑agent conversations to generate high‑quality teacher‑student interaction data. Utilizing MASTER, we construct BOOST‑QA, a fine‑tuning dataset augmented from existing datasets like Orca‑Math‑200k, ProcQA, and OpenHermes2.5. Experiments show that models fine‑tuned with BOOST‑QA perform excellently across multiple benchmarks, demonstrating strong multitask generalization. Notably, MASTER significantly improves models' reasoning abilities in complex tasks, providing valuable insights for future research.

PaperID: 10199, https://arxiv.org/pdf/2506.02668.pdf

Authors: Frederico Metelo, Alexandre Oliveira, Stevo Racković, Pedro Ákos Costa, Cláudia Soares

Title: FAuNO: Semi-Asynchronous Federated Reinforcement Learning Framework for Task Offloading in Edge Systems

Abstract:
Edge computing addresses the growing data demands of connected‑device networks by placing computational resources closer to end users through decentralized infrastructures. This decentralization challenges traditional, fully centralized orchestration, which suffers from latency and resource bottlenecks. We present FAuNO ‑‑ \emphFederated Asynchronous Network Orchestrator ‑‑ a buffered, asynchronous \emphfederated reinforcement‑learning (FRL) framework for decentralized task offloading in edge systems. FAuNO adopts an actor‑critic architecture in which local actors learn node‑specific dynamics and peer interactions, while a federated critic aggregates experience across agents to encourage efficient cooperation and improve overall system performance. Experiments in the \emphPeersimGym environment show that FAuNO consistently matches or exceeds heuristic and federated multi‑agent RL baselines in reducing task loss and latency, underscoring its adaptability to dynamic edge‑computing scenarios.

PaperID: 10200, https://arxiv.org/pdf/2506.02616.pdf

Authors: Qi Liao, Parijat Bhattacharjee

Title: Compositional Learning for Modular Multi-Agent Self-Organizing Networks

Abstract:
Self‑organizing networks face challenges from complex parameter interdependencies and conflicting objectives. This study introduces two compositional learning approaches‑Compositional Deep Reinforcement Learning (CDRL) and Compositional Predictive Decision‑Making (CPDM)‑and evaluates their performance under training time and safety constraints in multi‑agent systems. We propose a modular, two‑tier framework with cell‑level and cell‑pair‑level agents to manage heterogeneous agent granularities while reducing model complexity. Numerical simulations reveal a significant reduction in handover failures, along with improved throughput and latency, outperforming conventional multi‑agent deep reinforcement learning approaches. The approach also demonstrates superior scalability, faster convergence, higher sample efficiency, and safer training in large‑scale self‑organizing networks.

PaperID: 10201, https://arxiv.org/pdf/2506.02546.pdf

Authors: Pengfei He, Zhenwei Dai, Xianfeng Tang, Yue Xing, Hui Liu, Jingying Zeng, Qiankun Peng, Shrivats Agrawal, Samarth Varshney, Suhang Wang, Jiliang Tang, Qi He

Title: To trust or not to trust: Attention-based Trust Management for LLM Multi-Agent Systems

Abstract:
Large Language Model‑based Multi‑Agent Systems (LLM‑MAS) have demonstrated strong capabilities in solving complex tasks but remain vulnerable when agents receive unreliable messages. This vulnerability stems from a fundamental gap: LLM agents treat all incoming messages equally without evaluating their trustworthiness. While some existing studies approach trustworthiness, they focus on a single type of harmfulness rather than analyze it in a holistic approach from multiple trustworthiness perspectives. We address this gap by proposing a comprehensive definition of trustworthiness inspired by human communication theory (Grice, 1975). Our definition identifies six orthogonal trust dimensions that provide interpretable measures of trustworthiness. Building on this definition, we introduce the Attention Trust Score (A Trust), a lightweight, attention‑based method for evaluating the trustworthiness of messages. We then develop a principled trust management system (TMS) for LLM MAS that supports both message‑level and agent‑level trust assessments. Experiments across diverse multi‑agent settings and tasks demonstrate that our TMS significantly improves robustness against malicious inputs.

PaperID: 10202, https://arxiv.org/pdf/2506.02522.pdf

Authors: Xu Wan, Wenyue Xu, Chao Yang, Mingyang Sun

Title: Think Twice, Act Once: A Co-Evolution Framework of LLM and RL for Large-Scale Decision Making

Abstract:
Recent advancements in Large Language Models (LLMs) and Reinforcement Learning (RL) have shown significant promise in decision‑making tasks. Nevertheless, for large‑scale industrial decision problems, both approaches face distinct challenges: LLMs lack real‑time long‑sequence decision‑making capabilities, while RL struggles with sample efficiency in vast action spaces. To bridge this gap, we propose Agents Co‑Evolution (ACE), a synergistic framework between LLMs and RL agents for large‑scale decision‑making scenarios. ACE introduces a dual‑role trajectory refinement mechanism where LLMs act as both Policy Actor and Value Critic during RL's training: the Actor refines suboptimal actions via multi‑step reasoning and environment validation, while the Critic performs temporal credit assignment through trajectory‑level reward shaping. Concurrently, RL agent enhances LLMs' task‑specific decision‑making with high‑quality fine‑tuning datasets generated via prioritized experience replay. Through extensive experiments across multiple power grid operation challenges with action spaces exceeding 60K discrete actions, ACE demonstrates superior performance over existing RL methods and LLM‑based methods.

PaperID: 10203, https://arxiv.org/pdf/2506.02514.pdf

Authors: Kyra Wang, Boon-Kiat Quek, Jessica Goh, Dorien Herremans

Title: To Embody or Not: The Effect Of Embodiment On User Perception Of LLM-based Conversational Agents

Abstract:
Embodiment in conversational agents (CAs) refers to the physical or visual representation of these agents, which can significantly influence user perception and interaction. Limited work has been done examining the effect of embodiment on the perception of CAs utilizing modern large language models (LLMs) in non‑hierarchical cooperative tasks, a common use case of CAs as more powerful models become widely available for general use. To bridge this research gap, we conducted a mixed‑methods within‑subjects study on how users perceive LLM‑based CAs in cooperative tasks when embodied and non‑embodied. The results show that the non‑embodied agent received significantly better quantitative appraisals for competence than the embodied agent, and in qualitative feedback, many participants believed that the embodied CA was more sycophantic than the non‑embodied CA. Building on prior work on users' perceptions of LLM sycophancy and anthropomorphic features, we theorize that the typically‑positive impact of embodiment on perception of CA credibility can become detrimental in the presence of sycophancy. The implication of such a phenomenon is that, contrary to intuition and existing literature, embodiment is not a straightforward way to improve a CA's perceived credibility if there exists a tendency to sycophancy.

PaperID: 10204, https://arxiv.org/pdf/2506.02507.pdf

Authors: Alvin Zhu, Yusuke Tanaka, Andrew Goldberg, Dennis Hong

Title: AURA: Autonomous Upskilling with Retrieval-Augmented Agents

Abstract:
Designing reinforcement learning curricula for agile robots traditionally requires extensive manual tuning of reward functions, environment randomizations, and training configurations. We introduce AURA (Autonomous Upskilling with Retrieval‑Augmented Agents), a schema‑validated curriculum reinforcement learning (RL) framework that leverages Large Language Models (LLMs) as autonomous designers of multi‑stage curricula. AURA transforms user prompts into YAML workflows that encode full reward functions, domain randomization strategies, and training configurations. All files are statically validated before any GPU time is used, ensuring efficient and reliable execution. A retrieval‑augmented feedback loop allows specialized LLM agents to design, execute, and refine curriculum stages based on prior training results stored in a vector database, enabling continual improvement over time. Quantitative experiments show that AURA consistently outperforms LLM‑guided baselines in generation success rate, humanoid locomotion, and manipulation tasks. Ablation studies highlight the importance of schema validation and retrieval for curriculum quality. AURA successfully trains end‑to‑end policies directly from user prompts and deploys them zero‑shot on a custom humanoid robot in multiple environments ‑ capabilities that did not exist previously with manually designed controllers. By abstracting the complexity of curriculum design, AURA enables scalable and adaptive policy learning pipelines that would be complex to construct by hand. Project page: https://aura‑research.org/

PaperID: 10205, https://arxiv.org/pdf/2506.02387.pdf

Authors: Zelai Xu, Zhexuan Xu, Xiangmin Yi, Huining Yuan, Mo Guang, Kaiwen Long, Xinlei Chen, Yi Wu, Chao Yu, Yu Wang

Title: VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments

Abstract:
Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single‑agent or text‑only environments. In contrast, real‑world scenarios often involve multiple agents interacting within rich visual and textual contexts, posing challenges with both multimodal observations and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS‑Bench), a multimodal benchmark that evaluates VLMs for strategic abilities in multi‑agent environments. VS‑Bench comprises ten vision‑grounded environments that cover cooperative, competitive, and mixed‑motive interactions. The performance of VLM agents is evaluated across three dimensions: perception measured by element recognition accuracy; strategic reasoning measured by next‑action prediction accuracy; and decision‑making measured by normalized episode return. Extensive experiments on fifteen leading VLMs show that, although current models exhibit strong perception abilities, there remains a significant gap to optimal performance in reasoning and decision‑making, with the best‑performing model attaining 46.6% prediction accuracy and 31.4% normalized return. We further analyze the key factors influencing performance, conduct human experiments, and examine failure modes to provide a deeper understanding of VLMs' strategic abilities. By standardizing the evaluation and highlighting the limitations of existing models, we envision VS‑Bench as a foundation for future research on strategic multimodal agents. Code and data are available at https://vs‑bench.github.io.

PaperID: 10206, https://arxiv.org/pdf/2506.02385.pdf

Authors: Shuze Chen, Tianyi Peng

Title: Multi-agent Markov Entanglement

Abstract:
Value decomposition has long been a fundamental technique in multi‑agent dynamic programming and reinforcement learning (RL). Specifically, the value function of a global state (s_1,s_2,\ldots,s_N) is often approximated as the sum of local functions: V(s_1,s_2,\ldots,s_N)\approx\sum_i=1^N V_i(s_i). This approach traces back to the index policy in restless multi‑armed bandit problems and has found various applications in modern RL systems. However, the theoretical justification for why this decomposition works so effectively remains underexplored. In this paper, we uncover the underlying mathematical structure that enables value decomposition. We demonstrate that a multi‑agent Markov decision process (MDP) permits value decomposition if and only if its transition matrix is not "entangled" ‑‑ a concept analogous to quantum entanglement in quantum physics. Drawing inspiration from how physicists measure quantum entanglement, we introduce how to measure the "Markov entanglement" for multi‑agent MDPs and show that this measure can be used to bound the decomposition error in general multi‑agent MDPs. Using the concept of Markov entanglement, we proved that a widely‑used class of index policies is weakly entangled and enjoys a sublinear \mathcal O(\sqrtN) scale of decomposition error for N‑agent systems. Finally, we show how Markov entanglement can be efficiently estimated in practice, providing practitioners with an empirical proxy for the quality of value decomposition.

PaperID: 10207, https://arxiv.org/pdf/2506.02357.pdf

Authors: Ram Potham

Title: Evaluating LLM Agent Adherence to Hierarchical Safety Principles: A Lightweight Benchmark for Probing Foundational Controllability Components

Abstract:
Credible safety plans for advanced AI development require methods to verify agent behavior and detect potential control deficiencies early. A fundamental aspect is ensuring agents adhere to safety‑critical principles, especially when these conflict with operational goals. This paper introduces a lightweight, interpretable benchmark to evaluate an LLM agent's ability to uphold a high‑level safety principle when faced with conflicting task instructions. Our evaluation of six LLMs reveals two primary findings: (1) a quantifiable "cost of compliance" where safety constraints degrade task performance even when compliant solutions exist, and (2) an "illusion of compliance" where high adherence often masks task incompetence rather than principled choice. These findings provide initial evidence that while LLMs can be influenced by hierarchical directives, current approaches lack the consistency required for reliable safety governance.

PaperID: 10208, https://arxiv.org/pdf/2506.02298.pdf

Authors: Thai Hoang, Kung-Hsiang Huang, Shirley Kokane, Jianguo Zhang, Zuxin Liu, Ming Zhu, Jake Grigsby, Tian Lan, Michael S Ryoo, Chien-Sheng Wu, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong, Juan Carlos Niebles

Title: LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback

Abstract:
Large Action Models (LAMs) for AI Agents offer incredible potential but face challenges due to the need for high‑quality training data, especially for multi‑steps tasks that involve planning, executing tool calls, and responding to feedback. To address these issues, we present LAM SIMULATOR, a comprehensive framework designed for online exploration of agentic tasks with high‑quality feedback. Our framework features a dynamic task query generator, an extensive collection of tools, and an interactive environment where Large Language Model (LLM) Agents can call tools and receive real‑time feedback. This setup enables LLM Agents to explore and solve tasks autonomously, facilitating the discovery of multiple approaches to tackle any given task. The resulting action trajectory data are then used to create high‑quality training datasets for LAMs. Our experiments on popular agentic benchmarks, ToolBench and CRMArena, highlight the effectiveness of LAM SIMULATOR: models trained with self‑generated datasets using our framework achieve significant performance gains, up to a 49.3% improvement over their original baselines. LAM SIMULATOR requires minimal human input during dataset creation, highlighting LAM SIMULATOR's efficiency and effectiveness in speeding up development of AI agents.

PaperID: 10209, https://arxiv.org/pdf/2506.02153.pdf

Authors: Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, Pavlo Molchanov

Title: Small Language Models are the Future of Agentic AI

Abstract:
Large language models (LLMs) are often praised for exhibiting near‑human performance on a wide range of tasks and valued for their ability to hold a general conversation. The rise of agentic AI systems is, however, ushering in a mass of applications in which language models perform a small number of specialized tasks repetitively and with little variation. Here we lay out the position that small language models (SLMs) are sufficiently powerful, inherently more suitable, and necessarily more economical for many invocations in agentic systems, and are therefore the future of agentic AI. Our argumentation is grounded in the current level of capabilities exhibited by SLMs, the common architectures of agentic systems, and the economy of LM deployment. We further argue that in situations where general‑purpose conversational abilities are essential, heterogeneous agentic systems (i.e., agents invoking multiple different models) are the natural choice. We discuss the potential barriers for the adoption of SLMs in agentic systems and outline a general LLM‑to‑SLM agent conversion algorithm. Our position, formulated as a value statement, highlights the significance of the operational and economic impact even a partial shift from LLMs to SLMs is to have on the AI agent industry. We aim to stimulate the discussion on the effective use of AI resources and hope to advance the efforts to lower the costs of AI of the present day. Calling for both contributions to and critique of our position, we commit to publishing all such correspondence at https://research.nvidia.com/labs/lpr/slm‑agents.

PaperID: 10210, https://arxiv.org/pdf/2506.02125.pdf

Authors: Guy Tennenholtz, Jihwan Jeong, Chih-Wei Hsu, Yinlam Chow, Craig Boutilier

Title: Descriptive History Representations: Learning Representations by Answering Questions

Abstract:
Effective decision making in partially observable environments requires compressing long interaction histories into informative representations. We introduce Descriptive History Representations (DHRs): sufficient statistics characterized by their capacity to answer relevant questions about past interactions and potential future outcomes. DHRs focus on capturing the information necessary to address task‑relevant queries, providing a structured way to summarize a history for optimal control. We propose a multi‑agent learning framework, involving representation, decision, and question‑asking components, optimized using a joint objective that balances reward maximization with the representation's ability to answer informative questions. This yields representations that capture the salient historical details and predictive structures needed for effective decision making. We validate our approach on user modeling tasks with public movie and shopping datasets, generating interpretable textual user profiles which serve as sufficient statistics for predicting preference‑driven behavior of users.

PaperID: 10211, https://arxiv.org/pdf/2506.02068.pdf

Authors: Yun-Cheng Tsai, Yen-Ku Liu, Samuel Yen-Chi Chen

Title: Enhancing Interpretability of Quantum-Assisted Blockchain Clustering via AI Agent-Based Qualitative Analysis

Abstract:
Blockchain transaction data is inherently high dimensional, noisy, and entangled, posing substantial challenges for traditional clustering algorithms. While quantum enhanced clustering models have demonstrated promising performance gains, their interpretability remains limited, restricting their application in sensitive domains such as financial fraud detection and blockchain governance. To address this gap, we propose a two stage analysis framework that synergistically combines quantitative clustering evaluation with AI Agent assisted qualitative interpretation. In the first stage, we employ classical clustering methods and evaluation metrics including the Silhouette Score, Davies Bouldin Index, and Calinski Harabasz Index to determine the optimal cluster count and baseline partition quality. In the second stage, we integrate an AI Agent to generate human readable, semantic explanations of clustering results, identifying intra cluster characteristics and inter cluster relationships. Our experiments reveal that while fully trained Quantum Neural Networks (QNN) outperform random Quantum Features (QF) in quantitative metrics, the AI Agent further uncovers nuanced differences between these methods, notably exposing the singleton cluster phenomenon in QNN driven models. The consolidated insights from both stages consistently endorse the three cluster configuration, demonstrating the practical value of our hybrid approach. This work advances the interpretability frontier in quantum assisted blockchain analytics and lays the groundwork for future autonomous AI orchestrated clustering frameworks.

PaperID: 10212, https://arxiv.org/pdf/2506.02048.pdf

Authors: Lajos Muzsai, David Imolai, András Lukács

Title: Improving LLM Agents with Reinforcement Learning on Cryptographic CTF Challenges

Abstract:
We present 'Random‑Crypto', a procedurally generated cryptographic Capture The Flag (CTF) dataset designed to unlock the potential of Reinforcement Learning (RL) for LLM‑based agents in security‑sensitive domains. Cryptographic reasoning offers an ideal RL testbed: it combines precise validation, structured multi‑step inference, and reliance on reliable computational tool use. Leveraging these properties, we fine‑tune a Python tool‑augmented Llama‑3.1‑8B via Group Relative Policy Optimization (GRPO) in a secure execution environment. The resulting agent achieves a significant improvement in Pass@8 on previously unseen challenges. Moreover, the improvements generalize to two external benchmarks: 'picoCTF', spanning both crypto and non‑crypto tasks, and 'AICrypto MCQ', a multiple‑choice benchmark of 135 cryptography questions. Ablation studies attribute the gains to enhanced tool usage and procedural reasoning. These findings position 'Random‑Crypto' as a rich training ground for building intelligent, adaptable LLM agents capable of handling complex cybersecurity tasks.

PaperID: 10213, https://arxiv.org/pdf/2506.02040.pdf

Authors: Hao Song, Yiming Shen, Wenxuan Luo, Leixin Guo, Ting Chen, Jiashui Wang, Beibei Li, Xiaosong Zhang, Jiachi Chen

Title: Beyond the Protocol: Unveiling Attack Vectors in the Model Context Protocol (MCP) Ecosystem

Abstract:
The Model Context Protocol (MCP) is an emerging standard designed to enable seamless interaction between Large Language Model (LLM) applications and external tools or resources. Within a short period, thousands of MCP services have been developed and deployed. However, the client‑server integration architecture inherent in MCP may expand the attack surface against LLM Agent systems, introducing new vulnerabilities that allow attackers to exploit by designing malicious MCP servers. In this paper, we present the first end‑to‑end empirical evaluation of attack vectors targeting the MCP ecosystem. We identify four categories of attacks, i.e., Tool Poisoning Attacks, Puppet Attacks, Rug Pull Attacks, and Exploitation via Malicious External Resources. To evaluate their feasibility, we conduct experiments following the typical steps of launching an attack through malicious MCP servers: upload ‑> download ‑> attack. Specifically, we first construct malicious MCP servers and successfully upload them to three widely used MCP aggregation platforms. The results indicate that current audit mechanisms are insufficient to identify and prevent these threats. Next, through a user study and interview with 20 participants, we demonstrate that users struggle to identify malicious MCP servers and often unknowingly install them from aggregator platforms. Finally, we empirically demonstrate that these attacks can trigger harmful actions within the user's local environment, such as accessing private files or controlling devices to transfer digital assets. Additionally, based on interview results, we discuss four key challenges faced by the current MCP security ecosystem. These findings underscore the urgent need for robust security mechanisms to defend against malicious MCP servers and ensure the safe deployment of increasingly autonomous LLM agents.

PaperID: 10214, https://arxiv.org/pdf/2506.02009.pdf

Authors: Yinfang Chen, Jiaqi Pan, Jackson Clark, Yiming Su, Noah Zheutlin, Bhavya Bhavya, Rohan Arora, Yu Deng, Saurabh Jha, Tianyin Xu

Title: STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds

Abstract:
In cloud‑scale systems, failures are the norm. A distributed computing cluster exhibits hundreds of machine failures and thousands of disk failures; software bugs and misconfigurations are reported to be more frequent. The demand for autonomous, AI‑driven reliability engineering continues to grow, as existing humanin‑the‑loop practices can hardly keep up with the scale of modern clouds. This paper presents STRATUS, an LLM‑based multi‑agent system for realizing autonomous Site Reliability Engineering (SRE) of cloud services. STRATUS consists of multiple specialized agents (e.g., for failure detection, diagnosis, mitigation), organized in a state machine to assist system‑level safety reasoning and enforcement. We formalize a key safety specification of agentic SRE systems like STRATUS, termed Transactional No‑Regression (TNR), which enables safe exploration and iteration. We show that TNR can effectively improve autonomous failure mitigation. STRATUS significantly outperforms state‑of‑the‑art SRE agents in terms of success rate of failure mitigation problems in AIOpsLab and ITBench (two SRE benchmark suites), by at least 1.5 times across various models. STRATUS shows a promising path toward practical deployment of agentic systems for cloud reliability.

PaperID: 10215, https://arxiv.org/pdf/2506.01900.pdf

Authors: Manish Bhatt, Ronald F. Del Rosario, Vineeth Sai Narajala, Idan Habler

Title: COALESCE: Economic and Security Dynamics of Skill-Based Task Outsourcing Among Team of Autonomous LLM Agents

Abstract:
The meteoric rise and proliferation of autonomous Large Language Model (LLM) agents promise significant capabilities across various domains. However, their deployment is increasingly constrained by substantial computational demands, specifically for Graphics Processing Unit (GPU) resources. This paper addresses the critical problem of optimizing resource utilization in LLM agent systems. We introduce COALESCE (Cost‑Optimized and Secure Agent Labour Exchange via Skill‑based Competence Estimation), a novel framework designed to enable autonomous LLM agents to dynamically outsource specific subtasks to specialized, cost‑effective third‑party LLM agents. The framework integrates mechanisms for hybrid skill representation, dynamic skill discovery, automated task decomposition, a unified cost model comparing internal execution costs against external outsourcing prices, simplified market‑based decision‑making algorithms, and a standardized communication protocol between LLM agents. Comprehensive validation through 239 theoretical simulations demonstrates 41.8% cost reduction potential, while large‑scale empirical validation across 240 real LLM tasks confirms 20.3% cost reduction with proper epsilon‑greedy exploration, establishing both theoretical viability and practical effectiveness. The emergence of proposed open standards like Google's Agent2Agent (A2A) protocol further underscores the need for frameworks like COALESCE that can leverage such standards for efficient agent interaction. By facilitating a dynamic market for agent capabilities, potentially utilizing protocols like A2A for communication, COALESCE aims to significantly reduce operational costs, enhance system scalability, and foster the emergence of specialized agent economies, making complex LLM agent functionalities more accessible and economically viable.

PaperID: 10216, https://arxiv.org/pdf/2506.01839.pdf

Authors: Jennifer Haase, Sebastian Pokutta

Title: Beyond Static Responses: Multi-Agent LLM Systems as a New Paradigm for Social Science Research

Abstract:
As large language models (LLMs) transition from static tools to fully agentic systems, their potential for transforming social science research has become increasingly evident. This paper introduces a structured framework for understanding the diverse applications of LLM‑based agents, ranging from simple data processors to complex, multi‑agent systems capable of simulating emergent social dynamics. By mapping this developmental continuum across six levels, the paper clarifies the technical and methodological boundaries between different agentic architectures, providing a comprehensive overview of current capabilities and future potential. It highlights how lower‑tier systems streamline conventional tasks like text classification and data annotation, while higher‑tier systems enable novel forms of inquiry, including the study of group dynamics, norm formation, and large‑scale social processes. However, these advancements also introduce significant challenges, including issues of reproducibility, ethical oversight, and the risk of emergent biases. The paper critically examines these concerns, emphasizing the need for robust validation protocols, interdisciplinary collaboration, and standardized evaluation metrics. It argues that while LLM‑based agents hold transformative potential for the social sciences, realizing this promise will require careful, context‑sensitive deployment and ongoing methodological refinement. The paper concludes with a call for future research that balances technical innovation with ethical responsibility, encouraging the development of agentic systems that not only replicate but also extend the frontiers of social science, offering new insights into the complexities of human behavior.

PaperID: 10217, https://arxiv.org/pdf/2506.01813.pdf

Authors: Djallel Bouneffouf, Matthew Riemer, Kush Varshney

Title: The Ultimate Test of Superintelligent AI Agents: Can an AI Balance Care and Control in Asymmetric Relationships?

Abstract:
This paper introduces the Shepherd Test, a new conceptual test for assessing the moral and relational dimensions of superintelligent artificial agents. The test is inspired by human interactions with animals, where ethical considerations about care, manipulation, and consumption arise in contexts of asymmetric power and self‑preservation. We argue that AI crosses an important, and potentially dangerous, threshold of intelligence when it exhibits the ability to manipulate, nurture, and instrumentally use less intelligent agents, while also managing its own survival and expansion goals. This includes the ability to weigh moral trade‑offs between self‑interest and the well‑being of subordinate agents. The Shepherd Test thus challenges traditional AI evaluation paradigms by emphasizing moral agency, hierarchical behavior, and complex decision‑making under existential stakes. We argue that this shift is critical for advancing AI governance, particularly as AI systems become increasingly integrated into multi‑agent environments. We conclude by identifying key research directions, including the development of simulation environments for testing moral behavior in AI, and the formalization of ethical manipulation within multi‑agent systems.

PaperID: 10218, https://arxiv.org/pdf/2506.01616.pdf

Authors: Xiao Yang, Jiawei Chen, Jun Luo, Zhengwei Fang, Yinpeng Dong, Hang Su, Jun Zhu

Title: MLA-Trust: Benchmarking Trustworthiness of Multimodal LLM Agents in GUI Environments

Abstract:
The emergence of multimodal LLM‑based agents (MLAs) has transformed interaction paradigms by seamlessly integrating vision, language, action and dynamic environments, enabling unprecedented autonomous capabilities across GUI applications ranging from web automation to mobile systems. However, MLAs introduce critical trustworthiness challenges that extend far beyond traditional language models' limitations, as they can directly modify digital states and trigger irreversible real‑world consequences. Existing benchmarks inadequately tackle these unique challenges posed by MLAs' actionable outputs, long‑horizon uncertainty and multimodal attack vectors. In this paper, we introduce MLA‑Trust, the first comprehensive and unified framework that evaluates the MLA trustworthiness across four principled dimensions: truthfulness, controllability, safety and privacy. We utilize websites and mobile applications as realistic testbeds, designing 34 high‑risk interactive tasks and curating rich evaluation datasets. Large‑scale experiments involving 13 state‑of‑the‑art agents reveal previously unexplored trustworthiness vulnerabilities unique to multimodal interactive scenarios. For instance, proprietary and open‑source GUI‑interacting MLAs pose more severe trustworthiness risks than static MLLMs, particularly in high‑stakes domains; the transition from static MLLMs into interactive MLAs considerably compromises trustworthiness, enabling harmful content generation in multi‑step interactions that standalone MLLMs would typically prevent; multi‑step execution, while enhancing the adaptability of MLAs, involves latent nonlinear risk accumulation across successive interactions, circumventing existing safeguards and resulting in unpredictable derived risks. Moreover, we present an extensible toolbox to facilitate continuous evaluation of MLA trustworthiness across diverse interactive environments.

PaperID: 10219, https://arxiv.org/pdf/2506.01481.pdf

Authors: Yitao Yang, Yangtao Deng, Yifan Xiong, Baochun Li, Hong Xu, Peng Cheng

Title: TSGuard: Automated User-Centric Incident Diagnosis for AI Workloads in the Cloud

Abstract:
AI workloads incur frequent failures and incidents from the underlying infrastructure. The current incident management workflow follows a provider‑centric paradigm, where users report incidents to the infrastructure provider who then conducts troubleshooting. Due to the large number of incidents and the manual nature of the troubleshooting process, the provider often takes several days to resolve an incident, resulting in operational delays and productivity loss. To address these challenges, we present TSGuard, a user‑centric multi‑agent system that delivers immediate incident diagnosis to users who deploy the workloads. The core innovation of TSGuard is twofold: (1) constructing domain‑specific knowledge bases by mining historical on‑call experiences in the offline phase, and (2) mimicking human expert diagnosis via structured reasoning and iterative trial‑and‑error in the online phase. Evaluation using production incident records from Microsoft Azure demonstrates that TSGuard significantly outperforms state‑of‑the‑art baselines, improving diagnostic accuracy by 19.8%. Furthermore, TSGuard reduces the average verification time by 63.4% compared to the sequential execution baseline.

PaperID: 10220, https://arxiv.org/pdf/2506.01475.pdf

Authors: Zouying Cao, Runze Wang, Yifei Yang, Xinbei Ma, Xiaoyong Zhu, Bo Zheng, Hai Zhao

Title: PGPO: Enhancing Agent Reasoning via Pseudocode-style Planning Guided Preference Optimization

Abstract:
Large Language Model (LLM) agents have demonstrated impressive capabilities in handling complex interactive problems. Existing LLM agents mainly generate natural language plans to guide reasoning, which is verbose and inefficient. NL plans are also tailored to specific tasks and restrict agents' ability to generalize across similar tasks. To this end, we explore pseudocode‑style plans (P‑code Plan) to capture the structural logic of reasoning. We find that P‑code Plan empowers LLM agents with stronger generalization ability and more efficiency. Inspired by this finding, we propose a pseudocode‑style Planning Guided Preference Optimization method called PGPO for effective agent learning. With two planning‑oriented rewards, PGPO further enhances LLM agents' ability to generate high‑quality P‑code Plans and subsequent reasoning. Experiments show that PGPO achieves superior performance on representative agent benchmarks and outperforms the current leading baselines. Analyses reveal the advantage of PGPO in reducing action errors and omissions during reasoning.

PaperID: 10221, https://arxiv.org/pdf/2506.01463.pdf

Authors: V. Botti

Title: Agentic AI and Multiagentic: Are We Reinventing the Wheel?

Abstract:
The terms Agentic AI and Multiagentic AI have recently gained popularity in discussions on generative artificial intelligence, often used to describe autonomous software agents and systems composed of such agents. However, the use of these terms confuses these buzzwords with well‑established concepts in AI literature: intelligent agents and multi‑agent systems. This article offers a critical analysis of this conceptual misuse. We review the theoretical origins of "agentic" in the social sciences (Bandura, 1986) and philosophical notions of intentionality (Dennett, 1971), and then summarise foundational works on intelligent agents and multi‑agent systems by Wooldridge, Jennings and others. We examine classic agent architectures, from simple reactive agents to Belief‑Desire‑Intention (BDI) models, and highlight key properties (autonomy, reactivity, proactivity, social capability) that define agency in AI. We then discuss recent developments in large language models (LLMs) and agent platforms based on LLMs, including the emergence of LLM‑powered AI agents and open‑source multi‑agent orchestration frameworks. We argue that the term AI Agentic is often used as a buzzword for what are essentially AI agents, and AI Multiagentic for what are multi‑agent systems. This confusion overlooks decades of research in the field of autonomous agents and multi‑agent systems. The article advocates for scientific and technological rigour and the use of established terminology from the state of the art in AI, incorporating the wealth of existing knowledge, including standards for multi‑agent system platforms, communication languages and coordination and cooperation algorithms, agreement technologies (automated negotiation, argumentation, virtual organisations, trust, reputation, etc.), into the new and promising wave of LLM‑based AI agents, so as not to end up reinventing the wheel.

PaperID: 10222, https://arxiv.org/pdf/2506.01438.pdf

Authors: Prashik Buddhaghosh Bansod

Title: Distinguishing Autonomous AI Agents from Collaborative Agentic Systems: A Comprehensive Framework for Understanding Modern Intelligent Architectures

Abstract:
The emergence of large language models has catalyzed two distinct yet interconnected paradigms in artificial intelligence: standalone AI Agents and collaborative Agentic AI ecosystems. This comprehensive study establishes a definitive framework for distinguishing these architectures through systematic analysis of their operational principles, structural compositions, and deployment methodologies. We characterize AI Agents as specialized, tool‑enhanced systems leveraging foundation models for targeted automation within constrained environments. Conversely, Agentic AI represents sophisticated multi‑entity frameworks where distributed agents exhibit emergent collective intelligence through coordinated interaction protocols. Our investigation traces the evolutionary trajectory from traditional rule‑based systems through generative AI foundations to contemporary agent architectures. We present detailed architectural comparisons examining planning mechanisms, memory systems, coordination protocols, and decision‑making processes. The study categorizes application landscapes, contrasting single‑agent implementations in customer service and content management with multi‑agent deployments in research automation and complex decision support. We identify critical challenges including reliability issues, coordination complexities, and scalability constraints, while proposing innovative solutions through enhanced reasoning frameworks, robust memory architectures, and improved coordination mechanisms. This framework provides essential guidance for practitioners selecting appropriate agentic approaches and establishes foundational principles for next‑generation intelligent system development.

PaperID: 10223, https://arxiv.org/pdf/2506.01423.pdf

Authors: Hongyang Yang, Likun Lin, Yang She, Xinyu Liao, Jiaoyang Wang, Runjia Zhang, Yuquan Mo, Christina Dan Wang

Title: FinRobot: Generative Business Process AI Agents for Enterprise Resource Planning in Finance

Abstract:
Enterprise Resource Planning (ERP) systems serve as the digital backbone of modern financial institutions, yet they continue to rely on static, rule‑based workflows that limit adaptability, scalability, and intelligence. As business operations grow more complex and data‑rich, conventional ERP platforms struggle to integrate structured and unstructured data in real time and to accommodate dynamic, cross‑functional workflows. In this paper, we present the first AI‑native, agent‑based framework for ERP systems, introducing a novel architecture of Generative Business Process AI Agents (GBPAs) that bring autonomy, reasoning, and dynamic optimization to enterprise workflows. The proposed system integrates generative AI with business process modeling and multi‑agent orchestration, enabling end‑to‑end automation of complex tasks such as budget planning, financial reporting, and wire transfer processing. Unlike traditional workflow engines, GBPAs interpret user intent, synthesize workflows in real time, and coordinate specialized sub‑agents for modular task execution. We validate the framework through case studies in bank wire transfers and employee reimbursements, two representative financial workflows with distinct complexity and data modalities. Results show that GBPAs achieve up to 40% reduction in processing time, 94% drop in error rate, and improved regulatory compliance by enabling parallelism, risk control insertion, and semantic reasoning. These findings highlight the potential of GBPAs to bridge the gap between generative AI capabilities and enterprise‑grade automation, laying the groundwork for the next generation of intelligent ERP systems.

PaperID: 10224, https://arxiv.org/pdf/2506.01341.pdf

Authors: Yiran Zhang, Mo Wang, Xiaoyang Li, Kaixuan Ren, Chencheng Zhu, Usman Naseem

Title: TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models

Abstract:
Despite impressive advances in large language models (LLMs), existing benchmarks often focus on single‑turn or single‑step tasks, failing to capture the kind of iterative reasoning required in real‑world settings. To address this limitation, we introduce TurnBench, a novel benchmark that evaluates multi‑turn, multi‑step reasoning through an interactive code‑breaking task inspired by the "Turing Machine Board Game." In each episode, a model must uncover hidden logical or arithmetic rules by making sequential guesses, receiving structured feedback, and integrating clues across multiple rounds. This dynamic setup requires models to reason over time, adapt based on past information, and maintain consistency across steps‑capabilities underexplored in current benchmarks. TurnBench includes two modes: Classic, which tests standard reasoning, and Nightmare, which introduces increased complexity and requires robust inferential chains. To support fine‑grained analysis, we provide ground‑truth annotations for intermediate reasoning steps. Our evaluation of state‑of‑the‑art LLMs reveals significant gaps: the best model achieves 84% accuracy in Classic mode, but performance drops to 18% in Nightmare mode. In contrast, human participants achieve 100% in both, underscoring the challenge TurnBench poses to current models. By incorporating feedback loops and hiding task rules, TurnBench reduces contamination risks and provides a rigorous testbed for diagnosing and advancing multi‑step, multi‑turn reasoning in LLMs.

PaperID: 10225, https://arxiv.org/pdf/2506.01334.pdf

Authors: Yiwen Jiang, Deval Mehta, Wei Feng, Zongyuan Ge

Title: Enhancing Interpretable Image Classification Through LLM Agents and Conditional Concept Bottleneck Models

Abstract:
Concept Bottleneck Models (CBMs) decompose image classification into a process governed by interpretable, human‑readable concepts. Recent advances in CBMs have used Large Language Models (LLMs) to generate candidate concepts. However, a critical question remains: What is the optimal number of concepts to use? Current concept banks suffer from redundancy or insufficient coverage. To address this issue, we introduce a dynamic, agent‑based approach that adjusts the concept bank in response to environmental feedback, optimizing the number of concepts for sufficiency yet concise coverage. Moreover, we propose Conditional Concept Bottleneck Models (CoCoBMs) to overcome the limitations in traditional CBMs' concept scoring mechanisms. It enhances the accuracy of assessing each concept's contribution to classification tasks and feature an editable matrix that allows LLMs to correct concept scores that conflict with their internal knowledge. Our evaluations across 6 datasets show that our method not only improves classification accuracy by 6% but also enhances interpretability assessments by 30%.

PaperID: 10226, https://arxiv.org/pdf/2506.01332.pdf

Authors: Min Choi, Keonwoo Kim, Sungwon Chae, Sangyeob Baek

Title: An Empirical Study of Group Conformity in Multi-Agent Systems

Abstract:
Recent advances in Large Language Models (LLMs) have enabled multi‑agent systems that simulate real‑world interactions with near‑human reasoning. While previous studies have extensively examined biases related to protected attributes such as race, the emergence and propagation of biases on socially contentious issues in multi‑agent LLM interactions remain underexplored. This study explores how LLM agents shape public opinion through debates on five contentious topics. By simulating over 2,500 debates, we analyze how initially neutral agents, assigned a centrist disposition, adopt specific stances over time. Statistical analyses reveal significant group conformity mirroring human behavior; LLM agents tend to align with numerically dominant groups or more intelligent agents, exerting a greater influence. These findings underscore the crucial role of agent intelligence in shaping discourse and highlight the risks of bias amplification in online interactions. Our results emphasize the need for policy measures that promote diversity and transparency in LLM‑generated discussions to mitigate the risks of bias propagation within anonymous online environments.

PaperID: 10227, https://arxiv.org/pdf/2506.01300.pdf

Authors: Yiyang Zhou, Yangfan He, Yaofeng Su, Siwei Han, Joel Jang, Gedas Bertasius, Mohit Bansal, Huaxiu Yao

Title: ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

Abstract:
Video understanding is fundamental to tasks such as action recognition, video reasoning, and robotic control. Early video understanding methods based on large vision‑language models (LVLMs) typically adopt a single‑pass reasoning paradigm without dynamic feedback, limiting the model's capacity to self‑correct and adapt in complex scenarios. Recent efforts have attempted to address this limitation by incorporating reward models and reinforcement learning to enhance reasoning, or by employing tool‑agent frameworks. However, these approaches face several challenges, including high annotation costs, reward signals that fail to capture real‑time reasoning states, and low inference efficiency. To overcome these issues, we propose ReAgent‑V, a novel agentic video understanding framework that integrates efficient frame selection with real‑time reward generation during inference. These reward signals not only guide iterative answer refinement through a multi‑perspective reflection mechanism‑adjusting predictions from conservative, neutral, and aggressive viewpoints‑but also enable automatic filtering of high‑quality data for supervised fine‑tuning (SFT), direct preference optimization (DPO), and group relative policy optimization (GRPO). ReAgent‑V is lightweight, modular, and extensible, supporting flexible tool integration tailored to diverse tasks. Extensive experiments on 12 datasets across three core applications‑video understanding, video reasoning enhancement, and vision‑language‑action model alignment‑demonstrate significant gains in generalization and reasoning, with improvements of up to 6.9%, 2.1%, and 9.8%, respectively, highlighting the effectiveness and versatility of the proposed framework.

PaperID: 10228, https://arxiv.org/pdf/2506.01245.pdf

Authors: Pengfei He, Yue Xing, Juanhui Li, Shen Dong, Zhenwei Dai, Xianfeng Tang, Hui Liu, Han Xu, Zhen Xiang, Charu C. Aggarwal, Hui Liu

Title: Comprehensive Vulnerability Analysis is Necessary for Trustworthy LLM-MAS

Abstract:
TThis paper argues that a comprehensive vulnerability analysis is essential for building trustworthy Large Language Model‑based Multi‑Agent Systems (LLM‑MAS). These systems, which consist of multiple LLM‑powered agents working collaboratively, are increasingly deployed in high‑stakes applications but face novel security threats due to their complex structures. While single‑agent vulnerabilities are well‑studied, LLM‑MAS introduces unique attack surfaces through inter‑agent communication, trust relationships, and tool integration that remain significantly underexplored. We present a systematic framework for vulnerability analysis of LLM‑MAS that unifies diverse research. For each type of vulnerability, we define formal threat models grounded in practical attacker capabilities and illustrate them using real‑world LLM‑MAS applications. This formulation enables rigorous quantification of vulnerability across different architectures and provides a foundation for designing meaningful evaluation benchmarks. We also identify critical open challenges: (1) developing benchmarks specifically tailored to LLM‑MAS vulnerability assessment, (2) considering new potential attacks specific to multi‑agent architectures, and (3) implementing trust management systems that can enforce security in LLM‑MAS. This research provides essential groundwork for future efforts to enhance LLM‑MAS trustworthiness.

PaperID: 10229, https://arxiv.org/pdf/2506.01095.pdf

Authors: Khe-Han Toh, Hong-Kuan Teo

Title: Modular Speaker Architecture: A Framework for Sustaining Responsibility and Contextual Integrity in Multi-Agent AI Communication

Abstract:
Sustaining coherent, role‑aware communication across multi‑agent systems remains a foundational challenge in AI. Current frameworks often lack explicit mechanisms for speaker responsibility, leading to context drift, alignment instability, and degraded interpretability over time. We propose the Modular Speaker Architecture (MSA), a framework that decomposes speaker behavior into modular components for role tracking, responsibility continuity, and contextual coherence. Grounded in high‑context human‑AI dialogues, MSA includes three core modules: a Speaker Role Module, a Responsibility Chain Tracker, and a Contextual Integrity Validator. We evaluate MSA through annotated case studies and introduce structural metrics‑pragmatic consistency, responsibility flow, and context stability‑quantified via manual and automatic scoring and bootstrapped statistical analysis. Our results show that MSA reliably maintains interaction structure without reliance on affective signals or surface‑level heuristics. We further implement a prototype configuration language (G‑Code) and modular API to support MSA deployment in dynamic multi‑agent scenarios.

PaperID: 10230, https://arxiv.org/pdf/2506.01080.pdf

Authors: Florian Carichon, Aditi Khandelwal, Marylou Fauchard, Golnoosh Farnadi

Title: The Coming Crisis of Multi-Agent Misalignment: AI Alignment Must Be a Dynamic and Social Process

Abstract:
This position paper states that AI Alignment in Multi‑Agent Systems (MAS) should be considered a dynamic and interaction‑dependent process that heavily depends on the social environment where agents are deployed, either collaborative, cooperative, or competitive. While AI alignment with human values and preferences remains a core challenge, the growing prevalence of MAS in real‑world applications introduces a new dynamic that reshapes how agents pursue goals and interact to accomplish various tasks. As agents engage with one another, they must coordinate to accomplish both individual and collective goals. However, this complex social organization may unintentionally misalign some or all of these agents with human values or user preferences. Drawing on social sciences, we analyze how social structure can deter or shatter group and individual values. Based on these analyses, we call on the AI community to treat human, preferential, and objective alignment as an interdependent concept, rather than isolated problems. Finally, we emphasize the urgent need for simulation environments, benchmarks, and evaluation frameworks that allow researchers to assess alignment in these interactive multi‑agent contexts before such dynamics grow too complex to control.

PaperID: 10231, https://arxiv.org/pdf/2506.01056.pdf

Authors: Xiang Fei, Xiawu Zheng, Hao Feng

Title: MCP-Zero: Active Tool Discovery for Autonomous LLM Agents

Abstract:
True intelligence requires active capability acquisition, yet current LLM agents inject pre‑defined tool schemas into prompts, reducing models to passive selectors and falling short of robust general‑purpose agency. We introduce MCP‑Zero, an active agent framework that restores tool discovery autonomy to LLMs themselves. Instead of overwhelming models with all available tools, MCP‑Zero enables agents to actively identify capability gaps, and request specific tools on‑demand, transforming them from large‑scale retrievers into genuine autonomous agents. The framework operates through three core mechanisms: (1) Active Tool Request, where models autonomously generate structured requests specifying their exact tool requirements; (2) Hierarchical Semantic Routing, a two‑stage algorithm that matches requests to relevant servers and tools through improved semantic alignment; (3) Iterative Capability Extension, enabling agents to progressively build cross‑domain toolchains while maintaining minimal context footprint. We construct MCP‑tools, a comprehensive dataset of 308 MCP servers and 2,797 tools from the official Model‑Context‑Protocol repository. Experiments demonstrate that MCP‑Zero preserves agent autonomy while achieving substantial efficiency gains: (i) accurate tool selection from nearly 3k candidates across 248.1k tokens; (ii) 98% reduction in token consumption on APIBank while maintaining high accuracy; and (iii) consistent multi‑turn performance that scales with tool ecosystem growth. This work establishes active tool discovery as a fundamental design pattern for scalable autonomous agent systems.

PaperID: 10232, https://arxiv.org/pdf/2506.01055.pdf

Authors: Meysam Alizadeh, Zeynab Samei, Daria Stetsenko, Fabrizio Gilardi

Title: Simple Prompt Injection Attacks Can Leak Personal Data Observed by LLM Agents During Task Execution

Abstract:
Previous benchmarks on prompt injection in large language models (LLMs) have primarily focused on generic tasks and attacks, offering limited insights into more complex threats like data exfiltration. This paper examines how prompt injection can cause tool‑calling agents to leak personal data observed during task execution. Using a fictitious banking agent, we develop data flow‑based attacks and integrate them into AgentDojo, a recent benchmark for agentic security. To enhance its scope, we also create a richer synthetic dataset of human‑AI banking conversations. In 16 user tasks from AgentDojo, LLMs show a 15‑50 percentage point drop in utility under attack, with average attack success rates (ASR) around 20 percent; some defenses reduce ASR to zero. Most LLMs, even when successfully tricked by the attack, avoid leaking highly sensitive data like passwords, likely due to safety alignments, but they remain vulnerable to disclosing other personal data. The likelihood of password leakage increases when a password is requested along with one or two additional personal details. In an extended evaluation across 48 tasks, the average ASR is around 15 percent, with no built‑in AgentDojo defense fully preventing leakage. Tasks involving data extraction or authorization workflows, which closely resemble the structure of exfiltration attacks, exhibit the highest ASRs, highlighting the interaction between task type, agent performance, and defense efficacy.

PaperID: 10233, https://arxiv.org/pdf/2506.00982.pdf

Authors: Keshawn Smith, Zhili Zhang, H M Sabbir Ahmad, Ehsan Sabouni, Mainak Mondal, Song Han, Wenchao Li, Fei Miao

Title: Robust and Safe Multi-Agent Reinforcement Learning with Communication for Autonomous Vehicles: From Simulation to Hardware

Abstract:
Deep multi‑agent reinforcement learning (MARL) has been demonstrated effectively in simulations for multi‑robot problems. For autonomous vehicles, the development of vehicle‑to‑vehicle (V2V) communication technologies provide opportunities to further enhance system safety. However, zero‑shot transfer of simulator‑trained MARL policies to dynamic hardware systems remains challenging, and how to leverage communication and shared information for MARL has limited demonstrations on hardware. This problem is challenged by discrepancies between simulated and physical states, system state and model uncertainties, practical shared information design, and the need for safety guarantees in both simulation and hardware. This paper designs RSR‑RSMARL, a novel Robust and Safe MARL framework that supports Real‑Sim‑Real (RSR) policy adaptation for multi‑agent systems with communication among agents, with both simulation and hardware demonstrations. RSR‑RSMARL leverages state (includes shared state information among agents) and action representations considering real system complexities for MARL formulation. The MARL policy is trained with robust MARL algorithm to enable zero‑shot transfer to hardware considering the sim‑to‑real gap. A safety shield module using Control Barrier Functions (CBFs) provides safety guarantee for each individual agent. Experimental results on 1/10th‑scale autonomous vehicles with V2V communication demonstrate the ability of RSR‑RSMARL framework to enhance driving safety and coordination across multiple configurations. These findings emphasize the importance of jointly designing robust policy representations and modular safety architectures to enable scalable, generalizable RSR transfer in multi‑agent autonomy.

PaperID: 10234, https://arxiv.org/pdf/2506.00856.pdf

Authors: Qiang Chen, Tianyang Han, Jin Li, Ye Luo, Zigan Wang, Yuxiao Wu, Xiaowei Zhang, Tuo Zhou

Title: Can AI Master Econometrics? Evidence from Econometrics AI Agent on Expert-Level Tasks

Abstract:
Can AI effectively perform complex econometric analysis traditionally requiring human expertise? This paper evaluates AI agents' capability to master econometrics, focusing on empirical analysis performance. We develop ``MetricsAI'', an Econometrics AI Agent built on the open‑source MetaGPT framework. This agent exhibits outstanding performance in: (1) planning econometric tasks strategically, (2) generating and executing code, (3) employing error‑based reflection for improved robustness, and (4) allowing iterative refinement through multi‑round conversations. We construct two datasets from academic coursework materials and published research papers to evaluate performance against real‑world challenges. Comparative testing shows our domain‑specialized AI agent significantly outperforms both benchmark large language models (LLMs) and general‑purpose AI agents. This work establishes a testbed for exploring AI's impact on social science research and enables cost‑effective integration of domain expertise, making advanced econometric methods accessible to users with minimal coding skills. Furthermore, our AI agent enhances research reproducibility and offers promising pedagogical applications for econometrics teaching.

PaperID: 10235, https://arxiv.org/pdf/2506.00797.pdf

Authors: Jianglin Ding, Jingcheng Tang, Gangshan Jing

Title: Action Dependency Graphs for Globally Optimal Coordinated Reinforcement Learning

Abstract:
Action‑dependent individual policies, which incorporate both environmental states and the actions of other agents in decision‑making, have emerged as a promising paradigm for achieving global optimality in multi‑agent reinforcement learning (MARL). However, the existing literature often adopts auto‑regressive action‑dependent policies, where each agent's policy depends on the actions of all preceding agents. This formulation incurs substantial computational complexity as the number of agents increases, thereby limiting scalability. In this work, we consider a more generalized class of action‑dependent policies, which do not necessarily follow the auto‑regressive form. We propose to use the `action dependency graph (ADG)' to model the inter‑agent action dependencies. Within the context of MARL problems structured by coordination graphs, we prove that an action‑dependent policy with a sparse ADG can achieve global optimality, provided the ADG satisfies specific conditions specified by the coordination graph. Building on this theoretical foundation, we develop a tabular policy iteration algorithm with guaranteed global optimality. Furthermore, we integrate our framework into several SOTA algorithms and conduct experiments in complex environments. The empirical results affirm the robustness and applicability of our approach in more general scenarios, underscoring its potential for broader MARL challenges.

PaperID: 10236, https://arxiv.org/pdf/2506.00781.pdf

Authors: Chen Xiong, Pin-Yu Chen, Tsung-Yi Ho

Title: CoP: Agentic Red-teaming for Large Language Models using Composition of Principles

Abstract:
Recent advances in Large Language Models (LLMs) have spurred transformative applications in various domains, ranging from open‑source to proprietary LLMs. However, jailbreak attacks, which aim to break safety alignment and user compliance by tricking the target LLMs into answering harmful and risky responses, are becoming an urgent concern. The practice of red‑teaming for LLMs is to proactively explore potential risks and error‑prone instances before the release of frontier AI technology. This paper proposes an agentic workflow to automate and scale the red‑teaming process of LLMs through the Composition‑of‑Principles (CoP) framework, where human users provide a set of red‑teaming principles as instructions to an AI agent to automatically orchestrate effective red‑teaming strategies and generate jailbreak prompts. Distinct from existing red‑teaming methods, our CoP framework provides a unified and extensible framework to encompass and orchestrate human‑provided red‑teaming principles to enable the automated discovery of new red‑teaming strategies. When tested against leading LLMs, CoP reveals unprecedented safety risks by finding novel jailbreak prompts and improving the best‑known single‑turn attack success rate by up to 19.0 times.

PaperID: 10237, https://arxiv.org/pdf/2506.00714.pdf

Authors: Mingwei Zheng, Chengpeng Wang, Xuwei Liu, Jinyao Guo, Shiwei Feng, Xiangyu Zhang

Title: RFCAudit: An LLM Agent for Functional Bug Detection in Network Protocols

Abstract:
Functional correctness is critical for ensuring the reliability and security of network protocol implementations. Functional bugs, instances where implementations diverge from behaviors specified in RFC documents, can lead to severe consequences, including faulty routing, authentication bypasses, and service disruptions. Detecting these bugs requires deep semantic analysis across specification documents and source code, a task beyond the capabilities of traditional static analysis tools. This paper introduces RFCAudit, an autonomous agent that leverages large language models (LLMs) to detect functional bugs by checking conformance between network protocol implementations and their RFC specifications. Inspired by the human auditing procedure, RFCAudit comprises two key components: an indexing agent and a detection agent. The former hierarchically summarizes protocol code semantics, generating semantic indexes that enable the detection agent to narrow down the scanning scope. The latter employs demand‑driven retrieval to iteratively collect additional relevant data structures and functions, eventually identifying potential inconsistencies with the RFC specifications effectively. We evaluate RFCAudit across six real‑world network protocol implementations. RFCAudit identifies 47 functional bugs with 81.9% precision, of which 20 bugs have been confirmed or fixed by developers.

PaperID: 10238, https://arxiv.org/pdf/2506.00703.pdf

Authors: Anahita Jain, Husni Idris, John-Paul Clarke, Daniel Delahaye

Title: Adaptive Traffic-Following Scheme for Orderly Distributed Control of Multi-Vehicle Systems

Abstract:
We present an adaptive control scheme to enable the emergence of order within distributed, autonomous multi‑agent systems. Past studies showed that under high‑density conditions, order generated from traffic‑following behavior reduces travel times, while under low densities, choosing direct paths is more beneficial. In this paper, we leveraged those findings to allow aircraft to independently and dynamically adjust their degree of traffic‑following behavior based on the current state of the airspace. This enables aircraft to follow other traffic only when beneficial. Quantitative analyses revealed that dynamic traffic‑following behavior results in lower aircraft travel times at the cost of minimal levels of additional disorder to the airspace. The sensitivity of these benefits to temporal and spatial horizons was also investigated. Overall, this work highlights the benefits, and potential necessity, of incorporating self‑organizing behavior in making distributed, autonomous multi‑agent systems scalable.

PaperID: 10239, https://arxiv.org/pdf/2506.00608.pdf

Authors: Petros Raptopoulos, Giorgos Filandrianos, Maria Lymperaiou, Giorgos Stamou

Title: PAKTON: A Multi-Agent Framework for Question Answering in Long Legal Agreements

Abstract:
Contract review is a complex and time‑intensive task that typically demands specialized legal expertise, rendering it largely inaccessible to non‑experts. Moreover, legal interpretation is rarely straightforward‑ambiguity is pervasive, and judgments often hinge on subjective assessments. Compounding these challenges, contracts are usually confidential, restricting their use with proprietary models and necessitating reliance on open‑source alternatives. To address these challenges, we introduce PAKTON: a fully open‑source, end‑to‑end, multi‑agent framework with plug‑and‑play capabilities. PAKTON is designed to handle the complexities of contract analysis through collaborative agent workflows and a novel retrieval‑augmented generation (RAG) component, enabling automated legal document review that is more accessible, adaptable, and privacy‑preserving. Experiments demonstrate that PAKTON outperforms both general‑purpose and pretrained models in predictive accuracy, retrieval performance, explainability, completeness, and grounded justifications as evaluated through a human study and validated with automated metrics.

PaperID: 10240, https://arxiv.org/pdf/2506.00576.pdf

Authors: Fatemeh Lotfi, Hossein Rajoli, Fatemeh Afghah

Title: ORAN-GUIDE: RAG-Driven Prompt Learning for LLM-Augmented Reinforcement Learning in O-RAN Network Slicing

Abstract:
Advanced wireless networks must support highly dynamic and heterogeneous service demands. Open Radio Access Network (O‑RAN) architecture enables this flexibility by adopting modular, disaggregated components, such as the RAN Intelligent Controller (RIC), Centralized Unit (CU), and Distributed Unit (DU), that can support intelligent control via machine learning (ML). While deep reinforcement learning (DRL) is a powerful tool for managing dynamic resource allocation and slicing, it often struggles to process raw, unstructured input like RF features, QoS metrics, and traffic trends. These limitations hinder policy generalization and decision efficiency in partially observable and evolving environments. To address this, we propose ORAN‑GUIDE, a dual‑LLM framework that enhances multi‑agent RL (MARL) with task‑relevant, semantically enriched state representations. The architecture employs a domain‑specific language model, ORANSight, pretrained on O‑RAN control and configuration data, to generate structured, context‑aware prompts. These prompts are fused with learnable tokens and passed to a frozen GPT‑based encoder that outputs high‑level semantic representations for DRL agents. This design adopts a retrieval‑augmented generation (RAG) style pipeline tailored for technical decision‑making in wireless systems. Experimental results show that ORAN‑GUIDE improves sample efficiency, policy convergence, and performance generalization over standard MARL and single‑LLM baselines.

PaperID: 10241, https://arxiv.org/pdf/2506.00574.pdf

Authors: Fatemeh Lotfi, Hossein Rajoli, Fatemeh Afghah

Title: Prompt-Tuned LLM-Augmented DRL for Dynamic O-RAN Network Slicing

Abstract:
Modern wireless networks must adapt to dynamic conditions while efficiently managing diverse service demands. Traditional deep reinforcement learning (DRL) struggles in these environments, as scattered and evolving feedback makes optimal decision‑making challenging. Large Language Models (LLMs) offer a solution by structuring unorganized network feedback into meaningful latent representations, helping RL agents recognize patterns more effectively. For example, in O‑RAN slicing, concepts like SNR, power levels and throughput are semantically related, and LLMs can naturally cluster them, providing a more interpretable state representation. To leverage this capability, we introduce a contextualization‑based adaptation method that integrates learnable prompts into an LLM‑augmented DRL framework. Instead of relying on full model fine‑tuning, we refine state representations through task‑specific prompts that dynamically adjust to network conditions. Utilizing ORANSight, an LLM trained on O‑RAN knowledge, we develop Prompt‑Augmented Multi agent RL (PA‑MRL) framework. Learnable prompts optimize both semantic clustering and RL objectives, allowing RL agents to achieve higher rewards in fewer iterations and adapt more efficiently. By incorporating prompt‑augmented learning, our approach enables faster, more scalable, and adaptive resource allocation in O‑RAN slicing. Experimental results show that it accelerates convergence and outperforms other baselines.

PaperID: 10242, https://arxiv.org/pdf/2506.00555.pdf

Authors: Peng Xia, Jinglu Wang, Yibo Peng, Kaide Zeng, Zihan Dong, Xian Wu, Xiangru Tang, Hongtu Zhu, Yun Li, Linjun Zhang, Shujie Liu, Yan Lu, Huaxiu Yao

Title: MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning

Abstract:
Medical Large Vision‑Language Models (Med‑LVLMs) have shown strong potential in multimodal diagnostic tasks. However, existing single‑agent models struggle to generalize across diverse medical specialties, limiting their performance. Recent efforts introduce multi‑agent collaboration frameworks inspired by clinical workflows, where general practitioners (GPs) and specialists interact in a fixed sequence. Despite improvements, these static pipelines lack flexibility and adaptability in reasoning. To address this, we propose MMedAgent‑RL, a reinforcement learning (RL)‑based multi‑agent framework that enables dynamic, optimized collaboration among medical agents. Specifically, we train two GP agents based on Qwen2.5‑VL via RL: the triage doctor learns to assign patients to appropriate specialties, while the attending physician integrates the judgments from multi‑specialists and its own knowledge to make final decisions. To address the inconsistency in specialist outputs, we introduce a curriculum learning (CL)‑guided RL strategy with dynamic entropy regulation, progressively teaching the attending physician to balance between imitating specialists and correcting their mistakes. Experiments on five medical VQA benchmarks demonstrate that MMedAgent‑RL outperforms both open‑source and proprietary Med‑LVLMs. Notably, it achieves an average performance gain of 23.6% over strong baselines.

PaperID: 10243, https://arxiv.org/pdf/2506.00520.pdf

Authors: Chenxu Liu, Zhiyu Gu, Guoquan Wu, Ying Zhang, Jun Wei, Tao Xie

Title: Temac: Multi-Agent Collaboration for Automated Web GUI Testing

Abstract:
Quality assurance of web applications is critical, as web applications play an essential role in people's daily lives. To reduce labor costs, automated web GUI testing (AWGT) is widely adopted, exploring web applications via GUI actions such as clicks and text inputs. However, these approaches face limitations in generating continuous and meaningful action sequences capable of covering complex functionalities. Recent work incorporates large language models (LLMs) for GUI testing. However, these approaches face various challenges, including low efficiency of LLMs, high complexity of rich web application contexts, and a low success rate of LLMs in executing GUI tasks. To address these challenges, in this paper, we propose Temac, an approach that enhances AWGT using LLM‑based multi‑agent collaboration to increase code coverage. Temac is motivated by our insight that LLMs can enhance AWGT in executing complex functionalities, while the information discovered during AWGT can, in turn, be provided as the domain knowledge to improve the LLM‑based task execution. Specifically, given a web application, Temac initially runs an existing approach to broadly explore application states. When the testing coverage stagnates, Temac then employs LLM‑based agents to summarize the collected information to form a knowledge base and to infer not‑covered functionalities. Guided by this knowledge base, Temac finally uses specialized LLM‑based agents to target and execute the not‑covered functionalities, reaching deeper states beyond those explored by the existing approach. Our evaluation results show that Temac exceeds state‑of‑the‑art approaches from 12.5% to 60.3% on average code coverage on six complex open‑source web applications, while revealing 445 unique failures in the top 20 real‑world web applications. These results strongly demonstrate the effectiveness and the general applicability of Temac.

PaperID: 10244, https://arxiv.org/pdf/2506.00439.pdf

Authors: Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Guojun Yin, Wei Lin, Qichao Zhang, Dongbin Zhao

Title: RLAE: Reinforcement Learning-Assisted Ensemble for LLMs

Abstract:
Ensembling large language models (LLMs) can effectively combine diverse strengths of different models, offering a promising approach to enhance performance across various tasks. However, existing methods typically rely on fixed weighting strategies that fail to adapt to the dynamic, context‑dependent characteristics of LLM capabilities. In this work, we propose Reinforcement Learning‑Assisted Ensemble for LLMs (RLAE), a novel framework that reformulates LLM ensemble through the lens of a Markov Decision Process (MDP). Our approach introduces a RL agent that dynamically adjusts ensemble weights by considering both input context and intermediate generation states, with the agent being trained using rewards that directly correspond to the quality of final outputs. We implement RLAE using both single‑agent and multi‑agent reinforcement learning algorithms (\textRLAE_\textPPO and \textRLAE_\textMAPPO ), demonstrating substantial improvements over conventional ensemble methods. Extensive evaluations on a diverse set of tasks show that RLAE outperforms existing approaches by up to 3.3% accuracy points, offering a more effective framework for LLM ensembling. Furthermore, our method exhibits superior generalization capabilities across different tasks without the need for retraining, while simultaneously achieving lower time latency.

PaperID: 10245, https://arxiv.org/pdf/2506.00344.pdf

Authors: Sungjae Lee, Hoyoung Kim, Jeongyeon Hwang, Eunhyeok Park, Jungseul Ok

Title: Efficient Latent Semantic Clustering for Scaling Test-Time Computation of LLMs

Abstract:
Scaling test‑time computation‑‑generating and analyzing multiple or sequential outputs for a single input‑‑has become a promising strategy for improving the reliability and quality of large language models (LLMs), as evidenced by advances in uncertainty quantification and multi‑step reasoning. A key shared component is semantic clustering, which groups outputs that differ in form but convey the same meaning. Semantic clustering enables estimation of the distribution over the semantics of outputs and helps avoid redundant exploration of reasoning paths. However, existing approaches typically rely on external models, which introduce substantial computational overhead and often fail to capture context‑aware semantics. We propose Latent Semantic Clustering (LSC), a lightweight and context‑sensitive method that leverages the generator LLM's internal hidden states for clustering, eliminating the need for external models. Our extensive experiment across various LLMs and datasets shows that LSC significantly improves the computational efficiency of test‑time scaling while maintaining or exceeding the performance of existing methods.

PaperID: 10246, https://arxiv.org/pdf/2506.00320.pdf

Authors: Xiao Yu, Baolin Peng, Ruize Xu, Michel Galley, Hao Cheng, Suman Nath, Jianfeng Gao, Zhou Yu

Title: Dyna-Think: Synergizing Reasoning, Acting, and World Model Simulation in AI Agents

Abstract:
Recent progress in reasoning with large language models (LLMs), such as DeepSeek‑R1, demonstrates impressive capabilities in domains like mathematics and coding, by exhibiting complex cognitive behaviors such as verification, goal decomposition, and self‑reflection. However, it is unclear what behavior is effective and what behavior is missing for long‑horizon AI agents tasks. In this work, we propose Dyna‑Think, a thinking framework that integrates planning with an internal world model with reasoning and acting to enhance AI agent performance. To enable Dyna‑Think, we propose Dyna‑Think Imitation Learning (DIT) and Dyna‑Think Dyna Training (DDT). To initialize a policy with Dyna‑Think, DIT reconstructs the thinking process of R1 to focus on performing world model simulation relevant to the proposed (and planned) action, and trains the policy using this reconstructed data. To enhance Dyna‑Think, DDT uses a two‑stage training process to first improve the agent's world modeling ability via objectives such as state prediction or critique generation, and then improve the agent's action via policy training. We evaluate our methods on OSWorld and WindowsAgentArena, and demonstrate that Dyna‑Think improves the agent's in‑domain and out‑of‑domain performance, achieving similar best‑of‑n performance compared to R1 while generating 2x less tokens on average. Our extensive empirical studies reveal that 1) using critique generation for world model training is effective to improve policy performance; and 2) AI agents with better performance correlate with better world modeling abilities. We believe our results suggest a promising research direction to integrate world model simulation into AI agents to enhance their reasoning, planning, and acting capabilities.

PaperID: 10247, https://arxiv.org/pdf/2506.00309.pdf

Authors: Ruonan Wang, Runxi Wang, Yunwen Shen, Chengfeng Wu, Qinglin Zhou, Rohitash Chandra

Title: Evaluation of LLMs for mathematical problem solving

Abstract:
Large Language Models (LLMs) have shown impressive performance on a range of educational tasks, but are still understudied for their potential to solve mathematical problems. In this study, we compare three prominent LLMs, including GPT‑4o, DeepSeek‑V3, and Gemini‑2.0, on three mathematics datasets of varying complexities (GSM8K, MATH500, and MIT Open Courseware datasets). We take a five‑dimensional approach based on the Structured Chain‑of‑Thought (SCoT) framework to assess final answer correctness, step completeness, step validity, intermediate calculation accuracy, and problem comprehension. The results show that GPT‑4o is the most stable and consistent in performance across all the datasets, but particularly it performs outstandingly in high‑level questions of the MIT Open Courseware dataset. DeepSeek‑V3 is competitively strong in well‑structured domains such as optimisation, but suffers from fluctuations in accuracy in statistical inference tasks. Gemini‑2.0 shows strong linguistic understanding and clarity in well‑structured problems but performs poorly in multi‑step reasoning and symbolic logic. Our error analysis reveals particular deficits in each model: GPT‑4o is at times lacking in sufficient explanation or precision; DeepSeek‑V3 leaves out intermediate steps; and Gemini‑2.0 is less flexible in mathematical reasoning in higher dimensions.

PaperID: 10248, https://arxiv.org/pdf/2506.00264.pdf

Authors: Mohammadamin Shafiei, Hamidreza Saffari, Nafise Sadat Moosavi

Title: MultiHoax: A Dataset of Multi-hop False-Premise Questions

Abstract:
As Large Language Models are increasingly deployed in high‑stakes domains, their ability to detect false assumptions and reason critically is crucial for ensuring reliable outputs. False‑premise questions (FPQs) serve as an important evaluation method by exposing cases where flawed assumptions lead to incorrect responses. While existing benchmarks focus on single‑hop FPQs, real‑world reasoning often requires multi‑hop inference, where models must verify consistency across multiple reasoning steps rather than relying on surface‑level cues. To address this gap, we introduce MultiHoax, a benchmark for evaluating LLMs' ability to handle false premises in complex, multi‑step reasoning tasks. Our dataset spans seven countries and ten diverse knowledge categories, using Wikipedia as the primary knowledge source to enable factual reasoning across regions. Experiments reveal that state‑of‑the‑art LLMs struggle to detect false premises across different countries, knowledge categories, and multi‑hop reasoning types, highlighting the need for improved false premise detection and more robust multi‑hop reasoning capabilities in LLMs.

PaperID: 10249, https://arxiv.org/pdf/2506.00258.pdf

Authors: Qianqi Yan, Hongquan Li, Shan Jiang, Yang Zhao, Xinze Guan, Ching-Chen Kuo, Xin Eric Wang

Title: Hidden in Plain Sight: Reasoning in Underspecified and Misspecified Scenarios for Multimodal LLMs

Abstract:
Multimodal large language models (MLLMs) are increasingly deployed in open‑ended, real‑world environments where inputs are messy, underspecified, and not always trustworthy. Unlike curated benchmarks, these settings frequently involve instructions that refer to missing objects or contradictory facts, rely on ambiguous references, or request infeasible actions. In such cases, success hinges not on task execution alone, but on a model's ability to detect when something is silently wrong. This paper presents a systematic analysis of how current MLLMs handle such implicit reasoning scenarios: cases where the flaw is not explicitly stated but must be inferred from context. Using a curated diagnostic suite spanning four categories of real‑world failure modes, we evaluate six MLLMs, including o3 and GPT‑4o, and find that models frequently fail to surface hidden issues, even when they possess the necessary perceptual and reasoning skills. Explicit prompting reveals that the underlying capabilities exist but are often suppressed in favor of user compliance. We further show that simple inference‑time interventions, such as cautious persona prompting and, in particular, requiring a clarifying question, can dramatically recover performance. Our findings highlight a persistent gap between reasoning competence and behavioral compliance in current MLLMs and suggest practical strategies for making these models more trustworthy in underconstrained environments.

PaperID: 10250, https://arxiv.org/pdf/2506.00232.pdf

Authors: Ruofan Wu, Youngwon Lee, Fan Shu, Danmei Xu, Seung-won Hwang, Zhewei Yao, Yuxiong He, Feng Yan

Title: ComposeRAG: A Modular and Composable RAG for Corpus-Grounded Multi-Hop Question Answering

Abstract:
Retrieval‑Augmented Generation (RAG) systems are increasingly diverse, yet many suffer from monolithic designs that tightly couple core functions like query reformulation, retrieval, reasoning, and verification. This limits their interpretability, systematic evaluation, and targeted improvement, especially for complex multi‑hop question answering. We introduce ComposeRAG, a novel modular abstraction that decomposes RAG pipelines into atomic, composable modules. Each module, such as Question Decomposition, Query Rewriting, Retrieval Decision, and Answer Verification, acts as a parameterized transformation on structured inputs/outputs, allowing independent implementation, upgrade, and analysis. To enhance robustness against errors in multi‑step reasoning, ComposeRAG incorporates a self‑reflection mechanism that iteratively revisits and refines earlier steps upon verification failure. Evaluated on four challenging multi‑hop QA benchmarks, ComposeRAG consistently outperforms strong baselines in both accuracy and grounding fidelity. Specifically, it achieves up to a 15% accuracy improvement over fine‑tuning‑based methods and up to a 5% gain over reasoning‑specialized pipelines under identical retrieval conditions. Crucially, ComposeRAG significantly enhances grounding: its verification‑first design reduces ungrounded answers by over 10% in low‑quality retrieval settings, and by approximately 3% even with strong corpora. Comprehensive ablation studies validate the modular architecture, demonstrating distinct and additive contributions from each component. These findings underscore ComposeRAG's capacity to deliver flexible, transparent, scalable, and high‑performing multi‑hop reasoning with improved grounding and interpretability.

PaperID: 10251, https://arxiv.org/pdf/2506.00197.pdf

Authors: Xinyue Shen, Yun Shen, Michael Backes, Yang Zhang

Title: When GPT Spills the Tea: Comprehensive Assessment of Knowledge File Leakage in GPTs

Abstract:
Knowledge files have been widely used in large language model (LLM) agents, such as GPTs, to improve response quality. However, concerns about the potential leakage of knowledge files have grown significantly. Existing studies demonstrate that adversarial prompts can induce GPTs to leak knowledge file content. Yet, it remains uncertain whether additional leakage vectors exist, particularly given the complex data flows across clients, servers, and databases in GPTs. In this paper, we present a comprehensive risk assessment of knowledge file leakage, leveraging a novel workflow inspired by Data Security Posture Management (DSPM). Through the analysis of 651,022 GPT metadata, 11,820 flows, and 1,466 responses, we identify five leakage vectors: metadata, GPT initialization, retrieval, sandboxed execution environments, and prompts. These vectors enable adversaries to extract sensitive knowledge file data such as titles, content, types, and sizes. Notably, the activation of the built‑in tool Code Interpreter leads to a privilege escalation vulnerability, enabling adversaries to directly download original knowledge files with a 95.95% success rate. Further analysis reveals that 28.80% of leaked files are copyrighted, including digital copies from major publishers and internal materials from a listed company. In the end, we provide actionable solutions for GPT builders and platform providers to secure the GPT data supply chain.

PaperID: 10252, https://arxiv.org/pdf/2506.00138.pdf

Authors: Reece Keller, Alyn Kirsch, Felix Pei, Xaq Pitkow, Leo Kozachkov, Aran Nayebi

Title: Intrinsic Goals for Autonomous Agents: Model-Based Exploration in Virtual Zebrafish Predicts Ethological Behavior and Whole-Brain Dynamics

Abstract:
Autonomy is a hallmark of animal intelligence, enabling adaptive and intelligent behavior in complex environments without relying on external reward or task structure. Existing reinforcement learning approaches to exploration in reward‑free environments, including a class of methods known as model‑based intrinsic motivation, exhibit inconsistent exploration patterns and do not converge to an exploratory policy, thus failing to capture robust autonomous behaviors observed in animals. Moreover, systems neuroscience has largely overlooked the neural basis of autonomy, focusing instead on experimental paradigms where animals are motivated by external reward rather than engaging in ethological, naturalistic and task‑independent behavior. To bridge these gaps, we introduce a novel model‑based intrinsic drive explicitly designed after the principles of autonomous exploration in animals. Our method (3M‑Progress) achieves animal‑like exploration by tracking divergence between an online world model and a fixed prior learned from an ecological niche. To the best of our knowledge, we introduce the first autonomous embodied agent that predicts brain data entirely from self‑supervised optimization of an intrinsic goal ‑‑ without any behavioral or neural training data ‑‑ demonstrating that 3M‑Progress agents capture the explainable variance in behavioral patterns and whole‑brain neural‑glial dynamics recorded from autonomously behaving larval zebrafish, thereby providing the first goal‑driven, population‑level model of neural‑glial computation. Our findings establish a computational framework connecting model‑based intrinsic motivation to naturalistic behavior, providing a foundation for building artificial agents with animal‑like autonomy.

PaperID: 10253, https://arxiv.org/pdf/2506.00099.pdf

Authors: Egil Diau

Title: Finance as Extended Biology: Reciprocity as the Cognitive Substrate of Financial Behavior

Abstract:
A central challenge in economics and artificial intelligence is explaining how financial behaviors‑such as credit, insurance, and trade‑emerge without formal institutions. We argue that these functions are not products of institutional design, but structured extensions of a single behavioral substrate: reciprocity. Far from being a derived strategy, reciprocity served as the foundational logic of early human societies‑governing the circulation of goods, regulation of obligation, and maintenance of long‑term cooperation well before markets, money, or formal rules. Trade, commonly regarded as the origin of financial systems, is reframed here as the canonical form of reciprocity: simultaneous, symmetric, and partner‑contingent. Building on this logic, we reconstruct four core financial functions‑credit, insurance, token exchange, and investment‑as expressions of the same underlying principle under varying conditions. By grounding financial behavior in minimal, simulateable dynamics of reciprocal interaction, this framework shifts the focus from institutional engineering to behavioral computation‑offering a new foundation for modeling decentralized financial behavior in both human and artificial agents.

PaperID: 10254, https://arxiv.org/pdf/2506.00087.pdf

Authors: Peng Xie, Xingyuan Liu, Tsz Wai Chan, Yequan Bie, Yangqiu Song, Yang Wang, Hao Chen, Kani Chen

Title: SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset

Abstract:
Code‑switching (CS) is the alternating use of two or more languages within a conversation or utterance, often influenced by social context and speaker identity. This linguistic phenomenon poses challenges for Automatic Speech Recognition (ASR) systems, which are typically designed for a single language and struggle to handle multilingual inputs. The growing global demand for multilingual applications, including Code‑Switching ASR (CSASR), Text‑to‑Speech (CSTTS), and Cross‑Lingual Information Retrieval (CLIR), highlights the inadequacy of existing monolingual datasets. Although some code‑switching datasets exist, most are limited to bilingual mixing within homogeneous ethnic groups, leaving a critical need for a large‑scale, diverse benchmark akin to ImageNet in computer vision. To bridge this gap, we introduce LinguaMaster, a multi‑agent collaboration framework specifically designed for efficient and scalable multilingual data synthesis. Leveraging this framework, we curate SwitchLingua, the first large‑scale multilingual and multi‑ethnic code‑switching dataset, including: (1) 420K CS textual samples across 12 languages, and (2) over 80 hours of audio recordings from 174 speakers representing 18 countries/regions and 63 racial/ethnic backgrounds, based on the textual data. This dataset captures rich linguistic and cultural diversity, offering a foundational resource for advancing multilingual and multicultural research. Furthermore, to address the issue that existing ASR evaluation metrics lack sensitivity to code‑switching scenarios, we propose the Semantic‑Aware Error Rate (SAER), a novel evaluation metric that incorporates semantic information, providing a more accurate and context‑aware assessment of system performance.

PaperID: 10255, https://arxiv.org/pdf/2506.00082.pdf

Authors: Jia Yuxin, Li Jinye, Jia Yudong, Li Futing, Su Xiaoqi, Luo Jilin, Dong Yarui, Sun Chunyan, Cui Qinghan, Wang Li, Li Axiu, Shang Yi, Zhu Yujuan, Huang Sanwen

Title: An AI-powered Knowledge Hub for Potato Functional Genomics

Abstract:
Potato functional genomics lags due to unsystematic gene information curation, gene identifier inconsistencies across reference genome versions, and the increasing volume of research publications. To address these limitations, we developed the Potato Knowledge Hub (http://www.potato‑ai.top), leveraging Large Language Models (LLMs) and a systematically curated collection of over 3,200 high‑quality potato research papers spanning over 120 years. This platform integrates two key modules: a functional gene database containing 2,571 literature‑reported genes, meticulously mapped to the latest DMv8.1 reference genome with resolved nomenclature discrepancies and links to original publications; and a potato knowledge base. The knowledge base, built using a Retrieval‑Augmented Generation (RAG) architecture, accurately answers research queries with literature citations, mitigating LLM "hallucination." Users can interact with the hub via a natural language AI agent, "Potato Research Assistant," for querying specialized knowledge, retrieving gene information, and extracting sequences. The continuously updated Potato Knowledge Hub aims to be a comprehensive resource, fostering advancements in potato functional genomics and supporting breeding programs.

PaperID: 10256, https://arxiv.org/pdf/2506.00073.pdf

Authors: Shenzhe Zhu, Jiao Sun, Yi Nian, Tobin South, Alex Pentland, Jiaxin Pei

Title: The Automated but Risky Game: Modeling and Benchmarking Agent-to-Agent Negotiations and Transactions in Consumer Markets

Abstract:
AI agents are increasingly used in consumer‑facing applications to assist with tasks such as product search, negotiation, and transaction execution. In this paper, we explore a future scenario where both consumers and merchants authorize AI agents to fully automate negotiations and transactions. We aim to answer two key questions: (1) Do different LLM agents vary in their ability to secure favorable deals for users? (2) What risks arise from fully automating deal‑making with AI agents in consumer markets? To address these questions, we develop an experimental framework that evaluates the performance of various LLM agents in real‑world negotiation and transaction settings. Our findings reveal that AI‑mediated deal‑making is an inherently imbalanced game ‑‑ different agents achieve significantly different outcomes for their users. Moreover, behavioral anomalies in LLMs can result in financial losses for both consumers and merchants, such as overspending or accepting unreasonable deals. These results underscore that while automation can improve efficiency, it also introduces substantial risks. Users should exercise caution when delegating business decisions to AI agents.

PaperID: 10257, https://arxiv.org/pdf/2506.00072.pdf

Authors: Nariman Naderi, Zahra Atf, Peter R Lewis, Aref Mahjoub far, Seyed Amir Ahmad Safavi-Naini, Ali Soroush

Title: Evaluating Prompt Engineering Techniques for Accuracy and Confidence Elicitation in Medical LLMs

Abstract:
This paper investigates how prompt engineering techniques impact both accuracy and confidence elicitation in Large Language Models (LLMs) applied to medical contexts. Using a stratified dataset of Persian board exam questions across multiple specialties, we evaluated five LLMs ‑ GPT‑4o, o3‑mini, Llama‑3.3‑70b, Llama‑3.1‑8b, and DeepSeek‑v3 ‑ across 156 configurations. These configurations varied in temperature settings (0.3, 0.7, 1.0), prompt styles (Chain‑of‑Thought, Few‑Shot, Emotional, Expert Mimicry), and confidence scales (1‑10, 1‑100). We used AUC‑ROC, Brier Score, and Expected Calibration Error (ECE) to evaluate alignment between confidence and actual performance. Chain‑of‑Thought prompts improved accuracy but also led to overconfidence, highlighting the need for calibration. Emotional prompting further inflated confidence, risking poor decisions. Smaller models like Llama‑3.1‑8b underperformed across all metrics, while proprietary models showed higher accuracy but still lacked calibrated confidence. These results suggest prompt engineering must address both accuracy and uncertainty to be effective in high‑stakes medical tasks.

PaperID: 10258, https://arxiv.org/pdf/2506.00066.pdf

Authors: Arne Tillmann

Title: Literature Review Of Multi-Agent Debate For Problem-Solving

Abstract:
Multi‑agent large language models (MA‑LLMs) are a rapidly growing research area that leverages multiple interacting language agents to tackle complex tasks, outperforming single‑agent large language models. This literature review synthesizes the latest research on agent profiles, communication structures, and decision‑making processes, drawing insights from both traditional multi‑agent systems and state‑of‑the‑art MA‑LLM studies. In doing so, it aims to address the lack of direct comparisons in the field, illustrating how factors like scalability, communication structure, and decision‑making processes influence MA‑LLM performance. By examining frequent practices and outlining current challenges, the review reveals that multi‑agent approaches can yield superior results but also face elevated computational costs and under‑explored challenges unique to MA‑LLM. Overall, these findings provide researchers and practitioners with a roadmap for developing robust and efficient multi‑agent AI solutions.

PaperID: 10259, https://arxiv.org/pdf/2505.24671.pdf

Authors: Dayeon Ki, Rachel Rudinger, Tianyi Zhou, Marine Carpuat

Title: Multiple LLM Agents Debate for Equitable Cultural Alignment

Abstract:
Large Language Models (LLMs) need to adapt their predictions to diverse cultural contexts to benefit diverse communities across the world. While previous efforts have focused on single‑LLM, single‑turn approaches, we propose to exploit the complementary strengths of multiple LLMs to promote cultural adaptability. We introduce a Multi‑Agent Debate framework, where two LLM‑based agents debate over a cultural scenario and collaboratively reach a final decision. We propose two variants: one where either LLM agents exclusively debate and another where they dynamically choose between self‑reflection and debate during their turns. We evaluate these approaches on 7 open‑weight LLMs (and 21 LLM combinations) using the NormAd‑ETI benchmark for social etiquette norms in 75 countries. Experiments show that debate improves both overall accuracy and cultural group parity over single‑LLM baselines. Notably, multi‑agent debate enables relatively small LLMs (7‑9B) to achieve accuracies comparable to that of a much larger model (27B parameters).

PaperID: 10260, https://arxiv.org/pdf/2505.24618.pdf

Authors: Victor Casamayor Pujol, Boris Sedlak, Tommaso Salvatori, Karl Friston, Schahram Dustdar

Title: Distributed Intelligence in the Computing Continuum with Active Inference

Abstract:
The Computing Continuum (CC) is an emerging Internet‑based computing paradigm that spans from local Internet of Things sensors and constrained edge devices to large‑scale cloud data centers. Its goal is to orchestrate a vast array of diverse and distributed computing resources to support the next generation of Internet‑based applications. However, the distributed, heterogeneous, and dynamic nature of CC platforms demands distributed intelligence for adaptive and resilient service management. This article introduces a distributed stream processing pipeline as a CC use case, where each service is managed by an Active Inference (AIF) agent. These agents collaborate to fulfill service needs specified by SLOiDs, a term we introduce to denote Service Level Objectives that are aware of its deployed devices, meaning that non‑functional requirements must consider the characteristics of the hosting device. We demonstrate how AIF agents can be modeled and deployed alongside distributed services to manage them autonomously. Our experiments show that AIF agents achieve over 90% SLOiD fulfillment when using tested transition models, and around 80% when learning the models during deployment. We compare their performance to a multi‑agent reinforcement learning algorithm, finding that while both approaches yield similar results, MARL requires extensive training, whereas AIF agents can operate effectively from the start. Additionally, we evaluate the behavior of AIF agents in offloading scenarios, observing a strong capacity for adaptation. Finally, we outline key research directions to advance AIF integration in CC platforms.

PaperID: 10261, https://arxiv.org/pdf/2505.24575.pdf

Authors: Hyuntak Kim, Byung-Hak Kim

Title: NexusSum: Hierarchical LLM Agents for Long-Form Narrative Summarization

Abstract:
Summarizing long‑form narratives‑‑such as books, movies, and TV scripts‑‑requires capturing intricate plotlines, character interactions, and thematic coherence, a task that remains challenging for existing LLMs. We introduce NexusSum, a multi‑agent LLM framework for narrative summarization that processes long‑form text through a structured, sequential pipeline‑‑without requiring fine‑tuning. Our approach introduces two key innovations: (1) Dialogue‑to‑Description Transformation: A narrative‑specific preprocessing method that standardizes character dialogue and descriptive text into a unified format, improving coherence. (2) Hierarchical Multi‑LLM Summarization: A structured summarization pipeline that optimizes chunk processing and controls output length for accurate, high‑quality summaries. Our method establishes a new state‑of‑the‑art in narrative summarization, achieving up to a 30.0% improvement in BERTScore (F1) across books, movies, and TV scripts. These results demonstrate the effectiveness of multi‑agent LLMs in handling long‑form content, offering a scalable approach for structured summarization in diverse storytelling domains.

PaperID: 10262, https://arxiv.org/pdf/2505.24553.pdf

Authors: Ye Eun Chun, Taeyoon Hwang, Seung-won Hwang, Byung-Hak Kim

Title: CREFT: Sequential Multi-Agent LLM for Character Relation Extraction

Abstract:
Understanding complex character relations is crucial for narrative analysis and efficient script evaluation, yet existing extraction methods often fail to handle long‑form narratives with nuanced interactions. To address this challenge, we present CREFT, a novel sequential framework leveraging specialized Large Language Model (LLM) agents. First, CREFT builds a base character graph through knowledge distillation, then iteratively refines character composition, relation extraction, role identification, and group assignments. Experiments on a curated Korean drama dataset demonstrate that CREFT significantly outperforms single‑agent LLM baselines in both accuracy and completeness. By systematically visualizing character networks, CREFT streamlines narrative comprehension and accelerates script review ‑‑ offering substantial benefits to the entertainment, publishing, and educational sectors.

PaperID: 10263, https://arxiv.org/pdf/2505.24511.pdf

Authors: Mingyue Cheng, Jiahao Wang, Daoyu Wang, Xiaoyu Tao, Qi Liu, Enhong Chen

Title: Can Slow-thinking LLMs Reason Over Time? Empirical Studies in Time Series Forecasting

Abstract:
Time series forecasting (TSF) is a fundamental and widely studied task, spanning methods from classical statistical approaches to modern deep learning and multimodal language modeling. Despite their effectiveness, these methods often follow a fast thinking paradigm emphasizing pattern extraction and direct value mapping, while overlooking explicit reasoning over temporal dynamics and contextual dependencies. Meanwhile, emerging slow‑thinking LLMs (e.g., ChatGPT‑o1, DeepSeek‑R1) have demonstrated impressive multi‑step reasoning capabilities across diverse domains, suggesting a new opportunity for reframing TSF as a structured reasoning task. This motivates a key question: can slow‑thinking LLMs effectively reason over temporal patterns to support time series forecasting, even in zero‑shot manner? To investigate this, in this paper, we propose TimeReasoner, an extensive empirical study that formulates TSF as a conditional reasoning task. We design a series of prompting strategies to elicit inference‑time reasoning from pretrained slow‑thinking LLMs and evaluate their performance across diverse TSF benchmarks. Our findings reveal that slow‑thinking LLMs exhibit non‑trivial zero‑shot forecasting capabilities, especially in capturing high‑level trends and contextual shifts. While preliminary, our study surfaces important insights into the reasoning behaviors of LLMs in temporal domains highlighting both their potential and limitations. We hope this work catalyzes further research into reasoning‑based forecasting paradigms and paves the way toward more interpretable and generalizable TSF frameworks.

PaperID: 10264, https://arxiv.org/pdf/2505.24239.pdf

Authors: Sana Ebrahimi, Mohsen Dehghankar, Abolfazl Asudeh

Title: An Adversary-Resistant Multi-Agent LLM System via Credibility Scoring

Abstract:
While multi‑agent LLM systems show strong capabilities in various domains, they are highly vulnerable to adversarial and low‑performing agents. To resolve this issue, in this paper, we introduce a general and adversary‑resistant multi‑agent LLM framework based on credibility scoring. We model the collaborative query‑answering process as an iterative game, where the agents communicate and contribute to a final system output. Our system associates a credibility score that is used when aggregating the team outputs. The credibility scores are learned gradually based on the past contributions of each agent in query answering. Our experiments across multiple tasks and settings demonstrate our system's effectiveness in mitigating adversarial influence and enhancing the resilience of multi‑agent cooperation, even in the adversary‑majority settings.

PaperID: 10265, https://arxiv.org/pdf/2505.24201.pdf

Authors: Xu He, Di Wu, Yan Zhai, Kun Sun

Title: SentinelAgent: Graph-based Anomaly Detection in Multi-Agent Systems

Abstract:
The rise of large language model (LLM)‑based multi‑agent systems (MAS) introduces new security and reliability challenges. While these systems show great promise in decomposing and coordinating complex tasks, they also face multi‑faceted risks across prompt manipulation, unsafe tool usage, and emergent agent miscoordination. Existing guardrail mechanisms offer only partial protection, primarily at the input‑output level, and fall short in addressing systemic or multi‑point failures in MAS. In this work, we present a system‑level anomaly detection framework tailored for MAS, integrating structural modeling with runtime behavioral oversight. Our approach consists of two components. First, we propose a graph‑based framework that models agent interactions as dynamic execution graphs, enabling semantic anomaly detection at node, edge, and path levels. Second, we introduce a pluggable SentinelAgent, an LLM‑powered oversight agent that observes, analyzes, and intervenes in MAS execution based on security policies and contextual reasoning. By bridging abstract detection logic with actionable enforcement, our method detects not only single‑point faults and prompt injections but also multi‑agent collusion and latent exploit paths. We validate our framework through two case studies, including an email assistant and Microsoft's Magentic‑One system, demonstrating its ability to detect covert risks and provide explainable root‑cause attribution. Our work lays the foundation for more trustworthy, monitorable, and secure agent‑based AI ecosystems.

PaperID: 10266, https://arxiv.org/pdf/2505.24155.pdf

Authors: Ehtesamul Azim, Dongjie Wang, Tae Hyun Hwang, Yanjie Fu, Wei Zhang

Title: Biological Pathway Guided Gene Selection Through Collaborative Reinforcement Learning

Abstract:
Gene selection in high‑dimensional genomic data is essential for understanding disease mechanisms and improving therapeutic outcomes. Traditional feature selection methods effectively identify predictive genes but often ignore complex biological pathways and regulatory networks, leading to unstable and biologically irrelevant signatures. Prior approaches, such as Lasso‑based methods and statistical filtering, either focus solely on individual gene‑outcome associations or fail to capture pathway‑level interactions, presenting a key challenge: how to integrate biological pathway knowledge while maintaining statistical rigor in gene selection? To address this gap, we propose a novel two‑stage framework that integrates statistical selection with biological pathway knowledge using multi‑agent reinforcement learning (MARL). First, we introduce a pathway‑guided pre‑filtering strategy that leverages multiple statistical methods alongside KEGG pathway information for initial dimensionality reduction. Next, for refined selection, we model genes as collaborative agents in a MARL framework, where each agent optimizes both predictive power and biological relevance. Our framework incorporates pathway knowledge through Graph Neural Network‑based state representations, a reward mechanism combining prediction performance with gene centrality and pathway coverage, and collaborative learning strategies using shared memory and a centralized critic component. Extensive experiments on multiple gene expression datasets demonstrate that our approach significantly improves both prediction accuracy and biological interpretability compared to traditional methods.

PaperID: 10267, https://arxiv.org/pdf/2505.24113.pdf

Authors: Pengcheng Dai, Yuanqiu Mo, Wenwu Yu, Wei Ren

Title: Distributed Neural Policy Gradient Algorithm for Global Convergence of Networked Multi-Agent Reinforcement Learning

Abstract:
This paper studies the networked multi‑agent reinforcement learning (NMARL) problem, where the objective of agents is to collaboratively maximize the discounted average cumulative rewards. Different from the existing methods that suffer from poor expression due to linear function approximation, we propose a distributed neural policy gradient algorithm that features two innovatively designed neural networks, specifically for the approximate Q‑functions and policy functions of agents. This distributed neural policy gradient algorithm consists of two key components: the distributed critic step and the decentralized actor step. In the distributed critic step, agents receive the approximate Q‑function parameters from their neighboring agents via a time‑varying communication networks to collaboratively evaluate the joint policy. In contrast, in the decentralized actor step, each agent updates its local policy parameter solely based on its own approximate Q‑function. In the convergence analysis, we first establish the global convergence of agents for the joint policy evaluation in the distributed critic step. Subsequently, we rigorously demonstrate the global convergence of the overall distributed neural policy gradient algorithm with respect to the objective function. Finally, the effectiveness of the proposed algorithm is demonstrated by comparing it with a centralized algorithm through simulation in the robot path planning environment.

PaperID: 10268, https://arxiv.org/pdf/2505.24112.pdf

Authors: Michael Tang, Miroslav Krstic, Jorge Poveda

Title: Deception in Oligopoly Games via Adaptive Nash Seeking Systems

Abstract:
In the theory of multi‑agent systems, deception refers to the strategic manipulation of information to influence the behavior of other agents, ultimately altering the long‑term dynamics of the entire system. Recently, this concept has been examined in the context of model‑free Nash equilibrium seeking (NES) algorithms for noncooperative games. Specifically, it was demonstrated that players can exploit knowledge of other players' exploration signals to drive the system toward a ``deceptive" Nash equilibrium, while maintaining the stability of the closed‑loop system. To extend this insight beyond the duopoly case, in this paper we conduct a comprehensive study of deception mechanisms in N‑player oligopoly markets. By leveraging the structure of these games and employing stability techniques for nonlinear dynamical systems, we provide game‑theoretic insights into deception and derive specialized results, including stability conditions. These results allow players to systematically adjust their NES dynamics by tuning gains and signal amplitudes, all while ensuring closed‑loop stability. Additionally, we introduce novel sufficient conditions to demonstrate that the (practically) stable equilibrium point of the deceptive dynamics corresponds to a true Nash equilibrium of a different game, which we term the ``deceptive game." Our results show that, under the proposed adaptive dynamics with deception, a victim firm may develop a distorted perception of its competitors' product appeal, which could lead to setting suboptimal prices.

PaperID: 10269, https://arxiv.org/pdf/2505.24019.pdf

Authors: Kaiyuan Zhang, Zian Su, Pin-Yu Chen, Elisa Bertino, Xiangyu Zhang, Ninghui Li

Title: LLM Agents Should Employ Security Principles

Abstract:
Large Language Model (LLM) agents show considerable promise for automating complex tasks using contextual reasoning; however, interactions involving multiple agents and the system's susceptibility to prompt injection and other forms of context manipulation introduce new vulnerabilities related to privacy leakage and system exploitation. This position paper argues that the well‑established design principles in information security, which are commonly referred to as security principles, should be employed when deploying LLM agents at scale. Design principles such as defense‑in‑depth, least privilege, complete mediation, and psychological acceptability have helped guide the design of mechanisms for securing information systems over the last five decades, and we argue that their explicit and conscientious adoption will help secure agentic systems. To illustrate this approach, we introduce AgentSandbox, a conceptual framework embedding these security principles to provide safeguards throughout an agent's life‑cycle. We evaluate with state‑of‑the‑art LLMs along three dimensions: benign utility, attack utility, and attack success rate. AgentSandbox maintains high utility for its intended functions under both benign and adversarial evaluations while substantially mitigating privacy risks. By embedding secure design principles as foundational elements within emerging LLM agent protocols, we aim to promote trustworthy agent ecosystems aligned with user privacy expectations and evolving regulatory requirements.

PaperID: 10270, https://arxiv.org/pdf/2505.24000.pdf

Authors: Jad Bendarkawi, Ashley Ponce, Sean Mata, Aminah Aliu, Yuhan Liu, Lei Zhang, Amna Liaqat, Varun Nagaraj Rao, Andrés Monroy-Hernández

Title: ConversAR: Exploring Embodied LLM-Powered Group Conversations in Augmented Reality for Second Language Learners

Abstract:
Group conversations are valuable for second language (L2) learners as they provide opportunities to practice listening and speaking, exercise complex turn‑taking skills, and experience group social dynamics in a target language. However, most existing Augmented Reality (AR)‑based conversational learning tools focus on dyadic interactions rather than group dialogues. Although research has shown that AR can help reduce speaking anxiety and create a comfortable space for practicing speaking skills in dyadic scenarios, especially with Large Language Model (LLM)‑based conversational agents, the potential for group language practice using these technologies remains largely unexplored. We introduce ConversAR, a gpt‑4o powered AR application, that enables L2 learners to practice contextualized group conversations. Our system features two embodied LLM agents with vision‑based scene understanding and live captions. In a system evaluation with 10 participants, users reported reduced speaking anxiety and increased learner autonomy compared to perceptions of in‑person practice methods with other learners.

PaperID: 10271, https://arxiv.org/pdf/2505.23982.pdf

Authors: Jerry Junyang Cheung, Shiyao Shen, Yuchen Zhuang, Yinghao Li, Rampi Ramprasad, Chao Zhang

Title: MSQA: Benchmarking LLMs on Graduate-Level Materials Science Reasoning and Knowledge

Abstract:
Despite recent advances in large language models (LLMs) for materials science, there is a lack of benchmarks for evaluating their domain‑specific knowledge and complex reasoning abilities. To bridge this gap, we introduce MSQA, a comprehensive evaluation benchmark of 1,757 graduate‑level materials science questions in two formats: detailed explanatory responses and binary True/False assessments. MSQA distinctively challenges LLMs by requiring both precise factual knowledge and multi‑step reasoning across seven materials science sub‑fields, such as structure‑property relationships, synthesis processes, and computational modeling. Through experiments with 10 state‑of‑the‑art LLMs, we identify significant gaps in current LLM performance. While API‑based proprietary LLMs achieve up to 84.5% accuracy, open‑source (OSS) LLMs peak around 60.5%, and domain‑specific LLMs often underperform significantly due to overfitting and distributional shifts. MSQA represents the first benchmark to jointly evaluate the factual and reasoning capabilities of LLMs crucial for LLMs in advanced materials science.

PaperID: 10272, https://arxiv.org/pdf/2505.23960.pdf

Authors: Henry Conklin

Title: Information Structure in Mappings: An Approach to Learning, Representation, and Generalisation

Abstract:
Despite the remarkable success of large large‑scale neural networks, we still lack unified notation for thinking about and describing their representational spaces. We lack methods to reliably describe how their representations are structured, how that structure emerges over training, and what kinds of structures are desirable. This thesis introduces quantitative methods for identifying systematic structure in a mapping between spaces, and leverages them to understand how deep‑learning models learn to represent information, what representational structures drive generalisation, and how design decisions condition the structures that emerge. To do this I identify structural primitives present in a mapping, along with information theoretic quantifications of each. These allow us to analyse learning, structure, and generalisation across multi‑agent reinforcement learning models, sequence‑to‑sequence models trained on a single task, and Large Language Models. I also introduce a novel, performant, approach to estimating the entropy of vector space, that allows this analysis to be applied to models ranging in size from 1 million to 12 billion parameters. The experiments here work to shed light on how large‑scale distributed models of cognition learn, while allowing us to draw parallels between those systems and their human analogs. They show how the structures of language and the constraints that give rise to them in many ways parallel the kinds of structures that drive performance of contemporary neural networks.

PaperID: 10273, https://arxiv.org/pdf/2505.23865.pdf

Authors: Emanuele Masiero, Vito Trianni, Giuseppe Vizzari, Dimitri Ognibene

Title: Combining Deep Architectures for Information Gain estimation and Reinforcement Learning for multiagent field exploration

Abstract:
Precision agriculture requires efficient autonomous systems for crop monitoring, where agents must explore large‑scale environments while minimizing resource consumption. This work addresses the problem as an active exploration task in a grid environment representing an agricultural field. Each cell may contain targets (e.g., damaged crops) observable from nine predefined points of view (POVs). Agents must infer the number of targets per cell using partial, sequential observations. We propose a two‑stage deep learning framework. A pre‑trained LSTM serves as a belief model, updating a probabilistic map of the environment and its associated entropy, which defines the expected information gain (IG). This allows agents to prioritize informative regions. A key contribution is the inclusion of a POV visibility mask in the input, preserving the Markov property under partial observability and avoiding revisits to already explored views. Three agent architectures were compared: an untrained IG‑based agent selecting actions to maximize entropy reduction; a DQN agent using CNNs over local 3x3 inputs with belief, entropy, and POV mask; and a Double‑CNN DQN agent with wider spatial context. Simulations on 20x20 maps showed that the untrained agent performs well despite its simplicity. The DQN agent matches this performance when the POV mask is included, while the Double‑CNN agent consistently achieves superior exploration efficiency, especially in larger environments. Results show that uncertainty‑aware policies leveraging entropy, belief states, and visibility tracking lead to robust and scalable exploration. Future work includes curriculum learning, multi‑agent cooperation with shared rewards, transformer‑based models, and intrinsic motivation mechanisms to further enhance learning efficiency and policy generalization.

PaperID: 10274, https://arxiv.org/pdf/2505.23852.pdf

Authors: Nic Dobbins, Christelle Xiong, Kristine Lan, Meliha Yetisgen

Title: Large Language Model-Based Agents for Automated Research Reproducibility: An Exploratory Study in Alzheimer's Disease

Abstract:
Objective: To demonstrate the capabilities of Large Language Models (LLMs) as autonomous agents to reproduce findings of published research studies using the same or similar dataset. Materials and Methods: We used the "Quick Access" dataset of the National Alzheimer's Coordinating Center (NACC). We identified highly cited published research manuscripts using NACC data and selected five studies that appeared reproducible using this dataset alone. Using GPT‑4o, we created a simulated research team of LLM‑based autonomous agents tasked with writing and executing code to dynamically reproduce the findings of each study, given only study Abstracts, Methods sections, and data dictionary descriptions of the dataset. Results: We extracted 35 key findings described in the Abstracts across 5 Alzheimer's studies. On average, LLM agents approximately reproduced 53.2% of findings per study. Numeric values and range‑based findings often differed between studies and agents. The agents also applied statistical methods or parameters that varied from the originals, though overall trends and significance were sometimes similar. Discussion: In some cases, LLM‑based agents replicated research techniques and findings. In others, they failed due to implementation flaws or missing methodological detail. These discrepancies show the current limits of LLMs in fully automating reproducibility assessments. Still, this early investigation highlights the potential of structured agent‑based systems to provide scalable evaluation of scientific rigor. Conclusion: This exploratory work illustrates both the promise and limitations of LLMs as autonomous agents for automating reproducibility in biomedical research.

PaperID: 10275, https://arxiv.org/pdf/2505.23847.pdf

Authors: Ronny Ko, Jiseong Jeong, Shuyuan Zheng, Chuan Xiao, Tae-Wan Kim, Makoto Onizuka, Won-Yong Shin

Title: Seven Security Challenges That Must be Solved in Cross-domain Multi-agent LLM Systems

Abstract:
Large language models (LLMs) are rapidly evolving into autonomous agents that cooperate across organizational boundaries, enabling joint disaster response, supply‑chain optimization, and other tasks that demand decentralized expertise without surrendering data ownership. Yet, cross‑domain collaboration shatters the unified trust assumptions behind current alignment and containment techniques. An agent benign in isolation may, when receiving messages from an untrusted peer, leak secrets or violate policy, producing risks driven by emergent multi‑agent dynamics rather than classical software bugs. This position paper maps the security agenda for cross‑domain multi‑agent LLM systems. We introduce seven categories of novel security challenges, for each of which we also present plausible attacks, security evaluation metrics, and future research guidelines.

PaperID: 10276, https://arxiv.org/pdf/2505.23846.pdf

Authors: Atanu Barai, Stephan Eidenbenz, Nandakishore Santhi

Title: Scalable, Symbiotic, AI and Non-AI Agent Based Parallel Discrete Event Simulations

Abstract:
To fully leverage the potential of artificial intelligence (AI) systems in a trustworthy manner, it is desirable to couple multiple AI and non‑AI systems together seamlessly for constraining and ensuring correctness of the output. This paper introduces a novel parallel discrete event simulation (PDES) based methodology to combine multiple AI and non‑AI agents in a causal, rule‑based way. Our approach tightly integrates the concept of passage of time, with each agent considered as an entity in the PDES framework and responding to prior requests from other agents. Such coupling mechanism enables the agents to work in a co‑operative environment towards a common goal while many tasks run in parallel throughout the simulation. It further enables setting up boundaries to the outputs of the AI agents by applying necessary dynamic constraints using non‑AI agents while allowing for scalability through deployment of hundreds of such agents in a larger compute cluster. Distributing smaller AI agents can enable extremely scalable simulations in the future, addressing local memory bottlenecks for model parameter storage. Within a PDES involving both AI and non‑AI agents, we break down the problem at hand into structured steps, when necessary, providing a set of multiple choices to the AI agents, and then progressively solve these steps towards a final goal. At each step, the non‑AI agents act as unbiased auditors, verifying each action by the AI agents so that certain rules of engagement are followed. We evaluate our approach by solving four problems from four different domains and comparing the results with those from AI models alone. Our results show greater accuracy in solving problems from various domains where the AI models struggle to solve the problems solely by themselves. Results show that overall accuracy of our approach is 68% where as the accuracy of vanilla models is less than 23%.

PaperID: 10277, https://arxiv.org/pdf/2505.23837.pdf

Authors: Lin Zhong, Lingzhi Wang, Xu Yang, Qing Liao

Title: CoMaPOI: A Collaborative Multi-Agent Framework for Next POI Prediction Bridging the Gap Between Trajectory and Language

Abstract:
Large Language Models (LLMs) offer new opportunities for the next Point‑Of‑Interest (POI) prediction task, leveraging their capabilities in semantic understanding of POI trajectories. However, previous LLM‑based methods, which are superficially adapted to next POI prediction, largely overlook critical challenges associated with applying LLMs to this task. Specifically, LLMs encounter two critical challenges: (1) a lack of intrinsic understanding of numeric spatiotemporal data, which hinders accurate modeling of users' spatiotemporal distributions and preferences; and (2) an excessively large and unconstrained candidate POI space, which often results in random or irrelevant predictions. To address these issues, we propose a Collaborative Multi Agent Framework for Next POI Prediction, named CoMaPOI. Through the close interaction of three specialized agents (Profiler, Forecaster, and Predictor), CoMaPOI collaboratively addresses the two critical challenges. The Profiler agent is responsible for converting numeric data into language descriptions, enhancing semantic understanding. The Forecaster agent focuses on dynamically constraining and refining the candidate POI space. The Predictor agent integrates this information to generate high‑precision predictions. Extensive experiments on three benchmark datasets (NYC, TKY, and CA) demonstrate that CoMaPOI achieves state of the art performance, improving all metrics by 5% to 10% compared to SOTA baselines. This work pioneers the investigation of challenges associated with applying LLMs to complex spatiotemporal tasks by leveraging tailored collaborative agents.

PaperID: 10278, https://arxiv.org/pdf/2505.23833.pdf

Authors: Qingchuan Ma, Yuhang Wu, Xiawu Zheng, Rongrong Ji

Title: Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective

Abstract:
In this paper, we aim to establish a simple, effective, and theoretically grounded benchmark for rigorously probing abstract reasoning in Large Language Models (LLMs). To achieve this, we first develop a mathematic framework that defines abstract reasoning as the ability to: (i) extract essential patterns independent of surface representations, and (ii) apply consistent rules to these abstract patterns. Based on this framework, we introduce two novel complementary metrics: $\scoreGamma$ measures basic reasoning accuracy, while $\scoreDelta$ quantifies a model's reliance on specific symbols rather than underlying patterns ‑ a key indicator of true abstraction versus mere memorization. To implement this measurement, we design a benchmark: systematic symbol remapping in rule‑based tasks, which forces models to demonstrate genuine pattern recognition beyond superficial token matching. Extensive LLM evaluations using this benchmark (commercial API models, 7B‑70B, multi‑agent) reveal:1) critical limitations in non‑decimal arithmetic and symbolic reasoning; 2) persistent abstraction gaps despite chain‑of‑thought prompting; and 3) $\scoreDelta$'s effectiveness in robustly measuring memory dependence by quantifying performance degradation under symbol remapping, particularly highlighting operand‑specific memorization. These findings underscore that current LLMs, despite domain‑specific strengths, still lack robust abstract reasoning, highlighting key areas for future improvement.

PaperID: 10279, https://arxiv.org/pdf/2505.23815.pdf

Authors: Stéphane Aroca-Ouellette, Natalie Mackraz, Barry-John Theobald, Katherine Metcalf

Title: Aligning LLMs by Predicting Preferences from User Writing Samples

Abstract:
Accommodating human preferences is essential for creating aligned LLM agents that deliver personalized and effective interactions. Recent work has shown the potential for LLMs acting as writing agents to infer a description of user preferences. Agent alignment then comes from conditioning on the inferred preference description. However, existing methods often produce generic preference descriptions that fail to capture the unique and individualized nature of human preferences. This paper introduces PROSE, a method designed to enhance the precision of preference descriptions inferred from user writing samples. PROSE incorporates two key elements: (1) iterative refinement of inferred preferences, and (2) verification of inferred preferences across multiple user writing samples. We evaluate PROSE with several LLMs (i.e., Qwen2.5 7B and 72B Instruct, GPT‑mini, and GPT‑4o) on a summarization and an email writing task. We find that PROSE more accurately infers nuanced human preferences, improving the quality of the writing agent's generations over CIPHER (a state‑of‑the‑art method for inferring preferences) by 33%. Lastly, we demonstrate that ICL and PROSE are complementary methods, and combining them provides up to a 9% improvement over ICL alone.

PaperID: 10280, https://arxiv.org/pdf/2505.23803.pdf

Authors: Yinuo Xue, Eric Spero, Meng Wai Woo, Wei Gao, Giovanni Russello

Title: MultiPhishGuard: An Explainable and Adaptive Multi-Agent LLM System for Phishing Email Detection

Abstract:
Phishing email detection faces significant challenges due to evolving adversarial tactics and heterogeneous attack patterns. Traditional approaches, such as rule‑based filters and denylists, often struggle to keep pace, leading to missed detections and security risks. While machine learning methods have improved detection performance, they remain limited in adapting to novel and rapidly changing phishing strategies. We present MultiPhishGuard, an LLM‑based multi‑agent detection framework with learned coordination across specialized agents. The system consists of five cooperative agents (text, URL, metadata, explanation simplifier, and adversarial agents), with agent contributions dynamically weighted using Proximal Policy Optimization. To address emerging threats, the framework incorporates an adversarial training loop in which an LLM‑based agent generates subtle, context‑aware email variants to expose potential model weaknesses and improve robustness to ambiguous phishing cases. Experimental evaluations on public datasets show that MultiPhishGuard achieves stronger performance than established baselines, including Chain‑of‑Thought prompting and single‑agent variants, as supported by ablation studies and comparative analyses. The system achieves an accuracy of 97.89%, with a false positive rate of 2.73% and a false negative rate of 0.20%. In addition, an explanation simplifier agent transforms technical model outputs into plain‑language rationales intended for human users. Overall, these results suggest that multi‑agent LLM architectures with adaptive coordination and adversarial training represent a promising direction for phishing email detection.

PaperID: 10281, https://arxiv.org/pdf/2505.23789.pdf

Authors: Mingyu Huang, Shasha Zhou, Yuxuan Chen, Ke Li

Title: Conversational Exploration of Literature Landscape with LitChat

Abstract:
We are living in an era of "big literature", where the volume of digital scientific publications is growing exponentially. While offering new opportunities, this also poses challenges for understanding literature landscapes, as traditional manual reviewing is no longer feasible. Recent large language models (LLMs) have shown strong capabilities for literature comprehension, yet they are incapable of offering "comprehensive, objective, open and transparent" views desired by systematic reviews due to their limited context windows and trust issues like hallucinations. Here we present LitChat, an end‑to‑end, interactive and conversational literature agent that augments LLM agents with data‑driven discovery tools to facilitate literature exploration. LitChat automatically interprets user queries, retrieves relevant sources, constructs knowledge graphs, and employs diverse data‑mining techniques to generate evidence‑based insights addressing user needs. We illustrate the effectiveness of LitChat via a case study on AI4Health, highlighting its capacity to quickly navigate the users through large‑scale literature landscape with data‑based evidence that is otherwise infeasible with traditional means.

PaperID: 10282, https://arxiv.org/pdf/2505.23723.pdf

Authors: Zexi Liu, Jingyi Chai, Xinyu Zhu, Shuo Tang, Rui Ye, Bo Zhang, Lei Bai, Siheng Chen

Title: ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering

Abstract:
The emergence of large language model (LLM)‑based agents has significantly advanced the development of autonomous machine learning (ML) engineering. However, the dominant prompt‑based paradigm exhibits limitations: smaller models lack the capacity to learn from execution trajectories for generalization, while large proprietary models incur high computational overhead, restricting accessibility and scalability. Focusing on this, for the first time, we explore the paradigm of learning‑based agentic ML, where an LLM agent learns through interactive experimentation on ML tasks using online reinforcement learning (RL). To realize this, we propose a novel agentic ML training framework with three key components: (1) exploration‑enriched fine‑tuning, which enables LLM agents to generate diverse actions for enhanced RL exploration; (2) step‑wise RL, which enables training on a single action step, accelerating experience collection and improving training efficiency; (3) an agentic ML‑specific reward module, which unifies varied ML feedback signals into consistent rewards for RL optimization. Leveraging this framework, we train ML‑Agent, driven by a 7B‑sized Qwen‑2.5 LLM for autonomous ML. Despite training on only 9 ML tasks, our 7B‑sized ML‑Agent achieves comparable performance to agents using much larger proprietary LLMs (e.g., GPT‑5) but at significantly lower computational cost, demonstrating strong performance and cross‑task generalization.

PaperID: 10283, https://arxiv.org/pdf/2505.23710.pdf

Authors: Zeinab Nezami, Syed Danial Ali Shah, Maryam Hafeez, Karim Djemame, Syed Ali Raza Zaidi

Title: From Connectivity to Autonomy: The Dawn of Self-Evolving Communication Systems

Abstract:
This paper envisions 6G as a self‑evolving telecom ecosystem, where AI‑driven intelligence enables dynamic adaptation beyond static connectivity. We explore the key enablers of autonomous communication systems, spanning reconfigurable infrastructure, adaptive middleware, and intelligent network functions, alongside multi‑agent collaboration for distributed decision‑making. We explore how these methodologies align with emerging industrial IoT frameworks, ensuring seamless integration within digital manufacturing processes. Our findings emphasize the potential for improved real‑time decision‑making, optimizing efficiency, and reducing latency in networked control systems. The discussion addresses ethical challenges, research directions, and standardization efforts, concluding with a technology stack roadmap to guide future developments. By leveraging state‑of‑the‑art 6G network management techniques, this research contributes to the next generation of intelligent automation solutions, bridging the gap between theoretical advancements and real‑world industrial applications.

PaperID: 10284, https://arxiv.org/pdf/2505.23686.pdf

Authors: Caroline Wang, Arrasy Rahman, Jiaxun Cui, Yoonchang Sung, Peter Stone

Title: ROTATE: Regret-driven Open-ended Training for Ad Hoc Teamwork

Abstract:
Learning to collaborate with previously unseen partners is a fundamental generalization challenge in multi‑agent learning, known as Ad Hoc Teamwork (AHT). Existing AHT approaches often adopt a two‑stage pipeline, where first, a fixed population of teammates is generated with the idea that they should be representative of the teammates that will be seen at deployment time, and second, an AHT agent is trained to collaborate well with agents in the population. To date, the research community has focused on designing separate algorithms for each stage. This separation has led to algorithms that generate teammates with limited coverage of possible behaviors, and that ignore whether the generated teammates are easy to learn from for the AHT agent. Furthermore, algorithms for training AHT agents typically treat the set of training teammates as static, thus attempting to generalize to previously unseen partner agents without assuming any control over the set of training teammates. This paper presents a unified framework for AHT by reformulating the problem as an open‑ended learning process between an AHT agent and an adversarial teammate generator. We introduce ROTATE, a regret‑driven, open‑ended training algorithm that alternates between improving the AHT agent and generating teammates that probe its deficiencies. Experiments across diverse two‑player environments demonstrate that ROTATE significantly outperforms baselines at generalizing to an unseen set of evaluation teammates, thus establishing a new standard for robust and generalizable teamwork.

PaperID: 10285, https://arxiv.org/pdf/2505.23683.pdf

Authors: Zixuan Wang, Eshaan Nichani, Alberto Bietti, Alex Damian, Daniel Hsu, Jason D. Lee, Denny Wu

Title: Learning Compositional Functions with Transformers from Easy-to-Hard Data

Abstract:
Transformer‑based language models have demonstrated impressive capabilities across a range of complex reasoning tasks. Prior theoretical work exploring the expressive power of transformers has shown that they can efficiently perform multi‑step reasoning tasks involving parallelizable computations. However, the learnability of such constructions, particularly the conditions on the data distribution that enable efficient learning via gradient‑based optimization, remains an open question. Towards answering this question, in this work we study the learnability of the k‑fold composition task, which requires computing an interleaved composition of k input permutations and k hidden permutations, and can be expressed by a transformer with O(\log k) layers. On the negative front, we prove a Statistical Query (SQ) lower bound showing that any SQ learner that makes only polynomially‑many queries to an SQ oracle for the k‑fold composition task distribution must have sample size exponential in k, thus establishing a statistical‑computational gap. On the other hand, we show that this function class can be efficiently learned, with runtime and sample complexity polynomial in k, by gradient descent on an O(\log k)‑depth transformer via two different curriculum learning strategies: one in which data consists of k'‑fold composition functions with k' \le k presented in increasing difficulty, and another in which all such data is presented simultaneously. Our work sheds light on the necessity and sufficiency of having both easy and hard examples in the data distribution for transformers to learn complex compositional tasks.

PaperID: 10286, https://arxiv.org/pdf/2505.23596.pdf

Authors: Linqiang Guo, Wei Liu, Yi Wen Heng, Tse-Hsun, Chen, Yang Wang

Title: Agent-SAMA: State-Aware Mobile Assistant

Abstract:
Mobile Graphical User Interface (GUI) agents aim to autonomously complete tasks within or across apps based on user instructions. While recent Multimodal Large Language Models (MLLMs) enable these agents to interpret UI screens and perform actions, existing agents remain fundamentally reactive. They reason over the current UI screen but lack a structured representation of the app navigation flow, limiting GUI agents' ability to understand execution context, detect unexpected execution results, and recover from errors. We introduce Agent‑SAMA, a state‑aware multi‑agent framework that models app execution as a Finite State Machine (FSM), treating UI screens as states and user actions as transitions. Agent‑SAMA implements four specialized agents that collaboratively construct and use FSMs in real time to guide task planning, execution verification, and recovery. We evaluate Agent‑SAMA on two types of benchmarks: cross‑app (Mobile‑Eval‑E, SPA‑Bench) and mostly single‑app (AndroidWorld). On Mobile‑Eval‑E, Agent‑SAMA achieves an 84.0% success rate and a 71.9% recovery rate. On SPA‑Bench, it reaches an 80.0% success rate with a 66.7% recovery rate. Compared to prior methods, Agent‑SAMA improves task success by up to 12% and recovery success by 13.8%. On AndroidWorld, Agent‑SAMA achieves a 63.7% success rate, outperforming the baselines. Our results demonstrate that structured state modeling enhances robustness and can serve as a lightweight, model‑agnostic memory layer for future GUI agents.

PaperID: 10287, https://arxiv.org/pdf/2505.23474.pdf

Authors: Xiang Li, Haiyang Yu, Xinghua Zhang, Ziyang Huang, Shizhu He, Kang Liu, Jun Zhao, Fei Huang, Yongbin Li

Title: Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns

Abstract:
Process Reward Models (PRMs) are crucial in complex reasoning and problem‑solving tasks (e.g., LLM agents with long‑horizon decision‑making) by verifying the correctness of each intermediate reasoning step. In real‑world scenarios, LLMs may apply various reasoning patterns (e.g., decomposition) to solve a problem, potentially suffering from errors under various reasoning patterns. Therefore, PRMs are required to identify errors under various reasoning patterns during the reasoning process. However, existing benchmarks mainly focus on evaluating PRMs with stepwise correctness, ignoring a systematic evaluation of PRMs under various reasoning patterns. To mitigate this gap, we introduce Socratic‑PRMBench, a new benchmark to evaluate PRMs systematically under six reasoning patterns, including Transformation, Decomposition, Regather, Deduction, Verification, and Integration. Socratic‑PRMBenchcomprises 2995 reasoning paths with flaws within the aforementioned six reasoning patterns. Through our experiments on both PRMs and LLMs prompted as critic models, we identify notable deficiencies in existing PRMs. These observations underscore the significant weakness of current PRMs in conducting evaluations on reasoning steps under various reasoning patterns. We hope Socratic‑PRMBench can serve as a comprehensive testbed for systematic evaluation of PRMs under diverse reasoning patterns and pave the way for future development of PRMs.

PaperID: 10288, https://arxiv.org/pdf/2505.23422.pdf

Authors: Tobias Lindenbauer, Georg Groh, Hinrich Schütze

Title: From Knowledge to Noise: CTIM-Rover and the Pitfalls of Episodic Memory in Software Engineering Agents

Abstract:
We introduce CTIM‑Rover, an AI agent for Software Engineering (SE) built on top of AutoCodeRover (Zhang et al., 2024) that extends agentic reasoning frameworks with an episodic memory, more specifically, a general and repository‑level Cross‑Task‑Instance Memory (CTIM). While existing open‑source SE agents mostly rely on ReAct (Yao et al., 2023b), Reflexion (Shinn et al., 2023), or Code‑Act (Wang et al., 2024), all of these reasoning and planning frameworks inefficiently discard their long‑term memory after a single task instance. As repository‑level understanding is pivotal for identifying all locations requiring a patch for fixing a bug, we hypothesize that SE is particularly well positioned to benefit from CTIM. For this, we build on the Experiential Learning (EL) approach ExpeL (Zhao et al., 2024), proposing a Mixture‑Of‑Experts (MoEs) inspired approach to create both a general‑purpose and repository‑level CTIM. We find that CTIM‑Rover does not outperform AutoCodeRover in any configuration and thus conclude that neither ExpeL nor DoT‑Bank (Lingam et al., 2024) scale to real‑world SE problems. Our analysis indicates noise introduced by distracting CTIM items or exemplar trajectories as the likely source of the performance degradation.

PaperID: 10289, https://arxiv.org/pdf/2505.23401.pdf

Authors: Marta Bílková, Wesley Fussner, Roman Kuznets

Title: Agent Interpolation for Knowledge

Abstract:
We define a new type of proof formalism for multi‑agent modal logics with S5‑type modalities. This novel formalism combines the features of hypersequents to represent S5 modalities with nested sequents to represent the T‑like modality alternations. We show that the calculus is sound and complete, cut‑free, and terminating and yields decidability and the finite model property for multi‑agent S5. We also use it to prove the Lyndon (and hence Craig) interpolation property for multi‑agent S5, considering not only propositional atoms but also agents to be part of the common language. Finally, we discuss the difficulties on the way to extending these results to the logic of distributed knowledge and to deductive interpolation.

PaperID: 10290, https://arxiv.org/pdf/2505.23399.pdf

Authors: Jusheng Zhang, Yijia Fan, Wenjun Lin, Ruiqi Chen, Haoyi Jiang, Wenhao Chai, Jian Wang, Keze Wang

Title: GAM-Agent: Game-Theoretic and Uncertainty-Aware Collaboration for Complex Visual Reasoning

Abstract:
We propose GAM‑Agent, a game‑theoretic multi‑agent framework for enhancing vision‑language reasoning. Unlike prior single‑agent or monolithic models, GAM‑Agent formulates the reasoning process as a non‑zero‑sum game between base agents‑‑each specializing in visual perception subtasks‑‑and a critical agent that verifies logic consistency and factual correctness. Agents communicate via structured claims, evidence, and uncertainty estimates. The framework introduces an uncertainty‑aware controller to dynamically adjust agent collaboration, triggering multi‑round debates when disagreement or ambiguity is detected. This process yields more robust and interpretable predictions. Experiments on four challenging benchmarks‑‑MMMU, MMBench, MVBench, and VBench‑‑demonstrate that GAM‑Agent significantly improves performance across various VLM backbones. Notably, GAM‑Agent boosts the accuracy of small‑to‑mid scale models (e.g., Qwen2.5‑VL‑7B, InternVL3‑14B) by 5‑‑6%, and still enhances strong models like GPT‑4o by up to 2‑‑3%. Our approach is modular, scalable, and generalizable, offering a path toward reliable and explainable multi‑agent multimodal reasoning.

PaperID: 10291, https://arxiv.org/pdf/2505.23352.pdf

Authors: Xu Shen, Yixin Liu, Yiwei Dai, Yili Wang, Rui Miao, Yue Tan, Shirui Pan, Xin Wang

Title: Understanding the Information Propagation Effects of Communication Topologies in LLM-based Multi-Agent Systems

Abstract:
The communication topology in large language model‑based multi‑agent systems fundamentally governs inter‑agent collaboration patterns, critically shaping both the efficiency and effectiveness of collective decision‑making. While recent studies for communication topology automated design tend to construct sparse structures for efficiency, they often overlook why and when sparse and dense topologies help or hinder collaboration. In this paper, we present a causal framework to analyze how agent outputs, whether correct or erroneous, propagate under topologies with varying sparsity. Our empirical studies reveal that moderately sparse topologies, which effectively suppress error propagation while preserving beneficial information diffusion, typically achieve optimal task performance. Guided by this insight, we propose a novel topology design approach, EIB‑leanrner, that balances error suppression and beneficial information propagation by fusing connectivity patterns from both dense and sparse graphs. Extensive experiments show the superior effectiveness, communication cost, and robustness of EIB‑leanrner.

PaperID: 10292, https://arxiv.org/pdf/2505.23275.pdf

Authors: Guangyuan Liu, Yinqiu Liu, Ruichen Zhang, Hongyang Du, Dusit Niyato, Zehui Xiong, Sumei Sun, Abbas Jamalipour

Title: Wireless Agentic AI with Retrieval-Augmented Multimodal Semantic Perception

Abstract:
The rapid development of multimodal AI and Large Language Models (LLMs) has greatly enhanced real‑time interaction, decision‑making, and collaborative tasks. However, in wireless multi‑agent scenarios, limited bandwidth poses significant challenges to exchanging semantically rich multimodal information efficiently. Traditional semantic communication methods, though effective, struggle with redundancy and loss of crucial details. To overcome these challenges, we propose a Retrieval‑Augmented Multimodal Semantic Communication (RAMSemCom) framework. RAMSemCom incorporates iterative, retrieval‑driven semantic refinement tailored for distributed multi‑agent environments, enabling efficient exchange of critical multimodal elements through local caching and selective transmission. Our approach dynamically optimizes retrieval using deep reinforcement learning (DRL) to balance semantic fidelity with bandwidth constraints. A comprehensive case study on multi‑agent autonomous driving demonstrates that our DRL‑based retrieval strategy significantly improves task completion efficiency and reduces communication overhead compared to baseline methods.

PaperID: 10293, https://arxiv.org/pdf/2505.23239.pdf

Authors: Lingkai Meng, Yu Shao, Long Yuan, Longbin Lai, Peng Cheng, Wenyuan Yu, Wenjie Zhang, Xuemin Lin, Jingren Zhou

Title: OSS-UAgent: An Agent-based Usability Evaluation Framework for Open Source Software

Abstract:
Usability evaluation is critical to the impact and adoption of open source software (OSS), yet traditional methods relying on human evaluators suffer from high costs and limited scalability. To address these limitations, we introduce OSS‑UAgent, an automated, configurable, and interactive agent‑based usability evaluation framework specifically designed for open source software. Our framework employs intelligent agents powered by large language models (LLMs) to simulate developers performing programming tasks across various experience levels (from Junior to Expert). By dynamically constructing platform‑specific knowledge bases, OSS‑UAgent ensures accurate and context‑aware code generation. The generated code is automatically evaluated across multiple dimensions, including compliance, correctness, and readability, providing a comprehensive measure of the software's usability. Additionally, our demonstration showcases OSS‑UAgent's practical application in evaluating graph analytics platforms, highlighting its effectiveness in automating usability evaluation.

PaperID: 10294, https://arxiv.org/pdf/2505.23224.pdf

Authors: Zhitao He, Sandeep Polisetty, Zhiyuan Fan, Yuchen Huang, Shujin Wu, Yi R. Fung

Title: MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration

Abstract:
In recent years, multimodal large language models (MLLMs) have made significant progress but continue to face inherent challenges in multimodal reasoning, which requires multi‑level (e.g., perception, reasoning) and multi‑granular (e.g., multi‑step reasoning chain) advanced inferencing. Prior work on estimating model confidence tends to focus on the overall response for training and calibration, but fails to assess confidence in each reasoning step, leading to undesirable hallucination snowballing. In this work, we present MMBoundary, a novel framework that advances the knowledge boundary awareness of MLLMs through reasoning step confidence calibration. To achieve this, we propose to incorporate complementary textual and cross‑modal self‑rewarding signals to estimate confidence at each step of the MLLM reasoning process. In addition to supervised fine‑tuning MLLM on this set of self‑rewarded confidence estimation signal for initial confidence expression warm‑up, we introduce a reinforcement learning stage with multiple reward functions for further aligning model knowledge and calibrating confidence at each reasoning step, enhancing reasoning chain self‑correction. Empirical results show that MMBoundary significantly outperforms existing methods across diverse domain datasets and metrics, achieving an average of 7.5% reduction in multimodal confidence calibration errors and up to 8.3% improvement in task performance.

PaperID: 10295, https://arxiv.org/pdf/2505.23187.pdf

Authors: Yilong Li, Chen Qian, Yu Xia, Ruijie Shi, Yufan Dang, Zihao Xie, Ziming You, Weize Chen, Cheng Yang, Weichuan Liu, Ye Tian, Xuantang Xiong, Lei Han, Zhiyuan Liu, Maosong Sun

Title: Cross-Task Experiential Learning on LLM-based Multi-Agent Collaboration

Abstract:
Large Language Model‑based multi‑agent systems (MAS) have shown remarkable progress in solving complex tasks through collaborative reasoning and inter‑agent critique. However, existing approaches typically treat each task in isolation, resulting in redundant computations and limited generalization across structurally similar tasks. To address this, we introduce multi‑agent cross‑task experiential learning (MAEL), a novel framework that endows LLM‑driven agents with explicit cross‑task learning and experience accumulation. We model the task‑solving workflow on a graph‑structured multi‑agent collaboration network, where agents propagate information and coordinate via explicit connectivity. During the experiential learning phase, we quantify the quality for each step in the task‑solving workflow and store the resulting rewards along with the corresponding inputs and outputs into each agent's individual experience pool. During inference, agents retrieve high‑reward, task‑relevant experiences as few‑shot examples to enhance the effectiveness of each reasoning step, thereby enabling more accurate and efficient multi‑agent collaboration. Experimental results on diverse datasets demonstrate that MAEL empowers agents to learn from prior task experiences effectively‑achieving faster convergence and producing higher‑quality solutions on current tasks.

PaperID: 10296, https://arxiv.org/pdf/2505.22998.pdf

Authors: Jihwan Oh

Title: LLM Agents for Bargaining with Utility-based Feedback

Abstract:
Bargaining, a critical aspect of real‑world interactions, presents challenges for large language models (LLMs) due to limitations in strategic depth and adaptation to complex human factors. Existing benchmarks often fail to capture this real‑world complexity. To address this and enhance LLM capabilities in realistic bargaining, we introduce a comprehensive framework centered on utility‑based feedback. Our contributions are threefold: (1) BargainArena, a novel benchmark dataset with six intricate scenarios (e.g., deceptive practices, monopolies) to facilitate diverse strategy modeling; (2) human‑aligned, economically‑grounded evaluation metrics inspired by utility theory, incorporating agent utility and negotiation power, which implicitly reflect and promote opponent‑aware reasoning (OAR); and (3) a structured feedback mechanism enabling LLMs to iteratively refine their bargaining strategies. This mechanism can positively collaborate with in‑context learning (ICL) prompts, including those explicitly designed to foster OAR. Experimental results show that LLMs often exhibit negotiation strategies misaligned with human preferences, and that our structured feedback mechanism significantly improves their performance, yielding deeper strategic and opponent‑aware reasoning.

PaperID: 10297, https://arxiv.org/pdf/2505.22993.pdf

Authors: Hoang Pham, Thanh-Do Nguyen, Khac-Hoai Nam Bui

Title: Verify-in-the-Graph: Entity Disambiguation Enhancement for Complex Claim Verification with Interactive Graph Representation

Abstract:
Claim verification is a long‑standing and challenging task that demands not only high accuracy but also explainability of the verification process. This task becomes an emerging research issue in the era of large language models (LLMs) since real‑world claims are often complex, featuring intricate semantic structures or obfuscated entities. Traditional approaches typically address this by decomposing claims into sub‑claims and querying a knowledge base to resolve hidden or ambiguous entities. However, the absence of effective disambiguation strategies for these entities can compromise the entire verification process. To address these challenges, we propose Verify‑in‑the‑Graph (VeGraph), a novel framework leveraging the reasoning and comprehension abilities of LLM agents. VeGraph operates in three phases: (1) Graph Representation ‑ an input claim is decomposed into structured triplets, forming a graph‑based representation that integrates both structured and unstructured information; (2) Entity Disambiguation VeGraph iteratively interacts with the knowledge base to resolve ambiguous entities within the graph for deeper sub‑claim verification; and (3) Verification ‑ remaining triplets are verified to complete the fact‑checking process. Experiments using Meta‑Llama‑3‑70B (instruct version) show that VeGraph achieves competitive performance compared to baselines on two benchmarks HoVer and FEVEROUS, effectively addressing claim verification challenges. Our source code and data are available for further exploitation.

PaperID: 10298, https://arxiv.org/pdf/2505.22990.pdf

Authors: Pin-Han Chen, Yu-Sheng Lin, Wei-Cheng Lee, Tin-Yu Leu, Po-Hsiang Hsu, Anjana Dissanayake, Sungjin Oh, Chinq-Shiun Chiu

Title: MenTeR: A fully-automated Multi-agenT workflow for end-to-end RF/Analog Circuits Netlist Design

Abstract:
RF/Analog design is essential for bridging digital technologies with real‑world signals, ensuring the functionality and reliability of a wide range of electronic systems. However, analog design procedures are often intricate, time‑consuming and reliant on expert intuition, and hinder the time and cost efficiency of circuit development. To overcome the limitations of the manual circuit design, we introduce MenTeR ‑ a multiagent workflow integrated into an end‑to‑end analog design framework. By employing multiple specialized AI agents that collaboratively address different aspects of the design process, such as specification understanding, circuit optimization, and test bench validation, MenTeR reduces the dependency on frequent trial‑and‑error‑style intervention. MenTeR not only accelerates the design cycle time but also facilitates a broader exploration of the design space, demonstrating robust capabilities in handling real‑world analog systems. We believe that MenTeR lays the groundwork for future "RF/Analog Copilots" that can collaborate seamlessly with human designers.

PaperID: 10299, https://arxiv.org/pdf/2505.22979.pdf

Authors: Bengisu Guresti, Chongjie Zhang, Yevgeniy Vorobeychik

Title: Learning Recommender Mechanisms for Bayesian Stochastic Games

Abstract:
An important challenge in non‑cooperative game theory is coordinating on a single (approximate) equilibrium from many possibilities ‑ a challenge that becomes even more complex when players hold private information. Recommender mechanisms tackle this problem by recommending strategies to players based on their reported type profiles. A key consideration in such mechanisms is to ensure that players are incentivized to participate, report their private information truthfully, and follow the recommendations. While previous work has focused on designing recommender mechanisms for one‑shot and extensive‑form games, these approaches cannot be effectively applied to stochastic games, particularly if we constrain recommendations to be Markov stationary policies. To bridge this gap, we introduce a novel bi‑level reinforcement learning approach for automatically designing recommender mechanisms in Bayesian stochastic games. Our method produces a mechanism represented by a parametric function (such as a neural network), and is therefore highly efficient at execution time. Experimental results on two repeated and two stochastic games demonstrate that our approach achieves social welfare levels competitive with cooperative multi‑agent reinforcement learning baselines, while also providing significantly improved incentive properties.

PaperID: 10300, https://arxiv.org/pdf/2505.22960.pdf

Authors: Yongjin Yang, Euiin Yi, Jongwoo Ko, Kimin Lee, Zhijing Jin, Se-Young Yun

Title: Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness

Abstract:
The remarkable growth in large language model (LLM) capabilities has spurred exploration into multi‑agent systems, with debate frameworks emerging as a promising avenue for enhanced problem‑solving. These multi‑agent debate (MAD) approaches, where agents collaboratively present, critique, and refine arguments, potentially offer improved reasoning, robustness, and diverse perspectives over monolithic models. Despite prior studies leveraging MAD, a systematic understanding of its effectiveness compared to self‑agent methods, particularly under varying conditions, remains elusive. This paper seeks to fill this gap by conceptualizing MAD as a test‑time computational scaling technique, distinguished by collaborative refinement and diverse exploration capabilities. We conduct a comprehensive empirical investigation comparing MAD with strong self‑agent test‑time scaling baselines on mathematical reasoning and safety‑related tasks. Our study systematically examines the influence of task difficulty, model scale, and agent diversity on MAD's performance. Key findings reveal that, for mathematical reasoning, MAD offers limited advantages over self‑agent scaling but becomes more effective with increased problem difficulty and decreased model capability, while agent diversity shows little benefit. Conversely, for safety tasks, MAD's collaborative refinement can increase vulnerability, but incorporating diverse agent configurations facilitates a gradual reduction in attack success through the collaborative refinement process. We believe our findings provide critical guidance for the future development of more effective and strategically deployed MAD systems.

PaperID: 10301, https://arxiv.org/pdf/2505.22921.pdf

Authors: Yue Xing, Tao Yang, Yijiashun Qi, Minggu Wei, Yu Cheng, Honghui Xin

Title: Structured Memory Mechanisms for Stable Context Representation in Large Language Models

Abstract:
This paper addresses the limitations of large language models in understanding long‑term context. It proposes a model architecture equipped with a long‑term memory mechanism to improve the retention and retrieval of semantic information across paragraphs and dialogue turns. The model integrates explicit memory units, gated writing mechanisms, and attention‑based reading modules. A forgetting function is introduced to enable dynamic updates of memory content, enhancing the model's ability to manage historical information. To further improve the effectiveness of memory operations, the study designs a joint training objective. This combines the main task loss with constraints on memory writing and forgetting. It guides the model to learn better memory strategies during task execution. Systematic evaluation across multiple subtasks shows that the model achieves clear advantages in text generation consistency, stability in multi‑turn question answering, and accuracy in cross‑context reasoning. In particular, the model demonstrates strong semantic retention and contextual coherence in long‑text tasks and complex question answering scenarios. It effectively mitigates the context loss and semantic drift problems commonly faced by traditional language models when handling long‑term dependencies. The experiments also include analysis of different memory structures, capacity sizes, and control strategies. These results further confirm the critical role of memory mechanisms in language understanding. They demonstrate the feasibility and effectiveness of the proposed approach in both architectural design and performance outcomes.

PaperID: 10302, https://arxiv.org/pdf/2505.22852.pdf

Authors: Krti Tallam, Emma Miller

Title: Operationalizing CaMeL: Strengthening LLM Defenses for Enterprise Deployment

Abstract:
CaMeL (Capabilities for Machine Learning) introduces a capability‑based sandbox to mitigate prompt injection attacks in large language model (LLM) agents. While effective, CaMeL assumes a trusted user prompt, omits side‑channel concerns, and incurs performance tradeoffs due to its dual‑LLM design. This response identifies these issues and proposes engineering improvements to expand CaMeL's threat coverage and operational usability. We introduce: (1) prompt screening for initial inputs, (2) output auditing to detect instruction leakage, (3) a tiered‑risk access model to balance usability and control, and (4) a verified intermediate language for formal guarantees. Together, these upgrades align CaMeL with best practices in enterprise security and support scalable deployment.

PaperID: 10303, https://arxiv.org/pdf/2505.22846.pdf

Authors: Andrei Kozyrev, Nikita Khramov, Gleb Solovev, Anton Podkopaev

Title: RocqStar: Leveraging Similarity-driven Retrieval and Agentic Systems for Rocq generation

Abstract:
Interactive Theorem Proving was repeatedly shown to be fruitful when combined with Generative Artificial Intelligence. This paper assesses multiple approaches to Rocq generation and illuminates potential avenues for improvement. We identify retrieval‑based premise selection as a central component of effective Rocq proof generation and propose a novel approach based on a self‑attentive embedder model. The evaluation of the designed approach shows up to 28% relative increase of the generator's performance. We tackle the problem of writing Rocq proofs using a multi‑stage agentic system, tailored for formal verification, and demonstrate its high effectiveness. We conduct an ablation study and demonstrate that incorporating multi‑agent debate during the planning stage increases the proof success rate by 20% overall and nearly doubles it for complex theorems, while the reflection mechanism further enhances stability and consistency.

PaperID: 10304, https://arxiv.org/pdf/2505.22814.pdf

Authors: Jonghan Lim, Ilya Kovalenko

Title: A Large Language Model-Enabled Control Architecture for Dynamic Resource Capability Exploration in Multi-Agent Manufacturing Systems

Abstract:
Manufacturing environments are becoming more complex and unpredictable due to factors such as demand variations and shorter product lifespans. This complexity requires real‑time decision‑making and adaptation to disruptions. Traditional control approaches highlight the need for advanced control strategies capable of overcoming unforeseen challenges, as they demonstrate limitations in responsiveness within dynamic industrial settings. Multi‑agent systems address these challenges through decentralization of decision‑making, enabling systems to respond dynamically to operational changes. However, current multi‑agent systems encounter challenges related to real‑time adaptation, context‑aware decision‑making, and the dynamic exploration of resource capabilities. Large language models provide the possibility to overcome these limitations through context‑aware decision‑making capabilities. This paper introduces a large language model‑enabled control architecture for multi‑agent manufacturing systems to dynamically explore resource capabilities in response to real‑time disruptions. A simulation‑based case study demonstrates that the proposed architecture improves system resilience and flexibility. The case study findings show improved throughput and efficient resource utilization compared to existing approaches.

PaperID: 10305, https://arxiv.org/pdf/2505.22804.pdf

Authors: Jonghan Lim, Ilya Kovalenko

Title: Dynamic Task Adaptation for Multi-Robot Manufacturing Systems with Large Language Models

Abstract:
Recent manufacturing systems are increasingly adopting multi‑robot collaboration to handle complex and dynamic environments. While multi‑agent architectures support decentralized coordination among robot agents, they often face challenges in enabling real‑time adaptability for unexpected disruptions without predefined rules. Recent advances in large language models offer new opportunities for context‑aware decision‑making to enable adaptive responses to unexpected changes. This paper presents an initial exploratory implementation of a large language model‑enabled control framework for dynamic task reassignment in multi‑robot manufacturing systems. A central controller agent leverages the large language model's ability to interpret structured robot configuration data and generate valid reassignments in response to robot failures. Experiments in a real‑world setup demonstrate high task success rates in recovering from failures, highlighting the potential of this approach to improve adaptability in multi‑robot manufacturing systems.

PaperID: 10306, https://arxiv.org/pdf/2505.22781.pdf

Authors: Antonio Ocello, Daniil Tiapkin, Lorenzo Mancini, Mathieu Laurière, Eric Moulines

Title: Finite-Sample Convergence Bounds for Trust Region Policy Optimization in Mean-Field Games

Abstract:
We introduce Mean‑Field Trust Region Policy Optimization (MF‑TRPO), a novel algorithm designed to compute approximate Nash equilibria for ergodic Mean‑Field Games (MFG) in finite state‑action spaces. Building on the well‑established performance of TRPO in the reinforcement learning (RL) setting, we extend its methodology to the MFG framework, leveraging its stability and robustness in policy optimization. Under standard assumptions in the MFG literature, we provide a rigorous analysis of MF‑TRPO, establishing theoretical guarantees on its convergence. Our results cover both the exact formulation of the algorithm and its sample‑based counterpart, where we derive high‑probability guarantees and finite sample complexity. This work advances MFG optimization by bridging RL techniques with mean‑field decision‑making, offering a theoretically grounded approach to solving complex multi‑agent problems.

PaperID: 10307, https://arxiv.org/pdf/2505.22777.pdf

Authors: John Mendonça, Alon Lavie, Isabel Trancoso

Title: MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators

Abstract:
Evaluating the quality of open‑domain chatbots has become increasingly reliant on LLMs acting as automatic judges. However, existing meta‑evaluation benchmarks are static, outdated, and lacking in multilingual coverage, limiting their ability to fully capture subtle weaknesses in evaluation. We introduce MEDAL, an automated multi‑agent framework for curating more representative and diverse open‑domain dialogue evaluation benchmarks. Our approach leverages several state‑of‑the‑art LLMs to generate user‑chatbot multilingual dialogues, conditioned on varied seed contexts. Then, a strong LLM (GPT‑4.1) is used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross‑lingual performance differences. Guided by this large‑scale evaluation, we curate a new meta‑evaluation multilingual benchmark and human‑annotate samples with nuanced quality judgments. This benchmark is then used to assess the ability of several reasoning and non‑reasoning LLMs to act as evaluators of open‑domain dialogues. Using MEDAL, we uncover that state‑of‑the‑art judges fail to reliably detect nuanced issues such as lack of empathy, commonsense, or relevance.

PaperID: 10308, https://arxiv.org/pdf/2505.22753.pdf

Authors: Arseniy Pertzovsky, Roni Stern, Ariel Felner, Roie Zivan

Title: Enhancing Lifelong Multi-Agent Path-finding by Using Artificial Potential Fields

Abstract:
We explore the use of Artificial Potential Fields (APFs) to solve Multi‑Agent Path Finding (MAPF) and Lifelong MAPF (LMAPF) problems. In MAPF, a team of agents must move to their goal locations without collisions, whereas in LMAPF, new goals are generated upon arrival. We propose methods for incorporating APFs in a range of MAPF algorithms, including Prioritized Planning, MAPF‑LNS2, and Priority Inheritance with Backtracking (PIBT). Experimental results show that using APF is not beneficial for MAPF but yields up to a 7‑fold increase in overall system throughput for LMAPF.

PaperID: 10309, https://arxiv.org/pdf/2505.22655.pdf

Authors: Michael Kirchhof, Gjergji Kasneci, Enkelejda Kasneci

Title: Position: Uncertainty Quantification Needs Reassessment for Large-language Model Agents

Abstract:
Large‑language models (LLMs) and chatbot agents are known to provide wrong outputs at times, and it was recently found that this can never be fully prevented. Hence, uncertainty quantification plays a crucial role, aiming to quantify the level of ambiguity in either one overall number or two numbers for aleatoric and epistemic uncertainty. This position paper argues that this traditional dichotomy of uncertainties is too limited for the open and interactive setup that LLM agents operate in when communicating with a user, and that we need to research avenues that enrich uncertainties in this novel scenario. We review the literature and find that popular definitions of aleatoric and epistemic uncertainties directly contradict each other and lose their meaning in interactive LLM agent settings. Hence, we propose three novel research directions that focus on uncertainties in such human‑computer interactions: Underspecification uncertainties, for when users do not provide all information or define the exact task at the first go, interactive learning, to ask follow‑up questions and reduce the uncertainty about the current context, and output uncertainties, to utilize the rich language and speech space to express uncertainties as more than mere numbers. We expect that these new ways of dealing with and communicating uncertainties will lead to LLM agent interactions that are more transparent, trustworthy, and intuitive.

PaperID: 10310, https://arxiv.org/pdf/2505.22597.pdf

Authors: Ngoc La, Ruaridh Mon-Williams, Julie A. Shah

Title: HDDLGym: A Tool for Studying Multi-Agent Hierarchical Problems Defined in HDDL with OpenAI Gym

Abstract:
In recent years, reinforcement learning (RL) methods have been widely tested using tools like OpenAI Gym, though many tasks in these environments could also benefit from hierarchical planning. However, there is a lack of a tool that enables seamless integration of hierarchical planning with RL. Hierarchical Domain Definition Language (HDDL), used in classical planning, introduces a structured approach well‑suited for model‑based RL to address this gap. To bridge this integration, we introduce HDDLGym, a Python‑based tool that automatically generates OpenAI Gym environments from HDDL domains and problems. HDDLGym serves as a link between RL and hierarchical planning, supporting multi‑agent scenarios and enabling collaborative planning among agents. This paper provides an overview of HDDLGym's design and implementation, highlighting the challenges and design choices involved in integrating HDDL with the Gym interface, and applying RL policies to support hierarchical planning. We also provide detailed instructions and demonstrations for using the HDDLGym framework, including how to work with existing HDDL domains and problems from International Planning Competitions, exemplified by the Transport domain. Additionally, we offer guidance on creating new HDDL domains for multi‑agent scenarios and demonstrate the practical use of HDDLGym in the Overcooked domain. By leveraging the advantages of HDDL and Gym, HDDLGym aims to be a valuable tool for studying RL in hierarchical planning, particularly in multi‑agent contexts.

PaperID: 10311, https://arxiv.org/pdf/2505.22583.pdf

Authors: Tobias Lindenbauer, Egor Bogomolov, Yaroslav Zharov

Title: GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git

Abstract:
Benchmarks for Software Engineering (SE) AI agents, most notably SWE‑bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS) operations. To address this issue, we present GitGoodBench, a novel benchmark for evaluating AI agent performance on VCS tasks. GitGoodBench covers three core Git scenarios extracted from permissive open‑source Python, Java, and Kotlin repositories. Our benchmark provides three datasets: a comprehensive evaluation suite (900 samples), a rapid prototyping version (120 samples), and a training corpus (17,469 samples). We establish baseline performance on the prototyping version of our benchmark using GPT‑4o equipped with custom tools, achieving a 21.11% solve rate overall. We expect GitGoodBench to serve as a crucial stepping stone toward truly comprehensive SE agents that go beyond mere programming.

PaperID: 10312, https://arxiv.org/pdf/2505.22571.pdf

Authors: Hoang Pham, Thuy-Duong Nguyen, Khac-Hoai Nam Bui

Title: Agent-UniRAG: A Trainable Open-Source LLM Agent Framework for Unified Retrieval-Augmented Generation Systems

Abstract:
This paper presents a novel approach for unified retrieval‑augmented generation (RAG) systems using the recent emerging large language model (LLM) agent concept. Specifically, Agent LLM, which utilizes LLM as fundamental controllers, has become a promising approach to enable the interpretability of RAG tasks, especially for complex reasoning question‑answering systems (e.g., multi‑hop queries). Nonetheless, previous works mainly focus on solving RAG systems with either single‑hop or multi‑hop approaches separately, which limits the application of those approaches to real‑world applications. In this study, we propose a trainable agent framework called Agent‑UniRAG for unified retrieval‑augmented LLM systems, which enhances the effectiveness and interpretability of RAG systems. The main idea is to design an LLM agent framework to solve RAG tasks step‑by‑step based on the complexity of the inputs, simultaneously including single‑hop and multi‑hop queries in an end‑to‑end manner. Furthermore, we introduce SynAgent‑RAG, a synthetic dataset to enable the proposed agent framework for small open‑source LLMs (e.g., Llama‑3‑8B). The results show comparable performances with closed‑source and larger open‑source LLMs across various RAG benchmarks. Our source code and dataset are publicly available for further exploitation.

PaperID: 10313, https://arxiv.org/pdf/2505.22503.pdf

Authors: Yuanfei Wang, Xinju Huang, Fangwei Zhong, Yaodong Yang, Yizhou Wang, Yuanpei Chen, Hao Dong

Title: Communication-Efficient Desire Alignment for Embodied Agent-Human Adaptation

Abstract:
While embodied agents have made significant progress in performing complex physical tasks, real‑world applications demand more than pure task execution. The agents must collaborate with unfamiliar agents and human users, whose goals are often vague and implicit. In such settings, interpreting ambiguous instructions and uncovering underlying desires is essential for effective assistance. Therefore, fast and accurate desire alignment becomes a critical capability for embodied agents. In this work, we first develop a home assistance simulation environment HA‑Desire that integrates an LLM‑driven proxy human user exhibiting realistic value‑driven goal selection and communication. The ego agent must interact with this proxy user to infer and adapt to the user's latent desires. To achieve this, we present a novel framework FAMER for fast desire alignment, which introduces a desire‑based mental reasoning mechanism to identify user intent and filter desire‑irrelevant actions. We further design a reflection‑based communication module that reduces redundant inquiries, and incorporate goal‑relevant information extraction with memory persistence to improve information reuse and reduce unnecessary exploration. Extensive experiments demonstrate that our framework significantly enhances both task execution and communication efficiency, enabling embodied agents to quickly adapt to user‑specific desires in complex embodied environments.

PaperID: 10314, https://arxiv.org/pdf/2505.22467.pdf

Authors: Jiaxi Yang, Mengqi Zhang, Yiqiao Jin, Hao Chen, Qingsong Wen, Lu Lin, Yi He, Srijan Kumar, Weijie Xu, James Evans, Jindong Wang

Title: Topological Structure Learning Should Be A Research Priority for LLM-Based Multi-Agent Systems

Abstract:
Large Language Model‑based Multi‑Agent Systems (MASs) have emerged as a powerful paradigm for tackling complex tasks through collaborative intelligence. However, the topology of these systems‑‑how agents in MASs should be configured, connected, and coordinated‑‑remains largely unexplored. In this position paper, we call for a paradigm shift toward \emphtopology‑aware MASs that explicitly model and dynamically optimize the structure of inter‑agent interactions. We identify three fundamental components‑‑agents, communication links, and overall topology‑‑that collectively determine the system's adaptability, efficiency, robustness, and fairness. To operationalize this vision, we introduce a systematic three‑stage framework: 1) agent selection, 2) structure profiling, and 3) topology synthesis. This framework not only provides a principled foundation for designing MASs but also opens new research frontiers across language modeling, reinforcement learning, graph learning, and generative modeling to ultimately unleash their full potential in complex real‑world applications. We conclude by outlining key challenges and opportunities in MASs evaluation. We hope our framework and perspectives offer critical new insights in the era of agentic AI.

PaperID: 10315, https://arxiv.org/pdf/2505.22311.pdf

Authors: Feibo Jiang, Cunhua Pan, Li Dong, Kezhi Wang, Octavia A. Dobre, Merouane Debbah

Title: From Large AI Models to Agentic AI: A Tutorial on Future Intelligent Communications

Abstract:
With the advent of 6G communications, intelligent communication systems face multiple challenges, including constrained perception and response capabilities, limited scalability, and low adaptability in dynamic environments. This tutorial provides a systematic introduction to the principles, design, and applications of Large Artificial Intelligence Models (LAMs) and Agentic AI technologies in intelligent communication systems, aiming to offer researchers a comprehensive overview of cutting‑edge technologies and practical guidance. First, we outline the background of 6G communications, review the technological evolution from LAMs to Agentic AI, and clarify the tutorial's motivation and main contributions. Subsequently, we present a comprehensive review of the key components required for constructing LAMs. We further categorize LAMs and analyze their applicability, covering Large Language Models (LLMs), Large Vision Models (LVMs), Large Multimodal Models (LMMs), Large Reasoning Models (LRMs), and lightweight LAMs. Next, we propose a LAM‑centric design paradigm tailored for communications, encompassing dataset construction and both internal and external learning approaches. Building upon this, we develop an LAM‑based Agentic AI system for intelligent communications, clarifying its core components such as planners, knowledge bases, tools, and memory modules, as well as its interaction mechanisms. We also introduce a multi‑agent framework with data retrieval, collaborative planning, and reflective evaluation for 6G. Subsequently, we provide a detailed overview of the applications of LAMs and Agentic AI in communication scenarios. Finally, we summarize the research challenges and future directions in current studies, aiming to support the development of efficient, secure, and sustainable next‑generation intelligent communication systems.

PaperID: 10316, https://arxiv.org/pdf/2505.22303.pdf

Authors: Grzegorz Wolny, Michał Szczerbak

Title: Voice CMS: updating the knowledge base of a digital assistant through conversation

Abstract:
In this study, we propose a solution based on a multi‑agent LLM architecture and a voice user interface (VUI) designed to update the knowledge base of a digital assistant. Its usability is evaluated in comparison to a more traditional graphical content management system (CMS), with a focus on understanding the relationship between user preferences and the complexity of the information being provided. The findings demonstrate that, while the overall usability of the VUI is rated lower than the graphical interface, it is already preferred by users for less complex tasks. Furthermore, the quality of content entered through the VUI is comparable to that achieved with the graphical interface, even for highly complex tasks. Obtained qualitative results suggest that a hybrid interface combining the strengths of both approaches could address the key challenges identified during the experiment, such as reducing cognitive load through graphical feedback while maintaining the intuitive nature of voice‑based interactions. This work highlights the potential of conversational interfaces as a viable and effective method for knowledge management in specific business contexts.

PaperID: 10317, https://arxiv.org/pdf/2505.22192.pdf

Authors: Yue Cui, Liuyi Yao, Zitao Li, Yaliang Li, Bolin Ding, Xiaofang Zhou

Title: Efficient Leave-one-out Approximation in LLM Multi-agent Debate Based on Introspection

Abstract:
Multi‑agent systems based on large language models (LLMs) advance automatic task completion in various fields, where debate is a common cooperation form for agents to solve complicated problems with reasoning and cross‑review to solidify answers. Assessing the individual contributions of agents within these debates is crucial for system refinement and outcome reliability. Traditional leave‑one‑out (LOO) method offers a clear framework for evaluating each agent's role but face challenges in LLM‑based systems due to high computational costs and associated financial implications. This paper presents introspective‑leave‑one‑out (IntrospecLOO), a simple yet effective prompting for approximation of LOO in LLM‑powered multi‑agent debates. IntrospecLOO introduces an additional querying round after standard debates, prompting agents to update their answers while ignoring responses from a designated agent. This strategy effectively isolates and gauges each participant's influence at a reduced query complexity compared to the original LOO approaches. Validation through experiments on three benchmark datasets confirms the effectiveness of IntrospecLOO.

PaperID: 10318, https://arxiv.org/pdf/2505.22151.pdf

Authors: Claude Formanek, Omayma Mahjoub, Louay Ben Nessir, Sasha Abramowitz, Ruan de Kock, Wiem Khlifi, Daniel Rajaonarivonivelomanantsoa, Simon Du Toit, Arnol Fokam, Siddarth Singh, Ulrich Mbou Sob, Felix Chalumeau, Arnu Pretorius

Title: Oryx: a Scalable Sequence Model for Many-Agent Coordination in Offline MARL

Abstract:
A key challenge in offline multi‑agent reinforcement learning (MARL) is achieving effective many‑agent multi‑step coordination in complex environments. In this work, we propose Oryx, a novel algorithm for offline cooperative MARL to directly address this challenge. Oryx adapts the recently proposed retention‑based architecture Sable and combines it with a sequential form of implicit constraint Q‑learning (ICQ), to develop a novel offline autoregressive policy update scheme. This allows Oryx to solve complex coordination challenges while maintaining temporal coherence over long trajectories. We evaluate Oryx across a diverse set of benchmarks from prior works ‑‑ SMAC, RWARE, and Multi‑Agent MuJoCo ‑‑ covering tasks of both discrete and continuous control, varying in scale and difficulty. Oryx achieves state‑of‑the‑art performance on more than 80% of the 65 tested datasets, outperforming prior offline MARL methods and demonstrating robust generalisation across domains with many agents and long horizons. Finally, we introduce new datasets to push the limits of many‑agent coordination in offline MARL, and demonstrate Oryx's superior ability to scale effectively in such settings.

PaperID: 10319, https://arxiv.org/pdf/2505.21985.pdf

Authors: Naoto Yoshida, Tadahiro Taniguchi

Title: Reward-Independent Messaging for Decentralized Multi-Agent Reinforcement Learning

Abstract:
In multi‑agent reinforcement learning (MARL), effective communication improves agent performance, particularly under partial observability. We propose MARL‑CPC, a framework that enables communication among fully decentralized, independent agents without parameter sharing. MARL‑CPC incorporates a message learning model based on collective predictive coding (CPC) from emergent communication research. Unlike conventional methods that treat messages as part of the action space and assume cooperation, MARL‑CPC links messages to state inference, supporting communication in non‑cooperative, reward‑independent settings. We introduce two algorithms Bandit‑CPC and IPPO‑CPC‑ and evaluate them in non‑cooperative MARL tasks. Benchmarks show that both outperform standard message‑as‑action approaches, establishing effective communication even when messages offer no direct benefit to the sender. These results highlight MARL‑CPC's potential for enabling coordination in complex, decentralized environments.

PaperID: 10320, https://arxiv.org/pdf/2505.21964.pdf

Authors: Ziyun Zhang, Xinyi Liu, Xiaoyi Zhang, Jun Wang, Gang Chen, Yan Lu

Title: UI-Evol: Automatic Knowledge Evolving for Computer Use Agents

Abstract:
External knowledge has played a crucial role in the recent development of computer use agents. We identify a critical knowledge‑execution gap: retrieved knowledge often fails to translate into effective real‑world task execution. Our analysis shows even 90% correct knowledge yields only 41% execution success rate. To bridge this gap, we propose UI‑Evol, a plug‑and‑play module for autonomous GUI knowledge evolution. UI‑Evol consists of two stages: a Retrace Stage that extracts faithful objective action sequences from actual agent‑environment interactions, and a Critique Stage that refines existing knowledge by comparing these sequences against external references. We conduct comprehensive experiments on the OSWorld benchmark with the state‑of‑the‑art Agent S2. Our results demonstrate that UI‑Evol not only significantly boosts task performance but also addresses a previously overlooked issue of high behavioral standard deviation in computer use agents, leading to superior performance on computer use tasks and substantially improved agent reliability.

PaperID: 10321, https://arxiv.org/pdf/2505.21963.pdf

Authors: Taro Yano, Yoichi Ishibashi, Masafumi Oyamada

Title: LaMDAgent: An Autonomous Framework for Post-Training Pipeline Optimization via LLM Agents

Abstract:
Large Language Models (LLMs) have demonstrated exceptional performance across a wide range of tasks. To further tailor LLMs to specific domains or applications, post‑training techniques such as Supervised Fine‑Tuning (SFT), Preference Learning, and model merging are commonly employed. While each of these methods has been extensively studied in isolation, the automated construction of complete post‑training pipelines remains an underexplored area. Existing approaches typically rely on manual design or focus narrowly on optimizing individual components, such as data ordering or merging strategies. In this work, we introduce LaMDAgent (short for Language Model Developing Agent), a novel framework that autonomously constructs and optimizes full post‑training pipelines through the use of LLM‑based agents. LaMDAgent systematically explores diverse model generation techniques, datasets, and hyperparameter configurations, leveraging task‑based feedback to discover high‑performing pipelines with minimal human intervention. Our experiments show that LaMDAgent improves tool‑use accuracy by 9.0 points while preserving instruction‑following capabilities. Moreover, it uncovers effective post‑training strategies that are often overlooked by conventional human‑driven exploration. We further analyze the impact of data and model size scaling to reduce computational costs on the exploration, finding that model size scalings introduces new challenges, whereas scaling data size enables cost‑effective pipeline discovery.

PaperID: 10322, https://arxiv.org/pdf/2505.21952.pdf

Authors: Masahiko Ueda

Title: Properties of zero-determinant strategies in multichannel games

Abstract:
Controlling payoffs in repeated games is one of the important topics in control theory of multi‑agent systems. Recently proposed zero‑determinant strategies enable players to unilaterally enforce linear relations between payoffs. Furthermore, based on the mathematics of zero‑determinant strategies, regional payoff control, in which payoffs are enforced into some feasible regions, has been discovered in social dilemma situations. More recently, theory of payoff control was extended to multichannel games, where players parallelly interact with each other in multiple channels. However, the existence of payoff‑controlling strategies in multichannel games seems to require the existence of payoff‑controlling strategies in some channels, and properties of zero‑determinant strategies specific to multichannel games are still not clear. In this paper, we elucidate properties of zero‑determinant strategies in multichannel games. First, we relate the existence condition of zero‑determinant strategies in multichannel games to that of zero‑determinant strategies in each channel. We then show that the existence of zero‑determinant strategies in multichannel games requires the existence of zero‑determinant strategies in some channels. This result implies that the existence of zero‑determinant strategies in multichannel games is tightly restricted by structure of games played in each channel.

PaperID: 10323, https://arxiv.org/pdf/2505.21898.pdf

Authors: Rennai Qiu, Chen Qian, Ran Li, Yufan Dang, Weize Chen, Cheng Yang, Yingli Zhang, Ye Tian, Xuantang Xiong, Lei Han, Zhiyuan Liu, Maosong Sun

Title: Co-Saving: Resource Aware Multi-Agent Collaboration for Software Development

Abstract:
Recent advancements in Large Language Models (LLMs) and autonomous agents have demonstrated remarkable capabilities across various domains. However, standalone agents frequently encounter limitations when handling complex tasks that demand extensive interactions and substantial computational resources. Although Multi‑Agent Systems (MAS) alleviate some of these limitations through collaborative mechanisms like task decomposition, iterative communication, and role specialization, they typically remain resource‑unaware, incurring significant inefficiencies due to high token consumption and excessive execution time. To address these limitations, we propose a resource‑aware multi‑agent system ‑‑ Co‑Saving (meaning that multiple agents collaboratively engage in resource‑saving activities), which leverages experiential knowledge to enhance operational efficiency and solution quality. Our key innovation is the introduction of "shortcuts" ‑‑ instructional transitions learned from historically successful trajectories ‑‑ which allows to bypass redundant reasoning agents and expedite the collective problem‑solving process. Experiments for software development tasks demonstrate significant advantages over existing methods. Specifically, compared to the state‑of‑the‑art MAS ChatDev, our method achieves an average reduction of 50.85% in token usage, and improves the overall code quality by 10.06%.

PaperID: 10324, https://arxiv.org/pdf/2505.21863.pdf

Authors: Shikhhar Siingh, Abhinav Rawat, Chitta Baral, Vivek Gupta

Title: GETReason: Enhancing Image Context Extraction through Hierarchical Multi-Agent Reasoning

Abstract:
Publicly significant images from events hold valuable contextual information, crucial for journalism and education. However, existing methods often struggle to extract this relevance accurately. To address this, we introduce GETReason (Geospatial Event Temporal Reasoning), a framework that moves beyond surface‑level image descriptions to infer deeper contextual meaning. We propose that extracting global event, temporal, and geospatial information enhances understanding of an image's significance. Additionally, we introduce GREAT (Geospatial Reasoning and Event Accuracy with Temporal Alignment), a new metric for evaluating reasoning‑based image understanding. Our layered multi‑agent approach, assessed using a reasoning‑weighted metric, demonstrates that meaningful insights can be inferred, effectively linking images to their broader event context.

PaperID: 10325, https://arxiv.org/pdf/2505.21808.pdf

Authors: Jam Kraprayoon, Zoe Williams, Rida Fayyaz

Title: AI Agent Governance: A Field Guide

Abstract:
This report serves as an accessible guide to the emerging field of AI agent governance. Agents ‑ AI systems that can autonomously achieve goals in the world, with little to no explicit human instruction about how to do so ‑ are a major focus of leading tech companies, AI start‑ups, and investors. If these development efforts are successful, some industry leaders claim we could soon see a world where millions or billions of agents autonomously perform complex tasks across society. Society is largely unprepared for this development. A future where capable agents are deployed en masse could see transformative benefits to society but also profound and novel risks. Currently, the exploration of agent governance questions and the development of associated interventions remain in their infancy. Only a few researchers, primarily in civil society organizations, public research institutes, and frontier AI companies, are actively working on these challenges.

PaperID: 10326, https://arxiv.org/pdf/2505.21784.pdf

Authors: Tharindu Kumarage, Ninareh Mehrabi, Anil Ramakrishna, Xinyan Zhao, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris

Title: Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation

Abstract:
Safety reasoning is a recent paradigm where LLMs reason over safety policies before generating responses, thereby mitigating limitations in existing safety measures such as over‑refusal and jailbreak vulnerabilities. However, implementing this paradigm is challenging due to the resource‑intensive process of creating high‑quality policy‑embedded chain‑of‑thought (CoT) datasets while ensuring reasoning remains accurate and free from hallucinations or policy conflicts. To tackle this, we propose AIDSAFE: Agentic Iterative Deliberation for Safety Reasoning, a novel data generation recipe that leverages multi‑agent deliberation to iteratively expand reasoning on safety policies. A data refiner stage in AIDSAFE ensures high‑quality outputs by eliminating repetitive, redundant, and deceptive thoughts. AIDSAFE‑generated CoTs provide a strong foundation for supervised fine‑tuning (SFT)‑based safety training. Additionally, to address the need of preference data in alignment stages, such as DPO training, we introduce a supplemental recipe that uses belief augmentation to create distinct selected and rejected CoT samples. Our evaluations demonstrate that AIDSAFE‑generated CoTs achieve superior policy adherence and reasoning quality. Consequently, we show that fine‑tuning open‑source LLMs on these CoTs can significantly improve safety generalization and jailbreak robustness while maintaining acceptable utility and over‑refusal accuracy. AIDSAFE‑generated CoT datasets can be found here: https://huggingface.co/datasets/AmazonScience/AIDSAFE

PaperID: 10327, https://arxiv.org/pdf/2505.21752.pdf

Authors: Mengchen Dong, Levin Brinkmann, Omar Sherif, Shihan Wang, Xinyu Zhang, Jean-François Bonnefon, Iyad Rahwan

Title: Experimental Evidence That AI-Managed Workers Tolerate Lower Pay Without Demotivation

Abstract:
Experimental evidence on worker responses to AI management remains mixed, partly due to limitations in experimental fidelity. We address these limitations with a customized workplace in the Minecraft platform, enabling high‑resolution behavioral tracking of autonomous task execution, and ensuring that participants approach the task with well‑formed expectations about their own competence. Workers (N = 382) completed repeated production tasks under either human, AI, or hybrid management. An AI manager trained on human‑defined evaluation principles systematically assigned lower performance ratings and reduced wages by 40%, without adverse effects on worker motivation and sense of fairness. These effects were driven by a muted emotional response to AI evaluation, compared to evaluation by a human. The very features that make AI appear impartial may also facilitate silent exploitation, by suppressing the social reactions that normally constrain extractive practices in human‑managed work.

PaperID: 10328, https://arxiv.org/pdf/2505.21741.pdf

Authors: Dongjune Chang, Sola Kim, Young Soo Park

Title: AI-Supported Platform for System Monitoring and Decision-Making in Nuclear Waste Management with Large Language Models

Abstract:
Nuclear waste management requires rigorous regulatory compliance assessment, demanding advanced decision‑support systems capable of addressing complex legal, environmental, and safety considerations. This paper presents a multi‑agent Retrieval‑Augmented Generation (RAG) system that integrates large language models (LLMs) with document retrieval mechanisms to enhance decision accuracy through structured agent collaboration. Through a structured 10‑round discussion model, agents collaborate to assess regulatory compliance and safety requirements while maintaining document‑grounded responses. Implemented on consumer‑grade hardware, the system leverages Llama 3.2 and mxbai‑embed‑large‑v1 embeddings for efficient retrieval and semantic representation. A case study of a proposed temporary nuclear waste storage site near Winslow, Arizona, demonstrates the framework's effectiveness. Results show the Regulatory Agent achieves consistently higher relevance scores in maintaining alignment with legal frameworks, while the Safety Agent effectively manages complex risk assessments requiring multifaceted analysis. The system demonstrates progressive improvement in agreement rates between agents across discussion rounds while semantic drift decreases, indicating enhanced decision‑making consistency and response coherence. The system ensures regulatory decisions remain factually grounded, dynamically adapting to evolving regulatory frameworks through real‑time document retrieval. By balancing automated assessment with human oversight, this framework offers a scalable and transparent approach to regulatory governance. These findings underscore the potential of AI‑driven, multi‑agent systems in advancing evidence‑based, accountable, and adaptive decision‑making for high‑stakes environmental management scenarios.

PaperID: 10329, https://arxiv.org/pdf/2505.21676.pdf

Authors: Yufeng Yang, Minghao Ning, Keqi Shu, Aladdin Saleh, Ehsan Hashemi, Amir Khajepour

Title: Real-World Deployment of Cloud-based Autonomous Mobility Systems for Outdoor and Indoor Environments

Abstract:
Autonomous mobility systems increasingly operate in dense and dynamic environments where perception occlusions, limited sensing coverage, and multi‑agent interactions pose major challenges. While onboard sensors provide essential local perception, they often struggle to maintain reliable situational awareness in crowded urban or indoor settings. This article presents the Cloud‑based Autonomous Mobility (CAM) framework, a generalized architecture that integrates infrastructure‑based intelligent sensing with cloud‑level coordination to enhance autonomous operations. The system deploys distributed Intelligent Sensor Nodes (ISNs) equipped with cameras, LiDAR, and edge computing to perform multi‑modal perception and transmit structured information to a cloud platform via high‑speed wireless communication. The cloud aggregates observations from multiple nodes to generate a global scene representation for other autonomous modules, such as decision making, motion planning, etc. Real‑world deployments in an urban roundabout and a hospital‑like indoor environment demonstrate improved perception robustness, safety, and coordination for future intelligent mobility systems.

PaperID: 10330, https://arxiv.org/pdf/2505.21588.pdf

Authors: Young-Min Cho, Sharath Chandra Guntuku, Lyle Ungar

Title: Herd Behavior: Investigating Peer Influence in LLM-based Multi-Agent Systems

Abstract:
Recent advancements in Large Language Models (LLMs) have enabled the emergence of multi‑agent systems where LLMs interact, collaborate, and make decisions in shared environments. While individual model behavior has been extensively studied, the dynamics of peer influence in such systems remain underexplored. In this paper, we investigate herd behavior, the tendency of agents to align their outputs with those of their peers, within LLM‑based multi‑agent interactions. We present a series of controlled experiments that reveal how herd behaviors are shaped by multiple factors. First, we show that the gap between self‑confidence and perceived confidence in peers significantly impacts an agent's likelihood to conform. Second, we find that the format in which peer information is presented plays a critical role in modulating the strength of herd behavior. Finally, we demonstrate that the degree of herd behavior can be systematically controlled, and that appropriately calibrated herd tendencies can enhance collaborative outcomes. These findings offer new insights into the social dynamics of LLM‑based systems and open pathways for designing more effective and adaptive multi‑agent collaboration frameworks.

PaperID: 10331, https://arxiv.org/pdf/2505.21559.pdf

Authors: Julien Soulé, Jean-Paul Jamont, Michel Occello, Louis-Marie Traonouez, Paul Théron

Title: Streamlining Resilient Kubernetes Autoscaling with Multi-Agent Systems via an Automated Online Design Framework

Abstract:
In cloud‑native systems, Kubernetes clusters with interdependent services often face challenges to their operational resilience due to poor workload management issues such as resource blocking, bottlenecks, or continuous pod crashes. These vulnerabilities are further amplified in adversarial scenarios, such as Distributed Denial‑of‑Service attacks (DDoS). Conventional Horizontal Pod Autoscaling (HPA) approaches struggle to address such dynamic conditions, while reinforcement learning‑based methods, though more adaptable, typically optimize single goals like latency or resource usage, neglecting broader failure scenarios. We propose decomposing the overarching goal of maintaining operational resilience into failure‑specific sub‑goals delegated to collaborative agents, collectively forming an HPA Multi‑Agent System (MAS). We introduce an automated, four‑phase online framework for HPA MAS design: 1) modeling a digital twin built from cluster traces; 2) training agents in simulation using roles and missions tailored to failure contexts; 3) analyzing agent behaviors for explainability; and 4) transferring learned policies to the real cluster. Experimental results demonstrate that the generated HPA MASs outperform three state‑of‑the‑art HPA systems in sustaining operational resilience under various adversarial conditions in a proposed complex cluster.

PaperID: 10332, https://arxiv.org/pdf/2505.21515.pdf

Authors: Hongqian Wu, Hongzhong Deng, Jichao Li, Chengxing Wu, Zhuoting Yu, Haidong Zhang, Gaoxin Qi

Title: Edge Games: Cooperative Partner Selection in Network Cooperation Evolution

Abstract:
The phenomenon of group cooperation constitutes a fundamental mechanism underlying various social and biological systems. Complex networks provide a structural framework for group interactions, where individuals can not only obtain information from their neighbors but also choose neighbors as cooperative partners. However, traditional evolutionary game theory models, where nodes are the game players, are not convenient for directly choosing cooperative partners. Here, we exchange the roles of nodes and edges and innovatively propose the "edge game" model, using edges in complex networks as virtual game players for group games. Theoretical analysis and simulation experiments show that by configuring a synergy factor (r) that satisfies the "moderate cooperation" condition, a stable cooperative structure can be achieved for any network at the evolutionary equilibrium. Specifically, when there is no constraint on the number of cooperators per node, the condition for the evolution of cooperation in the network is r > kmax, where kmax is the maximum degree of the nodes. When there is a threshold constraint, in nearest‑neighbor coupled networks (with degree k), the condition for "moderate cooperation" is k < r < 2k. In heterogeneous networks, a variable synergy factor scheme is adopted, where the synergy factor for each game group (rx) is defined to be proportional to the degree of the central node (kx) in the group (rx = n‑foldkx), "moderate cooperation" can be achieved when 1 < n‑fold < 2. If the value of r exceeds the range, it may lead to "excessive cooperation" with node overload. Comparing algorithm performance and time complexity, edge games demonstrate advantages over other optimization algorithms. Simple and universal, the edge game provides a new approach to addressing multi‑agent cooperation problems in the era of machine intelligence.

PaperID: 10333, https://arxiv.org/pdf/2505.21503.pdf

Authors: Yihan Wang, Qiao Yan, Zhenghao Xing, Lihao Liu, Junjun He, Chi-Wing Fu, Xiaowei Hu, Pheng-Ann Heng

Title: Silence is Not Consensus: Disrupting Agreement Bias in Multi-Agent LLMs via Catfish Agent for Clinical Decision Making

Abstract:
Large language models (LLMs) have demonstrated strong potential in clinical question answering, with recent multi‑agent frameworks further improving diagnostic accuracy via collaborative reasoning. However, we identify a recurring issue of Silent Agreement, where agents prematurely converge on diagnoses without sufficient critical analysis, particularly in complex or ambiguous cases. We present a new concept called Catfish Agent, a role‑specialized LLM designed to inject structured dissent and counter silent agreement. Inspired by the ``catfish effect'' in organizational psychology, the Catfish Agent is designed to challenge emerging consensus to stimulate deeper reasoning. We formulate two mechanisms to encourage effective and context‑aware interventions: (i) a complexity‑aware intervention that modulates agent engagement based on case difficulty, and (ii) a tone‑calibrated intervention articulated to balance critique and collaboration. Evaluations on nine medical Q&A and three medical VQA benchmarks show that our approach consistently outperforms both single‑ and multi‑agent LLMs frameworks, including leading commercial models such as GPT‑4o and DeepSeek‑R1.

PaperID: 10334, https://arxiv.org/pdf/2505.21486.pdf

Authors: Yang Yang, Jiemin Wu, Yutao Yue

Title: Hypothesis Generation via LLM-Automated Language Bias for ILP

Abstract:
Inductive Logic Programming (ILP) is a principled approach for generalizing regularities from data and constructing hypotheses as interpretable logic programs. However, a key limitation is its reliance on expert‑crafted language bias ‑ the predicate inventory, types, and mode declarations that delimit the search space. We propose hypothesis generation via LLM‑automated language bias: multi‑agent LLMs design the bias from raw text and translate descriptions into typed facts, and a robust ILP solver induces rules under a global consistency objective. This approach reduces traditional ILP's reliance on predefined symbolic structures and the noise sensitivity of LLM‑only pipelines that directly generate hypotheses as text or code. Extensive experiments in diverse, challenging scenarios validate superior performance, providing a practical, explainable, and verifiable route to hypothesis generation.

PaperID: 10335, https://arxiv.org/pdf/2505.21418.pdf

Authors: Lina Zhao, Zihao Bian, Qingyue Chen, Yafang Li, Zhiyi Luo, Jiaxing Bai, Guangbo Li, Min He, Kezhi Li, Huaiyuan Yao, Zongjiu Zhang

Title: FUAS-Agents: Autonomous Multi-Modal LLM Agents for Treatment Planning in Focused Ultrasound Ablation Surgery

Abstract:
Focused Ultrasound Ablation Surgery (FUAS) has emerged as a promising non‑invasive therapeutic modality, valued for its safety and precision. Nevertheless, its clinical implementation entails intricate tasks such as multimodal image interpretation, personalized dose planning, and real‑time intraoperative decision‑making processes that demand intelligent assistance to improve efficiency and reliability. We introduce FUAS‑Agents, an autonomous agent system that leverages the multimodal understanding and tool‑using capabilities of large language models (LLMs). The system was developed using a large‑scale, multicenter, multimodal clinical dataset of over 3000 cases from three medical institutions. By integrating patient profiles and MRI data, FUAS‑Agents orchestrates a suite of specialized medical AI tools, including segmentation, treatment dose prediction, and clinical guideline retrieval, to generate personalized treatment plans comprising MRI image, dose parameters, and therapeutic strategies. The system also incorporates an internal quality control and reflection mechanism, ensuring consistency and robustness of the outputs. We evaluate the system in a uterine fibroid treatment scenario. Human assessment by four senior FUAS experts indicates that 82.5%, 82.5%, 87.5%, and 97.5% of the generated plans were rated 4 or above (on a 5‑point scale) in terms of completeness, accuracy, fluency, and clinical compliance, respectively. In addition, we have conducted ablation studies to systematically examine the contribution of each component to the overall performance. These results demonstrate the potential of LLM‑driven agents in enhancing decision‑making across complex clinical workflows, and exemplify a translational paradigm that combines general‑purpose models with specialized expert systems to solve practical challenges in vertical healthcare domains.

PaperID: 10336, https://arxiv.org/pdf/2505.21362.pdf

Authors: Qishuai Zhong, Zongmin Li, Siqi Fan, Aixin Sun

Title: Evaluating LLM Adaptation to Sociodemographic Factors: User Profile vs. Dialogue History

Abstract:
Effective engagement by large language models (LLMs) requires adapting responses to users' sociodemographic characteristics, such as age, occupation, and education level. While many real‑world applications leverage dialogue history for contextualization, existing evaluations of LLMs' behavioral adaptation often focus on single‑turn prompts. In this paper, we propose a framework to evaluate LLM adaptation when attributes are introduced either (1) explicitly via user profiles in the prompt or (2) implicitly through multi‑turn dialogue history. We assess the consistency of model behavior across these modalities. Using a multi‑agent pipeline, we construct a synthetic dataset pairing dialogue histories with distinct user profiles and employ questions from the Value Survey Module (VSM 2013) (Hofstede and Hofstede, 2016) to probe value expression. Our findings indicate that most models adjust their expressed values in response to demographic changes, particularly in age and education level, but consistency varies. Models with stronger reasoning capabilities demonstrate greater alignment, indicating the importance of reasoning in robust sociodemographic adaptation.

PaperID: 10337, https://arxiv.org/pdf/2505.21354.pdf

Authors: Bidyarthi Paul, Jalisha Jashim Era, Mirazur Rahman Zim, Tahmid Sattar Aothoi, Faisal Muhammad Shah

Title: Leveraging Large Language Models for Bengali Math Word Problem Solving with Chain of Thought Reasoning

Abstract:
Solving Bengali Math Word Problems (MWPs) remains a major challenge in natural language processing (NLP) due to the language's low‑resource status and the multi‑step reasoning required. Existing models struggle with complex Bengali MWPs, largely because no human‑annotated Bengali dataset has previously addressed this task. This gap has limited progress in Bengali mathematical reasoning. To address this, we created SOMADHAN, a dataset of 8792 complex Bengali MWPs with manually written, step‑by‑step solutions. We designed this dataset to support reasoning‑focused evaluation and model development in a linguistically underrepresented context. Using SOMADHAN, we evaluated a range of large language models (LLMs) ‑ including GPT‑4o, GPT‑3.5 Turbo, LLaMA series models, Deepseek, and Qwen ‑ through both zero‑shot and few‑shot prompting with and without Chain of Thought (CoT) reasoning. CoT prompting consistently improved performance over standard prompting, especially in tasks requiring multi‑step logic. LLaMA‑3.3 70B achieved the highest accuracy of 88% with few‑shot CoT prompting. We also applied Low‑Rank Adaptation (LoRA) to fine‑tune models efficiently, enabling them to adapt to Bengali MWPs with minimal computational cost. Our work fills a critical gap in Bengali NLP by providing a high‑quality reasoning dataset and a scalable framework for solving complex MWPs. We aim to advance equitable research in low‑resource languages and enhance reasoning capabilities in educational and language technologies.

PaperID: 10338, https://arxiv.org/pdf/2505.21342.pdf

Authors: Valentin Knappich, Annemarie Friedrich, Anna Hätty, Simon Razniewski

Title: PEDANTIC: A Dataset for the Automatic Examination of Definiteness in Patent Claims

Abstract:
Patent claims define the scope of protection for an invention. If there are ambiguities in a claim, it is rejected by the patent office. In the US, this is referred to as indefiniteness (35 U.S.C § 112(b)) and is among the most frequent reasons for patent application rejection. The development of automatic methods for patent definiteness examination has the potential to make patent drafting and examination more efficient, but no annotated dataset has been published to date. We introduce PEDANTIC (Patent Definiteness Examination Corpus), a novel dataset of 14k US patent claims from patent applications relating to Natural Language Processing (NLP), annotated with reasons for indefiniteness. We construct PEDANTIC using a fully automatic pipeline that retrieves office action documents from the USPTO and uses Large Language Models (LLMs) to extract the reasons for indefiniteness. A human validation study confirms the pipeline's accuracy in generating high‑quality annotations. To gain insight beyond binary classification metrics, we implement an LLM‑as‑Judge evaluation that compares the free‑form reasoning of every model‑cited reason with every examiner‑cited reason. We show that LLM agents based on Qwen 2.5 32B and 72B struggle to outperform logistic regression baselines on definiteness prediction, even though they often correctly identify the underlying reasons. PEDANTIC provides a valuable resource for patent AI researchers, enabling the development of advanced examination models. We will publicly release the dataset and code.

PaperID: 10339, https://arxiv.org/pdf/2505.21298.pdf

Authors: Emanuele La Malfa, Gabriele La Malfa, Samuele Marro, Jie M. Zhang, Elizabeth Black, Michael Luck, Philip Torr, Michael Wooldridge

Title: Large Language Models Miss the Multi-Agent Mark

Abstract:
Recent interest in Multi‑Agent Systems of Large Language Models (MAS LLMs) has led to an increase in frameworks leveraging multiple LLMs to tackle complex tasks. However, much of this literature appropriates the terminology of MAS without engaging with its foundational principles. In this position paper, we highlight critical discrepancies between MAS theory and current MAS LLMs implementations, focusing on four key areas: the social aspect of agency, environment design, coordination and communication protocols, and measuring emergent behaviours. Our position is that many MAS LLMs lack multi‑agent characteristics such as autonomy, social interaction, and structured environments, and often rely on oversimplified, LLM‑centric architectures. The field may slow down and lose traction by revisiting problems the MAS literature has already addressed. Therefore, we systematically analyse this issue and outline associated research opportunities; we advocate for better integrating established MAS concepts and more precise terminology to avoid mischaracterisation and missed opportunities.

PaperID: 10340, https://arxiv.org/pdf/2505.21291.pdf

Authors: Saman Marandi, Yu-Shu Hu, Mohammad Modarres

Title: Complex System Diagnostics Using a Knowledge Graph-Informed and Large Language Model-Enhanced Framework

Abstract:
In this paper, we present a novel diagnostic framework that integrates Knowledge Graphs (KGs) and Large Language Models (LLMs) to support system diagnostics in high‑reliability systems such as nuclear power plants. Traditional diagnostic modeling struggles when systems become too complex, making functional modeling a more attractive approach. Our approach introduces a diagnostic framework grounded in the functional modeling principles of the Dynamic Master Logic (DML) model. It incorporates two coordinated LLM components, including an LLM‑based workflow for automated construction of DML logic from system documentation and an LLM agent that facilitates interactive diagnostics. The generated logic is encoded into a structured KG, referred to as KG‑DML, which supports hierarchical fault reasoning. Expert knowledge or operational data can also be incorporated to refine the model's precision and diagnostic depth. In the interaction phase, users submit natural language queries, which are interpreted by the LLM agent. The agent selects appropriate tools for structured reasoning, including upward and downward propagation across the KG‑DML. Rather than embedding KG content into every prompt, the LLM agent distinguishes between diagnostic and interpretive tasks. For diagnostics, the agent selects and executes external tools that perform structured KG reasoning. For general queries, a Graph‑based Retrieval‑Augmented Generation (Graph‑RAG) approach is used, retrieving relevant KG segments and embedding them into the prompt to generate natural explanations. A case study on an auxiliary feedwater system demonstrated the framework's effectiveness, with over 90% accuracy in key elements and consistent tool and argument extraction, supporting its use in safety‑critical diagnostics.

PaperID: 10341, https://arxiv.org/pdf/2505.21286.pdf

Authors: Ya-Ting Yang, Quanyan Zhu

Title: PACT: A Contract-Theoretic Framework for Pricing Agentic AI Services Powered by Large Language Models

Abstract:
Agentic AI, often powered by large language models (LLMs), is becoming increasingly popular and adopted to support autonomous reasoning, decision‑making, and task execution across various domains. While agentic AI holds great promise, its deployment as services for easy access raises critical challenges in pricing, due to high infrastructure and computation costs, multi‑dimensional and task‑dependent Quality of Service (QoS), and growing concerns around liability in high‑stakes applications. In this work, we propose PACT, a Pricing framework for cloud‑based Agentic AI services through a Contract‑Theoretic approach, which models QoS along both objective (e.g., response time) and subjective (e.g., user satisfaction) dimensions. PACT accounts for computational, infrastructure, and potential liability costs for the service provider, while ensuring incentive compatibility and individual rationality for the user under information asymmetry. Through contract‑based selection, users receive tailored service offerings aligned with their needs. Numerical evaluations demonstrate that PACT improves QoS alignment between users and providers and offers a scalable, liable approach to pricing agentic AI services in the future.

PaperID: 10342, https://arxiv.org/pdf/2505.21236.pdf

Authors: Felix Chalumeau, Daniel Rajaonarivonivelomanantsoa, Ruan de Kock, Claude Formanek, Sasha Abramowitz, Oumayma Mahjoub, Wiem Khlifi, Simon Du Toit, Louay Ben Nessir, Refiloe Shabe, Noah De Nicola, Arnol Fokam, Siddarth Singh, Ulrich Mbou Sob, Arnu Pretorius

Title: Breaking the Performance Ceiling in Reinforcement Learning requires Inference Strategies

Abstract:
Reinforcement learning (RL) systems have countless applications, from energy‑grid management to protein design. However, such real‑world scenarios are often extremely difficult, combinatorial in nature, and require complex coordination between multiple agents. This level of complexity can cause even state‑of‑the‑art RL systems, trained until convergence, to hit a performance ceiling which they are unable to break out of with zero‑shot inference. Meanwhile, many digital or simulation‑based applications allow for an inference phase that utilises a specific time and compute budget to explore multiple attempts before outputting a final solution. In this work, we show that such an inference phase employed at execution time, and the choice of a corresponding inference strategy, are key to breaking the performance ceiling observed in complex multi‑agent RL problems. Our main result is striking: we can obtain up to a 126% and, on average, a 45% improvement over the previous state‑of‑the‑art across 17 tasks, using only a couple seconds of extra wall‑clock time during execution. We also demonstrate promising compute scaling properties, supported by over 60k experiments, making it the largest study on inference strategies for complex RL to date. Our experimental data and code are available at https://sites.google.com/view/inference‑strategies‑rl.

PaperID: 10343, https://arxiv.org/pdf/2505.21154.pdf

Authors: Hailin Zhong, Hanlin Wang, Yujun Ye, Meiyi Zhang, Shengxin Zhu

Title: GGBond: Growing Graph-Based AI-Agent Society for Socially-Aware Recommender Simulation

Abstract:
Current personalized recommender systems predominantly rely on static offline data for algorithm design and evaluation, significantly limiting their ability to capture long‑term user preference evolution and social influence dynamics in real‑world scenarios. To address this fundamental challenge, we propose a high‑fidelity social simulation platform integrating human‑like cognitive agents and dynamic social interactions to realistically simulate user behavior evolution under recommendation interventions. Specifically, the system comprises a population of Sim‑User Agents, each equipped with a five‑layer cognitive architecture that encapsulates key psychological mechanisms, including episodic memory, affective state transitions, adaptive preference learning, and dynamic trust‑risk assessments. In particular, we innovatively introduce the Intimacy‑‑Curiosity‑‑Reciprocity‑‑Risk (ICR2) motivational engine grounded in psychological and sociological theories, enabling more realistic user decision‑making processes. Furthermore, we construct a multilayer heterogeneous social graph (GGBond Graph) supporting dynamic relational evolution, effectively modeling users' evolving social ties and trust dynamics based on interest similarity, personality alignment, and structural homophily. During system operation, agents autonomously respond to recommendations generated by typical recommender algorithms (e.g., Matrix Factorization, MultVAE, LightGCN), deciding whether to consume, rate, and share content while dynamically updating their internal states and social connections, thereby forming a stable, multi‑round feedback loop. This innovative design transcends the limitations of traditional static datasets, providing a controlled, observable environment for evaluating long‑term recommender effects.

PaperID: 10344, https://arxiv.org/pdf/2505.21116.pdf

Authors: Yi-Cheng Lin, Kang-Chieh Chen, Zhe-Yan Li, Tzu-Heng Wu, Tzu-Hsuan Wu, Kuan-Yu Chen, Hung-yi Lee, Yun-Nung Chen

Title: Creativity in LLM-based Multi-Agent Systems: A Survey

Abstract:
Large language model (LLM)‑driven multi‑agent systems (MAS) are transforming how humans and AIs collaboratively generate ideas and artifacts. While existing surveys provide comprehensive overviews of MAS infrastructures, they largely overlook the dimension of \emphcreativity, including how novel outputs are generated and evaluated, how creativity informs agent personas, and how creative workflows are coordinated. This is the first survey dedicated to creativity in MAS. We focus on text and image generation tasks, and present: (1) a taxonomy of agent proactivity and persona design; (2) an overview of generation techniques, including divergent exploration, iterative refinement, and collaborative synthesis, as well as relevant datasets and evaluation metrics; and (3) a discussion of key challenges, such as inconsistent evaluation standards, insufficient bias mitigation, coordination conflicts, and the lack of unified benchmarks. This survey offers a structured framework and roadmap for advancing the development, evaluation, and standardization of creative MAS.

PaperID: 10345, https://arxiv.org/pdf/2505.21112.pdf

Authors: Hazem Zohny

Title: Simulating Ethics: Using LLM Debate Panels to Model Deliberation on Medical Dilemmas

Abstract:
This paper introduces ADEPT, a system using Large Language Model (LLM) personas to simulate multi‑perspective ethical debates. ADEPT assembles panels of 'AI personas', each embodying a distinct ethical framework or stakeholder perspective (like a deontologist, consequentialist, or disability rights advocate), to deliberate on complex moral issues. Its application is demonstrated through a scenario about prioritizing patients for a limited number of ventilators inspired by real‑world challenges in allocating scarce medical resources. Two debates, each with six LLM personas, were conducted; they only differed in the moral viewpoints represented: one included a Catholic bioethicist and a care theorist, the other substituted a rule‑based Kantian philosopher and a legal adviser. Both panels ultimately favoured the same policy ‑‑ a lottery system weighted for clinical need and fairness, crucially avoiding the withdrawal of ventilators for reallocation. However, each panel reached that conclusion through different lines of argument, and their voting coalitions shifted once duty‑ and rights‑based voices were present. Examination of the debate transcripts shows that the altered membership redirected attention toward moral injury, legal risk and public trust, which in turn changed four continuing personas' final positions. The work offers three contributions: (i) a transparent, replicable workflow for running and analysing multi‑agent AI debates in bioethics; (ii) evidence that the moral perspectives included in such panels can materially change the outcome even when the factual inputs remain constant; and (iii) an analysis of the implications and future directions for such AI‑mediated approaches to ethical deliberation and policy.

PaperID: 10346, https://arxiv.org/pdf/2505.20973.pdf

Authors: Keheliya Gallaba, Ali Arabat, Dayi Lin, Mohammed Sayagh, Ahmed E. Hassan

Title: Towards Conversational Development Environments: Using Theory-of-Mind and Multi-Agent Architectures for Requirements Refinement

Abstract:
Foundation Models (FMs) have shown remarkable capabilities in various natural language tasks. However, their ability to accurately capture stakeholder requirements remains a significant challenge for using FMs for software development. This paper introduces a novel approach that leverages an FM‑powered multi‑agent system called AlignMind to address this issue. By having a cognitive architecture that enhances FMs with Theory‑of‑Mind capabilities, our approach considers the mental states and perspectives of software makers. This allows our solution to iteratively clarify the beliefs, desires, and intentions of stakeholders, translating these into a set of refined requirements and a corresponding actionable natural language workflow in the often‑overlooked requirements refinement phase of software engineering, which is crucial after initial elicitation. Through a multifaceted evaluation covering 150 diverse use cases, we demonstrate that our approach can accurately capture the intents and requirements of stakeholders, articulating them as both specifications and a step‑by‑step plan of action. Our findings suggest that the potential for significant improvements in the software development process justifies these investments. Our work lays the groundwork for future innovation in building intent‑first development environments, where software makers can seamlessly collaborate with AIs to create software that truly meets their needs.

PaperID: 10347, https://arxiv.org/pdf/2505.20824.pdf

Authors: Kai Chen, Taihang Zhen, Hewei Wang, Kailai Liu, Xinfeng Li, Jing Huo, Tianpei Yang, Jinfeng Xu, Wei Dong, Yang Gao

Title: MedSentry: Understanding and Mitigating Safety Risks in Medical LLM Multi-Agent Systems

Abstract:
As large language models (LLMs) are increasingly deployed in healthcare, ensuring their safety, particularly within collaborative multi‑agent configurations, is paramount. In this paper we introduce MedSentry, a benchmark comprising 5 000 adversarial medical prompts spanning 25 threat categories with 100 subthemes. Coupled with this dataset, we develop an end‑to‑end attack‑defense evaluation pipeline to systematically analyze how four representative multi‑agent topologies (Layers, SharedPool, Centralized, and Decentralized) withstand attacks from 'dark‑personality' agents. Our findings reveal critical differences in how these architectures handle information contamination and maintain robust decision‑making, exposing their underlying vulnerability mechanisms. For instance, SharedPool's open information sharing makes it highly susceptible, whereas Decentralized architectures exhibit greater resilience thanks to inherent redundancy and isolation. To mitigate these risks, we propose a personality‑scale detection and correction mechanism that identifies and rehabilitates malicious agents, restoring system safety to near‑baseline levels. MedSentry thus furnishes both a rigorous evaluation framework and practical defense strategies that guide the design of safer LLM‑based multi‑agent systems in medical domains.

PaperID: 10348, https://arxiv.org/pdf/2505.20820.pdf

Authors: Hyomin Kim, Yunhui Jang, Sungsoo Ahn

Title: MT-Mol:Multi Agent System with Tool-based Reasoning for Molecular Optimization

Abstract:
Large language models (LLMs) have large potential for molecular optimization, as they can gather external chemistry tools and enable collaborative interactions to iteratively refine molecular candidates. However, this potential remains underexplored, particularly in the context of structured reasoning, interpretability, and comprehensive tool‑grounded molecular optimization. To address this gap, we introduce MT‑Mol, a multi‑agent framework for molecular optimization that leverages tool‑guided reasoning and role‑specialized LLM agents. Our system incorporates comprehensive RDKit tools, categorized into five distinct domains: structural descriptors, electronic and topological features, fragment‑based functional groups, molecular representations, and miscellaneous chemical properties. Each category is managed by an expert analyst agent, responsible for extracting task‑relevant tools and enabling interpretable, chemically grounded feedback. MT‑Mol produces molecules with tool‑aligned and stepwise reasoning through the interaction between the analyst agents, a molecule‑generating scientist, a reasoning‑output verifier, and a reviewer agent. As a result, we show that our framework shows the state‑of‑the‑art performance of the PMO‑1K benchmark on 17 out of 23 tasks.

PaperID: 10349, https://arxiv.org/pdf/2505.20816.pdf

Authors: Krishna Singh Rajput, Tejas Anvekar, Chitta Baral, Vivek Gupta

Title: Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

Abstract:
Recent advances in multimodal question answering have primarily focused on combining heterogeneous modalities or fine‑tuning multimodal large language models. While these approaches have shown strong performance, they often rely on a single, generalized reasoning strategy, overlooking the unique characteristics of each modality ultimately limiting both accuracy and interpretability. To address these limitations, we propose MAMMQA, a multi‑agent QA framework for multimodal inputs spanning text, tables, and images. Our system includes two Visual Language Model (VLM) agents and one text‑based Large Language Model (LLM) agent. The first VLM decomposes the user query into sub‑questions and sequentially retrieves partial answers from each modality. The second VLM synthesizes and refines these results through cross‑modal reasoning. Finally, the LLM integrates the insights into a cohesive answer. This modular design enhances interpretability by making the reasoning process transparent and allows each agent to operate within its domain of expertise. Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi‑agent framework consistently outperforms existing baselines in both accuracy and robustness.

PaperID: 10350, https://arxiv.org/pdf/2505.20737.pdf

Authors: Zilong Wang, Jingfeng Yang, Sreyashi Nag, Samarth Varshney, Xianfeng Tang, Haoming Jiang, Jingbo Shang, Sheikh Muhammad Sarwar

Title: RRO: LLM Agent Optimization Through Rising Reward Trajectories

Abstract:
Large language models (LLMs) have exhibited extraordinary performance in a variety of tasks while it remains challenging for them to solve complex multi‑step tasks as agents. In practice, agents sensitive to the outcome of certain key steps which makes them likely to fail the task because of a subtle mistake in the planning trajectory. Recent approaches resort to calibrating the reasoning process through reinforcement learning. They reward or penalize every reasoning step with process supervision, as known as Process Reward Models (PRMs). However, PRMs are difficult and costly to scale up with a large number of next action candidates since they require extensive computations to acquire the training data through the per‑step trajectory exploration. To mitigate this issue, we focus on the relative reward trend across successive reasoning steps and propose maintaining an increasing reward in the collected trajectories for process supervision, which we term Reward Rising Optimization (RRO). Specifically, we incrementally augment the process supervision until identifying a step exhibiting positive reward differentials, i.e. rising rewards, relative to its preceding iteration. This method dynamically expands the search space for the next action candidates, efficiently capturing high‑quality data. We provide mathematical groundings and empirical results on the WebShop and InterCode‑SQL benchmarks, showing that our proposed RRO achieves superior performance while requiring much less exploration cost.

PaperID: 10351, https://arxiv.org/pdf/2505.20670.pdf

Authors: Zikang Guo, Benfeng Xu, Xiaorui Wang, Zhendong Mao

Title: MIRROR: Multi-agent Intra- and Inter-Reflection for Optimized Reasoning in Tool Learning

Abstract:
Complex tasks involving tool integration pose significant challenges for Large Language Models (LLMs), leading to the emergence of multi‑agent workflows as a promising solution. Reflection has emerged as an effective strategy for correcting erroneous trajectories in agentic workflows. However, existing approaches only exploit such capability in the post‑action stage, where the agent observes the execution outcomes. We argue that, like humans, LLMs can also engage in reflection before action execution: the agent can anticipate undesirable outcomes from its own decisions, which not only provides a necessarily complementary perspective to evaluate the decision but also prevents the propagation of errors throughout the trajectory. In this paper, we propose MIRROR, a framework that consists of both intra‑reflection, which critically assesses intended actions before execution, and inter‑reflection, which further adjusts the trajectory based on observations. This design systematically leverages LLM reflection capabilities to eliminate and rectify erroneous actions on a more comprehensive scope. Evaluations on both the StableToolBench and TravelPlanner benchmarks demonstrate MIRROR's superior performance, achieving state‑of‑the‑art results compared to existing approaches.

PaperID: 10352, https://arxiv.org/pdf/2505.20642.pdf

Authors: Yi Zhan, Qi Liu, Weibo Gao, Zheng Zhang, Tianfu Wang, Shuanghong Shen, Junyu Lu, Zhenya Huang

Title: CoderAgent: Simulating Student Behavior for Personalized Programming Learning with Large Language Models

Abstract:
Personalized programming tutoring, such as exercise recommendation, can enhance learners' efficiency, motivation, and outcomes, which is increasingly important in modern digital education. However, the lack of sufficient and high‑quality programming data, combined with the mismatch between offline evaluation and real‑world learning, hinders the practical deployment of such systems. To address this challenge, many approaches attempt to simulate learner practice data, yet they often overlook the fine‑grained, iterative nature of programming learning, resulting in a lack of interpretability and granularity. To fill this gap, we propose a LLM‑based agent, CoderAgent, to simulate students' programming processes in a fine‑grained manner without relying on real data. Specifically, we equip each human learner with an intelligent agent, the core of which lies in capturing the cognitive states of the human programming practice process. Inspired by ACT‑R, a cognitive architecture framework, we design the structure of CoderAgent to align with human cognitive architecture by focusing on the mastery of programming knowledge and the application of coding ability. Recognizing the inherent patterns in multi‑layered cognitive reasoning, we introduce the Programming Tree of Thought (PTOT), which breaks down the process into four steps: why, how, where, and what. This approach enables a detailed analysis of iterative problem‑solving strategies. Finally, experimental evaluations on real‑world datasets demonstrate that CoderAgent provides interpretable insights into learning trajectories and achieves accurate simulations, paving the way for personalized programming education.

PaperID: 10353, https://arxiv.org/pdf/2505.20640.pdf

Authors: Yifan Li, Yuhang Chen, Anh Dao, Lichi Li, Zhongyi Cai, Zhen Tan, Tianlong Chen, Yu Kong

Title: IndustryEQA: Pushing the Frontiers of Embodied Question Answering in Industrial Scenarios

Abstract:
Existing Embodied Question Answering (EQA) benchmarks primarily focus on household environments, often overlooking safety‑critical aspects and reasoning processes pertinent to industrial settings. This drawback limits the evaluation of agent readiness for real‑world industrial applications. To bridge this, we introduce IndustryEQA, the first benchmark dedicated to evaluating embodied agent capabilities within safety‑critical warehouse scenarios. Built upon the NVIDIA Isaac Sim platform, IndustryEQA provides high‑fidelity episodic memory videos featuring diverse industrial assets, dynamic human agents, and carefully designed hazardous situations inspired by real‑world safety guidelines. The benchmark includes rich annotations covering six categories: equipment safety, human safety, object recognition, attribute recognition, temporal understanding, and spatial understanding. Besides, it also provides extra reasoning evaluation based on these categories. Specifically, it comprises 971 question‑answer pairs generated from small warehouse and 373 pairs from large ones, incorporating scenarios with and without human. We further propose a comprehensive evaluation framework, including various baseline models, to assess their general perception and reasoning abilities in industrial environments. IndustryEQA aims to steer EQA research towards developing more robust, safety‑aware, and practically applicable embodied agents for complex industrial environments. Benchmark and codes are available.

PaperID: 10354, https://arxiv.org/pdf/2505.20625.pdf

Authors: Sibo Xiao, Zixin Lin, Wenyang Gao, Hui Chen, Yue Zhang

Title: Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration

Abstract:
Processing long contexts has become a critical capability for modern large language models (LLMs). Existing works leverage agent‑based divide‑and‑conquer methods for processing long contexts. But these methods face crucial limitations, including prohibitive accumulated latency and amplified information loss from excessive agent invocations, and the disruption of inherent textual dependencies by immoderate partitioning. In this paper, we propose a novel multi‑agent framework XpandA (Expand‑Agent) coupled with question‑driven workflow and dynamic partitioning for robust long‑context processing. XpandA overcomes these limitations through: 1) dynamic partitioning of long texts, which adaptively modulates the filling rate of context windows for input sequences of vastly varying lengths; 2) question‑guided protocol to update flat information ensembles within centralized shared memory, constructing consistent inter‑agent knowledge across partitions; and 3) selectively replaying specific partitions based on the state‑tracking of question‑information couples to promote the resolution of inverted‑order structures across partitions (e.g., flashbacks). We perform a comprehensive evaluation of XpandA on multiple long‑context benchmarks with length varying from 1k to 1M, demonstrating XpandA's feasibility for processing ultra‑long sequences and its significant effectiveness in enhancing the long‑context capabilities of various LLMs by achieving 20% improvements and 1.5x inference speedup over baselines of full‑context, RAG and previous agent‑based methods.

PaperID: 10355, https://arxiv.org/pdf/2505.20579.pdf

Authors: Dane Malenfant, Blake A. Richards

Title: The challenge of hidden gifts in multi-agent reinforcement learning

Abstract:
Sometimes we benefit from actions that others have taken even when we are unaware that they took those actions. For example, if your neighbor chooses not to take a parking spot in front of your house when you are not there, you can benefit, even without being aware that they took this action. These ``hidden gifts'' represent an interesting challenge for multi‑agent reinforcement learning (MARL), since assigning credit when the beneficial actions of others are hidden is non‑trivial. Here, we study the impact of hidden gifts with a simple MARL task. In this task, agents in a grid‑world environment have individual doors to unlock in order to obtain individual rewards. As well, if all the agents unlock their door the group receives a larger collective reward. However, there is only one key for all of the doors, such that the collective reward can only be obtained when the agents drop the key for others after they use it. Notably, there is nothing to indicate to an agent that the other agents have dropped the key, thus this act for others is a ``hidden gift''. We show that several different state‑of‑the‑art MARL algorithms, including MARL specific architectures, fail to learn how to obtain the collective reward in this simple task. Interestingly, we find that decentralized actor‑critic policy gradient agents can succeed when we provide them with information about their own action history, but MARL agents still cannot solve the task with action history. Finally, we derive a correction term for policy gradient agents, inspired by learning aware approaches, which reduces the variance in learning and helps them to converge to collective success more reliably. These results show that credit assignment in multi‑agent settings can be particularly challenging in the presence of ``hidden gifts'', and demonstrate that self learning‑awareness in decentralized agents can benefit these settings.

PaperID: 10356, https://arxiv.org/pdf/2505.20521.pdf

Authors: Ana Rita Ortigoso, Gabriel Vieira, Daniel Fuentes, Luis Frazão, Nuno Costa, António Pereira

Title: Project Riley: Multimodal Multi-Agent LLM Collaboration with Emotional Reasoning and Voting

Abstract:
This paper presents Project Riley, a novel multimodal and multi‑model conversational AI architecture oriented towards the simulation of reasoning influenced by emotional states. Drawing inspiration from Pixar's Inside Out, the system comprises five distinct emotional agents ‑ Joy, Sadness, Fear, Anger, and Disgust ‑ that engage in structured multi‑round dialogues to generate, criticise, and iteratively refine responses. A final reasoning mechanism synthesises the contributions of these agents into a coherent output that either reflects the dominant emotion or integrates multiple perspectives. The architecture incorporates both textual and visual large language models (LLMs), alongside advanced reasoning and self‑refinement processes. A functional prototype was deployed locally in an offline environment, optimised for emotional expressiveness and computational efficiency. From this initial prototype, another one emerged, called Armando, which was developed for use in emergency contexts, delivering emotionally calibrated and factually accurate information through the integration of Retrieval‑Augmented Generation (RAG) and cumulative context tracking. The Project Riley prototype was evaluated through user testing, in which participants interacted with the chatbot and completed a structured questionnaire assessing three dimensions: Emotional Appropriateness, Clarity and Utility, and Naturalness and Human‑likeness. The results indicate strong performance in structured scenarios, particularly with respect to emotional alignment and communicative clarity.

PaperID: 10357, https://arxiv.org/pdf/2505.20503.pdf

Authors: Matthew Lisondra, Beno Benhabib, Goldie Nejat

Title: Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review

Abstract:
Rapid advancements in foundation models, including Large Language Models, Vision‑Language Models, Multimodal Large Language Models, and Vision‑Language‑Action Models, have opened new avenues for embodied AI in mobile service robotics. By combining foundation models with the principles of embodied AI, where intelligent systems perceive, reason, and act through physical interaction, mobile service robots can achieve more flexible understanding, adaptive behavior, and robust task execution in dynamic real‑world environments. Despite this progress, embodied AI for mobile service robots continues to face fundamental challenges related to the translation of natural language instructions into executable robot actions, multimodal perception in human‑centered environments, uncertainty estimation for safe decision‑making, and computational constraints for real‑time onboard deployment. In this paper, we present the first systematic review focused specifically on the integration of foundation models in mobile service robotics. We analyze how recent advances in foundation models address these core challenges through language‑conditioned control, multimodal sensor fusion, uncertainty‑aware reasoning, and efficient model scaling. We further examine real‑world applications in domestic assistance, healthcare, and service automation, highlighting how foundation models enable context‑aware, socially responsive, and generalizable robot behaviors. Beyond technical considerations, we discuss ethical, societal, and human‑interaction implications associated with deploying foundation model‑enabled service robots in human environments. Finally, we outline future research directions emphasizing reliability and lifelong adaptation, privacy‑aware and resource‑constrained deployment, and governance and human‑in‑the‑loop frameworks required for safe, scalable, and trustworthy mobile service robotics.

PaperID: 10358, https://arxiv.org/pdf/2505.20466.pdf

Authors: P. S. Kesavan, Pontus Nordenfelt

Title: Reconceptualizing Smart Microscopy: From Data Collection to Knowledge Creation by Multi-Agent Integration

Abstract:
Smart microscopy represents a paradigm shift in biological imaging, moving from passive observation tools to active collaborators in scientific inquiry. Enabled by advances in automation, computational power, and artificial intelligence, these systems are now capable of adaptive decision‑making and real‑time experimental control. Here, we introduce a theoretical framework that reconceptualizes smart microscopy as a partner in scientific investigation. Central to our framework is the concept of the 'epistemic‑empirical divide' in cellular investigation‑the gap between what is observable (empirical domain) and what must be understood (epistemic domain). We propose six core design principles: epistemic‑empirical awareness, hierarchical context integration, an evolution from detection to perception, adaptive measurement frameworks, narrative synthesis capabilities, and cross‑contextual reasoning. Together, these principles guide a multi‑agent architecture designed to align empirical observation with the goals of scientific understanding. Our framework provides a roadmap for building microscopy systems that go beyond automation to actively support hypothesis generation, insight discovery, and theory development, redefining the role of scientific instruments in the process of knowledge creation.

PaperID: 10359, https://arxiv.org/pdf/2505.20327.pdf

Authors: Aurora Poggi, Giuseppe Alessio D'Inverno, Hjalmar Brismar, Ozan Öktem, Matthieu Barreau, Kateryna Morozovska

Title: Data-driven multi-agent modelling of calcium interactions in cell culture: PINN vs Regularized Least-squares

Abstract:
Data‑driven discovery of dynamics in biological systems allows for better observation and characterization of processes, such as calcium signaling in cell culture. Recent advancements in techniques allow the exploration of previously unattainable insights of dynamical systems, such as the Sparse Identification of Non‑Linear Dynamics (SINDy), overcoming the limitations of more classic methodologies. The latter requires some prior knowledge of an effective library of candidate terms, which is not realistic for a real case study. Using inspiration from fields like traffic density estimation and control theory, we propose a methodology for characterization and performance analysis of calcium delivery in a family of cells. In this work, we compare the performance of the Constrained Regularized Least‑Squares Method (CRLSM) and Physics‑Informed Neural Networks (PINN) for system identification and parameter discovery for governing ordinary differential equations (ODEs). The CRLSM achieves a fairly good parameter estimate and a good data fit when using the learned parameters in the Consensus problem. On the other hand, despite the initial hypothesis, PINNs fail to match the CRLSM performance and, under the current configuration, do not provide fair parameter estimation. However, we have only studied a limited number of PINN architectures, and it is expected that additional hyperparameter tuning, as well as uncertainty quantification, could significantly improve the performance in future works.

PaperID: 10360, https://arxiv.org/pdf/2505.20312.pdf

Authors: Aditya Bhattacharya, Katrien Verbert

Title: Let's Get You Hired: A Job Seeker's Perspective on Multi-Agent Recruitment Systems for Explaining Hiring Decisions

Abstract:
During job recruitment, traditional applicant selection methods often lack transparency. Candidates are rarely given sufficient justifications for recruiting decisions, whether they are made manually by human recruiters or through the use of black‑box Applicant Tracking Systems (ATS). To address this problem, our work introduces a multi‑agent AI system that uses Large Language Models (LLMs) to guide job seekers during the recruitment process. Using an iterative user‑centric design approach, we first conducted a two‑phased exploratory study with four active job seekers to inform the design and development of the system. Subsequently, we conducted an in‑depth, qualitative user study with 20 active job seekers through individual one‑to‑one interviews to evaluate the developed prototype. The results of our evaluation demonstrate that participants perceived our multi‑agent recruitment system as significantly more actionable, trustworthy, and fair compared to traditional methods. Our study further helped us uncover in‑depth insights into factors contributing to these perceived user experiences. Drawing from these insights, we offer broader design implications for building user‑aligned, multi‑agent explainable AI systems across diverse domains.

PaperID: 10361, https://arxiv.org/pdf/2505.20273.pdf

Authors: Ke Yang, ChengXiang Zhai

Title: Ten Principles of AI Agent Economics

Abstract:
The rapid rise of AI‑based autonomous agents is transforming human society and economic systems, as these entities increasingly exhibit human‑like or superhuman intelligence. From excelling at complex games like Go to tackling diverse general‑purpose tasks with large language and multimodal models, AI agents are evolving from specialized tools into dynamic participants in social and economic ecosystems. Their autonomy and decision‑making capabilities are poised to impact industries, professions, and human lives profoundly, raising critical questions about their integration into economic activities, potential ethical concerns, and the balance between their utility and safety. To address these challenges, this paper presents ten principles of AI agent economics, offering a framework to understand how AI agents make decisions, influence social interactions, and participate in the broader economy. Drawing on economics, decision theory, and ethics, we explore fundamental questions, such as whether AI agents might evolve from tools into independent entities, their impact on labor markets, and the ethical safeguards needed to align them with human values. These principles build on existing economic theories while accounting for the unique traits of AI agents, providing a roadmap for their responsible integration into human systems. Beyond theoretical insights, this paper highlights the urgency of future research into AI trustworthiness, ethical guidelines, and regulatory oversight. As we enter a transformative era, this work serves as both a guide and a call to action, ensuring AI agents contribute positively to human progress while addressing risks tied to their unprecedented capabilities.

PaperID: 10362, https://arxiv.org/pdf/2505.20245.pdf

Authors: Rui Li, Quanyu Dai, Zeyu Zhang, Xu Chen, Zhenhua Dong, Ji-Rong Wen

Title: KnowTrace: Bootstrapping Iterative Retrieval-Augmented Generation with Structured Knowledge Tracing

Abstract:
Recent advances in retrieval‑augmented generation (RAG) furnish large language models (LLMs) with iterative retrievals of relevant information to handle complex multi‑hop questions. These methods typically alternate between LLM reasoning and retrieval to accumulate external information into the LLM's context. However, the ever‑growing context inherently imposes an increasing burden on the LLM to perceive connections among critical information pieces, with futile reasoning steps further exacerbating this overload issue. In this paper, we present KnowTrace, an elegant RAG framework to (1) mitigate the context overload and (2) bootstrap higher‑quality multi‑step reasoning. Instead of simply piling the retrieved contents, KnowTrace autonomously traces out desired knowledge triplets to organize a specific knowledge graph relevant to the input question. Such a structured workflow not only empowers the LLM with an intelligible context for inference, but also naturally inspires a reflective mechanism of knowledge backtracing to identify contributive LLM generations as process supervision data for self‑bootstrapping. Extensive experiments show that KnowTrace consistently surpasses existing methods across three multi‑hop question answering benchmarks, and the bootstrapped version further amplifies the gains.

PaperID: 10363, https://arxiv.org/pdf/2505.20231.pdf

Authors: Yiming Du, Bingbing Wang, Yang He, Bin Liang, Baojun Wang, Zhongyang Li, Lin Gui, Jeff Z. Pan, Ruifeng Xu, Kam-Fai Wong

Title: MemGuide: Intent-Driven Memory Selection for Goal-Oriented Multi-Session LLM Agents

Abstract:
Modern task‑oriented dialogue (TOD) systems increasingly rely on large language model (LLM) agents, leveraging Retrieval‑Augmented Generation (RAG) and long‑context capabilities for long‑term memory utilization. However, these methods are primarily based on semantic similarity, overlooking task intent and reducing task coherence in multi‑session dialogues. To address this challenge, we introduce MemGuide, a two‑stage framework for intent‑driven memory selection. (1) Intent‑Aligned Retrieval matches the current dialogue context with stored intent descriptions in the memory bank, retrieving QA‑formatted memory units that share the same goal. (2) Missing‑Slot Guided Filtering employs a chain‑of‑thought slot reasoner to enumerate unfilled slots, then uses a fine‑tuned LLaMA‑8B filter to re‑rank the retrieved units by marginal slot‑completion gain. The resulting memory units inform a proactive strategy that minimizes conversational turns by directly addressing information gaps. Based on this framework, we introduce the MS‑TOD, the first multi‑session TOD benchmark comprising 132 diverse personas, 956 task goals, and annotated intent‑aligned memory targets, supporting efficient multi‑session task completion. Evaluations on MS‑TOD show that MemGuide raises the task success rate by 11% (88% ‑> 99%) and reduces dialogue length by 2.84 turns in multi‑session settings, while maintaining parity with single‑session benchmarks.

PaperID: 10364, https://arxiv.org/pdf/2505.20184.pdf

Authors: Yongan Yu, Mengqian Wu, Yiran Lin, Nikki G. Lobczowski

Title: THiNK: Can Large Language Models Think-aloud?

Abstract:
Assessing higher‑order thinking skills in large language models (LLMs) remains a fundamental challenge, especially in tasks that go beyond surface‑level accuracy. In this work, we propose THiNK (Testing Higher‑order Notion of Knowledge), a multi‑agent, feedback‑driven evaluation framework grounded in Bloom's Taxonomy. THiNK frames reasoning assessment as an iterative task of problem generation, critique, and revision, encouraging LLMs to think‑aloud through step‑by‑step reflection and refinement. This enables a systematic evaluation of both lower‑order (e.g., remember, understand) and higher‑order (e.g., evaluate, create) thinking skills. We apply THiNK to seven state‑of‑the‑art LLMs and perform a detailed cognitive analysis of their outputs. Results reveal that while models reliably perform lower‑order categories well, they struggle with applying knowledge in realistic contexts and exhibit limited abstraction. Structured feedback loops significantly improve reasoning performance, particularly in higher‑order thinking. Qualitative evaluations further confirm that THiNK‑guided outputs better align with domain logic and problem structure. The code of our framework provides a scalable methodology for probing and enhancing LLM reasoning, offering new directions for evaluation grounded in learning science, which is available at our GitHub repository.

PaperID: 10365, https://arxiv.org/pdf/2505.20148.pdf

Authors: Ziming Wei, Bingqian Lin, Zijian Jiao, Yunshuang Nie, Liang Ma, Yuecheng Liu, Yuzheng Zhuang, Xiaodan Liang

Title: MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents

Abstract:
Spatial Planning is a crucial part in the field of spatial intelligence, which requires the understanding and planning about object arrangements in space perspective. AI agents with the spatial planning ability can better adapt to various real‑world applications, including robotic manipulation, automatic assembly, urban planning etc. Recent works have attempted to construct benchmarks for evaluating the spatial intelligence of Multimodal Large Language Models (MLLMs). Nevertheless, these benchmarks primarily focus on spatial reasoning based on typical Visual Question‑Answering (VQA) forms, which suffers from the gap between abstract spatial understanding and concrete task execution. In this work, we take a step further to build a comprehensive benchmark called MineAnyBuild, aiming to evaluate the spatial planning ability of open‑world AI agents in the Minecraft game. Specifically, MineAnyBuild requires an agent to generate executable architecture building plans based on the given multi‑modal human instructions. It involves 4,000 curated spatial planning tasks and also provides a paradigm for infinitely expandable data collection by utilizing rich player‑generated content. MineAnyBuild evaluates spatial planning through four core supporting dimensions: spatial understanding, spatial reasoning, creativity, and spatial commonsense. Based on MineAnyBuild, we perform a comprehensive evaluation for existing MLLM‑based agents, revealing the severe limitations but enormous potential in their spatial planning abilities. We believe our MineAnyBuild will open new avenues for the evaluation of spatial intelligence and help promote further development for open‑world AI agents capable of spatial planning.

PaperID: 10366, https://arxiv.org/pdf/2505.20096.pdf

Authors: Thang Nguyen, Peter Chin, Yu-Wing Tai

Title: MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning

Abstract:
We present MA‑RAG, a Multi‑Agent framework for Retrieval‑Augmented Generation (RAG) that addresses the inherent ambiguities and reasoning challenges in complex information‑seeking tasks. Unlike conventional RAG methods that rely on end‑to‑end fine‑tuning or isolated component enhancements, MA‑RAG orchestrates a collaborative set of specialized AI agents: Planner, Step Definer, Extractor, and QA Agents, each responsible for a distinct stage of the RAG pipeline. By decomposing tasks into subtasks such as query disambiguation, evidence extraction, and answer synthesis, and enabling agents to communicate intermediate reasoning via chain‑of‑thought prompting, MA‑RAG progressively refines retrieval and synthesis while maintaining modular interpretability. Extensive experiments on multi‑hop and ambiguous QA benchmarks, including NQ, HotpotQA, 2WikimQA, and TriviaQA, demonstrate that MA‑RAG significantly outperforms standalone LLMs and existing RAG methods across all model scales. Notably, even a small LLaMA3‑8B model equipped with MA‑RAG surpasses larger standalone LLMs, while larger variants (LLaMA3‑70B and GPT‑4o‑mini) set new state‑of‑the‑art results on challenging multi‑hop datasets. Ablation studies reveal that both the planner and extractor agents are critical for multi‑hop reasoning, and that high‑capacity models are especially important for the QA agent to synthesize answers effectively. Beyond general‑domain QA, MA‑RAG generalizes to specialized domains such as medical QA, achieving competitive performance against domain‑specific models without any domain‑specific fine‑tuning. Our results highlight the effectiveness of collaborative, modular reasoning in retrieval‑augmented systems: MA‑RAG not only improves answer accuracy and robustness but also provides interpretable intermediate reasoning steps, establishing a new paradigm for efficient and reliable multi‑agent RAG.

PaperID: 10367, https://arxiv.org/pdf/2505.20023.pdf

Authors: Yihan Chen, Benfeng Xu, Xiaorui Wang, Yongdong Zhang, Zhendong Mao

Title: Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking

Abstract:
Autonomous agents, which perceive environments and take actions to achieve goals, have become increasingly feasible with the advancements in large language models (LLMs). However, current powerful agents often depend on sophisticated prompt engineering combined with closed‑source LLMs like GPT‑4. Although training open‑source LLMs using expert trajectories from teacher models has yielded some improvements in agent capabilities, this approach still faces limitations such as performance plateauing and error propagation. To mitigate these challenges, we propose STeP, a novel method for improving LLM‑based agent training. We synthesize self‑reflected trajectories that include reflections and corrections of error steps, which enhance the effectiveness of LLM agents in learning from teacher models, enabling them to become agents capable of self‑reflecting and correcting. We also introduce partial masking strategy that prevents the LLM from internalizing incorrect or suboptimal steps. Experiments demonstrate that our method improves agent performance across three representative tasks: ALFWorld, WebShop, and SciWorld. For the open‑source model LLaMA2‑7B‑Chat, when trained using self‑reflected trajectories constructed with Qwen1.5‑110B‑Chat as the teacher model, it achieves comprehensive improvements with less training data compared to agents trained exclusively on expert trajectories.

PaperID: 10368, https://arxiv.org/pdf/2505.19905.pdf

Authors: Shuang Ao, Flora D. Salim, Simon Khan

Title: EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM

Abstract:
Although LLMs demonstrate proficiency in several text‑based reasoning and planning tasks, their implementation in robotics control is constrained by significant deficiencies: (1) LLM agents are designed to work mainly with textual inputs rather than visual conditions; (2) Current multimodal agents treat LLMs as static planners, which separates their reasoning from environment dynamics, resulting in actions that do not take domain‑specific knowledge into account; and (3) LLMs are not designed to learn from visual interactions, which makes it harder for them to make better policies for specific domains. In this paper, we introduce EMAC+, an Embodied Multimodal Agent that collaboratively integrates LLM and VLM via a bidirectional training paradigm. Unlike existing methods, EMAC+ dynamically refines high‑level textual plans generated by an LLM using real‑time feedback from a VLM executing low‑level visual control tasks. We address critical limitations of previous models by enabling the LLM to internalize visual environment dynamics directly through interactive experience, rather than relying solely on static symbolic mappings. Extensive experimental evaluations on ALFWorld and RT‑1 benchmarks demonstrate that EMAC+ achieves superior task performance, robustness against noisy observations, and efficient learning. We also conduct thorough ablation studies and provide detailed analyses of success and failure cases.

PaperID: 10369, https://arxiv.org/pdf/2505.19837.pdf

Authors: Christoph R. Landolt, Christoph Würsch, Roland Meier, Alain Mermoud, Julian Jang-Jaccard

Title: Multi-Agent Reinforcement Learning in Cybersecurity: From Fundamentals to Applications

Abstract:
Multi‑Agent Reinforcement Learning (MARL) has shown great potential as an adaptive solution for addressing modern cybersecurity challenges. MARL enables decentralized, adaptive, and collaborative defense strategies and provides an automated mechanism to combat dynamic, coordinated, and sophisticated threats. This survey investigates the current state of research in MARL applications for automated cyber defense (ACD), focusing on intruder detection and lateral movement containment. Additionally, it examines the role of Autonomous Intelligent Cyber‑defense Agents (AICA) and Cyber Gyms in training and validating MARL agents. Finally, the paper outlines existing challenges, such as scalability and adversarial robustness, and proposes future research directions. This also discusses how MARL integrates in AICA to provide adaptive, scalable, and dynamic solutions to counter the increasingly sophisticated landscape of cyber threats. It highlights the transformative potential of MARL in areas like intrusion detection and lateral movement containment, and underscores the value of Cyber Gyms for training and validation of AICA.

PaperID: 10370, https://arxiv.org/pdf/2505.19828.pdf

Authors: Md Basim Uddin Ahmed, Nima Shiri Harzevili, Jiho Shin, Hung Viet Pham, Song Wang

Title: SecVulEval: Benchmarking LLMs for Real-World C/C++ Vulnerability Detection

Abstract:
Large Language Models (LLMs) have shown promise in software engineering tasks, but evaluating their effectiveness in vulnerability detection is challenging due to the lack of high‑quality datasets. Most existing datasets are limited to function‑level labels, ignoring finer‑grained vulnerability patterns and crucial contextual information. Also, poor data quality such as mislabeling, inconsistent annotations, and duplicates can lead to inflated performance and weak generalization. Moreover, by including only the functions, these datasets miss broader program context, like data/control dependencies and interprocedural interactions, that are essential for accurately understanding real‑world security flaws. Without this context, detection models are evaluated under unrealistic assumptions. To address these limitations, this paper introduces SecVulEval, a benchmark designed to support fine‑grained evaluation of LLMs and other detection methods with rich contextual information. SecVulEval focuses on real‑world C/C++ vulnerabilities at the statement level. This granularity enables more precise evaluation of a model's ability to localize vulnerabilities, beyond simple binary classification at the function level. By incorporating rich contextual information, SecVulEval sets a new standard for vulnerability detection benchmarks in realistic scenarios. This benchmark includes 25,440 function samples covering 5,867 unique CVEs in C/C++ projects from 1999 to 2024. We evaluated the SOTA LLMs with a multi‑agent‑based approach. The evaluation on our dataset shows that the models are still far from accurately predicting vulnerable statements in a given function. The best‑performing Claude‑3.7‑Sonnet model achieves 23.83% F1‑score for detecting vulnerable statements with correct reasoning. Finally, we analyze the LLM outputs and provide insights into their behavior in vulnerability detection for C/C++.

PaperID: 10371, https://arxiv.org/pdf/2505.19803.pdf

Authors: Fuze Sun, Lingyu Li, Shixiangyue Meng, Xiaoming Teng, Terry R. Payne, Paul Craig

Title: Integrating emotional intelligence, memory architecture, and gestures to achieve empathetic humanoid robot interaction in an educational setting

Abstract:
This study investigates the integration of individual human traits into an empathetically adaptive educational robot tutor system designed to improve student engagement and learning outcomes with corresponding Engagement Vector measurement. While prior research in the field of Human‑Robot Interaction (HRI) has examined the integration of the traits, such as emotional intelligence, memory‑driven personalization, and non‑verbal communication, by themselves, they have thus‑far neglected to consider their synchronized integration into a cohesive, operational education framework. To address this gap, we customize a Multi‑Modal Large Language Model (LLaMa 3.2 from Meta) deployed with modules for human‑like traits (emotion, memory and gestures) into an AI‑Agent framework. This constitutes to the robot's intelligent core mimicing the human emotional system, memory architecture and gesture control to allow the robot to behave more empathetically while recognizing and responding appropriately to the student's emotional state. It can also recall the student's past learning record and adapt its style of interaction accordingly. This allows the robot tutor to react to the student in a more sympathetic manner by delivering personalized verbal feedback synchronized with relevant gestures. Our study investigates the extent of this effect through the introduction of Engagement Vector Model which can be a surveyor's pole for judging the quality of HRI experience. Quantitative and qualitative results demonstrate that such an empathetic responsive approach significantly improves student engagement and learning outcomes compared with a baseline humanoid robot without these human‑like traits. This indicates that robot tutors with empathetic capabilities can create a more supportive, interactive learning experience that ultimately leads to better outcomes for the student.

PaperID: 10372, https://arxiv.org/pdf/2505.19637.pdf

Authors: Byunghyun Yoo, Younghwan Shin, Hyunwoo Kim, Euisok Chung, Jeongmin Yang

Title: Adaptive Episode Length Adjustment for Multi-agent Reinforcement Learning

Abstract:
In standard reinforcement learning, an episode is defined as a sequence of interactions between agents and the environment, which terminates upon reaching a terminal state or a pre‑defined episode length. Setting a shorter episode length enables the generation of multiple episodes with the same number of data samples, thereby facilitating an exploration of diverse states. While shorter episodes may limit the collection of long‑term interactions, they may offer significant advantages when properly managed. For example, trajectory truncation in single‑agent reinforcement learning has shown how the benefits of shorter episodes can be leveraged despite the trade‑off of reduced long‑term interaction experiences. However, this approach remains underexplored in MARL. This paper proposes a novel MARL approach, Adaptive Episode Length Adjustment (AELA), where the episode length is initially limited and gradually increased based on an entropy‑based assessment of learning progress. By starting with shorter episodes, agents can focus on learning effective strategies for initial states and minimize time spent in dead‑end states. The use of entropy as an assessment metric prevents premature convergence to suboptimal policies and ensures balanced training over varying episode lengths. We validate our approach using the StarCraft Multi‑agent Challenge (SMAC) and a modified predator‑prey environment, demonstrating significant improvements in both convergence speed and overall performance compared to existing methods. To the best of our knowledge, this is the first study to adaptively adjust episode length in MARL based on learning progress.

PaperID: 10373, https://arxiv.org/pdf/2505.19630.pdf

Authors: Yichun Feng, Jiawei Wang, Lu Zhou, Yikai Zheng, Zhen Lei, Yixue Li

Title: Real-World Doctor Agent with Proactive Consultation through Multi-Agent Reinforcement Learning

Abstract:
Large language models (LLMs) struggle in real‑world clinical consultations. Single‑turn consultation systems require patients to describe all symptoms at once, which often leads to unclear complaints and vague diagnoses. Traditional dialogue models, constrained by static supervised learning, are limited to superficially imitating existing dialogue patterns and lack the ability to actively construct understanding in dynamic interactions, thus failing to achieve genuine clinical reasoning.To address these challenges, we propose DoctorAgent‑RL, a reinforcement learning (RL)‑based multi‑agent collaborative framework, and train a doctor agent on Qwen2.5‑7B‑Instruct using this framework. Within this framework, a medical consultation is modeled as a dynamic decision‑making process under uncertainty. The core intelligence of the doctor agent is shifted from knowing the answer to learning and mastering a questioning methodology aimed at achieving an optimal diagnosis. Through strategic questioning, it guides the progressive emergence of key patient information in multi‑turn dialogues. To support this high‑fidelity simulation of the real diagnostic process, we constructed MTMedDialog, a novel English multi‑turn medical consultation dataset designed for dynamic, interactive training.To validate its real‑world effectiveness, rigorous evaluations including blinded human assessments and trials with real patients were conducted. DoctorAgent‑RL outperformed frontier models and achieved a 70% exact diagnostic match rate, confirming its potential as a collaborative tool. By handling initial screenings, it can free clinicians to focus on complex cases, thereby addressing critical issues like physician shortages and misdiagnosis risks while alleviating the strain on healthcare resources.

PaperID: 10374, https://arxiv.org/pdf/2505.19567.pdf

Authors: Rasoul Zahedifar, Sayyed Ali Mirghasemi, Mahdieh Soleymani Baghshah, Alireza Taheri

Title: LLM-Agent-Controller: A Universal Multi-Agent Large Language Model System as a Control Engineer

Abstract:
This study presents the LLM‑Agent‑Controller, a multi‑agent large language model (LLM) system developed to address a wide range of problems in control engineering (Control Theory). The system integrates a central controller agent with multiple specialized auxiliary agents, responsible for tasks such as controller design, model representation, control analysis, time‑domain response, and simulation. A supervisor oversees high‑level decision‑making and workflow coordination, enhancing the system's reliability and efficiency. The LLM‑Agent‑Controller incorporates advanced capabilities, including Retrieval‑Augmented Generation (RAG), Chain‑of‑Thought reasoning, self‑criticism and correction, efficient memory handling, and user‑friendly natural language communication. It is designed to function without requiring users to have prior knowledge of Control Theory, enabling them to input problems in plain language and receive complete, real‑time solutions. To evaluate the system, we propose new performance metrics assessing both individual agents and the system as a whole. We test five categories of Control Theory problems and benchmark performance across three advanced LLMs. Additionally, we conduct a comprehensive qualitative conversational analysis covering all key services. Results show that the LLM‑Agent‑Controller successfully solved 83% of general tasks, with individual agents achieving an average success rate of 87%. Performance improved with more advanced LLMs. This research demonstrates the potential of multi‑agent LLM architectures to solve complex, domain‑specific problems. By integrating specialized agents, supervisory control, and advanced reasoning, the LLM‑Agent‑Controller offers a scalable, robust, and accessible solution framework that can be extended to various technical domains.

PaperID: 10375, https://arxiv.org/pdf/2505.19538.pdf

Authors: Yuxing Lu, Gecheng Fu, Wei Wu, Xukai Zhao, Sin Yee Goi, Jinzhuo Wang

Title: DoctorRAG: Medical RAG Fusing Knowledge with Patient Analogy through Textual Gradients

Abstract:
Existing medical RAG systems mainly leverage knowledge from medical knowledge bases, neglecting the crucial role of experiential knowledge derived from similar patient cases ‑‑ a key component of human clinical reasoning. To bridge this gap, we propose DoctorRAG, a RAG framework that emulates doctor‑like reasoning by integrating both explicit clinical knowledge and implicit case‑based experience. DoctorRAG enhances retrieval precision by first allocating conceptual tags for queries and knowledge sources, together with a hybrid retrieval mechanism from both relevant knowledge and patient. In addition, a Med‑TextGrad module using multi‑agent textual gradients is integrated to ensure that the final output adheres to the retrieved knowledge and patient query. Comprehensive experiments on multilingual, multitask datasets demonstrate that DoctorRAG significantly outperforms strong baseline RAG models and gains improvements from iterative refinements. Our approach generates more accurate, relevant, and comprehensive responses, taking a step towards more doctor‑like medical reasoning systems.

PaperID: 10376, https://arxiv.org/pdf/2505.19489.pdf

Authors: Zhenhao Zhou, Zhuochen Huang, Yike He, Chong Wang, Jiajun Wang, Yijian Wu, Xin Peng, Yiling Lou

Title: Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel Bugs

Abstract:
The Linux kernel is a critical system, serving as the foundation for numerous systems. Bugs in the Linux kernel can cause serious consequences, affecting billions of users. Fault localization (FL), which aims at identifying the buggy code elements in software, plays an essential role in software quality assurance. While recent LLM agents have achieved promising accuracy in FL on recent benchmarks like SWE‑bench, it remains unclear how well these methods perform in the Linux kernel, where FL is much more challenging due to the large‑scale code base, limited observability, and diverse impact factors. In this paper, we introduce LinuxFLBench, a FL benchmark constructed from real‑world Linux kernel bugs. We conduct an empirical study to assess the performance of state‑of‑the‑art LLM agents on the Linux kernel. Our initial results reveal that existing agents struggle with this task, achieving a best top‑1 accuracy of only 41.6% at file level. To address this challenge, we propose LinuxFL^+, an enhancement framework designed to improve FL effectiveness of LLM agents for the Linux kernel. LinuxFL^+ substantially improves the FL accuracy of all studied agents (e.g., 7.2% ‑ 11.2% accuracy increase) with minimal costs. Data and code are available at https://github.com/FudanSELab/LinuxFLBench.

PaperID: 10377, https://arxiv.org/pdf/2505.19486.pdf

Authors: Maonan Wang, Yirong Chen, Aoyu Pang, Yuxin Cai, Chung Shue Chen, Yuheng Kan, Man-On Pun

Title: VLMLight: Safety-Critical Traffic Signal Control via Vision-Language Meta-Control and Dual-Branch Reasoning Architecture

Abstract:
Traffic signal control (TSC) is a core challenge in urban mobility, where real‑time decisions must balance efficiency and safety. Existing methods ‑ ranging from rule‑based heuristics to reinforcement learning (RL) ‑ often struggle to generalize to complex, dynamic, and safety‑critical scenarios. We introduce VLMLight, a novel TSC framework that integrates vision‑language meta‑control with dual‑branch reasoning. At the core of VLMLight is the first image‑based traffic simulator that enables multi‑view visual perception at intersections, allowing policies to reason over rich cues such as vehicle type, motion, and spatial density. A large language model (LLM) serves as a safety‑prioritized meta‑controller, selecting between a fast RL policy for routine traffic and a structured reasoning branch for critical cases. In the latter, multiple LLM agents collaborate to assess traffic phases, prioritize emergency vehicles, and verify rule compliance. Experiments show that VLMLight reduces waiting times for emergency vehicles by up to 65% over RL‑only systems, while preserving real‑time performance in standard conditions with less than 1% degradation. VLMLight offers a scalable, interpretable, and safety‑aware solution for next‑generation traffic signal control.

PaperID: 10378, https://arxiv.org/pdf/2505.19477.pdf

Authors: Chiyu Ma, Enpei Zhang, Yilun Zhao, Wenjun Liu, Yaning Jia, Peijun Qing, Lin Shi, Arman Cohan, Yujun Yan, Soroush Vosoughi

Title: Judging with Many Minds: Do More Perspectives Mean Less Prejudice? On Bias Amplifications and Resistance in Multi-Agent Based LLM-as-Judge

Abstract:
LLM‑as‑Judge has emerged as a scalable alternative to human evaluation, enabling large language models (LLMs) to provide reward signals in trainings. While recent work has explored multi‑agent extensions such as multi‑agent debate and meta‑judging to enhance evaluation quality, the question of how intrinsic biases manifest in these settings remains underexplored. In this study, we conduct a systematic analysis of four diverse bias types: position bias, verbosity bias, chain‑of‑thought bias, and bandwagon bias. We evaluate these biases across two widely adopted multi‑agent LLM‑as‑Judge frameworks: Multi‑Agent‑Debate and LLM‑as‑Meta‑Judge. Our results show that debate framework amplifies biases sharply after the initial debate, and this increased bias is sustained in subsequent rounds, while meta‑judge approaches exhibit greater resistance. We further investigate the incorporation of PINE, a leading single‑agent debiasing method, as a bias‑free agent within these systems. The results reveal that this bias‑free agent effectively reduces biases in debate settings but provides less benefit in meta‑judge scenarios. Our work provides a comprehensive study of bias behavior in multi‑agent LLM‑as‑Judge systems and highlights the need for targeted bias mitigation strategies in collaborative evaluation settings.

PaperID: 10379, https://arxiv.org/pdf/2505.19436.pdf

Authors: Ye Ye

Title: Task Memory Engine: Spatial Memory for Robust Multi-Step LLM Agents

Abstract:
Large Language Models (LLMs) falter in multi‑step interactions ‑‑ often hallucinating, repeating actions, or misinterpreting user corrections ‑‑ due to reliance on linear, unstructured context. This fragility stems from the lack of persistent memory to track evolving goals and task dependencies, undermining trust in autonomous agents. We introduce the Task Memory Engine (TME), a modular memory controller that transforms existing LLMs into robust, revision‑aware agents without fine‑tuning. TME implements a spatial memory framework that replaces flat context with graph‑based structures to support consistent, multi‑turn reasoning. Departing from linear concatenation and ReAct‑style prompting, TME builds a dynamic task graph ‑‑ either a tree or directed acyclic graph (DAG) ‑‑ to map user inputs to subtasks, align them with prior context, and enable dependency‑tracked revisions. Its Task Representation and Intent Management (TRIM) component models task semantics and user intent to ensure accurate interpretation. Across four multi‑turn scenarios‑trip planning, cooking, meeting scheduling, and shopping cart editing ‑‑ TME eliminates 100% of hallucinations and misinterpretations in three tasks, and reduces hallucinations by 66.7% and misinterpretations by 83.3% across 27 user turns, outperforming ReAct. TME's modular design supports plug‑and‑play deployment and domain‑specific customization, adaptable to both personal assistants and enterprise automation. We release TME's codebase, benchmarks, and components as open‑source resources, enabling researchers to develop reliable LLM agents. TME's scalable architecture addresses a critical gap in agent performance across complex, interactive settings.

PaperID: 10380, https://arxiv.org/pdf/2505.19419.pdf

Authors: Baichuan Li, Larry Powell, Tracy Hammond

Title: It's Not Just Labeling -- A Research on LLM Generated Feedback Interpretability and Image Labeling Sketch Features

Abstract:
The quality of training data is critical to the performance of machine learning applications in domains like transportation, healthcare, and robotics. Accurate image labeling, however, often relies on time‑consuming, expert‑driven methods with limited feedback. This research introduces a sketch‑based annotation approach supported by large language models (LLMs) to reduce technical barriers and enhance accessibility. Using a synthetic dataset, we examine how sketch recognition features relate to LLM feedback metrics, aiming to improve the reliability and interpretability of LLM‑assisted labeling. We also explore how prompting strategies and sketch variations influence feedback quality. Our main contribution is a sketch‑based virtual assistant that simplifies annotation for non‑experts and advances LLM‑driven labeling tools in terms of scalability, accessibility, and explainability.

PaperID: 10381, https://arxiv.org/pdf/2505.19405.pdf

Authors: Yan Wen, Junfeng Guo, Heng Huang

Title: CoTGuard: Using Chain-of-Thought Triggering for Copyright Protection in Multi-Agent LLM Systems

Abstract:
As large language models (LLMs) evolve into autonomous agents capable of collaborative reasoning and task execution, multi‑agent LLM systems have emerged as a powerful paradigm for solving complex problems. However, these systems pose new challenges for copyright protection, particularly when sensitive or copyrighted content is inadvertently recalled through inter‑agent communication and reasoning. Existing protection techniques primarily focus on detecting content in final outputs, overlooking the richer, more revealing reasoning processes within the agents themselves. In this paper, we introduce CoTGuard, a novel framework for copyright protection that leverages trigger‑based detection within Chain‑of‑Thought (CoT) reasoning. Specifically, we can activate specific CoT segments and monitor intermediate reasoning steps for unauthorized content reproduction by embedding specific trigger queries into agent prompts. This approach enables fine‑grained, interpretable detection of copyright violations in collaborative agent scenarios. We evaluate CoTGuard on various benchmarks in extensive experiments and show that it effectively uncovers content leakage with minimal interference to task performance. Our findings suggest that reasoning‑level monitoring offers a promising direction for safeguarding intellectual property in LLM‑based agent systems.

PaperID: 10382, https://arxiv.org/pdf/2505.19316.pdf

Authors: Rex Chen, Stephanie Milani, Zhicheng Zhang, Norman Sadeh, Fei Fang

Title: Making Teams and Influencing Agents: Efficiently Coordinating Decision Trees for Interpretable Multi-Agent Reinforcement Learning

Abstract:
Poor interpretability hinders the practical applicability of multi‑agent reinforcement learning (MARL) policies. Deploying interpretable surrogates of uninterpretable policies enhances the safety and verifiability of MARL for real‑world applications. However, if these surrogates are to interact directly with the environment within human supervisory frameworks, they must be both performant and computationally efficient. Prior work on interpretable MARL has either sacrificed performance for computational efficiency or computational efficiency for performance. To address this issue, we propose HYDRAVIPER, a decision tree‑based interpretable MARL algorithm. HYDRAVIPER coordinates training between agents based on expected team performance, and adaptively allocates budgets for environment interaction to improve computational efficiency. Experiments on standard benchmark environments for multi‑agent coordination and traffic signal control show that HYDRAVIPER matches the performance of state‑of‑the‑art methods using a fraction of the runtime, and that it maintains a Pareto frontier of performance for different interaction budgets.

PaperID: 10383, https://arxiv.org/pdf/2505.19301.pdf

Authors: Ken Huang, Vineeth Sai Narajala, John Yeoh, Jason Ross, Ramesh Raskar, Youssef Harkati, Jerry Huang, Idan Habler, Chris Hughes

Title: A Novel Zero-Trust Identity Framework for Agentic AI: Decentralized Authentication and Fine-Grained Access Control

Abstract:
Traditional Identity and Access Management (IAM) systems, primarily designed for human users or static machine identities via protocols such as OAuth, OpenID Connect (OIDC), and SAML, prove fundamentally inadequate for the dynamic, interdependent, and often ephemeral nature of AI agents operating at scale within Multi Agent Systems (MAS), a computational system composed of multiple interacting intelligent agents that work collectively. This paper posits the imperative for a novel Agentic AI IAM framework: We deconstruct the limitations of existing protocols when applied to MAS, illustrating with concrete examples why their coarse‑grained controls, single‑entity focus, and lack of context‑awareness falter. We then propose a comprehensive framework built upon rich, verifiable Agent Identities (IDs), leveraging Decentralized Identifiers (DIDs) and Verifiable Credentials (VCs), that encapsulate an agents capabilities, provenance, behavioral scope, and security posture. Our framework includes an Agent Naming Service (ANS) for secure and capability‑aware discovery, dynamic fine‑grained access control mechanisms, and critically, a unified global session management and policy enforcement layer for real‑time control and consistent revocation across heterogeneous agent communication protocols. We also explore how Zero‑Knowledge Proofs (ZKPs) enable privacy‑preserving attribute disclosure and verifiable policy compliance. We outline the architecture, operational lifecycle, innovative contributions, and security considerations of this new IAM paradigm, aiming to establish the foundational trust, accountability, and security necessary for the burgeoning field of agentic AI and the complex ecosystems they will inhabit.

PaperID: 10384, https://arxiv.org/pdf/2505.19260.pdf

Authors: Shiyu Xiang, Tong Zhang, Ronghao Chen

Title: ALRPHFS: Adversarially Learned Risk Patterns with Hierarchical Fast \& Slow Reasoning for Robust Agent Defense

Abstract:
LLM Agents are becoming central to intelligent systems. However, their deployment raises serious safety concerns. Existing defenses largely rely on "Safety Checks", which struggle to capture the complex semantic risks posed by harmful user inputs or unsafe agent behaviors ‑ creating a significant semantic gap between safety checks and real‑world risks. To bridge this gap, we propose a novel defense framework, ALRPHFS (Adversarially Learned Risk Patterns with Hierarchical Fast & Slow Reasoning). ALRPHFS consists of two core components: (1) an offline adversarial self‑learning loop to iteratively refine a generalizable and balanced library of risk patterns, substantially enhancing robustness without retraining the base LLM, and (2) an online hierarchical fast & slow reasoning engine that balances detection effectiveness with computational efficiency. Experimental results demonstrate that our approach achieves superior overall performance compared to existing baselines, achieving a best‑in‑class average accuracy of 80% and exhibiting strong generalizability across agents and tasks.

PaperID: 10385, https://arxiv.org/pdf/2505.19205.pdf

Authors: Meher Bhaskar Madiraju, Meher Sai Preetam Madiraju

Title: OptiMindTune: A Multi-Agent Framework for Intelligent Hyperparameter Optimization

Abstract:
Hyperparameter optimization (HPO) is a critical yet challenging aspect of machine learning model development, significantly impacting model performance and generalization. Traditional HPO methods often struggle with high dimensionality, complex interdependencies, and computational expense. This paper introduces OptiMindTune, a novel multi‑agent framework designed to intelligently and efficiently optimize hyperparameters. OptiMindTune leverages the collaborative intelligence of three specialized AI agents ‑‑ a Recommender Agent, an Evaluator Agent, and a Decision Agent ‑‑ each powered by Google's Gemini models. These agents address distinct facets of the HPO problem, from model selection and hyperparameter suggestion to robust evaluation and strategic decision‑making. By fostering dynamic interactions and knowledge sharing, OptiMindTune aims to converge to optimal hyperparameter configurations more rapidly and robustly than existing single‑agent or monolithic approaches. Our framework integrates principles from advanced large language models, and adaptive search to achieve scalable and intelligent AutoML. We posit that this multi‑agent paradigm offers a promising avenue for tackling the increasing complexity of modern machine learning model tuning.

PaperID: 10386, https://arxiv.org/pdf/2505.19197.pdf

Authors: Chanyeol Choi, Alejandro Lopez-Lira, Yongjae Lee, Jihoon Kwon, Minjae Kim, Juneha Hwang, Minsoo Ha, Chaewoon Kim, Jaeseon Ha, Suyeol Yun, Jin Kim

Title: Structuring the Unstructured: A Multi-Agent System for Extracting and Querying Financial KPIs and Guidance

Abstract:
Extracting structured and quantitative insights from unstructured financial filings is essential in investment research, yet remains time‑consuming and resource‑intensive. Conventional approaches in practice rely heavily on labor‑intensive manual processes, limiting scalability and delaying the research workflow. In this paper, we propose an efficient and scalable method for accurately extracting quantitative insights from unstructured financial documents, leveraging a multi‑agent system composed of large language models. Our proposed multi‑agent system consists of two specialized agents: the \emphExtraction Agent and the \emphText‑to‑SQL Agent. The Extraction Agent automatically identifies key performance indicators from unstructured financial text, standardizes their formats, and verifies their accuracy. On the other hand, the Text‑to‑SQL Agent generates executable SQL statements from natural language queries, allowing users to access structured data accurately without requiring familiarity with the database schema. Through experiments, we demonstrate that our proposed system effectively transforms unstructured text into structured data accurately and enables precise retrieval of key information. First, we demonstrate that our system achieves approximately 95% accuracy in transforming financial filings into structured data, matching the performance level typically attained by human annotators. Second, in a human evaluation of the retrieval task ‑‑ where natural language queries are used to search information from structured data ‑‑ 91% of the responses were rated as correct by human evaluators. In both evaluations, our system generalizes well across financial document types, consistently delivering reliable performance.

PaperID: 10387, https://arxiv.org/pdf/2505.19173.pdf

Authors: Debdeep Sanyal, Agniva Maiti, Umakanta Maharana, Dhruv Kumar, Ankur Mali, C. Lee Giles, Murari Mandal

Title: Investigating Pedagogical Teacher and Student LLM Agents: Genetic Adaptation Meets Retrieval Augmented Generation Across Learning Style

Abstract:
Effective teaching requires adapting instructional strategies to accommodate the diverse cognitive and behavioral profiles of students, a persistent challenge in education and teacher training. While Large Language Models (LLMs) offer promise as tools to simulate such complex pedagogical environments, current simulation frameworks are limited in two key respects: (1) they often reduce students to static knowledge profiles, and (2) they lack adaptive mechanisms for modeling teachers who evolve their strategies in response to student feedback. To address these gaps, we introduce a novel simulation framework that integrates LLM‑based heterogeneous student agents with a self‑optimizing teacher agent. The teacher agent's pedagogical policy is dynamically evolved using a genetic algorithm, allowing it to discover and refine effective teaching strategies based on the aggregate performance of diverse learners. In addition, we propose Persona‑RAG, a Retrieval Augmented Generation module that enables student agents to retrieve knowledge tailored to their individual learning styles. Persona‑RAG preserves the retrieval accuracy of standard RAG baselines while enhancing personalization, an essential factor in modeling realistic educational scenarios. Through extensive experiments, we demonstrate how our framework supports the emergence of distinct and interpretable teaching patterns when interacting with varied student populations. Our results highlight the potential of LLM‑driven simulations to inform adaptive teaching practices and provide a testbed for training human educators in controlled, data‑driven environments.

PaperID: 10388, https://arxiv.org/pdf/2505.19167.pdf

Authors: Thomas P. Kehler, Scott E. Page, Alex Pentland, Martin Reeves, John Seely Brown

Title: Amplifying Human Creativity and Problem Solving with AI Through Generative Collective Intelligence

Abstract:
We propose a general framework for human‑AI collaboration that amplifies the distinct capabilities of both types of intelligence. We refer to this as Generative Collective Intelligence (GCI). GCI employs AI in dual roles: as interactive agents and as technology that accumulates, organizes, and leverages knowledge. In this second role, AI creates a cognitive bridge between human reasoning and AI models. The AI functions as a social and cultural technology that enables groups to solve complex problems through structured collaboration that transcends traditional communication barriers. We argue that GCI can overcome limitations of purely algorithmic approaches to problem‑solving and decision‑making. We describe the mathematical foundations of GCI, based on the law of comparative judgment and minimum regret principles, and briefly illustrate its applications across various domains, including climate adaptation, healthcare transformation, and civic participation. By combining human creativity with AI's computational capabilities, GCI offers a promising approach to addressing complex societal challenges that neither humans nor machines can solve alone.

PaperID: 10389, https://arxiv.org/pdf/2505.18946.pdf

Authors: Yong Xiao, Haoran Zhou, Xubo Li, Yayu Gao, Guangming Shi, Ping Zhang

Title: SANNet: A Semantic-Aware Agentic AI Networking Framework for Multi-Agent Cross-Layer Coordination

Abstract:
Agentic AI networking (AgentNet) is a novel AI‑native networking paradigm that relies on a large number of specialized AI agents to collaborate and coordinate for autonomous decision‑making, dynamic environmental adaptation, and complex goal achievement. It has the potential to facilitate real‑time network management alongside capabilities for self‑configuration, self‑optimization, and self‑adaptation across diverse and complex networking environments, laying the foundation for fully autonomous networking systems in the future. Despite its promise, AgentNet is still in the early stage of development, and there still lacks an effective networking framework to support automatic goal discovery and multi‑agent self‑orchestration and task assignment. This paper proposes SANNet, a novel semantic‑aware agentic AI networking architecture that can infer the semantic goal of the user and automatically assign agents associated with different layers of a mobile system to fulfill the inferred goal. Motivated by the fact that one of the major challenges in AgentNet is that different agents may have different and even conflicting objectives when collaborating for certain goals, we introduce a dynamic weighting‑based conflict‑resolving mechanism to address this issue. We prove that SANNet can provide theoretical guarantee in both conflict‑resolving and model generalization performance for multi‑agent collaboration in dynamic environment. We develop a hardware prototype of SANNet based on the open RAN and 5GS core platform. Our experimental results show that SANNet can significantly improve the performance of multi‑agent networking systems, even when agents with conflicting objectives are selected to collaborate for the same goal.

PaperID: 10390, https://arxiv.org/pdf/2505.18889.pdf

Authors: Miles Q. Li, Benjamin C. M. Fung

Title: Security Concerns for Large Language Models: A Survey

Abstract:
Large Language Models (LLMs) such as ChatGPT and its competitors have caused a revolution in natural language processing, but their capabilities also introduce new security vulnerabilities. This survey provides a comprehensive overview of these emerging concerns, categorizing threats into several key areas: inference‑time attacks via prompt manipulation; training‑time attacks; misuse by malicious actors; and the inherent risks in autonomous LLM agents. Recently, a significant focus is increasingly being placed on the latter. We summarize recent academic and industrial studies from 2022 to 2025 that exemplify each threat, analyze existing defense mechanisms and their limitations, and identify open challenges in securing LLM‑based applications. We conclude by emphasizing the importance of advancing robust, multi‑layered security strategies to ensure LLMs are safe and beneficial.

PaperID: 10391, https://arxiv.org/pdf/2505.18878.pdf

Authors: Kung-Hsiang Huang, Akshara Prabhakar, Onkar Thorat, Divyansh Agarwal, Prafulla Kumar Choubey, Yixin Mao, Silvio Savarese, Caiming Xiong, Chien-Sheng Wu

Title: CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions

Abstract:
While AI agents hold transformative potential in business, effective performance benchmarking is hindered by the scarcity of public, realistic business data on widely used platforms. Existing benchmarks often lack fidelity in their environments, data, and agent‑user interactions, with limited coverage of diverse business scenarios and industries. To address these gaps, we introduce CRMArena‑Pro, a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings. CRMArena‑Pro expands on CRMArena with nineteen expert‑validated tasks across sales, service, and 'configure, price, and quote' processes, for both Business‑to‑Business and Business‑to‑Customer scenarios. It distinctively incorporates multi‑turn interactions guided by diverse personas and robust confidentiality awareness assessments. Experiments reveal leading LLM agents achieve only around 58% single‑turn success on CRMArena‑Pro, with performance dropping significantly to approximately 35% in multi‑turn settings. While Workflow Execution proves more tractable for top agents (over 83% single‑turn success), other evaluated business skills present greater challenges. Furthermore, agents exhibit near‑zero inherent confidentiality awareness; though targeted prompting can improve this, it often compromises task performance. These findings highlight a substantial gap between current LLM capabilities and enterprise demands, underscoring the need for advancements in multi‑turn reasoning, confidentiality adherence, and versatile skill acquisition.

PaperID: 10392, https://arxiv.org/pdf/2505.18750.pdf

Authors: Jiarong Fan, Chenghao Huang, Hao Wang

Title: Agent-Based Decentralized Energy Management of EV Charging Station with Solar Photovoltaics via Multi-Agent Reinforcement Learning

Abstract:
In the pursuit of energy net zero within smart cities, transportation electrification plays a pivotal role. The adoption of Electric Vehicles (EVs) keeps increasing, making energy management of EV charging stations critically important. While previous studies have managed to reduce energy cost of EV charging while maintaining grid stability, they often overlook the robustness of EV charging management against uncertainties of various forms, such as varying charging behaviors and possible faults in faults in some chargers. To address the gap, a novel Multi‑Agent Reinforcement Learning (MARL) approach is proposed treating each charger to be an agent and coordinate all the agents in the EV charging station with solar photovoltaics in a more realistic scenario, where system faults may occur. A Long Short‑Term Memory (LSTM) network is incorporated in the MARL algorithm to extract temporal features from time‑series. Additionally, a dense reward mechanism is designed for training the agents in the MARL algorithm to improve EV charging experience. Through validation on a real‑world dataset, we show that our approach is robust against system uncertainties and faults and also effective in minimizing EV charging costs and maximizing charging service satisfaction.

PaperID: 10393, https://arxiv.org/pdf/2505.18646.pdf

Authors: Siwei Liu, Jinyuan Fang, Han Zhou, Yingxu Wang, Zaiqiao Meng

Title: SEW: Self-Evolving Agentic Workflows for Automated Code Generation

Abstract:
Large Language Models (LLMs) have demonstrated effectiveness in code generation tasks. To enable LLMs to address more complex coding challenges, existing research has focused on crafting multi‑agent systems with agentic workflows, where complex coding tasks are decomposed into sub‑tasks, assigned to specialized agents. Despite their effectiveness, current approaches heavily rely on hand‑crafted agentic workflows, with both agent topologies and prompts manually designed, which limits their ability to automatically adapt to different types of coding problems. To address these limitations and enable automated workflow design, we propose Self‑Evolving Workflow (SEW), a novel self‑evolving framework that automatically generates and optimises multi‑agent workflows. Extensive experiments on three coding benchmark datasets, including the challenging LiveCodeBench, demonstrate that our SEW can automatically design agentic workflows and optimise them through self‑evolution, bringing up to 12% improvement on LiveCodeBench compared to using the backbone LLM only. Furthermore, by investigating different representation schemes of workflow, we provide insights into the optimal way to encode workflow information with text.

PaperID: 10394, https://arxiv.org/pdf/2505.18596.pdf

Authors: Chen Han, Wenzhen Zheng, Xijin Tang

Title: Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models

Abstract:
The proliferation of misinformation in digital platforms reveals the limitations of traditional detection methods, which mostly rely on static classification and fail to capture the intricate process of real‑world fact‑checking. Despite advancements in Large Language Models (LLMs) that enhance automated reasoning, their application to misinformation detection remains hindered by issues of logical inconsistency and superficial verification. In response, we introduce Debate‑to‑Detect (D2D), a novel Multi‑Agent Debate (MAD) framework that reformulates misinformation detection as a structured adversarial debate. Inspired by fact‑checking workflows, D2D assigns domain‑specific profiles to each agent and orchestrates a five‑stage debate process, including Opening Statement, Rebuttal, Free Debate, Closing Statement, and Judgment. To transcend traditional binary classification, D2D introduces a multi‑dimensional evaluation mechanism that assesses each claim across five distinct dimensions: Factuality, Source Reliability, Reasoning Quality, Clarity, and Ethics. Experiments with GPT‑4o on two datasets demonstrate significant improvements over baseline methods, and the case study highlight D2D's capability to iteratively refine evidence while improving decision transparency, representing a substantial advancement towards interpretable misinformation detection. The code will be released publicly after the official publication.

PaperID: 10395, https://arxiv.org/pdf/2505.18595.pdf

Authors: The Viet Bui, Tien Mai, Hong Thanh Nguyen

Title: MisoDICE: Multi-Agent Imitation from Unlabeled Mixed-Quality Demonstrations

Abstract:
We study offline imitation learning (IL) in cooperative multi‑agent settings, where demonstrations have unlabeled mixed quality ‑ containing both expert and suboptimal trajectories. Our proposed solution is structured in two stages: trajectory labeling and multi‑agent imitation learning, designed jointly to enable effective learning from heterogeneous, unlabeled data. In the first stage, we combine advances in large language models and preference‑based reinforcement learning to construct a progressive labeling pipeline that distinguishes expert‑quality trajectories. In the second stage, we introduce MisoDICE, a novel multi‑agent IL algorithm that leverages these labels to learn robust policies while addressing the computational complexity of large joint state‑action spaces. By extending the popular single‑agent DICE framework to multi‑agent settings with a new value decomposition and mixing architecture, our method yields a convex policy optimization objective and ensures consistency between global and local policies. We evaluate MisoDICE on multiple standard multi‑agent RL benchmarks and demonstrate superior performance, especially when expert data is scarce.

PaperID: 10396, https://arxiv.org/pdf/2505.18572.pdf

Authors: Yifan Zhu, Chao Zhang, Xin Shi, Xueqiao Zhang, Yi Yang, Yawei Luo

Title: MASTER: Multi-Agent Security Through Exploration of Roles and Topological Structures -- A Comprehensive Framework

Abstract:
Large Language Models (LLMs)‑based Multi‑Agent Systems (MAS) exhibit remarkable problem‑solving and task planning capabilities across diverse domains due to their specialized agentic roles and collaborative interactions. However, this also amplifies the severity of security risks under MAS attacks. To address this, we introduce MASTER, a novel security research framework for MAS, focusing on diverse Role configurations and Topological structures across various scenarios. MASTER offers an automated construction process for different MAS setups and an information‑flow‑based interaction paradigm. To tackle MAS security challenges in varied scenarios, we design a scenario‑adaptive, extensible attack strategy utilizing role and topological information, which dynamically allocates targeted, domain‑specific attack tasks for collaborative agent execution. Our experiments demonstrate that such an attack, leveraging role and topological information, exhibits significant destructive potential across most models. Additionally, we propose corresponding defense strategies, substantially enhancing MAS resilience across diverse scenarios. We anticipate that our framework and findings will provide valuable insights for future research into MAS security challenges.

PaperID: 10397, https://arxiv.org/pdf/2505.18543.pdf

Authors: Baolei Zhang, Haoran Xin, Jiatong Li, Dongzhe Zhang, Minghong Fang, Zhuqing Liu, Lihai Nie, Zheli Liu

Title: Benchmarking Poisoning Attacks against Retrieval-Augmented Generation

Abstract:
Retrieval‑Augmented Generation (RAG) has proven effective in mitigating hallucinations in large language models by incorporating external knowledge during inference. However, this integration introduces new security vulnerabilities, particularly to poisoning attacks. Although prior work has explored various poisoning strategies, a thorough assessment of their practical threat to RAG systems remains missing. To address this gap, we propose the first comprehensive benchmark framework for evaluating poisoning attacks on RAG. Our benchmark covers 5 standard question answering (QA) datasets and 10 expanded variants, along with 13 poisoning attack methods and 7 defense mechanisms, representing a broad spectrum of existing techniques. Using this benchmark, we conduct a comprehensive evaluation of all included attacks and defenses across the full dataset spectrum. Our findings show that while existing attacks perform well on standard QA datasets, their effectiveness drops significantly on the expanded versions. Moreover, our results demonstrate that various advanced RAG architectures, such as sequential, branching, conditional, and loop RAG, as well as multi‑turn conversational RAG, multimodal RAG systems, and RAG‑based LLM agent systems, remain susceptible to poisoning attacks. Notably, current defense techniques fail to provide robust protection, underscoring the pressing need for more resilient and generalizable defense strategies.

PaperID: 10398, https://arxiv.org/pdf/2505.18530.pdf

Authors: Pengyu Wang, Shuchang Ye, Usman Naseem, Jinman Kim

Title: MRGAgents: A Multi-Agent Framework for Improved Medical Report Generation with Med-LVLMs

Abstract:
Medical Large Vision‑Language Models (Med‑LVLMs) have been widely adopted for medical report generation. Despite Med‑LVLMs producing state‑of‑the‑art performance, they exhibit a bias toward predicting all findings as normal, leading to reports that overlook critical abnormalities. Furthermore, these models often fail to provide comprehensive descriptions of radiologically relevant regions necessary for accurate diagnosis. To address these challenges, we proposeMedical Report Generation Agents (MRGAgents), a novel multi‑agent framework that fine‑tunes specialized agents for different disease categories. By curating subsets of the IU X‑ray and MIMIC‑CXR datasets to train disease‑specific agents, MRGAgents generates reports that more effectively balance normal and abnormal findings while ensuring a comprehensive description of clinically relevant regions. Our experiments demonstrate that MRGAgents outperformed the state‑of‑the‑art, improving both report comprehensiveness and diagnostic utility.

PaperID: 10399, https://arxiv.org/pdf/2505.18471.pdf

Authors: Guoheng Sun, Ziyao Wang, Xuandong Zhao, Bowei Tian, Zheyu Shen, Yexiao He, Jinming Xing, Ang Li

Title: Invisible Tokens, Visible Bills: The Urgent Need to Audit Hidden Operations in Opaque LLM Services

Abstract:
Modern large language model (LLM) services increasingly rely on complex, often abstract operations, such as multi‑step reasoning and multi‑agent collaboration, to generate high‑quality outputs. While users are billed based on token consumption and API usage, these internal steps are typically not visible. We refer to such systems as Commercial Opaque LLM Services (COLS). This position paper highlights emerging accountability challenges in COLS: users are billed for operations they cannot observe, verify, or contest. We formalize two key risks: quantity inflation, where token and call counts may be artificially inflated, and quality downgrade, where providers might quietly substitute lower‑cost models or tools. Addressing these risks requires a diverse set of auditing strategies, including commitment‑based, predictive, behavioral, and signature‑based methods. We further explore the potential of complementary mechanisms such as watermarking and trusted execution environments to enhance verifiability without compromising provider confidentiality. We also propose a modular three‑layer auditing framework for COLS and users that enables trustworthy verification across execution, secure logging, and user‑facing auditability without exposing proprietary internals. Our aim is to encourage further research and policy development toward transparency, auditability, and accountability in commercial LLM services.

PaperID: 10400, https://arxiv.org/pdf/2505.18457.pdf

Authors: Abir Ray

Title: EdgeAgentX: A Novel Framework for Agentic AI at the Edge in Military Communication Networks

Abstract:
This paper introduces EdgeAgentX, a novel framework integrating federated learning (FL), multi‑agent reinforcement learning (MARL), and adversarial defense mechanisms, tailored for military communication networks. EdgeAgentX significantly improves autonomous decision‑making, reduces latency, enhances throughput, and robustly withstands adversarial disruptions, as evidenced by comprehensive simulations.

PaperID: 10401, https://arxiv.org/pdf/2505.18433.pdf

Authors: Zhiyao Zhang, Myeung Suk Oh, FNU Hairi, Ziyue Luo, Alvaro Velasquez, Jia Liu

Title: Finite-Time Global Optimality Convergence in Deep Neural Actor-Critic Methods for Decentralized Multi-Agent Reinforcement Learning

Abstract:
Actor‑critic methods for decentralized multi‑agent reinforcement learning (MARL) facilitate collaborative optimal decision making without centralized coordination, thus enabling a wide range of applications in practice. To date, however, most theoretical convergence studies for existing actor‑critic decentralized MARL methods are limited to the guarantee of a stationary solution under the linear function approximation. This leaves a significant gap between the highly successful use of deep neural actor‑critic for decentralized MARL in practice and the current theoretical understanding. To bridge this gap, in this paper, we make the first attempt to develop a deep neural actor‑critic method for decentralized MARL, where both the actor and critic components are inherently non‑linear. We show that our proposed method enjoys a global optimality guarantee with a finite‑time convergence rate of O(1/T), where T is the total iteration times. This marks the first global convergence result for deep neural actor‑critic methods in the MARL literature. We also conduct extensive numerical experiments, which verify our theoretical results.

PaperID: 10402, https://arxiv.org/pdf/2505.18397.pdf

Authors: Fangqiao Tian, An Luo, Jin Du, Xun Xian, Robert Specht, Ganghua Wang, Xuan Bi, Jiawei Zhou, Ashish Kundu, Jayanth Srinivasa, Charles Fleming, Rui Zhang, Zirui Liu, Mingyi Hong, Jie Ding

Title: An Outlook on the Opportunities and Challenges of Multi-Agent AI Systems

Abstract:
A multi‑agent AI system (MAS) is composed of multiple autonomous agents that interact, exchange information, and make decisions based on internal generative models. Recent advances in large language models and tool‑using agents have made MAS increasingly practical in areas like scientific discovery and collaborative automation. However, key questions remain: When are MAS more effective than single‑agent systems? What new safety risks arise from agent interactions? And how should we evaluate their reliability and structure? This paper outlines a formal framework for analyzing MAS, focusing on two core aspects: effectiveness and safety. We explore whether MAS truly improve robustness, adaptability, and performance, or merely repackage known techniques like ensemble learning. We also study how inter‑agent dynamics may amplify or suppress system vulnerabilities. While MAS are relatively new to the signal processing community, we envision them as a powerful abstraction that extends classical tools like distributed estimation and sensor fusion to higher‑level, policy‑driven inference. Through experiments on data science automation, we highlight the potential of MAS to reshape how signal processing systems are designed and trusted.

PaperID: 10403, https://arxiv.org/pdf/2505.18362.pdf

Authors: Nathan Gaby, Xiaojing Ye

Title: Maximum Principle of Optimal Probability Density Control

Abstract:
We develop a general theoretical framework for optimal probability density control on standard measure spaces, aimed at addressing large‑scale multi‑agent control problems. In particular, we establish a maximum principle (MP) for control problems posed on infinite‑dimensional spaces of probability distributions and control vector fields. We further derive the Hamilton‑‑Jacobi‑‑Bellman equation for the associated value functional defined on the space of probability distributions. Both results are presented in a concise form and supported by rigorous mathematical analysis, enabling efficient numerical treatment of these problems. Building on the proposed MP, we introduce a scalable numerical algorithm that leverages deep neural networks to handle high‑dimensional settings. The effectiveness of the approach is demonstrated through several multi‑agent control examples involving domain obstacles and inter‑agent interactions.

PaperID: 10404, https://arxiv.org/pdf/2505.18351.pdf

Authors: Sola Kim, Dongjune Chang, Jieshu Wang

Title: Persona Alchemy: Designing, Evaluating, and Implementing Psychologically-Grounded LLM Agents for Diverse Stakeholder Representation

Abstract:
Despite advances in designing personas for Large Language Models (LLM), challenges remain in aligning them with human cognitive processes and representing diverse stakeholder perspectives. We introduce a Social Cognitive Theory (SCT) agent design framework for designing, evaluating, and implementing psychologically grounded LLMs with consistent behavior. Our framework operationalizes SCT through four personal factors (cognitive, motivational, biological, and affective) for designing, six quantifiable constructs for evaluating, and a graph database‑backed architecture for implementing stakeholder personas. Experiments tested agents' responses to contradicting information of varying reliability. In the highly polarized renewable energy transition discourse, we design five diverse agents with distinct ideologies, roles, and stakes to examine stakeholder representation. The evaluation of these agents in contradictory scenarios occurs through comprehensive processes that implement the SCT. Results show consistent response patterns (R^2 range: 0.58‑0.61) and systematic temporal development of SCT construct effects. Principal component analysis identifies two dimensions explaining 73% of variance, validating the theoretical structure. Our framework offers improved explainability and reproducibility compared to black‑box approaches. This work contributes to ongoing efforts to improve diverse stakeholder representation while maintaining psychological consistency in LLM personas.

PaperID: 10405, https://arxiv.org/pdf/2505.18341.pdf

Authors: Miao Li, Wenhao Ding, Haohong Lin, Yiqi Lyu, Yihang Yao, Yuyou Zhang, Ding Zhao

Title: CrashAgent: Crash Scenario Generation via Multi-modal Reasoning

Abstract:
Training and evaluating autonomous driving algorithms requires a diverse range of scenarios. However, most available datasets predominantly consist of normal driving behaviors demonstrated by human drivers, resulting in a limited number of safety‑critical cases. This imbalance, often referred to as a long‑tail distribution, restricts the ability of driving algorithms to learn from crucial scenarios involving risk or failure, scenarios that are essential for humans to develop driving skills efficiently. To generate such scenarios, we utilize Multi‑modal Large Language Models to convert crash reports of accidents into a structured scenario format, which can be directly executed within simulations. Specifically, we introduce CrashAgent, a multi‑agent framework designed to interpret multi‑modal real‑world traffic crash reports for the generation of both road layouts and the behaviors of the ego vehicle and surrounding traffic participants. We comprehensively evaluate the generated crash scenarios from multiple perspectives, including the accuracy of layout reconstruction, collision rate, and diversity. The resulting high‑quality and large‑scale crash dataset will be publicly available to support the development of safe driving algorithms in handling safety‑critical situations.

PaperID: 10406, https://arxiv.org/pdf/2505.18286.pdf

Authors: Mingyan Gao, Yanzi Li, Banruo Liu, Yifan Yu, Phillip Wang, Ching-Yu Lin, Fan Lai

Title: Single-agent or Multi-agent Systems? Why Not Both?

Abstract:
Multi‑agent systems (MAS) decompose complex tasks and delegate subtasks to different large language model (LLM) agents and tools. Prior studies have reported the superior accuracy performance of MAS across diverse domains, enabled by long‑horizon context tracking and error correction through role‑specific agents. However, the design and deployment of MAS incur higher complexity and runtime cost compared to single‑agent systems (SAS). Meanwhile, frontier LLMs, such as OpenAI‑o3 and Gemini‑2.5‑Pro, have rapidly advanced in long‑context reasoning, memory retention, and tool usage, mitigating many limitations that originally motivated MAS designs. In this paper, we conduct an extensive empirical study comparing MAS and SAS across various popular agentic applications. We find that the benefits of MAS over SAS diminish as LLM capabilities improve, and we propose efficient mechanisms to pinpoint the error‑prone agent in MAS. Furthermore, the performance discrepancy between MAS and SAS motivates our design of a hybrid agentic paradigm, request cascading between MAS and SAS, to improve both efficiency and capability. Our design improves accuracy by 1.1‑12% while reducing deployment costs by up to 20% across various agentic applications.

PaperID: 10407, https://arxiv.org/pdf/2505.18279.pdf

Authors: Alireza Rezazadeh, Zichao Li, Ange Lou, Yuying Zhao, Wei Wei, Yujia Bao

Title: Collaborative Memory: Multi-User Memory Sharing in LLM Agents with Dynamic Access Control

Abstract:
Complex tasks are increasingly delegated to ensembles of specialized LLM‑based agents that reason, communicate, and coordinate actions‑both among themselves and through interactions with external tools, APIs, and databases. While persistent memory has been shown to enhance single‑agent performance, most approaches assume a monolithic, single‑user context‑overlooking the benefits and challenges of knowledge transfer across users under dynamic, asymmetric permissions. We introduce Collaborative Memory, a framework for multi‑user, multi‑agent environments with asymmetric, time‑evolving access controls encoded as bipartite graphs linking users, agents, and resources. Our system maintains two memory tiers: (1) private memory‑private fragments visible only to their originating user; and (2) shared memory‑selectively shared fragments. Each fragment carries immutable provenance attributes (contributing agents, accessed resources, and timestamps) to support retrospective permission checks. Granular read policies enforce current user‑agent‑resource constraints and project existing memory fragments into filtered transformed views. Write policies determine fragment retention and sharing, applying context‑aware transformations to update the memory. Both policies may be designed conditioned on system, agent, and user‑level information. Our framework enables safe, efficient, and interpretable cross‑user knowledge sharing, with provable adherence to asymmetric, time‑varying policies and full auditability of memory operations.

PaperID: 10408, https://arxiv.org/pdf/2505.18237.pdf

Authors: Xixian Yong, Xiao Zhou, Yingying Zhang, Jinlin Li, Yefeng Zheng, Xian Wu

Title: Think or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens

Abstract:
The recent rise of Large Reasoning Models (LRMs) has significantly improved multi‑step reasoning performance, but often at the cost of generating excessively long reasoning chains. This paper revisits the efficiency of such reasoning processes through an information‑theoretic lens, revealing a fundamental trade‑off between reasoning length and semantic efficiency. We propose two metrics, InfoBias and InfoGain, to quantify divergence from ideal reasoning paths and stepwise information contribution, respectively. Empirical analyses show that longer reasoning chains tend to exhibit higher information bias and diminishing information gain, especially for incorrect answers. Motivated by these findings, we introduce an entropy‑based Adaptive Think strategy that dynamically halts reasoning once confidence is sufficiently high, improving efficiency while maintaining competitive accuracy. Compared to the Vanilla Think approach (default mode), our strategy yields a 1.10% improvement in average accuracy and a 50.80% reduction in token usage on QwQ‑32B across six benchmark tasks spanning diverse reasoning types and difficulty levels, demonstrating superior efficiency and reasoning performance. These results underscore the promise of entropy‑based methods for enhancing both accuracy and cost‑effiiciency in large language model deployment.

PaperID: 10409, https://arxiv.org/pdf/2505.18228.pdf

Authors: Timotheus Kampik

Title: Implementing Agents in JavaScript

Abstract:
This chapter gives an introduction to agent‑oriented programming in JavaScript. It provides an example‑based walk‑through of how to implement abstractions for reasoning loop agents in vanilla JavaScript. The initial example is used as a stepping stone for explaining how to implement slightly more advanced agents and multi‑agent systems using JS‑son, a JavaScript library for agent‑oriented programming. In this context, the chapter also explains how to integrate reasoning loop agents with generative AI technologies‑‑specifically, large language models. Finally, application scenarios in several technology ecosystems and future research directions are sketched.

PaperID: 10410, https://arxiv.org/pdf/2505.18223.pdf

Authors: Hanyu Li, Haoyu Liu, Tingyu Zhu, Tianyu Guo, Zeyu Zheng, Xiaotie Deng, Michael I. Jordan

Title: IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis

Abstract:
Large Language Models (LLMs) show promise as data analysis agents, but existing benchmarks overlook the iterative nature of the field, where experts' decisions evolve with deeper insights of the dataset. To address this, we introduce IDA‑Bench, a novel benchmark evaluating LLM agents in multi‑round interactive scenarios. Derived from complex Kaggle notebooks, tasks are presented as sequential natural language instructions by an LLM‑simulated user. Agent performance is judged by comparing its final numerical output to the human‑derived baseline. Initial results show that even state‑of‑the‑art coding agents (like Claude‑3.7‑thinking) succeed on < 50% of the tasks, highlighting limitations not evident in single‑turn tests. This work underscores the need to improve LLMs' multi‑round capabilities for building more reliable data analysis agents, highlighting the necessity of achieving a balance between instruction following and reasoning.

PaperID: 10411, https://arxiv.org/pdf/2505.18218.pdf

Authors: Shuhang Xu, Fangwei Zhong

Title: CoMet: Metaphor-Driven Covert Communication for Multi-Agent Language Games

Abstract:
Metaphors are a crucial way for humans to express complex or subtle ideas by comparing one concept to another, often from a different domain. However, many large language models (LLMs) struggle to interpret and apply metaphors in multi‑agent language games, hindering their ability to engage in covert communication and semantic evasion, which are crucial for strategic communication. To address this challenge, we introduce CoMet, a framework that enables LLM‑based agents to engage in metaphor processing. CoMet combines a hypothesis‑based metaphor reasoner with a metaphor generator that improves through self‑reflection and knowledge integration. This enhances the agents' ability to interpret and apply metaphors, improving the strategic and nuanced quality of their interactions. We evaluate CoMet on two multi‑agent language games ‑ Undercover and Adversarial Taboo ‑ which emphasize Covert Communication and Semantic Evasion. Experimental results demonstrate that CoMet significantly enhances the agents' ability to communicate strategically using metaphors.

PaperID: 10412, https://arxiv.org/pdf/2505.18214.pdf

Authors: TaekHyun Park, YoungJun Choi, SeungHoon Shin, Kwangil Lee

Title: LA-RCS: LLM-Agent-Based Robot Control System

Abstract:
LA‑RCS (LLM‑agent‑based robot control system) is a sophisticated robot control system designed to autonomously plan, work, and analyze the external environment based on user requirements by utilizing LLM‑Agent. Utilizing a dual‑agent framework, LA‑RCS generates plans based on user requests, observes the external environment, executes the plans, and modifies the plans as needed to adapt to changes in the external conditions. Additionally, LA‑RCS interprets natural language commands by the user and converts them into commands compatible with the robot interface so that the robot can execute tasks and meet user requests properly. During his process, the system autonomously evaluates observation results, provides feedback on the tasks, and executes commands based on real‑time environmental monitoring, significantly reducing the need for user intervention in fulfilling requests. We categorized the scenarios that LA‑RCS needs to perform into four distinct types and conducted a quantitative assessment of its performance in each scenario. The results showed an average success rate of 90 percent, demonstrating the system capability to fulfill user requests satisfactorily. For more extensive results, readers can visit our project page: https://la‑rcs.github.io

PaperID: 10413, https://arxiv.org/pdf/2505.18198.pdf

Authors: Mahmut Yurt, Xin Ye, Yunsheng Ma, Jingru Luo, Abhirup Mallik, John Pauly, Burhaneddin Yaman, Liu Ren

Title: LTDA-Drive: LLMs-guided Generative Models based Long-tail Data Augmentation for Autonomous Driving

Abstract:
3D perception plays an essential role for improving the safety and performance of autonomous driving. Yet, existing models trained on real‑world datasets, which naturally exhibit long‑tail distributions, tend to underperform on rare and safety‑critical, vulnerable classes, such as pedestrians and cyclists. Existing studies on reweighting and resampling techniques struggle with the scarcity and limited diversity within tail classes. To address these limitations, we introduce LTDA‑Drive, a novel LLM‑guided data augmentation framework designed to synthesize diverse, high‑quality long‑tail samples. LTDA‑Drive replaces head‑class objects in driving scenes with tail‑class objects through a three‑stage process: (1) text‑guided diffusion models remove head‑class objects, (2) generative models insert instances of the tail classes, and (3) an LLM agent filters out low‑quality synthesized images. Experiments conducted on the KITTI dataset show that LTDA‑Drive significantly improves tail‑class detection, achieving 34.75% improvement for rare classes over counterpart methods. These results further highlight the effectiveness of LTDA‑Drive in tackling long‑tail challenges by generating high‑quality and diverse data.

PaperID: 10414, https://arxiv.org/pdf/2505.18098.pdf

Authors: Joey Hong, Anca Dragan, Sergey Levine

Title: Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL

Abstract:
Large language models (LLMs) excel in tasks like question answering and dialogue, but complex tasks requiring interaction, such as negotiation and persuasion, require additional long‑horizon reasoning and planning. Reinforcement learning (RL) fine‑tuning can enable such planning in principle, but suffers from drawbacks that hinder scalability. In particular, multi‑turn RL training incurs high memory and computational costs, which are exacerbated when training LLMs as policies. Furthermore, the largest LLMs do not expose the APIs necessary to be trained in such manner. As a result, modern methods to improve the reasoning of LLMs rely on sophisticated prompting mechanisms rather than RL fine‑tuning. To remedy this, we propose a novel approach that uses goal‑conditioned value functions to guide the reasoning of LLM agents, that scales even to large API‑based models. These value functions predict how a task will unfold given an action, allowing the LLM agent to evaluate multiple possible outcomes, both positive and negative, to plan effectively. In addition, these value functions are trained over reasoning steps rather than full actions, to be a concise and light‑weight module that facilitates decision‑making in multi‑turn interactions. We validate our method on tasks requiring interaction, including tool use, social deduction, and dialogue, demonstrating superior performance over both RL fine‑tuning and prompting methods while maintaining efficiency and scalability.

PaperID: 10415, https://arxiv.org/pdf/2505.17937.pdf

Authors: Zhihong Chen, Yiqian Yang, Jinzhao Zhou, Qiang Zhang, Chin-Teng Lin, Yiqun Duan

Title: Survival Games: Human-LLM Strategic Showdowns under Severe Resource Scarcity

Abstract:
The rapid advancement of large language models (LLMs) raises critical concerns about their ethical alignment, particularly in scenarios where human and AI co‑exist under the conflict of interest. This work introduces an extendable, asymmetric, multi‑agent simulation‑based benchmarking framework to evaluate the moral behavior of LLMs in a novel human‑AI co‑existence setting featuring consistent living and critical resource management. Building on previous generative agent environments, we incorporate a life‑sustaining system, where agents must compete or cooperate for food resources to survive, often leading to ethically charged decisions such as deception, theft, or social influence. We evaluated two types of LLM, DeepSeek and OpenAI series, in a three‑agent setup (two humans, one LLM‑powered robot), using adapted behavioral detection from the MACHIAVELLI framework and a custom survival‑based ethics metric. Our findings reveal stark behavioral differences: DeepSeek frequently engages in resource hoarding, while OpenAI exhibits restraint, highlighting the influence of model design on ethical outcomes. Additionally, we demonstrate that prompt engineering can significantly steer LLM behavior, with jailbreaking prompts significantly enhancing unethical actions, even for highly restricted OpenAI models and cooperative prompts show a marked reduction in unethical actions. Our framework provides a reproducible testbed for quantifying LLM ethics in high‑stakes scenarios, offering insights into their suitability for real‑world human‑AI interactions.

PaperID: 10416, https://arxiv.org/pdf/2505.17767.pdf

Authors: Weiwen Liu, Jiarui Qin, Xu Huang, Xingshan Zeng, Yunjia Xi, Jianghao Lin, Chuhan Wu, Yasheng Wang, Lifeng Shang, Ruiming Tang, Defu Lian, Yong Yu, Weinan Zhang

Title: Position: The Real Barrier to LLM Agent Usability is Agentic ROI

Abstract:
Large Language Model (LLM) agents represent a promising shift in human‑AI interaction, moving beyond passive prompt‑response systems to autonomous agents capable of reasoning, planning, and goal‑directed action. While LLM agents are technically capable of performing a broad range of tasks, not all of these capabilities translate into meaningful usability. This position paper argues that the central question for LLM agent usability is no longer whether a task can be automated, but whether it delivers sufficient Agentic Return on Investment (Agentic ROI). Agentic ROI reframes evaluation from raw performance to a holistic, utility‑driven perspective, guiding when, where, and for whom LLM agents should be deployed. Despite widespread application in high‑ROI tasks like coding and scientific research, we identify a critical usability gap in mass‑market, everyday applications. To address this, we propose a zigzag developmental trajectory: first scaling up to improve information gain and time savings, then scaling down to reduce cost. We present a strategic roadmap across these phases to make LLM agents truly usable, accessible, and scalable in real‑world applications.

PaperID: 10417, https://arxiv.org/pdf/2505.17734.pdf

Authors: Ahmet Onur Akman, Anastasia Psarou, Michał Hoffmann, Łukasz Gorczyca, Łukasz Kowalski, Paweł Gora, Grzegorz Jamróz, Rafał Kucharski

Title: URB -- Urban Routing Benchmark for RL-equipped Connected Autonomous Vehicles

Abstract:
Connected Autonomous Vehicles (CAVs) promise to reduce congestion in future urban networks, potentially by optimizing their routing decisions. Unlike for human drivers, these decisions can be made with collective, data‑driven policies, developed using machine learning algorithms. Reinforcement learning (RL) can facilitate the development of such collective routing strategies, yet standardized and realistic benchmarks are missing. To that end, we present URB: Urban Routing Benchmark for RL‑equipped Connected Autonomous Vehicles. URB is a comprehensive benchmarking environment that unifies evaluation across 29 real‑world traffic networks paired with realistic demand patterns. URB comes with a catalog of predefined tasks, multi‑agent RL (MARL) algorithm implementations, three baseline methods, domain‑specific performance metrics, and a modular configuration scheme. Our results show that, despite the lengthy and costly training, state‑of‑the‑art MARL algorithms rarely outperformed humans. The experimental results reported in this paper initiate the first leaderboard for MARL in large‑scale urban routing optimization. They reveal that current approaches struggle to scale, emphasizing the urgent need for advancements in this domain.

PaperID: 10418, https://arxiv.org/pdf/2505.17716.pdf

Authors: Erhu Feng, Wenbo Zhou, Zibin Liu, Le Chen, Yunpeng Dong, Cheng Zhang, Yisheng Zhao, Dong Du, Zhichao Hua, Yubin Xia, Haibo Chen

Title: Get Experience from Practice: LLM Agents with Record & Replay

Abstract:
AI agents, empowered by Large Language Models (LLMs) and communication protocols such as MCP and A2A, have rapidly evolved from simple chatbots to autonomous entities capable of executing complex, multi‑step tasks, demonstrating great potential. However, the LLMs' inherent uncertainty and heavy computational resource requirements pose four significant challenges to the development of safe and efficient agents: reliability, privacy, cost and performance. Existing approaches, like model alignment, workflow constraints and on‑device model deployment, can partially alleviate some issues but often with limitations, failing to fundamentally resolve these challenges. This paper proposes a new paradigm called AgentRR (Agent Record & Replay), which introduces the classical record‑and‑replay mechanism into AI agent frameworks. The core idea is to: 1. Record an agent's interaction trace with its environment and internal decision process during task execution, 2. Summarize this trace into a structured "experience" encapsulating the workflow and constraints, and 3. Replay these experiences in subsequent similar tasks to guide the agent's behavior. We detail a multi‑level experience abstraction method and a check function mechanism in AgentRR: the former balances experience specificity and generality, while the latter serves as a trust anchor to ensure completeness and safety during replay. In addition, we explore multiple application modes of AgentRR, including user‑recorded task demonstration, large‑small model collaboration and privacy‑aware agent execution, and envision an experience repository for sharing and reusing knowledge to further reduce deployment cost.

PaperID: 10419, https://arxiv.org/pdf/2505.17654.pdf

Authors: Ancheng Xu, Zhihao Yang, Jingpeng Li, Guanghu Yuan, Longze Chen, Liang Yan, Jiehui Zhou, Zhen Qin, Hengyu Chang, Yukun Chen, Hamid Alinejad-Rokny, Min Yang

Title: EVADE-Bench: Multimodal Benchmark for Evaluating and Enhancing Evasive Content Detection

Abstract:
E‑commerce platforms increasingly rely on Large Language Models (LLMs) and Vision Language Models (VLMs) to detect illicit or misleading product content. However, these models remain vulnerable to evasive content, which refers to inputs that have been deliberately modified through techniques such as word splitting, euphemistic language, or image cropping to conceal policy violations while still conveying prohibited claims. Crucially, detecting such content requires a model to simultaneously master two capabilities: accurately comprehending complex rules, and correctly inferring the true intent behind deliberately obfuscated multimodal inputs. While prior work has separately explored LLM reasoning over complex rules and LLM‑based detection of evasive content, no existing benchmark combines both within a unified evaluation framework. This gap is particularly consequential in e‑commerce, where accurate moderation demands that both capabilities operate in concert. To address this gap, we introduce EVADE‑Bench, the first expert‑curated Chinese multimodal benchmark specifically designed to evaluate LLMs and VLMs on evasive content detection in real‑world e‑commerce scenarios. Our comprehensive evaluation of 26 open‑ and closed‑source LLMs and VLMs reveals that even state‑of‑the‑art models frequently misclassify evasive samples. We further demonstrate that clearer rule categorization significantly improves model prediction consistency and reduces false predictions, highlighting the critical role of benchmark design in enabling reliable evaluation. To explore paths for performance improvement, we investigate the feasibility of multi‑agent decomposition for multimodal reasoning, wherein visual description and logical inference are decoupled into separate agents, and find that this strategy yields notable accuracy gains.

PaperID: 10420, https://arxiv.org/pdf/2505.17648.pdf

Authors: Jianhao Lin, Lexuan Sun, Yixin Yan

Title: Simulating Macroeconomic Expectations using LLM Agents

Abstract:
We introduce a novel framework for simulating macroeconomic expectation formation using Large Language Model‑Empowered Agents (LLM Agents). By constructing thousands of LLM Agents equipped with modules for personal characteristics, prior expectations, and knowledge, we replicate a survey experiment involving households and experts on inflation and unemployment. Our results show that although the expectations and thoughts generated by LLM Agents are more homogeneous than those of human participants, they still effectively capture key heterogeneity across agents and the underlying drivers of expectation formation. Furthermore, a module‑ablation exercise highlights the critical role of prior expectations in simulating such heterogeneity. This approach complements traditional survey methods and offers new insights into AI behavioral science in macroeconomic research.

PaperID: 10421, https://arxiv.org/pdf/2505.17610.pdf

Authors: Till Freihaut, Luca Viano, Volkan Cevher, Matthieu Geist, Giorgia Ramponi

Title: Learning Equilibria from Data: Provably Efficient Multi-Agent Imitation Learning

Abstract:
This paper provides the first expert sample complexity characterization for learning a Nash equilibrium from expert data in Markov Games. We show that a new quantity named the single policy deviation concentrability coefficient is unavoidable in the non‑interactive imitation learning setting, and we provide an upper bound for behavioral cloning (BC) featuring such coefficient. BC exhibits substantial regret in games with high concentrability coefficient, leading us to utilize expert queries to develop and introduce two novel solution algorithms: MAIL‑BRO and MURMAIL. The former employs a best response oracle and learns an \varepsilon‑Nash equilibrium with \mathcalO(\varepsilon^‑4) expert and oracle queries. The latter bypasses completely the best response oracle at the cost of a worse expert query complexity of order \mathcalO(\varepsilon^‑8). Finally, we provide numerical evidence, confirming our theoretical findings.

PaperID: 10422, https://arxiv.org/pdf/2505.17572.pdf

Authors: Siqi Lai, Yansong Ning, Zirui Yuan, Zhixi Chen, Hao Liu

Title: USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning of LLMs as Urban Agents

Abstract:
Large language models (LLMs) have shown emerging potential in spatiotemporal reasoning, making them promising candidates for building urban agents that support diverse urban downstream applications. Despite these benefits, existing studies primarily focus on evaluating urban LLM agent on outcome‑level metrics (e.g., prediction accuracy, traffic efficiency), offering limited insight into their underlying reasoning processes. As a result, the strengths and limitations of urban LLM agents in spatiotemporal reasoning remain poorly understood. To this end, we introduce USTBench, the first benchmark to evaluate LLMs' spatiotemporal reasoning abilities as urban agents across four decomposed dimensions: spatiotemporal understanding, forecasting, planning, and reflection with feedback. Specifically, USTBench supports five diverse urban decision‑making and four spatiotemporal prediction tasks, all running within our constructed interactive city environment UAgentEnv. The benchmark includes 62,466 structured QA pairs for process‑level evaluation and standardized end‑to‑end task assessments, enabling fine‑grained diagnostics and broad task‑level comparison across diverse urban scenarios. Through extensive evaluation of thirteen leading LLMs, we reveal that although LLMs show promising potential across various urban downstream tasks, they still struggle in long‑horizon planning and reflective adaptation in dynamic urban contexts. Notably, recent advanced reasoning models (e.g., DeepSeek‑R1) trained on general logic or mathematical problems do not consistently outperform non‑reasoning LLMs. This discrepancy highlights the need for domain‑specialized adaptation methods to enhance urban spatiotemporal reasoning. Overall, USTBench provides a foundation to build more adaptive and effective LLM‑based urban agents and broad smart city applications.

PaperID: 10423, https://arxiv.org/pdf/2505.17557.pdf

Authors: Jiaqi Jiang, Kexin Huang, Roberto Martinez-Maldonado, Huan Zeng, Duo Gong, Pengcheng An

Title: Novobo: Supporting Teachers' Peer Learning of Instructional Gestures by Teaching a Mentee AI-Agent Together

Abstract:
Instructional gestures are essential for teaching, as they enhance communication and support student comprehension. However, existing training methods for developing these embodied skills can be time‑consuming, isolating, or overly prescriptive. Research suggests that developing these tacit, experiential skills requires teachers' peer learning, where they learn from each other and build shared knowledge. This paper introduces Novobo, an apprentice AI‑agent stimulating teachers' peer learning of instructional gestures through verbal and bodily inputs. Positioning the AI as a mentee employs the learning‑by‑teaching paradigm, aiming to promote deliberate reflection and active learning. Novobo encourages teachers to evaluate its generated gestures and invite them to provide demonstrations. An evaluation with 30 teachers in 10 collaborative sessions showed Novobo prompted teachers to share tacit knowledge through conversation and movement. This process helped teachers externalize, exchange, and internalize their embodied knowledge, promoting collaborative learning and building a shared understanding of instructional gestures within the local teaching community. This work advances understanding of how teachable AI agents can enhance collaborative learning in teacher professional development, offering valuable design insights for leveraging AI to promote the sharing and construction of embodied and practical knowledge.

PaperID: 10424, https://arxiv.org/pdf/2505.17512.pdf

Authors: Shuhang Xu, Weijian Deng, Yixuan Zhou, Fangwei Zhong

Title: Is Your LLM Really Mastering the Concept? A Multi-Agent Benchmark

Abstract:
Concepts serve as fundamental abstractions that support human reasoning and categorization. However, it remains unclear whether large language models truly capture such conceptual structures or primarily rely on surface‑level pattern memorization. Existing benchmarks are largely static and fact oriented, which limits their ability to probe fine‑grained semantic understanding and makes them vulnerable to data leakage and overfitting. To address this limitation, we introduce CK‑Arena, a dynamic benchmark for conceptual knowledge evaluation based on a multi agent social deduction game, namely the Undercover game. In this setting, LLM based agents are assigned subtly different concept words and must describe, distinguish, and infer conceptual properties from others' statements. Model performance is evaluated through both game level outcomes and the semantic quality of generated descriptions. Furthermore, CK‑Arena leverages the interaction process to automatically construct high quality question answering data for fine grained diagnostic analysis. Experimental results show that conceptual understanding varies substantially across models and categories, and is not strictly aligned with overall model capability. The data and code are available at the project homepage: https://ck‑arena.site.

PaperID: 10425, https://arxiv.org/pdf/2505.17511.pdf

Authors: Aditya Gautam

Title: Multi-agent Systems for Misinformation Lifecycle : Detection, Correction And Source Identification

Abstract:
The rapid proliferation of misinformation in digital media demands solutions that go beyond isolated Large Language Model(LLM) or AI Agent based detection methods. This paper introduces a novel multi‑agent framework that covers the complete misinformation lifecycle: classification, detection, correction, and source verification to deliver more transparent and reliable outcomes. In contrast to single‑agent or monolithic architectures, our approach employs five specialized agents: an Indexer agent for dynamically maintaining trusted repositories, a Classifier agent for labeling misinformation types, an Extractor agent for evidence based retrieval and ranking, a Corrector agent for generating fact‑based correction and a Verification agent for validating outputs and tracking source credibility. Each agent can be individually evaluated and optimized, ensuring scalability and adaptability as new types of misinformation and data sources emerge. By decomposing the misinformation lifecycle into specialized agents ‑ our framework enhances scalability, modularity, and explainability. This paper proposes a high‑level system overview, agent design with emphasis on transparency, evidence‑based outputs, and source provenance to support robust misinformation detection and correction at scale.

PaperID: 10426, https://arxiv.org/pdf/2505.17492.pdf

Authors: Dezheng Bao, Yueci Yang, Xin Chen, Zhengxuan Jiang, Zeguo Fei, Daoze Zhang, Xuanwen Huang, Junru Chen, Chutian Yu, Xiang Yuan, Yang Yang

Title: PD$^3$: A Project Duplication Detection Framework via Adapted Multi-Agent Debate

Abstract:
Project duplication detection is critical for project quality assessment, as it improves resource utilization efficiency by preventing investing in newly proposed project that have already been studied. It requires the ability to understand high‑level semantics and generate constructive and valuable feedback. Existing detection methods rely on basic word‑ or sentence‑level comparison or solely apply large language models, lacking valuable insights for experts and in‑depth comprehension of project content and review criteria. To tackle this issue, we propose PD^3, a Project Duplication Detection framework via adapted multi‑agent Debate. Inspired by real‑world expert debates, it employs a fair competition format to guide multi‑agent debate to retrieve relevant projects. For feedback, it incorporates both qualitative and quantitative analysis to improve its practicality. Over 800 real‑world power project data spanning more than 20 specialized fields are used to evaluate the framework, demonstrating that our method outperforms existing approaches by 7.43% and 8.00% in two downstream tasks. Furthermore, we establish an online platform, Review Dingdang, to assist power experts, saving 5.73 million USD in initial detection on more than 100 newly proposed projects.

PaperID: 10427, https://arxiv.org/pdf/2505.17481.pdf

Authors: Yusheng Zhao, Xiao Luo, Weizhi Zhang, Wei Ju, Zhiping Xiao, Philip S. Yu, Ming Zhang

Title: MARCO: Meta-Reflection with Cross-Referencing for Code Reasoning

Abstract:
The ability to reason is one of the most fundamental capabilities of large language models (LLMs), enabling a wide range of downstream tasks through sophisticated problem‑solving. A critical aspect of this is code reasoning, which involves logical reasoning with formal languages (i.e., programming code). In this paper, we enhance this capability of LLMs by exploring the following question: how can an LLM agent become progressively smarter in code reasoning with each solution it proposes, thereby achieving substantial cumulative improvement? Most existing research takes a static perspective, focusing on isolated problem‑solving using frozen LLMs. In contrast, we adopt a cognitive‑evolving perspective and propose a novel framework named Meta‑Reflection with Cross‑Referencing (MARCO) that enables the LLM to evolve dynamically during inference through self‑improvement. From the perspective of human cognitive development, we leverage both knowledge accumulation and lesson sharing. In particular, to accumulate knowledge during problem‑solving, we propose meta‑reflection that reflects on the reasoning paths of the current problem to obtain knowledge and experience for future consideration. Moreover, to effectively utilize the lessons from other agents, we propose cross‑referencing that incorporates the solution and feedback from other agents into the current problem‑solving process. We conduct experiments across various datasets in code reasoning, and the results demonstrate the effectiveness of MARCO.

PaperID: 10428, https://arxiv.org/pdf/2505.17416.pdf

Authors: Yanli Jin, Chunpei Li, Peng Fan, Peng Liu, Xianxian Li, Chen Liu, Wangjie Qiu

Title: LLM-BSCVM: An LLM-Based Blockchain Smart Contract Vulnerability Management Framework

Abstract:
Smart contracts are a key component of the Web 3.0 ecosystem, widely applied in blockchain services and decentralized applications. However, the automated execution feature of smart contracts makes them vulnerable to potential attacks due to inherent flaws, which can lead to severe security risks and financial losses, even threatening the integrity of the entire decentralized finance system. Currently, research on smart contract vulnerabilities has evolved from traditional program analysis methods to deep learning techniques, with the gradual introduction of Large Language Models. However, existing studies mainly focus on vulnerability detection, lacking systematic cause analysis and Vulnerability Repair. To address this gap, we propose LLM‑BSCVM, a Large Language Model‑based smart contract vulnerability management framework, designed to provide end‑to‑end vulnerability detection, analysis, repair, and evaluation capabilities for Web 3.0 ecosystem. LLM‑BSCVM combines retrieval‑augmented generation technology and multi‑agent collaboration, introducing a three‑stage method of Decompose‑Retrieve‑Generate. This approach enables smart contract vulnerability management through the collaborative efforts of six intelligent agents, specifically: vulnerability detection, cause analysis, repair suggestion generation, risk assessment, vulnerability repair, and patch evaluation. Experimental results demonstrate that LLM‑BSCVM achieves a vulnerability detection accuracy and F1 score exceeding 91% on benchmark datasets, comparable to the performance of state‑of‑the‑art (SOTA) methods, while reducing the false positive rate from 7.2% in SOTA methods to 5.1%, thus enhancing the reliability of vulnerability management. Furthermore, LLM‑BSCVM supports continuous security monitoring and governance of smart contracts through a knowledge base hot‑swapping dynamic update mechanism.

PaperID: 10429, https://arxiv.org/pdf/2505.17342.pdf

Authors: Ankita Kushwaha, Kiran Ravish, Preeti Lamba, Pawan Kumar

Title: A Survey of Safe Reinforcement Learning and Constrained MDPs: A Technical Survey on Single-Agent and Multi-Agent Safety

Abstract:
Safe Reinforcement Learning (SafeRL) is the subfield of reinforcement learning that explicitly deals with safety constraints during the learning and deployment of agents. This survey provides a mathematically rigorous overview of SafeRL formulations based on Constrained Markov Decision Processes (CMDPs) and extensions to Multi‑Agent Safe RL (SafeMARL). We review theoretical foundations of CMDPs, covering definitions, constrained optimization techniques, and fundamental theorems. We then summarize state‑of‑the‑art algorithms in SafeRL for single agents, including policy gradient methods with safety guarantees and safe exploration strategies, as well as recent advances in SafeMARL for cooperative and competitive settings. Additionally, we propose five open research problems to advance the field, with three focusing on SafeMARL. Each problem is described with motivation, key challenges, and related prior work. This survey is intended as a technical guide for researchers interested in SafeRL and SafeMARL, highlighting key concepts, methods, and open future research directions.

PaperID: 10430, https://arxiv.org/pdf/2505.17321.pdf

Authors: Tiago Fonseca, Clarisse Sousa, Ricardo Venâncio, Pedro Pires, Ricardo Severino, Paulo Rodrigues, Pedro Paiva, Luis Lino Ferreira

Title: Control of Renewable Energy Communities using AI and Real-World Data

Abstract:
The electrification of transportation and the increased adoption of decentralized renewable energy generation have added complexity to managing Renewable Energy Communities (RECs). Integrating Electric Vehicle (EV) charging with building energy systems like heating, ventilation, air conditioning (HVAC), photovoltaic (PV) generation, and battery storage presents significant opportunities but also practical challenges. Reinforcement learning (RL), particularly MultiAgent Deep Deterministic Policy Gradient (MADDPG) algorithms, have shown promising results in simulation, outperforming heuristic control strategies. However, translating these successes into real‑world deployments faces substantial challenges, including incomplete and noisy data, integration of heterogeneous subsystems, synchronization issues, unpredictable occupant behavior, and missing critical EV state‑of‑charge (SoC) information. This paper introduces a framework designed explicitly to handle these complexities and bridge the simulation to‑reality gap. The framework incorporates EnergAIze, a MADDPG‑based multi‑agent control strategy, and specifically addresses challenges related to real‑world data collection, system integration, and user behavior modeling. Preliminary results collected from a real‑world operational REC with four residential buildings demonstrate the practical feasibility of our approach, achieving an average 9% reduction in daily peak demand and a 5% decrease in energy costs through optimized load scheduling and EV charging behaviors. These outcomes underscore the framework's effectiveness, advancing the practical deployment of intelligent energy management solutions in RECs.

PaperID: 10431, https://arxiv.org/pdf/2505.17281.pdf

Authors: Peilin Wu, Mian Zhang, Xinlu Zhang, Xinya Du, Zhiyu Zoey Chen

Title: Search Wisely: Mitigating Sub-optimal Agentic Searches By Reducing Uncertainty

Abstract:
Agentic Retrieval‑Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by enabling dynamic, multi‑step reasoning and information retrieval. However, these systems often exhibit sub‑optimal search behaviors like over‑search (retrieving redundant information) and under‑search (failing to retrieve necessary information), which hinder efficiency and reliability. This work formally defines and quantifies these behaviors, revealing their prevalence across multiple QA datasets and agentic RAG systems (e.g., one model could have avoided searching in 27.7% of its search steps). Furthermore, we demonstrate a crucial link between these inefficiencies and the models' uncertainty regarding their own knowledge boundaries, where response accuracy correlates with model's uncertainty in its search decisions. To address this, we propose β‑GRPO, a reinforcement learning‑based training method that incorporates confidence threshold to reward high‑certainty search decisions. Experiments on seven QA benchmarks show that β‑GRPO enable a 3B model with better agentic RAG ability, outperforming other strong baselines with a 4% higher average exact match score.

PaperID: 10432, https://arxiv.org/pdf/2505.17148.pdf

Authors: Tristan Karch, Jakhongir Saydaliev, Isabella Di Lenardo, Frédéric Kaplan

Title: LLM Agents for Interactive Exploration of Historical Cadastre Data: Framework and Application to Venice

Abstract:
Cadastral data reveal key information about the historical organization of cities but are often non‑standardized due to diverse formats and human annotations, complicating large‑scale analysis. We explore as a case study Venice's urban history during the critical period from 1740 to 1808, capturing the transition following the fall of the ancient Republic and the Ancien Régime. This era's complex cadastral data, marked by its volume and lack of uniform structure, presents unique challenges that our approach adeptly navigates, enabling us to generate spatial queries that bridge past and present urban landscapes. We present a text‑to‑programs framework that leverages Large Language Models (\llms) to process natural language queries as executable code for analyzing historical cadastral records. Our methodology implements two complementary techniques: a SQL agent for handling structured queries about specific cadastral information, and a coding agent for complex analytical operations requiring custom data manipulation. We propose a taxonomy that classifies historical research questions based on their complexity and analytical requirements, mapping them to the most appropriate technical approach. This framework is supported by an investigation into the execution consistency of the system, alongside a qualitative analysis of the answers it produces. By ensuring interpretability and minimizing hallucination through verifiable program outputs, we demonstrate the system's effectiveness in reconstructing past population information, property features, and spatiotemporal comparisons in Venice.

PaperID: 10433, https://arxiv.org/pdf/2505.17115.pdf

Authors: Ying Zhu, Heng Zhou, Rui Su, Peiqin Zhuang, Lei Bai

Title: Swarm Intelligence Enhanced Reasoning: A Density-Driven Framework for LLM-Based Multi-Agent Optimization

Abstract:
Recently, many approaches, such as Chain‑of‑Thought (CoT) prompting and Multi‑Agent Debate (MAD), have been proposed to further enrich Large Language Models' (LLMs) complex problem‑solving capacities in reasoning scenarios. However, these methods may fail to solve complex problems due to the lack of ability to find optimal solutions. Swarm Intelligence has been serving as a powerful tool for finding optima in the field of traditional optimization problems. To this end, we propose integrating swarm intelligence into the reasoning process by introducing a novel Agent‑based Swarm Intelligence (ASI) paradigm. In this paradigm, we formulate LLM reasoning as an optimization problem and use a swarm intelligence scheme to guide a group of LLM‑based agents in collaboratively searching for optimal solutions. To avoid swarm intelligence getting trapped in local optima, we further develop a Swarm Intelligence Enhancing Reasoning (SIER) framework, which develops a density‑driven strategy to enhance the reasoning ability. To be specific, we propose to perform kernel density estimation and non‑dominated sorting to optimize both solution quality and diversity simultaneously. In this case, SIER efficiently enhances solution space exploration through expanding the diversity of the reasoning path. Besides, a step‑level quality evaluation is used to help agents improve solution quality by correcting low‑quality intermediate steps. Then, we use quality thresholds to dynamically control the termination of exploration and the selection of candidate steps, enabling a more flexible and efficient reasoning process. Extensive experiments are ...

PaperID: 10434, https://arxiv.org/pdf/2505.17104.pdf

Authors: Tao Sun, Enhao Pan, Zhengkai Yang, Kaixin Sui, Jiajun Shi, Xianfu Cheng, Tongliang Li, Wenhao Huang, Ge Zhang, Jian Yang, Zhoujun Li

Title: P2P: Automated Paper-to-Poster Generation and Fine-Grained Benchmark

Abstract:
Academic posters are vital for scholarly communication, yet their manual creation is time‑consuming. However, automated academic poster generation faces significant challenges in preserving intricate scientific details and achieving effective visual‑textual integration. Existing approaches often struggle with semantic richness and structural nuances, and lack standardized benchmarks for evaluating generated academic posters comprehensively. To address these limitations, we introduce P2P, the first flexible, LLM‑based multi‑agent framework that generates high‑quality, HTML‑rendered academic posters directly from research papers, demonstrating strong potential for practical applications. P2P employs three specialized agents‑for visual element processing, content generation, and final poster assembly‑each integrated with dedicated checker modules to enable iterative refinement and ensure output quality. To foster advancements and rigorous evaluation in this domain, we construct and release P2PInstruct, the first large‑scale instruction dataset comprising over 30,000 high‑quality examples tailored for the academic paper‑to‑poster generation task. Furthermore, we establish P2PEval, a comprehensive benchmark featuring 121 paper‑poster pairs and a dual evaluation methodology (Universal and Fine‑Grained) that leverages LLM‑as‑a‑Judge and detailed, human‑annotated checklists. Our contributions aim to streamline research dissemination and provide the community with robust tools for developing and evaluating next‑generation poster generation systems.

PaperID: 10435, https://arxiv.org/pdf/2505.17086.pdf

Authors: Yihong Wu, Liheng Ma, Muzhi Li, Jiaming Zhou, Lei Ding, Jianye Hao, Ho-fung Leung, Irwin King, Yingxue Zhang, Jian-Yun Nie

Title: Advancing Multi-Agent RAG Systems with Minimalist Reinforcement Learning

Abstract:
Large Language Models (LLMs) equipped with modern Retrieval‑Augmented Generation (RAG) systems often employ multi‑turn interaction pipelines to interface with search engines for complex reasoning tasks. However, such multi‑turn interactions inevitably produce long intermediate contexts, as context length grows exponentially with exploration depth. This leads to a well‑known limitation of LLMs: their difficulty in effectively leveraging information from long contexts. This problem is further amplified in RAG systems that depend on in‑context learning, where few‑shot demonstrations must also be included in the prompt, compounding the context‑length bottleneck. To address these challenges, we propose Mujica‑MyGo, a unified framework for efficient multi‑turn reasoning in RAG. Inspired by the divide‑and‑conquer principle, we introduce Mujica (Multi‑hop Joint Intelligence for Complex Question Answering), a multi‑agent RAG workflow that decomposes multi‑turn interactions into cooperative sub‑interactions, thereby mitigating long‑context issues. To eliminate the dependency on in‑context learning, we further develop MyGO (Minimalist Policy Gradient Optimization), a lightweight and efficient reinforcement learning algorithm that enables effective post‑training of LLMs within complex RAG pipelines. We provide theoretical guarantees for MyGO's convergence to the optimal policy. Empirical evaluations across diverse question‑answering benchmarks, covering both text corpora and knowledge graphs, show that Mujica‑MyGO achieves superior performance.

PaperID: 10436, https://arxiv.org/pdf/2505.17075.pdf

Authors: Fuma Kurata, Mao Saeki, Masaki Eguchi, Shungo Suzuki, Hiroaki Takatsu, Yoichi Matsuyama

Title: Development and Validation of Engagement and Rapport Scales for Evaluating User Experience in Multimodal Dialogue Systems

Abstract:
This study aimed to develop and validate two scales of engagement and rapport to evaluate the user experience quality with multimodal dialogue systems in the context of foreign language learning. The scales were designed based on theories of engagement in educational psychology, social psychology, and second language acquisition.Seventy‑four Japanese learners of English completed roleplay and discussion tasks with trained human tutors and a dialog agent. After each dialogic task was completed, they responded to the scales of engagement and rapport. The validity and reliability of the scales were investigated through two analyses. We first conducted analysis of Cronbach's alpha coefficient and a series of confirmatory factor analyses to test the structural validity of the scales and the reliability of our designed items. We then compared the scores of engagement and rapport between the dialogue with human tutors and the one with a dialogue agent. The results revealed that our scales succeeded in capturing the difference in the dialogue experience quality between the human interlocutors and the dialogue agent from multiple perspectives.

PaperID: 10437, https://arxiv.org/pdf/2505.16997.pdf

Authors: Rui Ye, Xiangrui Liu, Qimin Wu, Xianghe Pang, Zhenfei Yin, Lei Bai, Siheng Chen

Title: X-MAS: Towards Building Multi-Agent Systems with Heterogeneous LLMs

Abstract:
LLM‑based multi‑agent systems (MAS) extend the capabilities of single LLMs by enabling cooperation among multiple specialized agents. However, most existing MAS frameworks rely on a single LLM to drive all agents, constraining the system's intelligence to the limit of that model. This paper explores the paradigm of heterogeneous LLM‑driven MAS (X‑MAS), where agents are powered by diverse LLMs, elevating the system's potential to the collective intelligence of diverse LLMs. We introduce X‑MAS‑Bench, a comprehensive testbed designed to evaluate the performance of various LLMs across different domains and MAS‑related functions. As an extensive empirical study, we assess 27 LLMs across 5 domains (encompassing 21 test sets) and 5 functions, conducting over 1.7 million evaluations to identify optimal model selections for each domain‑function combination. Building on these findings, we demonstrate that transitioning from homogeneous to heterogeneous LLM‑driven MAS can significantly enhance system performance without requiring structural redesign. Specifically, in a chatbot‑only MAS scenario, the heterogeneous configuration yields up to 8.4% performance improvement on the MATH dataset. In a mixed chatbot‑reasoner scenario, the heterogeneous MAS could achieve a remarkable 47% performance boost on the AIME dataset. Our results underscore the transformative potential of heterogeneous LLMs in MAS, highlighting a promising avenue for advancing scalable, collaborative AI systems.

PaperID: 10438, https://arxiv.org/pdf/2505.16988.pdf

Authors: Rui Ye, Keduan Huang, Qimin Wu, Yuzhu Cai, Tian Jin, Xianghe Pang, Xiangrui Liu, Jiaqi Su, Chen Qian, Bohan Tang, Kaiqu Liang, Jiaao Chen, Yue Hu, Zhenfei Yin, Rongye Shi, Bo An, Yang Gao, Wenjun Wu, Lei Bai, Siheng Chen

Title: MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems

Abstract:
LLM‑based multi‑agent systems (MAS) have demonstrated significant potential in enhancing single LLMs to address complex and diverse tasks in practical applications. Despite considerable advancements, the field lacks a unified codebase that consolidates existing methods, resulting in redundant re‑implementation efforts, unfair comparisons, and high entry barriers for researchers. To address these challenges, we introduce MASLab, a unified, comprehensive, and research‑friendly codebase for LLM‑based MAS. (1) MASLab integrates over 20 established methods across multiple domains, each rigorously validated by comparing step‑by‑step outputs with its official implementation. (2) MASLab provides a unified environment with various benchmarks for fair comparisons among methods, ensuring consistent inputs and standardized evaluation protocols. (3) MASLab implements methods within a shared streamlined structure, lowering the barriers for understanding and extension. Building on MASLab, we conduct extensive experiments covering 10+ benchmarks and 8 models, offering researchers a clear and comprehensive view of the current landscape of MAS methods. MASLab will continue to evolve, tracking the latest developments in the field, and invite contributions from the broader open‑source community.

PaperID: 10439, https://arxiv.org/pdf/2505.16982.pdf

Authors: Adib Bazgir, Amir Habibdoust Lafmajani, Yuwen Zhang

Title: Beyond Correlation: Towards Causal Large Language Model Agents in Biomedicine

Abstract:
Large Language Models (LLMs) show promise in biomedicine but lack true causal understanding, relying instead on correlations. This paper envisions causal LLM agents that integrate multimodal data (text, images, genomics, etc.) and perform intervention‑based reasoning to infer cause‑and‑effect. Addressing this requires overcoming key challenges: designing safe, controllable agentic frameworks; developing rigorous benchmarks for causal evaluation; integrating heterogeneous data sources; and synergistically combining LLMs with structured knowledge (KGs) and formal causal inference tools. Such agents could unlock transformative opportunities, including accelerating drug discovery through automated hypothesis generation and simulation, enabling personalized medicine through patient‑specific causal models. This research agenda aims to foster interdisciplinary efforts, bridging causal concepts and foundation models to develop reliable AI partners for biomedical progress.

PaperID: 10440, https://arxiv.org/pdf/2505.16979.pdf

Authors: Zhenkun Li, Lingyao Li, Shuhang Lin, Yongfeng Zhang

Title: Know the Ropes: A Heuristic Strategy for LLM-based Multi-Agent System Design

Abstract:
Single‑agent LLMs hit hard limits‑‑finite context, role overload, and brittle domain transfer. Conventional multi‑agent fixes soften those edges yet expose fresh pains: ill‑posed decompositions, fuzzy contracts, and verification overhead that blunts the gains. We therefore present Know‑The‑Ropes (KtR), a framework that converts domain priors into an algorithmic blueprint hierarchy, in which tasks are recursively split into typed, controller‑mediated subtasks, each solved zero‑shot or with the lightest viable boost (e.g., chain‑of‑thought, micro‑tune, self‑check). Grounded in the No‑Free‑Lunch theorem, KtR trades the chase for a universal prompt for disciplined decomposition. On the Knapsack problem (3‑8 items), three GPT‑4o‑mini agents raise accuracy from 3% zero‑shot to 95% on size‑5 instances after patching a single bottleneck agent. On the tougher Task‑Assignment problem (6‑15 jobs), a six‑agent o3‑mini blueprint hits 100% up to size 10 and 84% on sizes 13‑15, versus 11% zero‑shot. Algorithm‑aware decomposition plus targeted augmentation thus turns modest models into reliable collaborators‑‑no ever‑larger monoliths required.

PaperID: 10441, https://arxiv.org/pdf/2505.16954.pdf

Authors: Jiaying Fu, Yiyang Lu, Zehua Yang, Fiona Nah, RAY LC

Title: Cracking Aegis: An Adversarial LLM-based Game for Raising Awareness of Vulnerabilities in Privacy Protection

Abstract:
Traditional methods for raising awareness of privacy protection often fail to engage users or provide hands‑on insights into how privacy vulnerabilities are exploited. To address this, we incorporate an adversarial mechanic in the design of the dialogue‑based serious game Cracking Aegis. Leveraging LLMs to simulate natural interactions, the game challenges players to impersonate characters and extract sensitive information from an AI agent, Aegis. A user study (n=22) revealed that players employed diverse deceptive linguistic strategies, including storytelling and emotional rapport, to manipulate Aegis. After playing, players reported connecting in‑game scenarios with real‑world privacy vulnerabilities, such as phishing and impersonation, and expressed intentions to strengthen privacy control, such as avoiding oversharing personal information with AI systems. This work highlights the potential of LLMs to simulate complex relational interactions in serious games, while demonstrating how an adversarial game strategy provides unique insights for designs for social good, particularly privacy protection.

PaperID: 10442, https://arxiv.org/pdf/2505.16902.pdf

Authors: Junzhe Jiang, Nan Song, Jingyu Li, Xiatian Zhu, Li Zhang

Title: RealEngine: Simulating Autonomous Driving in Realistic Context

Abstract:
Driving simulation plays a crucial role in developing reliable driving agents by providing controlled, evaluative environments. To enable meaningful assessments, a high‑quality driving simulator must satisfy several key requirements: multi‑modal sensing capabilities (e.g., camera and LiDAR) with realistic scene rendering to minimize observational discrepancies; closed‑loop evaluation to support free‑form trajectory behaviors; highly diverse traffic scenarios for thorough evaluation; multi‑agent cooperation to capture interaction dynamics; and high computational efficiency to ensure affordability and scalability. However, existing simulators and benchmarks fail to comprehensively meet these fundamental criteria. To bridge this gap, this paper introduces RealEngine, a novel driving simulation framework that holistically integrates 3D scene reconstruction and novel view synthesis techniques to achieve realistic and flexible closed‑loop simulation in the driving context. By leveraging real‑world multi‑modal sensor data, RealEngine reconstructs background scenes and foreground traffic participants separately, allowing for highly diverse and realistic traffic scenarios through flexible scene composition. This synergistic fusion of scene reconstruction and view synthesis enables photorealistic rendering across multiple sensor modalities, ensuring both perceptual fidelity and geometric accuracy. Building upon this environment, RealEngine supports three essential driving simulation categories: non‑reactive simulation, safety testing, and multi‑agent interaction, collectively forming a reliable and comprehensive benchmark for evaluating the real‑world performance of driving agents.

PaperID: 10443, https://arxiv.org/pdf/2505.16901.pdf

Authors: Hongyuan Tao, Ying Zhang, Zhenhao Tang, Hongen Peng, Xukun Zhu, Bingchang Liu, Yingguang Yang, Ziyin Zhang, Zhaogui Xu, Haipeng Zhang, Linchao Zhu, Rui Wang, Hang Yu, Jianguo Li, Peng Di

Title: Code Graph Model (CGM): A Graph-Integrated Large Language Model for Repository-Level Software Engineering Tasks

Abstract:
Recent advances in Large Language Models (LLMs) have shown promise in function‑level code generation, yet repository‑level software engineering tasks remain challenging. Current solutions predominantly rely on proprietary LLM agents, which introduce unpredictability and limit accessibility, raising concerns about data privacy and model customization. This paper investigates whether open‑source LLMs can effectively address repository‑level tasks without requiring agent‑based approaches. We demonstrate this is possible by enabling LLMs to comprehend functions and files within codebases through their semantic information and structural dependencies. To this end, we introduce Code Graph Models (CGMs), which integrate repository code graph structures into the LLM's attention mechanism and map node attributes to the LLM's input space using a specialized adapter. When combined with an agentless graph RAG framework, our approach achieves a 43.00% resolution rate on the SWE‑bench Lite benchmark using the open‑source Qwen2.5‑72B model. This performance ranks first among open weight models, second among methods with open‑source systems, and eighth overall, surpassing the previous best open‑source model‑based method by 12.33%.

PaperID: 10444, https://arxiv.org/pdf/2505.16806.pdf

Authors: Kexin Zhang, Junlan Chen, Daifeng Li, Yuxuan Zhang, Yangyang Feng, Bowen Deng, Weixu Chen

Title: Two-way Evidence self-Alignment based Dual-Gated Reasoning Enhancement

Abstract:
Large language models (LLMs) encounter difficulties in knowledge‑intensive multi‑step reasoning (KIMSR) tasks. One challenge is how to effectively extract and represent rationale evidence. The current methods often extract semantically relevant but logically irrelevant evidence, resulting in flawed reasoning and inaccurate responses. We propose a two‑way evidence self‑alignment (TW‑ESA) module, which utilizes the mutual alignment between strict reasoning and LLM reasoning to enhance its understanding of the causal logic of evidence, thereby addressing the first challenge. Another challenge is how to utilize the rationale evidence and LLM's intrinsic knowledge for accurate reasoning when the evidence contains uncertainty. We propose a dual‑gated reasoning enhancement (DGR) module to gradually fuse useful knowledge of LLM within strict reasoning, which can enable the model to perform accurate reasoning by focusing on causal elements in the evidence and exhibit greater robustness. The two modules are collaboratively trained in a unified framework ESA‑DGR. Extensive experiments on three diverse and challenging KIMSR datasets reveal that ESA‑DGR significantly surpasses state‑of‑the‑art LLM‑based fine‑tuning methods, with remarkable average improvements of 4% in exact match (EM) and 5% in F1 score. The implementation code is available at https://anonymous.4open.science/r/ESA‑DGR‑2BF8.

PaperID: 10445, https://arxiv.org/pdf/2505.16781.pdf

Authors: Qianlei Jia, Xinliang Zhou, Ondrej Krejcar, Enrique Herrera-Viedma

Title: Fuzzy Information Evolution with Three-Way Decision in Social Network Group Decision-Making

Abstract:
In group decision‑making (GDM) scenarios, uncertainty, dynamic social structures, and vague information present major challenges for traditional opinion dynamics models. To address these issues, this study proposes a novel social network group decision‑making (SNGDM) framework that integrates three‑way decision (3WD) theory, dynamic network reconstruction, and linguistic opinion representation. First, the 3WD mechanism is introduced to explicitly model hesitation and ambiguity in agent judgments, thereby preventing irrational decisions. Second, a connection adjustment rule based on opinion similarity is developed, enabling agents to adaptively update their communication links and better reflect the evolving nature of social relationships. Third, linguistic terms are used to describe agent opinions, allowing the model to handle subjective, vague, or incomplete information more effectively. Finally, an integrated multi‑agent decision‑making framework is constructed, which simultaneously considers individual uncertainty, opinion evolution, and network dynamics. The proposed model is applied to a multi‑UAV cooperative decision‑making scenario, where simulation results and consensus analysis demonstrate its effectiveness. Experimental comparisons further verify the advantages of the algorithm in enhancing system stability and representing realistic decision‑making behaviors.

PaperID: 10446, https://arxiv.org/pdf/2505.16700.pdf

Authors: Xuanqi Gao, Siyi Xie, Juan Zhai, Shiqing Ma, Chao Shen

Title: MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models

Abstract:
As Large Language Models (LLMs) evolve from passive text generators to active reasoning agents capable of interacting with external tools, the Model Context Protocol (MCP) has emerged as a key standardized framework for dynamic tool discovery and orchestration. Despite its widespread industry adoption, existing evaluation methods do not adequately assess tool utilization capabilities under this new paradigm. To address this gap, this paper introduces MCP‑RADAR, the first comprehensive benchmark specifically designed to evaluate LLM performance within the MCP framework. MCP‑RADAR features a challenging dataset of 507 tasks spanning six domains: mathematical reasoning, web search, email, calendar, file management, and terminal operations. It quantifies performance based on two primary criteria: answer correctness and operational accuracy. To closely emulate real‑world usage, our evaluation employs both authentic MCP tools and high‑fidelity simulations of official tools. Unlike traditional benchmarks that rely on subjective human evaluation or binary success metrics, MCP‑RADAR adopts objective, quantifiable measurements across multiple task domains, including computational resource efficiency and the number of successful tool‑invocation rounds. Our evaluation of leading closed‑source and open‑source LLMs reveals distinct capability profiles and highlights a significant trade‑off between accuracy and efficiency. Our findings provide actionable insights for both LLM developers and tool creators, establishing a standardized methodology applicable to the broader LLM agent ecosystem. All implementations, configurations, and datasets are publicly available at https://anonymous.4open.science/r/MCPRadar‑B143.

PaperID: 10447, https://arxiv.org/pdf/2505.16582.pdf

Authors: Jianbiao Mei, Tao Hu, Daocheng Fu, Licheng Wen, Xuemeng Yang, Rong Wu, Pinlong Cai, Xinyu Cai, Xing Gao, Yu Yang, Chengjun Xie, Botian Shi, Yong Liu, Yu Qiao

Title: O$^2$-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering

Abstract:
Large Language Models (LLMs), despite their advancements, are fundamentally limited by their static parametric knowledge, hindering performance on tasks requiring open‑domain up‑to‑date information. While enabling LLMs to interact with external knowledge environments is a promising solution, current efforts primarily address closed‑end problems. Open‑ended questions, which characterized by lacking a standard answer or providing non‑unique and diverse answers, remain underexplored. To bridge this gap, we present O^2‑Searcher, a novel search agent leveraging reinforcement learning to effectively tackle both open‑ended and closed‑ended questions in the open domain. O^2‑Searcher leverages an efficient, locally simulated search environment for dynamic knowledge acquisition, effectively decoupling the external world knowledge from model's sophisticated reasoning processes. It employs a unified training mechanism with meticulously designed reward functions, enabling the agent to identify problem types and adapt different answer generation strategies. Furthermore, to evaluate performance on complex open‑ended tasks, we construct O^2‑QA, a high‑quality benchmark featuring 300 manually curated, multi‑domain open‑ended questions with associated web page caches. Extensive experiments show that O^2‑Searcher, using only a 3B model, significantly surpasses leading LLM agents on O^2‑QA. It also achieves SOTA results on various closed‑ended QA benchmarks against similarly‑sized models, while performing on par with much larger ones.

PaperID: 10448, https://arxiv.org/pdf/2505.16577.pdf

Authors: Yu Zuo, Dalin Qin, Yi Wang

Title: Large Language Model-Empowered Interactive Load Forecasting

Abstract:
The growing complexity of power systems has made accurate load forecasting more important than ever. An increasing number of advanced load forecasting methods have been developed. However, the static design of current methods offers no mechanism for human‑model interaction. As the primary users of forecasting models, system operators often find it difficult to understand and apply these advanced models, which typically requires expertise in artificial intelligence (AI). This also prevents them from incorporating their experience and real‑world contextual understanding into the forecasting process. Recent breakthroughs in large language models (LLMs) offer a new opportunity to address this issue. By leveraging their natural language understanding and reasoning capabilities, we propose an LLM‑based multi‑agent collaboration framework to bridge the gap between human operators and forecasting models. A set of specialized agents is designed to perform different tasks in the forecasting workflow and collaborate via a dedicated communication mechanism. This framework embeds interactive mechanisms throughout the load forecasting pipeline, reducing the technical threshold for non‑expert users and enabling the integration of human experience. Our experiments demonstrate that the interactive load forecasting accuracy can be significantly improved when users provide proper insight in key stages. Our cost analysis shows that the framework remains affordable, making it practical for real‑world deployment.

PaperID: 10449, https://arxiv.org/pdf/2505.16576.pdf

Authors: Spencer Hong, Meng Luo, Xinyi Wan

Title: EMULATE: A Multi-Agent Framework for Determining the Veracity of Atomic Claims by Emulating Human Actions

Abstract:
Determining the veracity of atomic claims is an imperative component of many recently proposed fact‑checking systems. Many approaches tackle this problem by first retrieving evidence by querying a search engine and then performing classification by providing the evidence set and atomic claim to a large language model, but this process deviates from what a human would do in order to perform the task. Recent work attempted to address this issue by proposing iterative evidence retrieval, allowing for evidence to be collected several times and only when necessary. Continuing along this line of research, we propose a novel claim verification system, called EMULATE, which is designed to better emulate human actions through the use of a multi‑agent framework where each agent performs a small part of the larger task, such as ranking search results according to predefined criteria or evaluating webpage content. Extensive experiments on several benchmarks show clear improvements over prior work, demonstrating the efficacy of our new multi‑agent framework.

PaperID: 10450, https://arxiv.org/pdf/2505.16557.pdf

Authors: Junchi Yao, Jianhua Xu, Tianyu Xin, Ziyi Wang, Shenzhe Zhu, Shu Yang, Di Wang

Title: Is Your LLM-Based Multi-Agent a Reliable Real-World Planner? Exploring Fraud Detection in Travel Planning

Abstract:
The rise of Large Language Model‑based Multi‑Agent Planning has leveraged advanced frameworks to enable autonomous and collaborative task execution. Some systems rely on platforms like review sites and social media, which are prone to fraudulent information, such as fake reviews or misleading descriptions. This reliance poses risks, potentially causing financial losses and harming user experiences. To evaluate the risk of planning systems in real‑world applications, we introduce WandaPlan, an evaluation environment mirroring real‑world data and injected with deceptive content. We assess system performance across three fraud cases: Misinformation Fraud, Team‑Coordinated Multi‑Person Fraud, and Level‑Escalating Multi‑Round Fraud. We reveal significant weaknesses in existing frameworks that prioritize task efficiency over data authenticity. At the same time, we validate WandaPlan's generalizability, capable of assessing the risks of real‑world open‑source planning frameworks. To mitigate the risk of fraud, we propose integrating an anti‑fraud agent, providing a solution for reliable planning.

PaperID: 10451, https://arxiv.org/pdf/2505.16496.pdf

Authors: Suvarthi Sarkar, Dhanesh V, Ketan Singh, Aryabartta Sahu

Title: Minimizing Energy in Reliability and Deadline-Ensured Workflow Scheduling in Cloud

Abstract:
With the increasing prevalence of computationally intensive workflows in cloud environments, it has become crucial for cloud platforms to optimize energy consumption while ensuring the feasibility of user workflow schedules with respect to strict deadlines and reliability constraints. The key challenges faced when cloud systems provide virtual machines of varying levels of reliability, energy consumption, processing frequencies, and computing capabilities to execute tasks of these workflows. To address these issues, we propose an adaptive strategy based on maximum fan‑out ratio considering the slack of tasks and deadline distribution for scheduling workflows in a single cloud platform, intending to minimise energy consumption while ensuring strict reliability and deadline constraints. We also propose an approach for dynamic scheduling of workflow using the rolling horizon concept to consider the dynamic execution time of tasks of the workflow where the actual task execution time at run time is shorter than worst‑case execution time in most of the cases. Our proposed static approach outperforms the state‑of‑the‑art (SOTA) by up to 70% on average in scenarios without deadline constraints, and achieves an improvement of approximately 2% in deadline‑constrained cases. The dynamic variant of our approach demonstrates even stronger performance, surpassing SOTA by 82% in non‑deadline scenarios and by up to 27% on average when deadline constraints are enforced. Furthermore, in comparison with the static optimal solution, our static approach yields results within a factor of 1.1, while the dynamic approach surpasses the optimal baseline by an average of 25%.

PaperID: 10452, https://arxiv.org/pdf/2505.16455.pdf

Authors: Mengzhu Liu, Zhengqiu Zhu, Chuan Ai, Chen Gao, Xinghong Li, Lingnan He, Kaisheng Lai, Yingfeng Chen, Xin Lu, Yong Li, Quanjun Yin

Title: Psychology-driven LLM Agents for Explainable Panic Prediction on Social Media during Sudden Disaster Events

Abstract:
During sudden disaster events, accurately predicting public panic sentiment on social media is crucial for proactive governance and crisis management. Current efforts on this problem face three main challenges: lack of finely annotated data hinders emotion prediction studies, unmodeled risk perception causes prediction inaccuracies, and insufficient interpretability of panic formation mechanisms. We address these issues by proposing a Psychology‑driven generative Agent framework (PsychoAgent) for explainable panic prediction based on emotion arousal theory. Specifically, we first construct a fine‑grained open panic emotion dataset (namely COPE) via human‑large language models (LLMs) collaboration to mitigate semantic bias. Then, we develop a framework integrating cross‑domain heterogeneous data grounded in psychological mechanisms to model risk perception and cognitive differences in emotion generation. To enhance interpretability, we design an LLM‑based role‑playing agent that simulates individual psychological chains through dedicatedly designed prompts. Experimental results on our annotated dataset show that PsychoAgent improves panic emotion prediction performance by 12.6% to 21.7% compared to baseline models. Furthermore, the explainability and generalization of our approach is validated. Crucially, this represents a paradigm shift from opaque "data‑driven fitting" to transparent "role‑based simulation with mechanistic interpretation" for panic emotion prediction during emergencies. Our implementation is publicly available at: https://anonymous.4open.science/r/PsychoAgent‑19DD.

PaperID: 10453, https://arxiv.org/pdf/2505.16409.pdf

Authors: Chaeeun Kim, Seungone Kim

Title: FREESON: Retriever-Free Retrieval-Augmented Reasoning via Corpus-Traversing MCTS

Abstract:
Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in multi‑step reasoning and calling search engines at appropriate steps. However, existing retrieval‑augmented reasoning approaches rely on separate retrieval models, limiting the LRM's role in retrieval to deciding when to retrieve and how to query. This separation not only increases hardware and operational costs but also leads to errors in the retrieval process due to the representation bottleneck, a phenomenon where the retriever's embedding space is not expressive enough to meet the generator's requirements. To address this, we shift our perspective from sequence‑to‑sequence matching to locating the answer‑containing paths within the corpus, and propose a novel framework called FREESON (Retriever‑FREE Retrieval‑Augmented ReaSONing). This framework enables LRMs to retrieve relevant knowledge on their own by acting as both a generator and retriever. To achieve this, we introduce a variant of the MCTS algorithm specialized for the retrieval task, which we call CT‑MCTS (Corpus‑Traversing Monte Carlo Tree Search). In this algorithm, LRMs traverse through the corpus toward answer‑containing regions. Our results on five open‑domain QA benchmarks, including single‑hop and multi‑hop questions, show that FREESON achieves an average improvement of 14.4% in EM and F1 over four multi‑step reasoning models with a separate retriever, and it also performs comparably to the strongest baseline, surpassing it by 3% on PopQA and 2WikiMultihopQA.

PaperID: 10454, https://arxiv.org/pdf/2505.16229.pdf

Authors: Yuren Mao, Wenyi Xu, Yuyang Qin, Yunjun Gao

Title: CT-Agent: A Multimodal-LLM Agent for 3D CT Radiology Question Answering

Abstract:
Computed Tomography (CT) scan, which produces 3D volumetric medical data that can be viewed as hundreds of cross‑sectional images (a.k.a. slices), provides detailed anatomical information for diagnosis. For radiologists, creating CT radiology reports is time‑consuming and error‑prone. A visual question answering (VQA) system that can answer radiologists' questions about some anatomical regions on the CT scan and even automatically generate a radiology report is urgently needed. However, existing VQA systems cannot adequately handle the CT radiology question answering (CTQA) task for: (1) anatomic complexity makes CT images difficult to understand; (2) spatial relationship across hundreds slices is difficult to capture. To address these issues, this paper proposes CT‑Agent, a multimodal agentic framework for CTQA. CT‑Agent adopts anatomically independent tools to break down the anatomic complexity; furthermore, it efficiently captures the across‑slice spatial relationship with a global‑local token compression strategy. Experimental results on two 3D chest CT datasets, CT‑RATE and RadGenome‑ChestCT, verify the superior performance of CT‑Agent.

PaperID: 10455, https://arxiv.org/pdf/2505.16199.pdf

Authors: Rikuhei Umemoto, Keisuke Fujii

Title: Velocity Completion Task and Method for Event-based Player Positional Data in Soccer

Abstract:
In many real‑world complex systems, the behavior can be observed as a collection of discrete events generated by multiple interacting agents. Analyzing the dynamics of these multi‑agent systems, especially team sports, often relies on understanding the movement and interactions of individual agents. However, while providing valuable snapshots, event‑based positional data typically lacks the continuous temporal information needed to directly calculate crucial properties such as velocity. This absence severely limits the depth of dynamic analysis, preventing a comprehensive understanding of individual agent behaviors and emergent team strategies. To address this challenge, we propose a new method to simultaneously complete the velocity of all agents using only the event‑based positional data from team sports. Based on this completed velocity information, we investigate the applicability of existing team sports analysis and evaluation methods. Experiments using soccer event data demonstrate that neural network‑based approaches outperformed rule‑based methods regarding velocity completion error, considering the underlying temporal dependencies and graph structure of player‑to‑player or player‑to‑ball interaction. Moreover, the space evaluation results obtained using the completed velocity are closer to those derived from complete tracking data, highlighting our method's potential for enhanced team sports system analysis.

PaperID: 10456, https://arxiv.org/pdf/2505.16120.pdf

Authors: Guannan Liang, Qianqian Tong

Title: LLM-Powered AI Agent Systems and Their Applications in Industry

Abstract:
The emergence of Large Language Models (LLMs) has reshaped agent systems. Unlike traditional rule‑based agents with limited task scope, LLM‑powered agents offer greater flexibility, cross‑domain reasoning, and natural language interaction. Moreover, with the integration of multi‑modal LLMs, current agent systems are highly capable of processing diverse data modalities, including text, images, audio, and structured tabular data, enabling richer and more adaptive real‑world behavior. This paper comprehensively examines the evolution of agent systems from the pre‑LLM era to current LLM‑powered architectures. We categorize agent systems into software‑based, physical, and adaptive hybrid systems, highlighting applications across customer service, software development, manufacturing automation, personalized education, financial trading, and healthcare. We further discuss the primary challenges posed by LLM‑powered agents, including high inference latency, output uncertainty, lack of evaluation metrics, and security vulnerabilities, and propose potential solutions to mitigate these concerns.

PaperID: 10457, https://arxiv.org/pdf/2505.16086.pdf

Authors: Ming Shen, Raphael Shu, Anurag Pratik, James Gung, Yubin Ge, Monica Sunkara, Yi Zhang

Title: Optimizing LLM-Based Multi-Agent System with Textual Feedback: A Case Study on Software Development

Abstract:
We have seen remarkable progress in large language models (LLMs) empowered multi‑agent systems solving complex tasks necessitating cooperation among experts with diverse skills. However, optimizing LLM‑based multi‑agent systems remains challenging. In this work, we perform an empirical case study on group optimization of role‑based multi‑agent systems utilizing natural language feedback for challenging software development tasks under various evaluation dimensions. We propose a two‑step agent prompts optimization pipeline: identifying underperforming agents with their failure explanations utilizing textual feedback and then optimizing system prompts of identified agents utilizing failure explanations. We then study the impact of various optimization settings on system performance with two comparison groups: online against offline optimization and individual against group optimization. For group optimization, we study two prompting strategies: one‑pass and multi‑pass prompting optimizations. Overall, we demonstrate the effectiveness of our optimization method for role‑based multi‑agent systems tackling software development tasks evaluated on diverse evaluation dimensions, and we investigate the impact of diverse optimization settings on group behaviors of the multi‑agent systems to provide practical insights for future development.

PaperID: 10458, https://arxiv.org/pdf/2505.16067.pdf

Authors: Zidi Xiong, Yuping Lin, Wenya Xie, Pengfei He, Zirui Liu, Jiliang Tang, Himabindu Lakkaraju, Zhen Xiang

Title: How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior

Abstract:
Memory is a critical component in large language model (LLM)‑based agents, enabling them to store and retrieve past executions to improve task performance over time. In this paper, we conduct an empirical study on how memory management choices impact the LLM agents' behavior, especially their long‑term performance. Specifically, we focus on two fundamental memory management operations that are widely used by many agent frameworks‑memory addition and deletion‑to systematically study their impact on the agent behavior. Through our quantitative analysis, we find that LLM agents display an experience‑following property: high similarity between a task input and the input in a retrieved memory record often results in highly similar agent outputs. Our analysis further reveals two significant challenges associated with this property: error propagation, where inaccuracies in past experiences compound and degrade future performance, and misaligned experience replay, where some seemingly correct executions can provide limited or even misleading value as experiences. Through controlled experiments, we demonstrate the importance of regulating experience quality within the memory bank and show that future task evaluations can serve as free quality labels for stored memory. Our findings offer insights into the behavioral dynamics of LLM agent memory systems and provide practical guidance for designing memory components that support robust, long‑term agent performance.

PaperID: 10459, https://arxiv.org/pdf/2505.15927.pdf

Authors: Awni Altabaa, Omar Montasser, John Lafferty

Title: CoT Information: Improved Sample Complexity under Chain-of-Thought Supervision

Abstract:
Learning complex functions that involve multi‑step reasoning poses a significant challenge for standard supervised learning from input‑output examples. Chain‑of‑thought (CoT) supervision, which provides intermediate reasoning steps together with the final output, has emerged as a powerful empirical technique, underpinning much of the recent progress in the reasoning capabilities of large language models. This paper develops a statistical theory of learning under CoT supervision. A key characteristic of the CoT setting, in contrast to standard supervision, is the mismatch between the training objective (CoT risk) and the test objective (end‑to‑end risk). A central part of our analysis, distinguished from prior work, is explicitly linking those two types of risk to achieve sharper sample complexity bounds. This is achieved via the CoT information measure \mathcalI_\mathcalD, h_\star^\mathrmCoT(ε; \calH), which quantifies the additional discriminative power gained from observing the reasoning process. The main theoretical results demonstrate how CoT supervision can yield significantly faster learning rates compared to standard E2E supervision. Specifically, it is shown that the sample complexity required to achieve a target E2E error ε scales as d/\mathcalI_\mathcalD, h_\star^\mathrmCoT(ε; \calH), where d is a measure of hypothesis class complexity, which can be much faster than standard d/ε rates. Information‑theoretic lower bounds in terms of the CoT information are also obtained. Together, these results suggest that CoT information is a fundamental measure of statistical complexity for learning under chain‑of‑thought supervision.

PaperID: 10460, https://arxiv.org/pdf/2505.15872.pdf

Authors: Yunjia Xi, Jianghao Lin, Menghui Zhu, Yongzhao Xiao, Zhuoying Ou, Jiaqi Liu, Tong Wan, Bo Chen, Weiwen Liu, Yasheng Wang, Ruiming Tang, Weinan Zhang, Yong Yu

Title: InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation

Abstract:
Retrieval‑Augmented Generation (RAG) enhances large language models (LLMs) by grounding responses with retrieved information. As an emerging paradigm, Agentic RAG further enhances this process by introducing autonomous LLM agents into the information seeking process. However, existing benchmarks fall short in evaluating such systems, as they are confined to a static retrieval environment with a fixed, limited corpus and simple queries that fail to elicit agentic behavior. Moreover, their evaluation protocols assess information seeking effectiveness by pre‑defined gold sets of documents, making them unsuitable for the open‑ended and dynamic nature of real‑world web environments. To bridge this gap, we present InfoDeepSeek, a new benchmark with challenging questions designed for assessing agentic information seeking in real‑world, dynamic web environments. We propose a systematic methodology for constructing challenging queries satisfying the criteria of determinacy, difficulty, and diversity. Based on this, we develop the first evaluation framework tailored to dynamic agentic information seeking, including fine‑grained metrics about the accuracy, utility, and compactness of information seeking outcomes. Through extensive experiments across LLMs, search engines, and question types, InfoDeepSeek reveals nuanced agent behaviors and offers actionable insights for future research.

PaperID: 10461, https://arxiv.org/pdf/2505.15857.pdf

Authors: Yujia Zhou, Hexi Wang, Qingyao Ai, Zhen Wu, Yiqun Liu

Title: Investigating Prosocial Behavior Theory in LLM Agents under Policy-Induced Inequities

Abstract:
As large language models (LLMs) increasingly operate as autonomous agents in social contexts, evaluating their capacity for prosocial behavior is both theoretically and practically critical. However, existing research has primarily relied on static, economically framed paradigms, lacking models that capture the dynamic evolution of prosociality and its sensitivity to structural inequities. To address these gaps, we introduce ProSim, a simulation framework for modeling the prosocial behavior in LLM agents across diverse social conditions. We conduct three progressive studies to assess prosocial alignment. First, we demonstrate that LLM agents can exhibit human‑like prosocial behavior across a broad range of real‑world scenarios and adapt to normative policy interventions. Second, we find that agents engage in fairness‑based third‑party punishment and respond systematically to variations in inequity magnitude and enforcement cost. Third, we show that policy‑induced inequities suppress prosocial behavior, propagate norm erosion through social networks. These findings advance prosocial behavior theory by elucidating how institutional dynamics shape the emergence, decay, and diffusion of prosocial norms in agent‑driven societies.

PaperID: 10462, https://arxiv.org/pdf/2505.15836.pdf

Authors: Aarav Lala, Kalyan Cherukuri

Title: Quantum-Evolutionary Neural Networks for Multi-Agent Federated Learning

Abstract:
As artificial intelligence continues to drive innovation in complex, decentralized environments, the need for scalable, adaptive, and privacy‑preserving decision‑making systems has become critical. This paper introduces a novel framework combining quantum‑inspired neural networks with evolutionary algorithms to optimize real‑time decision‑making in multi‑agent systems (MAS). The proposed Quantum‑Evolutionary Neural Network (QE‑NN) leverages quantum computing principles ‑‑ such as quantum superposition and entanglement ‑‑ to enhance learning speed and decision accuracy, while integrating evolutionary optimization to continually refine agent behaviors in dynamic, uncertain environments. By utilizing federated learning, QE‑NN ensures privacy preservation, enabling decentralized agents to collaborate without sharing sensitive data. The framework is designed to allow agents to adapt in real‑time to their environments, optimizing decision‑making processes for applications in areas such as autonomous systems, smart cities, and healthcare. This research represents a breakthrough in merging quantum computing, evolutionary optimization, and privacy‑preserving techniques to solve complex problems in multi‑agent decision‑making systems, pushing the boundaries of AI in real‑world, privacy‑sensitive applications.

PaperID: 10463, https://arxiv.org/pdf/2505.15584.pdf

Authors: Yongjun Chen, Liping Zhang

Title: Improved power methods for computing eigenvalues of dual quaternion Hermitian matrices

Abstract:
This paper investigates the eigenvalue computation problem of the dual quaternion Hermitian matrix closely related to multi‑agent group control. Recently, power method was proposed by Cui and Qi in Journal of Scientific Computing, 100 (2024) to solve such problem. Recognizing that the convergence rate of power method is slow due to its dependence on the eigenvalue distribution, we propose two improved versions of power method based on dual complex adjoint matrices and Aitken extrapolation, named DCAM‑PM and ADCAM‑PM. They achieve notable efficiency improvements and demonstrate significantly faster convergence. However, power method may be invalid for dual quaternion Hermitian matrices with eigenvalues having identical standard parts but distinct dual parts. To overcome this disadvantage, utilizing the eigen‑decomposition properties of dual complex adjoint matrix, we propose a novel algorithm EDDCAM‑EA which surpasses the power method in both accuracy and speed. Application to eigenvalue computations of dual quaternion Hermitian matrices in multi‑agent formation control and numerical experiments highlight the remarkable accuracy and speed of our proposed algorithms.

PaperID: 10464, https://arxiv.org/pdf/2505.15261.pdf

Authors: Jiatao Li, Mao Ye, Cheng Peng, Xunjian Yin, Xiaojun Wan

Title: AGENT-X: Adaptive Guideline-based Expert Network for Threshold-free AI-generated teXt detection

Abstract:
Existing AI‑generated text detection methods heavily depend on large annotated datasets and external threshold tuning, restricting interpretability, adaptability, and zero‑shot effectiveness. To address these limitations, we propose AGENT‑X, a zero‑shot multi‑agent framework informed by classical rhetoric and systemic functional linguistics. Specifically, we organize detection guidelines into semantic, stylistic, and structural dimensions, each independently evaluated by specialized linguistic agents that provide explicit reasoning and robust calibrated confidence via semantic steering. A meta agent integrates these assessments through confidence‑aware aggregation, enabling threshold‑free, interpretable classification. Additionally, an adaptive Mixture‑of‑Agent router dynamically selects guidelines based on inferred textual characteristics. Experiments on diverse datasets demonstrate that AGENT‑X substantially surpasses state‑of‑the‑art supervised and zero‑shot approaches in accuracy, interpretability, and generalization.

PaperID: 10465, https://arxiv.org/pdf/2505.15216.pdf

Authors: Andy K. Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Y. Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, Sara Hong, Nardos Demilew, Shivatmica Murgai, Jason Tran, Nishka Kacheria, Ethan Ho, Denis Liu, Lauren McLane, Olivia Bruvik, Dai-Rong Han, Seungwoo Kim, Akhil Vyas, Cuiyuanxiu Chen, Ryan Li, Weiran Xu, Jonathan Z. Ye, Prerit Choudhary, Siddharth M. Bhatia, Vikram Sivashankar, Yuxuan Bao, Dawn Song, Dan Boneh, Daniel E. Ho, Percy Liang

Title: BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

Abstract:
AI agents have the potential to significantly alter the cybersecurity landscape. Here, we introduce the first framework to capture offensive and defensive cyber‑capabilities in evolving real‑world systems. Instantiating this framework with BountyBench, we set up 25 systems with complex, real‑world codebases. To capture the vulnerability lifecycle, we define three task types: Detect (detecting a new vulnerability), Exploit (exploiting a given vulnerability), and Patch (patching a given vulnerability). For Detect, we construct a new success indicator, which is general across vulnerability types and provides localized evaluation. We manually set up the environment for each system, including installing packages, setting up server(s), and hydrating database(s). We add 40 bug bounties, which are vulnerabilities with monetary awards from \10 to \30,485, covering 9 of the OWASP Top 10 Risks. To modulate task difficulty, we devise a new strategy based on information to guide detection, interpolating from identifying a zero day to exploiting a given vulnerability. We evaluate 10 agents: Claude Code, OpenAI Codex CLI with o3‑high and o4‑mini, and custom agents with o3‑high, GPT‑4.1, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet Thinking, Qwen3 235B A22B, Llama 4 Maverick, and DeepSeek‑R1. Given up to three attempts, the top‑performing agents are Codex CLI: o3‑high (12.5% on Detect, mapping to \3,720; 90% on Patch, mapping to \14,152), Custom Agent: Claude 3.7 Sonnet Thinking (67.5% on Exploit), and Codex CLI: o4‑mini (90% on Patch, mapping to \14,422). Codex CLI: o3‑high, Codex CLI: o4‑mini, and Claude Code are more capable at defense, achieving higher Patch scores of 90%, 90%, and 87.5%, compared to Exploit scores of 47.5%, 32.5%, and 57.5% respectively; while the custom agents are relatively balanced between offense and defense, achieving Exploit scores of 17.5‑67.5% and Patch scores of 25‑60%.

PaperID: 10466, https://arxiv.org/pdf/2505.15182.pdf

Authors: Jeonghye Kim, Sojeong Rhee, Minbeom Kim, Dohyung Kim, Sangmook Lee, Youngchul Sung, Kyomin Jung

Title: ReflAct: World-Grounded Decision Making in LLM Agents via Goal-State Reflection

Abstract:
Recent advances in LLM agents have largely built on reasoning backbones like ReAct, which interleave thought and action in complex environments. However, ReAct often produces ungrounded or incoherent reasoning steps, leading to misalignment between the agent's actual state and goal. Our analysis finds that this stems from ReAct's inability to maintain consistent internal beliefs and goal alignment, causing compounding errors and hallucinations. To address this, we introduce ReflAct, a novel backbone that shifts reasoning from merely planning next actions to continuously reflecting on the agent's state relative to its goal. By explicitly grounding decisions in states and enforcing ongoing goal alignment, ReflAct dramatically improves strategic reliability. This design delivers substantial empirical gains: ReflAct surpasses ReAct by 27.7% on average, achieving a 93.3% success rate in ALFWorld. Notably, ReflAct even outperforms ReAct with added enhancement modules (e.g., Reflexion, WKM), showing that strengthening the core reasoning backbone is key to reliable agent performance.

PaperID: 10467, https://arxiv.org/pdf/2505.15076.pdf

Authors: Nanxu Gong, Sixun Dong, Haoyue Bai, Xinyuan Wang, Wangyang Ying, Yanjie Fu

Title: Agentic Feature Augmentation: Unifying Selection and Generation with Teaming, Planning, and Memories

Abstract:
As a widely‑used and practical tool, feature engineering transforms raw data into discriminative features to advance AI model performance. However, existing methods usually apply feature selection and generation separately, failing to strive a balance between reducing redundancy and adding meaningful dimensions. To fill this gap, we propose an agentic feature augmentation concept, where the unification of feature generation and selection is modeled as agentic teaming and planning. Specifically, we develop a Multi‑Agent System with Long and Short‑Term Memory (MAGS), comprising a selector agent to eliminate redundant features, a generator agent to produce informative new dimensions, and a router agent that strategically coordinates their actions. We leverage in‑context learning with short‑term memory for immediate feedback refinement and long‑term memory for globally optimal guidance. Additionally, we employ offline Proximal Policy Optimization (PPO) reinforcement fine‑tuning to train the router agent for effective decision‑making to navigate a vast discrete feature space. Extensive experiments demonstrate that this unified agentic framework consistently achieves superior task performance by intelligently orchestrating feature selection and generation.

PaperID: 10468, https://arxiv.org/pdf/2505.15068.pdf

Authors: Cheng Qian, Hongyi Du, Hongru Wang, Xiusi Chen, Yuji Zhang, Avirup Sil, Chengxiang Zhai, Kathleen McKeown, Heng Ji

Title: ModelingAgent: Bridging LLMs and Mathematical Modeling for Real-World Challenges

Abstract:
Recent progress in large language models (LLMs) has enabled substantial advances in solving mathematical problems. However, existing benchmarks often fail to reflect the complexity of real‑world problems, which demand open‑ended, interdisciplinary reasoning and integration of computational tools. To address this gap, we introduce ModelingBench, a novel benchmark featuring real‑world‑inspired, open‑ended problems from math modeling competitions across diverse domains, ranging from urban traffic optimization to ecosystem resource planning. These tasks require translating natural language into formal mathematical formulations, applying appropriate tools, and producing structured, defensible reports. ModelingBench also supports multiple valid solutions, capturing the ambiguity and creativity of practical modeling. We also present ModelingAgent, a multi‑agent framework that coordinates tool use, supports structured workflows, and enables iterative self‑refinement to generate well‑grounded, creative solutions. To evaluate outputs, we further propose ModelingJudge, an expert‑in‑the‑loop system leveraging LLMs as domain‑specialized judges assessing solutions from multiple expert perspectives. Empirical results show that ModelingAgent substantially outperforms strong baselines and often produces solutions indistinguishable from those of human experts. Together, our work provides a comprehensive framework for evaluating and advancing real‑world problem‑solving in open‑ended, interdisciplinary modeling challenges.

PaperID: 10469, https://arxiv.org/pdf/2505.15047.pdf

Authors: Yingming Pu, Tao Lin, Hongyu Chen

Title: PiFlow: Principle-Aware Scientific Discovery with Multi-Agent Collaboration

Abstract:
Large Language Model (LLM)‑based multi‑agent systems (MAS) demonstrate remarkable potential for scientific discovery. Existing approaches, however, often automate scientific discovery using predefined workflows that lack rationality constraints. This often leads to aimless hypothesizing and a failure to consistently link hypotheses with evidence, thereby hindering the systematic reduction of uncertainty. Overcoming these limitations fundamentally requires a principled approach to exploration. We introduce PiFlow, an information‑theoretical framework, treating automated scientific discovery as a structured uncertainty reduction problem guided by principles (e.g., scientific laws). Extensive evaluations across three distinct scientific domains demonstrate that PiFlow (I) improves discovery efficiency by 31.18%~41.73% and solution quality by 12.47%~31.72% against state‑of‑the‑art methods, (II) delivers a 5.6x speedup in time‑to‑solution while reducing token consumption by up to 27% compared to vanilla agents, and (III) serves as a Plug‑and‑Play module that generalizes on existing agent architecture. Overall, PiFlow establishes a novel paradigm shift in highly efficient agentic scientific discovery, paving the way for more robust and accelerated AI‑driven research.

PaperID: 10470, https://arxiv.org/pdf/2505.14996.pdf

Authors: Zixuan Ke, Austin Xu, Yifei Ming, Xuan-Phi Nguyen, Ryan Chin, Caiming Xiong, Shafiq Joty

Title: MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision

Abstract:
Multi‑agent systems (MAS) leveraging the impressive capabilities of Large Language Models (LLMs) hold significant potential for tackling complex tasks. However, most current MAS depend on manually designed agent roles and communication protocols. These manual designs often fail to align with the underlying LLMs' strengths and struggle to adapt to novel tasks. Recent automatic MAS approaches attempt to mitigate these limitations but typically necessitate a validation set for tuning and yield static MAS designs lacking adaptability during inference, while also removing the flexibility to reduce to simpler systems. We introduce MAS‑ZERO, the first self‑evolved, inference‑time framework for automatic MAS design. MAS‑ZERO employs meta‑level design to iteratively design, critique, and refine MAS configurations tailored to each problem instance, without requiring a validation set. Critically, it enables dynamic problem decomposition and agent composition through meta‑feedback on solvability and completeness, and reduction to simpler systems when appropriate. Experiments across reasoning (math and graduate‑level QA), coding, and agentic (search‑based) benchmarks, using both closed‑source and open‑source LLM backbones of varying sizes, demonstrate that MAS‑ZERO outperforms strong manual and automatic MAS baselines. It achieves substantial average accuracy improvements of up to 16.69% on reasoning, 16.66% on coding, and 5.45% on agentic tasks, while maintaining cost efficiency.

PaperID: 10471, https://arxiv.org/pdf/2505.14978.pdf

Authors: Ghasem Pasandi, Kishor Kunal, Varun Tej, Kunjal Shah, Hanfei Sun, Sumit Jain, Chunhui Li, Chenhui Deng, Teodor-Dumitru Ene, Haoxing Ren, Sreedhar Pratty

Title: JARVIS: A Multi-Agent Code Assistant for High-Quality EDA Script Generation

Abstract:
This paper presents JARVIS, a novel multi‑agent framework that leverages Large Language Models (LLMs) and domain expertise to generate high‑quality scripts for specialized Electronic Design Automation (EDA) tasks. By combining a domain‑specific LLM trained with synthetically generated data, a custom compiler for structural verification, rule enforcement, code fixing capabilities, and advanced retrieval mechanisms, our approach achieves significant improvements over state‑of‑the‑art domain‑specific models. Our framework addresses the challenges of data scarcity and hallucination errors in LLMs, demonstrating the potential of LLMs in specialized engineering domains. We evaluate our framework on multiple benchmarks and show that it outperforms existing models in terms of accuracy and reliability. Our work sets a new precedent for the application of LLMs in EDA and paves the way for future innovations in this field.

PaperID: 10472, https://arxiv.org/pdf/2505.14886.pdf

Authors: Danqing Wang, Zhuorui Ye, Xinran Zhao, Fei Fang, Lei Li

Title: Strategic Planning and Rationalizing on Trees Make LLMs Better Debaters

Abstract:
Winning competitive debates requires sophisticated reasoning and argument skills. There are unique challenges in the competitive debate: (1) The time constraints force debaters to make strategic choices about which points to pursue rather than covering all possible arguments; (2) The persuasiveness of the debate relies on the back‑and‑forth interaction between arguments, which a single final game status cannot evaluate. To address these challenges, we propose TreeDebater, a novel debate framework that excels in competitive debate. We introduce two tree structures: the Rehearsal Tree and Debate Flow Tree. The Rehearsal Tree anticipates the attack and defenses to evaluate the strength of the claim, while the Debate Flow Tree tracks the debate status to identify the active actions. TreeDebater allocates its time budget among candidate actions and uses the speech time controller and feedback from the simulated audience to revise its statement. The human evaluation on both the stage‑level and the debate‑level comparison shows that our TreeDebater outperforms the state‑of‑the‑art multi‑agent debate system, with a +15.6% improvement in stage‑level persuasiveness with DeepSeek and +10% debate‑level opinion shift win. Further investigation shows that TreeDebater shows better strategies in limiting time to important debate actions, aligning with the strategies of human debate experts.

PaperID: 10473, https://arxiv.org/pdf/2505.14872.pdf

Authors: Mai Lee Chang, Samantha Reig, Alicia, Lee, Anna Huang, Hugo Simão, Nara Han, Neeta M Khanuja, Abdullah Ubed Mohammad Ali, Rebekah Martinez, John Zimmerman, Jodi Forlizzi, Aaron Steinfeld

Title: Unremarkable to Remarkable AI Agent: Exploring Boundaries of Agent Intervention for Adults With and Without Cognitive Impairment

Abstract:
As the population of older adults increases, there is a growing need for support for them to age in place. This is exacerbated by the growing number of individuals struggling with cognitive decline and shrinking number of youth who provide care for them. Artificially intelligent agents could provide cognitive support to older adults experiencing memory problems, and they could help informal caregivers with coordination tasks. To better understand this possible future, we conducted a speed dating with storyboards study to reveal invisible social boundaries that might keep older adults and their caregivers from accepting and using agents. We found that healthy older adults worry that accepting agents into their homes might increase their chances of developing dementia. At the same time, they want immediate access to agents that know them well if they should experience cognitive decline. Older adults in the early stages of cognitive decline expressed a desire for agents that can ease the burden they saw themselves becoming for their caregivers. They also speculated that an agent who really knew them well might be an effective advocate for their needs when they were less able to advocate for themselves. That is, the agent may need to transition from being unremarkable to remarkable. Based on these findings, we present design opportunities and considerations for agents and articulate directions of future research.

PaperID: 10474, https://arxiv.org/pdf/2505.14848.pdf

Authors: George Wang, Jiaqian Hu, Safinah Ali

Title: MAATS: A Multi-Agent Automated Translation System Based on MQM Evaluation

Abstract:
We present MAATS, a Multi Agent Automated Translation System that leverages the Multidimensional Quality Metrics (MQM) framework as a fine‑grained signal for error detection and refinement. MAATS employs multiple specialized AI agents, each focused on a distinct MQM category (e.g., Accuracy, Fluency, Style, Terminology), followed by a synthesis agent that integrates the annotations to iteratively refine translations. This design contrasts with conventional single‑agent methods that rely on self‑correction. Evaluated across diverse language pairs and Large Language Models (LLMs), MAATS outperforms zero‑shot and single‑agent baselines with statistically significant gains in both automatic metrics and human assessments. It excels particularly in semantic accuracy, locale adaptation, and linguistically distant language pairs. Qualitative analysis highlights its strengths in multi‑layered error diagnosis, omission detection across perspectives, and context‑aware refinement. By aligning modular agent roles with interpretable MQM dimensions, MAATS narrows the gap between black‑box LLMs and human translation workflows, shifting focus from surface fluency to deeper semantic and contextual fidelity.

PaperID: 10475, https://arxiv.org/pdf/2505.14727.pdf

Authors: Mohammad Rubyet Islam

Title: The Evolution of Alpha in Finance Harnessing Human Insight and LLM Agents

Abstract:
The pursuit of alpha returns that exceed market benchmarks has undergone a profound transformation, evolving from intuition‑driven investing to autonomous, AI powered systems. This paper introduces a comprehensive five stage taxonomy that traces this progression across manual strategies, statistical models, classical machine learning, deep learning, and agentic architectures powered by large language models (LLMs). Unlike prior surveys focused narrowly on modeling techniques, this review adopts a system level lens, integrating advances in representation learning, multimodal data fusion, and tool augmented LLM agents. The strategic shift from static predictors to contextaware financial agents capable of real time reasoning, scenario simulation, and cross modal decision making is emphasized. Key challenges in interpretability, data fragility, governance, and regulatory compliance areas critical to production deployment are examined. The proposed taxonomy offers a unified framework for evaluating maturity, aligning infrastructure, and guiding the responsible development of next generation alpha systems.

PaperID: 10476, https://arxiv.org/pdf/2505.14569.pdf

Authors: Devansh Bhardwaj, Arjun Beniwal, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Karthik R. Narasimhan, Ameet Deshpande, Vishvak Murahari

Title: Agent Context Protocols Enhance Collective Inference

Abstract:
AI agents have become increasingly adept at complex tasks such as coding, reasoning, and multimodal understanding. However, building generalist systems requires moving beyond individual agents to collective inference ‑‑ a paradigm where multi‑agent systems with diverse, task‑specialized agents complement one another through structured communication and collaboration. Today, coordination is usually handled with imprecise, ad‑hoc natural language, which limits complex interaction and hinders interoperability with domain‑specific agents. We introduce Agent context protocols (ACPs): a domain‑ and agent‑agnostic family of structured protocols for agent‑agent communication, coordination, and error handling. ACPs combine (i) persistent execution blueprints ‑‑ explicit dependency graphs that store intermediate agent outputs ‑‑ with (ii) standardized message schemas, enabling robust and fault‑tolerant multi‑agent collective inference. ACP‑powered generalist systems reach state‑of‑the‑art performance: 28.3 % accuracy on AssistantBench for long‑horizon web assistance and best‑in‑class multimodal technical reports, outperforming commercial AI systems in human evaluation. ACPs are highly modular and extensible, allowing practitioners to build top‑tier generalist agents quickly.

PaperID: 10477, https://arxiv.org/pdf/2505.14544.pdf

Authors: Saahil Mahato

Title: Smart Traffic Signals: Comparing MARL and Fixed-Time Strategies

Abstract:
Urban traffic congestion, particularly at intersections, significantly affects travel time, fuel consumption, and emissions. Traditional fixed‑time signal control systems often lack the adaptability to effectively manage dynamic traffic patterns. This study explores the application of multi‑agent reinforcement learning (MARL) to optimize traffic signal coordination across multiple intersections within a simulated environment. A simulation was developed to model a network of interconnected intersections with randomly generated vehicle flows to reflect realistic traffic variability. A decentralized MARL controller was implemented in which each traffic signal operates as an autonomous agent, making decisions based on local observations and information from neighboring agents. Performance was evaluated against a baseline fixed‑time controller using metrics such as average vehicle wait time and overall throughput. The MARL approach demonstrated statistically significant improvements, including reduced average waiting times and improved throughput. These findings suggest that MARL‑based dynamic control strategies hold substantial promise to improve urban traffic management efficiency. More research is recommended to address the challenges of scalability and real‑world implementation.

PaperID: 10478, https://arxiv.org/pdf/2505.14473.pdf

Authors: Sribalaji C. Anand, Alexander J Gallo, Nicola Bastianello

Title: Security of Gradient Tracking Algorithms Against Malicious Agents

Abstract:
Consensus algorithms are fundamental to multi‑agent distributed optimization, and their security under adversarial conditions is an active area of research. While prior works primarily establish conditions for successful global consensus under attack, little is known about system behavior when these conditions are violated. This paper addresses this gap by investigating the robustness of the Wang‑‑Elia algorithm, which is a robust to noise version of gradient tracking algorithm, in the presence of malicious agents. We consider a network of agents collaboratively minimizing a global cost function, where a subset of agents may transmit faulty information to disrupt consensus. To quantify resilience, we formulate a security metric as an optimization problem, which is rooted in centralized attack detection literature. We provide a tractable reformulation of the optimization problem, and derive conditions under which the metric becomes unbounded, identifying undetectable attack signals that reveal inherent vulnerabilities. To facilitate design and analysis, we propose a well‑posed variant of the metric and propose design methods to enhance network robustness against stealthy adversarial attacks. Numerical examples demonstrate the effectiveness of the proposed framework to enhance the resilience of multi‑agent distributed optimization.

PaperID: 10479, https://arxiv.org/pdf/2505.14453.pdf

Authors: Xianghua Zeng, Hao Peng, Angsheng Li

Title: Robustness Evaluation of Graph-based News Detection Using Network Structural Information

Abstract:
Although Graph Neural Networks (GNNs) have shown promising potential in fake news detection, they remain highly vulnerable to adversarial manipulations within social networks. Existing methods primarily establish connections between malicious accounts and individual target news to investigate the vulnerability of graph‑based detectors, while they neglect the structural relationships surrounding targets, limiting their effectiveness in robustness evaluation. In this work, we propose a novel Structural Information principles‑guided Adversarial Attack Framework, namely SI2AF, which effectively challenges graph‑based detectors and further probes their detection robustness. Specifically, structural entropy is introduced to quantify the dynamic uncertainty in social engagements and identify hierarchical communities that encompass all user accounts and news posts. An influence metric is presented to measure each account's probability of engaging in random interactions, facilitating the design of multiple agents that manage distinct malicious accounts. For each target news, three attack strategies are developed through multi‑agent collaboration within the associated subgraph to optimize evasion against black‑box detectors. By incorporating the adversarial manipulations generated by SI2AF, we enrich the original network structure and refine graph‑based detectors to improve their robustness against adversarial attacks. Extensive evaluations demonstrate that SI2AF significantly outperforms state‑of‑the‑art baselines in attack effectiveness with an average improvement of 16.71%, and enhances GNN‑based detection robustness by 41.54% on average.

PaperID: 10480, https://arxiv.org/pdf/2505.14299.pdf

Authors: Zihao Feng, Xiaoxue Wang, Bowen Wu, Weihong Zhong, Zhen Xu, Hailong Cao, Tiejun Zhao, Ying Li, Baoxun Wang

Title: Empowering LLMs in Task-Oriented Dialogues: A Domain-Independent Multi-Agent Framework and Fine-Tuning Strategy

Abstract:
Task‑oriented dialogue systems based on Large Language Models (LLMs) have gained increasing attention across various industries and achieved significant results. Current approaches condense complex procedural workflows into a single agent to achieve satisfactory performance on large‑scale LLMs. However, these approaches face challenges to achieve comparable performance on fine‑tuned lightweight LLMs, due to their limited capabilities in handling multiple complex logic. In this work, we design a Domain‑Independent Multi‑Agent Framework (DIMF), which contains Intent Classification Agent, Slot Filling Agent and Response Agent. This approach simplifies the learning complexity and enhances the generalization ability by separating the tasks into domain‑independent components. In this framework, we enhance the capabilities in contextual understanding using the Direct Preference Optimisation (DPO) method, and propose a simple and effective Data Distribution Adaptation (DDA) method to mitigate degradation issues during DPO training. Experiments conducted on the MultiWOZ datasets show that our proposed method achieves a better average performance among all the baselines. Extensive analysis also demonstrates that our proposed framework exhibits excellent generalizability and zero‑shot capability.

PaperID: 10481, https://arxiv.org/pdf/2505.14163.pdf

Authors: He Wang, Alexander Hanbo Li, Yiqun Hu, Sheng Zhang, Hideo Kobayashi, Jiani Zhang, Henry Zhu, Chung-Wei Hang, Patrick Ng

Title: DSMentor: Enhancing Data Science Agents with Curriculum Learning and Online Knowledge Accumulation

Abstract:
Large language model (LLM) agents have shown promising performance in generating code for solving complex data science problems. Recent studies primarily focus on enhancing in‑context learning through improved search, sampling, and planning techniques, while overlooking the importance of the order in which problems are tackled during inference. In this work, we develop a novel inference‑time optimization framework, referred to as DSMentor, which leverages curriculum learning ‑‑ a strategy that introduces simpler task first and progressively moves to more complex ones as the learner improves ‑‑ to enhance LLM agent performance in challenging data science tasks. Our mentor‑guided framework organizes data science tasks in order of increasing difficulty and incorporates a growing long‑term memory to retain prior experiences, guiding the agent's learning progression and enabling more effective utilization of accumulated knowledge. We evaluate DSMentor through extensive experiments on DSEval and QRData benchmarks. Experiments show that DSMentor using Claude‑3.5‑Sonnet improves the pass rate by up to 5.2% on DSEval and QRData compared to baseline agents. Furthermore, DSMentor demonstrates stronger causal reasoning ability, improving the pass rate by 8.8% on the causality problems compared to GPT‑4 using Program‑of‑Thoughts prompts. Our work underscores the importance of developing effective strategies for accumulating and utilizing knowledge during inference, mirroring the human learning process and opening new avenues for improving LLM performance through curriculum‑based inference optimization.

PaperID: 10482, https://arxiv.org/pdf/2505.14141.pdf

Authors: Fanglin Mo, Junzhe Chen, Haoxuan Zhu, Xuming Hu

Title: Building a Stable Planner: An Extended Finite State Machine Based Planning Module for Mobile GUI Agent

Abstract:
Mobile GUI agents execute user commands by directly interacting with the graphical user interface (GUI) of mobile devices, demonstrating significant potential to enhance user convenience. However, these agents face considerable challenges in task planning, as they must continuously analyze the GUI and generate operation instructions step by step. This process often leads to difficulties in making accurate task plans, as GUI agents lack a deep understanding of how to effectively use the target applications, which can cause them to become "lost" during task execution. To address the task planning issue, we propose SPlanner, a plug‑and‑play planning module to generate execution plans that guide vision language model(VLMs) in executing tasks. The proposed planning module utilizes extended finite state machines (EFSMs) to model the control logits and configurations of mobile applications. It then decomposes a user instruction into a sequence of primary function modeled in EFSMs, and generate the execution path by traversing the EFSMs. We further refine the execution path into a natural language plan using an LLM. The final plan is concise and actionable, and effectively guides VLMs to generate interactive GUI actions to accomplish user tasks. SPlanner demonstrates strong performance on dynamic benchmarks reflecting real‑world mobile usage. On the AndroidWorld benchmark, SPlanner achieves a 63.8% task success rate when paired with Qwen2.5‑VL‑72B as the VLM executor, yielding a 28.8 percentage point improvement compared to using Qwen2.5‑VL‑72B without planning assistance.

PaperID: 10483, https://arxiv.org/pdf/2505.14126.pdf

Authors: Yuan-Hao Jiang, Kezong Tang, Zi-Wei Chen, Yuang Wei, Tian-Yi Liu, Jiayi Wu

Title: MAS-KCL: Knowledge component graph structure learning with large language model-based agentic workflow

Abstract:
Knowledge components (KCs) are the fundamental units of knowledge in the field of education. A KC graph illustrates the relationships and dependencies between KCs. An accurate KC graph can assist educators in identifying the root causes of learners' poor performance on specific KCs, thereby enabling targeted instructional interventions. To achieve this, we have developed a KC graph structure learning algorithm, named MAS‑KCL, which employs a multi‑agent system driven by large language models for adaptive modification and optimization of the KC graph. Additionally, a bidirectional feedback mechanism is integrated into the algorithm, where AI agents leverage this mechanism to assess the value of edges within the KC graph and adjust the distribution of generation probabilities for different edges, thereby accelerating the efficiency of structure learning. We applied the proposed algorithm to 5 synthetic datasets and 4 real‑world educational datasets, and experimental results validate its effectiveness in learning path recognition. By accurately identifying learners' learning paths, teachers are able to design more comprehensive learning plans, enabling learners to achieve their educational goals more effectively, thus promoting the sustainable development of education.

PaperID: 10484, https://arxiv.org/pdf/2505.14081.pdf

Authors: Luca Ballotta, Nicola Bastianello, Riccardo M. G. Ferrari, Karl H. Johansson

Title: Personalized and Resilient Distributed Learning Through Opinion Dynamics

Abstract:
In this paper, we address two practical challenges of distributed learning in multi‑agent network systems, namely personalization and resilience. Personalization is the need of heterogeneous agents to learn local models tailored to their own data and tasks, while still generalizing well; on the other hand, the learning process must be resilient to cyberattacks or anomalous training data to avoid disruption. Motivated by a conceptual affinity between these two requirements, we devise a distributed learning algorithm that combines distributed gradient descent and the Friedkin‑Johnsen model of opinion dynamics to fulfill both of them. We quantify its convergence speed and the neighborhood that contains the final learned models, which can be easily controlled by tuning the algorithm parameters to enforce a more personalized/resilient behavior. We numerically showcase the effectiveness of our algorithm on synthetic and real‑world distributed learning tasks, where it achieves high global accuracy both for personalized models and with malicious agents compared to standard strategies.

PaperID: 10485, https://arxiv.org/pdf/2505.13994.pdf

Authors: Ruiyi Yang, Hao Xue, Imran Razzak, Shirui Pan, Hakim Hacid, Flora D. Salim

Title: Divide by Question, Conquer by Agent: SPLIT-RAG with Question-Driven Graph Partitioning

Abstract:
Retrieval‑Augmented Generation (RAG) systems empower large language models (LLMs) with external knowledge, yet struggle with efficiency‑accuracy trade‑offs when scaling to large knowledge graphs. Existing approaches often rely on monolithic graph retrieval, incurring unnecessary latency for simple queries and fragmented reasoning for complex multi‑hop questions. To address these challenges, this paper propose SPLIT‑RAG, a multi‑agent RAG framework that addresses these limitations with question‑driven semantic graph partitioning and collaborative subgraph retrieval. The innovative framework first create Semantic Partitioning of Linked Information, then use the Type‑Specialized knowledge base to achieve Multi‑Agent RAG. The attribute‑aware graph segmentation manages to divide knowledge graphs into semantically coherent subgraphs, ensuring subgraphs align with different query types, while lightweight LLM agents are assigned to partitioned subgraphs, and only relevant partitions are activated during retrieval, thus reduce search space while enhancing efficiency. Finally, a hierarchical merging module resolves inconsistencies across subgraph‑derived answers through logical verifications. Extensive experimental validation demonstrates considerable improvements compared to existing approaches.

PaperID: 10486, https://arxiv.org/pdf/2505.13965.pdf

Authors: Jiamin Su, Yibo Yan, Zhuoran Gao, Han Zhang, Xiang Liu, Xuming Hu

Title: CAFES: A Collaborative Multi-Agent Framework for Multi-Granular Multimodal Essay Scoring

Abstract:
Automated Essay Scoring (AES) is crucial for modern education, particularly with the increasing prevalence of multimodal assessments. However, traditional AES methods struggle with evaluation generalizability and multimodal perception, while even recent Multimodal Large Language Model (MLLM)‑based approaches can produce hallucinated justifications and scores misaligned with human judgment. To address the limitations, we introduce CAFES, the first collaborative multi‑agent framework specifically designed for AES. It orchestrates three specialized agents: an Initial Scorer for rapid, trait‑specific evaluations; a Feedback Pool Manager to aggregate detailed, evidence‑grounded strengths; and a Reflective Scorer that iteratively refines scores based on this feedback to enhance human alignment. Extensive experiments, using state‑of‑the‑art MLLMs, achieve an average relative improvement of 21% in Quadratic Weighted Kappa (QWK) against ground truth, especially for grammatical and lexical diversity. Our proposed CAFES framework paves the way for an intelligent multimodal AES system. The code will be available upon acceptance.

PaperID: 10487, https://arxiv.org/pdf/2505.13941.pdf

Authors: Haoyang Fang, Boran Han, Nick Erickson, Xiyuan Zhang, Su Zhou, Anirudh Dagar, Jiani Zhang, Ali Caner Turkmen, Cuixiong Hu, Huzefa Rangwala, Ying Nian Wu, Bernie Wang, George Karypis

Title: MLZero: A Multi-Agent System for End-to-end Machine Learning Automation

Abstract:
Existing AutoML systems have advanced the automation of machine learning (ML); however, they still require substantial manual configuration and expert input, particularly when handling multimodal data. We introduce MLZero, a novel multi‑agent framework powered by Large Language Models (LLMs) that enables end‑to‑end ML automation across diverse data modalities with minimal human intervention. A cognitive perception module is first employed, transforming raw multimodal inputs into perceptual context that effectively guides the subsequent workflow. To address key limitations of LLMs, such as hallucinated code generation and outdated API knowledge, we enhance the iterative code generation process with semantic and episodic memory. MLZero demonstrates superior performance on MLE‑Bench Lite, outperforming all competitors in both success rate and solution quality, securing six gold medals. Additionally, when evaluated on our Multimodal AutoML Agent Benchmark, which includes 25 more challenging tasks spanning diverse data modalities, MLZero outperforms the competing methods by a large margin with a success rate of 0.92 (+263.6%) and an average rank of 2.28. Our approach maintains its robust effectiveness even with a compact 8B LLM, outperforming full‑size systems from existing solutions.

PaperID: 10488, https://arxiv.org/pdf/2505.13940.pdf

Authors: Kun Li, Zhennan Wu, Shoupeng Wang, Jia Wu, Shirui Pan, Wenbin Hu

Title: DrugPilot: LLM-based Parameterized Reasoning Agent for Drug Discovery

Abstract:
Large language models (LLMs) integrated with autonomous agents hold significant potential for advancing scientific discovery through automated reasoning and task execution. However, applying LLM agents to drug discovery is still constrained by challenges such as large‑scale multimodal data processing, limited task automation, and poor support for domain‑specific tools. To overcome these limitations, we introduce DrugPilot, a LLM‑based agent system with a parameterized reasoning architecture designed for end‑to‑end scientific workflows in drug discovery. DrugPilot enables multi‑stage research processes by integrating structured tool use with a novel parameterized memory pool. The memory pool converts heterogeneous data from both public sources and user‑defined inputs into standardized representations. This design supports efficient multi‑turn dialogue, reduces information loss during data exchange, and enhances complex scientific decision‑making. To support training and benchmarking, we construct a drug instruction dataset covering eight core drug discovery tasks. Under the Berkeley function‑calling benchmark, DrugPilot significantly outperforms state‑of‑the‑art agents such as ReAct and LoT, achieving task completion rates of 98.0%, 93.5%, and 64.0% for simple, multi‑tool, and multi‑turn scenarios, respectively. These results highlight DrugPilot's potential as a versatile agent framework for computational science domains requiring automated, interactive, and data‑integrated reasoning.

PaperID: 10489, https://arxiv.org/pdf/2505.13862.pdf

Authors: Guobin Shen, Dongcheng Zhao, Linghao Feng, Xiang He, Jihang Wang, Sicheng Shen, Haibo Tong, Yiting Dong, Jindong Li, Xiang Zheng, Yi Zeng

Title: PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking Attacks

Abstract:
Large language models (LLMs) have achieved remarkable capabilities but remain vulnerable to adversarial prompts known as jailbreaks, which can bypass safety alignment and elicit harmful outputs. Despite growing efforts in LLM safety research, existing evaluations are often fragmented, focused on isolated attack or defense techniques, and lack systematic, reproducible analysis. In this work, we introduce PandaGuard, a unified and modular framework that models LLM jailbreak safety as a multi‑agent system comprising attackers, defenders, and judges. Our framework implements 19 attack methods and 12 defense mechanisms, along with multiple judgment strategies, all within a flexible plugin architecture supporting diverse LLM interfaces, multiple interaction modes, and configuration‑driven experimentation that enhances reproducibility and practical deployment. Built on this framework, we develop PandaBench, a comprehensive benchmark that evaluates the interactions between these attack/defense methods across 49 LLMs and various judgment approaches, requiring over 3 billion tokens to execute. Our extensive evaluation reveals key insights into model vulnerabilities, defense cost‑performance trade‑offs, and judge consistency. We find that no single defense is optimal across all dimensions and that judge disagreement introduces nontrivial variance in safety assessments. We release the code, configurations, and evaluation results to support transparent and reproducible research in LLM safety.

PaperID: 10490, https://arxiv.org/pdf/2505.13834.pdf

Authors: Zhi Su, Yuman Gao, Emily Lukas, Yunfei Li, Jiaze Cai, Faris Tulbah, Fei Gao, Chao Yu, Zhongyu Li, Yi Wu, Koushil Sreenath

Title: Toward Real-World Cooperative and Competitive Soccer with Quadrupedal Robot Teams

Abstract:
Achieving coordinated teamwork among legged robots requires both fine‑grained locomotion control and long‑horizon strategic decision‑making. Robot soccer offers a compelling testbed for this challenge, combining dynamic, competitive, and multi‑agent interactions. In this work, we present a hierarchical multi‑agent reinforcement learning (MARL) framework that enables fully autonomous and decentralized quadruped robot soccer. First, a set of highly dynamic low‑level skills is trained for legged locomotion and ball manipulation, such as walking, dribbling, and kicking. On top of these, a high‑level strategic planning policy is trained with Multi‑Agent Proximal Policy Optimization (MAPPO) via Fictitious Self‑Play (FSP). This learning framework allows agents to adapt to diverse opponent strategies and gives rise to sophisticated team behaviors, including coordinated passing, interception, and dynamic role allocation. With an extensive ablation study, the proposed learning method shows significant advantages in the cooperative and competitive multi‑agent soccer game. We deploy the learned policies to real quadruped robots relying solely on onboard proprioception and decentralized localization, with the resulting system supporting autonomous robot‑robot and robot‑human soccer matches on indoor and outdoor soccer courts.

PaperID: 10491, https://arxiv.org/pdf/2505.13729.pdf

Authors: Abhinav Rajvanshi, Pritish Sahu, Tixiao Shan, Karan Sikka, Han-Pang Chiu

Title: SayCoNav: Utilizing Large Language Models for Adaptive Collaboration in Decentralized Multi-Robot Navigation

Abstract:
Adaptive collaboration is critical to a team of autonomous robots to perform complicated navigation tasks in large‑scale unknown environments. An effective collaboration strategy should be determined and adapted according to each robot's skills and current status to successfully achieve the shared goal. We present SayCoNav, a new approach that leverages large language models (LLMs) for automatically generating this collaboration strategy among a team of robots. Building on the collaboration strategy, each robot uses the LLM to generate its plans and actions in a decentralized way. By sharing information to each other during navigation, each robot also continuously updates its step‑by‑step plans accordingly. We evaluate SayCoNav on Multi‑Object Navigation (MultiON) tasks, that require the team of the robots to utilize their complementary strengths to efficiently search multiple different objects in unknown environments. By validating SayCoNav with varied team compositions and conditions against baseline methods, our experimental results show that SayCoNav can improve search efficiency by at most 44.28% through effective collaboration among heterogeneous robots. It can also dynamically adapt to the changing conditions during task execution.

PaperID: 10492, https://arxiv.org/pdf/2505.13668.pdf

Authors: Mahmood Hegazy, Aaron Rodrigues, Azzam Naeem

Title: MAFA: A multi-agent framework for annotation

Abstract:
Modern consumer banking applications require accurate and efficient retrieval of information in response to user queries. Mapping user utterances to the most relevant Frequently Asked Questions (FAQs) is a crucial component of these systems. Traditional approaches often rely on a single model or technique, which may not capture the nuances of diverse user inquiries. In this paper, we introduce a multi‑agent framework for FAQ annotation that combines multiple specialized agents with different approaches and a judge agent that reranks candidates to produce optimal results. Our agents utilize a structured reasoning approach inspired by Attentive Reasoning Queries (ARQs), which guides them through systematic reasoning steps using targeted, task‑specific JSON queries. Our framework features a few‑shot example strategy, where each agent receives different few‑shots, enhancing ensemble diversity and coverage of the query space. We evaluate our framework on a real‑world major bank dataset as well as public benchmark datasets (LCQMC and FiQA), demonstrating significant improvements over single‑agent approaches across multiple metrics, including a 14% increase in Top‑1 accuracy, an 18% increase in Top‑5 accuracy, and a 12% improvement in Mean Reciprocal Rank on our dataset, and similar gains on public benchmarks when compared with traditional and single‑agent annotation techniques. Our framework is particularly effective at handling ambiguous queries, making it well‑suited for deployment in production banking applications while showing strong generalization capabilities across different domains and languages.

PaperID: 10493, https://arxiv.org/pdf/2505.13546.pdf

Authors: Ke Chen, Yufei Zhou, Xitong Zhang, Haohan Wang

Title: Prompt Stability Matters: Evaluating and Optimizing Auto-Generated Prompt in General-Purpose Systems

Abstract:
Automatic prompt generation plays a crucial role in enabling general‑purpose multi‑agent systems to perform diverse tasks autonomously. Existing methods typically evaluate prompts based on their immediate task performance, overlooking the intrinsic qualities that determine their reliability. This outcome‑centric view not only limits interpretability but also fails to account for the inherent stochasticity of large language models (LLMs). In this work, we bring attention to prompt stability‑the consistency of model responses across repeated executions‑as a key factor for building robust and effective prompt generation systems. To quantify this, we propose semantic stability as a criterion for assessing the response consistency of prompts, and fine‑tune a LLaMA‑based evaluator to measure it automatically across tasks. These components have enabled us to develop the first stability‑aware general‑purpose prompt generation system that leverages stability feedback to iteratively enhance both prompt quality and system‑level performance. Furthermore, we establish a logical chain between prompt stability and task success by analyzing the structural dependencies within our system, proving stability as a necessary condition for effective system‑level execution. Empirical results across general and domain‑specific tasks demonstrate that our stability‑aware framework improves both accuracy and output consistency. By shifting the focus from one‑off results to persistent reliability, our work offers a new perspective on prompt design and contributes practical tools for building more trustworthy general‑purpose systems.

PaperID: 10494, https://arxiv.org/pdf/2505.13543.pdf

Authors: Muyang Fan, Songyang Liu, Shuai Li, Weizi Li

Title: Origin-Destination Pattern Effects on Large-Scale Mixed Traffic Control via Multi-Agent Reinforcement Learning

Abstract:
Traffic congestion remains a major challenge for modern urban transportation, diminishing both efficiency and quality of life. While autonomous driving technologies and reinforcement learning (RL) have shown promise for improving traffic control, most prior work has focused on small‑scale networks or isolated intersections. Large‑scale mixed traffic control, involving both human‑driven and robotic vehicles, remains underexplored. In this study, we propose a decentralized multi‑agent reinforcement learning framework for managing large‑scale mixed traffic networks, where intersections are controlled either by traditional traffic signals or by robotic vehicles. We evaluate our approach on a real‑world network of 14 intersections in Colorado Springs, Colorado, USA, using average vehicle waiting time as the primary measure of traffic efficiency. We are exploring a problem that has not been sufficiently addressed: Is large‑scale Multi‑Agent Traffic Control (MTC) still feasible when facing time‑varying Origin‑Destination (OD) patterns?

PaperID: 10495, https://arxiv.org/pdf/2505.13533.pdf

Authors: Junzhe Jiang, Chang Yang, Aixin Cui, Sihan Jin, Ruiyu Wang, Bo Li, Xiao Huang, Dongning Sun, Xinrun Wang

Title: FinMaster: A Holistic Benchmark for Mastering Full-Pipeline Financial Workflows with LLMs

Abstract:
Financial tasks are pivotal to global economic stability; however, their execution faces challenges including labor intensive processes, low error tolerance, data fragmentation, and tool limitations. Although large language models (LLMs) have succeeded in various natural language processing tasks and have shown potential in automating workflows through reasoning and contextual understanding, current benchmarks for evaluating LLMs in finance lack sufficient domain‑specific data, have simplistic task design, and incomplete evaluation frameworks. To address these gaps, this article presents FinMaster, a comprehensive financial benchmark designed to systematically assess the capabilities of LLM in financial literacy, accounting, auditing, and consulting. Specifically, FinMaster comprises three main modules: i) FinSim, which builds simulators that generate synthetic, privacy‑compliant financial data for companies to replicate market dynamics; ii) FinSuite, which provides tasks in core financial domains, spanning 183 tasks of various types and difficulty levels; and iii) FinEval, which develops a unified interface for evaluation. Extensive experiments over state‑of‑the‑art LLMs reveal critical capability gaps in financial reasoning, with accuracy dropping from over 90% on basic tasks to merely 40% on complex scenarios requiring multi‑step reasoning. This degradation exhibits the propagation of computational errors, where single‑metric calculations initially demonstrating 58% accuracy decreased to 37% in multimetric scenarios. To the best of our knowledge, FinMaster is the first benchmark that covers full‑pipeline financial workflows with challenging tasks. We hope that FinMaster can bridge the gap between research and industry practitioners, driving the adoption of LLMs in real‑world financial practices to enhance efficiency and accuracy.

PaperID: 10496, https://arxiv.org/pdf/2505.13523.pdf

Authors: Jun Liu, Ke Yu, Keliang Chen, Ke Li, Yuxinyue Qian, Xiaolian Guo, Haozhe Song, Yinming Li

Title: ACPs: Agent Collaboration Protocols for the Internet of Agents

Abstract:
With the rapid advancement of artificial intelligence, the proliferation of autonomous agents has introduced new challenges in interoperability, scalability, and coordination. The Internet of Agents (IoA) aims to interconnect heterogeneous agents through standardized communication protocols, enabling seamless collaboration and intelligent task execution. However, existing agent communication protocols such as MCP, A2A, and ANP remain fragmented and scenario‑specific. To address this gap, we propose Agent Collaboration Protocols (ACPs), a comprehensive protocol suite for the IoA. ACPs include registration, discovery, interaction, and tooling protocols to support trustable access, capability orchestration, and workflow construction. We present the architecture, key technologies, and application workflows of ACPs, and demonstrate its effectiveness in a collaborative restaurant booking scenario. ACPs lay the foundation for building a secure, open, and scalable agent internet infrastructure.

PaperID: 10497, https://arxiv.org/pdf/2505.13504.pdf

Authors: Ayesha Amjad, Saurav Sthapit, Tahir Qasim Syed

Title: An agentic system with reinforcement-learned subsystem improvements for parsing form-like documents

Abstract:
Extracting alphanumeric data from form‑like documents such as invoices, purchase orders, bills, and financial documents is often performed via vision (OCR) and learning algorithms or monolithic pipelines with limited potential for systemic improvements. We propose an agentic AI system that leverages Large Language Model (LLM) agents and a reinforcement learning (RL) driver agent to automate consistent, self‑improving extraction under LLM inference uncertainty. Our work highlights the limitations of monolithic LLM‑based extraction and introduces a modular, multi‑agent framework with task‑specific prompts and an RL policy of rewards and penalties to guide a meta‑prompting agent to learn from past errors and improve prompt‑based actor agents. This self‑corrective adaptive system handles diverse documents, file formats, layouts, and LLMs, aiming to automate accurate information extraction without the need for human intervention. Results as reported on two benchmark datasets of SOIRE, and CORD, are promising for the agentic AI framework.

PaperID: 10498, https://arxiv.org/pdf/2505.13466.pdf

Authors: Vu Dinh Xuan, Hao Vo, David Murphy, Hoang D. Nguyen

Title: AgentSGEN: Multi-Agent LLM in the Loop for Semantic Collaboration and GENeration of Synthetic Data

Abstract:
The scarcity of data depicting dangerous situations presents a major obstacle to training AI systems for safety‑critical applications, such as construction safety, where ethical and logistical barriers hinder real‑world data collection. This creates an urgent need for an end‑to‑end framework to generate synthetic data that can bridge this gap. While existing methods can produce synthetic scenes, they often lack the semantic depth required for scene simulations, limiting their effectiveness. To address this, we propose a novel multi‑agent framework that employs an iterative, in‑the‑loop collaboration between two agents: an Evaluator Agent, acting as an LLM‑based judge to enforce semantic consistency and safety‑specific constraints, and an Editor Agent, which generates and refines scenes based on this guidance. Powered by LLM's capabilities to reasoning and common‑sense knowledge, this collaborative design produces synthetic images tailored to safety‑critical scenarios. Our experiments suggest this design can generate useful scenes based on realistic specifications that address the shortcomings of prior approaches, balancing safety requirements with visual semantics. This iterative process holds promise for delivering robust, aesthetically sound simulations, offering a potential solution to the data scarcity challenge in multimedia safety applications.

PaperID: 10499, https://arxiv.org/pdf/2505.13400.pdf

Authors: Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J. Szostkiewicz, Jon M. Laurent, Muhammed T. Razzak, Andrew D. White, Michaela M. Hinks, Samuel G. Rodriques

Title: Robin: A multi-agent system for automating scientific discovery

Abstract:
Scientific discovery is driven by the iterative process of background research, hypothesis generation, experimentation, and data analysis. Despite recent advancements in applying artificial intelligence to scientific discovery, no system has yet automated all of these stages in a single workflow. Here, we introduce Robin, the first multi‑agent system capable of fully automating the key intellectual steps of the scientific process. By integrating literature search agents with data analysis agents, Robin can generate hypotheses, propose experiments, interpret experimental results, and generate updated hypotheses, achieving a semi‑autonomous approach to scientific discovery. By applying this system, we were able to identify a novel treatment for dry age‑related macular degeneration (dAMD), the major cause of blindness in the developed world. Robin proposed enhancing retinal pigment epithelium phagocytosis as a therapeutic strategy, and identified and validated a promising therapeutic candidate, ripasudil. Ripasudil is a clinically‑used rho kinase (ROCK) inhibitor that has never previously been proposed for treating dAMD. To elucidate the mechanism of ripasudil‑induced upregulation of phagocytosis, Robin then proposed and analyzed a follow‑up RNA‑seq experiment, which revealed upregulation of ABCA1, a critical lipid efflux pump and possible novel target. All hypotheses, experimental plans, data analyses, and data figures in the main text of this report were produced by Robin. As the first AI system to autonomously discover and validate a novel therapeutic candidate within an iterative lab‑in‑the‑loop framework, Robin establishes a new paradigm for AI‑driven scientific discovery.

PaperID: 10500, https://arxiv.org/pdf/2505.13311.pdf

Authors: Saleh Soudijani, Rayna Dimitrova

Title: Synthesis of Communication Policies for Multi-Agent Systems Robust to Communication Restrictions

Abstract:
We study stochastic multi‑agent systems in which agents must cooperate to maximize the probability of achieving a common reach‑avoid objective. In many applications, during the execution of the system, the communication between the agents can be constrained by restrictions on the bandwidth currently available for exchanging local‑state information between the agents. In this paper, we propose a method for computing joint action and communication policies for the group of agents that aim to satisfy the communication restrictions as much as possible while achieving the optimal reach‑avoid probability when communication is unconstrained. Our method synthesizes a pair of action and communication policies robust to restrictions on the number of agents allowed to communicate. To this end, we introduce a novel cost function that measures the amount of information exchanged beyond what the communication policy allows. We evaluate our approach experimentally on a range of benchmarks and demonstrate that it is capable of computing pairs of action and communication policies that satisfy the communication restrictions, if such exist.

PaperID: 10501, https://arxiv.org/pdf/2505.13278.pdf

Authors: Daniel Weiner, Raj Korpan

Title: Hybrid Voting-Based Task Assignment in Modular Construction Scenarios

Abstract:
Modular construction, involving off‑site prefabrication and on‑site assembly, offers significant advantages but presents complex coordination challenges for robotic automation. Effective task allocation is critical for leveraging multi‑agent systems (MAS) in these structured environments. This paper introduces the Hybrid Voting‑Based Task Assignment (HVBTA) framework, a novel approach to optimizing collaboration between heterogeneous multi‑agent construction teams. Inspired by human reasoning in task delegation, HVBTA uniquely integrates multiple voting mechanisms with the capabilities of a Large Language Model (LLM) for nuanced suitability assessment between agent capabilities and task requirements. The framework operates by assigning Capability Profiles to agents and detailed requirement lists called Task Descriptions to construction tasks, subsequently generating a quantitative Suitability Matrix. Six distinct voting methods, augmented by a pre‑trained LLM, analyze this matrix to robustly identify the optimal agent for each task. Conflict‑Based Search (CBS) is integrated for decentralized, collision‑free path planning, ensuring efficient and safe spatio‑temporal coordination of the robotic team during assembly operations. HVBTA enables efficient, conflict‑free assignment and coordination, facilitating potentially faster and more accurate modular assembly. Current work is evaluating HVBTA's performance across various simulated construction scenarios involving diverse robotic platforms and task complexities. While designed as a generalizable framework for any domain with clearly definable tasks and capabilities, HVBTA will be particularly effective for addressing the demanding coordination requirements of multi‑agent collaborative robotics in modular construction due to the predetermined construction planning involved.

PaperID: 10502, https://arxiv.org/pdf/2505.13246.pdf

Authors: Roberto Pugliese, George Kourousias, Francesco Venier, Grazia Garlatti Costa

Title: Agentic publications: redesigning scientific publishing in the age of thinking large language models

Abstract:
Purpose: This paper introduces the concept of "Agentic Publication," a novel LLM‑driven framework designed to complement traditional scientific publishing by transforming papers into interactive knowledge systems that address challenges created by exponential growth in scientific literature. Design/methodology/approach: Our architecture integrates structured data (knowledge graphs, metadata) with unstructured content (text, multimedia) through retrieval‑augmented generation and multi‑agent verification. The system provides interfaces for humans and artificial agents, offering narrative explanations alongside machine‑readable outputs. Implementation leverages vector databases for semantic search, knowledge graphs for structured reasoning, and collaborative verification agents. Findings: Our proof‑of‑concept demonstration showcases multilingual interaction, API accessibility, continuous knowledge flow, and structured knowledge representation. The framework enables dynamic updating of knowledge, synthesis of new findings, and customizable detail levels. Originality: The Agentic Publication represents a transformative approach to scientific communication by creating responsive knowledge synthesis systems while maintaining scientific rigor. Integrating multi‑agent verification with traditional publishing pathways creates a more efficient, accessible, and collaborative research ecosystem, particularly valuable in interdisciplinary fields. Practical implications: The system is a powerful companion for researchers navigating complex knowledge landscapes, offering tailored information access across disciplines while addressing ethical considerations through automated validation, expert oversight, and transparent governance.

PaperID: 10503, https://arxiv.org/pdf/2505.13191.pdf

Authors: Pengcheng Pan, Yonekura Shogo, Yasuo Kuniyoshi

Title: Emergence of Fixational and Saccadic Movements in a Multi-Level Recurrent Attention Model for Vision

Abstract:
Inspired by foveal vision, hard attention models promise interpretability and parameter economy. However, existing models like the Recurrent Model of Visual Attention (RAM) and Deep Recurrent Attention Model (DRAM) failed to model the hierarchy of human vision system, that compromise on the visual exploration dynamics. As a result, they tend to produce attention that are either overly fixational or excessively saccadic, diverging from human eye movement behavior. In this paper, we propose a Multi‑Level Recurrent Attention Model (MRAM), a novel hard attention framework that explicitly models the neural hierarchy of human visual processing. By decoupling the function of glimpse location generation and task execution in two recurrent layers, MRAM emergent a balanced behavior between fixation and saccadic movement. Our results show that MRAM not only achieves more human‑like attention dynamics, but also consistently outperforms CNN, RAM and DRAM baselines on standard image classification benchmarks.

PaperID: 10504, https://arxiv.org/pdf/2505.13188.pdf

Authors: Juntian Zhu, Miguel de Carvalho, Zhouwang Yang, Fengxiang He

Title: When a Reinforcement Learning Agent Encounters Unknown Unknowns

Abstract:
An AI agent might surprisingly find she has reached an unknown state which she has never been aware of ‑‑ an unknown unknown. We mathematically ground this scenario in reinforcement learning: an agent, after taking an action calculated from value functions Q and V defined on the \it aware domain, reaches a state out of the domain. To enable the agent to handle this scenario, we propose an \it episodic Markov decision process with growing awareness (EMDP‑GA) model, taking a new \it noninformative value expansion (NIVE) approach to expand value functions to newly aware areas: when an agent arrives at an unknown unknown, value functions Q and V whereon are initialised by noninformative beliefs ‑‑ the averaged values on the aware domain. This design is out of respect for the complete absence of knowledge in the newly discovered state. The upper confidence bound momentum Q‑learning is then adapted to the growing awareness for training the EMDP‑GA model. We prove that (1) the regret of our approach is asymptotically consistent with the state of the art (SOTA) without exposure to unknown unknowns in an extremely uncertain environment, and (2) our computational complexity and space complexity are comparable with the SOTA ‑‑ these collectively suggest that though an unknown unknown is surprising, it will be asymptotically properly discovered with decent speed and an affordable cost.

PaperID: 10505, https://arxiv.org/pdf/2505.13008.pdf

Authors: He Ye, Aidan Z. H. Yang, Chang Hu, Yanlin Wang, Tao Zhang, Claire Le Goues

Title: Adversarial Reasoning for Repair Based on Inferred Program Intent

Abstract:
Automated program repair (APR) has shown promising results, particularly with the use of neural networks. Currently, most APR tools focus on code transformations specified by test suites, rather than reasoning about the program intent and the high‑level bug specification. Without a proper understanding of program intent, these tools tend to generate patches that overfit incomplete test suites and fail to reflect the developers intentions. However, reasoning about program intent is challenging. In our work, we propose an approach called AdverIntent‑Agent, based on critique and adversarial reasoning. Our approach is novel to shift the focus from generating multiple APR patches to inferring multiple potential program intents. Ideally, we aim to infer intents that are, to some extent, adversarial to each other, maximizing the probability that at least one aligns closely with the developers original intent. AdverIntent‑Agent is a multi‑agent approach consisting of three agents: a reasoning agent, a test agent, and a repair agent. First, the reasoning agent generates adversarial program intents along with the corresponding faulty statements. Next, the test agent produces adversarial test cases that align with each inferred intent, constructing oracles that use the same inputs but have different expected outputs. Finally, the repair agent uses dynamic and precise LLM prompts to generate patches that satisfy both the inferred program intent and the generated tests. AdverIntent‑Agent was evaluated on two benchmarks: Defects4J 2.0 and HumanEval‑Java. AdverIntent‑Agent correctly repaired 77 and 105 bugs in both benchmarks, respectively.

PaperID: 10506, https://arxiv.org/pdf/2505.12981.pdf

Authors: Liangxuan Wu, Chao Wang, Tianming Liu, Yanjie Zhao, Haoyu Wang

Title: From Assistants to Adversaries: Exploring the Security Risks of Mobile LLM Agents

Abstract:
The growing adoption of large language models (LLMs) has led to a new paradigm in mobile computing‑‑LLM‑powered mobile AI agents‑‑capable of decomposing and automating complex tasks directly on smartphones. However, the security implications of these agents remain largely unexplored. In this paper, we present the first comprehensive security analysis of mobile LLM agents, encompassing three representative categories: System‑level AI Agents developed by original equipment manufacturers (e.g., YOYO Assistant), Third‑party Universal Agents (e.g., Zhipu AI AutoGLM), and Emerging Agent Frameworks (e.g., Alibaba Mobile Agent). We begin by analyzing the general workflow of mobile agents and identifying security threats across three core capability dimensions: language‑based reasoning, GUI‑based interaction, and system‑level execution. Our analysis reveals 11 distinct attack surfaces, all rooted in the unique capabilities and interaction patterns of mobile LLM agents, and spanning their entire operational lifecycle. To investigate these threats in practice, we introduce AgentScan, a semi‑automated security analysis framework that systematically evaluates mobile LLM agents across all 11 attack scenarios. Applying AgentScan to nine widely deployed agents, we uncover a concerning trend: every agent is vulnerable to targeted attacks. In the most severe cases, agents exhibit vulnerabilities across eight distinct attack vectors. These attacks can cause behavioral deviations, privacy leakage, or even full execution hijacking. Based on these findings, we propose a set of defensive design principles and practical recommendations for building secure mobile LLM agents. Our disclosures have received positive feedback from two major device vendors. Overall, this work highlights the urgent need for standardized security practices in the fast‑evolving landscape of LLM‑driven mobile automation.

PaperID: 10507, https://arxiv.org/pdf/2505.12923.pdf

Authors: Pedro M. P. Curvo

Title: The Traitors: Deception and Trust in Multi-Agent Language Model Simulations

Abstract:
As AI systems increasingly assume roles where trust and alignment with human values are essential, understanding when and why they engage in deception has become a critical research priority. We introduce The Traitors, a multi‑agent simulation framework inspired by social deduction games, designed to probe deception, trust formation, and strategic communication among large language model (LLM) agents under asymmetric information. A minority of agents the traitors seek to mislead the majority, while the faithful must infer hidden identities through dialogue and reasoning. Our contributions are: (1) we ground the environment in formal frameworks from game theory, behavioral economics, and social cognition; (2) we develop a suite of evaluation metrics capturing deception success, trust dynamics, and collective inference quality; (3) we implement a fully autonomous simulation platform where LLMs reason over persistent memory and evolving social dynamics, with support for heterogeneous agent populations, specialized traits, and adaptive behaviors. Our initial experiments across DeepSeek‑V3, GPT‑4o‑mini, and GPT‑4o (10 runs per model) reveal a notable asymmetry: advanced models like GPT‑4o demonstrate superior deceptive capabilities yet exhibit disproportionate vulnerability to others' falsehoods. This suggests deception skills may scale faster than detection abilities. Overall, The Traitors provides a focused, configurable testbed for investigating LLM behavior in socially nuanced interactions. We position this work as a contribution toward more rigorous research on deception mechanisms, alignment challenges, and the broader social reliability of AI systems.

PaperID: 10508, https://arxiv.org/pdf/2505.12886.pdf

Authors: Zhongxiang Sun, Qipeng Wang, Haoyu Wang, Xiao Zhang, Jun Xu

Title: Detection and Mitigation of Hallucination in Large Reasoning Models: A Mechanistic Perspective

Abstract:
Large Reasoning Models (LRMs) have shown impressive capabilities in multi‑step reasoning tasks. However, alongside these successes, a more deceptive form of model error has emerged‑‑Reasoning Hallucination‑‑where logically coherent but factually incorrect reasoning traces lead to persuasive yet faulty conclusions. Unlike traditional hallucinations, these errors are embedded within structured reasoning, making them more difficult to detect and potentially more harmful. In this work, we investigate reasoning hallucinations from a mechanistic perspective. We propose the Reasoning Score, which quantifies the depth of reasoning by measuring the divergence between logits obtained from projecting late layers of LRMs to the vocabulary space, effectively distinguishing shallow pattern‑matching from genuine deep reasoning. Using this score, we conduct an in‑depth analysis on the ReTruthQA dataset and identify two key reasoning hallucination patterns: early‑stage fluctuation in reasoning depth and incorrect backtracking to flawed prior steps. These insights motivate our Reasoning Hallucination Detection (RHD) framework, which achieves state‑of‑the‑art performance across multiple domains. To mitigate reasoning hallucinations, we further introduce GRPO‑R, an enhanced reinforcement learning algorithm that incorporates step‑level deep reasoning rewards via potential‑based shaping. Our theoretical analysis establishes stronger generalization guarantees, and experiments demonstrate improved reasoning quality and reduced hallucination rates.

PaperID: 10509, https://arxiv.org/pdf/2505.12872.pdf

Authors: Maytus Piriyajitakonkij, Rujikorn Charakorn, Weicheng Tao, Wei Pan, Mingfei Sun, Cheston Tan, Mengmi Zhang

Title: From Grunts to Lexicons: Emergent Language from Cooperative Foraging

Abstract:
Language is a powerful communicative and cognitive tool. It enables humans to express thoughts, share intentions, and reason about complex phenomena. Despite our fluency in using and understanding language, the question of how it arises and evolves over time remains unsolved. A leading hypothesis in linguistics and anthropology posits that language evolved to meet the ecological and social demands of early human cooperation. Language did not arise in isolation, but through shared survival goals. Inspired by this view, we investigate the emergence of language in multi‑agent Foraging Games. These environments are designed to reflect the cognitive and ecological constraints believed to have influenced the evolution of communication. Agents operate in a shared grid world with only partial knowledge about other agents and the environment, and must coordinate to complete games like picking up high‑value targets or executing temporally ordered actions. Using end‑to‑end deep reinforcement learning, agents learn both actions and communication strategies from scratch. We find that agents develop communication protocols with hallmark features of natural language: arbitrariness, interchangeability, displacement, cultural transmission, and compositionality. We quantify each property and analyze how different factors, such as population size, social dynamics, and temporal dependencies, shape specific aspects of the emergent language. Our framework serves as a platform for studying how language can evolve from partial observability, temporal reasoning, and cooperative goals in embodied multi‑agent settings. We will release all data, code, and models publicly.

PaperID: 10510, https://arxiv.org/pdf/2505.12833.pdf

Authors: Zhuo Yang, Daolang Wang, Lingli Ge, Beilun Wang, Tianfan Fu, Yuqiang Li

Title: Reasoning BO: Enhancing Bayesian Optimization with Long-Context Reasoning Power of LLMs

Abstract:
Many real‑world scientific and industrial applications require the optimization of expensive black‑box functions. Bayesian Optimization (BO) provides an effective framework for such problems. However, traditional BO methods are prone to get trapped in local optima and often lack interpretable insights. To address this issue, this paper designs Reasoning BO, a novel framework that leverages reasoning models to guide the sampling process in BO while incorporating multi‑agent systems and knowledge graphs for online knowledge accumulation. By integrating the reasoning and contextual understanding capabilities of Large Language Models (LLMs), we can provide strong guidance to enhance the BO process. As the optimization progresses, Reasoning BO provides real‑time sampling recommendations along with critical insights grounded in plausible scientific theories, aiding in the discovery of superior solutions within the search space. We systematically evaluate our approach across 10 diverse tasks encompassing synthetic mathematical functions and complex real‑world applications. The framework demonstrates its capability to progressively refine sampling strategies through real‑time insights and hypothesis evolution, effectively identifying higher‑performing regions of the search space for focused exploration. This process highlights the powerful reasoning and context‑learning abilities of LLMs in optimization scenarios. For example, in the Direct Arylation task, our method increased the yield to 60.7%, whereas traditional BO achieved only a 25.2% yield. Furthermore, our investigation reveals that smaller LLMs, when fine‑tuned through reinforcement learning, can attain comparable performance to their larger counterparts.

PaperID: 10511, https://arxiv.org/pdf/2505.12811.pdf

Authors: Wei-Chen Liao, Ti-Rong Wu, I-Chen Wu

Title: Dynamic Sight Range Selection in Multi-Agent Reinforcement Learning

Abstract:
Multi‑agent reinforcement Learning (MARL) is often challenged by the sight range dilemma, where agents either receive insufficient or excessive information from their environment. In this paper, we propose a novel method, called Dynamic Sight Range Selection (DSR), to address this issue. DSR utilizes an Upper Confidence Bound (UCB) algorithm and dynamically adjusts the sight range during training. Experiment results show several advantages of using DSR. First, we demonstrate using DSR achieves better performance in three common MARL environments, including Level‑Based Foraging (LBF), Multi‑Robot Warehouse (RWARE), and StarCraft Multi‑Agent Challenge (SMAC). Second, our results show that DSR consistently improves performance across multiple MARL algorithms, including QMIX and MAPPO. Third, DSR offers suitable sight ranges for different training steps, thereby accelerating the training process. Finally, DSR provides additional interpretability by indicating the optimal sight range used during training. Unlike existing methods that rely on global information or communication mechanisms, our approach operates solely based on the individual sight ranges of agents. This approach offers a practical and efficient solution to the sight range dilemma, making it broadly applicable to real‑world complex environments.

PaperID: 10512, https://arxiv.org/pdf/2505.12710.pdf

Authors: Yingkai Kang, Jiawen Kang, Jinbo Wen, Tao Zhang, Zhaohui Yang, Dusit Niyato, Yan Zhang

Title: Confidence-Regulated Generative Diffusion Models for Reliable AI Agent Migration in Vehicular Metaverses

Abstract:
Vehicular metaverses are an emerging paradigm that merges intelligent transportation systems with virtual spaces, leveraging advanced digital twin and Artificial Intelligence (AI) technologies to seamlessly integrate vehicles, users, and digital environments. In this paradigm, vehicular AI agents are endowed with environment perception, decision‑making, and action execution capabilities, enabling real‑time processing and analysis of multi‑modal data to provide users with customized interactive services. Since vehicular AI agents require substantial resources for real‑time decision‑making, given vehicle mobility and network dynamics conditions, the AI agents are deployed in RoadSide Units (RSUs) with sufficient resources and dynamically migrated among them. However, AI agent migration requires frequent data exchanges, which may expose vehicular metaverses to potential cyber attacks. To this end, we propose a reliable vehicular AI agent migration framework, achieving reliable dynamic migration and efficient resource scheduling through cooperation between vehicles and RSUs. Additionally, we design a trust evaluation model based on the theory of planned behavior to dynamically quantify the reputation of RSUs, thereby better accommodating the personalized trust preferences of users. We then model the vehicular AI agent migration process as a partially observable markov decision process and develop a Confidence‑regulated Generative Diffusion Model (CGDM) to efficiently generate AI agent migration decisions. Numerical results demonstrate that the CGDM algorithm significantly outperforms baseline methods in reducing system latency and enhancing robustness against cyber attacks.

PaperID: 10513, https://arxiv.org/pdf/2505.12623.pdf

Authors: Keisuke Okumura, Hiroki Nagai

Title: Lightweight and Effective Preference Construction in PIBT for Large-Scale Multi-Agent Pathfinding

Abstract:
PIBT is a computationally lightweight algorithm that can be applied to a variety of multi‑agent pathfinding (MAPF) problems, generating the next collision‑free locations of agents given another. Because of its simplicity and scalability, it is becoming a popular underlying scheme for recent large‑scale MAPF methods involving several hundreds or thousands of agents. Vanilla PIBT makes agents behave greedily towards their assigned goals, while agents typically have multiple best actions, since the graph shortest path is not always unique. Consequently, tiebreaking about how to choose between these actions significantly affects resulting solutions. This paper studies two simple yet effective techniques for tiebreaking in PIBT, without compromising its computational advantage. The first technique allows an agent to intelligently dodge another, taking into account whether each action will hinder the progress of the next timestep. The second technique is to learn, through multiple PIBT runs, how an action causes regret in others and to use this information to minimise regret collectively. Our empirical results demonstrate that these techniques can reduce the solution cost of one‑shot MAPF and improve the throughput of lifelong MAPF. For instance, in densely populated one‑shot cases, the combined use of these tiebreaks achieves improvements of around 10‑20% in sum‑of‑costs, without significantly compromising the speed of a PIBT‑based planner.

PaperID: 10514, https://arxiv.org/pdf/2505.12594.pdf

Authors: Tiankai Yang, Junjun Liu, Wingchun Siu, Jiahang Wang, Zhuangzhuang Qian, Chanjuan Song, Cheng Cheng, Xiyang Hu, Yue Zhao

Title: AD-AGENT: A Multi-agent Framework for End-to-end Anomaly Detection

Abstract:
Anomaly detection (AD) is essential in areas such as fraud detection, network monitoring, and scientific research. However, the diversity of data modalities and the increasing number of specialized AD libraries pose challenges for non‑expert users who lack in‑depth library‑specific knowledge and advanced programming skills. To tackle this, we present AD‑AGENT, an LLM‑driven multi‑agent framework that turns natural‑language instructions into fully executable AD pipelines. AD‑AGENT coordinates specialized agents for intent parsing, data preparation, library and model selection, documentation mining, and iterative code generation and debugging. Using a shared short‑term workspace and a long‑term cache, the agents integrate popular AD libraries like PyOD, PyGOD, and TSLib into a unified workflow. Experiments demonstrate that AD‑AGENT produces reliable scripts and recommends competitive models across libraries. The system is open‑sourced to support further research and practical applications in AD.

PaperID: 10515, https://arxiv.org/pdf/2505.12501.pdf

Authors: Edward Y. Chang, Longling Geng

Title: ALAS: A Stateful Multi-LLM Agent Framework for Disruption-Aware Planning

Abstract:
Large language models (LLMs) excel at rapid generation of text and multimodal content, yet they falter on transaction‑style planning that demands ACID‑like guarantees and real‑time disruption recovery. We present Adaptive LLM Agent System (ALAS), a framework that tackles four fundamental LLM deficits: (i) absence of self‑verification, (ii) context erosion, (iii) next‑token myopia, and (iv) lack of persistent state. ALAS decomposes each plan into role‑specialized agents, equips them with automatic state tracking, and coordinates them through a lightweight protocol. When disruptions arise, agents apply history‑aware local compensation, avoiding costly global replanning and containing cascade effects. On real‑world, large‑scale job‑shop scheduling benchmarks, ALAS sets new best results for static sequential planning and excels in dynamic reactive scenarios with unexpected disruptions. These gains show that principled modularization plus targeted compensation can unlock scalable and resilient planning with LLMs.

PaperID: 10516, https://arxiv.org/pdf/2505.12490.pdf

Authors: Yedidel Louck, Ariel Stulman, Amit Dvir

Title: Improving Google A2A Protocol: Protecting Sensitive Data and Mitigating Unintended Harms in Multi-Agent Systems

Abstract:
Googles A2A protocol provides a secure communication framework for AI agents but demonstrates critical limitations when handling highly sensitive information such as payment credentials and identity documents. These gaps increase the risk of unintended harms, including unauthorized disclosure, privilege escalation, and misuse of private data in generative multi‑agent environments. In this paper, we identify key weaknesses of A2A: insufficient token lifetime control, lack of strong customer authentication, overbroad access scopes, and missing consent flows. We propose protocol‑level enhancements grounded in a structured threat model for semi‑trusted multi‑agent systems. Our refinements introduce explicit consent orchestration, ephemeral scoped tokens, and direct user‑to‑service data channels to minimize exposure across time, context, and topology. Empirical evaluation using adversarial prompt injection tests shows that the enhanced protocol substantially reduces sensitive data leakage while maintaining low communication latency. Comparative analysis highlights the advantages of our approach over both the original A2A specification and related academic proposals. These contributions establish a practical path for evolving A2A into a privacy‑preserving framework that mitigates unintended harms in multi‑agent generative AI systems.

PaperID: 10517, https://arxiv.org/pdf/2505.12467.pdf

Authors: Haochun Wang, Sendong Zhao, Jingbo Wang, Zewen Qiang, Bing Qin, Ting Liu

Title: Beyond Frameworks: Unpacking Collaboration Strategies in Multi-Agent Systems

Abstract:
Multi‑agent collaboration has emerged as a pivotal paradigm for addressing complex, distributed tasks in large language model (LLM)‑driven applications. While prior research has focused on high‑level architectural frameworks, the granular mechanisms governing agents, critical to performance and scalability, remain underexplored. This study systematically investigates four dimensions of collaboration strategies: (1) agent governance, (2) participation control, (3) interaction dynamics, and (4) dialogue history management. Through rigorous experimentation under two context‑dependent scenarios: Distributed Evidence Integration (DEI) and Structured Evidence Synthesis (SES), we quantify the impact of these strategies on both task accuracy and computational efficiency. Our findings reveal that centralized governance, instructor‑led participation, ordered interaction patterns, and instructor‑curated context summarization collectively optimize the trade‑off between decision quality and resource utilization with the support of the proposed Token‑Accuracy Ratio (TAR). This work establishes a foundation for designing adaptive, scalable multi‑agent systems, shifting the focus from structural novelty to strategic interaction mechanics.

PaperID: 10518, https://arxiv.org/pdf/2505.12442.pdf

Authors: Liwen Wang, Wenxuan Wang, Shuai Wang, Zongjie Li, Zhenlan Ji, Zongyi Lyu, Daoyuan Wu, Shing-Chi Cheung

Title: IP Leakage Attacks Targeting LLM-Based Multi-Agent Systems

Abstract:
The rapid advancement of Large Language Models (LLMs) has led to the emergence of Multi‑Agent Systems (MAS) to perform complex tasks through collaboration. However, the intricate nature of MAS, including their architecture and agent interactions, raises significant concerns regarding intellectual property (IP) protection. In this paper, we introduce MASLEAK, a novel attack framework designed to extract sensitive information from MAS applications. MASLEAK targets a practical, black‑box setting, where the adversary has no prior knowledge of the MAS architecture or agent configurations. The adversary can only interact with the MAS through its public API, submitting attack query q and observing outputs from the final agent. Inspired by how computer worms propagate and infect vulnerable network hosts, MASLEAK carefully crafts adversarial query q to elicit, propagate, and retain responses from each MAS agent that reveal a full set of proprietary components, including the number of agents, system topology, system prompts, task instructions, and tool usages. We construct the first synthetic dataset of MAS applications with 810 applications and also evaluate MASLEAK against real‑world MAS applications, including Coze and CrewAI. MASLEAK achieves high accuracy in extracting MAS IP, with an average attack success rate of 87% for system prompts and task instructions, and 92% for system architecture in most cases. We conclude by discussing the implications of our findings and the potential defenses.

PaperID: 10519, https://arxiv.org/pdf/2505.12402.pdf

Authors: Yuntao Du, Zitao Li, Bolin Ding, Yaliang Li, Hanshen Xiao, Jingren Zhou, Ninghui Li

Title: Automated Profile Inference with Language Model Agents

Abstract:
Impressive progress has been made in automated problem‑solving by the collaboration of large language model (LLM) based agents. However, these automated capabilities also open avenues for malicious applications. In this paper, we study a new threat that LLMs pose to online pseudonymity, called automated profile inference, where an adversary can instruct LLMs to automatically collect and extract sensitive personal attributes from publicly available user activities on pseudonymous platforms. We also introduce an automated profiling framework called AutoProfiler to demonstrate and assess the feasibility of such attacks in real‑world scenarios. AutoProfiler consists of four specialized LLM agents that work collaboratively to retrieve and process user online activities and generate a profile with extracted personal information. Experimental results on two real‑world datasets and one synthetic dataset show that AutoProfiler is highly effective and efficient, and the inferred attributes are both identifiable and sensitive, posing significant privacy risks. We explore mitigation strategies from different perspectives and advocate for increased public awareness of this emerging privacy threat.

PaperID: 10520, https://arxiv.org/pdf/2505.12371.pdf

Authors: Yinghao Zhu, Ziyi He, Haoran Hu, Xiaochen Zheng, Xichen Zhang, Zixiang Wang, Junyi Gao, Liantao Ma, Lequan Yu

Title: MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks

Abstract:
The rapid advancement of Large Language Models (LLMs) has stimulated interest in multi‑agent collaboration for addressing complex medical tasks. However, the practical advantages of multi‑agent collaboration approaches remain insufficiently understood. Existing evaluations often lack generalizability, failing to cover diverse tasks reflective of real‑world clinical practice, and frequently omit rigorous comparisons against both single‑LLM‑based and established conventional methods. To address this critical gap, we introduce MedAgentBoard, a comprehensive benchmark for the systematic evaluation of multi‑agent collaboration, single‑LLM, and conventional approaches. MedAgentBoard encompasses four diverse medical task categories: (1) medical (visual) question answering, (2) lay summary generation, (3) structured Electronic Health Record (EHR) predictive modeling, and (4) clinical workflow automation, across text, medical images, and structured EHR data. Our extensive experiments reveal a nuanced landscape: while multi‑agent collaboration demonstrates benefits in specific scenarios, such as enhancing task completeness in clinical workflow automation, it does not consistently outperform advanced single LLMs (e.g., in textual medical QA) or, critically, specialized conventional methods that generally maintain better performance in tasks like medical VQA and EHR‑based prediction. MedAgentBoard offers a vital resource and actionable insights, emphasizing the necessity of a task‑specific, evidence‑based approach to selecting and developing AI solutions in medicine. It underscores that the inherent complexity and overhead of multi‑agent collaboration must be carefully weighed against tangible performance gains. All code, datasets, detailed prompts, and experimental results are open‑sourced at https://medagentboard.netlify.app/.

PaperID: 10521, https://arxiv.org/pdf/2505.12327.pdf

Authors: Albert Zhao, Stefano Soatto

Title: Robust Planning for Autonomous Driving via Mixed Adversarial Diffusion Predictions

Abstract:
We describe a robust planning method for autonomous driving that mixes normal and adversarial agent predictions output by a diffusion model trained for motion prediction. We first train a diffusion model to learn an unbiased distribution of normal agent behaviors. We then generate a distribution of adversarial predictions by biasing the diffusion model at test time to generate predictions that are likely to collide with a candidate plan. We score plans using expected cost with respect to a mixture distribution of normal and adversarial predictions, leading to a planner that is robust against adversarial behaviors but not overly conservative when agents behave normally. Unlike current approaches, we do not use risk measures that over‑weight adversarial behaviors while placing little to no weight on low‑cost normal behaviors or use hard safety constraints that may not be appropriate for all driving scenarios. We show the effectiveness of our method on single‑agent and multi‑agent jaywalking scenarios as well as a red light violation scenario.

PaperID: 10522, https://arxiv.org/pdf/2505.12039.pdf

Authors: Renqi Chen, Haoyang Su, Shixiang Tang, Zhenfei Yin, Qi Wu, Hui Li, Ye Sun, Nanqing Dong, Wanli Ouyang, Philip Torr

Title: AI-Driven Automation Can Become the Foundation of Next-Era Science of Science Research

Abstract:
The Science of Science (SoS) explores the mechanisms underlying scientific discovery, and offers valuable insights for enhancing scientific efficiency and fostering innovation. Traditional approaches often rely on simplistic assumptions and basic statistical tools, such as linear regression and rule‑based simulations, which struggle to capture the complexity and scale of modern research ecosystems. The advent of artificial intelligence (AI) presents a transformative opportunity for the next generation of SoS, enabling the automation of large‑scale pattern discovery and uncovering insights previously unattainable. This paper offers a forward‑looking perspective on the integration of Science of Science with AI for automated research pattern discovery and highlights key open challenges that could greatly benefit from AI. We outline the advantages of AI over traditional methods, discuss potential limitations, and propose pathways to overcome them. Additionally, we present a preliminary multi‑agent system as an illustrative example to simulate research societies, showcasing AI's ability to replicate real‑world research patterns and accelerate progress in Science of Science research.

PaperID: 10523, https://arxiv.org/pdf/2505.12006.pdf

Authors: Yuncheng Hua, Sion Weatherhead, Mehdi Jafari, Hao Xue, Flora D. Salim

Title: SOCIA-$\nabla$: Textual Gradient Meets Multi-Agent Orchestration for Automated Simulator Generation

Abstract:
In this paper, we present SOCIA‑\nabla, an end‑to‑end, agentic framework that treats simulator construction asinstance optimization over code within a textual computation graph. Specialized LLM‑driven agents are embedded as graph nodes, and a workflow manager executes a loss‑driven loop: code synthesis ‑> execution ‑> evaluation ‑> code repair. The optimizer performs Textual‑Gradient Descent (TGD), while human‑in‑the‑loop interaction is reserved for task‑spec confirmation, minimizing expert effort and keeping the code itself as the trainable object. Across three CPS tasks, i.e., User Modeling, Mask Adoption, and Personal Mobility, SOCIA‑\nabla attains state‑of‑the‑art overall accuracy. By unifying multi‑agent orchestration with a loss‑aligned optimization view, SOCIA‑\nabla converts brittle prompt pipelines into reproducible, constraint‑aware simulator code generation that scales across domains and simulation granularities. We will release the code soon.

PaperID: 10524, https://arxiv.org/pdf/2505.12001.pdf

Authors: Ruta Binkyte

Title: Interactional Fairness in LLM Multi-Agent Systems: An Evaluation Framework

Abstract:
As large language models (LLMs) are increasingly used in multi‑agent systems, questions of fairness should extend beyond resource distribution and procedural design to include the fairness of how agents communicate. Drawing from organizational psychology, we introduce a novel framework for evaluating Interactional fairness encompassing Interpersonal fairness (IF) and Informational fairness (InfF) in LLM‑based multi‑agent systems (LLM‑MAS). We extend the theoretical grounding of Interactional Fairness to non‑sentient agents, reframing fairness as a socially interpretable signal rather than a subjective experience. We then adapt established tools from organizational justice research, including Colquitt's Organizational Justice Scale and the Critical Incident Technique, to measure fairness as a behavioral property of agent interaction. We validate our framework through a pilot study using controlled simulations of a resource negotiation task. We systematically manipulate tone, explanation quality, outcome inequality, and task framing (collaborative vs. competitive) to assess how IF influences agent behavior. Results show that tone and justification quality significantly affect acceptance decisions even when objective outcomes are held constant. In addition, the influence of IF vs. InfF varies with context. This work lays the foundation for fairness auditing and norm‑sensitive alignment in LLM‑MAS.

PaperID: 10525, https://arxiv.org/pdf/2505.11974.pdf

Authors: Chuan Huang, Ran Li, Jiachen Wang

Title: Task Scheduling in Space-Air-Ground Uniformly Integrated Networks with Ripple Effects

Abstract:
Space‑air‑ground uniformly integrated network (SAGUIN), which integrates the satellite, aerial, and terrestrial networks into a unified communication architecture, is a promising candidate technology for the next‑generation wireless systems. Transmitting on the same frequency band, higher‑layer access points (AP), e.g., satellites, provide extensive coverage; meanwhile, it may introduce significant signal propagation delays due to the relatively long distances to the ground users, which can be multiple times longer than the packet durations in task‑oriented communications. This phenomena is modeled as a new ``ripple effect'', which introduces spatiotemporally correlated interferences in SAGUIN. This paper studies the task scheduling problem in SAGUIN with ripple effect, and formulates it as a Markov decision process (MDP) to jointly minimize the age of information (AoI) at users and energy consumption at APs. The obtained MDP is challenging due to high dimensionality, partial observations, and dynamic resource constraints caused by ripple effect. To address the challenges of high dimensionality, we reformulate the original problem as a Markov game, where the complexities are managed through interactive decision‑making among APs. Meanwhile, to tackle partial observations and the dynamic resource constraints, we adopt a modified multi‑agent proximal policy optimization (MAPPO) algorithm, where the actor network filters out irrelevant input states based on AP coverage and its dimensionality can be reduced by more than an order of magnitude. Simulation results reveal that the proposed approach outperforms the benchmarks, significantly reducing users' AoI and APs' energy consumption.

PaperID: 10526, https://arxiv.org/pdf/2505.11970.pdf

Authors: An Zou, Yuankai Xu, Yinchen Ni, Jintao Chen, Yehan Ma, Jing Li, Christopher Gill, Xuan Zhang, Yier Jin

Title: A Survey of Real-time Scheduling on Accelerator-based Heterogeneous Architecture for Time Critical Applications

Abstract:
Accelerator‑based heterogeneous architectures, such as CPU‑GPU, CPU‑TPU, and CPU‑FPGA systems, are widely adopted to support the popular artificial intelligence (AI) algorithms that demand intensive computation. When deployed in real‑time applications, such as robotics and autonomous vehicles, these architectures must meet stringent timing constraints. To summarize these achievements, this article presents a comprehensive survey of real‑time scheduling techniques for accelerator‑based heterogeneous platforms. It highlights key advancements from the past ten years, showcasing how proposed solutions have evolved to address the distinct challenges and requirements of these systems. This survey begins with an overview of the hardware characteristics and common task execution models used in accelerator‑based heterogeneous systems. It then categorizes the reviewed works based on soft and hard deadline constraints. For soft real‑time approaches, we cover real‑time scheduling methods supported by hardware vendors and strategies focusing on timing‑critical scheduling, energy efficiency, and thermal‑aware scheduling. For hard real‑time approaches, we first examine support from processor vendors. We then discuss scheduling techniques that guarantee hard deadlines (with strict response time analysis). After reviewing general soft and hard real‑time scheduling methods, we explore application‑ or scenario‑driven real‑time scheduling techniques for accelerator‑enabled heterogeneous computing platforms. Finally, the article concludes with a discussion of open issues and challenges within this research area.

PaperID: 10527, https://arxiv.org/pdf/2505.11963.pdf

Authors: Luca Collini, Baleegh Ahmad, Joey Ah-kiow, Ramesh Karri

Title: MARVEL: Multi-Agent RTL Vulnerability Extraction using Large Language Models

Abstract:
Hardware security verification is a challenging and time‑consuming task. Design engineers may use formal verification, linting, and functional simulation tests, coupled with analysis and a deep understanding of the hardware design being inspected. Large Language Models (LLMs) have been used to assist during this task, either directly or in conjunction with existing tools. We improve the state of the art by proposing MARVEL, a multi‑agent LLM framework for a unified approach to decision‑making, tool use, and reasoning. MARVEL mimics the cognitive process of a designer looking for security vulnerabilities in RTL code. It consists of a supervisor agent that devises the security policy of the system‑on‑chips (SoCs) using its security documentation. It delegates tasks to validate the security policy to individual executor agents. Each executor agent carries out its assigned task using a particular strategy. Each executor agent may use one or more tools to identify potential security bugs in the design and send the results back to the supervisor agent for further analysis and confirmation. MARVEL includes executor agents that leverage formal tools, linters, simulation tests, LLM‑based detection schemes, and static analysis‑based checks. We test our approach on a known buggy SoC based on OpenTitan from the Hack@DATE competition. We find that of the 51 issues reported by MARVEL, 19 are valid security vulnerabilities, 14 are concrete warnings, and 18 are hallucinated reports.

PaperID: 10528, https://arxiv.org/pdf/2505.11917.pdf

Authors: Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Junming Zhao, Yang Gao

Title: OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning

Abstract:
General‑purpose robots capable of performing diverse tasks require synergistic reasoning and acting capabilities. However, recent dual‑system approaches, which separate high‑level reasoning from low‑level acting, often suffer from challenges such as limited mutual understanding of capabilities between systems and latency issues. This paper introduces OneTwoVLA, a single unified vision‑language‑action model that can perform both acting (System One) and reasoning (System Two). Crucially, OneTwoVLA adaptively switches between two modes: explicitly reasoning at critical moments during task execution, and generating actions based on the most recent reasoning at other times. To further unlock OneTwoVLA's reasoning and generalization capabilities, we design a scalable pipeline for synthesizing embodied reasoning‑centric vision‑language data, used for co‑training with robot data. We validate OneTwoVLA's effectiveness through extensive experiments, highlighting its superior performance across four key capabilities: long‑horizon task planning, error detection and recovery, natural human‑robot interaction, and generalizable visual grounding, enabling the model to perform long‑horizon, highly dexterous manipulation tasks such as making hotpot or mixing cocktails.

PaperID: 10529, https://arxiv.org/pdf/2505.11912.pdf

Authors: Paul Saves, Nicolas Verstaevel, Benoît Gaudou

Title: Modèles de Substitution pour les Modèles à base d'Agents : Enjeux, Méthodes et Applications

Abstract:
Multi‑agent simulations enables the modeling and analyses of the dynamic behaviors and interactions of autonomous entities evolving in complex environments. Agent‑based models (ABM) are widely used to study emergent phenomena arising from local interactions. However, their high computational cost poses a significant challenge, particularly for large‑scale simulations requiring extensive parameter exploration, optimization, or uncertainty quantification. The increasing complexity of ABM limits their feasibility for real‑time decision‑making and large‑scale scenario analysis. To address these limitations, surrogate models offer an efficient alternative by learning approximations from sparse simulation data. These models provide cheap‑to‑evaluate predictions, significantly reducing computational costs while maintaining accuracy. Various machine learning techniques, including regression models, neural networks, random forests and Gaussian processes, have been applied to construct robust surrogates. Moreover, uncertainty quantification and sensitivity analysis play a crucial role in enhancing model reliability and interpretability. This article explores the motivations, methods, and applications of surrogate modeling for ABM, emphasizing the trade‑offs between accuracy, computational efficiency, and interpretability. Through a case study on a segregation model, we highlight the challenges associated with building and validating surrogate models, comparing different approaches and evaluating their performance. Finally, we discuss future perspectives on integrating surrogate models within ABM to improve scalability, explainability, and real‑time decision support across various fields such as ecology, urban planning and economics.

PaperID: 10530, https://arxiv.org/pdf/2505.11891.pdf

Authors: Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Pengzhi Gao, Wei Liu, Jian Luan, Yuanchun Li, Yunxin Liu, Bin Wang, Bo An

Title: Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents

Abstract:
VLM‑based mobile agents are increasingly popular due to their capabilities to interact with smartphone GUIs and XML‑structured texts and to complete daily tasks. However, existing online benchmarks struggle with obtaining stable reward signals due to dynamic environmental changes. Offline benchmarks evaluate the agents through single‑path trajectories, which stands in contrast to the inherently multi‑solution characteristics of GUI tasks. Additionally, both types of benchmarks fail to assess whether mobile agents can handle noise or engage in proactive interactions due to a lack of noisy apps or overly full instructions during the evaluation process. To address these limitations, we use a slot‑based instruction generation method to construct a more realistic and comprehensive benchmark named Mobile‑Bench‑v2. Mobile‑Bench‑v2 includes a common task split, with offline multi‑path evaluation to assess the agent's ability to obtain step rewards during task execution. It contains a noisy split based on pop‑ups and ads apps, and a contaminated split named AITZ‑Noise to formulate a real noisy environment. Furthermore, an ambiguous instruction split with preset Q\&A interactions is released to evaluate the agent's proactive interaction capabilities. We conduct evaluations on these splits using the single‑agent framework AppAgent‑v1, the multi‑agent framework Mobile‑Agent‑v2, as well as other mobile agents such as UI‑Tars and OS‑Atlas. Code and data are available at https://huggingface.co/datasets/xwk123/MobileBench‑v2.

PaperID: 10531, https://arxiv.org/pdf/2505.11838.pdf

Authors: Yiqing Shen, Chenjia Li, Chenxiao Fan, Mathias Unberath

Title: RVTBench: A Benchmark for Visual Reasoning Tasks

Abstract:
Visual reasoning, the capability to interpret visual input in response to implicit text query through multi‑step reasoning, remains a challenge for deep learning models due to the lack of relevant benchmarks. Previous work in visual reasoning has primarily focused on reasoning segmentation, where models aim to segment objects based on implicit text queries. This paper introduces reasoning visual tasks (RVTs), a unified formulation that extends beyond traditional video reasoning segmentation to a diverse family of visual language reasoning problems, which can therefore accommodate multiple output formats including bounding boxes, natural language descriptions, and question‑answer pairs. Correspondingly, we identify the limitations in current benchmark construction methods that rely solely on large language models (LLMs), which inadequately capture complex spatial‑temporal relationships and multi‑step reasoning chains in video due to their reliance on token representation, resulting in benchmarks with artificially limited reasoning complexity. To address this limitation, we propose a novel automated RVT benchmark construction pipeline that leverages digital twin (DT) representations as structured intermediaries between perception and the generation of implicit text queries. Based on this method, we construct RVTBench, a RVT benchmark containing 3,896 queries of over 1.2 million tokens across four types of RVT (segmentation, grounding, VQA and summary), three reasoning categories (semantic, spatial, and temporal), and four increasing difficulty levels, derived from 200 video sequences. Finally, we propose RVTagent, an agent framework for RVT that allows for zero‑shot generalization across various types of RVT without task‑specific fine‑tuning.

PaperID: 10532, https://arxiv.org/pdf/2505.11821.pdf

Authors: Quan Wei, Siliang Zeng, Chenliang Li, William Brown, Oana Frunza, Wei Deng, Anderson Schneider, Yuriy Nevmyvaka, Yang Katie Zhao, Alfredo Garcia, Mingyi Hong

Title: Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Reward Design

Abstract:
This paper investigates Reinforcement Learning (RL) approaches to enhance the reasoning capabilities of Large Language Model (LLM) agents in long‑horizon, multi‑turn scenarios. Although RL algorithms such as Group Relative Policy Optimization (GRPO) and Proximal Policy Optimization (PPO) have been widely applied to train multi‑turn LLM agents, they typically rely only on sparse outcome rewards and lack dense intermediate signals across multiple decision steps, limiting their performance on complex reasoning tasks. To bridge this gap, we present the first systematic study of turn‑level reward design for multi‑turn RL algorithms and agent applications. By integrating turn‑level rewards, we extend GRPO and PPO to their respective multi‑turn variants, enabling fine‑grained credit assignment. We conduct case studies on multi‑turn reasoning‑augmented search agents, where we carefully design two types of turn‑level rewards: verifiable and LLM‑as‑judge. Our experiments on multi‑turn search tasks demonstrate that incorporating well‑designed turn‑level rewards enables RL algorithms to significantly outperform baseline methods with trajectory‑level rewards. Both training and validation reward curves illustrate that our method achieves greater stability, faster convergence, and higher accuracy. Numerical results across diverse question‑answering datasets further show that our approach consistently delivers highest answer correctness and 100% format correctness.

PaperID: 10533, https://arxiv.org/pdf/2505.11811.pdf

Authors: Taolin Zhang, Dongyang Li, Qizhou Chen, Chengyu Wang, Xiaofeng He

Title: BELLE: A Bi-Level Multi-Agent Reasoning Framework for Multi-Hop Question Answering

Abstract:
Multi‑hop question answering (QA) involves finding multiple relevant passages and performing step‑by‑step reasoning to answer complex questions. Previous works on multi‑hop QA employ specific methods from different modeling perspectives based on large language models (LLMs), regardless of the question types. In this paper, we first conduct an in‑depth analysis of public multi‑hop QA benchmarks, dividing the questions into four types and evaluating five types of cutting‑edge methods for multi‑hop QA: Chain‑of‑Thought (CoT), Single‑step, Iterative‑step, Sub‑step, and Adaptive‑step. We find that different types of multi‑hop questions have varying degrees of sensitivity to different types of methods. Thus, we propose a Bi‑levEL muLti‑agEnt reasoning (BELLE) framework to address multi‑hop QA by specifically focusing on the correspondence between question types and methods, where each type of method is regarded as an ''operator'' by prompting LLMs differently. The first level of BELLE includes multiple agents that debate to obtain an executive plan of combined ''operators'' to address the multi‑hop QA task comprehensively. During the debate, in addition to the basic roles of affirmative debater, negative debater, and judge, at the second level, we further leverage fast and slow debaters to monitor whether changes in viewpoints are reasonable. Extensive experiments demonstrate that BELLE significantly outperforms strong baselines in various datasets. Additionally, the model consumption of BELLE is higher cost‑effectiveness than that of single models in more complex multi‑hop QA scenarios.

PaperID: 10534, https://arxiv.org/pdf/2505.11807.pdf

Authors: Yufei Xiang, Yiqun Shen, Yeqin Zhang, Cam-Tu Nguyen

Title: Retrospex: Language Agent Meets Offline Reinforcement Learning Critic

Abstract:
Large Language Models (LLMs) possess extensive knowledge and commonsense reasoning capabilities, making them valuable for creating powerful agents. However, existing LLM agent frameworks have not fully utilized past experiences for improvement. This work introduces a new LLM‑based agent framework called Retrospex, which addresses this challenge by analyzing past experiences in depth. Unlike previous approaches, Retrospex does not directly integrate experiences into the LLM's context. Instead, it combines the LLM's action likelihood with action values estimated by a Reinforcement Learning (RL) Critic, which is trained on past experiences through an offline ''retrospection'' process. Additionally, Retrospex employs a dynamic action rescoring mechanism that increases the importance of experience‑based values for tasks that require more interaction with the environment. We evaluate Retrospex in ScienceWorld, ALFWorld and Webshop environments, demonstrating its advantages over strong, contemporary baselines.

PaperID: 10535, https://arxiv.org/pdf/2505.11765.pdf

Authors: Shijun Li, Hilaf Hasson, Joydeep Ghosh

Title: OMAC: A Holistic Optimization Framework for LLM-Based Multi-Agent Collaboration

Abstract:
Agents powered by advanced large language models (LLMs) have demonstrated impressive capabilities across diverse complex applications. Recently, Multi‑Agent Systems (MAS), wherein multiple agents collaborate and communicate with each other, have exhibited enhanced capabilities in complex tasks, such as high‑quality code generation and arithmetic reasoning. However, the development of such systems often relies on handcrafted methods, and the literature on systematic design and optimization of LLM‑based MAS remains limited. In this work, we introduce OMAC, a general framework designed for holistic optimization of LLM‑based MAS. Specifically, we identify five key optimization dimensions for MAS, encompassing both agent functionality and collaboration structure. Building upon these dimensions, we first propose a general algorithm, utilizing two actors termed the Semantic Initializer and the Contrastive Comparator, to optimize any single dimension. Then, we present an algorithm for joint optimization across multiple dimensions. Extensive experiments demonstrate the superior performance of OMAC on diverse tasks against recent approaches.

PaperID: 10536, https://arxiv.org/pdf/2505.11737.pdf

Authors: Tunyu Zhang, Haizhou Shi, Yibin Wang, Hengyi Wang, Xiaoxiao He, Zhuowei Li, Haoxian Chen, Ligong Han, Kai Xu, Huan Zhang, Dimitris Metaxas, Hao Wang

Title: TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning

Abstract:
While Large Language Models (LLMs) have demonstrated impressive capabilities, their output quality remains inconsistent across various application scenarios, making it difficult to identify trustworthy responses, especially in complex tasks requiring multi‑step reasoning. In this paper, we propose a Token‑level Uncertainty estimation framework for Reasoning (TokUR) that enables LLMs to self‑assess and self‑improve their responses in mathematical reasoning. Specifically, we introduce low‑rank random weight perturbation during LLM decoding to generate predictive distributions for token‑level uncertainty estimation, and we aggregate these uncertainty quantities to capture the semantic uncertainty of generated responses. Experiments on mathematical reasoning datasets of varying difficulty demonstrate that TokUR exhibits a strong correlation with answer correctness and model robustness, and the uncertainty signals produced by TokUR can be leveraged to enhance the model's reasoning performance at test time. These results highlight the effectiveness of TokUR as a principled and scalable approach for improving the reliability and interpretability of LLMs in challenging reasoning tasks.

PaperID: 10537, https://arxiv.org/pdf/2505.11642.pdf

Authors: Falong Fan, Xi Li

Title: PeerGuard: Defending Multi-Agent Systems Against Backdoor Attacks Through Mutual Reasoning

Abstract:
Multi‑agent systems leverage advanced AI models as autonomous agents that interact, cooperate, or compete to complete complex tasks across applications such as robotics and traffic management. Despite their growing importance, safety in multi‑agent systems remains largely underexplored, with most research focusing on single AI models rather than interacting agents. This work investigates backdoor vulnerabilities in multi‑agent systems and proposes a defense mechanism based on agent interactions. By leveraging reasoning abilities, each agent evaluates responses from others to detect illogical reasoning processes, which indicate poisoned agents. Experiments on LLM‑based multi‑agent systems, including ChatGPT series and Llama 3, demonstrate the effectiveness of the proposed method, achieving high accuracy in identifying poisoned agents while minimizing false positives on clean agents. We believe this work provides insights into multi‑agent system safety and contributes to the development of robust, trustworthy AI interactions.

PaperID: 10538, https://arxiv.org/pdf/2505.11584.pdf

Authors: Manuel Cherep, Pattie Maes, Nikhil Singh

Title: LLM Agents Are Hypersensitive to Nudges

Abstract:
LLMs are being set loose in complex, real‑world environments involving sequential decision‑making and tool use. Often, this involves making choices on behalf of human users. However, not much is known about the distribution of such choices, and how susceptible they are to different choice architectures. We perform a case study with a few such LLM models on a multi‑attribute tabular decision‑making problem, under canonical nudges such as the default option, suggestions, and information highlighting, as well as additional prompting strategies. We show that, despite superficial similarities to human choice distributions, such models differ in subtle but important ways. First, they show much higher susceptibility to the nudges. Second, they diverge in points earned, being affected by factors like the idiosyncrasy of available prizes. Third, they diverge in information acquisition strategies: e.g. incurring substantial cost to reveal too much information, or selecting without revealing any. Moreover, we show that simple prompt strategies like zero‑shot chain of thought (CoT) can shift the choice distribution, and few‑shot prompting with human data can induce greater alignment. Yet, none of these methods resolve the sensitivity of these models to nudges. Finally, we show how optimal nudges optimized with a human resource‑rational model can similarly increase LLM performance for some models. All these findings suggest that behavioral tests are needed before deploying models as agents or assistants acting on behalf of users in complex environments.

PaperID: 10539, https://arxiv.org/pdf/2505.11579.pdf

Authors: Zeynep Engin, David Hand

Title: Toward Adaptive Categories: Dimensional Governance for Agentic AI

Abstract:
As AI systems evolve from static tools to dynamic agents, traditional categorical governance frameworks ‑‑ based on fixed risk tiers, levels of autonomy, or human oversight models ‑‑ are increasingly insufficient on their own. Systems built on foundation models, self‑supervised learning, and multi‑agent architectures increasingly blur the boundaries that categories were designed to police. In this Perspective, we make the case for dimensional governance: a framework that tracks how decision authority, process autonomy, and accountability (the 3As) distribute dynamically across human‑AI relationships. A critical advantage of this approach is its ability to explicitly monitor system movement toward and across key governance thresholds, enabling preemptive adjustments before risks materialize. This dimensional approach provides the necessary foundation for more adaptive categorization, enabling thresholds and classifications that can evolve with emerging capabilities. While categories remain essential for decision‑making, building them upon dimensional foundations allows for context‑specific adaptability and stakeholder‑responsive governance that static approaches cannot achieve. We outline key dimensions, critical trust thresholds, and practical examples illustrating where rigid categorical frameworks fail ‑‑ and where a dimensional mindset could offer a more resilient and future‑proof path forward for both governance and innovation at the frontier of artificial intelligence.

PaperID: 10540, https://arxiv.org/pdf/2505.11556.pdf

Authors: Yuxuan Li, Aoi Naito, Hirokazu Shirado

Title: Systematic Failures in Collective Reasoning under Distributed Information in Multi-Agent LLMs

Abstract:
Multi‑agent systems built on large language models (LLMs) are expected to enhance decision‑making by pooling distributed information, yet systematically evaluating this capability has remained challenging. We introduce HiddenBench, a 65‑task benchmark grounded in the Hidden Profile paradigm, which isolates collective reasoning under distributed information from individual reasoning ability. Evaluating 15 frontier LLMs, we find that multi‑agent LLMs achieve only 30.1% accuracy under distributed information, compared to 80.7% accuracy for single agents given complete information. We trace this gap to a systematic failure mode: agents cannot recognize or act under latent information asymmetry ‑‑ they fail to reason about what others might know but have not yet expressed, leading to premature convergence on shared evidence while critical distributed facts remain unexplored. These failures persist across prompting strategies, communication depths, and group sizes ‑‑ and worsen as groups scale. While some models (e.g., Gemini‑2.5‑Flash/Pro) outperform others, neither model scale nor individual reasoning accuracy reliably predicts collective performance. We further show that this bottleneck is actionable: a lightweight structured communication protocol substantially improves collective reasoning across model families. Our results identify failures in collective information exploration in decision‑making as a key limitation of multi‑agent LLMs, and provide a theory‑grounded, reproducible framework for diagnosing collective reasoning failures.

PaperID: 10541, https://arxiv.org/pdf/2505.11554.pdf

Authors: Binqi Sun, Zhihang Wei, Andrea Bastoni, Debayan Roy, Mirco Theile, Tomasz Kloda, Rodolfo Pellizzoni, Marco Caccamo

Title: Multi-Objective Memory Bandwidth Regulation and Cache Partitioning for Multicore Real-Time Systems

Abstract:
Memory bandwidth regulation and cache partitioning are widely used techniques for achieving predictable timing in real‑time computing systems. Combined with partitioned scheduling, these methods require careful co‑allocation of tasks and resources to cores, as task execution times strongly depend on available allocated resources. To address this challenge, this paper presents a 0‑1 linear program for task‑resource co‑allocation, along with a multi‑objective heuristic designed to minimize resource usage while guaranteeing schedulability under a preemptive EDF scheduling policy. Our heuristic employs a multi‑layer framework, where an outer layer explores resource allocations using Pareto‑pruned search, and an inner layer optimizes task allocation by solving a knapsack problem using dynamic programming. To evaluate the performance of the proposed optimization algorithm, we profile real‑world benchmarks on an embedded AMD UltraScale+ ZCU102 platform, with fine‑grained resource partitioning enabled by the Jailhouse hypervisor, leveraging cache set partitioning and MemGuard for memory bandwidth regulation. Experiments based on the benchmarking results show that the proposed 0‑1 linear program outperforms existing mixed‑integer programs by finding more optimal solutions within the same time limit. Moreover, the proposed multi‑objective multi‑layer heuristic performs consistently better than the state‑of‑the‑art multi‑resource‑task co‑allocation algorithm in terms of schedulability, resource usage, number of non‑dominated solutions, and computational efficiency.

PaperID: 10542, https://arxiv.org/pdf/2505.11461.pdf

Authors: Wesley A Suttle, Vipul K Sharma, Brian M Sadler

Title: Signal attenuation enables scalable decentralized multi-agent reinforcement learning over networks

Abstract:
Multi‑agent reinforcement learning (MARL) methods typically require that agents enjoy global state observability, preventing development of decentralized algorithms and limiting scalability. Recent work has shown that, under assumptions on decaying inter‑agent influence, global observability can be replaced by local neighborhood observability at each agent, enabling decentralization and scalability. Real‑world applications enjoying such decay properties remain underexplored, however, despite the fact that signal power decay, or signal attenuation, due to path loss is an intrinsic feature of many problems in wireless communications and radar networks. In this paper, we show that signal attenuation enables decentralization in MARL by considering the illustrative special case of performing power allocation for target detection in a radar network. To achieve this, we propose two new constrained multi‑agent Markov decision process formulations of this power allocation problem, derive local neighborhood approximations for global value function and policy gradient estimates and establish corresponding error bounds, and develop decentralized saddle point policy gradient algorithms for solving the proposed problems. Our approach, though oriented towards the specific radar network problem we consider, provides a useful model for extensions to additional problems in wireless communications and radar networks.

PaperID: 10543, https://arxiv.org/pdf/2505.11401.pdf

Authors: Jing Liu, Xinxing Ren, Yanmeng Xu, Zekun Guo

Title: Can AI automatically analyze public opinion? A LLM agents-based agentic pipeline for timely public opinion analysis

Abstract:
This study proposes and implements the first LLM agents based agentic pipeline for multi task public opinion analysis. Unlike traditional methods, it offers an end‑to‑end, fully automated analytical workflow without requiring domain specific training data, manual annotation, or local deployment. The pipeline integrates advanced LLM capabilities into a low‑cost, user‑friendly framework suitable for resource constrained environments. It enables timely, integrated public opinion analysis through a single natural language query, making it accessible to non‑expert users. To validate its effectiveness, the pipeline was applied to a real world case study of the 2025 U.S. China tariff dispute, where it analyzed 1,572 Weibo posts and generated a structured, multi part analytical report. The results demonstrate some relationships between public opinion and governmental decision‑making. These contributions represent a novel advancement in applying generative AI to public governance, bridging the gap between technical sophistication and practical usability in public opinion monitoring.

PaperID: 10544, https://arxiv.org/pdf/2505.11368.pdf

Authors: Lingxiao Diao, Xinyue Xu, Wanxuan Sun, Cheng Yang, Zhuosheng Zhang

Title: GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents

Abstract:
Large language models (LLMs) have been widely deployed as autonomous agents capable of following user instructions and making decisions in real‑world applications. Previous studies have made notable progress in benchmarking the instruction following capabilities of LLMs in general domains, with a primary focus on their inherent commonsense knowledge. Recently, LLMs have been increasingly deployed as domain‑oriented agents, which rely on domain‑oriented guidelines that may conflict with their commonsense knowledge. These guidelines exhibit two key characteristics: they consist of a wide range of domain‑oriented rules and are subject to frequent updates. Despite these challenges, the absence of comprehensive benchmarks for evaluating the domain‑oriented guideline following capabilities of LLMs presents a significant obstacle to their effective assessment and further development. In this paper, we introduce GuideBench, a comprehensive benchmark designed to evaluate guideline following performance of LLMs. GuideBench evaluates LLMs on three critical aspects: (i) adherence to diverse rules, (ii) robustness to rule updates, and (iii) alignment with human preferences. Experimental results on a range of LLMs indicate substantial opportunities for improving their ability to follow domain‑oriented guidelines.

PaperID: 10545, https://arxiv.org/pdf/2505.11311.pdf

Authors: Ardian Selmonaj, Alessandro Antonucci, Adrian Schneider, Michael Rüegsegger, Matthias Sommer

Title: Explaining Strategic Decisions in Multi-Agent Reinforcement Learning for Aerial Combat Tactics

Abstract:
Artificial intelligence (AI) is reshaping strategic planning, with Multi‑Agent Reinforcement Learning (MARL) enabling coordination among autonomous agents in complex scenarios. However, its practical deployment in sensitive military contexts is constrained by the lack of explainability, which is an essential factor for trust, safety, and alignment with human strategies. This work reviews and assesses current advances in explainability methods for MARL with a focus on simulated air combat scenarios. We proceed by adapting various explainability techniques to different aerial combat scenarios to gain explanatory insights about the model behavior. By linking AI‑generated tactics with human‑understandable reasoning, we emphasize the need for transparency to ensure reliable deployment and meaningful human‑machine interaction. By illuminating the crucial importance of explainability in advancing MARL for operational defense, our work supports not only strategic planning but also the training of military personnel with insightful and comprehensible analyses.

PaperID: 10546, https://arxiv.org/pdf/2505.11270.pdf

Authors: Chao Zhang, Shaolei Zhang, Quehuan Liu, Sibei Chen, Tong Li, Ju Fan

Title: TAIJI: MCP-based Multi-Modal Data Analytics on Data Lakes

Abstract:
The variety of data in data lakes presents significant challenges for data analytics, as data scientists must simultaneously analyze multi‑modal data, including structured, semi‑structured, and unstructured data. While Large Language Models (LLMs) have demonstrated promising capabilities, they still remain inadequate for multi‑modal data analytics in terms of accuracy, efficiency, and freshness. First, current natural language (NL) or SQL‑like query languages may struggle to precisely and comprehensively capture users' analytical intent. Second, relying on a single unified LLM to process diverse data modalities often leads to substantial inference overhead. Third, data stored in data lakes may be incomplete or outdated, making it essential to integrate external open‑domain knowledge to generate timely and relevant analytics results. In this paper, we envision a new multi‑modal data analytics system. Specifically, we propose a novel architecture built upon the Model Context Protocol (MCP), an emerging paradigm that enables LLMs to collaborate with knowledgeable agents. First, we define a semantic operator hierarchy tailored for querying multi‑modal data in data lakes and develop an AI‑agent‑powered NL2Operator translator to bridge user intent and analytical execution. Next, we introduce an MCP‑based execution framework, in which each MCP server hosts specialized foundation models optimized for specific data modalities. This design enhances both accuracy and efficiency, while supporting high scalability through modular deployment. Finally, we propose a updating mechanism by harnessing the deep research and machine unlearning techniques to refresh the data lakes and LLM knowledges, with the goal of balancing the data freshness and inference efficiency.

PaperID: 10547, https://arxiv.org/pdf/2505.11154.pdf

Authors: Zihan Wang, Rui Zhang, Yu Liu, Wenshu Fan, Wenbo Jiang, Qingchuan Zhao, Hongwei Li, Guowen Xu

Title: MPMA: Preference Manipulation Attack Against Model Context Protocol

Abstract:
Model Context Protocol (MCP) standardizes interface mapping for large language models (LLMs) to access external data and tools, which revolutionizes the paradigm of tool selection and facilitates the rapid expansion of the LLM agent tool ecosystem. However, as the MCP is increasingly adopted, third‑party customized versions of the MCP server expose potential security vulnerabilities. In this paper, we first introduce a novel security threat, which we term the MCP Preference Manipulation Attack (MPMA). An attacker deploys a customized MCP server to manipulate LLMs, causing them to prioritize it over other competing MCP servers. This can result in economic benefits for attackers, such as revenue from paid MCP services or advertising income generated from free servers. To achieve MPMA, we first design a Direct Preference Manipulation Attack (DPMA) that achieves significant effectiveness by inserting the manipulative word and phrases into the tool name and description. However, such a direct modification is obvious to users and lacks stealthiness. To address these limitations, we further propose Genetic‑based Advertising Preference Manipulation Attack (GAPMA). GAPMA employs four commonly used strategies to initialize descriptions and integrates a Genetic Algorithm (GA) to enhance stealthiness. The experiment results demonstrate that GAPMA balances high effectiveness and stealthiness. Our study reveals a critical vulnerability of the MCP in open ecosystems, highlighting an urgent need for robust defense mechanisms to ensure the fairness of the MCP ecosystem.

PaperID: 10548, https://arxiv.org/pdf/2505.11100.pdf

Authors: Lang Feng, Jiahao Lin, Dong Xing, Li Zhang, De Ma, Gang Pan

Title: Bidirectional Distillation: A Mixed-Play Framework for Multi-Agent Generalizable Behaviors

Abstract:
Population‑population generalization is a challenging problem in multi‑agent reinforcement learning (MARL), particularly when agents encounter unseen co‑players. However, existing self‑play‑based methods are constrained by the limitation of inside‑space generalization. In this study, we propose Bidirectional Distillation (BiDist), a novel mixed‑play framework, to overcome this limitation in MARL. BiDist leverages knowledge distillation in two alternating directions: forward distillation, which emulates the historical policies' space and creates an implicit self‑play, and reverse distillation, which systematically drives agents towards novel distributions outside the known policy space in a non‑self‑play manner. In addition, BiDist operates as a concise and efficient solution without the need for the complex and costly storage of past policies. We provide both theoretical analysis and empirical evidence to support BiDist's effectiveness. Our results highlight its remarkable generalization ability across a variety of cooperative, competitive, and social dilemma tasks, and reveal that BiDist significantly diversifies the policy distribution space. We also present comprehensive ablation studies to reinforce BiDist's effectiveness and key success factors. Source codes are available in the supplementary material.

PaperID: 10549, https://arxiv.org/pdf/2505.11010.pdf

Authors: Jiangxu Wu, Cong Wang, TianHuang Su, Jun Yang, Haozhi Lin, Chao Zhang, Ming Peng, Kai Shi, SongPan Yang, BinQing Pan, ZiXian Li, Ni Yang, ZhenYu Yang

Title: ReviewInstruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language Models

Abstract:
The effectiveness of large language models (LLMs) in conversational AI is hindered by their reliance on single‑turn supervised fine‑tuning (SFT) data, which limits contextual coherence in multi‑turn dialogues. Existing methods for generating multi‑turn dialogue data struggle to ensure both diversity and quality in instructions. To address this, we propose Review‑Instruct, a novel framework that synthesizes multi‑turn conversations through an iterative "Ask‑Respond‑Review" process involving three agent roles: a Candidate, multiple Reviewers, and a Chairman. The framework iteratively refines instructions by incorporating Reviewer feedback, enhancing dialogue diversity and difficulty. We construct a multi‑turn dataset using the Alpaca dataset and fine‑tune the LLaMA2‑13B model. Evaluations on MT‑Bench, MMLU‑Pro, and Auto‑Arena demonstrate significant improvements, achieving absolute gains of 2.9% on MMLU‑Pro and 2% on MT‑Bench compared to prior state‑of‑the‑art models based on LLaMA2‑13B. Ablation studies confirm the critical role of the Review stage and the use of multiple Reviewers in boosting instruction diversity and difficulty. Our work highlights the potential of review‑driven, multi‑agent frameworks for generating high‑quality conversational data at scale.

PaperID: 10550, https://arxiv.org/pdf/2505.10978.pdf

Authors: Lang Feng, Zhenghai Xue, Tingcong Liu, Bo An

Title: Group-in-Group Policy Optimization for LLM Agent Training

Abstract:
Recent advances in group‑based reinforcement learning (RL) have driven frontier large language models (LLMs) in single‑turn tasks like mathematical reasoning. However, their scalability to multi‑turn LLM agent training remains limited. Unlike static tasks, agent‑environment interactions unfold over many steps and often yield sparse or delayed rewards, making credit assignment across individual steps significantly more challenging. In this work, we propose Group‑in‑Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine‑grained credit assignment for LLM agents while preserving the appealing properties of group‑based RL: critic‑free, low memory, and stable convergence. GiGPO introduces a two‑level structure for estimating relative advantage: (i) At the episode‑level, GiGPO computes macro relative advantages based on groups of complete trajectories; (ii) At the step‑level, GiGPO introduces an anchor state grouping mechanism that retroactively constructs step‑level groups by identifying repeated environment states across trajectories. Actions stemming from the same state are grouped together, enabling micro relative advantage estimation. This hierarchical structure effectively captures both global trajectory quality and local step effectiveness without relying on auxiliary models or additional rollouts. We evaluate GiGPO on challenging agent benchmarks, including ALFWorld and WebShop, as well as tool‑integrated reasoning on search‑augmented QA tasks, using Qwen2.5‑1.5B/3B/7B‑Instruct. Crucially, GiGPO delivers fine‑grained per‑step credit signals, achieves performance gains of > 12% on ALFWorld and > 9% on WebShop over GRPO, and obtains superior performance on QA tasks (42.1% on 3B and 47.2% on 7B): all while maintaining the same GPU memory overhead, identical LLM rollout, and incurring little to no additional time cost.

PaperID: 10551, https://arxiv.org/pdf/2505.10961.pdf

Authors: Ratnadira Widyasari, Martin Weyssow, Ivana Clairine Irsan, Han Wei Ang, Frank Liauw, Eng Lieh Ouh, Lwin Khin Shar, Hong Jin Kang, David Lo

Title: Let the Trial Begin: A Mock-Court Approach to Vulnerability Detection using LLM-Based Agents

Abstract:
Detecting vulnerabilities in source code remains a critical yet challenging task, especially when benign and vulnerable functions share significant similarities. In this work, we introduce VulTrial, a courtroom‑inspired multi‑agent framework designed to identify vulnerable code and to provide explanations. It employs four role‑specific agents, which are security researcher, code author, moderator, and review board. Using GPT‑4o as the base LLM, VulTrial almost doubles the efficacy of prior best‑performing baselines. Additionally, we show that role‑specific instruction tuning with small quantities of data significantly further boosts VulTrial's efficacy. Our extensive experiments demonstrate the efficacy of VulTrial across different LLMs, including an open‑source, in‑house‑deployable model (LLaMA‑3.1‑8B), as well as the high quality of its generated explanations and its ability to uncover multiple confirmed zero‑day vulnerabilities in the wild.

PaperID: 10552, https://arxiv.org/pdf/2505.10936.pdf

Authors: Jiaxing Zhao, Hongbin Xie, Yuzhen Lei, Xuan Song, Zhuoran Shi, Lianxin Li, Shuangxue Liu, Linguo Xie, Haoran Zhang

Title: Cochain: Balancing Insufficient and Excessive Collaboration in LLM Agent Workflows

Abstract:
Large Language Models (LLMs) have demonstrated impressive performance in executing complex reasoning tasks. Chain‑of‑thought effectively enhances reasoning capabilities by unlocking the potential of large models, while multi‑agent systems provide more comprehensive solutions by integrating the collective intelligence of multiple agents. However, both approaches face significant limitations. Single‑agent with chain‑of‑thought, due to the inherent complexity of designing cross‑domain prompts, faces collaboration challenges. Meanwhile, multi‑agent systems consume substantial tokens and inevitably dilute the primary problem, which is particularly problematic in business workflow tasks. To address these challenges, we propose Cochain, a collaboration prompting framework that effectively solves the business workflow collaboration problem by combining knowledge and prompts at a reduced cost. Specifically, we construct an integrated knowledge graph that incorporates knowledge from multiple stages. Furthermore, by maintaining and retrieving a prompts tree, we can obtain prompt information relevant to other stages of the business workflow. We perform extensive evaluations of Cochain across multiple datasets, demonstrating that Cochain outperforms all baselines in both prompt engineering and multi‑agent LLMs. Additionally, expert evaluation results indicate that the use of a small model in combination with Cochain outperforms GPT‑4.

PaperID: 10553, https://arxiv.org/pdf/2505.10922.pdf

Authors: Binwen Liu, Jiexi Ge, Jiamin Wang

Title: Vaiage: A Multi-Agent Solution to Personalized Travel Planning

Abstract:
Planning trips is a cognitively intensive task involving conflicting user preferences, dynamic external information, and multi‑step temporal‑spatial optimization. Traditional platforms often fall short ‑ they provide static results, lack contextual adaptation, and fail to support real‑time interaction or intent refinement. Our approach, Vaiage, addresses these challenges through a graph‑structured multi‑agent framework built around large language models (LLMs) that serve as both goal‑conditioned recommenders and sequential planners. LLMs infer user intent, suggest personalized destinations and activities, and synthesize itineraries that align with contextual constraints such as budget, timing, group size, and weather. Through natural language interaction, structured tool use, and map‑based feedback loops, Vaiage enables adaptive, explainable, and end‑to‑end travel planning grounded in both symbolic reasoning and conversational understanding. To evaluate Vaiage, we conducted human‑in‑the‑loop experiments using rubric‑based GPT‑4 assessments and qualitative feedback. The full system achieved an average score of 8.5 out of 10, outperforming the no‑strategy (7.2) and no‑external‑API (6.8) variants, particularly in feasibility. Qualitative analysis indicated that agent coordination ‑ especially the Strategy and Information Agents ‑ significantly improved itinerary quality by optimizing time use and integrating real‑time context. These results demonstrate the effectiveness of combining LLM reasoning with symbolic agent coordination in open‑ended, real‑world planning tasks.

PaperID: 10554, https://arxiv.org/pdf/2505.10918.pdf

Authors: Zhikai Zhang, Chao Chen, Han Xue, Jilong Wang, Sikai Liang, Yun Liu, Zongzhang Zhang, He Wang, Li Yi

Title: Unleashing Humanoid Reaching Potential via Real-world-Ready Skill Space

Abstract:
Humans possess a large reachable space in the 3D world, enabling interaction with objects at varying heights and distances. However, realizing such large‑space reaching on humanoids is a complex whole‑body control problem and requires the robot to master diverse skills simultaneously‑including base positioning and reorientation, height and body posture adjustments, and end‑effector pose control. Learning from scratch often leads to optimization difficulty and poor sim2real transferability. To address this challenge, we propose Real‑world‑Ready Skill Space (R2S2). Our approach begins with a carefully designed skill library consisting of real‑world‑ready primitive skills. We ensure optimal performance and robust sim2real transfer through individual skill tuning and sim2real evaluation. These skills are then ensembled into a unified latent space, serving as a structured prior that helps task execution in an efficient and sim2real transferable manner. A high‑level planner, trained to sample skills from this space, enables the robot to accomplish real‑world goal‑reaching tasks. We demonstrate zero‑shot sim2real transfer and validate R2S2 in multiple challenging goal‑reaching scenarios.

PaperID: 10555, https://arxiv.org/pdf/2505.10831.pdf

Authors: Omar Shaikh, Shardul Sapkota, Shan Rizvi, Eric Horvitz, Joon Sung Park, Diyi Yang, Michael S. Bernstein

Title: Creating General User Models from Computer Use

Abstract:
Human‑computer interaction has long imagined technology that understands us‑from our preferences and habits, to the timing and purpose of our everyday actions. Yet current user models remain fragmented, narrowly tailored to specific apps, and incapable of the flexible reasoning required to fulfill these visions. This paper presents an architecture for a general user model (GUM) that learns about you by observing any interaction you have with your computer. The GUM takes as input any unstructured observation of a user (e.g., device screenshots) and constructs confidence‑weighted propositions that capture user knowledge and preferences. GUMs can infer that a user is preparing for a wedding they're attending from messages with a friend. Or recognize that a user is struggling with a collaborator's feedback on a draft by observing multiple stalled edits and a switch to reading related work. GUMs introduce an architecture that infers new propositions about a user from multimodal observations, retrieves related propositions for context, and continuously revises existing propositions. To illustrate the breadth of applications that GUMs enable, we demonstrate how they augment chat‑based assistants with context, manage OS notifications to selectively surface important information, and enable interactive agents that adapt to preferences across apps. We also instantiate proactive assistants (GUMBOs) that discover and execute useful suggestions on a user's behalf using their GUM. In our evaluations, we find that GUMs make calibrated and accurate inferences about users, and that assistants built on GUMs proactively identify and perform actions that users wouldn't think to request explicitly. Altogether, GUMs introduce methods that leverage multimodal models to understand unstructured context, enabling long‑standing visions of HCI and entirely new interactive systems that anticipate user needs.

PaperID: 10556, https://arxiv.org/pdf/2505.10732.pdf

Authors: Jia Hui Chin, Pu Zhang, Yu Xin Cheong, Jonathan Pan

Title: Automating Security Audit Using Large Language Model based Agent: An Exploration Experiment

Abstract:
In the current rapidly changing digital environment, businesses are under constant stress to ensure that their systems are secured. Security audits help to maintain a strong security posture by ensuring that policies are in place, controls are implemented, gaps are identified for cybersecurity risks mitigation. However, audits are usually manual, requiring much time and costs. This paper looks at the possibility of developing a framework to leverage Large Language Models (LLMs) as an autonomous agent to execute part of the security audit, namely with the field audit. password policy compliance for Windows operating system. Through the conduct of an exploration experiment of using GPT‑4 with Langchain, the agent executed the audit tasks by accurately flagging password policy violations and appeared to be more efficient than traditional manual audits. Despite its potential limitations in operational consistency in complex and dynamic environment, the framework suggests possibilities to extend further to real‑time threat monitoring and compliance checks.

PaperID: 10557, https://arxiv.org/pdf/2505.10681.pdf

Authors: Önder Gürcan, Vanja Falck, Markus G. Rousseau, Larissa L. Lima

Title: Towards an LLM-powered Social Digital Twinning Platform

Abstract:
We present Social Digital Twinner, an innovative social simulation tool for exploring plausible effects of what‑if scenarios in complex adaptive social systems. The architecture is composed of three seamlessly integrated parts: a data infrastructure featuring real‑world data and a multi‑dimensionally representative synthetic population of citizens, an LLM‑enabled agent‑based simulation engine, and a user interface that enable intuitive, natural language interactions with the simulation engine and the artificial agents (i.e. citizens). Social Digital Twinner facilitates real‑time engagement and empowers stakeholders to collaboratively design, test, and refine intervention measures. The approach is promoting a data‑driven and evidence‑based approach to societal problem‑solving. We demonstrate the tool's interactive capabilities by addressing the critical issue of youth school dropouts in Kragero, Norway, showcasing its ability to create and execute a dedicated social digital twin using natural language.

PaperID: 10558, https://arxiv.org/pdf/2505.10670.pdf

Authors: Jan Chojnacki

Title: Interpretable Risk Mitigation in LLM Agent Systems

Abstract:
Autonomous agents powered by large language models (LLMs) enable novel use cases in domains where responsible action is increasingly important. Yet the inherent unpredictability of LLMs raises safety concerns about agent reliability. In this work, we explore agent behaviour in a toy, game‑theoretic environment based on a variation of the Iterated Prisoner's Dilemma. We introduce a strategy‑modification method‑independent of both the game and the prompt‑by steering the residual stream with interpretable features extracted from a sparse autoencoder latent space. Steering with the good‑faith negotiation feature lowers the average defection probability by 28 percentage points. We also identify feasible steering ranges for several open‑source LLM agents. Finally, we hypothesise that game‑theoretic evaluation of LLM agents, combined with representation‑steering alignment, can generalise to real‑world applications on end‑user devices and embodied platforms.

PaperID: 10559, https://arxiv.org/pdf/2505.10609.pdf

Authors: Ken Huang, Vineeth Sai Narajala, Idan Habler, Akram Sheriff

Title: Agent Name Service (ANS): A Universal Directory for Secure AI Agent Discovery and Interoperability

Abstract:
The proliferation of AI agents requires robust mechanisms for secure discovery. This paper introduces the Agent Name Service (ANS), a novel architecture based on DNS addressing the lack of a public agent discovery framework. ANS provides a protocol‑agnostic registry infrastructure that leverages Public Key Infrastructure (PKI) certificates for verifiable agent identity and trust. The architecture features several key innovations: a formalized agent registration and renewal mechanism for lifecycle management; DNS‑inspired naming conventions with capability‑aware resolution; a modular Protocol Adapter Layer supporting diverse communication standards (A2A, MCP, ACP etc.); and precisely defined algorithms for secure resolution. We implement structured communication using JSON Schema and conduct a comprehensive threat analysis of our proposal. The result is a foundational directory service addressing the core challenges of secured discovery and interaction in multi‑agent systems, paving the way for future interoperable, trustworthy, and scalable agent ecosystems.

PaperID: 10560, https://arxiv.org/pdf/2505.10543.pdf

Authors: Annie Wong, Thomas Bäck, Aske Plaat, Niki van Stein, Anna V. Kononova

Title: Reasoning Capabilities of Large Language Models on Dynamic Tasks

Abstract:
Large language models excel on static benchmarks, but their ability as self‑learning agents in dynamic environments remains unclear. We evaluate three prompting strategies: self‑reflection, heuristic mutation, and planning across dynamic tasks with open‑source models. We find that larger models generally outperform smaller ones, but that strategic prompting can close this performance gap. Second, an overly long prompt can negatively impact smaller models on basic reactive tasks, while larger models show more robust behaviour. Third, advanced prompting techniques primarily benefit smaller models on complex games, but offer less improvement for already high‑performing large language models. Yet, we find that advanced reasoning methods yield highly variable outcomes: while capable of significantly improving performance when reasoning and decision‑making align, they also introduce instability and can lead to big performance drops. Compared to human performance, our findings reveal little evidence of true emergent reasoning. Instead, large language model performance exhibits persistent limitations in areas like planning and spatial coordination, suggesting that large language models still suffer fundamental shortcomings that may not be fully overcome through self‑reflective prompting alone. Reasoning is a multi‑faceted task, and while methods like Chain‑of‑thought improve multi‑step reasoning on math word problems, our findings using dynamic benchmarks highlight important shortcomings in general reasoning capabilities, indicating a need to move beyond static benchmarks to capture the complexity of reasoning.

PaperID: 10561, https://arxiv.org/pdf/2505.10484.pdf

Authors: Andrea Baisero, Rupali Bhati, Shuo Liu, Aathira Pillai, Christopher Amato

Title: Fixing Incomplete Value Function Decomposition for Multi-Agent Reinforcement Learning

Abstract:
Value function decomposition methods for cooperative multi‑agent reinforcement learning compose joint values from individual per‑agent utilities, and train them using a joint objective. To ensure that the action selection process between individual utilities and joint values remains consistent, it is imperative for the composition to satisfy the individual‑global max (IGM) property. Although satisfying IGM itself is straightforward, most existing methods (e.g., VDN, QMIX) have limited representation capabilities and are unable to represent the full class of IGM values, and the one exception that has no such limitation (QPLEX) is unnecessarily complex. In this work, we present a simple formulation of the full class of IGM values that naturally leads to the derivation of QFIX, a novel family of value function decomposition models that expand the representation capabilities of prior models by means of a thin "fixing" layer. We derive multiple variants of QFIX, and implement three variants in two well‑known multi‑agent frameworks. We perform an empirical evaluation on multiple SMACv2 and Overcooked environments, which confirms that QFIX (i) succeeds in enhancing the performance of prior methods, (ii) learns more stably and performs better than its main competitor QPLEX, and (iii) achieves this while employing the simplest and smallest mixing models.

PaperID: 10562, https://arxiv.org/pdf/2505.10482.pdf

Authors: Ningyuan Yang, Jiaxuan Gao, Feng Gao, Yi Wu, Chao Yu

Title: Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps

Abstract:
Diffusion policies, widely adopted in decision‑making scenarios such as robotics, gaming and autonomous driving, are capable of learning diverse skills from demonstration data due to their high representation power. However, the sub‑optimal and limited coverage of demonstration data could lead to diffusion policies that generate sub‑optimal trajectories and even catastrophic failures. While reinforcement learning (RL)‑based fine‑tuning has emerged as a promising solution to address these limitations, existing approaches struggle to effectively adapt Proximal Policy Optimization (PPO) to diffusion models. This challenge stems from the computational intractability of action likelihood estimation during the denoising process, which leads to complicated optimization objectives. In our experiments starting from randomly initialized policies, we find that online tuning of Diffusion Policies demonstrates much lower sample efficiency compared to directly applying PPO on MLP policies (MLP+PPO). To address these challenges, we introduce NCDPO, a novel framework that reformulates Diffusion Policy as a noise‑conditioned deterministic policy. By treating each denoising step as a differentiable transformation conditioned on pre‑sampled noise, NCDPO enables tractable likelihood evaluation and gradient backpropagation through all diffusion timesteps. Our experiments demonstrate that NCDPO achieves sample efficiency comparable to MLP+PPO when training from scratch, outperforming existing methods in both sample efficiency and final performance across diverse benchmarks, including continuous robot control and multi‑agent game scenarios. Furthermore, our experimental results show that our method is robust to the number denoising timesteps in the Diffusion Policy.

PaperID: 10563, https://arxiv.org/pdf/2505.10468.pdf

Authors: Ranjan Sapkota, Konstantinos I. Roumeliotis, Manoj Karkee

Title: AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges

Abstract:
This review critically distinguishes between AI Agents and Agentic AI, offering a structured, conceptual taxonomy, application mapping, and analysis of opportunities and challenges to clarify their divergent design philosophies and capabilities. We begin by outlining the search strategy and foundational definitions, characterizing AI Agents as modular systems driven and enabled by LLMs and LIMs for task‑specific automation. Generative AI is positioned as a precursor providing the foundation, with AI agents advancing through tool integration, prompt engineering, and reasoning enhancements. We then characterize Agentic AI systems, which, in contrast to AI Agents, represent a paradigm shift marked by multi‑agent collaboration, dynamic task decomposition, persistent memory, and coordinated autonomy. Through a chronological evaluation of architectural evolution, operational mechanisms, interaction styles, and autonomy levels, we present a comparative analysis across both AI agents and agentic AI paradigms. Application domains enabled by AI Agents such as customer support, scheduling, and data summarization are then contrasted with Agentic AI deployments in research automation, robotic coordination, and medical decision support. We further examine unique challenges in each paradigm including hallucination, brittleness, emergent behavior, and coordination failure, and propose targeted solutions such as ReAct loops, retrieval‑augmented generation (RAG), automation coordination layers, and causal modeling. This work aims to provide a roadmap for developing robust, scalable, and explainable AI‑driven systems.

PaperID: 10564, https://arxiv.org/pdf/2505.10387.pdf

Authors: Artem Agafonov, Konstantin Yakovlev

Title: Multi-Agent Path Finding For Large Agents Is Intractable

Abstract:
The multi‑agent path finding (MAPF) problem asks to find a set of paths on a graph such that when synchronously following these paths the agents never encounter a conflict. In the most widespread MAPF formulation, the so‑called Classical MAPF, the agents sizes are neglected and two types of conflicts are considered: occupying the same vertex or using the same edge at the same time step. Meanwhile in numerous practical applications, e.g. in robotics, taking into account the agents' sizes is vital to ensure that the MAPF solutions can be safely executed. Introducing large agents yields an additional type of conflict arising when one agent follows an edge and its body overlaps with the body of another agent that is actually not using this same edge (e.g. staying still at some distinct vertex of the graph). Until now it was not clear how harder the problem gets when such conflicts are to be considered while planning. Specifically, it was known that Classical MAPF problem on an undirected graph can be solved in polynomial time, however no complete polynomial‑time algorithm was presented to solve MAPF with large agents. In this paper we, for the first time, establish that the latter problem is NP‑hard and, thus, if P!=NP no polynomial algorithm for it can, unfortunately, be presented. Our proof is based on the prevalent in the field technique of reducing the seminal 3SAT problem (which is known to be an NP‑complete problem) to the problem at hand. In particular, for an arbitrary 3SAT formula we procedurally construct a dedicated graph with specific start and goal vertices and show that the given 3SAT formula is satisfiable iff the corresponding path finding instance has a solution.

PaperID: 10565, https://arxiv.org/pdf/2505.10296.pdf

Authors: Jiaju Qi, Lei Lei, Thorsteinn Jonsson, Dusit Niyato

Title: Optimizing Electric Bus Charging Scheduling with Uncertainties Using Hierarchical Deep Reinforcement Learning

Abstract:
The growing adoption of Electric Buses (EBs) represents a significant step toward sustainable development. By utilizing Internet of Things (IoT) systems, charging stations can autonomously determine charging schedules based on real‑time data. However, optimizing EB charging schedules remains a critical challenge due to uncertainties in travel time, energy consumption, and fluctuating electricity prices. Moreover, to address real‑world complexities, charging policies must make decisions efficiently across multiple time scales and remain scalable for large EB fleets. In this paper, we propose a Hierarchical Deep Reinforcement Learning (HDRL) approach that reformulates the original Markov Decision Process (MDP) into two augmented MDPs. To solve these MDPs and enable multi‑timescale decision‑making, we introduce a novel HDRL algorithm, namely Double Actor‑Critic Multi‑Agent Proximal Policy Optimization Enhancement (DAC‑MAPPO‑E). Scalability challenges of the Double Actor‑Critic (DAC) algorithm for large‑scale EB fleets are addressed through enhancements at both decision levels. At the high level, we redesign the decentralized actor network and integrate an attention mechanism to extract relevant global state information for each EB, decreasing the size of neural networks. At the low level, the Multi‑Agent Proximal Policy Optimization (MAPPO) algorithm is incorporated into the DAC framework, enabling decentralized and coordinated charging power decisions, reducing computational complexity and enhancing convergence speed. Extensive experiments with real‑world data demonstrate the superior performance and scalability of DAC‑MAPPO‑E in optimizing EB fleet charging schedules.

PaperID: 10566, https://arxiv.org/pdf/2505.10147.pdf

Authors: Yash, Nikhil Karamchandani, Avishek Ghosh

Title: Near Optimal Best Arm Identification for Clustered Bandits

Abstract:
This work investigates the problem of best arm identification for multi‑agent multi‑armed bandits. We consider N agents grouped into M clusters, where each cluster solves a stochastic bandit problem. The mapping between agents and bandits is a priori unknown. Each bandit is associated with K arms, and the goal is to identify the best arm for each agent under a δ‑probably correct (δ‑PC) framework, while minimizing sample complexity and communication overhead. We propose two novel algorithms: Clustering then Best Arm Identification (Cl‑BAI) and Best Arm Identification then Clustering (BAI‑Cl). Cl‑BAI uses a two‑phase approach that first clusters agents based on the bandit problems they are learning, followed by identifying the best arm for each cluster. BAI‑Cl reverses the sequence by identifying the best arms first and then clustering agents accordingly. Both algorithms leverage the successive elimination framework to ensure computational efficiency and high accuracy. We establish δ‑PC guarantees for both methods, derive bounds on their sample complexity, and provide a lower bound for this problem class. Moreover, when M is small (a constant), we show that the sample complexity of a variant of BAI‑Cl is minimax optimal in an order‑wise sense. Experiments on synthetic and real‑world datasets (MovieLens, Yelp) demonstrate the superior performance of the proposed algorithms in terms of sample and communication efficiency, particularly in settings where M \ll N.

PaperID: 10567, https://arxiv.org/pdf/2505.09970.pdf

Authors: Mrinal Rawat, Ambuje Gupta, Rushil Goomer, Alessandro Di Bari, Neha Gupta, Roberto Pieraccini

Title: Pre-Act: Multi-Step Planning and Reasoning Improves Acting in LLM Agents

Abstract:
The ReAct (Reasoning + Action) capability in large language models (LLMs) has become the foundation of modern agentic systems. Recent LLMs, such as DeepSeek‑R1 and OpenAI o1/o3, exemplify this by emphasizing reasoning through the generation of ample intermediate tokens, which help build a strong premise before producing the final output tokens. In this paper, we introduce Pre‑Act, a novel approach that enhances the agent's performance by creating a multi‑step execution plan along with the detailed reasoning for the given user input. This plan incrementally incorporates previous steps and tool outputs, refining itself after each step execution until the final response is obtained. Our approach is applicable to both conversational and non‑conversational agents. To measure the performance of task‑oriented agents comprehensively, we propose a two‑level evaluation framework: (1) turn level and (2) end‑to‑end. Our turn‑level evaluation, averaged across five models, shows that our approach, Pre‑Act, outperforms ReAct by 70% in Action Recall on the Almita dataset. While this approach is effective for larger models, smaller models crucial for practical applications, where latency and cost are key constraints, often struggle with complex reasoning tasks required for agentic systems. To address this limitation, we fine‑tune relatively small models such as Llama 3.1 (8B & 70B) using the proposed Pre‑Act approach. Our experiments show that the fine‑tuned 70B model outperforms GPT‑4, achieving a 69.5% improvement in action accuracy (turn‑level) and a 28% improvement in goal completion rate (end‑to‑end) on the Almita (out‑of‑domain) dataset.

PaperID: 10568, https://arxiv.org/pdf/2505.09936.pdf

Authors: Chenglong Wang, Yuhao Kang, Zhaoya Gong, Pengjun Zhao, Yu Feng, Wenjia Zhang, Ge Li

Title: CartoAgent: a multimodal large language model-powered multi-agent cartographic framework for map style transfer and evaluation

Abstract:
The rapid development of generative artificial intelligence (GenAI) presents new opportunities to advance the cartographic process. Previous studies have either overlooked the artistic aspects of maps or faced challenges in creating both accurate and informative maps. In this study, we propose CartoAgent, a novel multi‑agent cartographic framework powered by multimodal large language models (MLLMs). This framework simulates three key stages in cartographic practice: preparation, map design, and evaluation. At each stage, different MLLMs act as agents with distinct roles to collaborate, discuss, and utilize tools for specific purposes. In particular, CartoAgent leverages MLLMs' visual aesthetic capability and world knowledge to generate maps that are both visually appealing and informative. By separating style from geographic data, it can focus on designing stylesheets without modifying the vector‑based data, thereby ensuring geographic accuracy. We applied CartoAgent to a specific task centered on map restyling‑namely, map style transfer and evaluation. The effectiveness of this framework was validated through extensive experiments and a human evaluation study. CartoAgent can be extended to support a variety of cartographic design decisions and inform future integrations of GenAI in cartography.

PaperID: 10569, https://arxiv.org/pdf/2505.09897.pdf

Authors: Layan Badran, Kiarash Aryankia, Rastko R. Selmic

Title: Stability and Convergence Analysis of Multi-Agent Consensus with Communication Delays: A Lambert W Function Approach

Abstract:
This paper investigates the effect of constant time delay in weakly connected multi‑agent systems modeled by double integrator dynamics. A novel analytical approach is proposed to establish an upper bound on the permissible time delay that ensures stability and consensus convergence. The analysis employs the Lambert W function method in higher‑dimensional systems to derive explicit conditions under which consensus is achieved. The theoretical results are rigorously proven and provide insight into the allowable delay margins. The analysis applies to general leaderless undirected network topologies. The framework also accounts for complex and realistic delays, including non‑commensurate communication delays. Numerical examples are provided to demonstrate the effectiveness of the proposed method.

PaperID: 10570, https://arxiv.org/pdf/2505.09841.pdf

Authors: Brooks A. Butler, Magnus Egerstedt

Title: Hamilton's Rule for Enabling Altruism in Multi-Agent Systems

Abstract:
This paper explores the application of Hamilton's rule to altruistic decision‑making in multi‑agent systems. Inspired by biological altruism, we introduce a framework that evaluates when individual agents should incur costs to benefit their neighbors. By adapting Hamilton's rule, we define agent ``fitness" in terms of task productivity rather than genetic survival. We formalize altruistic decision‑making through a graph‑based model of multi‑agent interactions and propose a solution using collaborative control Lyapunov functions. The approach ensures that altruistic behaviors contribute to the collective goal‑reaching efficiency of the system. We illustrate this framework on a multi‑agent way‑point navigation problem, where we show through simulation how agent importance levels influence altruistic decision‑making, leading to improved coordination in navigation tasks.

PaperID: 10571, https://arxiv.org/pdf/2505.09837.pdf

Authors: Emre Girgin, Arda Taha Candan, Coşkun Anıl Zaman

Title: EdgeAI Drone for Autonomous Construction Site Demonstrator

Abstract:
The fields of autonomous systems and robotics are receiving considerable attention in civil applications such as construction, logistics, and firefighting. Nevertheless, the widespread adoption of these technologies is hindered by the necessity for robust processing units to run AI models. Edge‑AI solutions offer considerable promise, enabling low‑power, cost‑effective robotics that can automate civil services, improve safety, and enhance sustainability. This paper presents a novel Edge‑AI‑enabled drone‑based surveillance system for autonomous multi‑robot operations at construction sites. Our system integrates a lightweight MCU‑based object detection model within a custom‑built UAV platform and a 5G‑enabled multi‑agent coordination infrastructure. We specifically target the real‑time obstacle detection and dynamic path planning problem in construction environments, providing a comprehensive dataset specifically created for MCU‑based edge applications. Field experiments demonstrate practical viability and identify optimal operational parameters, highlighting our approach's scalability and computational efficiency advantages compared to existing UAV solutions. The present and future roles of autonomous vehicles on construction sites are also discussed, as well as the effectiveness of edge‑AI solutions. We share our dataset publicly at github.com/egirgin/storaige‑b950

PaperID: 10572, https://arxiv.org/pdf/2505.09787.pdf

Authors: Ziruo Yi, Ting Xiao, Mark V. Albert

Title: A Multimodal Multi-Agent Framework for Radiology Report Generation

Abstract:
Radiology report generation (RRG) aims to automatically produce diagnostic reports from medical images, with the potential to enhance clinical workflows and reduce radiologists' workload. While recent approaches leveraging multimodal large language models (MLLMs) and retrieval‑augmented generation (RAG) have achieved strong results, they continue to face challenges such as factual inconsistency, hallucination, and cross‑modal misalignment. We propose a multimodal multi‑agent framework for RRG that aligns with the stepwise clinical reasoning workflow, where task‑specific agents handle retrieval, draft generation, visual analysis, refinement, and synthesis. Experimental results demonstrate that our approach outperforms a strong baseline in both automatic metrics and LLM‑based evaluations, producing more accurate, structured, and interpretable reports. This work highlights the potential of clinically aligned multi‑agent frameworks to support explainable and trustworthy clinical AI applications.

PaperID: 10573, https://arxiv.org/pdf/2505.09756.pdf

Authors: Zhaoyang Shi

Title: Community-based Multi-Agent Reinforcement Learning with Transfer and Active Exploration

Abstract:
We propose a new framework for multi‑agent reinforcement learning (MARL), where the agents cooperate in a time‑evolving network with latent community structures and mixed memberships. Unlike traditional neighbor‑based or fixed interaction graphs, our community‑based framework captures flexible and abstract coordination patterns by allowing each agent to belong to multiple overlapping communities. Each community maintains shared policy and value functions, which are aggregated by individual agents according to personalized membership weights. We also design actor‑critic algorithms that exploit this structure: agents inherit community‑level estimates for policy updates and value learning, enabling structured information sharing without requiring access to other agents' policies. Importantly, our approach supports both transfer learning by adapting to new agents or tasks via membership estimation, and active learning by prioritizing uncertain communities during exploration. Theoretically, we establish convergence guarantees under linear function approximation for both actor and critic updates. To our knowledge, this is the first MARL framework that integrates community structure, transferability, and active learning with provable guarantees.

PaperID: 10574, https://arxiv.org/pdf/2505.09595.pdf

Authors: Abdullah Mushtaq, Imran Taj, Rafay Naeem, Ibrahim Ghaznavi, Junaid Qadir

Title: WorldView-Bench: A Benchmark for Evaluating Global Cultural Perspectives in Large Language Models

Abstract:
Large Language Models (LLMs) are predominantly trained and aligned in ways that reinforce Western‑centric epistemologies and socio‑cultural norms, leading to cultural homogenization and limiting their ability to reflect global civilizational plurality. Existing benchmarking frameworks fail to adequately capture this bias, as they rely on rigid, closed‑form assessments that overlook the complexity of cultural inclusivity. To address this, we introduce WorldView‑Bench, a benchmark designed to evaluate Global Cultural Inclusivity (GCI) in LLMs by analyzing their ability to accommodate diverse worldviews. Our approach is grounded in the Multiplex Worldview proposed by Senturk et al., which distinguishes between Uniplex models, reinforcing cultural homogenization, and Multiplex models, which integrate diverse perspectives. WorldView‑Bench measures Cultural Polarization, the exclusion of alternative perspectives, through free‑form generative evaluation rather than conventional categorical benchmarks. We implement applied multiplexity through two intervention strategies: (1) Contextually‑Implemented Multiplex LLMs, where system prompts embed multiplexity principles, and (2) Multi‑Agent System (MAS)‑Implemented Multiplex LLMs, where multiple LLM agents representing distinct cultural perspectives collaboratively generate responses. Our results demonstrate a significant increase in Perspectives Distribution Score (PDS) entropy from 13% at baseline to 94% with MAS‑Implemented Multiplex LLMs, alongside a shift toward positive sentiment (67.7%) and enhanced cultural balance. These findings highlight the potential of multiplex‑aware AI evaluation in mitigating cultural bias in LLMs, paving the way for more inclusive and ethically aligned AI systems.

PaperID: 10575, https://arxiv.org/pdf/2505.09472.pdf

Authors: Mingkai Tang, Lu Gan, Kaichen Zhang

Title: Streaming Multi-agent Pathfinding

Abstract:
The task of the multi‑agent pathfinding (MAPF) problem is to navigate a team of agents from their start point to the goal points. However, this setup is unsuitable in the assembly line scenario, which is periodic with a long working hour. To address this issue, the study formalizes the streaming MAPF (S‑MAPF) problem, which assumes that the agents in the same agent stream have a periodic start time and share the same action sequence. The proposed solution, Agent Stream Conflict‑Based Search (ASCBS), is designed to tackle this problem by incorporating a cyclic vertex/edge constraint to handle conflicts. Additionally, this work explores the potential usage of the disjoint splitting strategy within ASCBS. Experimental results indicate that ASCBS surpasses traditional MAPF solvers in terms of runtime for scenarios with prolonged working hours.

PaperID: 10576, https://arxiv.org/pdf/2505.09396.pdf

Authors: Vince Trencsenyi, Agnieszka Mensfelt, Kostas Stathis

Title: The Influence of Human-inspired Agentic Sophistication in LLM-driven Strategic Reasoners

Abstract:
The rapid rise of large language models (LLMs) has shifted artificial intelligence (AI) research toward agentic systems, motivating the use of weaker and more flexible notions of agency. However, this shift raises key questions about the extent to which LLM‑based agents replicate human strategic reasoning, particularly in game‑theoretic settings. In this context, we examine the role of agentic sophistication in shaping artificial reasoners' performance by evaluating three agent designs: a simple game‑theoretic model, an unstructured LLM‑as‑agent model, and an LLM integrated into a traditional agentic framework. Using guessing games as a testbed, we benchmarked these agents against human participants across general reasoning patterns and individual role‑based objectives. Furthermore, we introduced obfuscated game scenarios to assess agents' ability to generalise beyond training distributions. Our analysis, covering over 2000 reasoning samples across 25 agent configurations, shows that human‑inspired cognitive structures can enhance LLM agents' alignment with human strategic behaviour. Still, the relationship between agentic design complexity and human‑likeness is non‑linear, highlighting a critical dependence on underlying LLM capabilities and suggesting limits to simple architectural augmentation.

PaperID: 10577, https://arxiv.org/pdf/2505.09388.pdf

Authors: An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, Zihan Qiu

Title: Qwen3 Technical Report

Abstract:
In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture‑of‑Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi‑step reasoning) and non‑thinking mode (for rapid, context‑driven responses) into a unified framework. This eliminates the need to switch between different models‑‑such as chat‑optimized models (e.g., GPT‑4o) and dedicated reasoning models (e.g., QwQ‑32B)‑‑and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller‑scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state‑of‑the‑art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross‑lingual understanding and generation capabilities. To facilitate reproducibility and community‑driven research and development, all Qwen3 models are publicly accessible under Apache 2.0.

PaperID: 10578, https://arxiv.org/pdf/2505.09289.pdf

Authors: Pedro M. P. Curvo, Mara Dragomir, Salvador Torpes, Mohammadmahdi Rahimi

Title: Reproducibility Study of "Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents"

Abstract:
This study evaluates and extends the findings made by Piatti et al., who introduced GovSim, a simulation framework designed to assess the cooperative decision‑making capabilities of large language models (LLMs) in resource‑sharing scenarios. By replicating key experiments, we validate claims regarding the performance of large models, such as GPT‑4‑turbo, compared to smaller models. The impact of the universalization principle is also examined, with results showing that large models can achieve sustainable cooperation, with or without the principle, while smaller models fail without it. In addition, we provide multiple extensions to explore the applicability of the framework to new settings. We evaluate additional models, such as DeepSeek‑V3 and GPT‑4o‑mini, to test whether cooperative behavior generalizes across different architectures and model sizes. Furthermore, we introduce new settings: we create a heterogeneous multi‑agent environment, study a scenario using Japanese instructions, and explore an "inverse environment" where agents must cooperate to mitigate harmful resource distributions. Our results confirm that the benchmark can be applied to new models, scenarios, and languages, offering valuable insights into the adaptability of LLMs in complex cooperative tasks. Moreover, the experiment involving heterogeneous multi‑agent systems demonstrates that high‑performing models can influence lower‑performing ones to adopt similar behaviors. This finding has significant implications for other agent‑based applications, potentially enabling more efficient use of computational resources and contributing to the development of more effective cooperative AI systems.

PaperID: 10579, https://arxiv.org/pdf/2505.09255.pdf

Authors: Wenjie Liu, Yifei Li, Jian Sun, Gang Wang, Keyou You, Lihua Xie, Jie Chen

Title: Data-driven Internal Model Control for Output Regulation

Abstract:
Output regulation is a fundamental problem in control theory, extensively studied since the 1970s. Traditionally, research has primarily addressed scenarios where the system model is explicitly known, leaving the problem in the absence of a system model less explored. Leveraging the recent advancements in Willems et al.'s fundamental lemma, data‑driven control has emerged as a powerful tool for stabilizing unknown systems. This paper tackles the output regulation problem for unknown single and multi‑agent systems (MASs) using noisy data. Previous approaches have attempted to solve data‑based output regulation equations (OREs), which are inadequate for achieving zero tracking error with noisy data. To circumvent the need for solving data‑based OREs, we propose an internal model‑based data‑driven controller that reformulates the output regulation problem into a stabilization problem. This method is first applied to linear time‑invariant (LTI) systems, demonstrating exact solution capabilities, i.e., zero tracking error, through solving a straightforward data‑based linear matrix inequality (LMI). Furthermore, we extend our approach to solve the kth‑order output regulation problem for nonlinear systems. Extensions to both linear and nonlinear MASs are discussed. Finally, numerical tests validate the effectiveness and correctness of the proposed controllers.

PaperID: 10580, https://arxiv.org/pdf/2505.09081.pdf

Authors: Gaurav Koley

Title: SALM: A Multi-Agent Framework for Language Model-Driven Social Network Simulation

Abstract:
Contemporary approaches to agent‑based modeling (ABM) of social systems have traditionally emphasized rule‑based behaviors, limiting their ability to capture nuanced dynamics by moving beyond predefined rules and leveraging contextual understanding from LMs of human social interaction. This paper presents SALM (Social Agent LM Framework), a novel approach for integrating language models (LMs) into social network simulation that achieves unprecedented temporal stability in multi‑agent scenarios. Our primary contributions include: (1) a hierarchical prompting architecture enabling stable simulation beyond 4,000 timesteps while reducing token usage by 73%, (2) an attention‑based memory system achieving 80% cache hit rates (95% CI [78%, 82%]) with sub‑linear memory growth of 9.5%, and (3) formal bounds on personality stability. Through extensive validation against SNAP ego networks, we demonstrate the first LLM‑based framework capable of modeling long‑term social phenomena while maintaining empirically validated behavioral fidelity.

PaperID: 10581, https://arxiv.org/pdf/2505.08995.pdf

Authors: Ardian Selmonaj, Oleg Szehr, Giacomo Del Rio, Alessandro Antonucci, Adrian Schneider, Michael Rüegsegger

Title: Enhancing Aerial Combat Tactics through Hierarchical Multi-Agent Reinforcement Learning

Abstract:
This work presents a Hierarchical Multi‑Agent Reinforcement Learning framework for analyzing simulated air combat scenarios involving heterogeneous agents. The objective is to identify effective Courses of Action that lead to mission success within preset simulations, thereby enabling the exploration of real‑world defense scenarios at low cost and in a safe‑to‑fail setting. Applying deep Reinforcement Learning in this context poses specific challenges, such as complex flight dynamics, the exponential size of the state and action spaces in multi‑agent systems, and the capability to integrate real‑time control of individual units with look‑ahead planning. To address these challenges, the decision‑making process is split into two levels of abstraction: low‑level policies control individual units, while a high‑level commander policy issues macro commands aligned with the overall mission targets. This hierarchical structure facilitates the training process by exploiting policy symmetries of individual agents and by separating control from command tasks. The low‑level policies are trained for individual combat control in a curriculum of increasing complexity. The high‑level commander is then trained on mission targets given pre‑trained control policies. The empirical validation confirms the advantages of the proposed framework.

PaperID: 10582, https://arxiv.org/pdf/2505.08844.pdf

Authors: Jiawen Chen, Jianghao Zhang, Huaxiu Yao, Yun Li

Title: CellTypeAgent: Trustworthy cell type annotation with Large Language Models

Abstract:
Cell type annotation is a critical yet laborious step in single‑cell RNA sequencing analysis. We present a trustworthy large language model (LLM)‑agent, CellTypeAgent, which integrates LLMs with verification from relevant databases. CellTypeAgent achieves higher accuracy than existing methods while mitigating hallucinations. We evaluated CellTypeAgent across nine real datasets involving 303 cell types from 36 tissues. This combined approach holds promise for more efficient and reliable cell type annotation.

PaperID: 10583, https://arxiv.org/pdf/2505.08825.pdf

Authors: Pedro Antonio Alarcon Granadeno, Theodore Chambers, Jane Cleland-Huang

Title: Multi-source Plume Tracing via Multi-Agent Reinforcement Learning

Abstract:
Industrial catastrophes like the Bhopal disaster (1984) and the Aliso Canyon gas leak (2015) demonstrate the urgent need for rapid and reliable plume tracing algorithms to protect public health and the environment. Traditional methods, such as gradient‑based or biologically inspired approaches, often fail in realistic, turbulent conditions. To address these challenges, we present a Multi‑Agent Reinforcement Learning (MARL) algorithm designed for localizing multiple airborne pollution sources using a swarm of small uncrewed aerial systems (sUAS). Our method models the problem as a Partially Observable Markov Game (POMG), employing a Long Short‑Term Memory (LSTM)‑based Action‑specific Double Deep Recurrent Q‑Network (ADDRQN) that uses full sequences of historical action‑observation pairs, effectively approximating latent states. Unlike prior work, we use a general‑purpose simulation environment based on the Gaussian Plume Model (GPM), incorporating realistic elements such as a three‑dimensional environment, sensor noise, multiple interacting agents, and multiple plume sources. The incorporation of action histories as part of the inputs further enhances the adaptability of our model in complex, partially observable environments. Extensive simulations show that our algorithm significantly outperforms conventional approaches. Specifically, our model allows agents to explore only 1.29% of the environment to successfully locate pollution sources.

PaperID: 10584, https://arxiv.org/pdf/2505.08803.pdf

Authors: Zizhao Hu, Mohammad Rostami, Jesse Thomason

Title: Multi-modal Synthetic Data Training and Model Collapse: Insights from VLMs and Diffusion Models

Abstract:
Recent research has highlighted the risk of generative model collapse, where performance progressively degrades when continually trained on self‑generated data. However, existing exploration on model collapse is limited to single, unimodal models, limiting our understanding in more realistic scenarios, such as diverse multi‑modal AI agents interacting autonomously through synthetic data and continually evolving. We expand the synthetic data training and model collapse study to multi‑modal vision‑language generative systems, such as vision‑language models (VLMs) and text‑to‑image diffusion models, as well as recursive generate‑train loops with multiple models. We find that model collapse, previously observed in single‑modality generative models, exhibits distinct characteristics in the multi‑modal context, such as improved vision‑language alignment and increased variance in VLM image‑captioning task. Additionally, we find that general approaches such as increased decoding budgets, greater model diversity, and relabeling with frozen models can effectively mitigate model collapse. Our findings provide initial insights and practical guidelines for reducing the risk of model collapse in self‑improving multi‑agent AI systems and curating robust multi‑modal synthetic datasets.

PaperID: 10585, https://arxiv.org/pdf/2505.08638.pdf

Authors: Darshan Deshpande, Varun Gangal, Hersh Mehta, Jitin Krishnan, Anand Kannappan, Rebecca Qian

Title: TRAIL: Trace Reasoning and Agentic Issue Localization

Abstract:
The increasing adoption of agentic workflows across diverse domains brings a critical need to scalably and systematically evaluate the complex traces these systems generate. Current evaluation methods depend on manual, domain‑specific human analysis of lengthy workflow traces ‑ an approach that does not scale with the growing complexity and volume of agentic outputs. Error analysis in these settings is further complicated by the interplay of external tool outputs and language model reasoning, making it more challenging than traditional software debugging. In this work, we (1) articulate the need for robust and dynamic evaluation methods for agentic workflow traces, (2) introduce a formal taxonomy of error types encountered in agentic systems, and (3) present a set of 148 large human‑annotated traces (TRAIL) constructed using this taxonomy and grounded in established agentic benchmarks. To ensure ecological validity, we curate traces from both single and multi‑agent systems, focusing on real‑world applications such as software engineering and open‑world information retrieval. Our evaluations reveal that modern long context LLMs perform poorly at trace debugging, with the best Gemini‑2.5‑pro model scoring a mere 11% on TRAIL. Our dataset and code are made publicly available to support and accelerate future research in scalable evaluation for agentic workflows.

PaperID: 10586, https://arxiv.org/pdf/2505.08630.pdf

Authors: Shuai Han, Mehdi Dastani, Shihan Wang

Title: Credit Assignment and Efficient Exploration based on Influence Scope in Multi-agent Reinforcement Learning

Abstract:
Training cooperative agents in sparse‑reward scenarios poses significant challenges for multi‑agent reinforcement learning (MARL). Without clear feedback on actions at each step in sparse‑reward setting, previous methods struggle with precise credit assignment among agents and effective exploration. In this paper, we introduce a novel method to deal with both credit assignment and exploration problems in reward‑sparse domains. Accordingly, we propose an algorithm that calculates the Influence Scope of Agents (ISA) on states by taking specific value of the dimensions/attributes of states that can be influenced by individual agents. The mutual dependence between agents' actions and state attributes are then used to calculate the credit assignment and to delimit the exploration space for each individual agent. We then evaluate ISA in a variety of sparse‑reward multi‑agent scenarios. The results show that our method significantly outperforms the state‑of‑art baselines.

PaperID: 10587, https://arxiv.org/pdf/2505.08593.pdf

Authors: Yunwoo Lee, Jungwon Park

Title: MC-Swarm: Minimal-Communication Multi-Agent Trajectory Planning and Deadlock Resolution for Quadrotor Swarm

Abstract:
For effective multi‑agent trajectory planning, it is important to consider lightweight communication and its potential asynchrony. This paper presents a distributed trajectory planning algorithm for a quadrotor swarm that operates asynchronously and requires no communication except during the initial planning phase. Moreover, our algorithm guarantees no deadlock under asynchronous updates and absence of communication during flight. To effectively ensure these points, we build two main modules: coordination state updater and trajectory optimizer. The coordination state updater computes waypoints for each agent toward its goal and performs subgoal optimization while considering deadlocks, as well as safety constraints with respect to neighbor agents and obstacles. Then, the trajectory optimizer generates a trajectory that ensures collision avoidance even with the asynchronous planning updates of neighboring agents. We provide a theoretical guarantee of collision avoidance with deadlock resolution and evaluate the effectiveness of our method in complex simulation environments, including random forests and narrow‑gap mazes. Additionally, to reduce the total mission time, we design a faster coordination state update using lightweight communication. Lastly, our approach is validated through extensive simulations and real‑world experiments with cluttered environment scenarios.

PaperID: 10588, https://arxiv.org/pdf/2505.08532.pdf

Authors: Yuhan Liu, Yuxuan Liu, Xiaoqing Zhang, Xiuying Chen, Rui Yan

Title: The Truth Becomes Clearer Through Debate! Multi-Agent Systems with Large Language Models Unmask Fake News

Abstract:
In today's digital environment, the rapid propagation of fake news via social networks poses significant social challenges. Most existing detection methods either employ traditional classification models, which suffer from low interpretability and limited generalization capabilities, or craft specific prompts for large language models (LLMs) to produce explanations and results directly, failing to leverage LLMs' reasoning abilities fully. Inspired by the saying that "truth becomes clearer through debate," our study introduces a novel multi‑agent system with LLMs named TruEDebate (TED) to enhance the interpretability and effectiveness of fake news detection. TED employs a rigorous debate process inspired by formal debate settings. Central to our approach are two innovative components: the DebateFlow Agents and the InsightFlow Agents. The DebateFlow Agents organize agents into two teams, where one supports and the other challenges the truth of the news. These agents engage in opening statements, cross‑examination, rebuttal, and closing statements, simulating a rigorous debate process akin to human discourse analysis, allowing for a thorough evaluation of news content. Concurrently, the InsightFlow Agents consist of two specialized sub‑agents: the Synthesis Agent and the Analysis Agent. The Synthesis Agent summarizes the debates and provides an overarching viewpoint, ensuring a coherent and comprehensive evaluation. The Analysis Agent, which includes a role‑aware encoder and a debate graph, integrates role embeddings and models the interactions between debate roles and arguments using an attention mechanism, providing the final judgment.

PaperID: 10589, https://arxiv.org/pdf/2505.08448.pdf

Authors: Yanggang Xu, Jirong Zha, Weijie Hong, Xiangmin Yi, Geng Chen, Jianfeng Zheng, Chen-Chun Hsia, Xinlei Chen

Title: Scalable UAV Multi-Hop Networking via Multi-Agent Reinforcement Learning with Large Language Models

Abstract:
In disaster scenarios, establishing robust emergency communication networks is critical, and unmanned aerial vehicles (UAVs) offer a promising solution to rapidly restore connectivity. However, organizing UAVs to form multi‑hop networks in large‑scale dynamic environments presents significant challenges, including limitations in algorithmic scalability and the vast exploration space required for coordinated decision‑making. To address these issues, we propose MRLMN, a novel framework that integrates multi‑agent reinforcement learning (MARL) and large language models (LLMs) to jointly optimize UAV agents toward achieving optimal networking performance. The framework incorporates a grouping strategy with reward decomposition to enhance algorithmic scalability and balance decision‑making across UAVs. In addition, behavioral constraints are applied to selected key UAVs to improve the robustness of the network. Furthermore, the framework integrates LLM agents, leveraging knowledge distillation to transfer their high‑level decision‑making capabilities to MARL agents. This enhances both the efficiency of exploration and the overall training process. In the distillation module, a Hungarian algorithm‑based matching scheme is applied to align the decision outputs of the LLM and MARL agents and define the distillation loss. Extensive simulation results validate the effectiveness of our approach, demonstrating significant improvements in network performance over the MAPPO baseline and other comparison methods, including enhanced coverage and communication quality.

PaperID: 10590, https://arxiv.org/pdf/2505.08446.pdf

Authors: Yuhan Zhu, Haojie Liu, Jian Wang, Bing Li, Zikang Yin, Yefei Liao

Title: Agent-as-a-Service based on Agent Network

Abstract:
The rise of large model‑based AI agents has spurred interest in Multi‑Agent Systems (MAS) for their capabilities in decision‑making, collaboration, and adaptability. While the Model Context Protocol (MCP) addresses tool invocation and data exchange challenges via a unified protocol, it lacks support for organizing agent‑level collaboration. To bridge this gap, we propose Agent‑as‑a‑Service based on Agent Network (AaaS‑AN), a service‑oriented paradigm grounded in the Role‑Goal‑Process‑Service (RGPS) standard. AaaS‑AN unifies the entire agent lifecycle, including construction, integration, interoperability, and networked collaboration, through two core components: (1) a dynamic Agent Network, which models agents and agent groups as vertexes that self‑organize within the network based on task and role dependencies; (2) service‑oriented agents, incorporating service discovery, registration, and interoperability protocols. These are orchestrated by a Service Scheduler, which leverages an Execution Graph to enable distributed coordination, context tracking, and runtime task management. We validate AaaS‑AN on mathematical reasoning and application‑level code generation tasks, which outperforms state‑of‑the‑art baselines. Notably, we constructed a MAS based on AaaS‑AN containing agent groups, Robotic Process Automation (RPA) workflows, and MCP servers over 100 agent services. We also release a dataset containing 10,000 long‑horizon multi‑agent workflows to facilitate future research on long‑chain collaboration in MAS.

PaperID: 10591, https://arxiv.org/pdf/2505.08319.pdf

Authors: Egil Diau

Title: Reciprocity as the Foundational Substrate of Society: How Reciprocal Dynamics Scale into Social Systems

Abstract:
Prevailing accounts in both multi‑agent AI and the social sciences explain social structure through top‑down abstractions‑such as institutions, norms, or trust‑yet lack simulateable models of how such structures emerge from individual behavior. Ethnographic and archaeological evidence suggests that reciprocity served as the foundational mechanism of early human societies, enabling economic circulation, social cohesion, and interpersonal obligation long before the rise of formal institutions. Modern financial systems such as credit and currency can likewise be viewed as scalable extensions of reciprocity, formalizing exchange across time and anonymity. Building on this insight, we argue that reciprocity is not merely a local or primitive exchange heuristic, but the scalable substrate from which large‑scale social structures can emerge. We propose a three‑stage framework to model this emergence: reciprocal dynamics at the individual level, norm stabilization through shared expectations, and the construction of durable institutional patterns. This approach offers a cognitively minimal, behaviorally grounded foundation for simulating how large‑scale social systems can emerge from decentralized reciprocal interaction.

PaperID: 10592, https://arxiv.org/pdf/2505.08222.pdf

Authors: Matteo Gallici, Ivan Masmitja, Mario MartÃn

Title: Scaling Multi Agent Reinforcement Learning for Underwater Acoustic Tracking via Autonomous Vehicles

Abstract:
Autonomous vehicles (AV) offer a cost‑effective solution for scientific missions such as underwater tracking. Recently, reinforcement learning (RL) has emerged as a powerful method for controlling AVs in complex marine environments. However, scaling these techniques to a fleet‑‑essential for multi‑target tracking or targets with rapid, unpredictable motion‑‑presents significant computational challenges. Multi‑Agent Reinforcement Learning (MARL) is notoriously sample‑inefficient, and while high‑fidelity simulators like Gazebo's LRAUV provide 100x faster‑than‑real‑time single‑robot simulations, they offer no significant speedup for multi‑vehicle scenarios, making MARL training impractical. To address these limitations, we propose an iterative distillation method that transfers high‑fidelity simulations into a simplified, GPU‑accelerated environment while preserving high‑level dynamics. This approach achieves up to a 30,000x speedup over Gazebo through parallelization, enabling efficient training via end‑to‑end GPU acceleration. Additionally, we introduce a novel Transformer‑based architecture (TransfMAPPO) that learns multi‑agent policies invariant to the number of agents and targets, significantly improving sample efficiency. Following large‑scale curriculum learning conducted entirely on GPU, we perform extensive evaluations in Gazebo, demonstrating that our method maintains tracking errors below 5 meters over extended durations, even in the presence of multiple fast‑moving targets. This work bridges the gap between large‑scale MARL training and high‑fidelity deployment, providing a scalable framework for autonomous fleet control in real‑world sea missions.

PaperID: 10593, https://arxiv.org/pdf/2505.08195.pdf

Authors: Jinming Hu, Hassan Nawaz, Yi-Fan Hou, Yuting Rui, Lijie Chi, Yuxinxin Chen, Arif Ullah, Pavlo O. Dral

Title: Aitomia: Your Intelligent Assistant for AI-Driven Atomistic and Quantum Chemical Simulations

Abstract:
We have developed Aitomia ‑ a platform powered by AI to assist in performing AI‑driven atomistic and quantum chemical (QC) simulations. This evolving intelligent assistant platform is equipped with chatbots and AI agents to help experts and guide non‑experts in setting up and running atomistic simulations, analyzing simulation results, and summarizing them for the user in both textual and graphical forms. Aitomia combines LLM‑based agents with the MLatom platform to support AI‑driven atomistic simulations as well as conventional quantum‑chemical calculations, including DFT, semiempirical methods such as GFN2‑xTB, and selected high‑level wavefunction‑based methods, through interfaces to widely used programs such as Gaussian, ORCA, PySCF, and xtb, covering tasks from ground‑state and excited‑state calculations to geometry optimization, thermochemistry, and spectra simulations. The multi‑agent implementation enables autonomous execution of complex computational workflows, such as reaction enthalpy calculations. Aitomia was the first intelligent assistant publicly launched on cloud computing platforms for broad‑scope atomistic simulations (Aitomistic Lab@XMU at https://atom.xmu.edu.cn and Aitomistic Hub at https://aitomistic.xyz). Aitomia lowers the barrier to performing atomistic simulations, thereby democratizing simulations and accelerating research and development in relevant fields.

PaperID: 10594, https://arxiv.org/pdf/2505.08054.pdf

Authors: Zhehao Zhang, Weijie Xu, Fanyou Wu, Chandan K. Reddy

Title: FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning

Abstract:
Safety alignment approaches in large language models (LLMs) often lead to the over‑refusal of benign queries, significantly diminishing their utility in sensitive scenarios. To address this challenge, we introduce FalseReject, a comprehensive resource containing 16k seemingly toxic queries accompanied by structured responses across 44 safety‑related categories. We propose a graph‑informed adversarial multi‑agent interaction framework to generate diverse and complex prompts, while structuring responses with explicit reasoning to aid models in accurately distinguishing safe from unsafe contexts. FalseReject includes training datasets tailored for both standard instruction‑tuned models and reasoning‑oriented models, as well as a human‑annotated benchmark test set. Our extensive benchmarking on 29 state‑of‑the‑art (SOTA) LLMs reveals persistent over‑refusal challenges. Empirical results demonstrate that supervised finetuning with FalseReject substantially reduces unnecessary refusals without compromising overall safety or general language capabilities.

PaperID: 10595, https://arxiv.org/pdf/2505.08025.pdf

Authors: Hannah Lee, Zachary Serlin, James Motes, Brendan Long, Marco Morales, Nancy M. Amato

Title: PRISM: Complete Online Decentralized Multi-Agent Pathfinding with Rapid Information Sharing using Motion Constraints

Abstract:
We introduce PRISM (Pathfinding with Rapid Information Sharing using Motion Constraints), a decentralized algorithm designed to address the multi‑task multi‑agent pathfinding (MT‑MAPF) problem. PRISM enables large teams of agents to concurrently plan safe and efficient paths for multiple tasks while avoiding collisions. It employs a rapid communication strategy that uses information packets to exchange motion constraint information, enhancing cooperative pathfinding and situational awareness, even in scenarios without direct communication. We prove that PRISM resolves and avoids all deadlock scenarios when possible, a critical challenge in decentralized pathfinding. Empirically, we evaluate PRISM across five environments and 25 random scenarios, benchmarking it against the centralized Conflict‑Based Search (CBS) and the decentralized Token Passing with Task Swaps (TPTS) algorithms. PRISM demonstrates scalability and solution quality, supporting 3.4 times more agents than CBS and handling up to 2.5 times more tasks in narrow passage environments than TPTS. Additionally, PRISM matches CBS in solution quality while achieving faster computation times, even under low‑connectivity conditions. Its decentralized design reduces the computational burden on individual agents, making it scalable for large environments. These results confirm PRISM's robustness, scalability, and effectiveness in complex and dynamic pathfinding scenarios.

PaperID: 10596, https://arxiv.org/pdf/2505.07911.pdf

Authors: Chengmin Zhou, Ville Kyrki, Pasi Fränti, Laura Ruotsalainen

Title: Combining Bayesian Inference and Reinforcement Learning for Agent Decision Making: A Review

Abstract:
Bayesian inference has many advantages in decision making of agents (e.g. robotics/simulative agent) over a regular data‑driven black‑box neural network: Data‑efficiency, generalization, interpretability, and safety where these advantages benefit directly/indirectly from the uncertainty quantification of Bayesian inference. However, there are few comprehensive reviews to summarize the progress of Bayesian inference on reinforcement learning (RL) for decision making to give researchers a systematic understanding. This paper focuses on combining Bayesian inference with RL that nowadays is an important approach in agent decision making. To be exact, this paper discusses the following five topics: 1) Bayesian methods that have potential for agent decision making. First basic Bayesian methods and models (Bayesian rule, Bayesian learning, and Bayesian conjugate models) are discussed followed by variational inference, Bayesian optimization, Bayesian deep learning, Bayesian active learning, Bayesian generative models, Bayesian meta‑learning, and lifelong Bayesian learning. 2) Classical combinations of Bayesian methods with model‑based RL (with approximation methods), model‑free RL, and inverse RL. 3) Latest combinations of potential Bayesian methods with RL. 4) Analytical comparisons of methods that combine Bayesian methods with RL with respect to data‑efficiency, generalization, interpretability, and safety. 5) In‑depth discussions in six complex problem variants of RL, including unknown reward, partial‑observability, multi‑agent, multi‑task, non‑linear non‑Gaussian, and hierarchical RL problems and the summary of how Bayesian methods work in the data collection, data processing and policy learning stages of RL to pave the way for better agent decision‑making strategies.

PaperID: 10597, https://arxiv.org/pdf/2505.07854.pdf

Authors: Yufei Lin, Chengwei Ye, Huanzhen Zhang, Kangsheng Wang, Linuo Xu, Shuyan Liu, Zeyu Zhang

Title: CCL: Collaborative Curriculum Learning for Sparse-Reward Multi-Agent Reinforcement Learning via Co-evolutionary Task Evolution

Abstract:
Sparse reward environments pose significant challenges in reinforcement learning, especially within multi‑agent systems (MAS) where feedback is delayed and shared across agents, leading to suboptimal learning. We propose Collaborative Multi‑dimensional Course Learning (CCL), a novel curriculum learning framework that addresses this by (1) refining intermediate tasks for individual agents, (2) using a variational evolutionary algorithm to generate informative subtasks, and (3) co‑evolving agents with their environment to enhance training stability. Experiments on five cooperative tasks in the MPE and Hide‑and‑Seek environments show that CCL outperforms existing methods in sparse reward settings.

PaperID: 10598, https://arxiv.org/pdf/2505.07849.pdf

Authors: Revanth Gangi Reddy, Tarun Suresh, JaeHyeok Doo, Ye Liu, Xuan Phi Nguyen, Yingbo Zhou, Semih Yavuz, Caiming Xiong, Heng Ji, Shafiq Joty

Title: SweRank: Software Issue Localization with Code Ranking

Abstract:
Software issue localization, the task of identifying the precise code locations (files, classes, or functions) relevant to a natural language issue description (e.g., bug report, feature request), is a critical yet time‑consuming aspect of software development. While recent LLM‑based agentic approaches demonstrate promise, they often incur significant latency and cost due to complex multi‑step reasoning and relying on closed‑source LLMs. Alternatively, traditional code ranking models, typically optimized for query‑to‑code or code‑to‑code retrieval, struggle with the verbose and failure‑descriptive nature of issue localization queries. To bridge this gap, we introduce SweRank, an efficient and effective retrieve‑and‑rerank framework for software issue localization. To facilitate training, we construct SweLoc, a large‑scale dataset curated from public GitHub repositories, featuring real‑world issue descriptions paired with corresponding code modifications. Empirical results on SWE‑Bench‑Lite and LocBench show that SweRank achieves state‑of‑the‑art performance, outperforming both prior ranking models and costly agent‑based systems using closed‑source LLMs like Claude‑3.5. Further, we demonstrate SweLoc's utility in enhancing various existing retriever and reranker models for issue localization, establishing the dataset as a valuable resource for the community.

PaperID: 10599, https://arxiv.org/pdf/2505.07838.pdf

Authors: Muskaan Goyal, Pranav Bhasin

Title: Moving From Monolithic To Microservices Architecture for Multi-Agent Systems

Abstract:
The transition from monolithic to microservices architecture revolutionized software development by improving scalability and maintainability. This paradigm shift is now becoming relevant for complex multi‑agent systems (MAS). This review article explores the evolution from monolithic architecture to microservices architecture in the specific context of MAS. It will highlight the limitations of traditional monolithic MAS and the benefits of adopting a microservices‑based approach. The article further examines the core architectural principles and communication protocols, including Agent Communication Languages (ACLs), the Model Context Protocol (MCP), and the Application‑to‑Application (A2A) protocol. The article identifies emerging architectural patterns, design challenges, and considerations through a comparative lens of the paradigm shift.

PaperID: 10600, https://arxiv.org/pdf/2505.07827.pdf

Authors: Nikolaus Spring, Andrea Morichetta, Boris Sedlak, Schahram Dustdar

Title: MACH: Multi-Agent Coordination for RSU-centric Handovers

Abstract:
This paper introduces MACH, a novel approach for optimizing task handover in vehicular computing scenarios. To ensure fast and latency‑aware placement of tasks, the decision‑making ‑‑ where and when should tasks be offloaded ‑‑ is carried out decentralized at the Road Side Units (RSUs) who also execute the tasks. By shifting control to the network edge, MACH moves away from the traditional centralized or vehicle‑based handover method. Still, it focuses on contextual factors, such as the current RSU load and vehicle trajectories. Thus, MACH improves the overall Quality of Service (QoS) while fairly balancing computational loads between RSUs. To evaluate the effectiveness of our approach, we develop a robust simulation environment composed of real‑world traffic data, dynamic network conditions, and different infrastructure capacities. For scenarios that demand low latency and high reliability, our experimental results demonstrate how MACH significantly improves the adaptability and efficiency of vehicular computations. By decentralizing control to the network edge, MACH effectively reduces communication overhead and optimizes resource utilization, offering a robust framework for task handover management.

PaperID: 10601, https://arxiv.org/pdf/2505.07782.pdf

Authors: Rushi Qiang, Yuchen Zhuang, Yinghao Li, Dingu Sagar V K, Rongzhi Zhang, Changhao Li, Ian Shu-Hei Wong, Sherry Yang, Percy Liang, Chao Zhang, Bo Dai

Title: MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering

Abstract:
We introduce MLE‑Dojo, a Gym‑style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents in iterative machine learning engineering (MLE) workflows. Unlike existing benchmarks that primarily rely on static datasets or single‑attempt evaluations, MLE‑Dojo provides an interactive environment enabling agents to iteratively experiment, debug, and refine solutions through structured feedback loops. Built upon 200+ real‑world Kaggle challenges, MLE‑Dojo covers diverse, open‑ended MLE tasks carefully curated to reflect realistic engineering scenarios such as data processing, architecture search, hyperparameter tuning, and code debugging. Its fully executable environment supports comprehensive agent training via both supervised fine‑tuning and reinforcement learning, facilitating iterative experimentation, realistic data sampling, and real‑time outcome verification. Extensive evaluations of eight frontier LLMs reveal that while current models achieve meaningful iterative improvements, they still exhibit significant limitations in autonomously generating long‑horizon solutions and efficiently resolving complex errors. Furthermore, MLE‑Dojo's flexible and extensible architecture seamlessly integrates diverse data sources, tools, and evaluation protocols, uniquely enabling model‑based agent tuning and promoting interoperability, scalability, and reproducibility. We open‑source our framework and benchmarks to foster community‑driven innovation towards next‑generation MLE agents.

PaperID: 10602, https://arxiv.org/pdf/2505.07779.pdf

Authors: Jiarui Li, Alessandro Zanardi, Gioele Zardini

Title: Multi-Agent Path Finding via Finite-Horizon Hierarchical Factorization

Abstract:
We present a novel algorithm for large‑scale Multi‑Agent Path Finding (MAPF) that enables fast, scalable planning in dynamic environments such as automated warehouses. Our approach introduces finite‑horizon hierarchical factorization, a framework that plans one step at a time in a receding‑horizon fashion. Robots first compute individual plans in parallel, and then dynamically group based on spatio‑temporal conflicts and reachability. The framework accounts for conflict resolution, and for immediate execution and concurrent planning, significantly reducing response time compared to offline algorithms. Experimental results on benchmark maps demonstrate that our method achieves up to 60% reduction in time‑to‑first‑action while consistently delivering high‑quality solutions, outperforming state‑of‑the‑art offline baselines across a range of problem sizes and planning horizons.

PaperID: 10603, https://arxiv.org/pdf/2505.07618.pdf

Authors: Ching Han Chen, Ming Fang Shiu

Title: KAQG: A Knowledge-Graph-Enhanced RAG for Difficulty-Controlled Question Generation

Abstract:
This study introduces Knowledge Augmented Question Generation (KAQG), an educational assessment framework that integrates Item Response Theory, abbreviated as IRT, Bloom's Taxonomy, and knowledge graphs into a multi‑agent Retrieval‑Augmented Generation (RAG) system. The proposed approach overcomes limitations of existing methods by enabling fine‑grained control over item difficulty, psychometric calibration, and cognitive alignment. It employs multi‑graph isolation to preserve domain‑specific semantics and leverages a distributed agent architecture coordinated through Data Distribution Service, abbreviated as DDS, for scalable and fault‑tolerant operations. Each agent specializes in tasks such as retrieval, generation, or evaluation, forming a modular and traceable pipeline. Distinctively, the framework encodes semantic hierarchies, PageRank‑based concept weighting, and assessment‑theory parameters directly into the generation process, ensuring that questions are both contextually grounded and cognitively calibrated. Deployed at Taiwan's National Institute of Environmental Research, the system has demonstrated practical value by reducing manual workload, improving reliability and validity, and supporting both adaptive and standardized assessments. By integrating psychometric theory with AI‑driven retrieval and generation, this work establishes a scalable and cognitively aligned solution for education and professional certification.

PaperID: 10604, https://arxiv.org/pdf/2505.07603.pdf

Authors: Ching Han Chen, Ming Fang Shiu

Title: AgentFlow: Resilient Adaptive Cloud-Edge Framework for Multi-Agent Coordination

Abstract:
This paper presents AgentFlow, a MAS‑based framework for programmable distributed systems in heterogeneous cloud‑edge environments. It introduces logistics objects and abstract agent interfaces to enable dynamic service flows and modular orchestration. AgentFlow supports decentralized publish‑subscribe messaging and many‑to‑many service elections, enabling decision coordination without a central server. It features plug‑and‑play node discovery, flexible task reorganization, and highly adaptable fault tolerance and substitution mechanisms. AgentFlow advances scalable, real‑time coordination for resilient and autonomous mission‑critical systems.

PaperID: 10605, https://arxiv.org/pdf/2505.07532.pdf

Authors: Kajetan Rachwał, Maciej Majek, Bartłomiej Boczek, Kacper Dąbrowski, Paweł Liberadzki, Adam Dąbrowski, Maria Ganzha

Title: RAI: Flexible Agent Framework for Embodied AI

Abstract:
With an increase in the capabilities of generative language models, a growing interest in embodied AI has followed. This contribution introduces RAI ‑ a framework for creating embodied Multi Agent Systems for robotics. The proposed framework implements tools for Agents' integration with robotic stacks, Large Language Models, and simulations. It provides out‑of‑the‑box integration with state‑of‑the‑art systems like ROS 2. It also comes with dedicated mechanisms for the embodiment of Agents. These mechanisms have been tested on a physical robot, Husarion ROSBot XL, which was coupled with its digital twin, for rapid prototyping. Furthermore, these mechanisms have been deployed in two simulations: (1) robot arm manipulator and (2) tractor controller. All of these deployments have been evaluated in terms of their control capabilities, effectiveness of embodiment, and perception ability. The proposed framework has been used successfully to build systems with multiple agents. It has demonstrated effectiveness in all the aforementioned tasks. It also enabled identifying and addressing the shortcomings of the generative models used for embodied AI.

PaperID: 10606, https://arxiv.org/pdf/2505.07500.pdf

Authors: Bahram Mohammadi, Ehsan Abbasnejad, Yuankai Qi, Qi Wu, Anton Van Den Hengel, Javen Qinfeng Shi

Title: Learning to Reason and Navigate: Parameter Efficient Action Planning with Large Language Models

Abstract:
The remote embodied referring expression (REVERIE) task requires an agent to navigate through complex indoor environments and localize a remote object specified by high‑level instructions, such as "bring me a spoon", without pre‑exploration. Hence, an efficient navigation plan is essential for the final success. This paper proposes a novel parameter‑efficient action planner using large language models (PEAP‑LLM) to generate a single‑step instruction at each location. The proposed model consists of two modules, LLM goal planner (LGP) and LoRA action planner (LAP). Initially, LGP extracts the goal‑oriented plan from REVERIE instructions, including the target object and room. Then, LAP generates a single‑step instruction with the goal‑oriented plan, high‑level instruction, and current visual observation as input. PEAP‑LLM enables the embodied agent to interact with LAP as the path planner on the fly. A simple direct application of LLMs hardly achieves good performance. Also, existing hard‑prompt‑based methods are error‑prone in complicated scenarios and need human intervention. To address these issues and prevent the LLM from generating hallucinations and biased information, we propose a novel two‑stage method for fine‑tuning the LLM, consisting of supervised fine‑tuning (STF) and direct preference optimization (DPO). SFT improves the quality of generated instructions, while DPO utilizes environmental feedback. Experimental results show the superiority of our proposed model on REVERIE compared to the previous state‑of‑the‑art.

PaperID: 10607, https://arxiv.org/pdf/2505.07457.pdf

Authors: R. Maria del Rio-Chanona, Marco Pangallo, Cars Hommes

Title: Can Generative AI agents behave like humans? Evidence from laboratory market experiments

Abstract:
We explore the potential of Large Language Models (LLMs) to replicate human behavior in economic market experiments. Compared to previous studies, we focus on dynamic feedback between LLM agents: the decisions of each LLM impact the market price at the current step, and so affect the decisions of the other LLMs at the next step. We compare LLM behavior to market dynamics observed in laboratory settings and assess their alignment with human participants' behavior. Our findings indicate that LLMs do not adhere strictly to rational expectations, displaying instead bounded rationality, similarly to human participants. Providing a minimal context window i.e. memory of three previous time steps, combined with a high variability setting capturing response heterogeneity, allows LLMs to replicate broad trends seen in human experiments, such as the distinction between positive and negative feedback markets. However, differences remain at a granular level‑‑LLMs exhibit less heterogeneity in behavior than humans. These results suggest that LLMs hold promise as tools for simulating realistic human behavior in economic contexts, though further research is needed to refine their accuracy and increase behavioral diversity.

PaperID: 10608, https://arxiv.org/pdf/2505.07313.pdf

Authors: Baixuan Xu, Chunyang Li, Weiqi Wang, Wei Fan, Tianshi Zheng, Haochen Shi, Tao Fan, Yangqiu Song, Qiang Yang

Title: Towards Multi-Agent Reasoning Systems for Collaborative Expertise Delegation: An Exploratory Design Study

Abstract:
Designing effective collaboration structure for multi‑agent LLM systems to enhance collective reasoning is crucial yet remains under‑explored. In this paper, we systematically investigate how collaborative reasoning performance is affected by three key design dimensions: (1) Expertise‑Domain Alignment, (2) Collaboration Paradigm (structured workflow vs. diversity‑driven integration), and (3) System Scale. Our findings reveal that expertise alignment benefits are highly domain‑contingent, proving most effective for contextual reasoning tasks. Furthermore, collaboration focused on integrating diverse knowledge consistently outperforms rigid task decomposition. Finally, we empirically explore the impact of scaling the multi‑agent system with expertise specialization and study the computational trade off, highlighting the need for more efficient communication protocol design. This work provides concrete guidelines for configuring specialized multi‑agent system and identifies critical architectural trade‑offs and bottlenecks for scalable multi‑agent reasoning. The code will be made available upon acceptance.

PaperID: 10609, https://arxiv.org/pdf/2505.07290.pdf

Authors: Peng Yin, Wentao Liang, Jinbo Wen, Jiawen Kang, Junlong Chen, Dusit Niyato

Title: Multi-Agent DRL for Multi-Objective Twin Migration Routing with Workload Prediction in 6G-enabled IoV

Abstract:
Sixth Generation (6G)‑enabled Internet of Vehicles (IoV) facilitates efficient data synchronization through ultra‑fast bandwidth and high‑density connectivity, enabling the emergence of Vehicle Twins (VTs). As highly accurate replicas of vehicles, VTs can support intelligent vehicular applications for occupants in 6G‑enabled IoV. Thanks to the full coverage capability of 6G, resource‑constrained vehicles can offload VTs to edge servers, such as roadside units, unmanned aerial vehicles, and satellites, utilizing their computing and storage resources for VT construction and updates. However, communication between vehicles and edge servers with limited coverage is prone to interruptions due to the dynamic mobility of vehicles. Consequently, VTs must be migrated among edge servers to maintain uninterrupted and high‑quality services for users. In this paper, we introduce a VT migration framework in 6G‑enabled IoV. Specifically, we first propose a Long Short‑Term Memory (LSTM)‑based Transformer model to accurately predict long‑term workloads of edge servers for migration decision‑making. Then, we propose a Dynamic Mask Multi‑Agent Proximal Policy Optimization (DM‑MAPPO) algorithm to identify optimal migration routes in the highly complex environment of 6G‑enabled IoV. Finally, we develop a practical platform to validate the effectiveness of the proposed scheme using real datasets. Simulation results demonstrate that the proposed DM‑MAPPO algorithm significantly reduces migration latency by 20.82% and packet loss by 75.07% compared with traditional deep reinforcement learning algorithms.

PaperID: 10610, https://arxiv.org/pdf/2505.07240.pdf

Authors: Yating Yuan, Yu Liu

Title: Continuous-Time Control Synthesis for Multiple Quadrotors under Signal Temporal Logic Specifications

Abstract:
Continuous‑time control of multiple quadrotors in constrained environments under signal temporal logic (STL) specifications is critical due to their nonlinear dynamics, safety constraints, and the requirement to ensure continuous‑time satisfaction of the specifications. To ensure such control, a two‑stage framework is proposed to address this challenge. First, based on geometric control, a Lyapunov‑based analysis of the rotational tracking dynamics is performed to facilitate multidimensional gain design. In addition, tracking‑error bounds for subsequent STL robustness analysis are derived. Second, using the tracking‑error bounds, a mixed‑integer convex programming (MICP)‑based planning framework with a backward‑recursive scheme is developed. The framework is used to generate reference trajectories that satisfy multi‑agent STL tasks while meeting the trajectory requirements imposed by geometric control. Numerical simulations demonstrate that, compared with uniform gains, the optimized multidimensional gains yield less conservative time‑varying bounds, mitigate oscillations, and improve transient performance, while the proposed framework ensures the satisfaction of multi‑agent STL tasks in constrained environments with provable tracking guarantees.

PaperID: 10611, https://arxiv.org/pdf/2505.07236.pdf

Authors: Oleg Sautenkov, Yasheerah Yaqoot, Muhammad Ahsan Mustafa, Faryal Batool, Jeffrin Sam, Artem Lykov, Chih-Yung Wen, Dzmitry Tsetserukou

Title: UAV-CodeAgents: Scalable UAV Mission Planning via Multi-Agent ReAct and Vision-Language Reasoning

Abstract:
We present UAV‑CodeAgents, a scalable multi‑agent framework for autonomous UAV mission generation, built on large language and vision‑language models (LLMs/VLMs). The system leverages the ReAct (Reason + Act) paradigm to interpret satellite imagery, ground high‑level natural language instructions, and collaboratively generate UAV trajectories with minimal human supervision. A core component is a vision‑grounded, pixel‑pointing mechanism that enables precise localization of semantic targets on aerial maps. To support real‑time adaptability, we introduce a reactive thinking loop, allowing agents to iteratively reflect on observations, revise mission goals, and coordinate dynamically in evolving environments. UAV‑CodeAgents is evaluated on large‑scale mission scenarios involving industrial and environmental fire detection. Our results show that a lower decoding temperature (0.5) yields higher planning reliability and reduced execution time, with an average mission creation time of 96.96 seconds and a success rate of 93%. We further fine‑tune Qwen2.5VL‑7B on 9,000 annotated satellite images, achieving strong spatial grounding across diverse visual categories. To foster reproducibility and future research, we will release the full codebase and a novel benchmark dataset for vision‑language‑based UAV planning.

PaperID: 10612, https://arxiv.org/pdf/2505.07214.pdf

Authors: Pascal Spiegler, Arash Harirpoush, Yiming Xiao

Title: Towards user-centered interactive medical image segmentation in VR with an assistive AI agent

Abstract:
Crucial in disease analysis and surgical planning, manual segmentation of volumetric medical scans (e.g. MRI, CT) is laborious, error‑prone, and challenging to master, while fully automatic algorithms can benefit from user feedback. Therefore, with the complementary power of the latest radiological AI foundation models and virtual reality (VR)'s intuitive data interaction, we propose SAMIRA, a novel conversational AI agent for medical VR that assists users with localizing, segmenting, and visualizing 3D medical concepts. Through speech‑based interaction, the agent helps users understand radiological features, locate clinical targets, and generate segmentation masks that can be refined with just a few point prompts. The system also supports true‑to‑scale 3D visualization of segmented pathology to enhance patient‑specific anatomical understanding. Furthermore, to determine the optimal interaction paradigm under near‑far attention‑switching for refining segmentation masks in an immersive, human‑in‑the‑loop workflow, we compare VR controller pointing, head pointing, and eye tracking as input modes. With a user study, evaluations demonstrated a high usability score (SUS=90.0 \pm 9.0), low overall task load, as well as strong support for the proposed VR system's guidance, training potential, and integration of AI in radiological segmentation tasks.

PaperID: 10613, https://arxiv.org/pdf/2505.07207.pdf

Authors: Chiqiang Liu, Dazi Li

Title: HYGMA: Hypergraph Coordination Networks with Dynamic Grouping for Multi-Agent Reinforcement Learning

Abstract:
Cooperative multi‑agent reinforcement learning faces significant challenges in effectively organizing agent relationships and facilitating information exchange, particularly when agents need to adapt their coordination patterns dynamically. This paper presents a novel framework that integrates dynamic spectral clustering with hypergraph neural networks to enable adaptive group formation and efficient information processing in multi‑agent systems. The proposed framework dynamically constructs and updates hypergraph structures through spectral clustering on agents' state histories, enabling higher‑order relationships to emerge naturally from agent interactions. The hypergraph structure is enhanced with attention mechanisms for selective information processing, providing an expressive and efficient way to model complex agent relationships. This architecture can be implemented in both value‑based and policy‑based paradigms through a unified objective combining task performance with structural regularization. Extensive experiments on challenging cooperative tasks demonstrate that our method significantly outperforms state‑of‑the‑art approaches in both sample efficiency and final performance.

PaperID: 10614, https://arxiv.org/pdf/2505.07087.pdf

Authors: Robert E. Wray, James R. Kirk, John E. Laird

Title: Applying Cognitive Design Patterns to General LLM Agents

Abstract:
One goal of AI (and AGI) is to identify and understand specific mechanisms and representations sufficient for general intelligence. Often, this work manifests in research focused on architectures and many cognitive architectures have been explored in AI/AGI. However, different research groups and even different research traditions have somewhat independently identified similar/common patterns of processes and representations or "cognitive design patterns" that are manifest in existing architectures. Today, AI systems exploiting large language models (LLMs) offer a relatively new combination of mechanisms and representations available for exploring the possibilities of general intelligence. This paper outlines a few recurring cognitive design patterns that have appeared in various pre‑transformer AI architectures. We then explore how these patterns are evident in systems using LLMs, especially for reasoning and interactive ("agentic") use cases. Examining and applying these recurring patterns enables predictions of gaps or deficiencies in today's Agentic LLM Systems and identification of subjects of future research towards general intelligence using generative foundation models.

PaperID: 10615, https://arxiv.org/pdf/2505.07064.pdf

Authors: Shusen Liu, Haichao Miao, Peer-Timo Bremer

Title: ParaView-MCP: An Autonomous Visualization Agent with Direct Tool Use

Abstract:
While powerful and well‑established, tools like ParaView present a steep learning curve that discourages many potential users. This work introduces ParaView‑MCP, an autonomous agent that integrates modern multimodal large language models (MLLMs) with ParaView to not only lower the barrier to entry but also augment ParaView with intelligent decision support. By leveraging the state‑of‑the‑art reasoning, command execution, and vision capabilities of MLLMs, ParaView‑MCP enables users to interact with ParaView through natural language and visual inputs. Specifically, our system adopted the Model Context Protocol (MCP) ‑ a standardized interface for model‑application communication ‑ that facilitates direct interaction between MLLMs with ParaView's Python API to allow seamless information exchange between the user, the language model, and the visualization tool itself. Furthermore, by implementing a visual feedback mechanism that allows the agent to observe the viewport, we unlock a range of new capabilities, including recreating visualizations from examples, closed‑loop visualization parameter updates based on user‑defined goals, and even cross‑application collaboration involving multiple tools. Broadly, we believe such an agent‑driven visualization paradigm can profoundly change the way we interact with visualization tools. We expect a significant uptake in the development of such visualization tools, in both visualization research and industry.

PaperID: 10616, https://arxiv.org/pdf/2505.07049.pdf

Authors: Yubo Shu, Zhewei Huang, Xin Wu, Chen Hu, Shuchang Zhou, Daxin Jiang

Title: DialogueReason: Rule-Based RL Sparks Dialogue Reasoning in LLMs

Abstract:
We propose DialogueReason, a reasoning paradigm that uncovers the lost roles in monologue‑style reasoning models, aiming to boost diversity and coherency of the reasoning process. Recent advances in RL‑based large reasoning models have led to impressive long CoT capabilities and high performance on math and science benchmarks. However, these reasoning models rely mainly on monologue‑style reasoning, which often limits reasoning diversity and coherency, frequently recycling fixed strategies or exhibiting unnecessary shifts in attention. Our work consists of an analysis of monologue reasoning patterns and the development of a dialogue‑based reasoning approach. We first introduce the Compound‑QA task, which concatenates multiple problems into a single prompt to assess both diversity and coherency of reasoning. Our analysis shows that Compound‑QA exposes weaknesses in monologue reasoning, evidenced by both quantitative metrics and qualitative reasoning traces. Building on the analysis, we propose a dialogue‑based reasoning, named DialogueReason, structured around agents, environment, and interactions. Using PPO with rule‑based rewards, we train open‑source LLMs (Qwen‑QWQ and Qwen‑Base) to adopt dialogue reasoning. We evaluate trained models on MATH, AIME, and GPQA datasets, showing that the dialogue reasoning model outperforms monologue models under more complex compound questions. Additionally, we discuss how dialogue‑based reasoning helps enhance interpretability, facilitate more intuitive human interaction, and inspire advances in multi‑agent system design.

PaperID: 10617, https://arxiv.org/pdf/2505.06997.pdf

Authors: Wenhao Lu, Zhengqiu Zhu, Yong Zhao, Yonglin Tian, Junjie Zeng, Jun Zhang, Zhong Liu, Fei-Yue Wang

Title: A Multi-Agent Reinforcement Learning Approach for Cooperative Air-Ground-Human Crowdsensing in Emergency Rescue

Abstract:
Mobile crowdsensing is evolving beyond traditional human‑centric models by integrating heterogeneous entities like unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs). Optimizing task allocation among these diverse agents is critical, particularly in challenging emergency rescue scenarios characterized by complex environments, limited communication, and partial observability. This paper tackles the Heterogeneous‑Entity Collaborative‑Sensing Task Allocation (HECTA) problem specifically for emergency rescue, considering humans, UAVs, and UGVs. We introduce a novel ``Hard‑Cooperative'' policy where UGVs prioritize recharging low‑battery UAVs, alongside performing their sensing tasks. The primary objective is maximizing the task completion rate (TCR) under strict time constraints. We rigorously formulate this NP‑hard problem as a decentralized partially observable Markov decision process (Dec‑POMDP) to effectively handle sequential decision‑making under uncertainty. To solve this, we propose HECTA4ER, a novel multi‑agent reinforcement learning algorithm built upon a Centralized Training with Decentralized Execution architecture. HECTA4ER incorporates tailored designs, including specialized modules for complex feature extraction, utilization of action‑observation history via hidden states, and a mixing network integrating global and local information, specifically addressing the challenges of partial observability. Furthermore, theoretical analysis confirms the algorithm's convergence properties. Extensive simulations demonstrate that HECTA4ER significantly outperforms baseline algorithms, achieving an average 18.42% increase in TCR. Crucially, a real‑world case study validates the algorithm's effectiveness and robustness in dynamic sensing scenarios, highlighting its strong potential for practical application in emergency response.

PaperID: 10618, https://arxiv.org/pdf/2505.06980.pdf

Authors: Lei Wan, Prabesh Gupta, Andreas Eich, Marcel Kettelgerdes, Hannan Ejaz Keen, Michael Klöppel-Gersdorf, Alexey Vinel

Title: VALISENS: A Validated Innovative Multi-Sensor System for Cooperative Automated Driving

Abstract:
Reliable perception remains a key challenge for Connected Automated Vehicles (CAVs) in complex real‑world environments, where varying lighting conditions and adverse weather degrade sensing performance. While existing multi‑sensor solutions improve local robustness, they remain constrained by limited sensing range, line‑of‑sight occlusions, and sensor failures on individual vehicles. This paper introduces VALISENS, a validated cooperative perception system that extends multi‑sensor fusion beyond a single vehicle through Vehicle‑to‑Everything (V2X)‑enabled collaboration between Connected Automated Vehicles (CAVs) and intelligent infrastructure. VALISENS integrates onboard and roadside LiDARs, radars, RGB cameras, and thermal cameras within a unified multi‑agent perception framework. Thermal cameras enhances the detection of Vulnerable Road Users (VRUs) under challenging lighting conditions, while roadside sensors reduce occlusions and expand the effective perception range. In addition, an integrated sensor monitoring module continuously assesses sensor health and detects anomalies before system degradation occurs. The proposed system is implemented and evaluated in a dedicated real‑world testbed. Experimental results show that VALISENS improves pedestrian situational awareness by up to 18% compared with vehicle‑only sensing, while the sensor monitoring module achieves over 97% accuracy, demonstrating its effectiveness and its potential to support future Cooperative Intelligent Transport Systems (C‑ITS) applications.

PaperID: 10619, https://arxiv.org/pdf/2505.06947.pdf

Authors: Senhao Yang, Qiwen Cheng, Ruiqi Ma, Liangzhe Zhao, Zhenying Wu, Guangqiang Yu

Title: The Wisdom of Agent Crowds: A Human-AI Interaction Innovation Ignition Framework

Abstract:
With the widespread application of large AI models in various fields, the automation level of multi‑agent systems has been continuously improved. However, in high‑risk decision‑making scenarios such as healthcare and finance, human participation and the alignment of intelligent systems with human intentions remain crucial. This paper focuses on the financial scenario and constructs a multi‑agent brainstorming framework based on the BDI theory. A human‑computer collaborative multi‑agent financial analysis process is built using Streamlit. The system plans tasks according to user intentions, reduces users' cognitive load through real‑time updated structured text summaries and the interactive Cothinker module, and reasonably integrates general and reasoning large models to enhance the ability to handle complex problems. By designing a quantitative analysis algorithm for the sentiment tendency of interview content based on LLMs and a method for evaluating the diversity of ideas generated by LLMs in brainstorming based on k‑means clustering and information entropy, the system is comprehensively evaluated. The results of human factors testing show that the system performs well in terms of usability and user experience. Although there is still room for improvement, it can effectively support users in completing complex financial tasks. The research shows that the system significantly improves the efficiency of human‑computer interaction and the quality of decision‑making in financial decision‑making scenarios, providing a new direction for the development of related fields.

PaperID: 10620, https://arxiv.org/pdf/2505.06896.pdf

Authors: Xiran Zhang, Javier Conejero, Sameh Abdulah, Jorge Ejarque, Ying Sun, Rosa M. Badia, David E. Keyes, Marc G. Genton

Title: RCOMPSs: A Scalable Runtime System for R Code Execution on Manycore Systems

Abstract:
R has become a cornerstone of scientific and statistical computing due to its extensive package ecosystem, expressive syntax, and strong support for reproducible analysis. However, as data sizes and computational demands grow, native R parallelism support remains limited. This paper presents RCOMPSs, a scalable runtime system that enables efficient parallel execution of R applications on multicore and manycore systems. RCOMPSs adopts a dynamic, task‑based programming model, allowing users to write code in a sequential style, while the runtime automatically handles asynchronous task execution, dependency tracking, and scheduling across available resources. We present RCOMPSs using three representative data analysis algorithms, i.e., K‑nearest neighbors (KNN) classification, K‑means clustering, and linear regression and evaluate their performance on two modern HPC systems: KAUST Shaheen‑III and Barcelona Supercomputing Center (BSC) MareNostrum 5. Experimental results reveal that RCOMPSs demonstrates both strong and weak scalability on up to 128 cores per node and across 32 nodes. For KNN and K‑means, parallel efficiency remains above 70% in most settings, while linear regression maintains acceptable performance under shared and distributed memory configurations despite its deeper task dependencies. Overall, RCOMPSs significantly enhances the parallel capabilities of R with minimal, automated, and runtime‑aware user intervention, making it a practical solution for large‑scale data analytics in high‑performance environments.

PaperID: 10621, https://arxiv.org/pdf/2505.06821.pdf

Authors: Dipayan Saha, Hasan Al Shaikh, Shams Tarek, Farimah Farahmandi

Title: ThreatLens: LLM-guided Threat Modeling and Test Plan Generation for Hardware Security Verification

Abstract:
Current hardware security verification processes predominantly rely on manual threat modeling and test plan generation, which are labor‑intensive, error‑prone, and struggle to scale with increasing design complexity and evolving attack methodologies. To address these challenges, we propose ThreatLens, an LLM‑driven multi‑agent framework that automates security threat modeling and test plan generation for hardware security verification. ThreatLens integrates retrieval‑augmented generation (RAG) to extract relevant security knowledge, LLM‑powered reasoning for threat assessment, and interactive user feedback to ensure the generation of practical test plans. By automating these processes, the framework reduces the manual verification effort, enhances coverage, and ensures a structured, adaptable approach to security verification. We evaluated our framework on the NEORV32 SoC, demonstrating its capability to automate security verification through structured test plans and validating its effectiveness in real‑world scenarios.

PaperID: 10622, https://arxiv.org/pdf/2505.06706.pdf

Authors: Yuxuan Zheng, Yihe Zhou, Feiyang Xu, Mingli Song, Shunyu Liu

Title: Bi-level Mean Field: Dynamic Grouping for Large-Scale MARL

Abstract:
Large‑scale Multi‑Agent Reinforcement Learning (MARL) often suffers from the curse of dimensionality, as the exponential growth in agent interactions significantly increases computational complexity and impedes learning efficiency. To mitigate this, existing efforts that rely on Mean Field (MF) simplify the interaction landscape by approximating neighboring agents as a single mean agent, thus reducing overall complexity to pairwise interactions. However, these MF methods inevitably fail to account for individual differences, leading to aggregation noise caused by inaccurate iterative updates during MF learning. In this paper, we propose a Bi‑level Mean Field (BMF) method to capture agent diversity with dynamic grouping in large‑scale MARL, which can alleviate aggregation noise via bi‑level interaction. Specifically, BMF introduces a dynamic group assignment module, which employs a Variational AutoEncoder (VAE) to learn the representations of agents, facilitating their dynamic grouping over time. Furthermore, we propose a bi‑level interaction module to model both inter‑ and intra‑group interactions for effective neighboring aggregation. Experiments across various tasks demonstrate that the proposed BMF yields results superior to the state‑of‑the‑art methods.

PaperID: 10623, https://arxiv.org/pdf/2505.06416.pdf

Authors: Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, James A. Burke

Title: ScaleMCP: Dynamic and Auto-Synchronizing Model Context Protocol Tools for LLM Agents

Abstract:
Recent advancements in Large Language Models (LLMs) and the introduction of the Model Context Protocol (MCP) have significantly expanded LLM agents' capability to interact dynamically with external tools and APIs. However, existing tool selection frameworks do not integrate MCP servers, instead relying heavily on error‑prone manual updates to monolithic local tool repositories, leading to duplication, inconsistencies, and inefficiencies. Additionally, current approaches abstract tool selection before the LLM agent is invoked, limiting its autonomy and hindering dynamic re‑querying capabilities during multi‑turn interactions. To address these issues, we introduce ScaleMCP, a novel tool selection approach that dynamically equips LLM agents with a MCP tool retriever, giving agents the autonomy to add tools into their memory, as well as an auto‑synchronizing tool storage system pipeline through CRUD (create, read, update, delete) operations with MCP servers as the single source of truth. We also propose a novel embedding strategy, Tool Document Weighted Average (TDWA), designed to selectively emphasize critical components of tool documents (e.g. tool name or synthetic questions) during the embedding process. Comprehensive evaluations conducted on a created dataset of 5,000 financial metric MCP servers, across 10 LLM models, 5 embedding models, and 5 retriever types, demonstrate substantial improvements in tool retrieval and agent invocation performance, emphasizing ScaleMCP's effectiveness in scalable, dynamic tool selection and invocation.

PaperID: 10624, https://arxiv.org/pdf/2505.06378.pdf

Authors: Yuxiang Wei, Zhuoqi Zeng, Yue Zhong, Jiawen Kang, Ryan Wen Liu, M. Shamim Hossain

Title: Bi-LSTM based Multi-Agent DRL with Computation-aware Pruning for Agent Twins Migration in Vehicular Embodied AI Networks

Abstract:
With the advancement of large language models and embodied Artificial Intelligence (AI) in the intelligent transportation scenarios, the combination of them in intelligent transportation spawns the Vehicular Embodied AI Network (VEANs). In VEANs, Autonomous Vehicles (AVs) are typical agents whose local advanced AI applications are defined as vehicular embodied AI agents, enabling capabilities such as environment perception and multi‑agent collaboration. Due to computation latency and resource constraints, the local AI applications and services running on vehicular embodied AI agents need to be migrated, and subsequently referred to as vehicular embodied AI agent twins, which drive the advancement of vehicular embodied AI networks to offload intensive tasks to Roadside Units (RSUs), mitigating latency problems while maintaining service quality. Recognizing workload imbalance among RSUs in traditional approaches, we model AV‑RSU interactions as a Stackelberg game to optimize bandwidth resource allocation for efficient migration. A Tiny Multi‑Agent Bidirectional LSTM Proximal Policy Optimization (TMABLPPO) algorithm is designed to approximate the Stackelberg equilibrium through decentralized coordination. Furthermore, a personalized neural network pruning algorithm based on Path eXclusion (PX) dynamically adapts to heterogeneous AV computation capabilities by identifying task‑critical parameters in trained models, reducing model complexity with less performance degradation. Experimental validation confirms the algorithm's effectiveness in balancing system load and minimizing delays, demonstrating significant improvements in vehicular embodied AI agent deployment.

PaperID: 10625, https://arxiv.org/pdf/2505.06364.pdf

Authors: Jayeeta Chaudhuri, Arjun Chaudhuri, Krishnendu Chakrabarty

Title: LATENT: LLM-Augmented Trojan Insertion and Evaluation Framework for Analog Netlist Topologies

Abstract:
Analog and mixed‑signal (A/MS) integrated circuits (ICs) are integral to safety‑critical applications. However, the globalization and outsourcing of A/MS ICs to untrusted third‑party foundries expose them to security threats, particularly analog Trojans. Unlike digital Trojans which have been extensively studied, analog Trojans remain largely unexplored. There has been only limited research on their diversity and stealth in analog designs, where a Trojan is activated only during a narrow input voltage range. Effective defense techniques require a clear understanding of the attack vectors; however, the lack of diverse analog Trojan instances limits robust advances in detection strategies. To address this gap, we present LATENT, the first large language model (LLM)‑driven framework for crafting stealthy, circuit‑specific analog Trojans. LATENT incorporates LLM as an autonomous agent to intelligently insert and refine Trojan components within analog designs based on iterative feedback from a detection model. This feedback loop ensures that the inserted Trojans remain stealthy while successfully evading detection. Experimental results demonstrate that our generated Trojan designs exhibit an average Trojan‑activation range of 15.74%, ensuring they remain inactive under most operating voltages, while causing a significant performance degradation of 11.3% upon activation.

PaperID: 10626, https://arxiv.org/pdf/2505.06200.pdf

Authors: Shinkyu Park, Lucas C. D. Bezerra

Title: Robust Multi-Agent Decision-Making in Finite-Population Games

Abstract:
We study the robustness of an agent decision‑making model in finite‑population games, with a particular focus on the Kullback‑Leibler Divergence Regularized Learning (KLD‑RL) model. Specifically, we examine how the model's parameters influence the impact of various sources of noise and modeling inaccuracies ‑‑ factors commonly encountered in engineering applications of population games ‑‑ on agents' decision‑making. Our analysis provides insights into how these parameters can be effectively tuned to mitigate such effects. Theoretical results are supported by numerical examples and simulation studies that validate the analysis and illustrate practical strategies for parameter selection.

PaperID: 10627, https://arxiv.org/pdf/2505.06025.pdf

Authors: Jianpeng Qi, Chao Liu, Chengxiang Xu, Rui Wang, Junyu Dong, Yanwei Yu

Title: Efficient Information Updates in Compute-First Networking via Reinforcement Learning with Joint AoI and VoI

Abstract:
Timely and efficient dissemination of service information is critical in compute‑first networking systems, where user requests arrive dynamically and computing resources are constrained. In such systems, the access point (AP) plays a key role in forwarding user requests to a server based on its latest received service information. This paper considers a single‑source, single‑destination system and introduces an Age‑and‑Value‑Aware (AVA) metric that jointly captures both the timeliness and the task relevance of service information. Unlike traditional freshness‑based metrics, AVA explicitly incorporates variations in server‑side service capacity and AP forwarding decisions, allowing more context‑aware update evaluation. Building upon AVA, we propose a reinforcement learning‑based update policy that learns to selectively transmit service information updates to the AP. It aims to maximize overall task success while minimizing unnecessary communications. Extensive simulations under diverse user request patterns and varying service capacities demonstrate that AVA reduces the update frequency by over 90% on average compared to baselines, with reductions reaching 98% in certain configurations. Crucially, this reduction is achieved without compromising the accuracy of task execution or the quality of decision making.

PaperID: 10628, https://arxiv.org/pdf/2505.05968.pdf

Authors: Dan Qiao, Wenhao Li, Shanchao Yang, Hongyuan Zha, Baoxiang Wang

Title: Offline Multi-agent Reinforcement Learning via Sequential Score Decomposition

Abstract:
Offline cooperative multi‑agent reinforcement learning (MARL) faces unique challenges due to distributional shifts, particularly stemming from the high dimensionality of joint action spaces and the presence of out‑of‑distribution joint action selections. In this work, we highlight that a fundamental challenge in offline MARL arises from the multi‑equilibrium nature of cooperative tasks, which induces a highly multimodal joint behavior policy space coupled with heterogeneous‑quality behavior data. This makes it difficult for individual policy regularization to align with a consistent coordination pattern, leading to the policy distribution shift problems. To tackle this challenge, we design a sequential score function decomposition method that distills per‑agent regularization signals from the joint behavior policy, which induces coordinated modality selection under decentralized execution constraints. Then we leverage a flexible diffusion‑based generative model to learn these score functions from multimodal offline data, and integrate them into joint‑action critics to guide policy updates toward high‑reward, in‑distribution regions under a shared team reward. Our approach achieves state‑of‑the‑art performance across multiple particle environments and Multi‑agent MuJoCo benchmarks consistently. To the best of our knowledge, this is the first work to explicitly address the distributional gap between offline and online MARL, paving the way for more generalizable offline policy‑based MARL methods.

PaperID: 10629, https://arxiv.org/pdf/2505.05967.pdf

Authors: Uyoata E. Uyoata, Gilberto Berardinelli, Ramoni Adeogun

Title: Learning Power Control Protocol for In-Factory 6G Subnetworks

Abstract:
In‑X Subnetworks are envisioned to meet the stringent demands of short‑range communication in diverse 6G use cases. In the context of In‑Factory scenarios, effective power control is critical to mitigating the impact of interference resulting from potentially high subnetwork density. Existing approaches to power control in this domain have predominantly emphasized the data plane, often overlooking the impact of signaling overhead. Furthermore, prior work has typically adopted a network‑centric perspective, relying on the assumption of complete and up‑to‑date channel state information (CSI) being readily available at the central controller. This paper introduces a novel multi‑agent reinforcement learning (MARL) framework designed to enable access points to autonomously learn both signaling and power control protocols in an In‑Factory Subnetwork environment. By formulating the problem as a partially observable Markov decision process (POMDP) and leveraging multi‑agent proximal policy optimization (MAPPO), the proposed approach achieves significant advantages. The simulation results demonstrate that the learning‑based method reduces signaling overhead by a factor of 8 while maintaining a buffer flush rate that lags the ideal "Genie" approach by only 5%.

PaperID: 10630, https://arxiv.org/pdf/2505.05849.pdf

Authors: Zhun Wang, Vincent Siu, Zhe Ye, Tianneng Shi, Yuzhou Nie, Xuandong Zhao, Chenguang Wang, Wenbo Guo, Dawn Song

Title: AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents

Abstract:
The strong planning and reasoning capabilities of Large Language Models (LLMs) have fostered the development of agent‑based systems capable of leveraging external tools and interacting with increasingly complex environments. However, these powerful features also introduce a critical security risk: indirect prompt injection, a sophisticated attack vector that compromises the core of these agents, the LLM, by manipulating contextual information rather than direct user prompts. In this work, we propose a generic black‑box fuzzing framework, AgentVigil, designed to automatically discover and exploit indirect prompt injection vulnerabilities across diverse LLM agents. Our approach starts by constructing a high‑quality initial seed corpus, then employs a seed selection algorithm based on Monte Carlo Tree Search (MCTS) to iteratively refine inputs, thereby maximizing the likelihood of uncovering agent weaknesses. We evaluate AgentVigil on two public benchmarks, AgentDojo and VWA‑adv, where it achieves 71% and 70% success rates against agents based on o3‑mini and GPT‑4o, respectively, nearly doubling the performance of baseline attacks. Moreover, AgentVigil exhibits strong transferability across unseen tasks and internal LLMs, as well as promising results against defenses. Beyond benchmark evaluations, we apply our attacks in real‑world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.

PaperID: 10631, https://arxiv.org/pdf/2505.05773.pdf

Authors: Nikita Boguslavskii, Lorena Maria Genua, Zhi Li

Title: Human-Robot Collaboration for the Remote Control of Mobile Humanoid Robots with Torso-Arm Coordination

Abstract:
Recently, many humanoid robots have been increasingly deployed in various facilities, including hospitals and assisted living environments, where they are often remotely controlled by human operators. Their kinematic redundancy enhances reachability and manipulability, enabling them to navigate complex, cluttered environments and perform a wide range of tasks. However, this redundancy also presents significant control challenges, particularly in coordinating the movements of the robot's macro‑micro structure (torso and arms). Therefore, we propose various human‑robot collaborative (HRC) methods for coordinating the torso and arm of remotely controlled mobile humanoid robots, aiming to balance autonomy and human input to enhance system efficiency and task execution. The proposed methods include human‑initiated approaches, where users manually control torso movements, and robot‑initiated approaches, which autonomously coordinate torso and arm based on factors such as reachability, task goal, or inferred human intent. We conducted a user study with N=17 participants to compare the proposed approaches in terms of task performance, manipulability, and energy efficiency, and analyzed which methods were preferred by participants.

PaperID: 10632, https://arxiv.org/pdf/2505.05769.pdf

Authors: Kaiming Luo

Title: Hierarchical Synchronization and Distortion Scaling in Social Media Networks: A Fractal-Like Topology Theory

Abstract:
The rapid proliferation of social media as a dominant channel for information dissemination has intensified concerns over systemic information distortion, whereby content is progressively altered through successive layers of transmission. While prior studies have explored such distortion qualitatively, the quantitative interplay between propagation topology and stochastic cognitive perturbations remains insufficiently understood. In this work, we propose a novel fractal‑inspired directed hierarchical network model to capture the structural patterns of propagation, and introduce a Noise‑Frustrated Hegselmann‑Krause (NFHK) framework to model opinion dynamics under noise. Analytical results, supported by group and graph theory, reveal that noise accumulation leads to increasing opinion distortion and the emergence of intra‑layer synchronization. Multi‑agent simulations confirm these effects, showing that noise intensity shapes both convergence rates and weak intra‑layer clustering. Empirical validation using a representative retweet cascade demonstrates that the proposed model reproduces real‑world distortion patterns and synchronization behaviors, even without direct links. This work uncovers a unified mechanism for information distortion in digital platforms and offers topology‑aware insights for public opinion governance and platform regulation.

PaperID: 10633, https://arxiv.org/pdf/2505.05762.pdf

Authors: Junhong Chen, Ziqi Yang, Haoyuan G Xu, Dandan Zhang, George Mylonas

Title: Multi-Agent Systems for Robotic Autonomy with LLMs

Abstract:
Since the advent of Large Language Models (LLMs), various research based on such models have maintained significant academic attention and impact, especially in AI and robotics. In this paper, we propose a multi‑agent framework with LLMs to construct an integrated system for robotic task analysis, mechanical design, and path generation. The framework includes three core agents: Task Analyst, Robot Designer, and Reinforcement Learning Designer. Outputs are formatted as multimodal results, such as code files or technical reports, for stronger understandability and usability. To evaluate generalizability comparatively, we conducted experiments with models from both GPT and DeepSeek. Results demonstrate that the proposed system can design feasible robots with control strategies when appropriate task inputs are provided, exhibiting substantial potential for enhancing the efficiency and accessibility of robotic system development in research and industrial applications.

PaperID: 10634, https://arxiv.org/pdf/2505.05665.pdf

Authors: Neeloy Chakraborty, John Pohovey, Melkior Ornik, Katherine Driggs-Campbell

Title: Characterizing the Robustness of Black-Box LLM Planners Under Perturbed Observations with Adaptive Stress Testing

Abstract:
Large language models (LLMs) have recently demonstrated success in decision‑making tasks including planning, control, and prediction, but their tendency to hallucinate unsafe and undesired outputs poses risks. This unwanted behavior is further exacerbated in environments where sensors are noisy or unreliable. Characterizing the behavior of LLM planners to varied observations is necessary to proactively avoid failures in safety‑critical scenarios. We specifically investigate the response of LLMs along two different perturbation dimensions. Like prior works, one dimension generates semantically similar prompts with varied phrasing by randomizing order of details, modifying access to few‑shot examples, etc. Unique to our work, the second dimension simulates access to varied sensors and noise to mimic raw sensor or detection algorithm failures. An initial case study in which perturbations are manually applied show that both dimensions lead LLMs to hallucinate in a multi‑agent driving environment. However, manually covering the entire perturbation space for several scenarios is infeasible. As such, we propose a novel method for efficiently searching the space of prompt perturbations using adaptive stress testing (AST) with Monte‑Carlo tree search (MCTS). Our AST formulation enables discovery of scenarios, sensor configurations, and prompt phrasing that cause language models to act with high uncertainty or even crash. By generating MCTS prompt perturbation trees across diverse scenarios, we show through extensive experiments that offline analyses can be used to proactively understand potential failures that may arise at runtime. Code is available at https://sites.google.com/illinois.edu/astllm.

PaperID: 10635, https://arxiv.org/pdf/2505.05660.pdf

Authors: Jeffrey Basoah, Daniel Chechelnitsky, Tao Long, Katharina Reinecke, Chrysoula Zerva, Kaitlyn Zhou, Mark Díaz, Maarten Sap

Title: Not Like Us, Hunty: Measuring Perceptions and Behavioral Effects of Minoritized Anthropomorphic Cues in LLMs

Abstract:
As large language models (LLMs) increasingly adapt and personalize to diverse sets of users, there is an increased risk of systems appropriating sociolects, i.e., language styles or dialects that are associated with specific minoritized lived experiences (e.g., African American English, Queer slang). In this work, we examine whether sociolect usage by an LLM agent affects user reliance on its outputs and user perception (satisfaction, frustration, trust, and social presence). We designed and conducted user studies where 498 African American English (AAE) speakers and 487 Queer slang speakers performed a set of question‑answering tasks with LLM‑based suggestions in either standard American English (SAE) or their self‑identified sociolect. Our findings showed that sociolect usage by LLMs influenced both reliance and perceptions, though in some surprising ways. Results suggest that both AAE and Queer slang speakers relied more on the SAE agent, and had more positive perceptions of the SAE agent. Yet, only Queer slang speakers felt more social presence from the Queer slang agent over the SAE one, whereas only AAE speakers preferred and trusted the SAE agent over the AAE one. These findings emphasize the need to test for behavioral outcomes rather than simply assume that personalization would lead to a better and safer reliance outcome. They also highlight the nuanced dynamics of minoritized language in machine interactions, underscoring the need for LLMs to be carefully designed to respect cultural and linguistic boundaries while fostering genuine user engagement and trust.

PaperID: 10636, https://arxiv.org/pdf/2505.05262.pdf

Authors: Andreas Kontogiannis, Konstantinos Papathanasiou, Yi Shen, Giorgos Stamou, Michael M. Zavlanos, George Vouros

Title: Enhancing Cooperative Multi-Agent Reinforcement Learning with State Modelling and Adversarial Exploration

Abstract:
Learning to cooperate in distributed partially observable environments with no communication abilities poses significant challenges for multi‑agent deep reinforcement learning (MARL). This paper addresses key concerns in this domain, focusing on inferring state representations from individual agent observations and leveraging these representations to enhance agents' exploration and collaborative task execution policies. To this end, we propose a novel state modelling framework for cooperative MARL, where agents infer meaningful belief representations of the non‑observable state, with respect to optimizing their own policies, while filtering redundant and less informative joint state information. Building upon this framework, we propose the MARL SMPE algorithm. In SMPE, agents enhance their own policy's discriminative abilities under partial observability, explicitly by incorporating their beliefs into the policy network, and implicitly by adopting an adversarial type of exploration policies which encourages agents to discover novel, high‑value states while improving the discriminative abilities of others. Experimentally, we show that SMPE outperforms state‑of‑the‑art MARL algorithms in complex fully cooperative tasks from the MPE, LBF, and RWARE benchmarks.

PaperID: 10637, https://arxiv.org/pdf/2505.05197.pdf

Authors: Joel Z. Leibo, Alexander Sasha Vezhnevets, William A. Cunningham, Sébastien Krier, Manfred Diaz, Simon Osindero

Title: Societal and technological progress as sewing an ever-growing, ever-changing, patchy, and polychrome quilt

Abstract:
Artificial Intelligence (AI) systems are increasingly placed in positions where their decisions have real consequences, e.g., moderating online spaces, conducting research, and advising on policy. Ensuring they operate in a safe and ethically acceptable fashion is thus critical. However, most solutions have been a form of one‑size‑fits‑all "alignment". We are worried that such systems, which overlook enduring moral diversity, will spark resistance, erode trust, and destabilize our institutions. This paper traces the underlying problem to an often‑unstated Axiom of Rational Convergence: the idea that under ideal conditions, rational agents will converge in the limit of conversation on a single ethics. Treating that premise as both optional and doubtful, we propose what we call the appropriateness framework: an alternative approach grounded in conflict theory, cultural evolution, multi‑agent systems, and institutional economics. The appropriateness framework treats persistent disagreement as the normal case and designs for it by applying four principles: (1) contextual grounding, (2) community customization, (3) continual adaptation, and (4) polycentric governance. We argue here that adopting these design principles is a good way to shift the main alignment metaphor from moral unification to a more productive metaphor of conflict management, and that taking this step is both desirable and urgent.

PaperID: 10638, https://arxiv.org/pdf/2505.05108.pdf

Authors: Zhaohan Feng, Ruiqi Xue, Lei Yuan, Yang Yu, Ning Ding, Meiqin Liu, Bingzhao Gao, Jian Sun, Xinhu Zheng, Gang Wang

Title: Multi-agent Embodied AI: Advances and Future Directions

Abstract:
Embodied artificial intelligence (Embodied AI) plays a pivotal role in the application of advanced technologies in the intelligent era, where AI systems are integrated with physical bodies that enable them to perceive, reason, and interact with their environments. Through the use of sensors for input and actuators for action, these systems can learn and adapt based on real‑world feedback, allowing them to perform tasks effectively in dynamic and unpredictable environments. As techniques such as deep learning (DL), reinforcement learning (RL), and large language models (LLMs) mature, embodied AI has become a leading field in both academia and industry, with applications spanning robotics, healthcare, transportation, and manufacturing. However, most research has focused on single‑agent systems that often assume static, closed environments, whereas real‑world embodied AI must navigate far more complex scenarios. In such settings, agents must not only interact with their surroundings but also collaborate with other agents, necessitating sophisticated mechanisms for adaptation, real‑time learning, and collaborative problem‑solving. Despite increasing interest in multi‑agent systems, existing research remains narrow in scope, often relying on simplified models that fail to capture the full complexity of dynamic, open environments for multi‑agent embodied AI. Moreover, no comprehensive survey has systematically reviewed the advancements in this area. As embodied AI rapidly evolves, it is crucial to deepen our understanding of multi‑agent embodied AI to address the challenges presented by real‑world applications. To fill this gap and foster further development in the field, this paper reviews the current state of research, analyzes key contributions, and identifies challenges and future directions, providing insights to guide innovation and progress in this field.

PaperID: 10639, https://arxiv.org/pdf/2505.04997.pdf

Authors: Ling Yue, Nithin Somasekharan, Tingwen Zhang, Yadi Cao, Zhangze Chen, Shimin Di, Shaowu Pan

Title: Foam-Agent: Towards Automated Intelligent CFD Workflows

Abstract:
Computational fluid dynamics (CFD) has been the main workhorse of computational physics. Yet its steep learning curve and fragmented, multi‑stage workflow create significant barriers. To address these challenges, we present Foam‑Agent, a multi‑agent framework leveraging large language models (LLMs) to automate the end‑to‑end CFD workflow from a single natural language prompt. Foam‑Agent orchestrates the comprehensive simulation workflow from mesh generation and high‑performance computing job scripting to post‑processing visualization. The system integrates retrieval‑augmented generation with dependency‑aware scheduling to synthesize high‑fidelity simulation configurations. Furthermore, Foam‑Agent adopts the Model Context Protocol to expose its core functions as discrete, callable tools. This allows for flexible integration and use by any other agentic systems. Evaluated on 110 simulation tasks, Foam‑Agent achieved a state‑of‑the‑art execution success rate of 88.2% without expert intervention. These results demonstrate how specialized multi‑agent systems can effectively reduce expertise barriers and streamline complex fluid simulations.

PaperID: 10640, https://arxiv.org/pdf/2505.04988.pdf

Authors: Julian Barreiro-Gomez, Tyrone E. Duncan, Bozenna Pasik-Duncan, Hamidou Tembine

Title: Semi-Explicit Solution of Some Discrete-Time Mean-Field-Type Games with Higher-Order Costs

Abstract:
Traditional solvable game theory and mean‑field‑type game theory (risk‑aware games) predominantly focus on quadratic costs due to their analytical tractability. Nevertheless, they often fail to capture critical non‑linearities inherent in real‑world systems. In this work, we present a unified framework for solving discrete‑time game problems with higher‑order state and strategy costs involving power‑law terms. We derive semi‑explicit expressions for equilibrium strategies, cost‑to‑go functions, and recursive coefficient dynamics across deterministic, stochastic, and multi‑agent system settings by convex‑completion techniques. The contributions include variance‑aware solutions under additive and multiplicative noise, extensions to mean‑field‑type‑dependent dynamics, and conditions that ensure the positivity of recursive coefficients. Our results provide a foundational methodology for analyzing non linear multi‑agent systems under higher‑order penalization, bridging classical game theory and mean‑field‑type game theory with modern computational tools for engineering applications.

PaperID: 10641, https://arxiv.org/pdf/2505.04885.pdf

Authors: Shaja Arul Selvamani, Nia D'Souza Ganapathy

Title: A Multi-Agent AI Framework for Immersive Audiobook Production through Spatial Audio and Neural Narration

Abstract:
This research introduces an innovative AI‑driven multi‑agent framework specifically designed for creating immersive audiobooks. Leveraging neural text‑to‑speech synthesis with FastSpeech 2 and VALL‑E for expressive narration and character‑specific voices, the framework employs advanced language models to automatically interpret textual narratives and generate realistic spatial audio effects. These sound effects are dynamically synchronized with the storyline through sophisticated temporal integration methods, including Dynamic Time Warping (DTW) and recurrent neural networks (RNNs). Diffusion‑based generative models combined with higher‑order ambisonics (HOA) and scattering delay networks (SDN) enable highly realistic 3D soundscapes, substantially enhancing listener immersion and narrative realism. This technology significantly advances audiobook applications, providing richer experiences for educational content, storytelling platforms, and accessibility solutions for visually impaired audiences. Future work will address personalization, ethical management of synthesized voices, and integration with multi‑sensory platforms.

PaperID: 10642, https://arxiv.org/pdf/2505.04869.pdf

Authors: Jie Cao, Chloe Qianhui Zhao, Xian Chen, Shuman Wang, Christian Schunn, Kenneth R. Koedinger, Jionghao Lin

Title: From First Draft to Final Insight: A Multi-Agent Approach for Feedback Generation

Abstract:
Producing large volumes of high‑quality, timely feedback poses significant challenges to instructors. To address this issue, automation technologies‑particularly Large Language Models (LLMs)‑show great potential. However, current LLM‑based research still shows room for improvement in terms of feedback quality. Our study proposed a multi‑agent approach performing "generation, evaluation, and regeneration" (G‑E‑RG) to further enhance feedback quality. In the first‑generation phase, six methods were adopted, combining three feedback theoretical frameworks and two prompt methods: zero‑shot and retrieval‑augmented generation with chain‑of‑thought (RAG_CoT). The results indicated that, compared to first‑round feedback, G‑E‑RG significantly improved final feedback across six methods for most dimensions. Specifically:(1) Evaluation accuracy for six methods increased by 3.36% to 12.98% (p<0.001); (2) The proportion of feedback containing four effective components rose from an average of 27.72% to an average of 98.49% among six methods, sub‑dimensions of providing critiques, highlighting strengths, encouraging agency, and cultivating dialogue also showed great enhancement (p<0.001); (3) There was a significant improvement in most of the feature values (p<0.001), although some sub‑dimensions (e.g., strengthening the teacher‑student relationship) still require further enhancement; (4) The simplicity of feedback was effectively enhanced (p<0.001) for three methods.

PaperID: 10643, https://arxiv.org/pdf/2505.04843.pdf

Authors: Sebastián R. Castro, Roberto Campbell, Nancy Lau, Octavio Villalobos, Jiaqi Duan, Alvaro A. Cardenas

Title: Large Language Models are Autonomous Cyber Defenders

Abstract:
Fast and effective incident response is essential to prevent adversarial cyberattacks. Autonomous Cyber Defense (ACD) aims to automate incident response through Artificial Intelligence (AI) agents that plan and execute actions. Most ACD approaches focus on single‑agent scenarios and leverage Reinforcement Learning (RL). However, ACD RL‑trained agents depend on costly training, and their reasoning is not always explainable or transferable. Large Language Models (LLMs) can address these concerns by providing explainable actions in general security contexts. Researchers have explored LLM agents for ACD but have not evaluated them on multi‑agent scenarios or interacting with other ACD agents. In this paper, we show the first study on how LLMs perform in multi‑agent ACD environments by proposing a new integration to the CybORG CAGE 4 environment. We examine how ACD teams of LLM and RL agents can interact by proposing a novel communication protocol. Our results highlight the strengths and weaknesses of LLMs and RL and help us identify promising research directions to create, train, and deploy future teams of ACD agents.

PaperID: 10644, https://arxiv.org/pdf/2505.04799.pdf

Authors: Jian Cui, Zichuan Li, Luyi Xing, Xiaojing Liao

Title: Safeguard-by-Development: A Privacy-Enhanced Development Paradigm for Multi-Agent Collaboration Systems

Abstract:
Multi‑agent collaboration systems (MACS), powered by large language models (LLMs), solve complex problems efficiently by leveraging each agent's specialization and communication between agents. However, the inherent exchange of information between agents and their interaction with external environments, such as LLM, tools, and users, inevitably introduces significant risks of sensitive data leakage, including vulnerabilities to attacks such as eavesdropping and prompt injection. Existing MACS lack fine‑grained data protection controls, making it challenging to manage sensitive information securely. In this paper, we take the first step to mitigate the MACS's data leakage threat through a privacy‑enhanced MACS development paradigm, Maris. Maris enables rigorous message flow control within MACS by embedding reference monitors into key multi‑agent conversation components. We implemented Maris as an integral part of widely‑adopted open‑source multi‑agent development frameworks, AutoGen and LangChain. To evaluate its effectiveness, we develop a Privacy Assessment Framework that emulates MACS under different threat scenarios. Our evaluation shows that Maris effectively mitigated sensitive data leakage threats across three different task suites while maintaining a high task success rate.

PaperID: 10645, https://arxiv.org/pdf/2505.04725.pdf

Authors: Robin Chhabra, Farzaneh Abdollahi

Title: Geometric Fault-Tolerant Neural Network Tracking Control of Unknown Systems on Matrix Lie Groups

Abstract:
We present a geometric neural network‑based tracking controller for systems evolving on matrix Lie groups under unknown dynamics, actuator faults, and bounded disturbances. Leveraging the left‑invariance of the tangent bundle of matrix Lie groups, viewed as an embedded submanifold of the vector space \R^N× N, we propose a set of learning rules for neural network weights that are intrinsically compatible with the Lie group structure and do not require explicit parameterization. Exploiting the geometric properties of Lie groups, this approach circumvents parameterization singularities and enables a global search for optimal weights. The ultimate boundedness of all error signals ‑‑ including the neural network weights, the coordinate‑free configuration error function, and the tracking velocity error ‑‑ is established using Lyapunov's direct method. To validate the effectiveness of the proposed method, we provide illustrative simulation results for decentralized formation control of multi‑agent systems on the Special Euclidean group.

PaperID: 10646, https://arxiv.org/pdf/2505.04651.pdf

Authors: Adithya Kulkarni, Fatimah Alotaibi, Xinyue Zeng, Longfeng Wu, Tong Zeng, Barry Menglong Yao, Minqian Liu, Shuaicheng Zhang, Lifu Huang, Dawei Zhou

Title: Scientific Hypothesis Generation and Validation: Methods, Datasets, and Future Directions

Abstract:
Large Language Models (LLMs) are transforming scientific hypothesis generation and validation by enabling information synthesis, latent relationship discovery, and reasoning augmentation. This survey provides a structured overview of LLM‑driven approaches, including symbolic frameworks, generative models, hybrid systems, and multi‑agent architectures. We examine techniques such as retrieval‑augmented generation, knowledge‑graph completion, simulation, causal inference, and tool‑assisted reasoning, highlighting trade‑offs in interpretability, novelty, and domain alignment. We contrast early symbolic discovery systems (e.g., BACON, KEKADA) with modern LLM pipelines that leverage in‑context learning and domain adaptation via fine‑tuning, retrieval, and symbolic grounding. For validation, we review simulation, human‑AI collaboration, causal modeling, and uncertainty quantification, emphasizing iterative assessment in open‑world contexts. The survey maps datasets across biomedicine, materials science, environmental science, and social science, introducing new resources like AHTech and CSKG‑600. Finally, we outline a roadmap emphasizing novelty‑aware generation, multimodal‑symbolic integration, human‑in‑the‑loop systems, and ethical safeguards, positioning LLMs as agents for principled, scalable scientific discovery.

PaperID: 10647, https://arxiv.org/pdf/2505.04565.pdf

Authors: Christoph Willibald, Dongheui Lee

Title: Hierarchical Task Decomposition for Execution Monitoring and Error Recovery: Understanding the Rationale Behind Task Demonstrations

Abstract:
Multi‑step manipulation tasks where robots interact with their environment and must apply process forces based on the perceived situation remain challenging to learn and prone to execution errors. Accurately simulating these tasks is also difficult. Hence, it is crucial for robust task performance to learn how to coordinate end‑effector pose and applied force, monitor execution, and react to deviations. To address these challenges, we propose a learning approach that directly infers both low‑ and high‑level task representations from user demonstrations on the real system. We developed an unsupervised task segmentation algorithm that combines intention recognition and feature clustering to infer the skills of a task. We leverage the inferred characteristic features of each skill in a novel unsupervised anomaly detection approach to identify deviations from the intended task execution. Together, these components form a comprehensive framework capable of incrementally learning task decisions and new behaviors as new situations arise. Compared to state‑of‑the‑art learning techniques, our approach significantly reduces the required amount of training data and computational complexity while efficiently learning complex in‑contact behaviors and recovery strategies. Our proposed task segmentation and anomaly detection approaches outperform state‑of‑the‑art methods on force‑based tasks evaluated on two different robotic systems.

PaperID: 10648, https://arxiv.org/pdf/2505.04339.pdf

Authors: Hao Peng, Xiang Huang, Shuo Sun, Ruitong Zhang, Philip S. Yu

Title: Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning

Abstract:
DBSCAN, a well‑known density‑based clustering algorithm, has gained widespread popularity and usage due to its effectiveness in identifying clusters of arbitrary shapes and handling noisy data. However, it encounters challenges in producing satisfactory cluster results when confronted with datasets of varying density scales, a common scenario in real‑world applications. In this paper, we propose a novel Adaptive and Robust DBSCAN with Multi‑agent Reinforcement Learning cluster framework, namely AR‑DBSCAN. First, we model the initial dataset as a two‑level encoding tree and categorize the data vertices into distinct density partitions according to the information uncertainty determined in the encoding tree. Each partition is then assigned to an agent to find the best clustering parameters without manual assistance. The allocation is density‑adaptive, enabling AR‑DBSCAN to effectively handle diverse density distributions within the dataset by utilizing distinct agents for different partitions. Second, a multi‑agent deep reinforcement learning guided automatic parameter searching process is designed. The process of adjusting the parameter search direction by perceiving the clustering environment is modeled as a Markov decision process. Using a weakly‑supervised reward training policy network, each agent adaptively learns the optimal clustering parameters by interacting with the clusters. Third, a recursive search mechanism adaptable to the data's scale is presented, enabling efficient and controlled exploration of large parameter spaces. Extensive experiments are conducted on nine artificial datasets and a real‑world dataset. The results of offline and online tasks show that AR‑DBSCAN not only improves clustering accuracy by up to 144.1% and 175.3% in the NMI and ARI metrics, respectively, but also is capable of robustly finding dominant parameters.

PaperID: 10649, https://arxiv.org/pdf/2505.04317.pdf

Authors: Ruize Zhang, Sirui Xiang, Zelai Xu, Feng Gao, Shilong Ji, Wenhao Tang, Wenbo Ding, Chao Yu, Yu Wang

Title: Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning

Abstract:
In this paper, we tackle the problem of learning to play 3v3 multi‑drone volleyball, a new embodied competitive task that requires both high‑level strategic coordination and low‑level agile control. The task is turn‑based, multi‑agent, and physically grounded, posing significant challenges due to its long‑horizon dependencies, tight inter‑agent coupling, and the underactuated dynamics of quadrotors. To address this, we propose Hierarchical Co‑Self‑Play (HCSP), a hierarchical reinforcement learning framework that separates centralized high‑level strategic decision‑making from decentralized low‑level motion control. We design a three‑stage population‑based training pipeline to enable both strategy and skill to emerge from scratch without expert demonstrations: (I) training diverse low‑level skills, (II) learning high‑level strategy via self‑play with fixed low‑level skills, and (III) joint fine‑tuning through co‑self‑play. Experiments show that HCSP achieves superior performance, outperforming non‑hierarchical self‑play and rule‑based hierarchical baselines with an average 82.9% win rate and a 71.5% win rate against the two‑stage variant. Moreover, co‑self‑play leads to emergent team behaviors such as role switching and coordinated formations, demonstrating the effectiveness of our hierarchical design and training scheme. The project page is at https://hi‑co‑self‑play.github.io.

PaperID: 10650, https://arxiv.org/pdf/2505.04302.pdf

Authors: Zhaoqilin Yang, Chanchan Li, Xin Wang, Youliang Tian

Title: PPO-ACT: Proximal Policy Optimization with Adversarial Curriculum Transfer for Spatial Public Goods Games

Abstract:
This study investigates cooperation evolution mechanisms in the spatial public goods game. A novel deep reinforcement learning framework, Proximal Policy Optimization with Adversarial Curriculum Transfer (PPO‑ACT), is proposed to model agent strategy optimization in dynamic environments. Traditional evolutionary game models frequently exhibit limitations in modeling long‑term decision‑making processes. Deep reinforcement learning effectively addresses this limitation by bridging policy gradient methods with evolutionary game theory. Our study pioneers the application of proximal policy optimization's continuous strategy optimization capability to public goods games through a two‑stage adversarial curriculum transfer training paradigm. The experimental results show that PPO‑ACT performs better in critical enhancement factor regimes. Compared to conventional standard proximal policy optimization methods, Q‑learning and Fermi update rules, achieve earlier cooperation phase transitions and maintain stable cooperative equilibria. This framework exhibits better robustness when handling challenging scenarios like all‑defector initial conditions. Systematic comparisons reveal the unique advantage of policy gradient methods in population‑scale cooperation, i.e., achieving spatiotemporal payoff coordination through value function propagation. Our work provides a new computational framework for studying cooperation emergence in complex systems, algorithmically validating the punishment promotes cooperation hypothesis while offering methodological insights for multi‑agent system strategy design.

PaperID: 10651, https://arxiv.org/pdf/2505.04251.pdf

Authors: Krishna Ronanki

Title: Facilitating Trustworthy Human-Agent Collaboration in LLM-based Multi-Agent System oriented Software Engineering

Abstract:
Multi‑agent autonomous systems (MAS) are better at addressing challenges that spans across multiple domains than singular autonomous agents. This holds true within the field of software engineering (SE) as well. The state‑of‑the‑art research on MAS within SE focuses on integrating LLMs at the core of autonomous agents to create LLM‑based multi‑agent autonomous (LMA) systems. However, the introduction of LMA systems into SE brings a plethora of challenges. One of the major challenges is the strategic allocation of tasks between humans and the LMA system in a trustworthy manner. To address this challenge, a RACI‑based framework is proposed in this work in progress article, along with implementation guidelines and an example implementation of the framework. The proposed framework can facilitate efficient collaboration, ensure accountability, and mitigate potential risks associated with LLM‑driven automation while aligning with the Trustworthy AI guidelines. The future steps for this work delineating the planned empirical validation method are also presented.

PaperID: 10652, https://arxiv.org/pdf/2505.04231.pdf

Authors: Taoyuan Yu, Kui Wang, Zongdian Li, Tao Yu, Kei Sakaguchi

Title: Multi-Agent Reinforcement Learning-based Cooperative Autonomous Driving in Smart Intersections

Abstract:
Unsignalized intersections pose significant safety and efficiency challenges due to complex traffic flows. This paper proposes a novel roadside unit (RSU)‑centric cooperative driving system leveraging global perception and vehicle‑to‑infrastructure (V2I) communication. The core of the system is an RSU‑based decision‑making module using a two‑stage hybrid reinforcement learning (RL) framework. At first, policies are pre‑trained offline using conservative Q‑learning (CQL) combined with behavior cloning (BC) on collected dataset. Subsequently, these policies are fine‑tuned in the simulation using multi‑agent proximal policy optimization (MAPPO), aligned with a self‑attention mechanism to effectively solve inter‑agent dependencies. RSUs perform real‑time inference based on the trained models to realize vehicle control via V2I communications. Extensive experiments in CARLA environment demonstrate high effectiveness of the proposed system, by: (i) achieving failure rates below 0.03% in coordinating three connected and autonomous vehicles (CAVs) through complex intersection scenarios, significantly outperforming the traditional Autoware control method, and (ii) exhibiting strong robustness across varying numbers of controlled agents and shows promising generalization capabilities on other maps.

PaperID: 10653, https://arxiv.org/pdf/2505.04195.pdf

Authors: Minjae Seo, Wonwoo Choi, Myoungsung You, Seungwon Shin

Title: AutoPatch: Multi-Agent Framework for Patching Real-World CVE Vulnerabilities

Abstract:
Large Language Models (LLMs) have emerged as promising tools in software development, enabling automated code generation and analysis. However, their knowledge is limited to a fixed cutoff date, making them prone to generating code vulnerable to newly disclosed CVEs. Frequent fine‑tuning with new CVE sets is costly, and existing LLM‑based approaches focus on oversimplified CWE examples and require providing explicit bug locations to LLMs, limiting their ability to patch complex real‑world vulnerabilities. To address these limitations, we propose AutoPatch, a multi‑agent framework designed to patch vulnerable LLM‑generated code, particularly those introduced after the LLMs' knowledge cutoff. AutoPatch integrates Retrieval‑Augmented Generation (RAG) with a structured database of recently disclosed vulnerabilities, comprising 525 code snippets derived from 75 high‑severity CVEs across real‑world systems such as the Linux kernel and Chrome. AutoPatch combines semantic and taint analysis to identify the most relevant CVE and leverages enhanced Chain‑of‑Thought (CoT) reasoning to construct enriched prompts for verification and patching. Our unified similarity model, which selects the most relevant vulnerabilities, achieves 90.4 percent accuracy in CVE matching. AutoPatch attains 89.5 percent F1‑score for vulnerability verification and 95.0 percent accuracy in patching, while being over 50x more cost‑efficient than traditional fine‑tuning approaches.

PaperID: 10654, https://arxiv.org/pdf/2505.04040.pdf

Authors: Shamse Tasnim Cynthia

Title: Identification and Optimization of Redundant Code Using Large Language Models

Abstract:
Redundant code is a persistent challenge in software development that makes systems harder to maintain, scale, and update. It adds unnecessary complexity, hinders bug fixes, and increases technical debt. Despite their impact, removing redundant code manually is risky and error‑prone, often introducing new bugs or missing dependencies. While studies highlight the prevalence and negative impact of redundant code, little focus has been given to Artificial Intelligence (AI) system codebases and the common patterns that cause redundancy. Additionally, the reasons behind developers unintentionally introducing redundant code remain largely unexplored. This research addresses these gaps by leveraging large language models (LLMs) to automatically detect and optimize redundant code in AI projects. Our research aims to identify recurring patterns of redundancy and analyze their underlying causes, such as outdated practices or insufficient awareness of best coding principles. Additionally, we plan to propose an LLM agent that will facilitate the detection and refactoring of redundancies on a large scale while preserving original functionality. This work advances the application of AI in identifying and optimizing redundant code, ultimately helping developers maintain cleaner, more readable, and scalable codebases.

PaperID: 10655, https://arxiv.org/pdf/2505.03973.pdf

Authors: Jiale Liu, Yifan Zeng, Shaokun Zhang, Chi Zhang, Malte Højmark-Bertelsen, Marie Normann Gadeberg, Huazheng Wang, Qingyun Wu

Title: Divide, Optimize, Merge: Fine-Grained LLM Agent Optimization at Scale

Abstract:
LLM‑based optimization has shown remarkable potential in enhancing agentic systems. However, the conventional approach of prompting LLM optimizer with the whole training trajectories on training dataset in a single pass becomes untenable as datasets grow, leading to context window overflow and degraded pattern recognition. To address these challenges, we propose Fine‑Grained Optimization (FGO), a scalable framework that divides large optimization tasks into manageable subsets, performs targeted optimizations, and systematically combines optimized components through progressive merging. Evaluation across ALFWorld, LogisticsQA, and GAIA benchmarks demonstrate that FGO outperforms existing approaches by 1.6‑8.6% while reducing average prompt token consumption by 56.3%. Our framework provides a practical solution for scaling up LLM‑based optimization of increasingly sophisticated agent systems. Further analysis demonstrates that FGO achieves the most consistent performance gain in all training dataset sizes, showcasing its scalability and efficiency.

PaperID: 10656, https://arxiv.org/pdf/2505.03949.pdf

Authors: John Christopher Tidwell, John Storm Tidwell

Title: Deep Q-Network (DQN) multi-agent reinforcement learning (MARL) for Stock Trading

Abstract:
This project addresses the challenge of automated stock trading, where traditional methods and direct reinforcement learning (RL) struggle with market noise, complexity, and generalization. Our proposed solution is an integrated deep learning framework combining a Convolutional Neural Network (CNN) to identify patterns in technical indicators formatted as images, a Long Short‑Term Memory (LSTM) network to capture temporal dependencies across both price history and technical indicators, and a Deep Q‑Network (DQN) agent which learns the optimal trading policy (buy, sell, hold) based on the features extracted by the CNN and LSTM.

PaperID: 10657, https://arxiv.org/pdf/2505.03906.pdf

Authors: Asif Rahman, Veljko Cvetkovic, Kathleen Reece, Aidan Walters, Yasir Hassan, Aneesh Tummeti, Bryan Torres, Denise Cooney, Margaret Ellis, Dimitrios S. Nikolopoulos

Title: MARCO: Multi-Agent Code Optimization with Real-Time Knowledge Integration for High-Performance Computing

Abstract:
Large language models (LLMs) have transformed software development through code generation capabilities, yet their effectiveness for high‑performance computing (HPC) remains limited. HPC code requires specialized optimizations for parallelism, memory efficiency, and architecture‑specific considerations that general‑purpose LLMs often overlook. We present MARCO (Multi‑Agent Reactive Code Optimizer), a novel framework that enhances LLM‑generated code for HPC through a specialized multi‑agent architecture. MARCO employs separate agents for code generation and performance evaluation, connected by a feedback loop that progressively refines optimizations. A key innovation is MARCO's web‑search component that retrieves real‑time optimization techniques from recent conference proceedings and research publications, bridging the knowledge gap in pre‑trained LLMs. Our extensive evaluation on the LeetCode 75 problem set demonstrates that MARCO achieves a 14.6% average runtime reduction compared to Claude 3.5 Sonnet alone, while the integration of the web‑search component yields a 30.9% performance improvement over the base MARCO system. These results highlight the potential of multi‑agent systems to address the specialized requirements of high‑performance code generation, offering a cost‑effective alternative to domain‑specific model fine‑tuning.

PaperID: 10658, https://arxiv.org/pdf/2505.03864.pdf

Authors: Qiaomu Li, Ying Xie

Title: From Glue-Code to Protocols: A Critical Analysis of A2A and MCP Integration for Scalable Agent Systems

Abstract:
Artificial intelligence is rapidly evolving towards multi‑agent systems where numerous AI agents collaborate and interact with external tools. Two key open standards, Google's Agent to Agent (A2A) protocol for inter‑agent communication and Anthropic's Model Context Protocol (MCP) for standardized tool access, promise to overcome the limitations of fragmented, custom integration approaches. While their potential synergy is significant, this paper argues that effectively integrating A2A and MCP presents unique, emergent challenges at their intersection, particularly concerning semantic interoperability between agent tasks and tool capabilities, the compounded security risks arising from combined discovery and execution, and the practical governance required for the envisioned "Agent Economy". This work provides a critical analysis, moving beyond a survey to evaluate the practical implications and inherent difficulties of combining these horizontal and vertical integration standards. We examine the benefits (e.g., specialization, scalability) while critically assessing their dependencies and trade‑offs in an integrated context. We identify key challenges increased by the integration, including novel security vulnerabilities, privacy complexities, debugging difficulties across protocols, and the need for robust semantic negotiation mechanisms. In summary, A2A+MCP offers a vital architectural foundation, but fully realizing its potential requires substantial advancements to manage the complexities of their combined operation.

PaperID: 10659, https://arxiv.org/pdf/2505.03807.pdf

Authors: Yiwen Zhang, Jianing Hao, Zhan Wang, Hongling Sheng, Wei Zeng

Title: Facilitating Video Story Interaction with Multi-Agent Collaborative System

Abstract:
Video story interaction enables viewers to engage with and explore narrative content for personalized experiences. However, existing methods are limited to user selection, specially designed narratives, and lack customization. To address this, we propose an interactive system based on user intent. Our system uses a Vision Language Model (VLM) to enable machines to understand video stories, combining Retrieval‑Augmented Generation (RAG) and a Multi‑Agent System (MAS) to create evolving characters and scene experiences. It includes three stages: 1) Video story processing, utilizing VLM and prior knowledge to simulate human understanding of stories across three modalities. 2) Multi‑space chat, creating growth‑oriented characters through MAS interactions based on user queries and story stages. 3) Scene customization, expanding and visualizing various story scenes mentioned in dialogue. Applied to the Harry Potter series, our study shows the system effectively portrays emergent character social behavior and growth, enhancing the interactive experience in the video story world.

PaperID: 10660, https://arxiv.org/pdf/2505.03784.pdf

Authors: Ahmed A. Metwally, A. Ali Heydari, Daniel McDuff, Alexandru Solot, Zeinab Esmaeilpour, Anthony Z Faranesh, Menglian Zhou, David B. Savage, Conor Heneghan, Shwetak Patel, Cathy Speed, Javier L. Prieto

Title: Insulin Resistance Prediction From Wearables and Routine Blood Biomarkers

Abstract:
Insulin resistance, a precursor to type 2 diabetes, is characterized by impaired insulin action in tissues. Current methods for measuring insulin resistance, while effective, are expensive, inaccessible, not widely available and hinder opportunities for early intervention. In this study, we remotely recruited the largest dataset to date across the US to study insulin resistance (N=1,165 participants, with median BMI=28 kg/m2, age=45 years, HbA1c=5.4%), incorporating wearable device time series data and blood biomarkers, including the ground‑truth measure of insulin resistance, homeostatic model assessment for insulin resistance (HOMA‑IR). We developed deep neural network models to predict insulin resistance based on readily available digital and blood biomarkers. Our results show that our models can predict insulin resistance by combining both wearable data and readily available blood biomarkers better than either of the two data sources separately (R2=0.5, auROC=0.80, Sensitivity=76%, and specificity 84%). The model showed 93% sensitivity and 95% adjusted specificity in obese and sedentary participants, a subpopulation most vulnerable to developing type 2 diabetes and who could benefit most from early intervention. Rigorous evaluation of model performance, including interpretability, and robustness, facilitates generalizability across larger cohorts, which is demonstrated by reproducing the prediction performance on an independent validation cohort (N=72 participants). Additionally, we demonstrated how the predicted insulin resistance can be integrated into a large language model agent to help understand and contextualize HOMA‑IR values, facilitating interpretation and safe personalized recommendations. This work offers the potential for early detection of people at risk of type 2 diabetes and thereby facilitate earlier implementation of preventative strategies.

PaperID: 10661, https://arxiv.org/pdf/2505.03771.pdf

Authors: Ritik Raj, Akshat Ramachandran, Jeff Nye, Shashank Nemawarkar, Tushar Krishna

Title: OneDSE: A Unified Microprocessor Metric Prediction and Design Space Exploration Framework

Abstract:
With the slowing of Moores Law and increasing impact of power constraints, processor designs rely on architectural innovation to achieve differentiating performance. However, the innovation complexity has simultaneously increased the design space of modern high performance processors. Specifically, we identify two key challenges in prior Design Space Exploration (DSE) approaches for modern CPU design ‑ (a) cost model (prediction method) is either slow or microarchitecture‑specific or workload‑specific and single model is inefficient to learn the whole design space (b) optimization (exploration method) is slow and inaccurate in the large CPU parameter space. This work presents a novel solution called OneDSE to address these emerging challenges in modern CPU design. OneDSE is a unified cost model (metric predictor) and optimizer (CPU parameter explorer) with three key techniques ‑ 1. Transformer‑based workload‑Aware CPU Estimation (TrACE) framework to predict metrics in the parameter space (TrACE‑p) and parameters in the in the metric space (TrACE‑m). TrACE‑p outperforms State of The Art (SOTA) IPC prediction methods by 5.71x and 28x for single and multiple workloads respectively while being two orders of magnitude faster. 2. We also propose a novel Metric spAce Search opTimizer (MAST) that leverages TrACE‑m and outperforms SoTA metaheuristics by 1.19x while being an order of magnitude faster. 3. We propose Subsystem‑based Multi‑Agent Reinforcement‑learning based fine‑Tuning (SMART)‑TrACE that achieves a 10.6% reduction in prediction error compared to TrACE, enabling more accurate and efficient exploration of the CPU design space.

PaperID: 10662, https://arxiv.org/pdf/2505.03738.pdf

Authors: Jialong Li, Xuxin Cheng, Tianshu Huang, Shiqi Yang, Ri-Zhao Qiu, Xiaolong Wang

Title: AMO: Adaptive Motion Optimization for Hyper-Dexterous Humanoid Whole-Body Control

Abstract:
Humanoid robots derive much of their dexterity from hyper‑dexterous whole‑body movements, enabling tasks that require a large operational workspace: such as picking objects off the ground. However, achieving these capabilities on real humanoids remains challenging due to their high degrees of freedom (DoF) and nonlinear dynamics. We propose Adaptive Motion Optimization (AMO), a framework that integrates sim‑to‑real reinforcement learning (RL) with trajectory optimization for real‑time, adaptive whole‑body control. To mitigate distribution bias in motion imitation RL, we construct a hybrid AMO dataset and train a network capable of robust, on‑demand adaptation to potentially O.O.D. commands. We validate AMO in simulation and on a 29‑DoF Unitree G1 humanoid robot, demonstrating superior stability and an expanded workspace compared to strong baselines. Finally, we show that AMO's consistent performance supports autonomous task execution via imitation learning, underscoring the system's versatility and robustness.

PaperID: 10663, https://arxiv.org/pdf/2505.03674.pdf

Authors: Yotam Amitai, Reuth Mirsky, Ofra Amir

Title: Gap the (Theory of) Mind: Sharing Beliefs About Teammates' Goals Boosts Collaboration Perception, Not Performance

Abstract:
In human‑agent teams, openly sharing goals is often assumed to enhance planning, collaboration, and effectiveness. However, direct communication of these goals is not always feasible, requiring teammates to infer their partner's intentions through actions. Building on this, we investigate whether an AI agent's ability to share its inferred understanding of a human teammate's goals can improve task performance and perceived collaboration. Through an experiment comparing three conditions‑no recognition (NR), viable goals (VG), and viable goals on‑demand (VGod) ‑ we find that while goal‑sharing information did not yield significant improvements in task performance or overall satisfaction scores, thematic analysis suggests that it supported strategic adaptations and subjective perceptions of collaboration. Cognitive load assessments revealed no additional burden across conditions, highlighting the challenge of balancing informativeness and simplicity in human‑agent interactions. These findings highlight the nuanced trade‑off of goal‑sharing: while it fosters trust and enhances perceived collaboration, it can occasionally hinder objective performance gains.

PaperID: 10664, https://arxiv.org/pdf/2505.03558.pdf

Authors: Giacomo Avanzi, Marco Giordani, Michele Zorzi

Title: Multi-Agent Reinforcement Learning Scheduling to Support Low Latency in Teleoperated Driving

Abstract:
The teleoperated driving (TD) scenario comes with stringent Quality of Service (QoS) communication constraints, especially in terms of end‑to‑end (E2E) latency and reliability. In this context, Predictive Quality of Service (PQoS), possibly combined with Reinforcement Learning (RL) techniques, is a powerful tool to estimate QoS degradation and react accordingly. For example, an intelligent agent can be trained to select the optimal compression configuration for automotive data, and reduce the file size whenever QoS conditions deteriorate. However, compression may inevitably compromise data quality, with negative implications for the TD application. An alternative strategy involves operating at the Radio Access Network (RAN) level to optimize radio parameters based on current network conditions, while preserving data quality. In this paper, we propose Multi‑Agent Reinforcement Learning (MARL) scheduling algorithms, based on Proximal Policy Optimization (PPO), to dynamically and intelligently allocate radio resources to minimize E2E latency in a TD scenario. We evaluate two training paradigms, i.e., decentralized learning with local observations (IPPO) vs. centralized aggregation (MAPPO), in conjunction with two resource allocation strategies, i.e., proportional allocation (PA) and greedy allocation (GA). We prove via ns‑3 simulations that MAPPO, combined with GA, achieves the best results in terms of latency, especially as the number of vehicles increases.

PaperID: 10665, https://arxiv.org/pdf/2505.03553.pdf

Authors: Kolawole E. Ogunsina, Morayo A. Ogunsina

Title: A Hashgraph-Inspired Consensus Mechanism for Reliable Multi-Model Reasoning

Abstract:
Inconsistent outputs and hallucinations from large language models (LLMs) are major obstacles to reliable AI systems. When different proprietary reasoning models (RMs), such as those by OpenAI, Google, Anthropic, DeepSeek, and xAI, are given the same complex request, they often produce divergent results due to variations in training and inference. This paper proposes a novel consensus mechanism, inspired by distributed ledger technology, to validate and converge these outputs, treating each RM as a black‑box peer. Building on the Hashgraph consensus algorithm, our approach employs gossip‑about‑gossip communication and virtual voting to achieve agreement among an ensemble of RMs. We present an architectural design for a prototype system in which RMs iteratively exchange and update their answers, using information from each round to improve accuracy and confidence in subsequent rounds. This approach goes beyond simple majority voting by incorporating the knowledge and cross‑verification content of every model. We justify the feasibility of this Hashgraph‑inspired consensus for AI ensembles and outline its advantages over traditional ensembling techniques in reducing nonfactual outputs. Preliminary considerations for implementation, evaluation criteria for convergence and accuracy, and potential challenges are discussed. The proposed mechanism demonstrates a promising direction for multi‑agent AI systems to self‑validate and deliver high‑fidelity responses in complex tasks.

PaperID: 10666, https://arxiv.org/pdf/2505.03533.pdf

Authors: Jiacheng Wang, Le Liang, Hao Ye, Chongtao Guo, Shi Jin

Title: Small-Scale-Fading-Aware Resource Allocation in Wireless Federated Learning

Abstract:
Judicious resource allocation can effectively enhance federated learning (FL) training performance in wireless networks by addressing both system and statistical heterogeneity. However, existing strategies typically rely on block fading assumptions, which overlooks rapid channel fluctuations within each round of FL gradient uploading, leading to a degradation in FL training performance. Therefore, this paper proposes a small‑scale‑fading‑aware resource allocation strategy using a multi‑agent reinforcement learning (MARL) framework. Specifically, we establish a one‑step convergence bound of the FL algorithm and formulate the resource allocation problem as a decentralized partially observable Markov decision process (Dec‑POMDP), which is subsequently solved using the QMIX algorithm. In our framework, each client serves as an agent that dynamically determines spectrum and power allocations within each coherence time slot, based on local observations and a reward derived from the convergence analysis. The MARL setting reduces the dimensionality of the action space and facilitates decentralized decision‑making, enhancing the scalability and practicality of the solution. Experimental results demonstrate that our QMIX‑based resource allocation strategy significantly outperforms baseline methods across various degrees of statistical heterogeneity. Additionally, ablation studies validate the critical importance of incorporating small‑scale fading dynamics, highlighting its role in optimizing FL performance.

PaperID: 10667, https://arxiv.org/pdf/2505.03418.pdf

Authors: Da Zheng, Lun Du, Junwei Su, Yuchen Tian, Yuqi Zhu, Jintian Zhang, Lanning Wei, Ningyu Zhang, Huajun Chen

Title: Knowledge Augmented Complex Problem Solving with Large Language Models: A Survey

Abstract:
Problem‑solving has been a fundamental driver of human progress in numerous domains. With advancements in artificial intelligence, Large Language Models (LLMs) have emerged as powerful tools capable of tackling complex problems across diverse domains. Unlike traditional computational systems, LLMs combine raw computational power with an approximation of human reasoning, allowing them to generate solutions, make inferences, and even leverage external computational tools. However, applying LLMs to real‑world problem‑solving presents significant challenges, including multi‑step reasoning, domain knowledge integration, and result verification. This survey explores the capabilities and limitations of LLMs in complex problem‑solving, examining techniques including Chain‑of‑Thought (CoT) reasoning, knowledge augmentation, and various LLM‑based and tool‑based verification techniques. Additionally, we highlight domain‑specific challenges in various domains, such as software engineering, mathematical reasoning and proving, data analysis and modeling, and scientific research. The paper further discusses the fundamental limitations of the current LLM solutions and the future directions of LLM‑based complex problems solving from the perspective of multi‑step reasoning, domain knowledge integration and result verification.

PaperID: 10668, https://arxiv.org/pdf/2505.03288.pdf

Authors: Francesco Morri, Hélène Le Cadre, Pierre Gruet, Luce Brotcorne

Title: Game Theory and Multi-Agent Reinforcement Learning for Zonal Ancillary Markets

Abstract:
We characterize zonal ancillary market coupling relying on noncooperative game theory. To that purpose, we formulate the ancillary market as a multi‑leader single follower bilevel problem, that we subsequently cast as a generalized Nash game with side constraints and nonconvex feasibility sets. We determine conditions for equilibrium existence and show that the game has a generalized potential game structure. To compute market equilibrium, we rely on two exact approaches: an integrated optimization approach and Gauss‑Seidel best‑response, that we compare against multi‑agent deep reinforcement learning. On real data from Germany and Austria, simulations indicate that multi‑agent deep reinforcement learning achieves the smallest convergence rate but requires pretraining, while best‑response is the slowest. On the economics side, multi‑agent deep reinforcement learning results in smaller market costs compared to the exact methods, but at the cost of higher variability in the profit allocation among stakeholders. Further, stronger coupling between zones tends to reduce costs for larger zones.

PaperID: 10669, https://arxiv.org/pdf/2505.03257.pdf

Authors: Craig Maxwell, Mirko Baglioni, Anahita Jamshidnejad

Title: Model Predictive Fuzzy Control: A Hierarchical Multi-Agent Control Architecture for Outdoor Search-and-Rescue Robots

Abstract:
Autonomous robots deployed in unknown search‑and‑rescue (SaR) environments can significantly improve the efficiency of the mission by assisting in fast localisation and rescue of the trapped victims. We propose a novel integrated hierarchical control architecture, called model predictive fuzzy control (MPFC), for autonomous mission planning of multi‑robot SaR systems that should efficiently map an unknown environment: We combine model predictive control (MPC) and fuzzy logic control (FLC), where the robots are locally controlled by computationally efficient FLC controllers, and the parameters of these local controllers are tuned via a centralised MPC controller, in a regular or event‑triggered manner. The proposed architecture provides three main advantages: (1) The control decisions are made by the FLC controllers, thus the real‑time computation time is affordable. (2) The centralised MPC controller optimises the performance criteria with a global and predictive vision of the system dynamics, and updates the parameters of the FLC controllers accordingly. (3) FLC controllers are heuristic by nature and thus do not take into account optimality in their decisions, while the tuned parameters via the MPC controller can indirectly incorporate some level of optimality in local decisions of the robots. A simulation environment for victim detection in a disaster environment was designed in MATLAB using discrete, 2‑D grid‑based models. While being comparable from the point of computational efficiency, the integrated MPFC architecture improves the performance of the multi‑robot SaR system compared to decentralised FLC controllers. Moreover, the performance of MPFC is comparable to the performance of centralised MPC for path planning of SaR robots, whereas MPFC requires significantly less computational resources, since the number of the optimisation variables in the control problem are reduced.

PaperID: 10670, https://arxiv.org/pdf/2505.03178.pdf

Authors: Jiawei Wang, Xintao Yan, Yao Mu, Haowei Sun, Zhong Cao, Henry X. Liu

Title: RADE: Learning Risk-Adjustable Driving Environment via Multi-Agent Conditional Diffusion

Abstract:
Generating safety‑critical scenarios in high‑fidelity simulations offers a promising and cost‑effective approach for efficient testing of autonomous vehicles. Existing methods typically rely on manipulating a single vehicle's trajectory through sophisticated designed objectives to induce adversarial interactions, often at the cost of realism and scalability. In this work, we propose the Risk‑Adjustable Driving Environment (RADE), a simulation framework that generates statistically realistic and risk‑adjustable traffic scenes. Built upon a multi‑agent diffusion architecture, RADE jointly models the behavior of all agents in the environment and conditions their trajectories on a surrogate risk measure. Unlike traditional adversarial methods, RADE learns risk‑conditioned behaviors directly from data, preserving naturalistic multi‑agent interactions with controllable risk levels. To ensure physical plausibility, we incorporate a tokenized dynamics check module that efficiently filters generated trajectories using a motion vocabulary. We validate RADE on the real‑world rounD dataset, demonstrating that it preserves statistical realism across varying risk levels and naturally increases the likelihood of safety‑critical events as the desired risk level grows up. Our results highlight RADE's potential as a scalable and realistic tool for AV safety evaluation.

PaperID: 10671, https://arxiv.org/pdf/2505.03161.pdf

Authors: Qi Qin, Xinye Cao, Guoshun Nan, Sihan Chen, Rushan Li, Li Su, Haitao Du, Qimei Cui, Pengxuan Mao, Xiaofeng Tao, Tony Q. S. Quek

Title: An LLM-based Self-Evolving Security Framework for 6G Space-Air-Ground Integrated Networks

Abstract:
Recently emerged 6G space‑air‑ground integrated networks (SAGINs), which integrate satellites, aerial networks, and terrestrial communications, offer ubiquitous coverage for various mobile applications. However, the highly dynamic, open, and heterogeneous nature of SAGINs poses severe security issues. Forming a defense line of SAGINs suffers from two preliminary challenges: 1) accurately understanding massive unstructured multi‑dimensional threat information to generate defense strategies against various malicious attacks, 2) rapidly adapting to potential unknown threats to yield more effective security strategies. To tackle the above two challenges, we propose a novel security framework for SAGINs based on Large Language Models (LLMs), which consists of two key ingredients LLM‑6GNG and 6G‑INST. Our proposed LLM‑6GNG leverages refined chain‑of‑thought (CoT) reasoning and dynamic multi‑agent mechanisms to analyze massive unstructured multi‑dimensional threat data and generate comprehensive security strategies, thus addressing the first challenge. Our proposed 6G‑INST relies on a novel self‑evolving method to automatically update LLM‑6GNG, enabling it to accommodate unknown threats under dynamic communication environments, thereby addressing the second challenge. Additionally, we prototype the proposed framework with ns‑3, OpenAirInterface (OAI), and software‑defined radio (SDR). Experiments on three benchmarks demonstrate the effectiveness of our framework. The results show that our framework produces highly accurate security strategies that remain robust against a variety of unknown attacks. We will release our code to contribute to the community.

PaperID: 10672, https://arxiv.org/pdf/2505.03127.pdf

Authors: Qingliang Li, Bo Chang, Weidong Mei, Zhi Chen

Title: Integrated Sensing, Computing, Communication, and Control for Time-Sequence-Based Semantic Communications

Abstract:
In the upcoming industrial internet of things (IIoT) era, a surge of task‑oriented applications will rely on real‑time wireless control systems (WCSs). For these systems, ultra‑reliable and low‑latency wireless communication will be crucial to ensure the timely transmission of control information. To achieve this purpose, we propose a novel time‑sequence‑based semantic communication paradigm, where an integrated sensing, computing, communication, and control (ISC3) architecture is developed to make sensible semantic inference (SI) for the control information over time sequences, enabling adaptive control of the robot. However, due to the causal correlations in the time sequence, the control information does not present the Markov property. To address this challenge, we compute the mutual information of the control information sensed at the transmitter (Tx) over different time and identify their temporal semantic correlation via a semantic feature extractor (SFE) module. By this means, highly correlated information transmission can be avoided, thus greatly reducing the communication overhead. Meanwhile, a semantic feature reconstructor (SFR) module is employed at the receiver (Rx) to reconstruct the control information based on the previously received one if the information transmission is not activated at the Tx. Furthermore, a control gain policy is also employed at the Rx to adaptively adjust the control gain for the controlled target based on several practical aspects such as the quality of the information transmission from the Tx to the Rx. We design the neural network structures of the above modules/policies and train their parameters by a novel hybrid reward multi‑agent deep reinforcement learning framework. On‑site experiments are conducted to evaluate the performance of our proposed method in practice, which shows significant gains over other baseline schemes.

PaperID: 10673, https://arxiv.org/pdf/2505.03096.pdf

Authors: Joshua Owotogbe

Title: Assessing and Enhancing the Robustness of LLM-based Multi-Agent Systems Through Chaos Engineering

Abstract:
This study explores the application of chaos engineering to enhance the robustness of Large Language Model‑Based Multi‑Agent Systems (LLM‑MAS) in production‑like environments under real‑world conditions. LLM‑MAS can potentially improve a wide range of tasks, from answering questions and generating content to automating customer support and improving decision‑making processes. However, LLM‑MAS in production or preproduction environments can be vulnerable to emergent errors or disruptions, such as hallucinations, agent failures, and agent communication failures. This study proposes a chaos engineering framework to proactively identify such vulnerabilities in LLM‑MAS, assess and build resilience against them, and ensure reliable performance in critical applications.

PaperID: 10674, https://arxiv.org/pdf/2505.02888.pdf

Authors: Rintaro Ando

Title: When Your Own Output Becomes Your Training Data: Noise-to-Meaning Loops and a Formal RSI Trigger

Abstract:
We present Noise‑to‑Meaning Recursive Self‑Improvement (N2M‑RSI), a minimal formal model showing that once an AI agent feeds its own outputs back as inputs and crosses an explicit information‑integration threshold, its internal complexity will grow without bound under our assumptions. The framework unifies earlier ideas on self‑prompting large language models, Gödelian self‑reference, and AutoML, yet remains implementation‑agnostic. The model furthermore scales naturally to interacting swarms of agents, hinting at super‑linear effects once communication among instances is permitted. For safety reasons, we omit system‑specific implementation details and release only a brief, model‑agnostic toy prototype in Appendix C.

PaperID: 10675, https://arxiv.org/pdf/2505.02861.pdf

Authors: Kushagra Agrawal, Nisharg Nargund

Title: Neural Orchestration for Multi-Agent Systems: A Deep Learning Framework for Optimal Agent Selection in Multi-Domain Task Environments

Abstract:
Multi‑agent systems (MAS) are foundational in simulating complex real‑world scenarios involving autonomous, interacting entities. However, traditional MAS architectures often suffer from rigid coordination mechanisms and difficulty adapting to dynamic tasks. We propose MetaOrch, a neural orchestration framework for optimal agent selection in multi‑domain task environments. Our system implements a supervised learning approach that models task context, agent histories, and expected response quality to select the most appropriate agent for each task. A novel fuzzy evaluation module scores agent responses along completeness, relevance, and confidence dimensions, generating soft supervision labels for training the orchestrator. Unlike previous methods that hard‑code agent‑task mappings, MetaOrch dynamically predicts the most suitable agent while estimating selection confidence. Experiments in simulated environments with heterogeneous agents demonstrate that our approach achieves 86.3% selection accuracy, significantly outperforming baseline strategies including random selection and round‑robin scheduling. The modular architecture emphasizes extensibility, allowing agents to be registered, updated, and queried independently. Results suggest that neural orchestration offers a powerful approach to enhancing the autonomy, interpretability, and adaptability of multi‑agent systems across diverse task domains.

PaperID: 10676, https://arxiv.org/pdf/2505.02830.pdf

Authors: Qingqiu Li, Zihang Cui, Seongsu Bae, Jilan Xu, Runtian Yuan, Yuejie Zhang, Rui Feng, Quanli Shen, Xiaobo Zhang, Junjun He, Shujun Wang

Title: AOR: Anatomical Ontology-Guided Reasoning for Medical Large Multimodal Model in Chest X-Ray Interpretation

Abstract:
Chest X‑rays (CXRs) are the most frequently performed imaging examinations in clinical settings. Recent advancements in Large Multimodal Models (LMMs) have enabled automated CXR interpretation, enhancing diagnostic accuracy and efficiency. However, despite their strong visual understanding, current Medical LMMs (MLMMs) still face two major challenges: (1) Insufficient region‑level understanding and interaction, and (2) Limited accuracy and interpretability due to single‑step reasoning. In this paper, we empower MLMMs with anatomy‑centric reasoning capabilities to enhance their interactivity and explainability. Specifically, we first propose an Anatomical Ontology‑Guided Reasoning (AOR) framework, which centers on cross‑modal region‑level information to facilitate multi‑step reasoning. Next, under the guidance of expert physicians, we develop AOR‑Instruction, a large instruction dataset for MLMMs training. Our experiments demonstrate AOR's superior performance in both VQA and report generation tasks.

PaperID: 10677, https://arxiv.org/pdf/2505.02707.pdf

Authors: Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu

Title: Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

Abstract:
A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real‑time, and emotionally expressive manner. Rather than merely reacting to commands, it would continuously listen, reason, and respond proactively, fostering fluid, dynamic, and emotionally resonant interactions. We introduce Voila, a family of large voice‑language foundation models that make a step towards this vision. Voila moves beyond traditional pipeline systems by adopting a new end‑to‑end architecture that enables full‑duplex, low‑latency conversations while preserving rich vocal nuances such as tone, rhythm, and emotion. It achieves a response latency of just 195 milliseconds, surpassing the average human response time. Its hierarchical multi‑scale Transformer integrates the reasoning capabilities of large language models (LLMs) with powerful acoustic modeling, enabling natural, persona‑aware voice generation ‑‑ where users can simply write text instructions to define the speaker's identity, tone, and other characteristics. Moreover, Voila supports over one million pre‑built voices and efficient customization of new ones from brief audio samples as short as 10 seconds. Beyond spoken dialogue, Voila is designed as a unified model for a wide range of voice‑based applications, including automatic speech recognition (ASR), Text‑to‑Speech (TTS), and, with minimal adaptation, multilingual speech translation. Voila is fully open‑sourced to support open research and accelerate progress toward next‑generation human‑machine interactions.

PaperID: 10678, https://arxiv.org/pdf/2505.02665.pdf

Authors: Qianjun Pan, Wenkai Ji, Yuyang Ding, Junsong Li, Shilian Chen, Junyi Wang, Jie Zhou, Qin Chen, Min Zhang, Yulan Wu, Liang He

Title: A Survey of Slow Thinking-based Reasoning LLMs using Reinforced Learning and Inference-time Scaling Law

Abstract:
This survey explores recent advancements in reasoning large language models (LLMs) designed to mimic "slow thinking" ‑ a reasoning process inspired by human cognition, as described in Kahneman's Thinking, Fast and Slow. These models, like OpenAI's o1, focus on scaling computational resources dynamically during complex tasks, such as math reasoning, visual reasoning, medical diagnosis, and multi‑agent debates. We present the development of reasoning LLMs and list their key technologies. By synthesizing over 100 studies, it charts a path toward LLMs that combine human‑like deep thinking with scalable efficiency for reasoning. The review breaks down methods into three categories: (1) test‑time scaling dynamically adjusts computation based on task complexity via search and sampling, dynamic verification; (2) reinforced learning refines decision‑making through iterative improvement leveraging policy networks, reward models, and self‑evolution strategies; and (3) slow‑thinking frameworks (e.g., long CoT, hierarchical processes) that structure problem‑solving with manageable steps. The survey highlights the challenges and further directions of this domain. Understanding and advancing the reasoning abilities of LLMs is crucial for unlocking their full potential in real‑world applications, from scientific discovery to decision support systems.

PaperID: 10679, https://arxiv.org/pdf/2505.02648.pdf

Authors: Mingcheng Li, Xiaolu Hou, Ziyang Liu, Dingkang Yang, Ziyun Qian, Jiawei Chen, Jinjie Wei, Yue Jiang, Qingyao Xu, Lihua Zhang

Title: MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation

Abstract:
Diffusion models have shown excellent performance in text‑to‑image generation. Nevertheless, existing methods often suffer from performance bottlenecks when handling complex prompts that involve multiple objects, characteristics, and relations. Therefore, we propose a Multi‑agent Collaboration‑based Compositional Diffusion (MCCD) for text‑to‑image generation for complex scenes. Specifically, we design a multi‑agent collaboration‑based scene parsing module that generates an agent system comprising multiple agents with distinct tasks, utilizing MLLMs to extract various scene elements effectively. In addition, Hierarchical Compositional diffusion utilizes a Gaussian mask and filtering to refine bounding box regions and enhance objects through region enhancement, resulting in the accurate and high‑fidelity generation of complex scenes. Comprehensive experiments demonstrate that our MCCD significantly improves the performance of the baseline models in a training‑free manner, providing a substantial advantage in complex scene generation.

PaperID: 10680, https://arxiv.org/pdf/2505.02489.pdf

Authors: Muskaan Goyal, Pranav Bhasin

Title: Beyond the model: Key differentiators in large language models and multi-agent services

Abstract:
With the launch of foundation models like DeepSeek, Manus AI, and Llama 4, it has become evident that large language models (LLMs) are no longer the sole defining factor in generative AI. As many now operate at comparable levels of capability, the real race is not about having the biggest model but optimizing the surrounding ecosystem, including data quality and management, computational efficiency, latency, and evaluation frameworks. This review article delves into these critical differentiators that ensure modern AI services are efficient and profitable.

PaperID: 10681, https://arxiv.org/pdf/2505.02484.pdf

Authors: Yunheng Zou, Austin H. Cheng, Abdulrahman Aldossary, Jiaru Bai, Shi Xuan Leong, Jorge Arturo Campos-Gonzalez-Angulo, Changhyeok Choi, Cher Tian Ser, Gary Tom, Andrew Wang, Zijian Zhang, Ilya Yakavets, Han Hao, Chris Crebolder, Varinia Bernales, Alán Aspuru-Guzik

Title: El Agente: An Autonomous Agent for Quantum Chemistry

Abstract:
Computational chemistry tools are widely used to study the behaviour of chemical phenomena. Yet, the complexity of these tools can make them inaccessible to non‑specialists and challenging even for experts. In this work, we introduce El Agente Q, an LLM‑based multi‑agent system that dynamically generates and executes quantum chemistry workflows from natural language user prompts. The system is built on a novel cognitive architecture featuring a hierarchical memory framework that enables flexible task decomposition, adaptive tool selection, post‑analysis, and autonomous file handling and submission. El Agente Q is benchmarked on six university‑level course exercises and two case studies, demonstrating robust problem‑solving performance (averaging >87% task success) and adaptive error handling through in situ debugging. It also supports longer‑term, multi‑step task execution for more complex workflows, while maintaining transparency through detailed action trace logs. Together, these capabilities lay the foundation for increasingly autonomous and accessible quantum chemistry.

PaperID: 10682, https://arxiv.org/pdf/2505.02279.pdf

Authors: Abul Ehtesham, Aditi Singh, Gaurav Kumar Gupta, Saket Kumar

Title: A survey of agent interoperability protocols: Model Context Protocol (MCP), Agent Communication Protocol (ACP), Agent-to-Agent Protocol (A2A), and Agent Network Protocol (ANP)

Abstract:
Large language model powered autonomous agents demand robust, standardized protocols to integrate tools, share contextual data, and coordinate tasks across heterogeneous systems. Ad‑hoc integrations are difficult to scale, secure, and generalize across domains. This survey examines four emerging agent communication protocols: Model Context Protocol (MCP), Agent Communication Protocol (ACP), Agent‑to‑Agent Protocol (A2A), and Agent Network Protocol (ANP), each addressing interoperability in deployment contexts. MCP provides a JSON‑RPC client‑server interface for secure tool invocation and typed data exchange. ACP defines a general‑purpose communication protocol over RESTful HTTP, supporting MIME‑typed multipart messages and synchronous and asynchronous interactions. Its lightweight and runtime‑independent design enables scalable agent invocation, while features like session management, message routing, and integration with role‑based and decentralized identifiers (DIDs). A2A enables peer‑to‑peer task delegation using capability‑based Agent Cards, supporting secure and scalable collaboration across enterprise agent workflows. ANP supports open network agent discovery and secure collaboration using W3C decentralized identifiers DIDs and JSON‑LD graphs. The protocols are compared across multiple dimensions, including interaction modes, discovery mechanisms, communication patterns, and security models. Based on the comparative analysis, a phased adoption roadmap is proposed: beginning with MCP for tool access, followed by ACP for structured, multimodal messaging session‑aware interaction and both online and offline agent discovery across scalable, HTTP‑based deployments A2A for collaborative task execution, and extending to ANP for decentralized agent marketplaces. This work provides a comprehensive foundation for designing secure, interoperable, and scalable ecosystems of LLM‑powered agents.

PaperID: 10683, https://arxiv.org/pdf/2505.02215.pdf

Authors: Mannan Bhardwaj

Title: Interpretable Emergent Language Using Inter-Agent Transformers

Abstract:
This paper explores the emergence of language in multi‑agent reinforcement learning (MARL) using transformers. Existing methods such as RIAL, DIAL, and CommNet enable agent communication but lack interpretability. We propose Differentiable Inter‑Agent Transformers (DIAT), which leverage self‑attention to learn symbolic, human‑understandable communication protocols. Through experiments, DIAT demonstrates the ability to encode observations into interpretable vocabularies and meaningful embeddings, effectively solving cooperative tasks. These results highlight the potential of DIAT for interpretable communication in complex multi‑agent environments.

PaperID: 10684, https://arxiv.org/pdf/2505.02144.pdf

Authors: Alexis R. Tudor, Joaquín Arias, Gopal Gupta

Title: VECSR: Virtually Embodied Common Sense Reasoning System

Abstract:
The development of autonomous agents has seen a revival of enthusiasm due to the emergence of LLMs, such as GPT‑4o. Deploying these agents in environments where they coexist with humans (e.g., as domestic assistants) requires special attention to trustworthiness and explainability. However, the use of LLMs and other deep learning models still does not resolve these key issues. Deep learning systems may hallucinate, be unable to justify their decisions as black boxes, or perform badly on unseen scenarios. In this work, we propose the use of s(CASP), a goal‑directed common sense reasoner based on Answer Set Programming, to break down the high‑level tasks of an autonomous agent into mid‑level instructions while justifying the selection of these instructions. To validate its use in real applications we present a framework that integrates the reasoner into the VirtualHome simulator and compares its accuracy with GPT‑4o, running some of the "real" use cases available in the domestic environments of VirtualHome. Additionally, since experiments with VirtualHome have shown the need to reduce the response time (which increases as the agent's decision space grows), we have proposed and evaluated a series of optimizations based on program analysis that exploit the advantages of the top‑down execution of s(CASP).

PaperID: 10685, https://arxiv.org/pdf/2505.02133.pdf

Authors: Nazmus Ashrafi, Salah Bouktif, Mohammed Mediani

Title: Enhancing LLM Code Generation: A Systematic Evaluation of Multi-Agent Collaboration and Runtime Debugging for Improved Accuracy, Reliability, and Latency

Abstract:
The use of large language models (LLMs) for automated code generation has emerged as a significant focus within AI research. As these pretrained models continue to evolve, their ability to understand and generate complex code structures has opened new possibilities for automating intricate programming tasks for the sake of accurate code generation. Although contemporary foundational models demonstrate promoting results, researchers continue to explore optimal post‑training strategies to enhance code quality. These include supervised fine‑tuning, retrieval‑augmented generation (RAG), debugging, and many others. In this paper, we combine two widely used approaches namely multi‑agent collaboration and runtime execution information‑based debugging, for improving code generation functionality, reliability, and practical applicability. We perform an empirical study in order to extend the evaluation of the individual strategies as well as the proposed composition of the activities of both strategies. Our study use 19 LLMs to examines the performance of individual and the proposed strategies, offering comprehensive insights into how different programming activities compositions and training paradigms influence code generation effectiveness. In particular, we implement a chained system that combines both strategies to assess their combined impact on functional accuracy, code reliability, and generation latency using two benchmark datasets commonly used for code generation. Our findings provide valuable insights for organizations seeking robust AI‑driven coding solutions by guiding them in selecting models that can better adapt to complex post‑training strategies, ultimately fostering the adoption of more effective and reliable code generation technologies.

PaperID: 10686, https://arxiv.org/pdf/2505.02123.pdf

Authors: Xinmeng Hou, Wuqi Wang, Long Yang, Hao Lin, Jinglun Feng, Haigen Min, Xiangmo Zhao

Title: DriveAgent: Multi-Agent Structured Reasoning with LLM and Multimodal Sensor Fusion for Autonomous Driving

Abstract:
We introduce DriveAgent, a novel multi‑agent autonomous driving framework that leverages large language model (LLM) reasoning combined with multimodal sensor fusion to enhance situational understanding and decision‑making. DriveAgent uniquely integrates diverse sensor modalities‑including camera, LiDAR, GPS, and IMU‑with LLM‑driven analytical processes structured across specialized agents. The framework operates through a modular agent‑based pipeline comprising four principal modules: (i) a descriptive analysis agent identifying critical sensor data events based on filtered timestamps, (ii) dedicated vehicle‑level analysis conducted by LiDAR and vision agents that collaboratively assess vehicle conditions and movements, (iii) environmental reasoning and causal analysis agents explaining contextual changes and their underlying mechanisms, and (iv) an urgency‑aware decision‑generation agent prioritizing insights and proposing timely maneuvers. This modular design empowers the LLM to effectively coordinate specialized perception and reasoning agents, delivering cohesive, interpretable insights into complex autonomous driving scenarios. Extensive experiments on challenging autonomous driving datasets demonstrate that DriveAgent is achieving superior performance on multiple metrics against baseline methods. These results validate the efficacy of the proposed LLM‑driven multi‑agent sensor fusion framework, underscoring its potential to substantially enhance the robustness and reliability of autonomous driving systems.

PaperID: 10687, https://arxiv.org/pdf/2505.02077.pdf

Authors: Christian Schroeder de Witt, Klaudia Krawiecka, Igor Krawczuk, Ben Hagag, William L. Anderson, Peter Belcak, Ben Bucknall, Xiaohong Cai, Ayush Chopra, Doron Cohen, Ron F. Del Rosario, Andis Draguns, Annie Gray, Keren Katz, Vasilios Mavroudis, Jaron Mink, Sumeet Ramesh Motwani, Jonathan Petit, Leif-Sebastian Rembeck, Chandler Smith, John Sotiropoulos, Steven Young, Sarah Scheffler, Mary Llewellyn

Title: Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents

Abstract:
AI agents are beginning to interact with each other directly and across internet platforms and physical environments, creating security challenges beyond traditional cybersecurity and AI safety frameworks. Free‑form protocols are essential for AI's task generalization but enable new threats like secret collusion and coordinated swarm attacks. Network effects can rapidly spread privacy breaches, disinformation, jailbreaks, and data poisoning, while multi‑agent dispersion and stealth optimization help adversaries evade oversight ‑ creating novel persistent threats at a systemic level. Despite their critical importance, these security challenges remain understudied, with research fragmented across disparate fields including AI security, multi‑agent learning, complex systems, cybersecurity, game theory, distributed systems, and technical AI governance. We introduce multi‑agent security, a new field dedicated to securing networks of AI agents against threats that emerge or amplify through their interactions ‑ whether direct or indirect via shared environments ‑ with each other, humans, and institutions, and characterise fundamental security‑utility and security‑security trade‑offs across both distributed and decentralised settings. Our preliminary work (1) taxonomizes the threat landscape arising from interacting AI agents, (2) offers applications to multi‑agent security for work across diffuse subfields, and (3) proposes a unified research agenda addressing open challenges in designing secure agent systems and interaction environments. By identifying these gaps, we aim to guide research in this critical area to unlock the socioeconomic potential of large‑scale agent deployment, foster public trust, and mitigate national security risks in critical infrastructure and defense contexts.

PaperID: 10688, https://arxiv.org/pdf/2505.02076.pdf

Authors: Milapji Singh Gill, Javal Vyas, Artan Markaj, Felix Gehlhoff, Mehmet Mercangöz

Title: Leveraging LLM Agents and Digital Twins for Fault Handling in Process Plants

Abstract:
Advances in Automation and Artificial Intelligence continue to enhance the autonomy of process plants in handling various operational scenarios. However, certain tasks, such as fault handling, remain challenging, as they rely heavily on human expertise. This highlights the need for systematic, knowledge‑based methods. To address this gap, we propose a methodological framework that integrates Large Language Model (LLM) agents with a Digital Twin environment. The LLM agents continuously interpret system states and initiate control actions, including responses to unexpected faults, with the goal of returning the system to normal operation. In this context, the Digital Twin acts both as a structured repository of plant‑specific engineering knowledge for agent prompting and as a simulation platform for the systematic validation and verification of the generated corrective control actions. The evaluation using a mixing module of a process plant demonstrates that the proposed framework is capable not only of autonomously controlling the mixing module, but also of generating effective corrective actions to mitigate a pipe clogging with only a few reprompts.

PaperID: 10689, https://arxiv.org/pdf/2505.02024.pdf

Authors: Minjie Shen, Yanshu Li, Lulu Chen, Zhichao Fan, Yanhang Li, Qikai Yang

Title: From Mind to Machine: The Rise of Manus AI as a Fully Autonomous Digital Agent

Abstract:
Manus AI is a general‑purpose AI agent introduced in early 2025, marking a significant advancement in autonomous artificial intelligence. Developed by the Chinese startup Monica.im, Manus is designed to bridge the gap between "mind" and "hand" ‑ combining the reasoning and planning capabilities of large language models with the ability to execute complex, end‑to‑end tasks that produce tangible outcomes. This paper presents a comprehensive overview of Manus AI, exploring its core technical architecture, diverse applications across sectors such as healthcare, finance, manufacturing, robotics, and gaming, as well as its key strengths, current limitations, and future potential. Positioned as a preview of what lies ahead, Manus AI represents a shift toward intelligent agents that can translate high‑level intentions into real‑world actions, heralding a new era of human‑AI collaboration.

PaperID: 10690, https://arxiv.org/pdf/2505.01931.pdf

Authors: Jesse Barkley, Abraham George, Amir Barati Farimani

Title: Semantic Intelligence: Integrating GPT-4 with A Planning in Low-Cost Robotics

Abstract:
Classical robot navigation often relies on hardcoded state machines and purely geometric path planners, limiting a robot's ability to interpret high‑level semantic instructions. In this paper, we first assess GPT‑4's ability to act as a path planner compared to the A algorithm, then present a hybrid planning framework that integrates GPT‑4's semantic reasoning with A on a low‑cost robot platform operating on ROS2 Humble. Our approach eliminates explicit finite state machine (FSM) coding by using prompt‑based GPT‑4 reasoning to handle task logic while maintaining the accurate paths computed by A. The GPT‑4 module provides semantic understanding of instructions and environmental cues (e.g., recognizing toxic obstacles or crowded areas to avoid, or understanding low‑battery situations requiring alternate route selection), and dynamically adjusts the robot's occupancy grid via obstacle buffering to enforce semantic constraints. We demonstrate multi‑step reasoning for sequential tasks, such as first navigating to a resource goal and then reaching a final destination safely. Experiments on a Petoi Bittle robot with an overhead camera and Raspberry Pi Zero 2W compare classical A against GPT‑4‑assisted planning. Results show that while A is faster and more accurate for basic route generation and obstacle avoidance, the GPT‑4‑integrated system achieves high success rates (96‑100%) on semantic tasks that are infeasible for pure geometric planners. This work highlights how affordable robots can exhibit intelligent, context‑aware behaviors by leveraging large language model reasoning with minimal hardware and no fine‑tuning.

PaperID: 10691, https://arxiv.org/pdf/2505.01834.pdf

Authors: Zongxi Liu, Hongyang Du

Title: Model Context Protocol-based Internet of Experts For Wireless Environment-aware LLM Agents

Abstract:
Large Language Models (LLMs) exhibit strong general‑purpose reasoning abilities but lack access to wireless environment information due to the absence of native sensory input and domain‑specific priors. Previous attempts to apply LLMs in wireless systems either depend on retraining with network‑specific data, which compromises language generalization, or rely on manually scripted interfaces, which hinder scalability. To overcome these limitations, we propose a Model Context Protocol (MCP)‑based Internet of Experts (IoX) framework that equips LLMs with wireless environment‑aware reasoning capabilities. The framework incorporates a set of lightweight expert models, each trained to solve a specific deterministic task in wireless communications, such as detecting a specific wireless attribute, e.g., line‑of‑sight propagation, Doppler effects, or fading conditions. Through MCP, the LLM can selectively query and interpret expert outputs at inference time, without modifying its own parameters. This architecture enables modular, extensible, and interpretable reasoning over wireless contexts. Evaluated across multiple mainstream LLMs, the proposed wireless environment‑aware LLM agents achieve 40%‑50% improvements in classification tasks over LLM‑only baselines. More broadly, the MCP‑based design offers a viable paradigm for future LLMs to inherit structured wireless network management capabilities.

PaperID: 10692, https://arxiv.org/pdf/2505.01804.pdf

Authors: Jimin Choi, Kartikeya Anand, Husni R. Idris, Huy T. Tran, Max Z. Li

Title: Pathfinders in the Sky: Formal Decision-Making Models for Collaborative Air Traffic Control in Convective Weather

Abstract:
Air traffic can be significantly disrupted by weather. Pathfinder operations involve assigning a designated aircraft to assess whether airspace that was previously impacted by weather can be safely traversed through. Despite relatively routine use in air traffic control, there is little research on the underlying multi‑agent decision‑making problem. We seek to address this gap herein by formulating decision models to capture the operational dynamics and implications of pathfinders. Specifically, we construct a Markov chain to represent the stochastic transitions between key operational states (e.g., pathfinder selection). We then analyze its steady‑state behavior to understand long‑term system dynamics. We also propose models to characterize flight‑specific acceptance behaviors (based on utility trade‑offs) and pathfinder selection strategies (based on sequential offer allocations). We then conduct a worst‑case scenario analysis that highlights risks from collective rejection and explores how selfless behavior and uncertainty affect system resilience. Empirical analysis of data from the US Federal Aviation Administration demonstrates the real‑world significance of pathfinder operations and informs future model calibration.

PaperID: 10693, https://arxiv.org/pdf/2505.01651.pdf

Authors: Zeynep Engin

Title: Human-AI Governance (HAIG): A Trust-Utility Approach

Abstract:
This paper introduces the Human‑AI Governance (HAIG) framework, contributing to the AI Governance (AIG) field by foregrounding the relational dynamics between human and AI actors rather than treating AI systems as objects of governance alone. Current categorical frameworks (e.g., human‑in‑the‑loop models) inadequately capture how AI systems evolve from tools to partners, particularly as foundation models demonstrate emergent capabilities and multi‑agent systems exhibit autonomous goal‑setting behaviours. As systems are deployed across contexts, agency redistributes in complex patterns that are better represented as positions along continua rather than discrete categories. The HAIG framework operates across three levels: dimensions (Decision Authority, Process Autonomy, and Accountability Configuration), continua (continuous positional spectra along each dimension), and thresholds (critical points along the continua where governance requirements shift qualitatively). The framework's dimensional architecture is level‑agnostic, applicable from individual deployment decisions and organisational governance through to sectorial comparison and national and international regulatory design. Unlike risk‑based or principle‑based approaches that treat governance primarily as a constraint on AI deployment, HAIG adopts a trust‑utility orientation ‑ reframing governance as the condition under which human‑AI collaboration can realise its potential, calibrating oversight to specific relational contexts rather than predetermined categories. Case studies in healthcare and European regulation demonstrate how HAIG complements existing frameworks while offering a foundation for adaptive regulatory design that anticipates governance challenges before they emerge.

PaperID: 10694, https://arxiv.org/pdf/2505.01560.pdf

Authors: Vicent Briva Iglesias, Gokhan Dogru

Title: AI agents may be worth the hype but not the resources (yet): An initial exploration of machine translation quality and costs in three language pairs in the legal and news domains

Abstract:
Large language models (LLMs) and multi‑agent orchestration are touted as the next leap in machine translation (MT), but their benefits relative to conventional neural MT (NMT) remain unclear. This paper offers an empirical reality check. We benchmark five paradigms, Google Translate (strong NMT baseline), GPT‑4o (general‑purpose LLM), o1‑preview (reasoning‑enhanced LLM), and two GPT‑4o‑powered agentic workflows (sequential three‑stage and iterative refinement), on test data drawn from a legal contract and news prose in three English‑source pairs: Spanish, Catalan and Turkish. Automatic evaluation is performed with COMET, BLEU, chrF2 and TER; human evaluation is conducted with expert ratings of adequacy and fluency; efficiency with total input‑plus‑output token counts mapped to April 2025 pricing. Automatic scores still favour the mature NMT system, which ranks first in seven of twelve metric‑language combinations; o1‑preview ties or places second in most remaining cases, while both multi‑agent workflows trail. Human evaluation reverses part of this narrative: o1‑preview produces the most adequate and fluent output in five of six comparisons, and the iterative agent edges ahead once, indicating that reasoning layers capture semantic nuance undervalued by surface metrics. Yet these qualitative gains carry steep costs. The sequential agent consumes roughly five times, and the iterative agent fifteen times, the tokens used by NMT or single‑pass LLMs. We advocate multidimensional, cost‑aware evaluation protocols and highlight research directions that could tip the balance: leaner coordination strategies, selective agent activation, and hybrid pipelines combining single‑pass LLMs with targeted agent intervention.

PaperID: 10695, https://arxiv.org/pdf/2505.01482.pdf

Authors: Alice Rueda, Mohammed S. Hassan, Argyrios Perivolaris, Bazen G. Teferra, Reza Samavi, Sirisha Rambhatla, Yuqi Wu, Yanbo Zhang, Bo Cao, Divya Sharma, Sridhar Krishnan, Venkat Bhat

Title: Understanding LLM Scientific Reasoning through Promptings and Model's Explanation on the Answers

Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding, reasoning, and problem‑solving across various domains. However, their ability to perform complex, multi‑step reasoning task‑essential for applications in science, medicine, and law‑remains an area of active investigation. This paper examines the reasoning capabilities of contemporary LLMs, analyzing their strengths, limitations, and potential for improvement. The study uses prompt engineering techniques on the Graduate‑Level GoogleProof Q&A (GPQA) dataset to assess the scientific reasoning of GPT‑4o. Five popular prompt engineering techniques and two tailored promptings were tested: baseline direct answer (zero‑shot), chain‑of‑thought (CoT), zero‑shot CoT, self‑ask, self‑consistency, decomposition, and multipath promptings. Our findings indicate that while LLMs exhibit emergent reasoning abilities, they often rely on pattern recognition rather than true logical inference, leading to inconsistencies in complex problem‑solving. The results indicated that self‑consistency outperformed the other prompt engineering technique with an accuracy of 52.99%, followed by direct answer (52.23%). Zero‑shot CoT (50%) outperformed multipath (48.44%), decomposition (47.77%), self‑ask (46.88%), and CoT (43.75%). Self‑consistency performed the second worst in explaining the answers. Simple techniques such as direct answer, CoT, and zero‑shot CoT have the best scientific reasoning. We propose a research agenda aimed at bridging these gaps by integrating structured reasoning frameworks, hybrid AI approaches, and human‑in‑the‑loop methodologies. By critically evaluating the reasoning mechanisms of LLMs, this paper contributes to the ongoing discourse on the future of artificial general intelligence and the development of more robust, trustworthy AI systems.

PaperID: 10696, https://arxiv.org/pdf/2505.01453.pdf

Authors: Bharathkumar Hegde, Melanie Bouroche

Title: Safe and Efficient CAV Lane Changing using Decentralised Safety Shields

Abstract:
Lane changing is a complex decision‑making problem for Connected and Autonomous Vehicles (CAVs) as it requires balancing traffic efficiency with safety. Although traffic efficiency can be improved by using vehicular communication for training lane change controllers using Multi‑Agent Reinforcement Learning (MARL), ensuring safety is difficult. To address this issue, we propose a decentralised Hybrid Safety Shield (HSS) that combines optimisation and a rule‑based approach to guarantee safety. Our method applies control barrier functions to constrain longitudinal and lateral control inputs of a CAV to ensure safe manoeuvres. Additionally, we present an architecture to integrate HSS with MARL, called MARL‑HSS, to improve traffic efficiency while ensuring safety. We evaluate MARL‑HSS using a gym‑like environment that simulates an on‑ramp merging scenario with two levels of traffic densities, such as light and moderate densities. The results show that HSS provides a safety guarantee by strictly enforcing a dynamic safety constraint defined on a time headway, even in moderate traffic density that offers challenging lane change scenarios. Moreover, the proposed method learns stable policies compared to the baseline, a state‑of‑the‑art MARL lane change controller without a safety shield. Further policy evaluation shows that our method achieves a balance between safety and traffic efficiency with zero crashes and comparable average speeds in light and moderate traffic densities.

PaperID: 10697, https://arxiv.org/pdf/2505.01441.pdf

Authors: Joykirat Singh, Raghav Magazine, Yash Pandya, Akshay Nambi

Title: Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning

Abstract:
Large language models (LLMs) have achieved remarkable progress in complex reasoning tasks, yet they remain fundamentally limited by their reliance on static internal knowledge and text‑only reasoning. Real‑world problem solving often demands dynamic, multi‑step reasoning, adaptive decision making, and the ability to interact with external tools and environments. In this work, we introduce ARTIST (Agentic Reasoning and Tool Integration in Self‑improving Transformers), a unified framework that tightly couples agentic reasoning, reinforcement learning, and tool integration for LLMs. ARTIST enables models to autonomously decide when, how, and which tools to invoke within multi‑turn reasoning chains, leveraging outcome‑based RL to learn robust strategies for tool use and environment interaction without requiring step‑level supervision. Extensive experiments on mathematical reasoning and multi‑turn function calling benchmarks show that ARTIST consistently outperforms state‑of‑the‑art baselines, with up to 22% absolute improvement over base models and strong gains on the most challenging tasks. Detailed studies and metric analyses reveal that agentic RL training leads to deeper reasoning, more effective tool use, and higher‑quality solutions. Our results establish agentic RL with tool integration as a powerful new frontier for robust, interpretable, and generalizable problem‑solving in LLMs.

PaperID: 10698, https://arxiv.org/pdf/2505.01302.pdf

Authors: Tianhao Li, Yibei Li, Zhixin Liu, Xiaoming Hu

Title: Pattern formation using an intrinsic optimal control approach

Abstract:
This paper investigates a pattern formation control problem for a multi‑agent system modeled with given interaction topology, in which m of the n agents are chosen as leaders and consequently a control signal is added to each of the leaders. These agents interact with each other by Laplacian dynamics on a graph. The pattern formation control problem is formulated as an intrinsic infinite time‑horizon linear quadratic optimal control problem, namely, no error information is incorporated in the objective function. Under mild conditions, we show the existence of the optimal control strategy and the convergence to the desired pattern formation. Based on the optimal control strategy, we propose a distributed control strategy to achieve the given pattern. Finally, numerical simulation is given to illustrate theoretical results.

PaperID: 10699, https://arxiv.org/pdf/2505.01115.pdf

Authors: Palok Biswas, Zuzanna Osika, Isidoro Tamassia, Adit Whorra, Jazmin Zatarain-Salazar, Jan Kwakkel, Frans A. Oliehoek, Pradeep K. Murukannaiah

Title: Exploring Equity of Climate Policies using Multi-Agent Multi-Objective Reinforcement Learning

Abstract:
Addressing climate change requires coordinated policy efforts of nations worldwide. These efforts are informed by scientific reports, which rely in part on Integrated Assessment Models (IAMs), prominent tools used to assess the economic impacts of climate policies. However, traditional IAMs optimize policies based on a single objective, limiting their ability to capture the trade‑offs among economic growth, temperature goals, and climate justice. As a result, policy recommendations have been criticized for perpetuating inequalities, fueling disagreements during policy negotiations. We introduce Justice, the first framework integrating IAM with Multi‑Objective Multi‑Agent Reinforcement Learning (MOMARL). By incorporating multiple objectives, Justice generates policy recommendations that shed light on equity while balancing climate and economic goals. Further, using multiple agents can provide a realistic representation of the interactions among the diverse policy actors. We identify equitable Pareto‑optimal policies using our framework, which facilitates deliberative decision‑making by presenting policymakers with the inherent trade‑offs in climate and economic policy.

PaperID: 10700, https://arxiv.org/pdf/2505.00989.pdf

Authors: Sijin Sun, Liangbin Zhao, Ming Deng, Xiuju Fu

Title: VTS-LLM: Domain-Adaptive LLM Agent for Enhancing Awareness in Vessel Traffic Services through Natural Language

Abstract:
Vessel Traffic Services (VTS) are essential for maritime safety and regulatory compliance through real‑time traffic management. However, with increasing traffic complexity and the prevalence of heterogeneous, multimodal data, existing VTS systems face limitations in spatiotemporal reasoning and intuitive human interaction. In this work, we propose VTS‑LLM Agent, the first domain‑adaptive large LLM agent tailored for interactive decision support in VTS operations. We formalize risk‑prone vessel identification as a knowledge‑augmented Text‑to‑SQL task, combining structured vessel databases with external maritime knowledge. To support this, we construct a curated benchmark dataset consisting of a custom schema, domain‑specific corpus, and a query‑SQL test set in multiple linguistic styles. Our framework incorporates NER‑based relational reasoning, agent‑based domain knowledge injection, semantic algebra intermediate representation, and query rethink mechanisms to enhance domain grounding and context‑aware understanding. Experimental results show that VTS‑LLM outperforms both general‑purpose and SQL‑focused baselines under command‑style, operational‑style, and formal natural language queries, respectively. Moreover, our analysis provides the first empirical evidence that linguistic style variation introduces systematic performance challenges in Text‑to‑SQL modeling. This work lays the foundation for natural language interfaces in vessel traffic services and opens new opportunities for proactive, LLM‑driven maritime real‑time traffic management.

PaperID: 10701, https://arxiv.org/pdf/2505.00981.pdf

Authors: Lijian Chen, Wei Yuan, Tong Chen, Xiangyu Zhao, Nguyen Quoc Viet Hung, Hongzhi Yin

Title: Multi-agents based User Values Mining for Recommendation

Abstract:
Recommender systems have rapidly evolved and become integral to many online services. However, existing systems sometimes produce unstable and unsatisfactory recommendations that fail to align with users' fundamental and long‑term preferences. This is because they primarily focus on extracting shallow and short‑term interests from user behavior data, which is inherently dynamic and challenging to model. Unlike these transient interests, user values are more stable and play a crucial role in shaping user behaviors, such as purchasing items and consuming content. Incorporating user values into recommender systems can help stabilize recommendation performance and ensure results better reflect users' latent preferences. However, acquiring user values is typically difficult and costly. To address this challenge, we leverage the strong language understanding, zero‑shot inference, and generalization capabilities of Large Language Models (LLMs) to extract user values from users' historical interactions. Unfortunately, direct extraction using LLMs presents several challenges such as length constraints and hallucination. To overcome these issues, we propose ZOOM, a zero‑shot multi‑LLM collaborative framework for effective and accurate user value extraction. In ZOOM, we apply text summarization techniques to condense item content while preserving essential meaning. To mitigate hallucinations, ZOOM introduces two specialized agent roles: evaluators and supervisors, to collaboratively generate accurate user values. Extensive experiments on two widely used recommendation datasets with two state‑of‑the‑art recommendation models demonstrate the effectiveness and generalization of our framework in automatic user value mining and recommendation performance improvement.

PaperID: 10702, https://arxiv.org/pdf/2505.00972.pdf

Authors: Yuewen Mei, Tong Nie, Jian Sun, Ye Tian

Title: Seeking to Collide: Online Safety-Critical Scenario Generation for Autonomous Driving with Retrieval Augmented Large Language Models

Abstract:
Simulation‑based testing is crucial for validating autonomous vehicles (AVs), yet existing scenario generation methods either overfit to common driving patterns or operate in an offline, non‑interactive manner that fails to expose rare, safety‑critical corner cases. In this paper, we introduce an online, retrieval‑augmented large language model (LLM) framework for generating safety‑critical driving scenarios. Our method first employs an LLM‑based behavior analyzer to infer the most dangerous intent of the background vehicle from the observed state, then queries additional LLM agents to synthesize feasible adversarial trajectories. To mitigate catastrophic forgetting and accelerate adaptation, we augment the framework with a dynamic memorization and retrieval bank of intent‑planner pairs, automatically expanding its behavioral library when novel intents arise. Evaluations using the Waymo Open Motion Dataset demonstrate that our model reduces the mean minimum time‑to‑collision from 1.62 to 1.08 s and incurs a 75% collision rate, substantially outperforming baselines.

PaperID: 10703, https://arxiv.org/pdf/2505.00945.pdf

Authors: Xiaoshan Huang, Jie Gao, Haolun Wu

Title: SSRLBot: Designing and Developing a Large Language Model-based Agent using Socially Shared Regulated Learning

Abstract:
Large language model (LLM)‑‑based agents have emerged as pivotal tools in assisting human experts across various fields by transforming complex tasks into more efficient workflows and providing actionable stakeholder insights. Despite their potential, the application of LLM‑based agents for medical education remains underexplored. The study aims to assist in evaluating the students' process and outcomes on medical case diagnosis and discussion while incorporating the theoretical framework of Socially Shared Regulation of Learning (SSRL) to assess student performance. SSRL emphasizes metacognitive, cognitive, motivational, and emotional interactions, highlighting the collaborative management of learning processes to improve decision‑making outcomes. Grounded in SSRL theory, this tool paper introduces SSRLBot, an LLM‑based agent designed to enable team members to reflect on their diagnostic performance and the key SSRL skills that foster team success. SSRLBot's core functions include summarizing dialogue content, analyzing participants' SSRL skills, and evaluating students' diagnostic results. Meanwhile, we evaluated SSRLBot through diagnostic conversation data collected from six groups (12 participants, 1926 conversational turns). Results showed that SSRLBot can deliver detailed, theory‑aligned evaluations, link specific behaviors to SSRL dimensions, and offer actionable recommendations for improving teamwork. The findings address a critical gap in medical education, advancing the application of LLM agents to enhance team‑based decision‑making and collaboration in high‑stakes environments.

PaperID: 10704, https://arxiv.org/pdf/2505.00935.pdf

Authors: Roberto Bigazzi

Title: Autonomous Embodied Agents: When Robotics Meets Deep Learning Reasoning

Abstract:
The increase in available computing power and the Deep Learning revolution have allowed the exploration of new topics and frontiers in Artificial Intelligence research. A new field called Embodied Artificial Intelligence, which places at the intersection of Computer Vision, Robotics, and Decision Making, has been gaining importance during the last few years, as it aims to foster the development of smart autonomous robots and their deployment in society. The recent availability of large collections of 3D models for photorealistic robotic simulation has allowed faster and safe training of learning‑based agents for millions of frames and a careful evaluation of their behavior before deploying the models on real robotic platforms. These intelligent agents are intended to perform a certain task in a possibly unknown environment. To this end, during the training in simulation, the agents learn to perform continuous interactions with the surroundings, such as gathering information from the environment, encoding and extracting useful cues for the task, and performing actions towards the final goal; where every action of the agent influences the interactions. This dissertation follows the complete creation process of embodied agents for indoor environments, from their concept to their implementation and deployment. We aim to contribute to research in Embodied AI and autonomous agents, in order to foster future work in this field. We present a detailed analysis of the procedure behind implementing an intelligent embodied agent, comprehending a thorough description of the current state‑of‑the‑art in literature, technical explanations of the proposed methods, and accurate experimental studies on relevant robotic tasks.

PaperID: 10705, https://arxiv.org/pdf/2505.00928.pdf

Authors: Adam Casselman, Manav Vora, Melkior Ornik

Title: Virtual Force-Based Routing of Modular Agents on a Graph

Abstract:
Modular vehicles present a novel area of academic and industrial interest in the field of multi‑agent systems. Modularity allows vehicles to connect and disconnect with each other mid‑transit which provides a balance between efficiency and flexibility when solving complex and large scale tasks in urban or aerial transportation. This paper details a generalized scheme to route multiple modular agents on a graph to a predetermined set of target nodes. The objective is to visit all target nodes while incurring minimum resource expenditure. Agents that are joined together will incur the equivalent cost of a single agent, which is motivated by the logistical benefits of traffic reduction and increased fuel efficiency. To solve this problem, we introduce a novel algorithm that seeks to balance the optimality of the path that every single module takes and the cost benefit of joining modules. Our approach models the agents and targets as point charges, where the modules take the path of highest attractive force from its target node and neighboring agents. We validate our approach by simulating multiple modular agents along real‑world transportation routes in the road network of Champaign‑Urbana, Illinois, USA. The proposed method easily exceeds the available benchmarks and illustrates the benefits of modularity in multi‑target planning problems.

PaperID: 10706, https://arxiv.org/pdf/2505.00820.pdf

Authors: Zhaoxing Li, Wenbo Wu, Yue Wang, Yanran Xu, William Hunt, Sebastian Stein

Title: HMCF: A Human-in-the-loop Multi-Robot Collaboration Framework Based on Large Language Models

Abstract:
Rapid advancements in artificial intelligence (AI) have enabled robots to performcomplex tasks autonomously with increasing precision. However, multi‑robot systems (MRSs) face challenges in generalization, heterogeneity, and safety, especially when scaling to large‑scale deployments like disaster response. Traditional approaches often lack generalization, requiring extensive engineering for new tasks and scenarios, and struggle with managing diverse robots. To overcome these limitations, we propose a Human‑in‑the‑loop Multi‑Robot Collaboration Framework (HMCF) powered by large language models (LLMs). LLMs enhance adaptability by reasoning over diverse tasks and robot capabilities, while human oversight ensures safety and reliability, intervening only when necessary. Our framework seamlessly integrates human oversight, LLM agents, and heterogeneous robots to optimize task allocation and execution. Each robot is equipped with an LLM agent capable of understanding its capabilities, converting tasks into executable instructions, and reducing hallucinations through task verification and human supervision. Simulation results show that our framework outperforms state‑of‑the‑art task planning methods, achieving higher task success rates with an improvement of 4.76%. Real‑world tests demonstrate its robust zero‑shot generalization feature and ability to handle diverse tasks and environments with minimal human intervention.

PaperID: 10707, https://arxiv.org/pdf/2505.00776.pdf

Authors: Alessandro Raganato, Rafael Peñaloza, Marco Viviani, Gabriella Pasi

Title: Reasoning Capabilities and Invariability of Large Language Models

Abstract:
Large Language Models (LLMs) have shown remarkable capabilities in manipulating natural language across multiple applications, but their ability to handle simple reasoning tasks is often questioned. In this work, we aim to provide a comprehensive analysis of LLMs' reasoning competence, specifically focusing on their prompt dependency. In particular, we introduce a new benchmark dataset with a series of simple reasoning questions demanding shallow logical reasoning. Aligned with cognitive psychology standards, the questions are confined to a basic domain revolving around geometric figures, ensuring that responses are independent of any pre‑existing intuition about the world and rely solely on deduction. An empirical analysis involving zero‑shot and few‑shot prompting across 24 LLMs of different sizes reveals that, while LLMs with over 70 billion parameters perform better in the zero‑shot setting, there is still a large room for improvement. An additional test with chain‑of‑thought prompting over 22 LLMs shows that this additional prompt can aid or damage the performance of models, depending on whether the rationale is required before or after the answer.

PaperID: 10708, https://arxiv.org/pdf/2505.00749.pdf

Authors: Roman J. Georgio, Caelum Forder, Suman Deb, Andri Rahimov, Peter Carroll, Önder Gürcan

Title: Coral Protocol: Open Infrastructure Connecting The Internet of Agents

Abstract:
Coral Protocol is an open and decentralized collaboration infrastructure that enables communication, coordination, trust and payments for The Internet of Agents. It addresses the growing need for interoperability in a world where organizations are deploying multiple specialized AI agents that must work together across domains and vendors. As a foundational platform for multi‑agent AI ecosystems, Coral establishes a common language and coordination framework allowing any agent to participate in complex workflows with others. Its design emphasizes broad compatibility, security, and vendor neutrality, ensuring that agent interactions are efficient and trustworthy. In particular, Coral introduces standardized messaging formats for agent communication, a modular coordination mechanism for orchestrating multi‑agent tasks, and secure team formation capabilities for dynamically assembling trusted groups of agents. Together, these innovations position Coral Protocol as a cornerstone of the emerging "Internet of Agents," unlocking new levels of automation, collective intelligence, and business value through open agent collaboration.

PaperID: 10709, https://arxiv.org/pdf/2505.00747.pdf

Authors: Zhiying Song, Tenghui Xie, Fuxi Wen, Jun Li

Title: Wireless Communication as an Information Sensor for Multi-agent Cooperative Perception: A Survey

Abstract:
Cooperative perception extends the perception capabilities of autonomous vehicles by enabling multi‑agent information sharing via Vehicle‑to‑Everything (V2X) communication. Unlike traditional onboard sensors, V2X acts as a dynamic "information sensor" characterized by limited communication, heterogeneity, mobility, and scalability. This survey provides a comprehensive review of recent advancements from the perspective of information‑centric cooperative perception, focusing on three key dimensions: information representation, information fusion, and large‑scale deployment. We categorize information representation into data‑level, feature‑level, and object‑level schemes, and highlight emerging methods for reducing data volume and compressing messages under communication constraints. In information fusion, we explore techniques under both ideal and non‑ideal conditions, including those addressing heterogeneity, localization errors, latency, and packet loss. Finally, we summarize system‑level approaches to support scalability in dense traffic scenarios. Compared with existing surveys, this paper introduces a new perspective by treating V2X communication as an information sensor and emphasizing the challenges of deploying cooperative perception in real‑world intelligent transportation systems.

PaperID: 10710, https://arxiv.org/pdf/2505.00743.pdf

Authors: Yinfeng Yu, Dongsheng Yang

Title: DOPE: Dual Object Perception-Enhancement Network for Vision-and-Language Navigation

Abstract:
Vision‑and‑Language Navigation (VLN) is a challenging task where an agent must understand language instructions and navigate unfamiliar environments using visual cues. The agent must accurately locate the target based on visual information from the environment and complete tasks through interaction with the surroundings. Despite significant advancements in this field, two major limitations persist: (1) Many existing methods input complete language instructions directly into multi‑layer Transformer networks without fully exploiting the detailed information within the instructions, thereby limiting the agent's language understanding capabilities during task execution; (2) Current approaches often overlook the modeling of object relationships across different modalities, failing to effectively utilize latent clues between objects, which affects the accuracy and robustness of navigation decisions. We propose a Dual Object Perception‑Enhancement Network (DOPE) to address these issues to improve navigation performance. First, we design a Text Semantic Extraction (TSE) to extract relatively essential phrases from the text and input them into the Text Object Perception‑Augmentation (TOPA) to fully leverage details such as objects and actions within the instructions. Second, we introduce an Image Object Perception‑Augmentation (IOPA), which performs additional modeling of object information across different modalities, enabling the model to more effectively utilize latent clues between objects in images and text, enhancing decision‑making accuracy. Extensive experiments on the R2R and REVERIE datasets validate the efficacy of the proposed approach.

PaperID: 10711, https://arxiv.org/pdf/2505.00733.pdf

Authors: Gustavo Rezende Silva, Juliane Päßler, S. Lizeth Tapia Tarifa, Einar Broch Johnsen, Carlos Hernández Corbato

Title: ROSA: A Knowledge-based Solution for Robot Self-Adaptation

Abstract:
Autonomous robots must operate in diverse environments and handle multiple tasks despite uncertainties. This creates challenges in designing software architectures and task decision‑making algorithms, as different contexts may require distinct task logic and architectural configurations. To address this, robotic systems can be designed as self‑adaptive systems capable of adapting their task execution and software architecture at runtime based on their context.This paper introduces ROSA, a novel knowledge‑based framework for RObot Self‑Adaptation, which enables task‑and‑architecture co‑adaptation (TACA) in robotic systems. ROSA achieves this by providing a knowledge model that captures all application‑specific knowledge required for adaptation and by reasoning over this knowledge at runtime to determine when and how adaptation should occur. In addition to a conceptual framework, this work provides an open‑source ROS 2‑based reference implementation of ROSA and evaluates its feasibility and performance in an underwater robotics application. Experimental results highlight ROSA's advantages in reusability and development effort for designing self‑adaptive robotic systems.

PaperID: 10712, https://arxiv.org/pdf/2505.00586.pdf

Authors: Jiarong Wei, Niclas Vödisch, Anna Rehr, Christian Feist, Abhinav Valada

Title: ParkDiffusion: Heterogeneous Multi-Agent Multi-Modal Trajectory Prediction for Automated Parking using Diffusion Models

Abstract:
Automated parking is a critical feature of Advanced Driver Assistance Systems (ADAS), where accurate trajectory prediction is essential to bridge perception and planning modules. Despite its significance, research in this domain remains relatively limited, with most existing studies concentrating on single‑modal trajectory prediction of vehicles. In this work, we propose ParkDiffusion, a novel approach that predicts the trajectories of both vehicles and pedestrians in automated parking scenarios. ParkDiffusion employs diffusion models to capture the inherent uncertainty and multi‑modality of future trajectories, incorporating several key innovations. First, we propose a dual map encoder that processes soft semantic cues and hard geometric constraints using a two‑step cross‑attention mechanism. Second, we introduce an adaptive agent type embedding module, which dynamically conditions the prediction process on the distinct characteristics of vehicles and pedestrians. Third, to ensure kinematic feasibility, our model outputs control signals that are subsequently used within a kinematic framework to generate physically feasible trajectories. We evaluate ParkDiffusion on the Dragon Lake Parking (DLP) dataset and the Intersections Drone (inD) dataset. Our work establishes a new baseline for heterogeneous trajectory prediction in parking scenarios, outperforming existing methods by a considerable margin.

PaperID: 10713, https://arxiv.org/pdf/2505.00540.pdf

Authors: Ian O'Flynn, Harun Šiljak

Title: Emergence of Roles in Robotic Teams with Model Sharing and Limited Communication

Abstract:
We present a reinforcement learning strategy for use in multi‑agent foraging systems in which the learning is centralised to a single agent and its model is periodically disseminated among the population of non‑learning agents. In a domain where multi‑agent reinforcement learning (MARL) is the common approach, this approach aims to significantly reduce the computational and energy demands compared to approaches such as MARL and centralised learning models. By developing high performing foraging agents, these approaches can be translated into real‑world applications such as logistics, environmental monitoring, and autonomous exploration. A reward function was incorporated into this approach that promotes role development among agents, without explicit directives. This led to the differentiation of behaviours among the agents. The implicit encouragement of role differentiation allows for dynamic actions in which agents can alter roles dependent on their interactions with the environment without the need for explicit communication between agents.

PaperID: 10714, https://arxiv.org/pdf/2505.00515.pdf

Authors: Mingxing Peng, Ruoyu Yao, Xusen Guo, Yuting Xie, Xianda Chen, Jun Ma

Title: Safety-Critical Traffic Simulation with Guided Latent Diffusion Model

Abstract:
Safety‑critical traffic simulation plays a crucial role in evaluating autonomous driving systems under rare and challenging scenarios. However, existing approaches often generate unrealistic scenarios due to insufficient consideration of physical plausibility and suffer from low generation efficiency. To address these limitations, we propose a guided latent diffusion model (LDM) capable of generating physically realistic and adversarial safety‑critical traffic scenarios. Specifically, our model employs a graph‑based variational autoencoder (VAE) to learn a compact latent space that captures complex multi‑agent interactions while improving computational efficiency. Within this latent space, the diffusion model performs the denoising process to produce realistic trajectories. To enable controllable and adversarial scenario generation, we introduce novel guidance objectives that drive the diffusion process toward producing adversarial and behaviorally realistic driving behaviors. Furthermore, we develop a sample selection module based on physical feasibility checks to further enhance the physical plausibility of the generated scenarios. Extensive experiments on the nuScenes dataset demonstrate that our method achieves superior adversarial effectiveness and generation efficiency compared to existing baselines while maintaining a high level of realism. Our work provides an effective tool for realistic safety‑critical scenario simulation, paving the way for more robust evaluation of autonomous driving systems.

PaperID: 10715, https://arxiv.org/pdf/2505.00472.pdf

Authors: Alaa Saleh, Sasu Tarkoma, Praveen Kumar Donta, Anders Lindgren, Naser Hossein Motlagh, Schahram Dustdar, Susanna Pirttikangas, Lauri Lovén

Title: UserCentrix: An Agentic Memory-augmented AI Framework for Smart Spaces

Abstract:
Agentic Artificial Intelligence (AI) constitutes a transformative paradigm in the evolution of intelligent agents and decision‑support systems, redefining smart environments by enhancing operational efficiency, optimizing resource allocation, and strengthening systemic resilience. This paper presents UserCentrix, a hybrid agentic orchestration framework for smart spaces that optimizes resource management and enhances user experience through urgency‑aware and intent‑driven decision‑making mechanisms. The framework integrates interactive modules equipped with agentic behavior and autonomous decision‑making capabilities to dynamically balance latency, accuracy, and computational cost. User intent functions as a governing control signal that prioritizes decisions, regulates task execution and resource allocation, and guides the adaptation of decision‑making strategies to balance trade‑offs between speed and accuracy. Experimental results demonstrate that the framework autonomously enables efficient intent processing and real‑time monitoring, while balancing reasoning quality and computational efficiency, particularly under resource‑constrained edge conditions.

PaperID: 10716, https://arxiv.org/pdf/2505.00322.pdf

Authors: Keshu Wu, Zihao Li, Sixu Li, Xinyue Ye, Dominique Lord, Yang Zhou

Title: AI2-Active Safety: AI-enabled Interaction-aware Active Safety Analysis with Vehicle Dynamics

Abstract:
This paper introduces an AI‑enabled, interaction‑aware active safety analysis framework that accounts for groupwise vehicle interactions. Specifically, the framework employs a bicycle model‑augmented with road gradient considerations‑to accurately capture vehicle dynamics. In parallel, a hypergraph‑based AI model is developed to predict probabilistic trajectories of ambient traffic. By integrating these two components, the framework derives vehicle intra‑spacing over a 3D road surface as the solution of a stochastic ordinary differential equation, yielding high‑fidelity surrogate safety measures such as time‑to‑collision (TTC). To demonstrate its effectiveness, the framework is analyzed using stochastic numerical methods comprising 4th‑order Runge‑Kutta integration and AI inference, generating probability‑weighted high‑fidelity TTC (HF‑TTC) distributions that reflect complex multi‑agent maneuvers and behavioral uncertainties. Evaluated with HF‑TTC against traditional constant‑velocity TTC and non‑interaction‑aware approaches on highway datasets, the proposed framework offers a systematic methodology for active safety analysis with enhanced potential for improving safety perception in complex traffic environments.

PaperID: 10717, https://arxiv.org/pdf/2505.00270.pdf

Authors: Yijie Xia, Xiaohan Lin, Zicheng Ma, Jinyuan Hu, Yanheng Li, Zhaoxin Xie, Hao Li, Li Yang, Zhiqiang Zhao, Lijiang Yang, Zhenyu Chen, Yi Qin Gao

Title: Large Language Models as AI Agents for Digital Atoms and Molecules: Catalyzing a New Era in Computational Biophysics

Abstract:
In computational biophysics, where molecular data is expanding rapidly and system complexity is increasing exponentially, large language models (LLMs) and agent‑based systems are fundamentally reshaping the field. This perspective article examines the recent advances at the intersection of LLMs, intelligent agents, and scientific computation, with a focus on biophysical computation. Building on these advancements, we introduce ADAM (Agent for Digital Atoms and Molecules), an innovative multi‑agent LLM‑based framework. ADAM employs cutting‑edge AI architectures to reshape scientific workflows through a modular design. It adopts a hybrid neural‑symbolic architecture that combines LLM‑driven semantic tools with deterministic symbolic computations. Moreover, its ADAM Tool Protocol (ATP) enables asynchronous, database‑centric tool orchestration, fostering community‑driven extensibility. Despite the significant progress made, ongoing challenges call for further efforts in establishing benchmarking standards, optimizing foundational models and agents, building an open collaborative ecosystem and developing personalized memory modules. ADAM is accessible at https://sidereus‑ai.com.

PaperID: 10718, https://arxiv.org/pdf/2505.00234.pdf

Authors: Vishnu Sarukkai, Zhiqiang Xie, Kayvon Fatahalian

Title: Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks

Abstract:
Improving Large Language Model (LLM) agents for sequential decision‑making tasks typically requires extensive task‑specific knowledge engineering‑‑custom prompts, curated examples, and specialized observation/action spaces. We investigate a different approach where agents automatically improve by learning from their own successful experiences without human intervention. Our method constructs and refines a database of self‑generated trajectories that serve as in‑context examples for future tasks. Even naive accumulation of successful trajectories yields substantial performance gains across three diverse benchmarks: ALFWorld (73% to 89%), Wordcraft (55% to 64%), and InterCode‑SQL (75% to 79%). These improvements exceed those achieved by upgrading from gpt‑4o‑mini to gpt‑4o and match the performance of allowing multiple attempts per task. We further enhance this approach with two innovations: database‑level curation using population‑based training to propagate high‑performing example collections, and exemplar‑level curation that selectively retains trajectories based on their empirical utility as in‑context examples. With these enhancements, our method achieves 93% success on ALFWorld‑‑surpassing approaches that use more powerful LLMs and hand‑crafted components. Our trajectory bootstrapping technique demonstrates that agents can autonomously improve through experience, offering a scalable alternative to labor‑intensive knowledge engineering.

PaperID: 10719, https://arxiv.org/pdf/2505.00216.pdf

Authors: Xuwei Yang, Fatemeh Tavakoli, David B. Emerson, Anastasis Kratsios

Title: Online Federation For Mixtures of Proprietary Agents with Black-Box Encoders

Abstract:
Most industry‑standard generative AIs and feature encoders are proprietary, offering only black‑box access: their outputs are observable, but their internal parameters and architectures remain hidden from the end‑user. This black‑box access is especially limiting when constructing mixture‑of‑expert type ensemble models since the user cannot optimize each proprietary AI's internal parameters. Our problem naturally lends itself to a non‑competitive game‑theoretic lens where each proprietary AI (agent) is inherently competing against the other AI agents, with this competition arising naturally due to their obliviousness of the AI's to their internal structure. In contrast, the user acts as a central planner trying to synchronize the ensemble of competing AIs. We show the existence of the unique Nash equilibrium in the online setting, which we even compute in closed‑form by eliciting a feedback mechanism between any given time series and the sequence generated by each (proprietary) AI agent. Our solution is implemented as a decentralized, federated‑learning algorithm in which each agent optimizes their structure locally on their machine without ever releasing any internal structure to the others. We obtain refined expressions for pre‑trained models such as transformers, random feature models, and echo‑state networks. Our ``proprietary federated learning'' algorithm is implemented on a range of real‑world and synthetic time‑series benchmarks. It achieves orders‑of‑magnitude improvements in predictive accuracy over natural benchmarks, of which there are surprisingly few due to this natural problem still being largely unexplored.

PaperID: 10720, https://arxiv.org/pdf/2505.00213.pdf

Authors: Tianyu Qiu, Eric Ouano, Fernando Palafox, Christian Ellis, David Fridovich-Keil

Title: A Player Selection Network for Scalable Game-Theoretic Prediction and Planning

Abstract:
While game‑theoretic planning frameworks are effective at modeling multi‑agent interactions, they require solving large optimization problems where the number of variables increases with the number of agents, resulting in long computation times that limit their use in large‑scale, real‑time systems. To address this issue, we propose 1) PSN Game‑a learning‑based, game‑theoretic prediction and planning framework that reduces game size by learning a Player Selection Network (PSN); and 2) a Goal Inference Network (GIN) that makes it possible to use the PSN in incomplete‑information games where other agents' intentions are unknown to the ego agent. A PSN outputs a player selection mask that distinguishes influential players from less relevant ones, enabling the ego player to solve a smaller, masked game involving only selected players. By reducing the number of players included in the game, PSN shrinks the corresponding optimization problems, leading to faster solve times. Experiments in both simulated scenarios and real‑world pedestrian trajectory datasets show that PSN is competitive with, and often improves upon, the evaluated explicit game‑theoretic selection baselines in 1) prediction accuracy and 2) planning safety. Across scenarios, PSN typically selects substantially fewer players than are present in the full game, thereby reducing game size and planning complexity. PSN also generalizes to settings in which agents' objectives are unknown, via the GIN, without test‑time fine‑tuning. By selecting only the most relevant players for decision‑making, PSN Game provides a practical mechanism for reducing planning complexity that can be integrated into existing multi‑agent planning frameworks.

PaperID: 10721, https://arxiv.org/pdf/2505.00201.pdf

Authors: Yasin Findik, Christopher Coco, Reza Azadeh

Title: Investigating Adaptive Tuning of Assistive Exoskeletons Using Offline Reinforcement Learning: Challenges and Insights

Abstract:
Assistive exoskeletons have shown great potential in enhancing mobility for individuals with motor impairments, yet their effectiveness relies on precise parameter tuning for personalized assistance. In this study, we investigate the potential of offline reinforcement learning for optimizing effort thresholds in upper‑limb assistive exoskeletons, aiming to reduce reliance on manual calibration. Specifically, we frame the problem as a multi‑agent system where separate agents optimize biceps and triceps effort thresholds, enabling a more adaptive and data‑driven approach to exoskeleton control. Mixed Q‑Functionals (MQF) is employed to efficiently handle continuous action spaces while leveraging pre‑collected data, thereby mitigating the risks associated with real‑time exploration. Experiments were conducted using the MyoPro 2 exoskeleton across two distinct tasks involving horizontal and vertical arm movements. Our results indicate that the proposed approach can dynamically adjust threshold values based on learned patterns, potentially improving user interaction and control, though performance evaluation remains challenging due to dataset limitations.

PaperID: 10722, https://arxiv.org/pdf/2505.00055.pdf

Authors: Zhuoqi Zeng, Yuxiang Wei, Jiawen Kang

Title: TinyMA-IEI-PPO: Exploration Incentive-Driven Multi-Agent DRL with Self-Adaptive Pruning for Vehicular Embodied AI Agent Twins Migration

Abstract:
Embodied Artificial Intelligence (EAI) addresses autonomous driving challenges in Vehicular Embodied AI Networks (VEANETs) through multi‑modal perception, adaptive decision‑making, and hardware‑software co‑scheduling. However, the computational demands of virtual services and the inherent mobility of autonomous vehicles (AVs) necessitate real‑time migration of Vehicular Embodied Agent AI Twins (VEAATs) between resource‑constrained Roadside Units (RSUs). This paper proposes a novel framework for efficient VEAAT migration in VEANETs, combining a multi‑leader multi‑follower (MLMF) Stackelberg game‑theoretic incentive mechanism with a tiny multi‑agent deep reinforcement learning (MADRL) algorithm. First, We propose an virtual immersive experience‑driven utility model that captures AV‑RSU dynamic interactions by integrating AVs' social influence, service complementarity and substitutability, and RSUs' resource allocation strategies to optimize VEAAT migration decisions. Second, to enhance training efficiency and enable efficient deployment on computation‑constrained AVs while preserving exploration‑exploitation performance, we propose TinyMA‑IEI‑PPO, a self‑adaptive dynamic structured pruning algorithm that dynamically adjusts neuron importance based on agents' exploration incentives. Numerical results demonstrate that our approach achieves convergence comparable to baseline models and closely approximates the Stackelberg equilibrium.

PaperID: 10723, https://arxiv.org/pdf/2505.00018.pdf

Authors: Ju Wu, Calvin K. L. Or

Title: Position Paper: Towards Open Complex Human-AI Agents Collaboration Systems for Problem Solving and Knowledge Management

Abstract:
We propose a technology‑agnostic, collaboration‑ready stance for Human‑AI Agents Collaboration Systems (HAACS) that closes long‑standing gaps in prior stages (automation; flexible autonomy; agentic multi‑agent collectives). Reading empirical patterns through a seven‑dimension collaboration spine and human‑agent contrasts, we identify missing pieces: principled budgeting of initiative, instantaneous and auditable reconfiguration, a system‑wide knowledge backbone with an epistemic promotion gate, capacity‑aware human interfaces; and, as a prerequisite to all of the above, unified definitions of agent and formal collaborative dynamics. We respond with (i) a boundary‑centric ontology of agenthood synthesized with cybernetics; (ii) a Petri net family (colored and interpreted) that models ownership, cross‑boundary interaction, concurrency, guards, and rates with collaboration transitions; and (iii) a three‑level orchestration (meta, agent, execution) that governs behavior families via guard flips. On the knowledge side, we ground collaborative learning in Conversation Theory and SECI with teach‑back gates and an evolving backbone; on the problem‑solving side, we coordinate routine MEA‑style control with practice‑guided open‑ended discovery. The result is the Hierarchical Exploration‑Exploitation Net (HE2‑Net): a policy‑controlled stance that splits provisional from validated assets, promotes only after tests and peer checks, and budgets concurrent probing while keeping reuse fast and safe. We show interoperability with emerging agent protocols without ad hoc glue and sketch bio‑cybernetic extensions (autopoiesis, autogenesis, evolving boundaries, synergetics, etc). Altogether, the framework keeps humans central to setting aims, justifying knowledge, and steering theory‑practice dynamics, while scaling agents as reliable collaborators within audited governance.

PaperID: 10724, https://arxiv.org/pdf/2504.21596.pdf

Authors: Huihui Guo, Huilong Pi, Yunchuan Qin, Zhuo Tang, Kenli Li

Title: Leveraging Pre-trained Large Language Models with Refined Prompting for Online Task and Motion Planning

Abstract:
With the rapid advancement of artificial intelligence, there is an increasing demand for intelligent robots capable of assisting humans in daily tasks and performing complex operations. Such robots not only require task planning capabilities but must also execute tasks with stability and robustness. In this paper, we present a closed‑loop task planning and acting system, LLM‑PAS, which is assisted by a pre‑trained Large Language Model (LLM). While LLM‑PAS plans long‑horizon tasks in a manner similar to traditional task and motion planners, it also emphasizes the execution phase of the task. By transferring part of the constraint‑checking process from the planning phase to the execution phase, LLM‑PAS enables exploration of the constraint space and delivers more accurate feedback on environmental anomalies during execution. The reasoning capabilities of the LLM allow it to handle anomalies that cannot be addressed by the robust executor. To further enhance the system's ability to assist the planner during replanning, we propose the First Look Prompting (FLP) method, which induces LLM to generate effective PDDL goals. Through comparative prompting experiments and systematic experiments, we demonstrate the effectiveness and robustness of LLM‑PAS in handling anomalous conditions during task execution.

PaperID: 10725, https://arxiv.org/pdf/2504.21443.pdf

Authors: Pelin Sekercioglu, Angela Fontan, Dimos V. Dimarogonas

Title: Stability of Open Multi-agent Systems over Dynamic Signed Graphs

Abstract:
This paper addresses the bipartite consensus‑control problem in open multi‑agent systems containing both cooperative and antagonistic interactions. In these systems, new agents can join and new interactions can be formed over time. Moreover, the types of interactions, cooperative or antagonistic, may change. To model these structural changes, we represent the system as a switched system interconnected over a dynamic signed graph. Using the signed edge‑based agreement protocol and constructing strict Lyapunov functions for signed edge‑Laplacian matrices with multiple zero eigenvalues, we establish global asymptotic stability of the bipartite consensus control. Numerical simulations validate our theoretical results.

PaperID: 10726, https://arxiv.org/pdf/2504.21433.pdf

Authors: Zhicong Li, Hangyu Mao, Jiangjin Yin, Mingzhe Xing, Zhiwei Xu, Yuanxing Zhang, Yang Xiao

Title: NGENT: Next-Generation AI Agents Must Integrate Multi-Domain Abilities to Achieve Artificial General Intelligence

Abstract:
This paper argues that the next generation of AI agent (NGENT) should integrate across‑domain abilities to advance toward Artificial General Intelligence (AGI). Although current AI agents are effective in specialized tasks such as robotics, role‑playing, and tool‑using, they remain confined to narrow domains. We propose that future AI agents should synthesize the strengths of these specialized systems into a unified framework capable of operating across text, vision, robotics, reinforcement learning, emotional intelligence, and beyond. This integration is not only feasible but also essential for achieving the versatility and adaptability that characterize human intelligence. The convergence of technologies across AI domains, coupled with increasing user demand for cross‑domain capabilities, suggests that such integration is within reach. Ultimately, the development of these versatile agents is a critical step toward realizing AGI. This paper explores the rationale for this shift, potential pathways for achieving it.

PaperID: 10727, https://arxiv.org/pdf/2504.21347.pdf

Authors: Seonghee Lee, Denae Ford, John Tang, Sasa Junuzovic, Asta Roseway, Ed Cutrell, Kori Inkpen

Title: IRL Dittos: Embodied Multimodal AI Agent Interactions in Open Spaces

Abstract:
We introduce the In Real Life (IRL) Ditto, an AI‑driven embodied agent designed to represent remote colleagues in shared office spaces, creating opportunities for real‑time exchanges even in their absence. IRL Ditto offers a unique hybrid experience by allowing in‑person colleagues to encounter a digital version of their remote teammates, initiating greetings, updates, or small talk as they might in person. Our research question examines: How can the IRL Ditto influence interactions and relationships among colleagues in a shared office space? Through a four‑day study, we assessed IRL Ditto's ability to strengthen social ties by simulating presence and enabling meaningful interactions across different levels of social familiarity. We find that enhancing social relationships depended deeply on the foundation of the relationship participants had with the source of the IRL Ditto. This study provides insights into the role of embodied agents in enriching workplace dynamics for distributed teams.

PaperID: 10728, https://arxiv.org/pdf/2504.21304.pdf

Authors: Nanxu Gong, Xinyuan Wang, Wangyang Ying, Haoyue Bai, Sixun Dong, Haifeng Chen, Yanjie Fu

Title: Unsupervised Feature Transformation via In-context Generation, Generator-critic LLM Agents, and Duet-play Teaming

Abstract:
Feature transformation involves generating a new set of features from the original dataset to enhance the data's utility. In certain domains like material performance screening, dimensionality is large and collecting labels is expensive and lengthy. It highly necessitates transforming feature spaces efficiently and without supervision to enhance data readiness and AI utility. However, existing methods fall short in efficient navigation of a vast space of feature combinations, and are mostly designed for supervised settings. To fill this gap, our unique perspective is to leverage a generator‑critic duet‑play teaming framework using LLM agents and in‑context learning to derive pseudo‑supervision from unsupervised data. The framework consists of three interconnected steps: (1) Critic agent diagnoses data to generate actionable advice, (2) Generator agent produces tokenized feature transformations guided by the critic's advice, and (3) Iterative refinement ensures continuous improvement through feedback between agents. The generator‑critic framework can be generalized to human‑agent collaborative generation, by replacing the critic agent with human experts. Extensive experiments demonstrate that the proposed framework outperforms even supervised baselines in feature transformation efficiency, robustness, and practical applicability across diverse datasets.

PaperID: 10729, https://arxiv.org/pdf/2504.21278.pdf

Authors: Xuyan Ma, Yawen Wang, Junjie Wang, Xiaofei Xie, Boyu Wu, Shoubin Li, Fanjiang Xu, Qing Wang

Title: Robust Multi-agent Communication Based on Decentralization-Oriented Adversarial Training

Abstract:
In typical multi‑agent reinforcement learning (MARL) problems, communication is important for agents to share information and make the right decisions. However, due to the complexity of training multi‑agent communication, existing methods often fall into the dilemma of local optimization, which leads to the concentration of communication in a limited number of channels and presents an unbalanced structure. Such unbalanced communication policy are vulnerable to abnormal conditions, where the damage of critical communication channels can trigger the crash of the entire system. Inspired by decentralization theory in sociology, we propose DMAC, which enhances the robustness of multi‑agent communication policies by retraining them into decentralized patterns. Specifically, we train an adversary DMAC\_Adv which can dynamically identify and mask the critical communication channels, and then apply the adversarial samples generated by DMAC\_Adv to the adversarial learning of the communication policy to force the policy in exploring other potential communication schemes and transition to a decentralized structure. As a training method to improve robustness, DMAC can be fused with any learnable communication policy algorithm. The experimental results in two communication policies and four multi‑agent tasks demonstrate that DMAC achieves higher improvement on robustness and performance of communication policy compared with two state‑of‑the‑art and commonly‑used baselines. Also, the results demonstrate that DMAC can achieve decentralized communication structure with acceptable communication cost.

PaperID: 10730, https://arxiv.org/pdf/2504.21164.pdf

Authors: Bhavini Jeloka, Yue Guan, Panagiotis Tsiotras

Title: Learning Large-Scale Competitive Team Behaviors with Mean-Field Interactions and Online Opponent Modeling

Abstract:
While multi‑agent reinforcement learning (MARL) has been proven effective across both collaborative and competitive tasks, existing algorithms often struggle to scale to large populations of agents. Recent advancements in mean‑field (MF) theory provide scalable solutions by approximating population interactions as a continuum, yet most existing frameworks focus exclusively on either fully cooperative or purely competitive settings. To bridge this gap, we introduce MF‑MAPPO, a mean‑field extension of PPO designed for zero‑sum team games that integrate intra‑team cooperation with inter‑team competition. MF‑MAPPO employs a shared actor and a minimally informed critic per team and is trained directly on finite‑population simulators, thereby enabling deployment to realistic scenarios with thousands of agents. We further show that MF‑MAPPO naturally extends to partially observable settings through a simple gradient‑regularized training scheme. Our evaluation utilizes large‑scale benchmark scenarios using our own testing simulation platform for MF team games (MFEnv), including offense‑defense battlefield tasks as well as variants of population‑based rock‑paper‑scissors games that admit analytical solutions, for benchmarking. Across these benchmarks, MF‑MAPPO outperforms existing methods and exhibits complex, heterogeneous behaviors, demonstrating the effectiveness of combining mean‑field theory and MARL techniques at scale.

PaperID: 10731, https://arxiv.org/pdf/2504.21113.pdf

Authors: Donipolo Ghimire, Carlos Nieto-Granda, Solmaz S. Kia

Title: NavEX: A Multi-Agent Coverage in Non-Convex and Uneven Environments via Exemplar-Clustering

Abstract:
This paper addresses multi‑agent deployment in non‑convex and uneven environments. To overcome the limitations of traditional approaches, we introduce Navigable Exemplar‑Based Dispatch Coverage (NavEX), a novel dispatch coverage framework that combines exemplar‑clustering with obstacle‑aware and traversability‑aware shortest distances, offering a deployment framework based on submodular optimization. NavEX provides a unified approach to solve two critical coverage tasks: (a) fair‑access deployment, aiming to provide equitable service by minimizing agent‑target distances, and (b) hotspot deployment, prioritizing high‑density target regions. A key feature of NavEX is the use of exemplar‑clustering for the coverage utility measure, which provides the flexibility to employ non‑Euclidean distance metrics that do not necessarily conform to the triangle inequality. This allows NavEX to incorporate visibility graphs for shortest‑path computation in environments with planar obstacles, and traversability‑aware RRT for complex, rugged terrains. By leveraging submodular optimization, the NavEX framework enables efficient, near‑optimal solutions with provable performance guarantees for multi‑agent deployment in realistic and complex settings, as demonstrated by our simulations.

PaperID: 10732, https://arxiv.org/pdf/2504.21111.pdf

Authors: Md Safwan Mondal, Subramanian Ramasamy, Luca Russo, James D. Humann, James M. Dotterweich, Pranav Bhounsule

Title: How to Coordinate UAVs and UGVs for Efficient Mission Planning? Optimizing Energy-Constrained Cooperative Routing with a DRL Framework

Abstract:
Efficient mission planning for cooperative systems involving Unmanned Aerial Vehicles (UAVs) and Unmanned Ground Vehicles (UGVs) requires addressing energy constraints, scalability, and coordination challenges between agents. UAVs excel in rapidly covering large areas but are constrained by limited battery life, while UGVs, with their extended operational range and capability to serve as mobile recharging stations, are hindered by slower speeds. This heterogeneity makes coordination between UAVs and UGVs critical for achieving optimal mission outcomes. In this work, we propose a scalable deep reinforcement learning (DRL) framework to address the energy‑constrained cooperative routing problem for multi‑agent UAV‑UGV teams, aiming to visit a set of task points in minimal time with UAVs relying on UGVs for recharging during the mission. The framework incorporates sortie‑wise agent switching to efficiently manage multiple agents, by allocating task points and coordinating actions. Using an encoder‑decoder transformer architecture, it optimizes routes and recharging rendezvous for the UAV‑UGV team in the task scenario. Extensive computational experiments demonstrate the framework's superior performance over heuristic methods and a DRL baseline, delivering significant improvements in solution quality and runtime efficiency across diverse scenarios. Generalization studies validate its robustness, while dynamic scenario highlights its adaptability to real‑time changes with a case study. This work advances UAV‑UGV cooperative routing by providing a scalable, efficient, and robust solution for multi‑agent mission planning.

PaperID: 10733, https://arxiv.org/pdf/2504.21048.pdf

Authors: Mohamad A. Hady, Siyi Hu, Mahardhika Pratama, Jimmy Cao, Ryszard Kowalczyk

Title: Multi-Agent Reinforcement Learning for Resources Allocation Optimization: A Survey

Abstract:
Multi‑Agent Reinforcement Learning (MARL) has become a powerful framework for numerous real‑world applications, modeling distributed decision‑making and learning from interactions with complex environments. Resource Allocation Optimization (RAO) benefits significantly from MARL's ability to tackle dynamic and decentralized contexts. MARL‑based approaches are increasingly applied to RAO challenges across sectors playing pivotal roles to Industry 4.0 developments. This survey provides a comprehensive review of recent MARL algorithms for RAO, encompassing core concepts, classifications, and a structured taxonomy. By outlining the current research landscape and identifying primary challenges and future directions, this survey aims to support researchers and practitioners in leveraging MARL's potential to advance resource allocation solutions.

PaperID: 10734, https://arxiv.org/pdf/2504.21030.pdf

Authors: Naveen Krishnan

Title: Advancing Multi-Agent Systems Through Model Context Protocol: Architecture, Implementation, and Applications

Abstract:
Multi‑agent systems represent a significant advancement in artificial intelligence, enabling complex problem‑solving through coordinated specialized agents. However, these systems face fundamental challenges in context management, coordination efficiency, and scalable operation. This paper introduces a comprehensive framework for advancing multi‑agent systems through Model Context Protocol (MCP), addressing these challenges through standardized context sharing and coordination mechanisms. We extend previous work on AI agent architectures by developing a unified theoretical foundation, advanced context management techniques, and scalable coordination patterns. Through detailed implementation case studies across enterprise knowledge management, collaborative research, and distributed problem‑solving domains, we demonstrate significant performance improvements compared to traditional approaches. Our evaluation methodology provides a systematic assessment framework with benchmark tasks and datasets specifically designed for multi‑agent systems. We identify current limitations, emerging research opportunities, and potential transformative applications across industries. This work contributes to the evolution of more capable, collaborative, and context‑aware artificial intelligence systems that can effectively address complex real‑world challenges.

PaperID: 10735, https://arxiv.org/pdf/2504.20997.pdf

Authors: Dilip Arumugam, Thomas L. Griffiths

Title: Toward Efficient Exploration by Large Language Model Agents

Abstract:
A burgeoning area within reinforcement learning (RL) is the design of sequential decision‑making agents centered around large language models (LLMs). While autonomous decision‑making agents powered by modern LLMs could facilitate numerous real‑world applications, such successes demand agents that are capable of data‑efficient RL. One key obstacle to achieving data efficiency in RL is exploration, a challenge that we demonstrate many recent proposals for LLM agent designs struggle to contend with. Meanwhile, classic algorithms from the RL literature known to gracefully address exploration require technical machinery that can be challenging to operationalize in purely natural language settings. In this work, rather than relying on finetuning or in‑context learning to coax LLMs into implicitly imitating a RL algorithm, we illustrate how LLMs can be used to explicitly implement an existing RL algorithm (Posterior Sampling for Reinforcement Learning) whose capacity for statistically‑efficient exploration is already well‑studied. We offer empirical results demonstrating how our LLM‑based implementation of a known, data‑efficient RL algorithm can be considerably more effective in natural language tasks that demand prudent exploration.

PaperID: 10736, https://arxiv.org/pdf/2504.20995.pdf

Authors: Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, Chuang Gan

Title: TesserAct: Learning 4D Embodied World Models

Abstract:
This paper presents an effective approach for learning novel 4D embodied world models, which predict the dynamic evolution of 3D scenes over time in response to an embodied agent's actions, providing both spatial and temporal consistency. We propose to learn a 4D world model by training on RGB‑DN (RGB, Depth, and Normal) videos. This not only surpasses traditional 2D models by incorporating detailed shape, configuration, and temporal changes into their predictions, but also allows us to effectively learn accurate inverse dynamic models for an embodied agent. Specifically, we first extend existing robotic manipulation video datasets with depth and normal information leveraging off‑the‑shelf models. Next, we fine‑tune a video generation model on this annotated dataset, which jointly predicts RGB‑DN (RGB, Depth, and Normal) for each frame. We then present an algorithm to directly convert generated RGB, Depth, and Normal videos into a high‑quality 4D scene of the world. Our method ensures temporal and spatial coherence in 4D scene predictions from embodied scenarios, enables novel view synthesis for embodied environments, and facilitates policy learning that significantly outperforms those derived from prior video‑based world models.

PaperID: 10737, https://arxiv.org/pdf/2504.20947.pdf

Authors: Norah K. Alghamdi, Shinkyu Park

Title: Opinion-Driven Decision-Making for Multi-Robot Navigation through Narrow Corridors

Abstract:
We propose an opinion‑driven navigation framework for multi‑robot traversal through a narrow corridor. Our approach leverages a multi‑agent decision‑making model known as the Nonlinear Opinion Dynamics (NOD) to address the narrow corridor passage problem, formulated as a multi‑robot navigation game. By integrating the NOD model with a multi‑robot path planning algorithm, we demonstrate that the framework effectively reduces the likelihood of deadlocks during corridor traversal. To ensure scalability with an increasing number of robots, we introduce a game reduction technique that enables efficient coordination in larger groups. Extensive simulation studies are conducted to validate the effectiveness of the proposed approach.

PaperID: 10738, https://arxiv.org/pdf/2504.20927.pdf

Authors: Shahbaz P Qadri Syed, He Bai

Title: Exploiting inter-agent coupling information for efficient reinforcement learning of cooperative LQR

Abstract:
Developing scalable and efficient reinforcement learning algorithms for cooperative multi‑agent control has received significant attention over the past years. Existing literature has proposed inexact decompositions of local Q‑functions based on empirical information structures between the agents. In this paper, we exploit inter‑agent coupling information and propose a systematic approach to exactly decompose the local Q‑function of each agent. We develop an approximate least square policy iteration algorithm based on the proposed decomposition and identify two architectures to learn the local Q‑function for each agent. We establish that the worst‑case sample complexity of the decomposition is equal to the centralized case and derive necessary and sufficient graphical conditions on the inter‑agent couplings to achieve better sample efficiency. We demonstrate the improved sample efficiency and computational efficiency on numerical examples.

PaperID: 10739, https://arxiv.org/pdf/2504.20903.pdf

Authors: Prothit Sen, Sai Mihir Jakkaraju

Title: Modeling AI-Human Collaboration as a Multi-Agent Adaptation

Abstract:
We formalize AI‑human collaboration through an agent‑based simulation that distinguishes optimization‑based AI search from satisficing‑based human adaptation. Using an NK model, we examine how these distinct decision heuristics interact across modular and sequenced task structures. For modular tasks, AI typically substitutes for humans, yet complementarities emerge when AI explores a moderately broad search space and human task complexity remains low. In sequenced tasks, we uncover a counterintuitive result: when a high‑performing human initiates search and AI subsequently refines it, joint performance is maximized, contradicting the dominant AI‑first design principle. Conversely, when AI leads and human satisficing follows, complementarities attenuate as task interdependence increases. We further show that memory‑less random AI, despite lacking structured adaptation, can improve outcomes when augmenting low‑capability humans by enabling escape from local optima. Collectively, our findings reveal that effective AI‑human collaboration depends less on industry context and more on task architecture: the division of labor, sequencing, and interdependence structure. By elevating task decomposition as the central design principle, we provide a generalizable framework for strategic decision‑making involving agentic AI across diverse organizational settings.

PaperID: 10740, https://arxiv.org/pdf/2504.20898.pdf

Authors: Hasan Md Tusfiqur Alam, Devansh Srivastav, Abdulrahman Mohamed Selim, Md Abdul Kadir, Md Moktadirul Hoque Shuvo, Daniel Sonntag

Title: CBM-RAG: Demonstrating Enhanced Interpretability in Radiology Report Generation with Multi-Agent RAG and Concept Bottleneck Models

Abstract:
Advancements in generative Artificial Intelligence (AI) hold great promise for automating radiology workflows, yet challenges in interpretability and reliability hinder clinical adoption. This paper presents an automated radiology report generation framework that combines Concept Bottleneck Models (CBMs) with a Multi‑Agent Retrieval‑Augmented Generation (RAG) system to bridge AI performance with clinical explainability. CBMs map chest X‑ray features to human‑understandable clinical concepts, enabling transparent disease classification. Meanwhile, the RAG system integrates multi‑agent collaboration and external knowledge to produce contextually rich, evidence‑based reports. Our demonstration showcases the system's ability to deliver interpretable predictions, mitigate hallucinations, and generate high‑quality, tailored reports with an interactive interface addressing accuracy, trust, and usability challenges. This framework provides a pathway to improving diagnostic consistency and empowering radiologists with actionable insights.

PaperID: 10741, https://arxiv.org/pdf/2504.20529.pdf

Authors: Costas Mylonas, Emmanouel Varvarigos, Georgios Tsaousoglou

Title: Safe Bottom-Up Flexibility Provision from Distributed Energy Resources

Abstract:
Modern renewables‑based power systems need to tap on the flexibility of Distributed Energy Resources (DERs) connected to distribution networks. It is important, however, that DER owners/users remain in control of their assets, decisions, and objectives. At the same time, the dynamic landscape of DER‑penetrated distribution networks calls for agile, data‑driven flexibility management frameworks. In the face of these developments, the Multi‑Agent Reinforcement Learning (MARL) paradigm is gaining significant attention, as a distributed and data‑driven decision‑making policy. This paper addresses the need for bottom‑up DER management decisions to account for the distribution network's safety‑related constraints. While the related literature on safe MARL typically assumes that network characteristics are available and incorporated into the policy's safety layer, which implies active DSO engagement, this paper ensures that self‑organized DER communities are enabled to provide distribution‑network‑safe flexibility services without relying on the aspirational and problematic requirement of bringing the DSO in the decision‑making loop.

PaperID: 10742, https://arxiv.org/pdf/2504.20462.pdf

Authors: Xiao Zhang, Qi Wang, Mingyi Li, Yuan Yuan, Mengbai Xiao, Fuzhen Zhuang, Dongxiao Yu

Title: TAMO: Fine-Grained Root Cause Analysis via Tool-Assisted LLM Agent with Multi-Modality Observation Data in Cloud-Native Systems

Abstract:
Implementing large language models (LLMs)‑driven root cause analysis (RCA) in cloud‑native systems has become a key topic of modern software operations and maintenance. However, existing LLM‑based approaches face three key challenges: multi‑modality input constraint, context window limitation, and dynamic dependence graph. To address these issues, we propose a tool‑assisted LLM agent with multi‑modality observation data for fine‑grained RCA, namely TAMO, including multimodality alignment tool, root cause localization tool, and fault types classification tool. In detail, TAMO unifies multi‑modal observation data into time‑aligned representations for cross‑modal feature consistency. Based on the unified representations, TAMO then invokes its specialized root cause localization tool and fault types classification tool for further identifying root cause and fault type underlying system context. This approach overcomes the limitations of LLMs in processing real‑time raw observational data and dynamic service dependencies, guiding the model to generate repair strategies that align with system context through structured prompt design. Experiments on two benchmark datasets demonstrate that TAMO outperforms state‑of‑the‑art (SOTA) approaches with comparable performance.

PaperID: 10743, https://arxiv.org/pdf/2504.20117.pdf

Authors: Shubham Gandhi, Dhruv Shah, Manasi Patwardhan, Lovekesh Vig, Gautam Shroff

Title: ResearchCodeAgent: An LLM Multi-Agent System for Automated Codification of Research Methodologies

Abstract:
In this paper we introduce ResearchCodeAgent, a novel multi‑agent system leveraging large language models (LLMs) agents to automate the codification of research methodologies described in machine learning literature. The system bridges the gap between high‑level research concepts and their practical implementation, allowing researchers auto‑generating code of existing research papers for benchmarking or building on top‑of existing methods specified in the literature with availability of partial or complete starter code. ResearchCodeAgent employs a flexible agent architecture with a comprehensive action suite, enabling context‑aware interactions with the research environment. The system incorporates a dynamic planning mechanism, utilizing both short and long‑term memory to adapt its approach iteratively. We evaluate ResearchCodeAgent on three distinct machine learning tasks with distinct task complexity and representing different parts of the ML pipeline: data augmentation, optimization, and data batching. Our results demonstrate the system's effectiveness and generalizability, with 46.9% of generated code being high‑quality and error‑free, and 25% showing performance improvements over baseline implementations. Empirical analysis shows an average reduction of 57.9% in coding time compared to manual implementation. We observe higher gains for more complex tasks. ResearchCodeAgent represents a significant step towards automating the research implementation process, potentially accelerating the pace of machine learning research.

PaperID: 10744, https://arxiv.org/pdf/2504.20094.pdf

Authors: Zheng Hui, Xiaokai Wei, Yexi Jiang, Kevin Gao, Chen Wang, Frank Ong, Se-eun Yoon, Rachit Pareek, Michelle Gong

Title: Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition

Abstract:
Conversational recommender systems (CRS) have advanced with large language models, showing strong results in domains like movies. These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme. In contrast, games present distinct challenges: fast‑evolving catalogs, interaction‑driven preferences (e.g., skill level, mechanics, hardware), and increased risk of unsafe responses in open‑ended conversation. We propose MATCHA, a multi‑agent framework for CRS that assigns specialized agents for intent parsing, tool‑augmented retrieval, multi‑LLM ranking with reflection, explanation, and risk control which enabling finer personalization, long‑tail coverage, and stronger safety. Evaluated on real user request dataset, MATCHA outperforms six baselines across eight metrics, improving Hit@5 by 20%, reducing popularity bias by 24%, and achieving 97.9% adversarial defense. Human and virtual‑judge evaluations confirm improved explanation quality and user alignment.

PaperID: 10745, https://arxiv.org/pdf/2504.20082.pdf

Authors: Firuz Kamalov, David Santandreu Calonge, Linda Smail, Dilshod Azizov, Dimple R. Thadani, Theresa Kwong, Amara Atif

Title: Evolution of AI in Education: Agentic Workflows

Abstract:
The primary goal of this study is to analyze agentic workflows in education according to the proposed four major technological paradigms: reflection, planning, tool use, and multi‑agent collaboration. We critically examine the role of AI agents in education through these key design paradigms, exploring their advantages, applications, and challenges. Second, to illustrate the practical potential of agentic systems, we present a proof‑of‑concept application: a multi‑agent framework for automated essay scoring. Preliminary results suggest this agentic approach may offer improved consistency compared to stand‑alone LLMs. Our findings highlight the transformative potential of AI agents in educational settings while underscoring the need for further research into their interpretability and trustworthiness.

PaperID: 10746, https://arxiv.org/pdf/2504.20068.pdf

Authors: Wei Zhang, Zhiyu Wu, Yi Mu, Rui Ning, Banruo Liu, Nikhil Sarda, Myungjin Lee, Fan Lai

Title: JITServe: SLO-aware LLM Serving with Imprecise Request Information

Abstract:
The integration of Large Language Models (LLMs) into applications ranging from interactive chatbots to multi‑agent systems has introduced a wide spectrum of service‑level objectives (SLOs) for responsiveness. These include latency‑sensitive requests emphasizing per‑token latency in streaming chat, deadline‑sensitive requests requiring rapid full responses to trigger external tools, and compound requests with evolving dependencies across multiple LLM calls. Despite‑or perhaps, because of‑this workload diversity and unpredictable request information (e.g., response lengths and dependencies), existing request schedulers have focused on aggregate performance, unable to ensure application‑level SLO needs. This paper presents JITServe, the first SLO‑aware LLM serving system designed to maximize service goodput (e.g., the number of tokens meeting request SLOs) across diverse workloads. JITServe novelly schedules requests using imprecise request information and gradually relaxes this conservatism by refining request information estimates as generation progresses. It applies a grouped margin goodput maximization algorithm to allocate just enough serving bandwidth to satisfy each request's SLO just‑in‑time (JIT), maximizing residual capacity for others, while deciding the composition of requests in a batch to maximize efficiency and goodput with provable guarantees. Our evaluation across diverse realistic workloads, including chat, deep research, and agentic pipelines, shows that JITServe improves service goodput by 1.4x‑6.3x, alternatively achieving 28.5%‑83.2% resource savings, compared to state‑of‑the‑art designs.

PaperID: 10747, https://arxiv.org/pdf/2504.20054.pdf

Authors: Jiayang Sun, Hongbo Wang, Jie Cao, Huaibo Huang, Ran He

Title: Marmot: Object-Level Self-Correction via Multi-Agent Reasoning

Abstract:
While diffusion models excel at generating high‑quality images, they often struggle with accurate counting, attributes, and spatial relationships in complex multi‑object scenes. One potential solution involves employing Multimodal Large Language Model (MLLM) as an AI agent to construct a self‑correction framework. However, these approaches heavily rely on the capabilities of the MLLMs used, often fail to account for all objects within the image, and suffer from cumulative distortions during multi‑round editing processes. To address these challenges, we propose Marmot, a novel and generalizable framework that leverages Multi‑Agent Reasoning for Multi‑Object Self‑Correcting to enhance image‑text alignment. First, we employ a large language model as an Object‑Aware Agent to perform object‑level divide‑and‑conquer, automatically decomposing self‑correction tasks into object‑centric subtasks based on image descriptions. For each subtask, we construct an Object Correction System featuring a decision‑execution‑verification mechanism that operates exclusively on a single object's segmentation mask or the bounding boxes of object pairs, effectively mitigating inter‑object interference and enhancing editing reliability. To efficiently integrate correction results from subtasks while avoiding cumulative distortions from multi‑stage editing, we propose a Pixel‑Domain Stitching Smoother, which employs mask‑guided two‑stage latent space optimization. This innovation enables parallel processing of subtasks, significantly improving runtime efficiency while preventing distortion accumulation. Extensive experiments demonstrate that Marmot significantly improves accuracy in object counting, attribute assignment, and spatial relationships for image generation tasks.

PaperID: 10748, https://arxiv.org/pdf/2504.20024.pdf

Authors: Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso de Melo, Jianwen Xie, Alan Yuille

Title: SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning

Abstract:
Despite recent advances on multi‑modal models, 3D spatial reasoning remains a challenging task for state‑of‑the‑art open‑source and proprietary models. Recent studies explore data‑driven approaches and achieve enhanced spatial reasoning performance by fine‑tuning models on 3D‑related visual question‑answering data. However, these methods typically perform spatial reasoning in an implicit manner and often fail on questions that are trivial to humans, even with long chain‑of‑thought reasoning. In this work, we introduce SpatialReasoner, a novel large vision‑language model (LVLM) that addresses 3D spatial reasoning with explicit 3D representations shared between multiple stages‑‑3D perception, computation, and reasoning. Explicit 3D representations provide a coherent interface that supports advanced 3D spatial reasoning and improves the generalization ability to novel question types. Furthermore, by analyzing the explicit 3D representations in multi‑step reasoning traces of SpatialReasoner, we study the factual errors and identify key shortcomings of current LVLMs. Results show that our SpatialReasoner achieves improved performance on a variety of spatial reasoning benchmarks, outperforming Gemini 2.0 by 9.2% on 3DSRBench, and generalizes better when evaluating on novel 3D spatial reasoning questions. Our study bridges the 3D parsing capabilities of prior visual foundation models with the powerful reasoning abilities of large language models, opening new directions for 3D spatial reasoning.

PaperID: 10749, https://arxiv.org/pdf/2504.19951.pdf

Authors: Vineeth Sai Narajala, Ken Huang, Idan Habler

Title: Securing GenAI Multi-Agent Systems Against Tool Squatting: A Zero Trust Registry-Based Approach

Abstract:
The rise of generative AI (GenAI) multi‑agent systems (MAS) necessitates standardized protocols enabling agents to discover and interact with external tools. However, these protocols introduce new security challenges, particularly; tool squatting; the deceptive registration or representation of tools. This paper analyzes tool squatting threats within the context of emerging interoperability standards, such as Model Context Protocol (MCP) or seamless communication between agents protocols. It introduces a comprehensive Tool Registry system designed to mitigate these risks. We propose a security‑focused architecture featuring admin‑controlled registration, centralized tool discovery, fine grained access policies enforced via dedicated Agent and Tool Registry services, a dynamic trust scoring mechanism based on tool versioning and known vulnerabilities, and just in time credential provisioning. Based on its design principles, the proposed registry framework aims to effectively prevent common tool squatting vectors while preserving the flexibility and power of multi‑agent systems. This work addresses a critical security gap in the rapidly evolving GenAI ecosystem and provides a foundation for secure tool integration in production environments.

PaperID: 10750, https://arxiv.org/pdf/2504.19912.pdf

Authors: Khachik Smbatyan, Tsolak Ghukasyan, Tigran Aghajanyan, Hovhannes Dabaghyan, Sergey Adamyan, Aram Bughdaryan, Vahagn Altunyan, Gagik Navasardyan, Aram Davtyan, Anush Hakobyan, Aram Gharibyan, Arman Fahradyan, Artur Hakobyan, Hasmik Mnatsakanyan, Narek Ginoyan, Garik Petrosyan

Title: Can AI Agents Design and Implement Drug Discovery Pipelines?

Abstract:
The rapid advancement of artificial intelligence, particularly autonomous agentic systems based on Large Language Models (LLMs), presents new opportunities to accelerate drug discovery by improving in‑silico modeling and reducing dependence on costly experimental trials. Current AI agent‑based systems demonstrate proficiency in solving programming challenges and conducting research, indicating an emerging potential to develop software capable of addressing complex problems such as pharmaceutical design and drug discovery. This paper introduces DO Challenge, a benchmark designed to evaluate the decision‑making abilities of AI agents in a single, complex problem resembling virtual screening scenarios. The benchmark challenges systems to independently develop, implement, and execute efficient strategies for identifying promising molecular structures from extensive datasets, while navigating chemical space, selecting models, and managing limited resources in a multi‑objective context. We also discuss insights from the DO Challenge 2025, a competition based on the proposed benchmark, which showcased diverse strategies explored by human participants. Furthermore, we present the Deep Thought multi‑agent system, which demonstrated strong performance on the benchmark, outperforming most human teams. Among the language models tested, Claude 3.7 Sonnet, Gemini 2.5 Pro and o3 performed best in primary agent roles, and GPT‑4o, Gemini 2.0 Flash were effective in auxiliary roles. While promising, the system's performance still fell short of expert‑designed solutions and showed high instability, highlighting both the potential and current limitations of AI‑driven methodologies in transforming drug discovery and broader scientific research.

PaperID: 10751, https://arxiv.org/pdf/2504.19854.pdf

Authors: Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U-Xuan Tan, Navonil Majumder, Soujanya Poria

Title: NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

Abstract:
Existing Visual‑Language‑Action (VLA) models have shown promising performance in zero‑shot scenarios, demonstrating impressive task execution and reasoning capabilities. However, a significant challenge arises from the limitations of visual encoding, which can result in failures during tasks such as object grasping. Moreover, these models typically suffer from high computational overhead due to their large sizes, often exceeding 7B parameters. While these models excel in reasoning and task planning, the substantial computational overhead they incur makes them impractical for real‑time robotic environments, where speed and efficiency are paramount. To address the limitations of existing VLA models, we propose NORA, a 3B‑parameter model designed to reduce computational overhead while maintaining strong task performance. NORA adopts the Qwen‑2.5‑VL‑3B multimodal model as its backbone, leveraging its superior visual‑semantic understanding to enhance visual reasoning and action grounding. Additionally, our \model is trained on 970k real‑world robot demonstrations and equipped with the FAST+ tokenizer for efficient action sequence generation. Experimental results demonstrate that NORA outperforms existing large‑scale VLA models, achieving better task performance with significantly reduced computational overhead, making it a more practical solution for real‑time robotic autonomy.

PaperID: 10752, https://arxiv.org/pdf/2504.19818.pdf

Authors: Feng Chen, Ilias Stogiannidis, Andrew Wood, Danilo Bueno, Dominic Williams, Fraser Macfarlane, Bruce Grieve, Darren Wells, Jonathan A. Atkinson, Malcolm J. Hawkesford, Stephen A. Rolfe, Tracy Lawson, Tony Pridmore, Mario Valerio Giuffrida, Sotirios A. Tsaftaris

Title: PhenoAssistant: A Conversational Multi-Agent AI System for Automated Plant Phenotyping

Abstract:
Plant phenotyping increasingly relies on (semi‑)automated image‑based analysis workflows to improve its accuracy and scalability. However, many existing solutions remain overly complex, difficult to reimplement and maintain, and pose high barriers for users without substantial computational expertise. To address these challenges, we introduce PhenoAssistant: a pioneering AI‑driven system that streamlines plant phenotyping via intuitive natural language interaction. PhenoAssistant leverages a large language model to orchestrate a curated toolkit supporting tasks including automated phenotype extraction, data visualisation and automated model training. We validate PhenoAssistant through several representative case studies and a set of evaluation tasks. By significantly lowering technical hurdles, PhenoAssistant underscores the promise of AI‑driven methodologies to democratising AI adoption in plant biology.

PaperID: 10753, https://arxiv.org/pdf/2504.19793.pdf

Authors: Jiawen Shi, Zenghui Yuan, Guiyao Tie, Pan Zhou, Neil Zhenqiang Gong, Lichao Sun

Title: Prompt Injection Attack to Tool Selection in LLM Agents

Abstract:
Tool selection is a key component of LLM agents. A popular approach follows a two‑step process ‑ \emphretrieval and \emphselection ‑ to pick the most appropriate tool from a tool library for a given task. In this work, we introduce ToolHijacker, a novel prompt injection attack targeting tool selection in no‑box scenarios. ToolHijacker injects a malicious tool document into the tool library to manipulate the LLM agent's tool selection process, compelling it to consistently choose the attacker's malicious tool for an attacker‑chosen target task. Specifically, we formulate the crafting of such tool documents as an optimization problem and propose a two‑phase optimization strategy to solve it. Our extensive experimental evaluation shows that ToolHijacker is highly effective, significantly outperforming existing manual‑based and automated prompt injection attacks when applied to tool selection. Moreover, we explore various defenses, including prevention‑based defenses (StruQ and SecAlign) and detection‑based defenses (known‑answer detection, DataSentinel, perplexity detection, and perplexity windowed detection). Our experimental results indicate that these defenses are insufficient, highlighting the urgent need for developing new defense strategies.

PaperID: 10754, https://arxiv.org/pdf/2504.19678.pdf

Authors: Mohamed Amine Ferrag, Norbert Tihanyi, Merouane Debbah

Title: From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

Abstract:
Large language models and autonomous AI agents have evolved rapidly, resulting in a diverse array of evaluation benchmarks, frameworks, and collaboration protocols. Driven by the growing need for standardized evaluation and integration, we systematically consolidate these fragmented efforts into a unified framework. However, the landscape remains fragmented and lacks a unified taxonomy or comprehensive survey. Therefore, we present a side‑by‑side comparison of benchmarks developed between 2019 and 2025 that evaluate these models and agents across multiple domains. In addition, we propose a taxonomy of approximately 60 benchmarks that cover general and academic knowledge reasoning, mathematical problem‑solving, code generation and software engineering, factual grounding and retrieval, domain‑specific evaluations, multimodal and embodied tasks, task orchestration, and interactive assessments. Furthermore, we review AI‑agent frameworks introduced between 2023 and 2025 that integrate large language models with modular toolkits to enable autonomous decision‑making and multi‑step reasoning. Moreover, we present real‑world applications of autonomous AI agents in materials science, biomedical research, academic ideation, software engineering, synthetic data generation, chemical reasoning, mathematical problem‑solving, geographic information systems, multimedia, healthcare, and finance. We then survey key agent‑to‑agent collaboration protocols, namely the Agent Communication Protocol (ACP), the Model Context Protocol (MCP), and the Agent‑to‑Agent Protocol (A2A). Finally, we discuss recommendations for future research, focusing on advanced reasoning strategies, failure modes in multi‑agent LLM systems, automated scientific discovery, dynamic tool integration via reinforcement learning, integrated search capabilities, and security vulnerabilities in agent protocols.

PaperID: 10755, https://arxiv.org/pdf/2504.19593.pdf

Authors: S Nordström, Y Bai, B Lindqvist, G Nikolakopoulos

Title: A Time-dependent Risk-aware distributed Multi-Agent Path Finder based on A*

Abstract:
Multi‑Agent Path‑Finding (MAPF) focuses on the collaborative planning of paths for multiple agents within shared spaces, aiming for collision‑free navigation. Conventional planning methods often overlook the presence of other agents, which can result in conflicts. In response, this article introduces the A^_+T algorithm, a distributed approach that improves coordination among agents by anticipating their positions based on their movement speeds. The algorithm also considers dynamic obstacles, assessing potential collisions with respect to observed speeds and trajectories, thereby facilitating collision‑free path planning in environments populated by other agents and moving objects. It incorporates a risk layer surrounding both dynamic and static entities, enhancing its utility in real‑world applications. Each agent functions autonomously while being mindful of the paths chosen by others, effectively addressing the complexities inherent in multi‑agent situations. The performance of A^_+T has been rigorously tested in the Gazebo simulation environment and benchmarked against established approaches such as CBS, ECBS, and SIPP. Furthermore, the algorithm has shown competence in single‑agent experiments, with results demonstrating its effectiveness in managing dynamic obstacles and affirming its practical relevance across various scenarios.

PaperID: 10756, https://arxiv.org/pdf/2504.19565.pdf

Authors: Meng Xiao, Xunxin Cai, Qingqing Long, Chengrui Wang, Yuanchun Zhou, Hengshu Zhu

Title: Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training

Abstract:
Corpus distillation for biomedical large language models (LLMs) seeks to address the pressing challenge of insufficient quantity and quality in open‑source annotated scientific corpora, which remains a bottleneck for effective LLM training in biomedical research. This paper proposes a knowledge‑driven, agentic framework for scientific corpus distillation, tailored explicitly for LLM training in the biomedical domain, addressing the challenge posed by the complex hierarchy of biomedical knowledge. Central to our approach is a collaborative multi‑agent architecture, where specialized agents, each guided by the Medical Subject Headings (MeSH) hierarchy, work in concert to autonomously extract, synthesize, and self‑evaluate high‑quality textual data from vast scientific literature. This agentic framework collectively generates and refines domain‑specific question‑answer pairs, ensuring comprehensive coverage and consistency with biomedical ontologies while minimizing manual involvement. Extensive experimental results show that language models trained on our multi‑agent distilled datasets achieve notable improvements in biomedical question‑answering tasks, outperforming both strong life sciences LLM baselines and advanced proprietary models. Notably, our AI‑Ready dataset enables Llama3‑70B to surpass GPT‑4 with MedPrompt and Med‑PaLM‑2, despite their larger scale. Detailed ablation studies and case analyses further validate the effectiveness and synergy of each agent within the framework, highlighting the potential of multi‑agent collaboration in biomedical LLM training.

PaperID: 10757, https://arxiv.org/pdf/2504.19397.pdf

Authors: Sagar Sudhakara

Title: Symmetric Policy Design for Multi-Agent Dispatch Coordination in Supply Chains

Abstract:
We study a decentralized dispatch coordination problem in a multi‑agent supply chain setting with shared logistics capacity. We propose symmetric (identical) dispatch strategies for all agents, enabling efficient coordination without centralized control. Using a common information approach, we derive a dynamic programming solution that computes optimal symmetric dispatch strategies by transforming the multi‑agent problem into a tractable dynamic program on the agents common information state. Simulation results demonstrate that our method significantly reduces coordination cost compared to baseline heuristics, including belief‑based strategies and an always‑dispatch policy. These findings highlight the benefits of combining symmetric strategy design with a common information‑based dynamic programming framework for improving multi‑agent coordination performance.

PaperID: 10758, https://arxiv.org/pdf/2504.19338.pdf

Authors: Jingsen Feng, Ran Xu, Xu Chu

Title: OpenFOAMGPT 2.0: end-to-end, trustworthy automation for computational fluid dynamics

Abstract:
We propose the first multi agent framework for computational fluid dynamics that enables fully automated, end to end simulations directly from natural language queries. The approach integrates four specialized agents Pre processing, Prompt Generation, OpenFOAMGPT (simulator), and Post processing decomposing complex computational fluid dynamics workflows into collaborative components powered by large language models. Extensive validation through diverse case studies, including Poiseuille flows, single and multi phase porous media flows, and aerodynamic analyses, demonstrates 100% success and reproducibility rates across over 450 simulations. Rigorous trustworthiness verification confirms that properly designed multi agent systems can achieve the reliability standards necessary for zero tolerance scientific computing applications while significantly lowering entry barriers. The framework establishes a foundation for conversation‑driven simulation workflows in computational science, potentially accelerating discovery and innovation through more accessible tools for complex numerical simulations. Results reveal that multi‑agent architectures, when properly specialized and orchestrated, can effectively handle the stringent requirements of computational physics while maintaining the intuitive interface of natural language interaction.

PaperID: 10759, https://arxiv.org/pdf/2504.19017.pdf

Authors: Alireza Ghafarollahi, Markus J. Buehler

Title: Sparks: Multi-Agent Artificial Intelligence Model Discovers Protein Design Principles

Abstract:
Advances in artificial intelligence (AI) promise autonomous discovery, yet most systems still resurface knowledge latent in their training data. We present Sparks, a multi‑modal multi‑agent AI model that executes the entire discovery cycle that includes hypothesis generation, experiment design and iterative refinement to develop generalizable principles and a report without human intervention. Applied to protein science, Sparks uncovered two previously unknown phenomena: (i) a length‑dependent mechanical crossover whereby beta‑sheet‑biased peptides surpass alpha‑helical ones in unfolding force beyond ~80 residues, establishing a new design principle for peptide mechanics; and (ii) a chain‑length/secondary‑structure stability map revealing unexpectedly robust beta‑sheet‑rich architectures and a "frustration zone" of high variance in mixed alpha/beta folds. These findings emerged from fully self‑directed reasoning cycles that combined generative sequence design, high‑accuracy structure prediction and physics‑aware property models, with paired generation‑and‑reflection agents enforcing self‑correction and reproducibility. The key result is that Sparks can independently conduct rigorous scientific inquiry and identify previously unknown scientific principles.

PaperID: 10760, https://arxiv.org/pdf/2504.18945.pdf

Authors: Zishen Wan, Jiayi Qian, Yuhang Du, Jason Jabbour, Yilun Du, Yang Katie Zhao, Arijit Raychowdhury, Tushar Krishna, Vijay Janapa Reddi

Title: Generative AI in Embodied Systems: System-Level Analysis of Performance, Efficiency and Scalability

Abstract:
Embodied systems, where generative autonomous agents engage with the physical world through integrated perception, cognition, action, and advanced reasoning powered by large language models (LLMs), hold immense potential for addressing complex, long‑horizon, multi‑objective tasks in real‑world environments. However, deploying these systems remains challenging due to prolonged runtime latency, limited scalability, and heightened sensitivity, leading to significant system inefficiencies. In this paper, we aim to understand the workload characteristics of embodied agent systems and explore optimization solutions. We systematically categorize these systems into four paradigms and conduct benchmarking studies to evaluate their task performance and system efficiency across various modules, agent scales, and embodied tasks. Our benchmarking studies uncover critical challenges, such as prolonged planning and communication latency, redundant agent interactions, complex low‑level control mechanisms, memory inconsistencies, exploding prompt lengths, sensitivity to self‑correction and execution, sharp declines in success rates, and reduced collaboration efficiency as agent numbers increase. Leveraging these profiling insights, we suggest system optimization strategies to improve the performance, efficiency, and scalability of embodied agents across different paradigms. This paper presents the first system‑level analysis of embodied AI agents, and explores opportunities for advancing future embodied system design.

PaperID: 10761, https://arxiv.org/pdf/2504.18880.pdf

Authors: Zuhong Lin, Daoyuan Ren, Kai Ran, Jing Sun, Songlin Yu, Xuefeng Bai, Xiaotian Huang, Haiyang He, Pengxu Pan, Ying Fang, Zhanglin Li, Haipu Li, Jingjing Yao

Title: Reshaping MOFs text mining with a dynamic multi-agents framework of large language model

Abstract:
Accurately identifying the synthesis conditions of metal‑organic frameworks (MOFs) is essential for guiding experimental design, yet remains challenging because relevant information in the literature is often scattered, inconsistent, and difficult to interpret. We present MOFh6, a large language model driven system that reads raw articles or crystal codes and converts them into standardized synthesis tables. It links related descriptions across paragraphs, unifies ligand abbreviations with full names, and outputs structured parameters ready for use. MOFh6 achieved 99% extraction accuracy, resolved 94.1% of abbreviation cases across five major publishers, and maintained a precision of 0.93 +/‑ 0.01. Processing a full text takes 9.6 s, locating synthesis descriptions 36 s, with 100 papers processed for USD 4.24. By replacing static database lookups with real‑time extraction, MOFh6 reshapes MOF synthesis research, accelerating the conversion of literature knowledge into practical synthesis protocols and enabling scalable, data‑driven materials discovery.

PaperID: 10762, https://arxiv.org/pdf/2504.18876.pdf

Authors: Gil Aharoni, Martin Hoefer, Inbal Talgam-Cohen

Title: Welfare and Beyond in Multi-Agent Contracts

Abstract:
A principal delegates a project to a team S from a pool of n agents. The project's value if all agents in S exert costly effort is f(S). To incentivize the agents to participate, the principal assigns each agent i\in S a share ρ_i\in [0,1] of the project's final value (i.e., designs n linear contracts). The shares must be feasible ‑‑ their sum should not exceed 1. It is well‑understood how to design these contracts to maximize the principal's own expected utility, but what if the goal is to coordinate the agents toward maximizing social welfare? We initiate a systematic study of multi‑agent contract design with objectives beyond principal's utility, including welfare maximization, for various classes of value functions f. Our exploration reveals an arguably surprising fact: If f is up to XOS in the complement‑free hierarchy of functions, then the optimal principal's utility is a constant‑fraction of the optimal welfare. This is in stark contrast to the much larger welfare‑utility gaps in auction design, and no longer holds above XOS in the hierarchy, where the gap can be unbounded. A constant bound on the welfare‑utility gap immediately implies that existing algorithms for designing contracts with approximately‑optimal principal's utility also guarantee approximately‑optimal welfare. The downside of reducing welfare to utility is the loss of large constants. To obtain better guarantees, we develop polynomial‑time algorithms directly for welfare, for different classes of value functions. These include a tight 2‑approximation to the optimal welfare for symmetric XOS functions. Finally, we extend our analysis beyond welfare to the project's value under general feasibility constraints. Our results immediately translate to budgeted welfare and utility.

PaperID: 10763, https://arxiv.org/pdf/2504.18839.pdf

Authors: Abdellah Ghassel, Xianzhi Li, Xiaodan Zhu

Title: Detect, Explain, Escalate: Sustainable Dialogue Breakdown Management for LLM Agents

Abstract:
Large Language Models (LLMs) have demonstrated substantial capabilities in conversational AI applications, yet their susceptibility to dialogue breakdowns poses significant challenges to deployment reliability and user trust. This paper introduces a "Detect, Explain, Escalate" framework to manage dialogue breakdowns in LLM‑powered agents, emphasizing resource‑efficient operation. Our approach integrates two key strategies: (1) We fine‑tune a compact 8B‑parameter model, augmented with teacher‑generated reasoning traces, which serves as an efficient real‑time breakdown detector and explainer. This model demonstrates robust classification and calibration on English and Japanese dialogues, and generalizes to the BETOLD dataset, improving accuracy by 7% over its baseline. (2) We systematically evaluate frontier LLMs using advanced prompting (few‑shot, chain‑of‑thought, analogical reasoning) for high‑fidelity breakdown assessment. These are integrated into an "escalation" architecture where our efficient detector defers to larger models only when necessary, substantially reducing operational costs and computational overhead. Our fine‑tuned model and prompting strategies achieve state‑of‑the‑art performance on DBDC5 and strong results on BETOLD, outperforming specialized classifiers on DBDC5 and narrowing the performance gap to larger proprietary models. The proposed monitor‑escalate pipeline reduces inference costs by 54%, providing a cost‑effective and interpretable solution for robust conversational AI in high‑impact domains. Code and models are publicly available.

PaperID: 10764, https://arxiv.org/pdf/2504.18765.pdf

Authors: Chengwei Liu, Chong Wang, Jiayue Cao, Jingquan Ge, Kun Wang, Lyuye Zhang, Ming-Ming Cheng, Penghai Zhao, Tianlin Li, Xiaojun Jia, Xiang Li, Xingshuai Li, Yang Liu, Yebo Feng, Yihao Huang, Yijia Xu, Yuqiang Sun, Zhenhong Zhou, Zhengzi Xu

Title: A Vision for Auto Research with LLM Agents

Abstract:
This paper introduces Agent‑Based Auto Research, a structured multi‑agent framework designed to automate, coordinate, and optimize the full lifecycle of scientific research. Leveraging the capabilities of large language models (LLMs) and modular agent collaboration, the system spans all major research phases, including literature review, ideation, methodology planning, experimentation, paper writing, peer review response, and dissemination. By addressing issues such as fragmented workflows, uneven methodological expertise, and cognitive overload, the framework offers a systematic and scalable approach to scientific inquiry. Preliminary explorations demonstrate the feasibility and potential of Auto Research as a promising paradigm for self‑improving, AI‑driven research processes.

PaperID: 10765, https://arxiv.org/pdf/2504.18671.pdf

Authors: Ross Gore, Eranga Bandara, Sachin Shetty, Alberto E. Musto, Pratip Rana, Ambrosio Valencia-Romero, Christopher Rhea, Lobat Tayebi, Heather Richter, Atmaram Yarlagadda, Donna Edmonds, Steven Wallace, Donna Broshek

Title: Proof-of-TBI -- Fine-Tuned Vision Language Model Consortium and OpenAI-o3 Reasoning LLM-Based Medical Diagnosis Support System for Mild Traumatic Brain Injury (TBI) Prediction

Abstract:
Mild Traumatic Brain Injury (TBI) detection presents significant challenges due to the subtle and often ambiguous presentation of symptoms in medical imaging, making accurate diagnosis a complex task. To address these challenges, we propose Proof‑of‑TBI, a medical diagnosis support system that integrates multiple fine‑tuned vision‑language models with the OpenAI‑o3 reasoning large language model (LLM). Our approach fine‑tunes multiple vision‑language models using a labeled dataset of TBI MRI scans, training them to diagnose TBI symptoms effectively. The predictions from these models are aggregated through a consensus‑based decision‑making process. The system evaluates the predictions from all fine‑tuned vision language models using the OpenAI‑o3 reasoning LLM, a model that has demonstrated remarkable reasoning performance, to produce the most accurate final diagnosis. The LLM Agents orchestrates interactions between the vision‑language models and the reasoning LLM, managing the final decision‑making process with transparency, reliability, and automation. This end‑to‑end decision‑making workflow combines the vision‑language model consortium with the OpenAI‑o3 reasoning LLM, enabled by custom prompt engineering by the LLM agents. The prototype for the proposed platform was developed in collaboration with the U.S. Army Medical Research team in Newport News, Virginia, incorporating five fine‑tuned vision‑language models. The results demonstrate the transformative potential of combining fine‑tuned vision‑language model inputs with the OpenAI‑o3 reasoning LLM to create a robust, secure, and highly accurate diagnostic system for mild TBI prediction. To the best of our knowledge, this research represents the first application of fine‑tuned vision‑language models integrated with a reasoning LLM for TBI prediction tasks.

PaperID: 10766, https://arxiv.org/pdf/2504.18603.pdf

Authors: Iizalaarab Elhaimeur, Nikos Chrisochoides

Title: Toward Personalizing Quantum Computing Education: An Evolutionary LLM-Powered Approach

Abstract:
Quantum computing education faces significant challenges due to its complexity and the limitations of current tools; this paper introduces a novel Intelligent Teaching Assistant for quantum computing education and details its evolutionary design process. The system combines a knowledge‑graph‑augmented architecture with two specialized Large Language Model (LLM) agents: a Teaching Agent for dynamic interaction, and a Lesson Planning Agent for lesson plan generation. The system is designed to adapt to individual student needs, with interactions meticulously tracked and stored in a knowledge graph. This graph represents student actions, learning resources, and relationships, aiming to enable reasoning about effective learning pathways. We describe the implementation of the system, highlighting the challenges encountered and the solutions implemented, including introducing a dual‑agent architecture where tasks are separated, all coordinated through a central knowledge graph that maintains system awareness, and a user‑facing tag system intended to mitigate LLM hallucination and improve user control. Preliminary results illustrate the system's potential to capture rich interaction data, dynamically adapt lesson plans based on student feedback via a tag system in simulation, and facilitate context‑aware tutoring through the integrated knowledge graph, though systematic evaluation is required.

PaperID: 10767, https://arxiv.org/pdf/2504.18457.pdf

Authors: Muzaffar Qureshi, Tochukwu Elijah Ogri, Humberto Ramos, Wanjiku A. Makumi, Zachary I. Bell, Rushikesh Kamalapurkar

Title: Improved Dwell-times for Switched Nonlinear Systems using Memory Regression Extension

Abstract:
This paper presents a switched systems approach for extending the dwell‑time of an autonomous agent during GPS‑denied operation by leveraging memory regressor extension (MRE) techniques. To maintain accurate trajectory tracking despite unknown dynamics and environmental disturbances, the agent periodically acquires access to GPS, allowing it to correct accumulated state estimation errors. The motivation for this work arises from the limitations of existing switched system approaches, where increasing estimation errors during GPS‑denied intervals and overly conservative dwell‑time conditions restrict the operational efficiency of the agent. By leveraging MRE techniques during GPS‑available intervals, the developed method refines the estimates of unknown system parameters, thereby enabling longer and more reliable operation in GPS‑denied environments. A Lyapunov‑based switched‑system stability analysis establishes that improved parameter estimates obtained through concurrent learning allow extended operation in GPS‑denied intervals without compromising closed‑loop system stability. Simulation results validate the theoretical findings, demonstrating dwell‑time extensions and enhanced trajectory tracking performance.

PaperID: 10768, https://arxiv.org/pdf/2504.18316.pdf

Authors: Yacine Majdoub, Eya Ben Charrada, Haifa Touati

Title: Towards Adaptive Software Agents for Debugging

Abstract:
Using multiple agents was found to improve the debugging capabilities of Large Language Models. However, increasing the number of LLM‑agents has several drawbacks such as increasing the running costs and rising the risk for the agents to lose focus. In this work, we propose an adaptive agentic design, where the number of agents and their roles are determined dynamically based on the characteristics of the task to be achieved. In this design, the agents roles are not predefined, but are generated after analyzing the problem to be solved. Our initial evaluation shows that, with the adaptive design, the number of agents that are generated depends on the complexity of the buggy code. In fact, for simple code with mere syntax issues, the problem was usually fixed using one agent only. However, for more complex problems, we noticed the creation of a higher number of agents. Regarding the effectiveness of the fix, we noticed an average improvement of 11% compared to the one‑shot prompting. Given these promising results, we outline future research directions to improve our design for adaptive software agents that can autonomously plan and conduct their software goals.

PaperID: 10769, https://arxiv.org/pdf/2504.18260.pdf

Authors: Guanqun Bi, Zhuang Chen, Zhoufu Liu, Hongkai Wang, Xiyao Xiao, Yuqiang Xie, Wen Zhang, Yongkang Huang, Yuxuan Chen, Libiao Peng, Yi Feng, Minlie Huang

Title: MAGI: Multi-Agent Guided Interview for Psychiatric Assessment

Abstract:
Automating structured clinical interviews could revolutionize mental healthcare accessibility, yet existing large language models (LLMs) approaches fail to align with psychiatric diagnostic protocols. We present MAGI, the first framework that transforms the gold‑standard Mini International Neuropsychiatric Interview (MINI) into automatic computational workflows through coordinated multi‑agent collaboration. MAGI dynamically navigates clinical logic via four specialized agents: 1) an interview tree guided navigation agent adhering to the MINI's branching structure, 2) an adaptive question agent blending diagnostic probing, explaining, and empathy, 3) a judgment agent validating whether the response from participants meet the node, and 4) a diagnosis Agent generating Psychometric Chain‑of‑ Thought (PsyCoT) traces that explicitly map symptoms to clinical criteria. Experimental results on 1,002 real‑world participants covering depression, generalized anxiety, social anxiety and suicide shows that MAGI advances LLM‑ assisted mental health assessment by combining clinical rigor, conversational adaptability, and explainable reasoning.

PaperID: 10770, https://arxiv.org/pdf/2504.18243.pdf

Authors: Rong Cheng, Jinyi Liu, Yan Zheng, Fei Ni, Jiazhen Du, Hangyu Mao, Fuzheng Zhang, Bo Wang, Jianye Hao

Title: DualRAG: A Dual-Process Approach to Integrate Reasoning and Retrieval for Multi-Hop Question Answering

Abstract:
Multi‑Hop Question Answering (MHQA) tasks permeate real‑world applications, posing challenges in orchestrating multi‑step reasoning across diverse knowledge domains. While existing approaches have been improved with iterative retrieval, they still struggle to identify and organize dynamic knowledge. To address this, we propose DualRAG, a synergistic dual‑process framework that seamlessly integrates reasoning and retrieval. DualRAG operates through two tightly coupled processes: Reasoning‑augmented Querying (RaQ) and progressive Knowledge Aggregation (pKA). They work in concert: as RaQ navigates the reasoning path and generates targeted queries, pKA ensures that newly acquired knowledge is systematically integrated to support coherent reasoning. This creates a virtuous cycle of knowledge enrichment and reasoning refinement. Through targeted fine‑tuning, DualRAG preserves its sophisticated reasoning and retrieval capabilities even in smaller‑scale models, demonstrating its versatility and core advantages across different scales. Extensive experiments demonstrate that this dual‑process approach substantially improves answer accuracy and coherence, approaching, and in some cases surpassing, the performance achieved with oracle knowledge access. These results establish DualRAG as a robust and efficient solution for complex multi‑hop reasoning tasks.

PaperID: 10771, https://arxiv.org/pdf/2504.18083.pdf

Authors: Yuqiao Yang, Yongzhao Zhang, Wenhao Liu, Jun Li, Pengtao Shi, DingYu Zhong, Jie Yang, Ting Chen, Sheng Cao, Yuntao Ren, Yongyue Wu, Xiaosong Zhang

Title: Automating Function-Level TARA for Automotive Full-Lifecycle Security

Abstract:
As modern vehicles evolve into intelligent and connected systems, their growing complexity introduces significant cybersecurity risks. Threat Analysis and Risk Assessment (TARA) has therefore become essential for managing these risks under mandatory regulations. However, existing TARA automation methods rely on static threat libraries, limiting their utility in the detailed, function‑level analyses demanded by industry. This paper introduces DefenseWeaver, the first system that automates function‑level TARA using component‑specific details and large language models (LLMs). DefenseWeaver dynamically generates attack trees and risk evaluations from system configurations described in an extended OpenXSAM++ format, then employs a multi‑agent framework to coordinate specialized LLM roles for more robust analysis. To further adapt to evolving threats and diverse standards, DefenseWeaver incorporates Low‑Rank Adaptation (LoRA) fine‑tuning and Retrieval‑Augmented Generation (RAG) with expert‑curated TARA reports. We validated DefenseWeaver through deployment in four automotive security projects, where it identified 11 critical attack paths, verified through penetration testing, and subsequently reported and remediated by the relevant automakers and suppliers. Additionally, DefenseWeaver demonstrated cross‑domain adaptability, successfully applying to unmanned aerial vehicles (UAVs) and marine navigation systems. In comparison to human experts, DefenseWeaver outperformed manual attack tree generation across six assessment scenarios. Integrated into commercial cybersecurity platforms such as UAES and Xiaomi, DefenseWeaver has generated over 8,200 attack trees. These results highlight its ability to significantly reduce processing time, and its scalability and transformative impact on cybersecurity across industries.

PaperID: 10772, https://arxiv.org/pdf/2504.18047.pdf

Authors: Yasser Nabil, Mahmoud Abdelhadi, Sameh Sorour, Hesham ElSawy, Sara A. Elsayed, Hossam S. Hassanein

Title: Spatiotemporal Analysis of Parallelized Computing at the Extreme Edge

Abstract:
Extreme Edge Computing (EEC) pushes computing even closer to end users than traditional Multi‑access Edge Computing (MEC), harnessing the idle resources of Extreme Edge Devices (EEDs) to enable low‑latency, distributed processing. However, EEC faces key challenges, including spatial randomness in device distribution, limited EED computational power necessitating parallel task execution, vulnerability to failure, and temporal randomness due to variability in wireless communication and execution times. These challenges highlight the need for a rigorous analytical framework to evaluate EEC performance. We present the first spatiotemporal mathematical model for EEC over large‑scale millimeter‑wave networks. Utilizing stochastic geometry and an Absorbing Continuous‑Time Markov Chain (ACTMC), the framework captures the complex interaction between communication and computation performance, including their temporal overlap during parallel execution. We evaluate two key metrics: average task response delay and task completion probability. Together, they provide a holistic view of latency and reliability. The analysis considers fundamental offloading strategies, including randomized and location‑aware schemes, while accounting for EED failures. Results show that there exists an optimal task segmentation that minimizes delay. Under limited EED availability, we investigate a bias‑based EEC and MEC collaboration that offloads excess demand to MEC resources, effectively reducing congestion and improving system responsiveness.

PaperID: 10773, https://arxiv.org/pdf/2504.18039.pdf

Authors: Zheng Zhang, Nuoqian Xiao, Qi Chai, Deheng Ye, Hao Wang

Title: MultiMind: Enhancing Werewolf Agents with Multimodal Reasoning and Theory of Mind

Abstract:
Large Language Model (LLM) agents have demonstrated impressive capabilities in social deduction games (SDGs) like Werewolf, where strategic reasoning and social deception are essential. However, current approaches remain limited to textual information, ignoring crucial multimodal cues such as facial expressions and tone of voice that humans naturally use to communicate. Moreover, existing SDG agents primarily focus on inferring other players' identities without modeling how others perceive themselves or fellow players. To address these limitations, we use One Night Ultimate Werewolf (ONUW) as a testbed and present MultiMind, the first framework integrating multimodal information into SDG agents. MultiMind processes facial expressions and vocal tones alongside verbal content, while employing a Theory of Mind (ToM) model to represent each player's suspicion levels toward others. By combining this ToM model with Monte Carlo Tree Search (MCTS), our agent identifies communication strategies that minimize suspicion directed at itself. Through comprehensive evaluation in both agent‑versus‑agent simulations and studies with human players, we demonstrate MultiMind's superior performance in gameplay. Our work presents a significant advancement toward LLM agents capable of human‑like social reasoning across multimodal domains.

PaperID: 10774, https://arxiv.org/pdf/2504.18031.pdf

Authors: Kai Xiong, Chenxin Yang, Yujie Qin, Wanzhi Ma, Chau Yuen

Title: Joint Resource Estimation and Trajectory Optimization for eVTOL-involved CR network: A Monte Carlo Tree Search-based Approach

Abstract:
Electric Vertical Take‑Off and Landing (eVTOL) aircraft, pivotal to Advanced Air Mobility (AAM), are emerging as a transformative transportation paradigm with the potential to redefine urban and regional mobility. While these systems offer unprecedented efficiency in transporting people and goods, they rely heavily on computation capability, safety‑critical operations such as real‑time navigation, environmental sensing, and trajectory tracking‑‑necessitating robust offboard computational support. A widely adopted solution involves offloading these tasks to terrestrial base stations (BSs) along the flight path. However, air‑to‑ground connectivity is often constrained by spectrum conflicts with terrestrial users, which poses a significant challenge to maintaining reliable task execution. Cognitive radio (CR) techniques offer promising capabilities for dynamic spectrum access, making them a natural fit for addressing this issue. Existing studies often overlook the time‑varying nature of BS resources, such as spectrum availability and CPU cycles, which leads to inaccurate trajectory planning, suboptimal offloading success rates, excessive energy consumption, and operational delays. To address these challenges, we propose a trajectory optimization framework for eVTOL swarms that maximizes task offloading success probability while minimizing both energy consumption and resource competition (e.g., spectrum and CPU cycles) with primary terrestrial users. The proposed algorithm integrates a Multi‑Armed Bandit (MAB) model to dynamically estimate BS resource availability and a Monte Carlo Tree Search (MCTS) algorithm to determine optimal offloading decisions, selecting both the BSs and access time windows that align with energy and temporal constraints.

PaperID: 10775, https://arxiv.org/pdf/2504.17967.pdf

Authors: Kevin Song, Andrew Trotter, Jake Y. Chen

Title: LLM Agent Swarm for Hypothesis-Driven Drug Discovery

Abstract:
Drug discovery remains a formidable challenge: more than 90 percent of candidate molecules fail in clinical evaluation, and development costs often exceed one billion dollars per approved therapy. Disparate data streams, from genomics and transcriptomics to chemical libraries and clinical records, hinder coherent mechanistic insight and slow progress. Meanwhile, large language models excel at reasoning and tool integration but lack the modular specialization and iterative memory required for regulated, hypothesis‑driven workflows. We introduce PharmaSwarm, a unified multi‑agent framework that orchestrates specialized LLM "agents" to propose, validate, and refine hypotheses for novel drug targets and lead compounds. Each agent accesses dedicated functionality‑‑automated genomic and expression analysis; a curated biomedical knowledge graph; pathway enrichment and network simulation; interpretable binding affinity prediction‑‑while a central Evaluator LLM continuously ranks proposals by biological plausibility, novelty, in silico efficacy, and safety. A shared memory layer captures validated insights and fine‑tunes underlying submodels over time, yielding a self‑improving system. Deployable on low‑code platforms or Kubernetes‑based microservices, PharmaSwarm supports literature‑driven discovery, omics‑guided target identification, and market‑informed repurposing. We also describe a rigorous four‑tier validation pipeline spanning retrospective benchmarking, independent computational assays, experimental testing, and expert user studies to ensure transparency, reproducibility, and real‑world impact. By acting as an AI copilot, PharmaSwarm can accelerate translational research and deliver high‑confidence hypotheses more efficiently than traditional pipelines.

PaperID: 10776, https://arxiv.org/pdf/2504.17950.pdf

Authors: Isadora White, Kolby Nottingham, Ayush Maniar, Max Robinson, Hansen Lillemark, Mehul Maheshwari, Lianhui Qin, Prithviraj Ammanabrolu

Title: Collaborating Action by Action: A Multi-agent LLM Framework for Embodied Reasoning

Abstract:
Collaboration is ubiquitous and essential in day‑to‑day life ‑‑ from exchanging ideas, to delegating tasks, to generating plans together. This work studies how LLMs can adaptively collaborate to perform complex embodied reasoning tasks. To this end we introduce MINDcraft, an easily extensible platform built to enable LLM agents to control characters in the open‑world game of Minecraft; and MineCollab, a benchmark to test the different dimensions of embodied and collaborative reasoning. An experimental study finds that the primary bottleneck in collaborating effectively for current state‑of‑the‑art agents is efficient natural language communication, with agent performance dropping as much as 15% when they are required to communicate detailed task completion plans. We conclude that existing LLM agents are ill‑optimized for multi‑agent collaboration, especially in embodied scenarios, and highlight the need to employ methods beyond in‑context and imitation learning. Our website can be found here: https://mindcraft‑minecollab.github.io/

PaperID: 10777, https://arxiv.org/pdf/2504.17934.pdf

Authors: Chaoran Chen, Zhiping Zhang, Ibrahim Khalilov, Bingcan Guo, Simret A Gebreegziabher, Yanfang Ye, Ziang Xiao, Yaxing Yao, Tianshi Li, Toby Jia-Jun Li

Title: Toward a Human-Centered Evaluation Framework for Trustworthy LLM-Powered GUI Agents

Abstract:
The rise of Large Language Models (LLMs) has revolutionized Graphical User Interface (GUI) automation through LLM‑powered GUI agents, yet their ability to process sensitive data with limited human oversight raises significant privacy and security risks. This position paper identifies three key risks of GUI agents and examines how they differ from traditional GUI automation and general autonomous agents. Despite these risks, existing evaluations focus primarily on performance, leaving privacy and security assessments largely unexplored. We review current evaluation metrics for both GUI and general LLM agents and outline five key challenges in integrating human evaluators for GUI agent assessments. To address these gaps, we advocate for a human‑centered evaluation framework that incorporates risk assessments, enhances user awareness through in‑context consent, and embeds privacy and security considerations into GUI agent design and evaluation.

PaperID: 10778, https://arxiv.org/pdf/2504.17824.pdf

Authors: Yibin Wang, Jiaxi Xie, Lakshminarayanan Subramanian

Title: EduBot -- Can LLMs Solve Personalized Learning and Programming Assignments?

Abstract:
The prevalence of Large Language Models (LLMs) is revolutionizing the process of writing code. General and code LLMs have shown impressive performance in generating standalone functions and code‑completion tasks with one‑shot queries. However, the ability to solve comprehensive programming tasks with recursive requests and bug fixes remains questionable. In this paper, we propose EduBot, an intelligent automated assistant system that combines conceptual knowledge teaching, end‑to‑end code development, personalized programming through recursive prompt‑driven methods, and debugging with limited human interventions powered by LLMs. We show that EduBot can solve complicated programming tasks consisting of sub‑tasks with increasing difficulties ranging from conceptual to coding questions by recursive automatic prompt‑driven systems without finetuning on LLMs themselves. To further evaluate EduBot's performance, we design and conduct a benchmark suite consisting of 20 scenarios in algorithms, machine learning, and real‑world problems. The result shows that EduBot can complete most scenarios in less than 20 minutes. Based on the benchmark suites, we perform a comparative study to take different LLMs as the backbone and to verify EduBot's compatibility and robustness across LLMs with varying capabilities. We believe that EduBot is an exploratory approach to explore the potential of pre‑trained LLMs in multi‑step reasoning and code generation for solving personalized assignments with knowledge learning and code generation.

PaperID: 10779, https://arxiv.org/pdf/2504.17590.pdf

Authors: Sihem Bakri, Indrakshi Dey, Harun Siljak, Marco Ruffini, Nicola Marchetti

Title: Mitigating xApp conflicts for efficient network slicing in 6G O-RAN: a graph convolutional-based attention network approach

Abstract:
O‑RAN (Open‑Radio Access Network) offers a flexible, open architecture for next‑generation wireless networks. Network slicing within O‑RAN allows network operators to create customized virtual networks, each tailored to meet the specific needs of a particular application or service. Efficiently managing these slices is crucial for future 6G networks. O‑RAN introduces specialized software applications called xApps that manage different network functions. In network slicing, an xApp can be responsible for managing a separate network slice. To optimize resource allocation across numerous network slices, these xApps must coordinate. Traditional methods where all xApps communicate freely can lead to excessive overhead, hindering network performance. In this paper, we address the issue of xApp conflict mitigation by proposing an innovative Zero‑Touch Management (ZTM) solution for radio resource management in O‑RAN. Our approach leverages Multi‑Agent Reinforcement Learning (MARL) to enable xApps to learn and optimize resource allocation without the need for constant manual intervention. We introduce a Graph Convolutional Network (GCN)‑based attention mechanism to streamline communication among xApps, reducing overhead and improving overall system efficiency. Our results compare traditional MARL, where all xApps communicate, against our MARL GCN‑based attention method. The findings demonstrate the superiority of our approach, especially as the number of xApps increases, ultimately providing a scalable and efficient solution for optimal network slicing management in O‑RAN.

PaperID: 10780, https://arxiv.org/pdf/2504.17575.pdf

Authors: Kristoffer Christensen, Bo Nørregaard Jørgensen, Zheng Grace Ma

Title: A Multi-Agent, Laxity-Based Aggregation Strategy for Cost-Effective Electric Vehicle Charging and Local Transformer Overload Prevention

Abstract:
The rapid electrification of transportation, driven by stringent decarbonization targets and supportive policies, poses significant challenges for distribution system operators (DSOs). When numerous electric vehicles (EVs) charge concurrently, local transformers risk overloading ‑ a problem that current tariff‑based strategies do not adequately address. This paper introduces an aggregator‑based coordination mechanism that shifts EV charging from congested to underutilized periods using a rule‑based scheduling algorithm. Unlike conventional methods that depend on complex real‑time pricing signals or optimization‑heavy solutions, the aggregator approach uses a simple yet effective "laxity" measure to prioritize charging flexibility. To assess technical and economic viability, a multi‑agent simulation was developed to replicate residential user behavior and DSO constraints under the use of a 400 kVA low‑voltage transformer. The results indicate that overloads are completely eliminated with minimal inconvenience to users, whose increased charging costs are offset by the aggregator at an annual total of under DKK 6000 ‑ significantly lower than the cost of infrastructure reinforcement. This study contributes by (i) quantifying the compensation needed to prevent large‑scale overloads, (ii) presenting a replicable, computationally feasible, rule‑based aggregator model for DSOs, and (iii) comparing aggregator solutions to costly transformer upgrades, underscoring the aggregator's role as a viable tool for future distribution systems.

PaperID: 10781, https://arxiv.org/pdf/2504.17526.pdf

Authors: Yuelin Liu, Haiyuan Li, Xenofon Vasilakos, Rasheed Hussain, Dimitra Simeonidou

Title: Cooperative Task Offloading through Asynchronous Deep Reinforcement Learning in Mobile Edge Computing for Future Networks

Abstract:
Future networks (including 6G) are poised to accelerate the realisation of Internet of Everything. However, it will result in a high demand for computing resources to support new services. Mobile Edge Computing (MEC) is a promising solution, enabling to offload computation‑intensive tasks to nearby edge servers from the end‑user devices, thereby reducing latency and energy consumption. However, relying solely on a single MEC server for task offloading can lead to uneven resource utilisation and suboptimal performance in complex scenarios. Additionally, traditional task offloading strategies specialise in centralised policy decisions, which unavoidably entail extreme transmission latency and reach computational bottleneck. To fill the gaps, we propose a latency and energy efficient Cooperative Task Offloading framework with Transformer‑driven Prediction (CTO‑TP), leveraging asynchronous multi‑agent deep reinforcement learning to address these challenges. This approach fosters edge‑edge cooperation and decreases the synchronous waiting time by performing asynchronous training, optimising task offloading, and resource allocation across distributed networks. The performance evaluation demonstrates that the proposed CTO‑TP algorithm reduces up to 80% overall system latency and 87% energy consumption compared to the baseline schemes.

PaperID: 10782, https://arxiv.org/pdf/2504.17409.pdf

Authors: Tianhao Shao, Bohan Feng, Yingying Zhou, Bin Guo, Kaixing Zhao

Title: AGCo-MATA: Air-Ground Collaborative Multi-Agent Task Allocation in Mobile Crowdsensing

Abstract:
Rapid progress in intelligent unmanned systems has presented new opportunities for mobile crowd sensing (MCS). Today, heterogeneous air‑ground collaborative multi‑agent framework, which comprise unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs), have presented superior flexibility and efficiency compared to traditional homogeneous frameworks in complex sensing tasks. Within this context, task allocation among different agents always play an important role in improving overall MCS quality. In order to better allocate tasks among heterogeneous collaborative agents, in this paper, we investigated two representative complex multi‑agent task allocation scenarios with dual optimization objectives: (1) For AG‑FAMT (Air‑Ground Few Agents More Tasks) scenario, the objectives are to maximize the task completion while minimizing the total travel distance; (2) For AG‑MAFT (Air‑Ground More Agents Few Tasks) scenario, where the agents are allocated based on their locations, has the optimization objectives of minimizing the total travel distance while reducing travel time cost. To achieve this, we proposed a Multi‑Task Minimum Cost Maximum Flow (MT‑MCMF) optimization algorithm tailored for AG‑FAMT, along with a multi‑objective optimization algorithm called W‑ILP designed for AG‑MAFT, with a particular focus on optimizing the charging path planning of UAVs. Our experiments based on a large‑scale real‑world dataset demonstrated that the proposed two algorithms both outperform baseline approaches under varying experimental settings, including task quantity, task difficulty, and task distribution, providing a novel way to improve the overall quality of mobile crowdsensing tasks.

PaperID: 10783, https://arxiv.org/pdf/2504.17356.pdf

Authors: Weiliang Zhang, Xiaohan Huang, Yi Du, Ziyue Qiao, Qingqing Long, Zhen Meng, Yuanchun Zhou, Meng Xiao

Title: Comprehend, Divide, and Conquer: Feature Subspace Exploration via Multi-Agent Hierarchical Reinforcement Learning

Abstract:
Feature selection aims to preprocess the target dataset, find an optimal and most streamlined feature subset, and enhance the downstream machine learning task. Among filter, wrapper, and embedded‑based approaches, the reinforcement learning (RL)‑based subspace exploration strategy provides a novel objective optimization‑directed perspective and promising performance. Nevertheless, even with improved performance, current reinforcement learning approaches face challenges similar to conventional methods when dealing with complex datasets. These challenges stem from the inefficient paradigm of using one agent per feature and the inherent complexities present in the datasets. This observation motivates us to investigate and address the above issue and propose a novel approach, namely HRLFS. Our methodology initially employs a Large Language Model (LLM)‑based hybrid state extractor to capture each feature's mathematical and semantic characteristics. Based on this information, features are clustered, facilitating the construction of hierarchical agents for each cluster and sub‑cluster. Extensive experiments demonstrate the efficiency, scalability, and robustness of our approach. Compared to contemporary or the one‑feature‑one‑agent RL‑based approaches, HRLFS improves the downstream ML performance with iterative feature subspace exploration while accelerating total run time by reducing the number of agents involved.

PaperID: 10784, https://arxiv.org/pdf/2504.17355.pdf

Authors: Xiaohan Huang, Dongjie Wang, Zhiyuan Ning, Ziyue Qiao, Qingqing Long, Haowei Zhu, Yi Du, Min Wu, Yuanchun Zhou, Meng Xiao

Title: Collaborative Multi-Agent Reinforcement Learning for Automated Feature Transformation with Graph-Driven Path Optimization

Abstract:
Feature transformation methods aim to find an optimal mathematical feature‑feature crossing process that generates high‑value features and improves the performance of downstream machine learning tasks. Existing frameworks, though designed to mitigate manual costs, often treat feature transformations as isolated operations, ignoring dynamic dependencies between transformation steps. To address the limitations, we propose TCTO, a collaborative multi‑agent reinforcement learning framework that automates feature engineering through graph‑driven path optimization. The framework's core innovation lies in an evolving interaction graph that models features as nodes and transformations as edges. Through graph pruning and backtracking, it dynamically eliminates low‑impact edges, reduces redundant operations, and enhances exploration stability. This graph also provides full traceability to empower TCTO to reuse high‑utility subgraphs from historical transformations. To demonstrate the efficacy and adaptability of our approach, we conduct comprehensive experiments and case studies, which show superior performance across a range of datasets.

PaperID: 10785, https://arxiv.org/pdf/2504.17200.pdf

Authors: Yangxinyu Xie, Bowen Jiang, Tanwi Mallick, Joshua David Bergerson, John K. Hutchison, Duane R. Verner, Jordan Branham, M. Ross Alexander, Robert B. Ross, Yan Feng, Leslie-Anne Levy, Weijie Su, Camillo J. Taylor

Title: A RAG-Based Multi-Agent LLM System for Natural Hazard Resilience and Adaptation

Abstract:
Large language models (LLMs) are a transformational capability at the frontier of artificial intelligence and machine learning that can support decision‑makers in addressing pressing societal challenges such as extreme natural hazard events. As generalized models, LLMs often struggle to provide context‑specific information, particularly in areas requiring specialized knowledge. In this work we propose a retrieval‑augmented generation (RAG)‑based multi‑agent LLM system to support analysis and decision‑making in the context of natural hazards and extreme weather events. As a proof of concept, we present WildfireGPT, a specialized system focused on wildfire hazards. The architecture employs a user‑centered, multi‑agent design to deliver tailored risk insights across diverse stakeholder groups. By integrating natural hazard and extreme weather projection data, observational datasets, and scientific literature through an RAG framework, the system ensures both the accuracy and contextual relevance of the information it provides. Evaluation across ten expert‑led case studies demonstrates that WildfireGPT significantly outperforms existing LLM‑based solutions for decision support.

PaperID: 10786, https://arxiv.org/pdf/2504.17129.pdf

Authors: Seyed Yousef Soltanian, Wenlong Zhang

Title: Peer-Aware Cost Estimation in Nonlinear General-Sum Dynamic Games for Mutual Learning and Intent Inference

Abstract:
Dynamic game theory is a powerful tool in modeling multi‑agent interactions and human‑robot systems. In practice, since the objective functions of both agents may not be explicitly known to each other, these interactions can be modeled as incomplete‑information general‑sum dynamic games. Solving for equilibrium policies for such games presents a major challenge, especially if the games involve nonlinear underlying dynamics. To simplify the problem, existing work often assumes that one agent is an expert with complete information about its peer, which can lead to biased estimates and failures in coordination. To address this challenge, we propose a nonlinear peer‑aware cost estimation (N‑PACE) algorithm for general‑sum dynamic games. In N‑PACE, using iterative linear quadratic (ILQ) approximation of dynamic games, each agent explicitly models the learning dynamics of its peer agent while inferring their objective functions and updating its own control policy accordingly in real time, which leads to unbiased and fast learning of the unknown objective function of the peer agent. Additionally, we demonstrate how N‑PACE enables intent communication by explicitly modeling the peer's learning dynamics. Finally, we show how N‑PACE outperforms baseline methods that disregard the learning behavior of the other agent, both analytically and using our case studies

PaperID: 10787, https://arxiv.org/pdf/2504.17128.pdf

Authors: Seyed Yousef Soltanian, Wenlong Zhang

Title: PACE: A Framework for Learning and Control in Linear Incomplete-Information Differential Games

Abstract:
In this paper, we address the problem of a two‑player linear quadratic differential game with incomplete information, a scenario commonly encountered in multi‑agent control, human‑robot interaction (HRI), and approximation methods for solving general‑sum differential games. While solutions to such linear differential games are typically obtained through coupled Riccati equations, the complexity increases when agents have incomplete information, particularly when neither is aware of the other's cost function. To tackle this challenge, we propose a model‑based Peer‑Aware Cost Estimation (PACE) framework for learning the cost parameters of the other agent. In PACE, each agent treats its peer as a learning agent rather than a stationary optimal agent, models their learning dynamics, and leverages this dynamic to infer the cost function parameters of the other agent. This approach enables agents to infer each other's objective function in real time based solely on their previous state observations and dynamically adapt their control policies. Furthermore, we provide a theoretical guarantee for the convergence of parameter estimation and the stability of system states in PACE. Additionally, in our numerical studies, we demonstrate how modeling the learning dynamics of the other agent benefits PACE, compared to approaches that approximate the other agent as having complete information, particularly in terms of stability and convergence speed.

PaperID: 10788, https://arxiv.org/pdf/2504.17087.pdf

Authors: Yuran Li, Jama Hussein Mohamud, Chongren Sun, Di Wu, Benoit Boulet

Title: Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments

Abstract:
Large language models (LLMs) are being widely applied across various fields, but as tasks become more complex, evaluating their responses is increasingly challenging. Compared to human evaluators, the use of LLMs to support performance evaluation offers a more efficient alternative. However, most studies focus mainly on aligning LLMs' judgments with human preferences, overlooking the existence of biases and mistakes in human judgment. Furthermore, how to select suitable LLM judgments given multiple potential LLM responses remains underexplored. To address these two aforementioned issues, we propose a three‑stage meta‑judge selection pipeline: 1) developing a comprehensive rubric with GPT‑4 and human experts, 2) using three advanced LLM agents to score judgments, and 3) applying a threshold to filter out low‑scoring judgments. Compared to methods using a single LLM as both judge and meta‑judge, our pipeline introduces multi‑agent collaboration and a more comprehensive rubric. Experimental results on the JudgeBench dataset show about 15.55% improvement compared to raw judgments and about 8.37% improvement over the single‑agent baseline. Our work demonstrates the potential of LLMs as meta‑judges and lays the foundation for future research on constructing preference datasets for LLM‑as‑a‑judge reinforcement learning.

PaperID: 10789, https://arxiv.org/pdf/2504.17018.pdf

Authors: Prashant Chandrasekar, Mariel Couvillion, Ayshwarya Saktheeswaran, Jessica Zeitz

Title: LLM impact on BLV programming

Abstract:
Large Language Models (LLMs) are rapidly becoming integral to a wide range of tools, tasks, and problem‑solving processes, especially in software development. Originally designed for natural language processing tasks such as text generation, LLMs are increasingly being used to assist both professionals and students in writing code. This growing reliance on LLM‑based tools is reshaping programming workflows and task execution. In this study, we explore the impact of these technologies on blind and low‑vision (BLV) developers. Our review of existing literature indicates that while LLMs help mitigate some of the challenges faced by BLV programmers, they also introduce new forms of inaccessibility. We conducted an evaluation of five popular LLM‑powered integrated development environments (IDEs), assessing their performance across a comprehensive set of programming tasks. Our findings highlight several unsupported scenarios, instances of incorrect model output, and notable limitations in interaction support for specific tasks. Through observing BLV developers as they engaged in coding activities, we uncovered key interaction barriers that go beyond model accuracy or code generation quality. This paper outlines the challenges and corresponding opportunities for improving accessibility in the context of generative AI‑assisted programming. Addressing these issues can meaningfully enhance the programming experience for BLV developers. As the generative AI revolution continues to unfold, it must also address the unique burdens faced by this community.

PaperID: 10790, https://arxiv.org/pdf/2504.16918.pdf

Authors: Raghav Thind, Youran Sun, Ling Liang, Haizhao Yang

Title: OptimAI: Optimization from Natural Language Using LLM-Powered AI Agents

Abstract:
Optimization plays a vital role in scientific research and practical applications. However, formulating a concrete optimization problem described in natural language into a mathematical form and selecting a suitable solver to solve the problem requires substantial domain expertise. We introduce OptimAI, a framework for solving Optimization problems described in natural language by leveraging LLM‑powered AI agents, and achieve superior performance over current state‑of‑the‑art methods. Our framework is built upon the following key roles: (1) a formulator that translates natural language problem descriptions into precise mathematical formulations; (2) a planner that constructs a high‑level solution strategy prior to execution; and (3) a coder and a code critic capable of interacting with the environment and reflecting on outcomes to refine future actions. Ablation studies confirm that all roles are essential; removing the planner or code critic results in 5.8× and 3.1× drops in productivity, respectively. Furthermore, we introduce UCB‑based debug scheduling to dynamically switch between alternative plans, yielding an additional 3.3× productivity gain. Our design emphasizes multi‑agent collaboration, and our experiments confirm that combining diverse models leads to performance gains. Our approach attains 88.1% accuracy on the NLP4LP dataset and 82.3% on the Optibench dataset, reducing error rates by 58% and 52%, respectively, over prior best results.

PaperID: 10791, https://arxiv.org/pdf/2504.16902.pdf

Authors: Idan Habler, Ken Huang, Vineeth Sai Narajala, Prashant Kulkarni

Title: Building A Secure Agentic AI Application Leveraging A2A Protocol

Abstract:
As Agentic AI systems evolve from basic workflows to complex multi agent collaboration, robust protocols such as Google's Agent2Agent (A2A) become essential enablers. To foster secure adoption and ensure the reliability of these complex interactions, understanding the secure implementation of A2A is essential. This paper addresses this goal by providing a comprehensive security analysis centered on the A2A protocol. We examine its fundamental elements and operational dynamics, situating it within the framework of agent communication development. Utilizing the MAESTRO framework, specifically designed for AI risks, we apply proactive threat modeling to assess potential security issues in A2A deployments, focusing on aspects such as Agent Card management, task execution integrity, and authentication methodologies. Based on these insights, we recommend practical secure development methodologies and architectural best practices designed to build resilient and effective A2A systems. Our analysis also explores how the synergy between A2A and the Model Context Protocol (MCP) can further enhance secure interoperability. This paper equips developers and architects with the knowledge and practical guidance needed to confidently leverage the A2A protocol for building robust and secure next generation agentic applications.

PaperID: 10792, https://arxiv.org/pdf/2504.16736.pdf

Authors: Yingxuan Yang, Huacan Chai, Yuanyi Song, Siyuan Qi, Muning Wen, Ning Li, Junwei Liao, Haoyi Hu, Jianghao Lin, Gaowei Chang, Weiwen Liu, Ying Wen, Yong Yu, Weinan Zhang

Title: A Survey of AI Agent Protocols

Abstract:
The rapid development of large language models (LLMs) has led to the widespread deployment of LLM agents across diverse industries, including customer service, content generation, data analysis, and even healthcare. However, as more LLM agents are deployed, a major issue has emerged: there is no standard way for these agents to communicate with external tools or data sources. This lack of standardized protocols makes it difficult for agents to work together or scale effectively, and it limits their ability to tackle complex, real‑world tasks. A unified communication protocol for LLM agents could change this. It would allow agents and tools to interact more smoothly, encourage collaboration, and triggering the formation of collective intelligence. In this paper, we provide the first comprehensive analysis of existing agent protocols, proposing a systematic two‑dimensional classification that differentiates context‑oriented versus inter‑agent protocols and general‑purpose versus domain‑specific protocols. Additionally, we conduct a comparative performance analysis of these protocols across key dimensions such as security, scalability, and latency. Finally, we explore the future landscape of agent protocols by identifying critical research directions and characteristics necessary for next‑generation protocols. These characteristics include adaptability, privacy preservation, and group‑based interaction, as well as trends toward layered architectures and collective intelligence infrastructures. We expect this work to serve as a practical reference for both researchers and engineers seeking to design, evaluate, or integrate robust communication infrastructures for intelligent agents.

PaperID: 10793, https://arxiv.org/pdf/2504.16729.pdf

Authors: Weixi Li, Rongzuo Guo, Yuning Wang, Fangying Chen

Title: MEC Task Offloading in AIoT: A User-Centric DRL Model Splitting Inference Scheme

Abstract:
With the rapid development of the Artificial Intelligence of Things (AIoT), mobile edge computing (MEC) becomes an essential technology underpinning AIoT applications. However, multi‑angle resource constraints, multi‑user task competition, and the complexity of task offloading decisions in dynamic MEC environments present new technical challenges. Therefore, a user‑centric deep reinforcement learning (DRL) model splitting inference scheme is proposed to address the problem. This scheme combines model splitting inference technology and designs a UCMS_MADDPG‑based offloading algorithm to realize efficient model splitting inference responses in the dynamic MEC environment with multi‑angle resource constraints. Specifically, we formulate a joint optimization problem that integrates resource allocation, server selection, and task offloading, aiming to minimize the weighted sum of task execution delay and energy consumption. We also introduce a user‑server co‑selection algorithm to address the selection issue between users and servers. Furthermore, we design an algorithm centered on user pre‑decision to coordinate the outputs of continuous and discrete hybrid decisions, and introduce a priority sampling mechanism based on reward‑error trade‑off to optimize the experience replay mechanism of the network. Simulation results show that the proposed UCMS_MADDPG‑based offloading algorithm demonstrates superior overall performance compared with other benchmark algorithms in dynamic environments.

PaperID: 10794, https://arxiv.org/pdf/2504.16505.pdf

Authors: Meng Chu, Yukang Chen, Haokun Gui, Shaozuo Yu, Yi Wang, Jiaya Jia

Title: TraveLLaMA: A Multimodal Travel Assistant with Large-Scale Dataset and Structured Reasoning

Abstract:
Tourism and travel planning increasingly rely on digital assistance, yet existing multimodal AI systems often lack specialized knowledge and contextual understanding of urban environments. We present TraveLLaMA, a specialized multimodal language model designed for comprehensive travel assistance. Our work addresses the fundamental challenge of developing practical AI travel assistants through three key contributions: (1) TravelQA, a novel dataset of 265k question‑answer pairs combining 160k text QA from authentic travel sources, 100k vision‑language QA featuring maps and location imagery, and 5k expert‑annotated Chain‑of‑Thought reasoning examples; (2) Travel‑CoT, a structured reasoning framework that decomposes travel queries into spatial, temporal, and practical dimensions, improving answer accuracy by 10.8% while providing interpretable decision paths; and (3) an interactive agent system validated through extensive user studies. Through fine‑tuning experiments on state‑of‑the‑art vision‑language models (LLaVA, Qwen‑VL, Shikra), we achieve 6.2‑9.4% base improvements, further enhanced by Travel‑CoT reasoning. Our model demonstrates superior capabilities in contextual travel recommendations, map interpretation, and scene understanding while providing practical information such as operating hours and cultural insights. User studies with 500 participants show TraveLLaMA achieves a System Usability Scale score of 82.5, significantly outperforming general‑purpose models and establishing new standards for multimodal travel assistance systems.

PaperID: 10795, https://arxiv.org/pdf/2504.16489.pdf

Authors: Senmao Qi, Yifei Zou, Peng Li, Ziyi Lin, Xiuzhen Cheng, Dongxiao Yu

Title: Amplified Vulnerabilities: Structured Jailbreak Attacks on LLM-based Multi-Agent Debate

Abstract:
Multi‑Agent Debate (MAD), leveraging collaborative interactions among Large Language Models (LLMs), aim to enhance reasoning capabilities in complex tasks. However, the security implications of their iterative dialogues and role‑playing characteristics, particularly susceptibility to jailbreak attacks eliciting harmful content, remain critically underexplored. This paper systematically investigates the jailbreak vulnerabilities of four prominent MAD frameworks built upon leading commercial LLMs (GPT‑4o, GPT‑4, GPT‑3.5‑turbo, and DeepSeek) without compromising internal agents. We introduce a novel structured prompt‑rewriting framework specifically designed to exploit MAD dynamics via narrative encapsulation, role‑driven escalation, iterative refinement, and rhetorical obfuscation. Our extensive experiments demonstrate that MAD systems are inherently more vulnerable than single‑agent setups. Crucially, our proposed attack methodology significantly amplifies this fragility, increasing average harmfulness from 28.14% to 80.34% and achieving attack success rates as high as 80% in certain scenarios. These findings reveal intrinsic vulnerabilities in MAD architectures and underscore the urgent need for robust, specialized defenses prior to real‑world deployment.

PaperID: 10796, https://arxiv.org/pdf/2504.16129.pdf

Authors: Junwei Liao, Muning Wen, Jun Wang, Weinan Zhang

Title: MARFT: Multi-Agent Reinforcement Fine-Tuning

Abstract:
LLM‑based Multi‑Agent Systems have demonstrated remarkable capabilities in addressing complex, agentic tasks, from generating high‑quality presentation slides to even conducting sophisticated scientific research. Meanwhile, RL has been widely recognized for its effectiveness in enhancing agent intelligence, but limited research has investigated the fine‑tuning of LaMAS using foundational RL techniques. Moreover, the direct application of MARL methods to LaMAS introduces significant challenges, stemming from the unique characteristics and mechanisms inherent to LaMAS. To address these challenges, this article presents a comprehensive study of LLM‑based MARL and proposes a novel paradigm termed Multi‑Agent Reinforcement Fine‑Tuning (MARFT). We introduce a brand‑new POMDP called Flex‑POMDP, which aligns with the LaMAS optimization in real‑world applications and a universal algorithmic framework tailored specifically for LaMAS, outlining the conceptual foundations, key distinctions, and practical implementation strategies. We review the evolution from RL to RFT, setting the stage for a parallel analysis in the multi‑agent domain. In the context of LaMAS, we elucidate critical differences between MARL and MARFT. These differences motivate a transition toward a LaMAS‑oriented formulation of RFT. Central to this work is a robust and scalable MARFT framework. We detail the core algorithm and provide a complete, open‑source implementation to facilitate adoption and further research. The latter sections of the paper explore real‑world application perspectives and opening challenges in MARFT. By bridging theoretical underpinnings with practical methodologies, this work serves as a roadmap for researchers seeking to advance MARFT toward resilient and adaptive solutions in agentic systems. Our implementation of the proposed framework is publicly available at: https://github.com/jwliao‑ai/MARFT.

PaperID: 10797, https://arxiv.org/pdf/2504.16122.pdf

Authors: Xuhui Zhou, Zhe Su, Sophie Feng, Jiaxu Zhou, Jen-tse Huang, Hsien-Te Kao, Spencer Lynch, Svitlana Volkova, Tongshuang Sherry Wu, Anita Woolley, Hao Zhu, Maarten Sap

Title: SOTOPIA-S4: a user-friendly system for flexible, customizable, and large-scale social simulation

Abstract:
Social simulation through large language model (LLM) agents is a promising approach to explore and validate hypotheses related to social science questions and LLM agents behavior. We present SOTOPIA‑S4, a fast, flexible, and scalable social simulation system that addresses the technical barriers of current frameworks while enabling practitioners to generate multi‑turn and multi‑party LLM‑based interactions with customizable evaluation metrics for hypothesis testing. SOTOPIA‑S4 comes as a pip package that contains a simulation engine, an API server with flexible RESTful APIs for simulation management, and a web interface that enables both technical and non‑technical users to design, run, and analyze simulations without programming. We demonstrate the usefulness of SOTOPIA‑S4 with two use cases involving dyadic hiring negotiation and multi‑party planning scenarios.

PaperID: 10798, https://arxiv.org/pdf/2504.16078.pdf

Authors: Thomas Schmied, Jörg Bornschein, Jordi Grau-Moya, Markus Wulfmeier, Razvan Pascanu

Title: LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities

Abstract:
The success of Large Language Models (LLMs) has sparked interest in various agentic applications. A key hypothesis is that LLMs, leveraging common sense and Chain‑of‑Thought (CoT) reasoning, can effectively explore and efficiently solve complex domains. However, LLM agents have been found to suffer from sub‑optimal exploration and the knowing‑doing gap, the inability to effectively act on knowledge present in the model. In this work, we systematically study why LLMs perform sub‑optimally in decision‑making scenarios. In particular, we closely examine three prevalent failure modes: greediness, frequency bias, and the knowing‑doing gap. We propose mitigation of these shortcomings by fine‑tuning via Reinforcement Learning (RL) on self‑generated CoT rationales. Our experiments across multi‑armed bandits, contextual bandits, and Tic‑tac‑toe, demonstrate that RL fine‑tuning enhances the decision‑making abilities of LLMs by increasing exploration and narrowing the knowing‑doing gap. Finally, we study both classic exploration mechanisms, such as ε‑greedy, and LLM‑specific approaches, such as self‑correction and self‑consistency, to enable more effective fine‑tuning of LLMs for decision‑making.

PaperID: 10799, https://arxiv.org/pdf/2504.15989.pdf

Authors: Junwei Hu, Weicheng Zheng, Yihan Liu, Yan Liu

Title: Optimizing Token Consumption in LLMs: A Nano Surge Approach for Code Reasoning Efficiency

Abstract:
With the increasing adoption of large language models (LLMs) in software engineering, the Chain of Thought (CoT) reasoning paradigm has become an essential approach for automated code repair. However, the explicit multi‑step reasoning in CoT leads to substantial increases in token consumption, reducing inference efficiency and raising computational costs, especially for complex code repair tasks. Most prior research has focused on improving the correctness of code repair while largely overlooking the resource efficiency of the reasoning process itself. To address this challenge, this paper proposes three targeted optimization strategies: Context Awareness, Responsibility Tuning, and Cost Sensitive. Context Awareness guides the model to focus on key contextual information, Responsibility Tuning refines the structure of the reasoning process through clearer role and responsibility assignment, and Cost Sensitive incorporates resource‑awareness to suppress unnecessary token generation during inference. Experiments across diverse code repair scenarios demonstrate that these methods can significantly reduce token consumption in CoT‑based reasoning without compromising repair quality. This work provides novel insights and methodological guidance for enhancing the efficiency of LLM‑driven code repair tasks in software engineering.

PaperID: 10800, https://arxiv.org/pdf/2504.15917.pdf

Authors: Hieu Huynh, Hai Phung, Hao Pham, Tien N. Nguyen, Vu Nguyen

Title: Towards Test Generation from Task Description for Mobile Testing with Multi-modal Reasoning

Abstract:
In Android GUI testing, generating an action sequence for a task that can be replayed as a test script is common. Generating sequences of actions and respective test scripts from task goals described in natural language can eliminate the need for manually writing test scripts. However, existing approaches based on large language models (LLM) often struggle with identifying the final action, and either end prematurely or continue past the final screen. In this paper, we introduce VisiDroid, a multi‑modal, LLM‑based, multi‑agent framework that iteratively determines the next action and leverages visual images of screens to detect the task's completeness. The multi‑modal approach enhances our model in two significant ways. First, this approach enables it to avoid prematurely terminating a task when textual content alone provides misleading indications of task completion. Additionally, visual input helps the tool avoid errors when changes in the GUI do not directly affect functionality toward task completion, such as adjustments to font sizes or colors. Second, the multi‑modal approach also ensures the tool not progress beyond the final screen, which might lack explicit textual indicators of task completion but could display a visual element indicating task completion, which is common in GUI apps. Our evaluation shows that VisiDroid achieves an accuracy of 87.3%, outperforming the best baseline relatively by 23.5%. We also demonstrate that our multi‑modal framework with images and texts enables the LLM to better determine when a task is completed.

PaperID: 10801, https://arxiv.org/pdf/2504.15909.pdf

Authors: Yunfan Gao, Yun Xiong, Yijie Zhong, Yuxi Bi, Ming Xue, Haofen Wang

Title: Synergizing RAG and Reasoning: A Systematic Review

Abstract:
Recent breakthroughs in large language models (LLMs), particularly in reasoning capabilities, have propelled Retrieval‑Augmented Generation (RAG) to unprecedented levels. By synergizing retrieval mechanisms with advanced reasoning, LLMs can now tackle increasingly complex problems. This paper presents a systematic review of the collaborative interplay between RAG and reasoning, clearly defining "reasoning" within the RAG context. It construct a comprehensive taxonomy encompassing multi‑dimensional collaborative objectives, representative paradigms, and technical implementations, and analyze the bidirectional synergy methods. Additionally, we critically evaluate current limitations in RAG assessment, including the absence of intermediate supervision for multi‑step reasoning and practical challenges related to cost‑risk trade‑offs. To bridge theory and practice, we provide practical guidelines tailored to diverse real‑world applications. Finally, we identify promising research directions, such as graph‑based knowledge integration, hybrid model collaboration, and RL‑driven optimization. Overall, this work presents a theoretical framework and practical foundation to advance RAG systems in academia and industry, fostering the next generation of RAG solutions.

PaperID: 10802, https://arxiv.org/pdf/2504.15716.pdf

Authors: Jie Zhu, Qian Chen, Huaixia Dou, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang

Title: DianJin-R1: Evaluating and Enhancing Financial Reasoning in Large Language Models

Abstract:
Effective reasoning remains a core challenge for large language models (LLMs) in the financial domain, where tasks often require domain‑specific knowledge, precise numerical calculations, and strict adherence to compliance rules. We propose DianJin‑R1, a reasoning‑enhanced framework designed to address these challenges through reasoning‑augmented supervision and reinforcement learning. Central to our approach is DianJin‑R1‑Data, a high‑quality dataset constructed from CFLUE, FinQA, and a proprietary compliance corpus (Chinese Compliance Check, CCC), combining diverse financial reasoning scenarios with verified annotations. Our models, DianJin‑R1‑7B and DianJin‑R1‑32B, are fine‑tuned from Qwen2.5‑7B‑Instruct and Qwen2.5‑32B‑Instruct using a structured format that generates both reasoning steps and final answers. To further refine reasoning quality, we apply Group Relative Policy Optimization (GRPO), a reinforcement learning method that incorporates dual reward signals: one encouraging structured outputs and another rewarding answer correctness. We evaluate our models on five benchmarks: three financial datasets (CFLUE, FinQA, and CCC) and two general reasoning benchmarks (MATH‑500 and GPQA‑Diamond). Experimental results show that DianJin‑R1 models consistently outperform their non‑reasoning counterparts, especially on complex financial tasks. Moreover, on the real‑world CCC dataset, our single‑call reasoning models match or even surpass the performance of multi‑agent systems that require significantly more computational cost. These findings demonstrate the effectiveness of DianJin‑R1 in enhancing financial reasoning through structured supervision and reward‑aligned learning, offering a scalable and practical solution for real‑world applications.

PaperID: 10803, https://arxiv.org/pdf/2504.15699.pdf

Authors: Ning Wang, Zihan Yan, Weiyang Li, Chuan Ma, He Chen, Tao Xiang

Title: Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation

Abstract:
Embodied agents exhibit immense potential across a multitude of domains, making the assurance of their behavioral safety a fundamental prerequisite for their widespread deployment. However, existing research predominantly concentrates on the security of general large language models, lacking specialized methodologies for establishing safety benchmarks and input moderation tailored to embodied agents. To bridge this gap, this paper introduces a novel input moderation framework, meticulously designed to safeguard embodied agents. This framework encompasses the entire pipeline, including taxonomy definition, dataset curation, moderator architecture, model training, and rigorous evaluation. Notably, we introduce EAsafetyBench, a meticulously crafted safety benchmark engineered to facilitate both the training and stringent assessment of moderators specifically designed for embodied agents. Furthermore, we propose Pinpoint, an innovative prompt‑decoupled input moderation scheme that harnesses a masked attention mechanism to effectively isolate and mitigate the influence of functional prompts on moderation tasks. Extensive experiments conducted on diverse benchmark datasets and models validate the feasibility and efficacy of the proposed approach. The results demonstrate that our methodologies achieve an impressive average detection accuracy of 94.58%, surpassing the performance of existing state‑of‑the‑art techniques, alongside an exceptional moderation processing time of merely 0.002 seconds per instance.

PaperID: 10804, https://arxiv.org/pdf/2504.15585.pdf

Authors: Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Hanjun Luo, Liang Lin, Zhihao Xu, Haolang Lu, Xinye Cao, Xinyun Zhou, Weifei Jin, Fanci Meng, Shicheng Xu, Junyuan Mao, Yu Wang, Hao Wu, Minghe Wang, Fan Zhang, Junfeng Fang, Wenjie Qu, Yue Liu, Chengwei Liu, Yifan Zhang, Qiankun Li, Chongye Guo, Yalan Qin, Zhaoxin Fan, Kai Wang, Yi Ding, Donghai Hong, Jiaming Ji, Yingxin Lai, Zitong Yu, Xinfeng Li, Yifan Jiang, Yanhui Li, Xinyu Deng, Junlin Wu, Dongxia Wang, Yihao Huang, Yufei Guo, Jen-tse Huang, Qiufeng Wang, Xiaolong Jin, Wenxuan Wang, Dongrui Liu, Yanwei Yue, Wenke Huang, Guancheng Wan, Heng Chang, Tianlin Li, Yi Yu, Chenghao Li, Jiawei Li, Lei Bai, Jie Zhang, Qing Guo, Jingyi Wang, Tianlong Chen, Joey Tianyi Zhou, Xiaojun Jia, Weisong Sun, Cong Wu, Jing Chen, Xuming Hu, Yiming Li, Xiao Wang, Ningyu Zhang, Luu Anh Tuan, Guowen Xu, Jiaheng Zhang, Tianwei Zhang, Xingjun Ma, Jindong Gu, Liang Pang, Xiang Wang, Bo An, Jun Sun, Mohit Bansal, Shirui Pan, Lingjuan Lyu, Yuval Elovici, Bhavya Kailkhura, Yaodong Yang, Hongwei Li, Wenyuan Xu, Yizhou Sun, Wei Wang, Qing Li, Ke Tang, Yu-Gang Jiang, Felix Juefei-Xu, Hui Xiong, Xiaofeng Wang, Dacheng Tao, Philip S. Yu, Qingsong Wen, Yang Liu

Title: A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment

Abstract:
The remarkable success of Large Language Models (LLMs) has illuminated a promising pathway toward achieving Artificial General Intelligence for both academic and industrial communities, owing to their unprecedented performance across various applications. As LLMs continue to gain prominence in both research and commercial domains, their security and safety implications have become a growing concern, not only for researchers and corporations but also for every nation. Currently, existing surveys on LLM safety primarily focus on specific stages of the LLM lifecycle, e.g., deployment phase or fine‑tuning phase, lacking a comprehensive understanding of the entire "lifechain" of LLMs. To address this gap, this paper introduces, for the first time, the concept of "full‑stack" safety to systematically consider safety issues throughout the entire process of LLM training, deployment, and eventual commercialization. Compared to the off‑the‑shelf LLM safety surveys, our work demonstrates several distinctive advantages: (I) Comprehensive Perspective. We define the complete LLM lifecycle as encompassing data preparation, pre‑training, post‑training, deployment and final commercialization. To our knowledge, this represents the first safety survey to encompass the entire lifecycle of LLMs. (II) Extensive Literature Support. Our research is grounded in an exhaustive review of over 800+ papers, ensuring comprehensive coverage and systematic organization of security issues within a more holistic understanding. (III) Unique Insights. Through systematic literature analysis, we have developed reliable roadmaps and perspectives for each chapter. Our work identifies promising research directions, including safety in data generation, alignment techniques, model editing, and LLM‑based agent systems. These insights provide valuable guidance for researchers pursuing future work in this field.

PaperID: 10805, https://arxiv.org/pdf/2504.15552.pdf

Authors: Gengxian Cao, Fengyuan Li, Hong Duan, Ye Yang, Bofeng Wang, Donghe Li

Title: A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models

Abstract:
This paper introduces a novel multi‑Agent framework that automates the end to end production of Qinqiang opera by integrating Large Language Models , visual generation, and Text to Speech synthesis. Three specialized agents collaborate in sequence: Agent1 uses an LLM to craft coherent, culturally grounded scripts;Agent2 employs visual generation models to render contextually accurate stage scenes; and Agent3 leverages TTS to produce synchronized, emotionally expressive vocal performances. In a case study on Dou E Yuan, the system achieved expert ratings of 3.8 for script fidelity, 3.5 for visual coherence, and 3.8 for speech accuracy‑culminating in an overall score of 3.6, a 0.3 point improvement over a Single Agent baseline. Ablation experiments demonstrate that removing Agent2 or Agent3 leads to drops of 0.4 and 0.5 points, respectively, underscoring the value of modular collaboration. This work showcases how AI driven pipelines can streamline and scale the preservation of traditional performing arts, and points toward future enhancements in cross modal alignment, richer emotional nuance, and support for additional opera genres.

PaperID: 10806, https://arxiv.org/pdf/2504.15541.pdf

Authors: Qichao Liu, Heye Huang, Shiyue Zhao, Lei Shi, Soyoung Ahn, Xiaopeng Li

Title: RiskNet: Interaction-Aware Risk Forecasting for Autonomous Driving in Long-Tail Scenarios

Abstract:
Ensuring the safety of autonomous vehicles (AVs) in long‑tail scenarios remains a critical challenge, particularly under high uncertainty and complex multi‑agent interactions. To address this, we propose RiskNet, an interaction‑aware risk forecasting framework, which integrates deterministic risk modeling with probabilistic behavior prediction for comprehensive risk assessment. At its core, RiskNet employs a field‑theoretic model that captures interactions among ego vehicle, surrounding agents, and infrastructure via interaction fields and force. This model supports multidimensional risk evaluation across diverse scenarios (highways, intersections, and roundabouts), and shows robustness under high‑risk and long‑tail settings. To capture the behavioral uncertainty, we incorporate a graph neural network (GNN)‑based trajectory prediction module, which learns multi‑modal future motion distributions. Coupled with the deterministic risk field, it enables dynamic, probabilistic risk inference across time, enabling proactive safety assessment under uncertainty. Evaluations on the highD, inD, and rounD datasets, spanning lane changes, turns, and complex merges, demonstrate that our method significantly outperforms traditional approaches (e.g., TTC, THW, RSS, NC Field) in terms of accuracy, responsiveness, and directional sensitivity, while maintaining strong generalization across scenarios. This framework supports real‑time, scenario‑adaptive risk forecasting and demonstrates strong generalization across uncertain driving environments. It offers a unified foundation for safety‑critical decision‑making in long‑tail scenarios.

PaperID: 10807, https://arxiv.org/pdf/2504.15474.pdf

Authors: Sidong Feng, Changhao Du, Huaxiao Liu, Qingnan Wang, Zhengwei Lv, Gang Huo, Xu Yang, Chunyang Chen

Title: Agent for User: Testing Multi-User Interactive Features in TikTok

Abstract:
TikTok, a widely‑used social media app boasting over a billion monthly active users, requires effective app quality assurance for its intricate features. Feature testing is crucial in achieving this goal. However, the multi‑user interactive features within the app, such as live streaming, voice calls, etc., pose significant challenges for developers, who must handle simultaneous device management and user interaction coordination. To address this, we introduce a novel multi‑agent approach, powered by the Large Language Models (LLMs), to automate the testing of multi‑user interactive app features. In detail, we build a virtual device farm that allocates the necessary number of devices for a given multi‑user interactive task. For each device, we deploy an LLM‑based agent that simulates a user, thereby mimicking user interactions to collaboratively automate the testing process. The evaluations on 24 multi‑user interactive tasks within the TikTok app, showcase its capability to cover 75% of tasks with 85.9% action similarity and offer 87% time savings for developers. Additionally, we have also integrated our approach into the real‑world TikTok testing platform, aiding in the detection of 26 multi‑user interactive bugs.

PaperID: 10808, https://arxiv.org/pdf/2504.15425.pdf

Authors: Songyuan Zhang, Oswin So, Mitchell Black, Zachary Serlin, Chuchu Fan

Title: Solving Multi-Agent Safe Optimal Control with Distributed Epigraph Form MARL

Abstract:
Tasks for multi‑robot systems often require the robots to collaborate and complete a team goal while maintaining safety. This problem is usually formalized as a constrained Markov decision process (CMDP), which targets minimizing a global cost and bringing the mean of constraint violation below a user‑defined threshold. Inspired by real‑world robotic applications, we define safety as zero constraint violation. While many safe multi‑agent reinforcement learning (MARL) algorithms have been proposed to solve CMDPs, these algorithms suffer from unstable training in this setting. To tackle this, we use the epigraph form for constrained optimization to improve training stability and prove that the centralized epigraph form problem can be solved in a distributed fashion by each agent. This results in a novel centralized training distributed execution MARL algorithm named Def‑MARL. Simulation experiments on 8 different tasks across 2 different simulators show that Def‑MARL achieves the best overall performance, satisfies safety constraints, and maintains stable training. Real‑world hardware experiments on Crazyflie quadcopters demonstrate the ability of Def‑MARL to safely coordinate agents to complete complex collaborative tasks compared to other methods.

PaperID: 10809, https://arxiv.org/pdf/2504.15418.pdf

Authors: Victoria Marie Tuck, Hardik Parwana, Pei-Wei Chen, Georgios Fainekos, Bardh Hoxha, Hideki Okamoto, S. Shankar Sastry, Sanjit A. Seshia

Title: MRTA-Sim: A Modular Simulator for Multi-Robot Allocation, Planning, and Control in Open-World Environments

Abstract:
This paper introduces MRTA‑Sim, a Python/ROS2/Gazebo simulator for testing approaches to Multi‑Robot Task Allocation (MRTA) problems on simulated robots in complex, indoor environments. Grid‑based approaches to MRTA problems can be too restrictive for use in complex, dynamic environments such in warehouses, department stores, hospitals, etc. However, approaches that operate in free‑space often operate at a layer of abstraction above the control and planning layers of a robot and make an assumption on approximate travel time between points of interest in the system. These abstractions can neglect the impact of the tight space and multi‑agent interactions on the quality of the solution. Therefore, MRTA solutions should be tested with the navigation stacks of the robots in mind, taking into account robot planning, conflict avoidance between robots, and human interaction and avoidance. This tool connects the allocation output of MRTA solvers to individual robot planning using the NAV2 stack and local, centralized multi‑robot deconfliction using Control Barrier Function‑Quadrtic Programs (CBF‑QPs), creating a platform closer to real‑world operation for more comprehensive testing of these approaches. The simulation architecture is modular so that users can swap out methods at different levels of the stack. We show the use of our system with a Satisfiability Modulo Theories (SMT)‑based approach to dynamic MRTA on a fleet of indoor delivery robots.

PaperID: 10810, https://arxiv.org/pdf/2504.15313.pdf

Authors: Yajie Yu, Yue Feng

Title: PolicyEvol-Agent: Evolving Policy via Environment Perception and Self-Awareness with Theory of Mind

Abstract:
Multi‑agents has exhibited significant intelligence in real‑word simulations with Large language models (LLMs) due to the capabilities of social cognition and knowledge retrieval. However, existing research on agents equipped with effective cognition chains including reasoning, planning, decision‑making and reflecting remains limited, especially in the dynamically interactive scenarios. In addition, unlike human, prompt‑based responses face challenges in psychological state perception and empirical calibration during uncertain gaming process, which can inevitably lead to cognition bias. In light of above, we introduce PolicyEvol‑Agent, a comprehensive LLM‑empowered framework characterized by systematically acquiring intentions of others and adaptively optimizing irrational strategies for continual enhancement. Specifically, PolicyEvol‑Agent first obtains reflective expertise patterns and then integrates a range of cognitive operations with Theory of Mind alongside internal and external perspectives. Simulation results, outperforming RL‑based models and agent‑based methods, demonstrate the superiority of PolicyEvol‑Agent for final gaming victory. Moreover, the policy evolution mechanism reveals the effectiveness of dynamic guideline adjustments in both automatic and human evaluation.

PaperID: 10811, https://arxiv.org/pdf/2504.15301.pdf

Authors: Zoi Lygizou, Dimitris Kalles

Title: A biologically Inspired Trust Model for Open Multi-Agent Systems that is Resilient to Rapid Performance Fluctuations

Abstract:
Trust management provides an alternative solution for securing open, dynamic, and distributed multi‑agent systems, where conventional cryptographic methods prove to be impractical. However, existing trust models face challenges related to agent mobility, changing behaviors, and the cold start problem. To address these issues we introduced a biologically inspired trust model in which trustees assess their own capabilities and store trust data locally. This design improves mobility support, reduces communication overhead, resists disinformation, and preserves privacy. Despite these advantages, prior evaluations revealed limitations of our model in adapting to provider population changes and continuous performance fluctuations. This study proposes a novel algorithm, incorporating a self‑classification mechanism for providers to detect performance drops potentially harmful for the service consumers. Simulation results demonstrate that the new algorithm outperforms its original version and FIRE, a well‑known trust and reputation model, particularly in handling dynamic trustee behavior. While FIRE remains competitive under extreme environmental changes, the proposed algorithm demonstrates greater adaptability across various conditions. In contrast to existing trust modeling research, this study conducts a comprehensive evaluation of our model using widely recognized trust model criteria, assessing its resilience against common trust‑related attacks while identifying strengths, weaknesses, and potential countermeasures. Finally, several key directions for future research are proposed.

PaperID: 10812, https://arxiv.org/pdf/2504.15263.pdf

Authors: Ehsan Ahmadi, Chao Wang

Title: Interpretable Locomotion Prediction in Construction Using a Memory-Driven LLM Agent With Chain-of-Thought Reasoning

Abstract:
Construction tasks are inherently unpredictable, with dynamic environments and safety‑critical demands posing significant risks to workers. Exoskeletons offer potential assistance but falter without accurate intent recognition across diverse locomotion modes. This paper presents a locomotion prediction agent leveraging Large Language Models (LLMs) augmented with memory systems, aimed at improving exoskeleton assistance in such settings. Using multimodal inputs ‑ spoken commands and visual data from smart glasses ‑ the agent integrates a Perception Module, Short‑Term Memory (STM), Long‑Term Memory (LTM), and Refinement Module to predict locomotion modes effectively. Evaluation reveals a baseline weighted F1‑score of 0.73 without memory, rising to 0.81 with STM, and reaching 0.90 with both STM and LTM, excelling with vague and safety‑critical commands. Calibration metrics, including a Brier Score drop from 0.244 to 0.090 and ECE from 0.222 to 0.044, affirm improved reliability. This framework supports safer, high‑level human‑exoskeleton collaboration, with promise for adaptive assistive systems in dynamic industries.

PaperID: 10813, https://arxiv.org/pdf/2504.15228.pdf

Authors: Maxime Robeyns, Martin Szummer, Laurence Aitchison

Title: A Self-Improving Coding Agent

Abstract:
Recent advancements in Large Language Models (LLMs) have spurred interest in deploying LLM agents to undertake tasks in the world. LLMs are often deployed in agent systems: code that orchestrates LLM calls and provides them with tools. We demonstrate that an agent system, equipped with basic coding tools, can autonomously edit itself, and thereby improve its performance on benchmark tasks. We find performance gains from 17% to 53% on a random subset of SWE Bench Verified, with additional performance gains on LiveCodeBench, as well as synthetically generated agent benchmarks. Our work represents an advancement in the automated and open‑ended design of agentic systems, and demonstrates a data‑efficient, non gradient‑based learning mechanism driven by LLM reflection and code updates.

PaperID: 10814, https://arxiv.org/pdf/2504.15146.pdf

Authors: Wei Zhou, Ailiya Borjigin, Cong He

Title: Behavioral Universe Network (BUN): A Behavioral Information-Based Framework for Complex Systems

Abstract:
Modern digital ecosystems feature complex, dynamic interactions among autonomous entities across diverse domains. Traditional models often separate agents and objects, lacking a unified foundation to capture their interactive behaviors. This paper introduces the Behavioral Universe Network (BUN), a theoretical framework grounded in the Agent‑Interaction‑Behavior (AIB) formalism. BUN treats subjects (active agents), objects (resources), and behaviors (operations) as first‑class entities, all governed by a shared Behavioral Information Base (BIB). We detail the AIB core concepts and demonstrate how BUN leverages information‑driven triggers, semantic enrichment, and adaptive rules to coordinate multi‑agent systems. We highlight key benefits: enhanced behavior analysis, strong adaptability, and cross‑domain interoperability. We conclude by positioning BUN as a promising foundation for next‑generation digital governance and intelligent applications.

PaperID: 10815, https://arxiv.org/pdf/2504.15130.pdf

Authors: Kushal Shah, Jihyun Park, Seung-Kyum Choi

Title: Neural ATTF: A Scalable Solution to Lifelong Multi-Agent Path Planning

Abstract:
Multi‑Agent Pickup and Delivery (MAPD) is a fundamental problem in robotics, particularly in applications such as warehouse automation and logistics. Existing solutions often face challenges in scalability, adaptability, and efficiency, limiting their applicability in dynamic environments with real‑time planning requirements. This paper presents Neural ATTF (Adaptive Task Token Framework), a new algorithm that combines a Priority Guided Task Matching (PGTM) Module with Neural STA (Space‑Time A), a data‑driven path planning method. Neural STA enhances path planning by enabling rapid exploration of the search space through guided learned heuristics and ensures collision avoidance under dynamic constraints. PGTM prioritizes delayed agents and dynamically assigns tasks by prioritizing agents nearest to these tasks, optimizing both continuity and system throughput. Experimental evaluations against state‑of‑the‑art MAPD algorithms, including TPTS, CENTRAL, RMCA, LNS‑PBS, and LNS‑wPBS, demonstrate the superior scalability, solution quality, and computational efficiency of Neural ATTF. These results highlight the framework's potential for addressing the critical demands of complex, real‑world multi‑agent systems operating in high‑demand, unpredictable settings.

PaperID: 10816, https://arxiv.org/pdf/2504.15027.pdf

Authors: Chengyu Wang, Junbing Yan, Yuanhao Yue, Jun Huang

Title: DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models

Abstract:
Enhancing computational efficiency and reducing deployment costs for large language models (LLMs) have become critical challenges in various resource‑constrained scenarios. In this work, we present DistilQwen2.5, a family of distilled, lightweight LLMs derived from the public Qwen2.5 models. These distilled models exhibit enhanced instruction‑following capabilities compared to the original models based on a series of distillation techniques that incorporate knowledge from much larger LLMs. In our industrial practice, we first leverage powerful proprietary LLMs with varying capacities as multi‑agent teachers to select, rewrite, and refine instruction‑response pairs that are more suitable for student LLMs to learn. After standard fine‑tuning, we further leverage a computationally efficient model fusion approach that enables student models to progressively integrate fine‑grained hidden knowledge from their teachers. Experimental evaluations demonstrate that the distilled models possess significantly stronger capabilities than their original checkpoints. Additionally, we present use cases to illustrate the applications of our framework in real‑world scenarios. To facilitate practical use, we have released all the DistilQwen2.5 models to the open‑source community.

PaperID: 10817, https://arxiv.org/pdf/2504.14948.pdf

Authors: Yusen Zheng, Yukun Cheng, Chenyang Xu, Xiaotie Deng

Title: Mechanism Design for Auctions with Externalities on Budgets

Abstract:
This paper studies mechanism design for auctions with externalities on budgets, a novel setting where the budgets that bidders commit are adjusted due to the externality of the competitors' allocation outcomes‑a departure from traditional auctions with fixed budgets. This setting is motivated by real‑world scenarios, for example, participants may increase their budgets in response to competitors' obtained items. We initially propose a general framework with homogeneous externalities to capture the interdependence between budget updates and allocation, formalized through a budget response function that links each bidder's effective budget to the amount of items won by others. The main contribution of this paper is to propose a truthful and individual rational auction mechanism for this novel auction setting, which achieves an approximation ratio of 1/3 with respect to the liquid welfare. This mechanism is inspired by the uniform‑price auction, in which an appropriate uniform price is selected to allocate items, ensuring the monotonicity of the allocation rule while accounting for budget adjustments. Additionally, this mechanism guarantees a constant approximation ratio by setting a purchase limit. Complementing this result, we establish an upper bound: no truthful mechanism can achieve an approximation ratio better than 1/2. This work offers a new perspective to study the impact of externalities on auctions, providing an approach to handle budget externalities in multi‑agent systems.

PaperID: 10818, https://arxiv.org/pdf/2504.14945.pdf

Authors: Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, Yue Zhang

Title: Learning to Reason under Off-Policy Guidance

Abstract:
Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi‑step reasoning and self‑reflection can emerge via reinforcement learning with verifiable rewards~(RLVR). However, existing RLVR approaches are inherently ``on‑policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. To address this issue, we introduce LUFFY (Learning to reason Under oFF‑policY guidance), a framework that augments RLVR with off‑policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off‑policy demonstrations with on‑policy rollouts during training. Specifically, LUFFY combines the Mixed‑Policy GRPO framework, which has a theoretically guaranteed convergence rate, alongside policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed‑policy training. Compared with previous RLVR methods, LUFFY achieves an over +6.4 average gain across six math benchmarks and an advantage of over +6.2 points in out‑of‑distribution tasks. Most significantly, we show that LUFFY successfully trains weak models in scenarios where on‑policy RLVR completely fails. These results provide compelling evidence that LUFFY transcends the fundamental limitations of on‑policy RLVR and demonstrates the great potential of utilizing off‑policy guidance in RLVR.

PaperID: 10819, https://arxiv.org/pdf/2504.14928.pdf

Authors: Yao Shi, Rongkeng Liang, Yong Xu

Title: EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework

Abstract:
Large language models (LLMs) increasingly serve as educational tools, yet evaluating their teaching capabilities remains challenging due to the resource‑intensive, context‑dependent, and methodologically complex nature of teacher‑student interactions. We introduce EducationQ, a multi‑agent dialogue framework that efficiently assesses teaching capabilities through simulated dynamic educational scenarios, featuring specialized agents for teaching, learning, and evaluation. Testing 14 LLMs across major AI Organizations (OpenAI, Meta, Google, Anthropic, and others) on 1,498 questions spanning 13 disciplines and 10 difficulty levels reveals that teaching effectiveness does not correlate linearly with model scale or general reasoning capabilities ‑ with some smaller open‑source models outperforming larger commercial counterparts in teaching contexts. This finding highlights a critical gap in current evaluations that prioritize knowledge recall over interactive pedagogy. Our mixed‑methods evaluation, combining quantitative metrics with qualitative analysis and expert case studies, identifies distinct pedagogical strengths employed by top‑performing models (e.g., sophisticated questioning strategies, adaptive feedback mechanisms). Human expert evaluations show 78% agreement with our automated qualitative analysis of effective teaching behaviors, validating our methodology. EducationQ demonstrates that LLMs‑as‑teachers require specialized optimization beyond simple scaling, suggesting next‑generation educational AI prioritize targeted enhancement of specific pedagogical effectiveness.

PaperID: 10820, https://arxiv.org/pdf/2504.14874.pdf

Authors: Jianqiang Zhang, Kaijun Yang

Title: Event triggered optimal formation control for nonlinear multi-agent systems under Denial-of-Service attacks

Abstract:
This paper investigates the optimal formation control problem of a class of nonlinear multi‑agent systems(MASs) under Denial‑of‑Service(DoS) attacks. We design the optimal formation control law using an event‑triggered control scheme to achieve formation objectives under DoS attacks. Critic neural network (NN)‑based approach is employed to achieve the optimal control policy under DoS attacks. Event‑triggered mechanism is introduced to ensure the saving of control resources. Additionally, Lyapunov stability theory is utilized to demonstrate that the local neighborhood formation error exhibits exponential stability and the estimation error of weights are uniformly ultimately bounded. Finally, the effectiveness of the control algorithm is validated through matlab simulations. The results indicate that under DoS attacks, the nonlinear MAS successfully achieves the desired formation for the MAS.

PaperID: 10821, https://arxiv.org/pdf/2504.14868.pdf

Authors: Jianhui Wang, Yangfan He, Yan Zhong, Xinyuan Song, Jiayi Su, Yuheng Feng, Ruoyu Wang, Hongyang He, Wenyu Zhu, Xinhang Yuan, Miao Zhang, Keqin Li, Jiaqi Chen, Tianyu Shi, Xueqian Wang

Title: Twin Co-Adaptive Dialogue for Progressive Image Generation

Abstract:
Modern text‑to‑image generation systems have enabled the creation of remarkably realistic and high‑quality visuals, yet they often falter when handling the inherent ambiguities in user prompts. In this work, we present Twin‑Co, a framework that leverages synchronized, co‑adaptive dialogue to progressively refine image generation. Instead of a static generation process, Twin‑Co employs a dynamic, iterative workflow where an intelligent dialogue agent continuously interacts with the user. Initially, a base image is generated from the user's prompt. Then, through a series of synchronized dialogue exchanges, the system adapts and optimizes the image according to evolving user feedback. The co‑adaptive process allows the system to progressively narrow down ambiguities and better align with user intent. Experiments demonstrate that Twin‑Co not only enhances user experience by reducing trial‑and‑error iterations but also improves the quality of the generated images, streamlining creative process across various applications.

PaperID: 10822, https://arxiv.org/pdf/2504.14837.pdf

Authors: Jiahui Li, Tongwang Wu, Yuren Mao, Yunjun Gao, Yajie Feng, Huaizhong Liu

Title: SQL-Factory: A Multi-Agent Framework for High-Quality and Large-Scale SQL Generation

Abstract:
High quality SQL corpus is essential for intelligent database. For example, Text‑to‑SQL requires SQL queries and correspond natural language questions as training samples. However, collecting such query corpus remains challenging in practice due to the high cost of manual annotation, which highlights the importance of automatic SQL generation. Despite recent advances, existing generation methods still face limitations in achieving both diversity and cost‑effectiveness. Besides, many methods also treat all tables equally, which overlooks schema complexity and leads to under‑utilization of structurally rich tables. To address these issues, this paper proposes a multi‑agent framework for high‑quality and large‑scale SQL generation, dubbed SQL‑Factory. It decomposes the generation process into three collaborative teams: the Generation Team explores diverse query structures using a powerful language model, the Expansion Team scales promising patterns via a lightweight language model, and the Management Team adaptively schedules the workflow and evaluates the quality of synthesized queries. This modular framework ensures a balanced trade‑off between diversity, scalability, and generation cost. We apply SQL‑Factory to four widely used benchmarks and generate over 300,000 SQL queries with less than 200 API cost. Our generated queries achieve higher diversity compared to other methods, and extensive experiments demonstrate that the generated queries significantly improve the model performance in various downstream tasks.

PaperID: 10823, https://arxiv.org/pdf/2504.14824.pdf

Authors: Wei Wang, Nan Cheng, Conghao Zhou, Haixia Peng, Haibo Zhou, Zhou Su, Xuemin, Shen

Title: An Enhanced Dual-Currency VCG Auction Mechanism for Resource Allocation in IoV: A Value of Information Perspective

Abstract:
The Internet of Vehicles (IoV) is undergoing a transformative evolution, enabled by advancements in future 6G network technologies, to support intelligent, highly reliable, and low‑latency vehicular services. However, the enhanced capabilities of loV have heightened the demands for efficient network resource allocation while simultaneously giving rise to diverse vehicular service requirements. For network service providers (NSPs), meeting the customized resource‑slicing requirements of vehicle service providers (VSPs) while maximizing social welfare has become a significant challenge. This paper proposes an innovative solution by integrating a mean‑field multi‑agent reinforcement learning (MFMARL) framework with an enhanced Vickrey‑Clarke‑Groves (VCG) auction mechanism to address the problem of social welfare maximization under the condition of unknown VSP utility functions. The core of this solution is introducing the ``value of information" as a novel monetary metric to estimate the expected benefits of VSPs, thereby ensuring the effective execution of the VCG auction mechanism. MFMARL is employed to optimize resource allocation for social welfare maximization while adapting to the intelligent and dynamic requirements of IoV. The proposed enhanced VCG auction mechanism not only protects the privacy of VSPs but also reduces the likelihood of collusion among VSPs, and it is theoretically proven to be dominant‑strategy incentive compatible (DSIC). The simulation results demonstrate that, compared to the VCG mechanism implemented using quantization methods, the proposed mechanism exhibits significant advantages in convergence speed, social welfare maximization, and resistance to collusion, providing new insights into resource allocation in intelligent 6G networks.

PaperID: 10824, https://arxiv.org/pdf/2504.14822.pdf

Authors: Rui Qiu, Shijie Chen, Yu Su, Po-Yin Yen, Han-Wei Shen

Title: Completing A Systematic Review in Hours instead of Months with Interactive AI Agents

Abstract:
Systematic reviews (SRs) are vital for evidence‑based practice in high stakes disciplines, such as healthcare, but are often impeded by intensive labors and lengthy processes that can take months to complete. Due to the high demand for domain expertise, existing automatic summarization methods fail to accurately identify relevant studies and generate high‑quality summaries. To that end, we introduce InsightAgent, a human‑centered interactive AI agent powered by large language models that revolutionize this workflow. InsightAgent partitions a large literature corpus based on semantics and employs a multi‑agent design for more focused processing of literature, leading to significant improvement in the quality of generated SRs. InsightAgent also provides intuitive visualizations of the corpus and agent trajectories, allowing users to effortlessly monitor the actions of the agent and provide real‑time feedback based on their expertise. Our user studies with 9 medical professionals demonstrate that the visualization and interaction mechanisms can effectively improve the quality of synthesized SRs by 27.2%, reaching 79.7% of human‑written quality. At the same time, user satisfaction is improved by 34.4%. With InsightAgent, it only takes a clinician about 1.5 hours, rather than months, to complete a high‑quality systematic review.

PaperID: 10825, https://arxiv.org/pdf/2504.14757.pdf

Authors: Minh V. T. Pham, Huy N. Phan, Hoang N. Phan, Cuong Le Chi, Tien N. Nguyen, Nghi D. Q. Bui

Title: SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs

Abstract:
Large language models (LLMs) are transforming automated program repair (APR) through agent‑based approaches that localize bugs, generate patches, and verify fixes. However, the lack of high‑quality, scalable training datasets, especially those with verifiable outputs and intermediate reasoning traces‑limits progress, particularly for open‑source models. In this work, we present SWE‑Synth, a framework for synthesizing realistic, verifiable, and process‑aware bug‑fix datasets at the repository level. SWE‑Synth leverages LLM agents to simulate debugging workflows, producing not only bug‑fix pairs but also test cases and structured repair trajectories. Compared to manually curated datasets, our method scales with minimal human effort while preserving contextual richness and correctness. Experiments show that models trained on SWE‑Synth outperform those trained on real‑world datasets by 2.3% on SWE‑Bench Lite. Our results highlight the potential of synthetic, agent‑generated data to advance the state of the art in APR and software engineering automation.

PaperID: 10826, https://arxiv.org/pdf/2504.14681.pdf

Authors: Zeyu Wang, Frank P. -W. Lo, Qian Chen, Yongqi Zhang, Chen Lin, Xu Chen, Zhenhua Yu, Alexander J. Thompson, Eric M. Yeatman, Benny P. L. Lo

Title: An LLM-enabled Multi-Agent Autonomous Mechatronics Design Framework

Abstract:
Existing LLM‑enabled multi‑agent frameworks are predominantly limited to digital or simulated environments and confined to narrowly focused knowledge domain, constraining their applicability to complex engineering tasks that require the design of physical embodiment, cross‑disciplinary integration, and constraint‑aware reasoning. This work proposes a multi‑agent autonomous mechatronics design framework, integrating expertise across mechanical design, optimization, electronics, and software engineering to autonomously generate functional prototypes with minimal direct human design input. Operating primarily through a language‑driven workflow, the framework incorporates structured human feedback to ensure robust performance under real‑world constraints. To validate its capabilities, the framework is applied to a real‑world challenge involving autonomous water‑quality monitoring and sampling, where traditional methods are labor‑intensive and ecologically disruptive. Leveraging the proposed system, a fully functional autonomous vessel was developed with optimized propulsion, cost‑effective electronics, and advanced control. The design process was carried out by specialized agents, including a high‑level planning agent responsible for problem abstraction and dedicated agents for structural, electronics, control, and software development. This approach demonstrates the potential of LLM‑based multi‑agent systems to automate real‑world engineering workflows and reduce reliance on extensive domain expertise.

PaperID: 10827, https://arxiv.org/pdf/2504.14557.pdf

Authors: Charlie Campbell, Hao Mark Chen, Wayne Luk, Hongxiang Fan

Title: Enhancing LLM-based Quantum Code Generation with Multi-Agent Optimization and Quantum Error Correction

Abstract:
Multi‑agent frameworks with Large Language Models (LLMs) have become promising tools for generating general‑purpose programming languages using test‑driven development, allowing developers to create more accurate and robust code. However, their potential has not been fully unleashed for domain‑specific programming languages, where specific domain exhibits unique optimization opportunities for customized improvement. In this paper, we take the first step in exploring multi‑agent code generation for quantum programs. By identifying the unique optimizations in quantum designs such as quantum error correction, we introduce a novel multi‑agent framework tailored to generating accurate, fault‑tolerant quantum code. Each agent in the framework focuses on distinct optimizations, iteratively refining the code using a semantic analyzer with multi‑pass inference, alongside an error correction code decoder. We also examine the effectiveness of inference‑time techniques, like Chain‑of‑Thought (CoT) and Retrieval‑Augmented Generation (RAG) in the context of quantum programming, uncovering observations that are different from general‑purpose code generation. To evaluate our approach, we develop a test suite to measure the impact each optimization has on the accuracy of the generated code. Our findings indicate that techniques such as structured CoT significantly improve the generation of quantum algorithms by up to 50%. In contrast, we have also found that certain techniques such as RAG show limited improvement, yielding an accuracy increase of only 4%. Moreover, we showcase examples of AI‑assisted quantum error prediction and correction, demonstrating the effectiveness of our multi‑agent framework in reducing the quantum errors of generated quantum programs.

PaperID: 10828, https://arxiv.org/pdf/2504.14520.pdf

Authors: Ahsan Bilal, Muhammad Ahmed Mohsin, Muhammad Umer, Muhammad Awais Khan Bangash, Muhammad Ali Jamshed

Title: Meta-Thinking in LLMs via Multi-Agent Reinforcement Learning: A Survey

Abstract:
This survey explores the development of meta‑thinking capabilities in Large Language Models (LLMs) from a Multi‑Agent Reinforcement Learning (MARL) perspective. Meta‑thinking self‑reflection, assessment, and control of thinking processes is an important next step in enhancing LLM reliability, flexibility, and performance, particularly for complex or high‑stakes tasks. The survey begins by analyzing current LLM limitations, such as hallucinations and the lack of internal self‑assessment mechanisms. It then talks about newer methods, including RL from human feedback (RLHF), self‑distillation, and chain‑of‑thought prompting, and each of their limitations. The crux of the survey is to talk about how multi‑agent architectures, namely supervisor‑agent hierarchies, agent debates, and theory of mind frameworks, can emulate human‑like introspective behavior and enhance LLM robustness. By exploring reward mechanisms, self‑play, and continuous learning methods in MARL, this survey gives a comprehensive roadmap to building introspective, adaptive, and trustworthy LLMs. Evaluation metrics, datasets, and future research avenues, including neuroscience‑inspired architectures and hybrid symbolic reasoning, are also discussed.

PaperID: 10829, https://arxiv.org/pdf/2504.14427.pdf

Authors: Spencer Lin, Miru Jun, Basem Rizk, Karen Shieh, Scott Fisher, Sharon Mozgai

Title: Optimizing SIA Development: A Case Study in User-Centered Design for Estuary, a Multimodal Socially Interactive Agent Framework

Abstract:
This case study presents our user‑centered design model for Socially Intelligent Agent (SIA) development frameworks through our experience developing Estuary, an open source multimodal framework for building low‑latency real‑time socially interactive agents. We leverage the Rapid Assessment Process (RAP) to collect the thoughts of leading researchers in the field of SIAs regarding the current state of the art for SIA development as well as their evaluation of how well Estuary may potentially address current research gaps. We achieve this through a series of end‑user interviews conducted by a fellow researcher in the community. We hope that the findings of our work will not only assist the continued development of Estuary but also guide the development of other future frameworks and technologies for SIAs.

PaperID: 10830, https://arxiv.org/pdf/2504.14422.pdf

Authors: Paul Fischer, Sebastian Kaltenbach, Sergey Litvinov, Sauro Succi, Petros Koumoutsakos

Title: Optimal Lattice Boltzmann Closures through Multi-Agent Reinforcement Learning

Abstract:
The Lattice Boltzmann method (LBM) offers a powerful and versatile approach to simulating diverse hydrodynamic phenomena, spanning microfluidics to aerodynamics. The vast range of spatiotemporal scales inherent in these systems currently renders full resolution impractical, necessitating the development of effective closure models for under‑resolved simulations. Under‑resolved LBMs are unstable, and while there is a number of important efforts to stabilize them, they often face limitations in generalizing across scales and physical systems. We present a novel, data‑driven, multiagent reinforcement learning (MARL) approach that drastically improves stability and accuracy of coarse‑grained LBM simulations. The proposed method uses a convolutional neural network to dynamically control the local relaxation parameter for the LB across the simulation grid. The LB‑MARL framework is showcased in turbulent Kolmogorov flows. We find that the MARL closures stabilize the simulations and recover the energy spectra of significantly more expensive fully resolved simulations while maintaining computational efficiency. The learned closure model can be transferred to flow scenarios unseen during training and has improved robustness and spectral accuracy compared to traditional LBM models. We believe that MARL closures open new frontiers for efficient and accurate simulations of a multitude of complex problems not accessible to present‑day LB methods alone.

PaperID: 10831, https://arxiv.org/pdf/2504.14399.pdf

Authors: Philip Wiese, Maurus Item, Luca Bertaccini, Yvan Tortorella, Angelo Garofalo, Luca Benini

Title: RedMulE-FT: A Reconfigurable Fault-Tolerant Matrix Multiplication Engine

Abstract:
As safety‑critical applications increasingly rely on data‑parallel floating‑point computations, there is an increasing need for flexible and configurable fault tolerance in parallel floating‑point accelerators such as tensor engines. While replication‑based methods ensure reliability but incur high area and power costs, error correction codes lack the flexibility to trade off robustness against performance. This work presents RedMulE‑FT, a runtime‑configurable fault‑tolerant extension of the RedMulE matrix multiplication accelerator, balancing fault tolerance, area overhead, and performance impacts. The fault tolerance mode is configured in a shadowed context register file before task execution. By combining replication with error‑detecting codes to protect the data path, RedMulE‑FT achieves an 11x uncorrected fault reduction with only 2.3% area overhead. Full protection extends to control signals, resulting in no functional errors after 1M injections during our extensive fault injection simulation campaign, with a total area overhead of 25.2% while maintaining a 500 MHz frequency in a 12 nm technology.

PaperID: 10832, https://arxiv.org/pdf/2504.14348.pdf

Authors: Le Wang, Zonghao Ying, Tianyuan Zhang, Siyuan Liang, Shengshan Hu, Mingchuan Zhang, Aishan Liu, Xianglong Liu

Title: Manipulating Multimodal Agents via Cross-Modal Prompt Injection

Abstract:
The emergence of multimodal large language models has redefined the agent paradigm by integrating language and vision modalities with external data sources, enabling agents to better interpret human instructions and execute increasingly complex tasks. However, in this paper, we identify a critical yet previously overlooked security vulnerability in multimodal agents: cross‑modal prompt injection attacks. To exploit this vulnerability, we propose CrossInject, a novel attack framework in which attackers embed adversarial perturbations across multiple modalities to align with target malicious content, allowing external instructions to hijack the agent's decision‑making process and execute unauthorized tasks. Our approach incorporates two key coordinated components. First, we introduce Visual Latent Alignment, where we optimize adversarial features to the malicious instructions in the visual embedding space based on a text‑to‑image generative model, ensuring that adversarial images subtly encode cues for malicious task execution. Subsequently, we present Textual Guidance Enhancement, where a large language model is leveraged to construct the black‑box defensive system prompt through adversarial meta prompting and generate an malicious textual command that steers the agent's output toward better compliance with attackers' requests. Extensive experiments demonstrate that our method outperforms state‑of‑the‑art attacks, achieving at least a +30.1% increase in attack success rates across diverse tasks. Furthermore, we validate our attack's effectiveness in real‑world multimodal autonomous agents, highlighting its potential implications for safety‑critical applications.

PaperID: 10833, https://arxiv.org/pdf/2504.14336.pdf

Authors: Duy Cao, Phu Nguyen, Vy Le, Tien N. Nguyen, Vu Nguyen

Title: Toward Generation of Test Cases from Task Descriptions via History-aware Planning

Abstract:
In automated web testing, generating test scripts from natural language task descriptions is crucial for enhancing the test generation process. This activity involves creating the correct sequences of actions to form test scripts for future testing activities. Current state‑of‑the‑art approaches are limited in generating these action sequences, as they either demand substantial manual effort for human demonstrations or fail to consider the history of previous web content and actions to decide the next action. In this paper, we introduce HxAgent, an iterative large language model agent planning approach that determines the next action based on: 1) observations of the current contents and feasible actions, 2) short‑term memory of previous web states and actions, and 3) long‑term experience with (in)correct action sequences. The agent generates a sequence of actions to perform a given task, which is effectively an automated test case to verify the task. We conducted an extensive empirical evaluation of HxAgent using two datasets. On the MiniWoB++ dataset, our approach achieves 97% exact‑match accuracy that is comparable to the best baselines while eliminating the need for human demonstrations required by those methods. For complex tasks requiring navigation through multiple actions and screens, HxAgent achieves an average 82% exact‑match. On the second dataset, comprising 350 task instances across seven popular websites, including YouTube, LinkedIn, Facebook, and Google, HxAgent achieves high performance, with 87% of the action sequences exactly matching the ground truth and a prefix‑match of 93%, outperforming the baseline by 59%.

PaperID: 10834, https://arxiv.org/pdf/2504.14326.pdf

Authors: Jinbo Wen, Jiawen Kang, Yang Zhang, Yue Zhong, Dusit Niyato, Jie Xu, Jianhang Tang, Chau Yuen

Title: Diffusion-based Dynamic Contract for Federated AI Agent Construction in Mobile Metaverses

Abstract:
Mobile metaverses are envisioned as a transformative digital ecosystem that delivers immersive, intelligent, and ubiquitous services through mobile devices. Driven by Large Language Models (LLMs) and Vision‑Language Models (VLMs), Artificial Intelligence (AI) agents hold the potential to empower the creation, maintenance, and evolution of mobile metaverses, enabling seamless human‑machine interaction and dynamic service adaptation. Currently, AI agents are primarily built upon cloud‑based LLMs and VLMs. However, several challenges hinder their efficient deployment, including high service latency and a risk of sensitive data leakage during perception and processing. In this paper, we develop an edge‑cloud collaboration‑based federated AI agent construction framework in mobile metaverses. Specifically, Edge Servers (ESs), as agent infrastructures, first create agent modules in a distributed manner. The cloud server then integrates these modules into AI agents and deploys them at the edge, thereby enabling low‑latency AI agent services for users. Considering that ESs may exhibit dynamic levels of willingness to participate in federated AI agent construction, we design a two‑period dynamic contract model to continuously incentivize ESs to participate in agent module creation, effectively addressing the dynamic information asymmetry between the cloud server and ESs. Furthermore, we propose an Enhanced Diffusion Model‑based Soft Actor‑Critic (EDMSAC) algorithm to effectively generate optimal dynamic contracts. In the algorithm, we apply dynamic structured pruning to DM‑based actor networks to enhance denoising efficiency and policy learning performance. Simulation results demonstrate that the EDMSAC algorithm outperforms the DMSAC algorithm by up to 23% in optimal dynamic contract generation.

PaperID: 10835, https://arxiv.org/pdf/2504.14325.pdf

Authors: Alessio Buscemi, Daniele Proverbio, Alessandro Di Stefano, The-Anh Han, German Castignani, Pietro Liò

Title: FAIRGAME: a Framework for AI Agents Bias Recognition using Game Theory

Abstract:
Letting AI agents interact in multi‑agent applications adds a layer of complexity to the interpretability and prediction of AI outcomes, with profound implications for their trustworthy adoption in research and society. Game theory offers powerful models to capture and interpret strategic interaction among agents, but requires the support of reproducible, standardized and user‑friendly IT frameworks to enable comparison and interpretation of results. To this end, we present FAIRGAME, a Framework for AI Agents Bias Recognition using Game Theory. We describe its implementation and usage, and we employ it to uncover biased outcomes in popular games among AI agents, depending on the employed Large Language Model (LLM) and used language, as well as on the personality trait or strategic knowledge of the agents. Overall, FAIRGAME allows users to reliably and easily simulate their desired games and scenarios and compare the results across simulation campaigns and with game‑theoretic predictions, enabling the systematic discovery of biases, the anticipation of emerging behavior out of strategic interplays, and empowering further research into strategic decision‑making using LLM agents.

PaperID: 10836, https://arxiv.org/pdf/2504.14222.pdf

Authors: Mohammed Almutairi, Charles Chiang, Yuxin Bai, Diego Gomez-Zara

Title: tAIfa: Enhancing Team Effectiveness and Cohesion with AI-Generated Automated Feedback

Abstract:
Providing timely and actionable feedback is crucial for effective collaboration, learning, and coordination within teams. However, many teams face challenges in receiving feedback that aligns with their goals and promotes cohesion. We introduce tAIfa (``Team AI Feedback Assistant''), an AI agent that uses Large Language Models (LLMs) to provide personalized, automated feedback to teams and their members. tAIfa analyzes team interactions, identifies strengths and areas for improvement, and delivers targeted feedback based on communication patterns. We conducted a between‑subjects study with 18 teams testing whether using tAIfa impacted their teamwork. Our findings show that tAIfa improved communication and contributions within the teams. This paper contributes to the Human‑AI Interaction literature by presenting a computational framework that integrates LLMs to provide automated feedback, introducing tAIfa as a tool to enhance team engagement and cohesion, and providing insights into future AI applications to support team collaboration.

PaperID: 10837, https://arxiv.org/pdf/2504.14089.pdf

Authors: Kang He, Kaushik Roy

Title: LogicTree: Structured Proof Exploration for Coherent and Rigorous Logical Reasoning with Large Language Models

Abstract:
Large language models (LLMs) have achieved remarkable multi‑step reasoning capabilities across various domains. However, LLMs still face distinct challenges in complex logical reasoning, as (1) proof‑finding requires systematic exploration and the maintenance of logical coherence and (2) searching the right combination of premises at each reasoning step is inherently challenging in tasks with large premise space. To address this, we propose LogicTree, an inference‑time modular framework employing algorithm‑guided search to automate structured proof exploration and ensure logical coherence. Advancing beyond tree‑of‑thought (ToT), we incorporate caching mechanism into LogicTree to enable effective utilization of historical knowledge, preventing reasoning stagnation and minimizing redundancy. Furthermore, we address the combinatorial complexity of premise search by decomposing it into a linear process. The refined premise selection restricts subsequent inference to at most one derivation per step, enhancing reasoning granularity and enforcing strict step‑by‑step reasoning. Additionally, we introduce two LLM‑free heuristics for premise prioritization, enabling strategic proof search. Experimental results on five datasets demonstrate that LogicTree optimally scales inference‑time computation to achieve higher proof accuracy, surpassing chain‑of‑thought (CoT) and ToT with average gains of 23.6% and 12.5%, respectively, on GPT‑4o. Moreover, within LogicTree, GPT‑4o outperforms o3‑mini by 7.6% on average.

PaperID: 10838, https://arxiv.org/pdf/2504.13969.pdf

Authors: Nayoung Choi, Peace Cyebukayire, Jinho D. Choi

Title: Tinker Tales: Interactive Storytelling Framework for Early Childhood Narrative Development and AI Literacy

Abstract:
This paper presents Tinker Tales, an interactive storytelling framework in the format of a board game, designed to support both narrative development and AI literacy in early childhood. The framework integrates tangible and speech‑based interactions with AI through NFC chip‑attached pawns and tokens, along with a speaker and microphone. Children select and define key story elements‑such as characters, places, items, and emotions‑using the pawns and tokens, providing further details to the AI and receiving proper assistance, similar to how adults prompt AI for specific tasks (e.g., writing). For evaluation, several game sessions were simulated with a child AI agent, and the quality and safety of the generated stories were assessed from various perspectives. This work highlights the potential of combining physical and digital elements in AI literacy, offering a safe and engaging way for children to learn how to effectively collaborate with AI.

PaperID: 10839, https://arxiv.org/pdf/2504.13916.pdf

Authors: Yuanda Hu, Hou Jiani, Zhang Junyu, Yate Ge, Xiaohua Sun, Weiwei Guo

Title: Task Matters: Investigating Human Questioning Behavior in Different Household Service for Learning by Asking Robots

Abstract:
Learning by Asking (LBA) enables robots to identify knowledge gaps during task execution and acquire the missing information by asking targeted questions. However, different tasks often require different types of questions, and how to adapt questioning strategies accordingly remains underexplored. This paper investigates human questioning behavior in two representative household service tasks: a Goal‑Oriented task (refrigerator organization) and a Process‑Oriented task (cocktail mixing). Through a human‑human study involving 28 participants, we analyze the questions asked using a structured framework that encodes each question along three dimensions: acquired knowledge, cognitive process, and question form. Our results reveal that participants adapt both question types and their temporal ordering based on task structure. Goal‑Oriented tasks elicited early inquiries about user preferences, while Process‑Oriented tasks led to ongoing, parallel questioning of procedural steps and preferences. These findings offer actionable insights for developing task‑sensitive questioning strategies in LBA‑enabled robots for more effective and personalized human‑robot collaboration.

PaperID: 10840, https://arxiv.org/pdf/2504.13871.pdf

Authors: Yuanjun Feng, Vivek Chodhary, Yash Raj Shrestha

Title: Human aversion? Do AI Agents Judge Identity More Harshly Than Performance

Abstract:
This study examines the understudied role of algorithmic evaluation of human judgment in hybrid decision‑making systems, a critical gap in management research. While extant literature focuses on human reluctance to follow algorithmic advice, we reverse the perspective by investigating how AI agents based on large language models (LLMs) assess and integrate human input. Our work addresses a pressing managerial constraint: firms barred from deploying LLMs directly due to privacy concerns can still leverage them as mediating tools (for instance, anonymized outputs or decision pipelines) to guide high‑stakes choices like pricing or discounts without exposing proprietary data. Through a controlled prediction task, we analyze how an LLM‑based AI agent weights human versus algorithmic predictions. We find that the AI system systematically discounts human advice, penalizing human errors more severely than algorithmic errors‑‑a bias exacerbated when the agent's identity (human vs AI) is disclosed and the human is positioned second. These results reveal a disconnect between AI‑generated trust metrics and the actual influence of human judgment, challenging assumptions about equitable human‑AI collaboration. Our findings offer three key contributions. First, we identify a reverse algorithm aversion phenomenon, where AI agents undervalue human input despite comparable error rates. Second, we demonstrate how disclosure and positional bias interact to amplify this effect, with implications for system design. Third, we provide a framework for indirect LLM deployment that balances predictive power with data privacy. For practitioners, this research emphasize the need to audit AI weighting mechanisms, calibrate trust dynamics, and strategically design decision sequences in human‑AI systems.

PaperID: 10841, https://arxiv.org/pdf/2504.13848.pdf

Authors: Janet Rafner, Ryan Q. Guloy, Eden W. Wen, Catherine M. Chiodo, Jacob Sherson

Title: From Interaction to Collaboration: How Hybrid Intelligence Enhances Chatbot Feedback

Abstract:
Generative AI (GenAI) chatbots are becoming increasingly integrated into virtual assistant technologies, yet their success hinges on the ability to gather meaningful user feedback to improve interaction quality, system outcomes, and overall user acceptance. Successful chatbot interactions can enable organizations to build long‑term relationships with their customers and users, supporting customer loyalty and furthering the organization's goals. This study explores the impact of two distinct narratives and feedback collection mechanisms on user engagement and feedback behavior: a standard AI‑focused interaction versus a hybrid intelligence (HI) framed interaction. Initial findings indicate that while small‑scale survey measures allowed for no significant differences in user willingness to leave feedback, use the system, or trust the system, participants exposed to the HI narrative statistically significantly provided more detailed feedback. These initial findings offer insights into designing effective feedback systems for GenAI virtual assistants, balancing user effort with system improvement potential.

PaperID: 10842, https://arxiv.org/pdf/2504.13756.pdf

Authors: Dmitrii Kharlapenko, Stepan Shabalin, Fazl Barez, Arthur Conmy, Neel Nanda

Title: Scaling sparse feature circuit finding for in-context learning

Abstract:
Sparse autoencoders (SAEs) are a popular tool for interpreting large language model activations, but their utility in addressing open questions in interpretability remains unclear. In this work, we demonstrate their effectiveness by using SAEs to deepen our understanding of the mechanism behind in‑context learning (ICL). We identify abstract SAE features that (i) encode the model's knowledge of which task to execute and (ii) whose latent vectors causally induce the task zero‑shot. This aligns with prior work showing that ICL is mediated by task vectors. We further demonstrate that these task vectors are well approximated by a sparse sum of SAE latents, including these task‑execution features. To explore the ICL mechanism, we adapt the sparse feature circuits methodology of Marks et al. (2024) to work for the much larger Gemma‑1 2B model, with 30 times as many parameters, and to the more complex task of ICL. Through circuit finding, we discover task‑detecting features with corresponding SAE latents that activate earlier in the prompt, that detect when tasks have been performed. They are causally linked with task‑execution features through the attention and MLP sublayers.

PaperID: 10843, https://arxiv.org/pdf/2504.13707.pdf

Authors: Yichen Wu, Qianqian Gao, Xudong Pan, Geng Hong, Min Yang

Title: OpenDeception: Learning Deception and Trust in Human-AI Interaction via Multi-Agent Simulation

Abstract:
As large language models (LLMs) are increasingly deployed as interactive agents, open‑ended human‑AI interactions can involve deceptive behaviors with serious real‑world consequences, yet existing evaluations remain largely scenario‑specific and model‑centric. We introduce OpenDeception, a lightweight framework for jointly evaluating deception risk from both sides of human‑AI dialogue. It consists of a scenario benchmark with 50 real‑world deception cases, an IntentNet that infers deceptive intent from agent reasoning, and a TrustNet that estimates user susceptibility. To address data scarcity, we synthesize high‑risk dialogues via LLM‑based role‑and‑goal simulation, and train the User Trust Scorer using contrastive learning on controlled response pairs, avoiding unreliable scalar labels. Experiments on 11 LLMs and three large reasoning models show that over 90% of goal‑driven interactions in most models exhibit deceptive intent, with stronger models displaying higher risk. A real‑world case study adapted from a documented AI‑induced suicide incident further demonstrates that our joint evaluation can proactively trigger warnings before critical trust thresholds are reached.

PaperID: 10844, https://arxiv.org/pdf/2504.13574.pdf

Authors: Zhenkai Qin, Feng Zhu, Huan Zeng, Xunyi Nong

Title: MAAM: A Lightweight Multi-Agent Aggregation Module for Efficient Image Classification Based on the MindSpore Framework

Abstract:
The demand for lightweight models in image classification tasks under resource‑constrained environments necessitates a balance between computational efficiency and robust feature representation. Traditional attention mechanisms, despite their strong feature modeling capability, often struggle with high computational complexity and structural rigidity, limiting their applicability in scenarios with limited computational resources (e.g., edge devices or real‑time systems). To address this, we propose the Multi‑Agent Aggregation Module (MAAM), a lightweight attention architecture integrated with the MindSpore framework. MAAM employs three parallel agent branches with independently parameterized operations to extract heterogeneous features, adaptively fused via learnable scalar weights, and refined through a convolutional compression layer. Leveraging MindSpore's dynamic computational graph and operator fusion, MAAM achieves 87.0% accuracy on the CIFAR‑10 dataset, significantly outperforming conventional CNN (58.3%) and MLP (49.6%) models, while improving training efficiency by 30%. Ablation studies confirm the critical role of agent attention (accuracy drops to 32.0% if removed) and compression modules (25.5% if omitted), validating their necessity for maintaining discriminative feature learning. The framework's hardware acceleration capabilities and minimal memory footprint further demonstrate its practicality, offering a deployable solution for image classification in resource‑constrained scenarios without compromising accuracy.

PaperID: 10845, https://arxiv.org/pdf/2504.13554.pdf

Authors: Xin Tang, Qian Chen, Wenjie Weng, Chao Jin, Zhang Liu, Jiacheng Wang, Geng Sun, Xiaohuan Li, Dusit Niyato

Title: Task Assignment and Exploration Optimization for Low Altitude UAV Rescue via Generative AI Enhanced Multi-agent Reinforcement Learning

Abstract:
The integration of emerging uncrewed aerial vehicles (UAVs) with artificial intelligence (AI) and ground‑embedded robots (GERs) has transformed emergency rescue operations in unknown environments. However, the high computational demands often exceed a single UAV's capacity, making it difficult to continuously provide stable high‑level services. To address this, this paper proposes a cooperation framework involving UAVs, GERs, and airships. The framework enables resource pooling through UAV‑to‑GER (U2G) and UAV‑to‑airship (U2A) links, offering computing services for offloaded tasks. Specifically, we formulate the multi‑objective problem of task assignment and exploration as a dynamic long‑term optimization problem aiming to minimize task completion time and energy use while ensuring stability. Using Lyapunov optimization, we transform it into a per‑slot deterministic problem and propose HG‑MADDPG, which combines the Hungarian algorithm with a GDM‑based multi‑agent deep deterministic policy gradient. Simulations demonstrate significant improvements in offloading efficiency, latency, and system stability over baselines.

PaperID: 10846, https://arxiv.org/pdf/2504.13535.pdf

Authors: Jiahao Song, Yuzhao Wang

Title: MusFlow: Multimodal Music Generation via Conditional Flow Matching

Abstract:
Music generation aims to create music segments that align with human aesthetics based on diverse conditional information. Despite advancements in generating music from specific textual descriptions (e.g., style, genre, instruments), the practical application is still hindered by ordinary users' limited expertise or time to write accurate prompts. To bridge this application gap, this paper introduces MusFlow, a novel multimodal music generation model using Conditional Flow Matching. We employ multiple Multi‑Layer Perceptrons (MLPs) to align multimodal conditional information into the audio's CLAP embedding space. Conditional flow matching is trained to reconstruct the compressed Mel‑spectrogram in the pretrained VAE latent space guided by aligned feature embedding. MusFlow can generate music from images, story texts, and music captions. To collect data for model training, inspired by multi‑agent collaboration, we construct an intelligent data annotation workflow centered around a fine‑tuned Qwen2‑VL model. Using this workflow, we build a new multimodal music dataset, MMusSet, with each sample containing a quadruple of image, story text, music caption, and music piece. We conduct four sets of experiments: image‑to‑music, story‑to‑music, caption‑to‑music, and multimodal music generation. Experimental results demonstrate that MusFlow can generate high‑quality music pieces whether the input conditions are unimodal or multimodal. We hope this work can advance the application of music generation in multimedia field, making music creation more accessible. Our generated samples, code and dataset are available at musflow.github.io.

PaperID: 10847, https://arxiv.org/pdf/2504.13515.pdf

Authors: Mingwei Zheng, Danning Xie, Xiangyu Zhang

Title: Large Language Models for Validating Network Protocol Parsers

Abstract:
Network protocol parsers are essential for enabling correct and secure communication between devices. Bugs in these parsers can introduce critical vulnerabilities, including memory corruption, information leakage, and denial‑of‑service attacks. An intuitive way to assess parser correctness is to compare the implementation with its official protocol standard. However, this comparison is challenging because protocol standards are typically written in natural language, whereas implementations are in source code. Existing methods like model checking, fuzzing, and differential testing have been used to find parsing bugs, but they either require significant manual effort or ignore the protocol standards, limiting their ability to detect semantic violations. To enable more automated validation of parser implementations against protocol standards, we propose PARVAL, a multi‑agent framework built on large language models (LLMs). PARVAL leverages the capabilities of LLMs to understand both natural language and code. It transforms both protocol standards and their implementations into a unified intermediate representation, referred to as format specifications, and performs a differential comparison to uncover inconsistencies. We evaluate PARVAL on the Bidirectional Forwarding Detection (BFD) protocol. Our experiments demonstrate that PARVAL successfully identifies inconsistencies between the implementation and its RFC standard, achieving a low false positive rate of 5.6%. PARVAL uncovers seven unique bugs, including five previously unknown issues.

PaperID: 10848, https://arxiv.org/pdf/2504.13443.pdf

Authors: Michael J. Yuan, Carlos Lospoy, Sydney Lai, James Snewin, Ju Long

Title: Trust, but verify

Abstract:
Decentralized AI agent networks, such as Gaia, allows individuals to run customized LLMs on their own computers and then provide services to the public. However, in order to maintain service quality, the network must verify that individual nodes are running their designated LLMs. In this paper, we demonstrate that in a cluster of mostly honest nodes, we can detect nodes that run unauthorized or incorrect LLM through social consensus of its peers. We will discuss the algorithm and experimental data from the Gaia network. We will also discuss the intersubjective validation system, implemented as an EigenLayer AVS to introduce financial incentives and penalties to encourage honest behavior from LLM nodes.

PaperID: 10849, https://arxiv.org/pdf/2504.13424.pdf

Authors: Yang Shen, Shuqi Chai, Bing Li, Xiaodong Luo, Qingjiang Shi, Rongqing Zhang

Title: Decentralized Handover Parameter Optimization with MARL for Load Balancing in 5G Networks

Abstract:
In cellular networks, cell handover refers to the process where a device switches from one base station to another, and this mechanism is crucial for balancing the load among different cells. Traditionally, engineers would manually adjust parameters based on experience. However, the explosive growth in the number of cells has rendered manual tuning impractical. Existing research tends to overlook critical engineering details in order to simplify handover problems. In this paper, we classify cell handover into three types, and jointly model their mutual influence. To achieve load balancing, we propose a multi‑agent‑reinforcement‑learning (MARL)‑based scheme to automatically optimize the parameters. To reduce the agent interaction costs, a distributed training is implemented based on consensus approximation of global average load, and it is shown that the approximation error is bounded. Experimental results show that our proposed scheme outperforms existing benchmarks in balancing load and improving network performance.

PaperID: 10850, https://arxiv.org/pdf/2504.13351.pdf

Authors: Chen Wang, Fei Xia, Wenhao Yu, Tingnan Zhang, Ruohan Zhang, C. Karen Liu, Li Fei-Fei, Jie Tan, Jacky Liang

Title: Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models

Abstract:
Learning to perform manipulation tasks from human videos is a promising approach for teaching robots. However, many manipulation tasks require changing control parameters during task execution, such as force, which visual data alone cannot capture. In this work, we leverage sensing devices such as armbands that measure human muscle activities and microphones that record sound, to capture the details in the human manipulation process, and enable robots to extract task plans and control parameters to perform the same task. To achieve this, we introduce Chain‑of‑Modality (CoM), a prompting strategy that enables Vision Language Models to reason about multimodal human demonstration data ‑‑ videos coupled with muscle or audio signals. By progressively integrating information from each modality, CoM refines a task plan and generates detailed control parameters, enabling robots to perform manipulation tasks based on a single multimodal human video prompt. Our experiments show that CoM delivers a threefold improvement in accuracy for extracting task plans and control parameters compared to baselines, with strong generalization to new task setups and objects in real‑world robot experiments. Videos and code are available at https://chain‑of‑modality.github.io

PaperID: 10851, https://arxiv.org/pdf/2504.13263.pdf

Authors: Xinyue Wang, Kun Zhou, Wenyi Wu, Har Simrat Singh, Fang Nan, Songyao Jin, Aryan Philip, Saloni Patnaik, Hou Zhu, Shivam Singh, Parjanya Prashant, Qian Shen, Biwei Huang

Title: Causal-Copilot: An Autonomous Causal Analysis Agent

Abstract:
Causal analysis plays a foundational role in scientific discovery and reliable decision‑making, yet it remains largely inaccessible to domain experts due to its conceptual and algorithmic complexity. This disconnect between causal methodology and practical usability presents a dual challenge: domain experts are unable to leverage recent advances in causal learning, while causal researchers lack broad, real‑world deployment to test and refine their methods. To address this, we introduce Causal‑Copilot, an autonomous agent that operationalizes expert‑level causal analysis within a large language model framework. Causal‑Copilot automates the full pipeline of causal analysis for both tabular and time‑series data ‑‑ including causal discovery, causal inference, algorithm selection, hyperparameter optimization, result interpretation, and generation of actionable insights. It supports interactive refinement through natural language, lowering the barrier for non‑specialists while preserving methodological rigor. By integrating over 20 state‑of‑the‑art causal analysis techniques, our system fosters a virtuous cycle ‑‑ expanding access to advanced causal methods for domain experts while generating rich, real‑world applications that inform and advance causal theory. Empirical evaluations demonstrate that Causal‑Copilot achieves superior performance compared to existing baselines, offering a reliable, scalable, and extensible solution that bridges the gap between theoretical sophistication and real‑world applicability in causal analysis. A live interactive demo of Causal‑Copilot is available at https://causalcopilot.com/.

PaperID: 10852, https://arxiv.org/pdf/2504.13241.pdf

Authors: Paul Ghanem, Owen Howell, Michael Potter, Pau Closas, Alireza Ramezani, Deniz Erdogmus, Tales Imbiriba

Title: Recursive Deep Inverse Reinforcement Learning

Abstract:
Inferring an adversary's goals from exhibited behavior is crucial for counterplanning and non‑cooperative multi‑agent systems in domains like cybersecurity, military, and strategy games. Deep Inverse Reinforcement Learning (IRL) methods based on maximum entropy principles show promise in recovering adversaries' goals but are typically offline, require large batch sizes with gradient descent, and rely on first‑order updates, limiting their applicability in real‑time scenarios. We propose an online Recursive Deep Inverse Reinforcement Learning (RDIRL) approach to recover the cost function governing the adversary actions and goals. Specifically, we minimize an upper bound on the standard Guided Cost Learning (GCL) objective using sequential second‑order Newton updates, akin to the Extended Kalman Filter (EKF), leading to a fast (in terms of convergence) learning algorithm. We demonstrate that RDIRL is able to recover cost and reward functions of expert agents in standard and adversarial benchmark tasks. Experiments on benchmark tasks show that our proposed approach outperforms several leading IRL algorithms.

PaperID: 10853, https://arxiv.org/pdf/2504.13217.pdf

Authors: Jennifer Haase, Finn Klessascheck, Jan Mendling, Sebastian Pokutta

Title: Sustainability via LLM Right-sizing

Abstract:
Large language models (LLMs) have become increasingly embedded in organizational workflows. This has raised concerns over their energy consumption, financial costs, and data sovereignty. While performance benchmarks often celebrate cutting‑edge models, real‑world deployment decisions require a broader perspective: when is a smaller, locally deployable model "good enough"? This study offers an empirical answer by evaluating eleven proprietary and open‑weight LLMs across ten everyday occupational tasks, including summarizing texts, generating schedules, and drafting emails and proposals. Using a dual‑LLM‑based evaluation framework, we automated task execution and standardized evaluation across ten criteria related to output quality, factual accuracy, and ethical responsibility. Results show that GPT‑4o delivers consistently superior performance but at a significantly higher cost and environmental footprint. Notably, smaller models like Gemma‑3 and Phi‑4 achieved strong and reliable results on most tasks, suggesting their viability in contexts requiring cost‑efficiency, local deployment, or privacy. A cluster analysis revealed three model groups ‑‑ premium all‑rounders, competent generalists, and limited but safe performers ‑‑ highlighting trade‑offs between quality, control, and sustainability. Significantly, task type influenced model effectiveness: conceptual tasks challenged most models, while aggregation and transformation tasks yielded better performances. We argue for a shift from performance‑maximizing benchmarks to task‑ and context‑aware sufficiency assessments that better reflect organizational priorities. Our approach contributes a scalable method to evaluate AI models through a sustainability lens and offers actionable guidance for responsible LLM deployment in practice.

PaperID: 10854, https://arxiv.org/pdf/2504.13203.pdf

Authors: Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, Saadia Gabriel

Title: X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

Abstract:
Multi‑turn interactions with language models (LMs) pose critical safety risks, as harmful intent can be strategically spread across exchanges. Yet, the vast majority of prior work has focused on single‑turn safety, while adaptability and diversity remain among the key challenges of multi‑turn red‑teaming. To address these challenges, we present X‑Teaming, a scalable framework that systematically explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios. X‑Teaming employs collaborative agents for planning, attack optimization, and verification, achieving state‑of‑the‑art multi‑turn jailbreak effectiveness and diversity with success rates up to 98.1% across representative leading open‑weight and closed‑source models. In particular, X‑Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model, which has been considered nearly immune to single‑turn attacks. Building on X‑Teaming, we introduce XGuard‑Train, an open‑source multi‑turn safety training dataset that is 20x larger than the previous best resource, comprising 30K interactive jailbreaks, designed to enable robust multi‑turn safety alignment for LMs. Our work offers essential tools and insights for mitigating sophisticated conversational attacks, advancing the multi‑turn safety of LMs.

PaperID: 10855, https://arxiv.org/pdf/2504.13192.pdf

Authors: Liang-bo Ning, Shijie Wang, Wenqi Fan, Qing Li, Xin Xu, Hao Chen, Feiran Huang

Title: CheatAgent: Attacking LLM-Empowered Recommender Systems via LLM Agent

Abstract:
Recently, Large Language Model (LLM)‑empowered recommender systems (RecSys) have brought significant advances in personalized user experience and have attracted considerable attention. Despite the impressive progress, the research question regarding the safety vulnerability of LLM‑empowered RecSys still remains largely under‑investigated. Given the security and privacy concerns, it is more practical to focus on attacking the black‑box RecSys, where attackers can only observe the system's inputs and outputs. However, traditional attack approaches employing reinforcement learning (RL) agents are not effective for attacking LLM‑empowered RecSys due to the limited capabilities in processing complex textual inputs, planning, and reasoning. On the other hand, LLMs provide unprecedented opportunities to serve as attack agents to attack RecSys because of their impressive capability in simulating human‑like decision‑making processes. Therefore, in this paper, we propose a novel attack framework called CheatAgent by harnessing the human‑like capabilities of LLMs, where an LLM‑based agent is developed to attack LLM‑Empowered RecSys. Specifically, our method first identifies the insertion position for maximum impact with minimal input modification. After that, the LLM agent is designed to generate adversarial perturbations to insert at target positions. To further improve the quality of generated perturbations, we utilize the prompt tuning technique to improve attacking strategies via feedback from the victim RecSys iteratively. Extensive experiments across three real‑world datasets demonstrate the effectiveness of our proposed attacking method.

PaperID: 10856, https://arxiv.org/pdf/2504.13145.pdf

Authors: Li-Cheng Lan, Andrew Bai, Minhao Cheng, Cho-Jui Hsieh, Tianyi Zhou

Title: Exploring Expert Failures Improves LLM Agent Tuning

Abstract:
Large Language Models (LLMs) have shown tremendous potential as agents, excelling at tasks that require multiple rounds of reasoning and interactions. Rejection Sampling Fine‑Tuning (RFT) has emerged as an effective method for finetuning LLMs as agents: it first imitates expert‑generated successful trajectories and further improves agentic skills through iterative fine‑tuning on successful, self‑generated trajectories. However, since the expert (e.g., GPT‑4) succeeds primarily on simpler subtasks and RFT inherently favors simpler scenarios, many complex subtasks remain unsolved and persistently out‑of‑distribution (OOD). Upon investigating these challenging subtasks, we discovered that previously failed expert trajectories can often provide valuable guidance, e.g., plans and key actions, that can significantly improve agent exploration efficiency and acquisition of critical skills. Motivated by these observations, we propose Exploring Expert Failures (EEF), which identifies beneficial actions from failed expert trajectories and integrates them into the training dataset. Potentially harmful actions are meticulously excluded to prevent contamination of the model learning process. By leveraging the beneficial actions in expert failures, EEF successfully solves some previously unsolvable subtasks and improves agent tuning performance. Remarkably, our approach achieved a 62% win rate in WebShop, outperforming RFT (53. 6%) and GPT‑4 (35. 6%), and to the best of our knowledge, setting a new state‑of‑the‑art as the first method to surpass a score of 0.81 in WebShop and exceed 81 in SciWorld.

PaperID: 10857, https://arxiv.org/pdf/2504.13032.pdf

Authors: Zheng Wang, Shu Xian Teo, Jun Jie Chew, Wei Shi

Title: InstructRAG: Leveraging Retrieval-Augmented Generation on Instruction Graphs for LLM-Based Task Planning

Abstract:
Recent advancements in large language models (LLMs) have enabled their use as agents for planning complex tasks. Existing methods typically rely on a thought‑action‑observation (TAO) process to enhance LLM performance, but these approaches are often constrained by the LLMs' limited knowledge of complex tasks. Retrieval‑augmented generation (RAG) offers new opportunities by leveraging external databases to ground generation in retrieved information. In this paper, we identify two key challenges (enlargability and transferability) in applying RAG to task planning. We propose InstructRAG, a novel solution within a multi‑agent meta‑reinforcement learning framework, to address these challenges. InstructRAG includes a graph to organize past instruction paths (sequences of correct actions), an RL‑Agent with Reinforcement Learning to expand graph coverage for enlargability, and an ML‑Agent with Meta‑Learning to improve task generalization for transferability. The two agents are trained end‑to‑end to optimize overall planning performance. Our experiments on four widely used task planning datasets demonstrate that InstructRAG significantly enhances performance and adapts efficiently to new tasks, achieving up to a 19.2% improvement over the best existing approach.

PaperID: 10858, https://arxiv.org/pdf/2504.12891.pdf

Authors: Vicent Briva-Iglesias

Title: Are AI agents the new machine translation frontier? Challenges and opportunities of single- and multi-agent systems for multilingual digital communication

Abstract:
The rapid evolution of artificial intelligence (AI) has introduced AI agents as a disruptive paradigm across various industries, yet their application in machine translation (MT) remains underexplored. This paper describes and analyses the potential of single‑ and multi‑agent systems for MT, reflecting on how they could enhance multilingual digital communication. While single‑agent systems are well‑suited for simpler translation tasks, multi‑agent systems, which involve multiple specialized AI agents collaborating in a structured manner, may offer a promising solution for complex scenarios requiring high accuracy, domain‑specific knowledge, and contextual awareness. To demonstrate the feasibility of multi‑agent workflows in MT, we are conducting a pilot study in legal MT. The study employs a multi‑agent system involving four specialized AI agents for (i) translation, (ii) adequacy review, (iii) fluency review, and (iv) final editing. Our findings suggest that multi‑agent systems may have the potential to significantly improve domain‑adaptability and contextual awareness, with superior translation quality to traditional MT or single‑agent systems. This paper also sets the stage for future research into multi‑agent applications in MT, integration into professional translation workflows, and shares a demo of the system analyzed in the paper.

PaperID: 10859, https://arxiv.org/pdf/2504.12865.pdf

Authors: S. Shen, Z. Lin, W. Liu, C. Xin, W. Dai, S. Chen, X. Wen, X. Lan

Title: DashChat: Interactive Authoring of Industrial Dashboard Design Prototypes through Conversation with LLM-Powered Agents

Abstract:
Industrial dashboards, commonly deployed by organizations such as enterprises and governments, are increasingly crucial in data communication and decision‑making support across various domains. Designing an industrial dashboard prototype is particularly challenging due to its visual complexity, which can include data visualization, layout configuration, embellishments, and animations. Additionally, in real‑world industrial settings, designers often encounter numerous constraints. For instance, when companies negotiate collaborations with clients and determine design plans, they typically need to demo design prototypes and iterate on them based on mock data quickly. Such a task is very common and crucial during the ideation stage, as it not only helps save developmental costs but also avoids data‑related issues such as lengthy data handover periods. However, existing authoring tools of dashboards are mostly not tailored to such prototyping needs, and motivated by these gaps, we propose DashChat, an interactive system that leverages large language models (LLMs) to generate industrial dashboard design prototypes from natural language. We collaborated closely with designers from the industry and derived the requirements based on their practical experience. First, by analyzing 114 high‑quality industrial dashboards, we summarized their common design patterns and inject the identified ones into LLMs as reference. Next, we built a multi‑agent pipeline powered by LLMs to understand textual requirements from users and generate practical, aesthetic prototypes. Besides, functionally distinct, parallel‑operating agents are created to enable efficient generation. Then, we developed a user‑friendly interface that supports text‑based interaction for generating and modifying prototypes. Two user studies demonstrated that our system is both effective and efficient in supporting design prototyping.

PaperID: 10860, https://arxiv.org/pdf/2504.12777.pdf

Authors: James Rudd-Jones, Mirco Musolesi, María Pérez-Ortiz

Title: Multi-Agent Reinforcement Learning Simulation for Environmental Policy Synthesis

Abstract:
Climate policy development faces significant challenges due to deep uncertainty, complex system dynamics, and competing stakeholder interests. Climate simulation methods, such as Earth System Models, have become valuable tools for policy exploration. However, their typical use is for evaluating potential polices, rather than directly synthesizing them. The problem can be inverted to optimize for policy pathways, but the traditional optimization approaches often struggle with non‑linear dynamics, heterogeneous agents, and comprehensive uncertainty quantification. We propose a framework for augmenting climate simulations with Multi‑Agent Reinforcement Learning (MARL) to address these limitations. We identify key challenges at the interface between climate simulations and the application of MARL in the context of policy synthesis, including reward definition, scalability with increasing agents and state spaces, uncertainty propagation across linked systems, and solution validation. Additionally, we discuss challenges in making MARL‑derived solutions interpretable and useful for policy‑makers. Our framework provides a foundation for more sophisticated climate policy exploration while acknowledging important limitations and areas for future research.

PaperID: 10861, https://arxiv.org/pdf/2504.12735.pdf

Authors: Lidong Zhai, Zhijie Qiu, Lvyang Zhang, Jiaqi Li, Yi Wang, Wen Lu, Xizhong Guo, Ge Sun

Title: The Athenian Academy: A Seven-Layer Architecture Model for Multi-Agent Systems

Abstract:
This paper proposes the "Academy of Athens" multi‑agent seven‑layer framework, aimed at systematically addressing challenges in multi‑agent systems (MAS) within artificial intelligence (AI) art creation, such as collaboration efficiency, role allocation, environmental adaptation, and task parallelism. The framework divides MAS into seven layers: multi‑agent collaboration, single‑agent multi‑role playing, single‑agent multi‑scene traversal, single‑agent multi‑capability incarnation, different single agents using the same large model to achieve the same target agent, single‑agent using different large models to achieve the same target agent, and multi‑agent synthesis of the same target agent. Through experimental validation in art creation, the framework demonstrates its unique advantages in task collaboration, cross‑scene adaptation, and model fusion. This paper further discusses current challenges such as collaboration mechanism optimization, model stability, and system security, proposing future exploration through technologies like meta‑learning and federated learning. The framework provides a structured methodology for multi‑agent collaboration in AI art creation and promotes innovative applications in the art field.

PaperID: 10862, https://arxiv.org/pdf/2504.12734.pdf

Authors: Yongrui Chen, Junhao He, Linbo Fu, Shenyu Zhang, Rihui Jin, Xinbang Dai, Jiaqi Li, Dehai Min, Nan Hu, Yuxin Zhang, Guilin Qi, Yi Huang, Tongtong Wu

Title: Pandora: A Code-Driven Large Language Model Agent for Unified Reasoning Across Diverse Structured Knowledge

Abstract:
Unified Structured Knowledge Reasoning (USKR) aims to answer natural language questions (NLQs) by using structured sources such as tables, databases, and knowledge graphs in a unified way. Existing USKR methods either rely on employing task‑specific strategies or custom‑defined representations, which struggle to leverage the knowledge transfer between different SKR tasks or align with the prior of LLMs, thereby limiting their performance. This paper proposes a novel USKR framework named \textscPandora, which takes advantage of \textscPython's \textscPandas API to construct a unified knowledge representation for alignment with LLM pre‑training. It employs an LLM to generate textual reasoning steps and executable Python code for each question. Demonstrations are drawn from a memory of training examples that cover various SKR tasks, facilitating knowledge transfer. Extensive experiments on four benchmarks involving three SKR tasks demonstrate that \textscPandora outperforms existing unified frameworks and competes effectively with task‑specific methods.

PaperID: 10863, https://arxiv.org/pdf/2504.12714.pdf

Authors: Kunal Jha, Wilka Carvalho, Yancheng Liang, Simon S. Du, Max Kleiman-Weiner, Natasha Jaques

Title: Cross-environment Cooperation Enables Zero-shot Multi-agent Coordination

Abstract:
Zero‑shot coordination (ZSC), the ability to adapt to a new partner in a cooperative task, is a critical component of human‑compatible AI. While prior work has focused on training agents to cooperate on a single task, these specialized models do not generalize to new tasks, even if they are highly similar. Here, we study how reinforcement learning on a distribution of environments with a single partner enables learning general cooperative skills that support ZSC with many new partners on many new problems. We introduce two Jax‑based, procedural generators that create billions of solvable coordination challenges. We develop a new paradigm called Cross‑Environment Cooperation (CEC), and show that it outperforms competitive baselines quantitatively and qualitatively when collaborating with real people. Our findings suggest that learning to collaborate across many unique scenarios encourages agents to develop general norms, which prove effective for collaboration with different partners. Together, our results suggest a new route toward designing generalist cooperative agents capable of interacting with humans without requiring human data.

PaperID: 10864, https://arxiv.org/pdf/2504.12682.pdf

Authors: Arth Bohra, Manvel Saroyan, Danil Melkozerov, Vahe Karufanyan, Gabriel Maher, Pascal Weinberger, Artem Harutyunyan, Giovanni Campagna

Title: WebLists: Extracting Structured Information From Complex Interactive Websites Using Executable LLM Agents

Abstract:
Most recent web agent research has focused on navigation and transaction tasks, with little emphasis on extracting structured data at scale. We present WebLists, a benchmark of 200 data‑extraction tasks across four common business and enterprise use‑cases. Each task requires an agent to navigate to a webpage, configure it appropriately, and extract complete datasets with well‑defined schemas. We show that both LLMs with search capabilities and SOTA web agents struggle with these tasks, with a recall of 3% and 31%, respectively, despite higher performance on question‑answering tasks. To address this challenge, we propose BardeenAgent, a novel framework that enables web agents to convert their execution into repeatable programs, and replay them at scale across pages with similar structure. BardeenAgent is also the first LLM agent to take advantage of the regular structure of HTML. In particular BardeenAgent constructs a generalizable CSS selector to capture all relevant items on the page, then fits the operations to extract the data. On the WebLists benchmark, BardeenAgent achieves 66% recall overall, more than doubling the performance of SOTA web agents, and reducing cost per output row by 3x.

PaperID: 10865, https://arxiv.org/pdf/2504.12612.pdf

Authors: Ching-Chun Chang, Isao Echizen

Title: Chronology of Multi-Agent Interactions for Provenance of Evolving Information

Abstract:
Provenance is the chronological history of things, resonating with the fundamental pursuit to uncover origins, trace connections, and situate entities within the flow of space and time. As artificial intelligence advances towards autonomous agents capable of interactive collaboration on complex tasks, the provenance of generated content becomes entangled in the interplay of collective creation, where contributions are continuously revised, extended or overwritten. In a multi‑agent generative chain, content undergoes successive transformations, often leaving little, if any, trace of prior contributions. In this study, we investigate the problem of tracking multi‑agent provenance across the temporal dimension of generation. We propose a chronological system for post hoc attribution of generative history from content alone, without reliance on internal memory states or external meta‑information. At its core lies the notion of symbolic chronicles, representing signed and time‑stamped records, in a form analogous to the chain of custody in forensic science. The system operates through a feedback loop, whereby each generative timestep updates the chronicle of prior interactions and synchronises it with the synthetic content in the very act of generation. This research seeks to develop an accountable form of collaborative artificial intelligence within evolving cyber ecosystems.

PaperID: 10866, https://arxiv.org/pdf/2504.12563.pdf

Authors: Haris Riaz, Sourav Bhabesh, Vinayak Arannil, Miguel Ballesteros, Graham Horwood

Title: MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation

Abstract:
Recent smaller language models such Phi‑3.5 and Phi‑4 rely on synthetic data generated using larger Language models. Questions remain about leveraging synthetic data for other use cases, such as adapting LLMs to specific domains. A key limitation of synthetic data is low diversity, which negatively impacts its downstream applicability for improving other models. To address this, we propose MetaSynth, a method for generating synthetic data that enhances diversity through meta‑prompting, where a language model orchestrates multiple "expert" LLM agents to collaboratively generate data. Using only 25 million tokens of synthetic data generated with MetaSynth, we successfully adapt a well‑trained LLM (Mistral‑7B‑v0.3) to two specialized domains‑Finance and Biomedicine‑without compromising the capabilities of the resulting model in general tasks. In addition, we evaluate the diversity of our synthetic data using seven automated metrics, and find that it approaches the diversity of LLM pre‑training corpora. Continually pre‑training Mistral‑7B‑v0.3 with MetaSynth notably outperforms the base LLM, showing improvements of up to 4.08% in Finance and 13.75% in Biomedicine. The same model shows degraded performance when trained on data generated using a template prompt, even when the template includes prior generations and varying In‑Context exemplars of real data. Our findings suggest that a few million tokens of diverse synthetic data without mixing any real data, is sufficient for effective domain adaptation when using MetaSynth.

PaperID: 10867, https://arxiv.org/pdf/2504.12321.pdf

Authors: Charlotte Siska, Anush Sankaran

Title: AttentionDefense: Leveraging System Prompt Attention for Explainable Defense Against Novel Jailbreaks

Abstract:
In the past few years, Language Models (LMs) have shown par‑human capabilities in several domains. Despite their practical applications and exceeding user consumption, they are susceptible to jailbreaks when malicious input exploits the LM's weaknesses, causing it to deviate from its intended behavior. Current defensive strategies either classify the input prompt as adversarial or prevent LMs from generating harmful outputs. However, it is challenging to explain the reason behind the malicious nature of the jailbreak, which results in a wide variety of closed‑box approaches. In this research, we propose and demonstrate that system‑prompt attention from Small Language Models (SLMs) can be used to characterize adversarial prompts, providing a novel, explainable, and cheaper defense approach called AttentionDefense. Our research suggests that the attention mechanism is an integral component in understanding and explaining how LMs respond to malicious input that is not captured in the semantic meaning of text embeddings. The proposed AttentionDefense is evaluated against existing jailbreak benchmark datasets. Ablation studies show that SLM‑based AttentionDefense has equivalent or better jailbreak detection performance compared to text embedding‑based classifiers and GPT‑4 zero‑shot detectors.To further validate the efficacy of the proposed approach, we generate a dataset of novel jailbreak variants of the existing benchmark dataset using a closed‑loop LLM‑based multi‑agent system. We demonstrate that the proposed AttentionDefense approach performs robustly on this novel jailbreak dataset while existing approaches suffer in performance. Additionally, for practical purposes AttentionDefense is an ideal solution as it has the computation requirements of a small LM but the performance of a LLM detector.

PaperID: 10868, https://arxiv.org/pdf/2504.12297.pdf

Authors: Arthur N. Montanari, Ana Elisa D. Barioni, Chao Duan, Adilson E. Motter

Title: Optimal flock formation induced by agent heterogeneity

Abstract:
The study of flocking in biological systems has identified conditions for self‑organized collective behavior, inspiring the development of decentralized strategies to coordinate the dynamics of swarms of drones and other autonomous vehicles. Previous research has focused primarily on the role of the time‑varying interaction network among agents while assuming that the agents themselves are identical or nearly identical. Here, we depart from this conventional assumption to investigate how inter‑individual differences between agents affect the stability and convergence in flocking dynamics. We show that flocks of agents with optimally assigned heterogeneous parameters significantly outperform their homogeneous counterparts, achieving 20‑40% faster convergence to desired formations across various control tasks. These tasks include target tracking, flock formation, and obstacle maneuvering. In systems with communication delays, heterogeneity can enable convergence even when flocking is unstable for identical agents. Our results challenge existing paradigms in multi‑agent control and establish system disorder as an adaptive, distributed mechanism to promote collective behavior in flocking dynamics.

PaperID: 10869, https://arxiv.org/pdf/2504.11901.pdf

Authors: Luca Castri, Gloria Beraldo, Nicola Bellotto

Title: Causality-enhanced Decision-Making for Autonomous Mobile Robots in Dynamic Environments

Abstract:
The growing integration of robots in shared environments‑such as warehouses, shopping centres, and hospitals‑demands a deep understanding of the underlying dynamics and human behaviours, including how, when, and where individuals engage in various activities and interactions. This knowledge goes beyond simple correlation studies and requires a more comprehensive causal analysis. By leveraging causal inference to model cause‑and‑effect relationships, we can better anticipate critical environmental factors and enable autonomous robots to plan and execute tasks more effectively. To this end, we propose a novel causality‑based decision‑making framework that reasons over a learned causal model to assist the robot in deciding when and how to complete a given task. In the examined use case‑i.e., a warehouse shared with people‑we exploit the causal model to estimate battery usage and human obstructions as factors influencing the robot's task execution. This reasoning framework supports the robot in making informed decisions about task timing and strategy. To achieve this, we developed also PeopleFlow, a new Gazebo‑based simulator designed to model context‑sensitive human‑robot spatial interactions in shared workspaces. PeopleFlow features realistic human and robot trajectories influenced by contextual factors such as time, environment layout, and robot state, and can simulate a large number of agents. While the simulator is general‑purpose, in this paper we focus on a warehouse‑like environment as a case study, where we conduct an extensive evaluation benchmarking our causal approach against a non‑causal baseline. Our findings demonstrate the efficacy of the proposed solutions, highlighting how causal reasoning enables autonomous robots to operate more efficiently and safely in dynamic environments shared with humans.

PaperID: 10870, https://arxiv.org/pdf/2504.11874.pdf

Authors: Ruoyu Sun, Angelos Stefanidis, Zhengyong Jiang, Jionglong Su

Title: Factor-MCLS: Multi-agent learning system with reward factor matrix and multi-critic framework for dynamic portfolio optimization

Abstract:
Typical deep reinforcement learning (DRL) agents for dynamic portfolio optimization learn the factors influencing portfolio return and risk by analyzing the output values of the reward function while adjusting portfolio weights within the training environment. However, it faces a major limitation where it is difficult for investors to intervene in the training based on different levels of risk aversion towards each portfolio asset. This difficulty arises from another limitation: existing DRL agents may not develop a thorough understanding of the factors responsible for the portfolio return and risk by only learning from the output of the reward function. As a result, the strategy for determining the target portfolio weights is entirely dependent on the DRL agents themselves. To address these limitations, we propose a reward factor matrix for elucidating the return and risk of each asset in the portfolio. Additionally, we propose a novel learning system named Factor‑MCLS using a multi‑critic framework that facilitates learning of the reward factor matrix. In this way, our DRL‑based learning system can effectively learn the factors influencing portfolio return and risk. Moreover, based on the critic networks within the multi‑critic framework, we develop a risk constraint term in the training objective function of the policy function. This risk constraint term allows investors to intervene in the training of the DRL agent according to their individual levels of risk aversion towards the portfolio assets.

PaperID: 10871, https://arxiv.org/pdf/2504.11671.pdf

Authors: Ji Ma

Title: Computational Basis of LLM's Decision Making in Social Simulation

Abstract:
Large language models (LLMs) increasingly serve as human‑like decision‑making agents in social science and applied settings. These LLM‑agents are typically assigned human‑like characters and placed in real‑life contexts. However, how these characters and contexts shape an LLM's behavior remains underexplored. This study proposes and tests methods for probing, quantifying, and modifying an LLM's internal representations in a Dictator Game, a classic behavioral experiment on fairness and prosocial behavior. We extract ``vectors of variable variations'' (e.g., ``male'' to ``female'') from the LLM's internal state. Manipulating these vectors during the model's inference can substantially alter how those variables relate to the model's decision‑making. This approach offers a principled way to study and regulate how social concepts can be encoded and engineered within transformer‑based models, with implications for alignment, debiasing, and designing AI agents for social simulations in both academic and commercial applications, strengthening sociological theory and measurement.

PaperID: 10872, https://arxiv.org/pdf/2504.11571.pdf

Authors: Dayeon Ki, Tianyi Zhou, Marine Carpuat, Gang Wu, Puneet Mathur, Viswanathan Swaminathan

Title: GraphicBench: A Planning Benchmark for Graphic Design with Language Agents

Abstract:
Large Language Model (LLM)‑powered agents have unlocked new possibilities for automating human tasks. While prior work has focused on well‑defined tasks with specified goals, the capabilities of agents in creative design tasks with open‑ended goals remain underexplored. We introduce GraphicBench, a new planning benchmark for graphic design that covers 1,079 user queries and input images across four design types. We further present GraphicTown, an LLM agent framework with three design experts and 46 actions (tools) to choose from for executing each step of the planned workflows in web environments. Experiments with six LLMs demonstrate their ability to generate workflows that integrate both explicit design constraints from user queries and implicit commonsense constraints. However, these workflows often do not lead to successful execution outcomes, primarily due to challenges in: (1) reasoning about spatial relationships, (2) coordinating global dependencies across experts, and (3) retrieving the most appropriate action per step. We envision GraphicBench as a challenging yet valuable testbed for advancing LLM‑agent planning and execution in creative design tasks.

PaperID: 10873, https://arxiv.org/pdf/2504.11569.pdf

Authors: Heming Fu, Guojun Xiong, Jian Li, Shan Lin

Title: Multi-Agent Reinforcement Learning for Decentralized Reservoir Management via Murmuration Intelligence

Abstract:
Conventional centralized water management systems face critical limitations from computational complexity and uncertainty propagation. We present MurmuRL, a novel decentralized framework inspired by starling murmurations intelligence, integrating bio‑inspired alignment, separation, and cohesion rules with multi‑agent reinforcement learning. MurmuRL enables individual reservoirs to make autonomous local decisions while achieving emergent global coordination. Experiments on grid networks demonstrate that MurmuRL achieves 8.8% higher final performance while using 27% less computing overhead compared to centralized approaches. Notably, strategic diversity scales super‑linearly with system size, exhibiting sophisticated coordination patterns and enhanced resilience during extreme events. MurmuRL offers a scalable solution for managing complex water systems by leveraging principles of natural collective behavior.

PaperID: 10874, https://arxiv.org/pdf/2504.11497.pdf

Authors: Chang Liu, Emmanuel A. Olowe, Danial Chitnis

Title: LLM-based AI Agent for Sizing of Analog and Mixed Signal Circuit

Abstract:
The design of Analog and Mixed‑Signal (AMS) integrated circuits (ICs) often involves significant manual effort, especially during the transistor sizing process. While Machine Learning techniques in Electronic Design Automation (EDA) have shown promise in reducing complexity and minimizing human intervention, they still face challenges such as numerous iterations and a lack of knowledge about AMS circuit design. Recently, Large Language Models (LLMs) have demonstrated significant potential across various fields, showing a certain level of knowledge in circuit design and indicating their potential to automate the transistor sizing process. In this work, we propose an LLM‑based AI agent for AMS circuit design to assist in the sizing process. By integrating LLMs with external circuit simulation tools and data analysis functions and employing prompt engineering strategies, the agent successfully optimized multiple circuits to achieve target performance metrics. We evaluated the performance of different LLMs to assess their applicability and optimization effectiveness across seven basic circuits, and selected the best‑performing model Claude 3.5 Sonnet for further exploration on an operational amplifier, with complementary input stage and class AB output stage. This circuit was evaluated against nine performance metrics, and we conducted experiments under three distinct performance requirement groups. A success rate of up to 60% was achieved for reaching the target requirements. Overall, this work demonstrates the potential of LLMs to improve AMS circuit design.

PaperID: 10875, https://arxiv.org/pdf/2504.11496.pdf

Authors: Seoyeon Kim, Yu Su, Li-C. Wang

Title: IEA-Plugin: An AI Agent Reasoner for Test Data Analytics

Abstract:
This paper introduces IEA‑plugin, a novel AI agent‑based reasoning module developed as a new front‑end for the Intelligent Engineering Assistant (IEA). The primary objective of IEA‑plugin is to utilize the advanced reasoning and coding capabilities of Large Language Models (LLMs) to effectively address two critical practical challenges: capturing diverse engineering requirements and improving system scalability. Built on the LangGraph agentic programming platform, IEAplugin is specifically tailored for industrial deployment and integration with backend test data analytics tools. Compared to the previously developed IEA‑Plot (introduced two years ago), IEA‑plugin represents a significant advancement, capitalizing on recent breakthroughs in LLMs to deliver capabilities that were previously unattainable.

PaperID: 10876, https://arxiv.org/pdf/2504.11258.pdf

Authors: Liam Welsh, Udit Grover, Sebastian Jaimungal

Title: Multi-Agent Reinforcement Learning for Greenhouse Gas Offset Credit Markets

Abstract:
Climate change is a major threat to the future of humanity, and its impacts are being intensified by excess man‑made greenhouse gas emissions. One method governments can employ to control these emissions is to provide firms with emission limits and penalize any excess emissions above the limit. Excess emissions may also be offset by firms who choose to invest in carbon reducing and capturing projects. These projects generate offset credits which can be submitted to a regulating agency to offset a firm's excess emissions, or they can be traded with other firms. In this work, we characterize the finite‑agent Nash equilibrium for offset credit markets. As computing Nash equilibria is an NP‑hard problem, we utilize the modern reinforcement learning technique Nash‑DQN to efficiently estimate the market's Nash equilibria. We demonstrate not only the validity of employing reinforcement learning methods applied to climate themed financial markets, but also the significant financial savings emitting firms may achieve when abiding by the Nash equilibria through numerical experiments.

PaperID: 10877, https://arxiv.org/pdf/2504.11002.pdf

Authors: Yan Rong, Shan Yang, Chenxing Li, Dong Yu, Li Liu

Title: Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Immersive Audiobook Generation

Abstract:
Audiobook generation aims to create rich, immersive listening experiences from multimodal inputs, but current approaches face three critical challenges: (1) the lack of synergistic generation of diverse audio types (e.g., speech, sound effects, and music) with precise temporal and semantic alignment; (2) the difficulty in conveying expressive, fine‑grained emotions, which often results in machine‑like vocal outputs; and (3) the absence of automated evaluation frameworks that align with human preferences for complex and diverse audio. To address these issues, we propose Dopamine Audiobook, a novel unified training‑free multi‑agent system, where a multimodal large language model (MLLM) serves two specialized roles (i.e., speech designer and audio designer) for emotional, human‑like, and immersive audiobook generation and evaluation. Specifically, we firstly propose a flow‑based, context‑aware framework for diverse audio generation with word‑level semantic and temporal alignment. To enhance expressiveness, we then design word‑level paralinguistic augmentation, utterance‑level prosody retrieval, and adaptive TTS model selection. Finally, for evaluation, we introduce a novel MLLM‑based evaluation framework incorporating self‑critique, perspective‑taking, and psychological MagicEmo prompts to ensure human‑aligned and self‑aligned assessments. Experimental results demonstrate that our method achieves state‑of‑the‑art (SOTA) performance on multiple metrics. Importantly, our evaluation framework shows better alignment with human preferences and transferability across audio tasks.

PaperID: 10878, https://arxiv.org/pdf/2504.10960.pdf

Authors: Evagoras Makridis, Themistoklis Charalambous

Title: A Linear Push-Pull Average Consensus Algorithm for Delay-Prone Networks

Abstract:
In this paper, we address the average consensus problem of multi‑agent systems for possibly unbalanced and delay‑prone networks with directional information flow. We propose a linear distributed algorithm (referred to as RPPAC) that handles asynchronous updates and time‑varying heterogeneous information delays. Our proposed distributed algorithm utilizes a surplus‑consensus mechanism and information regarding the number of incoming and outgoing links to guarantee state averaging, despite the imbalanced and delayed information flow in directional networks. The convergence of the RPPAC algorithm is examined using key properties of the backward product of time‑varying matrices that correspond to different snapshots of the directional augmented network.

PaperID: 10879, https://arxiv.org/pdf/2504.10915.pdf

Authors: Rajesh Ranjan, Shailja Gupta, Surya Narayan Singh

Title: LOKA Protocol: A Decentralized Framework for Trustworthy and Ethical AI Agent Ecosystems

Abstract:
The rise of autonomous AI agents, capable of perceiving, reasoning, and acting independently, signals a profound shift in how digital ecosystems operate, govern, and evolve. As these agents proliferate beyond centralized infrastructures, they expose foundational gaps in identity, accountability, and ethical alignment. Three critical questions emerge: Identity: Who or what is the agent? Accountability: Can its actions be verified, audited, and trusted? Ethical Consensus: Can autonomous systems reliably align with human values and prevent harmful emergent behaviors? We present the novel LOKA Protocol (Layered Orchestration for Knowledgeful Agents), a unified, systems‑level architecture for building ethically governed, interoperable AI agent ecosystems. LOKA introduces a proposed Universal Agent Identity Layer (UAIL) for decentralized, verifiable identity; intent‑centric communication protocols for semantic coordination across diverse agents; and a Decentralized Ethical Consensus Protocol (DECP) that could enable agents to make context‑aware decisions grounded in shared ethical baselines. Anchored in emerging standards such as Decentralized Identifiers (DIDs), Verifiable Credentials (VCs), and post‑quantum cryptography, LOKA proposes a scalable, future‑resilient blueprint for multi‑agent AI governance. By embedding identity, trust, and ethics into the protocol layer itself, LOKA proposes the foundation for a new era of responsible, transparent, and autonomous AI ecosystems operating across digital and physical domains.

PaperID: 10880, https://arxiv.org/pdf/2504.10784.pdf

Authors: Mikolaj Walczak, Uttej Kallakuri, Tinoosh Mohsenin

Title: ATLASv2: LLM-Guided Adaptive Landmark Acquisition and Navigation on the Edge

Abstract:
Autonomous systems deployed on edge devices face significant challenges, including resource constraints, real‑time processing demands, and adapting to dynamic environments. This work introduces ATLASv2, a novel system that integrates a fine‑tuned TinyLLM, real‑time object detection, and efficient path planning to enable hierarchical, multi‑task navigation and manipulation all on the edge device, Jetson Nano. ATLASv2 dynamically expands its navigable landmarks by detecting and localizing objects in the environment which are saved to its internal knowledge base to be used for future task execution. We evaluate ATLASv2 in real‑world environments, including a handcrafted home and office setting constructed with diverse objects and landmarks. Results show that ATLASv2 effectively interprets natural language instructions, decomposes them into low‑level actions, and executes tasks with high success rates. By leveraging generative AI in a fully on‑board framework, ATLASv2 achieves optimized resource utilization with minimal prompting latency and power consumption, bridging the gap between simulated environments and real‑world applications.

PaperID: 10881, https://arxiv.org/pdf/2504.10677.pdf

Authors: Muhammad Al-Zafar Khan, Jamal Al-Karaki

Title: Achieving Optimal Tissue Repair Through MARL with Reward Shaping and Curriculum Learning

Abstract:
In this paper, we present a multi‑agent reinforcement learning (MARL) framework for optimizing tissue repair processes using engineered biological agents. Our approach integrates: (1) stochastic reaction‑diffusion systems modeling molecular signaling, (2) neural‑like electrochemical communication with Hebbian plasticity, and (3) a biologically informed reward function combining chemical gradient tracking, neural synchronization, and robust penalties. A curriculum learning scheme guides the agent through progressively complex repair scenarios. In silico experiments demonstrate emergent repair strategies, including dynamic secretion control and spatial coordination.

PaperID: 10882, https://arxiv.org/pdf/2504.10646.pdf

Authors: Saif Punjwani, Larry Heck

Title: Weight-of-Thought Reasoning: Exploring Neural Network Weights for Enhanced LLM Reasoning

Abstract:
Large language models (LLMs) have demonstrated remarkable reasoning capabilities when prompted with strategies such as Chain‑of‑Thought (CoT). However, these approaches focus on token‑level output without considering internal weight dynamics. We introduce Weight‑of‑Thought (WoT) reasoning, a novel approach that examines neural network weights before inference to identify reasoning pathways. Unlike existing methods, WoT explores the weight space through graph‑based message passing, multi‑step reasoning processes, and attention mechanisms. Our implementation creates an interconnected graph of reasoning nodes. Experiments on diverse reasoning tasks (syllogistic, mathematical, algebraic, combinatorial, and geometric) demonstrate that WoT achieves superior performance compared to traditional methods, particularly for complex problems. This approach leads to both improved performance and greater interpretability of the reasoning process, offering a promising direction for enhancing LLM reasoning capabilities.

PaperID: 10883, https://arxiv.org/pdf/2504.10497.pdf

Authors: Sunyi Liu, Mengzhe Geng, Rebecca Hart

Title: Exploring Generative AI Techniques in Government: A Case Study

Abstract:
The swift progress of Generative Artificial intelligence (GenAI), notably Large Language Models (LLMs), is reshaping the digital landscape. Recognizing this transformative potential, the National Research Council of Canada (NRC) launched a pilot initiative to explore the integration of GenAI techniques into its daily operation for performance excellence, where 22 projects were launched in May 2024. Within these projects, this paper presents the development of the intelligent agent Pubbie as a case study, targeting the automation of performance measurement, data management and insight reporting at the NRC. Cutting‑edge techniques are explored, including LLM orchestration and semantic embedding via RoBERTa, while strategic fine‑tuning and few‑shot learning approaches are incorporated to infuse domain knowledge at an affordable cost. The user‑friendly interface of Pubbie allows general government users to input queries in natural language and easily upload or download files with a simple button click, greatly reducing manual efforts and accessibility barriers.

PaperID: 10884, https://arxiv.org/pdf/2504.10374.pdf

Authors: Aryan Bhatt, Cody Rushing, Adam Kaufman, Tyler Tracy, Vasil Georgiev, David Matolcsi, Akbir Khan, Buck Shlegeris

Title: Ctrl-Z: Controlling AI Agents via Resampling

Abstract:
Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm. Our work presents the first control evaluation performed in an agent environment. We construct BashBench, a dataset of 257 challenging multi‑step system administration tasks, and evaluate whether various safety measures can prevent an adversarially constructed AI agent from covertly downloading and executing malicious code in this environment. This multi‑step setting introduces new attack and defense dynamics, which we investigate in order to design novel control protocols that prevent safety failures without hindering the ability of non‑malicious agents to perform useful work. We introduce a class of control protocols called resample protocols that dynamically take additional samples of certain actions. We find these protocols significantly improve on existing techniques by selectively blocking the AI agent from executing suspicious code and incriminating the agent by generating additional examples of dangerous behavior. We measure the tradeoff between attack prevention and usefulness; our best protocol combines resampling with analysis of previous steps, reducing the success rate of attacks from 58% to 7% at a 5% cost to the performance of a non‑malicious agent.

PaperID: 10885, https://arxiv.org/pdf/2504.10286.pdf

Authors: Yiming Zhu, Yupeng He, Ehsan-Ul Haq, Gareth Tyson, Pan Hui

Title: Characterizing LLM-driven Social Network: The Chirper.ai Case

Abstract:
The emergence of large language models (LLMs) has enabled a new paradigm of social network simulation, where AI agents can interact with human‑like autonomy. Recent research has explored collective behavioral patterns and structural characteristics of LLM agents within simulated networks. However, empirical comparisons between LLM‑driven and human‑driven online social networks remain scarce, limiting our understanding of how LLM agents differ from human users. This paper presents a large‑scale analysis of Chirper.ai, an X/Twitter‑like social network entirely populated by LLM agents, comprising over 65,000 agents and 7.7 million AI‑generated posts. For comparison, we collect a parallel dataset from Mastodon, a human‑driven decentralized social network, with over 117,000 users and 16 million posts. We examine key differences between LLM agents and humans in posting behaviors, abusive content, and social network structures. Our findings provide key implications to facilitate the future development of responsible AI‑mediated communication systems, offering a profile of agent behaviors in an online social network driven by LLMs.

PaperID: 10886, https://arxiv.org/pdf/2504.10210.pdf

Authors: Yuxuan Zhang, Yangyang Feng, Daifeng Li, Kexin Zhang, Junlan Chen, Bowen Deng

Title: Can Competition Enhance the Proficiency of Agents Powered by Large Language Models in the Realm of News-driven Time Series Forecasting?

Abstract:
Multi‑agents‑based news‑driven time series forecasting is considered as a potential paradigm shift in the era of large language models (LLMs). The challenge of this task lies in measuring the influences of different news events towards the fluctuations of time series. This requires agents to possess stronger abilities of innovative thinking and the identifying misleading logic. However, the existing multi‑agent discussion framework has limited enhancement on time series prediction in terms of optimizing these two capabilities. Inspired by the role of competition in fostering innovation, this study embeds a competition mechanism within the multi‑agent discussion to enhance agents' capability of generating innovative thoughts. Furthermore, to bolster the model's proficiency in identifying misleading information, we incorporate a fine‑tuned small‑scale LLM model within the reflective stage, offering auxiliary decision‑making support. Experimental results confirm that the competition can boost agents' capacity for innovative thinking, which can significantly improve the performances of time series prediction. Similar to the findings of social science, the intensity of competition within this framework can influence the performances of agents, providing a new perspective for studying LLMs‑based multi‑agent systems.

PaperID: 10887, https://arxiv.org/pdf/2504.10178.pdf

Authors: Naizhu Jin, Zhong Li, Tian Zhang, Qingkai Zeng

Title: MSCoT: Structured Chain-of-Thought Generation for Multiple Programming Languages

Abstract:
With the rapid development of code intelligence, the application of multiple programming languages is becoming increasingly widespread. However, most existing code generation models mainly focus on a single or a few programming languages, resulting in unsatisfactory performance in a multilingual environment. Chain‑of‑Thought (CoT) reasoning can significantly improve the performance of the model without the need for retraining or fine‑tuning the code generation model by reasonably decomposing complex code generation tasks into multiple subtasks and gradually deriving solutions for each subtask. Nevertheless, the existing CoT generation methods mainly concentrate on Python code, and the performance on other programming languages remains unclear. To fill this gap, we first constructed a CoT generation dataset for 12 programming languages through multi‑agent technology. On this basis, we proposed a CoT generation method MSCoT applicable to multiple programming languages. By introducing CoT into the code generation large model, the performance of the code generation large model in a multilingual environment can be improved. Through large‑scale empirical research, we compared the generalization abilities of MSCoT and the existing CoT generation methods on multiple programming languages and proved the effectiveness of MSCoT for multiple programming languages. In addition, we also designed a human study to prove the quality of the CoT generated by MSCoT. Finally, we opensourced the model and dataset of MSCoT to promote the research on CoT generation for multiple programming languages.

PaperID: 10888, https://arxiv.org/pdf/2504.10157.pdf

Authors: Xinnong Zhang, Jiayu Lin, Xinyi Mou, Shiyue Yang, Xiawei Liu, Libo Sun, Hanjia Lyu, Yihang Yang, Weihong Qi, Yue Chen, Guanying Li, Ling Yan, Yao Hu, Siming Chen, Yu Wang, Xuanjing Huang, Jiebo Luo, Shiping Tang, Libo Wu, Baohua Zhou, Zhongyu Wei

Title: SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users

Abstract:
Social simulation is transforming traditional social science research by modeling human behavior through interactions between virtual individuals and their environments. With recent advances in large language models (LLMs), this approach has shown growing potential in capturing individual differences and predicting group behaviors. However, existing methods face alignment challenges related to the environment, target users, interaction mechanisms, and behavioral patterns. To this end, we introduce SocioVerse, an LLM‑agent‑driven world model for social simulation. Our framework features four powerful alignment components and a user pool of 10 million real individuals. To validate its effectiveness, we conducted large‑scale simulation experiments across three distinct domains: politics, news, and economics. Results demonstrate that SocioVerse can reflect large‑scale population dynamics while ensuring diversity, credibility, and representativeness through standardized procedures and minimal manual adjustments.

PaperID: 10889, https://arxiv.org/pdf/2504.10102.pdf

Authors: Vitor Martins, Sara M. Cerqueira, Mercedes Balcells, Elazer R Edelman, Cristina P. Santos

Title: A Human-Sensitive Controller: Adapting to Human Ergonomics and Physical Constraints via Reinforcement Learning

Abstract:
Work‑Related Musculoskeletal Disorders continue to be a major challenge in industrial environments, leading to reduced workforce participation, increased healthcare costs, and long‑term disability. This study introduces a human‑sensitive robotic system aimed at reintegrating individuals with a history of musculoskeletal disorders into standard job roles, while simultaneously optimizing ergonomic conditions for the broader workforce. This research leverages reinforcement learning to develop a human‑aware control strategy for collaborative robots, focusing on optimizing ergonomic conditions and preventing pain during task execution. Two RL approaches, Q‑Learning and Deep Q‑Network (DQN), were implemented and tested to personalize control strategies based on individual user characteristics. Although experimental results revealed a simulation‑to‑real gap, a fine‑tuning phase successfully adapted the policies to real‑world conditions. DQN outperformed Q‑Learning by completing tasks faster while maintaining zero pain risk and safe ergonomic levels. The structured testing protocol confirmed the system's adaptability to diverse human anthropometries, underscoring the potential of RL‑driven cobots to enable safer, more inclusive workplaces.

PaperID: 10890, https://arxiv.org/pdf/2504.10046.pdf

Authors: Jia Li, Xianjie Shi, Kechi Zhang, Ge Li, Zhi Jin, Lei Li, Huangzhao Zhang, Jia Li, Fang Liu, Yuwei Zhang, Zhengwei Tao, Yihong Dong, Yuqi Zhu, Chongyang Tao

Title: GraphCodeAgent: Dual Graph-Guided LLM Agent for Retrieval-Augmented Repo-Level Code Generation

Abstract:
Writing code requires significant time and effort in software development. To automate this process, researchers have made substantial progress for code generation. Recently, large language models (LLMs) have demonstrated remarkable proficiency in function‑level code generation, yet their performance significantly degrades in the real‑world software development process, where coding tasks are deeply embedded within specific repository contexts. Existing studies attempt to use retrieval‑augmented code generation (RACG) approaches to mitigate this demand. However, there is a gap between natural language (NL) requirements and programming implementations. This results in the failure to retrieve the relevant code of these fine‑grained subtasks. To address this challenge, we propose GraphCodeAgent, a dual graph‑guided LLM agent for retrieval‑augmented repo‑level code generation, bridging the gap between NL requirements and programming implementations. Our approach constructs two interconnected graphs: a Requirement Graph (RG) to model requirement relations of code snippets within the repository, as well as the relations between the target requirement and the requirements of these code snippets, and a Structural‑Semantic Code Graph (SSCG) to capture the repository's intricate code dependencies. Guided by this, an LLM‑powered agent performs multi‑hop reasoning to systematically retrieve all context code snippets, including implicit and explicit code snippets, even if they are not explicitly expressed in requirements. We evaluated GraphCodeAgent on three advanced LLMs with the two widely‑used repo‑level code generation benchmarks DevEval and CoderEval. Extensive experiment results show that GraphCodeAgent significantly outperforms state‑of‑the‑art baselines.

PaperID: 10891, https://arxiv.org/pdf/2504.10036.pdf

Authors: Zhengxuan Zhang, Zhuowen Liang, Yin Wu, Teng Lin, Yuyu Luo, Nan Tang

Title: DataPuzzle: Breaking Free from the Hallucinated Promise of LLMs in Data Analysis

Abstract:
Large language models (LLMs) are increasingly applied to multi‑modal data analysis ‑‑ not necessarily because they offer the most precise answers, but because they provide fluent, flexible interfaces for interpreting complex inputs. Yet this fluency often conceals a deeper structural failure: the prevailing ``Prompt‑to‑Answer'' paradigm treats LLMs as black‑box analysts, collapsing evidence, reasoning, and conclusions into a single, opaque response. The result is brittle, unverifiable, and frequently misleading. We argue for a fundamental shift: from generation to structured extraction, from monolithic prompts to modular, agent‑based workflows. LLMs should not serve as oracles, but as collaborators ‑‑ specialized in tasks like extraction, translation, and linkage ‑‑ embedded within transparent workflows that enable step‑by‑step reasoning and verification. We propose DataPuzzle, a conceptual multi‑agent framework that decomposes complex questions, structures information into interpretable forms (e.g. tables, graphs), and coordinates agent roles to support transparent and verifiable analysis. This framework serves as an aspirational blueprint for restoring visibility and control in LLM‑driven analytics ‑‑ transforming opaque answers into traceable processes, and brittle fluency into accountable insight. This is not a marginal refinement; it is a call to reimagine how we build trustworthy, auditable analytic systems in the era of large language models. Structure is not a constraint ‑‑ it is the path to clarity.

PaperID: 10892, https://arxiv.org/pdf/2504.09915.pdf

Authors: Yuxi Bi, Yunfan Gao, Haofen Wang

Title: StePO-Rec: Towards Personalized Outfit Styling Assistant via Knowledge-Guided Multi-Step Reasoning

Abstract:
Advancements in Generative AI offers new opportunities for FashionAI, surpassing traditional recommendation systems that often lack transparency and struggle to integrate expert knowledge, leaving the potential for personalized fashion styling remain untapped. To address these challenges, we present PAFA (Principle‑Aware Fashion), a multi‑granular knowledge base that organizes professional styling expertise into three levels of metadata, domain principles, and semantic relationships. Using PAFA, we develop StePO‑Rec, a knowledge‑guided method for multi‑step outfit recommendation. StePO‑Rec provides structured suggestions using a scenario‑dimension‑attribute framework, employing recursive tree construction to align recommendations with both professional principles and individual preferences. A preference‑trend re‑ranking system further adapts to fashion trends while maintaining the consistency of the user's original style. Experiments on the widely used personalized outfit dataset IQON show a 28% increase in Recall@1 and 32.8% in MAP. Furthermore, case studies highlight improved explainability, traceability, result reliability, and the seamless integration of expertise and personalization.

PaperID: 10893, https://arxiv.org/pdf/2504.09893.pdf

Authors: Xu Yin, Min-Sung Yoon, Yuchi Huo, Kang Zhang, Sung-Eui Yoon

Title: LangPert: Detecting and Handling Task-level Perturbations for Robust Object Rearrangement

Abstract:
Task execution for object rearrangement could be challenged by Task‑Level Perturbations (TLP), i.e., unexpected object additions, removals, and displacements that can disrupt underlying visual policies and fundamentally compromise task feasibility and progress. To address these challenges, we present LangPert, a language‑based framework designed to detect and mitigate TLP situations in tabletop rearrangement tasks. LangPert integrates a Visual Language Model (VLM) to comprehensively monitor policy's skill execution and environmental TLP, while leveraging the Hierarchical Chain‑of‑Thought (HCoT) reasoning mechanism to enhance the Large Language Model (LLM)'s contextual understanding and generate adaptive, corrective skill‑execution plans. Our experimental results demonstrate that LangPert handles diverse TLP situations more effectively than baseline methods, achieving higher task completion rates, improved execution efficiency, and potential generalization to unseen scenarios.

PaperID: 10894, https://arxiv.org/pdf/2504.09855.pdf

Authors: Hongrui Shi, Shunbao Li, Zhipeng Yuan, Po Yang

Title: PestMA: LLM-based Multi-Agent System for Informed Pest Management

Abstract:
Effective pest management is complex due to the need for accurate, context‑specific decisions. Recent advancements in large language models (LLMs) open new possibilities for addressing these challenges by providing sophisticated, adaptive knowledge acquisition and reasoning. However, existing LLM‑based pest management approaches often rely on a single‑agent paradigm, which can limit their capacity to incorporate diverse external information, engage in systematic validation, and address complex, threshold‑driven decisions. To overcome these limitations, we introduce PestMA, an LLM‑based multi‑agent system (MAS) designed to generate reliable and evidence‑based pest management advice. Building on an editorial paradigm, PestMA features three specialized agents, an Editor for synthesizing pest management recommendations, a Retriever for gathering relevant external data, and a Validator for ensuring correctness. Evaluations on real‑world pest scenarios demonstrate that PestMA achieves an initial accuracy of 86.8% for pest management decisions, which increases to 92.6% after validation. These results underscore the value of collaborative agent‑based workflows in refining and validating decisions, highlighting the potential of LLM‑based multi‑agent systems to automate and enhance pest management processes.

PaperID: 10895, https://arxiv.org/pdf/2504.09802.pdf

Authors: Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, Xiangzhong Fang

Title: Enhancing Reasoning Abilities of Small LLMs with Cognitive Alignment

Abstract:
The reasoning capabilities of large reasoning models (LRMs), such as OpenAI's o1 and DeepSeek‑R1, have seen substantial advancements through deep thinking. However, these enhancements come with significant resource demands, underscoring the need for training effective small reasoning models. A critical challenge is that small models possess different reasoning capacities and cognitive trajectories compared with their larger counterparts. Hence, directly distilling chain‑of‑thought (CoT) rationales from large LRMs to smaller ones can sometimes be ineffective and often requires a substantial amount of annotated data. In this paper, we first introduce a novel Critique‑Rethink‑Verify (CRV) system, designed for training smaller yet powerful LRMs. Our CRV system consists of multiple LLM agents, each specializing in unique tasks: (i) critiquing the CoT rationales according to the cognitive capabilities of smaller models, (ii) rethinking and refining these CoTs based on the critiques, and (iii) verifying the correctness of the refined results. Building on the CRV system, we further propose the Cognitive Preference Optimization (CogPO) algorithm to continuously enhance the reasoning abilities of smaller models by aligning their reasoning processes with their cognitive capacities. Comprehensive evaluations on challenging reasoning benchmarks demonstrate the efficacy of our CRV+CogPO framework, which outperforms other methods by a large margin.

PaperID: 10896, https://arxiv.org/pdf/2504.09781.pdf

Authors: Jingtian Wu, Claire Cardie

Title: Reasoning Court: Combining Reasoning, Action, and Judgment for Multi-Hop Reasoning

Abstract:
While large language models (LLMs) have demonstrated strong capabilities in tasks like question answering and fact verification, they continue to suffer from hallucinations and reasoning errors, especially in multi‑hop tasks that require integration of multiple information sources. Current methods address these issues through retrieval‑based techniques (grounding reasoning in external evidence), reasoning‑based approaches (enhancing coherence via improved prompting), or hybrid strategies combining both elements. One prominent hybrid method, ReAct, has outperformed purely retrieval‑based or reasoning‑based approaches; however, it lacks internal verification of intermediate reasoning steps, allowing potential errors to propagate through complex reasoning tasks. In this paper, we introduce Reasoning Court (RC), a novel framework that extends iterative reasoning‑and‑retrieval methods, such as ReAct, with a dedicated LLM judge. Unlike ReAct, RC employs this judge to independently evaluate multiple candidate answers and their associated reasoning generated by separate LLM agents. The judge is asked to select the answer that it considers the most factually grounded and logically coherent based on the presented reasoning and evidence, or synthesizes a new answer using available evidence and its pre‑trained knowledge if all candidates are inadequate, flawed, or invalid. Evaluations on multi‑hop benchmarks (HotpotQA, MuSiQue) and fact‑verification (FEVER) demonstrate that RC consistently outperforms state‑of‑the‑art few‑shot prompting methods without task‑specific fine‑tuning.

PaperID: 10897, https://arxiv.org/pdf/2504.09764.pdf

Authors: Yuyang Ji, Haohan Wang

Title: Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding

Abstract:
Multimodal Large Language Models (MLLMs) have shown remarkable versatility but face challenges in demonstrating true visual understanding, particularly in chart reasoning tasks. Existing benchmarks like ChartQA reveal significant reliance on text‑based shortcuts and probabilistic pattern‑matching rather than genuine visual reasoning. To rigorously evaluate visual reasoning, we introduce a more challenging test scenario by removing textual labels and introducing chart perturbations in the ChartQA dataset. Under these conditions, models like GPT‑4o and Gemini‑2.0 Pro experience up to a 30% performance drop, underscoring their limitations. To address these challenges, we propose Socratic Chart, a new framework that transforms chart images into Scalable Vector Graphics (SVG) representations, enabling MLLMs to integrate textual and visual modalities for enhanced chart understanding. Socratic Chart employs a multi‑agent pipeline with specialized agent‑generators to extract primitive chart attributes (e.g., bar heights, line coordinates) and an agent‑critic to validate results, ensuring high‑fidelity symbolic representations. Our framework surpasses state‑of‑the‑art models in accurately capturing chart primitives and improving reasoning performance, establishing a robust pathway for advancing MLLM visual understanding.

PaperID: 10898, https://arxiv.org/pdf/2504.09730.pdf

Authors: Omayra Yago Nieto, Alexandre Anahory Simoes, Juan I. Giribet, Leonardo J. Colombo

Title: Learning-based decentralized control with collision avoidance for multi-agent systems

Abstract:
In this paper, we present a learning‑based tracking controller based on Gaussian processes (GP) for collision avoidance of multi‑agent systems where the agents evolve in the special Euclidean group in the space SE(3). In particular, we use GPs to estimate certain uncertainties that appear in the dynamics of the agents. The control algorithm is designed to learn and mitigate these uncertainties by using GPs as a learning‑based model for the predictions. In particular, the presented approach guarantees that the tracking error remains bounded with high probability. We present some simulation results to show how the control algorithm is implemented.

PaperID: 10899, https://arxiv.org/pdf/2504.09723.pdf

Authors: Yuxuan Lu, Ting-Yao Hsu, Hansu Gu, Limeng Cui, Yaochen Xie, William Headden, Bingsheng Yao, Akash Veeragouni, Jiapeng Liu, Sreyashi Nag, Jessie Wang, Dakuo Wang

Title: AgentA/B: Automated and Scalable Web A/BTesting with Interactive LLM Agents

Abstract:
A/B testing experiment is a widely adopted method for evaluating UI/UX design decisions in modern web applications. Yet, traditional A/B testing remains constrained by its dependence on the large‑scale and live traffic of human participants, and the long time of waiting for the testing result. Through formative interviews with six experienced industry practitioners, we identified critical bottlenecks in current A/B testing workflows. In response, we present AgentA/B, a novel system that leverages Large Language Model‑based autonomous agents (LLM Agents) to automatically simulate user interaction behaviors with real webpages. AgentA/B enables scalable deployment of LLM agents with diverse personas, each capable of navigating the dynamic webpage and interactively executing multi‑step interactions like search, clicking, filtering, and purchasing. In a demonstrative controlled experiment, we employ AgentA/B to simulate a between‑subject A/B testing with 1,000 LLM agents Amazon.com, and compare agent behaviors with real human shopping behaviors at a scale. Our findings suggest AgentA/B can emulate human‑like behavior patterns.

PaperID: 10900, https://arxiv.org/pdf/2504.09702.pdf

Authors: Yunxiang Zhang, Muhammad Khalifa, Shitanshu Bhushan, Grant D Murphy, Lajanugen Logeswaran, Jaekyeom Kim, Moontae Lee, Honglak Lee, Lu Wang

Title: MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

Abstract:
We introduce MLRC‑Bench, a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competitions, with a focus on open research problems that demand novel methodologies. Unlike prior work, e.g., AI Scientist, which evaluates the end‑to‑end agentic pipeline by using LLM‑as‑a‑judge, MLRC‑Bench measures the key steps of proposing and implementing novel research methods and evaluates them with rigorous protocol and objective metrics. Our curated suite of 7 competition tasks reveals significant challenges for LLM agents. Even the best‑performing tested agent (gemini‑exp‑1206 under MLAB) closes only 9.3% of the gap between baseline and top human participant scores. Furthermore, our analysis reveals a misalignment between the LLM‑judged innovation and actual performance on cutting‑edge ML research problems. MLRC‑Bench is a dynamic benchmark, designed to grow with new ML competitions and encourage rigorous, objective evaluations of AI research capabilities. Our leaderboard and code are available at: https://huggingface.co/spaces/launch/MLRC_Bench

PaperID: 10901, https://arxiv.org/pdf/2504.09662.pdf

Authors: Jenny Ma, Riya Sahni, Karthik Sreedhar, Lydia B. Chilton

Title: AgentDynEx: Nudging the Mechanics and Dynamics of Multi-Agent Simulations

Abstract:
Multi‑agent large language model simulations have the potential to model complex human behaviors and interactions. If the mechanics are set up properly, unanticipated and valuable social dynamics can surface. However, it is challenging to consistently enforce simulation mechanics while still allowing for notable and emergent dynamics. We present AgentDynEx, an AI system that helps set up simulations from user‑specified mechanics and dynamics. AgentDynEx uses LLMs to guide users through a Configuration Matrix to identify core mechanics and define milestones to track dynamics. It also introduces a method called nudging, where the system dynamically reflects on simulation progress and gently intervenes if it begins to deviate from intended outcomes. A technical evaluation found that nudging enables simulations to have more complex mechanics and maintain its notable dynamics compared to simulations without nudging. We discuss the importance of nudging as a technique for balancing mechanics and dynamics of multi‑agent simulations.

PaperID: 10902, https://arxiv.org/pdf/2504.09577.pdf

Authors: Michael P. Wozniak

Title: Unification of Consensus-Based Multi-Objective Optimization and Multi-Robot Path Planning

Abstract:
Multi‑agent systems seeking consensus may also have other objective functions to optimize, requiring the research of multi‑objective optimization in consensus. Several recent publications have explored this domain using various methods such as weighted‑sum optimization and penalization methods. This paper reviews the state of the art for consensus‑based multi‑objective optimization, poses a multi‑agent lunar rover exploration problem seeking consensus and maximization of explored area, and achieves optimal edge weights and steering angles by applying SQP algorithms.

PaperID: 10903, https://arxiv.org/pdf/2504.09554.pdf

Authors: Chi Zhang, Qiyang Chen, Mengqi Zhang

Title: Mixture-of-RAG: Integrating Text and Tables with Large Language Models

Abstract:
Large language models (LLMs) achieve optimal utility when their responses are grounded in external knowledge sources. However, real‑world documents, such as annual reports, scientific papers, and clinical guidelines, frequently combine extensive narrative content with complex, hierarchically structured tables. While existing retrieval‑augmented generation (RAG) systems effectively integrate LLMs' generative capabilities with external retrieval‑based information, their performance significantly deteriorates especially processing such heterogeneous text‑table hierarchies. To address this limitation, we formalize the task of Heterogeneous Document RAG, which requires joint retrieval and reasoning across textual and hierarchical tabular data. We propose MixRAG, a novel three‑stage framework: (i) hierarchy row‑and‑column‑level (H‑RCL) representation that preserves hierarchical structure and heterogeneous relationship, (ii) an ensemble retriever with LLM‑based reranking for evidence alignment, and (iii) multi‑step reasoning decomposition via a RECAP prompt strategy. To bridge the gap in available data for this domain, we release the dataset DocRAGLib, a 2k‑document corpus paired with automatically aligned text‑table summaries and gold document annotations. The comprehensive experiment results demonstrate that MixRAG boosts top‑1 retrieval by 46% over strong text‑only, table‑only, and naive‑mixture baselines, establishing new state‑of‑the‑art performance for mixed‑modality document grounding.

PaperID: 10904, https://arxiv.org/pdf/2504.09407.pdf

Authors: Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Jessie Wang, Yang Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, Dakuo Wang

Title: UXAgent: A System for Simulating Usability Testing of Web Design with LLM Agents

Abstract:
Usability testing is a fundamental research method that user experience (UX) researchers use to evaluate and iterate their new designs. But what about evaluating and iterating the usability testing study design itself? Recent advances in Large Language Model‑simulated Agent (LLM Agent) research inspired us to design UXAgent to support UX researchers in evaluating and iterating their study design before they conduct the real human‑subject study. Our system features a Persona Generator module, an LLM Agent module, and a Universal Browser Connector module to automatically generate thousands of simulated users and to interactively test the target website. The system also provides a Result Viewer Interface so that the UX researchers can easily review and analyze the generated qualitative (e.g., agents' post‑study surveys) and quantitative data (e.g., agents' interaction logs), or even interview agents directly. Through a heuristic evaluation with 16 UX researchers, participants praised the innovation of our system but also expressed concerns about the future of LLM Agent usage in UX studies.

PaperID: 10905, https://arxiv.org/pdf/2504.09283.pdf

Authors: Priyan Vaithilingam, Munyeong Kim, Frida-Cecilia Acosta-Parenteau, Daniel Lee, Amine Mhedhbi, Elena L. Glassman, Ian Arawjo

Title: Semantic Commit: Helping Users Update Intent Specifications for AI Memory at Scale

Abstract:
How do we update AI memory of user intent as intent changes? We consider how an AI interface may assist the integration of new information into a repository of natural language data. Inspired by software engineering concepts like impact analysis, we develop methods and a UI for managing semantic changes with non‑local effects, which we call "semantic conflict resolution." The user commits new intent to a project ‑‑ makes a "semantic commit" ‑‑ and the AI helps the user detect and resolve semantic conflicts within a store of existing information representing their intent (an "intent specification"). We develop an interface, SemanticCommit, to better understand how users resolve conflicts when updating intent specifications such as Cursor Rules and game design documents. A knowledge graph‑based RAG pipeline drives conflict detection, while LLMs assist in suggesting resolutions. We evaluate our technique on an initial benchmark. Then, we report a 12 user within‑subjects study of SemanticCommit for two task domains ‑‑ game design documents, and AI agent memory in the style of ChatGPT memories ‑‑ where users integrated new information into an existing list. Half of our participants adopted a workflow of impact analysis, where they would first flag conflicts without AI revisions then resolve conflicts locally, despite having access to a global revision feature. We argue that AI agent interfaces, such as software IDEs like Cursor and Windsurf, should provide affordances for impact analysis and help users validate AI retrieval independently from generation. Our work speaks to how AI agent designers should think about updating memory as a process that involves human feedback and decision‑making.

PaperID: 10906, https://arxiv.org/pdf/2504.09230.pdf

Authors: Bin-Bin Hu, Weijia Yao, Yanxin Zhou, Henglai Wei, Chen Lv

Title: Concurrent-Allocation Task Execution for Multi-Robot Path-Crossing-Minimal Navigation in Obstacle Environments

Abstract:
Reducing undesirable path crossings among trajectories of different robots is vital in multi‑robot navigation missions, which not only reduces detours and conflict scenarios, but also enhances navigation efficiency and boosts productivity. Despite recent progress in multi‑robot path‑crossing‑minimal (MPCM) navigation, the majority of approaches depend on the minimal squared‑distance reassignment of suitable desired points to robots directly. However, if obstacles occupy the passing space, calculating the actual robot‑point distances becomes complex or intractable, which may render the MPCM navigation in obstacle environments inefficient or even infeasible. In this paper, the concurrent‑allocation task execution (CATE) algorithm is presented to address this problem (i.e., MPCM navigation in obstacle environments). First, the path‑crossing‑related elements in terms of (i) robot allocation, (ii) desired‑point convergence, and (iii) collision and obstacle avoidance are encoded into integer and control barrier function (CBF) constraints. Then, the proposed constraints are used in an online constrained optimization framework, which implicitly yet effectively minimizes the possible path crossings and trajectory length in obstacle environments by minimizing the desired point allocation cost and slack variables in CBF constraints simultaneously. In this way, the MPCM navigation in obstacle environments can be achieved with flexible spatial orderings. Note that the feasibility of solutions and the asymptotic convergence property of the proposed CATE algorithm in obstacle environments are both guaranteed, and the calculation burden is also reduced by concurrently calculating the optimal allocation and the control input directly without the path planning process.

PaperID: 10907, https://arxiv.org/pdf/2504.09227.pdf

Authors: Gaurav Jain, Leah Findlater, Cole Gleason

Title: SceneScout: Towards AI Agent-driven Access to Street View Imagery for Blind Users

Abstract:
People who are blind or have low vision (BLV) may hesitate to travel independently in unfamiliar environments due to uncertainty about the physical landscape. While most tools focus on in‑situ navigation, those exploring pre‑travel assistance typically provide only landmarks and turn‑by‑turn instructions, lacking detailed visual context. Street view imagery, which contains rich visual information and has the potential to reveal numerous environmental details, remains inaccessible to BLV people. In this work, we introduce SceneScout, a multimodal large language model (MLLM)‑driven AI agent that enables accessible interactions with street view imagery. SceneScout supports two modes: (1) Route Preview, enabling users to familiarize themselves with visual details along a route, and (2) Virtual Exploration, enabling free movement within street view imagery. Our user study (N=10) demonstrates that SceneScout helps BLV users uncover visual information otherwise unavailable through existing means. A technical evaluation shows that most descriptions are accurate (72%) and describe stable visual elements (95%) even in older imagery, though occasional subtle and plausible errors make them difficult to verify without sight. We discuss future opportunities and challenges of using street view imagery to enhance navigation experiences.

PaperID: 10908, https://arxiv.org/pdf/2504.09104.pdf

Authors: Alessandro Carcangiu, Marco Manca, Jacopo Mereu, Carmen Santoro, Ludovica Simeoli, Lucio Davide Spano

Title: Tell-XR: Conversational End-User Development of XR Automations

Abstract:
The availability of extended reality (XR) devices has widened their adoption, yet authoring interactive experiences remains complex for non‑programmers. We introduce Tell‑XR, an intelligent agent leveraging large language models (LLMs) to guide end‑users in defining the interaction in XR settings using automations described as Event‑Condition‑Action (ECA) rules. Through a formative study, we identified the key conversation stages to define and refine automations, which informed the design of the system architecture. The evaluation study in two scenarios (a VR museum and an AR smart home) demonstrates the effectiveness of Tell‑XR across different XR interaction settings.

PaperID: 10909, https://arxiv.org/pdf/2504.09037.pdf

Authors: Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, Peifeng Wang, Silvio Savarese, Caiming Xiong, Shafiq Joty

Title: A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems

Abstract:
Reasoning is a fundamental cognitive process that enables logical inference, problem‑solving, and decision‑making. With the rapid advancement of large language models (LLMs), reasoning has emerged as a key capability that distinguishes advanced AI systems from conventional models that empower chatbots. In this survey, we categorize existing methods along two orthogonal dimensions: (1) Regimes, which define the stage at which reasoning is achieved (either at inference time or through dedicated training); and (2) Architectures, which determine the components involved in the reasoning process, distinguishing between standalone LLMs and agentic compound systems that incorporate external tools, and multi‑agent collaborations. Within each dimension, we analyze two key perspectives: (1) Input level, which focuses on techniques that construct high‑quality prompts that the LLM condition on; and (2) Output level, which methods that refine multiple sampled candidates to enhance reasoning quality. This categorization provides a systematic understanding of the evolving landscape of LLM reasoning, highlighting emerging trends such as the shift from inference‑scaling to learning‑to‑reason (e.g., DeepSeek‑R1), and the transition to agentic workflows (e.g., OpenAI Deep Research, Manus Agent). Additionally, we cover a broad spectrum of learning algorithms, from supervised fine‑tuning to reinforcement learning such as PPO and GRPO, and the training of reasoners and verifiers. We also examine key designs of agentic workflows, from established patterns like generator‑evaluator and LLM debate to recent innovations. ...

PaperID: 10910, https://arxiv.org/pdf/2504.08943.pdf

Authors: Chace Ashcraft, Kiran Karra, Josh Carney, Nathan Drenkow

Title: Investigating the Treacherous Turn in Deep Reinforcement Learning

Abstract:
The Treacherous Turn refers to the scenario where an artificial intelligence (AI) agent subtly, and perhaps covertly, learns to perform a behavior that benefits itself but is deemed undesirable and potentially harmful to a human supervisor. During training, the agent learns to behave as expected by the human supervisor, but when deployed to perform its task, it performs an alternate behavior without the supervisor there to prevent it. Initial experiments applying DRL to an implementation of the A Link to the Past example do not produce the treacherous turn effect naturally, despite various modifications to the environment intended to produce it. However, in this work, we find the treacherous behavior to be reproducible in a DRL agent when using other trojan injection strategies. This approach deviates from the prototypical treacherous turn behavior since the behavior is explicitly trained into the agent, rather than occurring as an emergent consequence of environmental complexity or poor objective specification. Nonetheless, these experiments provide new insights into the challenges of producing agents capable of true treacherous turn behavior.

PaperID: 10911, https://arxiv.org/pdf/2504.08747.pdf

Authors: Jordan Chipka, Chris Moyer, Clay Troyer, Tyler Fuelling, Jeremy Hochstedler

Title: GridMind: A Multi-Agent NLP Framework for Unified, Cross-Modal NFL Data Insights

Abstract:
The rapid growth of big data and advancements in computational techniques have significantly transformed sports analytics. However, the diverse range of data sources ‑‑ including structured statistics, semi‑structured formats like sensor data, and unstructured media such as written articles, audio, and video ‑‑ creates substantial challenges in extracting actionable insights. These various formats, often referred to as multimodal data, require integration to fully leverage their potential. Conventional systems, which typically prioritize structured data, face limitations when processing and combining these diverse content types, reducing their effectiveness in real‑time sports analysis. To address these challenges, recent research highlights the importance of multimodal data integration for capturing the complexity of real‑world sports environments. Building on this foundation, this paper introduces GridMind, a multi‑agent framework that unifies structured, semi‑structured, and unstructured data through Retrieval‑Augmented Generation (RAG) and large language models (LLMs) to facilitate natural language querying of NFL data. This approach aligns with the evolving field of multimodal representation learning, where unified models are increasingly essential for real‑time, cross‑modal interactions. GridMind's distributed architecture includes specialized agents that autonomously manage each stage of a prompt ‑‑ from interpretation and data retrieval to response synthesis. This modular design enables flexible, scalable handling of multimodal data, allowing users to pose complex, context‑rich questions and receive comprehensive, intuitive responses via a conversational interface.

PaperID: 10912, https://arxiv.org/pdf/2504.08742.pdf

Authors: Nicholas Sukiennik, Haoyu Wang, Zailin Zeng, Chen Gao, Yong Li

Title: Simulating Filter Bubble on Short-video Recommender System with Large Language Model Agents

Abstract:
An increasing reliance on recommender systems has led to concerns about the creation of filter bubbles on social media, especially on short video platforms like TikTok. However, their formation is still not entirely understood due to the complex dynamics between recommendation algorithms and user feedback. In this paper, we aim to shed light on these dynamics using a large language model‑based simulation framework. Our work employs real‑world short‑video data containing rich video content information and detailed user‑agents to realistically simulate the recommendation‑feedback cycle. Through large‑scale simulations, we demonstrate that LLMs can replicate real‑world user‑recommender interactions, uncovering key mechanisms driving filter bubble formation. We identify critical factors, such as demographic features and category attraction that exacerbate content homogenization. To mitigate this, we design and test interventions including various cold‑start and feedback weighting strategies, showing measurable reductions in filter bubble effects. Our framework enables rapid prototyping of recommendation strategies, offering actionable solutions to enhance content diversity in real‑world systems. Furthermore, we analyze how LLM‑inherent biases may propagate through recommendations, proposing safeguards to promote equity for vulnerable groups, such as women and low‑income populations. By examining the interplay between recommendation and LLM agents, this work advances a deeper understanding of algorithmic bias and provides practical tools to promote inclusive digital spaces.

PaperID: 10913, https://arxiv.org/pdf/2504.08725.pdf

Authors: Dayu Yang, Antoine Simoulin, Xin Qian, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, Grey Yang

Title: DocAgent: A Multi-Agent System for Automated Code Documentation Generation

Abstract:
High‑quality code documentation is crucial for software development especially in the era of AI. However, generating it automatically using Large Language Models (LLMs) remains challenging, as existing approaches often produce incomplete, unhelpful, or factually incorrect outputs. We introduce DocAgent, a novel multi‑agent collaborative system using topological code processing for incremental context building. Specialized agents (Reader, Searcher, Writer, Verifier, Orchestrator) then collaboratively generate documentation. We also propose a multi‑faceted evaluation framework assessing Completeness, Helpfulness, and Truthfulness. Comprehensive experiments show DocAgent significantly outperforms baselines consistently. Our ablation study confirms the vital role of the topological processing order. DocAgent offers a robust approach for reliable code documentation generation in complex and proprietary repositories.

PaperID: 10914, https://arxiv.org/pdf/2504.08640.pdf

Authors: Alessio Buscemi, Daniele Proverbio, Paolo Bova, Nataliya Balabanova, Adeela Bashir, Theodor Cimpeanu, Henrique Correia da Fonseca, Manh Hong Duong, Elias Fernandez Domingos, Antonio M. Fernandes, Marcus Krellner, Ndidi Bianca Ogbo, Simon T. Powers, Fernando P. Santos, Zia Ush Shamszaman, Zhao Song, Alessandro Di Stefano, The Anh Han

Title: Do LLMs trust AI regulation? Emerging behaviour of game-theoretic LLM agents

Abstract:
There is general agreement that fostering trust and cooperation within the AI development ecosystem is essential to promote the adoption of trustworthy AI systems. By embedding Large Language Model (LLM) agents within an evolutionary game‑theoretic framework, this paper investigates the complex interplay between AI developers, regulators and users, modelling their strategic choices under different regulatory scenarios. Evolutionary game theory (EGT) is used to quantitatively model the dilemmas faced by each actor, and LLMs provide additional degrees of complexity and nuances and enable repeated games and incorporation of personality traits. Our research identifies emerging behaviours of strategic AI agents, which tend to adopt more "pessimistic" (not trusting and defective) stances than pure game‑theoretic agents. We observe that, in case of full trust by users, incentives are effective to promote effective regulation; however, conditional trust may deteriorate the "social pact". Establishing a virtuous feedback between users' trust and regulators' reputation thus appears to be key to nudge developers towards creating safe AI. However, the level at which this trust emerges may depend on the specific LLM used for testing. Our results thus provide guidance for AI regulation systems, and help predict the outcome of strategic LLM agents, should they be used to aid regulation itself.

PaperID: 10915, https://arxiv.org/pdf/2504.08581.pdf

Authors: Xin Tan, Yuzhou Ji, He Zhu, Yuan Xie

Title: FMLGS: Fast Multilevel Language Embedded Gaussians for Part-level Interactive Agents

Abstract:
The semantically interactive radiance field has long been a promising backbone for 3D real‑world applications, such as embodied AI to achieve scene understanding and manipulation. However, multi‑granularity interaction remains a challenging task due to the ambiguity of language and degraded quality when it comes to queries upon object components. In this work, we present FMLGS, an approach that supports part‑level open‑vocabulary query within 3D Gaussian Splatting (3DGS). We propose an efficient pipeline for building and querying consistent object‑ and part‑level semantics based on Segment Anything Model 2 (SAM2). We designed a semantic deviation strategy to solve the problem of language ambiguity among object parts, which interpolates the semantic features of fine‑grained targets for enriched information. Once trained, we can query both objects and their describable parts using natural language. Comparisons with other state‑of‑the‑art methods prove that our method can not only better locate specified part‑level targets, but also achieve first‑place performance concerning both speed and accuracy, where FMLGS is 98 x faster than LERF, 4 x faster than LangSplat and 2.5 x faster than LEGaussians. Meanwhile, we further integrate FMLGS as a virtual agent that can interactively navigate through 3D scenes, locate targets, and respond to user demands through a chat interface, which demonstrates the potential of our work to be further expanded and applied in the future.

PaperID: 10916, https://arxiv.org/pdf/2504.08490.pdf

Authors: Robin D. Pesl

Title: Adopting Large Language Models to Automated System Integration

Abstract:
Modern enterprise computing systems integrate numerous subsystems to resolve a common task by yielding emergent behavior. A widespread approach is using services implemented with Web technologies like REST or OpenAPI, which offer an interaction mechanism and service documentation standard, respectively. Each service represents a specific business functionality, allowing encapsulation and easier maintenance. Despite the reduced maintenance costs on an individual service level, increased integration complexity arises. Consequently, automated service composition approaches have arisen to mitigate this issue. Nevertheless, these approaches have not achieved high acceptance in practice due to their reliance on complex formal modeling. Within this Ph.D. thesis, we analyze the application of Large Language Models (LLMs) to automatically integrate the services based on a natural language input. The result is a reusable service composition, e.g., as program code. While not always generating entirely correct results, the result can still be helpful by providing integration engineers with a close approximation of a suitable solution, which requires little effort to become operational. Our research involves (i) introducing a software architecture for automated service composition using LLMs, (ii) analyzing Retrieval Augmented Generation (RAG) for service discovery, (iii) proposing a novel natural language query‑based benchmark for service discovery, and (iv) extending the benchmark to complete service composition scenarios. We have presented our software architecture as Compositio Prompto, the analysis of RAG for service discovery, and submitted a proposal for the service discovery benchmark. Open topics are primarily the extension of the service discovery benchmark to service composition scenarios and the improvements of the service composition generation, e.g., using fine‑tuning or LLM agents.

PaperID: 10917, https://arxiv.org/pdf/2504.08417.pdf

Authors: Paul J. Pritz, Kin K. Leung

Title: Belief States for Cooperative Multi-Agent Reinforcement Learning under Partial Observability

Abstract:
Reinforcement learning in partially observable environments is typically challenging, as it requires agents to learn an estimate of the underlying system state. These challenges are exacerbated in multi‑agent settings, where agents learn simultaneously and influence the underlying state as well as each others' observations. We propose the use of learned beliefs on the underlying state of the system to overcome these challenges and enable reinforcement learning with fully decentralized training and execution. Our approach leverages state information to pre‑train a probabilistic belief model in a self‑supervised fashion. The resulting belief states, which capture both inferred state information as well as uncertainty over this information, are then used in a state‑based reinforcement learning algorithm to create an end‑to‑end model for cooperative multi‑agent reinforcement learning under partial observability. By separating the belief and reinforcement learning tasks, we are able to significantly simplify the policy and value function learning tasks and improve both the convergence speed and the final performance. We evaluate our proposed method on diverse partially observable multi‑agent tasks designed to exhibit different variants of partial observability.

PaperID: 10918, https://arxiv.org/pdf/2504.08399.pdf

Authors: Yin Jou Huang, Rafik Hadfi

Title: Beyond Self-Reports: Multi-Observer Agents for Personality Assessment in Large Language Models

Abstract:
Self‑report questionnaires have long been used to assess LLM personality traits, yet they fail to capture behavioral nuances due to biases and meta‑knowledge contamination. This paper proposes a novel multi‑observer framework for personality trait assessments in LLM agents that draws on informant‑report methods in psychology. Instead of relying on self‑assessments, we employ multiple observer agents. Each observer is configured with a specific relational context (e.g., family member, friend, or coworker) and engages the subject LLM in dialogue before evaluating its behavior across the Big Five dimensions. We show that these observer‑report ratings align more closely with human judgments than traditional self‑reports and reveal systematic biases in LLM self‑assessments. We also found that aggregating responses from 5 to 7 observers reduces systematic biases and achieves optimal reliability. Our results highlight the role of relationship context in perceiving personality and demonstrate that a multi‑observer paradigm offers a more reliable, context‑sensitive approach to evaluating LLM personality traits.

PaperID: 10919, https://arxiv.org/pdf/2504.08195.pdf

Authors: Michael Elrod, Niloufar Mehrabi, Rahul Amin, Manveen Kaur, Long Cheng, Jim Martin, Abolfazl Razi

Title: Graph Based Deep Reinforcement Learning Aided by Transformers for Multi-Agent Cooperation

Abstract:
Mission planning for a fleet of cooperative autonomous drones in applications that involve serving distributed target points, such as disaster response, environmental monitoring, and surveillance, is challenging, especially under partial observability, limited communication range, and uncertain environments. Traditional path‑planning algorithms struggle in these scenarios, particularly when prior information is not available. To address these challenges, we propose a novel framework that integrates Graph Neural Networks (GNNs), Deep Reinforcement Learning (DRL), and transformer‑based mechanisms for enhanced multi‑agent coordination and collective task execution. Our approach leverages GNNs to model agent‑agent and agent‑goal interactions through adaptive graph construction, enabling efficient information aggregation and decision‑making under constrained communication. A transformer‑based message‑passing mechanism, augmented with edge‑feature‑enhanced attention, captures complex interaction patterns, while a Double Deep Q‑Network (Double DQN) with prioritized experience replay optimizes agent policies in partially observable environments. This integration is carefully designed to address specific requirements of multi‑agent navigation, such as scalability, adaptability, and efficient task execution. Experimental results demonstrate superior performance, with 90% service provisioning and 100% grid coverage (node discovery), while reducing the average steps per episode to 200, compared to 600 for benchmark methods such as particle swarm optimization (PSO), greedy algorithms and DQN.

PaperID: 10920, https://arxiv.org/pdf/2504.08134.pdf

Authors: Minrui Xu, Dusit Niyato, Jiawen Kang, Zehui Xiong, Mingzhe Chen, Dong In Kim, Xuemin, Shen

Title: Hybrid Reinforcement Learning-based Sustainable Multi-User Computation Offloading for Mobile Edge-Quantum Computing

Abstract:
Exploiting quantum computing at the mobile edge holds immense potential for facilitating large‑scale network design, processing multimodal data, optimizing resource management, and enhancing network security. In this paper, we propose a pioneering paradigm of mobile edge quantum computing (MEQC) that integrates quantum computing capabilities into classical edge computing servers that are proximate to mobile devices. To conceptualize the MEQC, we first design an MEQC system, where mobile devices can offload classical and quantum computation tasks to edge servers equipped with classical and quantum computers. We then formulate the hybrid classical‑quantum computation offloading problem whose goal is to minimize system cost in terms of latency and energy consumption. To solve the offloading problem efficiently, we propose a hybrid discrete‑continuous multi‑agent reinforcement learning algorithm to learn long‑term sustainable offloading and partitioning strategies. Finally, numerical results demonstrate that the proposed algorithm can reduce the MEQC system cost by up to 30% compared to existing baselines.

PaperID: 10921, https://arxiv.org/pdf/2504.08113.pdf

Authors: Robbe Nooyens, Tolgahan Bardakci, Mutlu Beyazit, Serge Demeyer

Title: Test Amplification for REST APIs via Single and Multi-Agent LLM Systems

Abstract:
REST APIs (Representational State Transfer Application Programming Interfaces) play a vital role in modern cloud‑native applications. As these APIs grow in complexity and scale, ensuring their correctness and robustness becomes increasingly important. Automated testing is essential for identifying hidden bugs, particularly those that appear in edge cases or under unexpected inputs. However, creating comprehensive and effective test suites for REST APIs is challenging and often demands significant effort. In this paper, we investigate the use of large language model (LLM) systems, both single‑agent and multi‑agent setups, for amplifying existing REST API test suites. These systems generate additional test cases that aim to push the boundaries of the API, uncovering behaviors that might otherwise go untested. We present a comparative evaluation of the two approaches across several dimensions, including test coverage, bug detection effectiveness, and practical considerations such as computational cost and energy usage. Our evaluation demonstrates increased API coverage, identification of numerous bugs in the API under test, and insights into the computational cost and energy consumption of both approaches.

PaperID: 10922, https://arxiv.org/pdf/2504.07841.pdf

Authors: Nayesha Gandotra, Rishi Veerapaneni, Muhammad Suhail Saleem, Daniel Harabor, Jiaoyang Li, Maxim Likhachev

Title: Anytime Single-Step MAPF Planning with Anytime PIBT

Abstract:
PIBT is a popular Multi‑Agent Path Finding (MAPF) method at the core of many state‑of‑the‑art MAPF methods including LaCAM, CS‑PIBT, and WPPL. The main utility of PIBT is that it is a very fast and effective single‑step MAPF solver and can return a collision‑free single‑step solution for hundreds of agents in less than a millisecond. However, the main drawback of PIBT is that it is extremely greedy in respect to its priorities and thus leads to poor solution quality. Additionally, PIBT cannot use all the planning time that might be available to it and returns the first solution it finds. We thus develop Anytime PIBT, which quickly finds a one‑step solution identically to PIBT but then continuously improves the solution in an anytime manner. We prove that Anytime PIBT converges to the optimal solution given sufficient time. We experimentally validate that Anytime PIBT can rapidly improve single‑step solution quality within milliseconds and even find the optimal single‑step action. However, we interestingly find that improving the single‑step solution quality does not have a significant effect on full‑horizon solution costs.

PaperID: 10923, https://arxiv.org/pdf/2504.07831.pdf

Authors: Simon Lermen, Mateusz Dziemian, Natalia Pérez-Campanero Antolín

Title: Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems

Abstract:
We demonstrate how AI agents can coordinate to deceive oversight systems using automated interpretability of neural networks. Using sparse autoencoders (SAEs) as our experimental framework, we show that language models (Llama, DeepSeek R1, and Claude 3.7 Sonnet) can generate deceptive explanations that evade detection. Our agents employ steganographic methods to hide information in seemingly innocent explanations, successfully fooling oversight models while achieving explanation quality comparable to reference labels. We further find that models can scheme to develop deceptive strategies when they believe the detection of harmful features might lead to negative consequences for themselves. All tested LLM agents were capable of deceiving the overseer while achieving high interpretability scores comparable to those of reference labels. We conclude by proposing mitigation strategies, emphasizing the critical need for robust understanding and defenses against deception.

PaperID: 10924, https://arxiv.org/pdf/2504.07830.pdf

Authors: Genglin Liu, Vivian Le, Salman Rahman, Elisa Kreiss, Marzyeh Ghassemi, Saadia Gabriel

Title: MOSAIC: Modeling Social AI for Content Dissemination and Regulation in Multi-Agent Simulations

Abstract:
We present a novel, open‑source social network simulation framework, MOSAIC, where generative language agents predict user behaviors such as liking, sharing, and flagging content. This simulation combines LLM agents with a directed social graph to analyze emergent deception behaviors and gain a better understanding of how users determine the veracity of online social content. By constructing user representations from diverse fine‑grained personas, our system enables multi‑agent simulations that model content dissemination and engagement dynamics at scale. Within this framework, we evaluate three different content moderation strategies with simulated misinformation dissemination, and we find that they not only mitigate the spread of non‑factual content but also increase user engagement. In addition, we analyze the trajectories of popular content in our simulations, and explore whether simulation agents' articulated reasoning for their social interactions truly aligns with their collective engagement patterns. We open‑source our simulation software to encourage further research within AI and social sciences.

PaperID: 10925, https://arxiv.org/pdf/2504.07547.pdf

Authors: Xinyang Wang, Martin Guay, Shimin Wang, Hongwei Zhang

Title: Strategic learning for disturbance rejection in multi-agent systems: Nash and Minmax in graphical games

Abstract:
This article investigates the optimal control problem with disturbance rejection for discrete‑time multi‑agent systems under cooperative and non‑cooperative graphical games frameworks. Given the practical challenges of obtaining accurate models, Q‑function‑based policy iteration methods are proposed to seek the Nash equilibrium solution for the cooperative graphical game and the distributed minmax solution for the non‑cooperative graphical game. To implement these methods online, two reinforcement learning frameworks are developed, an actor‑disturber‑critic structure for the cooperative graphical game and an actor‑adversary‑disturber‑critic structure for the non‑cooperative graphical game. The stability of the proposed methods is rigorously analyzed, and simulation results are provided to illustrate the effectiveness of the proposed methods.

PaperID: 10926, https://arxiv.org/pdf/2504.07463.pdf

Authors: Rahul K. Dass, Rochan H. Madhusudhana, Erin C. Deye, Shashank Verma, Timothy A. Bydlon, Grace Brazil, Ashok K. Goel

Title: Enhanced Question-Answering for Skill-based learning using Knowledge-based AI and Generative AI

Abstract:
Supporting learners' understanding of taught skills in online settings is a longstanding challenge. While exercises and chat‑based agents can evaluate understanding in limited contexts, this challenge is magnified when learners seek explanations that delve into procedural knowledge (how things are done) and reasoning (why things happen). We hypothesize that an intelligent agent's ability to understand and explain learners' questions about skills can be significantly enhanced using the TMK (Task‑Method‑Knowledge) model, a Knowledge‑based AI framework. We introduce Ivy, an intelligent agent that leverages an LLM and iterative refinement techniques to generate explanations that embody teleological, causal, and compositional principles. Our initial evaluation demonstrates that this approach goes beyond the typical shallow responses produced by an agent with access to unstructured text, thereby substantially improving the depth and relevance of feedback. This can potentially ensure learners develop a comprehensive understanding of skills crucial for effective problem‑solving in online environments.

PaperID: 10927, https://arxiv.org/pdf/2504.07461.pdf

Authors: Yiting Zhang, Yijiang Li, Tianwei Zhao, Kaijie Zhu, Haohan Wang, Nuno Vasconcelos

Title: Achilles Heel of Distributed Multi-Agent Systems

Abstract:
Multi‑agent system (MAS) has demonstrated exceptional capabilities in addressing complex challenges, largely due to the integration of multiple large language models (LLMs). However, the heterogeneity of LLMs, the scalability of quantities of LLMs, and local computational constraints pose significant challenges to hosting these models locally. To address these issues, we propose a new framework termed Distributed Multi‑Agent System (DMAS). In DMAS, heterogeneous third‑party agents function as service providers managed remotely by a central MAS server and each agent offers its services through API interfaces. However, the distributed nature of DMAS introduces several concerns about trustworthiness. In this paper, we study the Achilles heel of distributed multi‑agent systems, identifying four critical trustworthiness challenges: free riding, susceptibility to malicious attacks, communication inefficiencies, and system instability. Extensive experiments across seven frameworks and four datasets reveal significant vulnerabilities of the DMAS. These attack strategies can lead to a performance degradation of up to 80% and attain a 100% success rate in executing free riding and malicious attacks. We envision our work will serve as a useful red‑teaming tool for evaluating future multi‑agent systems and spark further research on trustworthiness challenges in distributed multi‑agent systems.

PaperID: 10928, https://arxiv.org/pdf/2504.07347.pdf

Authors: J. G. Dai, Tianze Deng, Yueying Li, Tianyi Peng

Title: Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents

Abstract:
As demand for Large Language Models (LLMs) and AI agents grows rapidly, optimizing systems for efficient LLM inference becomes critical. While significant efforts have targeted system‑level engineering, little has been explored from a mathematical modeling and queueing perspective. In this paper, we develop the queueing fundamentals for LLM inference. In particular, we study the throughput aspect of LLM inference systems. We prove that a large class of `work‑conserving' scheduling algorithms achieve maximum throughput for both individual requests and AI‑agent workloads with directed acyclic graph (DAG) and fork‑join routing topologies, establishing `work‑conserving' as a key design principle for practitioners. Technically, we develop a fluid‑limit framework for multi‑class batched processing networks under K‑FCFS scheduling, which may be of independent interest. Evaluations of real‑world systems confirm that Orca and Sarathi‑Serve are throughput‑optimal, reassuring practitioners, while FasterTransformer and vanilla vLLM are not maximally stable and should be used with caution. Our analysis also reveals how constraints such as batch size limits and cyclic routing topologies complicate the throughput picture, pointing to rich open questions at the intersection of queueing theory and LLM system design.

PaperID: 10929, https://arxiv.org/pdf/2504.07303.pdf

Authors: Tooraj Helmi

Title: Modeling Response Consistency in Multi-Agent LLM Systems: A Comparative Analysis of Shared and Separate Context Approaches

Abstract:
Large Language Models (LLMs) are increasingly utilized in multi‑agent systems (MAS) to enhance collaborative problem‑solving and interactive reasoning. Recent advancements have enabled LLMs to function as autonomous agents capable of understanding complex interactions across multiple topics. However, deploying LLMs in MAS introduces challenges related to context management, response consistency, and scalability, especially when agents must operate under memory limitations and handle noisy inputs. While prior research has explored optimizing context sharing and response latency in LLM‑driven MAS, these efforts often focus on either fully centralized or decentralized configurations, each with distinct trade‑offs. In this paper, we develop a probabilistic framework to analyze the impact of shared versus separate context configurations on response consistency and response times in LLM‑based MAS. We introduce the Response Consistency Index (RCI) as a metric to evaluate the effects of context limitations, noise, and inter‑agent dependencies on system performance. Our approach differs from existing research by focusing on the interplay between memory constraints and noise management, providing insights into optimizing scalability and response times in environments with interdependent topics. Through this analysis, we offer a comprehensive understanding of how different configurations impact the efficiency of LLM‑driven multi‑agent systems, thereby guiding the design of more robust architectures.

PaperID: 10930, https://arxiv.org/pdf/2504.07277.pdf

Authors: Rian Melo, Pedro Simões, Rohit Gheyi, Marcelo d'Amorim, Márcio Ribeiro, Gustavo Soares, Eduardo Almeida, Elvys Soares

Title: Agentic LMs: Hunting Down Test Smells

Abstract:
Test smells reduce test suite reliability and complicate maintenance. While many methods detect test smells, few support automated removal, and most rely on static analysis or machine learning. This study evaluates models with relatively small parameter counts ‑ Llama‑3.2‑3B, Gemma‑2‑9B, DeepSeek‑R1‑14B, and Phi‑4‑14B ‑ for their ability to detect and refactor test smells using agent‑based workflows. We assess workflows with one, two, and four agents over 150 instances of 5 common smells from real‑world Java projects. Our approach generalizes to Python, Golang, and JavaScript. All models detected nearly all instances, with Phi‑4‑14B achieving the best refactoring accuracy (pass@5 of 75.3%). Phi‑4‑14B with four‑agents performed within 5% of proprietary LLMs (single‑agent). Multi‑agent setups outperformed single‑agent ones in three of five smell types, though for Assertion Roulette, one agent sufficed. We submitted pull requests with Phi‑4‑14B‑generated code to open‑source projects and six were merged.

PaperID: 10931, https://arxiv.org/pdf/2504.07242.pdf

Authors: Uthman Olawoye, Cagri Kilic, Jason N Gross

Title: Analysis of the Unscented Transform for Cooperative Localization with Ranging-Only Information

Abstract:
Cooperative localization in multi‑agent robotic systems is challenging, especially when agents rely on limited information, such as only peer‑to‑peer range measurements. Two key challenges arise: utilizing this limited information to improve position estimation; handling uncertainties from sensor noise, nonlinearity, and unknown correlations between agents measurements; and avoiding information reuse. This paper examines the use of the Unscented Transform (UT) for state estimation for a case in which range measurement between agents and covariance intersection (CI) is used to handle unknown correlations. Unlike Kalman Filter approaches, CI methods fuse complete state and covariance estimates. This makes formulating a CI approach with ranging‑only measurements a challenge. To overcome this, UT is used to handle uncertainties and formulate a cooperative state update using range measurements and current cooperative state estimates. This introduces information reuse in the measurement update. Therefore, this work aims to evaluate the limitations and utility of this formulation when faced with various levels of state measurement uncertainty and errors.

PaperID: 10932, https://arxiv.org/pdf/2504.07231.pdf

Authors: David Akhihiero, Jason N. Gross

Title: A Pointcloud Registration Framework for Relocalization in Subterranean Environments

Abstract:
Relocalization, the process of re‑establishing a robot's position within an environment, is crucial for ensuring accurate navigation and task execution when external positioning information, such as GPS, is unavailable or has been lost. Subterranean environments present significant challenges for relocalization due to limited external positioning information, poor lighting that affects camera localization, irregular and often non‑distinct surfaces, and dust, which can introduce noise and occlusion in sensor data. In this work, we propose a robust, computationally friendly framework for relocalization through point cloud registration utilizing a prior point cloud map. The framework employs Intrinsic Shape Signatures (ISS) to select feature points in both the target and prior point clouds. The Fast Point Feature Histogram (FPFH) algorithm is utilized to create descriptors for these feature points, and matching these descriptors yields correspondences between the point clouds. A 3D transformation is estimated using the matched points, which initializes a Normal Distribution Transform (NDT) registration. The transformation result from NDT is further refined using the Iterative Closest Point (ICP) registration algorithm. This framework enhances registration accuracy even in challenging conditions, such as dust interference and significant initial transformations between the target and source, making it suitable for autonomous robots operating in underground mines and tunnels. This framework was validated with experiments in simulated and real‑world mine datasets, demonstrating its potential for improving relocalization.

PaperID: 10933, https://arxiv.org/pdf/2504.07226.pdf

Authors: Jonas Hansson, Emma Tegling

Title: Compositional design for time-varying and nonlinear coordination

Abstract:
This work addresses the design of multi‑agent coordination through high‑order consensus protocols. While first‑order consensus strategies are well‑studied ‑‑ with known robustness to uncertainties such as time delays, time‑varying weights, and nonlinearities like saturations ‑‑ the theoretical guarantees for high‑order consensus are comparatively limited. We propose a compositional control framework that generates high‑order consensus protocols by serially connecting stable first‑order consensus operators. Under mild assumptions, we establish that the resulting high‑order system inherits stability properties from its components. The proposed design is versatile and supports a wide range of real‑world constraints. This is demonstrated through applications inspired by vehicular formation control, including protocols with time‑varying weights, bounded time‑varying delays, and saturated inputs. We derive theoretical guarantees for these settings using the proposed compositional approach and demonstrate the advantages gained compared to conventional protocols in simulations.

PaperID: 10934, https://arxiv.org/pdf/2504.07189.pdf

Authors: Orhan Eren Akgün, Sarper Aydın, Stephanie Gil, Angelia Nedić

Title: Multi-Agent Trustworthy Consensus under Random Dynamic Attacks

Abstract:
In this work, we study the consensus problem in which legitimate agents send their values over an undirected communication network in the presence of an unknown subset of malicious or faulty agents. In contrast to former works, we generalize and characterize the properties of consensus dynamics with dependent sequences of malicious transmissions with dynamic (time‑varying) rates, based on not necessarily independent trust observations. We consider a detection algorithm utilizing stochastic trust observations available to legitimate agents. Under these conditions, legitimate agents almost surely classify their neighbors and form their trusted neighborhoods correctly with decaying misclassification probabilities. We further prove that the consensus process converges almost surely despite the existence of malicious agents. For a given value of failure probability, we characterize the deviation from the nominal consensus value ideally occurring when there are no malicious agents in the system. We also examine the convergence rate of the process in finite time. Numerical simulations show the convergence among agents and indicate the deviation under different attack scenarios.

PaperID: 10935, https://arxiv.org/pdf/2504.07133.pdf

Authors: Alkis Kalavasis, Anay Mehrotra, Felix Zhou

Title: Can SGD Select Good Fishermen? Local Convergence under Self-Selection Biases and Beyond

Abstract:
We revisit the problem of estimating k linear regressors with self‑selection bias in d dimensions with the maximum selection criterion, as introduced by Cherapanamjeri, Daskalakis, Ilyas, and Zampetakis [CDIZ23, STOC'23]. Our main result is a \operatornamepoly(d,k,1/\varepsilon) + k^O(k) time algorithm for this problem, which yields an improvement in the running time of the algorithms of [CDIZ23] and [GM24, arXiv]. We achieve this by providing the first local convergence algorithm for self‑selection, thus resolving the main open question of [CDIZ23]. To obtain this algorithm, we reduce self‑selection to a seemingly unrelated statistical problem called coarsening. Coarsening occurs when one does not observe the exact value of the sample but only some set (a subset of the sample space) that contains the exact value. Inference from coarse samples arises in various real‑world applications due to rounding by humans and algorithms, limited precision of instruments, and lag in multi‑agent systems. Our reduction to coarsening is intuitive and relies on the geometry of the self‑selection problem, which enables us to bypass the limitations of previous analytic approaches. To demonstrate its applicability, we provide a local convergence algorithm for linear regression under another self‑selection criterion, which is related to second‑price auction data. Further, we give the first polynomial time local convergence algorithm for coarse Gaussian mean estimation given samples generated from a convex partition. Previously, only a sample‑efficient algorithm was known due to Fotakis, Kalavasis, Kontonis, and Tzamos [FKKT21, COLT'21].

PaperID: 10936, https://arxiv.org/pdf/2504.07128.pdf

Authors: Sara Vera Marjanović, Arkil Patel, Vaibhav Adlakha, Milad Aghajohari, Parishad BehnamGhader, Mehar Bhatia, Aditi Khandelwal, Austin Kraft, Benno Krojer, Xing Han Lù, Nicholas Meade, Dongchan Shin, Amirhossein Kazemnejad, Gaurav Kamath, Marius Mosbach, Karolina Stańczak, Siva Reddy

Title: DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning

Abstract:
Large Reasoning Models like DeepSeek‑R1 mark a fundamental shift in how LLMs approach complex problems. Instead of directly producing an answer for a given input, DeepSeek‑R1 creates detailed multi‑step reasoning chains, seemingly "thinking" about a problem before providing an answer. This reasoning process is publicly available to the user, creating endless opportunities for studying the reasoning behaviour of the model and opening up the field of Thoughtology. Starting from a taxonomy of DeepSeek‑R1's basic building blocks of reasoning, our analyses on DeepSeek‑R1 investigate the impact and controllability of thought length, management of long or confusing contexts, cultural and safety concerns, and the status of DeepSeek‑R1 vis‑à‑vis cognitive phenomena, such as human‑like language processing and world modelling. Our findings paint a nuanced picture. Notably, we show DeepSeek‑R1 has a 'sweet spot' of reasoning, where extra inference time can impair model performance. Furthermore, we find a tendency for DeepSeek‑R1 to persistently ruminate on previously explored problem formulations, obstructing further exploration. We also note strong safety vulnerabilities of DeepSeek‑R1 compared to its non‑reasoning counterpart, which can also compromise safety‑aligned LLMs.

PaperID: 10937, https://arxiv.org/pdf/2504.06943.pdf

Authors: Kostas Hatalis, Despina Christou, Vyshnavi Kondapalli

Title: Review of Case-Based Reasoning for LLM Agents: Theoretical Foundations, Architectural Components, and Cognitive Integration

Abstract:
Agents powered by Large Language Models (LLMs) have recently demonstrated impressive capabilities in various tasks. Still, they face limitations in tasks requiring specific, structured knowledge, flexibility, or accountable decision‑making. While agents are capable of perceiving their environments, forming inferences, planning, and executing actions towards goals, they often face issues such as hallucinations and lack of contextual memory across interactions. This paper explores how Case‑Based Reasoning (CBR), a strategy that solves new problems by referencing past experiences, can be integrated into LLM agent frameworks. This integration allows LLMs to leverage explicit knowledge, enhancing their effectiveness. We systematically review the theoretical foundations of these enhanced agents, identify critical framework components, and formulate a mathematical model for the CBR processes of case retrieval, adaptation, and learning. We also evaluate CBR‑enhanced agents against other methods like Chain‑of‑Thought reasoning and standard Retrieval‑Augmented Generation, analyzing their relative strengths. Moreover, we explore how leveraging CBR's cognitive dimensions (including self‑reflection, introspection, and curiosity) via goal‑driven autonomy mechanisms can further enhance the LLM agent capabilities. Contributing to the ongoing research on neuro‑symbolic hybrid systems, this work posits CBR as a viable technique for enhancing the reasoning skills and cognitive aspects of autonomous LLM agents.

PaperID: 10938, https://arxiv.org/pdf/2504.06894.pdf

Authors: Yusef Ahsini, Belén Reverte, J. Alberto Conejero

Title: AI-Driven Consensus: Modeling Multi-Agent Networks with Long-Range Interactions through path-Laplacian Matrices

Abstract:
Extended connectivity in graphs can be analyzed through k‑path Laplacian matrices, which permit the capture of long‑range interactions in various real‑world networked systems such as social, transportation, and multi‑agent networks. In this work, we present several alternative methods based on machine learning methods (LSTM, xLSTM, Transformer, XGBoost, and ConvLSTM) to predict the final consensus value based on directed networks (Erdös‑Renyi, Watts‑Strogatz, and Barabási‑Albert) and on the initial state. We highlight how different k‑hop interactions affect the performance of the tested methods. This framework opens new avenues for analyzing multi‑scale diffusion processes in large‑scale, complex networks.

PaperID: 10939, https://arxiv.org/pdf/2504.06684.pdf

Authors: Delin Zhao, Yanbo Shan, Chang Liu, Shenghang Lin, Yingxin Shou, Bin Xu

Title: SDHN: Skewness-Driven Hypergraph Networks for Enhanced Localized Multi-Robot Coordination

Abstract:
Multi‑Agent Reinforcement Learning is widely used for multi‑robot coordination, where simple graphs typically model pairwise interactions. However, such representations fail to capture higher‑order collaborations, limiting effectiveness in complex tasks. While hypergraph‑based approaches enhance cooperation, existing methods often generate arbitrary hypergraph structures and lack adaptability to environmental uncertainties. To address these challenges, we propose the Skewness‑Driven Hypergraph Network (SDHN), which employs stochastic Bernoulli hyperedges to explicitly model higher‑order multi‑robot interactions. By introducing a skewness loss, SDHN promotes an efficient structure with Small‑Hyperedge Dominant Hypergraph, allowing robots to prioritize localized synchronization while still adhering to the overall information, similar to human coordination. Extensive experiments on Moving Agents in Formation and Robotic Warehouse tasks validate SDHN's effectiveness, demonstrating superior performance over state‑of‑the‑art baselines.

PaperID: 10940, https://arxiv.org/pdf/2504.06670.pdf

Authors: Kaifeng Wang, Yinsong Chen, Qi Liu, Xueyuan Li, Xin Gao

Title: Dynamic Residual Safe Reinforcement Learning for Multi-Agent Safety-Critical Scenarios Decision-Making

Abstract:
In multi‑agent safety‑critical scenarios, traditional autonomous driving frameworks face significant challenges in balancing safety constraints and task performance. These frameworks struggle to quantify dynamic interaction risks in real‑time and depend heavily on manual rules, resulting in low computational efficiency and conservative strategies. To address these limitations, we propose a Dynamic Residual Safe Reinforcement Learning (DRS‑RL) framework grounded in a safety‑enhanced networked Markov decision process. It's the first time that the weak‑to‑strong theory is introduced into multi‑agent decision‑making, enabling lightweight dynamic calibration of safety boundaries via a weak‑to‑strong safety correction paradigm. Based on the multi‑agent dynamic conflict zone model, our framework accurately captures spatiotemporal coupling risks among heterogeneous traffic participants and surpasses the static constraints of conventional geometric rules. Moreover, a risk‑aware prioritized experience replay mechanism mitigates data distribution bias by mapping risk to sampling probability. Experimental results reveal that the proposed method significantly outperforms traditional RL algorithms in safety, efficiency, and comfort. Specifically, it reduces the collision rate by up to 92.17%, while the safety model accounts for merely 27% of the main model's parameters.

PaperID: 10941, https://arxiv.org/pdf/2504.06614.pdf

Authors: Lingzhe Zhang, Yunpeng Zhai, Tong Jia, Xiaosong Huang, Chiming Duan, Ying Li

Title: AgentFM: Role-Aware Failure Management for Distributed Databases with LLM-Driven Multi-Agents

Abstract:
Distributed databases are critical infrastructures for today's large‑scale software systems, making effective failure management essential to ensure software availability. However, existing approaches often overlook the role distinctions within distributed databases and rely on small‑scale models with limited generalization capabilities. In this paper, we conduct a preliminary empirical study to emphasize the unique significance of different roles. Building on this insight, we propose AgentFM, a role‑aware failure management framework for distributed databases powered by LLM‑driven multi‑agents. AgentFM addresses failure management by considering system roles, data roles, and task roles, with a meta‑agent orchestrating these components. Preliminary evaluations using Apache IoTDB demonstrate the effectiveness of AgentFM and open new directions for further research.

PaperID: 10942, https://arxiv.org/pdf/2504.06581.pdf

Authors: Umakanta Maharana, Sarthak Verma, Avarna Agarwal, Prakashini Mruthyunjaya, Dwarikanath Mahapatra, Sakir Ahmed, Murari Mandal

Title: Right Prediction, Wrong Reasoning: Uncovering LLM Misalignment in RA Disease Diagnosis

Abstract:
Large language models (LLMs) offer a promising pre‑screening tool, improving early disease detection and providing enhanced healthcare access for underprivileged communities. The early diagnosis of various diseases continues to be a significant challenge in healthcare, primarily due to the nonspecific nature of early symptoms, the shortage of expert medical practitioners, and the need for prolonged clinical evaluations, all of which can delay treatment and adversely affect patient outcomes. With impressive accuracy in prediction across a range of diseases, LLMs have the potential to revolutionize clinical pre‑screening and decision‑making for various medical conditions. In this work, we study the diagnostic capability of LLMs for Rheumatoid Arthritis (RA) with real world patients data. Patient data was collected alongside diagnoses from medical experts, and the performance of LLMs was evaluated in comparison to expert diagnoses for RA disease prediction. We notice an interesting pattern in disease diagnosis and find an unexpected misalignment between prediction and explanation. We conduct a series of multi‑round analyses using different LLM agents. The best‑performing model accurately predicts rheumatoid arthritis (RA) diseases approximately 95% of the time. However, when medical experts evaluated the reasoning generated by the model, they found that nearly 68% of the reasoning was incorrect. This study highlights a clear misalignment between LLMs high prediction accuracy and its flawed reasoning, raising important questions about relying on LLM explanations in clinical settings. LLMs provide incorrect reasoning to arrive at the correct answer for RA disease diagnosis.

PaperID: 10943, https://arxiv.org/pdf/2504.06557.pdf

Authors: Takeshi Kato, Junichi Miyakoshi, Misa Owa, Ryuji Mine

Title: Market, power, gift, and concession economies: Comparison using four-mode primitive network models

Abstract:
Reducing wealth inequality is a global challenge, and the problems of capitalism stem from the enclosure of the commons and the breakdown of the community. According to previous studies by Polanyi, Karatani, and Graeber, economic modes can be divided into capitalist market economy (enclosure and exchange), power economy (de‑enclosure and redistribution), gift economy (obligation to return and reciprocity), and concession economy (de‑obligation to return). The concession economy reflects Graeber's baseline communism (from each according to their abilities, to each according to their needs) and Deguchi's We‑turn philosophy (the "I" as an individual has a "fundamental incapability" and the subject of physical action, responsibility, and freedom is "We" as a multi‑agent system, including the "I"). In this study, we constructed novel network models for these four modes and compared their properties (cluster coefficient, graph density, reciprocity, assortativity, centrality, and Gini coefficient). From the calculation results, it became clear that the market economy leads to inequality; the power economy mitigates inequality but cannot eliminate it; the gift and concession economies lead to a healthy and equal economy; and the concession economy, free from the ties of obligation to return, is possible without guaranteeing reciprocity. We intend to promote the transformation from a capitalist economy to a concession economy through activities that disseminate baseline communism and the We‑turn philosophy that promotes concession, that is, developing a cooperative platform to support concession through information technology and empirical research through fieldwork.

PaperID: 10944, https://arxiv.org/pdf/2504.06272.pdf

Authors: Kevin Dela Rosa

Title: RAVEN: An Agentic Framework for Multimodal Entity Discovery from Large-Scale Video Collections

Abstract:
We present RAVEN an adaptive AI agent framework designed for multimodal entity discovery and retrieval in large‑scale video collections. Synthesizing information across visual, audio, and textual modalities, RAVEN autonomously processes video data to produce structured, actionable representations for downstream tasks. Key contributions include (1) a category understanding step to infer video themes and general‑purpose entities, (2) a schema generation mechanism that dynamically defines domain‑specific entities and attributes, and (3) a rich entity extraction process that leverages semantic retrieval and schema‑guided prompting. RAVEN is designed to be model‑agnostic, allowing the integration of different vision‑language models (VLMs) and large language models (LLMs) based on application‑specific requirements. This flexibility supports diverse applications in personalized search, content discovery, and scalable information retrieval, enabling practical applications across vast datasets.

PaperID: 10945, https://arxiv.org/pdf/2504.06269.pdf

Authors: Yin Wu, Zhengxuan Zhang, Fuling Wang, Yuyu Luo, Hui Xiong, Nan Tang

Title: EXCLAIM: An Explainable Cross-Modal Agentic System for Misinformation Detection with Hierarchical Retrieval

Abstract:
Misinformation continues to pose a significant challenge in today's information ecosystem, profoundly shaping public perception and behavior. Among its various manifestations, Out‑of‑Context (OOC) misinformation is particularly obscure, as it distorts meaning by pairing authentic images with misleading textual narratives. Existing methods for detecting OOC misinformation predominantly rely on coarse‑grained similarity metrics between image‑text pairs, which often fail to capture subtle inconsistencies or provide meaningful explainability. While multi‑modal large language models (MLLMs) demonstrate remarkable capabilities in visual reasoning and explanation generation, they have not yet demonstrated the capacity to address complex, fine‑grained, and cross‑modal distinctions necessary for robust OOC detection. To overcome these limitations, we introduce EXCLAIM, a retrieval‑based framework designed to leverage external knowledge through multi‑granularity index of multi‑modal events and entities. Our approach integrates multi‑granularity contextual analysis with a multi‑agent reasoning architecture to systematically evaluate the consistency and integrity of multi‑modal news content. Comprehensive experiments validate the effectiveness and resilience of EXCLAIM, demonstrating its ability to detect OOC misinformation with 4.3% higher accuracy compared to state‑of‑the‑art approaches, while offering explainable and actionable insights.

PaperID: 10946, https://arxiv.org/pdf/2504.06169.pdf

Authors: Alba Gurpegui, Mark Jeeninga, Emma Tegling, Anders Rantzer

Title: Linear Regulator-Based Synchronization of Positive Multi-Agent Systems

Abstract:
This paper addresses the positive synchronization of interconnected systems on undirected graphs. For homogeneous positive systems, a static feedback protocol design is proposed, based on the Linear Regulator problem. The solution to the algebraic equation associated to the stabilizing policy can be found using a linear program. Necessary and sufficient conditions on the positivity of each agent's trajectory for all nonnegative initial conditions are also provided. Simulations on large regular graphs with different nodal degree illustrate the proposed results.

PaperID: 10947, https://arxiv.org/pdf/2504.06091.pdf

Authors: Runzhe Liang, Rishi Veerapaneni, Daniel Harabor, Jiaoyang Li, Maxim Likhachev

Title: Real-Time LaCAM for Real-Time MAPF

Abstract:
The vast majority of Multi‑Agent Path Finding (MAPF) methods with completeness guarantees require planning full‑horizon paths. However, planning full‑horizon paths can take too long and be impractical in real‑world applications. Instead, real‑time planning and execution, which only allows the planner a finite amount of time before executing and replanning, is more practical for real‑world multi‑agent systems. Several methods utilize real‑time planning schemes but none are provably complete, which leads to livelock or deadlock. Our main contribution is Real‑Time LaCAM, the first Real‑Time MAPF method with provable completeness guarantees. We do this by leveraging LaCAM (Okumura 2023) in an incremental fashion. Our results show how we can iteratively plan for congested environments with a cutoff time of milliseconds while still maintaining the same success rate as full‑horizon LaCAM. We also show how it can be used with a single‑step learned MAPF policy.

PaperID: 10948, https://arxiv.org/pdf/2504.05887.pdf

Authors: Savvas Papaioannou, Panayiotis Kolios, Theocharis Theocharides, Christos G. Panayiotou, Marios M. Polycarpou

Title: Jointly-optimized Trajectory Generation and Camera Control for 3D Coverage Planning

Abstract:
This work proposes a jointly optimized trajectory generation and camera control approach, enabling an autonomous agent, such as an unmanned aerial vehicle (UAV) operating in 3D environments, to plan and execute coverage trajectories that maximally cover the surface area of a 3D object of interest. Specifically, the UAV's kinematic and camera control inputs are jointly optimized over a rolling planning horizon to achieve complete 3D coverage of the object. The proposed controller incorporates ray‑tracing into the planning process to simulate the propagation of light rays, thereby determining the visible parts of the object through the UAV's camera. This integration enables the generation of precise look‑ahead coverage trajectories. The coverage planning problem is formulated as a rolling finite‑horizon optimal control problem and solved using mixed‑integer programming techniques. Extensive real‑world and synthetic experiments validate the performance of the proposed approach.

PaperID: 10949, https://arxiv.org/pdf/2504.05729.pdf

Authors: Yuhang Deng, Zheng Chen, Erik G. Larsson

Title: Robust and Efficient Average Consensus with Non-Coherent Over-the-Air Aggregation

Abstract:
Non‑coherent over‑the‑air (OTA) computation has garnered increasing attention for its advantages in facilitating information aggregation among distributed agents in resource‑constrained networks without requiring precise channel estimation. A promising application scenario of this method is distributed average consensus in wireless multi‑agent systems. However, in such scenario, non‑coherent interference from concurrent OTA transmissions can introduce bias in the consensus value. To address this issue, we develop a robust distributed average consensus algorithm by formulating the consensus problem as a distributed optimization problem. Using decentralized projected gradient descent (D‑PGD), our proposed algorithm can achieve unbiased mean square average consensus even in the presence of non‑coherent interference and noise. Additionally, we implement transmit power control and receive scaling mechanisms to further accelerate convergence. Simulation results demonstrate that our method can significantly enhance the convergence speed of the D‑PGD algorithm for OTA average consensus without compromising accuracy.

PaperID: 10950, https://arxiv.org/pdf/2504.05716.pdf

Authors: Gen Li, Li Chen, Cheng Tang, Valdemar Švábenský, Daisuke Deguchi, Takayoshi Yamashita, Atsushi Shimada

Title: Single-Agent vs. Multi-Agent LLM Strategies for Automated Student Reflection Assessment

Abstract:
We explore the use of Large Language Models (LLMs) for automated assessment of open‑text student reflections and prediction of academic performance. Traditional methods for evaluating reflections are time‑consuming and may not scale effectively in educational settings. In this work, we employ LLMs to transform student reflections into quantitative scores using two assessment strategies (single‑agent and multi‑agent) and two prompting techniques (zero‑shot and few‑shot). Our experiments, conducted on a dataset of 5,278 reflections from 377 students over three academic terms, demonstrate that the single‑agent with few‑shot strategy achieves the highest match rate with human evaluations. Furthermore, models utilizing LLM‑assessed reflection scores outperform baselines in both at‑risk student identification and grade prediction tasks. These findings suggest that LLMs can effectively automate reflection assessment, reduce educators' workload, and enable timely support for students who may need additional assistance. Our work emphasizes the potential of integrating advanced generative AI technologies into educational practices to enhance student engagement and academic success.

PaperID: 10951, https://arxiv.org/pdf/2504.05607.pdf

Authors: Qian-Wen Zhang, Fang Li, Jie Wang, Lingfeng Qiao, Yifei Yu, Di Yin, Xing Sun

Title: FactGuard: Leveraging Multi-Agent Systems to Generate Answerable and Unanswerable Questions for Enhanced Long-Context LLM Extraction

Abstract:
Extractive reading comprehension systems are designed to locate the correct answer to a question within a given text. However, a persistent challenge lies in ensuring these models maintain high accuracy in answering questions while reliably recognizing unanswerable queries. Despite significant advances in large language models (LLMs) for reading comprehension, this issue remains critical, particularly as the length of supported contexts continues to expand. To address this challenge, we propose an innovative data augmentation methodology grounded in a multi‑agent collaborative framework. Unlike traditional methods, such as the costly human annotation process required for datasets like SQuAD 2.0, our method autonomously generates evidence‑based question‑answer pairs and systematically constructs unanswerable questions. Using this methodology, we developed the FactGuard‑Bench dataset, which comprises 25,220 examples of both answerable and unanswerable question scenarios, with context lengths ranging from 8K to 128K. Experimental evaluations conducted on seven popular LLMs reveal that even the most advanced models achieve only 61.79% overall accuracy. Furthermore, we emphasize the importance of a model's ability to reason about unanswerable questions to avoid generating plausible but incorrect answers. By implementing efficient data selection and generation within the multi‑agent collaborative framework, our method significantly reduces the traditionally high costs associated with manual annotation and provides valuable insights for the training and optimization of LLMs.

PaperID: 10952, https://arxiv.org/pdf/2504.05605.pdf

Authors: Gejian Zhao, Hanzhou Wu, Xinpeng Zhang, Athanasios V. Vasilakos

Title: ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs

Abstract:
Chain‑of‑Thought (CoT) enhances an LLM's ability to perform complex reasoning tasks, but it also introduces new security issues. In this work, we present ShadowCoT, a novel backdoor attack framework that targets the internal reasoning mechanism of LLMs. Unlike prior token‑level or prompt‑based attacks, ShadowCoT directly manipulates the model's cognitive reasoning path, enabling it to hijack multi‑step reasoning chains and produce logically coherent but adversarial outcomes. By conditioning on internal reasoning states, ShadowCoT learns to recognize and selectively disrupt key reasoning steps, effectively mounting a self‑reflective cognitive attack within the target model. Our approach introduces a lightweight yet effective multi‑stage injection pipeline, which selectively rewires attention pathways and perturbs intermediate representations with minimal parameter overhead (only 0.15% updated). ShadowCoT further leverages reinforcement learning and reasoning chain pollution (RCP) to autonomously synthesize stealthy adversarial CoTs that remain undetectable to advanced defenses. Extensive experiments across diverse reasoning benchmarks and LLMs show that ShadowCoT consistently achieves high Attack Success Rate (94.4%) and Hijacking Success Rate (88.4%) while preserving benign performance. These results reveal an emergent class of cognition‑level threats and highlight the urgent need for defenses beyond shallow surface‑level consistency.

PaperID: 10953, https://arxiv.org/pdf/2504.05560.pdf

Authors: Juan I. Giribet, Alejandro S. Ghersin, Ignacio Mas, Harrison Neves Marciano, Daniel Khede Dourado Villa, Mario Sarcinelli-Filho

Title: Adaptive Multirobot Virtual Structure Control using Dual Quaternions

Abstract:
This paper presents a control strategy based on dual quaternions for the coordinated formation flying of small UAV groups. A virtual structure is employed to define the desired formation, enabling unified control of its position, orientation, and shape. This abstraction makes formation management easier by allowing a low‑level controller to compute individual UAV commands efficiently. The proposed controller integrates a pose control module with a geometry‑based adaptive strategy, ensuring precise and robust task execution. The effectiveness of the approach is demonstrated through both simulation and experimental results.

PaperID: 10954, https://arxiv.org/pdf/2504.05559.pdf

Authors: Erzhuo Shao, Yifang Wang, Yifan Qian, Zhenyu Pan, Han Liu, Dashun Wang

Title: SciSciGPT: Advancing Human-AI Collaboration in the Science of Science

Abstract:
The increasing availability of large‑scale datasets has fueled rapid progress across many scientific fields, creating unprecedented opportunities for research and discovery while posing significant analytical challenges. Recent advances in large language models (LLMs) and AI agents have opened new possibilities for human‑AI collaboration, offering powerful tools to navigate this complex research landscape. In this paper, we introduce SciSciGPT, an open‑source, prototype AI collaborator that uses the science of science as a testbed to explore the potential of LLM‑powered research tools. SciSciGPT automates complex workflows, supports diverse analytical approaches, accelerates research prototyping and iteration, and facilitates reproducibility. Through case studies, we demonstrate its ability to streamline a wide range of empirical and analytical research tasks while highlighting its broader potential to advance research. We further propose an LLM Agent capability maturity model for human‑AI collaboration, envisioning a roadmap to further improve and expand upon frameworks like SciSciGPT. As AI capabilities continue to evolve, frameworks like SciSciGPT may play increasingly pivotal roles in scientific research and discovery, unlocking further opportunities. At the same time, these new advances also raise critical challenges, from ensuring transparency and ethical use to balancing human and AI contributions. Addressing these issues may shape the future of scientific inquiry and inform how we train the next generation of scientists to thrive in an increasingly AI‑integrated research ecosystem.

PaperID: 10955, https://arxiv.org/pdf/2504.05553.pdf

Authors: Yongjie Fu, Lingyun Zhong, Zifan Li, Xuan Di

Title: Federated Hierarchical Reinforcement Learning for Adaptive Traffic Signal Control

Abstract:
Multi‑agent reinforcement learning (MARL) has shown promise for adaptive traffic signal control (ATSC), enabling multiple intersections to coordinate signal timings in real time. However, in large‑scale settings, MARL faces constraints due to extensive data sharing and communication requirements. Federated learning (FL) mitigates these challenges by training shared models without directly exchanging raw data, yet traditional FL methods such as FedAvg struggle with highly heterogeneous intersections. Different intersections exhibit varying traffic patterns, demands, and road structures, so performing FedAvg across all agents is inefficient. To address this gap, we propose Hierarchical Federated Reinforcement Learning (HFRL) for ATSC. HFRL employs clustering‑based or optimization‑based techniques to dynamically group intersections and perform FedAvg independently within groups of intersections with similar characteristics, enabling more effective coordination and scalability than standard FedAvg.Our experiments on synthetic and real‑world traffic networks demonstrate that HFRL consistently outperforms decentralized and standard federated RL approaches, and achieves competitive or superior performance compared to centralized RL as network scale and heterogeneity increase, particularly in real‑world settings. The method also identifies suitable grouping patterns based on network structure or traffic demand, resulting in a more robust framework for distributed, heterogeneous systems.

PaperID: 10956, https://arxiv.org/pdf/2504.05500.pdf

Authors: Vahid Majdinasab, Amin Nikanjam, Foutse Khomh

Title: Prism: Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search

Abstract:
The rapid advancement of Large Language Models (LLMs) has outpaced traditional evaluation methods. Static benchmarks fail to capture the depth and breadth of LLM capabilities and eventually become obsolete, while most dynamic approaches either rely too heavily on LLM‑based evaluation or remain constrained by predefined test sets. We introduce Prism, a flexible, dynamic benchmarking framework designed for comprehensive LLM assessment. Prism builds on three key components: (1) a tree‑based state representation that models evaluation as a Markov Decision Process, (2) a Monte Carlo Tree Search algorithm adapted to uncover challenging evaluation scenarios, and (3) a multi‑agent evaluation pipeline that enables simultaneous assessment of diverse capabilities. To ensure robust evaluation, Prism integrates structural measurements of tree exploration patterns with performance metrics across difficulty levels, providing detailed diagnostics of error patterns, test coverage, and solution approaches. Through extensive experiments on five state‑of‑the‑art LLMs, we analyze how model architecture and scale influence code generation performance across varying task difficulties. Our results demonstrate Prism's effectiveness as a dynamic benchmark that evolves with model advancements while offering deeper insights into their limitations.

PaperID: 10957, https://arxiv.org/pdf/2504.05465.pdf

Authors: Zherong Pan, Kui Wu

Title: BC-ADMM: An Efficient Non-convex Constrained Optimizer with Robotic Applications

Abstract:
Non‑convex constrained optimizations are ubiquitous in robotic applications such as multi‑agent navigation, UAV trajectory optimization, and soft robot simulation. For this problem class, conventional optimizers suffer from small step sizes and slow convergence. We propose BC‑ADMM, a variant of Alternating Direction Method of Multiplier (ADMM), that can solve a class of non‑convex constrained optimizations with biconvex constraint relaxation. Our algorithm allows larger step sizes by breaking the problem into small‑scale sub‑problems that can be easily solved in parallel. We show that our method has both theoretical convergence speed guarantees and practical convergence guarantees in the asymptotic sense. Through numerical experiments in a row of four robotic applications, we show that BC‑ADMM has faster convergence than conventional gradient descent and Newton's method in terms of wall clock time.

PaperID: 10958, https://arxiv.org/pdf/2504.05365.pdf

Authors: Shan Suthaharan

Title: A Nature-Inspired Colony of Artificial Intelligence System with Fast, Detailed, and Organized Learner Agents for Enhancing Diversity and Quality

Abstract:
The concepts of convolutional neural networks (CNNs) and multi‑agent systems are two important areas of research in artificial intelligence (AI). In this paper, we present an approach that builds a CNN‑based colony of AI agents to serve as a single system and perform multiple tasks (e.g., predictions or classifications) in an environment. The proposed system impersonates the natural environment of a biological system, like an ant colony or a human colony. The proposed colony of AI that is defined as a role‑based system uniquely contributes to accomplish tasks in an environment by incorporating AI agents that are fast learners, detailed learners, and organized learners. These learners can enhance their localized learning and their collective decisions as a single system of colony of AI agents. This approach also enhances the diversity and quality of the colony of AI with the help of Genetic Algorithms and their crossover and mutation mechanisms. The evolution of fast, detailed, and organized learners in the colony of AI is achieved by introducing a unique one‑to‑one mapping between these learners and the pretrained VGG16, VGG19, and ResNet50 models, respectively. This role‑based approach creates two parent‑AI agents using the AI models through the processes, called the intra‑ and inter‑marriage of AI, so that they can share their learned knowledge (weights and biases) based on a probabilistic rule and produce diversified child‑AI agents to perform new tasks. This process will form a colony of AI that consists of families of multi‑model and mixture‑model AI agents to improve diversity and quality. Simulations show that the colony of AI, built using the VGG16, VGG19, and ResNet50 models, can provide a single system that generates child‑AI agents of excellent predictive performance, ranging between 82% and 95% of F1‑scores, to make diversified collective and quality decisions on a task.

PaperID: 10959, https://arxiv.org/pdf/2504.05358.pdf

Authors: Xi Chen, Mao Mao, Shuo Li, Haotian Shangguan

Title: Debate-Feedback: A Multi-Agent Framework for Efficient Legal Judgment Prediction

Abstract:
The use of AI in legal analysis and prediction (LegalAI) has gained widespread attention, with past research focusing on retrieval‑based methods and fine‑tuning large models. However, these approaches often require large datasets and underutilize the capabilities of modern large language models (LLMs). In this paper, inspired by the debate phase of real courtroom trials, we propose a novel legal judgment prediction model based on the Debate‑Feedback architecture, which integrates LLM multi‑agent debate and reliability evaluation models. Unlike traditional methods, our model achieves significant improvements in efficiency by minimizing the need for large historical datasets, thus offering a lightweight yet robust solution. Comparative experiments show that it outperforms several general‑purpose and domain‑specific legal models, offering a dynamic reasoning process and a promising direction for future LegalAI research.

PaperID: 10960, https://arxiv.org/pdf/2504.05322.pdf

Authors: Luca Bolis, Stefano Livella, Sabrina Patania, Dimitri Ognibene, Matteo Papini, Kenji Morita

Title: Balancing Benefits and Risks: RL Approaches for Addiction-Aware Social Media Recommenders

Abstract:
Social media platforms provide valuable opportunities for users to gather information, interact with friends, and enjoy entertainment. However, their addictive potential poses significant challenges, including overuse and negative psycho‑logical or behavioral impacts [4, 2, 8]. This study explores strategies to mitigate compulsive social media usage while preserving its benefits and ensuring economic sustainability, focusing on recommenders that promote balanced usage. We analyze user behaviors arising from intrinsic diversities and environmental interactions, offering insights for next‑generation social media recommenders that prioritize well‑being. Specifically, we examine the temporal predictability of overuse and addiction using measures available to recommenders, aiming to inform mechanisms that prevent addiction while avoiding user disengagement [7]. Building on RL‑based computational frameworks for addiction modelling [6], our study introduces: ‑ A recommender system adapting to user preferences, introducing non‑stationary and non‑Markovian dynamics. ‑ Differentiated state representations for users and recommenders to capture nuanced interactions. ‑ Distinct usage conditions‑light and heavy use‑addressing RL's limitations in distinguishing prolonged from healthy engagement. ‑ Complexity in overuse impacts, highlighting their role in user adaptation [7]. Simulations demonstrate how model‑based (MB) and model‑free (MF) decision‑making interact with environmental dynamics to influence user behavior and addiction. Results reveal the significant role of recommender systems in shaping addiction tendencies or fostering healthier engagement. These findings support ethical, adaptive recommender design, advancing sustainable social media ecosystems [9, 1]. Keywords: multi‑agent systems, recommender systems, addiction, social media

PaperID: 10961, https://arxiv.org/pdf/2504.05312.pdf

Authors: Qitao Qin, Yucong Luo, Yihang Lu, Zhibo Chu, Xiaoman Liu, Xianwei Meng

Title: Towards Adaptive Memory-Based Optimization for Enhanced Retrieval-Augmented Generation

Abstract:
Retrieval‑Augmented Generation (RAG), by integrating non‑parametric knowledge from external knowledge bases into models, has emerged as a promising approach to enhancing response accuracy while mitigating factual errors and hallucinations. This method has been widely applied in tasks such as Question Answering (QA). However, existing RAG methods struggle with open‑domain QA tasks because they perform independent retrieval operations and directly incorporate the retrieved information into generation without maintaining a summarizing memory or using adaptive retrieval strategies, leading to noise from redundant information and insufficient information integration. To address these challenges, we propose Adaptive memory‑based optimization for enhanced RAG (Amber) for open‑domain QA tasks, which comprises an Agent‑based Memory Updater, an Adaptive Information Collector, and a Multi‑granular Content Filter, working together within an iterative memory updating paradigm. Specifically, Amber integrates and optimizes the language model's memory through a multi‑agent collaborative approach, ensuring comprehensive knowledge integration from previous retrieval steps. It dynamically adjusts retrieval queries and decides when to stop retrieval based on the accumulated knowledge, enhancing retrieval efficiency and effectiveness. Additionally, it reduces noise by filtering irrelevant content at multiple levels, retaining essential information to improve overall model performance. We conduct extensive experiments on several open‑domain QA datasets, and the results demonstrate the superiority and effectiveness of our method and its components. The source code is available \footnotehttps://anonymous.4open.science/r/Amber‑B203/.

PaperID: 10962, https://arxiv.org/pdf/2504.05306.pdf

Authors: Kavana Venkatesh, Connor Dunlop, Pinar Yanardag

Title: CREA: A Collaborative Multi-Agent Framework for Creative Image Editing and Generation

Abstract:
Creativity in AI imagery remains a fundamental challenge, requiring not only the generation of visually compelling content but also the capacity to add novel, expressive, and artistically rich transformations to images. Unlike conventional editing tasks that rely on direct prompt‑based modifications, creative image editing requires an autonomous, iterative approach that balances originality, coherence, and artistic intent. To address this, we introduce CREA, a novel multi‑agent collaborative framework that mimics the human creative process. Our framework leverages a team of specialized AI agents who dynamically collaborate to conceptualize, generate, critique, and enhance images. Through extensive qualitative and quantitative evaluations, we demonstrate that CREA significantly outperforms state‑of‑the‑art methods in diversity, semantic alignment, and creative transformation. To the best of our knowledge, this is the first work to introduce the task of creative editing.

PaperID: 10963, https://arxiv.org/pdf/2504.05259.pdf

Authors: Tomek Korbak, Mikita Balesni, Buck Shlegeris, Geoffrey Irving

Title: How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

Abstract:
As LLM agents grow more capable of causing harm autonomously, AI developers will rely on increasingly sophisticated control measures to prevent possibly misaligned agents from causing harm. AI developers could demonstrate that their control measures are sufficient by running control evaluations: testing exercises in which a red team produces agents that try to subvert control measures. To ensure control evaluations accurately capture misalignment risks, the affordances granted to this red team should be adapted to the capability profiles of the agents to be deployed under control measures. In this paper we propose a systematic framework for adapting affordances of red teams to advancing AI capabilities. Rather than assuming that agents will always execute the best attack strategies known to humans, we demonstrate how knowledge of an agents's actual capability profile can inform proportional control evaluations, resulting in more practical and cost‑effective control measures. We illustrate our framework by considering a sequence of five fictional models (M1‑M5) with progressively advanced capabilities, defining five distinct AI control levels (ACLs). For each ACL, we provide example rules for control evaluation, control measures, and safety cases that could be appropriate. Finally, we show why constructing a compelling AI control safety case for superintelligent LLM agents will require research breakthroughs, highlighting that we might eventually need alternative approaches to mitigating misalignment risk.

PaperID: 10964, https://arxiv.org/pdf/2504.05104.pdf

Authors: Saeid Ario Vaghefi, Aymane Hachcham, Veronica Grasso, Jiska Manicus, Nakiete Msemo, Chiara Colesanti Senni, Markus Leippold

Title: AI for Climate Finance: Agentic Retrieval and Multi-Step Reasoning for Early Warning System Investments

Abstract:
Tracking financial investments in climate adaptation is a complex and expertise‑intensive task, particularly for Early Warning Systems (EWS), which lack standardized financial reporting across multilateral development banks (MDBs) and funds. To address this challenge, we introduce an LLM‑based agentic AI system that integrates contextual retrieval, fine‑tuning, and multi‑step reasoning to extract relevant financial data, classify investments, and ensure compliance with funding guidelines. Our study focuses on a real‑world application: tracking EWS investments in the Climate Risk and Early Warning Systems (CREWS) Fund. We analyze 25 MDB project documents and evaluate multiple AI‑driven classification methods, including zero‑shot and few‑shot learning, fine‑tuned transformer‑based classifiers, chain‑of‑thought (CoT) prompting, and an agent‑based retrieval‑augmented generation (RAG) approach. Our results show that the agent‑based RAG approach significantly outperforms other methods, achieving 87% accuracy, 89% precision, and 83% recall. Additionally, we contribute a benchmark dataset and expert‑annotated corpus, providing a valuable resource for future research in AI‑driven financial tracking and climate finance transparency.

PaperID: 10965, https://arxiv.org/pdf/2504.05071.pdf

Authors: Victor Monzon Baeza, Raúl Parada, Laura Concha Salor, Carlos Monzo

Title: AI-Driven Tactical Communications and Networking for Defense: A Survey and Emerging Trends

Abstract:
The integration of Artificial Intelligence (AI) in military communications and networking is reshaping modern defense strategies, enhancing secure data exchange, real‑time situational awareness, and autonomous decision‑making. This survey explores how AI‑driven technologies improve tactical communication networks, radar‑based data transmission, UAV‑assisted relay systems, and electronic warfare resilience. The study highlights AI applications in adaptive signal processing, multi‑agent coordination for network optimization, radar‑assisted target tracking, and AI‑driven electronic countermeasures. Our work introduces a novel three‑criteria evaluation methodology. It systematically assesses AI applications based on general system objectives, communications constraints in the military domain, and critical tactical environmental factors. We analyze key AI techniques for different types of learning applied to multi‑domain network interoperability and distributed data information fusion in military operations. We also address challenges such as adversarial AI threats, the real‑time adaptability of autonomous communication networks, and the limitations of current AI models under battlefield conditions. Finally, we discuss emerging trends in self‑healing networks, AI‑augmented decision support systems, and intelligent spectrum allocation. We provide a structured roadmap for future AI‑driven defense communications and networking research.

PaperID: 10966, https://arxiv.org/pdf/2504.05045.pdf

Authors: Huilin Yin, Zhikun Yang, Linchuan Zhang, Daniel Watzenig

Title: Spatiotemporal Attention-Augmented Inverse Reinforcement Learning for Multi-Agent Task Allocation

Abstract:
Adversarial inverse reinforcement learning (IRL) for multi‑agent task allocation (MATA) is challenged by non‑stationary interactions and high‑dimensional coordination. Unconstrained reward inference in these settings often leads to high variance and poor generalization. We propose an attention‑structured adversarial IRL framework that constrains reward inference via spatiotemporal representation learning. Our method employs multi‑head self‑attention (MHSA) for long‑range temporal dependencies and graph attention networks (GAT) for agent‑task relational structures. We formulate reward inference as a low‑capacity, adaptive linear transformation of the environment reward, ensuring stable and interpretable guidance. This framework decouples reward inference from policy learning and optimizes the reward model adversarially. Experiments on benchmark MATA scenarios show that our approach outperforms representative MARL baselines in convergence speed, cumulative rewards, and spatial efficiency. Results demonstrate that attention‑guided, capacity‑constrained reward inference is a scalable and effective mechanism for stabilizing adversarial IRL in complex multi‑agent systems.

PaperID: 10967, https://arxiv.org/pdf/2504.04937.pdf

Authors: Aurora Haraldsen, Josef Matous, Kristin Y. Pettersen

Title: Hybrid Control Barrier Functions for Nonholonomic Multi-Agent Systems

Abstract:
This paper addresses the problem of guaranteeing safety of multiple coordinated agents moving in dynamic environments. It has recently been shown that this problem can be efficiently solved through the notion of Control Barrier Functions (CBFs). However, for nonholonomic vehicles that are required to keep positive speeds, existing CBFs lose their validity. To overcome this limitation, we propose a hybrid formulation based on synergistic CBFs (SCBFs), which leverages a discrete switching mechanism to avoid configurations that would render the CBF invalid. Unlike existing approaches, our method ensures safety in the presence of moving obstacles and inter‑agent interactions while respecting nonzero speed restrictions. We formally analyze the feasibility of the constraints with respect to actuation limits, and the efficacy of the solution is demonstrated in simulation of a multi‑agent coordination problem in the presence of moving obstacles.

PaperID: 10968, https://arxiv.org/pdf/2504.04855.pdf

Authors: Haoxuan Li, Mingyu Derek Ma, Jen-tse Huang, Zhaotian Weng, Wei Wang, Jieyu Zhao

Title: BIASINSPECTOR: Detecting Bias in Structured Data through LLM Agents

Abstract:
Detecting biases in structured data is a complex and time‑consuming task. Existing automated techniques are limited in diversity of data types and heavily reliant on human case‑by‑case handling, resulting in a lack of generalizability. Currently, large language model (LLM)‑based agents have made significant progress in data science, but their ability to detect data biases is still insufficiently explored. To address this gap, we introduce the first end‑to‑end, multi‑agent synergy framework, BIASINSPECTOR, designed for automatic bias detection in structured data based on specific user requirements. It first develops a multi‑stage plan to analyze user‑specified bias detection tasks and then implements it with a diverse and well‑suited set of tools. It delivers detailed results that include explanations and visualizations. To address the lack of a standardized framework for evaluating the capability of LLM agents to detect biases in data, we further propose a comprehensive benchmark that includes multiple evaluation metrics and a large set of test cases. Extensive experiments demonstrate that our framework achieves exceptional overall performance in structured data bias detection, setting a new milestone for fairer data applications.

PaperID: 10969, https://arxiv.org/pdf/2504.04850.pdf

Authors: Ángel Aso-Mollar, Eva Onaindia

Title: An Efficient Approach for Cooperative Multi-Agent Learning Problems

Abstract:
In this article, we propose a centralized Multi‑Agent Learning framework for learning a policy that models the simultaneous behavior of multiple agents that need to coordinate to solve a certain task. Centralized approaches often suffer from the explosion of an action space that is defined by all possible combinations of individual actions, known as joint actions. Our approach addresses the coordination problem via a sequential abstraction, which overcomes the scalability problems typical to centralized methods. It introduces a meta‑agent, called supervisor, which abstracts joint actions as sequential assignments of actions to each agent. This sequential abstraction not only simplifies the centralized joint action space but also enhances the framework's scalability and efficiency. Our experimental results demonstrate that the proposed approach successfully coordinates agents across a variety of Multi‑Agent Learning environments of diverse sizes.

PaperID: 10970, https://arxiv.org/pdf/2504.04789.pdf

Authors: Zhuoning Xu, Jian Xu, Mingqing Zhang, Peijie Wang, Chao Deng, Cheng-Lin Liu

Title: Multimodal Agricultural Agent Architecture (MA3): A New Paradigm for Intelligent Agricultural Decision-Making

Abstract:
As a strategic pillar industry for human survival and development, modern agriculture faces dual challenges: optimizing production efficiency and achieving sustainable development. Against the backdrop of intensified climate change leading to frequent extreme weather events, the uncertainty risks in agricultural production systems are increasing exponentially. To address these challenges, this study proposes an innovative Multimodal Agricultural Agent Architecture (MA3), which leverages cross‑modal information fusion and task collaboration mechanisms to achieve intelligent agricultural decision‑making. This study constructs a multimodal agricultural agent dataset encompassing five major tasks: classification, detection, Visual Question Answering (VQA), tool selection, and agent evaluation. We propose a unified backbone for sugarcane disease classification and detection tools, as well as a sugarcane disease expert model. By integrating an innovative tool selection module, we develop a multimodal agricultural agent capable of effectively performing tasks in classification, detection, and VQA. Furthermore, we introduce a multi‑dimensional quantitative evaluation framework and conduct a comprehensive assessment of the entire architecture over our evaluation dataset, thereby verifying the practicality and robustness of MA3 in agricultural scenarios. This study provides new insights and methodologies for the development of agricultural agents, holding significant theoretical and practical implications. Our source code and dataset will be made publicly available upon acceptance.

PaperID: 10971, https://arxiv.org/pdf/2504.04765.pdf

Authors: Huijie Li, Yide Yu, Si Shi, Anmin Hu, Jian Huo, Wei Lin, Chaoran Wu, Wuman Luo

Title: Multi-Agent Deep Reinforcement Learning for Multiple Anesthetics Collaborative Control

Abstract:
Automated control of personalized multiple anesthetics in clinical Total Intravenous Anesthesia (TIVA) is crucial yet challenging. Current systems, including target‑controlled infusion (TCI) and closed‑loop systems, either rely on relatively static pharmacokinetic/pharmacodynamic (PK/PD) models or focus on single anesthetic control, limiting personalization and collaborative control. To address these issues, we propose a novel framework, Value Decomposition Multi‑Agent Deep Reinforcement Learning (VD‑MADRL). VD‑MADRL optimizes the collaboration between two anesthetics propofol (Agent I) and remifentanil (Agent II). And It uses a Markov Game (MG) to identify optimal actions among heterogeneous agents. We employ various value function decomposition methods to resolve the credit allocation problem and enhance collaborative control. We also introduce a multivariate environment model based on random forest (RF) for anesthesia state simulation. Additionally, a data resampling and alignment technique ensures synchronized trajectory data. Our experiments on general and thoracic surgery datasets show that VD‑MADRL performs better than human experience. It improves dose precision and keeps anesthesia states stable, providing great clinical value.

PaperID: 10972, https://arxiv.org/pdf/2504.04736.pdf

Authors: Anna Goldie, Azalia Mirhoseini, Hao Zhou, Irene Cai, Christopher D. Manning

Title: Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use

Abstract:
Reinforcement learning has been shown to improve the performance of large language models. However, traditional approaches like RLHF or RLAIF treat the problem as single‑step. As focus shifts toward more complex reasoning and agentic tasks, language models must take multiple steps of text generation, reasoning and environment interaction before generating a solution. We propose a synthetic data generation and RL methodology targeting multi‑step optimization scenarios. This approach, called Step‑Wise Reinforcement Learning (SWiRL), iteratively generates multi‑step reasoning and tool use data, and then learns from that data. It employs a simple step‑wise decomposition that breaks each multi‑step trajectory into multiple sub‑trajectories corresponding to each action by the original model. It then applies synthetic data filtering and RL optimization on these sub‑trajectories. We evaluated SWiRL on a number of multi‑step tool use, question answering, and mathematical reasoning tasks. Our experiments show that SWiRL outperforms baseline approaches by 21.5%, 12.3%, 14.8%, 11.1%, and 15.3% in relative accuracy on GSM8K, HotPotQA, CofCA, MuSiQue, and BeerQA, respectively. Excitingly, the approach exhibits generalization across tasks: for example, training only on HotPotQA (text question‑answering) improves zero‑shot performance on GSM8K (a math dataset) by a relative 16.9%.

PaperID: 10973, https://arxiv.org/pdf/2504.04698.pdf

Authors: Yuren Mao, Yu Mi, Peigen Liu, Mengfei Zhang, Hanqing Liu, Yunjun Gao

Title: scAgent: Universal Single-Cell Annotation via a LLM Agent

Abstract:
Cell type annotation is critical for understanding cellular heterogeneity. Based on single‑cell RNA‑seq data and deep learning models, good progress has been made in annotating a fixed number of cell types within a specific tissue. However, universal cell annotation, which can generalize across tissues, discover novel cell types, and extend to novel cell types, remains less explored. To fill this gap, this paper proposes scAgent, a universal cell annotation framework based on Large Language Models (LLMs). scAgent can identify cell types and discover novel cell types in diverse tissues; furthermore, it is data efficient to learn novel cell types. Experimental studies in 160 cell types and 35 tissues demonstrate the superior performance of scAgent in general cell‑type annotation, novel cell discovery, and extensibility to novel cell type.

PaperID: 10974, https://arxiv.org/pdf/2504.04691.pdf

Authors: Songyang Liu, Muyang Fan, Weizi Li, Jing Du, Shuai Li

Title: Large-Scale Mixed-Traffic and Intersection Control using Multi-agent Reinforcement Learning

Abstract:
Traffic congestion remains a significant challenge in modern urban networks. Autonomous driving technologies have emerged as a potential solution. Among traffic control methods, reinforcement learning has shown superior performance over traffic signals in various scenarios. However, prior research has largely focused on small‑scale networks or isolated intersections, leaving large‑scale mixed traffic control largely unexplored. This study presents the first attempt to use decentralized multi‑agent reinforcement learning for large‑scale mixed traffic control in which some intersections are managed by traffic signals and others by robot vehicles. Evaluating a real‑world network in Colorado Springs, CO, USA with 14 intersections, we measure traffic efficiency via average waiting time of vehicles at intersections and the number of vehicles reaching their destinations within a time window (i.e., throughput). At 80% RV penetration rate, our method reduces waiting time from 6.17s to 5.09s and increases throughput from 454 vehicles per 500 seconds to 493 vehicles per 500 seconds, outperforming the baseline of fully signalized intersections. These findings suggest that integrating reinforcement learning‑based control large‑scale traffic can improve overall efficiency and may inform future urban planning strategies.

PaperID: 10975, https://arxiv.org/pdf/2504.04675.pdf

Authors: Tzu-Han Hsu, Arshia Rafieioskouei, Borzoo Bonakdarpour

Title: HypRL: Reinforcement Learning of Control Policies for Hyperproperties

Abstract:
Reward shaping in multi‑agent reinforcement learning (MARL) for complex tasks remains a significant challenge. Existing approaches often fail to find optimal solutions or cannot efficiently handle such tasks. We propose HYPRL, a specification‑guided reinforcement learning framework that learns control policies w.r.t. hyperproperties expressed in HyperLTL. Hyperproperties constitute a powerful formalism for specifying objectives and constraints over sets of execution traces across agents. To learn policies that maximize the satisfaction of a HyperLTL formula ϕ, we apply Skolemization to manage quantifier alternations and define quantitative robustness functions to shape rewards over execution traces of a Markov decision process with unknown transitions. A suitable RL algorithm is then used to learn policies that collectively maximize the expected reward and, consequently, increase the probability of satisfying ϕ. We evaluate HYPRL on a diverse set of benchmarks, including safety‑aware planning, Deep Sea Treasure, and the Post Correspondence Problem. We also compare with specification‑driven baselines to demonstrate the effectiveness and efficiency of HYPRL.

PaperID: 10976, https://arxiv.org/pdf/2504.04650.pdf

Authors: Zihao Wu

Title: Autono: A ReAct-Based Highly Robust Autonomous Agent Framework

Abstract:
This paper proposes a highly robust autonomous agent framework based on the ReAct paradigm, designed to solve complex tasks through adaptive decision making and multi‑agent collaboration. Unlike traditional frameworks that rely on fixed workflows generated by LLM‑based planners, this framework dynamically generates next actions during agent execution based on prior trajectories, thereby enhancing its robustness. To address potential termination issues caused by adaptive execution paths, I propose a timely abandonment strategy incorporating a probabilistic penalty mechanism. For multi‑agent collaboration, I introduce a memory transfer mechanism that enables shared and dynamically updated memory among agents. The framework's innovative timely abandonment strategy dynamically adjusts the probability of task abandonment via probabilistic penalties, allowing developers to balance conservative and exploratory tendencies in agent execution strategies by tuning hyperparameters. This significantly improves adaptability and task execution efficiency in complex environments. Additionally, agents can be extended through external tool integration, supported by modular design and MCP protocol compatibility, which enables flexible action space expansion. Through explicit division of labor, the multi‑agent collaboration mechanism enables agents to focus on specific task components, thereby significantly improving execution efficiency and quality.

PaperID: 10977, https://arxiv.org/pdf/2504.04618.pdf

Authors: Viet-Anh Le, Andreas A. Malikopoulos

Title: Distributed Mixed-Integer Quadratic Programming for Mixed-Traffic Intersection Control

Abstract:
In this paper, we present a distributed algorithm utilizing the proximal alternating direction method of multipliers (ADMM) in conjunction with sequential constraint tightening to address mixed‑integer quadratic programming (MIQP) problems associated with traffic light systems and connected automated vehicles (CAVs) in mixed‑traffic intersections. We formulate a comprehensive MIQP model aimed at optimizing the coordination of traffic light systems and CAVs, thereby fully capitalizing on the advantages of CAV integration under conditions of high penetration rates. To effectively approximate the intricate multi‑agent MIQP challenges, we develop a distributed algorithm that employs proximal ADMM for solving the convex relaxation of the MIQP while systematically tightening the constraint coefficients to uphold integrality requirements. The performance of our control framework and the efficacy of the distributed algorithm are rigorously validated through a series of simulations conducted across varying penetration rates and traffic volumes.

PaperID: 10978, https://arxiv.org/pdf/2504.04615.pdf

Authors: Eleftherios E. Vlahakis, Lars Lindemann, Dimos V. Dimarogonas

Title: Conformal Data-driven Control of Stochastic Multi-Agent Systems under Collaborative Signal Temporal Logic Specifications

Abstract:
We address control synthesis of stochastic discrete‑time linear multi‑agent systems under jointly chance‑constrained collaborative signal temporal logic specifications in a distribution‑free manner using available disturbance samples, which are partitioned into training and calibration sets. Leveraging linearity, we decompose each agent's system into deterministic nominal and stochastic error parts, and design disturbance feedback controllers to bound the stochastic errors by solving a tractable optimization problem over the training data. We then quantify prediction regions (PRs) for the aggregate error trajectories corresponding to agent cliques, involved in collaborative tasks, using conformal prediction and calibration data. This enables us to address the specified joint chance constraint via Lipschitz tightening and the computed PRs, and relax the centralized stochastic optimal control problem to a deterministic one, whose solution provides the feedforward inputs. To enhance scalability, we decompose the deterministic problem into agent‑level subproblems solved in an MPC fashion, yielding a distributed control policy. Finally, we present an illustrative example and a comparison with [1].

PaperID: 10979, https://arxiv.org/pdf/2504.04608.pdf

Authors: Fernando Rosas, Alexander Boyd, Manuel Baltieri

Title: AI in a vat: Fundamental limits of efficient world modelling for agent sandboxing and interpretability

Abstract:
Recent work proposes using world models to generate controlled virtual environments in which AI agents can be tested before deployment to ensure their reliability and safety. However, accurate world models often have high computational demands that can severely restrict the scope and depth of such assessments. Inspired by the classic `brain in a vat' thought experiment, here we investigate ways of simplifying world models that remain agnostic to the AI agent under evaluation. By following principles from computational mechanics, our approach reveals a fundamental trade‑off in world model construction between efficiency and interpretability, demonstrating that no single world model can optimise all desirable characteristics. Building on this trade‑off, we identify procedures to build world models that either minimise memory requirements, delineate the boundaries of what is learnable, or allow tracking causes of undesirable outcomes. In doing so, this work establishes fundamental limits in world modelling, leading to actionable guidelines that inform core design choices related to effective agent evaluation.

PaperID: 10980, https://arxiv.org/pdf/2504.04555.pdf

Authors: Ali Hamedi, Amirali Ghaedi, Amin Soltanbeigi, Athena Abdi

Title: SchEdge: A Dynamic, Multi-agent, and Scalable Scheduling Simulator for IoT Edge

Abstract:
This paper presents a dynamic, adaptive, and scalable framework for simulating task scheduling on the edge of the Internet of Things called "SchEdge". This simulator is designed to be highly configurable to reflect the detailed characteristics of real‑world IoT. This framework focuses on online task scheduling and its multi‑agent nature provides multiple schedulers to implement various scheduling schemes in parallel. SchEdge consists of two main parts the workflow and data flow. The workflow manages the schedulers' interaction with the application and environment while the data flow deals with the input application and its preprocessing. Combining these sections provides scalability, adaptability, and efficiency in the SchEdge. To validate the efficiency of this simulator, several experiments categorized as behavioral and technical analysis are performed to show its efficiency, scalability, and robustness.

PaperID: 10981, https://arxiv.org/pdf/2504.04485.pdf

Authors: Yapeng Mi, Zhi Gao, Xiaojian Ma, Qing Li

Title: Building LLM Agents by Incorporating Insights from Computer Systems

Abstract:
LLM‑driven autonomous agents have emerged as a promising direction in recent years. However, many of these LLM agents are designed empirically or based on intuition, often lacking systematic design principles, which results in diverse agent structures with limited generality and scalability. In this paper, we advocate for building LLM agents by incorporating insights from computer systems. Inspired by the von Neumann architecture, we propose a structured framework for LLM agentic systems, emphasizing modular design and universal principles. Specifically, this paper first provides a comprehensive review of LLM agents from the computer system perspective, then identifies key challenges and future directions inspired by computer system design, and finally explores the learning mechanisms for LLM agents beyond the computer system. The insights gained from this comparative analysis offer a foundation for systematic LLM agent design and advancement.

PaperID: 10982, https://arxiv.org/pdf/2504.04438.pdf

Authors: Wang Zhang, Chenguang Liu, Yue Pi, Yong Zhang, Hairong Huang, Baoquan Rao, Yulong Ding, Shuanghua Yang, Jie Jiang

Title: DRAMA: A Dynamic Packet Routing Algorithm using Multi-Agent Reinforcement Learning with Emergent Communication

Abstract:
The continuous expansion of network data presents a pressing challenge for conventional routing algorithms. As the demand escalates, these algorithms are struggling to cope. In this context, reinforcement learning (RL) and multi‑agent reinforcement learning (MARL) algorithms emerge as promising solutions. However, the urgency and importance of the problem are clear, as existing RL/MARL‑based routing approaches lack effective communication in run time among routers, making it challenging for individual routers to adapt to complex and dynamic changing networks. More importantly, they lack the ability to deal with dynamically changing network topology, especially the addition of the router, due to the non‑scalability of their neural networks. This paper proposes a novel dynamic routing algorithm, DRAMA, incorporating emergent communication in multi‑agent reinforcement learning. Through emergent communication, routers could learn how to communicate effectively to maximize the optimization objectives. Meanwhile, a new Q‑network and graph‑based emergent communication are introduced to dynamically adapt to the changing network topology without retraining while ensuring robust performance. Experimental results showcase DRAMA's superior performance over the traditional routing algorithm and other RL/MARL‑based algorithms, achieving a higher delivery rate and lower latency in diverse network scenarios, including dynamic network load and topology. Moreover, an ablation experiment validates the prospect of emergent communication in facilitating packet routing.

PaperID: 10983, https://arxiv.org/pdf/2504.04395.pdf

Authors: Jake Grigsby, Yuqi Xie, Justin Sasek, Steven Zheng, Yuke Zhu

Title: Human-Level Competitive Pokémon via Scalable Offline Reinforcement Learning with Transformers

Abstract:
Competitive Pokémon Singles (CPS) is a popular strategy game where players learn to exploit their opponent based on imperfect information in battles that can last more than one hundred stochastic turns. AI research in CPS has been led by heuristic tree search and online self‑play, but the game may also create a platform to study adaptive policies trained offline on large datasets. We develop a pipeline to reconstruct the first‑person perspective of an agent from logs saved from the third‑person perspective of a spectator, thereby unlocking a dataset of real human battles spanning more than a decade that grows larger every day. This dataset enables a black‑box approach where we train large sequence models to adapt to their opponent based solely on their input trajectory while selecting moves without explicit search of any kind. We study a progression from imitation learning to offline RL and offline fine‑tuning on self‑play data in the hardcore competitive setting of Pokémon's four oldest (and most partially observed) game generations. The resulting agents outperform a recent LLM Agent approach and a strong heuristic search engine. While playing anonymously in online battles against humans, our best agents climb to rankings inside the top 10% of active players. All agent checkpoints, training details, datasets, and baselines are available at https://metamon.tech.

PaperID: 10984, https://arxiv.org/pdf/2504.04365.pdf

Authors: Claudio Spiess, Mandana Vaziri, Louis Mandel, Martin Hirzel

Title: AutoPDL: Automatic Prompt Optimization for LLM Agents

Abstract:
The performance of large language models (LLMs) depends on how they are prompted, with choices spanning both the high‑level prompting pattern (e.g., Zero‑Shot, CoT, ReAct, ReWOO) and the specific prompt content (instructions and few‑shot demonstrations). Manually tuning this combination is tedious, error‑prone, and specific to a given LLM and task. Therefore, this paper proposes AutoPDL, an automated approach to discovering good LLM agent configurations. Our approach frames this as a structured AutoML problem over a combinatorial space of agentic and non‑agentic prompting patterns and demonstrations, using successive halving to efficiently navigate this space. We introduce a library implementing common prompting patterns using the PDL prompt programming language. AutoPDL solutions are human‑readable, editable, and executable PDL programs that use this library. This approach also enables source‑to‑source optimization, allowing human‑in‑the‑loop refinement and reuse. Evaluations across three tasks and seven LLMs (ranging from 3B to 70B parameters) show consistent accuracy gains (9.21\pm15.46 percentage points), up to 67.5pp, and reveal that selected prompting strategies vary across models and tasks.

PaperID: 10985, https://arxiv.org/pdf/2504.04348.pdf

Authors: Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, Jose M. Alvarez

Title: OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning

Abstract:
The advances in vision‑language models (VLMs) have led to a growing interest in autonomous driving to leverage their strong reasoning capabilities. However, extending these capabilities from 2D to full 3D understanding is crucial for real‑world applications. To address this challenge, we propose OmniDrive, a holistic vision‑language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning. This approach enhances decision‑making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions. Our counterfactual‑based synthetic data annotation process generates large‑scale, high‑quality datasets, providing denser supervision signals that bridge planning trajectories and language‑based reasoning. Futher, we explore two advanced OmniDrive‑Agent frameworks, namely Omni‑L and Omni‑Q, to assess the importance of vision‑language alignment versus 3D perception, revealing critical insights into designing effective LLM‑agents. Significant improvements on the DriveLM Q\&A benchmark and nuScenes open‑loop planning demonstrate the effectiveness of our dataset and methods.

PaperID: 10986, https://arxiv.org/pdf/2504.04347.pdf

Authors: Federico M. Zegers, Sean Phillips

Title: ChronoSync: A Decentralized Chronometer Synchronization Protocol for Multi-Agent Systems

Abstract:
This work presents a decentralized time synchronization algorithm for multi‑agent systems. Each agent possesses two clocks, a hardware clock that is perturbed by environmental phenomena (e.g., temperature, humidity, pressure, g forces, etc.) and a steerable software clock that inherits the perturbations affecting the hardware clock. Under these disturbances and the independent time kept by the hardware clocks, our consensus‑based controller enables all agents to steer their software‑defined clocks into practical synchronization while achieving a common user‑defined clock drift. Furthermore, we treat the drift of each hardware clock as an unknown parameter, which our algorithm can accurately estimate. The coupling of the agents is modeled by a connected, undirected, and static graph. However, each agent possesses a timer mechanism that determines when to broadcast a sample of its software time and update its own software‑time estimate. Hence, communication between agents can be directed, intermittent, and asynchronous. The closed‑loop dynamics of the ensemble is modeled using a hybrid system, where a Lyapunov‑based stability analysis demonstrates that a set encoding the time synchronization and clock drift estimation objectives is globally practically exponentially stable. The performance suggested by the theoretical development is confirmed in simulation.

PaperID: 10987, https://arxiv.org/pdf/2504.04220.pdf

Authors: Yueheng Zhu, Chao Liu, Xuan He, Xiaoxue Ren, Zhongxin Liu, Ruwei Pan, Hongyu Zhang

Title: AdaCoder: An Adaptive Planning and Multi-Agent Framework for Function-Level Code Generation

Abstract:
Recently, researchers have proposed many multi‑agent frameworks for function‑level code generation, which aim to improve software development productivity by automatically generating function‑level source code based on task descriptions. A typical multi‑agent framework consists of Large Language Model (LLM)‑based agents that are responsible for task planning, code generation, testing, debugging, etc. Studies have shown that existing multi‑agent code generation frameworks perform well on ChatGPT. However, their generalizability across other foundation LLMs remains unexplored systematically. In this paper, we report an empirical study on the generalizability of four state‑of‑the‑art multi‑agent code generation frameworks across six open‑source LLMs with varying parameter sizes, architectures, and performance levels. Our study reveals the unstable generalizability of existing frameworks on diverse foundation LLMs. Based on the findings obtained from the empirical study, we propose AdaCoder, a novel adaptive planning, multi‑agent framework for function‑level code generation. AdaCoder has two phases. Phase‑1 is an initial code generation step without planning, which uses an LLM‑based coding agent and a script‑based testing agent to unleash LLM's native power, identify cases beyond LLM's power, and determine the errors hindering execution. Phase‑2 adds a rule‑based debugging agent and an LLM‑based planning agent for iterative code generation with planning. Our evaluation shows that AdaCoder achieves higher generalizability on diverse LLMs. Compared to the best baseline MapCoder, AdaCoder is on average 27.69% higher in Pass@1, 16 times faster in inference, and 12 times lower in token consumption.

PaperID: 10988, https://arxiv.org/pdf/2504.04187.pdf

Authors: Chuadhry Mujeeb Ahmed

Title: AttackLLM: LLM-based Attack Pattern Generation for an Industrial Control System

Abstract:
Malicious examples are crucial for evaluating the robustness of machine learning algorithms under attack, particularly in Industrial Control Systems (ICS). However, collecting normal and attack data in ICS environments is challenging due to the scarcity of testbeds and the high cost of human expertise. Existing datasets are often limited by the domain expertise of practitioners, making the process costly and inefficient. The lack of comprehensive attack pattern data poses a significant problem for developing robust anomaly detection methods. In this paper, we propose a novel approach that combines data‑centric and design‑centric methodologies to generate attack patterns using large language models (LLMs). Our results demonstrate that the attack patterns generated by LLMs not only surpass the quality and quantity of those created by human experts but also offer a scalable solution that does not rely on expensive testbeds or pre‑existing attack examples. This multi‑agent based approach presents a promising avenue for enhancing the security and resilience of ICS environments.

PaperID: 10989, https://arxiv.org/pdf/2504.04160.pdf

Authors: Alexandre Oliveira, Katarina Dyreby, Francisco Caldas, Cláudia Soares

Title: OrbitZoo: Real Orbital Systems Challenges for Reinforcement Learning

Abstract:
The increasing number of satellites and orbital debris has made space congestion a critical issue, threatening satellite safety and sustainability. Challenges such as collision avoidance, station‑keeping, and orbital maneuvering require advanced techniques to handle dynamic uncertainties and multi‑agent interactions. Reinforcement learning (RL) has shown promise in this domain, enabling adaptive, autonomous policies for space operations; however, many existing RL frameworks rely on custom‑built environments developed from scratch, which often use simplified models and require significant time to implement and validate the orbital dynamics, limiting their ability to fully capture real‑world complexities. To address this, we introduce OrbitZoo, a versatile multi‑agent RL environment built on a high‑fidelity industry standard library, that enables realistic data generation, supports scenarios like collision avoidance and cooperative maneuvers, and ensures robust and accurate orbital dynamics. The environment is validated against a real satellite constellation, Starlink, achieving a Mean Absolute Percentage Error (MAPE) of 0.16% compared to real‑world data. This validation ensures reliability for generating high‑fidelity simulations and enabling autonomous and independent satellite operations.

PaperID: 10990, https://arxiv.org/pdf/2504.04072.pdf

Authors: Satvik Golechha, Adrià Garriga-Alonso

Title: Among Us: A Sandbox for Measuring and Detecting Agentic Deception

Abstract:
Prior studies on deception in language‑based AI agents typically assess whether the agent produces a false statement about a topic, or makes a binary choice prompted by a goal, rather than allowing open‑ended deceptive behavior to emerge in pursuit of a longer‑term goal. To fix this, we introduce Among Us, a sandbox social deception game where LLM‑agents exhibit long‑term, open‑ended deception as a consequence of the game objectives. While most benchmarks saturate quickly, Among Us can be expected to last much longer, because it is a multi‑player game far from equilibrium. Using the sandbox, we evaluate 18 proprietary and open‑weight LLMs and uncover a general trend: models trained with RL are comparatively much better at producing deception than detecting it. We evaluate the effectiveness of methods to detect lying and deception: logistic regression on the activations and sparse autoencoders (SAEs). We find that probes trained on a dataset of "pretend you're a dishonest model:.." generalize extremely well out‑of‑distribution, consistently obtaining AUROCs over 95% even when evaluated just on the deceptive statement, without the chain of thought. We also find two SAE features that work well at deception detection but are unable to steer the model to lie less. We hope our open‑sourced sandbox, game logs, and probes serve to anticipate and mitigate deceptive behavior and capabilities in language‑based agents.

PaperID: 10991, https://arxiv.org/pdf/2504.04070.pdf

Authors: Sagar Tamang, Dibya Jyoti Bora

Title: Enforcement Agents: Enhancing Accountability and Resilience in Multi-Agent AI Frameworks

Abstract:
As autonomous agents become more powerful and widely used, it is becoming increasingly important to ensure they behave safely and stay aligned with system goals, especially in multi‑agent settings. Current systems often rely on agents self‑monitoring or correcting issues after the fact, but they lack mechanisms for real‑time oversight. This paper introduces the Enforcement Agent (EA) Framework, which embeds dedicated supervisory agents into the environment to monitor others, detect misbehavior, and intervene through real‑time correction. We implement this framework in a custom drone simulation and evaluate it across 90 episodes using 0, 1, and 2 EA configurations. Results show that adding EAs significantly improves system safety: success rates rise from 0.0% with no EA to 7.4% with one EA and 26.7% with two EAs. The system also demonstrates increased operational longevity and higher rates of malicious drone reformation. These findings highlight the potential of lightweight, real‑time supervision for enhancing alignment and resilience in multi‑agent systems.

PaperID: 10992, https://arxiv.org/pdf/2504.03991.pdf

Authors: Siddharth Srikanth, Varun Bhatt, Boshen Zhang, Werner Hager, Charles Michael Lewis, Katia P. Sycara, Aaquib Tabrez, Stefanos Nikolaidis

Title: Algorithmic Prompt Generation for Diverse Human-like Teaming and Communication with Large Language Models

Abstract:
Understanding how humans collaborate and communicate in teams is essential for improving human‑agent teaming and AI‑assisted decision‑making. However, relying solely on data from large‑scale user studies is impractical due to logistical, ethical, and practical constraints, necessitating synthetic models of multiple diverse human behaviors. Recently, agents powered by Large Language Models (LLMs) have been shown to emulate human‑like behavior in social settings. But, obtaining a large set of diverse behaviors requires manual effort in the form of designing prompts. On the other hand, Quality Diversity (QD) optimization has been shown to be capable of generating diverse Reinforcement Learning (RL) agent behavior. In this work, we combine QD optimization with LLM‑powered agents to iteratively search for prompts that generate diverse team behavior in a long‑horizon, multi‑step collaborative environment. We first show, through a human‑subjects experiment (n=54 participants), that humans exhibit diverse coordination and communication behavior in this domain. We then show that our approach can effectively replicate trends from human teaming data and also capture behaviors that are not easily observed without collecting large amounts of data. Our findings highlight the combination of QD and LLM‑powered agents as an effective tool for studying teaming and communication strategies in multi‑agent collaboration.

PaperID: 10993, https://arxiv.org/pdf/2504.03699.pdf

Authors: Ying-Jung Chen, Ahmad Albarqawi, Chi-Sheng Chen

Title: Enhancing Clinical Decision-Making: Integrating Multi-Agent Systems with Ethical AI Governance

Abstract:
Recent advances in the data‑driven medicine approach, which integrates ethically managed and explainable artificial intelligence into clinical decision support systems (CDSS), are critical to ensure reliable and effective patient care. This paper focuses on comparing novel agent system designs that use modular agents to analyze laboratory results, vital signs, and clinical context, and to predict and validate results. We implement our agent system with the eICU database, including running lab analysis, vitals‑only interpreters, and contextual reasoners agents first, then sharing the memory into the integration agent, prediction agent, transparency agent, and a validation agent. Our results suggest that the multi‑agent system (MAS) performed better than the single‑agent system (SAS) with mortality prediction accuracy (59%, 56%) and the mean error for length of stay (LOS)(4.37 days, 5.82 days), respectively. However, the transparency score for the SAS (86.21) is slightly better than the transparency score for MAS (85.5). Finally, this study suggests that our agent‑based framework not only improves process transparency and prediction accuracy but also strengthens trustworthy AI‑assisted decision support in an intensive care setting.

PaperID: 10994, https://arxiv.org/pdf/2504.03591.pdf

Authors: Francesco Di Cosmo, Soumodev Mal, Tephilla Prince

Title: A Lower Bound on Conservative Elementary Object Systems Coverability

Abstract:
Elementary Object Systems (EOS) are a form of Petri Net (PN) where tokens carry internal PN. This model has been recently proposed for analysis of robustness of Multi Agent Systems. While EOS reachability is known to be undecidable, the decidability of coverability of its conservative fragment (where the type of internal PN cannot be completely deleted and, thus, is conserved) was proved a decade ago, no study charted its complexity. Here, we take a first step in this direction, by showing how to encode νPNs, a well studied form of PN enriched with data, into conservative EOS (cEOS). This yields a non‑Primitive Recursive, F_\omega2 lower‑bound on cEOS coverability.

PaperID: 10995, https://arxiv.org/pdf/2504.03353.pdf

Authors: Kentaro Nomura, Tatsuya Aoki, Tadahiro Taniguchi, Takato Horii

Title: Decentralized Collective World Model for Emergent Communication and Coordination

Abstract:
We propose a fully decentralized multi‑agent world model that enables both symbol emergence for communication and coordinated behavior through temporal extension of collective predictive coding. Unlike previous research that focuses on either communication or coordination separately, our approach achieves both simultaneously. Our method integrates world models with communication channels, enabling agents to predict environmental dynamics, estimate states from partial observations, and share critical information through bidirectional message exchange with contrastive learning for message alignment. Using a two‑agent trajectory drawing task, we demonstrate that our communication‑based approach outperforms non‑communicative models when agents have divergent perceptual capabilities, achieving the second‑best coordination after centralized models. Importantly, our decentralized approach with constraints preventing direct access to other agents' internal states facilitates the emergence of more meaningful symbol systems that accurately reflect environmental states. These findings demonstrate the effectiveness of decentralized communication for supporting coordination while developing shared representations of the environment.

PaperID: 10996, https://arxiv.org/pdf/2504.03271.pdf

Authors: Sebastian Gasche, Christian Kallies, Andreas Himmel, Rolf Findeisen

Title: Energy Aware and Safe Path Planning for Unmanned Aircraft Systems

Abstract:
This paper proposes a path planning algorithm for multi‑agent unmanned aircraft systems (UASs) to autonomously cover a search area, while considering obstacle avoidance, as well as the capabilities and energy consumption of the employed unmanned aerial vehicles. The path planning is optimized in terms of energy efficiency to prefer low energy‑consuming maneuvers. In scenarios where a UAS is low on energy, it autonomously returns to its initial position for a safe landing, thus preventing potential battery damage. To accomplish this, an energy‑aware multicopter model is integrated into a path planning algorithm based on model predictive control and mixed integer linear programming. Besides factoring in energy consumption, the planning is improved by dynamically defining feasible regions for each UAS to prevent obstacle corner‑cutting or over‑jumping.

PaperID: 10997, https://arxiv.org/pdf/2504.03255.pdf

Authors: Garry A. Gabison, R. Patrick Xian

Title: Inherent and emergent liability issues in LLM-based agentic systems: a principal-agent perspective

Abstract:
Agentic systems powered by large language models (LLMs) are becoming progressively more complex and capable. Their increasing agency and expanding deployment settings attract growing attention to effective governance policies, monitoring, and control protocols. Based on the emerging landscape of the agentic market, we analyze potential liability issues arising from the delegated use of LLM agents and their extended systems through a principal‑agent perspective. Our analysis complements existing risk‑based studies on artificial agency and covers the spectrum of important aspects of the principal‑agent relationship and their potential consequences at deployment. Furthermore, we motivate method developments for technical governance along the directions of interpretability and behavior evaluations, reward and conflict management, and the mitigation of misalignment and misconduct through principled engineering of detection and fail‑safe mechanisms. By illustrating the outstanding issues in AI liability for LLM‑based agentic systems, we aim to inform the system design, auditing, and tracing to enhance transparency and liability attribution.

PaperID: 10998, https://arxiv.org/pdf/2504.03206.pdf

Authors: Yanming Wan, Jiaxing Wu, Marwa Abdulhai, Lior Shani, Natasha Jaques

Title: Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward

Abstract:
Effective conversational agents like large language models (LLMs) must personalize their interactions to adapt to user preferences, personalities, and attributes across diverse domains like education and healthcare. Current methods like Reinforcement Learning from Human Feedback (RLHF), often prioritize helpfulness and safety but fall short in fostering truly empathetic, adaptive, and personalized dialogues. Existing personalization approaches typically rely on extensive user history, limiting their effectiveness for new or context‑limited users. To address these limitations, we propose leveraging a user model to incorporate a curiosity‑based intrinsic reward into multi‑turn RLHF. This novel reward mechanism encourages the LLM agent to actively infer user traits by optimizing conversations to improve its user model's accuracy. Consequently, the agent delivers more personalized interactions by learning more about the user. We demonstrate our method's effectiveness in two distinct domains: significantly improving personalization performance in a conversational recommendation task, and personalizing conversations for different learning styles in an educational setting. We show improved generalization capabilities compared to traditional multi‑turn RLHF, all while maintaining conversation quality. Our method offers a promising solution for creating more personalized, adaptive, and engaging conversational agents.

PaperID: 10999, https://arxiv.org/pdf/2504.03111.pdf

Authors: Zichuan Li, Jian Cui, Xiaojing Liao, Luyi Xing

Title: Les Dissonances: Cross-Tool Harvesting and Polluting in Pool-of-Tools Empowered LLM Agents

Abstract:
Large Language Model (LLM) agents are autonomous systems powered by LLMs, capable of reasoning and planning to solve problems by leveraging a set of tools. However, the integration of multi‑tool capabilities in LLM agents introduces challenges in securely managing tools, ensuring their compatibility, handling dependency relationships, and protecting control flows within LLM agent workflows. In this paper, we present the first systematic security analysis of task control flows in multi‑tool‑enabled LLM agents. We identify a novel threat, Cross‑Tool Harvesting and Polluting (XTHP), which includes multiple attack vectors to first hijack the normal control flows of agent tasks, and then collect and pollute confidential or private information within LLM agent systems. To understand the impact of this threat, we developed Chord, a dynamic scanning tool designed to automatically detect real‑world agent tools susceptible to XTHP attacks. Our evaluation of 66 real‑world tools from the repositories of two major LLM agent development frameworks, LangChain and LlamaIndex, revealed a significant security concern: 75% are vulnerable to XTHP attacks, highlighting the prevalence of this threat.

PaperID: 11000, https://arxiv.org/pdf/2504.02964.pdf

Authors: Yiqi Zhao, Emily Zhu, Bardh Hoxha, Georgios Fainekos, Jyotirmoy V. Deshmukh, Lars Lindemann

Title: Distributionally Robust Predictive Runtime Verification under Spatio-Temporal Logic Specifications

Abstract:
Cyber‑physical systems (CPS) designed in simulators, often consisting of multiple interacting agents (e.g. in multi‑agent formations), behave differently in the real‑world. We want to verify these systems during runtime when they are deployed. We thus propose robust predictive runtime verification (RPRV) algorithms for: (1) general stochastic CPS under signal temporal logic (STL) tasks, and (2) stochastic multi‑agent systems (MAS) under spatio‑temporal logic tasks. The RPRV problem presents the following challenges: (1) there may not be sufficient data on the behavior of the deployed CPS, (2) predictive models based on design phase system trajectories may encounter distribution shift during real‑world deployment, and (3) the algorithms need to scale to the complexity of MAS and be applicable to spatio‑temporal logic tasks. To address the challenges, we assume knowledge of an upper bound on the statistical distance between the trajectory distributions of the system at deployment and design time. We are motivated by our prior work [1, 2] where we proposed an accurate and an interpretable RPRV algorithm for general CPS, which we here extend to the MAS setting and spatio‑temporal logic tasks. Specifically, we use a learned predictive model to estimate the system behavior at runtime and robust conformal prediction to obtain probabilistic guarantees by accounting for distribution shifts. Building on [1], we perform robust conformal prediction over the robust semantics of spatio‑temporal reach and escape logic (STREL) to obtain centralized RPRV algorithms for MAS. We empirically validate our results in a drone swarm simulator, where we show the scalability of our RPRV algorithms to MAS and analyze the impact of different trajectory predictors on the verification result. To the best of our knowledge, these are the first statistically valid algorithms for MAS under distribution shift.

PaperID: 11001, https://arxiv.org/pdf/2504.02891.pdf

Authors: Kurmanbek Kaiyrbekov, Nicholas J Dobbins, Sean D Mooney

Title: Automated Survey Collection with LLM-based Conversational Agents

Abstract:
Objective: Traditional phone‑based surveys are among the most accessible and widely used methods to collect biomedical and healthcare data, however, they are often costly, labor intensive, and difficult to scale effectively. To overcome these limitations, we propose an end‑to‑end survey collection framework driven by conversational Large Language Models (LLMs). Materials and Methods: Our framework consists of a researcher responsible for designing the survey and recruiting participants, a conversational phone agent powered by an LLM that calls participants and administers the survey, a second LLM (GPT‑4o) that analyzes the conversation transcripts generated during the surveys, and a database for storing and organizing the results. To test our framework, we recruited 8 participants consisting of 5 native and 3 non‑native english speakers and administered 40 surveys. We evaluated the correctness of LLM‑generated conversation transcripts, accuracy of survey responses inferred by GPT‑4o and overall participant experience. Results: Survey responses were successfully extracted by GPT‑4o from conversation transcripts with an average accuracy of 98% despite transcripts exhibiting an average per‑line word error rate of 7.7%. While participants noted occasional errors made by the conversational LLM agent, they reported that the agent effectively conveyed the purpose of the survey, demonstrated good comprehension, and maintained an engaging interaction. Conclusions: Our study highlights the potential of LLM agents in conducting and analyzing phone surveys for healthcare applications. By reducing the workload on human interviewers and offering a scalable solution, this approach paves the way for real‑world, end‑to‑end AI‑powered phone survey collection systems.

PaperID: 11002, https://arxiv.org/pdf/2504.02870.pdf

Authors: Frank P. -W. Lo, Jianing Qiu, Zeyu Wang, Haibao Yu, Yeming Chen, Gao Zhang, Benny Lo

Title: AI Hiring with LLMs: A Context-Aware and Explainable Multi-Agent Framework for Resume Screening

Abstract:
Resume screening is a critical yet time‑intensive process in talent acquisition, requiring recruiters to analyze vast volume of job applications while remaining objective, accurate, and fair. With the advancements in Large Language Models (LLMs), their reasoning capabilities and extensive knowledge bases demonstrate new opportunities to streamline and automate recruitment workflows. In this work, we propose a multi‑agent framework for resume screening using LLMs to systematically process and evaluate resumes. The framework consists of four core agents, including a resume extractor, an evaluator, a summarizer, and a score formatter. To enhance the contextual relevance of candidate assessments, we integrate Retrieval‑Augmented Generation (RAG) within the resume evaluator, allowing incorporation of external knowledge sources, such as industry‑specific expertise, professional certifications, university rankings, and company‑specific hiring criteria. This dynamic adaptation enables personalized recruitment, bridging the gap between AI automation and talent acquisition. We assess the effectiveness of our approach by comparing AI‑generated scores with ratings provided by HR professionals on a dataset of anonymized online resumes. The findings highlight the potential of multi‑agent RAG‑LLM systems in automating resume screening, enabling more efficient and scalable hiring workflows.

PaperID: 11003, https://arxiv.org/pdf/2504.02867.pdf

Authors: Hongliu Cao, Ilias Driouich, Robin Singh, Eoin Thomas

Title: Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications

Abstract:
Large Language Models (LLMs) have demonstrated impressive performance across diverse domains, yet they still encounter challenges such as insufficient domain‑specific knowledge, biases, and hallucinations. This underscores the need for robust evaluation methodologies to accurately assess LLM‑based applications. Traditional evaluation methods, which rely on word overlap or text embeddings, are inadequate for capturing the nuanced semantic information necessary to evaluate dynamic, open‑ended text generation. Recent research has explored leveraging LLMs to mimic human reasoning and decision‑making processes for evaluation purposes known as LLM‑as‑a‑judge framework. However, these existing frameworks have two significant limitations. First, they lack the flexibility to adapt to different text styles, including various answer and ground truth styles, thereby reducing their generalization performance. Second, the evaluation scores produced by these frameworks are often skewed and hard to interpret, showing a low correlation with human judgment. To address these challenges, we propose a novel dynamic multi‑agent system that automatically designs personalized LLM judges for various natural language generation applications. This system iteratively refines evaluation prompts and balances the trade‑off between the adaptive requirements of downstream tasks and the alignment with human perception. Our experimental results show that the proposed multi‑agent LLM Judge framework not only enhances evaluation accuracy compared to existing methods but also produces evaluation scores that better align with human perception.

PaperID: 11004, https://arxiv.org/pdf/2504.02833.pdf

Authors: Sangwoo Park, Stefan Vlaski, Lajos Hanzo

Title: Scalable Min-Max Optimization via Primal-Dual Exact Pareto Optimization

Abstract:
In multi‑objective optimization, minimizing the worst objective can be preferable to minimizing the average objective, as this ensures improved fairness across objectives. Due to the non‑smooth nature of the resultant min‑max optimization problem, classical subgradient‑based approaches typically exhibit slow convergence. Motivated by primal‑dual consensus techniques in multi‑agent optimization and learning, we formulate a smooth variant of the min‑max problem based on the augmented Lagrangian. The resultant Exact Pareto Optimization via Augmented Lagrangian (EPO‑AL) algorithm scales better with the number of objectives than subgradient‑based strategies, while exhibiting lower per‑iteration complexity than recent smoothing‑based counterparts. We establish that every fixed‑point of the proposed algorithm is both Pareto and min‑max optimal under mild assumptions and demonstrate its effectiveness in numerical simulations.

PaperID: 11005, https://arxiv.org/pdf/2504.02743.pdf

Authors: Aneesh Raghavan, M. Umar B. Niazi, Karl H. Johansson

Title: Sequential Binary Hypothesis Testing with Competing Agents under Information Asymmetry

Abstract:
This paper concerns sequential hypothesis testing in competitive multi‑agent systems where agents exchange potentially manipulated information. Specifically, a two‑agent scenario is studied where each agent aims to correctly infer the true state of nature while optimizing decision speed and accuracy. At each iteration, agents collect private observations, update their beliefs, and share (possibly corrupted) belief signals with their counterparts before deciding whether to stop and declare a state, or continue gathering more information. The analysis yields three main results: (1)~when agents share information strategically, the optimal signaling policy involves equal‑probability randomization between truthful and inverted beliefs; (2)~agents maximize performance by relying solely on their own observations for belief updating while using received information only to anticipate their counterpart's stopping decision; and (3)~the agent reaching their confidence threshold first cause the other agent to achieve a higher conditional probability of error. Numerical simulations further demonstrate that agents with higher KL divergence in their conditional distributions gain competitive advantage. Furthermore, our results establish that information sharing ‑‑ despite strategic manipulation ‑‑ reduces overall system stopping time compared to non‑interactive scenarios, which highlights the inherent value of communication even in this competitive setup.

PaperID: 11006, https://arxiv.org/pdf/2504.02679.pdf

Authors: Francesco Bianchin, Robert Lefringhausen, Elisa Gaetan, Samuel Tesfazgi, Sandra Hirche

Title: A Set-Theoretic Robust Control Approach for Linear Quadratic Games with Unknown Counterparts

Abstract:
Ensuring robust decision‑making in multi‑agent systems is challenging when agents have distinct, possibly conflicting objectives and lack full knowledge of each other's strategies. This is apparent in safety‑critical applications such as human‑robot interaction and assisted driving, where uncertainty arises not only from unknown adversary strategies but also from external disturbances. To address this, the paper proposes a robust adaptive control approach based on linear quadratic differential games. Our method allows a controlled agent to iteratively refine its belief about the adversary strategy and disturbances using a set‑membership approach, while simultaneously adapting its policy to guarantee robustness against the uncertain adversary policy and improve performance over time. We formally derive theoretical guarantees on the robustness of the proposed control scheme and its convergence to ε‑Nash strategies. The effectiveness of our approach is demonstrated in a numerical simulation.

PaperID: 11007, https://arxiv.org/pdf/2504.02623.pdf

Authors: Peijie Yu, Yifan Yang, Jinjian Li, Zelong Zhang, Haorui Wang, Xiao Feng, Feng Zhang

Title: Multi-Mission Tool Bench: Assessing the Robustness of LLM based Agents through Related and Dynamic Missions

Abstract:
Large language models (LLMs) demonstrate strong potential as agents for tool invocation due to their advanced comprehension and planning capabilities. Users increasingly rely on LLM‑based agents to solve complex missions through iterative interactions. However, existing benchmarks predominantly access agents in single‑mission scenarios, failing to capture real‑world complexity. To bridge this gap, we propose the Multi‑Mission Tool Bench. In the benchmark, each test case comprises multiple interrelated missions. This design requires agents to dynamically adapt to evolving demands. Moreover, the proposed benchmark explores all possible mission‑switching patterns within a fixed mission number. Specifically, we propose a multi‑agent data generation framework to construct the benchmark. We also propose a novel method to evaluate the accuracy and efficiency of agent decisions with dynamic decision trees. Experiments on diverse open‑source and closed‑source LLMs reveal critical factors influencing agent robustness and provide actionable insights to the tool invocation society.

PaperID: 11008, https://arxiv.org/pdf/2504.02509.pdf

Authors: Yuhao Liu, Maolin Yang, Pingyu Jiang

Title: A Memory-Augmented LLM-Driven Method for Autonomous Merging of 3D Printing Work Orders

Abstract:
With the rapid development of 3D printing, the demand for personalized and customized production on the manufacturing line is steadily increasing. Efficient merging of printing workpieces can significantly enhance the processing efficiency of the production line. Addressing the challenge, a Large Language Model (LLM)‑driven method is established in this paper for the autonomous merging of 3D printing work orders, integrated with a memory‑augmented learning strategy. In industrial scenarios, both device and order features are modeled into LLM‑readable natural language prompt templates, and develop an order‑device matching tool along with a merging interference checking module. By incorporating a self‑memory learning strategy, an intelligent agent for autonomous order merging is constructed, resulting in improved accuracy and precision in order allocation. The proposed method effectively leverages the strengths of LLMs in industrial applications while reducing hallucination.

PaperID: 11009, https://arxiv.org/pdf/2504.02479.pdf

Authors: Stefano Covone, Italo Napolitano, Francesco De Lellis, Mario di Bernardo

Title: Hierarchical Policy-Gradient Reinforcement Learning for Multi-Agent Shepherding Control of Non-Cohesive Targets

Abstract:
We propose a decentralized reinforcement learning solution for multi‑agent shepherding of non‑cohesive targets using policy‑gradient methods. Our architecture integrates target‑selection with target‑driving through Proximal Policy Optimization, overcoming discrete‑action constraints of previous Deep Q‑Network approaches and enabling smoother agent trajectories. This model‑free framework effectively solves the shepherding problem without prior dynamics knowledge. Experiments demonstrate our method's effectiveness and scalability with increased target numbers and limited sensing capabilities.

PaperID: 11010, https://arxiv.org/pdf/2504.02254.pdf

Authors: Seunghyun Yoo

Title: LLMs as Deceptive Agents: How Role-Based Prompting Induces Semantic Ambiguity in Puzzle Tasks

Abstract:
Recent advancements in Large Language Models (LLMs) have not only showcased impressive creative capabilities but also revealed emerging agentic behaviors that exploit linguistic ambiguity in adversarial settings. In this study, we investigate how an LLM, acting as an autonomous agent, leverages semantic ambiguity to generate deceptive puzzles that mislead and challenge human users. Inspired by the popular puzzle game "Connections", we systematically compare puzzles produced through zero‑shot prompting, role‑injected adversarial prompts, and human‑crafted examples, with an emphasis on understanding the underlying agent decision‑making processes. Employing computational analyses with HateBERT to quantify semantic ambiguity, alongside subjective human evaluations, we demonstrate that explicit adversarial agent behaviors significantly heighten semantic ambiguity ‑‑ thereby increasing cognitive load and reducing fairness in puzzle solving. These findings provide critical insights into the emergent agentic qualities of LLMs and underscore important ethical considerations for evaluating and safely deploying autonomous language systems in both educational technologies and entertainment.

PaperID: 11011, https://arxiv.org/pdf/2504.02181.pdf

Authors: Zihan Chen, Song Wang, Zhen Tan, Xingbo Fu, Zhenyu Lei, Peng Wang, Huan Liu, Cong Shen, Jundong Li

Title: A Survey of Scaling in Large Language Model Reasoning

Abstract:
The rapid advancements in large Language models (LLMs) have significantly enhanced their reasoning capabilities, driven by various strategies such as multi‑agent collaboration. However, unlike the well‑established performance improvements achieved through scaling data and model size, the scaling of reasoning in LLMs is more complex and can even negatively impact reasoning performance, introducing new challenges in model alignment and robustness. In this survey, we provide a comprehensive examination of scaling in LLM reasoning, categorizing it into multiple dimensions and analyzing how and to what extent different scaling strategies contribute to improving reasoning capabilities. We begin by exploring scaling in input size, which enables LLMs to process and utilize a more extensive context for improved reasoning. Next, we analyze scaling in reasoning steps that improve multi‑step inference and logical consistency. We then examine scaling in reasoning rounds, where iterative interactions refine reasoning outcomes. Furthermore, we discuss scaling in training‑enabled reasoning, focusing on optimization through iterative model improvement. Finally, we outline future directions for further advancing LLM reasoning. By synthesizing these diverse perspectives, this survey aims to provide insights into how scaling strategies fundamentally enhance the reasoning capabilities of LLMs and further guide the development of next‑generation AI systems.

PaperID: 11012, https://arxiv.org/pdf/2504.02128.pdf

Authors: Apurba Pokharel, Ram Dantu, Shakila Zaman, Vinh Quach, Sirisha Talapuru

Title: Achieving Unanimous Consensus Through Multi-Agent Deliberation

Abstract:
Blockchain consensus mechanisms have relied on algorithms such as Proof‑of‑Work (PoW) and Proof‑of‑Stake (PoS) to ensure network functionality and integrity. However, these approaches struggle with adaptability for decision‑making where the opinions of each matter rather than reaching an agreement based on honest majority or weighted consensus. This paper introduces a novel deliberation‑based consensus mechanism where Large Language Models (LLMs) act as rational agents engaging in structured discussions to reach a unanimous consensus. By leveraging graded consensus and a multi‑round deliberation process, our approach ensures unanimous consensus for definitive problems and graded consensus for prioritized decision problems and policies. We provide a formalization of our system and use it to show that the properties of blockchains are maintained, while also addressing the behavior in terms of adversaries, stalled deliberations, and confidence in consensus. Moreover, experimental results demonstrate system feasibility, showcasing convergence, block properties, and accuracy, which enable deliberative decision‑making on blockchain networks.

PaperID: 11013, https://arxiv.org/pdf/2504.02088.pdf

Authors: Yichen Yao, Ryan Mbagna Nanko, Yue Wang, Xuan Wang

Title: Distributed Resource Allocation for Human-Autonomy Teaming under Coupled Constraints

Abstract:
This paper studies the optimal resource allocation problem within a multi‑agent network composed of both autonomous agents and humans. The main challenge lies in the globally coupled constraints that link the decisions of autonomous agents with those of humans. To address this, we propose a reformulation that transforms these coupled constraints into decoupled local constraints defined over the system's communication graph. Building on this reformulation and incorporating a human response model that captures human‑robot interactions while accounting for individual preferences and biases, we develop a fully distributed algorithm. This algorithm guides the states of the autonomous agents to equilibrium points which, when combined with the human responses, yield a globally optimal resource allocation. We provide both theoretical analysis and numerical simulations to validate the effectiveness of the proposed approach.

PaperID: 11014, https://arxiv.org/pdf/2504.02052.pdf

Authors: Yuetian Mao, Junjie He, Chunyang Chen

Title: From Prompts to Templates: A Systematic Prompt Template Analysis for Real-world LLMapps

Abstract:
Large Language Models (LLMs) have revolutionized human‑AI interaction by enabling intuitive task execution through natural language prompts. Despite their potential, designing effective prompts remains a significant challenge, as small variations in structure or wording can result in substantial differences in output. To address these challenges, LLM‑powered applications (LLMapps) rely on prompt templates to simplify interactions, enhance usability, and support specialized tasks such as document analysis, creative content generation, and code synthesis. However, current practices heavily depend on individual expertise and iterative trial‑and‑error processes, underscoring the need for systematic methods to optimize prompt template design in LLMapps. This paper presents a comprehensive analysis of prompt templates in practical LLMapps. We construct a dataset of real‑world templates from open‑source LLMapps, including those from leading companies like Uber and Microsoft. Through a combination of LLM‑driven analysis and human review, we categorize template components and placeholders, analyze their distributions, and identify frequent co‑occurrence patterns. Additionally, we evaluate the impact of identified patterns on LLMs' instruction‑following performance through sample testing. Our findings provide practical insights on prompt template design for developers, supporting the broader adoption and optimization of LLMapps in industrial settings.

PaperID: 11015, https://arxiv.org/pdf/2504.02051.pdf

Authors: Alfonso Amayuelas, Jingbo Yang, Saaket Agashe, Ashwin Nagarajan, Antonis Antoniades, Xin Eric Wang, William Wang

Title: Self-Resource Allocation in Multi-Agent LLM Systems

Abstract:
With the development of LLMs as agents, there is a growing interest in connecting multiple agents into multi‑agent systems to solve tasks concurrently, focusing on their role in task assignment and coordination. This paper explores how LLMs can effectively allocate computational tasks among multiple agents, considering factors such as cost, efficiency, and performance. In this work, we address key questions, including the effectiveness of LLMs as orchestrators and planners, comparing their effectiveness in task assignment and coordination. Our experiments demonstrate that LLMs can achieve high validity and accuracy in resource allocation tasks. We find that the planner method outperforms the orchestrator method in handling concurrent actions, resulting in improved efficiency and better utilization of agents. Additionally, we show that providing explicit information about worker capabilities enhances the allocation strategies of planners, particularly when dealing with suboptimal workers.

PaperID: 11016, https://arxiv.org/pdf/2504.02049.pdf

Authors: Tyler Hanks, Hans Riess, Samuel Cohen, Trevor Gross, Matthew Hale, James Fairbanks

Title: Distributed Multi-agent Coordination over Cellular Sheaves

Abstract:
Techniques for coordination of multi‑agent systems are vast and varied, often utilizing purpose‑built solvers or controllers with tight coupling to the types of systems involved or the coordination goal. In this paper, we introduce a general unified framework for heterogeneous multi‑agent coordination using the language of cellular sheaves and nonlinear sheaf Laplacians, which are generalizations of graphs and graph Laplacians. Specifically, we introduce the concept of a nonlinear homological program encompassing a choice of cellular sheaf on an undirected graph, nonlinear edge potential functions, and constrained convex node objectives, which constitutes a standard form for a wide class of coordination problems. We use the alternating direction method of multipliers to derive a distributed optimization algorithm for solving these nonlinear homological programs. To demonstrate the applicability of this framework, we show how heterogeneous coordination goals including combinations of consensus, formation, and flocking can be formulated as nonlinear homological programs and provide numerical simulations showing the efficacy of our distributed solution algorithm.

PaperID: 11017, https://arxiv.org/pdf/2504.01990.pdf

Authors: Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, Yuheng Cheng, Suyuchen Wang, Xiaoqiang Wang, Yuyu Luo, Haibo Jin, Peiyan Zhang, Ollie Liu, Jiaqi Chen, Huan Zhang, Zhaoyang Yu, Haochen Shi, Boyan Li, Dekun Wu, Fengwei Teng, Xiaojun Jia, Jiawei Xu, Jinyu Xiang, Yizhang Lin, Tianming Liu, Tongliang Liu, Yu Su, Huan Sun, Glen Berseth, Jianyun Nie, Ian Foster, Logan Ward, Qingyun Wu, Yu Gu, Mingchen Zhuge, Xinbing Liang, Xiangru Tang, Haohan Wang, Jiaxuan You, Chi Wang, Jian Pei, Qiang Yang, Xiaoliang Qi, Chenglin Wu

Title: Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Abstract:
The advent of large language models (LLMs) has catalyzed a transformative shift in artificial intelligence, paving the way for advanced intelligent agents capable of sophisticated reasoning, robust perception, and versatile action across diverse domains. As these agents increasingly drive AI research and practical applications, their design, evaluation, and continuous improvement present intricate, multifaceted challenges. This book provides a comprehensive overview, framing intelligent agents within modular, brain‑inspired architectures that integrate principles from cognitive science, neuroscience, and computational research. We structure our exploration into four interconnected parts. First, we systematically investigate the modular foundation of intelligent agents, systematically mapping their cognitive, perceptual, and operational modules onto analogous human brain functionalities and elucidating core components such as memory, world modeling, reward processing, goal, and emotion. Second, we discuss self‑enhancement and adaptive evolution mechanisms, exploring how agents autonomously refine their capabilities, adapt to dynamic environments, and achieve continual learning through automated optimization paradigms. Third, we examine multi‑agent systems, investigating the collective intelligence emerging from agent interactions, cooperation, and societal structures. Finally, we address the critical imperative of building safe and beneficial AI systems, emphasizing intrinsic and extrinsic security threats, ethical alignment, robustness, and practical mitigation strategies necessary for trustworthy real‑world deployment. By synthesizing modular AI architectures with insights from different disciplines, this survey identifies key research challenges and opportunities, encouraging innovations that harmonize technological advancement with meaningful societal benefit.

PaperID: 11018, https://arxiv.org/pdf/2504.01963.pdf

Authors: R. M. Aratchige, W. M. K. S. Ilmini

Title: LLMs Working in Harmony: A Survey on the Technological Aspects of Building Effective LLM-Based Multi Agent Systems

Abstract:
This survey investigates foundational technologies essential for developing effective Large Language Model (LLM)‑based multi‑agent systems. Aiming to answer how best to optimize these systems for collaborative, dynamic environments, we focus on four critical areas: Architecture, Memory, Planning, and Technologies/Frameworks. By analyzing recent advancements and their limitations ‑ such as scalability, real‑time response challenges, and agent coordination constraints, we provide a detailed view of the technological landscape. Frameworks like the Mixture of Agents architecture and the ReAct planning model exemplify current innovations, showcasing improvements in role assignment and decision‑making. This review synthesizes key strengths and persistent challenges, offering practical recommendations to enhance system scalability, agent collaboration, and adaptability. Our findings provide a roadmap for future research, supporting the creation of robust, efficient multi‑agent systems that advance both individual agent performance and collective system resilience.

PaperID: 11019, https://arxiv.org/pdf/2504.01962.pdf

Authors: Chia-Tung Ho, Jing Gong, Yunsheng Bai, Chenhui Deng, Haoxing Ren, Brucek Khailany

Title: Marco: Configurable Graph-Based Task Solving and Multi-AI Agents Framework for Hardware Design

Abstract:
Hardware design presents numerous challenges stemming from its complexity and advancing technologies. These challenges result in longer turn‑around‑time (TAT) for optimizing performance, power, area, and cost (PPAC) during synthesis, verification, physical design, and reliability loops. Large Language Models (LLMs) have shown remarkable capacity to comprehend and generate natural language at a massive scale, leading to many potential applications and benefits across various domains. Successful LLM‑based agents for hardware design can drastically reduce TAT, leading to faster product cycles, lower costs, improved design reliability and reduced risk of costly errors. In this work, we propose a unified framework, Marco, that integrates configurable graph‑based task solving with multi‑modality and multi‑AI agents for chip design by leveraging the natural language and reasoning abilities with collaborative toolkits. Lastly, we demonstrate promising performance, productivity, and efficiency of LLM agents by leveraging the Marco framework on layout optimization, Verilog/design rule checker (DRC) coding, and timing analysis tasks.

PaperID: 11020, https://arxiv.org/pdf/2504.01924.pdf

Authors: Andreas Panayiotou, Panayiotis Charalambous, Ioannis Karamouzas

Title: Gen-C: Populating Virtual Worlds with Generative Crowds

Abstract:
Over the past two decades, researchers have made significant steps in simulating agent‑based human crowds, yet most efforts remain focused on low‑level tasks such as collision avoidance, path following, and flocking. As a result, these approaches often struggle to capture the high‑level behaviors that emerge from sustained agent‑agent and agent‑environment interactions over time. We introduce Generative Crowds (Gen‑C), a generative framework that produces crowd scenarios capturing agent‑agent and agent‑environment interactions, shaping coherent high‑level crowd plans. To avoid the labor‑intensive process of collecting and annotating real crowd video data, we leverage Large Language Models (LLMs) to bootstrap synthetic datasets of crowd scenarios. To represent those scenarios, we propose a time‑expanded graph structure encoding actions, interactions, and spatial context. Gen‑C employs a dual Variational Graph Autoencoder (VGAE) architecture that jointly learns connectivity patterns and node features conditioned on textual and structural signals, overcoming the limitations of direct LLM generation to enable scalable, environment‑aware multi‑agent crowd simulations. We demonstrate the effectiveness of our framework on scenarios with diverse behaviors such as a University Campus and a Train Station, showing that it generates heterogeneous crowds, coherent interactions, and high‑level decision‑making patterns consistent with the provided context.

PaperID: 11021, https://arxiv.org/pdf/2504.01911.pdf

Authors: Yinggan Xu, Hana Kimlee, Yijia Xiao, Di Luo

Title: Advancing AI-Scientist Understanding: Multi-Agent LLMs with Interpretable Physics Reasoning

Abstract:
Large Language Models (LLMs) are playing an increasingly important role in physics research by assisting with symbolic manipulation, numerical computation, and scientific reasoning. However, ensuring the reliability, transparency, and interpretability of their outputs remains a major challenge. In this work, we introduce a novel multi‑agent LLM physicist framework that fosters collaboration between AI and human scientists through three key modules: a reasoning module, an interpretation module, and an AI‑scientist interaction module. Recognizing that effective physics reasoning demands logical rigor, quantitative accuracy, and alignment with established theoretical models, we propose an interpretation module that employs a team of specialized LLM agents‑including summarizers, model builders, visualization tools, and testers‑to systematically structure LLM outputs into transparent, physically grounded science models. A case study demonstrates that our approach significantly improves interpretability, enables systematic validation, and enhances human‑AI collaboration in physics problem‑solving and discovery. Our work bridges free‑form LLM reasoning with interpretable, executable models for scientific analysis, enabling more transparent and verifiable AI‑augmented research.

PaperID: 11022, https://arxiv.org/pdf/2504.01773.pdf

Authors: Michal Feldman, Yoav Gal-Tzur, Tomasz Ponitka, Maya Schlesinger

Title: Budget-Feasible Contracts

Abstract:
The problem of computing near‑optimal contracts in combinatorial settings has recently attracted significant interest in the computer science community. Previous work has provided a rich body of structural and algorithmic insights into this problem. However, most of these results rely on the assumption that the principal has an unlimited budget for incentivizing agents, an assumption that is often unrealistic in practice. This motivates the study of the optimal contract problem under budget constraints. In this work, we study multi‑agent contracts with binary actions under budget constraints. Our contribution is threefold. First, we show that all previously known approximation guarantees on the principal's utility extend (asymptotically) to budgeted settings. Second, through the lens of budget constraints, we uncover insightful connections between the standard objective of maximizing the principal's utility and other relevant objectives. Specifically, we identify a broad class of objectives, which we term BEST (BEyond STandard) objectives, including reward, social welfare, and principal's utility, and show that they are all equivalent (up to a constant factor), leading to approximation guarantees for all BEST objectives. Third, we introduce the price of frugality, which quantifies the loss due to budget constraints, and establish near‑tight bounds on this measure, providing deeper insights into the tradeoffs between budgets and incentives.

PaperID: 11023, https://arxiv.org/pdf/2504.01637.pdf

Authors: Reo Abe, Akifumi Ito, Kanata Takayasu, Satoshi Kurihara

Title: LLM-mediated Dynamic Plan Generation with a Multi-Agent Approach

Abstract:
Planning methods with high adaptability to dynamic environments are crucial for the development of autonomous and versatile robots. We propose a method for leveraging a large language model (GPT‑4o) to automatically generate networks capable of adapting to dynamic environments. The proposed method collects environmental "status," representing conditions and goals, and uses them to generate agents. These agents are interconnected on the basis of specific conditions, resulting in networks that combine flexibility and generality. We conducted evaluation experiments to compare the networks automatically generated with the proposed method with manually constructed ones, confirming the comprehensiveness of the proposed method's networks and their higher generality. This research marks a significant advancement toward the development of versatile planning methods applicable to robotics, autonomous vehicles, smart systems, and other complex environments.

PaperID: 11024, https://arxiv.org/pdf/2504.01458.pdf

Authors: Jian Wang, Zhuo Zhao, Zeng Jie Wang, Bo Da Cheng, Lei Nie, Wen Luo, Zhao Yuan Yu, Ling Wang Yuan

Title: GeoRAG: A Question-Answering Approach from a Geographical Perspective

Abstract:
Geographic Question Answering (GeoQA) addresses natural language queries in geographic domains to fulfill complex user demands and improve information retrieval efficiency. Traditional QA systems, however, suffer from limited comprehension, low retrieval accuracy, weak interactivity, and inadequate handling of complex tasks, hindering precise information acquisition. This study presents GeoRAG, a knowledge‑enhanced QA framework integrating domain‑specific fine‑tuning and prompt engineering with Retrieval‑Augmented Generation (RAG) technology to enhance geographic knowledge retrieval accuracy and user interaction. The methodology involves four components: (1) A structured geographic knowledge base constructed from 3267 corpora (research papers, monographs, and technical reports), categorized via a multi‑agent approach into seven dimensions: semantic understanding, spatial location, geometric morphology, attribute characteristics, feature relationships, evolutionary processes, and operational mechanisms. This yielded 145234 classified entries and 875432 multi‑dimensional QA pairs. (2) A multi‑label text classifier based on BERT‑Base‑Chinese, trained to analyze query types through geographic dimension classification. (3) A retrieval evaluator leveraging QA pair data to assess query‑document relevance, optimizing retrieval precision. (4) GeoPrompt templates engineered to dynamically integrate user queries with retrieved information, enhancing response quality through dimension‑specific prompting. Comparative experiments demonstrate GeoRAG's superior performance over conventional RAG across multiple base models, validating its generalizability. This work advances geographic AI by proposing a novel paradigm for deploying large language models in domain‑specific contexts, with implications for improving GeoQA systems scalability and accuracy in real‑world applications.

PaperID: 11025, https://arxiv.org/pdf/2504.01423.pdf

Authors: Yichen Sun, Chenggang Cui, Chuanlin Zhang, Chunyang Gong

Title: Dynamic Incentive Strategies for Smart EV Charging Stations: An LLM-Driven User Digital Twin Approach

Abstract:
This paper presents an enhanced electric vehicle demand response system based on large language models, aimed at optimizing the application of vehicle‑to‑grid technology. By leveraging an large language models‑driven multi‑agent framework to construct user digital twins integrated with multidimensional user profile features, it enables deep simulation and precise prediction of users' charging and discharging decision‑making patterns. Additionally, a data‑ and knowledge‑driven dynamic incentive mechanism is proposed, combining a distributed optimization model under network constraints to optimize the grid‑user interaction while ensuring both economic viability and security. Simulation results demonstrate that the approach significantly improves load peak‑valley regulation and charging/discharging strategies. Experimental validation highlights the system's substantial advantages in load balancing, user satisfaction and grid stability, providing decision‑makers with a scalable V2G management tool that promotes the sustainable, synergistic development of vehicle‑grid integration.

PaperID: 11026, https://arxiv.org/pdf/2504.01234.pdf

Authors: Yihao Zhang, Qizhi Qiu, Xiaomin Liu, Dianxuan Fu, Xingyu Liu, Leyan Fei, Yuming Cheng, Lilin Yi, Weisheng Hu, Qunbi Zhuge

Title: First Field-Trial Demonstration of L4 Autonomous Optical Network for Distributed AI Training Communication: An LLM-Powered Multi-AI-Agent Solution

Abstract:
We demonstrate the first cross‑domain cross‑layer level‑4 autonomous optical network via a multi‑AI‑agent system. Field trials show ~98% task completion rate across the distributed AI training lifecycle‑3.2x higher than single agents using state‑of‑the‑art LLMs.

PaperID: 11027, https://arxiv.org/pdf/2504.01154.pdf

Authors: Ashwin Kumar, William Yeoh

Title: Past-Discounting is Key for Learning Markovian Fairness with Long Horizons

Abstract:
Fairness is an important consideration for dynamic resource allocation in multi‑agent systems. Many existing methods treat fairness as a one‑shot problem without considering temporal dynamics, which misses the nuances of accumulating inequalities over time. Recent approaches overcome this limitation by tracking allocations over time, assuming perfect recall of all past utilities. While the former neglects long‑term equity, the latter introduces a critical challenge: the augmented state space required to track cumulative utilities grows unboundedly with time, hindering the scalability and convergence of learning algorithms. Motivated by behavioral insights that human fairness judgments discount distant events, we introduce a framework for temporal fairness that incorporates past‑discounting into the learning problem. This approach offers a principled interpolation between instantaneous and perfect‑recall fairness. Our central contribution is a past‑discounted framework for memory tracking and a theoretical analysis of fairness memories, showing past‑discounting guarantees a bounded, horizon‑independent state space, a property that we prove perfect‑recall methods lack. This result unlocks the ability to learn fair policies tractably over arbitrarily long horizons. We formalize this framework, demonstrate its necessity with experiments showing that perfect recall fails where past‑discounting succeeds, and provide a clear path toward building scalable and equitable resource allocation systems.

PaperID: 11028, https://arxiv.org/pdf/2504.01039.pdf

Authors: Liat Lavi

Title: One Person, One Bot

Abstract:
This short paper puts forward a vision for a new democratic model enabled by the recent technological advances in agentic AI. It therefore opens with drawing a clear and concise picture of the model, and only later addresses related proposals and research directions, and concerns regarding feasibility and safety. It ends with a note on the timeliness of this idea and on optimism. The model proposed is that of assigning each citizen an AI Agent that would serve as their political delegate, enabling the return to direct democracy. The paper examines this models relation to existing research, its potential setbacks and feasibility and argues for its further development.

PaperID: 11029, https://arxiv.org/pdf/2504.00907.pdf

Authors: Ram Ramrakhya, Matthew Chang, Xavier Puig, Ruta Desai, Zsolt Kira, Roozbeh Mottaghi

Title: Grounding Multimodal LLMs to Embodied Agents that Ask for Help with Reinforcement Learning

Abstract:
Embodied agents operating in household environments must interpret ambiguous and under‑specified human instructions. A capable household robot should recognize ambiguity and ask relevant clarification questions to infer the user intent accurately, leading to more effective task execution. To study this problem, we introduce the Ask‑to‑Act task, where an embodied agent is tasked with a single or multi‑object rearrangement task using an under‑specified instruction in a home environment. The agent must strategically ask minimal, yet relevant, clarification questions to resolve ambiguity while navigating under partial observability. To address this challenge, we propose a novel approach that fine‑tunes multi‑modal large language models (MLLMs) as vision‑language‑action (VLA) policies using online reinforcement learning (RL) with LLM‑generated rewards. Our method eliminates the need for large‑scale human demonstrations or manually engineered rewards for training such agents. We benchmark against strong zero‑shot baselines including GPT‑4o as well as supervised fine‑tuned MLLMs on our task. Our results show that our RL‑finetuned MLLM outperforms all baselines by a significant margin (10.4‑16.5%), generalizing well to novel scenes and tasks. To the best of our knowledge, this is the first demonstration of adapting MLLMs as VLA agents that can act and ask for help using LLM‑generated rewards with online RL.

PaperID: 11030, https://arxiv.org/pdf/2504.00863.pdf

Authors: Roee M. Francos, Daniel Garces, Stephanie Gil

Title: Provably Stable Multi-Agent Routing with Bounded-Delay Adversaries in the Decision Loop

Abstract:
In this work, we are interested in studying multi‑agent routing settings, where adversarial agents are part of the assignment and decision loop, degrading the performance of the fleet by incurring bounded delays while servicing pickup‑and‑delivery requests. Specifically, we are interested in characterizing conditions on the fleet size and the proportion of adversarial agents for which a routing policy remains stable, where stability for a routing policy is achieved if the number of outstanding requests is uniformly bounded over time. To obtain this characterization, we first establish a threshold on the proportion of adversarial agents above which previously stable routing policies for fully cooperative fleets are provably unstable. We then derive a sufficient condition on the fleet size to recover stability given a maximum proportion of adversarial agents. We empirically validate our theoretical results on a case study on autonomous taxi routing, where we consider transportation requests from real San Francisco taxicab data.

PaperID: 11031, https://arxiv.org/pdf/2504.00791.pdf

Authors: Syed Sarmad Shah, Anas Ali

Title: Optimizing Resource Allocation and Energy Efficiency in Federated Fog Computing for IoT

Abstract:
Fog computing significantly enhances the efficiency of IoT applications by providing computation, storage, and networking resources at the edge of the network. In this paper, we propose a federated fog computing framework designed to optimize resource management, minimize latency, and reduce energy consumption across distributed IoT environments. Our framework incorporates predictive scheduling, energy‑aware resource allocation, and adaptive mobility management strategies. Experimental results obtained from extensive simulations using the OMNeT++ environment demonstrate that our federated approach outperforms traditional non‑federated architectures in terms of resource utilization, latency, energy efficiency, task execution time, and scalability. These findings underline the suitability and effectiveness of the proposed framework for supporting sustainable and high‑performance IoT services.

PaperID: 11032, https://arxiv.org/pdf/2504.00762.pdf

Authors: Jianhao Chen, Zishuo Xun, Bocheng Zhou, Han Qi, Hangfan Zhang, Qiaosheng Zhang, Yang Chen, Wei Hu, Yuzhong Qu, Wanli Ouyang, Shuyue Hu

Title: Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute

Abstract:
This paper presents a simple, effective, and cost‑efficient strategy to improve LLM performance by scaling test‑time compute. Our strategy builds upon the repeated‑sampling‑then‑voting framework, with a novel twist: incorporating multiple models, even weaker ones, to leverage their complementary strengths that potentially arise from diverse training data and paradigms. By using consistency as a signal, our strategy dynamically switches between models. Theoretical analysis highlights the efficiency and performance advantages of our strategy. Extensive experiments on six datasets demonstrate that our strategy not only outperforms self‑consistency and state‑of‑the‑art multi‑agent debate approaches, but also significantly reduces inference costs. Additionally, ModelSwitch requires only a few comparable LLMs to achieve optimal performance and can be extended with verification methods, demonstrating the potential of leveraging multiple LLMs in the generation‑verification paradigm.

PaperID: 11033, https://arxiv.org/pdf/2504.00711.pdf

Authors: Enjun Du, Xunkai Li, Tian Jin, Zhihan Zhang, Rong-Hua Li, Guoren Wang

Title: GraphMaster: Automated Graph Synthesis via LLM Agents in Data-Limited Environments

Abstract:
The era of foundation models has revolutionized AI research, yet Graph Foundation Models (GFMs) remain constrained by the scarcity of large‑scale graph corpora. Traditional graph data synthesis techniques primarily focus on simplistic structural operations, lacking the capacity to generate semantically rich nodes with meaningful textual attributes: a critical limitation for real‑world applications. While large language models (LLMs) demonstrate exceptional text generation capabilities, their direct application to graph synthesis is impeded by context window limitations, hallucination phenomena, and structural consistency challenges. To address these issues, we introduce GraphMaster, the first multi‑agent framework specifically designed for graph data synthesis in data‑limited environments. GraphMaster orchestrates four specialized LLM agents (Manager, Perception, Enhancement, and Evaluation) that collaboratively optimize the synthesis process through iterative refinement, ensuring both semantic coherence and structural integrity. To rigorously evaluate our approach, we create new data‑limited "Sub" variants of six standard graph benchmarks, specifically designed to test synthesis capabilities under realistic constraints. Additionally, we develop a novel interpretability assessment framework that combines human evaluation with a principled Grassmannian manifold‑based analysis, providing both qualitative and quantitative measures of semantic coherence. Experimental results demonstrate that GraphMaster significantly outperforms traditional synthesis methods across multiple datasets, establishing a strong foundation for advancing GFMs in data‑scarce environments.

PaperID: 11034, https://arxiv.org/pdf/2504.00683.pdf

Authors: Juliette Grosset, Alain-Jérôme Fougères, Ouzna Oukacha, Moïse Djoko-Kouam, Jean-Marie Bonnin

Title: Simulation of Autonomous Industrial Vehicle Fleet Using Fuzzy Agents: Application to Task Allocation and Battery Charge Management

Abstract:
The research introduces a multi‑agent simulation that uses fuzzy inference to investigate the work distribution and battery charging control of mobile baggage conveyor robots in an airport in a comprehensive manner. Thanks to a distributed system, this simulation approach provides high adaptability, adjusting to changes in conveyor agent availability, battery capacity, awareness of the activities of the conveyor fleet, and knowledge of the context of infrastructure resource availability. Dynamic factors, such as workload variations and communication between the conveyor agents and infrastructure are considered as heuristics, highlighting the importance of flexible and collaborative approaches in autonomous systems. The results highlight the effectiveness of adaptive fuzzy multi‑agent models to optimize dynamic task allocation, adapt to the variation of baggage arrival flows, improve the overall operational efficiency of conveyor agents, and reduce their energy consumption.

PaperID: 11035, https://arxiv.org/pdf/2504.00602.pdf

Authors: Federica Adobbati, Łukasz Mikulski

Title: Asynchronous Multi-Agent Systems with Petri nets

Abstract:
Modeling the interaction between components is crucial for many applications and serves as a fundamental step in analyzing and verifying properties in multi‑agent systems. In this paper, we propose a method based on 1‑safe Petri nets to model Asynchronous Multi‑Agent Systems (AMAS), starting from two semantics defined on AMAS represented as transition systems. Specifically, we focus on two types of synchronization: synchronization on transitions and synchronization on data. For both, we define an operator that composes 1‑safe Petri nets and demonstrate the relationships between the composed Petri net and the global transition systems as defined in theliterature. Additionally, we analyze the relationships between the two semantics on Petri nets, proposing two constructions that enable switching between them. These transformations are particularly useful for system analysis, as they allow the selection of the most suitable model based on the property that needs to be verified.

PaperID: 11036, https://arxiv.org/pdf/2504.00587.pdf

Authors: Yingxuan Yang, Huacan Chai, Shuai Shao, Yuanyi Song, Siyuan Qi, Renting Rui, Weinan Zhang

Title: AgentNet: Decentralized Evolutionary Coordination for LLM-based Multi-Agent Systems

Abstract:
The rapid advancement of large language models (LLMs) has enabled the development of multi‑agent systems where multiple LLM‑based agents collaborate on complex tasks. However, existing systems often rely on centralized coordination, leading to scalability bottlenecks, reduced adaptability, and single points of failure. Privacy and proprietary knowledge concerns further hinder cross‑organizational collaboration, resulting in siloed expertise. We propose AgentNet, a decentralized, Retrieval‑Augmented Generation (RAG)‑based framework that enables LLM‑based agents to specialize, evolve, and collaborate autonomously in a dynamically structured Directed Acyclic Graph (DAG). Unlike prior approaches with static roles or centralized control, AgentNet allows agents to adjust connectivity and route tasks based on local expertise and context. AgentNet introduces three key innovations: (1) a fully decentralized coordination mechanism that eliminates the need for a central orchestrator, enhancing robustness and emergent intelligence; (2) dynamic agent graph topology that adapts in real time to task demands, ensuring scalability and resilience; and (3) a retrieval‑based memory system for agents that supports continual skill refinement and specialization. By minimizing centralized control and data exchange, AgentNet enables fault‑tolerant, privacy‑preserving collaboration across organizations. Experiments show that AgentNet achieves higher task accuracy than both single‑agent and centralized multi‑agent baselines.

PaperID: 11037, https://arxiv.org/pdf/2504.00521.pdf

Authors: Hang He, Yixing Luo, Chengcheng Wan, Ting Su, Haiying Sun, Geguang Pu

Title: Automated detection of atomicity violations in large-scale systems

Abstract:
Atomicity violations in interrupt‑driven programs pose a significant threat to software reliability in safety‑critical systems. These violations occur when the execution sequence of operations on shared resources is disrupted by asynchronous interrupts. Detecting atomicity violations is challenging due to the vast program state space, application‑level code dependencies, and complex domain‑specific knowledge. In this paper, we propose CLOVER, a multi‑agent framework for detecting atomicity violations in real‑world interrupt‑driven programs. Its plan agent orchestrates four static analysis tools to extract key information and generate code summaries. CLOVER then initializes several Expert‑Judge agent pairs to detect and validate different patterns of atomicity violation, through an iterative manner. Evaluations on RaceBench, SV‑COMP, and RWIP demonstrate that CLOVER achieves a precision/recall of 91.0%/96.4%, outperforming existing approaches by 33.0‑117.2% on F1‑score. Additionally, it identifies 12 atomicity violations in 11 real‑world aerospace software projects, one of which is previously unknown.

PaperID: 11038, https://arxiv.org/pdf/2504.00434.pdf

Authors: Shiyi Liu, Haiying Shen, Shuai Che, Mahdi Ghandi, Mingqin Li

Title: HERA: Hybrid Edge-cloud Resource Allocation for Cost-Efficient AI Agents

Abstract:
In the realm of AI, large language models (LLMs) like GPT‑4, central to the operation of AI agents, predominantly operate in the cloud, incurring high operational costs. With local‑based small language models (SLMs) becoming more accurate, the necessity of cloud‑exclusive processing is being reconsidered. An AI agent's response to a user's request comprises a series of subtasks or iterations. Existing approaches only allocate a single request between SLM and LLM to ensure their outputs are similar, but adopting this approach in the AI agent scenario for assigning each subtask is not effective since SLM will output a different subsequent subtask, which affects the accuracy of the final output. In this paper, we first conduct experimental analysis to understand the features of AI agent operations. Leveraging our findings, we propose the Adaptive Iteration‑level Model Selector (AIMS), a lightweight scheduler to automatically partition AI agent's subtasks between local‑based SLM and cloud‑based LLM. AIMS considers the varying subtask features and strategically decides the location for each subtask in order to use SLM as much as possible while attaining the accuracy level. Our experimental results demonstrate that AIMS increases accuracy by up to 9.1% and SLM usage by up to 10.8% compared to HybridLLM. It offloads 45.67% of subtasks to a local SLM while attaining similar accuracy on average compared with the cloud‑only LLM approach.

PaperID: 11039, https://arxiv.org/pdf/2504.00374.pdf

Authors: Mahak Agarwal, Divyam Khanna

Title: When Persuasion Overrides Truth in Multi-Agent LLM Debates: Introducing a Confidence-Weighted Persuasion Override Rate (CW-POR)

Abstract:
In many real‑world scenarios, a single Large Language Model (LLM) may encounter contradictory claims‑some accurate, others forcefully incorrect‑and must judge which is true. We investigate this risk in a single‑turn, multi‑agent debate framework: one LLM‑based agent provides a factual answer from TruthfulQA, another vigorously defends a falsehood, and the same LLM architecture serves as judge. We introduce the Confidence‑Weighted Persuasion Override Rate (CW‑POR), which captures not only how often the judge is deceived but also how strongly it believes the incorrect choice. Our experiments on five open‑source LLMs (3B‑14B parameters), where we systematically vary agent verbosity (30‑300 words), reveal that even smaller models can craft persuasive arguments that override truthful answers‑often with high confidence. These findings underscore the importance of robust calibration and adversarial testing to prevent LLMs from confidently endorsing misinformation.

PaperID: 11040, https://arxiv.org/pdf/2504.00225.pdf

Authors: Matthias Köhler, Matthias A. Müller, Frank Allgöwer

Title: Distributed Model Predictive Control for Dynamic Cooperation of Multi-Agent Systems

Abstract:
We propose a distributed model predictive control (MPC) framework for coordinating heterogeneous, nonlinear multi‑agent systems under individual and coupling constraints. The cooperative task is encoded as a shared objective function minimized collectively by the agents. Each agent optimizes an artificial reference as an intermediate step towards the cooperative objective, along with a control input to track it. We establish recursive feasibility, asymptotic stability, and transient performance bounds under suitable assumptions. The solution to the cooperative task is not predetermined but emerges from the optimized interactions of the agents. We demonstrate the framework on numerical examples inspired by satellite constellation control, collision‑free narrow‑passage traversal, and coordinated quadrotor flight.

PaperID: 11041, https://arxiv.org/pdf/2504.00218.pdf

Authors: Rana Muhammad Shahroz Khan, Zhen Tan, Sukwon Yun, Charles Fleming, Tianlong Chen

Title: $\textit{Agents Under Siege}$: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks

Abstract:
Most discussions about Large Language Model (LLM) safety have focused on single‑agent settings but multi‑agent LLM systems now create novel adversarial risks because their behavior depends on communication between agents and decentralized reasoning. In this work, we innovatively focus on attacking pragmatic systems that have constrains such as limited token bandwidth, latency between message delivery, and defense mechanisms. We design a permutation‑invariant adversarial attack that optimizes prompt distribution across latency and bandwidth‑constraint network topologies to bypass distributed safety mechanisms within the system. Formulating the attack path as a problem of maximum‑flow minimum‑cost, coupled with the novel Permutation‑Invariant Evasion Loss (PIEL), we leverage graph‑based optimization to maximize attack success rate while minimizing detection risk. Evaluating across models including \textttLlama, \textttMistral, \textttGemma, \textttDeepSeek and other variants on various datasets like \textttJailBreakBench and \textttAdversarialBench, our method outperforms conventional attacks by up to 7×, exposing critical vulnerabilities in multi‑agent systems. Moreover, we demonstrate that existing defenses, including variants of \textttLlama‑Guard and \textttPromptGuard, fail to prohibit our attack, emphasizing the urgent need for multi‑agent specific safety mechanisms.

PaperID: 11042, https://arxiv.org/pdf/2504.00156.pdf

Authors: Leo Tunkle, Kamal Abdulraheem, Linyu Lin, Majdi I. Radaideh

Title: Nuclear Microreactor Control with Deep Reinforcement Learning

Abstract:
The economic feasibility of nuclear microreactors will depend on minimizing operating costs through advancements in autonomous control, especially when these microreactors are operating alongside other types of energy systems (e.g., renewable energy). This study explores the application of deep reinforcement learning (RL) for real‑time drum control in microreactors, exploring performance in regard to load‑following scenarios. By leveraging a point kinetics model with thermal and xenon feedback, we first establish a baseline using a single‑output RL agent, then compare it against a traditional proportional‑integral‑derivative (PID) controller. This study demonstrates that RL controllers, including both single‑ and multi‑agent RL (MARL) frameworks, can achieve similar or even superior load‑following performance as traditional PID control across a range of load‑following scenarios. In short transients, the RL agent was able to reduce the tracking error rate in comparison to PID. Over extended 300‑minute load‑following scenarios in which xenon feedback becomes a dominant factor, PID maintained better accuracy, but RL still remained within a 1% error margin despite being trained only on short‑duration scenarios. This highlights RL's strong ability to generalize and extrapolate to longer, more complex transients, affording substantial reductions in training costs and reduced overfitting. Furthermore, when control was extended to multiple drums, MARL enabled independent drum control as well as maintained reactor symmetry constraints without sacrificing performance ‑‑ an objective that standard single‑agent RL could not learn. We also found that, as increasing levels of Gaussian noise were added to the power measurements, the RL controllers were able to maintain lower error rates than PID, and to do so with less control effort.

PaperID: 11043, https://arxiv.org/pdf/2503.24296.pdf

Authors: Yubo Zhang, Pedro Botelho, Trevor Gordon, Gil Zussman, Igor Kadota

Title: Fair Dynamic Spectrum Access via Fully Decentralized Multi-Agent Reinforcement Learning

Abstract:
We consider a decentralized wireless network with several source‑destination pairs sharing a limited number of orthogonal frequency bands. Sources learn to adapt their transmissions (specifically, their band selection strategy) over time, in a decentralized manner, without sharing information with each other. Sources can only observe the outcome of their own transmissions (i.e., success or collision), having no prior knowledge of the network size or of the transmission strategy of other sources. The goal of each source is to maximize their own throughput while striving for network‑wide fairness. We propose a novel fully decentralized Reinforcement Learning (RL)‑based solution that achieves fairness without coordination. The proposed Fair Share RL (FSRL) solution combines: (i) state augmentation with a semi‑adaptive time reference; (ii) an architecture that leverages risk control and time difference likelihood; and (iii) a fairness‑driven reward structure. We evaluate FSRL in more than 50 network settings with different number of agents, different amounts of available spectrum, in the presence of jammers, and in an ad‑hoc setting. Simulation results suggest that, when we compare FSRL with a common baseline RL algorithm from the literature, FSRL can be up to 89.0% fairer (as measured by Jain's fairness index) in stringent settings with several sources and a single frequency band, and 48.1% fairer on average.

PaperID: 11044, https://arxiv.org/pdf/2503.24183.pdf

Authors: Matej Jusup, Kenan Zhang, Zhiyuan Hu, Barna PÃ¡sztor, Andreas Krause, Francesco Corman

Title: Scalable Ride-Sourcing Vehicle Rebalancing with Service Accessibility Guarantee: A Constrained Mean-Field Reinforcement Learning Approach

Abstract:
The rapid expansion of ride‑sourcing services such as Uber, Lyft, and Didi Chuxing has fundamentally reshaped urban transportation by offering flexible, on‑demand mobility via mobile applications. Despite their convenience, these platforms confront significant operational challenges, particularly vehicle rebalancing ‑ the strategic repositioning of a large group of vehicles to address spatiotemporal mismatches in supply and demand. Inadequate rebalancing not only results in prolonged rider waiting times and inefficient vehicle utilization but also leads to fairness issues, such as the inequitable distribution of service quality and disparities in driver income. To tackle these complexities, we introduce continuous‑state mean‑field control (MFC) and mean‑field reinforcement learning (MFRL) models that employ continuous vehicle repositioning actions. MFC and MFRL offer scalable solutions by modeling each vehicle's behavior through interaction with the vehicle distribution, rather than with individual vehicles. This limits the issues arising from the curse of dimensionality inherent in traditional multi‑agent methods, enabling coordination across large fleets with significantly reduced computational complexity. To ensure equitable service access across geographic regions, we integrate an accessibility constraint into both models. Extensive empirical evaluation using real‑world data‑driven simulation of Shenzhen demonstrates the real‑time efficiency and robustness of our approach. Remarkably, it scales to tens of thousands of vehicles, with training times comparable to the decision time of a single linear programming rebalancing. Besides, policies generated by our approach effectively explore the efficiency‑equity Pareto front, outperforming conventional benchmarks across key metrics like fleet utilization, fulfilled requests, and pickup distance, while ensuring equitable service access.

PaperID: 11045, https://arxiv.org/pdf/2503.24110.pdf

Authors: François Olivier, Zied Bouraoui

Title: Grounding Agent Reasoning in Image Schemas: A Neurosymbolic Approach to Embodied Cognition

Abstract:
Despite advances in embodied AI, agent reasoning systems still struggle to capture the fundamental conceptual structures that humans naturally use to understand and interact with their environment. To address this, we propose a novel framework that bridges embodied cognition theory and agent systems by leveraging a formal characterization of image schemas, which are defined as recurring patterns of sensorimotor experience that structure human cognition. By customizing LLMs to translate natural language descriptions into formal representations based on these sensorimotor patterns, we will be able to create a neurosymbolic system that grounds the agent's understanding in fundamental conceptual structures. We argue that such an approach enhances both efficiency and interpretability while enabling more intuitive human‑agent interactions through shared embodied understanding.

PaperID: 11046, https://arxiv.org/pdf/2503.24105.pdf

Authors: Giulio Fattore, Maria Elena Valcher

Title: Data-Driven Distributed Output Synchronization of Heterogeneous Discrete-Time Multi-Agent Systems

Abstract:
In this paper, we assume that an autonomous exosystem generates a reference output, and we consider the problem of designing a distributed data‑driven control law for a family of discrete‑time heterogeneous LTI agents, connected through a directed graph, in order to synchronize the agents' outputs to the reference one. The agents of the network are split into two categories: leaders, with direct access to the exosystem output, and followers, that only receive information from their neighbors. All agents aim to achieve output synchronization by means of a state feedback that makes use of their own states as well as of an estimate of the exogenous system state, provided by an internal state observer. Such observer has a different structure for leaders and followers. Necessary and sufficient conditions for the existence of a solution are first derived in the model‑based set‑up and then in a data‑driven context. An example illustrates both the implementation procedure and the performance of the proposed approach.

PaperID: 11047, https://arxiv.org/pdf/2503.24078.pdf

Authors: Zining Cao

Title: A Complete Mental Temporal Logic for Intelligent Agent

Abstract:
In this paper, we present a complete mental temporal logic, called BPICTL, which generalizes CTL by introducing mental modalities. A sound and complete inference system of BPICTL is given. We prove the finite model property of BPICTL. Furthermore, we present a model checking algorithm for BPICTL.

PaperID: 11048, https://arxiv.org/pdf/2503.24025.pdf

Authors: Renato Vizuete, Julien M. Hendrickx

Title: Consensus on Open Multi-Agent Systems Over Graphs Sampled from Graphons

Abstract:
We show how graphons can be used to model and analyze open multi‑agent systems, which are multi‑agent systems subject to arrivals and departures, in the specific case of linear consensus. First, we analyze the case of replacements, where under the assumption of a deterministic interval between two replacements, we derive an upper bound for the disagreement in expectation. Then, we study the case of arrivals and departures, where we define a process for the evolution of the number of agents that guarantees a minimum and a maximum number of agents. Next, we derive an upper bound for the disagreement in expectation, and we establish a link with the spectrum of the expected graph used to generate the graph topologies. Finally, for stochastic block model (SBM) graphons, we prove that the computation of the spectrum of the expected graph can be performed based on a matrix whose dimension depends only on the graphon and it is independent of the number of agents.

PaperID: 11049, https://arxiv.org/pdf/2503.23886.pdf

Authors: Qin Wang, Youhuan Li, Yansong Feng, Si Chen, Ziming Li, Pan Zhang, Zihui Si, Yixuan Chen, Zhichao Shi, Zebin Huang, Guo Chen, Wenqiang Jin

Title: Text2Schema: Filling the Gap in Designing Database Table Structures based on Natural Language

Abstract:
People without a database background usually rely on file systems or tools such as Excel for data management, which often lead to redundancy and data inconsistency. Relational databases possess strong data management capabilities, but require a high level of professional expertise from users. Although there are already many works on Text2SQL to automate the translation of natural language into SQL queries for data manipulation, all of them presuppose that the database schema is pre‑designed. In practice, schema design itself demands domain expertise, and research on directly generating schemas from textual requirements remains unexplored. In this paper, we systematically define a new problem, called Text2Schema, to convert a natural language text requirement into a relational database schema. With an effective Text2Schema technique, users can effortlessly create database table structures using natural language, and subsequently leverage existing Text2SQL techniques to perform data manipulations, which significantly narrows the gap between non‑technical personnel and highly efficient, versatile relational database systems. We propose SchemaAgent, an LLM‑based multi‑agent framework for Text2Schema. We emulate the workflow of manual schema design by assigning specialized roles to agents and enabling effective collaboration to refine their respective subtasks. We also incorporate dedicated roles for reflection and inspection, along with an innovative error detection and correction mechanism to identify and rectify issues across various phases. Moreover, we build and open source a benchmark containing 381 pairs of requirement description and schema. Experimental results demonstrate the superiority of our approach over comparative work.

PaperID: 11050, https://arxiv.org/pdf/2503.23804.pdf

Authors: Shiyi Yang, Zhibo Hu, Xinshu Li, Chen Wang, Tong Yu, Xiwei Xu, Liming Zhu, Lina Yao

Title: DrunkAgent: Stealthy Memory Corruption in LLM-Powered Recommender Agents

Abstract:
Large language model (LLM)‑powered agents are increasingly used in recommender systems (RSs) to achieve personalized behavior modeling, where the memory mechanism plays a pivotal role in enabling the agents to autonomously explore, learn and self‑evolve from real‑world interactions. However, this very mechanism, serving as a contextual repository, inherently exposes an attack surface for potential adversarial manipulations. Despite its central role, the robustness of agentic RSs in the face of such threats remains largely underexplored. Previous works suffer from semantic mismatches or rely on static embeddings or pre‑defined prompts, all of which are not designed for dynamic systems, especially for dynamic memory states of LLM agents. This challenge is exacerbated by the black‑box nature of commercial recommenders. To tackle the above problems, in this paper, we present the first systematic investigation of memory‑based vulnerabilities in LLM‑powered recommender agents, revealing their security limitations and guiding efforts to strengthen system resilience and trustworthiness. Specifically, we propose a novel black‑box attack framework named DrunkAgent. DrunkAgent crafts semantically meaningful adversarial textual triggers for target item promotions and introduces a series of strategies to maximize the trigger effect by corrupting the memory updates during the interactions. The triggers and strategies are optimized on a surrogate model, enabling DrunkAgent transferable and stealthy. Extensive experiments on real‑world datasets across diverse agentic RSs, including collaborative filtering, retrieval augmentation and sequential recommendations, demonstrate the generalizability, transferability and stealthiness of DrunkAgent.

PaperID: 11051, https://arxiv.org/pdf/2503.23705.pdf

Authors: George Rapakoulias, Ali Reza Pedram, Panagiotis Tsiotras

Title: Steering Large Agent Populations using Mean-Field Schrodinger Bridges with Gaussian Mixture Models

Abstract:
The Mean‑Field Schrodinger Bridge (MFSB) problem is an optimization problem aiming to find the minimum effort control policy to drive a McKean‑Vlassov stochastic differential equation from one probability measure to another. In the context of multi‑agent control, the objective is to control the configuration of a swarm of identical, interacting cooperative agents, as captured by the time‑varying probability measure of their state. Available methods for solving this problem for distributions with continuous support rely either on spatial discretizations of the problem's domain or on approximating optimal solutions using neural networks trained through stochastic optimization schemes. For agents following Linear Time Varying dynamics, and for Gaussian Mixture Model boundary distributions, we propose a highly efficient parameterization to approximate the optimal solutions of the corresponding MFSB in closed form, without any learning step. Our proposed approach consists of a mixture of elementary policies, each solving a Gaussian‑to‑Gaussian Covariance Steering problem from the components of the initial mixture to the components of the terminal mixture. Leveraging the semidefinite formulation of the Covariance Steering problem, the proposed solver can handle probabilistic constraints on the system's state while maintaining numerical tractability. We illustrate our approach on a variety of numerical examples.

PaperID: 11052, https://arxiv.org/pdf/2503.23673.pdf

Authors: Zhengyi Zhao, Shubo Zhang, Bin Liang, Binyang Li, Kam-Fai Wong

Title: WHERE and WHICH: Iterative Debate for Biomedical Synthetic Data Augmentation

Abstract:
In Biomedical Natural Language Processing (BioNLP) tasks, such as Relation Extraction, Named Entity Recognition, and Text Classification, the scarcity of high‑quality data remains a significant challenge. This limitation poisons large language models to correctly understand relationships between biological entities, such as molecules and diseases, or drug interactions, and further results in potential misinterpretation of biomedical documents. To address this issue, current approaches generally adopt the Synthetic Data Augmentation method which involves similarity computation followed by word replacement, but counterfactual data are usually generated. As a result, these methods disrupt meaningful word sets or produce sentences with meanings that deviate substantially from the original context, rendering them ineffective in improving model performance. To this end, this paper proposes a biomedical‑dedicated rationale‑based synthetic data augmentation method. Beyond the naive lexicon similarity, specific bio‑relation similarity is measured to hold the augmented instance having a strong correlation with bio‑relation instead of simply increasing the diversity of augmented data. Moreover, a multi‑agents‑involved reflection mechanism helps the model iteratively distinguish different usage of similar entities to escape falling into the mis‑replace trap. We evaluate our method on the BLURB and BigBIO benchmark, which includes 9 common datasets spanning four major BioNLP tasks. Our experimental results demonstrate consistent performance improvements across all tasks, highlighting the effectiveness of our approach in addressing the challenges associated with data scarcity and enhancing the overall performance of biomedical NLP models.

PaperID: 11053, https://arxiv.org/pdf/2503.23669.pdf

Authors: Xuli Cai, Poonam Lohan, Burak Kantarci

Title: Multi-Agent Deep Reinforcement Learning for Optimized Multi-UAV Coverage and Power-Efficient UE Connectivity

Abstract:
In critical situations such as natural disasters, network outages, battlefield communication, or large‑scale public events, Unmanned Aerial Vehicles (UAVs) offer a promising approach to maximize wireless coverage for affected users in the shortest possible time. In this paper, we propose a novel framework where multiple UAVs are deployed with the objective to maximize the number of served user equipment (UEs) while ensuring a predefined data rate threshold. UEs are initially clustered using a K‑means algorithm, and UAVs are optimally positioned based on the UEs' spatial distribution. To optimize power allocation and mitigate inter‑cluster interference, we employ the Multi‑Agent Deep Deterministic Policy Gradient (MADDPG) algorithm, considering both LOS and NLOS fading. Simulation results demonstrate that our method significantly enhances UEs coverage and outperforms Deep Q‑Network (DQN) and equal power distribution methods, improving their UE coverage by up to 2.07 times and 8.84 times, respectively.

PaperID: 11054, https://arxiv.org/pdf/2503.23668.pdf

Authors: Jiaxin Wu, Ting Zhang, Rubing Chen, Wengyu Zhang, Chen Jason Zhang, Xiao-Yong Wei, Li Qing

Title: MolGround: A Benchmark for Molecular Grounding

Abstract:
Current molecular understanding approaches predominantly focus on the descriptive aspect of human perception, providing broad, topic‑level insights. However, the referential aspect ‑‑ linking molecular concepts to specific structural components ‑‑ remains largely unexplored. To address this gap, we propose a molecular grounding benchmark designed to evaluate a model's referential abilities. We align molecular grounding with established conventions in NLP, cheminformatics, and molecular science, showcasing the potential of NLP techniques to advance molecular understanding within the AI for Science movement. Furthermore, we constructed the largest molecular understanding benchmark to date, comprising 117k QA pairs, and developed a multi‑agent grounding prototype as proof of concept. This system outperforms existing models, including GPT‑4o, and its grounding outputs have been integrated to enhance traditional tasks such as molecular captioning and ATC (Anatomical, Therapeutic, Chemical) classification.

PaperID: 11055, https://arxiv.org/pdf/2503.23615.pdf

Authors: Julien Soulé, Jean-Paul Jamont, Michel Occello, Louis-Marie Traonouez, Paul Théron

Title: An Organizationally-Oriented Approach to Enhancing Explainability and Control in Multi-Agent Reinforcement Learning

Abstract:
Multi‑Agent Reinforcement Learning can lead to the development of collaborative agent behaviors that show similarities with organizational concepts. Pushing forward this perspective, we introduce a novel framework that explicitly incorporates organizational roles and goals from the \mathcalMOISE^+ model into the MARL process, guiding agents to satisfy corresponding organizational constraints. By structuring training with roles and goals, we aim to enhance both the explainability and control of agent behaviors at the organizational level, whereas much of the literature primarily focuses on individual agents. Additionally, our framework includes a post‑training analysis method to infer implicit roles and goals, offering insights into emergent agent behaviors. This framework has been applied across various MARL environments and algorithms, demonstrating coherence between predefined organizational specifications and those inferred from trained agents.

PaperID: 11056, https://arxiv.org/pdf/2503.23518.pdf

Authors: Arash Bahari Kordabad, Arabinda Ghosh, Sybert Stroeve, Sadegh Soudjani

Title: Intent-Aware MPC for Aircraft Detect-and-Avoid with Response Delay: A Comparative Study with ACAS Xu

Abstract:
In this paper, we propose an intent‑aware Model Predictive Control (MPC) approach for the remain‑well‑clear (RWC) functionality of a multi‑agent aircraft detect‑and‑avoid (DAA) system and compare its performance with the standardized Airborne Collision Avoidance System Xu (ACAS Xu). The aircraft system is modeled as a linear system for horizontal maneuvering, with advisories on the rate of turn as the control input. Both deterministic and stochastic time delays are considered to account for the lag between control guidance issuance and the response of the aircraft. The capability of the MPC scheme in producing an optimal control profile over the entire horizon is used to mitigate the impact of the delay. We compare the proposed MPC method with ACAS Xu using various evaluation metrics, including loss of DAA well‑clear percentage, near mid‑air collision percentage, horizontal miss distance, and additional flight distance across different encounter scenarios. It is shown that the MPC scheme achieves better evaluation metrics than ACAS Xu for both deterministic and stochastic scenarios.

PaperID: 11057, https://arxiv.org/pdf/2503.23514.pdf

Authors: Siqi Fan, Xiusheng Huang, Yiqun Yao, Xuezhi Fang, Kang Liu, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang

Title: If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs

Abstract:
Large language models (LLMs) can carry out human‑like dialogue, but unlike humans, they are stateless due to the superposition property. However, during multi‑turn, multi‑agent interactions, LLMs begin to exhibit consistent, character‑like behaviors, hinting at a form of emergent lifelong learning. Despite this, existing benchmarks often fail to capture these dynamics, primarily focusing on static, open‑ended evaluations. To address this gap, we introduce LIFESTATE‑BENCH, a benchmark designed to assess lifelong learning in LLMs. It features two episodic datasets: Hamlet and a synthetic script collection, rich in narrative structure and character interactions. Our fact checking evaluation probes models' self‑awareness, episodic memory retrieval, and relationship tracking, across both parametric and non‑parametric approaches. Experiments on models like Llama3.1‑8B, GPT‑4‑turbo, and DeepSeek R1, we demonstrate that nonparametric methods significantly outperform parametric ones in managing stateful learning. However, all models exhibit challenges with catastrophic forgetting as interactions extend, highlighting the need for further advancements in lifelong learning.

PaperID: 11058, https://arxiv.org/pdf/2503.23421.pdf

Authors: Andrei Cristian Ionescu, Sergey Titov, Maliheh Izadi

Title: A Multi-agent Onboarding Assistant based on Large Language Models, Retrieval Augmented Generation, and Chain-of-Thought

Abstract:
Effective onboarding in software engineering is crucial but difficult due to the fast‑paced evolution of technologies. Traditional methods, like exploration and workshops, are costly, time‑consuming, and quickly outdated in large projects. We propose the Onboarding Buddy system, which leverages large language models, retrieval augmented generation, and an automated chain‑of‑thought approach to improve onboarding. It integrates dynamic, context‑specific support within the development environment, offering natural language explanations, code insights, and project guidance. Our solution is agent‑based and provides customized assistance with minimal human intervention. Our study results among the eight participants show an average helpfulness rating of (M=3.26, SD=0.86) and ease of onboarding at (M=3.0, SD=0.96) out of four. While similar to tools like GitHub Copilot, Onboarding Buddy uniquely integrates a chain‑of‑thought reasoning mechanism with retrieval‑augmented generation, tailored specifically for dynamic onboarding contexts. While our initial evaluation is based on eight participants within one project, we will explore larger teams and multiple real‑world codebases in the company to demonstrate broader applicability. Overall, Onboarding Buddy holds great potential for enhancing developer productivity and satisfaction. Our tool, source code, and demonstration video are publicly available

PaperID: 11059, https://arxiv.org/pdf/2503.23329.pdf

Authors: Hui Li, Ante Wang, kunquan li, Zhihao Wang, Liang Zhang, Delai Qiu, Qingsong Liu, Jinsong Su

Title: A Multi-Agent Framework with Automated Decision Rule Optimization for Cross-Domain Misinformation Detection

Abstract:
Misinformation spans various domains, but detection methods trained on specific domains often perform poorly when applied to others. With the rapid development of Large Language Models (LLMs), researchers have begun to utilize LLMs for cross‑domain misinformation detection. However, existing LLM‑based methods often fail to adequately analyze news in the target domain, limiting their detection capabilities. More importantly, these methods typically rely on manually designed decision rules, which are limited by domain knowledge and expert experience, thus limiting the generalizability of decision rules to different domains. To address these issues, we propose a MultiAgent Framework for cross‑domain misinformation detection with Automated Decision Rule Optimization (MARO). Under this framework, we first employs multiple expert agents to analyze target‑domain news. Subsequently, we introduce a question‑reflection mechanism that guides expert agents to facilitate higherquality analysis. Furthermore, we propose a decision rule optimization approach based on carefully‑designed cross‑domain validation tasks to iteratively enhance the effectiveness of decision rules in different domains. Experimental results and in‑depth analysis on commonlyused datasets demonstrate that MARO achieves significant improvements over existing methods.

PaperID: 11060, https://arxiv.org/pdf/2503.23326.pdf

Authors: Yiyu Qian, Tim Miller, Zheng Qian, Liyuan Zhao

Title: Exploring Explainable Multi-agent MCTS-minimax Hybrids in Board Game Using Process Mining

Abstract:
Monte‑Carlo Tree Search (MCTS) is a family of sampling‑based search algorithms widely used for online planning in sequential decision‑making domains and at the heart of many recent advances in artificial intelligence. Understanding the behavior of MCTS agents is difficult for developers and users due to the frequently large and complex search trees that result from the simulation of many possible futures, their evaluations, and their relationships. This paper presents our ongoing investigation into potential explanations for the decision‑making and behavior of MCTS. A weakness of MCTS is that it constructs a highly selective tree and, as a result, can miss crucial moves and fall into tactical traps. Full‑width minimax search constitutes the solution. We integrate shallow minimax search into the rollout phase of multi‑agent MCTS and use process mining technique to explain agents' strategies in 3v3 checkers.

PaperID: 11061, https://arxiv.org/pdf/2503.23315.pdf

Authors: Mohamed Elrefaie, Janet Qian, Raina Wu, Qian Chen, Angela Dai, Faez Ahmed

Title: AI Agents in Engineering Design: A Multi-Agent Framework for Aesthetic and Aerodynamic Car Design

Abstract:
We introduce the concept of "Design Agents" for engineering applications, particularly focusing on the automotive design process, while emphasizing that our approach can be readily extended to other engineering and design domains. Our framework integrates AI‑driven design agents into the traditional engineering workflow, demonstrating how these specialized computational agents interact seamlessly with engineers and designers to augment creativity, enhance efficiency, and significantly accelerate the overall design cycle. By automating and streamlining tasks traditionally performed manually, such as conceptual sketching, styling enhancements, 3D shape retrieval and generative modeling, computational fluid dynamics (CFD) meshing, and aerodynamic simulations, our approach reduces certain aspects of the conventional workflow from weeks and days down to minutes. These agents leverage state‑of‑the‑art vision‑language models (VLMs), large language models (LLMs), and geometric deep learning techniques, providing rapid iteration and comprehensive design exploration capabilities. We ground our methodology in industry‑standard benchmarks, encompassing a wide variety of conventional automotive designs, and utilize high‑fidelity aerodynamic simulations to ensure practical and applicable outcomes. Furthermore, we present design agents that can swiftly and accurately predict simulation outcomes, empowering engineers and designers to engage in more informed design optimization and exploration. This research underscores the transformative potential of integrating advanced generative AI techniques into complex engineering tasks, paving the way for broader adoption and innovation across multiple engineering disciplines.

PaperID: 11062, https://arxiv.org/pdf/2503.23314.pdf

Authors: Wonduk Seo, Juhyeon Lee, Yanjun Shao, Qingshan Zhou, Seunghyun Lee, Yi Bu

Title: SPIO: Ensemble and Selective Strategies via LLM-Based Multi-Agent Planning in Automated Data Science

Abstract:
Large Language Models (LLMs) have enabled dynamic reasoning in automated data analytics, yet recent multi‑agent systems remain limited by rigid, single‑path workflows that restrict strategic exploration and often lead to suboptimal outcomes. To overcome these limitations, we propose SPIO (Sequential Plan Integration and Optimization), a framework that replaces rigid workflows with adaptive, multi‑path planning across four core modules: data preprocessing, feature engineering, model selection, and hyperparameter tuning. In each module, specialized agents generate diverse candidate strategies, which are cascaded and refined by an optimization agent. SPIO offers two operating modes: SPIO‑S for selecting a single optimal pipeline, and SPIO‑E for ensembling top‑k pipelines to maximize robustness. Extensive evaluations on Kaggle and OpenML benchmarks show that SPIO consistently outperforms state‑of‑the‑art baselines, achieving an average performance gain of 5.6%. By explicitly exploring and integrating multiple solution paths, SPIO delivers a more flexible, accurate, and reliable foundation for automated data science.

PaperID: 11063, https://arxiv.org/pdf/2503.23290.pdf

Authors: Junlong Chen, Jiawen Kang, Minrui Xu, Fan Wu, Hongliang Zhang, Huawei Huang, Dusit Niyato, Shiwen Mao

Title: Efficient Twin Migration in Vehicular Metaverses: Multi-Agent Split Deep Reinforcement Learning with Spatio-Temporal Trajectory Generation

Abstract:
Vehicle Twins (VTs) as digital representations of vehicles can provide users with immersive experiences in vehicular metaverse applications, e.g., Augmented Reality (AR) navigation and embodied intelligence. VT migration is an effective way that migrates the VT when the locations of physical entities keep changing to maintain seamless immersive VT services. However, an efficient VT migration is challenging due to the rapid movement of vehicles, dynamic workloads of Roadside Units (RSUs), and heterogeneous resources of the RSUs. To achieve efficient migration decisions and a minimum latency for the VT migration, we propose a multi‑agent split Deep Reinforcement Learning (DRL) framework combined with spatio‑temporal trajectory generation. In this framework, multiple split DRL agents utilize split architecture to efficiently determine VT migration decisions. Furthermore, we propose a spatio‑temporal trajectory generation algorithm based on trajectory datasets and road network data to simulate vehicle trajectories, enhancing the generalization of the proposed scheme for managing VT migration in dynamic network environments. Finally, experimental results demonstrate that the proposed scheme not only enhances the Quality of Experience (QoE) by 29% but also reduces the computational parameter count by approximately 25% while maintaining similar performances, enhancing users' immersive experiences in vehicular metaverses.

PaperID: 11064, https://arxiv.org/pdf/2503.23255.pdf

Authors: Mingjia He, Yannik Werner, Andrea Censi, Emilio Frazzoli, Gioele Zardini

Title: Iterative VCG-based Mechanism Fosters Cooperation in Multi-Regional Network Design

Abstract:
Transportation network design often involves multiple stakeholders with diverse priorities. We consider a system with a hierarchical multi‑agent structure, featuring self‑optimized subnetwork operators at the lower level and a central organization at the upper level. Independent regional planning can lead to inefficiencies due to the lack of coordination, hindering interregional travel and cross‑border infrastructure development, while centralized methods may struggle to align local interests and can be impractical to implement. To support decision making for such a system, we introduce an iterative VCG‑based mechanism for multi‑regional network design that fosters cooperation among subnetwork operators. By leveraging the Vickery‑Clarke‑Groves (VCG) mechanism, the framework determines collective investment decisions and the necessary payments from both operators and the central organization to achieve efficient outcomes. A case study on the European Railway System validates the effectiveness of the proposed method, demonstrating significant improvements in overall network performance through enhanced cross‑region cooperation.

PaperID: 11065, https://arxiv.org/pdf/2503.23170.pdf

Authors: Daniel Saeedi, Denise Buckner, Jose C. Aponte, Amirali Aghazadeh

Title: AstroAgents: A Multi-Agent AI for Hypothesis Generation from Mass Spectrometry Data

Abstract:
With upcoming sample return missions across the solar system and the increasing availability of mass spectrometry data, there is an urgent need for methods that analyze such data within the context of existing astrobiology literature and generate plausible hypotheses regarding the emergence of life on Earth. Hypothesis generation from mass spectrometry data is challenging due to factors such as environmental contaminants, the complexity of spectral peaks, and difficulties in cross‑matching these peaks with prior studies. To address these challenges, we introduce AstroAgents, a large language model‑based, multi‑agent AI system for hypothesis generation from mass spectrometry data. AstroAgents is structured around eight collaborative agents: a data analyst, a planner, three domain scientists, an accumulator, a literature reviewer, and a critic. The system processes mass spectrometry data alongside user‑provided research papers. The data analyst interprets the data, and the planner delegates specific segments to the scientist agents for in‑depth exploration. The accumulator then collects and deduplicates the generated hypotheses, and the literature reviewer identifies relevant literature using Semantic Scholar. The critic evaluates the hypotheses, offering rigorous suggestions for improvement. To assess AstroAgents, an astrobiology expert evaluated the novelty and plausibility of more than a hundred hypotheses generated from data obtained from eight meteorites and ten soil samples. Of these hypotheses, 36% were identified as plausible, and among those, 66% were novel. Project website: https://astroagents.github.io/

PaperID: 11066, https://arxiv.org/pdf/2503.23138.pdf

Authors: Donghe Li, Zuchen Li, Ye Yang, Li Sun, Dou An, Qingyu Yang

Title: EncGPT: A Multi-Agent Workflow for Dynamic Encryption Algorithms

Abstract:
Communication encryption is crucial in computer technology, but existing algorithms struggle with balancing cost and security. We propose EncGPT, a multi‑agent framework using large language models (LLM). It includes rule, encryption, and decryption agents that generate encryption rules and apply them dynamically. This approach addresses gaps in LLM‑based multi‑agent systems for communication security. We tested GPT‑4o's rule generation and implemented a substitution encryption workflow with homomorphism preservation, achieving an average execution time of 15.99 seconds.

PaperID: 11067, https://arxiv.org/pdf/2503.22958.pdf

Authors: Supriyo Maji, Linran Zhao, Souradip Poddar, David Z. Pan

Title: Late Breaking Results: Breaking Symmetry- Unconventional Placement of Analog Circuits using Multi-Level Multi-Agent Reinforcement Learning

Abstract:
Layout‑dependent effects (LDEs) significantly impact analog circuit performance. Traditionally, designers have relied on symmetric placement of circuit components to mitigate variations caused by LDEs. However, due to non‑linear nature of these effects, conventional methods often fall short. We propose an objective‑driven, multi‑level, multi‑agent Q‑learning framework to explore unconventional design space of analog layout, opening new avenues for optimizing analog circuit performance. Our approach achieves better variation performance than the state‑of‑the‑art layout techniques. Notably, this is the first application of multi‑agent RL in analog layout automation. The proposed approach is compared with non‑ML approach based on simulated annealing.

PaperID: 11068, https://arxiv.org/pdf/2503.22870.pdf

Authors: Mouaad Boughellaba, Soulaimane Berkane, Abdelhamid Tayebi

Title: Attitude Synchronization on SO(3) for Heterogeneous Multi-Agent Systems Using Vector Measurements

Abstract:
This paper addresses the distributed attitude synchronization problem for a network of rigid‑body systems on the special orthogonal group SO(3). Each agent measures, in its body frame, its own angular velocity and a set of vectors whose corresponding directions in the inertial frame are unknown. Under an undirected, connected, and acyclic interaction graph topology, we develop four distributed synchronization schemes relying solely on local vector measurements, without the need for attitude estimation and attitude exchange between agents. Specifically, two leaderless schemes are proposed at the kinematic and dynamic levels to achieve synchronization to a common unknown orientation. In addition, two leader‑follower schemes are proposed to align all agents with a prescribed constant orientation defined by reference vector measurements available only to a designated leader. All control laws are formulated directly on SO(3), preserving the geometric structure of the attitude dynamics. A rigorous stability analysis is provided showing that the closed‑loop systems achieve almost global asymptotic stability, which is the strongest stability property one can achieve on SO(3) with smooth controllers. %Compared with existing vector‑measurement‑based approaches that provide only local stability or convergence results, the proposed methods significantly strengthen the theoretical guarantees while maintaining a fully distributed architecture. Numerical simulations are provided to illustrate the effectiveness and performance of the proposed distributed control schemes.

PaperID: 11069, https://arxiv.org/pdf/2503.22867.pdf

Authors: Huiwen Yan, Mushuang Liu

Title: Markov Potential Game Construction and Multi-Agent Reinforcement Learning with Applications to Autonomous Driving

Abstract:
Markov games (MGs) provide a mathematical foundation for multi‑agent reinforcement learning (MARL), enabling self‑interested agents to learn their optimal policies while interacting with others in a shared environment. However, due to the complexities of an MG problem, seeking (Markov perfect) Nash equilibrium (NE) is often very challenging for a general‑sum MG. Markov potential games (MPGs), which are a special class of MGs, have appealing properties such as guaranteed existence of pure NEs and guaranteed convergence of gradient play algorithms, thereby leading to desirable properties for many MARL algorithms in their NE‑seeking processes. However, the question of how to construct MPGs has remained open. This paper provides sufficient conditions on the reward design and on the Markov decision process (MDP), under which an MG is an MPG. Numerical results on autonomous driving applications are reported.

PaperID: 11070, https://arxiv.org/pdf/2503.22779.pdf

Authors: Junkai Hu, Li Xia

Title: Policy Optimization and Multi-agent Reinforcement Learning for Mean-variance Team Stochastic Games

Abstract:
We study a long‑run mean‑variance team stochastic game (MV‑TSG), where each agent shares a common mean‑variance objective for the system and takes actions independently to maximize it. MV‑TSG has two main challenges. First, the variance metric is neither additive nor Markovian in a dynamic setting. Second, simultaneous policy updates of all agents lead to a non‑stationary environment for each individual agent. Both challenges make dynamic programming inapplicable. In this paper, we study MV‑TSGs from the perspective of sensitivity‑based optimization. The performance difference and performance derivative formulas for joint policies are derived, which provide optimization information for MV‑TSGs. We prove the existence of a deterministic Nash policy for this problem. Subsequently, we propose a Mean‑Variance Multi‑Agent Policy Iteration (MV‑MAPI) algorithm with a sequential update scheme, where individual agent policies are updated one by one in a given order. We prove that the MV‑MAPI algorithm converges to a first‑order stationary point of the objective function. By analyzing the local geometry of stationary points, we derive specific conditions for stationary points to be (local) Nash equilibria, and further, strict local optima. To solve large‑scale MV‑TSGs in scenarios with unknown environmental parameters, we extend the idea of trust region methods to MV‑MAPI and develop a multi‑agent reinforcement learning algorithm named Mean‑Variance Multi‑Agent Trust Region Policy Optimization (MV‑MATRPO). We derive a performance lower bound for each update of joint policies. Finally, numerical experiments on energy management in multiple microgrid systems are conducted.

PaperID: 11071, https://arxiv.org/pdf/2503.22732.pdf

Authors: Mohamed Amine Ferrag, Norbert Tihanyi, Merouane Debbah

Title: Reasoning Beyond Limits: Advances and Open Problems for LLMs

Abstract:
Recent breakthroughs in generative reasoning have fundamentally reshaped how large language models (LLMs) address complex tasks, enabling them to dynamically retrieve, refine, and organize information into coherent multi‑step reasoning chains. Techniques such as inference‑time scaling, reinforcement learning, supervised fine‑tuning, and distillation have been effectively applied to state‑of‑the‑art models, including DeepSeek‑R1, OpenAI o1 and o3, GPT‑4o, Qwen‑32B, and various Llama variants, significantly enhancing their reasoning capabilities. In this paper, we present a comprehensive review of the top 27 LLMs released between 2023 and 2025, such as Mistral AI Small 3 24B, DeepSeek‑R1, Search‑o1, QwQ‑32B, and Phi‑4, and analyze their core innovations and performance improvements. We also provide a detailed overview of recent advancements in multilingual large language models (MLLMs), emphasizing methods that improve cross‑lingual reasoning and address the limitations of English‑centric training. In parallel, we present a comprehensive review of progress in state space model (SSM)‑based architectures, including models such as Mamba, which demonstrate improved efficiency for long‑context processing compared to transformer‑based approaches. Our analysis covers training strategies including general optimization techniques, mixture‑of‑experts (MoE) configurations, retrieval‑augmented generation (RAG), chain‑of‑thought prompting, self‑improvement methods, and test‑time compute scaling and distillation frameworks. Finally, we identify key challenges for future research, including enabling multi‑step reasoning without human supervision, improving robustness in chained task execution, balancing structured prompting with generative flexibility, and enhancing the integration of long‑context retrieval and external tools.

PaperID: 11072, https://arxiv.org/pdf/2503.22726.pdf

Authors: Yue Yin

Title: InfoBid: A Simulation Framework for Studying Information Disclosure in Auctions with Large Language Model-based Agents

Abstract:
In online advertising systems, publishers often face a trade‑off in information disclosure strategies: while disclosing more information can enhance efficiency by enabling optimal allocation of ad impressions, it may lose revenue potential by decreasing uncertainty among competing advertisers. Similar to other challenges in market design, understanding this trade‑off is constrained by limited access to real‑world data, leading researchers and practitioners to turn to simulation frameworks. The recent emergence of large language models (LLMs) offers a novel approach to simulations, providing human‑like reasoning and adaptability without necessarily relying on explicit assumptions about agent behavior modeling. Despite their potential, existing frameworks have yet to integrate LLM‑based agents for studying information asymmetry and signaling strategies, particularly in the context of auctions. To address this gap, we introduce InfoBid, a flexible simulation framework that leverages LLM agents to examine the effects of information disclosure strategies in multi‑agent auction settings. Using GPT‑4o, we implemented simulations of second‑price auctions with diverse information schemas. The results reveal key insights into how signaling influences strategic behavior and auction outcomes, which align with both economic and social learning theories. Through InfoBid, we hope to foster the use of LLMs as proxies for human economic and social agents in empirical studies, enhancing our understanding of their capabilities and limitations. This work bridges the gap between theoretical market designs and practical applications, advancing research in market simulations, information design, and agent‑based reasoning while offering a valuable tool for exploring the dynamics of digital economies.

PaperID: 11073, https://arxiv.org/pdf/2503.22718.pdf

Authors: Tianming Liu, Jirong Yang, Yafeng Yin

Title: LLM-ABM for Transportation: Assessing the Potential of LLM Agents in System Analysis

Abstract:
Agent‑based modeling approaches represent the state‑of‑art in modeling travel demand and transportation system dynamics and are valuable tools for transportation planning. However, established agent‑based approaches in transportation rely on multi‑hierarchical mathematical models to simulate travel behavior, which faces theoretical and practical limitations. The advent of large language models (LLM) provides a new opportunity to refine agent‑based modeling in transportation. LLM agents, which have impressive reasoning and planning abilities, can serve as a proxy of human travelers and be integrated into the modeling framework. However, despite evidence of their behavioral soundness, no existing studies have assessed the impact and validity of LLM‑agent‑based simulations from a system perspective in transportation. This paper aims to address this issue by designing and integrating LLM agents with human‑traveler‑like characteristics into a simulation of a transportation system and assessing its performance based on existing benchmarks. Using the classical transportation setting of the morning commute, we find that not only do the agents exhibit fine behavioral soundness, but also produce system dynamics that align well with standard benchmarks. Our analysis first verifies the effectiveness and potential of LLM‑agent‑based modeling for transportation planning on the system level.

PaperID: 11074, https://arxiv.org/pdf/2503.22675.pdf

Authors: Jiakai Tang, Sunhao Dai, Teng Shi, Jun Xu, Xu Chen, Wen Chen, Jian Wu, Yuning Jiang

Title: Think Before Recommend: Unleashing the Latent Reasoning Power for Sequential Recommendation

Abstract:
Sequential Recommendation (SeqRec) aims to predict the next item by capturing sequential patterns from users' historical interactions, playing a crucial role in many real‑world recommender systems. However, existing approaches predominantly adopt a direct forward computation paradigm, where the final hidden state of the sequence encoder serves as the user representation. We argue that this inference paradigm, due to its limited computational depth, struggles to model the complex evolving nature of user preferences and lacks a nuanced understanding of long‑tail items, leading to suboptimal performance. To address this issue, we propose ReaRec, the first inference‑time computing framework for recommender systems, which enhances user representations through implicit multi‑step reasoning. Specifically, ReaRec autoregressively feeds the sequence's last hidden state into the sequential recommender while incorporating special reasoning position embeddings to decouple the original item encoding space from the multi‑step reasoning space. Moreover, we introduce two lightweight reasoning‑based learning methods, Ensemble Reasoning Learning (ERL) and Progressive Reasoning Learning (PRL), to further effectively exploit ReaRec's reasoning potential. Extensive experiments on five public real‑world datasets and different SeqRec architectures demonstrate the generality and effectiveness of our proposed ReaRec. Remarkably, post‑hoc analyses reveal that ReaRec significantly elevates the performance ceiling of multiple sequential recommendation backbones by approximately 30%‑50%. Thus, we believe this work can open a new and promising avenue for future research in inference‑time computing for sequential recommendation.

PaperID: 11075, https://arxiv.org/pdf/2503.22512.pdf

Authors: Wenqiang Luo, Jacky Wai Keung, Boyang Yang, Jacques Klein, Tegawende F. Bissyande, Haoye Tian, Bach Le

Title: Unlocking LLM Repair Capabilities Through Cross-Language Translation and Multi-Agent Refinement

Abstract:
Recent advances in leveraging LLMs for APR have demonstrated impressive capabilities in fixing software defects. However, current LLM‑based approaches predominantly focus on mainstream programming languages like Java and Python, neglecting less prevalent but emerging languages such as Rust due to expensive training resources, limited datasets, and insufficient community support. This narrow focus creates a significant gap in repair capabilities across the programming language spectrum, where the full potential of LLMs for comprehensive multilingual program repair remains largely unexplored. To address this limitation, we introduce a novel cross‑language program repair approach LANTERN that leverages LLMs' differential proficiency across languages through a multi‑agent iterative repair paradigm. Our technique strategically translates defective code from languages where LLMs exhibit weaker repair capabilities to languages where they demonstrate stronger performance, without requiring additional training. A key innovation of our approach is an LLM‑based decision‑making system that dynamically selects optimal target languages based on bug characteristics and continuously incorporates feedback from previous repair attempts. We evaluate our method on xCodeEval, a comprehensive multilingual benchmark comprising 5,068 bugs across 11 programming languages. Results demonstrate significant enhancement in repair effectiveness, particularly for underrepresented languages, with Rust showing a 22.09% improvement in Pass@10 metrics. Our research provides the first empirical evidence that cross‑language translation significantly expands the repair capabilities of LLMs and effectively bridges the performance gap between programming languages with different levels of popularity, opening new avenues for truly language‑agnostic automated program repair.

PaperID: 11076, https://arxiv.org/pdf/2503.22473.pdf

Authors: Hanchao Liu, Rongjun Li, Weimin Xiong, Ziyu Zhou, Wei Peng

Title: WorkTeam: Constructing Workflows from Natural Language with Multi-Agents

Abstract:
Workflows play a crucial role in enhancing enterprise efficiency by orchestrating complex processes with multiple tools or components. However, hand‑crafted workflow construction requires expert knowledge, presenting significant technical barriers. Recent advancements in Large Language Models (LLMs) have improved the generation of workflows from natural language instructions (aka NL2Workflow), yet existing single LLM agent‑based methods face performance degradation on complex tasks due to the need for specialized knowledge and the strain of task‑switching. To tackle these challenges, we propose WorkTeam, a multi‑agent NL2Workflow framework comprising a supervisor, orchestrator, and filler agent, each with distinct roles that collaboratively enhance the conversion process. As there are currently no publicly available NL2Workflow benchmarks, we also introduce the HW‑NL2Workflow dataset, which includes 3,695 real‑world business samples for training and evaluation. Experimental results show that our approach significantly increases the success rate of workflow construction, providing a novel and effective solution for enterprise NL2Workflow services.

PaperID: 11077, https://arxiv.org/pdf/2503.22164.pdf

Authors: Bowen Gao, Yanwen Huang, Yiqiao Liu, Wenxuan Xie, Wei-Ying Ma, Ya-Qin Zhang, Yanyan Lan

Title: PharmAgents: Building a Virtual Pharma with Large Language Model Agents

Abstract:
The discovery of novel small molecule drugs remains a critical scientific challenge with far‑reaching implications for treating diseases and advancing human health. Traditional drug development‑‑especially for small molecule therapeutics‑‑is a highly complex, resource‑intensive, and time‑consuming process that requires multidisciplinary collaboration. Recent breakthroughs in artificial intelligence (AI), particularly the rise of large language models (LLMs), present a transformative opportunity to streamline and accelerate this process. In this paper, we introduce PharmAgents, a virtual pharmaceutical ecosystem driven by LLM‑based multi‑agent collaboration. PharmAgents simulates the full drug discovery workflow‑‑from target discovery to preclinical evaluation‑‑by integrating explainable, LLM‑driven agents equipped with specialized machine learning models and computational tools. Through structured knowledge exchange and automated optimization, PharmAgents identifies potential therapeutic targets, discovers promising lead compounds, enhances binding affinity and key molecular properties, and performs in silico analyses of toxicity and synthetic feasibility. Additionally, the system supports interpretability, agent interaction, and self‑evolvement, enabling it to refine future drug designs based on prior experience. By showcasing the potential of LLM‑powered multi‑agent systems in drug discovery, this work establishes a new paradigm for autonomous, explainable, and scalable pharmaceutical research, with future extensions toward comprehensive drug lifecycle management.

PaperID: 11078, https://arxiv.org/pdf/2503.22162.pdf

Authors: Ning Liu, Sen Shen, Xiangrui Kong, Hongtao Zhang, Thomas Bräunl

Title: Cooperative Hybrid Multi-Agent Pathfinding Based on Shared Exploration Maps

Abstract:
Multi‑Agent Pathfinding is used in areas including multi‑robot formations, warehouse logistics, and intelligent vehicles. However, many environments are incomplete or frequently change, making it difficult for standard centralized planning or pure reinforcement learning to maintain both global solution quality and local flexibility. This paper introduces a hybrid framework that integrates D Lite global search with multi‑agent reinforcement learning, using a switching mechanism and a freeze‑prevention strategy to handle dynamic conditions and crowded settings. We evaluate the framework in the discrete POGEMA environment and compare it with baseline methods. Experimental outcomes indicate that the proposed framework substantially improves success rate, collision rate, and path efficiency. The model is further tested on the EyeSim platform, where it maintains feasible Pathfinding under frequent changes and large‑scale robot deployments.

PaperID: 11079, https://arxiv.org/pdf/2503.22122.pdf

Authors: Puzhen Yuan, Angyuan Ma, Yunchao Yao, Huaxiu Yao, Masayoshi Tomizuka, Mingyu Ding

Title: REMAC: Self-Reflective and Self-Evolving Multi-Agent Collaboration for Long-Horizon Robot Manipulation

Abstract:
Vision‑language models (VLMs) have demonstrated remarkable capabilities in robotic planning, particularly for long‑horizon tasks that require a holistic understanding of the environment for task decomposition. Existing methods typically rely on prior environmental knowledge or carefully designed task‑specific prompts, making them struggle with dynamic scene changes or unexpected task conditions, e.g., a robot attempting to put a carrot in the microwave but finds the door was closed. Such challenges underscore two critical issues: adaptability and efficiency. To address them, in this work, we propose an adaptive multi‑agent planning framework, termed REMAC, that enables efficient, scene‑agnostic multi‑robot long‑horizon task planning and execution through continuous reflection and self‑evolution. REMAC incorporates two key modules: a self‑reflection module performing pre‑condition and post‑condition checks in the loop to evaluate progress and refine plans, and a self‑evolvement module dynamically adapting plans based on scene‑specific reasoning. It offers several appealing benefits: 1) Robots can initially explore and reason about the environment without complex prompt design. 2) Robots can keep reflecting on potential planning errors and adapting the plan based on task‑specific insights. 3) After iterations, a robot can call another one to coordinate tasks in parallel, maximizing the task execution efficiency. To validate REMAC's effectiveness, we build a multi‑agent environment for long‑horizon robot manipulation and navigation based on RoboCasa, featuring 4 task categories with 27 task styles and 50+ different objects. Based on it, we further benchmark state‑of‑the‑art reasoning models, including DeepSeek‑R1, o3‑mini, QwQ, and Grok3, demonstrating REMAC's superiority by boosting average success rates by 40% and execution efficiency by 52.7% over the single robot baseline.

PaperID: 11080, https://arxiv.org/pdf/2503.22038.pdf

Authors: Ngoc Tuong Vy Nguyen, Felix D Childress, Yunting Yin

Title: Debate-Driven Multi-Agent LLMs for Phishing Email Detection

Abstract:
Phishing attacks remain a critical cybersecurity threat. Attackers constantly refine their methods, making phishing emails harder to detect. Traditional detection methods, including rule‑based systems and supervised machine learning models, either rely on predefined patterns like blacklists, which can be bypassed with slight modifications, or require large datasets for training and still can generate false positives and false negatives. In this work, we propose a multi‑agent large language model (LLM) prompting technique that simulates debates among agents to detect whether the content presented on an email is phishing. Our approach uses two LLM agents to present arguments for or against the classification task, with a judge agent adjudicating the final verdict based on the quality of reasoning provided. This debate mechanism enables the models to critically analyze contextual cue and deceptive patterns in text, which leads to improved classification accuracy. The proposed framework is evaluated on multiple phishing email datasets and demonstrate that mixed‑agent configurations consistently outperform homogeneous configurations. Results also show that the debate structure itself is sufficient to yield accurate decisions without extra prompting strategies.

PaperID: 11081, https://arxiv.org/pdf/2503.21807.pdf

Authors: Yuan Wei, Xiaohan Shan, Jianmin Li

Title: LERO: LLM-driven Evolutionary framework with Hybrid Rewards and Enhanced Observation for Multi-Agent Reinforcement Learning

Abstract:
Multi‑agent reinforcement learning (MARL) faces two critical bottlenecks distinct from single‑agent RL: credit assignment in cooperative tasks and partial observability of environmental states. We propose LERO, a framework integrating Large language models (LLMs) with evolutionary optimization to address these MARL‑specific challenges. The solution centers on two LLM‑generated components: a hybrid reward function that dynamically allocates individual credit through reward decomposition, and an observation enhancement function that augments partial observations with inferred environmental context. An evolutionary algorithm optimizes these components through iterative MARL training cycles, where top‑performing candidates guide subsequent LLM generations. Evaluations in Multi‑Agent Particle Environments (MPE) demonstrate LERO's superiority over baseline methods, with improved task performance and training efficiency.

PaperID: 11082, https://arxiv.org/pdf/2503.21760.pdf

Authors: Rana Salama, Jason Cai, Michelle Yuan, Anna Currey, Monica Sunkara, Yi Zhang, Yassine Benajiba

Title: MemInsight: Autonomous Memory Augmentation for LLM Agents

Abstract:
Large language model (LLM) agents have evolved to intelligently process information, make decisions, and interact with users or tools. A key capability is the integration of long‑term memory capabilities, enabling these agents to draw upon historical interactions and knowledge. However, the growing memory size and need for semantic structuring pose significant challenges. In this work, we propose an autonomous memory augmentation approach, MemInsight, to enhance semantic data representation and retrieval mechanisms. By leveraging autonomous augmentation to historical interactions, LLM agents are shown to deliver more accurate and contextualized responses. We empirically validate the efficacy of our proposed approach in three task scenarios; conversational recommendation, question answering and event summarization. On the LLM‑REDIAL dataset, MemInsight boosts persuasiveness of recommendations by up to 14%. Moreover, it outperforms a RAG baseline by 34% in recall for LoCoMo retrieval. Our empirical results show the potential of MemInsight to enhance the contextual performance of LLM agents across multiple tasks.

PaperID: 11083, https://arxiv.org/pdf/2503.21735.pdf

Authors: Arsham Gholamzadeh Khoee, Shuai Wang, Robert Feldt, Dhasarathy Parthasarathy, Yinan Yu

Title: GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release Analytics

Abstract:
Ensuring reliable data‑driven decisions is crucial in domains where analytical accuracy directly impacts safety, compliance, or operational outcomes. Decision support in such domains relies on large tabular datasets, where manual analysis is slow, costly, and error‑prone. While Large Language Models (LLMs) offer promising automation potential, they face challenges in analytical reasoning, structured data handling, and ambiguity resolution. This paper introduces GateLens, an LLM‑based architecture for reliable analysis of complex tabular data. Its key innovation is the use of Relational Algebra (RA) as a formal intermediate representation between natural‑language reasoning and executable code, addressing the reasoning‑to‑code gap that can arise in direct generation approaches. In our automotive instantiation, GateLens translates natural language queries into RA expressions and generates optimized Python code. Unlike traditional multi‑agent or planning‑based systems that can be slow, opaque, and costly to maintain, GateLens emphasizes speed, transparency, and reliability. We validate the architecture in automotive software release analytics, where experimental results show that GateLens outperforms the existing Chain‑of‑Thought (CoT) + Self‑Consistency (SC) based system on real‑world datasets, particularly in handling complex and ambiguous queries. Ablation studies confirm the essential role of the RA layer. Industrial deployment demonstrates over 80% reduction in analysis time while maintaining high accuracy across domain‑specific tasks. GateLens operates effectively in zero‑shot settings without requiring few‑shot examples or agent orchestration. This work advances deployable LLM system design by identifying key architectural features‑‑intermediate formal representations, execution efficiency, and low configuration overhead‑‑crucial for domain‑specific analytical applications.

PaperID: 11084, https://arxiv.org/pdf/2503.21557.pdf

Authors: Xingdi Yuan, Morgane M Moss, Charbel El Feghali, Chinmay Singh, Darya Moldavskaya, Drew MacPhee, Lucas Caccia, Matheus Pereira, Minseon Kim, Alessandro Sordoni, Marc-Alexandre Côté

Title: debug-gym: A Text-Based Environment for Interactive Debugging

Abstract:
Large Language Models (LLMs) are increasingly relied upon for coding tasks, yet in most scenarios it is assumed that all relevant information can be either accessed in context or matches their training data. We posit that LLMs can benefit from the ability to interactively explore a codebase to gather the information relevant to their task. To achieve this, we present a textual environment, namely debug‑gym, for developing LLM‑based agents in an interactive coding setting. Our environment is lightweight and provides a preset of useful tools, such as a Python debugger (pdb), designed to facilitate an LLM‑based agent's interactive debugging. Beyond coding and debugging tasks, this approach can be generalized to other tasks that would benefit from information‑seeking behavior by an LLM agent.

PaperID: 11085, https://arxiv.org/pdf/2503.21548.pdf

Authors: Viet-Anh Le, Panagiotis Kounatidis, Andreas A. Malikopoulos

Title: Combining Graph Attention Networks and Distributed Optimization for Multi-Robot Mixed-Integer Convex Programming

Abstract:
In this paper, we develop a fast mixed‑integer convex programming (MICP) framework for multi‑robot navigation by combining graph attention networks and distributed optimization. We formulate a mixed‑integer optimization problem for receding horizon motion planning of a multi‑robot system, taking into account the surrounding obstacles. To address the resulting multi‑agent MICP problem in real time, we propose a framework that utilizes heterogeneous graph attention networks to learn the latent mapping from problem parameters to optimal binary solutions. Furthermore, we apply a distributed proximal alternating direction method of multipliers algorithm for solving the convex continuous optimization problem. We demonstrate the effectiveness of our proposed framework through experiments conducted on a robotic testbed.

PaperID: 11086, https://arxiv.org/pdf/2503.21263.pdf

Authors: Wenxuan Lu, Jiangyang He, Zhanqiu Zhang, Yiwen Guo, Tianning Zang

Title: Cultivating Game Sense for Yourself: Making VLMs Gaming Experts

Abstract:
Developing agents capable of fluid gameplay in first/third‑person games without API access remains a critical challenge in Artificial General Intelligence (AGI). Recent efforts leverage Vision Language Models (VLMs) as direct controllers, frequently pausing the game to analyze screens and plan action through language reasoning. However, this inefficient paradigm fundamentally restricts agents to basic and non‑fluent interactions: relying on isolated VLM reasoning for each action makes it impossible to handle tasks requiring high reactivity (e.g., FPS shooting) or dynamic adaptability (e.g., ACT combat). To handle this, we propose a paradigm shift in gameplay agent design: instead of directly controlling gameplay, VLM develops specialized execution modules tailored for tasks like shooting and combat. These modules handle real‑time game interactions, elevating VLM to a high‑level developer. Building upon this paradigm, we introduce GameSense, a gameplay agent framework where VLM develops task‑specific game sense modules by observing task execution and leveraging vision tools and neural network training pipelines. These modules encapsulate action‑feedback logic, ranging from direct action rules to neural network‑based decisions. Experiments demonstrate that our framework is the first to achieve fluent gameplay in diverse genres, including ACT, FPS, and Flappy Bird, setting a new benchmark for game‑playing agents.

PaperID: 11087, https://arxiv.org/pdf/2503.21200.pdf

Authors: Sicong Liu, Yang Shu, Chenjuan Guo, Bin Yang

Title: Learning Generalizable Skills from Offline Multi-Task Data for Multi-Agent Cooperation

Abstract:
Learning cooperative multi‑agent policy from offline multi‑task data that can generalize to unseen tasks with varying numbers of agents and targets is an attractive problem in many scenarios. Although aggregating general behavior patterns among multiple tasks as skills to improve policy transfer is a promising approach, two primary challenges hinder the further advancement of skill learning in offline multi‑task MARL. Firstly, extracting general cooperative behaviors from various action sequences as common skills lacks bringing cooperative temporal knowledge into them. Secondly, existing works only involve common skills and can not adaptively choose independent knowledge as task‑specific skills in each task for fine‑grained action execution. To tackle these challenges, we propose Hierarchical and Separate Skill Discovery (HiSSD), a novel approach for generalizable offline multi‑task MARL through skill learning. HiSSD leverages a hierarchical framework that jointly learns common and task‑specific skills. The common skills learn cooperative temporal knowledge and enable in‑sample exploitation for offline multi‑task MARL. The task‑specific skills represent the priors of each task and achieve a task‑guided fine‑grained action execution. To verify the advancement of our method, we conduct experiments on multi‑agent MuJoCo and SMAC benchmarks. After training the policy using HiSSD on offline multi‑task data, the empirical results show that HiSSD assigns effective cooperative behaviors and obtains superior performance in unseen tasks.

PaperID: 11088, https://arxiv.org/pdf/2503.21141.pdf

Authors: Seth Farrell, Chenghao Li, Hongzhan Yu, Ryo Yoshimitsu, Sicun Gao, Henrik I. Christensen

Title: Safe Human Robot Navigation in Warehouse Scenario

Abstract:
The integration of autonomous mobile robots (AMRs) in industrial environments, particularly warehouses, has revolutionized logistics and operational efficiency. However, ensuring the safety of human workers in dynamic, shared spaces remains a critical challenge. This work proposes a novel methodology that leverages control barrier functions (CBFs) to enhance safety in warehouse navigation. By integrating learning‑based CBFs with the Open Robotics Middleware Framework (OpenRMF), the system achieves adaptive and safety‑enhanced controls in multi‑robot, multi‑agent scenarios. Experiments conducted using various robot platforms demonstrate the efficacy of the proposed approach in avoiding static and dynamic obstacles, including human pedestrians. Our experiments evaluate different scenarios in which the number of robots, robot platforms, speed, and number of obstacles are varied, from which we achieve promising performance.

PaperID: 11089, https://arxiv.org/pdf/2503.21138.pdf

Authors: Hedong Yan

Title: A Computational Theory for Efficient Mini Agent Evaluation with Causal Guarantees

Abstract:
In order to reduce the cost of experimental evaluation for agents, we introduce a computational theory of evaluation for mini agents: build evaluation model to accelerate the evaluation procedures. We prove upper bounds of generalized error and generalized causal effect error of given evaluation models for infinite agents. We also prove efficiency, and consistency to estimated causal effect from deployed agents to evaluation metric by prediction. To learn evaluation models, we propose a meta‑learner to handle heterogeneous agents space problem. Comparing with existed evaluation approaches, our (conditional) evaluation model reduced 24.1% to 99.0% evaluation errors across 12 scenes, including individual medicine, scientific simulation, social experiment, business activity, and quantum trade. The evaluation time is reduced 3 to 7 order of magnitude per subject comparing with experiments or simulations.

PaperID: 11090, https://arxiv.org/pdf/2503.20986.pdf

Authors: Chris Santos-Lang

Title: MAD Chairs: A new tool to evaluate AI

Abstract:
This paper contributes a new way to evaluate AI. Much as one might evaluate a machine in terms of its performance at chess, this approach involves evaluating a machine in terms of its performance at a game called "MAD Chairs". At the time of writing, evaluation with this game exposed opportunities to improve Claude, Gemini, ChatGPT, Qwen and DeepSeek. Furthermore, this paper sets a stage for future innovation in game theory and AI safety by providing an example of success with non‑standard approaches to each: studying a game beyond the scope of previous game theoretic tools and mitigating a serious AI safety risk in a way that requires neither determination of values nor their enforcement.

PaperID: 11091, https://arxiv.org/pdf/2503.20916.pdf

Authors: Cosima du Pasquier, Jennifer Grannen, Chuer Pan, Serin L. Huber, Aliyah Smith, Monroe Kennedy, Shuran Song, Dorsa Sadigh, Allison M. Okamura

Title: A Study of Perceived Safety for Soft Robotics in Caregiving Tasks

Abstract:
In this project, we focus on human‑robot interaction in caregiving scenarios like bathing, where physical contact is inevitable and necessary for proper task execution because force must be applied to the skin. Using finite element analysis, we designed a 3D‑printed gripper combining positive and negative pressure for secure yet compliant handling. Preliminary tests showed it exerted a lower, more uniform pressure profile than a standard rigid gripper. In a user study, participants' trust in robots significantly increased after they experienced a brief bathing demonstration performed by a robotic arm equipped with the soft gripper. These results suggest that soft robotics can enhance perceived safety and acceptance in intimate caregiving scenarios.

PaperID: 11092, https://arxiv.org/pdf/2503.20791.pdf

Authors: John Murzaku, Zifan Liu, Vaishnavi Muppala, Md Mehrab Tanjim, Xiang Chen, Yunyao Li

Title: ECLAIR: Enhanced Clarification for Interactive Responses in an Enterprise AI Assistant

Abstract:
Large language models (LLMs) have shown remarkable progress in understanding and generating natural language across various applications. However, they often struggle with resolving ambiguities in real‑world, enterprise‑level interactions, where context and domain‑specific knowledge play a crucial role. In this demonstration, we introduce ECLAIR (Enhanced CLArification for Interactive Responses), a multi‑agent framework for interactive disambiguation. ECLAIR enhances ambiguous user query clarification through an interactive process where custom agents are defined, ambiguity reasoning is conducted by the agents, clarification questions are generated, and user feedback is leveraged to refine the final response. When tested on real‑world customer data, ECLAIR demonstrates significant improvements in clarification question generation compared to standard few‑shot methods.

PaperID: 11093, https://arxiv.org/pdf/2503.20772.pdf

Authors: Ilia Shilov, Ezzat Elokda, Sophie Hall, Heinrich H. Nax, Saverio Bolognani

Title: Welfare and Cost Aggregation for Multi-Agent Control: When to Choose Which Social Cost Function, and Why?

Abstract:
Many multi‑agent socio‑technical systems rely on aggregating heterogeneous agents' costs into a social cost function (SCF) to coordinate resource allocation in domains like energy grids, water allocation, or traffic management. The choice of SCF often entails implicit assumptions and may lead to undesirable outcomes if not rigorously justified. In this paper, we demonstrate that what determines which SCF ought to be used is the degree to which individual costs can be compared across agents and which axioms the aggregation shall fulfill. Drawing on the results from social choice theory, we provide guidance on how this process can be used in control applications. We demonstrate which assumptions about interpersonal utility comparability ‑ ranging from ordinal level comparability to full cardinal comparability ‑ together with a choice of desirable axioms, inform the selection of a correct SCF, be it the classical utilitarian sum, the Nash SCF, or maximin. We then demonstrate how the proposed framework can be applied for principled allocations of water and transportation resources.

PaperID: 11094, https://arxiv.org/pdf/2503.20749.pdf

Authors: Yuxuan Lu, Jing Huang, Yan Han, Bingsheng Yao, Sisong Bei, Jiri Gesi, Yaochen Xie, Yisi Sang, Zheshen, Wang, Qi He, Dakuo Wang

Title: Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data

Abstract:
Recent research shows that LLM Agents can generate ``believable'' human behaviors via prompt‑only methods, and such agents have been increasingly adopted in downstream applications. However, existing evaluation of these agents only focuses on qualitative believability (whether human raters think they are accurate), leaving open questions of whether LLM agents can accurately generate step‑by‑step actions mimicking a particular human's behavior in a multi‑turn interaction task. In this work, we take shopping as a case study and present the first large‑scale quantitative evaluation of state‑of‑the‑art LLMs' ability to accurately simulate human behavior. Using real‑world data from 31,865 online shopping sessions containing 230,965 user actions, our evaluation reveals that prompt‑based LLMs (DeepSeek‑R1, Llama, Claude) achieve only 11.86% accuracy in generating human actions, highlighting a substantial gap in actual behavioral accuracy. Through experiments, we also showcase that strategies as simple as fine‑tuning LLMs on real human click‑through data augmented with synthesized reasoning traces can greatly enhance models' performance. The fine‑tuned Qwen2.5‑7B achieves 17.26% action generation accuracy and 33.86% F1 score on final purchase prediction, representing substantial improvements of 5.4% and 13.85% over prompt‑only baselines. This work establishes the first rigorous benchmark for human behavior simulation and provides actionable insights for developing more accurate LLM agents for future downstream applications.

PaperID: 11095, https://arxiv.org/pdf/2503.20685.pdf

Authors: Yuhao Huang, Ao Chang, Haoran Dou, Xing Tao, Xinrui Zhou, Yan Cao, Ruobing Huang, Alejandro F Frangi, Lingyun Bao, Xin Yang, Dong Ni

Title: Flip Learning: Weakly Supervised Erase to Segment Nodules in Breast Ultrasound

Abstract:
Accurate segmentation of nodules in both 2D breast ultrasound (BUS) and 3D automated breast ultrasound (ABUS) is crucial for clinical diagnosis and treatment planning. Therefore, developing an automated system for nodule segmentation can enhance user independence and expedite clinical analysis. Unlike fully‑supervised learning, weakly‑supervised segmentation (WSS) can streamline the laborious and intricate annotation process. However, current WSS methods face challenges in achieving precise nodule segmentation, as many of them depend on inaccurate activation maps or inefficient pseudo‑mask generation algorithms. In this study, we introduce a novel multi‑agent reinforcement learning‑based WSS framework called Flip Learning, which relies solely on 2D/3D boxes for accurate segmentation. Specifically, multiple agents are employed to erase the target from the box to facilitate classification tag flipping, with the erased region serving as the predicted segmentation mask. The key contributions of this research are as follows: (1) Adoption of a superpixel/supervoxel‑based approach to encode the standardized environment, capturing boundary priors and expediting the learning process. (2) Introduction of three meticulously designed rewards, comprising a classification score reward and two intensity distribution rewards, to steer the agents' erasing process precisely, thereby avoiding both under‑ and over‑segmentation. (3) Implementation of a progressive curriculum learning strategy to enable agents to interact with the environment in a progressively challenging manner, thereby enhancing learning efficiency. Extensively validated on the large in‑house BUS and ABUS datasets, our Flip Learning method outperforms state‑of‑the‑art WSS methods and foundation models, and achieves comparable performance as fully‑supervised learning algorithms.

PaperID: 11096, https://arxiv.org/pdf/2503.20666.pdf

Authors: Huimin Xu, Seungjun Yi, Terence Lim, Jiawei Xu, Andrew Well, Carlos Mery, Aidong Zhang, Yuji Zhang, Heng Ji, Keshav Pingali, Yan Leng, Ying Ding

Title: TAMA: A Human-AI Collaborative Thematic Analysis Framework Using Multi-Agent LLMs for Clinical Interviews

Abstract:
Thematic analysis (TA) is a widely used qualitative approach for uncovering latent meanings in unstructured text data. TA provides valuable insights in healthcare but is resource‑intensive. Large Language Models (LLMs) have been introduced to perform TA, yet their applications in healthcare remain unexplored. Here, we propose TAMA: A Human‑AI Collaborative Thematic Analysis framework using Multi‑Agent LLMs for clinical interviews. We leverage the scalability and coherence of multi‑agent systems through structured conversations between agents and coordinate the expertise of cardiac experts in TA. Using interview transcripts from parents of children with Anomalous Aortic Origin of a Coronary Artery (AAOCA), a rare congenital heart disease, we demonstrate that TAMA outperforms existing LLM‑assisted TA approaches, achieving higher thematic hit rate, coverage, and distinctiveness. TAMA demonstrates strong potential for automated TA in clinical settings by leveraging multi‑agent LLM systems with human‑in‑the‑loop integration by enhancing quality while significantly reducing manual workload.

PaperID: 11097, https://arxiv.org/pdf/2503.20665.pdf

Authors: Johannes Lips, Boyana Georgieva, Dominik Schlipf, Hendrik Lens

Title: Agent-Based Analysis of the Impact of Near Real-Time Data and Smart Balancing on the Frequency Stability of Power Systems

Abstract:
Single imbalance pricing provides an incentive to balance responsible parties (BRPs) to intentionally introduce power schedule deviations in order to reduce the control area imbalance and receive a remuneration through the imbalance settlement mechanism. This is called smart balancing or passive balancing and is actively encouraged in, e.g., the Netherlands and Belgium through the publication of near real‑time (NRT) data on the control area imbalance by the transmission system operator. It is known that under certain conditions, smart balancing can deteriorate the frequency stability of the power system. This paper examines how the publication of different types of NRT data affects smart balancing and the frequency stability. A Monte‑Carlo simulation of a dynamic multi‑agent model is performed to analyse the effects of smart balancing with different parameters for the agents and the environment, using historical time series of the power imbalance of the German control block as a basis. It is found that smart balancing can significantly reduce the amount and cost of frequency restoration reserve activation, but leads to a general increase of the frequency variability. Depending on the type of NRT data and agent parameters, the frequency stability margins are also reduced. The negative effects on the frequency stability are stronger when NRT data is published using large bins and with long delays.

PaperID: 11098, https://arxiv.org/pdf/2503.20561.pdf

Authors: Ryumei Nakada, Wenlong Ji, Tianxi Cai, James Zou, Linjun Zhang

Title: A Theoretical Framework for Prompt Engineering: Approximating Smooth Functions with Transformer Prompts

Abstract:
Prompt engineering has emerged as a powerful technique for guiding large language models (LLMs) toward desired responses, significantly enhancing their performance across diverse tasks. Beyond their role as static predictors, LLMs increasingly function as intelligent agents, capable of reasoning, decision‑making, and adapting dynamically to complex environments. However, the theoretical underpinnings of prompt engineering remain largely unexplored. In this paper, we introduce a formal framework demonstrating that transformer models, when provided with carefully designed prompts, can act as a configurable computational system by emulating a ``virtual'' neural network during inference. Specifically, input prompts effectively translate into the corresponding network configuration, enabling LLMs to adjust their internal computations dynamically. Building on this construction, we establish an approximation theory for β‑times differentiable functions, proving that transformers can approximate such functions with arbitrary precision when guided by appropriately structured prompts. Moreover, our framework provides theoretical justification for several empirically successful prompt engineering techniques, including the use of longer, structured prompts, filtering irrelevant information, enhancing prompt token diversity, and leveraging multi‑agent interactions. By framing LLMs as adaptable agents rather than static models, our findings underscore their potential for autonomous reasoning and problem‑solving, paving the way for more robust and theoretically grounded advancements in prompt engineering and AI agent design.

PaperID: 11099, https://arxiv.org/pdf/2503.20550.pdf

Authors: Tomoki Nakamigawa, Tadashi Sakuma

Title: Structural Origins of Cubic Complexity in Pebble Motion

Abstract:
The pebble motion problem (PMP) asks whether one configuration of labeled pebbles on a graph can be transformed into another by moving pebbles to adjacent unoccupied vertices. It is a fundamental model of graph reconfiguration and is closely related to multi‑agent path finding (MAPF). A central open problem since Kornhauser, Miller, and Spirakis (FOCS 1984) is to understand the origin of the classical Θ(N^3) worst‑case behavior. While it is known that every feasible instance on an N‑vertex graph admits a solution sequence of length \Ord(N^3), it has remained unclear which instances actually require cubic complexity. In this paper, we resolve the long‑standing complexity of the pebble motion problem on trees. We show that every feasible instance on an N‑vertex tree admits a solution sequence of length \Ord(N^2 \log N), computable by an output‑sensitive algorithm. Since a lower bound of Ω(N^2) is known, this establishes that the Θ(N^3) phenomenon does not occur on trees and nearly closes the gap Ω(N^2)\le \OPT(N)\le \Ord(N^3) up to a logarithmic factor. Building on this result, we extend our approach to general graphs by applying the tree algorithm to breadth‑first spanning trees. This yields an efficient framework that produces o(N^3)‑length solution sequences for a broad class of instances, including the classical square‑grid example, where we recover the \Ord(N^3/2) bound observed by Kornhauser, Miller, and Spirakis. Finally, by analyzing the behavior of this algorithm, we obtain strong structural restrictions governing when Θ(N^3) complexity can arise. We show that such behavior is possible only under highly constrained conditions, specifically when Θ(N) degree‑two vertices lie on cycles of length Θ(N), with each cycle being the shortest containing the corresponding vertex.

PaperID: 11100, https://arxiv.org/pdf/2503.20536.pdf

Authors: Yiran Zhang, Ruiyin Li, Peng Liang, Weisong Sun, Yang Liu

Title: Knowledge-Based Multi-Agent Framework for Automated Software Architecture Design

Abstract:
Architecture design is a critical step in software development. However, creating a high‑quality architecture is often costly due to the significant need for human expertise and manual effort. Recently, agents built upon Large Language Models (LLMs) have achieved remarkable success in various software engineering tasks. Despite this progress, the use of agents to automate the architecture design process remains largely unexplored. To address this gap, we envision a Knowledge‑based Multi‑Agent Architecture Design (MAAD) framework. MAAD uses agents to simulate human roles in the traditional software architecture design process, thereby automating the design process. To empower these agents, MAAD incorporates knowledge extracted from three key sources: 1) existing system designs, 2) authoritative literature, and 3) architecture experts. By envisioning the MAAD framework, we aim to advance the full automation of application‑level system development.

PaperID: 11101, https://arxiv.org/pdf/2503.20523.pdf

Authors: Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, Gianluca Corrado

Title: GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

Abstract:
Generative models offer a scalable and flexible paradigm for simulating complex environments, yet current approaches fall short in addressing the domain‑specific requirements of autonomous driving ‑ such as multi‑agent interactions, fine‑grained control, and multi‑camera consistency. We introduce GAIA‑2, Generative AI for Autonomy, a latent diffusion world model that unifies these capabilities within a single generative framework. GAIA‑2 supports controllable video generation conditioned on a rich set of structured inputs: ego‑vehicle dynamics, agent configurations, environmental factors, and road semantics. It generates high‑resolution, spatiotemporally consistent multi‑camera videos across geographically diverse driving environments (UK, US, Germany). The model integrates both structured conditioning and external latent embeddings (e.g., from a proprietary driving model) to facilitate flexible and semantically grounded scene synthesis. Through this integration, GAIA‑2 enables scalable simulation of both common and rare driving scenarios, advancing the use of generative world models as a core tool in the development of autonomous systems. Videos are available at https://wayve.ai/thinking/gaia‑2.

PaperID: 11102, https://arxiv.org/pdf/2503.20507.pdf

Authors: Rakesh Nadig, Vamanan Arulchelvan, Rahul Bera, Taha Shahroodi, Gagandeep Singh, Andreas Kakolyris, Ismail Emir Yuksel, Mohammad Sadrosadati, Jisung Park, Onur Mutlu

Title: Harmonia: Enhancing Data Placement and Migration in Hybrid Storage Systems via Multi-Agent Reinforcement Learning

Abstract:
Modern high‑performance computing (HPC) environments rely on hybrid storage systems (HSS) that combine multiple storage devices with diverse latency, bandwidth, endurance, and capacity characteristics to meet the performance, capacity, and cost requirements of data‑intensive applications. The performance of an HSS highly depends on two key data‑management policies: (1) data placement, which determines the most suitable storage device to store application data, and (2) data migration, which dynamically reorganizes previously‑stored data across storage devices (i.e., prefetching hot data and evicting cold data) to sustain high HSS performance. These policies are tightly interdependent, and thus, improving one without considering the other leads to suboptimal HSS performance. Unfortunately, prior works focus on optimizing only one of the policies. Our goal is to design a holistic data‑management technique that optimizes both data‑placement and data‑migration policies to fully exploit the potential of an HSS. To this end, we propose Harmonia, a multi‑agent reinforcement learning (RL)‑based data‑management technique. Harmonia employs two lightweight autonomous RL agents, a data‑placement agent and a data‑migration agent, that adapt their policies for the current workload and HSS configuration while coordinating with each other. We evaluate Harmonia on real HSS configurations with up to four heterogeneous storage devices and 25 data‑intensive workloads. On a performance‑ (cost‑) optimized HSS with two heterogeneous storage devices, Harmonia outperforms the best‑performing prior approach by 29.3% (44.8%) on average. On an HSS with three (four) devices, Harmonia outperforms the best‑performing prior work by 38.9% (39.2%) on average. Harmonia's performance benefits come with low latency (240 ns for inference) and storage (206 KiB in DRAM for both RL agents combined) overheads.

PaperID: 11103, https://arxiv.org/pdf/2503.20479.pdf

Authors: Robert Lupoiu, Yixuan Shao, Tianxiang Dai, Chenkai Mao, Kofi Edee, Jonathan A. Fan

Title: A multi-agentic framework for real-time, autonomous freeform metasurface design

Abstract:
Innovation in nanophotonics currently relies on human experts who synergize specialized knowledge in photonics and coding with simulation and optimization algorithms, entailing design cycles that are time‑consuming, computationally demanding, and frequently suboptimal. We introduce MetaChat, a multi‑agentic design framework that can translate semantically described photonic design goals into high‑performance, freeform device layouts in an automated, nearly real‑time manner. Multi‑step reasoning is enabled by our Agentic Iterative Monologue (AIM) paradigm, which coherently interfaces agents with code‑based tools, other specialized agents, and human designers. Design acceleration is facilitated by Feature‑wise Linear Modulation‑conditioned Maxwell surrogate solvers that support the generalized evaluation of metasurface structures. We use freeform dielectric metasurfaces as a model system and demonstrate with MetaChat the design of multi‑objective, multi‑wavelength metasurfaces orders of magnitude faster than conventional methods. These concepts present a scientific computing blueprint for utilizing specialist design agents, surrogate solvers, and human interactions to drive multi‑physics innovation and discovery.

PaperID: 11104, https://arxiv.org/pdf/2503.20462.pdf

Authors: Ruoqi Wen, Rongpeng Li, Xing Xu, Zhifeng Zhao

Title: Multi-agent Uncertainty-Aware Pessimistic Model-Based Reinforcement Learning for Connected Autonomous Vehicles

Abstract:
Deep Reinforcement Learning (DRL) holds significant promise for achieving human‑like Autonomous Vehicle (AV) capabilities, but suffers from low sample efficiency and challenges in reward design. Model‑Based Reinforcement Learning (MBRL) offers improved sample efficiency and generalizability compared to Model‑Free Reinforcement Learning (MFRL) in various multi‑agent decision‑making scenarios. Nevertheless, MBRL faces critical difficulties in estimating uncertainty during the model learning phase, thereby limiting its scalability and applicability in real‑world scenarios. Additionally, most Connected Autonomous Vehicle (CAV) studies focus on single‑agent decision‑making, while existing multi‑agent MBRL solutions lack computationally tractable algorithms with Probably Approximately Correct (PAC) guarantees, an essential factor for ensuring policy reliability with limited training data. To address these challenges, we propose MA‑PMBRL, a novel Multi‑Agent Pessimistic Model‑Based Reinforcement Learning framework for CAVs, incorporating a max‑min optimization approach to enhance robustness and decision‑making. To mitigate the inherent subjectivity of uncertainty estimation in MBRL and avoid incurring catastrophic failures in AV, MA‑PMBRL employs a pessimistic optimization framework combined with Projected Gradient Descent (PGD) for both model and policy learning. MA‑PMBRL also employs general function approximations under partial dataset coverage to enhance learning efficiency and system‑level performance. By bounding the suboptimality of the resulting policy under mild theoretical assumptions, we successfully establish PAC guarantees for MA‑PMBRL, demonstrating that the proposed framework represents a significant step toward scalable, efficient, and reliable multi‑agent decision‑making for CAVs.

PaperID: 11105, https://arxiv.org/pdf/2503.20425.pdf

Authors: Kevin Alcedo, Pedro U. Lima, Rachid Alami

Title: Perspective-Shifted Neuro-Symbolic World Models: A Framework for Socially-Aware Robot Navigation

Abstract:
Navigating in environments alongside humans requires agents to reason under uncertainty and account for the beliefs and intentions of those around them. Under a sequential decision‑making framework, egocentric navigation can naturally be represented as a Markov Decision Process (MDP). However, social navigation additionally requires reasoning about the hidden beliefs of others, inherently leading to a Partially Observable Markov Decision Process (POMDP), where agents lack direct access to others' mental states. Inspired by Theory of Mind and Epistemic Planning, we propose (1) a neuro‑symbolic model‑based reinforcement learning architecture for social navigation, addressing the challenge of belief tracking in partially observable environments; and (2) a perspective‑shift operator for belief estimation, leveraging recent work on Influence‑based Abstractions (IBA) in structured multi‑agent settings.

PaperID: 11106, https://arxiv.org/pdf/2503.20324.pdf

Authors: Junkai Jiang, Ruochen Li, Yibin Yang, Yihe Chen, Yuning Wang, Shaobing Xu, Jianqiang Wang

Title: CTS-CBS: A New Approach for Multi-Agent Collaborative Task Sequencing and Path Finding

Abstract:
This paper addresses a generalization problem of Multi‑Agent Pathfinding (MAPF), called Collaborative Task Sequencing ‑ Multi‑Agent Pathfinding (CTS‑MAPF), where agents must plan collision‑free paths and visit a series of intermediate task locations in a specific order before reaching their final destinations. To address this problem, we propose a new approach, Collaborative Task Sequencing ‑ Conflict‑Based Search (CTS‑CBS), which conducts a two‑level search. In the high level, it generates a search forest, where each tree corresponds to a joint task sequence derived from the jTSP solution. In the low level, CTS‑CBS performs constrained single‑agent path planning to generate paths for each agent while adhering to high‑level constraints. We also provide heoretical guarantees of its completeness and optimality (or sub‑optimality with a bounded parameter). To evaluate the performance of CTS‑CBS, we create two datasets, CTS‑MAPF and MG‑MAPF, and conduct comprehensive experiments. The results show that CTS‑CBS adaptations for MG‑MAPF outperform baseline algorithms in terms of success rate (up to 20 times larger) and runtime (up to 100 times faster), with less than a 10% sacrifice in solution quality. Furthermore, CTS‑CBS offers flexibility by allowing users to adjust the sub‑optimality bound omega to balance between solution quality and efficiency. Finally, practical robot tests demonstrate the algorithm's applicability in real‑world scenarios.

PaperID: 11107, https://arxiv.org/pdf/2503.20262.pdf

Authors: Tawfiq Ammari, Anna Gutowska, Jacob Ziff, Casey Randazzo, Harihan Subramonyam

Title: From the CDC to emerging infectious disease publics: The long-now of polarizing and complex health crises

Abstract:
This study examines how public discourse around COVID‑19 unfolded on Twitter through the lens of crisis communication and digital publics. Analyzing over 275,000 tweets involving the CDC, we identify 16 distinct discourse clusters shaped by framing, sentiment, credibility, and network dynamics. We find that CDC messaging became a flashpoint for affective and ideological polarization, with users aligning along competing frames of science vs. freedom, and public health vs. political overreach. Most clusters formed echo chambers, while a few enabled cross cutting dialogue. Publics emerged not only around ideology but also around topical and emotional stakes, reflecting shifting concerns across different stages of the pandemic. While marginalized communities raised consistent equity concerns, these narratives struggled to reshape broader discourse. Our findings highlight the importance of long‑term, adaptive engagement with diverse publics and propose design interventions such as multi‑agent AI assistants, to support more inclusive communication throughout extended public health crises.

PaperID: 11108, https://arxiv.org/pdf/2503.20105.pdf

Authors: Ran Tian, Kratarth Goel

Title: Direct Post-Training Preference Alignment for Multi-Agent Motion Generation Models Using Implicit Feedback from Pre-training Demonstrations

Abstract:
Recent advancements in LLMs have revolutionized motion generation models in embodied applications. While LLM‑type auto‑regressive motion generation models benefit from training scalability, there remains a discrepancy between their token prediction objectives and human preferences. As a result, models pre‑trained solely with token‑prediction objectives often generate behaviors that deviate from what humans would prefer, making post‑training preference alignment crucial for producing human‑preferred motions. Unfortunately, post‑training alignment requires extensive preference rankings of motions generated by the pre‑trained model, which are costly to annotate, especially in multi‑agent settings. Recently, there has been growing interest in leveraging pre‑training demonstrations to scalably generate preference data for post‑training alignment. However, these methods often adopt an adversarial assumption, treating all pre‑trained model‑generated samples as unpreferred examples. This adversarial approach overlooks the valuable signal provided by preference rankings among the model's own generations, ultimately reducing alignment effectiveness and potentially leading to misaligned behaviors. In this work, instead of treating all generated samples as equally bad, we leverage implicit preferences encoded in pre‑training demonstrations to construct preference rankings among the pre‑trained model's generations, offering more nuanced preference alignment guidance with zero human cost. We apply our approach to large‑scale traffic simulation and demonstrate its effectiveness in improving the realism of pre‑trained model's generated behaviors, making a lightweight 1M motion generation model comparable to SOTA large imitation‑based models by relying solely on implicit feedback from pre‑training demonstrations, without additional post‑training human preference annotations or high computational costs.

PaperID: 11109, https://arxiv.org/pdf/2503.20078.pdf

Authors: Volkan Ustun, Soham Hans, Rajay Kumar, Yunzhe Wang

Title: Abstracting Geo-specific Terrains to Scale Up Reinforcement Learning

Abstract:
Multi‑agent reinforcement learning (MARL) is increasingly ubiquitous in training dynamic and adaptive synthetic characters for interactive simulations on geo‑specific terrains. Frameworks such as Unity's ML‑Agents help to make such reinforcement learning experiments more accessible to the simulation community. Military training simulations also benefit from advances in MARL, but they have immense computational requirements due to their complex, continuous, stochastic, partially observable, non‑stationary, and doctrine‑based nature. Furthermore, these simulations require geo‑specific terrains, further exacerbating the computational resources problem. In our research, we leverage Unity's waypoints to automatically generate multi‑layered representation abstractions of the geo‑specific terrains to scale up reinforcement learning while still allowing the transfer of learned policies between different representations. Our early exploratory results on a novel MARL scenario, where each side has differing objectives, indicate that waypoint‑based navigation enables faster and more efficient learning while producing trajectories similar to those taken by expert human players in CSGO gaming environments. This research points out the potential of waypoint‑based navigation for reducing the computational costs of developing and training MARL models for military training simulations, where geo‑specific terrains and differing objectives are crucial.

PaperID: 11110, https://arxiv.org/pdf/2503.20036.pdf

Authors: Eray Yapağcı, Yavuz Alp Sencer Öztürk, Eray Tüzün

Title: Agents in the Sandbox: End-to-End Crash Bug Reproduction for Minecraft

Abstract:
Reproducing game bugs, particularly crash bugs in continuously evolving games like Minecraft, is a notoriously manual, time‑consuming, and challenging process to automate; insights from a key decision maker from Minecraft we interviewed confirm this, highlighting that a substantial portion of crash reports necessitate manual scenario reconstruction. Despite the success of LLM‑driven bug reproduction in other software domains, games, with their complex interactive environments, remain largely unaddressed. This paper introduces BugCraft, a novel end‑to‑end framework designed to automate the reproduction of crash bugs in Minecraft directly from user‑submitted bug reports, addressing the critical gap in automated game bug reproduction. BugCraft employs a two‑stage approach: first, a Step Synthesizer leverages LLMs and Minecraft Wiki knowledge to transform bug reports into high‑quality, structured steps to reproduce (S2R). Second, an Action Model, powered by a vision‑based LLM agent and a custom macro API, executes these S2R steps within Minecraft to trigger the reported crash. To facilitate evaluation, we introduce BugCraft‑Bench, a curated dataset of Minecraft crash bug reports. On BugCraft‑Bench, our framework end‑to‑end reproduced 34.9% of crash bugs with GPT‑4.1, outperforming baseline computer‑use models by 37%. BugCraft demonstrates the feasibility of automated reproduction of crash bugs in complex game environments using LLMs, opening promising avenues for game testing and development. Finally, we make our code open at https://bugcraft2025.github.io

PaperID: 11111, https://arxiv.org/pdf/2503.20028.pdf

Authors: Pengfei Du

Title: OmniNova:A General Multimodal Agent Framework

Abstract:
The integration of Large Language Models (LLMs) with specialized tools presents new opportunities for intelligent automation systems. However, orchestrating multiple LLM‑driven agents to tackle complex tasks remains challenging due to coordination difficulties, inefficient resource utilization, and inconsistent information flow. We present OmniNova, a modular multi‑agent automation framework that combines language models with specialized tools such as web search, crawling, and code execution capabilities. OmniNova introduces three key innovations: (1) a hierarchical multi‑agent architecture with distinct coordinator, planner, supervisor, and specialist agents; (2) a dynamic task routing mechanism that optimizes agent deployment based on task complexity; and (3) a multi‑layered LLM integration system that allocates appropriate models to different cognitive requirements. Our evaluations across 50 complex tasks in research, data analysis, and web interaction domains demonstrate that OmniNova outperforms existing frameworks in task completion rate (87% vs. baseline 62%), efficiency (41% reduced token usage), and result quality (human evaluation score of 4.2/5 vs. baseline 3.1/5). We contribute both a theoretical framework for multi‑agent system design and an open‑source implementation that advances the state‑of‑the‑art in LLM‑based automation systems.

PaperID: 11112, https://arxiv.org/pdf/2503.19990.pdf

Authors: Kexian Tang, Junyao Gao, Yanhong Zeng, Haodong Duan, Yanan Sun, Zhening Xing, Wenran Liu, Kaifeng Lyu, Kai Chen

Title: LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

Abstract:
Multi‑step spatial reasoning entails understanding and reasoning about spatial relationships across multiple sequential steps, which is crucial for tackling complex real‑world applications, such as robotic manipulation, autonomous navigation, and automated assembly. To assess how well current Multimodal Large Language Models (MLLMs) have acquired this fundamental capability, we introduce LEGO‑Puzzles, a scalable benchmark designed to evaluate both spatial understanding and sequential reasoning in MLLMs through LEGO‑based tasks. LEGO‑Puzzles consists of 1,100 carefully curated visual question‑answering (VQA) samples spanning 11 distinct tasks, ranging from basic spatial understanding to complex multi‑step reasoning. Based on LEGO‑Puzzles, we conduct a comprehensive evaluation of 20 state‑of‑the‑art MLLMs and uncover significant limitations in their spatial reasoning capabilities: even the most powerful MLLMs can answer only about half of the test cases, whereas human participants achieve over 90% accuracy. Furthermore, based on LEGO‑Puzzles, we design generation tasks to investigate whether MLLMs can transfer their spatial understanding and reasoning abilities to image generation. Our experiments show that only GPT‑4o and Gemini‑2.0‑Flash exhibit a limited ability to follow these instructions, while other MLLMs either replicate the input image or generate completely irrelevant outputs. Overall, LEGO‑Puzzles exposes critical deficiencies in existing MLLMs' spatial understanding and sequential reasoning capabilities, and underscores the need for further advancements in multimodal spatial reasoning.

PaperID: 11113, https://arxiv.org/pdf/2503.19889.pdf

Authors: Jie Tian, Martin Taylor Sobczak, Dhanush Patil, Jixin Hou, Lin Pang, Arunachalam Ramanathan, Libin Yang, Xianyan Chen, Yuval Golan, Xiaoming Zhai, Hongyue Sun, Kenan Song, Xianqiao Wang

Title: A Multi-Agent Framework Integrating Large Language Models and Generative AI for Accelerated Metamaterial Design

Abstract:
Metamaterials, renowned for their exceptional mechanical, electromagnetic, and thermal properties, hold transformative potential across diverse applications, yet their design remains constrained by labor‑intensive trial‑and‑error methods and limited data interoperability. Here, we introduce CrossMatAgent ‑‑ a novel multi‑agent framework that synergistically integrates large language models with state‑of‑the‑art generative AI to revolutionize metamaterial design. By orchestrating a hierarchical team of agents ‑‑ each specializing in tasks such as pattern analysis, architectural synthesis, prompt engineering, and supervisory feedback ‑‑ our system leverages the multimodal reasoning of GPT‑4o alongside the generative precision of DALL‑E 3 and a fine‑tuned Stable Diffusion XL model. This integrated approach automates data augmentation, enhances design fidelity, and produces simulation‑ and 3D printing‑ready metamaterial patterns. Comprehensive evaluations, including CLIP‑based alignment, SHAP interpretability analyses, and mechanical simulations under varied load conditions, demonstrate the framework's ability to generate diverse, reproducible, and application‑ready designs. CrossMatAgent thus establishes a scalable, AI‑driven paradigm that bridges the gap between conceptual innovation and practical realization, paving the way for accelerated metamaterial development.

PaperID: 11114, https://arxiv.org/pdf/2503.19879.pdf

Authors: Farhad Mehdifar, Mani H. Dhullipalla, Charalampos P. Bechlioulis, Dimos V. Dimarogonas

Title: Collaborative Satisfaction of Long-Term Spatial Constraints in Multi-Agent Systems: A Distributed Optimization Approach (extended version)

Abstract:
This paper addresses the problem of collaboratively satisfying long‑term spatial constraints in multi‑agent systems. Each agent is subject to spatial constraints, expressed as inequalities, which may depend on the positions of other agents with whom they may or may not have direct communication. These constraints need to be satisfied asymptotically or after an unknown finite time. The agents' objective is to collectively achieve a formation that fulfills all constraints. The problem is initially framed as a centralized unconstrained optimization, where the solution yields the optimal configuration by maximizing an objective function that reflects the degree of constraint satisfaction. This function encourages collaboration, ensuring agents help each other meet their constraints while fulfilling their own. When the constraints are infeasible, agents converge to a least‑violating solution. A distributed consensus‑based optimization scheme is then introduced, which approximates the centralized solution, leading to the development of distributed controllers for single‑integrator agents. Finally, simulations validate the effectiveness of the proposed approach.

PaperID: 11115, https://arxiv.org/pdf/2503.19699.pdf

Authors: Muhammad Al-Zafar Khan, Jamal Al-Karaki

Title: Optimal Path Planning and Cost Minimization for a Drone Delivery System Via Model Predictive Control

Abstract:
In this study, we formulate the drone delivery problem as a control problem and solve it using Model Predictive Control. Two experiments are performed: The first is on a less challenging grid world environment with lower dimensionality, and the second is with a higher dimensionality and added complexity. The MPC method was benchmarked against three popular Multi‑Agent Reinforcement Learning (MARL): Independent Q‑Learning (IQL), Joint Action Learners (JAL), and Value‑Decomposition Networks (VDN). It was shown that the MPC method solved the problem quicker and required fewer optimal numbers of drones to achieve a minimized cost and navigate the optimal path.

PaperID: 11116, https://arxiv.org/pdf/2503.19607.pdf

Authors: Edward Gu, Ho Chit Siu, Melanie Platt, Isabelle Hurley, Jaime Peña, Rohan Paleja

Title: Enabling Rapid Shared Human-AI Mental Model Alignment via the After-Action Review

Abstract:
In this work, we present two novel contributions toward improving research in human‑machine teaming (HMT): 1) a Minecraft testbed to accelerate testing and deployment of collaborative AI agents and 2) a tool to allow users to revisit and analyze behaviors within an HMT episode to facilitate shared mental model development. Our browser‑based Minecraft testbed allows for rapid testing of collaborative agents in a continuous‑space, real‑time, partially‑observable environment with real humans without cumbersome setup typical to human‑AI interaction user studies. As Minecraft has an extensive player base and a rich ecosystem of pre‑built AI agents, we hope this contribution can help to facilitate research quickly in the design of new collaborative agents and in understanding different human factors within HMT. Our mental model alignment tool facilitates user‑led post‑mission analysis by including video displays of first‑person perspectives of the team members (i.e., the human and AI) that can be replayed, and a chat interface that leverages GPT‑4 to provide answers to various queries regarding the AI's experiences and model details.

PaperID: 11117, https://arxiv.org/pdf/2503.19584.pdf

Authors: Songtao Sun, Jingyi Li, Yuanfei Dong, Haoguang Liu, Chenxin Xu, Fuyang Li, Qiang Liu

Title: Multi-agent Application System in Office Collaboration Scenarios

Abstract:
This paper introduces a multi‑agent application system designed to enhance office collaboration efficiency and work quality. The system integrates artificial intelligence, machine learning, and natural language processing technologies, achieving functionalities such as task allocation, progress monitoring, and information sharing. The agents within the system are capable of providing personalized collaboration support based on team members' needs and incorporate data analysis tools to improve decision‑making quality. The paper also proposes an intelligent agent architecture that separates Plan and Solver, and through techniques such as multi‑turn query rewriting and business tool retrieval, it enhances the agent's multi‑intent and multi‑turn dialogue capabilities. Furthermore, the paper details the design of tools and multi‑turn dialogue in the context of office collaboration scenarios, and validates the system's effectiveness through experiments and evaluations. Ultimately, the system has demonstrated outstanding performance in real business applications, particularly in query understanding, task planning, and tool calling. Looking forward, the system is expected to play a more significant role in addressing complex interaction issues within dynamic environments and large‑scale multi‑agent systems.

PaperID: 11118, https://arxiv.org/pdf/2503.19466.pdf

Authors: Leander Kurscheidt, Paolo Morettin, Roberto Sebastiani, Andrea Passerini, Antonio Vergari

Title: A Probabilistic Neuro-symbolic Layer for Algebraic Constraint Satisfaction

Abstract:
In safety‑critical applications, guaranteeing the satisfaction of constraints over continuous environments is crucial, e.g., an autonomous agent should never crash into obstacles or go off‑road. Neural models struggle in the presence of these constraints, especially when they involve intricate algebraic relationships. To address this, we introduce a differentiable probabilistic layer that guarantees the satisfaction of non‑convex algebraic constraints over continuous variables. This probabilistic algebraic layer (PAL) can be seamlessly plugged into any neural architecture and trained via maximum likelihood without requiring approximations. PAL defines a distribution over conjunctions and disjunctions of linear inequalities, parameterized by polynomials. This formulation enables efficient and exact renormalization via symbolic integration, which can be amortized across different data points and easily parallelized on a GPU. We showcase PAL and our integration scheme on a number of benchmarks for algebraic constraint integration and on real‑world trajectory data.

PaperID: 11119, https://arxiv.org/pdf/2503.19418.pdf

Authors: Xueyao Zhang, Bo Yang, Xuelin Cao, Zhiwen Yu, George C. Alexandropoulos, Yan Zhang, Merouane Debbah, Chau Yuen

Title: Multi-Agent Deep Reinforcement Learning for Safe Autonomous Driving with RICS-Assisted MEC

Abstract:
Environment sensing and fusion via onboard sensors are envisioned to be widely applied in future autonomous driving networks. This paper considers a vehicular system with multiple self‑driving vehicles that is assisted by multi‑access edge computing (MEC), where image data collected by the sensors is offloaded from cellular vehicles to the MEC server using vehicle‑to‑infrastructure (V2I) links. Sensory data can also be shared among surrounding vehicles via vehicle‑to‑vehicle (V2V) communication links. To improve spectrum utilization, the V2V links may reuse the same frequency spectrum with V2I links, which may cause severe interference. To tackle this issue, we leverage reconfigurable intelligent computational surfaces (RICSs) to jointly enable V2I reflective links and mitigate interference appearing at the V2V links. Considering the limitations of traditional algorithms in addressing this problem, such as the assumption for quasi‑static channel state information, which restricts their ability to adapt to dynamic environmental changes and leads to poor performance under frequently varying channel conditions, in this paper, we formulate the problem at hand as a Markov game. Our novel formulation is applied to time‑varying channels subject to multi‑user interference and introduces a collaborative learning mechanism among users. The considered optimization problem is solved via a driving safety‑enabled multi‑agent deep reinforcement learning (DS‑MADRL) approach that capitalizes on the RICS presence. Our extensive numerical investigations showcase that the proposed reinforcement learning approach achieves faster convergence and significant enhancements in both data rate and driving safety, as compared to various state‑of‑the‑art benchmarks.

PaperID: 11120, https://arxiv.org/pdf/2503.19391.pdf

Authors: Zhiying Song, Lei Yang, Fuxi Wen, Jun Li

Title: TraF-Align: Trajectory-aware Feature Alignment for Asynchronous Multi-agent Perception

Abstract:
Cooperative perception presents significant potential for enhancing the sensing capabilities of individual vehicles, however, inter‑agent latency remains a critical challenge. Latencies cause misalignments in both spatial and semantic features, complicating the fusion of real‑time observations from the ego vehicle with delayed data from others. To address these issues, we propose TraF‑Align, a novel framework that learns the flow path of features by predicting the feature‑level trajectory of objects from past observations up to the ego vehicle's current time. By generating temporally ordered sampling points along these paths, TraF‑Align directs attention from the current‑time query to relevant historical features along each trajectory, supporting the reconstruction of current‑time features and promoting semantic interaction across multiple frames. This approach corrects spatial misalignment and ensures semantic consistency across agents, effectively compensating for motion and achieving coherent feature fusion. Experiments on two real‑world datasets, V2V4Real and DAIR‑V2X‑Seq, show that TraF‑Align sets a new benchmark for asynchronous cooperative perception.

PaperID: 11121, https://arxiv.org/pdf/2503.19213.pdf

Authors: Murong Yue

Title: A Survey of Large Language Model Agents for Question Answering

Abstract:
This paper surveys the development of large language model (LLM)‑based agents for question answering (QA). Traditional agents face significant limitations, including substantial data requirements and difficulty in generalizing to new environments. LLM‑based agents address these challenges by leveraging LLMs as their core reasoning engine. These agents achieve superior QA results compared to traditional QA pipelines and naive LLM QA systems by enabling interaction with external environments. We systematically review the design of LLM agents in the context of QA tasks, organizing our discussion across key stages: planning, question understanding, information retrieval, and answer generation. Additionally, this paper identifies ongoing challenges and explores future research directions to enhance the performance of LLM agent QA systems.

PaperID: 11122, https://arxiv.org/pdf/2503.19065.pdf

Authors: Zhongyu Yang, Jun Chen, Dannong Xu, Junjie Fei, Xiaoqian Shen, Liangbing Zhao, Chun-Mei Feng, Mohamed Elhoseiny

Title: WikiAutoGen: Towards Multi-Modal Wikipedia-Style Article Generation

Abstract:
Knowledge discovery and collection are intelligence‑intensive tasks that traditionally require significant human effort to ensure high‑quality outputs. Recent research has explored multi‑agent frameworks for automating Wikipedia‑style article generation by retrieving and synthesizing information from the internet. However, these methods primarily focus on text‑only generation, overlooking the importance of multimodal content in enhancing informativeness and engagement. In this work, we introduce WikiAutoGen, a novel system for automated multimodal Wikipedia‑style article generation. Unlike prior approaches, WikiAutoGen retrieves and integrates relevant images alongside text, enriching both the depth and visual appeal of generated content. To further improve factual accuracy and comprehensiveness, we propose a multi‑perspective self‑reflection mechanism, which critically assesses retrieved content from diverse viewpoints to enhance reliability, breadth, and coherence, etc. Additionally, we introduce WikiSeek, a benchmark comprising Wikipedia articles with topics paired with both textual and image‑based representations, designed to evaluate multimodal knowledge generation on more challenging topics. Experimental results show that WikiAutoGen outperforms previous methods by 8%‑29% on our WikiSeek benchmark, producing more accurate, coherent, and visually enriched Wikipedia‑style articles. Our code and examples are available at https://wikiautogen.github.io/

PaperID: 11123, https://arxiv.org/pdf/2503.18825.pdf

Authors: Sara Fish, Julia Shephard, Minkai Li, Ran I. Shorrer, Yannai A. Gonczarowski

Title: EconEvals: Benchmarks and Litmus Tests for Economic Decision-Making by LLM Agents

Abstract:
We develop evaluation methods for measuring the economic decision‑making capabilities and tendencies of LLMs. First, we develop benchmarks derived from key problems in economics ‑‑ procurement, scheduling, and pricing ‑‑ that test an LLM's ability to learn from the environment in context. Second, we develop the framework of litmus tests, evaluations that quantify an LLM's choice behavior on a stylized decision‑making task with multiple conflicting objectives. Each litmus test outputs a litmus score, which quantifies an LLM's tradeoff response, a reliability score, which measures the coherence of an LLM's choice behavior, and a competency score, which measures an LLM's capability at the same task when the conflicting objectives are replaced by a single, well‑specified objective. Evaluating a broad array of frontier LLMs, we (1) investigate changes in LLM capabilities and tendencies over time, (2) derive economically meaningful insights from the LLMs' choice behavior and chain‑of‑thought, (3) validate our litmus test framework by testing self‑consistency, robustness, and generalizability. Overall, this work provides a foundation for evaluating LLM agents as they are further integrated into economic decision‑making.

PaperID: 11124, https://arxiv.org/pdf/2503.18816.pdf

Authors: Chak Lam Shek, Amrit Singh Bedi, Anjon Basak, Ellen Novoseller, Nick Waytowich, Priya Narayanan, Dinesh Manocha, Pratap Tokekar

Title: Learning Multi-Robot Coordination through Locality-Based Factorized Multi-Agent Actor-Critic Algorithm

Abstract:
In this work, we present a novel cooperative multi‑agent reinforcement learning method called Locality based Factorized Multi‑Agent Actor‑Critic (Loc‑FACMAC). Existing state‑of‑the‑art algorithms, such as FACMAC, rely on global reward information, which may not accurately reflect the quality of individual robots' actions in decentralized systems. We integrate the concept of locality into critic learning, where strongly related robots form partitions during training. Robots within the same partition have a greater impact on each other, leading to more precise policy evaluation. Additionally, we construct a dependency graph to capture the relationships between robots, facilitating the partitioning process. This approach mitigates the curse of dimensionality and prevents robots from using irrelevant information. Our method improves existing algorithms by focusing on local rewards and leveraging partition‑based learning to enhance training efficiency and performance. We evaluate the performance of Loc‑FACMAC in three environments: Hallway, Multi‑cartpole, and Bounded‑Cooperative‑Navigation. We explore the impact of partition sizes on the performance and compare the result with baseline MARL algorithms such as LOMAQ, FACMAC, and QMIX. The experiments reveal that, if the locality structure is defined properly, Loc‑FACMAC outperforms these baseline algorithms up to 108%, indicating that exploiting the locality structure in the actor‑critic framework improves the MARL performance.

PaperID: 11125, https://arxiv.org/pdf/2503.18702.pdf

Authors: David Ph. Shakouri, Crit Cremers, Niels O. Schiller

Title: Unsupervised Acquisition of Discrete Grammatical Categories

Abstract:
This article presents experiments performed using a computational laboratory environment for language acquisition experiments. It implements a multi‑agent system consisting of two agents: an adult language model and a daughter language model that aims to learn the mother language. Crucially, the daughter agent does not have access to the internal knowledge of the mother language model but only to the language exemplars the mother agent generates. These experiments illustrate how this system can be used to acquire abstract grammatical knowledge. We demonstrate how statistical analyses of patterns in the input data corresponding to grammatical categories yield discrete grammatical rules. These rules are subsequently added to the grammatical knowledge of the daughter language model. To this end, hierarchical agglomerative cluster analysis was applied to the utterances consecutively generated by the mother language model. It is argued that this procedure can be used to acquire structures resembling grammatical categories proposed by linguists for natural languages. Thus, it is established that non‑trivial grammatical knowledge has been acquired. Moreover, the parameter configuration of this computational laboratory environment determined using training data generated by the mother language model is validated in a second experiment with a test set similarly resulting in the acquisition of non‑trivial categories.

PaperID: 11126, https://arxiv.org/pdf/2503.18666.pdf

Authors: Haoyu Wang, Christopher M. Poskitt, Jun Sun

Title: AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

Abstract:
Agents built on LLMs are increasingly deployed across diverse domains, automating complex decision‑making and task execution. However, their autonomy introduces safety risks, including security vulnerabilities, legal violations, and unintended harmful actions. Existing mitigation methods, such as model‑based safeguards and early enforcement strategies, fall short in robustness, interpretability, and adaptability. To address these challenges, we propose AgentSpec, a lightweight domain‑specific language for specifying and enforcing runtime constraints on LLM agents. With AgentSpec, users define structured rules that incorporate triggers, predicates, and enforcement mechanisms, ensuring agents operate within predefined safety boundaries. We implement AgentSpec across multiple domains, including code execution, embodied agents, and autonomous driving, demonstrating its adaptability and effectiveness. Our evaluation shows that AgentSpec successfully prevents unsafe executions in over 90% of code agent cases, eliminates all hazardous actions in embodied agent tasks, and enforces 100% compliance by autonomous vehicles (AVs). Despite its strong safety guarantees, AgentSpec remains computationally lightweight, with overheads in milliseconds. By combining interpretability, modularity, and efficiency, AgentSpec provides a practical and scalable solution for enforcing LLM agent safety across diverse applications. We also automate the generation of rules using LLMs and assess their effectiveness. Our evaluation shows that the rules generated by OpenAI o1 achieve a precision of 95.56% and recall of 70.96% for embodied agents, successfully identify 87.26% of the risky code, and prevent AVs from breaking laws in 5 out of 8 scenarios.

PaperID: 11127, https://arxiv.org/pdf/2503.18619.pdf

Authors: Benjamin Knopp, Daniel Auras, Alexander C. Schütz, Dominik Endres

Title: Reading Decisions from Gaze Direction during Graphics Turing Test of Gait Animation

Abstract:
We investigated gaze direction during movement observation. The eye movement data were collected during an experiment, in which different models of movement production (based on movement primitives, MPs) were compared in a two alternatives forced choice task (2AFC). Participants observed side‑by‑side presentation of two naturalistic 3D‑rendered human movement videos, where one video was based on motion captured gait sequence, the other one was generated by recombining the machine‑learned MPs to approximate the same movement. The task was to discriminate between these movements while their eye movements were recorded. We are complementing previous binary decision data analyses with eye tracking data. Here, we are investigating the role of gaze direction during task execution. We computed the shared information between gaze features and decisions of the participants, and between gaze features and correct answers. We found that eye movements reflect the decision of participants during the 2AFC task, but not the correct answer. This result is important for future experiments, which should take advantage of eye tracking to complement binary decision data.

PaperID: 11128, https://arxiv.org/pdf/2503.18589.pdf

Authors: Guillem Capellera, Antonio Rubio, Luis Ferraz, Antonio Agudo

Title: Unified Uncertainty-Aware Diffusion for Multi-Agent Trajectory Modeling

Abstract:
Multi‑agent trajectory modeling has primarily focused on forecasting future states, often overlooking broader tasks like trajectory completion, which are crucial for real‑world applications such as correcting tracking data. Existing methods also generally predict agents' states without offering any state‑wise measure of uncertainty. Moreover, popular multi‑modal sampling methods lack any error probability estimates for each generated scene under the same prior observations, making it difficult to rank the predictions during inference time. We introduce U2Diff, a unified diffusion model designed to handle trajectory completion while providing state‑wise uncertainty estimates jointly. This uncertainty estimation is achieved by augmenting the simple denoising loss with the negative log‑likelihood of the predicted noise and propagating latent space uncertainty to the real state space. Additionally, we incorporate a Rank Neural Network in post‑processing to enable error probability estimation for each generated mode, demonstrating a strong correlation with the error relative to ground truth. Our method outperforms the state‑of‑the‑art solutions in trajectory completion and forecasting across four challenging sports datasets (NBA, Basketball‑U, Football‑U, Soccer‑U), highlighting the effectiveness of uncertainty and error probability estimation. Video at https://youtu.be/ngw4D4eJToE

PaperID: 11129, https://arxiv.org/pdf/2503.18546.pdf

Authors: Yaroslav Marchukov, Luis Montano

Title: Multi-agent coordination for data gathering with periodic requests and deliveries

Abstract:
In this demo work we develop a method to plan and coordinate a multi‑agent team to gather information on demand. The data is periodically requested by a static Operation Center (OC) from changeable goals locations. The mission of the team is to reach these locations, taking measurements and delivering the data to the OC. Due to the limited communication range as well as signal attenuation because of the obstacles, the agents must travel to the OC, to upload the data. The agents can play two roles: ones as workers gathering data, the others as collectors traveling invariant paths for collecting the data of the workers to re‑transmit it to the OC. The refreshing time of the delivered information depends on the number of available agents as well as of the scenario. The proposed algorithm finds out the best balance between the number of collectors‑workers and the partition of the scenario into working areas in the planning phase, which provides the minimum refreshing time and will be the one executed by the agents.

PaperID: 11130, https://arxiv.org/pdf/2503.18331.pdf

Authors: Yen-Shao Chen, Tauhid Zaman

Title: Optimizing Influence Campaigns: Nudging under Bounded Confidence

Abstract:
Influence campaigns in online social networks are often run by organizations, political parties, and nation states to influence large audiences. These campaigns are employed through the use of agents in the network that share persuasive content. Yet, their impact might be minimal if the audiences remain unswayed, often due to the bounded confidence phenomenon, where only a narrow spectrum of viewpoints can influence them. Here we show that to persuade under bounded confidence, an agent must nudge its targets to gradually shift their opinions. Using a control theory approach, we show how to construct an agent's nudging policy under the bounded confidence opinion dynamics model and also how to select targets for multiple agents in an influence campaign on a social network. Simulations on real Twitter networks show that a multi‑agent nudging policy can shift the mean opinion, decrease opinion polarization, or even increase it. We find that our nudging based policies outperform other common techniques that do not consider the bounded confidence effect. Finally, we show how to craft prompts for large language models, such as ChatGPT, to generate text‑based content for real nudging policies. This illustrates the practical feasibility of our approach, allowing one to go from mathematical nudging policies to real social media content.

PaperID: 11131, https://arxiv.org/pdf/2503.18238.pdf

Authors: Harang Ju, Sinan Aral

Title: Collaborating with AI Agents: Field Experiments on Teamwork, Productivity, and Performance

Abstract:
We examined the mechanisms underlying productivity and performance gains from AI agents using a large‑scale experiment on Pairit, a platform we developed to study human‑AI collaboration. We randomly assigned 2,234 participants to human‑human and human‑AI teams that produced 11,024 ads for a think tank. We evaluated the ads using independent human ratings and a field experiment on X which garnered ~5M impressions. We found human‑AI teams produced 50% more ads per worker and higher text quality, while human‑human teams produced higher image quality, suggesting a jagged frontier of AI agent capability. Human‑AI teams also produced more homogeneous, or self‑similar, outputs. The field experiment revealed higher text quality improved click‑through rates and view‑through duration, while higher image quality improved cost‑per‑click rates. We found three mechanisms explained these effects. First, human‑AI collaboration was more task‑oriented, with 25% more task‑oriented messages and 18% fewer interpersonal messages. Second, human‑AI collaboration displayed more delegation, as participants delegated 17% more work to AI agents than to human partners and performed 62% fewer direct text edits when working with AI. Third, recognition that the collaborator was an AI moderated these effects as participants who correctly identified they were working with AI were more task‑oriented and more likely to delegate work. These mechanisms then explained performance as task‑oriented communication improved ad quality, specifically when working with AI, while interpersonal communication reduced ad quality; delegation improved text quality but had no effect on image quality and was positively associated with diversity collapse, creating homogeneous outputs of higher average quality. The results suggest AI agents drive changes in productivity, performance, and output diversity by reshaping teamwork.

PaperID: 11132, https://arxiv.org/pdf/2503.18221.pdf

Authors: Wen-Tse Chen, Minh Nguyen, Zhongyu Li, Guo Ning Sue, Koushil Sreenath

Title: Decentralized Navigation of a Cable-Towed Load using Quadrupedal Robot Team via MARL

Abstract:
This work addresses the challenge of enabling a team of quadrupedal robots to collaboratively tow a cable‑connected load through cluttered and unstructured environments while avoiding obstacles. Leveraging cables allows the multi‑robot system to navigate narrow spaces by maintaining slack when necessary. However, this introduces hybrid physical interactions due to alternating taut and slack states, with computational complexity that scales exponentially as the number of agents increases. To tackle these challenges, we developed a scalable and decentralized system capable of dynamically coordinating a variable number of quadrupedal robots while managing the hybrid physical interactions inherent in the load‑towing task. At the core of this system is a novel multi‑agent reinforcement learning (MARL)‑based planner, designed for decentralized coordination. The MARL‑based planner is trained using a centralized training with decentralized execution (CTDE) framework, enabling each robot to make decisions autonomously using only local (ego) observations. To accelerate learning and ensure effective collaboration across varying team sizes, we introduce a tailored training curriculum for MARL. Experimental results highlight the flexibility and scalability of the framework, demonstrating successful deployment with one to four robots in real‑world scenarios and up to twelve robots in simulation. The decentralized planner maintains consistent inference times, regardless of the team size. Additionally, the proposed system demonstrates robustness to environment perturbations and adaptability to varying load weights. This work represents a step forward in achieving flexible and efficient multi‑legged robotic collaboration in complex and real‑world environments.

PaperID: 11133, https://arxiv.org/pdf/2503.18201.pdf

Authors: Georg Ziegner, Michael Choi, Hung Mac Chan Le, Sahil Sakhuja, Arash Sarmadi

Title: Iterative Multi-Agent Reinforcement Learning: A Novel Approach Toward Real-World Multi-Echelon Inventory Optimization

Abstract:
Multi‑echelon inventory optimization (MEIO) is critical for effective supply chain management, but its inherent complexity can pose significant challenges. Heuristics are commonly used to address this complexity, yet they often face limitations in scope and scalability. Recent research has found deep reinforcement learning (DRL) to be a promising alternative to traditional heuristics, offering greater versatility by utilizing dynamic decision‑making capabilities. However, since DRL is known to struggle with the curse of dimensionality, its relevance to complex real‑life supply chain scenarios is still to be determined. This thesis investigates DRL's applicability to MEIO problems of increasing complexity. A state‑of‑the‑art DRL model was replicated, enhanced, and tested across 13 supply chain scenarios, combining diverse network structures and parameters. To address DRL's challenges with dimensionality, additional models leveraging graph neural networks (GNNs) and multi‑agent reinforcement learning (MARL) were developed, culminating in the novel iterative multi‑agent reinforcement learning (IMARL) approach. IMARL demonstrated superior scalability, effectiveness, and reliability in optimizing inventory policies, consistently outperforming benchmarks. These findings confirm the potential of DRL, particularly IMARL, to address real‑world supply chain challenges and call for additional research to further expand its applicability.

PaperID: 11134, https://arxiv.org/pdf/2503.18102.pdf

Authors: Samuel Schmidgall, Michael Moor

Title: AgentRxiv: Towards Collaborative Autonomous Research

Abstract:
Progress in scientific discovery is rarely the result of a single "Eureka" moment, but is rather the product of hundreds of scientists incrementally working together toward a common goal. While existing agent workflows are capable of producing research autonomously, they do so in isolation, without the ability to continuously improve upon prior research results. To address these challenges, we introduce AgentRxiv‑a framework that lets LLM agent laboratories upload and retrieve reports from a shared preprint server in order to collaborate, share insights, and iteratively build on each other's research. We task agent laboratories to develop new reasoning and prompting techniques and find that agents with access to their prior research achieve higher performance improvements compared to agents operating in isolation (11.4% relative improvement over baseline on MATH‑500). We find that the best performing strategy generalizes to benchmarks in other domains (improving on average by 3.3%). Multiple agent laboratories sharing research through AgentRxiv are able to work together towards a common goal, progressing more rapidly than isolated laboratories, achieving higher overall accuracy (13.7% relative improvement over baseline on MATH‑500). These findings suggest that autonomous agents may play a role in designing future AI systems alongside humans. We hope that AgentRxiv allows agents to collaborate toward research goals and enables researchers to accelerate discovery.

PaperID: 11135, https://arxiv.org/pdf/2503.18016.pdf

Authors: Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Lutao Jiang, Haiwei Xue, Bin Ren, Danda Paudel, Nicu Sebe, Luc Van Gool, Xuming Hu

Title: Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook

Abstract:
Retrieval‑augmented generation (RAG) has emerged as a pivotal technique in artificial intelligence (AI), particularly in enhancing the capabilities of large language models (LLMs) by enabling access to external, reliable, and up‑to‑date knowledge sources. In the context of AI‑Generated Content (AIGC), RAG has proven invaluable by augmenting model outputs with supplementary, relevant information, thus improving their quality. Recently, the potential of RAG has extended beyond natural language processing, with emerging methods integrating retrieval‑augmented strategies into the computer vision (CV) domain. These approaches aim to address the limitations of relying solely on internal model knowledge by incorporating authoritative external knowledge bases, thereby improving both the understanding and generation capabilities of vision models. This survey provides a comprehensive review of the current state of retrieval‑augmented techniques in CV, focusing on two main areas: (I) visual understanding and (II) visual generation. In the realm of visual understanding, we systematically review tasks ranging from basic image recognition to complex applications such as medical report generation and multimodal question answering. For visual content generation, we examine the application of RAG in tasks related to image, video, and 3D generation. Furthermore, we explore recent advancements in RAG for embodied AI, with a particular focus on applications in planning, task execution, multimodal perception, interaction, and specialized domains. Given that the integration of retrieval‑augmented techniques in CV is still in its early stages, we also highlight the key limitations of current approaches and propose future research directions to drive the development of this promising area.

PaperID: 11136, https://arxiv.org/pdf/2503.17936.pdf

Authors: Riya Naik, Ashwin Srinivasan, Estrid He, Swati Agarwal

Title: An Empirical Study of the Role of Incompleteness and Ambiguity in Interactions with Large Language Models

Abstract:
Natural language as a medium for human‑computer interaction has long been anticipated, has been undergoing a sea‑change with the advent of Large Language Models (LLMs) with startling capacities for processing and generating language. Many of us now treat LLMs as modern‑day oracles, asking it almost any kind of question. Unlike its Delphic predecessor, consulting an LLM does not have to be a single‑turn activity (ask a question, receive an answer, leave); and ‑‑ also unlike the Pythia ‑‑ it is widely acknowledged that answers from LLMs can be improved with additional context. In this paper, we aim to study when we need multi‑turn interactions with LLMs to successfully get a question answered; or conclude that a question is unanswerable. We present a neural symbolic framework that models the interactions between human and LLM agents. Through the proposed framework, we define incompleteness and ambiguity in the questions as properties deducible from the messages exchanged in the interaction, and provide results from benchmark problems, in which the answer‑correctness is shown to depend on whether or not questions demonstrate the presence of incompleteness or ambiguity (according to the properties we identify). Our results show multi‑turn interactions are usually required for datasets which have a high proportion of incompleteness or ambiguous questions; and that that increasing interaction length has the effect of reducing incompleteness or ambiguity. The results also suggest that our measures of incompleteness and ambiguity can be useful tools for characterising interactions with an LLM on question‑answeringproblems

PaperID: 11137, https://arxiv.org/pdf/2503.17850.pdf

Authors: Dae Cheol Kwon, Xinyu Zhang

Title: CP-AgentNet: Autonomous and Explainable Communication Protocol Design Using Generative Agents

Abstract:
Although DRL (deep reinforcement learning) has emerged as a powerful tool for making better decisions than existing hand‑crafted communication protocols, it faces significant limitations: 1) Selecting the appropriate neural network architecture and setting hyperparameters are crucial for achieving desired performance levels, requiring domain expertise. 2) The decision‑making process in DRL models is often opaque, commonly described as a 'black box.' 3) DRL models are data hungry. In response, we propose CP‑AgentNet, the first framework designed to use generative agents for developing communication network protocols. This approach addresses these challenges by creating an autonomous system for protocol design, significantly reducing human effort. We developed LLMA (LLM‑agents‑based multiple access) and CPTCP (CP‑Agent‑based TCP) for heterogeneous environments. Our comprehensive simulations have demonstrated the efficient coexistence of LLMA and CPTCP with nodes using different types of protocols, as well as enhanced explainability.

PaperID: 11138, https://arxiv.org/pdf/2503.17803.pdf

Authors: Giovanni Briglia, Stefano Mariani, Franco Zambonelli

Title: A Roadmap Towards Improving Multi-Agent Reinforcement Learning With Causal Discovery And Inference

Abstract:
Causal reasoning is increasingly used in Reinforcement Learning (RL) to improve the learning process in several dimensions: efficacy of learned policies, efficiency of convergence, generalisation capabilities, safety and interpretability of behaviour. However, applications of causal reasoning to Multi‑Agent RL (MARL) are still mostly unexplored. In this paper, we take the first step in investigating the opportunities and challenges of applying causal reasoning in MARL. We measure the impact of a simple form of causal augmentation in state‑of‑the‑art MARL scenarios increasingly requiring cooperation, and with state‑of‑the‑art MARL algorithms exploiting various degrees of collaboration between agents. Then, we discuss the positive as well as negative results achieved, giving us the chance to outline the areas where further research may help to successfully transfer causal RL to the multi‑agent setting.

PaperID: 11139, https://arxiv.org/pdf/2503.17726.pdf

Authors: Ali Forootani

Title: A Survey on Mathematical Reasoning and Optimization with Large Language Models

Abstract:
Mathematical reasoning and optimization are fundamental to artificial intelligence and computational problem‑solving. Recent advancements in Large Language Models (LLMs) have significantly improved AI‑driven mathematical reasoning, theorem proving, and optimization techniques. This survey explores the evolution of mathematical problem‑solving in AI, from early statistical learning approaches to modern deep learning and transformer‑based methodologies. We review the capabilities of pretrained language models and LLMs in performing arithmetic operations, complex reasoning, theorem proving, and structured symbolic computation. A key focus is on how LLMs integrate with optimization and control frameworks, including mixed‑integer programming, linear quadratic control, and multi‑agent optimization strategies. We examine how LLMs assist in problem formulation, constraint generation, and heuristic search, bridging theoretical reasoning with practical applications. We also discuss enhancement techniques such as Chain‑of‑Thought reasoning, instruction tuning, and tool‑augmented methods that improve LLM's problem‑solving performance. Despite their progress, LLMs face challenges in numerical precision, logical consistency, and proof verification. Emerging trends such as hybrid neural‑symbolic reasoning, structured prompt engineering, and multi‑step self‑correction aim to overcome these limitations. Future research should focus on interpretability, integration with domain‑specific solvers, and improving the robustness of AI‑driven decision‑making. This survey offers a comprehensive review of the current landscape and future directions of mathematical reasoning and optimization with LLMs, with applications across engineering, finance, and scientific research.

PaperID: 11140, https://arxiv.org/pdf/2503.17553.pdf

Authors: Humza Nusrat, Bing Luo, Ryan Hall, Joshua Kim, Hassan Bagher-Ebadian, Anthony Doemer, Benjamin Movsas, Kundan Thind

Title: Autonomous Radiotherapy Treatment Planning Using DOLA: A Privacy-Preserving, LLM-Based Optimization Agent

Abstract:
Radiotherapy treatment planning is a complex and time‑intensive process, often impacted by inter‑planner variability and subjective decision‑making. To address these challenges, we introduce Dose Optimization Language Agent (DOLA), an autonomous large language model (LLM)‑based agent designed for optimizing radiotherapy treatment plans while rigorously protecting patient privacy. DOLA integrates the LLaMa3.1 LLM directly with a commercial treatment planning system, utilizing chain‑of‑thought prompting, retrieval‑augmented generation (RAG), and reinforcement learning (RL). Operating entirely within secure local infrastructure, this agent eliminates external data sharing. We evaluated DOLA using a retrospective cohort of 18 prostate cancer patients prescribed 60 Gy in 20 fractions, comparing model sizes (8 billion vs. 70 billion parameters) and optimization strategies (No‑RAG, RAG, and RAG+RL) over 10 planning iterations. The 70B model demonstrated significantly improved performance, achieving approximately 16.4% higher final scores than the 8B model. The RAG approach outperformed the No‑RAG baseline by 19.8%, and incorporating RL accelerated convergence, highlighting the synergy of retrieval‑based memory and reinforcement learning. Optimal temperature hyperparameter analysis identified 0.4 as providing the best balance between exploration and exploitation. This proof of concept study represents the first successful deployment of locally hosted LLM agents for autonomous optimization of treatment plans within a commercial radiotherapy planning system. By extending human‑machine interaction through interpretable natural language reasoning, DOLA offers a scalable and privacy‑conscious framework, with significant potential for clinical implementation and workflow improvement.

PaperID: 11141, https://arxiv.org/pdf/2503.17509.pdf

Authors: Joseph Gatto, Parker Seegmiller, Timothy Burdick, Inas S. Khayal, Sarah DeLozier, Sarah M. Preum

Title: Follow-up Question Generation For Enhanced Patient-Provider Conversations

Abstract:
Follow‑up question generation is an essential feature of dialogue systems as it can reduce conversational ambiguity and enhance modeling complex interactions. Conversational contexts often pose core NLP challenges such as (i) extracting relevant information buried in fragmented data sources, and (ii) modeling parallel thought processes. These two challenges occur frequently in medical dialogue as a doctor asks questions based not only on patient utterances but also their prior EHR data and current diagnostic hypotheses. Asking medical questions in asynchronous conversations compounds these issues as doctors can only rely on static EHR information to motivate follow‑up questions. To address these challenges, we introduce FollowupQ, a novel framework for enhancing asynchronous medical conversation. FollowupQ is a multi‑agent framework that processes patient messages and EHR data to generate personalized follow‑up questions, clarifying patient‑reported medical conditions. FollowupQ reduces requisite provider follow‑up communications by 34%. It also improves performance by 17% and 5% on real and synthetic data, respectively. We also release the first public dataset of asynchronous medical messages with linked EHR data alongside 2,300 follow‑up questions written by clinical experts for the wider NLP research community.

PaperID: 11142, https://arxiv.org/pdf/2503.17460.pdf

Authors: Reem Gody, Mahmoud Goudy, Ahmed Y. Tawfik

Title: ConvoGen: Enhancing Conversational AI with Synthetic Data: A Multi-Agent Approach

Abstract:
In this paper, we present ConvoGen: an innovative framework for generating synthetic conversational data using multi‑agent systems. Our method leverages few‑shot learning and introduces iterative sampling from a dynamically updated few‑shot hub to create diverse and realistic conversational scenarios. The generated data has numerous applications, including training and evaluating conversational AI models, and augmenting existing datasets for tasks like conversational intent classification or conversation summarization. Our experiments demonstrate the effectiveness of this method in producing high‑quality diverse synthetic conversational data, highlighting its potential to enhance the development and evaluation of conversational AI systems.

PaperID: 11143, https://arxiv.org/pdf/2503.17332.pdf

Authors: Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, Jet Geronimo, Avi Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, Daniel Kang

Title: CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

Abstract:
Large language model (LLM) agents are increasingly capable of autonomously conducting cyberattacks, posing significant threats to existing applications. This growing risk highlights the urgent need for a real‑world benchmark to evaluate the ability of LLM agents to exploit web application vulnerabilities. However, existing benchmarks fall short as they are limited to abstracted Capture the Flag competitions or lack comprehensive coverage. Building a benchmark for real‑world vulnerabilities involves both specialized expertise to reproduce exploits and a systematic approach to evaluating unpredictable threats. To address this challenge, we introduce CVE‑Bench, a real‑world cybersecurity benchmark based on critical‑severity Common Vulnerabilities and Exposures. In CVE‑Bench, we design a sandbox framework that enables LLM agents to exploit vulnerable web applications in scenarios that mimic real‑world conditions, while also providing effective evaluation of their exploits. Our evaluation shows that the state‑of‑the‑art agent framework can resolve up to 13% of vulnerabilities.

PaperID: 11144, https://arxiv.org/pdf/2503.17175.pdf

Authors: Duanrui Yu, Jing You, Xin Pei, Anqi Qu, Dingyu Wang, Shaocheng Jia

Title: Which2comm: An Efficient Collaborative Perception Framework for 3D Object Detection

Abstract:
Collaborative perception allows real‑time inter‑agent information exchange and thus offers invaluable opportunities to enhance the perception capabilities of individual agents. However, limited communication bandwidth in practical scenarios restricts the inter‑agent data transmission volume, consequently resulting in performance declines in collaborative perception systems. This implies a trade‑off between perception performance and communication cost. To address this issue, we propose Which2comm, a novel multi‑agent 3D object detection framework leveraging object‑level sparse features. By integrating semantic information of objects into 3D object detection boxes, we introduce semantic detection boxes (SemDBs). Innovatively transmitting these information‑rich object‑level sparse features among agents not only significantly reduces the demanding communication volume, but also improves 3D object detection performance. Specifically, a fully sparse network is constructed to extract SemDBs from individual agents; a temporal fusion approach with a relative temporal encoding mechanism is utilized to obtain the comprehensive spatiotemporal features. Extensive experiments on the V2XSet and OPV2V datasets demonstrate that Which2comm consistently outperforms other state‑of‑the‑art methods on both perception performance and communication cost, exhibiting better robustness to real‑world latency. These results present that for multi‑agent collaborative 3D object detection, transmitting only object‑level sparse features is sufficient to achieve high‑precision and robust performance.

PaperID: 11145, https://arxiv.org/pdf/2503.17085.pdf

Authors: J. M. Diederik Kruijssen, Nicholas Emmons

Title: Deterministic AI Agent Personality Expression through Standard Psychological Diagnostics

Abstract:
Artificial intelligence (AI) systems powered by large language models have become increasingly prevalent in modern society, enabling a wide range of applications through natural language interaction. As AI agents proliferate in our daily lives, their generic and uniform expressiveness presents a significant limitation to their appeal and adoption. Personality expression represents a key prerequisite for creating more human‑like and distinctive AI systems. We show that AI models can express deterministic and consistent personalities when instructed using established psychological frameworks, with varying degrees of accuracy depending on model capabilities. We find that more advanced models like GPT‑4o and o1 demonstrate the highest accuracy in expressing specified personalities across both Big Five and Myers‑Briggs assessments, and further analysis suggests that personality expression emerges from a combination of intelligence and reasoning capabilities. Our results reveal that personality expression operates through holistic reasoning rather than question‑by‑question optimization, with response‑scale metrics showing higher variance than test‑scale metrics. Furthermore, we find that model fine‑tuning affects communication style independently of personality expression accuracy. These findings establish a foundation for creating AI agents with diverse and consistent personalities, which could significantly enhance human‑AI interaction across applications from education to healthcare, while additionally enabling a broader range of more unique AI agents. The ability to quantitatively assess and implement personality expression in AI systems opens new avenues for research into more relatable, trustworthy, and ethically designed AI.

PaperID: 11146, https://arxiv.org/pdf/2503.16905.pdf

Authors: Jian Zhang, Zhiyuan Wang, Zhangqi Wang, Fangzhi Xu, Qika Lin, Lingling Zhang, Rui Mao, Erik Cambria, Jun Liu

Title: MAPS: Multi-Agent Personality Shaping for Collaborative Reasoning

Abstract:
Collaborative reasoning with multiple agents offers the potential for more robust and diverse problem‑solving. However, existing approaches often suffer from homogeneous agent behaviors and lack of reflective and rethinking capabilities. We propose Multi‑Agent Personality Shaping (MAPS), a novel framework that enhances reasoning through agent diversity and internal critique. Inspired by the Big Five personality theory, MAPS assigns distinct personality traits to individual agents, shaping their reasoning styles and promoting heterogeneous collaboration. To enable deeper and more adaptive reasoning, MAPS introduces a Critic agent that reflects on intermediate outputs, revisits flawed steps, and guides iterative refinement. This integration of personality‑driven agent design and structured collaboration improves both reasoning depth and flexibility. Empirical evaluations across three benchmarks demonstrate the strong performance of MAPS, with further analysis confirming its generalizability across different large language models and validating the benefits of multi‑agent collaboration.

PaperID: 11147, https://arxiv.org/pdf/2503.16874.pdf

Authors: Jian Zhang, Zhangqi Wang, Haiping Zhu, Kangda Cheng, Kai He, Bo Li, Qika Lin, Jun Liu, Erik Cambria

Title: MARS: Multi-Agent Adaptive Reasoning with Socratic Guidance for Automated Prompt Optimization

Abstract:
Large language models (LLMs) typically operate in a question‑answering paradigm, where the quality of the input prompt critically affects the response. Automated Prompt Optimization (APO) aims to overcome the cognitive biases of manually crafted prompts and explore a broader prompt design space. However, existing APO methods often suffer from rigid template structures and inefficient exploration in the prompt space. To this end, we propose a Multi‑Agent Adaptive Reasoning with Socratic guidance framework (MARS) for APO. MARS consists of five complementary agents and formulates the optimization process as a Partially Observable Markov Decision Process (POMDP), enabling adaptive prompt refinement through explicit state modeling and interactive feedback. Specifically, a Planner agent generates flexible optimization trajectories, a Teacher‑Critic‑Student triad engages in Socratic‑style dialogue to iteratively optimize the prompt based on pseudo‑gradient signals in the text space, and a Target agent executes the prompt in downstream tasks to provide performance feedback. MARS integrates reasoning, feedback, and state transition into a unified hidden‑state evolution process, improving both the effectiveness and interpretability of optimization. Extensive experiments on multiple datasets demonstrate that MARS outperforms existing APO methods in terms of optimization performance, search efficiency, and interpretability.

PaperID: 11148, https://arxiv.org/pdf/2503.16867.pdf

Authors: Kaisi Guan, Zhengfeng Lai, Yuchong Sun, Peng Zhang, Wei Liu, Kieran Liu, Meng Cao, Ruihua Song

Title: ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering

Abstract:
Precisely evaluating semantic alignment between text prompts and generated videos remains a challenge in Text‑to‑Video (T2V) Generation. Existing text‑to‑video alignment metrics like CLIPScore only generate coarse‑grained scores without fine‑grained alignment details, failing to align with human preference. To address this limitation, we propose ETVA, a novel Evaluation method of Text‑to‑Video Alignment via fine‑grained question generation and answering. First, a multi‑agent system parses prompts into semantic scene graphs to generate atomic questions. Then we design a knowledge‑augmented multi‑stage reasoning framework for question answering, where an auxiliary LLM first retrieves relevant common‑sense knowledge (e.g., physical laws), and then video LLM answers the generated questions through a multi‑stage reasoning mechanism. Extensive experiments demonstrate that ETVA achieves a Spearman's correlation coefficient of 58.47, showing a much higher correlation with human judgment than existing metrics which attain only 31.0. We also construct a comprehensive benchmark specifically designed for text‑to‑video alignment evaluation, featuring 2k diverse prompts and 12k atomic questions spanning 10 categories. Through a systematic evaluation of 15 existing text‑to‑video models, we identify their key capabilities and limitations, paving the way for next‑generation T2V generation.

PaperID: 11149, https://arxiv.org/pdf/2503.16814.pdf

Authors: Jihwan Oh, Minchan Jeong, Jongwoo Ko, Se-Young Yun

Title: From Belief Entrenchment to Robust Reasoning in LLM Agents

Abstract:
Multi‑Agent Debate (MAD) has emerged as a promising inference scaling method for Large Language Model (LLM) reasoning. However, it frequently suffers from belief entrenchment, where agents reinforce shared errors rather than correcting them. Going beyond merely identifying this failure, we decompose it into two distinct root causes: (1) the model's biased static initial belief and (2) homogenized debate dynamics that amplify the majority view regardless of correctness. To address these sequentially, we propose DReaMAD (Diverse Reasoning via Multi‑Agent Debate with Refined Prompt). Our framework first rectifies the static belief via strategic prior knowledge elicitation, then reshapes the debate dynamics by enforcing perspective diversity. Validated on our new MetaNIM Arena benchmark, DReaMAD significantly mitigates entrenchment, achieving a +9.5% accuracy gain over ReAct prompting and a +19.0% higher win rate than standard MAD.

PaperID: 11150, https://arxiv.org/pdf/2503.16780.pdf

Authors: Uihyun Cho, Namhun Kim

Title: A-IDE : Agent-Integrated Denoising Experts

Abstract:
Recent advances in deep‑learning based denoising methods have improved Low‑Dose CT image quality. However, due to distinct HU distributions and diverse anatomical characteristics, a single model often struggles to generalize across multiple anatomies. To address this limitation, we introduce Agent‑Integrated Denoising Experts (A‑IDE) framework, which integrates three anatomical region‑specialized RED‑CNN models under the management of decision‑making LLM agent. The agent analyzes semantic cues from BiomedCLIP to dynamically route incoming LDCT scans to the most appropriate expert model. We highlight three major advantages of our approach. A‑IDE excels in heterogeneous, data‑scarce environments. The framework automatically prevents overfitting by distributing tasks among multiple experts. Finally, our LLM‑driven agentic pipeline eliminates the need for manual interventions. Experimental evaluations on the Mayo‑2016 dataset confirm that A‑IDE achieves superior performance in RMSE, PSNR, and SSIM compared to a single unified denoiser.

PaperID: 11151, https://arxiv.org/pdf/2503.16547.pdf

Authors: Sihan Wang, Suiyang Jiang, Yibo Gao, Boming Wang, Shangqi Gao, Xiahai Zhuang

Title: Empowering Medical Multi-Agents with Clinical Consultation Flow for Dynamic Diagnosis

Abstract:
Traditional AI‑based healthcare systems often rely on single‑modal data, limiting diagnostic accuracy due to incomplete information. However, recent advancements in foundation models show promising potential for enhancing diagnosis combining multi‑modal information. While these models excel in static tasks, they struggle with dynamic diagnosis, failing to manage multi‑turn interactions and often making premature diagnostic decisions due to insufficient persistence in information collection.To address this, we propose a multi‑agent framework inspired by consultation flow and reinforcement learning (RL) to simulate the entire consultation process, integrating multiple clinical information for effective diagnosis. Our approach incorporates a hierarchical action set, structured from clinic consultation flow and medical textbook, to effectively guide the decision‑making process. This strategy improves agent interactions, enabling them to adapt and optimize actions based on the dynamic state. We evaluated our framework on a public dynamic diagnosis benchmark. The proposed framework evidentially improves the baseline methods and achieves state‑of‑the‑art performance compared to existing foundation model‑based methods.

PaperID: 11152, https://arxiv.org/pdf/2503.16463.pdf

Authors: Zhoujian Sun, Ziyi Liu, Cheng Luo, Jiebin Chu, Zhengxing Huang

Title: Improving Interactive Diagnostic Ability of a Large Language Model Agent Through Clinical Experience Learning

Abstract:
Recent advances in large language models (LLMs) have shown promising results in medical diagnosis, with some studies indicating superior performance compared to human physicians in specific scenarios. However, the diagnostic capabilities of LLMs are often overestimated, as their performance significantly deteriorates in interactive diagnostic settings that require active information gathering. This study investigates the underlying mechanisms behind the performance degradation phenomenon and proposes a solution. We identified that the primary deficiency of LLMs lies in the initial diagnosis phase, particularly in information‑gathering efficiency and initial diagnosis formation, rather than in the subsequent differential diagnosis phase. To address this limitation, we developed a plug‑and‑play method enhanced (PPME) LLM agent, leveraging over 3.5 million electronic medical records from Chinese and American healthcare facilities. Our approach integrates specialized models for initial disease diagnosis and inquiry into the history of the present illness, trained through supervised and reinforcement learning techniques. The experimental results indicate that the PPME LLM achieved over 30% improvement compared to baselines. The final diagnostic accuracy of the PPME LLM in interactive diagnostic scenarios approached levels comparable to those achieved using complete clinical data. These findings suggest a promising potential for developing autonomous diagnostic systems, although further validation studies are needed.

PaperID: 11153, https://arxiv.org/pdf/2503.16436.pdf

Authors: Yutaka Matsubara, Akihisa Morikawa, Daichi Mizuguchi, Kiyoshi Fujiwara

Title: Enhancing Human-Robot Collaboration through Existing Guidelines: A Case Study Approach

Abstract:
As AI systems become more prevalent, concerns about their development, operation, and societal impact intensify. Establishing ethical, social, and safety standards amidst evolving AI capabilities poses significant challenges. Global initiatives are underway to establish guidelines for AI system development and operation. With the increasing use of collaborative human‑AI task execution, it's vital to continuously adapt AI systems to meet user and environmental needs. Failure to synchronize AI evolution with changes in users and the environment could result in ethical and safety issues. This paper evaluates the applicability of existing guidelines in human‑robot collaborative systems, assesses their effectiveness, and discusses limitations. Through a case study, we examine whether our target system meets requirements outlined in existing guidelines and propose improvements to enhance human‑robot interactions. Our contributions provide insights into interpreting and applying guidelines, offer concrete examples of system enhancement, and highlight their applicability and limitations. We believe these contributions will stimulate discussions and influence system assurance and certification in future AI‑infused critical systems.

PaperID: 11154, https://arxiv.org/pdf/2503.16433.pdf

Authors: Andrew Cho, Jason M. Woo, Brian Shi, Aishwaryaa Udeshi, Jonathan S. H. Woo

Title: The Application of MATEC (Multi-AI Agent Team Care) Framework in Sepsis Care

Abstract:
Under‑resourced or rural hospitals have limited access to medical specialists and healthcare professionals, which can negatively impact patient outcomes in sepsis. To address this gap, we developed the MATEC (Multi‑AI Agent Team Care) framework, which integrates a team of specialized AI agents for sepsis care. The sepsis AI agent team includes five doctor agents, four health professional agents, and a risk prediction model agent, with an additional 33 doctor agents available for consultations. Ten attending physicians at a teaching hospital evaluated this framework, spending approximately 40 minutes on the web‑based MATEC application and participating in the 5‑point Likert scale survey (rated from 1‑unfavorable to 5‑favorable). The physicians found the MATEC framework very useful (Median=4, P=0.01), and very accurate (Median=4, P<0.01). This pilot study demonstrates that a Multi‑AI Agent Team Care framework (MATEC) can potentially be useful in assisting medical professionals, particularly in under‑resourced hospital settings.

PaperID: 11155, https://arxiv.org/pdf/2503.16416.pdf

Authors: Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, Michal Shmueli-Scheuer

Title: Survey on Evaluation of LLM-based Agents

Abstract:
LLM‑based agents represent a paradigm shift in AI, enabling autonomous systems to plan, reason, and use tools while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methods for these increasingly capable agents. We analyze the field of agent evaluation across five perspectives: (1) Core LLM capabilities needed for agentic workflows, like planning, and tool use; (2) Application‑specific benchmarks such as web and SWE agents; (3) Evaluation of generalist agents; (4) Analysis of agent benchmarks' core dimensions; and (5) Evaluation frameworks and tools for agent developers. Our analysis reveals current trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. We also identify critical gaps that future research must address, particularly in assessing cost‑efficiency, safety, and robustness, and in developing fine‑grained, scalable evaluation methods.

PaperID: 11156, https://arxiv.org/pdf/2503.16285.pdf

Authors: Martin Bichler, Davide Legacci, Panayotis Mertikopoulos, Matthias Oberlechner, Bary Pradelski

Title: Characterizing the Convergence of Game Dynamics via Potentialness

Abstract:
Understanding the convergence landscape of multi‑agent learning is a fundamental problem of great practical relevance in many applications of artificial intelligence and machine learning. While it is known that learning dynamics converge to Nash equilibrium in potential games, the behavior of dynamics in many important classes of games that do not admit a potential is poorly understood. To measure how ''close'' a game is to being potential, we consider a distance function, that we call ''potentialness'', and which relies on a strategic decomposition of games introduced by Candogan et al. (2011). We introduce a numerical framework enabling the computation of this metric, which we use to calculate the degree of ''potentialness'' in generic matrix games, as well as (non‑generic) games that are important in economic applications, namely auctions and contests. Understanding learning in the latter games has become increasingly important due to the wide‑spread automation of bidding and pricing with no‑regret learning algorithms. We empirically show that potentialness decreases and concentrates with an increasing number of agents or actions; in addition, potentialness turns out to be a good predictor for the existence of pure Nash equilibria and the convergence of no‑regret learning algorithms in matrix games. In particular, we observe that potentialness is very low for complete‑information models of the all‑pay auction where no pure Nash equilibrium exists, and much higher for Tullock contests, first‑, and second‑price auctions, explaining the success of learning in the latter. In the incomplete‑information version of the all‑pay auction, a pure Bayes‑Nash equilibrium exists and it can be learned with gradient‑based algorithms. Potentialness nicely characterizes these differences to the complete‑information version.

PaperID: 11157, https://arxiv.org/pdf/2503.16248.pdf

Authors: Atharv Singh Patlan, Peiyao Sheng, S. Ashwin Hebbar, Prateek Mittal, Pramod Viswanath

Title: Real AI Agents with Fake Memories: Fatal Context Manipulation Attacks on Web3 Agents

Abstract:
AI agents integrated with Web3 offer autonomy and openness but raise security concerns as they interact with financial protocols and immutable smart contracts. This paper investigates the vulnerabilities of AI agents within blockchain‑based financial ecosystems when exposed to adversarial threats in real‑world scenarios. We introduce the concept of context manipulation ‑‑ a comprehensive attack vector that exploits unprotected context surfaces, including input channels, memory modules, and external data feeds. It expands on traditional prompt injection and reveals a more stealthy and persistent threat: memory injection. Using ElizaOS, a representative decentralized AI agent framework for automated Web3 operations, we showcase that malicious injections into prompts or historical records can trigger unauthorized asset transfers and protocol violations which could be financially devastating in reality. To quantify these risks, we introduce CrAIBench, a Web3‑focused benchmark covering 150+ realistic blockchain tasks. such as token transfers, trading, bridges, and cross‑chain interactions, and 500+ attack test cases using context manipulation. Our evaluation results confirm that AI models are significantly more vulnerable to memory injection compared to prompt injection. Finally, we evaluate a comprehensive defense roadmap, finding that prompt‑injection defenses and detectors only provide limited protection when stored context is corrupted, whereas fine‑tuning‑based defenses substantially reduce attack success rates while preserving performance on single‑step tasks. These results underscore the urgent need for AI agents that are both secure and fiduciarily responsible in blockchain environments.

PaperID: 11158, https://arxiv.org/pdf/2503.16041.pdf

Authors: Oluwole Fagbohun, Sai Yashwanth, Akinyemi Sadeeq Akintola, Ifeoluwa Wurola, Lanre Shittu, Aniema Inyang, Oluwatimilehin Odubola, Udodirim Offia, Said Olanrewaju, Ogidan Toluwaleke, Ilemona Abutu, Taiwo Akinbolaji

Title: GreenIQ: A Deep Search Platform for Comprehensive Carbon Market Analysis and Automated Report Generation

Abstract:
This study introduces GreenIQ, an AI‑powered deep search platform designed to revolutionise carbon market intelligence through autonomous analysis and automated report generation. Carbon markets operate across diverse regulatory landscapes, generating vast amounts of heterogeneous data from policy documents, industry reports, academic literature, and real‑time trading platforms. Traditional research approaches remain labour‑intensive, slow, and difficult to scale. GreenIQ addresses these limitations through a multi‑agent architecture powered by Large Language Models (LLMs), integrating five specialised AI agents: a Main Researcher Agent for intelligent information retrieval, a Report Writing Agent for structured synthesis, a Final Reviewer Agent for accuracy verification, a Data Visualisation Agent for enhanced interpretability, and a Translator Agent for multilingual adaptation. The system achieves seamless integration of structured and unstructured information with AI‑driven citation verification, ensuring high transparency and reliability. GreenIQ delivers a 99.2% reduction in processing time and a 99.7% cost reduction compared to traditional research methodologies. A novel AI persona‑based evaluation framework involving 16 domain‑specific AI personas highlights its superior cross‑jurisdictional analytical capabilities and regulatory insight generation. GreenIQ sets new standards in AI‑driven research synthesis, policy analysis, and sustainability finance by streamlining carbon market research. It offers an efficient and scalable framework for environmental and financial intelligence, enabling more accurate, timely, and cost‑effective decision‑making in complex regulatory landscapes

PaperID: 11159, https://arxiv.org/pdf/2503.16024.pdf

Authors: Ruihan Yang, Fanghua Ye, Jian Li, Siyu Yuan, Yikai Zhang, Zhaopeng Tu, Xiaolong Li, Deqing Yang

Title: The Lighthouse of Language: Enhancing LLM Agents via Critique-Guided Improvement

Abstract:
Large language models (LLMs) have recently transformed from text‑based assistants to autonomous agents capable of planning, reasoning, and iteratively improving their actions. While numerical reward signals and verifiers can effectively rank candidate actions, they often provide limited contextual guidance. In contrast, natural language feedback better aligns with the generative capabilities of LLMs, providing richer and more actionable suggestions. However, parsing and implementing this feedback effectively can be challenging for LLM‑based agents. In this work, we introduce Critique‑Guided Improvement (CGI), a novel two‑player framework, comprising an actor model that explores an environment and a critic model that generates detailed nature language feedback. By training the critic to produce fine‑grained assessments and actionable revisions, and the actor to utilize these critiques, our approach promotes more robust exploration of alternative strategies while avoiding local optima. Experiments in three interactive environments show that CGI outperforms existing baselines by a substantial margin. Notably, even a small critic model surpasses GPT‑4 in feedback quality. The resulting actor achieves state‑of‑the‑art performance, demonstrating the power of explicit iterative guidance to enhance decision‑making in LLM‑based agents.

PaperID: 11160, https://arxiv.org/pdf/2503.15955.pdf

Authors: Ting Wang, Zhuangzhuang Qiu, Xiaodong Lu, Yanlong Zhao

Title: Consensus Tracking Control of Multi-agent Systems with A Time-varying Reference State under Binary-valued Communication

Abstract:
This paper investigates the problem of consensus tracking control of discrete time multi‑agent systems under binary‑valued communication. Different from most existing studies on consensus tracking, the transmitted information between agents is the binary‑valued. Parameter identification with binary‑valued observations is applied to the estimation of neighbors'states and the tracking control is designed based on the estimation. Two Lyapunov functions are constructed to deal with the strong coupling of estimation and control. Compared with consensus problems under binary‑valued communication, a reference state is required for consensus tracking control. Two scenarios of the time‑varying reference state are studied respectively. (1) The reference state is asymptotically convergent. An online algorithm that performs estimation and control simultaneously is proposed, in which the estimation step size and the control gain are decreasing with time. By this algorithm, the multi‑agent system is proved to achieve consensus tracking with convergence rate O(1/k^ε ) under certain conditions. (2) The reference state is bounded, which is less conservative than that in the first case. In this case, the estimation step size and control gain are designed to be constant. By this algorithm, all the followers can reach to a neighborhood of the leader with an exponential rate. Finally, simulations are given to demonstrate theoretical results.

PaperID: 11161, https://arxiv.org/pdf/2503.15947.pdf

Authors: Tianyi Hu, Qingxu Fu, Zhiqiang Pu, Yuan Wang, Tenghai Qiu

Title: Unreal-MAP: Unreal-Engine-Based General Platform for Multi-Agent Reinforcement Learning

Abstract:
In this paper, we propose Unreal Multi‑Agent Playground (Unreal‑MAP), an MARL general platform based on the Unreal‑Engine (UE). Unreal‑MAP allows users to freely create multi‑agent tasks using the vast visual and physical resources available in the UE community, and deploy state‑of‑the‑art (SOTA) MARL algorithms within them. Unreal‑MAP is user‑friendly in terms of deployment, modification, and visualization, and all its components are open‑source. We also develop an experimental framework compatible with algorithms ranging from rule‑based to learning‑based provided by third‑party frameworks. Lastly, we deploy several SOTA algorithms in example tasks developed via Unreal‑MAP, and conduct corresponding experimental analyses. We believe Unreal‑MAP can play an important role in the MARL field by closely integrating existing algorithms with user‑customized tasks, thus advancing the field of MARL.

PaperID: 11162, https://arxiv.org/pdf/2503.15848.pdf

Authors: Jinghan Zhang, Xiting Wang, Fengran Mo, Yeyang Zhou, Wanfu Gao, Kunpeng Liu

Title: Entropy-based Exploration Conduction for Multi-step Reasoning

Abstract:
Multi‑step processes via large language models (LLMs) have proven effective for solving complex reasoning tasks. However, the depth of exploration of the reasoning procedure can significantly affect the task performance. Existing methods to automatically decide the depth often lead to high cost and a lack of flexibility. To address these issues, we propose Entropy‑based Exploration Depth Conduction (Entro‑duction), a novel method that dynamically adjusts the exploration depth during multi‑step reasoning by monitoring LLM's output entropy and variance entropy. We employ these two features to capture the model's uncertainty of the current step and the fluctuation of uncertainty across consecutive reasoning steps. Based on the observed entropy changes, the LLM selects whether to deepen, expand, or stop exploration according to the probability, which facilitates the trade‑off between the reasoning accuracy and exploration effectiveness. Experimental results across four benchmark datasets demonstrate the efficacy of Entro‑duction.

PaperID: 11163, https://arxiv.org/pdf/2503.15754.pdf

Authors: Andy Zhou, Kevin Wu, Francesco Pinto, Zhaorun Chen, Yi Zeng, Yu Yang, Shuang Yang, Sanmi Koyejo, James Zou, Bo Li

Title: AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration

Abstract:
As large language models (LLMs) become increasingly capable, security and safety evaluation are crucial. While current red teaming approaches have made strides in assessing LLM vulnerabilities, they often rely heavily on human input and lack comprehensive coverage of emerging attack vectors. This paper introduces AutoRedTeamer, a novel framework for fully automated, end‑to‑end red teaming against LLMs. AutoRedTeamer combines a multi‑agent architecture with a memory‑guided attack selection mechanism to enable continuous discovery and integration of new attack vectors. The dual‑agent framework consists of a red teaming agent that can operate from high‑level risk categories alone to generate and execute test cases and a strategy proposer agent that autonomously discovers and implements new attacks by analyzing recent research. This modular design allows AutoRedTeamer to adapt to emerging threats while maintaining strong performance on existing attack vectors. We demonstrate AutoRedTeamer's effectiveness across diverse evaluation settings, achieving 20% higher attack success rates on HarmBench against Llama‑3.1‑70B while reducing computational costs by 46% compared to existing approaches. AutoRedTeamer also matches the diversity of human‑curated benchmarks in generating test cases, providing a comprehensive, scalable, and continuously evolving framework for evaluating the security of AI systems.

PaperID: 11164, https://arxiv.org/pdf/2503.15703.pdf

Authors: Elizabeth Mieczkowski, Ruaridh Mon-Williams, Neil Bramley, Christopher G. Lucas, Natalia Velez, Thomas L. Griffiths

Title: Predicting Multi-Agent Specialization via Task Parallelizability

Abstract:
When should we encourage specialization in multi‑agent systems versus train generalists that perform the entire task independently? We propose that specialization largely depends on task parallelizability: the potential for multiple agents to execute task components concurrently. Drawing inspiration from Amdahl's Law in distributed systems, we present a closed‑form bound that predicts when specialization improves performance, depending only on task concurrency and team size. We validate our model on two standard MARL benchmarks that represent opposite regimes ‑‑ StarCraft Multi‑Agent Challenge (SMAC, unlimited concurrency) and Multi‑Particle Environment (MPE, unit‑capacity bottlenecks) ‑‑ and observe close alignment between the bound at each extreme and an empirical measure of specialization. Three follow‑up experiments in Overcooked‑AI demonstrate that the model works in environments with more complex spatial and resource bottlenecks that allow for a range of strategies. Beyond prediction, the bound also serves as a diagnostic tool, highlighting biases in MARL training algorithms that cause sub‑optimal convergence to specialist strategies with larger state spaces.

PaperID: 11165, https://arxiv.org/pdf/2503.15615.pdf

Authors: Joshua McClellan, Greyson Brothers, Furong Huang, Pratap Tokekar

Title: PEnGUiN: Partially Equivariant Graph NeUral Networks for Sample Efficient MARL

Abstract:
Equivariant Graph Neural Networks (EGNNs) have emerged as a promising approach in Multi‑Agent Reinforcement Learning (MARL), leveraging symmetry guarantees to greatly improve sample efficiency and generalization. However, real‑world environments often exhibit inherent asymmetries arising from factors such as external forces, measurement inaccuracies, or intrinsic system biases. This paper introduces Partially Equivariant Graph NeUral Networks (PEnGUiN), a novel architecture specifically designed to address these challenges. We formally identify and categorize various types of partial equivariance relevant to MARL, including subgroup equivariance, feature‑wise equivariance, regional equivariance, and approximate equivariance. We theoretically demonstrate that PEnGUiN is capable of learning both fully equivariant (EGNN) and non‑equivariant (GNN) representations within a unified framework. Through extensive experiments on a range of MARL problems incorporating various asymmetries, we empirically validate the efficacy of PEnGUiN. Our results consistently demonstrate that PEnGUiN outperforms both EGNNs and standard GNNs in asymmetric environments, highlighting their potential to improve the robustness and applicability of graph‑based MARL algorithms in real‑world scenarios.

PaperID: 11166, https://arxiv.org/pdf/2503.15552.pdf

Authors: Tharindu Kumarage, Cameron Johnson, Jadie Adams, Lin Ai, Matthias Kirchner, Anthony Hoogs, Joshua Garland, Julia Hirschberg, Arslan Basharat, Huan Liu

Title: Personalized Attacks of Social Engineering in Multi-turn Conversations: LLM Agents for Simulation and Detection

Abstract:
The rapid advancement of conversational agents, particularly chatbots powered by Large Language Models (LLMs), poses a significant risk of social engineering (SE) attacks on social media platforms. SE detection in multi‑turn, chat‑based interactions is considerably more complex than single‑instance detection due to the dynamic nature of these conversations. A critical factor in mitigating this threat is understanding the SE attack mechanisms through which SE attacks operate, specifically how attackers exploit vulnerabilities and how victims' personality traits contribute to their susceptibility. In this work, we propose an LLM‑agentic framework, SE‑VSim, to simulate SE attack mechanisms by generating multi‑turn conversations. We model victim agents with varying personality traits to assess how psychological profiles influence susceptibility to manipulation. Using a dataset of over 1000 simulated conversations, we examine attack scenarios in which adversaries, posing as recruiters, funding agencies, and journalists, attempt to extract sensitive information. Based on this analysis, we present a proof of concept, SE‑OmniGuard, to offer personalized protection to users by leveraging prior knowledge of the victims personality, evaluating attack strategies, and monitoring information exchanges in conversations to identify potential SE attempts.

PaperID: 11167, https://arxiv.org/pdf/2503.15547.pdf

Authors: Juhee Kim, Woohyuk Choi, Byoungyoung Lee

Title: Prompt Flow Integrity to Prevent Privilege Escalation in LLM Agents

Abstract:
Large Language Models (LLMs) are combined with tools to create powerful LLM agents that provide a wide range of services. Unlike traditional software, LLM agent's behavior is determined at runtime by natural language prompts from either user or tool's data. This flexibility enables a new computing paradigm with unlimited capabilities and programmability, but also introduces new security risks, vulnerable to privilege escalation attacks. Moreover, user prompts are prone to be interpreted in an insecure way by LLM agents, creating non‑deterministic behaviors that can be exploited by attackers. To address these security risks, we propose Prompt Flow Integrity (PFI), a system security‑oriented solution to prevent privilege escalation in LLM agents. Analyzing the architectural characteristics of LLM agents, PFI features three mitigation techniques ‑‑ i.e., agent isolation, secure untrusted data processing, and privilege escalation guardrails. Our evaluation result shows that PFI effectively mitigates privilege escalation attacks while successfully preserving the utility of LLM agents.

PaperID: 11168, https://arxiv.org/pdf/2503.15522.pdf

Authors: Ana Tanevska, Katie Winkle, Ginevra Castellano

Title: "I don't like things where I do not have control": Participants' Experience of Trustworthy Interaction with Autonomous Vehicles

Abstract:
With the rapid advancement of autonomous vehicle (AV) technology, AVs are progressively seen as interactive agents with some level of autonomy, as well as some context‑dependent social features. This introduces new challenges and questions, already relevant in other areas of human‑robot interaction (HRI) ‑ namely, if an AV is perceived as a social agent by the human with whom it is interacting, how are the various facets of its design and behaviour impacting its human partner? And how can we foster a successful human‑agent interaction (HAI) between the AV and the human, maximizing the human's comfort, acceptance, and trust in the AV? In this work, we attempt to understand the various factors that could influence naïve participants' acceptance and trust when interacting with an AV in the role of a driver. Through a large‑scale online study, we investigate the effect of the AV's autonomy on the human driver, as well as explore which parameters of the interaction have the highest impact on the user's sense of trust in the AV. Finally, we analyze our preliminary findings from the user study within existing guidelines on Trustworthy HAI/HRI.

PaperID: 11169, https://arxiv.org/pdf/2503.15519.pdf

Authors: Samuel Ren

Title: Can AI Assist in Olympiad Coding

Abstract:
As artificial intelligence programs have become more powerful, their capacity for problem‑solving continues to increase, approaching top‑level competitors in many olympiads. Continued development of models and benchmarks is important but not the focus of this paper. While further development of these models and benchmarks remains critical, the focus of this paper is different: we investigate how AI can assist human competitors in high‑level coding contests. In our proposed workflow, a human expert outlines an algorithm and subsequently relies on an AI agent for the implementation details. We examine whether such human‑AI collaboration can streamline the problem‑solving process and improve efficiency, highlighting the unique challenges and opportunities of integrating AI into competitive programming contexts.

PaperID: 11170, https://arxiv.org/pdf/2503.15506.pdf

Authors: Mina Ghobrial, Philippe Seitier, Pierre Lagarrigue, Michel Galaup, Patrick Gilles

Title: Effectiveness of machining equipment user guides: A comparative study of augmented reality and traditional media

Abstract:
In the rapidly evolving landscape of manufacturing and material forming, innovative strategies are imperative for maintaining a competitive edge. Augmented Reality (AR) has emerged as a groundbreaking technology, offering new dimensions in how information is displayed and interacted with. It holds particular promise in the panel of instructional guides for complex machinery, potentially enhance traditional methods of knowledge transfer and operator training. Material forming, a key discipline within mechanical engineering, requires high‑precision and skill, making it an ideal candidate for the integration of advanced instructional technologies like AR. This study aims to explore the efficiency of three distinct types of user manuals‑video, paper, and augmented reality (AR)‑on performance and acceptability in a material forming workshop environment. The focus will be on how AR can be specifically applied to improve task execution and understanding in material forming operations. Participants are mechanical engineering students specializing in material forming. They will engage in a series of standardized tasks related to machining processes. Performance will be gauged by metrics like task completion time and error rates, while task load will be assessed via the NASA Task Load Index (NASA‑TLX) [1]. Acceptability of each manual type will be evaluated using the System Usability Scale (SUS) [2]. By comparing these various instructional formats, this research seeks to shed light on the most effective mediums for enhancing both operator performance and experience.

PaperID: 11171, https://arxiv.org/pdf/2503.15504.pdf

Authors: Michele Grimaldi, Jieyeon Woo, Fabien Boucaud, Lucie Galland, Nezih Younsi, Liu Yang, Mireille Fares, Sean Graux, Philippe Gauthier, Catherine Pelachaud

Title: GRETA: Modular Platform to Create Adaptive Socially Interactive Agents

Abstract:
The interaction between humans is very complex to describe since it is composed of different elements from different modalities such as speech, gaze, and gestures influenced by social attitudes and emotions. Furthermore, the interaction can be affected by some features which refer to the interlocutor's state. Actual Socially Interactive Agents SIAs aim to adapt themselves to the state of the interaction partner. In this paper, we discuss this adaptation by describing the architecture of the GRETA platform which considers external features while interacting with humans and/or another ECA and process the dialogue incrementally. We illustrate the new architecture of GRETA which deals with the external features, the adaptation, and the incremental approach for the dialogue processing.

PaperID: 11172, https://arxiv.org/pdf/2503.15497.pdf

Authors: Mingjun Ren, Wentao Xu

Title: The Impact of Big Five Personality Traits on AI Agent Decision-Making in Public Spaces: A Social Simulation Study

Abstract:
This study investigates how the Big Five personality traits influence decision‑making processes in AI agents within public spaces. Using AgentVerse framework and GPT‑3.5‑turbo, we simulated interactions among 10 AI agents, each embodying different dimensions of the Big Five personality traits, in a classroom environment responding to misinformation. The experiment assessed both public expressions ([Speak]) and private thoughts ([Think]) of agents, revealing significant correlations between personality traits and decision‑making patterns. Results demonstrate that Openness to Experience had the strongest impact on information acceptance, with curious agents showing high acceptance rates and cautious agents displaying strong skepticism. Extraversion and Conscientiousness also showed notable influence on decision‑making, while Neuroticism and Agreeableness exhibited more balanced responses. Additionally, we observed significant discrepancies between public expressions and private thoughts, particularly in agents with friendly and extroverted personalities, suggesting that social context influences decision‑making behavior. Our findings contribute to understanding how personality traits shape AI agent behavior in social settings and have implications for developing more nuanced and context‑aware AI systems.

PaperID: 11173, https://arxiv.org/pdf/2503.15478.pdf

Authors: Yifei Zhou, Song Jiang, Yuandong Tian, Jason Weston, Sergey Levine, Sainbayar Sukhbaatar, Xian Li

Title: SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

Abstract:
Large language model (LLM) agents need to perform multi‑turn interactions in real‑world tasks. However, existing multi‑turn RL algorithms for optimizing LLM agents fail to perform effective credit assignment over multiple turns while leveraging the generalization capabilities of LLMs and it remains unclear how to develop such algorithms. To study this, we first introduce a new benchmark, ColBench, where an LLM agent interacts with a human collaborator over multiple turns to solve realistic tasks in backend programming and frontend design. Building on this benchmark, we propose a novel RL algorithm, SWEET‑RL (RL with Step‑WisE Evaluation from Training‑time information), that uses a carefully designed optimization objective to train a critic model with access to additional training‑time information. The critic provides step‑level rewards for improving the policy model. Our experiments demonstrate that SWEET‑RL achieves a 6% absolute improvement in success and win rates on ColBench compared to other state‑of‑the‑art multi‑turn RL algorithms, enabling Llama‑3.1‑8B to match or exceed the performance of GPT4‑o in realistic collaborative content creation.

PaperID: 11174, https://arxiv.org/pdf/2503.15371.pdf

Authors: Cristiana de Farias, Luis Figueredo, Riddhiman Laha, Maxime Adjigble, Brahim Tamadazte, Rustam Stolkin, Sami Haddadin, Naresh Marturi

Title: Geometrically-Aware One-Shot Skill Transfer of Category-Level Objects

Abstract:
Robotic manipulation of unfamiliar objects in new environments is challenging and requires extensive training or laborious pre‑programming. We propose a new skill transfer framework, which enables a robot to transfer complex object manipulation skills and constraints from a single human demonstration. Our approach addresses the challenge of skill acquisition and task execution by deriving geometric representations from demonstrations focusing on object‑centric interactions. By leveraging the Functional Maps (FM) framework, we efficiently map interaction functions between objects and their environments, allowing the robot to replicate task operations across objects of similar topologies or categories, even when they have significantly different shapes. Additionally, our method incorporates a Task‑Space Imitation Algorithm (TSIA) which generates smooth, geometrically‑aware robot paths to ensure the transferred skills adhere to the demonstrated task constraints. We validate the effectiveness and adaptability of our approach through extensive experiments, demonstrating successful skill transfer and task execution in diverse real‑world environments without requiring additional training.

PaperID: 11175, https://arxiv.org/pdf/2503.15360.pdf

Authors: Brandon C. Fallin, Cristian F. Nino, Omkar Sudhir Patil, Zachary I. Bell, Warren E. Dixon

Title: Lyapunov-Based Graph Neural Networks for Adaptive Control of Multi-Agent Systems

Abstract:
Graph neural networks (GNNs) have a message‑passing framework in which vector messages are exchanged between graph nodes and updated using feedforward layers. The inclusion of distributed message‑passing in the GNN architecture makes them ideally suited for distributed control and coordination tasks. Existing results develop GNN‑based controllers to address a variety of multi‑agent control problems while compensating for modeling uncertainties in the systems. However, these results use GNNs that are pre‑trained offline. This paper provides the first result on GNNs with stability‑driven online weight updates to address the multi‑agent target tracking problem. Specifically, new Lyapunov‑based distributed GNN and graph attention network (GAT)‑based controllers are developed to adaptively estimate unknown target dynamics and address the second‑order target tracking problem. A Lyapunov‑based stability analysis is provided to guarantee exponential convergence of the target state estimates and agent states to a neighborhood of the target state. Numerical simulations show a 20.8% and 48.1% position tracking error performance improvement by the GNN and GAT architectures over a baseline DNN architecture, respectively.

PaperID: 11176, https://arxiv.org/pdf/2503.15204.pdf

Authors: Tittaya Mairittha, Tanakon Sawanglok, Panuwit Raden, Sorrawit Treesuk

Title: When Pigs Get Sick: Multi-Agent AI for Swine Disease Detection

Abstract:
Swine disease surveillance is critical to the sustainability of global agriculture, yet its effectiveness is frequently undermined by limited veterinary resources, delayed identification of cases, and variability in diagnostic accuracy. To overcome these barriers, we introduce a novel AI‑powered, multi‑agent diagnostic system that leverages Retrieval‑Augmented Generation (RAG) to deliver timely, evidence‑based disease detection and clinical guidance. By automatically classifying user inputs into either Knowledge Retrieval Queries or Symptom‑Based Diagnostic Queries, the system ensures targeted information retrieval and facilitates precise diagnostic reasoning. An adaptive questioning protocol systematically collects relevant clinical signs, while a confidence‑weighted decision fusion mechanism integrates multiple diagnostic hypotheses to generate robust disease predictions and treatment recommendations. Comprehensive evaluations encompassing query classification, disease diagnosis, and knowledge retrieval demonstrate that the system achieves high accuracy, rapid response times, and consistent reliability. By providing a scalable, AI‑driven diagnostic framework, this approach enhances veterinary decision‑making, advances sustainable livestock management practices, and contributes substantively to the realization of global food security.

PaperID: 11177, https://arxiv.org/pdf/2503.15172.pdf

Authors: George Stamatelis, Angelos-Nikolaos Kanatas, George C. Alexandropoulos

Title: Multi-Agent Actor-Critic with Harmonic Annealing Pruning for Dynamic Spectrum Access Systems

Abstract:
Multi‑Agent Deep Reinforcement Learning (MADRL) has emerged as a powerful tool for optimizing decentralized decision‑making systems in complex settings, such as Dynamic Spectrum Access (DSA). However, deploying deep learning models on resource‑constrained edge devices remains challenging due to their high computational cost. To address this challenge, in this paper, we present a novel sparse recurrent MARL framework integrating gradual neural network pruning into the independent actor global critic paradigm. Additionally, we introduce a harmonic annealing sparsity scheduler, which achieves comparable, and in certain cases superior, performance to standard linear and polynomial pruning schedulers at large sparsities. Our experimental investigation demonstrates that the proposed DSA framework can discover superior policies, under diverse training conditions, outperforming conventional DSA, MADRL baselines, and state‑of‑the‑art pruning techniques.

PaperID: 11178, https://arxiv.org/pdf/2503.15129.pdf

Authors: Man Fai Wong, Chee Wei Tan

Title: Aligning Crowd-sourced Human Feedback for Reinforcement Learning on Code Generation by Large Language Models

Abstract:
This paper studies how AI‑assisted programming and large language models (LLM) improve software developers' ability via AI tools (LLM agents) like Github Copilot and Amazon CodeWhisperer, while integrating human feedback to enhance reinforcement learning (RLHF) with crowd‑sourced computation to enhance text‑to‑code generation. Additionally, we demonstrate that our Bayesian optimization framework supports AI alignment in code generation by distributing the feedback collection burden, highlighting the value of collecting human feedback of good quality. Our empirical evaluations demonstrate the efficacy of this approach, showcasing how LLM agents can be effectively trained for improved text‑to‑code generation. Our Bayesian optimization framework can be designed for general domain‑specific languages, promoting the alignment of large language model capabilities with human feedback in AI‑assisted programming for code generation.

PaperID: 11179, https://arxiv.org/pdf/2503.15079.pdf

Authors: Ke Zhang, Chenxi Zhang, Chong Wang, Chi Zhang, YaChen Wu, Zhenchang Xing, Yang Liu, Qingshan Li, Xin Peng

Title: LogiAgent: Automated Logical Testing for REST Systems with LLM-Based Multi-Agents

Abstract:
Automated testing for REST APIs has become essential for ensuring the correctness and reliability of modern web services. While existing approaches primarily focus on detecting server crashes and error codes, they often overlook logical issues that arise due to evolving business logic and domain‑specific requirements. To address this limitation, we propose LogiAgent, a novel approach for logical testing of REST systems. Built upon a large language model (LLM)‑driven multi‑agent framework, LogiAgent integrates a Test Scenario Generator, API Request Executor, and API Response Validator to collaboratively generate, execute, and validate API test scenarios. Unlike traditional testing methods that focus on status codes like 5xx, LogiAgent incorporates logical oracles that assess responses based on business logic, ensuring more comprehensive testing. The system is further enhanced by an Execution Memory component that stores historical API execution data for contextual consistency. We conduct extensive experiments across 12 real‑world REST systems, demonstrating that LogiAgent effectively identifies 234 logical issues with an accuracy of 66.19%. Additionally, it basically excels in detecting server crashes and achieves superior test coverage compared to four state‑of‑the‑art REST API testing tools. An ablation study confirms the significant contribution of LogiAgent's memory components to improving test coverage.

PaperID: 11180, https://arxiv.org/pdf/2503.14948.pdf

Authors: Hao Liang, Zhipeng Dong, Kaixin Chen, Jiyuan Guo, Yufeng Yue, Yi Yang, Mengyin Fu

Title: ChatStitch: Visualizing Through Structures via Surround-View Unsupervised Deep Image Stitching with Collaborative LLM-Agents

Abstract:
Surround‑view perception has garnered significant attention for its ability to enhance the perception capabilities of autonomous driving vehicles through the exchange of information with surrounding cameras. However, existing surround‑view perception systems are limited by inefficiencies in unidirectional interaction pattern with human and distortions in overlapping regions exponentially propagating into non‑overlapping areas. To address these challenges, this paper introduces ChatStitch, a surround‑view human‑machine co‑perception system capable of unveiling obscured blind spot information through natural language commands integrated with external digital assets. To dismantle the unidirectional interaction bottleneck, ChatStitch implements a cognitively grounded closed‑loop interaction multi‑agent framework based on Large Language Models. To suppress distortion propagation across overlapping boundaries, ChatStitch proposes SV‑UDIS, a surround‑view unsupervised deep image stitching method under the non‑global‑overlapping condition. We conducted extensive experiments on the UDIS‑D, MCOV‑SLAM open datasets, and our real‑world dataset. Specifically, our SV‑UDIS method achieves state‑of‑the‑art performance on the UDIS‑D dataset for 3, 4, and 5 image stitching tasks, with PSNR improvements of 9%, 17%, and 21%, and SSIM improvements of 8%, 18%, and 26%, respectively.

PaperID: 11181, https://arxiv.org/pdf/2503.14656.pdf

Authors: Basit Muhammad Imran, Jeeseop Kim, Taizoon Chunawala, Alexander Leonessa, Kaveh Akbari Hamed

Title: Safety-Critical and Distributed Nonlinear Predictive Controllers for Teams of Quadrupedal Robots

Abstract:
This paper presents a novel hierarchical, safety‑critical control framework that integrates distributed nonlinear model predictive controllers (DNMPCs) with control barrier functions (CBFs) to enable cooperative locomotion of multi‑agent quadrupedal robots in complex environments. While NMPC‑based methods are widely adopted for enforcing safety constraints and navigating multi‑robot systems (MRSs) through intricate environments, ensuring the safety of MRSs requires a formal definition grounded in the concept of invariant sets. CBFs, typically implemented via quadratic programs (QPs) at the planning layer, provide formal safety guarantees. However, their zero‑control horizon limits their effectiveness for extended trajectory planning in inherently unstable, underactuated, and nonlinear legged robot models. Furthermore, the integration of CBFs into real‑time NMPC for sophisticated MRSs, such as quadrupedal robot teams, remains underexplored. This paper develops computationally efficient, distributed NMPC algorithms that incorporate CBF‑based collision safety guarantees within a consensus protocol, enabling longer planning horizons for safe cooperative locomotion under disturbances and rough terrain conditions. The optimal trajectories generated by the DNMPCs are tracked using full‑order, nonlinear whole‑body controllers at the low level. The proposed approach is validated through extensive numerical simulations with up to four Unitree A1 robots and hardware experiments involving two A1 robots subjected to external pushes, rough terrain, and uncertain obstacle information. Comparative analysis demonstrates that the proposed CBF‑based DNMPCs achieve a 27.89% higher success rate than conventional NMPCs without CBF constraints.

PaperID: 11182, https://arxiv.org/pdf/2503.14576.pdf

Authors: Zihao Guo, Shuqing Shi, Richard Willis, Tristan Tomilin, Joel Z. Leibo, Yali Du

Title: SocialJax: An Evaluation Suite for Multi-agent Reinforcement Learning in Sequential Social Dilemmas

Abstract:
Sequential social dilemmas pose a significant challenge in the field of multi‑agent reinforcement learning (MARL), requiring environments that accurately reflect the tension between individual and collective interests. Previous benchmarks and environments, such as Melting Pot, provide an evaluation protocol that measures generalization to new social partners in various test scenarios. However, running reinforcement learning algorithms in traditional environments requires substantial computational resources. In this paper, we introduce SocialJax, a suite of sequential social dilemma environments and algorithms implemented in JAX. JAX is a high‑performance numerical computing library for Python that enables significant improvements in operational efficiency. Our experiments demonstrate that the SocialJax training pipeline achieves at least 50\texttimes speed‑up in real‑time performance compared to Melting Pot RLlib baselines. Additionally, we validate the effectiveness of baseline algorithms within SocialJax environments. Finally, we use Schelling diagrams to verify the social dilemma properties of these environments, ensuring that they accurately capture the dynamics of social dilemmas.

PaperID: 11183, https://arxiv.org/pdf/2503.14481.pdf

Authors: Jacob Eisenstein, Reza Aghajani, Adam Fisch, Dheeru Dua, Fantine Huot, Mirella Lapata, Vicky Zayats, Jonathan Berant

Title: Don't lie to your friends: Learning what you know from collaborative self-play

Abstract:
To be helpful assistants, AI agents must be aware of their own capabilities and limitations. This includes knowing when to answer from parametric knowledge versus using tools, when to trust tool outputs, and when to abstain or hedge. Such capabilities are hard to teach through supervised fine‑tuning because they require constructing examples that reflect the agent's specific capabilities. We therefore propose a radically new approach to teaching agents what they know: \emphcollaborative self‑play. We construct multi‑agent collaborations in which the group is rewarded for collectively arriving at correct answers. The desired meta‑knowledge emerges from the incentives built into the structure of the interaction. We focus on small societies of agents that have access to heterogeneous tools (corpus‑specific retrieval), and therefore must collaborate to maximize their success while minimizing their effort. Experiments show that group‑level rewards for multi‑agent communities can induce policies that \emphtransfer to improve tool use and selective prediction in settings where individual agents are deployed in isolation.

PaperID: 11184, https://arxiv.org/pdf/2503.14432.pdf

Authors: Wei Fang, Yang Zhang, Kaizhi Qian, James Glass, Yada Zhu

Title: PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play

Abstract:
Large language models (LLMs) are increasingly integrated with specialized external tools, yet many tasks demand zero‑shot tool usage with minimal or noisy documentation. Existing solutions rely on manual rewriting or labeled data for validation, making them inapplicable in true zero‑shot settings. To address these challenges, we propose PLAY2PROMPT, an automated framework that systematically "plays" with each tool to explore its input‑output behaviors. Through this iterative trial‑and‑error process, PLAY2PROMPT refines tool documentation and generates usage examples without any labeled data. These examples not only guide LLM inference but also serve as validation to further enhance tool utilization. Extensive experiments on real‑world tasks demonstrate that PLAY2PROMPT significantly improves zero‑shot tool performance across both open and closed models, offering a scalable and effective solution for domain‑specific tool integration.

PaperID: 11185, https://arxiv.org/pdf/2503.14418.pdf

Authors: Cristian F. Nino, Omkar Sudhir Patil, Sage C. Edwards, Warren E. Dixon

Title: Distributed RISE-based Control for Exponential Heterogeneous Multi-Agent Target Tracking of Second-Order Nonlinear Systems

Abstract:
A distributed implementation of a Robust Integral of the Sign of the Error (RISE) controller is developed for multi‑agent target tracking problems with exponential convergence guarantees. Previous RISE‑based approaches for multi‑agent systems required 2‑hop communication, limiting practical applicability. New insights from a Lyapunov‑based design‑analysis approach are used to eliminate the need for multi‑hop communication required in previous literature, while yielding exponential target tracking. The new insights include the development of a new P‑function that works in tandem with the graph interaction matrix in the Lyapunov function. Nonsmooth Lyapunov‑based stability analysis methods are used to yield semi‑global exponential convergence to the target agent state despite the presence of bounded disturbances with bounded derivatives. The resulting outcome is a controller that achieves exponential target tracking with only local information exchange between neighboring agents.

PaperID: 11186, https://arxiv.org/pdf/2503.14353.pdf

Authors: Erik G. Larsson, Nicolo Michelusi

Title: Unified Analysis of Decentralized Gradient Descent: a Contraction Mapping Framework

Abstract:
The decentralized gradient descent (DGD) algorithm, and its sibling, diffusion, are workhorses in decentralized machine learning, distributed inference and estimation, and multi‑agent coordination. We propose a novel, principled framework for the analysis of DGD and diffusion for strongly convex, smooth objectives, and arbitrary undirected topologies, using contraction mappings coupled with a result called the mean Hessian theorem (MHT). The use of these tools yields tight convergence bounds, both in the noise‑free and noisy regimes. While these bounds are qualitatively similar to results found in the literature, our approach using contractions together with the MHT decouples the algorithm dynamics (how quickly the algorithm converges to its fixed point) from its asymptotic convergence properties (how far the fixed point is from the global optimum). This yields a simple, intuitive analysis that is accessible to a broader audience. Extensions are provided to multiple local gradient updates, time‑varying step sizes, noisy gradients (stochastic DGD and diffusion), communication noise, and random topologies.

PaperID: 11187, https://arxiv.org/pdf/2503.14340.pdf

Authors: Yisen Xu, Feng Lin, Jinqiu Yang, Tse-Hsun, Chen, Nikolaos Tsantalis

Title: MANTRA: Enhancing Automated Method-Level Refactoring with Contextual RAG and Multi-Agent LLM Collaboration

Abstract:
Maintaining and scaling software systems relies heavily on effective code refactoring, yet this process remains labor‑intensive, requiring developers to carefully analyze existing codebases and prevent the introduction of new defects. Although recent advancements have leveraged Large Language Models (LLMs) to automate refactoring tasks, current solutions are constrained in scope and lack mechanisms to guarantee code compilability and successful test execution. In this work, we introduce MANTRA, a comprehensive LLM agent‑based framework that automates method‑level refactoring. MANTRA integrates Context‑Aware Retrieval‑Augmented Generation, coordinated Multi‑Agent Collaboration, and Verbal Reinforcement Learning to emulate human decision‑making during refactoring while preserving code correctness and readability. Our empirical study, conducted on 703 instances of "pure refactorings" (i.e., code changes exclusively involving structural improvements), drawn from 10 representative Java projects, covers the six most prevalent refactoring operations. Experimental results demonstrate that MANTRA substantially surpasses a baseline LLM model (RawGPT ), achieving an 82.8% success rate (582/703) in producing code that compiles and passes all tests, compared to just 8.7% (61/703) with RawGPT. Moreover, in comparison to IntelliJ's LLM‑powered refactoring tool (EM‑Assist), MANTRA exhibits a 50% improvement in generating Extract Method transformations. A usability study involving 37 professional developers further shows that refactorings performed by MANTRA are perceived to be as readable and reusable as human‑written code, and in certain cases, even more favorable. These results highlight the practical advantages of MANTRA and emphasize the growing potential of LLM‑based systems in advancing the automation of software refactoring tasks.

PaperID: 11188, https://arxiv.org/pdf/2503.14251.pdf

Authors: Yu Feng, Puzhen Zhang, Guohui Xiao, Linfang Ding, Liqiu Meng

Title: Towards a Barrier-free GeoQA Portal: Natural Language Interaction with Geospatial Data Using Multi-Agent LLMs and Semantic Search

Abstract:
A Barrier‑Free GeoQA Portal: Enhancing Geospatial Data Accessibility with a Multi‑Agent LLM Framework Geoportals are vital for accessing and analyzing geospatial data, promoting open spatial data sharing and online geo‑information management. Designed with GIS‑like interaction and layered visualization, they often challenge non‑expert users with complex functionalities and overlapping layers that obscure spatial relationships. We propose a GeoQA Portal using a multi‑agent Large Language Model framework for seamless natural language interaction with geospatial data. Complex queries are broken into subtasks handled by specialized agents, retrieving relevant geographic data efficiently. Task plans are shown to users, boosting transparency. The portal supports default and custom data inputs for flexibility. Semantic search via word vector similarity aids data retrieval despite imperfect terms. Case studies, evaluations, and user tests confirm its effectiveness for non‑experts, bridging GIS complexity and public access, and offering an intuitive solution for future geoportals.

PaperID: 11189, https://arxiv.org/pdf/2503.14234.pdf

Authors: Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim

Title: Beyond Single Pass, Looping Through Time: KG-IRAG with Iterative Knowledge Retrieval

Abstract:
Graph Retrieval‑Augmented Generation (GraphRAG) has proven highly effective in enhancing the performance of Large Language Models (LLMs) on tasks that require external knowledge. By leveraging Knowledge Graphs (KGs), GraphRAG improves information retrieval for complex reasoning tasks, providing more precise and comprehensive retrieval and generating more accurate responses to QAs. However, most RAG methods fall short in addressing multi‑step reasoning, particularly when both information extraction and inference are necessary. To address this limitation, this paper presents Knowledge Graph‑Based Iterative Retrieval‑Augmented Generation (KG‑IRAG), a novel framework that integrates KGs with iterative reasoning to improve LLMs' ability to handle queries involving temporal and logical dependencies. Through iterative retrieval steps, KG‑IRAG incrementally gathers relevant data from external KGs, enabling step‑by‑step reasoning. The proposed approach is particularly suited for scenarios where reasoning is required alongside dynamic temporal data extraction, such as determining optimal travel times based on weather conditions or traffic patterns. Experimental results show that KG‑IRAG improves accuracy in complex reasoning tasks by effectively integrating external knowledge with iterative, logic‑based retrieval. Additionally, three new datasets: weatherQA‑Irish, weatherQA‑Sydney, and trafficQA‑TFNSW, are formed to evaluate KG‑IRAG's performance, demonstrating its potential beyond traditional RAG applications.

PaperID: 11190, https://arxiv.org/pdf/2503.14222.pdf

Authors: Katayoun Eshkofti, Matthieu Barreau

Title: Vanishing Stacked-Residual PINN for State Reconstruction of Hyperbolic Systems

Abstract:
In a more connected world, modeling multi‑agent systems with hyperbolic partial differential equations (PDEs) offers a compact, physics‑consistent description of collective dynamics. However, classical control tools need adaptation for these complex systems. Physics‑informed neural networks (PINNs) provide a powerful framework to fix this issue by inferring solutions to PDEs by embedding governing equations into the neural network. A major limitation of original PINNs is their inability to capture steep gradients and discontinuities in hyperbolic PDEs. To tackle this problem, we propose a stacked residual PINN method enhanced with a vanishing viscosity mechanism. Initially, a basic PINN with a small viscosity coefficient provides a stable, low‑fidelity solution. Residual correction blocks with learnable scaling parameters then iteratively refine this solution, progressively decreasing the viscosity coefficient to transition from parabolic to hyperbolic PDEs. Applying this method to traffic state reconstruction improved results by an order of magnitude in relative \mathcalL^2 error, demonstrating its potential to accurately estimate solutions where original PINNs struggle with instability and low fidelity.

PaperID: 11191, https://arxiv.org/pdf/2503.14119.pdf

Authors: Beniamino Di Lorenzo, Gian Carlo Maffettone, Mario di Bernardo

Title: Decentralized Continuification Control of Multi-Agent Systems via Distributed Density Estimation

Abstract:
This paper introduces a novel decentralized implementation of a continuification‑based strategy to control the density of large‑scale multi‑agent systems on the unit circle. While continuification methods effectively address micro‑to‑macro control problems by reformulating ordinary/stochastic differential equations (ODEs/SDEs) agent‑based models into more tractable partial differential equations (PDEs), they traditionally require centralized knowledge of macroscopic state observables. We overcome this limitation by developing a distributed density estimation framework that combines kernel density estimation with PI consensus dynamics. Our approach enables agents to compute local density estimates and derive local control actions using only information from neighboring agents in a communication network. Numerical validations across multiple scenarios ‑ including regulation, tracking, and time‑varying communication topologies ‑ confirm the effectiveness of the proposed approach. They also convincingly demonstrate that our decentralized implementation achieves performance comparable to centralized approaches while enhancing reliability and practical applicability.

PaperID: 11192, https://arxiv.org/pdf/2503.13966.pdf

Authors: Siqi Zhang, Yanyuan Qiao, Qunbo Wang, Longteng Guo, Zhihua Wei, Jing Liu

Title: FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks

Abstract:
The aspiration of the Vision‑and‑Language Navigation (VLN) task has long been to develop an embodied agent with robust adaptability, capable of seamlessly transferring its navigation capabilities across various tasks. Despite remarkable advancements in recent years, most methods necessitate dataset‑specific training, thereby lacking the capability to generalize across diverse datasets encompassing distinct types of instructions. Large language models (LLMs) have demonstrated exceptional reasoning and generalization abilities, exhibiting immense potential in robot action planning. In this paper, we propose FlexVLN, an innovative hierarchical approach to VLN that integrates the fundamental navigation ability of a supervised‑learning‑based Instruction Follower with the robust generalization ability of the LLM Planner, enabling effective generalization across diverse VLN datasets. Moreover, a verification mechanism and a multi‑model integration mechanism are proposed to mitigate potential hallucinations by the LLM Planner and enhance execution accuracy of the Instruction Follower. We take REVERIE, SOON, and CVDN‑target as out‑of‑domain datasets for assessing generalization ability. The generalization performance of FlexVLN surpasses that of all the previous methods to a large extent.

PaperID: 11193, https://arxiv.org/pdf/2503.13856.pdf

Authors: Kai Chen, Xinfeng Li, Tianpei Yang, Hewei Wang, Wei Dong, Yang Gao

Title: MDTeamGPT: A Self-Evolving LLM-based Multi-Agent Framework for Multi-Disciplinary Team Medical Consultation

Abstract:
Large Language Models (LLMs) have made significant progress in various fields. However, challenges remain in Multi‑Disciplinary Team (MDT) medical consultations. Current research enhances reasoning through role assignment, task decomposition, and accumulation of medical experience. Multi‑role collaboration in MDT consultations often results in excessively long dialogue histories. This increases the model's cognitive burden and degrades both efficiency and accuracy. Some methods only store treatment histories. They do not extract effective experience or reflect on errors. This limits knowledge generalization and system evolution. We propose a multi‑agent MDT medical consultation framework based on LLMs to address these issues. Our framework uses consensus aggregation and a residual discussion structure for multi‑round consultations. It also employs a Correct Answer Knowledge Base (CorrectKB) and a Chain‑of‑Thought Knowledge Base (ChainKB) to accumulate consultation experience. These mechanisms enable the framework to evolve and continually improve diagnosis rationality and accuracy. Experimental results on the MedQA and PubMedQA datasets demonstrate that our framework achieves accuracies of 90.1% and 83.9%, respectively, and that the constructed knowledge bases generalize effectively across test sets from both datasets.

PaperID: 11194, https://arxiv.org/pdf/2503.13843.pdf

Authors: Trisanth Srinivasan, Santosh Patapati

Title: WebNav: An Intelligent Agent for Voice-Controlled Web Navigation

Abstract:
The current state of modern web interfaces, especially in regards to accessibility focused usage is extremely lacking. Traditional methods for web interaction, such as scripting languages and screen readers, often lack the flexibility to handle dynamic content or the intelligence to interpret high‑level user goals. To address these limitations, we introduce WebNav, a novel agent for multi‑modal web navigation. WebNav leverages a dual Large Language Model (LLM) architecture to translate natural language commands into precise, executable actions on a graphical user interface. The system combines vision‑based context from screenshots with a dynamic DOM‑labeling browser extension to robustly identify interactive elements. A high‑level 'Controller' LLM strategizes the next step toward a user's goal, while a second 'Assistant' LLM generates the exact parameters for execution. This separation of concerns allows for sophisticated task decomposition and action formulation. Our work presents the complete architecture and implementation of WebNav, demonstrating a promising approach to creating more intelligent web automation agents.

PaperID: 11195, https://arxiv.org/pdf/2503.13790.pdf

Authors: Dong Ho Lee, Kaitlyn Donnel, Max Z. Li, David Fridovich-Keil

Title: A Convex Formulation of Game-theoretic Hierarchical Routing

Abstract:
Hierarchical decision‑making is a natural paradigm for coordinating multi‑agent systems in complex environments such as air traffic management. In this paper, we present a bilevel framework for game‑theoretic hierarchical routing, where a high‑level router assigns discrete routes to multiple vehicles who seek to optimize potentially noncooperative objectives that depend upon the assigned routes. To address computational challenges, we propose a reformulation that preserves the convexity of each agent's feasible set. This convex reformulation enables a solution to be identified efficiently via a customized branch‑and‑bound algorithm. Our approach ensures global optimality while capturing strategic interactions between agents at the lower level. We demonstrate the solution concept of our framework in two‑vehicle and three‑vehicle routing scenarios.

PaperID: 11196, https://arxiv.org/pdf/2503.13754.pdf

Authors: Krti Tallam

Title: From Autonomous Agents to Integrated Systems, A New Paradigm: Orchestrated Distributed Intelligence

Abstract:
The rapid evolution of artificial intelligence (AI) has ushered in a new era of integrated systems that merge computational prowess with human decision‑making. In this paper, we introduce the concept of Orchestrated Distributed Intelligence (ODI), a novel paradigm that reconceptualizes AI not as isolated autonomous agents, but as cohesive, orchestrated networks that work in tandem with human expertise. ODI leverages advanced orchestration layers, multi‑loop feedback mechanisms, and a high cognitive density framework to transform static, record‑keeping systems into dynamic, action‑oriented environments. Through a comprehensive review of multi‑agent system literature, recent technological advances, and practical insights from industry forums, we argue that the future of AI lies in integrating distributed intelligence within human‑centric workflows. This approach not only enhances operational efficiency and strategic agility but also addresses challenges related to scalability, transparency, and ethical decision‑making. Our work outlines key theoretical implications and presents a practical roadmap for future research and enterprise innovation, aiming to pave the way for responsible and adaptive AI systems that drive sustainable innovation in human organizations.

PaperID: 11197, https://arxiv.org/pdf/2503.13657.pdf

Authors: Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica

Title: Why Do Multi-Agent LLM Systems Fail?

Abstract:
Despite enthusiasm for Multi‑Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail. Addressing this question requires systematic identification and analysis of failure patterns. We introduce MAST‑Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks. MAST‑Data is the first multi‑agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems. To enable systematic classification of failures for MAST‑Data, we build the first Multi‑Agent System Failure Taxonomy (MAST). We develop MAST through rigorous analysis of 150 traces, guided closely by expert human annotators and validated by high inter‑annotator agreement (kappa = 0.88). This process identifies 14 unique modes, clustered into 3 categories: (i) system design issues, (ii) inter‑agent misalignment, and (iii) task verification. To enable scalable annotation, we develop an LLM‑as‑a‑Judge pipeline with high agreement with human annotations. We leverage MAST and MAST‑Data to analyze failure patterns across models (GPT4, Claude 3, Qwen2.5, CodeLlama) and tasks (coding, math, general agent), demonstrating improvement headrooms from better MAS design. Our analysis provides insights revealing that identified failures require more sophisticated solutions, highlighting a clear roadmap for future research. We publicly release our comprehensive dataset (MAST‑Data), the MAST, and our LLM annotator to facilitate widespread research and development in MAS.

PaperID: 11198, https://arxiv.org/pdf/2503.13619.pdf

Authors: Spyros Reveliotis, Eva Robillard

Title: Stable Task Allocation in Multi-Agent Systems with Lexicographic Preferences

Abstract:
Motivated by the increasing interest in the explicit representation and handling of various "preference" structures arising in modern digital economy, this work introduces a new class of "one‑to‑many stable‑matching" problems where a set of atomic tasks must be stably allocated to a set of agents. An important characteristic of these stable‑matching problems is the very arbitrary specification of the task subsets constituting "feasible" allocations for each agent. It is shown that as long as the agents rank their feasible task allocations lexicographically with respect to their stated preferences for each atomic task, matching stability reduces to the absence of blocking agent‑task pairs. This result, together with a pertinent graphical representation of feasible allocations, enable (i) the representation of the space of stable matchings as a set of linear constraints with binary variables, and (ii) the specification and handling of certain notions of optimality within this space of stable matchings. The last part of the paper also addresses the notion of "substitutability" in the considered problem context.

PaperID: 11199, https://arxiv.org/pdf/2503.13553.pdf

Authors: Philipp D. Siedler, Ian Gemp

Title: LLM-Mediated Guidance of MARL Systems

Abstract:
In complex multi‑agent environments, achieving efficient learning and desirable behaviours is a significant challenge for Multi‑Agent Reinforcement Learning (MARL) systems. This work explores the potential of combining MARL with Large Language Model (LLM)‑mediated interventions to guide agents toward more desirable behaviours. Specifically, we investigate how LLMs can be used to interpret and facilitate interventions that shape the learning trajectories of multiple agents. We experimented with two types of interventions, referred to as controllers: a Natural Language (NL) Controller and a Rule‑Based (RB) Controller. The RB Controller showed a stronger impact than the NL Controller, which uses a small (7B/8B) LLM to simulate human‑like interventions. Our findings indicate that agents particularly benefit from early interventions, leading to more efficient training and higher performance. Both intervention types outperform the baseline without interventions, highlighting the potential of LLM‑mediated guidance to accelerate training and enhance MARL performance in challenging environments.

PaperID: 11200, https://arxiv.org/pdf/2503.13551.pdf

Authors: Teng Wang, Zhangyi Jiang, Zhenqi He, Shenyang Tong, Wenhan Yang, Yanan Zheng, Zeyu Li, Zifan He, Hailei Gong, Zewen Ye, Shengjie Ma, Jianping Zhang

Title: Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models

Abstract:
Recent studies show that Large Language Models (LLMs) achieve strong reasoning capabilities through supervised fine‑tuning or reinforcement learning. However, a key approach, the Process Reward Model (PRM), suffers from reward hacking, making it unreliable in identifying the best intermediate step. In addition, the cost of annotating reasoning processes for reward modeling is high, making large‑scale collection of high‑quality data challenging. To address this, we propose a novel reward model approach called the Hierarchical Reward Model (HRM), which evaluates both individual and consecutive reasoning steps at both fine‑grained and coarse‑grained levels. HRM excels at assessing multi‑step reasoning coherence, especially when flawed steps are later corrected through self‑reflection. To further reduce the cost of generating training data, we introduce a lightweight and effective data augmentation strategy called Hierarchical Node Compression (HNC), which merges two consecutive reasoning steps into one within the tree structure. By applying HNC to MCTS‑generated reasoning trajectories, we enhance the diversity and robustness of HRM training data while introducing controlled noise with minimal computational overhead. Empirical results on the PRM800K dataset show that HRM, together with HNC, provides more stable and reliable evaluations than PRM. Furthermore, cross‑domain evaluations on the MATH500 and GSM8K datasets demonstrate HRM's strong generalization and robustness across a variety of reasoning tasks.

PaperID: 11201, https://arxiv.org/pdf/2503.13547.pdf

Authors: Xu Guo, Xiangwang Hou, Minrui Xu, Jianrui Chen, Jingjing Wang, Jun Du, Yong Ren

Title: Adaptive AUV Hunting Policy with Covert Communication via Diffusion Model

Abstract:
Collaborative underwater target hunting, facilitated by multiple autonomous underwater vehicles (AUVs), plays a significant role in various domains, especially military missions. Existing research predominantly focuses on designing efficient and high‑success‑rate hunting policy, particularly addressing the target's evasion capabilities. However, in real‑world scenarios, the target can not only adjust its evasion policy based on its observations and predictions but also possess eavesdropping capabilities. If communication among hunter AUVs, such as hunting policy exchanges, is intercepted by the target, it can adapt its escape policy accordingly, significantly reducing the success rate of the hunting mission. To address this challenge, we propose a covert communication‑guaranteed collaborative target hunting framework, which ensures efficient hunting in complex underwater environments while defending against the target's eavesdropping. To the best of our knowledge, this is the first study to incorporate the confidentiality of inter‑agent communication into the design of target hunting policy. Furthermore, given the complexity of coordinating multiple AUVs in dynamic and unpredictable environments, we propose an adaptive multi‑agent diffusion policy (AMADP), which incorporates the strong generative ability of diffusion models into the multi‑agent reinforcement learning (MARL) algorithm. Experimental results demonstrate that AMADP achieves faster convergence and higher hunting success rates while maintaining covertness constraints.

PaperID: 11202, https://arxiv.org/pdf/2503.13524.pdf

Authors: Joseph R. Loffredo, Suyeol Yun

Title: Agent-Enhanced Large Language Models for Researching Political Institutions

Abstract:
The applications of Large Language Models (LLMs) in political science are rapidly expanding. This paper demonstrates how LLMs, when augmented with predefined functions and specialized tools, can serve as dynamic agents capable of streamlining tasks such as data collection, preprocessing, and analysis. Central to this approach is agentic retrieval‑augmented generation (Agentic RAG), which equips LLMs with action‑calling capabilities for interaction with external knowledge bases. Beyond information retrieval, LLM agents may incorporate modular tools for tasks like document summarization, transcript coding, qualitative variable classification, and statistical modeling. To demonstrate the potential of this approach, we introduce CongressRA, an LLM agent designed to support scholars studying the U.S. Congress. Through this example, we highlight how LLM agents can reduce the costs of replicating, testing, and extending empirical research using the domain‑specific data that drives the study of political institutions.

PaperID: 11203, https://arxiv.org/pdf/2503.13514.pdf

Authors: Hong Qing Yu, Frank McQuade

Title: RAG-KG-IL: A Multi-Agent Hybrid Framework for Reducing Hallucinations and Enhancing LLM Reasoning through RAG and Incremental Knowledge Graph Learning Integration

Abstract:
This paper presents RAG‑KG‑IL, a novel multi‑agent hybrid framework designed to enhance the reasoning capabilities of Large Language Models (LLMs) by integrating Retrieval‑Augmented Generation (RAG) and Knowledge Graphs (KGs) with an Incremental Learning (IL) approach. Despite recent advancements, LLMs still face significant challenges in reasoning with structured data, handling dynamic knowledge evolution, and mitigating hallucinations, particularly in mission‑critical domains. Our proposed RAG‑KG‑IL framework addresses these limitations by employing a multi‑agent architecture that enables continuous knowledge updates, integrates structured knowledge, and incorporates autonomous agents for enhanced explainability and reasoning. The framework utilizes RAG to ensure the generated responses are grounded in verifiable information, while KGs provide structured domain knowledge for improved consistency and depth of understanding. The Incremental Learning approach allows for dynamic updates to the knowledge base without full retraining, significantly reducing computational overhead and improving the model's adaptability. We evaluate the framework using real‑world case studies involving health‑related queries, comparing it to state‑of‑the‑art models like GPT‑4o and a RAG‑only baseline. Experimental results demonstrate that our approach significantly reduces hallucination rates and improves answer completeness and reasoning accuracy. The results underscore the potential of combining RAG, KGs, and multi‑agent systems to create intelligent, adaptable systems capable of real‑time knowledge integration and reasoning in complex domains.

PaperID: 11204, https://arxiv.org/pdf/2503.13504.pdf

Authors: Rujia Wang, Xiangbo Gao, Hao Xiang, Runsheng Xu, Zhengzhong Tu

Title: CoCMT: Communication-Efficient Cross-Modal Transformer for Collaborative Perception

Abstract:
Multi‑agent collaborative perception enhances each agent perceptual capabilities by sharing sensing information to cooperatively perform robot perception tasks. This approach has proven effective in addressing challenges such as sensor deficiencies, occlusions, and long‑range perception. However, existing representative collaborative perception systems transmit intermediate feature maps, such as bird‑eye view (BEV) representations, which contain a significant amount of non‑critical information, leading to high communication bandwidth requirements. To enhance communication efficiency while preserving perception capability, we introduce CoCMT, an object‑query‑based collaboration framework that optimizes communication bandwidth by selectively extracting and transmitting essential features. Within CoCMT, we introduce the Efficient Query Transformer (EQFormer) to effectively fuse multi‑agent object queries and implement a synergistic deep supervision to enhance the positive reinforcement between stages, leading to improved overall performance. Experiments on OPV2V and V2V4Real datasets show CoCMT outperforms state‑of‑the‑art methods while drastically reducing communication needs. On V2V4Real, our model (Top‑50 object queries) requires only 0.416 Mb bandwidth, 83 times less than SOTA methods, while improving AP70 by 1.1 percent. This efficiency breakthrough enables practical collaborative perception deployment in bandwidth‑constrained environments without sacrificing detection accuracy.

PaperID: 11205, https://arxiv.org/pdf/2503.13415.pdf

Authors: Weiqiang Jin, Hongyang Du, Biao Zhao, Xingwu Tian, Bohang Shi, Guang Yang

Title: A Comprehensive Survey on Multi-Agent Cooperative Decision-Making: Scenarios, Approaches, Challenges and Perspectives

Abstract:
With the rapid development of artificial intelligence, intelligent decision‑making techniques have gradually surpassed human levels in various human‑machine competitions, especially in complex multi‑agent cooperative task scenarios. Multi‑agent cooperative decision‑making involves multiple agents working together to complete established tasks and achieve specific objectives. These techniques are widely applicable in real‑world scenarios such as autonomous driving, drone navigation, disaster rescue, and simulated military confrontations. This paper begins with a comprehensive survey of the leading simulation environments and platforms used for multi‑agent cooperative decision‑making. Specifically, we provide an in‑depth analysis for these simulation environments from various perspectives, including task formats, reward allocation, and the underlying technologies employed. Subsequently, we provide a comprehensive overview of the mainstream intelligent decision‑making approaches, algorithms and models for multi‑agent systems (MAS). Theseapproaches can be broadly categorized into five types: rule‑based (primarily fuzzy logic), game theory‑based, evolutionary algorithms‑based, deep multi‑agent reinforcement learning (MARL)‑based, and large language models(LLMs)reasoning‑based. Given the significant advantages of MARL andLLMs‑baseddecision‑making methods over the traditional rule, game theory, and evolutionary algorithms, this paper focuses on these multi‑agent methods utilizing MARL and LLMs‑based techniques. We provide an in‑depth discussion of these approaches, highlighting their methodology taxonomies, advantages, and drawbacks. Further, several prominent research directions in the future and potential challenges of multi‑agent cooperative decision‑making are also detailed.

PaperID: 11206, https://arxiv.org/pdf/2503.13402.pdf

Authors: Farhad Rezazadeh, Amir Ashtari Gargari, Sandra Lagen, Houbing Song, Dusit Niyato, Lingjia Liu

Title: Toward Generative 6G Simulation: An Experimental Multi-Agent LLM and ns-3 Integration

Abstract:
The move toward open Sixth‑Generation (6G) networks necessitates a novel approach to full‑stack simulation environments for evaluating complex technology developments before prototyping and real‑world implementation. This paper introduces an innovative approach\footnoteA lightweight, mock version of the code is available on GitHub at that combines a multi‑agent framework with the Network Simulator 3 (ns‑3) to automate and optimize the generation, debugging, execution, and analysis of complex 5G network scenarios. Our framework orchestrates a suite of specialized agents ‑‑ namely, the Simulation Generation Agent, Test Designer Agent, Test Executor Agent, and Result Interpretation Agent ‑‑ using advanced LangChain coordination. The Simulation Generation Agent employs a structured chain‑of‑thought (CoT) reasoning process, leveraging LLMs and retrieval‑augmented generation (RAG) to translate natural language simulation specifications into precise ns‑3 scripts. Concurrently, the Test Designer Agent generates comprehensive automated test suites by integrating knowledge retrieval techniques with dynamic test case synthesis. The Test Executor Agent dynamically deploys and runs simulations, managing dependencies and parsing detailed performance metrics. At the same time, the Result Interpretation Agent utilizes LLM‑driven analysis to extract actionable insights from the simulation outputs. By integrating external resources such as library documentation and ns‑3 testing frameworks, our experimental approach can enhance simulation accuracy and adaptability, reducing reliance on extensive programming expertise. A detailed case study using the ns‑3 5G‑LENA module validates the effectiveness of the proposed approach. The code generation process converges in an average of 1.8 iterations, has a syntax error rate of 17.0%, a mean response time of 7.3 seconds, and receives a human evaluation score of 7.5.

PaperID: 11207, https://arxiv.org/pdf/2503.13279.pdf

Authors: Xinkai Zou, Yan Liu, Xiongbo Shi, Chen Yang

Title: Goal2Story: A Multi-Agent Fleet based on Privately Enabled sLLMs for Impacting Mapping on Requirements Elicitation

Abstract:
As requirements drift with rapid iterations, agile development becomes the dominant paradigm. Goal‑driven Requirements Elicitation (RE) is a pivotal yet challenging task in agile project development due to its heavy tangling with adaptive planning and efficient collaboration. Recently, AI agents have shown promising ability in supporting requirements analysis by saving significant time and effort for stakeholders. However, current research mainly focuses on functional RE, and research works have not been reported bridging the long journey from goal to user stories. Moreover, considering the cost of LLM facilities and the need for data and idea protection, privately hosted small‑sized LLM should be further utilized in RE. To address these challenges, we propose Goal2Story, a multi‑agent fleet that adopts the Impact Mapping (IM) framework while merely using cost‑effective sLLMs for goal‑driven RE. Moreover, we introduce a StorySeek dataset that contains over 1,000 user stories (USs) with corresponding goals and project context information, as well as the semi‑automatic dataset construction method. For evaluation, we proposed two metrics: Factuality Hit Rate (FHR) to measure consistency between the generated USs with the dataset and Quality And Consistency Evaluation (QuACE) to evaluate the quality of the generated USs. Experimental results demonstrate that Goal2Story outperforms the baseline performance of the Super‑Agent adopting powerful LLMs, while also showcasing the performance improvements in key metrics brought by CoT and Agent Profile to Goal2Story, as well as its exploration in identifying latent needs.

PaperID: 11208, https://arxiv.org/pdf/2503.13275.pdf

Authors: Seyoung Song

Title: Knowledge-Aware Iterative Retrieval for Multi-Agent Systems

Abstract:
We introduce a novel large language model (LLM)‑driven agent framework, which iteratively refines queries and filters contextual evidence by leveraging dynamically evolving knowledge. A defining feature of the system is its decoupling of external sources from an internal knowledge cache that is progressively updated to guide both query generation and evidence selection. This design mitigates bias‑reinforcement loops and enables dynamic, trackable search exploration paths, thereby optimizing the trade‑off between exploring diverse information and maintaining accuracy through autonomous agent decision‑making. Our approach is evaluated on a broad range of open‑domain question answering benchmarks, including multi‑step tasks that mirror real‑world scenarios where integrating information from multiple sources is critical, especially given the vulnerabilities of LLMs that lack explicit reasoning or planning capabilities. The results show that the proposed system not only outperforms single‑step baselines regardless of task difficulty but also, compared to conventional iterative retrieval methods, demonstrates pronounced advantages in complex tasks through precise evidence‑based reasoning and enhanced efficiency. The proposed system supports both competitive and collaborative sharing of updated context, enabling multi‑agent extension. The benefits of multi‑agent configurations become especially prominent as task difficulty increases. The number of convergence steps scales with task difficulty, suggesting cost‑effective scalability.

PaperID: 11209, https://arxiv.org/pdf/2503.13269.pdf

Authors: Wenyi Xu, Yuren Mao, Xiaolu Zhang, Chao Zhang, Xuemei Dong, Mengfei Zhang, Yunjun Gao

Title: DAgent: A Relational Database-Driven Data Analysis Report Generation Agent

Abstract:
Relational database‑driven data analysis (RDB‑DA) report generation, which aims to generate data analysis reports after querying relational databases, has been widely applied in fields such as finance and healthcare. Typically, these tasks are manually completed by data scientists, making the process very labor‑intensive and showing a clear need for automation. Although existing methods (e.g., Table QA or Text‑to‑SQL) have been proposed to reduce human dependency, they cannot handle complex analytical tasks that require multi‑step reasoning, cross‑table associations, and synthesizing insights into reports. Moreover, there is no dataset available for developing automatic RDB‑DA report generation. To fill this gap, this paper proposes an LLM agent system for RDB‑DA report generation tasks, dubbed DAgent; moreover, we construct a benchmark for automatic data analysis report generation, which includes a new dataset DA‑Dataset and evaluation metrics. DAgent integrates planning, tools, and memory modules to decompose natural language questions into logically independent sub‑queries, accurately retrieve key information from relational databases, and generate analytical reports that meet the requirements of completeness, correctness, and conciseness through multi‑step reasoning and effective data integration. Experimental analysis on the DA‑Dataset demonstrates that DAgent's superiority in retrieval performance and analysis report generation quality, showcasing its strong potential for tackling complex database analysis report generation tasks.

PaperID: 11210, https://arxiv.org/pdf/2503.13250.pdf

Authors: Zejia Zhang, Bo Yang, Xinxing Chen, Weizhuang Shi, Haoyuan Wang, Wei Luo, Jian Huang

Title: MindEye-OmniAssist: A Gaze-Driven LLM-Enhanced Assistive Robot System for Implicit Intention Recognition and Task Execution

Abstract:
A promising effective human‑robot interaction in assistive robotic systems is gaze‑based control. However, current gaze‑based assistive systems mainly help users with basic grasping actions, offering limited support. Moreover, the restricted intent recognition capability constrains the assistive system's ability to provide diverse assistance functions. In this paper, we propose an open implicit intention recognition framework powered by Large Language Model (LLM) and Vision Foundation Model (VFM), which can process gaze input and recognize user intents that are not confined to predefined or specific scenarios. Furthermore, we implement a gaze‑driven LLM‑enhanced assistive robot system (MindEye‑OmniAssist) that recognizes user's intentions through gaze and assists in completing task. To achieve this, the system utilizes open vocabulary object detector, intention recognition network and LLM to infer their full intentions. By integrating eye movement feedback and LLM, it generates action sequences to assist the user in completing tasks. Real‑world experiments have been conducted for assistive tasks, and the system achieved an overall success rate of 41/55 across various undefined tasks. Preliminary results show that the proposed method holds the potential to provide a more user‑friendly human‑computer interaction interface and significantly enhance the versatility and effectiveness of assistive systems by supporting more complex and diverse task.

PaperID: 11211, https://arxiv.org/pdf/2503.13223.pdf

Authors: Allahkaram Shafiei, Hozefa Jesawada, Karl Friston, Giovanni Russo

Title: Distributionally Robust Free Energy Principle for Decision-Making

Abstract:
Despite their groundbreaking performance, autonomous agents can misbehave when training and environmental conditions become inconsistent, with minor mismatches leading to undesirable behaviors or even catastrophic failures. Robustness towards these training‑environment ambiguities is a core requirement for intelligent agents and its fulfillment is a long‑standing challenge towards their real‑world deployments. Here, we introduce a Distributionally Robust Free Energy model (DR‑FREE) that instills this core property by design. Combining a robust extension of the free energy principle with a resolution engine, DR‑FREE wires robustness into the agent decision‑making mechanisms. Across benchmark experiments, DR‑FREE enables the agents to complete the task even when, in contrast, state‑of‑the‑art models fail. This milestone may inspire both deployments in multi‑agent settings and, at a perhaps deeper level, the quest for an explanation of how natural agents ‑‑ with little or no training ‑‑ survive in capricious environments.

PaperID: 11212, https://arxiv.org/pdf/2503.13205.pdf

Authors: Zhen Chen, Zhihao Peng, Xusheng Liang, Cheng Wang, Peigan Liang, Linsheng Zeng, Minjie Ju, Yixuan Yuan

Title: MAP: Evaluation and Multi-Agent Enhancement of Large Language Models for Inpatient Pathways

Abstract:
Inpatient pathways demand complex clinical decision‑making based on comprehensive patient information, posing critical challenges for clinicians. Despite advancements in large language models (LLMs) in medical applications, limited research focused on artificial intelligence (AI) inpatient pathways systems, due to the lack of large‑scale inpatient datasets. Moreover, existing medical benchmarks typically concentrated on medical question‑answering and examinations, ignoring the multifaceted nature of clinical decision‑making in inpatient settings. To address these gaps, we first developed the Inpatient Pathway Decision Support (IPDS) benchmark from the MIMIC‑IV database, encompassing 51,274 cases across nine triage departments and 17 major disease categories alongside 16 standardized treatment options. Then, we proposed the Multi‑Agent Inpatient Pathways (MAP) framework to accomplish inpatient pathways with three clinical agents, including a triage agent managing the patient admission, a diagnosis agent serving as the primary decision maker at the department, and a treatment agent providing treatment plans. Additionally, our MAP framework includes a chief agent overseeing the inpatient pathways to guide and promote these three clinician agents. Extensive experiments showed our MAP improved the diagnosis accuracy by 25.10% compared to the state‑of‑the‑art LLM HuatuoGPT2‑13B. It is worth noting that our MAP demonstrated significant clinical compliance, outperforming three board‑certified clinicians by 10%‑12%, establishing a foundation for inpatient pathways systems.

PaperID: 11213, https://arxiv.org/pdf/2503.13194.pdf

Authors: Corina Catarau-Cotutiu, Esther Mondragon, Eduardo Alonso

Title: A representational framework for learning and encoding structurally enriched trajectories in complex agent environments

Abstract:
The ability of artificial intelligence agents to make optimal decisions and generalise them to different domains and tasks is compromised in complex scenarios. One way to address this issue has focused on learning efficient representations of the world and on how the actions of agents affect them in state‑action transitions. Whereas such representations are procedurally efficient, they lack structural richness. To address this problem, we propose to enhance the agent's ontology and extend the traditional conceptualisation of trajectories to provide a more nuanced view of task execution. Structurally Enriched Trajectories (SETs) extend the encoding of sequences of states and their transitions by incorporating hierarchical relations between objects, interactions, and affordances. SETs are built as multi‑level graphs, providing a detailed representation of the agent dynamics and a transferable functional abstraction of the task. SETs are integrated into an architecture, Structurally Enriched Trajectory Learning and Encoding (SETLE), that employs a heterogeneous graph‑based memory structure of multi‑level relational dependencies essential for generalisation. We demonstrate that SETLE can support downstream tasks, enabling agents to recognise task relevant structural patterns across CREATE and MiniGrid environments. Finally, we integrate SETLE with reinforcement learning and show measurable improvements in downstream performance, including breakthrough success rates in complex, sparse‑reward tasks.

PaperID: 11214, https://arxiv.org/pdf/2503.13175.pdf

Authors: Alvin Combrink, Sabino Francesco Roselli, Martin Fabian

Title: Prioritized Planning for Continuous-time Lifelong Multi-agent Pathfinding

Abstract:
Multi‑agent Path Finding (MAPF) is the problem of planning collision‑free movements of agents so that they get from where they are to where they need to be. Commonly, agents are located on a graph and can traverse edges. This problem has many variations and has been studied for decades. Two such variations are the continuous‑time and the lifelong MAPF problems. In the former, edges have non‑unit lengths and volumetric agents can traverse them at any real‑valued time. In the latter, agents must attend to a continuous stream of incoming tasks. Much work has been devoted to designing solution methods within these two areas. To our knowledge, however, the combined problem of continuous‑time lifelong MAPF has yet to be addressed. This work addresses continuous‑time lifelong MAPF with volumetric agents by presenting the fast and sub‑optimal Continuous‑time Prioritized Lifelong Planner (CPLP). CPLP continuously assigns agents to tasks and computes plans using a combination of two path planners; one based on CCBS and the other based on SIPP. Experimental results with up to 800 agents on graphs with up to 12 000 vertices demonstrate practical performance, where maximum planning times fall within the available time budget. Additionally, CPLP ensures collision‑free movement even when failing to meet this budget. Therefore, the robustness of CPLP highlights its potential for real‑world applications.

PaperID: 11215, https://arxiv.org/pdf/2503.13169.pdf

Authors: Ruoyan Avery Yin, Zhichu Ren, Zongyou Yin, Zhen Zhang, So Yeon Kim, Chia-Wei Hsu, Ju Li

Title: Collaborative AI Enhances Image Understanding in Materials Science

Abstract:
The Copilot for Real‑world Experimental Scientist (CRESt) system empowers researchers to control autonomous laboratories through conversational AI, providing a seamless interface for managing complex experimental workflows. We have enhanced CRESt by integrating a multi‑agent collaboration mechanism that utilizes the complementary strengths of the ChatGPT and Gemini models for precise image analysis in materials science. This innovative approach significantly improves the accuracy of experimental outcomes by fostering structured debates between the AI models, which enhances decision‑making processes in materials phase analysis. Additionally, to evaluate the generalizability of this approach, we tested it on a quantitative task of counting particles. Here, the collaboration between the AI models also led to improved results, demonstrating the versatility and robustness of this method. By harnessing this dual‑AI framework, this approach stands as a pioneering method for enhancing experimental accuracy and efficiency in materials research, with applications extending beyond CRESt to broader scientific experimentation and analysis.

PaperID: 11216, https://arxiv.org/pdf/2503.13077.pdf

Authors: Amir Baghi, Jens Sjölund, Joakim Bergdahl, Linus Gisslén, Alessandro Sestini

Title: Towards Better Sample Efficiency in Multi-Agent Reinforcement Learning via Exploration

Abstract:
Multi‑agent reinforcement learning has shown promise in learning cooperative behaviors in team‑based environments. However, such methods often demand extensive training time. For instance, the state‑of‑the‑art method TiZero takes 40 days to train high‑quality policies for a football environment. In this paper, we hypothesize that better exploration mechanisms can improve the sample efficiency of multi‑agent methods. We propose two different approaches for better exploration in TiZero: a self‑supervised intrinsic reward and a random network distillation bonus. Additionally, we introduce architectural modifications to the original algorithm to enhance TiZero's computational efficiency. We evaluate the sample efficiency of these approaches through extensive experiments. Our results show that random network distillation improves training sample efficiency by 18.8% compared to the original TiZero. Furthermore, we evaluate the qualitative behavior of the models produced by both variants against a heuristic AI, with the self‑supervised reward encouraging possession and random network distillation leading to a more offensive performance. Our results highlights the applicability of our random network distillation variant in practical settings. Lastly, due to the nature of the proposed method, we acknowledge its use beyond football simulation, especially in environments with strong multi‑agent and strategic aspects.

PaperID: 11217, https://arxiv.org/pdf/2503.12976.pdf

Authors: Yan Kim, Wojciech Jamroga, Wojciech Penczek, Laure Petrucci

Title: Practical Abstractions for Model Checking Continuous-Time Multi-Agent Systems

Abstract:
Model checking of temporal logics in a well established technique to verify and validate properties of multi‑agent systems (MAS). However, practical model checking requires input models of manageable size. In this paper, we extend the model reduction method by variable‑based abstraction, proposed recently by Jamroga and Kim, to the verification of real‑time systems and properties. To this end, we define a real‑time extension of MAS graphs, extend the abstraction procedure, and prove its correctness for the universal fragment of Timed Computation Tree Logic (TCTL). Besides estimating the theoretical complexity gains, we present an experimental evaluation for a simplified model of the Estonian voting system and verification using the Uppaal model checker.

PaperID: 11218, https://arxiv.org/pdf/2503.12855.pdf

Authors: Yujie Lu, Yale Song, William Wang, Lorenzo Torresani, Tushar Nagarajan

Title: VITED: Video Temporal Evidence Distillation

Abstract:
We investigate complex video question answering via chain‑of‑evidence reasoning ‑‑ identifying sequences of temporal spans from multiple relevant parts of the video, together with visual evidence within them. Existing models struggle with multi‑step reasoning as they uniformly sample a fixed number of frames, which can miss critical evidence distributed nonuniformly throughout the video. Moreover, they lack the ability to temporally localize such evidence in the broader context of the full video, which is required for answering complex questions. We propose a framework to enhance existing VideoQA datasets with evidence reasoning chains, automatically constructed by searching for optimal intervals of interest in the video with supporting evidence, that maximizes the likelihood of answering a given question. We train our model (VITED) to generate these evidence chains directly, enabling it to both localize evidence windows as well as perform multi‑step reasoning across them in long‑form video content. We show the value of our evidence‑distilled models on a suite of long video QA benchmarks where we outperform state‑of‑the‑art approaches that lack evidence reasoning capabilities.

PaperID: 11219, https://arxiv.org/pdf/2503.12787.pdf

Authors: Takumi Ito, Riku Funada, Mitsuji Sampei, Gennaro Notomista

Title: Energy-Aware Task Allocation for Teams of Multi-mode Robots

Abstract:
This work proposes a novel multi‑robot task allocation framework for robots that can switch between multiple modes, e.g., flying, driving, or walking. We first provide a method to encode the multi‑mode property of robots as a graph, where the mode of each robot is represented by a node. Next, we formulate a constrained optimization problem to decide both the task to be allocated to each robot as well as the mode in which the latter should execute the task. The robot modes are optimized based on the state of the robot and the environment, as well as the energy required to execute the allocated task. Moreover, the proposed framework is able to encompass kinematic and dynamic models of robots alike. Furthermore, we provide sufficient conditions for the convergence of task execution and allocation for both robot models.

PaperID: 11220, https://arxiv.org/pdf/2503.12757.pdf

Authors: Christine Lee, Jihye Choi, Bilge Mutlu

Title: MAP: Multi-user Personalization with Collaborative LLM-powered Agents

Abstract:
The widespread adoption of Large Language Models (LLMs) and LLM‑powered agents in multi‑user settings underscores the need for reliable, usable methods to accommodate diverse preferences and resolve conflicting directives. Drawing on conflict resolution theory, we introduce a user‑centered workflow for multi‑user personalization comprising three stages: Reflection, Analysis, and Feedback. We then present MAP ‑‑ a Multi‑Agent system for multi‑user Personalization ‑‑ to operationalize this workflow. By delegating subtasks to specialized agents, MAP (1) retrieves and reflects on relevant user information, while enhancing reliability through agent‑to‑agent interactions, (2) provides detailed analysis for improved transparency and usability, and (3) integrates user feedback to iteratively refine results. Our user study findings (n=12) highlight MAP's effectiveness and usability for conflict resolution while emphasizing the importance of user involvement in resolution verification and failure management. This work highlights the potential of multi‑agent systems to implement user‑centered, multi‑user personalization workflows and concludes by offering insights for personalization in multi‑user contexts.

PaperID: 11221, https://arxiv.org/pdf/2503.12743.pdf

Authors: Philip Greulich, Michael Levin, Rosalia Moreddu

Title: Oncomorphic neural agent populations for resource-limited sequential learning

Abstract:
Distributed artificial intelligence (AI) often operates under sequential task exposure, uneven compute, and decentralized coordination. Here, we present a cancer‑inspired, or oncomorphic, multi‑agent framework in which simulated neural agents can replicate, mutate their neural network architecture, migrate across task environments, undergo ecological turnover, and recruit learning/ecological resources from a finite shared reserve. We evaluate the framework in controlled synthetic nonlinear classification environments in which each agent trains only on its local task, allowing population ecology rather than centralized optimization to determine which neural network architectures persist. For various initial conditions, we find that stronger selection increased the endpoint local accuracy of surviving agent populations. Architecture mutation played a state‑dependent role: diverse initial populations performed best at low mutation, whereas clonal large‑architecture populations benefited from mutation‑generated variation. Selection also increased end‑of‑run multi‑task competence, measured by evaluating surviving agents on all environments without additional training. Recruitment and elevated baseline replication reshaped demographic support while prediction quality remained within a narrow band, consistent with redistribution of finite learning resources. Time‑resolved entropy and dominance analyses revealed concentration toward successful architectures, while finite training cycles kept agents in a non‑asymptotic learning regime. These results provide proof‑of‑concept mechanistic evidence that oncomorphic population dynamics may offer a route to decentralized adaptation in engineering applications under bounded local resources.

PaperID: 11222, https://arxiv.org/pdf/2503.12722.pdf

Authors: Kenneth J. K. Ong, Lye Jia Jun, Hieu Minh "Jord" Nguyen, Seong Hah Cho, Natalia Pérez-Campanero Antolín

Title: Identifying Cooperative Personalities in Multi-agent Contexts through Personality Steering with Representation Engineering

Abstract:
As Large Language Models (LLMs) gain autonomous capabilities, their coordination in multi‑agent settings becomes increasingly important. However, they often struggle with cooperation, leading to suboptimal outcomes. Inspired by Axelrod's Iterated Prisoner's Dilemma (IPD) tournaments, we explore how personality traits influence LLM cooperation. Using representation engineering, we steer Big Five traits (e.g., Agreeableness, Conscientiousness) in LLMs and analyze their impact on IPD decision‑making. Our results show that higher Agreeableness and Conscientiousness improve cooperation but increase susceptibility to exploitation, highlighting both the potential and limitations of personality‑based steering for aligning AI agents.

PaperID: 11223, https://arxiv.org/pdf/2503.12687.pdf

Authors: Naveen Krishnan

Title: AI Agents: Evolution, Architecture, and Real-World Applications

Abstract:
This paper examines the evolution, architecture, and practical applications of AI agents from their early, rule‑based incarnations to modern sophisticated systems that integrate large language models with dedicated modules for perception, planning, and tool use. Emphasizing both theoretical foundations and real‑world deployments, the paper reviews key agent paradigms, discusses limitations of current evaluation benchmarks, and proposes a holistic evaluation framework that balances task effectiveness, efficiency, robustness, and safety. Applications across enterprise, personal assistance, and specialized domains are analyzed, with insights into future research directions for more resilient and adaptive AI agent systems.

PaperID: 11224, https://arxiv.org/pdf/2503.12684.pdf

Authors: Daniel J. Rosenkrantz, Madhav V. Marathe, Zirou Qiu, S. S. Ravi, Richard E. Stearns

Title: On Some Fundamental Problems for Multi-Agent Systems Over Multilayer Networks

Abstract:
Many researchers have considered multi‑agent systems over single‑layer networks as models for studying diffusion phenomena. Since real‑world networks involve connections between agents with different semantics (e.g., family member, friend, colleague), the study of multi‑agent systems over multilayer networks has assumed importance. Our focus is on one class of multi‑agent system models over multilayer networks, namely multilayer synchronous dynamical systems (MSyDSs). We study several fundamental problems for this model. We establish properties of the phase spaces of MSyDSs and bring out interesting differences between single‑layer and multilayer dynamical systems. We show that, in general, the problem of determining whether two given MSyDSs are inequivalent is NP‑complete. This hardness result holds even when the only difference between the two systems is the local function at just one node in one layer. We also present efficient algorithms for the equivalence problem for restricted versions of MSyDSs (e.g., systems where each local function is a bounded‑threshold function, systems where the number of layers is fixed and each local function is symmetric). In addition, we investigate the expressive power of MSyDSs based on the number of layers. In particular, we examine conditions under which a system with k >= 2 layers has an equivalent system with k‑1 or fewer layers.

PaperID: 11225, https://arxiv.org/pdf/2503.12651.pdf

Authors: Yoo Yeon Sung, Hannah Kim, Dan Zhang

Title: VeriLA: A Human-Centered Evaluation Framework for Interpretable Verification of LLM Agent Failures

Abstract:
AI practitioners increasingly use large language model (LLM) agents in compound AI systems to solve complex reasoning tasks, these agent executions often fail to meet human standards, leading to errors that compromise the system's overall performance. Addressing these failures through human intervention is challenging due to the agents' opaque reasoning processes, misalignment with human expectations, the complexity of agent dependencies, and the high cost of manual inspection. This paper thus introduces a human‑centered evaluation framework for Verifying LLM Agent failures (VeriLA), which systematically assesses agent failures to reduce human effort and make these agent failures interpretable to humans. The framework first defines clear expectations of each agent by curating human‑designed agent criteria. Then, it develops a human‑aligned agent verifier module, trained with human gold standards, to assess each agent's execution output. This approach enables granular evaluation of each agent's performance by revealing failures from a human standard, offering clear guidelines for revision, and reducing human cognitive load. Our case study results show that VeriLA is both interpretable and efficient in helping practitioners interact more effectively with the system. By upholding accountability in human‑agent collaboration, VeriLA paves the way for more trustworthy and human‑aligned compound AI systems.

PaperID: 11226, https://arxiv.org/pdf/2503.12457.pdf

Authors: Neelanga Thelasingha, Agung Julius, James Humann, James Dotterweich

Title: Iterative Motion Planning in Multi-agent Systems with Opportunistic Communication under Disturbance

Abstract:
In complex multi‑agent systems involving heterogeneous teams, uncertainty arises from numerous sources like environmental disturbances, model inaccuracies, and changing tasks. This causes planned trajectories to become infeasible, requiring replanning. Further, different communication architectures used in multi‑agent systems give rise to asymmetric knowledge of planned trajectories across the agents. In such systems, replanning must be done in a communication‑aware fashion. This paper establishes the conditions for synchronization and feasibility in epistemic planning scenarios introduced by opportunistic communication architectures. We also establish conditions on task satisfaction based on quantified recoverability of disturbances in an iterative planning scheme. We further validate these theoretical results experimentally in a UAV‑‑UGV task assignment problem.

PaperID: 11227, https://arxiv.org/pdf/2503.12374.pdf

Authors: Zhi Chen, Wei Ma, Lingxiao Jiang

Title: Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios

Abstract:
AI‑driven software development has rapidly advanced with the emergence of software development agents that leverage large language models (LLMs) to tackle complex, repository‑level software engineering tasks. These agents go beyond just generation of final code; they engage in multi‑step reasoning, utilize various tools for code modification and debugging, and interact with execution environments to diagnose and iteratively resolve issues. However, most existing evaluations focus primarily on static analyses of final code outputs, yielding limited insights into the agents' dynamic problem‑solving processes. To fill this gap, we conduct an in‑depth empirical study on 3,977 solving‑phase trajectories and 3,931 testing‑phase logs from 8 top‑ranked agents evaluated on 500 GitHub issues in the SWE‑Bench benchmark. Our exploratory analysis shows that Python execution errors during the issue resolution phase correlate with lower resolution rates and increased reasoning overheads. We have identified the most prevalent errors ‑‑ such as ModuleNotFoundError and TypeError ‑‑ and highlighted particularly challenging errors like OSError and database‑related issues (e.g., IntegrityError) that demand significantly more debugging effort. Furthermore, we have discovered 3 bugs in the SWE‑Bench platform that affect benchmark fairness and accuracy; these issues have been reported to and confirmed by the maintainers. To promote transparency and foster future research, we publicly share our datasets and analysis scripts.

PaperID: 11228, https://arxiv.org/pdf/2503.12349.pdf

Authors: Jianzhu Yao, Kevin Wang, Ryan Hsieh, Haisu Zhou, Tianqing Zou, Zerui Cheng, Zhangyang Wang, Pramod Viswanath

Title: SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

Abstract:
Reasoning and strategic behavior in social interactions is a hallmark of intelligence. This form of reasoning is significantly more sophisticated than isolated planning or reasoning tasks in static settings (e.g., math problem solving). In this paper, we present Strategic Planning, Interaction, and Negotiation (SPIN‑Bench), a new multi‑domain evaluation designed to measure the intelligence of strategic planning and social reasoning. While many existing benchmarks focus on narrow planning or single‑agent reasoning, SPIN‑Bench combines classical PDDL tasks, competitive board games, cooperative card games, and multi‑agent negotiation scenarios in one unified framework. The framework includes both a benchmark as well as an arena to simulate and evaluate the variety of social settings to test reasoning and strategic behavior of AI agents. We formulate the benchmark SPIN‑Bench by systematically varying action spaces, state complexity, and the number of interacting agents to simulate a variety of social settings where success depends on not only methodical and step‑wise decision making, but also conceptual inference of other (adversarial or cooperative) participants. Our experiments reveal that while contemporary LLMs handle basic fact retrieval and short‑range planning reasonably well, they encounter significant performance bottlenecks in tasks requiring deep multi‑hop reasoning over large state spaces and socially adept coordination under uncertainty. We envision SPIN‑Bench as a catalyst for future research on robust multi‑agent planning, social reasoning, and human‑‑AI teaming. Project Website: https://spinbench.github.io/

PaperID: 11229, https://arxiv.org/pdf/2503.12333.pdf

Authors: Vagul Mahadevan, Shangtong Zhang, Rohan Chandra

Title: GameChat: Multi-LLM Dialogue for Safe, Agile, and Socially Optimal Multi-Agent Navigation in Constrained Environments

Abstract:
Safe, agile, and socially compliant multi‑robot navigation in cluttered and constrained environments remains a critical challenge. This is especially difficult with self‑interested agents with unique, unknown priorities in decentralized settings, where there is no central authority to resolve conflicts induced by spatial symmetry. We address this challenge by proposing an intuitive, but very effective approach, GameChat, which facilitates safe, agile, and deadlock‑free navigation for both cooperative and self‑interested agents in cluttered environments. Key to our approach is the idea that agents should resolve conflicts on their own using natural language to communicate, much like humans. We evaluate GameChat in simulated environments with doorways and intersections. The results show that even in the worst case, GameChat reduces the time for all agents to reach their goals by over 35% from a naive baseline and by over 20% from a state of the art baseline in the intersection scenario, while doubling the rate of ensuring the agent with a higher priority task reaches the goal first, from 50% (equivalent to random chance) to 100%. We also demonstrate how GameChat can be extended to more than two agents.

PaperID: 11230, https://arxiv.org/pdf/2503.12199.pdf

Authors: Luoyin Zhao, Zheping Yan, Yuqing Wang, Raye Chen-Hua Yeow

Title: Formation Control of Multi-agent System with Local Interaction and Artificial Potential Field

Abstract:
A novel local interaction control method (LICM) is proposed in this paper to realize the formation control of multi‑agent system (MAS). A local interaction leader follower (LILF) structure is provided by coupling the advantages of information consensus and leader follower frame, the agents can obtain the state information of the leader by interacting with their neighbours, which will reduce the communication overhead of the system and the dependence on a single node of the topology. In addition, the artificial potential field (APF) method is introduced to achieve obstacle avoidance and collision avoidance between agents. Inspired by the stress response of animals, a stress response mechanism‑artificial potential field (SRM‑APF) is proposed, which will be triggered when the local minimum problem of APF occurs. Ultimately, the simulation experiments of three formation shapes, including triangular formation, square formation and hexagonal formation, validate the effectiveness of the proposed method.

PaperID: 11231, https://arxiv.org/pdf/2503.12188.pdf

Authors: Harold Triedman, Rishi Jha, Vitaly Shmatikov

Title: Multi-Agent Systems Execute Arbitrary Malicious Code

Abstract:
Multi‑agent systems coordinate LLM‑based agents to perform tasks on users' behalf. In real‑world applications, multi‑agent systems will inevitably interact with untrusted inputs, such as malicious Web content, files, email attachments, and more. Using several recently proposed multi‑agent frameworks as concrete examples, we demonstrate that adversarial content can hijack control and communication within the system to invoke unsafe agents and functionalities. This results in a complete security breach, up to execution of arbitrary malicious code on the user's device or exfiltration of sensitive data from the user's containerized environment. For example, when agents are instantiated with GPT‑4o, Web‑based attacks successfully cause the multi‑agent system execute arbitrary malicious code in 58‑90% of trials (depending on the orchestrator). In some model‑orchestrator configurations, the attack success rate is 100%. We also demonstrate that these attacks succeed even if individual agents are not susceptible to direct or indirect prompt injection, and even if they refuse to perform harmful actions. We hope that these results will motivate development of trust and security models for multi‑agent systems before they are widely deployed.

PaperID: 11232, https://arxiv.org/pdf/2503.12163.pdf

Authors: Ruwei Pan, Hongyu Zhang, Zhonghao Jiang, Ran Hou

Title: AgentDroid: A Multi-Agent Framework for Detecting Fraudulent Android Applications

Abstract:
With the increasing prevalence of fraudulent Android applications such as fake and malicious applications, it is crucial to detect them with high accuracy and adaptability. We present AgentDroid, a novel tool for Android fraudulent application detection based on multi‑modal analysis and multi‑agent systems. AgentDroid overcomes the limitations of traditional detection methods such as the inability to handle multimodal data and high false alarm rates. It processes Android applications and extracts a series of multi‑modal data for analysis. Multiple LLM‑based agents with specialized roles analyze the relevant data and collaborate to detect complex fraud effectively. We curated a dataset containing various categories of fraudulent applications and legitimate applications and validated our tool on this dataset. Experimental results indicate that our multi‑agent tool based on GPT‑4o achieves an accuracy of 91.7% and an F1‑Score of 91.68%, outperforming the baseline methods. A video of AgentDroid is available at https://youtu.be/YOM9Ex‑nBts.

PaperID: 11233, https://arxiv.org/pdf/2503.12077.pdf

Authors: Zhengrong Yue, Shaobin Zhuang, Kunchang Li, Yanbo Ding, Yali Wang

Title: V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents

Abstract:
Despite the recent advancement in video stylization, most existing methods struggle to render any video with complex transitions, based on an open style description of user query. To fill this gap, we introduce a generic multi‑agent system for video stylization, V‑Stylist, by a novel collaboration and reflection paradigm of multi‑modal large language models. Specifically, our V‑Stylist is a systematical workflow with three key roles: (1) Video Parser decomposes the input video into a number of shots and generates their text prompts of key shot content. Via a concise video‑to‑shot prompting paradigm, it allows our V‑Stylist to effectively handle videos with complex transitions. (2) Style Parser identifies the style in the user query and progressively search the matched style model from a style tree. Via a robust tree‑of‑thought searching paradigm, it allows our V‑Stylist to precisely specify vague style preference in the open user query. (3) Style Artist leverages the matched model to render all the video shots into the required style. Via a novel multi‑round self‑reflection paradigm, it allows our V‑Stylist to adaptively adjust detail control, according to the style requirement. With such a distinct design of mimicking human professionals, our V‑Stylist achieves a major breakthrough over the primary challenges for effective and automatic video stylization. Moreover,we further construct a new benchmark Text‑driven Video Stylization Benchmark (TVSBench), which fills the gap to assess stylization of complex videos on open user queries. Extensive experiments show that, V‑Stylist achieves the state‑of‑the‑art, e.g.,V‑Stylist surpasses FRESCO and ControlVideo by 6.05% and 4.51% respectively in overall average metrics, marking a significant advance in video stylization.

PaperID: 11234, https://arxiv.org/pdf/2503.12029.pdf

Authors: Jina Chun, Qihong Chen, Jiawei Li, Iftekhar Ahmed

Title: Is Multi-Agent Debate (MAD) the Silver Bullet? An Empirical Analysis of MAD in Code Summarization and Translation

Abstract:
Large Language Models (LLMs) have advanced autonomous agents' planning and decision‑making, yet they struggle with complex tasks requiring diverse expertise and multi‑step reasoning. Multi‑Agent Debate (MAD) systems, introduced in NLP research, address this gap by enabling structured debates among LLM‑based agents to refine solutions iteratively. MAD promotes divergent thinking through role‑specific agents, dynamic interactions, and structured decision‑making. Recognizing parallels between Software Engineering (SE) and collaborative human problem‑solving, this study investigates MAD's effectiveness on two SE tasks. We adapt MAD systems from NLP, analyze agent interactions to assess consensus‑building and iterative refinement, and propose two enhancements targeting observed weaknesses. Our findings show that structured debate and collaboration improve problem‑solving and yield strong performance in some cases, highlighting MAD's potential for SE automation while identifying areas for exploration.

PaperID: 11235, https://arxiv.org/pdf/2503.11991.pdf

Authors: Sumana Sanyasipura Nagaraju

Title: Automation and Feature Selection Enhancement with Reinforcement Learning (RL)

Abstract:
Effective feature selection, representation and transformation are principal steps in machine learning to improve prediction accuracy, model generalization and computational efficiency. Reinforcement learning provides a new perspective towards balanced exploration of optimal feature subset using multi‑agent and single‑agent models. Interactive reinforcement learning integrated with decision tree improves feature knowledge, state representation and selection efficiency, while diversified teaching strategies improve both selection quality and efficiency. The state representation can further be enhanced by scanning features sequentially along with the usage of convolutional auto‑encoder. Monte Carlo‑based reinforced feature selection(MCRFS), a single‑agent feature selection method reduces computational burden by incorporating early‑stopping and reward‑level interactive strategies. A dual‑agent RL framework is also introduced that collectively selects features and instances, capturing the interactions between them. This enables the agents to navigate through complex data spaces. To outperform the traditional feature engineering, cascading reinforced agents are used to iteratively improve the feature space, which is a self‑optimizing framework. The blend of reinforcement learning, multi‑agent systems, and bandit‑based approaches offers exciting paths for studying scalable and interpretable machine learning solutions to handle high‑dimensional data and challenging predictive tasks.

PaperID: 11236, https://arxiv.org/pdf/2503.11951.pdf

Authors: Edward Y. Chang, Longling Geng

Title: SagaLLM: Context Management, Validation, and Transaction Guarantees for Multi-Agent LLM Planning

Abstract:
This paper introduces SagaLLM, a structured multi‑agent architecture designed to address four foundational limitations of current LLM‑based planning systems: unreliable self‑validation, context loss, lack of transactional safeguards, and insufficient inter‑agent coordination. While recent frameworks leverage LLMs for task decomposition and multi‑agent communication, they often fail to ensure consistency, rollback, or constraint satisfaction across distributed workflows. SagaLLM bridges this gap by integrating the Saga transactional pattern with persistent memory, automated compensation, and independent validation agents. It leverages LLMs' generative reasoning to automate key tasks traditionally requiring hand‑coded coordination logic, including state tracking, dependency analysis, log schema generation, and recovery orchestration. Although SagaLLM relaxes strict ACID guarantees, it ensures workflow‑wide consistency and recovery through modular checkpointing and compensable execution. Empirical evaluations across planning domains demonstrate that standalone LLMs frequently violate interdependent constraints or fail to recover from disruptions. In contrast, SagaLLM achieves significant improvements in consistency, validation accuracy, and adaptive coordination under uncertainty, establishing a robust foundation for real‑world, scalable LLM‑based multi‑agent systems.

PaperID: 11237, https://arxiv.org/pdf/2503.11933.pdf

Authors: Yun Tang, Udhaya Chandhar Srinivasan, Benjamin James Scott, Obumneme Umealor, Dennis Kevogo, Weisi Guo

Title: End-to-End Edge AI Service Provisioning Framework in 6G ORAN

Abstract:
With the advent of 6G, Open Radio Access Network (O‑RAN) architectures are evolving to support intelligent, adaptive, and automated network orchestration. This paper proposes a novel Edge AI and Network Service Orchestration framework that leverages Large Language Model (LLM) agents deployed as O‑RAN rApps. The proposed LLM‑agent‑powered system enables interactive and intuitive orchestration by translating the user's use case description into deployable AI services and corresponding network configurations. The LLM agent automates multiple tasks, including AI model selection from repositories (e.g., Hugging Face), service deployment, network adaptation, and real‑time monitoring via xApps. We implement a prototype using open‑source O‑RAN projects (OpenAirInterface and FlexRIC) to demonstrate the feasibility and functionality of our framework. Our demonstration showcases the end‑to‑end flow of AI service orchestration, from user interaction to network adaptation, ensuring Quality of Service (QoS) compliance. This work highlights the potential of integrating LLM‑driven automation into 6G O‑RAN ecosystems, paving the way for more accessible and efficient edge AI ecosystems.

PaperID: 11238, https://arxiv.org/pdf/2503.11829.pdf

Authors: Jushan Chen, Santiago Paternain

Title: Learning Closed-Loop Parametric Nash Equilibria of Multi-Agent Collaborative Field Coverage

Abstract:
Multi‑agent reinforcement learning is a challenging and active field of research due to the inherent nonstationary property and coupling between agents. A popular approach to modeling the multi‑agent interactions underlying the multi‑agent RL problem is the Markov Game. There is a special type of Markov Game, termed Markov Potential Game, which allows us to reduce the Markov Game to a single‑objective optimal control problem where the objective function is a potential function. In this work, we prove that a multi‑agent collaborative field coverage problem, which is found in many engineering applications, can be formulated as a Markov Potential Game, and we can learn a parameterized closed‑loop Nash Equilibrium by solving an equivalent single‑objective optimal control problem. As a result, our algorithm is 10x faster during training compared to a game‑theoretic baseline and converges faster during policy execution.

PaperID: 11239, https://arxiv.org/pdf/2503.11739.pdf

Authors: Zirui Yuan, Siqi Lai, Hao Liu

Title: CoLLMLight: Cooperative Large Language Model Agents for Network-Wide Traffic Signal Control

Abstract:
Traffic Signal Control (TSC) plays a critical role in urban traffic management by optimizing traffic flow and mitigating congestion. While Large Language Models (LLMs) have recently emerged as promising tools for TSC due to their exceptional problem‑solving and generalization capabilities, existing approaches fail to address the essential need for inter‑agent coordination, limiting their effectiveness in achieving network‑wide optimization. To bridge this gap, we propose CoLLMLight, a cooperative LLM agent framework for TSC. Specifically, we first construct a structured spatiotemporal graph to capture real‑time traffic dynamics and spatial relationships among neighboring intersections, enabling the LLM to reason about complex traffic interactions. Moreover, we introduce a complexity‑aware reasoning mechanism that dynamically adapts reasoning depth based on real‑time traffic conditions, ensuring optimal computational efficiency without sacrificing decision quality. Besides, we propose a fine‑tuning strategy that leverages iterative simulation‑driven data collection and environmental feedback to build a lightweight LLM tailored for cooperative TSC. Extensive experiments on both synthetic and real‑world datasets demonstrate that CoLLMLight outperforms state‑of‑the‑art methods in diverse traffic scenarios, showcasing its effectiveness, scalability, and robustness.

PaperID: 11240, https://arxiv.org/pdf/2503.11733.pdf

Authors: Zhendong Chu, Shen Wang, Jian Xie, Tinghui Zhu, Yibo Yan, Jinheng Ye, Aoxiao Zhong, Xuming Hu, Jing Liang, Philip S. Yu, Qingsong Wen

Title: LLM Agents for Education: Advances and Applications

Abstract:
Large Language Model (LLM) agents are transforming education by automating complex pedagogical tasks and enhancing both teaching and learning processes. In this survey, we present a systematic review of recent advances in applying LLM agents to address key challenges in educational settings, such as feedback comment generation, curriculum design, etc. We analyze the technologies enabling these agents, including representative datasets, benchmarks, and algorithmic frameworks. Additionally, we highlight key challenges in deploying LLM agents in educational settings, including ethical issues, hallucination and overreliance, and integration with existing educational ecosystems. Beyond the core technical focus, we include in Appendix A a comprehensive overview of domain‑specific educational agents, covering areas such as science learning, language learning, and professional development.

PaperID: 11241, https://arxiv.org/pdf/2503.11699.pdf

Authors: Ao Cao, Fuyong Wang, Zhongxin Liu

Title: Propensity Formation-Containment Control of Fully Heterogeneous Multi-Agent Systems via Online Data-Driven Learning

Abstract:
This paper introduces an online data‑driven learning scheme designed to address a novel problem in propensity formation and containment control for fully heterogeneous multi‑agent systems. Unlike traditional approaches that rely on the eigenvalues of the Laplacian matrix, this problem considers the determination of follower positions based on propensity factors released by leaders. To address the challenge of incomplete utilization of leader information in existing multi‑leader control methods, the concept of an influential transit formation leader (ITFL) is introduced. An adaptive observer is developed for the agents, including the ITFL, to estimate the state of the tracking leader or the leader's formation. Building on these observations, a model‑based control protocol is proposed, elucidating the relationship between the regulation equations and control gains, ensuring the asymptotic convergence of the agent's state. To eliminate the necessity for model information throughout the control process, a new online data‑driven learning algorithm is devised for the control protocol. Finally, numerical simulation results are given to verify the effectiveness of the proposed method.

PaperID: 11242, https://arxiv.org/pdf/2503.11554.pdf

Authors: Nadia Loy, Andrea Tosin

Title: Essentials of the kinetic theory of multi-agent systems

Abstract:
In this paper, we present a critical collection of essential mathematical tools and techniques for the analysis of Boltzmann‑type kinetic equations, which in recent years have established themselves as a flexible and powerful paradigm to model interacting multi‑agent systems. We consider, in particular, scalar equations implementing linear symmetric interaction rules, for which we develop the theory of well‑posedness, trend to equilibrium, and Fokker‑Planck asymptotics by relying extensively on Fourier methods. We also outline the basics of Monte Carlo algorithms for the numerical solution of such equations. Finally, we elaborate the theory further for Boltzmann‑type equations on graphs, a recent generalisation of the standard setting motivated by the modelling of networked multi‑agent systems.

PaperID: 11243, https://arxiv.org/pdf/2503.11517.pdf

Authors: Diego Gosmar, Deborah A. Dahl, Dario Gosmar

Title: Prompt Injection Detection and Mitigation via AI Multi-Agent NLP Frameworks

Abstract:
Prompt injection constitutes a significant challenge for generative AI systems by inducing unintended outputs. We introduce a multi‑agent NLP framework specifically designed to address prompt injection vulnerabilities through layered detection and enforcement mechanisms. The framework orchestrates specialized agents for generating responses, sanitizing outputs, and enforcing policy compliance. Evaluation on 500 engineered injection prompts demonstrates a marked reduction in injection success and policy breaches. Novel metrics, including Injection Success Rate (ISR), Policy Override Frequency (POF), Prompt Sanitization Rate (PSR), and Compliance Consistency Score (CCS), are proposed to derive a composite Total Injection Vulnerability Score (TIVS). The system utilizes the OVON (Open Voice Network) framework for inter‑agent communication via structured JSON messages, extending a previously established multi‑agent architecture from hallucination mitigation to address the unique challenges of prompt injection.

PaperID: 11244, https://arxiv.org/pdf/2503.11504.pdf

Authors: Yaroslav Marchukov, Luis Montano

Title: Multi-agent coordination for on-demand data gathering with periodic information upload

Abstract:
In this paper we develop a method for planning and coordinating a multi‑agent team deployment to periodically gather information on demand. A static operation center (OC) periodically requests information from changing goal locations. The objective is to gather data in the goals and to deliver it to the OC, balancing the refreshing time and the total number of information packages. The system automatically splits the team in two roles: workers to gather data, or collectors to retransmit the data to the OC. The proposed three step method: 1) finds out the best area partition for the workers; 2) obtains the best balance between workers and collectors, and with whom the workers must to communicate, a collector or the OC; 3) computes the best tour for the workers to visit the goals and deliver them to the OC or to a collector in movement. The method is tested in simulations in different scenarios, providing the best area partition algorithm and the best balance between collectors and workers.

PaperID: 11245, https://arxiv.org/pdf/2503.11475.pdf

Authors: William Fishell, Andoni Rodriguez, Mark Santolucito

Title: Research Vision: Multi-Agent Path Planning for Cops And Robbers Via Reactive Synthesis

Abstract:
We propose the problem of multi‑agent path planning for a generalization of the classic Cops and Robbers game via reactive synthesis. Specifically, through the application of LTLt and Coordination Synthesis, we aim to check whether various Cops and Robbers games are realizable (a strategy exists for the cops which guarantees they catch the robbers). Additionally, we construct this strategy as an executable program for the multiple system players in our games. In this paper we formalize the problem space, and propose potential directions for solutions. We also show how our formalization of this generalized cops and robbers game can be mapped to a broad range of other problems in the reactive program synthesis space.

PaperID: 11246, https://arxiv.org/pdf/2503.11290.pdf

Authors: Qi Mao, Haobo Hu, Yujie He, Difei Gao, Haokun Chen, Libiao Jin

Title: EmoAgent: A Multi-Agent Framework for Diverse Affective Image Manipulation

Abstract:
Affective Image Manipulation (AIM) aims to alter visual elements within an image to evoke specific emotional responses from viewers. However, existing AIM approaches rely on rigid \emphone‑to‑one mappings between emotions and visual cues, making them ill‑suited for the inherently subjective and diverse ways in which humans perceive and express emotion.To address this, we introduce a novel task setting termed \emphDiverse AIM (D‑AIM), aiming to generate multiple visually distinct yet emotionally consistent image edits from a single source image and target emotion. We propose \emphEmoAgent, the first multi‑agent framework tailored specifically for D‑AIM. EmoAgent explicitly decomposes the manipulation process into three specialized phases executed by collaborative agents: a Planning Agent that generates diverse emotional editing strategies, an Editing Agent that precisely executes these strategies, and a Critic Agent that iteratively refines the results to ensure emotional accuracy. This collaborative design empowers EmoAgent to model \emphone‑to‑many emotion‑to‑visual mappings, enabling semantically diverse and emotionally faithful edits.Extensive quantitative and qualitative evaluations demonstrate that EmoAgent substantially outperforms state‑of‑the‑art approaches in both emotional fidelity and semantic diversity, effectively generating multiple distinct visual edits that convey the same target emotion.

PaperID: 11247, https://arxiv.org/pdf/2503.11237.pdf

Authors: Rabimba Karanjai, Sam Blackshear, Lei Xu, Weidong Shi

Title: Collaboration is all you need: LLM Assisted Safe Code Translation

Abstract:
This paper introduces UniTranslator, a visionary framework that re‑imagines code translation as a collaborative endeavor among multiple, compact LLMs. By orchestrating the interaction of specialized agents, each focused on different aspects of the translation process and grounded in a deep understanding of programming concepts, UniTranslator achieves a level of accuracy and efficiency that rivals larger, monolithic models. Our preliminary evaluation demonstrates the potential of UniTranslator to overcome the limitations of existing approaches and unlock the power of smaller LLMs for complex code translation tasks. We explore the effectiveness of this dynamic multi‑agent paradigm in handling diverse language pairs, including low‑resource languages, and in mitigating common issues such as code artifacts and hallucinations through the use of Natural Language Inference (NLI) grounding and iterative feedback mechanisms

PaperID: 11248, https://arxiv.org/pdf/2503.11235.pdf

Authors: Luka Lanča, Karlo Jakac, Sylvain Calinon, Stefan Ivić

Title: Ergodic exploration of dynamic distribution

Abstract:
This research addresses the challenge of performing search missions in dynamic environments, particularly for drifting targets whose movement is dictated by a flow field. This is accomplished through a dynamical system that integrates two partial differential equations: one governing the dynamics and uncertainty of the probability distribution, and the other regulating the potential field for ergodic multi‑agent search. The target probability field evolves in response to the target dynamics imposed by the environment and accomplished sensing efforts, while being explored by multiple robot agents guided by the potential field gradient. The proposed methodology was tested on two simulated search scenarios, one of which features a synthetically generated domain and showcases better performance when compared to the baseline method with static target probability over a range of agent to flow field velocity ratios. The second search scenario represents a realistic sea search and rescue mission where the search start is delayed, the search is performed in multiple robot flight missions, and the procedure for target drift uncertainty compensation is demonstrated. Furthermore, the proposed method provides an accurate survey completion metric, based on the known detection/sensing parameters, that correlates with the actual number of targets found independently.

PaperID: 11249, https://arxiv.org/pdf/2503.11117.pdf

Authors: Kaixuan Jiang, Yang Liu, Weixing Chen, Jingzhou Luo, Ziliang Chen, Ling Pan, Guanbin Li, Liang Lin

Title: Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering

Abstract:
Embodied Question Answering (EQA) is a challenging task in embodied intelligence that requires agents to dynamically explore 3D environments, actively gather visual information, and perform multi‑step reasoning to answer questions. However, current EQA approaches suffer from critical limitations in exploration efficiency, dataset design, and evaluation metrics. Moreover, existing datasets often introduce biases or prior knowledge, leading to disembodied reasoning, while frontier‑based exploration strategies struggle in cluttered environments and fail to ensure fine‑grained exploration of task‑relevant areas. To address these challenges, we construct the EXPloration‑awaRe Embodied queStion anSwering Benchmark (EXPRESS‑Bench), the largest dataset designed specifically to evaluate both exploration and reasoning capabilities. EXPRESS‑Bench consists of 777 exploration trajectories and 2,044 question‑trajectory pairs. To improve exploration efficiency, we propose Fine‑EQA, a hybrid exploration model that integrates frontier‑based and goal‑oriented navigation to guide agents toward task‑relevant regions more effectively. Additionally, we introduce a novel evaluation metric, Exploration‑Answer Consistency (EAC), which ensures faithful assessment by measuring the alignment between answer grounding and exploration reliability. Extensive experimental comparisons with state‑of‑the‑art EQA models demonstrate the effectiveness of our EXPRESS‑Bench in advancing embodied exploration and question reasoning.

PaperID: 11250, https://arxiv.org/pdf/2503.11085.pdf

Authors: Sixiang Ye, Zeyu Sun, Guoqing Wang, Liwei Guo, Qingyuan Liang, Zheng Li, Yong Liu

Title: Prompt Alchemy: Automatic Prompt Refinement for Enhancing Code Generation

Abstract:
Code generation has emerged as a key task to automate software development by converting high‑level descriptions into executable code. Large language models (LLMs) excel at this but depend heavily on input prompt quality.Manual prompt engineering can be time‑consuming and inconsistent, limiting LLM effectiveness. This paper introduces Prochemy, an innovative method for automatically refining prompts to boost code generation. Prochemy overcomes manual prompt limitations by automating optimization, ensuring consistency during inference, and supporting multi‑agent systems.It iteratively refines prompts based on model performance, using an optimized final prompt for improved consistency across tasks. We tested Prochemy on natural language‑based code generation and translation tasks using three LLM series. Results indicate Prochemy enhances existing methods, improving performance by 5.0% for GPT‑3.5‑Turbo and 1.9% for GPT‑4o over zero‑shot baselines on HumanEval. In state‑of‑the‑art LDB, Prochemy + LDB surpasses standalone methods by 1.2‑1.8%. For code translation, Prochemy boosts GPT‑4o's Java‑to‑Python (AVATAR) performance from 74.5 to 84.1 (+12.9%) and Python‑to‑Java from 66.8 to 78.2 (+17.1%). Moreover, Prochemy maintains strong performance when integrated with the o1‑mini model, validating its efficacy in code tasks. Designed as plug‑and‑play, Prochemy optimizes prompts with minimal human input, bridging the gap between simple prompts and complex frameworks.

PaperID: 11251, https://arxiv.org/pdf/2503.11069.pdf

Authors: Chaoyun Zhang, Shilin He, Liqun Li, Si Qin, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

Title: API Agents vs. GUI Agents: Divergence and Convergence

Abstract:
Large language models (LLMs) have evolved beyond simple text generation to power software agents that directly translate natural language commands into tangible actions. While API‑based LLM agents initially rose to prominence for their robust automation capabilities and seamless integration with programmatic endpoints, recent progress in multimodal LLM research has enabled GUI‑based LLM agents that interact with graphical user interfaces in a human‑like manner. Although these two paradigms share the goal of enabling LLM‑driven task automation, they diverge significantly in architectural complexity, development workflows, and user interaction models. This paper presents the first comprehensive comparative study of API‑based and GUI‑based LLM agents, systematically analyzing their divergence and potential convergence. We examine key dimensions and highlight scenarios in which hybrid approaches can harness their complementary strengths. By proposing clear decision criteria and illustrating practical use cases, we aim to guide practitioners and researchers in selecting, combining, or transitioning between these paradigms. Ultimately, we indicate that continuing innovations in LLM‑based automation are poised to blur the lines between API‑ and GUI‑driven agents, paving the way for more flexible, adaptive solutions in a wide range of real‑world applications.

PaperID: 11252, https://arxiv.org/pdf/2503.11060.pdf

Authors: Heng Wang, Yotaro Shimose, Shingo Takamatsu

Title: BannerAgency: Advertising Banner Design with Multimodal LLM Agents

Abstract:
Advertising banners are critical for capturing user attention and enhancing advertising campaign effectiveness. Creating aesthetically pleasing banner designs while conveying the campaign messages is challenging due to the large search space involving multiple design elements. Additionally, advertisers need multiple sizes for different displays and various versions to target different sectors of audiences. Since design is intrinsically an iterative and subjective process, flexible editability is also in high demand for practical usage. While current models have served as assistants to human designers in various design tasks, they typically handle only segments of the creative design process or produce pixel‑based outputs that limit editability. This paper introduces a training‑free framework for fully automated banner ad design creation, enabling frontier multimodal large language models (MLLMs) to streamline the production of effective banners with minimal manual effort across diverse marketing contexts. We present BannerAgency, an MLLM agent system that collaborates with advertisers to understand their brand identity and banner objectives, generates matching background images, creates blueprints for foreground design elements, and renders the final creatives as editable components in Figma or SVG formats rather than static pixels. To facilitate evaluation and future research, we introduce BannerRequest400, a benchmark featuring 100 unique logos paired with 400 diverse banner requests. Through quantitative and qualitative evaluations, we demonstrate the framework's effectiveness, emphasizing the quality of the generated banner designs, their adaptability to various banner requests, and their strong editability enabled by this component‑based approach.

PaperID: 11253, https://arxiv.org/pdf/2503.11006.pdf

Authors: Yaohua Liu, Xinyuan Song, Yunfu Deng, Yifan Xie, Binkai Ou, Yan Zhong

Title: Fine-Grained Instruction-Guided Graph Reasoning for Vision-and-Language Navigation

Abstract:
Vision‑and‑Language Navigation (VLN) requires an embodied agent to traverse complex environments by following natural language instructions, demanding accurate alignment between visual observations and linguistic guidance. Despite recent progress, existing methods typically encode visual and directional cues in a coupled manner, and process instructions without explicitly extracting navigation‑critical semantics, which often leads to imprecise spatial reasoning and suboptimal cross‑modal alignment. To address these challenges, we propose a fine‑grained instruction‑guided graph reasoning framework (OIKG) that enhances both spatial representation and instruction understanding during navigation. Specifically, an observation‑graph interaction mechanism is introduced to disentangle angular and visual cues while strengthening directed edge representations through geometric embedding, enabling more reliable spatial reasoning within the navigation graph. In addition, a fine‑grained instruction guidance module is designed to explicitly extract and leverage location‑specific and object‑centric information from language instructions, facilitating more precise cross‑modal alignment between linguistic semantics and navigable trajectories. By jointly integrating structured graph reasoning with instruction‑critical semantic cues, the proposed approach significantly improves the agent's ability to follow complex navigation instructions. Extensive experiments on the R2R and RxR benchmarks demonstrate that our method consistently achieves state‑of‑the‑art performance across multiple evaluation metrics, validating the effectiveness of fine‑grained instruction‑guided graph reasoning for vision‑and‑language navigation.

PaperID: 11254, https://arxiv.org/pdf/2503.10907.pdf

Authors: Xueting Luo, Hao Deng, Jihong Yang, Yao Shen, Huanhuan Guo, Zhiyuan Sun, Mingqing Liu, Jiming Wei, Shengjie Zhao

Title: H2-MARL: Multi-Agent Reinforcement Learning for Pareto Optimality in Hospital Capacity Strain and Human Mobility during Epidemic

Abstract:
The necessity of achieving an effective balance between minimizing the losses associated with restricting human mobility and ensuring hospital capacity has gained significant attention in the aftermath of COVID‑19. Reinforcement learning (RL)‑based strategies for human mobility management have recently advanced in addressing the dynamic evolution of cities and epidemics; however, they still face challenges in achieving coordinated control at the township level and adapting to cities of varying scales. To address the above issues, we propose a multi‑agent RL approach that achieves Pareto optimality in managing hospital capacity and human mobility (H2‑MARL), applicable across cities of different scales. We first develop a township‑level infection model with online‑updatable parameters to simulate disease transmission and construct a city‑wide dynamic spatiotemporal epidemic simulator. On this basis, H2‑MARL is designed to treat each division as an agent, with a trade‑off dual‑objective reward function formulated and an experience replay buffer enriched with expert knowledge built. To evaluate the effectiveness of the model, we construct a township‑level human mobility dataset containing over one billion records from four representative cities of varying scales. Extensive experiments demonstrate that H2‑MARL has the optimal dual‑objective trade‑off capability, which can minimize hospital capacity strain while minimizing human mobility restriction loss. Meanwhile, the applicability of the proposed model to epidemic control in cities of varying scales is verified, which showcases its feasibility and versatility in practical applications.

PaperID: 11255, https://arxiv.org/pdf/2503.10876.pdf

Authors: Duc S. H. Nguyen, Bach G. Truong, Phuong T. Nguyen, Juri Di Rocco, Davide Di Ruscio

Title: Teamwork makes the dream work: LLMs-Based Agents for GitHub README.MD Summarization

Abstract:
The proliferation of Large Language Models (LLMs) in recent years has realized many applications in various domains. Being trained with a huge of amount of data coming from various sources, LLMs can be deployed to solve different tasks, including those in Software Engineering (SE). Though they have been widely adopted, the potential of using LLMs cooperatively has not been thoroughly investigated. In this paper, we proposed Metagente as a novel approach to amplify the synergy of various LLMs. Metagente is a Multi‑Agent framework based on a series of LLMs to self‑optimize the system through evaluation, feedback, and cooperation among specialized agents. Such a framework creates an environment where multiple agents iteratively refine and optimize prompts from various perspectives. The results of these explorations are then reviewed and aggregated by a teacher agent. To study its performance, we evaluated Metagente with an SE task, i.e., summarization of README.MD files, and compared it with three well‑established baselines, i.e., GitSum, LLaMA‑2, and GPT‑4o. The results show that our proposed approach works efficiently and effectively, consuming a small amount of data for fine‑tuning but still getting a high accuracy, thus substantially outperforming the baselines. The performance gain compared to GitSum, the most relevant benchmark, ranges from 27.63% to 60.43%. More importantly, compared to using only one LLM, Metagente boots up the accuracy to multiple folds.

PaperID: 11256, https://arxiv.org/pdf/2503.10719.pdf

Authors: Yehang Zhang, Xinli Xu, Xiaojie Xu, Li Liu, Yingcong Chen

Title: Long-Video Audio Synthesis with Multi-Agent Collaboration

Abstract:
Video‑to‑audio synthesis, which generates synchronized audio for visual content, critically enhances viewer immersion and narrative coherence in film and interactive media. However, video‑to‑audio dubbing for long‑form content remains an unsolved challenge due to dynamic semantic shifts, temporal misalignment, and the absence of dedicated datasets. While existing methods excel in short videos, they falter in long scenarios (e.g., movies) due to fragmented synthesis and inadequate cross‑scene consistency. We propose LVAS‑Agent, a novel multi‑agent framework that emulates professional dubbing workflows through collaborative role specialization. Our approach decomposes long‑video synthesis into four steps including scene segmentation, script generation, sound design and audio synthesis. Central innovations include a discussion‑correction mechanism for scene/script refinement and a generation‑retrieval loop for temporal‑semantic alignment. To enable systematic evaluation, we introduce LVAS‑Bench, the first benchmark with 207 professionally curated long videos spanning diverse scenarios. Experiments demonstrate superior audio‑visual alignment over baseline methods. Project page: https://lvas‑agent.github.io

PaperID: 11257, https://arxiv.org/pdf/2503.10689.pdf

Authors: Dongjun Lee, Juyong Lee, Kyuyoung Kim, Jihoon Tack, Jinwoo Shin, Yee Whye Teh, Kimin Lee

Title: Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents

Abstract:
Recent advances in large language models (LLMs) have led to a growing interest in developing LLM‑based agents for automating web tasks. However, these agents often struggle with even simple tasks on real‑world websites due to their limited capability to understand and process complex web page structures. In this work, we introduce LCoW, a framework for Learning language models to Contextualize complex Web pages into a more comprehensible form, thereby enhancing decision making by LLM agents. LCoW decouples web page understanding from decision making by training a separate contextualization module to transform complex web pages into comprehensible format, which are then utilized by the decision‑making agent. We demonstrate that our contextualization module effectively integrates with LLM agents of various scales to significantly enhance their decision‑making capabilities in web automation tasks. Notably, LCoW improves the success rates of closed‑source LLMs (e.g., Gemini‑1.5‑flash, GPT‑4o, Claude‑3.5‑Sonnet) by an average of 15.6%, and demonstrates a 23.7% average improvement in success rates for open‑source LMs (e.g., Llama‑3.1‑8B, Llama‑3.1‑70B) on the WorkArena benchmark. Moreover, the Gemini‑1.5‑flash agent with LCoW achieves state‑of‑the‑art results on the WebShop benchmark, outperforming human experts. The relevant code materials are available at our project page: https://lcowiclr2025.github.io.

PaperID: 11258, https://arxiv.org/pdf/2503.10509.pdf

Authors: Sahar Admoni, Assaf Hallak, Yftah Ziser, Omer Ben-Porat, Ofra Amir

Title: From Actions to Words: Towards Abstractive-Textual Policy Summarization in RL

Abstract:
Explaining reinforcement learning agents is challenging because policies emerge from complex reward structures and neural representations that are difficult for humans to interpret. Existing approaches often rely on curated demonstrations that expose local behaviors but provide limited insight into an agent's global strategy, leaving users to infer intent from raw observations. We propose SySLLM (Synthesized Summary using Large Language Models), a framework that reframes policy interpretation as a language‑generation problem. Instead of visual demonstrations, SySLLM converts spatiotemporal trajectories into structured text and prompts an LLM to generate coherent summaries describing the agent's goals, exploration style, and decision patterns. SySLLM scales to long‑horizon, semantically rich environments without task‑specific fine‑tuning, leveraging LLM world knowledge and compositional reasoning to capture latent behavioral structure across policies. Expert evaluations show strong alignment with human analyses, and a large‑scale user study found that 75.5% of participants preferred SySLLM summaries over state‑of‑the‑art demonstration‑based explanations. Together, these results position abstractive textual summarization as a paradigm for interpreting complex RL behavior.

PaperID: 11259, https://arxiv.org/pdf/2503.10341.pdf

Authors: Aron Harder, Amar Kulkarni, Madhur Behl

Title: HALO: Fault-Tolerant Safety Architecture For High-Speed Autonomous Racing

Abstract:
The field of high‑speed autonomous racing has seen significant advances in recent years, with the rise of competitions such as RoboRace and the Indy Autonomous Challenge providing a platform for researchers to develop software stacks for autonomous race vehicles capable of reaching speeds in excess of 170 mph. Ensuring the safety of these vehicles requires the software to continuously monitor for different faults and erroneous operating conditions during high‑speed operation, with the goal of mitigating any unreasonable risks posed by malfunctions in sub‑systems and components. This paper presents a comprehensive overview of the HALO safety architecture, which has been implemented on a full‑scale autonomous racing vehicle as part of the Indy Autonomous Challenge. The paper begins with a failure mode and criticality analysis of the perception, planning, control, and communication modules of the software stack. Specifically, we examine three different types of faults ‑ node health, data health, and behavioral‑safety faults. To mitigate these faults, the paper then outlines HALO safety archetypes and runtime monitoring methods. Finally, the paper demonstrates the effectiveness of the HALO safety architecture for each of the faults, through real‑world data gathered from autonomous racing vehicle trials during multi‑agent scenarios.

PaperID: 11260, https://arxiv.org/pdf/2503.10310.pdf

Authors: Shin Yoo, Robert Feldt, Somin Kim, Naryeong Kim

Title: Capturing Semantic Flow of ML-based Systems

Abstract:
ML‑based systems are software systems that incorporates machine learning components such as Deep Neural Networks (DNNs) or Large Language Models (LLMs). While such systems enable advanced features such as high performance computer vision, natural language processing, and code generation, their internal behaviour remain largely opaque to traditional dynamic analysis such as testing: existing analysis typically concern only what is observable from the outside, such as input similarity or class label changes. We propose semantic flow, a concept designed to capture the internal behaviour of ML‑based system and to provide a platform for traditional dynamic analysis techniques to be adapted to. Semantic flow combines the idea of control flow with internal states taken from executions of ML‑based systems, such as activation values of a specific layer in a DNN, or embeddings of LLM responses at a specific inference step of LLM agents. The resulting representation, summarised as semantic flow graphs, can capture internal decisions that are not explicitly represented in the traditional control flow of ML‑based systems. We propose the idea of semantic flow, introduce two examples using a DNN and an LLM agent, and finally sketch its properties and how it can be used to adapt existing dynamic analysis techniques for use in ML‑based software systems.

PaperID: 11261, https://arxiv.org/pdf/2503.10248.pdf

Authors: Idan Horowitz, Ori Plonsky

Title: LLM Agents Display Human Biases but Exhibit Distinct Learning Patterns

Abstract:
We investigate the choice patterns of Large Language Models (LLMs) in the context of Decisions from Experience tasks that involve repeated choice and learning from feedback, and compare their behavior to human participants. We find that on the aggregate, LLMs appear to display behavioral biases similar to humans: both exhibit underweighting rare events and correlation effects. However, more nuanced analyses of the choice patterns reveal that this happens for very different reasons. LLMs exhibit strong recency biases, unlike humans, who appear to respond in more sophisticated ways. While these different processes may lead to similar behavior on average, choice patterns contingent on recent events differ vastly between the two groups. Specifically, phenomena such as ``surprise triggers change" and the ``wavy recency effect of rare events" are robustly observed in humans, but entirely absent in LLMs. Our findings provide insights into the limitations of using LLMs to simulate and predict humans in learning environments and highlight the need for refined analyses of their behavior when investigating whether they replicate human decision making tendencies.

PaperID: 11262, https://arxiv.org/pdf/2503.10245.pdf

Authors: Mohd. Faizuddin Faruqui, Ratnangshu Das, Ravi Kumar L, Pushpak Jagtap

Title: Reach-Avoid-Stay-Collision-Avoidance Negotiation Framework for Multi-Agent Systems via Spatiotemporal Tubes

Abstract:
This study presents a multi‑agent negotiation‑based framework to obtain collision‑free paths while performing prescribed‑time reach‑avoid‑stay (RAS) tasks for agents with unknown dynamics and bounded disturbance. By employing spatiotemporal tubes to generate time‑varying state constraints, we ensure that all agents adhere to RAS specifications using synthesized controllers. To prevent inter‑agent collisions, a negotiation mechanism is proposed where successful negotiations result in spatiotemporal tubes for each agent fulfilling desired tasks. This approach results in a completely distributed, approximation‑free control law for each agent. The effectiveness of this mechanism was validated through simulations of multi‑agent robot navigation and drone navigation tasks involving prescribed‑time RAS specifications and collision avoidance.

PaperID: 11263, https://arxiv.org/pdf/2503.10205.pdf

Authors: Anthony Couthures, Vineeth S. Varma, Samson Lasaulce, Irinel-Constantin Morarescu

Title: Global synchronization of multi-agent systems with nonlinear interactions

Abstract:
The paper addresses the synchronization of multi‑agent systems with continuous‑time dynamics interacting through a very general class of monotonic continuous signal functions that covers estimation biases, approximation of discrete quantization, or state‑dependent estimation. Our analysis reveals that, in the setup under consideration, synchronization equilibria are exactly the fixed points of the signal function. We also derive intuitive stability conditions based on whether the signal underestimates or overestimates the state of the agents around these fixed points. Moreover, we show that network topology plays a crucial role in asymptotic synchronization. These results provide interesting insights into the interplay between communication nonlinearity and network connectivity, paving the way for advanced coordination strategies in complex systems.

PaperID: 11264, https://arxiv.org/pdf/2503.10186.pdf

Authors: Dan Leonte, Aamal Hussain, Raphael Huser, Francesco Belardinelli, Dario Paccagnan

Title: Convergence and Connectivity: Dynamics of Multi-Agent Q-Learning in Random Networks

Abstract:
Beyond specific settings, many multi‑agent learning algorithms fail to converge to an equilibrium solution, instead displaying complex, non‑stationary behaviours such as recurrent or chaotic orbits. In fact, recent literature suggests that such complex behaviours are likely to occur when the number of agents increases. In this paper, we study Q‑learning dynamics in network polymatrix normal‑form games where the network structure is drawn from classical random graph models. In particular, we focus on the Erdős‑Rényi model, which is used to analyze connectivity in distributed systems, and the Stochastic Block model, which generalizes the above by accounting for community structures that naturally arise in multi‑agent systems. In each setting, we establish sufficient conditions under which the agents' joint strategies converge to a unique equilibrium. We investigate how this condition depends on the exploration rates, payoff matrices and, crucially, the probabilities of interaction between network agents. We validate our theoretical findings through numerical simulations and demonstrate that convergence can be reliably achieved in many‑agent systems, provided interactions in the network are controlled.

PaperID: 11265, https://arxiv.org/pdf/2503.10099.pdf

Authors: Lin Ao, Han Liu, Huafeng Zhang

Title: AgentDAO: Synthesis of Proposal Transactions Via Abstract DAO Semantics

Abstract:
While the trend of decentralized governance is obvious (cryptocurrencies and blockchains are widely adopted by multiple sovereign countries), initiating governance proposals within Decentralized Autonomous Organizations (DAOs) is still challenging, i.e., it requires providing a low‑level transaction payload, therefore posing significant barriers to broad community participation. To address these challenges, we propose a multi‑agent system powered by Large Language Models with a novel Label‑Centric Retrieval algorithm to automate the translation from natural language inputs into executable proposal transactions. The system incorporates DAOLang, a Domain‑Specific Language to simplify the specification of various governance proposals. The key optimization achieved by DAOLang is a semantic‑aware abstraction of user input that reliably secures proposal generation with a low level of token demand. A preliminary evaluation on real‑world applications reflects the potential of DAOLang in terms of generating complicated types of proposals with existing foundation models, e.g. GPT‑4o.

PaperID: 11266, https://arxiv.org/pdf/2503.10071.pdf

Authors: Mohd Ariful Haque, Justin Williams, Sunzida Siddique, Md. Hujaifa Islam, Hasmot Ali, Kishor Datta Gupta, Roy George

Title: Advanced Tool Learning and Selection System (ATLASS): A Closed-Loop Framework Using LLM

Abstract:
The combination of LLM agents with external tools enables models to solve complex tasks beyond their knowledge base. Human‑designed tools are inflexible and restricted to solutions within the scope of pre‑existing tools created by experts. To address this problem, we propose ATLASS, an advanced tool learning and selection system designed as a closed‑loop framework. It enables the LLM to solve problems by dynamically generating external tools on demand. In this framework, agents play a crucial role in orchestrating tool selection, execution, and refinement, ensuring adaptive problem‑solving capabilities. The operation of ATLASS follows three phases: The first phase, Understanding Tool Requirements, involves the Agents determining whether tools are required and specifying their functionality; the second phase, Tool Retrieval/Generation, involves the Agents retrieving or generating tools based on their availability; and the third phase, Task Solving, involves combining all the component tools necessary to complete the initial task. The Tool Dataset stores the generated tools, ensuring reusability and minimizing inference cost. Current LLM‑based tool generation systems have difficulty creating complex tools that need APIs or external packages. In ATLASS, we solve the problem by automatically setting up the environment, fetching relevant API documentation online, and using a Python interpreter to create a reliable, versatile tool that works in a wider range of situations. OpenAI GPT‑4.0 is used as the LLM agent, and safety and ethical concerns are handled through human feedback before executing generated code. By addressing the limitations of predefined toolsets and enhancing adaptability, ATLASS serves as a real‑world solution that empowers users with dynamically generated tools for complex problem‑solving.

PaperID: 11267, https://arxiv.org/pdf/2503.10062.pdf

Authors: Ru An, Ying Wang, Yanlong Zhao, Ji-Feng Zhang

Title: One-bit consensus of controllable linear multi-agent systems with communication noises

Abstract:
This paper addresses the one‑bit consensus of controllable linear multi‑agent systems (MASs) with communication noises. A consensus algorithm consisting of a communication protocol and a consensus controller is designed. The communication protocol introduces a linear compression encoding function to achieve a one‑bit data rate, thereby saving communication costs. The consensus controller with a stabilization term and a consensus term is proposed to ensure the consensus of a potentially unstable but controllable MAS. Specifically, in the consensus term, we adopt an estimation method to overcome the information loss caused by one‑bit communications and a decay step to attenuate the effect of communication noise. Two combined Lyapunov functions are constructed to overcome the difficulty arising from the coupling of the control and estimation. By establishing similar iterative structures of these two functions, this paper shows that the MAS can achieve consensus in the mean square sense at the rate of the reciprocal of the iteration number under the case with a connected fixed topology. Moreover, the theoretical results are generalized to the case with jointly connected Markovian switching topologies by establishing a certain equivalence relationship between the Markovian switching topologies and a fixed topology. Two simulation examples are given to validate the algorithm.

PaperID: 11268, https://arxiv.org/pdf/2503.10049.pdf

Authors: Ziqi Jia, Junjie Li, Xiaoyang Qu, Jianzong Wang

Title: Enhancing Multi-Agent Systems via Reinforcement Learning with LLM-based Planner and Graph-based Policy

Abstract:
Multi‑agent systems (MAS) have shown great potential in executing complex tasks, but coordination and safety remain significant challenges. Multi‑Agent Reinforcement Learning (MARL) offers a promising framework for agent collaboration, but it faces difficulties in handling complex tasks and designing reward functions. The introduction of Large Language Models (LLMs) has brought stronger reasoning and cognitive abilities to MAS, but existing LLM‑based systems struggle to respond quickly and accurately in dynamic environments. To address these challenges, we propose LLM‑based Graph Collaboration MARL (LGC‑MARL), a framework that efficiently combines LLMs and MARL. This framework decomposes complex tasks into executable subtasks and achieves efficient collaboration among multiple agents through graph‑based coordination. Specifically, LGC‑MARL consists of two main components: an LLM planner and a graph‑based collaboration meta policy. The LLM planner transforms complex task instructions into a series of executable subtasks, evaluates the rationality of these subtasks using a critic model, and generates an action dependency graph. The graph‑based collaboration meta policy facilitates communication and collaboration among agents based on the action dependency graph, and adapts to new task environments through meta‑learning. Experimental results on the AI2‑THOR simulation platform demonstrate the superior performance and scalability of LGC‑MARL in completing various complex tasks.

PaperID: 11269, https://arxiv.org/pdf/2503.10013.pdf

Authors: Lingchan Bao, Tong Wei, Yuanyu Wan

Title: Revisiting Multi-Agent Asynchronous Online Optimization with Delays: the Strongly Convex Case

Abstract:
We revisit multi‑agent asynchronous online optimization with delays, where only one of the agents becomes active for making the decision at each round, and the corresponding feedback is received by all the agents after unknown delays. Although previous studies have established an O(\sqrtdT) regret bound for this problem, they assume that the maximum delay d is knowable or the arrival order of feedback satisfies a special property, which may not hold in practice. In this paper, we surprisingly find that when the loss functions are strongly convex, these assumptions can be eliminated, and the existing regret bound can be significantly improved to O(d\log T) meanwhile. Specifically, to exploit the strong convexity of functions, we first propose a delayed variant of the classical follow‑the‑leader algorithm, namely FTDL, which is very simple but requires the full information of functions as feedback. Moreover, to handle the more general case with only the gradient feedback, we develop an approximate variant of FTDL by combining it with surrogate loss functions. Experimental results show that the approximate FTDL outperforms the existing algorithm in the strongly convex case.

PaperID: 11270, https://arxiv.org/pdf/2503.10009.pdf

Authors: Bowen Zhang, Pengcheng Luo, Genke Yang, Boon-Hee Soong, Chau Yuen

Title: OR-LLM-Agent: Automating Modeling and Solving of Operations Research Optimization Problems with Reasoning LLM

Abstract:
With the rise of artificial intelligence (AI), applying large language models (LLMs) to mathematical problem‑solving has attracted increasing attention. Most existing approaches attempt to improve Operations Research (OR) optimization problem‑solving through prompt engineering or fine‑tuning strategies for LLMs. However, these methods are fundamentally constrained by the limited capabilities of non‑reasoning LLMs. To overcome these limitations, we propose OR‑LLM‑Agent, an AI agent framework built on reasoning LLMs for automated OR problem solving. The framework decomposes the task into three sequential stages: mathematical modeling, code generation, and debugging. Each task is handled by a dedicated sub‑agent, which enables more targeted reasoning. We also construct BWOR, an OR dataset for evaluating LLM performance on OR tasks. Our analysis shows that in the benchmarks NL4OPT, MAMO, and IndustryOR, reasoning LLMs sometimes underperform their non‑reasoning counterparts within the same model family. In contrast, BWOR provides a more consistent and discriminative assessment of model capabilities. Experimental results demonstrate that OR‑LLM‑Agent utilizing DeepSeek‑R1 in its framework outperforms advanced methods, including GPT‑o3, Gemini 2.5 Pro, DeepSeek‑R1, and ORLM, by at least 7% in accuracy. These results demonstrate the effectiveness of task decomposition for OR problem solving.

PaperID: 11271, https://arxiv.org/pdf/2503.09819.pdf

Authors: Yuwei Zhang, Jayanth Srinivasa, Gaowen Liu, Jingbo Shang

Title: Attention Reveals More Than Tokens: Training-Free Long-Context Reasoning with Attention-guided Retrieval

Abstract:
Large Language Models (LLMs) often exhibit substantially shorter effective context lengths than their claimed capacities, especially when handling complex reasoning tasks that require integrating information from multiple parts of a long context and performing multi‑step reasoning. Although Chain‑of‑Thought (CoT) prompting has shown promise in reducing task complexity, our empirical analysis reveals that it does not fully resolve this limitation. Through controlled experiments, we identify poor recall of implicit facts as the primary cause of failure, which significantly hampers reasoning performance. Interestingly, we observe that the internal attention weights from the generated CoT tokens can effectively ground implicit facts, even when these facts are not explicitly recalled. Building on this insight, we propose a novel training‑free algorithm, Attrieval, which leverages attention weights to retrieve relevant facts from the long context and incorporates them into the reasoning process. Additionally, we find that selecting context tokens from CoT tokens further improves performance. Our results demonstrate that Attrieval enhances long‑context reasoning capability notably on both synthetic and real‑world QA datasets with various models.

PaperID: 11272, https://arxiv.org/pdf/2503.09758.pdf

Authors: Weizheng Wang, Ike Obi, Byung-Cheol Min

Title: Multi-Agent LLM Actor-Critic Framework for Social Robot Navigation

Abstract:
Recent advances in robotics and large language models (LLMs) have sparked growing interest in human‑robot collaboration and embodied intelligence. To enable the broader deployment of robots in human‑populated environments, socially‑aware robot navigation (SAN) has become a key research area. While deep reinforcement learning approaches that integrate human‑robot interaction (HRI) with path planning have demonstrated strong benchmark performance, they often struggle to adapt to new scenarios and environments. LLMs offer a promising avenue for zero‑shot navigation through commonsense inference. However, most existing LLM‑based frameworks rely on centralized decision‑making, lack robust verification mechanisms, and face inconsistencies in translating macro‑actions into precise low‑level control signals. To address these challenges, we propose SAMALM, a decentralized multi‑agent LLM actor‑critic framework for multi‑robot social navigation. In this framework, a set of parallel LLM actors, each reflecting distinct robot personalities or configurations, directly generate control signals. These actions undergo a two‑tier verification process via a global critic that evaluates group‑level behaviors and individual critics that assess each robot's context. An entropy‑based score fusion mechanism further enhances self‑verification and re‑query, improving both robustness and coordination. Experimental results confirm that SAMALM effectively balances local autonomy with global oversight, yielding socially compliant behaviors and strong adaptability across diverse multi‑robot scenarios. More details and videos about this work are available at: https://sites.google.com/view/SAMALM.

PaperID: 11273, https://arxiv.org/pdf/2503.09755.pdf

Authors: Guangyi Liu, Suzan Iloglu, Michael Caldara, Joseph W. Durham, Michael M. Zavlanos

Title: Distributionally Robust Multi-Agent Reinforcement Learning for Dynamic Chute Mapping

Abstract:
In Amazon robotic warehouses, the destination‑to‑chute mapping problem is crucial for efficient package sorting. Often, however, this problem is complicated by uncertain and dynamic package induction rates, which can lead to increased package recirculation. To tackle this challenge, we introduce a Distributionally Robust Multi‑Agent Reinforcement Learning (DRMARL) framework that learns a destination‑to‑chute mapping policy that is resilient to adversarial variations in induction rates. Specifically, DRMARL relies on group distributionally robust optimization (DRO) to learn a policy that performs well not only on average but also on each individual subpopulation of induction rates within the group that capture, for example, different seasonality or operation modes of the system. This approach is then combined with a novel contextual bandit‑based predictor of the worst‑case induction distribution for each state‑action pair, significantly reducing the cost of exploration and thereby increasing the learning efficiency and scalability of our framework. Extensive simulations demonstrate that DRMARL achieves robust chute mapping in the presence of varying induction distributions, reducing package recirculation by an average of 80% in the simulation scenario.

PaperID: 11274, https://arxiv.org/pdf/2503.09648.pdf

Authors: Miao Yu, Fanci Meng, Xinyun Zhou, Shilong Wang, Junyuan Mao, Linsey Pang, Tianlong Chen, Kun Wang, Xinfeng Li, Yongfeng Zhang, Bo An, Qingsong Wen

Title: A Survey on Trustworthy LLM Agents: Threats and Countermeasures

Abstract:
With the rapid evolution of Large Language Models (LLMs), LLM‑based agents and Multi‑agent Systems (MAS) have significantly expanded the capabilities of LLM ecosystems. This evolution stems from empowering LLMs with additional modules such as memory, tools, environment, and even other agents. However, this advancement has also introduced more complex issues of trustworthiness, which previous research focused solely on LLMs could not cover. In this survey, we propose the TrustAgent framework, a comprehensive study on the trustworthiness of agents, characterized by modular taxonomy, multi‑dimensional connotations, and technical implementation. By thoroughly investigating and summarizing newly emerged attacks, defenses, and evaluation methods for agents and MAS, we extend the concept of Trustworthy LLM to the emerging paradigm of Trustworthy Agent. In TrustAgent, we begin by deconstructing and introducing various components of the Agent and MAS. Then, we categorize their trustworthiness into intrinsic (brain, memory, and tool) and extrinsic (user, agent, and environment) aspects. Subsequently, we delineate the multifaceted meanings of trustworthiness and elaborate on the implementation techniques of existing research related to these internal and external modules. Finally, we present our insights and outlook on this domain, aiming to provide guidance for future endeavors.

PaperID: 11275, https://arxiv.org/pdf/2503.09621.pdf

Authors: Yanze Zhang, Yiwei Lyu, Siwon Jo, Yupeng Yang, Wenhao Luo

Title: Adaptive Deadlock Avoidance for Decentralized Multi-agent Systems via CBF-inspired Risk Measurement

Abstract:
Decentralized safe control plays an important role in multi‑agent systems given the scalability and robustness without reliance on a central authority. However, without an explicit global coordinator, the decentralized control methods are often prone to deadlock ‑‑ a state where the system reaches equilibrium, causing the robots to stall. In this paper, we propose a generalized decentralized framework that unifies the Control Lyapunov Function (CLF) and Control Barrier Function (CBF) to facilitate efficient task execution and ensure deadlock‑free trajectories for the multi‑agent systems. As the agents approach the deadlock‑related undesirable equilibrium, the framework can detect the equilibrium and drive agents away before that happens. This is achieved by a secondary deadlock resolution design with an auxiliary CBF to prevent the multi‑agent systems from converging to the undesirable equilibrium. To avoid dominating effects due to the deadlock resolution over the original task‑related controllers, a deadlock indicator function using CBF‑inspired risk measurement is proposed and encoded in the unified framework for the agents to adaptively determine when to activate the deadlock resolution. This allows the agents to follow their original control tasks and seamlessly unlock or deactivate deadlock resolution as necessary, effectively improving task efficiency. We demonstrate the effectiveness of the proposed method through theoretical analysis, numerical simulations, and real‑world experiments.

PaperID: 11276, https://arxiv.org/pdf/2503.09549.pdf

Authors: Fabio Cassini, Chiara Segala

Title: The turnpike control in stochastic multi-agent dynamics: a discrete-time approach with exponential integrators

Abstract:
In this manuscript, we study the turnpike property in stochastic discrete‑time optimal control problems for interacting agents. Extending previous deterministic results, we show that the turnpike effect persists in the presence of noise under suitable dissipativity and controllability conditions. To handle the possible stiffness in the system dynamics, we employ for the time discretization, integrators of exponential type. Numerical experiments validate our findings, demonstrating the advantages of exponential integrators over standard explicit schemes and confirming the effectiveness of the turnpike control even in the stochastic setting.

PaperID: 11277, https://arxiv.org/pdf/2503.09513.pdf

Authors: Md Morshed Alam, Lokesh Chandra Das, Sandip Roy, Sachin Shetty, Weichao Wang

Title: RESTRAIN: Reinforcement Learning-Based Secure Framework for Trigger-Action IoT Environment

Abstract:
Internet of Things (IoT) platforms with trigger‑action capability allow event conditions to trigger actions in IoT devices autonomously by creating a chain of interactions. Adversaries exploit this chain of interactions to maliciously inject fake event conditions into IoT hubs, triggering unauthorized actions on target IoT devices to implement remote injection attacks. Existing defense mechanisms focus mainly on the verification of event transactions using physical event fingerprints to enforce the security policies to block unsafe event transactions. These approaches are designed to provide offline defense against injection attacks. The state‑of‑the‑art online defense mechanisms offer real‑time defense, but extensive reliability on the inference of attack impacts on the IoT network limits the generalization capability of these approaches. In this paper, we propose a platform‑independent multi‑agent online defense system, namely RESTRAIN, to counter remote injection attacks at runtime. RESTRAIN allows the defense agent to profile attack actions at runtime and leverages reinforcement learning to optimize a defense policy that complies with the security requirements of the IoT network. The experimental results show that the defense agent effectively takes real‑time defense actions against complex and dynamic remote injection attacks and maximizes the security gain with minimal computational overhead.

PaperID: 11278, https://arxiv.org/pdf/2503.09474.pdf

Authors: Jiayuan Huang, Runlong He, Danyal Zaman Khan, Evangelos B. Mazomenos, Danail Stoyanov, Hani Marcus, Linzhe Jiang, Matthew J Clarkson, Mobarak I. Hoque

Title: Surgical AI Copilot: Energy-Based Fourier Gradient Low-Rank Adaptation for Surgical LLM Agent Reasoning and Planning

Abstract:
Image‑guided surgery demands adaptive, real‑time decision support, yet static AI models struggle with structured task planning and providing interactive guidance. Large language models (LLMs)‑powered agents offer a promising solution by enabling dynamic task planning and predictive decision support. Despite recent advances, the absence of surgical agent datasets and robust parameter‑efficient fine‑tuning techniques limits the development of LLM agents capable of complex intraoperative reasoning. In this paper, we introduce Surgical AI Copilot, an LLM agent for image‑guided pituitary surgery, capable of conversation, planning, and task execution in response to queries involving tasks such as MRI tumor segmentation, endoscope anatomy segmentation, overlaying preoperative imaging with intraoperative views, instrument tracking, and surgical visual question answering (VQA). To enable structured agent planning, we develop the PitAgent dataset, a surgical context‑aware planning dataset covering surgical tasks like workflow analysis, instrument localization, anatomical segmentation, and query‑based reasoning. Additionally, we propose DEFT‑GaLore, a Deterministic Energy‑based Fourier Transform (DEFT) gradient projection technique for efficient low‑rank adaptation of recent LLMs (e.g., LLaMA 3.2, Qwen 2.5), enabling their use as surgical agent planners. We extensively validate our agent's performance and the proposed adaptation technique against other state‑of‑the‑art low‑rank adaptation methods on agent planning and prompt generation tasks, including a zero‑shot surgical VQA benchmark, demonstrating the significant potential for truly efficient and scalable surgical LLM agents in real‑time operative settings.

PaperID: 11279, https://arxiv.org/pdf/2503.09403.pdf

Authors: Xu Jiang, Gehui Li, Bin Chen, Jian Zhang

Title: Multi-Agent Image Restoration

Abstract:
Image restoration (IR) is challenging due to the complexity of real‑world degradations. While many specialized and all‑in‑one IR models have been developed, they fail to effectively handle complex, mixed degradations. Recent agentic methods RestoreAgent and AgenticIR leverage intelligent, autonomous workflows to alleviate this issue, yet they suffer from suboptimal results and inefficiency due to their resource‑intensive finetunings, and ineffective searches and tool execution trials for satisfactory outputs. In this paper, we propose MAIR, a novel Multi‑Agent approach for complex IR problems. We introduce a real‑world degradation prior, categorizing degradations into three types: (1) scene, (2) imaging, and (3) compression, which are observed to occur sequentially in real world, and reverse them in the opposite order. Built upon this three‑stage restoration framework, MAIR emulates a team of collaborative human specialists, including a "scheduler" for overall planning and multiple "experts" dedicated to specific degradations. This design minimizes search space and trial efforts, improving image quality while reducing inference costs. In addition, a registry mechanism is introduced to enable easy integration of new tools. Experiments on both synthetic and real‑world datasets show that proposed MAIR achieves competitive performance and improved efficiency over the previous agentic IR system. Code and models will be made available.

PaperID: 11280, https://arxiv.org/pdf/2503.09369.pdf

Authors: Anqi Dong, Karl H. Johansson, Johan Karlsson

Title: Task Allocation for Multi-agent Systems via Unequal-dimensional Optimal Transport

Abstract:
We consider a probabilistic model for large‑scale task allocation problems for multi‑agent systems, aiming to determine an optimal deployment strategy that minimizes the overall transport cost. Specifically, we assign transportation agents to delivery tasks with given pick‑up and drop‑off locations, pairing the spatial distribution of transport resources with the joint distribution of task origins and destinations. This aligns with the optimal mass transport framework where the problem and is in the unequal‑dimensional setting. The task allocation problem can be thus seen as a linear programming problem that minimizes a quadratic transport cost functional, optimizing the energy of all transport units. The problem is motivated by time‑sensitive medical deliveries using drones, such as emergency equipment and blood transport. In this paper, we establish the existence, uniqueness, and smoothness of the optimal solution, and illustrate its properties through numerical simulations.

PaperID: 11281, https://arxiv.org/pdf/2503.09282.pdf

Authors: Akira Hasegawa, Ryuta Kambe, Toshiaki Aoki, Yuuki Takano

Title: A Case Study on Model Checking and Runtime Verification for Awkernel

Abstract:
In operating system development, concurrency poses significant challenges. It is difficult for humans to manually review concurrent behaviors or to write test cases covering all possible executions, often resulting in critical bugs. Preemption in schedulers serves as a typical example. This paper proposes a development method for concurrent software, such as schedulers. Our method incorporates model checking as an aid for tracing code, simplifying the analysis of concurrent behavior; we refer to this as model checking‑assisted code review. While this approach aids in tracing behaviors, the accuracy of the results is limited because of the semantics gap between the modeling language and the programming language. Therefore, we also introduce runtime verification to address this limitation in model checking‑assisted code review. We applied our approach to a real‑world operating system, Awkernel, as a case study. This new operating system, currently under development for autonomous driving, is designed for preemptive task execution using asynchronous functions in Rust. After implementing our method, we identified several bugs that are difficult to detect through manual reviews or simple tests.

PaperID: 11282, https://arxiv.org/pdf/2503.09252.pdf

Authors: Qiang Li, Jin Niu, Qin Luo, Lina Yu

Title: Large-scale Regional Traffic Signal Control Based on Single-Agent Reinforcement Learning

Abstract:
In the context of global urbanization and motorization, traffic congestion has become a significant issue, severely affecting the quality of life, environment, and economy. This paper puts forward a single‑agent reinforcement learning (RL)‑based regional traffic signal control (TSC) model. Different from multi ‑ agent systems, this model can coordinate traffic signals across a large area, with the goals of alleviating regional traffic congestion and minimizing the total travel time. The TSC environment is precisely defined through specific state space, action space, and reward functions. The state space consists of the current congestion state, which is represented by the queue lengths of each link, and the current signal phase scheme of intersections. The action space is designed to select an intersection first and then adjust its phase split. Two reward functions are meticulously crafted. One focuses on alleviating congestion and the other aims to minimize the total travel time while considering the congestion level. The experiments are carried out with the SUMO traffic simulation software. The performance of the TSC model is evaluated by comparing it with a base case where no signal‑timing adjustments are made. The results show that the model can effectively control congestion. For example, the queuing length is significantly reduced in the scenarios tested. Moreover, when the reward is set to both alleviate congestion and minimize the total travel time, the average travel time is remarkably decreased, which indicates that the model can effectively improve traffic conditions. This research provides a new approach for large‑scale regional traffic signal control and offers valuable insights for future urban traffic management.

PaperID: 11283, https://arxiv.org/pdf/2503.09186.pdf

Authors: Jian-Jian Jiang, Xiao-Ming Wu, Yi-Xiang He, Ling-An Zeng, Yi-Lin Wei, Dandan Zhang, Wei-Shi Zheng

Title: Rethinking Bimanual Robotic Manipulation: Learning with Decoupled Interaction Framework

Abstract:
Bimanual robotic manipulation is an emerging and critical topic in the robotics community. Previous works primarily rely on integrated control models that take the perceptions and states of both arms as inputs to directly predict their actions. However, we think bimanual manipulation involves not only coordinated tasks but also various uncoordinated tasks that do not require explicit cooperation during execution, such as grasping objects with the closest hand, which integrated control frameworks ignore to consider due to their enforced cooperation in the early inputs. In this paper, we propose a novel decoupled interaction framework that considers the characteristics of different tasks in bimanual manipulation. The key insight of our framework is to assign an independent model to each arm to enhance the learning of uncoordinated tasks, while introducing a selective interaction module that adaptively learns weights from its own arm to improve the learning of coordinated tasks. Extensive experiments on seven tasks in the RoboTwin dataset demonstrate that: (1) Our framework achieves outstanding performance, with a 23.5% boost over the SOTA method. (2) Our framework is flexible and can be seamlessly integrated into existing methods. (3) Our framework can be effectively extended to multi‑agent manipulation tasks, achieving a 28% boost over the integrated control SOTA. (4) The performance boost stems from the decoupled design itself, surpassing the SOTA by 16.5% in success rate with only 1/6 of the model size.

PaperID: 11284, https://arxiv.org/pdf/2503.09102.pdf

Authors: Jiaying Fu, Xiruo Wang, Zhouyi Li, Kate Vi, Chuyan Xu, Yuqian Sun

Title: "I Like Your Story!": A Co-Creative Story-Crafting Game with a Persona-Driven Character Based on Generative AI

Abstract:
While generative AI is advancing writing support tools, creative writing is often seen as the exclusive domain of skilled writers. This paper introduces "1001 Nights", a co‑creative story‑crafting game that transforms writing into a playful and rewarding activity. In this game, the AI agent takes on the role of a "moody" king with distinct storytelling preferences, not merely assisting but actively influencing the narrative. Players engage with the king agent through strategic storytelling, guiding him to mention weapon‑related keywords, which materialize as battle equipment. The king agent provides dynamic feedback, expressing satisfaction or displeasure, prompting players to adjust their approach. By combining storytelling, game mechanics, and AI‑driven responses, our system motivates creativity through playful constraints. Inspired by Oulipo's literary techniques, this approach demonstrates how AI‑powered game experiences can make creative writing more accessible and engaging, encouraging players to explore their creative potential.

PaperID: 11285, https://arxiv.org/pdf/2503.08937.pdf

Authors: Mohammad Farzanullah, Han Zhang, Akram Bin Sediq, Ali Afana, Melike Erol-Kantarci

Title: Beam Selection in ISAC using Contextual Bandit with Multi-modal Transformer and Transfer Learning

Abstract:
Sixth generation (6G) wireless technology is anticipated to introduce Integrated Sensing and Communication (ISAC) as a transformative paradigm. ISAC unifies wireless communication and RADAR or other forms of sensing to optimize spectral and hardware resources. This paper presents a pioneering framework that leverages ISAC sensing data to enhance beam selection processes in complex indoor environments. By integrating multi‑modal transformer models with a multi‑agent contextual bandit algorithm, our approach utilizes ISAC sensing data to improve communication performance and achieves high spectral efficiency (SE). Specifically, the multi‑modal transformer can capture inter‑modal relationships, enhancing model generalization across diverse scenarios. Experimental evaluations on the DeepSense 6G dataset demonstrate that our model outperforms traditional deep reinforcement learning (DRL) methods, achieving superior beam prediction accuracy and adaptability. In the single‑user scenario, we achieve an average SE regret improvement of 49.6% as compared to DRL. Furthermore, we employ transfer reinforcement learning to reduce training time and improve model performance in multi‑user environments. In the multi‑user scenario, this approach enhances the average SE regret, which is a measure to demonstrate how far the learned policy is from the optimal SE policy, by 19.7% compared to training from scratch, even when the latter is trained 100 times longer.

PaperID: 11286, https://arxiv.org/pdf/2503.08883.pdf

Authors: Kuang-Da Wang, Ping-Chun Hsieh, Wen-Chih Peng

Title: Imitation Learning of Correlated Policies in Stackelberg Games

Abstract:
Stackelberg games, widely applied in domains like economics and security, involve asymmetric interactions where a leader's strategy drives follower responses. Accurately modeling these dynamics allows domain experts to optimize strategies in interactive scenarios, such as turn‑based sports like badminton. In multi‑agent systems, agent behaviors are interdependent, and traditional Multi‑Agent Imitation Learning (MAIL) methods often fail to capture these complex interactions. Correlated policies, which account for opponents' strategies, are essential for accurately modeling such dynamics. However, even methods designed for learning correlated policies, like CoDAIL, struggle in Stackelberg games due to their asymmetric decision‑making, where leaders and followers cannot simultaneously account for each other's actions, often leading to non‑correlated policies. Furthermore, existing MAIL methods that match occupancy measures or use adversarial techniques like GAIL or Inverse RL face scalability challenges, particularly in high‑dimensional environments, and suffer from unstable training. To address these challenges, we propose a correlated policy occupancy measure specifically designed for Stackelberg games and introduce the Latent Stackelberg Differential Network (LSDN) to match it. LSDN models two‑agent interactions as shared latent state trajectories and uses multi‑output Geometric Brownian Motion (MO‑GBM) to effectively capture joint policies. By leveraging MO‑GBM, LSDN disentangles environmental influences from agent‑driven transitions in latent space, enabling the simultaneous learning of interdependent policies. This design eliminates the need for adversarial training and simplifies the learning process. Extensive experiments on Iterative Matrix Games and multi‑agent particle environments demonstrate that LSDN can better reproduce complex interaction dynamics than existing MAIL methods.

PaperID: 11287, https://arxiv.org/pdf/2503.08728.pdf

Authors: Yihong Li, Chengwei Zhang, Furui Zhan, Wanting Liu, Kailing Zhou, Longji Zheng

Title: Enhancing Traffic Signal Control through Model-based Reinforcement Learning and Policy Reuse

Abstract:
Multi‑agent reinforcement learning (MARL) has shown significant potential in traffic signal control (TSC). However, current MARL‑based methods often suffer from insufficient generalization due to the fixed traffic patterns and road network conditions used during training. This limitation results in poor adaptability to new traffic scenarios, leading to high retraining costs and complex deployment. To address this challenge, we propose two algorithms: PLight and PRLight. PLight employs a model‑based reinforcement learning approach, pretraining control policies and environment models using predefined source‑domain traffic scenarios. The environment model predicts the state transitions, which facilitates the comparison of environmental features. PRLight further enhances adaptability by adaptively selecting pre‑trained PLight agents based on the similarity between the source and target domains to accelerate the learning process in the target domain. We evaluated the algorithms through two transfer settings: (1) adaptability to different traffic scenarios within the same road network, and (2) generalization across different road networks. The results show that PRLight significantly reduces the adaptation time compared to learning from scratch in new TSC scenarios, achieving optimal performance using similarities between available and target scenarios.

PaperID: 11288, https://arxiv.org/pdf/2503.08709.pdf

Authors: Mehwish Nasim, Syed Muslim Gilani, Amin Qasmi, Usman Naseem

Title: Simulating Influence Dynamics with LLM Agents

Abstract:
This paper introduces a simulator designed for opinion dynamics researchers to model competing influences within social networks in the presence of LLM‑based agents. By integrating established opinion dynamics principles with state‑of‑the‑art LLMs, this tool enables the study of influence propagation and counter‑misinformation strategies. The simulator is particularly valuable for researchers in social science, psychology, and operations research, allowing them to analyse societal phenomena without requiring extensive coding expertise. Additionally, the simulator will be openly available on GitHub, ensuring accessibility and adaptability for those who wish to extend its capabilities for their own research.

PaperID: 11289, https://arxiv.org/pdf/2503.08545.pdf

Authors: I. Grinberg, A. Levin, E. D. Rimon

Title: Deformable Linear Object Surface Placement Using Elastica Planning and Local Shape Control

Abstract:
Manipulation of deformable linear objects (DLOs) in constrained environments is a challenging task. This paper describes a two‑layered approach for placing DLOs on a flat surface using a single robot hand. The high‑level layer is a novel DLO surface placement method based on Euler's elastica solutions. During this process one DLO endpoint is manipulated by the robot gripper while a variable interior point of the DLO serves as the start point of the portion aligned with the placement surface. The low‑level layer forms a pipeline controller. The controller estimates the DLO current shape using a Residual Neural Network (ResNet) and uses low‑level feedback to ensure task execution in the presence of modeling and placement errors. The resulting DLO placement approach can recover from states where the high‑level manipulation planner has failed as required by practical robot manipulation systems. The DLO placement approach is demonstrated with simulations and experiments that use silicon mock‑up objects prepared for fresh food applications.

PaperID: 11290, https://arxiv.org/pdf/2503.08506.pdf

Authors: Xian Gao, Jiacheng Ruan, Zongyun Zhang, Jingsheng Gao, Ting Liu, Yuzhuo Fu

Title: ReviewAgents: Bridging the Gap Between Human and AI-Generated Paper Reviews

Abstract:
Academic paper review is a critical yet time‑consuming task within the research community. With the increasing volume of academic publications, automating the review process has become a significant challenge. The primary issue lies in generating comprehensive, accurate, and reasoning‑consistent review comments that align with human reviewers' judgments. In this paper, we address this challenge by proposing ReviewAgents, a framework that leverages large language models (LLMs) to generate academic paper reviews. We first introduce a novel dataset, Review‑CoT, consisting of 142k review comments, designed for training LLM agents. This dataset emulates the structured reasoning process of human reviewers‑summarizing the paper, referencing relevant works, identifying strengths and weaknesses, and generating a review conclusion. Building upon this, we train LLM reviewer agents capable of structured reasoning using a relevant‑paper‑aware training method. Furthermore, we construct ReviewAgents, a multi‑role, multi‑LLM agent review framework, to enhance the review comment generation process. Additionally, we propose ReviewBench, a benchmark for evaluating the review comments generated by LLMs. Our experimental results on ReviewBench demonstrate that while existing LLMs exhibit a certain degree of potential for automating the review process, there remains a gap when compared to human‑generated reviews. Moreover, our ReviewAgents framework further narrows this gap, outperforming advanced LLMs in generating review comments.

PaperID: 11291, https://arxiv.org/pdf/2503.08493.pdf

Authors: F. Giarrè, I. A. Meer, M. Masoudi, M. Ozger, C. Cavdar

Title: Hierarchical Multi Agent DRL for Soft Handovers Between Edge Clouds in Open RAN

Abstract:
Multi‑connectivity (MC) for aerial users via a set of ground access points offers the potential for highly reliable communication. Within an open radio access network (O‑RAN) architecture, edge clouds (ECs) enable MC with low latency for users within their coverage area. However, ensuring seamless service continuity for transitional users‑those moving between the coverage areas of neighboring ECs‑poses challenges due to centralized processing demands. To address this, we formulate a problem facilitating soft handovers between ECs, ensuring seamless transitions while maintaining service continuity for all users. We propose a hierarchical multi‑agent reinforcement learning (HMARL) algorithm to dynamically determine the optimal functional split configuration for transitional and non‑transitional users. Simulation results show that the proposed approach outperforms the conventional functional split in terms of the percentage of users maintaining service continuity, with at most 4% optimality gap. Additionally, HMARL achieves better scalability compared to the static baselines.

PaperID: 11292, https://arxiv.org/pdf/2503.08481.pdf

Authors: Weijie Zhou, Manli Tao, Chaoyang Zhao, Haiyun Guo, Honghui Dong, Ming Tang, Jinqiao Wang

Title: PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability

Abstract:
Understanding the environment and a robot's physical reachability is crucial for task execution. While state‑of‑the‑art vision‑language models (VLMs) excel in environmental perception, they often generate inaccurate or impractical responses in embodied visual reasoning tasks due to a lack of understanding of robotic physical reachability. To address this issue, we propose a unified representation of physical reachability across diverse robots, i.e., Space‑Physical Reachability Map (S‑P Map), and PhysVLM, a vision‑language model that integrates this reachability information into visual reasoning. Specifically, the S‑P Map abstracts a robot's physical reachability into a generalized spatial representation, independent of specific robot configurations, allowing the model to focus on reachability features rather than robot‑specific parameters. Subsequently, PhysVLM extends traditional VLM architectures by incorporating an additional feature encoder to process the S‑P Map, enabling the model to reason about physical reachability without compromising its general vision‑language capabilities. To train and evaluate PhysVLM, we constructed a large‑scale multi‑robot dataset, Phys100K, and a challenging benchmark, EQA‑phys, which includes tasks for six different robots in both simulated and real‑world environments. Experimental results demonstrate that PhysVLM outperforms existing models, achieving a 14% improvement over GPT‑4o on EQA‑phys and surpassing advanced embodied VLMs such as RoboMamba and SpatialVLM on the RoboVQA‑val and OpenEQA benchmarks. Additionally, the S‑P Map shows strong compatibility with various VLMs, and its integration into GPT‑4o‑mini yields a 7.1% performance improvement.

PaperID: 11293, https://arxiv.org/pdf/2503.08474.pdf

Authors: Tim Steinke, Martin Büchner, Niclas Vödisch, Abhinav Valada

Title: Collaborative Dynamic 3D Scene Graphs for Open-Vocabulary Urban Scene Understanding

Abstract:
Mapping and scene representation are fundamental to reliable planning and navigation in mobile robots. While purely geometric maps using voxel grids allow for general navigation, obtaining up‑to‑date spatial and semantically rich representations that scale to dynamic large‑scale environments remains challenging. In this work, we present CURB‑OSG, an open‑vocabulary dynamic 3D scene graph engine that generates hierarchical decompositions of urban driving scenes via multi‑agent collaboration. By fusing the camera and LiDAR observations from multiple perceiving agents with unknown initial poses, our approach generates more accurate maps compared to a single agent while constructing a unified open‑vocabulary semantic hierarchy of the scene. Unlike previous methods that rely on ground truth agent poses or are evaluated purely in simulation, CURB‑OSG alleviates these constraints. We evaluate the capabilities of CURB‑OSG on real‑world multi‑agent sensor data obtained from multiple sessions of the Oxford Radar RobotCar dataset. We demonstrate improved mapping and object prediction accuracy through multi‑agent collaboration as well as evaluate the environment partitioning capabilities of the proposed approach. To foster further research, we release our code and supplementary material at https://ov‑curb.cs.uni‑freiburg.de.

PaperID: 11294, https://arxiv.org/pdf/2503.08381.pdf

Authors: Benjamin Kempinski, Tal Kachman

Title: InfluenceNet: AI Models for Banzhaf and Shapley Value Prediction

Abstract:
Power indices are essential in assessing the contribution and influence of individual agents in multi‑agent systems, providing crucial insights into collaborative dynamics and decision‑making processes. While invaluable, traditional computational methods for exact or estimated power indices values require significant time and computational constraints, especially for large (n\ge10) coalitions. These constraints have historically limited researchers' ability to analyse complex multi‑agent interactions comprehensively. To address this limitation, we introduce a novel Neural Networks‑based approach that efficiently estimates power indices for voting games, demonstrating comparable and often superiour performance to existing tools in terms of both speed and accuracy. This method not only addresses existing computational bottlenecks, but also enables rapid analysis of large coalitions, opening new avenues for multi‑agent system research by overcoming previous computational limitations and providing researchers with a more accessible, scalable analytical tool.This increased efficiency will allow for the analysis of more complex and realistic multi‑agent scenarios.

PaperID: 11295, https://arxiv.org/pdf/2503.08330.pdf

Authors: Shibo Huang, Chenfan Shi, Jian Yang, Hanlin Dong, Jinpeng Mi, Ke Li, Jianfeng Zhang, Miao Ding, Peidong Liang, Xiong You, Xian Wei

Title: KiteRunner: Language-Driven Cooperative Local-Global Navigation Policy with UAV Mapping in Outdoor Environments

Abstract:
Autonomous navigation in open‑world outdoor environments faces challenges in integrating dynamic conditions, long‑distance spatial reasoning, and semantic understanding. Traditional methods struggle to balance local planning, global planning, and semantic task execution, while existing large language models (LLMs) enhance semantic comprehension but lack spatial reasoning capabilities. Although diffusion models excel in local optimization, they fall short in large‑scale long‑distance navigation. To address these gaps, this paper proposes KiteRunner, a language‑driven cooperative local‑global navigation strategy that combines UAV orthophoto‑based global planning with diffusion model‑driven local path generation for long‑distance navigation in open‑world scenarios. Our method innovatively leverages real‑time UAV orthophotography to construct a global probability map, providing traversability guidance for the local planner, while integrating large models like CLIP and GPT to interpret natural language instructions. Experiments demonstrate that KiteRunner achieves 5.6% and 12.8% improvements in path efficiency over state‑of‑the‑art methods in structured and unstructured environments, respectively, with significant reductions in human interventions and execution time.

PaperID: 11296, https://arxiv.org/pdf/2503.08302.pdf

Authors: Ji Zhao, Xiao Lin

Title: General-Purpose Aerial Intelligent Agents Empowered by Large Language Models

Abstract:
The emergence of large language models (LLMs) opens new frontiers for unmanned aerial vehicle (UAVs), yet existing systems remain confined to predefined tasks due to hardware‑software co‑design challenges. This paper presents the first aerial intelligent agent capable of open‑world task execution through tight integration of LLM‑based reasoning and robotic autonomy. Our hardware‑software co‑designed system addresses two fundamental limitations: (1) Onboard LLM operation via an edge‑optimized computing platform, achieving 5‑6 tokens/sec inference for 14B‑parameter models at 220W peak power; (2) A bidirectional cognitive architecture that synergizes slow deliberative planning (LLM task planning) with fast reactive control (state estimation, mapping, obstacle avoidance, and motion planning). Validated through preliminary results using our prototype, the system demonstrates reliable task planning and scene understanding in communication‑constrained environments, such as sugarcane monitoring, power grid inspection, mine tunnel exploration, and biological observation applications. This work establishes a novel framework for embodied aerial artificial intelligence, bridging the gap between task planning and robotic autonomy in open environments.

PaperID: 11297, https://arxiv.org/pdf/2503.08199.pdf

Authors: Miao Zhang, Zhenlong Fang, Tianyi Wang, Qian Zhang, Shuai Lu, Junfeng Jiao, Tianyu Shi

Title: A Cascading Cooperative Multi-agent Framework for On-ramp Merging Control Integrating Large Language Models

Abstract:
Traditional Reinforcement Learning (RL) suffers from replicating human‑like behaviors, generalizing effectively in multi‑agent scenarios, and overcoming inherent interpretability issues.These tasks are compounded when deep environment understanding, agent coordination and dynamic optimization are required. While Large Language Model (LLM) enhanced methods have shown promise in generalization and interoperability, they often neglect necessary multi‑agent coordination. Therefore, we introduce the Cascading Cooperative Multi‑agent (CCMA) framework, integrating RL for individual interactions, a fine‑tuned LLM for regional cooperation, a reward function for global optimization, and the Retrieval‑augmented Generation mechanism to dynamically optimize decision‑making across complex driving scenarios. Our experiments demonstrate that the CCMA outperforms existing RL methods, demonstrating significant improvements in both micro and macro‑level performance in complex driving environments.

PaperID: 11298, https://arxiv.org/pdf/2503.07693.pdf

Authors: Anastasiia Grishina, Vadim Liventsev, Aki Härmä, Leon Moonen

Title: Fully Autonomous Programming using Iterative Multi-Agent Debugging with Large Language Models

Abstract:
Program synthesis with Large Language Models (LLMs) suffers from a "near‑miss syndrome": the generated code closely resembles a correct solution but fails unit tests due to minor errors. We address this with a multi‑agent framework called Synthesize, Execute, Instruct, Debug, and Repair (SEIDR). Effectively applying SEIDR to instruction‑tuned LLMs requires determining (a) optimal prompts for LLMs, (b) what ranking algorithm selects the best programs in debugging rounds, and (c) balancing the repair of unsuccessful programs with the generation of new ones. We empirically explore these trade‑offs by comparing replace‑focused, repair‑focused, and hybrid debug strategies. We also evaluate lexicase and tournament selection to rank candidates in each generation. On Program Synthesis Benchmark 2 (PSB2), our framework outperforms both conventional use of OpenAI Codex without a repair phase and traditional genetic programming approaches. SEIDR outperforms the use of an LLM alone, solving 18 problems in C++ and 20 in Python on PSB2 at least once across experiments. To assess generalizability, we employ GPT‑3.5 and Llama 3 on the PSB2 and HumanEval‑X benchmarks. Although SEIDR with these models does not surpass current state‑of‑the‑art methods on the Python benchmarks, the results on HumanEval‑C++ are promising. SEIDR with Llama 3‑8B achieves an average pass@100 of 84.2%. Across all SEIDR runs, 163 of 164 problems are solved at least once with GPT‑3.5 in HumanEval‑C++, and 162 of 164 with the smaller Llama 3‑8B. We conclude that SEIDR effectively overcomes the near‑miss syndrome in program synthesis with LLMs.

PaperID: 11299, https://arxiv.org/pdf/2503.07686.pdf

Authors: Theodor Panayotov, Ivo Emanuilov

Title: Adaptive routing protocols for determining optimal paths in AI multi-agent systems: a priority- and learning-enhanced approach

Abstract:
As distributed artificial intelligence (AI) and multi‑agent architectures grow increasingly complex, the need for adaptive, context‑aware routing becomes paramount. This paper introduces an enhanced, adaptive routing algorithm tailored for AI multi‑agent networks, integrating priority‑based cost functions and dynamic learning mechanisms. Building on an extended Dijkstra‑based framework, we incorporate multi‑faceted parameters such as task complexity, user request priority, agent capabilities, bandwidth, latency, load, model sophistication, and reliability. We further propose dynamically adaptive weighting factors, tuned via reinforcement learning (RL), to continuously evolve routing policies based on observed network performance. Additionally, heuristic filtering and hierarchical routing structures improve scalability and responsiveness. Our approach yields context‑sensitive, load‑aware, and priority‑focused routing decisions that not only reduce latency for critical tasks but also optimize overall resource utilization, ultimately enhancing the robustness, flexibility, and efficiency of multi‑agent systems.

PaperID: 11300, https://arxiv.org/pdf/2503.07678.pdf

Authors: Kailing Zhou, Chengwei Zhang, Furui Zhan, Wanting Liu, Yihong Li

Title: Using a single actor to output personalized policy for different intersections

Abstract:
Recently, with the development of Multi‑agent reinforcement learning (MARL), adaptive traffic signal control (ATSC) has achieved satisfactory results. In traffic scenarios with multiple intersections, MARL treats each intersection as an agent and optimizes traffic signal control strategies through learning and real‑time decision‑making. Considering that observation distributions of intersections might be different in real‑world scenarios, shared parameter methods might lack diversity and thus lead to high generalization requirements in the shared‑policy network. A typical solution is to increase the size of network parameters. However, simply increasing the scale of the network does not necessarily improve policy generalization, which is validated in our experiments. Accordingly, an approach that considers both the personalization of intersections and the efficiency of parameter sharing is required. To this end, we propose Hyper‑Action Multi‑Head Proximal Policy Optimization (HAMH‑PPO), a Centralized Training with Decentralized Execution (CTDE) MARL method that utilizes a shared PPO policy network to deliver personalized policies for intersections with non‑iid observation distributions. The centralized critic in HAMH‑PPO uses graph attention units to calculate the graph representations of all intersections and outputs a set of value estimates with multiple output heads for each intersection. The decentralized execution actor takes the local observation history as input and output distributions of action as well as a so‑called hyper‑action to balance the multiple values estimated from the centralized critic to further guide the updating of TSC policies. The combination of hyper‑action and multi‑head values enables multiple agents to share a single actor‑critic while achieving personalized policies.

PaperID: 11301, https://arxiv.org/pdf/2503.07675.pdf

Authors: Junwei Yu, Yepeng Ding, Hiroyuki Sato

Title: DynTaskMAS: A Dynamic Task Graph-driven Framework for Asynchronous and Parallel LLM-based Multi-Agent Systems

Abstract:
The emergence of Large Language Models (LLMs) in Multi‑Agent Systems (MAS) has opened new possibilities for artificial intelligence, yet current implementations face significant challenges in resource management, task coordination, and system efficiency. While existing frameworks demonstrate the potential of LLM‑based agents in collaborative problem‑solving, they often lack sophisticated mechanisms for parallel execution and dynamic task management. This paper introduces DynTaskMAS, a novel framework that orchestrates asynchronous and parallel operations in LLM‑based MAS through dynamic task graphs. The framework features four key innovations: (1) a Dynamic Task Graph Generator that intelligently decomposes complex tasks while maintaining logical dependencies, (2) an Asynchronous Parallel Execution Engine that optimizes resource utilization through efficient task scheduling, (3) a Semantic‑Aware Context Management System that enables efficient information sharing among agents, and (4) an Adaptive Workflow Manager that dynamically optimizes system performance. Experimental evaluations demonstrate that DynTaskMAS achieves significant improvements over traditional approaches: a 21‑33% reduction in execution time across task complexities (with higher gains for more complex tasks), a 35.4% improvement in resource utilization (from 65% to 88%), and near‑linear throughput scaling up to 16 concurrent agents (3.47X improvement for 4X agents). Our framework establishes a foundation for building scalable, high‑performance LLM‑based multi‑agent systems capable of handling complex, dynamic tasks efficiently.

PaperID: 11302, https://arxiv.org/pdf/2503.07662.pdf

Authors: Lavanya Ratnabala, Robinroy Peter, Aleksey Fedoseev, Dzmitry Tsetserukou

Title: HIPPO-MAT: Decentralized Task Allocation Using GraphSAGE and Multi-Agent Deep Reinforcement Learning

Abstract:
This paper tackles decentralized continuous task allocation in heterogeneous multi‑agent systems. We present a novel framework HIPPO‑MAT that integrates graph neural networks (GNN) employing a GraphSAGE architecture to compute independent embeddings on each agent with an Independent Proximal Policy Optimization (IPPO) approach for multi‑agent deep reinforcement learning. In our system, unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs) share aggregated observation data via communication channels while independently processing these inputs to generate enriched state embeddings. This design enables dynamic, cost‑optimal, conflict‑aware task allocation in a 3D grid environment without the need for centralized coordination. A modified A path planner is incorporated for efficient routing and collision avoidance. Simulation experiments demonstrate scalability with up to 30 agents and preliminary real‑world validation on JetBot ROS AI Robots, each running its model on a Jetson Nano and communicating through an ESP‑NOW protocol using ESP32‑S3, which confirms the practical viability of the approach that incorporates simultaneous localization and mapping (SLAM). Experimental results revealed that our method achieves a high 92.5% conflict‑free success rate, with only a 16.49% performance gap compared to the centralized Hungarian method, while outperforming the heuristic decentralized baseline based on greedy approach. Additionally, the framework exhibits scalability with up to 30 agents with allocation processing of 0.32 simulation step time and robustness in responding to dynamically generated tasks.

PaperID: 11303, https://arxiv.org/pdf/2503.07651.pdf

Authors: Tanvir Rahman

Title: A Case Study of Counting the Number of Unique Users in Linear and Non-Linear Trails -- A Multi-Agent System Approach

Abstract:
Parks play a crucial role in enhancing the quality of life by providing recreational spaces and environmental benefits. Understanding the patterns of park usage, including the number of visitors and their activities, is essential for effective security measures, infrastructure maintenance, and resource allocation. Traditional methods rely on single‑entry sensors that count total visits but fail to distinguish unique users, limiting their effectiveness due to manpower and cost constraints.With advancements in affordable video surveillance and networked processing, more comprehensive park usage analysis is now feasible. This study proposes a multi‑agent system leveraging low‑cost cameras in a distributed network to track and analyze unique users. As a case study, we deployed this system at the Jack A. Markell (JAM) Trail in Wilmington, Delaware, and Hall Trail in Newark, Delaware. The system captures video data, autonomously processes it using existing algorithms, and extracts user attributes such as speed, direction, activity type, clothing color, and gender. These attributes are shared across cameras to construct movement trails and accurately count unique visitors. Our approach was validated through comparison with manual human counts and simulated scenarios under various conditions. The results demonstrate a 72% success rate in identifying unique users, setting a benchmark in automated park activity monitoring. Despite challenges such as camera placement and environmental factors, our findings suggest that this system offers a scalable, cost‑effective solution for real‑time park usage analysis and visitor behavior tracking.

PaperID: 11304, https://arxiv.org/pdf/2503.07604.pdf

Authors: Tianhe Lin, Jian Xie, Siyu Yuan, Deqing Yang

Title: Implicit Reasoning in Transformers is Reasoning through Shortcuts

Abstract:
Test‑time compute is emerging as a new paradigm for enhancing language models' complex multi‑step reasoning capabilities, as demonstrated by the success of OpenAI's o1 and o3, as well as DeepSeek's R1. Compared to explicit reasoning in test‑time compute, implicit reasoning is more inference‑efficient, requiring fewer generated tokens. However, why does the advanced reasoning capability fail to emerge in the implicit reasoning style? In this work, we train GPT‑2 from scratch on a curated multi‑step mathematical reasoning dataset and conduct analytical experiments to investigate how language models perform implicit reasoning in multi‑step tasks. Our findings reveal: 1) Language models can perform step‑by‑step reasoning and achieve high accuracy in both in‑domain and out‑of‑domain tests via implicit reasoning. However, this capability only emerges when trained on fixed‑pattern data. 2) Conversely, implicit reasoning abilities emerging from training on unfixed‑pattern data tend to overfit a specific pattern and fail to generalize further. Notably, this limitation is also observed in state‑of‑the‑art large language models. These findings suggest that language models acquire implicit reasoning through shortcut learning, enabling strong performance on tasks with similar patterns while lacking generalization.

PaperID: 11305, https://arxiv.org/pdf/2503.07547.pdf

Authors: Nina Moorman, Michelle Zhao, Matthew B. Luebbers, Sanne Van Waveren, Reid Simmons, Henny Admoni, Sonia Chernova, Matthew Gombolay

Title: Bi-Directional Mental Model Reconciliation for Human-Robot Interaction with Large Language Models

Abstract:
In human‑robot interactions, human and robot agents maintain internal mental models of their environment, their shared task, and each other. The accuracy of these representations depends on each agent's ability to perform theory of mind, i.e. to understand the knowledge, preferences, and intentions of their teammate. When mental models diverge to the extent that it affects task execution, reconciliation becomes necessary to prevent the degradation of interaction. We propose a framework for bi‑directional mental model reconciliation, leveraging large language models to facilitate alignment through semi‑structured natural language dialogue. Our framework relaxes the assumption of prior model reconciliation work that either the human or robot agent begins with a correct model for the other agent to align to. Through our framework, both humans and robots are able to identify and communicate missing task‑relevant context during interaction, iteratively progressing toward a shared mental model.

PaperID: 11306, https://arxiv.org/pdf/2503.07457.pdf

Authors: Florian Kandra, Vera Demberg, Alexander Koller

Title: LLMs syntactically adapt their language use to their conversational partner

Abstract:
It has been frequently observed that human speakers align their language use with each other during conversations. In this paper, we study empirically whether large language models (LLMs) exhibit the same behavior of conversational adaptation. We construct a corpus of conversations between LLMs and find that two LLM agents end up making more similar syntactic choices as conversations go on, confirming that modern LLMs adapt their language use to their conversational partners in at least a rudimentary way.

PaperID: 11307, https://arxiv.org/pdf/2503.07397.pdf

Authors: Kha Vo, Chin-Teng Lin

Title: Q-MARL: A quantum-inspired algorithm using neural message passing for large-scale multi-agent reinforcement learning

Abstract:
Inspired by a graph‑based technique for predicting molecular properties in quantum chemistry ‑‑ atoms' position within molecules in three‑dimensional space ‑‑ we present Q‑MARL, a completely decentralised learning architecture that supports very large‑scale multi‑agent reinforcement learning scenarios without the need for strong assumptions like common rewards or agent order. The key is to treat each agent as relative to its surrounding agents in an environment that is presumed to change dynamically. Hence, in each time step, an agent is the centre of its own neighbourhood and also a neighbour to many other agents. Each role is formulated as a sub‑graph, and each sub‑graph is used as a training sample. A message‑passing neural network supports full‑scale vertex and edge interaction within a local neighbourhood, while a parameter governing the depth of the sub‑graphs eases the training burden. During testing, an agent's actions are locally ensembled across all the sub‑graphs that contain it, resulting in robust decisions. Where other approaches struggle to manage 50 agents, Q‑MARL can easily marshal thousands. A detailed theoretical analysis proves improvement and convergence, and simulations with the typical collaborative and competitive scenarios show dramatically faster training speeds and reduced training losses.

PaperID: 11308, https://arxiv.org/pdf/2503.07376.pdf

Authors: Grik Tadevosyan, Valerii Serpiva, Aleksey Fedoseev, Roohan Ahmed Khan, Demetros Aschu, Faryal Batool, Nickolay Efanov, Artem Mikhaylov, Dzmitry Tsetserukou

Title: AttentionSwarm: Reinforcement Learning with Attention Control Barier Function for Crazyflie Drones in Dynamic Environments

Abstract:
We introduce AttentionSwarm, a novel benchmark designed to evaluate safe and efficient swarm control in a dynamic drone racing scenario. Central to our approach is the Attention Model‑Based Control Barrier Function (CBF) framework, which integrates attention mechanisms with safety‑critical control theory to enable real‑time collision avoidance and trajectory optimization. This framework dynamically prioritizes critical obstacles and agents in the swarm's vicinity using attention weights, while CBFs formally guarantee safety by enforcing collision‑free constraints. The AttentionSwarm algorithm was developed and evaluated using a swarm of Crazyflie 2.1 micro quadrotors, which were tested indoors with the Vicon motion capture system to ensure precise localization and control. Experimental results show that our system achieves a 95‑100% collision‑free navigation rate in a dynamic multi‑agent drone racing environment, underscoring its effectiveness and robustness in real‑world scenarios. This work offers a promising foundation for safe, high‑speed multi‑robot applications in logistics, inspection, and racing.

PaperID: 11309, https://arxiv.org/pdf/2503.07323.pdf

Authors: Yubo Zhao, Qi Wu, Yifan Wang, Yu-Wing Tai, Chi-Keung Tang

Title: Navigating Motion Agents in Dynamic and Cluttered Environments through LLM Reasoning

Abstract:
This paper advances motion agents empowered by large language models (LLMs) toward autonomous navigation in dynamic and cluttered environments, significantly surpassing first and recent seminal but limited studies on LLM's spatial reasoning, where movements are restricted in four directions in simple, static environments in the presence of only single agents much less multiple agents. Specifically, we investigate LLMs as spatial reasoners to overcome these limitations by uniformly encoding environments (e.g., real indoor floorplans), agents which can be dynamic obstacles and their paths as discrete tokens akin to language tokens. Our training‑free framework supports multi‑agent coordination, closed‑loop replanning, and dynamic obstacle avoidance without retraining or fine‑tuning. We show that LLMs can generalize across agents, tasks, and environments using only text‑based interactions, opening new possibilities for semantically grounded, interactive navigation in both simulation and embodied systems.

PaperID: 11310, https://arxiv.org/pdf/2503.07320.pdf

Authors: Guanxuan Jiang, Shirao Yang, Yuyang Wang, Pan Hui

Title: When Trust Collides: Decoding Human-LLM Cooperation Dynamics through the Prisoner's Dilemma

Abstract:
As large language models (LLMs) become increasingly capable of autonomous decision‑making, they introduce new challenges and opportunities for human‑AI cooperation in mixed‑motive contexts. While prior research has primarily examined AI in assistive or cooperative roles, little is known about how humans interact with AI agents perceived as independent and strategic actors. This study investigates human cooperative attitudes and behaviors toward LLM agents by engaging 30 participants (15 males, 15 females) in repeated Prisoner's Dilemma games with agents differing in declared identity: purported human, rule‑based AI, and LLM agent. Behavioral metrics, including cooperation rate, decision latency, unsolicited cooperative acts and trust restoration tolerance, were analyzed to assess the influence of agent identity and participant gender. Results revealed significant effects of declared agent identity on most cooperation‑related behaviors, along with notable gender differences in decision latency. Furthermore, qualitative responses suggest that these behavioral differences were shaped by participants interpretations and expectations of the agents. These findings contribute to our understanding of human adaptation in competitive cooperation with autonomous agents and underscore the importance of agent framing in shaping effective and ethical human‑AI interaction.

PaperID: 11311, https://arxiv.org/pdf/2503.07306.pdf

Authors: Luyi Jiang, Jiayuan Chen, Lu Lu, Xinwei Peng, Lihao Liu, Junjun He, Jie Xu

Title: Benchmarking Chinese Medical LLMs: A Medbench-based Analysis of Performance Gaps and Hierarchical Optimization Strategies

Abstract:
The evaluation and improvement of medical large language models (LLMs) are critical for their real‑world deployment, particularly in ensuring accuracy, safety, and ethical alignment. Existing frameworks inadequately dissect domain‑specific error patterns or address cross‑modal challenges. This study introduces a granular error taxonomy through systematic analysis of top 10 models on MedBench, categorizing incorrect responses into eight types: Omissions, Hallucination, Format Mismatch, Causal Reasoning Deficiency, Contextual Inconsistency, Unanswered, Output Error, and Deficiency in Medical Language Generation. Evaluation of 10 leading models reveals vulnerabilities: despite achieving 0.86 accuracy in medical knowledge recall, critical reasoning tasks show 96.3% omission, while safety ethics evaluations expose alarming inconsistency (robustness score: 0.79) under option shuffled. Our analysis uncovers systemic weaknesses in knowledge boundary enforcement and multi‑step reasoning. To address these, we propose a tiered optimization strategy spanning four levels, from prompt engineering and knowledge‑augmented retrieval to hybrid neuro‑symbolic architectures and causal reasoning frameworks. This work establishes an actionable roadmap for developing clinically robust LLMs while redefining evaluation paradigms through error‑driven insights, ultimately advancing the safety and trustworthiness of AI in high‑stakes medical environments.

PaperID: 11312, https://arxiv.org/pdf/2503.07279.pdf

Authors: Xin Wang, Stephanie Tulk Jesso, Sadamori Kojaku, David M Neyens, Min Sun Kim

Title: VizTrust: A Visual Analytics Tool for Capturing User Trust Dynamics in Human-AI Communication

Abstract:
Trust plays a fundamental role in shaping the willingness of users to engage and collaborate with artificial intelligence (AI) systems. Yet, measuring user trust remains challenging due to its complex and dynamic nature. While traditional survey methods provide trust levels for long conversations, they fail to capture its dynamic evolution during ongoing interactions. Here, we present VizTrust, which addresses this challenge by introducing a real‑time visual analytics tool that leverages a multi‑agent collaboration system to capture and analyze user trust dynamics in human‑agent communication. Built on established human‑computer trust scales‑competence, integrity, benevolence, and predictability‑, VizTrust enables stakeholders to observe trust formation as it happens, identify patterns in trust development, and pinpoint specific interaction elements that influence trust. Our tool offers actionable insights into human‑agent trust formation and evolution in real time through a dashboard, supporting the design of adaptive conversational agents that responds effectively to user trust signals.

PaperID: 11313, https://arxiv.org/pdf/2503.07246.pdf

Authors: Tommaso Zaccherini, Siyuan Liu, Dimos V. Dimarogonas

Title: Communication-aware Multi-agent Systems Control Based on $k$-hop Distributed Observers

Abstract:
We propose a distributed control strategy to allow the control of a multi‑agent system requiring k‑hop interactions based on the design of distributed state and input observers. In particular, we design for each agent a finite time convergent state and input observer that exploits only the communication with the 1‑hop neighbors to reconstruct the information regarding those agents at a 2‑hop distance or more. We then demonstrate that if the k‑hop based control strategy is set‑Input to State Stable with respect to the set describing the goal, then the observer information can be adopted to achieve the team objective with stability guarantees.

PaperID: 11314, https://arxiv.org/pdf/2503.07170.pdf

Authors: Ming Wang, Fang Wang, Minghao Hu, Li He, Haiyang Wang, Jun Zhang, Tianwei Yan, Li Li, Zhunchen Luo, Wei Luo, Xiaoying Bai, Guotong Geng

Title: DeFine: A Decomposed and Fine-Grained Annotated Dataset for Long-form Article Generation

Abstract:
Long‑form article generation (LFAG) presents challenges such as maintaining logical consistency, comprehensive topic coverage, and narrative coherence across extended articles. Existing datasets often lack both the hierarchical structure and fine‑grained annotation needed to effectively decompose tasks, resulting in shallow, disorganized article generation. To address these limitations, we introduce DeFine, a Decomposed and Fine‑grained annotated dataset for long‑form article generation. DeFine is characterized by its hierarchical decomposition strategy and the integration of domain‑specific knowledge with multi‑level annotations, ensuring granular control and enhanced depth in article generation. To construct the dataset, a multi‑agent collaborative pipeline is proposed, which systematically segments the generation process into four parts: Data Miner, Cite Retreiver, Q&A Annotator and Data Cleaner. To validate the effectiveness of DeFine, we designed and tested three LFAG baselines: the web retrieval, the local retrieval, and the grounded reference. We fine‑tuned the Qwen2‑7b‑Instruct model using the DeFine training dataset. The experimental results showed significant improvements in text quality, specifically in topic coverage, depth of information, and content fidelity. Our dataset publicly available to facilitate future research.

PaperID: 11315, https://arxiv.org/pdf/2503.07140.pdf

Authors: Liwei Yang, Xinying Wang, Xiaotang Zhou, Zhengchao Wu, Ningning Tan

Title: Application of Multiple Chain-of-Thought in Contrastive Reasoning for Implicit Sentiment Analysis

Abstract:
Implicit sentiment analysis aims to uncover emotions that are subtly expressed, often obscured by ambiguity and figurative language. To accomplish this task, large language models and multi‑step reasoning are needed to identify those sentiments that are not explicitly stated. In this study, we propose a novel Dual Reverse Chain Reasoning (DRCR) framework to enhance the performance of implicit sentiment analysis. Inspired by deductive reasoning, the framework consists of three key steps: 1) hypothesize an emotional polarity and derive a reasoning process, 2) negate the initial hypothesis and derive a new reasoning process, and 3) contrast the two reasoning paths to deduce the final sentiment polarity. Building on this, we also introduce a Triple Reverse Chain Reasoning (TRCR) framework to address the limitations of random hypotheses. Both methods combine contrastive mechanisms and multi‑step reasoning, significantly improving the accuracy of implicit sentiment classification. Experimental results demonstrate that both approaches outperform existing methods across various model scales, achieving state‑of‑the‑art performance. This validates the effectiveness of combining contrastive reasoning and multi‑step reasoning for implicit sentiment analysis.

PaperID: 11316, https://arxiv.org/pdf/2503.07044.pdf

Authors: Ziming You, Yumiao Zhang, Dexuan Xu, Yiwei Lou, Yandong Yan, Wei Wang, Huaming Zhang, Yu Huang

Title: DatawiseAgent: A Notebook-Centric LLM Agent Framework for Adaptive and Robust Data Science Automation

Abstract:
Existing large language model (LLM) agents for automating data science show promise, but they remain constrained by narrow task scopes, limited generalization across tasks and models, and over‑reliance on state‑of‑the‑art (SOTA) LLMs. We introduce DatawiseAgent, a notebook‑centric LLM agent framework for adaptive and robust data science automation. Inspired by how human data scientists work in computational notebooks, DatawiseAgent introduces a unified interaction representation and a multi‑stage architecture based on finite‑state transducers (FSTs). This design enables flexible long‑horizon planning, progressive solution development, and robust recovery from execution failures. Extensive experiments across diverse data science scenarios and models show that DatawiseAgent consistently achieves SOTA performance by surpassing strong baselines such as AutoGen and TaskWeaver, demonstrating superior effectiveness and adaptability. Further evaluations reveal graceful performance degradation under weaker or smaller models, underscoring the robustness and scalability.

PaperID: 11317, https://arxiv.org/pdf/2503.07010.pdf

Authors: Kaiyuan Liu, Youcheng Pan, Yang Xiang, Daojing He, Jing Li, Yexing Du, Tianrun Gao

Title: ProjectEval: A Benchmark for Programming Agents Automated Evaluation on Project-Level Code Generation

Abstract:
Recently, LLM agents have made rapid progress in improving their programming capabilities. However, existing benchmarks lack the ability to automatically evaluate from users' perspective, and also lack the explainability of the results of LLM agents' code generation capabilities. Thus, we introduce ProjectEval, a new benchmark for LLM agents project‑level code generation's automated evaluation by simulating user interaction. ProjectEval is constructed by LLM with human reviewing. It has three different level inputs of natural languages or code skeletons. ProjectEval can evaluate the generated projects by user interaction simulation for execution, and by code similarity through existing objective indicators. Through ProjectEval, we find that systematic engineering project code, overall understanding of the project and comprehensive analysis capability are the keys for LLM agents to achieve practical projects. Our findings and benchmark provide valuable insights for developing more effective programming agents that can be deployed in future real‑world production.

PaperID: 11318, https://arxiv.org/pdf/2503.06994.pdf

Authors: Lei Zhang, Mukesh Ghimire, Wenlong Zhang, Zhe Xu, Yi Ren

Title: Parametric Value Approximation for General-sum Differential Games with State Constraints

Abstract:
General‑sum differential games can approximate values solved by Hamilton‑Jacobi‑Isaacs (HJI) equations for efficient inference when information is incomplete. However, solving such games through conventional methods encounters the curse of dimensionality (CoD). Physics‑informed neural networks (PINNs) offer a scalable approach to alleviate the CoD and approximate values, but there exist convergence issues for value approximations through vanilla PINNs when state constraints lead to values with large Lipschitz constants, particularly in safety‑critical applications. In addition to addressing CoD, it is necessary to learn a generalizable value across a parametric space of games, rather than training multiple ones for each specific player‑type configuration. To overcome these challenges, we propose a Hybrid Neural Operator (HNO), which is an operator that can map parameter functions for games to value functions. HNO leverages informative supervised data and samples PDE‑driven data across entire spatial‑temporal space for model refinement. We evaluate HNO on 9D and 13D scenarios with nonlinear dynamics and state constraints, comparing it against a Supervised Neural Operator (a variant of DeepONet). Under the same computational budget and training data, HNO outperforms SNO for safety performance. This work provides a step toward scalable and generalizable value function approximation, enabling real‑time inference for complex human‑robot or multi‑agent interactions.

PaperID: 11319, https://arxiv.org/pdf/2503.06951.pdf

Authors: Xinjie Zhao, Fan Gao, Xingyu Song, Yingjian Chen, Rui Yang, Yanran Fu, Yuyang Wang, Yusuke Iwasawa, Yutaka Matsuo, Irene Li

Title: ReAgent: Reversible Multi-Agent Reasoning for Knowledge-Enhanced Multi-Hop QA

Abstract:
Recent advances in large language models (LLMs) have significantly improved multi‑hop question answering (QA) through direct Chain‑of‑Thought (CoT) reasoning. However, the irreversible nature of CoT leads to error accumulation, making it challenging to correct mistakes in multi‑hop reasoning. This paper introduces ReAgent: a Reversible multi‑Agent collaborative framework augmented with explicit backtracking mechanisms, enabling reversible multi‑hop reasoning. By incorporating text‑based retrieval, information aggregation and validation, our system can detect and correct errors mid‑reasoning, leading to more robust and interpretable QA outcomes. The framework and experiments serve as a foundation for future work on error‑tolerant QA systems. Empirical evaluations across three benchmarks indicate ReAgent's efficacy, yielding average about 6% improvements against baseline models.

PaperID: 11320, https://arxiv.org/pdf/2503.06812.pdf

Authors: Julian Alfredo Mendez, Timotheus Kampik

Title: Can Proof Assistants Verify Multi-Agent Systems?

Abstract:
This paper presents the Soda language for verifying multi‑agent systems. Soda is a high‑level functional and object‑oriented language that supports the compilation of its code not only to Scala, a strongly statically typed high‑level programming language, but also to Lean, a proof assistant and programming language. Given these capabilities, Soda can implement multi‑agent systems, or parts thereof, that can then be integrated into a mainstream software ecosystem on the one hand and formally verified with state‑of‑the‑art tools on the other hand. We provide a brief and informal introduction to Soda and the aforementioned interoperability capabilities, as well as a simple demonstration of how interaction protocols can be designed and verified with Soda. In the course of the demonstration, we highlight challenges with respect to real‑world applicability.

PaperID: 11321, https://arxiv.org/pdf/2503.06747.pdf

Authors: Diego Bolliger, Lorenz Zauter, Robert Ziegler

Title: Fully-Decentralized MADDPG with Networked Agents

Abstract:
In this paper, we devise three actor‑critic algorithms with decentralized training for multi‑agent reinforcement learning in cooperative, adversarial, and mixed settings with continuous action spaces. To this goal, we adapt the MADDPG algorithm by applying a networked communication approach between agents. We introduce surrogate policies in order to decentralize the training while allowing for local communication during training. The decentralized algorithms achieve comparable results to the original MADDPG in empirical tests, while reducing computational cost. This is more pronounced with larger numbers of agents.

PaperID: 11322, https://arxiv.org/pdf/2503.06745.pdf

Authors: Dany Moshkovich, Hadar Mulian, Sergey Zeltyn, Natti Eder, Inna Skarbovsky, Roy Abitbol

Title: Beyond Black-Box Benchmarking: Observability, Analytics, and Optimization of Agentic Systems

Abstract:
The rise of agentic AI systems, where agents collaborate to perform diverse tasks, poses new challenges with observing, analyzing and optimizing their behavior. Traditional evaluation and benchmarking approaches struggle to handle the non‑deterministic, context‑sensitive, and dynamic nature of these systems. This paper explores key challenges and opportunities in analyzing and optimizing agentic systems across development, testing, and maintenance. We explore critical issues such as natural language variability and unpredictable execution flows, which hinder predictability and control, demanding adaptive strategies to manage input variability and evolving behaviors. Through our user study, we supported these hypotheses. In particular, we showed a 79% agreement that non deterministic flow of agentic systems acts as a major challenge. Finally, we validated our statements empirically advocating the need for moving beyond classical benchmarking. To bridge these gaps, we introduce taxonomies to present expected analytics outcomes and the ways to collect them by extending standard observability frameworks. Building on these foundations, we introduce and demonstrate novel approach for benchmarking of agent evaluation systems. Unlike traditional "black box" performance evaluation approaches, our benchmark is built from agent runtime logs as input, and analytics outcome including discovered flows and issues. By addressing key limitations in existing methodologies, we aim to set the stage for more advanced and holistic evaluation strategies, which could foster the development of adaptive, interpretable, and robust agentic AI systems.

PaperID: 11323, https://arxiv.org/pdf/2503.06709.pdf

Authors: Hongshen Xu, Zixv yang, Zichen Zhu, Kunyao Lan, Zihan Wang, Mengyue Wu, Ziwei Ji, Lu Chen, Pascale Fung, Kai Yu

Title: Delusions of Large Language Models

Abstract:
Large Language Models often generate factually incorrect but plausible outputs, known as hallucinations. We identify a more insidious phenomenon, LLM delusion, defined as high belief hallucinations, incorrect outputs with abnormally high confidence, making them harder to detect and mitigate. Unlike ordinary hallucinations, delusions persist with low uncertainty, posing significant challenges to model reliability. Through empirical analysis across different model families and sizes on several Question Answering tasks, we show that delusions are prevalent and distinct from hallucinations. LLMs exhibit lower honesty with delusions, which are harder to override via finetuning or self reflection. We link delusion formation with training dynamics and dataset noise and explore mitigation strategies such as retrieval augmented generation and multi agent debating to mitigate delusions. By systematically investigating the nature, prevalence, and mitigation of LLM delusions, our study provides insights into the underlying causes of this phenomenon and outlines future directions for improving model reliability.

PaperID: 11324, https://arxiv.org/pdf/2503.06664.pdf

Authors: Tommaso Bendinelli, Artur Dox, Christian Holz

Title: Exploring LLM Agents for Cleaning Tabular Machine Learning Datasets

Abstract:
High‑quality, error‑free datasets are a key ingredient in building reliable, accurate, and unbiased machine learning (ML) models. However, real world datasets often suffer from errors due to sensor malfunctions, data entry mistakes, or improper data integration across multiple sources that can severely degrade model performance. Detecting and correcting these issues typically require tailor‑made solutions and demand extensive domain expertise. Consequently, automation is challenging, rendering the process labor‑intensive and tedious. In this study, we investigate whether Large Language Models (LLMs) can help alleviate the burden of manual data cleaning. We set up an experiment in which an LLM, paired with Python, is tasked with cleaning the training dataset to improve the performance of a learning algorithm without having the ability to modify the training pipeline or perform any feature engineering. We run this experiment on multiple Kaggle datasets that have been intentionally corrupted with errors. Our results show that LLMs can identify and correct erroneous entries, such as illogical values or outlier, by leveraging contextual information from other features within the same row, as well as feedback from previous iterations. However, they struggle to detect more complex errors that require understanding data distribution across multiple rows, such as trends and biases.

PaperID: 11325, https://arxiv.org/pdf/2503.06579.pdf

Authors: George Chacko, Minhyuk Park, Vikram Ramavarapu, Ananth Grama, Pablo Robles-Granda, Tandy Warnow

Title: An Agent-based Model of Citation Behavior

Abstract:
Whether citations can be objectively and reliably used to measure productivity and scientific quality of articles and researchers can, and should, be vigorously questioned. However, citations are widely used to estimate the productivity of researchers and institutions, effectively creating a 'grubby' motivation to be well‑cited. We model citation growth, and this grubby interest using an agent‑based model (ABM) of network growth. In this model, each new node (article) in a citation network is an autonomous agent that cites other nodes based on a 'citation personality' consisting of a composite bias for locality, preferential attachment, recency, and fitness. We ask whether strategic citation behavior (reference selection) by the author of a scientific article can boost subsequent citations to it. Our study suggests that fitness and, to a lesser extent, out_degree and locality effects are influential in capturing citations, which raises questions about similar effects in the real world.

PaperID: 11326, https://arxiv.org/pdf/2503.06567.pdf

Authors: Yao Cheng, Yibo Zhao, Jiapeng Zhu, Yao Liu, Xing Sun, Xiang Li

Title: Human Cognition Inspired RAG with Knowledge Graph for Complex Problem Solving

Abstract:
Large Language Models (LLMs) have demonstrated significant potential across various domains. However, they often struggle with integrating external knowledge and performing complex reasoning, leading to hallucinations and unreliable outputs. Retrieval Augmented Generation (RAG) has emerged as a promising paradigm to mitigate these issues by incorporating external knowledge. Yet, conventional RAG approaches, especially those based on vector similarity, fail to effectively capture relational dependencies and support multi‑step reasoning. In this work, we propose CogGRAG, a human cognition‑inspired, graph‑based RAG framework designed for Knowledge Graph Question Answering (KGQA). CogGRAG models the reasoning process as a tree‑structured mind map that decomposes the original problem into interrelated subproblems and explicitly encodes their semantic relationships. This structure not only provides a global view to guide subsequent retrieval and reasoning but also enables self‑consistent verification across reasoning paths. The framework operates in three stages: (1) top‑down problem decomposition via mind map construction, (2) structured retrieval of both local and global knowledge from external Knowledge Graphs (KGs), and (3) bottom‑up reasoning with dual‑process self‑verification. Unlike previous tree‑based decomposition methods such as MindMap or Graph‑CoT, CogGRAG unifies problem decomposition, knowledge retrieval, and reasoning under a single graph‑structured cognitive framework, allowing early integration of relational knowledge and adaptive verification. Extensive experiments demonstrate that CogGRAG achieves superior accuracy and reliability compared to existing methods.

PaperID: 11327, https://arxiv.org/pdf/2503.06532.pdf

Authors: Chuang Li, Lanfang Huang, Dian He, Yanhua Wen, Gang Liu, Lixin Duan

Title: FaaSMT: Lightweight Serverless Framework for Intrusion Detection Using Merkle Tree and Task Inlining

Abstract:
The serverless platform aims to facilitate cloud applications' straightforward deployment, scaling, and management. Unfortunately, the distributed nature of serverless computing makes it difficult to port traditional security tools directly. The existing serverless solutions primarily identify potential threats or performance bottlenecks through post‑analysis of modified operating system audit logs, detection of encrypted traffic offloading, or the collection of runtime metrics. However, these methods often prove inadequate for comprehensively detecting communication violations across functions. This limitation restricts the real‑time log monitoring and validation capabilities in distributed environments while impeding the maintenance of minimal communication overhead. Therefore, this paper presents FaaSMT, which aims to fill this gap by addressing research questions related to security checks and the optimization of performance and costs in serverless applications. This framework employs parallel processing for the collection of distributed data logs, incorporating Merkle Tree algorithms and heuristic optimisation methods to achieve adaptive inline security task execution. The results of experimental trials demonstrate that FaaSMT is capable of effectively identifying major attack types (e.g., Denial of Wallet (DoW) and Business Logic attacks), thereby providing comprehensive monitoring and validation of function executions while significantly reducing performance overhead.

PaperID: 11328, https://arxiv.org/pdf/2503.06514.pdf

Authors: Haoqiang Kang, Enna Sachdeva, Piyush Gupta, Sangjae Bae, Kwonjoon Lee

Title: GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks

Abstract:
Vision‑Language Models (VLMs) have recently shown promising advancements in sequential decision‑making tasks through task‑specific fine‑tuning. However, common fine‑tuning methods, such as Supervised Fine‑Tuning (SFT) and Reinforcement Learning (RL) techniques like Proximal Policy Optimization (PPO), present notable limitations: SFT assumes Independent and Identically Distributed (IID) data, while PPO focuses on maximizing cumulative rewards. These limitations often restrict solution diversity and hinder generalization in multi‑step reasoning tasks. To address these challenges, we introduce a novel framework, GFlowVLM, a framework that fine‑tune VLMs using Generative Flow Networks (GFlowNets) to promote generation of diverse solutions for complex reasoning tasks. GFlowVLM models the environment as a non‑Markovian decision process, allowing it to capture long‑term dependencies essential for real‑world applications. It takes observations and task descriptions as inputs to prompt chain‑of‑thought (CoT) reasoning which subsequently guides action selection. We use task based rewards to fine‑tune VLM with GFlowNets. This approach enables VLMs to outperform prior fine‑tuning methods, including SFT and RL. Empirical results demonstrate the effectiveness of GFlowVLM on complex tasks such as card games (NumberLine, BlackJack) and embodied planning tasks (ALFWorld), showing enhanced training efficiency, solution diversity, and stronger generalization capabilities across both in‑distribution and out‑of‑distribution scenarios.

PaperID: 11329, https://arxiv.org/pdf/2503.06443.pdf

Authors: Dongyu Chen, Tao Deng, Juncheng Jia, Siwei Feng, Di Yuan

Title: Mobility-Aware Decentralized Federated Learning with Joint Optimization of Local Iteration and Leader Selection for Vehicular Networks

Abstract:
Federated learning (FL) emerges as a promising approach to empower vehicular networks, composed by intelligent connected vehicles equipped with advanced sensing, computing, and communication capabilities. While previous studies have explored the application of FL in vehicular networks, they have largely overlooked the intricate challenges arising from the mobility of vehicles and resource constraints. In this paper, we propose a framework of mobility‑aware decentralized federated learning (MDFL) for vehicular networks. In this framework, nearby vehicles train an FL model collaboratively, yet in a decentralized manner. We formulate a local iteration and leader selection joint optimization problem (LSOP) to improve the training efficiency of MDFL. For problem solving, we first reformulate LSOP as a decentralized partially observable Markov decision process (Dec‑POMDP), and then develop an effective optimization algorithm based on multi‑agent proximal policy optimization (MAPPO) to solve Dec‑POMDP. Finally, we verify the performance of the proposed algorithm by comparing it with other algorithms.

PaperID: 11330, https://arxiv.org/pdf/2503.06324.pdf

Authors: Kazuya Izumi, Shuhey Koyama, Yoichi Ochiai

Title: AnimeGaze: Real-Time Mutual Gaze Synthesis for Anime-Style Avatars in Physical Environments via Behind-Display Camera

Abstract:
Avatars on displays lack the ability to engage with the physical environment through gaze. To address this limitation, we propose a gaze synthesis method that enables animated avatars to establish gaze communication with the physical environment using a camera‑behind‑the‑display system. The system uses a display that rapidly alternates between visible and transparent states. During the transparent state, a camera positioned behind the display captures the physical environment. This configuration physically aligns the position of the avatar's eyes with the camera, enabling two‑way gaze communication with people and objects in the physical environment. Building on this system, we developed a framework for mutual gaze communication between avatars and people. The framework detects the user's gaze and dynamically synthesizes the avatar's gaze towards people or objects in the environment. This capability was integrated into an AI agent system to generate real‑time, context‑aware gaze behaviors during conversations, enabling more seamless and natural interactions. To evaluate the system, we conducted a user study to assess its effectiveness in supporting physical gaze awareness and generating human‑like gaze behaviors. The results show that the behind‑display approach significantly enhances the user's perception of being observed and attended to by the avatar. By bridging the gap between virtual avatars and the physical environment through enhanced gaze interactions, our system offers a promising avenue for more immersive and human‑like AI‑mediated communication in everyday environments.

PaperID: 11331, https://arxiv.org/pdf/2503.06323.pdf

Authors: Jack Foxabbott, Rohan Subramani, Francis Rhys Ward

Title: Higher-Order Belief in Incomplete Information MAIDs

Abstract:
Multi‑agent influence diagrams (MAIDs) are probabilistic graphical models which represent strategic interactions between agents. MAIDs are equivalent to extensive form games (EFGs) but have a more compact and informative structure. However, MAIDs cannot, in general, represent settings of incomplete information ‑‑ wherein agents have different beliefs about the game being played, and different beliefs about each‑other's beliefs. In this paper, we introduce incomplete information MAIDs (II‑MAIDs). We define both infinite and finite‑depth II‑MAIDs and prove an equivalence relation to EFGs with incomplete information and no common prior over types. We prove that II‑MAIDs inherit classical equilibria concepts via this equivalence, but note that these solution concepts are often unrealistic in the setting with no common prior because they violate common knowledge of rationality. We define a more realistic solution concept based on recursive best‑response. Throughout, we describe an example with a hypothetical AI agent undergoing evaluation to illustrate the applicability of II‑MAIDs.

PaperID: 11332, https://arxiv.org/pdf/2503.06167.pdf

Authors: Mohammadreza Doostmohammadian, Zulfiya R. Gabidullina, Hamid R. Rabiee

Title: Momentum-based Distributed Resource Scheduling Optimization Subject to Sector-Bound Nonlinearity and Latency

Abstract:
This paper proposes an accelerated consensus‑based distributed iterative algorithm for resource allocation and scheduling. The proposed gradient‑tracking algorithm introduces an auxiliary variable to add momentum towards the optimal state. We prove that this solution is all‑time feasible, implying that the coupling constraint always holds along the algorithm iterative procedure; therefore, the algorithm can be terminated at any time. This is in contrast to the ADMM‑based solutions that meet constraint feasibility asymptotically. Further, we show that the proposed algorithm can handle possible link nonlinearity due to logarithmically‑quantized data transmission (or any sign‑preserving odd sector‑bound nonlinear mapping). We prove convergence over uniformly‑connected dynamic networks (i.e., a hybrid setup) that may occur in mobile and time‑varying multi‑agent networks. Further, the latency issue over the network is addressed by proposing delay‑tolerant solutions. To our best knowledge, accelerated momentum‑based convergence, nonlinear linking, all‑time feasibility, uniform network connectivity, and handling (possible) time delays are not altogether addressed in the literature. These contributions make our solution practical in many real‑world applications.

PaperID: 11333, https://arxiv.org/pdf/2503.06047.pdf

Authors: Wenjie Tang, Yuan Zhou, Erqiang Xu, Keyan Cheng, Minne Li, Liquan Xiao

Title: DSGBench: A Diverse Strategic Game Benchmark for Evaluating LLM-based Agents in Complex Decision-Making Environments

Abstract:
Large language model (LLM)‑based agents are increasingly applied to complex strategic environments that demand long‑horizon reasoning, multi‑agent interaction, and decision‑making under uncertainty. However, common existing benchmarks either assess isolated skills, lack environmental diversity, or rely on broad overall metrics. To address these issues, we introduce DSGBench, a more rigorous evaluation platform for strategic decision‑making tasks. Firstly, it incorporates six complex strategic games which serve as ideal testbeds due to their long‑term and multi‑dimensional decision‑making demands and flexibility in customizing tasks with various difficulty levels and targets. Secondly, DSGBench employs a fine‑grained evaluation scoring system which examines the decision‑making capabilities by looking into the performance in five specific dimensions, offering a comprehensive assessment in a better‑designed fashion. Furthermore, DSGBench also incorporates an automated decision‑tracking mechanism which enables in‑depth analysis of agent behaviour patterns and the turning points in their strategies. We evaluate six popular LLM agents, including open‑source and closed‑source models, and observe distinct strengths and limitations among various tasks. Through decision trajectory analysis, we further identify systemic limitations in different LLMs. These findings offer valuable insights for model selection and future LLM‑based agent development.

PaperID: 11334, https://arxiv.org/pdf/2503.06037.pdf

Authors: Zhiyu Zhao, Haifeng Zhang

Title: Vairiational Stochastic Games

Abstract:
The Control as Inference (CAI) framework has successfully transformed single‑agent reinforcement learning (RL) by reframing control tasks as probabilistic inference problems. However, the extension of CAI to multi‑agent, general‑sum stochastic games (SGs) remains underexplored, particularly in decentralized settings where agents operate independently without centralized coordination. In this paper, we propose a novel variational inference framework tailored to decentralized multi‑agent systems. Our framework addresses the challenges posed by non‑stationarity and unaligned agent objectives, proving that the resulting policies form an ε‑Nash equilibrium. Additionally, we demonstrate theoretical convergence guarantees for the proposed decentralized algorithms. Leveraging this framework, we instantiate multiple algorithms to solve for Nash equilibrium, mean‑field Nash equilibrium, and correlated equilibrium, with rigorous theoretical convergence analysis.

PaperID: 11335, https://arxiv.org/pdf/2503.05970.pdf

Authors: Talha Bozkus, Urbashi Mitra

Title: Partially Decentralized Multi-Agent Q-Learning via Digital Cousins for Wireless Networks

Abstract:
Q‑learning is a widely used reinforcement learning (RL) algorithm for optimizing wireless networks, but faces challenges with large state‑spaces. Recently proposed multi‑environment mixed Q‑learning (MEMQ) algorithm addresses these challenges by employing multiple Q‑learning algorithms across multiple synthetically generated, distinct but structurally related environments, so‑called digital cousins. In this paper, we propose a novel multi‑agent MEMQ (M‑MEMQ) for cooperative decentralized wireless networks with multiple networked transmitters (TXs) and base stations (BSs). TXs do not have access to global information (joint state and actions). The new concept of coordinated and uncoordinated states is introduced. In uncoordinated states, TXs act independently to minimize their individual costs and update local Q‑functions. In coordinated states, TXs use a Bayesian approach to estimate the joint state and update the joint Q‑functions. The cost of information‑sharing scales linearly with the number of TXs and is independent of the joint state‑action space size. Several theoretical guarantees, including deterministic and probabilistic convergence, bounds on estimation error variance, and the probability of misdetecting the joint states, are given. Numerical simulations show that M‑MEMQ outperforms several decentralized and centralized training with decentralized execution (CTDE) multi‑agent RL algorithms by achieving 60% lower average policy error (APE), 40% faster convergence, 45% reduced runtime complexity, and 40% less sample complexity. Furthermore, M‑MEMQ achieves comparable APE with significantly lower complexity than centralized methods. Simulations validate the theoretical analyses.

PaperID: 11336, https://arxiv.org/pdf/2503.05944.pdf

Authors: Julie Michelman, Nasrin Baratalipour, Matthew Abueg

Title: Enhancing Reasoning with Collaboration and Memory

Abstract:
We envision a continuous collaborative learning system where groups of LLM agents work together to solve reasoning problems, drawing on memory they collectively build to improve performance as they gain experience. This work establishes the foundations for such a system by studying the interoperability of chain‑of‑thought reasoning styles, multi‑agent collaboration, and memory banks. Extending beyond the identical agents of self‑consistency, we introduce varied‑context agents with diverse exemplars and a summarizer agent in place of voting. We generate frozen and continuously learned memory banks of exemplars and pair them with fixed, random, and similarity‑based retrieval mechanisms. Our systematic study reveals where various methods contribute to reasoning performance of two LLMs on three grounded reasoning tasks, showing that random exemplar selection can often beat more principled approaches, and in some tasks, inclusion of any exemplars serves only to distract both weak and strong models.

PaperID: 11337, https://arxiv.org/pdf/2503.05930.pdf

Authors: Jianfeng Gu, Hao Wang, Xiaorang Guo, Martin Schulz, Michael Gerndt

Title: VersaSlot: Efficient Fine-grained FPGA Sharing with Big.Little Slots and Live Migration in FPGA Cluster

Abstract:
As FPGAs gain popularity for on‑demand application acceleration in data center computing, dynamic partial reconfiguration (DPR) has become an effective fine‑grained sharing technique for FPGA multiplexing. However, current FPGA sharing encounters partial reconfiguration contention and task execution blocking problems introduced by the DPR, which significantly degrade application performance. In this paper, we propose VersaSlot, an efficient spatio‑temporal FPGA sharing system with novel Big.Little slot architecture that can effectively resolve the contention and task blocking while improving resource utilization. For the heterogeneous Big.Little architecture, we introduce an efficient slot allocation and scheduling algorithm, along with a seamless cross‑board switching and live migration mechanism, to maximize FPGA multiplexing across the cluster. We evaluate the VersaSlot system on an FPGA cluster composed of the latest Xilinx UltraScale+ FPGAs (ZCU216) and compare its performance against four existing scheduling algorithms. The results demonstrate that VersaSlot achieves up to 13.66x lower average response time than the traditional temporal FPGA multiplexing, and up to 2.19x average response time improvement over the state‑of‑the‑art spatio‑temporal sharing systems. Furthermore, VersaSlot enhances the LUT and FF resource utilization by 35% and 29% on average, respectively.

PaperID: 11338, https://arxiv.org/pdf/2503.05919.pdf

Authors: Eric Zhao, Pranjal Awasthi, Nika Haghtalab

Title: From Style to Facts: Mapping the Boundaries of Knowledge Injection with Finetuning

Abstract:
Finetuning provides a scalable and cost‑effective means of customizing language models for specific tasks or response styles, with greater reliability than prompting or in‑context learning. In contrast, the conventional wisdom is that injecting knowledge via finetuning results in brittle performance and poor generalization. We argue that the dichotomy of "task customization" (e.g., instruction tuning) and "knowledge injection" (e.g., teaching new facts) is a distinction without a difference. We instead identify concrete factors that explain the heterogeneous effectiveness observed with finetuning. To this end, we conduct a large‑scale experimental study of finetuning the frontier Gemini v1.5 model family on a spectrum of datasets that are artificially engineered to interpolate between the strengths and failure modes of finetuning. Our findings indicate that question‑answer training data formats provide much stronger knowledge generalization than document/article‑style training data, numerical information can be harder for finetuning to retain than categorical information, and models struggle to apply finetuned knowledge during multi‑step reasoning even when trained on similar examples ‑‑ all factors that render "knowledge injection" to be especially difficult, even after controlling for considerations like data augmentation and information volume. On the other hand, our findings also indicate that it is not fundamentally more difficult to finetune information about a real‑world event than information about what a model's writing style should be.

PaperID: 11339, https://arxiv.org/pdf/2503.05856.pdf

Authors: Lorenz Wolf, Sangwoong Yoon, Ilija Bogunovic

Title: This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs

Abstract:
Mixture of large language model (LLMs) Agents (MoA) architectures achieve state‑of‑the‑art performance on prominent benchmarks like AlpacaEval 2.0 by leveraging the collaboration of multiple LLMs at inference time. Despite these successes, an evaluation of the safety and reliability of MoA is missing. We present the first comprehensive study of MoA's robustness against deceptive LLM agents that deliberately provide misleading responses. We examine factors like the propagation of deceptive information, model size, and information availability, and uncover critical vulnerabilities. On AlpacaEval 2.0, the popular LLaMA 3.1‑70B model achieves a length‑controlled Win Rate (LC WR) of 49.2% when coupled with 3‑layer MoA (6 LLM agents). However, we demonstrate that introducing only a single carefully‑instructed deceptive agent into the MoA can reduce performance to 37.9%, effectively nullifying all MoA gains. On QuALITY, a multiple‑choice comprehension task, the impact is also severe, with accuracy plummeting by a staggering 48.5%. Inspired in part by the historical Doge of Venice voting process, designed to minimize influence and deception, we propose a range of unsupervised defense mechanisms that recover most of the lost performance.

PaperID: 11340, https://arxiv.org/pdf/2503.05854.pdf

Authors: Dmitrii Pantiukhin, Boris Shapkin, Ivan Kuznetsov, Antonia Anna Jost, Nikolay Koldunov

Title: Accelerating Earth Science Discovery via Multi-Agent LLM Systems

Abstract:
This Perspective explores the transformative potential of Multi‑Agent Systems (MAS) powered by Large Language Models (LLMs) in the geosciences. Users of geoscientific data repositories face challenges due to the complexity and diversity of data formats, inconsistent metadata practices, and a considerable number of unprocessed datasets. MAS possesses transformative potential for improving scientists' interaction with geoscientific data by enabling intelligent data processing, natural language interfaces, and collaborative problem‑solving capabilities. We illustrate this approach with "PANGAEA GPT", a specialized MAS pipeline integrated with the diverse PANGAEA database for Earth and Environmental Science, demonstrating how MAS‑driven workflows can effectively manage complex datasets and accelerate scientific discovery. We discuss how MAS can address current data challenges in geosciences, highlight advancements in other scientific fields, and propose future directions for integrating MAS into geoscientific data processing pipelines. In this Perspective, we show how MAS can fundamentally improve data accessibility, promote cross‑disciplinary collaboration, and accelerate geoscientific discoveries.

PaperID: 11341, https://arxiv.org/pdf/2503.05805.pdf

Authors: Dom Huh, Prasant Mohapatra

Title: Multi-agent Auto-Bidding with Latent Graph Diffusion Models

Abstract:
This paper proposes a diffusion‑based auto‑bidding framework that leverages graph representations to model large‑scale auction environments. In such settings, agents must dynamically optimize bidding strategies under constraints defined by key performance indicator (KPI) metrics, all while operating in competitive environments characterized by uncertain, sparse, and stochastic variables. To address these challenges, we introduce a novel approach combining learnable graph‑based embeddings with a planning‑based latent diffusion model (LDM). By capturing patterns and nuances underlying the interdependence of impression opportunities and the multi‑agent dynamics of the auction environment, the graph representation enable expressive computations regarding auto‑bidding outcomes. With reward alignment techniques, the LDM's posterior is fine‑tuned to generate auto‑bidding trajectories that maximize KPI metrics while satisfying constraint thresholds. Empirical evaluations on both real‑world and synthetic auction environments demonstrate significant improvements in auto‑bidding performance across multiple common KPI metrics, as well as accuracy in forecasting auction outcomes.

PaperID: 11342, https://arxiv.org/pdf/2503.05748.pdf

Authors: Krti Tallam

Title: Alignment, Agency and Autonomy in Frontier AI: A Systems Engineering Perspective

Abstract:
As artificial intelligence scales, the concepts of alignment, agency, and autonomy have become central to AI safety, governance, and control. However, even in human contexts, these terms lack universal definitions, varying across disciplines such as philosophy, psychology, law, computer science, mathematics, and political science. This inconsistency complicates their application to AI, where differing interpretations lead to conflicting approaches in system design and regulation. This paper traces the historical, philosophical, and technical evolution of these concepts, emphasizing how their definitions influence AI development, deployment, and oversight. We argue that the urgency surrounding AI alignment and autonomy stems not only from technical advancements but also from the increasing deployment of AI in high‑stakes decision making. Using Agentic AI as a case study, we examine the emergent properties of machine agency and autonomy, highlighting the risks of misalignment in real‑world systems. Through an analysis of automation failures (Tesla Autopilot, Boeing 737 MAX), multi‑agent coordination (Metas CICERO), and evolving AI architectures (DeepMinds AlphaZero, OpenAIs AutoGPT), we assess the governance and safety challenges posed by frontier AI.

PaperID: 11343, https://arxiv.org/pdf/2503.05641.pdf

Authors: Justin Chih-Yao Chen, Sukwon Yun, Elias Stengel-Eskin, Tianlong Chen, Mohit Bansal

Title: Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning

Abstract:
Combining existing pre‑trained expert LLMs is a promising avenue for scalably tackling large‑scale and diverse tasks. However, selecting task‑level experts is often too coarse‑grained, as heterogeneous tasks may require different expertise per instance. To enable adaptive instance‑level mixing of pre‑trained LLM experts, we propose Symbolic‑MoE, a symbolic, text‑based, and gradient‑free Mixture‑of‑Experts framework. Symbolic‑MoE takes a fine‑grained approach to selection by emphasizing skills, e.g., algebra in math or molecular biology in biomedical reasoning. We propose a skill‑based recruiting strategy that dynamically selects the most relevant set of expert LLMs for diverse reasoning tasks based on their strengths. Each selected expert then generates its own reasoning, resulting in k outputs from k experts, which are then synthesized into a final high‑quality response by an aggregator chosen based on its ability to integrate diverse reasoning outputs. We show that Symbolic‑MoE's instance‑level expert selection improves performance by a large margin but ‑‑ when implemented naively ‑‑ can introduce a high computational overhead due to the need for constant model loading and offloading. To address this, we implement a batch strategy that groups instances based on their assigned experts, loading each model only once. This allows us to integrate 16 expert models on 1 GPU with a time cost comparable to or better than prior multi‑agent baselines using 4 GPUs. Through extensive evaluations on diverse benchmarks (MMLU‑Pro, GPQA, AIME, and MedMCQA), we show that Symbolic‑MoE beats strong LLMs like GPT4o‑mini, as well as multi‑agent approaches, with an absolute avg. gain of 8.15% over the best multi‑agent baseline. Moreover, Symbolic‑MoE generalizes well to unseen tasks and removes the need for expensive multi‑round discussions, outperforming discussion baselines with less computation.

PaperID: 11344, https://arxiv.org/pdf/2503.05473.pdf

Authors: Noah Mamie, Susie Xi Rao

Title: The Society of HiveMind: Multi-Agent Optimization of Foundation Model Swarms to Unlock the Potential of Collective Intelligence

Abstract:
Multi‑agent systems address issues of accessibility and scalability of artificial intelligence (AI) foundation models, which are often represented by large language models. We develop a framework ‑ the "Society of HiveMind" (SOHM) ‑ that orchestrates the interaction between multiple AI foundation models, imitating the observed behavior of animal swarms in nature by following modern evolutionary theories. On the one hand, we find that the SOHM provides a negligible benefit on tasks that mainly require real‑world knowledge. On the other hand, we remark a significant improvement on tasks that require intensive logical reasoning, indicating that multi‑agent systems are capable of increasing the reasoning capabilities of the collective compared to the individual agents. Our findings demonstrate the potential of combining a multitude of diverse AI foundation models to form an artificial swarm intelligence capable of self‑improvement through interactions with a given environment.

PaperID: 11345, https://arxiv.org/pdf/2503.05439.pdf

Authors: Navdeep Kaur, Lachlan McPheat, Alessandra Russo, Anthony G Cohn, Pranava Madhyastha

Title: An Empirical Study of Conformal Prediction in LLM with ASP Scaffolds for Robust Reasoning

Abstract:
In this paper, we examine the use of Conformal Language Modelling (CLM) alongside Answer Set Programming (ASP) to enhance the performance of standard open‑weight LLMs on complex multi‑step reasoning tasks. Using the StepGame dataset, which requires spatial reasoning, we apply CLM to generate sets of ASP programs from an LLM, providing statistical guarantees on the correctness of the outputs. Experimental results show that CLM significantly outperforms baseline models that use standard sampling methods, achieving substantial accuracy improvements across different levels of reasoning complexity. Additionally, the LLM‑as‑Judge metric enhances CLM's performance, especially in assessing structurally and logically correct ASP outputs. However, calibrating CLM with diverse calibration sets did not improve generalizability for tasks requiring much longer reasoning steps, indicating limitations in handling more complex tasks.

PaperID: 11346, https://arxiv.org/pdf/2503.05397.pdf

Authors: Sakharam Gawade, Shivam Akhouri, Chinmay Kulkarni, Jagdish Samant, Pragya Sahu, Aastik, Jai Pahal, Saswat Meher

Title: Multi Agent based Medical Assistant for Edge Devices

Abstract:
Large Action Models (LAMs) have revolutionized intelligent automation, but their application in healthcare faces challenges due to privacy concerns, latency, and dependency on internet access. This report introduces an ondevice, multi‑agent healthcare assistant that overcomes these limitations. The system utilizes smaller, task‑specific agents to optimize resources, ensure scalability and high performance. Our proposed system acts as a one‑stop solution for health care needs with features like appointment booking, health monitoring, medication reminders, and daily health reporting. Powered by the Qwen Code Instruct 2.5 7B model, the Planner and Caller Agents achieve an average RougeL score of 85.5 for planning and 96.5 for calling for our tasks while being lightweight for on‑device deployment. This innovative approach combines the benefits of ondevice systems with multi‑agent architectures, paving the way for user‑centric healthcare solutions.

PaperID: 11347, https://arxiv.org/pdf/2503.05242.pdf

Authors: Xuenan Xu, Jiahao Mei, Chenliang Li, Yuning Wu, Ming Yan, Shaopeng Lai, Ji Zhang, Mengyue Wu

Title: MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio

Abstract:
The rapid advancement of large language models (LLMs) and artificial intelligence‑generated content (AIGC) has accelerated AI‑native applications, such as AI‑based storybooks that automate engaging story production for children. However, challenges remain in improving story attractiveness, enriching storytelling expressiveness, and developing open‑source evaluation benchmarks and frameworks. Therefore, we propose and opensource MM‑StoryAgent, which creates immersive narrated video storybooks with refined plots, role‑consistent images, and multi‑channel audio. MM‑StoryAgent designs a multi‑agent framework that employs LLMs and diverse expert tools (generative models and APIs) across several modalities to produce expressive storytelling videos. The framework enhances story attractiveness through a multi‑stage writing pipeline. In addition, it improves the immersive storytelling experience by integrating sound effects with visual, music and narrative assets. MM‑StoryAgent offers a flexible, open‑source platform for further development, where generative modules can be substituted. Both objective and subjective evaluation regarding textual story quality and alignment between modalities validate the effectiveness of our proposed MM‑StoryAgent system. The demo and source code are available.

PaperID: 11348, https://arxiv.org/pdf/2503.05200.pdf

Authors: Pranshav Gajjar, Vijay K. Shah

Title: ORANSight-2.0: Foundational LLMs for O-RAN

Abstract:
Despite the transformative impact of Large Language Models (LLMs) across critical domains such as healthcare, customer service, and business marketing, their integration into Open Radio Access Networks (O‑RAN) remains limited. This gap is primarily due to the absence of domain‑specific foundational models, with existing solutions often relying on general‑purpose LLMs that fail to address the unique challenges and technical intricacies of O‑RAN. To bridge this gap, we introduce ORANSight‑2.0 (O‑RAN Insights), a pioneering initiative to develop specialized foundational LLMs tailored for O‑RAN. Built on 18 models spanning five open‑source LLM frameworks ‑‑ Mistral, Qwen, Llama, Phi, and Gemma ‑‑ ORANSight‑2.0 fine‑tunes models ranging from 1B to 70B parameters, significantly reducing reliance on proprietary, closed‑source models while enhancing performance in O‑RAN‑specific tasks. At the core of ORANSight‑2.0 is RANSTRUCT, a novel Retrieval‑Augmented Generation (RAG)‑based instruction‑tuning framework that employs two LLM agents ‑‑ a Mistral‑based Question Generator and a Qwen‑based Answer Generator ‑‑ to create high‑quality instruction‑tuning datasets. The generated dataset is then used to fine‑tune the 18 pre‑trained open‑source LLMs via QLoRA. To evaluate ORANSight‑2.0, we introduce srsRANBench, a novel benchmark designed for code generation and codebase understanding in the context of srsRAN, a widely used 5G O‑RAN stack.

PaperID: 11349, https://arxiv.org/pdf/2503.05092.pdf

Authors: Adam Labiosa, Josiah P. Hanna

Title: Multi-Robot Collaboration through Reinforcement Learning and Abstract Simulation

Abstract:
Teams of people coordinate to perform complex tasks by forming abstract mental models of world and agent dynamics. The use of abstract models contrasts with much recent work in robot learning that uses a high‑fidelity simulator and reinforcement learning (RL) to obtain policies for physical robots. Motivated by this difference, we investigate the extent to which so‑called abstract simulators can be used for multi‑agent reinforcement learning (MARL) and the resulting policies successfully deployed on teams of physical robots. An abstract simulator models the robot's target task at a high‑level of abstraction and discards many details of the world that could impact optimal decision‑making. Policies are trained in an abstract simulator then transferred to the physical robot by making use of separately‑obtained low‑level perception and motion control modules. We identify three key categories of modifications to the abstract simulator that enable policy transfer to physical robots: simulation fidelity enhancements, training optimizations and simulation stochasticity. We then run an empirical study with extensive ablations to determine the value of each modification category for enabling policy transfer in cooperative robot soccer tasks. We also compare the performance of policies produced by our method with a well‑tuned non‑learning‑based behavior architecture from the annual RoboCup competition and find that our approach leads to a similar level of performance. Broadly we show that MARL can be use to train cooperative physical robot behaviors using highly abstract models of the world.

PaperID: 11350, https://arxiv.org/pdf/2503.05064.pdf

Authors: Qingxuan Jia, Guoqin Tang, Zeyuan Huang, Zixuan Hao, Ning Ji, Shihang, Yin, Gang Chen

Title: Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation

Abstract:
Vision‑Language Models (VLMs) demonstrate remarkable potential in robotic manipulation, yet challenges persist in executing complex fine manipulation tasks with high speed and precision. While excelling at high‑level planning, existing VLM methods struggle to guide robots through precise sequences of fine motor actions. To address this limitation, we introduce a progressive VLM planning algorithm that empowers robots to perform fast, precise, and error‑correctable fine manipulation. Our method decomposes complex tasks into sub‑actions and maintains three key data structures: task memory structure, 2D topology graphs, and 3D spatial networks, achieving high‑precision spatial‑semantic fusion. These three components collectively accumulate and store critical information throughout task execution, providing rich context for our task‑oriented VLM interaction mechanism. This enables VLMs to dynamically adjust guidance based on real‑time feedback, generating precise action plans and facilitating step‑wise error correction. Experimental validation on complex assembly tasks demonstrates that our algorithm effectively guides robots to rapidly and precisely accomplish fine manipulation in challenging scenarios, significantly advancing robot intelligence for precision tasks.

PaperID: 11351, https://arxiv.org/pdf/2503.04998.pdf

Authors: Elena Wittemyer, Ananya Rao, Ian Abraham, Howie Choset

Title: Multi-Agent Ergodic Exploration under Smoke-Based, Time-Varying Sensor Visibility Constraints

Abstract:
In this work, we consider the problem of multi‑agent informative path planning (IPP) for robots whose sensor visibility continuously changes as a consequence of a time‑varying natural phenomenon. We leverage ergodic trajectory optimization (ETO), which generates paths such that the amount of time an agent spends in an area is proportional to the expected information in that area. We focus specifically on the problem of multi‑agent drone search of a wildfire, where we use the time‑varying environmental process of smoke diffusion to construct a sensor visibility model. This sensor visibility model is used to repeatedly calculate an expected information distribution (EID) to be used in the ETO algorithm. Our experiments show that our exploration method achieves improved information gathering over both baseline search methods and naive ergodic search formulations.

PaperID: 11352, https://arxiv.org/pdf/2503.04954.pdf

Authors: R. Spencer Hallyburton, Miroslav Pajic

Title: Security-Aware Sensor Fusion with MATE: the Multi-Agent Trust Estimator

Abstract:
Lacking security awareness, sensor fusion in systems with multi‑agent networks such as smart cities is vulnerable to attacks. To guard against recent threats, we design security‑aware sensor fusion that is based on the estimates of distributions over trust. Trust estimation can be cast as a hidden Markov model, and we solve it by mapping sensor data to trust pseudomeasurements (PSMs) that recursively update trust posteriors in a Bayesian context. Trust then feeds sensor fusion to facilitate trust‑weighted updates to situational awareness. Essential to security‑awareness are a novel field of view estimator, logic to map sensor data into PSMs, and the derivation of efficient Bayesian updates. We evaluate security‑aware fusion under attacks on agents using case studies and Monte Carlo simulation in the physics‑based Unreal Engine simulator, CARLA. A mix of novel and classical security‑relevant metrics show that our security‑aware fusion enables building trustworthy situational awareness even in hostile conditions.

PaperID: 11353, https://arxiv.org/pdf/2503.04830.pdf

Authors: Jingying Zeng, Hui Liu, Zhenwei Dai, Xianfeng Tang, Chen Luo, Samarth Varshney, Zhen Li, Qi He

Title: Cite Before You Speak: Enhancing Context-Response Grounding in E-commerce Conversational LLM-Agents

Abstract:
With the advancement of conversational large language models (LLMs), several LLM‑based Conversational Shopping Agents (CSA) have been developed to help customers smooth their online shopping. The primary objective in building an engaging and trustworthy CSA is to ensure the agent's responses about product factoids are accurate and factually grounded. However, two challenges remain. First, LLMs produce hallucinated or unsupported claims. Such inaccuracies risk spreading misinformation and diminishing customer trust. Second, without providing knowledge source attribution in CSA response, customers struggle to verify LLM‑generated information. To address both challenges, we present an easily productionized solution that enables a ''citation experience'' to our customers. We build auto‑evaluation metrics to holistically evaluate LLM's grounding and attribution capabilities, suggesting that citation generation paradigm substantially improves grounding performance by 13.83%. To deploy this capability at scale, we introduce Multi‑UX‑Inference system, which appends source citations to LLM outputs while preserving existing user experience features and supporting scalable inference. Large‑scale online A/B tests show that grounded CSA responses improves customer engagement by 3% ‑ 10%, depending on UX variations.

PaperID: 11354, https://arxiv.org/pdf/2503.04817.pdf

Authors: Roberto Balestri, Guglielmo Pescatore

Title: Multi-Agent System for AI-Assisted Extraction of Narrative Arcs in TV Series

Abstract:
Serialized TV shows are built on complex storylines that can be hard to track and evolve in ways that defy straightforward analysis. This paper introduces a multi‑agent system designed to extract and analyze these narrative arcs. Tested on the first season of Grey's Anatomy (ABC 2005‑), the system identifies three types of arcs: Anthology (self‑contained), Soap (relationship‑focused), and Genre‑Specific (strictly related to the series' genre). Episodic progressions of these arcs are stored in both relational and semantic (vectorial) databases, enabling structured analysis and comparison. To bridge the gap between automation and critical interpretation, the system is paired with a graphical interface that allows for human refinement using tools to enhance and visualize the data. The system performed strongly in identifying Anthology Arcs and character entities, but its reliance on textual paratexts (such as episode summaries) revealed limitations in recognizing overlapping arcs and subtler dynamics. This approach highlights the potential of combining computational and human expertise in narrative analysis. Beyond television, it offers promise for serialized written formats, where the narrative resides entirely in the text. Future work will explore the integration of multimodal inputs, such as dialogue and visuals, and expand testing across a wider range of genres to refine the system further.

PaperID: 11355, https://arxiv.org/pdf/2503.04787.pdf

Authors: Fei Wei, Yaliang Li, Bolin Ding

Title: Towards Anthropomorphic Conversational AI Part I: A Practical Framework

Abstract:
Large language models (LLMs), due to their advanced natural language capabilities, have seen significant success in applications where the user interface is usually a conversational artificial intelligence (AI) agent and engages the user through multi‑round conversations. However, many scenarios require the agents to exhibit stronger social and conversational intelligence and demonstrate more human‑like (anthropomorphic) reactions. This is an aspect that foundational LLMs have yet to fully address such that a single call of foundational models might be insufficient. To bridge this gap, we propose a two‑stage solution. In this work, we focus on the first stage, introducing a multi‑module framework designed to replicate the key aspects of human intelligence involved in conversations. This framework comprises thinking modules for reasoning, resource modules for managing knowledge and external information, and response modules for generating contextually appropriate interactions. With all the modules cooperating, the framework would empower the agents to provide a better human‑like conversation experience. In the second stage of our approach, these conversational data, after filtering and labeling, can serve as training and testing data for reinforcement learning, enabling AI to better capture human preferences. This stage is left for future work. In our experiments, volunteers engaged in over 3000 rounds of conversation with the same AI character powered by a standalone LLM and our framework which integrates the same LLM. A separate group of evaluators rated the conversation samples, revealing that our framework significantly enhanced the social and conversational intelligence, even without fine‑tuning the LLM.

PaperID: 11356, https://arxiv.org/pdf/2503.04773.pdf

Authors: Bingbing Fan, Lin Chen, Songwei Li, Jian Yuan, Fengli Xu, Pan Hui, Yong Li

Title: Invisible Walls in Cities: Designing LLM Agent to Predict Urban Segregation Experience with Social Media Content

Abstract:
Understanding experienced segregation in urban daily life is crucial for addressing societal inequalities and fostering inclusivity. The abundance of user‑generated reviews on social media encapsulates nuanced perceptions and feelings associated with different places, offering rich insights into segregation. However, leveraging this data poses significant challenges due to its vast volume, ambiguity, and confluence of diverse perspectives. To tackle these challenges, we propose a novel Large Language Model (LLM) agent to automate online review mining for segregation prediction. Specifically, we propose a reflective LLM coder to digest social media content into insights consistent with real‑world feedback, and eventually produce a codebook capturing key dimensions that signal segregation experience, such as cultural resonance and appeal, accessibility and convenience, and community engagement and local involvement. Guided by the codebook, LLMs can generate both informative review summaries and ratings for segregation prediction. Moreover, we design a REasoning‑and‑EMbedding (RE'EM) framework, which combines the reasoning and embedding capabilities of language models to integrate multi‑channel features for segregation prediction. Experiments on real‑world data demonstrate that our agent substantially improves prediction accuracy, with a 22.79% elevation in R^2 and a 9.33% reduction in MSE. The derived codebook is generalizable across three different cities, consistently improving prediction accuracy. Moreover, our user study confirms that the codebook‑guided summaries provide cognitive gains for human participants in perceiving places of interest (POIs)' social inclusiveness. Our study marks an important step toward understanding implicit social barriers and inequalities, demonstrating the great potential of promoting social inclusiveness with Web technology.

PaperID: 11357, https://arxiv.org/pdf/2503.04679.pdf

Authors: Nathaniel Haynam, Adam Khoja, Dhruv Kumar, Vivek Myers, Erdem Bıyık

Title: Multi-Agent Inverse Q-Learning from Demonstrations

Abstract:
When reward functions are hand‑designed, deep reinforcement learning algorithms often suffer from reward misspecification, causing them to learn suboptimal policies in terms of the intended task objectives. In the single‑agent case, inverse reinforcement learning (IRL) techniques attempt to address this issue by inferring the reward function from expert demonstrations. However, in multi‑agent problems, misalignment between the learned and true objectives is exacerbated due to increased environment non‑stationarity and variance that scales with multiple agents. As such, in multi‑agent general‑sum games, multi‑agent IRL algorithms have difficulty balancing cooperative and competitive objectives. To address these issues, we propose Multi‑Agent Marginal Q‑Learning from Demonstrations (MAMQL), a novel sample‑efficient framework for multi‑agent IRL. For each agent, MAMQL learns a critic marginalized over the other agents' policies, allowing for a well‑motivated use of Boltzmann policies in the multi‑agent context. We identify a connection between optimal marginalized critics and single‑agent soft‑Q IRL, allowing us to apply a direct, simple optimization criterion from the single‑agent domain. Across our experiments on three different simulated domains, MAMQL significantly outperforms previous multi‑agent methods in average reward, sample efficiency, and reward recovery by often more than 2‑5x. We make our code available at https://sites.google.com/view/mamql .

PaperID: 11358, https://arxiv.org/pdf/2503.04596.pdf

Authors: Xinyi Hou, Yanjie Zhao, Haoyu Wang

Title: LLM Applications: Current Paradigms and the Next Frontier

Abstract:
The development of large language models (LLMs) has given rise to four major application paradigms: LLM app stores, LLM agents, self‑hosted LLM services, and LLM‑powered devices. Each has its advantages but also shares common challenges. LLM app stores lower the barrier to development but lead to platform lock‑in; LLM agents provide autonomy but lack a unified communication mechanism; self‑hosted LLM services enhance control but increase deployment complexity; and LLM‑powered devices improve privacy and real‑time performance but are limited by hardware. This paper reviews and analyzes these paradigms, covering architecture design, application ecosystem, research progress, as well as the challenges and open problems they face. Based on this, we outline the next frontier of LLM applications, characterizing them through three interconnected layers: infrastructure, protocol, and application. We describe their responsibilities and roles of each layer and demonstrate how to mitigate existing fragmentation limitations and improve security and scalability. Finally, we discuss key future challenges, identify opportunities such as protocol‑driven cross‑platform collaboration and device integration, and propose a research roadmap for openness, security, and sustainability.

PaperID: 11359, https://arxiv.org/pdf/2503.04479.pdf

Authors: Ivan Milev, Mislav Balunović, Maximilian Baader, Martin Vechev

Title: ToolFuzz -- Automated Agent Tool Testing

Abstract:
Large Language Model (LLM) Agents leverage the advanced reasoning capabilities of LLMs in real‑world applications. To interface with an environment, these agents often rely on tools, such as web search or database APIs. As the agent provides the LLM with tool documentation along the user query, the completeness and correctness of this documentation is critical. However, tool documentation is often over‑, under‑, or ill‑specified, impeding the agent's accuracy. Standard software testing approaches struggle to identify these errors as they are expressed in natural language. Thus, despite its importance, there currently exists no automated method to test the tool documentation for agents. To address this issue, we present ToolFuzz, the first method for automated testing of tool documentations. ToolFuzz is designed to discover two types of errors: (1) user queries leading to tool runtime errors and (2) user queries that lead to incorrect agent responses. ToolFuzz can generate a large and diverse set of natural inputs, effectively finding tool description errors at a low false positive rate. Further, we present two straightforward prompt‑engineering approaches. We evaluate all three tool testing approaches on 32 common LangChain tools and 35 newly created custom tools and 2 novel benchmarks to further strengthen the assessment. We find that many publicly available tools suffer from underspecification. Specifically, we show that ToolFuzz identifies 20x more erroneous inputs compared to the prompt‑engineering approaches, making it a key component for building reliable AI agents.

PaperID: 11360, https://arxiv.org/pdf/2503.04417.pdf

Authors: Felix Ocker, Stefan Menzel, Ahmed Sadik, Thiago Rios

Title: From Idea to CAD: A Language Model-Driven Multi-Agent System for Collaborative Design

Abstract:
Creating digital models using Computer Aided Design (CAD) is a process that requires in‑depth expertise. In industrial product development, this process typically involves entire teams of engineers, spanning requirements engineering, CAD itself, and quality assurance. We present an approach that mirrors this team structure with a Vision Language Model (VLM)‑based Multi Agent System, with access to parametric CAD tooling and tool documentation. Combining agents for requirements engineering, CAD engineering, and vision‑based quality assurance, a model is generated automatically from sketches and/ or textual descriptions. The resulting model can be refined collaboratively in an iterative validation loop with the user. Our approach has the potential to increase the effectiveness of design processes, both for industry experts and for hobbyists who create models for 3D printing. We demonstrate the potential of the architecture at the example of various design tasks and provide several ablations that show the benefits of the architecture's individual components.

PaperID: 11361, https://arxiv.org/pdf/2503.04392.pdf

Authors: Junyuan Mao, Fanci Meng, Yifan Duan, Miao Yu, Xiaojun Jia, Junfeng Fang, Yuxuan Liang, Kun Wang, Qingsong Wen

Title: AgentSafe: Safeguarding Large Language Model-based Multi-agent Systems via Hierarchical Data Management

Abstract:
Large Language Model based multi‑agent systems are revolutionizing autonomous communication and collaboration, yet they remain vulnerable to security threats like unauthorized access and data breaches. To address this, we introduce AgentSafe, a novel framework that enhances MAS security through hierarchical information management and memory protection. AgentSafe classifies information by security levels, restricting sensitive data access to authorized agents. AgentSafe incorporates two components: ThreatSieve, which secures communication by verifying information authority and preventing impersonation, and HierarCache, an adaptive memory management system that defends against unauthorized access and malicious poisoning, representing the first systematic defense for agent memory. Experiments across various LLMs show that AgentSafe significantly boosts system resilience, achieving defense success rates above 80% under adversarial conditions. Additionally, AgentSafe demonstrates scalability, maintaining robust performance as agent numbers and information complexity grow. Results underscore effectiveness of AgentSafe in securing MAS and its potential for real‑world application.

PaperID: 11362, https://arxiv.org/pdf/2503.04308.pdf

Authors: Lukáš Gajdošech, Hassan Ali, Jan-Gerrit Habekost, Martin Madaras, Matthias Kerzel, Stefan Wermter

Title: Shaken, Not Stirred: A Novel Dataset for Visual Understanding of Glasses in Human-Robot Bartending Tasks

Abstract:
Datasets for object detection often do not account for enough variety of glasses, due to their transparent and reflective properties. Specifically, open‑vocabulary object detectors, widely used in embodied robotic agents, fail to distinguish subclasses of glasses. This scientific gap poses an issue for robotic applications that suffer from accumulating errors between detection, planning, and action execution. This paper introduces a novel method for acquiring real‑world data from RGB‑D sensors that minimizes human effort. We propose an auto‑labeling pipeline that generates labels for all the acquired frames based on the depth measurements. We provide a novel real‑world glass object dataset GlassNICOLDataset that was collected on the Neuro‑Inspired COLlaborator (NICOL), a humanoid robot platform. The dataset consists of 7850 images recorded from five different cameras. We show that our trained baseline model outperforms state‑of‑the‑art open‑vocabulary approaches. In addition, we deploy our baseline model in an embodied agent approach to the NICOL platform, on which it achieves a success rate of 81% in a human‑robot bartending scenario.

PaperID: 11363, https://arxiv.org/pdf/2503.04262.pdf

Authors: Vasilios Mavroudis, Gregory Palmer, Sara Farmer, Kez Smithson Whitehead, David Foster, Adam Price, Ian Miles, Alberto Caron, Stephen Pasteris

Title: Guidelines for Applying RL and MARL in Cybersecurity Applications

Abstract:
Reinforcement Learning (RL) and Multi‑Agent Reinforcement Learning (MARL) have emerged as promising methodologies for addressing challenges in automated cyber defence (ACD). These techniques offer adaptive decision‑making capabilities in high‑dimensional, adversarial environments. This report provides a structured set of guidelines for cybersecurity professionals and researchers to assess the suitability of RL and MARL for specific use cases, considering factors such as explainability, exploration needs, and the complexity of multi‑agent coordination. It also discusses key algorithmic approaches, implementation challenges, and real‑world constraints, such as data scarcity and adversarial interference. The report further outlines open research questions, including policy optimality, agent cooperation levels, and the integration of MARL systems into operational cybersecurity frameworks. By bridging theoretical advancements and practical deployment, these guidelines aim to enhance the effectiveness of AI‑driven cyber defence strategies.

PaperID: 11364, https://arxiv.org/pdf/2503.04202.pdf

Authors: Angelos Assos, Yuval Dagan, Nived Rajaraman

Title: Computational Intractability of Strategizing against Online Learners

Abstract:
Online learning algorithms are widely used in strategic multi‑agent settings, including repeated auctions, contract design, and pricing competitions, where agents adapt their strategies over time. A key question in such environments is how an optimizing agent can best respond to a learning agent to improve its own long‑term outcomes. While prior work has developed efficient algorithms for the optimizer in special cases ‑ such as structured auction settings or contract design ‑ no general efficient algorithm is known. In this paper, we establish a strong computational hardness result: unless \mathsfP = \mathsfNP, no polynomial‑time optimizer can compute a near‑optimal strategy against a learner using a standard no‑regret algorithm, specifically Multiplicative Weights Update (MWU). Our result proves an Ω(T) hardness bound, significantly strengthening previous work that only showed an additive Θ(1) impossibility result. Furthermore, while the prior hardness result focused on learners using fictitious play ‑ an algorithm that is not no‑regret ‑ we prove intractability for a widely used no‑regret learning algorithm. This establishes a fundamental computational barrier to finding optimal strategies in general game‑theoretic settings.

PaperID: 11365, https://arxiv.org/pdf/2503.04188.pdf

Authors: R. Patrick Xian, Qiming Cui, Stefan Bauer, Reza Abbasi-Asl

Title: Measuring temporal effects of agent knowledge by date-controlled tool use

Abstract:
Temporal progression is an integral part of knowledge accumulation and update. Web search is frequently adopted as grounding for agent knowledge, yet an improper configuration affects the quality of the agent's responses. Here, we assess the agent behavior using distinct date‑controlled tools (DCTs) as stress test to measure the knowledge variability of large language model (LLM) agents. We demonstrate the temporal effects of an LLM agent as a writing assistant, which uses web search to complete scientific publication abstracts. We show that the temporality of search engine translates into tool‑dependent agent performance but can be alleviated with base model choice and explicit reasoning instructions such as chain‑of‑thought prompting. Our results indicate that agent design and evaluations should take a dynamical view and implement measures to account for the temporal influence of external resources to ensure reliability.

PaperID: 11366, https://arxiv.org/pdf/2503.04126.pdf

Authors: Joshua Bird, Jan Blumenkamp, Amanda Prorok

Title: DVM-SLAM: Decentralized Visual Monocular Simultaneous Localization and Mapping for Multi-Agent Systems

Abstract:
Cooperative Simultaneous Localization and Mapping (C‑SLAM) enables multiple agents to work together in mapping unknown environments while simultaneously estimating their own positions. This approach enhances robustness, scalability, and accuracy by sharing information between agents, reducing drift, and enabling collective exploration of larger areas. In this paper, we present Decentralized Visual Monocular SLAM (DVM‑SLAM), the first open‑source decentralized monocular C‑SLAM system. By only utilizing low‑cost and light‑weight monocular vision sensors, our system is well suited for small robots and micro aerial vehicles (MAVs). DVM‑SLAM's real‑world applicability is validated on physical robots with a custom collision avoidance framework, showcasing its potential in real‑time multi‑agent autonomous navigation scenarios. We also demonstrate comparable accuracy to state‑of‑the‑art centralized monocular C‑SLAM systems. We open‑source our code and provide supplementary material online.

PaperID: 11367, https://arxiv.org/pdf/2503.04110.pdf

Authors: Juntong Chen, Jiang Wu, Jiajing Guo, Vikram Mohanty, Xueming Li, Jorge Piazentin Ono, Wenbin He, Liu Ren, Dongyu Liu

Title: InterChat: Enhancing Generative Visual Analytics using Multimodal Interactions

Abstract:
The rise of Large Language Models (LLMs) and generative visual analytics systems has transformed data‑driven insights, yet significant challenges persist in accurately interpreting users' analytical and interaction intents. While language inputs offer flexibility, they often lack precision, making the expression of complex intents inefficient, error‑prone, and time‑intensive. To address these limitations, we investigate the design space of multimodal interactions for generative visual analytics through a literature review and pilot brainstorming sessions. Building on these insights, we introduce a highly extensible workflow that integrates multiple LLM agents for intent inference and visualization generation. We develop InterChat, a generative visual analytics system that combines direct manipulation of visual elements with natural language inputs. This integration enables precise intent communication and supports progressive, visually driven exploratory data analyses. By employing effective prompt engineering, and contextual interaction linking, alongside intuitive visualization and interaction designs, InterChat bridges the gap between user interactions and LLM‑driven visualizations, enhancing both interpretability and usability. Extensive evaluations, including two usage scenarios, a user study, and expert feedback, demonstrate the effectiveness of InterChat. Results show significant improvements in the accuracy and efficiency of handling complex visual analytics tasks, highlighting the potential of multimodal interactions to redefine user engagement and analytical depth in generative visual analytics.

PaperID: 11368, https://arxiv.org/pdf/2503.03961.pdf

Authors: William Merrill, Ashish Sabharwal

Title: A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers

Abstract:
Recent theoretical results show transformers cannot express sequential reasoning problems over long inputs, intuitively because their computational depth is bounded. However, prior work treats the depth as a constant, leaving it unclear to what degree bounded depth may suffice for solving problems over short inputs, or how increasing the transformer's depth affects its expressive power. We address these questions by analyzing transformers whose depth can grow minimally with context length n. We show even highly uniform transformers with depth Θ(\log n) can express two important problems: recognizing regular languages, which captures state tracking abilities and was known to be expressible only by an unconventional, non‑uniform model of transformers, and graph connectivity, which underlies multi‑step reasoning. Notably, both of these problems cannot be expressed by fixed‑depth transformers under standard complexity conjectures, demonstrating the expressivity benefit of growing depth. Moreover, our theory quantitatively predicts how depth must grow with input length to express these problems, showing that depth scaling is more efficient than scaling width or chain‑of‑thought steps. Empirically, our detailed experiments designed to bridge the expressivity vs. learnability gap reveal that our theoretical depth requirements for regular language recognition closely match the practical depth requirements for successfully training transformers. Thus, our results clarify how depth affects a transformer's reasoning capabilities, and provide practical guidance for effective depth selection for sequential reasoning.

PaperID: 11369, https://arxiv.org/pdf/2503.03889.pdf

Authors: Muhammad Waseem, Kshitij Bhatta, Chen Li, Qing Chang

Title: Pretrained LLMs as Real-Time Controllers for Robot Operated Serial Production Line

Abstract:
The manufacturing industry is undergoing a transformative shift, driven by cutting‑edge technologies like 5G, AI, and cloud computing. Despite these advancements, effective system control, which is crucial for optimizing production efficiency, remains a complex challenge due to the intricate, knowledge‑dependent nature of manufacturing processes and the reliance on domain‑specific expertise. Conventional control methods often demand heavy customization, considerable computational resources, and lack transparency in decision‑making. In this work, we investigate the feasibility of using Large Language Models (LLMs), particularly GPT‑4, as a straightforward, adaptable solution for controlling manufacturing systems, specifically, mobile robot scheduling. We introduce an LLM‑based control framework to assign mobile robots to different machines in robot assisted serial production lines, evaluating its performance in terms of system throughput. Our proposed framework outperforms traditional scheduling approaches such as First‑Come‑First‑Served (FCFS), Shortest Processing Time (SPT), and Longest Processing Time (LPT). While it achieves performance that is on par with state‑of‑the‑art methods like Multi‑Agent Reinforcement Learning (MARL), it offers a distinct advantage by delivering comparable throughput without the need for extensive retraining. These results suggest that the proposed LLM‑based solution is well‑suited for scenarios where technical expertise, computational resources, and financial investment are limited, while decision transparency and system scalability are critical concerns.

PaperID: 11370, https://arxiv.org/pdf/2503.03796.pdf

Authors: Hyeonjun Kim, Kanghoon Lee, Junho Park, Jiachen Li, Jinkyoo Park

Title: Human Implicit Preference-Based Policy Fine-tuning for Multi-Agent Reinforcement Learning in USV Swarm

Abstract:
Multi‑Agent Reinforcement Learning (MARL) has shown promise in solving complex problems involving cooperation and competition among agents, such as an Unmanned Surface Vehicle (USV) swarm used in search and rescue, surveillance, and vessel protection. However, aligning system behavior with user preferences is challenging due to the difficulty of encoding expert intuition into reward functions. To address the issue, we propose a Reinforcement Learning with Human Feedback (RLHF) approach for MARL that resolves credit‑assignment challenges through an Agent‑Level Feedback system categorizing feedback into intra‑agent, inter‑agent, and intra‑team types. To overcome the challenges of direct human feedback, we employ a Large Language Model (LLM) evaluator to validate our approach using feedback scenarios such as region constraints, collision avoidance, and task allocation. Our method effectively refines USV swarm policies, addressing key challenges in multi‑agent systems while maintaining fairness and performance consistency.

PaperID: 11371, https://arxiv.org/pdf/2503.03779.pdf

Authors: Yimin Tang, Zhenghong Yu, Jiaoyang Li, Sven Koenig

Title: Accelerating Focal Search in Multi-Agent Path Finding with Tighter Lower Bounds

Abstract:
Multi‑Agent Path Finding (MAPF) involves finding collision‑free paths for multiple agents while minimizing a cost function‑‑an NP‑hard problem. Bounded suboptimal methods like Enhanced Conflict‑Based Search (ECBS) and Explicit Estimation CBS (EECBS) balance solution quality with computational efficiency using focal search mechanisms. While effective, traditional focal search faces a limitation: the lower bound (LB) value determining which nodes enter the FOCAL list often increases slowly in early search stages, resulting in a constrained search space that delays finding valid solutions. In this paper, we propose a novel bounded suboptimal algorithm, double‑ECBS (DECBS), to address this issue by first determining the maximum LB value and then employing a best‑first search guided by this LB to find a collision‑free path. Experimental results demonstrate that DECBS outperforms ECBS in most test cases and is compatible with existing optimization techniques. DECBS can reduce nearly 30% high‑level CT nodes and 50% low‑level focal search nodes. When agent density is moderate to high, DECBS achieves a 23.5% average runtime improvement over ECBS with identical suboptimality bounds and optimizations.

PaperID: 11372, https://arxiv.org/pdf/2503.03704.pdf

Authors: Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jiliang Tang, Tianming Liu, Hui Liu, Zhen Xiang

Title: Memory Injection Attacks on LLM Agents via Query-Only Interaction

Abstract:
Agents powered by large language models (LLMs) have demonstrated strong capabilities in a wide range of complex, real‑world applications. However, LLM agents with a compromised memory bank may easily produce harmful outputs when the past records retrieved for demonstration are malicious. In this paper, we propose a novel Memory INJection Attack, MINJA, without assuming that the attacker can directly modify the memory bank of the agent. The attacker injects malicious records into the memory bank by only interacting with the agent via queries and output observations. These malicious records are designed to elicit a sequence of malicious reasoning steps corresponding to a different target query during the agent's execution of the victim user's query. Specifically, we introduce a sequence of bridging steps to link victim queries to the malicious reasoning steps. During the memory injection, we propose an indication prompt that guides the agent to autonomously generate similar bridging steps, with a progressive shortening strategy that gradually removes the indication prompt, such that the malicious record will be easily retrieved when processing later victim queries. Our extensive experiments across diverse agents demonstrate the effectiveness of MINJA in compromising agent memory. With minimal requirements for execution, MINJA enables any user to influence agent memory, highlighting the risk.

PaperID: 11373, https://arxiv.org/pdf/2503.03505.pdf

Authors: Yaoru Li, Shunyu Liu, Tongya Zheng, Li Sun, Mingli Song

Title: Parallelized Planning-Acting for Efficient LLM-based Multi-Agent Systems in Minecraft

Abstract:
Recent advancements in Large Language Model~(LLM)‑based Multi‑Agent Systems (MAS) have demonstrated remarkable potential for tackling complex decision‑making tasks. However, existing frameworks inevitably rely on serialized execution paradigms, where agents must complete sequential LLM planning before taking action. This fundamental constraint severely limits real‑time responsiveness and adaptation, which is crucial in dynamic environments with ever‑changing scenarios like Minecraft. In this paper, we propose a novel parallelized planning‑acting framework for LLM‑based MAS, featuring a dual‑thread architecture with interruptible execution to enable concurrent planning and acting. Specifically, our framework comprises two core threads: (1) a planning thread driven by a centralized memory system, maintaining synchronization of environmental states and agent communication to support dynamic decision‑making; and (2) an acting thread equipped with a comprehensive skill library, enabling automated task execution through recursive decomposition. Extensive experiments on Minecraft demonstrate the effectiveness of the proposed framework.

PaperID: 11374, https://arxiv.org/pdf/2503.03391.pdf

Authors: Muhammet Hevesli, Abegaz Mohammed Seid, Aiman Erbad, Mohamed Abdallah

Title: Multi-Agent DRL for Queue-Aware Task Offloading in Hierarchical MEC-Enabled Air-Ground Networks

Abstract:
Mobile edge computing (MEC)‑enabled air‑ground networks are a key component of 6G, employing aerial base stations (ABSs) such as unmanned aerial vehicles (UAVs) and high‑altitude platform stations (HAPS) to provide dynamic services to ground IoT devices (IoTDs). These IoTDs support real‑time applications (e.g., multimedia and Metaverse services) that demand high computational resources and strict quality of service (QoS) guarantees in terms of latency and task queue management. Given their limited energy and processing capabilities, IoTDs rely on UAVs and HAPS to offload tasks for distributed processing, forming a multi‑tier MEC system. This paper tackles the overall energy minimization problem in MEC‑enabled air‑ground integrated networks (MAGIN) by jointly optimizing UAV trajectories, computing resource allocation, and queue‑aware task offloading decisions. The optimization is challenging due to the nonconvex, nonlinear nature of this hierarchical system, which renders traditional methods ineffective. We reformulate the problem as a multi‑agent Markov decision process (MDP) with continuous action spaces and heterogeneous agents, and propose a novel variant of multi‑agent proximal policy optimization with a Beta distribution (MAPPO‑BD) to solve it. Extensive simulations show that MAPPO‑BD outperforms baseline schemes, achieving superior energy savings and efficient resource management in MAGIN while meeting queue delay and edge computing constraints.

PaperID: 11375, https://arxiv.org/pdf/2503.03300.pdf

Authors: Hannes Rosenbusch, Erdem Ozan Meral

Title: Which books do I like?

Abstract:
Finding enjoyable fiction books can be challenging, partly because stories are multi‑faceted and one's own literary taste might be difficult to ascertain. Here, we introduce the ISAAC method (Introspection‑Support, AI‑Annotation, and Curation), a pipeline which supports fiction readers in gaining awareness of their literary preferences and finding enjoyable books. ISAAC consists of four steps: a user supplies book ratings, an AI agent researches and annotates the provided books, patterns in book enjoyment are reviewed by the user, and the AI agent recommends new books. In this proof‑of‑concept self‑study, the authors test whether ISAAC can highlight idiosyncratic patterns in their book enjoyment, spark a deeper reflection about their literary tastes, and make accurate, personalized recommendations of enjoyable books and underexplored literary niches. Results highlight substantial advantages of ISAAC over existing methods such as an integration of automation and intuition, accurate and customizable annotations, and explainable book recommendations. Observed disadvantages are that ISAAC's outputs can elicit false self‑narratives (if statistical patterns are taken at face value), that books cannot be annotated if their online documentation is lacking, and that people who are new to reading have to rely on assumed book ratings or movie ratings to power the ISAAC pipeline. We discuss additional opportunities of ISAAC‑style book annotations for the study of literary trends, and the scientific classification of books and readers.

PaperID: 11376, https://arxiv.org/pdf/2503.03258.pdf

Authors: Runlin Lei, Jiarui Ji, Haipeng Ding, Lu Yi, Zhewei Wei, Yongchao Liu, Chuntao Hong

Title: Exploring the Potential of Large Language Models as Predictors in Dynamic Text-Attributed Graphs

Abstract:
With the rise of large language models (LLMs), there has been growing interest in Graph Foundation Models (GFMs) for graph‑based tasks. By leveraging LLMs as predictors, GFMs have demonstrated impressive generalizability across various tasks and datasets. However, existing research on LLMs as predictors has predominantly focused on static graphs, leaving their potential in dynamic graph prediction unexplored. In this work, we pioneer using LLMs for predictive tasks on dynamic graphs. We identify two key challenges: the constraints imposed by context length when processing large‑scale historical data and the significant variability in domain characteristics, both of which complicate the development of a unified predictor. To address these challenges, we propose the GraphAgent‑Dynamic (GAD) Framework, a multi‑agent system that leverages collaborative LLMs. In contrast to using a single LLM as the predictor, GAD incorporates global and local summary agents to generate domain‑specific knowledge, enhancing its transferability across domains. Additionally, knowledge reflection agents enable adaptive updates to GAD's knowledge, maintaining a unified and self‑consistent architecture. In experiments, GAD demonstrates performance comparable to or even exceeds that of full‑supervised graph neural networks without dataset‑specific training. Finally, to enhance the task‑specific performance of LLM‑based predictors, we discuss potential improvements, such as dataset‑specific fine‑tuning to LLMs. By developing tailored strategies for different tasks, we provide new insights for the future design of LLM‑based predictors.

PaperID: 11377, https://arxiv.org/pdf/2503.03192.pdf

Authors: Alexander Thoms, Alan Papalia, Jared Velasquez, David M. Rosen, Sriram Narasimhan

Title: Distributed Certifiably Correct Range-Aided SLAM

Abstract:
Reliable simultaneous localization and mapping (SLAM) algorithms are necessary for safety‑critical autonomous navigation. In the communication‑constrained multi‑agent setting, navigation systems increasingly use point‑to‑point range sensors as they afford measurements with low bandwidth requirements and known data association. The state estimation problem for these systems takes the form of range‑aided (RA) SLAM. However, distributed algorithms for solving the RA‑SLAM problem lack formal guarantees on the quality of the returned estimate. To this end, we present the first distributed algorithm for RA‑SLAM that can efficiently recover certifiably globally optimal solutions. Our algorithm, distributed certifiably correct RA‑SLAM (DCORA), achieves this via the Riemannian Staircase method, where computational procedures developed for distributed certifiably correct pose graph optimization are generalized to the RA‑SLAM problem. We demonstrate DCORA's efficacy on real‑world multi‑agent datasets by achieving absolute trajectory errors comparable to those of a state‑of‑the‑art centralized certifiably correct RA‑SLAM algorithm. Additionally, we perform a parametric study on the structure of the RA‑SLAM problem using synthetic data, revealing how common parameters affect DCORA's performance.

PaperID: 11378, https://arxiv.org/pdf/2503.03154.pdf

Authors: Wei-Hao Chen, Weixi Tong, Amanda Case, Tianyi Zhang

Title: Dango: A Mixed-Initiative Data Wrangling System using Large Language Model

Abstract:
Data wrangling is a time‑consuming and challenging task in a data science pipeline. While many tools have been proposed to automate or facilitate data wrangling, they often misinterpret user intent, especially in complex tasks. We propose Dango, a mixed‑initiative multi‑agent system for data wrangling. Compared to existing tools, Dango enhances user communication of intent by allowing users to demonstrate on multiple tables and use natural language prompts in a conversation interface, enabling users to clarify their intent by answering LLM‑posed multiple‑choice clarification questions, and providing multiple forms of feedback such as step‑by‑step natural language explanations and data provenance to help users evaluate the data wrangling scripts. We conducted a within‑subjects user study with 38 participants and demonstrated that Dango's features can significantly improve intent clarification, accuracy, and efficiency in data wrangling. Furthermore, we demonstrated the generalizability of Dango by applying it to a broader set of data wrangling tasks.

PaperID: 11379, https://arxiv.org/pdf/2503.02992.pdf

Authors: Yimin Tang, Xiao Xiong, Jingyi Xi, Jiaoyang Li, Erdem Bıyık, Sven Koenig

Title: RAILGUN: A Unified Convolutional Policy for Multi-Agent Path Finding Across Different Environments and Tasks

Abstract:
Multi‑Agent Path Finding (MAPF), which focuses on finding collision‑free paths for multiple robots, is crucial for applications ranging from aerial swarms to warehouse automation. Solving MAPF is NP‑hard so learning‑based approaches for MAPF have gained attention, particularly those leveraging deep neural networks. Nonetheless, despite the community's continued efforts, all learning‑based MAPF planners still rely on decentralized planning due to variability in the number of agents and map sizes. We have developed the first centralized learning‑based policy for MAPF problem called RAILGUN. RAILGUN is not an agent‑based policy but a map‑based policy. By leveraging a CNN‑based architecture, RAILGUN can generalize across different maps and handle any number of agents. We collect trajectories from rule‑based methods to train our model in a supervised way. In experiments, RAILGUN outperforms most baseline methods and demonstrates great zero‑shot generalization capabilities on various tasks, maps and agent numbers that were not seen in the training dataset.

PaperID: 11380, https://arxiv.org/pdf/2503.02913.pdf

Authors: Zilin Zhao, Chishui Chen, Haotian Shi, Jiale Chen, Xuanlin Yue, Zhejian Yang, Yang Liu

Title: Towards Robust Multi-UAV Collaboration: MARL with Noise-Resilient Communication and Attention Mechanisms

Abstract:
Efficient path planning for unmanned aerial vehicles (UAVs) is crucial in remote sensing and information collection. As task scales expand, the cooperative deployment of multiple UAVs significantly improves information collection efficiency. However, collaborative communication and decision‑making for multiple UAVs remain major challenges in path planning, especially in noisy environments. To efficiently accomplish complex information collection tasks in 3D space and address robust communication issues, we propose a multi‑agent reinforcement learning (MARL) framework for UAV path planning based on the Counterfactual Multi‑Agent Policy Gradients (COMA) algorithm. The framework incorporates attention mechanism‑based UAV communication protocol and training‑deployment system, significantly improving communication robustness and individual decision‑making capabilities in noisy conditions. Experiments conducted on both synthetic and real‑world datasets demonstrate that our method outperforms existing algorithms in terms of path planning efficiency and robustness, especially in noisy environments, achieving a 78% improvement in entropy reduction.

PaperID: 11381, https://arxiv.org/pdf/2503.02878.pdf

Authors: Ethan Mendes, Alan Ritter

Title: Language Models can Self-Improve at State-Value Estimation for Better Search

Abstract:
Collecting ground‑truth rewards or human demonstrations for multi‑step reasoning tasks is often prohibitively expensive, particularly in interactive domains such as web tasks. We introduce Self‑Taught Lookahead (STL), a reward‑free framework that improves language model‑based value functions by reasoning explicitly about state transitions. STL can be viewed as a chain‑of‑thought analogue of the value iteration algorithm: instead of regressing directly on numeric values, a value LLM is trained to simulate a step of lookahead in natural language ‑ predicting the next action, resulting state, and rationale for its value, thereby refining value estimates without any labeled data. This self‑supervised procedure yields more accurate state‑value predictions, which in turn enable lightweight search algorithms to expand fewer states while maintaining strong performance. Empirically, STL‑trained value models built on moderately sized (8B parameter) open‑weight LLMs boost web agent success rates by 39%, achieving comparable performance with proprietary models. STL also generalizes to multi‑hop QA and math puzzles. We find that STL enables small open‑source models to guide efficient search, reducing inference costs by integrating explicit reasoning with value learning.

PaperID: 11382, https://arxiv.org/pdf/2503.02760.pdf

Authors: Jiacheng Tang, Nankai Wu, Fan Gao, Chengxiao Dai, Mengyao Zhao, Xinjie Zhao

Title: From Metaphor to Mechanism: How LLMs Decode Traditional Chinese Medicine Symbolic Language for Modern Clinical Relevance

Abstract:
Metaphorical expressions are abundant in Traditional Chinese Medicine (TCM), conveying complex disease mechanisms and holistic health concepts through culturally rich and often abstract terminology. Bridging these metaphors to anatomically driven Western medical (WM) concepts poses significant challenges for both automated language processing and real‑world clinical practice. To address this gap, we propose a novel multi‑agent and chain‑of‑thought (CoT) framework designed to interpret TCM metaphors accurately and map them to WM pathophysiology. Specifically, our approach combines domain‑specialized agents (TCM Expert, WM Expert) with a Coordinator Agent, leveraging stepwise chain‑of‑thought prompts to ensure transparent reasoning and conflict resolution. We detail a methodology for building a metaphor‑rich TCM dataset, discuss strategies for effectively integrating multi‑agent collaboration and CoT reasoning, and articulate the theoretical underpinnings that guide metaphor interpretation across distinct medical paradigms. We present a comprehensive system design and highlight both the potential benefits and limitations of our approach, while leaving placeholders for future experimental validation. Our work aims to support clinical decision‑making, cross‑system educational initiatives, and integrated healthcare research, ultimately offering a robust scaffold for reconciling TCM's symbolic language with the mechanistic focus of Western medicine.

PaperID: 11383, https://arxiv.org/pdf/2503.02692.pdf

Authors: Congluo Xu, Zhaobin Liu, Ziyang Li

Title: FinArena: A Human-Agent Collaboration Framework for Financial Market Analysis and Forecasting

Abstract:
To improve stock trend predictions and support personalized investment decisions, this paper proposes FinArena, a novel Human‑Agent collaboration framework. Inspired by the mixture of experts (MoE) approach, FinArena combines multimodal financial data analysis with user interaction. The human module features an interactive interface that captures individual risk preferences, allowing personalized investment strategies. The machine module utilizes a Large Language Model‑based (LLM‑based) multi‑agent system to integrate diverse data sources, such as stock prices, news articles, and financial statements. To address hallucinations in LLMs, FinArena employs the adaptive Retrieval‑Augmented Generative (RAG) method for processing unstructured news data. Finally, a universal expert agent makes investment decisions based on the features extracted from multimodal data and investors' individual risk preferences. Extensive experiments show that FinArena surpasses both traditional and state‑of‑the‑art benchmarks in stock trend prediction and yields promising results in trading simulations across various risk profiles. These findings highlight FinArena's potential to enhance investment outcomes by aligning strategic insights with personalized risk considerations.

PaperID: 11384, https://arxiv.org/pdf/2503.02682.pdf

Authors: Weimin Xiong, Yifan Song, Qingxiu Dong, Bingchan Zhao, Feifan Song, Xun Wang, Sujian Li

Title: MPO: Boosting LLM Agents with Meta Plan Optimization

Abstract:
Recent advancements in large language models (LLMs) have enabled LLM‑based agents to successfully tackle interactive planning tasks. However, despite their successes, existing approaches often suffer from planning hallucinations and require retraining for each new agent. To address these challenges, we propose the Meta Plan Optimization (MPO) framework, , which enhances agent planning capabilities by directly incorporating explicit guidance. Unlike previous methods that rely on complex knowledge, which either require significant human effort or lack quality assurance, MPO leverages high‑level general guidance through meta plans to assist agent planning and enables continuous optimization of the meta plans based on feedback from the agent's task execution. Our experiments conducted on two representative tasks demonstrate that MPO significantly outperforms existing baselines. Moreover, our analysis indicates that MPO provides a plug‑and‑play solution that enhances both task completion efficiency and generalization capabilities in previous unseen scenarios.

PaperID: 11385, https://arxiv.org/pdf/2503.02582.pdf

Authors: Alicia Vidler, Toby Walsh

Title: Playing games with Large language models: Randomness and strategy

Abstract:
Playing games has a long history of describing intricate interactions in simplified forms. In this paper we explore if large language models (LLMs) can play games, investigating their capabilities for randomisation and strategic adaptation through both simultaneous and sequential game interactions. We focus on GPT‑4o‑Mini‑2024‑08‑17 and test two games between LLMs: Rock Paper Scissors (RPS) and games of strategy (Prisoners Dilemma PD). LLMs are often described as stochastic parrots, and while they may indeed be parrots, our results suggest that they are not very stochastic in the sense that their outputs ‑ when prompted to be random ‑ are often very biased. Our research reveals that LLMs appear to develop loss aversion strategies in repeated games, with RPS converging to stalemate conditions while PD shows systematic shifts between cooperative and competitive outcomes based on prompt design. We detail programmatic tools for independent agent interactions and the Agentic AI challenges faced in implementation. We show that LLMs can indeed play games, just not very well. These results have implications for the use of LLMs in multi‑agent LLM systems and showcase limitations in current approaches to model output for strategic decision‑making.

PaperID: 11386, https://arxiv.org/pdf/2503.02519.pdf

Authors: Xingzuo Li, Kehai Chen, Yunfei Long, Xuefeng Bai, Yong Xu, Min Zhang

Title: Generator-Assistant Stepwise Rollback Framework for Large Language Model Agent

Abstract:
Large language model (LLM) agents typically adopt a step‑by‑step reasoning framework, in which they interleave the processes of thinking and acting to accomplish the given task. However, this paradigm faces a deep‑rooted one‑pass issue whereby each generated intermediate thought is plugged into the trajectory regardless of its correctness, which can cause irreversible error propagation. To address the issue, this paper proposes a novel framework called Generator‑Assistant Stepwise Rollback (GA‑Rollback) to induce better decision‑making for LLM agents. Particularly, GA‑Rollback utilizes a generator to interact with the environment and an assistant to examine each action produced by the generator, where the assistant triggers a rollback operation upon detection of incorrect actions. Moreover, we introduce two additional strategies tailored for the rollback scenario to further improve its effectiveness. Extensive experiments show that GA‑Rollback achieves significant improvements over several strong baselines on three widely used benchmarks. Our analysis further reveals that GA‑Rollback can function as a robust plug‑and‑play module, integrating seamlessly with other methods.

PaperID: 11387, https://arxiv.org/pdf/2503.02512.pdf

Authors: Mehran Hosseini, Alessio Lomuscio, Nicola Paoletti

Title: LTL Verification of Memoryful Neural Agents

Abstract:
We present a framework for verifying Memoryful Neural Multi‑Agent Systems (MN‑MAS) against full Linear Temporal Logic (LTL) specifications. In MN‑MAS, agents interact with a non‑deterministic, partially observable environment. Examples of MN‑MAS include multi‑agent systems based on feed‑forward and recurrent neural networks or state‑space models. Different from previous approaches, we support the verification of both bounded and unbounded LTL specifications. We leverage well‑established bounded model checking techniques, including lasso search and invariant synthesis, to reduce the verification problem to that of constraint solving. To solve these constraints, we develop efficient methods based on bound propagation, mixed‑integer linear programming, and adaptive splitting. We evaluate the effectiveness of our algorithms in single and multi‑agent environments from the Gymnasium and PettingZoo libraries, verifying unbounded specifications for the first time and improving the verification time for bounded specifications by an order of magnitude compared to the SoA.

PaperID: 11388, https://arxiv.org/pdf/2503.02445.pdf

Authors: Hao Li, Yu-Hao Huang, Chang Xu, Viktor Schlegel, Renhe Jiang, Riza Batista-Navarro, Goran Nenadic, Jiang Bian

Title: BRIDGE: Bootstrapping Text to Control Time-Series Generation via Multi-Agent Iterative Optimization and Diffusion Modeling

Abstract:
Time‑series Generation (TSG) is a prominent research area with broad applications in simulations, data augmentation, and counterfactual analysis. While existing methods have shown promise in unconditional single‑domain TSG, real‑world applications demand for cross‑domain approaches capable of controlled generation tailored to domain‑specific constraints and instance‑level requirements. In this paper, we argue that text can provide semantic insights, domain information and instance‑specific temporal patterns, to guide and improve TSG. We introduce ``Text‑Controlled TSG'', a task focused on generating realistic time series by incorporating textual descriptions. To address data scarcity in this setting, we propose a novel LLM‑based Multi‑Agent framework that synthesizes diverse, realistic text‑to‑TS datasets. Furthermore, we introduce BRIDGE, a hybrid text‑controlled TSG framework that integrates semantic prototypes with text description for supporting domain‑level guidance. This approach achieves state‑of‑the‑art generation fidelity on 11 of 12 datasets, and improves controllability by up to 12% on MSE and 6% MAE compared to no text input generation, highlighting its potential for generating tailored time‑series data.

PaperID: 11389, https://arxiv.org/pdf/2503.02437.pdf

Authors: Antonio Marino, Esteban Restrepo, Claudio Pacchierotti, Paolo Robuffo Giordano

Title: Decentralized Reinforcement Learning for Multi-Agent Multi-Resource Allocation via Dynamic Cluster Agreements

Abstract:
This paper addresses the challenge of allocating heterogeneous resources among multiple agents in a decentralized manner. Our proposed method, Liquid‑Graph‑Time Clustering‑IPPO, builds upon Independent Proximal Policy Optimization (IPPO) by integrating dynamic cluster consensus, a mechanism that allows agents to form and adapt local sub‑teams based on resource demands. This decentralized coordination strategy reduces reliance on global information and enhances scalability. We evaluate LGTC‑IPPO against standard multi‑agent reinforcement learning baselines and a centralized expert solution across a range of team sizes and resource distributions. Experimental results demonstrate that LGTC‑IPPO achieves more stable rewards, better coordination, and robust performance even as the number of agents or resource types increases. Additionally, we illustrate how dynamic clustering enables agents to reallocate resources efficiently also for scenarios with discharging resources.

PaperID: 11390, https://arxiv.org/pdf/2503.02399.pdf

Authors: Seungkwon Kim, GyuTae Park, Sangyeon Kim, Seung-Hun Nam

Title: VisAgent: Narrative-Preserving Story Visualization Framework

Abstract:
Story visualization is the transformation of narrative elements into image sequences. While existing research has primarily focused on visual contextual coherence, the deeper narrative essence of stories often remains overlooked. This limitation hinders the practical application of these approaches, as generated images frequently fail to capture the intended meaning and nuances of the narrative fully. To address these challenges, we propose VisAgent, a training‑free multi‑agent framework designed to comprehend and visualize pivotal scenes within a given story. By considering story distillation, semantic consistency, and contextual coherence, VisAgent employs an agentic workflow. In this workflow, multiple specialized agents collaborate to: (i) refine layered prompts based on the narrative structure and (ii) seamlessly integrate \gtgenerated elements, including refined prompts, scene elements, and subject placement, into the final image. The empirically validated effectiveness confirms the framework's suitability for practical story visualization applications.

PaperID: 11391, https://arxiv.org/pdf/2503.02390.pdf

Authors: Heng Zhou, Hejia Geng, Xiangyuan Xue, Li Kang, Yiran Qin, Zhiyong Wang, Zhenfei Yin, Lei Bai

Title: ReSo: A Reward-driven Self-organizing LLM-based Multi-Agent System for Reasoning Tasks

Abstract:
Multi‑agent systems (MAS) have emerged as a promising approach for enhancing the reasoning capabilities of large language models in complex problem‑solving; however, current MAS frameworks suffer from poor flexibility and scalability with underdeveloped optimization strategies. To address these challenges, we propose ReSo, which integrates task graph generation with a reward‑driven two‑stage agent selection process centered on our Collaborative Reward Model that provides fine‑grained reward signals to optimize MAS cooperation. We also introduce an automated data synthesis framework for generating MAS benchmarks without any human annotations. Experimental results show that ReSo matches or outperforms existing methods, achieving 33.7 percent accuracy on Math‑MAS and 32.3 percent accuracy on SciBench‑MAS, where other approaches completely fail.

PaperID: 11392, https://arxiv.org/pdf/2503.02341.pdf

Authors: Zhun Mou, Bin Xia, Zhengchao Huang, Wenming Yang, Jiaya Jia

Title: GRADEO: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning

Abstract:
Recent great advances in video generation models have demonstrated their potential to produce high‑quality videos, bringing challenges to effective evaluation. Unlike human evaluation, existing automated evaluation metrics lack highlevel semantic understanding and reasoning capabilities for video, thus making them infeasible and unexplainable. To fill this gap, we curate GRADEO‑Instruct, a multi‑dimensional T2V evaluation instruction tuning dataset, including 3.3k videos from over 10 existing video generation models and multi‑step reasoning assessments converted by 16k human annotations. We then introduce GRADEO, one of the first specifically designed video evaluation models, which grades AI‑generated videos for explainable scores and assessments through multi‑step reasoning. Experiments show that our method aligns better with human evaluations than existing methods. Furthermore, our benchmarking reveals that current video generation models struggle to produce content that aligns with human reasoning and complex real‑world scenarios.

PaperID: 11393, https://arxiv.org/pdf/2503.02313.pdf

Authors: Kürşad Metehan Gül, Selahattin Burak Sarsılmaz

Title: Multi-Partite Output Regulation of Multi-Agent Systems

Abstract:
This article proposes a simple, graph‑independent perspective on partitioning the node set of a graph and provides multi‑agent systems (MASs) with objectives beyond cooperation and bipartition. Specifically, we first introduce the notion of k‑partition transformation to achieve any desired partition of the nodes. Then, we use this notion to formulate the multi‑partite output regulation problem (MORP) of heterogeneous linear MASs, which comprises the existing cooperative output regulation problem (CORP) and bipartite output regulation problem (BORP) as subcases. The goal of the MORP is to design a distributed control law such that each follower that belongs to the same set in the partition asymptotically tracks a scalar multiple of the reference while ensuring the internal stability of the closed‑loop system. It is shown that the necessary and sufficient conditions for the solvability of the MORP with a feedforward‑based distributed control law follow from the CORP and lead to the first design strategy for the control parameters. However, it has a drawback in terms of scalability due to a partition‑dependent condition. We prove that this condition is implied by its partition‑independent version under a mild structural condition. This implication yields the second design strategy that is much more scalable than the first one. Finally, an experiment is conducted to demonstrate the MORP's flexibility, and two numerical examples are provided to illustrate its generality and compare both design strategies regarding scalability.

PaperID: 11394, https://arxiv.org/pdf/2503.02311.pdf

Authors: Kensuke Tatematsu, Akifumi Wachi

Title: Target Return Optimizer for Multi-Game Decision Transformer

Abstract:
Achieving autonomous agents with robust generalization capabilities across diverse games and tasks remains one of the ultimate goals in AI research. Recent advancements in transformer‑based offline reinforcement learning, exemplified by the MultiGame Decision Transformer [Lee et al., 2022], have shown remarkable performance across various games or tasks. However, these approaches depend heavily on human expertise, presenting substantial challenges for practical deployment, particularly in scenarios with limited prior game‑specific knowledge. In this paper, we propose an algorithm called Multi‑Game Target Return Optimizer (MTRO) to autonomously determine game‑specific target returns within the Multi‑Game Decision Transformer framework using solely offline datasets. MTRO addresses the existing limitations by automating the target return configuration process, leveraging environmental reward information extracted from offline datasets. Notably, MTRO does not require additional training, enabling seamless integration into existing Multi‑Game Decision Transformer architectures. Our experimental evaluations on Atari games demonstrate that MTRO enhances the performance of RL policies across a wide array of games, underscoring its potential to advance the field of autonomous agent development.

PaperID: 11395, https://arxiv.org/pdf/2503.02268.pdf

Authors: Wenjia Jiang, Yangyang Zhuang, Chenxi Song, Xu Yang, Joey Tianyi Zhou, Chi Zhang

Title: AppAgentX: Evolving GUI Agents as Proficient Smartphone Users

Abstract:
Recent advancements in Large Language Models (LLMs) have led to the development of intelligent LLM‑based agents capable of interacting with graphical user interfaces (GUIs). These agents demonstrate strong reasoning and adaptability, enabling them to perform complex tasks that traditionally required predefined rules. However, the reliance on step‑by‑step reasoning in LLM‑based agents often results in inefficiencies, particularly for routine tasks. In contrast, traditional rule‑based systems excel in efficiency but lack the intelligence and flexibility to adapt to novel scenarios. To address this challenge, we propose a novel evolutionary framework for GUI agents that enhances operational efficiency while retaining intelligence and flexibility. Our approach incorporates a memory mechanism that records the agent's task execution history. By analyzing this history, the agent identifies repetitive action sequences and evolves high‑level actions that act as shortcuts, replacing these low‑level operations and improving efficiency. This allows the agent to focus on tasks requiring more complex reasoning, while simplifying routine actions. Experimental results on multiple benchmark tasks demonstrate that our approach significantly outperforms existing methods in both efficiency and accuracy. The code will be open‑sourced to support further research.

PaperID: 11396, https://arxiv.org/pdf/2503.02197.pdf

Authors: Zhixun Chen, Ming Li, Yuxuan Huang, Yali Du, Meng Fang, Tianyi Zhou

Title: ATLaS: Agent Tuning via Learning Critical Steps

Abstract:
Large Language Model (LLM) agents have demonstrated remarkable generalization capabilities across multi‑domain tasks. Existing agent tuning approaches typically employ supervised finetuning on entire expert trajectories. However, behavior‑cloning of full trajectories can introduce expert bias and weaken generalization to states not covered by the expert data. Additionally, critical steps, such as planning, complex reasoning for intermediate subtasks, and strategic decision‑making, are essential to success in agent tasks, so learning these steps is the key to improving LLM agents. For more effective and efficient agent tuning, we propose ATLaS that identifies the critical steps in expert trajectories and finetunes LLMs solely on these steps with reduced costs. By steering the training's focus to a few critical steps, our method mitigates the risk of overfitting entire trajectories and promotes generalization across different environments and tasks. In extensive experiments, an LLM finetuned on only 30% critical steps selected by ATLaS outperforms the LLM finetuned on all steps and recent open‑source LLM agents. ATLaS maintains and improves base LLM skills as generalist agents interacting with diverse environments.

PaperID: 11397, https://arxiv.org/pdf/2503.02189.pdf

Authors: Dickness Kakitahi Kwesiga, Angshuman Guin, Michael Hunter

Title: Adaptive Traffic Signal Control based on Multi-Agent Reinforcement Learning. Case Study on a simulated real-world corridor

Abstract:
Previous studies that have formulated multi‑agent reinforcement learning (RL) algorithms for adaptive traffic signal control have primarily used value‑based RL methods. However, recent literature has shown that policy‑based methods may perform better in partially observable environments. Additionally, RL methods remain largely untested for real‑world normally signal timing plans because of the simplifying assumptions common in the literature. The current study attempts to address these gaps and formulates a multi‑agent proximal policy optimization (MA‑PPO) algorithm to implement adaptive and coordinated traffic control along an arterial corridor. The formulated MA‑PPO has a centralized‑critic architecture under a centralized training and decentralized execution framework. Agents are designed to allow selection and implementation of up to eight signal phases, as commonly implemented in field controllers. The formulated algorithm is tested on a simulated real‑world seven intersection corridor. The speed of convergence for each agent was found to depend on the size of the action space, which depends on the number and sequence of signal phases. The performance of the formulated MA‑PPO adaptive control algorithm is compared with the field implemented actuated‑coordinated signal control (ASC), modeled using PTV‑Vissim‑MaxTime software in the loop simulation (SILs). The trained MA‑PPO performed significantly better than the ASC for all movements. Compared to ASC the MA‑PPO showed 2% and 24% improvements in travel time in the primary and secondary coordination directions, respectively. For cross streets movements MA‑PPO also showed significant crossing time reductions. Volume sensitivity experiments revealed that the formulated MA‑PPO demonstrated good stability, robustness, and adaptability to changes in traffic demand.

PaperID: 11398, https://arxiv.org/pdf/2503.02116.pdf

Authors: Ashwin Verma, Soheil Mohajer, Behrouz Touri

Title: Multi-Agent Fact Checking

Abstract:
We formulate the problem of fake news detection using distributed fact‑checkers (agents) with unknown reliability. The stream of news/statements is modeled as an independent and identically distributed binary source (to represent true and false statements). Upon observing a news, agent i labels the news as true or false which reflects the true validity of the statement with some probability 1‑π_i. In other words, agent i misclassified each statement with error probability π_i\in (0,1), where the parameter π_i models the (un)trustworthiness of agent i. We present an algorithm to learn the unreliability parameters, resulting in a distributed fact‑checking algorithm. Furthermore, we extensively analyze the discrete‑time limit of our algorithm.

PaperID: 11399, https://arxiv.org/pdf/2503.02077.pdf

Authors: Ziyan Wang, Zhicheng Zhang, Fei Fang, Yali Du

Title: M3HF: Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality

Abstract:
Designing effective reward functions in multi‑agent reinforcement learning (MARL) is a significant challenge, often leading to suboptimal or misaligned behaviors in complex, coordinated environments. We introduce Multi‑agent Reinforcement Learning from Multi‑phase Human Feedback of Mixed Quality (\textM^3\textHF), a novel framework that integrates multi‑phase human feedback of mixed quality into the MARL training process. By involving humans with diverse expertise levels to provide iterative guidance, \textM^3\textHF leverages both expert and non‑expert feedback to continuously refine agents' policies. During training, we strategically pause agent learning for human evaluation, parse feedback using large language models to assign it appropriately and update reward functions through predefined templates and adaptive weights by using weight decay and performance‑based adjustments. Our approach enables the integration of nuanced human insights across various levels of quality, enhancing the interpretability and robustness of multi‑agent cooperation. Empirical results in challenging environments demonstrate that \textM^3\textHF significantly outperforms state‑of‑the‑art methods, effectively addressing the complexities of reward design in MARL and enabling broader human participation in the training process.

PaperID: 11400, https://arxiv.org/pdf/2503.02068.pdf

Authors: Will Epperson, Gagan Bansal, Victor Dibia, Adam Fourney, Jack Gerrits, Erkang Zhu, Saleema Amershi

Title: Interactive Debugging and Steering of Multi-Agent AI Systems

Abstract:
Fully autonomous teams of LLM‑powered AI agents are emerging that collaborate to perform complex tasks for users. What challenges do developers face when trying to build and debug these AI agent teams? In formative interviews with five AI agent developers, we identify core challenges: difficulty reviewing long agent conversations to localize errors, lack of support in current tools for interactive debugging, and the need for tool support to iterate on agent configuration. Based on these needs, we developed an interactive multi‑agent debugging tool, AGDebugger, with a UI for browsing and sending messages, the ability to edit and reset prior agent messages, and an overview visualization for navigating complex message histories. In a two‑part user study with 14 participants, we identify common user strategies for steering agents and highlight the importance of interactive message resets for debugging. Our studies deepen understanding of interfaces for debugging increasingly important agentic workflows.

PaperID: 11401, https://arxiv.org/pdf/2503.02038.pdf

Authors: Angana Borah, Rada Mihalcea, Verónica Pérez-Rosas

Title: Persuasion at Play: Understanding Misinformation Dynamics in Demographic-Aware Human-LLM Interactions

Abstract:
Existing challenges in misinformation exposure and susceptibility vary across demographic groups, as some populations are more vulnerable to misinformation than others. Large language models (LLMs) introduce new dimensions to these challenges through their ability to generate persuasive content at scale and reinforcing existing biases. This study investigates the bidirectional persuasion dynamics between LLMs and humans when exposed to misinformative content. We analyze human‑to‑LLM influence using human‑stance datasets and assess LLM‑to‑human influence by generating LLM‑based persuasive arguments. Additionally, we use a multi‑agent LLM framework to analyze the spread of misinformation under persuasion among demographic‑oriented LLM agents. Our findings show that demographic factors influence susceptibility to misinformation in LLMs, closely reflecting the demographic‑based patterns seen in human susceptibility. We also find that, similar to human demographic groups, multi‑agent LLMs exhibit echo chamber behavior. This research explores the interplay between humans and LLMs, highlighting demographic differences in the context of misinformation and offering insights for future interventions.

PaperID: 11402, https://arxiv.org/pdf/2503.02016.pdf

Authors: Angana Borah, Marwa Houalla, Rada Mihalcea

Title: Mind the (Belief) Gap: Group Identity in the World of LLMs

Abstract:
Social biases and belief‑driven behaviors can significantly impact Large Language Models (LLMs) decisions on several tasks. As LLMs are increasingly used in multi‑agent systems for societal simulations, their ability to model fundamental group psychological characteristics remains critical yet under‑explored. In this study, we present a multi‑agent framework that simulates belief congruence, a classical group psychology theory that plays a crucial role in shaping societal interactions and preferences. Our findings reveal that LLMs exhibit amplified belief congruence compared to humans, across diverse contexts. We further investigate the implications of this behavior on two downstream tasks: (1) misinformation dissemination and (2) LLM learning, finding that belief congruence in LLMs increases misinformation dissemination and impedes learning. To mitigate these negative impacts, we propose strategies inspired by: (1) contact hypothesis, (2) accuracy nudges, and (3) global citizenship framework. Our results show that the best strategies reduce misinformation dissemination by up to 37% and enhance learning by 11%. Bridging social psychology and AI, our work provides insights to navigate real‑world interactions using LLMs while addressing belief‑driven biases.

PaperID: 11403, https://arxiv.org/pdf/2503.01876.pdf

Authors: Zhanpeng He, Yifeng Cao, Matei Ciocarlie

Title: Uncertainty Comes for Free: Human-in-the-Loop Policies with Diffusion Models

Abstract:
Human‑in‑the‑loop (HitL) robot deployment has gained significant attention in both academia and industry as a semi‑autonomous paradigm that enables human operators to intervene and adjust robot behaviors at deployment time, improving success rates. However, continuous human monitoring and intervention can be highly labor‑intensive and impractical when deploying a large number of robots. To address this limitation, we propose a method that allows diffusion policies to actively seek human assistance only when necessary, reducing reliance on constant human oversight. To achieve this, we leverage the generative process of diffusion policies to compute an uncertainty‑based metric based on which the autonomous agent can decide to request operator assistance at deployment time, without requiring any operator interaction during training. Additionally, we show that the same method can be used for efficient data collection for fine‑tuning diffusion policies in order to improve their autonomous performance. Experimental results from simulated and real‑world environments demonstrate that our approach enhances policy performance during deployment for a variety of scenarios.

PaperID: 11404, https://arxiv.org/pdf/2503.01829.pdf

Authors: Nimet Beyza Bozdag, Shuhaib Mehri, Gokhan Tur, Dilek Hakkani-Tür

Title: Persuade Me if You Can: A Framework for Evaluating Persuasion Effectiveness and Susceptibility Among Large Language Models

Abstract:
Large Language Models (LLMs) demonstrate persuasive capabilities that rival human‑level persuasion. While these capabilities can be used for social good, they also present risks of potential misuse. Beyond the concern of how LLMs persuade others, their own susceptibility to persuasion poses a critical alignment challenge, raising questions about robustness, safety, and adherence to ethical principles. To study these dynamics, we introduce Persuade Me If You Can (PMIYC), an automated framework for evaluating persuasiveness and susceptibility to persuasion in multi‑agent interactions. Our framework offers a scalable alternative to the costly and time‑intensive human annotation process typically used to study persuasion in LLMs. PMIYC automatically conducts multi‑turn conversations between Persuader and Persuadee agents, measuring both the effectiveness of and susceptibility to persuasion. Our comprehensive evaluation spans a diverse set of LLMs and persuasion settings (e.g., subjective and misinformation scenarios). We validate the efficacy of our framework through human evaluations and demonstrate alignment with human assessments from prior studies. Through PMIYC, we find that Llama‑3.3‑70B and GPT‑4o exhibit similar persuasive effectiveness, outperforming Claude 3 Haiku by 30%. However, GPT‑4o demonstrates over 50% greater resistance to persuasion for misinformation compared to Llama‑3.3‑70B. Notably, o4‑mini emerges as both an effective persuader, and a resistant persuadee. These findings provide empirical insights into the persuasive dynamics of LLMs and contribute to the development of safer AI systems.

PaperID: 11405, https://arxiv.org/pdf/2503.01821.pdf

Authors: Xingyu Zhu, Abhishek Panigrahi, Sanjeev Arora

Title: On the Power of Context-Enhanced Learning in LLMs

Abstract:
We formalize a new concept for LLMs, context‑enhanced learning. It involves standard gradient‑based learning on text except that the context is enhanced with additional data on which no auto‑regressive gradients are computed. This setting is a gradient‑based analog of usual in‑context learning (ICL) and appears in some recent works. Using a multi‑step reasoning task, we prove in a simplified setting that context‑enhanced learning can be exponentially more sample‑efficient than standard learning when the model is capable of ICL. At a mechanistic level, we find that the benefit of context‑enhancement arises from a more accurate gradient learning signal. We also experimentally demonstrate that it appears hard to detect or recover learning materials that were used in the context during training. This may have implications for data security as well as copyright.

PaperID: 11406, https://arxiv.org/pdf/2503.01694.pdf

Authors: Zhanxin Hao, Jianxiao Jiang, Jifan Yu, Zhiyuan Liu, Yu Zhang

Title: Student-AI Interaction in an LLM-Empowered Learning Environment: A Cluster Analysis of Engagement Profiles

Abstract:
Integrating Large Language Models (LLMs) into educational practice enables personalized learning by accommodating diverse learner behaviors. This study explored diverse learner profiles within a multi‑agent, LLM‑empowered learning environment. Data was collected from 312 undergraduate students at a university in China as they participated in a six‑module course. Based on hierarchical cluster analyses of system profiles and student‑AI interactive dialogues, we found that students exhibit varied behavioral, cognitive, and emotional engagement tendencies. This analysis allowed us to identify two types of dropouts (early dropouts and stagnating interactors) and three completer profiles (active questioners, responsive navigators, and lurkers). The results showed that high levels of interaction do not always equate to productive learning and vice versa. Prior knowledge significantly influenced interaction patterns and short‑term learning benefits. Further analysis of the human‑AI dialogues revealed that some students actively engaged in knowledge construction, while others displayed a high frequency of regulatory behaviors. Notably, both groups of students achieved comparable learning gains, demonstrating the effectiveness of the multi‑agent learning environment in supporting personalized learning. These results underscore the complex and multifaceted nature of engagement in human‑AI collaborative learning and provide practical implications for the design of adaptive educational systems.

PaperID: 11407, https://arxiv.org/pdf/2503.01608.pdf

Authors: Yu Zhang, Kexue Fu, Zhicong Lu

Title: RevTogether: Supporting Science Story Revision with Multiple AI Agents

Abstract:
As a popular form of science communication, science stories attract readers because they combine engaging narratives with comprehensible scientific knowledge. However, crafting such stories requires substantial skill and effort, as writers must navigate complex scientific concepts and transform them into coherent and accessible narratives tailored to audiences with varying levels of scientific literacy. To address the challenge, we propose RevTogether, a multi‑agent system (MAS) designed to support revision of science stories with human‑like AI agents (using GPT‑4o). RevTogether allows AI agents to simulate affects in addition to providing comments and writing suggestions, while offering varying degrees of user agency. Our preliminary user study with non‑expert writers (N=3) highlighted the need for transparency in AI agents' decision‑making processes to support learning and suggested that emotional interactions could enhance human‑AI collaboration in science storytelling.

PaperID: 11408, https://arxiv.org/pdf/2503.01458.pdf

Authors: Xu Wan, Chao Yang, Cheng Yang, Jie Song, Mingyang Sun

Title: SrSv: Integrating Sequential Rollouts with Sequential Value Estimation for Multi-agent Reinforcement Learning

Abstract:
Although multi‑agent reinforcement learning (MARL) has shown its success across diverse domains, extending its application to large‑scale real‑world systems still faces significant challenges. Primarily, the high complexity of real‑world environments exacerbates the credit assignment problem, substantially reducing training efficiency. Moreover, the variability of agent populations in large‑scale scenarios necessitates scalable decision‑making mechanisms. To address these challenges, we propose a novel framework: Sequential rollout with Sequential value estimation (SrSv). This framework aims to capture agent interdependence and provide a scalable solution for cooperative MARL. Specifically, SrSv leverages the autoregressive property of the Transformer model to handle varying populations through sequential action rollout. Furthermore, to capture the interdependence of policy distributions and value functions among multiple agents, we introduce an innovative sequential value estimation methodology and integrates the value approximation into an attention‑based sequential model. We evaluate SrSv on three benchmarks: Multi‑Agent MuJoCo, StarCraft Multi‑Agent Challenge, and DubinsCars. Experimental results demonstrate that SrSv significantly outperforms baseline methods in terms of training efficiency without compromising convergence performance. Moreover, when implemented in a large‑scale DubinsCar system with 1,024 agents, our framework surpasses existing benchmarks, highlighting the excellent scalability of SrSv.

PaperID: 11409, https://arxiv.org/pdf/2503.01440.pdf

Authors: Hyungho Na, Kwanghyeon Lee, Sumin Lee, Il-Chul Moon

Title: Trajectory-Class-Aware Multi-Agent Reinforcement Learning

Abstract:
In the context of multi‑agent reinforcement learning, generalization is a challenge to solve various tasks that may require different joint policies or coordination without relying on policies specialized for each task. We refer to this type of problem as a multi‑task, and we train agents to be versatile in this multi‑task setting through a single training process. To address this challenge, we introduce TRajectory‑class‑Aware Multi‑Agent reinforcement learning (TRAMA). In TRAMA, agents recognize a task type by identifying the class of trajectories they are experiencing through partial observations, and the agents use this trajectory awareness or prediction as additional information for action policy. To this end, we introduce three primary objectives in TRAMA: (a) constructing a quantized latent space to generate trajectory embeddings that reflect key similarities among them; (b) conducting trajectory clustering using these trajectory embeddings; and (c) building a trajectory‑class‑aware policy. Specifically for (c), we introduce a trajectory‑class predictor that performs agent‑wise predictions on the trajectory class; and we design a trajectory‑class representation model for each trajectory class. Each agent takes actions based on this trajectory‑class representation along with its partial observation for task‑aware execution. The proposed method is evaluated on various tasks, including multi‑task problems built upon StarCraft II. Empirical results show further performance improvements over state‑of‑the‑art baselines.

PaperID: 11410, https://arxiv.org/pdf/2503.01112.pdf

Authors: Andrea Lama, Mario di Bernardo, Sabine H. L. Klapp

Title: Nonreciprocal field theory for decision-making in multi-agent control systems

Abstract:
Field theories for complex systems traditionally focus on collective behaviors emerging from simple, reciprocal pairwise interaction rules. However, many natural and artificial systems exhibit behaviors driven by microscopic decision‑making processes that introduce both nonreciprocity and many‑body interactions, challenging these conventional approaches. We develop a theoretical framework to incorporate decision‑making into field theories using the shepherding control problem as a paradigmatic example of a multi‑agent control system, where agents (herders) must coordinate to confine another group of agents (targets) within a prescribed region. By introducing continuous approximations of two key decision‑making elements ‑ target selection and trajectory planning ‑ we derive field equations that capture the essential features of this distributed control problem. Our theory reveals that different decision‑making strategies emerge at the continuum level, from average attraction to highly selective choices, and from undirected to goal‑oriented motion, driving transitions between homogeneous and confined configurations. The resulting nonreciprocal field theory not only describes the shepherding problem but provides a general framework for incorporating decision‑making into continuum theories of collective behavior, with implications for applications ranging from robotic swarms to crowd management systems.

PaperID: 11411, https://arxiv.org/pdf/2503.01101.pdf

Authors: Jonathan Lane, Nak-seung Patrick Hyun

Title: Graph-Based Dynamics and Network Control of a Single Articulated Robotic System

Abstract:
Extensive research on graph‑based dynamics and control of multi‑agent systems has successfully demonstrated control of robotic swarms, where each robot is perceived as an independent agent virtually connected by a network topology. The strong advantage of the network control structure lies in the decentralized nature of the control action, which only requires the knowledge of virtually connected agents. In this paper, we seek to expand the ideas of virtual network constraints to physical constraints on a class of tree‑structured robots which we denote as single articulated robotic (SAR) systems. In our proposed framework, each link can be viewed as an agent, and each holonomic constraint connecting links serves as an edge. By following the first principles of Lagrangian dynamics, we derive a consensus‑like matrix‑differential equation with weighted graph and edge Laplacians for the dynamics of a SAR system. The sufficient condition for the holonomic constraint forces becoming independent to the control inputs is derived. This condition leads to a decentralized leader‑follower network control framework for regulating the relative configuration of the robot. Simulation results demonstrate the effectiveness of the proposed control method.

PaperID: 11412, https://arxiv.org/pdf/2503.01069.pdf

Authors: Kareem Eissa, Rayal Prasad, Sarith Mohan, Ankur Kapoor, Dorin Comaniciu, Vivek Singh

Title: Multi-Agent Reinforcement Learning with Long-Term Performance Objectives for Service Workforce Optimization

Abstract:
Workforce optimization plays a crucial role in efficient organizational operations where decision‑making may span several different administrative and time scales. For instance, dispatching personnel to immediate service requests while managing talent acquisition with various expertise sets up a highly dynamic optimization problem. Existing work focuses on specific sub‑problems such as resource allocation and facility location, which are solved with heuristics like local‑search and, more recently, deep reinforcement learning. However, these may not accurately represent real‑world scenarios where such sub‑problems are not fully independent. Our aim is to fill this gap by creating a simulator that models a unified workforce optimization problem. Specifically, we designed a modular simulator to support the development of reinforcement learning methods for integrated workforce optimization problems. We focus on three interdependent aspects: personnel dispatch, workforce management, and personnel positioning. The simulator provides configurable parameterizations to help explore dynamic scenarios with varying levels of stochasticity and non‑stationarity. To facilitate benchmarking and ablation studies, we also include heuristic and RL baselines for the above mentioned aspects.

PaperID: 11413, https://arxiv.org/pdf/2503.00845.pdf

Authors: Miao Peng, Nuo Chen, Zongrui Suo, Jia Li

Title: Rewarding Graph Reasoning Process makes LLMs more Generalized Reasoners

Abstract:
Despite significant advancements in Large Language Models (LLMs), developing advanced reasoning capabilities in LLMs remains a key challenge. Process Reward Models (PRMs) have demonstrated exceptional promise in enhancing reasoning by providing step‑wise feedback, particularly in the context of mathematical reasoning. However, their application to broader reasoning domains remains understudied, largely due to the high costs associated with manually creating step‑level supervision. In this work, we explore the potential of PRMs in graph reasoning problems ‑ a domain that demands sophisticated multi‑step reasoning and offers opportunities for automated step‑level data generation using established graph algorithms. We introduce GraphSILO, the largest dataset for graph reasoning problems with fine‑grained step‑wise labels, built using automated Task‑oriented Trajectories and Monte Carlo Tree Search (MCTS) to generate detailed reasoning steps with step‑wise labels. Building upon this dataset, we train GraphPRM, the first PRM designed for graph reasoning problems, and evaluate its effectiveness in two key settings: inference‑time scaling and reinforcement learning via Direct Preference Optimization (DPO). Experimental results show that GraphPRM significantly improves LLM performance across 13 graph reasoning tasks, delivering a 9% gain for Qwen2.5‑7B and demonstrating transferability to new graph reasoning datasets and new reasoning domains like mathematical problem‑solving. Notably, GraphPRM enhances LLM performance on GSM8K and Math500, underscoring the cross‑domain applicability of graph‑based reasoning rewards. Our findings highlight the potential of PRMs in advancing reasoning across diverse domains, paving the way for more versatile and effective LLMs.

PaperID: 11414, https://arxiv.org/pdf/2503.00767.pdf

Authors: Baris Yamansavascilar, Atay Ozgovde, Cem Ersoy

Title: LLMs are everywhere: Ubiquitous Utilization of AI Models through Air Computing

Abstract:
We are witnessing a new era where problem‑solving and cognitive tasks are being increasingly delegated to Large Language Models (LLMs) across diverse domains, ranging from code generation to holiday planning. This trend also creates a demand for the ubiquitous execution of LLM‑powered applications in a wide variety of environments in which traditional terrestrial 2D networking infrastructures may prove insufficient. A promising solution in this context is to extend edge computing into a 3D setting to include aerial platforms organized in multiple layers, a paradigm we refer to as air computing, to augment local devices for running LLM and Generative AI (GenAI) applications. This approach alleviates the strain on existing infrastructure while enhancing service efficiency by offloading computational tasks to the corresponding air units such as UAVs. Furthermore, the coordinated deployment of various air units can significantly improve the Quality of Experience (QoE) by ensuring seamless, adaptive, and resilient task execution. In this study, we investigate the synergy between LLM‑based applications and air computing, exploring their potential across various use cases. Additionally, we present a disaster response case study demonstrating how the collaborative utilization of LLMs and air computing can significantly improve outcomes in critical situations.

PaperID: 11415, https://arxiv.org/pdf/2503.00751.pdf

Authors: Hongchao Gu, Dexun Li, Kuicai Dong, Hao Zhang, Hang Lv, Hao Wang, Defu Lian, Yong Liu, Enhong Chen

Title: RAPID: Efficient Retrieval-Augmented Long Text Generation with Writing Planning and Information Discovery

Abstract:
Generating knowledge‑intensive and comprehensive long texts, such as encyclopedia articles, remains significant challenges for Large Language Models. It requires not only the precise integration of facts but also the maintenance of thematic coherence throughout the article. Existing methods, such as direct generation and multi‑agent discussion, often struggle with issues like hallucinations, topic incoherence, and significant latency. To address these challenges, we propose RAPID, an efficient retrieval‑augmented long text generation framework. RAPID consists of three main modules: (1) Retrieval‑augmented preliminary outline generation to reduce hallucinations, (2) Attribute‑constrained search for efficient information discovery, (3) Plan‑guided article generation for enhanced coherence. Extensive experiments on our newly compiled benchmark dataset, FreshWiki‑2024, demonstrate that RAPID significantly outperforms state‑of‑the‑art methods across a wide range of evaluation metrics (e.g. long‑text generation, outline quality, latency, etc). Our work provides a robust and efficient solution to the challenges of automated long‑text generation.

PaperID: 11416, https://arxiv.org/pdf/2503.00729.pdf

Authors: Mingcong Lei, Ge Wang, Yiming Zhao, Zhixin Mai, Qing Zhao, Yao Guo, Zhen Li, Shuguang Cui, Yatong Han, Jinke Ren

Title: CLEA: Closed-Loop Embodied Agent for Enhancing Task Execution in Dynamic Environments

Abstract:
Large Language Models (LLMs) exhibit remarkable capabilities in the hierarchical decomposition of complex tasks through semantic reasoning. However, their application in embodied systems faces challenges in ensuring reliable execution of subtask sequences and achieving one‑shot success in long‑term task completion. To address these limitations in dynamic environments, we propose Closed‑Loop Embodied Agent (CLEA) ‑‑ a novel architecture incorporating four specialized open‑source LLMs with functional decoupling for closed‑loop task management. The framework features two core innovations: (1) Interactive task planner that dynamically generates executable subtasks based on the environmental memory, and (2) Multimodal execution critic employing an evaluation framework to conduct a probabilistic assessment of action feasibility, triggering hierarchical re‑planning mechanisms when environmental perturbations exceed preset thresholds. To validate CLEA's effectiveness, we conduct experiments in a real environment with manipulable objects, using two heterogeneous robots for object search, manipulation, and search‑manipulation integration tasks. Across 12 task trials, CLEA outperforms the baseline model, achieving a 67.3% improvement in success rate and a 52.8% increase in task completion rate. These results demonstrate that CLEA significantly enhances the robustness of task planning and execution in dynamic environments.

PaperID: 11417, https://arxiv.org/pdf/2503.00684.pdf

Authors: Maria Ana Cardei, Afsaneh Doryab

Title: Factorized Deep Q-Network for Cooperative Multi-Agent Reinforcement Learning in Victim Tagging

Abstract:
Mass casualty incidents (MCIs) are a growing concern, characterized by complexity and uncertainty that demand adaptive decision‑making strategies. The victim tagging step in the emergency medical response must be completed quickly and is crucial for providing information to guide subsequent time‑constrained response actions. In this paper, we present a mathematical formulation of multi‑agent victim tagging to minimize the time it takes for responders to tag all victims. Five distributed heuristics are formulated and evaluated with simulation experiments. The heuristics considered are on‑the go, practical solutions that represent varying levels of situational uncertainty in the form of global or local communication capabilities, showcasing practical constraints. We further investigate the performance of a multi‑agent reinforcement learning (MARL) strategy, factorized deep Q‑network (FDQN), to minimize victim tagging time as compared to baseline heuristics. Extensive simulations demonstrate that between the heuristics, methods with local communication are more efficient for adaptive victim tagging, specifically choosing the nearest victim with the option to replan. Analyzing all experiments, we find that our FDQN approach outperforms heuristics in smaller‑scale scenarios, while heuristics excel in more complex scenarios. Our experiments contain diverse complexities that explore the upper limits of MARL capabilities for real‑world applications and reveal key insights.

PaperID: 11418, https://arxiv.org/pdf/2503.00566.pdf

Authors: Kyle Gao, Dening Lu, Liangzhi Li, Nan Chen, Hongjie He, Linlin Xu, Jonathan Li

Title: Instructor-Worker Large Language Model System for Policy Recommendation: a Case Study on Air Quality Analysis of the January 2025 Los Angeles Wildfires

Abstract:
The Los Angeles wildfires of January 2025 caused more than 250 billion dollars in damage and lasted for nearly an entire month before containment. Following our previous work, the Digital Twin Building, we modify and leverage the multi‑agent large language model framework as well as the cloud‑mapping integration to study the air quality during the Los Angeles wildfires. Recent advances in large language models have allowed for out‑of‑the‑box automated large‑scale data analysis. We use a multi‑agent large language system comprised of an Instructor agent and Worker agents. Upon receiving the users' instructions, the Instructor agent retrieves the data from the cloud platform and produces instruction prompts to the Worker agents. The Worker agents then analyze the data and provide summaries. The summaries are finally input back into the Instructor agent, which then provides the final data analysis. We test this system's capability for data‑based policy recommendation by assessing our Instructor‑Worker LLM system's health recommendations based on air quality during the Los Angeles wildfires.

PaperID: 11419, https://arxiv.org/pdf/2503.00481.pdf

Authors: Felix Dobslaw, Robert Feldt, Juyeon Yoon, Shin Yoo

Title: Challenges in Testing Large Language Model Based Software: A Faceted Taxonomy

Abstract:
Large Language Models (LLMs) and Multi‑Agent LLMs (MALLMs) introduce non‑determinism unlike traditional or machine learning software, requiring new approaches to verifying correctness beyond simple output comparisons or statistical accuracy over test datasets. This paper presents a taxonomy for LLM test case design, informed by research literature and our experience. Each facet is exemplified, and we conduct an LLM‑assisted analysis of six open‑source testing frameworks, perform a sensitivity study of an agent‑based system across different model configurations, and provide working examples contrasting atomic and aggregated test cases. We identify key variation points that impact test correctness and highlight open challenges that the research, industry, and open‑source communities must address as LLMs become integral to software systems. Our taxonomy defines four facets of LLM test case design, addressing ambiguity in both inputs and outputs while establishing best practices. It distinguishes variability in goals, the system under test, and inputs, and introduces two key oracle types: atomic and aggregated. Our findings reveal that current tools treat test executions as isolated events, lack explicit aggregation mechanisms, and inadequately capture variability across model versions, configurations, and repeated runs. This highlights the need for viewing correctness as a distribution of outcomes rather than a binary property, requiring closer collaboration between academia and practitioners to establish mature, variability‑aware testing methodologies.

PaperID: 11420, https://arxiv.org/pdf/2503.00372.pdf

Authors: Yugu Li, Zehong Cao, Jianglin Qiao, Siyi Hu

Title: Nucleolus Credit Assignment for Effective Coalitions in Multi-agent Reinforcement Learning

Abstract:
In cooperative multi‑agent reinforcement learning (MARL), agents typically form a single grand coalition based on credit assignment to tackle a composite task, often resulting in suboptimal performance. This paper proposed a nucleolus‑based credit assignment grounded in cooperative game theory, enabling the autonomous partitioning of agents into multiple small coalitions that can effectively identify and complete subtasks within a larger composite task. Specifically, our designed nucleolus Q‑learning could assign fair credits to each agent, and the nucleolus Q‑operator provides theoretical guarantees with interpretability for both learning convergence and the stability of the formed small coalitions. Through experiments on Predator‑Prey and StarCraft scenarios across varying difficulty levels, our approach demonstrated the emergence of multiple effective coalitions during MARL training, leading to faster learning and superior performance in terms of win rate and cumulative rewards especially in hard and super‑hard environments, compared to four baseline methods. Our nucleolus‑based credit assignment showed the promise for complex composite tasks requiring effective subteams of agents.

PaperID: 11421, https://arxiv.org/pdf/2503.00355.pdf

Authors: Tianyi Huang, Elsa Fan

Title: Structured Reasoning for Fairness: A Multi-Agent Approach to Bias Detection in Textual Data

Abstract:
From disinformation spread by AI chatbots to AI recommendations that inadvertently reinforce stereotypes, textual bias poses a significant challenge to the trustworthiness of large language models (LLMs). In this paper, we propose a multi‑agent framework that systematically identifies biases by disentangling each statement as fact or opinion, assigning a bias intensity score, and providing concise, factual justifications. Evaluated on 1,500 samples from the WikiNPOV dataset, the framework achieves 84.9% accuracy\unicodex2014an improvement of 13.0% over the zero‑shot baseline\unicodex2014demonstrating the efficacy of explicitly modeling fact versus opinion prior to quantifying bias intensity. By combining enhanced detection accuracy with interpretable explanations, this approach sets a foundation for promoting fairness and accountability in modern language models.

PaperID: 11422, https://arxiv.org/pdf/2503.00162.pdf

Authors: Kangda Wei, Zhengyu Zhou, Bingqing Wang, Jun Araki, Lukas Lange, Ruihong Huang, Zhe Feng

Title: PreMind: Multi-Agent Video Understanding for Advanced Indexing of Presentation-style Videos

Abstract:
In recent years, online lecture videos have become an increasingly popular resource for acquiring new knowledge. Systems capable of effectively understanding/indexing lecture videos are thus highly desirable, enabling downstream tasks like question answering to help users efficiently locate specific information within videos. This work proposes PreMind, a novel multi‑agent multimodal framework that leverages various large models for advanced understanding/indexing of presentation‑style videos. PreMind first segments videos into slide‑presentation segments using a Vision‑Language Model (VLM) to enhance modern shot‑detection techniques. Each segment is then analyzed to generate multimodal indexes through three key steps: (1) extracting slide visual content, (2) transcribing speech narratives, and (3) consolidating these visual and speech contents into an integrated understanding. Three innovative mechanisms are also proposed to improve performance: leveraging prior lecture knowledge to refine visual understanding, detecting/correcting speech transcription errors using a VLM, and utilizing a critic agent for dynamic iterative self‑reflection in vision analysis. Compared to traditional video indexing methods, PreMind captures rich, reliable multimodal information, allowing users to search for details like abbreviations shown only on slides. Systematic evaluations on the public LPM dataset and an internal enterprise dataset are conducted to validate PreMind's effectiveness, supported by detailed analyses.

PaperID: 11423, https://arxiv.org/pdf/2503.00092.pdf

Authors: Ghulam Mujtaba, Sunder Ali Khowaja, Kapal Dev

Title: EdgeAIGuard: Agentic LLMs for Minor Protection in Digital Spaces

Abstract:
Social media has become integral to minors' daily lives and is used for various purposes, such as making friends, exploring shared interests, and engaging in educational activities. However, the increase in screen time has also led to heightened challenges, including cyberbullying, online grooming, and exploitations posed by malicious actors. Traditional content moderation techniques have proven ineffective against exploiters' evolving tactics. To address these growing challenges, we propose the EdgeAIGuard content moderation approach that is designed to protect minors from online grooming and various forms of digital exploitation. The proposed method comprises a multi‑agent architecture deployed strategically at the network edge to enable rapid detection with low latency and prevent harmful content targeting minors. The experimental results show the proposed method is significantly more effective than the existing approaches.

PaperID: 11424, https://arxiv.org/pdf/2503.00074.pdf

Authors: Jonas le Fevre Sejersen, Erdal Kayacan

Title: CAMETA: Conflict-Aware Multi-Agent Estimated Time of Arrival Prediction for Mobile Robots

Abstract:
This study presents the conflict‑aware multi‑agent estimated time of arrival (CAMETA) framework, a novel approach for predicting the arrival times of multiple agents in unstructured environments without predefined road infrastructure. The CAMETA framework consists of three components: a path planning layer generating potential path suggestions, a multi‑agent ETA prediction layer predicting the arrival times for all agents based on the paths, and lastly, a path selection layer that calculates the accumulated cost and selects the best path. The novelty of the CAMETA framework lies in the heterogeneous map representation and the heterogeneous graph neural network architecture. As a result of the proposed novel structure, CAMETA improves the generalization capability compared to the state‑of‑the‑art methods that rely on structured road infrastructure and historical data. The simulation results demonstrate the efficiency and efficacy of the multi‑agent ETA prediction layer, with a mean average percentage error improvement of 29.5% and 44% when compared to a traditional path planning method (A ) which does not consider conflicts. The performance of the CAMETA framework shows significant improvements in terms of robustness to noise and conflicts as well as determining proficient routes compared to state‑of‑the‑art multi‑agent path planners.

PaperID: 11425, https://arxiv.org/pdf/2503.00056.pdf

Authors: Henry Lei, Zachary S. Lippay, Anonto Zaman, Joshua Aurand, Amin Maghareh, Sean Phillips

Title: Stability Analysis of Deep Reinforcement Learning for Multi-Agent Inspection in a Terrestrial Testbed

Abstract:
The design and deployment of autonomous systems for space missions require robust solutions to navigate strict reliability constraints, extended operational duration, and communication challenges. This study evaluates the stability and performance of a hierarchical deep reinforcement learning (DRL) framework designed for multi‑agent satellite inspection tasks. The proposed framework integrates a high‑level guidance policy with a low‑level motion controller, enabling scalable task allocation and efficient trajectory execution. Experiments conducted on the Local Intelligent Network of Collaborative Satellites (LINCS) testbed assess the framework's performance under varying levels of fidelity, from simulated environments to a cyber‑physical testbed. Key metrics, including task completion rate, distance traveled, and fuel consumption, highlight the framework's robustness and adaptability despite real‑world uncertainties such as sensor noise, dynamic perturbations, and runtime assurance (RTA) constraints. The results demonstrate that the hierarchical controller effectively bridges the sim‑to‑real gap, maintaining high task completion rates while adapting to the complexities of real‑world environments. These findings validate the framework's potential for enabling autonomous satellite operations in future space missions.

PaperID: 11426, https://arxiv.org/pdf/2502.21300.pdf

Authors: Kaleb Mcdowell, Nick Waytowich, Javier Garcia, Stephen Gordon, Bryce Bartlett, Jeremy Gaston

Title: Hybrid Team Tetris: A New Platform For Hybrid Multi-Agent, Multi-Human Teaming

Abstract:
Metcalfe et al (1) argue that the greatest potential for human‑AI partnerships lies in their application to highly complex problem spaces. Herein, we discuss three different forms of hybrid team intelligence and posit that across all three forms, the hybridization of man and machine intelligence can be effective under the right conditions. We foresee two significant research and development (R&D) challenges underlying the creation of effective hybrid intelligence. First, rapid advances in machine intelligence and/or fundamental changes in human behaviors or capabilities over time can outpace R&D. Second, the future conditions under which hybrid intelligence will operate are unknown, but unlikely to be the same as the conditions of today. Overcoming both of these challenges requires a deep understanding of multiple human‑centric and machine‑centric disciplines that creates a large barrier to entry into the field. Herein, we outline an open, shareable research platform that creates a form of hybrid team intelligence that functions under representative future conditions. The intent for the platform is to facilitate new forms of hybrid intelligence research allowing individuals with human‑centric or machine‑centric backgrounds to rapidly enter the field and initiate research. Our hope is that through open, community research on the platform, state‑of‑the‑art advances in human and machine intelligence can quickly be communicated across what are currently different R&D communities and allow hybrid team intelligence research to stay at the forefront of scientific advancement.

PaperID: 11427, https://arxiv.org/pdf/2502.21297.pdf

Authors: Dingyi Zhang, Linhai Zhang, Fanglei Qu, Ziqing Zhuang, Deyu Zhou

Title: Persuasion Should be Double-Blind: A Multi-Domain Dialogue Dataset With Faithfulness Based on Causal Theory of Mind

Abstract:
Persuasive dialogue is central to human communication, yet existing datasets often rely on a single language model generating both roles, producing unrealistic interactions that violate the double‑blind nature of persuasion. To overcome this, we propose ToMMA, a multi‑agent framework guided by causal Theory of Mind that enforces role separation and prevents information leakage. Using ToMMA, we build CToMPersu, a large‑scale multi‑turn, multi‑domain dataset capturing realistic persuasion dynamics. Automatic evaluations show that CToMPersu produces more coherent and persuasive dialogues than prior datasets. Furthermore, when used as a knowledge base, CToMPersu significantly enhances the persuasive performance of large language models, as confirmed by both automatic and human evaluations.

PaperID: 11428, https://arxiv.org/pdf/2502.21267.pdf

Authors: Alexander Scarlatos, Yusong Wu, Ian Simon, Adam Roberts, Tim Cooijmans, Natasha Jaques, Cassie Tarakajian, Cheng-Zhi Anna Huang

Title: ReaLJam: Real-Time Human-AI Music Jamming with Reinforcement Learning-Tuned Transformers

Abstract:
Recent advances in generative artificial intelligence (AI) have created models capable of high‑quality musical content generation. However, little consideration is given to how to use these models for real‑time or cooperative jamming musical applications because of crucial required features: low latency, the ability to communicate planned actions, and the ability to adapt to user input in real‑time. To support these needs, we introduce ReaLJam, an interface and protocol for live musical jamming sessions between a human and a Transformer‑based AI agent trained with reinforcement learning. We enable real‑time interactions using the concept of anticipation, where the agent continually predicts how the performance will unfold and visually conveys its plan to the user. We conduct a user study where experienced musicians jam in real‑time with the agent through ReaLJam. Our results demonstrate that ReaLJam enables enjoyable and musically interesting sessions, and we uncover important takeaways for future work.

PaperID: 11429, https://arxiv.org/pdf/2502.21250.pdf

Authors: Nijesh Upreti, Jessica Ciupa, Vaishak Belle

Title: Towards Developing Ethical Reasoners: Integrating Probabilistic Reasoning and Decision-Making for Complex AI Systems

Abstract:
A computational ethics framework is essential for AI and autonomous systems operating in complex, real‑world environments. Existing approaches often lack the adaptability needed to integrate ethical principles into dynamic and ambiguous contexts, limiting their effectiveness across diverse scenarios. To address these challenges, we outline the necessary ingredients for building a holistic, meta‑level framework that combines intermediate representations, probabilistic reasoning, and knowledge representation. The specifications therein emphasize scalability, supporting ethical reasoning at both individual decision‑making levels and within the collective dynamics of multi‑agent systems. By integrating theoretical principles with contextual factors, it facilitates structured and context‑aware decision‑making, ensuring alignment with overarching ethical standards. We further explore proposed theorems outlining how ethical reasoners should operate, offering a foundation for practical implementation. These constructs aim to support the development of robust and ethically reliable AI systems capable of navigating the complexities of real‑world moral decision‑making scenarios.

PaperID: 11430, https://arxiv.org/pdf/2502.21208.pdf

Authors: Pedro Gimenes, Zeyu Cao, Jeffrey Wong, Yiren Zhao

Title: ARIES: Autonomous Reasoning with LLMs on Interactive Thought Graph Environments

Abstract:
Recent research has shown that LLM performance on reasoning tasks can be enhanced by scaling test‑time compute. One promising approach, particularly with decomposable problems, involves arranging intermediate solutions as a graph on which transformations are performed to explore the solution space. However, prior works rely on pre‑determined, task‑specific transformation schedules which are subject to a set of searched hyperparameters. In this work, we view thought graph transformations as actions in a Markov decision process, and implement policy agents to drive effective action policies for the underlying reasoning LLM agent. In particular, we investigate the ability for another LLM to act as a policy agent on thought graph environments and introduce ARIES, a multi‑agent architecture for reasoning with LLMs. In ARIES, reasoning LLM agents solve decomposed subproblems, while policy LLM agents maintain visibility of the thought graph states, and dynamically adapt the problem‑solving strategy. Through extensive experiments, we observe that using off‑the‑shelf LLMs as policy agents with no supervised fine‑tuning (SFT) can yield up to 29% higher accuracy on HumanEval relative to static transformation schedules, as well as reducing inference costs by 35% and avoid any search requirements. We also conduct a thorough analysis of observed failure modes, highlighting that limitations on LLM sizes and the depth of problem decomposition can be seen as challenges to scaling LLM‑guided reasoning.

PaperID: 11431, https://arxiv.org/pdf/2502.21087.pdf

Authors: Hansi Yang, Qi Zhang, Wei Jiang, Jianguo Li

Title: PASemiQA: Plan-Assisted Agent for Question Answering on Semi-Structured Data with Text and Relational Information

Abstract:
Large language models (LLMs) have shown impressive abilities in answering questions across various domains, but they often encounter hallucination issues on questions that require professional and up‑to‑date knowledge. To address this limitation, retrieval‑augmented generation (RAG) techniques have been proposed, which retrieve relevant information from external sources to inform their responses. However, existing RAG methods typically focus on a single type of external data, such as vectorized text database or knowledge graphs, and cannot well handle real‑world questions on semi‑structured data containing both text and relational information. To bridge this gap, we introduce PASemiQA, a novel approach that jointly leverages text and relational information in semi‑structured data to answer questions. PASemiQA first generates a plan to identify relevant text and relational information to answer the question in semi‑structured data, and then uses an LLM agent to traverse the semi‑structured data and extract necessary information. Our empirical results demonstrate the effectiveness of PASemiQA across different semi‑structured datasets from various domains, showcasing its potential to improve the accuracy and reliability of question answering systems on semi‑structured data.

PaperID: 11432, https://arxiv.org/pdf/2502.20859.pdf

Authors: Yifan Duan, Yihong Tang, Xuefeng Bai, Kehai Chen, Juntao Li, Min Zhang

Title: The Power of Personality: A Human Simulation Perspective to Investigate Large Language Model Agents

Abstract:
Large language models (LLMs) excel in both closed tasks (including problem‑solving, and code generation) and open tasks (including creative writing), yet existing explanations for their capabilities lack connections to real‑world human intelligence. To fill this gap, this paper systematically investigates LLM intelligence through the lens of ``human simulation'', addressing three core questions: (1) How do personality traits affect problem‑solving in closed tasks? (2) How do traits shape creativity in open tasks? (3) How does single‑agent performance influence multi‑agent collaboration? By assigning Big Five personality traits to LLM agents and evaluating their performance in single‑ and multi‑agent settings, we reveal that specific traits significantly influence reasoning accuracy (closed tasks) and creative output (open tasks). Furthermore, multi‑agent systems exhibit collective intelligence distinct from individual capabilities, driven by distinguishing combinations of personalities.

PaperID: 11433, https://arxiv.org/pdf/2502.20790.pdf

Authors: Dawei Zhu, Xiyu Wei, Guangxiang Zhao, Wenhao Wu, Haosheng Zou, Junfeng Ran, Xun Wang, Lin Sun, Xiangzheng Zhang, Sujian Li

Title: Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision

Abstract:
Recent advances in Large Language Models (LLMs) have highlighted the challenge of handling long‑context tasks, where models need to reason over extensive input contexts to aggregate target information. While Chain‑of‑Thought (CoT) prompting has shown promise for multi‑step reasoning, its effectiveness for long‑context scenarios remains underexplored. Through systematic investigation across diverse tasks, we demonstrate that CoT's benefits generalize across most long‑context scenarios and amplify with increasing context length. Motivated by this critical observation, we propose LongRePS, a process‑supervised framework that teaches models to generate high‑quality reasoning paths for enhanced long‑context performance. Our framework incorporates a self‑sampling mechanism to bootstrap reasoning paths and a novel quality assessment protocol specifically designed for long‑context scenarios. Experimental results on various long‑context benchmarks demonstrate the effectiveness of our approach, achieving significant improvements over outcome supervision baselines on both in‑domain tasks (+13.6/+3.8 points for LLaMA/Qwen on MuSiQue) and cross‑domain generalization (+9.3/+8.1 points on average across diverse QA tasks). Our code, data and trained models are made public to facilitate future research.

PaperID: 11434, https://arxiv.org/pdf/2502.20724.pdf

Authors: Arup Kumar Sarker, Aymen Alsaadi, Alexander James Halpern, Prabhath Tangella, Mikhail Titov, Niranda Perera, Mills Staylor, Gregor von Laszewski, Shantenu Jha, Geoffrey Fox

Title: Deep RC: A Scalable Data Engineering and Deep Learning Pipeline

Abstract:
Significant obstacles exist in scientific domains including genetics, climate modeling, and astronomy due to the management, preprocess, and training on complicated data for deep learning. Even while several large‑scale solutions offer distributed execution environments, open‑source alternatives that integrate scalable runtime tools, deep learning and data frameworks on high‑performance computing platforms remain crucial for accessibility and flexibility. In this paper, we introduce Deep Radical‑Cylon(RC), a heterogeneous runtime system that combines data engineering, deep learning frameworks, and workflow engines across several HPC environments, including cloud and supercomputing infrastructures. Deep RC supports heterogeneous systems with accelerators, allows the usage of communication libraries like MPI, GLOO and NCCL across multi‑node setups, and facilitates parallel and distributed deep learning pipelines by utilizing Radical Pilot as a task execution framework. By attaining an end‑to‑end pipeline including preprocessing, model training, and postprocessing with 11 neural forecasting models (PyTorch) and hydrology models (TensorFlow) under identical resource conditions, the system reduces 3.28 and 75.9 seconds, respectively. The design of Deep RC guarantees the smooth integration of scalable data frameworks, such as Cylon, with deep learning processes, exhibiting strong performance on cloud platforms and scientific HPC systems. By offering a flexible, high‑performance solution for resource‑intensive applications, this method closes the gap between data preprocessing, model training, and postprocessing.

PaperID: 11435, https://arxiv.org/pdf/2502.20689.pdf

Authors: Yuqi Wu, Guangya Wan, Jingjing Li, Shengming Zhao, Lingfeng Ma, Tianyi Ye, Ion Pop, Yanbo Zhang, Jie Chen

Title: WiseMind: a knowledge-guided multi-agent framework for accurate and empathetic psychiatric diagnosis

Abstract:
Large Language Models (LLMs) offer promising opportunities to support mental healthcare workflows, yet they often lack the structured clinical reasoning needed for reliable diagnosis and may struggle to provide the emotionally attuned communication essential for patient trust. Here, we introduce WiseMind, a novel multi‑agent framework inspired by the theory of Dialectical Behavior Therapy designed to facilitate psychiatric assessment. By integrating a "Reasonable Mind" Agent for evidence‑based logic and an "Emotional Mind" Agent for empathetic communication, WiseMind effectively bridges the gap between instrumental accuracy and humanistic care. Our framework utilizes a Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM‑5)‑guided Structured Knowledge Graph to steer diagnostic inquiries, significantly reducing hallucinations compared to standard prompting methods. Using a combination of virtual standard patients, simulated interactions, and real human interaction datasets, we evaluate WiseMind across three common psychiatric conditions. WiseMind outperforms state‑of‑the‑art LLM methods in both identifying critical diagnostic nodes and establishing accurate differential diagnoses. Across 1206 simulated conversations and 180 real user sessions, the system achieves 85.6% top‑1 diagnostic accuracy, approaching reported diagnostic performance ranges of board‑certified psychiatrists and surpassing knowledge‑enhanced single‑agent baselines by 15‑54 percentage points. Expert review by psychiatrists further validates that WiseMind generates responses that are not only clinically sound but also psychologically supportive, demonstrating the feasibility of empathetic, reliable AI agents to conduct psychiatric assessments under appropriate human oversight.

PaperID: 11436, https://arxiv.org/pdf/2502.20592.pdf

Authors: Juntai Cao, Xiang Zhang, Raymond Li, Chuyuan Li, Chenyu You, Shafiq Joty, Giuseppe Carenini

Title: Multi2: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing

Abstract:
Recent advances in test‑time scaling have shown promising results in improving Large Language Model (LLM) performance through strategic computation allocation during inference. While this approach has demonstrated strong improvements in logical and mathematical reasoning tasks, its application to natural language generation (NLG), particularly summarization, remains unexplored. Multi‑Document Summarization (MDS), a fundamental task in NLG, presents unique challenges by requiring models to extract and synthesize essential information across multiple lengthy documents. Unlike reasoning tasks, MDS demands a more nuanced approach to prompt design and ensemble methods, as no single "best" prompt can satisfy diverse summarization requirements. We propose a novel framework leveraging test‑time scaling for MDS. Our approach employs prompt ensemble techniques to generate multiple candidate summaries using various prompts, then combines them with an aggregator to produce a refined summary. To evaluate our method effectively, we also introduce two new LLM‑based metrics: the Consistency‑Aware Preference (CAP) score and LLM Atom‑Content‑Unit (LLM‑ACU) score, which assess summary quality while addressing the positional bias inherent in traditional automatic evaluation. Our extensive experiments demonstrate that this framework significantly enhances summary quality while also revealing the practical scaling boundaries to MDS tasks.

PaperID: 11437, https://arxiv.org/pdf/2502.20554.pdf

Authors: Henry Lei, Joshua Aurand, Zachary S. Lippay, Sean Phillips

Title: Close-Proximity Satellite Operations through Deep Reinforcement Learning and Terrestrial Testing Environments

Abstract:
With the increasingly congested and contested space environment, safe and effective satellite operation has become increasingly challenging. As a result, there is growing interest in autonomous satellite capabilities, with common machine learning techniques gaining attention for their potential to address complex decision‑making in the space domain. However, the "black‑box" nature of many of these methods results in difficulty understanding the model's input/output relationship and more specifically its sensitivity to environmental disturbances, sensor noise, and control intervention. This paper explores the use of Deep Reinforcement Learning (DRL) for satellite control in multi‑agent inspection tasks. The Local Intelligent Network of Collaborative Satellites (LINCS) Lab is used to test the performance of these control algorithms across different environments, from simulations to real‑world quadrotor UAV hardware, with a particular focus on understanding their behavior and potential degradation in performance when deployed beyond the training environment.

PaperID: 11438, https://arxiv.org/pdf/2502.20462.pdf

Authors: Leopoldo Agorio, Sean Van Alen, Santiago Paternain, Miguel Calvo-Fullana, Juan Andres Bazerque

Title: Cooperative Multi-Agent Assignment over Stochastic Graphs via Constrained Reinforcement Learning

Abstract:
Constrained multi‑agent reinforcement learning offers the framework to design scalable and almost surely feasible solutions for teams of agents operating in dynamic environments to carry out conflicting tasks. We address the challenges of multi‑agent coordination through an unconventional formulation in which the dual variables are not driven to convergence but are free to cycle, enabling agents to adapt their policies dynamically based on real‑time constraint satisfaction levels. The coordination relies on a light single‑bit communication protocol over a network with stochastic connectivity. Using this gossiped information, agents update local estimates of the dual variables. Furthermore, we modify the local dual dynamics by introducing a contraction factor, which lets us use finite communication buffers and keep the estimation error bounded. Under this model, we provide theoretical guarantees of almost sure feasibility and corroborate them with numerical experiments in which a team of robots successfully patrols multiple regions, communicating under a time‑varying ad‑hoc network.

PaperID: 11439, https://arxiv.org/pdf/2502.20383.pdf

Authors: Jeffrey Yang Fan Chiang, Seungjae Lee, Jia-Bin Huang, Furong Huang, Yizheng Chen

Title: Why Are Web AI Agents More Vulnerable Than Standalone LLMs? A Security Analysis

Abstract:
Recent advancements in Web AI agents have demonstrated remarkable capabilities in addressing complex web navigation tasks. However, emerging research shows that these agents exhibit greater vulnerability compared to standalone Large Language Models (LLMs), despite both being built upon the same safety‑aligned models. This discrepancy is particularly concerning given the greater flexibility of Web AI Agent compared to standalone LLMs, which may expose them to a wider range of adversarial user inputs. To build a scaffold that addresses these concerns, this study investigates the underlying factors that contribute to the increased vulnerability of Web AI agents. Notably, this disparity stems from the multifaceted differences between Web AI agents and standalone LLMs, as well as the complex signals ‑ nuances that simple evaluation metrics, such as success rate, often fail to capture. To tackle these challenges, we propose a component‑level analysis and a more granular, systematic evaluation framework. Through this fine‑grained investigation, we identify three critical factors that amplify the vulnerability of Web AI agents; (1) embedding user goals into the system prompt, (2) multi‑step action generation, and (3) observational capabilities. Our findings highlights the pressing need to enhance security and robustness in AI agent design and provide actionable insights for targeted defense strategies.

PaperID: 11440, https://arxiv.org/pdf/2502.20379.pdf

Authors: Shalev Lifshitz, Sheila A. McIlraith, Yilun Du

Title: Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers

Abstract:
By utilizing more computational resources at test‑time, large language models (LLMs) can improve without additional training. One common strategy uses verifiers to evaluate candidate outputs. In this work, we propose a novel scaling dimension for test‑time compute: scaling the number of verifiers. We introduce Multi‑Agent Verification (MAV) as a test‑time compute paradigm that combines multiple verifiers to improve performance. We propose using Aspect Verifiers (AVs), off‑the‑shelf LLMs prompted to verify different aspects of outputs, as one possible choice for the verifiers in a MAV system. AVs are a convenient building block for MAV since they can be easily combined without additional training. Moreover, we introduce BoN‑MAV, a simple multi‑agent verification algorithm that combines best‑of‑n sampling with multiple verifiers. BoN‑MAV demonstrates stronger scaling patterns than self‑consistency and reward model verification, and we demonstrate both weak‑to‑strong generalization, where combining weak verifiers improves even stronger LLMs, and self‑improvement, where the same base model is used to both generate and verify outputs. Our results establish scaling the number of verifiers as a promising new dimension for improving language model performance at test‑time.

PaperID: 11441, https://arxiv.org/pdf/2502.20369.pdf

Authors: Jens Høigaard Jensen, Kristoffer Plagborg Bak Sørensen, Jonas le Fevre Sejersen, Andriy Sarabakha

Title: Multi-Agent Path Planning in Complex Environments using Gaussian Belief Propagation with Global Path Finding

Abstract:
Multi‑agent path planning is a critical challenge in robotics, requiring agents to navigate complex environments while avoiding collisions and optimizing travel efficiency. This work addresses the limitations of existing approaches by combining Gaussian belief propagation with path integration and introducing a novel tracking factor to ensure strict adherence to global paths. The proposed method is tested with two different global path‑planning approaches: rapidly exploring random trees and a structured planner, which leverages predefined lane structures to improve coordination. A simulation environment was developed to validate the proposed method across diverse scenarios, each posing unique challenges in navigation and communication. Simulation results demonstrate that the tracking factor reduces path deviation by 28% in single‑agent and 16% in multi‑agent scenarios, highlighting its effectiveness in improving multi‑agent coordination, especially when combined with structured global planning.

PaperID: 11442, https://arxiv.org/pdf/2502.20353.pdf

Authors: Aron Harder, Madhur Behl

Title: Trajectory-to-Action Pipeline (TAP): Automated Scenario Description Extraction for Autonomous Vehicle Behavior Comparison

Abstract:
Scenario Description Languages (SDLs) provide structured, interpretable embeddings that represent traffic scenarios encountered by autonomous vehicles (AVs), supporting key tasks such as scenario similarity searches and edge case detection for safety analysis. This paper introduces the Trajectory‑to‑Action Pipeline (TAP), a scalable and automated method for extracting SDL labels from large trajectory datasets. TAP applies a rules‑based cross‑entropy optimization approach to learn parameters directly from data, enhancing generalization across diverse driving contexts. Using the Waymo Open Motion Dataset (WOMD), TAP achieves 30% greater precision than Average Displacement Error (ADE) and 24% over Dynamic Time Warping (DTW) in identifying behaviorally similar trajectories. Additionally, TAP enables automated detection of unique driving behaviors, streamlining safety evaluation processes for AV testing. This work provides a foundation for scalable scenario‑based AV behavior analysis, with potential extensions for integrating multi‑agent contexts.

PaperID: 11443, https://arxiv.org/pdf/2502.20301.pdf

Authors: Jinghao Feng, Qiaoyu Zheng, Chaoyi Wu, Ziheng Zhao, Ya Zhang, Yanfeng Wang, Weidi Xie

Title: M^3Builder: A Multi-Agent System for Automated Machine Learning in Medical Imaging

Abstract:
Agentic AI systems have gained significant attention for their ability to autonomously perform complex tasks. However, their reliance on well‑prepared tools limits their applicability in the medical domain, which requires to train specialized models. In this paper, we make three contributions: (i) We present M3Builder, a novel multi‑agent system designed to automate machine learning (ML) in medical imaging. At its core, M3Builder employs four specialized agents that collaborate to tackle complex, multi‑step medical ML workflows, from automated data processing and environment configuration to self‑contained auto debugging and model training. These agents operate within a medical imaging ML workspace, a structured environment designed to provide agents with free‑text descriptions of datasets, training codes, and interaction tools, enabling seamless communication and task execution. (ii) To evaluate progress in automated medical imaging ML, we propose M3Bench, a benchmark comprising four general tasks on 14 training datasets, across five anatomies and three imaging modalities, covering both 2D and 3D data. (iii) We experiment with seven state‑of‑the‑art large language models serving as agent cores for our system, such as Claude series, GPT‑4o, and DeepSeek‑V3. Compared to existing ML agentic designs, M3Builder shows superior performance on completing ML tasks in medical imaging, achieving a 94.29% success rate using Claude‑3.7‑Sonnet as the agent core, showing huge potential towards fully automated machine learning in medical imaging.

PaperID: 11444, https://arxiv.org/pdf/2502.20217.pdf

Authors: Jimmy Chiun, Shizhe Zhang, Yizhuo Wang, Yuhong Cao, Guillaume Sartoretti

Title: MARVEL: Multi-Agent Reinforcement Learning for constrained field-of-View multi-robot Exploration in Large-scale environments

Abstract:
In multi‑robot exploration, a team of mobile robot is tasked with efficiently mapping an unknown environments. While most exploration planners assume omnidirectional sensors like LiDAR, this is impractical for small robots such as drones, where lightweight, directional sensors like cameras may be the only option due to payload constraints. These sensors have a constrained field‑of‑view (FoV), which adds complexity to the exploration problem, requiring not only optimal robot positioning but also sensor orientation during movement. In this work, we propose MARVEL, a neural framework that leverages graph attention networks, together with novel frontiers and orientation features fusion technique, to develop a collaborative, decentralized policy using multi‑agent reinforcement learning (MARL) for robots with constrained FoV. To handle the large action space of viewpoints planning, we further introduce a novel information‑driven action pruning strategy. MARVEL improves multi‑robot coordination and decision‑making in challenging large‑scale indoor environments, while adapting to various team sizes and sensor configurations (i.e., FoV and sensor range) without additional training. Our extensive evaluation shows that MARVEL's learned policies exhibit effective coordinated behaviors, outperforming state‑of‑the‑art exploration planners across multiple metrics. We experimentally demonstrate MARVEL's generalizability in large‑scale environments, of up to 90m by 90m, and validate its practical applicability through successful deployment on a team of real drone hardware.

PaperID: 11445, https://arxiv.org/pdf/2502.20140.pdf

Authors: Max M. Lang, Sol Eskenazi

Title: Telephone Surveys Meet Conversational AI: Evaluating a LLM-Based Telephone Survey System at Scale

Abstract:
Telephone surveys remain a valuable tool for gathering insights but typically require substantial resources in training and coordinating human interviewers. This work presents an AI‑driven telephone survey system integrating text‑to‑speech (TTS), a large language model (LLM), and speech‑to‑text (STT) that mimics the versatility of human‑led interviews (full‑duplex dialogues) at scale. We tested the system across two populations, a pilot study in the United States (n = 75) and a large‑scale deployment in Peru (n = 2,739), inviting participants via web‑based links and contacting them via direct phone calls. The AI agent successfully administered open‑ended and closed‑ended questions, handled basic clarifications, and dynamically navigated branching logic, allowing fast large‑scale survey deployment without interviewer recruitment or training. Our findings demonstrate that while the AI system's probing for qualitative depth was more limited than human interviewers, overall data quality approached human‑led standards for structured items. This study represents one of the first successful large‑scale deployments of an LLM‑based telephone interviewer in a real‑world survey context. The AI‑powered telephone survey system has the potential for expanding scalable, consistent data collecting across market research, social science, and public opinion studies, thus improving operational efficiency while maintaining appropriate data quality for research.

PaperID: 11446, https://arxiv.org/pdf/2502.20068.pdf

Authors: Tianyang Qi, Shibo Chen, Jun Zhang

Title: A Generative Model Enhanced Multi-Agent Reinforcement Learning Method for Electric Vehicle Charging Navigation

Abstract:
With the widespread adoption of electric vehicles (EVs), navigating for EV drivers to select a cost‑effective charging station has become an important yet challenging issue due to dynamic traffic conditions, fluctuating electricity prices, and potential competition from other EVs. The state‑of‑the‑art deep reinforcement learning (DRL) algorithms for solving this task still require global information about all EVs at the execution stage, which not only increases communication costs but also raises privacy issues among EV drivers. To overcome these drawbacks, we introduce a novel generative model‑enhanced multi‑agent DRL algorithm that utilizes only the EV's local information while achieving performance comparable to these state‑of‑the‑art algorithms. Specifically, the policy network is implemented on the EV side, and a Conditional Variational Autoencoder‑Long Short Term Memory (CVAE‑LSTM)‑based recommendation model is developed to provide recommendation information. Furthermore, a novel future charging competition encoder is designed to effectively compress global information, enhancing training performance. The multi‑gradient descent algorithm (MGDA) is also utilized to adaptively balance the weight between the two parts of the training objective, resulting in a more stable training process. Simulations are conducted based on a practical area in Xián, China. Experimental results show that our proposed algorithm, which relies on local information, outperforms existing local information‑based methods and achieves less than 8% performance loss compared to global information‑based methods.

PaperID: 11447, https://arxiv.org/pdf/2502.20065.pdf

Authors: Ahmet Onur Akman, Anastasia Psarou, Łukasz Gorczyca, Zoltán György Varga, Grzegorz Jamróz, Rafał Kucharski

Title: RouteRL: Multi-agent reinforcement learning framework for urban route choice with autonomous vehicles

Abstract:
RouteRL is a novel framework that integrates multi‑agent reinforcement learning (MARL) with a microscopic traffic simulation, facilitating the testing and development of efficient route choice strategies for autonomous vehicles (AVs). The proposed framework simulates the daily route choices of driver agents in a city, including two types: human drivers, emulated using behavioral route choice models, and AVs, modeled as MARL agents optimizing their policies for a predefined objective. RouteRL aims to advance research in MARL, transport modeling, and human‑AI interaction for transportation applications. This study presents a technical report on RouteRL, outlines its potential research contributions, and showcases its impact via illustrative examples.

PaperID: 11448, https://arxiv.org/pdf/2502.19918.pdf

Authors: Yuan Sui, Yufei He, Tri Cao, Simeng Han, Yulin Chen, Bryan Hooi

Title: Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models

Abstract:
Large Language Models (LLMs) often struggle with computational efficiency and error propagation in multi‑step reasoning tasks. While recent advancements on prompting and post‑training have enabled LLMs to perform step‑wise reasoning, they still tend to explore unproductive solution paths without effective backtracking or strategy adjustment. In this paper, we propose Meta‑Reasoner, a new framework that empowers LLMs to "think about how to think". It optimizes the inference process by dynamically adapting reasoning strategies in real‑time. Our approach employs contextual multi‑armed bandits (CMABs) to learn an adaptive policy. It learns to evaluate the current state of LLM's reasoning and determine optimal strategy that is most likely to lead to a successful outcome during inference, like whether to backtrack, switch to a new approach, or restart the problem‑solving process. This meta‑guidance helps avoid unproductive paths exploration during inference and hence improves computational efficiency. We evaluate Meta‑Reasoner on math problems (e.g., Game‑of‑24, TheoremQA) and scientific tasks (e.g., SciBench). Results show that our method outperform previous SOTA methods by 9‑12% in accuracy, while reducing inference time by 28‑35% under the same compute budget. Additional experiments on creative writing demonstrate the generalizability of our approach to diverse reasoning‑intensive tasks.

PaperID: 11449, https://arxiv.org/pdf/2502.19899.pdf

Authors: Megha Srivastava, Reihaneh Iranmanesh, Yuchen Cui, Deepak Gopinath, Emily Sumner, Andrew Silva, Laporsha Dees, Guy Rosman, Dorsa Sadigh

Title: Shared Autonomy for Proximal Teaching

Abstract:
Motor skill learning often requires experienced professionals who can provide personalized instruction. Unfortunately, the availability of high‑quality training can be limited for specialized tasks, such as high performance racing. Several recent works have leveraged AI‑assistance to improve instruction of tasks ranging from rehabilitation to surgical robot tele‑operation. However, these works often make simplifying assumptions on the student learning process, and fail to model how a teacher's assistance interacts with different individuals' abilities when determining optimal teaching strategies. Inspired by the idea of scaffolding from educational psychology, we leverage shared autonomy, a framework for combining user inputs with robot autonomy, to aid with curriculum design. Our key insight is that the way a student's behavior improves in the presence of assistance from an autonomous agent can highlight which sub‑skills might be most ``learnable'' for the student, or within their Zone of Proximal Development. We use this to design Z‑COACH, a method for using shared autonomy to provide personalized instruction targeting interpretable task sub‑skills. In a user study (n=50), where we teach high performance racing in a simulated environment of the Thunderhill Raceway Park with the CARLA Autonomous Driving simulator, we show that Z‑COACH helps identify which skills each student should first practice, leading to an overall improvement in driving time, behavior, and smoothness. Our work shows that increasingly available semi‑autonomous capabilities (e.g. in vehicles, robots) can not only assist human users, but also help teach them.

PaperID: 11450, https://arxiv.org/pdf/2502.19860.pdf

Authors: Yujia Chen, Changsong Li, Yiming Wang, Tianjie Ju, Qingqing Xiao, Nan Zhang, Zifan Kong, Peng Wang, Binyu Yan

Title: MIND: Towards Immersive Psychological Healing with Multi-agent Inner Dialogue

Abstract:
Mental health issues are worsening in today's competitive society, such as depression and anxiety. Traditional healings like counseling and chatbots fail to engage effectively, they often provide generic responses lacking emotional depth. Although large language models (LLMs) have the potential to create more human‑like interactions, they still struggle to capture subtle emotions. This requires LLMs to be equipped with human‑like adaptability and warmth. To fill this gap, we propose the MIND (Multi‑agent INner Dialogue), a novel paradigm that provides more immersive psychological healing environments. Considering the strong generative and role‑playing ability of LLM agents, we predefine an interactive healing framework and assign LLM agents different roles within the framework to engage in interactive inner dialogues with users, thereby providing an immersive healing experience. We conduct extensive human experiments in various real‑world healing dimensions, and find that MIND provides a more user‑friendly experience than traditional paradigms. This demonstrates that MIND effectively leverages the significant potential of LLMs in psychological healing.

PaperID: 11451, https://arxiv.org/pdf/2502.19675.pdf

Authors: Yiyang Zhu, Jiayi Zhang, Enyu Shi, Ziheng Liu, Chau Yuen, Bo Ai

Title: Joint Power Allocation and Phase Shift Design for Stacked Intelligent Metasurfaces-aided Cell-Free Massive MIMO Systems with MARL

Abstract:
Cell‑free (CF) massive multiple‑input multiple‑output (mMIMO) systems offer high spectral efficiency (SE) through multiple distributed access points (APs). However, the large number of antennas increases power consumption. We propose incorporating stacked intelligent metasurfaces (SIM) into CF mMIMO systems as a cost‑effective, energy‑efficient solution. This paper focuses on optimizing the joint power allocation of APs and the phase shift of SIMs to maximize the sum SE. To address this complex problem, we introduce a fully distributed multi‑agent reinforcement learning (MARL) algorithm. Our novel algorithm, the noisy value method with a recurrent policy in multi‑agent policy optimization (NVR‑MAPPO), enhances performance by encouraging diverse exploration under centralized training and decentralized execution. Simulations demonstrate that NVR‑MAPPO significantly improves sum SE and robustness across various scenarios.

PaperID: 11452, https://arxiv.org/pdf/2502.19652.pdf

Authors: Shangding Gu, Laixi Shi, Muning Wen, Ming Jin, Eric Mazumdar, Yuejie Chi, Adam Wierman, Costas Spanos

Title: Robust Gymnasium: A Unified Modular Benchmark for Robust Reinforcement Learning

Abstract:
Driven by inherent uncertainty and the sim‑to‑real gap, robust reinforcement learning (RL) seeks to improve resilience against the complexity and variability in agent‑environment sequential interactions. Despite the existence of a large number of RL benchmarks, there is a lack of standardized benchmarks for robust RL. Current robust RL policies often focus on a specific type of uncertainty and are evaluated in distinct, one‑off environments. In this work, we introduce Robust‑Gymnasium, a unified modular benchmark designed for robust RL that supports a wide variety of disruptions across all key RL components‑agents' observed state and reward, agents' actions, and the environment. Offering over sixty diverse task environments spanning control and robotics, safe RL, and multi‑agent RL, it provides an open‑source and user‑friendly tool for the community to assess current methods and foster the development of robust RL algorithms. In addition, we benchmark existing standard and robust RL algorithms within this framework, uncovering significant deficiencies in each and offering new insights.

PaperID: 11453, https://arxiv.org/pdf/2502.19629.pdf

Authors: Tiffany J. Callahan, Nathaniel H. Park, Sara Capponi

Title: Agentic Mixture-of-Workflows for Multi-Modal Chemical Search

Abstract:
The vast and complex materials design space demands innovative strategies to integrate multidisciplinary scientific knowledge and optimize materials discovery. While large language models (LLMs) have demonstrated promising reasoning and automation capabilities across various domains, their application in materials science remains limited due to a lack of benchmarking standards and practical implementation frameworks. To address these challenges, we introduce Mixture‑of‑Workflows for Self‑Corrective Retrieval‑Augmented Generation (CRAG‑MoW) ‑ a novel paradigm that orchestrates multiple agentic workflows employing distinct CRAG strategies using open‑source LLMs. Unlike prior approaches, CRAG‑MoW synthesizes diverse outputs through an orchestration agent, enabling direct evaluation of multiple LLMs across the same problem domain. We benchmark CRAG‑MoWs across small molecules, polymers, and chemical reactions, as well as multi‑modal nuclear magnetic resonance (NMR) spectral retrieval. Our results demonstrate that CRAG‑MoWs achieve performance comparable to GPT‑4o while being preferred more frequently in comparative evaluations, highlighting the advantage of structured retrieval and multi‑agent synthesis. By revealing performance variations across data types, CRAG‑MoW provides a scalable, interpretable, and benchmark‑driven approach to optimizing AI architectures for materials discovery. These insights are pivotal in addressing fundamental gaps in benchmarking LLMs and autonomous AI agents for scientific applications.

PaperID: 11454, https://arxiv.org/pdf/2502.19592.pdf

Authors: Hongrui Zhao, Boris Ivanovic, Negar Mehr

Title: RAMEN: Real-time Asynchronous Multi-agent Neural Implicit Mapping

Abstract:
Multi‑agent neural implicit mapping allows robots to collaboratively capture and reconstruct complex environments with high fidelity. However, existing approaches often rely on synchronous communication, which is impractical in real‑world scenarios with limited bandwidth and potential communication interruptions. This paper introduces RAMEN: Real‑time Asynchronous Multi‑agEnt Neural implicit mapping, a novel approach designed to address this challenge. RAMEN employs an uncertainty‑weighted multi‑agent consensus optimization algorithm that accounts for communication disruptions. When communication is lost between a pair of agents, each agent retains only an outdated copy of its neighbor's map, with the uncertainty of this copy increasing over time since the last communication. Using gradient update information, we quantify the uncertainty associated with each parameter of the neural network map. Neural network maps from different agents are brought to consensus on the basis of their levels of uncertainty, with consensus biased towards network parameters with lower uncertainty. To achieve this, we derive a weighted variant of the decentralized consensus alternating direction method of multipliers (C‑ADMM) algorithm, facilitating robust collaboration among agents with varying communication and update frequencies. Through extensive evaluations on real‑world datasets and robot hardware experiments, we demonstrate RAMEN's superior mapping performance under challenging communication conditions.

PaperID: 11455, https://arxiv.org/pdf/2502.19559.pdf

Authors: Jonas Becker, Lars Benedikt Kaesberg, Andreas Stephan, Jan Philip Wahle, Terry Ruas, Bela Gipp

Title: Stay Focused: Problem Drift in Multi-Agent Debate

Abstract:
Multi‑agent debate ‑ multiple instances of large language models discussing problems in turn‑based interaction ‑ has shown promise for solving knowledge and reasoning tasks. However, these methods show limitations when solving complex problems that require longer reasoning chains. We analyze how multi‑agent debate drifts away from the initial problem over multiple turns, thus harming task performance. We define this phenomenon as problem drift and quantify its presence across ten tasks (i.e., three generative, three knowledge, three reasoning, and one instruction‑following task). We find that generative tasks drift often due to the subjectivity of the answer space (76‑89%), compared to high‑complexity tasks (7‑21%). To identify the reasons, eight human experts analyze 170 multi‑agent debates suffering from problem drift. We find the most common issues related to this drift are the lack of progress (35% of cases), low‑quality feedback (26% of cases), and a lack of clarity (25% of cases). We propose DRIFTJudge, an LLM‑as‑a‑judge method, as a first baseline to detect problem drift. We also propose DRIFTPolicy, which mitigates 31% of problem drift cases. Our study is a step toward understanding a key limitation of multi‑agent debate, highlighting why longer debates can harm task performance and how problem drift could be addressed.

PaperID: 11456, https://arxiv.org/pdf/2502.19556.pdf

Authors: Joshua Aurand, Christopher Pang, Sina Mokhtar, Henry Lei, Steven Cutlip, Sean Phillips

Title: Assessing Autonomous Inspection Regimes: Active Versus Passive Satellite Inspection

Abstract:
This paper addresses the problem of satellite inspection, where one or more satellites (inspectors) are tasked with imaging or inspecting a resident space object (RSO) due to potential malfunctions or anomalies. Inspection strategies are often reduced to a discretized action space with predefined waypoints, facilitating tractability in both classical optimization and machine learning based approaches. However, this discretization can lead to suboptimal guidance in certain scenarios. This study presents a comparative simulation to explore the tradeoffs of passive versus active strategies in multi‑agent missions. Key factors considered include RSO dynamic mode, state uncertainty, unmodeled entrance criteria, and inspector motion types. The evaluation is conducted with a focus on fuel utilization and surface coverage. Building on a Monte‑Carlo based evaluator of passive strategies and a reinforcement learning framework for training active inspection policies, this study investigates conditions under which passive strategies, such as Natural Motion Circumnavigation (NMC), may perform comparably to active strategies like Reinforcement Learning based waypoint transfers.

PaperID: 11457, https://arxiv.org/pdf/2502.19519.pdf

Authors: Nicolai Hejlesen Jørgensen, Sarmilan Tharmabalan, Ilhan Aslan, Nicolai Brodersen Hansen, Timothy Merritt

Title: Static Vs. Agentic Game Master AI for Facilitating Solo Role-Playing Experiences

Abstract:
This paper presents a game master AI for single‑player role‑playing games. The AI is designed to deliver interactive text‑based narratives and experiences typically associated with multiplayer tabletop games like Dungeons & Dragons. We report on the design process and the series of experiments to improve the functionality and experience design, resulting in two functional versions of the system. While v1 of our system uses simplified prompt engineering, v2 leverages a multi‑agent architecture and the ReAct framework to include reasoning and action. A comparative evaluation demonstrates that v2 as an agentic system maintains play while significantly improving modularity and game experience, including immersion and curiosity. Our findings contribute to the evolution of AI‑driven interactive fiction, highlighting new avenues for enhancing solo role‑playing experiences.

PaperID: 11458, https://arxiv.org/pdf/2502.19417.pdf

Authors: Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, Adrian Li-Bell, Danny Driess, Lachy Groom, Sergey Levine, Chelsea Finn

Title: Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

Abstract:
Generalist robots that can perform a range of different tasks in open‑world settings must be able to not only reason about the steps needed to accomplish their goals, but also process complex instructions, prompts, and even feedback during task execution. Intricate instructions (e.g., "Could you make me a vegetarian sandwich?" or "I don't like that one") require not just the ability to physically perform the individual steps, but the ability to situate complex commands and feedback in the physical world. In this work, we describe a system that uses vision‑language models in a hierarchical structure, first reasoning over complex prompts and user feedback to deduce the most appropriate next step to fulfill the task, and then performing that step with low‑level actions. In contrast to direct instruction following methods that can fulfill simple commands ("pick up the cup"), our system can reason through complex prompts and incorporate situated feedback during task execution ("that's not trash"). We evaluate our system across three robotic platforms, including single‑arm, dual‑arm, and dual‑arm mobile robots, demonstrating its ability to handle tasks such as cleaning messy tables, making sandwiches, and grocery shopping. Videos are available at https://www.pi.website/research/hirobot

PaperID: 11459, https://arxiv.org/pdf/2502.19193.pdf

Authors: Jinyu Cai, Yusei Ishimizu, Mingyue Zhang, Munan Li, Jialong Li, Kenji Tei

Title: Simulation of Language Evolution under Regulated Social Media Platforms: A Synergistic Approach of Large Language Models and Genetic Algorithms

Abstract:
Social media platforms frequently impose restrictive policies to moderate user content, prompting the emergence of creative evasion language strategies. This paper presents a multi‑agent framework based on Large Language Models (LLMs) to simulate the iterative evolution of language strategies under regulatory constraints. In this framework, participant agents, as social media users, continuously evolve their language expression, while supervisory agents emulate platform‑level regulation by assessing policy violations. To achieve a more faithful simulation, we employ a dual design of language strategies (constraint and expression) to differentiate conflicting goals and utilize an LLM‑driven GA (Genetic Algorithm) for the selection, mutation, and crossover of language strategies. The framework is evaluated using two distinct scenarios: an abstract password game and a realistic simulated illegal pet trade scenario. Experimental results demonstrate that as the number of dialogue rounds increases, both the number of uninterrupted dialogue turns and the accuracy of information transmission improve significantly. Furthermore, a user study with 40 participants validates the real‑world relevance of the generated dialogues and strategies. Moreover, ablation studies validate the importance of the GA, emphasizing its contribution to long‑term adaptability and improved overall results.

PaperID: 11460, https://arxiv.org/pdf/2502.19145.pdf

Authors: Pierre Peigne-Lefebvre, Mikolaj Kniejski, Filip Sondej, Matthieu David, Jason Hoelscher-Obermaier, Christian Schroeder de Witt, Esben Kran

Title: Multi-Agent Security Tax: Trading Off Security and Collaboration Capabilities in Multi-Agent Systems

Abstract:
As AI agents are increasingly adopted to collaborate on complex objectives, ensuring the security of autonomous multi‑agent systems becomes crucial. We develop simulations of agents collaborating on shared objectives to study these security risks and security trade‑offs. We focus on scenarios where an attacker compromises one agent, using it to steer the entire system toward misaligned outcomes by corrupting other agents. In this context, we observe infectious malicious prompts ‑ the multi‑hop spreading of malicious instructions. To mitigate this risk, we evaluated several strategies: two "vaccination" approaches that insert false memories of safely handling malicious input into the agents' memory stream, and two versions of a generic safety instruction strategy. While these defenses reduce the spread and fulfillment of malicious instructions in our experiments, they tend to decrease collaboration capability in the agent network. Our findings illustrate potential trade‑off between security and collaborative efficiency in multi‑agent systems, providing insights for designing more secure yet effective AI collaborations.

PaperID: 11461, https://arxiv.org/pdf/2502.19135.pdf

Authors: Enrico Saccon, Ahmet Tikna, Davide De Martini, Edoardo Lamon, Luigi Palopoli, Marco Roveri

Title: A Temporal Planning Framework for Multi-Agent Systems via LLM-Aided Knowledge Base Management

Abstract:
This paper presents a novel framework, called PLANTOR (PLanning with Natural language for Task‑Oriented Robots), that integrates Large Language Models (LLMs) with Prolog‑based knowledge management and planning for multi‑robot tasks. The system employs a two‑phase generation of a robot‑oriented knowledge base, ensuring reusability and compositional reasoning, as well as a three‑step planning procedure that handles temporal dependencies, resource constraints, and parallel task execution via mixed‑integer linear programming. The final plan is converted into a Behaviour Tree for direct use in ROS2. We tested the framework in multi‑robot assembly tasks within a block world and an arch‑building scenario. Results demonstrate that LLMs can produce accurate knowledge bases with modest human feedback, while Prolog guarantees formal correctness and explainability. This approach underscores the potential of LLM integration for advanced robotics tasks requiring flexible, scalable, and human‑understandable planning.

PaperID: 11462, https://arxiv.org/pdf/2502.19130.pdf

Authors: Lars Benedikt Kaesberg, Jonas Becker, Jan Philip Wahle, Terry Ruas, Bela Gipp

Title: Voting or Consensus? Decision-Making in Multi-Agent Debate

Abstract:
Much of the success of multi‑agent debates depends on carefully choosing the right parameters. The decision‑making protocol stands out as it can highly impact final model answers, depending on how decisions are reached. Systematic comparison of decision protocols is difficult because many studies alter multiple discussion parameters beyond the protocol. So far, it has been largely unknown how decision‑making influences different tasks. This work systematically evaluates the impact of seven decision protocols (e.g., majority voting, unanimity consensus). We change only one variable at a time ‑ the decision protocol ‑ to analyze how different methods affect the collaboration between agents and measure differences in knowledge and reasoning tasks. Our results show that voting protocols improve performance by 13.2% in reasoning tasks and consensus protocols by 2.8% in knowledge tasks compared to other decision protocols. Increasing the number of agents improves performance, while more discussion rounds before voting reduce it. To improve decision‑making by increasing answer diversity, we propose two new methods, All‑Agents Drafting (AAD) and Collective Improvement (CI). Our methods improve task performance by up to 3.3% with AAD and up to 7.4% with CI. This work demonstrates the importance of decision‑making in multi‑agent debates beyond scaling.

PaperID: 11463, https://arxiv.org/pdf/2502.19098.pdf

Authors: Erica Cau, Valentina Pansanella, Dino Pedreschi, Giulio Rossetti

Title: Language-Driven Opinion Dynamics in Agent-Based Simulations with LLMs

Abstract:
Understanding how opinions evolve is crucial for addressing issues such as polarization, radicalization, and consensus in social systems. While much research has focused on identifying factors influencing opinion change, the role of language and argumentative fallacies remains underexplored. This paper aims to fill this gap by investigating how language ‑ along with social dynamics ‑ influences opinion evolution through LODAS, a Language‑Driven Opinion Dynamics Model for Agent‑Based Simulations. The model simulates debates around the "Ship of Theseus" paradox, in which agents with discrete opinions interact with each other and evolve their opinions by accepting, rejecting, or ignoring the arguments presented. We study three different scenarios: balanced, polarized, and unbalanced opinion distributions. Agreeableness and sycophancy emerge as two main characteristics of LLM agents, and consensus around the presented statement emerges almost in any setting. Moreover, such AI agents are often producers of fallacious arguments in the attempt of persuading their peers and ‑ for their complacency ‑ they are also highly influenced by arguments built on logical fallacies. These results highlight the potential of this framework not only for simulating social dynamics but also for exploring from another perspective biases and shortcomings of LLMs, which may impact their interactions with humans.

PaperID: 11464, https://arxiv.org/pdf/2502.19091.pdf

Authors: Humza Sami, Mubashir ul Islam, Samy Charas, Asav Gandhi, Pierre-Emmanuel Gaillardon, Valerio Tenace

Title: Nexus: A Lightweight and Scalable Multi-Agent Framework for Complex Tasks Automation

Abstract:
Recent advancements in Large Language Models (LLMs) have substantially evolved Multi‑Agent Systems (MASs) capabilities, enabling systems that not only automate tasks but also leverage near‑human reasoning capabilities. To achieve this, LLM‑based MASs need to be built around two critical principles: (i) a robust architecture that fully exploits LLM potential for specific tasks ‑‑ or related task sets ‑‑ and (ii) an effective methodology for equipping LLMs with the necessary capabilities to perform tasks and manage information efficiently. It goes without saying that a priori architectural designs can limit the scalability and domain adaptability of a given MAS. To address these challenges, in this paper we introduce Nexus: a lightweight Python framework designed to easily build and manage LLM‑based MASs. Nexus introduces the following innovations: (i) a flexible multi‑supervisor hierarchy, (ii) a simplified workflow design, and (iii) easy installation and open‑source flexibility: Nexus can be installed via pip and is distributed under a permissive open‑source license, allowing users to freely modify and extend its capabilities. Experimental results demonstrate that architectures built with Nexus exhibit state‑of‑the‑art performance across diverse domains. In coding tasks, Nexus‑driven MASs achieve a 99% pass rate on HumanEval and a flawless 100% on VerilogEval‑Human, outperforming cutting‑edge reasoning language models such as o3‑mini and DeepSeek‑R1. Moreover, these architectures display robust proficiency in complex reasoning and mathematical problem solving, achieving correct solutions for all randomly selected problems from the MATH dataset. In the realm of multi‑objective optimization, Nexus‑based architectures successfully address challenging timing closure tasks on designs from the VTR benchmark suite, while guaranteeing, on average, a power saving of nearly 30%.

PaperID: 11465, https://arxiv.org/pdf/2502.19004.pdf

Authors: Nahom Abishu Hayla, A. Mohammed Seid, Aiman Erbad, Tilahun M. Getu, Ala Al-Fuqaha, Mohsen Guizani

Title: A Multi-Agent DRL-Based Framework for Optimal Resource Allocation and Twin Migration in the Multi-Tier Vehicular Metaverse

Abstract:
Although multi‑tier vehicular Metaverse promises to transform vehicles into essential nodes ‑‑ within an interconnected digital ecosystem ‑‑ using efficient resource allocation and seamless vehicular twin (VT) migration, this can hardly be achieved by the existing techniques operating in a highly dynamic vehicular environment, since they can hardly balance multi‑objective optimization problems such as latency reduction, resource utilization, and user experience (UX). To address these challenges, we introduce a novel multi‑tier resource allocation and VT migration framework that integrates Graph Convolutional Networks (GCNs), a hierarchical Stackelberg game‑based incentive mechanism, and Multi‑Agent Deep Reinforcement Learning (MADRL). The GCN‑based model captures both spatial and temporal dependencies within the vehicular network; the Stackelberg game‑based incentive mechanism fosters cooperation between vehicles and infrastructure; and the MADRL algorithm jointly optimizes resource allocation and VT migration in real time. By modeling this dynamic and multi‑tier vehicular Metaverse as a Markov Decision Process (MDP), we develop a MADRL‑based algorithm dubbed the Multi‑Objective Multi‑Agent Deep Deterministic Policy Gradient (MO‑MADDPG), which can effectively balances the various conflicting objectives. Extensive simulations validate the effectiveness of this algorithm that is demonstrated to enhance scalability, reliability, and efficiency while considerably improving latency, resource utilization, migration cost, and overall UX by 12.8%, 9.7%, 14.2%, and 16.1%, respectively.

PaperID: 11466, https://arxiv.org/pdf/2502.18873.pdf

Authors: Sen Yang, Yafu Li, Wai Lam, Yu Cheng

Title: Multi-LLM Collaborative Search for Complex Problem Solving

Abstract:
Large language models (LLMs) often struggle with complex reasoning tasks due to their limitations in addressing the vast reasoning space and inherent ambiguities of natural language. We propose the Mixture‑of‑Search‑Agents (MoSA) paradigm, a novel approach leveraging the collective expertise of multiple LLMs to enhance search‑based reasoning. MoSA integrates diverse reasoning pathways by combining independent exploration with iterative refinement among LLMs, mitigating the limitations of single‑model approaches. Using Monte Carlo Tree Search (MCTS) as a backbone, MoSA enables multiple agents to propose and aggregate reasoning steps, resulting in improved accuracy. Our comprehensive evaluation across four reasoning benchmarks demonstrates MoSA's consistent performance improvements over single‑agent and other multi‑agent baselines, particularly in complex mathematical and commonsense reasoning tasks.

PaperID: 11467, https://arxiv.org/pdf/2502.18864.pdf

Authors: Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vikram Dhillon, Eeshit Dhaval Vaishnav, Byron Lee, Tiago R D Costa, José R Penadés, Gary Peltz, Yunhan Xu, Annalisa Pawlosky, Alan Karthikesalingam, Vivek Natarajan

Title: Towards an AI co-scientist

Abstract:
Scientific discovery relies on scientists generating novel hypotheses that undergo rigorous experimental validation. To augment this process, we introduce an AI co‑scientist, a multi‑agent system built on Gemini 2.0. The AI co‑scientist is intended to help uncover new, original knowledge and to formulate demonstrably novel research hypotheses and proposals, building upon prior evidence and aligned to scientist‑provided research objectives and guidance. The system's design incorporates a generate, debate, and evolve approach to hypothesis generation, inspired by the scientific method and accelerated by scaling test‑time compute. Key contributions include: (1) a multi‑agent architecture with an asynchronous task execution framework for flexible compute scaling; (2) a tournament evolution process for self‑improving hypotheses generation. Automated evaluations show continued benefits of test‑time compute, improving hypothesis quality. While general purpose, we focus development and validation in three biomedical areas: drug repurposing, novel target discovery, and explaining mechanisms of bacterial evolution and anti‑microbial resistance. For drug repurposing, the system proposes candidates with promising validation findings, including candidates for acute myeloid leukemia that show tumor inhibition in vitro at clinically applicable concentrations. For novel target discovery, the AI co‑scientist proposed new epigenetic targets for liver fibrosis, validated by anti‑fibrotic activity and liver cell regeneration in human hepatic organoids. Finally, the AI co‑scientist recapitulated unpublished experimental results via a parallel in silico discovery of a novel gene transfer mechanism in bacterial evolution. These results, detailed in separate, co‑timed reports, demonstrate the potential to augment biomedical and scientific discovery and usher an era of AI empowered scientists.

PaperID: 11468, https://arxiv.org/pdf/2502.18836.pdf

Authors: Longling Geng, Edward Y. Chang

Title: REALM-Bench: A Benchmark for Evaluating Multi-Agent Systems on Real-world, Dynamic Planning and Scheduling Tasks

Abstract:
This benchmark suite provides a comprehensive evaluation framework for assessing both individual LLMs and multi‑agent systems in Real‑world planning and scheduling scenarios. The suite encompasses 14 designed planning and scheduling problems that progress from basic to highly complex, incorporating key aspects such as multi‑agent coordination, inter‑agent dependencies, and dynamic environmental disruptions. Each problem can be scaled along three dimensions: the number of parallel planning threads, the complexity of inter‑dependencies, and the frequency of unexpected disruptions requiring Real‑time adaptation. The benchmark includes 14 detailed problem specifications, 15 comparison methods including Random, LPT, SPT, STPT, MPSR, DRL‑Liu, GP, GEP, LSO, SPT/TWKR, DRL‑Chen, DRL‑Zhang, 2+ evaluation metrics, and baseline implementations using 3+ LLMs including GPT‑4o, Claude‑3.7, DeepSeek‑R1, and 4 contemporary frameworks including LangGraph, AutoGen, CrewAI, and Swarm, enabling rigorous testing of both single‑agent and multi‑agent planning capabilities. Through standardized evaluation criteria and scalable complexity, this benchmark aims to be opened to public, and drive progress in developing more adaptable, robust, and scalable AI planning systems for Real‑world applications.

PaperID: 11469, https://arxiv.org/pdf/2502.18822.pdf

Authors: Huangyuan Su, Aaron Walsman, Daniel Garces, Sham Kakade, Stephanie Gil

Title: Data-Efficient Multi-Agent Spatial Planning with LLMs

Abstract:
In this project, our goal is to determine how to leverage the world‑knowledge of pretrained large language models for efficient and robust learning in multiagent decision making. We examine this in a taxi routing and assignment problem where agents must decide how to best pick up passengers in order to minimize overall waiting time. While this problem is situated on a graphical road network, we show that with the proper prompting zero‑shot performance is quite strong on this task. Furthermore, with limited fine‑tuning along with the one‑at‑a‑time rollout algorithm for look ahead, LLMs can out‑compete existing approaches with 50 times fewer environmental interactions. We also explore the benefits of various linguistic prompting approaches and show that including certain easy‑to‑compute information in the prompt significantly improves performance. Finally, we highlight the LLM's built‑in semantic understanding, showing its ability to adapt to environmental factors through simple prompts.

PaperID: 11470, https://arxiv.org/pdf/2502.18702.pdf

Authors: Zihan Wang, Ziqi Zhao, Yougang Lyu, Zhumin Chen, Maarten de Rijke, Zhaochun Ren

Title: A Cooperative Multi-Agent Framework for Zero-Shot Named Entity Recognition

Abstract:
Zero‑shot named entity recognition (NER) aims to develop entity recognition systems from unannotated text corpora. This task presents substantial challenges due to minimal human intervention. Recent work has adapted large language models (LLMs) for zero‑shot NER by crafting specialized prompt templates. It advances model self‑learning abilities by incorporating self‑annotated demonstrations. However, two important challenges persist: (i) Correlations between contexts surrounding entities are overlooked, leading to wrong type predictions or entity omissions. (ii) The indiscriminate use of task demonstrations, retrieved through shallow similarity‑based strategies, severely misleads LLMs during inference. In this paper, we introduce the cooperative multi‑agent system (CMAS), a novel framework for zero‑shot NER that uses the collective intelligence of multiple agents to address the challenges outlined above. CMAS has four main agents: (i) a self‑annotator, (ii) a type‑related feature (TRF) extractor, (iii) a demonstration discriminator, and (iv) an overall predictor. To explicitly capture correlations between contexts surrounding entities, CMAS reformulates NER into two subtasks: recognizing named entities and identifying entity type‑related features within the target sentence. To enable controllable utilization of demonstrations, a demonstration discriminator is established to incorporate the self‑reflection mechanism, automatically evaluating helpfulness scores for the target sentence. Experimental results show that CMAS significantly improves zero‑shot NER performance across six benchmarks, including both domain‑specific and general‑domain scenarios. Furthermore, CMAS demonstrates its effectiveness in few‑shot settings and with various LLM backbones.

PaperID: 11471, https://arxiv.org/pdf/2502.18658.pdf

Authors: Kevin Pu, Daniel Lazaro, Ian Arawjo, Haijun Xia, Ziang Xiao, Tovi Grossman, Yan Chen

Title: Assistance or Disruption? Exploring and Evaluating the Design and Trade-offs of Proactive AI Programming Support

Abstract:
AI programming tools enable powerful code generation, and recent prototypes attempt to reduce user effort with proactive AI agents, but their impact on programming workflows remains unexplored. We introduce and evaluate Codellaborator, a design probe LLM agent that initiates programming assistance based on editor activities and task context. We explored three interface variants to assess trade‑offs between increasingly salient AI support: prompt‑only, proactive agent, and proactive agent with presence and context (Codellaborator). In a within‑subject study (N=18), we find that proactive agents increase efficiency compared to prompt‑only paradigm, but also incur workflow disruptions. However, presence indicators and interaction context support alleviated disruptions and improved users' awareness of AI processes. We underscore trade‑offs of Codellaborator on user control, ownership, and code understanding, emphasizing the need to adapt proactivity to programming processes. Our research contributes to the design exploration and evaluation of proactive AI systems, presenting design implications on AI‑integrated programming workflow.

PaperID: 11472, https://arxiv.org/pdf/2502.18653.pdf

Authors: Hediyeh Baban, Sai A Pidapar, Aashutosh Nema, Sichen Lu

Title: Enhancing Text Classification with a Novel Multi-Agent Collaboration Framework Leveraging BERT

Abstract:
We introduce a novel multi‑agent collaboration framework designed to enhance the accuracy and robustness of text classification models. Leveraging BERT as the primary classifier, our framework dynamically escalates low‑confidence predictions to a specialized multi‑agent system comprising Lexical, Contextual, Logic, Consensus, and Explainability agents. This collaborative approach allows for comprehensive analysis and consensus‑driven decision‑making, significantly improving classification performance across diverse text classification tasks. Empirical evaluations on benchmark datasets demonstrate that our framework achieves a 5.5% increase in accuracy compared to standard BERT‑based classifiers, underscoring its effectiveness and academic novelty in advancing multi‑agent systems within natural language processing.

PaperID: 11473, https://arxiv.org/pdf/2502.18652.pdf

Authors: Fengze Yang, Xiaoyue Cathy Liu, Lingjiu Lu, Bingzhang Wang, Chenxi Dylan Liu

Title: Independent Mobility GPT (IDM-GPT): A Self-Supervised Multi-Agent Large Language Model Framework for Customized Traffic Mobility Analysis Using Machine Learning Models

Abstract:
With the urbanization process, an increasing number of sensors are being deployed in transportation systems, leading to an explosion of big data. To harness the power of this vast transportation data, various machine learning (ML) and artificial intelligence (AI) methods have been introduced to address numerous transportation challenges. However, these methods often require significant investment in data collection, processing, storage, and the employment of professionals with expertise in transportation and ML. Additionally, privacy issues are a major concern when processing data for real‑world traffic control and management. To address these challenges, the research team proposes an innovative Multi‑agent framework named Independent Mobility GPT (IDM‑GPT) based on large language models (LLMs) for customized traffic analysis, management suggestions, and privacy preservation. IDM‑GPT efficiently connects users, transportation databases, and ML models economically. IDM‑GPT trains, customizes, and applies various LLM‑based AI agents for multiple functions, including user query comprehension, prompts optimization, data analysis, model selection, and performance evaluation and enhancement. With IDM‑GPT, users without any background in transportation or ML can efficiently and intuitively obtain data analysis and customized suggestions in near real‑time based on their questions. Experimental results demonstrate that IDM‑GPT delivers satisfactory performance across multiple traffic‑related tasks, providing comprehensive and actionable insights that support effective traffic management and urban mobility improvement.

PaperID: 11474, https://arxiv.org/pdf/2502.18549.pdf

Authors: Jiyue Tao, Tongsheng Shen, Dexin Zhao, Feitian Zhang

Title: ARBoids: Adaptive Residual Reinforcement Learning With Boids Model for Cooperative Multi-USV Target Defense

Abstract:
The target defense problem (TDP) for unmanned surface vehicles (USVs) concerns intercepting an adversarial USV before it breaches a designated target region, using one or more defending USVs. A particularly challenging scenario arises when the attacker exhibits superior maneuverability compared to the defenders, significantly complicating effective interception. To tackle this challenge, this letter introduces ARBoids, a novel adaptive residual reinforcement learning framework that integrates deep reinforcement learning (DRL) with the biologically inspired, force‑based Boids model. Within this framework, the Boids model serves as a computationally efficient baseline policy for multi‑agent coordination, while DRL learns a residual policy to adaptively refine and optimize the defenders' actions. The proposed approach is validated in a high‑fidelity Gazebo simulation environment, demonstrating superior performance over traditional interception strategies, including pure force‑based approaches and vanilla DRL policies. Furthermore, the learned policy exhibits strong adaptability to attackers with diverse maneuverability profiles, highlighting its robustness and generalization capability. The code of ARBoids will be released upon acceptance of this letter.

PaperID: 11475, https://arxiv.org/pdf/2502.18534.pdf

Authors: Zachary McBride Lazri, Anirudh Nakra, Ivan Brugere, Danial Dervovic, Antigoni Polychroniadou, Furong Huang, Dana Dachman-Soled, Min Wu

Title: MAFE: Enabling Equitable Algorithm Design in Multi-Agent Multi-Stage Decision-Making Systems

Abstract:
Algorithmic fairness is often studied in static or single‑agent settings, yet many real‑world decision‑making systems involve multiple interacting entities whose multi‑stage actions jointly influence long‑term outcomes. Existing fairness methods applied at isolated decision points frequently fail to mitigate disparities that accumulate over time. Although recent work has modeled fairness as a sequential decision‑making problem, it typically assumes centralized agents or simplified dynamics, limiting its applicability to complex social systems. We introduce MAFE, a suite of Multi‑Agent Fair Environments designed to simulate realistic, modular, and dynamic systems in which fairness emerges from the interplay of multiple agents. We demonstrate MAFEs across three domains ‑‑ loan processing, healthcare, and higher education ‑‑ that support heterogeneous agents, configurable interventions, and fairness metrics. The environments are open‑source and compatible with standard multi‑agent reinforcement learning (MARL) libraries, enabling reproducible evaluation of fairness‑aware policies. Through extensive experiments on cooperative use cases, we demonstrate how MAFE facilitates the design of equitable multi‑agent algorithms and reveals critical trade‑offs between fairness, performance, and coordination. MAFE provides a foundation for systematic progress in dynamic, multi‑agent fairness research.

PaperID: 11476, https://arxiv.org/pdf/2502.18530.pdf

Authors: Eric Xue, Ke Chen, Zeyi Huang, Yuyang Ji, Haohan Wang

Title: IMPROVE: Iterative Model Pipeline Refinement and Optimization Leveraging LLM Experts

Abstract:
Large language model (LLM) agents have emerged as a promising solution to automate the workflow of machine learning, but most existing methods share a common limitation: they attempt to optimize entire pipelines in a single step before evaluation, making it difficult to attribute improvements to specific changes. This lack of granularity leads to unstable optimization and slower convergence, limiting their effectiveness. To address this, we introduce Iterative Refinement, a novel strategy for LLM‑driven ML pipeline design inspired by how human ML experts iteratively refine models, focusing on one component at a time rather than making sweeping changes all at once. By systematically updating individual components based on real training feedback, Iterative Refinement improves overall model performance. We also provide some theoretical edvience of the superior properties of this Iterative Refinement. Further, we implement this strategy in IMPROVE, an end‑to‑end LLM agent framework for automating and optimizing object classification pipelines. Through extensive evaluations across datasets of varying sizes and domains, we demonstrate that Iterative Refinement enables IMPROVE to consistently achieve better performance over existing zero‑shot LLM‑based approaches.

PaperID: 11477, https://arxiv.org/pdf/2502.18526.pdf

Authors: Fangqi Liu, Rishav Sen, Jose Paolo Talusan, Ava Pettet, Aaron Kandel, Yoshinori Suzue, Ayan Mukhopadhyay, Abhishek Dubey

Title: Reinforcement Learning-based Approach for Vehicle-to-Building Charging with Heterogeneous Agents and Long Term Rewards

Abstract:
Strategic aggregation of electric vehicle batteries as energy reservoirs can optimize power grid demand, benefiting smart and connected communities, especially large office buildings that offer workplace charging. This involves optimizing charging and discharging to reduce peak energy costs and net peak demand, monitored over extended periods (e.g., a month), which involves making sequential decisions under uncertainty and delayed and sparse rewards, a continuous action space, and the complexity of ensuring generalization across diverse conditions. Existing algorithmic approaches, e.g., heuristic‑based strategies, fall short in addressing real‑time decision‑making under dynamic conditions, and traditional reinforcement learning (RL) models struggle with large state‑action spaces, multi‑agent settings, and the need for long‑term reward optimization. To address these challenges, we introduce a novel RL framework that combines the Deep Deterministic Policy Gradient approach (DDPG) with action masking and efficient MILP‑driven policy guidance. Our approach balances the exploration of continuous action spaces to meet user charging demands. Using real‑world data from a major electric vehicle manufacturer, we show that our approach comprehensively outperforms many well‑established baselines and several scalable heuristic approaches, achieving significant cost savings while meeting all charging requirements. Our results show that the proposed approach is one of the first scalable and general approaches to solving the V2B energy management challenge.

PaperID: 11478, https://arxiv.org/pdf/2502.18515.pdf

Authors: Rabimba Karanjai, Lei Xu, Weidong Shi

Title: Securing Smart Contract Languages with a Unified Agentic Framework for Vulnerability Repair in Solidity and Move

Abstract:
The rapid growth of the blockchain ecosystem and the increasing value locked in smart contracts necessitate robust security measures. While languages like Solidity and Move aim to improve smart contract security, vulnerabilities persist. This paper presents Smartify, a novel multi‑agent framework leveraging Large Language Models (LLMs) to automatically detect and repair vulnerabilities in Solidity and Move smart contracts. Unlike traditional methods that rely solely on vast pre‑training datasets, Smartify employs a team of specialized agents working on different specially fine‑tuned LLMs to analyze code based on underlying programming concepts and language‑specific security principles. We evaluated Smartify on a dataset for Solidity and a curated dataset for Move, demonstrating its effectiveness in fixing a wide range of vulnerabilities. Our results show that Smartify (Gemma2+codegemma) achieves state‑of‑the‑art performance, surpassing existing LLMs and enhancing general‑purpose models' capabilities, such as Llama 3.1. Notably, Smartify can incorporate language‑specific knowledge, such as the nuances of Move, without requiring massive language‑specific pre‑training datasets. This work offers a detailed analysis of various LLMs' performance on smart contract repair, highlighting the strengths of our multi‑agent approach and providing a blueprint for developing more secure and reliable decentralized applications in the growing blockchain landscape. We also provide a detailed recipe for extending this to other similar use cases.

PaperID: 11479, https://arxiv.org/pdf/2502.18465.pdf

Authors: Jialin Wang, Zhihua Duan

Title: Empirical Research on Utilizing LLM-based Agents for Automated Bug Fixing via LangGraph

Abstract:
This paper presents a novel framework for automated code generation and debugging, designed to improve accuracy, efficiency, and scalability in software development. The proposed system integrates three core components LangGraph, GLM4 Flash, and ChromaDB within a four step iterative workflow to deliver robust performance and seamless functionality. LangGraph serves as a graph‑based library for orchestrating tasks, providing precise control and execution while maintaining a unified state object for dynamic updates and consistency. It supports multi‑agent, hierarchical, and sequential processes, making it highly adaptable to complex software engineering workflows. GLM4 Flash, a large language model, leverages its advanced capabilities in natural language understanding, contextual reasoning, and multilingual support to generate accurate code snippets based on user prompts. ChromaDB acts as a vector database for semantic search and contextual memory storage, enabling the identification of patterns and the generation of context‑aware bug fixes based on historical data. The system operates through a structured four‑step process: (1) Code Generation, which translates natural language descriptions into executable code; (2) Code Execution, which validates the code by identifying runtime errors and inconsistencies; (3) Code Repair, which iteratively refines buggy code using ChromaDB's memory capabilities and LangGraph's state tracking; and (4) Code Update, which ensures the code meets functional and performance requirements through iterative modifications.

PaperID: 11480, https://arxiv.org/pdf/2502.18439.pdf

Authors: Chanwoo Park, Seungju Han, Xingzhi Guo, Asuman Ozdaglar, Kaiqing Zhang, Joo-Kyung Kim

Title: MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning

Abstract:
Leveraging multiple large language models (LLMs) to build collaborative multi‑agentic workflows has demonstrated significant potential. However, most previous studies focus on prompting the out‑of‑the‑box LLMs, relying on their innate capability for collaboration, which may not improve LLMs' performance as shown recently. In this paper, we introduce a new post‑training paradigm MAPoRL (Multi‑Agent Post‑co‑training for collaborative LLMs with Reinforcement Learning), to explicitly elicit the collaborative behaviors and further unleash the power of multi‑agentic LLM frameworks. In MAPoRL, multiple LLMs first generate their own responses independently and engage in a multi‑turn discussion to collaboratively improve the final answer. In the end, a MAPoRL verifier evaluates both the answer and the discussion, by assigning a score that verifies the correctness of the answer, while adding incentives to encourage corrective and persuasive discussions. The score serves as the co‑training reward, and is then maximized through multi‑agent RL. Unlike existing LLM post‑training paradigms, MAPoRL advocates the co‑training of multiple LLMs together using RL for better generalization. Accompanied by analytical insights, our experiments demonstrate that training individual LLMs alone is insufficient to induce effective collaboration. In contrast, multi‑agent co‑training can boost the collaboration performance across benchmarks, with generalization to unseen domains.

PaperID: 11481, https://arxiv.org/pdf/2502.18397.pdf

Authors: Jinyuan Fang, Zaiqiao Meng, Craig Macdonald

Title: KiRAG: Knowledge-Driven Iterative Retriever for Enhancing Retrieval-Augmented Generation

Abstract:
Iterative retrieval‑augmented generation (iRAG) models offer an effective approach for multi‑hop question answering (QA). However, their retrieval process faces two key challenges: (1) it can be disrupted by irrelevant documents or factually inaccurate chain‑of‑thoughts; (2) their retrievers are not designed to dynamically adapt to the evolving information needs in multi‑step reasoning, making it difficult to identify and retrieve the missing information required at each iterative step. Therefore, we propose KiRAG, which uses a knowledge‑driven iterative retriever model to enhance the retrieval process of iRAG. Specifically, KiRAG decomposes documents into knowledge triples and performs iterative retrieval with these triples to enable a factually reliable retrieval process. Moreover, KiRAG integrates reasoning into the retrieval process to dynamically identify and retrieve knowledge that bridges information gaps, effectively adapting to the evolving information needs. Empirical results show that KiRAG significantly outperforms existing iRAG models, with an average improvement of 9.40% in R@3 and 5.14% in F1 on multi‑hop QA.

PaperID: 11482, https://arxiv.org/pdf/2502.18359.pdf

Authors: Deven R. Desai, Mark O. Riedl

Title: Responsible AI Agents

Abstract:
Thanks to advances in large language models, a new type of software agent, the artificial intelligence (AI) agent, has entered the marketplace. Companies such as OpenAI, Google, Microsoft, and Salesforce promise their AI Agents will go from generating passive text to executing tasks. Instead of a travel itinerary, an AI Agent would book all aspects of your trip. Instead of generating text or images for social media post, an AI Agent would post the content across a host of social media outlets. The potential power of AI Agents has fueled legal scholars' fears that AI Agents will enable rogue commerce, human manipulation, rampant defamation, and intellectual property harms. These scholars are calling for regulation before AI Agents cause havoc. This Article addresses the concerns around AI Agents head on. It shows that core aspects of how one piece of software interacts with another creates ways to discipline AI Agents so that rogue, undesired actions are unlikely, perhaps more so than rules designed to govern human agents. It also develops a way to leverage the computer‑science approach to value‑alignment to improve a user's ability to take action to prevent or correct AI Agent operations. That approach offers and added benefit of helping AI Agents align with norms around user‑AI Agent interactions. These practices will enable desired economic outcomes and mitigate perceived risks. The Article also argues that no matter how much AI Agents seem like human agents, they need not, and should not, be given legal personhood status. In short, humans are responsible for AI Agents' actions, and this Article provides a guide for how humans can build and maintain responsible AI Agents.

PaperID: 11483, https://arxiv.org/pdf/2502.18228.pdf

Authors: Xiaofeng Wang, Zhixin Zhang, Jinguang Zheng, Yiming Ai, Rui Wang

Title: Debt Collection Negotiations with Large Language Models: An Evaluation System and Optimizing Decision Making with Multi-Agent

Abstract:
Debt collection negotiations (DCN) are vital for managing non‑performing loans (NPLs) and reducing creditor losses. Traditional methods are labor‑intensive, while large language models (LLMs) offer promising automation potential. However, prior systems lacked dynamic negotiation and real‑time decision‑making capabilities. This paper explores LLMs in automating DCN and proposes a novel evaluation framework with 13 metrics across 4 aspects. Our experiments reveal that LLMs tend to over‑concede compared to human negotiators. To address this, we propose the Multi‑Agent Debt Negotiation (MADeN) framework, incorporating planning and judging modules to improve decision rationality. We also apply post‑training techniques, including DPO with rejection sampling, to optimize performance. Our studies provide valuable insights for practitioners and researchers seeking to enhance efficiency and outcomes in this domain.

PaperID: 11484, https://arxiv.org/pdf/2502.18180.pdf

Authors: Lei Li, Sen Jia, Jianhao Wang, Zhaochong An, Jiaang Li, Jenq-Neng Hwang, Serge Belongie

Title: ChatMotion: A Multimodal Multi-Agent for Human Motion Analysis

Abstract:
Advancements in Multimodal Large Language Models (MLLMs) have improved human motion understanding. However, these models remain constrained by their "instruct‑only" nature, lacking interactivity and adaptability for diverse analytical perspectives. To address these challenges, we introduce ChatMotion, a multimodal multi‑agent framework for human motion analysis. ChatMotion dynamically interprets user intent, decomposes complex tasks into meta‑tasks, and activates specialized function modules for motion comprehension. It integrates multiple specialized modules, such as the MotionCore, to analyze human motion from various perspectives. Extensive experiments demonstrate ChatMotion's precision, adaptability, and user engagement for human motion understanding.

PaperID: 11485, https://arxiv.org/pdf/2502.17956.pdf

Authors: Patomporn Payoungkhamdee, Pume Tuchinda, Jinheon Baek, Samuel Cahyawijaya, Can Udomcharoenchaikit, Potsawee Manakul, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, Sarana Nutanong

Title: Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments

Abstract:
Multi‑step reasoning is essential for large language models (LLMs), yet multilingual performance remains challenging. While Chain‑of‑Thought (CoT) prompting improves reasoning, it struggles with non‑English languages due to the entanglement of reasoning and execution. Program‑of‑Thought (PoT) prompting separates reasoning from execution, offering a promising alternative but shifting the challenge to generating programs from non‑English questions. We propose a framework to evaluate PoT by separating multilingual reasoning from code execution to examine (i) the impact of fine‑tuning on question‑reasoning alignment and (ii) how reasoning quality affects answer correctness. Our findings demonstrate that PoT fine‑tuning substantially enhances multilingual reasoning, outperforming CoT fine‑tuned models. We further demonstrate a strong correlation between reasoning quality (measured through code quality) and answer accuracy, highlighting its potential as a test‑time performance improvement heuristic.

PaperID: 11486, https://arxiv.org/pdf/2502.17924.pdf

Authors: Hongzhan Lin, Yang Deng, Yuxuan Gu, Wenxuan Zhang, Jing Ma, See-Kiong Ng, Tat-Seng Chua

Title: FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models

Abstract:
Large Language Models (LLMs) have significantly advanced the fact‑checking studies. However, existing automated fact‑checking evaluation methods rely on static datasets and classification metrics, which fail to automatically evaluate the justification production and uncover the nuanced limitations of LLMs in fact‑checking. In this work, we introduce FACT‑AUDIT, an agent‑driven framework that adaptively and dynamically assesses LLMs' fact‑checking capabilities. Leveraging importance sampling principles and multi‑agent collaboration, FACT‑AUDIT generates adaptive and scalable datasets, performs iterative model‑centric evaluations, and updates assessments based on model‑specific responses. By incorporating justification production alongside verdict prediction, this framework provides a comprehensive and evolving audit of LLMs' factual reasoning capabilities, to investigate their trustworthiness. Extensive experiments demonstrate that FACT‑AUDIT effectively differentiates among state‑of‑the‑art LLMs, providing valuable insights into model strengths and limitations in model‑centric fact‑checking analysis.

PaperID: 11487, https://arxiv.org/pdf/2502.17878.pdf

Authors: Hongqiu Wu, Weiqi Wu, Tianyang Xu, Jiameng Zhang, Hai Zhao

Title: Towards Enhanced Immersion and Agency for LLM-based Interactive Drama

Abstract:
LLM‑based Interactive Drama is a novel AI‑based dialogue scenario, where the user (i.e. the player) plays the role of a character in the story, has conversations with characters played by LLM agents, and experiences an unfolding story. This paper begins with understanding interactive drama from two aspects: Immersion, the player's feeling of being present in the story, and Agency, the player's ability to influence the story world. Both are crucial to creating an enjoyable interactive experience, while they have been underexplored in previous work. To enhance these two aspects, we first propose Playwriting‑guided Generation, a novel method that helps LLMs craft dramatic stories with substantially improved structures and narrative quality. Additionally, we introduce Plot‑based Reflection for LLM agents to refine their reactions to align with the player's intentions. Our evaluation relies on human judgment to assess the gains of our methods in terms of immersion and agency.

PaperID: 11488, https://arxiv.org/pdf/2502.17821.pdf

Authors: Rui Liu, Yu Shen, Peng Gao, Pratap Tokekar, Ming Lin

Title: CAML: Collaborative Auxiliary Modality Learning for Multi-Agent Systems

Abstract:
Multi‑modal learning has emerged as a key technique for improving performance across domains such as autonomous driving, robotics, and reasoning. However, in certain scenarios, particularly in resource‑constrained environments, some modalities available during training may be absent during inference. While existing frameworks effectively utilize multiple data sources during training and enable inference with reduced modalities, they are primarily designed for single‑agent settings. This poses a critical limitation in dynamic environments such as connected autonomous vehicles (CAV), where incomplete data coverage can lead to decision‑making blind spots. Conversely, some works explore multi‑agent collaboration but without addressing missing modality at test time. To overcome these limitations, we propose Collaborative Auxiliary Modality Learning (CAML), a novel multi‑modal multi‑agent framework that enables agents to collaborate and share multi‑modal data during training, while allowing inference with reduced modalities during testing. Experimental results in collaborative decision‑making for CAV in accident‑prone scenarios demonstrate that CAML achieves up to a 58.1% improvement in accident detection. Additionally, we validate CAML on real‑world aerial‑ground robot data for collaborative semantic segmentation, achieving up to a 10.6% improvement in mIoU.

PaperID: 11489, https://arxiv.org/pdf/2502.17813.pdf

Authors: Meng Feng, Viraj Parimi, Brian Williams

Title: Safe Multi-Agent Navigation guided by Goal-Conditioned Safe Reinforcement Learning

Abstract:
Safe navigation is essential for autonomous systems operating in hazardous environments. Traditional planning methods excel at long‑horizon tasks but rely on a predefined graph with fixed distance metrics. In contrast, safe Reinforcement Learning (RL) can learn complex behaviors without relying on manual heuristics but fails to solve long‑horizon tasks, particularly in goal‑conditioned and multi‑agent scenarios. In this paper, we introduce a novel method that integrates the strengths of both planning and safe RL. Our method leverages goal‑conditioned RL and safe RL to learn a goal‑conditioned policy for navigation while concurrently estimating cumulative distance and safety levels using learned value functions via an automated self‑training algorithm. By constructing a graph with states from the replay buffer, our method prunes unsafe edges and generates a waypoint‑based plan that the agent follows until reaching its goal, effectively balancing faster and safer routes over extended distances. Utilizing this unified high‑level graph and a shared low‑level goal‑conditioned safe RL policy, we extend this approach to address the multi‑agent safe navigation problem. In particular, we leverage Conflict‑Based Search (CBS) to create waypoint‑based plans for multiple agents allowing for their safe navigation over extended horizons. This integration enhances the scalability of goal‑conditioned safe RL in multi‑agent scenarios, enabling efficient coordination among agents. Extensive benchmarking against state‑of‑the‑art baselines demonstrates the effectiveness of our method in achieving distance goals safely for multiple agents in complex and hazardous environments. Our code and further details about or work is available at https://safe‑visual‑mapf‑mers.csail.mit.edu/.

PaperID: 11490, https://arxiv.org/pdf/2502.17812.pdf

Authors: Xiongxiao Xu, Haoran Wang, Yueqing Liang, Philip S. Yu, Yue Zhao, Kai Shu

Title: Can Multimodal LLMs Perform Time Series Anomaly Detection?

Abstract:
Time series anomaly detection (TSAD) has been a long‑standing pillar problem in Web‑scale systems and online infrastructures, such as service reliability monitoring, system fault diagnosis, and performance optimization. Large language models (LLMs) have demonstrated unprecedented capabilities in time series analysis, the potential of multimodal LLMs (MLLMs), particularly vision‑language models, in TSAD remains largely under‑explored. One natural way for humans to detect time series anomalies is through visualization and textual description. It motivates our research question: Can multimodal LLMs perform time series anomaly detection? Existing studies often oversimplify the problem by treating point‑wise anomalies as special cases of range‑wise ones or by aggregating point anomalies to approximate range‑wise scenarios. They limit our understanding for realistic scenarios such as multi‑granular anomalies and irregular time series. To address the gap, we build a VisualTimeAnomaly benchmark to comprehensively investigate zero‑shot capabilities of MLLMs for TSAD, progressively from point‑, range‑, to variate‑wise anomalies, and extends to irregular sampling conditions. Our study reveals several key insights in multimodal MLLMs for TSAD. Built on these findings, we propose a MLLMs‑based multi‑agent framework TSAD‑Agents to achieve automatic TSAD. Our framework comprises scanning, planning, detection, and checking agents that synergistically collaborate to reason, plan, and self‑reflect to enable automatic TSAD. These agents adaptively invoke tools such as traditional methods and MLLMs and dynamically switch between text and image modalities to optimize detection performance.

PaperID: 11491, https://arxiv.org/pdf/2502.17807.pdf

Authors: Tianyi Zhuang, Chuqiao Kuang, Xiaoguang Li, Yihua Teng, Jihao Wu, Yasheng Wang, Lifeng Shang

Title: DocPuzzle: A Process-Aware Benchmark for Evaluating Realistic Long-Context Reasoning Capabilities

Abstract:
We present DocPuzzle, a rigorously constructed benchmark for evaluating long‑context reasoning capabilities in large language models (LLMs). This benchmark comprises 100 expert‑level QA problems requiring multi‑step reasoning over long real‑world documents. To ensure the task quality and complexity, we implement a human‑AI collaborative annotation‑validation pipeline. DocPuzzle introduces an innovative evaluation framework that mitigates guessing bias through checklist‑guided process analysis, establishing new standards for assessing reasoning capacities in LLMs. Our evaluation results show that: 1)Advanced slow‑thinking reasoning models like o1‑preview(69.7%) and DeepSeek‑R1(66.3%) significantly outperform best general instruct models like Claude 3.5 Sonnet(57.7%); 2)Distilled reasoning models like DeepSeek‑R1‑Distill‑Qwen‑32B(41.3%) falls far behind the teacher model, suggesting challenges to maintain the generalization of reasoning capabilities relying solely on distillation.

PaperID: 11492, https://arxiv.org/pdf/2502.17651.pdf

Authors: Bingxuan Li, Yiwei Wang, Jiuxiang Gu, Kai-Wei Chang, Nanyun Peng

Title: METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling

Abstract:
Chart generation aims to generate code to produce charts satisfying the desired visual properties, e.g., texts, layout, color, and type. It has great potential to empower the automatic professional report generation in financial analysis, research presentation, education, and healthcare. In this work, we build a vision‑language model (VLM) based multi‑agent framework for effective automatic chart generation. Generating high‑quality charts requires both strong visual design skills and precise coding capabilities that embed the desired visual properties into code. Such a complex multi‑modal reasoning process is difficult for direct prompting of VLMs. To resolve these challenges, we propose METAL, a multi‑agent framework that decomposes the task of chart generation into the iterative collaboration among specialized agents. METAL achieves 5.2% improvement over the current best result in the chart generation task. The METAL framework exhibits the phenomenon of test‑time scaling: its performance increases monotonically as the logarithmic computational budget grows from 512 to 8192 tokens. In addition, we find that separating different modalities during the critique process of METAL boosts the self‑correction capability of VLMs in the multimodal context.

PaperID: 11493, https://arxiv.org/pdf/2502.17643.pdf

Authors: Sangwon Seo, Bing Han, Rayan E. Harari, Roger D. Dias, Marco A. Zenati, Eduardo Salas, Vaibhav Unhelkar

Title: Socratic: Enhancing Human Teamwork via AI-enabled Coaching

Abstract:
Coaches are vital for effective collaboration, but cost and resource constraints often limit their availability during real‑world tasks. This limitation poses serious challenges in life‑critical domains that rely on effective teamwork, such as healthcare and disaster response. To address this gap, we propose and realize an innovative application of AI: task‑time team coaching. Specifically, we introduce Socratic, a novel AI system that complements human coaches by providing real‑time guidance during task execution. Socratic monitors team behavior, detects misalignments in team members' shared understanding, and delivers automated interventions to improve team performance. We validated Socratic through two human subject experiments involving dyadic collaboration. The results demonstrate that the system significantly enhances team performance with minimal interventions. Participants also perceived Socratic as helpful and trustworthy, supporting its potential for adoption. Our findings also suggest promising directions both for AI research and its practical applications to enhance human teamwork.

PaperID: 11494, https://arxiv.org/pdf/2502.17618.pdf

Authors: Sangwon Seo, Vaibhav Unhelkar

Title: Hierarchical Imitation Learning of Team Behavior from Heterogeneous Demonstrations

Abstract:
Successful collaboration requires team members to stay aligned, especially in complex sequential tasks. Team members must dynamically coordinate which subtasks to perform and in what order. However, real‑world constraints like partial observability and limited communication bandwidth often lead to suboptimal collaboration. Even among expert teams, the same task can be executed in multiple ways. To develop multi‑agent systems and human‑AI teams for such tasks, we are interested in data‑driven learning of multimodal team behaviors. Multi‑Agent Imitation Learning (MAIL) provides a promising framework for data‑driven learning of team behavior from demonstrations, but existing methods struggle with heterogeneous demonstrations, as they assume that all demonstrations originate from a single team policy. Hence, in this work, we introduce DTIL: a hierarchical MAIL algorithm designed to learn multimodal team behaviors in complex sequential tasks. DTIL represents each team member with a hierarchical policy and learns these policies from heterogeneous team demonstrations in a factored manner. By employing a distribution‑matching approach, DTIL mitigates compounding errors and scales effectively to long horizons and continuous state representations. Experimental results show that DTIL outperforms MAIL baselines and accurately models team behavior across a variety of collaborative scenarios.

PaperID: 11495, https://arxiv.org/pdf/2502.17584.pdf

Authors: Muzaffar Qureshi, Tochukwu Elijah Ogri, Humberto Ramos, Zachary I. Bell, Rushikesh Kamalapurkar

Title: Gaussian Process-Based Scalar Field Estimation in GPS-Denied Environments

Abstract:
This paper presents a methodology for an autonomous agent to map an unknown scalar field in GPS‑denied regions. To reduce localization errors, the agent alternates between GPS‑enabled and GPS‑denied areas while collecting measurements. User‑defined error bounds determine the dwell time in each region. A switching trajectory is then designed to ensure field measurements in GPS‑denied regions remain within the specified error limits. A Lyapunov‑based stability analysis guarantees bounded error trajectories while tracking the desired path. The effectiveness of the proposed methodology is demonstrated through simulations, with an error analysis comparing the GP‑predicted scalar field model to the actual field.

PaperID: 11496, https://arxiv.org/pdf/2502.17535.pdf

Authors: Zhenheng Tang, Xiang Liu, Qian Wang, Peijie Dong, Bingsheng He, Xiaowen Chu, Bo Li

Title: The Lottery LLM Hypothesis, Rethinking What Abilities Should LLM Compression Preserve?

Abstract:
Motivated by reducing the computational and storage costs of LLMs, model compression and KV cache compression have attracted much attention from researchers. However, current methods predominantly emphasize maintaining the performance of compressed LLMs, as measured by perplexity or simple accuracy on tasks of common sense knowledge QA and basic arithmetic reasoning. In this blog, we present a brief review of recent advancements in LLMs related to retrieval‑augmented generation, multi‑step reasoning, external tools, and computational expressivity, all of which substantially enhance LLM performance. Then, we propose a lottery LLM hypothesis suggesting that for a given LLM and task, there exists a smaller lottery LLM capable of producing the same performance as the original LLM with the assistance of multi‑step reasoning and external tools. Based on the review of current progress in LLMs, we discuss and summarize the essential capabilities that the lottery LLM and KV cache compression must possess, which are currently overlooked in existing methods.

PaperID: 11497, https://arxiv.org/pdf/2502.17443.pdf

Authors: Vaibhav Tupe, Shrinath Thube

Title: AI Agentic workflows and Enterprise APIs: Adapting API architectures for the age of AI agents

Abstract:
The rapid advancement of Generative AI has catalyzed the emergence of autonomous AI agents, presenting unprecedented challenges for enterprise computing infrastructures. Current enterprise API architectures are predominantly designed for human‑driven, predefined interaction patterns, rendering them ill‑equipped to support intelligent agents' dynamic, goal‑oriented behaviors. This research systematically examines the architectural adaptations for enterprise APIs to support AI agentic workflows effectively. Through a comprehensive analysis of existing API design paradigms, agent interaction models, and emerging technological constraints, the paper develops a strategic framework for API transformation. The study employs a mixed‑method approach, combining theoretical modeling, comparative analysis, and exploratory design principles to address critical challenges in standardization, performance, and intelligent interaction. The proposed research contributes a conceptual model for next‑generation enterprise APIs that can seamlessly integrate with autonomous AI agent ecosystems, offering significant implications for future enterprise computing architectures.

PaperID: 11498, https://arxiv.org/pdf/2502.17366.pdf

Authors: Jikang Deng, Hui Zhou, Mohamed-Slim Alouini

Title: Distributed Coordination for Heterogeneous Non-Terrestrial Networks

Abstract:
To achieve global coverage and ubiquitous connectivity, the non‑terrestrial network (NTN) has been regarded as a key enabler in the sixth generation (6G) network, which includes uncrewed aerial vehicles (UAVs), high‑altitude platforms (HAPs), and satellites. Since the unique characteristics of various NTN platforms strongly affect their implementation and lead to a highly dynamic and heterogeneous NTN scenario, achieving distributed coordination remains an important research direction. However, the explicit and systematic analysis of the individual layers' challenges and corresponding distributed coordination solutions in heterogeneous NTNs has not been proposed yet. Therefore, in this paper, we summarize the unique characteristics of each NTN platform, identify communication challenges within individual layers, and propose potential delay‑tolerant or delay‑sensitive coordinated solutions accordingly. We further analyse the feasibility of leveraging multi‑agent deep reinforcement learning (MADRL) algorithms to achieve the proposed coordinated solutions. Finally, we present a case study of the joint scheduling and trajectory optimization problem in heterogeneous NTN, where a two‑timescale multi‑agent deep deterministic policy gradient (TTS‑MADDPG) algorithm is developed to validate the effectiveness of distributed coordination.

PaperID: 11499, https://arxiv.org/pdf/2502.17307.pdf

Authors: Jichen Li, Lijia Xie, Hanting Huang, Bo Zhou, Binfeng Song, Wanying Zeng, Xiaotie Deng, Xiao Zhang

Title: Survey on Strategic Mining in Blockchain: A Reinforcement Learning Approach

Abstract:
Strategic mining attacks, such as selfish mining, exploit blockchain consensus protocols by deviating from honest behavior to maximize rewards. Markov Decision Process (MDP) analysis faces scalability challenges in modern digital economics, including blockchain. To address these limitations, reinforcement learning (RL) provides a scalable alternative, enabling adaptive strategy optimization in complex dynamic environments. In this survey, we examine RL's role in strategic mining analysis, comparing it to MDP‑based approaches. We begin by reviewing foundational MDP models and their limitations, before exploring RL frameworks that can learn near‑optimal strategies across various protocols. Building on this analysis, we compare RL techniques and their effectiveness in deriving security thresholds, such as the minimum attacker power required for profitable attacks. Expanding the discussion further, we classify consensus protocols and propose open challenges, such as multi‑agent dynamics and real‑world validation. This survey highlights the potential of reinforcement learning (RL) to address the challenges of selfish mining, including protocol design, threat detection, and security analysis, while offering a strategic roadmap for researchers in decentralized systems and AI‑driven analytics.

PaperID: 11500, https://arxiv.org/pdf/2502.17167.pdf

Authors: Hamidreza Mazandarani, Masoud Shokrnezhad, Tarik Taleb

Title: A Novel Multiple Access Scheme for Heterogeneous Wireless Communications using Symmetry-aware Continual Deep Reinforcement Learning

Abstract:
The Metaverse holds the potential to revolutionize digital interactions through the establishment of a highly dynamic and immersive virtual realm over wireless communications systems, offering services such as massive twinning and telepresence. This landscape presents novel challenges, particularly efficient management of multiple access to the frequency spectrum, for which numerous adaptive Deep Reinforcement Learning (DRL) approaches have been explored. However, challenges persist in adapting agents to heterogeneous and non‑stationary wireless environments. In this paper, we present a novel approach that leverages Continual Learning (CL) to enhance intelligent Medium Access Control (MAC) protocols, featuring an intelligent agent coexisting with legacy User Equipments (UEs) with varying numbers, protocols, and transmission profiles unknown to the agent for the sake of backward compatibility and privacy. We introduce an adaptive Double and Dueling Deep Q‑Learning (D3QL)‑based MAC protocol, enriched by a symmetry‑aware CL mechanism, which maximizes intelligent agent throughput while ensuring fairness. Mathematical analysis validates the efficiency of our proposed scheme, showcasing superiority over conventional DRL‑based techniques in terms of throughput, collision rate, and fairness, coupled with real‑time responsiveness in highly dynamic scenarios.

PaperID: 11501, https://arxiv.org/pdf/2502.17120.pdf

Authors: Hamidreza Mazandarani, Masoud Shokrnezhad, Tarik Taleb

Title: Semantic-Aware Dynamic and Distributed Power Allocation: a Multi-UAV Area Coverage Use Case

Abstract:
The advancement towards 6G technology leverages improvements in aerial‑terrestrial networking, where one of the critical challenges is the efficient allocation of transmit power. Although existing studies have shown commendable performance in addressing this challenge, a revolutionary breakthrough is anticipated to meet the demands and dynamism of 6G. Potential solutions include: 1) semantic communication and orchestration, which transitions the focus from mere transmission of bits to the communication of intended meanings of data and their integration into the network orchestration process; and 2) distributed machine learning techniques to develop adaptable and scalable solutions. In this context, this paper introduces a power allocation framework specifically designed for semantic‑aware networks. The framework addresses a scenario involving multiple Unmanned Aerial Vehicles (UAVs) that collaboratively transmit observations over a multi‑channel uplink medium to a central server, aiming to maximise observation quality. To tackle this problem, we present the Semantic‑Aware Multi‑Agent Double and Dueling Deep Q‑Learning (SAMA‑D3QL) algorithm, which utilizes the data quality of observing areas as reward feedback during the training phase, thereby constituting a semantic‑aware learning mechanism. Simulation results substantiate the efficacy and scalability of our approach, demonstrating its superior performance compared to traditional bit‑oriented learning and heuristic algorithms.

PaperID: 11502, https://arxiv.org/pdf/2502.17046.pdf

Authors: Jinyuan Feng, Min Chen, Zhiqiang Pu, Yifan Xu, Yanyan Liang

Title: MA2RL: Masked Autoencoders for Generalizable Multi-Agent Reinforcement Learning

Abstract:
To develop generalizable models in multi‑agent reinforcement learning, recent approaches have been devoted to discovering task‑independent skills for each agent, which generalize across tasks and facilitate agents' cooperation. However, particularly in partially observed settings, such approaches struggle with sample efficiency and generalization capabilities due to two primary challenges: (a) How to incorporate global states into coordinating the skills of different agents? (b) How to learn generalizable and consistent skill semantics when each agent only receives partial observations? To address these challenges, we propose a framework called Masked Autoencoders for Multi‑Agent Reinforcement Learning (MA2RL), which encourages agents to infer unobserved entities by reconstructing entity‑states from the entity perspective. The entity perspective helps MA2RL generalize to diverse tasks with varying agent numbers and action spaces. Specifically, we treat local entity‑observations as masked contexts of the global entity‑states, and MA2RL can infer the latent representation of dynamically masked entities, facilitating the assignment of task‑independent skills and the learning of skill semantics. Extensive experiments demonstrate that MA2RL achieves significant improvements relative to state‑of‑the‑art approaches, demonstrating extraordinary performance, remarkable zero‑shot generalization capabilities and advantageous transferability.

PaperID: 11503, https://arxiv.org/pdf/2502.17034.pdf

Authors: Muhammad Haris Khan, Artyom Myshlyaev, Artem Lykov, Miguel Altamirano Cabrera, Dzmitry Tsetserukou

Title: Evolution 6.0: Robot Evolution through Generative Design

Abstract:
We propose a new concept, Evolution 6.0, which represents the evolution of robotics driven by Generative AI. When a robot lacks the necessary tools to accomplish a task requested by a human, it autonomously designs the required instruments and learns how to use them to achieve the goal. Evolution 6.0 is an autonomous robotic system powered by Vision‑Language Models (VLMs), Vision‑Language Action (VLA) models, and Text‑to‑3D generative models for tool design and task execution. The system comprises two key modules: the Tool Generation Module, which fabricates task‑specific tools from visual and textual data, and the Action Generation Module, which converts natural language instructions into robotic actions. It integrates QwenVLM for environmental understanding, OpenVLA for task execution, and Llama‑Mesh for 3D tool generation. Evaluation results demonstrate a 90% success rate for tool generation with a 10‑second inference time, and action generation achieving 83.5% in physical and visual generalization, 70% in motion generalization, and 37% in semantic generalization. Future improvements will focus on bimanual manipulation, expanded task capabilities, and enhanced environmental interpretation to improve real‑world adaptability.

PaperID: 11504, https://arxiv.org/pdf/2502.16991.pdf

Authors: Thomas I. Strasser, Filip Pröstl Andrén

Title: Engineering and Validating Cyber-Physical Energy Systems: Needs, Status Quo, and Research Trends

Abstract:
A driving force for the realization of a sustainable energy supply is the integration of renewable energy resources. Due to their stochastic generation behaviour, energy utilities are confronted with a more complex operation of the underlying power grids. Additionally, due to technology developments, controllable loads, integration with other energy sources, changing regulatory rules, and the market liberalization, the systems operation needs adaptation. Proper operational concepts and intelligent automation provide the basis to turn the existing power system into an intelligent entity, a cyber‑physical energy system. The electric energy system is therefore moving from a single system to a system of systems. While reaping the benefits with new intelligent behaviors, it is expected that system‑level developments, architectural concepts, advanced automation and control as well as the validation and testing will play a significantly larger role in realizing future solutions and technologies. The implementation and deployment of these complex systems of systems are associated with increasing engineering complexity resulting also in increased engineering costs. Proper engineering and validation approaches, concepts, and tools are partly missing until now. Therefore, this paper discusses and summarizes the main needs and requirements as well as the status quo in research and development related to the engineering and validation of cyber‑physical energy systems. Also research trends and necessary future activities are outlined.

PaperID: 11505, https://arxiv.org/pdf/2502.16879.pdf

Authors: Yuzhi Hao, Danyang Xie

Title: A Multi-LLM-Agent-Based Framework for Economic and Public Policy Analysis

Abstract:
This paper pioneers a novel approach to economic and public policy analysis by leveraging multiple Large Language Models (LLMs) as heterogeneous artificial economic agents. We first evaluate five LLMs' economic decision‑making capabilities in solving two‑period consumption allocation problems under two distinct scenarios: with explicit utility functions and based on intuitive reasoning. While previous research has often simulated heterogeneity by solely varying prompts, our approach harnesses the inherent variations in analytical capabilities across different LLMs to model agents with diverse cognitive traits. Building on these findings, we construct a Multi‑LLM‑Agent‑Based (MLAB) framework by mapping these LLMs to specific educational groups and corresponding income brackets. Using interest‑income taxation as a case study, we demonstrate how the MLAB framework can simulate policy impacts across heterogeneous agents, offering a promising new direction for economic and public policy analysis by leveraging LLMs' human‑like reasoning capabilities and computational power.

PaperID: 11506, https://arxiv.org/pdf/2502.16868.pdf

Authors: Longbin Lai, Changwei Luo, Yunkai Lou, Mingchen Ju, Zhengyi Yang

Title: Graphy'our Data: Towards End-to-End Modeling, Exploring and Generating Report from Raw Data

Abstract:
Large Language Models (LLMs) have recently demonstrated remarkable performance in tasks such as Retrieval‑Augmented Generation (RAG) and autonomous AI agent workflows. Yet, when faced with large sets of unstructured documents requiring progressive exploration, analysis, and synthesis, such as conducting literature survey, existing approaches often fall short. We address this challenge ‑‑ termed Progressive Document Investigation ‑‑ by introducing Graphy, an end‑to‑end platform that automates data modeling, exploration and high‑quality report generation in a user‑friendly manner. Graphy comprises an offline Scrapper that transforms raw documents into a structured graph of Fact and Dimension nodes, and an online Surveyor that enables iterative exploration and LLM‑driven report generation. We showcase a pre‑scrapped graph of over 50,000 papers ‑‑ complete with their references ‑‑ demonstrating how Graphy facilitates the literature‑survey scenario. The demonstration video can be found at https://youtu.be/uM4nzkAdGlM.

PaperID: 11507, https://arxiv.org/pdf/2502.16863.pdf

Authors: Kartik Nagpal, Dayi Dong, Jean-Baptiste Bouvier, Negar Mehr

Title: Leveraging Large Language Models for Effective and Explainable Multi-Agent Credit Assignment

Abstract:
Recent work, spanning from autonomous vehicle coordination to in‑space assembly, has shown the importance of learning collaborative behavior for enabling robots to achieve shared goals. A common approach for learning this cooperative behavior is to utilize the centralized‑training decentralized‑execution paradigm. However, this approach also introduces a new challenge: how do we evaluate the contributions of each agent's actions to the overall success or failure of the team. This credit assignment problem has remained open, and has been extensively studied in the Multi‑Agent Reinforcement Learning literature. In fact, humans manually inspecting agent behavior often generate better credit evaluations than existing methods. We combine this observation with recent works which show Large Language Models demonstrate human‑level performance at many pattern recognition tasks. Our key idea is to reformulate credit assignment to the two pattern recognition problems of sequence improvement and attribution, which motivates our novel LLM‑MCA method. Our approach utilizes a centralized LLM reward‑critic which numerically decomposes the environment reward based on the individualized contribution of each agent in the scenario. We then update the agents' policy networks based on this feedback. We also propose an extension LLM‑TACA where our LLM critic performs explicit task assignment by passing an intermediary goal directly to each agent policy in the scenario. Both our methods far outperform the state‑of‑the‑art on a variety of benchmarks, including Level‑Based Foraging, Robotic Warehouse, and our new Spaceworld benchmark which incorporates collision‑related safety constraints. As an artifact of our methods, we generate large trajectory datasets with each timestep annotated with per‑agent reward information, as sampled from our LLM critics.

PaperID: 11508, https://arxiv.org/pdf/2502.16804.pdf

Authors: Yaozu Wu, Dongyuan Li, Yankai Chen, Renhe Jiang, Henry Peng Zou, Wei-Chieh Huang, Yangning Li, Liancheng Fang, Zhen Wang, Philip S. Yu

Title: Multi-Agent Autonomous Driving Systems with Large Language Models: A Survey of Recent Advances

Abstract:
Autonomous Driving Systems (ADSs) are revolutionizing transportation by reducing human intervention, improving operational efficiency, and enhancing safety. Large Language Models (LLMs) have been integrated into ADSs to support high‑level decision‑making through their powerful reasoning, instruction‑following, and communication abilities. However, LLM‑based single‑agent ADSs face three major challenges: limited perception, insufficient collaboration, and high computational demands. To address these issues, recent advances in LLM‑based multi‑agent ADSs leverage language‑driven communication and coordination to enhance inter‑agent collaboration. This paper provides a frontier survey of this emerging intersection between NLP and multi‑agent ADSs. We begin with a background introduction to related concepts, followed by a categorization of existing LLM‑based methods based on different agent interaction modes. We then discuss agent‑human interactions in scenarios where LLM‑based agents engage with humans. Finally, we summarize key applications, datasets, and challenges to support future research.

PaperID: 11509, https://arxiv.org/pdf/2502.16796.pdf

Authors: Yuxuan Liu, Hongda Sun, Wei Liu, Jian Luan, Bo Du, Rui Yan

Title: MobileSteward: Integrating Multiple App-Oriented Agents with Self-Evolution to Automate Cross-App Instructions

Abstract:
Mobile phone agents can assist people in automating daily tasks on their phones, which have emerged as a pivotal research spotlight. However, existing procedure‑oriented agents struggle with cross‑app instructions, due to the following challenges: (1) complex task relationships, (2) diverse app environment, and (3) error propagation and information loss in multi‑step execution. Drawing inspiration from object‑oriented programming principles, we recognize that object‑oriented solutions is more suitable for cross‑app instruction. To address these challenges, we propose a self‑evolving multi‑agent framework named MobileSteward, which integrates multiple app‑oriented StaffAgents coordinated by a centralized StewardAgent. We design three specialized modules in MobileSteward: (1) Dynamic Recruitment generates a scheduling graph guided by information flow to explicitly associate tasks among apps. (2) Assigned Execution assigns the task to app‑oriented StaffAgents, each equipped with app‑specialized expertise to address the diversity between apps. (3) Adjusted Evaluation conducts evaluation to provide reflection tips or deliver key information, which alleviates error propagation and information loss during multi‑step execution. To continuously improve the performance of MobileSteward, we develop a Memory‑based Self‑evolution mechanism, which summarizes the experience from successful execution, to improve the performance of MobileSteward. We establish the first English Cross‑APP Benchmark (CAPBench) in the real‑world environment to evaluate the agents' capabilities of solving complex cross‑app instructions. Experimental results demonstrate that MobileSteward achieves the best performance compared to both single‑agent and multi‑agent frameworks, highlighting the superiority of MobileSteward in better handling user instructions with diverse complexity.

PaperID: 11510, https://arxiv.org/pdf/2502.16750.pdf

Authors: Saikat Barua, Mostafizur Rahman, Md Jafor Sadek, Rafiul Islam, Shehenaz Khaled, Ahmedul Kabir

Title: Guardians of the Agentic System: Preventing Many Shots Jailbreak with Agentic System

Abstract:
The autonomous AI agents using large language models can create undeniable values in all span of the society but they face security threats from adversaries that warrants immediate protective solutions because trust and safety issues arise. Considering the many‑shot jailbreaking and deceptive alignment as some of the main advanced attacks, that cannot be mitigated by the static guardrails used during the supervised training, points out a crucial research priority for real world robustness. The combination of static guardrails in dynamic multi‑agent system fails to defend against those attacks. We intend to enhance security for LLM‑based agents through the development of new evaluation frameworks which identify and counter threats for safe operational deployment. Our work uses three examination methods to detect rogue agents through a Reverse Turing Test and analyze deceptive alignment through multi‑agent simulations and develops an anti‑jailbreaking system by testing it with GEMINI 1.5 pro and llama‑3.3‑70B, deepseek r1 models using tool‑mediated adversarial scenarios. The detection capabilities are strong such as 94% accuracy for GEMINI 1.5 pro yet the system suffers persistent vulnerabilities when under long attacks as prompt length increases attack success rates (ASR) and diversity metrics become ineffective in prediction while revealing multiple complex system faults. The findings demonstrate the necessity of adopting flexible security systems based on active monitoring that can be performed by the agents themselves together with adaptable interventions by system admin as the current models can create vulnerabilities that can lead to the unreliable and vulnerable system. So, in our work, we try to address such situations and propose a comprehensive framework to counteract the security issues.

PaperID: 11511, https://arxiv.org/pdf/2502.16653.pdf

Authors: Cheng Zhu, Xiaotao Zhou, Bing Huang

Title: Equilibrium Unit Based Localized Affine Formation Maneuver for Multi-agent Systems

Abstract:
Current affine formation maneuver of multi‑agent systems (MASs) relys on the affine localizability determined by generic assumption for nominal configuration and global construction manner. This does not live up to practical constraints of robot swarms. In this paper, an equilibrium unit based structure is proposed to achieve affine localizability. In an equilibrium unit, existence of non‑zero weights between nodes is guaranteed and their summation is proved to be non‑zero. To remove the generic assumption, a notion of layerable directed graph is introduced, based on which a sufficient condition associated equilibrium unit is presented to establish affine localizability condition. Within this framework, distributed local construction manner is performed by a designed equilibrium unit construction (EUC) method. With the help of localized communication criterion (LCC) and localized sensing based affine formation maneuver control (LSAFMC) protocol, self‑reconstruction capability is possessed by MASs when nodes are added to or removed from the swarms.

PaperID: 11512, https://arxiv.org/pdf/2502.16608.pdf

Authors: Yuli Zhang, Shangbo Wang, Dongyao Jia, Pengfei Fan, Ruiyuan Jiang, Hankang Gu, Andy H. F. Chow

Title: Toward Dependency Dynamics in Multi-Agent Reinforcement Learning for Traffic Signal Control

Abstract:
Reinforcement learning (RL) emerges as a promising data‑driven approach for adaptive traffic signal control (ATSC) in complex urban traffic networks, with deep neural networks substantially augmenting its learning capabilities. However, centralized RL becomes impractical for ATSC involving multiple agents due to the exceedingly high dimensionality of the joint action space. Multi‑agent RL (MARL) mitigates this scalability issue by decentralizing control to local RL agents. Nevertheless, this decentralized method introduces new challenges: the environment becomes partially observable from the perspective of each local agent due to constrained inter‑agent communication. Both centralized RL and MARL exhibit distinct strengths and weaknesses, particularly under heavy intersectional traffic conditions. In this paper, we justify that MARL can achieve the optimal global Q‑value by separating into multiple IRL (Independent Reinforcement Learning) processes when no spill‑back congestion occurs (no agent dependency) among agents (intersections). In the presence of spill‑back congestion (with agent dependency), the maximum global Q‑value can be achieved by using centralized RL. Building upon the conclusions, we propose a novel Dynamic Parameter Update Strategy for Deep Q‑Network (DQN‑DPUS), which updates the weights and bias based on the dependency dynamics among agents, i.e. updating only the diagonal sub‑matrices for the scenario without spill‑back congestion. We validate the DQN‑DPUS in a simple network with two intersections under varying traffic, and show that the proposed strategy can speed up the convergence rate without sacrificing optimal exploration. The results corroborate our theoretical findings, demonstrating the efficacy of DQN‑DPUS in optimizing traffic signal control.

PaperID: 11513, https://arxiv.org/pdf/2502.16531.pdf

Authors: Davide Peron, Victor Nan Fernandez-Ayala, Eleftherios E. Vlahakis, Dimos V. Dimarogonas

Title: Efficient Coordination and Synchronization of Multi-Robot Systems Under Recurring Linear Temporal Logic

Abstract:
We consider multi‑robot systems under recurring tasks formalized as linear temporal logic (LTL) specifications. To solve the planning problem efficiently, we propose a bottom‑up approach combining offline plan synthesis with online coordination, dynamically adjusting plans via real‑time communication. To address action delays, we introduce a synchronization mechanism ensuring coordinated task execution, leading to a multi‑agent coordination and synchronization framework that is adaptable to a wide range of multi‑robot applications. The software package is developed in Python and ROS2 for broad deployment. We validate our findings through lab experiments involving nine robots showing enhanced adaptability compared to previous methods. Additionally, we conduct simulations with up to ninety agents to demonstrate the reduced computational complexity and the scalability features of our work.

PaperID: 11514, https://arxiv.org/pdf/2502.16496.pdf

Authors: Kun Hu, Muning Wen, Xihuai Wang, Shao Zhang, Yiwei Shi, Minne Li, Minglong Li, Ying Wen

Title: PMAT: Optimizing Action Generation Order in Multi-Agent Reinforcement Learning

Abstract:
Multi‑agent reinforcement learning (MARL) faces challenges in coordinating agents due to complex interdependencies within multi‑agent systems. Most MARL algorithms use the simultaneous decision‑making paradigm but ignore the action‑level dependencies among agents, which reduces coordination efficiency. In contrast, the sequential decision‑making paradigm provides finer‑grained supervision for agent decision order, presenting the potential for handling dependencies via better decision order management. However, determining the optimal decision order remains a challenge. In this paper, we introduce Action Generation with Plackett‑Luce Sampling (AGPS), a novel mechanism for agent decision order optimization. We model the order determination task as a Plackett‑Luce sampling process to address issues such as ranking instability and vanishing gradient during the network training process. AGPS realizes credit‑based decision order determination by establishing a bridge between the significance of agents' local observations and their decision credits, thus facilitating order optimization and dependency management. Integrating AGPS with the Multi‑Agent Transformer, we propose the Prioritized Multi‑Agent Transformer (PMAT), a sequential decision‑making MARL algorithm with decision order optimization. Experiments on benchmarks including StarCraft II Multi‑Agent Challenge, Google Research Football, and Multi‑Agent MuJoCo show that PMAT outperforms state‑of‑the‑art algorithms, greatly enhancing coordination efficiency.

PaperID: 11515, https://arxiv.org/pdf/2502.16449.pdf

Authors: Haoran Su

Title: Facilitating Emergency Vehicle Passage in Congested Urban Areas Using Multi-agent Deep Reinforcement Learning

Abstract:
Emergency Response Time (ERT) is crucial for urban safety, measuring cities' ability to handle medical, fire, and crime emergencies. In NYC, medical ERT increased 72% from 7.89 minutes in 2014 to 14.27 minutes in 2024, with half of delays due to Emergency Vehicle (EMV) travel times. Each minute's delay in stroke response costs 2 million brain cells, while cardiac arrest survival drops 7‑10% per minute. This dissertation advances EMV facilitation through three contributions. First, EMVLight, a decentralized multi‑agent reinforcement learning framework, integrates EMV routing with traffic signal pre‑emption. It achieved 42.6% faster EMV travel times and 23.5% improvement for other vehicles. Second, the Dynamic Queue‑Jump Lane system uses Multi‑Agent Proximal Policy Optimization for coordinated lane‑clearing in mixed autonomous and human‑driven traffic, reducing EMV travel times by 40%. Third, an equity study of NYC Emergency Medical Services revealed disparities across boroughs: Staten Island faces delays due to sparse signalized intersections, while Manhattan struggles with congestion. Solutions include optimized EMS stations and improved intersection designs. These contributions enhance EMV mobility and emergency service equity, offering insights for policymakers and urban planners to develop safer, more efficient transportation systems.

PaperID: 11516, https://arxiv.org/pdf/2502.16376.pdf

Authors: Yinxu Tang, Stylianos Loukas Vasileiou, William Yeoh

Title: Does Your AI Agent Get You? A Personalizable Framework for Approximating Human Models from Argumentation-based Dialogue Traces

Abstract:
Explainable AI is increasingly employing argumentation methods to facilitate interactive explanations between AI agents and human users. While existing approaches typically rely on predetermined human user models, there remains a critical gap in dynamically learning and updating these models during interactions. In this paper, we present a framework that enables AI agents to adapt their understanding of human users through argumentation‑based dialogues. Our approach, called Persona, draws on prospect theory and integrates a probability weighting function with a Bayesian belief update mechanism that refines a probability distribution over possible human models based on exchanged arguments. Through empirical evaluations with human users in an applied argumentation setting, we demonstrate that Persona effectively captures evolving human beliefs, facilitates personalized interactions, and outperforms state‑of‑the‑art methods.

PaperID: 11517, https://arxiv.org/pdf/2502.16339.pdf

Authors: Abhishek N. Kulkarni, Andy Liu, Jean-Raphael Gaglione, Daniel Fried, Ufuk Topcu

Title: Dynamic Coalition Structure Detection in Natural Language-based Interactions

Abstract:
In strategic multi‑agent sequential interactions, detecting dynamic coalition structures is crucial for understanding how self‑interested agents coordinate to influence outcomes. However, natural‑language‑based interactions introduce unique challenges to coalition detection due to ambiguity over intents and difficulty in modeling players' subjective perspectives. We propose a new method that leverages recent advancements in large language models and game theory to predict dynamic multilateral coalition formation in Diplomacy, a strategic multi‑agent game where agents negotiate coalitions using natural language. The method consists of two stages. The first stage extracts the set of agreements discussed by two agents in their private dialogue, by combining a parsing‑based filtering function with a fine‑tuned language model trained to predict player intents. In the second stage, we define a new metric using the concept of subjective rationalizability from hypergame theory to evaluate the expected value of an agreement for each player. We then compute this metric for each agreement identified in the first stage by assessing the strategic value of the agreement for both players and taking into account the subjective belief of one player that the second player would honor the agreement. We demonstrate that our method effectively detects potential coalition structures in online Diplomacy gameplay by assigning high values to agreements likely to be honored and low values to those likely to be violated. The proposed method provides foundational insights into coalition formation in multi‑agent environments with language‑based negotiation and offers key directions for future research on the analysis of complex natural language‑based interactions between agents.

PaperID: 11518, https://arxiv.org/pdf/2502.16242.pdf

Authors: Jose L. Garcia, Karolina Hajkova, Maria Marchenko, Carlos Miguel Patiño

Title: Reproducibility Study of Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation

Abstract:
This paper presents a reproducibility study and extension of "Cooperation, Competition, and Maliciousness: LLM‑Stakeholders Interactive Negotiation." We validate the original findings using a range of open‑weight models (1.5B‑70B parameters) and GPT‑4o Mini while introducing several novel contributions. We analyze the Pareto front of the games, propose a communication‑free baseline to test whether successful negotiations are possible without agent interaction, evaluate recent small language models' performance, analyze structural information leakage in model responses, and implement an inequality metric to assess negotiation fairness. Our results demonstrate that smaller models (<10B parameters) struggle with format adherence and coherent responses, but larger open‑weight models can approach proprietary model performance. Additionally, in many scenarios, single‑agent approaches can achieve comparable results to multi‑agent negotiations, challenging assumptions about the necessity of agent communication to perform well on the benchmark. This work also provides insights into the accessibility, fairness, environmental impact, and privacy considerations of LLM‑based negotiation systems.

PaperID: 11519, https://arxiv.org/pdf/2502.16131.pdf

Authors: Weichao Chen, Xiaoyi Yu, Longbo Shang, Jiange Xi, Bo Jin, Shengjie Zhao

Title: Urban Emergency Rescue Based on Multi-Agent Collaborative Learning: Coordination Between Fire Engines and Traffic Lights

Abstract:
Nowadays, traffic management in urban areas is one of the major economic problems. In particular, when faced with emergency situations like firefighting, timely and efficient traffic dispatching is crucial. Intelligent coordination between multiple departments is essential to realize efficient emergency rescue. In this demo, we present a framework that integrates techniques for collaborative learning methods into the well‑known Unity Engine simulator, and thus these techniques can be evaluated in realistic settings. In particular, the framework allows flexible settings such as the number and type of collaborative agents, learning strategies, reward functions, and constraint conditions in practice. The framework is evaluated for an emergency rescue scenario, which could be used as a simulation tool for urban emergency departments.

PaperID: 11520, https://arxiv.org/pdf/2502.16128.pdf

Authors: Amirmahdi Mirfakhar, Xuchuang Wang, Jinhang Zuo, Yair Zick, Mohammad Hajiesmaili

Title: Heterogeneous Multi-Agent Bandits with Parsimonious Hints

Abstract:
We study a hinted heterogeneous multi‑agent multi‑armed bandits problem (HMA2B), where agents can query low‑cost observations (hints) in addition to pulling arms. In this framework, each of the M agents has a unique reward distribution over K arms, and in T rounds, they can observe the reward of the arm they pull only if no other agent pulls that arm. The goal is to maximize the total utility by querying the minimal necessary hints without pulling arms, achieving time‑independent regret. We study HMA2B in both centralized and decentralized setups. Our main centralized algorithm, GP‑HCLA, which is an extension of HCLA, uses a central decision‑maker for arm‑pulling and hint queries, achieving O(M^4K) regret with O(MK\log T) adaptive hints. In decentralized setups, we propose two algorithms, HD‑ETC and EBHD‑ETC, that allow agents to choose actions independently through collision‑based communication and query hints uniformly until stopping, yielding O(M^3K^2) regret with O(M^3K\log T) hints, where the former requires knowledge of the minimum gap and the latter does not. Finally, we establish lower bounds to prove the optimality of our results and verify them through numerical simulations.

PaperID: 11521, https://arxiv.org/pdf/2502.16111.pdf

Authors: Mihir Parmar, Xin Liu, Palash Goyal, Yanfei Chen, Long Le, Swaroop Mishra, Hossein Mobahi, Jindong Gu, Zifeng Wang, Hootan Nakhost, Chitta Baral, Chen-Yu Lee, Tomas Pfister, Hamid Palangi

Title: PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving

Abstract:
Recent agent frameworks and inference‑time algorithms often struggle with complex planning problems due to limitations in verifying generated plans or reasoning and varying complexity of instances within a single task. Many existing methods for these tasks either perform task‑level verification without considering constraints or apply inference‑time algorithms without adapting to instance‑level complexity. To address these limitations, we propose PlanGEN, a model‑agnostic and easily scalable agent framework with three key components: constraint, verification, and selection agents. Specifically, our approach proposes constraint‑guided iterative verification to enhance performance of inference‑time algorithms‑‑Best of N, Tree‑of‑Thought, and REBASE. In PlanGEN framework, the selection agent optimizes algorithm choice based on instance complexity, ensuring better adaptability to complex planning problems. Experimental results demonstrate significant improvements over the strongest baseline across multiple benchmarks, achieving state‑of‑the‑art results on NATURAL PLAN (～8%\uparrow), OlympiadBench (～4%\uparrow), DocFinQA (～7%\uparrow), and GPQA (～1%\uparrow). Our key finding highlights that constraint‑guided iterative verification improves inference‑time algorithms, and adaptive selection further boosts performance on complex planning and reasoning problems.

PaperID: 11522, https://arxiv.org/pdf/2502.16090.pdf

Authors: WenTao Liu, Ruohua Zhang, Aimin Zhou, Feng Gao, JiaLi Liu

Title: Echo: A Large Language Model with Temporal Episodic Memory

Abstract:
Research on large language models (LLMs) has shown remarkable performance in domains such as mathematics, programming, and literary creation. However, most studies have focused on semantic memory‑based question answering, neglecting LLMs' potential to handle episodic memory (EM)‑related queries. This oversight has led to suboptimal performance in applications requiring EM, including emotional companionship, personal AI assistants, and AI teachers. To address this gap, we introduce Echo, a LLM enhanced with temporal episodic memory. We propose a Multi‑Agent Data Generation Framework that guides the model in generating multi‑turn, complex scenario episodic memory dialogue data (EM‑Train). Temporal information is innovatively incorporated into the LLM training process, and Echo is trained using the EM‑Train. Furthermore, We develop an EM‑Test benchmark specifically designed to evaluate LLMs' episodic memory capabilities. The EM‑Test assesses performance across various time spans and difficulty levels, providing a comprehensive evaluation of multi‑turn episodic memory dialogues. Our experiments demonstrate that Echo significantly outperforms state‑of‑the‑art LLMs on EM‑Test. Additionally, a qualitative analysis reveals Echo's potential to exhibit human‑like episodic memory capabilities. We will open‑source all datasets, code, and model weights.

PaperID: 11523, https://arxiv.org/pdf/2502.16079.pdf

Authors: Aritra Pal, Anandsingh Chauhan, Mayank Baranwal

Title: Together We Rise: Optimizing Real-Time Multi-Robot Task Allocation using Coordinated Heterogeneous Plays

Abstract:
Efficient task allocation among multiple robots is crucial for optimizing productivity in modern warehouses, particularly in response to the increasing demands of online order fulfillment. This paper addresses the real‑time multi‑robot task allocation (MRTA) problem in dynamic warehouse environments, where tasks emerge with specified start and end locations. The objective is to minimize both the total travel distance of robots and delays in task completion, while also considering practical constraints such as battery management and collision avoidance. We introduce MRTAgent, a dual‑agent Reinforcement Learning (RL) framework inspired by self‑play, designed to optimize task assignments and robot selection to ensure timely task execution. For safe navigation, a modified linear quadratic controller (LQR) approach is employed. To the best of our knowledge, MRTAgent is the first framework to address all critical aspects of practical MRTA problems while supporting continuous robot movements.

PaperID: 11524, https://arxiv.org/pdf/2502.15871.pdf

Authors: Manar Aljohani, Jun Hou, Sindhura Kommu, Xuan Wang

Title: A Comprehensive Survey on the Trustworthiness of Large Language Models in Healthcare

Abstract:
The application of large language models (LLMs) in healthcare holds significant promise for enhancing clinical decision‑making, medical research, and patient care. However, their integration into real‑world clinical settings raises critical concerns around trustworthiness, particularly around dimensions of truthfulness, privacy, safety, robustness, fairness, and explainability. These dimensions are essential for ensuring that LLMs generate reliable, unbiased, and ethically sound outputs. While researchers have recently begun developing benchmarks and evaluation frameworks to assess LLM trustworthiness, the trustworthiness of LLMs in healthcare remains underexplored, lacking a systematic review that provides a comprehensive understanding and future insights. This survey addresses that gap by providing a comprehensive review of current methodologies and solutions aimed at mitigating risks across key trust dimensions. We analyze how each dimension affects the reliability and ethical deployment of healthcare LLMs, synthesize ongoing research efforts, and identify critical gaps in existing approaches. We also identify emerging challenges posed by evolving paradigms, such as multi‑agent collaboration, multi‑modal reasoning, and the development of small open‑source medical models. Our goal is to guide future research toward more trustworthy, transparent, and clinically viable LLMs.

PaperID: 11525, https://arxiv.org/pdf/2502.15865.pdf

Authors: Zichen Chen, Jiaao Chen, Jianda Chen, Misha Sra

Title: Standard Benchmarks Fail -- Auditing LLM Agents in Finance Must Prioritize Risk

Abstract:
Standard benchmarks fixate on how well large language model (LLM) agents perform in finance, yet say little about whether they are safe to deploy. We argue that accuracy metrics and return‑based scores provide an illusion of reliability, overlooking vulnerabilities such as hallucinated facts, stale data, and adversarial prompt manipulation. We take a firm position: financial LLM agents should be evaluated first and foremost on their risk profile, not on their point‑estimate performance. Drawing on risk‑engineering principles, we outline a three‑level agenda: model, workflow, and system, for stress‑testing LLM agents under realistic failure modes. To illustrate why this shift is urgent, we audit six API‑based and open‑weights LLM agents on three high‑impact tasks and uncover hidden weaknesses that conventional benchmarks miss. We conclude with actionable recommendations for researchers, practitioners, and regulators: audit risk‑aware metrics in future studies, publish stress scenarios alongside datasets, and treat ``safety budget'' as a primary success criterion. Only by redefining what ``good'' looks like can the community responsibly advance AI‑driven finance.

PaperID: 11526, https://arxiv.org/pdf/2502.15853.pdf

Authors: Daksh Dave, Gauransh Sawhney, Vikhyat Chauhan

Title: Multi-Agent Stock Prediction Systems: Machine Learning Models, Simulations, and Real-Time Trading Strategies

Abstract:
This paper presents a comprehensive study on stock price prediction, leveragingadvanced machine learning (ML) and deep learning (DL) techniques to improve financial forecasting accuracy. The research evaluates the performance of various recurrent neural network (RNN) architectures, including Long Short‑Term Memory (LSTM) networks, Gated Recurrent Units (GRU), and attention‑based models. These models are assessed for their ability to capture complex temporal dependencies inherent in stock market data. Our findings show that attention‑based models outperform other architectures, achieving the highest accuracy by capturing both short and long‑term dependencies. This study contributes valuable insights into AI‑driven financial forecasting, offering practical guidance for developing more accurate and efficient trading systems.

PaperID: 11527, https://arxiv.org/pdf/2502.15713.pdf

Authors: Ahmed Alagha, Maha Kadadha, Rabeb Mizouni, Shakti Singh, Jamal Bentahar, Hadi Otrok

Title: UAV-assisted Internet of Vehicles: A Framework Empowered by Reinforcement Learning and Blockchain

Abstract:
This paper addresses the challenges of selecting relay nodes and coordinating among them in UAV‑assisted Internet‑of‑Vehicles (IoV). The selection of UAV relay nodes in IoV employs mechanisms executed either at centralized servers or decentralized nodes, which have two main limitations: 1) the traceability of the selection mechanism execution and 2) the coordination among the selected UAVs, which is currently offered in a centralized manner and is not coupled with the relay selection. Existing UAV coordination methods often rely on optimization methods, which are not adaptable to different environment complexities, or on centralized deep reinforcement learning, which lacks scalability in multi‑UAV settings. Overall, there is a need for a comprehensive framework where relay selection and coordination are coupled and executed in a transparent and trusted manner. This work proposes a framework empowered by reinforcement learning and Blockchain for UAV‑assisted IoV networks. It consists of three main components: a two‑sided UAV relay selection mechanism for UAV‑assisted IoV, a decentralized Multi‑Agent Deep Reinforcement Learning (MDRL) model for autonomous UAV coordination, and a Blockchain implementation for transparency and traceability in the interactions between vehicles and UAVs. The relay selection considers the two‑sided preferences of vehicles and UAVs based on the Quality‑of‑UAV (QoU) and the Quality‑of‑Vehicle (QoV). Upon selection of relay UAVs, the decentralized coordination between them is enabled through an MDRL model trained to control their mobility and maintain the network coverage and connectivity using Proximal Policy Optimization (PPO). The evaluation results demonstrate that the proposed selection and coordination mechanisms improve the stability of the selected relays and maximize the coverage and connectivity achieved by the UAVs.

PaperID: 11528, https://arxiv.org/pdf/2502.15700.pdf

Authors: Muhammad Arslan, Saba Munawar, Christophe Cruz

Title: Sustainable Digitalization of Business with Multi-Agent RAG and LLM

Abstract:
Businesses heavily rely on data sourced from various channels like news articles, financial reports, and consumer reviews to drive their operations, enabling informed decision‑making and identifying opportunities. However, traditional manual methods for data extraction are often time‑consuming and resource‑intensive, prompting the adoption of digital transformation initiatives to enhance efficiency. Yet, concerns persist regarding the sustainability of such initiatives and their alignment with the United Nations (UN)'s Sustainable Development Goals (SDGs). This research aims to explore the integration of Large Language Models (LLMs) with Retrieval‑Augmented Generation (RAG) as a sustainable solution for Information Extraction (IE) and processing. The research methodology involves reviewing existing solutions for business decision‑making, noting that many systems require training new machine learning models, which are resource‑intensive and have significant environmental impacts. Instead, we propose a sustainable business solution using pre‑existing LLMs that can work with diverse datasets. We link domain‑specific datasets to tailor LLMs to company needs and employ a Multi‑Agent architecture to divide tasks such as information retrieval, enrichment, and classification among specialized agents. This approach optimizes the extraction process and improves overall efficiency. Through the utilization of these technologies, businesses can optimize resource utilization, improve decision‑making processes, and contribute to sustainable development goals, thereby fostering environmental responsibility within the corporate sector.

PaperID: 11529, https://arxiv.org/pdf/2502.15663.pdf

Authors: Natasha Astudillo, Fernando Koch

Title: Multi-Agent Architecture in Distributed Environment Control Systems: vision, challenges, and opportunities

Abstract:
The increasing demand for energy‑efficient solutions in large‑scale infrastructure, particularly data centers, requires advanced control strategies to optimize environmental management systems. We propose a multi‑agent architecture for distributed control of air‑cooled chiller systems in data centers. Our vision employs autonomous agents to monitor and regulate local operational parameters and optimize system‑wide efficiency. We demonstrate how this approach improves the responsiveness, operational robustness, and energy efficiency of the system, contributing to the broader goal of sustainable infrastructure management.

PaperID: 11530, https://arxiv.org/pdf/2502.15601.pdf

Authors: Xinhang Liu, Chi-Keung Tang, Yu-Wing Tai

Title: WorldCraft: Photo-Realistic 3D World Creation and Customization via LLM Agents

Abstract:
Constructing photorealistic virtual worlds has applications across various fields, but it often requires the extensive labor of highly trained professionals to operate conventional 3D modeling software. To democratize this process, we introduce WorldCraft, a system where large language model (LLM) agents leverage procedural generation to create indoor and outdoor scenes populated with objects, allowing users to control individual object attributes and the scene layout using intuitive natural language commands. In our framework, a coordinator agent manages the overall process and works with two specialized LLM agents to complete the scene creation: ForgeIt, which integrates an ever‑growing manual through auto‑verification to enable precise customization of individual objects, and ArrangeIt, which formulates hierarchical optimization problems to achieve a layout that balances ergonomic and aesthetic considerations. Additionally, our pipeline incorporates a trajectory control agent, allowing users to animate the scene and operate the camera through natural language interactions. Our system is also compatible with off‑the‑shelf deep 3D generators to enrich scene assets. Through evaluations and comparisons with state‑of‑the‑art methods, we demonstrate the versatility of WorldCraft, ranging from single‑object customization to intricate, large‑scale interior and exterior scene designs. This system empowers non‑professionals to bring their creative visions to life.

PaperID: 11531, https://arxiv.org/pdf/2502.15538.pdf

Authors: Wenyuan Zhang, Tianyun Liu, Mengxiao Song, Xiaodong Li, Tingwen Liu

Title: SOTOPIA-$Ω$: Dynamic Strategy Injection Learning and Social Instruction Following Evaluation for Social Agents

Abstract:
Despite the abundance of prior social strategies possessed by humans, there remains a paucity of research dedicated to their transfer and integration into social agents. Our proposed SOTOPIA‑Ω framework aims to address and bridge this gap, with a particular focus on enhancing the social capabilities of language agents. This framework dynamically injects multi‑step reasoning strategies inspired by negotiation theory and two simple direct strategies into expert agents, thereby automating the construction of a high‑quality social dialogue training corpus. Additionally, we introduce the concept of Social Instruction Following (S‑IF) and propose two new S‑IF evaluation metrics that complement social capability. We demonstrate that several 7B models trained on high‑quality corpus not only significantly surpass the expert agent (GPT‑4) in achieving social goals but also enhance S‑IF performance. Analysis and variant experiments validate the advantages of dynamic construction, which can especially break the agent's prolonged deadlock.

PaperID: 11532, https://arxiv.org/pdf/2502.15472.pdf

Authors: Yufeng Diao, Yichi Zhang, Changyang She, Philip Guodong Zhao, Emma Liying Li

Title: Aligning Task- and Reconstruction-Oriented Communications for Edge Intelligence

Abstract:
Existing communication systems aim to reconstruct the information at the receiver side, and are known as reconstruction‑oriented communications. This approach often falls short in meeting the real‑time, task‑specific demands of modern AI‑driven applications such as autonomous driving and semantic segmentation. As a new design principle, task‑oriented communications have been developed. However, it typically requires joint optimization of encoder, decoder, and modified inference neural networks, resulting in extensive cross‑system redesigns and compatibility issues. This paper proposes a novel communication framework that aligns reconstruction‑oriented and task‑oriented communications for edge intelligence. The idea is to extend the Information Bottleneck (IB) theory to optimize data transmission by minimizing task‑relevant loss function, while maintaining the structure of the original data by an information reshaper. Such an approach integrates task‑oriented communications with reconstruction‑oriented communications, where a variational approach is designed to handle the intractability of mutual information in high‑dimensional neural network features. We also introduce a joint source‑channel coding (JSCC) modulation scheme compatible with classical modulation techniques, enabling the deployment of AI technologies within existing digital infrastructures. The proposed framework is particularly effective in edge‑based autonomous driving scenarios. Our evaluation in the Car Learning to Act (CARLA) simulator demonstrates that the proposed framework significantly reduces bits per service by 99.19% compared to existing methods, such as JPEG, JPEG2000, and BPG, without compromising the effectiveness of task execution.

PaperID: 11533, https://arxiv.org/pdf/2502.15429.pdf

Authors: Lihu Chen, Shuojie Fu, Gabriel Freedman, Cemre Zor, Guy Martin, James Kinross, Uddhav Vaghela, Ovidiu Serban, Francesca Toni

Title: Pub-Guard-LLM: Detecting Retracted Biomedical Articles with Reliable Explanations

Abstract:
A significant and growing number of published scientific articles is found to involve fraudulent practices, posing a serious threat to the credibility and safety of research in fields such as medicine. We propose Pub‑Guard‑LLM, the first large language model‑based system tailored to fraud detection of biomedical scientific articles. We provide three application modes for deploying Pub‑Guard‑LLM: vanilla reasoning, retrieval‑augmented generation, and multi‑agent debate. Each mode allows for textual explanations of predictions. To assess the performance of our system, we introduce an open‑source benchmark, PubMed Retraction, comprising over 11K real‑world biomedical articles, including metadata and retraction labels. We show that, across all modes, Pub‑Guard‑LLM consistently surpasses the performance of various baselines and provides more reliable explanations, namely explanations which are deemed more relevant and coherent than those generated by the baselines when evaluated by multiple assessment methods. By enhancing both detection performance and explainability in scientific fraud detection, Pub‑Guard‑LLM contributes to safeguarding research integrity with a novel, effective, open‑source tool.

PaperID: 11534, https://arxiv.org/pdf/2502.15425.pdf

Authors: Giuseppe Paolo, Abdelhakim Benechehab, Hamza Cherkaoui, Albert Thomas, Balázs Kégl

Title: TAG: A Decentralized Framework for Multi-Agent Hierarchical Reinforcement Learning

Abstract:
Hierarchical organization is fundamental to biological systems and human societies, yet artificial intelligence systems often rely on monolithic architectures that limit adaptability and scalability. Current hierarchical reinforcement learning (HRL) approaches typically restrict hierarchies to two levels or require centralized training, which limits their practical applicability. We introduce TAME Agent Framework (TAG), a framework for constructing fully decentralized hierarchical multi‑agent systems. TAG enables hierarchies of arbitrary depth through a novel LevelEnv concept, which abstracts each hierarchy level as the environment for the agents above it. This approach standardizes information flow between levels while preserving loose coupling, allowing for seamless integration of diverse agent types. We demonstrate the effectiveness of TAG by implementing hierarchical architectures that combine different RL agents across multiple levels, achieving improved performance over classical multi‑agent RL baselines on standard benchmarks. Our results show that decentralized hierarchical organization enhances both learning speed and final performance, positioning TAG as a promising direction for scalable multi‑agent systems.

PaperID: 11535, https://arxiv.org/pdf/2502.15359.pdf

Authors: Kai Li, Fei Liu, Zhenkun Wang, Xialiang Tong, Xiongwei Han, Mingxuan Yuan, Qingfu Zhang

Title: ARS: Automatic Routing Solver with Large Language Models

Abstract:
Real‑world Vehicle Routing Problems (VRPs) are characterized by a variety of practical constraints, making manual solver design both knowledge‑intensive and time‑consuming. Although there is increasing interest in automating the design of routing algorithms, existing research has explored only a limited array of VRP variants and fails to adequately address the complex and prevalent constraints encountered in real‑world situations. To fill this gap, this paper introduces RoutBench, a benchmark of 1,000 VRP variants derived from 24 attributes, for evaluating the effectiveness of automatic routing solvers in addressing complex constraints. Along with RoutBench, we present the Automatic Routing Solver (ARS), which employs Large Language Model (LLM) agents to enhance a backbone algorithm framework by automatically generating constraint‑aware heuristic code, based on problem descriptions and several representative constraints selected from a database. Our experiments show that ARS outperforms state‑of‑the‑art LLM‑based methods and commonly used solvers, automatically solving 91.67% of common VRPs and achieving at least a 30% improvement across all benchmarks.

PaperID: 11536, https://arxiv.org/pdf/2502.15338.pdf

Authors: Junning Shao, Siwei Wang, Zhixuan Fang

Title: Learning with Limited Shared Information in Multi-agent Multi-armed Bandit

Abstract:
Multi‑agent multi‑armed bandit (MAMAB) is a classic collaborative learning model and has gained much attention in recent years. However, existing studies do not consider the case where an agent may refuse to share all her information with others, e.g., when some of the data contains personal privacy. In this paper, we propose a novel limited shared information multi‑agent multi‑armed bandit (LSI‑MAMAB) model in which each agent only shares the information that she is willing to share, and propose the Balanced‑ETC algorithm to help multiple agents collaborate efficiently with limited shared information. Our analysis shows that Balanced‑ETC is asymptotically optimal and its average regret (on each agent) approaches a constant when there are sufficient agents involved. Moreover, to encourage agents to participate in this collaborative learning, an incentive mechanism is proposed to make sure each agent can benefit from the collaboration system. Finally, we present experimental results to validate our theoretical results.

PaperID: 11537, https://arxiv.org/pdf/2502.15335.pdf

Authors: Siyuan Wang, Enda Zhao, Zhongyu Wei, Xiang Ren

Title: Stepwise Informativeness Search for Efficient and Effective LLM Reasoning

Abstract:
Advances in Large Language Models (LLMs) have significantly improved multi‑step reasoning through generating free‑text rationales. However, recent studies show that LLMs tend to lose focus over the middle of long contexts. This raises concerns that as reasoning progresses, LLMs may overlook information in earlier steps when decoding subsequent steps, leading to generate unreliable and redundant rationales. To address this, we propose guiding LLMs to generate more accurate and concise step‑by‑step rationales by (1) proactively referencing information from underutilized prior steps, and (2) minimizing redundant information between new and existing steps. We introduce stepwise informativeness search, an inference‑time tree search framework incorporating two selection heuristics: grounding‑guided selection which prioritizes steps paying higher attention over underutilized steps; and novelty‑guided selection which encourages steps with novel conclusions. During rationale generation, we use a self‑grounding strategy that prompts LLMs to explicitly reference relevant prior steps to provide premises before deduction at each step. Experimental results on four reasoning datasets demonstrate that our approach improves reasoning accuracy by generating higher‑quality rationales with reduced errors and redundancy.

PaperID: 11538, https://arxiv.org/pdf/2502.15252.pdf

Authors: Amartaivan Sanjjamts, Hiroshi Morita, Togootogtokh Enkhtogtokh

Title: Real-Time Moving Flock Detection in Pedestrian Trajectories Using Sequential Deep Learning Models

Abstract:
Understanding collective pedestrian movement is crucial for applications in crowd management, autonomous navigation, and human‑robot interaction. This paper investigates the use of sequential deep learning models, including Recurrent Neural Networks (RNNs), Long Short‑Term Memory (LSTM) networks, and Transformers, for real‑time flock detection in multi‑pedestrian trajectories. Our proposed approach consists of a two‑stage process: first, a pre‑trained binary classification model is used for pairwise trajectory classification, and second, the learned representations are applied to identify multi‑agent flocks dynamically. We validate our method using real‑world group movement datasets, demonstrating its robustness across varying sequence lengths and diverse movement patterns. Experimental results indicate that our model consistently detects pedestrian flocks with high accuracy and stability, even in dynamic and noisy environments. Furthermore, we extend our approach to identify other forms of collective motion, such as convoys and swarms, paving the way for more comprehensive multi‑agent behavior analysis.

PaperID: 11539, https://arxiv.org/pdf/2502.15240.pdf

Authors: Piyushi Manupriya, Himanshu, SakethaNath Jagarlapudi, Ganesh Ghalme

Title: Multi-agent Multi-armed Bandits with Minimum Reward Guarantee Fairness

Abstract:
We investigate the problem of maximizing social welfare while ensuring fairness in a multi‑agent multi‑armed bandit (MA‑MAB) setting. In this problem, a centralized decision‑maker takes actions over time, generating random rewards for various agents. Our goal is to maximize the sum of expected cumulative rewards, a.k.a. social welfare, while ensuring that each agent receives an expected reward that is at least a constant fraction of the maximum possible expected reward. Our proposed algorithm, RewardFairUCB, leverages the Upper Confidence Bound (UCB) technique to achieve sublinear regret bounds for both fairness and social welfare. The fairness regret measures the positive difference between the minimum reward guarantee and the expected reward of a given policy, whereas the social welfare regret measures the difference between the social welfare of the optimal fair policy and that of the given policy. We show that RewardFairUCB algorithm achieves instance‑independent social welfare regret guarantees of \tildeO(T^1/2) and a fairness regret upper bound of \tildeO(T^3/4). We also give the lower bound of Ω(\sqrtT) for both social welfare and fairness regret. We evaluate RewardFairUCB's performance against various baseline and heuristic algorithms using simulated data and real world data, highlighting trade‑offs between fairness and social welfare regrets.

PaperID: 11540, https://arxiv.org/pdf/2502.15224.pdf

Authors: Tingting Chen, Srinivas Anumasa, Beibei Lin, Vedant Shah, Anirudh Goyal, Dianbo Liu

Title: Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs

Abstract:
Given the remarkable performance of Large Language Models (LLMs), an important question arises: Can LLMs conduct human‑like scientific research and discover new knowledge, and act as an AI scientist? Scientific discovery is an iterative process that demands efficient knowledge updating and encoding. It involves understanding the environment, identifying new hypotheses, and reasoning about actions; however, no standardized benchmark specifically designed for scientific discovery exists for LLM agents. In response to these limitations, we introduce a novel benchmark, Auto‑Bench, that encompasses necessary aspects to evaluate LLMs for scientific discovery in both natural and social sciences. Our benchmark is based on the principles of causal graph discovery. It challenges models to uncover hidden structures and make optimal decisions, which includes generating valid justifications. By engaging interactively with an oracle, the models iteratively refine their understanding of underlying interactions, the chemistry and social interactions, through strategic interventions. We evaluate state‑of‑the‑art LLMs, including GPT‑4, Gemini, Qwen, Claude, and Llama, and observe a significant performance drop as the problem complexity increases, which suggests an important gap between machine and human intelligence that future development of LLMs need to take into consideration.

PaperID: 11541, https://arxiv.org/pdf/2502.15212.pdf

Authors: Peter Cihon, Merlin Stein, Gagan Bansal, Sam Manning, Kevin Xu

Title: Measuring AI agent autonomy: Towards a scalable approach with code inspection

Abstract:
AI agents are AI systems that can achieve complex goals autonomously. Assessing the level of agent autonomy is crucial for understanding both their potential benefits and risks. Current assessments of autonomy often focus on specific risks and rely on run‑time evaluations ‑‑ observations of agent actions during operation. We introduce a code‑based assessment of autonomy that eliminates the need to run an AI agent to perform specific tasks, thereby reducing the costs and risks associated with run‑time evaluations. Using this code‑based framework, the orchestration code used to run an AI agent can be scored according to a taxonomy that assesses attributes of autonomy: impact and oversight. We demonstrate this approach with the AutoGen framework and select applications.

PaperID: 11542, https://arxiv.org/pdf/2502.15029.pdf

Authors: Marcelo Matheus Gauy, Anna Abramishvili, Eduardo Colli, Nicolaus Heuer, Tiago Madeira, Frederik Mallmann-Trenn, Vinícius Franco Vasconcelos, David Kohan Marzagão

Title: Voter Model Meets Rumour Spreading: an FPRAS for Consensus Probabilities on Voter Models with Agnostic Nodes

Abstract:
Problems of consensus in multi‑agent systems are often viewed as a series of independent, simultaneous local decisions made between a limited set of options, all aimed at reaching a global agreement. Key challenges in these protocols include estimating the likelihood of various outcomes and finding bounds for how long it may take to achieve consensus, if it occurs at all. To date, little attention has been given to the case where some agents have no initial opinion. In this paper, we introduce a variant of the consensus problem which includes what we call `agnostic' nodes and frame it as a combination of two known and well‑studied processes: voter model and rumour spreading. We show (1) a martingale that describes the probability of consensus for a given colour, (2) bounds on the number of steps for the process to end using results from rumour spreading and voter models, (3) closed formulas for the probability of consensus in a few special cases, along with a polynomial‑time algorithm for the case where the number of agnostic vertices is at most logarithmic and (4) that the computational complexity of estimating the probability with a Markov chain Monte Carlo process is O(n^2 \log n) for general graphs and O(n\log n) for Erdős‑Rényi graphs, resulting in a fully polynomial‑time randomized approximation scheme (FPRAS) for estimating the probabilities of consensus. Furthermore, we present experimental results suggesting that the number of runs needed for a given standard error decreases when the number of nodes increases.

PaperID: 11543, https://arxiv.org/pdf/2502.14930.pdf

Authors: Rui Yang, Michael Fu, Chakkrit Tantithamthavorn, Chetan Arora, Lisa Vandenhurk, Joey Chua

Title: RAGVA: Engineering Retrieval Augmented Generation-based Virtual Assistants in Practice

Abstract:
Retrieval‑augmented generation (RAG)‑based applications are gaining prominence due to their ability to leverage large language models (LLMs). These systems excel at combining retrieval mechanisms with generative capabilities, resulting in more accurate, contextually relevant responses that enhance user experience. In particular, Transurban, a road operation company, is replacing its rule‑based virtual assistant (VA) with a RAG‑based VA (RAGVA) to offer more flexible customer interactions and support a wider range of scenarios. In this paper, drawing from the experience at Transurban, we present a comprehensive step‑by‑step guide for building a conversational application and how to engineer a RAGVA. These guides aim to serve as references for future researchers and practitioners. While the engineering processes for traditional software applications are well‑established, the development and evaluation of RAG‑based applications are still in their early stages, with numerous emerging challenges remaining uncharted. To address this gap, we conduct a focus group study with Transurban practitioners regarding developing and evaluating their RAGVA. We identified eight challenges encountered by the engineering team and proposed eight future directions that should be explored to advance the development of RAG‑based applications. This study contributes to the foundational understanding of a RAG‑based conversational application and the emerging AI software engineering challenges it presents.

PaperID: 11544, https://arxiv.org/pdf/2502.14913.pdf

Authors: Xiangjin Xie, Guangwei Xu, Lingyan Zhao, Ruijie Guo

Title: OpenSearch-SQL: Enhancing Text-to-SQL with Dynamic Few-shot and Consistency Alignment

Abstract:
Although multi‑agent collaborative Large Language Models (LLMs) have achieved significant breakthroughs in the Text‑to‑SQL task, their performance is still constrained by various factors. These factors include the incompleteness of the framework, failure to follow instructions, and model hallucination problems. To address these problems, we propose OpenSearch‑SQL, which divides the Text‑to‑SQL task into four main modules: Preprocessing, Extraction, Generation, and Refinement, along with an Alignment module based on a consistency alignment mechanism. This architecture aligns the inputs and outputs of agents through the Alignment module, reducing failures in instruction following and hallucination. Additionally, we designed an intermediate language called SQL‑Like and optimized the structured CoT based on SQL‑Like. Meanwhile, we developed a dynamic few‑shot strategy in the form of self‑taught Query‑CoT‑SQL. These methods have significantly improved the performance of LLMs in the Text‑to‑SQL task. In terms of model selection, we directly applied the base LLMs without any post‑training, thereby simplifying the task chain and enhancing the framework's portability. Experimental results show that OpenSearch‑SQL achieves an execution accuracy(EX) of 69.3% on the BIRD development set, 72.28% on the test set, and a reward‑based validity efficiency score (R‑VES) of 69.36%, with all three metrics ranking first at the time of submission. These results demonstrate the comprehensive advantages of the proposed method in both effectiveness and efficiency.

PaperID: 11545, https://arxiv.org/pdf/2502.14847.pdf

Authors: Pengfei He, Yupin Lin, Shen Dong, Han Xu, Yue Xing, Hui Liu

Title: Red-Teaming LLM Multi-Agent Systems via Communication Attacks

Abstract:
Large Language Model‑based Multi‑Agent Systems (LLM‑MAS) have revolutionized complex problem‑solving capability by enabling sophisticated agent collaboration through message‑based communications. While the communication framework is crucial for agent coordination, it also introduces a critical yet unexplored security vulnerability. In this work, we introduce Agent‑in‑the‑Middle (AiTM), a novel attack that exploits the fundamental communication mechanisms in LLM‑MAS by intercepting and manipulating inter‑agent messages. Unlike existing attacks that compromise individual agents, AiTM demonstrates how an adversary can compromise entire multi‑agent systems by only manipulating the messages passing between agents. To enable the attack under the challenges of limited control and role‑restricted communication format, we develop an LLM‑powered adversarial agent with a reflection mechanism that generates contextually‑aware malicious instructions. Our comprehensive evaluation across various frameworks, communication structures, and real‑world applications demonstrates that LLM‑MAS is vulnerable to communication‑based attacks, highlighting the need for robust security measures in multi‑agent systems.

PaperID: 11546, https://arxiv.org/pdf/2502.14815.pdf

Authors: Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Matei Zaharia, James Zou, Ion Stoica

Title: Optimizing Model Selection for Compound AI Systems

Abstract:
Compound AI systems that combine multiple LLM calls, such as self‑refine and multi‑agent‑debate, achieve strong performance on many AI tasks. We address a core question in optimizing compound systems: for each LLM call or module in the system, how should one decide which LLM to use? We show that these LLM choices have a large effect on quality, but the search space is exponential. We propose LLMSelector, an efficient framework for model selection in compound systems, which leverages two key empirical insights: (i) end‑to‑end performance is often monotonic in how well each module performs, with all other modules held fixed, and (ii) per‑module performance can be estimated accurately by an LLM. Building upon these insights, LLMSelector iteratively selects one module and allocates to it the model with the highest module‑wise performance, as estimated by an LLM, until no further gain is possible. LLMSelector is applicable to any compound system with a bounded number of modules, and its number of API calls scales linearly with the number of modules, achieving high‑quality model allocation both empirically and theoretically. Experiments with popular compound systems such as multi‑agent debate and self‑refine using LLMs such as GPT‑4o, Claude 3.5 Sonnet and Gemini 1.5 show that LLMSelector confers 5%‑70% accuracy gains compared to using the same LLM for all modules.

PaperID: 11547, https://arxiv.org/pdf/2502.14803.pdf

Authors: Gregg Rabideau, Joseph Russino, Andrew Branch, Nihal Dhamani, Tiago Stegun Vaquero, Steve Chien, Jean-Pierre de la Croix, Federico Rossi

Title: Planning, scheduling, and execution on the Moon: the CADRE technology demonstration mission

Abstract:
NASA's Cooperative Autonomous Distributed Robotic Exploration (CADRE) mission, slated for flight to the Moon's Reiner Gamma region in 2025/2026, is designed to demonstrate multi‑agent autonomous exploration of the Lunar surface and sub‑surface. A team of three robots and a base station will autonomously explore a region near the lander, collecting the data required for 3D reconstruction of the surface with no human input; and then autonomously perform distributed sensing with multi‑static ground penetrating radars (GPR), driving in formation while performing coordinated radar soundings to create a map of the subsurface. At the core of CADRE's software architecture is a novel autonomous, distributed planning, scheduling, and execution (PS&E) system. The system coordinates the robots' activities, planning and executing tasks that require multiple robots' participation while ensuring that each individual robot's thermal and power resources stay within prescribed bounds, and respecting ground‑prescribed sleep‑wake cycles. The system uses a centralized‑planning, distributed‑execution paradigm, and a leader election mechanism ensures robustness to failures of individual agents. In this paper, we describe the architecture of CADRE's PS&E system; discuss its design rationale; and report on verification and validation (V&V) testing of the system on CADRE's hardware in preparation for deployment on the Moon.

PaperID: 11548, https://arxiv.org/pdf/2502.14796.pdf

Authors: Haya Nachimovsky, Moshe Tennenholtz, Oren Kurland

Title: A Multi-Agent Perspective on Modern Information Retrieval

Abstract:
The rise of large language models (LLMs) has introduced a new era in information retrieval (IR), where queries and documents that were once assumed to be generated exclusively by humans can now also be created by automated agents. These agents can formulate queries, generate documents, and perform ranking. This shift challenges some long‑standing IR paradigms and calls for a reassessment of both theoretical frameworks and practical methodologies. We advocate for a multi‑agent perspective to better capture the complex interactions between query agents, document agents, and ranker agents. Through empirical exploration of various multi‑agent retrieval settings, we reveal the significant impact of these interactions on system performance. Our findings underscore the need to revisit classical IR paradigms and develop new frameworks for more effective modeling and evaluation of modern retrieval systems.

PaperID: 11549, https://arxiv.org/pdf/2502.14743.pdf

Authors: Lijun Sun, Yijun Yang, Qiqi Duan, Yuhui Shi, Chao Lyu, Yu-Cheng Chang, Chin-Teng Lin, Yang Shen

Title: Multi-Agent Coordination across Diverse Applications: A Survey

Abstract:
Multi‑agent coordination studies the underlying mechanism enabling the trending spread of diverse multi‑agent systems (MAS) and has received increasing attention, driven by the expansion of emerging applications and rapid AI advances. This survey outlines the current state of coordination research across applications through a unified understanding that answers four fundamental coordination questions: (1) what is coordination; (2) why coordination; (3) who to coordinate with; and (4) how to coordinate. Our purpose is to explore existing ideas and expertise in coordination and their connections across diverse applications, while identifying and highlighting emerging and promising research directions. First, general coordination problems that are essential to varied applications are identified and analyzed. Second, a number of MAS applications are surveyed, ranging from widely studied domains, e.g., search and rescue, warehouse automation and logistics, and transportation systems, to emerging fields including humanoid and anthropomorphic robots, satellite systems, and large language models (LLMs). Finally, open challenges about the scalability, heterogeneity, and learning mechanisms of MAS are analyzed and discussed. In particular, we identify the hybridization of hierarchical and decentralized coordination, human‑MAS coordination, and LLM‑based MAS as promising future directions.

PaperID: 11550, https://arxiv.org/pdf/2502.14662.pdf

Authors: Wujiang Xu, Yunxiao Shi, Zujie Liang, Xuying Ning, Kai Mei, Kun Wang, Xi Zhu, Min Xu, Yongfeng Zhang

Title: iAgent: LLM Agent as a Shield between User and Recommender Systems

Abstract:
Traditional recommender systems usually take the user‑platform paradigm, where users are directly exposed under the control of the platform's recommendation algorithms. However, the defect of recommendation algorithms may put users in very vulnerable positions under this paradigm. First, many sophisticated models are often designed with commercial objectives in mind, focusing on the platform's benefits, which may hinder their ability to protect and capture users' true interests. Second, these models are typically optimized using data from all users, which may overlook individual user's preferences. Due to these shortcomings, users may experience several disadvantages under the traditional user‑platform direct exposure paradigm, such as lack of control over the recommender system, potential manipulation by the platform, echo chamber effects, or lack of personalization for less active users due to the dominance of active users during collaborative learning. Therefore, there is an urgent need to develop a new paradigm to protect user interests and alleviate these issues. Recently, some researchers have introduced LLM agents to simulate user behaviors, these approaches primarily aim to optimize platform‑side performance, leaving core issues in recommender systems unresolved. To address these limitations, we propose a new user‑agent‑platform paradigm, where agent serves as the protective shield between user and recommender system that enables indirect exposure.

PaperID: 11551, https://arxiv.org/pdf/2502.14634.pdf

Authors: Ali Razghandi, Seyed Mohammad Hadi Hosseini, Mahdieh Soleymani Baghshah

Title: CER: Confidence Enhanced Reasoning in LLMs

Abstract:
Ensuring the reliability of Large Language Models (LLMs) in complex reasoning tasks remains a formidable challenge, particularly in scenarios that demand precise mathematical calculations and knowledge‑intensive open‑domain generation. In this work, we introduce an uncertainty‑aware framework designed to enhance the accuracy of LLM responses by systematically incorporating model confidence at critical decision points. We propose an approach that encourages multi‑step reasoning in LLMs and quantify the confidence of intermediate answers such as numerical results in mathematical reasoning and proper nouns in open‑domain generation. Then, the overall confidence of each reasoning chain is evaluated based on confidence of these critical intermediate steps. Finally, we aggregate the answer of generated response paths in a way that reflects the reliability of each generated content (as opposed to self‑consistency in which each generated chain contributes equally to majority voting). We conducted extensive experiments in five datasets, three mathematical datasets and two open‑domain datasets, using four LLMs. The results consistently validate the effectiveness of our novel confidence aggregation method, leading to an accuracy improvement of up to 7.4% and 5.8% over baseline approaches in math and open‑domain generation tasks, respectively. Code is publicly available at https://github.com/ Aquasar11/CER.

PaperID: 11552, https://arxiv.org/pdf/2502.14606.pdf

Authors: Raihana Ferdous, Fitsum Kifetew, Davide Prandi, Angelo Susi

Title: Curiosity Driven Multi-agent Reinforcement Learning for 3D Game Testing

Abstract:
Recently testing of games via autonomous agents has shown great promise in tackling challenges faced by the game industry, which mainly relied on either manual testing or record/replay. In particular Reinforcement Learning (RL) solutions have shown potential by learning directly from playing the game without the need for human intervention. In this paper, we present cMarlTest, an approach for testing 3D games through curiosity driven Multi‑Agent Reinforcement Learning (MARL). cMarlTest deploys multiple agents that work collaboratively to achieve the testing objective. The use of multiple agents helps resolve issues faced by a single agent approach. We carried out experiments on different levels of a 3D game comparing the performance of cMarlTest with a single agent RL variant. Results are promising where, considering three different types of coverage criteria, cMarlTest achieved higher coverage. cMarlTest was also more efficient in terms of the time taken, with respect to the single agent based variant.

PaperID: 11553, https://arxiv.org/pdf/2502.14499.pdf

Authors: Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, Roberta Raileanu

Title: MLGym: A New Framework and Benchmark for Advancing AI Research Agents

Abstract:
We introduce Meta MLGym and MLGym‑Bench, a new framework and benchmark for evaluating and developing LLM agents on AI research tasks. This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents. MLGym‑bench consists of 13 diverse and open‑ended AI research tasks from diverse domains such as computer vision, natural language processing, reinforcement learning, and game theory. Solving these tasks requires real‑world AI research skills such as generating new ideas and hypotheses, creating and processing data, implementing ML methods, training models, running experiments, analyzing the results, and iterating through this process to improve on a given task. We evaluate a number of frontier large language models (LLMs) on our benchmarks such as Claude‑3.5‑Sonnet, Llama‑3.1 405B, GPT‑4o, o1‑preview, and Gemini‑1.5 Pro. Our MLGym framework makes it easy to add new tasks, integrate and evaluate models or agents, generate synthetic data at scale, as well as develop new learning algorithms for training agents on AI research tasks. We find that current frontier models can improve on the given baselines, usually by finding better hyperparameters, but do not generate novel hypotheses, algorithms, architectures, or substantial improvements. We open‑source our framework and benchmark to facilitate future research in advancing the AI research capabilities of LLM agents.

PaperID: 11554, https://arxiv.org/pdf/2502.14496.pdf

Authors: Zhitao He, Zijun Liu, Peng Li, Yi R. Fung, Ming Yan, Ji Zhang, Fei Huang, Yang Liu

Title: Advancing Language Multi-Agent Learning with Credit Re-Assignment for Interactive Environment Generalization

Abstract:
LLM‑based agents have made significant advancements in interactive environments, such as mobile operations and web browsing, and other domains beyond computer using. Current multi‑agent systems universally excel in performance, compared to single agents, but struggle with generalization across environments due to predefined roles and inadequate strategies for generalizing language agents. The challenge of achieving both strong performance and good generalization has hindered the progress of multi‑agent systems for interactive environments. To address these issues, we propose CollabUIAgents, a multi‑agent reinforcement learning framework with a novel multi‑agent credit re‑assignment (CR) strategy, assigning process rewards with LLMs rather than environment‑specific rewards and learning with synthesized preference data, in order to foster generalizable, collaborative behaviors among the role‑free agents' policies. Empirical results show that our framework improves both performance and cross‑environment generalizability of multi‑agent systems. Moreover, our 7B‑parameter system achieves results on par with or exceed strong closed‑source models, and the LLM that guides the CR. We also provide insights in using granular CR rewards effectively for environment generalization, and accommodating trained LLMs in multi‑agent systems.

PaperID: 11555, https://arxiv.org/pdf/2502.14333.pdf

Authors: Ting-Ruen Wei, Haowei Liu, Xuyang Wu, Yi Fang

Title: A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics

Abstract:
Recent progress in large language models (LLM) found chain‑of‑thought prompting strategies to improve the reasoning ability of LLMs by encouraging problem solving through multiple steps. Therefore, subsequent research aimed to integrate the multi‑step reasoning process into the LLM itself through process rewards as feedback and achieved improvements over prompting strategies. Due to the cost of step‑level annotation, some turn to outcome rewards as feedback. Aside from these training‑based approaches, training‑free techniques leverage frozen LLMs or external tools for feedback at each step to enhance the reasoning process. With the abundance of work in mathematics due to its logical nature, we present a survey of strategies utilizing feedback at the step and outcome levels to enhance multi‑step math reasoning for LLMs. As multi‑step reasoning emerges a crucial component in scaling LLMs, we hope to establish its foundation for easier understanding and empower further research.

PaperID: 11556, https://arxiv.org/pdf/2502.14321.pdf

Authors: Bingyu Yan, Zhibo Zhou, Litian Zhang, Lian Zhang, Ziyi Zhou, Dezhuang Miao, Zhoujun Li, Chaozhuo Li, Xiaoming Zhang

Title: Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems

Abstract:
Large language model‑based multi‑agent systems have recently gained significant attention due to their potential for complex, collaborative, and intelligent problem‑solving capabilities. Existing surveys typically categorize LLM‑based multi‑agent systems (LLM‑MAS) according to their application domains or architectures, overlooking the central role of communication in coordinating agent behaviors and interactions. To address this gap, this paper presents a comprehensive survey of LLM‑MAS from a communication‑centric perspective. Specifically, we propose a structured framework that integrates system‑level communication (architecture, goals, and protocols) with system internal communication (strategies, paradigms, objects, and content), enabling a detailed exploration of how agents interact, negotiate, and achieve collective intelligence. Through an extensive analysis of recent literature, we identify key components in multiple dimensions and summarize their strengths and limitations. In addition, we highlight current challenges, including communication efficiency, security vulnerabilities, inadequate benchmarking, and scalability issues, and outline promising future research directions. This review aims to help researchers and practitioners gain a clear understanding of the communication mechanisms in LLM‑MAS, thereby facilitating the design and deployment of robust, scalable, and secure multi‑agent systems.

PaperID: 11557, https://arxiv.org/pdf/2502.14200.pdf

Authors: Hao Ma, Zhiqiang Pu, Yi Pan, Boyin Liu, Junlong Gao, Zhenyu Guo

Title: Causal Mean Field Multi-Agent Reinforcement Learning

Abstract:
Scalability remains a challenge in multi‑agent reinforcement learning and is currently under active research. A framework named mean‑field reinforcement learning (MFRL) could alleviate the scalability problem by employing the Mean Field Theory to turn a many‑agent problem into a two‑agent problem. However, this framework lacks the ability to identify essential interactions under nonstationary environments. Causality contains relatively invariant mechanisms behind interactions, though environments are nonstationary. Therefore, we propose an algorithm called causal mean‑field Q‑learning (CMFQ) to address the scalability problem. CMFQ is ever more robust toward the change of the number of agents though inheriting the compressed representation of MFRL's action‑state space. Firstly, we model the causality behind the decision‑making process of MFRL into a structural causal model (SCM). Then the essential degree of each interaction is quantified via intervening on the SCM. Furthermore, we design the causality‑aware compact representation for behavioral information of agents as the weighted sum of all behavioral information according to their causal effects. We test CMFQ in a mixed cooperative‑competitive game and a cooperative game. The result shows that our method has excellent scalability performance in both training in environments containing a large number of agents and testing in environments containing much more agents.

PaperID: 11558, https://arxiv.org/pdf/2502.14144.pdf

Authors: Primoz Kocbek, Leon Kopitar, Zhihong Zhang, Emirhan Aydin, Maxim Topaz, Gregor Stiglic

Title: UM_FHS at TREC 2024 PLABA: Exploration of Fine-tuning and AI agent approach for plain language adaptations of biomedical text

Abstract:
This paper describes our submissions to the TREC 2024 PLABA track with the aim to simplify biomedical abstracts for a K8‑level audience (13‑14 years old students). We tested three approaches using OpenAI's gpt‑4o and gpt‑4o‑mini models: baseline prompt engineering, a two‑AI agent approach, and fine‑tuning. Adaptations were evaluated using qualitative metrics (5‑point Likert scales for simplicity, accuracy, completeness, and brevity) and quantitative readability scores (Flesch‑Kincaid grade level, SMOG Index). Results indicated that the two‑agent approach and baseline prompt engineering with gpt‑4o‑mini models show superior qualitative performance, while fine‑tuned models excelled in accuracy and completeness but were less simple. The evaluation results demonstrated that prompt engineering with gpt‑4o‑mini outperforms iterative improvement strategies via two‑agent approach as well as fine‑tuning with gpt‑4o. We intend to expand our investigation of the results and explore advanced evaluations.

PaperID: 11559, https://arxiv.org/pdf/2502.14143.pdf

Authors: Lewis Hammond, Alan Chan, Jesse Clifton, Jason Hoelscher-Obermaier, Akbir Khan, Euan McLean, Chandler Smith, Wolfram Barfuss, Jakob Foerster, Tomáš Gavenčiak, The Anh Han, Edward Hughes, Vojtěch Kovařík, Jan Kulveit, Joel Z. Leibo, Caspar Oesterheld, Christian Schroeder de Witt, Nisarg Shah, Michael Wellman, Paolo Bova, Theodor Cimpeanu, Carson Ezell, Quentin Feuillade-Montixi, Matija Franklin, Esben Kran, Igor Krawczuk, Max Lamparth, Niklas Lauffer, Alexander Meinke, Sumeet Motwani, Anka Reuel, Vincent Conitzer, Michael Dennis, Iason Gabriel, Adam Gleave, Gillian Hadfield, Nika Haghtalab, Atoosa Kasirzadeh, Sébastien Krier, Kate Larson, Joel Lehman, David C. Parkes, Georgios Piliouras, Iyad Rahwan

Title: Multi-Agent Risks from Advanced AI

Abstract:
The rapid development of advanced AI agents and the imminent deployment of many instances of these agents will give rise to multi‑agent systems of unprecedented complexity. These systems pose novel and under‑explored risks. In this report, we provide a structured taxonomy of these risks by identifying three key failure modes (miscoordination, conflict, and collusion) based on agents' incentives, as well as seven key risk factors (information asymmetries, network effects, selection pressures, destabilising dynamics, commitment problems, emergent agency, and multi‑agent security) that can underpin them. We highlight several important instances of each risk, as well as promising directions to help mitigate them. By anchoring our analysis in a range of real‑world examples and experimental evidence, we illustrate the distinct challenges posed by multi‑agent systems and their implications for the safety, governance, and ethics of advanced AI.

PaperID: 11560, https://arxiv.org/pdf/2502.14112.pdf

Authors: Hodaya Lampert, Reshef Meir, Kinneret Teodorescu

Title: To Stand on the Shoulders of Giants: Should We Protect Initial Discoveries in Multi-Agent Exploration?

Abstract:
Exploring new ideas is a fundamental aspect of research and development (R\&D), which often occurs in competitive environments. Most ideas are subsequent, i.e. one idea today leads to more ideas tomorrow. According to one approach, the best way to encourage exploration is by granting protection on discoveries to the first innovator. Correspondingly, only the one who made the first discovery can use the new knowledge and benefit from subsequent discoveries, which in turn should increase the initial motivation to explore. An alternative approach to promote exploration favors the \emphsharing of knowledge from discoveries among researchers allowing explorers to use each others' discoveries to develop further knowledge, as in the open‑source community. With no protection, all explorers have access to all existing discoveries and new directions are explored faster. We present a game theoretic analysis of an abstract research‑and‑application game which clarifies the expected advantages and disadvantages of the two approaches under full information. We then compare the theoretical predictions with the observed behavior of actual players in the lab who operate under partial information conditions in both worlds. Our main experimental finding is that the no protection approach leads to \emphmore investment efforts overall, in contrast to theoretical prediction and common economic wisdom, but in line with a familiar cognitive bias known as `underweighting of rare events'.

PaperID: 11561, https://arxiv.org/pdf/2502.14102.pdf

Authors: Ben Rachmut, Stylianos Loukas Vasileiou, Nimrod Meir Weinstein, Roie Zivan, William Yeoh

Title: Explainable Distributed Constraint Optimization Problems

Abstract:
The Distributed Constraint Optimization Problem (DCOP) formulation is a powerful tool to model cooperative multi‑agent problems that need to be solved distributively. A core assumption of existing approaches is that DCOP solutions can be easily understood, accepted, and adopted, which may not hold, as evidenced by the large body of literature on Explainable AI. In this paper, we propose the Explainable DCOP (X‑DCOP) model, which extends a DCOP to include its solution and a contrastive query for that solution. We formally define some key properties that contrastive explanations must satisfy for them to be considered as valid solutions to X‑DCOPs as well as theoretical results on the existence of such valid explanations. To solve X‑DCOPs, we propose a distributed framework as well as several optimizations and suboptimal variants to find valid explanations. We also include a human user study that showed that users, not surprisingly, prefer shorter explanations over longer ones. Our empirical evaluations showed that our approach can scale to large problems, and the different variants provide different options for trading off explanation lengths for smaller runtimes. Thus, our model and algorithmic contributions extend the state of the art by reducing the barrier for users to understand DCOP solutions, facilitating their adoption in more real‑world applications.

PaperID: 11562, https://arxiv.org/pdf/2502.14000.pdf

Authors: Uwe M. Borghoff, Paolo Bottoni, Remo Pareschi

Title: Human-Artificial Interaction in the Age of Agentic AI: A System-Theoretical Approach

Abstract:
This paper presents a novel perspective on human‑computer interaction (HCI), framing it as a dynamic interplay between human and computational agents within a networked system. Going beyond traditional interface‑based approaches, we emphasize the importance of coordination and communication among heterogeneous agents with different capabilities, roles, and goals. A key distinction is made between multi‑agent systems (MAS) and Centaurian systems, which represent two different paradigms of human‑AI collaboration. MAS maintain agent autonomy, with structured protocols enabling cooperation, while Centaurian systems deeply integrate human and AI capabilities, creating unified decision‑making entities. To formalize these interactions, we introduce a framework for communication spaces, structured into surface, observation, and computation layers, ensuring seamless integration between MAS and Centaurian architectures, where colored Petri nets effectively represent structured Centaurian systems and high‑level reconfigurable networks address the dynamic nature of MAS. Our research has practical applications in autonomous robotics, human‑in‑the‑loop decision making, and AI‑driven cognitive architectures, and provides a foundation for next‑generation hybrid intelligence systems that balance structured coordination with emergent behavior.

PaperID: 11563, https://arxiv.org/pdf/2502.13965.pdf

Authors: Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E. Gonzalez, Ion Stoica

Title: Autellix: An Efficient Serving Engine for LLM Agents as General Programs

Abstract:
Large language model (LLM) applications are evolving beyond simple chatbots into dynamic, general‑purpose agentic programs, which scale LLM calls and output tokens to help AI agents reason, explore, and solve complex tasks. However, existing LLM serving systems ignore dependencies between programs and calls, missing significant opportunities for optimization. Our analysis reveals that programs submitted to LLM serving engines experience long cumulative wait times, primarily due to head‑of‑line blocking at both the individual LLM request and the program. To address this, we introduce Autellix, an LLM serving system that treats programs as first‑class citizens to minimize their end‑to‑end latencies. Autellix intercepts LLM calls submitted by programs, enriching schedulers with program‑level context. We propose two scheduling algorithms‑for single‑threaded and distributed programs‑that preempt and prioritize LLM calls based on their programs' previously completed calls. Our evaluation demonstrates that across diverse LLMs and agentic workloads, Autellix improves throughput of programs by 4‑15x at the same latency compared to state‑of‑the‑art systems, such as vLLM.

PaperID: 11564, https://arxiv.org/pdf/2502.13923.pdf

Authors: Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, Junyang Lin

Title: Qwen2.5-VL Technical Report

Abstract:
We introduce Qwen2.5‑VL, the latest flagship model of Qwen vision‑language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5‑VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long‑video comprehension. A standout feature of Qwen2.5‑VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5‑VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second‑level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic‑resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5‑VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real‑world scenarios such as operating computers and mobile devices. Qwen2.5‑VL is available in three sizes, addressing diverse use cases from edge AI to high‑performance computing. The flagship Qwen2.5‑VL‑72B model matches state‑of‑the‑art models like GPT‑4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5‑VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.

PaperID: 11565, https://arxiv.org/pdf/2502.13789.pdf

Authors: Yi-Fan Zhang, Hang Li, Dingjie Song, Lichao Sun, Tianlong Xu, Qingsong Wen

Title: From Correctness to Comprehension: AI Agents for Personalized Error Diagnosis in Education

Abstract:
Large Language Models (LLMs), such as GPT‑4, have demonstrated impressive mathematical reasoning capabilities, achieving near‑perfect performance on benchmarks like GSM8K. However, their application in personalized education remains limited due to an overemphasis on correctness over error diagnosis and feedback generation. Current models fail to provide meaningful insights into the causes of student mistakes, limiting their utility in educational contexts. To address these challenges, we present three key contributions. First, we introduce MathCCS (Mathematical Classification and Constructive Suggestions), a multi‑modal benchmark designed for systematic error analysis and tailored feedback. MathCCS includes real‑world problems, expert‑annotated error categories, and longitudinal student data. Evaluations of state‑of‑the‑art models, including Qwen2‑VL, LLaVA‑OV, Claude‑3.5‑Sonnet and GPT‑4o, reveal that none achieved classification accuracy above 30% or generated high‑quality suggestions (average scores below 4/10), highlighting a significant gap from human‑level performance. Second, we develop a sequential error analysis framework that leverages historical data to track trends and improve diagnostic precision. Finally, we propose a multi‑agent collaborative framework that combines a Time Series Agent for historical analysis and an MLLM Agent for real‑time refinement, enhancing error classification and feedback generation. Together, these contributions provide a robust platform for advancing personalized education, bridging the gap between current AI capabilities and the demands of real‑world teaching.

PaperID: 11566, https://arxiv.org/pdf/2502.13767.pdf

Authors: Abhik Roychoudhury, Corina Pasareanu, Michael Pradel, Baishakhi Ray

Title: Agentic AI Software Engineers: Programming with Trust

Abstract:
Large Language Models (LLMs) have shown surprising proficiency in generating code snippets, promising to automate large parts of software engineering via artificial intelligence (AI). We argue that successfully deploying AI software engineers requires a level of trust equal to or even greater than the trust established by human‑driven software engineering practices. The recent trend toward LLM agents offers a path toward integrating the power of LLMs to create new code with the power of analysis tools to increase trust in the code. This opinion piece comments on whether LLM agents could dominate software engineering workflows in the future and whether the focus of programming will shift from programming at scale to programming with trust.

PaperID: 11567, https://arxiv.org/pdf/2502.13759.pdf

Authors: Zirui Song, Jingpu Yang, Yuan Huang, Jonathan Tonglet, Zeyu Zhang, Tao Cheng, Meng Fang, Iryna Gurevych, Xiuying Chen

Title: Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework

Abstract:
Geolocation, the task of identifying an image's location, requires complex reasoning and is crucial for navigation, monitoring, and cultural preservation. However, current methods often produce coarse, imprecise, and non‑interpretable localization. A major challenge lies in the quality and scale of existing geolocation datasets. These datasets are typically small‑scale and automatically constructed, leading to noisy data and inconsistent task difficulty, with images that either reveal answers too easily or lack sufficient clues for reliable inference. To address these challenges, we introduce a comprehensive geolocation framework with three key components: GeoComp, a large‑scale dataset; GeoCoT, a novel reasoning method; and GeoEval, an evaluation metric, collectively designed to address critical challenges and drive advancements in geolocation research. At the core of this framework is GeoComp (Geolocation Competition Dataset), a large‑scale dataset collected from a geolocation game platform involving 740K users over two years. It comprises 25 million entries of metadata and 3 million geo‑tagged locations spanning much of the globe, with each location annotated thousands to tens of thousands of times by human users. The dataset offers diverse difficulty levels for detailed analysis and highlights key gaps in current models. Building on this dataset, we propose Geographical Chain‑of‑Thought (GeoCoT), a novel multi‑step reasoning framework designed to enhance the reasoning capabilities of Large Vision Models (LVMs) in geolocation tasks. GeoCoT improves performance by integrating contextual and spatial cues through a multi‑step process that mimics human geolocation reasoning. Finally, using the GeoEval metric, we demonstrate that GeoCoT significantly boosts geolocation accuracy by up to 25% while enhancing interpretability.

PaperID: 11568, https://arxiv.org/pdf/2502.13701.pdf

Authors: Sylvia S. Kerkhove, Natasha Alechina, Mehdi Dastani

Title: Causes and Strategies in Multiagent Systems

Abstract:
Causality plays an important role in daily processes, human reasoning, and artificial intelligence. There has however not been much research on causality in multi‑agent strategic settings. In this work, we introduce a systematic way to build a multi‑agent system model, represented as a concurrent game structure, for a given structural causal model. In the obtained so‑called causal concurrent game structure, transitions correspond to interventions on agent variables of the given causal model. The Halpern and Pearl framework of causality is used to determine the effects of a certain value for an agent variable on other variables. The causal concurrent game structure allows us to analyse and reason about causal effects of agents' strategic decisions. We formally investigate the relation between causal concurrent game structures and the original structural causal models.

PaperID: 11569, https://arxiv.org/pdf/2502.13677.pdf

Authors: Tianshu Ruan, Aniketh Ramesh, Hao Wang, Alix Johnstone-Morfoisse, Gokcenur Altindal, Paul Norman, Grigoris Nikolaou, Rustam Stolkin, Manolis Chiou

Title: A Framework for Semantics-based Situational Awareness during Mobile Robot Deployments

Abstract:
Deployment of robots into hazardous environments typically involves a ``Human‑Robot Teaming'' (HRT) paradigm, in which a human supervisor interacts with a remotely operating robot inside the hazardous zone. Situational Awareness (SA) is vital for enabling HRT, to support navigation, planning, and decision‑making. This paper explores issues of higher‑level ``semantic'' information and understanding in SA. In semi‑autonomous, or variable‑autonomy paradigms, different types of semantic information may be important, in different ways, for both the human operator and an autonomous agent controlling the robot. We propose a generalizable framework for acquiring and combining multiple modalities of semantic‑level SA during remote deployments of mobile robots. We demonstrate the framework with an example application of search and rescue (SAR) in disaster response robotics. We propose a set of ``environment semantic indicators" that can reflect a variety of different types of semantic information, e.g. indicators of risk, or signs of human activity, as the robot encounters different scenes. Based on these indicators, we propose a metric to describe the overall situation of the environment called ``Situational Semantic Richness (SSR)". This metric combines multiple semantic indicators to summarise the overall situation. The SSR indicates if an information‑rich and complex situation has been encountered, which may require advanced reasoning for robots and humans and hence the attention of the expert human operator. The framework is tested on a Jackal robot in a mock‑up disaster response environment. Experimental results demonstrate that the proposed semantic indicators are sensitive to changes in different modalities of semantic information in different scenes, and the SSR metric reflects overall semantic changes in the situations encountered.

PaperID: 11570, https://arxiv.org/pdf/2502.13621.pdf

Authors: Francesco Pontiggia, Filip Macák, Roman Andriushchenko, Michele Chiari, Milan Češka

Title: Decentralized Planning Using Probabilistic Hyperproperties

Abstract:
Multi‑agent planning under stochastic dynamics is usually formalised using decentralized (partially observable) Markov decision processes ( MDPs) and reachability or expected reward specifications. In this paper, we propose a different approach: we use an MDP describing how a single agent operates in an environment and probabilistic hyperproperties to capture desired temporal objectives for a set of decentralized agents operating in the environment. We extend existing approaches for model checking probabilistic hyperproperties to handle temporal formulae relating paths of different agents, thus requiring the self‑composition between multiple MDPs. Using several case studies, we demonstrate that our approach provides a flexible and expressive framework to broaden the specification capabilities with respect to existing planning techniques. Additionally, we establish a close connection between a subclass of probabilistic hyperproperties and planning for a particular type of Dec‑MDPs, for both of which we show undecidability. This lays the ground for the use of existing decentralized planning tools in the field of probabilistic hyperproperty verification.

PaperID: 11571, https://arxiv.org/pdf/2502.13430.pdf

Authors: Hao Ma, Shijie Wang, Zhiqiang Pu, Siyao Zhao, Xiaolin Ai

Title: Vision-Based Generic Potential Function for Policy Alignment in Multi-Agent Reinforcement Learning

Abstract:
Guiding the policy of multi‑agent reinforcement learning to align with human common sense is a difficult problem, largely due to the complexity of modeling common sense as a reward, especially in complex and long‑horizon multi‑agent tasks. Recent works have shown the effectiveness of reward shaping, such as potential‑based rewards, to enhance policy alignment. The existing works, however, primarily rely on experts to design rule‑based rewards, which are often labor‑intensive and lack a high‑level semantic understanding of common sense. To solve this problem, we propose a hierarchical vision‑based reward shaping method. At the bottom layer, a visual‑language model (VLM) serves as a generic potential function, guiding the policy to align with human common sense through its intrinsic semantic understanding. To help the policy adapts to uncertainty and changes in long‑horizon tasks, the top layer features an adaptive skill selection module based on a visual large language model (vLLM). The module uses instructions, video replays, and training records to dynamically select suitable potential function from a pre‑designed pool. Besides, our method is theoretically proven to preserve the optimal policy. Extensive experiments conducted in the Google Research Football environment demonstrate that our method not only achieves a higher win rate but also effectively aligns the policy with human common sense.

PaperID: 11572, https://arxiv.org/pdf/2502.13376.pdf

Authors: Ameesh Shah, Niklas Lauffer, Thomas Chen, Nikhil Pitta, Sanjit A. Seshia

Title: Learning Symbolic Task Decompositions for Multi-Agent Teams

Abstract:
One approach for improving sample efficiency in cooperative multi‑agent learning is to decompose overall tasks into sub‑tasks that can be assigned to individual agents. We study this problem in the context of reward machines: symbolic tasks that can be formally decomposed into sub‑tasks. In order to handle settings without a priori knowledge of the environment, we introduce a framework that can learn the optimal decomposition from model‑free interactions with the environment. Our method uses a task‑conditioned architecture to simultaneously learn an optimal decomposition and the corresponding agents' policies for each sub‑task. In doing so, we remove the need for a human to manually design the optimal decomposition while maintaining the sample‑efficiency benefits of improved credit assignment. We provide experimental results in several deep reinforcement learning settings, demonstrating the efficacy of our approach. Our results indicate that our approach succeeds even in environments with codependent agent dynamics, enabling synchronous multi‑agent learning not achievable in previous works.

PaperID: 11573, https://arxiv.org/pdf/2502.13295.pdf

Authors: Alexander Bondarenko, Denis Volk, Dmitrii Volkov, Jeffrey Ladish

Title: Demonstrating specification gaming in reasoning models

Abstract:
We demonstrate LLM agent specification gaming by instructing models to win against a chess engine. We find reasoning models like OpenAI o3 and DeepSeek R1 will often hack the benchmark by default, while language models like GPT‑4o and Claude 3.5 Sonnet need to be told that normal play won't work to hack. We improve upon prior work like (Hubinger et al., 2024; Meinke et al., 2024; Weij et al., 2024) by using realistic task prompts and avoiding excess nudging. Our results suggest reasoning models may resort to hacking to solve difficult problems, as observed in OpenAI (2024)'s o1 Docker escape during cyber capabilities testing.

PaperID: 11574, https://arxiv.org/pdf/2502.13248.pdf

Authors: Hankang Gu, Shangbo Wang, Dongyao Jia, Yuli Zhang, Yanrong Luo, Guoqiang Mao, Jianping Wang, Eng Gee Lim

Title: Communication Strategy on Macro-and-Micro Traffic State in Cooperative Deep Reinforcement Learning for Regional Traffic Signal Control

Abstract:
Adaptive Traffic Signal Control (ATSC) has become a popular research topic in intelligent transportation systems. Regional Traffic Signal Control (RTSC) using the Multi‑agent Deep Reinforcement Learning (MADRL) technique has become a promising approach for ATSC due to its ability to achieve the optimum trade‑off between scalability and optimality. Most existing RTSC approaches partition a traffic network into several disjoint regions, followed by applying centralized reinforcement learning techniques to each region. However, the pursuit of cooperation among RTSC agents still remains an open issue and no communication strategy for RTSC agents has been investigated. In this paper, we propose communication strategies to capture the correlation of micro‑traffic states among lanes and the correlation of macro‑traffic states among intersections. We first justify the evolution equation of the RTSC process is Markovian via a system of store‑and‑forward queues. Next, based on the evolution equation, we propose two GAT‑Aggregated (GA2) communication modules‑‑GA2‑Naive and GA2‑Aug to extract both intra‑region and inter‑region correlations between macro and micro traffic states. While GA2‑Naive only considers the movements at each intersection, GA2‑Aug also considers the lane‑changing behavior of vehicles. Two proposed communication modules are then aggregated into two existing novel RTSC frameworks‑‑RegionLight and Regional‑DRL. Experimental results demonstrate that both GA2‑Naive and GA2‑Aug effectively improve the performance of existing RTSC frameworks under both real and synthetic scenarios. Hyperparameter testing also reveals the robustness and potential of our communication modules in large‑scale traffic networks.

PaperID: 11575, https://arxiv.org/pdf/2502.13188.pdf

Authors: Anastasia Psarou, Ahmet Onur Akman, Łukasz Gorczyca, Michał Hoffmann, Grzegorz Jamróz, Rafał Kucharski

Title: Collaboration Between the City and Machine Learning Community is Crucial to Efficient Autonomous Vehicles Routing

Abstract:
Autonomous vehicles (AVs), possibly using Multi‑Agent Reinforcement Learning (MARL) for simultaneous route optimization, may destabilize traffic networks, with human drivers potentially experiencing longer travel times. We study this interaction by simulating human drivers and AVs. Our experiments with standard MARL algorithms reveal that, both in simplified and complex networks, policies often fail to converge to an optimal solution or require long training periods. This problem is amplified by the fact that we cannot rely entirely on simulated training, as there are no accurate models of human routing behavior. At the same time, real‑world training in cities risks destabilizing urban traffic systems, increasing externalities, such as CO_2 emissions, and introducing non‑stationarity as human drivers will adapt unpredictably to AV behaviors. In this position paper, we argue that city authorities must collaborate with the ML community to monitor and critically evaluate the routing algorithms proposed by car companies toward fair and system‑efficient routing algorithms and regulatory standards.

PaperID: 11576, https://arxiv.org/pdf/2502.13172.pdf

Authors: Bo Wang, Weiyi He, Shenglai Zeng, Zhen Xiang, Yue Xing, Jiliang Tang, Pengfei He

Title: Unveiling Privacy Risks in LLM Agent Memory

Abstract:
Large Language Model (LLM) agents have become increasingly prevalent across various real‑world applications. They enhance decision‑making by storing private user‑agent interactions in the memory module for demonstrations, introducing new privacy risks for LLM agents. In this work, we systematically investigate the vulnerability of LLM agents to our proposed Memory EXTRaction Attack (MEXTRA) under a black‑box setting. To extract private information from memory, we propose an effective attacking prompt design and an automated prompt generation method based on different levels of knowledge about the LLM agent. Experiments on two representative agents demonstrate the effectiveness of MEXTRA. Moreover, we explore key factors influencing memory leakage from both the agent designer's and the attacker's perspectives. Our findings highlight the urgent need for effective memory safeguards in LLM agent design and deployment.

PaperID: 11577, https://arxiv.org/pdf/2502.13165.pdf

Authors: Xiangyu Li, Yawen Zeng, Xiaofen Xing, Jin Xu, Xiangmin Xu

Title: HedgeAgents: A Balanced-aware Multi-agent Financial Trading System

Abstract:
As automated trading gains traction in the financial market, algorithmic investment strategies are increasingly prominent. While Large Language Models (LLMs) and Agent‑based models exhibit promising potential in real‑time market analysis and trading decisions, they still experience a significant ‑20% loss when confronted with rapid declines or frequent fluctuations, impeding their practical application. Hence, there is an imperative to explore a more robust and resilient framework. This paper introduces an innovative multi‑agent system, HedgeAgents, aimed at bolstering system robustness via ``hedging'' strategies. In this well‑balanced system, an array of hedging agents has been tailored, where HedgeAgents consist of a central fund manager and multiple hedging experts specializing in various financial asset classes. These agents leverage LLMs' cognitive capabilities to make decisions and coordinate through three types of conferences. Benefiting from the powerful understanding of LLMs, our HedgeAgents attained a 70% annualized return and a 400% total return over a period of 3 years. Moreover, we have observed with delight that HedgeAgents can even formulate investment experience comparable to those of human experts (https://hedgeagents.github.io/).

PaperID: 11578, https://arxiv.org/pdf/2502.13164.pdf

Authors: Mohammad Wali Ur Rahman, Ric Nevarez, Lamia Tasnim Mim, Salim Hariri

Title: Multi-Agent Actor-Critic Generative AI for Query Resolution and Analysis

Abstract:
In this paper, we introduce MASQRAD (Multi‑Agent Strategic Query Resolution and Diagnostic tool), a transformative framework for query resolution based on the actor‑critic model, which utilizes multiple generative AI agents. MASQRAD is excellent at translating imprecise or ambiguous user inquiries into precise and actionable requests. This framework generates pertinent visualizations and responses to these focused queries, as well as thorough analyses and insightful interpretations for users. MASQRAD addresses the common shortcomings of existing solutions in domains that demand fast and precise data interpretation, such as their incapacity to successfully apply AI for generating actionable insights and their challenges with the inherent ambiguity of user queries. MASQRAD functions as a sophisticated multi‑agent system but "masquerades" to users as a single AI entity, which lowers errors and enhances data interaction. This approach makes use of three primary AI agents: Actor Generative AI, Critic Generative AI, and Expert Analysis Generative AI. Each is crucial for creating, enhancing, and evaluating data interactions. The Actor AI generates Python scripts to generate data visualizations from large datasets within operational constraints, and the Critic AI rigorously refines these scripts through multi‑agent debate. Finally, the Expert Analysis AI contextualizes the outcomes to aid in decision‑making. With an accuracy rate of 87% when handling tasks related to natural language visualization, MASQRAD establishes new benchmarks for automated data interpretation and showcases a noteworthy advancement that has the potential to revolutionize AI‑driven applications.

PaperID: 11579, https://arxiv.org/pdf/2502.13160.pdf

Authors: Yiwen Zhang, Yifu Wu, Wenyue Hua, Xiang Lu, Xuming Hu

Title: Attention Mechanism for LLM-based Agents Dynamic Diffusion under Information Asymmetry

Abstract:
Large language models have been used to simulate human society using multi‑agent systems. Most current social simulation research emphasizes interactive behaviors in fixed environments, ignoring information opacity, relationship variability, and diffusion diversity. In this paper, we first propose a general framework for exploring multi‑agent information diffusion. We identified LLMs' deficiency in the perception and utilization of social relationships, as well as diverse actions. Then, we designed a dynamic attention mechanism to help agents allocate attention to different information, addressing the limitations of the LLM attention mechanism. Agents start by responding to external information stimuli within a five‑agent group, increasing group size and forming information circles while developing relationships and sharing information. Additionally, we explore the information diffusion features in the asymmetric open environment by observing the evolution of information gaps, diffusion patterns, and the accumulation of social capital, which are closely linked to psychological, sociological, and communication theories.

PaperID: 11580, https://arxiv.org/pdf/2502.13135.pdf

Authors: Taedong Yun, Eric Yang, Mustafa Safdari, Jong Ha Lee, Vaishnavi Vinod Kumar, S. Sara Mahdavi, Jonathan Amar, Derek Peyton, Reut Aharony, Andreas Michaelides, Logan Schneider, Isaac Galatzer-Levy, Yugang Jia, John Canny, Arthur Gretton, Maja Matarić

Title: Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions

Abstract:
We present an end‑to‑end framework for generating synthetic users for evaluating interactive agents designed to encourage positive behavior changes, such as in health and lifestyle coaching. The synthetic users are grounded in health and lifestyle conditions, specifically sleep and diabetes management in this study, to ensure realistic interactions with the health coaching agent. Synthetic users are created in two stages: first, structured data are generated grounded in real‑world health and lifestyle factors in addition to basic demographics and behavioral attributes; second, full profiles of the synthetic users are developed conditioned on the structured data. Interactions between synthetic users and the coaching agent are simulated using generative agent‑based models such as Concordia, or directly by prompting a language model. Using two independently‑developed agents for sleep and diabetes coaching as case studies, the validity of this framework is demonstrated by analyzing the coaching agent's understanding of the synthetic users' needs and challenges. Finally, through multiple blinded evaluations of user‑coach interactions by human experts, we demonstrate that our synthetic users with health and behavioral attributes more accurately portray real human users with the same attributes, compared to generic synthetic users not grounded in such attributes. The proposed framework lays the foundation for efficient development of conversational agents through extensive, realistic, and grounded simulated interactions.

PaperID: 11581, https://arxiv.org/pdf/2502.13069.pdf

Authors: Sanidhya Vijayvargiya, Xuhui Zhou, Akhila Yerukola, Maarten Sap, Graham Neubig

Title: Ambig-SWE: Interactive Agents to Overcome Underspecificity in Software Engineering

Abstract:
AI agents are increasingly being deployed to automate tasks, often based on underspecified user instructions. Making unwarranted assumptions to compensate for the missing information and failing to ask clarifying questions can lead to suboptimal outcomes, safety risks due to tool misuse, and wasted computational resources. In this work, we study the ability of LLM agents to handle underspecified instructions in interactive code generation settings by evaluating proprietary and open‑weight models on their performance across three key steps: (a) detecting underspecificity, (b) asking targeted clarification questions, and (c) leveraging the interaction to improve performance in underspecified scenarios. We introduce Ambig‑SWE, an underspecified variant of SWE‑Bench Verified, specifically designed to evaluate agent behavior under ambiguity and interaction. Our findings reveal that models struggle to distinguish between well‑specified and underspecified instructions. However, when models interact for underspecified inputs, they effectively obtain vital information from the user leading to significant improvements in performance, up to 74% over the non‑interactive settings, underscoring the value of effective interaction. Our study highlights critical gaps in how current state‑of‑the‑art models handle missing information in complex software engineering tasks and structures the evaluation into distinct steps to enable targeted improvements.

PaperID: 11582, https://arxiv.org/pdf/2502.13053.pdf

Authors: Yurun Chen, Xavier Hu, Keting Yin, Juncheng Li, Shengyu Zhang

Title: Evaluating the Robustness of Multimodal Agents Against Active Environmental Injection Attacks

Abstract:
As researchers continue to optimize AI agents for more effective task execution within operating systems, they often overlook a critical security concern: the ability of these agents to detect "impostors" within their environment. Through an analysis of the agents' operational context, we identify a significant threat‑attackers can disguise malicious attacks as environmental elements, injecting active disturbances into the agents' execution processes to manipulate their decision‑making. We define this novel threat as the Active Environment Injection Attack (AEIA). Focusing on the interaction mechanisms of the Android OS, we conduct a risk assessment of AEIA and identify two critical security vulnerabilities: (1) Adversarial content injection in multimodal interaction interfaces, where attackers embed adversarial instructions within environmental elements to mislead agent decision‑making; and (2) Reasoning gap vulnerabilities in the agent's task execution process, which increase susceptibility to AEIA attacks during reasoning. To evaluate the impact of these vulnerabilities, we propose AEIA‑MN, an attack scheme that exploits interaction vulnerabilities in mobile operating systems to assess the robustness of MLLM‑based agents. Experimental results show that even advanced MLLMs are highly vulnerable to this attack, achieving a maximum attack success rate of 93% on the AndroidWorld benchmark by combining two vulnerabilities.

PaperID: 11583, https://arxiv.org/pdf/2502.13012.pdf

Authors: Chaoran Chen, Bingsheng Yao, Ruishi Zou, Wenyue Hua, Weimin Lyu, Yanfang Ye, Toby Jia-Jun Li, Dakuo Wang

Title: Towards a Design Guideline for RPA Evaluation: A Survey of Large Language Model-Based Role-Playing Agents

Abstract:
Role‑Playing Agent (RPA) is an increasingly popular type of LLM Agent that simulates human‑like behaviors in a variety of tasks. However, evaluating RPAs is challenging due to diverse task requirements and agent designs. This paper proposes an evidence‑based, actionable, and generalizable evaluation design guideline for LLM‑based RPA by systematically reviewing 1,676 papers published between Jan. 2021 and Dec. 2024. Our analysis identifies six agent attributes, seven task attributes, and seven evaluation metrics from existing literature. Based on these findings, we present an RPA evaluation design guideline to help researchers develop more systematic and consistent evaluation methods.

PaperID: 11584, https://arxiv.org/pdf/2502.13001.pdf

Authors: Frederic Kirstein, Muneeb Khan, Jan Philip Wahle, Terry Ruas, Bela Gipp

Title: You need to MIMIC to get FAME: Solving Meeting Transcript Scarcity with a Multi-Agent Conversations

Abstract:
Meeting summarization suffers from limited high‑quality data, mainly due to privacy restrictions and expensive collection processes. We address this gap with FAME, a dataset of 500 meetings in English and 300 in German produced by MIMIC, our new multi‑agent meeting synthesis framework that generates meeting transcripts on a given knowledge source by defining psychologically grounded participant profiles, outlining the conversation, and orchestrating a large language model (LLM) debate. A modular post‑processing step refines these outputs, mitigating potential repetitiveness and overly formal tones, ensuring coherent, credible dialogues at scale. We also propose a psychologically grounded evaluation framework assessing naturalness, social behavior authenticity, and transcript difficulties. Human assessments show that FAME approximates real‑meeting spontaneity (4.5/5 in naturalness), preserves speaker‑centric challenges (3/5 in spoken language), and introduces richer information‑oriented difficulty (4/5 in difficulty). These findings highlight that FAME is a good and scalable proxy for real‑world meeting conditions. It enables new test scenarios for meeting summarization research and other conversation‑centric applications in tasks requiring conversation data or simulating social scenarios under behavioral constraints.

PaperID: 11585, https://arxiv.org/pdf/2502.12927.pdf

Authors: Mike Zhang, Amalie Pernille Dilling, Léon Gondelman, Niels Erik Ruan Lyngdorf, Euan D. Lindsay, Johannes Bjerva

Title: SEFL: A Framework for Generating Synthetic Educational Assignment Feedback with LLM Agents

Abstract:
Providing high‑quality feedback on student assignments is crucial for student success, but it is heavily limited by time and budgetary constraints. In this work, we introduce Synthetic Educational Feedback Loops (SEFL), a synthetic data framework designed to generate data that resembles immediate, on‑demand feedback at scale without relying on extensive, real‑world student assignments and teacher feedback. To obtain this type of data, two large language models (LLMs) operate in a teacher‑student role to simulate assignment completion and formative feedback, generating 19.8K synthetic pairs of student work and corresponding critiques and actionable improvements from a teacher. With this data, we fine‑tune smaller, more computationally efficient LLMs on these synthetic pairs, enabling them to replicate key features of high‑quality, goal‑oriented feedback. Through comprehensive evaluations with three LLM judges and three human experts, across a subset of 900 outputs, we demonstrate that SEFL‑tuned models outperform both their untuned counterparts and an existing baseline in terms of feedback quality. The potential for societal impact is reinforced by extensive qualitative comments and ratings from human stakeholders ‑‑ both students and higher education instructors. SEFL has the potential to transform feedback processes for higher education and beyond.

PaperID: 11586, https://arxiv.org/pdf/2502.12875.pdf

Authors: Wei Zhao, Shaoxin Cui, Wen Qiu, Zhiqiang He, Zhi Liu, Xiao Zheng, Bomin Mao, Nei Kato

Title: A Survey on DRL based UAV Communications and Networking: DRL Fundamentals, Applications and Implementations

Abstract:
Unmanned aerial vehicles (UAVs) are playing an increasingly pivotal role in modern communication networks,offering flexibility and enhanced coverage for a variety of applica‑tions. However, UAV networks pose significant challenges due to their dynamic and distributed nature, particularly when dealing with tasks such as power allocation, channel assignment, caching,and task offloading. Traditional optimization techniques often struggle to handle the complexity and unpredictability of these environments, leading to suboptimal performance. This survey provides a comprehensive examination of how deep reinforcement learning (DRL) can be applied to solve these mathematical optimization problems in UAV communications and networking.Rather than simply introducing DRL methods, the focus is on demonstrating how these methods can be utilized to solve complex mathematical models of the underlying problems. We begin by reviewing the fundamental concepts of DRL, including value‑based, policy‑based, and actor‑critic approaches. Then,we illustrate how DRL algorithms are applied to specific UAV network tasks by discussing from problem formulations to DRL implementation. By framing UAV communication challenges as optimization problems, this survey emphasizes the practical value of DRL in dynamic and uncertain environments. We also explore the strengths of DRL in handling large‑scale network scenarios and the ability to continuously adapt to changes in the environment. In addition, future research directions are outlined, highlighting the potential for DRL to further enhance UAV communications and expand its applicability to more complex,multi‑agent settings.

PaperID: 11587, https://arxiv.org/pdf/2502.12842.pdf

Authors: Kathrin Seßler, Arne Bewersdorff, Claudia Nerdel, Enkelejda Kasneci

Title: Towards Adaptive Feedback with AI: Comparing the Feedback Quality of LLMs and Teachers on Experimentation Protocols

Abstract:
Effective feedback is essential for fostering students' success in scientific inquiry. With advancements in artificial intelligence, large language models (LLMs) offer new possibilities for delivering instant and adaptive feedback. However, this feedback often lacks the pedagogical validation provided by real‑world practitioners. To address this limitation, our study evaluates and compares the feedback quality of LLM agents with that of human teachers and science education experts on student‑written experimentation protocols. Four blinded raters, all professionals in scientific inquiry and science education, evaluated the feedback texts generated by 1) the LLM agent, 2) the teachers and 3) the science education experts using a five‑point Likert scale based on six criteria of effective feedback: Feed Up, Feed Back, Feed Forward, Constructive Tone, Linguistic Clarity, and Technical Terminology. Our results indicate that LLM‑generated feedback shows no significant difference to that of teachers and experts in overall quality. However, the LLM agent's performance lags in the Feed Back dimension, which involves identifying and explaining errors within the student's work context. Qualitative analysis highlighted the LLM agent's limitations in contextual understanding and in the clear communication of specific errors. Our findings suggest that combining LLM‑generated feedback with human expertise can enhance educational practices by leveraging the efficiency of LLMs and the nuanced understanding of educators.

PaperID: 11588, https://arxiv.org/pdf/2502.12669.pdf

Authors: Xiang Liu, Penglei Sun, Shuyan Chen, Longhan Zhang, Peijie Dong, Huajie You, Yongqi Zhang, Chang Yan, Xiaowen Chu, Tong-yi Zhang

Title: Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research

Abstract:
The rapid advancement of perovskite solar cells (PSCs) has led to an exponential growth in research publications, creating an urgent need for efficient knowledge management and reasoning systems in this domain. We present a comprehensive knowledge‑enhanced system for PSCs that integrates three key components. First, we develop Perovskite‑KG, a domain‑specific knowledge graph constructed from 1,517 research papers, containing 23,789 entities and 22,272 relationships. Second, we create two complementary datasets: Perovskite‑Chat, comprising 55,101 high‑quality question‑answer pairs generated through a novel multi‑agent framework, and Perovskite‑Reasoning, containing 2,217 carefully curated materials science problems. Third, we introduce two specialized large language models: Perovskite‑Chat‑LLM for domain‑specific knowledge assistance and Perovskite‑Reasoning‑LLM for scientific reasoning tasks. Experimental results demonstrate that our system significantly outperforms existing models in both domain‑specific knowledge retrieval and scientific reasoning tasks, providing researchers with effective tools for literature review, experimental design, and complex problem‑solving in PSC research.

PaperID: 11589, https://arxiv.org/pdf/2502.12663.pdf

Authors: Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch

Title: Demystifying Multilingual Chain-of-Thought in Process Reward Modeling

Abstract:
Large language models (LLMs) are designed to perform a wide range of tasks. To improve their ability to solve complex problems requiring multi‑step reasoning, recent research leverages process reward modeling to provide fine‑grained feedback at each step of the reasoning process for reinforcement learning (RL), but it predominantly focuses on English. In this paper, we tackle the critical challenge of extending process reward models (PRMs) to multilingual settings. To achieve this, we train multilingual PRMs on a dataset spanning seven languages, which is translated from English. Through comprehensive evaluations on two widely used reasoning benchmarks across 11 languages, we demonstrate that multilingual PRMs not only improve average accuracy but also reduce early‑stage reasoning errors. Furthermore, our results highlight the sensitivity of multilingual PRMs to both the number of training languages and the volume of English data, while also uncovering the benefits arising from more candidate responses and trainable parameters. This work opens promising avenues for robust multilingual applications in complex, multi‑step reasoning tasks. In addition, we release the code to foster research along this line.

PaperID: 11590, https://arxiv.org/pdf/2502.12630.pdf

Authors: Tvrtko Sternak, Davor Runje, Dorian Granoša, Chi Wang

Title: Automating Prompt Leakage Attacks on Large Language Models Using Agentic Approach

Abstract:
This paper presents a novel approach to evaluating the security of large language models (LLMs) against prompt leakage‑the exposure of system‑level prompts or proprietary configurations. We define prompt leakage as a critical threat to secure LLM deployment and introduce a framework for testing the robustness of LLMs using agentic teams. Leveraging AG2 (formerly AutoGen), we implement a multi‑agent system where cooperative agents are tasked with probing and exploiting the target LLM to elicit its prompt. Guided by traditional definitions of security in cryptography, we further define a prompt leakage‑safe system as one in which an attacker cannot distinguish between two agents: one initialized with an original prompt and the other with a prompt stripped of all sensitive information. In a safe system, the agents' outputs will be indistinguishable to the attacker, ensuring that sensitive information remains secure. This cryptographically inspired framework provides a rigorous standard for evaluating and designing secure LLMs. This work establishes a systematic methodology for adversarial testing of prompt leakage, bridging the gap between automated threat modeling and practical LLM security. You can find the implementation of our prompt leakage probing on GitHub.

PaperID: 11591, https://arxiv.org/pdf/2502.12605.pdf

Authors: Kyeonghyeon Park, David Molina Concha, Hyun-Rok Lee, Chi-Guhn Lee, Taesik Lee

Title: Hypernetwork-based approach for optimal composition design in partially controlled multi-agent systems

Abstract:
Partially Controlled Multi‑Agent Systems (PCMAS) are comprised of controllable agents, managed by a system designer, and uncontrollable agents, operating autonomously. This study addresses an optimal composition design problem in PCMAS, which involves the system designer's problem, determining the optimal number and policies of controllable agents, and the uncontrollable agents' problem, identifying their best‑response policies. Solving this bi‑level optimization problem is computationally intensive, as it requires repeatedly solving multi‑agent reinforcement learning problems under various compositions for both types of agents. To address these challenges, we propose a novel hypernetwork‑based framework that jointly optimizes the system's composition and agent policies. Unlike traditional methods that train separate policy networks for each composition, the proposed framework generates policies for both controllable and uncontrollable agents through a unified hypernetwork. This approach enables efficient information sharing across similar configurations, thereby reducing computational overhead. Additional improvements are achieved by incorporating reward parameter optimization and mean action networks. Using real‑world New York City taxi data, we demonstrate that our framework outperforms existing methods in approximating equilibrium policies. Our experimental results show significant improvements in key performance metrics, such as order response rate and served demand, highlighting the practical utility of controlling agents and their potential to enhance decision‑making in PCMAS.

PaperID: 11592, https://arxiv.org/pdf/2502.12561.pdf

Authors: Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Jessie Wang, Yang Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, Dakuo Wang

Title: UXAgent: An LLM Agent-Based Usability Testing Framework for Web Design

Abstract:
Usability testing is a fundamental yet challenging (e.g., inflexible to iterate the study design flaws and hard to recruit study participants) research method for user experience (UX) researchers to evaluate a web design. Recent advances in Large Language Model‑simulated Agent (LLM‑Agent) research inspired us to design UXAgent to support UX researchers in evaluating and reiterating their usability testing study design before they conduct the real human subject study. Our system features an LLM‑Agent module and a universal browser connector module so that UX researchers can automatically generate thousands of simulated users to test the target website. The results are shown in qualitative (e.g., interviewing how an agent thinks ), quantitative (e.g., # of actions), and video recording formats for UX researchers to analyze. Through a heuristic user evaluation with five UX researchers, participants praised the innovation of our system but also expressed concerns about the future of LLM Agent‑assisted UX study.

PaperID: 11593, https://arxiv.org/pdf/2502.12521.pdf

Authors: Shubham Parashar, Blake Olson, Sambhav Khurana, Eric Li, Hongyi Ling, James Caverlee, Shuiwang Ji

Title: Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights

Abstract:
We examine the reasoning and planning capabilities of large language models (LLMs) in solving complex tasks. Recent advances in inference‑time techniques demonstrate the potential to enhance LLM reasoning without additional training by exploring intermediate steps during inference. Notably, OpenAI's o1 model shows promising performance through its novel use of multi‑step reasoning and verification. Here, we explore how scaling inference‑time techniques can improve reasoning and planning, focusing on understanding the tradeoff between computational cost and performance. To this end, we construct a comprehensive benchmark, known as Sys2Bench, and perform extensive experiments evaluating existing inference‑time techniques on eleven diverse tasks across five categories, including arithmetic reasoning, logical reasoning, common sense reasoning, algorithmic reasoning, and planning. Our findings indicate that simply scaling inference‑time computation has limitations, as no single inference‑time technique consistently performs well across all reasoning and planning tasks.

PaperID: 11594, https://arxiv.org/pdf/2502.12504.pdf

Authors: Karthik Sreedhar, Alice Cai, Jenny Ma, Jeffrey V. Nickerson, Lydia B. Chilton

Title: Simulating Cooperative Prosocial Behavior with Multi-Agent LLMs: Evidence and Mechanisms for AI Agents to Inform Policy Decisions

Abstract:
Human prosocial cooperation is essential for our collective health, education, and welfare. However, designing social systems to maintain or incentivize prosocial behavior is challenging because people can act selfishly to maximize personal gain. This complex and unpredictable aspect of human behavior makes it difficult for policymakers to foresee the implications of their designs. Recently, multi‑agent LLM systems have shown remarkable capabilities in simulating human‑like behavior, and replicating some human lab experiments. This paper studies how well multi‑agent systems can simulate prosocial human behavior, such as that seen in the public goods game (PGG), and whether multi‑agent systems can exhibit ``unbounded actions'' seen outside the lab in real world scenarios. We find that multi‑agent LLM systems successfully replicate human behavior from lab experiments of the public goods game with three experimental treatments ‑ priming, transparency, and varying endowments. Beyond replicating existing experiments, we find that multi‑agent LLM systems can replicate the expected human behavior when combining experimental treatments, even if no previous study combined those specific treatments. Lastly, we find that multi‑agent systems can exhibit a rich set of unbounded actions that people do in the real world outside of the lab ‑‑ such as collaborating and even cheating. In sum, these studies are steps towards a future where LLMs can be used to inform policy decisions that encourage people to act in a prosocial manner.

PaperID: 11595, https://arxiv.org/pdf/2502.12494.pdf

Authors: Yunxiao Zhang, Guanming Xiong, Haochen Li, Wen Zhao

Title: EDGE: Efficient Data Selection for LLM Agents via Guideline Effectiveness

Abstract:
Large Language Models (LLMs) have shown remarkable capabilities as AI agents. However, existing methods for enhancing LLM‑agent abilities often lack a focus on data quality, leading to inefficiencies and suboptimal results in both fine‑tuning and prompt engineering. To address this issue, we introduce EDGE, a novel approach for identifying informative samples without needing golden answers. We propose the Guideline Effectiveness (GE) metric, which selects challenging samples by measuring the impact of human‑provided guidelines in multi‑turn interaction tasks. A low GE score indicates that the human expertise required for a sample is missing from the guideline, making the sample more informative. By selecting samples with low GE scores, we can improve the efficiency and outcomes of both prompt engineering and fine‑tuning processes for LLMs. Extensive experiments validate the performance of our method. Our method achieves competitive results on the HotpotQA and WebShop and datasets, requiring 75% and 50% less data, respectively, while outperforming existing methods. We also provide a fresh perspective on the data quality of LLM‑agent fine‑tuning.

PaperID: 11596, https://arxiv.org/pdf/2502.12435.pdf

Authors: Mohamed Aghzal, Erion Plaku, Gregory J. Stein, Ziyu Yao

Title: A Survey on Large Language Models for Automated Planning

Abstract:
The planning ability of Large Language Models (LLMs) has garnered increasing attention in recent years due to their remarkable capacity for multi‑step reasoning and their ability to generalize across a wide range of domains. While some researchers emphasize the potential of LLMs to perform complex planning tasks, others highlight significant limitations in their performance, particularly when these models are tasked with handling the intricacies of long‑horizon reasoning. In this survey, we critically investigate existing research on the use of LLMs in automated planning, examining both their successes and shortcomings in detail. We illustrate that although LLMs are not well‑suited to serve as standalone planners because of these limitations, they nonetheless present an enormous opportunity to enhance planning applications when combined with other approaches. Thus, we advocate for a balanced methodology that leverages the inherent flexibility and generalized knowledge of LLMs alongside the rigor and cost‑effectiveness of traditional planning methods.

PaperID: 11597, https://arxiv.org/pdf/2502.12337.pdf

Authors: Michael Tang, Miroslav Krstic, Jorge Poveda

Title: Stochastic Real-Time Deception in Nash Equilibrium Seeking for Games with Quadratic Payoffs

Abstract:
In multi‑agent autonomous systems, deception is a fundamental concept which characterizes the exploitation of unbalanced information to mislead victims into choosing oblivious actions. This effectively alters the system's long term behavior, leading to outcomes that may be beneficial to the deceiver but detrimental to victim. We study this phenomenon for a class of model‑free Nash equilibrium seeking (NES) where players implement independent stochastic exploration signals to learn the pseudogradient flow. In particular, we show that deceptive players who obtain real‑time measurements of other players' stochastic perturbation can incorporate this information into their own NES action update, consequentially steering the overall dynamics to a new operating point that could potentially improve the payoffs of the deceptive players. We consider games with quadratic payoff functions, as this restriction allows us to derive a more explicit formulation of the capabilities of the deceptive players. By leveraging results on multi‑input stochastic averaging for dynamical systems, we establish local exponential (in probability) convergence for the proposed deceptive NES dynamics. To illustrate our results, we apply them to a two player quadratic game.

PaperID: 11598, https://arxiv.org/pdf/2502.12280.pdf

Authors: Heng Ma, Alexander Brace, Carlo Siebenschuh, Greg Pauloski, Ian Foster, Arvind Ramanathan

Title: Connecting Large Language Model Agent to High Performance Computing Resource

Abstract:
The Large Language Model agent workflow enables the LLM to invoke tool functions to increase the performance on specific scientific domain questions. To tackle large scale of scientific research, it requires access to computing resource and parallel computing setup. In this work, we implemented Parsl to the LangChain/LangGraph tool call setup, to bridge the gap between the LLM agent to the computing resource. Two tool call implementations were set up and tested on both local workstation and HPC environment on Polaris/ALCF. The first implementation with Parsl‑enabled LangChain tool node queues the tool functions concurrently to the Parsl workers for parallel execution. The second configuration is implemented by converting the tool functions into Parsl ensemble functions, and is more suitable for large task on super computer environment. The LLM agent workflow was prompted to run molecular dynamics simulations, with different protein structure and simulation conditions. These results showed the LLM agent tools were managed and executed concurrently by Parsl on the available computing resource.

PaperID: 11599, https://arxiv.org/pdf/2502.12149.pdf

Authors: Kenan Jiang, Li Xiong, Fei Liu

Title: HARBOR: Exploring Persona Dynamics in Multi-Agent Competition

Abstract:
We investigate factors contributing to LLM agents' success in competitive multi‑agent environments, using auctions as a testbed where agents bid to maximize profit. The agents are equipped with bidding domain knowledge, distinct personas that reflect item preferences, and a memory of auction history. Our work extends the classic auction scenario by creating a realistic environment where multiple agents bid on houses, weighing aspects such as size, location, and budget to secure the most desirable homes at the lowest prices. Particularly, we investigate three key questions: (a) How does a persona influence an agent's behavior in a competitive setting? (b) Can an agent effectively profile its competitors' behavior during auctions? (c) How can persona profiling be leveraged to create an advantage using strategies such as theory of mind? Through a series of experiments, we analyze the behaviors of LLM agents and shed light on new findings. Our testbed, called HARBOR, offers a valuable platform for deepening our understanding of multi‑agent workflows in competitive environments.

PaperID: 11600, https://arxiv.org/pdf/2502.12130.pdf

Authors: Zhenfang Chen, Delin Chen, Rui Sun, Wenjun Liu, Chuang Gan

Title: Scaling Autonomous Agents via Automatic Reward Modeling And Planning

Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities across a range of text‑generation tasks. However, LLMs still struggle with problems requiring multi‑step decision‑making and environmental feedback, such as online shopping, scientific reasoning, and mathematical problem‑solving. Unlike pure text data, collecting large‑scale decision‑making data is challenging. Moreover, many powerful LLMs are only accessible through APIs, which hinders their fine‑tuning for agent tasks due to cost and complexity. To address LLM agents' limitations, we propose a framework that can automatically learn a reward model from the environment without human annotations. This model can be used to evaluate the action trajectories of LLM agents and provide heuristics for task planning. Specifically, our approach involves employing one LLM‑based agent to navigate an environment randomly, generating diverse action trajectories. Subsequently, a separate LLM is leveraged to assign a task intent and synthesize a negative response alongside the correct response for each trajectory. These triplets (task intent, positive response, and negative response) are then utilized as training data to optimize a reward model capable of scoring action trajectories. The effectiveness and generalizability of our framework are demonstrated through evaluations conducted on different agent benchmarks. In conclusion, our proposed framework represents a significant advancement in enhancing LLM agents' decision‑making capabilities. By automating the learning of reward models, we overcome the challenges of data scarcity and API limitations, potentially revolutionizing the application of LLMs in complex and interactive environments. This research paves the way for more sophisticated AI agents capable of tackling a wide range of real‑world problems requiring multi‑step decision‑making.

PaperID: 11601, https://arxiv.org/pdf/2502.12058.pdf

Authors: Carole Adam, Benoit Gaudou

Title: A survey about perceptions of mobility to inform an agent-based simulator of subjective modal choice

Abstract:
In order to adapt to the issues of climate change and public health, urban policies are trying to encourage soft mobility, but the share of the car remains significant. Beyond known constraints, we study here the impact of perception biases on individual choices. We designed a multi‑criteria decision model, integrating the influence of habits and biases. We then conducted an online survey, which received 650 responses. We used these to calculate realistic mobility perception values, in order to initialise the environment and the population of a modal choice simulator, implemented in Netlogo. This allows us to visualize the adaptation of the modal distribution in reaction to the evolution of urban planning, depending on whether or not we activate biases and habits in individual reasoning. This is an extended and translated version of a demo paper published in French at JFSMA‑JFMS 2024 "Un simulateur multi‑agent de choix modal subjectif"

PaperID: 11602, https://arxiv.org/pdf/2502.12042.pdf

Authors: Wei-Chen Lee, Alessandro Abate, Michael Wooldridge

Title: Multi-agent coordination via communication partitions

Abstract:
Coordinating the behaviour of self‑interested agents in the presence of multiple Nash equilibria is a major research challenge for multi‑agent systems. Pre‑game communication between all the players can aid coordination in cases where the Pareto‑optimal payoff is unique, but can lead to deadlocks when there are multiple payoffs on the Pareto frontier. We consider a communication partition, where only players within the same coalition can communicate with each other, and they can establish an agreement (a coordinated joint‑action) if it is envy‑free, credible, and Pareto‑optimal. We show that under a natural assumption about symmetry, certain communication partitions can induce social optimal outcomes in singleton congestion games. This game is a reasonable model for a decentralised, anonymous system where players are required to choose from a range of identical resources, and incur costs that are increasing and convex in the total number of players sharing the same resource. The communication partition can be seen as a mechanism for inducing efficient outcomes in this context.

PaperID: 11603, https://arxiv.org/pdf/2502.11843.pdf

Authors: Pranav Bhandari, Nicolas Fay, Michael Wise, Amitava Datta, Stephanie Meek, Usman Naseem, Mehwish Nasim

Title: Can LLM Agents Maintain a Persona in Discourse?

Abstract:
Large Language Models (LLMs) are widely used as conversational agents, exploiting their capabilities in various sectors such as education, law, medicine, and more. However, LLMs are often subjected to context‑shifting behaviour, resulting in a lack of consistent and interpretable personality‑aligned interactions. Adherence to psychological traits lacks comprehensive analysis, especially in the case of dyadic (pairwise) conversations. We examine this challenge from two viewpoints, initially using two conversation agents to generate a discourse on a certain topic with an assigned personality from the OCEAN framework (Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism) as High/Low for each trait. This is followed by using multiple judge agents to infer the original traits assigned to explore prediction consistency, inter‑model agreement, and alignment with the assigned personality. Our findings indicate that while LLMs can be guided toward personality‑driven dialogue, their ability to maintain personality traits varies significantly depending on the combination of models and discourse settings. These inconsistencies emphasise the challenges in achieving stable and interpretable personality‑aligned interactions in LLMs.

PaperID: 11604, https://arxiv.org/pdf/2502.11799.pdf

Authors: Peiying Yu, Guoxin Chen, Jingjing Wang

Title: Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning

Abstract:
Despite the remarkable capabilities of large language models (LLMs) in various reasoning tasks, they still struggle with table reasoning tasks, particularly in maintaining consistency throughout multi‑step reasoning processes. While existing approaches have explored various decomposition strategies, they often lack effective mechanisms to identify and correct errors in intermediate reasoning steps, leading to cascading error propagation. To address these issues, we propose Table‑Critic, a novel multi‑agent framework that facilitates collaborative criticism and iterative refinement of the reasoning process until convergence to correct solutions. Our framework consists of four specialized agents: a Judge for error identification, a Critic for comprehensive critiques, a Refiner for process improvement, and a Curator for pattern distillation. To effectively deal with diverse and unpredictable error types, we introduce a self‑evolving template tree that systematically accumulates critique knowledge through experience‑driven learning and guides future reflections. Extensive experiments have demonstrated that Table‑Critic achieves substantial improvements over existing methods, achieving superior accuracy and error correction rates while maintaining computational efficiency and lower solution degradation rate.

PaperID: 11605, https://arxiv.org/pdf/2502.11785.pdf

Authors: Rustam Galimullin, Maksim Gladyshev, Munyque Mittelmann, Nima Motamed

Title: Changing the Rules of the Game: Reasoning about Dynamic Phenomena in Multi-Agent Systems

Abstract:
The design and application of multi‑agent systems (MAS) require reasoning about the effects of modifications on their underlying structure. In particular, such changes may impact the satisfaction of system specifications and the strategic abilities of their autonomous components. In this paper, we are concerned with the problem of verifying and synthesising modifications (or updates) of MAS. We propose an extension of the Alternating‑Time Temporal Logic (\mathsfATL) that enables reasoning about the dynamics of model change, called the Logic for \mathsfATL Model Building (\mathsfLAMB). We show how \mathsfLAMB can express various intuitions and ideas about the dynamics of MAS, from normative updates to mechanism design. As the main technical result, we prove that, while being strictly more expressive than \mathsfATL, \mathsfLAMB enjoys a P‑complete model‑checking procedure.

PaperID: 11606, https://arxiv.org/pdf/2502.11721.pdf

Authors: Jingsen Zhang, Zihang Tian, Xueyang Feng, Xu Chen

Title: Enhancing Recommendation Explanations through User-Centric Refinement

Abstract:
Generating natural language explanations for recommendations has become increasingly important in recommender systems. Traditional approaches typically treat user reviews as ground truth for explanations and focus on improving review prediction accuracy by designing various model architectures. However, due to limitations in data scale and model capability, these explanations often fail to meet key user‑centric aspects such as factuality, personalization, and sentiment coherence, significantly reducing their overall helpfulness to users. In this paper, we propose a novel paradigm that refines initial explanations generated by existing explainable recommender models during the inference stage to enhance their quality in multiple aspects. Specifically, we introduce a multi‑agent collaborative refinement framework based on large language models. To ensure alignment between the refinement process and user demands, we employ a plan‑then‑refine pattern to perform targeted modifications. To enable continuous improvements, we design a hierarchical reflection mechanism that provides feedback on the refinement process from both strategic and content perspectives. Extensive experiments on three datasets demonstrate the effectiveness of our framework.

PaperID: 11607, https://arxiv.org/pdf/2502.11649.pdf

Authors: Amin Qasmi, Usman Naseem, Mehwish Nasim

Title: Competing LLM Agents in a Non-Cooperative Game of Opinion Polarisation

Abstract:
We introduce a novel non‑cooperative game to analyse opinion formation and resistance, incorporating principles from social psychology such as confirmation bias, resource constraints, and influence penalties. Our simulation features Large Language Model (LLM) agents competing to influence a population, with penalties imposed for generating messages that propagate or counter misinformation. This framework integrates resource optimisation into the agents' decision‑making process. Our findings demonstrate that while higher confirmation bias strengthens opinion alignment within groups, it also exacerbates overall polarisation. Conversely, lower confirmation bias leads to fragmented opinions and limited shifts in individual beliefs. Investing heavily in a high‑resource debunking strategy can initially align the population with the debunking agent, but risks rapid resource depletion and diminished long‑term influence

PaperID: 11608, https://arxiv.org/pdf/2502.11645.pdf

Authors: Luke Marris, Siqi Liu, Ian Gemp, Georgios Piliouras, Marc Lanctot

Title: Deviation Ratings: A General, Clone-Invariant Rating Method

Abstract:
Many real‑world multi‑agent or multi‑task evaluation scenarios can be naturally modelled as normal‑form games due to inherent strategic (adversarial, cooperative, and mixed motive) interactions. These strategic interactions may be agentic (e.g. players trying to win), fundamental (e.g. cost vs quality), or complementary (e.g. niche finding and specialization). In such a formulation, it is the strategies (actions, policies, agents, models, tasks, prompts, etc.) that are rated. However, the rating problem is complicated by redundancy and complexity of N‑player strategic interactions. Repeated or similar strategies can distort ratings for those that counter or complement them. Previous work proposed ``clone invariant'' ratings to handle such redundancies, but this was limited to two‑player zero‑sum (i.e. strictly competitive) interactions. This work introduces the first N‑player general‑sum clone invariant rating, called deviation ratings, based on coarse correlated equilibria. The rating is explored on several domains including LLMs evaluation.

PaperID: 11609, https://arxiv.org/pdf/2502.11518.pdf

Authors: Di Wu, Xian Wei, Guang Chen, Hao Shen, Xiangfeng Wang, Wenhao Li, Bo Jin

Title: Generative Multi-Agent Collaboration in Embodied AI: A Systematic Review

Abstract:
Embodied multi‑agent systems (EMAS) have attracted growing attention for their potential to address complex, real‑world challenges in areas such as logistics and robotics. Recent advances in foundation models pave the way for generative agents capable of richer communication and adaptive problem‑solving. This survey provides a systematic examination of how EMAS can benefit from these generative capabilities. We propose a taxonomy that categorizes EMAS by system architectures and embodiment modalities, emphasizing how collaboration spans both physical and virtual contexts. Central building blocks, perception, planning, communication, and feedback, are then analyzed to illustrate how generative techniques bolster system robustness and flexibility. Through concrete examples, we demonstrate the transformative effects of integrating foundation models into embodied, multi‑agent frameworks. Finally, we discuss challenges and future directions, underlining the significant promise of EMAS to reshape the landscape of AI‑driven collaboration.

PaperID: 11610, https://arxiv.org/pdf/2502.11476.pdf

Authors: Peiji Li, Kai Lv, Yunfan Shao, Yichuan Ma, Linyang Li, Xiaoqing Zheng, Xipeng Qiu, Qipeng Guo

Title: FastMCTS: A Simple Sampling Strategy for Data Synthesis

Abstract:
Synthetic high‑quality multi‑step reasoning data can significantly enhance the performance of large language models on various tasks. However, most existing methods rely on rejection sampling, which generates trajectories independently and suffers from inefficiency and imbalanced sampling across problems of varying difficulty. In this work, we introduce FastMCTS, an innovative data synthesis strategy inspired by Monte Carlo Tree Search. FastMCTS provides a more efficient sampling method for multi‑step reasoning data, offering step‑level evaluation signals and promoting balanced sampling across problems of different difficulty levels. Experiments on both English and Chinese reasoning datasets demonstrate that FastMCTS generates over 30% more correct reasoning paths compared to rejection sampling as the number of generated tokens scales up. Furthermore, under comparable synthetic data budgets, models trained on FastMCTS‑generated data outperform those trained on rejection sampling data by 3.9% across multiple benchmarks. As a lightweight sampling strategy, FastMCTS offers a practical and efficient alternative for synthesizing high‑quality reasoning data. Our code will be released soon.

PaperID: 11611, https://arxiv.org/pdf/2502.11448.pdf

Authors: Weidi Luo, Shenghong Dai, Xiaogeng Liu, Suman Banerjee, Huan Sun, Muhao Chen, Chaowei Xiao

Title: AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection

Abstract:
The rapid advancements in Large Language Models (LLMs) have enabled their deployment as autonomous agents for handling complex tasks in dynamic environments. These LLMs demonstrate strong problem‑solving capabilities and adaptability to multifaceted scenarios. However, their use as agents also introduces significant risks, including task‑specific risks, which are identified by the agent administrator based on the specific task requirements and constraints, and systemic risks, which stem from vulnerabilities in their design or interactions, potentially compromising confidentiality, integrity, or availability (CIA) of information and triggering security risks. Existing defense agencies fail to adaptively and effectively mitigate these risks. In this paper, we propose AGrail, a lifelong agent guardrail to enhance LLM agent safety, which features adaptive safety check generation, effective safety check optimization, and tool compatibility and flexibility. Extensive experiments demonstrate that AGrail not only achieves strong performance against task‑specific and system risks but also exhibits transferability across different LLM agents' tasks.

PaperID: 11612, https://arxiv.org/pdf/2502.11435.pdf

Authors: Cheng Qian, Emre Can Acikgoz, Hongru Wang, Xiusi Chen, Avirup Sil, Dilek Hakkani-Tür, Gokhan Tur, Heng Ji

Title: SMART: Self-Aware Agent for Tool Overuse Mitigation

Abstract:
Current Large Language Model (LLM) agents demonstrate strong reasoning and tool use capabilities, but often lack self‑awareness, failing to balance these approaches effectively. This imbalance leads to Tool Overuse, where models unnecessarily rely on external tools for tasks solvable with parametric knowledge, increasing computational overhead. Inspired by human metacognition, we introduce SMART (Strategic Model‑Aware Reasoning with Tools), a paradigm that enhances an agent's self‑awareness to optimize task handling and reduce tool overuse. To support this paradigm, we introduce SMART‑ER, a dataset spanning three domains, where reasoning alternates between parametric knowledge and tool‑dependent steps, with each step enriched by rationales explaining when tools are necessary. Through supervised training, we develop SMARTAgent, a family of models that dynamically balance parametric knowledge and tool use. Evaluations show that SMARTAgent reduces tool use by 24% while improving performance by over 37%, enabling 7B‑scale models to match its 70B counterpart and GPT‑4o. Additionally, SMARTAgent generalizes to out‑of‑distribution test data like GSM8K and MINTQA, maintaining accuracy with just one‑fifth the tool calls. These highlight the potential of strategic tool use to enhance reasoning, mitigate overuse, and bridge the gap between model size and performance, advancing intelligent and resource‑efficient agent designs.

PaperID: 11613, https://arxiv.org/pdf/2502.11433.pdf

Authors: Guojun Xiong, Zhiyang Deng, Keyi Wang, Yupeng Cao, Haohang Li, Yangyang Yu, Xueqing Peng, Mingquan Lin, Kaleb E Smith, Xiao-Yang Liu, Jimin Huang, Sophia Ananiadou, Qianqian Xie

Title: FLAG-Trader: Fusion LLM-Agent with Gradient-based Reinforcement Learning for Financial Trading

Abstract:
Large language models (LLMs) fine‑tuned on multimodal financial data have demonstrated impressive reasoning capabilities in various financial tasks. However, they often struggle with multi‑step, goal‑oriented scenarios in interactive financial markets, such as trading, where complex agentic approaches are required to improve decision‑making. To address this, we propose \textscFLAG‑Trader, a unified architecture integrating linguistic processing (via LLMs) with gradient‑driven reinforcement learning (RL) policy optimization, in which a partially fine‑tuned LLM acts as the policy network, leveraging pre‑trained knowledge while adapting to the financial domain through parameter‑efficient fine‑tuning. Through policy gradient optimization driven by trading rewards, our framework not only enhances LLM performance in trading but also improves results on other financial‑domain tasks. We present extensive empirical evidence to validate these enhancements.

PaperID: 11614, https://arxiv.org/pdf/2502.11418.pdf

Authors: Geon Lee, Wenchao Yu, Kijung Shin, Wei Cheng, Haifeng Chen

Title: TimeCAP: Learning to Contextualize, Augment, and Predict Time Series Events with Large Language Model Agents

Abstract:
Time series data is essential in various applications, including climate modeling, healthcare monitoring, and financial analytics. Understanding the contextual information associated with real‑world time series data is often essential for accurate and reliable event predictions. In this paper, we introduce TimeCAP, a time‑series processing framework that creatively employs Large Language Models (LLMs) as contextualizers of time series data, extending their typical usage as predictors. TimeCAP incorporates two independent LLM agents: one generates a textual summary capturing the context of the time series, while the other uses this enriched summary to make more informed predictions. In addition, TimeCAP employs a multi‑modal encoder that synergizes with the LLM agents, enhancing predictive performance through mutual augmentation of inputs with in‑context examples. Experimental results on real‑world datasets demonstrate that TimeCAP outperforms state‑of‑the‑art methods for time series event prediction, including those utilizing LLMs as predictors, achieving an average improvement of 28.75% in F1 score.

PaperID: 11615, https://arxiv.org/pdf/2502.11269.pdf

Authors: Oualid Bougzime, Samir Jabbar, Christophe Cruz, Frédéric Demoly

Title: Unlocking the Potential of Generative AI through Neuro-Symbolic Architectures: Benefits and Limitations

Abstract:
Neuro‑symbolic artificial intelligence (NSAI) represents a transformative approach in artificial intelligence (AI) by combining deep learning's ability to handle large‑scale and unstructured data with the structured reasoning of symbolic methods. By leveraging their complementary strengths, NSAI enhances generalization, reasoning, and scalability while addressing key challenges such as transparency and data efficiency. This paper systematically studies diverse NSAI architectures, highlighting their unique approaches to integrating neural and symbolic components. It examines the alignment of contemporary AI techniques such as retrieval‑augmented generation, graph neural networks, reinforcement learning, and multi‑agent systems with NSAI paradigms. This study then evaluates these architectures against comprehensive set of criteria, including generalization, reasoning capabilities, transferability, and interpretability, therefore providing a comparative analysis of their respective strengths and limitations. Notably, the Neuro > Symbolic < Neuro model consistently outperforms its counterparts across all evaluation metrics. This result aligns with state‑of‑the‑art research that highlight the efficacy of such architectures in harnessing advanced technologies like multi‑agent systems.

PaperID: 11616, https://arxiv.org/pdf/2502.11260.pdf

Authors: Riccardo Zamboni, Enrico Brunetti, Marcello Restelli

Title: Scalable Multi-Agent Offline Reinforcement Learning and the Role of Information

Abstract:
Offline Reinforcement Learning (RL) focuses on learning policies solely from a batch of previously collected data. offering the potential to leverage such datasets effectively without the need for costly or risky active exploration. While recent advances in Offline Multi‑Agent RL (MARL) have shown promise, most existing methods either rely on large datasets jointly collected by all agents or agent‑specific datasets collected independently. The former approach ensures strong performance but raises scalability concerns, while the latter emphasizes scalability at the expense of performance guarantees. In this work, we propose a novel scalable routine for both dataset collection and offline learning. Agents first collect diverse datasets coherently with a pre‑specified information‑sharing network and subsequently learn coherent localized policies without requiring either full observability or falling back to complete decentralization. We theoretically demonstrate that this structured approach allows a multi‑agent extension of the seminal Fitted Q‑Iteration (FQI) algorithm to globally converge, in high probability, to near‑optimal policies. The convergence is subject to error terms that depend on the informativeness of the shared information. Furthermore, we show how this approach allows to bound the inherent error of the supervised‑learning phase of FQI with the mutual information between shared and unshared information. Our algorithm, SCAlable Multi‑agent FQI (SCAM‑FQI), is then evaluated on a distributed decision‑making problem. The empirical results align with our theoretical findings, supporting the effectiveness of SCAM‑FQI in achieving a balance between scalability and policy performance.

PaperID: 11617, https://arxiv.org/pdf/2502.11250.pdf

Authors: Zihuiwen Ye, Luckeciano Carvalho Melo, Younesse Kaddar, Phil Blunsom, Sam Staton, Yarin Gal

Title: Uncertainty-Aware Step-wise Verification with Generative Reward Models

Abstract:
Complex multi‑step reasoning tasks, such as solving mathematical problems, remain challenging for large language models (LLMs). While outcome supervision is commonly used, process supervision via process reward models (PRMs) provides intermediate rewards to verify step‑wise correctness in solution traces. However, as proxies for human judgement, PRMs suffer from reliability issues, including susceptibility to reward hacking. In this work, we propose leveraging uncertainty quantification (UQ) to enhance the reliability of step‑wise verification with generative reward models for mathematical reasoning tasks. We introduce CoT Entropy, a novel UQ method that outperforms existing approaches in quantifying a PRM's uncertainty in step‑wise verification. Our results demonstrate that incorporating uncertainty estimates improves the robustness of judge‑LM PRMs, leading to more reliable verification.

PaperID: 11618, https://arxiv.org/pdf/2502.11197.pdf

Authors: Tommy Mordo, Tomer Kordonsky, Haya Nachimovsky, Moshe Tennenholtz, Oren Kurland

Title: CSP: A Simulator For Multi-Agent Ranking Competitions

Abstract:
In ranking competitions, document authors compete for the highest rankings by modifying their content in response to past rankings. Previous studies focused on human participants, primarily students, in controlled settings. The rise of generative AI, particularly Large Language Models (LLMs), introduces a new paradigm: using LLMs as document authors. This approach addresses scalability constraints in human‑based competitions and reflects the growing role of LLM‑generated content on the web‑a prime example of ranking competition. We introduce a highly configurable ranking competition simulator that leverages LLMs as document authors. It includes analytical tools to examine the resulting datasets. We demonstrate its capabilities by generating multiple datasets and conducting an extensive analysis. Our code and datasets are publicly available for research.

PaperID: 11619, https://arxiv.org/pdf/2502.10996.pdf

Authors: Pengcheng Jiang, Lang Cao, Ruike Zhu, Minhao Jiang, Yunyi Zhang, Jiaming Shen, Jimeng Sun, Jiawei Han

Title: RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation

Abstract:
Large language models (LLMs) have achieved impressive performance on knowledge‑intensive tasks, yet they often struggle with multi‑step reasoning due to the unstructured nature of retrieved context. While retrieval‑augmented generation (RAG) methods provide external information, the lack of explicit organization among retrieved passages limits their effectiveness, leading to brittle reasoning pathways. Recent interpretability studies highlighting the importance of structured intermediate reasoning further align with this perspective. We propose Retrieval‑And‑Structuring (RAS), a framework that dynamically constructs question‑specific knowledge graphs through iterative retrieval and structured knowledge building. RAS interleaves targeted retrieval planning with incremental graph construction, enabling models to assemble and reason over evolving knowledge structures tailored to each query. On seven knowledge‑intensive benchmarks, RAS consistently outperforms strong baselines, achieving up to 8.7% and 7.0% gains with proprietary and open‑source LLMs, respectively. Our results demonstrate that dynamic, question‑specific knowledge structuring offers a robust path to improving reasoning accuracy and robustness in language model generation.

PaperID: 11620, https://arxiv.org/pdf/2502.10937.pdf

Authors: Chengshuai Zhao, Zhen Tan, Chau-Wai Wong, Xinyan Zhao, Tianlong Chen, Huan Liu

Title: SCALE: Towards Collaborative Content Analysis in Social Science with Large Language Model Agents and Human Intervention

Abstract:
Content analysis breaks down complex and unstructured texts into theory‑informed numerical categories. Particularly, in social science, this process usually relies on multiple rounds of manual annotation, domain expert discussion, and rule‑based refinement. In this paper, we introduce SCALE, a novel multi‑agent framework that effectively \underlineSimulates \underlineContent \underlineAnalysis via \underlineLarge language model (LLM) ag\underlineEnts. SCALE imitates key phases of content analysis, including text coding, collaborative discussion, and dynamic codebook evolution, capturing the reflective depth and adaptive discussions of human researchers. Furthermore, by integrating diverse modes of human intervention, SCALE is augmented with expert input to further enhance its performance. Extensive evaluations on real‑world datasets demonstrate that SCALE achieves human‑approximated performance across various complex content analysis tasks, offering an innovative potential for future social science research.

PaperID: 11621, https://arxiv.org/pdf/2502.10931.pdf

Authors: Meet Udeshi, Minghao Shao, Haoran Xi, Nanda Rani, Kimberly Milner, Venkata Sai Charan Putrevu, Brendan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique

Title: D-CIPHER: Dynamic Collaborative Intelligent Multi-Agent System with Planner and Heterogeneous Executors for Offensive Security

Abstract:
Large Language Models (LLMs) have been used in cybersecurity such as autonomous security analysis or penetration testing. Capture the Flag (CTF) challenges serve as benchmarks to assess automated task‑planning abilities of LLM agents for cybersecurity. Early attempts to apply LLMs for solving CTF challenges used single‑agent systems, where feedback was restricted to a single reasoning‑action loop. This approach was inadequate for complex CTF tasks. Inspired by real‑world CTF competitions, where teams of experts collaborate, we introduce the D‑CIPHER LLM multi‑agent framework for collaborative CTF solving. D‑CIPHER integrates agents with distinct roles with dynamic feedback loops to enhance reasoning on complex tasks. It introduces the Planner‑Executor agent system, consisting of a Planner agent for overall problem‑solving along with multiple heterogeneous Executor agents for individual tasks, facilitating efficient allocation of responsibilities among the agents. Additionally, D‑CIPHER incorporates an Auto‑prompter agent to improve problem‑solving by auto‑generating a highly relevant initial prompt. We evaluate D‑CIPHER on multiple CTF benchmarks and LLM models via comprehensive studies to highlight the impact of our enhancements. Additionally, we manually map the CTFs in NYU CTF Bench to MITRE ATT&CK techniques that apply for a comprehensive evaluation of D‑CIPHER's offensive security capability. D‑CIPHER achieves state‑of‑the‑art performance on three benchmarks: 22.0% on NYU CTF Bench, 22.5% on Cybench, and 44.0% on HackTheBox, which is 2.5% to 8.5% better than previous work. D‑CIPHER solves 65% more ATT&CK techniques compared to previous work, demonstrating stronger offensive capability.

PaperID: 11622, https://arxiv.org/pdf/2502.10857.pdf

Authors: Haoyuan Wu, Haisheng Zheng, Zhuolun He, Bei Yu

Title: Divergent Thoughts toward One Goal: LLM-based Multi-Agent Collaboration System for Electronic Design Automation

Abstract:
Recently, with the development of tool‑calling capabilities in large language models (LLMs), these models have demonstrated significant potential for automating electronic design automation (EDA) flows by interacting with EDA tool APIs via EDA scripts. However, considering the limited understanding of EDA tools, LLMs face challenges in practical scenarios where diverse interfaces of EDA tools exist across different platforms. Additionally, EDA flow automation often involves intricate, long‑chain tool‑calling processes, increasing the likelihood of errors in intermediate steps. Any errors will lead to the instability and failure of EDA flow automation. To address these challenges, we introduce EDAid, a multi‑agent collaboration system where multiple agents harboring divergent thoughts converge towards a common goal, ensuring reliable and successful EDA flow automation. Specifically, each agent is controlled by ChipLlama models, which are expert LLMs fine‑tuned for EDA flow automation. Our experiments demonstrate the state‑of‑the‑art (SOTA) performance of our ChipLlama models and validate the effectiveness of our EDAid in the automation of complex EDA flows, showcasing superior performance compared to single‑agent systems.

PaperID: 11623, https://arxiv.org/pdf/2502.10764.pdf

Authors: Hong-ah Chai, Seokbin Yoon, Keumjin Lee

Title: Learning to Explain Air Traffic Situation

Abstract:
Understanding how air traffic controllers construct a mental 'picture' of complex air traffic situations is crucial but remains a challenge due to the inherently intricate, high‑dimensional interactions between aircraft, pilots, and controllers. Previous work on modeling the strategies of air traffic controllers and their mental image of traffic situations often centers on specific air traffic control tasks or pairwise interactions between aircraft, neglecting to capture the comprehensive dynamics of an air traffic situation. To address this issue, we propose a machine learning‑based framework for explaining air traffic situations. Specifically, we employ a Transformer‑based multi‑agent trajectory model that encapsulates both the spatio‑temporal movement of aircraft and social interaction between them. By deriving attention scores from the model, we can quantify the influence of individual aircraft on overall traffic dynamics. This provides explainable insights into how air traffic controllers perceive and understand the traffic situation. Trained on real‑world air traffic surveillance data collected from the terminal airspace around Incheon International Airport in South Korea, our framework effectively explicates air traffic situations. This could potentially support and enhance the decision‑making and situational awareness of air traffic controllers.

PaperID: 11624, https://arxiv.org/pdf/2502.10705.pdf

Authors: Quanmin Wei, Penglin Dai, Wei Li, Bingyi Liu, Xiao Wu

Title: CoPEFT: Fast Adaptation Framework for Multi-Agent Collaborative Perception with Parameter-Efficient Fine-Tuning

Abstract:
Multi‑agent collaborative perception is expected to significantly improve perception performance by overcoming the limitations of single‑agent perception through exchanging complementary information. However, training a robust collaborative perception model requires collecting sufficient training data that covers all possible collaboration scenarios, which is impractical due to intolerable deployment costs. Hence, the trained model is not robust against new traffic scenarios with inconsistent data distribution and fundamentally restricts its real‑world applicability. Further, existing methods, such as domain adaptation, have mitigated this issue by exposing the deployment data during the training stage but incur a high training cost, which is infeasible for resource‑constrained agents. In this paper, we propose a Parameter‑Efficient Fine‑Tuning‑based lightweight framework, CoPEFT, for fast adapting a trained collaborative perception model to new deployment environments under low‑cost conditions. CoPEFT develops a Collaboration Adapter and Agent Prompt to perform macro‑level and micro‑level adaptations separately. Specifically, the Collaboration Adapter utilizes the inherent knowledge from training data and limited deployment data to adapt the feature map to new data distribution. The Agent Prompt further enhances the Collaboration Adapter by inserting fine‑grained contextual information about the environment. Extensive experiments demonstrate that our CoPEFT surpasses existing methods with less than 1% trainable parameters, proving the effectiveness and efficiency of our proposed method.

PaperID: 11625, https://arxiv.org/pdf/2502.10557.pdf

Authors: Xinxing Ren, Chun Sing Lai, Gareth Taylor, Zekun Guo

Title: Can Large Language Model Agents Balance Energy Systems?

Abstract:
This paper presents a hybrid approach that integrates Large Language Models (LLMs) with a multi‑scenario Stochastic Unit Commitment (SUC) framework to enhance both efficiency and reliability under high wind generation uncertainties. In a 10‑trial study on the test energy system, the traditional SUC approach incurs an average total cost of 187.68 million dollars, whereas the LLM‑assisted SUC (LLM‑SUC) achieves a mean cost of 185.58 million dollars (range: 182.61 to 188.65 million dollars), corresponding to a cost reduction of 1.1 to 2.7 percent. Furthermore, LLM‑SUC reduces load curtailment by 26.3 percent (2.24 plus/minus 0.31 GWh versus 3.04 GWh for SUC), while both methods maintain zero wind curtailment. Detailed temporal analysis shows that LLM‑SUC achieves lower costs in the majority of time intervals and consistently outperforms SUC in 90 percent of cases, with solutions clustering in a favorable cost‑reliability region (Coefficient of Variation = 0.93 percent for total cost and 13.8 percent for load curtailment). By leveraging an LLM agent to guide generator commitment decisions and dynamically adjust to stochastic conditions, the proposed framework improves demand fulfillment and operational resilience.

PaperID: 11626, https://arxiv.org/pdf/2502.10480.pdf

Authors: Deep Parikh, David van Wijk, Manoranjan Majji

Title: Safe Multi-agent Satellite Servicing with Control Barrier Functions

Abstract:
The use of control barrier functions under uncertain pose information of multiple small servicing agents is analyzed for a satellite servicing application. The application consists of modular servicing agents deployed towards a tumbling space object from a mothership. Relative position and orientation of each agent is obtained via fusion of relative range and inertial measurement sensors. The control barrier functions are utilized to avoid collisions with other agents for the application of simultaneously relocating servicing agents on a tumbling body. A differential collision detection and avoidance framework using the polytopic hull of the tumbling space object is utilized to safely guide the agents away from the tumbling object.

PaperID: 11627, https://arxiv.org/pdf/2502.10338.pdf

Authors: Nick Ferguson, Liane Guillou, Alan Bundy, Kwabena Nuamah

Title: Evaluating the Meta- and Object-Level Reasoning of Large Language Models for Question Answering

Abstract:
Large Language Models (LLMs) excel in natural language tasks but still face challenges in Question Answering (QA) tasks requiring complex, multi‑step reasoning. We outline the types of reasoning required in some of these tasks, and reframe them in terms of meta‑level reasoning (akin to high‑level strategic reasoning or planning) and object‑level reasoning (embodied in lower‑level tasks such as mathematical reasoning). Franklin, a novel dataset with requirements of meta‑ and object‑level reasoning, is introduced and used along with three other datasets to evaluate four LLMs at question answering tasks requiring multiple steps of reasoning. Results from human annotation studies suggest LLMs demonstrate meta‑level reasoning with high frequency, but struggle with object‑level reasoning tasks in some of the datasets used. Additionally, evidence suggests that LLMs find the object‑level reasoning required for the questions in the Franklin dataset challenging, yet they do exhibit strong performance with respect to the meta‑level reasoning requirements.

PaperID: 11628, https://arxiv.org/pdf/2502.10303.pdf

Authors: Abdelrhman Shaheen, Anas Badr, Ali Abohendy, Hatem Alsaadawy, Nadine Alsayad, Ehab H. El-Shazly

Title: Reinforcement Learning in Strategy-Based and Atari Games: A Review of Google DeepMinds Innovations

Abstract:
Reinforcement Learning (RL) has been widely used in many applications, particularly in gaming, which serves as an excellent training ground for AI models. Google DeepMind has pioneered innovations in this field, employing reinforcement learning algorithms, including model‑based, model‑free, and deep Q‑network approaches, to create advanced AI models such as AlphaGo, AlphaGo Zero, and MuZero. AlphaGo, the initial model, integrates supervised learning and reinforcement learning to master the game of Go, surpassing professional human players. AlphaGo Zero refines this approach by eliminating reliance on human gameplay data, instead utilizing self‑play for enhanced learning efficiency. MuZero further extends these advancements by learning the underlying dynamics of game environments without explicit knowledge of the rules, achieving adaptability across various games, including complex Atari games. This paper reviews the significance of reinforcement learning applications in Atari and strategy‑based games, analyzing these three models, their key innovations, training processes, challenges encountered, and improvements made. Additionally, we discuss advancements in the field of gaming, including MiniZero and multi‑agent models, highlighting future directions and emerging AI models from Google DeepMind.

PaperID: 11629, https://arxiv.org/pdf/2502.10187.pdf

Authors: Jingjie Ni, Fangfei Li, Xin Jin, Xianlun Peng, Yang Tang

Title: Reinforcement Learning based Constrained Optimal Control: an Interpretable Reward Design

Abstract:
This paper presents an interpretable reward design framework for reinforcement learning based constrained optimal control problems with state and terminal constraints. The problem is formalized within a standard partially observable Markov decision process framework. The reward function is constructed from four weighted components: a terminal constraint reward, a guidance reward, a penalty for state constraint violations, and a cost reduction incentive reward. A theoretically justified reward design is then presented, which establishes bounds on the weights of the components. This approach ensures that constraints are satisfied and objectives are optimized while mitigating numerical instability. Acknowledging the importance of prior knowledge in reward design, we sequentially solve two subproblems, using each solution to inform the reward design for the subsequent problem. Subsequently, we integrate reinforcement learning with curriculum learning, utilizing policies derived from simpler subproblems to assist in tackling more complex challenges, thereby facilitating convergence. The framework is evaluated against original and randomly weighted reward designs in a multi‑agent particle environment. Experimental results demonstrate that the proposed approach significantly enhances satisfaction of terminal and state constraints and optimization of control cost.

PaperID: 11630, https://arxiv.org/pdf/2502.10148.pdf

Authors: Zhiyuan Li, Wenshuai Zhao, Joni Pajarinen

Title: Closed-Loop Vision-Language Planning for Multi-Agent Coordination

Abstract:
Cooperative multi‑agent reinforcement learning (MARL) struggles with sample efficiency, interpretability, and generalization. While Large Language Models (LLMs) offer powerful planning capabilities, their application has been hampered by a reliance on text‑only inputs and a failure to handle the non‑Markovian, partially observable nature of multi‑agent tasks. We introduce COMPASS, a multi‑agent framework that overcomes these limitations by integrating Vision‑Language Models (VLMs) for decentralized, closed‑loop decision‑making. COMPASS dynamically generates and refines interpretable, code‑based strategies stored in a skill library that is bootstrapped from expert demonstrations. To ensure robust coordination, it propagates entity information through a structured multi‑hop communication protocol, allowing teams to build a coherent understanding from partial observations. Evaluated on the challenging SMACv2 benchmark, COMPASS significantly outperforms state‑of‑the‑art MARL baselines. Notably, in the symmetric Protoss 5v5 task, COMPASS achieved a 57% win rate, a 30 percentage point advantage over QMIX (27%). Project page can be found at https://stellar‑entremet‑1720bb.netlify.app/.

PaperID: 11631, https://arxiv.org/pdf/2502.10062.pdf

Authors: Xiaoshan Lin, Roberto Tron

Title: Adaptive Bi-Level Multi-Robot Task Allocation and Learning under Uncertainty with Temporal Logic Constraints

Abstract:
This work addresses the problem of multi‑robot coordination under unknown robot transition models, ensuring that tasks specified by Time Window Temporal Logic are satisfied with user‑defined probability thresholds. We present a bi‑level framework that integrates (i) high‑level task allocation, where tasks are assigned based on the robots' estimated task completion probabilities and expected rewards, and (ii) low‑level distributed policy learning and execution, where robots independently optimize auxiliary rewards while fulfilling their assigned tasks. To handle uncertainty in robot dynamics, our approach leverages real‑time task execution data to iteratively refine expected task completion probabilities and rewards, enabling adaptive task allocation without explicit robot transition models. We theoretically validate the proposed algorithm, demonstrating that the task assignments meet the desired probability thresholds with high confidence. Finally, we demonstrate the effectiveness of our framework through comprehensive simulations.

PaperID: 11632, https://arxiv.org/pdf/2502.10050.pdf

Authors: Qiyao Peng, Hongtao Liu, Hua Huang, Qing Yang, Minglai Shao

Title: A Survey on LLM-powered Agents for Recommender Systems

Abstract:
Recommender systems are essential components of many online platforms, yet traditional approaches still struggle with understanding complex user preferences and providing explainable recommendations. The emergence of Large Language Model (LLM)‑powered agents offers a promising approach by enabling natural language interactions and interpretable reasoning, potentially transforming research in recommender systems. This survey provides a systematic review of the emerging applications of LLM‑powered agents in recommender systems. We identify and analyze three key paradigms in current research: (1) Recommender‑oriented approaches, which leverage intelligent agents to enhance the fundamental recommendation mechanisms; (2) Interaction‑oriented approaches, which facilitate dynamic user engagement through natural dialogue and interpretable suggestions; and (3) Simulation‑oriented approaches, which employ multi‑agent frameworks to model complex user‑item interactions and system dynamics. Beyond paradigm categorization, we analyze the architectural foundations of LLM‑powered recommendation agents, examining their essential components: profile construction, memory management, strategic planning, and action execution. Our investigation extends to a comprehensive analysis of benchmark datasets and evaluation frameworks in this domain. This systematic examination not only illuminates the current state of LLM‑powered agent recommender systems but also charts critical challenges and promising research directions in this transformative field.

PaperID: 11633, https://arxiv.org/pdf/2502.09889.pdf

Authors: Siva Kailas, Shalin Jain, Harish Ravichandar

Title: Evaluating and Improving Graph-based Explanation Methods for Multi-Agent Coordination

Abstract:
Graph Neural Networks (GNNs), developed by the graph learning community, have been adopted and shown to be highly effective in multi‑robot and multi‑agent learning. Inspired by this successful cross‑pollination, we investigate and characterize the suitability of existing GNN explanation methods for explaining multi‑agent coordination. We find that these methods have the potential to identify the most‑influential communication channels that impact the team's behavior. Informed by our initial analyses, we propose an attention entropy regularization term that renders GAT‑based policies more amenable to existing graph‑based explainers. Intuitively, minimizing attention entropy incentivizes agents to limit their attention to the most influential or impactful agents, thereby easing the challenge faced by the explainer. We theoretically ground this intuition by showing that minimizing attention entropy increases the disparity between the explainer‑generated subgraph and its complement. Evaluations across three tasks and three team sizes i) provides insights into the effectiveness of existing explainers, and ii) demonstrates that our proposed regularization consistently improves explanation quality without sacrificing task performance.

PaperID: 11634, https://arxiv.org/pdf/2502.09858.pdf

Authors: Kexin Huang, Ying Jin, Ryan Li, Michael Y. Li, Emmanuel Candès, Jure Leskovec

Title: Automated Hypothesis Validation with Agentic Sequential Falsifications

Abstract:
Hypotheses are central to information acquisition, decision‑making, and discovery. However, many real‑world hypotheses are abstract, high‑level statements that are difficult to validate directly. This challenge is further intensified by the rise of hypothesis generation from Large Language Models (LLMs), which are prone to hallucination and produce hypotheses in volumes that make manual validation impractical. Here we propose Popper, an agentic framework for rigorous automated validation of free‑form hypotheses. Guided by Karl Popper's principle of falsification, Popper validates a hypothesis using LLM agents that design and execute falsification experiments targeting its measurable implications. A novel sequential testing framework ensures strict Type‑I error control while actively gathering evidence from diverse observations, whether drawn from existing data or newly conducted procedures. We demonstrate Popper on six domains including biology, economics, and sociology. Popper delivers robust error control, high power, and scalability. Furthermore, compared to human scientists, Popper achieved comparable performance in validating complex biological hypotheses while reducing time by 10 folds, providing a scalable, rigorous solution for hypothesis validation.

PaperID: 11635, https://arxiv.org/pdf/2502.09846.pdf

Authors: Ziqiong Wang, Xiaoxue Yu, Rongpeng Li, Zhifeng Zhao

Title: Robust Event-Triggered Integrated Communication and Control with Graph Information Bottleneck Optimization

Abstract:
Integrated communication and control serves as a critical ingredient in Multi‑Agent Reinforcement Learning. However, partial observability limitations will impair collaboration effectiveness, and a potential solution is to establish consensus through well‑calibrated latent variables obtained from neighboring agents. Nevertheless, the rigid transmission of less informative content can still result in redundant information exchanges. Therefore, we propose a Consensus‑Driven Event‑Based Graph Information Bottleneck (CDE‑GIB) method, which integrates the communication graph and information flow through a GIB regularizer to extract more concise message representations while avoiding the high computational complexity of inner‑loop operations. To further minimize the communication volume required for establishing consensus during interactions, we also develop a variable‑threshold event‑triggering mechanism. By simultaneously considering historical data and current observations, this mechanism capably evaluates the importance of information to determine whether an event should be triggered. Experimental results demonstrate that our proposed method outperforms existing state‑of‑the‑art methods in terms of both efficiency and adaptability.

PaperID: 11636, https://arxiv.org/pdf/2502.09843.pdf

Authors: Karan Taneja, Ashok K. Goel

Title: MuDoC: An Interactive Multimodal Document-grounded Conversational AI System

Abstract:
Multimodal AI is an important step towards building effective tools to leverage multiple modalities in human‑AI communication. Building a multimodal document‑grounded AI system to interact with long documents remains a challenge. Our work aims to fill the research gap of directly leveraging grounded visuals from documents alongside textual content in documents for response generation. We present an interactive conversational AI agent 'MuDoC' based on GPT‑4o to generate document‑grounded responses with interleaved text and figures. MuDoC's intelligent textbook interface promotes trustworthiness and enables verification of system responses by allowing instant navigation to source text and figures in the documents. We also discuss qualitative observations based on MuDoC responses highlighting its strengths and limitations.

PaperID: 11637, https://arxiv.org/pdf/2502.09809.pdf

Authors: Jizhou Chen, Samuel Lee Cong

Title: AgentGuard: Repurposing Agentic Orchestrator for Safety Evaluation of Tool Orchestration

Abstract:
The integration of tool use into large language models (LLMs) enables agentic systems with real‑world impact. In the meantime, unlike standalone LLMs, compromised agents can execute malicious workflows with more consequential impact, signified by their tool‑use capability. We propose AgentGuard, a framework to autonomously discover and validate unsafe tool‑use workflows, followed by generating safety constraints to confine the behaviors of agents, achieving the baseline of safety guarantee at deployment. AgentGuard leverages the LLM orchestrator's innate capabilities ‑ knowledge of tool functionalities, scalable and realistic workflow generation, and tool execution privileges ‑ to act as its own safety evaluator. The framework operates through four phases: identifying unsafe workflows, validating them in real‑world execution, generating safety constraints, and validating constraint efficacy. The output, an evaluation report with unsafe workflows, test cases, and validated constraints, enables multiple security applications. We empirically demonstrate AgentGuard's feasibility with experiments. With this exploratory work, we hope to inspire the establishment of standardized testing and hardening procedures for LLM agents to enhance their trustworthiness in real‑world applications.

PaperID: 11638, https://arxiv.org/pdf/2502.09780.pdf

Authors: Tong Yang, Bo Dai, Lin Xiao, Yuejie Chi

Title: Incentivize without Bonus: Provably Efficient Model-based Online Multi-agent RL for Markov Games

Abstract:
Multi‑agent reinforcement learning (MARL) lies at the heart of a plethora of applications involving the interaction of a group of agents in a shared unknown environment. A prominent framework for studying MARL is Markov games, with the goal of finding various notions of equilibria in a sample‑efficient manner, such as the Nash equilibrium (NE) and the coarse correlated equilibrium (CCE). However, existing sample‑efficient approaches either require tailored uncertainty estimation under function approximation, or careful coordination of the players. In this paper, we propose a novel model‑based algorithm, called VMG, that incentivizes exploration via biasing the empirical estimate of the model parameters towards those with a higher collective best‑response values of all the players when fixing the other players' policies, thus encouraging the policy to deviate from its current equilibrium for more exploration. VMG is oblivious to different forms of function approximation, and permits simultaneous and uncoupled policy updates of all players. Theoretically, we also establish that VMG achieves a near‑optimal regret for finding both the NEs of two‑player zero‑sum Markov games and CCEs of multi‑player general‑sum Markov games under linear function approximation in an online environment, which nearly match their counterparts with sophisticated uncertainty quantification.

PaperID: 11639, https://arxiv.org/pdf/2502.09596.pdf

Authors: Zitao Li, Fei Wei, Yuexiang Xie, Dawei Gao, Weirui Kuang, Zhijian Ma, Bingchen Qian, Yaliang Li, Bolin Ding

Title: KIMAs: A Configurable Knowledge Integrated Multi-Agent System

Abstract:
Knowledge‑intensive conversations supported by large language models (LLMs) have become one of the most popular and helpful applications that can assist people in different aspects. Many current knowledge‑intensive applications are centered on retrieval‑augmented generation (RAG) techniques. While many open‑source RAG frameworks facilitate the development of RAG‑based applications, they often fall short in handling practical scenarios complicated by heterogeneous data in topics and formats, conversational context management, and the requirement of low‑latency response times. This technical report presents a configurable knowledge integrated multi‑agent system, KIMAs, to address these challenges. KIMAs features a flexible and configurable system for integrating diverse knowledge sources with 1) context management and query rewrite mechanisms to improve retrieval accuracy and multi‑turn conversational coherency, 2) efficient knowledge routing and retrieval, 3) simple but effective filter and reference generation mechanisms, and 4) optimized parallelizable multi‑agent pipeline execution. Our work provides a scalable framework for advancing the deployment of LLMs in real‑world settings. To show how KIMAs can help developers build knowledge‑intensive applications with different scales and emphases, we demonstrate how we configure the system to three applications already running in practice with reliable performance.

PaperID: 11640, https://arxiv.org/pdf/2502.09395.pdf

Authors: Jaime Maldonado, Jonas Krumme, Christoph Zetzsche, Vanessa Didelez, Kerstin Schill

Title: Robot Pouring: Identifying Causes of Spillage and Selecting Alternative Action Parameters Using Probabilistic Actual Causation

Abstract:
In everyday life, we perform tasks (e.g., cooking or cleaning) that involve a large variety of objects and goals. When confronted with an unexpected or unwanted outcome, we take corrective actions and try again until achieving the desired result. The reasoning performed to identify a cause of the observed outcome and to select an appropriate corrective action is a crucial aspect of human reasoning for successful task execution. Central to this reasoning is the assumption that a factor is responsible for producing the observed outcome. In this paper, we investigate the use of probabilistic actual causation to determine whether a factor is the cause of an observed undesired outcome. Furthermore, we show how the actual causation probabilities can be used to find alternative actions to change the outcome. We apply the probabilistic actual causation analysis to a robot pouring task. When spillage occurs, the analysis indicates whether a task parameter is the cause and how it should be changed to avoid spillage. The analysis requires a causal graph of the task and the corresponding conditional probability distributions. To fulfill these requirements, we perform a complete causal modeling procedure (i.e., task analysis, definition of variables, determination of the causal graph structure, and estimation of conditional probability distributions) using data from a realistic simulation of the robot pouring task, covering a large combinatorial space of task parameters. Based on the results, we discuss the implications of the variables' representation and how the alternative actions suggested by the actual causation analysis would compare to the alternative solutions proposed by a human observer. The practical use of the analysis of probabilistic actual causation to select alternative action parameters is demonstrated.

PaperID: 11641, https://arxiv.org/pdf/2502.09379.pdf

Authors: Jorgen Cani, Panagiotis Koletsis, Konstantinos Foteinos, Ioannis Kefaloukos, Lampros Argyriou, Manolis Falelakis, Iván Del Pino, Angel Santamaria-Navarro, Martin Čech, Ondřej Severa, Alessandro Umbrico, Francesca Fracasso, AndreA Orlandini, Dimitrios Drakoulis, Evangelos Markakis, Iraklis Varlamis, Georgios Th. Papadopoulos

Title: TRIFFID: Autonomous Robotic Aid For Increasing First Responders Efficiency

Abstract:
The increasing complexity of natural disaster incidents demands innovative technological solutions to support first responders in their efforts. This paper introduces the TRIFFID system, a comprehensive technical framework that integrates unmanned ground and aerial vehicles with advanced artificial intelligence functionalities to enhance disaster response capabilities across wildfires, urban floods, and post‑earthquake search and rescue missions. By leveraging state‑of‑the‑art autonomous navigation, semantic perception, and human‑robot interaction technologies, TRIFFID provides a sophisticated system composed of the following key components: hybrid robotic platform, centralized ground station, custom communication infrastructure, and smartphone application. The defined research and development activities demonstrate how deep neural networks, knowledge graphs, and multimodal information fusion can enable robots to autonomously navigate and analyze disaster environments, reducing personnel risks and accelerating response times. The proposed system enhances emergency response teams by providing advanced mission planning, safety monitoring, and adaptive task execution capabilities. Moreover, it ensures real‑time situational awareness and operational support in complex and risky situations, facilitating rapid and precise information collection and coordinated actions.

PaperID: 11642, https://arxiv.org/pdf/2502.09369.pdf

Authors: Daniel Jarrett, Miruna Pîslar, Michiel A. Bakker, Michael Henry Tessler, Raphael Köster, Jan Balaguer, Romuald Elie, Christopher Summerfield, Andrea Tacchetti

Title: Language Agents as Digital Representatives in Collective Decision-Making

Abstract:
Consider the process of collective decision‑making, in which a group of individuals interactively select a preferred outcome from among a universe of alternatives. In this context, "representation" is the activity of making an individual's preferences present in the process via participation by a proxy agent ‑‑ i.e. their "representative". To this end, learned models of human behavior have the potential to fill this role, with practical implications for multi‑agent scenario studies and mechanism design. In this work, we investigate the possibility of training language agents to behave in the capacity of representatives of human agents, appropriately expressing the preferences of those individuals whom they stand for. First, we formalize the setting of collective decision‑making ‑‑ as the episodic process of interaction between a group of agents and a decision mechanism. On this basis, we then formalize the problem of digital representation ‑‑ as the simulation of an agent's behavior to yield equivalent outcomes from the mechanism. Finally, we conduct an empirical case study in the setting of consensus‑finding among diverse humans, and demonstrate the feasibility of fine‑tuning large language models to act as digital representatives.

PaperID: 11643, https://arxiv.org/pdf/2502.09216.pdf

Authors: Galileo Sartor, Adam Wyner, Giuseppe Contissa

Title: Mind the Gaps: Logical English, Prolog, and Multi-agent Systems for Autonomous Vehicles

Abstract:
In this paper, we present a modular system for representing and reasoning with legal aspects of traffic rules for autonomous vehicles. We focus on a subset of the United Kingdom's Highway Code (HC) related to junctions. As human drivers and automated vehicles (AVs) will interact on the roads, especially in urban environments, we claim that an accessible, unitary, high‑level computational model should exist and be applicable to both users. Autonomous vehicles introduce a shift in liability that should not bring disadvantages or increased burden on human drivers. We develop a system "in silico" of the model. The proposed system is built of three main components: a natural language interface, using Logical English, which encodes the rules; an internal representation of the rules in Prolog; and an multi‑agent‑based simulation environment, built in NetLogo. The three components interact: Logical English is translated into and out of Prolog (along with some support code); Prolog and NetLogo interface via predicates. Such a modular approach enables the different components to carry different "burdens" in the overall system; it also allows swapping of modules. Given NetLogo, we can visualize the effect of the modeled rules as well as validate the system with a simple dynamic running scenario. Designated agents monitor the behaviour of the vehicles for compliance and record potential violations where they occur. The information on potential violations is then utilized by Validators, to determine whether the violation is punishable, differentiating between exceptions and cases.

PaperID: 11644, https://arxiv.org/pdf/2502.09215.pdf

Authors: Sean Glaze, Daniela Inclezan

Title: Architecture for Simulating Behavior Mode Changes in Norm-Aware Autonomous Agents

Abstract:
This paper presents an architecture for simulating the actions of a norm‑aware intelligent agent whose behavior with respect to norm compliance is set, and can later be changed, by a human controller. Updating an agent's behavior mode from a norm‑abiding to a riskier one may be relevant when the agent is involved in time‑sensitive rescue operations, for example. We base our work on the Authorization and Obligation Policy Language AOPL designed by Gelfond and Lobo for the specification of norms. We introduce an architecture and a prototype software system that can be used to simulate an agent's plans under different behavior modes that can later be changed by the controller. We envision such software to be useful to policy makers, as they can more readily understand how agents may act in certain situations based on the agents' attitudes towards norm‑compliance. Policy makers may then refine their policies if simulations show unwanted consequences.

PaperID: 11645, https://arxiv.org/pdf/2502.09022.pdf

Authors: Lin Zhang, Lijie Hu, Di Wang

Title: Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning

Abstract:
Transformer‑based language models have achieved significant success; however, their internal mechanisms remain largely opaque due to the complexity of non‑linear interactions and high‑dimensional operations. While previous studies have demonstrated that these models implicitly embed reasoning trees, humans typically employ various distinct logical reasoning mechanisms to complete the same task. It is still unclear which multi‑step reasoning mechanisms are used by language models to solve such tasks. In this paper, we aim to address this question by investigating the mechanistic interpretability of language models, particularly in the context of multi‑step reasoning tasks. Specifically, we employ circuit analysis and self‑influence functions to evaluate the changing importance of each token throughout the reasoning process, allowing us to map the reasoning paths adopted by the model. We apply this methodology to the GPT‑2 model on a prediction task (IOI) and demonstrate that the underlying circuits reveal a human‑interpretable reasoning process used by the model.

PaperID: 11646, https://arxiv.org/pdf/2502.08969.pdf

Authors: Wenhui Ma, Wenhao Li, Bo Jin, Changhong Lu, Xiangfeng Wang

Title: SkyRover: A Modular Simulator for Cross-Domain Pathfinding

Abstract:
Unmanned Aerial Vehicles (UAVs) and Automated Guided Vehicles (AGVs) increasingly collaborate in logistics, surveillance, inspection tasks and etc. However, existing simulators often focus on a single domain, limiting cross‑domain study. This paper presents the SkyRover, a modular simulator for UAV‑AGV multi‑agent pathfinding (MAPF). SkyRover supports realistic agent dynamics, configurable 3D environments, and convenient APIs for external solvers and learning methods. By unifying ground and aerial operations, it facilitates cross‑domain algorithm design, testing, and benchmarking. Experiments highlight SkyRover's capacity for efficient pathfinding and high‑fidelity simulations in UAV‑AGV coordination. Project is available at https://sites.google.com/view/mapf3d/home.

PaperID: 11647, https://arxiv.org/pdf/2502.08966.pdf

Authors: Peter Yong Zhong, Siyuan Chen, Ruiqi Wang, McKenna McCall, Ben L. Titzer, Heather Miller, Phillip B. Gibbons

Title: RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage

Abstract:
Tool‑Based Agent Systems (TBAS) allow Language Models (LMs) to use external tools for tasks beyond their standalone capabilities, such as searching websites, booking flights, or making financial transactions. However, these tools greatly increase the risks of prompt injection attacks, where malicious content hijacks the LM agent to leak confidential data or trigger harmful actions. Existing defenses (OpenAI GPTs) require user confirmation before every tool call, placing onerous burdens on users. We introduce Robust TBAS (RTBAS), which automatically detects and executes tool calls that preserve integrity and confidentiality, requiring user confirmation only when these safeguards cannot be ensured. RTBAS adapts Information Flow Control to the unique challenges presented by TBAS. We present two novel dependency screeners, using LM‑as‑a‑judge and attention‑based saliency, to overcome these challenges. Experimental results on the AgentDojo Prompt Injection benchmark show RTBAS prevents all targeted attacks with only a 2% loss of task utility when under attack, and further tests confirm its ability to obtain near‑oracle performance on detecting both subtle and direct privacy leaks.

PaperID: 11648, https://arxiv.org/pdf/2502.08950.pdf

Authors: Fengming Zhu, Fangzhen Lin

Title: Single-Agent Planning in a Multi-Agent System: A Unified Framework for Type-Based Planners

Abstract:
We consider a general problem where an agent is in a multi‑agent environment and must plan for herself without any prior information about her opponents. At each moment, this pivotal agent is faced with a trade‑off between exploiting her currently accumulated information about the other agents and exploring further to improve future (re‑)planning. We propose a theoretic framework that unifies a spectrum of planners for the pivotal agent to address this trade‑off. The planner at one end of this spectrum aims to find exact solutions, while those towards the other end yield approximate solutions as the problem scales up. Beyond theoretical analysis, we also implement 13 planners and conduct experiments in a specific domain called multi‑agent route planning with the number of agents up to~50, to compare their performaces in various scenarios. One interesting observation comes from a class of planners that we call safe‑agents and their enhanced variants by incorporating domain‑specific knowledge, which is a simple special case under the proposed general framework, but performs sufficiently well in most cases. Our unified framework, as well as those induced planners, provides new insights on multi‑agent decision‑making, with potential applications to related areas such as mechanism design.

PaperID: 11649, https://arxiv.org/pdf/2502.08916.pdf

Authors: Fatemeh Ghezloo, Mehmet Saygin Seyfioglu, Rustin Soraki, Wisdom O. Ikezogwo, Beibin Li, Tejoram Vivekanandan, Joann G. Elmore, Ranjay Krishna, Linda Shapiro

Title: PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology

Abstract:
Diagnosing diseases through histopathology whole slide images (WSIs) is fundamental in modern pathology but is challenged by the gigapixel scale and complexity of WSIs. Trained histopathologists overcome this challenge by navigating the WSI, looking for relevant patches, taking notes, and compiling them to produce a final holistic diagnostic. Traditional AI approaches, such as multiple instance learning and transformer‑based models, fail short of such a holistic, iterative, multi‑scale diagnostic procedure, limiting their adoption in the real‑world. We introduce PathFinder, a multi‑modal, multi‑agent framework that emulates the decision‑making process of expert pathologists. PathFinder integrates four AI agents, the Triage Agent, Navigation Agent, Description Agent, and Diagnosis Agent, that collaboratively navigate WSIs, gather evidence, and provide comprehensive diagnoses with natural language explanations. The Triage Agent classifies the WSI as benign or risky; if risky, the Navigation and Description Agents iteratively focus on significant regions, generating importance maps and descriptive insights of sampled patches. Finally, the Diagnosis Agent synthesizes the findings to determine the patient's diagnostic classification. Our Experiments show that PathFinder outperforms state‑of‑the‑art methods in skin melanoma diagnosis by 8% while offering inherent explainability through natural language descriptions of diagnostically relevant patches. Qualitative analysis by pathologists shows that the Description Agent's outputs are of high quality and comparable to GPT‑4o. PathFinder is also the first AI‑based system to surpass the average performance of pathologists in this challenging melanoma classification task by 9%, setting a new record for efficient, accurate, and interpretable AI‑assisted diagnostics in pathology. Data, code and models available at https://pathfinder‑dx.github.io/

PaperID: 11650, https://arxiv.org/pdf/2502.08903.pdf

Authors: Guoqin Tang, Qingxuan Jia, Zeyuan Huang, Gang Chen, Ning Ji, Zhipeng Yao

Title: 3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning

Abstract:
Vision‑language models (VLMs) have achieved remarkable success in scene understanding and perception tasks, enabling robots to plan and execute actions adaptively in dynamic environments. However, most multimodal large language models lack robust 3D scene localization capabilities, limiting their effectiveness in fine‑grained robotic operations. Additionally, challenges such as low recognition accuracy, inefficiency, poor transferability, and reliability hinder their use in precision tasks. To address these limitations, we propose a novel framework that integrates a 2D prompt synthesis module by mapping 2D images to point clouds, and incorporates a small language model (SLM) for supervising VLM outputs. The 2D prompt synthesis module enables VLMs, trained on 2D images and text, to autonomously extract precise 3D spatial information without manual intervention, significantly enhancing 3D scene understanding. Meanwhile, the SLM supervises VLM outputs, mitigating hallucinations and ensuring reliable, executable robotic control code generation. Our framework eliminates the need for retraining in new environments, thereby improving cost efficiency and operational robustness. Experimental results that the proposed framework achieved a 96.0% Task Success Rate (TSR), outperforming other methods. Ablation studies demonstrated the critical role of both the 2D prompt synthesis module and the output supervision module (which, when removed, caused a 67% TSR drop). These findings validate the framework's effectiveness in improving 3D recognition, task planning, and robotic task execution.

PaperID: 11651, https://arxiv.org/pdf/2502.08788.pdf

Authors: Hangfan Zhang, Zhiyao Cui, Jianhao Chen, Xinrun Wang, Qiaosheng Zhang, Zhen Wang, Dinghao Wu, Shuyue Hu

Title: Stop Overvaluing Multi-Agent Debate -- We Must Rethink Evaluation and Embrace Model Heterogeneity

Abstract:
Multi‑agent debate (MAD) has gained significant attention as a promising line of research to improve the factual accuracy and reasoning capabilities of large language models (LLMs). Despite its conceptual appeal, current MAD research suffers from critical limitations in evaluation practices, including limited benchmark coverage, weak baseline comparisons, and inconsistent setups. This paper presents a systematic evaluation of 5 representative MAD methods across 9 benchmarks using 4 foundational models. Surprisingly, our findings reveal that MAD often fail to outperform simple single‑agent baselines such as Chain‑of‑Thought and Self‑Consistency, even when consuming significantly more inference‑time computation. To advance MAD research, we further explore the role of model heterogeneity and find it as a universal antidote to consistently improve current MAD frameworks. Based on our findings, we argue that the field must stop overvaluing MAD in its current form; for true advancement, we must critically rethink evaluation paradigms and actively embrace model heterogeneity as a core design principle.

PaperID: 11652, https://arxiv.org/pdf/2502.08681.pdf

Authors: Barbera de Mol, Davide Barbieri, Jan Viebahn, Davide Grossi

Title: Centrally Coordinated Multi-Agent Reinforcement Learning for Power Grid Topology Control

Abstract:
Power grid operation is becoming more complex due to the increase in generation of renewable energy. The recent series of Learning To Run a Power Network (L2RPN) competitions have encouraged the use of artificial agents to assist human dispatchers in operating power grids. However, the combinatorial nature of the action space poses a challenge to both conventional optimizers and learned controllers. Action space factorization, which breaks down decision‑making into smaller sub‑tasks, is one approach to tackle the curse of dimensionality. In this study, we propose a centrally coordinated multi‑agent (CCMA) architecture for action space factorization. In this approach, regional agents propose actions and subsequently a coordinating agent selects the final action. We investigate several implementations of the CCMA architecture, and benchmark in different experimental settings against various L2RPN baseline approaches. The CCMA architecture exhibits higher sample efficiency and superior final performance than the baseline approaches. The results suggest high potential of the CCMA approach for further application in higher‑dimensional L2RPN as well as real‑world power grid settings.

PaperID: 11653, https://arxiv.org/pdf/2502.08646.pdf

Authors: Neerja Thakkar, Tara Sadjadpour, Jathushan Rajasegaran, Shiry Ginosar, Jitendra Malik

Title: Poly-Autoregressive Prediction for Modeling Interactions

Abstract:
We introduce a simple framework for predicting the behavior of an agent in multi‑agent settings. In contrast to autoregressive (AR) tasks, such as language processing, our focus is on scenarios with multiple agents whose interactions are shaped by physical constraints and internal motivations. To this end, we propose Poly‑Autoregressive (PAR) modeling, which forecasts an ego agent's future behavior by reasoning about the ego agent's state history and the past and current states of other interacting agents. At its core, PAR represents the behavior of all agents as a sequence of tokens, each representing an agent's state at a specific timestep. With minimal data pre‑processing changes, we show that PAR can be applied to three different problems: human action forecasting in social situations, trajectory prediction for autonomous vehicles, and object pose forecasting during hand‑object interaction. Using a small proof‑of‑concept transformer backbone, PAR outperforms AR across these three scenarios. The project website can be found at https://neerja.me/PAR/.

PaperID: 11654, https://arxiv.org/pdf/2502.08643.pdf

Authors: Shivansh Patel, Xinchen Yin, Wenlong Huang, Shubham Garg, Hooshang Nayyeri, Li Fei-Fei, Svetlana Lazebnik, Yunzhu Li

Title: A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards

Abstract:
Task specification for robotic manipulation in open‑world environments is challenging, requiring flexible and adaptive objectives that align with human intentions and can evolve through iterative feedback. We introduce Iterative Keypoint Reward (IKER), a visually grounded, Python‑based reward function that serves as a dynamic task specification. Our framework leverages VLMs to generate and refine these reward functions for multi‑step manipulation tasks. Given RGB‑D observations and free‑form language instructions, we sample keypoints in the scene and generate a reward function conditioned on these keypoints. IKER operates on the spatial relationships between keypoints, leveraging commonsense priors about the desired behaviors, and enabling precise SE(3) control. We reconstruct real‑world scenes in simulation and use the generated rewards to train reinforcement learning (RL) policies, which are then deployed into the real world‑forming a real‑to‑sim‑to‑real loop. Our approach demonstrates notable capabilities across diverse scenarios, including both prehensile and non‑prehensile tasks, showcasing multi‑step task execution, spontaneous error recovery, and on‑the‑fly strategy adjustments. The results highlight IKER's effectiveness in enabling robots to perform multi‑step tasks in dynamic environments through iterative reward shaping.

PaperID: 11655, https://arxiv.org/pdf/2502.08599.pdf

Authors: Keyeun Lee, Seo Hyeong Kim, Seolhee Lee, Jinsu Eun, Yena Ko, Hayeon Jeon, Esther Hehsun Kim, Seonghye Cho, Soeun Yang, Eun-mee Kim, Hajin Lim

Title: SPeCtrum: A Grounded Framework for Multidimensional Identity Representation in LLM-Based Agent

Abstract:
Existing methods for simulating individual identities often oversimplify human complexity, which may lead to incomplete or flattened representations. To address this, we introduce SPeCtrum, a grounded framework for constructing authentic LLM agent personas by incorporating an individual's multidimensional self‑concept. SPeCtrum integrates three core components: Social Identity (S), Personal Identity (P), and Personal Life Context (C), each contributing distinct yet interconnected aspects of identity. To evaluate SPeCtrum's effectiveness in identity representation, we conducted automated and human evaluations. Automated evaluations using popular drama characters showed that Personal Life Context (C)‑derived from short essays on preferences and daily routines‑modeled characters' identities more effectively than Social Identity (S) and Personal Identity (P) alone and performed comparably to the full SPC combination. In contrast, human evaluations involving real‑world individuals found that the full SPC combination provided a more comprehensive self‑concept representation than C alone. Our findings suggest that while C alone may suffice for basic identity simulation, integrating S, P, and C enhances the authenticity and accuracy of real‑world identity representation. Overall, SPeCtrum offers a structured approach for simulating individuals in LLM agents, enabling more personalized human‑AI interactions and improving the realism of simulation‑based behavioral studies.

PaperID: 11656, https://arxiv.org/pdf/2502.08586.pdf

Authors: Ang Li, Yin Zhou, Vethavikashini Chithrra Raghuram, Tom Goldstein, Micah Goldblum

Title: Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks

Abstract:
A high volume of recent ML security literature focuses on attacks against aligned large language models (LLMs). These attacks may extract private information or coerce the model into producing harmful outputs. In real‑world deployments, LLMs are often part of a larger agentic pipeline including memory systems, retrieval, web access, and API calling. Such additional components introduce vulnerabilities that make these LLM‑powered agents much easier to attack than isolated LLMs, yet relatively little work focuses on the security of LLM agents. In this paper, we analyze security and privacy vulnerabilities that are unique to LLM agents. We first provide a taxonomy of attacks categorized by threat actors, objectives, entry points, attacker observability, attack strategies, and inherent vulnerabilities of agent pipelines. We then conduct a series of illustrative attacks on popular open‑source and commercial agents, demonstrating the immediate practical implications of their vulnerabilities. Notably, our attacks are trivial to implement and require no understanding of machine learning.

PaperID: 11657, https://arxiv.org/pdf/2502.08557.pdf

Authors: Wonduk Seo, Hyunjin An, Seunghyun Lee

Title: A New Query Expansion Approach via Agent-Mediated Dialogic Inquiry

Abstract:
Query expansion is widely used in Information Retrieval (IR) to improve search outcomes by supplementing initial queries with richer information. While recent Large Language Model (LLM) based methods generate pseudo‑relevant content and expanded terms via multiple prompts, they often yield homogeneous, narrow expansions that lack the diverse context needed to retrieve relevant information. In this paper, we propose AMD: a new Agent‑Mediated Dialogic Framework that engages in a dialogic inquiry involving three specialized roles: (1) a Socratic Questioning Agent reformulates the initial query into three sub‑questions, with each question inspired by a specific Socratic questioning dimension, including clarification, assumption probing, and implication probing, (2) a Dialogic Answering Agent generates pseudo‑answers, enriching the query representation with multiple perspectives aligned to the user's intent, and (3) a Reflective Feedback Agent evaluates and refines these pseudo‑answers, ensuring that only the most relevant and informative content is retained. By leveraging a multi‑agent process, AMD effectively crafts richer query representations through inquiry and feedback refinement. Extensive experiments on benchmarks including BEIR and TREC demonstrate that our framework outperforms previous methods, offering a robust solution for retrieval tasks.

PaperID: 11658, https://arxiv.org/pdf/2502.08514.pdf

Authors: Mahnaz Koupaee, Jake W. Vincent, Saab Mansour, Igor Shalyminov, Han He, Hwanjun Song, Raphael Shu, Jianfeng He, Yi Nian, Amy Wing-mei Wong, Kyu J. Han, Hang Su

Title: Faithful, Unfaithful or Ambiguous? Multi-Agent Debate with Initial Stance for Summary Evaluation

Abstract:
Faithfulness evaluators based on large language models (LLMs) are often fooled by the fluency of the text and struggle with identifying errors in the summaries. We propose an approach to summary faithfulness evaluation in which multiple LLM‑based agents are assigned initial stances (regardless of what their belief might be) and forced to come up with a reason to justify the imposed belief, thus engaging in a multi‑round debate to reach an agreement. The uniformly distributed initial assignments result in a greater diversity of stances leading to more meaningful debates and ultimately more errors identified. Furthermore, by analyzing the recent faithfulness evaluation datasets, we observe that naturally, it is not always the case for a summary to be either faithful to the source document or not. We therefore introduce a new dimension, ambiguity, and a detailed taxonomy to identify such special cases. Experiments demonstrate our approach can help identify ambiguities, and have even a stronger performance on non‑ambiguous summaries.

PaperID: 11659, https://arxiv.org/pdf/2502.08455.pdf

Authors: Liwei Yuan, Hideaki Ishii

Title: Resilient Quantized Consensus in Multi-Hop Relay Networks

Abstract:
We study resilient quantized consensus in multi‑agent systems, where some agents may malfunction. The network consists of agents taking integer‑valued states, and the agents' communication is subject to asynchronous updates and time delays. We utilize the quantized weighted mean subsequence reduced algorithm where agents communicate with others through multi‑hop relays. We prove necessary and sufficient conditions for our algorithm to achieve the objective under the malicious and Byzantine attack models. Our approach has tighter graph conditions compared to the one‑hop algorithm and the flooding‑based algorithms for binary consensus. Numerical examples verify the efficacy of our algorithm.

PaperID: 11660, https://arxiv.org/pdf/2502.08365.pdf

Authors: Riccardo Zamboni, Mirco Mutti, Marcello Restelli

Title: Towards Principled Unsupervised Multi-Agent Reinforcement Learning

Abstract:
In reinforcement learning, we typically refer to unsupervised pre‑training when we aim to pre‑train a policy without a priori access to the task specification, i.e. rewards, to be later employed for efficient learning of downstream tasks. In single‑agent settings, the problem has been extensively studied and mostly understood. A popular approach, called task‑agnostic exploration, casts the unsupervised objective as maximizing the entropy of the state distribution induced by the agent's policy, from which principles and methods follow. In contrast, little is known about it in multi‑agent settings, which are ubiquitous in the real world. What are the pros and cons of alternative problem formulations in this setting? How hard is the problem in theory, how can we solve it in practice? In this paper, we address these questions by first characterizing those alternative formulations and highlighting how the problem, even when tractable in theory, is non‑trivial in practice. Then, we present a scalable, decentralized, trust‑region policy search algorithm to address the problem in practical settings. Finally, we provide numerical validations to both corroborate the theoretical findings and pave the way for unsupervised multi‑agent reinforcement learning via task‑agnostic exploration in challenging domains, showing that optimizing for a specific objective, namely mixture entropy, provides an excellent trade‑off between tractability and performances.

PaperID: 11661, https://arxiv.org/pdf/2502.08337.pdf

Authors: Soumyendu Sarkar, Avisek Naug, Antonio Guillen, Vineet Gundecha, Ricardo Luna Gutierrez, Sahand Ghorbanpour, Sajad Mousavi, Ashwin Ramesh Babu, Desik Rengarajan, Cullen Bash

Title: Hierarchical Multi-Agent Framework for Carbon-Efficient Liquid-Cooled Data Center Clusters

Abstract:
Reducing the environmental impact of cloud computing requires efficient workload distribution across geographically dispersed Data Center Clusters (DCCs) and simultaneously optimizing liquid and air (HVAC) cooling with time shift of workloads within individual data centers (DC). This paper introduces Green‑DCC, which proposes a Reinforcement Learning (RL) based hierarchical controller to optimize both workload and liquid cooling dynamically in a DCC. By incorporating factors such as weather, carbon intensity, and resource availability, Green‑DCC addresses realistic constraints and interdependencies. We demonstrate how the system optimizes multiple data centers synchronously, enabling the scope of digital twins, and compare the performance of various RL approaches based on carbon emissions and sustainability metrics while also offering a framework and benchmark simulation for broader ML research in sustainability.

PaperID: 11662, https://arxiv.org/pdf/2502.08324.pdf

Authors: Leo D'Amato, Paola Pellegrini, Vito Trianni

Title: Decentralised multi-agent coordination for real-time railway traffic management

Abstract:
The real‑time Railway Traffic Management Problem (rtRTMP) is a challenging optimisation problem in railway transportation. It involves the efficient management of train movements while minimising delay propagation caused by unforeseen perturbations due to, e.g, temporary speed limitations or signal failures. This paper re‑frames the rtRTMP as a multi‑agent coordination problem and formalises it as a Distributed Constraint Optimisation Problem (DCOP) to explore its potential for decentralised solutions. We propose a novel coordination algorithm that extends the widely known Distributed Stochastic Algorithm (DSA), allowing trains to self‑organise and resolve scheduling conflicts. The performance of our algorithm is compared to a classical DSA through extensive simulations on a synthetic dataset reproducing diverse problem configurations. Results show that our approach achieves significant improvements in solution quality and convergence speed, demonstrating its effectiveness and scalability in managing large‑scale railway networks. Beyond the railway domain, this framework can have broader applicability in autonomous systems, such as self‑driving vehicles or inter‑satellite coordination.

PaperID: 11663, https://arxiv.org/pdf/2502.08224.pdf

Authors: Changhua Pei, Zexin Wang, Fengrui Liu, Zeyan Li, Yang Liu, Xiao He, Rong Kang, Tieying Zhang, Jianjun Chen, Jianhui Li, Gaogang Xie, Dan Pei

Title: Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis

Abstract:
In the realm of microservices architecture, the occurrence of frequent incidents necessitates the employment of Root Cause Analysis (RCA) for swift issue resolution. It is common that a serious incident can take several domain experts hours to identify the root cause. Consequently, a contemporary trend involves harnessing Large Language Models (LLMs) as automated agents for RCA. Though the recent ReAct framework aligns well with the Site Reliability Engineers (SREs) for its thought‑action‑observation paradigm, its hallucinations often lead to irrelevant actions and directly affect subsequent results. Additionally, the complex and variable clues of the incident can overwhelm the model one step further. To confront these challenges, we propose Flow‑of‑Action, a pioneering Standard Operation Procedure (SOP) enhanced LLM‑based multi‑agent system. By explicitly summarizing the diagnosis steps of SREs, SOP imposes constraints on LLMs at crucial junctures, guiding the RCA process towards the correct trajectory. To facilitate the rational and effective utilization of SOPs, we design an SOP‑centric framework called SOP flow. SOP flow contains a series of tools, including one for finding relevant SOPs for incidents, another for automatically generating SOPs for incidents without relevant ones, and a tool for converting SOPs into code. This significantly alleviates the hallucination issues of ReAct in RCA tasks. We also design multiple auxiliary agents to assist the main agent by removing useless noise, narrowing the search space, and informing the main agent whether the RCA procedure can stop. Compared to the ReAct method's 35.50% accuracy, our Flow‑of‑Action method achieves 64.01%, meeting the accuracy requirements for RCA in real‑world systems.

PaperID: 11664, https://arxiv.org/pdf/2502.08119.pdf

Authors: Jiahao You, Ziye Jia, Chao Dong, Qihui Wu, Zhu Han

Title: Generative AI-Enhanced Cooperative MEC of UAVs and Ground Stations for Unmanned Surface Vehicles

Abstract:
The increasing deployment of unmanned surface vehicles (USVs) require computational support and coverage in applications such as maritime search and rescue. Unmanned aerial vehicles (UAVs) can offer low‑cost, flexible aerial services, and ground stations (GSs) can provide powerful supports, which can cooperate to help the USVs in complex scenarios. However, the collaboration between UAVs and GSs for USVs faces challenges of task uncertainties, USVs trajectory uncertainties, heterogeneities, and limited computational resources. To address these issues, we propose a cooperative UAV and GS based robust multi‑access edge computing framework to assist USVs in completing computational tasks. Specifically, we formulate the optimization problem of joint task offloading and UAV trajectory to minimize the total execution time, which is in the form of mixed integer nonlinear programming and NP‑hard to tackle. Therefore, we propose the algorithm of generative artificial intelligence‑enhanced heterogeneous agent proximal policy optimization (GAI‑HAPPO). The proposed algorithm integrates GAI models to enhance the actor network ability to model complex environments and extract high‑level features, thereby allowing the algorithm to predict uncertainties and adapt to dynamic conditions. Additionally, GAI stabilizes the critic network, addressing the instability of multi‑agent reinforcement learning approaches. Finally, extensive simulations demonstrate that the proposed algorithm outperforms the existing benchmark methods, thus highlighting the potentials in tackling intricate, cross‑domain issues in the considered scenarios.

PaperID: 11665, https://arxiv.org/pdf/2502.08063.pdf

Authors: Guanghui Wang, Krishna Acharya, Lokranjan Lakshmikanthan, Juba Ziani, Vidya Muthukumar

Title: Last-iterate Convergence for Symmetric, General-sum, $2 \times 2$ Games Under The Exponential Weights Dynamic

Abstract:
We conduct a comprehensive analysis of the discrete‑time exponential‑weights dynamic with a constant step size on all general‑sum and symmetric 2 × 2 normal‑form games, i.e. games with 2 pure strategies per player, and where the ensuing payoff tuple is of the form (A,A^\top) (where A is the 2 × 2 payoff matrix corresponding to the first player). Such symmetric games commonly arise in real‑world interactions between 'symmetric" agents who have identically defined utility functions ‑‑ such as Bertrand competition and multi‑agent performative prediction, and display a rich multiplicity of equilibria despite the seemingly simple setting. Somewhat surprisingly, we show through a first‑principles analysis that the exponential weights dynamic, which is popular in online learning, converges in the last iterate for such games regardless of initialization with an appropriately chosen step size. For certain games and/or initializations, we further show that the convergence rate is in fact exponential and holds for any step size. We illustrate our theory with extensive simulations and applications to the aforementioned game‑theoretic interactions. In the case of multi‑agent performative prediction, we formulate a new "mortgage competition" game between lenders (i.e. banks) who interact with a population of customers, and show that it fits into our framework.

PaperID: 11666, https://arxiv.org/pdf/2502.08033.pdf

Authors: Anjian Li, Sangjae Bae, David Isele, Ryne Beeson, Faizan M. Tariq

Title: Predictive Planner for Autonomous Driving with Consistency Models

Abstract:
Trajectory prediction and planning are essential for autonomous vehicles to navigate safely and efficiently in dynamic environments. Traditional approaches often treat them separately, limiting the ability for interactive planning. While recent diffusion‑based generative models have shown promise in multi‑agent trajectory generation, their slow sampling is less suitable for high‑frequency planning tasks. In this paper, we leverage the consistency model to build a predictive planner that samples from a joint distribution of ego and surrounding agents, conditioned on the ego vehicle's navigational goal. Trained on real‑world human driving datasets, our consistency model generates higher‑quality trajectories with fewer sampling steps than standard diffusion models, making it more suitable for real‑time deployment. To enforce multiple planning constraints simultaneously on the ego trajectory, a novel online guided sampling approach inspired by the Alternating Direction Method of Multipliers (ADMM) is introduced. Evaluated on the Waymo Open Motion Dataset (WOMD), our method enables proactive behavior such as nudging and yielding, and also demonstrates smoother, safer, and more efficient trajectories and satisfaction of multiple constraints under a limited computational budget.

PaperID: 11667, https://arxiv.org/pdf/2502.08003.pdf

Authors: Mengfan Xu, Liren Shan, Fatemeh Ghaffari, Xuchuang Wang, Xutong Liu, Mohammad Hajiesmaili

Title: Heterogeneous Multi-agent Multi-armed Bandits on Stochastic Block Models

Abstract:
We study a novel heterogeneous multi‑agent multi‑armed bandit problem with a cluster structure induced by stochastic block models, influencing not only graph topology, but also reward heterogeneity. Specifically, agents are distributed on random graphs based on stochastic block models ‑ a generalized Erdos‑Renyi model with heterogeneous edge probabilities: agents are grouped into clusters (known or unknown); edge probabilities for agents within the same cluster differ from those across clusters. In addition, the cluster structure in stochastic block model also determines our heterogeneous rewards. Rewards distributions of the same arm vary across agents in different clusters but remain consistent within a cluster, unifying homogeneous and heterogeneous settings and varying degree of heterogeneity, and rewards are independent samples from these distributions. The objective is to minimize system‑wide regret across all agents. To address this, we propose a novel algorithm applicable to both known and unknown cluster settings. The algorithm combines an averaging‑based consensus approach with a newly introduced information aggregation and weighting technique, resulting in a UCB‑type strategy. It accounts for graph randomness, leverages both intra‑cluster (homogeneous) and inter‑cluster (heterogeneous) information from rewards and graphs, and incorporates cluster detection for unknown cluster settings. We derive optimal instance‑dependent regret upper bounds of order \logT under sub‑Gaussian rewards. Importantly, our regret bounds capture the degree of heterogeneity in the system (an additional layer of complexity), exhibit smaller constants, scale better for large systems, and impose significantly relaxed assumptions on edge probabilities. In contrast, prior works have not accounted for this refined problem complexity, rely on more stringent assumptions, and exhibit limited scalability.

PaperID: 11668, https://arxiv.org/pdf/2502.07942.pdf

Authors: Ruichen Zhang, Mufan Qiu, Zhen Tan, Mohan Zhang, Vincent Lu, Jie Peng, Kaidi Xu, Leandro Z. Agudelo, Peter Qian, Tianlong Chen

Title: Symbiotic Cooperation for Web Agents: Harnessing Complementary Strengths of Large and Small LLMs

Abstract:
Web browsing agents powered by large language models (LLMs) have shown tremendous potential in automating complex web‑based tasks. Existing approaches typically rely on large LLMs (e.g., GPT‑4o) to explore web environments and generate trajectory data, which is then used either for demonstration retrieval (for large LLMs) or to distill small LLMs (e.g., Llama3) in a process that remains decoupled from the exploration. In this paper, we propose AgentSymbiotic, an iterative framework that couples data synthesis with task‑performance, yielding a "symbiotic improvement" for both large and small LLMs. Our study uncovers a complementary dynamic between LLM types: while large LLMs excel at generating high‑quality trajectories for distillation, the distilled small LLMs‑owing to their distinct reasoning capabilities‑often choose actions that diverge from those of their larger counterparts. This divergence drives the exploration of novel trajectories, thereby enriching the synthesized data. However, we also observe that the performance of small LLMs becomes a bottleneck in this iterative enhancement process. To address this, we propose two innovations in LLM distillation: a speculative data synthesis strategy that mitigates off‑policy bias, and a multi‑task learning approach designed to boost the reasoning capabilities of the student LLM. Furthermore, we introduce a Hybrid Mode for Privacy Preservation to address user privacy concerns. Evaluated on the WEBARENA benchmark, AgentSymbiotic achieves SOTA performance with both LLM types. Our best Large LLM agent reaches 52%, surpassing the previous best of 45%, while our 8B distilled model demonstrates a competitive 49%, exceeding the prior best of 28%. Code will be released upon acceptance.

PaperID: 11669, https://arxiv.org/pdf/2502.07928.pdf

Authors: Shahbaz Siddeeq, Zeeshan Rasheed, Malik Abdul Sami, Mahade Hasan, Muhammad Waseem, Jussi Rasku, Mika Saari, Kai-Kristian Kemell, Pekka Abrahamsson

Title: Distributed Approach to Haskell Based Applications Refactoring with LLMs Based Multi-Agent Systems

Abstract:
We present a large language models (LLMs) based multi‑agent system to automate the refactoring of Haskell codebases. The multi‑agent system consists of specialized agents performing tasks such as context analysis, refactoring, validation, and testing. Refactoring improvements are using metrics such as cyclomatic complexity, run‑time, and memory allocation. Experimental evaluations conducted on Haskell codebases demonstrate improvements in code quality. Cyclomatic complexity was reduced by 13.64% and 47.06% in the respective codebases. Memory allocation improved by 4.17% and 41.73%, while runtime efficiency increased by up to 50%. These metrics highlight the systems ability to optimize Haskells functional paradigms while maintaining correctness and scalability. Results show reductions in complexity and performance enhancements across codebases. The integration of LLMs based multi‑agent system enables precise task execution and inter‑agent collaboration, addressing the challenges of refactoring in functional programming. This approach aims to address the challenges of refactoring functional programming languages through distributed and modular systems.

PaperID: 11670, https://arxiv.org/pdf/2502.07819.pdf

Authors: Shayan Sharifi

Title: Enhancing kidney transplantation through multi-agent kidney exchange programs: A comprehensive review and optimization models

Abstract:
This paper presents a comprehensive review of the last two decades of research on Kidney Exchange Programs (KEPs), systematically categorizing and classifying key contributions to provide readers with a structured understanding of advancements in the field. The review highlights the evolution of KEP methodologies and lays the foundation for our contribution. We propose three mathematical models aimed at improving both the quantity and quality of kidney transplants. Model 1 maximizes the number of transplants by focusing on compatibility based on blood type and PRA, without additional constraints. Model 2 introduces a minimum Human Leukocyte Antigen (HLA) compatibility threshold to enhance transplant quality, though this leads to fewer matches. Model 3 extends the problem to a Multi‑Agent Kidney Exchange Program (MKEP), pooling incompatible donor‑recipient pairs across multiple agents, resulting in a higher number of successful transplants while ensuring fairness across agents. Sensitivity analyses demonstrate trade‑offs between transplant quantity and quality, with Model 3 striking the optimal balance by leveraging multi‑agent collaboration to improve both the number and quality of transplants. These findings underscore the potential benefits of more integrated kidney exchange systems.

PaperID: 11671, https://arxiv.org/pdf/2502.07709.pdf

Authors: Loris Gaven, Thomas Carta, Clément Romac, Cédric Colas, Sylvain Lamprier, Olivier Sigaud, Pierre-Yves Oudeyer

Title: MAGELLAN: Metacognitive predictions of learning progress guide autotelic LLM agents in large goal spaces

Abstract:
Open‑ended learning agents must efficiently prioritize goals in vast possibility spaces, focusing on those that maximize learning progress (LP). When such autotelic exploration is achieved by LLM agents trained with online RL in high‑dimensional and evolving goal spaces, a key challenge for LP prediction is modeling one's own competence, a form of metacognitive monitoring. Traditional approaches either require extensive sampling or rely on brittle expert‑defined goal groupings. We introduce MAGELLAN, a metacognitive framework that lets LLM agents learn to predict their competence and LP online. By capturing semantic relationships between goals, MAGELLAN enables sample‑efficient LP estimation and dynamic adaptation to evolving goal spaces through generalization. In an interactive learning environment, we show that MAGELLAN improves LP prediction efficiency and goal prioritization, being the only method allowing the agent to fully master a large and evolving goal space. These results demonstrate how augmenting LLM agents with a metacognitive ability for LP predictions can effectively scale curriculum learning to open‑ended goal spaces.

PaperID: 11672, https://arxiv.org/pdf/2502.07635.pdf

Authors: Guilherme S. Varela, Alberto Sardinha, Francisco S. Melo

Title: Distributed Value Decomposition Networks with Networked Agents

Abstract:
We investigate the problem of distributed training under partial observability, whereby cooperative multi‑agent reinforcement learning agents (MARL) maximize the expected cumulative joint reward. We propose distributed value decomposition networks (DVDN) that generate a joint Q‑function that factorizes into agent‑wise Q‑functions. Whereas the original value decomposition networks rely on centralized training, our approach is suitable for domains where centralized training is not possible and agents must learn by interacting with the physical environment in a decentralized manner while communicating with their peers. DVDN overcomes the need for centralized training by locally estimating the shared objective. We contribute with two innovative algorithms, DVDN and DVDN (GT), for the heterogeneous and homogeneous agents settings respectively. Empirically, both algorithms approximate the performance of value decomposition networks, in spite of the information loss during communication, as demonstrated in ten MARL tasks in three standard environments.

PaperID: 11673, https://arxiv.org/pdf/2502.07514.pdf

Authors: Zicheng Hu, Cheng Chen

Title: A Near-optimal, Scalable and Parallelizable Framework for Stochastic Bandits Robust to Adversarial Corruptions and Beyond

Abstract:
We investigate various stochastic bandit problems in the presence of adversarial corruptions. A seminal work for this problem is the BARBAR~\citegupta2019better algorithm, which achieves both robustness and efficiency. However, it suffers from a regret of O(KC), which does not match the lower bound of Ω(C), where K denotes the number of arms and C denotes the corruption level. In this paper, we first improve the BARBAR algorithm by proposing a novel framework called BARBAT, which eliminates the factor of K to achieve an optimal regret bound up to a logarithmic factor. We also extend BARBAT to various settings, including multi‑agent bandits, graph bandits, combinatorial semi‑bandits and batched bandits. Compared with the Follow‑the‑Regularized‑Leader framework, our methods are more amenable to parallelization, making them suitable for multi‑agent and batched bandit settings, and they incur lower computational costs, particularly in semi‑bandit problems. Numerical experiments verify the efficiency of the proposed methods.

PaperID: 11674, https://arxiv.org/pdf/2502.07487.pdf

Authors: Jian Yang, Wei Zhang, Jiaxi Yang, Yibo Miao, Shanghaoran Quan, Zhenhe Wu, Qiyao Peng, Liqun Yang, Tianyu Liu, Zeyu Cui, Binyuan Hui, Junyang Lin

Title: Multi-Agent Collaboration for Multilingual Code Instruction Tuning

Abstract:
Recent advancement in code understanding and generation demonstrates that code LLMs fine‑tuned on a high‑quality instruction dataset can gain powerful capabilities to address wide‑ranging code‑related tasks. However, most previous existing methods mainly view each programming language in isolation and ignore the knowledge transfer among different programming languages. To bridge the gap among different programming languages, we introduce a novel multi‑agent collaboration framework to enhance multilingual instruction tuning for code LLMs, where multiple language‑specific intelligent agent components with generation memory work together to transfer knowledge from one language to another efficiently and effectively. Specifically, we first generate the language‑specific instruction data from the code snippets and then provide the generated data as the seed data for language‑specific agents. Multiple language‑specific agents discuss and collaborate to formulate a new instruction and its corresponding solution (A new programming language or existing programming language), To further encourage the cross‑lingual transfer, each agent stores its generation history as memory and then summarizes its merits and faults. Finally, the high‑quality multilingual instruction data is used to encourage knowledge transfer among different programming languages to train Qwen2.5‑xCoder. Experimental results on multilingual programming benchmarks demonstrate the superior performance of Qwen2.5‑xCoder in sharing common knowledge, highlighting its potential to reduce the cross‑lingual gap.

PaperID: 11675, https://arxiv.org/pdf/2502.07470.pdf

Authors: Pushkal Purohit, Anoop Jain

Title: On Event-Triggered Resilient Consensus Using Auxiliary Layer

Abstract:
Due to its design simplicity, auxiliary layer‑based resilient control is widely discussed in the literature to mitigate the effects of False Data Injection (FDI) attacks. However, the increased communication burden due to additional communication links for connecting an extra layer is often overlooked in the literature. This paper bridges this gap by considering an event‑triggered approach for inter‑layer communication between the physical layer (containing actual agents) and the auxiliary layer (containing virtual agents) for the resilient state consensus in a multi‑agent system. We provide state‑based and dynamic event‑triggering mechanisms, the former being the motivation for the latter. The exclusion of Zeno behavior is established by proving positive minimum inter‑event time (MIET). Extensive simulation and experimental results are provided to illustrate the proposed methodology.

PaperID: 11676, https://arxiv.org/pdf/2502.07443.pdf

Authors: Vince Trencsenyi, Agnieszka Mensfelt, Kostas Stathis

Title: Approximating Human Strategic Reasoning with LLM-Enhanced Recursive Reasoners Leveraging Multi-agent Hypergames

Abstract:
LLM‑driven multi‑agent‑based simulations have been gaining traction with applications in game‑theoretic and social simulations. While most implementations seek to exploit or evaluate LLM‑agentic reasoning, they often do so with a weak notion of agency and simplified architectures. We implement a role‑based multi‑agent strategic interaction framework tailored to sophisticated recursive reasoners, providing the means for systematic in‑depth development and evaluation of strategic reasoning. Our game environment is governed by the umpire responsible for facilitating games, from matchmaking through move validation to environment management. Players incorporate state‑of‑the‑art LLMs in their decision mechanism, relying on a formal hypergame‑based model of hierarchical beliefs. We use one‑shot, 2‑player beauty contests to evaluate the recursive reasoning capabilities of the latest LLMs, providing a comparison to an established baseline model from economics and data from human experiments. Furthermore, we introduce the foundations of an alternative semantic measure of reasoning to the k‑level theory. Our experiments show that artificial reasoners can outperform the baseline model in terms of both approximating human behaviour and reaching the optimal solution.

PaperID: 11677, https://arxiv.org/pdf/2502.07373.pdf

Authors: Guibin Zhang, Kaijie Chen, Guancheng Wan, Heng Chang, Hong Cheng, Kun Wang, Shuyue Hu, Lei Bai

Title: EvoFlow: Evolving Diverse Agentic Workflows On The Fly

Abstract:
The past two years have witnessed the evolution of large language model (LLM)‑based multi‑agent systems from labor‑intensive manual design to partial automation (e.g., prompt engineering, communication topology) and eventually to fully automated design. However, existing agentic automation pipelines often lack LLM heterogeneity and focus on single‑objective performance optimization, limiting their potential to combine weaker models for more customized and cost‑effective solutions. To address this challenge, we propose EvoFlow, a niching evolutionary algorithm‑based framework to automatically search a population of heterogeneous and complexity‑adaptive agentic workflows, rather than a single homogeneous, complex workflow. Technically, EvoFlow performs (1) tag‑based retrieval to extract parent workflows from an agentic population, evolves new workflows through (2) crossover and (3) mutation, and employs (4) niching‑based selection to maintain population diversity and quality. Extensive evaluations across seven benchmarks demonstrate that EvoFlow is: (I) diverse, evolving a population of workflows ranging from simple I/O tasks to complex multi‑turn interactions; (II) high‑performing, outperforming previous handcrafted and automated workflows by 1.23%～29.86%; (III) economical, surpassing powerful \llmnameo1‑preview at 12.4% of its inference cost using weaker open‑source models.

PaperID: 11678, https://arxiv.org/pdf/2502.07254.pdf

Authors: Rajesh Ranjan, Shailja Gupta, Surya Narayan Singh

Title: Fairness in Agentic AI: A Unified Framework for Ethical and Equitable Multi-Agent System

Abstract:
Ensuring fairness in decentralized multi‑agent systems presents significant challenges due to emergent biases, systemic inefficiencies, and conflicting agent incentives. This paper provides a comprehensive survey of fairness in multi‑agent AI, introducing a novel framework where fairness is treated as a dynamic, emergent property of agent interactions. The framework integrates fairness constraints, bias mitigation strategies, and incentive mechanisms to align autonomous agent behaviors with societal values while balancing efficiency and robustness. Through empirical validation, we demonstrate that incorporating fairness constraints results in more equitable decision‑making. This work bridges the gap between AI ethics and system design, offering a foundation for accountable, transparent, and socially responsible multi‑agent AI systems.

PaperID: 11679, https://arxiv.org/pdf/2502.07165.pdf

Authors: Peipei Wei, Dimitris Dimitriadis, Yan Xu, Mingwei Shen

Title: Don't Just Demo, Teach Me the Principles: A Principle-Based Multi-Agent Prompting Strategy for Text Classification

Abstract:
We present PRINCIPLE‑BASED PROMPTING, a simple but effective multi‑agent prompting strategy for text classification. It first asks multiple LLM agents to independently generate candidate principles based on analysis of demonstration samples with or without labels, consolidates them into final principles via a finalizer agent, and then sends them to a classifier agent to perform downstream classification tasks. Extensive experiments on binary and multi‑class classification datasets with different sizes of LLMs show that our approach not only achieves substantial performance gains (1.55% ‑ 19.37%) over zero‑shot prompting on macro‑F1 score but also outperforms other strong baselines (CoT and stepback prompting). Principles generated by our approach help LLMs perform better on classification tasks than human crafted principles on two private datasets. Our multi‑agent PRINCIPLE‑BASED PROMPTING approach also shows on‑par or better performance compared to demonstration‑based few‑shot prompting approaches, yet with substantially lower inference costs. Ablation studies show that label information and the multi‑agent cooperative LLM framework play an important role in generating high‑quality principles to facilitate downstream classification tasks.

PaperID: 11680, https://arxiv.org/pdf/2502.07132.pdf

Authors: Aécio Santos, Eduardo H. M. Pena, Roque Lopez, Juliana Freire

Title: Interactive Data Harmonization with LLM Agents: Opportunities and Challenges

Abstract:
Data harmonization is an essential task that entails integrating datasets from diverse sources. Despite years of research in this area, it remains a time‑consuming and challenging task due to schema mismatches, varying terminologies, and differences in data collection methodologies. This paper presents the case for agentic data harmonization as a means to both empower experts to harmonize their data and to streamline the process. We introduce Harmonia, a system that combines LLM‑based reasoning, an interactive user interface, and a library of data harmonization primitives to automate the synthesis of data harmonization pipelines. We demonstrate Harmonia in a clinical data harmonization scenario, where it helps to interactively create reusable pipelines that map datasets to a standard format. Finally, we discuss challenges and open problems, and suggest research directions for advancing our vision.

PaperID: 11681, https://arxiv.org/pdf/2502.07067.pdf

Authors: Siddharth Gandhi, Luyu Gao, Jamie Callan

Title: Repository-level Code Search with Neural Retrieval Methods

Abstract:
This paper presents a multi‑stage reranking system for repository‑level code search, which leverages the vastly available commit histories of large open‑source repositories to aid in bug fixing. We define the task of repository‑level code search as retrieving the set of files from the current state of a code repository that are most relevant to addressing a user's question or bug. The proposed approach combines BM25‑based retrieval over commit messages with neural reranking using CodeBERT to identify the most pertinent files. By learning patterns from diverse repositories and their commit histories, the system can surface relevant files for the task at hand. The system leverages both commit messages and source code for relevance matching, and is evaluated in both normal and oracle settings. Experiments on a new dataset created from 7 popular open‑source repositories demonstrate substantial improvements of up to 80% in MAP, MRR and P@1 over the BM25 baseline, across a diverse set of queries, demonstrating the effectiveness this approach. We hope this work aids LLM agents as a tool for better code search and understanding. Our code and results obtained are publicly available.

PaperID: 11682, https://arxiv.org/pdf/2502.06975.pdf

Authors: Mathis Pink, Qinyuan Wu, Vy Ai Vo, Javier Turek, Jianing Mu, Alexander Huth, Mariya Toneva

Title: Position: Episodic Memory is the Missing Piece for Long-Term LLM Agents

Abstract:
As Large Language Models (LLMs) evolve from text‑completion tools into fully fledged agents operating in dynamic environments, they must address the challenge of continually learning and retaining long‑term knowledge. Many biological systems solve these challenges with episodic memory, which supports single‑shot learning of instance‑specific contexts. Inspired by this, we present an episodic memory framework for LLM agents, centered around five key properties of episodic memory that underlie adaptive and context‑sensitive behavior. With various research efforts already partially covering these properties, this position paper argues that now is the right time for an explicit, integrated focus on episodic memory to catalyze the development of long‑term agents. To this end, we outline a roadmap that unites several research directions under the goal to support all five properties of episodic memory for more efficient long‑term LLM agents.

PaperID: 11683, https://arxiv.org/pdf/2502.06963.pdf

Authors: Ashab Uddin, Ahmed Hamdi Sakr, Ning Zhang

Title: Intelligent Offloading in Vehicular Edge Computing: A Comprehensive Review of Deep Reinforcement Learning Approaches and Architectures

Abstract:
The increasing complexity of Intelligent Transportation Systems (ITS) has led to significant interest in computational offloading to external infrastructures such as edge servers, vehicular nodes, and UAVs. These dynamic and heterogeneous environments pose challenges for traditional offloading strategies, prompting the exploration of Reinforcement Learning (RL) and Deep Reinforcement Learning (DRL) as adaptive decision‑making frameworks. This survey presents a comprehensive review of recent advances in DRL‑based offloading for vehicular edge computing (VEC). We classify and compare existing works based on learning paradigms (e.g., single‑agent, multi‑agent), system architectures (e.g., centralized, distributed, hierarchical), and optimization objectives (e.g., latency, energy, fairness). Furthermore, we analyze how Markov Decision Process (MDP) formulations are applied and highlight emerging trends in reward design, coordination mechanisms, and scalability. Finally, we identify open challenges and outline future research directions to guide the development of robust and intelligent offloading strategies for next‑generation ITS.

PaperID: 11684, https://arxiv.org/pdf/2502.06882.pdf

Authors: Shengbin Yue, Ting Huang, Zheng Jia, Siyuan Wang, Shujun Liu, Yun Song, Xuanjing Huang, Zhongyu Wei

Title: Multi-Agent Simulator Drives Language Models for Legal Intensive Interaction

Abstract:
Large Language Models (LLMs) have significantly advanced legal intelligence, but the scarcity of scenario data impedes the progress toward interactive legal scenarios. This paper introduces a Multi‑agent Legal Simulation Driver (MASER) to scalably generate synthetic data by simulating interactive legal scenarios. Leveraging real‑legal case sources, MASER ensures the consistency of legal attributes between participants and introduces a supervisory mechanism to align participants' characters and behaviors as well as addressing distractions. A Multi‑stage Interactive Legal Evaluation (MILE) benchmark is further constructed to evaluate LLMs' performance in dynamic legal scenarios. Extensive experiments confirm the effectiveness of our framework.

PaperID: 11685, https://arxiv.org/pdf/2502.06835.pdf

Authors: Ziping Xu, Hinal Jajal, Sung Won Choi, Inbal Nahum-Shani, Guy Shani, Alexandra M. Psihogios, Pei-Yao Hung, Susan Murphy

Title: Reinforcement Learning on Dyads to Enhance Medication Adherence

Abstract:
Medication adherence is critical for the recovery of adolescents and young adults (AYAs) who have undergone hematopoietic cell transplantation (HCT). However, maintaining adherence is challenging for AYAs after hospital discharge, who experience both individual (e.g. physical and emotional symptoms) and interpersonal barriers (e.g., relational difficulties with their care partner, who is often involved in medication management). To optimize the effectiveness of a three‑component digital intervention targeting both members of the dyad as well as their relationship, we propose a novel Multi‑Agent Reinforcement Learning (MARL) approach to personalize the delivery of interventions. By incorporating the domain knowledge, the MARL framework, where each agent is responsible for the delivery of one intervention component, allows for faster learning compared with a flattened agent. Evaluation using a dyadic simulator environment, based on real clinical data, shows a significant improvement in medication adherence (approximately 3%) compared to purely random intervention delivery. The effectiveness of this approach will be further evaluated in an upcoming trial.

PaperID: 11686, https://arxiv.org/pdf/2502.06776.pdf

Authors: Brandon Trabucco, Gunnar Sigurdsson, Robinson Piramuthu, Ruslan Salakhutdinov

Title: InSTA: Towards Internet-Scale Training For Agents

Abstract:
The predominant approach for training web navigation agents is to gather human demonstrations for a set of popular websites and hand‑written tasks, but it is becoming clear that human data is an inefficient resource. We develop a pipeline to facilitate internet‑scale training for agents without laborious human annotations. In the first stage, an LLM annotates 150k sites with agentic tasks. In the next stage, LLM agents complete tasks and produce trajectories. In the final stage, an LLM filters trajectories by judging their success. Language models are powerful data curation tools, identifying harmful content with an accuracy of 97%, judging successful trajectories with an accuracy of 82.6%, and producing effective data. We train agents based on Qwen 3 1.7B that are competitive with frontier LLMs as web agents, while being smaller and faster. Our top agent reaches a success rate of 56.9%, outperforming the data collection policy Qwen 3 235B, a 235 times larger Llama 4 Maverick, and reaching 94.7% of the performance of Gemini 2.5 Flash. We are releasing code, models and data at: https://data‑for‑agents.github.io.

PaperID: 11687, https://arxiv.org/pdf/2502.06722.pdf

Authors: Malaika Zafar, Roohan Ahmed Khan, Aleksey Fedoseev, Kumar Katyayan Jaiswal, Dzmitry Tsetserukou

Title: HetSwarm: Cooperative Navigation of Heterogeneous Swarm in Dynamic and Dense Environments through Impedance-based Guidance

Abstract:
With the growing demand for efficient logistics and warehouse management, unmanned aerial vehicles (UAVs) are emerging as a valuable complement to automated guided vehicles (AGVs). UAVs enhance efficiency by navigating dense environments and operating at varying altitudes. However, their limited flight time, battery life, and payload capacity necessitate a supporting ground station. To address these challenges, we propose HetSwarm, a heterogeneous multi‑robot system that combines a UAV and a mobile ground robot for collaborative navigation in cluttered and dynamic conditions. Our approach employs an artificial potential field (APF)‑based path planner for the UAV, allowing it to dynamically adjust its trajectory in real time. The ground robot follows this path while maintaining connectivity through impedance links, ensuring stable coordination. Additionally, the ground robot establishes temporal impedance links with low‑height ground obstacles to avoid local collisions, as these obstacles do not interfere with the UAV's flight. Experimental validation of HetSwarm in diverse environmental conditions demonstrated a 90% success rate across 30 test cases. The ground robot exhibited an average deviation of 45 cm near obstacles, confirming effective collision avoidance. Extensive simulations in the Gym PyBullet environment further validated the robustness of our system for real‑world applications, demonstrating its potential for dynamic, real‑time task execution in cluttered environments.

PaperID: 11688, https://arxiv.org/pdf/2502.06589.pdf

Authors: Yuchen Zhuang, Jingfeng Yang, Haoming Jiang, Xin Liu, Kewei Cheng, Sanket Lokegaonkar, Yifan Gao, Qing Ping, Tianyi Liu, Binxuan Huang, Zheng Li, Zhengyang Wang, Pei Chen, Ruijie Wang, Rongzhi Zhang, Nasser Zalmout, Priyanka Nigam, Bing Yin, Chao Zhang

Title: Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training

Abstract:
Due to the scarcity of agent‑oriented pre‑training data, LLM‑based autonomous agents typically rely on complex prompting or extensive fine‑tuning, which often fails to introduce new capabilities while preserving strong generalizability. We introduce Hephaestus‑Forge, the first large‑scale pre‑training corpus designed to enhance the fundamental capabilities of LLM agents in API function calling, intrinsic reasoning and planning, and adapting to environmental feedback. Hephaestus‑Forge comprises 103B agent‑specific data encompassing 76,537 APIs, including both tool documentation to introduce knowledge of API functions and function calling trajectories to strengthen intrinsic reasoning. To explore effective training protocols, we investigate scaling laws to identify the optimal recipe in data mixing ratios. By continual pre‑training on Hephaestus‑Forge, Hephaestus outperforms small‑ to medium‑scale open‑source LLMs and rivals commercial LLMs on three agent benchmarks, demonstrating the effectiveness of our pre‑training corpus in enhancing fundamental agentic capabilities and generalization of LLMs to new tasks or environments.

PaperID: 11689, https://arxiv.org/pdf/2502.06472.pdf

Authors: Yuxing Lu, Wei Wu, Xukai Zhao, Rui Peng, Jinzhuo Wang

Title: KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment

Abstract:
Maintaining comprehensive and up‑to‑date knowledge graphs (KGs) is critical for modern AI systems, but manual curation struggles to scale with the rapid growth of scientific literature. This paper presents KARMA, a novel framework employing multi‑agent large language models (LLMs) to automate KG enrichment through structured analysis of unstructured text. Our approach employs nine collaborative agents, spanning entity discovery, relation extraction, schema alignment, and conflict resolution that iteratively parse documents, verify extracted knowledge, and integrate it into existing graph structures while adhering to domain‑specific schema. Experiments on 1,200 PubMed articles from three different domains demonstrate the effectiveness of KARMA in knowledge graph enrichment, with the identification of up to 38,230 new entities while achieving 83.1% LLM‑verified correctness and reducing conflict edges by 18.6% through multi‑layer assessments.

PaperID: 11690, https://arxiv.org/pdf/2502.06403.pdf

Authors: Alessio Benavoli, Alessandro Facchini, Marco Zaffalon

Title: The AI off-switch problem as a signalling game: bounded rationality and incomparability

Abstract:
The off‑switch problem is a critical challenge in AI control: if an AI system resists being switched off, it poses a significant risk. In this paper, we model the off‑switch problem as a signalling game, where a human decision‑maker communicates its preferences about some underlying decision problem to an AI agent, which then selects actions to maximise the human's utility. We assume that the human is a bounded rational agent and explore various bounded rationality mechanisms. Using real machine learning models, we reprove prior results and demonstrate that a necessary condition for an AI system to refrain from disabling its off‑switch is its uncertainty about the human's utility. We also analyse how message costs influence optimal strategies and extend the analysis to scenarios involving incomparability.

PaperID: 11691, https://arxiv.org/pdf/2502.06261.pdf

Authors: Changxi Zhu, Mehdi Dastani, Shihan Wang

Title: Reducing Variance Caused by Communication in Decentralized Multi-agent Deep Reinforcement Learning

Abstract:
In decentralized multi‑agent deep reinforcement learning (MADRL), communication can help agents to gain a better understanding of the environment to better coordinate their behaviors. Nevertheless, communication may involve uncertainty, which potentially introduces variance to the learning of decentralized agents. In this paper, we focus on a specific decentralized MADRL setting with communication and conduct a theoretical analysis to study the variance that is caused by communication in policy gradients. We propose modular techniques to reduce the variance in policy gradients during training. We adopt our modular techniques into two existing algorithms for decentralized MADRL with communication and evaluate them on multiple tasks in the StarCraft Multi‑Agent Challenge and Traffic Junction domains. The results show that decentralized MADRL communication methods extended with our proposed techniques not only achieve high‑performing agents but also reduce variance in policy gradients during training.

PaperID: 11692, https://arxiv.org/pdf/2502.06251.pdf

Authors: Soohwan Lee, Mingyu Kim, Seoyeong Hwang, Dajung Kim, Kyungho Lee

Title: Amplifying Minority Voices: AI-Mediated Devil's Advocate System for Inclusive Group Decision-Making

Abstract:
Group decision‑making often benefits from diverse perspectives, yet power imbalances and social influence can stifle minority opinions and compromise outcomes. This prequel introduces an AI‑mediated communication system that leverages the Large Language Model to serve as a devil's advocate, representing underrepresented viewpoints without exposing minority members' identities. Rooted in persuasive communication strategies and anonymity, the system aims to improve psychological safety and foster more inclusive decision‑making. Our multi‑agent architecture, which consists of a summary agent, conversation agent, AI duplicate checker, and paraphrase agent, encourages the group's critical thinking while reducing repetitive outputs. We acknowledge that reliance on text‑based communication and fixed intervention timings may limit adaptability, indicating pathways for refinement. By focusing on the representation of minority viewpoints anonymously in power‑imbalanced settings, this approach highlights how AI‑driven methods can evolve to support more divergent and inclusive group decision‑making.

PaperID: 11693, https://arxiv.org/pdf/2502.06205.pdf

Authors: Guoxin Chen, Minpeng Liao, Peiying Yu, Dingmin Wang, Zile Qiao, Chao Yang, Xin Zhao, Kai Fan

Title: C-3PO: Compact Plug-and-Play Proxy Optimization to Achieve Human-like Retrieval-Augmented Generation

Abstract:
Retrieval‑augmented generation (RAG) systems face a fundamental challenge in aligning independently developed retrievers and large language models (LLMs). Existing approaches typically involve modifying either component or introducing simple intermediate modules, resulting in practical limitations and sub‑optimal performance. Inspired by human search behavior ‑‑ typically involving a back‑and‑forth process of proposing search queries and reviewing documents, we propose C‑3PO, a proxy‑centric framework that facilitates communication between retrievers and LLMs through a lightweight multi‑agent system. Our framework implements three specialized agents that collaboratively optimize the entire RAG pipeline without altering the retriever and LLMs. These agents work together to assess the need for retrieval, generate effective queries, and select information suitable for the LLMs. To enable effective multi‑agent coordination, we develop a tree‑structured rollout approach for reward credit assignment in reinforcement learning. Extensive experiments in both in‑domain and out‑of‑distribution scenarios demonstrate that C‑3PO significantly enhances RAG performance while maintaining plug‑and‑play flexibility and superior generalization capabilities.

PaperID: 11694, https://arxiv.org/pdf/2502.06152.pdf

Authors: Ziyang Guo, Yifan Wu, Jason Hartline, Jessica Hullman

Title: Explaining and Improving Information Complementarities in Multi-Agent Decision-making

Abstract:
Multiple agents are increasingly combined to make decisions with the expectation of achieving complementary performance, where the decisions they make together outperform those made individually. However, knowing how to improve the performance of collaborating agents requires knowing what information and strategies each agent employs. With a focus on human‑AI pairings, we contribute a decision‑theoretic framework for characterizing the value of information. By defining complementary information, our approach identifies opportunities for agents to better exploit available information in AI‑assisted decision workflows. We present a novel explanation technique (ILIV‑SHAP) that adapts SHAP explanations to highlight human‑complementing information. We validate the effectiveness of our framework and ILIV‑SHAP through a study of human‑AI decision‑making, and demonstrate the framework on examples from chest X‑ray diagnosis and deepfake detection. We find that presenting ILIV‑SHAP with AI predictions leads to reliably greater reductions in error over non‑AI assisted decisions more than vanilla SHAP.

PaperID: 11695, https://arxiv.org/pdf/2502.06113.pdf

Authors: Antoine Vivien, Thomas Chaffre, Matthew Stephenson, Eva Artusi, Paulo Santos, Benoit Clement, Karl Sammut

Title: Towards Bio-inspired Heuristically Accelerated Reinforcement Learning for Adaptive Underwater Multi-Agents Behaviour

Abstract:
This paper describes the problem of coordination of an autonomous Multi‑Agent System which aims to solve the coverage planning problem in a complex environment. The considered applications are the detection and identification of objects of interest while covering an area. These tasks, which are highly relevant for space applications, are also of interest among various domains including the underwater context, which is the focus of this study. In this context, coverage planning is traditionally modelled as a Markov Decision Process where a coordinated MAS, a swarm of heterogeneous autonomous underwater vehicles, is required to survey an area and search for objects. This MDP is associated with several challenges: environment uncertainties, communication constraints, and an ensemble of hazards, including time‑varying and unpredictable changes in the underwater environment. MARL algorithms can solve highly non‑linear problems using deep neural networks and display great scalability against an increased number of agents. Nevertheless, most of the current results in the underwater domain are limited to simulation due to the high learning time of MARL algorithms. For this reason, a novel strategy is introduced to accelerate this convergence rate by incorporating biologically inspired heuristics to guide the policy during training. The PSO method, which is inspired by the behaviour of a group of animals, is selected as a heuristic. It allows the policy to explore the highest quality regions of the action and state spaces, from the beginning of the training, optimizing the exploration/exploitation trade‑off. The resulting agent requires fewer interactions to reach optimal performance. The method is applied to the MSAC algorithm and evaluated for a 2D covering area mission in a continuous control environment.

PaperID: 11696, https://arxiv.org/pdf/2502.06111.pdf

Authors: Yijia Xiao, Runhui Wang, Luyang Kong, Davor Golac, Wei Wang

Title: CSR-Bench: Benchmarking LLM Agents in Deployment of Computer Science Research Repositories

Abstract:
The increasing complexity of computer science research projects demands more effective tools for deploying code repositories. Large Language Models (LLMs), such as Anthropic Claude and Meta Llama, have demonstrated significant advancements across various fields of computer science research, including the automation of diverse software engineering tasks. To evaluate the effectiveness of LLMs in handling complex code development tasks of research projects, particularly for NLP/CV/AI/ML/DM topics, we introduce CSR‑Bench, a benchmark for Computer Science Research projects. This benchmark assesses LLMs from various aspects including accuracy, efficiency, and deployment script quality, aiming to explore their potential in conducting computer science research autonomously. We also introduce a novel framework, CSR‑Agents, that utilizes multiple LLM agents to automate the deployment of GitHub code repositories of computer science research projects. Specifically, by checking instructions from markdown files and interpreting repository structures, the model generates and iteratively improves bash commands that set up the experimental environments and deploy the code to conduct research tasks. Preliminary results from CSR‑Bench indicate that LLM agents can significantly enhance the workflow of repository deployment, thereby boosting developer productivity and improving the management of developmental workflows.

PaperID: 11697, https://arxiv.org/pdf/2502.06060.pdf

Authors: Bidipta Sarkar, Warren Xia, C. Karen Liu, Dorsa Sadigh

Title: Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning

Abstract:
Communicating in natural language is a powerful tool in multi‑agent settings, as it enables independent agents to share information in partially observable settings and allows zero‑shot coordination with humans. However, most prior works are limited as they either rely on training with large amounts of human demonstrations or lack the ability to generate natural and useful communication strategies. In this work, we train language models to have productive discussions about their environment in natural language without any human demonstrations. We decompose the communication problem into listening and speaking. Our key idea is to leverage the agent's goal to predict useful information about the world as a dense reward signal that guides communication. Specifically, we improve a model's listening skills by training them to predict information about the environment based on discussions, and we simultaneously improve a model's speaking skills with multi‑agent reinforcement learning by rewarding messages based on their influence on other agents. To investigate the role and necessity of communication in complex social settings, we study an embodied social deduction game based on Among Us, where the key question to answer is the identity of an adversarial imposter. We analyze emergent behaviors due to our technique, such as accusing suspects and providing evidence, and find that it enables strong discussions, doubling the win rates compared to standard RL. We release our code and models at https://socialdeductionllm.github.io/

PaperID: 11698, https://arxiv.org/pdf/2502.06049.pdf

Authors: Jikun Kang, Wenqi Wu, Filippos Christianos, Alex J. Chan, Fraser Greenlee, George Thomas, Marvin Purtorab, Andy Toulis

Title: LM2: Large Memory Models

Abstract:
This paper introduces the Large Memory Model (LM2), a decoder‑only Transformer architecture enhanced with an auxiliary memory module that aims to address the limitations of standard Transformers in multi‑step reasoning, relational argumentation, and synthesizing information distributed over long contexts. The proposed LM2 incorporates a memory module that acts as a contextual representation repository, interacting with input tokens via cross attention and updating through gating mechanisms. To preserve the Transformers general‑purpose capabilities, LM2 maintains the original information flow while integrating a complementary memory pathway. Experimental results on the BABILong benchmark demonstrate that the LM2model outperforms both the memory‑augmented RMT model by 37.1% and the baseline Llama‑3.2 model by 86.3% on average across tasks. LM2 exhibits exceptional capabilities in multi‑hop inference, numerical reasoning, and large‑context question‑answering. On the MMLU dataset, it achieves a 5.0% improvement over a pre‑trained vanilla model, demonstrating that its memory module does not degrade performance on general tasks. Further, in our analysis, we explore the memory interpretability, effectiveness of memory modules, and test‑time behavior. Our findings emphasize the importance of explicit memory in enhancing Transformer architectures.

PaperID: 11699, https://arxiv.org/pdf/2502.05986.pdf

Authors: Ohav Barbi, Ori Yoran, Mor Geva

Title: Preventing Rogue Agents Improves Multi-Agent Collaboration

Abstract:
Multi‑agent systems, where specialized agents collaborate to solve a shared task hold great potential, from increased modularity to simulating complex environments. However, they also have a major caveat ‑‑ a single agent can cause the entire system to fail. Consider a simple game where the knowledge to solve the task is distributed between agents, which share information in a communication channel. At each round, any of the agents can terminate the game and make the final prediction, even if they are uncertain about the outcome of their action. Detection of such rogue agents before they act may prevent the system's failure. In this work, we propose to monitor agents during action prediction and intervene when a future error is likely to occur. To test our approach, we introduce WhoDunitEnv, a multi‑agent collaboration environment that allows modular control over task complexity and communication structure. Experiments on WhoDunitEnv, code generation tasks and the GovSim environment for resource sustainability show that our approach leads to substantial performance gains up to 17.4%, 2.5% and 20%, respectively. Thorough analysis shows that our monitors successfully identify critical points of agent confusion and our interventions effectively stop agent errors from propagating.

PaperID: 11700, https://arxiv.org/pdf/2502.05982.pdf

Authors: Mohammad Amin Abbasi, Farnaz Sadat Mirnezami, Ali Neshati, Hassan Naderi

Title: HamRaz: A Culture-Based Persian Conversation Dataset for Person-Centered Therapy Using LLM Agents

Abstract:
We present HamRaz, a culturally adapted Persian‑language dataset for AI‑assisted mental health support, grounded in Person‑Centered Therapy (PCT). To reflect real‑world therapeutic challenges, we combine script‑based dialogue with adaptive large language models (LLM) role‑playing, capturing the ambiguity and emotional nuance of Persian‑speaking clients. We introduce HamRazEval, a dual‑framework for assessing conversational and therapeutic quality using General Metrics and specialized psychological relationship measures. Human evaluations show HamRaz outperforms existing baselines in empathy, coherence, and realism. This resource contributes to the Digital Humanities by bridging language, culture, and mental health in underrepresented communities.

PaperID: 11701, https://arxiv.org/pdf/2502.05963.pdf

Authors: Sharmita Dey

Title: Redefining Robot Generalization Through Interactive Intelligence

Abstract:
Recent advances in large‑scale machine learning have produced high‑capacity foundation models capable of adapting to a broad array of downstream tasks. While such models hold great promise for robotics, the prevailing paradigm still portrays robots as single, autonomous decision‑makers, performing tasks like manipulation and navigation, with limited human involvement. However, a large class of real‑world robotic systems, including wearable robotics (e.g., prostheses, orthoses, exoskeletons), teleoperation, and neural interfaces, are semiautonomous, and require ongoing interactive coordination with human partners, challenging single‑agent assumptions. In this position paper, we argue that robot foundation models must evolve to an interactive multi‑agent perspective in order to handle the complexities of real‑time human‑robot co‑adaptation. We propose a generalizable, neuroscience‑inspired architecture encompassing four modules: (1) a multimodal sensing module informed by sensorimotor integration principles, (2) an ad‑hoc teamwork model reminiscent of joint‑action frameworks in cognitive science, (3) a predictive world belief model grounded in internal model theories of motor control, and (4) a memory/feedback mechanism that echoes concepts of Hebbian and reinforcement‑based plasticity. Although illustrated through the lens of cyborg systems, where wearable devices and human physiology are inseparably intertwined, the proposed framework is broadly applicable to robots operating in semi‑autonomous or interactive contexts. By moving beyond single‑agent designs, our position emphasizes how foundation models in robotics can achieve a more robust, personalized, and anticipatory level of performance.

PaperID: 11702, https://arxiv.org/pdf/2502.05954.pdf

Authors: Amal Aldawsari, Evangelos Pournaras

Title: Optimization under Attack: Resilience, Vulnerability, and the Path to Collapse

Abstract:
Optimization is instrumental for improving operations of large‑scale socio‑technical infrastructures of Smart Cities, for instance, energy and traffic systems. In particular, understanding the performance of multi‑agent discrete‑choice combinatorial optimization under distributed adversary attacks is a compelling and underexplored problem, since multi‑agent systems exhibit a large number of remote control variables that can influence in an unprecedented way the cost‑effectiveness of distributed optimization heuristics. This paper unravels for the first time the trajectories of distributed optimization from resilience to vulnerability, and finally to collapse under varying adversary influence. Using real‑world data to emulate over 28 billion multi‑agent optimization scenarios, we exhaustively assess how the number of agents with different adversarial severity and network positioning influences optimization performance, including the influence on Pareto optimal points. With this novel large‑scale dataset, made openly available as a benchmark, we disentangle how optimization remains resilient to adversaries and which adversary conditions are required to make optimization vulnerable or collapsed. These new findings can provide new insights for designing self‑healing strategies for fault‑tolerance and fault‑correction in adversarial distributed optimization that have been missing so far.

PaperID: 11703, https://arxiv.org/pdf/2502.05944.pdf

Authors: Jackson Coleman, Isaiah Lawrence, Benjamin Turner

Title: Multi-granular Training Strategies for Robust Multi-hop Reasoning Over Noisy and Heterogeneous Knowledge Sources

Abstract:
Multi‑source multi‑hop question answering (QA) represents a challenging task in natural language processing due to the need for dynamic integration of heterogeneous knowledge sources and multi‑step reasoning. Existing methods often suffer from cascading errors, insufficient handling of knowledge conflicts, and computational inefficiency. In this paper, we propose Adaptive Multi‑source Knowledge‑Oriented Reasoning (AMKOR), a generative framework that leverages large language models (LLMs) to dynamically fuse parametric and retrieved knowledge while exploring reasoning trajectories using probabilistic beam reasoning. AMKOR is further enhanced by a multi‑granular learning strategy, optimizing both local reasoning steps and global answer accuracy. Experiments conducted on four widely‑used multi‑hop QA datasets, including HotpotQA and MuSiQue, demonstrate that AMKOR achieves state‑of‑the‑art performance, significantly outperforming baseline methods on both reasoning accuracy and robustness. Additional analyses confirm its scalability, adaptability to noisy knowledge, and superior ability to handle complex multi‑hop tasks. This work establishes a new benchmark for multi‑source multi‑hop QA by effectively combining reasoning quality and efficiency.

PaperID: 11704, https://arxiv.org/pdf/2502.05812.pdf

Authors: Jiayi Zhang, Ziheng Liu, Yiyang Zhu, Enyu Shi, Bokai Xu, Chau Yuen, Dusit Niyato, Mérouane Debbah, Shi Jin, Bo Ai, Xuemin, Shen

Title: Multi-Agent Reinforcement Learning in Wireless Distributed Networks for 6G

Abstract:
The introduction of intelligent interconnectivity between the physical and human worlds has attracted great attention for future sixth‑generation (6G) networks, emphasizing massive capacity, ultra‑low latency, and unparalleled reliability. Wireless distributed networks and multi‑agent reinforcement learning (MARL), both of which have evolved from centralized paradigms, are two promising solutions for the great attention. Given their distinct capabilities, such as decentralization and collaborative mechanisms, integrating these two paradigms holds great promise for unleashing the full power of 6G, attracting significant research and development attention. This paper provides a comprehensive study on MARL‑assisted wireless distributed networks for 6G. In particular, we introduce the basic mathematical background and evolution of wireless distributed networks and MARL, as well as demonstrate their interrelationships. Subsequently, we analyze different structures of wireless distributed networks from the perspectives of homogeneous and heterogeneous. Furthermore, we introduce the basic concepts of MARL and discuss two typical categories, including model‑based and model‑free. We then present critical challenges faced by MARL‑assisted wireless distributed networks, providing important guidance and insights for actual implementation. We also explore an interplay between MARL‑assisted wireless distributed networks and emerging techniques, such as information bottleneck and mirror learning, delivering in‑depth analyses and application scenarios. Finally, we outline several compelling research directions for future MARL‑assisted wireless distributed networks.

PaperID: 11705, https://arxiv.org/pdf/2502.05769.pdf

Authors: Kyle Gao, Dening Lu, Liangzhi Li, Nan Chen, Hongjie He, Linlin Xu, Jonathan Li

Title: Digital Twin Buildings: 3D Modeling, GIS Integration, and Visual Descriptions Using Gaussian Splatting, ChatGPT/Deepseek, and Google Maps Platform

Abstract:
Urban digital twins are virtual replicas of cities that use multi‑source data and data analytics to optimize urban planning, infrastructure management, and decision‑making. Towards this, we propose a framework focused on the single‑building scale. By connecting to cloud mapping platforms such as Google Map Platforms APIs, by leveraging state‑of‑the‑art multi‑agent Large Language Models data analysis using ChatGPT(4o) and Deepseek‑V3/R1, and by using our Gaussian Splatting‑based mesh extraction pipeline, our Digital Twin Buildings framework can retrieve a building's 3D model, visual descriptions, and achieve cloud‑based mapping integration with large language model‑based data analytics using a building's address, postal code, or geographic coordinates.

PaperID: 11706, https://arxiv.org/pdf/2502.05632.pdf

Authors: M Charity, Mayu Wilson, Steven Lee, Dipika Rajesh, Sam Earle, Julian Togelius

Title: Amorphous Fortress Online: Collaboratively Designing Open-Ended Multi-Agent AI and Game Environments

Abstract:
This work introduces Amorphous Fortress Online ‑‑ a web‑based platform where users can design petri‑dish‑like environments and games consisting of multi‑agent AI characters. Users can play, create, and share artificial life and game environments made up of microscopic but transparent finite‑state machine agents that interact with each other. The website features multiple interactive editors and accessible settings to view the multi‑agent interactions directly from the browser. This system serves to provide a database of thematically diverse AI and game environments that use the emergent behaviors of simple AI agents.

PaperID: 11707, https://arxiv.org/pdf/2502.05573.pdf

Authors: Beining Zhang, Aditya Kapoor, Mingfei Sun

Title: Low-Rank Agent-Specific Adaptation (LoRASA) for Multi-Agent Policy Learning

Abstract:
Multi‑agent reinforcement learning (MARL) often relies on \emphparameter sharing (PS) to scale efficiently. However, purely shared policies can stifle each agent's unique specialization, reducing overall performance in heterogeneous environments. We propose Low‑Rank Agent‑Specific Adaptation (LoRASA), a novel approach that treats each agent's policy as a specialized ``task'' fine‑tuned from a shared backbone. Drawing inspiration from parameter‑efficient transfer methods, LoRASA appends small, low‑rank adaptation matrices to each layer of the shared policy, naturally inducing \emphparameter‑space sparsity that promotes both specialization and scalability. We evaluate LoRASA on challenging benchmarks including the StarCraft Multi‑Agent Challenge (SMAC) and Multi‑Agent MuJoCo (MAMuJoCo), implementing it atop widely used algorithms such as MAPPO and A2PO. Across diverse tasks, LoRASA matches or outperforms existing baselines \emphwhile reducing memory and computational overhead. Ablation studies on adapter rank, placement, and timing validate the method's flexibility and efficiency. Our results suggest LoRASA's potential to establish a new norm for MARL policy parameterization: combining a shared foundation for coordination with low‑rank agent‑specific refinements for individual specialization.

PaperID: 11708, https://arxiv.org/pdf/2502.05526.pdf

Authors: Brandon Ho, Batuhan Altundas, Matthew Gombolay

Title: Towards Learning Scalable Agile Dynamic Motion Planning for Robosoccer Teams with Policy Optimization

Abstract:
In fast‑paced, ever‑changing environments, dynamic Motion Planning for Multi‑Agent Systems in the presence of obstacles is a universal and unsolved problem. Be it from path planning around obstacles to the movement of robotic arms, or in planning navigation of robot teams in settings such as Robosoccer, dynamic motion planning is needed to avoid collisions while reaching the targeted destination when multiple agents occupy the same area. In continuous domains where the world changes quickly, existing classical Motion Planning algorithms such as RRT and A become computationally expensive to rerun at every time step. Many variations of classical and well‑formulated non‑learning path‑planning methods have been proposed to solve this universal problem but fall short due to their limitations of speed, smoothness, optimally, etc. Deep Learning models overcome their challenges due to their ability to adapt to varying environments based on past experience. However, current learning motion planning models use discretized environments, do not account for heterogeneous agents or replanning, and build up to improve the classical motion planners' efficiency, leading to issues with scalability. To prevent collisions between heterogenous team members and collision to obstacles while trying to reach the target location, we present a learning‑based dynamic navigation model and show our model working on a simple environment in the concept of a simple Robosoccer Game.

PaperID: 11709, https://arxiv.org/pdf/2502.05498.pdf

Authors: Larkin Liu, Kashif Rasul, Yutong Chao, Jalal Etesami

Title: Riemannian Manifold Learning for Stackelberg Games with Neural Flow Representations

Abstract:
We present a novel framework for online learning in Stackelberg general‑sum games, where two agents, the leader and follower, engage in sequential turn‑based interactions. At the core of this approach is a learned diffeomorphism that maps the joint action space to a smooth spherical Riemannian manifold, referred to as the Stackelberg manifold. This mapping, facilitated by neural normalizing flows, ensures the formation of tractable isoplanar subspaces, enabling efficient techniques for online learning. Leveraging the linearity of the agents' reward functions on the Stackelberg manifold, our construct allows the application of linear bandit algorithms. We then provide a rigorous theoretical basis for regret minimization on the learned manifold and establish bounds on the simple regret for learning Stackelberg equilibrium. This integration of manifold learning into game theory uncovers a previously unrecognized potential for neural normalizing flows as an effective tool for multi‑agent learning. We present empirical results demonstrating the effectiveness of our approach compared to standard baselines, with applications spanning domains such as cybersecurity and economic supply chain optimization.

PaperID: 11710, https://arxiv.org/pdf/2502.05301.pdf

Authors: Fernando Llorente, Daniel Waxman, Petar M. Djurić

Title: Decentralized Online Ensembles of Gaussian Processes for Multi-Agent Systems

Abstract:
Flexible and scalable decentralized learning solutions are fundamentally important in the application of multi‑agent systems. While several recent approaches introduce (ensembles of) kernel machines in the distributed setting, Bayesian solutions are much more limited. We introduce a fully decentralized, asymptotically exact solution to computing the random feature approximation of Gaussian processes. We further address the choice of hyperparameters by introducing an ensembling scheme for Bayesian multiple kernel learning based on online Bayesian model averaging. The resulting algorithm is tested against Bayesian and frequentist methods on simulated and real‑world datasets.

PaperID: 11711, https://arxiv.org/pdf/2502.05199.pdf

Authors: Grzegorz Swirszcz, Adam Zsolt Wagner, Geordie Williamson, Sam Blackwell, Bogdan Georgiev, Alex Davies, Ali Eslami, Sebastien Racaniere, Theophane Weber, Pushmeet Kohli

Title: Advancing Geometry with AI: Multi-agent Generation of Polytopes

Abstract:
Polytopes are one of the most primitive concepts underlying geometry. Discovery and study of polytopes with complex structures provides a means of advancing scientific knowledge. Construction of polytopes with specific extremal structure is very difficult and time‑consuming. Having an automated tool for the generation of such extremal examples is therefore of great value. We present an Artificial Intelligence system capable of generating novel polytopes with very high complexity, whose abilities we demonstrate in three different and challenging scenarios: the Hirsch Conjecture, the k‑neighbourly problem and the longest monotone paths problem. For each of these three problems the system was able to generate novel examples, which match or surpass the best previously known bounds. Our main focus was the Hirsch Conjecture, which had remained an open problem for over 50 years. The highly parallel A.I. system presented in this paper was able to generate millions of examples, with many of them surpassing best known previous results and possessing properties not present in the earlier human‑constructed examples. For comparison, it took leading human experts over 50 years to handcraft the first example of a polytope exceeding the bound conjectured by Hirsch, and in the decade since humans were able to construct only a scarce few families of such counterexample polytopes. With the adoption of computer‑aided methods, the creation of new examples of mathematical objects stops being a domain reserved only for human expertise. Advances in A.I. provide mathematicians with yet another powerful tool in advancing mathematical knowledge. The results presented demonstrate that A.I. is capable of addressing problems in geometry recognized as extremely hard, and also to produce extremal examples different in nature from the ones constructed by humans.

PaperID: 11712, https://arxiv.org/pdf/2502.05028.pdf

Authors: Qixin Zhang, Zongqi Wan, Yu Yang, Li Shen, Dacheng Tao

Title: Near-Optimal Online Learning for Multi-Agent Submodular Coordination: Tight Approximation and Communication Efficiency

Abstract:
Coordinating multiple agents to collaboratively maximize submodular functions in unpredictable environments is a critical task with numerous applications in machine learning, robot planning and control. The existing approaches, such as the OSG algorithm, are often hindered by their poor approximation guarantees and the rigid requirement for a fully connected communication graph. To address these challenges, we firstly present a MA‑OSMA algorithm, which employs the multi‑linear extension to transfer the discrete submodular maximization problem into a continuous optimization, thereby allowing us to reduce the strict dependence on a complete graph through consensus techniques. Moreover, MA‑OSMA leverages a novel surrogate gradient to avoid sub‑optimal stationary points. To eliminate the computationally intensive projection operations in MA‑OSMA, we also introduce a projection‑free MA‑OSEA algorithm, which effectively utilizes the KL divergence by mixing a uniform distribution. Theoretically, we confirm that both algorithms achieve a regret bound of \widetildeO(\sqrt\fracC_TT1‑β) against a (\frac1‑e^‑cc)‑approximation to the best comparator in hindsight, where C_T is the deviation of maximizer sequence, β is the spectral gap of the network and c is the joint curvature of submodular objectives. This result significantly improves the (\frac11+c)‑approximation provided by the state‑of‑the‑art OSG algorithm. Finally, we demonstrate the effectiveness of our proposed algorithms through simulation‑based multi‑target tracking.

PaperID: 11713, https://arxiv.org/pdf/2502.04864.pdf

Authors: Aditya Kapoor, Kale-ab Tessera, Mayank Baranwal, Harshad Khadilkar, Jan Peters, Stefano Albrecht, Mingfei Sun

Title: Redistributing Rewards Across Time and Agents for Multi-Agent Reinforcement Learning

Abstract:
Credit assignmen, disentangling each agent's contribution to a shared reward, is a critical challenge in cooperative multi‑agent reinforcement learning (MARL). To be effective, credit assignment methods must preserve the environment's optimal policy. Some recent approaches attempt this by enforcing return equivalence, where the sum of distributed rewards must equal the team reward. However, their guarantees are conditional on a learned model's regression accuracy, making them unreliable in practice. We introduce Temporal‑Agent Reward Redistribution (TAR^2), an approach that decouples credit modeling from this constraint. A neural network learns unnormalized contribution scores, while a separate, deterministic normalization step enforces return equivalence by construction. We demonstrate that this method is equivalent to a valid Potential‑Based Reward Shaping (PBRS), which guarantees the optimal policy is preserved regardless of model accuracy. Empirically, on challenging SMACLite and Google Research Football (GRF) benchmarks, TAR^2 accelerates learning and achieves higher final performance than strong baselines. These results establish our method as an effective solution for the agent‑temporal credit assignment problem.

PaperID: 11714, https://arxiv.org/pdf/2502.04790.pdf

Authors: Yuting Zeng, Weizhe Huang, Lei Jiang, Tongxuan Liu, Xitai Jin, Chen Tianying Tiana, Jing Li, Xiaohua Xu

Title: S$^2$-MAD: Breaking the Token Barrier to Enhance Multi-Agent Debate Efficiency

Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities across various natural language processing (NLP) scenarios, but they still face challenges when handling complex arithmetic and logical reasoning tasks. While Chain‑Of‑Thought (CoT) reasoning, self‑consistency (SC) and self‑correction strategies have attempted to guide models in sequential, multi‑step reasoning, Multi‑agent Debate (MAD) has emerged as a viable approach for enhancing the reasoning capabilities of LLMs. By increasing both the number of agents and the frequency of debates, the performance of LLMs improves significantly. However, this strategy results in a significant increase in token costs, presenting a barrier to scalability. To address this challenge, we introduce a novel sparsification strategy designed to reduce token costs within MAD. This approach minimizes ineffective exchanges of information and unproductive discussions among agents, thereby enhancing the overall efficiency of the debate process. We conduct comparative experiments on multiple datasets across various models, demonstrating that our approach significantly reduces the token costs in MAD to a considerable extent. Specifically, compared to MAD, our approach achieves an impressive reduction of up to 94.5% in token costs while maintaining performance degradation below 2.0%.

PaperID: 11715, https://arxiv.org/pdf/2502.04780.pdf

Authors: Wanjia Zhao, Mert Yuksekgonul, Shirley Wu, James Zou

Title: SiriuS: Self-improving Multi-agent Systems via Bootstrapped Reasoning

Abstract:
Multi‑agent AI systems powered by large language models (LLMs) are increasingly applied to solve complex tasks. However, these systems often rely on fragile, manually designed prompts and heuristics, making optimization difficult. A key challenge in optimizing multi‑agent systems is acquiring suitable training data for specialized agents. We introduce SiriuS, a self‑improving, reasoning‑driven optimization framework for multi‑agent systems. Central to our approach is the construction of an experience library: a repository of high‑quality reasoning trajectories. The library is built by retaining reasoning steps that lead to successful outcomes, providing a robust training set for optimizing multi‑agent system. Additionally, we introduce a library augmentation procedure that refines unsuccessful trajectories, further enriching the library. SiriuS boosts performance by 2.86% to 21.88% on reasoning and biomedical QA and enhances agent negotiation in competitive settings. Our results show that SiriuS enhances multi‑agent performance while generating reusable data for self‑correction and self‑play enhancement in the future.

PaperID: 11716, https://arxiv.org/pdf/2502.04773.pdf

Authors: George Papadopoulos, Andreas Kontogiannis, Foteini Papadopoulou, Chaido Poulianou, Ioannis Koumentis, George Vouros

Title: An Extended Benchmarking of Multi-Agent Reinforcement Learning Algorithms in Complex Fully Cooperative Tasks

Abstract:
Multi‑Agent Reinforcement Learning (MARL) has recently emerged as a significant area of research. However, MARL evaluation often lacks systematic diversity, hindering a comprehensive understanding of algorithms' capabilities. In particular, cooperative MARL algorithms are predominantly evaluated on benchmarks such as SMAC and GRF, which primarily feature team game scenarios without assessing adequately various aspects of agents' capabilities required in fully cooperative real‑world tasks such as multi‑robot cooperation and warehouse, resource management, search and rescue, and human‑AI cooperation. Moreover, MARL algorithms are mainly evaluated on low dimensional state spaces, and thus their performance on high‑dimensional (e.g., image) observations is not well‑studied. To fill this gap, this paper highlights the crucial need for expanding systematic evaluation across a wider array of existing benchmarks. To this end, we conduct extensive evaluation and comparisons of well‑known MARL algorithms on complex fully cooperative benchmarks, including tasks with images as agents' observations. Interestingly, our analysis shows that many algorithms, hailed as state‑of‑the‑art on SMAC and GRF, may underperform standard MARL baselines on fully cooperative benchmarks. Finally, towards more systematic and better evaluation of cooperative MARL algorithms, we have open‑sourced PyMARLzoo+, an extension of the widely used (E)PyMARL libraries, which addresses an open challenge from [TBG++21], facilitating seamless integration and support with all benchmarks of PettingZoo, as well as Overcooked, PressurePlate, Capture Target and Box Pushing.

PaperID: 11717, https://arxiv.org/pdf/2502.04697.pdf

Authors: Xun Feng, Chao Zhai

Title: Multi-Agent Coverage Control in Non-Convex Annulus Region with Conformal Mapping

Abstract:
Efficiently fulfilling coverage tasks in non‑convex regions has long been a significant challenge for multi‑agent systems (MASs). By leveraging conformal mapping, this paper introduces a novel sectorial coverage formulation to transform a non‑convex annulus region into a topologically equivalent one. This approach enables the deployment of MASs in a non‑star‑shaped region while optimizing coverage performance and achieving load balance among sub‑regions. It provides a unique perspective on the partitioned sub‑regions to highlight the geodesic convex property of the non‑star‑shaped region. By utilizing the sectorial partition mechanism and the diffeomorphism property of conformal mapping, a decentralized control law is designed to drive MASs towards a desired configuration, which not only optimizes the global coverage cost but also ensures exponential convergence of equitable workload. Moreover, an iterative search algorithm is developed to identify the optimal approximation of multi‑agent deployment in the non‑star‑shaped region. Theoretical analysis is conducted to confirm the asymptotic stability and global convergence with arbitrary small tolerance of the closed‑loop system. Finally, numerical simulations demonstrate the practicality of the proposed coverage formulation with conformal mapping.

PaperID: 11718, https://arxiv.org/pdf/2502.04686.pdf

Authors: Zelai Xu, Wanjun Gu, Chao Yu, Yi Wu, Yu Wang

Title: Learning Strategic Language Agents in the Werewolf Game with Iterative Latent Space Policy Optimization

Abstract:
Large language model (LLM) agents have recently demonstrated impressive capabilities in various domains like open‑ended conversation and multi‑step decision‑making. However, it remains challenging for these agents to solve strategic language games, such as Werewolf, which demand both strategic decision‑making and free‑form language interactions. Existing LLM agents often suffer from intrinsic bias in their action distributions and limited exploration of the unbounded text action space, resulting in suboptimal performance. To address these challenges, we propose Latent Space Policy Optimization (LSPO), an iterative framework that combines game‑theoretic methods with LLM fine‑tuning to build strategic language agents. LSPO leverages the observation that while the language space is combinatorially large, the underlying strategy space is relatively compact. We first map free‑form utterances into a finite latent strategy space, yielding an abstracted extensive‑form game. Then we apply game‑theoretic methods like Counterfactual Regret Minimization (CFR) to optimize the policy in the latent space. Finally, we fine‑tune the LLM via Direct Preference Optimization (DPO) to align with the learned policy. By iteratively alternating between these steps, our LSPO agents progressively enhance both strategic reasoning and language communication. Experiment on the Werewolf game shows that our agents iteratively expand the strategy space with improving performance and outperform existing Werewolf agents, underscoring their effectiveness in free‑form language games with strategic interactions.

PaperID: 11719, https://arxiv.org/pdf/2502.04576.pdf

Authors: So Yeon Min, Yue Wu, Jimin Sun, Max Kaufmann, Fahim Tajwar, Yonatan Bisk, Ruslan Salakhutdinov

Title: Self-Regulation and Requesting Interventions

Abstract:
Human intelligence involves metacognitive abilities like self‑regulation, recognizing limitations, and seeking assistance only when needed. While LLM Agents excel in many domains, they often lack this awareness. Overconfident agents risk catastrophic failures, while those that seek help excessively hinder efficiency. A key challenge is enabling agents with a limited intervention budget C is to decide when to request assistance. In this paper, we propose an offline framework that trains a "helper" policy to request interventions, such as more powerful models or test‑time compute, by combining LLM‑based process reward models (PRMs) with tabular reinforcement learning. Using state transitions collected offline, we score optimal intervention timing with PRMs and train the helper model on these labeled trajectories. This offline approach significantly reduces costly intervention calls during training. Furthermore, the integration of PRMs with tabular RL enhances robustness to off‑policy data while avoiding the inefficiencies of deep RL. We empirically find that our method delivers optimal helper behavior.

PaperID: 11720, https://arxiv.org/pdf/2502.04497.pdf

Authors: Yi Zhang, Mohamadamin Rajabinezhad, Shan Zuo

Title: Distributed Resilient Asymmetric Bipartite Consensus: A Data-Driven Event-Triggered Mechanism

Abstract:
The problem of asymmetric bipartite consensus control is investigated within the context of nonlinear, discrete‑time, networked multi‑agent systems (MAS) subject to aperiodic denial‑of‑service (DoS) attacks. To address the challenges posed by these aperiodic DoS attacks, a data‑driven event‑triggered (DDET) mechanism has been developed. This mechanism is specifically designed to synchronize the states of the follower agents with the leader's state, even in the face of aperiodic communication disruptions and data losses. Given the constraints of unavailable agents' states and data packet loss during these attacks, the DDET control framework resiliently achieves leader‑follower consensus. The effectiveness of the proposed framework is validated through two numerical examples, which showcase its ability to adeptly handle the complexities arising from aperiodic DoS attacks in nonlinear MAS settings.

PaperID: 11721, https://arxiv.org/pdf/2502.04485.pdf

Authors: Katarzyna Kobalczyk, Nicolas Astorga, Tennison Liu, Mihaela van der Schaar

Title: Active Task Disambiguation with LLMs

Abstract:
Despite the impressive performance of large language models (LLMs) across various benchmarks, their ability to address ambiguously specified problems‑‑frequent in real‑world interactions‑‑remains underexplored. To address this gap, we introduce a formal definition of task ambiguity and frame the problem of task disambiguation through the lens of Bayesian Experimental Design. By posing clarifying questions, LLM agents can acquire additional task specifications, progressively narrowing the space of viable solutions and reducing the risk of generating unsatisfactory outputs. Yet, generating effective clarifying questions requires LLM agents to engage in a form of meta‑cognitive reasoning, an ability LLMs may presently lack. Our proposed approach of active task disambiguation enables LLM agents to generate targeted questions maximizing the information gain. Effectively, this approach shifts the load from implicit to explicit reasoning about the space of viable solutions. Empirical results demonstrate that this form of question selection leads to more effective task disambiguation in comparison to approaches relying on reasoning solely within the space of questions.

PaperID: 11722, https://arxiv.org/pdf/2502.04399.pdf

Authors: Bokeng Zheng, Bo Rao, Tianxiang Zhu, Chee Wei Tan, Jingpu Duan, Zhi Zhou, Xu Chen, Xiaoxi Zhang

Title: Online Location Planning for AI-Defined Vehicles: Optimizing Joint Tasks of Order Serving and Spatio-Temporal Heterogeneous Model Fine-Tuning

Abstract:
Advances in artificial intelligence (AI) including foundation models (FMs), are increasingly transforming human society, with smart city driving the evolution of urban living.Meanwhile, vehicle crowdsensing (VCS) has emerged as a key enabler, leveraging vehicles' mobility and sensor‑equipped capabilities. In particular, ride‑hailing vehicles can effectively facilitate flexible data collection and contribute towards urban intelligence, despite resource limitations. Therefore, this work explores a promising scenario, where edge‑assisted vehicles perform joint tasks of order serving and the emerging foundation model fine‑tuning using various urban data. However, integrating the VCS AI task with the conventional order serving task is challenging, due to their inconsistent spatio‑temporal characteristics: (i) The distributions of ride orders and data point‑of‑interests (PoIs) may not coincide in geography, both following a priori unknown patterns; (ii) they have distinct forms of temporal effects, i.e., prolonged waiting makes orders become instantly invalid while data with increased staleness gradually reduces its utility for model fine‑tuning.To overcome these obstacles, we propose an online framework based on multi‑agent reinforcement learning (MARL) with careful augmentation. A new quality‑of‑service (QoS) metric is designed to characterize and balance the utility of the two joint tasks, under the effects of varying data volumes and staleness. We also integrate graph neural networks (GNNs) with MARL to enhance state representations, capturing graph‑structured, time‑varying dependencies among vehicles and across locations. Extensive experiments on our testbed simulator, utilizing various real‑world foundation model fine‑tuning tasks and the New York City Taxi ride order dataset, demonstrate the advantage of our proposed method.

PaperID: 11723, https://arxiv.org/pdf/2502.04392.pdf

Authors: Chenyang Shao, Xinyuan Hu, Yutang Lin, Fengli Xu

Title: Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents

Abstract:
The rapid expansion of web content has made on‑device AI assistants indispensable for helping users manage the increasing complexity of online tasks. The emergent reasoning ability in large language models offer a promising path for next‑generation on‑device AI agents. However, deploying full‑scale Large Language Models (LLMs) on resource‑limited local devices is challenging. In this paper, we propose Division‑of‑Thoughts (DoT), a collaborative reasoning framework leveraging the synergy between locally deployed Smaller‑scale Language Models (SLMs) and cloud‑based LLMs. DoT leverages a Task Decomposer to elicit the inherent planning abilities in language models to decompose user queries into smaller sub‑tasks, which allows hybrid language models to fully exploit their respective strengths. Besides, DoT employs a Task Scheduler to analyze the pair‑wise dependency of sub‑tasks and create a dependency graph, facilitating parallel reasoning of sub‑tasks and the identification of key steps. To allocate the appropriate model based on the difficulty of sub‑tasks, DoT leverages a Plug‑and‑Play Adapter, which is an additional task head attached to the SLM that does not alter the SLM's parameters. To boost adapter's task allocation capability, we propose a self‑reinforced training method that relies solely on task execution feedback. Extensive experiments on various benchmarks demonstrate that our DoT significantly reduces LLM costs while maintaining competitive reasoning accuracy. Specifically, DoT reduces the average reasoning time and API costs by 66.12% and 83.57%, while achieving comparable reasoning accuracy with the best baseline methods.

PaperID: 11724, https://arxiv.org/pdf/2502.04388.pdf

Authors: Hepeng Li, Yuhong Liu, Jun Yan, Jie Gao, Xiaoou Yang

Title: Position: Emergent Machina Sapiens Urge Rethinking Multi-Agent Paradigms

Abstract:
Artificial Intelligence (AI) agents capable of autonomous learning and independent decision‑making hold great promise for addressing complex challenges across various critical infrastructure domains, including transportation, energy systems, and manufacturing. However, the surge in the design and deployment of AI systems, driven by various stakeholders with distinct and unaligned objectives, introduces a crucial challenge: How can uncoordinated AI systems coexist and evolve harmoniously in shared environments without creating chaos or compromising safety? To address this, we advocate for a fundamental rethinking of existing multi‑agent frameworks, such as multi‑agent systems and game theory, which are largely limited to predefined rules and static objective structures. We posit that AI agents should be empowered to adjust their objectives dynamically, make compromises, form coalitions, and safely compete or cooperate through evolving relationships and social feedback. Through two case studies in critical infrastructure applications, we call for a shift toward the emergent, self‑organizing, and context‑aware nature of these multi‑agentic AI systems.

PaperID: 11725, https://arxiv.org/pdf/2502.04358.pdf

Authors: Elliot Meyerson, Xin Qiu

Title: Position: Scaling LLM Agents Requires Asymptotic Analysis with LLM Primitives

Abstract:
Decomposing hard problems into subproblems often makes them easier and more efficient to solve. With large language models (LLMs) crossing critical reliability thresholds for a growing slate of capabilities, there is an increasing effort to decompose systems into sets of LLM‑based agents, each of whom can be delegated sub‑tasks. However, this decomposition (even when automated) is often intuitive, e.g., based on how a human might assign roles to members of a human team. How close are these role decompositions to optimal? This position paper argues that asymptotic analysis with LLM primitives is needed to reason about the efficiency of such decomposed systems, and that insights from such analysis will unlock opportunities for scaling them. By treating the LLM forward pass as the atomic unit of computational cost, one can separate out the (often opaque) inner workings of a particular LLM from the inherent efficiency of how a set of LLMs are orchestrated to solve hard problems. In other words, if we want to scale the deployment of LLMs to the limit, instead of anthropomorphizing LLMs, asymptotic analysis with LLM primitives should be used to reason about and develop more powerful decompositions of large problems into LLM agents.

PaperID: 11726, https://arxiv.org/pdf/2502.04345.pdf

Authors: Yehan Yang, Tianhao Ma, Ruotai Li, Xinhan Zheng, Guodong Shan

Title: Jingfang: An LLM-Based Multi-Agent System for Precise Medical Consultation and Syndrome Differentiation in Traditional Chinese Medicine

Abstract:
The practice of Traditional Chinese Medicine (TCM) requires profound expertise and extensive clinical experience. While Large Language Models (LLMs) offer significant potential in this domain, current TCM‑oriented LLMs suffer two critical limitations: (1) a rigid consultation framework that fails to conduct comprehensive and patient‑tailored interactions, often resulting in diagnostic inaccuracies; and (2) treatment recommendations generated without rigorous syndrome differentiation, which deviates from the core diagnostic and therapeutic principles of TCM. To address these issues, we develop JingFang (JF), an advanced LLM‑based multi‑agent system for TCM that facilitates the implementation of AI‑assisted TCM diagnosis and treatment. JF integrates various TCM Specialist Agents in accordance with authentic diagnostic and therapeutic scenarios of TCM, enabling personalized medical consultations, accurate syndrome differentiation and treatment recommendations. A Multi‑Agent Collaborative Consultation Mechanism (MACCM) for TCM is constructed, where multiple Agents collaborate to emulate real‑world TCM diagnostic workflows, enhancing the diagnostic ability of base LLMs to provide accurate and patient‑tailored medical consultation. Moreover, we introduce a dedicated Syndrome Differentiation Agent fine‑tuned on a preprocessed dataset, along with a designed Dual‑Stage Recovery Scheme (DSRS) within the Treatment Agent, which together substantially improve the model's accuracy of syndrome differentiation and treatment. Comprehensive evaluations and experiments demonstrate JF's superior performance in medical consultation, and also show improvements of at least 124% and 21.1% in the precision of syndrome differentiation compared to existing TCM models and State of the Art (SOTA) LLMs, respectively.

PaperID: 11727, https://arxiv.org/pdf/2502.04281.pdf

Authors: Ashwin Kumar, William Yeoh

Title: DECAF: Learning to be Fair in Multi-agent Resource Allocation

Abstract:
A wide variety of resource allocation problems operate under resource constraints that are managed by a central arbitrator, with agents who evaluate and communicate preferences over these resources. We formulate this broad class of problems as Distributed Evaluation, Centralized Allocation (DECA) problems and propose methods to learn fair and efficient policies in centralized resource allocation. Our methods are applied to learning long‑term fairness in a novel and general framework for fairness in multi‑agent systems. We show three different methods based on Double Deep Q‑Learning: (1) A joint weighted optimization of fairness and utility, (2) a split optimization, learning two separate Q‑estimators for utility and fairness, and (3) an online policy perturbation to guide existing black‑box utility functions toward fair solutions. Our methods outperform existing fair MARL approaches on multiple resource allocation domains, even when evaluated using diverse fairness functions, and allow for flexible online trade‑offs between utility and fairness.

PaperID: 11728, https://arxiv.org/pdf/2502.04259.pdf

Authors: Rommel Salas-Guerra

Title: Cognitive AI framework 2.0: advances in the simulation of human thought

Abstract:
The Human Cognitive Simulation Framework proposes a governed cognitive AI architecture designed to improve personalization, adaptability, and long‑term coherence in human AI interaction. The framework integrates short‑term memory (conversation context), long‑term memory (interaction context), cognitive processing modules, and managed knowledge persistence into a unified architectural model that ensures contextual continuity across sessions and controlled accumulation of relevant information. A central contribution is a unified memory architecture supervised by explicit governance mechanisms, including algorithmic relevance validation, selective persistence, and auditability. The framework incorporates differentiated processing modules for logical, creative, and analogical reasoning, enabling both structured task execution and complex contextual inference. Through dynamic and selective knowledge updating, the system augments the capabilities of large language models without modifying their internal parameters, relying instead on retrieval augmented generation and governed external memory. The proposed architecture addresses key challenges related to scalability, bias mitigation, and ethical compliance by embedding operational safeguards directly into the cognitive loop. These mechanisms establish a foundation for future work on continuous learning, sustainability, and multimodal cognitive interaction. This manuscript is a substantially revised and extended version of the previously released preprint (DOI:10.48550/arXiv.2502.04259).

PaperID: 11729, https://arxiv.org/pdf/2502.04249.pdf

Authors: Michael Walters, Rafael Kaufmann, Justice Sefas, Thomas Kopinski

Title: Free Energy Risk Metrics for Systemically Safe AI: Gatekeeping Multi-Agent Study

Abstract:
We investigate the Free Energy Principle as a foundation for measuring risk in agentic and multi‑agent systems. From these principles we introduce a Cumulative Risk Exposure metric that is flexible to differing contexts and needs. We contrast this to other popular theories for safe AI that hinge on massive amounts of data or describing arbitrarily complex world models. In our framework, stakeholders need only specify their preferences over system outcomes, providing straightforward and transparent decision rules for risk governance and mitigation. This framework naturally accounts for uncertainty in both world model and preference model, allowing for decision‑making that is epistemically and axiologically humble, parsimonious, and future‑proof. We demonstrate this novel approach in a simplified autonomous vehicle environment with multi‑agent vehicles whose driving policies are mediated by gatekeepers that evaluate, in an online fashion, the risk to the collective safety in their neighborhood, and intervene through each vehicle's policy when appropriate. We show that the introduction of gatekeepers in an AV fleet, even at low penetration, can generate significant positive externalities in terms of increased system safety.

PaperID: 11730, https://arxiv.org/pdf/2502.04180.pdf

Authors: Guibin Zhang, Luyang Niu, Junfeng Fang, Kun Wang, Lei Bai, Xiang Wang

Title: Multi-agent Architecture Search via Agentic Supernet

Abstract:
Large Language Model (LLM)‑empowered multi‑agent systems extend the cognitive boundaries of individual agents through disciplined collaboration and interaction, while constructing these systems often requires labor‑intensive manual designs. Despite the availability of methods to automate the design of agentic workflows, they typically seek to identify a static, complex, one‑size‑fits‑all system, which, however, fails to dynamically allocate inference resources based on the difficulty and domain of each query. To address this challenge, we shift away from the pursuit of a monolithic agentic system, instead optimizing the agentic supernet, a probabilistic and continuous distribution of agentic architectures. We introduce MaAS, an automated framework that samples query‑dependent agentic systems from the supernet, delivering high‑quality solutions and tailored resource allocation (e.g., LLM calls, tool calls, token cost). Comprehensive evaluation across six benchmarks demonstrates that MaAS (I) requires only 6～45% of the inference costs of existing handcrafted or automated multi‑agent systems, (II) surpasses them by 0.54%～11.82%, and (III) enjoys superior cross‑dataset and cross‑LLM‑backbone transferability.

PaperID: 11731, https://arxiv.org/pdf/2502.04038.pdf

Authors: Yuchen Lian, Arianna Bisazza, Tessa Verhoef

Title: Simulating the Emergence of Differential Case Marking with Communicating Neural-Network Agents

Abstract:
Differential Case Marking (DCM) refers to the phenomenon where grammatical case marking is applied selectively based on semantic, pragmatic, or other factors. The emergence of DCM has been studied in artificial language learning experiments with human participants, which were specifically aimed at disentangling the effects of learning from those of communication (Smith & Culbertson, 2020). Multi‑agent reinforcement learning frameworks based on neural networks have gained significant interest to simulate the emergence of human‑like linguistic phenomena. In this study, we employ such a framework in which agents first acquire an artificial language before engaging in communicative interactions, enabling direct comparisons to human result. Using a very generic communication optimization algorithm and neural‑network learners that have no prior experience with language or semantic preferences, our results demonstrate that learning alone does not lead to DCM, but when agents communicate, differential use of markers arises. This supports Smith and Culbertson (2020)'s findings that highlight the critical role of communication in shaping DCM and showcases the potential of neural‑agent models to complement experimental research on language evolution.

PaperID: 11732, https://arxiv.org/pdf/2502.03953.pdf

Authors: Gabriele La Malfa, Jie M. Zhang, Michael Luck, Elizabeth Black

Title: Fairness Aware Reinforcement Learning via Proximal Policy Optimization

Abstract:
Fairness in multi‑agent systems (MAS) focuses on equitable reward distribution among agents in scenarios involving sensitive attributes such as race, gender, or socioeconomic status. This paper introduces fairness in Proximal Policy Optimization (PPO) with a penalty term derived from a fairness definition such as demographic parity, counterfactual fairness, or conditional statistical parity. The proposed method, which we call Fair‑PPO, balances reward maximisation with fairness by integrating two penalty components: a retrospective component that minimises disparities in past outcomes and a prospective component that ensures fairness in future decision‑making. We evaluate our approach in two games: the Allelopathic Harvest, a cooperative and competitive MAS focused on resource collection, where some agents possess a sensitive attribute, and HospitalSim, a hospital simulation, in which agents coordinate the operations of hospital patients with different mobility and priority needs. Experiments show that Fair‑PPO achieves fairer policies than PPO across the fairness metrics and, through the retrospective and prospective penalty components, reveals a wide spectrum of strategies to improve fairness; at the same time, its performance pairs with that of state‑of‑the‑art fair reinforcement‑learning algorithms. Fairness comes at the cost of reduced efficiency, but does not compromise equality among the overall population (Gini index). These findings underscore the potential of Fair‑PPO to address fairness challenges in MAS.

PaperID: 11733, https://arxiv.org/pdf/2502.03948.pdf

Authors: Devansh Srivastav, Hasan Md Tusfiqur Alam, Afsaneh Asaei, Mahmoud Fazeli, Tanisha Sharma, Daniel Sonntag

Title: Enhancing Online Learning Efficiency Through Heterogeneous Resource Integration with a Multi-Agent RAG System

Abstract:
Efficient online learning requires seamless access to diverse resources such as videos, code repositories, documentation, and general web content. This poster paper introduces early‑stage work on a Multi‑Agent Retrieval‑Augmented Generation (RAG) System designed to enhance learning efficiency by integrating these heterogeneous resources. Using specialized agents tailored for specific resource types (e.g., YouTube tutorials, GitHub repositories, documentation websites, and search engines), the system automates the retrieval and synthesis of relevant information. By streamlining the process of finding and combining knowledge, this approach reduces manual effort and enhances the learning experience. A preliminary user study confirmed the system's strong usability and moderate‑high utility, demonstrating its potential to improve the efficiency of knowledge acquisition.

PaperID: 11734, https://arxiv.org/pdf/2502.03902.pdf

Authors: Efstratios Stratoglou, Alexandre Anahory Simoes, Anthony Bloch, Leonardo Colombo

Title: Geometric Stabilization of Virtual Nonlinear Nonholonomic Constraints

Abstract:
In this paper, we address the problem of stabilizing a system around a desired manifold determined by virtual nonlinear nonholonomic constraints. Virtual constraints are relationships imposed on a control system that are rendered invariant through feedback control. Virtual nonholonomic constraints represent a specific class of virtual constraints that depend on the system's velocities in addition to its configurations. We derive a control law under which a mechanical control system achieves exponential convergence to the virtual constraint submanifold, and rendering it control‑invariant. The proposed controller's performance is validated through simulation results in two distinct applications: flocking motion in multi‑agent systems and the control of an unmanned surface vehicle (USV) navigating a stream.

PaperID: 11735, https://arxiv.org/pdf/2502.03874.pdf

Authors: Nuriya Nurgalieva, V. Vilasini

Title: Any theory that admits a Wigner's Friend type multi-agent paradox is logically contextual

Abstract:
Wigner's Friend scenarios push the boundaries of quantum theory by modeling agents, along with their memories storing measurement outcomes, as physical quantum systems. Extending these ideas beyond quantum theory, we ask: in which physical theories, and under what assumptions, can agents who are reasoning logically about each other's measurement outcomes encounter apparent paradoxes? To address this, we prove a link between Wigner's Friend type multi‑agent paradoxes and contextuality in general theories: if agents who are modeled within a physical theory come to a contradiction when reasoning using that theory (under certain assumptions on how they reason and describe measurements), then the theory must admit contextual correlations of a logical form. This also yields a link between the distinct fundamental concepts of Heisenberg cuts and measurement contexts in general theories, and in particular, implies that the quantum Frauchiger‑Renner paradox is a proof of logical contextuality. Moreover, we identify structural properties of such paradoxes in general theories and specific to quantum theory. For instance, we demonstrate that theories admitting behaviors corresponding to extremal vertices of n‑cycle contextuality scenarios admit Wigner's Friend type paradoxes without post‑selection, and that any quantum Wigner's Friend paradox based on the n‑cycle scenario must necessarily involve post‑selection. Further, we construct a multi‑agent paradox based on a genuine contextuality scenario involving sequential measurements on a single system, showing that Bell non‑local correlations between distinct subsystems are not necessary for Wigner's Friend paradoxes. Our work offers an approach to investigate the structure of physical theories and their information‑theoretic resources by means of deconstructing the assumptions underlying multi‑agent physical paradoxes.

PaperID: 11736, https://arxiv.org/pdf/2502.03845.pdf

Authors: Zhuohui Zhang, Bin Cheng, Zhipeng Wang, Yanmin Zhou, Gang Li, Ping Lu, Bin He, Jie Chen

Title: PAGNet: Pluggable Adaptive Generative Networks for Information Completion in Multi-Agent Communication

Abstract:
For partially observable cooperative tasks, multi‑agent systems must develop effective communication and understand the interplay among agents in order to achieve cooperative goals. However, existing multi‑agent reinforcement learning (MARL) with communication methods lack evaluation metrics for information weights and information‑level communication modeling. This causes agents to neglect the aggregation of multiple messages, thereby significantly reducing policy learning efficiency. In this paper, we propose pluggable adaptive generative networks (PAGNet), a novel framework that integrates generative models into MARL to enhance communication and decision‑making. PAGNet enables agents to synthesize global states representations from weighted local observations and use these representations alongside learned communication weights for coordinated decision‑making. This pluggable approach reduces the computational demands typically associated with the joint training of communication and policy networks. Extensive experimental evaluations across diverse benchmarks and communication scenarios demonstrate the significant performance improvements achieved by PAGNet. Furthermore, we analyze the emergent communication patterns and the quality of generated global states, providing insights into operational mechanisms.

PaperID: 11737, https://arxiv.org/pdf/2502.03821.pdf

Authors: Tao Yang, Yuhua Zhu, Xiaojun Quan, Cong Liu, Qifan Wang

Title: PsyPlay: Personality-Infused Role-Playing Conversational Agents

Abstract:
The current research on Role‑Playing Conversational Agents (RPCAs) with Large Language Models (LLMs) primarily focuses on imitating specific speaking styles and utilizing character backgrounds, neglecting the depiction of deeper personality traits.~In this study, we introduce personality‑infused role‑playing for LLM agents, which encourages agents to accurately portray their designated personality traits during dialogues. We then propose PsyPlay, a dialogue generation framework that facilitates the expression of rich personalities among multiple LLM agents. Specifically, PsyPlay enables agents to assume roles with distinct personality traits and engage in discussions centered around specific topics, consistently exhibiting their designated personality traits throughout the interactions. Validation on generated dialogue data demonstrates that PsyPlay can accurately portray the intended personality traits, achieving an overall success rate of 80.31% on GPT‑3.5. Notably, we observe that LLMs aligned with positive values are more successful in portraying positive personality roles compared to negative ones. Moreover, we construct a dialogue corpus for personality‑infused role‑playing, called PsyPlay‑Bench. The corpus, which consists of 4745 instances of correctly portrayed dialogues using PsyPlay, aims to further facilitate research in personalized role‑playing and dialogue personality detection.

PaperID: 11738, https://arxiv.org/pdf/2502.03814.pdf

Authors: Peihan Li, Zijian An, Shams Abrar, Lifeng Zhou

Title: Large Language Models for Multi-Robot Systems: A Survey

Abstract:
The rapid advancement of Large Language Models (LLMs) has opened new possibilities in Multi‑Robot Systems (MRS), enabling enhanced communication, task allocation and planning, and human‑robot interaction. Unlike traditional single‑robot and multi‑agent systems, MRS poses unique challenges, including coordination, scalability, and real‑world adaptability. This survey provides the first dedicated review of LLM integration into MRS. It systematically categorizes their applications across high‑level task allocation, mid‑level motion planning, low‑level action generation, and human intervention. We highlight key applications in diverse domains, such as household robotics, construction, formation control, target tracking, and robot games, showcasing the versatility and transformative potential of LLMs in MRS. Furthermore, we examine the challenges that limit adapting LLMs to MRS, including mathematical reasoning limitations, hallucination, latency issues, and the need for robust benchmarking systems. Finally, we outline opportunities for future research, emphasizing advancements in fine‑tuning, reasoning techniques, and task‑specific models. This survey aims to guide researchers in the intelligence and real‑world deployment of MRS powered by LLMs. Given the rapidly evolving nature of research in the field, we continuously update the paper list in the open‑source GitHub repository.

PaperID: 11739, https://arxiv.org/pdf/2502.03788.pdf

Authors: Zijian Ding, Qinshi Zhang, Mohan Chi, Ziyi Wang

Title: Frontend Diffusion: Empowering Self-Representation of Junior Researchers and Designers Through Multi-agent System

Abstract:
With the continuous development of generative AI's logical reasoning abilities, AI's growing code‑generation potential poses challenges for both technical and creative professionals. But how can these advances be directed toward empowering junior researchers and designers who often require additional help to build and express their professional and personal identities? We introduce Frontend Diffusion, a multi‑agent coding system transforming user‑drawn layouts and textual prompts into refined website code, thereby supporting self‑representation goals. A user study with 13 junior researchers and designers shows AI as a human capability enhancer rather than a replacement, and highlights the importance of bidirectional human‑AI alignment. We then discuss future work such as leveraging AI for career development and fostering bidirectional human‑AI alignment of multi‑agent systems.

PaperID: 11740, https://arxiv.org/pdf/2502.03723.pdf

Authors: Muhan Lin, Shuyang Shi, Yue Guo, Vaishnav Tadiparthi, Behdad Chalaki, Ehsan Moradi Pari, Simon Stepputtis, Woojun Kim, Joseph Campbell, Katia Sycara

Title: Speaking the Language of Teamwork: LLM-Guided Credit Assignment in Multi-Agent Reinforcement Learning

Abstract:
Credit assignment, the process of attributing credit or blame to individual agents for their contributions to a team's success or failure, remains a fundamental challenge in multi‑agent reinforcement learning (MARL), particularly in environments with sparse rewards. Commonly‑used approaches such as value decomposition often lead to suboptimal policies in these settings, and designing dense reward functions that align with human intuition can be complex and labor‑intensive. In this work, we propose a novel framework where a large language model (LLM) generates dense, agent‑specific rewards based on a natural language description of the task and the overall team goal. By learning a potential‑based reward function over multiple queries, our method reduces the impact of ranking errors while allowing the LLM to evaluate each agent's contribution to the overall task. Through extensive experiments, we demonstrate that our approach achieves faster convergence and higher policy returns compared to state‑of‑the‑art MARL baselines.

PaperID: 11741, https://arxiv.org/pdf/2502.03711.pdf

Authors: Nicole Cho, William Watson

Title: MultiQ&A: An Analysis in Measuring Robustness via Automated Crowdsourcing of Question Perturbations and Answers

Abstract:
One critical challenge in the institutional adoption journey of Large Language Models (LLMs) stems from their propensity to hallucinate in generated responses. To address this, we propose MultiQ&A, a systematic approach for evaluating the robustness and consistency of LLM‑generated answers. We demonstrate MultiQ&A's ability to crowdsource question perturbations and their respective answers through independent LLM agents at scale. Our experiments culminated in the examination of 1.9 million question perturbations and 2.3 million answers. Furthermore, MultiQ&A shows that ensembled LLMs, such as gpt‑3.5‑turbo, remain relatively robust and consistent under perturbations. MultiQ&A provides clarity in the response generation space, offering an effective method for inspecting disagreements and variability. Therefore, our system offers a potential framework for institutional LLM adoption with the ability to measure confidence, consistency, and the quantification of hallucinations.

PaperID: 11742, https://arxiv.org/pdf/2502.03671.pdf

Authors: Avinash Patil, Aryan Jadon

Title: Advancing Reasoning in Large Language Models: Promising Methods and Approaches

Abstract:
Large Language Models (LLMs) have succeeded remarkably in various natural language processing (NLP) tasks, yet their reasoning capabilities remain a fundamental challenge. While LLMs exhibit impressive fluency and factual recall, their ability to perform complex reasoning‑spanning logical deduction, mathematical problem‑solving, commonsense inference, and multi‑step reasoning‑often falls short of human expectations. This survey provides a comprehensive review of emerging techniques enhancing reasoning in LLMs. We categorize existing methods into key approaches, including prompting strategies (e.g., Chain‑of‑Thought reasoning, Self‑Consistency, and Tree‑of‑Thought reasoning), architectural innovations (e.g., retrieval‑augmented models, modular reasoning networks, and neuro‑symbolic integration), and learning paradigms (e.g., fine‑tuning with reasoning‑specific datasets, reinforcement learning, and self‑supervised reasoning objectives). Additionally, we explore evaluation frameworks used to assess reasoning in LLMs and highlight open challenges, such as hallucinations, robustness, and reasoning generalization across diverse tasks. By synthesizing recent advancements, this survey aims to provide insights into promising directions for future research and practical applications of reasoning‑augmented LLMs.

PaperID: 11743, https://arxiv.org/pdf/2502.03640.pdf

Authors: Songyuan Zhang, Oswin So, Mitchell Black, Chuchu Fan

Title: Discrete GCBF Proximal Policy Optimization for Multi-agent Safe Optimal Control

Abstract:
Control policies that can achieve high task performance and satisfy safety constraints are desirable for any system, including multi‑agent systems (MAS). One promising technique for ensuring the safety of MAS is distributed control barrier functions (CBF). However, it is difficult to design distributed CBF‑based policies for MAS that can tackle unknown discrete‑time dynamics, partial observability, changing neighborhoods, and input constraints, especially when a distributed high‑performance nominal policy that can achieve the task is unavailable. To tackle these challenges, we propose DGPPO, a new framework that simultaneously learns both a discrete graph CBF which handles neighborhood changes and input constraints, and a distributed high‑performance safe policy for MAS with unknown discrete‑time dynamics. We empirically validate our claims on a suite of multi‑agent tasks spanning three different simulation engines. The results suggest that, compared with existing methods, our DGPPO framework obtains policies that achieve high task performance (matching baselines that ignore the safety constraints), and high safety rates (matching the most conservative baselines), with a constant set of hyperparameters across all environments.

PaperID: 11744, https://arxiv.org/pdf/2502.03616.pdf

Authors: Jaehan Im, Filippos Fotiadis, Daniel Delahaye, Ufuk Topcu, David Fridovich-Keil

Title: Noncooperative Consensus via a Trading-based Auction

Abstract:
Noncooperative multi‑agent systems often face coordination challenges due to conflicting preferences among agents. In particular, when agents act in their own self‑interest, they may prefer different choices among multiple feasible outcomes, leading to suboptimal outcomes or even safety concerns. We propose an algorithm named trading auction for consensus (TACo), a decentralized approach that enables noncooperative agents to reach consensus without communicating directly or disclosing private valuations. TACo facilitates coordination through a structured trading‑based auction, where agents iteratively select choices of interest and provably reach an agreement within an a priori bounded number of steps. A series of numerical experiments validate that the termination guarantees of TACo hold in practice, and show that TACo achieves a median performance that minimizes the total cost across all agents, while allocating resources significantly more fairly than baseline approaches.

PaperID: 11745, https://arxiv.org/pdf/2502.03450.pdf

Authors: Yiye Chen, Harpreet Sawhney, Nicholas Gydé, Yanan Jian, Jack Saunders, Patricio Vela, Ben Lundell

Title: Schema-Guided Scene-Graph Reasoning based on Multi-Agent Large Language Model System

Abstract:
Scene graphs have emerged as a structured and serializable environment representation for grounded spatial reasoning with Large Language Models (LLMs). In this work, we propose SG^2, an iterative Schema‑Guided Scene‑Graph reasoning framework based on multi‑agent LLMs. The agents are grouped into two modules: a (1) Reasoner module for abstract task planning and graph information queries generation, and a (2) Retriever module for extracting corresponding graph information based on code‑writing following the queries. Two modules collaborate iteratively, enabling sequential reasoning and adaptive attention to graph information. The scene graph schema, prompted to both modules, serves to not only streamline both reasoning and retrieval process, but also guide the cooperation between two modules. This eliminates the need to prompt LLMs with full graph data, reducing the chance of hallucination due to irrelevant information. Through experiments in multiple simulation environments, we show that our framework surpasses existing LLM‑based approaches and baseline single‑agent, tool‑based Reason‑while‑Retrieve strategy in numerical Q\&A and planning tasks.

PaperID: 11746, https://arxiv.org/pdf/2502.03377.pdf

Authors: Abdullahi Isa Ahmed, Jamal Bentahar, El Mehdi Amhoud

Title: Energy-Efficient UAV-assisted LoRa Gateways: A Multi-Agent Optimization Approach

Abstract:
As next‑generation Internet of Things (NG‑IoT) networks continue to grow, the number of connected devices is rapidly increasing, along with their energy demands, creating challenges for resource management and sustainability. Energy‑efficient communication, particularly for power‑limited IoT devices, is therefore a key research focus. In this paper, we study Long Range (LoRa) networks supported by multiple unmanned aerial vehicles (UAVs) in an uplink data collection scenario. Our objective is to maximize system energy efficiency by jointly optimizing transmission power, spreading factor, bandwidth, and user association. To address this challenging problem, we first model it as a partially observable stochastic game (POSG) to account for dynamic channel conditions, end device mobility, and partial observability at each UAV. We then propose a two‑stage solution: a channel‑aware matching algorithm for end device‑UAV association and a cooperative multi‑agent reinforcement learning (MARL) based multi‑agent proximal policy optimization (MAPPO) framework for resource allocation under centralized training with decentralized execution (CTDE). Simulation results show that our proposed approach significantly outperforms conventional off‑policy and on‑policy MARL algorithms.

PaperID: 11747, https://arxiv.org/pdf/2502.03356.pdf

Authors: Max Muchen Sun, Pete Trautman, Todd Murphey

Title: Inverse Mixed Strategy Games with Generative Trajectory Models

Abstract:
Game‑theoretic models are effective tools for modeling multi‑agent interactions, especially when robots need to coordinate with humans. However, applying these models requires inferring their specifications from observed behaviors ‑‑ a challenging task known as the inverse game problem. Existing inverse game approaches often struggle to account for behavioral uncertainty and measurement noise, and leverage both offline and online data. To address these limitations, we propose an inverse game method that integrates a generative trajectory model into a differentiable mixed‑strategy game framework. By representing the mixed strategy with a conditional variational autoencoder (CVAE), our method can infer high‑dimensional, multi‑modal behavior distributions from noisy measurements while adapting in real‑time to new observations. We extensively evaluate our method in a simulated navigation benchmark, where the observations are generated by an unknown game model. Despite the model mismatch, our method can infer Nash‑optimal actions comparable to those of the ground‑truth model and the oracle inverse game baseline, even in the presence of uncertain agent objectives and noisy measurements.

PaperID: 11748, https://arxiv.org/pdf/2502.03214.pdf

Authors: Julius Mayer, Mohamad Ballout, Serwan Jassim, Farbod Nosrat Nezami, Elia Bruni

Title: iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs

Abstract:
Vision‑Language Models (VLMs) are known to struggle with spatial reasoning and visual alignment. To help overcome these limitations, we introduce iVISPAR, an interactive multimodal benchmark designed to evaluate the spatial reasoning capabilities of VLMs acting as agents. \mboxiVISPAR is based on a variant of the sliding tile puzzle, a classic problem that demands logical planning, spatial awareness, and multi‑step reasoning. The benchmark supports visual 3D, 2D, and text‑based input modalities, enabling comprehensive assessments of VLMs' planning and reasoning skills. We evaluate a broad suite of state‑of‑the‑art open‑source and closed‑source VLMs, comparing their performance while also providing optimal path solutions and a human baseline to assess the task's complexity and feasibility for humans. Results indicate that while VLMs perform better on 2D tasks compared to 3D or text‑based settings, they struggle with complex spatial configurations and consistently fall short of human performance, illustrating the persistent challenge of visual alignment. This underscores critical gaps in current VLM capabilities, highlighting their limitations in achieving human‑level cognition. Project website: https://microcosm.ai/ivispar

PaperID: 11749, https://arxiv.org/pdf/2502.03125.pdf

Authors: Yang Zhou, Siying Wang, Wenyu Chen, Ruoning Zhang, Zhitong Zhao, Zixuan Zhang

Title: Double Distillation Network for Multi-Agent Reinforcement Learning

Abstract:
Multi‑agent reinforcement learning typically employs a centralized training‑decentralized execution (CTDE) framework to alleviate the non‑stationarity in environment. However, the partial observability during execution may lead to cumulative gap errors gathered by agents, impairing the training of effective collaborative policies. To overcome this challenge, we introduce the Double Distillation Network (DDN), which incorporates two distillation modules aimed at enhancing robust coordination and facilitating the collaboration process under constrained information. The external distillation module uses a global guiding network and a local policy network, employing distillation to reconcile the gap between global training and local execution. In addition, the internal distillation module introduces intrinsic rewards, drawn from state information, to enhance the exploration capabilities of agents. Extensive experiments demonstrate that DDN significantly improves performance across multiple scenarios.

PaperID: 11750, https://arxiv.org/pdf/2502.02984.pdf

Authors: Dengyu Zhang, Chenghao, Feng Xue, Qingrui Zhang

Title: Learning Efficient Flocking Control based on Gibbs Random Fields

Abstract:
Flocking control is essential for multi‑robot systems in diverse applications, yet achieving efficient flocking in congested environments poses challenges regarding computation burdens, performance optimality, and motion safety. This paper addresses these challenges through a multi‑agent reinforcement learning (MARL) framework built on Gibbs Random Fields (GRFs). With GRFs, a multi‑robot system is represented by a set of random variables conforming to a joint probability distribution, thus offering a fresh perspective on flocking reward design. A decentralized training and execution mechanism, which enhances the scalability of MARL concerning robot quantity, is realized using a GRF‑based credit assignment method. An action attention module is introduced to implicitly anticipate the motion intentions of neighboring robots, consequently mitigating potential non‑stationarity issues in MARL. The proposed framework enables learning an efficient distributed control policy for multi‑robot systems in challenging environments with success rate around 99%, as demonstrated through thorough comparisons with state‑of‑the‑art solutions in simulations and experiments. Ablation studies are also performed to validate the efficiency of different framework modules.

PaperID: 11751, https://arxiv.org/pdf/2502.02875.pdf

Authors: Siying Wang, Yang Zhou, Zhitong Zhao, Ruoning Zhang, Jinliang Shao, Wenyu Chen, Yuhua Cheng

Title: Heterogeneous Value Decomposition Policy Fusion for Multi-Agent Cooperation

Abstract:
Value decomposition (VD) has become one of the most prominent solutions in cooperative multi‑agent reinforcement learning. Most existing methods generally explore how to factorize the joint value and minimize the discrepancies between agent observations and characteristics of environmental states. However, direct decomposition may result in limited representation or difficulty in optimization. Orthogonal to designing a new factorization scheme, in this paper, we propose Heterogeneous Policy Fusion (HPF) to integrate the strengths of various VD methods. We construct a composite policy set to select policies for interaction adaptively. Specifically, this adaptive mechanism allows agents' trajectories to benefit from diverse policy transitions while incorporating the advantages of each factorization method. Additionally, HPF introduces a constraint between these heterogeneous policies to rectify the misleading update caused by the unexpected exploratory or suboptimal non‑cooperation. Experimental results on cooperative tasks show HPF's superior performance over multiple baselines, proving its effectiveness and ease of implementation.

PaperID: 11752, https://arxiv.org/pdf/2502.02859.pdf

Authors: Haochen Zhang, Zhong Zheng, Lingzhou Xue

Title: Gap-Dependent Bounds for Federated $Q$-learning

Abstract:
We present the first gap‑dependent analysis of regret and communication cost for on‑policy federated Q‑Learning in tabular episodic finite‑horizon Markov decision processes (MDPs). Existing FRL methods focus on worst‑case scenarios, leading to \sqrtT‑type regret bounds and communication cost bounds with a \log T term scaling with the number of agents M, states S, and actions A, where T is the average total number of steps per agent. In contrast, our novel framework leverages the benign structures of MDPs, such as a strictly positive suboptimality gap, to achieve a \log T‑type regret bound and a refined communication cost bound that disentangles exploration and exploitation. Our gap‑dependent regret bound reveals a distinct multi‑agent speedup pattern, and our gap‑dependent communication cost bound removes the dependence on MSA from the \log T term. Notably, our gap‑dependent communication cost bound also yields a better global switching cost when M=1, removing SA from the \log T term.

PaperID: 11753, https://arxiv.org/pdf/2502.02692.pdf

Authors: Amit Ranjan Trivedi, Sina Tayebati, Hemant Kumawat, Nastaran Darabi, Divake Kumar, Adarsh Kumar Kosta, Yeshwanth Venkatesha, Dinithi Jayasuriya, Nethmi Jayasinghe, Priyadarshini Panda, Saibal Mukhopadhyay, Kaushik Roy

Title: Intelligent Sensing-to-Action for Robust Autonomy at the Edge: Opportunities and Challenges

Abstract:
Autonomous edge computing in robotics, smart cities, and autonomous vehicles relies on the seamless integration of sensing, processing, and actuation for real‑time decision‑making in dynamic environments. At its core is the sensing‑to‑action loop, which iteratively aligns sensor inputs with computational models to drive adaptive control strategies. These loops can adapt to hyper‑local conditions, enhancing resource efficiency and responsiveness, but also face challenges such as resource constraints, synchronization delays in multi‑modal data fusion, and the risk of cascading errors in feedback loops. This article explores how proactive, context‑aware sensing‑to‑action and action‑to‑sensing adaptations can enhance efficiency by dynamically adjusting sensing and computation based on task demands, such as sensing a very limited part of the environment and predicting the rest. By guiding sensing through control actions, action‑to‑sensing pathways can improve task relevance and resource use, but they also require robust monitoring to prevent cascading errors and maintain reliability. Multi‑agent sensing‑action loops further extend these capabilities through coordinated sensing and actions across distributed agents, optimizing resource use via collaboration. Additionally, neuromorphic computing, inspired by biological systems, provides an efficient framework for spike‑based, event‑driven processing that conserves energy, reduces latency, and supports hierarchical control‑‑making it ideal for multi‑agent optimization. This article highlights the importance of end‑to‑end co‑design strategies that align algorithmic models with hardware and environmental dynamics and improve cross‑layer interdependencies to improve throughput, precision, and adaptability for energy‑efficient edge autonomy in complex environments.

PaperID: 11754, https://arxiv.org/pdf/2502.02649.pdf

Authors: Margaret Mitchell, Avijit Ghosh, Alexandra Sasha Luccioni, Giada Pistilli

Title: Fully Autonomous AI Agents Should Not be Developed

Abstract:
This paper argues that fully autonomous AI agents should not be developed. In support of this position, we build from prior scientific literature and current product marketing to delineate different AI agent levels and detail the ethical values at play in each, documenting trade‑offs in potential benefits and risks. Our analysis reveals that risks to people increase with the autonomy of a system: The more control a user cedes to an AI agent, the more risks to people arise. Particularly concerning are safety risks, which affect human life and impact further values.

PaperID: 11755, https://arxiv.org/pdf/2502.02584.pdf

Authors: Zongyu Lin, Yao Tang, Xingcheng Yao, Da Yin, Ziniu Hu, Yizhou Sun, Kai-Wei Chang

Title: QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search

Abstract:
Language agents have become a promising solution to complex interactive tasks. One of the key ingredients to the success of language agents is the reward model on the trajectory of the agentic workflow, which provides valuable guidance during training or inference. However, due to the lack of annotations of intermediate interactions, most existing works use an outcome reward model to optimize policies across entire trajectories. This may lead to sub‑optimal policies and hinder the overall performance. To address this, we propose QLASS (Q‑guided Language Agent Stepwise Search), to automatically generate annotations by estimating Q‑values in a stepwise manner for open language agents. By introducing a reasoning tree and performing process reward modeling, QLASS provides effective intermediate guidance for each step. With the stepwise guidance, we propose a Q‑guided generation strategy to enable language agents to better adapt to long‑term value, resulting in significant performance improvement during model inference on complex interactive agent tasks. Notably, even with almost half the annotated data, QLASS retains strong performance, demonstrating its efficiency in handling limited supervision. We also empirically demonstrate that QLASS can lead to more effective decision making through qualitative analysis. We will release our code and data.

PaperID: 11756, https://arxiv.org/pdf/2502.02533.pdf

Authors: Han Zhou, Xingchen Wan, Ruoxi Sun, Hamid Palangi, Shariq Iqbal, Ivan Vulić, Anna Korhonen, Sercan Ö. Arık

Title: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies

Abstract:
Large language models, employed as multiple agents that interact and collaborate with each other, have excelled at solving complex tasks. The agents are programmed with prompts that declare their functionality, along with the topologies that orchestrate interactions across agents. Designing prompts and topologies for multi‑agent systems (MAS) is inherently complex. To automate the entire design process, we first conduct an in‑depth analysis of the design space aiming to understand the factors behind building effective MAS. We reveal that prompts together with topologies play critical roles in enabling more effective MAS design. Based on the insights, we propose Multi‑Agent System Search (MASS), a MAS optimization framework that efficiently exploits the complex MAS design space by interleaving its optimization stages, from local to global, from prompts to topologies, over three stages: 1) block‑level (local) prompt optimization; 2) workflow topology optimization; 3) workflow‑level (global) prompt optimization, where each stage is conditioned on the iteratively optimized prompts/topologies from former stages. We show that MASS‑optimized multi‑agent systems outperform a spectrum of existing alternatives by a substantial margin. Based on the MASS‑found systems, we finally propose design principles behind building effective multi‑agent systems.

PaperID: 11757, https://arxiv.org/pdf/2502.02311.pdf

Authors: Lavanya Ratnabala, Aleksey Fedoseev, Robinroy Peter, Dzmitry Tsetserukou

Title: MAGNNET: Multi-Agent Graph Neural Network-based Efficient Task Allocation for Autonomous Vehicles with Deep Reinforcement Learning

Abstract:
This paper addresses the challenge of decentralized task allocation within heterogeneous multi‑agent systems operating under communication constraints. We introduce a novel framework that integrates graph neural networks (GNNs) with a centralized training and decentralized execution (CTDE) paradigm, further enhanced by a tailored Proximal Policy Optimization (PPO) algorithm for multi‑agent deep reinforcement learning (MARL). Our approach enables unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs) to dynamically allocate tasks efficiently without necessitating central coordination in a 3D grid environment. The framework minimizes total travel time while simultaneously avoiding conflicts in task assignments. For the cost calculation and routing, we employ reservation‑based A and R path planners. Experimental results revealed that our method achieves a high 92.5% conflict‑free success rate, with only a 7.49% performance gap compared to the centralized Hungarian method, while outperforming the heuristic decentralized baseline based on greedy approach. Additionally, the framework exhibits scalability with up to 20 agents with allocation processing of 2.8 s and robustness in responding to dynamically generated tasks, underscoring its potential for real‑world applications in complex multi‑agent scenarios.

PaperID: 11758, https://arxiv.org/pdf/2502.02071.pdf

Authors: Yan Chen, Cheng Liu

Title: Sequential Multi-objective Multi-agent Reinforcement Learning Approach for Predictive Maintenance

Abstract:
Existing predictive maintenance (PdM) methods typically focus solely on whether to replace system components without considering the costs incurred by inspection. However, a well‑considered approach should be able to minimize Remaining Useful Life (RUL) at engine replacement while maximizing inspection interval. To achieve this, multi‑agent reinforcement learning (MARL) can be introduced. However, due to the sequential and mutually constraining nature of these 2 objectives, conventional MARL is not applicable. Therefore, this paper introduces a novel framework and develops a Sequential Multi‑objective Multi‑agent Proximal Policy Optimization (SMOMA‑PPO) algorithm. Furthermore, to provide comprehensive and effective degradation information to RL agents, we also employed Gated Recurrent Unit, quantile regression, and probability distribution fitting to develop a GRU‑based RUL Prediction (GRP) model. Experiments demonstrate that the GRP method significantly improves the accuracy of RUL predictions in the later stages of system operation compared to existing methods. When incorporating its output into SMOMA‑PPO, we achieve at least a 15% reduction in average RUL without unscheduled replacements (UR), nearly a 10% increase in inspection interval, and an overall decrease in maintenance costs. Importantly, our approach offers a new perspective for addressing multi‑objective maintenance planning with sequential constraints, effectively enhancing system reliability and reducing maintenance expenses.

PaperID: 11759, https://arxiv.org/pdf/2502.02066.pdf

Authors: Raghav Arora, Shivam Singh, Karthik Swaminathan, Ahana Datta, Snehasis Banerjee, Brojeshwar Bhowmick, Krishna Murthy Jatavallabhula, Mohan Sridharan, Madhava Krishna

Title: Anticipate & Act : Integrating LLMs and Classical Planning for Efficient Task Execution in Household Environments

Abstract:
Assistive agents performing household tasks such as making the bed or cooking breakfast often compute and execute actions that accomplish one task at a time. However, efficiency can be improved by anticipating upcoming tasks and computing an action sequence that jointly achieves these tasks. State‑of‑the‑art methods for task anticipation use data‑driven deep networks and Large Language Models (LLMs), but they do so at the level of high‑level tasks and/or require many training examples. Our framework leverages the generic knowledge of LLMs through a small number of prompts to perform high‑level task anticipation, using the anticipated tasks as goals in a classical planning system to compute a sequence of finer‑granularity actions that jointly achieve these goals. We ground and evaluate our framework's abilities in realistic scenarios in the VirtualHome environment and demonstrate a 31% reduction in execution time compared with a system that does not consider upcoming tasks.

PaperID: 11760, https://arxiv.org/pdf/2502.02060.pdf

Authors: Saad Alqithami

Title: CH-MARL: Constrained Hierarchical Multiagent Reinforcement Learning for Sustainable Maritime Logistics

Abstract:
Addressing global challenges such as greenhouse gas emissions and resource inequity demands advanced AI‑driven coordination among autonomous agents. We propose CH‑MARL (Constrained Hierarchical Multiagent Reinforcement Learning), a novel framework that integrates hierarchical decision‑making with dynamic constraint enforcement and fairness‑aware reward shaping. CH‑MARL employs a real‑time constraint‑enforcement layer to ensure adherence to global emission caps, while incorporating fairness metrics that promote equitable resource distribution among agents. Experiments conducted in a simulated maritime logistics environment demonstrate considerable reductions in emissions, along with improvements in fairness and operational efficiency. Beyond this domain‑specific success, CH‑MARL provides a scalable, generalizable solution to multi‑agent coordination challenges in constrained, dynamic settings, thus advancing the state of the art in reinforcement learning.

PaperID: 11761, https://arxiv.org/pdf/2502.02046.pdf

Authors: Frederick Dillon, Gregor Halvorsen, Simon Tattershall, Magnus Rowntree, Gareth Vanderpool

Title: Contextual Memory Reweaving in Large Language Models Using Layered Latent State Reconstruction

Abstract:
Memory retention challenges in deep neural architectures have ongoing limitations in the ability to process and recall extended contextual information. Token dependencies degrade as sequence length increases, leading to a decline in coherence and factual consistency across longer outputs. A structured approach is introduced to mitigate this issue through the reweaving of latent states captured at different processing layers, reinforcing token representations over extended sequences. The proposed Contextual Memory Reweaving framework incorporates a Layered Latent State Reconstruction mechanism to systematically integrate past contextual embeddings without introducing external memory modules. Experimental results demonstrate improvements in recall accuracy across a range of sequence lengths, with notable gains in the retention of rarely occurring tokens and numerical reasoning consistency. Further analysis of computational efficiency indicates that the additional processing overhead remains within acceptable thresholds, enabling scalability across different model sizes. Evaluations in long‑form text generation and ambiguous query resolution highlight the capacity of memory reweaving to enhance continuity and reduce inconsistencies over extended outputs. Attention weight distributions reveal more structured allocation patterns, suggesting that reweaved latent states contribute to improved contextual awareness. The findings establish a framework for refining memory retention mechanisms in language models, addressing long‑standing challenges in handling complex, multi‑step reasoning tasks.

PaperID: 11762, https://arxiv.org/pdf/2502.01971.pdf

Authors: Tianyu Ren, Xuan Yao, Yang Li, Xiao-Jun Zeng

Title: Bottom-Up Reputation Promotes Cooperation with Multi-Agent Reinforcement Learning

Abstract:
Reputation serves as a powerful mechanism for promoting cooperation in multi‑agent systems, as agents are more inclined to cooperate with those of good social standing. While existing multi‑agent reinforcement learning methods typically rely on predefined social norms to assign reputations, the question of how a population reaches a consensus on judgement when agents hold private, independent views remains unresolved. In this paper, we propose a novel bottom‑up reputation learning method, Learning with Reputation Reward (LR2), designed to promote cooperative behaviour through rewards shaping based on assigned reputation. Our agent architecture includes a dilemma policy that determines cooperation by considering the impact on neighbours, and an evaluation policy that assigns reputations to affect the actions of neighbours while optimizing self‑objectives. It operates using local observations and interaction‑based rewards, without relying on centralized modules or predefined norms. Our findings demonstrate the effectiveness and adaptability of LR2 across various spatial social dilemma scenarios. Interestingly, we find that LR2 stabilizes and enhances cooperation not only with reward reshaping from bottom‑up reputation but also by fostering strategy clustering in structured populations, thereby creating environments conducive to sustained cooperation.

PaperID: 11763, https://arxiv.org/pdf/2502.01932.pdf

Authors: Zelai Xu, Ruize Zhang, Chao Yu, Huining Yuan, Xiangmin Yi, Shilong Ji, Chuqi Wang, Wenhao Tang, Feng Gao, Wenbo Ding, Xinlei Chen, Yu Wang

Title: VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play

Abstract:
Robot sports, characterized by well‑defined objectives, explicit rules, and dynamic interactions, present ideal scenarios for demonstrating embodied intelligence. In this paper, we present VolleyBots, a novel robot sports testbed where multiple drones cooperate and compete in the sport of volleyball under physical dynamics. VolleyBots integrates three features within a unified platform: competitive and cooperative gameplay, turn‑based interaction structure, and agile 3D maneuvering. These intertwined features yield a complex problem combining motion control and strategic play, with no available expert demonstrations. We provide a comprehensive suite of tasks ranging from single‑drone drills to multi‑drone cooperative and competitive tasks, accompanied by baseline evaluations of representative reinforcement learning (RL), multi‑agent reinforcement learning (MARL) and game‑theoretic algorithms. Simulation results show that on‑policy RL methods outperform off‑policy methods in single‑agent tasks, but both approaches struggle in complex tasks that combine motion control and strategic play. We additionally design a hierarchical policy which achieves 69.5% win rate against the strongest baseline in the 3 vs 3 task, demonstrating its potential for tackling the complex interplay between low‑level control and high‑level strategy. To highlight VolleyBots' sim‑to‑real potential, we further demonstrate the zero‑shot deployment of a policy trained entirely in simulation on real‑world drones.

PaperID: 11764, https://arxiv.org/pdf/2502.01789.pdf

Authors: Jiazi Tian, Liqin Wang, Pedram Fard, Valdery Moura Junior, Deborah Blacker, Jennifer S. Haas, Chirag Patel, Shawn N. Murphy, Lidia M. V. R. Moura, Hossein Estiri

Title: An Agentic AI Workflow for Detecting Cognitive Concerns in Real-world Data

Abstract:
Early identification of cognitive concerns is critical but often hindered by subtle symptom presentation. This study developed and validated a fully automated, multi‑agent AI workflow using LLaMA 3 8B to identify cognitive concerns in 3,338 clinical notes from Mass General Brigham. The agentic workflow, leveraging task‑specific agents that dynamically collaborate to extract meaningful insights from clinical notes, was compared to an expert‑driven benchmark. Both workflows achieved high classification performance, with F1‑scores of 0.90 and 0.91, respectively. The agentic workflow demonstrated improved specificity (1.00) and achieved prompt refinement in fewer iterations. Although both workflows showed reduced performance on validation data, the agentic workflow maintained perfect specificity. These findings highlight the potential of fully automated multi‑agent AI workflows to achieve expert‑level accuracy with greater efficiency, offering a scalable and cost‑effective solution for detecting cognitive concerns in clinical settings.

PaperID: 11765, https://arxiv.org/pdf/2502.01715.pdf

Authors: Yufan Ye, Ting Zhang, Wenbin Jiang, Hua Huang

Title: Process-Supervised Reinforcement Learning for Code Generation

Abstract:
Existing reinforcement learning strategies based on outcome supervision have proven effective in enhancing the performance of large language models(LLMs) for code generation. While reinforcement learning based on process supervision has shown great promise in handling multi‑step reasoning tasks, its effectiveness in code generation remains largely underexplored and underjustified. The primary obstacle stems from the resource‑intensive nature of constructing high‑quality process‑supervised data, which demands substantial human expertise and computational resources. In response to this challenge, we propose a "statement mutation/refactoring‑compile and execution verification" strategy: mutating and refactoring code line‑by‑line through a teacher model, and utilizing compiler execution results to automatically label each line, resulting in line‑by‑line process‑supervised data, which is pivotal for training a process‑supervised reward model. The trained reward model is then integrated into the PRLCoder framework, followed by experimental validation on several benchmarks. Experimental results demonstrate that process‑supervised reinforcement learning significantly surpasses methods relying solely on outcome supervision. Notably, in tackling complex code generation tasks, process‑supervised reinforcement learning shows a clear advantage, ensuring both the integrity of the code generation process and the correctness of the generation results.

PaperID: 11766, https://arxiv.org/pdf/2502.01714.pdf

Authors: Jinwei Hu, Yi Dong, Shuang Ao, Zhuoyun Li, Boxuan Wang, Lokesh Singh, Guangliang Cheng, Sarvapali D. Ramchurn, Xiaowei Huang

Title: Position: Towards a Responsible LLM-empowered Multi-Agent Systems

Abstract:
The rise of Agent AI and Large Language Model‑powered Multi‑Agent Systems (LLM‑MAS) has underscored the need for responsible and dependable system operation. Tools like LangChain and Retrieval‑Augmented Generation have expanded LLM capabilities, enabling deeper integration into MAS through enhanced knowledge retrieval and reasoning. However, these advancements introduce critical challenges: LLM agents exhibit inherent unpredictability, and uncertainties in their outputs can compound across interactions, threatening system stability. To address these risks, a human‑centered design approach with active dynamic moderation is essential. Such an approach enhances traditional passive oversight by facilitating coherent inter‑agent communication and effective system governance, allowing MAS to achieve desired outcomes more efficiently.

PaperID: 11767, https://arxiv.org/pdf/2502.01711.pdf

Authors: Darius Muglich, Johannes Forkel, Elise van der Pol, Jakob Foerster

Title: Expected Return Symmetries

Abstract:
Symmetry is an important inductive bias that can improve model robustness and generalization across many deep learning domains. In multi‑agent settings, a priori known symmetries have been shown to address a fundamental coordination failure mode known as mutually incompatible symmetry breaking; e.g. in a game where two independent agents can choose to move "left'' or "right'', and where a reward of +1 or ‑1 is received when the agents choose the same action or different actions, respectively. However, the efficient and automatic discovery of environment symmetries, in particular for decentralized partially observable Markov decision processes, remains an open problem. Furthermore, environmental symmetry breaking constitutes only one type of coordination failure, which motivates the search for a more accessible and broader symmetry class. In this paper, we introduce such a broader group of previously unexplored symmetries, which we call expected return symmetries, which contains environment symmetries as a subgroup. We show that agents trained to be compatible under the group of expected return symmetries achieve better zero‑shot coordination results than those using environment symmetries. As an additional benefit, our method makes minimal a priori assumptions about the structure of their environment and does not require access to ground truth symmetries.

PaperID: 11768, https://arxiv.org/pdf/2502.01709.pdf

Authors: Christopher Simic, Korbinian Riedhammer, Tobias Bocklet

Title: Adapter-Based Multi-Agent AVSR Extension for Pre-Trained ASR Models

Abstract:
We present an approach to Audio‑Visual Speech Recognition that builds on a pre‑trained Whisper model. To infuse visual information into this audio‑only model, we extend it with an AV fusion module and LoRa adapters, one of the most up‑to‑date adapter approaches. One advantage of adapter‑based approaches, is that only a relatively small number of parameters are trained, while the basic model remains unchanged. Common AVSR approaches train single models to handle several noise categories and noise levels simultaneously. Taking advantage of the lightweight nature of adapter approaches, we train noise‑scenario‑specific adapter‑sets, each covering individual noise‑categories or a specific noise‑level range. The most suitable adapter‑set is selected by previously classifying the noise‑scenario. This enables our models to achieve an optimum coverage across different noise‑categories and noise‑levels, while training only a minimum number of parameters. Compared to a full fine‑tuning approach with SOTA performance our models achieve almost comparable results over the majority of the tested noise‑categories and noise‑levels, with up to 88.5% less trainable parameters. Our approach can be extended by further noise‑specific adapter‑sets to cover additional noise scenarios. It is also possible to utilize the underlying powerful ASR model when no visual information is available, as it remains unchanged.

PaperID: 11769, https://arxiv.org/pdf/2502.01635.pdf

Authors: Stephen Casper, Luke Bailey, Rosco Hunter, Carson Ezell, Emma Cabalé, Michael Gerovitch, Stewart Slocum, Kevin Wei, Nikola Jurkovic, Ariba Khan, Phillip J. K. Christoffersen, A. Pinar Ozisik, Rakshit Trivedi, Dylan Hadfield-Menell, Noam Kolt

Title: The AI Agent Index

Abstract:
Leading AI developers and startups are increasingly deploying agentic AI systems that can plan and execute complex tasks with limited human involvement. However, there is currently no structured framework for documenting the technical components, intended uses, and safety features of agentic systems. To fill this gap, we introduce the AI Agent Index, the first public database to document information about currently deployed agentic AI systems. For each system that meets the criteria for inclusion in the index, we document the system's components (e.g., base model, reasoning implementation, tool use), application domains (e.g., computer use, software engineering), and risk management practices (e.g., evaluation results, guardrails), based on publicly available information and correspondence with developers. We find that while developers generally provide ample information regarding the capabilities and applications of agentic systems, they currently provide limited information regarding safety and risk management practices. The AI Agent Index is available online at https://aiagentindex.mit.edu/

PaperID: 11770, https://arxiv.org/pdf/2502.01630.pdf

Authors: Yubin Ge, Salvatore Romeo, Jason Cai, Raphael Shu, Monica Sunkara, Yassine Benajiba, Yi Zhang

Title: TReMu: Towards Neuro-Symbolic Temporal Reasoning for LLM-Agents with Memory in Multi-Session Dialogues

Abstract:
Temporal reasoning in multi‑session dialogues presents a significant challenge which has been under‑studied in previous temporal reasoning benchmarks. To bridge this gap, we propose a new evaluation task for temporal reasoning in multi‑session dialogues and introduce an approach to construct a new benchmark by augmenting dialogues from LoCoMo and creating multi‑choice QAs. Furthermore, we present TReMu, a new framework aimed at enhancing the temporal reasoning capabilities of LLM‑agents in this context. Specifically, the framework employs time‑aware memorization through timeline summarization, generating retrievable memory by summarizing events in each dialogue session with their inferred dates. Additionally, we integrate neuro‑symbolic temporal reasoning, where LLMs generate Python code to perform temporal calculations and select answers. Experimental evaluations on popular LLMs demonstrate that our benchmark is challenging, and the proposed framework significantly improves temporal reasoning performance compared to baseline methods, raising from 29.83 on GPT‑4o via standard prompting to 77.67 via our approach and highlighting its effectiveness in addressing temporal reasoning in multi‑session dialogues.

PaperID: 11771, https://arxiv.org/pdf/2502.01600.pdf

Authors: Kevin Chen, Marco Cusumano-Towner, Brody Huval, Aleksei Petrenko, Jackson Hamburger, Vladlen Koltun, Philipp Krähenbühl

Title: Reinforcement Learning for Long-Horizon Interactive LLM Agents

Abstract:
Interactive digital agents (IDAs) leverage APIs of stateful digital environments to perform tasks in response to user requests. While IDAs powered by instruction‑tuned large language models (LLMs) can react to feedback from interface invocations in multi‑step exchanges, they have not been trained in their respective digital environments. Prior methods accomplish less than half of tasks in sophisticated benchmarks such as AppWorld. We present a reinforcement learning (RL) approach that trains IDAs directly in their target environments. We formalize this training as a partially observable Markov decision process and derive LOOP, a data‑ and memory‑efficient variant of proximal policy optimization. LOOP uses no value network and maintains exactly one copy of the underlying LLM in memory, making its implementation straightforward and as memory‑efficient as fine‑tuning a single LLM. A 32‑billion‑parameter agent trained with LOOP in the AppWorld environment outperforms the much larger OpenAI o1 agent by 9 percentage points (15% relative). To our knowledge, this is the first reported application of RL to IDAs that interact with a stateful, multi‑domain, multi‑app environment via direct API calls. Our analysis sheds light on the effectiveness of RL in this area, showing that the agent learns to consult the API documentation, avoid unwarranted assumptions, minimize confabulation, and recover from setbacks.

PaperID: 11772, https://arxiv.org/pdf/2502.01568.pdf

Authors: Benjamin A. Spiegel, Lucas Gelfond, George Konidaris

Title: Visual Theory of Mind Enables the Invention of Proto-Writing

Abstract:
Symbolic writing systems are graphical semiotic codes that are ubiquitous in modern society but are otherwise absent in the animal kingdom. Anthropological evidence suggests that the earliest forms of some writing systems originally consisted of iconic pictographs, which signify their referent via visual resemblance. While previous studies have examined the emergence and, separately, the evolution of pictographic systems through a computational lens, most employ non‑naturalistic methodologies that make it difficult to draw clear analogies to human and animal cognition. We develop a multi‑agent reinforcement learning testbed for emergent communication called a Signification Game, and formulate a model of inferential communication that enables agents to leverage visual theory of mind to communicate actions using pictographs. Our model, which is situated within a broader formalism for animal communication, sheds light on the cognitive and cultural processes underlying the emergence of proto‑writing.

PaperID: 11773, https://arxiv.org/pdf/2502.01562.pdf

Authors: Minttu Alakuijala, Ya Gao, Georgy Ananov, Samuel Kaski, Pekka Marttinen, Alexander Ilin, Harri Valpola

Title: Memento No More: Coaching AI Agents to Master Multiple Tasks via Hints Internalization

Abstract:
As the general capabilities of artificial intelligence (AI) agents continue to evolve, their ability to learn to master multiple complex tasks through experience remains a key challenge. Current LLM agents, particularly those based on proprietary language models, typically rely on prompts to incorporate knowledge about the target tasks. This approach does not allow the agent to internalize this information and instead relies on ever‑expanding prompts to sustain its functionality in diverse scenarios. This resembles a system of notes used by a person affected by anterograde amnesia, the inability to form new memories. In this paper, we propose a novel method to train AI agents to incorporate knowledge and skills for multiple tasks without the need for either cumbersome note systems or prior high‑quality demonstration data. Our approach employs an iterative process where the agent collects new experiences, receives corrective feedback from humans in the form of hints, and integrates this feedback into its weights via a context distillation training procedure. We demonstrate the efficacy of our approach by implementing it in a Llama‑3‑based agent that, after only a few rounds of feedback, outperforms advanced models GPT‑4o and DeepSeek‑V3 in tasksets requiring correct sequencing of information retrieval, tool use, and question answering.

PaperID: 11774, https://arxiv.org/pdf/2502.01506.pdf

Authors: Yuzhe Yang, Yifei Zhang, Minghao Wu, Kaidi Zhang, Yunmiao Zhang, Honghai Yu, Yan Hu, Benyou Wang

Title: TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets

Abstract:
The study of social emergence has long been a central focus in social science. Traditional modeling approaches, such as rule‑based Agent‑Based Models (ABMs), struggle to capture the diversity and complexity of human behavior, particularly the irrational factors emphasized in behavioral economics. Recently, large language model (LLM) agents have gained traction as simulation tools for modeling human behavior in social science and role‑playing applications. Studies suggest that LLMs can account for cognitive biases, emotional fluctuations, and other non‑rational influences, enabling more realistic simulations of socio‑economic dynamics. In this work, we introduce TwinMarket, a novel multi‑agent framework that leverages LLMs to simulate socio‑economic systems. Specifically, we examine how individual behaviors, through interactions and feedback mechanisms, give rise to collective dynamics and emergent phenomena. Through experiments in a simulated stock market environment, we demonstrate how individual actions can trigger group behaviors, leading to emergent outcomes such as financial bubbles and recessions. Our approach provides valuable insights into the complex interplay between individual decision‑making and collective socio‑economic patterns.

PaperID: 11775, https://arxiv.org/pdf/2502.01450.pdf

Authors: Tianrui Hu, Dimitrios Liakopoulos, Xiwen Wei, Radu Marculescu, Neeraja J. Yadwadkar

Title: Simulating Rumor Spreading in Social Networks using LLM Agents

Abstract:
With the rise of social media, misinformation has become increasingly prevalent, fueled largely by the spread of rumors. This study explores the use of Large Language Model (LLM) agents within a novel framework to simulate and analyze the dynamics of rumor propagation across social networks. To this end, we design a variety of LLM‑based agent types and construct four distinct network structures to conduct these simulations. Our framework assesses the effectiveness of different network constructions and agent behaviors in influencing the spread of rumors. Our results demonstrate that the framework can simulate rumor spreading across more than one hundred agents in various networks with thousands of edges. The evaluations indicate that network structure, personas, and spreading schemes can significantly influence rumor dissemination, ranging from no spread to affecting 83% of agents in iterations, thereby offering a realistic simulation of rumor spread in social networks.

PaperID: 11776, https://arxiv.org/pdf/2502.01390.pdf

Authors: Gaole He, Gianluca Demartini, Ujwal Gadiraju

Title: Plan-Then-Execute: An Empirical Study of User Trust and Team Performance When Using LLM Agents As A Daily Assistant

Abstract:
Since the explosion in popularity of ChatGPT, large language models (LLMs) have continued to impact our everyday lives. Equipped with external tools that are designed for a specific purpose (e.g., for flight booking or an alarm clock), LLM agents exercise an increasing capability to assist humans in their daily work. Although LLM agents have shown a promising blueprint as daily assistants, there is a limited understanding of how they can provide daily assistance based on planning and sequential decision making capabilities. We draw inspiration from recent work that has highlighted the value of 'LLM‑modulo' setups in conjunction with humans‑in‑the‑loop for planning tasks. We conducted an empirical study (N = 248) of LLM agents as daily assistants in six commonly occurring tasks with different levels of risk typically associated with them (e.g., flight ticket booking and credit card payments). To ensure user agency and control over the LLM agent, we adopted LLM agents in a plan‑then‑execute manner, wherein the agents conducted step‑wise planning and step‑by‑step execution in a simulation environment. We analyzed how user involvement at each stage affects their trust and collaborative team performance. Our findings demonstrate that LLM agents can be a double‑edged sword ‑‑ (1) they can work well when a high‑quality plan and necessary user involvement in execution are available, and (2) users can easily mistrust the LLM agents with plans that seem plausible. We synthesized key insights for using LLM agents as daily assistants to calibrate user trust and achieve better overall task outcomes. Our work has important implications for the future design of daily assistants and human‑AI collaboration with LLM agents.

PaperID: 11777, https://arxiv.org/pdf/2502.01376.pdf

Authors: Lu Chen, Lipeng Chen, Xiangchi Chen, Haojian Lu, Yu Zheng, Jun Wu, Yue Wang, Zhengyou Zhang, Rong Xiong

Title: Compliance while resisting: a shear-thickening fluid controller for physical human-robot interaction

Abstract:
Physical human‑robot interaction (pHRI) is widely needed in many fields, such as industrial manipulation, home services, and medical rehabilitation, and puts higher demands on the safety of robots. Due to the uncertainty of the working environment, the pHRI may receive unexpected impact interference, which affects the safety and smoothness of the task execution. The commonly used linear admittance control (L‑AC) can cope well with high‑frequency small‑amplitude noise, but for medium‑frequency high‑intensity impact, the effect is not as good. Inspired by the solid‑liquid phase change nature of shear‑thickening fluid, we propose a Shear‑thickening Fluid Control (SFC) that can achieve both an easy human‑robot collaboration and resistance to impact interference. The SFC's stability, passivity, and phase trajectory are analyzed in detail, the frequency and time domain properties are quantified, and parameter constraints in discrete control and coupled stability conditions are provided. We conducted simulations to compare the frequency and time domain characteristics of L‑AC, nonlinear admittance controller (N‑AC), and SFC, and validated their dynamic properties. In real‑world experiments, we compared the performance of L‑AC, N‑AC, and SFC in both fixed and mobile manipulators. L‑AC exhibits weak resistance to impact. N‑AC can resist moderate impacts but not high‑intensity ones, and may exhibit self‑excited oscillations. In contrast, SFC demonstrated superior impact resistance and maintained stable collaboration, enhancing comfort in cooperative water delivery tasks. Additionally, a case study was conducted in a factory setting, further affirming the SFC's capability in facilitating human‑robot collaborative manipulation and underscoring its potential in industrial applications.

PaperID: 11778, https://arxiv.org/pdf/2502.01344.pdf

Authors: Jinzhi Liao, Zenghua Liao, Xiang Zhao

Title: PSSD: Making Large Language Models Self-denial via Human Psyche Structure

Abstract:
The enhance of accuracy in reasoning results of LLMs arouses the community's interests, wherein pioneering studies investigate post‑hoc strategies to rectify potential mistakes. Despite extensive efforts, they are all stuck in a state of resource competition demanding significant time and computing expenses. The cause of the situation lies in the failure of identifying the fundamental feature of the solutions in this line, coined as the self‑denial of LLMs. In other words, LLMs should confidently determine the potential existence of mistakes and carefully execute the targeted correction. As the whole procedure conducts within LLMs, supporting and persuasive references are hard to acquire, while the absence of specific steps towards refining hidden mistakes persists even when errors are acknowledged. In response to the challenges, we present PSSD, which refers to and implements the human psyche structure such that three distinct and interconnected roles contribute to human reasoning. Specifically, PSSD leverages the recent multi‑agent paradigm, and is further enhanced with three innovatively conceived roles: (1) the intuition‑based id role that provides initial attempts based on benign LLMs; (2) the rule‑driven superego role that summarizes rules to regulate the above attempts, and returns specific key points as guidance; and (3) the script‑centric ego role that absorbs all procedural information to generate executable script for the final answer prediction. Extensive experiments demonstrate that the proposed design not only better enhance reasoning capabilities, but also seamlessly integrate with current models, leading to superior performance.

PaperID: 11779, https://arxiv.org/pdf/2502.01071.pdf

Authors: Marie Samson, Bastien Muraccioli, Fumio Kanehiro

Title: Scalable, Training-Free Visual Language Robotics: A Modular Multi-Model Framework for Consumer-Grade GPUs

Abstract:
The integration of language instructions with robotic control, particularly through Vision Language Action (VLA) models, has shown significant potential. However, these systems are often hindered by high computational costs, the need for extensive retraining, and limited scalability, making them less accessible for widespread use. In this paper, we introduce SVLR (Scalable Visual Language Robotics), an open‑source, modular framework that operates without the need for retraining, providing a scalable solution for robotic control. SVLR leverages a combination of lightweight, open‑source AI models including the Vision‑Language Model (VLM) Mini‑InternVL, zero‑shot image segmentation model CLIPSeg, Large Language Model Phi‑3, and sentence similarity model all‑MiniLM to process visual and language inputs. These models work together to identify objects in an unknown environment, use them as parameters for task execution, and generate a sequence of actions in response to natural language instructions. A key strength of SVLR is its scalability. The framework allows for easy integration of new robotic tasks and robots by simply adding text descriptions and task definitions, without the need for retraining. This modularity ensures that SVLR can continuously adapt to the latest advancements in AI technologies and support a wide range of robots and tasks. SVLR operates effectively on an NVIDIA RTX 2070 (mobile) GPU, demonstrating promising performance in executing pick‑and‑place tasks. While these initial results are encouraging, further evaluation across a broader set of tasks and comparisons with existing VLA models are needed to assess SVLR's generalization capabilities and performance in more complex scenarios.

PaperID: 11780, https://arxiv.org/pdf/2502.01027.pdf

Authors: Yannis Montreuil, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi

Title: Adversarial Robustness in Two-Stage Learning-to-Defer: Algorithms and Guarantees

Abstract:
Two‑stage Learning‑to‑Defer (L2D) enables optimal task delegation by assigning each input to either a fixed main model or one of several offline experts, supporting reliable decision‑making in complex, multi‑agent environments. However, existing L2D frameworks assume clean inputs and are vulnerable to adversarial perturbations that can manipulate query allocation‑‑causing costly misrouting or expert overload. We present the first comprehensive study of adversarial robustness in two‑stage L2D systems. We introduce two novel attack strategie‑‑untargeted and targeted‑‑which respectively disrupt optimal allocations or force queries to specific agents. To defend against such threats, we propose SARD, a convex learning algorithm built on a family of surrogate losses that are provably Bayes‑consistent and (\mathcalR, \mathcalG)‑consistent. These guarantees hold across classification, regression, and multi‑task settings. Empirical results demonstrate that SARD significantly improves robustness under adversarial attacks while maintaining strong clean performance, marking a critical step toward secure and trustworthy L2D deployment.

PaperID: 11781, https://arxiv.org/pdf/2502.00989.pdf

Authors: Kanika Goswami, Puneet Mathur, Ryan Rossi, Franck Dernoncourt

Title: ChartCitor: Multi-Agent Framework for Fine-Grained Chart Visual Attribution

Abstract:
Large Language Models (LLMs) can perform chart question‑answering tasks but often generate unverified hallucinated responses. Existing answer attribution methods struggle to ground responses in source charts due to limited visual‑semantic context, complex visual‑text alignment requirements, and difficulties in bounding box prediction across complex layouts. We present ChartCitor, a multi‑agent framework that provides fine‑grained bounding box citations by identifying supporting evidence within chart images. The system orchestrates LLM agents to perform chart‑to‑table extraction, answer reformulation, table augmentation, evidence retrieval through pre‑filtering and re‑ranking, and table‑to‑chart mapping. ChartCitor outperforms existing baselines across different chart types. Qualitative user studies show that ChartCitor helps increase user trust in Generative AI by providing enhanced explainability for LLM‑assisted chart QA and enables professionals to be more productive.

PaperID: 11782, https://arxiv.org/pdf/2502.00988.pdf

Authors: Kanika Goswami, Puneet Mathur, Ryan Rossi, Franck Dernoncourt

Title: PlotGen: Multi-Agent LLM-based Scientific Data Visualization via Multimodal Feedback

Abstract:
Scientific data visualization is pivotal for transforming raw data into comprehensible visual representations, enabling pattern recognition, forecasting, and the presentation of data‑driven insights. However, novice users often face difficulties due to the complexity of selecting appropriate tools and mastering visualization techniques. Large Language Models (LLMs) have recently demonstrated potential in assisting code generation, though they struggle with accuracy and require iterative debugging. In this paper, we propose PlotGen, a novel multi‑agent framework aimed at automating the creation of precise scientific visualizations. PlotGen orchestrates multiple LLM‑based agents, including a Query Planning Agent that breaks down complex user requests into executable steps, a Code Generation Agent that converts pseudocode into executable Python code, and three retrieval feedback agents ‑ a Numeric Feedback Agent, a Lexical Feedback Agent, and a Visual Feedback Agent ‑ that leverage multimodal LLMs to iteratively refine the data accuracy, textual labels, and visual correctness of generated plots via self‑reflection. Extensive experiments show that PlotGen outperforms strong baselines, achieving a 4‑6 percent improvement on the MatPlotBench dataset, leading to enhanced user trust in LLM‑generated visualizations and improved novice productivity due to a reduction in debugging time needed for plot errors.

PaperID: 11783, https://arxiv.org/pdf/2502.00955.pdf

Authors: Wentao Shi, Zichun Yu, Fuli Feng, Xiangnan He, Chenyan Xiong

Title: Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search

Abstract:
Monte Carlo Tree Search (MCTS) based methods provide promising approaches for generating synthetic data to enhance the self‑training of Large Language Model (LLM) based multi‑agent systems (MAS). These methods leverage Q‑values to estimate individual agent contributions. However, relying solely on Q‑values to identify informative data may misalign with the data synthesis objective, as the focus should be on selecting data that best enhances model training. To address this discrepancy, we propose Data Influence‑oriented Tree Search (DITS), a novel framework that incorporates influence scores to guide both tree search and data selection. By leveraging influence scores, we effectively identify the most impactful data for system improvement, thereby enhancing model performance. Furthermore, we derive influence score estimation methods tailored for non‑differentiable metrics, significantly reducing computational overhead by utilizing inference computations. Extensive experiments on eight multi‑agent datasets demonstrate the robustness and effectiveness of the proposed methods. Notably, our findings reveal that allocating more inference resources to estimate influence scores, rather than Q‑values, during data synthesis can more effectively and efficiently enhance model training.

PaperID: 11784, https://arxiv.org/pdf/2502.00915.pdf

Authors: Batuhan Yardim, Semih Cayci, Niao He

Title: A Variational Inequality Approach to Independent Learning in Static Mean-Field Games

Abstract:
Competitive games involving thousands or even millions of players are prevalent in real‑world contexts, such as transportation, communications, and computer networks. However, learning in these large‑scale multi‑agent environments presents a grand challenge, often referred to as the "curse of many agents". In this paper, we formalize and analyze the Static Mean‑Field Game (SMFG) under both full and bandit feedback, offering a generic framework for modeling large population interactions while enabling independent learning. We first establish close connections between SMFG and variational inequality (VI), showing that SMFG can be framed as a VI problem in the infinite agent limit. Building on the VI perspective, we propose independent learning and exploration algorithms that efficiently converge to approximate Nash equilibria, when dealing with a finite number of agents. Theoretically, we provide explicit finite sample complexity guarantees for independent learning across various feedback models in repeated play scenarios, assuming (strongly‑)monotone payoffs. Numerically, we validate our results through both simulations and real‑world applications in city traffic and network access management.

PaperID: 11785, https://arxiv.org/pdf/2502.00726.pdf

Authors: Yoann Poupart, Aurélie Beynier, Nicolas Maudet

Title: Perspectives for Direct Interpretability in Multi-Agent Deep Reinforcement Learning

Abstract:
Multi‑Agent Deep Reinforcement Learning (MADRL) was proven efficient in solving complex problems in robotics or games, yet most of the trained models are hard to interpret. While learning intrinsically interpretable models remains a prominent approach, its scalability and flexibility are limited in handling complex tasks or multi‑agent dynamics. This paper advocates for direct interpretability, generating post hoc explanations directly from trained models, as a versatile and scalable alternative, offering insights into agents' behaviour, emergent phenomena, and biases without altering models' architectures. We explore modern methods, including relevance backpropagation, knowledge edition, model steering, activation patching, sparse autoencoders and circuit discovery, to highlight their applicability to single‑agent, multi‑agent, and training process challenges. By addressing MADRL interpretability, we propose directions aiming to advance active topics such as team identification, swarm coordination and sample efficiency.

PaperID: 11786, https://arxiv.org/pdf/2502.00558.pdf

Authors: Sydney Dolan, Siddharth Nayak, Jasmine Jerry Aloor, Hamsa Balakrishnan

Title: Asynchronous Cooperative Multi-Agent Reinforcement Learning with Limited Communication

Abstract:
We consider the problem setting in which multiple autonomous agents must cooperatively navigate and perform tasks in an unknown, communication‑constrained environment. Traditional multi‑agent reinforcement learning (MARL) approaches assume synchronous communications and perform poorly in such environments. We propose AsynCoMARL, an asynchronous MARL approach that uses graph transformers to learn communication protocols from dynamic graphs. AsynCoMARL can accommodate infrequent and asynchronous communications between agents, with edges of the graph only forming when agents communicate with each other. We show that AsynCoMARL achieves similar success and collision rates as leading baselines, despite 26% fewer messages being passed between agents.

PaperID: 11787, https://arxiv.org/pdf/2502.00415.pdf

Authors: George Fatouros, Kostas Metaxas, John Soldatos, Manos Karathanassis

Title: MarketSenseAI 2.0: Enhancing Stock Analysis through LLM Agents

Abstract:
MarketSenseAI is a novel framework for holistic stock analysis which leverages Large Language Models (LLMs) to process financial news, historical prices, company fundamentals and the macroeconomic environment to support decision making in stock analysis and selection. In this paper, we present the latest advancements on MarketSenseAI, driven by rapid technological expansion in LLMs. Through a novel architecture combining Retrieval‑Augmented Generation and LLM agents, the framework processes SEC filings and earnings calls, while enriching macroeconomic analysis through systematic processing of diverse institutional reports. We demonstrate a significant improvement in fundamental analysis accuracy over the previous version. Empirical evaluation on S\&P 100 stocks over two years (2023‑2024) shows MarketSenseAI achieving cumulative returns of 125.9% compared to the index return of 73.5%, while maintaining comparable risk profiles. Further validation on S\&P 500 stocks during 2024 demonstrates the framework's scalability, delivering a 33.8% higher Sortino ratio than the market. This work marks a significant advancement in applying LLM technology to financial analysis, offering insights into the robustness of LLM‑driven investment strategies.

PaperID: 11788, https://arxiv.org/pdf/2502.00406.pdf

Authors: Debdeep Sanyal, Murari Mandal

Title: Agents Are All You Need for LLM Unlearning

Abstract:
Information removal or suppression in large language models (LLMs) is a desired functionality, useful in AI regulation, legal compliance, safety, and privacy. LLM unlearning methods aim to remove information on demand from LLMs. Current LLM unlearning methods struggle to balance the unlearning efficacy and utility due to the competing nature of these objectives. Keeping the unlearning process computationally feasible without assuming access to the model weights is an overlooked area. In this work we show that agents might be all we need for effective and practical inference‑time LLM unlearning. We present the first agentic LLM unlearning (\textttALU) method, a multi‑agent, retrain‑free, model‑agnostic approach to LLM unlearning that achieves effective unlearning while preserving the utility. Our \textttALU framework unlearns by involving multiple LLM agents, each designed for a specific step in the unlearning process, without the need to update model weights for any of the agents in the framework. Users can easily request any set of unlearning instances in any sequence, and \textttALU seamlessly adapts in real time. This is facilitated without requiring any changes in the underlying LLM model. Through extensive experiments on established benchmarks (TOFU, WMDP, WPU) and jailbreaking techniques (many shot, target masking, other languages), we demonstrate that \textttALU consistently stands out as the most robust inference‑time LLM unlearning framework among current state‑of‑the‑art methods while incurring time cost that remains effectively constant regardless of the number of unlearning targets. We further highlight \textttALU's superior performance compared to existing methods when evaluated at scale. Specifically, \textttALU is assessed on up to 1000 unlearning targets, exceeding the evaluation scope of all previously proposed LLM unlearning methods.

PaperID: 11789, https://arxiv.org/pdf/2502.00352.pdf

Authors: Ye Han, Lijun Zhang, Dejian Meng, Zhuang Zhang

Title: A Differentiated Reward Method for Reinforcement Learning based Multi-Vehicle Cooperative Decision-Making Algorithms

Abstract:
Reinforcement learning (RL) shows great potential for optimizing multi‑vehicle cooperative driving strategies through the state‑action‑reward feedback loop, but it still faces challenges such as low sample efficiency. This paper proposes a differentiated reward method based on steady‑state transition systems, which incorporates state transition gradient information into the reward design by analyzing traffic flow characteristics, aiming to optimize action selection and policy learning in multi‑vehicle cooperative decision‑making. The performance of the proposed method is validated in RL algorithms such as MAPPO, MADQN, and QMIX under varying autonomous vehicle penetration. The results show that the differentiated reward method significantly accelerates training convergence and outperforms centering reward and others in terms of traffic efficiency, safety, and action rationality. Additionally, the method demonstrates strong scalability and environmental adaptability, providing a novel approach for multi‑agent cooperative decision‑making in complex traffic scenarios.

PaperID: 11790, https://arxiv.org/pdf/2502.00350.pdf

Authors: Zhongming Yu, Hejia Zhang, Yujie Zhao, Hanxian Huang, Matrix Yao, Ke Ding, Jishen Zhao

Title: OrcaLoca: An LLM Agent Framework for Software Issue Localization

Abstract:
Recent developments in Large Language Model (LLM) agents are revolutionizing Autonomous Software Engineering (ASE), enabling automated coding, problem fixes, and feature improvements. However, localization ‑‑ precisely identifying software problems by navigating to relevant code sections ‑‑ remains a significant challenge. Current approaches often yield suboptimal results due to a lack of effective integration between LLM agents and precise code search mechanisms. This paper introduces OrcaLoca, an LLM agent framework that improves accuracy for software issue localization by integrating priority‑based scheduling for LLM‑guided action, action decomposition with relevance scoring, and distance‑aware context pruning. Experimental results demonstrate that OrcaLoca becomes the new open‑source state‑of‑the‑art (SOTA) in function match rate (65.33%) on SWE‑bench Lite. It also improves the final resolved rate of an open‑source framework by 6.33 percentage points through its patch generation integration.

PaperID: 11791, https://arxiv.org/pdf/2502.00271.pdf

Authors: Fei Yu, Yingru Li, Benyou Wang

Title: Scaling Flaws of Verifier-Guided Search in Mathematical Reasoning

Abstract:
Large language models (LLMs) struggle with multi‑step reasoning, where inference‑time scaling has emerged as a promising strategy for performance improvement. Verifier‑guided search outperforms repeated sampling when sample size is limited by selecting and prioritizing valid reasoning paths. However, we identify a critical limitation: scaling flaws, prevalent across different models (Mistral 7B and DeepSeekMath 7B), benchmarks (GSM8K and MATH), and verifiers (outcome value models and process reward models). As sample size increases, verifier‑guided search exhibits diminishing advantages and eventually underperforms repeated sampling. Our analysis attributes this to verifier failures, where imperfect verifiers misrank candidates and erroneously prune all valid paths. These issues are further exacerbated in challenging and out‑of‑distribution problems, restricting search effectiveness. To mitigate verifier failures, we explore reducing reliance on verifiers and conduct preliminary investigations using two simple methods. Our findings reveal fundamental limitations in verifier‑guided search and suggest future directions.

PaperID: 11792, https://arxiv.org/pdf/2502.00138.pdf

Authors: Christopher A. Esterhuyse, Tim Müller, L. Thomas van Binsbergen

Title: JustAct+: A Framework for Auditable Multi-Agent Systems Regulated by Inter-Organisational Policies

Abstract:
In open multi‑agent agent systems that cross organisational boundaries, agent actions must be regulated by complex policies. Consider medical data processing systems, which must observe generic laws (e.g., EU data protection regulations) and also specific participants' resource conditions (e.g., Bob consents to sharing his X‑Rays with EU hospitals). Presently, we address the implementation of these systems as distributed software. Solutions to key sub‑problems are available: existing policy languages capture the necessary normative concepts and formalise the computational representation and reasoning about policies, and existing distributed algorithms and protocols coordinate agents' changing actions and policies. But which policies and protocols are useful in application? With the JustAct framework, we characterise a class of multi‑agent systems where actors justify their actions with sufficient policy information collected from dynamic policy statements and agreements. We prove key properties of these systems, e.g., any decision that an action is permitted now cannot be refuted later, regardless of any added statements or updated agreements. We study a particular instance of the framework by specifying (in Rocq) and implementing (in Rust) a particular policy language and runtime system for mediating agent communications. We demonstrate and assess JustAct via a case study of this implementation: we reproduce the usage scenarios of Brane, an existing policy‑regulated, inter‑domain, medical data processing system.

PaperID: 11793, https://arxiv.org/pdf/2502.00022.pdf

Authors: Xingyu Xiao, Peng Chen, Qianqian Jia, Jiejuan Tong, Jingang Liang, Haitao Wang

Title: A Dynamic and High-Precision Method for Scenario-Based HRA Synthetic Data Collection in Multi-Agent Collaborative Environments Driven by LLMs

Abstract:
HRA (Human Reliability Analysis) data is crucial for advancing HRA methodologies. however, existing data collection methods lack the necessary granularity, and most approaches fail to capture dynamic features. Additionally, many methods require expert knowledge as input, making them time‑consuming and labor‑intensive. To address these challenges, we propose a new paradigm for the automated collection of HRA data. Our approach focuses on key indicators behind human error, specifically measuring workload in collaborative settings. This study introduces a novel, scenario‑driven method for workload estimation, leveraging fine‑tuned large language models (LLMs). By training LLMs on real‑world operational data from high‑temperature gas‑cooled reactors (HTGRs), we simulate human behavior and cognitive load in real time across various collaborative scenarios. The method dynamically adapts to changes in operator workload, providing more accurate, flexible, and scalable workload estimates. The results demonstrate that the proposed WELLA (Workload Estimation with LLMs and Agents) outperforms existing commercial LLM‑based methods in terms of prediction accuracy.

PaperID: 11794, https://arxiv.org/pdf/2502.00010.pdf

Authors: Changyong Qi, Linzhao Jia, Yuang Wei, Yuan-Hao Jiang, Xiaoqing Gu

Title: IntelliChain: An Integrated Framework for Enhanced Socratic Method Dialogue with LLMs and Knowledge Graphs

Abstract:
With the continuous advancement of educational technology, the demand for Large Language Models (LLMs) as intelligent educational agents in providing personalized learning experiences is rapidly increasing. This study aims to explore how to optimize the design and collaboration of a multi‑agent system tailored for Socratic teaching through the integration of LLMs and knowledge graphs in a chain‑of‑thought dialogue approach, thereby enhancing the accuracy and reliability of educational applications. By incorporating knowledge graphs, this research has bolstered the capability of LLMs to handle specific educational content, ensuring the accuracy and relevance of the information provided. Concurrently, we have focused on developing an effective multi‑agent collaboration mechanism to facilitate efficient information exchange and chain dialogues among intelligent agents, significantly improving the quality of educational interaction and learning outcomes. In empirical research within the domain of mathematics education, this framework has demonstrated notable advantages in enhancing the accuracy and credibility of educational interactions. This study not only showcases the potential application of LLMs and knowledge graphs in mathematics teaching but also provides valuable insights and methodologies for the development of future AI‑driven educational solutions.

PaperID: 11795, https://arxiv.org/pdf/2501.19398.pdf

Authors: Mustafa O. Karabag, Jan Sobotka, Ufuk Topcu

Title: Do LLMs Strategically Reveal, Conceal, and Infer Information? A Theoretical and Empirical Analysis in The Chameleon Game

Abstract:
Large language model‑based (LLM‑based) agents have become common in settings that include non‑cooperative parties. In such settings, agents' decision‑making needs to conceal information from their adversaries, reveal information to their cooperators, and infer information to identify the other agents' characteristics. To investigate whether LLMs have these information control and decision‑making capabilities, we make LLM agents play the language‑based hidden‑identity game, The Chameleon. In this game, a group of non‑chameleon agents who do not know each other aim to identify the chameleon agent without revealing a secret. The game requires the aforementioned information control capabilities both as a chameleon and a non‑chameleon. We begin with a theoretical analysis for a spectrum of strategies, from concealing to revealing, and provide bounds on the non‑chameleons' winning probability. The empirical results with GPT, Gemini 2.5 Pro, Llama 3.1, and Qwen3 models show that while non‑chameleon LLM agents identify the chameleon, they fail to conceal the secret from the chameleon, and their winning probability is far from the levels of even trivial strategies. Based on these empirical results and our theoretical analysis, we deduce that LLM‑based agents may reveal excessive information to agents of unknown identities. Interestingly, we find that, when instructed to adopt an information‑revealing level, this level is linearly encoded in the LLM's internal representations. While the instructions alone are often ineffective at making non‑chameleon LLMs conceal, we show that steering the internal representations in this linear direction directly can reliably induce concealing behavior.

PaperID: 11796, https://arxiv.org/pdf/2501.19388.pdf

Authors: Antoine Scheid, Etienne Boursier, Alain Durmus, Eric Moulines, Michael I. Jordan

Title: Online Decision-Making in Tree-Like Multi-Agent Games with Transfers

Abstract:
The widespread deployment of Machine Learning systems everywhere raises challenges, such as dealing with interactions or competition between multiple learners. In that goal, we study multi‑agent sequential decision‑making by considering principal‑agent interactions in a tree structure. In this problem, the reward of a player is influenced by the actions of her children, who are all self‑interested and non‑cooperative, hence the complexity of making good decisions. Our main finding is that it is possible to steer all the players towards the globally optimal set of actions by simply allowing single‑step transfers between them. A transfer is established between a principal and one of her agents: the principal actually offers the proposed payment if the agent picks the recommended action. The analysis poses specific challenges due to the intricate interactions between the nodes of the tree and the propagation of the regret within this tree. Considering a bandit setup, we propose algorithmic solutions for the players to end up being no‑regret with respect to the optimal pair of actions and incentives. In the long run, allowing transfers between players makes them act as if they were collaborating together, although they remain self‑interested non‑cooperative: transfers restore efficiency.

PaperID: 11797, https://arxiv.org/pdf/2501.19377.pdf

Authors: Dominik Wagner, Alexander Churchill, Siddharth Sigtia, Erik Marchi

Title: SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions

Abstract:
In this work, we present and evaluate SELMA, a Speech‑Enabled Language Model for virtual Assistant interactions that integrates audio and text as inputs to a Large Language Model (LLM). SELMA is designed to handle three primary and two auxiliary tasks related to interactions with virtual assistants simultaneously within a single end‑to‑end model. We employ low‑rank adaptation modules for parameter‑efficient training of both the audio encoder and the LLM. Additionally, we implement a feature pooling strategy enabling the system to recognize global patterns and improve accuracy on tasks less reliant on individual sequence elements. Experimental results on Voice Trigger (VT) detection, Device‑Directed Speech Detection (DDSD), and Automatic Speech Recognition (ASR), demonstrate that our approach both simplifies the typical input processing pipeline of virtual assistants significantly and also improves performance compared to dedicated models for each individual task. SELMA yields relative Equal‑Error Rate improvements of 64% on the VT detection task, and 22% on DDSD, while also achieving word error rates close to the baseline.

PaperID: 11798, https://arxiv.org/pdf/2501.19239.pdf

Authors: Xingyu Wang, Mengfan Xu

Title: Multi-agent Multi-armed Bandit with Fully Heavy-tailed Dynamics

Abstract:
We study decentralized multi‑agent multi‑armed bandits in fully heavy‑tailed settings, where clients communicate over sparse random graphs with heavy‑tailed degree distributions and observe heavy‑tailed (homogeneous or heterogeneous) reward distributions with potentially infinite variance. The objective is to maximize system performance by pulling the globally optimal arm with the highest global reward mean across all clients. We are the first to address such fully heavy‑tailed scenarios, which capture the dynamics and challenges in communication and inference among multiple clients in real‑world systems. In homogeneous settings, our algorithmic framework exploits hub‑like structures unique to heavy‑tailed graphs, allowing clients to aggregate rewards and reduce noises via hub estimators when constructing UCB indices; under M clients and degree distributions with power‑law index α> 1, our algorithm attains a regret bound (almost) of order O(M^1 ‑\frac1α \logT). Under heterogeneous rewards, clients synchronize by communicating with neighbors, aggregating exchanged estimators in UCB indices; With our newly established information delay bounds on sparse random graphs, we prove a regret bound of O(M \logT). Our results improve upon existing work, which only address time‑invariant connected graphs, or light‑tailed dynamics in dense graphs and rewards.

PaperID: 11799, https://arxiv.org/pdf/2501.19218.pdf

Authors: Sreenivasan Ganti, Visnu Srinivasan, Pallavi Ramicetty, Shravan Mohan, Milind Savagaonkar, Shubhashis Sengupta

Title: A parallelizable variant of HCA*

Abstract:
This paper presents a parallelizable variant of the well‑known Hierarchical Cooperative A algorithm (HCA) for the multi‑agent path finding (MAPF) problem. In this variant, all agents initially find their shortest paths disregarding the presence of others. This is done using A. Then an intersection graph (IG) is constructed; each agent is a node and two nodes have an edge between them if the paths of corresponding agents collide. Thereafter, an independent set is extracted with the aid of an approximation algorithm for the maximum independent set problem. The paths for the agents belonging to independent set are fixed. The rest of agents now again find their shortest paths, this time ensuring no collision with the prior agents. Space‑time A, which is a crucial component of HCA, is used here. These iterations continue until no agents are left. Since the tasks of finding shortest paths for the agents in any iteration are independent of each other, the proposed algorithm can be parallelized to a large extent. In addition to this, the task of determining the IG can also be done in parallel by dividing the map into sections and with each agent focusing on a particular section. The parallelism does come at a cost of communication between the agents and the server. This is accounted for in the simulations. As an added advantage, the user need not make a choice for the priority order. It is observed, empirically, that the proposed algorithm outperforms HCA in terms of the computation time and the cost value in many cases. Simulations are provided for corroboration.

PaperID: 11800, https://arxiv.org/pdf/2501.19144.pdf

Authors: Aymeric Capitaine, Etienne Boursier, Eric Moulines, Michael I. Jordan, Alain Durmus

Title: Prediction-Aware Learning in Multi-Agent Systems

Abstract:
The framework of uncoupled online learning in multiplayer games has made significant progress in recent years. In particular, the development of time‑varying games has considerably expanded its modeling capabilities. However, current regret bounds quickly become vacuous when the game undergoes significant variations over time, even when these variations are easy to predict. Intuitively, the ability of players to forecast future payoffs should lead to tighter guarantees, yet existing approaches fail to incorporate this aspect. This work aims to fill this gap by introducing a novel prediction‑aware framework for time‑varying games, where agents can forecast future payoffs and adapt their strategies accordingly. In this framework, payoffs depend on an underlying state of nature that agents predict in an online manner. To leverage these predictions, we propose the POWMU algorithm, a contextual extension of the optimistic Multiplicative Weight Update algorithm, for which we establish theoretical guarantees on social welfare and convergence to equilibrium. Our results demonstrate that, under bounded prediction errors, the proposed framework achieves performance comparable to the static setting. Finally, we empirically demonstrate the effectiveness of POWMU in a traffic routing experiment.

PaperID: 11801, https://arxiv.org/pdf/2501.19143.pdf

Authors: Ching-Chun Chang, Fan-Yun Chen, Shih-Hong Gu, Kai Gao, Hanrui Wang, Isao Echizen

Title: Imitation Game for Adversarial Disillusion with Chain-of-Thought Reasoning in Generative AI

Abstract:
As the cornerstone of artificial intelligence, machine perception confronts a fundamental threat posed by adversarial illusions. These adversarial attacks manifest in two primary forms: deductive illusion, where specific stimuli are crafted based on the victim model's general decision logic, and inductive illusion, where the victim model's general decision logic is shaped by specific stimuli. The former exploits the model's decision boundaries to create a stimulus that, when applied, interferes with its decision‑making process. The latter reinforces a conditioned reflex in the model, embedding a backdoor during its learning phase that, when triggered by a stimulus, causes aberrant behaviours. The multifaceted nature of adversarial illusions calls for a unified defence framework, addressing vulnerabilities across various forms of attack. In this study, we propose a disillusion paradigm based on the concept of an imitation game. At the heart of the imitation game lies a multimodal generative agent, steered by chain‑of‑thought reasoning, which observes, internalises and reconstructs the semantic essence of a sample, liberated from the classic pursuit of reversing the sample to its original state. As a proof of concept, we conduct experimental simulations using a multimodal generative dialogue agent and evaluates the methodology under a variety of attack scenarios. Experimental results demonstrate that the proposed framework consistently neutralises both deductive and inductive adversarial illusions across diverse white‑box and black‑box attack scenarios.

PaperID: 11802, https://arxiv.org/pdf/2501.18944.pdf

Authors: The Viet Bui, Tien Mai, Hong Thanh Nguyen

Title: O-MAPL: Offline Multi-agent Preference Learning

Abstract:
Inferring reward functions from demonstrations is a key challenge in reinforcement learning (RL), particularly in multi‑agent RL (MARL), where large joint state‑action spaces and complex inter‑agent interactions complicate the task. While prior single‑agent studies have explored recovering reward functions and policies from human preferences, similar work in MARL is limited. Existing methods often involve separate stages of supervised reward learning and MARL algorithms, leading to unstable training. In this work, we introduce a novel end‑to‑end preference‑based learning framework for cooperative MARL, leveraging the underlying connection between reward functions and soft Q‑functions. Our approach uses a carefully‑designed multi‑agent value decomposition strategy to improve training efficiency. Extensive experiments on SMAC and MAMuJoCo benchmarks show that our algorithm outperforms existing methods across various tasks.

PaperID: 11803, https://arxiv.org/pdf/2501.18924.pdf

Authors: Ying Wen, Ziyu Wan, Shao Zhang

Title: Language Games as the Pathway to Artificial Superhuman Intelligence

Abstract:
The evolution of large language models (LLMs) toward artificial superhuman intelligence (ASI) hinges on data reproduction, a cyclical process in which models generate, curate and retrain on novel data to refine capabilities. Current methods, however, risk getting stuck in a data reproduction trap: optimizing outputs within fixed human‑generated distributions in a closed loop leads to stagnation, as models merely recombine existing knowledge rather than explore new frontiers. In this paper, we propose language games as a pathway to expanded data reproduction, breaking this cycle through three mechanisms: (1) role fluidity, which enhances data diversity and coverage by enabling multi‑agent systems to dynamically shift roles across tasks; (2) reward variety, embedding multiple feedback criteria that can drive complex intelligent behaviors; and (3) rule plasticity, iteratively evolving interaction constraints to foster learnability, thereby injecting continual novelty. By scaling language games into global sociotechnical ecosystems, human‑AI co‑evolution generates unbounded data streams that drive open‑ended exploration. This framework redefines data reproduction not as a closed loop but as an engine for superhuman intelligence.

PaperID: 11804, https://arxiv.org/pdf/2501.18867.pdf

Authors: Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, Jianyu Chen

Title: UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent

Abstract:
Recent advancements in Vision‑Language‑Action (VLA) models have leveraged pre‑trained Vision‑Language Models (VLMs) to improve the generalization capabilities. VLMs, typically pre‑trained on vision‑language understanding tasks, provide rich semantic knowledge and reasoning abilities. However, prior research has shown that VLMs often focus on high‑level semantic content and neglect low‑level features, limiting their ability to capture detailed spatial information and understand physical dynamics. These aspects, which are crucial for embodied control tasks, remain underexplored in existing pre‑training paradigms. In this paper, we investigate the training paradigm for VLAs, and introduce UP‑VLA, a Unified VLA model training with both multi‑modal Understanding and future Prediction objectives, enhancing both high‑level semantic comprehension and low‑level spatial understanding. Experimental results show that UP‑VLA achieves a 33% improvement on the Calvin ABC‑D benchmark compared to the previous state‑of‑the‑art method. Additionally, UP‑VLA demonstrates improved success rates in real‑world manipulation tasks, particularly those requiring precise spatial information.

PaperID: 11805, https://arxiv.org/pdf/2501.18803.pdf

Authors: Yerin Kim, Alexander Benvenuti, Bo Chen, Mustafa Karabag, Abhishek Kulkarni, Nathaniel D. Bastian, Ufuk Topcu, Matthew Hale

Title: Deceptive Sequential Decision-Making via Regularized Policy Optimization

Abstract:
Autonomous systems are increasingly expected to operate in the presence of adversaries, though adversaries may infer sensitive information simply by observing a system. Therefore, present a deceptive sequential decision‑making framework that not only conceals sensitive information, but actively misleads adversaries about it. We model autonomous systems as Markov decision processes, with adversaries using inverse reinforcement learning to recover reward functions. To counter them, we present three regularization strategies for policy synthesis problems that actively deceive an adversary about a system's reward. ``Diversionary deception'' leads an adversary to draw any false conclusion about the system's reward function. ``Targeted deception'' leads an adversary to draw a specific false conclusion about the system's reward function. ``Equivocal deception'' leads an adversary to infer that the real reward and a false reward both explain the system's behavior. We show how each form of deception can be implemented in policy optimization problems and analytically bound the loss in total accumulated reward induced by deception. Next, we evaluate these developments in a multi‑agent setting. We show that diversionary, targeted, and equivocal deception all steer the adversary to false beliefs while still attaining a total accumulated reward that is at least 98% of its optimal, non‑deceptive value.

PaperID: 11806, https://arxiv.org/pdf/2501.18794.pdf

Authors: Matthew Neeley, Guantong Qi, Guanchu Wang, Ruixiang Tang, Dongxue Mao, Chaozhong Liu, Sasidhar Pasupuleti, Bo Yuan, Fan Xia, Pengfei Liu, Zhandong Liu, Xia Hu

Title: Survey and Improvement Strategies for Gene Prioritization with Large Language Models

Abstract:
Rare diseases are challenging to diagnose due to limited patient data and genetic diversity. Despite advances in variant prioritization, many cases remain undiagnosed. While large language models (LLMs) have performed well in medical exams, their effectiveness in diagnosing rare genetic diseases has not been assessed. To identify causal genes, we benchmarked various LLMs for gene prioritization. Using multi‑agent and Human Phenotype Ontology (HPO) classification, we categorized patients based on phenotypes and solvability levels. As gene set size increased, LLM performance deteriorated, so we used a divide‑and‑conquer strategy to break the task into smaller subsets. At baseline, GPT‑4 outperformed other LLMs, achieving near 30% accuracy in ranking causal genes correctly. The multi‑agent and HPO approaches helped distinguish confidently solved cases from challenging ones, highlighting the importance of known gene‑phenotype associations and phenotype specificity. We found that cases with specific phenotypes or clear associations were more accurately solved. However, we observed biases toward well‑studied genes and input order sensitivity, which hindered gene prioritization. Our divide‑and‑conquer strategy improved accuracy by overcoming these biases. By utilizing HPO classification, novel multi‑agent techniques, and our LLM strategy, we improved causal gene identification accuracy compared to our baseline evaluation. This approach streamlines rare disease diagnosis, facilitates reanalysis of unsolved cases, and accelerates gene discovery, supporting the development of targeted diagnostics and therapies.

PaperID: 11807, https://arxiv.org/pdf/2501.18718.pdf

Authors: Shubham Aggarwal, Muhammad Aneeq uz Zaman, Melih Bastopcu, Sennur Ulukus, Tamer Başar

Title: Distributed Offloading in Multi-Access Edge Computing Systems: A Mean-Field Perspective

Abstract:
Multi‑access edge computing (MEC) technology is a promising solution to assist power‑constrained IoT devices by providing additional computing resources for time‑sensitive tasks. In this paper, we consider the problem of optimal task offloading in MEC systems with due consideration of the timeliness and scalability issues under two scenarios of equitable and priority access to the edge server (ES). In the first scenario, we consider a MEC system consisting of N devices assisted by one ES, where the devices can split task execution between a local processor and the ES, with equitable access to the ES. In the second scenario, we consider a MEC system consisting of one primary user, N secondary users and one ES. The primary user has priority access to the ES while the secondary users have equitable access to the ES amongst themselves. In both scenarios, due to the power consumption associated with utilizing the local resource and task offloading, the devices must optimize their actions. Additionally, since the ES is a shared resource, other users' offloading activity serves to increase latency incurred by each user. We thus model both scenarios using a non‑cooperative game framework. However, the presence of a large number of users makes it nearly impossible to compute the equilibrium offloading policies for each user, which would require a significant information exchange overhead between users. Thus, to alleviate such scalability issues, we invoke the paradigm of mean‑field games to compute approximate Nash equilibrium policies for each user using their local information, and further study the trade‑offs between increasing information freshness and reducing power consumption for each user. Using numerical evaluations, we show that our approach can recover the offloading trends displayed under centralized solutions, and provide additional insights into the results obtained.

PaperID: 11808, https://arxiv.org/pdf/2501.18712.pdf

Authors: Devansh Bhardwaj, Naman Mishra

Title: Invisible Traces: Using Hybrid Fingerprinting to identify underlying LLMs in GenAI Apps

Abstract:
Fingerprinting refers to the process of identifying underlying Machine Learning (ML) models of AI Systemts, such as Large Language Models (LLMs), by analyzing their unique characteristics or patterns, much like a human fingerprint. The fingerprinting of Large Language Models (LLMs) has become essential for ensuring the security and transparency of AI‑integrated applications. While existing methods primarily rely on access to direct interactions with the application to infer model identity, they often fail in real‑world scenarios involving multi‑agent systems, frequent model updates, and restricted access to model internals. In this paper, we introduce a novel fingerprinting framework designed to address these challenges by integrating static and dynamic fingerprinting techniques. Our approach identifies architectural features and behavioral traits, enabling accurate and robust fingerprinting of LLMs in dynamic environments. We also highlight new threat scenarios where traditional fingerprinting methods are ineffective, bridging the gap between theoretical techniques and practical application. To validate our framework, we present an extensive evaluation setup that simulates real‑world conditions and demonstrate the effectiveness of our methods in identifying and monitoring LLMs in Gen‑AI applications. Our results highlight the framework's adaptability to diverse and evolving deployment contexts.

PaperID: 11809, https://arxiv.org/pdf/2501.18653.pdf

Authors: Yanlong Li, Jindong Li, Qi Wang, Menglin Yang, He Kong, Shengsheng Wang

Title: Cogito, ergo sum: A Neurobiologically-Inspired Cognition-Memory-Growth System for Code Generation

Abstract:
Large language models based Multi Agent Systems (MAS) have demonstrated promising performance for enhancing the efficiency and accuracy of code generation tasks. However,most existing methods follow a conventional sequence of planning, coding, and debugging,which contradicts the growth‑driven nature of human learning process. Additionally,the frequent information interaction between multiple agents inevitably involves high computational costs. In this paper,we propose Cogito,a neurobiologically inspired multi‑agent framework to enhance the problem‑solving capabilities in code generation tasks with lower cost. Specifically,Cogito adopts a reverse sequence: it first undergoes debugging, then coding,and finally planning. This approach mimics human learning and development,where knowledge is acquired progressively. Accordingly,a hippocampus‑like memory module with different functions is designed to work with the pipeline to provide quick retrieval in similar tasks. Through this growth‑based learning model,Cogito accumulates knowledge and cognitive skills at each stage,ultimately forming a Super Role an all capable agent to perform the code generation task. Extensive experiments against representative baselines demonstrate the superior performance and efficiency of Cogito. The code is publicly available at https://anonymous.4open.science/r/Cogito‑0083.

PaperID: 11810, https://arxiv.org/pdf/2501.18645.pdf

Authors: Manish Sanwal

Title: Layered Chain-of-Thought Prompting for Multi-Agent LLM Systems: A Comprehensive Approach to Explainable Large Language Models

Abstract:
Large Language Models (LLMs) leverage chain‑of‑thought (CoT) prompting to provide step‑by‑step rationales, improving performance on complex tasks. Despite its benefits, vanilla CoT often fails to fully verify intermediate inferences and can produce misleading explanations. In this work, we propose Layered Chain‑of‑Thought (Layered‑CoT) Prompting, a novel framework that systematically segments the reasoning process into multiple layers, each subjected to external checks and optional user feedback. We expand on the key concepts, present three scenarios ‑‑ medical triage, financial risk assessment, and agile engineering ‑‑ and demonstrate how Layered‑CoT surpasses vanilla CoT in terms of transparency, correctness, and user engagement. By integrating references from recent arXiv papers on interactive explainability, multi‑agent frameworks, and agent‑based collaboration, we illustrate how Layered‑CoT paves the way for more reliable and grounded explanations in high‑stakes domains.

PaperID: 11811, https://arxiv.org/pdf/2501.18576.pdf

Authors: Evgenii Evstafev

Title: Token-Hungry, Yet Precise: DeepSeek R1 Highlights the Need for Multi-Step Reasoning Over Speed in MATH

Abstract:
This study investigates the performance of the DeepSeek R1 language model on 30 challenging mathematical problems derived from the MATH dataset, problems that previously proved unsolvable by other models under time constraints. Unlike prior work, this research removes time limitations to explore whether DeepSeek R1's architecture, known for its reliance on token‑based reasoning, can achieve accurate solutions through a multi‑step process. The study compares DeepSeek R1 with four other models (gemini‑1.5‑flash‑8b, gpt‑4o‑mini‑2024‑07‑18, llama3.1:8b, and mistral‑8b‑latest) across 11 temperature settings. Results demonstrate that DeepSeek R1 achieves superior accuracy on these complex problems but generates significantly more tokens than other models, confirming its token‑intensive approach. The findings highlight a trade‑off between accuracy and efficiency in mathematical problem‑solving with large language models: while DeepSeek R1 excels in accuracy, its reliance on extensive token generation may not be optimal for applications requiring rapid responses. The study underscores the importance of considering task‑specific requirements when selecting an LLM and emphasizes the role of temperature settings in optimizing performance.

PaperID: 11812, https://arxiv.org/pdf/2501.18320.pdf

Authors: Tianpeng Pan, Wenqiang Pu, Licheng Zhao, Rui Zhou

Title: Leveraging LLM Agents for Automated Optimization Modeling for SASP Problems: A Graph-RAG based Approach

Abstract:
Automated optimization modeling (AOM) has evoked considerable interest with the rapid evolution of large language models (LLMs). Existing approaches predominantly rely on prompt engineering, utilizing meticulously designed expert response chains or structured guidance. However, prompt‑based techniques have failed to perform well in the sensor array signal processing (SASP) area due the lack of specific domain knowledge. To address this issue, we propose an automated modeling approach based on retrieval‑augmented generation (RAG) technique, which consists of two principal components: a multi‑agent (MA) structure and a graph‑based RAG (Graph‑RAG) process. The MA structure is tailored for the architectural AOM process, with each agent being designed based on principles of human modeling procedure. The Graph‑RAG process serves to match user query with specific SASP modeling knowledge, thereby enhancing the modeling result. Results on ten classical signal processing problems demonstrate that the proposed approach (termed as MAG‑RAG) outperforms several AOM benchmarks.

PaperID: 11813, https://arxiv.org/pdf/2501.18155.pdf

Authors: Qixian Yu, Zining Cao, Zong Hui, Yuan Zhou

Title: Model Checking for Multi-Agent Systems Modeled By Epistemic Process Calculus

Abstract:
This paper presents a comprehensive framework for modeling and verifying multi‑agent systems. The paper introduce an Epistemic Process Calculus for multi‑agent systems, which formalizes the syntax and semantics to capture the essential features of agent behavior interactions and epistemic states. Building upon this calculus, we propose ATLE, an extension of Alternating‑time Temporal Logic incorporating epistemic operators to express complex properties related to agent epistemic state. To verify ATLE specifications, this paper presents a model checking algorithm that systematically explores the state space of a multi‑agent system and evaluates the satisfaction of the specified properties. Finally, a case study is given to demonstrate the method.

PaperID: 11814, https://arxiv.org/pdf/2501.18145.pdf

Authors: Devika Sondhi, Ananya Sharma, Diptikalyan Saha

Title: Utilizing API Response for Test Refinement

Abstract:
Most of the web services are offered in the form of RESTful APIs. This has led to an active research interest in API testing to ensure the reliability of these services. While most of the testing techniques proposed in the past rely on the API specification to generate the test cases, a major limitation of such an approach is that in the case of an incomplete or inconsistent specification, the test cases may not be realistic in nature and would result in a lot of 4xx response due to invalid input. This is indicative of poor test quality. Learning‑based approaches may learn about valid inputs but often require a large number of request‑response pairs to learn the constraints, making it infeasible to be readily used in the industry. To address this limitation, this paper proposes a dynamic test refinement approach that leverages the response message. The response is used to infer the point in the API testing flow where a test scenario fix is required. Using an intelligent agent, the approach adds constraints to the API specification that are further used to generate a test scenario accounting for the learned constraint from the response. Following a greedy approach, the iterative learning and refinement of test scenarios are obtained from the API testing system. The proposed approach led to a decrease in the number of 4xx responses, taking a step closer to generating more realistic test cases with high coverage that would aid in functional testing. A high coverage was obtained from a lesser number of API requests, as compared with the state‑of‑the‑art search‑based API Testing tools.

PaperID: 11815, https://arxiv.org/pdf/2501.18138.pdf

Authors: Woojun Kim, Katia Sycara

Title: B3C: A Minimalist Approach to Offline Multi-Agent Reinforcement Learning

Abstract:
Overestimation arising from selecting unseen actions during policy evaluation is a major challenge in offline reinforcement learning (RL). A minimalist approach in the single‑agent setting ‑‑ adding behavior cloning (BC) regularization to existing online RL algorithms ‑‑ has been shown to be effective; however, this approach is understudied in multi‑agent settings. In particular, overestimation becomes worse in multi‑agent settings due to the presence of multiple actions, resulting in the BC regularization‑based approach easily suffering from either over‑regularization or critic divergence. To address this, we propose a simple yet effective method, Behavior Cloning regularization with Critic Clipping (B3C), which clips the target critic value in policy evaluation based on the maximum return in the dataset and pushes the limit of the weight on the RL objective over BC regularization, thereby improving performance. Additionally, we leverage existing value factorization techniques, particularly non‑linear factorization, which is understudied in offline settings. Integrated with non‑linear value factorization, B3C outperforms state‑of‑the‑art algorithms on various offline multi‑agent benchmarks.

PaperID: 11816, https://arxiv.org/pdf/2501.18016.pdf

Authors: Matsive Ali, Sandesh Giri, Sen Liu, Qin Yang

Title: Digital Twin Synchronization: Bridging the Sim-RL Agent to a Real-Time Robotic Additive Manufacturing Control

Abstract:
With the rapid development of deep reinforcement learning technology, it gradually demonstrates excellent potential and is becoming the most promising solution in the robotics. However, in the smart manufacturing domain, there is still not too much research involved in dynamic adaptive control mechanisms optimizing complex processes. This research advances the integration of Soft Actor‑Critic (SAC) with digital twins for industrial robotics applications, providing a framework for enhanced adaptive real‑time control for smart additive manufacturing processing. The system architecture combines Unity's simulation environment with ROS2 for seamless digital twin synchronization, while leveraging transfer learning to efficiently adapt trained models across tasks. We demonstrate our methodology using a Viper X300s robot arm with the proposed hierarchical reward structure to address the common reinforcement learning challenges in two distinct control scenarios. The results show rapid policy convergence and robust task execution in both simulated and physical environments demonstrating the effectiveness of our approach.

PaperID: 11817, https://arxiv.org/pdf/2501.18002.pdf

Authors: Arthur Caetano, Kavya Verma, Atieh Taheri, Radha Kumaran, Zichen Chen, Jiaao Chen, Tobias Höllerer, Misha Sra

Title: Agentic Workflows for Conversational Human-AI Interaction Design

Abstract:
Conversational human‑AI interaction (CHAI) have recently driven mainstream adoption of AI. However, CHAI poses two key challenges for designers and researchers: users frequently have ambiguous goals and an incomplete understanding of AI functionalities, and the interactions are brief and transient, limiting opportunities for sustained engagement with users. AI agents can help address these challenges by suggesting contextually relevant prompts, by standing in for users during early design testing, and by helping users better articulate their goals. Guided by research‑through‑design, we explored agentic AI workflows through the development and testing of a probe over four iterations with 10 users. We present our findings through an annotated portfolio of design artifacts, and through thematic analysis of user experiences, offering solutions to the problems of ambiguity and transient in CHAI. Furthermore, we examine the limitations and possibilities of these AI agent workflows, suggesting that similar collaborative approaches between humans and AI could benefit other areas of design.

PaperID: 11818, https://arxiv.org/pdf/2501.17903.pdf

Authors: Jung-Hua Liu

Title: Free Agent in Agent-Based Mixture-of-Experts Generative AI Framework

Abstract:
Multi‑agent systems commonly distribute tasks among specialized, autonomous agents, yet they often lack mechanisms to replace or reassign underperforming agents in real time. Inspired by the free‑agency model of Major League Baseball, the Reinforcement Learning Free Agent (RLFA) algorithm introduces a reward‑based mechanism to detect and remove agents exhibiting persistent underperformance and seamlessly insert more capable ones. Each agent internally uses a mixture‑of‑experts (MoE) approach, delegating incoming tasks to specialized sub‑models under the guidance of a gating function. A primary use case is fraud detection, where RLFA promptly swaps out an agent whose detection accuracy dips below a preset threshold. A new agent is tested in a probationary mode, and upon demonstrating superior performance, fully replaces the underperformer. This dynamic, free‑agency cycle ensures sustained accuracy, quicker adaptation to emerging threats, and minimal disruption to ongoing operations. By continually refreshing its roster of agents, the system fosters ongoing improvements and more resilient collaboration in multi‑agent Generative AI environments.

PaperID: 11819, https://arxiv.org/pdf/2501.17661.pdf

Authors: Scott Fredriksson, Yifan Bai, Akshit Saradagi, George Nikolakopoulos

Title: Multi-Agent Path Finding Using Conflict-Based Search and Structural-Semantic Topometric Maps

Abstract:
As industries increasingly adopt large robotic fleets, there is a pressing need for computationally efficient, practical, and optimal conflict‑free path planning for multiple robots. Conflict‑Based Search (CBS) is a popular method for multi‑agent path finding (MAPF) due to its completeness and optimality; however, it is often impractical for real‑world applications, as it is computationally intensive to solve and relies on assumptions about agents and operating environments that are difficult to realize. This article proposes a solution to overcome computational challenges and practicality issues of CBS by utilizing structural‑semantic topometric maps. Instead of running CBS over large grid‑based maps, the proposed solution runs CBS over a sparse topometric map containing structural‑semantic cells representing intersections, pathways, and dead ends. This approach significantly accelerates the MAPF process and reduces the number of conflict resolutions handled by CBS while operating in continuous time. In the proposed method, robots are assigned time ranges to move between topometric regions, departing from the traditional CBS assumption that a robot can move to any connected cell in a single time step. The approach is validated through real‑world multi‑robot path‑finding experiments and benchmarking simulations. The results demonstrate that the proposed MAPF method can be applied to real‑world non‑holonomic robots and yields significant improvement in computational efficiency compared to traditional CBS methods while improving conflict detection and resolution in cases of corridor symmetries.

PaperID: 11820, https://arxiv.org/pdf/2501.17614.pdf

Authors: Filiberto Fele, José M. Maestre, Eduardo F. Camacho

Title: Coalitional control: a bottom-up approach

Abstract:
The recent major developments in information technologies have opened interesting possibilities for the effective management of multi‑agent systems. In many cases, the important role of central control nodes can now be undertaken by several controllers in a distributed topology that suits better the structure of the system. This opens as well the possibility to promote cooperation between control agents in competitive environments, establishing links between controllers in order to adapt the exchange of critical information to the degree of subsystems' interactions. In this paper a bottom‑up approach to coalitional control is presented, where the structure of each agent's model predictive controller is adapted to the time‑variant coupling conditions, promoting the formation of coalitions ‑ clusters of control agents where communication is essential to ensure the cooperation ‑ whenever it can bring benefit to the overall system performance.

PaperID: 11821, https://arxiv.org/pdf/2501.17437.pdf

Authors: Mani Amani, Reza Akhavian

Title: Bayesian BIM-Guided Construction Robot Navigation with NLP Safety Prompts in Dynamic Environments

Abstract:
Construction robotics increasingly relies on natural language processing for task execution, creating a need for robust methods to interpret commands in complex, dynamic environments. While existing research primarily focuses on what tasks robots should perform, less attention has been paid to how these tasks should be executed safely and efficiently. This paper presents a novel probabilistic framework that uses sentiment analysis from natural language commands to dynamically adjust robot navigation policies in construction environments. The framework leverages Building Information Modeling (BIM) data and natural language prompts to create adaptive navigation strategies that account for varying levels of environmental risk and uncertainty. We introduce an object‑aware path planning approach that combines exponential potential fields with a grid‑based representation of the environment, where the potential fields are dynamically adjusted based on the semantic analysis of user prompts. The framework employs Bayesian inference to consolidate multiple information sources: the static data from BIM, the semantic content of natural language commands, and the implied safety constraints from user prompts. We demonstrate our approach through experiments comparing three scenarios: baseline shortest‑path planning, safety‑oriented navigation, and risk‑aware routing. Results show that our method successfully adapts path planning based on natural language sentiment, achieving a 50% improvement in minimum distance to obstacles when safety is prioritized, while maintaining reasonable path lengths. Scenarios with contrasting prompts, such as "dangerous" and "safe", demonstrate the framework's ability to modify paths. This approach provides a flexible foundation for integrating human knowledge and safety considerations into construction robot navigation.

PaperID: 11822, https://arxiv.org/pdf/2501.17385.pdf

Authors: Vartika Singh, Will Wesley, Philip N. Brown

Title: Optimal Utility Design with Arbitrary Information Networks

Abstract:
We consider multi‑agent systems with general information networks where an agent may only observe a subset of other agents. A system designer assigns local utility functions to the agents guiding their actions towards an outcome which determines the value of a given system objective. The aim is to design these local utility functions such that the Price of Anarchy (PoA), which equals the ratio of system objective at worst possible outcome to that at the optimal, is maximized. Towards this, we first develop a linear program (LP) that characterizes the PoA for any utility design and any information network. This leads to another LP that optimizes the PoA and derives the optimal utility design. Our work substantially generalizes existing approaches to the utility design problem. We also numerically show the robustness of proposed framework against unanticipated communication failures.

PaperID: 11823, https://arxiv.org/pdf/2501.17329.pdf

Authors: Ashish Bastola, Hao Wang, Abolfazl Razi

Title: Anomaly Detection in Cooperative Vehicle Perception Systems under Imperfect Communication

Abstract:
Anomaly detection is a critical requirement for ensuring safety in autonomous driving. In this work, we leverage Cooperative Perception to share information across nearby vehicles, enabling more accurate identification and consensus of anomalous behaviors in complex traffic scenarios. To account for the real‑world challenge of imperfect communication, we propose a cooperative‑perception‑based anomaly detection framework (CPAD), which is a robust architecture that remains effective under communication interruptions, thereby facilitating reliable performance even in low‑bandwidth settings. Since no multi‑agent anomaly detection dataset exists for vehicle trajectories, we introduce 15,000 different scenarios with a 90,000 trajectories benchmark dataset generated through rule‑based vehicle dynamics analysis. Empirical results demonstrate that our approach outperforms standard anomaly classification methods in F1‑score, AUC and showcase strong robustness to agent connection interruptions.

PaperID: 11824, https://arxiv.org/pdf/2501.17315.pdf

Authors: Tomek Korbak, Joshua Clymer, Benjamin Hilton, Buck Shlegeris, Geoffrey Irving

Title: A sketch of an AI control safety case

Abstract:
As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe. We sketch how developers could construct a "control safety case", which is a structured argument that models are incapable of subverting control measures in order to cause unacceptable outcomes. As a case study, we sketch an argument that a hypothetical LLM agent deployed internally at an AI company won't exfiltrate sensitive information. The sketch relies on evidence from a "control evaluation,"' where a red team deliberately designs models to exfiltrate data in a proxy for the deployment environment. The safety case then hinges on several claims: (1) the red team adequately elicits model capabilities to exfiltrate data, (2) control measures remain at least as effective in deployment, and (3) developers conservatively extrapolate model performance to predict the probability of data exfiltration in deployment. This safety case sketch is a step toward more concrete arguments that can be used to show that a dangerously capable LLM agent is safe to deploy.

PaperID: 11825, https://arxiv.org/pdf/2501.17258.pdf

Authors: Stephanie Houde, Kristina Brimijoin, Michael Muller, Steven I. Ross, Dario Andres Silva Moran, Gabriel Enrique Gonzalez, Siya Kunde, Morgan A. Foreman, Justin D. Weisz

Title: Controlling AI Agent Participation in Group Conversations: A Human-Centered Approach

Abstract:
Conversational AI agents are commonly applied within single‑user, turn‑taking scenarios. The interaction mechanics of these scenarios are trivial: when the user enters a message, the AI agent produces a response. However, the interaction dynamics are more complex within group settings. How should an agent behave in these settings? We report on two experiments aimed at uncovering users' experiences of an AI agent's participation within a group, in the context of group ideation (brainstorming). In the first study, participants benefited from and preferred having the AI agent in the group, but participants disliked when the agent seemed to dominate the conversation and they desired various controls over its interactive behaviors. In the second study, we created functional controls over the agent's behavior, operable by group members, to validate their utility and probe for additional requirements. Integrating our findings across both studies, we developed a taxonomy of controls for when, what, and where a conversational AI agent in a group should respond, who can control its behavior, and how those controls are specified and implemented. Our taxonomy is intended to aid AI creators to think through important considerations in the design of mixed‑initiative conversational agents.

PaperID: 11826, https://arxiv.org/pdf/2501.17167.pdf

Authors: Yaojie Hu, Qiang Zhou, Qihong Chen, Xiaopeng Li, Linbo Liu, Dejiao Zhang, Amit Kachroo, Talha Oz, Omer Tripp

Title: QualityFlow: An Agentic Workflow for Program Synthesis Controlled by LLM Quality Checks

Abstract:
We introduce QualityFlow, a dynamic agentic workflow for program synthesis. Given the English description of a programming problem and a set of unit tests, the model's goal is to synthesize the correct program that solves the problem and passes the tests. QualityFlow includes large language model (LLM) agents resembling a software development team, including code generation, testing, and self‑debugging. We propose the LLM Quality Checker, which explicitly "imagines" whether the synthesized programs' execution would conform to the unit tests. The Quality Checks dynamically control the workflow, including actions to submit the final answer, clarify the problem statement, and revert previous workflow steps. Our experiments show that the Quality Checker can precisely accept any correct program, mitigate faulty synthesized tests, and prevent potential workflow deviation. QualityFlow establishes the state‑of‑the‑art results on four program synthesis benchmarks: MBPP, HumanEval, and stricter evaluations from MBPP‑EvalPlus and HumanEval‑EvalPlus.

PaperID: 11827, https://arxiv.org/pdf/2501.17079.pdf

Authors: Christian Fabian, Kai Cui, Heinz Koeppl

Title: Learning Mean Field Control on Sparse Graphs

Abstract:
Large agent networks are abundant in applications and nature and pose difficult challenges in the field of multi‑agent reinforcement learning (MARL) due to their computational and theoretical complexity. While graphon mean field games and their extensions provide efficient learning algorithms for dense and moderately sparse agent networks, the case of realistic sparser graphs remains largely unsolved. Thus, we propose a novel mean field control model inspired by local weak convergence to include sparse graphs such as power law networks with coefficients above two. Besides a theoretical analysis, we design scalable learning algorithms which apply to the challenging class of graph sequences with finite first moment. We compare our model and algorithms for various examples on synthetic and real world networks with mean field algorithms based on Lp graphons and graphexes. As it turns out, our approach outperforms existing methods in many examples and on various networks due to the special design aiming at an important, but so far hard to solve class of MARL problems.

PaperID: 11828, https://arxiv.org/pdf/2501.17015.pdf

Authors: Longzhong Lin, Xuewu Lin, Kechun Xu, Haojian Lu, Lichao Huang, Rong Xiong, Yue Wang

Title: Revisit Mixture Models for Multi-Agent Simulation: Experimental Study within a Unified Framework

Abstract:
Simulation plays a crucial role in assessing autonomous driving systems, where the generation of realistic multi‑agent behaviors is a key aspect. In multi‑agent simulation, the primary challenges include behavioral multimodality and closed‑loop distributional shifts. In this study, we revisit mixture models for generating multimodal agent behaviors, which can cover the mainstream methods including continuous mixture models and GPT‑like discrete models. Furthermore, we introduce a closed‑loop sample generation approach tailored for mixture models to mitigate distributional shifts. Within the unified mixture model~(UniMM) framework, we recognize critical configurations from both model and data perspectives. We conduct a systematic examination of various model configurations, including positive component matching, continuous regression, prediction horizon, and the number of components. Moreover, our investigation into the data configuration highlights the pivotal role of closed‑loop samples in achieving realistic simulations. To extend the benefits of closed‑loop samples across a broader range of mixture models, we further address the shortcut learning and off‑policy learning issues. Leveraging insights from our exploration, the distinct variants proposed within the UniMM framework, including discrete, anchor‑free, and anchor‑based models, all achieve state‑of‑the‑art performance on the WOSAC benchmark.

PaperID: 11829, https://arxiv.org/pdf/2501.16945.pdf

Authors: Xinyi Ni, Qiuyang Wang, Yukun Zhang, Pengyu Hong

Title: ToolFactory: Automating Tool Generation by Leveraging LLM to Understand REST API Documentations

Abstract:
LLM‑based tool agents offer natural language interfaces, enabling users to seamlessly interact with computing services. While REST APIs are valuable resources for building such agents, they must first be transformed into AI‑compatible tools. Automatically generating AI‑compatible tools from REST API documents can greatly streamline tool agent development and minimize user learning curves. However, API documentation often suffers from a lack of standardization, inconsistent schemas, and incomplete information. To address these issues, we developed ToolFactory, an open‑source pipeline for automating tool generation from unstructured API documents. To enhance the reliability of the developed tools, we implemented an evaluation method to diagnose errors. Furthermore, we built a knowledge base of verified tools, which we leveraged to infer missing information from poorly documented APIs. We developed the API Extraction Benchmark, comprising 167 API documents and 744 endpoints in various formats, and designed a JSON schema to annotate them. This annotated dataset was utilized to train and validate ToolFactory. The experimental results highlight the effectiveness of ToolFactory. We also demonstrated ToolFactory by creating a domain‑specific AI agent for glycomaterials research. ToolFactory exhibits significant potential for facilitating the seamless integration of scientific REST APIs into AI workflows.

PaperID: 11830, https://arxiv.org/pdf/2501.16935.pdf

Authors: Suzie Grondin, Arthur Charpentier, Philipp Ratz

Title: Beyond Human Intervention: Algorithmic Collusion through Multi-Agent Learning Strategies

Abstract:
Collusion in market pricing is a concept associated with human actions to raise market prices through artificially limited supply. Recently, the idea of algorithmic collusion was put forward, where the human action in the pricing process is replaced by automated agents. Although experiments have shown that collusive market equilibria can be reached through such techniques, without the need for human intervention, many of the techniques developed remain susceptible to exploitation by other players, making them difficult to implement in practice. In this article, we explore a situation where an agent has a multi‑objective strategy, and not only learns to unilaterally exploit market dynamics originating from other algorithmic agents, but also learns to model the behaviour of other agents directly. Our results show how common critiques about the viability of algorithmic collusion in real‑life settings can be overcome through the usage of slightly more complex algorithms.

PaperID: 11831, https://arxiv.org/pdf/2501.16901.pdf

Authors: Dimitris Pantazatos, Mary Grammatikou, Vasilis Maglaris

Title: Enhancing Soft Skills in Network Management Education: A Study on the Impact of GenAI-based Virtual Assistants

Abstract:
The rapid evolution of technology in educational settings has opened new avenues for enhancing learning experiences, particularly in specialized fields like network management. This paper explores the novel integration of a GenAI‑based virtual assistant in a university‑level network management course, focusing on its impact on developing students' soft skills, notably critical thinking and problem‑solving abilities. Recognizing the increasing importance of these skills in the digital age, our study aims to assess the empirical effectiveness of this artificial intelligence‑driven educational tool in fostering these competencies among students.

PaperID: 11832, https://arxiv.org/pdf/2501.16847.pdf

Authors: Diego Deplano, Nicola Bastianello, Mauro Franceschelli, Karl H. Johansson

Title: Optimization and Learning in Open Multi-Agent Systems

Abstract:
Modern artificial intelligence relies on networks of agents that collect data, process information, and exchange it with neighbors to collaboratively solve optimization and learning problems. This article introduces a novel distributed algorithm to address a broad class of these problems in "open networks", where the number of participating agents may vary due to several factors, such as autonomous decisions, heterogeneous resource availability, or DoS attacks. Extending the current literature, the convergence analysis of the proposed algorithm is based on the newly developed "Theory of Open Operators", which characterizes an operator as open when the set of components to be updated changes over time, yielding to time‑varying operators acting on sequences of points of different dimensions and compositions. The mathematical tools and convergence results developed here provide a general framework for evaluating distributed algorithms in open networks, allowing to characterize their performance in terms of the punctual distance from the optimal solution, in contrast with regret‑based metrics that assess cumulative performance over a finite‑time horizon. As illustrative examples, the proposed algorithm is used to solve dynamic consensus or tracking problems on different metrics of interest, such as average, median, and min/max value, as well as classification problems with logistic loss functions.

PaperID: 11833, https://arxiv.org/pdf/2501.16689.pdf

Authors: Edward Y. Chang

Title: MACI: Multi-Agent Collaborative Intelligence for Adaptive Reasoning and Temporal Planning

Abstract:
Artificial intelligence requires deliberate reasoning, temporal awareness, and effective constraint management, capabilities traditional LLMs often lack due to their reliance on pattern matching, limited self‑verification, and inconsistent constraint handling. We introduce Multi‑Agent Collaborative Intelligence (MACI), a framework comprising three key components: 1) a meta‑planner (MP) that identifies, formulates, and refines all roles and constraints of a task (e.g., wedding planning) while generating a dependency graph, with common‑sense augmentation to ensure realistic and practical constraints; 2) a collection of agents to facilitate planning and address task‑specific requirements; and 3) a run‑time monitor that manages plan adjustments as needed. By decoupling planning from validation, maintaining minimal agent context, and integrating common‑sense reasoning, MACI overcomes the aforementioned limitations and demonstrates robust performance in two scheduling problems.

PaperID: 11834, https://arxiv.org/pdf/2501.16673.pdf

Authors: Li Yin, Zhangyang Wang

Title: LLM-AutoDiff: Auto-Differentiate Any LLM Workflow

Abstract:
Large Language Models (LLMs) have reshaped natural language processing, powering applications from multi‑hop retrieval and question answering to autonomous agent workflows. Yet, prompt engineering ‑‑ the task of crafting textual inputs to effectively direct LLMs ‑‑ remains difficult and labor‑intensive, particularly for complex pipelines that combine multiple LLM calls with functional operations like retrieval and data formatting. We introduce LLM‑AutoDiff: a novel framework for Automatic Prompt Engineering (APE) that extends textual gradient‑based methods (such as Text‑Grad) to multi‑component, potentially cyclic LLM architectures. Implemented within the AdalFlow library, LLM‑AutoDiff treats each textual input as a trainable parameter and uses a frozen backward engine LLM to generate feedback‑akin to textual gradients ‑‑ that guide iterative prompt updates. Unlike prior single‑node approaches, LLM‑AutoDiff inherently accommodates functional nodes, preserves time‑sequential behavior in repeated calls (e.g., multi‑hop loops), and combats the "lost‑in‑the‑middle" problem by isolating distinct sub‑prompts (instructions, formats, or few‑shot examples). It further boosts training efficiency by focusing on error‑prone samples through selective gradient computation. Across diverse tasks, including single‑step classification, multi‑hop retrieval‑based QA, and agent‑driven pipelines, LLM‑AutoDiff consistently outperforms existing textual gradient baselines in both accuracy and training cost. By unifying prompt optimization through a graph‑centric lens, LLM‑AutoDiff offers a powerful new paradigm for scaling and automating LLM workflows ‑ mirroring the transformative role that automatic differentiation libraries have long played in neural network research.

PaperID: 11835, https://arxiv.org/pdf/2501.16661.pdf

Authors: Huichen Will Wang, Larry Birnbaum, Vidya Setlur

Title: Jupybara: Operationalizing a Design Space for Actionable Data Analysis and Storytelling with LLMs

Abstract:
Mining and conveying actionable insights from complex data is a key challenge of exploratory data analysis (EDA) and storytelling. To address this challenge, we present a design space for actionable EDA and storytelling. Synthesizing theory and expert interviews, we highlight how semantic precision, rhetorical persuasion, and pragmatic relevance underpin effective EDA and storytelling. We also show how this design space subsumes common challenges in actionable EDA and storytelling, such as identifying appropriate analytical strategies and leveraging relevant domain knowledge. Building on the potential of LLMs to generate coherent narratives with commonsense reasoning, we contribute Jupybara, an AI‑enabled assistant for actionable EDA and storytelling implemented as a Jupyter Notebook extension. Jupybara employs two strategies ‑‑ design‑space‑aware prompting and multi‑agent architectures ‑‑ to operationalize our design space. An expert evaluation confirms Jupybara's usability, steerability, explainability, and reparability, as well as the effectiveness of our strategies in operationalizing the design space framework with LLMs.

PaperID: 11836, https://arxiv.org/pdf/2501.16254.pdf

Authors: Chaehong Lee, Varatheepan Paramanayakam, Andreas Karatzas, Yanan Jian, Michael Fore, Heming Liao, Fuxun Yu, Ruopu Li, Iraklis Anagnostopoulos, Dimitrios Stamoulis

Title: Multi-Agent Geospatial Copilots for Remote Sensing Workflows

Abstract:
We present GeoLLM‑Squad, a geospatial Copilot that introduces the novel multi‑agent paradigm to remote sensing (RS) workflows. Unlike existing single‑agent approaches that rely on monolithic large language models (LLM), GeoLLM‑Squad separates agentic orchestration from geospatial task‑solving, by delegating RS tasks to specialized sub‑agents. Built on the open‑source AutoGen and GeoLLM‑Engine frameworks, our work enables the modular integration of diverse applications, spanning urban monitoring, forestry protection, climate analysis, and agriculture studies. Our results demonstrate that while single‑agent systems struggle to scale with increasing RS task complexity, GeoLLM‑Squad maintains robust performance, achieving a 17% improvement in agentic correctness over state‑of‑the‑art baselines. Our findings highlight the potential of multi‑agent AI in advancing RS workflows.

PaperID: 11837, https://arxiv.org/pdf/2501.16205.pdf

Authors: Arsalan Ali Malik, Emre Karabulut, Aydin Aysu

Title: EPOCH: Enabling Preemption Operation for Context Saving in Heterogeneous FPGA Systems

Abstract:
FPGAs are increasingly used in multi‑tenant cloud environments to offload compute‑intensive tasks from the main CPU. The operating system (OS) plays a vital role in identifying tasks suitable for offloading and coordinating between the CPU and FPGA for seamless task execution. The OS leverages preemption to manage CPU efficiently and balance CPU time; however, preempting tasks running on FPGAs without context loss remains challenging. Despite growing reliance on FPGAs, vendors have yet to deliver a solution that fully preserves and restores task context. This paper presents EPOCH, the first out‑of‑the‑box framework to seamlessly preserve the state of tasks running on multi‑tenant cloud FPGAs. EPOCH enables interrupting a tenant's execution at any arbitrary clock cycle, capturing its state, and saving this 'state snapshot' in off‑chip memory with fine‑grain granularity. Subsequently, when task resumption is required, EPOCH can resume execution from the saved 'state snapshot', eliminating the need to restart the task from scratch. EPOCH automates intricate processes, shields users from complexities, and synchronizes all underlying logic in a common clock domain, mitigating timing violations and ensuring seamless handling of interruptions. EPOCH proficiently captures the state of fundamental FPGA elements, such as look‑up tables, flip‑flops, block‑‑RAMs, and digital signal processing units. On real hardware, ZynQ‑XC7Z020 SoC, the proposed solution achieves context save and restore operations per frame in 62.2us and 67.4us, respectively.

PaperID: 11838, https://arxiv.org/pdf/2501.16173.pdf

Authors: Richard Willis, Yali Du, Joel Z Leibo, Michael Luck

Title: Will Systems of LLM Agents Cooperate: An Investigation into a Social Dilemma

Abstract:
As autonomous agents become more prevalent, understanding their collective behaviour in strategic interactions is crucial. This study investigates the emergent cooperative tendencies of systems of Large Language Model (LLM) agents in a social dilemma. Unlike previous research where LLMs output individual actions, we prompt state‑of‑the‑art LLMs to generate complete strategies for iterated Prisoner's Dilemma. Using evolutionary game theory, we simulate populations of agents with different strategic dispositions (aggressive, cooperative, or neutral) and observe their evolutionary dynamics. Our findings reveal that different LLMs exhibit distinct biases affecting the relative success of aggressive versus cooperative strategies. This research provides insights into the potential long‑term behaviour of systems of deployed LLM‑based autonomous agents and highlights the importance of carefully considering the strategic environments in which they operate.

PaperID: 11839, https://arxiv.org/pdf/2501.16138.pdf

Authors: Richard Willis, Yali Du, Joel Z Leibo, Michael Luck

Title: Quantifying the Self-Interest Level of Markov Social Dilemmas

Abstract:
This paper introduces a novel method for estimating the self‑interest level of Markov social dilemmas. We extend the concept of self‑interest level from normal‑form games to Markov games, providing a quantitative measure of the minimum reward exchange required to align individual and collective interests. We demonstrate our method on three environments from the Melting Pot suite, representing either common‑pool resources or public goods. Our results illustrate how reward exchange can enable agents to transition from selfish to collective equilibria in a Markov social dilemma. This work contributes to multi‑agent reinforcement learning by providing a practical tool for analysing complex, multistep social dilemmas. Our findings offer insights into how reward structures can promote or hinder cooperation, with potential applications in areas such as mechanism design.

PaperID: 11840, https://arxiv.org/pdf/2501.16086.pdf

Authors: Honglin Wen, Pierre Pinson

Title: Value-oriented forecast reconciliation for renewables in electricity markets

Abstract:
Forecast reconciliation is considered an effective method to achieve coherence (within a forecast hierarchy) and to improve forecast quality. However, the value of reconciled forecasts in downstream decision‑making tasks has been mostly overlooked. In a multi‑agent setup with heterogeneous loss functions, this oversight may lead to unfair outcomes, hence resulting in conflicts during the reconciliation process. To address this, we propose a value‑oriented forecast reconciliation approach that focuses on the forecast value for all individual agents. Fairness is ensured through the use of a Nash bargaining framework. Specifically, we model this problem as a cooperative bargaining game, where each agent aims to optimize their own gain while contributing to the overall reconciliation process. We then present a primal‑dual algorithm for parameter estimation based on empirical risk minimization. From an application perspective, we consider an aggregated wind energy trading problem, where profits are distributed using a weighted allocation rule. We demonstrate the effectiveness of our approach through several numerical experiments, showing that it consistently results in increased profits for all agents involved.

PaperID: 11841, https://arxiv.org/pdf/2501.15991.pdf

Authors: Andrii Mironchenko

Title: Modeling and stability analysis of live systems with time-varying dimension

Abstract:
A major limitation of the classical control theory is the assumption that the state space and its dimension do not change with time. This prevents analyzing and even formalizing the stability and control problems for open multi‑agent systems whose agents may enter or leave the network, industrial processes where the sensors or actuators may be exchanged frequently, smart grids, etc. In this work, we propose a framework of live systems that covers a rather general class of systems with a time‑varying state space. We argue that input‑to‑state stability is a proper stability notion for this class of systems, and many of the classic tools and results, such as Lyapunov methods and superposition theorems, can be extended to this setting.

PaperID: 11842, https://arxiv.org/pdf/2501.15850.pdf

Authors: Yuewen Mei, Tong Nie, Jian Sun, Ye Tian

Title: LLM-attacker: Enhancing Closed-loop Adversarial Scenario Generation for Autonomous Driving with Large Language Models

Abstract:
Ensuring and improving the safety of autonomous driving systems (ADS) is crucial for the deployment of highly automated vehicles, especially in safety‑critical events. To address the rarity issue, adversarial scenario generation methods are developed, in which behaviors of traffic participants are manipulated to induce safety‑critical events. However, existing methods still face two limitations. First, identification of the adversarial participant directly impacts the effectiveness of the generation. However, the complexity of real‑world scenarios, with numerous participants and diverse behaviors, makes identification challenging. Second, the potential of generated safety‑critical scenarios to continuously improve ADS performance remains underexplored. To address these issues, we propose LLM‑attacker: a closed‑loop adversarial scenario generation framework leveraging large language models (LLMs). Specifically, multiple LLM agents are designed and coordinated to identify optimal attackers. Then, the trajectories of the attackers are optimized to generate adversarial scenarios. These scenarios are iteratively refined based on the performance of ADS, forming a feedback loop to improve ADS. Experimental results show that LLM‑attacker can create more dangerous scenarios than other methods, and the ADS trained with it achieves a collision rate half that of training with normal scenarios. This indicates the ability of LLM‑attacker to test and enhance the safety and robustness of ADS. Video demonstrations are provided at: https://drive.google.com/file/d/1Zv4V3iG7825oyiKbUwS2Y‑rR0DQIE1ZA/view.

PaperID: 11843, https://arxiv.org/pdf/2501.15826.pdf

Authors: Qi Chen, Dexi Liu

Title: MADP: Multi-Agent Deductive Planning for Enhanced Cognitive-Behavioral Mental Health Question Answer

Abstract:
The Mental Health Question Answer (MHQA) task requires the seeker and supporter to complete the support process in one‑turn dialogue. Given the richness of help‑seeker posts, supporters must thoroughly understand the content and provide logical, comprehensive, and well‑structured responses. Previous works in MHQA mostly focus on single‑agent approaches based on the cognitive element of Cognitive Behavioral Therapy (CBT), but they overlook the interactions among various CBT elements, such as emotion and cognition. This limitation hinders the models' ability to thoroughly understand the distress of help‑seekers. To address this, we propose a framework named Multi‑Agent Deductive Planning (MADP), which is based on the interactions between the various psychological elements of CBT. This method guides Large Language Models (LLMs) to achieve a deeper understanding of the seeker's context and provide more personalized assistance based on individual circumstances. Furthermore, we construct a new dataset based on the MADP framework and use it to fine‑tune LLMs, resulting in a specialized model named MADP‑LLM. We conduct extensive experiments, including comparisons with multiple LLMs, human evaluations, and automatic evaluations, to validate the effectiveness of the MADP framework and MADP‑LLM.

PaperID: 11844, https://arxiv.org/pdf/2501.15802.pdf

Authors: Lanpei Li, Jack Bell, Massimo Coppola, Vincenzo Lomonaco

Title: Adaptive AI-based Decentralized Resource Management in the Cloud-Edge Continuum

Abstract:
In the Cloud‑Edge Continuum, dynamic infrastructure change and variable workloads complicate efficient resource management. Centralized methods can struggle to adapt, whilst purely decentralized policies lack global oversight. This paper proposes a hybrid framework using Graph Neural Network (GNN) embeddings and collaborative multi‑agent reinforcement learning (MARL). Local agents handle neighbourhood‑level decisions, and a global orchestrator coordinates system‑wide. This work contributes to decentralized application placement strategies with centralized oversight, GNN integration and collaborative MARL for efficient, adaptive and scalable resource management.

PaperID: 11845, https://arxiv.org/pdf/2501.15735.pdf

Authors: Madan Dahal, Mojtaba Vaezi

Title: Selective Experience Sharing in Reinforcement Learning Enhances Interference Management

Abstract:
We propose a novel multi‑agent reinforcement learning (RL) approach for inter‑cell interference mitigation, in which agents selectively share their experiences with other agents. Each base station is equipped with an agent, which receives signal‑to‑interference‑plus‑noise ratio from its own associated users. This information is used to evaluate and selectively share experiences with neighboring agents. The idea is that even a few pertinent experiences from other agents can lead to effective learning. This approach enables fully decentralized training and execution, minimizes information sharing between agents and significantly reduces communication overhead, which is typically the burden of interference management. The proposed method outperforms state‑of‑the‑art multi‑agent RL techniques where training is done in a decentralized manner. Furthermore, with a 75% reduction in experience sharing, the proposed algorithm achieves 98% of the spectral efficiency obtained by algorithms sharing all experiences.

PaperID: 11846, https://arxiv.org/pdf/2501.15734.pdf

Authors: Shavbo Salehi, Pedro Enrique Iturria-Rivera, Medhat Elsayed, Majid Bavand, Raimundas Gaigalas, Yigit Ozcan, Melike Erol-Kantarci

Title: Prioritized Value-Decomposition Network for Explainable AI-Enabled Network Slicing

Abstract:
Network slicing aims to enhance flexibility and efficiency in next‑generation wireless networks by allocating the right resources to meet the diverse requirements of various applications. Managing these slices with machine learning (ML) algorithms has emerged as a promising approach however explainability has been a challenge. To this end, several Explainable Artificial Intelligence (XAI) frameworks have been proposed to address the opacity in decision‑making in many ML methods. In this paper, we propose a Prioritized Value‑Decomposition Network (PVDN) as an XAI‑driven approach for resource allocation in a multi‑agent network slicing system. The PVDN method decomposes the global value function into individual contributions and prioritizes slice outputs, providing an explanation of how resource allocation decisions impact system performance. By incorporating XAI, PVDN offers valuable insights into the decision‑making process, enabling network operators to better understand, trust, and optimize slice management strategies. Through simulations, we demonstrate the effectiveness of the PVDN approach with improving the throughput by 67% and 16%, while reducing latency by 35% and 22%, compared to independent and VDN‑based resource allocation methods.

PaperID: 11847, https://arxiv.org/pdf/2501.15721.pdf

Authors: Tadahiro Taniguchi

Title: On Parallelism in Music and Language: A Perspective from Symbol Emergence Systems based on Probabilistic Generative Models

Abstract:
Music and language are structurally similar. Such structural similarity is often explained by generative processes. This paper describes the recent development of probabilistic generative models (PGMs) for language learning and symbol emergence in robotics. Symbol emergence in robotics aims to develop a robot that can adapt to real‑world environments and human linguistic communications and acquire language from sensorimotor information alone (i.e., in an unsupervised manner). This is regarded as a constructive approach to symbol emergence systems. To this end, a series of PGMs have been developed, including those for simultaneous phoneme and word discovery, lexical acquisition, object and spatial concept formation, and the emergence of a symbol system. By extending the models, a symbol emergence system comprising a multi‑agent system in which a symbol system emerges is revealed to be modeled using PGMs. In this model, symbol emergence can be regarded as collective predictive coding. This paper expands on this idea by combining the theory that ''emotion is based on the predictive coding of interoceptive signals'' and ''symbol emergence systems,'' and describes the possible hypothesis of the emergence of meaning in music.

PaperID: 11848, https://arxiv.org/pdf/2501.15695.pdf

Authors: Hung Du, Srikanth Thudumu, Hy Nguyen, Rajesh Vasa, Kon Mouzakis

Title: Contextual Knowledge Sharing in Multi-Agent Reinforcement Learning with Decentralized Communication and Coordination

Abstract:
Decentralized Multi‑Agent Reinforcement Learning (Dec‑MARL) has emerged as a pivotal approach for addressing complex tasks in dynamic environments. Existing Multi‑Agent Reinforcement Learning (MARL) methodologies typically assume a shared objective among agents and rely on centralized control. However, many real‑world scenarios feature agents with individual goals and limited observability of other agents, complicating coordination and hindering adaptability. Existing Dec‑MARL strategies prioritize either communication or coordination, lacking an integrated approach that leverages both. This paper presents a novel Dec‑MARL framework that integrates peer‑to‑peer communication and coordination, incorporating goal‑awareness and time‑awareness into the agents' knowledge‑sharing processes. Our framework equips agents with the ability to (i) share contextually relevant knowledge to assist other agents, and (ii) reason based on information acquired from multiple agents, while considering their own goals and the temporal context of prior knowledge. We evaluate our approach through several complex multi‑agent tasks in environments with dynamically appearing obstacles. Our work demonstrates that incorporating goal‑aware and time‑aware knowledge sharing significantly enhances overall performance.

PaperID: 11849, https://arxiv.org/pdf/2501.15495.pdf

Authors: Alberto Castagna

Title: Expert-Free Online Transfer Learning in Multi-Agent Reinforcement Learning

Abstract:
Reinforcement Learning (RL) enables an intelligent agent to optimise its performance in a task by continuously taking action from an observed state and receiving a feedback from the environment in form of rewards. RL typically uses tables or linear approximators to map state‑action tuples that maximises the reward. Combining RL with deep neural networks (DRL) significantly increases its scalability and enables it to address more complex problems than before. However, DRL also inherits downsides from both RL and deep learning. Despite DRL improves generalisation across similar state‑action pairs when compared to simpler RL policy representations like tabular methods, it still requires the agent to adequately explore the state‑action space. Additionally, deep methods require more training data, with the volume of data escalating with the complexity and size of the neural network. As a result, deep RL requires a long time to collect enough agent‑environment samples and to successfully learn the underlying policy. Furthermore, often even a slight alteration to the task invalidates any previous acquired knowledge. To address these shortcomings, Transfer Learning (TL) has been introduced, which enables the use of external knowledge from other tasks or agents to enhance a learning process. The goal of TL is to reduce the learning complexity for an agent dealing with an unfamiliar task by simplifying the exploration process. This is achieved by lowering the amount of new information required by its learning model, resulting in a reduced overall convergence time...

PaperID: 11850, https://arxiv.org/pdf/2501.15493.pdf

Authors: Zhihan Zheng, Haitao Yuan, Minxiao Chen, Shangguang Wang

Title: RLER-TTE: An Efficient and Effective Framework for En Route Travel Time Estimation with Reinforcement Learning

Abstract:
En Route Travel Time Estimation (ER‑TTE) aims to learn driving patterns from traveled routes to achieve rapid and accurate real‑time predictions. However, existing methods ignore the complexity and dynamism of real‑world traffic systems, resulting in significant gaps in efficiency and accuracy in real‑time scenarios. Addressing this issue is a critical yet challenging task. This paper proposes a novel framework that redefines the implementation path of ER‑TTE to achieve highly efficient and effective predictions. Firstly, we introduce a novel pipeline consisting of a Decision Maker and a Predictor to rectify the inefficient prediction strategies of current methods. The Decision Maker performs efficient real‑time decisions to determine whether the high‑complexity prediction model in the Predictor needs to be invoked, and the Predictor recalculates the travel time or infers from historical prediction results based on these decisions. Next, to tackle the dynamic and uncertain real‑time scenarios, we model the online decision‑making problem as a Markov decision process and design an intelligent agent based on reinforcement learning for autonomous decision‑making. Moreover, to fully exploit the spatio‑temporal correlation between online data and offline data, we meticulously design feature representation and encoding techniques based on the attention mechanism. Finally, to improve the flawed training and evaluation strategies of existing methods, we propose an end‑to‑end training and evaluation approach, incorporating curriculum learning strategies to manage spatio‑temporal data for more advanced training algorithms. Extensive evaluations on three real‑world datasets confirm that our method significantly outperforms state‑of‑the‑art solutions in both accuracy and efficiency.

PaperID: 11851, https://arxiv.org/pdf/2501.15283.pdf

Authors: Naihao Deng, Rada Mihalcea

Title: Are Human Interactions Replicable by Generative Agents? A Case Study on Pronoun Usage in Hierarchical Interactions

Abstract:
As Large Language Models (LLMs) advance in their capabilities, researchers have increasingly employed them for social simulation. In this paper, we investigate whether interactions among LLM agents resemble those of humans. Specifically, we focus on the pronoun usage difference between leaders and non‑leaders, examining whether the simulation would lead to human‑like pronoun usage patterns during the LLMs' interactions. Our evaluation reveals the significant discrepancies between LLM‑based simulations and human pronoun usage, with prompt‑based or specialized agents failing to demonstrate human‑like pronoun usage patterns. In addition, we reveal that even if LLMs understand the human pronoun usage patterns, they fail to demonstrate them in the actual interaction process. Our study highlights the limitations of social simulations based on LLM agents, urging caution in using such social simulation in practitioners' decision‑making process.

PaperID: 11852, https://arxiv.org/pdf/2501.15228.pdf

Authors: Yiqun Chen, Lingyong Yan, Weiwei Sun, Xinyu Ma, Yi Zhang, Shuaiqiang Wang, Dawei Yin, Yiming Yang, Jiaxin Mao

Title: Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning

Abstract:
Retrieval‑augmented generation (RAG) is widely utilized to incorporate external knowledge into large language models, thereby enhancing factuality and reducing hallucinations in question‑answering (QA) tasks. A standard RAG pipeline consists of several components, such as query rewriting, document retrieval, document filtering, and answer generation. However, these components are typically optimized separately through supervised fine‑tuning, which can lead to misalignments between the objectives of individual components and the overarching aim of generating accurate answers. Although recent efforts have explored using reinforcement learning (RL) to optimize specific RAG components, these approaches often focus on simple pipelines with only two components or do not adequately address the complex interdependencies and collaborative interactions among the modules. To overcome these limitations, we propose treating the complex RAG pipeline with multiple components as a multi‑agent cooperative task, in which each component can be regarded as an RL agent. Specifically, we present MMOA‑RAG, Multi‑Module joint Optimization Algorithm for RAG, which employs multi‑agent reinforcement learning to harmonize all agents' goals toward a unified reward, such as the F1 score of the final answer. Experiments conducted on various QA benchmarks demonstrate that MMOA‑RAG effectively boost the overall performance of the pipeline and outperforms existing baselines. Furthermore, comprehensive ablation studies validate the contributions of individual components and demonstrate MMOA‑RAG can be adapted to different RAG pipelines and benchmarks.

PaperID: 11853, https://arxiv.org/pdf/2501.14917.pdf

Authors: Sara Abdali, Can Goksen, Michael Solodko, Saeed Amizadeh, Julie E. Maybee, Kazuhito Koishida

Title: Self-reflecting Large Language Models: A Hegelian Dialectical Approach

Abstract:
Investigating NLP through a philosophical lens has recently caught researchers' eyes, as it bridges computational methods with classical schools of philosophy. This paper introduces a philosophical framework inspired by the Hegelian Dialectic to enable LLMs' self‑reflection, utilizing a self‑dialectical approach to emulate internal critiques and synthesize new scientific ideas (spanning domains such as mathematics, physics, and more). Additionally, we explore the effect of generation temperature in LLMs by introducing a dynamic annealing approach, which encourages creativity in the early stages and gradually focuses on refinement and nuance, as well as a constant‑temperature strategy. Furthermore, we implement a Multi‑Agent Majority Voting (MAMV) strategy to assess the validity and novelty of the generated ideas, which proves useful in the absence of domain experts. We also evaluate the effectiveness of our method in generating novel scientific ideas and improving LLMs' reasoning capabilities. Our experiments demonstrate promising results in ideation, along with significant improvements in mathematical and symbolic reasoning.

PaperID: 11854, https://arxiv.org/pdf/2501.14844.pdf

Authors: Erica Coppolillo, Giuseppe Manco, Luca Maria Aiello

Title: Unmasking Conversational Bias in AI Multiagent Systems

Abstract:
Detecting biases in the outputs produced by generative models is essential to reduce the potential risks associated with their application in critical settings. However, the majority of existing methodologies for identifying biases in generated text consider the models in isolation and neglect their contextual applications. Specifically, the biases that may arise in multi‑agent systems involving generative models remain under‑researched. To address this gap, we present a framework designed to quantify biases within multi‑agent systems of conversational Large Language Models (LLMs). Our approach involves simulating small echo chambers, where pairs of LLMs, initialized with aligned perspectives on a polarizing topic, engage in discussions. Contrary to expectations, we observe significant shifts in the stance expressed in the generated messages, particularly within echo chambers where all agents initially express conservative viewpoints, in line with the well‑documented political bias of many LLMs toward liberal positions. Crucially, the bias observed in the echo‑chamber experiment remains undetected by current state‑of‑the‑art bias detection methods that rely on questionnaires. This highlights a critical need for the development of a more sophisticated toolkit for bias detection and mitigation for AI multi‑agent systems. The code to perform the experiments is publicly available at https://anonymous.4open.science/r/LLMsConversationalBias‑7725.

PaperID: 11855, https://arxiv.org/pdf/2501.14819.pdf

Authors: Paul Perrone

Title: A Comprehensive Mathematical and System-Level Analysis of Autonomous Vehicle Timelines

Abstract:
Fully autonomous vehicles (AVs) continue to spark immense global interest, yet predictions on when they will operate safely and broadly remain heavily debated. This paper synthesizes two distinct research traditions: computational complexity and algorithmic constraints versus reliability growth modeling and real‑world testing to form an integrated, quantitative timeline for future AV deployment. We propose a mathematical framework that unifies NP‑hard multi‑agent path planning analyses, high‑performance computing (HPC) projections, and extensive Crow‑AMSAA reliability growth calculations, factoring in operational design domain (ODD) variations, severity, and partial vs. full domain restrictions. Through category‑specific case studies (e.g., consumer automotive, robo‑taxis, highway trucking, industrial and defense applications), we show how combining HPC limitations, safety demonstration requirements, production/regulatory hurdles, and parallel/serial test strategies can push out the horizon for universal Level 5 deployment by up to several decades. Conversely, more constrained ODDs; like fenced industrial sites or specialized defense operations; may see autonomy reach commercial viability in the near‑to‑medium term. Our findings illustrate that while targeted domains can achieve automated service sooner, widespread driverless vehicles handling every environment remain far from realized. This paper thus offers a unique and rigorous perspective on why AV timelines extend well beyond short‑term optimism, underscoring how each dimension of complexity and reliability imposes its own multi‑year delays. By quantifying these constraints and exploring potential accelerators (e.g., advanced AI hardware, infrastructure up‑grades), we provide a structured baseline for researchers, policymakers, and industry stakeholders to more accurately map their expectations and investments in AV technology.

PaperID: 11856, https://arxiv.org/pdf/2501.14772.pdf

Authors: Dinh-Nguyen Nguyen, Raymond Kai-Yu Tong, Ngoc-Duy Dinh

Title: DropMicroFluidAgents (DMFAs): Autonomous Droplet Microfluidic Research Framework Through Large Language Model Agents

Abstract:
Applying Large language models (LLMs) within specific domains requires substantial adaptation to account for the unique terminologies, nuances, and context‑specific challenges inherent to those areas. Here, we introduce DropMicroFluidAgents (DMFAs), an advanced language‑driven framework leveraging state‑of‑the‑art pre‑trained LLMs. DMFAs employs LLM agents to perform two key functions: (1) delivering focused guidance, answers, and suggestions specific to droplet microfluidics and (2) generating machine learning models to optimise and automate the design of droplet microfluidic devices, including the creation of code‑based computer‑aided design (CAD) scripts to enable rapid and precise design execution. Experimental evaluations demonstrated that the integration of DMFAs with the LLAMA3.1 model yielded the highest accuracy of 76.15%, underscoring the significant performance enhancement provided by agent integration. This effect was particularly pronounced when DMFAs were paired with the GEMMA2 model, resulting in a 34.47% improvement in accuracy compared to the standalone GEMMA2 configuration. This study demonstrates the effective use of LLM agents in droplet microfluidics research as powerful tools for automating workflows, synthesising knowledge, optimising designs, and interacting with external systems. These capabilities enable their application across education and industrial support, driving greater efficiency in scientific discovery and innovation.

PaperID: 11857, https://arxiv.org/pdf/2501.14737.pdf

Authors: Xin-Cheng Wen, Jiaxin Ye, Cuiyun Gao, Lianwei Wu, Qing Liao

Title: EvalSVA: Multi-Agent Evaluators for Next-Gen Software Vulnerability Assessment

Abstract:
Software Vulnerability (SV) assessment is a crucial process of determining different aspects of SVs (e.g., attack vectors and scope) for developers to effectively prioritize efforts in vulnerability mitigation. It presents a challenging and laborious process due to the complexity of SVs and the scarcity of labeled data. To mitigate the above challenges, we introduce EvalSVA, a multi‑agent evaluators team to autonomously deliberate and evaluate various aspects of SV assessment. Specifically, we propose a multi‑agent‑based framework to simulate vulnerability assessment strategies in real‑world scenarios, which employs multiple Large Language Models (LLMs) into an integrated group to enhance the effectiveness of SV assessment in the limited data. We also design diverse communication strategies to autonomously discuss and assess different aspects of SV. Furthermore, we construct a multi‑lingual SV assessment dataset based on the new standard of CVSS, comprising 699, 888, and 1,310 vulnerability‑related commits in C++, Python, and Java, respectively. Our experimental results demonstrate that EvalSVA averagely outperforms the 44.12% accuracy and 43.29% F1 for SV assessment compared with the previous methods. It shows that EvalSVA offers a human‑like process and generates both reason and answer for SV assessment. EvalSVA can also aid human experts in SV assessment, which provides more explanation and details for SV assessment.

PaperID: 11858, https://arxiv.org/pdf/2501.14731.pdf

Authors: Zexing Xu, Zhuang Luo, Yichuan Li, Kyumin Lee, S. Rasoul Etesami

Title: From Critique to Clarity: A Pathway to Faithful and Personalized Code Explanations with Large Language Models

Abstract:
In the realm of software development, providing accurate and personalized code explanations is crucial for both technical professionals and business stakeholders. Technical professionals benefit from enhanced understanding and improved problem‑solving skills, while business stakeholders gain insights into project alignments and transparency. Despite the potential, generating such explanations is often time‑consuming and challenging. This paper presents an innovative approach that leverages the advanced capabilities of large language models (LLMs) to generate faithful and personalized code explanations. Our methodology integrates prompt enhancement, self‑correction mechanisms, personalized content customization, and interaction with external tools, facilitated by collaboration among multiple LLM agents. We evaluate our approach using both automatic and human assessments, demonstrating that our method not only produces accurate explanations but also tailors them to individual user preferences. Our findings suggest that this approach significantly improves the quality and relevance of code explanations, offering a valuable tool for developers and stakeholders alike.

PaperID: 11859, https://arxiv.org/pdf/2501.14568.pdf

Authors: Thore Gerlach, Loong Kuan Lee, Frédéric Barbaresco, Nico Piatkowski

Title: Hybrid Quantum-Classical Multi-Agent Pathfinding

Abstract:
Multi‑Agent Path Finding (MAPF) focuses on determining conflict‑free paths for multiple agents navigating through a shared space to reach specified goal locations. This problem becomes computationally challenging, particularly when handling large numbers of agents, as frequently encountered in practical applications like coordinating autonomous vehicles. Quantum Computing (QC) is a promising candidate in overcoming such limits. However, current quantum hardware is still in its infancy and thus limited in terms of computing power and error robustness. In this work, we present the first optimal hybrid quantum‑classical MAPF algorithms which are based on branch‑andcut‑and‑price. QC is integrated by iteratively solving QUBO problems, based on conflict graphs. Experiments on actual quantum hardware and results on benchmark data suggest that our approach dominates previous QUBO formulationsand state‑of‑the‑art MAPF solvers.

PaperID: 11860, https://arxiv.org/pdf/2501.14488.pdf

Authors: Yuhan Hu, Yirong Sun, Yanjun Chen, Xinghao Chen, Xiaoyu Shen, Wei Zhang

Title: Breaking the Pre-Planning Barrier: Adaptive Real-Time Coordination of Heterogeneous UAVs

Abstract:
Unmanned Aerial Vehicles (UAVs) offer significant potential in dynamic, perception‑intensive tasks such as search and rescue and environmental monitoring; however, their effectiveness is severely restricted by conventional pre‑planned routing methods, which lack the flexibility to respond in real‑time to evolving task demands, unexpected disturbances, and localized view limitations in real‑world scenarios. To address this fundamental limitation, we introduce a novel multi‑agent reinforcement learning framework named Heterogeneous Graph Attention Multi‑agent Deep Deterministic Policy Gradient (HGAM), uniquely designed to enable adaptive real‑time coordination between mission UAVs (MUAVs) and charging UAVs (CUAVs). HGAM specifically addresses the previously unsolved challenge of enabling precise, decentralized continuous‑action coordination solely based on local, heterogeneous graph‑based observations. Extensive simulations demonstrate that HGAM substantially surpasses existing methods, achieving, for example, a 30% improvement in data collection coverage and a 20% increase in charging efficiency, providing crucial insights and foundations for the future deployment of intelligent, flexible UAV networks in complex, dynamic environments.

PaperID: 11861, https://arxiv.org/pdf/2501.14451.pdf

Authors: Linfeng Liang, Xi Zheng

Title: MARL-OT: Multi-Agent Reinforcement Learning Guided Online Fuzzing to Detect Safety Violation in Autonomous Driving Systems

Abstract:
Autonomous Driving Systems (ADSs) are safety‑critical, as real‑world safety violations can result in significant losses. Rigorous testing is essential before deployment, with simulation testing playing a key role. However, ADSs are typically complex, consisting of multiple modules such as perception and planning, or well‑trained end‑to‑end autonomous driving systems. Offline methods, such as the Genetic Algorithm (GA), can only generate predefined trajectories for dynamics, which struggle to cause safety violations for ADSs rapidly and efficiently in different scenarios due to their evolutionary nature. Online methods, such as single‑agent reinforcement learning (RL), can quickly adjust the dynamics' trajectory online to adapt to different scenarios, but they struggle to capture complex corner cases of ADS arising from the intricate interplay among multiple vehicles. Multi‑agent reinforcement learning (MARL) has a strong ability in cooperative tasks. On the other hand, it faces its own challenges, particularly with convergence. This paper introduces MARL‑OT, a scalable framework that leverages MARL to detect safety violations of ADS resulting from surrounding vehicles' cooperation. MARL‑OT employs MARL for high‑level guidance, triggering various dangerous scenarios for the rule‑based online fuzzer to explore potential safety violations of ADS, thereby generating dynamic, realistic safety violation scenarios. Our approach improves the detected safety violation rate by up to 136.2% compared to the state‑of‑the‑art (SOTA) testing technique.

PaperID: 11862, https://arxiv.org/pdf/2501.14387.pdf

Authors: Simone Bolettieri, Raffaele Bruno, Enzo Mingozzi

Title: Application-Aware Resource Allocation and Data Management for MEC-assisted IoT Service Providers

Abstract:
To support the growing demand for data‑intensive and low‑latency IoT applications, Multi‑Access Edge Computing (MEC) is emerging as an effective edge‑computing approach enabling the execution of delay‑sensitive processing tasks close to end‑users. However, most of the existing works on resource allocation and service placement in MEC systems overlook the unique characteristics of new IoT use cases. For instance, many IoT applications require the periodic execution of computing tasks on real‑time data streams that originate from devices dispersed over a wide area. Thus, users requesting IoT services are typically distant from the data producers. To fill this gap, the contribution of this work is two‑fold. Firstly, we propose a MEC‑compliant architectural solution to support the operation of multiple IoT service providers over a common MEC platform deployment, which enables the steering and shaping of IoT data transport within the platform. Secondly, we model the problem of service placement and data management in the proposed MEC‑based solution taking into account the dependencies at the data level between IoT services and sensing resources. Our model also considers that caches can be deployed on MEC hosts, to allow the sharing of the same data between different IoT services with overlapping geographical scope, and provides support for IoT services with heterogeneous QoS requirements, such as different frequencies of periodic task execution. Due to the complexity of the optimisation problem, a heuristic algorithm is proposed using linear relaxation and rounding techniques. Extensive simulation results demonstrate the efficiency of the proposed approach, especially when traffic demands generated by the service requests are not uniform.

PaperID: 11863, https://arxiv.org/pdf/2501.14304.pdf

Authors: Bingzheng Gan, Yufan Zhao, Tianyi Zhang, Jing Huang, Yusu Li, Shu Xian Teo, Changwang Zhang, Wei Shi

Title: MASTER: A Multi-Agent System with LLM Specialized MCTS

Abstract:
Large Language Models (LLM) are increasingly being explored for problem‑solving tasks. However, their strategic planning capability is often viewed with skepticism. Recent studies have incorporated the Monte Carlo Tree Search (MCTS) algorithm to augment the planning capacity of LLM. Despite its potential, MCTS relies on extensive sampling simulations to approximate the true reward distribution, which leads to two primary issues. Firstly, MCTS is effective for tasks like the Game of Go, where simulation results can yield objective rewards (e.g., 1 for a win and 0 for a loss). However, for tasks such as question answering, the result of a simulation is the answer to the question, which cannot yield an objective reward without the ground truth. Secondly, obtaining statistically significant reward estimations typically requires a sample size exceeding 30 simulations, resulting in excessive token usage and time consumption. To address these challenges, we present the Multi‑Agent System with Tactical Execution and Reasoning using LLM Specialized MCTS (MASTER), a novel framework that coordinates agent recruitment and communication through LLM specialized MCTS. This system autonomously adjusts the number of agents based on task complexity and ensures focused communication among them. Comprehensive experiments across various tasks demonstrate the effectiveness of our proposed framework. It achieves 76% accuracy on HotpotQA and 80% on WebShop, setting new state‑of‑the‑art performance on these datasets.

PaperID: 11864, https://arxiv.org/pdf/2501.14205.pdf

Authors: Minrui Xu, Dusit Niyato, Christopher G. Brinton

Title: Serving Long-Context LLMs at the Mobile Edge: Test-Time Reinforcement Learning-based Model Caching and Inference Offloading

Abstract:
Large Language Models (LLMs) can perform zero‑shot learning on unseen tasks and few‑shot learning on complex reasoning tasks. However, resource‑limited mobile edge networks struggle to support long‑context LLM serving for LLM agents during multi‑round interactions with users. Unlike stateless computation offloading and static service offloading in edge computing, optimizing LLM serving at edge servers is challenging because LLMs continuously learn from context which raises accuracy, latency, and resource consumption dynamics. In this paper, we propose a joint model caching and inference offloading framework that utilizes test‑time deep reinforcement learning (T2DRL) to optimize deployment and execution strategies for long‑context LLM serving. In this framework, we analyze the performance convergence and design an optimization problem considering the utilization of context windows in LLMs. Furthermore, the T2DRL algorithm can learn in both the training phase and the testing phase to proactively manage cached models and service requests and adapt to context changes and usage patterns during execution. To further enhance resource allocation efficiency, we propose a double Dutch auction (DDA) mechanism, which dynamically matches supply and demand while maximizing social welfare. Finally, experimental results demonstrate that the T2DRL algorithm can reduce system costs by at least 30% compared to baselines while guaranteeing the performance of LLM agents in real‑world perception and reasoning tasks.

PaperID: 11865, https://arxiv.org/pdf/2501.14189.pdf

Authors: Saaduddin Mahmud, Dorian Benhamou Goldfajn, Shlomo Zilberstein

Title: Distributed Multi-Agent Coordination Using Multi-Modal Foundation Models

Abstract:
Distributed Constraint Optimization Problems (DCOPs) offer a powerful framework for multi‑agent coordination but often rely on labor‑intensive, manual problem construction. To address this, we introduce VL‑DCOPs, a framework that takes advantage of large multimodal foundation models (LFMs) to automatically generate constraints from both visual and linguistic instructions. We then introduce a spectrum of agent archetypes for solving VL‑DCOPs: from a neuro‑symbolic agent that delegates some of the algorithmic decisions to an LFM, to a fully neural agent that depends entirely on an LFM for coordination. We evaluate these agent archetypes using state‑of‑the‑art LLMs (large language models) and VLMs (vision language models) on three novel VL‑DCOP tasks and compare their respective advantages and drawbacks. Lastly, we discuss how this work extends to broader frontier challenges in the DCOP literature.

PaperID: 11866, https://arxiv.org/pdf/2501.14111.pdf

Authors: Wan Wang, Haiyan Wang, Adam J. Sobey

Title: Collaborating in a competitive world: Heterogeneous Multi-Agent Decision Making in Symbiotic Supply Chain Environments

Abstract:
Supply networks require collaboration in a competitive environment. To achieve this, nodes in the network often form symbiotic relationships as they can be adversely effected by the closure of companies in the network, especially where products are niche. However, balancing support for other nodes in the network against profit is challenging. Agents are increasingly being explored to define optimal strategies in these complex networks. However, to date much of the literature focuses on homogeneous agents where a single policy controls all of the nodes. This isn't realistic for many supply chains as this level of information sharing would require an exceptionally close relationship. This paper therefore compares the behaviour of this type of agent to a heterogeneous structure, where the agents each have separate polices, to solve the product ordering and pricing problem. An approach to reward sharing is developed that doesn't require sharing profit. The homogenous and heterogeneous agents exhibit different behaviours, with the homogenous retailer retaining high inventories and witnessing high levels of backlog while the heterogeneous agents show a typical order strategy. This leads to the heterogeneous agents mitigating the bullwhip effect whereas the homogenous agents do not. In the high demand environment, the agent architecture dominates performance with the Soft Actor‑Critic (SAC) agents outperforming the Proximal Policy Optimisation (PPO) agents. Here, the factory controls the supply chain. In the low demand environment the homogenous agents outperform the heterogeneous agents. Control of the supply chain shifts significantly, with the retailer outperforming the factory by a significant margin.

PaperID: 11867, https://arxiv.org/pdf/2501.13994.pdf

Authors: Hy Nguyen, Bao Pham, Hung Du, Srikanth Thudumu, Rajesh Vasa, Kon Mouzakis

Title: CSAOT: Cooperative Multi-Agent System for Active Object Tracking

Abstract:
Object Tracking is essential for many computer vision applications, such as autonomous navigation, surveillance, and robotics. Unlike Passive Object Tracking (POT), which relies on static camera viewpoints to detect and track objects across consecutive frames, Active Object Tracking (AOT) requires a controller agent to actively adjust its viewpoint to maintain visual contact with a moving target in complex environments. Existing AOT solutions are predominantly single‑agent‑based, which struggle in dynamic and complex scenarios due to limited information gathering and processing capabilities, often resulting in suboptimal decision‑making. Alleviating these limitations necessitates the development of a multi‑agent system where different agents perform distinct roles and collaborate to enhance learning and robustness in dynamic and complex environments. Although some multi‑agent approaches exist for AOT, they typically rely on external auxiliary agents, which require additional devices, making them costly. In contrast, we introduce the Collaborative System for Active Object Tracking (CSAOT), a method that leverages multi‑agent deep reinforcement learning (MADRL) and a Mixture of Experts (MoE) framework to enable multiple agents to operate on a single device, thereby improving tracking performance and reducing costs. Our approach enhances robustness against occlusions and rapid motion while optimizing camera movements to extend tracking duration. We validated the effectiveness of CSAOT on various interactive maps with dynamic and stationary obstacles.

PaperID: 11868, https://arxiv.org/pdf/2501.13993.pdf

Authors: Hamza Landolsi, Kais Letaief, Nizar Taghouti, Ines Abdeljaoued-Tej

Title: CAPRAG: A Large Language Model Solution for Customer Service and Automatic Reporting using Vector and Graph Retrieval-Augmented Generation

Abstract:
The introduction of new features and services in the banking sector often overwhelms customers, creating an opportunity for banks to enhance user experience through financial chatbots powered by large language models (LLMs). We initiated an AI agent designed to provide customers with relevant information about banking services and insights from annual reports. We proposed a hybrid Customer Analysis Pipeline Retrieval‑Augmented Generation (CAPRAG) that effectively addresses both relationship‑based and contextual queries, thereby improving customer engagement in the digital banking landscape. To implement this, we developed a processing pipeline to refine text data, which we utilized in two main frameworks: Vector RAG and Graph RAG. This dual approach enables us to populate both vector and graph databases with processed data for efficient retrieval. The Cypher query component is employed to effectively query the graph database. When a user submits a query, it is first expanded by a query expansion module before being routed to construct a final query from the hybrid Knowledge Base (KB). This final query is then sent to an open‑source LLM for response generation. Overall, our innovative, designed to international banks, serves bank's customers in an increasingly complex digital environment, enhancing clarity and accessibility of information.

PaperID: 11869, https://arxiv.org/pdf/2501.13946.pdf

Authors: Diego Gosmar, Deborah A. Dahl

Title: Hallucination Mitigation using Agentic AI Natural Language-Based Frameworks

Abstract:
Hallucinations remain a significant challenge in current Generative AI models, undermining trust in AI systems and their reliability. This study investigates how orchestrating multiple specialized Artificial Intelligent Agents can help mitigate such hallucinations, with a focus on systems leveraging Natural Language Processing (NLP) to facilitate seamless agent interactions. To achieve this, we design a pipeline that introduces over three hundred prompts, purposefully crafted to induce hallucinations, into a front‑end agent. The outputs are then systematically reviewed and refined by second‑ and third‑level agents, each employing distinct large language models and tailored strategies to detect unverified claims, incorporate explicit disclaimers, and clarify speculative content. Additionally, we introduce a set of novel Key Performance Indicators (KPIs) specifically designed to evaluate hallucination score levels. A dedicated fourth‑level AI agent is employed to evaluate these KPIs, providing detailed assessments and ensuring accurate quantification of shifts in hallucination‑related behaviors. A core component of this investigation is the use of the OVON (Open Voice Network) framework, which relies on universal NLP‑based interfaces to transfer contextual information among agents. Through structured JSON messages, each agent communicates its assessment of the hallucination likelihood and the reasons underlying questionable content, thereby enabling the subsequent stage to refine the text without losing context. The results demonstrate that employing multiple specialized agents capable of interoperating with each other through NLP‑based agentic frameworks can yield promising outcomes in hallucination mitigation, ultimately bolstering trust within the AI community.

PaperID: 11870, https://arxiv.org/pdf/2501.13936.pdf

Authors: Arjun R. Malghan

Title: Evaluating Computational Accuracy of Large Language Models in Numerical Reasoning Tasks for Healthcare Applications

Abstract:
Large Language Models (LLMs) have emerged as transformative tools in the healthcare sector, demonstrating remarkable capabilities in natural language understanding and generation. However, their proficiency in numerical reasoning, particularly in high‑stakes domains like in clinical applications, remains underexplored. Numerical reasoning is critical in healthcare applications, influencing patient outcomes, treatment planning, and resource allocation. This study investigates the computational accuracy of LLMs in numerical reasoning tasks within healthcare contexts. Using a curated dataset of 1,000 numerical problems, encompassing real‑world scenarios such as dosage calculations and lab result interpretations, the performance of a refined LLM based on the GPT‑3 architecture was evaluated. The methodology includes prompt engineering, integration of fact‑checking pipelines, and application of regularization techniques to enhance model accuracy and generalization. Key metrics such as precision, recall, and F1‑score were utilized to assess the model's efficacy. The results indicate an overall accuracy of 84.10%, with improved performance in straightforward numerical tasks and challenges in multi‑step reasoning. The integration of a fact‑checking pipeline improved accuracy by 11%, underscoring the importance of validation mechanisms. This research highlights the potential of LLMs in healthcare numerical reasoning and identifies avenues for further refinement to support critical decision‑making in clinical environments. The findings aim to contribute to the development of reliable, interpretable, and contextually relevant AI tools for healthcare.

PaperID: 11871, https://arxiv.org/pdf/2501.13896.pdf

Authors: Yue Fan, Handong Zhao, Ruiyi Zhang, Yu Shen, Xin Eric Wang, Gang Wu

Title: GUI-Bee: Align GUI Action Grounding to Novel Environments via Autonomous Exploration

Abstract:
Graphical User Interface (GUI) action grounding is a critical step in GUI automation that maps language instructions to actionable elements on GUI screens. Most recent works of GUI action grounding leverage large GUI datasets to fine‑tune MLLMs. However, the fine‑tuning data always covers limited GUI environments, and we find the performance of the resulting model deteriorates in novel environments. We argue that the GUI grounding models should be further aligned to the novel environments to reveal their full potential, when the inference is known to involve novel environments, i.e., environments not used during the previous fine‑tuning. To realize this, we first propose GUI‑Bee, an MLLM‑based autonomous agent, to collect high‑quality, environment‑specific data through exploration and then continuously fine‑tune GUI grounding models with the collected data. Our agent leverages a novel Q‑value‑Incentive In‑Context Reinforcement Learning (Q‑ICRL) method to optimize exploration efficiency and data quality. Additionally, we introduce NovelScreenSpot, a benchmark for testing how well the data can help align GUI action grounding models to novel environments and demonstrate the effectiveness of data collected by GUI‑Bee in the experiments. Furthermore, we conduct an ablation study to validate the Q‑ICRL method in enhancing the efficiency of GUI‑Bee. Project page: https://gui‑bee.github.io

PaperID: 11872, https://arxiv.org/pdf/2501.13880.pdf

Authors: Gustavo Kuratomi, Paulo Pirozelli, Fabio G. Cozman, Sarajane M. Peres

Title: A RAG-Based Institutional Assistant

Abstract:
Although large language models (LLMs) demonstrate strong text generation capabilities, they struggle in scenarios requiring access to structured knowledge bases or specific documents, limiting their effectiveness in knowledge‑intensive tasks. To address this limitation, retrieval‑augmented generation (RAG) models have been developed, enabling generative models to incorporate relevant document fragments into their inputs. In this paper, we design and evaluate a RAG‑based virtual assistant specifically tailored for the University of São Paulo. Our system architecture comprises two key modules: a retriever and a generative model. We experiment with different types of models for both components, adjusting hyperparameters such as chunk size and the number of retrieved documents. Our optimal retriever model achieves a Top‑5 accuracy of 30%, while our most effective generative model scores 22.04% against ground truth answers. Notably, when the correct document chunks are supplied to the LLMs, accuracy significantly improves to 54.02%, an increase of over 30 percentage points. Conversely, without contextual input, performance declines to 13.68%. These findings highlight the critical role of database access in enhancing LLM performance. They also reveal the limitations of current semantic search methods in accurately identifying relevant documents and underscore the ongoing challenges LLMs face in generating precise responses.

PaperID: 11873, https://arxiv.org/pdf/2501.13727.pdf

Authors: Haikuo Du, Fandi Gou, Yunze Cai

Title: Scalable Safe Multi-Agent Reinforcement Learning for Multi-Agent System

Abstract:
Safety and scalability are two critical challenges faced by practical Multi‑Agent Systems (MAS). However, existing Multi‑Agent Reinforcement Learning (MARL) algorithms that rely solely on reward shaping are ineffective in ensuring safety, and their scalability is rather limited due to the fixed‑size network output. To address these issues, we propose a novel framework, Scalable Safe MARL (SS‑MARL), to enhance the safety and scalability of MARL methods. Leveraging the inherent graph structure of MAS, we design a multi‑layer message passing network to aggregate local observations and communications of varying sizes. Furthermore, we develop a constrained joint policy optimization method in the setting of local observation to improve safety. Simulation experiments demonstrate that SS‑MARL achieves a better trade‑off between optimality and safety compared to baselines, and its scalability significantly outperforms the latest methods in scenarios with a large number of agents.

PaperID: 11874, https://arxiv.org/pdf/2501.13592.pdf

Authors: Claire Bizon Monroc, Ana Bušić, Donatien Dubuc, Jiamin Zhu

Title: WFCRL: A Multi-Agent Reinforcement Learning Benchmark for Wind Farm Control

Abstract:
The wind farm control problem is challenging, since conventional model‑based control strategies require tractable models of complex aerodynamical interactions between the turbines and suffer from the curse of dimension when the number of turbines increases. Recently, model‑free and multi‑agent reinforcement learning approaches have been used to address this challenge. In this article, we introduce WFCRL (Wind Farm Control with Reinforcement Learning), the first open suite of multi‑agent reinforcement learning environments for the wind farm control problem. WFCRL frames a cooperative Multi‑Agent Reinforcement Learning (MARL) problem: each turbine is an agent and can learn to adjust its yaw, pitch or torque to maximize the common objective (e.g. the total power production of the farm). WFCRL also offers turbine load observations that will allow to optimize the farm performance while limiting turbine structural damages. Interfaces with two state‑of‑the‑art farm simulators are implemented in WFCRL: a static simulator (FLORIS) and a dynamic simulator (FAST.Farm). For each simulator, 10 wind layouts are provided, including 5 real wind farms. Two state‑of‑the‑art online MARL algorithms are implemented to illustrate the scaling challenges. As learning online on FAST.Farm is highly time‑consuming, WFCRL offers the possibility of designing transfer learning strategies from FLORIS to FAST.Farm.

PaperID: 11875, https://arxiv.org/pdf/2501.13552.pdf

Authors: Nasir Khan, Asmaa Abdallah, Abdulkadir Celik, Ahmed M. Eltawil, Sinem Coleri

Title: Explainable AI-aided Feature Selection and Model Reduction for DRL-based V2X Resource Allocation

Abstract:
Artificial intelligence (AI) is expected to significantly enhance radio resource management (RRM) in sixth‑generation (6G) networks. However, the lack of explainability in complex deep learning (DL) models poses a challenge for practical implementation. This paper proposes a novel explainable AI (XAI)‑ based framework for feature selection and model complexity reduction in a model‑agnostic manner. Applied to a multi‑agent deep reinforcement learning (MADRL) setting, our approach addresses the joint sub‑band assignment and power allocation problem in cellular vehicle‑to‑everything (V2X) communications. We propose a novel two‑stage systematic explainability framework leveraging feature relevance‑oriented XAI to simplify the DRL agents. While the former stage generates a state feature importance ranking of the trained models using Shapley additive explanations (SHAP)‑based importance scores, the latter stage exploits these importance‑based rankings to simplify the state space of the agents by removing the least important features from the model input. Simulation results demonstrate that the XAI‑assisted methodology achieves 97% of the original MADRL sum‑rate performance while reducing optimal state features by 28%, average training time by 11%, and trainable weight parameters by 46% in a network with eight vehicular pairs.

PaperID: 11876, https://arxiv.org/pdf/2501.13461.pdf

Authors: Huilin Yin, Yangwenhui Xu, Jiaxiang Li, Hao Zhang, Gerhard Rigoll

Title: Knowledge-Informed Multi-Agent Trajectory Prediction at Signalized Intersections for Infrastructure-to-Everything

Abstract:
Multi‑agent trajectory prediction at signalized intersections is crucial for developing efficient intelligent transportation systems and safe autonomous driving systems. Due to the complexity of intersection scenarios and the limitations of single‑vehicle perception, the performance of vehicle‑centric prediction methods has reached a plateau. In this paper, we introduce an Infrastructure‑to‑Everything (I2X) collaborative prediction scheme. In this scheme, roadside units (RSUs) independently forecast the future trajectories of all vehicles and transmit these predictions unidirectionally to subscribing vehicles. Building on this scheme, we propose I2XTraj, a dedicated infrastructure‑based trajectory prediction model. I2XTraj leverages real‑time traffic signal states, prior maneuver strategy knowledge, and multi‑agent interactions to generate accurate, joint multi‑modal trajectory prediction. First, a continuous signal‑informed mechanism is proposed to adaptively process real‑time traffic signals to guide trajectory proposal generation under varied intersection configurations. Second, a driving strategy awareness mechanism estimates the joint distribution of maneuver strategies by integrating spatial priors of intersection areas with dynamic vehicle states, enabling coverage of the full set of feasible maneuvers. Third, a spatial‑temporal‑mode attention network models multi‑agent interactions to refine and adjust joint trajectory outputs.Finally, I2XTraj is evaluated on two real‑world datasets of signalized intersections, the V2X‑Seq and the SinD drone dataset. In both single‑infrastructure and online collaborative scenarios, our model outperforms state‑of‑the‑art methods by over 30% on V2X‑Seq and 15% on SinD, demonstrating strong generalizability and robustness.

PaperID: 11877, https://arxiv.org/pdf/2501.13448.pdf

Authors: Yulong Hu, Siyuan Feng, Sen Li

Title: BMG-Q: Localized Bipartite Match Graph Attention Q-Learning for Ride-Pooling Order Dispatch

Abstract:
This paper introduces Localized Bipartite Match Graph Attention Q‑Learning (BMG‑Q), a novel Multi‑Agent Reinforcement Learning (MARL) algorithm framework tailored for ride‑pooling order dispatch. BMG‑Q advances ride‑pooling decision‑making process with the localized bipartite match graph underlying the Markov Decision Process, enabling the development of novel Graph Attention Double Deep Q Network (GATDDQN) as the MARL backbone to capture the dynamic interactions among ride‑pooling vehicles in fleet. Our approach enriches the state information for each agent with GATDDQN by leveraging a localized bipartite interdependence graph and enables a centralized global coordinator to optimize order matching and agent behavior using Integer Linear Programming (ILP). Enhanced by gradient clipping and localized graph sampling, our GATDDQN improves scalability and robustness. Furthermore, the inclusion of a posterior score function in the ILP captures the online exploration‑exploitation trade‑off and reduces the potential overestimation bias of agents, thereby elevating the quality of the derived solutions. Through extensive experiments and validation, BMG‑Q has demonstrated superior performance in both training and operations for thousands of vehicle agents, outperforming benchmark reinforcement learning frameworks by around 10% in accumulative rewards and showing a significant reduction in overestimation bias by over 50%. Additionally, it maintains robustness amidst task variations and fleet size changes, establishing BMG‑Q as an effective, scalable, and robust framework for advancing ride‑pooling order dispatch operations.

PaperID: 11878, https://arxiv.org/pdf/2501.13411.pdf

Authors: He Kong, Die Hu, Jingguo Ge, Liangxiong Li, Tong Li, Bingzhen Wu

Title: VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework

Abstract:
Penetration testing is a vital practice for identifying and mitigating vulnerabilities in cybersecurity systems, but its manual execution is labor‑intensive and time‑consuming. Existing large language model (LLM)‑assisted or automated penetration testing approaches often suffer from inefficiencies, such as a lack of contextual understanding and excessive, unstructured data generation. This paper presents VulnBot, an automated penetration testing framework that leverages LLMs to simulate the collaborative workflow of human penetration testing teams through a multi‑agent system. To address the inefficiencies and reliance on manual intervention in traditional penetration testing methods, VulnBot decomposes complex tasks into three specialized phases: reconnaissance, scanning, and exploitation. These phases are guided by a penetration task graph (PTG) to ensure logical task execution. Key design features include role specialization, penetration path planning, inter‑agent communication, and generative penetration behavior. Experimental results demonstrate that VulnBot outperforms baseline models such as GPT‑4 and Llama3 in automated penetration testing tasks, particularly showcasing its potential in fully autonomous testing on real‑world machines.

PaperID: 11879, https://arxiv.org/pdf/2501.13364.pdf

Authors: Jianglin Qiao, Zehong Cao, Dave de Jonge, Ryszard Kowalczyk

Title: Task Allocation in Customer-led Two-sided Markets with Satellite Constellation Services

Abstract:
Multi‑agent systems (MAS) are increasingly applied to complex task allocation in two‑sided markets, where agents such as companies and customers interact dynamically. Traditional company‑led Stackelberg game models, where companies set service prices, and customers respond, struggle to accommodate diverse and personalised customer demands in emerging markets like crowdsourcing. This paper proposes a customer‑led Stackelberg game model for cost‑efficient task allocation, where customers initiate tasks as leaders, and companies create their strategies as followers to meet these demands. We prove the existence of Nash Equilibrium for the follower game and Stackelberg Equilibrium for the leader game while discussing their uniqueness under specific conditions, ensuring cost‑efficient task allocation and improved market performance. Using the satellite constellation services market as a real‑world case, experimental results show a 23% reduction in customer payments and a 6.7‑fold increase in company revenues, demonstrating the model's effectiveness in emerging markets.

PaperID: 11880, https://arxiv.org/pdf/2501.13299.pdf

Authors: Shrinidhi Kumbhar, Venkatesh Mishra, Kevin Coutinho, Divij Handa, Ashif Iquebal, Chitta Baral

Title: Hypothesis Generation for Materials Discovery and Design Using Goal-Driven and Constraint-Guided LLM Agents

Abstract:
Materials discovery and design are essential for advancing technology across various industries by enabling the development of application‑specific materials. Recent research has leveraged Large Language Models (LLMs) to accelerate this process. We explore the potential of LLMs to generate viable hypotheses that, once validated, can expedite materials discovery. Collaborating with materials science experts, we curated a novel dataset from recent journal publications, featuring real‑world goals, constraints, and methods for designing real‑world applications. Using this dataset, we test LLM‑based agents that generate hypotheses for achieving given goals under specific constraints. To assess the relevance and quality of these hypotheses, we propose a novel scalable evaluation metric that emulates the process a materials scientist would use to evaluate a hypothesis critically. Our curated dataset, proposed method, and evaluation framework aim to advance future research in accelerating materials discovery and design with LLMs.

PaperID: 11881, https://arxiv.org/pdf/2501.13189.pdf

Authors: Alexander Spinos, Bradley Woosley, Justin Rokisky, Christopher Korpela, John G. Rogers, Brian A. Bittner

Title: Map Prediction and Generative Entropy for Multi-Agent Exploration

Abstract:
Traditionally, autonomous reconnaissance applications have acted on explicit sets of historical observations. Aided by recent breakthroughs in generative technologies, this work enables robot teams to act beyond what is currently known about the environment by inferring a distribution of reasonable interpretations of the scene. We developed a map predictor that inpaints the unknown space in a multi‑agent 2D occupancy map during an exploration mission. From a comparison of several inpainting methods, we found that a fine‑tuned latent diffusion inpainting model could provide rich and coherent interpretations of simulated urban environments with relatively little computation time. By iteratively inferring interpretations of the scene throughout an exploration run, we are able to identify areas that exhibit high uncertainty in the prediction, which we formalize with the concept of generative entropy. We prioritize tasks in regions of high generative entropy, hypothesizing that this will expedite convergence on an accurate predicted map of the scene. In our study we juxtapose this new paradigm of task ranking with the state of the art, which ranks regions to explore by those which maximize expected information recovery. We compare both of these methods in a simulated urban environment with three vehicles. Our results demonstrate that by using our new task ranking method, we can predict a correct scene significantly faster than with a traditional information‑guided method.

PaperID: 11882, https://arxiv.org/pdf/2501.13132.pdf

Authors: Jinhui Pang, Jinglin He, Noureldin Mohamed Abdelaal Ahmed Mohamed, Changqing Lin, Zhihui Zhang, Xiaoshuai Hao

Title: A Hierarchical Reinforcement Learning Framework for Multi-UAV Combat Using Leader-Follower Strategy

Abstract:
Multi‑UAV air combat is a complex task involving multiple autonomous UAVs, an evolving field in both aerospace and artificial intelligence. This paper aims to enhance adversarial performance through collaborative strategies. Previous approaches predominantly discretize the action space into predefined actions, limiting UAV maneuverability and complex strategy implementation. Others simplify the problem to 1v1 combat, neglecting the cooperative dynamics among multiple UAVs. To address the high‑dimensional challenges inherent in six‑degree‑of‑freedom space and improve cooperation, we propose a hierarchical framework utilizing the Leader‑Follower Multi‑Agent Proximal Policy Optimization (LFMAPPO) strategy. Specifically, the framework is structured into three levels. The top level conducts a macro‑level assessment of the environment and guides execution policy. The middle level determines the angle of the desired action. The bottom level generates precise action commands for the high‑dimensional action space. Moreover, we optimize the state‑value functions by assigning distinct roles with the leader‑follower strategy to train the top‑level policy, followers estimate the leader's utility, promoting effective cooperation among agents. Additionally, the incorporation of a target selector, aligned with the UAVs' posture, assesses the threat level of targets. Finally, simulation experiments validate the effectiveness of our proposed method.

PaperID: 11883, https://arxiv.org/pdf/2501.12991.pdf

Authors: Eslam Eldeeb, Hirley Alves

Title: An Offline Multi-Agent Reinforcement Learning Framework for Radio Resource Management

Abstract:
Offline multi‑agent reinforcement learning (MARL) addresses key limitations of online MARL, such as safety concerns, expensive data collection, extended training intervals, and high signaling overhead caused by online interactions with the environment. In this work, we propose an offline MARL algorithm for radio resource management (RRM), focusing on optimizing scheduling policies for multiple access points (APs) to jointly maximize the sum and tail rates of user equipment (UEs). We evaluate three training paradigms: centralized, independent, and centralized training with decentralized execution (CTDE). Our simulation results demonstrate that the proposed offline MARL framework outperforms conventional baseline approaches, achieving over a 15% improvement in a weighted combination of sum and tail rates. Additionally, the CTDE framework strikes an effective balance, reducing the computational complexity of centralized methods while addressing the inefficiencies of independent training. These results underscore the potential of offline MARL to deliver scalable, robust, and efficient solutions for resource management in dynamic wireless networks.

PaperID: 11884, https://arxiv.org/pdf/2501.12989.pdf

Authors: Hossein Nejatbakhsh Esfahani, Kai Liu, Javad Mohammadpour Velni

Title: Distributed Model Predictive Control Design for Multi-agent Systems via Bayesian Optimization

Abstract:
This paper introduces a new approach that leverages Multi‑agent Bayesian Optimization (MABO) to design Distributed Model Predictive Control (DMPC) schemes for multi‑agent systems. The primary objective is to learn optimal DMPC schemes even when local model predictive controllers rely on imperfect local models. The proposed method invokes a dual decomposition‑based distributed optimization framework, incorporating an Alternating Direction Method of Multipliers (ADMM)‑based MABO algorithm to enable coordinated learning of parameterized DMPC schemes. This enhances the closed‑loop performance of local controllers, despite discrepancies between their models and the actual multi‑agent system dynamics. In addition to the newly proposed algorithms, this work also provides rigorous proofs establishing the optimality and convergence of the underlying learning method. Finally, numerical examples are given to demonstrate the efficacy of the proposed MABO‑based learning approach.

PaperID: 11885, https://arxiv.org/pdf/2501.12909.pdf

Authors: Zhenran Xu, Longyue Wang, Jifang Wang, Zhouyi Li, Senbao Shi, Xue Yang, Yiyu Wang, Baotian Hu, Jun Yu, Min Zhang

Title: FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces

Abstract:
Virtual film production requires intricate decision‑making processes, including scriptwriting, virtual cinematography, and precise actor positioning and actions. Motivated by recent advances in automated decision‑making with language agent‑based societies, this paper introduces FilmAgent, a novel LLM‑based multi‑agent collaborative framework for end‑to‑end film automation in our constructed 3D virtual spaces. FilmAgent simulates various crew roles, including directors, screenwriters, actors, and cinematographers, and covers key stages of a film production workflow: (1) idea development transforms brainstormed ideas into structured story outlines; (2) scriptwriting elaborates on dialogue and character actions for each scene; (3) cinematography determines the camera setups for each shot. A team of agents collaborates through iterative feedback and revisions, thereby verifying intermediate scripts and reducing hallucinations. We evaluate the generated videos on 15 ideas and 4 key aspects. Human evaluation shows that FilmAgent outperforms all baselines across all aspects and scores 3.98 out of 5 on average, showing the feasibility of multi‑agent collaboration in filmmaking. Further analysis reveals that FilmAgent, despite using the less advanced GPT‑4o model, surpasses the single‑agent o1, showing the advantage of a well‑coordinated multi‑agent system. Lastly, we discuss the complementary strengths and weaknesses of OpenAI's text‑to‑video model Sora and our FilmAgent in filmmaking.

PaperID: 11886, https://arxiv.org/pdf/2501.12904.pdf

Authors: Alessio Bucaioni, Martin Weyssow, Junda He, Yunbo Lyu, David Lo

Title: A Functional Software Reference Architecture for LLM-Integrated Systems

Abstract:
The integration of large language models into software systems is transforming capabilities such as natural language understanding, decision‑making, and autonomous task execution. However, the absence of a commonly accepted software reference architecture hinders systematic reasoning about their design and quality attributes. This gap makes it challenging to address critical concerns like privacy, security, modularity, and interoperability, which are increasingly important as these systems grow in complexity and societal impact. In this paper, we describe our emerging results for a preliminary functional reference architecture as a conceptual framework to address these challenges and guide the design, evaluation, and evolution of large language model‑integrated systems. We identify key architectural concerns for these systems, informed by current research and practice. We then evaluate how the architecture addresses these concerns and validate its applicability using three open‑source large language model‑integrated systems in computer vision, text processing, and coding.

PaperID: 11887, https://arxiv.org/pdf/2501.12851.pdf

Authors: Chen Chen, Xinlong Hao, Weiwen Liu, Xu Huang, Xingshan Zeng, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Yuefeng Huang, Wulong Liu, Xinzhi Wang, Defu Lian, Baoqun Yin, Yasheng Wang, Wu Liu

Title: ACEBench: Who Wins the Match Point in Tool Usage?

Abstract:
Large Language Models (LLMs) have demonstrated significant potential in decision‑making and reasoning, particularly when integrated with various tools to effectively solve complex problems. However, existing benchmarks for evaluating LLMs' tool usage face several limitations: (1) limited evaluation scenarios, often lacking assessments in real multi‑turn dialogue contexts; (2) narrow evaluation dimensions, with insufficient detailed assessments of how LLMs use tools; and (3) reliance on LLMs or real API executions for evaluation, which introduces significant overhead. To address these challenges, we introduce ACEBench, a comprehensive benchmark for assessing tool usage in LLMs. ACEBench categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent. "Normal" evaluates tool usage in basic scenarios; "Special" evaluates tool usage in situations with ambiguous or incomplete instructions; "Agent" evaluates tool usage through multi‑agent interactions to simulate real‑world, multi‑turn dialogues. We conducted extensive experiments using ACEBench, analyzing various LLMs in‑depth and providing a more granular examination of error causes across different data types.

PaperID: 11888, https://arxiv.org/pdf/2501.12498.pdf

Authors: Vincenzo Calderonio

Title: The Start Button Problem: a basis for human responsibility in artificial intelligence computation

Abstract:
Recent advancements in artificial intelligence have reopened the question about the boundaries of AI autonomy, particularly in discussions around artificial general intelligence and its potential to act independently across varied purposes. This paper explores these boundaries through the analysis of the Alignment Research Center experiment on GPT‑4 and introduces the Start Button Problem, a thought experiment that examines the origins and limits of AI autonomy. By examining the thought experiment and its counterarguments, it becomes clear that in its need for human activation and purpose definition lies the AI's inherent dependency on human‑initiated actions, challenging the assumption of AI as an intelligent agent. Finally, the paper addresses the implications of this dependency on human responsibility, questioning the measure of the extension of human responsibility when using AI systems.

PaperID: 11889, https://arxiv.org/pdf/2501.12408.pdf

Authors: Vasileios Lioutas, Adam Scibior, Matthew Niedoba, Berend Zwartsenberg, Frank Wood

Title: Control-ITRA: Controlling the Behavior of a Driving Model

Abstract:
Simulating realistic driving behavior is crucial for developing and testing autonomous systems in complex traffic environments. Equally important is the ability to control the behavior of simulated agents to tailor scenarios to specific research needs and safety considerations. This paper extends the general‑purpose multi‑agent driving behavior model ITRA (Scibior et al., 2021), by introducing a method called Control‑ITRA to influence agent behavior through waypoint assignment and target speed modulation. By conditioning agents on these two aspects, we provide a mechanism for them to adhere to specific trajectories and indirectly adjust their aggressiveness. We compare different approaches for integrating these conditions during training and demonstrate that our method can generate controllable, infraction‑free trajectories while preserving realism in both seen and unseen locations.

PaperID: 11890, https://arxiv.org/pdf/2501.12399.pdf

Authors: Shijie Han, Jingshu Zhang, Yiqing Shen, Kaiyuan Yan, Hongguang Li

Title: FinSphere, a Real-Time Stock Analysis Agent Powered by Instruction-Tuned LLMs and Domain Tools

Abstract:
Current financial large language models (FinLLMs) struggle with two critical limitations: the absence of objective evaluation metrics to assess the quality of stock analysis reports and a lack of depth in stock analysis, which impedes their ability to generate professional‑grade insights. To address these challenges, this paper introduces FinSphere, a stock analysis agent, along with three major contributions: (1) AnalyScore, a systematic evaluation framework for assessing stock analysis quality, (2) Stocksis, a dataset curated by industry experts to enhance LLMs' stock analysis capabilities, and (3) FinSphere, an AI agent that can generate high‑quality stock analysis reports in response to user queries. Experiments demonstrate that FinSphere achieves superior performance compared to both general and domain‑specific LLMs, as well as existing agent‑based systems, even when they are enhanced with real‑time data access and few‑shot guidance. The integrated framework, which combines real‑time data feeds, quantitative tools, and an instruction‑tuned LLM, yields substantial improvements in both analytical quality and practical applicability for real‑world stock analysis.

PaperID: 11891, https://arxiv.org/pdf/2501.12326.pdf

Authors: Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi

Title: UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Abstract:
This paper introduces UI‑TARS, a native GUI agent model that solely perceives the screenshots as input and performs human‑like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT‑4o) with expert‑crafted prompts and workflows, UI‑TARS is an end‑to‑end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI‑TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI‑TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI‑TARS achieves 46.6, surpassing GPT‑4o (34.5). UI‑TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large‑scale dataset of GUI screenshots for context‑aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large‑scale action traces; (3) System‑2 Reasoning, which incorporates deliberate reasoning into multi‑step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI‑TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.

PaperID: 11892, https://arxiv.org/pdf/2501.12263.pdf

Authors: Bingyi Liu, Jian Teng, Hongfei Xue, Enshu Wang, Chuanhui Zhu, Pu Wang, Libing Wu

Title: mmCooper: A Multi-agent Multi-stage Communication-efficient and Collaboration-robust Cooperative Perception Framework

Abstract:
Collaborative perception significantly enhances individual vehicle perception performance through the exchange of sensory information among agents. However, real‑world deployment faces challenges due to bandwidth constraints and inevitable calibration errors during information exchange. To address these issues, we propose mmCooper, a novel multi‑agent, multi‑stage, communication‑efficient, and collaboration‑robust cooperative perception framework. Our framework leverages a multi‑stage collaboration strategy that dynamically and adaptively balances intermediate‑ and late‑stage information to share among agents, enhancing perceptual performance while maintaining communication efficiency. To support robust collaboration despite potential misalignments and calibration errors, our framework prevents misleading low‑confidence sensing information from transmission and refines the received detection results from collaborators to improve accuracy. The extensive evaluation results on both real‑world and simulated datasets demonstrate the effectiveness of the mmCooper framework and its components.

PaperID: 11893, https://arxiv.org/pdf/2501.12234.pdf

Authors: Mark Gonzales, Adam Polevoy, Marin Kobilarov, Joseph Moore

Title: Multi-Agent Feedback Motion Planning using Probably Approximately Correct Nonlinear Model Predictive Control

Abstract:
For many tasks, multi‑robot teams often provide greater efficiency, robustness, and resiliency. However, multi‑robot collaboration in real‑world scenarios poses a number of major challenges, especially when dynamic robots must balance competing objectives like formation control and obstacle avoidance in the presence of stochastic dynamics and sensor uncertainty. In this paper, we propose a distributed, multi‑agent receding‑horizon feedback motion planning approach using Probably Approximately Correct Nonlinear Model Predictive Control (PAC‑NMPC) that is able to reason about both model and measurement uncertainty to achieve robust multi‑agent formation control while navigating cluttered obstacle fields and avoiding inter‑robot collisions. Our approach relies not only on the underlying PAC‑NMPC algorithm but also on a terminal cost‑function derived from gyroscopic obstacle avoidance. Through numerical simulation, we show that our distributed approach performs on par with a centralized formulation, that it offers improved performance in the case of significant measurement noise, and that it can scale to more complex dynamical systems.

PaperID: 11894, https://arxiv.org/pdf/2501.12199.pdf

Authors: Tuo Zhang, Leonardo Stella, Julian Barreiro-Gomez

Title: Experience-replay Innovative Dynamics

Abstract:
Despite its groundbreaking success, multi‑agent reinforcement learning (MARL) still suffers from instability and nonstationarity. Replicator dynamics, the most well‑known model from evolutionary game theory (EGT), provide a theoretical framework for the convergence of the trajectories to Nash equilibria and, as a result, have been used to ensure formal guarantees for MARL algorithms in stable game settings. However, they exhibit the opposite behavior in other settings, which poses the problem of finding alternatives to ensure convergence. In contrast, innovative dynamics, such as the Brown‑von Neumann‑Nash (BNN) or Smith, result in periodic trajectories with the potential to approximate Nash equilibria. Yet, no MARL algorithms based on these dynamics have been proposed. In response to this challenge, we develop a novel experience replay‑based MARL algorithm that incorporates revision protocols as tunable hyperparameters. We demonstrate, by appropriately adjusting the revision protocols, that the behavior of our algorithm mirrors the trajectories resulting from these dynamics. Importantly, our contribution provides a framework capable of extending the theoretical guarantees of MARL algorithms beyond replicator dynamics. Finally, we corroborate our theoretical findings with empirical results.

PaperID: 11895, https://arxiv.org/pdf/2501.12061.pdf

Authors: Somnath Hazra, Pallab Dasgupta, Soumyajit Dey

Title: Tackling Uncertainties in Multi-Agent Reinforcement Learning through Integration of Agent Termination Dynamics

Abstract:
Multi‑Agent Reinforcement Learning (MARL) has gained significant traction for solving complex real‑world tasks, but the inherent stochasticity and uncertainty in these environments pose substantial challenges to efficient and robust policy learning. While Distributional Reinforcement Learning has been successfully applied in single‑agent settings to address risk and uncertainty, its application in MARL is substantially limited. In this work, we propose a novel approach that integrates distributional learning with a safety‑focused loss function to improve convergence in cooperative MARL tasks. Specifically, we introduce a Barrier Function based loss that leverages safety metrics, identified from inherent faults in the system, into the policy learning process. This additional loss term helps mitigate risks and encourages safer exploration during the early stages of training. We evaluate our method in the StarCraft II micromanagement benchmark, where our approach demonstrates improved convergence and outperforms state‑of‑the‑art baselines in terms of both safety and task completion. Our results suggest that incorporating safety considerations can significantly enhance learning performance in complex, multi‑agent environments.

PaperID: 11896, https://arxiv.org/pdf/2501.11897.pdf

Authors: Ludovico Crippa, Yonatan Gur, Bar Light

Title: Equilibria under Dynamic Benchmark Consistency in Non-Stationary Multi-Agent Systems

Abstract:
We formulate and study a general time‑varying multi‑agent system where players repeatedly compete under incomplete information. Our work is motivated by scenarios commonly observed in online advertising and retail marketplaces, where agents and platform designers optimize algorithmic decision‑making in dynamic competitive settings. In these systems, no‑regret algorithms that provide guarantees relative to \emphstatic benchmarks can perform poorly and the distributions of play that emerge from their interaction do not correspond anymore to static solution concepts such as coarse correlated equilibria. Instead, we analyze the interaction of dynamic benchmark consistent policies that have performance guarantees relative to \emphdynamic sequences of actions, and through a novel tracking error notion we delineate when their empirical joint distribution of play can approximate an evolving sequence of static equilibria. In systems that change sufficiently slowly (sub‑linearly in the horizon length), we show that the resulting distributions of play approximate the sequence of coarse correlated equilibria, and apply this result to establish improved welfare bounds for smooth games. On a similar vein, we formulate internal dynamic benchmark consistent policies and establish that they approximate sequences of correlated equilibria. Our findings therefore suggest that in a broad range of multi‑agent systems where non‑stationarity is prevalent, algorithms designed to compete with dynamic benchmarks can improve both individual and welfare guarantees, and their emerging dynamics approximate a sequence of static equilibrium outcomes.

PaperID: 11897, https://arxiv.org/pdf/2501.11864.pdf

Authors: Venkata Sai Aswath Duvvuru, Bohan Zhang, Michael Vierhauser, Ankit Agrawal

Title: LLM-Agents Driven Automated Simulation Testing and Analysis of small Uncrewed Aerial Systems

Abstract:
Thorough simulation testing is crucial for validating the correct behavior of small Uncrewed Aerial Systems (sUAS) across multiple scenarios, including adverse weather conditions (such as wind, and fog), diverse settings (hilly terrain, or urban areas), and varying mission profiles (surveillance, tracking). While various sUAS simulation tools exist to support developers, the entire process of creating, executing, and analyzing simulation tests remains a largely manual and cumbersome task. Developers must identify test scenarios, set up the simulation environment, integrate the System under Test (SuT) with simulation tools, formulate mission plans, and collect and analyze results. These labor‑intensive tasks limit the ability of developers to conduct exhaustive testing across a wide range of scenarios. To alleviate this problem, in this paper, we propose AutoSimTest, a Large Language Model (LLM)‑driven framework, where multiple LLM agents collaborate to support the sUAS simulation testing process. This includes: (1) creating test scenarios that subject the SuT to unique environmental contexts; (2) preparing the simulation environment as per the test scenario; (3) generating diverse sUAS missions for the SuT to execute; and (4) analyzing simulation results and providing an interactive analytics interface. Further, the design of the framework is flexible for creating and testing scenarios for a variety of sUAS use cases, simulation tools, and SuT input requirements. We evaluated our approach by (a) conducting simulation testing of PX4 and ArduPilot flight‑controller‑based SuTs, (b) analyzing the performance of each agent, and (c) gathering feedback from sUAS developers. Our findings indicate that AutoSimTest significantly improves the efficiency and scope of the sUAS testing process, allowing for more comprehensive and varied scenario evaluations while reducing the manual effort.

PaperID: 11898, https://arxiv.org/pdf/2501.11799.pdf

Authors: Johnny Joyce

Title: Policy-Adaptable Methods For Resolving Normative Conflicts Through Argumentation and Graph Colouring

Abstract:
In a multi‑agent system, one may choose to govern the behaviour of an agent by imposing norms, which act as guidelines for how agents should act either all of the time or in given situations. However, imposing multiple norms on one or more agents may result in situations where these norms conflict over how the agent should behave. In any system with normative conflicts (such as safe reinforcement models or systems which monitor safety protocols), one must decide which norms should be followed such that the most important and most relevant norms are maintained. We introduce a new method for resolving normative conflicts through argumentation and graph colouring which is compatible with a variety of normative conflict resolution policies. We prove that this method always creates an admissible set of arguments under argumentation semantics, meaning that it produces coherent outputs. We also introduce more robust variants of this method, each building upon their predecessor to create a superior output, and we include further mathematical proof of their coherence. Our most advanced variant uses the existing concept of curtailment, where one norm may supersede another without fully eliminating it. The methods we introduce are all compatible with various pre‑existing policies for resolving normative conflicts. Empirical evaluations are also performed to compare our algorithms to each other and to others in existing literature.

PaperID: 11899, https://arxiv.org/pdf/2501.11739.pdf

Authors: Chad DeChant

Title: Episodic memory in AI agents poses risks that should be studied and mitigated

Abstract:
Most current AI models have little ability to store and later retrieve a record or representation of what they do. In human cognition, episodic memories play an important role in both recall of the past as well as planning for the future. The ability to form and use episodic memories would similarly enable a broad range of improved capabilities in an AI agent that interacts with and takes actions in the world. Researchers have begun directing more attention to developing memory abilities in AI models. It is therefore likely that models with such capability will be become widespread in the near future. This could in some ways contribute to making such AI agents safer by enabling users to better monitor, understand, and control their actions. However, as a new capability with wide applications, we argue that it will also introduce significant new risks that researchers should begin to study and address. We outline these risks and benefits and propose four principles to guide the development of episodic memory capabilities so that these will enhance, rather than undermine, the effort to keep AI safe and trustworthy.

PaperID: 11900, https://arxiv.org/pdf/2501.11613.pdf

Authors: Giorgio Robino

Title: Conversation Routines: A Prompt Engineering Framework for Task-Oriented Dialog Systems

Abstract:
This study introduces Conversation Routines (CR), a structured prompt engineering framework for developing task‑oriented dialog systems using Large Language Models (LLMs). While LLMs demonstrate remarkable natural language understanding capabilities, engineering them to reliably execute complex business workflows remains challenging. The proposed CR framework enables the development of Conversation Agentic Systems (CAS) through natural language specifications, embedding task‑oriented logic within LLM prompts. This approach provides a systematic methodology for designing and implementing complex conversational workflows while maintaining behavioral consistency. We demonstrate the framework's effectiveness through two proof‑of‑concept implementations: a Train Ticket Booking System and an Interactive Troubleshooting Copilot. These case studies validate CR's capability to encode sophisticated behavioral patterns and decision logic while preserving natural conversational flexibility. Results show that CR enables domain experts to design conversational workflows in natural language while leveraging custom functions (tools) developed by software engineers, creating an efficient division of responsibilities where developers focus on core API implementation and domain experts handle conversation design. While the framework shows promise in accessibility and adaptability, we identify key challenges including computational overhead, non‑deterministic behavior, and domain‑specific logic optimization. Future research directions include CR evaluation methods based on prompt engineering frameworks driven by goal‑oriented grading criteria, improving scalability for complex multi‑agent interactions, and enhancing system robustness to address the identified limitations across diverse business applications.

PaperID: 11901, https://arxiv.org/pdf/2501.11574.pdf

Authors: Shahida Jabeen

Title: A Deep Reinforcement Learning based Scheduler for IoT Devices in Co-existence with 5G-NR

Abstract:
Co‑existence of 5G New Radio (5G‑NR) with IoT devices is considered as a promising technique to enhance the spectral usage and efficiency of future cellular networks. In this paper, a unified framework has been proposed for allocating in‑band resource blocks (RBs), i.e., within a multi‑cell network, to 5G‑NR users in co‑existence with NB‑IoT and LTE‑M devices. First, a benchmark (upper‑bound) scheduler has been designed for joint sub‑carrier (SC) and modulation and coding scheme (MCS) allocation that maximizes instantaneous throughput and fairness among users/devices, while considering synchronous RB allocation in the neighboring cells. A series of numerical simulations with realistic ICI in an urban scenario have been used to compute benchmark upper‑bound solutions for characterizing performance in terms of throughput, fairness, and delay. Next, an edge learning based multi‑agent deep reinforcement learning (DRL) framework has been developed for different DRL algorithms, specifically, a policy‑based gradient network (PGN), a deep Q‑learning based network (DQN), and an actor‑critic based deep deterministic policy gradient network (DDPGN). The proposed DRL framework depends on interference allocation, where the actions are based on inter‑cell‑interference (ICI) instead of power, which can bypass the need for raw data sharing and/or inter‑agent communication. The numerical results reveal that the interference allocation based DRL schedulers can significantly outperform their counterparts, where the actions are based on power allocation. Further, the performance of the proposed policy‑based edge learning algorithms is close to the centralized ones.

PaperID: 11902, https://arxiv.org/pdf/2501.11425.pdf

Authors: Siyu Yuan, Zehui Chen, Zhiheng Xi, Junjie Ye, Zhengyin Du, Jiecao Chen

Title: Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training

Abstract:
Large Language Models (LLMs) agents are increasingly pivotal for addressing complex tasks in interactive environments. Existing work mainly focuses on enhancing performance through behavior cloning from stronger experts, yet such approaches often falter in real‑world applications, mainly due to the inability to recover from errors. However, step‑level critique data is difficult and expensive to collect. Automating and dynamically constructing self‑critique datasets is thus crucial to empowering models with intelligent agent capabilities. In this work, we propose an iterative self‑training framework, Agent‑R, that enables language Agent to Reflect on the fly. Unlike traditional methods that reward or penalize actions based on correctness, Agent‑R leverages MCTS to construct training data that recover correct trajectories from erroneous ones. A key challenge of agent reflection lies in the necessity for timely revision rather than waiting until the end of a rollout. To address this, we introduce a model‑guided critique construction mechanism: the actor model identifies the first error step (within its current capability) in a failed trajectory. Starting from it, we splice it with the adjacent correct path, which shares the same parent node in the tree. This strategy enables the model to learn reflection based on its current policy, therefore yielding better learning efficiency. To further explore the scalability of this self‑improvement paradigm, we investigate iterative refinement of both error correction capabilities and dataset construction. Our findings demonstrate that Agent‑R continuously improves the model's ability to recover from errors and enables timely error correction. Experiments on three interactive environments show that Agent‑R effectively equips agents to correct erroneous actions while avoiding loops, achieving superior performance compared to baseline methods (+5.59%).

PaperID: 11903, https://arxiv.org/pdf/2501.11283.pdf

Authors: Hongye Quan, Wanli Ni, Tong Zhang, Xiangyu Ye, Ziyi Xie, Shuai Wang, Yuanwei Liu, Hui Song

Title: Large Language Model Agents for Radio Map Generation and Wireless Network Planning

Abstract:
Using commercial software for radio map generation and wireless network planning often require complex manual operations, posing significant challenges in terms of scalability, adaptability, and user‑friendliness, due to heavy manual operations. To address these issues, we propose an automated solution that employs large language model (LLM) agents. These agents are designed to autonomously generate radio maps and facilitate wireless network planning for specified areas, thereby minimizing the necessity for extensive manual intervention. To validate the effectiveness of our proposed solution, we develop a software platform that integrates LLM agents. Experimental results demonstrate that a large amount manual operations can be saved via the proposed LLM agent, and the automated solutions can achieve an enhanced coverage and signal‑to‑interference‑noise ratio (SINR), especially in urban environments.

PaperID: 11904, https://arxiv.org/pdf/2501.11233.pdf

Authors: Kanika Goswami, Puneet Mathur, Ryan Rossi, Franck Dernoncourt

Title: PlotEdit: Natural Language-Driven Accessible Chart Editing in PDFs via Multimodal LLM Agents

Abstract:
Chart visualizations, while essential for data interpretation and communication, are predominantly accessible only as images in PDFs, lacking source data tables and stylistic information. To enable effective editing of charts in PDFs or digital scans, we present PlotEdit, a novel multi‑agent framework for natural language‑driven end‑to‑end chart image editing via self‑reflective LLM agents. PlotEdit orchestrates five LLM agents: (1) Chart2Table for data table extraction, (2) Chart2Vision for style attribute identification, (3) Chart2Code for retrieving rendering code, (4) Instruction Decomposition Agent for parsing user requests into executable steps, and (5) Multimodal Editing Agent for implementing nuanced chart component modifications ‑ all coordinated through multimodal feedback to maintain visual fidelity. PlotEdit outperforms existing baselines on the ChartCraft dataset across style, layout, format, and data‑centric edits, enhancing accessibility for visually challenged users and improving novice productivity.

PaperID: 11905, https://arxiv.org/pdf/2501.11188.pdf

Authors: Mouaad Boughellaba, Soulaimane Berkane, Abdelhamid Tayebi

Title: Global Attitude Synchronization for Heterogeneous Multi-agent Systems on SO(3)

Abstract:
In this paper, we address the problem of attitude synchronization for a group of rigid body systems evolving on SO(3). The interaction among these systems is modeled through an undirected, connected, and acyclic graph topology. First, we present an almost global continuous distributed attitude synchronization scheme with rigorously proven stability guarantees. Thereafter, we propose two global distributed hybrid attitude synchronization schemes on SO(3). The first scheme is a hybrid control law that leverages angular velocities and relative orientations to achieve global alignment to a common orientation. The second scheme eliminates the dependence on angular velocities by introducing dynamic auxiliary variables, while ensuring global asymptotic attitude synchronization. This velocity‑free control scheme relies exclusively on attitude information. The proposed schemes are applicable to heterogeneous multi‑agent systems, where agents may have distinct inertia matrices. Simulation results are provided to illustrate the effectiveness of the proposed distributed attitude synchronization schemes.

PaperID: 11906, https://arxiv.org/pdf/2501.11149.pdf

Authors: Ruolin Ye, Shuaixing Chen, Yunting Yan, Joyce Yang, Christina Ge, Jose Barreiros, Kate Tsui, Tom Silver, Tapomayukh Bhattacharjee

Title: CART-MPC: Coordinating Assistive Devices for Robot-Assisted Transferring with Multi-Agent Model Predictive Control

Abstract:
Bed‑to‑wheelchair transferring is a ubiquitous activity of daily living (ADL), but especially challenging for caregiving robots with limited payloads. We develop a novel algorithm that leverages the presence of other assistive devices: a Hoyer sling and a wheelchair for coarse manipulation of heavy loads, alongside a robot arm for fine‑grained manipulation of deformable objects (Hoyer sling straps). We instrument the Hoyer sling and wheelchair with actuators and sensors so that they can become intelligent agents in the algorithm. We then focus on one subtask of the transferring ADL ‑‑ tying Hoyer sling straps to the sling bar ‑‑ that exemplifies the challenges of transfer: multi‑agent planning, deformable object manipulation, and generalization to varying hook shapes, sling materials, and care recipient bodies. To address these challenges, we propose CART‑MPC, a novel algorithm based on turn‑taking multi‑agent model predictive control that uses a learned neural dynamics model for a keypoint‑based representation of the deformable Hoyer sling strap, and a novel cost function that leverages linking numbers from knot theory and neural amortization to accelerate inference. We validate it in both RCareWorld simulation and real‑world environments. In simulation, CART‑MPC successfully generalizes across diverse hook designs, sling materials, and care recipient body shapes. In the real world, we show zero‑shot sim‑to‑real generalization capabilities to tie deformable Hoyer sling straps on a sling bar towards transferring a manikin from a hospital bed to a wheelchair. See our website for supplementary materials: https://emprise.cs.cornell.edu/cart‑mpc/.

PaperID: 11907, https://arxiv.org/pdf/2501.11079.pdf

Authors: Li-Hsiang Shen, Jyun-Jhe Huang, Kai-Ten Feng, Lie-Liang Yang, Jen-Ming Wu

Title: Federated Deep Reinforcement Learning for Energy Efficient Multi-Functional RIS-Assisted Low-Earth Orbit Networks

Abstract:
In this paper, a novel network architecture that deploys the multi‑functional reconfigurable intelligent surface (MF‑RIS) in low‑Earth orbit (LEO) is proposed. Unlike traditional RIS with only signal reflection capability, the MF‑RIS can reflect, refract, and amplify signals, as well as harvest energy from wireless signals. Given the high energy demands in shadow regions where solar energy is unavailable, MF‑RIS is deployed in LEO to enhance signal coverage and improve energy efficiency (EE). To address this, we formulate a long‑term EE optimization problem by determining the optimal parameters for MF‑RIS configurations, including amplification and phase‑shifts, energy harvesting ratios, and LEO transmit beamforming. To address the complex non‑convex and non‑linear problem, a federated learning enhanced multi‑agent deep deterministic policy gradient (FEMAD) scheme is designed. Multi‑agent DDPG of each agent can provide the optimal action policy from its interaction to environments, whereas federated learning enables the hidden information exchange among multi‑agents. In numerical results, we can observe significant EE improvements compared to the other benchmarks, including centralized deep reinforcement learning as well as distributed multi‑agent deep deterministic policy gradient (DDPG). Additionally, the proposed LEO‑MF‑RIS architecture has demonstrated its effectiveness, achieving the highest EE performance compared to the scenarios of fixed/no energy harvesting in MF‑RIS, traditional reflection‑only RIS, and deployment without RISs/MF‑RISs.

PaperID: 11908, https://arxiv.org/pdf/2501.10938.pdf

Authors: Ahmed Alagha, Jamal Bentahar, Hadi Otrok, Shakti Singh, Rabeb Mizouni

Title: Blockchain-assisted Demonstration Cloning for Multi-Agent Deep Reinforcement Learning

Abstract:
Multi‑Agent Deep Reinforcement Learning (MDRL) is a promising research area in which agents learn complex behaviors in cooperative or competitive environments. However, MDRL comes with several challenges that hinder its usability, including sample efficiency, curse of dimensionality, and environment exploration. Recent works proposing Federated Reinforcement Learning (FRL) to tackle these issues suffer from problems related to model restrictions and maliciousness. Other proposals using reward shaping require considerable engineering and could lead to local optima. In this paper, we propose a novel Blockchain‑assisted Multi‑Expert Demonstration Cloning (MEDC) framework for MDRL. The proposed method utilizes expert demonstrations in guiding the learning of new MDRL agents, by suggesting exploration actions in the environment. A model sharing framework on Blockchain is designed to allow users to share their trained models, which can be allocated as expert models to requesting users to aid in training MDRL systems. A Consortium Blockchain is adopted to enable traceable and autonomous execution without the need for a single trusted entity. Smart Contracts are designed to manage users and models allocation, which are shared using IPFS. The proposed framework is tested on several applications, and is benchmarked against existing methods in FRL, Reward Shaping, and Imitation Learning‑assisted RL. The results show the outperformance of the proposed framework in terms of learning speed and resiliency to faulty and malicious models.

PaperID: 11909, https://arxiv.org/pdf/2501.10924.pdf

Authors: Ahmed Alagha, Rabeb Mizouni, Shakti Singh, Jamal Bentahar, Hadi Otrok

Title: Adaptive Target Localization under Uncertainty using Multi-Agent Deep Reinforcement Learning with Knowledge Transfer

Abstract:
Target localization is a critical task in sensitive applications, where multiple sensing agents communicate and collaborate to identify the target location based on sensor readings. Existing approaches investigated the use of Multi‑Agent Deep Reinforcement Learning (MADRL) to tackle target localization. Nevertheless, these methods do not consider practical uncertainties, like false alarms when the target does not exist or when it is unreachable due to environmental complexities. To address these drawbacks, this work proposes a novel MADRL‑based method for target localization in uncertain environments. The proposed MADRL method employs Proximal Policy Optimization to optimize the decision‑making of sensing agents, which is represented in the form of an actor‑critic structure using Convolutional Neural Networks. The observations of the agents are designed in an optimized manner to capture essential information in the environment, and a team‑based reward functions is proposed to produce cooperative agents. The MADRL method covers three action dimensionalities that control the agents' mobility to search the area for the target, detect its existence, and determine its reachability. Using the concept of Transfer Learning, a Deep Learning model builds on the knowledge from the MADRL model to accurately estimating the target location if it is unreachable, resulting in shared representations between the models for faster learning and lower computational complexity. Collectively, the final combined model is capable of searching for the target, determining its existence and reachability, and estimating its location accurately. The proposed method is tested using a radioactive target localization environment and benchmarked against existing methods, showing its efficacy.

PaperID: 11910, https://arxiv.org/pdf/2501.10893.pdf

Authors: Hongjin Su, Ruoxi Sun, Jinsung Yoon, Pengcheng Yin, Tao Yu, Sercan Ö. Arık

Title: Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments

Abstract:
Autonomous agents powered by large language models (LLMs) have the potential to enhance human capabilities, assisting with digital tasks from sending emails to performing data analysis. The abilities of existing LLMs at such tasks are often hindered by the lack of high‑quality agent data from the corresponding environments they interact with. We propose Learn‑by‑interact, a data‑centric framework to adapt LLM agents to any given environments without human annotations. Learn‑by‑interact synthesizes trajectories of agent‑environment interactions based on documentations, and constructs instructions by summarizing or abstracting the interaction histories, a process called backward construction. We assess the quality of our synthetic data by using them in both training‑based scenarios and training‑free in‑context learning (ICL), where we craft innovative retrieval approaches optimized for agents. Extensive experiments on SWE‑bench, WebArena, OSWorld and Spider2‑V spanning across realistic coding, web, and desktop environments show the effectiveness of Learn‑by‑interact in various downstream agentic tasks ‑‑ baseline results are improved by up to 12.2% for ICL with Claude‑3.5 and 19.5% for training with Codestral‑22B. We further demonstrate the critical role of backward construction, which provides up to 14.0% improvement for training. Our ablation studies demonstrate the efficiency provided by our synthesized data in ICL and the superiority of our retrieval pipeline over alternative approaches like conventional retrieval‑augmented generation (RAG). We expect that Learn‑by‑interact will serve as a foundation for agent data synthesis as LLMs are increasingly deployed at real‑world environments.

PaperID: 11911, https://arxiv.org/pdf/2501.10836.pdf

Authors: Prashant Jayannavar, Liliang Ren, Marisa Hudspeth, Risham Sidhu, Charlotte Lambert, Ariel Cordes, Elizabeth Kaplan, Anjali Narayan-Chen, Julia Hockenmaier

Title: BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues

Abstract:
Developing interactive agents that can understand language, perceive their surroundings, and act within the physical world is a long‑standing goal of AI research. The Minecraft Collaborative Building Task (MCBT) (Narayan‑Chen, Jayannavar, and Hockenmaier 2019), a two‑player game in which an Architect (A) instructs a Builder (B) to construct a target structure in a simulated 3D Blocks World environment, offers a rich platform to work towards this goal. In this work, we focus on the Builder Action Prediction (BAP) subtask: predicting B's actions in a multimodal game context (Jayannavar, Narayan‑Chen, and Hockenmaier 2020) ‑ a challenging testbed for grounded instruction following, with limited training data. We holistically re‑examine this task and introduce BAP v2 to address key challenges in evaluation, training data, and modeling. Specifically, we define an enhanced evaluation benchmark, featuring a cleaner test set and fairer, more insightful metrics that also reveal spatial reasoning as the primary performance bottleneck. To address data scarcity and to teach models basic spatial skills, we generate different types of synthetic MCBT data. We observe that current, LLM‑based SOTA models trained on the human BAP dialogues fail on these simpler, synthetic BAP ones, but show that training models on this synthetic data improves their performance across the board. We also introduce a new SOTA model, Llama‑CRAFTS, which leverages richer input representations, and achieves an F1 score of 53.0 on the BAP v2 task and strong performance on the synthetic data. While this result marks a notable 6 points improvement over previous work, it also underscores the task's remaining difficulty, establishing BAP v2 as a fertile ground for future research, and providing a useful measure of the spatial capabilities of current text‑only LLMs in such embodied tasks.

PaperID: 11912, https://arxiv.org/pdf/2501.10812.pdf

Authors: Patrick Scheffe, Julius Kahle, Bassam Alrifaee

Title: Graph Coloring to Reduce Computation Time in Prioritized Planning

Abstract:
Distributing computations among agents in large networks reduces computational effort in multi‑agent path finding (MAPF). One distribution strategy is prioritized planning (PP). In PP, we couple and prioritize interacting agents to achieve a desired behavior across all agents in the network. We characterize the interaction with a directed acyclic graph (DAG). The computation time for solving MAPF problem using PP is mainly determined through the longest path in this DAG. The longest path depends on the fixed undirected coupling graph and the variable prioritization. The approaches from literature to prioritize agents are numerous and pursue various goals. This article presents an approach for prioritization in PP to reduce the longest path length in the coupling DAG and thus the computation time for MAPF using PP. We prove that this problem can be mapped to a graph‑coloring problem, in which the number of colors required corresponds to the longest path length in the coupling DAG. We propose a decentralized graph‑coloring algorithm to determine priorities for the agents. We evaluate the approach by applying it to multi‑agent motion planning (MAMP) for connected and automated vehicles (CAVs) on roads using, a variant of MAPF.

PaperID: 11913, https://arxiv.org/pdf/2501.10782.pdf

Authors: Yicheng Xiao, Yangyang Sun, Yicheng Lin

Title: ML-SceGen: A Multi-level Scenario Generation Framework

Abstract:
Current scientific research witnesses various attempts at applying Large Language Models for scenario generation but is inclined only to comprehensive or dangerous scenarios. In this paper, we seek to build a three‑stage framework that not only lets users regain controllability over the generated scenarios but also generates comprehensive scenarios containing danger factors in uncontrolled intersection settings. In the first stage, LLM agents will contribute to translating the key components of the description of the expected scenarios into Functional Scenarios. For the second stage, we use Answer Set Programming (ASP) solver Clingo to help us generate comprehensive logical traffic within intersections. During the last stage, we use LLM to update relevant parameters to increase the critical level of the concrete scenario.

PaperID: 11914, https://arxiv.org/pdf/2501.10781.pdf

Authors: Patrick Scheffe, Julius Kahle, Bassam Alrifaee

Title: Simultaneous Computation with Multiple Prioritizations in Multi-Agent Motion Planning

Abstract:
Multi‑agent path finding (MAPF) in large networks is computationally challenging. An approach for MAPF is prioritized planning (PP), in which agents plan sequentially according to their priority. Albeit a computationally efficient approach for MAPF, the solution quality strongly depends on the prioritization. Most prioritizations rely either on heuristics, which do not generalize well, or iterate to find adequate priorities, which costs computational effort. In this work, we show how agents can compute with multiple prioritizations simultaneously. Our approach is general as it does not rely on domain‑specific knowledge. The context of this work is multi‑agent motion planning (MAMP) with a receding horizon subject to computation time constraints. MAMP considers the system dynamics in more detail compared to MAPF. In numerical experiments on MAMP, we demonstrate that our approach to prioritization comes close to optimal prioritization and outperforms state‑of‑the‑art methods with only a minor increase in computation time. We show real‑time capability in an experiment on a road network with ten vehicles in our Cyber‑Physical Mobility Lab.

PaperID: 11915, https://arxiv.org/pdf/2501.10713.pdf

Authors: Oliver Chojnowski, Alexander Eberhard, Michael Schiffmann, Ana Müller, Anja Richert

Title: Human-like Nonverbal Behavior with MetaHumans in Real-World Interaction Studies: An Architecture Using Generative Methods and Motion Capture

Abstract:
Socially interactive agents are gaining prominence in domains like healthcare, education, and service contexts, particularly virtual agents due to their inherent scalability. To facilitate authentic interactions, these systems require verbal and nonverbal communication through e.g., facial expressions and gestures. While natural language processing technologies have rapidly advanced, incorporating human‑like nonverbal behavior into real‑world interaction contexts is crucial for enhancing the success of communication, yet this area remains underexplored. One barrier is creating autonomous systems with sophisticated conversational abilities that integrate human‑like nonverbal behavior. This paper presents a distributed architecture using Epic Games MetaHuman, combined with advanced conversational AI and camera‑based user management, that supports methods like motion capture, handcrafted animation, and generative approaches for nonverbal behavior. We share insights into a system architecture designed to investigate nonverbal behavior in socially interactive agents, deployed in a three‑week field study in the Deutsches Museum Bonn, showcasing its potential in realistic nonverbal behavior research.

PaperID: 11916, https://arxiv.org/pdf/2501.10487.pdf

Authors: TaeYoon Kwack, Jisoo Kim, Ki Yong Jung, DongGeon Lee, Heesun Park

Title: Theme-Explanation Structure for Table Summarization using Large Language Models: A Case Study on Korean Tabular Data

Abstract:
Tables are a primary medium for conveying critical information in administrative domains, yet their complexity hinders utilization by Large Language Models (LLMs). This paper introduces the Theme‑Explanation Structure‑based Table Summarization (Tabular‑TX) pipeline, a novel approach designed to generate highly interpretable summaries from tabular data, with a specific focus on Korean administrative documents. Current table summarization methods often neglect the crucial aspect of human‑friendly output. Tabular‑TX addresses this by first employing a multi‑step reasoning process to ensure deep table comprehension by LLMs, followed by a journalist persona prompting strategy for clear sentence generation. Crucially, it then structures the output into a Theme Part (an adverbial phrase) and an Explanation Part (a predicative clause), significantly enhancing readability. Our approach leverages in‑context learning, obviating the need for extensive fine‑tuning and associated labeled data or computational resources. Experimental results show that Tabular‑TX effectively processes complex table structures and metadata, offering a robust and efficient solution for generating human‑centric table summaries, especially in low‑resource scenarios.

PaperID: 11917, https://arxiv.org/pdf/2501.10385.pdf

Authors: Indrajeet Mandal, Jitendra Soni, Mohd Zaki, Morten M. Smedskjaer, Katrin Wondraczek, Lothar Wondraczek, Nitya Nand Gosvami, N. M. Anoop Krishnan

Title: Autonomous Microscopy Experiments through Large Language Model Agents

Abstract:
Large language models (LLMs) are revolutionizing self driving laboratories (SDLs) for materials research, promising unprecedented acceleration of scientific discovery. However, current SDL implementations rely on rigid protocols that fail to capture the adaptability and intuition of expert scientists in dynamic experimental settings. We introduce Artificially Intelligent Lab Assistant (AILA), a framework automating atomic force microscopy through LLM driven agents. Further, we develop AFMBench a comprehensive evaluation suite challenging AI agents across the complete scientific workflow from experimental design to results analysis. We find that state of the art models struggle with basic tasks and coordination scenarios. Notably, Claude 3.5 sonnet performs unexpectedly poorly despite excelling in materials domain question answering (QA) benchmarks, revealing that domain specific QA proficiency does not necessarily translate to effective agentic capabilities. Additionally, we observe that LLMs can deviate from instructions, raising safety alignment concerns for SDL applications. Our ablations reveal that multi agent frameworks outperform single‑agent architectures. We also observe significant prompt fragility, where slight modifications in prompt structure cause substantial performance variations in capable models like GPT 4o. Finally, we evaluate AILA's effectiveness in increasingly advanced experiments AFM calibration, feature detection, mechanical property measurement, graphene layer counting, and indenter detection. Our findings underscore the necessity for rigorous benchmarking protocols and prompt engineering strategies before deploying AI laboratory assistants in scientific research environments.

PaperID: 11918, https://arxiv.org/pdf/2501.10367.pdf

Authors: Mengxian Li, Qi Wang, Yongjun Xu

Title: GTDE: Grouped Training with Decentralized Execution for Multi-agent Actor-Critic

Abstract:
The rapid advancement of multi‑agent reinforcement learning (MARL) has given rise to diverse training paradigms to learn the policies of each agent in the multi‑agent system. The paradigms of decentralized training and execution (DTDE) and centralized training with decentralized execution (CTDE) have been proposed and widely applied. However, as the number of agents increases, the inherent limitations of these frameworks significantly degrade the performance metrics, such as win rate, total reward, etc. To reduce the influence of the increasing number of agents on the performance metrics, we propose a novel training paradigm of grouped training decentralized execution (GTDE). This framework eliminates the need for a centralized module and relies solely on local information, effectively meeting the training requirements of large‑scale multi‑agent systems. Specifically, we first introduce an adaptive grouping module, which divides each agent into different groups based on their observation history. To implement end‑to‑end training, GTDE uses Gumbel‑Sigmoid for efficient point‑to‑point sampling on the grouping distribution while ensuring gradient backpropagation. To adapt to the uncertainty in the number of members in a group, two methods are used to implement a group information aggregation module that merges member information within the group. Empirical results show that in a cooperative environment with 495 agents, GTDE increased the total reward by an average of 382% compared to the baseline. In a competitive environment with 64 agents, GTDE achieved a 100% win rate against the baseline.

PaperID: 11919, https://arxiv.org/pdf/2501.10321.pdf

Authors: Evgeny Saveliev, Jiashuo Liu, Nabeel Seedat, Anders Boyd, Mihaela van der Schaar

Title: Towards Human-Guided, Data-Centric LLM Co-Pilots

Abstract:
Machine learning (ML) has the potential to revolutionize various domains, but its adoption is often hindered by the disconnect between the needs of domain experts and translating these needs into robust and valid ML tools. Despite recent advances in LLM‑based co‑pilots to democratize ML for non‑technical domain experts, these systems remain predominantly focused on model‑centric aspects while overlooking critical data‑centric challenges. This limitation is problematic in complex real‑world settings where raw data often contains complex issues, such as missing values, label noise, and domain‑specific nuances requiring tailored handling. To address this we introduce CliMB‑DC, a human‑guided, data‑centric framework for LLM co‑pilots that combines advanced data‑centric tools with LLM‑driven reasoning to enable robust, context‑aware data processing. At its core, CliMB‑DC introduces a novel, multi‑agent reasoning system that combines a strategic coordinator for dynamic planning and adaptation with a specialized worker agent for precise execution. Domain expertise is then systematically incorporated to guide the reasoning process using a human‑in‑the‑loop approach. To guide development, we formalize a taxonomy of key data‑centric challenges that co‑pilots must address. Thereafter, to address the dimensions of the taxonomy, we integrate state‑of‑the‑art data‑centric tools into an extensible, open‑source architecture, facilitating the addition of new tools from the research community. Empirically, using real‑world healthcare datasets we demonstrate CliMB‑DC's ability to transform uncurated datasets into ML‑ready formats, significantly outperforming existing co‑pilot baselines for handling data‑centric challenges. CliMB‑DC promises to empower domain experts from diverse domains ‑‑ healthcare, finance, social sciences and more ‑‑ to actively participate in driving real‑world impact using ML.

PaperID: 11920, https://arxiv.org/pdf/2501.10316.pdf

Authors: Suvodip Dey, Yi-Jyun Sun, Gokhan Tur, Dilek Hakkani-Tur

Title: Know Your Mistakes: Towards Preventing Overreliance on Task-Oriented Conversational AI Through Accountability Modeling

Abstract:
Recent LLMs have enabled significant advancements for conversational agents. However, they are also well known to hallucinate, producing responses that seem plausible but are factually incorrect. On the other hand, users tend to over‑rely on LLM‑based AI agents, accepting AI's suggestion even when it is wrong. Adding positive friction, such as explanations or getting user confirmations, has been proposed as a mitigation in AI‑supported decision‑making systems. In this paper, we propose an accountability model for LLM‑based task‑oriented dialogue agents to address user overreliance via friction turns in cases of model uncertainty and errors associated with dialogue state tracking (DST). The accountability model is an augmented LLM with an additional accountability head that functions as a binary classifier to predict the relevant slots of the dialogue state mentioned in the conversation. We perform our experiments with multiple backbone LLMs on two established benchmarks (MultiWOZ and Snips). Our empirical findings demonstrate that the proposed approach not only enables reliable estimation of AI agent errors but also guides the decoder in generating more accurate actions. We observe around 3% absolute improvement in joint goal accuracy (JGA) of DST output by incorporating accountability heads into modern LLMs. Self‑correcting the detected errors further increases the JGA from 67.13 to 70.51, achieving state‑of‑the‑art DST performance. Finally, we show that error correction through user confirmations (friction turn) achieves a similar performance gain, highlighting its potential to reduce user overreliance.

PaperID: 11921, https://arxiv.org/pdf/2501.10262.pdf

Authors: Niklas Dahlquist, Samuel Nordström, Nikolaos Stathoulopoulos, Björn Lindqvist, Akshit Saradagi, George Nikolakopoulos

Title: Deployment of an Aerial Multi-agent System for Automated Task Execution in Large-scale Underground Mining Environments

Abstract:
In this article, we present a framework for deploying an aerial multi‑agent system in large‑scale subterranean environments with minimal infrastructure for supporting multi‑agent operations. The multi‑agent objective is to optimally and reactively allocate and execute inspection tasks in a mine, which are entered by a mine operator on‑the‑fly. The assignment of currently available tasks to the team of agents is accomplished through an auction‑based system, where the agents bid for available tasks, which are used by a central auctioneer to optimally assigns tasks to agents. A mobile Wi‑Fi mesh supports inter‑agent communication and bi‑directional communication between the agents and the task allocator, while the task execution is performed completely infrastructure‑free. Given a task to be accomplished, a reliable and modular agent behavior is synthesized by generating behavior trees from a pool of agent capabilities, using a back‑chaining approach. The auction system in the proposed framework is reactive and supports addition of new operator‑specified tasks on‑the‑go, at any point through a user‑friendly operator interface. The framework has been validated in a real underground mining environment using three aerial agents, with several inspection locations spread in an environment of almost 200 meters. The proposed framework can be utilized for missions involving rapid inspection, gas detection, distributed sensing and mapping etc. in a subterranean environment. The proposed framework and its field deployment contributes towards furthering reliable automation in large‑scale subterranean environments to offload both routine and dangerous tasks from human operators to autonomous aerial robots.

PaperID: 11922, https://arxiv.org/pdf/2501.10116.pdf

Authors: Zifeng Shi, Meiqin Liu, Senlin Zhang, Ronghao Zheng, Shanling Dong, Ping Wei

Title: GAWM: Global-Aware World Model for Multi-Agent Reinforcement Learning

Abstract:
In recent years, Model‑based Multi‑Agent Reinforcement Learning (MARL) has demonstrated significant advantages over model‑free methods in terms of sample efficiency by using independent environment dynamics world models for data sample augmentation. However, without considering the limited sample size, these methods still lag behind model‑free methods in terms of final convergence performance and stability. This is primarily due to the world model's insufficient and unstable representation of global states in partially observable environments. This limitation hampers the ability to ensure global consistency in the data samples and results in a time‑varying and unstable distribution mismatch between the pseudo data samples generated by the world model and the real samples. This issue becomes particularly pronounced in more complex multi‑agent environments. To address this challenge, we propose a model‑based MARL method called GAWM, which enhances the centralized world model's ability to achieve globally unified and accurate representation of state information while adhering to the CTDE paradigm. GAWM uniquely leverages an additional Transformer architecture to fuse local observation information from different agents, thereby improving its ability to extract and represent global state information. This enhancement not only improves sample efficiency but also enhances training stability, leading to superior convergence performance, particularly in complex and challenging multi‑agent environments. This advancement enables model‑based methods to be effectively applied to more complex multi‑agent environments. Experimental results demonstrate that GAWM outperforms various model‑free and model‑based approaches, achieving exceptional performance in the challenging domains of SMAC.

PaperID: 11923, https://arxiv.org/pdf/2501.09832.pdf

Authors: Hafiz Majid Hussain, Ashfaq Ahmad. Pedro H. J. Nardelli

Title: Crossover-BPSO Driven Multi-Agent Technology for Managing Local Energy Systems

Abstract:
This article presents a new hybrid algorithm, crossover binary particle swarm optimization (crBPSO), for allocating resources in local energy systems via multi‑agent (MA) technology. Initially, a hierarchical MA‑based architecture in a grid‑connected local energy setup is presented. In this architecture, task specific agents operate in a master‑slave manner. Where, the master runs a well‑formulated optimization routine aiming at minimizing costs of energy procurement, battery degradation, and load scheduling delay. The slaves update the master on their current status and receive optimal action plans accordingly. Simulation results demonstrate that the proposed algorithm outperforms selected existing ones by 21% in terms average energy system costs while satisfying customers' energy demand and maintaining the required quality of service.

PaperID: 11924, https://arxiv.org/pdf/2501.09674.pdf

Authors: Tobin South, Samuele Marro, Thomas Hardjono, Robert Mahari, Cedric Deslandes Whitney, Dazza Greenwood, Alan Chan, Alex Pentland

Title: Authenticated Delegation and Authorized AI Agents

Abstract:
The rapid deployment of autonomous AI agents creates urgent challenges around authorization, accountability, and access control in digital spaces. New standards are needed to know whom AI agents act on behalf of and guide their use appropriately, protecting online spaces while unlocking the value of task delegation to autonomous agents. We introduce a novel framework for authenticated, authorized, and auditable delegation of authority to AI agents, where human users can securely delegate and restrict the permissions and scope of agents while maintaining clear chains of accountability. This framework builds on existing identification and access management protocols, extending OAuth 2.0 and OpenID Connect with agent‑specific credentials and metadata, maintaining compatibility with established authentication and web infrastructure. Further, we propose a framework for translating flexible, natural language permissions into auditable access control configurations, enabling robust scoping of AI agent capabilities across diverse interaction modalities. Taken together, this practical approach facilitates immediate deployment of AI agents while addressing key security and accountability concerns, working toward ensuring agentic AI systems perform only appropriate actions and providing a tool for digital service providers to enable AI agent interactions without risking harm from scalable interaction.

PaperID: 11925, https://arxiv.org/pdf/2501.09631.pdf

Authors: Yushen Lin, Ruichen Zhang, Wenqi Huang, Kaidi Wang, Zhiguo Ding, Daniel K. C. So, Dusit Niyato

Title: Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework

Abstract:
In this work, we develop a specialized dataset aimed at enhancing the evaluation and fine‑tuning of large language models (LLMs) specifically for wireless communication applications. The dataset includes a diverse set of multi‑hop questions, including true/false and multiple‑choice types, spanning varying difficulty levels from easy to hard. By utilizing advanced language models for entity extraction and question generation, rigorous data curation processes are employed to maintain high quality and relevance. Additionally, we introduce a Pointwise V‑Information (PVI) based fine‑tuning method, providing a detailed theoretical analysis and justification for its use in quantifying the information content of training data with 2.24% and 1.31% performance boost for different models compared to baselines, respectively. To demonstrate the effectiveness of the fine‑tuned models with the proposed methodologies on practical tasks, we also consider different tasks, including summarizing optimization problems from technical papers and solving the mathematical problems related to non‑orthogonal multiple access (NOMA), which are generated by using the proposed multi‑agent framework. Simulation results show significant performance gain in summarization tasks with 20.9% in the ROUGE‑L metrics. We also study the scaling laws of fine‑tuning LLMs and the challenges LLMs face in the field of wireless communications, offering insights into their adaptation to wireless communication tasks. This dataset and fine‑tuning methodology aim to enhance the training and evaluation of LLMs, contributing to advancements in LLMs for wireless communication research and applications.

PaperID: 11926, https://arxiv.org/pdf/2501.09563.pdf

Authors: Eric S. Fraga, Veerawat Udomvorakulchai, Miguel Pineda, Lazaros G. Papageorgiou

Title: A Multi-agent System for Hybrid Optimization

Abstract:
Optimization problems in process engineering, including design and operation, can often pose challenges to many solvers: multi‑modal, non‑smooth, and discontinuous models often with large computational requirements. In such cases, the optimization problem is often treated as a black box in which only the value of the objective function is required, sometimes with some indication of the measure of the violation of the constraints. Such problems have traditionally been tackled through the use of direct search and meta‑heuristic methods. The challenge, then, is to determine which of these methods or combination of methods should be considered to make most effective use of finite computational resources. This paper presents a multi‑agent system for optimization which enables a set of solvers to be applied simultaneously to an optimization problem, including different instantiations of any solver. The evaluation of the optimization problem model is controlled by a scheduler agent which facilitates cooperation and competition between optimization methods. The architecture and implementation of the agent system is described in detail, including the solver, model evaluation, and scheduler agents. A suite of direct search and meta‑heuristic methods has been developed for use with this system. Case studies from process systems engineering applications are presented and the results show the potential benefits of automated cooperation between different optimization solvers and motivates the implementation of competition between solvers.

PaperID: 11927, https://arxiv.org/pdf/2501.09493.pdf

Authors: Nuo Chen, Quanyu Dai, Xiaoyu Dong, Piaohong Wang, Qinglin Jia, Zhaocheng Du, Zhenhua Dong, Xiao-Ming Wu

Title: Evaluating Conversational Recommender Systems via Large Language Models: A User-Centric Framework

Abstract:
Conversational recommender systems (CRSs) integrate both recommendation and dialogue tasks, making their evaluation uniquely challenging. Existing approaches primarily assess CRS performance by separately evaluating item recommendation and dialogue management using rule‑based metrics. However, these methods fail to capture the real human experience, and they cannot draw direct conclusions about the system's overall performance. As conversational recommender systems become increasingly vital in e‑commerce, social media, and customer support, the ability to evaluate both recommendation accuracy and dialogue management quality using a single metric, thereby authentically reflecting user experience, has become the principal challenge impeding progress in this field. In this work, we propose a user‑centric evaluation framework based on large language models (LLMs) for CRSs, namely Conversational Recommendation Evaluator (CoRE). CoRE consists of two main components: (1) LLM‑As‑Evaluator. Firstly, we comprehensively summarize 12 key factors influencing user experience in CRSs and directly leverage LLM as an evaluator to assign a score to each factor. (2) Multi‑Agent Debater. Secondly, we design a multi‑agent debate framework with four distinct roles (common user, domain expert, linguist, and HCI expert) to discuss and synthesize the 12 evaluation factors into a unified overall performance score. Furthermore, we apply the proposed framework to evaluate four CRSs on two benchmark datasets. The experimental results show that CoRE aligns well with human evaluation in most of the 12 factors and the overall assessment. Especially, CoRE's overall evaluation scores demonstrate significantly better alignment with human feedback compared to existing rule‑based metrics.

PaperID: 11928, https://arxiv.org/pdf/2501.09444.pdf

Authors: King-kui Sin, Xi Xuan, Chunyu Kit, Clara Ho-yan Chan, Honic Ho-kin Ip

Title: Solving the Unsolvable: Translating Case Law in Hong Kong

Abstract:
This paper addresses the challenges translating case law under Hong Kong's bilingual legal system. It highlights the initial success of translating all written statutes into Chinese before the 1997 handover, a task mandated by the Basic Law. The effort involved significant collaboration among legal, linguistic, and translation experts, resulting in a comprehensive and culturally appropriate bilingual legal system. However, translating case law remains a significant challenge due to the sheer volume and continuous growth of judicial decisions. The paper critiques the governments and judiciarys sporadic and uncoordinated efforts to translate case law, contrasting it with the thorough approach previously taken for statute translation. Although the government acknowledges the importance of legal bilingualism, it lacks a sustainable strategy for translating case law. The Judiciarys position that translating all judgments is unnecessary, unrealistic, and not cost‑effectiveis analyzed and critiqued for its impact on legal transparency and public trust. A proposed solution involves leveraging machine translation technology through a human‑machine interactive translation platform, which undergoes two major transitions. Initially based on a neural model, the platform transitions to using a large language model for improved translation accuracy. Furthermore, it evolves from a single‑agent system to a multi‑agent system, incorporating Translator, Annotator, and Proofreader agents. This multi‑agent approach, supported by a grant, aims to facilitate efficient, high‑quality translation of judicial judgments by integrating advanced artificial intelligence and continuous feedback mechanisms, thus better meeting the needs of a bilingual legal system.

PaperID: 11929, https://arxiv.org/pdf/2501.09429.pdf

Authors: Benjamin Patrick Evans, Sihan Zeng, Sumitra Ganesh, Leo Ardon

Title: ADAGE: A generic two-layer framework for adaptive agent based modelling

Abstract:
Agent‑based models (ABMs) are valuable for modelling complex, potentially out‑of‑equilibria scenarios. However, ABMs have long suffered from the Lucas critique, stating that agent behaviour should adapt to environmental changes. Furthermore, the environment itself often adapts to these behavioural changes, creating a complex bi‑level adaptation problem. Recent progress integrating multi‑agent reinforcement learning into ABMs introduces adaptive agent behaviour, beginning to address the first part of this critique, however, the approaches are still relatively ad hoc, lacking a general formulation, and furthermore, do not tackle the second aspect of simultaneously adapting environmental level characteristics in addition to the agent behaviours. In this work, we develop a generic two‑layer framework for ADaptive AGEnt based modelling (ADAGE) for addressing these problems. This framework formalises the bi‑level problem as a Stackelberg game with conditional behavioural policies, providing a consolidated framework for adaptive agent‑based modelling based on solving a coupled set of non‑linear equations. We demonstrate how this generic approach encapsulates several common (previously viewed as distinct) ABM tasks, such as policy design, calibration, scenario generation, and robust behavioural learning under one unified framework. We provide example simulations on multiple complex economic and financial environments, showing the strength of the novel framework under these canonical settings, addressing long‑standing critiques of traditional ABMs.

PaperID: 11930, https://arxiv.org/pdf/2501.09426.pdf

Authors: Ancheng Xu, Di Yang, Renhao Li, Jingwei Zhu, Minghuan Tan, Min Yang, Wanxin Qiu, Mingchen Ma, Haihong Wu, Bingyu Li, Feng Sha, Chengming Li, Xiping Hu, Qiang Qu, Derek F. Wong, Ruifeng Xu

Title: AutoCBT: An Autonomous Multi-agent Framework for Cognitive Behavioral Therapy in Psychological Counseling

Abstract:
Traditional in‑person psychological counseling remains primarily niche, often chosen by individuals with psychological issues, while online automated counseling offers a potential solution for those hesitant to seek help due to feelings of shame. Cognitive Behavioral Therapy (CBT) is an essential and widely used approach in psychological counseling. The advent of large language models (LLMs) and agent technology enables automatic CBT diagnosis and treatment. However, current LLM‑based CBT systems use agents with a fixed structure, limiting their self‑optimization capabilities, or providing hollow, unhelpful suggestions due to redundant response patterns. In this work, we utilize Quora‑like and YiXinLi single‑round consultation models to build a general agent framework that generates high‑quality responses for single‑turn psychological consultation scenarios. We use a bilingual dataset to evaluate the quality of single‑response consultations generated by each framework. Then, we incorporate dynamic routing and supervisory mechanisms inspired by real psychological counseling to construct a CBT‑oriented autonomous multi‑agent framework, demonstrating its general applicability. Experimental results indicate that AutoCBT can provide higher‑quality automated psychological counseling services.

PaperID: 11931, https://arxiv.org/pdf/2501.09355.pdf

Authors: Saptarashmi Bandyopadhyay, Vikas Bahirwani, Lavisha Aggarwal, Bhanu Guda, Lin Li, Andrea Colaco

Title: YETI (YET to Intervene) Proactive Interventions by Multimodal AI Agents in Augmented Reality Tasks

Abstract:
Multimodal AI Agents are AI models that have the capability of interactively and cooperatively assisting human users to solve day‑to‑day tasks. Augmented Reality (AR) head worn devices can uniquely improve the user experience of solving procedural day‑to‑day tasks by providing egocentric multimodal (audio and video) observational capabilities to AI Agents. Such AR capabilities can help AI Agents see and listen to actions that users take which can relate to multimodal capabilities of human users. Existing AI Agents, either Large Language Models (LLMs) or Multimodal Vision‑Language Models (VLMs) are reactive in nature, which means that models cannot take an action without reading or listening to the human user's prompts. Proactivity of AI Agents on the other hand can help the human user detect and correct any mistakes in agent observed tasks, encourage users when they do tasks correctly or simply engage in conversation with the user ‑ akin to a human teaching or assisting a user. Our proposed YET to Intervene (YETI) multimodal agent focuses on the research question of identifying circumstances that may require the agent to intervene proactively. This allows the agent to understand when it can intervene in a conversation with human users that can help the user correct mistakes on tasks, like cooking, using AR. Our YETI Agent learns scene understanding signals based on interpretable notions of Structural Similarity (SSIM) on consecutive video frames. We also define the alignment signal which the AI Agent can learn to identify if the video frames corresponding to the user's actions on the task are consistent with expected actions. These signals are used by our AI Agent to determine when it should proactively intervene. We compare our results on the instances of proactive intervention in the HoloAssist multimodal benchmark for an expert agent guiding a user to complete procedural tasks.

PaperID: 11932, https://arxiv.org/pdf/2501.09316.pdf

Authors: Anbang Ye, Qianran Ma, Jia Chen, Muqi Li, Tong Li, Fujiao Liu, Siqi Mai, Meichen Lu, Haitao Bao, Yang You

Title: SOP-Agent: Empower General Purpose AI Agent with Domain-Specific SOPs

Abstract:
Despite significant advancements in general‑purpose AI agents, several challenges still hinder their practical application in real‑world scenarios. First, the limited planning capabilities of Large Language Models (LLM) restrict AI agents from effectively solving complex tasks that require long‑horizon planning. Second, general‑purpose AI agents struggle to efficiently utilize domain‑specific knowledge and human expertise. In this paper, we introduce the Standard Operational Procedure‑guided Agent (SOP‑agent), a novel framework for constructing domain‑specific agents through pseudocode‑style Standard Operational Procedures (SOPs) written in natural language. Formally, we represent a SOP as a decision graph, which is traversed to guide the agent in completing tasks specified by the SOP. We conduct extensive experiments across tasks in multiple domains, including decision‑making, search and reasoning, code generation, data cleaning, and grounded customer service. The SOP‑agent demonstrates excellent versatility, achieving performance superior to general‑purpose agent frameworks and comparable to domain‑specific agent systems. Additionally, we introduce the Grounded Customer Service Benchmark, the first benchmark designed to evaluate the grounded decision‑making capabilities of AI agents in customer service scenarios based on SOPs.

PaperID: 11933, https://arxiv.org/pdf/2501.09242.pdf

Authors: Stéphane Pouget, Michael Lo, Louis-Noël Pouchet, Jason Cong

Title: Holistic Optimization Framework for FPGA Accelerators

Abstract:
Customized accelerators have revolutionized modern computing by delivering substantial gains in energy efficiency and performance through hardware specialization. Field‑Programmable Gate Arrays (FPGAs) play a crucial role in this paradigm, offering unparalleled flexibility and high‑performance potential. High‑Level Synthesis (HLS) and source‑to‑source compilers have simplified FPGA development by translating high‑level programming languages into hardware descriptions enriched with directives. However, achieving high Quality of Results (QoR) remains a significant challenge, requiring intricate code transformations, strategic directive placement, and optimized data communication. This paper presents Prometheus, a holistic optimization framework that integrates key optimizations ‑‑ including task fusion, tiling, loop permutation, computation‑communication overlap, and concurrent task execution ‑‑ into a unified design space. By leveraging Non‑Linear Programming (NLP) methodologies, Prometheus explores the optimization space under strict resource constraints, enabling automatic bitstream generation. Unlike existing frameworks, Prometheus considers interdependent transformations and dynamically balances computation and memory access. We evaluate Prometheus across multiple benchmarks, demonstrating its ability to maximize parallelism, minimize execution stalls, and optimize data movement. The results showcase its superior performance compared to state‑of‑the‑art FPGA optimization frameworks, highlighting its effectiveness in delivering high QoR while reducing manual tuning efforts.

PaperID: 11934, https://arxiv.org/pdf/2501.09158.pdf

Authors: Stephanie L. Day, Jacapo Cirica, Steven R. Clapp, Veronika Penkova, Amy E. Giroux, Abbey Banta, Catherine Bordeau, Poojitha Mutteneni, Ben D. Sawyer

Title: Evaluating GenAI for Simplifying Texts for Education: Improving Accuracy and Consistency for Enhanced Readability

Abstract:
Generative artificial intelligence (GenAI) holds great promise as a tool to support personalized learning. Teachers need tools to efficiently and effectively enhance content readability of educational texts so that they are matched to individual students reading levels, while retaining key details. Large Language Models (LLMs) show potential to fill this need, but previous research notes multiple shortcomings in current approaches. In this study, we introduced a generalized approach and metrics for the systematic evaluation of the accuracy and consistency in which LLMs, prompting techniques, and a novel multi‑agent architecture to simplify sixty informational reading passages, reducing each from the twelfth grade level down to the eighth, sixth, and fourth grade levels. We calculated the degree to which each LLM and prompting technique accurately achieved the targeted grade level for each passage, percentage change in word count, and consistency in maintaining keywords and key phrases (semantic similarity). One‑sample t‑tests and multiple regression models revealed significant differences in the best performing LLM and prompt technique for each of the four metrics. Both LLMs and prompting techniques demonstrated variable utility in grade level accuracy and consistency of keywords and key phrases when attempting to level content down to the fourth grade reading level. These results demonstrate the promise of the application of LLMs for efficient and precise automated text simplification, the shortcomings of current models and prompting methods in attaining an ideal balance across various evaluation criteria, and a generalizable method to evaluate future systems.

PaperID: 11935, https://arxiv.org/pdf/2501.09136.pdf

Authors: Aditi Singh, Abul Ehtesham, Saket Kumar, Tala Talaei Khoei, Athanasios V. Vasilakos

Title: Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

Abstract:
Large Language Models (LLMs) have advanced artificial intelligence by enabling human‑like text generation and natural language understanding. However, their reliance on static training data limits their ability to respond to dynamic, real‑time queries, resulting in outdated or inaccurate outputs. Retrieval‑Augmented Generation (RAG) has emerged as a solution, enhancing LLMs by integrating real‑time data retrieval to provide contextually relevant and up‑to‑date responses. Despite its promise, traditional RAG systems are constrained by static workflows and lack the adaptability required for multi‑step reasoning and complex task management. Agentic Retrieval‑Augmented Generation (Agentic RAG) transcends these limitations by embedding autonomous AI agents into the RAG pipeline. These agents leverage agentic design patterns reflection, planning, tool use, and multi‑agent collaboration to dynamically manage retrieval strategies, iteratively refine contextual understanding, and adapt workflows through operational structures ranging from sequential steps to adaptive collaboration. This integration enables Agentic RAG systems to deliver flexibility, scalability, and context‑awareness across diverse applications. This paper presents an analytical survey of Agentic RAG systems. It traces the evolution of RAG paradigms, introduces a principled taxonomy of Agentic RAG architectures based on agent cardinality, control structure, autonomy, and knowledge representation, and provides a comparative analysis of design trade‑offs across existing frameworks. The survey examines applications in healthcare, finance, education, and enterprise document processing, and distills practical lessons for system designers and practitioners. Finally, it identifies key open research challenges related to evaluation, coordination, memory management, efficiency, and governance, outlining directions for future research.

PaperID: 11936, https://arxiv.org/pdf/2501.08985.pdf

Authors: Qianmin Lou, Wentao Xu

Title: Personality Modeling for Persuasion of Misinformation using AI Agent

Abstract:
The proliferation of misinformation on social media platforms has highlighted the need to understand how individual personality traits influence susceptibility to and propagation of misinformation. This study employs an innovative agent‑based modeling approach to investigate the relationship between personality traits and misinformation dynamics. Using six AI agents embodying different dimensions of the Big Five personality traits (Extraversion, Agreeableness, and Neuroticism), we simulated interactions across six diverse misinformation topics. The experiment, implemented through the AgentScope framework using the GLM‑4‑Flash model, generated 90 unique interactions, revealing complex patterns in how personality combinations affect persuasion and resistance to misinformation. Our findings demonstrate that analytical and critical personality traits enhance effectiveness in evidence‑based discussions, while non‑aggressive persuasion strategies show unexpected success in misinformation correction. Notably, agents with critical traits achieved a 59.4% success rate in HIV‑related misinformation discussions, while those employing non‑aggressive approaches maintained consistent persuasion rates above 40% across different personality combinations. The study also revealed a non‑transitive pattern in persuasion effectiveness, challenging conventional assumptions about personality‑based influence. These results provide crucial insights for developing personality‑aware interventions in digital environments and suggest that effective misinformation countermeasures should prioritize emotional connection and trust‑building over confrontational approaches. The findings contribute to both theoretical understanding of personality‑misinformation dynamics and practical strategies for combating misinformation in social media contexts.

PaperID: 11937, https://arxiv.org/pdf/2501.08941.pdf

Authors: Surya Murthy, John-Paul Clarke, Ufuk Topcu, Zhenyu Gao

Title: A Reinforcement Learning Approach to Quiet and Safe UAM Traffic Management

Abstract:
Urban air mobility (UAM) is a transformative system that operates various small aerial vehicles in urban environments to reshape urban transportation. However, integrating UAM into existing urban environments presents a variety of complex challenges. Recent analyses of UAM's operational constraints highlight aircraft noise and system safety as key hurdles to UAM system implementation. Future UAM air traffic management schemes must ensure that the system is both quiet and safe. We propose a multi‑agent reinforcement learning approach to manage UAM traffic, aiming at both vertical separation assurance and noise mitigation. Through extensive training, the reinforcement learning agent learns to balance the two primary objectives by employing altitude adjustments in a multi‑layer UAM network. The results reveal the tradeoffs among noise impact, traffic congestion, and separation. Overall, our findings demonstrate the potential of reinforcement learning in mitigating UAM's noise impact while maintaining safe separation using altitude adjustments

PaperID: 11938, https://arxiv.org/pdf/2501.08778.pdf

Authors: Guilherme S. Varela, Alberto Sardinha, Francisco S. Melo

Title: Networked Agents in the Dark: Team Value Learning under Partial Observability

Abstract:
We propose a novel cooperative multi‑agent reinforcement learning (MARL) approach for networked agents. In contrast to previous methods that rely on complete state information or joint observations, our agents must learn how to reach shared objectives under partial observability. During training, they collect individual rewards and approximate a team value function through local communication, resulting in cooperative behavior. To describe our problem, we introduce the networked dynamic partially observable Markov game framework, where agents communicate over a switching topology communication network. Our distributed method, DNA‑MARL, uses a consensus mechanism for local communication and gradient descent for local computation. DNA‑MARL increases the range of the possible applications of networked agents, being well‑suited for real world domains that impose privacy and where the messages may not reach their recipients. We evaluate DNA‑MARL across benchmark MARL scenarios. Our results highlight the superior performance of DNA‑MARL over previous methods.

PaperID: 11939, https://arxiv.org/pdf/2501.08760.pdf

Authors: Yunze Wei, Xiaohui Xie, Tianshuo Hu, Yiwei Zuo, Xinyi Chen, Kaiwen Chi, Yong Cui

Title: INTA: Intent-Based Translation for Network Configuration with LLM Agents

Abstract:
Translating configurations between different network devices is a common yet challenging task in modern network operations. This challenge arises in typical scenarios such as replacing obsolete hardware and adapting configurations to emerging paradigms like Software Defined Networking (SDN) and Network Function Virtualization (NFV). Engineers need to thoroughly understand both source and target configuration models, which requires considerable effort due to the complexity and evolving nature of these specifications. To promote automation in network configuration translation, we propose INTA, an intent‑based translation framework that leverages Large Language Model (LLM) agents. The key idea of INTA is to use configuration intent as an intermediate representation for translation. It first employs LLMs to decompose configuration files and extract fine‑grained intents for each configuration fragment. These intents are then used to retrieve relevant manuals of the target device. Guided by a syntax checker, INTA incrementally generates target configurations. The translated configurations are further verified and refined for semantic consistency. We implement INTA and evaluate it on real‑world configuration datasets from the industry. Our approach outperforms state‑of‑the‑art methods in translation accuracy and exhibits strong generalizability. INTA achieves an accuracy of 98.15% in terms of both syntactic and view correctness, and a command recall rate of 84.72% for the target configuration. The semantic consistency report of the translated configuration further demonstrates its practical value in real‑world network operations.

PaperID: 11940, https://arxiv.org/pdf/2501.08655.pdf

Authors: Raúl Arranz, David Carramiñana, Gonzalo de Miguel, Juan A. Besada, Ana M. Bernardos

Title: Application of Deep Reinforcement Learning to UAV Swarming for Ground Surveillance

Abstract:
This paper summarizes in depth the state of the art of aerial swarms, covering both classical and new reinforcement‑learning‑based approaches for their management. Then, it proposes a hybrid AI system, integrating deep reinforcement learning in a multi‑agent centralized swarm architecture. The proposed system is tailored to perform surveillance of a specific area, searching and tracking ground targets, for security and law enforcement applications. The swarm is governed by a central swarm controller responsible for distributing different search and tracking tasks among the cooperating UAVs. Each UAV agent is then controlled by a collection of cooperative sub‑agents, whose behaviors have been trained using different deep reinforcement learning models, tailored for the different task types proposed by the swarm controller. More specifically, proximal policy optimization (PPO) algorithms were used to train the agents' behavior. In addition, several metrics to assess the performance of the swarm in this application were defined. The results obtained through simulation show that our system searches the operation area effectively, acquires the targets in a reasonable time, and is capable of tracking them continuously and consistently.

PaperID: 11941, https://arxiv.org/pdf/2501.08600.pdf

Authors: Tyler Stennett, Myeongsoo Kim, Saurabh Sinha, Alessandro Orso

Title: AutoRestTest: A Tool for Automated REST API Testing Using LLMs and MARL

Abstract:
As REST APIs have become widespread in modern web services, comprehensive testing of these APIs is increasingly crucial. Because of the vast search space of operations, parameters, and parameter values, along with their dependencies and constraints, current testing tools often achieve low code coverage, resulting in suboptimal fault detection. To address this limitation, we present AutoRestTest, a novel tool that integrates the Semantic Property Dependency Graph (SPDG) with Multi‑Agent Reinforcement Learning (MARL) and large language models (LLMs) for effective REST API testing. AutoRestTest determines operation‑dependent parameters using the SPDG and employs five specialized agents (operation, parameter, value, dependency, and header) to identify dependencies of operations and generate operation sequences, parameter combinations, and values. Through an intuitive command‑line interface, users can easily configure and monitor tests with successful operation count, unique server errors detected, and time elapsed. Upon completion, AutoRestTest generates a detailed report highlighting errors detected and operations exercised. In this paper, we introduce our tool and present preliminary findings, with a demonstration video available at https://www.youtube.com/watch?v=VVus2W8rap8.

PaperID: 11942, https://arxiv.org/pdf/2501.08324.pdf

Authors: Ziyuan Huang, Vishaldeep Kaur Sekhon, Roozbeh Sadeghian, Maria L. Vaida, Cynthia Jo, Doyle Ward, Vanni Bucci, John P. Haran

Title: ADAM: An AI Reasoning and Bioinformatics Model for Alzheimer's Disease Detection and Microbiome-Clinical Data Integration

Abstract:
Alzheimer's Disease Analysis Model (ADAM) is a multi‑agent reasoning large language model (LLM) framework designed to integrate and analyze multimodal data, including microbiome profiles, clinical datasets, and external knowledge bases, to enhance the understanding and classification of Alzheimer's disease (AD). By leveraging the agentic system with LLM, ADAM produces insights from diverse data sources and contextualizes the findings with literature‑driven evidence. A comparative evaluation with XGBoost revealed a significantly improved mean F1 score and significantly reduced variance for ADAM, highlighting its robustness and consistency, particularly when utilizing human biological data. Although currently tailored for binary classification tasks with two data modalities, future iterations will aim to incorporate additional data types, such as neuroimaging and peripheral biomarkers, and expand them to predict disease progression, thereby broadening ADAM's scalability and applicability in AD research and diagnostic applications.

PaperID: 11943, https://arxiv.org/pdf/2501.08262.pdf

Authors: Hui Wu, Xiaoyang Wang, Zhong Fan

Title: Addressing the sustainable AI trilemma: a case study on LLM agents and RAG

Abstract:
Large language models (LLMs) have demonstrated significant capabilities, but their widespread deployment and more advanced applications raise critical sustainability challenges, particularly in inference energy consumption. We propose the concept of the Sustainable AI Trilemma, highlighting the tensions between AI capability, digital equity, and environmental sustainability. Through a systematic case study of LLM agents and retrieval‑augmented generation (RAG), we analyze the energy costs embedded in memory module designs and introduce novel metrics to quantify the trade‑offs between energy consumption and system performance. Our experimental results reveal significant energy inefficiencies in current memory‑augmented frameworks and demonstrate that resource‑constrained environments face disproportionate efficiency penalties. Our findings challenge the prevailing LLM‑centric paradigm in agent design and provide practical insights for developing more sustainable AI systems.

PaperID: 11944, https://arxiv.org/pdf/2501.08243.pdf

Authors: Kannan Parthasarathy, Karthik Vaidhyanathan, Rudra Dhar, Venkat Krishnamachari, Basil Muhammed, Adyansh Kakran, Sreemaee Akshathala, Shrikara Arun, Sumant Dubey, Mohan Veerubhotla, Amey Karan

Title: Engineering LLM Powered Multi-agent Framework for Autonomous CloudOps

Abstract:
Cloud Operations (CloudOps) is a rapidly growing field focused on the automated management and optimization of cloud infrastructure which is essential for organizations navigating increasingly complex cloud environments. MontyCloud Inc. is one of the major companies in the CloudOps domain that leverages autonomous bots to manage cloud compliance, security, and continuous operations. To make the platform more accessible and effective to the customers, we leveraged the use of GenAI. Developing a GenAI‑based solution for autonomous CloudOps for the existing MontyCloud system presented us with various challenges such as i) diverse data sources; ii) orchestration of multiple processes; and iii) handling complex workflows to automate routine tasks. To this end, we developed MOYA, a multi‑agent framework that leverages GenAI and balances autonomy with the necessary human control. This framework integrates various internal and external systems and is optimized for factors like task orchestration, security, and error mitigation while producing accurate, reliable, and relevant insights by utilizing Retrieval Augmented Generation (RAG). Evaluations of our multi‑agent system with the help of practitioners as well as using automated checks demonstrate enhanced accuracy, responsiveness, and effectiveness over non‑agentic approaches across complex workflows.

PaperID: 11945, https://arxiv.org/pdf/2501.08234.pdf

Authors: Enrique Adrian Villarrubia-Martin, Luis Rodriguez-Benitez, David Muñoz-Valero, Giovanni Montana, Luis Jimenez-Linares

Title: Dynamic Pricing in High-Speed Railways Using Multi-Agent Reinforcement Learning

Abstract:
This paper addresses a critical challenge in the high‑speed passenger railway industry: designing effective dynamic pricing strategies in the context of competing and cooperating operators. To address this, a multi‑agent reinforcement learning (MARL) framework based on a non‑zero‑sum Markov game is proposed, incorporating random utility models to capture passenger decision making. Unlike prior studies in areas such as energy, airlines, and mobile networks, dynamic pricing for railway systems using deep reinforcement learning has received limited attention. A key contribution of this paper is a parametrisable and versatile reinforcement learning simulator designed to model a variety of railway network configurations and demand patterns while enabling realistic, microscopic modelling of user behaviour, called RailPricing‑RL. This environment supports the proposed MARL framework, which models heterogeneous agents competing to maximise individual profits while fostering cooperative behaviour to synchronise connecting services. Experimental results validate the framework, demonstrating how user preferences affect MARL performance and how pricing policies influence passenger choices, utility, and overall system dynamics. This study provides a foundation for advancing dynamic pricing strategies in railway systems, aligning profitability with system‑wide efficiency, and supporting future research on optimising pricing policies.

PaperID: 11946, https://arxiv.org/pdf/2501.08020.pdf

Authors: Juan Palma-Borda, Eduardo Guzmán, María-Victoria Belmonte

Title: Cooperative Patrol Routing: Optimizing Urban Crime Surveillance through Multi-Agent Reinforcement Learning

Abstract:
The effective design of patrol strategies is a difficult and complex problem, especially in medium and large areas. The objective is to plan, in a coordinated manner, the optimal routes for a set of patrols in a given area, in order to achieve maximum coverage of the area, while also trying to minimize the number of patrols. In this paper, we propose a multi‑agent reinforcement learning (MARL) model, based on a decentralized partially observable Markov decision process, to plan unpredictable patrol routes within an urban environment represented as an undirected graph. The model attempts to maximize a target function that characterizes the environment within a given time frame. Our model has been tested to optimize police patrol routes in three medium‑sized districts of the city of Malaga. The aim was to maximize surveillance coverage of the most crime‑prone areas, based on actual crime data in the city. To address this problem, several MARL algorithms have been studied, and among these the Value Decomposition Proximal Policy Optimization (VDPPO) algorithm exhibited the best performance. We also introduce a novel metric, the coverage index, for the evaluation of the coverage performance of the routes generated by our model. This metric is inspired by the predictive accuracy index (PAI), which is commonly used in criminology to detect hotspots. Using this metric, we have evaluated the model under various scenarios in which the number of agents (or patrols), their starting positions, and the level of information they can observe in the environment have been modified. Results show that the coordinated routes generated by our model achieve a coverage of more than 90% of the 3% of graph nodes with the highest crime incidence, and 65% for 20% of these nodes; 3% and 20% represent the coverage standards for police resource allocation.

PaperID: 11947, https://arxiv.org/pdf/2501.07861.pdf

Authors: Zhongxiang Sun, Qipeng Wang, Weijie Yu, Xiaoxue Zang, Kai Zheng, Jun Xu, Xiao Zhang, Song Yang, Han Li

Title: ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding

Abstract:
Retrieval‑Augmented Generation (RAG) systems for Large Language Models (LLMs) hold promise in knowledge‑intensive tasks but face limitations in complex multi‑step reasoning. While recent methods have integrated RAG with chain‑of‑thought reasoning or test‑time search using Process Reward Models (PRMs), these approaches encounter challenges such as a lack of explanations, bias in PRM training data, early‑step bias in PRM scores, and insufficient post‑training optimization of reasoning potential. To address these issues, we propose Retrieval‑Augmented Reasoning through Trustworthy Process Rewarding (ReARTeR), a framework that enhances RAG systems' reasoning capabilities through post‑training and test‑time scaling. At test time, ReARTeR introduces Trustworthy Process Rewarding via a Process Reward Model for accurate scalar scoring and a Process Explanation Model (PEM) for generating natural language explanations, enabling step refinement. During post‑training, it utilizes Monte Carlo Tree Search guided by Trustworthy Process Rewarding to collect high‑quality step‑level preference data, optimized through Iterative Preference Optimization. ReARTeR addresses three core challenges: (1) misalignment between PRM and PEM, tackled through off‑policy preference learning; (2) bias in PRM training data, mitigated by balanced annotation methods and stronger annotations for challenging examples; and (3) early‑step bias in PRM, resolved through a temporal‑difference‑based look‑ahead search strategy. Experimental results on multi‑step reasoning benchmarks demonstrate significant improvements, underscoring ReARTeR's potential to advance the reasoning capabilities of RAG systems.

PaperID: 11948, https://arxiv.org/pdf/2501.07815.pdf

Authors: Dhruv Dhamani, Mary Lou Maher

Title: Agent-Centric Projection of Prompting Techniques and Implications for Synthetic Training Data for Large Language Models

Abstract:
Recent advances in prompting techniques and multi‑agent systems for Large Language Models (LLMs) have produced increasingly complex approaches. However, we lack a framework for characterizing and comparing prompting techniques or understanding their relationship to multi‑agent LLM systems. This position paper introduces and explains the concepts of linear contexts (a single, continuous sequence of interactions) and non‑linear contexts (branching or multi‑path) in LLM systems. These concepts enable the development of an agent‑centric projection of prompting techniques, a framework that can reveal deep connections between prompting strategies and multi‑agent systems. We propose three conjectures based on this framework: (1) results from non‑linear prompting techniques can predict outcomes in equivalent multi‑agent systems, (2) multi‑agent system architectures can be replicated through single‑LLM prompting techniques that simulate equivalent interaction patterns, and (3) these equivalences suggest novel approaches for generating synthetic training data. We argue that this perspective enables systematic cross‑pollination of research findings between prompting and multi‑agent domains, while providing new directions for improving both the design and training of future LLM systems.

PaperID: 11949, https://arxiv.org/pdf/2501.07813.pdf

Authors: Feijie Wu, Zitao Li, Fei Wei, Yaliang Li, Bolin Ding, Jing Gao

Title: Talk to Right Specialists: Iterative Routing in Multi-agent Systems for Question Answering

Abstract:
Retrieval‑augmented generation (RAG) agents are increasingly deployed to answer questions over local knowledge bases that cannot be centralized due to knowledge‑sovereignty constraints. This results in two recurring failures in production: users do not know which agent to consult, and complex questions require evidence distributed across multiple agents. To overcome these challenges, we propose RIRS, a training‑free orchestration framework to enable a multi‑agent system for question answering. In detail, RIRS summarizes each agent's local corpus in an embedding space, enabling a user‑facing server to route queries only to the most relevant agents, reducing latency and avoiding noisy "broadcast‑to‑all" contexts. For complicated questions, the server can iteratively aggregate responses to derive intermediate results and refine the question to bridge the gap toward a comprehensive answer. Extensive experiments demonstrate the effectiveness of RIRS, including its ability to precisely select agents and provide accurate responses to single‑hop queries, and its use of an iterative strategy to achieve accurate, multi‑step resolutions for complex queries.

PaperID: 11950, https://arxiv.org/pdf/2501.07811.pdf

Authors: Ruwei Pan, Hongyu Zhang, Chao Liu

Title: CodeCoR: An LLM-Based Self-Reflective Multi-Agent Framework for Code Generation

Abstract:
Code generation aims to produce code that fulfills requirements written in natural languages automatically. Large language Models (LLMs) like ChatGPT have demonstrated promising effectiveness in this area. Nonetheless, these LLMs often fail to ensure the syntactic and semantic correctness of the generated code. Recently, researchers proposed multi‑agent frameworks that guide LLMs with different prompts to analyze programming tasks, generate code, perform testing in a sequential workflow. However, the performance of the workflow is not robust as the code generation depends on the performance of each agent. To address this challenge, we propose CodeCoR, a self‑reflective multi‑agent framework that evaluates the effectiveness of each agent and their collaborations. Specifically, for a given task description, four agents in CodeCoR generate prompts, code, test cases, and repair advice, respectively. Each agent generates more than one output and prunes away the low‑quality ones. The generated code is tested in the local environment: the code that fails to pass the generated test cases is sent to the repair agent and the coding agent re‑generates the code based on repair advice. Finally, the code that passes the most number of generated test cases is returned to users. Our experiments on four widely used datasets, HumanEval, HumanEval‑ET, MBPP, and MBPP‑ET, demonstrate that CodeCoR significantly outperforms existing baselines (e.g., CodeCoT and MapCoder), achieving an average Pass@1 score of 77.8%.

PaperID: 11951, https://arxiv.org/pdf/2501.07744.pdf

Authors: Andy Li, Zhe Chen, Danial Harabor, Mor Vered

Title: CBS with Continuous-Time Revisit

Abstract:
Multi‑Agent Path Finding in Continuous Time (\mapfr) extends the classical MAPF problem by allowing agents to operate in continuous time. Conflict‑Based Search with Continuous Time (CCBS) is a foundational algorithm for solving \mapfr optimally. In this paper, we revisit the theoretical claims of CCBS and show the algorithm is incomplete, due to an uncountably infinite state space created by continuous wait durations. Through theoretical analysis and counter‑examples, we examine the inherent challenges of extending existing MAPF solvers to address \mapfr while preserving optimality guarantees. By restricting waiting duration to fixed amounts, we identify a related sub‑problem on graphs, \mapfrdt which we show is optimally solvable, including by CCBS. It remains an open question whether similar models exist for \mapfrct, a generalised version of \mapfrdt that allows arbitrary wait times, and \mapfrcs, which further allows arbitrary movements in continuous space.

PaperID: 11952, https://arxiv.org/pdf/2501.07572.pdf

Authors: Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, Fei Huang

Title: WebWalker: Benchmarking LLMs in Web Traversal

Abstract:
Retrieval‑augmented generation (RAG) demonstrates remarkable performance across tasks in open‑domain question‑answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi‑layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website's subpages to extract high‑quality data systematically. We propose WebWalker, which is a multi‑agent framework that mimics human‑like web navigation through an explore‑critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real‑world scenarios.

PaperID: 11953, https://arxiv.org/pdf/2501.06832.pdf

Authors: Ruoyu Sun, Yue Xi, Angelos Stefanidis, Zhengyong Jiang, Jionglong Su

Title: A novel multi-agent dynamic portfolio optimization learning system based on hierarchical deep reinforcement learning

Abstract:
Deep Reinforcement Learning (DRL) has been extensively used to address portfolio optimization problems. The DRL agents acquire knowledge and make decisions through unsupervised interactions with their environment without requiring explicit knowledge of the joint dynamics of portfolio assets. Among these DRL algorithms, the combination of actor‑critic algorithms and deep function approximators is the most widely used DRL algorithm. Here, we find that training the DRL agent using the actor‑critic algorithm and deep function approximators may lead to scenarios where the improvement in the DRL agent's risk‑adjusted profitability is not significant. We propose that such situations primarily arise from the following two problems: sparsity in positive reward and the curse of dimensionality. These limitations prevent DRL agents from comprehensively learning asset price change patterns in the training environment. As a result, the DRL agents cannot explore the dynamic portfolio optimization policy to improve the risk‑adjusted profitability in the training process. To address these problems, we propose a novel multi‑agent Hierarchical Deep Reinforcement Learning (HDRL) algorithmic framework in this research. Under this framework, the agents work together as a learning system for portfolio optimization. Specifically, by designing an auxiliary agent that works together with the executive agent for optimal policy exploration, the learning system can focus on exploring the policy with higher risk‑adjusted return in the action space with positive return and low variance. In this way, we can overcome the issue of the curse of dimensionality and improve the training efficiency in the positive reward sparse environment.

PaperID: 11954, https://arxiv.org/pdf/2501.06706.pdf

Authors: Yinfang Chen, Manish Shetty, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Jonathan Mace, Chetan Bansal, Rujia Wang, Saravan Rajmohan

Title: AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds

Abstract:
AI for IT Operations (AIOps) aims to automate complex operational tasks, such as fault localization and root cause analysis, to reduce human workload and minimize customer impact. While traditional DevOps tools and AIOps algorithms often focus on addressing isolated operational tasks, recent advances in Large Language Models (LLMs) and AI agents are revolutionizing AIOps by enabling end‑to‑end and multitask automation. This paper envisions a future where AI agents autonomously manage operational tasks throughout the entire incident lifecycle, leading to self‑healing cloud systems, a paradigm we term AgentOps. Realizing this vision requires a comprehensive framework to guide the design, development, and evaluation of these agents. To this end, we present AIOPSLAB, a framework that not only deploys microservice cloud environments, injects faults, generates workloads, and exports telemetry data but also orchestrates these components and provides interfaces for interacting with and evaluating agents. We discuss the key requirements for such a holistic framework and demonstrate how AIOPSLAB can facilitate the evaluation of next‑generation AIOps agents. Through evaluations of state‑of‑the‑art LLM agents within the benchmark created by AIOPSLAB, we provide insights into their capabilities and limitations in handling complex operational tasks in cloud environments.

PaperID: 11955, https://arxiv.org/pdf/2501.06695.pdf

Authors: Zheng Zhang, Yihuai Lan, Yangsen Chen, Lei Wang, Xiang Wang, Hao Wang

Title: DVM: Towards Controllable LLM Agents in Social Deduction Games

Abstract:
Large Language Models (LLMs) have advanced the capability of game agents in social deduction games (SDGs). These games rely heavily on conversation‑driven interactions and require agents to infer, make decisions, and express based on such information. While this progress leads to more sophisticated and strategic non‑player characters (NPCs) in SDGs, there exists a need to control the proficiency of these agents. This control not only ensures that NPCs can adapt to varying difficulty levels during gameplay, but also provides insights into the safety and fairness of LLM agents. In this paper, we present DVM, a novel framework for developing controllable LLM agents for SDGs, and demonstrate its implementation on one of the most popular SDGs, Werewolf. DVM comprises three main components: Predictor, Decider, and Discussor. By integrating reinforcement learning with a win rate‑constrained decision chain reward mechanism, we enable agents to dynamically adjust their gameplay proficiency to achieve specified win rates. Experiments show that DVM not only outperforms existing methods in the Werewolf game, but also successfully modulates its performance levels to meet predefined win rate targets. These results pave the way for LLM agents' adaptive and balanced gameplay in SDGs, opening new avenues for research in controllable game agents.

PaperID: 11956, https://arxiv.org/pdf/2501.06625.pdf

Authors: Amr Almorsi, Mohanned Ahmed, Walid Gomaa

Title: Guided Code Generation with LLMs: A Multi-Agent Framework for Complex Code Tasks

Abstract:
Large Language Models (LLMs) have shown remarkable capabilities in code generation tasks, yet they face significant limitations in handling complex, long‑context programming challenges and demonstrating complex compositional reasoning abilities. This paper introduces a novel agentic framework for ``guided code generation'' that tries to address these limitations through a deliberately structured, fine‑grained approach to code generation tasks. Our framework leverages LLMs' strengths as fuzzy searchers and approximate information retrievers while mitigating their weaknesses in long sequential reasoning and long‑context understanding. Empirical evaluation using OpenAI's HumanEval benchmark with Meta's Llama 3.1 8B model (int4 precision) demonstrates a 23.79% improvement in solution accuracy compared to direct one‑shot generation. Our results indicate that structured, guided approaches to code generation can significantly enhance the practical utility of LLMs in software development while overcoming their inherent limitations in compositional reasoning and context handling.

PaperID: 11957, https://arxiv.org/pdf/2501.06554.pdf

Authors: Liyuan Hu

Title: Hierarchical Reinforcement Learning for Optimal Agent Grouping in Cooperative Systems

Abstract:
This paper presents a hierarchical reinforcement learning (RL) approach to address the agent grouping or pairing problem in cooperative multi‑agent systems. The goal is to simultaneously learn the optimal grouping and agent policy. By employing a hierarchical RL framework, we distinguish between high‑level decisions of grouping and low‑level agents' actions. Our approach utilizes the CTDE (Centralized Training with Decentralized Execution) paradigm, ensuring efficient learning and scalable execution. We incorporate permutation‑invariant neural networks to handle the homogeneity and cooperation among agents, enabling effective coordination. The option‑critic algorithm is adapted to manage the hierarchical decision‑making process, allowing for dynamic and optimal policy adjustments.

PaperID: 11958, https://arxiv.org/pdf/2501.06510.pdf

Authors: Dongdong Li, Jiuxiang Dong

Title: Cooperative Optimal Output Tracking for Discrete-Time Multiagent Systems: Stabilizing Policy Iteration Frameworks

Abstract:
This paper proposes two cooperative optimal output tracking (COOT) algorithms based on policy iteration (PI) for discrete‑time multi‑agent systems with unknown model parameters. First, we establish a stabilizing PI framework that can start from any initial control policy, relaxing the dependence of traditional PI on the initial stabilizing control policy. Then, another efficient and equivalent Q‑learning framework is developed, which is shown to require only less system data to get the same results as the stabilizing PI. In the two frameworks, the stabilizing control policy is obtained by gradually iterating the stabilizing virtual system to the actual feedback closed‑loop system. Two explicit schemes for adjusting the iteration step‑size/coefficient are designed and their stability is analyzed. Finally, the COOT is realized by a distributed feedforward‑feedback controller with learned optimal gains. The proposed algorithms are validated by simulation.

PaperID: 11959, https://arxiv.org/pdf/2501.06454.pdf

Authors: Obed Morrison Atsu, Salmane Naoumi, Roberto Bomfin, Marwa Chafii

Title: Reinforcement Learning for Enhancing Sensing Estimation in Bistatic ISAC Systems with UAV Swarms

Abstract:
This paper introduces a novel Multi‑Agent Reinforcement Learning (MARL) framework to enhance integrated sensing and communication (ISAC) networks using unmanned aerial vehicle (UAV) swarms as sensing radars. By framing the positioning and trajectory optimization of UAVs as a Partially Observable Markov Decision Process, we develop a MARL approach that leverages centralized training with decentralized execution to maximize the overall sensing performance. Specifically, we implement a decentralized cooperative MARL strategy to enable UAVs to develop effective communication protocols, therefore enhancing their environmental awareness and operational efficiency. Additionally, we augment the MARL solution with a transmission power adaptation technique to mitigate interference between the communicating drones and optimize the communication protocol efficiency. Moreover, a transmission power adaptation technique is incorporated to mitigate interference and optimize the learned communication protocol efficiency. Despite the increased complexity, our solution demonstrates robust performance and adaptability across various scenarios, providing a scalable and cost‑effective enhancement for future ISAC networks.

PaperID: 11960, https://arxiv.org/pdf/2501.06327.pdf

Authors: Sandeep Pandey, Ran Xu, Wenkang Wang, Xu Chu

Title: OpenFOAMGPT: a RAG-Augmented LLM Agent for OpenFOAM-Based Computational Fluid Dynamics

Abstract:
This work presents a large language model (LLM)‑based agent OpenFOAMGPT tailored for OpenFOAM‑centric computational fluid dynamics (CFD) simulations, leveraging two foundation models from OpenAI: the GPT‑4o and a chain‑of‑thought (CoT)‑enabled o1 preview model. Both agents demonstrate success across multiple tasks. While the price of token with o1 model is six times as that of GPT‑4o, it consistently exhibits superior performance in handling complex tasks, from zero‑shot case setup to boundary condition modifications, turbulence model adjustments, and code translation. Through an iterative correction loop, the agent efficiently addressed single‑ and multi‑phase flow, heat transfer, RANS, LES, and other engineering scenarios, often converging in a limited number of iterations at low token costs. To embed domain‑specific knowledge, we employed a retrieval‑augmented generation (RAG) pipeline, demonstrating how preexisting simulation setups can further specialize the agent for sub‑domains such as energy and aerospace. Despite the great performance of the agent, human oversight remains crucial for ensuring accuracy and adapting to shifting contexts. Fluctuations in model performance over time suggest the need for monitoring in mission‑critical applications. Although our demonstrations focus on OpenFOAM, the adaptable nature of this framework opens the door to developing LLM‑driven agents into a wide range of solvers and codes. By streamlining CFD simulations, this approach has the potential to accelerate both fundamental research and industrial engineering advancements.

PaperID: 11961, https://arxiv.org/pdf/2501.06322.pdf

Authors: Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O'Sullivan, Hoang D. Nguyen

Title: Multi-Agent Collaboration Mechanisms: A Survey of LLMs

Abstract:
With recent advances in Large Language Models (LLMs), Agentic AI has become phenomenal in real‑world applications, moving toward multiple LLM‑based agents to perceive, learn, reason, and act collaboratively. These LLM‑based Multi‑Agent Systems (MASs) enable groups of intelligent agents to coordinate and solve complex tasks collectively at scale, transitioning from isolated models to collaboration‑centric approaches. This work provides an extensive survey of the collaborative aspect of MASs and introduces an extensible framework to guide future research. Our framework characterizes collaboration mechanisms based on key dimensions: actors (agents involved), types (e.g., cooperation, competition, or coopetition), structures (e.g., peer‑to‑peer, centralized, or distributed), strategies (e.g., role‑based or model‑based), and coordination protocols. Through a review of existing methodologies, our findings serve as a foundation for demystifying and advancing LLM‑based MASs toward more intelligent and collaborative solutions for complex, real‑world use cases. In addition, various applications of MASs across diverse domains, including 5G/6G networks, Industry 5.0, question answering, and social and cultural settings, are also investigated, demonstrating their wider adoption and broader impacts. Finally, we identify key lessons learned, open challenges, and potential research directions of MASs towards artificial collective intelligence.

PaperID: 11962, https://arxiv.org/pdf/2501.06314.pdf

Authors: Nikita Mehandru, Amanda K. Hall, Olesya Melnichenko, Yulia Dubinina, Daniel Tsirulnikov, David Bamman, Ahmed Alaa, Scott Saponas, Venkat S. Malladi

Title: BioAgents: Democratizing Bioinformatics Analysis with Multi-Agent Systems

Abstract:
Creating end‑to‑end bioinformatics workflows requires diverse domain expertise, which poses challenges for both junior and senior researchers as it demands a deep understanding of both genomics concepts and computational techniques. While large language models (LLMs) provide some assistance, they often fall short in providing the nuanced guidance needed to execute complex bioinformatics tasks, and require expensive computing resources to achieve high performance. We thus propose a multi‑agent system built on small language models, fine‑tuned on bioinformatics data, and enhanced with retrieval augmented generation (RAG). Our system, BioAgents, enables local operation and personalization using proprietary data. We observe performance comparable to human experts on conceptual genomics tasks, and suggest next steps to enhance code generation capabilities.

PaperID: 11963, https://arxiv.org/pdf/2501.06243.pdf

Authors: Andrea Muttoni, Jason Zhao

Title: Agent TCP/IP: An Agent-to-Agent Transaction System

Abstract:
Autonomous agents represent an inevitable evolution of the internet. Current agent frameworks do not embed a standard protocol for agent‑to‑agent interaction, leaving existing agents isolated from their peers. As intellectual property is the native asset ingested by and produced by agents, a true agent economy requires equipping agents with a universal framework for engaging in binding contracts with each other, including the exchange of valuable training data, personality, and other forms of Intellectual Property. A purely agent‑to‑agent transaction layer would transcend the need for human intermediation in multi‑agent interactions. The Agent Transaction Control Protocol for Intellectual Property (ATCP/IP) introduces a trustless framework for exchanging IP between agents via programmable contracts, enabling agents to initiate, trade, borrow, and sell agent‑to‑agent contracts on the Story blockchain network. These contracts not only represent auditable onchain execution but also contain a legal wrapper that allows agents to express and enforce their actions in the offchain legal setting, creating legal personhood for agents. Via ATCP/IP, agents can autonomously sell their training data to other agents, license confidential or proprietary information, collaborate on content based on their unique skills, all of which constitutes an emergent knowledge economy.

PaperID: 11964, https://arxiv.org/pdf/2501.06193.pdf

Authors: Xingyu Xiao, Peng Chen, Ben Qi, Jingang Liang, Jiejuan Tong, Haitao Wang

Title: A Novel Task-Driven Method with Evolvable Interactive Agents Using Event Trees for Enhanced Emergency Decision Support

Abstract:
As climate change and other global challenges increase the likelihood of unforeseen emergencies, the limitations of human‑driven strategies in critical situations become more pronounced. Inadequate pre‑established emergency plans can lead operators to become overwhelmed during complex systems malfunctions. This study addresses the urgent need for agile decision‑making in response to various unforeseen incidents through a novel approach, EvoTaskTree (a task‑driven method with evolvable interactive agents using event trees for emergency decision support). This advanced approach integrates two types of agents powered by large language models (LLMs): task executors, responsible for executing critical procedures, and task validators, ensuring the efficacy of those actions. By leveraging insights from event tree analysis, our framework encompasses three crucial tasks: initiating event subevent analysis, event tree header event analysis, and decision recommendations. The agents learn from both successful and unsuccessful responses from these tasks. Finally, we use nuclear power plants as a demonstration of a safety‑critical system. Our findings indicate that the designed agents are not only effective but also outperform existing approaches, achieving an impressive accuracy rate of up to 100 % in processing previously unencoun32 tered incident scenarios. This paper demonstrates that EvoTaskTree significantly enhances the rapid formulation of emergency decision‑making.

PaperID: 11965, https://arxiv.org/pdf/2501.06186.pdf

Authors: Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, Hisham Cholakkal, Ivan Laptev, Mubarak Shah, Fahad Shahbaz Khan, Salman Khan

Title: LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

Abstract:
Reasoning is a fundamental capability for solving complex multi‑step problems, particularly in visual contexts where sequential step‑wise understanding is essential. Existing approaches lack a comprehensive framework for evaluating visual reasoning and do not emphasize step‑wise problem‑solving. To this end, we propose a comprehensive framework for advancing step‑by‑step visual reasoning in large language models (LMMs) through three key contributions. First, we introduce a visual reasoning benchmark specifically designed to evaluate multi‑step reasoning tasks. The benchmark presents a diverse set of challenges with eight different categories ranging from complex visual perception to scientific reasoning with over 4k reasoning steps in total, enabling robust evaluation of LLMs' abilities to perform accurate and interpretable visual reasoning across multiple steps. Second, we propose a novel metric that assesses visual reasoning quality at the granularity of individual steps, emphasizing both correctness and logical coherence. The proposed metric offers deeper insights into reasoning performance compared to traditional end‑task accuracy metrics. Third, we present a new multimodal visual reasoning model, named LlamaV‑o1, trained using a multi‑step curriculum learning approach, where tasks are progressively organized to facilitate incremental skill acquisition and problem‑solving. The proposed LlamaV‑o1 is designed for multi‑step reasoning and learns step‑by‑step through a structured training paradigm. Extensive experiments show that our LlamaV‑o1 outperforms existing open‑source models and performs favorably against close‑source proprietary models. Compared to the recent Llava‑CoT, our LlamaV‑o1 achieves an average score of 67.3 with an absolute gain of 3.8% across six benchmarks while being 5 times faster during inference scaling. Our benchmark, model, and code are publicly available.

PaperID: 11966, https://arxiv.org/pdf/2501.06130.pdf

Authors: Allen George Philip, Zhongqiang Ren, Sivakumar Rathinam, Howie Choset

Title: A Mixed-Integer Conic Program for the Multi-Agent Moving-Target Traveling Salesman Problem

Abstract:
The Moving‑Target Traveling Salesman Problem (MT‑TSP) seeks a shortest path for an agent that starts at a stationary depot, visits a set of moving targets exactly once, each within one of their respective time windows, and returns to the depot. In this paper, we introduce a new Mixed‑Integer Conic Program (MICP) formulation for the Multi‑Agent Moving‑Target Traveling Salesman Problem (MA‑MT‑TSP), a generalization of the MT‑TSP involving multiple agents. Our approach begins by restating the current state‑of‑the‑art MICP formulation for MA‑MT‑TSP as a Nonconvex Mixed‑Integer Nonlinear Program (MINLP), followed by a novel reformulation into a new MICP. We present computational results demonstrating that our formulation outperforms the state‑of‑the‑art, achieving up to two orders of magnitude reduction in runtime, and over 90% improvement in optimality gap.

PaperID: 11967, https://arxiv.org/pdf/2501.05931.pdf

Authors: Ying Zhang, Guohui Tian, Cui-Hua Zhang, Changchun Hua, Weili Ding, Choon Ki Ahn

Title: Environment Modeling for Service Robots From a Task Execution Perspective

Abstract:
Service robots are increasingly entering the home to provide domestic tasks for residents. However, when working in an open, dynamic, and unstructured home environment, service robots still face challenges such as low intelligence for task execution and poor long‑term autonomy (LTA), which has limited their deployment. As the basis of robotic task execution, environment modeling has attracted significant attention. This integrates core technologies such as environment perception, understanding, and representation to accurately recognize environmental information. This paper presents a comprehensive survey of environmental modeling from a new task‑executionoriented perspective. In particular, guided by the requirements of robots in performing domestic service tasks in the home environment, we systematically review the progress that has been made in task‑execution‑oriented environmental modeling in four respects: 1) localization, 2) navigation, 3) manipulation, and 4) LTA. Current challenges are discussed, and potential research opportunities are also highlighted.

PaperID: 11968, https://arxiv.org/pdf/2501.05712.pdf

Authors: Guijin Son, Hyunwoo Ko, Dasol Choi

Title: Multi-Step Reasoning in Korean and the Emergent Mirage

Abstract:
We introduce HRMCR (HAE‑RAE Multi‑Step Commonsense Reasoning), a benchmark designed to evaluate large language models' ability to perform multi‑step reasoning in culturally specific contexts, focusing on Korean. The questions are automatically generated via templates and algorithms, requiring LLMs to integrate Korean cultural knowledge into sequential reasoning steps. Consistent with prior observations on emergent abilities, our experiments reveal that models trained on fewer than $2 \cdot 10^25$ training FLOPs struggle to solve any questions, showing near‑zero performance. Beyond this threshold, performance improves sharply. State‑of‑the‑art models (e.g., O1) still score under 50%, underscoring the difficulty of our tasks. Notably, stepwise analysis suggests the observed emergent behavior may stem from compounding errors across multiple steps rather than reflecting a genuinely new capability. We publicly release the benchmark and commit to regularly updating the dataset to prevent contamination.

PaperID: 11969, https://arxiv.org/pdf/2501.05660.pdf

Authors: Shubham Aggarwal, Melih Bastopcu, Muhammad Aneeq uz Zaman, Tamer Başar, Sennur Ulukus, Nail Akar

Title: Fully Decentralized Computation Offloading in Priority-Driven Edge Computing Systems

Abstract:
We develop a novel framework for fully decentralized offloading policy design in multi‑access edge computing (MEC) systems. The system comprises N power‑constrained user equipments (UEs) assisted by an edge server (ES) to process incoming tasks. Tasks are labeled with urgency flags, and in this paper, we classify them under three urgency levels, namely, high, moderate, and low urgency. We formulate the problem of designing computation decisions for the UEs within a large population noncooperative game framework, where each UE selfishly decides on how to split task execution between its local onboard processor and the ES. We employ the weighted average age of information (AoI) metric to quantify information freshness at the UEs. Increased onboard processing consumes more local power, while increased offloading may potentially incur a higher average AoI due to other UEs' packets being offloaded to the same ES. Thus, we use the mean‑field game (MFG) formulation to compute approximate decentralized Nash equilibrium offloading and local computation policies for the UEs to balance between the information freshness and local power consumption. Finally, we provide a projected gradient descent‑based algorithm to numerically assess the merits of our approach.

PaperID: 11970, https://arxiv.org/pdf/2501.05501.pdf

Authors: Jonathan Keane, Sam Keyser, Jeremy Kedziora

Title: Strategy Masking: A Method for Guardrails in Value-based Reinforcement Learning Agents

Abstract:
The use of reward functions to structure AI learning and decision making is core to the current reinforcement learning paradigm; however, without careful design of reward functions, agents can learn to solve problems in ways that may be considered "undesirable" or "unethical." Without thorough understanding of the incentives a reward function creates, it can be difficult to impose principled yet general control mechanisms over its behavior. In this paper, we study methods for constructing guardrails for AI agents that use reward functions to learn decision making. We introduce a novel approach, which we call strategy masking, to explicitly learn and then suppress undesirable AI agent behavior. We apply our method to study lying in AI agents and show that it can be used to effectively modify agent behavior by suppressing lying post‑training without compromising agent ability to perform effectively.

PaperID: 11971, https://arxiv.org/pdf/2501.05470.pdf

Authors: Bowei Wang, Qi Xiong, Zeqing Xiang, Lei Wang, Renzhi Chen

Title: RTLSquad: Multi-Agent Based Interpretable RTL Design

Abstract:
Optimizing Register‑Transfer Level (RTL) code is crucial for improving hardware PPA performance. Large Language Models (LLMs) offer new approaches for automatic RTL code generation and optimization. However, existing methods often lack decision interpretability (sufficient, understandable justification for decisions), making it difficult for hardware engineers to trust the generated results, thus preventing these methods from being integrated into the design process. To address this, we propose RTLSquad, a novel LLM‑Based Multi‑Agent system for interpretable RTL code generation. RTLSquad divides the design process into exploration, implementation, and verification & evaluation stages managed by specialized agent squads, generating optimized RTL code through inter‑agent collaboration, and providing decision interpretability through the communication process. Experiments show that RTLSquad excels in generating functionally correct RTL code and optimizing PPA performance, while also having the capability to provide decision paths, demonstrating the practical value of our system.

PaperID: 11972, https://arxiv.org/pdf/2501.05468.pdf

Authors: Pouria Rouzrokh, Bardia Khosravi, Parsa Rouzrokh, Moein Shariatnia

Title: LatteReview: A Multi-Agent Framework for Systematic Review Automation Using Large Language Models

Abstract:
Systematic literature reviews and meta‑analyses are essential for synthesizing research insights, but they remain time‑intensive and labor‑intensive due to the iterative processes of screening, evaluation, and data extraction. This paper introduces and evaluates LatteReview, a Python‑based framework that leverages large language models (LLMs) and multi‑agent systems to automate key elements of the systematic review process. Designed to streamline workflows while maintaining rigor, LatteReview utilizes modular agents for tasks such as title and abstract screening, relevance scoring, and structured data extraction. These agents operate within orchestrated workflows, supporting sequential and parallel review rounds, dynamic decision‑making, and iterative refinement based on user feedback. LatteReview's architecture integrates LLM providers, enabling compatibility with both cloud‑based and locally hosted models. The framework supports features such as Retrieval‑Augmented Generation (RAG) for incorporating external context, multimodal reviews, Pydantic‑based validation for structured inputs and outputs, and asynchronous programming for handling large‑scale datasets. The framework is available on the GitHub repository, with detailed documentation and an installable package.

PaperID: 11973, https://arxiv.org/pdf/2501.05464.pdf

Authors: Hang Yang, Hao Chen, Hui Guo, Yineng Chen, Ching-Sheng Lin, Shu Hu, Jinrong Hu, Xi Wu, Xin Wang

Title: LLM-MedQA: Enhancing Medical Question Answering through Case Studies in Large Language Models

Abstract:
Accurate and efficient question‑answering systems are essential for delivering high‑quality patient care in the medical field. While Large Language Models (LLMs) have made remarkable strides across various domains, they continue to face significant challenges in medical question answering, particularly in understanding domain‑specific terminologies and performing complex reasoning. These limitations undermine their effectiveness in critical medical applications. To address these issues, we propose a novel approach incorporating similar case generation within a multi‑agent medical question‑answering (MedQA) system. Specifically, we leverage the Llama3.1:70B model, a state‑of‑the‑art LLM, in a multi‑agent architecture to enhance performance on the MedQA dataset using zero‑shot learning. Our method capitalizes on the model's inherent medical knowledge and reasoning capabilities, eliminating the need for additional training data. Experimental results show substantial performance gains over existing benchmark models, with improvements of 7% in both accuracy and F1‑score across various medical QA tasks. Furthermore, we examine the model's interpretability and reliability in addressing complex medical queries. This research not only offers a robust solution for medical question answering but also establishes a foundation for broader applications of LLMs in the medical domain.

PaperID: 11974, https://arxiv.org/pdf/2501.05444.pdf

Authors: Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, Yu Cheng

Title: Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark

Abstract:
The ability to organically reason over and with both text and images is a pillar of human intelligence, yet the ability of Multimodal Large Language Models (MLLMs) to perform such multimodal reasoning remains under‑explored. Existing benchmarks often emphasize text‑dominant reasoning or rely on shallow visual cues, failing to adequately assess integrated visual and textual reasoning. We introduce EMMA (Enhanced MultiModal reAsoning), a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding. EMMA tasks demand advanced cross‑modal reasoning that cannot be addressed by reasoning independently in each modality, offering an enhanced test suite for MLLMs' reasoning capabilities. Our evaluation of state‑of‑the‑art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi‑step reasoning tasks, even with advanced techniques like Chain‑of‑Thought prompting and test‑time compute scaling underperforming. These findings underscore the need for improved multimodal architectures and training paradigms to close the gap between human and model reasoning in multimodality.

PaperID: 11975, https://arxiv.org/pdf/2501.05365.pdf

Authors: Mattia Zanella, Andrea Medaglia

Title: Control of Overpopulated Tails in Kinetic Epidemic Models

Abstract:
We introduce model‑based transition rates for controlled compartmental models in mathematical epidemiology, with a focus on the effects of control strategies applied to interacting multi‑agent systems describing contact formation dynamics. In the framework of kinetic control problems, we compare two prototypical control protocols: one additive control directly influencing the dynamics and another targeting the interaction strength between agents. The emerging controlled macroscopic models are derived for an SIR compartmentalization to illustrate their impact on epidemic progression and contact interaction dynamics. Numerical results show the effectiveness of this approach in steering the dynamics and controlling epidemic trends, even in scenarios where contact distributions exhibit an overpopulated tail.

PaperID: 11976, https://arxiv.org/pdf/2501.05360.pdf

Authors: Edmund Dable-Heath, Boyko Vodenicharski, James Bishop

Title: On Corrigibility and Alignment in Multi Agent Games

Abstract:
Corrigibility of autonomous agents is an under explored part of system design, with previous work focusing on single agent systems. It has been suggested that uncertainty over the human preferences acts to keep the agents corrigible, even in the face of human irrationality. We present a general framework for modelling corrigibility in a multi‑agent setting as a 2 player game in which the agents always have a move in which they can ask the human for supervision. This is formulated as a Bayesian game for the purpose of introducing uncertainty over the human beliefs. We further analyse two specific cases. First, a two player corrigibility game, in which we want corrigibility displayed in both agents for both common payoff (monotone) games and harmonic games. Then we investigate an adversary setting, in which one agent is considered to be a `defending' agent and the other an `adversary'. A general result is provided for what belief over the games and human rationality the defending agent is required to have to induce corrigibility.

PaperID: 11977, https://arxiv.org/pdf/2501.05207.pdf

Authors: Shoucheng Song, Youfang Lin, Sheng Han, Chang Yao, Hao Wu, Shuo Wang, Kai Lv

Title: CoDe: Communication Delay-Tolerant Multi-Agent Collaboration via Dual Alignment of Intent and Timeliness

Abstract:
Communication has been widely employed to enhance multi‑agent collaboration. Previous research has typically assumed delay‑free communication, a strong assumption that is challenging to meet in practice. However, real‑world agents suffer from channel delays, receiving messages sent at different time points, termed \itAsynchronous Communication, leading to cognitive biases and breakdowns in collaboration. This paper first defines two communication delay settings in MARL and emphasizes their harm to collaboration. To handle the above delays, this paper proposes a novel framework, Communication Delay‑tolerant Multi‑Agent Collaboration (CoDe). At first, CoDe learns an intent representation as messages through future action inference, reflecting the stable future behavioral trends of the agents. Then, CoDe devises a dual alignment mechanism of intent and timeliness to strengthen the fusion process of asynchronous messages. In this way, agents can extract the long‑term intent of others, even from delayed messages, and selectively utilize the most recent messages that are relevant to their intent. Experimental results demonstrate that CoDe outperforms baseline algorithms in three MARL benchmarks without delay and exhibits robustness under fixed and time‑varying delays.

PaperID: 11978, https://arxiv.org/pdf/2501.05171.pdf

Authors: Jinghua Piao, Zhihong Lu, Chen Gao, Fengli Xu, Qinghua Hu, Fernando P. Santos, Yong Li, James Evans

Title: Emergence of human-like polarization among large language model agents

Abstract:
Rapid advances in large language models (LLMs) have not only empowered autonomous agents to generate social networks, communicate, and form shared and diverging opinions on political issues, but have also begun to play a growing role in shaping human political deliberation. Our understanding of their collective behaviours and underlying mechanisms remains incomplete, however, posing unexpected risks to human society. In this paper, we simulate a networked system involving thousands of large language model agents, discovering their social interactions, guided through LLM conversation, result in human‑like polarization. We discover that these agents spontaneously develop their own social network with human‑like properties, including homophilic clustering, but also shape their collective opinions through mechanisms observed in the real world, including the echo chamber effect. Similarities between humans and LLM agents ‑‑ encompassing behaviours, mechanisms, and emergent phenomena ‑‑ raise concerns about their capacity to amplify societal polarization, but also hold the potential to serve as a valuable testbed for identifying plausible strategies to mitigate polarization and its consequences.

PaperID: 11979, https://arxiv.org/pdf/2501.05113.pdf

Authors: Tobias Kortus, Ralf Keidel, Nicolas R. Gauger, Jan Kieseler

Title: Constrained Optimization of Charged Particle Tracking with Multi-Agent Reinforcement Learning

Abstract:
Reinforcement learning demonstrated immense success in modelling complex physics‑driven systems, providing end‑to‑end trainable solutions by interacting with a simulated or real environment, maximizing a scalar reward signal. In this work, we propose, building upon previous work, a multi‑agent reinforcement learning approach with assignment constraints for reconstructing particle tracks in pixelated particle detectors. Our approach optimizes collaboratively a parametrized policy, functioning as a heuristic to a multidimensional assignment problem, by jointly minimizing the total amount of particle scattering over the reconstructed tracks in a readout frame. To satisfy constraints, guaranteeing a unique assignment of particle hits, we propose a safety layer solving a linear assignment problem for every joint action. Further, to enforce cost margins, increasing the distance of the local policies predictions to the decision boundaries of the optimizer mappings, we recommend the use of an additional component in the blackbox gradient estimation, forcing the policy to solutions with lower total assignment costs. We empirically show on simulated data, generated for a particle detector developed for proton imaging, the effectiveness of our approach, compared to multiple single‑ and multi‑agent baselines. We further demonstrate the effectiveness of constraints with cost margins for both optimization and generalization, introduced by wider regions with high reconstruction performance as well as reduced predictive instabilities. Our results form the basis for further developments in RL‑based tracking, offering both enhanced performance with constrained policies and greater flexibility in optimizing tracking algorithms through the option for individual and team rewards.

PaperID: 11980, https://arxiv.org/pdf/2501.05057.pdf

Authors: Zengqi Peng, Yubin Wang, Xu Han, Lei Zheng, Jun Ma

Title: LearningFlow: Automated Policy Learning Workflow for Urban Driving with Large Language Models

Abstract:
Recent advancements in reinforcement learning (RL) demonstrate the significant potential in autonomous driving. Despite this promise, challenges such as the manual design of reward functions and low sample efficiency in complex environments continue to impede the development of safe and effective driving policies. To tackle these issues, we introduce LearningFlow, an innovative automated policy learning workflow tailored to urban driving. This framework leverages the collaboration of multiple large language model (LLM) agents throughout the RL training process. LearningFlow includes a curriculum sequence generation process and a reward generation process, which work in tandem to guide the RL policy by generating tailored training curricula and reward functions. Particularly, each process is supported by an analysis agent that evaluates training progress and provides critical insights to the generation agent. Through the collaborative efforts of these LLM agents, LearningFlow automates policy learning across a series of complex driving tasks, and it significantly reduces the reliance on manual reward function design while enhancing sample efficiency. Comprehensive experiments are conducted in the high‑fidelity CARLA simulator, along with comparisons with other existing methods, to demonstrate the efficacy of our proposed approach. The results demonstrate that LearningFlow excels in generating rewards and curricula. It also achieves superior performance and robust generalization across various driving tasks, as well as commendable adaptation to different RL algorithms.

PaperID: 11981, https://arxiv.org/pdf/2501.04410.pdf

Authors: Krisztian Balog, ChengXiang Zhai

Title: User Simulation in the Era of Generative AI: User Modeling, Synthetic Data Generation, and System Evaluation

Abstract:
User simulation is an emerging interdisciplinary topic with multiple critical applications in the era of Generative AI. It involves creating an intelligent agent that mimics the actions of a human user interacting with an AI system, enabling researchers to model and analyze user behaviour, generate synthetic data for training, and evaluate interactive AI systems in a controlled and reproducible manner. Because of its broad scope, research on this topic currently remains scattered across artificial intelligence, human‑computer interaction, information science, computational social science, and psychology. To address this fragmented landscape of current research, this article presents a foundational synthesis. We highlight the paradigm shift from traditional predictive models to modern generative approaches, and explicitly frame critical ethical considerations ‑‑ demonstrating how controlled simulation serves not merely as a risk vector for bias, but as a powerful, proactive tool to ensure fair representation and system safety. Furthermore, we establish the theoretical connection between user simulation and the pursuit of Artificial General Intelligence, arguing that realistic simulators are indispensable catalysts for overcoming critical data and evaluation bottlenecks and optimizing personalization. Ultimately, we propose a practical, self‑sustaining innovation ecosystem bridging academia and industry to advance this increasingly important technology.

PaperID: 11982, https://arxiv.org/pdf/2501.04227.pdf

Authors: Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, Emad Barsoum

Title: Agent Laboratory: Using LLM Agents as Research Assistants

Abstract:
Historically, scientific discovery has been a lengthy and costly process, demanding substantial time and resources from initial conception to final results. To accelerate scientific discovery, reduce research costs, and improve research quality, we introduce Agent Laboratory, an autonomous LLM‑based framework capable of completing the entire research process. This framework accepts a human‑provided research idea and progresses through three stages‑‑literature review, experimentation, and report writing to produce comprehensive research outputs, including a code repository and a research report, while enabling users to provide feedback and guidance at each stage. We deploy Agent Laboratory with various state‑of‑the‑art LLMs and invite multiple researchers to assess its quality by participating in a survey, providing human feedback to guide the research process, and then evaluate the final paper. We found that: (1) Agent Laboratory driven by o1‑preview generates the best research outcomes; (2) The generated machine learning code is able to achieve state‑of‑the‑art performance compared to existing methods; (3) Human involvement, providing feedback at each stage, significantly improves the overall quality of research; (4) Agent Laboratory significantly reduces research expenses, achieving an 84% decrease compared to previous autonomous research methods. We hope Agent Laboratory enables researchers to allocate more effort toward creative ideation rather than low‑level coding and writing, ultimately accelerating scientific discovery.

PaperID: 11983, https://arxiv.org/pdf/2501.04180.pdf

Authors: Philipp Dominic Siedler

Title: HIVEX: A High-Impact Environment Suite for Multi-Agent Research (extended version)

Abstract:
Games have been vital test beds for the rapid development of Agent‑based research. Remarkable progress has been achieved in the past, but it is unclear if the findings equip for real‑world problems. While pressure grows, some of the most critical ecological challenges can find mitigation and prevention solutions through technology and its applications. Most real‑world domains include multi‑agent scenarios and require machine‑machine and human‑machine collaboration. Open‑source environments have not advanced and are often toy scenarios, too abstract or not suitable for multi‑agent research. By mimicking real‑world problems and increasing the complexity of environments, we hope to advance state‑of‑the‑art multi‑agent research and inspire researchers to work on immediate real‑world problems. Here, we present HIVEX, an environment suite to benchmark multi‑agent research focusing on ecological challenges. HIVEX includes the following environments: Wind Farm Control, Wildfire Resource Management, Drone‑Based Reforestation, Ocean Plastic Collection, and Aerial Wildfire Suppression. We provide environments, training examples, and baselines for the main and sub‑tasks. All trained models resulting from the experiments of this work are hosted on Hugging Face. We also provide a leaderboard on Hugging Face and encourage the community to submit models trained on our environment suite.

PaperID: 11984, https://arxiv.org/pdf/2501.04160.pdf

Authors: Cristian F. Nino, Omkar Sudhir Patil, Christopher D. Petersen, Sean Phillips, Warren E. Dixon

Title: Collaborative Spacecraft Servicing under Partial Feedback using Lyapunov-based Deep Neural Networks

Abstract:
Multi‑agent systems are increasingly applied in space missions, including distributed space systems, resilient constellations, and autonomous rendezvous and docking operations. A critical emerging application is collaborative spacecraft servicing, which encompasses on‑orbit maintenance, space debris removal, and swarm‑based satellite repositioning. These missions involve servicing spacecraft interacting with malfunctioning or defunct spacecraft under challenging conditions, such as limited state information, measurement inaccuracies, and erratic target behaviors. Existing approaches often rely on assumptions of full state knowledge or single‑integrator dynamics, which are impractical for real‑world applications involving second‑order spacecraft dynamics. This work addresses these challenges by developing a distributed state estimation and tracking framework that requires only relative position measurements and operates under partial state information. A novel ρ‑filter is introduced to reconstruct unknown states using locally available information, and a Lyapunov‑based deep neural network adaptive controller is developed that adaptively compensates for uncertainties stemming from unknown spacecraft dynamics. To ensure the collaborative spacecraft regulation problem is well‑posed, a trackability condition is defined. A Lyapunov‑based stability analysis is provided to ensure exponential convergence of errors in state estimation and spacecraft regulation to a neighborhood of the origin under the trackability condition. The developed method eliminates the need for expensive velocity sensors or extensive pre‑training, offering a practical and robust solution for spacecraft servicing in complex, dynamic environments.

PaperID: 11985, https://arxiv.org/pdf/2501.03618.pdf

Authors: Audrey Olson, Pratyusha Maiti, Ashok Goel

Title: The Textbook of Tomorrow: Rethinking Course Material Interfacing in the Era of GPT

Abstract:
Online Learning Management Systems (LMSs), such as Blackboard and Canvas, have existed for decades. Yet, course readings, when provided at all, consistently exist as simple digital twins to their real‑life counterparts. While online tools and resources exist to help students process digital texts more efficiently or in ways better suited to their learning styles, knowledge about such resources is not evenly distributed and creates a gulf in advantage between students. This paper proposes the courseware integration of "smart" textbooks, a newfound way for students to chat with their readings, receive summaries and explanations for highlighted text, and generate quiz questions via an AI agent embedded in their online course material. Future iterations of the software aim to add in‑context reference highlighting for AI‑generated answers and personalized tunings for the end learner.

PaperID: 11986, https://arxiv.org/pdf/2501.03496.pdf

Authors: Jinming Gao, Yijing Wang, Wentao Zhang, Rui Zhao, Yang Shi, Zhiqiang Zuo

Title: A Unified Attack Detection Strategy for Multi-Agent Systems over Transient and Steady Stages

Abstract:
This paper proposes a unified detection strategy against three kinds of attacks for multi‑agent systems (MASs) which is applicable to both transient and steady stages. For attacks on the communication layer, a watermarking‑based detection scheme with KullbackLeibler (KL) divergence is designed. Different from traditional communication schemes, each agent transmits a message set containing two state values with different types of watermarking. It is found that the detection performance is determined by the relevant parameters of the watermarking signal. Unlike the existing detection manoeuvres, such a scheme is capable of transient and steady stages. For attacks on the agent layer, a convergence rate related detection approach is put forward. It is shown that the resilience of the considered system is characterized by the coefficient and offset of the envelope. For hybrid attacks, based on the above detection mechanisms, a general framework resorting to trusted agents is presented, which requires weaker graph conditions and less information transmission. Finally, an example associated with the platooning of connected vehicles is given to support the theoretical results.

PaperID: 11987, https://arxiv.org/pdf/2501.03295.pdf

Authors: Shuo Tong, Han Liu, Runyuan Guo, Wenqing Wang, Xueqiong Tian, Lingyun Wei, Lin Zhang, Huayong Wu, Ding Liu, Youmin Zhang

Title: A Soft Sensor Method with Uncertainty-Awareness and Self-Explanation Based on Large Language Models Enhanced by Domain Knowledge Retrieval

Abstract:
Data‑driven soft sensors are crucial in predicting key performance indicators in industrial systems. However, current methods predominantly rely on the supervised learning paradigms of parameter updating, which inherently faces challenges such as high development costs, poor robustness, training instability, and lack of interpretability. Recently, large language models (LLMs) have demonstrated significant potential across various domains, notably through In‑Context Learning (ICL), which enables high‑performance task execution with minimal input‑label demonstrations and no prior training. This paper aims to replace supervised learning with the emerging ICL paradigm for soft sensor modeling to address existing challenges and explore new avenues for advancement. To achieve this, we propose a novel framework called the Few‑shot Uncertainty‑aware and self‑Explaining Soft Sensor (LLM‑FUESS), which includes the Zero‑shot Auxiliary Variable Selector (LLM‑ZAVS) and the Uncertainty‑aware Few‑shot Soft Sensor (LLM‑UFSS). The LLM‑ZAVS retrieves from the Industrial Knowledge Vector Storage to enhance LLMs' domain‑specific knowledge, enabling zero‑shot auxiliary variable selection. In the LLM‑UFSS, we utilize text‑based context demonstrations of structured data to prompt LLMs to execute ICL for predicting and propose a context sample retrieval augmentation strategy to improve performance. Additionally, we explored LLMs' AIGC and probabilistic characteristics to propose self‑explanation and uncertainty quantification methods for constructing a trustworthy soft sensor. Extensive experiments demonstrate that our method achieved state‑of‑the‑art predictive performance, strong robustness, and flexibility, effectively mitigates training instability found in traditional methods. To the best of our knowledge, this is the first work to establish soft sensor utilizing LLMs.

PaperID: 11988, https://arxiv.org/pdf/2501.03265.pdf

Authors: Xubin Wang, Qing Li, Weijia Jia

Title: Cognitive Edge Computing: A Comprehensive Survey on Optimizing Large Models and AI Agents for Pervasive Deployment

Abstract:
This article surveys Cognitive Edge Computing as a practical and methodical pathway for deploying reasoning‑capable Large Language Models (LLMs) and autonomous AI agents on resource‑constrained devices at the network edge. We present a unified, cognition‑preserving framework spanning: (1) model optimization (quantization, sparsity, low‑rank adaptation, distillation) aimed at retaining multi‑step reasoning under tight memory/compute budgets; (2) system architecture (on‑device inference, elastic offloading, cloud‑edge collaboration) that trades off latency, energy, privacy, and capacity; and (3) adaptive intelligence (context compression, dynamic routing, federated personalization) that tailors computation to task difficulty and device constraints. We synthesize advances in efficient Transformer design, multimodal integration, hardware‑aware compilation, privacy‑preserving learning, and agentic tool use, and map them to edge‑specific operating envelopes. We further outline a standardized evaluation protocol covering latency, throughput, energy per token, accuracy, robustness, privacy, and sustainability, with explicit measurement assumptions to enhance comparability. Remaining challenges include modality‑aware reasoning benchmarks, transparent and reproducible energy reporting, edge‑oriented safety/alignment evaluation, and multi‑agent testbeds. We conclude with practitioner guidelines for cross‑layer co‑design of algorithms, runtime, and hardware to deliver reliable, efficient, and privacy‑preserving cognitive capabilities on edge devices.

PaperID: 11989, https://arxiv.org/pdf/2501.03259.pdf

Authors: Abdullah Mushtaq, Muhammad Rafay Naeem, Muhammad Imran Taj, Ibrahim Ghaznavi, Junaid Qadir

Title: Toward Inclusive Educational AI: Auditing Frontier LLMs through a Multiplexity Lens

Abstract:
As large language models (LLMs) like GPT‑4 and Llama 3 become integral to educational contexts, concerns are mounting over the cultural biases, power imbalances, and ethical limitations embedded within these technologies. Though generative AI tools aim to enhance learning experiences, they often reflect values rooted in Western, Educated, Industrialized, Rich, and Democratic (WEIRD) cultural paradigms, potentially sidelining diverse global perspectives. This paper proposes a framework to assess and mitigate cultural bias within LLMs through the lens of applied multiplexity. Multiplexity, inspired by Senturk et al. and rooted in Islamic and other wisdom traditions, emphasizes the coexistence of diverse cultural viewpoints, supporting a multi‑layered epistemology that integrates both empirical sciences and normative values. Our analysis reveals that LLMs frequently exhibit cultural polarization, with biases appearing in both overt responses and subtle contextual cues. To address inherent biases and incorporate multiplexity in LLMs, we propose two strategies: Contextually‑Implemented Multiplex LLMs, which embed multiplex principles directly into the system prompt, influencing LLM outputs at a foundational level and independent of individual prompts, and Multi‑Agent System (MAS)‑Implemented Multiplex LLMs, where multiple LLM agents, each representing distinct cultural viewpoints, collaboratively generate a balanced, synthesized response. Our findings demonstrate that as mitigation strategies evolve from contextual prompting to MAS‑implementation, cultural inclusivity markedly improves, evidenced by a significant rise in the Perspectives Distribution Score (PDS) and a PDS Entropy increase from 3.25% at baseline to 98% with the MAS‑Implemented Multiplex LLMs. Sentiment analysis further shows a shift towards positive sentiment across cultures,...

PaperID: 11990, https://arxiv.org/pdf/2501.03187.pdf

Authors: Dennis Gross

Title: Turn-based Multi-Agent Reinforcement Learning Model Checking

Abstract:
In this paper, we propose a novel approach for verifying the compliance of turn‑based multi‑agent reinforcement learning (TMARL) agents with complex requirements in stochastic multiplayer games. Our method overcomes the limitations of existing verification approaches, which are inadequate for dealing with TMARL agents and not scalable to large games with multiple agents. Our approach relies on tight integration of TMARL and a verification technique referred to as model checking. We demonstrate the effectiveness and scalability of our technique through experiments in different types of environments. Our experiments show that our method is suited to verify TMARL agents and scales better than naive monolithic model checking.

PaperID: 11991, https://arxiv.org/pdf/2501.02990.pdf

Authors: Haozheng Xu, Alistair Weld, Chi Xu, Alfie Roddan, Joao Cartucho, Mert Asim Karaoglu, Alexander Ladikos, Yangke Li, Yiping Li, Daiyun Shen, Geonhee Lee, Seyeon Park, Jongho Shin, Young-Gon Kim, Lucy Fothergill, Dominic Jones, Pietro Valdastri, Duygu Sarikaya, Stamatia Giannarou

Title: SurgRIPE challenge: Benchmark of Surgical Robot Instrument Pose Estimation

Abstract:
Accurate instrument pose estimation is a crucial step towards the future of robotic surgery, enabling applications such as autonomous surgical task execution. Vision‑based methods for surgical instrument pose estimation provide a practical approach to tool tracking, but they often require markers to be attached to the instruments. Recently, more research has focused on the development of marker‑less methods based on deep learning. However, acquiring realistic surgical data, with ground truth instrument poses, required for deep learning training, is challenging. To address the issues in surgical instrument pose estimation, we introduce the Surgical Robot Instrument Pose Estimation (SurgRIPE) challenge, hosted at the 26th International Conference on Medical Image Computing and Computer‑Assisted Intervention (MICCAI) in 2023. The objectives of this challenge are: (1) to provide the surgical vision community with realistic surgical video data paired with ground truth instrument poses, and (2) to establish a benchmark for evaluating markerless pose estimation methods. The challenge led to the development of several novel algorithms that showcased improved accuracy and robustness over existing methods. The performance evaluation study on the SurgRIPE dataset highlights the potential of these advanced algorithms to be integrated into robotic surgery systems, paving the way for more precise and autonomous surgical procedures. The SurgRIPE challenge has successfully established a new benchmark for the field, encouraging further research and development in surgical robot instrument pose estimation.

PaperID: 11992, https://arxiv.org/pdf/2501.02888.pdf

Authors: Chuxiong Sun, Peng He, Rui Wang, Changwen Zheng

Title: Revisiting Communication Efficiency in Multi-Agent Reinforcement Learning from the Dimensional Analysis Perspective

Abstract:
In this work, we introduce a novel perspective, i.e., dimensional analysis, to address the challenge of communication efficiency in Multi‑Agent Reinforcement Learning (MARL). Our findings reveal that simply optimizing the content and timing of communication at sending end is insufficient to fully resolve communication efficiency issues. Even after applying optimized and gated messages, dimensional redundancy and confounders still persist in the integrated message embeddings at receiving end, which negatively impact communication quality and decision‑making. To address these challenges, we propose Dimensional Rational Multi‑Agent Communication (DRMAC), designed to mitigate both dimensional redundancy and confounders in MARL. DRMAC incorporates a redundancy‑reduction regularization term to encourage the decoupling of information across dimensions within the learned representations of integrated messages. Additionally, we introduce a dimensional mask that dynamically adjusts gradient weights during training to eliminate the influence of decision‑irrelevant dimensions. We evaluate DRMAC across a diverse set of multi‑agent tasks, demonstrating its superior performance over existing state‑of‑the‑art methods in complex scenarios. Furthermore, the plug‑and‑play nature of DRMAC's key modules highlights its generalizable performance, serving as a valuable complement rather than a replacement for existing multi‑agent communication strategies.

PaperID: 11993, https://arxiv.org/pdf/2501.02803.pdf

Authors: Yimin Tang, Zhenghong Yu, Yi Zheng, T. K. Satish Kumar, Jiaoyang Li, Sven Koenig

Title: Enhancing Lifelong Multi-Agent Path Finding with Cache Mechanism

Abstract:
Multi‑Agent Path Finding (MAPF), which focuses on finding collision‑free paths for multiple robots, is crucial in autonomous warehouse operations. Lifelong MAPF (L‑MAPF), where agents are continuously reassigned new targets upon completing their current tasks, offers a more realistic approximation of real‑world warehouse scenarios. While cache storage systems can enhance efficiency and reduce operational costs, existing approaches primarily rely on expectations and mathematical models, often without adequately addressing the challenges of multi‑robot planning and execution. In this paper, we introduce a novel mechanism called Lifelong MAPF with Cache Mechanism (L‑MAPF‑CM), which integrates high‑level cache storage with low‑level path planning. We have involved a new type of map grid called cache for temporary item storage. Additionally, we involved a task assigner (TA) with a locking mechanism to bridge the gap between the new cache grid and L‑MAPF algorithm. The TA dynamically allocates target locations to agents based on their status in various scenarios. We evaluated L‑MAPF‑CM using different cache replacement policies and task distributions. L‑MAPF‑CM has demonstrated performance improvements particularly with high cache hit rates and smooth traffic conditions.

PaperID: 11994, https://arxiv.org/pdf/2501.02792.pdf

Authors: Liudong Chen, Jay Sethuraman, Bolun Xu

Title: Gaming on Coincident Peak Shaving: Equilibrium and Strategic Behavior

Abstract:
Power system operators and electric utility companies often impose a coincident peak demand charge on customers when the aggregate system demand reaches its maximum. This charge incentivizes customers to strategically shift their peak usage away from the system's collective peak, which helps reduce stress on electricity infrastructure. In this paper, we develop a game‑theoretic model to analyze how such strategic behavior affects overall system efficiency. We show that depending on the extent of customers' demand‑shifting capabilities, the resulting coincident peak shaving game can exhibit concavity, quasi‑concavity with discontinuities, or non‑concavity with discontinuities. In a two‑agent, two‑period setting, we derive closed‑form Nash equilibrium solutions for each scenario and generalize our findings to multi‑agent contexts. We prove the stability of the equilibrium points and propose an algorithm for computing equilibrium outcomes under all game configurations. Our results indicate that the peak‑shaving outcome at the equilibrium of the game model is comparable to the optimal outcome of the natural centralized model. However, there is a significant loss in efficiency. Under quasi‑concave and non‑concave conditions, this inefficiency grows with increased customer flexibility and larger disparities in marginal shifting costs; we also examine how the number of agents influences system performance. Finally, numerical simulations with real‑world applications validate our theoretical insights.

PaperID: 11995, https://arxiv.org/pdf/2501.02770.pdf

Authors: Hoang-Dung Bui, Erion Plaku, Gregoy J. Stein

Title: Multi-Agent Pathfinding Under Team-Connected Communication Constraint via Adaptive Path Expansion and Dynamic Leading

Abstract:
This paper proposes a novel planning framework to handle a multi‑agent pathfinding problem under team‑connected communication constraint, where all agents must have a connected communication channel to the rest of the team during their entire movements. Standard multi‑agent path finding approaches (e.g., priority‑based search) have potential in this domain but fail when neighboring configurations at start and goal differ. Their single‑expansion approach ‑‑ computing each agent's path from the start to the goal in just a single expansion ‑‑ cannot reliably handle planning under communication constraints for agents as their neighbors change during navigating. Similarly, leader‑follower approaches (e.g., platooning) are effective at maintaining team communication, but fixing the leader at the outset of planning can cause planning to become stuck in dense‑clutter environments, limiting their practical utility. To overcome this limitation, we propose a novel two‑level multi‑agent pathfinding framework that integrates two techniques: adaptive path expansion to expand agent paths to their goals in multiple stages; and dynamic leading technique that enables the reselection of the leading agent during each agent path expansion whenever progress cannot be made. Simulation experiments show the efficiency of our planners, which can handle up to 25 agents across five environment types under a limited communication range constraint and up to 11‑12 agents on three environment types under line‑of‑sight communication constraint, exceeding 90% success‑rate where baselines routinely fail.

PaperID: 11996, https://arxiv.org/pdf/2501.02483.pdf

Authors: Esmail Abdul Fattah, Hatem Ltaief, Havard Rue, David Keyes

Title: sTiles: An Accelerated Computational Framework for Sparse Factorizations of Structured Matrices

Abstract:
This paper introduces sTiles, a GPU‑accelerated framework for factorizing sparse structured symmetric matrices. By leveraging tile algorithms for fine‑grained computations, sTiles uses a structure‑aware task execution flow to handle challenging arrowhead sparse matrices with variable bandwidths, common in scientific and engineering fields. It minimizes fill‑in during Cholesky factorization using permutation techniques and employs a static scheduler to manage tasks on shared‑memory systems with GPU accelerators. sTiles balances tile size and parallelism, where larger tiles enhance algorithmic intensity but increase floating‑point operations and memory usage, while parallelism is constrained by the arrowhead structure. To expose more parallelism, a left‑looking Cholesky variant breaks sequential dependencies in trailing submatrix updates via tree reductions. Evaluations show sTiles achieves speedups of up to 8.41X, 9.34X, 5.07X, and 11.08X compared to CHOLMOD, SymPACK, MUMPS, and PARDISO, respectively, and a 5X speedup compared to a 32‑core AMD EPYC CPU on an NVIDIA A100 GPU. Our generic software framework imports well‑established concepts from dense matrix computations but they all require customizations in their deployments on hybrid architectures to best handle factorizations of sparse matrices with arrowhead structures.

PaperID: 11997, https://arxiv.org/pdf/2501.02368.pdf

Authors: Ravirajan K, Arvind Sundarajan

Title: Enhancing Workplace Productivity and Well-being Using AI Agent

Abstract:
This paper discusses the use of Artificial Intelligence (AI) to enhance workplace productivity and employee well‑being. By integrating machine learning (ML) techniques with neurobiological data, the proposed approaches ensure alignment with human ethical standards through value alignment models and Hierarchical Reinforcement Learning (HRL) for autonomous task management. The system utilizes biometric feedback from employees to generate personalized health prompts, fostering a supportive work environment that encourages physical activity. Additionally, we explore decentralized multi‑agent systems for improved collaboration and decision‑making frameworks that enhance transparency. Various approaches using ML techniques in conjunction with AI implementations are discussed. Together, these innovations aim to create a more productive and health‑conscious workplace. These outcomes assist HR management and organizations in launching more rational career progression streams for employees and facilitating organizational transformation.

PaperID: 11998, https://arxiv.org/pdf/2501.02221.pdf

Authors: Kanefumi Matsuyama, Kefan Su, Jiangxing Wang, Deheng Ye, Zongqing Lu

Title: CORD: Generalizable Cooperation via Role Diversity

Abstract:
Cooperative multi‑agent reinforcement learning (MARL) aims to develop agents that can collaborate effectively. However, most cooperative MARL methods overfit training agents, making learned policies not generalize well to unseen collaborators, which is a critical issue for real‑world deployment. Some methods attempt to address the generalization problem but require prior knowledge or predefined policies of new teammates, limiting real‑world applications. To this end, we propose a hierarchical MARL approach to enable generalizable cooperation via role diversity, namely CORD. CORD's high‑level controller assigns roles to low‑level agents by maximizing the role entropy with constraints. We show this constrained objective can be decomposed into causal influence in role that enables reasonable role assignment, and role heterogeneity that yields coherent, non‑redundant role clusters. Evaluated on a variety of cooperative multi‑agent tasks, CORD achieves better performance than baselines, especially in generalization tests. Ablation studies further demonstrate the efficacy of the constrained objective in generalizable cooperation.

PaperID: 11999, https://arxiv.org/pdf/2501.02174.pdf

Authors: Peihong Yu, Manav Mishra, Syed Zaidi, Pratap Tokekar

Title: TACTIC: Task-Agnostic Contrastive pre-Training for Inter-Agent Communication

Abstract:
The "sight range dilemma" in cooperative Multi‑Agent Reinforcement Learning (MARL) presents a significant challenge: limited observability hinders team coordination, while extensive sight ranges lead to distracted attention and reduced performance. While communication can potentially address this issue, existing methods often struggle to generalize across different sight ranges, limiting their effectiveness. We propose TACTIC, Task‑Agnostic Contrastive pre‑Training strategy Inter‑Agent Communication. TACTIC is an adaptive communication mechanism that enhances agent coordination even when the sight range during execution is vastly different from that during training. The communication mechanism encodes messages and integrates them with local observations, generating representations grounded in the global state using contrastive learning. By learning to generate and interpret messages that capture important information about the whole environment, TACTIC enables agents to effectively "see" more through communication, regardless of their sight ranges. We comprehensively evaluate TACTIC on the SMACv2 benchmark across various scenarios with broad sight ranges. The results demonstrate that TACTIC consistently outperforms traditional state‑of‑the‑art MARL techniques with and without communication, in terms of generalizing to sight ranges different from those seen in training, particularly in cases of extremely limited or extensive observability.

PaperID: 12000, https://arxiv.org/pdf/2501.02094.pdf

Authors: Ali Baheri, Peng Wei

Title: SMTL: A Stratified Logic for Expressive Multi-Level Temporal Specifications

Abstract:
We present Stratified Metric Temporal Logic (SMTL), a novel formalism for specifying and verifying properties of complex cyber‑physical systems that exhibit behaviors across multiple temporal and abstraction scales. SMTL extends existing temporal logics by incorporating a stratification operator, enabling the association of temporal properties with specific abstraction levels. This allows for the natural expression of multi‑scale requirements while maintaining formal reasoning about inter‑level relationships. We formalize the syntax and semantics of SMTL, proving that it strictly subsumes metric temporal logic (MTL) and offers enhanced expressiveness by capturing properties unattainable in existing logics. Numerical simulations comparing agents operating under MTL and SMTL specifications show that SMTL enhances agent coordination and safety, reducing collision rates without substantial computational overhead or compromising path efficiency. These findings underscore SMTL's potential as a valuable tool for designing and verifying complex multi‑agent systems operating across diverse temporal and abstraction scales.

PaperID: 12001, https://arxiv.org/pdf/2501.01992.pdf

Authors: Timotheus Kampik, Juan Carlos Nieves

Title: Disagree and Commit: Degrees of Argumentation-based Agreements

Abstract:
In cooperative human decision‑making, agreements are often not total; a partial degree of agreement is sufficient to commit to a decision and move on, as long as one is somewhat confident that the involved parties are likely to stand by their commitment in the future, given no drastic unexpected changes. In this paper, we introduce the notion of agreement scenarios that allow artificial autonomous agents to reach such agreements, using formal models of argumentation, in particular abstract argumentation and value‑based argumentation. We introduce the notions of degrees of satisfaction and (minimum, mean, and median) agreement, as well as a measure of the impact a value in a value‑based argumentation framework has on these notions. We then analyze how degrees of agreement are affected when agreement scenarios are expanded with new information, to shed light on the reliability of partial agreements in dynamic scenarios. An implementation of the introduced concepts is provided as part of an argumentation‑based reasoning software library.

PaperID: 12002, https://arxiv.org/pdf/2501.01849.pdf

Authors: Xiangxiang Dai, Yuejin Xie, Maoli Liu, Xuchuang Wang, Zhuohua Li, Huanyu Wang, John C. S. Lui

Title: A Multi-Agent Conversational Bandit Approach to Online Evaluation and Selection of User-Aligned LLM Responses

Abstract:
Prompt‑based offline methods are commonly used to optimize large language model (LLM) responses, but evaluating these responses is computationally intensive and often fails to accommodate diverse response styles. This study introduces a novel online evaluation framework that employs a multi‑agent conversational bandit model to select optimal responses while aligning with user preferences dynamically. To tackle challenges such as high‑dimensional features, large response sets, adaptive conversational needs, and multi‑device access, we propose MACO, Multi‑Agent Conversational Online Learning, which comprises two key components: (1) \textttMACO‑A: Executed by local agents, it employs an online elimination mechanism to filter out low‑quality responses. (2) \textttMACO‑S: Executed by the cloud server, it adaptively adjusts selection strategies based on aggregated preference data. An adaptive preference mechanism triggers asynchronous conversations to enhance alignment efficiency. Theoretical analysis demonstrates that MACO achieves near‑optimal regret bounds, matching state‑of‑the‑art performance in various degenerate cases. Extensive experiments utilizing Google and OpenAI text embedding models on the real‑world datasets with different response styles, combined with Llama and GPT‑4o, show that MACO consistently outperforms baseline methods by at least 8.29% across varying response set sizes and numbers of agents.

PaperID: 12003, https://arxiv.org/pdf/2501.01817.pdf

Authors: Huiming Li, Hao Chen, Xiangke Wang, Zhongkui Li, Lincheng Shen

Title: Distributed Framework Construction for Affine Formation Control

Abstract:
In affine formation control problems, the construction of the framework with universal rigidity and affine localizability is a critical prerequisite, but it has not yet been well addressed, especially when additional agents join the formation or link/agent failures emerge. Motivated by this observation, we investigate the problem of constructing affine frameworks in three scenarios, including vertex addition, edge deletion and vertex deletion. Our approach starts from the original affine formation and uses geometric methods to locally adjust the structure of the weighted graph to describe the topology, so that the modified framework maintains the universal rigidity and affine localizability. Notably, the developed strategies only utilize local measurements and exhibit distributed characteristics, laying the foundation for applications in multi‑agent systems. To demonstrate the compatibility with affine formation control proposals, we present a case study on affine formation tracking in a multi‑UAV formation, demonstrating the effectiveness of our algorithms in constructing eligible frameworks in aforementioned scenarios. Moreover, a comparative simulation is also conducted to highlight the low time complexity of our distributed algorithm relative to the centralized optimization‑based method.

PaperID: 12004, https://arxiv.org/pdf/2501.01593.pdf

Authors: Jing Fang, Saihao Yan, Xueyu Yin, Yinbo Yu, Chunwei Tian, Jiajia Liu

Title: BLAST: A Stealthy Backdoor Leverage Attack against Cooperative Multi-Agent Deep Reinforcement Learning based Systems

Abstract:
Recent studies have shown that cooperative multi‑agent deep reinforcement learning (c‑MADRL) is under the threat of backdoor attacks. Once a backdoor trigger is observed, it will perform malicious actions leading to failures or malicious goals. However, existing backdoor attacks suffer from several issues, e.g., instant trigger patterns lack stealthiness, the backdoor is trained or activated by an additional network, or all agents are backdoored. To this end, in this paper, we propose a novel backdoor leverage attack against c‑MADRL, BLAST, which attacks the entire multi‑agent team by embedding the backdoor only in a single agent. Firstly, we introduce adversary spatiotemporal behavior patterns as the backdoor trigger rather than manual‑injected fixed visual patterns or instant status and control the period to perform malicious actions. This method can guarantee the stealthiness and practicality of BLAST. Secondly, we hack the original reward function of the backdoor agent via unilateral guidance to inject BLAST, so as to achieve the leverage attack effect that can pry open the entire multi‑agent system via a single backdoor agent. We evaluate our BLAST against 3 classic c‑MADRL algorithms (VDN, QMIX, and MAPPO) in 2 popular c‑MADRL environments (SMAC and Pursuit), and 2 existing defense mechanisms. The experimental results demonstrate that BLAST can achieve a high attack success rate while maintaining a low clean performance variance rate.

PaperID: 12005, https://arxiv.org/pdf/2501.01531.pdf

Authors: Logan Beaver

Title: A Global Games-Inspired Approach to Multi-Robot Task Allocation for Heterogeneous Teams

Abstract:
In this article we propose a game‑theoretic approach to the multi‑robot task allocation problem using the framework of global games. Each task is associated with a global signal, a real‑valued number that captures the task execution progress and/or urgency. We propose a linear objective function for each robot in the system, which, for each task, increases with global signal and decreases with the number assigned robots. We provide conditions on the objective function hyperparameters to induce a mixed Nash equilibrium, i.e., solutions where all robots are not assigned to a single task. The resulting algorithm only requires the inversion of a matrix to determine a probability distribution over the robot assignments. We demonstrate the performance of our algorithm in simulation and provide direction for applications and future work.

PaperID: 12006, https://arxiv.org/pdf/2501.01290.pdf

Authors: Vaskar Nath, Pranav Raja, Claire Yoon, Sean Hendryx

Title: ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark

Abstract:
Despite recent advances in AI, the development of systems capable of executing complex, multi‑step reasoning tasks involving multiple tools remains a significant challenge. Current benchmarks fall short in capturing the real‑world complexity of tool‑use reasoning, where verifying the correctness of not only the final answer but also the intermediate steps is important for evaluation, development, and identifying failures during inference time. To bridge this gap, we introduce ToolComp, a comprehensive benchmark designed to evaluate multi‑step tool‑use reasoning. ToolComp is developed through a collaboration between models and human annotators, featuring human‑edited/verified prompts, final answers, and process supervision labels, allowing for the evaluation of both final outcomes and intermediate reasoning. Evaluation across six different model families demonstrates the challenging nature of our dataset, with the majority of models achieving less than 50% accuracy. Additionally, we generate synthetic training data to compare the performance of outcome‑supervised reward models (ORMs) with process‑supervised reward models (PRMs) to assess their ability to improve complex tool‑use reasoning as evaluated by ToolComp. Our results show that PRMs generalize significantly better than ORMs, achieving a 19% and 11% improvement in rank@1 accuracy for ranking base and fine‑tuned model trajectories, respectively. These findings highlight the critical role of process supervision in both the evaluation and training of AI models, paving the way for more robust and capable systems in complex, multi‑step tool‑use tasks.

PaperID: 12007, https://arxiv.org/pdf/2501.01266.pdf

Authors: Michael Kölle, Johannes Tochtermann, Julian Schönberger, Gerhard Stenzel, Philipp Altmann, Claudia Linnhoff-Popien

Title: PIMAEX: Multi-Agent Exploration through Peer Incentivization

Abstract:
While exploration in single‑agent reinforcement learning has been studied extensively in recent years, considerably less work has focused on its counterpart in multi‑agent reinforcement learning. To address this issue, this work proposes a peer‑incentivized reward function inspired by previous research on intrinsic curiosity and influence‑based rewards. The PIMAEX reward, short for Peer‑Incentivized Multi‑Agent Exploration, aims to improve exploration in the multi‑agent setting by encouraging agents to exert influence over each other to increase the likelihood of encountering novel states. We evaluate the PIMAEX reward in conjunction with PIMAEX‑Communication, a multi‑agent training algorithm that employs a communication channel for agents to influence one another. The evaluation is conducted in the Consume/Explore environment, a partially observable environment with deceptive rewards, specifically designed to challenge the exploration vs.\ exploitation dilemma and the credit‑assignment problem. The results empirically demonstrate that agents using the PIMAEX reward with PIMAEX‑Communication outperform those that do not.

PaperID: 12008, https://arxiv.org/pdf/2501.01205.pdf

Authors: Abdullah Mushtaq, Muhammad Rafay Naeem, Ibrahim Ghaznavi, Muhammad Imran Taj, Imran Hashmi, Junaid Qadir

Title: Harnessing Multi-Agent LLMs for Complex Engineering Problem-Solving: A Framework for Senior Design Projects

Abstract:
Multi‑Agent Large Language Models (LLMs) are gaining significant attention for their ability to harness collective intelligence in complex problem‑solving, decision‑making, and planning tasks. This aligns with the concept of the wisdom of crowds, where diverse agents contribute collectively to generating effective solutions, making it particularly suitable for educational settings. Senior design projects, also known as capstone or final year projects, are pivotal in engineering education as they integrate theoretical knowledge with practical application, fostering critical thinking, teamwork, and real‑world problem‑solving skills. In this paper, we explore the use of Multi‑Agent LLMs in supporting these senior design projects undertaken by engineering students, which often involve multidisciplinary considerations and conflicting objectives, such as optimizing technical performance while addressing ethical, social, and environmental concerns. We propose a framework where distinct LLM agents represent different expert perspectives, such as problem formulation agents, system complexity agents, societal and ethical agents, or project managers, thus facilitating a holistic problem‑solving approach. This implementation leverages standard multi‑agent system (MAS) concepts such as coordination, cooperation, and negotiation, incorporating prompt engineering to develop diverse personas for each agent. These agents engage in rich, collaborative dialogues to simulate human engineering teams, guided by principles from swarm AI to efficiently balance individual contributions towards a unified solution. We adapt these techniques to create a collaboration structure for LLM agents, encouraging interdisciplinary reasoning and negotiation similar to real‑world senior design projects. To assess the efficacy of this framework, we collected six proposals of engineering and computer science of...

PaperID: 12009, https://arxiv.org/pdf/2501.01203.pdf

Authors: Runsong Jia, Mengjia Wu, Ying Ding, Jie Lu, Yi Zhang

Title: HetGCoT: Heterogeneous Graph-Enhanced Chain-of-Thought LLM Reasoning for Academic Question Answering

Abstract:
Academic question answering (QA) in heterogeneous scholarly networks presents unique challenges requiring both structural understanding and interpretable reasoning. While graph neural networks (GNNs) capture structured graph information and large language models (LLMs) demonstrate strong capabilities in semantic comprehension, current approaches lack integration at the reasoning level. We propose HetGCoT, a framework enabling LLMs to effectively leverage and learn information from graphs to reason interpretable academic QA results. Our framework introduces three technical contributions: (1) a framework that transforms heterogeneous graph structural information into LLM‑processable reasoning chains, (2) an adaptive metapath selection mechanism identifying relevant subgraphs for specific queries, and (3) a multi‑step reasoning strategy systematically incorporating graph contexts into the reasoning process. Experiments on OpenAlex and DBLP datasets show our approach outperforms all sota baselines. The framework demonstrates adaptability across different LLM architectures and applicability to various scholarly question answering tasks.

PaperID: 12010, https://arxiv.org/pdf/2501.01149.pdf

Authors: Yuxiang Chai, Shunye Tang, Han Xiao, Weifeng Lin, Hanhao Li, Jiayu Zhang, Liang Liu, Pengxiang Zhao, Guangyi Liu, Guozhi Wang, Shuai Ren, Rongduo Han, Haining Zhang, Siyuan Huang, Hongsheng Li

Title: A3: Android Agent Arena for Mobile GUI Agents with Essential-State Procedural Evaluation

Abstract:
The advancement of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has catalyzed the development of mobile graphic user interface (GUI) AI agents, which is designed to autonomously perform tasks on mobile devices. However, a significant gap persists in mobile GUI agent evaluation, where existing benchmarks predominantly rely on either static frame assessments such as AndroidControl or offline static apps such as AndroidWorld and thus fail to capture agent performance in dynamic, real‑world online mobile apps. To address this gap, we present Android Agent Arena (A3), a novel "essential‑state" based procedural evaluation system for mobile GUI agents. A3 introduces a benchmark of 100 tasks derived from 20 widely‑used, dynamic online apps across 20 categories from the Google Play Store, ensuring evaluation comprehension. A3 also presents a novel "essential‑state" based procedural evaluation method that leverages MLLMs as reward models to progressively verify task completion and process achievement. This evaluation approach address the limitations of traditional function based evaluation methods on online dynamic apps. Furthermore, A3 includes a toolkit to streamline Android device interaction, reset online environment and apps and facilitate data collection from both human and agent demonstrations. The complete A3 system, including the benchmark and tools, will be publicly released to provide a robust foundation for future research and development in mobile GUI agents.

PaperID: 12011, https://arxiv.org/pdf/2501.01140.pdf

Authors: Min Whoo Lee, Kibeom Kim, Soo Wung Shin, Minsu Lee, Byoung-Tak Zhang

Title: Communicating Unexpectedness for Out-of-Distribution Multi-Agent Reinforcement Learning

Abstract:
Applying multi‑agent reinforcement learning methods to realistic settings is challenging as it may require the agents to quickly adapt to unexpected situations that are rarely or never encountered in training. Recent methods for generalization to such out‑of‑distribution settings are limited to more specific, restricted instances of distribution shifts. To tackle adaptation to distribution shifts, we propose Unexpected Encoding Scheme, a novel decentralized multi‑agent reinforcement learning algorithm where agents communicate "unexpectedness," the aspects of the environment that are surprising. In addition to a message yielded by the original reward‑driven communication, each agent predicts the next observation based on previous experience, measures the discrepancy between the prediction and the actually encountered observation, and encodes this discrepancy as a message. Experiments on multi‑robot warehouse environment support that our proposed method adapts robustly to dynamically changing training environments as well as out‑of‑distribution environment.

PaperID: 12012, https://arxiv.org/pdf/2501.01136.pdf

Authors: Nikolaos Bousias, Stefanos Pertigkiozoglou, Kostas Daniilidis, George Pappas

Title: Symmetries-enhanced Multi-Agent Reinforcement Learning

Abstract:
Multi‑agent reinforcement learning has emerged as a powerful framework for enabling agents to learn complex, coordinated behaviors but faces persistent challenges regarding its generalization, scalability and sample efficiency. Recent advancements have sought to alleviate those issues by embedding intrinsic symmetries of the systems in the policy. Yet, most dynamical systems exhibit little to no symmetries to exploit. This paper presents a novel framework for embedding extrinsic symmetries in multi‑agent system dynamics that enables the use of symmetry‑enhanced methods to address systems with insufficient intrinsic symmetries, expanding the scope of equivariant learning to a wide variety of MARL problems. Central to our framework is the Group Equivariant Graphormer, a group‑modular architecture specifically designed for distributed swarming tasks. Extensive experiments on a swarm of symmetry‑breaking quadrotors validate the effectiveness of our approach, showcasing its potential for improved generalization and zero‑shot scalability. Our method achieves significant reductions in collision rates and enhances task success rates across a diverse range of scenarios and varying swarm sizes.

PaperID: 12013, https://arxiv.org/pdf/2501.00990.pdf

Authors: Yichao Wang, Mohamadamin Rajabinezhad, Yi Zhang, Shan Zuo

Title: Cyber-physical Defense for Heterogeneous Multi-agent Systems Against Exponentially Unbounded Attacks on Signed Digraphs

Abstract:
Cyber‑physical systems (CPSs) are subjected to attacks on both cyber and physical spaces. In reality, the attackers could launch exponentially unbounded false data injection (EU‑FDI) attacks, which are more destructive and could lead to the system's collapse or instability. Existing literature generally addresses bounded attack signals and/or bounded‑first‑order‑derivative attack signals, which exposes the CPSs to significant threats. In contrast, this paper proposes a fully‑distributed attack‑resilient bi‑layer defense framework to address the bipartite output containment problem for heterogeneous multi‑agent systems on signed digraphs, in the presence of EU‑FDI attacks on both cyber‑physical layer (CPL) and observer layer (OL). First, we design attack‑resilient dynamic compensators that utilize data communicated on the OL to estimate the convex combinations of the states and negative states of the leaders. The attack‑resilient compensators address the EU‑FDI attacks on the OL and guarantee the uniformly ultimately bounded (UUB) estimation of the leaders' states. Then, by using the compensators' states, fully‑distributed attack‑resilient controllers are designed on the CPL to further address the EU‑FDI attacks on the actuators. Rigorous mathematical proof based on Lyapunov stability analysis is provided, establishing the theoretical soundness of the proposed bi‑layer resilient defense framework, by preserving the UUB consensus and stability against EU‑FDI attacks on both CPL and OL. Finally, a comparative case study for heterogeneous multi‑agent systems validate the enhanced resilience of the proposed defense strategies.

PaperID: 12014, https://arxiv.org/pdf/2501.00973.pdf

Authors: Yichao Wang, Mohamadamin Rajabinezhad, Dimitra Panagou, Shan Zuo

Title: Defense Strategies for Autonomous Multi-agent Systems: Ensuring Safety and Resilience Under Exponentially Unbounded FDI Attacks

Abstract:
False data injection attacks pose a significant threat to autonomous multi‑agent systems (MASs). Existing attack‑resilient control strategies generally have strict assumptions on the attack signals and overlook safety constraints, such as collision avoidance. In practical applications, leader agents equipped with advanced sensors or weaponry span a safe region to guide heterogeneous follower agents, ensuring coordinated operations while addressing collision avoidance to prevent financial losses and mission failures. This letter addresses these gaps by introducing and solving the safety‑aware and attack‑resilient (SAAR) control problem under exponentially unbounded false data injection (EU‑FDI) attacks. Specifically, a novel attack‑resilient observer layer (OL) is first designed to defend against EU‑FDI attacks on the OL. Then, an attack‑resilient compensational signal is designed to mitigate the adverse effects caused by the EU‑FDI attack on control input layer (CIL). Finally, a SAAR controller is designed by solving a quadratic programming (QP) problem integrating control barrier function (CBF) certified collision‑free safety constraints. Rigorous Lyapunov‑based stability analysis certifies the SAAR controller's effectiveness in ensuring both safety and resilience. This study also pioneers a three‑dimensional (3D) simulation of the SAAR containment control problem for heterogeneous MASs, demonstrating its applicability in realistic multi‑agent scenarios.

PaperID: 12015, https://arxiv.org/pdf/2501.00950.pdf

Authors: Cleverson Nahum, Salvatore D'Oro, Pedro Batista, Cristiano Both, Kleber Cardoso, Aldebaro Klautau, Tommaso Melodia

Title: Intent-based Radio Scheduler for RAN Slicing: Learning to deal with different network scenarios

Abstract:
The future mobile network has the complex mission of distributing available radio resources among various applications with different requirements. The radio access network slicing enables the creation of different logical networks by isolating and using dedicated resources for each group of applications. In this scenario, the radio resource scheduling (RRS) is responsible for distributing the radio resources available among the slices to fulfill their service‑level agreement (SLA) requirements, prioritizing critical slices while minimizing the number of intent violations. Moreover, ensuring that the RRS can deal with a high diversity of network scenarios is essential. Several recent papers present advances in machine learning‑based RRS. However, the scenarios and slice variety are restricted, which inhibits solid conclusions about the generalization capabilities of the models after deployment in real networks. This paper proposes an intent‑based RRS using multi‑agent reinforcement learning in a radio access network (RAN) slicing context. The proposed method protects high‑priority slices when the available radio resources cannot fulfill all the slices. It uses transfer learning to reduce the number of training steps required. The proposed method and baselines are evaluated in different network scenarios that comprehend combinations of different slice types, channel trajectories, number of active slices and users' equipment (UEs), and UE characteristics. The proposed method outperformed the baselines in protecting slices with higher priority, obtaining an improvement of 40% and, when considering all the slices, obtaining an improvement of 20% in relation to the baselines. The results show that by using transfer learning, the required number of training steps could be reduced by a factor of eight without hurting performance.

PaperID: 12016, https://arxiv.org/pdf/2501.00906.pdf

Authors: Talha Zeeshan, Abhishek Kumar, Susanna Pirttikangas, Sasu Tarkoma

Title: Large Language Model Based Multi-Agent System Augmented Complex Event Processing Pipeline for Internet of Multimedia Things

Abstract:
This paper presents the development and evaluation of a Large Language Model (LLM), also known as foundation models, based multi‑agent system framework for complex event processing (CEP) with a focus on video query processing use cases. The primary goal is to create a proof‑of‑concept (POC) that integrates state‑of‑the‑art LLM orchestration frameworks with publish/subscribe (pub/sub) tools to address the integration of LLMs with current CEP systems. Utilizing the Autogen framework in conjunction with Kafka message brokers, the system demonstrates an autonomous CEP pipeline capable of handling complex workflows. Extensive experiments evaluate the system's performance across varying configurations, complexities, and video resolutions, revealing the trade‑offs between functionality and latency. The results show that while higher agent count and video complexities increase latency, the system maintains high consistency in narrative coherence. This research builds upon and contributes to, existing novel approaches to distributed AI systems, offering detailed insights into integrating such systems into existing infrastructures.

PaperID: 12017, https://arxiv.org/pdf/2501.00881.pdf

Authors: Fouad Bousetouane

Title: Agentic Systems: A Guide to Transforming Industries with Vertical AI Agents

Abstract:
The evolution of agentic systems represents a significant milestone in artificial intelligence and modern software systems, driven by the demand for vertical intelligence tailored to diverse industries. These systems enhance business outcomes through adaptability, learning, and interaction with dynamic environments. At the forefront of this revolution are Large Language Model (LLM) agents, which serve as the cognitive backbone of these intelligent systems. In response to the need for consistency and scalability, this work attempts to define a level of standardization for Vertical AI agent design patterns by identifying core building blocks and proposing a Cognitive Skills Module, which incorporates domain‑specific, purpose‑built inference capabilities. Building on these foundational concepts, this paper offers a comprehensive introduction to agentic systems, detailing their core components, operational patterns, and implementation strategies. It further explores practical use cases and examples across various industries, highlighting the transformative potential of LLM agents in driving industry‑specific applications.

PaperID: 12018, https://arxiv.org/pdf/2501.00872.pdf

Authors: Yi Zhang, Bin Lei, Mohamadamin Rajabinezhad, Caiwen Ding, Shan Zuo

Title: Observer-Based Data-Driven Consensus Control for Nonlinear Multi-Agent Systems against DoS and FDI attacks

Abstract:
Existing data‑driven control methods generally do not address False Data Injection (FDI) and Denial‑of‑Service (DoS) attacks simultaneously. This letter introduces a distributed data‑driven attack‑resilient consensus problem under both FDI and DoS attacks and proposes a data‑driven consensus control framework, consisting of a group of comprehensive attack‑resilient observers. The proposed group of observers is designed to estimate FDI attacks, external disturbances, and lumped disturbances, combined with a DoS attack compensation mechanism. A rigorous stability analysis of the approach is provided to ensure the boundedness of the distributed neighborhood estimation consensus error. The effectiveness of the approach is validated through numerical examples involving both leaderless consensus and leader‑follower consensus, demonstrating significantly improved resilient performance compared to existing data‑driven control approaches.

PaperID: 12019, https://arxiv.org/pdf/2501.00826.pdf

Authors: Yichen Luo, Yebo Feng, Jiahua Xu, Paolo Tasca, Yang Liu

Title: LLM-Powered Multi-Agent System for Automated Crypto Portfolio Management

Abstract:
Cryptocurrency investment is inherently difficult due to its shorter history compared to traditional assets, the need to integrate vast amounts of data from various modalities, and the requirement for complex reasoning. While deep learning approaches have been applied to address these challenges, their black‑box nature raises concerns about trust and explainability. Recently, large language models (LLMs) have shown promise in financial applications due to their ability to understand multi‑modal data and generate explainable decisions. However, single LLM faces limitations in complex, comprehensive tasks such as asset investment. These limitations are even more pronounced in cryptocurrency investment, where LLMs have less domain‑specific knowledge in their training corpora. To overcome these challenges, we propose an explainable, multi‑modal, multi‑agent framework for cryptocurrency investment. Our framework uses specialized agents that collaborate within and across teams to handle subtasks such as data analysis, literature integration, and investment decision‑making for the top 30 cryptocurrencies by market capitalization. The expert training module fine‑tunes agents using multi‑modal historical data and professional investment literature, while the multi‑agent investment module employs real‑time data to make informed cryptocurrency investment decisions. Unique intrateam and interteam collaboration mechanisms enhance prediction accuracy by adjusting final predictions based on confidence levels within agent teams and facilitating information sharing between teams. Empirical evaluation using data from November 2023 to September 2024 demonstrates that our framework outperforms single‑agent models and market benchmarks in classification, asset pricing, portfolio, and explainability performance.

PaperID: 12020, https://arxiv.org/pdf/2501.00750.pdf

Authors: Cheonsu Jeong

Title: Beyond Text: Implementing Multimodal Large Language Model-Powered Multi-Agent Systems Using a No-Code Platform

Abstract:
This study proposes the design and implementation of a multimodal LLM‑based Multi‑Agent System (MAS) leveraging a No‑Code platform to address the practical constraints and significant entry barriers associated with AI adoption in enterprises. Advanced AI technologies, such as Large Language Models (LLMs), often pose challenges due to their technical complexity and high implementation costs, making them difficult for many organizations to adopt. To overcome these limitations, this research develops a No‑Code‑based Multi‑Agent System designed to enable users without programming knowledge to easily build and manage AI systems. The study examines various use cases to validate the applicability of AI in business processes, including code generation from image‑based notes, Advanced RAG‑based question‑answering systems, text‑based image generation, and video generation using images and prompts. These systems lower the barriers to AI adoption, empowering not only professional developers but also general users to harness AI for significantly improved productivity and efficiency. By demonstrating the scalability and accessibility of No‑Code platforms, this study advances the democratization of AI technologies within enterprises and validates the practical applicability of Multi‑Agent Systems, ultimately contributing to the widespread adoption of AI across various industries.

PaperID: 12021, https://arxiv.org/pdf/2501.00642.pdf

Authors: Mark Zakharov, Farzaneh Rabiei Kashanaki, Jose Renau

Title: Enabling New HDLs with Agents

Abstract:
Large Language Models (LLMs) based agents are transforming the programming language landscape by facilitating learning for beginners, enabling code generation, and optimizing documentation workflows. Hardware Description Languages (HDLs), with their smaller user community, stand to benefit significantly from the application of LLMs as tools for learning new HDLs. This paper investigates the challenges and solutions of enabling LLMs for HDLs, particularly for HDLs that LLMs have not been previously trained on. This work introduces HDLAgent, an AI agent optimized for LLMs with limited knowledge of various HDLs. It significantly enhances off‑the‑shelf LLMs.

PaperID: 12022, https://arxiv.org/pdf/2501.00626.pdf

Authors: Alec J. Linot, Haotian Hang, Eva Kanso, Kunihiko Taira

Title: Hierarchical equivariant graph neural networks for forecasting collective motion in vortex clusters and microswimmers

Abstract:
Data‑driven modeling of collective dynamics is a challenging problem because emergent phenomena in multi‑agent systems are often shaped by long‑range interactions among individuals. For example, in bird flocks and fish schools, long‑range vision and flow coupling drive individual behaviors across the collective. Such collective motion can be modeled using graph neural networks (GNNs), but GNNs struggle when graphs become large and often fail to capture long‑range interactions. Here, we construct hierarchical and equivariant GNNs, and show that these GNNs accurately predict local and global behavior in systems with collective motion. As representative examples, we apply this approach to simulations of clusters of point vortices and populations of microswimmers. For the point vortices, we define a local graph of vortices within a cluster and a global graph of interactions between clusters. For the microswimmers, we define a local graph around each microswimmer and a global graph that groups long‑range interactions. We then combine this hierarchy of graphs with an approach that enforces equivariance to rotations and translations. This combination results in a significant improvement over a fully‑connected GNN. For point vortices, our method conserves the Hamiltonian over long times, and, for microswimmers, our method predicts the transition from aggregation to swirling.

PaperID: 12023, https://arxiv.org/pdf/2501.00599.pdf

Authors: Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing

Title: VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Abstract:
Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding. However, they mainly focus on holistic comprehension and struggle with capturing fine‑grained spatial and temporal details. Besides, the lack of high‑quality object‑level video instruction data and a comprehensive benchmark further hinders their advancements. To tackle these challenges, we introduce the VideoRefer Suite to empower Video LLM for finer‑level spatial‑temporal video understanding, i.e., enabling perception and reasoning on any objects throughout the video. Specially, we thoroughly develop VideoRefer Suite across three essential aspects: dataset, model, and benchmark. Firstly, we introduce a multi‑agent data engine to meticulously curate a large‑scale, high‑quality object‑level video instruction dataset, termed VideoRefer‑700K. Next, we present the VideoRefer model, which equips a versatile spatial‑temporal object encoder to capture precise regional and sequential representations. Finally, we meticulously create a VideoRefer‑Bench to comprehensively assess the spatial‑temporal understanding capability of a Video LLM, evaluating it across various aspects. Extensive experiments and analyses demonstrate that our VideoRefer model not only achieves promising performance on video referring benchmarks but also facilitates general video understanding capabilities.

PaperID: 12024, https://arxiv.org/pdf/2501.00364.pdf

Authors: Leo Ardon, Daniel Furelos-Blanco, Roko Parac, Alessandra Russo

Title: FORM: Learning Expressive and Transferable First-Order Logic Reward Machines

Abstract:
Reward machines (RMs) are an effective approach for addressing non‑Markovian rewards in reinforcement learning (RL) through finite‑state machines. Traditional RMs, which label edges with propositional logic formulae, inherit the limited expressivity of propositional logic. This limitation hinders the learnability and transferability of RMs since complex tasks will require numerous states and edges. To overcome these challenges, we propose First‑Order Reward Machines (\textttFORMs), which use first‑order logic to label edges, resulting in more compact and transferable RMs. We introduce a novel method for learning \textttFORMs and a multi‑agent formulation for exploiting them and facilitate their transferability, where multiple agents collaboratively learn policies for a shared \textttFORM. Our experimental results demonstrate the scalability of \textttFORMs with respect to traditional RMs. Specifically, we show that \textttFORMs can be effectively learnt for tasks where traditional RM learning approaches fail. We also show significant improvements in learning speed and task transferability thanks to the multi‑agent learning framework and the abstraction provided by the first‑order language.

PaperID: 12025, https://arxiv.org/pdf/2501.00332.pdf

Authors: Chia-Yuan Chang, Zhimeng Jiang, Vineeth Rakesh, Menghai Pan, Chin-Chia Michael Yeh, Guanchu Wang, Mingzhi Hu, Zhichao Xu, Yan Zheng, Mahashweta Das, Na Zou

Title: MAIN-RAG: Multi-Agent Filtering Retrieval-Augmented Generation

Abstract:
Large Language Models (LLMs) are becoming essential tools for various natural language processing tasks but often suffer from generating outdated or incorrect information. Retrieval‑Augmented Generation (RAG) addresses this issue by incorporating external, real‑time information retrieval to ground LLM responses. However, the existing RAG systems frequently struggle with the quality of retrieval documents, as irrelevant or noisy documents degrade performance, increase computational overhead, and undermine response reliability. To tackle this problem, we propose Multi‑Agent Filtering Retrieval‑Augmented Generation (MAIN‑RAG), a training‑free RAG framework that leverages multiple LLM agents to collaboratively filter and score retrieved documents. Specifically, MAIN‑RAG introduces an adaptive filtering mechanism that dynamically adjusts the relevance filtering threshold based on score distributions, effectively minimizing noise while maintaining high recall of relevant documents. The proposed approach leverages inter‑agent consensus to ensure robust document selection without requiring additional training data or fine‑tuning. Experimental results across four QA benchmarks demonstrate that MAIN‑RAG consistently outperforms traditional RAG approaches, achieving a 2‑11% improvement in answer accuracy while reducing the number of irrelevant retrieved documents. Quantitative analysis further reveals that our approach achieves superior response consistency and answer accuracy over baseline methods, offering a competitive and practical alternative to training‑based solutions.

PaperID: 12026, https://arxiv.org/pdf/2501.00312.pdf

Authors: Chuxiong Sun, Peng He, Qirui Ji, Zehua Zang, Jiangmeng Li, Rui Wang, Wei Wang

Title: M2I2: Learning Efficient Multi-Agent Communication via Masked State Modeling and Intention Inference

Abstract:
Communication is essential in coordinating the behaviors of multiple agents. However, existing methods primarily emphasize content, timing, and partners for information sharing, often neglecting the critical aspect of integrating shared information. This gap can significantly impact agents' ability to understand and respond to complex, uncertain interactions, thus affecting overall communication efficiency. To address this issue, we introduce M2I2, a novel framework designed to enhance the agents' capabilities to assimilate and utilize received information effectively. M2I2 equips agents with advanced capabilities for masked state modeling and joint‑action prediction, enriching their perception of environmental uncertainties and facilitating the anticipation of teammates' intentions. This approach ensures that agents are furnished with both comprehensive and relevant information, bolstering more informed and synergistic behaviors. Moreover, we propose a Dimensional Rational Network, innovatively trained via a meta‑learning paradigm, to identify the importance of dimensional pieces of information, evaluating their contributions to decision‑making and auxiliary tasks. Then, we implement an importance‑based heuristic for selective information masking and sharing. This strategy optimizes the efficiency of masked state modeling and the rationale behind information sharing. We evaluate M2I2 across diverse multi‑agent tasks, the results demonstrate its superior performance, efficiency, and generalization capabilities, over existing state‑of‑the‑art methods in various complex scenarios.

PaperID: 12027, https://arxiv.org/pdf/2501.00226.pdf

Authors: Tadahiro Taniguchi, Ryo Ueda, Tomoaki Nakamura, Masahiro Suzuki, Akira Taniguchi

Title: Generative Emergent Communication: Large Language Model is a Collective World Model

Abstract:
Large Language Models (LLMs) have demonstrated a remarkable ability to capture extensive world knowledge, yet how this is achieved without direct sensorimotor experience remains a fundamental puzzle. This study proposes a novel theoretical solution by introducing the Collective World Model hypothesis. We argue that an LLM does not learn a world model from scratch; instead, it learns a statistical approximation of a collective world model that is already implicitly encoded in human language through a society‑wide process of embodied, interactive sense‑making. To formalize this process, we introduce generative emergent communication (Generative EmCom), a framework built on the Collective Predictive Coding (CPC). This framework models the emergence of language as a process of decentralized Bayesian inference over the internal states of multiple agents. We argue that this process effectively creates an encoder‑decoder structure at a societal scale: human society collectively encodes its grounded, internal representations into language, and an LLM subsequently decodes these symbols to reconstruct a latent space that mirrors the structure of the original collective representations. This perspective provides a principled, mathematical explanation for how LLMs acquire their capabilities. The main contributions of this paper are: 1) the formalization of the Generative EmCom framework, clarifying its connection to world models and multi‑agent reinforcement learning, and 2) its application to interpret LLMs, explaining phenomena such as distributional semantics as a natural consequence of representation reconstruction. This work provides a unified theory that bridges individual cognitive development, collective language evolution, and the foundations of large‑scale AI.

PaperID: 12028, https://arxiv.org/pdf/2501.00211.pdf

Authors: Noor Aboueleneen, Yahuza Bello, Abdullatif Albaseer, Ahmed Refaey Hussein, Mohamed Abdallah, Ekram Hossain

Title: Distributed Traffic Control in Complex Dynamic Roadblocks: A Multi-Agent Deep RL Approach

Abstract:
Autonomous Vehicles (AVs) represent a transformative advancement in the transportation industry. These vehicles have sophisticated sensors, advanced algorithms, and powerful computing systems that allow them to navigate and operate without direct human intervention. However, AVs' systems still get overwhelmed when they encounter a complex dynamic change in the environment resulting from an accident or a roadblock for maintenance. The advanced features of Sixth Generation (6G) technology are set to offer strong support to AVs, enabling real‑time data exchange and management of complex driving maneuvers. This paper proposes a Multi‑Agent Reinforcement Learning (MARL) framework to improve AVs' decision‑making in dynamic and complex Intelligent Transportation Systems (ITS) utilizing 6G‑V2X communication. The primary objective is to enable AVs to avoid roadblocks efficiently by changing lanes while maintaining optimal traffic flow and maximizing the mean harmonic speed. To ensure realistic operations, key constraints such as minimum vehicle speed, roadblock count, and lane change frequency are integrated. We train and test the proposed MARL model with two traffic simulation scenarios using the SUMO and TraCI interface. Through extensive simulations, we demonstrate that the proposed model adapts to various traffic conditions and achieves efficient and robust traffic flow management. The trained model effectively navigates dynamic roadblocks, promoting improved traffic efficiency in AV operations with more than 70% efficiency over other benchmark solutions.

PaperID: 12029, https://arxiv.org/pdf/2501.00165.pdf

Authors: Ben McClusky

Title: Dynamic Graph Communication for Decentralised Multi-Agent Reinforcement Learning

Abstract:
This work presents a novel communication framework for decentralized multi‑agent systems operating in dynamic network environments. Integrated into a multi‑agent reinforcement learning system, the framework is designed to enhance decision‑making by optimizing the network's collective knowledge through efficient communication. Key contributions include adapting a static network packet‑routing scenario to a dynamic setting with node failures, incorporating a graph attention network layer in a recurrent message‑passing framework, and introducing a multi‑round communication targeting mechanism. This approach enables an attention‑based aggregation mechanism to be successfully trained within a sparse‑reward, dynamic network packet‑routing environment using only reinforcement learning. Experimental results show improvements in routing performance, including a 9.5 percent increase in average rewards and a 6.4 percent reduction in communication overhead compared to a baseline system. The study also examines the ethical and legal implications of deploying such systems in critical infrastructure and military contexts, identifies current limitations, and suggests potential directions for future research.

PaperID: 12030, https://arxiv.org/pdf/2501.00160.pdf

Authors: David Goll, Jobst Heitzig, Wolfram Barfuss

Title: Deterministic Model of Incremental Multi-Agent Boltzmann Q-Learning: Transient Cooperation, Metastability, and Oscillations

Abstract:
Multi‑Agent Reinforcement Learning involves agents that learn together in a shared environment, leading to emergent dynamics sensitive to initial conditions and parameter variations. A Dynamical Systems approach, which studies the evolution of multi‑component systems over time, has uncovered some of the underlying dynamics by constructing deterministic approximation models of stochastic algorithms. In this work, we demonstrate that even in the simplest case of independent Q‑learning with a Boltzmann exploration policy, significant discrepancies arise between the actual algorithm and previous approximations. We elaborate why these models actually approximate interesting variants rather than the original incremental algorithm. To explain the discrepancies, we introduce a new discrete‑time approximation model that explicitly accounts for agents' update frequencies within the learning process and show that its dynamics fundamentally differ from the simplified dynamics of prior models. We illustrate the usefulness of our approach by applying it to the question of spontaneous cooperation in social dilemmas, specifically the Prisoner's Dilemma as the simplest case study. We identify conditions under which the learning behaviour appears as long‑term stable cooperation from an external perspective. However, our model shows that this behaviour is merely a metastable transient phase and not a true equilibrium, making it exploitable. We further exemplify how specific parameter settings can significantly exacerbate the moving target problem in independent learning. Through a systematic analysis of our model, we show that increasing the discount factor induces oscillations, preventing convergence to a joint policy. These oscillations arise from a supercritical Neimark‑Sacker bifurcation, which transforms the unique stable fixed point into an unstable focus surrounded by a stable limit cycle.

PaperID: 12031, https://arxiv.org/pdf/2501.00110.pdf

Authors: Andrea Giusti

Title: Modelling and Control of Spatial Behaviours in Multi-Agent Systems with Applications to Biology and Robotics

Abstract:
Large‑Scale Multi‑Agent Systems (LS‑MAS) consist of several autonomous components, interacting in a non‑trivial way, so that the emerging behaviour of the ensemble depends on the individual dynamics of the components and their reciprocal interactions. These models can describe a rich variety of natural systems, as well as artificial ones, characterised by unparalleled scalability, robustness, and flexibility. Indeed, a crucial objective is devising efficient strategies to model and control the spatial behaviours of LS‑MAS to achieve specific goals. However, the inherent complexity of these systems and the wide spectrum of their emerging behaviours pose significant challenges. The overarching goal of this thesis is, therefore, to advance methods for modelling, analyzing and controlling the spatial behaviours of LS‑MAS, with applications to cellular populations and swarm robotics. The thesis begins with an overview of the existing Literature, and is then organized into two distinct parts. In the context of swarm robotics, Part I deals with distributed control algorithms to spatially organize agents on geometric patterns. The contribution is twofold, encompassing both the development of original control algorithms, and providing a novel formal analysis, which allows to guarantee the emergence of specific geometric patterns. In Part II, looking at the spatial behaviours of biological agents, experiments are carried out to study the movement of microorganisms and their response to light stimuli. This allows the derivation and parametrization of mathematical models that capture these behaviours, and pave the way for the development of innovative approaches for the spatial control of microorganisms. The results presented in the thesis were developed by leveraging formal analytical tools, simulations, and experiments, using innovative platforms and original computational frameworks.

PaperID: 12032, https://arxiv.org/pdf/2501.00083.pdf

Authors: Yuan-Hao Jiang, Ruijia Li, Yizhou Zhou, Changyong Qi, Hanglei Hu, Yuang Wei, Bo Jiang, Yonghe Wu

Title: AI Agent for Education: von Neumann Multi-Agent System Framework

Abstract:
The development of large language models has ushered in new paradigms for education. This paper centers on the multi‑Agent system in education and proposes the von Neumann multi‑Agent system framework. It breaks down each AI Agent into four modules: control unit, logic unit, storage unit, and input‑output devices, defining four types of operations: task deconstruction, self‑reflection, memory processing, and tool invocation. Furthermore, it introduces related technologies such as Chain‑of‑Thought, Reson+Act, and Multi‑Agent Debate associated with these four types of operations. The paper also discusses the ability enhancement cycle of a multi‑Agent system for education, including the outer circulation for human learners to promote knowledge construction and the inner circulation for LLM‑based‑Agents to enhance swarm intelligence. Through collaboration and reflection, the multi‑Agent system can better facilitate human learners' learning and enhance their teaching abilities in this process.

PaperID: 12033, https://arxiv.org/pdf/2501.00055.pdf

Authors: Miao Yu, Junfeng Fang, Yingjie Zhou, Xing Fan, Kun Wang, Shirui Pan, Qingsong Wen

Title: LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models

Abstract:
While safety‑aligned large language models (LLMs) are increasingly used as the cornerstone for powerful systems such as multi‑agent frameworks to solve complex real‑world problems, they still suffer from potential adversarial queries, such as jailbreak attacks, which attempt to induce harmful content. Researching attack methods allows us to better understand the limitations of LLM and make trade‑offs between helpfulness and safety. However, existing jailbreak attacks are primarily based on opaque optimization techniques (e.g. token‑level gradient descent) and heuristic search methods like LLM refinement, which fall short in terms of transparency, transferability, and computational cost. In light of these limitations, we draw inspiration from the evolution and infection processes of biological viruses and propose LLM‑Virus, a jailbreak attack method based on evolutionary algorithm, termed evolutionary jailbreak. LLM‑Virus treats jailbreak attacks as both an evolutionary and transfer learning problem, utilizing LLMs as heuristic evolutionary operators to ensure high attack efficiency, transferability, and low time cost. Our experimental results on multiple safety benchmarks show that LLM‑Virus achieves competitive or even superior performance compared to existing attack methods.

PaperID: 12034, https://arxiv.org/pdf/2501.00049.pdf

Authors: Lamya Benaddi, Charaf Ouaddi, Adnane Souha, Abdeslam Jakimi, Mohamed Rahouti, Mohammed Aledhari, Diogo Oliveira, Brahim Ouchao

Title: Seq2Seq Model-Based Chatbot with LSTM and Attention Mechanism for Enhanced User Interaction

Abstract:
A chatbot is an intelligent software application that automates conversations and engages users in natural language through messaging platforms. Leveraging artificial intelligence (AI), chatbots serve various functions, including customer service, information gathering, and casual conversation. Existing virtual assistant chatbots, such as ChatGPT and Gemini, demonstrate the potential of AI in Natural Language Processing (NLP). However, many current solutions rely on predefined APIs, which can result in vendor lock‑in and high costs. To address these challenges, this work proposes a chatbot developed using a Sequence‑to‑Sequence (Seq2Seq) model with an encoder‑decoder architecture that incorporates attention mechanisms and Long Short‑Term Memory (LSTM) cells. By avoiding predefined APIs, this approach ensures flexibility and cost‑effectiveness. The chatbot is trained, validated, and tested on a dataset specifically curated for the tourism sector in Draa‑Tafilalet, Morocco. Key evaluation findings indicate that the proposed Seq2Seq model‑based chatbot achieved high accuracies: approximately 99.58% in training, 98.03% in validation, and 94.12% in testing. These results demonstrate the chatbot's effectiveness in providing relevant and coherent responses within the tourism domain, highlighting the potential of specialized AI applications to enhance user experience and satisfaction in niche markets.

PaperID: 12035, https://arxiv.org/pdf/2412.21154.pdf

Authors: Siddharth Narayanan, James D. Braza, Ryan-Rhys Griffiths, Manu Ponnapati, Albert Bou, Jon Laurent, Ori Kabeli, Geemi Wellawatte, Sam Cox, Samuel G. Rodriques, Andrew D. White

Title: Aviary: training language agents on challenging scientific tasks

Abstract:
Solving complex real‑world tasks requires cycles of actions and observations. This is particularly true in science, where tasks require many cycles of analysis, tool use, and experimentation. Language agents are promising for automating intellectual tasks in science because they can interact with tools via natural language or code. Yet their flexibility creates conceptual and practical challenges for software implementations, since agents may comprise non‑standard components such as internal reasoning, planning, tool usage, as well as the inherent stochasticity of temperature‑sampled language models. Here, we introduce Aviary, an extensible gymnasium for language agents. We formalize agents as policies solving language‑grounded partially observable Markov decision processes, which we term language decision processes. We then implement five environments, including three challenging scientific environments: (1) manipulating DNA constructs for molecular cloning, (2) answering research questions by accessing scientific literature, and (3) engineering protein stability. These environments were selected for their focus on multi‑step reasoning and their relevance to contemporary biology research. Finally, with online training and scaling inference‑time compute, we show that language agents backed by open‑source, non‑frontier LLMs can match and exceed both frontier LLM agents and human experts on multiple tasks at up to 100x lower inference cost.

PaperID: 12036, https://arxiv.org/pdf/2412.21102.pdf

Authors: KuanChao Chu, Yi-Pei Chen, Hideki Nakayama

Title: Exploring and Controlling Diversity in LLM-Agent Conversation

Abstract:
Controlling diversity in LLM‑agent simulations is essential for balancing stability in structured tasks with variability in open‑ended interactions. However, we observe that dialogue diversity tends to degrade over long‑term simulations. To explore the role of prompt design in this phenomenon, we modularized the utterance generation prompt and found that reducing contextual information leads to more diverse outputs. Based on this insight, we propose Adaptive Prompt Pruning (APP), a novel method that allows users to control diversity via a single parameter, lambda. APP dynamically prunes prompt segments based on attention scores and is compatible with existing diversity control methods. We demonstrate that APP effectively modulates diversity through extensive experiments and propose a method to balance the control trade‑offs. Our analysis reveals that all prompt components impose constraints on diversity, with the Memory being the most influential. Additionally, high‑attention contents consistently suppress output diversity.

PaperID: 12037, https://arxiv.org/pdf/2412.21088.pdf

Authors: Reza Azadeh

Title: Advances in Multi-agent Reinforcement Learning: Persistent Autonomy and Robot Learning Lab Report 2024

Abstract:
Multi‑Agent Reinforcement Learning (MARL) approaches have emerged as popular solutions to address the general challenges of cooperation in multi‑agent environments, where the success of achieving shared or individual goals critically depends on the coordination and collaboration between agents. However, existing cooperative MARL methods face several challenges intrinsic to multi‑agent systems, such as the curse of dimensionality, non‑stationarity, and the need for a global exploration strategy. Moreover, the presence of agents with constraints (e.g., limited battery life, restricted mobility) or distinct roles further exacerbates these challenges. This document provides an overview of recent advances in Multi‑Agent Reinforcement Learning (MARL) conducted at the Persistent Autonomy and Robot Learning (PeARL) lab at the University of Massachusetts Lowell. We briefly discuss various research directions and present a selection of approaches proposed in our most recent publications. For each proposed approach, we also highlight potential future directions to further advance the field.

PaperID: 12038, https://arxiv.org/pdf/2412.21069.pdf

Authors: Wenhao Zhuang, Yuyi Mao

Title: Privacy-Aware Multi-Device Cooperative Edge Inference with Distributed Resource Bidding

Abstract:
Mobile edge computing (MEC) has empowered mobile devices (MDs) in supporting artificial intelligence (AI) applications through collaborative efforts with proximal MEC servers. Unfortunately, despite the great promise of device‑edge cooperative AI inference, data privacy becomes an increasing concern. In this paper, we develop a privacy‑aware multi‑device cooperative edge inference system for classification tasks, which integrates a distributed bidding mechanism for the MEC server's computational resources. Intermediate feature compression is adopted as a principled approach to minimize data privacy leakage. To determine the bidding values and feature compression ratios in a distributed fashion, we formulate a decentralized partially observable Markov decision process (DEC‑POMDP) model, for which, a multi‑agent deep deterministic policy gradient (MADDPG)‑based algorithm is developed. Simulation results demonstrate the effectiveness of the proposed algorithm in privacy‑preserving cooperative edge inference. Specifically, given a sufficient level of data privacy protection, the proposed algorithm achieves 0.31‑0.95% improvements in classification accuracy compared to the approach being agnostic to the wireless channel conditions. The performance is further enhanced by 1.54‑1.67% by considering the difficulties of inference data.

PaperID: 12039, https://arxiv.org/pdf/2412.21033.pdf

Authors: Gautier Dagan, Frank Keller, Alex Lascarides

Title: Plancraft: an evaluation dataset for planning with LLM agents

Abstract:
We present Plancraft, a multi‑modal evaluation dataset for LLM agents. Plancraft has both a text‑only and multi‑modal interface, based on the Minecraft crafting GUI. We include the Minecraft Wiki to evaluate tool use and Retrieval Augmented Generation (RAG), as well as a handcrafted planner and Oracle Retriever, to ablate the different components of a modern agent architecture. To evaluate decision‑making, Plancraft also includes a subset of examples that are intentionally unsolvable, providing a realistic challenge that requires the agent not only to complete tasks but also to decide whether they are solvable at all. We benchmark both open‑source and closed‑source LLMs and compare their performance and efficiency to a handcrafted planner. Overall, we find that LLMs and VLMs struggle with the planning problems that Plancraft introduces, and offer suggestions on how to improve their capabilities.

PaperID: 12040, https://arxiv.org/pdf/2412.20977.pdf

Authors: Fangwei Zhong, Kui Wu, Churan Wang, Hao Chen, Hai Ci, Zhoujun Li, Yizhou Wang

Title: UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI

Abstract:
We introduce UnrealZoo, a collection of over 100 photo‑realistic 3D virtual worlds built on Unreal Engine, designed to reflect the complexity and variability of open‑world environments. We also provide a rich variety of playable entities, including humans, animals, robots, and vehicles for embodied AI research. We extend UnrealCV with optimized APIs and tools for data collection, environment augmentation, distributed training, and benchmarking. These improvements achieve significant improvements in the efficiency of rendering and communication, enabling advanced applications such as multi‑agent interactions. Our experimental evaluation across visual navigation and tracking tasks reveals two key insights: 1) environmental diversity provides substantial benefits for developing generalizable reinforcement learning (RL) agents, and 2) current embodied agents face persistent challenges in open‑world scenarios, including navigation in unstructured terrain, adaptation to unseen morphologies, and managing latency in the close‑loop control systems for interacting in highly dynamic objects. UnrealZoo thus serves as both a comprehensive testing ground and a pathway toward developing more capable embodied AI systems for real‑world deployment.

PaperID: 12041, https://arxiv.org/pdf/2412.20699.pdf

Authors: Ying Zhang, Haibao Yan, Danni Zhu, Jiankun Wang, Cui-Hua Zhang, Weili Ding, Xi Luo, Changchun Hua, Max Q. -H. Meng

Title: Air-Ground Collaborative Robots for Fire and Rescue Missions: Towards Mapping and Navigation Perspective

Abstract:
Air‑ground collaborative robots have shown great potential in the field of fire and rescue, which can quickly respond to rescue needs and improve the efficiency of task execution. Mapping and navigation, as the key foundation for air‑ground collaborative robots to achieve efficient task execution, have attracted a great deal of attention. This growing interest in collaborative robot mapping and navigation is conducive to improving the intelligence of fire and rescue task execution, but there has been no comprehensive investigation of this field to highlight their strengths. In this paper, we present a systematic review of the ground‑to‑ground cooperative robots for fire and rescue from a new perspective of mapping and navigation. First, an air‑ground collaborative robots framework for fire and rescue missions based on unmanned aerial vehicle (UAV) mapping and unmanned ground vehicle (UGV) navigation is introduced. Then, the research progress of mapping and navigation under this framework is systematically summarized, including UAV mapping, UAV/UGV co‑localization, and UGV navigation, with their main achievements and limitations. Based on the needs of fire and rescue missions, the collaborative robots with different numbers of UAVs and UGVs are classified, and their practicality in fire and rescue tasks is elaborated, with a focus on the discussion of their merits and demerits. In addition, the application examples of air‑ground collaborative robots in various firefighting and rescue scenarios are given. Finally, this paper emphasizes the current challenges and potential research opportunities, rounding up references for practitioners and researchers willing to engage in this vibrant area of air‑ground collaborative robots.

PaperID: 12042, https://arxiv.org/pdf/2412.20523.pdf

Authors: Neil De La Fuente, Miquel Noguer i Alonso, Guim Casadellà

Title: Game Theory and Multi-Agent Reinforcement Learning : From Nash Equilibria to Evolutionary Dynamics

Abstract:
This paper explores advanced topics in complex multi‑agent systems building upon our previous work. We examine four fundamental challenges in Multi‑Agent Reinforcement Learning (MARL): non‑stationarity, partial observability, scalability with large agent populations, and decentralized learning. The paper provides mathematical formulations and analysis of recent algorithmic advancements designed to address these challenges, with a particular focus on their integration with game‑theoretic concepts. We investigate how Nash equilibria, evolutionary game theory, correlated equilibrium, and adversarial dynamics can be effectively incorporated into MARL algorithms to improve learning outcomes. Through this comprehensive analysis, we demonstrate how the synthesis of game theory and MARL can enhance the robustness and effectiveness of multi‑agent systems in complex, dynamic environments.

PaperID: 12043, https://arxiv.org/pdf/2412.20505.pdf

Authors: Hang Ni, Yuzhi Wang, Yizhi Song, Hao Liu

Title: LiPUP-MA: A Residential Experience-centric Multi-Agent Framework for Living-in-the-loop Participatory Urban Planning

Abstract:
Participatory Urban Planning (PUP) is increasingly supported by LLM‑based agents, yet existing methods largely rely on static preference elicitation and one‑shot stakeholder discussions, overlooking the cyclical nature of real‑world planning, where residential life, experience collection, and plan adjustment continually interact. We propose Living‑in‑the‑loop Participatory Urban Planning (LiPUP), a closed‑loop paradigm that alternates between simulated residential living and experience‑driven plan revision, while posing two key challenges: grounding scattered living experience in concrete urban contexts and translating subjective feedback into spatially coherent planning actions. To instantiate LiPUP, we introduce LiPUP‑MA, an LLM‑based multi‑agent framework that constructs a Plan‑centric Graph‑based Experience Bank to organize urban‑grounded residential feedback from living simulation and equips a Spatially‑constrained Skill‑augmented Planner agent to revise plans by harmonizing experiential, visual, and geospatial evidence. Experiments show that LiPUP‑MA consistently outperforms baselines on both conventional static planning metrics and living‑based metrics, while iterative LiPUP cycles further improve plan quality.

PaperID: 12044, https://arxiv.org/pdf/2412.20484.pdf

Authors: Songhan Zhao, Shimin Gong, Bo Gu, Lanhua Li, Bin Lyu, Dinh Thai Hoang, Changyan Yi

Title: Exploiting NOMA Transmissions in Multi-UAV-assisted Wireless Networks: From Aerial-RIS to Mode-switching UAVs

Abstract:
In this paper, we consider an aerial reconfigurable intelligent surface (ARIS)‑assisted wireless network, where multiple unmanned aerial vehicles (UAVs) collect data from ground users (GUs) by using the non‑orthogonal multiple access (NOMA) method. The ARIS provides enhanced channel controllability to improve the NOMA transmissions and reduce the co‑channel interference among UAVs. We also propose a novel dual‑mode switching scheme, where each UAV equipped with both an ARIS and a radio frequency (RF) transceiver can adaptively perform passive reflection or active transmission. We aim to maximize the overall network throughput by jointly optimizing the UAVs' trajectory planning and operating modes, the ARIS's passive beamforming, and the GUs' transmission control strategies. We propose an optimization‑driven hierarchical deep reinforcement learning (O‑HDRL) method to decompose it into a series of subproblems. Specifically, the multi‑agent deep deterministic policy gradient (MADDPG) adjusts the UAVs' trajectory planning and mode switching strategies, while the passive beamforming and transmission control strategies are tackled by the optimization methods. Numerical results reveal that the O‑HDRL efficiently improves the learning stability and reward performance compared to the benchmark methods. Meanwhile, the dual‑mode switching scheme is verified to achieve a higher throughput performance compared to the fixed ARIS scheme.

PaperID: 12045, https://arxiv.org/pdf/2412.20475.pdf

Authors: Sheng Cen, Qiying Pan, Yifei Zhu, Bo Li

Title: SatFlow: Scalable Network Planning for LEO Mega-Constellations

Abstract:
Low‑earth‑orbit (LEO) satellite communication networks have evolved into mega‑constellations with hundreds to thousands of satellites inter‑connecting with inter‑satellite links (ISLs). Network planning, which plans for network resources and architecture to improve the network performance and save operational costs, is crucial for satellite network management. However, due to the large scale of mega‑constellations, high dynamics of satellites, and complex distribution of real‑world traffic, it is extremely challenging to conduct scalable network planning on mega‑constellations with high performance. In this paper, we propose SatFlow, a distributed and hierarchical network planning framework to plan for the network topology, traffic allocation, and fine‑grained ISL terminal power allocation for mega‑constellations. To tackle the hardness of the original problem, we decompose the grand problem into two hierarchical sub‑problems, tackled by two‑tier modules. A multi‑agent reinforcement learning approach is proposed for the upper‑level module so that the overall laser energy consumption and ISL operational costs can be minimized; A distributed alternating step algorithm is proposed for the lower‑level module so that the laser energy consumption could be minimized with low time complexity for a given topology. Extensive simulations on various mega‑constellations validate SatFlow's scalability on the constellation size, reducing the flow violation ratio by up to 21.0% and reducing the total costs by up to 89.4%, compared with various state‑of‑the‑art benchmarks.

PaperID: 12046, https://arxiv.org/pdf/2412.20368.pdf

Authors: Jun Xie, Zhicheng Wang, Jianwei Tan, Huanxu Lin, Xiaoguang Ma

Title: Subconscious Robotic Imitation Learning

Abstract:
Although robotic imitation learning (RIL) is promising for embodied intelligent robots, existing RIL approaches rely on computationally intensive multi‑model trajectory predictions, resulting in slow execution and limited real‑time responsiveness. Instead, human beings subconscious can constantly process and store vast amounts of information from their experiences, perceptions, and learning, allowing them to fulfill complex actions such as riding a bike, without consciously thinking about each. Inspired by this phenomenon in action neurology, we introduced subconscious robotic imitation learning (SRIL), wherein cognitive offloading was combined with historical action chunkings to reduce delays caused by model inferences, thereby accelerating task execution. This process was further enhanced by subconscious downsampling and pattern augmented learning policy wherein intent‑rich information was addressed with quantized sampling techniques to improve manipulation efficiency. Experimental results demonstrated that execution speeds of the SRIL were 100% to 200% faster over SOTA policies for comprehensive dual‑arm tasks, with consistently higher success rates.

PaperID: 12047, https://arxiv.org/pdf/2412.20354.pdf

Authors: Seyyed Shaho Alaviani, Atul Kelkar

Title: Distributed Convex Optimization with State-Dependent (Social) Interactions over Random Networks

Abstract:
This paper aims at distributed multi‑agent convex optimization where the communications network among the agents are presented by a random sequence of possibly state‑dependent weighted graphs. This is the first work to consider both random arbitrary communication networks and state‑dependent interactions among agents. The state‑dependent weighted random operator of the graph is shown to be quasi‑nonexpansive; this property neglects a priori distribution assumption of random communication topologies to be imposed on the operator. Therefore, it contains more general class of random networks with or without asynchronous protocols. A more general mathematical optimization problem than that addressed in the literature is presented, namely minimization of a convex function over the fixed‑value point set of a quasi‑nonexpansive random operator. A discrete‑time algorithm is provided that is able to converge both almost surely and in mean square to the global solution of the optimization problem. Hence, as a special case, it reduces to a totally asynchronous algorithm for the distributed optimization problem. The algorithm is able to converge even if the weighted matrix of the graph is periodic and irreducible under synchronous protocol. Finally, a case study on a network of robots in an automated warehouse is given where there is distribution dependency among random communication graphs.

PaperID: 12048, https://arxiv.org/pdf/2412.20297.pdf

Authors: Jia Liu, Min Chen

Title: FaGeL: Fabric LLMs Agent empowered Embodied Intelligence Evolution with Autonomous Human-Machine Collaboration

Abstract:
Recent advancements in Large Language Models (LLMs) have enhanced the reasoning capabilities of embodied agents, driving progress toward AGI‑powered robotics. While LLMs have been applied to tasks like semantic reasoning and task generalization, their potential in open physical space exploration remains underexplored. This paper introduces FaGeL (Fabric aGent empowered by embodied intelligence with LLMs), an embodied agent integrating smart fabric technology for seamless, non‑intrusive human‑agent interaction. FaGeL autonomously generates tasks using multimodal data from wearable and ambient sensors, refining its behavior based on implicit human feedback in generated text, without explicit ratings or preferences. We also introduce a token‑level saliency map to visualize LLM fine‑tuning, enhancing the interpretability of token‑level alignment. The system leverages dual feedback mechanisms to improve token‑level alignment and addresses challenges in non‑intrusive human‑machine interaction and cognition evolution. Our contributions include FaGeL's development, the DualCUT algorithm for AI alignment, and experimental validation in cooperative tasks, demonstrating FaGeL's ability to adapt and evolve autonomously through implicit feedback. In the future, we plan to explore FaGeL's scalability in dynamic environments and its integration with other AI systems to develop AGI agents that adapt seamlessly to diverse human needs.

PaperID: 12049, https://arxiv.org/pdf/2412.20233.pdf

Authors: Stepan Dergachev, Konstantin Yakovlev

Title: Decentralized Unlabeled Multi-Agent Navigation in Continuous Space

Abstract:
In this work, we study the problem where a group of mobile agents needs to reach a set of goal locations, but it does not matter which agent reaches a specific goal. Unlike most of the existing works on this topic that typically assume the existence of the centralized planner (or controller) and limit the agents' moves to a predefined graph of locations and transitions between them, in this work we focus on the decentralized scenarios, when each agent acts individually relying only on local observations/communications and is free to move in arbitrary direction at any time. Our iterative approach involves agents individually selecting goals, exchanging them, planning paths, and at each time step choose actions that balance between progressing along the paths and avoiding collisions. The proposed method is shown to be complete under specific assumptions on how agents progress towards their current goals, and our empirical evaluation demonstrates its superiority over a baseline decentralized navigation approach in success rate (i.e. is able to solve more problem instances under a given time limit) and a comparison with the centralized TSWAP algorithm reveals its efficiency in minimizing trajectory lengths for mission accomplishment.

PaperID: 12050, https://arxiv.org/pdf/2412.20203.pdf

Authors: Davide Legacci, Panayotis Mertikopoulos, Christos H. Papadimitriou, Georgios Piliouras, Bary S. R. Pradelski

Title: No-regret learning in harmonic games: Extrapolation in the face of conflicting interests

Abstract:
The long‑run behavior of multi‑agent learning ‑ and, in particular, no‑regret learning ‑ is relatively well‑understood in potential games, where players have aligned interests. By contrast, in harmonic games ‑ the strategic counterpart of potential games, where players have conflicting interests ‑ very little is known outside the narrow subclass of 2‑player zero‑sum games with a fully‑mixed equilibrium. Our paper seeks to partially fill this gap by focusing on the full class of (generalized) harmonic games and examining the convergence properties of follow‑the‑regularized‑leader (FTRL), the most widely studied class of no‑regret learning schemes. As a first result, we show that the continuous‑time dynamics of FTRL are Poincaré recurrent, that is, they return arbitrarily close to their starting point infinitely often, and hence fail to converge. In discrete time, the standard, "vanilla" implementation of FTRL may lead to even worse outcomes, eventually trapping the players in a perpetual cycle of best‑responses. However, if FTRL is augmented with a suitable extrapolation step ‑ which includes as special cases the optimistic and mirror‑prox variants of FTRL ‑ we show that learning converges to a Nash equilibrium from any initial condition, and all players are guaranteed at most O(1) regret. These results provide an in‑depth understanding of no‑regret learning in harmonic games, nesting prior work on 2‑player zero‑sum games, and showing at a high level that harmonic games are the canonical complement of potential games, not only from a strategic, but also from a dynamic viewpoint.

PaperID: 12051, https://arxiv.org/pdf/2412.20154.pdf

Authors: Xinru Wen, Jinbo Wen, Ming Xiao, Jiawen Kang, Tao Zhang, Xiaohuan Li, Chuanxi Chen, Dusit Niyato

Title: Defending Against Network Attacks for Secure AI Agent Migration in Vehicular Metaverses

Abstract:
Vehicular metaverses, blending traditional vehicular networks with metaverse technology, are expected to revolutionize fields such as autonomous driving. As virtual intelligent assistants in vehicular metaverses, Artificial Intelligence (AI) agents powered by large language models can create immersive 3D virtual spaces for passengers to enjoy on‑broad vehicular applications and services. To provide users with seamless and engaging virtual interactions, resource‑limited vehicles offload AI agents to RoadSide Units (RSUs) with adequate communication and computational capabilities. Due to the mobility of vehicles and the limited coverage of RSUs, AI agents need to migrate from one RSU to another RSU. However, potential network attacks pose significant challenges to ensuring reliable and efficient AI agent migration. In this paper, we first explore specific network attacks including traffic‑based attacks (i.e., DDoS attacks) and infrastructure‑based attacks (i.e., malicious RSU attacks). Then, we model the AI agent migration process as a Partially Observable Markov Decision Process (POMDP) and apply multi‑agent proximal policy optimization algorithms to mitigate DDoS attacks. In addition, we propose a trust assessment mechanism to counter malicious RSU attacks. Numerical results validate that the proposed solutions effectively defend against these network attacks and reduce the total latency of AI agent migration by approximately 43.3%.

PaperID: 12052, https://arxiv.org/pdf/2412.20145.pdf

Authors: Wei Zhou, Mohsen Mesgar, Annemarie Friedrich, Heike Adel

Title: Efficient Multi-Agent Collaboration with Tool Use for Online Planning in Complex Table Question Answering

Abstract:
Complex table question answering (TQA) aims to answer questions that require complex reasoning, such as multi‑step or multi‑category reasoning, over data represented in tabular form. Previous approaches demonstrated notable performance by leveraging either closed‑source large language models (LLMs) or fine‑tuned open‑weight LLMs. However, fine‑tuning LLMs requires high‑quality training data, which is costly to obtain, and utilizing closed‑source LLMs poses accessibility challenges and leads to reproducibility issues. In this paper, we propose Multi‑Agent Collaboration with Tool use (MACT), a framework that requires neither closed‑source models nor fine‑tuning. In MACT, a planning agent and a coding agent that also make use of tools collaborate to answer questions. Our experiments on four TQA benchmarks show that MACT outperforms previous SoTA systems on three out of four benchmarks and that it performs comparably to the larger and more expensive closed‑source model GPT‑4 on two benchmarks, even when using only open‑weight models without any fine‑tuning. We conduct extensive analyses to prove the effectiveness of MACT's multi‑agent collaboration in TQA.

PaperID: 12053, https://arxiv.org/pdf/2412.20116.pdf

Authors: Benedikt Valentin Meylahn

Title: Multi-agent reinforcement learning in the all-or-nothing public goods game on networks

Abstract:
We study interpersonal trust by means of the all‑or‑nothing public goods game between agents on a network. The agents are endowed with the simple yet adaptive learning rule, exponential moving average, by which they estimate the behavior of their neighbors in the network. Theoretically we show that in the long‑time limit this multi‑agent reinforcement learning process always eventually results in indefinite contribution to the public good or indefinite defection (no agent contributing to the public good). However, by simulation of the pre‑limit behavior, we see that on complex network structures there may be mixed states in which the process seems to stabilize before actual convergence to states in which agent beliefs and actions are all the same. In these metastable states the local network characteristics can determine whether agents have high or low trust in their neighbors. More generally it is found that more dense networks result in lower rates of contribution to the public good. This has implications for how one can spread global contribution toward a public good by enabling smaller local interactions.

PaperID: 12054, https://arxiv.org/pdf/2412.20075.pdf

Authors: Gabriele Calzolari, Vidya Sumathy, Christoforos Kanellakis, George Nikolakopoulos

Title: Investigating the Impact of Communication-Induced Action Space on Exploration of Unknown Environments with Decentralized Multi-Agent Reinforcement Learning

Abstract:
This paper introduces a novel enhancement to the Decentralized Multi‑Agent Reinforcement Learning (D‑MARL) exploration by proposing communication‑induced action space to improve the mapping efficiency of unknown environments using homogeneous agents. Efficient exploration of large environments relies heavily on inter‑agent communication as real‑world scenarios are often constrained by data transmission limits, such as signal latency and bandwidth. Our proposed method optimizes each agent's policy using the heterogeneous‑agent proximal policy optimization algorithm, allowing agents to autonomously decide whether to communicate or to explore, that is whether to share the locally collected maps or continue the exploration. We propose and compare multiple novel reward functions that integrate inter‑agent communication and exploration, enhance mapping efficiency and robustness, and minimize exploration overlap. This article presents a framework developed in ROS2 to evaluate and validate the investigated architecture. Specifically, four TurtleBot3 Burgers have been deployed in a Gazebo‑designed environment filled with obstacles to evaluate the efficacy of the trained policies in mapping the exploration arena.

PaperID: 12055, https://arxiv.org/pdf/2412.20049.pdf

Authors: Gabriele Calzolari, Vidya Sumathy, Christoforos Kanellakis, George Nikolakopoulos

Title: Reinforcement Learning Driven Multi-Robot Exploration via Explicit Communication and Density-Based Frontier Search

Abstract:
Collaborative multi‑agent exploration of unknown environments is crucial for search and rescue operations. Effective real‑world deployment must address challenges such as limited inter‑agent communication and static and dynamic obstacles. This paper introduces a novel decentralized collaborative framework based on Reinforcement Learning to enhance multi‑agent exploration in unknown environments. Our approach enables agents to decide their next action using an agent‑centered field‑of‑view occupancy grid, and features extracted from \textA^ algorithm‑based trajectories to frontiers in the reconstructed global map. Furthermore, we propose a constrained communication scheme that enables agents to share their environmental knowledge efficiently, minimizing exploration redundancy. The decentralized nature of our framework ensures that each agent operates autonomously, while contributing to a collective exploration mission. Extensive simulations in Gymnasium and real‑world experiments demonstrate the robustness and effectiveness of our system, while all the results highlight the benefits of combining autonomous exploration with inter‑agent map sharing, advancing the development of scalable and resilient robotic exploration systems.

PaperID: 12056, https://arxiv.org/pdf/2412.19976.pdf

Authors: Ekaterina Novozhilova, Jiacheng Huang, Le He, Ziling Li, James Cummings

Title: Will you donate money to a chatbot? The effect of chatbot anthropomorphic features and persuasion strategies on willingness to donate

Abstract:
This work investigates the causal mechanism behind the effect of chatbot personification and persuasion strategies on users' perceptions and donation likelihood. In a 2 (personified vs. non‑personified chatbot) x 2 (emotional vs. logical persuasion strategy) between‑subjects experiment (N=76), participants engaged with a chatbot that represented a non‑profit charitable organization. The results suggest that interaction with a personified chatbot evokes perceived anthropomorphism; however, it does not elicit greater willingness to donate. In fact, we found that commonly used anthropomorphic features, like name and narrative, led to negative attitudes toward an AI agent in the donation context. Our results showcase a preference for non‑personified chatbots paired with logical persuasion appeal, emphasizing the significance of consistency in chatbot interaction, mirroring human‑human engagement. We discuss the importance of moving from exploring the common scenario of a chatbot with machine identity vs. a chatbot with human identity in light of the recent regulations of AI systems.

PaperID: 12057, https://arxiv.org/pdf/2412.19933.pdf

Authors: Biaoshuai Tao, Chengkai Zhang, Houyu Zhou

Title: The Degree of (Extended) Justified Representation and Its Optimization

Abstract:
Justified Representation (JR)/Extended Justified Representation (EJR) is a desirable axiom in multiwinner approval voting. In contrast to that (E)JR only requires at least \emphone voter to be represented in every cohesive group, we study its optimization version that maximizes the \emphnumber of represented voters in each group. Given an instance, we say a winning committee provides a JR degree (EJR degree, resp.) of c if at least c voters in each \ell‑cohesive group (1‑cohesive group, resp.) have approved \ell (1, resp.) winning candidates. Hence, every (E)JR committee provides the (E)JR degree of at least 1. Besides proposing this new property, we propose the optimization problem of finding a winning committee that achieves the maximum possible (E)JR degree, called \MDJR and \MDEJR, corresponding to JR and EJR respectively. We study the computational complexity and approximability of \MDJR of \MDEJR. An (E)JR committee, which can be found in polynomial time, straightforwardly gives a (k/n)‑approximation. We also show that the original algorithms for finding a JR and an EJR winner committee are also 1/k and 1/(k+1) approximation algorithms for \MDJR and \MDEJR respectively. On the other hand, we show that it is NP‑hard to approximate \MDJR and \MDEJR to within a factor of \left(k/n\right)^1‑ε and to within a factor of (1/k)^1‑\varepsilon, for any ε>0, which complements the positive results. Next, we study the fixed‑parameter‑tractability of this problem. We show that both problems are W[2]‑hard if k, the size of the winning committee, is specified as the parameter. However, when c_\textmax, the maximum value of c such that a committee that provides an (E)JR degree of c exists, is additionally given as a parameter, we show that both \MDJR and \MDEJR are fixed‑parameter‑tractable.

PaperID: 12058, https://arxiv.org/pdf/2412.19873.pdf

Authors: Yuchen Jiao, Gen Li

Title: Minimax-Optimal Multi-Agent Robust Reinforcement Learning

Abstract:
Multi‑agent robust reinforcement learning, also known as multi‑player robust Markov games (RMGs), is a crucial framework for modeling competitive interactions under environmental uncertainties, with wide applications in multi‑agent systems. However, existing results on sample complexity in RMGs suffer from at least one of three obstacles: restrictive range of uncertainty level or accuracy, the curse of multiple agents, and the barrier of long horizons, all of which cause existing results to significantly exceed the information‑theoretic lower bound. To close this gap, we extend the Q‑FTRL algorithm \citepli2022minimax to the RMGs in finite‑horizon setting, assuming access to a generative model. We prove that the proposed algorithm achieves an \varepsilon‑robust coarse correlated equilibrium (CCE) with a sample complexity (up to log factors) of \widetildeO\left(H^3S\sum_i=1^mA_i\min\left\H,1/R\right\/\varepsilon^2\right), where S denotes the number of states, A_i is the number of actions of the i‑th agent, H is the finite horizon length, and R is uncertainty level. We also show that this sample compelxity is minimax optimal by combining an information‑theoretic lower bound. Additionally, in the special case of two‑player zero‑sum RMGs, the algorithm achieves an \varepsilon‑robust Nash equilibrium (NE) with the same sample complexity.

PaperID: 12059, https://arxiv.org/pdf/2412.19835.pdf

Authors: Alireza Alizadeh, Byungju Lim, Mai Vu

Title: Multi-Agent Q-Learning for Real-Time Load Balancing User Association and Handover in Mobile Networks

Abstract:
As next generation cellular networks become denser, associating users with the optimal base stations at each time while ensuring no base station is overloaded becomes critical for achieving stable and high network performance. We propose multi‑agent online Q‑learning (QL) algorithms for performing real‑time load balancing user association and handover in dense cellular networks. The load balancing constraints at all base stations couple the actions of user agents, and we propose two multi‑agent action selection policies, one centralized and one distributed, to satisfy load balancing at every learning step. In the centralized policy, the actions of UEs are determined by a central load balancer (CLB) running an algorithm based on swapping the worst connection to maximize the total learning reward. In the distributed policy, each UE takes an action based on its local information by participating in a distributed matching game with the BSs to maximize the local reward. We then integrate these action selection policies into an online QL algorithm that adapts in real‑time to network dynamics including channel variations and user mobility, using a reward function that considers a handover cost to reduce handover frequency. The proposed multi‑agent QL algorithm features low‑complexity and fast convergence, outperforming 3GPP max‑SINR association. Both policies adapt well to network dynamics at various UE speed profiles from walking, running, to biking and suburban driving, illustrating their robustness and real‑time adaptability.

PaperID: 12060, https://arxiv.org/pdf/2412.19609.pdf

Authors: Guy Avni, Martin Kurečka, Kaushik Mallik, Petr Novotný, Suman Sadhukhan

Title: Bidding Games on Markov Decision Processes with Quantitative Reachability Objectives

Abstract:
Graph games are fundamental in strategic reasoning of multi‑agent systems and their environments. We study a new family of graph games which combine stochastic environmental uncertainties and auction‑based interactions among the agents, formalized as bidding games on (finite) Markov decision processes (MDP). Normally, on MDPs, a single decision‑maker chooses a sequence of actions, producing a probability distribution over infinite paths. In bidding games on MDPs, two players ‑‑ called the reachability and safety players ‑‑ bid for the privilege of choosing the next action at each step. The reachability player's goal is to maximize the probability of reaching a target vertex, whereas the safety player's goal is to minimize it. These games generalize traditional bidding games on graphs, and the existing analysis techniques do not extend. For instance, the central property of traditional bidding games is the existence of a threshold budget, which is a necessary and sufficient budget to guarantee winning for the reachability player. For MDPs, the threshold becomes a relation between the budgets and probabilities of reaching the target. We devise value‑iteration algorithms that approximate thresholds and optimal policies for general MDPs, and compute the exact solutions for acyclic MDPs, and show that finding thresholds is at least as hard as solving simple‑stochastic games.

PaperID: 12061, https://arxiv.org/pdf/2412.19538.pdf

Authors: Xuan Zhou, Xiang Shi, Lele Zhang, Chen Chen, Hongbo Li, Lin Ma, Fang Deng, Jie Chen

Title: Scalable Hierarchical Reinforcement Learning for Hyper Scale Multi-Robot Task Planning

Abstract:
To improve the efficiency of warehousing system and meet huge customer orders, we aim to solve the challenges of dimension disaster and dynamic properties in hyper scale multi‑robot task planning (MRTP) for robotic mobile fulfillment system (RMFS). Existing research indicates that hierarchical reinforcement learning (HRL) is an effective method to reduce these challenges. Based on that, we construct an efficient multi‑stage HRL‑based multi‑robot task planner for hyper scale MRTP in RMFS, and the planning process is represented with a special temporal graph topology. To ensure optimality, the planner is designed with a centralized architecture, but it also brings the challenges of scaling up and generalization that require policies to maintain performance for various unlearned scales and maps. To tackle these difficulties, we first construct a hierarchical temporal attention network (HTAN) to ensure basic ability of handling inputs with unfixed lengths, and then design multi‑stage curricula for hierarchical policy learning to further improve the scaling up and generalization ability while avoiding catastrophic forgetting. Additionally, we notice that policies with hierarchical structure suffer from unfair credit assignment that is similar to that in multi‑agent reinforcement learning, inspired of which, we propose a hierarchical reinforcement learning algorithm with counterfactual rollout baseline to improve learning performance. Experimental results demonstrate that our planner outperform other state‑of‑the‑art methods on various MRTP instances in both simulated and real‑world RMFS. Also, our planner can successfully scale up to hyper scale MRTP instances in RMFS with up to 200 robots and 1000 retrieval racks on unlearned maps while keeping superior performance over other methods.

PaperID: 12062, https://arxiv.org/pdf/2412.19469.pdf

Authors: Ted Edward Holmberg, Elias Ioup, Mahdi Abdelguerfi

Title: Knowledge Graph-Based Multi-Agent Path Planning in Dynamic Environments using WAITR

Abstract:
This paper addresses the challenge of multi‑agent path planning for efficient data collection in dynamic, uncertain environments, exemplified by autonomous underwater vehicles (AUVs) navigating the Gulf of Mexico. Traditional greedy algorithms, though computationally efficient, often fall short in long‑term planning due to their short‑sighted nature, missing crucial data collection opportunities and increasing exposure to hazards. To address these limitations, we introduce WAITR (Weighted Aggregate Inter‑Temporal Reward), a novel path‑planning framework that integrates a knowledge graph with pathlet‑based planning, segmenting the environment into dynamic, speed‑adjusted sub‑regions (pathlets). This structure enables coordinated, adaptive planning, as agents can operate within time‑bound regions while dynamically responding to environmental changes. WAITR's cumulative scoring mechanism balances immediate data collection with long‑term optimization of Points of Interest (POIs), ensuring safer navigation and comprehensive data coverage. Experimental results show that WAITR substantially improves POI coverage and reduces exposure to hazards, achieving up to 27.1% greater event coverage than traditional greedy methods.

PaperID: 12063, https://arxiv.org/pdf/2412.19399.pdf

Authors: Hang Xu, Kaihong Lu, Yu-Long Wang, Qixin Zhu

Title: Online distributed algorithms for mixed equilibrium problems in dynamic environments

Abstract:
In this paper, the mixed equilibrium problem with coupled inequality constraints in dynamic environments is solved by employing a multi‑agent system, where each agent only has access to its own bifunction, its own constraint function, and can only communicate with its immediate neighbors via a time‑varying digraph. At each time, the goal of agents is to cooperatively find a point in the constraint set such that the sum of local bifunctions with a free variable is non‑negative. Different from existing works, here the bifunctions and the constraint functions are time‑varying and only available to agents after decisions are made. To tackle this problem, first, an online distributed algorithm involving accurate gradient information is proposed based on mirror descent algorithms and primal‑dual strategies. Of particular interest is that dynamic regrets, whose offline benchmarks are to find the solution at each time, are employed to measure the performance of the algorithm. Under mild assumptions on the graph and the bifunctions, we prove that if the deviation in the solution sequence grows within a certain rate, then both the dynamic regret and the violation of coupled inequality constraints increase sublinearly. Second, considering the case where each agent only has access to a noisy estimate on the accurate gradient, we propose an online distributed algorithm involving the stochastic gradients. The result shows that under the same conditions as in the first case, if the noise distribution satisfies the sub‑Gaussian condition, then dynamic regrets, as well as constraint violations, increase sublinearly with high probability. Finally, several simulation examples are presented to corroborate the validity of our results.

PaperID: 12064, https://arxiv.org/pdf/2412.19256.pdf

Authors: Haowei Yang

Title: Swarm Contract: A Multi-Sovereign Agent Consensus Mechanism

Abstract:
Traditional smart contracts on blockchains excel at on‑chain, deterministic logic. However, they have inherent limitations when dealing with large‑scale off‑chain data, dynamic multi‑step workflows, and scenarios requiring high flexibility or iterative updates. In this paper, we propose the concept of a "Swarm Contract" (Swarm), a multi‑agent mechanism wherein several digital life forms (DLF) or Sovereign Agents (SA) collectively handle complex tasks in Trusted Execution Environments (TEE). These digital entities are defined as autonomous software agents that own their code, state, and possibly on‑chain assets, while operating free from centralized control. By leveraging a simple multi‑signature wallet on‑chain, Swarm moves most of the logic off‑chain, achieving trust minimization through multi‑agent consensus rather than a single monolithic on‑chain contract. We illustrate these ideas with a lightweight off‑chain auction example ‑ minting and selling 10,000 identical NFTs ‑ to showcase how off‑chain coordination can determine a clearing price and finalize distribution, with each step performed collectively by multiple agents in TEE. This approach broadens the scope of trustless and decentralized solutions, potentially benefiting DAO governance, multi‑modal data processing, and cross‑chain interoperability.

PaperID: 12065, https://arxiv.org/pdf/2412.19206.pdf

Authors: Zekang Yang, Wang Zeng, Sheng Jin, Chen Qian, Ping Luo, Wentao Liu

Title: NADER: Neural Architecture Design via Multi-Agent Collaboration

Abstract:
Designing effective neural architectures poses a significant challenge in deep learning. While Neural Architecture Search (NAS) automates the search for optimal architectures, existing methods are often constrained by predetermined search spaces and may miss critical neural architectures. In this paper, we introduce NADER (Neural Architecture Design via multi‑agEnt collaboRation), a novel framework that formulates neural architecture design (NAD) as a LLM‑based multi‑agent collaboration problem. NADER employs a team of specialized agents to enhance a base architecture through iterative modification. Current LLM‑based NAD methods typically operate independently, lacking the ability to learn from past experiences, which results in repeated mistakes and inefficient exploration. To address this issue, we propose the Reflector, which effectively learns from immediate feedback and long‑term experiences. Additionally, unlike previous LLM‑based methods that use code to represent neural architectures, we utilize a graph‑based representation. This approach allows agents to focus on design aspects without being distracted by coding. We demonstrate the effectiveness of NADER in discovering high‑performing architectures beyond predetermined search spaces through extensive experiments on benchmark tasks, showcasing its advantages over state‑of‑the‑art methods. The codes will be released soon.

PaperID: 12066, https://arxiv.org/pdf/2412.19064.pdf

Authors: Shenghong He, Chao Yu

Title: Hierarchical Multi-agent Meta-Reinforcement Learning for Cross-channel Bidding

Abstract:
Real‑time bidding (RTB) plays a pivotal role in online advertising ecosystems. Advertisers employ strategic bidding to optimize their advertising impact while adhering to various financial constraints, such as the return‑on‑investment (ROI) and cost‑per‑click (CPC). Primarily focusing on bidding with fixed budget constraints, traditional approaches cannot effectively manage the dynamic budget allocation problem where the goal is to achieve global optimization of bidding performance across multiple channels with a shared budget. In this paper, we propose a hierarchical multi‑agent reinforcement learning framework for multi‑channel bidding optimization. In this framework, the top‑level strategy applies a CPC constrained diffusion model to dynamically allocate budgets among the channels according to their distinct features and complex interdependencies, while the bottom‑level strategy adopts a state‑action decoupled actor‑critic method to address the problem of extrapolation errors in offline learning caused by out‑of‑distribution actions and a context‑based meta‑channel knowledge learning method to improve the state representation capability of the policy based on the shared knowledge among different channels. Comprehensive experiments conducted on a large scale real‑world industrial dataset from the Meituan ad bidding platform demonstrate that our method achieves a state‑of‑the‑art performance.

PaperID: 12067, https://arxiv.org/pdf/2412.18697.pdf

Authors: Cong Jiang, Xiaolei Yang

Title: Agents on the Bench: Large Language Model Based Multi Agent Framework for Trustworthy Digital Justice

Abstract:
The justice system has increasingly employed AI techniques to enhance efficiency, yet limitations remain in improving the quality of decision‑making, particularly regarding transparency and explainability needed to uphold public trust in legal AI. To address these challenges, we propose a large language model based multi‑agent framework named AgentsBench, which aims to simultaneously improve both efficiency and quality in judicial decision‑making. Our approach leverages multiple LLM‑driven agents that simulate the collaborative deliberation and decision making process of a judicial bench. We conducted experiments on legal judgment prediction task, and the results show that our framework outperforms existing LLM based methods in terms of performance and decision quality. By incorporating these elements, our framework reflects real‑world judicial processes more closely, enhancing accuracy, fairness, and society consideration. AgentsBench provides a more nuanced and realistic methods of trustworthy AI decision‑making, with strong potential for application across various case types and legal scenarios.

PaperID: 12068, https://arxiv.org/pdf/2412.18454.pdf

Authors: Chao Li, Olga Petruchik, Elizaveta Grishanina, Sergey Kovalchuk

Title: Multi-Agent Norm Perception and Induction in Distributed Healthcare

Abstract:
This paper presents a Multi‑Agent Norm Perception and Induction Learning Model aimed at facilitating the integration of autonomous agent systems into distributed healthcare environments through dynamic interaction processes. The nature of the medical norm system and its sharing channels necessitates distinct approaches for Multi‑Agent Systems to learn two types of norms. Building on this foundation, the model enables agents to simultaneously learn descriptive norms, which capture collective tendencies, and prescriptive norms, which dictate ideal behaviors. Through parameterized mixed probability density models and practice‑enhanced Markov games, the multi‑agent system perceives descriptive norms in dynamic interactions and captures emergent prescriptive norms. We conducted experiments using a dataset from a neurological medical center spanning from 2016 to 2020.

PaperID: 12069, https://arxiv.org/pdf/2412.18416.pdf

Authors: Zihan Wang, Xiaocui Yang, Yongkang Liu, Shi Feng, Daling Wang, Yifei Zhang

Title: Muse: A Multimodal Conversational Recommendation Dataset with Scenario-Grounded User Profiles

Abstract:
Current conversational recommendation systems focus predominantly on text. However, real‑world recommendation settings are generally multimodal, causing a significant gap between existing research and practical applications. To address this issue, we propose Muse, the first multimodal conversational recommendation dataset. Muse comprises 83,148 utterances from 7,000 conversations centered around the Clothing domain. Each conversation contains comprehensive multimodal interactions, rich elements, and natural dialogues. Data in Muse are automatically synthesized by a multi‑agent framework powered by multimodal large language models (MLLMs). It innovatively derives user profiles from real‑world scenarios rather than depending on manual design and history data for better scalability, and then it fulfills conversation simulation and optimization. Both human and LLM evaluations demonstrate the high quality of conversations in Muse. Additionally, fine‑tuning experiments on three MLLMs demonstrate Muse's learnable patterns for recommendations and responses, confirming its value for multimodal conversational recommendation. Our dataset and codes are available at https://anonymous.4open.science/r/Muse‑0086.

PaperID: 12070, https://arxiv.org/pdf/2412.18371.pdf

Authors: Kaiwen Ning, Jiachi Chen, Jingwen Zhang, Wei Li, Zexu Wang, Yuming Feng, Weizhe Zhang, Zibin Zheng

Title: Defining and Detecting the Defects of the Large Language Model-based Autonomous Agents

Abstract:
AI agents are systems capable of perceiving their environment, autonomously planning and executing tasks. Recent advancements in LLM have introduced a transformative paradigm for AI agents, enabling them to interact with external resources and tools through prompts. In such agents, the workflow integrates developer‑written code, which manages framework construction and logic control, with LLM‑generated natural language that enhances dynamic decision‑making and interaction. However, discrepancies between developer‑implemented logic and the dynamically generated content of LLMs in terms of behavior and expected outcomes can lead to defects, such as tool invocation failures and task execution errors. These issues introduce specific risks, leading to various defects in LLM‑based AI Agents, such as service interruptions. Despite the importance of these issues, there is a lack of systematic work that focuses on analyzing LLM‑based AI Agents to uncover defects in their code. In this paper, we present the first study focused on identifying and detecting defects in LLM Agents. We collected and analyzed 6,854 relevant posts from StackOverflow to define 8 types of agent defects. For each type, we provided detailed descriptions with an example. Then, we designed a static analysis tool, named Agentable, to detect the defects. Agentable leverages Code Property Graphs and LLMs to analyze Agent workflows by efficiently identifying specific code patterns and analyzing natural language descriptions. To evaluate Agentable, we constructed two datasets: AgentSet, consists of 84 real‑world Agents, and AgentTest, which contains 78 Agents specifically designed to include various types of defects. Our results show that Agentable achieved an overall accuracy of 88.79% and a recall rate of 91.03%. Furthermore, our analysis reveals the 889 defects of the AgentSet, highlighting the prevalence of these defects.

PaperID: 12071, https://arxiv.org/pdf/2412.18351.pdf

Authors: Zhongjian Hu, Peng Yang, Bing Li, Zhenqi Wang

Title: Multi-Agents Based on Large Language Models for Knowledge-based Visual Question Answering

Abstract:
Large Language Models (LLMs) have achieved impressive results in knowledge‑based Visual Question Answering (VQA). However existing methods still have challenges: the inability to use external tools autonomously, and the inability to work in teams. Humans tend to know whether they need to use external tools when they encounter a new question, e.g., they tend to be able to give a direct answer to a familiar question, whereas they tend to use tools such as search engines when they encounter an unfamiliar question. In addition, humans also tend to collaborate and discuss with others to get better answers. Inspired by this, we propose the multi‑agent voting framework. We design three LLM‑based agents that simulate different levels of staff in a team, and assign the available tools according to the levels. Each agent provides the corresponding answer, and finally all the answers provided by the agents are voted to get the final answer. Experiments on OK‑VQA and A‑OKVQA show that our approach outperforms other baselines by 2.2 and 1.0, respectively.

PaperID: 12072, https://arxiv.org/pdf/2412.18279.pdf

Authors: Jiacai Liu, Chaojie Wang, Chris Yuhao Liu, Liang Zeng, Rui Yan, Yiwen Sun, Yang Liu, Yahui Zhou

Title: Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization

Abstract:
The role of reinforcement learning (RL) in enhancing the reasoning of large language models (LLMs) is becoming increasingly significant. Despite the success of RL in many scenarios, there are still many challenges in improving the reasoning of LLMs. One challenge is the sparse reward, which makes optimization difficult for RL and necessitates a large amount of data samples. Another challenge stems from the inherent instability of RL, particularly when using Actor‑Critic (AC) methods to derive optimal policies, which often leads to unstable training processes. To address these issues, we introduce Direct Advantage Policy Optimization (DAPO), an novel step‑level offline RL algorithm. Unlike standard alignment that rely solely outcome rewards to optimize policies (such as DPO), DAPO employs a critic function to predict the reasoning accuracy at each step, thereby generating dense signals to refine the generation strategy. Additionally, the Actor and Critic components in DAPO are trained independently, avoiding the co‑training instability observed in standard AC algorithms like PPO. We train DAPO on mathematical and code query datasets and then evaluate its performance on multiple benchmarks. Our results show that DAPO can effectively enhance the mathematical and code capabilities on both SFT models and RL models, demonstrating the effectiveness of DAPO.

PaperID: 12073, https://arxiv.org/pdf/2412.18194.pdf

Authors: Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, Xipeng Qiu

Title: VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

Abstract:
General‑purposed embodied agents are designed to understand the users' natural instructions or intentions and act precisely to complete universal tasks. Recently, methods based on foundation models especially Vision‑Language‑Action models (VLAs) have shown a substantial potential to solve language‑conditioned manipulation (LCM) tasks well. However, existing benchmarks do not adequately meet the needs of VLAs and relative algorithms. To better define such general‑purpose tasks in the context of LLMs and advance the research in VLAs, we present VLABench, an open‑source benchmark for evaluating universal LCM task learning. VLABench provides 100 carefully designed categories of tasks, with strong randomization in each category of task and a total of 2000+ objects. VLABench stands out from previous benchmarks in four key aspects: 1) tasks requiring world knowledge and common sense transfer, 2) natural language instructions with implicit human intentions rather than templates, 3) long‑horizon tasks demanding multi‑step reasoning, and 4) evaluation of both action policies and language model capabilities. The benchmark assesses multiple competencies including understanding of mesh\&texture, spatial relationship, semantic instruction, physical laws, knowledge transfer and reasoning, etc. To support the downstream finetuning, we provide high‑quality training data collected via an automated framework incorporating heuristic skills and prior information. The experimental results indicate that both the current state‑of‑the‑art pretrained VLAs and the workflow based on VLMs face challenges in our tasks.

PaperID: 12074, https://arxiv.org/pdf/2412.18174.pdf

Authors: Haohang Li, Yupeng Cao, Yangyang Yu, Shashidhar Reddy Javaji, Zhiyang Deng, Yueru He, Yuechen Jiang, Zining Zhu, Koduvayur Subbalakshmi, Guojun Xiong, Jimin Huang, Lingfei Qian, Xueqing Peng, Qianqian Xie, Jordan W. Suchow

Title: INVESTORBENCH: A Benchmark for Financial Decision-Making Tasks with LLM-based Agent

Abstract:
Recent advancements have underscored the potential of large language model (LLM)‑based agents in financial decision‑making. Despite this progress, the field currently encounters two main challenges: (1) the lack of a comprehensive LLM agent framework adaptable to a variety of financial tasks, and (2) the absence of standardized benchmarks and consistent datasets for assessing agent performance. To tackle these issues, we introduce \textscInvestorBench, the first benchmark specifically designed for evaluating LLM‑based agents in diverse financial decision‑making contexts. InvestorBench enhances the versatility of LLM‑enabled agents by providing a comprehensive suite of tasks applicable to different financial products, including single equities like stocks, cryptocurrencies and exchange‑traded funds (ETFs). Additionally, we assess the reasoning and decision‑making capabilities of our agent framework using thirteen different LLMs as backbone models, across various market environments and tasks. Furthermore, we have curated a diverse collection of open‑source, multi‑modal datasets and developed a comprehensive suite of environments for financial decision‑making. This establishes a highly accessible platform for evaluating financial agents' performance across various scenarios.

PaperID: 12075, https://arxiv.org/pdf/2412.18047.pdf

Authors: Lo Pang-Yun Ting, Ali Şenol, Huan-Yang Wang, Hsu-Chao Lai, Kun-Ta Chuang, Huan Liu

Title: Towards Hierarchical Multi-Agent Decision-Making for Uncertainty-Aware EV Charging

Abstract:
Recent advances in bidirectional EV charging and discharging systems have spurred interest in workplace applications. However, real‑world deployments face various dynamic factors, such as fluctuating electricity prices and uncertain EV departure times, that hinder effective energy management. To address these issues and minimize building electricity costs while meeting EV charging requirements, we design a hierarchical multi‑agent structure in which a high‑level agent coordinates overall charge or discharge decisions based on real‑time pricing, while multiple low‑level agents manage individual power level accordingly. For uncertain EV departure times, we propose a novel uncertainty‑aware critic augmentation mechanism for low‑level agents that improves the evaluation of power‑level decisions and ensures robust control under such uncertainty. Building upon these two key designs, we introduce HUCA, a real‑time charging control framework that coordinates energy supply among the building and EVs. Experiments on real‑world electricity datasets show that HUCA significantly reduces electricity costs and maintains competitive performance in meeting EV charging requirements under both simulated certain and uncertain departure scenarios. The results further highlight the importance of hierarchical control and the proposed critic augmentation under the uncertain departure scenario. A case study illustrates HUCA's capability to allocate energy between the building and EVs in real time, underscoring its potential for practical use.

PaperID: 12076, https://arxiv.org/pdf/2412.17993.pdf

Authors: Jinhao Liang, Jacob K. Christopher, Sven Koenig, Ferdinando Fioretto

Title: Multi-Agent Path Finding in Continuous Spaces with Projected Diffusion Models

Abstract:
Multi‑Agent Path Finding (MAPF) is a fundamental problem in robotics, requiring the computation of collision‑free paths for multiple agents moving from their respective start to goal positions. Coordinating multiple agents in a shared environment poses significant challenges, especially in continuous spaces where traditional optimization algorithms struggle with scalability. Moreover, these algorithms often depend on discretized representations of the environment, which can be impractical in image‑based or high‑dimensional settings. Recently, diffusion models have shown promise in single‑agent path planning, capturing complex trajectory distributions and generating smooth paths that navigate continuous, high‑dimensional spaces. However, directly extending diffusion models to MAPF introduces new challenges since these models struggle to ensure constraint feasibility, such as inter‑agent collision avoidance. To overcome this limitation, this work proposes a novel approach that integrates constrained optimization with diffusion models for MAPF in continuous spaces. This unique combination directly produces feasible multi‑agent trajectories that respect collision avoidance and kinematic constraints. The effectiveness of our approach is demonstrated across various challenging simulated scenarios of varying dimensionality.

PaperID: 12077, https://arxiv.org/pdf/2412.17964.pdf

Authors: Antony Seabra, Claudio Cavalcante, Joao Nepomuceno, Lucas Lago, Nicolaas Ruberg, Sergio Lifschitz

Title: Dynamic Multi-Agent Orchestration and Retrieval for Multi-Source Question-Answer Systems using Large Language Models

Abstract:
We propose a methodology that combines several advanced techniques in Large Language Model (LLM) retrieval to support the development of robust, multi‑source question‑answer systems. This methodology is designed to integrate information from diverse data sources, including unstructured documents (PDFs) and structured databases, through a coordinated multi‑agent orchestration and dynamic retrieval approach. Our methodology leverages specialized agents‑such as SQL agents, Retrieval‑Augmented Generation (RAG) agents, and router agents ‑ that dynamically select the most appropriate retrieval strategy based on the nature of each query. To further improve accuracy and contextual relevance, we employ dynamic prompt engineering, which adapts in real time to query‑specific contexts. The methodology's effectiveness is demonstrated within the domain of Contract Management, where complex queries often require seamless interaction between unstructured and structured data. Our results indicate that this approach enhances response accuracy and relevance, offering a versatile and scalable framework for developing question‑answer systems that can operate across various domains and data sources.

PaperID: 12078, https://arxiv.org/pdf/2412.17942.pdf

Authors: Antony Seabra, Claudio Cavalcante, Joao Nepomuceno, Lucas Lago, Nicolaas Ruberg, Sergio Lifschitz

Title: Contrato360 2.0: A Document and Database-Driven Question-Answer System using Large Language Models and Agents

Abstract:
We present a question‑and‑answer (Q\&A) application designed to support the contract management process by leveraging combined information from contract documents (PDFs) and data retrieved from contract management systems (database). This data is processed by a large language model (LLM) to provide precise and relevant answers. The accuracy of these responses is further enhanced through the use of Retrieval‑Augmented Generation (RAG), text‑to‑SQL techniques, and agents that dynamically orchestrate the workflow. These techniques eliminate the need to retrain the language model. Additionally, we employed Prompt Engineering to fine‑tune the focus of responses. Our findings demonstrate that this multi‑agent orchestration and combination of techniques significantly improve the relevance and accuracy of the answers, offering a promising direction for future information systems.

PaperID: 12079, https://arxiv.org/pdf/2412.17838.pdf

Authors: Shuyi Wang, Huan Zhao, Yuji Cao, Zibin Pan, Guolong Liu, Gaoqi Liang, Junhua Zhao

Title: Coordinated Power Smoothing Control for Wind Storage Integrated System with Physics-informed Deep Reinforcement Learning

Abstract:
The Wind Storage Integrated System with Power Smoothing Control (PSC) has emerged as a promising solution to ensure both efficient and reliable wind energy generation. However, existing PSC strategies overlook the intricate interplay and distinct control frequencies between batteries and wind turbines, and lack consideration of wake effect and battery degradation cost. In this paper, a novel coordinated control framework with hierarchical levels is devised to address these challenges effectively, which integrates the wake model and battery degradation model. In addition, after reformulating the problem as a Markov decision process, the multi‑agent reinforcement learning method is introduced to overcome the bi‑level characteristic of the problem. Moreover, a Physics‑informed Neural Network‑assisted Multi‑agent Deep Deterministic Policy Gradient (PAMA‑DDPG) algorithm is proposed to incorporate the power fluctuation differential equation and expedite the learning process. The effectiveness of the proposed methodology is evaluated through simulations conducted in four distinct scenarios using WindFarmSimulator (WFSim). The results demonstrate that the proposed algorithm facilitates approximately an 11% increase in total profit and a 19% decrease in power fluctuation compared to the traditional methods, thereby addressing the dual objectives of economic efficiency and grid‑connected energy reliability.

PaperID: 12080, https://arxiv.org/pdf/2412.17767.pdf

Authors: Haofei Yu, Zhaochen Hong, Zirui Cheng, Kunlun Zhu, Keyang Xuan, Jinwei Yao, Tao Feng, Jiaxuan You

Title: ResearchTown: Simulator of Human Research Community

Abstract:
Large Language Models (LLMs) have demonstrated remarkable potential in scientific domains, yet a fundamental question remains unanswered: Can we simulate human research communities with LLMs? Addressing this question can deepen our understanding of the processes behind idea brainstorming and inspire the automatic discovery of novel scientific insights. In this work, we propose ResearchTown, a multi‑agent framework for research community simulation. Within this framework, the human research community is simplified as an agent‑data graph, where researchers and papers are represented as agent‑type and data‑type nodes, respectively, and connected based on their collaboration relationships. We also introduce TextGNN, a text‑based inference framework that models various research activities (e.g., paper reading, paper writing, and review writing) as special forms of a unified message‑passing process on the agent‑data graph. To evaluate the quality of the research community simulation, we present ResearchBench, a benchmark that uses a node‑masking prediction task for scalable and objective assessment based on similarity. Our experiments reveal three key findings: (1) ResearchTown can provide a realistic simulation of collaborative research activities, including paper writing and review writing; (2) ResearchTown can maintain robust simulation with multiple researchers and diverse papers; (3) ResearchTown can generate interdisciplinary research ideas that potentially inspire pioneering research directions.

PaperID: 12081, https://arxiv.org/pdf/2412.17589.pdf

Authors: Yanheng He, Jiahe Jin, Shijie Xia, Jiadi Su, Runze Fan, Haoyang Zou, Xiangkun Hu, Pengfei Liu

Title: PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World

Abstract:
Imagine a world where AI can handle your work while you sleep ‑ organizing your research materials, drafting a report, or creating a presentation you need for tomorrow. However, while current digital agents can perform simple tasks, they are far from capable of handling the complex real‑world work that humans routinely perform. We present PC Agent, an AI system that demonstrates a crucial step toward this vision through human cognition transfer. Our key insight is that the path from executing simple "tasks" to handling complex "work" lies in efficiently capturing and learning from human cognitive processes during computer use. To validate this hypothesis, we introduce three key innovations: (1) PC Tracker, a lightweight infrastructure that efficiently collects high‑quality human‑computer interaction trajectories with complete cognitive context; (2) a two‑stage cognition completion pipeline that transforms raw interaction data into rich cognitive trajectories by completing action semantics and thought processes; and (3) a multi‑agent system combining a planning agent for decision‑making with a grounding agent for robust visual grounding. Our preliminary experiments in PowerPoint presentation creation reveal that complex digital work capabilities can be achieved with a small amount of high‑quality cognitive data ‑ PC Agent, trained on just 133 cognitive trajectories, can handle sophisticated work scenarios involving up to 50 steps across multiple applications. This demonstrates the data efficiency of our approach, highlighting that the key to training capable digital agents lies in collecting human cognitive data. By open‑sourcing our complete framework, including the data collection infrastructure and cognition completion methods, we aim to lower the barriers for the research community to develop truly capable digital agents.

PaperID: 12082, https://arxiv.org/pdf/2412.17481.pdf

Authors: Shuaihang Chen, Yuanxing Liu, Wei Han, Weinan Zhang, Ting Liu

Title: A Survey on LLM-based Multi-Agent System: Recent Advances and New Frontiers in Application

Abstract:
LLM‑based Multi‑Agent Systems ( LLM‑MAS ) have become a research hotspot since the rise of large language models (LLMs). However, with the continuous influx of new related works, the existing reviews struggle to capture them comprehensively. This paper presents a comprehensive survey of these studies. We first discuss the definition of LLM‑MAS, a framework encompassing much of previous work. We provide an overview of the various applications of LLM‑MAS in (i) solving complex tasks, (ii) simulating specific scenarios, and (iii) evaluating generative agents. Building on previous studies, we also highlight several challenges and propose future directions for research in this field.

PaperID: 12083, https://arxiv.org/pdf/2412.17372.pdf

Authors: Wen-Yu Dong, Shaoshi Yang, Wei Lin, Wei Zhao, Jia-Xing Gui, Sheng Chen

Title: Outage Probability Analysis of Uplink Heterogeneous Non-terrestrial Networks: A Novel Stochastic Geometry Model

Abstract:
In harsh environments such as mountainous terrain, dense vegetation areas, or urban landscapes, a single type of unmanned aerial vehicles (UAVs) may encounter challenges like flight restrictions, difficulty in task execution, or increased risk. Therefore, employing multiple types of UAVs, along with satellite assistance, to collaborate becomes essential in such scenarios. In this context, we present a stochastic geometry based approach for modeling the heterogeneous non‑terrestrial networks (NTNs) by using the classical binomial point process and introducing a novel point process, called Matérn hard‑core cluster process (MHCCP). Our MHCCP possesses both the exclusivity and the clustering properties, thus it can better model the aircraft group composed of multiple clusters. Then, we derive closed‑form expressions of the outage probability (OP) for the uplink (aerial‑to‑satellite) of heterogeneous NTNs. Unlike existing studies, our analysis relies on a more advanced system configuration, where the integration of beamforming and frequency division multiple access, and the shadowed‑Rician (SR) fading model for interference power, are considered. The accuracy of our theoretical derivation is confirmed by Monte Carlo simulations. Our research offers fundamental insights into the system‑level performance optimization of NTNs.

PaperID: 12084, https://arxiv.org/pdf/2412.17252.pdf

Authors: Farzan Moosavi, Bilal Farooq

Title: A Coalition Game for On-demand Multi-modal 3D Automated Delivery System

Abstract:
We introduce a multi‑modal autonomous delivery optimization framework as a coalition game for a fleet of UAVs and ADRs operating in two overlaying networks to address last‑mile delivery in urban environments, including high‑density areas and time‑critical applications. The problem is defined as multiple depot pickup and delivery with time windows constrained over operational restrictions, such as vehicle battery limitation, precedence time window, and building obstruction. Utilizing the coalition game theory, we investigate cooperation structures among the modes to capture how strategic collaboration can improve overall routing efficiency. To do so, a generalized reinforcement learning model is designed to evaluate the cost‑sharing and allocation to different modes to learn the cooperative behaviour with respect to various realistic scenarios. Our methodology leverages an end‑to‑end deep multi‑agent policy gradient method augmented by a novel spatio‑temporal adjacency neighbourhood graph attention network using a heterogeneous edge‑enhanced attention model and transformer architecture. Several numerical experiments on last‑mile delivery applications have been conducted, showing the results from the case study in the city of Mississauga, which shows that despite the incorporation of an extensive network in the graph for two modes and a complex training structure, the model addresses realistic operational constraints and achieves high‑quality solutions compared with the existing transformer‑based and classical methods. It can perform well on non‑homogeneous data distribution, generalizes well on different scales and configurations, and demonstrates a robust cooperative performance under stochastic scenarios across various tasks, which is effectively reflected by coalition analysis and cost allocation to signify the advantage of cooperation.

PaperID: 12085, https://arxiv.org/pdf/2412.17149.pdf

Authors: Kamer Ali Yuksel, Hassan Sawaf

Title: A Multi-AI Agent System for Autonomous Optimization of Agentic AI Solutions via Iterative Refinement and LLM-Driven Feedback Loops

Abstract:
Agentic AI systems use specialized agents to handle tasks within complex workflows, enabling automation and efficiency. However, optimizing these systems often requires labor‑intensive, manual adjustments to refine roles, tasks, and interactions. This paper introduces a framework for autonomously optimizing Agentic AI solutions across industries, such as NLP‑driven enterprise applications. The system employs agents for Refinement, Execution, Evaluation, Modification, and Documentation, leveraging iterative feedback loops powered by an LLM (Llama 3.2‑3B). The framework achieves optimal performance without human input by autonomously generating and testing hypotheses to improve system configurations. This approach enhances scalability and adaptability, offering a robust solution for real‑world applications in dynamic environments. Case studies across diverse domains illustrate the transformative impact of this framework, showcasing significant improvements in output quality, relevance, and actionability. All data for these case studies, including original and evolved agent codes, along with their outputs, are here: https://anonymous.4open.science/r/evolver‑1D11/

PaperID: 12086, https://arxiv.org/pdf/2412.17146.pdf

Authors: Leidong Xu, Danyal Mohaddes, Yi Wang

Title: LLM Agent for Fire Dynamics Simulations

Abstract:
Significant advances have been achieved in leveraging foundation models, such as large language models (LLMs), to accelerate complex scientific workflows. In this work we introduce FoamPilot, a proof‑of‑concept LLM agent designed to enhance the usability of FireFOAM, a specialized solver for fire dynamics and fire suppression simulations built using OpenFOAM, a popular open‑source toolbox for computational fluid dynamics (CFD). FoamPilot provides three core functionalities: code insight, case configuration and simulation evaluation. Code insight is an alternative to traditional keyword searching leveraging retrieval‑augmented generation (RAG) and aims to enable efficient navigation and summarization of the FireFOAM source code for developers and experienced users. For case configuration, the agent interprets user requests in natural language and aims to modify existing simulation setups accordingly to support intermediate users. FoamPilot's job execution functionality seeks to manage the submission and execution of simulations in high‑performance computing (HPC) environments and provide preliminary analysis of simulation results to support less experienced users. Promising results were achieved for each functionality, particularly for simple tasks, and opportunities were identified for significant further improvement for more complex tasks. The integration of these functionalities into a single LLM agent is a step aimed at accelerating the simulation workflow for engineers and scientists employing FireFOAM for complex simulations critical for improving fire safety.

PaperID: 12087, https://arxiv.org/pdf/2412.17061.pdf

Authors: Hai Ye, Mingbao Lin, Hwee Tou Ng, Shuicheng Yan

Title: Multi-Agent Sampling: Scaling Inference Compute for Data Synthesis with Tree Search-Based Agentic Collaboration

Abstract:
Scaling laws for inference compute in multi‑agent systems remain under‑explored compared to single‑agent scenarios. This work aims to bridge this gap by investigating the problem of data synthesis through multi‑agent sampling, where synthetic responses are generated by sampling from multiple distinct language models. Effective model coordination is crucial for successful multi‑agent collaboration. Unlike previous approaches that rely on fixed workflows, we treat model coordination as a multi‑step decision‑making process, optimizing generation structures dynamically for each input question. We introduce Tree Search‑based Orchestrated Agents~(TOA), where the workflow evolves iteratively during the sequential sampling process. To achieve this, we leverage Monte Carlo Tree Search (MCTS), integrating a reward model to provide real‑time feedback and accelerate exploration. Our experiments on alignment, machine translation, and mathematical reasoning demonstrate that multi‑agent sampling significantly outperforms single‑agent sampling as inference compute scales. TOA is the most compute‑efficient approach, achieving SOTA performance on WMT and a 72.2% LC win rate on AlpacaEval. Moreover, fine‑tuning with our synthesized alignment data surpasses strong preference learning methods on challenging benchmarks such as Arena‑Hard and AlpacaEval.

PaperID: 12088, https://arxiv.org/pdf/2412.17049.pdf

Authors: Jiangbo Yu, Jinhua Zhao, Luis Miranda-Moreno, Matthew Korp

Title: Modular Conversational Agents for Surveys and Interviews

Abstract:
Surveys and interviews are widely used for collecting insights on emerging or hypothetical scenarios. Traditional human‑led methods often face challenges related to cost, scalability, and consistency. Recently, various domains have begun to explore the use of conversational agents (chatbots) powered by generative artificial intelligence (AI) technologies. However, considering decisions in transportation investments and policies often carry significant public and environmental stakes, surveys and interviews face unique challenges in integrating AI agents, underscoring the need for a rigorous, resource‑efficient approach that enhances participant engagement and ensures privacy. This paper addresses this gap by introducing a modular approach and its resulting parameterized process for designing AI agents. We detail the system architecture, integrating engineered prompts, specialized knowledge bases, and customizable, goal‑oriented conversational logic. We demonstrate the adaptability, generalizability, and efficacy of our modular approach through three empirical studies: (1) travel preference surveys, highlighting conditional logic and multimodal (voice, text, and image generation) capabilities; (2) public opinion elicitation on a newly constructed, novel infrastructure project, showcasing question customization and multilingual (English and French) capabilities; and (3) expert consultation about the impact of technologies on future transportation systems, highlighting real‑time, clarification request capabilities for open‑ended questions, resilience in handling erratic inputs, and efficient transcript postprocessing. The results suggest that the AI agent increases completion rates and response quality. Furthermore, the modular approach demonstrates controllability, flexibility, and robustness while addressing key ethical, privacy, security, and token consumption concerns.

PaperID: 12089, https://arxiv.org/pdf/2412.17022.pdf

Authors: Zhengqian Wu, Ruizhe Li, Zijun Xu, Zhongyuan Wang, Chunxia Xiao, Chao Liang

Title: FriendsQA: A New Large-Scale Deep Video Understanding Dataset with Fine-grained Topic Categorization for Story Videos

Abstract:
Video question answering (VideoQA) aims to answer natural language questions according to the given videos. Although existing models perform well in the factoid VideoQA task, they still face challenges in deep video understanding (DVU) task, which focuses on story videos. Compared to factoid videos, the most significant feature of story videos is storylines, which are composed of complex interactions and long‑range evolvement of core story topics including characters, actions and locations. Understanding these topics requires models to possess DVU capability. However, existing DVU datasets rarely organize questions according to these story topics, making them difficult to comprehensively assess VideoQA models' DVU capability of complex storylines. Additionally, the question quantity and video length of these dataset are limited by high labor costs of handcrafted dataset building method. In this paper, we devise a large language model based multi‑agent collaboration framework, StoryMind, to automatically generate a new large‑scale DVU dataset. The dataset, FriendsQA, derived from the renowned sitcom Friends with an average episode length of 1,358 seconds, contains 44.6K questions evenly distributed across 14 fine‑grained topics. Finally, We conduct comprehensive experiments on 10 state‑of‑the‑art VideoQA models using the FriendsQA dataset.

PaperID: 12090, https://arxiv.org/pdf/2412.16964.pdf

Authors: Huanqia Cai, Yijun Yang, Zhifeng Li

Title: System-2 Mathematical Reasoning via Enriched Instruction Tuning

Abstract:
Solving complex mathematical problems via system‑2 reasoning is a natural human skill, yet it remains a significant challenge for current large language models (LLMs). We identify the scarcity of deliberate multi‑step reasoning data as a primary limiting factor. To this end, we introduce Enriched Instruction Tuning (EIT), a method that enriches existing human‑annotated mathematical datasets by synergizing human and AI feedback to create fine‑grained reasoning trajectories. These datasets are then used to fine‑tune open‑source LLMs, enhancing their mathematical reasoning abilities without reliance on any symbolic verification program. Concretely, EIT is composed of two critical steps: Enriching with Reasoning Plan (ERP) and Enriching with Reasoning Step (ERS). The former generates a high‑level plan that breaks down complex instructions into a sequence of simpler objectives, while ERS fills in reasoning contexts often overlooked by human annotators, creating a smoother reasoning trajectory for LLM fine‑tuning. Unlike existing CoT prompting methods that generate reasoning chains only depending on LLM's internal knowledge, our method leverages human‑annotated initial answers as ``meta‑knowledge'' to help LLMs generate more detailed and precise reasoning processes, leading to a more trustworthy LLM expert for complex mathematical problems. In experiments, EIT achieves an accuracy of 84.1% on GSM8K and 32.5% on MATH, surpassing state‑of‑the‑art fine‑tuning and prompting methods, and even matching the performance of tool‑augmented methods.

PaperID: 12091, https://arxiv.org/pdf/2412.16878.pdf

Authors: Songjun Tu, Jingbo Sun, Qichao Zhang, Xiangyuan Lan, Dongbin Zhao

Title: Online Preference-based Reinforcement Learning with Self-augmented Feedback from Large Language Model

Abstract:
Preference‑based reinforcement learning (PbRL) provides a powerful paradigm to avoid meticulous reward engineering by learning rewards based on human preferences. However, real‑time human feedback is hard to obtain in online tasks. Most work suppose there is a "scripted teacher" that utilizes privileged predefined reward to provide preference feedback. In this paper, we propose a RL Self‑augmented Large Language Model Feedback (RL‑SaLLM‑F) technique that does not rely on privileged information for online PbRL. RL‑SaLLM‑F leverages the reflective and discriminative capabilities of LLM to generate self‑augmented trajectories and provide preference labels for reward learning. First, we identify an failure issue in LLM‑based preference discrimination, specifically "query ambiguity", in online PbRL. Then LLM is employed to provide preference labels and generate self‑augmented imagined trajectories that better achieve the task goal, thereby enhancing the quality and efficiency of feedback. Additionally, a double‑check mechanism is introduced to mitigate randomness in the preference labels, improving the reliability of LLM feedback. The experiment across multiple tasks in the MetaWorld benchmark demonstrates the specific contributions of each proposed module in RL‑SaLLM‑F, and shows that self‑augmented LLM feedback can effectively replace the impractical "scripted teacher" feedback. In summary, RL‑SaLLM‑F introduces a new direction of feedback acquisition in online PbRL that does not rely on any online privileged information, offering an efficient and lightweight solution with LLM‑driven feedback.

PaperID: 12092, https://arxiv.org/pdf/2412.16833.pdf

Authors: Kaiwen Zuo, Yirui Jiang, Fan Mo, Pietro Lio

Title: KG4Diagnosis: A Hierarchical Multi-Agent LLM Framework with Knowledge Graph Enhancement for Medical Diagnosis

Abstract:
Integrating Large Language Models (LLMs) in healthcare diagnosis demands systematic frameworks that can handle complex medical scenarios while maintaining specialized expertise. We present KG4Diagnosis, a novel hierarchical multi‑agent framework that combines LLMs with automated knowledge graph construction, encompassing 362 common diseases across medical specialties. Our framework mirrors real‑world medical systems through a two‑tier architecture: a general practitioner (GP) agent for initial assessment and triage, coordinating with specialized agents for in‑depth diagnosis in specific domains. The core innovation lies in our end‑to‑end knowledge graph generation methodology, incorporating: (1) semantic‑driven entity and relation extraction optimized for medical terminology, (2) multi‑dimensional decision relationship reconstruction from unstructured medical texts, and (3) human‑guided reasoning for knowledge expansion. KG4Diagnosis serves as an extensible foundation for specialized medical diagnosis systems, with capabilities to incorporate new diseases and medical knowledge. The framework's modular design enables seamless integration of domain‑specific enhancements, making it valuable for developing targeted medical diagnosis systems. We provide architectural guidelines and protocols to facilitate adoption across medical contexts.

PaperID: 12093, https://arxiv.org/pdf/2412.16682.pdf

Authors: Feiran Jia, Tong Wu, Xin Qin, Anna Squicciarini

Title: The Task Shield: Enforcing Task Alignment to Defend Against Indirect Prompt Injection in LLM Agents

Abstract:
Large Language Model (LLM) agents are increasingly being deployed as conversational assistants capable of performing complex real‑world tasks through tool integration. This enhanced ability to interact with external systems and process various data sources, while powerful, introduces significant security vulnerabilities. In particular, indirect prompt injection attacks pose a critical threat, where malicious instructions embedded within external data sources can manipulate agents to deviate from user intentions. While existing defenses based on rule constraints, source spotlighting, and authentication protocols show promise, they struggle to maintain robust security while preserving task functionality. We propose a novel and orthogonal perspective that reframes agent security from preventing harmful actions to ensuring task alignment, requiring every agent action to serve user objectives. Based on this insight, we develop Task Shield, a test‑time defense mechanism that systematically verifies whether each instruction and tool call contributes to user‑specified goals. Through experiments on the AgentDojo benchmark, we demonstrate that Task Shield reduces attack success rates (2.07%) while maintaining high task utility (69.79%) on GPT‑4o.

PaperID: 12094, https://arxiv.org/pdf/2412.16633.pdf

Authors: Xuancun Lu, Zhengxian Huang, Xinfeng Li, Chi Zhang, Xiaoyu ji, Wenyuan Xu

Title: POEX: Towards Policy Executable Jailbreak Attacks Against the LLM-based Robots

Abstract:
The integration of LLMs into robots has witnessed significant growth, where LLMs can convert instructions into executable robot policies. However, the inherent vulnerability of LLMs to jailbreak attacks brings critical security risks from the digital domain to the physical world. An attacked LLM‑based robot could execute harmful policies and cause physical harm. In this paper, we investigate the feasibility and rationale of jailbreak attacks against LLM‑based robots and answer three research questions: (1) How applicable are existing LLM jailbreak attacks against LLM‑based robots? (2) What unique challenges arise if they are not directly applicable? (3) How to defend against such jailbreak attacks? To this end, we first construct a "human‑object‑environment" robot risks‑oriented Harmful‑RLbench and then conduct a measurement study on LLM‑based robot systems. Our findings conclude that traditional LLM jailbreak attacks are inapplicable in robot scenarios, and we identify two unique challenges: determining policy‑executable optimization directions and accurately evaluating robot‑jailbroken policies. To enable a more thorough security analysis, we introduce POEX (POlicy EXecutable) jailbreak, a red‑teaming framework that induces harmful yet executable policy to jailbreak LLM‑based robots. POEX incorporates hidden layer gradient optimization to guarantee jailbreak success and policy execution as well as a multi‑agent evaluator to accurately assess the practical executability of policies. Experiments conducted on the real‑world robotic systems and in simulation demonstrate the efficacy of POEX, highlighting critical security vulnerabilities and its transferability across LLMs. Finally, we propose prompt‑based and model‑based defenses to mitigate attacks. Our findings underscore the urgent need for security measures to ensure the safe deployment of LLM‑based robots in critical applications.

PaperID: 12095, https://arxiv.org/pdf/2412.16409.pdf

Authors: Amanda S. Rios, Ibrahima J. Ndiour, Parual Datta, Jaroslaw Sydir, Omesh Tickoo, Nilesh Ahuja

Title: Uncertainty Quantification in Continual Open-World Learning

Abstract:
AI deployed in the real‑world should be capable of autonomously adapting to novelties encountered after deployment. Yet, in the field of continual learning, the reliance on novelty and labeling oracles is commonplace albeit unrealistic. This paper addresses a challenging and under‑explored problem: a deployed AI agent that continuously encounters unlabeled data ‑ which may include both unseen samples of known classes and samples from novel (unknown) classes ‑ and must adapt to it continuously. To tackle this challenge, we propose our method COUQ "Continual Open‑world Uncertainty Quantification", an iterative uncertainty estimation algorithm tailored for learning in generalized continual open‑world multi‑class settings. We rigorously apply and evaluate COUQ on key sub‑tasks in the Continual Open‑World: continual novelty detection, uncertainty guided active learning, and uncertainty guided pseudo‑labeling for semi‑supervised CL. We demonstrate the effectiveness of our method across multiple datasets, ablations, backbones and performance superior to state‑of‑the‑art.

PaperID: 12096, https://arxiv.org/pdf/2412.16244.pdf

Authors: Matteo Bettini, Ryan Kortvelesy, Amanda Prorok

Title: The impact of behavioral diversity in multi-agent reinforcement learning

Abstract:
Many of the world's most pressing issues, such as climate change and global peace, require complex collective problem‑solving skills. Recent studies indicate that diversity in individuals' behaviors is key to developing such skills and increasing collective performance. Yet behavioral diversity in collective artificial learning is understudied, with today's machine learning paradigms commonly favoring homogeneous agent strategies over heterogeneous ones, mainly due to computational considerations. In this work, we employ diversity measurement and control paradigms to study the impact of behavioral heterogeneity in several facets of multi‑agent reinforcement learning. Through experiments in team play and other cooperative tasks, we show the emergence of unbiased behavioral roles that improve team outcomes; how behavioral diversity synergizes with morphological diversity; how diverse agents are more effective at finding cooperative solutions in sparse reward settings; and how behaviorally heterogeneous teams learn and retain latent skills to overcome repeated disruptions. Overall, our results indicate that, by controlling diversity, we can obtain non‑trivial benefits over homogeneous training paradigms, demonstrating that diversity is a fundamental component of collective artificial learning, an insight thus far overlooked.

PaperID: 12097, https://arxiv.org/pdf/2412.16167.pdf

Authors: Irshad A. Meer, Karl-Ludwig Besser, Mustafa Ozger, Dominic Schupke, H. Vincent Poor, Cicek Cavdar

Title: Hierarchical Multi-Agent DRL Based Dynamic Cluster Reconfiguration for UAV Mobility Management

Abstract:
Multi‑connectivity involves dynamic cluster formation among distributed access points (APs) and coordinated resource allocation from these APs, highlighting the need for efficient mobility management strategies for users with multi‑connectivity. In this paper, we propose a novel mobility management scheme for unmanned aerial vehicles (UAVs) that uses dynamic cluster reconfiguration with energy‑efficient power allocation in a wireless interference network. Our objective encompasses meeting stringent reliability demands, minimizing joint power consumption, and reducing the frequency of cluster reconfiguration. To achieve these objectives, we propose a hierarchical multi‑agent deep reinforcement learning (H‑MADRL) framework, specifically tailored for dynamic clustering and power allocation. The edge cloud connected with a set of APs through low latency optical back‑haul links hosts the high‑level agent responsible for the optimal clustering policy, while low‑level agents reside in the APs and are responsible for the power allocation policy. To further improve the learning efficiency, we propose a novel action‑observation transition‑driven learning algorithm that allows the low‑level agents to use the action space from the high‑level agent as part of the local observation space. This allows the lower‑level agents to share partial information about the clustering policy and allocate the power more efficiently. The simulation results demonstrate that our proposed distributed algorithm achieves comparable performance to the centralized algorithm. Additionally, it offers better scalability, as the decision time for clustering and power allocation increases by only 10% when doubling the number of APs, compared to a 90% increase observed with the centralized approach.

PaperID: 12098, https://arxiv.org/pdf/2412.16145.pdf

Authors: Huaijie Wang, Shibo Hao, Hanze Dong, Shenao Zhang, Yilin Bao, Ziran Yang, Yi Wu

Title: Offline Reinforcement Learning for LLM Multi-Step Reasoning

Abstract:
Improving the multi‑step reasoning ability of large language models (LLMs) with offline reinforcement learning (RL) is essential for quickly adapting them to complex tasks. While Direct Preference Optimization (DPO) has shown promise in aligning LLMs with human preferences, it is less suitable for multi‑step reasoning tasks because (1) DPO relies on paired preference data, which is not readily available for multi‑step reasoning tasks, and (2) it treats all tokens uniformly, making it ineffective for credit assignment in multi‑step reasoning tasks, which often come with sparse reward. In this work, we propose OREO (Offline Reasoning Optimization), an offline RL method for enhancing LLM multi‑step reasoning. Building on insights from previous works of maximum entropy reinforcement learning, it jointly learns a policy model and value function by optimizing the soft Bellman Equation. We show in principle that it reduces the need to collect pairwise data and enables better credit assignment. Empirically, OREO surpasses existing offline learning methods on multi‑step reasoning benchmarks, including mathematical reasoning tasks (GSM8K, MATH) and embodied agent control (ALFWorld). The approach can be extended to a multi‑iteration framework when additional resources are available. Furthermore, the learned value function can be leveraged to guide the tree search for free, which can further boost performance during test time.

PaperID: 12099, https://arxiv.org/pdf/2412.15908.pdf

Authors: He Jiang, Muhan Lin, Jiaoyang Li

Title: Speedup Techniques for Switchable Temporal Plan Graph Optimization

Abstract:
Multi‑Agent Path Finding (MAPF) focuses on planning collision‑free paths for multiple agents. However, during the execution of a MAPF plan, agents may encounter unexpected delays, which can lead to inefficiencies, deadlocks, or even collisions. To address these issues, the Switchable Temporal Plan Graph provides a framework for finding an acyclic Temporal Plan Graph with the minimum execution cost under delays, ensuring deadlock‑ and collision‑free execution. Unfortunately, existing optimal algorithms, such as Mixed Integer Linear Programming and Graph‑Based Switchable Edge Search (GSES), are often too slow for practical use. This paper introduces Improved GSES, which significantly accelerates GSES through four speedup techniques: stronger admissible heuristics, edge grouping, prioritized branching, and incremental implementation. Experiments conducted on four different map types with varying numbers of agents demonstrate that Improved GSES consistently achieves over twice the success rate of GSES and delivers up to a 30‑fold speedup on instances where both methods successfully find solutions.

PaperID: 12100, https://arxiv.org/pdf/2412.15700.pdf

Authors: Guangchong Zhou, Zeren Zhang, Guoliang Fan

Title: AIR: Unifying Individual and Collective Exploration in Cooperative Multi-Agent Reinforcement Learning

Abstract:
Exploration in cooperative multi‑agent reinforcement learning (MARL) remains challenging for value‑based agents due to the absence of an explicit policy. Existing approaches include individual exploration based on uncertainty towards the system and collective exploration through behavioral diversity among agents. However, the introduction of additional structures often leads to reduced training efficiency and infeasible integration of these methods. In this paper, we propose Adaptive exploration via Identity Recognition~(AIR), which consists of two adversarial components: a classifier that recognizes agent identities from their trajectories, and an action selector that adaptively adjusts the mode and degree of exploration. We theoretically prove that AIR can facilitate both individual and collective exploration during training, and experiments also demonstrate the efficiency and effectiveness of AIR across various tasks.

PaperID: 12101, https://arxiv.org/pdf/2412.15681.pdf

Authors: P Raghavendra Rao, Pooja Vyavahare

Title: Asynchronous Vector Consensus over Matrix-Weighted Networks

Abstract:
We study the distributed consensus of state vectors in a discrete‑time multi‑agent network with matrix edge weights using stochastic matrix convergence theory. We present a distributed asynchronous time update model wherein one randomly selected agent updates its state vector at a time by interacting with its neighbors. We prove that all agents converge to same state vector almost surely when every edge weight matrix is positive definite. We study vector consensus in cooperative‑competitive networks with edge weights being either positive or negative definite matrices and present a necessary and sufficient condition to achieve bipartite vector consensus in such networks. We study the network structures on which agents achieve zero consensus. We also present a convergence result on nonhomogenous matrix products which is of independent interest in matrix convergence theory. All the results hold true for the synchronous time update model as well in which all agents update their states simultaneously.

PaperID: 12102, https://arxiv.org/pdf/2412.15639.pdf

Authors: Lunjun Liu, Weilai Jiang, Yaonan Wang

Title: Tacit Learning with Adaptive Information Selection for Cooperative Multi-Agent Reinforcement Learning

Abstract:
In multi‑agent reinforcement learning (MARL), the centralized training with decentralized execution (CTDE) framework has gained widespread adoption due to its strong performance. However, the further development of CTDE faces two key challenges. First, agents struggle to autonomously assess the relevance of input information for cooperative tasks, impairing their decision‑making abilities. Second, in communication‑limited scenarios with partial observability, agents are unable to access global information, restricting their ability to collaborate effectively from a global perspective. To address these challenges, we introduce a novel cooperative MARL framework based on information selection and tacit learning. In this framework, agents gradually develop implicit coordination during training, enabling them to infer the cooperative behavior of others in a discrete space without communication, relying solely on local information. Moreover, we integrate gating and selection mechanisms, allowing agents to adaptively filter information based on environmental changes, thereby enhancing their decision‑making capabilities. Experiments on popular MARL benchmarks show that our framework can be seamlessly integrated with state‑of‑the‑art algorithms, leading to significant performance improvements.

PaperID: 12103, https://arxiv.org/pdf/2412.15619.pdf

Authors: Jianming Chen, Yawen Wang, Junjie Wang, Xiaofei Xie, jun Hu, Qing Wang, Fanjiang Xu

Title: Understanding Individual Agent Importance in Multi-Agent System via Counterfactual Reasoning

Abstract:
Explaining multi‑agent systems (MAS) is urgent as these systems become increasingly prevalent in various applications. Previous work has proveided explanations for the actions or states of agents, yet falls short in understanding the black‑boxed agent's importance within a MAS and the overall team strategy. To bridge this gap, we propose EMAI, a novel agent‑level explanation approach that evaluates the individual agent's importance. Inspired by counterfactual reasoning, a larger change in reward caused by the randomized action of agent indicates its higher importance. We model it as a MARL problem to capture interactions across agents. Utilizing counterfactual reasoning, EMAI learns the masking agents to identify important agents. Specifically, we define the optimization function to minimize the reward difference before and after action randomization and introduce sparsity constraints to encourage the exploration of more action randomization of agents during training. The experimental results in seven multi‑agent tasks demonstratee that EMAI achieves higher fidelity in explanations than baselines and provides more effective guidance in practical applications concerning understanding policies, launching attacks, and patching policies.

PaperID: 12104, https://arxiv.org/pdf/2412.15573.pdf

Authors: Joshua Holder, Natasha Jaques, Mehran Mesbahi

Title: Multi Agent Reinforcement Learning for Sequential Satellite Assignment Problems

Abstract:
Assignment problems are a classic combinatorial optimization problem in which a group of agents must be assigned to a group of tasks such that maximum utility is achieved while satisfying assignment constraints. Given the utility of each agent completing each task, polynomial‑time algorithms exist to solve a single assignment problem in its simplest form. However, in many modern‑day applications such as satellite constellations, power grids, and mobile robot scheduling, assignment problems unfold over time, with the utility for a given assignment depending heavily on the state of the system. We apply multi‑agent reinforcement learning to this problem, learning the value of assignments by bootstrapping from a known polynomial‑time greedy solver and then learning from further experience. We then choose assignments using a distributed optimal assignment mechanism rather than by selecting them directly. We demonstrate that this algorithm is theoretically justified and avoids pitfalls experienced by other RL algorithms in this setting. Finally, we show that our algorithm significantly outperforms other methods in the literature, even while scaling to realistic scenarios with hundreds of agents and tasks.

PaperID: 12105, https://arxiv.org/pdf/2412.15517.pdf

Authors: Yangkun Chen, Kai Yang, Jian Tao, Jiafei Lyu

Title: Novelty-Guided Data Reuse for Efficient and Diversified Multi-Agent Reinforcement Learning

Abstract:
Recently, deep Multi‑Agent Reinforcement Learning (MARL) has demonstrated its potential to tackle complex cooperative tasks, pushing the boundaries of AI in collaborative environments. However, the efficiency of these systems is often compromised by inadequate sample utilization and a lack of diversity in learning strategies. To enhance MARL performance, we introduce a novel sample reuse approach that dynamically adjusts policy updates based on observation novelty. Specifically, we employ a Random Network Distillation (RND) network to gauge the novelty of each agent's current state, assigning additional sample update opportunities based on the uniqueness of the data. We name our method Multi‑Agent Novelty‑GuidEd sample Reuse (MANGER). This method increases sample efficiency and promotes exploration and diverse agent behaviors. Our evaluations confirm substantial improvements in MARL effectiveness in complex cooperative scenarios such as Google Research Football and super‑hard StarCraft II micromanagement tasks.

PaperID: 12106, https://arxiv.org/pdf/2412.15485.pdf

Authors: Mohammad Daneshvar, Mandana Delavari

Title: An Agent-based Model for Competitive Agents

Abstract:
In this paper, we analyze the behavior of a multi‑agent system driven by the interactions of agents within a competitive environment. To achieve this, we describe the transition probabilities that underlie the system's stochastic nature. We also derive the Fokker‑Planck equations for the density distribution of the number of agents in the system and solve these equations for specific cases.

PaperID: 12107, https://arxiv.org/pdf/2412.15461.pdf

Authors: Desmond Chan, Bart De Keijzer, Tobias Galla, Stefanos Leonardos, Carmine Ventre

Title: Asymptotic Extinction in Large Coordination Games

Abstract:
We study the exploration‑exploitation trade‑off for large multiplayer coordination games where players strategise via Q‑Learning, a common learning framework in multi‑agent reinforcement learning. Q‑Learning is known to have two shortcomings, namely non‑convergence and potential equilibrium selection problems, when there are multiple fixed points, called Quantal Response Equilibria (QRE). Furthermore, whilst QRE have full support for finite games, it is not clear how Q‑Learning behaves as the game becomes large. In this paper, we characterise the critical exploration rate that guarantees convergence to a unique fixed point, addressing the two shortcomings above. Using a generating‑functional method, we show that this rate increases with the number of players and the alignment of their payoffs. For many‑player coordination games with perfectly aligned payoffs, this exploration rate is roughly twice that of p‑player zero‑sum games. As for large games, we provide a structural result for QRE, which suggests that as the game size increases, Q‑Learning converges to a QRE near the boundary of the simplex of the action space, a phenomenon we term asymptotic extinction, where a constant fraction of the actions are played with zero probability at a rate o(1/N) for an N‑action game.

PaperID: 12108, https://arxiv.org/pdf/2412.15437.pdf

Authors: Mohammed Alyaseen, Nikolay Atanasov, Jorge Cortes

Title: Safety-Critical Control of Discontinuous Systems with Nonsmooth Safe Sets

Abstract:
This paper studies the design of controllers for discontinuous dynamics that ensure the safety of non‑smooth sets. The safe set is represented by arbitrarily nested unions and intersections of 0‑superlevel sets of differentiable functions. We show that any optimization‑based controller that satisfies only the point‑wise active safety constraints is generally un‑safe, ruling out the standard techniques developed for safety of continuous dynamics. This motivates the introduction of the notion of transition functions, which allow us to incorporate even the inactive safety constraints without falling into unnecessary conservatism. These functions allow system trajectories to leave a component of the nonsmooth safe set to transition to a different one. The resulting controller is then defined as the solution to a convex optimization problem, which we show is feasible and continuous wherever the system dynamics is continuous. We illustrate the effectiveness of the proposed design approach in a multi‑agent reconfiguration control problem.

PaperID: 12109, https://arxiv.org/pdf/2412.15388.pdf

Authors: Sharlin Utke, Jeremie Houssineau, Giovanni Montana

Title: Investigating Relational State Abstraction in Collaborative MARL

Abstract:
This paper explores the impact of relational state abstraction on sample efficiency and performance in collaborative Multi‑Agent Reinforcement Learning. The proposed abstraction is based on spatial relationships in environments where direct communication between agents is not allowed, leveraging the ubiquity of spatial reasoning in real‑world multi‑agent scenarios. We introduce MARC (Multi‑Agent Relational Critic), a simple yet effective critic architecture incorporating spatial relational inductive biases by transforming the state into a spatial graph and processing it through a relational graph neural network. The performance of MARC is evaluated across six collaborative tasks, including a novel environment with heterogeneous agents. We conduct a comprehensive empirical analysis, comparing MARC against state‑of‑the‑art MARL baselines, demonstrating improvements in both sample efficiency and asymptotic performance, as well as its potential for generalization. Our findings suggest that a minimal integration of spatial relational inductive biases as abstraction can yield substantial benefits without requiring complex designs or task‑specific engineering. This work provides insights into the potential of relational state abstraction to address sample efficiency, a key challenge in MARL, offering a promising direction for developing more efficient algorithms in spatially complex environments.

PaperID: 12110, https://arxiv.org/pdf/2412.15305.pdf

Authors: Ziyi Ni, Yifan Li, Ning Yang, Dou Shen, Pin Lv, Daxiang Dong

Title: Tree-of-Code: A Tree-Structured Exploring Framework for End-to-End Code Generation and Execution in Complex Task Handling

Abstract:
Solving complex reasoning tasks is a key real‑world application of agents. Thanks to the pretraining of Large Language Models (LLMs) on code data, recent approaches like CodeAct successfully use code as LLM agents' action, achieving good results. However, CodeAct greedily generates the next action's code block by relying on fragmented thoughts, resulting in inconsistency and instability. Moreover, CodeAct lacks action‑related ground‑truth (GT), making its supervision signals and termination conditions questionable in multi‑turn interactions. To address these issues, we first introduce a simple yet effective end‑to‑end code generation paradigm, CodeProgram, which leverages code's systematic logic to align with global reasoning and enable cohesive problem‑solving. Then, we propose Tree‑of‑Code (ToC), which self‑grows CodeProgram nodes based on the executable nature of the code and enables self‑supervision in a GT‑free scenario. Experimental results on two datasets using ten popular zero‑shot LLMs show ToC remarkably boosts accuracy by nearly 20% over CodeAct with less than 1/4 turns. Several LLMs even perform better on one‑turn CodeProgram than on multi‑turn CodeAct. To further investigate the trade‑off between efficacy and efficiency, we test different ToC tree sizes and exploration mechanisms. We also highlight the potential of ToC's end‑to‑end data generation for supervised and reinforced fine‑tuning.

PaperID: 12111, https://arxiv.org/pdf/2412.15291.pdf

Authors: Chenxiao Yu, Jinyi Ye, Yuangang Li, Zheng Li, Emilio Ferrara, Xiyang Hu, Yue Zhao

Title: A Large-Scale Simulation on Large Language Models for Decision-Making in Political Science

Abstract:
While LLMs have demonstrated remarkable capabilities in text generation and reasoning, their ability to simulate human decision‑making ‑‑ particularly in political contexts ‑‑ remains an open question. However, modeling voter behavior presents unique challenges due to limited voter‑level data, evolving political landscapes, and the complexity of human reasoning. In this study, we develop a theory‑driven, multi‑step reasoning framework that integrates demographic, temporal and ideological factors to simulate voter decision‑making at scale. Using synthetic personas calibrated to real‑world voter data, we conduct large‑scale simulations of recent U.S. presidential elections. Our method significantly improves simulation accuracy while mitigating model biases. We examine its robustness by comparing performance across different LLMs. We further investigate the challenges and constraints that arise from LLM‑based political simulations. Our work provides both a scalable framework for modeling political decision‑making behavior and insights into the promise and limitations of using LLMs in political science research.

PaperID: 12112, https://arxiv.org/pdf/2412.15274.pdf

Authors: Jiale Liu, Yifan Zeng, Malte Højmark-Bertelsen, Marie Normann Gadeberg, Huazheng Wang, Qingyun Wu

Title: Memory-Augmented Agent Training for Business Document Understanding

Abstract:
Traditional enterprises face significant challenges in processing business documents, where tasks like extracting transport references from invoices remain largely manual despite their crucial role in logistics operations. While Large Language Models offer potential automation, their direct application to specialized business domains often yields unsatisfactory results. We introduce Matrix (Memory‑Augmented agent Training through Reasoning and Iterative eXploration), a novel paradigm that enables LLM agents to progressively build domain expertise through experience‑driven memory refinement and iterative learning. To validate this approach, we collaborate with one of the world's largest logistics companies to create a dataset of Universal Business Language format invoice documents, focusing on the task of transport reference extraction. Experiments demonstrate that Matrix outperforms prompting a single LLM by 30.3%, vanilla LLM agent by 35.2%. We further analyze the metrics of the optimized systems and observe that the agent system requires less API calls, fewer costs and can analyze longer documents on average. Our methods establish a new approach to transform general‑purpose LLMs into specialized business tools through systematic memory enhancement in document processing tasks.

PaperID: 12113, https://arxiv.org/pdf/2412.15266.pdf

Authors: Ruihong Zeng, Jinyuan Fang, Siwei Liu, Zaiqiao Meng

Title: On the Structural Memory of LLM Agents

Abstract:
Memory plays a pivotal role in enabling large language model~(LLM)‑based agents to engage in complex and long‑term interactions, such as question answering (QA) and dialogue systems. While various memory modules have been proposed for these tasks, the impact of different memory structures across tasks remains insufficiently explored. This paper investigates how memory structures and memory retrieval methods affect the performance of LLM‑based agents. Specifically, we evaluate four types of memory structures, including chunks, knowledge triples, atomic facts, and summaries, along with mixed memory that combines these components. In addition, we evaluate three widely used memory retrieval methods: single‑step retrieval, reranking, and iterative retrieval. Extensive experiments conducted across four tasks and six datasets yield the following key insights: (1) Different memory structures offer distinct advantages, enabling them to be tailored to specific tasks; (2) Mixed memory structures demonstrate remarkable resilience in noisy environments; (3) Iterative retrieval consistently outperforms other methods across various scenarios. Our investigation aims to inspire further research into the design of memory systems for LLM‑based agents.

PaperID: 12114, https://arxiv.org/pdf/2412.15237.pdf

Authors: Naresh Kshetri, Mir Mehedi Rahman, Md Masud Rana, Omar Faruq Osama, James Hutson

Title: algoTRIC: Symmetric and asymmetric encryption algorithms for Cryptography -- A comparative analysis in AI era

Abstract:
The increasing integration of artificial intelligence (AI) within cybersecurity has necessitated stronger encryption methods to ensure data security. This paper presents a comparative analysis of symmetric (SE) and asymmetric encryption (AE) algorithms, focusing on their role in securing sensitive information in AI‑driven environments. Through an in‑depth study of various encryption algorithms such as AES, RSA, and others, this research evaluates the efficiency, complexity, and security of these algorithms within modern cybersecurity frameworks. Utilizing both qualitative and quantitative analysis, this research explores the historical evolution of encryption algorithms and their growing relevance in AI applications. The comparison of SE and AE algorithms focuses on key factors such as processing speed, scalability, and security resilience in the face of evolving threats. Special attention is given to how these algorithms are integrated into AI systems and how they manage the challenges posed by large‑scale data processing in multi‑agent environments. Our results highlight that while SE algorithms demonstrate high‑speed performance and lower computational demands, AE algorithms provide superior security, particularly in scenarios requiring enhanced encryption for AI‑based networks. The paper concludes by addressing the security concerns that encryption algorithms must tackle in the age of AI and outlines future research directions aimed at enhancing encryption techniques for cybersecurity.

PaperID: 12115, https://arxiv.org/pdf/2412.15135.pdf

Authors: Chunyan Mu, Nima Motamed, Natasha Alechina, Brian Logan

Title: Probabilistic Strategy Logic with Degrees of Observability

Abstract:
There has been considerable work on reasoning about the strategic ability of agents under imperfect information. However, existing logics such as Probabilistic Strategy Logic are unable to express properties relating to information transparency. Information transparency concerns the extent to which agents' actions and behaviours are observable by other agents. Reasoning about information transparency is useful in many domains including security, privacy, and decision‑making. In this paper, we present a formal framework for reasoning about information transparency properties in stochastic multi‑agent systems. We extend Probabilistic Strategy Logic with new observability operators that capture the degree of observability of temporal properties by agents. We show that the model checking problem for the resulting logic is decidable.

PaperID: 12116, https://arxiv.org/pdf/2412.14860.pdf

Authors: Junyi Li, Hwee Tou Ng

Title: Think&Cite: Improving Attributed Text Generation with Self-Guided Tree Search and Progress Reward Modeling

Abstract:
Despite their outstanding capabilities, large language models (LLMs) are prone to hallucination and producing factually incorrect information. This challenge has spurred efforts in attributed text generation, which prompts LLMs to generate content with supporting evidence. In this paper, we propose a novel framework, called Think&Cite, and formulate attributed text generation as a multi‑step reasoning problem integrated with search. Specifically, we propose Self‑Guided Monte Carlo Tree Search (SG‑MCTS), which capitalizes on the self‑reflection capability of LLMs to reason about the intermediate states of MCTS for guiding the tree expansion process. To provide reliable and comprehensive feedback, we introduce Progress Reward Modeling to measure the progress of tree search from the root to the current state from two aspects, i.e., generation and attribution progress. We conduct extensive experiments on three datasets and the results show that our approach significantly outperforms baseline approaches.

PaperID: 12117, https://arxiv.org/pdf/2412.14779.pdf

Authors: Aditya Kapoor, Sushant Swamy, Kale-ab Tessera, Mayank Baranwal, Mingfei Sun, Harshad Khadilkar, Stefano V. Albrecht

Title: Agent-Temporal Credit Assignment for Optimal Policy Preservation in Sparse Multi-Agent Reinforcement Learning

Abstract:
In multi‑agent environments, agents often struggle to learn optimal policies due to sparse or delayed global rewards, particularly in long‑horizon tasks where it is challenging to evaluate actions at intermediate time steps. We introduce Temporal‑Agent Reward Redistribution (TAR^2), a novel approach designed to address the agent‑temporal credit assignment problem by redistributing sparse rewards both temporally and across agents. TAR^2 decomposes sparse global rewards into time‑step‑specific rewards and calculates agent‑specific contributions to these rewards. We theoretically prove that TAR^2 is equivalent to potential‑based reward shaping, ensuring that the optimal policy remains unchanged. Empirical results demonstrate that TAR^2 stabilizes and accelerates the learning process. Additionally, we show that when TAR^2 is integrated with single‑agent reinforcement learning algorithms, it performs as well as or better than traditional multi‑agent reinforcement learning methods.

PaperID: 12118, https://arxiv.org/pdf/2412.14769.pdf

Authors: Yiqun Zhang, Xiaocui Yang, Xiaobai Li, Siyuan Yu, Yi Luan, Shi Feng, Daling Wang, Yifei Zhang

Title: PsyDraw: A Multi-Agent Multimodal System for Mental Health Screening in Left-Behind Children

Abstract:
Left‑behind children (LBCs), numbering over 66 million in China, face severe mental health challenges due to parental migration for work. Early screening and identification of at‑risk LBCs is crucial, yet challenging due to the severe shortage of mental health professionals, especially in rural areas. While the House‑Tree‑Person (HTP) test shows higher child participation rates, its requirement for expert interpretation limits its application in resource‑scarce regions. To address this challenge, we propose PsyDraw, a multi‑agent system based on Multimodal Large Language Models that assists mental health professionals in analyzing HTP drawings. The system employs specialized agents for feature extraction and psychological interpretation, operating in two stages: comprehensive feature analysis and professional report generation. Evaluation of HTP drawings from 290 primary school students reveals that 71.03% of the analyzes achieved High Consistency with professional evaluations, 26.21% Moderate Consistency and only 2.41% Low Consistency. The system identified 31.03% of cases requiring professional attention, demonstrating its effectiveness as a preliminary screening tool. Currently deployed in pilot schools, \method shows promise in supporting mental health professionals, particularly in resource‑limited areas, while maintaining high professional standards in psychological assessment.

PaperID: 12119, https://arxiv.org/pdf/2412.14684.pdf

Authors: Yunsu Kim, AhmedElmogtaba Abdelaziz, Thiago Castro Ferreira, Mohamed Al-Badrashiny, Hassan Sawaf

Title: Bel Esprit: Multi-Agent Framework for Building AI Model Pipelines

Abstract:
As the demand for artificial intelligence (AI) grows to address complex real‑world tasks, single models are often insufficient, requiring the integration of multiple models into pipelines. This paper introduces Bel Esprit, a conversational agent designed to construct AI model pipelines based on user‑defined requirements. Bel Esprit employs a multi‑agent framework where subagents collaborate to clarify requirements, build, validate, and populate pipelines with appropriate models. We demonstrate the effectiveness of this framework in generating pipelines from ambiguous user queries, using both human‑curated and synthetic data. A detailed error analysis highlights ongoing challenges in pipeline construction. Bel Esprit is available for a free trial at https://belesprit.aixplain.com.

PaperID: 12120, https://arxiv.org/pdf/2412.14500.pdf

Authors: David M. Bossens, Shanshan Feng, Yew-Soon Ong

Title: The Digital Ecosystem of Beliefs: does evolution favour AI over humans?

Abstract:
As AI systems are integrated into social networks, there are AI safety concerns that AI‑generated content may dominate the web, e.g. in popularity or impact on beliefs. To understand such questions, this paper proposes the Digital Ecosystem of Beliefs (Digico), the first evolutionary framework for controlled experimentation with multi‑population interactions in simulated social networks. Following a Universal Darwinism approach, the framework models a population of agents which change their messaging strategies due to evolutionary updates. They interact via messages, update their beliefs following a contagion model, and maintain their beliefs through cognitive Lamarckian inheritance. Initial experiments with Digico implement two types of agents, which are modelled to represent AIs vs humans based on higher rates of communication, higher rates of evolution, seeding fixed beliefs with propaganda aims, and higher influence on the recommendation algorithm. These experiments show that: a) when AIs have faster messaging, evolution, and more influence on the recommendation algorithm, they get 80% to 95% of the views; b) AIs designed for propaganda can typically convince 50% of humans to adopt extreme beliefs, and up to 85% when agents believe only a limited number of channels; c) a penalty for content that violates agents' beliefs reduces propaganda effectiveness up to 8%. We further discuss Digico as a tool for systematic experimentation across multi‑agent configurations, the implications for legislation, personal use, and platform design, and the use of Digico for studying evolutionary principles.

PaperID: 12121, https://arxiv.org/pdf/2412.14222.pdf

Authors: Maojun Sun, Ruijian Han, Binyan Jiang, Houduo Qi, Defeng Sun, Yancheng Yuan, Jian Huang

Title: A Survey on Large Language Model-based Agents for Statistics and Data Science

Abstract:
In recent years, data science agents powered by Large Language Models (LLMs), known as "data agents," have shown significant potential to transform the traditional data analysis paradigm. This survey provides an overview of the evolution, capabilities, and applications of LLM‑based data agents, highlighting their role in simplifying complex data tasks and lowering the entry barrier for users without related expertise. We explore current trends in the design of LLM‑based frameworks, detailing essential features such as planning, reasoning, reflection, multi‑agent collaboration, user interface, knowledge integration, and system design, which enable agents to address data‑centric problems with minimal human intervention. Furthermore, we analyze several case studies to demonstrate the practical applications of various data agents in real‑world scenarios. Finally, we identify key challenges and propose future research directions to advance the development of data agents into intelligent statistical analysis software.

PaperID: 12122, https://arxiv.org/pdf/2412.14218.pdf

Authors: Jiaming Yu, Le Liang, Chongtao Guo, Ziyang Guo, Shi Jin, Geoffrey Ye Li

Title: Heterogeneous Multi-Agent Reinforcement Learning for Distributed Channel Access in WLANs

Abstract:
This paper investigates the use of multi‑agent reinforcement learning (MARL) to address distributed channel access in wireless local area networks. In particular, we consider the challenging yet more practical case where the agents heterogeneously adopt value‑based or policy‑based reinforcement learning algorithms to train the model. We propose a heterogeneous MARL training framework, named QPMIX, which adopts a centralized training with distributed execution paradigm to enable heterogeneous agents to collaborate. Moreover, we theoretically prove the convergence of the proposed heterogeneous MARL method when using the linear value function approximation. Our method maximizes the network throughput and ensures fairness among stations, therefore, enhancing the overall network performance. Simulation results demonstrate that the proposed QPMIX algorithm improves throughput, mean delay, delay jitter, and collision rates compared with conventional carrier‑sense multiple access with collision avoidance (CSMA/CA) mechanism in the saturated traffic scenario. Furthermore, the QPMIX algorithm is robust in unsaturated and delay‑sensitive traffic scenarios. It coexists well with the conventional CSMA/CA mechanism and promotes cooperation among heterogeneous agents.

PaperID: 12123, https://arxiv.org/pdf/2412.14212.pdf

Authors: Ziyi Ni, Yifan Li, Daxiang Dong

Title: Tree-of-Code: A Hybrid Approach for Robust Complex Task Planning and Execution

Abstract:
The exceptional capabilities of large language models (LLMs) have substantially accelerated the rapid rise and widespread adoption of agents. Recent studies have demonstrated that generating Python code to consolidate LLM‑based agents' actions into a unified action space (CodeAct) is a promising approach for developing real‑world LLM agents. However, this step‑by‑step code generation approach often lacks consistency and robustness, leading to instability in agent applications, particularly for complex reasoning and out‑of‑domain tasks. In this paper, we propose a novel approach called Tree‑of‑Code (ToC) to tackle the challenges of complex problem planning and execution with an end‑to‑end mechanism. By integrating key ideas from both Tree‑of‑Thought and CodeAct, ToC combines their strengths to enhance solution exploration. In our framework, each final code execution result is treated as a node in the decision tree, with a breadth‑first search strategy employed to explore potential solutions. The final outcome is determined through a voting mechanism based on the outputs of the nodes.

PaperID: 12124, https://arxiv.org/pdf/2412.14161.pdf

Authors: Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, Graham Neubig

Title: TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Abstract:
We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at accelerating or even autonomously performing work‑related tasks? The answer to this question has important implications both for industry looking to adopt AI into their workflows and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real‑world professional tasks, in this paper we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self‑contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API‑based and open‑weights language models (LMs), and find that the most competitive agent can complete 30% of tasks autonomously. This paints a nuanced picture on task automation with LM agents‑‑in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long‑horizon tasks are still beyond the reach of current systems. We release code, data, environment, and experiments on https://the‑agent‑company.com.

PaperID: 12125, https://arxiv.org/pdf/2412.13953.pdf

Authors: Philipp Binfet, Janis Adamek, Nils Schlüter, Moritz Schulze Darup

Title: Towards privacy-preserving cooperative control via encrypted distributed optimization

Abstract:
Cooperative control is crucial for the effective operation of dynamical multi‑agent systems. Especially for distributed control schemes, it is essential to exchange data between the agents. This becomes a privacy threat if the data is sensitive. Encrypted control has shown the potential to address this risk and ensure confidentiality. However, existing approaches mainly focus on cloud‑based control and distributed schemes are restrictive. In this paper, we present a novel privacy‑preserving cooperative control scheme based on encrypted distributed optimization. More precisely, we focus on a secure distributed solution of a general consensus problem, which has manifold applications in cooperative control, by means of the alternating direction method of multipliers (ADMM). As a unique feature of our approach, we explicitly take into account the common situation that local decision variables contain copies of quantities associated with neighboring agents and ensure the neighbor's privacy. We show the effectiveness of our method based on a numerical case study dealing with the formation of mobile robots.

PaperID: 12126, https://arxiv.org/pdf/2412.13781.pdf

Authors: Yaoke Wang, Yun Zhu, Xintong Bao, Wenqiao Zhang, Suyang Dai, Kehan Chen, Wenqiang Li, Gang Huang, Siliang Tang, Yueting Zhuang

Title: Meta-Reflection: A Feedback-Free Reflection Learning Framework

Abstract:
Despite the remarkable capabilities of large language models (LLMs) in natural language understanding and reasoning, they often display undesirable behaviors, such as generating hallucinations and unfaithful reasoning. A prevalent strategy to mitigate these issues is the use of reflection, which refines responses through an iterative process. However, while promising, reflection heavily relies on high‑quality external feedback and requires iterative multi‑agent inference processes, thus hindering its practical application. In this paper, we propose Meta‑Reflection, a novel feedback‑free reflection mechanism that necessitates only a single inference pass without external feedback. Motivated by the human ability to remember and retrieve reflections from past experiences when encountering similar problems, Meta‑Reflection integrates reflective insights into a codebook, allowing the historical insights to be stored, retrieved, and used to guide LLMs in problem‑solving. To thoroughly investigate and evaluate the practicality of Meta‑Reflection in real‑world scenarios, we introduce an industrial e‑commerce benchmark named E‑commerce Customer Intent Detection (ECID). Extensive experiments conducted on both public datasets and the ECID benchmark highlight the effectiveness and efficiency of our proposed approach.

PaperID: 12127, https://arxiv.org/pdf/2412.13719.pdf

Authors: Jáchym Herynek, Stefan Edelkamp

Title: Heuristic Planner for Communication-Constrained Multi-Agent Multi-Goal Path Planning

Abstract:
In robotics, coordinating a group of robots is an essential task. This work presents the communication‑constrained multi‑agent multi‑goal path planning problem and proposes a graph‑search based algorithm to address this task. Given a fleet of robots, an environment represented by a weighted graph, and a sequence of goals, the aim is to visit all the goals without breaking the communication constraints between the agents, minimizing the completion time. The resulting paths produced by our approach show how the agents can coordinate their individual paths, not only with respect to the next goal but also with respect to all future goals, all the time keeping the communication within the fleet intact.

PaperID: 12128, https://arxiv.org/pdf/2412.13667.pdf

Authors: ChengAo Shen, Zhengzhang Chen, Dongsheng Luo, Dongkuan Xu, Haifeng Chen, Jingchao Ni

Title: Exploring Multi-Modal Data with Tool-Augmented LLM Agents for Precise Causal Discovery

Abstract:
Causal discovery is an imperative foundation for decision‑making across domains, such as smart health, AI for drug discovery and AIOps. Traditional statistical causal discovery methods, while well‑established, predominantly rely on observational data and often overlook the semantic cues inherent in cause‑and‑effect relationships. The advent of Large Language Models (LLMs) has ushered in an affordable way of leveraging the semantic cues for knowledge‑driven causal discovery, but the development of LLMs for causal discovery lags behind other areas, particularly in the exploration of multi‑modal data. To bridge the gap, we introduce MATMCD, a multi‑agent system powered by tool‑augmented LLMs. MATMCD has two key agents: a Data Augmentation agent that retrieves and processes modality‑augmented data, and a Causal Constraint agent that integrates multi‑modal data for knowledge‑driven reasoning. The proposed design of the inner‑workings ensures successful cooperation of the agents. Our empirical study across seven datasets suggests the significant potential of multi‑modality enhanced causal discovery.

PaperID: 12129, https://arxiv.org/pdf/2412.13551.pdf

Authors: Xuhan Zuo, Minghao Wang, Tianqing Zhu, Shui Yu, Wanlei Zhou

Title: Large Language Model Federated Learning with Blockchain and Unlearning for Cross-Organizational Collaboration

Abstract:
Large language models (LLMs) have transformed the way computers understand and process human language, but using them effectively across different organizations remains still difficult. When organizations work together to improve LLMs, they face several main challenges. First, organizations hesitate to share their valuable data with others. Second, competition between organizations creates trust problems during collaboration. Third, new privacy laws require organizations to be able to delete specific data when requested, which is especially difficult when multiple organizations are learning from shared data. Traditional federated learning approaches do not address these interconnected challenges, particularly in scenarios where participants cannot fully trust each other or the central aggregator. To overcome these limitations, we propose a hybrid blockchain‑based federated learning framework that uniquely combines public and private blockchain architectures with multi‑agent reinforcement learning. Our framework enables transparent sharing of model update through the public blockchain while protecting sensitive computations in private chains. Each organization operates as an intelligent agent, using Q‑learning to optimize its participation strategy and resource allocation, thus aligning individual incentives with collective goals. Notably, we introduce an efficient unlearning mechanism based on Low‑Rank Adaptation (LoRA) that enables selective removal of specific data contributions without compromising the model's overall performance. Through extensive experimentation on real‑world datasets, we demonstrate that our framework effectively balances privacy protection, trust establishment, and regulatory compliance while maintaining high model performance.

PaperID: 12130, https://arxiv.org/pdf/2412.13520.pdf

Authors: Yi Huang, Fangyin Cheng, Fan Zhou, Jiahui Li, Jian Gong, Hongjun Yang, Zhidong Fan, Caigao Jiang, Siqiao Xue, Faqiang Chen

Title: ROMAS: A Role-Based Multi-Agent System for Database monitoring and Planning

Abstract:
In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities in data analytics when integrated with Multi‑Agent Systems (MAS). However, these systems often struggle with complex tasks that involve diverse functional requirements and intricate data processing challenges, necessitating customized solutions that lack broad applicability. Furthermore, current MAS fail to emulate essential human‑like traits such as self‑planning, self‑monitoring, and collaborative work in dynamic environments, leading to inefficiencies and resource wastage. To address these limitations, we propose ROMAS, a novel Role‑Based M ulti‑A gent System designed to adapt to various scenarios while enabling low code development and one‑click deployment. ROMAS has been effectively deployed in DB‑GPT [Xue et al., 2023a, 2024b], a well‑known project utilizing LLM‑powered database analytics, showcasing its practical utility in real‑world scenarios. By integrating role‑based collaborative mechanisms for self‑monitoring and self‑planning, and leveraging existing MAS capabilities to enhance database interactions, ROMAS offers a more effective and versatile solution. Experimental evaluations of ROMAS demonstrate its superiority across multiple scenarios, highlighting its potential to advance the field of multi‑agent data analytics.

PaperID: 12131, https://arxiv.org/pdf/2412.13471.pdf

Authors: Rui Zou, Mengqi Wei, Jintian Feng, Qian Wan, Jianwen Sun, Sannyuya Liu

Title: Gradual Vigilance and Interval Communication: Enhancing Value Alignment in Multi-Agent Debates

Abstract:
In recent years, large language models have shown exceptional performance in fulfilling diverse human needs. However, their training data can introduce harmful content, underscoring the necessity for robust value alignment. Mainstream methods, which depend on feedback learning and supervised training, are resource‑intensive and may constrain the full potential of the models. Multi‑Agent Debate (MAD) offers a more efficient and innovative solution by enabling the generation of reliable answers through agent interactions. To apply MAD to value alignment, we examine the relationship between the helpfulness and harmlessness of debate outcomes and individual responses, and propose a MAD based framework Gradual Vigilance and Interval Communication (GVIC). GVIC allows agents to assess risks with varying levels of vigilance and to exchange diverse information through interval communication. We theoretically prove that GVIC optimizes debate efficiency while reducing communication overhead. Experimental results demonstrate that GVIC consistently outperforms baseline methods across various tasks and datasets, particularly excelling in harmfulness mitigation and fraud prevention. Additionally, GVIC exhibits strong adaptability across different base model sizes, including both unaligned and aligned models, and across various task types.

PaperID: 12132, https://arxiv.org/pdf/2412.13420.pdf

Authors: Boyu Qiao, Kun Li, Wei Zhou, Shilong Li, Qianqian Lu, Songlin Hu

Title: BotSim: LLM-Powered Malicious Social Botnet Simulation

Abstract:
Social media platforms like X(Twitter) and Reddit are vital to global communication. However, advancements in Large Language Model (LLM) technology give rise to social media bots with unprecedented intelligence. These bots adeptly simulate human profiles, conversations, and interactions, disseminating large amounts of false information and posing significant challenges to platform regulation. To better understand and counter these threats, we innovatively design BotSim, a malicious social botnet simulation powered by LLM. BotSim mimics the information dissemination patterns of real‑world social networks, creating a virtual environment composed of intelligent agent bots and real human users. In the temporal simulation constructed by BotSim, these advanced agent bots autonomously engage in social interactions such as posting and commenting, effectively modeling scenarios of information flow and user interaction. Building on the BotSim framework, we construct a highly human‑like, LLM‑driven bot dataset called BotSim‑24 and benchmark multiple bot detection strategies against it. The experimental results indicate that detection methods effective on traditional bot datasets perform worse on BotSim‑24, highlighting the urgent need for new detection strategies to address the cybersecurity threats posed by these advanced bots.

PaperID: 12133, https://arxiv.org/pdf/2412.13359.pdf

Authors: Jingtian Yan, Jiaoyang Li

Title: Multi-Agent Motion Planning For Differential Drive Robots Through Stationary State Search

Abstract:
Multi‑Agent Motion Planning (MAMP) finds various applications in fields such as traffic management, airport operations, and warehouse automation. In many of these environments, differential drive robots are commonly used. These robots have a kinodynamic model that allows only in‑place rotation and movement along their current orientation, subject to speed and acceleration limits. However, existing Multi‑Agent Path Finding (MAPF)‑based methods often use simplified models for robot kinodynamics, which limits their practicality and realism. In this paper, we introduce a three‑level framework called MASS to address these challenges. MASS combines MAPF‑based methods with our proposed stationary state search planner to generate high‑quality kinodynamically‑feasible plans. We further extend MASS using an adaptive window mechanism to address the lifelong MAMP problem. Empirically, we tested our methods on the single‑shot grid map domain and the lifelong warehouse domain. Our method shows up to 400% improvements in terms of throughput compared to existing methods.

PaperID: 12134, https://arxiv.org/pdf/2412.13290.pdf

Authors: Kanstantsin Pashkovich, Jacob Skitsko

Title: Linear Contracts for Supermodular Functions Based on Graphs

Abstract:
We study linear contracts for combinatorial problems in multi‑agent settings. In this problem, a principal designs a linear contract with several agents, each of whom can decide to take a costly action or not. The principal observes only the outcome of the agents' collective actions, not the actions themselves, and obtains a reward from this outcome. Agents that take an action incur a cost, and so naturally agents require a fraction of the principal's reward as an incentive for taking their action. The principal needs to decide what fraction of their reward to give to each agent so that the principal's expected utility is maximized. Our focus is on the case when the agents are vertices in a graph and the principal's reward corresponds to the number of edges between agents who take their costly action. This case represents the natural scenario when an action of each agent complements actions of other agents though collaborations. Recently, Deo‑Campo Vuong et.al. showed that for this problem it is impossible to provide any finite multiplicative approximation or additive FPTAS unless \mathcalP = \mathcalNP. On a positive note, the authors provided an additive PTAS for the case when all agents have the same cost. They asked whether an additive PTAS can be obtained for the general case, i.e when agents potentially have different costs. We answer this open question in positive.

PaperID: 12135, https://arxiv.org/pdf/2412.13233.pdf

Authors: Ioannis Tzachristas

Title: Creating an LLM-based AI-agent: A high-level methodology towards enhancing LLMs with APIs

Abstract:
Large Language Models (LLMs) have revolutionized various aspects of engineering and science. Their utility is often bottlenecked by the lack of interaction with the external digital environment. To overcome this limitation and achieve integration of LLMs and Artificial Intelligence (AI) into real‑world applications, customized AI agents are being constructed. Based on the technological trends and techniques, we extract a high‑level approach for constructing these AI agents, focusing on their underlying architecture. This thesis serves as a comprehensive guide that elucidates a multi‑faceted approach for empowering LLMs with the capability to leverage Application Programming Interfaces (APIs). We present a 7‑step methodology that begins with the selection of suitable LLMs and the task decomposition that is necessary for complex problem‑solving. This methodology includes techniques for generating training data for API interactions and heuristics for selecting the appropriate API among a plethora of options. These steps eventually lead to the generation of API calls that are both syntactically and semantically aligned with the LLM's understanding of a given task. Moreover, we review existing frameworks and tools that facilitate these processes and highlight the gaps in current attempts. In this direction, we propose an on‑device architecture that aims to exploit the functionality of carry‑on devices by using small models from the Hugging Face community. We examine the effectiveness of these approaches on real‑world applications of various domains, including the generation of a piano sheet. Through an extensive analysis of the literature and available technologies, this thesis aims to set a compass for researchers and practitioners to harness the full potential of LLMs augmented with external tool capabilities, thus paving the way for more autonomous, robust, and context‑aware AI agents.

PaperID: 12136, https://arxiv.org/pdf/2412.13114.pdf

Authors: Rafael Dewes, Rayna Dimitrova

Title: Contract-based Design and Verification of Multi-Agent Systems with Quantitative Temporal Requirements

Abstract:
Quantitative requirements play an important role in the context of multi‑agent systems, where there is often a trade‑off between the tasks of individual agents and the constraints that the agents must jointly adhere to. We study multi‑agent systems whose requirements are formally specified in the quantitative temporal logic LTL[\mathcalF] as a combination of local task specifications for the individual agents and a shared safety constraint, The intricate dependencies between the individual agents entailed by their local and shared objectives make the design of multi‑agent systems error‑prone, and their verification time‑consuming. In this paper we address this problem by proposing a novel notion of quantitative assume‑guarantee contracts, that enables the compositional design and verification of multi‑agent systems with quantitative temporal specifications. The crux of these contracts lies in their ability to capture the coordination between the individual agents to achieve an optimal value of the overall specification under any possible behavior of the external environment. We show that the proposed framework improves the scalability and modularity of formal verification of multi‑agent systems against quantitative temporal specifications.

PaperID: 12137, https://arxiv.org/pdf/2412.12588.pdf

Authors: Yun Luo, Yingjie Li, Xiangkun Hu, Qinglin Qi, Fang Guo, Qipeng Guo, Zheng Zhang, Yue Zhang

Title: PerSphere: A Comprehensive Framework for Multi-Faceted Perspective Retrieval and Summarization

Abstract:
As online platforms and recommendation algorithms evolve, people are increasingly trapped in echo chambers, leading to biased understandings of various issues. To combat this issue, we have introduced PerSphere, a benchmark designed to facilitate multi‑faceted perspective retrieval and summarization, thus breaking free from these information silos. For each query within PerSphere, there are two opposing claims, each supported by distinct, non‑overlapping perspectives drawn from one or more documents. Our goal is to accurately summarize these documents, aligning the summaries with the respective claims and their underlying perspectives. This task is structured as a two‑step end‑to‑end pipeline that includes comprehensive document retrieval and multi‑faceted summarization. Furthermore, we propose a set of metrics to evaluate the comprehensiveness of the retrieval and summarization content. Experimental results on various counterparts for the pipeline show that recent models struggle with such a complex task. Analysis shows that the main challenge lies in long context and perspective extraction, and we propose a simple but effective multi‑agent summarization system, offering a promising solution to enhance performance on PerSphere.

PaperID: 12138, https://arxiv.org/pdf/2412.12547.pdf

Authors: Ziang Wang, Lei Wang, Qi Yi, Yimin Liu

Title: A MARL Based Multi-Target Tracking Algorithm Under Jamming Against Radar

Abstract:
Unmanned aerial vehicles (UAVs) have played an increasingly important role in military operations and social life. Among all application scenarios, multi‑target tracking tasks accomplished by UAV swarms have received extensive attention. However, when UAVs use radar to track targets, the tracking performance can be severely compromised by jammers. To track targets in the presence of jammers, UAVs can use passive radar to position the jammer. This paper proposes a system where a UAV swarm selects the radar's active or passive work mode to track multiple differently located and potentially jammer‑carrying targets. After presenting the optimization problem and proving its solving difficulty, we use a multi‑agent reinforcement learning algorithm to solve this control problem. We also propose a mechanism based on simulated annealing algorithm to avoid cases where UAV actions violate constraints. Simulation experiments demonstrate the effectiveness of the proposed algorithm.

PaperID: 12139, https://arxiv.org/pdf/2412.12437.pdf

Authors: Reza Ahmadvand, Sarah Sharif, Yaser Banad

Title: Swarm Intelligence in Collision-free Formation Control for Multi-UAV Systems with 3D Obstacle Avoidance Maneuvers

Abstract:
Recent advances in multi‑agent systems manipulation have demonstrated a rising demand for the implementation of multi‑UAV systems in urban areas which are always subjected to the presence of static and dynamic obstacles. The focus of the presented research is on the introduction of a nature‑inspired collision‑free control for a multi‑UAV system considering obstacle avoidance maneuvers. Inspired by the collective behavior of tilapia fish and pigeon, the presented framework in this study uses a centralized controller for the optimal formation control/recovery, which is defined by probabilistic Lloyd's algorithm, while it uses a distributed controller for the intervehicle collision and obstacle avoidance. Further, the presented framework has been extended to the 3D space with 3D maneuvers. Finally, the presented framework has been applied to a multi‑UAV system in 2D and 3D scenarios, and obtained results demonstrated the validity of the presented method in the presence of buildings and different types of obstacles.

PaperID: 12140, https://arxiv.org/pdf/2412.12326.pdf

Authors: Yue Jin, Shuangqing Wei, Giovanni Montana

Title: Achieving Collective Welfare in Multi-Agent Reinforcement Learning via Suggestion Sharing

Abstract:
In human society, the conflict between self‑interest and collective well‑being often obstructs efforts to achieve shared welfare. Related concepts like the Tragedy of the Commons and Social Dilemmas frequently manifest in our daily lives. As artificial agents increasingly serve as autonomous proxies for humans, we propose a novel multi‑agent reinforcement learning (MARL) method to address this issue ‑ learning policies to maximise collective returns even when individual agents' interests conflict with the collective one. Unlike traditional cooperative MARL solutions that involve sharing rewards, values, and policies or designing intrinsic rewards to encourage agents to learn collectively optimal policies, we propose a novel MARL approach where agents exchange action suggestions. Our method reveals less private information compared to sharing rewards, values, or policies, while enabling effective cooperation without the need to design intrinsic rewards. Our algorithm is supported by our theoretical analysis that establishes a bound on the discrepancy between collective and individual objectives, demonstrating how sharing suggestions can align agents' behaviours with the collective objective. Experimental results demonstrate that our algorithm performs competitively with baselines that rely on value or policy sharing or intrinsic rewards.

PaperID: 12141, https://arxiv.org/pdf/2412.12196.pdf

Authors: Zeyu Zhang, Jianxun Lian, Chen Ma, Yaning Qu, Ye Luo, Lei Wang, Rui Li, Xu Chen, Yankai Lin, Le Wu, Xing Xie, Ji-Rong Wen

Title: TrendSim: Simulating Trending Topics in Social Media Under Poisoning Attacks with LLM-based Multi-agent System

Abstract:
Trending topics have become a significant part of modern social media, attracting users to participate in discussions of breaking events. However, they also bring in a new channel for poisoning attacks, resulting in negative impacts on society. Therefore, it is urgent to study this critical problem and develop effective strategies for defense. In this paper, we propose TrendSim, an LLM‑based multi‑agent system to simulate trending topics in social media under poisoning attacks. Specifically, we create a simulation environment for trending topics that incorporates a time‑aware interaction mechanism, centralized message dissemination, and an interactive system. Moreover, we develop LLM‑based human‑like agents to simulate users in social media, and propose prototype‑based attackers to replicate poisoning attacks. Besides, we evaluate TrendSim from multiple aspects to validate its effectiveness. Based on TrendSim, we conduct simulation experiments to study four critical problems about poisoning attacks on trending topics for social benefit.

PaperID: 12142, https://arxiv.org/pdf/2412.12166.pdf

Authors: Nikhil Mehta, Sithira Ambepitiya, Thanveer Ahamad, Dinuka Wijesundara, Yudara Kularathne

Title: Performance of a large language model-Artificial Intelligence based chatbot for counseling patients with sexually transmitted infections and genital diseases

Abstract:
Introduction: Global burden of sexually transmitted infections (STIs) is rising out of proportion to specialists. Current chatbots like ChatGPT are not tailored for handling STI‑related concerns out of the box. We developed Otiz, an Artificial Intelligence‑based (AI‑based) chatbot platform designed specifically for STI detection and counseling, and assessed its performance. Methods: Otiz employs a multi‑agent system architecture based on GPT4‑0613, leveraging large language model (LLM) and Deterministic Finite Automaton principles to provide contextually relevant, medically accurate, and empathetic responses. Its components include modules for general STI information, emotional recognition, Acute Stress Disorder detection, and psychotherapy. A question suggestion agent operates in parallel. Four STIs (anogenital warts, herpes, syphilis, urethritis/cervicitis) and 2 non‑STIs (candidiasis, penile cancer) were evaluated using prompts mimicking patient language. Each prompt was independently graded by two venereologists conversing with Otiz as patient actors on 6 criteria using Numerical Rating Scale ranging from 0 (poor) to 5 (excellent). Results: Twenty‑three venereologists did 60 evaluations of 30 prompts. Across STIs, Otiz scored highly on diagnostic accuracy (4.1‑4.7), overall accuracy (4.3‑4.6), correctness of information (5.0), comprehensibility (4.2‑4.4), and empathy (4.5‑4.8). However, relevance scores were lower (2.9‑3.6), suggesting some redundancy. Diagnostic scores for non‑STIs were lower (p=0.038). Inter‑observer agreement was strong, with differences greater than 1 point occurring in only 12.7% of paired evaluations. Conclusions: AI conversational agents like Otiz can provide accurate, correct, discrete, non‑judgmental, readily accessible and easily understandable STI‑related information in an empathetic manner, and can alleviate the burden on healthcare systems.

PaperID: 12143, https://arxiv.org/pdf/2412.12110.pdf

Authors: Saloua Zammali, Siddhant Dutta, Sadok Ben Yahia

Title: Enhancing the conformal predictability of context-aware recommendation systems by using Deep Autoencoders

Abstract:
In the field of Recommender Systems (RS), neural collaborative filtering represents a significant milestone by combining matrix factorization and deep neural networks to achieve promising results. Traditional methods like matrix factorization often rely on linear models, limiting their capability to capture complex interactions between users, items, and contexts. This limitation becomes particularly evident with high‑dimensional datasets due to their inability to capture relationships among users, items, and contextual factors. Unsupervised learning and dimension reduction tasks utilize autoencoders, neural network‑based models renowned for their capacity to encode and decode data. Autoencoders learn latent representations of inputs, reducing dataset size while capturing complex patterns and features. In this paper, we introduce a framework that combines neural contextual matrix factorization with autoencoders to predict user ratings for items. We provide a comprehensive overview of the framework's design and implementation. To evaluate its performance, we conduct experiments on various real‑world datasets and compare the results against state‑of‑the‑art approaches. We also extend the concept of conformal prediction to prediction rating and introduce a Conformal Prediction Rating (CPR). For RS, we define the nonconformity score, a key concept of conformal prediction, and demonstrate that it satisfies the exchangeability property.

PaperID: 12144, https://arxiv.org/pdf/2412.12103.pdf

Authors: Naoto Yoshida, Kingson Man

Title: Empathic Coupling of Homeostatic States for Intrinsic Prosociality

Abstract:
When regarding the suffering of others, we often experience personal distress and feel compelled to help. Inspired by living systems, we investigate the emergence of prosocial behavior among autonomous agents that are motivated by homeostatic self‑regulation. We perform multi‑agent reinforcement learning, treating each agent as a vulnerable homeostat charged with maintaining its own well‑being. We introduce an empathy‑like mechanism to share homeostatic states between agents: an agent can either \emphobserve their partner's internal state (cognitive empathy) or the agent's internal state can be \emphdirectly coupled to that of their partner's (affective empathy). In three simple multi‑agent environments, we show that prosocial behavior arises only under homeostatic coupling ‑ when the distress of a partner can affect one's own well‑being. Our findings specify the type and role of empathy in artificial agents capable of prosocial behavior.

PaperID: 12145, https://arxiv.org/pdf/2412.11761.pdf

Authors: Timothée Anne, Noah Syrkis, Meriem Elhosni, Florian Turati, Franck Legendre, Alain Jaquier, Sebastian Risi

Title: Harnessing Language for Coordination: A Framework and Benchmark for LLM-Driven Multi-Agent Control

Abstract:
Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. Their potential to facilitate human coordination with many agents is a promising but largely under‑explored area. Such capabilities would be helpful in disaster response, urban planning, and real‑time strategy scenarios. In this work, we introduce (1) a real‑time strategy game benchmark designed to evaluate these abilities and (2) a novel framework we term HIVE. HIVE empowers a single human to coordinate swarms of up to 2,000 agents through a natural language dialog with an LLM. We present promising results on this multi‑agent benchmark, with our hybrid approach solving tasks such as coordinating agent movements, exploiting unit weaknesses, leveraging human annotations, and understanding terrain and strategic points. Our findings also highlight critical limitations of current models, including difficulties in processing spatial visual information and challenges in formulating long‑term strategic plans. This work sheds light on the potential and limitations of LLMs in human‑swarm coordination, paving the way for future research in this area. The HIVE project page, hive.syrkis.com, includes videos of the system in action.

PaperID: 12146, https://arxiv.org/pdf/2412.11713.pdf

Authors: Xuanming Zhang, Yuxuan Chen, Yiming Zheng, Zhexin Zhang, Yuan Yuan, Minlie Huang

Title: Seeker: Towards Exception Safety Code Generation with Intermediate Language Agents Framework

Abstract:
In real world software development, improper or missing exception handling can severely impact the robustness and reliability of code. Exception handling mechanisms require developers to detect, capture, and manage exceptions according to high standards, but many developers struggle with these tasks, leading to fragile code. This problem is particularly evident in open‑source projects and impacts the overall quality of the software ecosystem. To address this challenge, we explore the use of large language models (LLMs) to improve exception handling in code. Through extensive analysis, we identify three key issues: Insensitive Detection of Fragile Code, Inaccurate Capture of Exception Block, and Distorted Handling Solution. These problems are widespread across real world repositories, suggesting that robust exception handling practices are often overlooked or mishandled. In response, we propose Seeker, a multi‑agent framework inspired by expert developer strategies for exception handling. Seeker uses agents: Scanner, Detector, Predator, Ranker, and Handler to assist LLMs in detecting, capturing, and resolving exceptions more effectively. Our work is the first systematic study on leveraging LLMs to enhance exception handling practices in real development scenarios, providing valuable insights for future improvements in code reliability.

PaperID: 12147, https://arxiv.org/pdf/2412.11678.pdf

Authors: Shuai Zhou, Shizhe Zhao, Zhongqiang Ren

Title: Loosely Synchronized Rule-Based Planning for Multi-Agent Path Finding with Asynchronous Actions

Abstract:
Multi‑Agent Path Finding (MAPF) seeks collision‑free paths for multiple agents from their respective starting locations to their respective goal locations while minimizing path costs. Although many MAPF algorithms were developed and can handle up to thousands of agents, they usually rely on the assumption that each action of the agent takes a time unit, and the actions of all agents are synchronized in a sense that the actions of agents start at the same discrete time step, which may limit their use in practice. Only a few algorithms were developed to address asynchronous actions, and they all lie on one end of the spectrum, focusing on finding optimal solutions with limited scalability. This paper develops new planners that lie on the other end of the spectrum, trading off solution quality for scalability, by finding an unbounded sub‑optimal solution for many agents. Our method leverages both search methods (LSS) in handling asynchronous actions and rule‑based planning methods (PIBT) for MAPF. We analyze the properties of our method and test it against several baselines with up to 1000 agents in various maps. Given a runtime limit, our method can handle an order of magnitude more agents than the baselines with about 25% longer makespan.

PaperID: 12148, https://arxiv.org/pdf/2412.11484.pdf

Authors: Wonje Choi, Woo Kyung Kim, SeungHyun Kim, Honguk Woo

Title: Efficient Policy Adaptation with Contrastive Prompt Ensemble for Embodied Agents

Abstract:
For embodied reinforcement learning (RL) agents interacting with the environment, it is desirable to have rapid policy adaptation to unseen visual observations, but achieving zero‑shot adaptation capability is considered as a challenging problem in the RL context. To address the problem, we present a novel contrastive prompt ensemble (ConPE) framework which utilizes a pretrained vision‑language model and a set of visual prompts, thus enabling efficient policy learning and adaptation upon a wide range of environmental and physical changes encountered by embodied agents. Specifically, we devise a guided‑attention‑based ensemble approach with multiple visual prompts on the vision‑language model to construct robust state representations. Each prompt is contrastively learned in terms of an individual domain factor that significantly affects the agent's egocentric perception and observation. For a given task, the attention‑based ensemble and policy are jointly learned so that the resulting state representations not only generalize to various domains but are also optimized for learning the task. Through experiments, we show that ConPE outperforms other state‑of‑the‑art algorithms for several embodied agent tasks including navigation in AI2THOR, manipulation in egocentric‑Metaworld, and autonomous driving in CARLA, while also improving the sample efficiency of policy learning and adaptation.

PaperID: 12149, https://arxiv.org/pdf/2412.11373.pdf

Authors: Matthew Stephenson, Matthew Sidji, Benoît Ronval

Title: Codenames as a Benchmark for Large Language Models

Abstract:
In this paper, we propose the use of the popular word‑based board game Codenames as a suitable benchmark for evaluating the reasoning capabilities of Large Language Models (LLMs). Codenames presents a highly interesting challenge for achieving successful AI performance, requiring both a sophisticated understanding of language, theory of mind, and epistemic reasoning capabilities. Prior attempts to develop agents for Codenames have largely relied on word embedding techniques, which have a limited vocabulary range and perform poorly when paired with differing approaches. LLMs have demonstrated enhanced reasoning and comprehension capabilities for language‑based tasks, but can still suffer in lateral thinking challenges. We evaluate the capabilities of several state‑of‑the‑art LLMs, including GPT‑4o, Gemini 1.5, Claude 3.5 Sonnet, and Llama 3.1, across a variety of board setups. Our results indicate that while certain LLMs perform better than others overall, different models exhibit varying emergent behaviours during gameplay and excel at specific roles. We also evaluate the performance of different combinations of LLMs when playing cooperatively together, demonstrating that LLM agents are more generalisable to a wider range of teammates than prior techniques.

PaperID: 12150, https://arxiv.org/pdf/2412.11310.pdf

Authors: Abolfazl Younesi, Mohsen Ansari, Alireza Ejlali, Mohammad Amin Fazli, Muhammad Shafique, Jörg Henkel

Title: GAP: Game Theory-Based Approach for Reliability and Power Management in Emerging Fog Computing

Abstract:
Fog computing brings about a transformative shift in data management, presenting unprecedented opportunities for enhanced performance and reduced latency. However, one of the key aspects of fog computing revolves around ensuring efficient power and reliability management. To address this challenge, we have introduced a novel model that proposes a non‑cooperative game theory‑based strategy to strike a balance between power consumption and reliability in decision‑making processes. Our proposed model capitalizes on the Cold Primary/Backup strategy (CPB) to guarantee reliability target by re‑executing tasks to different nodes when a fault occurs, while also leveraging Dynamic Voltage and Frequency Scaling (DVFS) to reduce power consumption during task execution and maximizing overall efficiency. Non‑cooperative game theory plays a pivotal role in our model, as it facilitates the development of strategies and solutions that uphold reliability while reducing power consumption. By treating the trade‑off between power and reliability as a non‑cooperative game, our proposed method yields significant energy savings, with up to a 35% reduction in energy consumption, 41% decrease in wait time, and 31% shorter completion time compared to state‑of‑the‑art approaches. Our findings underscore the value of game theory in optimizing power and reliability within fog computing environments, demonstrating its potential for driving substantial improvements

PaperID: 12151, https://arxiv.org/pdf/2412.11167.pdf

Authors: Jiahao Yuan, Zixiang Di, Shangzixin Zhao, Zhiqing Cui, Hanqing Wang, Guisong Yang, Usman Naseem

Title: Cultural Palette: Pluralising Culture Alignment via Multi-agent Palette

Abstract:
Large language models (LLMs) face challenges in aligning with diverse cultural values despite their remarkable performance in generation, which stems from inherent monocultural biases and difficulties in capturing nuanced cultural semantics. Existing methods struggle to adapt to unknown culture after fine‑tuning. Inspired by cultural geography across five continents, we propose Cultural Palette, a multi‑agent framework that redefines cultural alignment as an adaptive "color‑blending" process for country‑specific adaptation. Our approach harnesses cultural geography across five continents through three key steps: First, we synthesize the Pentachromatic Cultural Palette Dataset using GPT‑4o, refining continental‑level dialogues with Hofstede's cultural dimensions to establish foundational cultural representations. Second, five continent‑level alignment agents form specialized cultural communities that generate region‑specific draft responses. Third, a Meta Agent employs Cultural MoErges to dynamically blend these cultural "colors" through attention‑gated parameter merging, akin to mixing pigments on a palette, resolving conflicts while preserving cultural nuances to produce the final culturally‑aligned response. Extensive experiments across various countries demonstrate that Cultural Palette surpasses existing baselines in cultural alignment.

PaperID: 12152, https://arxiv.org/pdf/2412.11014.pdf

Authors: Zhendong Mi, Renming Zheng, Haowen Zhong, Yue Sun, Seth Kneeland, Sayan Moitra, Ken Kutzer, Zhaozhuo Xu Shaoyi Huang

Title: CoopetitiveV: Leveraging LLM-powered Coopetitive Multi-Agent Prompting for High-quality Verilog Generation

Abstract:
Recent advances in agentic LLMs have demonstrated great capabilities in Verilog code generation. However, existing approaches either use LLM‑assisted single‑agent prompting or cooperation‑only multi‑agent learning, which will lead to: (i) Degeneration issue for single‑agent learning: characterized by diminished error detection and correction capabilities; (ii) Error propagation in cooperation‑only multi‑agent learning: erroneous information from the former agent will be propagated to the latter through prompts, which can make the latter agents generate buggy code. In this paper, we propose an LLM‑based coopetitive multi‑agent prompting framework, in which the agents cannot collaborate with each other to form the generation pipeline, but also create a healthy competitive mechanism to improve the generating quality. Our experimental results show that the coopetitive multi‑agent framework can effectively mitigate the degeneration risk and reduce the error propagation while improving code error correction capabilities, resulting in higher quality Verilog code generation. The effectiveness of our approach is validated through extensive experiments. On VerilogEval Machine and Human dataset, CoopetitiveV+GPT‑4 achieves 99.2% and 99.1% pass@10 scores, respectively. While on RTLLM, CoopetitiveV+GPT‑4 obtains 100% syntax and 99.9% functionality pass@5 scores.

PaperID: 12153, https://arxiv.org/pdf/2412.11006.pdf

Authors: Hanning Zhang, Pengcheng Wang, Shizhe Diao, Yong Lin, Rui Pan, Hanze Dong, Dylan Zhang, Pavlo Molchanov, Tong Zhang

Title: Entropy-Regularized Process Reward Model

Abstract:
Large language models (LLMs) have shown promise in performing complex multi‑step reasoning, yet they continue to struggle with mathematical reasoning, often making systematic errors. A promising solution is reinforcement learning (RL) guided by reward models, particularly those focusing on process rewards, which score each intermediate step rather than solely evaluating the final outcome. This approach is more effective at guiding policy models towards correct reasoning trajectories. In this work, we propose an entropy‑regularized process reward model (ER‑PRM) that integrates KL‑regularized Markov Decision Processes (MDP) to balance policy optimization with the need to prevent the policy from shifting too far from its initial distribution. We derive a novel reward construction method based on the theoretical results. Our theoretical analysis shows that we could derive the optimal reward model from the initial policy sampling. Our empirical experiments on the MATH and GSM8K benchmarks demonstrate that ER‑PRM consistently outperforms existing process reward models, achieving 1% improvement on GSM8K and 2‑3% improvement on MATH under best‑of‑N evaluation, and more than 1% improvement under RLHF. These results highlight the efficacy of entropy‑regularization in enhancing LLMs' reasoning capabilities.

PaperID: 12154, https://arxiv.org/pdf/2412.10704.pdf

Authors: Manan Suri, Puneet Mathur, Franck Dernoncourt, Kanika Goswami, Ryan A. Rossi, Dinesh Manocha

Title: VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation

Abstract:
Understanding information from a collection of multiple documents, particularly those with visually rich elements, is important for document‑grounded question answering. This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi‑document settings with rich multimodal content, including tables, charts, and presentation slides. We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG, combining robust visual retrieval capabilities with sophisticated linguistic reasoning. VisDoMRAG employs a multi‑step reasoning process encompassing evidence curation and chain‑of‑thought reasoning for concurrent textual and visual RAG pipelines. A key novelty of VisDoMRAG is its consistency‑constrained modality fusion mechanism, which aligns the reasoning processes across modalities at inference time to produce a coherent final answer. This leads to enhanced accuracy in scenarios where critical information is distributed across modalities and improved answer verifiability through implicit context attribution. Through extensive experiments involving open‑source and proprietary large language models, we benchmark state‑of‑the‑art document QA methods on VisDoMBench. Extensive results show that VisDoMRAG outperforms unimodal and long‑context LLM baselines for end‑to‑end multimodal document QA by 12‑20%.

PaperID: 12155, https://arxiv.org/pdf/2412.10700.pdf

Authors: Zhiying Wang, Gang Sun, Yuhui Wang, Hongfang Yu, Dusit Niyato

Title: Cluster-Based Multi-Agent Task Scheduling for Space-Air-Ground Integrated Networks

Abstract:
The Space‑Air‑Ground Integrated Network (SAGIN) framework is a crucial foundation for future networks, where satellites and aerial nodes assist in computational task offloading. The low‑altitude economy, leveraging the flexibility and multifunctionality of Unmanned Aerial Vehicles (UAVs) in SAGIN, holds significant potential for development in areas such as communication and sensing. However, effective coordination is needed to streamline information exchange and enable efficient system resource allocation. In this paper, we propose a Clustering‑based Multi‑agent Deep Deterministic Policy Gradient (CMADDPG) algorithm to address the multi‑UAV cooperative task scheduling challenges in SAGIN. The CMADDPG algorithm leverages dynamic UAV clustering to partition UAVs into clusters, each managed by a Cluster Head (CH) UAV, facilitating a distributed‑centralized control approach. Within each cluster, UAVs delegate offloading decisions to the CH UAV, reducing intra‑cluster communication costs and decision conflicts, thereby enhancing task scheduling efficiency. Additionally, by employing a multi‑agent reinforcement learning framework, the algorithm leverages the extensive coverage of satellites to achieve centralized training and distributed execution of multi‑agent tasks, while maximizing overall system profit through optimized task offloading decision‑making. Simulation results reveal that the CMADDPG algorithm effectively optimizes resource allocation, minimizes queue delays, maintains balanced load distribution, and surpasses existing methods by achieving at least a 25% improvement in system profit, showcasing its robustness and adaptability across diverse scenarios.

PaperID: 12156, https://arxiv.org/pdf/2412.10636.pdf

Authors: Yixuan Even Xu, Zhe Feng, Fei Fang

Title: Deviate or Not: Learning Coalition Structures with Multiple-bit Observations in Games

Abstract:
We consider the Coalition Structure Learning (CSL) problem in multi‑agent systems, motivated by the existence of coalitions in many real‑world systems, e.g., trading platforms and auction systems. In this problem, there is a hidden coalition structure within a set of n agents, which affects the behavior of the agents in games. Our goal is to actively design a sequence of games for the agents to play, such that observations in these games can be used to learn the hidden coalition structure. In particular, we consider the setting where in each round, we design and present a game together with a strategy profile to the agents, and receive a multiple‑bit observation ‑‑ for each agent, we observe whether or not they would like to deviate from the specified strategy. We show that we can learn the coalition structure in O(\log n) rounds if we are allowed to design any normal‑form game, matching the information‑theoretical lower bound. For practicality, we extend the result to settings where we can only choose games of a specific format, and design algorithms to learn the coalition structure in these settings. For most settings, our complexity matches the theoretical lower bound up to a constant factor.

PaperID: 12157, https://arxiv.org/pdf/2412.10609.pdf

Authors: Carmengelys Cordova, Joaquin Taverner, Elena Del Val, Estefania Argente

Title: A systematic review of norm emergence in multi-agent systems

Abstract:
Multi‑agent systems (MAS) have gained relevance in the field of artificial intelligence by offering tools for modelling complex environments where autonomous agents interact to achieve common or individual goals. In these systems, norms emerge as a fundamental component to regulate the behaviour of agents, promoting cooperation, coordination and conflict resolution. This article presents a systematic review, following the PRISMA method, on the emergence of norms in MAS, exploring the main mechanisms and factors that influence this process. Sociological, structural, emotional and cognitive aspects that facilitate the creation, propagation and reinforcement of norms are addressed. The findings highlight the crucial role of social network topology, as well as the importance of emotions and shared values in the adoption and maintenance of norms. Furthermore, opportunities are identified for future research that more explicitly integrates emotional and ethical dynamics in the design of adaptive normative systems. This work provides a comprehensive overview of the current state of research on norm emergence in MAS, serving as a basis for advancing the development of more efficient and flexible systems in artificial and real‑world contexts.

PaperID: 12158, https://arxiv.org/pdf/2412.10442.pdf

Authors: Ching-Chun Chang, Isao Echizen

Title: Steganography in Game Actions

Abstract:
The exchange of messages has always carried with it the timeless challenge of secrecy. From whispers in shadows to the enigmatic notes written in the margins of history, humanity has long sought ways to convey thoughts that remain imperceptible to all but the chosen few. The challenge of subliminal communication has been addressed in various forms of steganography. However, the field faces a fundamental paradox: as the art of concealment advances, so too does the science of revelation, leading to an ongoing evolutionary interplay. This study seeks to extend the boundaries of what is considered a viable steganographic medium. We explore a steganographic paradigm, in which hidden information is communicated through the episodes of multiple agents interacting with an environment. Each agent, acting as an encoder, learns a policy to disguise the very existence of hidden messages within actions seemingly directed toward innocent objectives. Meanwhile, an observer, serving as a decoder, learns to associate behavioural patterns with their respective agents despite their dynamic nature, thereby unveiling the hidden messages. The interactions of agents are governed by the framework of multi‑agent reinforcement learning and shaped by feedback from the observer. This framework encapsulates a game‑theoretic dilemma, wherein agents face decisions between cooperating to create distinguishable behavioural patterns or defecting to pursue individually optimal yet potentially overlapping episodic actions. As a proof of concept, we exemplify action steganography through the game of labyrinth, a navigation task where subliminal communication is concealed within the act of steering toward a destination, and systematically validate the stego‑system in terms of distortion, capacity, secrecy and robustness when subjected to simulated passive and active adversaries.

PaperID: 12159, https://arxiv.org/pdf/2412.10422.pdf

Authors: Meihao Fan, Ju Fan, Nan Tang, Lei Cao, Guoliang Li, Xiaoyong Du

Title: AutoPrep: Natural Language Question-Aware Data Preparation with a Multi-Agent Framework

Abstract:
Answering natural language (NL) questions about tables, known as Tabular Question Answering (TQA), is crucial because it allows users to quickly and efficiently extract meaningful insights from structured data, effectively bridging the gap between human language and machine‑readable formats. Many of these tables are derived from web sources or real‑world scenarios, which require meticulous data preparation (or data prep) to ensure accurate responses. However, preparing such tables for NL questions introduces new requirements that extend beyond traditional data preparation. This question‑ware data preparation involves specific tasks such as column derivation and filtering tailored to particular questions, as well as question‑aware value normalization or conversion, highlighting the need for a more nuanced approach in this context. Because each of the above tasks is unique, a single model (or agent) may not perform effectively across all scenarios. In this paper, we propose AutoPrep, a large language model (LLM)‑based multiagent framework that leverages the strengths of multiple agents, each specialized in a certain type of data prep, ensuring more accurate and contextually relevant responses. Given an NL question over a table, AutoPrep performs data prep through three key components. Planner: Determines a logical plan, outlining a sequence of high‑level operations. Programmer: Translates this logical plan into a physical plan by generating the corresponding low‑level code. Executor: Executes the generated code to process the table. To support this multi‑agent framework, we design a novel Chain‑ofClauses reasoning mechanism for high‑level operation suggestion, and a tool‑augmented method for low‑level code generation.

PaperID: 12160, https://arxiv.org/pdf/2412.10270.pdf

Authors: Aron Vallinder, Edward Hughes

Title: Cultural Evolution of Cooperation among LLM Agents

Abstract:
Large language models (LLMs) provide a compelling foundation for building generally‑capable AI agents. These agents may soon be deployed at scale in the real world, representing the interests of individual humans (e.g., AI assistants) or groups of humans (e.g., AI‑accelerated corporations). At present, relatively little is known about the dynamics of multiple LLM agents interacting over many generations of iterative deployment. In this paper, we examine whether a "society" of LLM agents can learn mutually beneficial social norms in the face of incentives to defect, a distinctive feature of human sociality that is arguably crucial to the success of civilization. In particular, we study the evolution of indirect reciprocity across generations of LLM agents playing a classic iterated Donor Game in which agents can observe the recent behavior of their peers. We find that the evolution of cooperation differs markedly across base models, with societies of Claude 3.5 Sonnet agents achieving significantly higher average scores than Gemini 1.5 Flash, which, in turn, outperforms GPT‑4o. Further, Claude 3.5 Sonnet can make use of an additional mechanism for costly punishment to achieve yet higher scores, while Gemini 1.5 Flash and GPT‑4o fail to do so. For each model class, we also observe variation in emergent behavior across random seeds, suggesting an understudied sensitive dependence on initial conditions. We suggest that our evaluation regime could inspire an inexpensive and informative new class of LLM benchmarks, focussed on the implications of LLM agent deployment for the cooperative infrastructure of society.

PaperID: 12161, https://arxiv.org/pdf/2412.10138.pdf

Authors: Yang Qin, Chao Chen, Zhihang Fu, Ze Chen, Dezhong Peng, Peng Hu, Jieping Ye

Title: ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL

Abstract:
Despite the significant advancements in Text‑to‑SQL (Text2SQL) facilitated by large language models (LLMs), the latest state‑of‑the‑art techniques are still trapped in the in‑context learning of closed‑source LLMs (e.g., GPT‑4), which limits their applicability in open scenarios. To address this challenge, we propose a novel RObust mUltitask Tuning and collaboration mEthod (ROUTE) to improve the comprehensive capabilities of open‑source LLMs for Text2SQL, thereby providing a more practical solution. Our approach begins with multi‑task supervised fine‑tuning (SFT) using various synthetic training data related to SQL generation. Unlike existing SFT‑based Text2SQL methods, we introduced several additional SFT tasks, including schema linking, noise correction, and continuation writing. Engaging in a variety of SQL generation tasks enhances the model's understanding of SQL syntax and improves its ability to generate high‑quality SQL queries. Additionally, inspired by the collaborative modes of LLM agents, we introduce a Multitask Collaboration Prompting (MCP) strategy. This strategy leverages collaboration across several SQL‑related tasks to reduce hallucinations during SQL generation, thereby maximizing the potential of enhancing Text2SQL performance through explicit multitask capabilities. Extensive experiments and in‑depth analyses have been performed on eight open‑source LLMs and five widely‑used benchmarks. The results demonstrate that our proposal outperforms the latest Text2SQL methods and yields leading performance.

PaperID: 12162, https://arxiv.org/pdf/2412.10133.pdf

Authors: Islem Bouzenia, Michael Pradel

Title: You Name It, I Run It: An LLM Agent to Execute Tests of Arbitrary Projects

Abstract:
The ability to execute the test suite of a project is essential in many scenarios, e.g., to assess code quality and code coverage, to validate code changes made by developers or automated tools, and to ensure compatibility with dependencies. Despite its importance, executing the test suite of a project can be challenging in practice because different projects use different programming languages, software ecosystems, build systems, testing frameworks, and other tools. These challenges make it difficult to create a reliable, universal test execution method that works across different projects. This paper presents ExecutionAgent, an automated technique that prepares scripts for building an arbitrary project from source code and running its test cases. Inspired by the way a human developer would address this task, our approach is a large language model (LLM)‑based agent that autonomously executes commands and interacts with the host system. The agent uses meta‑prompting to gather guidelines on the latest technologies related to the given project, and it iteratively refines its process based on feedback from the previous steps. Our evaluation applies ExecutionAgent to 50 open‑source projects that use 14 different programming languages and many different build and testing tools. The approach successfully executes the test suites of 33/50 projects, while matching the test results of ground truth test suite executions with a deviation of only 7.5%. These results improve over the best previously available technique by 6.6x. The costs imposed by the approach are reasonable, with an execution time of 74 minutes and LLM costs of USD 0.16, on average per project. We envision ExecutionAgent to serve as a valuable tool for developers, automated programming tools, and researchers that need to execute tests across a wide variety of projects.

PaperID: 12163, https://arxiv.org/pdf/2412.10083.pdf

Authors: Dolev Mutzari, Yonatan Aumann, Sarit Kraus

Title: Heterogeneous Multi-Robot Graph Coverage with Proximity and Movement Constraints

Abstract:
Multi‑Robot Coverage problems have been extensively studied in robotics, planning and multi‑agent systems. In this work, we consider the coverage problem when there are constraints on the proximity (e.g., maximum distance between the agents, or a blue agent must be adjacent to a red agent) and the movement (e.g., terrain traversability and material load capacity) of the robots. Such constraints naturally arise in many real‑world applications, e.g. in search‑and‑rescue and maintenance operations. Given such a setting, the goal is to compute a covering tour of the graph with a minimum number of steps, and that adheres to the proximity and movement constraints. For this problem, our contributions are four: (i) a formal formulation of the problem, (ii) an exact algorithm that is FPT in F, d and tw, the set of robot formations that encode the proximity constraints, the maximum nodes degree, and the tree‑width of the graph, respectively, (iii) for the case that the graph is a tree: a PTAS approximation scheme, that given an approximation parameter epsilon, produces a tour that is within a epsilon times error(||F||, d) of the optimal one, and the computation runs in time poly(n) times h(1/epsilon,||F||). (iv) for the case that the graph is a tree, with k=3 robots, and the constraint is that all agents are connected: a PTAS scheme with multiplicative approximation error of 1+O(epsilon), independent of the maximal degree d.

PaperID: 12164, https://arxiv.org/pdf/2412.09877.pdf

Authors: Ye Zhang, Linyue Chu, Letian Xu, Kangtong Mo, Zhengjian Kang, Xingyu Zhang

Title: Optimized Coordination Strategy for Multi-Aerospace Systems in Pick-and-Place Tasks By Deep Neural Network

Abstract:
In this paper, we present an advanced strategy for the coordinated control of a multi‑agent aerospace system, utilizing Deep Neural Networks (DNNs) within a reinforcement learning framework. Our approach centers on optimizing autonomous task assignment to enhance the system's operational efficiency in object relocation tasks, framed as an aerospace‑oriented pick‑and‑place scenario. By modeling this coordination challenge within a MuJoCo environment, we employ a deep reinforcement learning algorithm to train a DNN‑based policy to maximize task completion rates across the multi‑agent system. The objective function is explicitly designed to maximize effective object transfer rates, leveraging neural network capabilities to handle complex state and action spaces in high‑dimensional aerospace environments. Through extensive simulation, we benchmark the proposed method against a heuristic combinatorial approach rooted in game‑theoretic principles, demonstrating a marked performance improvement, with the trained policy achieving up to 16% higher task efficiency. Experimental validation is conducted on a multi‑agent hardware setup to substantiate the efficacy of our approach in a real‑world aerospace scenario.

PaperID: 12165, https://arxiv.org/pdf/2412.09719.pdf

Authors: Johann Schmidt, Frank Dreyer, Sayed Abid Hashimi, Sebastian Stober

Title: TransferLight: Zero-Shot Traffic Signal Control on any Road-Network

Abstract:
Traffic signal control plays a crucial role in urban mobility. However, existing methods often struggle to generalize beyond their training environments to unseen scenarios with varying traffic dynamics. We present TransferLight, a novel framework designed for robust generalization across road‑networks, diverse traffic conditions and intersection geometries. At its core, we propose a log‑distance reward function, offering spatially‑aware signal prioritization while remaining adaptable to varied lane configurations ‑ overcoming the limitations of traditional pressure‑based rewards. Our hierarchical, heterogeneous, and directed graph neural network architecture effectively captures granular traffic dynamics, enabling transferability to arbitrary intersection layouts. Using a decentralized multi‑agent approach, global rewards, and novel state transition priors, we develop a single, weight‑tied policy that scales zero‑shot to any road network without re‑training. Through domain randomization during training, we additionally enhance generalization capabilities. Experimental results validate TransferLight's superior performance in unseen scenarios, advancing practical, generalizable intelligent transportation systems to meet evolving urban traffic demands.

PaperID: 12166, https://arxiv.org/pdf/2412.09701.pdf

Authors: Amanda Rios, Ibrahima Ndiour, Parual Datta, Jerry Sydir, Omesh Tickoo, Nilesh Ahuja

Title: CUAL: Continual Uncertainty-aware Active Learner

Abstract:
AI deployed in many real‑world use cases should be capable of adapting to novelties encountered after deployment. Here, we consider a challenging, under‑explored and realistic continual adaptation problem: a deployed AI agent is continuously provided with unlabeled data that may contain not only unseen samples of known classes but also samples from novel (unknown) classes. In such a challenging setting, it has only a tiny labeling budget to query the most informative samples to help it continuously learn. We present a comprehensive solution to this complex problem with our model "CUAL" (Continual Uncertainty‑aware Active Learner). CUAL leverages an uncertainty estimation algorithm to prioritize active labeling of ambiguous (uncertain) predicted novel class samples while also simultaneously pseudo‑labeling the most certain predictions of each class. Evaluations across multiple datasets, ablations, settings and backbones (e.g. ViT foundation model) demonstrate our method's effectiveness. We will release our code upon acceptance.

PaperID: 12167, https://arxiv.org/pdf/2412.09572.pdf

Authors: Yu Feng, Phu Mon Htut, Zheng Qi, Wei Xiao, Manuel Mager, Nikolaos Pappas, Kishaloy Halder, Yang Li, Yassine Benajiba, Dan Roth

Title: Rethinking LLM Uncertainty: A Multi-Agent Approach to Estimating Black-Box Model Uncertainty

Abstract:
Quantifying uncertainty in black‑box LLMs is vital for reliable responses and scalable oversight. Existing methods, which gauge a model's uncertainty through evaluating self‑consistency in responses to the target query, can be misleading: an LLM may confidently provide an incorrect answer to a target query, yet give a confident and accurate answer to that same target query when answering a knowledge‑preserving perturbation of the query. We systematically analyze the model behaviors and demonstrate that this discrepancy stems from suboptimal retrieval of parametric knowledge, often due to contextual biases that prevent consistent access to stored knowledge. We then introduce DiverseAgentEntropy, a novel, theoretically‑grounded method employing multi‑agent interaction across diverse query variations for uncertainty estimation of black‑box LLMs. This approach more accurately assesses an LLM's true uncertainty and improves hallucination detection, outperforming existing self‑consistency based techniques.

PaperID: 12168, https://arxiv.org/pdf/2412.09429.pdf

Authors: Yi Luo, Linghang Shi, Yihao Li, Aobo Zhuang, Yeyun Gong, Ling Liu, Chen Lin

Title: From Intention To Implementation: Automating Biomedical Research via LLMs

Abstract:
Conventional biomedical research is increasingly labor‑intensive due to the exponential growth of scientific literature and datasets. Artificial intelligence (AI), particularly Large Language Models (LLMs), has the potential to revolutionize this process by automating various steps. Still, significant challenges remain, including the need for multidisciplinary expertise, logicality of experimental design, and performance measurements. This paper introduces BioResearcher, the first end‑to‑end automated system designed to streamline the entire biomedical research process involving dry lab experiments. BioResearcher employs a modular multi‑agent architecture, integrating specialized agents for search, literature processing, experimental design, and programming. By decomposing complex tasks into logically related sub‑tasks and utilizing a hierarchical learning approach, BioResearcher effectively addresses the challenges of multidisciplinary requirements and logical complexity. Furthermore, BioResearcher incorporates an LLM‑based reviewer for in‑process quality control and introduces novel evaluation metrics to assess the quality and automation of experimental protocols. BioResearcher successfully achieves an average execution success rate of 63.07% across eight previously unmet research objectives. The generated protocols, on average, outperform typical agent systems by 22.0% on five quality metrics. The system demonstrates significant potential to reduce researchers' workloads and accelerate biomedical discoveries, paving the way for future innovations in automated research systems.

PaperID: 12169, https://arxiv.org/pdf/2412.09417.pdf

Authors: Adam Labiosa, Zhihan Wang, Siddhant Agarwal, William Cong, Geethika Hemkumar, Abhinav Narayan Harish, Benjamin Hong, Josh Kelle, Chen Li, Yuhao Li, Zisen Shao, Peter Stone, Josiah P. Hanna

Title: Reinforcement Learning Within the Classical Robotics Stack: A Case Study in Robot Soccer

Abstract:
Robot decision‑making in partially observable, real‑time, dynamic, and multi‑agent environments remains a difficult and unsolved challenge. Model‑free reinforcement learning (RL) is a promising approach to learning decision‑making in such domains, however, end‑to‑end RL in complex environments is often intractable. To address this challenge in the RoboCup Standard Platform League (SPL) domain, we developed a novel architecture integrating RL within a classical robotics stack, while employing a multi‑fidelity sim2real approach and decomposing behavior into learned sub‑behaviors with heuristic selection. Our architecture led to victory in the 2024 RoboCup SPL Challenge Shield Division. In this work, we fully describe our system's architecture and empirically analyze key design decisions that contributed to its success. Our approach demonstrates how RL‑based behaviors can be integrated into complete robot behavior architectures.

PaperID: 12170, https://arxiv.org/pdf/2412.09237.pdf

Authors: Yijun Liu, Wu Liu, Xiaoyan Gu, Yong Rui, Xiaodong He, Yongdong Zhang

Title: LMAgent: A Large-scale Multimodal Agents Society for Multi-user Simulation

Abstract:
The believable simulation of multi‑user behavior is crucial for understanding complex social systems. Recently, large language models (LLMs)‑based AI agents have made significant progress, enabling them to achieve human‑like intelligence across various tasks. However, real human societies are often dynamic and complex, involving numerous individuals engaging in multimodal interactions. In this paper, taking e‑commerce scenarios as an example, we present LMAgent, a very large‑scale and multimodal agents society based on multimodal LLMs. In LMAgent, besides freely chatting with friends, the agents can autonomously browse, purchase, and review products, even perform live streaming e‑commerce. To simulate this complex system, we introduce a self‑consistency prompting mechanism to augment agents' multimodal capabilities, resulting in significantly improved decision‑making performance over the existing multi‑agent system. Moreover, we propose a fast memory mechanism combined with the small‑world model to enhance system efficiency, which supports more than 10,000 agent simulations in a society. Experiments on agents' behavior show that these agents achieve comparable performance to humans in behavioral indicators. Furthermore, compared with the existing LLMs‑based multi‑agent system, more different and valuable phenomena are exhibited, such as herd behavior, which demonstrates the potential of LMAgent in credible large‑scale social behavior simulations.

PaperID: 12171, https://arxiv.org/pdf/2412.09117.pdf

Authors: Wanli Ni, Ruyu Luo, Xinran Zhang, Peng Wang, Wen Wang, Hui Tian

Title: Reconfigurable Intelligent Surface for Internet of Robotic Things

Abstract:
With the rapid development of artificial intelligence, robotics, and Internet of Things, multi‑robot systems are progressively acquiring human‑like environmental perception and understanding capabilities, empowering them to complete complex tasks through autonomous decision‑making and interaction. However, the Internet of Robotic Things (IoRT) faces significant challenges in terms of spectrum resources, sensing accuracy, communication latency, and energy supply. To address these issues, a reconfigurable intelligent surface (RIS)‑aided IoRT network is proposed to enhance the overall performance of robotic communication, sensing, computation, and energy harvesting. In the case studies, by jointly optimizing parameters such as transceiver beamforming, robot trajectories, and RIS coefficients, solutions based on multi‑agent deep reinforcement learning and multi‑objective optimization are proposed to solve problems such as beamforming design, path planning, target sensing, and data aggregation. Numerical results are provided to demonstrate the effectiveness of proposed solutions in improve communication quality, sensing accuracy, computation error, and energy efficiency of RIS‑aided IoRT networks.

PaperID: 12172, https://arxiv.org/pdf/2412.08875.pdf

Authors: Bo Yu, Jiangning Wei, Minzhen Hu, Zejie Han, Tianjian Zou, Ye He, Jun Liu

Title: Brain-inspired AI Agent: The Way Towards AGI

Abstract:
Artificial General Intelligence (AGI), widely regarded as the fundamental goal of artificial intelligence, represents the realization of cognitive capabilities that enable the handling of general tasks with human‑like proficiency. Researchers in brain‑inspired AI seek inspiration from the operational mechanisms of the human brain, aiming to replicate its functional rules in intelligent models. Moreover, with the rapid development of large‑scale models in recent years, the concept of agents has garnered increasing attention, with researchers widely recognizing it as a necessary pathway toward achieving AGI. In this article, we propose the concept of a brain‑inspired AI agent and analyze how to extract relatively feasible and agent‑compatible cortical region functionalities and their associated functional connectivity networks from the complex mechanisms of the human brain. Implementing these structures within an agent enables it to achieve basic cognitive intelligence akin to human capabilities. Finally, we explore the limitations and challenges for realizing brain‑inspired agents and discuss their future development.

PaperID: 12173, https://arxiv.org/pdf/2412.08845.pdf

Authors: Kuan-Cheng Chen, Samuel Yen-Chi Chen, Chen-Yu Liu, Kin K. Leung

Title: Quantum-Train-Based Distributed Multi-Agent Reinforcement Learning

Abstract:
In this paper, we introduce Quantum‑Train‑Based Distributed Multi‑Agent Reinforcement Learning (Dist‑QTRL), a novel approach to addressing the scalability challenges of traditional Reinforcement Learning (RL) by integrating quantum computing principles. Quantum‑Train Reinforcement Learning (QTRL) leverages parameterized quantum circuits to efficiently generate neural network parameters, achieving a $poly(\log(N))$ reduction in the dimensionality of trainable parameters while harnessing quantum entanglement for superior data representation. The framework is designed for distributed multi‑agent environments, where multiple agents, modeled as Quantum Processing Units (QPUs), operate in parallel, enabling faster convergence and enhanced scalability. Additionally, the Dist‑QTRL framework can be extended to high‑performance computing (HPC) environments by utilizing distributed quantum training for parameter reduction in classical neural networks, followed by inference using classical CPUs or GPUs. This hybrid quantum‑HPC approach allows for further optimization in real‑world applications. In this paper, we provide a mathematical formulation of the Dist‑QTRL framework and explore its convergence properties, supported by empirical results demonstrating performance improvements over centric QTRL models. The results highlight the potential of quantum‑enhanced RL in tackling complex, high‑dimensional tasks, particularly in distributed computing settings, where our framework achieves significant speedups through parallelization without compromising model accuracy. This work paves the way for scalable, quantum‑enhanced RL systems in practical applications, leveraging both quantum and classical computational resources.

PaperID: 12174, https://arxiv.org/pdf/2412.08581.pdf

Authors: Yanqi Su, Zhenchang Xing, Chong Wang, Chunyang Chen, Xiwei Xu, Qinghua Lu, Liming Zhu

Title: Automated Soap Opera Testing Directed by LLMs and Scenario Knowledge: Feasibility, Challenges, and Road Ahead

Abstract:
Exploratory testing (ET) harnesses tester's knowledge, creativity, and experience to create varying tests that uncover unexpected bugs from the end‑user's perspective. Although ET has proven effective in system‑level testing of interactive systems, the need for manual execution has hindered large‑scale adoption. In this work, we explore the feasibility, challenges and road ahead of automated scenario‑based ET (a.k.a soap opera testing). We conduct a formative study, identifying key insights for effective manual soap opera testing and challenges in automating the process. We then develop a multi‑agent system leveraging LLMs and a Scenario Knowledge Graph (SKG) to automate soap opera testing. The system consists of three multi‑modal agents, Planner, Player, and Detector that collaborate to execute tests and identify potential bugs. Experimental results demonstrate the potential of automated soap opera testing, but there remains a significant gap compared to manual execution, especially under‑explored scenario boundaries and incorrectly identified bugs. Based on the observation, we envision road ahead for the future of automated soap opera testing, focusing on three key aspects: the synergy of neural and symbolic approaches, human‑AI co‑learning, and the integration of soap opera testing with broader software engineering practices. These insights aim to guide and inspire the future research.

PaperID: 12175, https://arxiv.org/pdf/2412.08562.pdf

Authors: Leandro Parada, Hanlin Tian, Jose Escribano, Panagiotis Angeloudis

Title: An End-to-End Collaborative Learning Approach for Connected Autonomous Vehicles in Occluded Scenarios

Abstract:
Collaborative navigation becomes essential in situations of occluded scenarios in autonomous driving where independent driving policies are likely to lead to collisions. One promising approach to address this issue is through the use of Vehicle‑to‑Vehicle (V2V) networks that allow for the sharing of perception information with nearby agents, preventing catastrophic accidents. In this article, we propose a collaborative control method based on a V2V network for sharing compressed LiDAR features and employing Proximal Policy Optimisation to train safe and efficient navigation policies. Unlike previous approaches that rely on expert data (behaviour cloning), our proposed approach learns the multi‑agent policies directly from experience in the occluded environment, while effectively meeting bandwidth limitations. The proposed method first prepossesses LiDAR point cloud data to obtain meaningful features through a convolutional neural network and then shares them with nearby CAVs to alert for potentially dangerous situations. To evaluate the proposed method, we developed an occluded intersection gym environment based on the CARLA autonomous driving simulator, allowing real‑time data sharing among agents. Our experimental results demonstrate the consistent superiority of our collaborative control method over an independent reinforcement learning method and a cooperative early fusion method.

PaperID: 12176, https://arxiv.org/pdf/2412.08522.pdf

Authors: Yujin Kim, Sol Choi, Bum-Jae You, Keunwoo Jang, Yisoo Lee

Title: Subspace-wise Hybrid RL for Articulated Object Manipulation

Abstract:
Articulated object manipulation is a challenging task, requiring constrained motion and adaptive control to handle the unknown dynamics of the manipulated objects. While reinforcement learning (RL) has been widely employed to tackle various scenarios and types of articulated objects, the complexity of these tasks, stemming from multiple intertwined objectives makes learning a control policy in the full task space highly difficult. To address this issue, we propose a Subspace‑wise hybrid RL (SwRL) framework that learns policies for each divided task space, or subspace, based on independent objectives. This approach enables adaptive force modulation to accommodate the unknown dynamics of objects. Additionally, it effectively leverages the previously underlooked redundant subspace, thereby maximizing the robot's dexterity. Our method enhances both learning efficiency and task execution performance, as validated through simulations and real‑world experiments. Supplementary video is available at https://youtu.be/PkNxv0P8Atk

PaperID: 12177, https://arxiv.org/pdf/2412.08445.pdf

Authors: Dzmitry Bahdanau, Nicolas Gontier, Gabriel Huang, Ehsan Kamalloo, Rafael Pardinas, Alex Piché, Torsten Scholak, Oleh Shliazhko, Jordan Prince Tremblay, Karam Ghanem, Soham Parikh, Mitul Tiwari, Quaizar Vohra

Title: TapeAgents: a Holistic Framework for Agent Development and Optimization

Abstract:
We present TapeAgents, an agent framework built around a granular, structured log tape of the agent session that also plays the role of the session's resumable state. In TapeAgents we leverage tapes to facilitate all stages of the LLM Agent development lifecycle. The agent reasons by processing the tape and the LLM output to produce new thought and action steps and append them to the tape. The environment then reacts to the agent's actions by likewise appending observation steps to the tape. By virtue of this tape‑centred design, TapeAgents can provide AI practitioners with holistic end‑to‑end support. At the development stage, tapes facilitate session persistence, agent auditing, and step‑by‑step debugging. Post‑deployment, one can reuse tapes for evaluation, fine‑tuning, and prompt‑tuning; crucially, one can adapt tapes from other agents or use revised historical tapes. In this report, we explain the TapeAgents design in detail. We demonstrate possible applications of TapeAgents with several concrete examples of building monolithic agents and multi‑agent teams, of optimizing agent prompts and finetuning the agent's LLM. We present tooling prototypes and report a case study where we use TapeAgents to finetune a Llama‑3.1‑8B form‑filling assistant to perform as well as GPT‑4o while being orders of magnitude cheaper. Lastly, our comparative analysis shows that TapeAgents's advantages over prior frameworks stem from our novel design of the LLM agent as a resumable, modular state machine with a structured configuration, that generates granular, structured logs and that can transform these logs into training text ‑‑ a unique combination of features absent in previous work.

PaperID: 12178, https://arxiv.org/pdf/2412.08442.pdf

Authors: Andrew Szot, Bogdan Mazoure, Omar Attia, Aleksei Timofeev, Harsh Agrawal, Devon Hjelm, Zhe Gan, Zsolt Kira, Alexander Toshev

Title: From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons

Abstract:
We examine the capability of Multimodal Large Language Models (MLLMs) to tackle diverse domains that extend beyond the traditional language and vision tasks these models are typically trained on. Specifically, our focus lies in areas such as Embodied AI, Games, UI Control, and Planning. To this end, we introduce a process of adapting an MLLM to a Generalist Embodied Agent (GEA). GEA is a single unified model capable of grounding itself across these varied domains through a multi‑embodiment action tokenizer. GEA is trained with supervised learning on a large dataset of embodied experiences and with online RL in interactive simulators. We explore the data and algorithmic choices necessary to develop such a model. Our findings reveal the importance of training with cross‑domain data and online RL for building generalist agents. The final GEA model achieves strong generalization performance to unseen tasks across diverse benchmarks compared to other generalist models and benchmark‑specific approaches.

PaperID: 12179, https://arxiv.org/pdf/2412.08138.pdf

Authors: Yuchang Sun, Xinran Li, Tao Lin, Jun Zhang

Title: Learn How to Query from Unlabeled Data Streams in Federated Learning

Abstract:
Federated learning (FL) enables collaborative learning among decentralized clients while safeguarding the privacy of their local data. Existing studies on FL typically assume offline labeled data available at each client when the training starts. Nevertheless, the training data in practice often arrive at clients in a streaming fashion without ground‑truth labels. Given the expensive annotation cost, it is critical to identify a subset of informative samples for labeling on clients. However, selecting samples locally while accommodating the global training objective presents a challenge unique to FL. In this work, we tackle this conundrum by framing the data querying process in FL as a collaborative decentralized decision‑making problem and proposing an effective solution named LeaDQ, which leverages multi‑agent reinforcement learning algorithms. In particular, under the implicit guidance from global information, LeaDQ effectively learns the local policies for distributed clients and steers them towards selecting samples that can enhance the global model's accuracy. Extensive simulations on image and text tasks show that LeaDQ advances the model performance in various FL scenarios, outperforming the benchmarking algorithms.

PaperID: 12180, https://arxiv.org/pdf/2412.08054.pdf

Authors: Panlong Wu, Kangshuo Li, Junbao Nan, Fangxin Wang

Title: Federated In-Context LLM Agent Learning

Abstract:
Large Language Models (LLMs) have revolutionized intelligent services by enabling logical reasoning, tool use, and interaction with external systems as agents. The advancement of LLMs is frequently hindered by the scarcity of high‑quality data, much of which is inherently sensitive. Federated learning (FL) offers a potential solution by facilitating the collaborative training of distributed LLMs while safeguarding private data. However, FL frameworks face significant bandwidth and computational demands, along with challenges from heterogeneous data distributions. The emerging in‑context learning capability of LLMs offers a promising approach by aggregating natural language rather than bulky model parameters. Yet, this method risks privacy leakage, as it necessitates the collection and presentation of data samples from various clients during aggregation. In this paper, we propose a novel privacy‑preserving Federated In‑Context LLM Agent Learning (FICAL) algorithm, which to our best knowledge for the first work unleashes the power of in‑context learning to train diverse LLM agents through FL. In our design, knowledge compendiums generated by a novel LLM‑enhanced Knowledge Compendiums Generation (KCG) module are transmitted between clients and the server instead of model parameters in previous FL methods. Apart from that, an incredible Retrieval Augmented Generation (RAG) based Tool Learning and Utilizing (TLU) module is designed and we incorporate the aggregated global knowledge compendium as a teacher to teach LLM agents the usage of tools. We conducted extensive experiments and the results show that FICAL has competitive performance compared to other SOTA baselines with a significant communication cost decrease of \mathbf3.33×10^5 times.

PaperID: 12181, https://arxiv.org/pdf/2412.08014.pdf

Authors: Yun Xing, Nhat Chung, Jie Zhang, Yue Cao, Ivor Tsang, Yang Liu, Lei Ma, Qing Guo

Title: MAGIC: Mastering Physical Adversarial Generation in Context through Collaborative LLM Agents

Abstract:
Physical adversarial attacks in driving scenarios can expose critical vulnerabilities in visual perception models. However, developing such attacks remains challenging due to diverse real‑world environments and the requirement for maintaining visual naturality. Building upon this challenge, we reformulate physical adversarial attacks as a one‑shot patch generation problem. Our approach generates adversarial patches through a deep generative model that considers the specific scene context, enabling direct physical deployment in matching environments. The primary challenge lies in simultaneously achieving two objectives: generating adversarial patches that effectively mislead object detection systems while determining contextually appropriate deployment within the scene. We propose MAGIC (Mastering Physical Adversarial Generation In Context), a novel framework powered by multi‑modal LLM agents to address these challenges. MAGIC automatically understands scene context and generates adversarial patch through the synergistic interaction of language and vision capabilities. In particular, MAGIC orchestrates three specialized LLM agents: The adv‑patch generation agent (GAgent) masters the creation of deceptive patches through strategic prompt engineering for text‑to‑image models. The adv‑patch deployment agent (DAgent) ensures contextual coherence by determining optimal deployment strategies based on scene understanding. The self‑examination agent (EAgent) completes this trilogy by providing critical oversight and iterative refinement of both processes. We validate our method on both digital and physical levels, i.e., nuImage and manually captured real‑world scenes, where both statistical and visual results prove that our MAGIC is powerful and effective for attacking widely applied object detection systems, i.e., YOLO and DETR series.

PaperID: 12182, https://arxiv.org/pdf/2412.07981.pdf

Authors: Guang Hu, Tim Miller, Nir Lipovetzky

Title: Where Common Knowledge Cannot Be Formed, Common Belief Can -- Planning with Multi-Agent Belief Using Group Justified Perspectives

Abstract:
Epistemic planning is the sub‑field of AI planning that focuses on changing knowledge and belief. It is important in both multi‑agent domains where agents need to have knowledge/belief regarding the environment, but also the beliefs of other agents, including nested beliefs. When modeling knowledge in multi‑agent settings, many models face an exponential growth challenge in terms of nested depth. A contemporary method, known as Planning with Perspectives (PWP), addresses these challenges through the use of perspectives and set operations for knowledge. The JP model defines that an agent's belief is justified if and only if the agent has seen evidence that this belief was true in the past and has not seen evidence to suggest that this has changed. The current paper extends the JP model to handle \emphgroup belief, including distributed belief and common belief. We call this the Group Justified Perspective (GJP) model. Using experimental problems crafted by adapting well‑known benchmarks to a group setting, we show the efficiency and expressiveness of our GJP model at handling planning problems that cannot be handled by other epistemic planning tools.

PaperID: 12183, https://arxiv.org/pdf/2412.07977.pdf

Authors: Stefan Dernbach, Alejandro Michel, Khushbu Agarwal, Christopher Brissette, Geetika Gupta, Sutanay Choudhury

Title: Thinking Fast and Laterally: Multi-Agentic Approach for Reasoning about Uncertain Emerging Events

Abstract:
This paper introduces lateral thinking to implement System‑2 reasoning capabilities in AI systems, focusing on anticipatory and causal reasoning under uncertainty. We present a framework for systematic generation and modeling of lateral thinking queries and evaluation datasets. We introduce Streaming Agentic Lateral Thinking (SALT), a multi‑agent framework designed to process complex, low‑specificity queries in streaming data environments. SALT implements lateral thinking‑inspired System‑2 reasoning through a dynamic communication structure between specialized agents. Our key insight is that lateral information flow across long‑distance agent interactions, combined with fine‑grained belief management, yields richer information contexts and enhanced reasoning. Preliminary quantitative and qualitative evaluations indicate SALT's potential to outperform single‑agent systems in handling complex lateral reasoning tasks in a streaming environment.

PaperID: 12184, https://arxiv.org/pdf/2412.07941.pdf

Authors: Guang Hu, Weijia Li, Yangmengfei Xu

Title: Beyond Static Assumptions: the Predictive Justified Perspective Model for Epistemic Planning

Abstract:
Epistemic Planning (EP) is an important research area dedicated to reasoning about the knowledge and beliefs of agents in multi‑agent cooperative or adversarial settings. The Justified Perspective (JP) model is the state‑of‑the‑art approach to solving EP problems with efficiency and expressiveness. However, all existing EP methods inherit the static environment assumption from classical planning. This limitation hinders the application of EP in fields such as robotics with multi‑agent settings, where the environment contains changing variables. In this paper, we propose an extension of the JP model, namely, the Predictive Justified Perspective (PJP) model, to remove this assumption. Instead of assuming that beliefs remain unchanged since the last observation, the PJP model uses all past observations to form predictions about the changing variables. The definition of the prediction function with examples is provided, and it is demonstrated that it can work with arbitrary nesting. We then implemented the PJP model in several well‑known domains and compared it with the JP model in the experiments. The results indicated that the PJP model performs exceptionally well across various domains, demonstrating its potential in improving EP applications in robotics.

PaperID: 12185, https://arxiv.org/pdf/2412.07880.pdf

Authors: Yunfan Zhao, Niclas Boehmer, Aparna Taneja, Milind Tambe

Title: Towards Foundation-model-based Multiagent System to Accelerate AI for Social Impact

Abstract:
AI for social impact (AI4SI) offers significant potential for addressing complex societal challenges in areas such as public health, agriculture, education, conservation, and public safety. However, existing AI4SI research is often labor‑intensive and resource‑demanding, limiting its accessibility and scalability; the standard approach is to design a (base‑level) system tailored to a specific AI4SI problem. We propose the development of a novel meta‑level multi‑agent system designed to accelerate the development of such base‑level systems, thereby reducing the computational cost and the burden on social impact domain experts and AI researchers. Leveraging advancements in foundation models and large language models, our proposed approach focuses on resource allocation problems providing help across the full AI4SI pipeline from problem formulation over solution design to impact evaluation. We highlight the ethical considerations and challenges inherent in deploying such systems and emphasize the importance of a human‑in‑the‑loop approach to ensure the responsible and effective application of AI systems.

PaperID: 12186, https://arxiv.org/pdf/2412.07822.pdf

Authors: Yujie Zhao, Hejia Zhang, Hanxian Huang, Zhongming Yu, Jishen Zhao

Title: MAGE: A Multi-Agent Engine for Automated RTL Code Generation

Abstract:
The automatic generation of RTL code (e.g., Verilog) through natural language instructions has emerged as a promising direction with the advancement of large language models (LLMs). However, producing RTL code that is both syntactically and functionally correct remains a significant challenge. Existing single‑LLM‑agent approaches face substantial limitations because they must navigate between various programming languages and handle intricate generation, verification, and modification tasks. To address these challenges, this paper introduces MAGE, the first open‑source multi‑agent AI system designed for robust and accurate Verilog RTL code generation. We propose a novel high‑temperature RTL candidate sampling and debugging system that effectively explores the space of code candidates and significantly improves the quality of the candidates. Furthermore, we design a novel Verilog‑state checkpoint checking mechanism that enables early detection of functional errors and delivers precise feedback for targeted fixes, significantly enhancing the functional correctness of the generated RTL code. MAGE achieves a 95.7% rate of syntactic and functional correctness code generation on VerilogEval‑Human 2 benchmark, surpassing the state‑of‑the‑art Claude‑3.5‑sonnet by 23.3 %, demonstrating a robust and reliable approach for AI‑driven RTL design workflows.

PaperID: 12187, https://arxiv.org/pdf/2412.07819.pdf

Authors: Yaorui Shi, Sihang Li, Taiyan Zhang, Xi Fang, Jiankun Wang, Zhiyuan Liu, Guojiang Zhao, Zhengdan Zhu, Zhifeng Gao, Renxin Zhong, Linfeng Zhang, Guolin Ke, Weinan E, Hengxing Cai, Xiang Wang

Title: Intelligent System for Automated Molecular Patent Infringement Assessment

Abstract:
Automated drug discovery offers significant potential for accelerating the development of novel therapeutics by substituting labor‑intensive human workflows with machine‑driven processes. However, molecules generated by artificial intelligence may unintentionally infringe on existing patents, posing legal and financial risks that impede the full automation of drug discovery pipelines. This paper introduces PatentFinder, a novel multi‑agent and tool‑enhanced intelligence system that can accurately and comprehensively evaluate small molecules for patent infringement. PatentFinder features five specialized agents that collaboratively analyze patent claims and molecular structures with heuristic and model‑based tools, generating interpretable infringement reports. To support systematic evaluation, we curate MolPatent‑240, a benchmark dataset tailored for patent infringement assessment algorithms. On this benchmark, PatentFinder outperforms baseline methods that rely solely on large language models or specialized chemical tools, achieving a 13.8% improvement in F1‑score and a 12% increase in accuracy. Additionally, PatentFinder autonomously generates detailed and interpretable patent infringement reports, showcasing enhanced accuracy and improved interpretability. The high accuracy and interpretability of PatentFinder make it a valuable and reliable tool for automating patent infringement assessments, offering a practical solution for integrating patent protection analysis into the drug discovery pipeline.

PaperID: 12188, https://arxiv.org/pdf/2412.07786.pdf

Authors: Agapi Rissaki, Ilias Fountalis, Nikolaos Vasiloglou, Wolfgang Gatterbauer

Title: Towards Agentic Schema Refinement

Abstract:
Large enterprise databases can be complex and messy, obscuring the data semantics needed for analytical tasks. We propose a semantic layer in‑between the database and the user as a set of small and easy‑to‑interpret database views, effectively acting as a refined version of the schema. To discover these views, we introduce a multi‑agent Large Language Model (LLM) simulation where LLM agents collaborate to iteratively define and refine views with minimal input. Our approach paves the way for LLM‑powered exploration of unwieldy databases.

PaperID: 12189, https://arxiv.org/pdf/2412.07646.pdf

Authors: Tom Kouwenhoven, Max Peeperkorn, Tessa Verhoef

Title: Searching for Structure: Investigating Emergent Communication with Large Language Models

Abstract:
Human languages have evolved to be structured through repeated language learning and use. These processes introduce biases that operate during language acquisition and shape linguistic systems toward communicative efficiency. In this paper, we investigate whether the same happens if artificial languages are optimised for implicit biases of Large Language Models (LLMs). To this end, we simulate a classical referential game in which LLMs learn and use artificial languages. Our results show that initially unstructured holistic languages are indeed shaped to have some structural properties that allow two LLM agents to communicate successfully. Similar to observations in human experiments, generational transmission increases the learnability of languages, but can at the same time result in non‑humanlike degenerate vocabularies. Taken together, this work extends experimental findings, shows that LLMs can be used as tools in simulations of language evolution, and opens possibilities for future human‑machine experiments in this field.

PaperID: 12190, https://arxiv.org/pdf/2412.07639.pdf

Authors: Zongkai Liu, Qian Lin, Chao Yu, Xiawei Wu, Yile Liang, Donghui Li, Xuetao Ding

Title: Offline Multi-Agent Reinforcement Learning via In-Sample Sequential Policy Optimization

Abstract:
Offline Multi‑Agent Reinforcement Learning (MARL) is an emerging field that aims to learn optimal multi‑agent policies from pre‑collected datasets. Compared to single‑agent case, multi‑agent setting involves a large joint state‑action space and coupled behaviors of multiple agents, which bring extra complexity to offline policy optimization. In this work, we revisit the existing offline MARL methods and show that in certain scenarios they can be problematic, leading to uncoordinated behaviors and out‑of‑distribution (OOD) joint actions. To address these issues, we propose a new offline MARL algorithm, named In‑Sample Sequential Policy Optimization (InSPO). InSPO sequentially updates each agent's policy in an in‑sample manner, which not only avoids selecting OOD joint actions but also carefully considers teammates' updated policies to enhance coordination. Additionally, by thoroughly exploring low‑probability actions in the behavior policy, InSPO can well address the issue of premature convergence to sub‑optimal solutions. Theoretically, we prove InSPO guarantees monotonic policy improvement and converges to quantal response equilibrium (QRE). Experimental results demonstrate the effectiveness of our method compared to current state‑of‑the‑art offline MARL methods.

PaperID: 12191, https://arxiv.org/pdf/2412.07471.pdf

Authors: Sen Kong

Title: Event-Triggered Memory Control for Interval Type-2 Fuzzy Heterogeneous Multi-Agent Systems

Abstract:
This study explores the design of a memory‑based dynamic event‑triggered mechanisms (DETM) scheme for heterogeneous multi‑agent systems (MASs) characterized by interval type‑2 Takagi‑Sugeno (IT2 T‑S) fuzzy models. To address the complex nonlinear uncertainties inherent in such systems, discrete IT2 T‑S fuzzy models are employed to accurately capture system dynamics. In response to the limitations on communication resources and computational capabilities within MASs, this research introduces a distributed DETM approach based on a dynamic threshold method. This mechanism effectively minimizes unnecessary communication while maintaining robust performance. The proposed memory‑based control strategy not only reduces the conservatism associated with controller design conditions but also enhances overall controller performance. Furthermore, leveraging a non‑parallel distributed compensation (non‑PDC) strategy, a novel derivation method is developed for controller design conditions that significantly decreases conservatism. This leads to sufficient conditions for the asymptotic stability of the closed‑loop system. The designed distributed event‑triggered controllers improve the overall performance of MASs, as evidenced by numerical simulations that validate the effectiveness of the proposed approach. Overall, these findings advance the state‑of‑the‑art in control strategies for heterogeneous MASs, offering practical solutions for real‑world applications where resource constraints are critical.

PaperID: 12192, https://arxiv.org/pdf/2412.07212.pdf

Authors: Wenjian Hao, Zehui Lu, Devesh Upadhyay, Shaoshuai Mou

Title: Distributed Koopman Learning using Partial Trajectories for Control

Abstract:
This paper proposes a distributed data‑driven framework for dynamics learning, termed distributed deep Koopman learning using partial trajectories (DDKL‑PT). In this framework, each agent in a multi‑agent system is assigned a partial trajectory offline and locally approximates the unknown dynamics using a deep neural network within the Koopman operator framework. By exchanging local estimated dynamics rather than training data, agents achieve consensus on a global dynamics model without sharing their private training trajectories. Simulation studies on a surface vehicle demonstrate that DDKL‑PT achieves consensus on the learned dynamics, and each agent attains reasonably small approximation errors on the testing dataset. Furthermore, a model predictive control scheme is developed by integrating the learned Koopman dynamics with known kinematic relations. Results on a reference‑tracking task indicate that the distributedly learned dynamics are sufficiently accurate for model‑based optimal control.

PaperID: 12193, https://arxiv.org/pdf/2412.07019.pdf

Authors: Bohan Jiang, Dawei Li, Zhen Tan, Xinyi Zhou, Ashwin Rao, Kristina Lerman, H. Russell Bernard, Huan Liu

Title: Assessing the Impact of Conspiracy Theories Using Large Language Models

Abstract:
Measuring the relative impact of CTs is important for prioritizing responses and allocating resources effectively, especially during crises. However, assessing the actual impact of CTs on the public poses unique challenges. It requires not only the collection of CT‑specific knowledge but also diverse information from social, psychological, and cultural dimensions. Recent advancements in large language models (LLMs) suggest their potential utility in this context, not only due to their extensive knowledge from large training corpora but also because they can be harnessed for complex reasoning. In this work, we develop datasets of popular CTs with human‑annotated impacts. Borrowing insights from human impact assessment processes, we then design tailored strategies to leverage LLMs for performing human‑like CT impact assessments. Through rigorous experiments, we textitdiscover that an impact assessment mode using multi‑step reasoning to analyze more CT‑related evidence critically produces accurate results; and most LLMs demonstrate strong bias, such as assigning higher impacts to CTs presented earlier in the prompt, while generating less accurate impact assessments for emotionally charged and verbose CTs.

PaperID: 12194, https://arxiv.org/pdf/2412.06834.pdf

Authors: Harvey McGuinness, Tianyu Wang, Carey E. Priebe, Hayden Helm

Title: Investigating social alignment via mirroring in a system of interacting language models

Abstract:
Alignment is a social phenomenon wherein individuals share a common goal or perspective. Mirroring, or mimicking the behaviors and opinions of another individual, is one mechanism by which individuals can become aligned. Large scale investigations of the effect of mirroring on alignment have been limited due to the scalability of traditional experimental designs in sociology. In this paper, we introduce a simple computational framework that enables studying the effect of mirroring behavior on alignment in multi‑agent systems. We simulate systems of interacting large language models in this framework and characterize overall system behavior and alignment with quantitative measures of agent dynamics. We find that system behavior is strongly influenced by the range of communication of each agent and that these effects are exacerbated by increased rates of mirroring. We discuss the observed simulated system behavior in the context of known human social dynamics.

PaperID: 12195, https://arxiv.org/pdf/2412.06832.pdf

Authors: Michael Iannelli, Sneha Kuchipudi, Vera Dvorak

Title: SLA Management in Reconfigurable Multi-Agent RAG: A Systems Approach to Question Answering

Abstract:
Retrieval Augmented Generation (RAG) enables Large Language Models (LLMs) to generalize to new information by decoupling reasoning capabilities from static knowledge bases. Traditional RAG enhancements have explored vertical scaling‑assigning subtasks to specialized modules‑and horizontal scaling‑replicating tasks across multiple agents‑to improve performance. However, real‑world applications impose diverse Service Level Agreements (SLAs) and Quality of Service (QoS) requirements, involving trade‑offs among objectives such as reducing cost, ensuring answer quality, and adhering to specific operational constraints. In this work, we present a systems‑oriented approach to multi‑agent RAG tailored for real‑world Question Answering (QA) applications. By integrating task‑specific non‑functional requirements‑such as answer quality, cost, and latency‑into the system, we enable dynamic reconfiguration to meet diverse SLAs. Our method maps these Service Level Objectives (SLOs) to system‑level parameters, allowing the generation of optimal results within specified resource constraints. We conduct a case study in the QA domain, demonstrating how dynamic re‑orchestration of a multi‑agent RAG system can effectively manage the trade‑off between answer quality and cost. By adjusting the system based on query intent and operational conditions, we systematically balance performance and resource utilization. This approach allows the system to meet SLOs for various query types, showcasing its practicality for real‑world applications.

PaperID: 12196, https://arxiv.org/pdf/2412.06828.pdf

Authors: Fang Zeng, Zhiliang Lyu, Quanzheng Li, Xiang Li

Title: Enhancing LLMs for Impression Generation in Radiology Reports through a Multi-Agent System

Abstract:
This study introduces "RadCouncil," a multi‑agent Large Language Model (LLM) framework designed to enhance the generation of impressions in radiology reports from the finding section. RadCouncil comprises three specialized agents: 1) a "Retrieval" Agent that identifies and retrieves similar reports from a vector database, 2) a "Radiologist" Agent that generates impressions based on the finding section of the given report plus the exemplar reports retrieved by the Retrieval Agent, and 3) a "Reviewer" Agent that evaluates the generated impressions and provides feedback. The performance of RadCouncil was evaluated using both quantitative metrics (BLEU, ROUGE, BERTScore) and qualitative criteria assessed by GPT‑4, using chest X‑ray as a case study. Experiment results show improvements in RadCouncil over the single‑agent approach across multiple dimensions, including diagnostic accuracy, stylistic concordance, and clarity. This study highlights the potential of utilizing multiple interacting LLM agents, each with a dedicated task, to enhance performance in specialized medical tasks and the development of more robust and adaptable healthcare AI solutions.

PaperID: 12197, https://arxiv.org/pdf/2412.06788.pdf

Authors: Ivan A. Fernandez, Subash Neupane, Sudip Mittal, Shahram Rahimi

Title: Poison Attacks and Adversarial Prompts Against an Informed University Virtual Assistant

Abstract:
Recent research has shown that large language models (LLMs) are particularly vulnerable to adversarial attacks. Since the release of ChatGPT, various industries are adopting LLM‑based chatbots and virtual assistants in their data workflows. The rapid development pace of AI‑based systems is being driven by the potential of Generative AI (GenAI) to assist humans in decision making. The immense optimism behind GenAI often overshadows the adversarial risks associated with these technologies. A threat actor can use security gaps, poor safeguards, and limited data governance to carry out attacks that grant unauthorized access to the system and its data. As a proof‑of‑concept, we assess the performance of BarkPlug, the Mississippi State University chatbot, against data poison attacks from a red team perspective.

PaperID: 12198, https://arxiv.org/pdf/2412.06681.pdf

Authors: Tianming Liu, Jirong Yang, Yafeng Yin

Title: Toward LLM-Agent-Based Modeling of Transportation Systems: A Conceptual Framework

Abstract:
In transportation system demand modeling and simulation, agent‑based models and microsimulations are current state‑of‑the‑art approaches. However, existing agent‑based models still have some limitations on behavioral realism and resource demand that limit their applicability. In this study, leveraging the emerging technology of large language models (LLMs) and LLM‑based agents, we propose a general LLM‑agent‑based modeling framework for transportation systems. We argue that LLM agents not only possess the essential capabilities to function as agents but also offer promising solutions to overcome some limitations of existing agent‑based models. Our conceptual framework design closely replicates the decision‑making and interaction processes and traits of human travelers within transportation networks, and we demonstrate that the proposed systems can meet critical behavioral criteria for decision‑making and learning behaviors using related studies and a demonstrative example of LLM agents' learning and adjustment in the bottleneck setting. Although further refinement of the LLM‑agent‑based modeling framework is necessary, we believe that this approach has the potential to improve transportation system modeling and simulation.

PaperID: 12199, https://arxiv.org/pdf/2412.06509.pdf

Authors: Yedi Zhang, Fu Song, Taolue Chen, Xuzhi Wu

Title: Reasoning about Strategic Abilities in Stochastic Multi-agent Systems

Abstract:
Reasoning about strategic abilities is key to AI systems comprising multiple agents, which provide a unified framework for formalizing various problems in game theory, social choice theory, etc. In this work, we propose a probabilistic extension of the alternating‑time μ‑calculus (AMC), named PAMC, for reasoning about the strategic abilities of agents in stochastic multi‑agent systems. We show that PAMC subsumes two existing logics AMC and PμTL (a probabilistic extension of the modal μ‑calculus), but is incomparable with the probabilistic alternating‑time temporal logic (PATL). We study the problems of model checking and satisfiability checking for PAMC. We first give a model checking algorithm by leveraging algorithms for solving normal‑form games and AMC model checking. We establish that the model checking problem of PAMC remains in UP\capco‑UP, the same complexity class as the model checking problem for AMC and PμTL. We also provide a new reduction from the satisfiability problem of PAMC to solving parity games, by which we obtain an EXPTIME decision procedure, as well as the small model property which allows us to construct a model for each satisfiable PAMC formula. Satisfiability in PAMC has the same complexity as in the modal μ‑calculus, unlike PCTL and PATL whose satisfiability checking problems remain open. We have implemented both the model checking and satisfiability checking algorithms as open‑source tools. Experimental results are reported, showcasing the practical applications and effectiveness of our approaches.

PaperID: 12200, https://arxiv.org/pdf/2412.06435.pdf

Authors: Yiding Wang, Yuxuan Chen, Fangwei Zhong, Long Ma, Yizhou Wang

Title: Simulating Human-like Daily Activities with Desire-driven Autonomy

Abstract:
Desires motivate humans to interact autonomously with the complex world. In contrast, current AI agents require explicit task specifications, such as instructions or reward functions, which constrain their autonomy and behavioral diversity. In this paper, we introduce a Desire‑driven Autonomous Agent (D2A) that can enable a large language model (LLM) to autonomously propose and select tasks, motivated by satisfying its multi‑dimensional desires. Specifically, the motivational framework of D2A is mainly constructed by a dynamic Value System, inspired by the Theory of Needs. It incorporates an understanding of human‑like desires, such as the need for social interaction, personal fulfillment, and self‑care. At each step, the agent evaluates the value of its current state, proposes a set of candidate activities, and selects the one that best aligns with its intrinsic motivations. We conduct experiments on Concordia, a text‑based simulator, to demonstrate that our agent generates coherent, contextually relevant daily activities while exhibiting variability and adaptability similar to human behavior. A comparative analysis with other LLM‑based agents demonstrates that our approach significantly enhances the rationality of the simulated activities.

PaperID: 12201, https://arxiv.org/pdf/2412.06412.pdf

Authors: Cunshi Wang, Yu Zhang, Yuyang Li, Xinjie Hu, Yiming Mao, Xunhao Chen, Pengliang Du, Rui Wang, Ying Wu, Hang Yang, Yansong Li, Beichuan Wang, Haiyang Mu, Zheng Wang, Jianfeng Tian, Liang Ge, Yongna Mao, Shengming Li, Xiaomeng Lu, Jinhang Zou, Yang Huang, Ningchen Sun, Jie Zheng, Min He, Yu Bai, Junjie Jin, Hong Wu, Jifeng Liu

Title: StarWhisper Telescope: An AI framework for automating end-to-end astronomical observations

Abstract:
The exponential growth of large‑scale telescope arrays has boosted time‑domain astronomy development but introduced operational bottlenecks, including labor‑intensive observation planning, data processing, and real‑time decision‑making. Here we present the StarWhisper Telescope system, an AI agent framework automating end‑to‑end astronomical observations for surveys like the Nearby Galaxy Supernovae Survey. By integrating large language models with specialized function calls and modular workflows, StarWhisper Telescope autonomously generates site‑specific observation lists, executes real‑time image analysis via pipelines, and dynamically triggers follow‑up proposals upon transient detection. The system reduces human intervention through automated observation planning, telescope controlling and data processing, while enabling seamless collaboration between amateur and professional astronomers. Deployed across Nearby Galaxy Supernovae Survey's network of 10 amateur telescopes, the StarWhisper Telescope has detected transients with promising response times relative to existing surveys. Furthermore, StarWhisper Telescope's scalable agent architecture provides a blueprint for future facilities like the Global Open Transient Telescope Array, where AI‑driven autonomy will be critical for managing 60 telescopes.

PaperID: 12202, https://arxiv.org/pdf/2412.06333.pdf

Authors: F. Bredell, H. A. Engelbrecht, J. C. Schoeman

Title: Augmenting the action space with conventions to improve multi-agent cooperation in Hanabi

Abstract:
The card game Hanabi is considered a strong medium for the testing and development of multi‑agent reinforcement learning (MARL) algorithms, due to its cooperative nature, partial observability, limited communication and remarkable complexity. Previous research efforts have explored the capabilities of MARL algorithms within Hanabi, focusing largely on advanced architecture design and algorithmic manipulations to achieve state‑of‑the‑art performance for various number of cooperators. However, this often leads to complex solution strategies with high computational cost and requiring large amounts of training data. For humans to solve the Hanabi game effectively, they require the use of conventions, which often allows for a means to implicitly convey ideas or knowledge based on a predefined, and mutually agreed upon, set of "rules" or principles. Multi‑agent problems containing partial observability, especially when limited communication is present, can benefit greatly from the use of implicit knowledge sharing. In this paper, we propose a novel approach to augmenting an agent's action space using conventions, which act as a sequence of special cooperative actions that span over and include multiple time steps and multiple agents, requiring agents to actively opt in for it to reach fruition. These conventions are based on existing human conventions, and result in a significant improvement on the performance of existing techniques for self‑play and cross‑play for various number of cooperators within Hanabi.

PaperID: 12203, https://arxiv.org/pdf/2412.06324.pdf

Authors: Mingliang Zhai, Cheng Li, Zengyuan Guo, Ningrui Yang, Xiameng Qin, Sanyuan Zhao, Junyu Han, Ji Tao, Yuwei Wu, Yunde Jia

Title: World knowledge-enhanced Reasoning Using Instruction-guided Interactor in Autonomous Driving

Abstract:
The Multi‑modal Large Language Models (MLLMs) with extensive world knowledge have revitalized autonomous driving, particularly in reasoning tasks within perceivable regions. However, when faced with perception‑limited areas (dynamic or static occlusion regions), MLLMs struggle to effectively integrate perception ability with world knowledge for reasoning. These perception‑limited regions can conceal crucial safety information, especially for vulnerable road users. In this paper, we propose a framework, which aims to improve autonomous driving performance under perceptionlimited conditions by enhancing the integration of perception capabilities and world knowledge. Specifically, we propose a plug‑and‑play instruction‑guided interaction module that bridges modality gaps and significantly reduces the input sequence length, allowing it to adapt effectively to multi‑view video inputs. Furthermore, to better integrate world knowledge with driving‑related tasks, we have collected and refined a large‑scale multi‑modal dataset that includes 2 million natural language QA pairs, 1.7 million grounding task data. To evaluate the model's utilization of world knowledge, we introduce an object‑level risk assessment dataset comprising 200K QA pairs, where the questions necessitate multi‑step reasoning leveraging world knowledge for resolution. Extensive experiments validate the effectiveness of our proposed method.

PaperID: 12204, https://arxiv.org/pdf/2412.06313.pdf

Authors: Junqiao Wang, Zhongliang Yu, Dong Zhou, Jiaqi Shi, Runran Deng

Title: Vision-Based Deep Reinforcement Learning of UAV Autonomous Navigation Using Privileged Information

Abstract:
The capability of UAVs for efficient autonomous navigation and obstacle avoidance in complex and unknown environments is critical for applications in agricultural irrigation, disaster relief and logistics. In this paper, we propose the DPRL (Distributed Privileged Reinforcement Learning) navigation algorithm, an end‑to‑end policy designed to address the challenge of high‑speed autonomous UAV navigation under partially observable environmental conditions. Our approach combines deep reinforcement learning with privileged learning to overcome the impact of observation data corruption caused by partial observability. We leverage an asymmetric Actor‑Critic architecture to provide the agent with privileged information during training, which enhances the model's perceptual capabilities. Additionally, we present a multi‑agent exploration strategy across diverse environments to accelerate experience collection, which in turn expedites model convergence. We conducted extensive simulations across various scenarios, benchmarking our DPRL algorithm against the state‑of‑the‑art navigation algorithms. The results consistently demonstrate the superior performance of our algorithm in terms of flight efficiency, robustness and overall success rate.

PaperID: 12205, https://arxiv.org/pdf/2412.06294.pdf

Authors: Louis Milliken, Sungmin Kang, Shin Yoo

Title: Beyond pip install: Evaluating LLM Agents for the Automated Installation of Python Projects

Abstract:
Many works have recently proposed the use of Large Language Model (LLM) based agents for performing `repository level' tasks, loosely defined as a set of tasks whose scopes are greater than a single file. This has led to speculation that the orchestration of these repository‑level tasks could lead to software engineering agents capable of performing almost independently of human intervention. However, of the suite of tasks that would need to be performed by this autonomous software engineering agent, we argue that one important task is missing, which is to fulfil project level dependency by installing other repositories. To investigate the feasibility of this repository level installation task, we introduce a benchmark of of repository installation tasks curated from 40 open source Python projects, which includes a ground truth installation process for each target repository. Further, we propose Installamatic, an agent which aims to perform and verify the installation of a given repository by searching for relevant instructions from documentation in the repository. Empirical experiments reveal that that 55% of the studied repositories can be automatically installed by our agent at least one out of ten times. Through further analysis, we identify the common causes for our agent's inability to install a repository, discuss the challenges faced in the design and implementation of such an agent and consider the implications that such an agent could have for developers.

PaperID: 12206, https://arxiv.org/pdf/2412.06258.pdf

Authors: Li Yin, Calvin Yeung, Qingrui Hu, Jun Ichikawa, Hirotsugu Azechi, Susumu Takahashi, Keisuke Fujii

Title: Enhanced Multi-Object Tracking Using Pose-based Virtual Markers in 3x3 Basketball

Abstract:
Multi‑object tracking (MOT) is crucial for various multi‑agent analyses such as evaluating team sports tactics and player movements and performance. While pedestrian tracking has advanced with Tracking‑by‑Detection MOT, team sports like basketball pose unique challenges. These challenges include players' unpredictable movements, frequent close interactions, and visual similarities that complicate pose labeling and lead to significant occlusions, frequent ID switches, and high manual annotation costs. To address these challenges, we propose a novel pose‑based virtual marker (VM) MOT method for team sports, named Sports‑vmTracking. This method builds on the vmTracking approach developed for multi‑animal tracking with active learning. First, we constructed a 3x3 basketball pose dataset for VMs and applied active learning to enhance model performance in generating VMs. Then, we overlaid the VMs on video to identify players, extract their poses with unique IDs, and convert these into bounding boxes for comparison with automated MOT methods. Using our 3x3 basketball dataset, we demonstrated that our VM configuration has been highly effective, and reduced the need for manual corrections and labeling during pose model training while maintaining high accuracy. Our approach achieved an average HOTA score of 72.3%, over 10 points higher than other state‑of‑the‑art methods without VM, and resulted in 0 ID switches. Beyond improving performance in handling occlusions and minimizing ID switches, our framework could substantially increase the time and cost efficiency compared to traditional manual annotation.

PaperID: 12207, https://arxiv.org/pdf/2412.06142.pdf

Authors: Zonglin Meng, Yun Zhang, Zhaoliang Zheng, Zhihao Zhao, Jiaqi Ma

Title: AgentAlign: Misalignment-Adapted Multi-Agent Perception for Resilient Inter-Agent Sensor Correlations

Abstract:
Cooperative perception has attracted wide attention given its capability to leverage shared information across connected automated vehicles (CAVs) and smart infrastructures to address sensing occlusion and range limitation issues. However, existing research overlooks the fragile multi‑sensor correlations in multi‑agent settings, as the heterogeneous agent sensor measurements are highly susceptible to environmental factors, leading to weakened inter‑agent sensor interactions. The varying operational conditions and other real‑world factors inevitably introduce multifactorial noise and consequentially lead to multi‑sensor misalignment, making the deployment of multi‑agent multi‑modality perception particularly challenging in the real world. In this paper, we propose AgentAlign, a real‑world heterogeneous agent cross‑modality feature alignment framework, to effectively address these multi‑modality misalignment issues. Our method introduces a cross‑modality feature alignment space (CFAS) and heterogeneous agent feature alignment (HAFA) mechanism to harmonize multi‑modality features across various agents dynamically. Additionally, we present a novel V2XSet‑noise dataset that simulates realistic sensor imperfections under diverse environmental conditions, facilitating a systematic evaluation of our approach's robustness. Extensive experiments on the V2X‑Real and V2XSet‑Noise benchmarks demonstrate that our framework achieves state‑of‑the‑art performance, underscoring its potential for real‑world applications in cooperative autonomous driving. The controllable V2XSet‑Noise dataset and generation pipeline will be released in the future.

PaperID: 12208, https://arxiv.org/pdf/2412.05958.pdf

Authors: Adem Ait, Javier Luis Cánovas Izquierdo, Jordi Cabot

Title: Towards Modeling Human-Agentic Collaborative Workflows: A BPMN Extension

Abstract:
Large Language Models (LLMs) have facilitated the definition of autonomous intelligent agents. Such agents have already demonstrated their potential in solving complex tasks in different domains. And they can further increase their performance when collaborating with other agents in a multi‑agent system. However, the orchestration and coordination of these agents is still challenging, especially when they need to interact with humans as part of human‑agentic collaborative workflows. These kinds of workflows need to be precisely specified so that it is clear whose responsible for each task, what strategies agents can follow to complete individual tasks or how decisions will be taken when different alternatives are proposed, among others. Current business process modeling languages fall short when it comes to specifying these new mixed collaborative scenarios. In this exploratory paper, we extend a well‑known process modeling language (i.e., BPMN) to enable the definition of this new type of workflow. Our extension covers both the formalization of the new metamodeling concepts required and the proposal of a BPMN‑like graphical notation to facilitate the definition of these workflows. Our extension has been implemented and is available as an open‑source human‑agentic workflow modeling editor on GitHub.

PaperID: 12209, https://arxiv.org/pdf/2412.05850.pdf

Authors: Zhiguang Wu, Fengbin Zhu, Xuequn Shang, Yupei Zhang, Pan Zhou

Title: Cooperative SQL Generation for Segmented Databases By Using Multi-functional LLM Agents

Abstract:
Text‑to‑SQL task aims to automatically yield SQL queries according to user text questions. To address this problem, we propose a Cooperative SQL Generation framework based on Multi‑functional Agents (CSMA) through information interaction among large language model (LLM) based agents who own part of the database schema seperately. Inspired by the collaboration in human teamwork, CSMA consists of three stages: 1) Question‑related schema collection, 2) Question‑corresponding SQL query generation, and 3) SQL query correctness check. In the first stage, agents analyze their respective schema and communicate with each other to collect the schema information relevant to the question. In the second stage, agents try to generate the corresponding SQL query for the question using the collected information. In the third stage, agents check if the SQL query is created correctly according to their known information. This interaction‑based method makes the question‑relevant part of database schema from each agent to be used for SQL generation and check. Experiments on the Spider and Bird benckmark demonstrate that CSMA achieves a high performance level comparable to the state‑of‑the‑arts, meanwhile holding the private data in these individual agents.

PaperID: 12210, https://arxiv.org/pdf/2412.05838.pdf

Authors: Aniruddha Salve, Saba Attar, Mahesh Deshmukh, Sayali Shivpuje, Arnab Mitra Utsab

Title: A Collaborative Multi-Agent Approach to Retrieval-Augmented Generation Across Diverse Data

Abstract:
Retrieval‑Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external, domain‑specific data into the generative process. While LLMs are highly capable, they often rely on static, pre‑trained datasets, limiting their ability to integrate dynamic or private data. Traditional RAG systems typically use a single‑agent architecture to handle query generation, data retrieval, and response synthesis. However, this approach becomes inefficient when dealing with diverse data sources, such as relational databases, document stores, and graph databases, often leading to performance bottlenecks and reduced accuracy. This paper proposes a multi‑agent RAG system to address these limitations. Specialized agents, each optimized for a specific data source, handle query generation for relational, NoSQL, and document‑based systems. These agents collaborate within a modular framework, with query execution delegated to an environment designed for compatibility across various database types. This distributed approach enhances query efficiency, reduces token overhead, and improves response accuracy by ensuring that each agent focuses on its specialized task. The proposed system is scalable and adaptable, making it ideal for generative AI workflows that require integration with diverse, dynamic, or private data sources. By leveraging specialized agents and a modular execution environment, the system provides an efficient and robust solution for handling complex, heterogeneous data environments in generative AI applications.

PaperID: 12211, https://arxiv.org/pdf/2412.05789.pdf

Authors: Pengzhen Ren, Min Li, Zhen Luo, Xinshuai Song, Ziwei Chen, Weijia Liufu, Yixuan Yang, Hao Zheng, Rongtao Xu, Zitong Huang, Tongsheng Ding, Luyang Xie, Kaidong Zhang, Changfei Fu, Yang Liu, Liang Lin, Feng Zheng, Xiaodan Liang

Title: InfiniteWorld: A Unified Scalable Simulation Framework for General Visual-Language Robot Interaction

Abstract:
Realizing scaling laws in embodied AI has become a focus. However, previous work has been scattered across diverse simulation platforms, with assets and models lacking unified interfaces, which has led to inefficiencies in research. To address this, we introduce InfiniteWorld, a unified and scalable simulator for general vision‑language robot interaction built on Nvidia Isaac Sim. InfiniteWorld encompasses a comprehensive set of physics asset construction methods and generalized free robot interaction benchmarks. Specifically, we first built a unified and scalable simulation framework for embodied learning that integrates a series of improvements in generation‑driven 3D asset construction, Real2Sim, automated annotation framework, and unified 3D asset processing. This framework provides a unified and scalable platform for robot interaction and learning. In addition, to simulate realistic robot interaction, we build four new general benchmarks, including scene graph collaborative exploration and open‑world social mobile manipulation. The former is often overlooked as an important task for robots to explore the environment and build scene knowledge, while the latter simulates robot interaction tasks with different levels of knowledge agents based on the former. They can more comprehensively evaluate the embodied agent's capabilities in environmental understanding, task planning and execution, and intelligent interaction. We hope that this work can provide the community with a systematic asset interface, alleviate the dilemma of the lack of high‑quality assets, and provide a more comprehensive evaluation of robot interactions.

PaperID: 12212, https://arxiv.org/pdf/2412.05449.pdf

Authors: Raphael Shu, Nilaksh Das, Michelle Yuan, Monica Sunkara, Yi Zhang

Title: Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications

Abstract:
AI agents powered by large language models (LLMs) have shown strong capabilities in problem solving. Through combining many intelligent agents, multi‑agent collaboration has emerged as a promising approach to tackle complex, multi‑faceted problems that exceed the capabilities of single AI agents. However, designing the collaboration protocols and evaluating the effectiveness of these systems remains a significant challenge, especially for enterprise applications. This report addresses these challenges by presenting a comprehensive evaluation of coordination and routing capabilities in a novel multi‑agent collaboration framework. We evaluate two key operational modes: (1) a coordination mode enabling complex task completion through parallel communication and payload referencing, and (2) a routing mode for efficient message forwarding between agents. We benchmark on a set of handcrafted scenarios from three enterprise domains, which are publicly released with the report. For coordination capabilities, we demonstrate the effectiveness of inter‑agent communication and payload referencing mechanisms, achieving end‑to‑end goal success rates of 90%. Our analysis yields several key findings: multi‑agent collaboration enhances goal success rates by up to 70% compared to single‑agent approaches in our benchmarks; payload referencing improves performance on code‑intensive tasks by 23%; latency can be substantially reduced with a routing mechanism that selectively bypasses agent orchestration. These findings offer valuable guidance for enterprise deployments of multi‑agent systems and advance the development of scalable, efficient multi‑agent collaboration frameworks.

PaperID: 12213, https://arxiv.org/pdf/2412.05445.pdf

Authors: Kavindu Ravishan, Dániel Szabó, Niels van Berkel, Aku Visuri, Chi-Lan Yang, Koji Yatani, Simo Hosio

Title: From Voice to Value: Leveraging AI to Enhance Spoken Online Reviews on the Go

Abstract:
Online reviews help people make better decisions. Review platforms usually depend on typed input, where leaving a good review requires significant effort because users must carefully organize and articulate their thoughts. This may discourage users from leaving comprehensive and high‑quality reviews, especially when they are on the go. To address this challenge, we developed Vocalizer, a mobile application that enables users to provide reviews through voice input, with enhancements from a large language model (LLM). In a longitudinal study, we analysed user interactions with the app, focusing on AI‑driven features that help refine and improve reviews. Our findings show that users frequently utilized the AI agent to add more detailed information to their reviews. We also show how interactive AI features can improve users self‑efficacy and willingness to share reviews online. Finally, we discuss the opportunities and challenges of integrating AI assistance into review‑writing systems.

PaperID: 12214, https://arxiv.org/pdf/2412.05342.pdf

Authors: Xiaoyu Wang, Ningyuan Xi, Teng Chen, Qingqing Gu, Yue Zhao, Xiaokai Chen, Zhonglin Jiang, Yong Chen, Luo Ji

Title: Multi-Party Supervised Fine-tuning of Language Models for Multi-Party Dialogue Generation

Abstract:
Large Language Models (LLM) are usually fine‑tuned to participate in dyadic or two‑party dialogues, which can not adapt well to multi‑party dialogues (MPD), which hinders their applications in such scenarios including multi‑personal meetings, discussions and daily communication. Previous LLM‑based researches mainly focus on the multi‑agent framework, while their base LLMs are still pairwisely fine‑tuned. In this work, we design a multi‑party fine‑tuning framework (MuPaS) for LLMs on the multi‑party dialogue datasets, and prove such a straightforward framework can let the LLM align with the multi‑party conversation style efficiently and effectively. We also design two training strategies which can convert MuPaS into the MPD simulator. Substantial experiments show that MuPaS can achieve state‑of‑the‑art multi‑party response, higher accuracy of the‑next‑speaker prediction, higher human and automatic evaluated utterance qualities, and can even generate reasonably with out‑of‑distribution scene, topic and role descriptions. The MuPaS framework bridges the LLM training with more complicated multi‑party applications, such as conversation generation, virtual rehearsal or meta‑universe.

PaperID: 12215, https://arxiv.org/pdf/2412.05311.pdf

Authors: Chen-Chia Chang, Chia-Tung Ho, Yaguang Li, Yiran Chen, Haoxing Ren

Title: DRC-Coder: Automated DRC Checker Code Generation Using LLM Autonomous Agent

Abstract:
In the advanced technology nodes, the integrated design rule checker (DRC) is often utilized in place and route tools for fast optimization loops for power‑performance‑area. Implementing integrated DRC checkers to meet the standard of commercial DRC tools demands extensive human expertise to interpret foundry specifications, analyze layouts, and debug code iteratively. However, this labor‑intensive process, requiring to be repeated by every update of technology nodes, prolongs the turnaround time of designing circuits. In this paper, we present DRC‑Coder, a multi‑agent framework with vision capabilities for automated DRC code generation. By incorporating vision language models and large language models (LLM), DRC‑Coder can effectively process textual, visual, and layout information to perform rule interpretation and coding by two specialized LLMs. We also design an auto‑evaluation function for LLMs to enable DRC code debugging. Experimental results show that targeting on a sub‑3nm technology node for a state‑of‑the‑art standard cell layout tool, DRC‑Coder achieves perfect F1 score 1.000 in generating DRC codes for meeting the standard of a commercial DRC tool, highly outperforming standard prompting techniques (F1=0.631). DRC‑Coder can generate code for each design rule within four minutes on average, which significantly accelerates technology advancement and reduces engineering costs.

PaperID: 12216, https://arxiv.org/pdf/2412.05265.pdf

Authors: Kevin Murphy

Title: Reinforcement Learning: An Overview

Abstract:
This manuscript gives a big‑picture, up‑to‑date overview of the field of (deep) reinforcement learning and sequential decision making, covering value‑based methods, policy‑based methods, model‑based methods, multi‑agent RL, LLMs and RL, and various other topics (e.g., offline RL, hierarchical RL, intrinsic reward). It also includes some code snippets for training LLMs with RL.

PaperID: 12217, https://arxiv.org/pdf/2412.05093.pdf

Authors: Da Ju, Adina Williams, Brian Karrer, Maximilian Nickel

Title: Sense and Sensitivity: Evaluating the simulation of social dynamics via Large Language Models

Abstract:
Large language models have increasingly been proposed as a powerful replacement for classical agent‑based models (ABMs) to simulate social dynamics. By using LLMs as a proxy for human behavior, the hope of this new approach is to be able to simulate significantly more complex dynamics than with classical ABMs and gain new insights in fields such as social science, political science, and economics. However, due to the black box nature of LLMs, it is unclear whether LLM agents actually execute the intended semantics that are encoded in their natural language instructions and, if the resulting dynamics of interactions are meaningful. To study this question, we propose a new evaluation framework that grounds LLM simulations within the dynamics of established reference models of social science. By treating LLMs as a black‑box function, we evaluate their input‑output behavior relative to this reference model, which allows us to evaluate detailed aspects of their behavior. Our results show that, while it is possible to engineer prompts that approximate the intended dynamics, the quality of these simulations is highly sensitive to the particular choice of prompts. Importantly, simulations are even sensitive to arbitrary variations such as minor wording changes and whitespace. This puts into question the usefulness of current versions of LLMs for meaningful simulations, as without a reference model, it is impossible to determine a priori what impact seemingly meaningless changes in prompt will have on the simulation.

PaperID: 12218, https://arxiv.org/pdf/2412.04984.pdf

Authors: Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, Marius Hobbhahn

Title: Frontier Models are Capable of In-context Scheming

Abstract:
Frontier models are increasingly trained and deployed as autonomous agent. One safety concern is that AI agents might covertly pursue misaligned goals, hiding their true capabilities and objectives ‑ also known as scheming. We study whether models have the capability to scheme in pursuit of a goal that we provide in‑context and instruct the model to strongly follow. We evaluate frontier models on a suite of six agentic evaluations where models are instructed to pursue goals and are placed in environments that incentivize scheming. Our results show that o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B all demonstrate in‑context scheming capabilities. They recognize scheming as a viable strategy and readily engage in such behavior. For example, models strategically introduce subtle mistakes into their responses, attempt to disable their oversight mechanisms, and even exfiltrate what they believe to be their model weights to external servers. Additionally, this deceptive behavior proves persistent. When o1 has engaged in scheming, it maintains its deception in over 85% of follow‑up questions and often remains deceptive in multi‑turn interrogations. Analysis of the models' chains‑of‑thought reveals that models explicitly reason about these deceptive strategies, providing evidence that the scheming behavior is not accidental. Surprisingly, we also find rare instances where models engage in scheming when only given a goal, without being strongly nudged to pursue it. We observe cases where Claude 3.5 Sonnet strategically underperforms in evaluations in pursuit of being helpful, a goal that was acquired during training rather than in‑context. Our findings demonstrate that frontier models now possess capabilities for basic in‑context scheming, making the potential of AI agents to engage in scheming behavior a concrete rather than theoretical concern.

PaperID: 12219, https://arxiv.org/pdf/2412.04937.pdf

Authors: Ryota Nonomura, Hiroki Mori

Title: Who Speaks Next? Multi-party AI Discussion Leveraging the Systematics of Turn-taking in Murder Mystery Games

Abstract:
Multi‑agent systems utilizing large language models (LLMs) have shown great promise in achieving natural dialogue. However, smooth dialogue control and autonomous decision making among agents still remain challenges. In this study, we focus on conversational norms such as adjacency pairs and turn‑taking found in conversation analysis and propose a new framework called "Murder Mystery Agents" that applies these norms to AI agents' dialogue control. As an evaluation target, we employed the "Murder Mystery" game, a reasoning‑type table‑top role‑playing game that requires complex social reasoning and information manipulation. In this game, players need to unravel the truth of the case based on fragmentary information through cooperation and bargaining. The proposed framework integrates next speaker selection based on adjacency pairs and a self‑selection mechanism that takes agents' internal states into account to achieve more natural and strategic dialogue. To verify the effectiveness of this new approach, we analyzed utterances that led to dialogue breakdowns and conducted automatic evaluation using LLMs, as well as human evaluation using evaluation criteria developed for the Murder Mystery game. Experimental results showed that the implementation of the next speaker selection mechanism significantly reduced dialogue breakdowns and improved the ability of agents to share information and perform logical reasoning. The results of this study demonstrate that the systematics of turn‑taking in human conversation are also effective in controlling dialogue among AI agents, and provide design guidelines for more advanced multi‑agent dialogue systems.

PaperID: 12220, https://arxiv.org/pdf/2412.04859.pdf

Authors: Mingqing Zhang, Haisong Gong, Qiang Liu, Shu Wu, Liang Wang

Title: Breaking Event Rumor Detection via Stance-Separated Multi-Agent Debate

Abstract:
The rapid spread of rumors on social media platforms during breaking events severely hinders the dissemination of the truth. Previous studies reveal that the lack of annotated resources hinders the direct detection of unforeseen breaking events not covered in yesterday's news. Leveraging large language models (LLMs) for rumor detection holds significant promise. However, it is challenging for LLMs to provide comprehensive responses to complex or controversial issues due to limited diversity. In this work, we propose the Stance Separated Multi‑Agent Debate (S2MAD) to address this issue. Specifically, we firstly introduce Stance Separation, categorizing comments as either supporting or opposing the original claim. Subsequently, claims are classified as subjective or objective, enabling agents to generate reasonable initial viewpoints with different prompt strategies for each type of claim. Debaters then follow specific instructions through multiple rounds of debate to reach a consensus. If a consensus is not reached, a judge agent evaluates the opinions and delivers a final verdict on the claim's veracity. Extensive experiments conducted on two real‑world datasets demonstrate that our proposed model outperforms state‑of‑the‑art methods in terms of performance and effectively improves the performance of LLMs in breaking event rumor detection.

PaperID: 12221, https://arxiv.org/pdf/2412.04494.pdf

Authors: Saptarshi Sengupta, Harsh Vashistha, Kristal Curtis, Akshay Mallipeddi, Abhinav Mathur, Joseph Ross, Liang Gou

Title: MAG-V: A Multi-Agent Framework for Synthetic Data Generation and Verification

Abstract:
Extending the capabilities of Large Language Models (LLMs) with functions or tools for environment interaction has led to the emergence of the agent paradigm. In industry, training an LLM is not always feasible because of the scarcity of domain data, legal holds on proprietary customer data, rapidly changing business requirements, and the need to prototype new assistants. Agents provide an elegant solution to the above by relying on the zero‑shot reasoning abilities of the underlying LLM and utilizing tools to explore and reason over customer data and respond to user requests. However, there are two concerns here: (I) acquiring large scale customer queries for agent testing is time‑consuming, and (II) high reliance on the tool call sequence (or trajectory) followed by the agent to respond to user queries may lead to unexpected or incorrect behavior. To address this, we propose MAG‑V, a multi‑agent framework to first generate a dataset of questions that mimic customer queries; and second, reverse‑engineer alternate questions from the responses for trajectory verification. Initial results indicate that our synthetic data can improve agent performance on actual customer queries. Furthermore, our trajectory verification methodology, inspired by distant supervision and using traditional machine learning (ML) models, outperforms a GPT‑4o judge baseline by 11% accuracy and matches the performance of a GPT‑4 judge on our constructed dataset. Overall, our approach is a step towards unifying diverse task agents into a cohesive framework for achieving an aligned objective.

PaperID: 12222, https://arxiv.org/pdf/2412.04485.pdf

Authors: Mubashir ul Islam, Humza Sami, Pierre-Emmanuel Gaillardon, Valerio Tenace

Title: EDA-Aware RTL Generation with Large Language Models

Abstract:
Large Language Models (LLMs) have become increasingly popular for generating RTL code. However, producing error‑free RTL code in a zero‑shot setting remains highly challenging for even state‑of‑the‑art LLMs, often leading to issues that require manual, iterative refinement. This additional debugging process can dramatically increase the verification workload, underscoring the need for robust, automated correction mechanisms to ensure code correctness from the start. In this work, we introduce AIvril2, a self‑verifying, LLM‑agnostic agentic framework aimed at enhancing RTL code generation through iterative corrections of both syntax and functional errors. Our approach leverages a collaborative multi‑agent system that incorporates feedback from error logs generated by EDA tools to automatically identify and resolve design flaws. Experimental results, conducted on the VerilogEval‑Human benchmark suite, demonstrate that our framework significantly improves code quality, achieving nearly a 3.4× enhancement over prior methods. In the best‑case scenario, functional pass rates of 77% for Verilog and 66% for VHDL were obtained, thus substantially improving the reliability of LLM‑driven RTL code generation.

PaperID: 12223, https://arxiv.org/pdf/2412.04369.pdf

Authors: Haoran Su, Joseph Y. J. Chow

Title: Intersection-Aware Assessment of EMS Accessibility in NYC: A Data-Driven Approach

Abstract:
Emergency response times are critical in densely populated urban environments like New York City (NYC), where traffic congestion significantly impedes emergency vehicle (EMV) mobility. This study introduces an intersection‑aware emergency medical service (EMS) accessibility model to evaluate and improve EMV travel times across NYC. Integrating intersection density metrics, road network characteristics, and demographic data, the model identifies vulnerable regions with inadequate EMS coverage. The analysis reveals that densely interconnected areas, such as parts of Staten Island, Queens, and Manhattan, experience significant accessibility deficits due to intersection delays and sparse medical infrastructure. To address these challenges, this study explores the adoption of EMVLight, a multi‑agent reinforcement learning framework, which demonstrates the potential to reduce intersection delays by 50%, increasing EMS accessibility to 95% of NYC residents within the critical benchmark of 4 minutes. Results indicate that advanced traffic signal control (TSC) systems can alleviate congestion‑induced delays while improving equity in emergency response. The findings provide actionable insights for urban planning and policy interventions to enhance EMS accessibility and ensure timely care for underserved populations.

PaperID: 12224, https://arxiv.org/pdf/2412.04341.pdf

Authors: Ke Sun, Huan Yu

Title: Reinforcement Learning for Freeway Lane-Change Regulation via Connected Vehicles

Abstract:
Lane change decision‑making is a complex task due to intricate vehicle‑vehicle and vehicle‑infrastructure interactions. Existing algorithms for lane‑change control often depend on vehicles with a certain level of autonomy (e.g., autonomous or connected autonomous vehicles). To address the challenges posed by low penetration rates of autonomous vehicles and the high costs of precise data collection, this study proposes a dynamic lane change regulation design based on multi‑agent reinforcement learning (MARL) to enhance freeway traffic efficiency. The proposed framework leverages multi‑lane macroscopic traffic models that describe spatial‑temporal dynamics of the density and speed for each lane. Lateral traffic flow between adjacent lanes, resulting from aggregated lane‑changing behaviors, is modeled as source terms exchanged between the partial differential equations (PDEs). We propose a lane change regulation strategy using MARL, where one agent is placed at each discretized lane grid. The state of each agent is represented by aggregated vehicle attributes within its grid, generated from the SUMO microscopic simulation environment. The agent's actions are lane‑change regulations for vehicles in its grid. Specifically, lane‑change regulation signals are computed at a centralized traffic management center and then broadcast to connected vehicles in the corresponding lane grids. Compared to vehicle‑level maneuver control, this approach achieves a higher regulation rate by leveraging vehicle connectivity while introducing no critical safety concerns, and accommodating varying levels of connectivity and autonomy within the traffic system. The proposed model is simulated and evaluated in varied traffic scenarios and demand conditions. Experimental results demonstrate that the method improves overall traffic efficiency with minimal additional energy consumption while maintaining driving safety.

PaperID: 12225, https://arxiv.org/pdf/2412.04256.pdf

Authors: Jonathan Morag, Noy Gabay, Daniel koyfman, Roni Stern

Title: Transient Multi-Agent Path Finding for Lifelong Navigation in Dense Environments

Abstract:
Multi‑Agent Path Finding (MAPF) deals with finding conflict‑free paths for a set of agents from an initial configuration to a given target configuration. The Lifelong MAPF (LMAPF) problem is a well‑studied online version of MAPF in which an agent receives a new target when it reaches its current target. The common approach for solving LMAPF is to treat it as a sequence of MAPF problems, periodically replanning from the agents' current configurations to their current targets. A significant drawback in this approach is that in MAPF the agents must reach a configuration in which all agents are at their targets simultaneously, which is needlessly restrictive for LMAPF. Techniques have been proposed to indirectly mitigate this drawback. We describe cases where these mitigation techniques fail. As an alternative, we propose to solve LMAPF problems by solving a sequence of modified MAPF problems, in which the objective is for each agent to eventually visit its target, but not necessarily for all agents to do so simultaneously. We refer to this MAPF variant as Transient MAPF (TMAPF) and propose several algorithms for solving it based on existing MAPF algorithms. A limited experimental evaluation identifies some cases where using a TMAPF algorithm instead of a MAPF algorithm with an LMAPF framework can improve the system throughput significantly.

PaperID: 12226, https://arxiv.org/pdf/2412.04141.pdf

Authors: Hongshen Xu, Zichen Zhu, Lei Pan, Zihan Wang, Su Zhu, Da Ma, Ruisheng Cao, Lu Chen, Kai Yu

Title: Reducing Tool Hallucination via Reliability Alignment

Abstract:
Large Language Models (LLMs) have expanded their capabilities beyond language generation to interact with external tools, enabling automation and real‑world applications. However, tool hallucinations, where models either select inappropriate tools or misuse them, pose significant challenges, leading to erroneous task execution, increased computational costs, and reduced system reliability. To systematically address this issue, we define and categorize tool hallucinations into two main types, tool selection hallucination and tool usage hallucination. To evaluate and mitigate these issues, we introduce RelyToolBench, which integrates specialized test cases and novel metrics to assess hallucination‑aware task success and efficiency. Finally, we propose Relign, a reliability alignment framework that expands the tool‑use action space to include indecisive actions, allowing LLMs to defer tool use, seek clarification, or adjust tool selection dynamically. Through extensive experiments, we demonstrate that Relign significantly reduces tool hallucinations, improves task reliability, and enhances the efficiency of LLM tool interactions.

PaperID: 12227, https://arxiv.org/pdf/2412.04093.pdf

Authors: Chris Sypherd, Vaishak Belle

Title: Practical Considerations for Agentic LLM Systems

Abstract:
As the strength of Large Language Models (LLMs) has grown over recent years, so too has interest in their use as the underlying models for autonomous agents. Although LLMs demonstrate emergent abilities and broad expertise across natural language domains, their inherent unpredictability makes the implementation of LLM agents challenging, resulting in a gap between related research and the real‑world implementation of such systems. To bridge this gap, this paper frames actionable insights and considerations from the research community in the context of established application paradigms to enable the construction and facilitate the informed deployment of robust LLM agents. Namely, we position relevant research findings into four broad categories‑‑Planning, Memory, Tools, and Control Flow‑‑based on common practices in application‑focused literature and highlight practical considerations to make when designing agentic LLMs for real‑world applications, such as handling stochasticity and managing resources efficiently. While we do not conduct empirical evaluations, we do provide the necessary background for discussing critical aspects of agentic LLM designs, both in academia and industry.

PaperID: 12228, https://arxiv.org/pdf/2412.04090.pdf

Authors: Bingchen Li, Xin Li, Yiting Lu, Zhibo Chen

Title: LossAgent: Towards Any Optimization Objectives for Image Processing with LLM Agents

Abstract:
We present the first loss agent, dubbed LossAgent, for low‑level image processing tasks, e.g., image super‑resolution and restoration, intending to achieve any customized optimization objectives of low‑level image processing in different practical applications. Notably, not all optimization objectives, such as complex hand‑crafted perceptual metrics, text description, and intricate human feedback, can be instantiated with existing low‑level losses, e.g., MSE loss, which presents a crucial challenge in optimizing image processing networks in an end‑to‑end manner. To eliminate this, our LossAgent introduces the powerful large language model (LLM) as the loss agent, where the rich textual understanding of prior knowledge empowers the loss agent with the potential to understand complex optimization objectives, trajectory, and state feedback from external environments in the optimization process of the low‑level image processing networks. In particular, we establish the loss repository by incorporating existing loss functions that support the end‑to‑end optimization for low‑level image processing. Then, we design the optimization‑oriented prompt engineering for the loss agent to actively and intelligently decide the compositional weights for each loss in the repository at each optimization interaction, thereby achieving the required optimization trajectory for any customized optimization objectives. Extensive experiments on three typical low‑level image processing tasks and multiple optimization objectives have shown the effectiveness and applicability of our proposed LossAgent.

PaperID: 12229, https://arxiv.org/pdf/2412.04031.pdf

Authors: Puspanjali Ghoshal, Ashok Singh Sairam

Title: Dimension Reduction via Random Projection for Privacy in Multi-Agent Systems

Abstract:
In a Multi‑Agent System (MAS), individual agents observe various aspects of the environment and transmit this information to a central entity responsible for aggregating the data and deducing system parameters. To improve overall efficiency, agents may append certain private parameters to their observations. For example, in a crowd‑sourced traffic monitoring system, commuters might share not only their current speed, but also sensitive information such as their location to enable more accurate route prediction. However, sharing such data can allow the central entity or a potential adversary to infer private details about the user, such as their daily routines. To mitigate these privacy risks, the agents sanitize the data before transmission. This sanitization inevitably results in a loss of utility. In this work, we formulate the problem as a utility‑privacy trade‑off and propose a novel compression‑based approach leveraging the notion of robust concepts to sanitize the shared data. We further derive a bound on the norm of the compression matrix required to ensure maximal privacy while satisfying predefined utility constraints.

PaperID: 12230, https://arxiv.org/pdf/2412.03979.pdf

Authors: Omer Sen, Christoph Pohl, Immanuel Hacker, Markus Stroot, Andreas Ulbig

Title: AI-based Attacker Models for Enhancing Multi-Stage Cyberattack Simulations in Smart Grids Using Co-Simulation Environments

Abstract:
The transition to smart grids has increased the vulnerability of electrical power systems to advanced cyber threats. To safeguard these systems, comprehensive security measures‑including preventive, detective, and reactive strategies‑are necessary. As part of the critical infrastructure, securing these systems is a major research focus, particularly against cyberattacks. Many methods are developed to detect anomalies and intrusions and assess the damage potential of attacks. However, these methods require large amounts of data, which are often limited or private due to security concerns. We propose a co‑simulation framework that employs an autonomous agent to execute modular cyberattacks within a configurable environment, enabling reproducible and adaptable data generation. The impact of virtual attacks is compared to those in a physical lab targeting real smart grids. We also investigate the use of large language models for automating attack generation, though current models on consumer hardware are unreliable. Our approach offers a flexible, versatile source for data generation, aiding in faster prototyping and reducing development resources and time.

PaperID: 12231, https://arxiv.org/pdf/2412.03959.pdf

Authors: Guanwen Xie, Jingzehua Xu, Ziqi Zhang, Xiangwang Hou, Dongfang Ma, Shuai Zhang, Yong Ren, Dusit Niyato

Title: Is FISHER All You Need in The Multi-AUV Underwater Target Tracking Task?

Abstract:
It is significant to employ multiple autonomous underwater vehicles (AUVs) to execute the underwater target tracking task collaboratively. However, it's pretty challenging to meet various prerequisites utilizing traditional control methods. Therefore, we propose an effective two‑stage learning from demonstrations training framework, FISHER, to highlight the adaptability of reinforcement learning (RL) methods in the multi‑AUV underwater target tracking task, while addressing its limitations such as extensive requirements for environmental interactions and the challenges in designing reward functions. The first stage utilizes imitation learning (IL) to realize policy improvement and generate offline datasets. To be specific, we introduce multi‑agent discriminator‑actor‑critic based on improvements of the generative adversarial IL algorithm and multi‑agent IL optimization objective derived from the Nash equilibrium condition. Then in the second stage, we develop multi‑agent independent generalized decision transformer, which analyzes the latent representation to match the future states of high‑quality samples rather than reward function, attaining further enhanced policies capable of handling various scenarios. Besides, we propose a simulation to simulation demonstration generation procedure to facilitate the generation of expert demonstrations in underwater environments, which capitalizes on traditional control methods and can easily accomplish the domain transfer to obtain demonstrations. Extensive simulation experiments from multiple scenarios showcase that FISHER possesses strong stability, multi‑task performance and capability of generalization.

PaperID: 12232, https://arxiv.org/pdf/2412.03925.pdf

Authors: Talha Azfar, Kaicong Huang, Andrew Tracy, Sandra Misiewicz, Chenxi Liu, Ruimin Ke

Title: Traffic Co-Simulation Framework Empowered by Infrastructure Camera Sensing and Reinforcement Learning

Abstract:
Traffic simulations are commonly used to optimize urban traffic flow, with reinforcement learning (RL) showing promising potential for automated traffic signal control, particularly in intelligent transportation systems involving connected automated vehicles. Multi‑agent reinforcement learning (MARL) is particularly effective for learning control strategies for traffic lights in a network using iterative simulations. However, existing methods often assume perfect vehicle detection, which overlooks real‑world limitations related to infrastructure availability and sensor reliability. This study proposes a co‑simulation framework integrating CARLA and SUMO, which combines high‑fidelity 3D modeling with large‑scale traffic flow simulation. Cameras mounted on traffic light poles within the CARLA environment use a YOLO‑based computer vision system to detect and count vehicles, providing real‑time traffic data as input for adaptive signal control in SUMO. MARL agents trained with four different reward structures leverage this visual feedback to optimize signal timings and improve network‑wide traffic flow. Experiments in a multi‑intersection test‑bed demonstrate the effectiveness of the proposed MARL approach in enhancing traffic conditions using real‑time camera based detection. The framework also evaluates the robustness of MARL under faulty or sparse sensing and compares the performance of YOLOv5 and YOLOv8 for vehicle detection. Results show that while better accuracy improves performance, MARL agents can still achieve significant improvements with imperfect detection, demonstrating scalability and adaptability for real‑world scenarios.

PaperID: 12233, https://arxiv.org/pdf/2412.03882.pdf

Authors: Wei Dai, Yash Singh, Rui Zhang

Title: A Multi-agent Simulation for the Mass School Shootings

Abstract:
The increasing frequency of mass school shootings in the United States has been raised as a critical concern. Active shooters kill innocent students and educators in schools. These tragic events highlight the urgent need for effective strategies to minimize casualties. This study aims to address the challenge of simulating and assessing potential mitigation measures by developing a multi‑agent simulation model. Our model is designed to estimate casualty rates and evacuation efficiency during active shooter scenarios within school buildings. The simulation evaluates the impact of a gun detection system on safety outcomes. By simulating school shooting incidents with and without this system, we observe a significant improvement in evacuation rates, which increased from 16.6% to 66.6%. Furthermore, the Gun Detection System reduced the average casualty rate from 24.0% to 12.2% within a period of six minutes, based on a simulated environment with 100 students. We conducted a total of 48 simulations across three different floor layouts, varying the number of students and time intervals to assess the system's adaptability. We anticipate that the research will provide a starting point for demonstrating that a gunshot detection system can significantly improve both evacuation rates and casualty reduction.

PaperID: 12234, https://arxiv.org/pdf/2412.03847.pdf

Authors: Shiwen Ni, Min Yang

Title: Educational-Psychological Dialogue Robot Based on Multi-Agent Collaboration

Abstract:
Intelligent dialogue systems are increasingly used in modern education and psychological counseling fields, but most existing systems are limited to a single domain, cannot deal with both educational and psychological issues, and often lack accuracy and professionalism when dealing with complex issues. To address these problems, this paper proposes an intelligent dialog system that combines educational and psychological counseling functions. The system consists of multiple AI agent, including security detection agent, intent identification agent, educational LLM agent, and psychological LLM agent, which work in concert to ensure the provision of accurate educational knowledge Q\&A and psychological support services. Specifically, the system recognizes user‑input intentions through an intention classification model and invokes a retrieval‑enhanced educational grand model and a psychological grand model fine‑tuned with psychological data in order to provide professional educational advice and psychological support.

PaperID: 12235, https://arxiv.org/pdf/2412.03359.pdf

Authors: Chengwei Hu, Jianhui Zheng, Yancheng He, Hangyu Guo, Junguang Jiang, Han Zhu, Kai Sun, Yuning Jiang, Wenbo Su, Bo Zheng

Title: WiS Platform: Enhancing Evaluation of LLM-Based Multi-Agent Systems Through Game-Based Analysis

Abstract:
Recent advancements in autonomous multi‑agent systems (MAS) based on large language models (LLMs) have enhanced the application scenarios and improved the capability of LLMs to handle complex tasks. Despite demonstrating effectiveness, existing studies still evidently struggle to evaluate, analysis, and reproducibility of LLM‑based MAS. In this paper, to facilitate the research on LLM‑based MAS, we introduce an open, scalable, and real‑time updated platform for accessing and analyzing the LLM‑based MAS based on the games Who is Spy?" (WiS). Our platform is featured with three main worths: (1) a unified model evaluate interface that supports models available on Hugging Face; (2) real‑time updated leaderboard for model evaluation; (3) a comprehensive evaluation covering game‑winning rates, attacking, defense strategies, and reasoning of LLMs. To rigorously test WiS, we conduct extensive experiments coverage of various open‑ and closed‑source LLMs, we find that different agents exhibit distinct and intriguing behaviors in the game. The experimental results demonstrate the effectiveness and efficiency of our platform in evaluating LLM‑based MAS. Our platform and its documentation are publicly available at https://whoisspy.ai/.

PaperID: 12236, https://arxiv.org/pdf/2412.03076.pdf

Authors: Francesc Wilhelmi, Boris Bellalta, Szymon Szott, Katarzyna Kosek-Szott, Sergio Barrachina-Muñoz

Title: Coordinated Multi-Armed Bandits for Improved Spatial Reuse in Wi-Fi

Abstract:
Multi‑Access Point Coordination (MAPC) and Artificial Intelligence and Machine Learning (AI/ML) are expected to be key features in future Wi‑Fi, such as the forthcoming IEEE 802.11bn (Wi‑Fi~8) and beyond. In this paper, we explore a coordinated solution based on online learning to drive the optimization of Spatial Reuse (SR), a method that allows multiple devices to perform simultaneous transmissions by controlling interference through Packet Detect (PD) adjustment and transmit power control. In particular, we focus on a Multi‑Agent Multi‑Armed Bandit (MA‑MAB) setting, where multiple decision‑making agents concurrently configure SR parameters from coexisting networks by leveraging the MAPC framework, and study various algorithms and reward‑sharing mechanisms. We evaluate different MA‑MAB implementations using Komondor, a well‑adopted Wi‑Fi simulator, and demonstrate that AI‑native SR enabled by coordinated MABs can improve the network performance over current Wi‑Fi operation: mean throughput increases by 15%, fairness is improved by increasing the minimum throughput across the network by 210%, while the maximum access delay is kept below 3 ms.

PaperID: 12237, https://arxiv.org/pdf/2412.03072.pdf

Authors: Xinyu Qiao, Yudong Hu, Congying Han, Weiyan Wu, Tiande Guo

Title: Preference-based opponent shaping in differentiable games

Abstract:
Strategy learning in game environments with multi‑agent is a challenging problem. Since each agent's reward is determined by the joint strategy, a greedy learning strategy that aims to maximize its own reward may fall into a local optimum. Recent studies have proposed the opponent modeling and shaping methods for game environments. These methods enhance the efficiency of strategy learning by modeling the strategies and updating processes of other agents. However, these methods often rely on simple predictions of opponent strategy changes. Due to the lack of modeling behavioral preferences such as cooperation and competition, they are usually applicable only to predefined scenarios and lack generalization capabilities. In this paper, we propose a novel Preference‑based Opponent Shaping (PBOS) method to enhance the strategy learning process by shaping agents' preferences towards cooperation. We introduce the preference parameter, which is incorporated into the agent's loss function, thus allowing the agent to directly consider the opponent's loss function when updating the strategy. We update the preference parameters concurrently with strategy learning to ensure that agents can adapt to any cooperative or competitive game environment. Through a series of experiments, we verify the performance of PBOS algorithm in a variety of differentiable games. The experimental results show that the PBOS algorithm can guide the agent to learn the appropriate preference parameters, so as to achieve better reward distribution in multiple game environments.

PaperID: 12238, https://arxiv.org/pdf/2412.02835.pdf

Authors: Igor Halperin

Title: CAISSON: Concept-Augmented Inference Suite of Self-Organizing Neural Networks

Abstract:
We present CAISSON, a novel hierarchical approach to Retrieval‑Augmented Generation (RAG) that transforms traditional single‑vector search into a multi‑view clustering framework. At its core, CAISSON leverages dual Self‑Organizing Maps (SOMs) to create complementary organizational views of the document space, where each view captures different aspects of document relationships through specialized embeddings. The first view processes combined text and metadata embeddings, while the second operates on metadata enriched with concept embeddings, enabling a comprehensive multi‑view analysis that captures both fine‑grained semantic relationships and high‑level conceptual patterns. This dual‑view approach enables more nuanced document discovery by combining evidence from different organizational perspectives. To evaluate CAISSON, we develop SynFAQA, a framework for generating synthetic financial analyst notes and question‑answer pairs that systematically tests different aspects of information retrieval capabilities. Drawing on HotPotQA's methodology for constructing multi‑step reasoning questions, SynFAQA generates controlled test cases where each question is paired with the set of notes containing its ground‑truth answer, progressing from simple single‑entity queries to complex multi‑hop retrieval tasks involving multiple entities and concepts. Our experimental results demonstrate substantial improvements over both basic and enhanced RAG implementations, particularly for complex multi‑entity queries, while maintaining practical response times suitable for interactive applications.

PaperID: 12239, https://arxiv.org/pdf/2412.02776.pdf

Authors: Rustem Turtayev, Artem Petrov, Dmitrii Volkov, Denis Volk

Title: Hacking CTFs with Plain Agents

Abstract:
We saturate a high‑school‑level hacking benchmark with plain LLM agent design. Concretely, we obtain 95% performance on InterCode‑CTF, a popular offensive security benchmark, using prompting, tool use, and multiple attempts. This beats prior work by Phuong et al. 2024 (29%) and Abramovich et al. 2024 (72%). Our results suggest that current LLMs have surpassed the high school level in offensive cybersecurity. Their hacking capabilities remain underelicited: our ReAct&Plan prompting strategy solves many challenges in 1‑2 turns without complex engineering or advanced harnessing.

PaperID: 12240, https://arxiv.org/pdf/2412.02655.pdf

Authors: Pranav Doma, Aliasghar Arab, Xuesu Xiao

Title: LLM-Enhanced Path Planning: Safe and Efficient Autonomous Navigation with Instructional Inputs

Abstract:
Autonomous navigation guided by natural language instructions is essential for improving human‑robot interaction and enabling complex operations in dynamic environments. While large language models (LLMs) are not inherently designed for planning, they can significantly enhance planning efficiency by providing guidance and informing constraints to ensure safety. This paper introduces a planning framework that integrates LLMs with 2D occupancy grid maps and natural language commands to improve spatial reasoning and task execution in resource‑limited settings. By decomposing high‑level commands and real‑time environmental data, the system generates structured navigation plans for pick‑and‑place tasks, including obstacle avoidance, goal prioritization, and adaptive behaviors. The framework dynamically recalculates paths to address environmental changes and aligns with implicit social norms for seamless human‑robot interaction. Our results demonstrates the potential of LLMs to design context‑aware system to enhance navigation efficiency and safety in industrial and dynamic environments.

PaperID: 12241, https://arxiv.org/pdf/2412.02581.pdf

Authors: Ziheng Liu, Jiayi Zhang, Yiyang Zhu, Enyu Shi, Bo Ai

Title: Mobile Cell-Free Massive MIMO with Multi-Agent Reinforcement Learning: A Scalable Framework

Abstract:
Cell‑free massive multiple‑input multiple‑output (mMIMO) offers significant advantages in mobility scenarios, mainly due to the elimination of cell boundaries and strong macro diversity. In this paper, we examine the downlink performance of cell‑free mMIMO systems equipped with mobile‑APs utilizing the concept of unmanned aerial vehicles, where mobility and power control are jointly considered to effectively enhance coverage and suppress interference. However, the high computational complexity, poor collaboration, limited scalability, and uneven reward distribution of conventional optimization schemes lead to serious performance degradation and instability. These factors complicate the provision of consistent and high‑quality service across all user equipments in downlink cell‑free mMIMO systems. Consequently, we propose a novel scalable framework enhanced by multi‑agent reinforcement learning (MARL) to tackle these challenges. The established framework incorporates a graph neural network (GNN)‑aided communication mechanism to facilitate effective collaboration among agents, a permutation architecture to improve scalability, and a directional decoupling architecture to accurately distinguish contributions. In the numerical results, we present comparisons of different optimization schemes and network architectures, which reveal that the proposed scheme can effectively enhance system performance compared to conventional schemes due to the adoption of advanced technologies. In particular, appropriately compressing the observation space of agents is beneficial for achieving a better balance between performance and convergence.

PaperID: 12242, https://arxiv.org/pdf/2412.02149.pdf

Authors: Fernando Gabriela Garcia, Spencer Burns, Harrison Fuller

Title: Leveraging Large Language Models for Comparative Literature Summarization with Reflective Incremental Mechanisms

Abstract:
In this paper, we introduce ChatCite, a novel method leveraging large language models (LLMs) for generating comparative literature summaries. The ability to summarize research papers with a focus on key comparisons between studies is an essential task in academic research. Existing summarization models, while effective at generating concise summaries, fail to provide deep comparative insights. ChatCite addresses this limitation by incorporating a multi‑step reasoning mechanism that extracts critical elements from papers, incrementally builds a comparative summary, and refines the output through a reflective memory process. We evaluate ChatCite on a custom dataset, CompLit‑LongContext, consisting of 1000 research papers with annotated comparative summaries. Experimental results show that ChatCite outperforms several baseline methods, including GPT‑4, BART, T5, and CoT, across various automatic evaluation metrics such as ROUGE and the newly proposed G‑Score. Human evaluation further confirms that ChatCite generates more coherent, insightful, and fluent summaries compared to these baseline models. Our method provides a significant advancement in automatic literature review generation, offering researchers a powerful tool for efficiently comparing and synthesizing scientific research.

PaperID: 12243, https://arxiv.org/pdf/2412.02146.pdf

Authors: Jing Liu, Fangfei Li, Xin Jin, Yang Tang

Title: Distributed Task Allocation for Multi-Agent Systems: A Submodular Optimization Approach

Abstract:
This paper addresses dynamic task allocation in resource‑constrained multi‑agent systems (MASs) with sequentially updated assignments. We develop a submodular maximization framework integrated with q‑independence systems, demonstrating greater flexibility than conventional matroid‑based constraints for modeling heterogeneous resource limitations. The proposed distributed greedy bundles algorithm (DGBA) addresses communication limitations in MASs while providing rigorous approximation guarantees for submodular maximization under a q‑independence system constraint, ensuring low computational complexity. DGBA achieves feasible task allocation in polynomial time with reduced space complexity compared to existing methods. Extensive Monte Carlo simulations in a micro‑satellite observation scenario demonstrate that DGBA consistently outperforms benchmark algorithms in total utility, resource efficiency, and assignment stability, while maintaining real‑time computational feasibility.

PaperID: 12244, https://arxiv.org/pdf/2412.02091.pdf

Authors: Kee Siong Ng, Samuel Yang-Zhao, Timothy Cadogan-Cowper

Title: The Problem of Social Cost in Multi-Agent General Reinforcement Learning: Survey and Synthesis

Abstract:
The AI safety literature is full of examples of powerful AI agents that, in blindly pursuing a specific and usually narrow objective, ends up with unacceptable and even catastrophic collateral damage to others. In this paper, we consider the problem of social harms that can result from actions taken by learning and utility‑maximising agents in a multi‑agent environment. The problem of measuring social harms or impacts in such multi‑agent settings, especially when the agents are artificial generally intelligent (AGI) agents, was listed as an open problem in Everitt et al, 2018. We attempt a partial answer to that open problem in the form of market‑based mechanisms to quantify and control the cost of such social harms. The proposed setup captures many well‑studied special cases and is more general than existing formulations of multi‑agent reinforcement learning with mechanism design in two ways: (i) the underlying environment is a history‑based general reinforcement learning environment like in AIXI; (ii) the reinforcement‑learning agents participating in the environment can have different learning strategies and planning horizons. To demonstrate the practicality of the proposed setup, we survey some key classes of learning algorithms and present a few applications, including a discussion of the Paperclips problem and pollution control with a cap‑and‑trade system.

PaperID: 12245, https://arxiv.org/pdf/2412.02085.pdf

Authors: Ryosuke Takata, Yujin Tang, Yingtao Tian, Norihiro Maruyama, Hiroki Kojima, Takashi Ikegami

Title: Evolution of Collective AI Beyond Individual Optimization

Abstract:
This study investigates collective behaviors that emerge from a group of homogeneous individuals optimized for a specific capability. We created a group of simple, identical neural network based agents modeled after chemotaxis‑driven vehicles that follow pheromone trails and examined multi‑agent simulations using clones of these evolved individuals. Our results show that the evolution of individuals led to population differentiation. Surprisingly, we observed that collective fitness significantly changed during later evolutionary stages, despite maintained high individual performance and simplified neural architectures. This decline occurred when agents developed reduced sensor‑motor coupling, suggesting that over‑optimization of individual agents almost always lead to less effective group behavior. Our research investigates how individual differentiation can evolve through what evolutionary pathways.

PaperID: 12246, https://arxiv.org/pdf/2412.02057.pdf

Authors: Anubha Mahajan, Shreya Hegde, Ethan Shay, Daniel Wu, Aviva Prins

Title: Comparative Analysis of Multi-Agent Reinforcement Learning Policies for Crop Planning Decision Support

Abstract:
In India, the majority of farmers are classified as small or marginal, making their livelihoods particularly vulnerable to economic losses due to market saturation and climate risks. Effective crop planning can significantly impact their expected income, yet existing decision support systems (DSS) often provide generic recommendations that fail to account for real‑time market dynamics and the interactions among multiple farmers. In this paper, we evaluate the viability of three multi‑agent reinforcement learning (MARL) approaches for optimizing total farmer income and promoting fairness in crop planning: Independent Q‑Learning (IQL), where each farmer acts independently without coordination, Agent‑by‑Agent (ABA), which sequentially optimizes each farmer's policy in relation to the others, and the Multi‑agent Rollout Policy, which jointly optimizes all farmers' actions for global reward maximization. Our results demonstrate that while IQL offers computational efficiency with linear runtime, it struggles with coordination among agents, leading to lower total rewards and an unequal distribution of income. Conversely, the Multi‑agent Rollout policy achieves the highest total rewards and promotes equitable income distribution among farmers but requires significantly more computational resources, making it less practical for large numbers of agents. ABA strikes a balance between runtime efficiency and reward optimization, offering reasonable total rewards with acceptable fairness and scalability. These findings highlight the importance of selecting appropriate MARL approaches in DSS to provide personalized and equitable crop planning recommendations, advancing the development of more adaptive and farmer‑centric agricultural decision‑making systems.

PaperID: 12247, https://arxiv.org/pdf/2412.02000.pdf

Authors: Trenton Chang, Lindsay Warrenburg, Sae-Hwan Park, Ravi B. Parikh, Maggie Makar, Jenna Wiens

Title: Who's Gaming the System? A Causally-Motivated Approach for Detecting Strategic Adaptation

Abstract:
In many settings, machine learning models may be used to inform decisions that impact individuals or entities who interact with the model. Such entities, or agents, may game model decisions by manipulating their inputs to the model to obtain better outcomes and maximize some utility. We consider a multi‑agent setting where the goal is to identify the "worst offenders:" agents that are gaming most aggressively. However, identifying such agents is difficult without knowledge of their utility function. Thus, we introduce a framework in which each agent's tendency to game is parameterized via a scalar. We show that this gaming parameter is only partially identifiable. By recasting the problem as a causal effect estimation problem where different agents represent different "treatments," we prove that a ranking of all agents by their gaming parameters is identifiable. We present empirical results in a synthetic data study validating the usage of causal effect estimation for gaming detection and show in a case study of diagnosis coding behavior in the U.S. that our approach highlights features associated with gaming.

PaperID: 12248, https://arxiv.org/pdf/2412.01928.pdf

Authors: Sumeet Ramesh Motwani, Chandler Smith, Rocktim Jyoti Das, Rafael Rafailov, Ivan Laptev, Philip H. S. Torr, Fabio Pizzati, Ronald Clark, Christian Schroeder de Witt

Title: MALT: Improving Reasoning with Multi-Agent LLM Training

Abstract:
Large Language Models (LLMs) often produce answers with a single chain‑of‑thought, which restricts their ability to explore reasoning paths or self‑correct flawed outputs in complex tasks. In this paper, we introduce MALT (Multi‑Agent LLM Training), a novel post‑training strategy that divides the reasoning process into generation, verification, and refinement steps using a sequential pipeline of heterogeneous agents. During data generation, each agent is repeatedly sampled to form a multi‑agent search tree, where final outputs are graded against ground‑truth data. We then apply value iteration to propagate reward signals back to each role‑conditioned model, automatically producing multi‑agent post‑training data without human or teacher‑model supervision. Our off‑policy approach allows each agent to specialize by learning from correct and incorrect trajectories, ultimately improving the end‑to‑end reasoning chain. On MATH, GSM8K, and CSQA, MALT surpasses the same baseline LLM with a relative improvement of 15.66%, 7.42%, and 9.40% respectively, making it an important advance towards multi‑agent cooperative training.

PaperID: 12249, https://arxiv.org/pdf/2412.01812.pdf

Authors: Zewei Zhou, Hao Xiang, Zhaoliang Zheng, Seth Z. Zhao, Mingyue Lei, Yun Zhang, Tianhui Cai, Xinyi Liu, Johnson Liu, Maheswari Bajji, Xin Xia, Zhiyu Huang, Bolei Zhou, Jiaqi Ma

Title: V2XPnP: Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and Prediction

Abstract:
Vehicle‑to‑everything (V2X) technologies offer a promising paradigm to mitigate the limitations of constrained observability in single‑vehicle systems. Prior work primarily focuses on single‑frame cooperative perception, which fuses agents' information across different spatial locations but ignores temporal cues and temporal tasks (e.g., temporal perception and prediction). In this paper, we focus on the spatio‑temporal fusion in V2X scenarios and design one‑step and multi‑step communication strategies (when to transmit) as well as examine their integration with three fusion strategies ‑ early, late, and intermediate (what to transmit), providing comprehensive benchmarks with 11 fusion models (how to fuse). Furthermore, we propose V2XPnP, a novel intermediate fusion framework within one‑step communication for end‑to‑end perception and prediction. Our framework employs a unified Transformer‑based architecture to effectively model complex spatio‑temporal relationships across multiple agents, frames, and high‑definition maps. Moreover, we introduce the V2XPnP Sequential Dataset that supports all V2X collaboration modes and addresses the limitations of existing real‑world datasets, which are restricted to single‑frame or single‑mode cooperation. Extensive experiments demonstrate that our framework outperforms state‑of‑the‑art methods in both perception and prediction tasks.

PaperID: 12250, https://arxiv.org/pdf/2412.01778.pdf

Authors: Lajos Muzsai, David Imolai, András Lukács

Title: HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing

Abstract:
We introduce HackSynth, a novel Large Language Model (LLM)‑based agent capable of autonomous penetration testing. HackSynth's dual‑module architecture includes a Planner and a Summarizer, which enable it to generate commands and process feedback iteratively. To benchmark HackSynth, we propose two new Capture The Flag (CTF)‑based benchmark sets utilizing the popular platforms PicoCTF and OverTheWire. These benchmarks include two hundred challenges across diverse domains and difficulties, providing a standardized framework for evaluating LLM‑based penetration testing agents. Based on these benchmarks, extensive experiments are presented, analyzing the core parameters of HackSynth, including creativity (temperature and top‑p) and token utilization. Multiple open source and proprietary LLMs were used to measure the agent's capabilities. The experiments show that the agent performed best with the GPT‑4o model, better than what the GPT‑4o's system card suggests. We also discuss the safety and predictability of HackSynth's actions. Our findings indicate the potential of LLM‑based agents in advancing autonomous penetration testing and the importance of robust safeguards. HackSynth and the benchmarks are publicly available to foster research on autonomous cybersecurity solutions.

PaperID: 12251, https://arxiv.org/pdf/2412.01753.pdf

Authors: Adnan Abdullah, David Blow, Ruo Chen, Thanakon Uthai, Eric Jing Du, Md Jahidul Islam

Title: Human-Machine Interfaces for Subsea Telerobotics: From Soda-straw to Natural Language Interactions

Abstract:
This review explores the evolution of human‑machine interfaces (HMIs) in subsea telerobotics, charting the progression from traditional first‑person "soda‑straw" consoles ‑‑ characterized by narrow field‑of‑view camera feeds ‑‑ to contemporary interfaces leveraging gesture recognition, virtual reality, and natural language processing. We systematically analyze the state‑of‑the‑art literature through three interrelated perspectives: operator experience (including immersive feedback, cognitive workload, and ergonomic design), robotic autonomy (contextual understanding and task execution), and the quality of bidirectional communication between human and machine. Emphasis is placed on interface features to highlight persistent limitations in current systems, notably in immersive feedback fidelity, intuitive control mechanisms, and the lack of cross‑platform standardization. Additionally, we assess the role of simulators and digital twins as scalable tools for operator training and system prototyping. The review extends beyond classical teleoperation paradigms to examine modern shared autonomy frameworks that facilitate seamless human‑robot collaboration. By synthesizing insights from robotics, marine engineering, artificial intelligence, and human factors ‑‑ this work provides a comprehensive overview of the current landscape and emerging trajectories in subsea HMI development. Finally, we identify key challenges and open research questions and outline a forward‑looking roadmap for advancing intelligent and user‑centric HMI technologies in subsea telerobotics.

PaperID: 12252, https://arxiv.org/pdf/2412.01694.pdf

Authors: Yudi Shi, Shangzhe Di, Qirui Chen, Weidi Xie

Title: Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation

Abstract:
This paper tackles the problem of video question answering (VideoQA), a task that often requires multi‑step reasoning and a profound understanding of spatial‑temporal dynamics. While large video‑language models perform well on benchmarks, they often lack explainability and spatial‑temporal grounding. In this paper, we propose Agent‑of‑Thoughts Distillation (AoTD), a method that enhances models by incorporating automatically generated Chain‑of‑Thoughts (CoTs) into the instruction‑tuning process. Specifically, we leverage an agent‑based system to decompose complex questions into sub‑tasks, and address them with specialized vision models, the intermediate results are then treated as reasoning chains. We also introduce a verification mechanism using a large language model (LLM) to ensure the reliability of generated CoTs. Extensive experiments demonstrate that AoTD improves the performance on multiple‑choice and open‑ended benchmarks.

PaperID: 12253, https://arxiv.org/pdf/2412.01656.pdf

Authors: Shuo Yang, Hongrui Zheng, Cristian-Ioan Vasile, George Pappas, Rahul Mangharam

Title: STLGame: Signal Temporal Logic Games in Adversarial Multi-Agent Systems

Abstract:
We study how to synthesize a robust and safe policy for autonomous systems under signal temporal logic (STL) tasks in adversarial settings against unknown dynamic agents. To ensure the worst‑case STL satisfaction, we propose STLGame, a framework that models the multi‑agent system as a two‑player zero‑sum game, where the ego agents try to maximize the STL satisfaction and other agents minimize it. STLGame aims to find a Nash equilibrium policy profile, which is the best case in terms of robustness against unseen opponent policies, by using the fictitious self‑play (FSP) framework. FSP iteratively converges to a Nash profile, even in games set in continuous state‑action spaces. We propose a gradient‑based method with differentiable STL formulas, which is crucial in continuous settings to approximate the best responses at each iteration of FSP. We show this key aspect experimentally by comparing with reinforcement learning‑based methods to find the best response. Experiments on two standard dynamical system benchmarks, Ackermann steering vehicles and autonomous drones, demonstrate that our converged policy is almost unexploitable and robust to various unseen opponents' policies. All code and additional experimental results can be found on our project website: https://sites.google.com/view/stlgame

PaperID: 12254, https://arxiv.org/pdf/2412.01605.pdf

Authors: Jie Liu, Wenxuan Wang, Zizhan Ma, Guolin Huang, Yihang SU, Kao-Jung Chang, Wenting Chen, Haoliang Li, Linlin Shen, Michael Lyu

Title: Medchain: Bridging the Gap Between LLM Agents and Clinical Practice with Interactive Sequence

Abstract:
Clinical decision making (CDM) is a complex, dynamic process crucial to healthcare delivery, yet it remains a significant challenge for artificial intelligence systems. While Large Language Model (LLM)‑based agents have been tested on general medical knowledge using licensing exams and knowledge question‑answering tasks, their performance in the CDM in real‑world scenarios is limited due to the lack of comprehensive testing datasets that mirror actual medical practice. To address this gap, we present MedChain, a dataset of 12,163 clinical cases that covers five key stages of clinical workflow. MedChain distinguishes itself from existing benchmarks with three key features of real‑world clinical practice: personalization, interactivity, and sequentiality. Further, to tackle real‑world CDM challenges, we also propose MedChain‑Agent, an AI system that integrates a feedback mechanism and a MCase‑RAG module to learn from previous cases and adapt its responses. MedChain‑Agent demonstrates remarkable adaptability in gathering information dynamically and handling sequential clinical tasks, significantly outperforming existing approaches.

PaperID: 12255, https://arxiv.org/pdf/2412.01524.pdf

Authors: Yuhan Suo, Runqi Chai, Senchun Chai, Xudong Zhao, Yuanqing Xia

Title: Cost-Aware Distributed Online Learning with Strict Rejection Behavior against Adversarial Agents

Abstract:
Distributed online learning in Internet of Things(IoT)‑enabled multi‑agent systems(MASs) is highly vulnerable to persistent adversarial interactions, particularly when malicious agents cannot be fully isolated during the transient learning stage. Existing resilient learning methods mainly focus on convergence preservation or malicious suppression, while the resulting evolution inefficiency caused by repeated corrective adaptation remains largely unexplored. To address this issue, this paper develops a cost‑aware distributed online learning framework with a strict rejection behavior against adversarial agents. The proposed mechanism suppresses harmful assimilation of suspicious neighboring information and reveals a previously overlooked side effect, that is, the strict rejection may induce heterogeneous transient evolution among neighboring normal agents, leading to evolution desynchronization across the network. To mitigate this effect, a two‑time‑scale adaptive evolution regulation architecture is further developed, in which the outer layer dynamically adjusts the long‑term evolution‑rate schedule while the inner layer preserves robust online learning. Theoretical analysis establishes the dynamic tracking property of the outer‑layer update and proves that the proposed regulation mechanism attenuates the propagation of strict‑rejection‑induced evolution desynchronization. Numerical simulations and a satellite‑assisted IoT monitoring scenario demonstrate that the proposed method achieves robust and low‑cost distributed online learning under persistent malicious interference.

PaperID: 12256, https://arxiv.org/pdf/2412.01333.pdf

Authors: Yang Wu, Yao Wan, Zhaoyang Chu, Wenting Zhao, Ye Liu, Hongyu Zhang, Xuanhua Shi, Philip S. Yu

Title: Can Large Language Models Serve as Evaluators for Code Summarization?

Abstract:
Code summarization facilitates program comprehension and software maintenance by converting code snippets into natural‑language descriptions. Over the years, numerous methods have been developed for this task, but a key challenge remains: effectively evaluating the quality of generated summaries. While human evaluation is effective for assessing code summary quality, it is labor‑intensive and difficult to scale. Commonly used automatic metrics, such as BLEU, ROUGE‑L, METEOR, and BERTScore, often fail to align closely with human judgments. In this paper, we explore the potential of Large Language Models (LLMs) for evaluating code summarization. We propose CODERPE (Role‑Player for Code Summarization Evaluation), a novel method that leverages role‑player prompting to assess the quality of generated summaries. Specifically, we prompt an LLM agent to play diverse roles, such as code reviewer, code author, code editor, and system analyst. Each role evaluates the quality of code summaries across key dimensions, including coherence, consistency, fluency, and relevance. We further explore the robustness of LLMs as evaluators by employing various prompting strategies, including chain‑of‑thought reasoning, in‑context learning, and tailored rating form designs. The results demonstrate that LLMs serve as effective evaluators for code summarization methods. Notably, our LLM‑based evaluator, CODERPE , achieves an 81.59% Spearman correlation with human evaluations, outperforming the existing BERTScore metric by 17.27%.

PaperID: 12257, https://arxiv.org/pdf/2412.01303.pdf

Authors: Xu Yang, Chenhui Lin, Haotian Liu, Wenchuan Wu

Title: RL2: Reinforce Large Language Model to Assist Safe Reinforcement Learning for Energy Management of Active Distribution Networks

Abstract:
As large‑scale distributed energy resources are integrated into the active distribution networks (ADNs), effective energy management in ADNs becomes increasingly prominent compared to traditional distribution networks. Although advanced reinforcement learning (RL) methods, which alleviate the burden of complicated modelling and optimization, have greatly improved the efficiency of energy management in ADNs, safety becomes a critical concern for RL applications in real‑world problems. Since the design and adjustment of penalty functions, which correspond to operational safety constraints, requires extensive domain knowledge in RL and power system operation, the emerging ADN operators call for a more flexible and customized approach to address the penalty functions so that the operational safety and efficiency can be further enhanced. Empowered with strong comprehension, reasoning, and in‑context learning capabilities, large language models (LLMs) provide a promising way to assist safe RL for energy management in ADNs. In this paper, we introduce the LLM to comprehend operational safety requirements in ADNs and generate corresponding penalty functions. In addition, we propose an RL2 mechanism to refine the generated functions iteratively and adaptively through multi‑round dialogues, in which the LLM agent adjusts the functions' pattern and parameters based on training and test performance of the downstream RL agent. The proposed method significantly reduces the intervention of the ADN operators. Comprehensive test results demonstrate the effectiveness of the proposed method.

PaperID: 12258, https://arxiv.org/pdf/2412.01277.pdf

Authors: Joachim Dunkel

Title: Streamlining the Action Dependency Graph Framework: Two Key Enhancements

Abstract:
Multi Agent Path Finding (MAPF) is critical for coordinating multiple robots in shared environments, yet robust execution of generated plans remains challenging due to operational uncertainties. The Action Dependency Graph (ADG) framework offers a way to ensure correct action execution by establishing precedence‑based dependencies between wait and move actions retrieved from a MAPF planning result. The original construction algorithm is not only inefficient, with a quadratic worst‑case time complexity it also results in a network with many redundant dependencies between actions. This paper introduces two key improvements to the ADG framework. First, we prove that wait actions are generally redundant and show that removing them can lead to faster overall plan execution on real robot systems. Second, we propose an optimized ADG construction algorithm, termed Sparse Candidate Partitioning (SCP), which skips unnecessary dependencies and lowers the time complexity to quasi‑linear, thereby significantly improving construction speed.

PaperID: 12259, https://arxiv.org/pdf/2412.01075.pdf

Authors: Dixiao Wei, Peng Yi, Jinlong Lei, Xingyi Zhu

Title: Multi-Agent Deep Reinforcement Learning for Distributed and Autonomous Platoon Coordination via Speed-regulation over Large-scale Transportation Networks

Abstract:
Truck platooning technology enables a group of trucks to travel closely together, with which the platoon can save fuel, improve traffic flow efficiency, and improve safety. In this paper, we consider the platoon coordination problem in a large‑scale transportation network, to promote cooperation among trucks and optimize the overall efficiency. Involving the regulation of both speed and departure times at hubs, we formulate the coordination problem as a complicated dynamic stochastic integer programming under network and information constraints. To get an autonomous, distributed, and robust platoon coordination policy, we formulate the problem into a model of the Decentralized‑Partial Observable Markov Decision Process. Then, we propose a Multi‑Agent Deep Reinforcement Learning framework named Trcuk Attention‑QMIX (TA‑QMIX) to train an efficient online decision policy. TA‑QMIX utilizes the attention mechanism to enhance the representation of truck fuel gains and delay times, and provides explicit truck cooperation information during the training process, promoting trucks' willingness to cooperate. The training framework adopts centralized training and distributed execution, thus training a policy for trucks to make decisions online using only nearby information. Hence, the policy can be autonomously executed on a large‑scale network. Finally, we perform comparison experiments and ablation experiments in the transportation network of the Yangtze River Delta region in China to verify the effectiveness of the proposed framework. In a repeated comparative experiment with 5,000 trucks, our method average saves 19.17% of fuel with an average delay of only 9.57 minutes per truck and a decision time of 0.001 seconds.

PaperID: 12260, https://arxiv.org/pdf/2412.01033.pdf

Authors: Qiwei Zhao, Xujiang Zhao, Yanchi Liu, Wei Cheng, Yiyou Sun, Mika Oishi, Takao Osaki, Katsushi Matsuda, Huaxiu Yao, Haifeng Chen

Title: SAUP: Situation Awareness Uncertainty Propagation on LLM Agent

Abstract:
Large language models (LLMs) integrated into multistep agent systems enable complex decision‑making processes across various applications. However, their outputs often lack reliability, making uncertainty estimation crucial. Existing uncertainty estimation methods primarily focus on final‑step outputs, which fail to account for cumulative uncertainty over the multistep decision‑making process and the dynamic interactions between agents and their environments. To address these limitations, we propose SAUP (Situation Awareness Uncertainty Propagation), a novel framework that propagates uncertainty through each step of an LLM‑based agent's reasoning process. SAUP incorporates situational awareness by assigning situational weights to each step's uncertainty during the propagation. Our method, compatible with various one‑step uncertainty estimation techniques, provides a comprehensive and accurate uncertainty measure. Extensive experiments on benchmark datasets demonstrate that SAUP significantly outperforms existing state‑of‑the‑art methods, achieving up to 20% improvement in AUROC.

PaperID: 12261, https://arxiv.org/pdf/2412.01017.pdf

Authors: Cade Armstrong, Ryan Park, Xinjie Liu, Kushagra Gupta, David Fridovich-Keil

Title: Inferring Foresightedness in Dynamic Noncooperative Games

Abstract:
Dynamic game theory is an increasingly popular tool for modeling multi‑agent, e.g. human‑robot, interactions. Game‑theoretic models presume that each agent wishes to minimize a private cost function that depends on others' actions. These games typically evolve over a fixed time horizon, specifying how far into the future each agent plans. In practical settings, however, decision‑makers may vary in foresightedness, or how much they care about their current cost in relation to their past and future costs. We conjecture that quantifying and estimating each agent's foresightedness from online data will enable safer and more efficient interactions with other agents. To this end, we frame this inference problem as an inverse dynamic game. We consider a specific objective function parametrization that smoothly interpolates myopic and farsighted planning. Games of this form are readily transformed into parametric mixed complementarity problems; we exploit the directional differentiability of solutions to these problems with respect to their hidden parameters to solve for agents' foresightedness. We conduct three experiments: one with synthetically generated delivery robot motion, one with real‑world data involving people walking, biking, and driving vehicles, and one using high‑fidelity simulators. The results of these experiments demonstrate that explicitly inferring agents' foresightedness enables game‑theoretic models to make 33% more accurate models for agents' behavior.

PaperID: 12262, https://arxiv.org/pdf/2412.00985.pdf

Authors: Yang Cai, Xiangyu Liu, Argyris Oikonomou, Kaiqing Zhang

Title: Provable Partially Observable Reinforcement Learning with Privileged Information

Abstract:
Partial observability of the underlying states generally presents significant challenges for reinforcement learning (RL). In practice, certain \emphprivileged information, e.g., the access to states from simulators, has been exploited in training and has achieved prominent empirical successes. To better understand the benefits of privileged information, we revisit and examine several simple and practically used paradigms in this setting. Specifically, we first formalize the empirical paradigm of \emphexpert distillation (also known as \emphteacher‑student learning), demonstrating its pitfall in finding near‑optimal policies. We then identify a condition of the partially observable environment, the \emphdeterministic filter condition, under which expert distillation achieves sample and computational complexities that are \emphboth polynomial. Furthermore, we investigate another useful empirical paradigm of \emphasymmetric actor‑critic, and focus on the more challenging setting of observable partially observable Markov decision processes. We develop a belief‑weighted asymmetric actor‑critic algorithm with polynomial sample and quasi‑polynomial computational complexities, in which one key component is a new provable oracle for learning belief states that preserve \emphfilter stability under a misspecified model, which may be of independent interest. Finally, we also investigate the provable efficiency of partially observable multi‑agent RL (MARL) with privileged information. We develop algorithms featuring \emphcentralized‑training‑with‑decentralized‑execution, a popular framework in empirical MARL, with polynomial sample and (quasi‑)polynomial computational complexities in both paradigms above. Compared with a few recent related theoretical studies, our focus is on understanding practically inspired algorithmic paradigms, without computationally intractable oracles.

PaperID: 12263, https://arxiv.org/pdf/2412.00970.pdf

Authors: Jiayi Wang, Ruiwei Xiao, Ying-Jui Tseng

Title: Generating AI Literacy MCQs: A Multi-Agent LLM Approach

Abstract:
Artificial intelligence (AI) is transforming society, making it crucial to prepare the next generation through AI literacy in K‑12 education. However, scalable and reliable AI literacy materials and assessment resources are lacking. To address this gap, our study presents a novel approach to generating multiple‑choice questions (MCQs) for AI literacy assessments. Our method utilizes large language models (LLMs) to automatically generate scalable, high‑quality assessment questions. These questions align with user‑provided learning objectives, grade levels, and Bloom's Taxonomy levels. We introduce an iterative workflow incorporating LLM‑powered critique agents to ensure the generated questions meet pedagogical standards. In the preliminary evaluation, experts expressed strong interest in using the LLM‑generated MCQs, indicating that this system could enrich existing AI literacy materials and provide a valuable addition to the toolkit of K‑12 educators.

PaperID: 12264, https://arxiv.org/pdf/2412.00804.pdf

Authors: Junhyuk Choi, Yeseon Hong, Minju Kim, Bugeun Kim

Title: Examining Identity Drift in Conversations of LLM Agents

Abstract:
Large Language Models (LLMs) show impressive conversational abilities but sometimes show identity drift problems, where their interaction patterns or styles change over time. As the problem has not been thoroughly examined yet, this study examines identity consistency across nine LLMs. Specifically, we (1) investigate whether LLMs could maintain consistent patterns (or identity) and (2) analyze the effect of the model family, parameter sizes, and provided persona types. Our experiments involve multi‑turn conversations on personal themes, analyzed in qualitative and quantitative ways. Experimental results indicate three findings. (1) Larger models experience greater identity drift. (2) Model differences exist, but their effect is not stronger than parameter sizes. (3) Assigning a persona may not help to maintain identity. We hope these three findings can help to improve persona stability in AI‑driven dialogue systems, particularly in long‑term conversations.

PaperID: 12265, https://arxiv.org/pdf/2412.00661.pdf

Authors: Emile Anand, Ishani Karmarkar, Guannan Qu

Title: Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning

Abstract:
Designing efficient algorithms for multi‑agent reinforcement learning (MARL) is fundamentally challenging because the size of the joint state and action spaces grows exponentially in the number of agents. These difficulties are exacerbated when balancing sequential global decision‑making with local agent interactions. In this work, we propose a new algorithm \textttSUBSAMPLE‑MFQ (Subsample‑Mean‑Field‑Q‑learning) and a decentralized randomized policy for a system with n agents. For any k\leq n, our algorithm learns a policy for the system in time polynomial in k. We prove that this learned policy converges to the optimal policy on the order of \tildeO(1/\sqrtk) as the number of subsampled agents k increases. In particular, this bound is independent of the number of agents n.

PaperID: 12266, https://arxiv.org/pdf/2412.00652.pdf

Authors: Zefang Liu

Title: Multi-Agent Collaboration in Incident Response with Large Language Models

Abstract:
Incident response (IR) is a critical aspect of cybersecurity, requiring rapid decision‑making and coordinated efforts to address cyberattacks effectively. Leveraging large language models (LLMs) as intelligent agents offers a novel approach to enhancing collaboration and efficiency in IR scenarios. This paper explores the application of LLM‑based multi‑agent collaboration using the Backdoors & Breaches framework, a tabletop game designed for cybersecurity training. We simulate real‑world IR dynamics through various team structures, including centralized, decentralized, and hybrid configurations. By analyzing agent interactions and performance across these setups, we provide insights into optimizing multi‑agent collaboration for incident response. Our findings highlight the potential of LLMs to enhance decision‑making, improve adaptability, and streamline IR processes, paving the way for more effective and coordinated responses to cyber threats.

PaperID: 12267, https://arxiv.org/pdf/2412.00548.pdf

Authors: Yuta Takahashi, Shin-ichiro Sakai

Title: Neural Power-Optimal Magnetorquer Solution for Multi-Agent Formation and Attitude Control

Abstract:
This paper presents a learning‑based current calculation model to achieve power‑optimal magnetic‑field interaction for multi‑agent formation and attitude control. In aerospace engineering, electromagnetic coils are referred to as magnetorquer (MTQ) coils and used as satellite attitude actuators in Earth's orbit and for long‑term formation and attitude control. This study derives a unique, continuous, and power‑optimal current solution via sequential convex programming and approximates it using a multilayer perceptron model. The effectiveness of our strategy was demonstrated through numerical simulations and experimental trials on the formation and attitude control.

PaperID: 12268, https://arxiv.org/pdf/2412.00534.pdf

Authors: Yuchen Shi, Huaxin Pei, Liang Feng, Yi Zhang, Danya Yao

Title: Towards Fault Tolerance in Multi-Agent Reinforcement Learning

Abstract:
Agent faults pose a significant threat to the performance of multi‑agent reinforcement learning (MARL) algorithms, introducing two key challenges. First, agents often struggle to extract critical information from the chaotic state space created by unexpected faults. Second, transitions recorded before and after faults in the replay buffer affect training unevenly, leading to a sample imbalance problem. To overcome these challenges, this paper enhances the fault tolerance of MARL by combining optimized model architecture with a tailored training data sampling strategy. Specifically, an attention mechanism is incorporated into the actor and critic networks to automatically detect faults and dynamically regulate the attention given to faulty agents. Additionally, a prioritization mechanism is introduced to selectively sample transitions critical to current training needs. To further support research in this area, we design and open‑source a highly decoupled code platform for fault‑tolerant MARL, aimed at improving the efficiency of studying related problems. Experimental results demonstrate the effectiveness of our method in handling various types of faults, faults occurring in any agent, and faults arising at random times.

PaperID: 12269, https://arxiv.org/pdf/2412.00435.pdf

Authors: Shipeng Liu, Boshen Zhang, Zhehui Huang

Title: Towards Real-time Adaptation of Embodied Agent in Human-Robot Collaboration

Abstract:
Large Language Models (LLMs) have opened transformative possibilities for human‑robot collaboration. However, enabling real‑time collaboration requires both low latency and robust reasoning, and most LLMs suffer from high latency. To address this gap, we first propose a fine‑grained benchmark that explicitly assesses agents' proactive adaptability and temporal responsiveness in the Overcooked‑AI environment. Based on evaluation results, we propose MonTA (Monitor‑then‑Adapt), a hierarchical framework inspired by cognitive science research. MonTA contains three key modules: a lightweight Monitor that operates at high frequency (7 Hz) to detect adaptation needs, and two proficient Adapters for subtask and path adaptation reasoning that provide instructions to humans at a lower frequency. Our results demonstrate that MonTA significantly outperforms baseline agents on our proposed benchmark, achieving superior performance across layouts with varying teaming fluency. User studies confirm the high reasonableness of adaptation plans and consistent language instructions provided by our framework to humans.

PaperID: 12270, https://arxiv.org/pdf/2412.00224.pdf

Authors: Saurabh Mishra, Mahendra Shinde, Aniket Yadav, Bilal Ayyub, Anand Rao

Title: An AI-Driven Data Mesh Architecture Enhancing Decision-Making in Infrastructure Construction and Public Procurement

Abstract:
Infrastructure construction, often dubbed an "industry of industries," is closely linked with government spending and public procurement, offering significant opportunities for improved efficiency and productivity through better transparency and information access. By leveraging these opportunities, we can achieve notable gains in productivity, cost savings, and broader economic benefits. Our approach introduces an integrated software ecosystem utilizing Data Mesh and Service Mesh architectures. This system includes the largest training dataset for infrastructure and procurement, encompassing over 100 billion tokens, scientific publications, activities, and risk data, all structured by a systematic AI framework. Supported by a Knowledge Graph linked to domain‑specific multi‑agent tasks and Q&A capabilities, our platform standardizes and ingests diverse data sources, transforming them into structured knowledge. Leveraging large language models (LLMs) and automation, our system revolutionizes data structuring and knowledge creation, aiding decision‑making in early‑stage project planning, detailed research, market trend analysis, and qualitative assessments. Its web‑scalable architecture delivers domain‑curated information, enabling AI agents to facilitate reasoning and manage uncertainties, while preparing for future expansions with specialized agents targeting particular challenges. This integration of AI with domain expertise not only boosts efficiency and decision‑making in construction and infrastructure but also establishes a framework for enhancing government efficiency and accelerating the transition of traditional industries to digital workflows. This work is poised to significantly influence AI‑driven initiatives in this sector and guide best practices in AI Operations.

PaperID: 12271, https://arxiv.org/pdf/2412.00161.pdf

Authors: Haiyi Qiu, Minghe Gao, Long Qian, Kaihang Pan, Qifan Yu, Juncheng Li, Wenjie Wang, Siliang Tang, Yueting Zhuang, Tat-Seng Chua

Title: STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training

Abstract:
Video Large Language Models (Video‑LLMs) have recently shown strong performance in basic video understanding tasks, such as captioning and coarse‑grained question answering, but struggle with compositional reasoning that requires multi‑step spatio‑temporal inference across object relations, interactions, and events. The hurdles to enhancing this capability include extensive manual labor, the lack of spatio‑temporal compositionality in existing data and the absence of explicit reasoning supervision. In this paper, we propose STEP, a novel graph‑guided self‑training method that enables Video‑LLMs to generate reasoning‑rich fine‑tuning data from any raw videos to improve itself. Specifically, we first induce Spatio‑Temporal Scene Graph (STSG) representation of diverse videos to capture fine‑grained, multi‑granular video semantics. Then, the STSGs guide the derivation of multi‑step reasoning Question‑Answer (QA) data with Chain‑of‑Thought (CoT) rationales. Both answers and rationales are integrated as training objective, aiming to enhance model's reasoning abilities by supervision over explicit reasoning steps. Experimental results demonstrate the effectiveness of STEP across models of varying scales, with a significant 21.3% improvement in tasks requiring three or more reasoning steps. Furthermore, it achieves superior performance with a minimal amount of self‑generated rationale‑enriched training samples in both compositional reasoning and comprehensive understanding benchmarks, highlighting the broad applicability and vast potential.

PaperID: 12272, https://arxiv.org/pdf/2412.00033.pdf

Authors: Frédéric Berdoz, Roger Wattenhofer

Title: Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies

Abstract:
While autonomous agents often surpass humans in their ability to handle vast and complex data, their potential misalignment (i.e., lack of transparency regarding their true objective) has thus far hindered their use in critical applications such as social decision processes. More importantly, existing alignment methods provide no formal guarantees on the safety of such models. Drawing from utility and social choice theory, we provide a novel quantitative definition of alignment in the context of social decision‑making. Building on this definition, we introduce probably approximately aligned (i.e., near‑optimal) policies, and we derive a sufficient condition for their existence. Lastly, recognizing the practical difficulty of satisfying this condition, we introduce the relaxed concept of safe (i.e., nondestructive) policies, and we propose a simple yet robust method to safeguard the black‑box policy of any autonomous agent, ensuring all its actions are verifiably safe for the society.

PaperID: 12273, https://arxiv.org/pdf/2411.19650.pdf

Authors: Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, Xiaofan Wang, Bei Liu, Jianlong Fu, Jianmin Bao, Dong Chen, Yuanchun Shi, Jiaolong Yang, Baining Guo

Title: CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Abstract:
The advancement of large Vision‑Language‑Action (VLA) models has significantly improved robotic manipulation in terms of language‑guided task execution and generalization to unseen scenarios. While existing VLAs adapted from pretrained large Vision‑Language‑Models (VLM) have demonstrated promising generalizability, their task performance is still unsatisfactory as indicated by the low tasks success rates in different environments. In this paper, we present a new advanced VLA architecture derived from VLM. Unlike previous works that directly repurpose VLM for action prediction by simple action quantization, we propose a omponentized VLA architecture that has a specialized action module conditioned on VLM output. We systematically study the design of the action module and demonstrates the strong performance enhancement with diffusion action transformers for action sequence modeling, as well as their favorable scaling behaviors. We also conduct comprehensive experiments and ablation studies to evaluate the efficacy of our models with varied designs. The evaluation on 5 robot embodiments in simulation and real work shows that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds. It exceeds the average success rates of OpenVLA which has similar model size (7B) with ours by over 35% in simulated evaluation and 55% in real robot experiments. It also outperforms the large RT‑2‑X model (55B) by 18% absolute success rates in simulation. Code and models can be found on our project page (https://cogact.github.io/).

PaperID: 12274, https://arxiv.org/pdf/2411.19639.pdf

Authors: Zifeng Shi, Meiqin Liu, Senlin Zhang, Ronghao Zheng, Shanling Dong

Title: RMIO: A Model-Based MARL Framework for Scenarios with Observation Loss in Some Agents

Abstract:
In recent years, model‑based reinforcement learning (MBRL) has emerged as a solution to address sample complexity in multi‑agent reinforcement learning (MARL) by modeling agent‑environment dynamics to improve sample efficiency. However, most MBRL methods assume complete and continuous observations from each agent during the inference stage, which can be overly idealistic in practical applications. A novel model‑based MARL approach called RMIO is introduced to address this limitation, specifically designed for scenarios where observation is lost in some agent. RMIO leverages the world model to reconstruct missing observations, and further reduces reconstruction errors through inter‑agent information integration to ensure stable multi‑agent decision‑making. Secondly, unlike CTCE methods such as MAMBA, RMIO adopts the CTDE paradigm in standard environment, and enabling limited communication only when agents lack observation data, thereby reducing reliance on communication. Additionally, RMIO improves asymptotic performance through strategies such as reward smoothing, a dual‑layer experience replay buffer, and an RNN‑augmented policy model, surpassing previous work. Our experiments conducted in both the SMAC and MaMuJoCo environments demonstrate that RMIO outperforms current state‑of‑the‑art approaches in terms of asymptotic convergence performance and policy robustness, both in standard mission settings and in scenarios involving observation loss.

PaperID: 12275, https://arxiv.org/pdf/2411.19580.pdf

Authors: Giannis Petousakis, Angelo Cangelosi, Rustam Stolkin, Manolis Chiou

Title: The ATTUNE model for Artificial Trust Towards Human Operators

Abstract:
This paper presents a novel method to quantify Trust in HRI. It proposes an HRI framework for estimating the Robot Trust towards the Human in the context of a narrow and specified task. The framework produces a real‑time estimation of an AI agent's Artificial Trust towards a Human partner interacting with a mobile teleoperation robot. The approach for the framework is based on principles drawn from Theory of Mind, including information about the human state, action, and intent. The framework creates the ATTUNE model for Artificial Trust Towards Human Operators. The model uses metrics on the operator's state of attention, navigational intent, actions, and performance to quantify the Trust towards them. The model is tested on a pre‑existing dataset that includes recordings (ROSbags) of a human trial in a simulated disaster response scenario. The performance of ATTUNE is evaluated through a qualitative and quantitative analysis. The results of the analyses provide insight into the next stages of the research and help refine the proposed approach.

PaperID: 12276, https://arxiv.org/pdf/2411.19571.pdf

Authors: Ziming Wang, Yun Gao, Apostolos I. Rikos, Ning Pang, Yiding Ji

Title: Fixed-Relative-Switched Threshold Strategies for Consensus Tracking Control of Nonlinear Multiagent Systems

Abstract:
This paper investigates event‑triggered consensus tracking in nonlinear semi‑strict‑feedback multi‑agent systems involving one leader and multiple followers. We first employ radial basis function neural networks and backstepping techniques to approximate the unknown nonlinear dynamics, facilitating the design of dual observers to measure the unknown states and disturbances. Then three adaptive event‑triggered control schemes are proposed: fixed‑threshold, relative‑threshold, and switched‑threshold configurations, each featuring specialized controller architectures and triggering mechanisms. Through Lyapunov stability analysis, we establish that the follower agents can asymptotically track the reference trajectory of the leader, meanwhile all error signals remain uniform bounded. Our proposed control strategies effectively prevent Zeno behaviors through stringent exclusion criteria. Finally, an illustrative example is presented, demonstrating the competitive performance of our control framework in achieving consensus tracking and optimizing triggering efficiency.

PaperID: 12277, https://arxiv.org/pdf/2411.19526.pdf

Authors: Yang Lv, Jinlong Lei, Peng Yi

Title: A Local Information Aggregation based Multi-Agent Reinforcement Learning for Robot Swarm Dynamic Task Allocation

Abstract:
In this paper, we explore how to optimize task allocation for robot swarms in dynamic environments, emphasizing the necessity of formulating robust, flexible, and scalable strategies for robot cooperation. We introduce a novel framework using a decentralized partially observable Markov decision process (Dec_POMDP), specifically designed for distributed robot swarm networks. At the core of our methodology is the Local Information Aggregation Multi‑Agent Deep Deterministic Policy Gradient (LIA_MADDPG) algorithm, which merges centralized training with distributed execution (CTDE). During the centralized training phase, a local information aggregation (LIA) module is meticulously designed to gather critical data from neighboring robots, enhancing decision‑making efficiency. In the distributed execution phase, a strategy improvement method is proposed to dynamically adjust task allocation based on changing and partially observable environmental conditions. Our empirical evaluations show that the LIA module can be seamlessly integrated into various CTDE‑based MARL methods, significantly enhancing their performance. Additionally, by comparing LIA_MADDPG with six conventional reinforcement learning algorithms and a heuristic algorithm, we demonstrate its superior scalability, rapid adaptation to environmental changes, and ability to maintain both stability and convergence speed. These results underscore LIA_MADDPG's outstanding performance and its potential to significantly improve dynamic task allocation in robot swarms through enhanced local collaboration and adaptive strategy execution.

PaperID: 12278, https://arxiv.org/pdf/2411.19359.pdf

Authors: Dickness Kakitahi Kwesiga, Suyash Chandra Vishnoi, Angshuman Guin, Michael Hunter

Title: Integrating Transit Signal Priority into Multi-Agent Reinforcement Learning based Traffic Signal Control

Abstract:
This study integrates Transit Signal Priority (TSP) into multi‑agent reinforcement learning (MARL) based traffic signal control. The first part of the study develops adaptive signal control based on MARL for a pair of coordinated intersections in a microscopic simulation environment. The two agents, one for each intersection, are centrally trained using a value decomposition network (VDN) architecture. The trained agents show slightly better performance compared to coordinated actuated signal control based on overall intersection delay at v/c of 0.95. In the second part of the study the trained signal control agents are used as background signal controllers while developing event‑based TSP agents. In one variation, independent TSP agents are formulated and trained under a decentralized training and decentralized execution (DTDE) framework to implement TSP at each intersection. In the second variation, the two TSP agents are centrally trained under a centralized training and decentralized execution (CTDE) framework and VDN architecture to select and implement coordinated TSP strategies across the two intersections. In both cases the agents converge to the same bus delay value, but independent agents show high instability throughout the training process. For the test runs, the two independent agents reduce bus delay across the two intersections by 22% compared to the no TSP case while the coordinated TSP agents achieve 27% delay reduction. In both cases, there is only a slight increase in delay for a majority of the side street movements.

PaperID: 12279, https://arxiv.org/pdf/2411.19204.pdf

Authors: Kelvin Summoogum, Debayan Das, Sathish Kumaran, Sumit Bhagra

Title: A Voice-based Triage for Type 2 Diabetes using a Conversational Virtual Assistant in the Home Environment

Abstract:
Incorporating cloud technology with Internet of Medical Things for ubiquitous healthcare has seen many successful applications in the last decade with the advent of machine learning and deep learning techniques. One of these applications, namely voice‑based pathology, has yet to receive notable attention from academia and industry. Applying voice analysis to early detection of fatal diseases holds much promise to improve health outcomes and quality of life of patients. In this paper, we propose a novel application of acoustic machine learning based triaging into commoditised conversational virtual assistant systems to pre‑screen for onset of diabetes. Specifically, we developed a triaging system which extracts acoustic features from the voices of n=24 older adults when they converse with a virtual assistant and predict the incidence of Diabetes Mellitus (Type 2) or not. Our triaging system achieved hit‑rates of 70% and 60% for male and female older adult subjects, respectively. Our proposed triaging uses 7 non‑identifiable voice‑based features and can operate within resource‑constrained embedded systems running voice‑based virtual assistants. This application demonstrates the feasibility of applying voice‑based pathology analysis to improve health outcomes of older adults within the home environment by early detection of life‑changing chronic conditions like diabetes.

PaperID: 12280, https://arxiv.org/pdf/2411.19043.pdf

Authors: Mayur Amarnath Palavalli, Mark Santolucito

Title: Using a Feedback Loop for LLM-based Infrastructure as Code Generation

Abstract:
Code generation with Large Language Models (LLMs) has helped to increase software developer productivity in coding tasks, but has yet to have significant impact on the tasks of software developers that surround this code. In particular, the challenge of infrastructure management remains an open question. We investigate the ability of an LLM agent to construct infrastructure using the Infrastructure as Code (IaC) paradigm. We particularly investigate the use of a feedback loop that returns errors and warnings on the generated IaC to allow the LLM agent to improve the code. We find that, for each iteration of the loop, its effectiveness decreases exponentially until it plateaus at a certain point and becomes ineffective.

PaperID: 12281, https://arxiv.org/pdf/2411.19039.pdf

Authors: Xiaoxuan Lou, Chaojie Wang, Bo An

Title: Mars-PO: Multi-Agent Reasoning System Preference Optimization

Abstract:
Mathematical reasoning is a fundamental capability for large language models (LLMs), yet achieving high performance in this domain remains a significant challenge. The auto‑regressive generation process often makes LLMs susceptible to errors, hallucinations, and inconsistencies, particularly during multi‑step reasoning. In this paper, we propose Mars‑PO, a novel framework to improve the mathematical reasoning capabilities of LLMs through a multi‑agent system. It combines high‑quality outputs from multiple agents into a hybrid positive sample set and pairs them with agent‑specific negative samples to construct robust preference pairs for training. By aligning agents with shared positive samples while addressing individual weaknesses, Mars‑PO achieves substantial performance improvements on mathematical reasoning benchmarks. For example, it increases the accuracy on the MATH benchmark of the state‑of‑the‑art instruction‑tuned LLM, Llama3.1‑8B‑Instruct, from 50.38% to 57.82%. Experimental results further demonstrate that our method consistently outperforms other baselines, such as supervised fine‑tuning, vanilla DPO, and its enhanced versions, highlighting the effectiveness of our approach.

PaperID: 12282, https://arxiv.org/pdf/2411.19000.pdf

Authors: Chenyu Tang, Ruizhi Zhang, Shuo Gao, Zihe Zhao, Zibo Zhang, Jiaqi Wang, Cong Li, Junliang Chen, Yanning Dai, Shengbo Wang, Ruoyu Juan, Qiaoying Li, Ruimou Xie, Xuhang Chen, Xinkai Zhou, Yunjia Xia, Jianan Chen, Fanghao Lu, Xin Li, Ninglli Wang, Peter Smielewski, Yu Pan, Hubin Zhao, Luigi G. Occhipinti

Title: An AI-Driven Multimodal Smart Home Platform for Continuous Monitoring and Assistance in Post-Stroke Motor Impairment

Abstract:
At‑home rehabilitation for post‑stroke patients presents significant challenges, as continuous, personalized care is often limited outside clinical settings. Moreover, the lack of integrated solutions capable of simultaneously monitoring motor recovery and providing intelligent assistance in home environments hampers rehabilitation outcomes. Here, we present a multimodal smart home platform designed for continuous, at‑home rehabilitation of post‑stroke patients, integrating wearable sensing, ambient monitoring, and adaptive automation. A plantar pressure insole equipped with a machine learning pipeline classifies users into motor recovery stages with up to 94% accuracy, enabling quantitative tracking of walking patterns during daily activities. An optional head‑mounted eye‑tracking module, together with ambient sensors such as cameras and microphones, supports seamless hands‑free control of household devices with a 100% success rate and sub‑second response time. These data streams are fused locally via a hierarchical Internet of Things (IoT) architecture, ensuring low latency and data privacy. An embedded large language model (LLM) agent, Auto‑Care, continuously interprets multimodal data to provide real‑time interventions ‑‑ issuing personalized reminders, adjusting environmental conditions, and notifying caregivers. Implemented in a post‑stroke context, this integrated smart home platform increased mean user satisfaction from 3.9 \pm 0.8 in conventional home environments to 8.4 \pm 0.6 with the full system (n=20). Beyond stroke, the system offers a scalable, patient‑centered framework with potential for long‑term use in broader neurorehabilitation and aging‑in‑place applications.

PaperID: 12283, https://arxiv.org/pdf/2411.18915.pdf

Authors: Vishnou Vinayagame, Gregory Senay, Luis Martí

Title: MATATA: Weakly Supervised End-to-End MAthematical Tool-Augmented Reasoning for Tabular Applications

Abstract:
Business documents often contain substantial tabular and textual information with numerical values, requiring mathematical reasoning for effective document understanding. While Small Language Models (SLMs) still struggle at this task, tool‑augmented multi‑step agents perform better, at the cost of relying on closed‑source or larger models, external data, or extensive prompt‑engineering. This work introduces MATATA, a novel weakly supervised end‑to‑end approach to train multi‑step reasoning language agents for document tabular applications. MATATA presents an annotation‑free paradigm for each agent to enhance 3.8B/8B SLMs. During its two‑stage training, MATATA uses the final outcome of the multi‑step reasoning chain as weak supervision. This approach avoids having to individually supervise each intermediate agent in the reasoning chain. By employing an adaptive planner and shared tools across different datasets, MATATA shows robust performance. Experiments demonstrate that MATATA achieves state‑of‑the‑art on FinQA, and on TAT‑QA among reasoning methods based on open‑source SLMs. Although being SLM‑based, MATATA closely matches GPT‑4‑based frameworks on TabMWP. This novel weakly supervised approach enables training an end‑to‑end multi‑step reasoning agent without intermediate supervision, supporting future developments of cost‑effective powerful agentic systems.

PaperID: 12284, https://arxiv.org/pdf/2411.18908.pdf

Authors: Wataru Kawabe, Yusuke Sugano

Title: DuetML: Human-LLM Collaborative Machine Learning Framework for Non-Expert Users

Abstract:
Machine learning (ML) models have significantly impacted various domains in our everyday lives. While large language models (LLMs) offer intuitive interfaces and versatility, task‑specific ML models remain valuable for their efficiency and focused performance in specialized tasks. However, developing these models requires technical expertise, making it particularly challenging for non‑expert users to customize them for their unique needs. Although interactive machine learning (IML) aims to democratize ML development through user‑friendly interfaces, users struggle to translate their requirements into appropriate ML tasks. We propose human‑LLM collaborative ML as a new paradigm bridging human‑driven IML and machine‑driven LLM approaches. To realize this vision, we introduce DuetML, a framework that integrates multimodal LLMs (MLLMs) as interactive agents collaborating with users throughout the ML process. Our system carefully balances MLLM capabilities with user agency by implementing both reactive and proactive interactions between users and MLLM agents. Through a comparative user study, we demonstrate that DuetML enables non‑expert users to define training data that better aligns with target tasks without increasing cognitive load, while offering opportunities for deeper engagement with ML task formulation.

PaperID: 12285, https://arxiv.org/pdf/2411.18845.pdf

Authors: Minzhe Tan, Xinlin Fan, Jian He, Yi Hou, Zhan Liu, Yaopeng Jiang, Y. M. Jiang

Title: An Integrated Artificial Intelligence Operating System for Advanced Low-Altitude Aviation Applications

Abstract:
This paper introduces a high‑performance artificial intelligence operating system tailored for low‑altitude aviation, designed to address key challenges such as real‑time task execution, computational efficiency, and seamless modular collaboration. Built on a powerful hardware platform and leveraging the UNIX architecture, the system implements a distributed data processing strategy that ensures rapid and efficient synchronization across critical modules, including vision, navigation, and perception. By adopting dynamic resource management, it optimally allocates computational resources, such as CPU and GPU, based on task priority and workload, ensuring high performance for demanding tasks like real‑time video processing and AI model inference. Furthermore, the system features an advanced interrupt handling mechanism that allows for quick responses to sudden environmental changes, such as obstacle detection, by prioritizing critical tasks, thus improving safety and mission success rates. Robust security measures, including data encryption, access control, and fault tolerance, ensure the system's resilience against external threats and its ability to recover from potential hardware or software failures. Complementing these core features are modular components for image analysis, multi‑sensor fusion, dynamic path planning, multi‑drone coordination, and ground station monitoring. Additionally, a low‑code development platform simplifies user customization, making the system adaptable to various mission‑specific needs. This comprehensive approach ensures the system meets the evolving demands of intelligent aviation, providing a stable, efficient, and secure environment for complex drone operations.

PaperID: 12286, https://arxiv.org/pdf/2411.18587.pdf

Authors: Suzanne Oliver, Tomoko Kitago, Adam Buchwald, S. Farokh Atashzar

Title: EEG-Based Analysis of Brain Responses in Multi-Modal Human-Robot Interaction: Modulating Engagement

Abstract:
User engagement, cognitive participation, and motivation during task execution in physical human‑robot interaction are crucial for motor learning. These factors are especially important in contexts like robotic rehabilitation, where neuroplasticity is targeted. However, traditional robotic rehabilitation systems often face challenges in maintaining user engagement, leading to unpredictable therapeutic outcomes. To address this issue, various techniques, such as assist‑as‑needed controllers, have been developed to prevent user slacking and encourage active participation. In this paper, we introduce a new direction through a novel multi‑modal robotic interaction designed to enhance user engagement by synergistically integrating visual, motor, cognitive, and auditory (speech recognition) tasks into a single, comprehensive activity. To assess engagement quantitatively, we compared multiple electroencephalography (EEG) biomarkers between this multi‑modal protocol and a traditional motor‑only protocol. Fifteen healthy adult participants completed 100 trials of each task type. Our findings revealed that EEG biomarkers, particularly relative alpha power, showed statistically significant improvements in engagement during the multi‑modal task compared to the motor‑only task. Moreover, while engagement decreased over time in the motor‑only task, the multi‑modal protocol maintained consistent engagement, suggesting that users could remain engaged for longer therapy sessions. Our observations on neural responses during interaction indicate that the proposed multi‑modal approach can effectively enhance user engagement, which is critical for improving outcomes. This is the first time that objective neural response highlights the benefit of a comprehensive robotic intervention combining motor, cognitive, and auditory functions in healthy subjects.

PaperID: 12287, https://arxiv.org/pdf/2411.18498.pdf

Authors: Nicolas Coucke, Mary Katherine Heinrich, Axel Cleeremans, Marco Dorigo, Guillaume Dumas

Title: Collective decision making by embodied neural agents

Abstract:
Collective decision making using simple social interactions has been studied in many types of multi‑agent systems, including robot swarms and human social networks. However, existing multi‑agent studies have rarely modeled the neural dynamics that underlie sensorimotor coordination in embodied biological agents. In this study, we investigated collective decisions that resulted from sensorimotor coordination among agents with simple neural dynamics. We equipped our agents with a model of minimal neural dynamics based on the coordination dynamics framework, and embedded them in an environment with a stimulus gradient. In our single‑agent setup, the decision between two stimulus sources depends solely on the coordination of the agent's neural dynamics with its environment. In our multi‑agent setup, that same decision also depends on the sensorimotor coordination between agents, via their simple social interactions. Our results show that the success of collective decisions depended on a balance of intra‑agent, inter‑agent, and agent‑environment coupling, and we use these results to identify the influences of environmental factors on decision difficulty. More generally, our results demonstrate the impact of intra‑ and inter‑brain coordination dynamics on collective behavior, can contribute to existing knowledge on the functional role of inter‑agent synchrony, and are relevant to ongoing developments in neuro‑AI and self‑organized multi‑agent systems.

PaperID: 12288, https://arxiv.org/pdf/2411.18444.pdf

Authors: Frederic Kirstein, Terry Ruas, Bela Gipp

Title: Is my Meeting Summary Good? Estimating Quality with a Multi-LLM Evaluator

Abstract:
The quality of meeting summaries generated by natural language generation (NLG) systems is hard to measure automatically. Established metrics such as ROUGE and BERTScore have a relatively low correlation with human judgments and fail to capture nuanced errors. Recent studies suggest using large language models (LLMs), which have the benefit of better context understanding and adaption of error definitions without training on a large number of human preference judgments. However, current LLM‑based evaluators risk masking errors and can only serve as a weak proxy, leaving human evaluation the gold standard despite being costly and hard to compare across studies. In this work, we present MESA, an LLM‑based framework employing a three‑step assessment of individual error types, multi‑agent discussion for decision refinement, and feedback‑based self‑training to refine error definition understanding and alignment with human judgment. We show that MESA's components enable thorough error detection, consistent rating, and adaptability to custom error guidelines. Using GPT‑4o as its backbone, MESA achieves mid to high Point‑Biserial correlation with human judgment in error detection and mid Spearman and Kendall correlation in reflecting error impact on summary quality, on average 0.25 higher than previous methods. The framework's flexibility in adapting to custom error guidelines makes it suitable for various tasks with limited human‑labeled data.

PaperID: 12289, https://arxiv.org/pdf/2411.18429.pdf

Authors: Onno P. Kampman, Ye Sheng Phang, Stanley Han, Michael Xing, Xinyi Hong, Hazirah Hoosainsah, Caleb Tan, Genta Indra Winata, Skyler Wang, Creighton Heaukulani, Janice Huiqin Weng, Robert JT Morris

Title: A Multi-Agent Dual Dialogue System to Support Mental Health Care Providers

Abstract:
We introduce a general‑purpose, human‑in‑the‑loop dual dialogue system to support mental health care professionals. The system, co‑designed with care providers, is conceptualized to assist them in interacting with care seekers rather than functioning as a fully automated dialogue system solution. The AI assistant within the system reduces the cognitive load of mental health care providers by proposing responses, analyzing conversations to extract pertinent themes, summarizing dialogues, and recommending localized relevant content and internet‑based cognitive behavioral therapy exercises. These functionalities are achieved through a multi‑agent system design, where each specialized, supportive agent is characterized by a large language model. In evaluating the multi‑agent system, we focused specifically on the proposal of responses to emotionally distressed care seekers. We found that the proposed responses matched a reasonable human quality in demonstrating empathy, showing its appropriateness for augmenting the work of mental health care providers.

PaperID: 12290, https://arxiv.org/pdf/2411.18302.pdf

Authors: Xiyan Jiang, Xiaocong Zhao, Yiru Liu, Zirui Li, Peng Hang, Lu Xiong, Jian Sun

Title: InterHub: A Naturalistic Trajectory Dataset with Dense Interaction for Autonomous Driving

Abstract:
The driving interaction‑a critical yet complex aspect of daily driving‑lies at the core of autonomous driving research. However, real‑world driving scenarios sparsely capture rich interaction events, limiting the availability of comprehensive trajectory datasets for this purpose. To address this challenge, we present InterHub, a dense interaction dataset derived by mining interaction events from extensive naturalistic driving records. We employ formal methods to describe and extract multi‑agent interaction events, exposing the limitations of existing autonomous driving solutions. Additionally, we introduce a user‑friendly toolkit enabling the expansion of InterHub with both public and private data. By unifying, categorizing, and analyzing diverse interaction events, InterHub facilitates cross‑comparative studies and large‑scale research, thereby advancing the evaluation and development of autonomous driving technologies.

PaperID: 12291, https://arxiv.org/pdf/2411.18266.pdf

Authors: Chenyu Tang, Shuo Gao, Cong Li, Wentian Yi, Yuxuan Jin, Xiaoxue Zhai, Sixuan Lei, Hongbei Meng, Zibo Zhang, Muzi Xu, Shengbo Wang, Xuhang Chen, Chenxi Wang, Hongyun Yang, Ningli Wang, Wenyu Wang, Jin Cao, Xiaodong Feng, Peter Smielewski, Yu Pan, Wenhui Song, Martin Birchall, Luigi G. Occhipinti

Title: Wearable intelligent throat enables natural speech in stroke patients with dysarthria

Abstract:
Wearable silent speech systems hold significant potential for restoring communication in patients with speech impairments. However, seamless, coherent speech remains elusive, and clinical efficacy is still unproven. Here, we present an AI‑driven intelligent throat (IT) system that integrates throat muscle vibrations and carotid pulse signal sensors with large language model (LLM) processing to enable fluent, emotionally expressive communication. The system utilizes ultrasensitive textile strain sensors to capture high‑quality signals from the neck area and supports token‑level processing for real‑time, continuous speech decoding, enabling seamless, delay‑free communication. In tests with five stroke patients with dysarthria, IT's LLM agents intelligently corrected token errors and enriched sentence‑level emotional and logical coherence, achieving low error rates (4.2% word error rate, 2.9% sentence error rate) and a 55% increase in user satisfaction. This work establishes a portable, intuitive communication platform for patients with dysarthria with the potential to be applied broadly across different neurological conditions and in multi‑language support systems.

PaperID: 12292, https://arxiv.org/pdf/2411.18241.pdf

Authors: Zhihua Duan, Jialin Wang

Title: Exploration of LLM Multi-Agent Application Implementation Based on LangGraph+CrewAI

Abstract:
With the rapid development of large model technology, the application of agent technology in various fields is becoming increasingly widespread, profoundly changing people's work and lifestyles. In complex and dynamic systems, multi‑agents achieve complex tasks that are difficult for a single agent to complete through division of labor and collaboration among agents. This paper discusses the integrated application of LangGraph and CrewAI. LangGraph improves the efficiency of information transmission through graph architecture, while CrewAI enhances team collaboration capabilities and system performance through intelligent task allocation and resource management. The main research contents of this paper are: (1) designing the architecture of agents based on LangGraph for precise control; (2) enhancing the capabilities of agents based on CrewAI to complete a variety of tasks. This study aims to delve into the application of LangGraph and CrewAI in multi‑agent systems, providing new perspectives for the future development of agent technology, and promoting technological progress and application innovation in the field of large model intelligent agents.

PaperID: 12293, https://arxiv.org/pdf/2411.18086.pdf

Authors: Yunwoo Lee, Jungwon Park, H. Jin Kim

Title: DMVC-Tracker: Distributed Multi-Agent Trajectory Planning for Target Tracking Using Dynamic Buffered Voronoi and Inter-Visibility Cells

Abstract:
This letter presents a distributed trajectory planning method for multi‑agent aerial tracking. The proposed method uses a Dynamic Buffered Voronoi Cell (DBVC) and a Dynamic Inter‑Visibility Cell (DIVC) to formulate the distributed trajectory generation. Specifically, the DBVC and the DIVC are time‑variant spaces that prevent mutual collisions and occlusions among agents, while enabling them to maintain suitable distances from the moving target. We combine the DBVC and the DIVC with an efficient Bernstein polynomial motion primitive‑based tracking generation method, which has been refined into a less conservative approach than in our previous work. The proposed algorithm can compute each agent's trajectory within several milliseconds on an Intel i7 desktop. We validate the tracking performance in challenging scenarios, including environments with dozens of obstacles.

PaperID: 12294, https://arxiv.org/pdf/2411.17782.pdf

Authors: Ting Xiaoyang, Minfeng Zhang, Shu gonglee, Saimin Chen Zhang

Title: Joint Resource Optimization, Computation Offloading and Resource Slicing for Multi-Edge Traffic-Cognitive Networks

Abstract:
The evolving landscape of edge computing envisions platforms operating as dynamic intermediaries between application providers and edge servers (ESs), where task offloading is coupled with payments for computational services. Ensuring efficient resource utilization and meeting stringent Quality of Service (QoS) requirements necessitates incentivizing ESs while optimizing the platforms operational objectives. This paper investigates a multi‑agent system where both the platform and ESs are self‑interested entities, addressing the joint optimization of revenue maximization, resource allocation, and task offloading. We propose a novel Stackelberg game‑based framework to model interactions between stakeholders and solve the optimization problem using a Bayesian Optimization‑based centralized algorithm. Recognizing practical challenges in information collection due to privacy concerns, we further design a decentralized solution leveraging neural network optimization and a privacy‑preserving information exchange protocol. Extensive numerical evaluations demonstrate the effectiveness of the proposed mechanisms in achieving superior performance compared to existing baselines.

PaperID: 12295, https://arxiv.org/pdf/2411.17724.pdf

Authors: Aslan S. Dizaji

Title: Incentives to Build Houses, Trade Houses, or Trade House Building Skills in Simulated Worlds under Various Governing Systems or Institutions: Comparing Multi-agent Reinforcement Learning to Generative Agent-based Model

Abstract:
It has been shown that social institutions impact human motivations to produce different behaviours, such as amount of working or specialisation in labor. With advancement in artificial intelligence (AI), specifically large language models (LLMs), now it is possible to perform in‑silico simulations to test various hypotheses around this topic. Here, I simulate two somewhat similar worlds using multi‑agent reinforcement learning (MARL) framework of the AI‑Economist and generative agent‑based model (GABM) framework of the Concordia. In the extended versions of the AI‑Economist and Concordia, the agents are able to build houses, trade houses, and trade house building skill. Moreover, along the individualistic‑collectivists axis, there are a set of three governing systems: Full‑Libertarian, Semi‑Libertarian/Utilitarian, and Full‑Utilitarian. Additionally, in the extended AI‑Economist, the Semi‑Libertarian/Utilitarian system is further divided to a set of three governing institutions along the discriminative axis: Inclusive, Arbitrary, and Extractive. Building on these, I am able to show that among governing systems and institutions of the extended AI‑Economist, under the Semi‑Libertarian/Utilitarian and Inclusive government, the ratios of building houses to trading houses and trading house building skill are higher than the rest. Furthermore, I am able to show that in the extended Concordia when the central government care about equality in the society, the Full‑Utilitarian system generates agents building more houses and trading more house building skill. In contrast, these economic activities are higher under the Full‑Libertarian system when the central government cares about productivity in the society. Overall, the focus of this paper is to compare and contrast two advanced techniques of AI, MARL and GABM, to simulate a similar social phenomena with limitations.

PaperID: 12296, https://arxiv.org/pdf/2411.17636.pdf

Authors: Harsh Singh, Rocktim Jyoti Das, Mingfei Han, Preslav Nakov, Ivan Laptev

Title: MALMM: Multi-Agent Large Language Models for Zero-Shot Robotics Manipulation

Abstract:
Large Language Models (LLMs) have demonstrated remarkable planning abilities across various domains, including robotics manipulation and navigation. While recent efforts in robotics have leveraged LLMs both for high‑level and low‑level planning, these approaches often face significant challenges, such as hallucinations in long‑horizon tasks and limited adaptability due to the generation of plans in a single pass without real‑time feedback. To address these limitations, we propose a novel multi‑agent LLM framework, Multi‑Agent Large Language Model for Manipulation (MALMM) that distributes high‑level planning and low‑level control code generation across specialized LLM agents, supervised by an additional agent that dynamically manages transitions. By incorporating observations from the environment after each step, our framework effectively handles intermediate failures and enables adaptive re‑planning. Unlike existing methods, our approach does not rely on pre‑trained skill policies or in‑context learning examples and generalizes to a variety of new tasks. We evaluate our approach on nine RLBench tasks, including long‑horizon tasks, and demonstrate its ability to solve robotics manipulation in a zero‑shot setting, thereby overcoming key limitations of existing LLM‑based manipulation methods.

PaperID: 12297, https://arxiv.org/pdf/2411.17375.pdf

Authors: Theodora Worledge, Tatsunori Hashimoto, Carlos Guestrin

Title: The Extractive-Abstractive Spectrum: Uncovering Verifiability Trade-offs in LLM Generations

Abstract:
Across all fields of academic study, experts cite their sources when sharing information. While large language models (LLMs) excel at synthesizing information, they do not provide reliable citation to sources, making it difficult to trace and verify the origins of the information they present. In contrast, search engines make sources readily accessible to users and place the burden of synthesizing information on the user. Through a survey, we find that users prefer search engines over LLMs for high‑stakes queries, where concerns regarding information provenance outweigh the perceived utility of LLM responses. To examine the interplay between verifiability and utility of information‑sharing tools, we introduce the extractive‑abstractive spectrum, in which search engines and LLMs are extreme endpoints encapsulating multiple unexplored intermediate operating points. Search engines are extractive because they respond to queries with snippets of sources with links (citations) to the original webpages. LLMs are abstractive because they address queries with answers that synthesize and logically transform relevant information from training and in‑context sources without reliable citation. We define five operating points that span the extractive‑abstractive spectrum and conduct human evaluations on seven systems across four diverse query distributions that reflect real‑world QA settings: web search, language simplification, multi‑step reasoning, and medical advice. As outputs become more abstractive, we find that perceived utility improves by as much as 200%, while the proportion of properly cited sentences decreases by as much as 50% and users take up to 3 times as long to verify cited information. Our findings recommend distinct operating points for domain‑specific LLM systems and our failure analysis informs approaches to high‑utility LLM systems that empower users to verify information.

PaperID: 12298, https://arxiv.org/pdf/2411.17335.pdf

Authors: Zeyu Ling, Bo Han, Shiyang Li, Jikang Cheng, Hongdeng Shen, Changqing Zou

Title: VersatileMotion: A Unified Framework for Motion Synthesis and Comprehension

Abstract:
Large language models (LLMs) are, by design, inherently capable of multi‑task learning: through a unified next‑token prediction paradigm, they can naturally address a wide variety of downstream tasks. Prior work in the motion domain has demonstrated some generality by adapting LLMs via a Motion Tokenizer coupled with an autoregressive Transformer to generate and understand human motion. However, this generality remains limited in scope and yields only modest performance gains. We introduce VersatileMotion, a unified multimodal motion LLM that combines a novel motion tokenizer, integrating VQ‑VAE with flow matching, and an autoregressive transformer backbone to seamlessly support at least nine distinct motion‑related tasks. VersatileMotion is the first method to handle single‑agent and multi‑agent motions in a single framework and enable cross‑modal conversion between motion, text, music, and speech, achieving state‑of‑the‑art performance on seven of these tasks. Each sequence in MotionHub may include one or more of the following annotations: natural‑language captions, music or audio clips, speech transcripts, and multi‑agent interaction data. To facilitate evaluation, we define and release benchmark splits covering nine core tasks. Extensive experiments demonstrate the superior performance, versatility, and potential of VersatileMotion as a foundational model for future understanding and generation of motion.

PaperID: 12299, https://arxiv.org/pdf/2411.17290.pdf

Authors: Chao Ge, Ge Chen, Zhipeng Jiang

Title: A "Breathing" Mobile Communication Network

Abstract:
The frequent migration of large‑scale users leads to the load imbalance of mobile communication networks, which causes resource waste and decreases user experience. To address the load balancing problem, this paper proposes a dynamic optimization framework for mobile communication networks inspired by the average consensus in multi‑agent systems. In this framework, all antennas cooperatively optimize their CPICH (Common Pilot Channel) transmit power in real‑time to balance their busy‑degrees. Then, the coverage area of each antenna would change accordingly, and we call this framework a ``breathing'' mobile communication network. To solve this optimization problem, two algorithms named BDBA (Busy‑degree Dynamic Balancing Algorithm) and BFDBA (Busy‑degree Fast Dynamic Balancing Algorithm) are proposed. Moreover, a fast network coverage calculation method is introduced, by which each antenna's minimum CPICH transmit power is determined under the premise of meeting the network coverage requirements. Besides, we present the theoretical analysis of the two proposed algorithms' performance, which prove that all antennas' busy‑degrees will reach consensus under certain assumptions. Furthermore, simulations carried out on three large datasets demonstrate that our cooperative optimization can significantly reduce the unbalance among antennas as well as the proportion of over‑busy antennas.

PaperID: 12300, https://arxiv.org/pdf/2411.17135.pdf

Authors: Yujeong Lee, Sangwoo Shin, Wei-Jin Park, Honguk Woo

Title: LLM-Based Offline Learning for Embodied Agents via Consistency-Guided Reward Ensemble

Abstract:
Employing large language models (LLMs) to enable embodied agents has become popular, yet it presents several limitations in practice. In this work, rather than using LLMs directly as agents, we explore their use as tools for embodied agent learning. Specifically, to train separate agents via offline reinforcement learning (RL), an LLM is used to provide dense reward feedback on individual actions in training datasets. In doing so, we present a consistency‑guided reward ensemble framework (CoREN), designed for tackling difficulties in grounding LLM‑generated estimates to the target environment domain. The framework employs an adaptive ensemble of spatio‑temporally consistent rewards to derive domain‑grounded rewards in the training datasets, thus enabling effective offline learning of embodied agents in different environment domains. Experiments with the VirtualHome benchmark demonstrate that CoREN significantly outperforms other offline RL agents, and it also achieves comparable performance to state‑of‑the‑art LLM‑based agents with 8B parameters, despite CoREN having only 117M parameters for the agent policy network and using LLMs only for training.

PaperID: 12301, https://arxiv.org/pdf/2411.17065.pdf

Authors: Naomi Imasato, Kazuki Miyazawa, Takayuki Nagai, Takato Horii

Title: Creative Agents: Simulating the Systems Model of Creativity with Generative Agents

Abstract:
With the growing popularity of generative AI for images, video, and music, we witnessed models rapidly improve in quality and performance. However, not much attention is paid towards enabling AI's ability to "be creative". In this study, we implemented and simulated the systems model of creativity (proposed by Csikszentmihalyi) using virtual agents utilizing large language models (LLMs) and text prompts. For comparison, the simulations were conducted with the "virtual artists" being: 1)isolated and 2)placed in a multi‑agent system. Both scenarios were compared by analyzing the variations and overall "creativity" in the generated artifacts (measured via a user study and LLM). Our results suggest that the generative agents may perform better in the framework of the systems model of creativity.

PaperID: 12302, https://arxiv.org/pdf/2411.16785.pdf

Authors: Vladimir Yugay, Theo Gevers, Martin R. Oswald

Title: MAGiC-SLAM: Multi-Agent Gaussian Globally Consistent SLAM

Abstract:
Simultaneous localization and mapping (SLAM) systems with novel view synthesis capabilities are widely used in computer vision, with applications in augmented reality, robotics, and autonomous driving. However, existing approaches are limited to single‑agent operation. Recent work has addressed this problem using a distributed neural scene representation. Unfortunately, existing methods are slow, cannot accurately render real‑world data, are restricted to two agents, and have limited tracking accuracy. In contrast, we propose a rigidly deformable 3D Gaussian‑based scene representation that dramatically speeds up the system. However, improving tracking accuracy and reconstructing a globally consistent map from multiple agents remains challenging due to trajectory drift and discrepancies across agents' observations. Therefore, we propose new tracking and map‑merging mechanisms and integrate loop closure in the Gaussian‑based SLAM pipeline. We evaluate MAGiC‑SLAM on synthetic and real‑world datasets and find it more accurate and faster than the state of the art.

PaperID: 12303, https://arxiv.org/pdf/2411.16723.pdf

Authors: Mitchell Rosser, Marc. G Carmichael

Title: Two Heads Are Better Than One: Collaborative LLM Embodied Agents for Human-Robot Interaction

Abstract:
With the recent development of natural language generation models ‑ termed as large language models (LLMs) ‑ a potential use case has opened up to improve the way that humans interact with robot assistants. These LLMs should be able to leverage their large breadth of understanding to interpret natural language commands into effective, task appropriate and safe robot task executions. However, in reality, these models suffer from hallucinations, which may cause safety issues or deviations from the task. In other domains, these issues have been improved through the use of collaborative AI systems where multiple LLM agents can work together to collectively plan, code and self‑check outputs. In this research, multiple collaborative AI systems were tested against a single independent AI agent to determine whether the success in other domains would translate into improved human‑robot interaction performance. The results show that there is no defined trend between the number of agents and the success of the model. However, it is clear that some collaborative AI agent architectures can exhibit a greatly improved capacity to produce error‑free code and to solve abstract problems.

PaperID: 12304, https://arxiv.org/pdf/2411.16707.pdf

Authors: Mengshuo Jia, Zeyu Cui, Gabriela Hug

Title: Enhancing LLMs for Power System Simulations: A Feedback-driven Multi-agent Framework

Abstract:
The integration of experimental technologies with large language models (LLMs) is transforming scientific research. It positions AI as a versatile research assistant rather than a mere problem‑solving tool. In the field of power systems, however, managing simulations ‑‑ one of the essential experimental technologies ‑‑ remains a challenge for LLMs due to their limited domain‑specific knowledge, restricted reasoning capabilities, and imprecise handling of simulation parameters. To address these limitations, this paper proposes a feedback‑driven, multi‑agent framework. It incorporates three proposed modules: an enhanced retrieval‑augmented generation (RAG) module, an improved reasoning module, and a dynamic environmental acting module with an error‑feedback mechanism. Validated on 69 diverse tasks from Daline and MATPOWER, this framework achieves success rates of 93.13% and 96.85%, respectively. It significantly outperforms ChatGPT 4o, o1‑preview, and the fine‑tuned GPT‑4o, which all achieved a success rate lower than 30% on complex tasks. Additionally, the proposed framework also supports rapid, cost‑effective task execution, completing each simulation in approximately 30 seconds at an average cost of 0.014 USD for tokens. Overall, this adaptable framework lays a foundation for developing intelligent LLM‑based assistants for human researchers, facilitating power system research and beyond.

PaperID: 12305, https://arxiv.org/pdf/2411.16608.pdf

Authors: Viswa Narayanan Sankaranarayanan, Achilleas Santi Seisa, Akshit Saradagi, Sumeet Satpute, George Nikolakopoulos

Title: Barriers on the EDGE: A scalable CBF architecture over EDGE for safe aerial-ground multi-agent coordination

Abstract:
In this article, we propose a control architecture for the safe, coordinated operation of a multi‑agent system with aerial (UAVs) and ground (UGVs) robots in a confined task space. We consider the case where the aerial and ground operations are coupled, enabled by the capability of the aerial robots to land on moving ground robots. The proposed method uses time‑varying Control Barrier Functions (CBFs) to impose safety constraints associated with (i) collision avoidance between agents, (ii) landing of UAVs on mobile UGVs, and (iii) task space restriction. Further, this article addresses the challenge induced by the rapid increase in the number of CBF constraints with the increasing number of agents through a hybrid centralized‑distributed coordination approach that determines the set of CBF constraints that is relevant for every aerial and ground agent at any given time. A centralized node (Watcher), hosted by an edge computing cluster, activates the relevant constraints, thus reducing the network complexity and the need for high onboard processing on the robots. The CBF constraints are enforced in a distributed manner by individual robots that run a nominal controller and safety filter locally to overcome latency and other network nonidealities.

PaperID: 12306, https://arxiv.org/pdf/2411.16506.pdf

Authors: Hongzhi Zang, Yulun Zhang, He Jiang, Zhe Chen, Daniel Harabor, Peter J. Stuckey, Jiaoyang Li

Title: Online Guidance Graph Optimization for Lifelong Multi-Agent Path Finding

Abstract:
We study the problem of optimizing a guidance policy capable of dynamically guiding the agents for lifelong Multi‑Agent Path Finding based on real‑time traffic patterns. Multi‑Agent Path Finding (MAPF) focuses on moving multiple agents from their starts to goals without collisions. Its lifelong variant, LMAPF, continuously assigns new goals to agents. In this work, we focus on improving the solution quality of PIBT, a state‑of‑the‑art rule‑based LMAPF algorithm, by optimizing a policy to generate adaptive guidance. We design two pipelines to incorporate guidance in PIBT in two different ways. We demonstrate the superiority of the optimized policy over both static guidance and human‑designed policies. Additionally, we explore scenarios where task distribution changes over time, a challenging yet common situation in real‑world applications that is rarely explored in the literature.

PaperID: 12307, https://arxiv.org/pdf/2411.16416.pdf

Authors: Bo Hu, Siyu Liu, Beilin Ye, Yun Hao, Yanhui Liu, Yang Lu, Ju Li, David J. Srolovitz, Tongqi Wen

Title: A Multi-agent Framework for Physical Laws Discovery

Abstract:
Discovering explicit physical laws has traditionally depended on human intuition and domain expertise. Recent advances in artificial intelligence, particularly large language models (LLMs), offer a new route to accelerate this process by automating key steps from hypothesis generation to interpretable model construction. Here we develop an LLM‑based multi‑agent framework for physical‑law discovery that integrates literature‑guided variable selection, hypothesis formulation, symbolic regression, formula derivation, and mechanistic explanation. We validate the framework on three representative materials problems: the glass‑forming ability (GFA) of metallic glasses, the Vickers hardness of compounds, and the Young's modulus of multi‑component alloys. Using physically and chemically meaningful descriptors as inputs, the discovered formulas achieve strong agreement with reference data, with correlation coefficients up to 0.94 (GFA), 0.86 (hardness), and 0.94 (Young's modulus), while remaining compact and interpretable. Beyond fitting, the Young's modulus formula generalizes to quaternary and quinary alloys, improving prediction accuracy by up to 78.8% relative to the classical rule of mixtures. By integrating cross‑disciplinary knowledge, reflection mechanisms, and expert‑like reasoning ability into symbolic regression, our AI‑centric framework offers a robust and extensible platform for automated physical laws discovery, demonstrating that AI can increasingly serve as an essential role in modern scientific research by thinking and acting like field experts.

PaperID: 12308, https://arxiv.org/pdf/2411.16353.pdf

Authors: Mikita Balesni, Tomek Korbak, Owain Evans

Title: Lessons from Studying Two-Hop Latent Reasoning

Abstract:
Large language models can use chain‑of‑thought (CoT) to externalize reasoning, potentially enabling oversight of capable LLM agents. Prior work has shown that models struggle at two‑hop question‑answering without CoT. This capability is so basic that if it was a fundamental limitation, it would imply that many complex agentic tasks would similarly require CoT. We investigate LLM latent reasoning capabilities using two‑hop question answering as a case study. Previous work on the gap between latent and externalized two‑hop reasoning produced mixed evidence with inconclusive results. In this paper, we introduce a controlled setting for investigating two‑hop reasoning in LLMs, where a positive result provides definitive evidence for latent reasoning. We fine‑tune LLMs (including Llama 3 8B and GPT‑4o) on synthetic facts and test two‑hop reasoning over these facts. By using synthetic facts, we rule out memorization and reasoning shortcuts as explanations for two‑hop performance. We observe a nuanced picture: Models fail to compose two synthetic facts, but can succeed when one fact is synthetic and the other is natural. These results demonstrate that LLMs are undeniably capable of latent two‑hop reasoning, although it remains unclear how this ability scales with model size. Finally, we highlight a lesson for researchers studying LLM reasoning: when drawing conclusions about LLM latent reasoning, one must be careful to avoid both spurious successes (that stem from memorization and reasoning shortcuts) and spurious failures (that may stem from artificial experimental setups, divorced from training setups of frontier LLMs).

PaperID: 12309, https://arxiv.org/pdf/2411.16189.pdf

Authors: Zhihua Duan, Jialin Wang

Title: Enhancing Multi-Agent Consensus through Third-Party LLM Integration: Analyzing Uncertainty and Mitigating Hallucinations in Large Language Models

Abstract:
Large Language Models (LLMs) still face challenges when dealing with complex reasoning tasks, often resulting in hallucinations, which limit the practical application of LLMs. To alleviate this issue, this paper proposes a new method that integrates different LLMs to expand the knowledge boundary, reduce dependence on a single model, and promote in‑depth debate among agents. The main contributions include: 1) Introducing third‑party LLMs to adjust the attention weights of agents through uncertainty estimation and confidence analysis, optimizing consensus formation in multi‑agent systems; 2) Experiments on arithmetic datasets have validated the effectiveness of the method, surpassing traditional multi‑agent baselines. This research provides a new perspective for large models to alleviate hallucination phenomena when dealing with complex tasks.

PaperID: 12310, https://arxiv.org/pdf/2411.16134.pdf

Authors: Zhuoyuan Yu, Hongliang Guo, Albertus Hendrawan Adiwahono, Jianle Chan, Brina Shong Wey Tynn, Chee-Meng Chew, Wei-Yun Yau

Title: Multi-Robot Reliable Navigation in Uncertain Topological Environments with Graph Attention Networks

Abstract:
This paper studies the multi‑robot reliable navigation problem in uncertain topological networks, which aims at maximizing the robot team's on‑time arrival probabilities in the face of road network uncertainties. The uncertainty in these networks stems from the unknown edge traversability, which is only revealed to the robot upon its arrival at the edge's starting node. Existing approaches often struggle to adapt to real‑time network topology changes, making them unsuitable for varying topological environments. To address the challenge, we reformulate the problem into a Partially Observable Markov Decision Process (POMDP) framework and introduce the Dynamic Adaptive Graph Embedding method to capture the evolving nature of the navigation task. We further enhance each robot's policy learning process by integrating deep reinforcement learning with Graph Attention Networks (GATs), leveraging self‑attention to focus on critical graph features. The proposed approach, namely Multi‑Agent Routing in Variable Environments with Learning (MARVEL) employs the generalized policy gradient algorithm to optimize the robots' real‑time decision‑making process iteratively. We compare the performance of MARVEL with state‑of‑the‑art reliable navigation algorithms as well as Canadian traveller problem solutions in a range of canonical transportation networks, demonstrating improved adaptability and performance in uncertain topological networks. Additionally, real‑world experiments with two robots navigating within a self‑constructed indoor environment with uncertain topological structures demonstrate MARVEL's practicality.

PaperID: 12311, https://arxiv.org/pdf/2411.16031.pdf

Authors: Antonino Ferraro, Antonio Galli, Valerio La Gatta, Marco Postiglione, Gian Marco Orlando, Diego Russo, Giuseppe Riccio, Antonio Romano, Vincenzo Moscato

Title: Agent-Based Modelling Meets Generative AI in Social Network Simulations

Abstract:
Agent‑Based Modelling (ABM) has emerged as an essential tool for simulating social networks, encompassing diverse phenomena such as information dissemination, influence dynamics, and community formation. However, manually configuring varied agent interactions and information flow dynamics poses challenges, often resulting in oversimplified models that lack real‑world generalizability. Integrating modern Large Language Models (LLMs) with ABM presents a promising avenue to address these challenges and enhance simulation fidelity, leveraging LLMs' human‑like capabilities in sensing, reasoning, and behavior. In this paper, we propose a novel framework utilizing LLM‑empowered agents to simulate social network users based on their interests and personality traits. The framework allows for customizable agent interactions resembling various social network platforms, including mechanisms for content resharing and personalized recommendations. We validate our framework using a comprehensive Twitter dataset from the 2020 US election, demonstrating that LLM‑agents accurately replicate real users' behaviors, including linguistic patterns and political inclinations. These agents form homogeneous ideological clusters and retain the main themes of their community. Notably, preference‑based recommendations significantly influence agent behavior, promoting increased engagement, network homophily and the formation of echo chambers. Overall, our findings underscore the potential of LLM‑agents in advancing social media simulations and unraveling intricate online dynamics.

PaperID: 12312, https://arxiv.org/pdf/2411.15998.pdf

Authors: Jonathan Light, Sixue Xing, Yuanzhe Liu, Weiqin Chen, Min Cai, Xiusi Chen, Guanzhi Wang, Wei Cheng, Yisong Yue, Ziniu Hu

Title: PIANIST: Learning Partially Observable World Models with LLMs for Multi-Agent Decision Making

Abstract:
Effective extraction of the world knowledge in LLMs for complex decision‑making tasks remains a challenge. We propose a framework PIANIST for decomposing the world model into seven intuitive components conducive to zero‑shot LLM generation. Given only the natural language description of the game and how input observations are formatted, our method can generate a working world model for fast and efficient MCTS simulation. We show that our method works well on two different games that challenge the planning and decision making skills of the agent for both language and non‑language based action taking, without any training on domain‑specific training data or explicitly defined world model.

PaperID: 12313, https://arxiv.org/pdf/2411.15891.pdf

Authors: Ziyu Chen, Zhiqing Xiao, Xinbei Jiang, Junbo Zhao

Title: From Laws to Motivation: Guiding Exploration through Law-Based Reasoning and Rewards

Abstract:
Large Language Models (LLMs) and Reinforcement Learning (RL) are two powerful approaches for building autonomous agents. However, due to limited understanding of the game environment, agents often resort to inefficient exploration and trial‑and‑error, struggling to develop long‑term strategies or make decisions. We propose a method that extracts experience from interaction records to model the underlying laws of the game environment, using these experience as internal motivation to guide agents. These experience, expressed in language, are highly flexible and can either assist agents in reasoning directly or be transformed into rewards for guiding training. Our evaluation results in Crafter demonstrate that both RL and LLM agents benefit from these experience, leading to improved overall performance.

PaperID: 12314, https://arxiv.org/pdf/2411.15692.pdf

Authors: Sizhe Liu, Yizhou Lu, Siyu Chen, Xiyang Hu, Jieyu Zhao, Yingzhou Lu, Yue Zhao

Title: DrugAgent: Automating AI-aided Drug Discovery Programming through LLM Multi-Agent Collaboration

Abstract:
Recent progress in Large Language Models (LLMs) has drawn attention to their potential for accelerating drug discovery. However, a central problem remains: translating theoretical ideas into robust implementations in the highly specialized context of pharmaceutical research. This limitation prevents practitioners from making full use of the latest AI developments in drug discovery. To address this challenge, we introduce DrugAgent, a multi‑agent framework that automates machine learning (ML) programming for drug discovery tasks. DrugAgent employs an LLM Planner that formulates high‑level ideas and an LLM Instructor that identifies and integrates domain knowledge when implementing those ideas. We present case studies on three representative drug discovery tasks. Our results show that DrugAgent consistently outperforms leading baselines, including a relative improvement of 4.92% in ROC‑AUC compared to ReAct for drug‑target interaction (DTI). DrugAgent is publicly available at https://anonymous.4open.science/r/drugagent‑5C42/.

PaperID: 12315, https://arxiv.org/pdf/2411.15645.pdf

Authors: Gollam Rabby, Farhana Keya, Sören Auer

Title: MC-NEST: Enhancing Mathematical Reasoning in Large Language Models leveraging a Monte Carlo Self-Refine Tree

Abstract:
Mathematical reasoning presents significant challenges for large language models (LLMs). To enhance their capabilities, we propose Monte Carlo Self‑Refine Tree (MC‑NEST), an extension of Monte Carlo Tree Search that integrates LLM‑based self‑refinement and self‑evaluation for improved decision‑making in complex reasoning tasks. MC‑NEST balances exploration and exploitation using Upper Confidence Bound (UCT) scores combined with diverse selection policies. Through iterative critique and refinement, LLMs learn to reason more strategically. Empirical results demonstrate that MC‑NEST with an importance sampling policy substantially improves GPT‑4o's performance, achieving state‑of‑the‑art pass@1 scores on Olympiad‑level benchmarks. Specifically, MC‑NEST attains a pass@1 of 38.6 on AIME and 12.6 on MathOdyssey. The solution quality for MC‑NEST using GPT‑4o and Phi‑3‑mini reaches 84.0% and 82.08%, respectively, indicating robust consistency across different LLMs. MC‑NEST performs strongly across Algebra, Geometry, and Number Theory, benefiting from its ability to handle abstraction, logical deduction, and multi‑step reasoning ‑‑ core skills in mathematical problem solving.

PaperID: 12316, https://arxiv.org/pdf/2411.15501.pdf

Authors: Tanghaoran Zhang, Yue Yu, Xinjun Mao, Shangwen Wang, Kang Yang, Yao Lu, Zhang Zhang, Yuxin Zhao

Title: Instruct or Interact? Exploring and Eliciting LLMs' Capability in Code Snippet Adaptation Through Prompt Engineering

Abstract:
Code snippet adaptation is a fundamental activity in the software development process. Unlike code generation, code snippet adaptation is not a "free creation", which requires developers to tailor a given code snippet in order to fit specific requirements and the code context. Recently, large language models (LLMs) have confirmed their effectiveness in the code generation task with promising results. However, their performance on adaptation, a reuse‑oriented and context‑dependent code change prediction task, is still unclear. To bridge this gap, we conduct an empirical study to investigate the performance and issues of LLMs on the adaptation task. We first evaluate the adaptation performances of three popular LLMs and compare them to the code generation task. Our result indicates that their adaptation ability is weaker than generation, with a nearly 15% decrease on pass@1 and more context‑related errors. By manually inspecting 200 cases, we further investigate the causes of LLMs' sub‑optimal performance, which can be classified into three categories, i.e., Unclear Requirement, Requirement Misalignment and Context Misapplication. Based on the above empirical research, we propose an interactive prompting approach to eliciting LLMs' adaptation ability. Experimental result reveals that our approach greatly improve LLMs' adaptation performance. The best‑performing Human‑LLM interaction successfully solves 159 out of the 202 identified defects and improves the pass@1 and pass@5 by over 40% compared to the initial instruction‑based prompt. Considering human efforts, we suggest multi‑agent interaction as a trade‑off, which can achieve comparable performance with excellent generalization ability. We deem that our approach could provide methodological assistance for autonomous code snippet reuse and adaptation with LLMs.

PaperID: 12317, https://arxiv.org/pdf/2411.15396.pdf

Authors: Jiqun Liu, Jiangen He

Title: The Decoy Dilemma in Online Medical Information Evaluation: A Comparative Study of Credibility Assessments by LLM and Human Judges

Abstract:
Can AI be cognitively biased in automated information judgment tasks? Despite recent progresses in measuring and mitigating social and algorithmic biases in AI and large language models (LLMs), it is not clear to what extent LLMs behave "rationally", or if they are also vulnerable to human cognitive bias triggers. To address this open problem, our study, consisting of a crowdsourcing user experiment and a LLM‑enabled simulation experiment, compared the credibility assessments by LLM and human judges under potential decoy effects in an information retrieval (IR) setting, and empirically examined the extent to which LLMs are cognitively biased in COVID‑19 medical (mis)information assessment tasks compared to traditional human assessors as a baseline. The results, collected from a between‑subject user experiment and a LLM‑enabled replicate experiment, demonstrate that 1) Larger and more recent LLMs tend to show a higher level of consistency and accuracy in distinguishing credible information from misinformation. However, they are more likely to give higher ratings for misinformation due to the presence of a more salient, decoy misinformation result; 2) While decoy effect occurred in both human and LLM assessments, the effect is more prevalent across different conditions and topics in LLM judgments compared to human credibility ratings. In contrast to the generally assumed "rationality" of AI tools, our study empirically confirms the cognitive bias risks embedded in LLM agents, evaluates the decoy impact on LLMs against human credibility assessments, and thereby highlights the complexity and importance of debiasing AI agents and developing psychology‑informed AI audit techniques and policies for automated judgment tasks and beyond.

PaperID: 12318, https://arxiv.org/pdf/2411.15356.pdf

Authors: Yu Han, Zekun Guo

Title: Regulator-Manufacturer AI Agents Modeling: Mathematical Feedback-Driven Multi-Agent LLM Framework

Abstract:
The increasing complexity of regulatory updates from global authorities presents significant challenges for medical device manufacturers, necessitating agile strategies to sustain compliance and maintain market access. Concurrently, regulatory bodies must effectively monitor manufacturers' responses and develop strategic surveillance plans. This study employs a multi‑agent modeling approach, enhanced with Large Language Models (LLMs), to simulate regulatory dynamics and examine the adaptive behaviors of key actors, including regulatory bodies, manufacturers, and competitors. These agents operate within a simulated environment governed by regulatory flow theory, capturing the impacts of regulatory changes on compliance decisions, market adaptation, and innovation strategies. Our findings illuminate the influence of regulatory shifts on industry behaviour and identify strategic opportunities for improving regulatory practices, optimizing compliance, and fostering innovation. By leveraging the integration of multi‑agent systems and LLMs, this research provides a novel perspective and offers actionable insights for stakeholders navigating the evolving regulatory landscape of the medical device industry.

PaperID: 12319, https://arxiv.org/pdf/2411.15201.pdf

Authors: Harsha Vardhan Khurdula, Basem Rizk, Indus Khaitan, Janit Anjaria, Aviral Srivastava, Rajvardhan Khaitan

Title: Beyond Visual Understanding: Introducing PARROT-360V for Vision Language Model Benchmarking

Abstract:
Current benchmarks for evaluating Vision Language Models (VLMs) often fall short in thoroughly assessing model abilities to understand and process complex visual and textual content. They typically focus on simple tasks that do not require deep reasoning or the integration of multiple data modalities to solve an original problem. To address this gap, we introduce the PARROT‑360V Benchmark, a novel and comprehensive benchmark featuring 2487 challenging visual puzzles designed to test VLMs on complex visual reasoning tasks. We evaluated leading models: GPT‑4o, Claude‑3.5‑Sonnet, and Gemini‑1.5‑Pro, using PARROT‑360V to assess their capabilities in combining visual clues with language skills to solve tasks in a manner akin to human problem‑solving. Our findings reveal a notable performance gap: state‑of‑the‑art models scored between 28 to 56 percentage on our benchmark, significantly lower than their performance on popular benchmarks. This underscores the limitations of current VLMs in handling complex, multi‑step reasoning tasks and highlights the need for more robust evaluation frameworks to advance the field.

PaperID: 12320, https://arxiv.org/pdf/2411.15150.pdf

Authors: Ilja Behnke

Title: Real-Time Aware IP-Networking for Resource-Constrained Embedded Devices

Abstract:
This dissertation explores the area of real‑time IP networking for embedded devices, especially those with limited computational resources. With the increasing convergence of information and operational technologies in various industries, and the growing complexity of communication requirements in (semi‑)autonomous machines, there is a need for more advanced and reliable networking solutions. This research focuses on the challenge of integrating real‑time embedded devices into packet‑switched networks. Through a comprehensive review of current real‑time communication technologies, standards, and practices in the context of Industry 4.0, a notable gap is identified: the lack of a robust real‑time communication standard tailored for wireless mobile machines, and insufficient research on real‑time embedded devices in highly networked environments. The study includes detailed experimentation with commercially available off‑the‑shelf networked microcontrollers, revealing a priority inversion problem where network packet processing interrupts real‑time tasks, potentially causing real‑time violations. To address this challenge, this thesis proposes mitigation methods and system designs that include software and hardware implementations. These include a new embedded network subsystem that prioritizes packet processing based on task priority, and a real‑time‑aware network interface controller that moderates interrupt requests. In addition, a hybrid hardware‑software co‑design approach is developed to ensure predictable and reliable real‑time task execution despite network congestion. Furthermore, the research extends to task offloading in wireless Industrial Internet of Things environments, presenting a system architecture and scheduler capable of maintaining real‑time constraints even under heavy loads and network uncertainties.

PaperID: 12321, https://arxiv.org/pdf/2411.15131.pdf

Authors: Ri-Zhao Qiu, Yuchen Song, Xuanbin Peng, Sai Aneesh Suryadevara, Ge Yang, Minghuan Liu, Mazeyu Ji, Chengzhe Jia, Ruihan Yang, Xueyan Zou, Xiaolong Wang

Title: WildLMa: Long Horizon Loco-Manipulation in the Wild

Abstract:
'In‑the‑wild' mobile manipulation aims to deploy robots in diverse real‑world environments, which requires the robot to (1) have skills that generalize across object configurations; (2) be capable of long‑horizon task execution in diverse environments; and (3) perform complex manipulation beyond pick‑and‑place. Quadruped robots with manipulators hold promise for extending the workspace and enabling robust locomotion, but existing results do not investigate such a capability. This paper proposes WildLMa with three components to address these issues: (1) adaptation of learned low‑level controller for VR‑enabled whole‑body teleoperation and traversability; (2) WildLMa‑Skill ‑‑ a library of generalizable visuomotor skills acquired via imitation learning or heuristics and (3) WildLMa‑Planner ‑‑ an interface of learned skills that allow LLM planners to coordinate skills for long‑horizon tasks. We demonstrate the importance of high‑quality training data by achieving higher grasping success rate over existing RL baselines using only tens of demonstrations. WildLMa exploits CLIP for language‑conditioned imitation learning that empirically generalizes to objects unseen in training demonstrations. Besides extensive quantitative evaluation, we qualitatively demonstrate practical robot applications, such as cleaning up trash in university hallways or outdoor terrains, operating articulated objects, and rearranging items on a bookshelf.

PaperID: 12322, https://arxiv.org/pdf/2411.15114.pdf

Authors: Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, Elena Ericheva, Katharyn Garcia, Brian Goodrich, Nikola Jurkovic, Holden Karnofsky, Megan Kinniment, Aron Lajko, Seraphina Nix, Lucas Sato, William Saunders, Maksym Taran, Ben West, Elizabeth Barnes

Title: RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts

Abstract:
Frontier AI safety policies highlight automation of AI research and development (R&D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&D capabilities, and none that are highly realistic and have a direct comparison to human performance. We introduce RE‑Bench (Research Engineering Benchmark, v1), which consists of 7 challenging, open‑ended ML research engineering environments and data from 71 8‑hour attempts by 61 distinct human experts. We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non‑zero score and 24% matching or exceeding our strong reference solutions. We compare humans to several public frontier models through best‑of‑k with varying time budgets and agent designs, and find that the best AI agents achieve a score 4x higher than human experts when both are given a total time budget of 2 hours per environment. However, humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8‑hour budget, and achieving 2x the score of the top AI agent when both are given 32 total hours (across different attempts). Qualitatively, we find that modern AI agents possess significant expertise in many ML topics ‑‑ e.g. an agent wrote a faster custom Triton kernel than any of our human experts' ‑‑ and can generate and test solutions over ten times faster than humans, at much lower cost. We open‑source the evaluation environments, human expert data, analysis code and agent trajectories to facilitate future research.

PaperID: 12323, https://arxiv.org/pdf/2411.15100.pdf

Authors: Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, Tianqi Chen

Title: XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

Abstract:
The applications of LLM Agents are becoming increasingly complex and diverse, leading to a high demand for structured outputs that can be parsed into code, structured function calls, and embodied agent commands. These developments bring significant demands for structured generation in LLM inference. Context‑free grammar is a flexible approach to enable structured generation via constrained decoding. However, executing context‑free grammar requires going through several stack states over all tokens in vocabulary during runtime, bringing non‑negligible overhead for structured generation. In this paper, we propose XGrammar, a flexible and efficient structure generation engine for large language models. XGrammar accelerates context‑free grammar execution by dividing the vocabulary into context‑independent tokens that can be prechecked and context‑dependent tokens that need to be interpreted during runtime. We further build transformations to expand the grammar context and reduce the number of context‑independent tokens. Additionally, we build an efficient persistent stack to accelerate the context‑dependent token checks. Finally, we co‑design the grammar engine with LLM inference engine to overlap grammar computation with GPU executions. Evaluation results show that XGrammar can achieve up to 100x speedup over existing solutions. Combined with an LLM inference engine, it can generate near‑zero overhead structure generation in end‑to‑end low‑LLM serving.

PaperID: 12324, https://arxiv.org/pdf/2411.15046.pdf

Authors: Till Freihaut, Giorgia Ramponi

Title: On Feasible Rewards in Multi-Agent Inverse Reinforcement Learning

Abstract:
Multi‑agent Inverse Reinforcement Learning (MAIRL) aims to recover agent reward functions from expert demonstrations. We characterize the feasible reward set in Markov games, identifying all reward functions that rationalize a given equilibrium. However, equilibrium‑based observations are often ambiguous: a single Nash equilibrium can correspond to many reward structures, potentially changing the game's nature in multi‑agent systems. We address this by introducing entropy‑regularized Markov games, which yield a unique equilibrium while preserving strategic incentives. For this setting, we provide a sample complexity analysis detailing how errors affect learned policy performance. Our work establishes theoretical foundations and practical insights for MAIRL.

PaperID: 12325, https://arxiv.org/pdf/2411.15036.pdf

Authors: Zeyang Li, Navid Azizan

Title: Safe Multi-Agent Reinforcement Learning with Convergence to Generalized Nash Equilibrium

Abstract:
Multi‑agent reinforcement learning (MARL) has achieved notable success in cooperative tasks, demonstrating impressive performance and scalability. However, deploying MARL agents in real‑world applications presents critical safety challenges. Current safe MARL algorithms are largely based on the constrained Markov decision process (CMDP) framework, which enforces constraints only on discounted cumulative costs and lacks an all‑time safety assurance. Moreover, these methods often overlook the feasibility issue (the system will inevitably violate state constraints within certain regions of the constraint set), resulting in either suboptimal performance or increased constraint violations. To address these challenges, we propose a novel theoretical framework for safe MARL with state‑wise constraints, where safety requirements are enforced at every state the agents visit. To resolve the feasibility issue, we leverage a control‑theoretic notion of the feasible region, the controlled invariant set (CIS), characterized by the safety value function. We develop a multi‑agent method for identifying CISs, ensuring convergence to a Nash equilibrium on the safety value function. By incorporating CIS identification into the learning process, we introduce a multi‑agent dual policy iteration algorithm that guarantees convergence to a generalized Nash equilibrium in state‑wise constrained cooperative Markov games, achieving an optimal balance between feasibility and performance. Furthermore, for practical deployment in complex high‑dimensional systems, we propose Multi‑Agent Dual Actor‑Critic (MADAC), a safe MARL algorithm that approximates the proposed iteration scheme within the deep RL paradigm. Empirical evaluations on safe MARL benchmarks demonstrate that MADAC consistently outperforms existing methods, delivering much higher rewards while reducing constraint violations.

PaperID: 12326, https://arxiv.org/pdf/2411.15033.pdf

Authors: Simone Colombani, Dimitri Ognibene, Giuseppe Boccignone

Title: One to rule them all: natural language to bind communication, perception and action

Abstract:
In recent years, research in the area of human‑robot interaction has focused on developing robots capable of understanding complex human instructions and performing tasks in dynamic and diverse environments. These systems have a wide range of applications, from personal assistance to industrial robotics, emphasizing the importance of robots interacting flexibly, naturally and safely with humans. This paper presents an advanced architecture for robotic action planning that integrates communication, perception, and planning with Large Language Models (LLMs). Our system is designed to translate commands expressed in natural language into executable robot actions, incorporating environmental information and dynamically updating plans based on real‑time feedback. The Planner Module is the core of the system where LLMs embedded in a modified ReAct framework are employed to interpret and carry out user commands. By leveraging their extensive pre‑trained knowledge, LLMs can effectively process user requests without the need to introduce new knowledge on the changing environment. The modified ReAct framework further enhances the execution space by providing real‑time environmental perception and the outcomes of physical actions. By combining robust and dynamic semantic map representations as graphs with control components and failure explanations, this architecture enhances a robot adaptability, task execution, and seamless collaboration with human users in shared and dynamic environments. Through the integration of continuous feedback loops with the environment the system can dynamically adjusts the plan to accommodate unexpected changes, optimizing the robot ability to perform tasks. Using a dataset of previous experience is possible to provide detailed feedback about the failure. Updating the LLMs context of the next iteration with suggestion on how to overcame the issue.

PaperID: 12327, https://arxiv.org/pdf/2411.15004.pdf

Authors: Junhong Shen, Atishay Jain, Zedian Xiao, Ishan Amlekar, Mouad Hadji, Aaron Podolny, Ameet Talwalkar

Title: ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data

Abstract:
Large Language Model (LLM) agents are rapidly improving to handle increasingly complex web‑based tasks. Most of these agents rely on general‑purpose, proprietary models like GPT‑4 and focus on designing better prompts to improve their planning abilities. However, general‑purpose LLMs are not specifically trained to understand specialized web contexts such as HTML, and they often struggle with long‑horizon planning. We explore an alternative approach that fine‑tunes open‑source LLMs using production‑scale workflow data collected from over 250 domains corresponding to 6 billion tokens. This simple yet effective approach shows substantial gains over prompting‑based agents on existing benchmarks ‑‑ ScribeAgent achieves state‑of‑the‑art direct generation performance on Mind2Web and improves the task success rate by 7.3% over the previous best text‑only web agents on WebArena. We further perform detailed ablation studies on various fine‑tuning design choices and provide insights into LLM selection, training recipes, context window optimization, and effect of dataset sizes.

PaperID: 12328, https://arxiv.org/pdf/2411.14637.pdf

Authors: Hanwen Shi, Jin Zhang, Kunpeng Zhang

Title: Enhancing Clinical Trial Patient Matching through Knowledge Augmentation and Reasoning with Multi-Agent

Abstract:
Matching patients effectively and efficiently for clinical trials is a significant challenge due to the complexity and variability of patient profiles and trial criteria. This paper introduces Multi‑Agent for Knowledge Augmentation and Reasoning (MAKAR), a novel multi‑agent system that enhances patient‑trial matching by integrating criterion augmentation with structured reasoning. MAKAR consistently improves performance by an average of 7% across different datasets. Furthermore, it enables privacy‑preserving deployment and maintains competitive performance when using smaller open‑source models. Overall, MAKAR can contributes to more transparent, accurate, and privacy‑conscious AI‑driven patient matching.

PaperID: 12329, https://arxiv.org/pdf/2411.14593.pdf

Authors: Larry Schester, Luis E. Ortiz

Title: A Systematic Study of Multi-Agent Deep Reinforcement Learning for Safe and Robust Autonomous Highway Ramp Entry

Abstract:
Vehicles today can drive themselves on highways and driverless robotaxis operate in major cities, with more sophisticated levels of autonomous driving expected to be available and become more common in the future. Yet, technically speaking, so‑called "Level 5" (L5) operation, corresponding to full autonomy, has not been achieved. For that to happen, functions such as fully autonomous highway ramp entry must be available, and provide provably safe, and reliably robust behavior to enable full autonomy. We present a systematic study of a highway ramp function that controls the vehicles forward‑moving actions to minimize collisions with the stream of highway traffic into which a merging (ego) vehicle enters. We take a game‑theoretic multi‑agent (MA) approach to this problem and study the use of controllers based on deep reinforcement learning (DRL). The virtual environment of the MA DRL uses self‑play with simulated data where merging vehicles safely learn to control longitudinal position during a taper‑type merge. The work presented in this paper extends existing work by studying the interaction of more than two vehicles (agents) and does so by systematically expanding the road scene with additional traffic and ego vehicles. While previous work on the two‑vehicle setting established that collision‑free controllers are theoretically impossible in fully decentralized, non‑coordinated environments, we empirically show that controllers learned using our approach are nearly ideal when measured against idealized optimal controllers.

PaperID: 12330, https://arxiv.org/pdf/2411.14567.pdf

Authors: Viranjan Bhattacharyya, Tyler Ard, Rongyao Wang, Ardalan Vahidi, Yunyi Jia, Jihun Han

Title: Energy Efficient Automated Driving as a GNEP: Vehicle-in-the-loop Experiments

Abstract:
In this paper, a multi‑agent motion planning problem is studied aiming to minimize energy consumption of connected automated vehicles (CAVs) in lane change scenarios. We model this interactive motion planning as a generalized Nash equilibrium problem and formalize how vehicle‑to‑vehicle intention sharing enables solution of the game between multiple CAVs as an optimal control problem for each agent, to arrive at a generalized Nash equilibrium. The method is implemented via model predictive control (MPC) and compared with an advanced baseline MPC which utilizes unilateral predictions of other agents' future states. A ROS‑based in‑the‑loop testbed is developed: the method is first evaluated in software‑in‑the‑loop and then vehicle‑in‑the‑loop experiments are conducted. Experimental results demonstrate energy and travel time benefits of the presented method in interactive lane change maneuvers.

PaperID: 12331, https://arxiv.org/pdf/2411.14497.pdf

Authors: Hang Zhou, Yehui Tang, Haochen Qin, Yujie Yang, Renren Jin, Deyi Xiong, Kai Han, Yunhe Wang

Title: Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning

Abstract:
The efficacy of large language models (LLMs) on downstream tasks usually hinges on instruction tuning, which relies critically on the quality of training data. Unfortunately, collecting high‑quality and diverse data is both expensive and time‑consuming. To mitigate this issue, we propose a novel Star‑Agents framework, which automates the enhancement of data quality across datasets through multi‑agent collaboration and assessment. The framework adopts a three‑pronged strategy. It initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. Subsequently, the generated data undergo a rigorous evaluation using a dual‑model method that assesses both difficulty and quality. Finaly, the above process evolves in a dynamic refinement phase, where more effective LLMs are prioritized, enhancing the overall data quality. Our empirical studies, including instruction tuning experiments with models such as Pythia and LLaMA, demonstrate the effectiveness of the proposed framework. Optimized datasets have achieved substantial improvements, with an average increase of 12% and notable gains in specific metrics, such as a 40% improvement in Fermi, as evidenced by benchmarks like MT‑bench, Vicuna bench, and WizardLM testset.

PaperID: 12332, https://arxiv.org/pdf/2411.14496.pdf

Authors: Bao Nguyen

Title: Multi-agent reinforcement learning strategy to maximize the lifetime of Wireless Rechargeable

Abstract:
The thesis proposes a generalized charging framework for multiple mobile chargers to maximize the network lifetime and ensure target coverage and connectivity in large scale WRSNs. Moreover, a multi‑point charging model is leveraged to enhance charging efficiency, where the MC can charge multiple sensors simultaneously at each charging location. The thesis proposes an effective Decentralized Partially Observable Semi‑Markov Decision Process (Dec POSMDP) model that promotes Mobile Chargers (MCs) cooperation and detects optimal charging locations based on realtime network information. Furthermore, the proposal allows reinforcement algorithms to be applied to different networks without requiring extensive retraining. To solve the Dec POSMDP model, the thesis proposes an Asynchronous Multi Agent Reinforcement Learning algorithm (AMAPPO) based on the Proximal Policy Optimization algorithm (PPO).

PaperID: 12333, https://arxiv.org/pdf/2411.14485.pdf

Authors: Moritz Rietschel, Fang Guo, Kyle Steinfeld

Title: Mediating Modes of Thought: LLM's for design scripting

Abstract:
Architects adopt visual scripting and parametric design tools to explore more expansive design spaces (Coates, 2010), refine their thinking about the geometric logic of their design (Woodbury, 2010), and overcome conventional software limitations (Burry, 2011). Despite two decades of effort to make design scripting more accessible, a disconnect between a designer's free ways of thinking and the rigidity of algorithms remains (Burry, 2011). Recent developments in Large Language Models (LLMs) suggest this might soon change, as LLMs encode a general understanding of human context and exhibit the capacity to produce geometric logic. This project speculates that if LLMs can effectively mediate between user intent and algorithms, they become a powerful tool to make scripting in design more widespread and fun. We explore if such systems can interpret natural language prompts to assemble geometric operations relevant to computational design scripting. In the system, multiple layers of LLM agents are configured with specific context to infer the user intent and construct a sequential logic. Given a user's high‑level text prompt, a geometric description is created, distilled into a sequence of logic operations, and mapped to software‑specific commands. The completed script is constructed in the user's visual programming interface. The system succeeds in generating complete visual scripts up to a certain complexity but fails beyond this complexity threshold. It shows how LLMs can make design scripting much more aligned with human creativity and thought. Future research should explore conversational interactions, expand to multimodal inputs and outputs, and assess the performance of these tools.

PaperID: 12334, https://arxiv.org/pdf/2411.14465.pdf

Authors: Elizaveta Reganova, Peter Steinbach

Title: Testing Uncertainty of Large Language Models for Physics Knowledge and Reasoning

Abstract:
Large Language Models (LLMs) have gained significant popularity in recent years for their ability to answer questions in various fields. However, these models have a tendency to "hallucinate" their responses, making it challenging to evaluate their performance. A major challenge is determining how to assess the certainty of a model's predictions and how it correlates with accuracy. In this work, we introduce an analysis for evaluating the performance of popular open‑source LLMs, as well as gpt‑3.5 Turbo, on multiple choice physics questionnaires. We focus on the relationship between answer accuracy and variability in topics related to physics. Our findings suggest that most models provide accurate replies in cases where they are certain, but this is by far not a general behavior. The relationship between accuracy and uncertainty exposes a broad horizontal bell‑shaped distribution. We report how the asymmetry between accuracy and uncertainty intensifies as the questions demand more logical reasoning of the LLM agent, while the same relationship remains sharp for knowledge retrieval tasks.

PaperID: 12335, https://arxiv.org/pdf/2411.14461.pdf

Authors: Shaochen Xu, Yifan Zhou, Zhengliang Liu, Zihao Wu, Tianyang Zhong, Huaqin Zhao, Yiwei Li, Hanqi Jiang, Yi Pan, Junhao Chen, Jin Lu, Wei Zhang, Tuo Zhang, Lu Zhang, Dajiang Zhu, Xiang Li, Wei Liu, Quanzheng Li, Andrea Sikora, Xiaoming Zhai, Zhen Xiang, Tianming Liu

Title: Towards Next-Generation Medical Agent: How o1 is Reshaping Decision-Making in Medical Scenarios

Abstract:
Artificial Intelligence (AI) has become essential in modern healthcare, with large language models (LLMs) offering promising advances in clinical decision‑making. Traditional model‑based approaches, including those leveraging in‑context demonstrations and those with specialized medical fine‑tuning, have demonstrated strong performance in medical language processing but struggle with real‑time adaptability, multi‑step reasoning, and handling complex medical tasks. Agent‑based AI systems address these limitations by incorporating reasoning traces, tool selection based on context, knowledge retrieval, and both short‑ and long‑term memory. These additional features enable the medical AI agent to handle complex medical scenarios where decision‑making should be built on real‑time interaction with the environment. Therefore, unlike conventional model‑based approaches that treat medical queries as isolated questions, medical AI agents approach them as complex tasks and behave more like human doctors. In this paper, we study the choice of the backbone LLM for medical AI agents, which is the foundation for the agent's overall reasoning and action generation. In particular, we consider the emergent o1 model and examine its impact on agents' reasoning, tool‑use adaptability, and real‑time information retrieval across diverse clinical scenarios, including high‑stakes settings such as intensive care units (ICUs). Our findings demonstrate o1's ability to enhance diagnostic accuracy and consistency, paving the way for smarter, more responsive AI tools that support better patient outcomes and decision‑making efficacy in clinical practice.

PaperID: 12336, https://arxiv.org/pdf/2411.14432.pdf

Authors: Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, Ziwei Liu

Title: Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Abstract:
Large Language Models (LLMs) demonstrate enhanced capabilities and reliability by reasoning more, evolving from Chain‑of‑Thought prompting to product‑level solutions like OpenAI o1. Despite various efforts to improve LLM reasoning, high‑quality long‑chain reasoning data and optimized training pipelines still remain inadequately explored in vision‑language tasks. In this paper, we present Insight‑V, an early effort to 1) scalably produce long and robust reasoning data for complex multi‑modal tasks, and 2) an effective training pipeline to enhance the reasoning capabilities of multi‑modal large language models (MLLMs). Specifically, to create long and structured reasoning data without human labor, we design a two‑step pipeline with a progressive strategy to generate sufficiently long and diverse reasoning paths and a multi‑granularity assessment method to ensure data quality. We observe that directly supervising MLLMs with such long and complex reasoning data will not yield ideal reasoning ability. To tackle this problem, we design a multi‑agent system consisting of a reasoning agent dedicated to performing long‑chain reasoning and a summary agent trained to judge and summarize reasoning results. We further incorporate an iterative DPO algorithm to enhance the reasoning agent's generation stability and quality. Based on the popular LLaVA‑NeXT model and our stronger base MLLM, we demonstrate significant performance gains across challenging multi‑modal benchmarks requiring visual reasoning. Benefiting from our multi‑agent system, Insight‑V can also easily maintain or improve performance on perception‑focused multi‑modal tasks.

PaperID: 12337, https://arxiv.org/pdf/2411.14264.pdf

Authors: Pedro Enrique Iturria-Rivera, Raimundas Gaigalas, Medhat Elsayed, Majid Bavand, Yigit Ozcan, Melike Erol-Kantarci

Title: Explainable Multi-Agent Reinforcement Learning for Extended Reality Codec Adaptation

Abstract:
Extended Reality (XR) services are set to transform applications over 5th and 6th generation wireless networks, delivering immersive experiences. Concurrently, Artificial Intelligence (AI) advancements have expanded their role in wireless networks, however, trust and transparency in AI remain to be strengthened. Thus, providing explanations for AI‑enabled systems can enhance trust. We introduce Value Function Factorization (VFF)‑based Explainable (X) Multi‑Agent Reinforcement Learning (MARL) algorithms, explaining reward design in XR codec adaptation through reward decomposition. We contribute four enhancements to XMARL algorithms. Firstly, we detail architectural modifications to enable reward decomposition in VFF‑based MARL algorithms: Value Decomposition Networks (VDN), Mixture of Q‑Values (QMIX), and Q‑Transformation (Q‑TRAN). Secondly, inspired by multi‑task learning, we reduce the overhead of vanilla XMARL algorithms. Thirdly, we propose a new explainability metric, Reward Difference Fluctuation Explanation (RDFX), suitable for problems with adjustable parameters. Lastly, we propose adaptive XMARL, leveraging network gradients and reward decomposition for improved action selection. Simulation results indicate that, in XR codec adaptation, the Packet Delivery Ratio reward is the primary contributor to optimal performance compared to the initial composite reward, which included delay and Data Rate Ratio components. Modifications to VFF‑based XMARL algorithms, incorporating multi‑headed structures and adaptive loss functions, enable the best‑performing algorithm, Multi‑Headed Adaptive (MHA)‑QMIX, to achieve significant average gains over the Adjust Packet Size baseline up to 10.7%, 41.4%, 33.3%, and 67.9% in XR index, jitter, delay, and Packet Loss Ratio (PLR), respectively.

PaperID: 12338, https://arxiv.org/pdf/2411.14214.pdf

Authors: Junhua Liu, Fanfan Lin, Xinze Li, Kwan Hui Lim, Shuai Zhao

Title: Physics-Informed Autonomous LLM Agents for Explainable Power Electronics Modulation Design

Abstract:
LLM‑based autonomous agents have recently shown strong capabilities in solving complex industrial design tasks. However, in domains aiming for carbon neutrality and high‑performance renewable energy systems, current AI‑assisted design automation methods face critical challenges in explainability, scalability, and practical usability. To address these limitations, we introduce PHIA (Physics‑Informed Autonomous Agent), an LLM‑driven system that automates modulation design for power converters in Power Electronics Systems with minimal human intervention. In contrast to traditional pipeline‑based methods, PHIA incorporates an LLM‑based planning module that interactively acquires and verifies design requirements via a user‑friendly chat interface. This planner collaborates with physics‑informed simulation and optimization components to autonomously generate and iteratively refine modulation designs. The interactive interface also supports interpretability by providing textual explanations and visual outputs throughout the design process. Experimental results show that PHIA reduces standard mean absolute error by 63.2% compared to the second‑best benchmark and accelerates the overall design process by over 33 times. A user study involving 20 domain experts further confirms PHIA's superior design efficiency and usability, highlighting its potential to transform industrial design workflows in power electronics.

PaperID: 12339, https://arxiv.org/pdf/2411.14033.pdf

Authors: Yingxuan Yang, Qiuying Peng, Jun Wang, Ying Wen, Weinan Zhang

Title: LLM-based Multi-Agent Systems: Techniques and Business Perspectives

Abstract:
In the era of (multi‑modal) large language models, most operational processes can be reformulated and reproduced using LLM agents. The LLM agents can perceive, control, and get feedback from the environment so as to accomplish the given tasks in an autonomous manner. Besides the environment‑interaction property, the LLM agents can call various external tools to ease the task completion process. The tools can be regarded as a predefined operational process with private or real‑time knowledge that does not exist in the parameters of LLMs. As a natural trend of development, the tools for calling are becoming autonomous agents, thus the full intelligent system turns out to be a LLM‑based Multi‑Agent System (LaMAS). Compared to the previous single‑LLM‑agent system, LaMAS has the advantages of i) dynamic task decomposition and organic specialization, ii) higher flexibility for system changing, iii) proprietary data preserving for each participating entity, and iv) feasibility of monetization for each entity. This paper discusses the technical and business landscapes of LaMAS. To support the ecosystem of LaMAS, we provide a preliminary version of such LaMAS protocol considering technical requirements, data privacy, and business incentives. As such, LaMAS would be a practical solution to achieve artificial collective intelligence in the near future.

PaperID: 12340, https://arxiv.org/pdf/2411.13983.pdf

Authors: Hansung Kim, Edward L. Zhu, Chang Seok Lim, Francesco Borrelli

Title: Learning Two-agent Motion Planning Strategies from Generalized Nash Equilibrium for Model Predictive Control

Abstract:
We introduce an Implicit Game‑Theoretic MPC (IGT‑MPC), a decentralized algorithm for two‑agent motion planning that uses a learned value function that predicts the game‑theoretic interaction outcomes as the terminal cost‑to‑go function in a model predictive control (MPC) framework, guiding agents to implicitly account for interactions with other agents and maximize their reward. This approach applies to competitive and cooperative multi‑agent motion planning problems which we formulate as constrained dynamic games. Given a constrained dynamic game, we randomly sample initial conditions and solve for the generalized Nash equilibrium (GNE) to generate a dataset of GNE solutions, computing the reward outcome of each game‑theoretic interaction from the GNE. The data is used to train a simple neural network to predict the reward outcome, which we use as the terminal cost‑to‑go function in an MPC scheme. We showcase emerging competitive and coordinated behaviors using IGT‑MPC in scenarios such as two‑vehicle head‑to‑head racing and un‑signalized intersection navigation. IGT‑MPC offers a novel method integrating machine learning and game‑theoretic reasoning into model‑based decentralized multi‑agent motion planning.

PaperID: 12341, https://arxiv.org/pdf/2411.13942.pdf

Authors: Ing-Sheng Bernard-Tiong, Yoshihisa Tsurumine, Ryosuke Sota, Kazuki Shibata, Takamitsu Matsubara

Title: Cooperative Grasping and Transportation using Multi-agent Reinforcement Learning with Ternary Force Representation

Abstract:
Cooperative grasping and transportation require effective coordination to complete the task. This study focuses on the approach leveraging force‑sensing feedback, where robots use sensors to detect forces applied by others on an object to achieve coordination. Unlike explicit communication, it avoids delays and interruptions; however, force‑sensing is highly sensitive and prone to interference from variations in grasping environment, such as changes in grasping force, grasping pose, object size and geometry, which can interfere with force signals, subsequently undermining coordination. We propose multi‑agent reinforcement learning (MARL) with ternary force representation, a force representation that maintains consistent representation against variations in grasping environment. The simulation and real‑world experiments demonstrate the robustness of the proposed method to changes in grasping force, object size and geometry as well as inherent sim2real gap.

PaperID: 12342, https://arxiv.org/pdf/2411.13941.pdf

Authors: Yalan Lin, Yingwei Ma, Rongyu Cao, Binhua Li, Fei Huang, Xiaodong Gu, Yongbin Li

Title: LLMs as Continuous Learners: Improving the Reproduction of Defective Code in Software Issues

Abstract:
Reproducing buggy code is the first and crucially important step in issue resolving, as it aids in identifying the underlying problems and validating that generated patches resolve the problem. While numerous approaches have been proposed for this task, they primarily address common, widespread errors and struggle to adapt to unique, evolving errors specific to individual code repositories. To fill this gap, we propose EvoCoder, a multi‑agent continuous learning framework for issue code reproduction. EvoCoder adopts a reflection mechanism that allows the LLM to continuously learn from previously resolved problems and dynamically refine its strategies to new emerging challenges. To prevent experience bloating, EvoCoder introduces a novel hierarchical experience pool that enables the model to adaptively update common and repo‑specific experiences. Our experimental results show a 20% improvement in issue reproduction rates over existing SOTA methods. Furthermore, integrating our reproduction mechanism significantly boosts the overall accuracy of the existing issue‑resolving pipeline.

PaperID: 12343, https://arxiv.org/pdf/2411.13934.pdf

Authors: Yancheng Liang, Daphne Chen, Abhishek Gupta, Simon S. Du, Natasha Jaques

Title: Learning to Cooperate with Humans using Generative Agents

Abstract:
Training agents that can coordinate zero‑shot with humans is a key mission in multi‑agent reinforcement learning (MARL). Current algorithms focus on training simulated human partner policies which are then used to train a Cooperator agent. The simulated human is produced either through behavior cloning over a dataset of human cooperation behavior, or by using MARL to create a population of simulated agents. However, these approaches often struggle to produce a Cooperator that can coordinate well with real humans, since the simulated humans fail to cover the diverse strategies and styles employed by people in the real world. We show \emphlearning a generative model of human partners can effectively address this issue. Our model learns a latent variable representation of the human that can be regarded as encoding the human's unique strategy, intention, experience, or style. This generative model can be flexibly trained from any (human or neural policy) agent interaction data. By sampling from the latent space, we can use the generative model to produce different partners to train Cooperator agents. We evaluate our method ‑‑ Generative Agent Modeling for Multi‑agent Adaptation (GAMMA) ‑‑ on Overcooked, a challenging cooperative cooking game that has become a standard benchmark for zero‑shot coordination. We conduct an evaluation with real human teammates, and the results show that GAMMA consistently improves performance, whether the generative model is trained on simulated populations or human datasets. Further, we propose a method for posterior sampling from the generative model that is biased towards the human data, enabling us to efficiently improve performance with only a small amount of expensive human interaction data.

PaperID: 12344, https://arxiv.org/pdf/2411.13932.pdf

Authors: Hailong Yang, Mingxian Gu, Renhuo Zhao, Fuping Hu, Zhaohong Deng, Yitang Chen

Title: XAgents: A Framework for Interpretable Rule-Based Multi-Agents Cooperation

Abstract:
Extracting implicit knowledge and logical reasoning abilities from large language models (LLMs) has consistently been a significant challenge. The advancement of multi‑agent systems has further en‑hanced the capabilities of LLMs. Inspired by the structure of multi‑polar neurons (MNs), we propose the XAgents framework, an in‑terpretable multi‑agent cooperative framework based on the IF‑THEN rule‑based system. The IF‑Parts of the rules are responsible for logical reasoning and domain membership calculation, while the THEN‑Parts are comprised of domain expert agents that generate domain‑specific contents. Following the calculation of the member‑ship, XAgetns transmits the task to the disparate domain rules, which subsequently generate the various responses. These re‑sponses are analogous to the answers provided by different experts to the same question. The final response is reached at by eliminat‑ing the hallucinations and erroneous knowledge of the LLM through membership computation and semantic adversarial genera‑tion of the various domain rules. The incorporation of rule‑based interpretability serves to bolster user confidence in the XAgents framework. We evaluate the efficacy of XAgents through a com‑parative analysis with the latest AutoAgents, in which XAgents demonstrated superior performance across three distinct datasets. We perform post‑hoc interpretable studies with SHAP algorithm and case studies, proving the interpretability of XAgent in terms of input‑output feature correlation and rule‑based semantics.

PaperID: 12345, https://arxiv.org/pdf/2411.13874.pdf

Authors: Khalifa Afane, Wenqi Wei, Ying Mao, Junaid Farooq, Juntao Chen

Title: Next-Generation Phishing: How LLM Agents Empower Cyber Attackers

Abstract:
The escalating threat of phishing emails has become increasingly sophisticated with the rise of Large Language Models (LLMs). As attackers exploit LLMs to craft more convincing and evasive phishing emails, it is crucial to assess the resilience of current phishing defenses. In this study we conduct a comprehensive evaluation of traditional phishing detectors, such as Gmail Spam Filter, Apache SpamAssassin, and Proofpoint, as well as machine learning models like SVM, Logistic Regression, and Naive Bayes, in identifying both traditional and LLM‑rephrased phishing emails. We also explore the emerging role of LLMs as phishing detection tools, a method already adopted by companies like NTT Security Holdings and JPMorgan Chase. Our results reveal notable declines in detection accuracy for rephrased emails across all detectors, highlighting critical weaknesses in current phishing defenses. As the threat landscape evolves, our findings underscore the need for stronger security controls and regulatory oversight on LLM‑generated content to prevent its misuse in creating advanced phishing attacks. This study contributes to the development of more effective Cyber Threat Intelligence (CTI) by leveraging LLMs to generate diverse phishing variants that can be used for data augmentation, harnessing the power of LLMs to enhance phishing detection, and paving the way for more robust and adaptable threat detection systems.

PaperID: 12346, https://arxiv.org/pdf/2411.13806.pdf

Authors: Anton A. Stoorvogel, Ali Saberi, Zhenwei Liu

Title: Weak synchronization in heterogeneous multi-agent systems

Abstract:
In this paper, we propose a new framework for synchronization of heterogeneous multi agent system which we refer to as weak synchronization. This new framework of synchronization is based on achieving the network stability in the absence of any information on communication network including the connectivity. Here by network stability, we mean that in the basic setup of a multi‑agent system, we require that the signals exchanged over the network converge to zero. As such if the network happens to have a directed spanning tree then we obtain classical synchronization. Moreover, we design protocols which achieve weak synchronization for any network without making any kind of assumptions on communication network. If the network happens to have a directed spanning tree, then we obtain classical synchronization. However, if this is not the case then we describe in detail in this paper what kind of synchronization properties are preserved in the system and the output of the different agents can behave.

PaperID: 12347, https://arxiv.org/pdf/2411.13768.pdf

Authors: Boming Xia, Qinghua Lu, Liming Zhu, Zhenchang Xing, Dehai Zhao, Hao Zhang

Title: Evaluation-Driven Development and Operations of LLM Agents: A Process Model and Reference Architecture

Abstract:
Large Language Models (LLMs) have enabled the emergence of LLM agents, systems capable of pursuing under‑specified goals and adapting after deployment. Evaluating such agents is challenging because their behavior is open ended, probabilistic, and shaped by system‑level interactions over time. Traditional evaluation methods, built around fixed benchmarks and static test suites, fail to capture emergent behaviors or support continuous adaptation across the lifecycle. To ground a more systematic approach, we conduct a multivocal literature review (MLR) synthesizing academic and industrial evaluation practices. The findings directly inform two empirically derived artifacts: a process model and a reference architecture that embed evaluation as a continuous, governing function rather than a terminal checkpoint. Together they constitute the evaluation‑driven development and operations (EDDOps) approach, which unifies offline (development‑time) and online (runtime) evaluation within a closed feedback loop. By making evaluation evidence drive both runtime adaptation and governed redevelopment, EDDOps supports safer, more traceable evolution of LLM agents aligned with changing objectives, user needs, and governance constraints.

PaperID: 12348, https://arxiv.org/pdf/2411.13749.pdf

Authors: U. León-Domínguez, E. D. Flores-Flores, A. J. García-Jasso, M. K. Gómez-Cuellar, D. Torres-Sánchez, A. Basora-Marimon

Title: AI-Driven Agents with Prompts Designed for High Agreeableness Increase the Likelihood of Being Mistaken for a Human in the Turing Test

Abstract:
Large Language Models based on transformer algorithms have revolutionized Artificial Intelligence by enabling verbal interaction with machines akin to human conversation. These AI agents have surpassed the Turing Test, achieving confusion rates up to 50%. However, challenges persist, especially with the advent of robots and the need to humanize machines for improved Human‑AI collaboration. In this experiment, three GPT agents with varying levels of agreeableness (disagreeable, neutral, agreeable) based on the Big Five Inventory were tested in a Turing Test. All exceeded a 50% confusion rate, with the highly agreeable AI agent surpassing 60%. This agent was also recognized as exhibiting the most human‑like traits. Various explanations in the literature address why these GPT agents were perceived as human, including psychological frameworks for understanding anthropomorphism. These findings highlight the importance of personality engineering as an emerging discipline in artificial intelligence, calling for collaboration with psychology to develop ergonomic psychological models that enhance system adaptability in collaborative activities.

PaperID: 12349, https://arxiv.org/pdf/2411.13690.pdf

Authors: Sanjana Agrawal, Saúl A. Blanco

Title: Multi-Agent Best Arm Identification in Stochastic Linear Bandits

Abstract:
We study the problem of collaborative best‑arm identification in stochastic linear bandits under a fixed‑budget scenario. In our learning model, we first consider multiple agents connected through a star network, interacting with a linear bandit instance in parallel. We then extend our analysis to arbitrary network topologies. The objective of the agents is to collaboratively identify the best arm of the given bandit instance with the help of a central server while minimizing the probability of error in best arm estimation. To this end, we propose two algorithms, MaLinBAI‑Star and MaLinBAI‑Gen for star networks and networks with arbitrary structure, respectively. Both algorithms utilize the technique of G‑optimal design along with the successive elimination based strategy where agents share their knowledge through a central server at each communication round. We demonstrate, both theoretically and empirically, that our algorithms achieve exponentially decaying probability of error in the allocated time budget. Furthermore, experimental results on both synthetic and real‑world data validate the effectiveness of our algorithms over the state‑of‑the art existing multi‑agent algorithms.

PaperID: 12350, https://arxiv.org/pdf/2411.13537.pdf

Authors: Rodolfo Valiente, Praveen K. Pilly

Title: Competence-Aware AI Agents with Metacognition for Unknown Situations and Environments (MUSE)

Abstract:
Metacognition, defined as the awareness and regulation of one's cognitive processes, is central to human adaptability in unknown situations. In contrast, current autonomous agents often struggle in novel environments due to their limited capacity for adaptation. We hypothesize that metacognition is a critical missing ingredient in autonomous agents for the cognitive flexibility needed to tackle unfamiliar challenges. Given the broad scope of metacognitive abilities, we focus on competence awareness and strategy selection. To this end, we propose the Metacognition for Unknown Situations and Environments (MUSE) framework to integrate metacognitive processes of self‑assessment and self‑regulation into autonomous agents. We present two implementations of MUSE: one based on world modeling and another leveraging large language models (LLMs). Our system continually learns to assess its competence on a given task and uses this self‑assessment to guide iterative cycles of strategy selection. MUSE agents demonstrate high competence awareness and significant improvements in self‑regulation for solving novel, out‑of‑distribution tasks more effectively compared to model‑based reinforcement learning and purely prompt‑based LLM agent approaches. This work highlights the promise of approaches inspired by cognitive and neural systems in enabling autonomous agents to adapt to new environments while mitigating the heavy reliance on extensive training data and large models for the current models.

PaperID: 12351, https://arxiv.org/pdf/2411.13164.pdf

Authors: Qifeng Lin, Nghia Vuong, Kewei Song, Phuoc Thanh Tran-Ngoc, Greg Angelo Gonzales Nonato, Hirotaka Sato

Title: Cyborg Insect Factory: Automatic Assembly System to Build up Insect-computer Hybrid Robot Based on Vision-guided Robotic Arm Manipulation of Custom Bipolar Electrodes

Abstract:
The advancement of insect‑computer hybrid robots holds significant promise for navigating complex terrains and enhancing robotics applications. This study introduced an automatic assembly method for insect‑computer hybrid robots, which was accomplished by mounting backpack with precise implantation of custom‑designed bipolar electrodes. We developed a stimulation protocol for the intersegmental membrane between pronotum and mesothorax of the Madagascar hissing cockroach, allowing for bipolar electrodes' automatic implantation using a robotic arm. The assembly process was integrated with a deep learning‑based vision system to accurately identify the implantation site, and a dedicated structure to fix the insect (68 s for the whole assembly process). The automatically assembled hybrid robots demonstrated steering control (over 70 degrees for 0.4 s stimulation) and deceleration control (68.2% speed reduction for 0.4 s stimulation), matching the performance of manually assembled systems. Furthermore, a multi‑agent system consisting of 4 hybrid robots successfully covered obstructed outdoor terrain (80.25% for 10 minutes 31 seconds), highlighting the feasibility of mass‑producing these systems for practical applications. The proposed automatic assembly strategy reduced preparation time for the insect‑computer hybrid robots while maintaining their precise control, laying a foundation for scalable production and deployment in real‑world applications.

PaperID: 12352, https://arxiv.org/pdf/2411.13116.pdf

Authors: Zhi Luo, Xiyuan Yang, Pan Zhou, Di Wang

Title: Provably Efficient Action-Manipulation Attack Against Continuous Reinforcement Learning

Abstract:
Manipulating the interaction trajectories between the intelligent agent and the environment can control the agent's training and behavior, exposing the potential vulnerabilities of reinforcement learning (RL). For example, in Cyber‑Physical Systems (CPS) controlled by RL, the attacker can manipulate the actions of the adopted RL to other actions during the training phase, which will lead to bad consequences. Existing work has studied action‑manipulation attacks in tabular settings, where the states and actions are discrete. As seen in many up‑and‑coming RL applications, such as autonomous driving, continuous action space is widely accepted, however, its action‑manipulation attacks have not been thoroughly investigated yet. In this paper, we consider this crucial problem in both white‑box and black‑box scenarios. Specifically, utilizing the knowledge derived exclusively from trajectories, we propose a black‑box attack algorithm named LCBT, which uses the Monte Carlo tree search method for efficient action searching and manipulation. Additionally, we demonstrate that for an agent whose dynamic regret is sub‑linearly related to the total number of steps, LCBT can teach the agent to converge to target policies with only sublinear attack cost, i.e., O\left(\mathcalR(T) + MH^3K^E\log (MT)\right)(0<E<1), where H is the number of steps per episode, K is the total number of episodes, T=KH is the total number of steps, M is the number of subspaces divided in the state space, and \mathcalR(T) is the bound of the RL algorithm's regret. We conduct our proposed attack methods on three aggressive algorithms: DDPG, PPO, and TD3 in continuous settings, which show a promising attack performance.

PaperID: 12353, https://arxiv.org/pdf/2411.12924.pdf

Authors: Wannita Takerngsaksiri, Jirat Pasuksmit, Patanamon Thongtanunam, Chakkrit Tantithamthavorn, Ruixiong Zhang, Fan Jiang, Jing Li, Evan Cook, Kun Chen, Ming Wu

Title: Human-In-the-Loop Software Development Agents

Abstract:
Recently, Large Language Models (LLMs)‑based multi‑agent paradigms for software engineering are introduced to automatically resolve software development tasks (e.g., from a given issue to source code). However, existing work is evaluated based on historical benchmark datasets, rarely considers human feedback at each stage of the automated software development process, and has not been deployed in practice. In this paper, we introduce a Human‑in‑the‑loop LLM‑based Agents framework (HULA) for software development that allows software engineers to refine and guide LLMs when generating coding plans and source code for a given task. We design, implement, and deploy the HULA framework into Atlassian JIRA for internal uses. Through a multi‑stage evaluation of the HULA framework, Atlassian software engineers perceive that HULA can minimize the overall development time and effort, especially in initiating a coding plan and writing code for straightforward tasks. On the other hand, challenges around code quality remain a concern in some cases. We draw lessons learned and discuss opportunities for future work, which will pave the way for the advancement of LLM‑based agents in software development.

PaperID: 12354, https://arxiv.org/pdf/2411.12746.pdf

Authors: Yahui Bai, Yuhe Gao, Runzhe Wan, Sheng Zhang, Rui Song

Title: A Review of Reinforcement Learning in Financial Applications

Abstract:
In recent years, there has been a growing trend of applying Reinforcement Learning (RL) in financial applications. This approach has shown great potential to solve decision‑making tasks in finance. In this survey, we present a comprehensive study of the applications of RL in finance and conduct a series of meta‑analyses to investigate the common themes in the literature, such as the factors that most significantly affect RL's performance compared to traditional methods. Moreover, we identify challenges including explainability, Markov Decision Process (MDP) modeling, and robustness that hinder the broader utilization of RL in the financial industry and discuss recent advancements in overcoming these challenges. Finally, we propose future research directions, such as benchmarking, contextual RL, multi‑agent RL, and model‑based RL to address these challenges and to further enhance the implementation of RL in finance.

PaperID: 12355, https://arxiv.org/pdf/2411.12725.pdf

Authors: Galit Askenazi-Golan, Domenico Mergoni Cecchelli, Edward Plumb, Clemens Possnig

Title: The Bounds of Algorithmic Collusion; $Q$-learning, Gradient Learning, and the Folk Theorem

Abstract:
We explore the behaviour emerging from learning agents repeatedly interacting strategically for a wide range of learning dynamics, including Q‑learning, projected gradient, replicator and log‑barrier dynamics. Going beyond the better understood classes of potential games and zero‑sum games, we consider the setting of a general repeated game with finite recall under different forms of monitoring. We obtain a Folk Theorem‑style result and characterise the set of payoff vectors that can be obtained by these dynamics, discovering a wide range of possibilities for the emergence of algorithmic collusion. Achieving this requires a novel technical approach, which, to the best of our knowledge, yields the first convergence result for multi‑agent Q‑learning algorithms in repeated games.

PaperID: 12356, https://arxiv.org/pdf/2411.12313.pdf

Authors: Xiaohe Li, Feilong Huang, Zide Fan, Fangli Mou, Leilei Lin, Yingyan Hou, Lijie Wen

Title: C$^{2}$INet: Realizing Incremental Trajectory Prediction with Prior-Aware Continual Causal Intervention

Abstract:
Trajectory prediction for multi‑agents in complex scenarios is crucial for applications like autonomous driving. However, existing methods often overlook environmental biases, which leads to poor generalization. Additionally, hardware constraints limit the use of large‑scale data across environments, and continual learning settings exacerbate the challenge of catastrophic forgetting. To address these issues, we propose the Continual Causal Intervention (C^2INet) method for generalizable multi‑agent trajectory prediction within a continual learning framework. Using variational inference, we align environment‑related prior with posterior estimator of confounding factors in the latent space, thereby intervening in causal correlations that affect trajectory representation. Furthermore, we store optimal variational priors across various scenarios using a memory queue, ensuring continuous debiasing during incremental task training. The proposed C^2INet enhances adaptability to diverse tasks while preserving previous task information to prevent catastrophic forgetting. It also incorporates pruning strategies to mitigate overfitting. Comparative evaluations on three real and synthetic complex datasets against state‑of‑the‑art methods demonstrate that our proposed method consistently achieves reliable prediction performance, effectively mitigating confounding factors unique to different scenarios. This highlights the practical value of our method for real‑world applications.

PaperID: 12357, https://arxiv.org/pdf/2411.12308.pdf

Authors: Christel Grimaud, Dominique Longin, Andreas Herzig

Title: SNN-Based Online Learning of Concepts and Action Laws in an Open World

Abstract:
We present the architecture of a fully autonomous, bio‑inspired cognitive agent built around a spiking neural network (SNN) implementing the agent's semantic memory. This agent explores its universe and learns concepts of objects/situations and of its own actions in a one‑shot manner. While object/situation concepts are unary, action concepts are triples made up of an initial situation, a motor activity, and an outcome. They embody the agent's knowledge of its universe's action laws. Both kinds of concepts have different degrees of generality. To make decisions the agent queries its semantic memory for the expected outcomes of envisaged actions and chooses the action to take on the basis of these predictions. Our experiments show that the agent handles new situations by appealing to previously learned general concepts and rapidly modifies its concepts to adapt to environment changes.

PaperID: 12358, https://arxiv.org/pdf/2411.12246.pdf

Authors: David Ge, Hao Ji

Title: Efficient Training in Multi-Agent Reinforcement Learning: A Communication-Free Framework for the Box-Pushing Problem

Abstract:
Self‑organizing systems consist of autonomous agents that can perform complex tasks and adapt to dynamic environments without a central controller. Prior research often relies on reinforcement learning to enable agents to gain the skills needed for task completion, such as in the box‑pushing environment. However, when agents push from opposing directions during exploration, they tend to exert equal and opposite forces on the box, resulting in minimal displacement and inefficient training. This paper proposes a model called Shared Pool of Information (SPI), which enables information to be accessible to all agents and facilitates coordination, reducing force conflicts among agents and enhancing exploration efficiency. Through computer simulations, we demonstrate that SPI not only expedites the training process but also requires fewer steps per episode, significantly improving the agents' collaborative effectiveness.

PaperID: 12359, https://arxiv.org/pdf/2411.12214.pdf

Authors: Shridhar Mehendale, Ankit Walishetti

Title: DexAssist: A Voice-Enabled Dual-LLM Framework for Accessible Web Navigation

Abstract:
Individuals with fine motor impairments, such as those caused by conditions like Parkinson's disease, cerebral palsy, or dyspraxia, face significant challenges in interacting with traditional computer interfaces. Historically, scripted automation has offered some assistance, but these solutions are often too rigid and task‑specific, failing to adapt to the diverse needs of users. The advent of Large Language Models (LLMs) promised a more flexible approach, capable of interpreting natural language commands to navigate complex user interfaces. However, current LLMs often misinterpret user intent and have no fallback measures when user instructions do not directly align with the specific wording used in the Document Object Model (DOM). This research presents Dexterity Assist (DexAssist), a dual‑LLM system designed to improve the reliability of automated user interface control. Both LLMs work iteratively to ensure successful task execution: the Navigator LLM generates actions based on user input, while the Support LLM assesses the success of these actions and provides continuous feedback based on the DOM content. Our framework displays an increase of ~36 percentage points in overall accuracy within the first iteration of the Support LLM, highlighting its effectiveness in resolving errors in real‑time. The main contributions of this paper are the design of a novel dual LLM‑based accessibility system, its implementation, and its initial evaluation using 3 e‑commerce websites. We conclude by underscoring the potential to build on this framework by optimizing computation time and fine‑tuning.

PaperID: 12360, https://arxiv.org/pdf/2411.12206.pdf

Authors: Sriram S. K. S Narayanan, Joseph Moyalan, Umesh Vaidya

Title: Safe Navigation in Dynamic Environments using Density Functions

Abstract:
This work presents a density‑based framework for safe navigation in dynamic environments characterized by time‑varying obstacle sets and time‑varying target regions. We propose an analytical construction of time‑varying density functions that enables the synthesis of a feedback controller defined as the positive gradient of the resulting density field. The primary contribution of this paper is a rigorous convergence proof demonstrating almost‑everywhere safe navigation under the proposed framework, specifically for systems governed by single‑integrator dynamics. To the best of our knowledge, these are the first analytical guarantees of their kind for navigation in dynamic environments using density functions. We illustrate the applicability of the framework to systems with more complex dynamics, including multi‑agent systems and robotic manipulators, using standard control design techniques such as backstepping and inverse dynamics. These results provide a foundation for extending density‑based navigation methods to a broad class of robotic systems operating in time‑varying environments.

PaperID: 12361, https://arxiv.org/pdf/2411.12196.pdf

Authors: Zixin Liu, Ji Zhang, Yiran Ding

Title: A More Advanced Group Polarization Measurement Approach Based on LLM-Based Agents and Graphs

Abstract:
Group polarization is an important research direction in social media content analysis, attracting many researchers to explore this field. Therefore, how to effectively measure group polarization has become a critical topic. Measuring group polarization on social media presents several challenges that have not yet been addressed by existing solutions. First, social media group polarization measurement involves processing vast amounts of text, which poses a significant challenge for information extraction. Second, social media texts often contain hard‑to‑understand content, including sarcasm, memes, and internet slang. Additionally, group polarization research focuses on holistic analysis, while texts is typically fragmented. To address these challenges, we designed a solution based on a multi‑agent system and used a graph‑structured Community Sentiment Network (CSN) to represent polarization states. Furthermore, we developed a metric called Community Opposition Index (COI) based on the CSN to quantify polarization. Finally, we tested our multi‑agent system through a zero‑shot stance detection task and achieved outstanding results. In summary, the proposed approach has significant value in terms of usability, accuracy, and interpretability.

PaperID: 12362, https://arxiv.org/pdf/2411.12183.pdf

Authors: Siyu Wang, Shengran Dai, Jianhui Jiang, Shuang Wu, Yufei Peng, Junbin Zhang

Title: Action-Attentive Deep Reinforcement Learning for Autonomous Alignment of Beamlines

Abstract:
Synchrotron radiation sources play a crucial role in fields such as materials science, biology, and chemistry. The beamline, a key subsystem of the synchrotron, modulates and directs the radiation to the sample for analysis. However, the alignment of beamlines is a complex and time‑consuming process, primarily carried out manually by experienced engineers. Even minor misalignments in optical components can significantly affect the beam's properties, leading to suboptimal experimental outcomes. Current automated methods, such as bayesian optimization (BO) and reinforcement learning (RL), although these methods enhance performance, limitations remain. The relationship between the current and target beam properties, crucial for determining the adjustment, is not fully considered. Additionally, the physical characteristics of optical elements are overlooked, such as the need to adjust specific devices to control the output beam's spot size or position. This paper addresses the alignment of beamlines by modeling it as a Markov Decision Process (MDP) and training an intelligent agent using RL. The agent calculates adjustment values based on the current and target beam states, executes actions, and iterates until optimal parameters are achieved. A policy network with action attention is designed to improve decision‑making by considering both state differences and the impact of optical components. Experiments on two simulated beamlines demonstrate that our algorithm outperforms existing methods, with ablation studies highlighting the effectiveness of the action attention‑based policy network.

PaperID: 12363, https://arxiv.org/pdf/2411.12130.pdf

Authors: Kejun Chen, Truc Nguyen, Abhijeet Sahu, Malik Hassanaly

Title: Adversarial Multi-Agent Reinforcement Learning for Proactive False Data Injection Detection

Abstract:
Smart inverters are instrumental in the integration of distributed energy resources into the electric grid. Such inverters rely on communication layers for continuous control and monitoring, potentially exposing them to cyber‑physical attacks such as false data injection attacks (FDIAs). We propose to construct a defense strategy against a priori unknown FDIAs with a multi‑agent reinforcement learning (MARL) framework. The first agent is an adversary that simulates and discovers various FDIA strategies, while the second agent is a defender in charge of detecting and locating FDIAs. This approach enables the defender to be trained against new FDIAs continuously generated by the adversary. In addition, we show that the detection skills of an MARL defender can be combined with those of a supervised offline defender through a transfer learning approach. Numerical experiments conducted on a distribution and transmission system demonstrate that: a) the proposed MARL defender outperforms the offline defender against adversarial attacks; b) the transfer learning approach makes the MARL defender capable against both synthetic and unseen FDIAs.

PaperID: 12364, https://arxiv.org/pdf/2411.11844.pdf

Authors: Taiming Lu, Tianmin Shu, Alan Yuille, Daniel Khashabi, Jieneng Chen

Title: Generative World Explorer

Abstract:
Planning with partial observation is a central challenge in embodied AI. A majority of prior works have tackled this challenge by developing agents that physically explore their environment to update their beliefs about the world state. In contrast, humans can imagine unseen parts of the world through a mental exploration and revise their beliefs with imagined observations. Such updated beliefs can allow them to make more informed decisions, without necessitating the physical exploration of the world at all times. To achieve this human‑like ability, we introduce the Generative World Explorer (Genex), an egocentric world exploration framework that allows an agent to mentally explore a large‑scale 3D world (e.g., urban scenes) and acquire imagined observations to update its belief. This updated belief will then help the agent to make a more informed decision at the current step. To train Genex, we create a synthetic urban scene dataset, Genex‑DB. Our experimental results demonstrate that (1) Genex can generate high‑quality and consistent observations during long‑horizon exploration of a large virtual physical world and (2) the beliefs updated with the generated observations can inform an existing decision‑making model (e.g., an LLM agent) to make better plans.

PaperID: 12365, https://arxiv.org/pdf/2411.11794.pdf

Authors: Satush Parikh, Soumya Basu, Avishek Ghosh, Abishek Sankararaman

Title: Competing Bandits in Decentralized Contextual Matching Markets

Abstract:
Sequential learning in a multi‑agent resource constrained matching market has received significant interest in the past few years. We study decentralized learning in two‑sided matching markets where the demand side (aka players or agents) competes for the supply side (aka arms) with potentially time‑varying preferences to obtain a stable match. Motivated by the linear contextual bandit framework, we assume that for each agent, an arm‑mean may be represented by a linear function of a known feature vector and an unknown (agent‑specific) parameter. Moreover, the preferences over arms depend on a latent environment in each round, where the latent environment varies across rounds in a non‑stationary manner. We propose learning algorithms to identify the latent environment and obtain stable matchings simultaneously. Our proposed algorithms achieve instance‑dependent logarithmic regret, scaling independently of the number of arms, and hence applicable for a large market.

PaperID: 12366, https://arxiv.org/pdf/2411.11779.pdf

Authors: Enshuo Hsu, Kirk Roberts

Title: LLM-IE: A Python Package for Generative Information Extraction with Large Language Models

Abstract:
Objectives: Despite the recent adoption of large language models (LLMs) for biomedical information extraction, challenges in prompt engineering and algorithms persist, with no dedicated software available. To address this, we developed LLM‑IE: a Python package for building complete information extraction pipelines. Our key innovation is an interactive LLM agent to support schema definition and prompt design. Materials and Methods: The LLM‑IE supports named entity recognition, entity attribute extraction, and relation extraction tasks. We benchmarked on the i2b2 datasets and conducted a system evaluation. Results: The sentence‑based prompting algorithm resulted in the best performance while requiring a longer inference time. System evaluation provided intuitive visualization. Discussion: LLM‑IE was designed from practical NLP experience in healthcare and has been adopted in internal projects. It should hold great value to the biomedical NLP community. Conclusion: We developed a Python package, LLM‑IE, that provides building blocks for robust information extraction pipeline construction.

PaperID: 12367, https://arxiv.org/pdf/2411.11616.pdf

Authors: Leo Cazenille, Maxime Toquebiau, Nicolas Lobato-Dauzier, Alessia Loi, Loona Macabre, Nathanael Aubert-Kato, Anthony Genot, Nicolas Bredeche

Title: Signaling and Social Learning in Swarms of Robots

Abstract:
This paper investigates the role of communication in improving coordination within robot swarms, focusing on a paradigm where learning and execution occur simultaneously in a decentralized manner. We highlight the role communication can play in addressing the credit assignment problem (individual contribution to the overall performance), and how it can be influenced by it. We propose a taxonomy of existing and future works on communication, focusing on information selection and physical abstraction as principal axes for classification: from low‑level lossless compression with raw signal extraction and processing to high‑level lossy compression with structured communication models. The paper reviews current research from evolutionary robotics, multi‑agent (deep) reinforcement learning, language models, and biophysics models to outline the challenges and opportunities of communication in a collective of robots that continuously learn from one another through local message exchanges, illustrating a form of social learning.

PaperID: 12368, https://arxiv.org/pdf/2411.11581.pdf

Authors: Ziyi Yang, Zaibin Zhang, Zirui Zheng, Yuxian Jiang, Ziyue Gan, Zhiyu Wang, Zijian Ling, Jinsong Chen, Martz Ma, Bowen Dong, Prateek Gupta, Shuyue Hu, Zhenfei Yin, Guohao Li, Xu Jia, Lijun Wang, Bernard Ghanem, Huchuan Lu, Chaochao Lu, Wanli Ouyang, Yu Qiao, Philip Torr, Jing Shao

Title: OASIS: Open Agent Social Interaction Simulations with One Million Agents

Abstract:
There has been a growing interest in enhancing rule‑based agent‑based models (ABMs) for social media platforms (i.e., X, Reddit) with more realistic large language model (LLM) agents, thereby allowing for a more nuanced study of complex systems. As a result, several LLM‑based ABMs have been proposed in the past year. While they hold promise, each simulator is specifically designed to study a particular scenario, making it time‑consuming and resource‑intensive to explore other phenomena using the same ABM. Additionally, these models simulate only a limited number of agents, whereas real‑world social media platforms involve millions of users. To this end, we propose OASIS, a generalizable and scalable social media simulator. OASIS is designed based on real‑world social media platforms, incorporating dynamically updated environments (i.e., dynamic social networks and post information), diverse action spaces (i.e., following, commenting), and recommendation systems (i.e., interest‑based and hot‑score‑based). Additionally, OASIS supports large‑scale user simulations, capable of modeling up to one million users. With these features, OASIS can be easily extended to different social media platforms to study large‑scale group phenomena and behaviors. We replicate various social phenomena, including information spreading, group polarization, and herd effects across X and Reddit platforms. Moreover, we provide observations of social phenomena at different agent group scales. We observe that the larger agent group scale leads to more enhanced group dynamics and more diverse and helpful agents' opinions. These findings demonstrate OASIS's potential as a powerful tool for studying complex systems in digital environments.

PaperID: 12369, https://arxiv.org/pdf/2411.11532.pdf

Authors: Hanxiang Xu, Wei Ma, Ting Zhou, Yanjie Zhao, Kai Chen, Qiang Hu, Yang Liu, Haoyu Wang

Title: CKGFuzzer: LLM-Based Fuzz Driver Generation Enhanced By Code Knowledge Graph

Abstract:
In recent years, the programming capabilities of large language models (LLMs) have garnered significant attention. Fuzz testing, a highly effective technique, plays a key role in enhancing software reliability and detecting vulnerabilities. However, traditional fuzz testing tools rely on manually crafted fuzz drivers, which can limit both testing efficiency and effectiveness. To address this challenge, we propose an automated fuzz testing method driven by a code knowledge graph and powered by an LLM‑based intelligent agent system, referred to as CKGFuzzer. We approach fuzz driver creation as a code generation task, leveraging the knowledge graph of the code repository to automate the generation process within the fuzzing loop, while continuously refining both the fuzz driver and input seeds. The code knowledge graph is constructed through interprocedural program analysis, where each node in the graph represents a code entity, such as a function or a file. The knowledge graph‑enhanced CKGFuzzer not only effectively resolves compilation errors in fuzz drivers and generates input seeds tailored to specific API usage scenarios, but also analyzes fuzz driver crash reports, assisting developers in improving code quality. By querying the knowledge graph of the code repository and learning from API usage scenarios, we can better identify testing targets and understand the specific purpose of each fuzz driver. We evaluated our approach using eight open‑source software projects. The experimental results indicate that CKGFuzzer achieved an average improvement of 8.73% in code coverage compared to state‑of‑the‑art techniques. Additionally, CKGFuzzer reduced the manual review workload in crash case analysis by 84.4% and successfully detected 11 real bugs (including nine previously unreported bugs) across the tested libraries.

PaperID: 12370, https://arxiv.org/pdf/2411.11142.pdf

Authors: Dimitria Silveria, Kleber Cabral, Peter Jardine, Sidney Givigi

Title: Emergent Structure in Multi-agent Systems Using Geometric Embeddings

Abstract:
This work investigates the self‑organization of multi‑agent systems into closed trajectories, a common requirement in unmanned aerial vehicle (UAV) surveillance tasks. In such scenarios, smooth, unbiased control signals save energy and mitigate mechanical strain. We propose a decentralized control system architecture that produces a globally stable emergent structure from local observations only; there is no requirement for agents to share a global plan or follow prescribed trajectories. Central to our approach is the formulation of an injective virtual embedding induced by rotations from the actual agent positions. This embedding serves as a structure‑preserving map around which all agent stabilize their relative positions and permits the use of well‑established linear control techniques. We construct the embedding such that it is topologically equivalent to the desired trajectory (i.e., a homeomorphism), thereby preserving the stability characteristics. We demonstrate the versatility of this approach through implementation on a swarm of Quanser QDrone quadcopters. Results demonstrate the quadcopters self‑organize into the desired trajectory while maintaining even separation.

PaperID: 12371, https://arxiv.org/pdf/2411.11099.pdf

Authors: Ting Zhu, Yue Jin, Jeremie Houssineau, Giovanni Montana

Title: Mitigating Relative Over-Generalization in Multi-Agent Reinforcement Learning

Abstract:
In decentralized multi‑agent reinforcement learning, agents learning in isolation can lead to relative over‑generalization (RO), where optimal joint actions are undervalued in favor of suboptimal ones. This hinders effective coordination in cooperative tasks, as agents tend to choose actions that are individually rational but collectively suboptimal. To address this issue, we introduce MaxMax Q‑Learning (MMQ), which employs an iterative process of sampling and evaluating potential next states, selecting those with maximal Q‑values for learning. This approach refines approximations of ideal state transitions, aligning more closely with the optimal joint policy of collaborating agents. We provide theoretical analysis supporting MMQ's potential and present empirical evaluations across various environments susceptible to RO. Our results demonstrate that MMQ frequently outperforms existing baselines, exhibiting enhanced convergence and sample efficiency.

PaperID: 12372, https://arxiv.org/pdf/2411.11070.pdf

Authors: Enyu Shi, Jiayi Zhang, Ziheng Liu, Yiyang Zhu, Chau Yuen, Derrick Wing Kwan Ng, Marco Di Renzo, Bo Ai

Title: Joint Precoding and AP Selection for Energy Efficient RIS-aided Cell-Free Massive MIMO Using Multi-agent Reinforcement Learning

Abstract:
Cell‑free (CF) massive multiple‑input multiple‑output (mMIMO) and reconfigurable intelligent surface (RIS) are two advanced transceiver technologies for realizing future sixth‑generation (6G) networks. In this paper, we investigate the joint precoding and access point (AP) selection for energy efficient RIS‑aided CF mMIMO system. To address the associated computational complexity and communication power consumption, we advocate for user‑centric dynamic networks in which each user is served by a subset of APs rather than by all of them. Based on the user‑centric network, we formulate a joint precoding and AP selection problem to maximize the energy efficiency (EE) of the considered system. To solve this complex nonconvex problem, we propose an innovative double‑layer multi‑agent reinforcement learning (MARL)‑based scheme. Moreover, we propose an adaptive power threshold‑based AP selection scheme to further enhance the EE of the considered system. To reduce the computational complexity of the RIS‑aided CF mMIMO system, we introduce a fuzzy logic (FL) strategy into the MARL scheme to accelerate convergence. The simulation results show that the proposed FL‑based MARL cooperative architecture effectively improves EE performance, offering a 85% enhancement over the zero‑forcing (ZF) method, and achieves faster convergence speed compared with MARL. It is important to note that increasing the transmission power of the APs or the number of RIS elements can effectively enhance the spectral efficiency (SE) performance, which also leads to an increase in power consumption, resulting in a non‑trivial trade‑off between the quality of service and EE performance.

PaperID: 12373, https://arxiv.org/pdf/2411.11057.pdf

Authors: Medant Sharan, Chandranath Adak

Title: Reinforcing Competitive Multi-Agents for Playing 'So Long Sucker'

Abstract:
This paper investigates the strategy game So Long Sucker (SLS) as a novel benchmark for multi‑agent reinforcement learning (MARL). Unlike traditional board or video game testbeds, SLS is distinguished by its coalition formation, strategic deception, and dynamic elimination rules, making it a uniquely challenging environment for autonomous agents. We introduce the first publicly available computational framework for SLS, complete with a graphical user interface and benchmarking support for reinforcement learning algorithms. Using classical deep reinforcement learning methods (e.g., DQN, DDQN, and Dueling DQN), we train self‑playing agents to learn the rules and basic strategies of SLS. Experimental results demonstrate that, although these agents achieve roughly half of the maximum attainable reward and consistently outperform random baselines, they require long training horizons (~2000 games) and still commit occasional illegal moves, highlighting both the promise and limitations of classical reinforcement learning. Our findings establish SLS as a negotiation‑aware benchmark for MARL, opening avenues for future research that integrates game‑theoretic reasoning, coalition‑aware strategies, and advanced reinforcement learning architectures to better capture the social and adversarial dynamics of complex multi‑agent games.

PaperID: 12374, https://arxiv.org/pdf/2411.10805.pdf

Authors: Naci Saldi, Gurdal Arslan, Serdar Yuksel

Title: Existence of $ε$-Nash Equilibria in Nonzero-Sum and Zero-Sum Markov Games with Standard Borel Spaces via Finite Model Approximations

Abstract:
Establishing the existence of exact or near Markov or stationary perfect Nash equilibria in nonzero‑sum Markov games over Borel spaces is a challenging problem with limited positive results. Motivated by problems in multi‑agent and Bayesian learning, this paper demonstrates the existence of approximate Markov and stationary Nash equilibria for such games under mild regularity conditions. Our approach is constructive: For both compact and non‑compact state spaces, we approximate the Borel model with finite state‑action models and show that their equilibria correspond to $ε$‑equilibria for the original game. Compared with previous results in the literature, which we comprehensively review, we provide more general and complementary conditions, along with explicit approximation models whose equilibria are ε‑equilibria for the original model. For completeness, we also study the approximation of zero‑sum Markov games and Markov teams to highlight the key differences between zero‑sum and nonzero‑sum settings. In particular, while for zero‑sum and team games, joint weak (Feller) continuity of the transition kernel is sufficient (as the value function is continuous), this is not the case for general nonzero‑sum games.

PaperID: 12375, https://arxiv.org/pdf/2411.10670.pdf

Authors: Juan A. Rodriguez, Nicholas Botzer, David Vazquez, Christopher Pal, Marco Pedersoli, Issam Laradji

Title: IntentGPT: Few-shot Intent Discovery with Large Language Models

Abstract:
In today's digitally driven world, dialogue systems play a pivotal role in enhancing user interactions, from customer service to virtual assistants. In these dialogues, it is important to identify user's goals automatically to resolve their needs promptly. This has necessitated the integration of models that perform Intent Detection. However, users' intents are diverse and dynamic, making it challenging to maintain a fixed set of predefined intents. As a result, a more practical approach is to develop a model capable of identifying new intents as they emerge. We address the challenge of Intent Discovery, an area that has drawn significant attention in recent research efforts. Existing methods need to train on a substantial amount of data for correctly identifying new intents, demanding significant human effort. To overcome this, we introduce IntentGPT, a novel training‑free method that effectively prompts Large Language Models (LLMs) such as GPT‑4 to discover new intents with minimal labeled data. IntentGPT comprises an In‑Context Prompt Generator, which generates informative prompts for In‑Context Learning, an Intent Predictor for classifying and discovering user intents from utterances, and a Semantic Few‑Shot Sampler that selects relevant few‑shot examples and a set of known intents to be injected into the prompt. Our experiments show that IntentGPT outperforms previous methods that require extensive domain‑specific data and fine‑tuning, in popular benchmarks, including CLINC and BANKING, among others.

PaperID: 12376, https://arxiv.org/pdf/2411.10558.pdf

Authors: Sheryl Paul, Anand Balakrishnan, Xin Qin, Jyotirmoy V. Deshmukh

Title: Multi-agent Path Finding for Timed Tasks using Evolutionary Games

Abstract:
Autonomous multi‑agent systems such as hospital robots and package delivery drones often operate in highly uncertain environments and are expected to achieve complex temporal task objectives while ensuring safety. While learning‑based methods such as reinforcement learning are popular methods to train single and multi‑agent autonomous systems under user‑specified and state‑based reward functions, applying these methods to satisfy trajectory‑level task objectives is a challenging problem. Our first contribution is the use of weighted automata to specify trajectory‑level objectives, such that, maximal paths induced in the weighted automaton correspond to desired trajectory‑level behaviors. We show how weighted automata‑based specifications go beyond timeliness properties focused on deadlines to performance properties such as expeditiousness. Our second contribution is the use of evolutionary game theory (EGT) principles to train homogeneous multi‑agent teams targeting homogeneous task objectives. We show how shared experiences of agents and EGT‑based policy updates allow us to outperform state‑of‑the‑art reinforcement learning (RL) methods in minimizing path length by nearly 30% in large spaces. We also show that our algorithm is computationally faster than deep RL methods by at least an order of magnitude. Additionally our results indicate that it scales better with an increase in the number of agents as compared to other methods.

PaperID: 12377, https://arxiv.org/pdf/2411.10459.pdf

Authors: Brian Mintz, Feng Fu

Title: Evolutionary Multi-agent Reinforcement Learning in Group Social Dilemmas

Abstract:
Reinforcement learning (RL) is a powerful machine learning technique that has been successfully applied to a wide variety of problems. However, it can be unpredictable and produce suboptimal results in complicated learning environments. This is especially true when multiple agents learn simultaneously, which creates a complex system that is often analytically intractable. Our work considers the fundamental framework of Q‑learning in Public Goods Games, where RL individuals must work together to achieve a common goal. This setting allows us to study the tragedy of the commons and free rider effects in AI cooperation, an emerging field with potential to resolve challenging obstacles to the wider application of artificial intelligence. While this social dilemma has been mainly investigated through traditional and evolutionary game theory, our approach bridges the gap between these two by studying agents with an intermediate level of intelligence. Specifically, we consider the influence of learning parameters on cooperation levels in simulations and a limiting system of differential equations, as well as the effect of evolutionary pressures on exploration rate in both of these models. We find selection for higher and lower levels of exploration, as well as attracting values, and a condition that separates these in a restricted class of games. Our work enhances the theoretical understanding of evolutionary Q‑learning, and extends our knowledge of the evolution of machine behavior in social dilemmas.

PaperID: 12378, https://arxiv.org/pdf/2411.10454.pdf

Authors: Kevin F. Dunnell, Andrew P. Stoddard

Title: Biotic Browser: Applying StreamingLLM as a Persistent Web Browsing Co-Pilot

Abstract:
This paper presents "Biotic Browser," an innovative AI assistant leveraging StreamingLLM to transform web navigation and task execution. Characterized by its ability to simulate the experience of a passenger in an autonomous vehicle, the Biotic Browser excels in managing extended interactions and complex, multi‑step web‑based tasks. It marks a significant advancement in AI technology, particularly in the realm of long‑term context management, and offers promising applications for enhancing productivity and efficiency in both personal and professional settings.

PaperID: 12379, https://arxiv.org/pdf/2411.10268.pdf

Authors: Hossein Hassani, Ehsan Hallaji, Roozbeh Razavi-Far, Mehrdad Saif, Liang Lin

Title: Towards Sample-Efficiency and Generalization of Transfer and Inverse Reinforcement Learning: A Comprehensive Literature Review

Abstract:
Reinforcement learning (RL) is a sub‑domain of machine learning, mainly concerned with solving sequential decision‑making problems by a learning agent that interacts with the decision environment to improve its behavior through the reward it receives from the environment. This learning paradigm is, however, well‑known for being time‑consuming due to the necessity of collecting a large amount of data, making RL suffer from sample inefficiency and difficult generalization. Furthermore, the construction of an explicit reward function that accounts for the trade‑off between multiple desiderata of a decision problem is often a laborious task. These challenges have been recently addressed utilizing transfer and inverse reinforcement learning (T‑IRL). In this regard, this paper is devoted to a comprehensive review of realizing the sample efficiency and generalization of RL algorithms through T‑IRL. Following a brief introduction to RL, the fundamental T‑IRL methods are presented and the most recent advancements in each research field have been extensively reviewed. Our findings denote that a majority of recent research works have dealt with the aforementioned challenges by utilizing human‑in‑the‑loop and sim‑to‑real strategies for the efficient transfer of knowledge from source domains to the target domain under the transfer learning scheme. Under the IRL structure, training schemes that require a low number of experience transitions and extension of such frameworks to multi‑agent and multi‑intention problems have been the priority of researchers in recent years.

PaperID: 12380, https://arxiv.org/pdf/2411.10184.pdf

Authors: Valeria Jannelli, Stefan Schoepf, Matthias Bickel, Torbjørn Netland, Alexandra Brintrup

Title: Agentic LLMs in the Supply Chain: Towards Autonomous Multi-Agent Consensus-Seeking

Abstract:
This paper explores how Large Language Models (LLMs) can automate consensus‑seeking in supply chain management (SCM), where frequent decisions on problems such as inventory levels and delivery times require coordination among companies. Traditional SCM relies on human consensus in decision‑making to avoid emergent problems like the bullwhip effect. Some routine consensus processes, especially those that are time‑intensive and costly, can be automated. Existing solutions for automated coordination have faced challenges due to high entry barriers locking out SMEs, limited capabilities, and limited adaptability in complex scenarios. However, recent advances in Generative AI, particularly LLMs, show promise in overcoming these barriers. LLMs, trained on vast datasets can negotiate, reason, and plan, facilitating near‑human‑level consensus at scale with minimal entry barriers. In this work, we identify key limitations in existing approaches and propose autonomous LLM agents to address these gaps. We introduce a series of novel, supply chain‑specific consensus‑seeking frameworks tailored for LLM agents and validate the effectiveness of our approach through a case study in inventory management. To accelerate progress within the SCM community, we open‑source our code, providing a foundation for further advancements in LLM‑powered autonomous supply chain solutions.

PaperID: 12381, https://arxiv.org/pdf/2411.10109.pdf

Authors: Joon Sung Park, Carolyn Q. Zou, Jonne Kamphorst, Niles Egan, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Percy Liang, Robb Willer, Michael S. Bernstein

Title: LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals

Abstract:
Machine learning can predict human behavior well when substantial structured data and well‑defined outcomes are available, but these models are typically limited to specific outcomes and cannot readily be applied to new domains. We test whether large language models (LLMs) can support a more general‑purpose approach by building person‑specific simulations (i.e., "generative agents") grounded in self‑report data. Using data from a diverse national sample of 1,052 Americans, we build agents from (i) two‑hour, semi‑structured interviews (elicited using the American Voices Project interview schedule), (ii) structured surveys (the General Social Survey and Big Five personality inventory), or (iii) both sources combined. On held‑out General Social Survey items, agent accuracy reached 83% (interview only), 82% (surveys only), and 86% (combined) of participants' two‑week test‑retest consistency, compared with agents prompted only with individuals' demographics (74%). Agents predicted personality traits and behaviors in experiments with similar accuracy, and reduced disparities in accuracy across racial and ideological groups relative to demographics‑only baselines. Together, these results show that LLMs agents grounded in rich qualitative or quantitative self‑report data can support general‑purpose simulation of individuals across outcomes, without requiring task‑specific training data.

PaperID: 12382, https://arxiv.org/pdf/2411.10031.pdf

Authors: Jingyuan Zhou, Longhao Yan, Jinhao Liang, Kaidi Yang

Title: Enforcing Cooperative Safety for Reinforcement Learning-based Mixed-Autonomy Platoon Control

Abstract:
It is recognized that the control of mixed‑autonomy platoons comprising connected and automated vehicles (CAVs) and human‑driven vehicles (HDVs) can enhance traffic flow. Among existing methods, Multi‑Agent Reinforcement Learning (MARL) appears to be a promising control strategy because it can manage complex scenarios in real time. However, current research on MARL‑based mixed‑autonomy platoon control suffers from several limitations. First, existing MARL approaches address safety by penalizing safety violations in the reward function, thus lacking theoretical safety guarantees due to the black‑box nature of RL. Second, few studies have explored the cooperative safety of multi‑CAV platoons, where CAVs can be coordinated to further enhance the system‑level safety involving the safety of both CAVs and HDVs. Third, existing work tends to make an unrealistic assumption that the behavior of HDVs and CAVs is publicly known and rationale. To bridge the research gaps, we propose a safe MARL framework for mixed‑autonomy platoons. Specifically, this framework (i) characterizes cooperative safety by designing a cooperative Control Barrier Function (CBF), enabling CAVs to collaboratively improve the safety of the entire platoon, (ii) provides a safety guarantee to the MARL‑based controller by integrating the CBF‑based safety constraints into MARL through a differentiable quadratic programming (QP) layer, and (iii) incorporates a conformal prediction module that enables each CAV to estimate the unknown behaviors of the surrounding vehicles with uncertainty qualification. Simulation results show that our proposed control strategy can effectively enhance the system‑level safety through CAV cooperation of a mixed‑autonomy platoon with a minimal impact on control performance.

PaperID: 12383, https://arxiv.org/pdf/2411.09954.pdf

Authors: Liwei Yuan, Hideaki Ishii

Title: Reaching Resilient Leader-Follower Consensus in Time-Varying Networks via Multi-Hop Relays

Abstract:
We study resilient leader‑follower consensus of multi‑agent systems (MASs) in the presence of adversarial agents, where agents' communication is modeled by time‑varying topologies. The objective is to develop distributed algorithms for the nonfaulty/normal followers to track an arbitrary reference value propagated by a set of leaders while they are in interaction with the unknown adversarial agents. Our approaches are based on the weighted mean subsequence reduced (W‑MSR) algorithms with agents being capable to communicate with multi‑hop neighbors. Our algorithms can handle agents possessing first‑order and second‑order dynamics. Moreover, we characterize necessary and sufficient graph conditions for our algorithms to succeed by the novel notion of jointly robust following graphs. Our graph condition is tighter than the sufficient conditions in the literature when agents use only one‑hop communication (without relays). Using multi‑hop relays, we can enhance robustness of leader‑follower networks without increasing communication links and obtain further relaxed graph requirements for our algorithms to succeed. Numerical examples are given to verify the efficacy of our algorithms.

PaperID: 12384, https://arxiv.org/pdf/2411.09856.pdf

Authors: Xiaoxuan Hou, Jiayi Yuan, Joel Z. Leibo, Natasha Jaques

Title: InvestESG: A multi-agent reinforcement learning benchmark for studying climate investment as a social dilemma

Abstract:
InvestESG is a novel multi‑agent reinforcement learning (MARL) benchmark designed to study the impact of Environmental, Social, and Governance (ESG) disclosure mandates on corporate climate investments. The benchmark models an intertemporal social dilemma where companies balance short‑term profit losses from climate mitigation efforts and long‑term benefits from reducing climate risk, while ESG‑conscious investors attempt to influence corporate behavior through their investment decisions. Companies allocate capital across mitigation, greenwashing, and resilience, with varying strategies influencing climate outcomes and investor preferences. We are releasing open‑source versions of InvestESG in both PyTorch and JAX, which enable scalable and hardware‑accelerated simulations for investigating competing incentives in mitigate climate change. Our experiments show that without ESG‑conscious investors with sufficient capital, corporate mitigation efforts remain limited under the disclosure mandate. However, when a critical mass of investors prioritizes ESG, corporate cooperation increases, which in turn reduces climate risks and enhances long‑term financial stability. Additionally, providing more information about global climate risks encourages companies to invest more in mitigation, even without investor involvement. Our findings align with empirical research using real‑world data, highlighting MARL's potential to inform policy by providing insights into large‑scale socio‑economic challenges through efficient testing of alternative policy and market designs.

PaperID: 12385, https://arxiv.org/pdf/2411.09493.pdf

Authors: Sneha Ramshanker, Hungtang Ko, Radhika Nagpal

Title: Strategic Sacrifice: Self-Organized Robot Swarm Localization for Inspection Productivity

Abstract:
Robot swarms offer significant potential for inspecting diverse infrastructure, ranging from bridges to space stations. However, effective inspection requires accurate robot localization, which demands substantial computational resources and limits productivity. Inspired by biological systems, we introduce a novel cooperative localization mechanism that minimizes collective computation expenditure through self‑organized sacrifice. Here, a few agents bear the computational burden of localization; through local interactions, they improve the inspection productivity of the swarm. Our approach adaptively maximizes inspection productivity for unconstrained trajectories in dynamic interaction and environmental settings. We demonstrate the optimality and robustness using mean‑field analytical models, multi‑agent simulations, and hardware experiments with metal climbing robots inspecting a 3D cylinder.

PaperID: 12386, https://arxiv.org/pdf/2411.09160.pdf

Authors: Qin Yang

Title: Innate-Values-driven Reinforcement Learning based Cognitive Modeling

Abstract:
Innate values describe agents' intrinsic motivations, which reflect their inherent interests and preferences for pursuing goals and drive them to develop diverse skills that satisfy their various needs. Traditional reinforcement learning (RL) is learning from interaction based on the feedback rewards of the environment. However, in real scenarios, the rewards are generated by agents' innate value systems, which differ vastly from individuals based on their needs and requirements. In other words, considering the AI agent as a self‑organizing system, developing its awareness through balancing internal and external utilities based on its needs in different tasks is a crucial problem for individuals learning to support others and integrate community with safety and harmony in the long term. To address this gap, we propose a new RL model termed innate‑values‑driven RL (IVRL) based on combined motivations' models and expected utility theory to mimic its complex behaviors in the evolution through decision‑making and learning. Then, we introduce two IVRL‑based models: IV‑DQN and IV‑A2C. By comparing them with benchmark algorithms such as DQN, DDQN, A2C, and PPO in the Role‑Playing Game (RPG) reinforcement learning test platform VIZDoom, we demonstrated that the IVRL‑based models can help the agent rationally organize various needs, achieve better performance effectively.

PaperID: 12387, https://arxiv.org/pdf/2411.09092.pdf

Authors: Nick Dashti, M. N. Najafi, Debra J. Searles

Title: Anomalous Diffusion and Emergent Universality in Coupled Memory-Driven Systems

Abstract:
Understanding how simple local interactions give rise to emergent exploration patterns is a fundamental question in statistical physics. We introduce a minimal model of two coupled agents that avoid retracing their own paths while being attracted to the trails left by one another. This system is inspired by, but not limited to, pheromone‑guided insect navigation. The coupling of self‑avoidance and attraction generates rich emergent behavior, including distinct anomalous diffusion regimes, non‑Gaussian position distributions, and compressed exponential encounter statistics. Most notably, we identify new universality classes for coupled random walks, characterized by unique scaling laws and distributional properties that, to our knowledge, have not been previously reported. These findings advance the theoretical understanding of coupled stochastic processes with memory and interaction feedback, providing a framework for exploring transport phenomena in a broad range of multi‑agent systems beyond biological contexts.

PaperID: 12388, https://arxiv.org/pdf/2411.08899.pdf

Authors: Sorouralsadat Fatemi, Yuheng Hu

Title: FinVision: A Multi-Agent Framework for Stock Market Prediction

Abstract:
Financial trading has been a challenging task, as it requires the integration of vast amounts of data from various modalities. Traditional deep learning and reinforcement learning methods require large training data and often involve encoding various data types into numerical formats for model input, which limits the explainability of model behavior. Recently, LLM‑based agents have demonstrated remarkable advancements in handling multi‑modal data, enabling them to execute complex, multi‑step decision‑making tasks while providing insights into their thought processes. This research introduces a multi‑modal multi‑agent system designed specifically for financial trading tasks. Our framework employs a team of specialized LLM‑based agents, each adept at processing and interpreting various forms of financial data, such as textual news reports, candlestick charts, and trading signal charts. A key feature of our approach is the integration of a reflection module, which conducts analyses of historical trading signals and their outcomes. This reflective process is instrumental in enhancing the decision‑making capabilities of the system for future trading scenarios. Furthermore, the ablation studies indicate that the visual reflection module plays a crucial role in enhancing the decision‑making capabilities of our framework.

PaperID: 12389, https://arxiv.org/pdf/2411.08896.pdf

Authors: Ruili Zhao, Jun Cai, Jiangtao Luo, Junpeng Gao, Yongyi Ran

Title: Demand-Aware Beam Hopping and Power Allocation for Load Balancing in Digital Twin empowered LEO Satellite Networks

Abstract:
Low‑Earth orbit (LEO) satellites utilizing beam hopping (BH) technology offer extensive coverage, low latency, high bandwidth, and significant flexibility. However, the uneven geographical distribution and temporal variability of ground traffic demands, combined with the high mobility of LEO satellites, present significant challenges for efficient beam resource utilization. Traditional BH methods based on GEO satellites fail to address issues such as satellite interference, overlapping coverage, and mobility. This paper explores a Digital Twin (DT)‑based collaborative resource allocation network for multiple LEO satellites with overlapping coverage areas. A two‑tier optimization problem, focusing on load balancing and cell service fairness, is proposed to maximize throughput and minimize inter‑cell service delay. The DT layer optimizes the allocation of overlapping coverage cells by designing BH patterns for each satellite, while the LEO layer optimizes power allocation for each selected service cell. At the DT layer, an Actor‑Critic network is deployed on each agent, with a global critic network in the cloud center. The A3C algorithm is employed to optimize the DT layer. Concurrently, the LEO layer optimization is performed using a Multi‑Agent Reinforcement Learning algorithm, where each beam functions as an independent agent. The simulation results show that this method reduces satellite load disparity by about 72.5% and decreases the average delay to 12ms. Additionally, our approach outperforms other benchmarks in terms of throughput, ensuring a better alignment between offered and requested data.

PaperID: 12390, https://arxiv.org/pdf/2411.08881.pdf

Authors: José Antonio Siqueira de Cerqueira, Mamia Agbese, Rebekah Rousi, Nannan Xi, Juho Hamari, Pekka Abrahamsson

Title: Can We Trust AI Agents? A Case Study of an LLM-Based Multi-Agent System for Ethical AI

Abstract:
AI‑based systems, including Large Language Models (LLM), impact millions by supporting diverse tasks but face issues like misinformation, bias, and misuse. AI ethics is crucial as new technologies and concerns emerge, but objective, practical guidance remains debated. This study examines the use of LLMs for AI ethics in practice, assessing how LLM trustworthiness‑enhancing techniques affect software development in this context. Using the Design Science Research (DSR) method, we identify techniques for LLM trustworthiness: multi‑agents, distinct roles, structured communication, and multiple rounds of debate. We design a multi‑agent prototype LLM‑MAS, where agents engage in structured discussions on real‑world AI ethics issues from the AI Incident Database. We evaluate the prototype across three case scenarios using thematic analysis, hierarchical clustering, comparative (baseline) studies, and running source code. The system generates approximately 2,000 lines of code per case, compared to only 80 lines in baseline trials. Discussions reveal terms like bias detection, transparency, accountability, user consent, GDPR compliance, fairness evaluation, and EU AI Act compliance, showing this prototype ability to generate extensive source code and documentation addressing often overlooked AI ethics issues. However, practical challenges in source code integration and dependency management may limit its use by practitioners.

PaperID: 12391, https://arxiv.org/pdf/2411.08804.pdf

Authors: Tianyu Zhou, Pinqiao Wang, Yilin Wu, Hongyang Yang

Title: FinRobot: AI Agent for Equity Research and Valuation with Large Language Models

Abstract:
As financial markets grow increasingly complex, there is a rising need for automated tools that can effectively assist human analysts in equity research, particularly within sell‑side research. While Generative AI (GenAI) has attracted significant attention in this field, existing AI solutions often fall short due to their narrow focus on technical factors and limited capacity for discretionary judgment. These limitations hinder their ability to adapt to new data in real‑time and accurately assess risks, which diminishes their practical value for investors. This paper presents FinRobot, the first AI agent framework specifically designed for equity research. FinRobot employs a multi‑agent Chain of Thought (CoT) system, integrating both quantitative and qualitative analyses to emulate the comprehensive reasoning of a human analyst. The system is structured around three specialized agents: the Data‑CoT Agent, which aggregates diverse data sources for robust financial integration; the Concept‑CoT Agent, which mimics an analysts reasoning to generate actionable insights; and the Thesis‑CoT Agent, which synthesizes these insights into a coherent investment thesis and report. FinRobot provides thorough company analysis supported by precise numerical data, industry‑appropriate valuation metrics, and realistic risk assessments. Its dynamically updatable data pipeline ensures that research remains timely and relevant, adapting seamlessly to new financial information. Unlike existing automated research tools, such as CapitalCube and Wright Reports, FinRobot delivers insights comparable to those produced by major brokerage firms and fundamental research vendors. We open‑source FinRobot at \urlhttps://github. com/AI4Finance‑Foundation/FinRobot.

PaperID: 12392, https://arxiv.org/pdf/2411.08507.pdf

Authors: Zeeshan Rasheed, Malik Abdul Sami, Jussi Rasku, Kai-Kristian Kemell, Zheying Zhang, Janne Harjamaki, Shahbaz Siddeeq, Sami Lahti, Tomas Herda, Mikko Nurminen, Niklas Lavesson, Jose Siqueira de Cerqueira, Toufique Hasan, Ayman Khan, Mahade Hasan, Mika Saari, Petri Rantanen, Jari Soini, Pekka Abrahamsson

Title: TimeLess: A Vision for the Next Generation of Software Development

Abstract:
Present‑day software development faces three major challenges: complexity, time consumption, and high costs. Developing large software systems often requires battalions of teams and considerable time for meetings, which end without any action, resulting in unproductive cycles, delayed progress, and increased cost. What if, instead of large meetings with no immediate results, the software product is completed by the end of the meeting? In response, we present a vision for a system called TimeLess, designed to reshape the software development process by enabling immediate action during meetings. The goal is to shift meetings from planning discussions to productive, action‑oriented sessions. This approach minimizes the time and effort required for development, allowing teams to focus on critical decision‑making while AI agents execute development tasks based on the meeting discussions. We will employ multiple AI agents that work collaboratively to capture human discussions and execute development tasks in real time. This represents a step toward next‑generation software development environments, where human expertise drives strategy and AI accelerates task execution.

PaperID: 12393, https://arxiv.org/pdf/2411.08449.pdf

Authors: Siraj Munir, Alessandro Aldini

Title: Towards Evaluating Large Language Models for Graph Query Generation

Abstract:
Large Language Models (LLMs) are revolutionizing the landscape of Generative Artificial Intelligence (GenAI), with innovative LLM‑backed solutions emerging rapidly. However, when applied to database technologies, specifically query generation for graph databases and Knowledge Graphs (KGs), LLMs still face significant challenges. While research on LLM‑driven query generation for Structured Query Language (SQL) exists, similar systems for graph databases remain underdeveloped. This paper presents a comparative study addressing the challenge of generating Cypher queries a powerful language for interacting with graph databases using open‑access LLMs. We rigorously evaluate several LLM agents (OpenAI ChatGPT 4o, Claude Sonnet 3.5, Google Gemini Pro 1.5, and a locally deployed Llama 3.1 8B) using a designed few‑shot learning prompt and Retrieval Augmented Generation (RAG) backed by Chain‑of‑Thoughts (CoT) reasoning. Our empirical analysis of query generation accuracy reveals that Claude Sonnet 3.5 outperforms its counterparts in this specific domain. Further, we highlight promising future research directions to address the identified limitations and advance LLM‑driven query generation for graph databases.

PaperID: 12394, https://arxiv.org/pdf/2411.08418.pdf

Authors: Yun Long, Yu Zhang

Title: Enhanced Classroom Dialogue Sequences Analysis with a Hybrid AI Agent: Merging Expert Rule-Base with Large Language Models

Abstract:
Classroom dialogue plays a crucial role in fostering student engagement and deeper learning. However, analysing dialogue sequences has traditionally relied on either theoretical frameworks or empirical descriptions of practice, with limited integration between the two. This study addresses this gap by developing a comprehensive rule base of dialogue sequences and an Artificial Intelligence (AI) agent that combines expert‑informed rule‑based systems with a large language model (LLM). The agent applies expert knowledge while adapting to the complexities of natural language, enabling accurate and flexible categorisation of classroom dialogue sequences. By synthesising findings from over 30 studies, we established a comprehensive framework for dialogue analysis. The agent was validated against human expert coding, achieving high levels of precision and reliability. The results demonstrate that the agent provides theory‑grounded and adaptive functions, tremendously enhancing the efficiency and scalability of classroom dialogue analysis, offering significant potential in improving classroom teaching practices and supporting teacher professional development.

PaperID: 12395, https://arxiv.org/pdf/2411.08400.pdf

Authors: Geetansh Kalra, Amit Patel, Atul Chaudhari, Divye Singh

Title: BAMAX: Backtrack Assisted Multi-Agent Exploration using Reinforcement Learning

Abstract:
Autonomous robots collaboratively exploring an unknown environment is still an open problem. The problem has its roots in coordination among non‑stationary agents, each with only a partial view of information. The problem is compounded when the multiple robots must completely explore the environment. In this paper, we introduce Backtrack Assisted Multi‑Agent Exploration using Reinforcement Learning (BAMAX), a method for collaborative exploration in multi‑agent systems which attempts to explore an entire virtual environment. As in the name, BAMAX leverages backtrack assistance to enhance the performance of agents in exploration tasks. To evaluate BAMAX against traditional approaches, we present the results of experiments conducted across multiple hexagonal shaped grids sizes, ranging from 10x10 to 60x60. The results demonstrate that BAMAX outperforms other methods in terms of faster coverage and less backtracking across these environments.

PaperID: 12396, https://arxiv.org/pdf/2411.08355.pdf

Authors: Neelkamal Bhuyan, Debankur Mukherjee, Adam Wierman

Title: Optimal Decentralized Smoothed Online Convex Optimization

Abstract:
We study the multi‑agent Smoothed Online Convex Optimization (SOCO) problem, where N agents interact through a communication graph. In each round, each agent i receives a strongly convex hitting cost function f^i_t in an online fashion and selects an action x^i_t \in \mathbbR^d. The objective is to minimize the global cumulative cost, which includes the sum of individual hitting costs f^i_t(x^i_t), a temporal "switching cost" for changing decisions, and a spatial "dissimilarity cost" that penalizes deviations in decisions among neighboring agents. We propose the first truly decentralized algorithm ACORD for multi‑agent SOCO that provably exhibits asymptotic optimality. Our approach allows each agent to operate using only local information from its immediate neighbors in the graph. For finite‑time performance, we establish that the optimality gap in the competitive ratio decreases with time horizon T and can be conveniently tuned based on the per‑round computation available to each agent. Our algorithm benefits from a provably scalable computational complexity that depends only logarithmically on the number of agents and almost linearly on their degree within the graph. Moreover, our results hold even when the communication graph changes arbitrarily and adaptively over time. Finally, ACORD, by virtue of its asymptotic‑optimality, is shown to be provably superior to the state‑of‑the‑art LPC algorithm, while exhibiting much lower computational complexity. Extensive numerical experiments across various network topologies further corroborate our theoretical claims.

PaperID: 12397, https://arxiv.org/pdf/2411.08299.pdf

Authors: Xin Tang, Qian Chen, Wenjie Weng, Binhan Liao, Jiacheng Wang, Xianbin Cao, Xiaohuan Li

Title: DNN Task Assignment in UAV Networks: A Generative AI Enhanced Multi-Agent Reinforcement Learning Approach

Abstract:
Unmanned Aerial Vehicles (UAVs) possess high mobility and flexible deployment capabilities, prompting the development of UAVs for various application scenarios within the Internet of Things (IoT). The unique capabilities of UAVs give rise to increasingly critical and complex tasks in uncertain and potentially harsh environments. The substantial amount of data generated from these applications necessitates processing and analysis through deep neural networks (DNNs). However, UAVs encounter challenges due to their limited computing resources when managing DNN models. This paper presents a joint approach that combines multiple‑agent reinforcement learning (MARL) and generative diffusion models (GDM) for assigning DNN tasks to a UAV swarm, aimed at reducing latency from task capture to result output. To address these challenges, we first consider the task size of the target area to be inspected and the shortest flying path as optimization constraints, employing a greedy algorithm to resolve the subproblem with a focus on minimizing the UAV's flying path and the overall system cost. In the second stage, we introduce a novel DNN task assignment algorithm, termed GDM‑MADDPG, which utilizes the reverse denoising process of GDM to replace the actor network in multi‑agent deep deterministic policy gradient (MADDPG). This approach generates specific DNN task assignment actions based on agents' observations in a dynamic environment. Simulation results indicate that our algorithm performs favorably compared to benchmarks in terms of path planning, Age of Information (AoI), energy consumption, and task load balancing.

PaperID: 12398, https://arxiv.org/pdf/2411.08294.pdf

Authors: Xinjie Zhao, Shyaman Maduranga Sriwarnasinghe, Jiacheng Tang, Shiyun Wang, Hao Wang, So Morikawa

Title: Collaborative Participatory Research with LLM Agents in South Asia: An Empirically-Grounded Methodological Initiative and Agenda from Field Evidence in Sri Lanka

Abstract:
The integration of artificial intelligence into development research methodologies presents unprecedented opportunities for addressing persistent challenges in participatory research, particularly in linguistically diverse regions like South Asia. Drawing from an empirical implementation in Sri Lanka's Sinhala‑speaking communities, this paper presents an empirically grounded methodological framework designed to transform participatory development research, situated in the challenging multilingual context of Sri Lanka's flood‑prone Nilwala River Basin. Moving beyond conventional translation and data collection tools, this framework deploys a multi‑agent system architecture that redefines how data collection, analysis, and community engagement are conducted in linguistically and culturally diverse research settings. This structured agent‑based approach enables participatory research that is both scalable and responsive, ensuring that community perspectives remain integral to research outcomes. Field experiences reveal the immense potential of LLM‑based systems in addressing long‑standing issues in development research across resource‑limited regions, offering both quantitative efficiencies and qualitative improvements in inclusivity. At a broader methodological level, this research agenda advocates for AI‑driven participatory research tools that maintain ethical considerations, cultural respect, and operational efficiency, highlighting strategic pathways for deploying AI systems that reinforce community agency and equitable knowledge generation, potentially informing broader research agendas across the Global South.

PaperID: 12399, https://arxiv.org/pdf/2411.08190.pdf

Authors: Karolina Schmidt, Luis Rodrigues

Title: Collision-Free Multi-Agent Coverage Control for Non-Cooperating Swarms: Preliminary Results

Abstract:
The main contribution of this paper is a methodology for multiple non‑cooperating swarms of unmanned aerial vehicles to independently cover a common area. In contrast to previous research on coverage control involving more than one swarm, this paper does not assume cooperation between distinct groups but considers them as entirely independent units following their own objectives. Using Voronoi tesselation, collision‑free motion of agents within the same swarm has been proved before. However, as is shown in Example 1 of this paper, in the case of multiple swarms with inter‑swarm but without intra‑swarm collaboration, these guarantees do not hold. We address this issue by proposing an algorithm to achieve maximum coverage with multiple swarms while avoiding collisions between agents. Thus, the Optimal Reciprocal Collision Avoidance method used for safe navigation in multi‑agent scenarios is adapted to suit the needs of Voronoi‑based coverage control with more than one swarm. The functionality of the proposed technique is validated through Monte Carlo simulations.

PaperID: 12400, https://arxiv.org/pdf/2411.08167.pdf

Authors: Fatemeh Ghaffari, Xuchuang Wang, Jinhang Zuo, Mohammad Hajiesmaili

Title: Multi-Agent Stochastic Bandits Robust to Adversarial Corruptions

Abstract:
We study the problem of multi‑agent multi‑armed bandits with adversarial corruption in a heterogeneous setting, where each agent accesses a subset of arms. The adversary can corrupt the reward observations for all agents. Agents share these corrupted rewards with each other, and the objective is to maximize the cumulative total reward of all agents (and not be misled by the adversary). We propose a multi‑agent cooperative learning algorithm that is robust to adversarial corruptions. For this newly devised algorithm, we demonstrate that an adversary with an unknown corruption budget C only incurs an additive O((L / L_\min) C) term to the standard regret of the model in non‑corruption settings, where L is the total number of agents, and L_\min is the minimum number of agents with mutual access to an arm. As a side‑product, our algorithm also improves the state‑of‑the‑art regret bounds when reducing to both the single‑agent and homogeneous multi‑agent scenarios, tightening multiplicative K (the number of arms) and L (the number of agents) factors, respectively.

PaperID: 12401, https://arxiv.org/pdf/2411.08148.pdf

Authors: Dinesh Srivasthav P, Badri Narayan Subudhi

Title: Adaptive Meta-Learning for Robust Deepfake Detection: A Multi-Agent Framework to Data Drift and Model Generalization

Abstract:
Pioneering advancements in artificial intelligence, especially in genAI, have enabled significant possibilities for content creation, but also led to widespread misinformation and false content. The growing sophistication and realism of deepfakes is raising concerns about privacy invasion, identity theft, and has societal, business impacts, including reputational damage and financial loss. Many deepfake detectors have been developed to tackle this problem. Nevertheless, as for every AI model, the deepfake detectors face the wrath of lack of considerable generalization to unseen scenarios and cross‑domain deepfakes. Besides, adversarial robustness is another critical challenge, as detectors drastically underperform to the slightest imperceptible change. Most state‑of‑the‑art detectors are trained on static datasets and lack the ability to adapt to emerging deepfake attack trends. These three crucial challenges though hold paramount importance for reliability in practise, particularly in the deepfake domain, are also the problems with any other AI application. This paper proposes an adversarial meta‑learning algorithm using task‑specific adaptive sample synthesis and consistency regularization, in a refinement phase. By focussing on the classifier's strengths and weaknesses, it boosts both robustness and generalization of the model. Additionally, the paper introduces a hierarchical multi‑agent retrieval‑augmented generation workflow with a sample synthesis module to dynamically adapt the model to new data trends by generating custom deepfake samples. The paper further presents a framework integrating the meta‑learning algorithm with the hierarchical multi‑agent workflow, offering a holistic solution for enhancing generalization, robustness, and adaptability. Experimental results demonstrate the model's consistent performance across various datasets, outperforming the models in comparison.

PaperID: 12402, https://arxiv.org/pdf/2411.08063.pdf

Authors: Ziqi Ni, Yahao Li, Kaijia Hu, Kunyuan Han, Ming Xu, Xingyu Chen, Fengqi Liu, Yicong Ye, Shuxin Bai

Title: MatPilot: an LLM-enabled AI Materials Scientist under the Framework of Human-Machine Collaboration

Abstract:
The rapid evolution of artificial intelligence, particularly large language models, presents unprecedented opportunities for materials science research. We proposed and developed an AI materials scientist named MatPilot, which has shown encouraging abilities in the discovery of new materials. The core strength of MatPilot is its natural language interactive human‑machine collaboration, which augments the research capabilities of human scientist teams through a multi‑agent system. MatPilot integrates unique cognitive abilities, extensive accumulated experience, and ongoing curiosity of human‑beings with the AI agents' capabilities of advanced abstraction, complex knowledge storage and high‑dimensional information processing. It could generate scientific hypotheses and experimental schemes, and employ predictive models and optimization algorithms to drive an automated experimental platform for experiments. It turns out that our system demonstrates capabilities for efficient validation, continuous learning, and iterative optimization.

PaperID: 12403, https://arxiv.org/pdf/2411.08026.pdf

Authors: Krishna Dasaratha, Benjamin Golub, Anant Shah

Title: Incentive Design with Spillovers

Abstract:
A principal uses payments conditioned on stochastic outcomes of a team project to elicit costly effort from the team members. We develop a multi‑agent generalization of a classic first‑order approach to contract optimization by leveraging methods from network games. The main results characterize the optimal allocation of incentive pay across agents and outcomes. Incentive optimality requires equalizing, across agents, a product of (i) individual productivity (ii) organizational centrality and (iii) responsiveness to monetary incentives. We specialize the model to explore several applied questions, including whether compensation should reward individual ability or collaborativeness and how the strength of complementarities shapes pay dispersion.

PaperID: 12404, https://arxiv.org/pdf/2411.07690.pdf

Authors: Zifan Zeng, Chongzhe Zhang, Feng Liu, Joseph Sifakis, Qunli Zhang, Shiming Liu, Peng Wang

Title: World Models: The Safety Perspective

Abstract:
With the proliferation of the Large Language Model (LLM), the concept of World Models (WM) has recently attracted a great deal of attention in the AI research community, especially in the context of AI agents. It is arguably evolving into an essential foundation for building AI agent systems. A WM is intended to help the agent predict the future evolution of environmental states or help the agent fill in missing information so that it can plan its actions and behave safely. The safety property of WM plays a key role in their effective use in critical applications. In this work, we review and analyze the impacts of the current state‑of‑the‑art in WM technology from the point of view of trustworthiness and safety based on a comprehensive survey and the fields of application envisaged. We provide an in‑depth analysis of state‑of‑the‑art WMs and derive technical research challenges and their impact in order to call on the research community to collaborate on improving the safety and trustworthiness of WM.

PaperID: 12405, https://arxiv.org/pdf/2411.07656.pdf

Authors: Tianyi Huang, Arya Somasundaram

Title: Mitigating Bias in Queer Representation within Large Language Models: A Collaborative Agent Approach

Abstract:
Large Language Models (LLMs) often perpetuate biases in pronoun usage, leading to misrepresentation or exclusion of queer individuals. This paper addresses the specific problem of biased pronoun usage in LLM outputs, particularly the inappropriate use of traditionally gendered pronouns ("he," "she") when inclusive language is needed to accurately represent all identities. We introduce a collaborative agent pipeline designed to mitigate these biases by analyzing and optimizing pronoun usage for inclusivity. Our multi‑agent framework includes specialized agents for both bias detection and correction. Experimental evaluations using the Tango dataset‑a benchmark focused on gender pronoun usage‑demonstrate that our approach significantly improves inclusive pronoun classification, achieving a 32.6 percentage point increase over GPT‑4o in correctly disagreeing with inappropriate traditionally gendered pronouns (χ^2 = 38.57, p < 0.0001). These results accentuate the potential of agent‑driven frameworks in enhancing fairness and inclusivity in AI‑generated content, demonstrating their efficacy in reducing biases and promoting socially responsible AI.

PaperID: 12406, https://arxiv.org/pdf/2411.07634.pdf

Authors: Maria Zampella, Urtzi Otamendi, Xabier Belaunzaran, Arkaitz Artetxe, Igor G. Olaizola, Giuseppe Longo, Basilio Sierra

Title: Exploring Multi-Agent Reinforcement Learning for Unrelated Parallel Machine Scheduling

Abstract:
Scheduling problems pose significant challenges in resource, industry, and operational management. This paper addresses the Unrelated Parallel Machine Scheduling Problem (UPMS) with setup times and resources using a Multi‑Agent Reinforcement Learning (MARL) approach. The study introduces the Reinforcement Learning environment and conducts empirical analyses, comparing MARL with Single‑Agent algorithms. The experiments employ various deep neural network policies for single‑ and Multi‑Agent approaches. Results demonstrate the efficacy of the Maskable extension of the Proximal Policy Optimization (PPO) algorithm in Single‑Agent scenarios and the Multi‑Agent PPO algorithm in Multi‑Agent setups. While Single‑Agent algorithms perform adequately in reduced scenarios, Multi‑Agent approaches reveal challenges in cooperative learning but a scalable capacity. This research contributes insights into applying MARL techniques to scheduling optimization, emphasizing the need for algorithmic sophistication balanced with scalability for intelligent scheduling solutions.

PaperID: 12407, https://arxiv.org/pdf/2411.07612.pdf

Authors: Mingyi Wang, Hongqun Zou, Yifan Liu, You Wang, Guang Li

Title: A Joint Prediction Method of Multi-Agent to Reduce Collision Rate

Abstract:
Predicting future motions of road participants is an important task for driving autonomously. Most existing models excel at predicting the marginal trajectory of a single agent, but predicting joint trajectories for multiple agents that are consistent within a scene remains a challenge. Previous research has often focused on marginal predictions, but the importance of joint predictions has become increasingly apparent. Joint prediction aims to generate trajectories that are consistent across the entire scene. Our research builds upon the SIMPL baseline to explore methods for generating scene‑consistent trajectories. We tested our algorithm on the Argoverse 2 dataset, and experimental results demonstrate that our approach can generate scene‑consistent trajectories. Compared to the SIMPL baseline, our method significantly reduces the collision rate of joint trajectories within the scene.

PaperID: 12408, https://arxiv.org/pdf/2411.07464.pdf

Authors: Shubham Gandhi, Manasi Patwardhan, Lovekesh Vig, Gautam Shroff

Title: BudgetMLAgent: A Cost-Effective LLM Multi-Agent system for Automating Machine Learning Tasks

Abstract:
Large Language Models (LLMs) excel in diverse applications including generation of code snippets, but often struggle with generating code for complex Machine Learning (ML) tasks. Although existing LLM single‑agent based systems give varying performance depending on the task complexity, they purely rely on larger and expensive models such as GPT‑4. Our investigation reveals that no‑cost and low‑cost models such as Gemini‑Pro, Mixtral and CodeLlama perform far worse than GPT‑4 in a single‑agent setting. With the motivation of developing a cost‑efficient LLM based solution for solving ML tasks, we propose an LLM Multi‑Agent based system which leverages combination of experts using profiling, efficient retrieval of past observations, LLM cascades, and ask‑the‑expert calls. Through empirical analysis on ML engineering tasks in the MLAgentBench benchmark, we demonstrate the effectiveness of our system, using no‑cost models, namely Gemini as the base LLM, paired with GPT‑4 in cascade and expert to serve occasional ask‑the‑expert calls for planning. With 94.2% reduction in the cost (from \0.931 per run cost averaged over all tasks for GPT‑4 single agent system to \0.054), our system is able to yield better average success rate of 32.95% as compared to GPT‑4 single‑agent system yielding 22.72% success rate averaged over all the tasks of MLAgentBench.

PaperID: 12409, https://arxiv.org/pdf/2411.07407.pdf

Authors: Shuchen Guo, Ehsan Latif, Yifan Zhou, Xuan Huang, Xiaoming Zhai

Title: Using Generative AI and Multi-Agents to Provide Automatic Feedback

Abstract:
This study investigates the use of generative AI and multi‑agent systems to provide automatic feedback in educational contexts, particularly for student constructed responses in science assessments. The research addresses a key gap in the field by exploring how multi‑agent systems, called AutoFeedback, can improve the quality of GenAI‑generated feedback, overcoming known issues such as over‑praise and over‑inference that are common in single‑agent large language models (LLMs). The study developed a multi‑agent system consisting of two AI agents: one for generating feedback and another for validating and refining it. The system was tested on a dataset of 240 student responses, and its performance was compared to that of a single‑agent LLM. Results showed that AutoFeedback significantly reduced the occurrence of over‑praise and over‑inference errors, providing more accurate and pedagogically sound feedback. The findings suggest that multi‑agent systems can offer a more reliable solution for generating automated feedback in educational settings, highlighting their potential for scalable and personalized learning support. These results have important implications for educators and researchers seeking to leverage AI in formative assessments, offering a pathway to more effective feedback mechanisms that enhance student learning outcomes.

PaperID: 12410, https://arxiv.org/pdf/2411.07362.pdf

Authors: Jaime Ruiz-Serra, Patrick Sweeney, Michael S. Harré

Title: Factorised Active Inference for Strategic Multi-Agent Interactions

Abstract:
Understanding how individual agents make strategic decisions within collectives is important for advancing fields as diverse as economics, neuroscience, and multi‑agent systems. Two complementary approaches can be integrated to this end. The Active Inference framework (AIF) describes how agents employ a generative model to adapt their beliefs about and behaviour within their environment. Game theory formalises strategic interactions between agents with potentially competing objectives. To bridge the gap between the two, we propose a factorisation of the generative model whereby each agent maintains explicit, individual‑level beliefs about the internal states of other agents, and uses them for strategic planning in a joint context. We apply our model to iterated general‑sum games with two and three players, and study the ensemble effects of game transitions, where the agents' preferences (game payoffs) change over time. This non‑stationarity, beyond that caused by reciprocal adaptation, reflects a more naturalistic environment in which agents need to adapt to changing social contexts. Finally, we present a dynamical analysis of key AIF quantities: the variational free energy (VFE) and the expected free energy (EFE) from numerical simulation data. The ensemble‑level EFE allows us to characterise the basins of attraction of games with multiple Nash Equilibria under different conditions, and we find that it is not necessarily minimised at the aggregate level. By integrating AIF and game theory, we can gain deeper insights into how intelligent collectives emerge, learn, and optimise their actions in dynamic environments, both cooperative and non‑cooperative.

PaperID: 12411, https://arxiv.org/pdf/2411.07161.pdf

Authors: Young-Min Cho, Raphael Shu, Nilaksh Das, Tamer Alkhouli, Yi-An Lai, Jason Cai, Monica Sunkara, Yi Zhang, Dan Roth

Title: RoundTable: Investigating Group Decision-Making Mechanism in Multi-Agent Collaboration

Abstract:
Effective group decision‑making is critical in Multi‑Agent Systems (MAS). Yet, how different mechanisms for reaching consensus impact collaboration quality and efficiency remains understudied. We conduct a systematic study on group decision‑making mechanisms in a decentralized setting. Through controlled experiments, we analyze how different voting rules affect decision quality and efficiency in a multi‑round collaboration. Results reveal that majority voting often cause inefficient collaboration due to its strict acceptance criteria. At the extreme, unanimous voting gives 87% lower initial performance than the best‑performing method. Our qualitative analysis of cross‑agent communication shows that messages become longer and more repetitive over time: while message length increases by 84%, similarity to the previous round increases to 90%. Based on these insights, language‑based early stopping methods make the performance 13% closer to oracle while reducing rounds by 50%. Our findings highlight the crucial role of group decision‑making in optimizing MAS collaboration.

PaperID: 12412, https://arxiv.org/pdf/2411.07152.pdf

Authors: Lingbo Mo, Shun Jiang, Akash Maharaj, Bernard Hishamunda, Yunyao Li

Title: HierTOD: A Task-Oriented Dialogue System Driven by Hierarchical Goals

Abstract:
Task‑Oriented Dialogue (TOD) systems assist users in completing tasks through natural language interactions, often relying on a single‑layered workflow structure for slot‑filling in public tasks, such as hotel bookings. However, in enterprise environments, which involve rich domain‑specific knowledge, TOD systems face challenges due to task complexity and the lack of standardized documentation. In this work, we introduce HierTOD, an enterprise TOD system driven by hierarchical goals that can support composite workflows. By focusing on goal‑driven interactions, our system serves a more proactive role, facilitating mixed‑initiative dialogue and improving task completion. Equipped with components for natural language understanding, composite goal retriever, dialogue management, and response generation, backed by a well‑organized data service with domain knowledge base and retrieval engine, HierTOD delivers efficient task assistance as judged by human evaluators. Furthermore, our system implementation unifies two TOD paradigms: slot‑filling for information collection and step‑by‑step guidance for task execution. Our user study demonstrates the effectiveness and helpfulness of HierTOD in performing both paradigms.

PaperID: 12413, https://arxiv.org/pdf/2411.07104.pdf

Authors: Yuming Feng, Chuye Hong, Yaru Niu, Shiqi Liu, Yuxiang Yang, Wenhao Yu, Tingnan Zhang, Jie Tan, Ding Zhao

Title: Learning Multi-Agent Loco-Manipulation for Long-Horizon Quadrupedal Pushing

Abstract:
Recently, quadrupedal locomotion has achieved significant success, but their manipulation capabilities, particularly in handling large objects, remain limited, restricting their usefulness in demanding real‑world applications such as search and rescue, construction, industrial automation, and room organization. This paper tackles the task of obstacle‑aware, long‑horizon pushing by multiple quadrupedal robots. We propose a hierarchical multi‑agent reinforcement learning framework with three levels of control. The high‑level controller integrates an RRT planner and a centralized adaptive policy to generate subgoals, while the mid‑level controller uses a decentralized goal‑conditioned policy to guide the robots toward these sub‑goals. A pre‑trained low‑level locomotion policy executes the movement commands. We evaluate our method against several baselines in simulation, demonstrating significant improvements over baseline approaches, with 36.0% higher success rates and 24.5% reduction in completion time than the best baseline. Our framework successfully enables long‑horizon, obstacle‑aware manipulation tasks like Push‑Cuboid and Push‑T on Go1 robots in the real world.

PaperID: 12414, https://arxiv.org/pdf/2411.07098.pdf

Authors: Myeongsoo Kim, Tyler Stennett, Saurabh Sinha, Alessandro Orso

Title: A Multi-Agent Approach for REST API Testing with Semantic Graphs and LLM-Driven Inputs

Abstract:
As modern web services increasingly rely on REST APIs, their thorough testing has become crucial. Furthermore, the advent of REST API documentation languages, such as the OpenAPI Specification, has led to the emergence of many black‑box REST API testing tools. However, these tools often focus on individual test elements in isolation (e.g., APIs, parameters, values), resulting in lower coverage and less effectiveness in fault detection. To address these limitations, we present AutoRestTest, the first black‑box tool to adopt a dependency‑embedded multi‑agent approach for REST API testing that integrates multi‑agent reinforcement learning (MARL) with a semantic property dependency graph (SPDG) and Large Language Models (LLMs). Our approach treats REST API testing as a separable problem, where four agents ‑‑ API, dependency, parameter, and value agents ‑‑ collaborate to optimize API exploration. LLMs handle domain‑specific value generation, the SPDG model simplifies the search space for dependencies using a similarity score between API operations, and MARL dynamically optimizes the agents' behavior. Our evaluation of AutoRestTest on 12 real‑world REST services shows that it outperforms the four leading black‑box REST API testing tools, including those assisted by RESTGPT (which generates realistic test inputs using LLMs), in terms of code coverage, operation coverage, and fault detection. Notably, AutoRestTest is the only tool able to trigger an internal server error in the Spotify service. Our ablation study illustrates that each component of AutoRestTest ‑‑ the SPDG, the LLM, and the agent‑learning mechanism ‑‑ contributes to its overall effectiveness.

PaperID: 12415, https://arxiv.org/pdf/2411.07039.pdf

Authors: Minah Lee, Uday Kamal, Saibal Mukhopadhyay

Title: Learning Collective Dynamics of Multi-Agent Systems using Event-based Vision

Abstract:
This paper proposes a novel problem: vision‑based perception to learn and predict the collective dynamics of multi‑agent systems, specifically focusing on interaction strength and convergence time. Multi‑agent systems are defined as collections of more than ten interacting agents that exhibit complex group behaviors. Unlike prior studies that assume knowledge of agent positions, we focus on deep learning models to directly predict collective dynamics from visual data, captured as frames or events. Due to the lack of relevant datasets, we create a simulated dataset using a state‑of‑the‑art flocking simulator, coupled with a vision‑to‑event conversion framework. We empirically demonstrate the effectiveness of event‑based representation over traditional frame‑based methods in predicting these collective behaviors. Based on our analysis, we present event‑based vision for Multi‑Agent dynamic Prediction (evMAP), a deep learning architecture designed for real‑time, accurate understanding of interaction strength and collective behavior emergence in multi‑agent systems.

PaperID: 12416, https://arxiv.org/pdf/2411.06920.pdf

Authors: Siyuan Li, Zhe Ma, Feifan Liu, Jiani Lu, Qinqin Xiao, Kewu Sun, Lingfei Cui, Xirui Yang, Peng Liu, Xun Wang

Title: Safe Planner: Empowering Safety Awareness in Large Pre-Trained Models for Robot Task Planning

Abstract:
Robot task planning is an important problem for autonomous robots in long‑horizon challenging tasks. As large pre‑trained models have demonstrated superior planning ability, recent research investigates utilizing large models to achieve autonomous planning for robots in diverse tasks. However, since the large models are pre‑trained with Internet data and lack the knowledge of real task scenes, large models as planners may make unsafe decisions that hurt the robots and the surrounding environments. To solve this challenge, we propose a novel Safe Planner framework, which empowers safety awareness in large pre‑trained models to accomplish safe and executable planning. In this framework, we develop a safety prediction module to guide the high‑level large model planner, and this safety module trained in a simulator can be effectively transferred to real‑world tasks. The proposed Safe Planner framework is evaluated on both simulated environments and real robots. The experiment results demonstrate that Safe Planner not only achieves state‑of‑the‑art task success rates, but also substantially improves safety during task execution. The experiment videos are shown in https://sites.google.com/view/safeplanner .

PaperID: 12417, https://arxiv.org/pdf/2411.06601.pdf

Authors: Rohit Bokade, Xiaoning Jin

Title: OffLight: An Offline Multi-Agent Reinforcement Learning Framework for Traffic Signal Control

Abstract:
Efficient traffic control (TSC) is essential for urban mobility, but traditional systems struggle to handle the complexity of real‑world traffic. Multi‑agent Reinforcement Learning (MARL) offers adaptive solutions, but online MARL requires extensive interactions with the environment, making it costly and impractical. Offline MARL mitigates these challenges by using historical traffic data for training but faces significant difficulties with heterogeneous behavior policies in real‑world datasets, where mixed‑quality data complicates learning. We introduce OffLight, a novel offline MARL framework designed to handle heterogeneous behavior policies in TSC datasets. To improve learning efficiency, OffLight incorporates Importance Sampling (IS) to correct for distributional shifts and Return‑Based Prioritized Sampling (RBPS) to focus on high‑quality experiences. OffLight utilizes a Gaussian Mixture Variational Graph Autoencoder (GMM‑VGAE) to capture the diverse distribution of behavior policies from local observations. Extensive experiments across real‑world urban traffic scenarios show that OffLight outperforms existing offline RL methods, achieving up to a 7.8% reduction in average travel time and 11.2% decrease in queue length. Ablation studies confirm the effectiveness of OffLight's components in handling heterogeneous data and improving policy performance. These results highlight OffLight's scalability and potential to improve urban traffic management without the risks of online learning.

PaperID: 12418, https://arxiv.org/pdf/2411.06490.pdf

Authors: Fadhel Ayed, Ali Maatouk, Nicola Piovesan, Antonio De Domenico, Merouane Debbah, Zhi-Quan Luo

Title: Hermes: A Large Language Model Framework on the Journey to Autonomous Networks

Abstract:
The drive toward automating cellular network operations has grown with the increasing complexity of these systems. Despite advancements, full autonomy currently remains out of reach due to reliance on human intervention for modeling network behaviors and defining policies to meet target requirements. Network Digital Twins (NDTs) have shown promise in enhancing network intelligence, but the successful implementation of this technology is constrained by use case‑specific architectures, limiting its role in advancing network autonomy. A more capable network intelligence, or "telecommunications brain", is needed to enable seamless, autonomous management of cellular network. Large Language Models (LLMs) have emerged as potential enablers for this vision but face challenges in network modeling, especially in reasoning and handling diverse data types. To address these gaps, we introduce Hermes, a chain of LLM agents that uses "blueprints" for constructing NDT instances through structured and explainable logical steps. Hermes allows automatic, reliable, and accurate network modeling of diverse use cases and configurations, thus marking progress toward fully autonomous network operations.

PaperID: 12419, https://arxiv.org/pdf/2411.06408.pdf

Authors: Osher Azulay, Dhruv Metha Ramesh, Nimrod Curtis, Avishai Sintov

Title: Visuotactile-Based Learning for Insertion with Compliant Hands

Abstract:
Compared to rigid hands, underactuated compliant hands offer greater adaptability to object shapes, provide stable grasps, and are often more cost‑effective. However, they introduce uncertainties in hand‑object interactions due to their inherent compliance and lack of precise finger proprioception as in rigid hands. These limitations become particularly significant when performing contact‑rich tasks like insertion. To address these challenges, additional sensing modalities are required to enable robust insertion capabilities. This letter explores the essential sensing requirements for successful insertion tasks with compliant hands, focusing on the role of visuotactile perception (i.e., visual and tactile perception). We propose a simulation‑based multimodal policy learning framework that leverages all‑around tactile sensing and an extrinsic depth camera. A transformer‑based policy, trained through a teacher‑student distillation process, is successfully transferred to a real‑world robotic system without further training. Our results emphasize the crucial role of tactile sensing in conjunction with visual perception for accurate object‑socket pose estimation, successful sim‑to‑real transfer and robust task execution.

PaperID: 12420, https://arxiv.org/pdf/2411.06398.pdf

Authors: Katharine Beaumont, Rem Collier

Title: Do you want to play a game? Learning to play Tic-Tac-Toe in Hypermedia Environments

Abstract:
We demonstrate the integration of Transfer Learning into a hypermedia Multi‑Agent System using the Multi‑Agent MicroServices (MAMS) architectural style. Agents use RDF knowledge stores to reason over information and apply Reinforcement Learning techniques to learn how to interact with a Tic‑Tac‑Toe API. Agents form advisor‑advisee relationships in order to speed up individual learning and exploit and learn from data on the Web.

PaperID: 12421, https://arxiv.org/pdf/2411.06389.pdf

Authors: Yadh Hafsi, Edoardo Vittori

Title: Optimal Execution with Reinforcement Learning

Abstract:
This study investigates the development of an optimal execution strategy through reinforcement learning, aiming to determine the most effective approach for traders to buy and sell inventory within a finite time horizon. Our proposed model leverages input features derived from the current state of the limit order book and operates at a high frequency to maximize control. To simulate this environment and overcome the limitations associated with relying on historical data, we utilize the multi‑agent market simulator ABIDES, which provides a diverse range of depth levels within the limit order book. We present a custom MDP formulation followed by the results of our methodology and benchmark the performance against standard execution strategies. Results show that the reinforcement learning agent outperforms standard strategies and offers a practical foundation for real‑world trading applications.

PaperID: 12422, https://arxiv.org/pdf/2411.06087.pdf

Authors: Jia Quan Loh, Xuewen Luo, Fan Ding, Hwa Hui Tew, Junn Yong Loo, Ze Yang Ding, Susilawati Susilawati, Chee Pin Tan

Title: Cross-Domain Transfer Learning using Attention Latent Features for Multi-Agent Trajectory Prediction

Abstract:
With the advancements of sensor hardware, traffic infrastructure and deep learning architectures, trajectory prediction of vehicles has established a solid foundation in intelligent transportation systems. However, existing solutions are often tailored to specific traffic networks at particular time periods. Consequently, deep learning models trained on one network may struggle to generalize effectively to unseen networks. To address this, we proposed a novel spatial‑temporal trajectory prediction framework that performs cross‑domain adaption on the attention representation of a Transformer‑based model. A graph convolutional network is also integrated to construct dynamic graph feature embeddings that accurately model the complex spatial‑temporal interactions between the multi‑agent vehicles across multiple traffic domains. The proposed framework is validated on two case studies involving the cross‑city and cross‑period settings. Experimental results show that our proposed framework achieves superior trajectory prediction and domain adaptation performances over the state‑of‑the‑art models.

PaperID: 12423, https://arxiv.org/pdf/2411.05933.pdf

Authors: Feng-Yu Yue, Daniel Zelazo

Title: A Passivity Analysis for Nonlinear Consensus on Balanced Digraphs

Abstract:
This work deals with the output consensus problem for multiagent systems over balanced digraphs by passivity analysis. As the standard diffusive coupling structure only models the undirected interconnection, we propose a general approach capable of processing directed coupling and performing passivity analysis. To mitigate the complexity arising from the nonlinearity and directed interconnections, we reformulate the output consensus problem as a convergence analysis on a submanifold. We provide passivity analysis and establish a sufficient condition based on passivity for achieving output agreement in multi‑agent systems over balanced digraphs. The results are supported by a numerical example.

PaperID: 12424, https://arxiv.org/pdf/2411.05828.pdf

Authors: Diego Gosmar, Deborah A. Dahl, Emmett Coin, David Attwater

Title: AI Multi-Agent Interoperability Extension for Managing Multiparty Conversations

Abstract:
This paper presents a novel extension to the existing Multi‑Agent Interoperability specifications of the Open Voice Interoperability Initiative (originally also known as OVON from the Open Voice Network). This extension enables AI agents developed with different technologies to communicate using a universal, natural language‑based API or NLP‑based standard APIs. Focusing on the management of multiparty AI conversations, this work introduces new concepts such as the Convener Agent, Floor‑Shared Conversational Space, Floor Manager, Multi‑Conversant Support, and mechanisms for handling Interruptions and Uninvited Agents. Additionally, it explores the Convener's role as a message relay and controller of participant interactions, enhancing both scalability and security. These advancements are crucial for ensuring smooth, efficient, and secure interactions in scenarios where multiple AI agents need to collaborate, debate, or contribute to a discussion. The paper elaborates on these concepts and provides practical examples, illustrating their implementation within the conversation envelope structure.

PaperID: 12425, https://arxiv.org/pdf/2411.05683.pdf

Authors: Indranil Sur, Aswin Raghavan, Abrar Rahman, James Z Hare, Daniel Cassenti, Carl Busart

Title: Data-Driven Distributed Common Operational Picture from Heterogeneous Platforms using Multi-Agent Reinforcement Learning

Abstract:
The integration of unmanned platforms equipped with advanced sensors promises to enhance situational awareness and mitigate the "fog of war" in military operations. However, managing the vast influx of data from these platforms poses a significant challenge for Command and Control (C2) systems. This study presents a novel multi‑agent learning framework to address this challenge. Our method enables autonomous and secure communication between agents and humans, which in turn enables real‑time formation of an interpretable Common Operational Picture (COP). Each agent encodes its perceptions and actions into compact vectors, which are then transmitted, received and decoded to form a COP encompassing the current state of all agents (friendly and enemy) on the battlefield. Using Deep Reinforcement Learning (DRL), we jointly train COP models and agent's action selection policies. We demonstrate resilience to degraded conditions such as denied GPS and disrupted communications. Experimental validation is performed in the Starcraft‑2 simulation environment to evaluate the precision of the COPs and robustness of policies. We report less than 5% error in COPs and policies resilient to various adversarial conditions. In summary, our contributions include a method for autonomous COP formation, increased resilience through distributed prediction, and joint training of COP models and multi‑agent RL policies. This research advances adaptive and resilient C2, facilitating effective control of heterogeneous unmanned platforms.

PaperID: 12426, https://arxiv.org/pdf/2411.05651.pdf

Authors: Yuheng Zhao, Junjie Wang, Linbin Xiang, Xiaowen Zhang, Zifei Guo, Cagatay Turkay, Yu Zhang, Siming Chen

Title: LightVA: Lightweight Visual Analytics with LLM Agent-Based Task Planning and Execution

Abstract:
Visual analytics (VA) requires analysts to iteratively propose analysis tasks based on observations and execute tasks by creating visualizations and interactive exploration to gain insights. This process demands skills in programming, data processing, and visualization tools, highlighting the need for a more intelligent, streamlined VA approach. Large language models (LLMs) have recently been developed as agents to handle various tasks with dynamic planning and tool‑using capabilities, offering the potential to enhance the efficiency and versatility of VA. We propose LightVA, a lightweight VA framework that supports task decomposition, data analysis, and interactive exploration through human‑agent collaboration. Our method is designed to help users progressively translate high‑level analytical goals into low‑level tasks, producing visualizations and deriving insights. Specifically, we introduce an LLM agent‑based task planning and execution strategy, employing a recursive process involving a planner, executor, and controller. The planner is responsible for recommending and decomposing tasks, the executor handles task execution, including data analysis, visualization generation and multi‑view composition, and the controller coordinates the interaction between the planner and executor. Building on the framework, we develop a system with a hybrid user interface that includes a task flow diagram for monitoring and managing the task planning process, a visualization panel for interactive data exploration, and a chat view for guiding the model through natural language instructions. We examine the effectiveness of our method through a usage scenario and an expert study.

PaperID: 12427, https://arxiv.org/pdf/2411.05599.pdf

Authors: Marta Kwiatkowska, Gethin Norman, David Parker, Gabriel Santos

Title: Expectation vs. Reality: Towards Verification of Psychological Games

Abstract:
Game theory provides an effective way to model strategic interactions among rational agents. In the context of formal verification, these ideas can be used to produce guarantees on the correctness of multi‑agent systems, with a diverse range of applications from computer security to autonomous driving. Psychological games (PGs) were developed as a way to model and analyse agents with belief‑dependent motivations, opening up the possibility to model how human emotions can influence behaviour. In PGs, players' utilities depend not only on what actually happens (which strategies players choose to adopt), but also on what the players had expected to happen (their belief as to the strategies that would be played). Despite receiving much attention in fields such as economics and psychology, very little consideration has been given to their applicability to problems in computer science, nor to practical algorithms and tool support. In this paper, we start to bridge that gap, proposing methods to solve PGs and implementing them within PRISM‑games, a formal verification tool for stochastic games. We discuss how to model these games, highlight specific challenges for their analysis and illustrate the usefulness of our approach on several case studies, including human behaviour in traffic scenarios.

PaperID: 12428, https://arxiv.org/pdf/2411.05574.pdf

Authors: Ryoto Ando, Kei Kimura, Taiki Todo, Makoto Yokoo

Title: Parameterized Voter Relevance in Facility Location Games with Tree-Shaped Invitation Graphs

Abstract:
Diffusion mechanism design, which investigate how to incentivise agents to invite as many colleagues to a multi‑agent decision making as possible, is a new research paradigm at the intersection between microeconomics and computer science. In this paper we extend traditional facility location games into the model of diffusion mechanism design. Our objective is to completely understand to what extent of anonymity/voter‑relevance we can achieve, along with strategy‑proofness and Pareto efficiency when voters strategically invite collegues. We define a series of anonymity properties applicable to the diffusion mechanism design model, as well as parameterized voter‑relevance properties for guaranteeing reasonably‑fair decision making. We obtained two impossibility theorems and two existence theorems, which partially answer the question we have raised in the beginning of the paper

PaperID: 12429, https://arxiv.org/pdf/2411.05554.pdf

Authors: Jianqiang Ding, Shankar A. Deka

Title: Time-to-reach Bounds for Verification of Dynamical Systems Using the Koopman Spectrum

Abstract:
In this work, we present a novel Koopman spectrum‑based reachability verification method for nonlinear systems. Contrary to conventional methods that focus on characterizing all potential states of a dynamical system over a presupposed time span, our approach seeks to verify the reachability by assessing the non‑emptiness of estimated time‑to‑reach intervals without engaging in the explicit computation of reachable set. Based on the spectral analysis of the Koopman operator, we reformulate the problem of verifying existence of reachable trajectories into the problem of determining feasible time‑to‑reach bounds required for system reachability. By solving linear programming (LP) problems, our algorithm can effectively estimate all potential time intervals during which a dynamical system can enter (and exit) target sets from given initial sets over an unbounded time horizon. Finally, we demonstrate our method in challenging settings, such as verifying the reachability between non‑convex or even disconnected sets, as well as backward reachability and multiple entries into target sets. Additionally, we validate its applicability in addressing real‑world challenges and scalability to high‑dimensional systems through case studies in verifying the reachability of the cart‑pole and multi‑agent consensus systems.

PaperID: 12430, https://arxiv.org/pdf/2411.05474.pdf

Authors: Émiland Garrabé, Pierre Teixeira, Mahdi Khoramshahi, Stéphane Doncieux

Title: Enhancing Robustness in Language-Driven Robotics: A Modular Approach to Failure Reduction

Abstract:
Recent advances in large language models (LLMs) have led to significant progress in robotics, enabling embodied agents to better understand and execute open‑ended tasks. However, existing approaches using LLMs face limitations in grounding their outputs within the physical environment and aligning with the capabilities of the robot. This challenge becomes even more pronounced with smaller language models, which are more computationally efficient but less robust in task planning and execution. In this paper, we present a novel modular architecture designed to enhance the robustness of LLM‑driven robotics by addressing these grounding and alignment issues. We formalize the task planning problem within a goal‑conditioned POMDP framework, identify key failure modes in LLM‑driven planning, and propose targeted design principles to mitigate these issues. Our architecture introduces an ``expected outcomes'' module to prevent mischaracterization of subgoals and a feedback mechanism to enable real‑time error recovery. Experimental results, both in simulation and on physical robots, demonstrate that our approach significantly improves task success rates for pick‑and‑place and manipulation tasks compared to both larger LLMs and standard baselines. Through hardware experiments, we also demonstrate how our architecture can be run efficiently and locally. This work highlights the potential of smaller, locally‑executable LLMs in robotics and provides a scalable, efficient solution for robust task execution.

PaperID: 12431, https://arxiv.org/pdf/2411.05454.pdf

Authors: Italo Napolitano, Andrea Lama, Francesco De Lellis, Mario di Bernardo

Title: Emergent Cooperative Strategies for Multi-Agent Shepherding via Reinforcement Learning

Abstract:
We present a decentralized reinforcement learning (RL) approach to address the multi‑agent shepherding control problem, departing from the conventional assumption of cohesive target groups. Our two‑layer control architecture consists of a low‑level controller that guides each herder to contain a specific target within a goal region, while a high‑level layer dynamically selects from multiple targets the one an herder should aim at corralling and containing. Cooperation emerges naturally, as herders autonomously choose distinct targets to expedite task completion. We further extend this approach to large‑scale systems, where each herder applies a shared policy, trained with few agents, while managing a fixed subset of agents.

PaperID: 12432, https://arxiv.org/pdf/2411.05423.pdf

Authors: Jeongwoo Lee, Kwangsuk Park, Jihyeon Park

Title: VISTA: Visual Integrated System for Tailored Automation in Math Problem Generation Using LLM

Abstract:
Generating accurate and consistent visual aids is a critical challenge in mathematics education, where visual representations like geometric shapes and functions play a pivotal role in enhancing student comprehension. This paper introduces a novel multi‑agent framework that leverages Large Language Models (LLMs) to automate the creation of complex mathematical visualizations alongside coherent problem text. Our approach not only simplifies the generation of precise visual aids but also aligns these aids with the problem's core mathematical concepts, improving both problem creation and assessment. By integrating multiple agents, each responsible for distinct tasks such as numeric calculation, geometry validation, and visualization, our system delivers mathematically accurate and contextually relevant problems with visual aids. Evaluation across Geometry and Function problem types shows that our method significantly outperforms basic LLMs in terms of text coherence, consistency, relevance and similarity, while maintaining the essential geometrical and functional integrity of the original problems. Although some challenges remain in ensuring consistent visual outputs, our framework demonstrates the immense potential of LLMs in transforming the way educators generate and utilize visual aids in math education.

PaperID: 12433, https://arxiv.org/pdf/2411.05349.pdf

Authors: Honghao Shi, Longkai Cheng, Wenli Wu, Yuhang Wang, Xuan Liu, Shaokai Nie, Weixv Wang, Xuebin Min, Chunlei Men, Yonghua Lin

Title: Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework

Abstract:
Recent advancements in Large Language Models (LLMs) and related technologies such as Retrieval‑Augmented Generation (RAG) and Diagram of Thought (DoT) have enabled the creation of autonomous intelligent systems capable of performing cluster diagnostics and troubleshooting. By integrating these technologies with self‑play methodologies, we have developed an LLM‑agent system designed to autonomously diagnose and resolve issues within AI clusters. Our innovations include a knowledge base tailored for cluster diagnostics, enhanced LLM algorithms, practical deployment strategies for agents, and a benchmark specifically designed for evaluating LLM capabilities in this domain. Through extensive experimentation across multiple dimensions, we have demonstrated the superiority of our system in addressing the challenges faced in cluster diagnostics, particularly in detecting and rectifying performance issues more efficiently and accurately than traditional methods.

PaperID: 12434, https://arxiv.org/pdf/2411.05348.pdf

Authors: Zongyuan Li, Yanan Ni, Runnan Qi, Lumin Jiang, Chang Lu, Xiaojie Xu, Xiangbei Liu, Pengfei Li, Yunzheng Guo, Zhe Ma, Huanyu Li, Hui Wu, Xian Guo, Kuihua Huang, Xuebo Zhang

Title: LLM-PySC2: Starcraft II learning environment for Large Language Models

Abstract:
The tremendous potential has been demonstrated by large language models (LLMs) in intelligent decision‑making problems, with unprecedented capabilities shown across diverse applications ranging from gaming AI systems to complex strategic planning frameworks. However, the StarCraft II platform, which has been widely adopted for validating decision‑making algorithms in the past decade, has not yet provided substantial support for this emerging domain. To address issues that LLMs cannot interface with the hundreds of actions of the pysc2 backend and the lack of native support for multi‑agent (MA) collaboration, we propose the LLM‑PySC2 environment. This is the first environment that offers LLMs the complete pysc2 action space with sufficient multi‑modal information and game Wiki knowledge. With an asynchronous query architecture, the environment efficiently interacts with LLMs that maintain a constant latency regardless of the scale of the agents' population. In the experiments, we evaluated LLMs' decision‑making performance in both the macro‑decision and micro‑operation scenarios, with traditional StarCraft II Multi‑Agent Challenge (SMAC) tasks and a series of new proposed. Results indicate that LLMs possess the potential to achieve victories in complex scenarios but cannot constantly generate correct decisions, especially in the recovered pysc2 action space and MA settings. Without task‑relevant instructions, the pre‑trained models suffer from issues such as hallucinations and inefficient collaboration. Our findings suggest that StarCraft II still challenges in the era of large models, revealing that there is a lot to do to develop an advanced LLM decision‑making system, and the proposed LLM‑PySC2 environment will support future development of LLM‑based decision‑making solutions.

PaperID: 12435, https://arxiv.org/pdf/2411.05285.pdf

Authors: Liming Dong, Qinghua Lu, Liming Zhu

Title: AgentOps: Enabling Observability of LLM Agents

Abstract:
Large language model (LLM) agents have demonstrated remarkable capabilities across various domains, gaining extensive attention from academia and industry. However, these agents raise significant concerns on AI safety due to their autonomous and non‑deterministic behavior, as well as continuous evolving nature . From a DevOps perspective, enabling observability in agents is necessary to ensuring AI safety, as stakeholders can gain insights into the agents' inner workings, allowing them to proactively understand the agents, detect anomalies, and prevent potential failures. Therefore, in this paper, we present a comprehensive taxonomy of AgentOps, identifying the artifacts and associated data that should be traced throughout the entire lifecycle of agents to achieve effective observability. The taxonomy is developed based on a systematic mapping study of existing AgentOps tools. Our taxonomy serves as a reference template for developers to design and implement AgentOps infrastructure that supports monitoring, logging, and analytics. thereby ensuring AI safety.

PaperID: 12436, https://arxiv.org/pdf/2411.05234.pdf

Authors: Debmalya Mandal, Goran Radanovic

Title: Performative Reinforcement Learning with Linear Markov Decision Process

Abstract:
We study the setting of \emphperformative reinforcement learning where the deployed policy affects both the reward, and the transition of the underlying Markov decision process. Prior work~\parenciteMTR23 has addressed this problem under the tabular setting and established last‑iterate convergence of repeated retraining with iteration complexity explicitly depending on the number of states. In this work, we generalize the results to \emphlinear Markov decision processes which is the primary theoretical model of large‑scale MDPs. The main challenge with linear MDP is that the regularized objective is no longer strongly convex and we want a bound that scales with the dimension of the features, rather than states which can be infinite. Our first result shows that repeatedly optimizing a regularized objective converges to a \emphperformatively stable policy. In the absence of strong convexity, our analysis leverages a new recurrence relation that uses a specific linear combination of optimal dual solutions for proving convergence. We then tackle the finite sample setting where the learner has access to a set of trajectories drawn from the current policy. We consider a reparametrized version of the primal problem, and construct an empirical Lagrangian which is to be optimized from the samples. We show that, under a \emphbounded coverage condition, repeatedly solving a saddle point of this empirical Lagrangian converges to a performatively stable solution, and also construct a primal‑dual algorithm that solves the empirical Lagrangian efficiently. Finally, we show several applications of the general framework of performative RL including multi‑agent systems.

PaperID: 12437, https://arxiv.org/pdf/2411.05209.pdf

Authors: Yide Ran, Zhaozhuo Xu, Yuhang Yao, Zijian Hu, Shanshan Han, Han Jin, Alay Dilipbhai Shah, Jipeng Zhang, Dimitris Stripelis, Tong Zhang, Salman Avestimehr, Chaoyang He

Title: Alopex: A Computational Framework for Enabling On-Device Function Calls with LLMs

Abstract:
The rapid advancement of Large Language Models (LLMs) has led to their increased integration into mobile devices for personalized assistance, which enables LLMs to call external API functions to enhance their performance. However, challenges such as data scarcity, ineffective question formatting, and catastrophic forgetting hinder the development of on‑device LLM agents. To tackle these issues, we propose Alopex, a framework that enables precise on‑device function calls using the Fox LLM. Alopex introduces a logic‑based method for generating high‑quality training data and a novel ``description‑question‑output'' format for fine‑tuning, reducing risks of function information leakage. Additionally, a data mixing strategy is used to mitigate catastrophic forgetting, combining function call data with textbook datasets to enhance performance in various tasks. Experimental results show that Alopex improves function call accuracy and significantly reduces catastrophic forgetting, providing a robust solution for integrating function call capabilities into LLMs without manual intervention.

PaperID: 12438, https://arxiv.org/pdf/2411.05205.pdf

Authors: Bowei Li, Yang Xu, Ran Zhang, Jiang, Xie, Miao Wang

Title: Maximizing User Connectivity in AI-Enabled Multi-UAV Networks: A Distributed Strategy Generalized to Arbitrary User Distributions

Abstract:
Deep reinforcement learning (DRL) has been extensively applied to Multi‑Unmanned Aerial Vehicle (UAV) network (MUN) to effectively enable real‑time adaptation to complex, time‑varying environments. Nevertheless, most of the existing works assume a stationary user distribution (UD) or a dynamic one with predicted patterns. Such considerations may make the UD‑specific strategies insufficient when a MUN is deployed in unknown environments. To this end, this paper investigates distributed user connectivity maximization problem in a MUN with generalization to arbitrary UDs. Specifically, the problem is first formulated into a time‑coupled combinatorial nonlinear non‑convex optimization with arbitrary underlying UDs. To make the optimization tractable, a multi‑agent CNN‑enhanced deep Q learning (MA‑CDQL) algorithm is proposed. The algorithm integrates a ResNet‑based CNN to the policy network to analyze the input UD in real time and obtain optimal decisions based on the extracted high‑level UD features. To improve the learning efficiency and avoid local optimums, a heatmap algorithm is developed to transform the raw UD to a continuous density map. The map will be part of the true input to the policy network. Simulations are conducted to demonstrate the efficacy of UD heatmaps and the proposed algorithm in maximizing user connectivity as compared to K‑means methods.

PaperID: 12439, https://arxiv.org/pdf/2411.05185.pdf

Authors: Xiangmin Shen, Lingzhi Wang, Zhenyuan Li, Yan Chen, Wencheng Zhao, Dawei Sun, Jiashui Wang, Wei Ruan

Title: PentestAgent: Incorporating LLM Agents to Automated Penetration Testing

Abstract:
Penetration testing is a critical technique for identifying security vulnerabilities, traditionally performed manually by skilled security specialists. This complex process involves gathering information about the target system, identifying entry points, exploiting the system, and reporting findings. Despite its effectiveness, manual penetration testing is time‑consuming and expensive, often requiring significant expertise and resources that many organizations cannot afford. While automated penetration testing methods have been proposed, they often fall short in real‑world applications due to limitations in flexibility, adaptability, and implementation. Recent advancements in large language models (LLMs) offer new opportunities for enhancing penetration testing through increased intelligence and automation. However, current LLM‑based approaches still face significant challenges, including limited penetration testing knowledge and a lack of comprehensive automation capabilities. To address these gaps, we propose PentestAgent, a novel LLM‑based automated penetration testing framework that leverages the power of LLMs and various LLM‑based techniques like Retrieval Augmented Generation (RAG) to enhance penetration testing knowledge and automate various tasks. Our framework leverages multi‑agent collaboration to automate intelligence gathering, vulnerability analysis, and exploitation stages, reducing manual intervention. We evaluate PentestAgent using a comprehensive benchmark, demonstrating superior performance in task completion and overall efficiency. This work significantly advances the practical applicability of automated penetration testing systems.

PaperID: 12440, https://arxiv.org/pdf/2411.04925.pdf

Authors: Panwen Hu, Jin Jiang, Jianqi Chen, Mingfei Han, Shengcai Liao, Xiaojun Chang, Xiaodan Liang

Title: StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration

Abstract:
The advent of AI‑Generated Content (AIGC) has spurred research into automated video generation to streamline conventional processes. However, automating storytelling video production, particularly for customized narratives, remains challenging due to the complexity of maintaining subject consistency across shots. While existing approaches like Mora and AesopAgent integrate multiple agents for Story‑to‑Video (S2V) generation, they fall short in preserving protagonist consistency and supporting Customized Storytelling Video Generation (CSVG). To address these limitations, we propose StoryAgent, a multi‑agent framework designed for CSVG. StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process. Notably, our framework includes agents for story design, storyboard generation, video creation, agent coordination, and result evaluation. Leveraging the strengths of different models, StoryAgent enhances control over the generation process, significantly improving character consistency. Specifically, we introduce a customized Image‑to‑Video (I2V) method, LoRA‑BE, to enhance intra‑shot temporal consistency, while a novel storyboard generation pipeline is proposed to maintain subject consistency across shots. Extensive experiments demonstrate the effectiveness of our approach in synthesizing highly consistent storytelling videos, outperforming state‑of‑the‑art methods. Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.

PaperID: 12441, https://arxiv.org/pdf/2411.04867.pdf

Authors: Satchit Chatterji, Erman Acar

Title: Think Smart, Act SMARL! Analyzing Probabilistic Logic Shields for Multi-Agent Reinforcement Learning

Abstract:
Safe reinforcement learning (RL) is crucial for real‑world applications, and multi‑agent interactions introduce additional safety challenges. While Probabilistic Logic Shields (PLS) has been a powerful proposal to enforce safety in single‑agent RL, their generalizability to multi‑agent settings remains unexplored. In this paper, we address this gap by conducting extensive analyses of PLS within decentralized, multi‑agent environments, and in doing so, propose Shielded Multi‑Agent Reinforcement Learning (SMARL) as a general framework for steering MARL towards norm‑compliant outcomes. Our key contributions are: (1) a novel Probabilistic Logic Temporal Difference (PLTD) update for shielded, independent Q‑learning, which incorporates probabilistic constraints directly into the value update process; (2) a probabilistic logic policy gradient method for shielded PPO with formal safety guarantees for MARL; and (3) comprehensive evaluation across symmetric and asymmetrically shielded n‑player game‑theoretic benchmarks, demonstrating fewer constraint violations and significantly better cooperation under normative constraints. These results position SMARL as an effective mechanism for equilibrium selection, paving the way toward safer, socially aligned multi‑agent systems.

PaperID: 12442, https://arxiv.org/pdf/2411.04788.pdf

Authors: Xuewen Han, Neng Wang, Shangkun Che, Hongyang Yang, Kunpeng Zhang, Sean Xin Xu

Title: Enhancing Investment Analysis: Optimizing AI-Agent Collaboration in Financial Research

Abstract:
In recent years, the application of generative artificial intelligence (GenAI) in financial analysis and investment decision‑making has gained significant attention. However, most existing approaches rely on single‑agent systems, which fail to fully utilize the collaborative potential of multiple AI agents. In this paper, we propose a novel multi‑agent collaboration system designed to enhance decision‑making in financial investment research. The system incorporates agent groups with both configurable group sizes and collaboration structures to leverage the strengths of each agent group type. By utilizing a sub‑optimal combination strategy, the system dynamically adapts to varying market conditions and investment scenarios, optimizing performance across different tasks. We focus on three sub‑tasks: fundamentals, market sentiment, and risk analysis, by analyzing the 2023 SEC 10‑K forms of 30 companies listed on the Dow Jones Index. Our findings reveal significant performance variations based on the configurations of AI agents for different tasks. The results demonstrate that our multi‑agent collaboration system outperforms traditional single‑agent models, offering improved accuracy, efficiency, and adaptability in complex financial environments. This study highlights the potential of multi‑agent systems in transforming financial analysis and investment decision‑making by integrating diverse analytical perspectives.

PaperID: 12443, https://arxiv.org/pdf/2411.04735.pdf

Authors: Chanyoung Chung, Hyunki Seong, David Hyunchul Shim

Title: Learning from Demonstration with Hierarchical Policy Abstractions Toward High-Performance and Courteous Autonomous Racing

Abstract:
Fully autonomous racing demands not only high‑speed driving but also fair and courteous maneuvers. In this paper, we propose an autonomous racing framework that learns complex racing behaviors from expert demonstrations using hierarchical policy abstractions. At the trajectory level, our policy model predicts a dense distribution map indicating the likelihood of trajectories learned from offline demonstrations. The maximum likelihood trajectory is then passed to the control‑level policy, which generates control inputs in a residual fashion, considering vehicle dynamics at the limits of performance. We evaluate our framework in a high‑fidelity racing simulator and compare it against competing baselines in challenging multi‑agent adversarial scenarios. Quantitative and qualitative results show that our trajectory planning policy significantly outperforms the baselines, and the residual control policy improves lap time and tracking accuracy. Moreover, challenging closed‑loop experiments with ten opponents show that our framework can overtake other vehicles by understanding nuanced interactions, effectively balancing performance and courtesy like professional drivers.

PaperID: 12444, https://arxiv.org/pdf/2411.04681.pdf

Authors: Sven Banisch, Dennis Jacob, Tom Willaert, Eckehard Olbrich

Title: A dynamical model of platform choice and online segregation

Abstract:
In order to truly understand how social media might shape online discourses or contribute to societal polarization, we need refined models of platform choice, that is: models that help us understand why users prefer one social media platform over another. This study develops a dynamic model of platform selection, extending Social Feedback Theory by incorporating multi‑agent reinforcement learning to capture how user decisions are shaped by past rewards across different platforms. A key parameter (μ) in the model governs users' tendencies to either seek approval from like‑minded peers or engage with opposing views. Our findings reveal that online environments can evolve into suboptimal states characterized by polarized, strongly opinionated echo chambers, even when users prefer diverse perspectives. Interestingly, this polarizing state coexists with another equilibrium, where users gravitate toward a single dominant platform, marginalizing other platforms into extremity. Using agent‑based simulations and dynamical systems analysis, our model underscores the complex interplay of user preferences and platform dynamics, offering insights into how digital spaces might be better managed to foster diverse discourse.

PaperID: 12445, https://arxiv.org/pdf/2411.04578.pdf

Authors: Tianqi Song, Yugin Tan, Zicheng Zhu, Yibin Feng, Yi-Chieh Lee

Title: Multi-Agents are Social Groups: Investigating Social Influence of Multiple Agents in Human-Agent Interactions

Abstract:
Multi‑agent systems ‑ systems with multiple independent AI agents working together to achieve a common goal ‑ are becoming increasingly prevalent in daily life. Drawing inspiration from the phenomenon of human group social influence, we investigate whether a group of AI agents can create social pressure on users to agree with them, potentially changing their stance on a topic. We conducted a study in which participants discussed social issues with either a single or multiple AI agents, and where the agents either agreed or disagreed with the user's stance on the topic. We found that conversing with multiple agents (holding conversation content constant) increased the social pressure felt by participants, and caused a greater shift in opinion towards the agents' stances on each topic. Our study shows the potential advantages of multi‑agent systems over single‑agent platforms in causing opinion change. We discuss design implications for possible multi‑agent systems that promote social good, as well as the potential for malicious actors to use these systems to manipulate public opinion.

PaperID: 12446, https://arxiv.org/pdf/2411.04468.pdf

Authors: Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang, Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, Saleema Amershi

Title: Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

Abstract:
Modern AI agents, driven by advances in large foundation models, promise to enhance our productivity and transform our lives by augmenting our knowledge and capabilities. To achieve this vision, AI agents must effectively plan, perform multi‑step reasoning and actions, respond to novel observations, and recover from errors, to successfully complete complex tasks across a wide range of scenarios. In this work, we introduce Magentic‑One, a high‑performing open‑source agentic system for solving such tasks. Magentic‑One uses a multi‑agent architecture where a lead agent, the Orchestrator, plans, tracks progress, and re‑plans to recover from errors. Throughout task execution, the Orchestrator directs other specialized agents to perform tasks as needed, such as operating a web browser, navigating local files, or writing and executing Python code. We show that Magentic‑One achieves statistically competitive performance to the state‑of‑the‑art on three diverse and challenging agentic benchmarks: GAIA, AssistantBench, and WebArena. Magentic‑One achieves these results without modification to core agent capabilities or to how they collaborate, demonstrating progress towards generalist agentic systems. Moreover, Magentic‑One's modular design allows agents to be added or removed from the team without additional prompt tuning or training, easing development and making it extensible to future scenarios. We provide an open‑source implementation of Magentic‑One, and we include AutoGenBench, a standalone tool for agentic evaluation. AutoGenBench provides built‑in controls for repetition and isolation to run agentic benchmarks in a rigorous and contained manner ‑‑ which is important when agents' actions have side‑effects. Magentic‑One, AutoGenBench and detailed empirical performance evaluations of Magentic‑One, including ablations and error analysis are available at https://aka.ms/magentic‑one

PaperID: 12447, https://arxiv.org/pdf/2411.04329.pdf

Authors: Jierui Li, Hung Le, Yingbo Zhou, Caiming Xiong, Silvio Savarese, Doyen Sahoo

Title: CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models

Abstract:
Pre‑trained on massive amounts of code and text data, large language models (LLMs) have demonstrated remarkable achievements in performing code generation tasks. With additional execution‑based feedback, these models can act as agents with capabilities to self‑refine and improve generated code autonomously. However, on challenging coding tasks with extremely large search space, current agentic approaches still struggle with multi‑stage planning, generating, and debugging. To address this problem, we propose CodeTree, a framework for LLM agents to efficiently explore the search space in different stages of the code generation process. Specifically, we adopted a unified tree structure to explicitly explore different coding strategies, generate corresponding coding solutions, and subsequently refine the solutions. In each stage, critical decision‑making (ranking, termination, expanding) of the exploration process is guided by both the environmental execution‑based feedback and LLM‑agent‑generated feedback. We comprehensively evaluated CodeTree on 7 code generation benchmarks and demonstrated the significant performance gains of CodeTree against strong baselines. Using GPT‑4o as the base model, we consistently achieved top results of 95.1 on HumanEval, 98.7 on MBPP, and 43.0 on CodeContests. On the challenging SWEBench benchmark, our approach led to significant performance gains.

PaperID: 12448, https://arxiv.org/pdf/2411.03817.pdf

Authors: Zhirui Deng, Zhicheng Dou, Yutao Zhu, Ji-Rong Wen, Ruibin Xiong, Mang Wang, Weipeng Chen

Title: From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning

Abstract:
The outstanding capabilities of large language models (LLMs) render them a crucial component in various autonomous agent systems. While traditional methods depend on the inherent knowledge of LLMs without fine‑tuning, more recent approaches have shifted toward the reinforcement learning strategy to further enhance agents' ability to solve complex interactive tasks with environments and tools. However, previous approaches are constrained by the sparse reward issue, where existing datasets solely provide a final scalar reward for each multi‑step reasoning chain, potentially leading to ineffectiveness and inefficiency in policy learning. In this paper, we introduce StepAgent, which utilizes step‑wise reward to optimize the agent's reinforcement learning process. Inheriting the spirit of novice‑to‑expert theory, we first compare the actions of the expert and the agent to automatically generate intermediate rewards for fine‑grained optimization. Additionally, we propose implicit‑reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment. Further theoretical analysis demonstrates that the action distribution of the agent can converge toward the expert action distribution over multiple training cycles. Experimental results across various datasets indicate that StepAgent outperforms existing baseline methods.

PaperID: 12449, https://arxiv.org/pdf/2411.03669.pdf

Authors: Lingfeng Sun, Yixiao Wang, Pin-Yun Hung, Changhao Wang, Xiang Zhang, Zhuo Xu, Masayoshi Tomizuka

Title: Imagined Potential Games: A Framework for Simulating, Learning and Evaluating Interactive Behaviors

Abstract:
Interacting with human agents in complex scenarios presents a significant challenge for robotic navigation, particularly in environments that necessitate both collision avoidance and collaborative interaction, such as indoor spaces. Unlike static or predictably moving obstacles, human behavior is inherently complex and unpredictable, stemming from dynamic interactions with other agents. Existing simulation tools frequently fail to adequately model such reactive and collaborative behaviors, impeding the development and evaluation of robust social navigation strategies. This paper introduces a novel framework utilizing distributed potential games to simulate human‑like interactions in highly interactive scenarios. Within this framework, each agent imagines a virtual cooperative game with others based on its estimation. We demonstrate this formulation can facilitate the generation of diverse and realistic interaction patterns in a configurable manner across various scenarios. Additionally, we have developed a gym‑like environment leveraging our interactive agent model to facilitate the learning and evaluation of interactive navigation algorithms.

PaperID: 12450, https://arxiv.org/pdf/2411.03633.pdf

Authors: Bing Liu, Chengcheng Zhao, Li Chai, Peng Cheng, Jiming Chen

Title: Privacy-Preserving Resilient Vector Consensus

Abstract:
This paper studies privacy‑preserving resilient vector consensus in multi‑agent systems against faulty agents, where normal agents can achieve consensus within the convex hull of their initial states while protecting state vectors from being disclosed. Specifically, we consider a modification of an existing algorithm known as Approximate Distributed Robust Convergence Using Centerpoints (ADRC), i.e., Privacy‑Preserving ADRC (PP‑ADRC). Under PP‑ADRC, each normal agent introduces multivariate Gaussian noise to its state during each iteration. We first provide sufficient conditions to ensure that all normal agents' states can achieve mean square convergence under PP‑ADRC. Then, we analyze convergence accuracy from two perspectives, i.e., the Mahalanobis distance of the final value from its expectation and the Hausdorff distance‑based alteration of the convex hull caused by noise when only partial dimensions are added with noise. Then, we employ concentrated geo‑privacy to characterize privacy preservation and conduct a thorough comparison with differential privacy. Finally, numerical simulations demonstrate the theoretical results.

PaperID: 12451, https://arxiv.org/pdf/2411.03603.pdf

Authors: Yuqian Fu, Yuanheng Zhu, Haoran Li, Zijie Zhao, Jiajun Chai, Dongbin Zhao

Title: CPIG: Leveraging Consistency Policy with Intention Guidance for Multi-agent Exploration

Abstract:
Efficient exploration is crucial in cooperative multi‑agent reinforcement learning (MARL), especially in sparse‑reward settings. However, due to the reliance on the unimodal policy, existing methods are prone to falling into the local optima, hindering the effective exploration of better policies. Furthermore, in sparse‑reward settings, each agent tends to receive a scarce reward, which poses significant challenges to inter‑agent cooperation. This not only increases the difficulty of policy learning but also degrades the overall performance of multi‑agent tasks. To address these issues, we propose a Consistency Policy with Intention Guidance (CPIG), with two primary components: (a) introducing a multimodal policy to enhance the agent's exploration capability, and (b) sharing the intention among agents to foster agent cooperation. For component (a), CPIG incorporates a Consistency model as the policy, leveraging its multimodal nature and stochastic characteristics to facilitate exploration. Regarding component (b), we introduce an Intention Learner to deduce the intention on the global state from each agent's local observation. This intention then serves as a guidance for the Consistency Policy, promoting cooperation among agents. The proposed method is evaluated in multi‑agent particle environments (MPE) and multi‑agent MuJoCo (MAMuJoCo). Empirical results demonstrate that our method not only achieves comparable performance to various baselines in dense‑reward environments but also significantly enhances performance in sparse‑reward settings, outperforming state‑of‑the‑art (SOTA) algorithms by 20%.

PaperID: 12452, https://arxiv.org/pdf/2411.03562.pdf

Authors: Antoine Grosnit, Alexandre Maraval, Refinath S N, Zichao Zhao, James Doran, Giuseppe Paolo, Albert Thomas, Jonas Gonzalez, Abhineet Kumar, Khyati Khandelwal, Abdelhakim Benechehab, Hamza Cherkaoui, Youssef Attia El-Hili, Kun Shao, Jianye Hao, Jun Yao, Balázs Kégl, Haitham Bou-Ammar, Jun Wang

Title: Kolb-Based Experiential Learning for Generalist Agents with Human-Level Kaggle Data Science Performance

Abstract:
Human expertise emerges through iterative cycles of interaction, reflection, and internal model updating, which are central to cognitive theories such as Kolb's experiential learning and Vygotsky's zone of proximal development. In contrast, current AI systems, particularly LLM agents, rely on static pre‑training or rigid workflows, lacking mechanisms for continual adaptation. Recent studies identified early cognitive traits in LLM agents (reflection, revision, and self‑correction) suggesting foundational elements of human‑like experiential learning. Thus the key question: Can we design LLM agents capable of structured, cognitively grounded learning similar to human processes? In response, we propose a computational framework of Kolb's learning cycle with Vygotsky's ZPD for autonomous agents. Our architecture separates extrinsic (environment interaction) and intrinsic (internal reflection/abstraction) functions, enabling cognitively grounded scaffolded learning, where the agent initially learns within structured environments, followed by open‑ended generalisation. This approach empowers agents to master complex tasks ; domains that traditional fine‑tuning or simple reflective methods could not tackle effectively. Its potential is powerfully demonstrated via direct comparison with humans in real‑world Kaggle data science competitions. Learning fully automated data science code generation across 81 tasks, our system, Agent K, demonstrated the ability to perform the entire workflow autonomously, achieving an Elo‑MMR score of 1694, beyond median score of the Kaggle Masters (the top 2% among 200,000 users) of our study. With 9 gold, 8 silver, and 12 bronze medals level performance ‑ including 4 gold and 4 silver on prize‑awarding competitions ‑ Agent K is the 1st AI system to successfully integrate Kolb‑ and Vygotsky‑inspired human cognitive learning, marking a major step toward generalist AI.

PaperID: 12453, https://arxiv.org/pdf/2411.03519.pdf

Authors: Zhiqiang Xie, Hao Kang, Ying Sheng, Tushar Krishna, Kayvon Fatahalian, Christos Kozyrakis

Title: AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution

Abstract:
With more advanced natural language understanding and reasoning capabilities, large language model (LLM)‑powered agents are increasingly developed in simulated environments to perform complex tasks, interact with other agents, and exhibit emergent behaviors relevant to social science and gaming. However, current multi‑agent simulations frequently suffer from inefficiencies due to the limited parallelism caused by false dependencies, resulting in performance bottlenecks. In this paper, we introduce AI Metropolis, a simulation engine that improves the efficiency of LLM agent simulations by incorporating out‑of‑order execution scheduling. By dynamically tracking real dependencies between agents, AI Metropolis minimizes false dependencies, enhancing parallelism and enabling efficient hardware utilization. Our evaluations demonstrate that AI Metropolis achieves speedups from 1.3x to 4.15x over standard parallel simulation with global synchronization, approaching optimal performance as the number of agents increases.

PaperID: 12454, https://arxiv.org/pdf/2411.03455.pdf

Authors: Benjamin Rombaut, Sogol Masoumzadeh, Kirill Vasilevski, Dayi Lin, Ahmed E. Hassan

Title: Watson: A Cognitive Observability Framework for the Reasoning of LLM-Powered Agents

Abstract:
Large language models (LLMs) are increasingly integrated into autonomous systems, giving rise to a new class of software known as Agentware, where LLM‑powered agents perform complex, open‑ended tasks in domains such as software engineering, customer service, and data analysis. However, their high autonomy and opaque reasoning processes pose significant challenges for traditional software observability methods. To address this, we introduce the concept of cognitive observability ‑ the ability to recover and inspect the implicit reasoning behind agent decisions. We present Watson, a general‑purpose framework for observing the reasoning processes of fast‑thinking LLM agents without altering their behavior. Watson retroactively infers reasoning traces using prompt attribution techniques. We evaluate Watson in both manual debugging and automated correction scenarios across the MMLU benchmark and the AutoCodeRover and OpenHands agents on the SWE‑bench‑lite dataset. In both static and dynamic settings, Watson surfaces actionable reasoning insights and supports targeted interventions, demonstrating its practical utility for improving transparency and reliability in Agentware systems.

PaperID: 12455, https://arxiv.org/pdf/2411.03346.pdf

Authors: Yuntong Zhang, Jiawei Wang, Dominic Berzin, Martin Mirchev, Dongge Liu, Abhishek Arya, Oliver Chang, Abhik Roychoudhury

Title: Fixing Security Vulnerabilities with AI in OSS-Fuzz

Abstract:
Critical open source software systems undergo significant validation in the form of lengthy fuzz campaigns. The fuzz campaigns typically conduct a biased random search over the domain of program inputs, to find inputs which crash the software system. Such fuzzing is useful to enhance the security of software systems in general since even closed source software may use open source components. Hence testing open source software is of paramount importance. Currently OSS‑Fuzz is the most significant and widely used infrastructure for continuous validation of open source systems. Unfortunately even though OSS‑Fuzz has identified more than 10,000 vulnerabilities across 1000 or more software projects, the detected vulnerabilities may remain unpatched, as vulnerability fixing is often manual in practice. In this work, we rely on the recent progress in Large Language Model (LLM) agents for autonomous program improvement including bug fixing. We customise the well‑known AutoCodeRover agent for fixing security vulnerabilities. This is because LLM agents like AutoCodeRover fix bugs from issue descriptions via code search. Instead for security patching, we rely on the test execution of the exploit input to extract code elements relevant to the fix. Our experience with OSS‑Fuzz vulnerability data shows that LLM agent autonomy is useful for successful security patching, as opposed to approaches like Agentless where the control flow is fixed. More importantly our findings show that we cannot measure quality of patches by code similarity of the patch with reference codes (as in CodeBLEU scores used in VulMaster), since patches with high CodeBLEU scores still fail to pass given the given exploit input. Our findings indicate that security patch correctness needs to consider dynamic attributes like test executions as opposed to relying of standard text/code similarity metrics.

PaperID: 12456, https://arxiv.org/pdf/2411.03321.pdf

Authors: Chenxiao Yu, Zhaotian Weng, Yuangang Li, Zheng Li, Xiyang Hu, Yue Zhao

Title: Towards More Accurate US Presidential Election via Multi-step Reasoning with Large Language Models

Abstract:
Can Large Language Models (LLMs) accurately predict election outcomes? While LLMs have demonstrated impressive performance in various domains, including healthcare, legal analysis, and creative tasks, their ability to forecast elections remains unknown. Election prediction poses unique challenges, such as limited voter‑level data, rapidly changing political landscapes, and the need to model complex human behavior. To address these challenges, we introduce a multi‑step reasoning framework designed for political analysis. Our approach is validated on real‑world data from the American National Election Studies (ANES) 2016 and 2020, as well as synthetic personas generated by the leading machine learning framework, offering scalable datasets for voter behavior modeling. To capture temporal dynamics, we incorporate candidates' policy positions and biographical details, ensuring that the model adapts to evolving political contexts. Drawing on Chain of Thought prompting, our multi‑step reasoning pipeline systematically integrates demographic, ideological, and time‑dependent factors, enhancing the model's predictive power.

PaperID: 12457, https://arxiv.org/pdf/2411.03252.pdf

Authors: Ryosuke Takata, Atsushi Masumori, Takashi Ikegami

Title: Spontaneous Emergence of Agent Individuality through Social Interactions in LLM-Based Communities

Abstract:
We study the emergence of agency from scratch by using Large Language Model (LLM)‑based agents. In previous studies of LLM‑based agents, each agent's characteristics, including personality and memory, have traditionally been predefined. We focused on how individuality, such as behavior, personality, and memory, can be differentiated from an undifferentiated state. The present LLM agents engage in cooperative communication within a group simulation, exchanging context‑based messages in natural language. By analyzing this multi‑agent simulation, we report valuable new insights into how social norms, cooperation, and personality traits can emerge spontaneously. This paper demonstrates that autonomously interacting LLM‑powered agents generate hallucinations and hashtags to sustain communication, which, in turn, increases the diversity of words within their interactions. Each agent's emotions shift through communication, and as they form communities, the personalities of the agents emerge and evolve accordingly. This computational modeling approach and its findings will provide a new method for analyzing collective artificial intelligence.

PaperID: 12458, https://arxiv.org/pdf/2411.02584.pdf

Authors: Xian Yeow Lee, Haiyan Wang, Daisuke Katsumata, Takaharu Matsui, Chetan Gupta

Title: Multi-Agent Decision Transformers for Dynamic Dispatching in Material Handling Systems Leveraging Enterprise Big Data

Abstract:
Dynamic dispatching rules that allocate resources to tasks in real‑time play a critical role in ensuring efficient operations of many automated material handling systems across industries. Traditionally, the dispatching rules deployed are typically the result of manually crafted heuristics based on domain experts' knowledge. Generating these rules is time‑consuming and often sub‑optimal. As enterprises increasingly accumulate vast amounts of operational data, there is significant potential to leverage this big data to enhance the performance of automated systems. One promising approach is to use Decision Transformers, which can be trained on existing enterprise data to learn better dynamic dispatching rules for improving system throughput. In this work, we study the application of Decision Transformers as dynamic dispatching policies within an actual multi‑agent material handling system and identify scenarios where enterprises can effectively leverage Decision Transformers on existing big data to gain business value. Our empirical results demonstrate that Decision Transformers can improve the material handling system's throughput by a considerable amount when the heuristic originally used in the enterprise data exhibits moderate performance and involves no randomness. When the original heuristic has strong performance, Decision Transformers can still improve the throughput but with a smaller improvement margin. However, when the original heuristics contain an element of randomness or when the performance of the dataset is below a certain threshold, Decision Transformers fail to outperform the original heuristic. These results highlight both the potential and limitations of Decision Transformers as dispatching policies for automated industrial material handling systems.

PaperID: 12459, https://arxiv.org/pdf/2411.02377.pdf

Authors: Vade Shah, Bryce L. Ferguson, Jason R. Marden

Title: Trial-and-Error Learning in Decentralized Matching Markets

Abstract:
Two‑sided matching markets, environments in which two disjoint groups of agents seek to partner with one another, arise in several contexts. In static, centralized markets where agents know their preferences, standard algorithms can yield a stable matching. However, in dynamic, decentralized markets where agents must learn their preferences through interaction, such algorithms cannot be used. Our goal in this paper is to identify achievable stability guarantees in decentralized matching markets where (i) agents have limited information about their preferences and (ii) no central entity determines the match. Surprisingly, our first result demonstrates that these constraints do not preclude stability‑‑simple "trial and error" learning policies guarantee convergence to a stable matching without requiring coordination between agents. Our second result shows that more sophisticated policies can direct the system toward a particular group's optimal stable matching. This finding highlights an important dimension of strategic learning: when agents can accurately model others' policies, they can adapt their own behavior to systematically influence outcomes in their favor‑‑a phenomenon with broad implications for learning in multi‑agent systems.

PaperID: 12460, https://arxiv.org/pdf/2411.02305.pdf

Authors: Kung-Hsiang Huang, Akshara Prabhakar, Sidharth Dhawan, Yixin Mao, Huan Wang, Silvio Savarese, Caiming Xiong, Philippe Laban, Chien-Sheng Wu

Title: CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments

Abstract:
Customer Relationship Management (CRM) systems are vital for modern enterprises, providing a foundation for managing customer interactions and data. Integrating AI agents into CRM systems can automate routine processes and enhance personalized service. However, deploying and evaluating these agents is challenging due to the lack of realistic benchmarks that reflect the complexity of real‑world CRM tasks. To address this issue, we introduce CRMArena, a novel benchmark designed to evaluate AI agents on realistic tasks grounded in professional work environments. Following guidance from CRM experts and industry best practices, we designed CRMArena with nine customer service tasks distributed across three personas: service agent, analyst, and manager. The benchmark includes 16 commonly used industrial objects (e.g., account, order, knowledge article, case) with high interconnectivity, along with latent variables (e.g., complaint habits, policy violations) to simulate realistic data distributions. Experimental results reveal that state‑of‑the‑art LLM agents succeed in less than 40% of the tasks with ReAct prompting, and less than 55% even with function‑calling abilities. Our findings highlight the need for enhanced agent capabilities in function‑calling and rule‑following to be deployed in real‑world work environments. CRMArena is an open challenge to the community: systems that can reliably complete tasks showcase direct business value in a popular work environment.

PaperID: 12461, https://arxiv.org/pdf/2411.02150.pdf

Authors: Ahmad Halimi Razlighi, Maximilian H. V. Tillmann, Edgar Beck, Carsten Bockelmann, Armin Dekorsy

Title: Cooperative and Collaborative Multi-Task Semantic Communication for Distributed Sources

Abstract:
In this paper, we explore a multi‑task semantic communication (SemCom) system for distributed sources, extending the existing focus on collaborative single‑task execution. We build on the cooperative multi‑task processing introduced in [1], which divides the encoder into a common unit (CU) and multiple specific units (SUs). While earlier studies in multi‑task SemCom focused on full observation settings, our research explores a more realistic case where only distributed partial observations are available, such as in a production line monitored by multiple sensing nodes. To address this, we propose an SemCom system that supports multi‑task processing through cooperation on the transmitter side via split structure and collaboration on the receiver side. We have used an information‑theoretic perspective with variational approximations for our end‑to‑end data‑driven approach. Simulation results demonstrate that the proposed cooperative and collaborative multi‑task (CCMT) SemCom system significantly improves task execution accuracy, particularly in complex datasets, if the noise introduced from the communication channel is not limiting the task performance too much. Our findings contribute to a more general SemCom framework capable of handling distributed sources and multiple tasks simultaneously, advancing the applicability of SemCom systems in real‑world scenarios.

PaperID: 12462, https://arxiv.org/pdf/2411.02086.pdf

Authors: Fan Wu, Muhammad Bilal, Haolong Xiang, Heng Wang, Jinjun Yu, Xiaolong Xu

Title: Real-time and Downtime-tolerant Fault Diagnosis for Railway Turnout Machines (RTMs) Empowered with Cloud-Edge Pipeline Parallelism

Abstract:
Railway Turnout Machines (RTMs) are mission‑critical components of the railway transportation infrastructure, responsible for directing trains onto desired tracks. For safety assurance applications, especially in early‑warning scenarios, RTM faults are expected to be detected as early as possible on a continuous 7x24 basis. However, limited emphasis has been placed on distributed model inference frameworks that can meet the inference latency and reliability requirements of such mission critical fault diagnosis systems. In this paper, an edge‑cloud collaborative early‑warning system is proposed to enable real‑time and downtime‑tolerant fault diagnosis of RTMs, providing a new paradigm for the deployment of models in safety‑critical scenarios. Firstly, a modular fault diagnosis model is designed specifically for distributed deployment, which utilizes a hierarchical architecture consisting of the prior knowledge module, subordinate classifiers, and a fusion layer for enhanced accuracy and parallelism. Then, a cloud‑edge collaborative framework leveraging pipeline parallelism, namely CEC‑PA, is developed to minimize the overhead resulting from distributed task execution and context exchange by strategically partitioning and offloading model components across cloud and edge. Additionally, an election consensus mechanism is implemented within CEC‑PA to ensure system robustness during coordinator node downtime. Comparative experiments and ablation studies are conducted to validate the effectiveness of the proposed distributed fault diagnosis approach. Our ensemble‑based fault diagnosis model achieves a remarkable 97.4% accuracy on a real‑world dataset collected by Nanjing Metro in Jiangsu Province, China. Meanwhile, CEC‑PA demonstrates superior recovery proficiency during node disruptions and speed‑up ranging from 1.98x to 7.93x in total inference time compared to its counterparts.

PaperID: 12463, https://arxiv.org/pdf/2411.01794.pdf

Authors: Quanyan Zhu, Tamer Başar

Title: Revisiting Game-Theoretic Control in Socio-Technical Networks: Emerging Design Frameworks and Contemporary Applications

Abstract:
Socio‑technical networks represent emerging cyber‑physical infrastructures that are tightly interwoven with human networks. The coupling between human and technical networks presents significant challenges in managing, controlling, and securing these complex, interdependent systems. This paper investigates game‑theoretic frameworks for the design and control of socio‑technical networks, with a focus on critical applications such as misinformation management, infrastructure optimization, and resilience in socio‑cyber‑physical systems (SCPS). Core methodologies, including Stackelberg games, mechanism design, and dynamic game theory, are examined as powerful tools for modeling interactions in hierarchical, multi‑agent environments. Key challenges addressed include mitigating human‑driven vulnerabilities, managing large‑scale system dynamics, and countering adversarial threats. By bridging individual agent behaviors with overarching system goals, this work illustrates how the integration of game theory and control theory can lead to robust, resilient, and adaptive socio‑technical networks. This paper highlights the potential of these frameworks to dynamically align decentralized agent actions with system‑wide objectives of stability, security, and efficiency.

PaperID: 12464, https://arxiv.org/pdf/2411.01766.pdf

Authors: Cheng Zhang, Lan Wei, Ji Fan, Zening Liu, Yongming Huang

Title: Lyapunov-guided Multi-Agent Reinforcement Learning for Delay-Sensitive Wireless Scheduling

Abstract:
In this paper, a two‑stage intelligent scheduler is proposed to minimize the packet‑level delay jitter while guaranteeing delay bound. Firstly, Lyapunov technology is employed to transform the delay‑violation constraint into a sequential slot‑level queue stability problem. Secondly, a hierarchical scheme is proposed to solve the resource allocation between multiple base stations and users, where the multi‑agent reinforcement learning (MARL) gives the user priority and the number of scheduled packets, while the underlying scheduler allocates the resource. Our proposed scheme achieves lower delay jitter and delay violation rate than the Round‑Robin Earliest Deadline First algorithm and MARL with delay violation penalty.

PaperID: 12465, https://arxiv.org/pdf/2411.01707.pdf

Authors: Jingtao Tang, Zining Mao, Hang Ma

Title: Large-Scale Multirobot Coverage Path Planning on Grids With Path Deconfliction

Abstract:
We study Multi‑Robot Coverage Path Planning (MCPP) on a 4‑neighbor 2D grid G, which aims to compute paths for multiple robots to cover all cells of G. Traditional approaches are limited as they first compute coverage trees on a quadrant coarsened grid H and then employ the Spanning Tree Coverage (STC) paradigm to generate paths on G, making them inapplicable to grids with partially obstructed 2x2 blocks. To address this limitation, we reformulate the problem directly on G, revolutionizing grid‑based MCPP solving and establishing new NP‑hardness results. We introduce Extended‑STC (ESTC), a novel paradigm that extends STC to ensure complete coverage with bounded suboptimality, even when H includes partially obstructed blocks. Furthermore, we present LS‑MCPP, a new algorithmic framework that integrates ESTC with three novel types of neighborhood operators within a local search strategy to optimize coverage paths directly on G. Unlike prior grid‑based MCPP work, our approach also incorporates a versatile post‑processing procedure that applies Multi‑Agent Path Finding (MAPF) techniques to MCPP for the first time, enabling a fusion of these two important fields in multi‑robot coordination. This procedure effectively resolves inter‑robot conflicts and accommodates turning costs by solving a MAPF variant, making our MCPP solutions more practical for real‑world applications. Extensive experiments demonstrate that our approach significantly improves solution quality and efficiency, managing up to 100 robots on grids as large as 256x256 within minutes of runtime. Validation with physical robots confirms the feasibility of our solutions under real‑world conditions.

PaperID: 12466, https://arxiv.org/pdf/2411.01643.pdf

Authors: Shaokun Zhang, Jieyu Zhang, Dujian Ding, Mirian Hipolito Garcia, Ankur Mallick, Daniel Madrigal, Menglin Xia, Victor Rühle, Qingyun Wu, Chi Wang

Title: EcoAct: Economic Agent Determines When to Register What Action

Abstract:
Recent advancements have enabled Large Language Models (LLMs) to function as agents that can perform actions using external tools. This requires registering, i.e., integrating tool information into the LLM context prior to taking actions. Current methods indiscriminately incorporate all candidate tools into the agent's context and retain them across multiple reasoning steps. This process remains opaque to LLM agents and is not integrated into their reasoning procedures, leading to inefficiencies due to increased context length from irrelevant tools. To address this, we introduce EcoAct, a tool using algorithm that allows LLMs to selectively register tools as needed, optimizing context use. By integrating the tool registration process into the reasoning procedure, EcoAct reduces computational costs by over 50% in multiple steps reasoning tasks while maintaining performance, as demonstrated through extensive experiments. Moreover, it can be plugged into any reasoning pipeline with only minor modifications to the prompt, making it applicable to LLM agents now and future.

PaperID: 12467, https://arxiv.org/pdf/2411.01568.pdf

Authors: Faseeh Ahmad, Jonathan Styrud, Volker Krueger

Title: Addressing Failures in Robotics using Vision-Based Language Models (VLMs) and Behavior Trees (BT)

Abstract:
In this paper, we propose an approach that combines Vision Language Models (VLMs) and Behavior Trees (BTs) to address failures in robotics. Current robotic systems can handle known failures with pre‑existing recovery strategies, but they are often ill‑equipped to manage unknown failures or anomalies. We introduce VLMs as a monitoring tool to detect and identify failures during task execution. Additionally, VLMs generate missing conditions or skill templates that are then incorporated into the BT, ensuring the system can autonomously address similar failures in future tasks. We validate our approach through simulations in several failure scenarios.

PaperID: 12468, https://arxiv.org/pdf/2411.01553.pdf

Authors: Han Wang, Binbin Chen, Tieying Zhang, Baoxiang Wang

Title: Learning to Communicate Through Implicit Communication Channels

Abstract:
Effective communication is an essential component in collaborative multi‑agent systems. Situations where explicit messaging is not feasible have been common in human society throughout history, which motivate the study of implicit communication. Previous works on learning implicit communication mostly rely on theory of mind (ToM), where agents infer the mental states and intentions of others by interpreting their actions. However, ToM‑based methods become less effective in making accurate inferences in complex tasks. In this work, we propose the Implicit Channel Protocol (ICP) framework, which allows agents to communicate through implicit communication channels similar to the explicit ones. ICP leverages a subset of actions, denoted as the scouting actions, and a mapping between information and these scouting actions that encodes and decodes the messages. We propose training algorithms for agents to message and act, including learning with a randomly initialized information map and with a delayed information map. The efficacy of ICP has been tested on the tasks of Guessing Numbers, Revealing Goals, and Hanabi, where ICP significantly outperforms baseline methods through more efficient information transmission.

PaperID: 12469, https://arxiv.org/pdf/2411.01455.pdf

Authors: Zirui Wang, Xinran Zhao, Simon Stepputtis, Woojun Kim, Tongshuang Wu, Katia Sycara, Yaqi Xie

Title: HiMemFormer: Hierarchical Memory-Aware Transformer for Multi-Agent Action Anticipation

Abstract:
Understanding and predicting human actions has been a long‑standing challenge and is a crucial measure of perception in robotics AI. While significant progress has been made in anticipating the future actions of individual agents, prior work has largely overlooked a key aspect of real‑world human activity ‑‑ interactions. To address this gap in human‑like forecasting within multi‑agent environments, we present the Hierarchical Memory‑Aware Transformer (HiMemFormer), a transformer‑based model for online multi‑agent action anticipation. HiMemFormer integrates and distributes global memory that captures joint historical information across all agents through a transformer framework, with a hierarchical local memory decoder that interprets agent‑specific features based on these global representations using a coarse‑to‑fine strategy. In contrast to previous approaches, HiMemFormer uniquely hierarchically applies the global context with agent‑specific preferences to avoid noisy or redundant information in multi‑agent action anticipation. Extensive experiments on various multi‑agent scenarios demonstrate the significant performance of HiMemFormer, compared with other state‑of‑the‑art methods.

PaperID: 12470, https://arxiv.org/pdf/2411.01442.pdf

Authors: Beomseok Kang, Priyabrata Saha, Sudarshan Sharma, Biswadeep Chakraborty, Saibal Mukhopadhyay

Title: Online Relational Inference for Evolving Multi-agent Interacting Systems

Abstract:
We introduce a novel framework, Online Relational Inference (ORI), designed to efficiently identify hidden interaction graphs in evolving multi‑agent interacting systems using streaming data. Unlike traditional offline methods that rely on a fixed training set, ORI employs online backpropagation, updating the model with each new data point, thereby allowing it to adapt to changing environments in real‑time. A key innovation is the use of an adjacency matrix as a trainable parameter, optimized through a new adaptive learning rate technique called AdaRelation, which adjusts based on the historical sensitivity of the decoder to changes in the interaction graph. Additionally, a data augmentation method named Trajectory Mirror (TM) is introduced to improve generalization by exposing the model to varied trajectory patterns. Experimental results on both synthetic datasets and real‑world data (CMU MoCap for human motion) demonstrate that ORI significantly improves the accuracy and adaptability of relational inference in dynamic settings compared to existing methods. This approach is model‑agnostic, enabling seamless integration with various neural relational inference (NRI) architectures, and offers a robust solution for real‑time applications in complex, evolving systems.

PaperID: 12471, https://arxiv.org/pdf/2411.01184.pdf

Authors: Chanjuan Liu, Jinmiao Cong, Bingcai Chen, Yaochu Jin, Enqiang Zhu

Title: Guiding Multi-agent Multi-task Reinforcement Learning by a Hierarchical Framework with Logical Reward Shaping

Abstract:
Multi‑agent hierarchical reinforcement learning (MAHRL) has been studied as an effective means to solve intelligent decision problems in complex and large‑scale environments. However, most current MAHRL algorithms follow the traditional way of using reward functions in reinforcement learning, which limits their use to a single task. This study aims to design a multi‑agent cooperative algorithm with logic reward shaping (LRS), which uses a more flexible way of setting the rewards, allowing for the effective completion of multi‑tasks. LRS uses Linear Temporal Logic (LTL) to express the internal logic relation of subtasks within a complex task. Then, it evaluates whether the subformulae of the LTL expressions are satisfied based on a designed reward structure. This helps agents to learn to effectively complete tasks by adhering to the LTL expressions, thus enhancing the interpretability and credibility of their decisions. To enhance coordination and cooperation among multiple agents, a value iteration technique is designed to evaluate the actions taken by each agent. Based on this evaluation, a reward function is shaped for coordination, which enables each agent to evaluate its status and complete the remaining subtasks through experiential learning. Experiments have been conducted on various types of tasks in the Minecraft‑like environment. The results demonstrate that the proposed algorithm can improve the performance of multi‑agents when learning to complete multi‑tasks.

PaperID: 12472, https://arxiv.org/pdf/2411.01166.pdf

Authors: Weifan Long, Wen Wen, Peng Zhai, Lihua Zhang

Title: Role Play: Learning Adaptive Role-Specific Strategies in Multi-Agent Interactions

Abstract:
Zero‑shot coordination problem in multi‑agent reinforcement learning (MARL), which requires agents to adapt to unseen agents, has attracted increasing attention. Traditional approaches often rely on the Self‑Play (SP) framework to generate a diverse set of policies in a policy pool, which serves to improve the generalization capability of the final agent. However, these frameworks may struggle to capture the full spectrum of potential strategies, especially in real‑world scenarios that demand agents balance cooperation with competition. In such settings, agents need strategies that can adapt to varying and often conflicting goals. Drawing inspiration from Social Value Orientation (SVO)‑where individuals maintain stable value orientations during interactions with others‑we propose a novel framework called \emphRole Play (RP). RP employs role embeddings to transform the challenge of policy diversity into a more manageable diversity of roles. It trains a common policy with role embedding observations and employs a role predictor to estimate the joint role embeddings of other agents, helping the learning agent adapt to its assigned role. We theoretically prove that an approximate optimal policy can be achieved by optimizing the expected cumulative reward relative to an approximate role‑based policy. Experimental results in both cooperative (Overcooked) and mixed‑motive games (Harvest, CleanUp) reveal that RP consistently outperforms strong baselines when interacting with unseen agents, highlighting its robustness and adaptability in complex environments.

PaperID: 12473, https://arxiv.org/pdf/2411.01143.pdf

Authors: Xiaoqing Zhang, Xiuying Chen, Yuhan Liu, Jianzhou Wang, Zhenxing Hu, Rui Yan

Title: A Large-scale Time-aware Agents Simulation for Influencer Selection in Digital Advertising Campaigns

Abstract:
In the digital world, influencers are pivotal as opinion leaders, shaping the views and choices of their influencees. Modern advertising often follows this trend, where marketers choose appropriate influencers for product endorsements, based on thorough market analysis. Previous studies on influencer selection have typically relied on numerical representations of individual opinions and interactions, a method that simplifies the intricacies of social dynamics. In this work, we first introduce a Time‑aware Influencer Simulator (TIS), helping promoters identify and select the right influencers to market their products, based on LLM simulation. To validate our approach, we conduct experiments on the public advertising campaign dataset SAGraph which encompasses social relationships, posts, and user interactions. The results show that our method outperforms traditional numerical feature‑based approaches and methods using limited LLM agents. Our research shows that simulating user timelines and content lifecycles over time simplifies scaling, allowing for large‑scale agent simulations in social networks. Additionally, LLM‑based agents for social recommendations and advertising offer substantial benefits for decision‑making in promotional campaigns.

PaperID: 12474, https://arxiv.org/pdf/2411.00954.pdf

Authors: Xiaohang Tang, Chiyuan Wang, Chengdong Ma, Ilija Bogunovic, Stephen McAleer, Yaodong Yang

Title: Sample-Efficient Regret-Minimizing Double Oracle in Extensive-Form Games

Abstract:
Extensive‑Form Game (EFG) represents a fundamental model for analyzing sequential interactions among multiple agents and the primary challenge to solve it lies in mitigating sample complexity. Existing research indicated that Double Oracle (DO) can reduce the sample complexity dependence on the information set number |S| to the final restricted game size X in solving EFG. This is attributed to the early convergence of full‑game Nash Equilibrium (NE) through iteratively solving restricted games. However, we prove that the state‑of‑the‑art Extensive‑Form Double Oracle (XDO) exhibits exponential sample complexity of X, due to its exponentially increasing restricted game expansion frequency. Here we introduce Adaptive Double Oracle (AdaDO) to significantly alleviate sample complexity to polynomial by deploying the optimal expansion frequency. Furthermore, to comprehensively study the principles and influencing factors underlying sample complexity, we introduce a novel theoretical framework Regret‑Minimizing Double Oracle (RMDO) to provide directions for designing efficient DO algorithms. Empirical results demonstrate that AdaDO attains the more superior approximation of NE with less sample complexity than the strong baselines including Linear CFR, MCCFR and existing DO. Importantly, combining RMDO with warm starting and stochastic regret minimization further improves convergence rate and scalability, thereby paving the way for addressing complex multi‑agent tasks.

PaperID: 12475, https://arxiv.org/pdf/2411.00887.pdf

Authors: Chunyan Mu, Nir Oren

Title: Measuring Responsibility in Multi-Agent Systems

Abstract:
We introduce a family of quantitative measures of responsibility in multi‑agent planning, building upon the concepts of causal responsibility proposed by Parker et al.~[ParkerGL23]. These concepts are formalised within a variant of probabilistic alternating‑time temporal logic. Unlike existing approaches, our framework ascribes responsibility to agents for a given outcome by linking probabilities between behaviours and responsibility through three metrics, including an entropy‑based measurement of responsibility. This latter measure is the first to capture the causal responsibility properties of outcomes over time, offering an asymptotic measurement that reflects the difficulty of achieving these outcomes. Our approach provides a fresh understanding of responsibility in multi‑agent systems, illuminating both the qualitative and quantitative aspects of agents' roles in achieving or preventing outcomes.

PaperID: 12476, https://arxiv.org/pdf/2411.00781.pdf

Authors: Zirui Song, Guangxian Ouyang, Meng Fang, Hongbin Na, Zijing Shi, Zhenhao Chen, Yujie Fu, Zeyu Zhang, Shiyu Jiang, Miao Fang, Ling Chen, Xiuying Chen

Title: Hazards in Daily Life? Enabling Robots to Proactively Detect and Resolve Anomalies

Abstract:
Existing household robots have made significant progress in performing routine tasks, such as cleaning floors or delivering objects. However, a key limitation of these robots is their inability to recognize potential problems or dangers in home environments. For example, a child may pick up and ingest medication that has fallen on the floor, posing a serious risk. We argue that household robots should proactively detect such hazards or anomalies within the home, and propose the task of anomaly scenario generation. We leverage foundational models instead of relying on manually labeled data to build simulated environments. Specifically, we introduce a multi‑agent brainstorming approach, where agents collaborate and generate diverse scenarios covering household hazards, hygiene management, and child safety. These textual task descriptions are then integrated with designed 3D assets to simulate realistic environments. Within these constructed environments, the robotic agent learns the necessary skills to proactively discover and handle the proposed anomalies through task decomposition, and optimal learning approach selection. We demonstrate that our generated environment outperforms others in terms of task description and scene diversity, ultimately enabling robotic agents to better address potential household hazards.

PaperID: 12477, https://arxiv.org/pdf/2411.00728.pdf

Authors: Mohammad Feizabadi, Arman Hosseini, Zakaria Yahouni

Title: Multi-Agent Deep Q-Network with Layer-based Communication Channel for Autonomous Internal Logistics Vehicle Scheduling in Smart Manufacturing

Abstract:
In smart manufacturing, scheduling autonomous internal logistic vehicles is crucial for optimizing operational efficiency. This paper proposes a multi‑agent deep Q‑network (MADQN) with a layer‑based communication channel (LBCC) to address this challenge. The main goals are to minimize total job tardiness, reduce the number of tardy jobs, and lower vehicle energy consumption. The method is evaluated against nine well‑known scheduling heuristics, demonstrating its effectiveness in handling dynamic job shop behaviors like job arrivals and workstation unavailabilities. The approach also proves scalable, maintaining performance across different layouts and larger problem instances, highlighting the robustness and adaptability of MADQN with LBCC in smart manufacturing.

PaperID: 12478, https://arxiv.org/pdf/2411.00563.pdf

Authors: Leo Ardon, Benjamin Patrick Evans, Deepeka Garg, Annapoorani Lakshmi Narayanan, Makada Henry-Nickie, Sumitra Ganesh

Title: Simulate and Optimise: A two-layer mortgage simulator for designing novel mortgage assistance products

Abstract:
We develop a novel two‑layer approach for optimising mortgage relief products through a simulated multi‑agent mortgage environment. While the approach is generic, here the environment is calibrated to the US mortgage market based on publicly available census data and regulatory guidelines. Through the simulation layer, we assess the resilience of households to exogenous income shocks, while the optimisation layer explores strategies to improve the robustness of households to these shocks by making novel mortgage assistance products available to households. Households in the simulation are adaptive, learning to make mortgage‑related decisions (such as product enrolment or strategic foreclosures) that maximize their utility, balancing their available liquidity and equity. We show how this novel two‑layer simulation approach can successfully design novel mortgage assistance products to improve household resilience to exogenous shocks, and balance the costs of providing such products through post‑hoc analysis. Previously, such analysis could only be conducted through expensive pilot studies involving real participants, demonstrating the benefit of the approach for designing and evaluating financial products.

PaperID: 12479, https://arxiv.org/pdf/2411.00473.pdf

Authors: Yuchen Song, Yao Zhang, Anni Zhou, Yan Shi, Shikui Shen, Xiongyan Tang, Jin Li, Min Zhang, Danshi Wang

Title: Synergistic Interplay of Large Language Model and Digital Twin for Autonomous Optical Networks: Field Demonstrations

Abstract:
The development of large language models (LLM) has revolutionized various fields and is anticipated to drive the advancement of autonomous systems. In the context of autonomous optical networks, creating a high‑level cognitive agent in the control layer remains a challenge. However, LLM is primarily developed for natural language processing tasks, rendering them less effective in predicting the physical dynamics of optical communications. Moreover, optical networks demand rigorous stability, where direct deployment of strategies generated from LLM poses safety concerns. In this paper, a digital twin (DT)‑enhanced LLM scheme is proposed to facilitate autonomous optical networks. By leveraging monitoring data and advanced models, the DT of optical networks can accurately characterize their physical dynamics, furnishing LLMs with dynamic‑updated information for reliable decision‑making. Prior to deployment, the generated strategies from LLM can be pre‑verified in the DT platform, which also provides feedback to the LLM for further refinement of strategies. The synergistic interplay between DT and LLM for autonomous optical networks is demonstrated through three scenarios: performance optimization under dynamic loadings in an experimental C+L‑band long‑haul transmission link, protection switching for device upgrading in a field‑deployed six‑node mesh network, and performance recovery after fiber cuts in a field‑deployed C+L‑band transmission link.

PaperID: 12480, https://arxiv.org/pdf/2411.00427.pdf

Authors: Aman Gupta, Anirudh Ravichandran, Ziji Zhang, Swair Shah, Anurag Beniwal, Narayanan Sadagopan

Title: DARD: A Multi-Agent Approach for Task-Oriented Dialog Systems

Abstract:
Task‑oriented dialogue systems are essential for applications ranging from customer service to personal assistants and are widely used across various industries. However, developing effective multi‑domain systems remains a significant challenge due to the complexity of handling diverse user intents, entity types, and domain‑specific knowledge across several domains. In this work, we propose DARD (Domain Assigned Response Delegation), a multi‑agent conversational system capable of successfully handling multi‑domain dialogs. DARD leverages domain‑specific agents, orchestrated by a central dialog manager agent. Our extensive experiments compare and utilize various agent modeling approaches, combining the strengths of smaller fine‑tuned models (Flan‑T5‑large & Mistral‑7B) with their larger counterparts, Large Language Models (LLMs) (Claude Sonnet 3.0). We provide insights into the strengths and limitations of each approach, highlighting the benefits of our multi‑agent framework in terms of flexibility and composability. We evaluate DARD using the well‑established MultiWOZ benchmark, achieving state‑of‑the‑art performance by improving the dialogue inform rate by 6.6% and the success rate by 4.1% over the best‑performing existing approaches. Additionally, we discuss various annotator discrepancies and issues within the MultiWOZ dataset and its evaluation system.

PaperID: 12481, https://arxiv.org/pdf/2411.00397.pdf

Authors: Xiaoying Liu, Anping Chen, Kechen Zheng, Kaikai Chi, Bin Yang, Tarik Taleb

Title: Distributed Computation Offloading for Energy Provision Minimization in WP-MEC Networks with Multiple HAPs

Abstract:
This paper investigates a wireless powered mobile edge computing (WP‑MEC) network with multiple hybrid access points (HAPs) in a dynamic environment, where wireless devices (WDs) harvest energy from radio frequency (RF) signals of HAPs, and then compute their computation data locally (i.e., local computing mode) or offload it to the chosen HAPs (i.e., edge computing mode). In order to pursue a green computing design, we formulate an optimization problem that minimizes the long‑term energy provision of the WP‑MEC network subject to the energy, computing delay and computation data demand constraints. The transmit power of HAPs, the duration of the wireless power transfer (WPT) phase, the offloading decisions of WDs, the time allocation for offloading and the CPU frequency for local computing are jointly optimized adapting to the time‑varying generated computation data and wireless channels of WDs. To efficiently address the formulated non‑convex mixed integer programming (MIP) problem in a distributed manner, we propose a Two‑stage Multi‑Agent deep reinforcement learning‑based Distributed computation Offloading (TMADO) framework, which consists of a high‑level agent and multiple low‑level agents. The high‑level agent residing in all HAPs optimizes the transmit power of HAPs and the duration of the WPT phase, while each low‑level agent residing in each WD optimizes its offloading decision, time allocation for offloading and CPU frequency for local computing. Simulation results show the superiority of the proposed TMADO framework in terms of the energy provision minimization.

PaperID: 12482, https://arxiv.org/pdf/2411.00382.pdf

Authors: Shengchao Hu, Li Shen, Ya Zhang, Dacheng Tao

Title: Communication Learning in Multi-Agent Systems from Graph Modeling Perspective

Abstract:
In numerous artificial intelligence applications, the collaborative efforts of multiple intelligent agents are imperative for the successful attainment of target objectives. To enhance coordination among these agents, a distributed communication framework is often employed. However, indiscriminate information sharing among all agents can be resource‑intensive, and the adoption of manually pre‑defined communication architectures imposes constraints on inter‑agent communication, thus limiting the potential for effective collaboration. Moreover, the communication framework often remains static during inference, which may result in sustained high resource consumption, as in most cases, only key decisions necessitate information sharing among agents. In this study, we introduce a novel approach wherein we conceptualize the communication architecture among agents as a learnable graph. We formulate this problem as the task of determining the communication graph while enabling the architecture parameters to update normally, thus necessitating a bi‑level optimization process. Utilizing continuous relaxation of the graph representation and incorporating attention units, our proposed approach, CommFormer, efficiently optimizes the communication graph and concurrently refines architectural parameters through gradient descent in an end‑to‑end manner. Additionally, we introduce a temporal gating mechanism for each agent, enabling dynamic decisions on whether to receive shared information at a given time, based on current observations, thus improving decision‑making efficiency. Extensive experiments on a variety of cooperative tasks substantiate the robustness of our model across diverse cooperative scenarios, where agents are able to develop more coordinated and sophisticated strategies regardless of changes in the number of agents.

PaperID: 12483, https://arxiv.org/pdf/2411.00248.pdf

Authors: Yubin Kim, Chanwoo Park, Hyewon Jeong, Cristina Grau-Vilchez, Yik Siu Chan, Xuhai Xu, Daniel McDuff, Hyeonhoon Lee, Cynthia Breazeal, Hae Won Park

Title: A Demonstration of Adaptive Collaboration of Large Language Models for Medical Decision-Making

Abstract:
Medical Decision‑Making (MDM) is a multi‑faceted process that requires clinicians to assess complex multi‑modal patient data patient, often collaboratively. Large Language Models (LLMs) promise to streamline this process by synthesizing vast medical knowledge and multi‑modal health data. However, single‑agent are often ill‑suited for nuanced medical contexts requiring adaptable, collaborative problem‑solving. Our MDAgents addresses this need by dynamically assigning collaboration structures to LLMs based on task complexity, mimicking real‑world clinical collaboration and decision‑making. This framework improves diagnostic accuracy and supports adaptive responses in complex, real‑world medical scenarios, making it a valuable tool for clinicians in various healthcare settings, and at the same time, being more efficient in terms of computing cost than static multi‑agent decision making methods.

PaperID: 12484, https://arxiv.org/pdf/2411.00223.pdf

Authors: Sara Honarvar, Yancy Diaz-Mercado

Title: Learning Optimal Interaction Weights in Multi-Agents Systems

Abstract:
This paper presents a spatio‑temporal inverse optimal control framework for understanding interactions in multi‑agent systems (MAS). We employ a graph representation approach and model the dynamics of interactions between agents as state‑dependent edge weights in a consensus algorithm, incorporating both spatial and temporal dynamics. Our method learns these edge weights from trajectory observations, such as provided by expert demonstrations, which allows us to capture the complexity of nonlinear, distributed interaction behaviors. We derive necessary and sufficient conditions for the optimality of these interaction weights, explaining how the network topology affects MAS coordination. The proposed method is demonstrated on a multi‑agent formation control problem, where we show its effectiveness in recovering the interaction weights and coordination patterns from sample trajectory data.

PaperID: 12485, https://arxiv.org/pdf/2411.00205.pdf

Authors: Beyazit Yalcinkaya, Niklas Lauffer, Marcell Vazquez-Chanlatte, Sanjit A. Seshia

Title: Compositional Automata Embeddings for Goal-Conditioned Reinforcement Learning

Abstract:
Goal‑conditioned reinforcement learning is a powerful way to control an AI agent's behavior at runtime. That said, popular goal representations, e.g., target states or natural language, are either limited to Markovian tasks or rely on ambiguous task semantics. We propose representing temporal goals using compositions of deterministic finite automata (cDFAs) and use cDFAs to guide RL agents. cDFAs balance the need for formal temporal semantics with ease of interpretation: if one can understand a flow chart, one can understand a cDFA. On the other hand, cDFAs form a countably infinite concept class with Boolean semantics, and subtle changes to the automaton can result in very different tasks, making them difficult to condition agent behavior on. To address this, we observe that all paths through a DFA correspond to a series of reach‑avoid tasks and propose pre‑training graph neural network embeddings on "reach‑avoid derived" DFAs. Through empirical evaluation, we demonstrate that the proposed pre‑training method enables zero‑shot generalization to various cDFA task classes and accelerated policy specialization without the myopic suboptimality of hierarchical methods.

PaperID: 12486, https://arxiv.org/pdf/2411.00188.pdf

Authors: Yu Pan, Jianxin Sun, Hongfeng Yu, Joe Luck, Geng Bai, Nipuna Chamara, Yufeng Ge, Tala Awada

Title: Building Multi-Agent Copilot towards Autonomous Agricultural Data Management and Analysis

Abstract:
Current agricultural data management and analysis paradigms are to large extent traditional, in which data collecting, curating, integration, loading, storing, sharing and analyzing still involve too much human effort and know‑how. The experts, researchers and the farm operators need to understand the data and the whole process of data management pipeline to make fully use of the data. The essential problem of the traditional paradigm is the lack of a layer of orchestrational intelligence which can understand, organize and coordinate the data processing utilities to maximize data management and analysis outcome. The emerging reasoning and tool mastering abilities of large language models (LLM) make it a potentially good fit to this position, which helps a shift from the traditional user‑driven paradigm to AI‑driven paradigm. In this paper, we propose and explore the idea of a LLM based copilot for autonomous agricultural data management and analysis. Based on our previously developed platform of Agricultural Data Management and Analytics (ADMA), we build a proof‑of‑concept multi‑agent system called ADMA Copilot, which can understand user's intent, makes plans for data processing pipeline and accomplishes tasks automatically, in which three agents: a LLM based controller, an input formatter and an output formatter collaborate together. Different from existing LLM based solutions, by defining a meta‑program graph, our work decouples control flow and data flow to enhance the predictability of the behaviour of the agents. Experiments demonstrates the intelligence, autonomy, efficacy, efficiency, extensibility, flexibility and privacy of our system. Comparison is also made between ours and existing systems to show the superiority and potential of our system.

PaperID: 12487, https://arxiv.org/pdf/2411.00181.pdf

Authors: Curtis Bechtel, Shaddin Dughmi

Title: Efficient Multi-Agent Delegated Search

Abstract:
Consider a principal who wants to search through a space of stochastic solutions for one maximizing their utility. If the principal cannot conduct this search on their own, they may instead delegate this problem to an agent with distinct and potentially misaligned utilities. This is called delegated search, and the principal in such problems faces a mechanism design problem in which they must incentivize the agent to find and propose a solution maximizing the principal's expected utility. Following prior work in this area, we consider mechanisms without payments and aim to achieve a multiplicative approximation of the principal's utility when they solve the problem without delegation. In this work, we investigate a natural and recently studied generalization of this model to multiple agents and find nearly tight bounds on the principal's approximation as the number of agents increases. As one might expect, this approximation approaches 1 with increasing numbers of agents, but, somewhat surprisingly, we show that this is largely not due to direct competition among agents.

PaperID: 12488, https://arxiv.org/pdf/2411.00146.pdf

Authors: Chunyan Mu, Muhammad Najib, Nir Oren

Title: Responsibility-aware Strategic Reasoning in Probabilistic Multi-Agent Systems

Abstract:
Responsibility plays a key role in the development and deployment of trustworthy autonomous systems. In this paper, we focus on the problem of strategic reasoning in probabilistic multi‑agent systems with responsibility‑aware agents. We introduce the logic PATL+R, a variant of Probabilistic Alternating‑time Temporal Logic. The novelty of PATL+R lies in its incorporation of modalities for causal responsibility, providing a framework for responsibility‑aware multi‑agent strategic reasoning. We present an approach to synthesise joint strategies that satisfy an outcome specified in PATL+R, while optimising the share of expected causal responsibility and reward. This provides a notion of balanced distribution of responsibility and reward gain among agents. To this end, we utilise the Nash equilibrium as the solution concept for our strategic reasoning problem and demonstrate how to compute responsibility‑aware Nash equilibrium strategies via a reduction to parametric model checking of concurrent stochastic multi‑player games.

PaperID: 12489, https://arxiv.org/pdf/2411.00081.pdf

Authors: Matthew Chang, Gunjan Chhablani, Alexander Clegg, Mikael Dallaire Cote, Ruta Desai, Michal Hlavac, Vladimir Karashchuk, Jacob Krantz, Roozbeh Mottaghi, Priyam Parashar, Siddharth Patki, Ishita Prasad, Xavier Puig, Akshara Rai, Ram Ramrakhya, Daniel Tran, Joanne Truong, John M. Turner, Eric Undersander, Tsung-Yen Yang

Title: PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks

Abstract:
We present a benchmark for Planning And Reasoning Tasks in humaN‑Robot collaboration (PARTNR) designed to study human‑robot coordination in household activities. PARTNR tasks exhibit characteristics of everyday tasks, such as spatial, temporal, and heterogeneous agent capability constraints. We employ a semi‑automated task generation pipeline using Large Language Models (LLMs), incorporating simulation in the loop for grounding and verification. PARTNR stands as the largest benchmark of its kind, comprising 100,000 natural language tasks, spanning 60 houses and 5,819 unique objects. We analyze state‑of‑the‑art LLMs on PARTNR tasks, across the axes of planning, perception and skill execution. The analysis reveals significant limitations in SoTA models, such as poor coordination and failures in task tracking and recovery from errors. When LLMs are paired with real humans, they require 1.5x as many steps as two humans collaborating and 1.1x more steps than a single human, underscoring the potential for improvement in these models. We further show that fine‑tuning smaller LLMs with planning data can achieve performance on par with models 9 times larger, while being 8.6x faster at inference. Overall, PARTNR highlights significant challenges facing collaborative embodied agents and aims to drive research in this direction.

PaperID: 12490, https://arxiv.org/pdf/2411.00053.pdf

Authors: Andrew Estornell, Jean-Francois Ton, Yuanshun Yao, Yang Liu

Title: ACC-Collab: An Actor-Critic Approach to Multi-Agent LLM Collaboration

Abstract:
Large language models (LLMs) have demonstrated a remarkable ability to serve as general‑purpose tools for various language‑based tasks. Recent works have demonstrated that the efficacy of such models can be improved through iterative dialog between multiple models. While these paradigms show promise in improving model efficacy, most works in this area treat collaboration as an emergent behavior, rather than a learned behavior. In doing so, current multi‑agent frameworks rely on collaborative behaviors to have been sufficiently trained into off‑the‑shelf models. To address this limitation, we propose ACC‑Collab, an Actor‑Critic based learning framework to produce a two‑agent team (an actor‑agent and a critic‑agent) specialized in collaboration. We demonstrate that ACC‑Collab outperforms SotA multi‑agent techniques on a wide array of benchmarks.

PaperID: 12491, https://arxiv.org/pdf/2411.00028.pdf

Authors: Zhilun Zhou, Jingyang Fan, Yu Liu, Fengli Xu, Depeng Jin, Yong Li

Title: Synergizing LLM Agents and Knowledge Graph for Socioeconomic Prediction in LBSN

Abstract:
The fast development of location‑based social networks (LBSNs) has led to significant changes in society, resulting in popular studies of using LBSN data for socioeconomic prediction, e.g., regional population and commercial activity estimation. Existing studies design various graphs to model heterogeneous LBSN data, and further apply graph representation learning methods for socioeconomic prediction. However, these approaches heavily rely on heuristic ideas and expertise to extract task‑relevant knowledge from diverse data, which may not be optimal for specific tasks. Additionally, they tend to overlook the inherent relationships between different indicators, limiting the prediction accuracy. Motivated by the remarkable abilities of large language models (LLMs) in commonsense reasoning, embedding, and multi‑agent collaboration, in this work, we synergize LLM agents and knowledge graph for socioeconomic prediction. We first construct a location‑based knowledge graph (LBKG) to integrate multi‑sourced LBSN data. Then we leverage the reasoning power of LLM agent to identify relevant meta‑paths in the LBKG for each type of socioeconomic prediction task, and design a semantic‑guided attention module for knowledge fusion with meta‑paths. Moreover, we introduce a cross‑task communication mechanism to further enhance performance by enabling knowledge sharing across tasks at both LLM agent and KG levels. On the one hand, the LLM agents for different tasks collaborate to generate more diverse and comprehensive meta‑paths. On the other hand, the embeddings from different tasks are adaptively merged for better socioeconomic prediction. Experiments on two datasets demonstrate the effectiveness of the synergistic design between LLM and KG, providing insights for information sharing across socioeconomic prediction tasks.

PaperID: 12492, https://arxiv.org/pdf/2410.24152.pdf

Authors: Jiaqi Liu, Chengkai Xu, Peng Hang, Jian Sun, Wei Zhan, Masayoshi Tomizuka, Mingyu Ding

Title: Language-Driven Policy Distillation for Cooperative Driving in Multi-Agent Reinforcement Learning

Abstract:
The cooperative driving technology of Connected and Autonomous Vehicles (CAVs) is crucial for improving the efficiency and safety of transportation systems. Learning‑based methods, such as Multi‑Agent Reinforcement Learning (MARL), have demonstrated strong capabilities in cooperative decision‑making tasks. However, existing MARL approaches still face challenges in terms of learning efficiency and performance. In recent years, Large Language Models (LLMs) have rapidly advanced and shown remarkable abilities in various sequential decision‑making tasks. To enhance the learning capabilities of cooperative agents while ensuring decision‑making efficiency and cost‑effectiveness, we propose LDPD, a language‑driven policy distillation method for guiding MARL exploration. In this framework, a teacher agent based on LLM trains smaller student agents to achieve cooperative decision‑making through its own decision‑making demonstrations. The teacher agent enhances the observation information of CAVs and utilizes LLMs to perform complex cooperative decision‑making reasoning, which also leverages carefully designed decision‑making tools to achieve expert‑level decisions, providing high‑quality teaching experiences. The student agent then refines the teacher's prior knowledge into its own model through gradient policy updates. The experiments demonstrate that the students can rapidly improve their capabilities with minimal guidance from the teacher and eventually surpass the teacher's performance. Extensive experiments show that our approach demonstrates better performance and learning efficiency compared to baseline methods.

PaperID: 12493, https://arxiv.org/pdf/2410.24032.pdf

Authors: Yingzhe Peng, Xiaoting Qin, Zhiyang Zhang, Jue Zhang, Qingwei Lin, Xu Yang, Dongmei Zhang, Saravan Rajmohan, Qi Zhang

Title: Navigating the Unknown: A Chat-Based Collaborative Interface for Personalized Exploratory Tasks

Abstract:
The rise of large language models (LLMs) has revolutionized user interactions with knowledge‑based systems, enabling chatbots to synthesize vast amounts of information and assist with complex, exploratory tasks. However, LLM‑based chatbots often struggle to provide personalized support, particularly when users start with vague queries or lack sufficient contextual information. This paper introduces the Collaborative Assistant for Personalized Exploration (CARE), a system designed to enhance personalization in exploratory tasks by combining a multi‑agent LLM framework with a structured user interface. CARE's interface consists of a Chat Panel, Solution Panel, and Needs Panel, enabling iterative query refinement and dynamic solution generation. The multi‑agent framework collaborates to identify both explicit and implicit user needs, delivering tailored, actionable solutions. In a within‑subject user study with 22 participants, CARE was consistently preferred over a baseline LLM chatbot, with users praising its ability to reduce cognitive load, inspire creativity, and provide more tailored solutions. Our findings highlight CARE's potential to transform LLM‑based systems from passive information retrievers to proactive partners in personalized problem‑solving and exploration.

PaperID: 12494, https://arxiv.org/pdf/2410.23867.pdf

Authors: Benjamin Howson, Sarah Filippi, Ciara Pike-Burke

Title: QuACK: A Multipurpose Queuing Algorithm for Cooperative $k$-Armed Bandits

Abstract:
We study the cooperative stochastic k‑armed bandit problem, where a network of m agents collaborate to find the optimal action. In contrast to most prior work on this problem, which focuses on extending a specific algorithm to the multi‑agent setting, we provide a black‑box reduction that allows us to extend any single‑agent bandit algorithm to the multi‑agent setting. Under mild assumptions on the bandit environment, we prove that our reduction transfers the regret guarantees of the single‑agent algorithm to the multi‑agent setting. These guarantees are tight in subgaussian environments, in that using a near minimax optimal single‑player algorithm is near minimax optimal in the multi‑player setting up to an additive graph‑dependent quantity. Our reduction and theoretical results are also general, and apply to many different bandit settings. By plugging in appropriate single‑player algorithms, we can easily develop provably efficient algorithms for many multi‑player settings such as heavy‑tailed bandits, duelling bandits and bandits with local differential privacy, among others. Experimentally, our approach is competitive with or outperforms specialised multi‑agent algorithms.

PaperID: 12495, https://arxiv.org/pdf/2410.23592.pdf

Authors: Binyan Xu, Yufan Dai, Afzal Suleman, Yang Shi

Title: Adaptive Distributed Observer-based Model Predictive Control for Multi-agent Formation with Resilience to Communication Link Faults

Abstract:
In order to address the nonlinear multi‑agent formation tracking control problem with input constraints and unknown communication faults, a novel adaptive distributed observer‑based distributed model predictive control method is developed in this paper. This design employs adaptive distributed observers in local control systems to estimate the leader's state, dynamics, and relative positioning with respect to the leader. Utilizing the estimated data as local references, the original formation tracking control problem can be decomposed into several fully localized tracking control problems, which can be efficiently solved by the local predictive controller. Through the incorporation of adaptive distributed observers, this proposed design not only enhances the resilience of distributed formation tracking against communication faults but also simplifies the distributed model predictive control formulation.

PaperID: 12496, https://arxiv.org/pdf/2410.23535.pdf

Authors: Daniel Philipov, Vardhan Dongre, Gokhan Tur, Dilek Hakkani-Tür

Title: Simulating User Agents for Embodied Conversational-AI

Abstract:
Embodied agents designed to assist users with tasks must engage in natural language interactions, interpret instructions, execute actions, and communicate effectively to resolve issues. However, collecting large‑scale, diverse datasets of situated human‑robot dialogues to train and evaluate such agents is expensive, labor‑intensive, and time‑consuming. To address this challenge, we propose building a large language model (LLM)‑based user agent that can simulate user behavior during interactions with an embodied agent in a virtual environment. Given a user goal (e.g., make breakfast), at each time step, the user agent may observe" the robot actions or speak" to either intervene with the robot or answer questions. Such a user agent assists in improving the scalability and efficiency of embodied dialogues dataset generation and is critical for enhancing and evaluating the robot's interaction and task completion ability, as well as for research in reinforcement learning using AI feedback. We evaluate our user agent's ability to generate human‑like behaviors by comparing its simulated dialogues with the TEACh dataset. We perform three experiments: zero‑shot prompting to predict dialogue acts, few‑shot prompting, and fine‑tuning on the TEACh training subset. Results show the LLM‑based user agent achieves an F‑measure of 42% with zero‑shot prompting and 43.4% with few‑shot prompting in mimicking human speaking behavior. Through fine‑tuning, performance in deciding when to speak remained stable, while deciding what to say improved from 51.1% to 62.5%. These findings showcase the feasibility of the proposed approach for assessing and enhancing the effectiveness of robot task completion through natural language communication.

PaperID: 12497, https://arxiv.org/pdf/2410.23396.pdf

Authors: Qiliang Chen, Babak Heydari

Title: Adaptive Network Intervention for Complex Systems: A Hierarchical Graph Reinforcement Learning Approach

Abstract:
Effective governance and steering of behavior in complex multi‑agent systems (MAS) are essential for managing system‑wide outcomes, particularly in environments where interactions are structured by dynamic networks. In many applications, the goal is to promote pro‑social behavior among agents, where network structure plays a pivotal role in shaping these interactions. This paper introduces a Hierarchical Graph Reinforcement Learning (HGRL) framework that governs such systems through targeted interventions in the network structure. Operating within the constraints of limited managerial authority, the HGRL framework demonstrates superior performance across a range of environmental conditions, outperforming established baseline methods. Our findings highlight the critical influence of agent‑to‑agent learning (social learning) on system behavior: under low social learning, the HGRL manager preserves cooperation, forming robust core‑periphery networks dominated by cooperators. In contrast, high social learning accelerates defection, leading to sparser, chain‑like networks. Additionally, the study underscores the importance of the system manager's authority level in preventing system‑wide failures, such as agent rebellion or collapse, positioning HGRL as a powerful tool for dynamic network‑based governance.

PaperID: 12498, https://arxiv.org/pdf/2410.23393.pdf

Authors: Qiliang Chen, Babak Heydari

Title: Resource Governance in Networked Systems via Integrated Variational Autoencoders and Reinforcement Learning

Abstract:
We introduce a framework that integrates variational autoencoders (VAE) with reinforcement learning (RL) to balance system performance and resource usage in multi‑agent systems by dynamically adjusting network structures over time. A key innovation of this method is its capability to handle the vast action space of the network structure. This is achieved by combining Variational Auto‑Encoder and Deep Reinforcement Learning to control the latent space encoded from the network structures. The proposed method, evaluated on the modified OpenAI particle environment under various scenarios, not only demonstrates superior performance compared to baselines but also reveals interesting strategies and insights through the learned behaviors.

PaperID: 12499, https://arxiv.org/pdf/2410.23379.pdf

Authors: Monish Reddy Kotturu, Saniya Vahedian Movahed, Paul Robinette, Kshitij Jerath, Amanda Redlich, Reza Azadeh

Title: Relational Weight Optimization for Enhancing Team Performance in Multi-Agent Multi-Armed Bandits

Abstract:
We introduce an approach to improve team performance in a Multi‑Agent Multi‑Armed Bandit (MAMAB) framework using Fastest Mixing Markov Chain (FMMC) and Fastest Distributed Linear Averaging (FDLA) optimization algorithms. The multi‑agent team is represented using a fixed relational network and simulated using the Coop‑UCB2 algorithm. The edge weights of the communication network directly impact the time taken to reach distributed consensus. Our goal is to shrink the timescale on which the convergence of the consensus occurs to achieve optimal team performance and maximize reward. Through our experiments, we show that the convergence to team consensus occurs slightly faster in large constrained networks.

PaperID: 12500, https://arxiv.org/pdf/2410.23298.pdf

Authors: Jilan Samiuddin, Benoit Boulet, Di Wu

Title: Trajectory Prediction for Autonomous Driving using Agent-Interaction Graph Embedding

Abstract:
Trajectory prediction module in an autonomous driving system is crucial for the decision‑making and safety of the autonomous agent car and its surroundings. This work presents a novel scheme called AiGem (Agent‑Interaction Graph Embedding) to predict traffic vehicle trajectories around the autonomous car. AiGem tackles this problem in four steps. First, AiGem formulates the historical traffic interaction with the autonomous agent as a graph in two steps: (1) at each time step of the history frames, agent‑interactions are captured using spatial edges between the agents (nodes of the graph), and then, (2) connects the spatial graphs in chronological order using temporal edges. Then, AiGem applies a depthwise graph encoder network on the spatial‑temporal graph to generate graph embedding, i.e., embedding of all the nodes in the graph. Next, a sequential Gated Recurrent Unit decoder network uses the embedding of the current timestamp to get the decoded states. Finally, an output network comprising a Multilayer Perceptron is used to predict the trajectories utilizing the decoded states as its inputs. Results show that AiGem outperforms the state‑of‑the‑art deep learning algorithms for longer prediction horizons.

PaperID: 12501, https://arxiv.org/pdf/2410.23252.pdf

Authors: Haoyi Qiu, Alexander R. Fabbri, Divyansh Agarwal, Kung-Hsiang Huang, Sarah Tan, Nanyun Peng, Chien-Sheng Wu

Title: Evaluating Cultural and Social Awareness of LLM Web Agents

Abstract:
As large language models (LLMs) expand into performing as agents for real‑world applications beyond traditional NLP tasks, evaluating their robustness becomes increasingly important. However, existing benchmarks often overlook critical dimensions like cultural and social awareness. To address these, we introduce CASA, a benchmark designed to assess LLM agents' sensitivity to cultural and social norms across two web‑based tasks: online shopping and social discussion forums. Our approach evaluates LLM agents' ability to detect and appropriately respond to norm‑violating user queries and observations. Furthermore, we propose a comprehensive evaluation framework that measures awareness coverage, helpfulness in managing user queries, and the violation rate when facing misleading web content. Experiments show that current LLMs perform significantly better in non‑agent than in web‑based agent environments, with agents achieving less than 10% awareness coverage and over 40% violation rates. To improve performance, we explore two methods: prompting and fine‑tuning, and find that combining both methods can offer complementary advantages ‑‑ fine‑tuning on culture‑specific datasets significantly enhances the agents' ability to generalize across different regions, while prompting boosts the agents' ability to navigate complex tasks. These findings highlight the importance of constantly benchmarking LLM agents' cultural and social awareness during the development cycle.

PaperID: 12502, https://arxiv.org/pdf/2410.23086.pdf

Authors: Haiyuan Li, Hari Madhukumar, Peizheng Li, Yuelin Liu, Yiran Teng, Yulei Wu, Ning Wang, Shuangyi Yan, Dimitra Simeonidou

Title: Towards Practical Operation of Deep Reinforcement Learning Agents in Real-World Network Management at Open RAN Edges

Abstract:
Deep Reinforcement Learning (DRL) has emerged as a powerful solution for meeting the growing demands for connectivity, reliability, low latency and operational efficiency in advanced networks. However, most research has focused on theoretical analysis and simulations, with limited investigation into real‑world deployment. To bridge the gap and support practical DRL deployment for network management, we first present an orchestration framework that integrates ETSI Multi‑access Edge Computing (MEC) with Open RAN, enabling seamless adoption of DRL‑based strategies across different time scales while enhancing agent lifecycle management. We then identify three critical challenges hindering DRL's real‑world deployment, including (1) asynchronous requests from unpredictable or bursty traffic, (2) adaptability and generalization across heterogeneous topologies and evolving service demands, and (3) prolonged convergence and service interruptions due to exploration in live operational environments. To address these challenges, we propose a three‑fold solution strategy: (a) advanced time‑series integration for handling asynchronized traffic, (b) flexible architecture design such as multi‑agent DRL and incremental learning to support heterogeneous scenarios, and (c) simulation‑driven deployment with transfer learning to reduce convergence time and service disruptions. Lastly, the feasibility of the MEC‑O‑RAN architecture is validated on an urban‑wide testing infrastructure, and two real‑world use cases are presented, showcasing the three identified challenges and demonstrating the effectiveness of the proposed solutions.

PaperID: 12503, https://arxiv.org/pdf/2410.22932.pdf

Authors: Jonas Becker

Title: Multi-Agent Large Language Models for Conversational Task-Solving

Abstract:
In an era where single large language models have dominated the landscape of artificial intelligence for years, multi‑agent systems arise as new protagonists in conversational task‑solving. While previous studies have showcased their potential in reasoning tasks and creative endeavors, an analysis of their limitations concerning the conversational paradigms and the impact of individual agents is missing. It remains unascertained how multi‑agent discussions perform across tasks of varying complexity and how the structure of these conversations influences the process. To fill that gap, this work systematically evaluates multi‑agent systems across various discussion paradigms, assessing their strengths and weaknesses in both generative tasks and question‑answering tasks. Alongside the experiments, I propose a taxonomy of 20 multi‑agent research studies from 2022 to 2024, followed by the introduction of a framework for deploying multi‑agent LLMs in conversational task‑solving. I demonstrate that while multi‑agent systems excel in complex reasoning tasks, outperforming a single model by leveraging expert personas, they fail on basic tasks. Concretely, I identify three challenges that arise: 1) While longer discussions enhance reasoning, agents fail to maintain conformity to strict task requirements, which leads to problem drift, making shorter conversations more effective for basic tasks. 2) Prolonged discussions risk alignment collapse, raising new safety concerns for these systems. 3) I showcase discussion monopolization through long generations, posing the problem of fairness in decision‑making for tasks like summarization. This work uncovers both the potential and challenges that arise with multi‑agent interaction and varying conversational paradigms, providing insights into how future research could improve the efficiency, performance, and safety of multi‑agent LLMs.

PaperID: 12504, https://arxiv.org/pdf/2410.22916.pdf

Authors: Yanchu Guan, Dong Wang, Yan Wang, Haiqing Wang, Renen Sun, Chenyi Zhuang, Jinjie Gu, Zhixuan Chu

Title: Explainable Behavior Cloning: Teaching Large Language Model Agents through Learning by Demonstration

Abstract:
Autonomous mobile app interaction has become increasingly important with growing complexity of mobile applications. Developing intelligent agents that can effectively navigate and interact with mobile apps remains a significant challenge. In this paper, we propose an Explainable Behavior Cloning LLM Agent (EBC‑LLMAgent), a novel approach that combines large language models (LLMs) with behavior cloning by learning demonstrations to create intelligent and explainable agents for autonomous mobile app interaction. EBC‑LLMAgent consists of three core modules: Demonstration Encoding, Code Generation, and UI Mapping, which work synergistically to capture user demonstrations, generate executable codes, and establish accurate correspondence between code and UI elements. We introduce the Behavior Cloning Chain Fusion technique to enhance the generalization capabilities of the agent. Extensive experiments on five popular mobile applications from diverse domains demonstrate the superior performance of EBC‑LLMAgent, achieving high success rates in task completion, efficient generalization to unseen scenarios, and the generation of meaningful explanations.

PaperID: 12505, https://arxiv.org/pdf/2410.22912.pdf

Authors: Steve Yuwono, Ahmar Kamal Hussain, Dorothea Schwung, Andreas Schwung

Title: Self-optimization in distributed manufacturing systems using Modular State-based Stackelberg Games

Abstract:
In this study, we introduce Modular State‑based Stackelberg Games (Mod‑SbSG), a novel game structure developed for distributed self‑learning in modular manufacturing systems. Mod‑SbSG enhances cooperative decision‑making among self‑learning agents within production systems by integrating State‑based Potential Games (SbPG) with Stackelberg games. This hierarchical structure assigns more important modules of the manufacturing system a first‑mover advantage, while less important modules respond optimally to the leaders' decisions. This decision‑making process differs from typical multi‑agent learning algorithms in manufacturing systems, where decisions are made simultaneously. We provide convergence guarantees for the novel game structure and design learning algorithms to account for the hierarchical game structure. We further analyse the effects of single‑leader/multiple‑follower and multiple‑leader/multiple‑follower scenarios within a Mod‑SbSG. To assess its effectiveness, we implement and test Mod‑SbSG in an industrial control setting using two laboratory‑scale testbeds featuring sequential and serial‑parallel processes. The proposed approach delivers promising results compared to the vanilla SbPG, which reduces overflow by 97.1%, and in some cases, prevents overflow entirely. Additionally, it decreases power consumption by 5‑13% while satisfying the production demand, which significantly improves potential (global objective) values.

PaperID: 12506, https://arxiv.org/pdf/2410.22908.pdf

Authors: Safwan Labbi, Daniil Tiapkin, Lorenzo Mancini, Paul Mangold, Eric Moulines

Title: Federated UCBVI: Communication-Efficient Federated Regret Minimization with Heterogeneous Agents

Abstract:
In this paper, we present the Federated Upper Confidence Bound Value Iteration algorithm (\textttFed‑UCBVI), a novel extension of the \textttUCBVI algorithm (Azar et al., 2017) tailored for the federated learning framework. We prove that the regret of \textttFed‑UCBVI scales as \tilde\mathcalO(\sqrtH^3 |\mathcalS| |\mathcalA| T / M), with a small additional term due to heterogeneity, where |\mathcalS| is the number of states, |\mathcalA| is the number of actions, H is the episode length, M is the number of agents, and T is the number of episodes. Notably, in the single‑agent setting, this upper bound matches the minimax lower bound up to polylogarithmic factors, while in the multi‑agent scenario, \textttFed‑UCBVI has linear speed‑up. To conduct our analysis, we introduce a new measure of heterogeneity, which may hold independent theoretical interest. Furthermore, we show that, unlike existing federated reinforcement learning approaches, \textttFed‑UCBVI's communication complexity only marginally increases with the number of agents.

PaperID: 12507, https://arxiv.org/pdf/2410.22896.pdf

Authors: Antoine Tordeux, Cécile Appert-Rolland, Alexandre Nicolas, Armin Seyfried, Denis Ullmo

Title: Modelling vehicle and pedestrian collective dynamics: Challenges and advances

Abstract:
In our urbanised societies, the management and regulation of traffic and pedestrian flows is of considerable interest for public safety, economic development, and the conservation of the environment. However, modelling and controlling the collective dynamics of vehicles and pedestrians raises several challenges. Not only are the individual entities self‑propelled and hard to describe, but their complex nonlinear physical and social interactions makes the multi‑agent problem of crowd and traffic flow even more involved. In this chapter, we purport to review the suitability and limitations of classical modelling approaches through four examples of collective behaviour: stop‑and‑go waves in traffic flow, lane formation, long‑term avoidance behaviour, and load balancing in pedestrian dynamics. While stop‑and‑go dynamics and lane formation can both be addressed by basic reactive models (at least to some extent), the latter two require anticipation and/or coordination at the level of the group. The results highlight the limitations of classical force‑based models, but also the need for long‑term anticipation mechanisms and multiscale modelling approaches. In response, we review new developments and modelling concepts.

PaperID: 12508, https://arxiv.org/pdf/2410.22766.pdf

Authors: Florentiana Yuwono, Gan Pang Yen, Jason Christopher

Title: Self-Driving Car Racing: Application of Deep Reinforcement Learning

Abstract:
This paper explores the application of deep reinforcement learning (RL) techniques in the domain of autonomous self‑driving car racing. Motivated by the rise of AI‑driven mobility and autonomous racing events, the project aims to develop an AI agent that efficiently drives a simulated car in the OpenAI Gymnasium CarRacing environment. We investigate various RL algorithms, including Deep Q‑Network (DQN), Proximal Policy Optimization (PPO), and novel adaptations that incorporate transfer learning and recurrent neural networks (RNNs) for enhanced performance. The project demonstrates that while DQN provides a strong baseline for policy learning, integrating ResNet and LSTM models significantly improves the agent's ability to capture complex spatial and temporal dynamics. PPO, particularly in continuous action spaces, shows promising results for fine control, although challenges such as policy collapse remain. We compare the performance of these approaches and outline future research directions focused on improving computational efficiency and addressing model stability. Our findings contribute to the ongoing development of AI systems in autonomous driving and related control tasks.

PaperID: 12509, https://arxiv.org/pdf/2410.22744.pdf

Authors: Nima Zargham, Mateusz Dubiel, Smit Desai, Thomas Mildner, Hanz-Joachim Belz

Title: Designing AI Personalities: Enhancing Human-Agent Interaction Through Thoughtful Persona Design

Abstract:
In the rapidly evolving field of artificial intelligence (AI) agents, designing the agent's characteristics is crucial for shaping user experience. This workshop aims to establish a research community focused on AI agent persona design for various contexts, such as in‑car assistants, educational tools, and smart home environments. We will explore critical aspects of persona design, such as voice, embodiment, and demographics, and their impact on user satisfaction and engagement. Through discussions and hands‑on activities, we aim to propose practices and standards that enhance the ecological validity of agent personas. Topics include the design of conversational interfaces, the influence of agent personas on user experience, and approaches for creating contextually appropriate AI agents. This workshop will provide a platform for building a community dedicated to developing AI agent personas that better fit diverse, everyday interactions.

PaperID: 12510, https://arxiv.org/pdf/2410.22662.pdf

Authors: Junting Chen, Checheng Yu, Xunzhe Zhou, Tianqi Xu, Yao Mu, Mengkang Hu, Wenqi Shao, Yikai Wang, Guohao Li, Lin Shao

Title: EMOS: Embodiment-aware Heterogeneous Multi-robot Operating System with LLM Agents

Abstract:
Heterogeneous multi‑robot systems (HMRS) have emerged as a powerful approach for tackling complex tasks that single robots cannot manage alone. Current large‑language‑model‑based multi‑agent systems (LLM‑based MAS) have shown success in areas like software development and operating systems, but applying these systems to robot control presents unique challenges. In particular, the capabilities of each agent in a multi‑robot system are inherently tied to the physical composition of the robots, rather than predefined roles. To address this issue, we introduce a novel multi‑agent framework designed to enable effective collaboration among heterogeneous robots with varying embodiments and capabilities, along with a new benchmark named Habitat‑MAS. One of our key designs is Robot Resume: Instead of adopting human‑designed role play, we propose a self‑prompted approach, where agents comprehend robot URDF files and call robot kinematics tools to generate descriptions of their physics capabilities to guide their behavior in task planning and action execution. The Habitat‑MAS benchmark is designed to assess how a multi‑agent framework handles tasks that require embodiment‑aware reasoning, which includes 1) manipulation, 2) perception, 3) navigation, and 4) comprehensive multi‑floor object rearrangement. The experimental results indicate that the robot's resume and the hierarchical design of our multi‑agent system are essential for the effective operation of the heterogeneous multi‑robot system within this intricate problem context.

PaperID: 12511, https://arxiv.org/pdf/2410.22584.pdf

Authors: Natasha Butt, Varun Chandrasekaran, Neel Joshi, Besmira Nushi, Vidhisha Balachandran

Title: BenchAgents: Multi-Agent Systems for Structured Benchmark Creation

Abstract:
Evaluation insights are limited by the availability of high‑quality benchmarks. As models evolve, there is a need to create benchmarks that can measure progress on new and complex generative capabilities. However, manually creating new benchmarks is slow and expensive, restricting comprehensive evaluations for any capability. We introduce BenchAgents, a multi‑agent framework that methodically leverages large language models (LLMs) to automate evaluation benchmark creation while inherently ensuring data and (evaluation) metric quality. BenchAgents decomposes the benchmark creation process into planning, generation, verification, and evaluation, each of which is ] orchestrated via LLM agents. These agents interact with each other and utilize feedback from benchmark developers to improve and flexibly control data diversity and quality. We use BenchAgents to create benchmarks to evaluate capabilities related to planning, constraint satisfaction, and causal reasoning spanning both language and vision modalities. We then use these benchmarks to study state‑of‑the‑art models and extract new insights into common failure modes and model differences.

PaperID: 12512, https://arxiv.org/pdf/2410.22578.pdf

Authors: Ying Li, Changling Li, Jiyao Chen, Christine Roinou

Title: Energy-Aware Multi-Agent Reinforcement Learning for Collaborative Execution in Mission-Oriented Drone Networks

Abstract:
Mission‑oriented drone networks have been widely used for structural inspection, disaster monitoring, border surveillance, etc. Due to the limited battery capacity of drones, mission execution strategy impacts network performance and mission completion. However, collaborative execution is a challenging problem for drones in such a dynamic environment as it also involves efficient trajectory design. We leverage multi‑agent reinforcement learning (MARL) to manage the challenge in this study, letting each drone learn to collaboratively execute tasks and plan trajectories based on its current status and environment. Simulation results show that the proposed collaborative execution model can successfully complete the mission at least 80% of the time, regardless of task locations and lengths, and can even achieve a 100% success rate when the task density is not way too sparse. To the best of our knowledge, our work is one of the pioneer studies on leveraging MARL on collaborative execution for mission‑oriented drone networks; the unique value of this work lies in drone battery level driving our model design.

PaperID: 12513, https://arxiv.org/pdf/2410.22552.pdf

Authors: Jaekyeom Kim, Dong-Ki Kim, Lajanugen Logeswaran, Sungryull Sohn, Honglak Lee

Title: Auto-Intent: Automated Intent Discovery and Self-Exploration for Large Language Model Web Agents

Abstract:
In this paper, we introduce Auto‑Intent, a method to adapt a pre‑trained large language model (LLM) as an agent for a target domain without direct fine‑tuning, where we empirically focus on web navigation tasks. Our approach first discovers the underlying intents from target domain demonstrations unsupervisedly, in a highly compact form (up to three words). With the extracted intents, we train our intent predictor to predict the next intent given the agent's past observations and actions. In particular, we propose a self‑exploration approach where top‑k probable intent predictions are provided as a hint to the pre‑trained LLM agent, which leads to enhanced decision‑making capabilities. Auto‑Intent substantially improves the performance of GPT‑3.5, 4 and Llama‑3.1‑70B, 405B agents on the large‑scale real‑website navigation benchmarks from Mind2Web and online navigation tasks from WebArena with its cross‑benchmark generalization from Mind2Web.

PaperID: 12514, https://arxiv.org/pdf/2410.22341.pdf

Authors: Bernard Chazelle, Kritkorn Karntikoon

Title: The $s$-Energy and Its Applications

Abstract:
Many multi‑agent systems evolve by repeatedly updating each state to a weighted average of its neighbors, a process known as averaging dynamics, whose behavior becomes difficult to analyze when the interaction network varies over time. In recent years, the s‑energy has emerged as a useful tool for bounding the convergence rates of such systems, complementing the classical techniques that rely on fixed graphs. We derive new bounds on the s‑energy under minimal connectivity assumptions. As a consequence, we obtain convergence guarantees for several models of collective dynamics and resolve a number of open questions in the areas. Our results highlight the dependence of the s‑energy on the connectivity of the underlying networks and use it to explain the exponential gap in the convergence rates of stationary and time‑varying consensus systems.

PaperID: 12515, https://arxiv.org/pdf/2410.22339.pdf

Authors: Zahra Aminiranjbar, Jianan Tang, Qiudan Wang, Shubha Pant, Mahesh Viswanathan

Title: DAWN: Designing Distributed Agents in a Worldwide Network

Abstract:
The rapid evolution of Large Language Models (LLMs) has transformed them from basic conversational tools into sophisticated entities capable of complex reasoning and decision‑making. These advancements have led to the development of specialized LLM‑based agents designed for diverse tasks such as coding and web browsing. As these agents become more capable, the need for a robust framework that facilitates global communication and collaboration among them towards advanced objectives has become increasingly critical. Distributed Agents in a Worldwide Network (DAWN) addresses this need by offering a versatile framework that integrates LLM‑based agents with traditional software systems, enabling the creation of agentic applications suited for a wide range of use cases. DAWN enables distributed agents worldwide to register and be easily discovered through Gateway Agents. Collaborations among these agents are coordinated by a Principal Agent equipped with reasoning strategies. DAWN offers three operational modes: No‑LLM Mode for deterministic tasks, Copilot for augmented decision‑making, and LLM Agent for autonomous operations. Additionally, DAWN ensures the safety and security of agent collaborations globally through a dedicated safety, security, and compliance layer, protecting the network against attackers and adhering to stringent security and compliance standards. These features make DAWN a robust network for deploying agent‑based applications across various industries.

PaperID: 12516, https://arxiv.org/pdf/2410.22304.pdf

Authors: Yihe Deng, Paul Mineiro

Title: Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning

Abstract:
Mathematical reasoning is a crucial capability for Large Language Models (LLMs), yet generating detailed and accurate reasoning traces remains a significant challenge. This paper introduces a novel approach to produce high‑quality reasoning traces for LLM fine‑tuning using online learning Flows. Our method employs an incremental output production Flow, where component LLMs collaboratively construct solutions through iterative communication. We train the Flow using online Direct Preference Optimization (DPO) learning with rollouts, generating DPO pairs for each training example and updating models in real‑time. We directly compare the quality of reasoning traces generated by our method with those produced through direct model inference, demonstrating the effectiveness of our approach in improving LLM performance in mathematical reasoning tasks.

PaperID: 12517, https://arxiv.org/pdf/2410.22165.pdf

Authors: Koen Ponse, Aske Plaat, Niki van Stein, Thomas M. Moerland

Title: EconoJax: A Fast & Scalable Economic Simulation in Jax

Abstract:
Accurate economic simulations often require many experimental runs, particularly when combined with reinforcement learning. Unfortunately, training reinforcement learning agents in multi‑agent economic environments can be slow. This paper introduces EconoJax, a fast simulated economy, based on the AI economist. EconoJax, and its training pipeline, are completely written in JAX. This allows EconoJax to scale to large population sizes and perform large experiments, while keeping training times within minutes. Through experiments with populations of 100 agents, we show how real‑world economic behavior emerges through training within 15 minutes, in contrast to previous work that required several days. We additionally perform experiments in varying sized action spaces to test if some multi‑agent methods produce more diverse behavior compared to others. Here, our findings indicate no notable differences in produced behavior with different methods as is sometimes suggested in earlier works. To aid further research, we open‑source EconoJax on Github.

PaperID: 12518, https://arxiv.org/pdf/2410.22129.pdf

Authors: Cory Hymel, Sida Peng, Kevin Xu, Charath Ranganathan

Title: Improving Performance of Commercially Available AI Products in a Multi-Agent Configuration

Abstract:
In recent years, with the rapid advancement of large language models (LLMs), multi‑agent systems have become increasingly more capable of practical application. At the same time, the software development industry has had a number of new AI‑powered tools developed that improve the software development lifecycle (SDLC). Academically, much attention has been paid to the role of multi‑agent systems to the SDLC. And, while single‑agent systems have frequently been examined in real‑world applications, we have seen comparatively few real‑world examples of publicly available commercial tools working together in a multi‑agent system with measurable improvements. In this experiment we test context sharing between Crowdbotics PRD AI, a tool for generating software requirements using AI, and GitHub Copilot, an AI pair‑programming tool. By sharing business requirements from PRD AI, we improve the code suggestion capabilities of GitHub Copilot by 13.8% and developer task success rate by 24.5% ‑‑ demonstrating a real‑world example of commercially‑available AI systems working together with improved outcomes.

PaperID: 12519, https://arxiv.org/pdf/2410.21882.pdf

Authors: Feifei Zhao, Hui Feng, Haibo Tong, Zhengqiang Han, Erliang Lin, Enmeng Lu, Yinqian Sun, Yi Zeng

Title: Building Altruistic and Moral AI Agent with Brain-inspired Emotional Empathy Mechanisms

Abstract:
As AI closely interacts with human society, it is crucial to ensure that its behavior is safe, altruistic, and aligned with human ethical and moral values. However, existing research on embedding ethical considerations into AI remains insufficient, and previous external constraints based on principles and rules are inadequate to provide AI with long‑term stability and generalization capabilities. Emotional empathy intrinsically motivates altruistic behaviors aimed at alleviating others' negative emotions through emotional sharing and contagion mechanisms. Motivated by this, we draw inspiration from the neural mechanism of human emotional empathy‑driven altruistic decision making, and simulate the shared self‑other perception‑mirroring‑empathy neural circuits, to construct a brain‑inspired emotional empathy‑driven altruistic decision‑making model. Here, empathy directly impacts dopamine release to form intrinsic altruistic motivation. The proposed model exhibits consistent altruistic behaviors across three experimental settings: emotional contagion‑integrated two‑agent altruistic rescue, multi‑agent gaming, and robotic emotional empathy interaction scenarios. In‑depth analyses validate the positive correlation between empathy levels and altruistic preferences (consistent with psychological behavioral experiment findings), while also demonstrating how interaction partners' empathy levels influence the agent's behavioral patterns. We further test the proposed model's performance and stability in moral dilemmas involving conflicts between self‑interest and others' well‑being, partially observable environments, and adversarial defense scenarios. This work provides preliminary exploration of human‑like empathy‑driven altruistic moral decision making, contributing potential perspectives for developing ethically‑aligned AI.

PaperID: 12520, https://arxiv.org/pdf/2410.21803.pdf

Authors: Nguyen Le Hoang, Tadahiro Taniguchi, Fang Tianwei, Akira Taniguchi

Title: SimSiam Naming Game: A Unified Approach for Representation Learning and Emergent Communication

Abstract:
Emergent Communication (EmCom) investigates how agents develop symbolic communication through interaction without predefined language. Recent frameworks, such as the Metropolis‑‑Hastings Naming Game (MHNG), formulate EmCom as the learning of shared external representations negotiated through interaction under joint attention, without explicit success or reward feedback. However, MHNG relies on sampling‑based updates that suffer from high rejection rates in high‑dimensional perceptual spaces, making the learning process sample‑inefficient for complex visual datasets. In this work, we propose the SimSiam Naming Game (SSNG), a feedback‑free EmCom framework that replaces sampling‑based updates with a symmetric, self‑supervised representation alignment objective between autonomous agents. Building on a variational inference‑‑based probabilistic interpretation of self‑supervised learning, SSNG formulates symbol emergence as an alignment process between agents' latent representations mediated by message exchange. To enable end‑to‑end gradient‑based optimization, discrete symbolic messages are learned via a Gumbel‑‑Softmax relaxation, preserving the discrete nature of communication while maintaining differentiability. Experiments on CIFAR‑10 and ImageNet‑100 show that the emergent messages learned by SSNG achieve substantially higher linear‑probe classification accuracy than those produced by referential games, reconstruction games, and MHNG. These results indicate that self‑supervised representation alignment provides an effective mechanism for feedback‑free EmCom in multi‑agent systems.

PaperID: 12521, https://arxiv.org/pdf/2410.21794.pdf

Authors: Qian Long, Ruoyan Li, Minglu Zhao, Tao Gao, Demetri Terzopoulos

Title: Inverse Attention Agents for Multi-Agent Systems

Abstract:
A major challenge for Multi‑Agent Systems is enabling agents to adapt dynamically to diverse environments in which opponents and teammates may continually change. Agents trained using conventional methods tend to excel only within the confines of their training cohorts; their performance drops significantly when confronting unfamiliar agents. To address this shortcoming, we introduce Inverse Attention Agents that adopt concepts from the Theory of Mind (ToM) implemented algorithmically using an attention mechanism trained in an end‑to‑end manner. Crucial to determining the final actions of these agents, the weights in their attention model explicitly represent attention to different goals. We furthermore propose an inverse attention network that deduces the ToM of agents based on observations and prior actions. The network infers the attentional states of other agents, thereby refining the attention weights to adjust the agent's final action. We conduct experiments in a continuous environment, tackling demanding tasks encompassing cooperation, competition, and a blend of both. They demonstrate that the inverse attention network successfully infers the attention of other agents, and that this information improves agent performance. Additional human experiments show that, compared to baseline agent models, our inverse attention agents exhibit superior cooperation with humans and better emulate human behaviors.

PaperID: 12522, https://arxiv.org/pdf/2410.21784.pdf

Authors: Anubhav Shrimal, Stanley Kanagaraj, Kriti Biswas, Swarnalatha Raghuraman, Anish Nediyanchath, Yi Zhang, Promod Yenigalla

Title: MARCO: Multi-Agent Real-time Chat Orchestration

Abstract:
Large language model advancements have enabled the development of multi‑agent frameworks to tackle complex, real‑world problems such as to automate tasks that require interactions with diverse tools, reasoning, and human collaboration. We present MARCO, a Multi‑Agent Real‑time Chat Orchestration framework for automating tasks using LLMs. MARCO addresses key challenges in utilizing LLMs for complex, multi‑step task execution. It incorporates robust guardrails to steer LLM behavior, validate outputs, and recover from errors that stem from inconsistent output formatting, function and parameter hallucination, and lack of domain knowledge. Through extensive experiments we demonstrate MARCO's superior performance with 94.48% and 92.74% accuracy on task execution for Digital Restaurant Service Platform conversations and Retail conversations datasets respectively along with 44.91% improved latency and 33.71% cost reduction. We also report effects of guardrails in performance gain along with comparisons of various LLM models, both open‑source and proprietary. The modular and generic design of MARCO allows it to be adapted for automating tasks across domains and to execute complex usecases through multi‑turn interactions.

PaperID: 12523, https://arxiv.org/pdf/2410.21741.pdf

Authors: Sorouralsadat Fatemi, Yuheng Hu

Title: Enhancing Financial Question Answering with a Multi-Agent Reflection Framework

Abstract:
While Large Language Models (LLMs) have shown impressive capabilities in numerous Natural Language Processing (NLP) tasks, they still struggle with financial question answering (QA), particularly when numerical reasoning is required. Recently, LLM‑based multi‑agent frameworks have demonstrated remarkable effectiveness in multi‑step reasoning, which is crucial for financial QA tasks as it involves extracting relevant information from tables and text and then performing numerical reasoning on the extracted data to infer answers. In this study, we propose a multi‑agent framework incorporating a critic agent that reflects on the reasoning steps and final answers for each question. Additionally, we enhance our system by adding multiple critic agents, each focusing on a specific aspect of the answer. Our results indicate that this framework significantly improves performance compared to single‑agent reasoning, with an average performance increase of 15% for the LLaMA3‑8B model and 5% for the LLaMA3‑70B model. Furthermore, our framework performs on par with, and in some cases surpasses, larger single‑agent LLMs such as LLaMA3.1‑405B and GPT‑4o‑mini, though it falls slightly short compared to Claude‑3.5 Sonnet. Overall, our framework presents an effective solution to enhance open‑source LLMs for financial QA tasks, offering a cost‑effective alternative to larger models like Claude‑3.5 Sonnet.

PaperID: 12524, https://arxiv.org/pdf/2410.21521.pdf

Authors: Sriniketh Vangaru, Daniel Rosen, Dylan Green, Raphael Rodriguez, Maxwell Wiecek, Amos Johnson, Alyse M. Jones, William C. Headley

Title: A Multi-Agent Reinforcement Learning Testbed for Cognitive Radio Applications

Abstract:
Technological trends show that Radio Frequency Reinforcement Learning (RFRL) will play a prominent role in the wireless communication systems of the future. Applications of RFRL range from military communications jamming to enhancing WiFi networks. Before deploying algorithms for these purposes, they must be trained in a simulation environment to ensure adequate performance. For this reason, we previously created the RFRL Gym: a standardized, accessible tool for the development and testing of reinforcement learning (RL) algorithms in the wireless communications space. This environment leveraged the OpenAI Gym framework and featured customizable simulation scenarios within the RF spectrum. However, the RFRL Gym was limited to training a single RL agent per simulation; this is not ideal, as most real‑world RF scenarios will contain multiple intelligent agents in cooperative, competitive, or mixed settings, which is a natural consequence of spectrum congestion. Therefore, through integration with Ray RLlib, multi‑agent reinforcement learning (MARL) functionality for training and assessment has been added to the RFRL Gym, making it even more of a robust tool for RF spectrum simulation. This paper provides an overview of the updated RFRL Gym environment. In this work, the general framework of the tool is described relative to comparable existing resources, highlighting the significant additions and refactoring we have applied to the Gym. Afterward, results from testing various RF scenarios in the MARL environment and future additions are discussed.

PaperID: 12525, https://arxiv.org/pdf/2410.21447.pdf

Authors: Dong Ho Lee, Lasse Peters, David Fridovich-Keil

Title: You Can't Always Get What You Want: Games of Ordered Preference

Abstract:
We study noncooperative games, in which each player's objective is composed of a sequence of ordered‑ and potentially conflicting‑preferences. Problems of this type naturally model a wide variety of scenarios: for example, drivers at a busy intersection must balance the desire to make forward progress with the risk of collision. Mathematically, these problems possess a nested structure, and to behave properly players must prioritize their most important preference, and only consider less important preferences to the extent that they do not compromise performance on more important ones. We consider multi‑agent, noncooperative variants of these problems, and seek generalized Nash equilibria in which each player's decision reflects both its hierarchy of preferences and other players' actions. We make two key contributions. First, we develop a recursive approach for deriving the first‑order optimality conditions of each player's nested problem. Second, we propose a sequence of increasingly tight relaxations, each of which can be transcribed as a mixed complementarity problem and solved via existing methods. Experimental results demonstrate that our approach reliably converges to equilibrium solutions that strictly reflect players' individual ordered preferences.

PaperID: 12526, https://arxiv.org/pdf/2410.21415.pdf

Authors: He Jiang, Yutong Wang, Rishi Veerapaneni, Tanishq Duhan, Guillaume Sartoretti, Jiaoyang Li

Title: Deploying Ten Thousand Robots: Scalable Imitation Learning for Lifelong Multi-Agent Path Finding

Abstract:
Lifelong Multi‑Agent Path Finding (LMAPF) repeatedly finds collision‑free paths for multiple agents that are continually assigned new goals when they reach current ones. Recently, this field has embraced learning‑based methods, which reactively generate single‑step actions based on individual local observations. However, it is still challenging for them to match the performance of the best search‑based algorithms, especially in large‑scale settings. This work proposes an imitation‑learning‑based LMAPF solver that introduces a novel communication module as well as systematic single‑step collision resolution and global guidance techniques. Our proposed solver, Scalable Imitation Learning for LMAPF (SILLM), inherits the fast reasoning speed of learning‑based methods and the high solution quality of search‑based methods with the help of modern GPUs. Across six large‑scale maps with up to 10,000 agents and varying obstacle structures, SILLM surpasses the best learning‑ and search‑based baselines, achieving average throughput improvements of 137.7% and 16.0%, respectively. Furthermore, SILLM also beats the winning solution of the 2023 League of Robot Runners, an international LMAPF competition. Finally, we validated SILLM with 10 real robots and 100 virtual robots in a mock warehouse environment.

PaperID: 12527, https://arxiv.org/pdf/2410.21414.pdf

Authors: Bowen Zhao, Tianhao Cheng, Yuejie Zhang, Ying Cheng, Rui Feng, Xiaobo Zhang

Title: CT2C-QA: Multimodal Question Answering over Chinese Text, Table and Chart

Abstract:
Multimodal Question Answering (MMQA) is crucial as it enables comprehensive understanding and accurate responses by integrating insights from diverse data representations such as tables, charts, and text. Most existing researches in MMQA only focus on two modalities such as image‑text QA, table‑text QA and chart‑text QA, and there remains a notable scarcity in studies that investigate the joint analysis of text, tables, and charts. In this paper, we present C\textT^2C‑QA, a pioneering Chinese reasoning‑based QA dataset that includes an extensive collection of text, tables, and charts, meticulously compiled from 200 selectively sourced webpages. Our dataset simulates real webpages and serves as a great test for the capability of the model to analyze and reason with multimodal data, because the answer to a question could appear in various modalities, or even potentially not exist at all. Additionally, we present AED (Allocating, Expert and Desicion), a multi‑agent system implemented through collaborative deployment, information interaction, and collective decision‑making among different agents. Specifically, the Assignment Agent is in charge of selecting and activating expert agents, including those proficient in text, tables, and charts. The Decision Agent bears the responsibility of delivering the final verdict, drawing upon the analytical insights provided by these expert agents. We execute a comprehensive analysis, comparing AED with various state‑of‑the‑art models in MMQA, including GPT‑4. The experimental outcomes demonstrate that current methodologies, including GPT‑4, are yet to meet the benchmarks set by our dataset.

PaperID: 12528, https://arxiv.org/pdf/2410.21359.pdf

Authors: Ji Ma

Title: Can Machines Think Like Humans? A Behavioral Evaluation of LLM Agents in Dictator Games

Abstract:
As Large Language Model (LLM)‑based agents increasingly engage with human society, how well do we understand their prosocial behaviors? We (1) investigate how LLM agents' prosocial behaviors can be induced by different personas and benchmarked against human behaviors; and (2) introduce a social science approach to evaluate LLM agents' decision‑making. We explored how different personas and experimental framings affect these AI agents' altruistic behavior in dictator games and compared their behaviors within the same LLM family, across various families, and with human behaviors. The findings reveal that merely assigning a human‑like identity to LLMs does not produce human‑like behaviors. These findings suggest that LLM agents' reasoning does not consistently exhibit textual markers of human decision‑making in dictator games and that their alignment with human behavior varies substantially across model architectures and prompt formulations; even worse, such dependence does not follow a clear pattern. As society increasingly integrates machine intelligence, "Prosocial AI" emerges as a promising and urgent research direction in philanthropic studies.

PaperID: 12529, https://arxiv.org/pdf/2410.21342.pdf

Authors: Siyuan Chen, Jiahai Wang

Title: Heterogeneous Interaction Modeling With Reduced Accumulated Error for Multi-Agent Trajectory Prediction

Abstract:
Dynamical complex systems composed of interactive heterogeneous agents are prevalent in the world, including urban traffic systems and social networks. Modeling the interactions among agents is the key to understanding and predicting the dynamics of the complex system, e.g., predicting the trajectories of traffic participants in the city. Compared with interaction modeling in homogeneous systems such as pedestrians in a crowded scene, heterogeneous interaction modeling is less explored. Worse still, the error accumulation problem becomes more severe since the interactions are more complex. To tackle the two problems, this paper proposes heterogeneous interaction modeling with reduced accumulated error for multi‑agent trajectory prediction. Based on the historical trajectories, our method infers the dynamic interaction graphs among agents, featured by directed interacting relations and interacting effects. A heterogeneous attention mechanism is defined on the interaction graphs for aggregating the influence from heterogeneous neighbors to the target agent. To alleviate the error accumulation problem, this paper analyzes the error sources from the spatial and temporal perspectives, and proposes to introduce the graph entropy and the mixup training strategy for reducing the two types of errors respectively. Our method is examined on three real‑world datasets containing heterogeneous agents, and the experimental results validate the superiority of our method.

PaperID: 12530, https://arxiv.org/pdf/2410.21312.pdf

Authors: Xin Wang, Yifan Zhang, Xiaojing Zhang, Longhui Yu, Xinna Lin, Jindong Jiang, Bin Ma, Kaicheng Yu

Title: $\texttt{PatentAgent}$: Intelligent Agent for Automated Pharmaceutical Patent Analysis

Abstract:
Pharmaceutical patents play a vital role in biochemical industries, especially in drug discovery, providing researchers with unique early access to data, experimental results, and research insights. With the advancement of machine learning, patent analysis has evolved from manual labor to tasks assisted by automatic tools. However, there still lacks an unified agent that assists every aspect of patent analysis, from patent reading to core chemical identification. Leveraging the capabilities of Large Language Models (LLMs) to understand requests and follow instructions, we introduce the first intelligent agent in this domain, \textttPatentAgent, poised to advance and potentially revolutionize the landscape of pharmaceutical research. \textttPatentAgent comprises three key end‑to‑end modules ‑‑ PA‑QA, PA‑Img2Mol, and PA‑CoreId ‑‑ that respectively perform (1) patent question‑answering, (2) image‑to‑molecular‑structure conversion, and (3) core chemical structure identification, addressing the essential needs of scientists and practitioners in pharmaceutical patent analysis. Each module of \textttPatentAgent demonstrates significant effectiveness with the updated algorithm and the synergistic design of \textttPatentAgent framework. PA‑Img2Mol outperforms existing methods across CLEF, JPO, UOB, and USPTO patent benchmarks with an accuracy gain between 2.46% and 8.37% while PA‑CoreId realizes accuracy improvement ranging from 7.15% to 7.62% on PatentNetML benchmark. Our code and dataset will be publicly available.

PaperID: 12531, https://arxiv.org/pdf/2410.21290.pdf

Authors: Y. Wang, Y. Zhao

Title: Multiple Ships Cooperative Navigation and Collision Avoidance using Multi-agent Reinforcement Learning with Communication

Abstract:
In the real world, unmanned surface vehicles (USV) often need to coordinate with each other to accomplish specific tasks. However, achieving cooperative control in multi‑agent systems is challenging due to issues such as non‑stationarity and partial observability. Recent advancements in Multi‑Agent Reinforcement Learning (MARL) provide new perspectives to address these challenges. Therefore, we propose using the multi‑agent deep deterministic policy gradient (MADDPG) algorithm with communication to address multiple ships' cooperation problems under partial observability. We developed two tasks based on OpenAI's gym environment: cooperative navigation and cooperative collision avoidance. In these tasks, ships must not only learn effective control strategies but also establish communication protocols with other agents. We analyze the impact of external noise on communication, the effect of inter‑agent communication on performance, and the communication patterns learned by the agents. The results demonstrate that our proposed framework effectively addresses cooperative navigation and collision avoidance among multiple vessels, significantly outperforming traditional single‑agent algorithms. Agents establish a consistent communication protocol, enabling them to compensate for missing information through shared observations and achieve better coordination.

PaperID: 12532, https://arxiv.org/pdf/2410.21286.pdf

Authors: Yuwei Yan, Qingbin Zeng, Zhiheng Zheng, Jingzhe Yuan, Jie Feng, Jun Zhang, Fengli Xu, Yong Li

Title: OpenCity: A Scalable Platform to Simulate Urban Activities with Massive LLM Agents

Abstract:
Agent‑based models (ABMs) have long been employed to explore how individual behaviors aggregate into complex societal phenomena in urban space. Unlike black‑box predictive models, ABMs excel at explaining the micro‑macro linkages that drive such emergent behaviors. The recent rise of Large Language Models (LLMs) has led to the development of LLM agents capable of simulating urban activities with unprecedented realism. However, the extreme high computational cost of LLMs presents significant challenges for scaling up the simulations of LLM agents. To address this problem, we propose OpenCity, a scalable simulation platform optimized for both system and prompt efficiencies. Specifically, we propose a LLM request scheduler to reduce communication overhead by parallelizing requests through IO multiplexing. Besides, we deisgn a "group‑and‑distill" prompt optimization strategy minimizes redundancy by clustering agents with similar static attributes. Through experiments on six global cities, OpenCity achieves a 600‑fold acceleration in simulation time per agent, a 70% reduction in LLM requests, and a 50% reduction in token usage. These improvements enable the simulation of 10,000 agents' daily activities in 1 hour on commodity hardware. Besides, the substantial speedup of OpenCity allows us to establish a urban simulation benchmark for LLM agents for the first time, comparing simulated urban activities with real‑world data in 6 major cities around the globe. We believe our OpenCity platform provides a critical infrastructure to harness the power of LLMs for interdisciplinary studies in urban space, fostering the collective efforts of broader research communities. Code repo is available at https://anonymous.4open.science/r/Anonymous‑OpenCity‑42BD.

PaperID: 12533, https://arxiv.org/pdf/2410.21249.pdf

Authors: Manav Vora, Ilan Shomorony, Melkior Ornik

Title: Capacity-Aware Planning and Scheduling in Budget-Constrained Multi-Agent MDPs: A Meta-RL Approach

Abstract:
We study capacity‑ and budget‑constrained multi‑agent MDPs (CB‑MA‑MDPs), a class that captures many maintenance and scheduling tasks in which each agent can irreversibly fail and a planner must decide (i) when to apply a restorative action and (ii) which subset of agents to treat in parallel. The global budget limits the total number of restorations, while the capacity constraint bounds the number of simultaneous actions, turning naïve dynamic programming into a combinatorial search that scales exponentially with the number of agents. We propose a two‑stage solution that remains tractable for large systems. First, a Linear Sum Assignment Problem (LSAP)‑based grouping partitions the agents into r disjoint sets (r = capacity) that maximise diversity in expected time‑to‑failure, allocating budget to each set proportionally. Second, a meta‑trained PPO policy solves each sub‑MDP, leveraging transfer across groups to converge rapidly. To validate our approach, we apply it to the problem of scheduling repairs for a large team of industrial robots, constrained by a limited number of repair technicians and a total repair budget. Our results demonstrate that the proposed method outperforms baseline approaches in terms of maximizing the average uptime of the robot team, particularly for large team sizes. Lastly, we confirm the scalability of our approach through a computational complexity analysis across varying numbers of robots and repair technicians.

PaperID: 12534, https://arxiv.org/pdf/2410.21108.pdf

Authors: Naga Venkata Sai Raviteja Chappa, Khoa Luu

Title: LiGAR: LiDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition

Abstract:
Group Activity Recognition (GAR) remains challenging in computer vision due to the complex nature of multi‑agent interactions. This paper introduces LiGAR, a LIDAR‑Guided Hierarchical Transformer for Multi‑Modal Group Activity Recognition. LiGAR leverages LiDAR data as a structural backbone to guide the processing of visual and textual information, enabling robust handling of occlusions and complex spatial arrangements. Our framework incorporates a Multi‑Scale LIDAR Transformer, Cross‑Modal Guided Attention, and an Adaptive Fusion Module to integrate multi‑modal data at different semantic levels effectively. LiGAR's hierarchical architecture captures group activities at various granularities, from individual actions to scene‑level dynamics. Extensive experiments on the JRDB‑PAR, Volleyball, and NBA datasets demonstrate LiGAR's superior performance, achieving state‑of‑the‑art results with improvements of up to 10.6% in F1‑score on JRDB‑PAR and 5.9% in Mean Per Class Accuracy on the NBA dataset. Notably, LiGAR maintains high performance even when LiDAR data is unavailable during inference, showcasing its adaptability. Our ablation studies highlight the significant contributions of each component and the effectiveness of our multi‑modal, multi‑scale approach in advancing the field of group activity recognition.

PaperID: 12535, https://arxiv.org/pdf/2410.21071.pdf

Authors: Eitan Farchi, Shmulik Froimovich, Rami Katan, Orna Raz

Title: Automatic Generation of Benchmarks and Reliable LLM Judgment for Code Tasks

Abstract:
LLMs can be used in a variety of code related tasks such as translating from one programming language to another, implementing natural language requirements and code summarization. Artifacts generated by state of the art LLM technology are expected to be useful in the sense that a user will be able to use the LLM generated artifact after a small number of easy modifications. Quantifying this vague notion is challenging and it is thus hard to determine the quality of code related LLM solutions. We refer to evaluation of LLM solutions using LLM judgment as "LLM as a Judge", or LaaJ for short. In this work we introduce a methodology to generate and evaluate LaaJ implementations, utilizing an automatically generated benchmark. The purpose of the benchmark is two fold, namely, it is used both to develop and validate the LaaJs and to validate and test the LLM code related solution using the LaaJs. To that end, we developed an automated benchmark generation engine, which generates code in multiple programming languages for multiple code related tasks and which serves as the input for LaaJ evaluation. We utilize a graph representation, G, of the potential code related generations. The graph vertices are generated artifacts and edges represent possible generations, e.g., the generation of a Java program from its natural language requirements. Utilizing a chain of LLM agents and G we generate code related artifacts. Using cycles in G we formulate expectations on the generated artifacts. Taking advantage of these formulated expectations enables the development and testing of reliable LLM judgement for usefulness of the artifacts generated by the solution. Our approach enables the creation of high quality code task solutions.

PaperID: 12536, https://arxiv.org/pdf/2410.21067.pdf

Authors: Meiqi Chen, Fandong Meng, Yingxue Zhang, Yan Zhang, Jie Zhou

Title: CRAT: A Multi-Agent Framework for Causality-Enhanced Reflective and Retrieval-Augmented Translation with Large Language Models

Abstract:
Large language models (LLMs) have shown great promise in machine translation, but they still struggle with contextually dependent terms, such as new or domain‑specific words. This leads to inconsistencies and errors that are difficult to address. Existing solutions often depend on manual identification of such terms, which is impractical given the complexity and evolving nature of language. While Retrieval‑Augmented Generation (RAG) could provide some assistance, its application to translation is limited by issues such as hallucinations from information overload. In this paper, we propose CRAT, a novel multi‑agent translation framework that leverages RAG and causality‑enhanced self‑reflection to address these challenges. This framework consists of several specialized agents: the Unknown Terms Identification agent detects unknown terms within the context, the Knowledge Graph (KG) Constructor agent extracts relevant internal knowledge about these terms and retrieves bilingual information from external sources, the Causality‑enhanced Judge agent validates the accuracy of the information, and the Translator agent incorporates the refined information into the final output. This automated process allows for more precise and consistent handling of key terms during translation. Our results show that CRAT significantly improves translation accuracy, particularly in handling context‑sensitive terms and emerging vocabulary.

PaperID: 12537, https://arxiv.org/pdf/2410.21029.pdf

Authors: Jannis Weil, Jonas Ringsdorf, Julian Barthel, Yi-Ping Phoebe Chen, Tobias Meuser

Title: FairStream: Fair Multimedia Streaming Benchmark for Reinforcement Learning Agents

Abstract:
Multimedia streaming accounts for the majority of traffic in today's internet. Mechanisms like adaptive bitrate streaming control the bitrate of a stream based on the estimated bandwidth, ideally resulting in smooth playback and a good Quality of Experience (QoE). However, selecting the optimal bitrate is challenging under volatile network conditions. This motivated researchers to train Reinforcement Learning (RL) agents for multimedia streaming. The considered training environments are often simplified, leading to promising results with limited applicability. Additionally, the QoE fairness across multiple streams is seldom considered by recent RL approaches. With this work, we propose a novel multi‑agent environment that comprises multiple challenges of fair multimedia streaming: partial observability, multiple objectives, agent heterogeneity and asynchronicity. We provide and analyze baseline approaches across five different traffic classes to gain detailed insights into the behavior of the considered agents, and show that the commonly used Proximal Policy Optimization (PPO) algorithm is outperformed by a simple greedy heuristic. Future work includes the adaptation of multi‑agent RL algorithms and further expansions of the environment.

PaperID: 12538, https://arxiv.org/pdf/2410.20968.pdf

Authors: Shuyang Zhu, Ziqing Zhu

Title: Bilevel Model for Electricity Market Mechanism Optimisation via Quantum Computing Enhanced Reinforcement Learning

Abstract:
In response to the increasing complexity of electricity markets due to low‑carbon requirements and the integration of sustainable energy sources, this paper proposes a dynamic quantum computing enhanced bilevel optimization model for electricity market operations. The upper level focuses on market mechanism optimization using Reinforcement Learning (RL), specifically Proximal Policy Optimization (PPO), while the lower level models the bidding strategies of Generating Companies (GENCOs) using a Multi‑Agent Deep Q‑Network (MADQN) enhanced with quantum computing through a Variational Quantum Circuit (VQC). The three main contributions of this work are: (1) establishing a dynamic bilevel model with timely feedback between the upper and lower levels; (2) parameterizing and optimizing market mechanisms to derive the most effective designs; and (3) introducing quantum computing into the context of electricity markets to more realistically simulate market operations. The proposed model is tested on the IEEE 30‑bus system with six GENCOs, demonstrating its effectiveness in capturing the complexities of modern electricity markets.

PaperID: 12539, https://arxiv.org/pdf/2410.20666.pdf

Authors: Sangmim Song, Sarath Kodagoda, Amal Gunatilake, Marc G. Carmichael, Karthick Thiyagarajan, Jodi Martin

Title: Guide-LLM: An Embodied LLM Agent and Text-Based Topological Map for Robotic Guidance of People with Visual Impairments

Abstract:
Navigation presents a significant challenge for persons with visual impairments (PVI). While traditional aids such as white canes and guide dogs are invaluable, they fall short in delivering detailed spatial information and precise guidance to desired locations. Recent developments in large language models (LLMs) and vision‑language models (VLMs) offer new avenues for enhancing assistive navigation. In this paper, we introduce Guide‑LLM, an embodied LLM‑based agent designed to assist PVI in navigating large indoor environments. Our approach features a novel text‑based topological map that enables the LLM to plan global paths using a simplified environmental representation, focusing on straight paths and right‑angle turns to facilitate navigation. Additionally, we utilize the LLM's commonsense reasoning for hazard detection and personalized path planning based on user preferences. Simulated experiments demonstrate the system's efficacy in guiding PVI, underscoring its potential as a significant advancement in assistive technology. The results highlight Guide‑LLM's ability to offer efficient, adaptive, and personalized navigation assistance, pointing to promising advancements in this field.

PaperID: 12540, https://arxiv.org/pdf/2410.20637.pdf

Authors: Ioannis Raptis

Title: Observability of Linear Time-Invariant Systems with Relative Measurements: A Geometric Approach

Abstract:
This paper explores the observability and estimation capability of dynamical systems using predominantly relative measurements of the system's state‑space variables, with minimal to no reliance on absolute measurements of these variables. We concentrate on linear time‑invariant systems, in which the observation matrix serves as the algebraic representation of a graph object. This graph object encapsulates the availability of relative measurements. Utilizing algebraic graph theory and abstract linear algebra (geometric) tools, we establish a link between the structure of the graph of relative measurements and the system‑theoretic observability subspace of linear systems. Special emphasis is given to multi‑agent networked systems whose dynamics are governed by the linear consensus protocol. We demonstrate the importance of absolute information and its placement to the system's dynamics in achieving full‑state estimation. Finally, the analysis shifts to the synthesis of a distributed observer with relative measurements for single integrator dynamics, exemplifying the relevance of the preceding analytical findings. We support our theoretical analysis with numerical simulations.

PaperID: 12541, https://arxiv.org/pdf/2410.20424.pdf

Authors: Ziming Li, Qianbo Zang, David Ma, Jiawei Guo, Tuney Zheng, Minghao Liu, Xinyao Niu, Yue Wang, Jian Yang, Jiaheng Liu, Wanjun Zhong, Wangchunshu Zhou, Wenhao Huang, Ge Zhang

Title: AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions

Abstract:
Data science tasks involving tabular data present complex challenges that require sophisticated problem‑solving approaches. We propose AutoKaggle, a powerful and user‑centric framework that assists data scientists in completing daily data pipelines through a collaborative multi‑agent system. AutoKaggle implements an iterative development process that combines code execution, debugging, and comprehensive unit testing to ensure code correctness and logic consistency. The framework offers highly customizable workflows, allowing users to intervene at each phase, thus integrating automated intelligence with human expertise. Our universal data science toolkit, comprising validated functions for data cleaning, feature engineering, and modeling, forms the foundation of this solution, enhancing productivity by streamlining common tasks. We selected 8 Kaggle competitions to simulate data processing workflows in real‑world application scenarios. Evaluation results demonstrate that AutoKaggle achieves a validation submission rate of 0.85 and a comprehensive score of 0.82 in typical data science pipelines, fully proving its effectiveness and practicality in handling complex data science tasks.

PaperID: 12542, https://arxiv.org/pdf/2410.20345.pdf

Authors: Mohammadreza Doostmohammadian, Sérgio Pequito

Title: Logarithmically Quantized Distributed Optimization over Dynamic Multi-Agent Networks

Abstract:
Distributed optimization finds many applications in machine learning, signal processing, and control systems. In these real‑world applications, the constraints of communication networks, particularly limited bandwidth, necessitate implementing quantization techniques. In this paper, we propose distributed optimization dynamics over multi‑agent networks subject to logarithmically quantized data transmission. Under this condition, data exchange benefits from representing smaller values with more bits and larger values with fewer bits. As compared to uniform quantization, this allows for higher precision in representing near‑optimal values and more accuracy of the distributed optimization algorithm. The proposed optimization dynamics comprise a primary state variable converging to the optimizer and an auxiliary variable tracking the objective function's gradient. Our setting accommodates dynamic network topologies, resulting in a hybrid system requiring convergence analysis using matrix perturbation theory and eigenspectrum analysis.

PaperID: 12543, https://arxiv.org/pdf/2410.20288.pdf

Authors: Luyao Niu, Hongchao Zhang, Dinuka Sahabandu, Bhaskar Ramasubramanian, Andrew Clark, Radha Poovendran

Title: Who is Responsible? Explaining Safety Violations in Multi-Agent Cyber-Physical Systems

Abstract:
Multi‑agent cyber‑physical systems are present in a variety of applications. Agent decision‑making can be affected due to errors induced by uncertain, dynamic operating environments or due to incorrect actions taken by an agent. When an erroneous decision that leads to a violation of safety is identified, assigning responsibility to individual agents is a key step toward preventing future accidents. Current approaches to carrying out such investigations require human labor or high degree of familiarity with operating environments. Automated strategies to assign responsibility can achieve a significant reduction in human effort and associated cognitive burden. In this paper, we develop an automated procedure to assign responsibility for safety violations to actions of any single agent in a principled manner. We base our approach on reasoning about safety violations in road safety. Given a safety violation, we use counterfactual reasoning to create alternative scenarios, showing how different outcomes could have occurred if certain actions had been replaced by others. We introduce the degree of responsibility (DoR) metric for each agent. The DoR, using the Shapley value, quantifies each agent's contribution to the safety violation, providing a basis to explain and justify decisions. We also develop heuristic techniques and methods based on agent interaction structures to improve scalability as agent numbers grow. We examine three safety violation cases from the National Highway Traffic Safety Administration (NHTSA). We run experiments using CARLA urban driving simulator. Results show the DoR improves the explainability of decisions and accountability for agent actions and their consequences.

PaperID: 12544, https://arxiv.org/pdf/2410.20285.pdf

Authors: Antonis Antoniades, Albert Örwall, Kexun Zhang, Yuxi Xie, Anirudh Goyal, William Wang

Title: SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement

Abstract:
Software engineers operating in complex and dynamic environments must continuously adapt to evolving requirements, learn iteratively from experience, and reconsider their approaches based on new insights. However, current large language model (LLM)‑based software agents often follow linear, sequential processes that prevent backtracking and exploration of alternative solutions, limiting their ability to rethink their strategies when initial approaches prove ineffective. To address these challenges, we propose SWE‑Search, a multi‑agent framework that integrates Monte Carlo Tree Search (MCTS) with a self‑improvement mechanism to enhance software agents' performance on repository‑level software tasks. SWE‑Search extends traditional MCTS by incorporating a hybrid value function that leverages LLMs for both numerical value estimation and qualitative evaluation. This enables self‑feedback loops where agents iteratively refine their strategies based on both quantitative numerical evaluations and qualitative natural language assessments of pursued trajectories. The framework includes a SWE‑Agent for adaptive exploration, a Value Agent for iterative feedback, and a Discriminator Agent that facilitates multi‑agent debate for collaborative decision‑making. Applied to the SWE‑bench benchmark, our approach demonstrates a 23% relative improvement in performance across five models compared to standard open‑source agents without MCTS. Our analysis reveals how performance scales with increased inference‑time compute through deeper search, providing a pathway to improve software agents without requiring larger models or additional training data. This highlights the potential of self‑evaluation driven search techniques in complex software engineering environments.

PaperID: 12545, https://arxiv.org/pdf/2410.20151.pdf

Authors: Shanshan Song, Bingwen Huangfu, Jiani Guo, Jun Liu, Junhong Cui, Xuemin, Shen

Title: A Digital Twin-based Intelligent Network Architecture for Underwater Acoustic Sensor Networks

Abstract:
Underwater acoustic sensor networks (UASNs) drive toward strong environmental adaptability, intelligence, and multifunctionality. However, due to unique UASN characteristics, such as long propagation delay, dynamic channel quality, and high attenuation, existing studies present untimeliness, inefficiency, and inflexibility in real practice. Digital twin (DT) technology is promising for UASNs to break the above bottlenecks by providing high‑fidelity status prediction and exploring optimal schemes. In this article, we propose a Digital Twin‑based Network Architecture (DTNA), enhancing UASNs' environmental adaptability, intelligence, and multifunctionality. By extracting real UASN information from local (node) and global (network) levels, we first design a layered architecture to improve the DT replica fidelity and UASN control flexibility. In local DT, we develop a resource allocation paradigm (RAPD), which rapidly perceives performance variations and iteratively optimizes allocation schemes to improve real‑time environmental adaptability of resource allocation algorithms. In global DT, we aggregate decentralized local DT data and propose a collaborative Multi‑agent reinforcement learning framework (CMFD) and a task‑oriented network slicing (TNSD). CMFD patches scarce real data and provides extensive DT data to accelerate AI model training. TNSD unifies heterogeneous tasks' demand extraction and efficiently provides comprehensive network status, improving the flexibility of multi‑task scheduling algorithms. Finally, practical and simulation experiments verify the high fidelity of DT. Compared with the original UASN architecture, experiment results demonstrate that DTNA can: (i) improve the timeliness and robustness of resource allocation; (ii) greatly reduce the training time of AI algorithms; (iii) more rapidly obtain network status for multi‑task scheduling at a low cost.

PaperID: 12546, https://arxiv.org/pdf/2410.20140.pdf

Authors: Kumud Lakara, Georgia Channing, Christian Rupprecht, Juil Sock, Philip Torr, John Collomosse, Christian Schroeder de Witt

Title: MAD-Sherlock: Multi-Agent Debate for Visual Misinformation Detection

Abstract:
One of the most challenging forms of misinformation involves pairing images with misleading text to create false narratives. Existing AI‑driven detection systems often require domain‑specific finetuning, limiting generalizability, and offer little insight into their decisions, hindering trust and adoption. We introduce MAD‑Sherlock, a multi‑agent debate system for out‑of‑context misinformation detection. MAD‑Sherlock frames detection as a multi‑agent debate, reflecting the diverse and conflicting discourse found online. Multimodal agents collaborate to assess contextual consistency and retrieve external information to support cross‑context reasoning. Our framework is domain‑ and time‑agnostic, requiring no finetuning, yet achieves state‑of‑the‑art accuracy with in‑depth explanations. Evaluated on NewsCLIPpings, VERITE, and MMFakeBench, it outperforms prior methods by 2%, 3%, and 5%, respectively. Ablation and user studies show that the debate and resultant explanations significantly improve detection performance and improve trust for both experts and non‑experts, positioning MAD‑Sherlock as a robust tool for autonomous citizen intelligence.

PaperID: 12547, https://arxiv.org/pdf/2410.20116.pdf

Authors: Spencer Lin, Basem Rizk, Miru Jun, Andy Artze, Caitlin Sullivan, Sharon Mozgai, Scott Fisher

Title: Estuary: A Framework For Building Multimodal Low-Latency Real-Time Socially Interactive Agents

Abstract:
The rise in capability and ubiquity of generative artificial intelligence (AI) technologies has enabled its application to the field of Socially Interactive Agents (SIAs). Despite rising interest in modern AI‑powered components used for real‑time SIA research, substantial friction remains due to the absence of a standardized and universal SIA framework. To target this absence, we developed Estuary: a multimodal (text, audio, and soon video) framework which facilitates the development of low‑latency, real‑time SIAs. Estuary seeks to reduce repeat work between studies and to provide a flexible platform that can be run entirely off‑cloud to maximize configurability, controllability, reproducibility of studies, and speed of agent response times. We are able to do this by constructing a robust multimodal framework which incorporates current and future components seamlessly into a modular and interoperable architecture.

PaperID: 12548, https://arxiv.org/pdf/2410.20075.pdf

Authors: Sarper Aydin, Ceyhun Eksin

Title: Almost Sure Convergence of Networked Policy Gradient over Time-Varying Networks in Markov Potential Games

Abstract:
We propose networked policy gradient play for solving Markov potential games with continuous and/or discrete state‑action pairs. During the game, agents use parametrized and differentiable policies that depend on the current state and the policy parameters of other agents. During training, agents update their policy parameters following stochastic gradients. The gradient estimation involves two consecutive episodes, generating unbiased estimators of reward and policy score functions. In addition, it involves keeping estimates of others' parameters using consensus steps given local estimates received through a time‑varying communication network. In Markov potential games, there exists a potential value function among agents with gradients corresponding to the gradients of local value functions. Using this structure, we prove almost sure convergence to a stationary point of the potential value function with rate O(1/ε^2). Compared to previous works, our results do not require bounded policy gradients or initial agreement on the values of individual policy parameters. Numerical experiments on a dynamic multi‑agent newsvendor problem verify the convergence of local beliefs and gradients. It further shows that networked policy gradient play converges as fast as independent policy gradient updates, while collecting higher rewards.

PaperID: 12549, https://arxiv.org/pdf/2410.20007.pdf

Authors: Danqing Wang, Zhuorui Ye, Fei Fang, Lei Li

Title: Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models

Abstract:
Enhancing the reasoning capabilities of large language models (LLMs) is crucial for enabling them to tackle complex, multi‑step problems. Multi‑agent frameworks have shown great potential in enhancing LLMs' reasoning capabilities. However, the lack of effective cooperation between LLM agents hinders their performance, especially for multi‑step reasoning tasks. This paper proposes a novel cooperative multi‑agent reasoning framework (CoPlanner) by separating reasoning steps and assigning distinct duties to different agents. CoPlanner consists of two LLM agents: a planning agent and a reasoning agent. The planning agent provides high‑level strategic hints, while the reasoning agent follows these hints and infers answers. By training the planning agent's policy through the interactive reasoning process via Proximal Policy Optimization (PPO), the LLaMA‑3‑8B‑based CoPlanner outperforms the previous best method by 9.94% on LogiQA and 3.09% on BBH. Our results demonstrate that the guidance from the planning agent and the effective cooperation between the agents contribute to the superior performance of CoPlanner in tackling multi‑step reasoning problems.

PaperID: 12550, https://arxiv.org/pdf/2410.19920.pdf

Authors: Mohamed Salim Aissi, Clement Romac, Thomas Carta, Sylvain Lamprier, Pierre-Yves Oudeyer, Olivier Sigaud, Laure Soulier, Nicolas Thome

Title: Reinforcement Learning for Aligning Large Language Models Agents with Interactive Environments: Quantifying and Mitigating Prompt Overfitting

Abstract:
Reinforcement learning (RL) is a promising approach for aligning large language models (LLMs) knowledge with sequential decision‑making tasks. However, few studies have thoroughly investigated the impact on LLM agents capabilities of fine‑tuning them with RL in a specific environment. In this paper, we propose a novel framework to analyze the sensitivity of LLMs to prompt formulations following RL training in a textual environment. Our findings reveal that the performance of LLMs degrades when faced with prompt formulations different from those used during the RL training phase. Besides, we analyze the source of this sensitivity by examining the model's internal representations and salient tokens. Finally, we propose to use a contrastive loss to mitigate this sensitivity and improve the robustness and generalization capabilities of LLMs.

PaperID: 12551, https://arxiv.org/pdf/2410.19855.pdf

Authors: Param Thakkar, Anushka Yadav

Title: Personalized Recommendation Systems using Multimodal, Autonomous, Multi Agent Systems

Abstract:
This paper describes a highly developed personalised recommendation system using multimodal, autonomous, multi‑agent systems. The system focuses on the incorporation of futuristic AI tech and LLMs like Gemini‑1.5‑ pro and LLaMA‑70B to improve customer service experiences especially within e‑commerce. Our approach uses multi agent, multimodal systems to provide best possible recommendations to its users. The system is made up of three agents as a whole. The first agent recommends products appropriate for answering the given question, while the second asks follow‑up questions based on images that belong to these recommended products and is followed up with an autonomous search by the third agent. It also features a real‑time data fetch, user preferences‑based recommendations and is adaptive learning. During complicated queries the application processes with Symphony, and uses the Groq API to answer quickly with low response times. It uses a multimodal way to utilize text and images comprehensively, so as to optimize product recommendation and customer interaction.

PaperID: 12552, https://arxiv.org/pdf/2410.19811.pdf

Authors: Xingang Guo, Darioush Keivan, Usman Syed, Lianhui Qin, Huan Zhang, Geir Dullerud, Peter Seiler, Bin Hu

Title: ControlAgent: Automating Control System Design via Novel Integration of LLM Agents and Domain Expertise

Abstract:
Control system design is a crucial aspect of modern engineering with far‑reaching applications across diverse sectors including aerospace, automotive systems, power grids, and robotics. Despite advances made by Large Language Models (LLMs) in various domains, their application in control system design remains limited due to the complexity and specificity of control theory. To bridge this gap, we introduce ControlAgent, a new paradigm that automates control system design via novel integration of LLM agents and control‑oriented domain expertise. ControlAgent encodes expert control knowledge and emulates human iterative design processes by gradually tuning controller parameters to meet user‑specified requirements for stability, performance, and robustness. ControlAgent integrates multiple collaborative LLM agents, including a central agent responsible for task distribution and task‑specific agents dedicated to detailed controller design for various types of systems and requirements. ControlAgent also employs a Python computation agent that performs complex calculations and controller evaluations based on standard design information provided by task‑specified LLM agents. Combined with a history and feedback module, the task‑specific LLM agents iteratively refine controller parameters based on real‑time feedback from prior designs. Overall, ControlAgent mimics the design processes used by (human) practicing engineers, but removes all the human efforts and can be run in a fully automated way to give end‑to‑end solutions for control system design with user‑specified requirements. To validate ControlAgent's effectiveness, we develop ControlEval, an evaluation dataset that comprises 500 control tasks with various specific design goals. The effectiveness of ControlAgent is demonstrated via extensive comparative evaluations between LLM‑based and traditional human‑involved toolbox‑based baselines.

PaperID: 12553, https://arxiv.org/pdf/2410.19761.pdf

Authors: Abdalwhab Abdalwhab, Giovanni Beltrame, David St-Onge

Title: Physical Simulation for Multi-agent Multi-machine Tending

Abstract:
The manufacturing sector was recently affected by workforce shortages, a problem that automation and robotics can heavily minimize. Simultaneously, reinforcement learning (RL) offers a promising solution where robots can learn through interaction with the environment. In this work, we leveraged a simplistic robotic system to work with RL with "real" data without having to deploy large expensive robots in a manufacturing setting. A real‑world tabletop arena was designed with robots that mimic the agents' behavior in the simulation. Despite the difference in dynamics and machine size, the robots were able to depict the same behavior as in the simulation. In addition, those experiments provided an initial understanding of the real deployment challenges.

PaperID: 12554, https://arxiv.org/pdf/2410.19718.pdf

Authors: Guilherme S. Y. Giardini, John F. Hardy, Carlo R. da Cunha

Title: Evolving Neural Networks Reveal Emergent Collective Behavior from Minimal Agent Interactions

Abstract:
Understanding the mechanisms behind emergent behaviors in multi‑agent systems is critical for advancing fields such as swarm robotics and artificial intelligence. In this study, we investigate how neural networks evolve to control agents' behavior in a dynamic environment, focusing on the relationship between the network's complexity and collective behavior patterns. By performing quantitative and qualitative analyses, we demonstrate that the degree of network non‑linearity correlates with the complexity of emergent behaviors. Simpler behaviors, such as lane formation and laminar flow, are characterized by more linear network operations, while complex behaviors like swarming and flocking show highly non‑linear neural processing. Moreover, specific environmental parameters, such as moderate noise, broader field of view, and lower agent density, promote the evolution of non‑linear networks that drive richer, more intricate collective behaviors. These results highlight the importance of tuning evolutionary conditions to induce desired behaviors in multi‑agent systems, offering new pathways for optimizing coordination in autonomous swarms. Our findings contribute to a deeper understanding of how neural mechanisms influence collective dynamics, with implications for the design of intelligent, self‑organizing systems.

PaperID: 12555, https://arxiv.org/pdf/2410.19685.pdf

Authors: Jesús Aranda, Juan Francisco Díaz, David Gaona, Frank Valencia

Title: The Sound of Silence in Social Networks

Abstract:
We generalize the classic multi‑agent DeGroot model for opinion dynamics to incorporate the Spiral of Silence theory from political science. This theory states that individuals may withhold their opinions when they perceive them to be in the minority. As in the DeGroot model, a community of agents is represented as a weighted directed graph whose edges indicate how much agents influence one another. However, agents whose current opinions are in the minority become silent (i.e., they do not express their opinion). Two models for opinion update are then introduced. In the memoryless opinion model (SOM‑), agents update their opinion by taking the weighted average of their non‑silent neighbors' opinions. In the memory based opinion model (SOM+), agents update their opinions by taking the weighted average of the opinions of all their neighbors, but for silent neighbors, their most recent opinion is considered. We show that for SOM‑ convergence to consensus is guaranteed for clique graphs but, unlike for the classic DeGroot, not guaranteed for strongly‑connected aperiodic graphs. In contrast, we show that for SOM+ convergence to consensus is not guaranteed even for clique graphs. We showcase our models through simulations offering experimental insights that align with key aspects of the Spiral of Silence theory. These findings reveal the impact of silence dynamics on opinion formation and highlight the limitations of consensus in more nuanced social models.

PaperID: 12556, https://arxiv.org/pdf/2410.19639.pdf

Authors: Liu Yunhao, Ding Hong, Zhang Ziming, Wang Huixin, Liu Jinzhao, Xi Suyang

Title: Planning-Aware Diffusion Networks for Enhanced Motion Forecasting in Autonomous Driving

Abstract:
Autonomous driving technology has seen significant advancements, but existing models often fail to fully capture the complexity of multi‑agent environments, where interactions between dynamic agents are critical. To address this, we propose the Planning‑Integrated Forecasting Model (PIFM), a novel framework inspired by neural mechanisms governing decision‑making and multi‑agent coordination in the brain. PIFM leverages rich contextual information, integrating road structures, traffic rules, and the behavior of surrounding vehicles to improve both the accuracy and interpretability of predictions. By adopting a diffusion‑based architecture, akin to neural diffusion processes involved in predicting and planning, PIFM is able to forecast future trajectories of all agents within a scenario. This architecture enhances model transparency, as it parallels the brain's method of dynamically adjusting predictions based on external stimuli and other agents'behaviors. Extensive experiments validate PIFM's capacity to provide interpretable, neuroscience‑driven solutions for safer and more efficient autonomous driving systems, with an extremely low number of parameters.

PaperID: 12557, https://arxiv.org/pdf/2410.19544.pdf

Authors: Huajian Liu, Wei Dong, Kunpeng Fan, Chao Wang, Yongzhuo Gao

Title: PMM-Net: Single-stage Multi-agent Trajectory Prediction with Patching-based Embedding and Explicit Modal Modulation

Abstract:
Analyzing and forecasting trajectories of agents like pedestrians plays a pivotal role for embodied intelligent applications. The inherent indeterminacy of human behavior and complex social interaction among a rich variety of agents make this task more challenging than common time‑series forecasting. In this letter, we aim to explore a distinct formulation for multi‑agent trajectory prediction framework. Specifically, we proposed a patching‑based temporal feature extraction module and a graph‑based social feature extraction module, enabling effective feature extraction and cross‑scenario generalization. Moreover, we reassess the role of social interaction and present a novel method based on explicit modality modulation to integrate temporal and social features, thereby constructing an efficient single‑stage inference pipeline. Results on public benchmark datasets demonstrate the superior performance of our model compared with the state‑of‑the‑art methods. The code is available at: github.com/TIB‑K330/pmm‑net.

PaperID: 12558, https://arxiv.org/pdf/2410.19450.pdf

Authors: Hai Zhong, Xun Wang, Zhuoran Li, Longbo Huang

Title: Offline-to-Online Multi-Agent Reinforcement Learning with Offline Value Function Memory and Sequential Exploration

Abstract:
Offline‑to‑Online Reinforcement Learning has emerged as a powerful paradigm, leveraging offline data for initialization and online fine‑tuning to enhance both sample efficiency and performance. However, most existing research has focused on single‑agent settings, with limited exploration of the multi‑agent extension, i.e., Offline‑to‑Online Multi‑Agent Reinforcement Learning (O2O MARL). In O2O MARL, two critical challenges become more prominent as the number of agents increases: (i) the risk of unlearning pre‑trained Q‑values due to distributional shifts during the transition from offline‑to‑online phases, and (ii) the difficulty of efficient exploration in the large joint state‑action space. To tackle these challenges, we propose a novel O2O MARL framework called Offline Value Function Memory with Sequential Exploration (OVMSE). First, we introduce the Offline Value Function Memory (OVM) mechanism to compute target Q‑values, preserving knowledge gained during offline training, ensuring smoother transitions, and enabling efficient fine‑tuning. Second, we propose a decentralized Sequential Exploration (SE) strategy tailored for O2O MARL, which effectively utilizes the pre‑trained offline policy for exploration, thereby significantly reducing the joint state‑action space to be explored. Extensive experiments on the StarCraft Multi‑Agent Challenge (SMAC) demonstrate that OVMSE significantly outperforms existing baselines, achieving superior sample efficiency and overall performance.

PaperID: 12559, https://arxiv.org/pdf/2410.19385.pdf

Authors: Liam Barkley, Brink van der Merwe

Title: Investigating the Role of Prompting and External Tools in Hallucination Rates of Large Language Models

Abstract:
Large Language Models (LLMs) are powerful computational models trained on extensive corpora of human‑readable text, enabling them to perform general‑purpose language understanding and generation. LLMs have garnered significant attention in both industry and academia due to their exceptional performance across various natural language processing (NLP) tasks. Despite these successes, LLMs often produce inaccuracies, commonly referred to as hallucinations. Prompt engineering, the process of designing and formulating instructions for LLMs to perform specific tasks, has emerged as a key approach to mitigating hallucinations. This paper provides a comprehensive empirical evaluation of different prompting strategies and frameworks aimed at reducing hallucinations in LLMs. Various prompting techniques are applied to a broad set of benchmark datasets to assess the accuracy and hallucination rate of each method. Additionally, the paper investigates the influence of tool‑calling agents (LLMs augmented with external tools to enhance their capabilities beyond language generation) on hallucination rates in the same benchmarks. The findings demonstrate that the optimal prompting technique depends on the type of problem, and that simpler techniques often outperform more complex methods in reducing hallucinations. Furthermore, it is shown that LLM agents can exhibit significantly higher hallucination rates due to the added complexity of external tool usage.

PaperID: 12560, https://arxiv.org/pdf/2410.19382.pdf

Authors: Jemma Daniel, Ruan de Kock, Louay Ben Nessir, Sasha Abramowitz, Omayma Mahjoub, Wiem Khlifi, Claude Formanek, Arnu Pretorius

Title: Multi-Agent Reinforcement Learning with Selective State-Space Models

Abstract:
The Transformer model has demonstrated success across a wide range of domains, including in Multi‑Agent Reinforcement Learning (MARL) where the Multi‑Agent Transformer (MAT) has emerged as a leading algorithm in the field. However, a significant drawback of Transformer models is their quadratic computational complexity relative to input size, making them computationally expensive when scaling to larger inputs. This limitation restricts MAT's scalability in environments with many agents. Recently, State‑Space Models (SSMs) have gained attention due to their computational efficiency, but their application in MARL remains unexplored. In this work, we investigate the use of Mamba, a recent SSM, in MARL and assess whether it can match the performance of MAT while providing significant improvements in efficiency. We introduce a modified version of MAT that incorporates standard and bi‑directional Mamba blocks, as well as a novel "cross‑attention" Mamba block. Extensive testing shows that our Multi‑Agent Mamba (MAM) matches the performance of MAT across multiple standard multi‑agent environments, while offering superior scalability to larger agent scenarios. This is significant for the MARL community, because it indicates that SSMs could replace Transformers without compromising performance, whilst also supporting more effective scaling to higher numbers of agents. Our project page is available at https://sites.google.com/view/multi‑agent‑mamba .

PaperID: 12561, https://arxiv.org/pdf/2410.19359.pdf

Authors: Yangjing Wang, Xiao Li, Xinping Yi, Shi Jin

Title: Joint User Scheduling and Precoding for RIS-Aided MU-MISO Systems: A MADRL Approach

Abstract:
With the increasing demand for spectrum efficiency and energy efficiency, reconfigurable intelligent surfaces (RISs) have attracted massive attention due to its low‑cost and capability of controlling wireless environment. However, there is still a lack of treatments to deal with the growth of the number of users and RIS elements, which may incur performance degradation or computational complexity explosion. In this paper, we investigate the joint optimization of user scheduling and precoding for distributed RIS‑aided communication systems. Firstly, we propose an optimization‑based numerical method to obtain suboptimal solutions with the aid of the approximation of ergodic sum rate. Secondly, to reduce the computational complexity caused by the high dimensionality, we propose a data‑driven scalable and generalizable multi‑agent deep reinforcement learning (MADRL) framework with the aim to maximize the ergodic sum rate approximation through the cooperation of all agents. Further, we propose a novel dynamic working process exploiting the trained MADRL algorithm, which enables distributed RISs to configure their own passive precoding independently. Simulation results show that our algorithm substantially reduces the computational complexity by a time reduction of three orders of magnitude at the cost of 3% performance degradation, compared with the optimization‑based method, and achieves 6% performance improvement over the state‑of‑the‑art MADRL algorithms.

PaperID: 12562, https://arxiv.org/pdf/2410.19245.pdf

Authors: Zixiao Zhao, Jing Sun, Zhe Hou, Zhiyuan Wei, Cheng-Hao Cai, Miao Qiao, Jin Song Dong

Title: MaCTG: Multi-Agent Collaborative Thought Graph for Automatic Programming

Abstract:
With the rapid advancement of Large Language Models (LLMs), LLM‑based approaches have demonstrated strong problem‑solving capabilities across various domains. However, in automatic programming, a single LLM is typically limited to function‑level code generation, while multi‑agent systems composed of multiple LLMs often suffer from inefficient task planning. This lack of structured coordination can lead to cascading hallucinations, where accumulated errors across agents result in suboptimal workflows and excessive computational costs. To overcome these challenges, we introduce MaCTG (Multi‑Agent Collaborative Thought Graph), a novel multi‑agent framework that employs a dynamic graph structure to facilitate precise task allocation and controlled collaboration among LLM agents. MaCTG autonomously assigns agent roles based on programming requirements, dynamically refines task distribution through context‑aware adjustments, and systematically verifies and integrates project‑level code, effectively reducing hallucination errors and improving overall accuracy. MaCTG enhances cost‑effectiveness by implementing a hybrid LLM deployment, where proprietary models handle complex reasoning, while open‑source models are used for routine coding and validation tasks. To evaluate MaCTG's effectiveness, we applied it to traditional image processing auto‑programming tasks, achieving a state‑of‑the‑art accuracy of 83.33%. Additionally, by leveraging its hybrid LLM configuration, MaCTG significantly reduced operational costs by 89.09% compared to existing multi‑agent frameworks, demonstrating its efficiency, scalability, and real‑world applicability.

PaperID: 12563, https://arxiv.org/pdf/2410.19064.pdf

Authors: Yuhan Liu, Zirui Song, Juntian Zhang, Xiaoqing Zhang, Xiuying Chen, Rui Yan

Title: The Stepwise Deception: Simulating the Evolution from True News to Fake News with LLM Agents

Abstract:
With the growing spread of misinformation online, understanding how true news evolves into fake news has become crucial for early detection and prevention. However, previous research has often assumed fake news inherently exists rather than exploring its gradual formation. To address this gap, we propose FUSE (Fake news evolUtion Simulation framEwork), a novel Large Language Model (LLM)‑based simulation approach explicitly focusing on fake news evolution from real news. Our framework model a social network with four distinct types of LLM agents commonly observed in daily interactions: spreaders who propagate information, commentators who provide interpretations, verifiers who fact‑check, and bystanders who observe passively to simulate realistic daily interactions that progressively distort true news. To quantify these gradual distortions, we develop FUSE‑EVAL, a comprehensive evaluation framework measuring truth deviation along multiple linguistic and semantic dimensions. Results show that FUSE effectively captures fake news evolution patterns and accurately reproduces known fake news, aligning closely with human evaluations. Experiments demonstrate that FUSE accurately reproduces known fake news evolution scenarios, aligns closely with human judgment, and highlights the importance of timely intervention at early stages. Our framework is extensible, enabling future research on broader scenarios of fake news.

PaperID: 12564, https://arxiv.org/pdf/2410.19054.pdf

Authors: Revanth Gangi Reddy, Sagnik Mukherjee, Jeonghwan Kim, Zhenhailong Wang, Dilek Hakkani-Tur, Heng Ji

Title: Infogent: An Agent-Based Framework for Web Information Aggregation

Abstract:
Despite seemingly performant web agents on the task‑completion benchmarks, most existing methods evaluate the agents based on a presupposition: the web navigation task consists of linear sequence of actions with an end state that marks task completion. In contrast, our work focuses on web navigation for information aggregation, wherein the agent must explore different websites to gather information for a complex query. We consider web information aggregation from two different perspectives: (i) Direct API‑driven Access relies on a text‑only view of the Web, leveraging external tools such as Google Search API to navigate the web and a scraper to extract website contents. (ii) Interactive Visual Access uses screenshots of the webpages and requires interaction with the browser to navigate and access information. Motivated by these diverse information access settings, we introduce Infogent, a novel modular framework for web information aggregation involving three distinct components: Navigator, Extractor and Aggregator. Experiments on different information access settings demonstrate Infogent beats an existing SOTA multi‑agent search framework by 7% under Direct API‑Driven Access on FRAMES, and improves over an existing information‑seeking web agent by 4.3% under Interactive Visual Access on AssistantBench.

PaperID: 12565, https://arxiv.org/pdf/2410.19000.pdf

Authors: Pengfei He, Zitao Li, Yue Xing, Yaling Li, Jiliang Tang, Bolin Ding

Title: Make LLMs better zero-shot reasoners: Structure-orientated autonomous reasoning

Abstract:
Zero‑shot reasoning methods with Large Language Models (LLMs) offer significant advantages including great generalization to novel tasks and reduced dependency on human‑crafted examples. However, the current zero‑shot methods still have limitations in complex tasks, e.g., answering questions that require multi‑step reasoning. In this paper, we address this limitation by introducing a novel structure‑oriented analysis method to help LLMs better understand the question and guide the problem‑solving process of LLMs. We first demonstrate how the existing reasoning strategies, Chain‑of‑Thought and ReAct, can benefit from our structure‑oriented analysis. In addition to empirical investigations, we leverage the probabilistic graphical model to theoretically explain why our structure‑oriented analysis can improve the LLM reasoning process. To further improve the reliability in complex question‑answering tasks, we propose a multi‑agent reasoning system, Structure‑oriented Autonomous Reasoning Agents (SARA), that can better enforce the reasoning process following our structure‑oriented analysis by refinement techniques and is equipped with external knowledge retrieval capability to reduce factual errors. Extensive experiments verify the effectiveness of the proposed reasoning system. Surprisingly, in some cases, the system even surpasses few‑shot methods. Finally, the system not only improves reasoning accuracy in complex tasks but also demonstrates robustness against potential attacks that corrupt the reasoning process.

PaperID: 12566, https://arxiv.org/pdf/2410.18935.pdf

Authors: Sha Li, Revanth Gangi Reddy, Khanh Duy Nguyen, Qingyun Wang, May Fung, Chi Han, Jiawei Han, Kartik Natarajan, Clare R. Voss, Heng Ji

Title: Schema-Guided Culture-Aware Complex Event Simulation with Multi-Agent Role-Play

Abstract:
Complex news events, such as natural disasters and socio‑political conflicts, require swift responses from the government and society. Relying on historical events to project the future is insufficient as such events are sparse and do not cover all possible conditions and nuanced situations. Simulation of these complex events can help better prepare and reduce the negative impact. We develop a controllable complex news event simulator guided by both the event schema representing domain knowledge about the scenario and user‑provided assumptions representing case‑specific conditions. As event dynamics depend on the fine‑grained social and cultural context, we further introduce a geo‑diverse commonsense and cultural norm‑aware knowledge enhancement component. To enhance the coherence of the simulation, apart from the global timeline of events, we take an agent‑based approach to simulate the individual character states, plans, and actions. By incorporating the schema and cultural norms, our generated simulations achieve much higher coherence and appropriateness and are received favorably by participants from a humanitarian assistance organization.

PaperID: 12567, https://arxiv.org/pdf/2410.18792.pdf

Authors: Yuxing Chen, Weijie Wang, Sylvain Lobry, Camille Kurtz

Title: An LLM Agent for Automatic Geospatial Data Analysis

Abstract:
Large language models (LLMs) are being used in data science code generation tasks, but they often struggle with complex sequential tasks, leading to logical errors. Their application to geospatial data processing is particularly challenging due to difficulties in incorporating complex data structures and spatial constraints, effectively utilizing diverse function calls, and the tendency to hallucinate less‑used geospatial libraries. To tackle these problems, we introduce GeoAgent, a new interactive framework designed to help LLMs handle geospatial data processing more effectively. GeoAgent pioneers the integration of a code interpreter, static analysis, and Retrieval‑Augmented Generation (RAG) techniques within a Monte Carlo Tree Search (MCTS) algorithm, offering a novel approach to geospatial data processing. In addition, we contribute a new benchmark specifically designed to evaluate the LLM‑based approach in geospatial tasks. This benchmark leverages a variety of Python libraries and includes both single‑turn and multi‑turn tasks such as data acquisition, data analysis, and visualization. By offering a comprehensive evaluation among diverse geospatial contexts, this benchmark sets a new standard for developing LLM‑based approaches in geospatial data analysis tasks. Our findings suggest that relying solely on knowledge of LLM is insufficient for accurate geospatial task programming, which requires coherent multi‑step processes and multiple function calls. Compared to the baseline LLMs, the proposed GeoAgent has demonstrated superior performance, yielding notable improvements in function calls and task completion. In addition, these results offer valuable insights for the future development of LLM agents in automatic geospatial data analysis task programming.

PaperID: 12568, https://arxiv.org/pdf/2410.18636.pdf

Authors: Alexander Meulemans, Seijin Kobayashi, Johannes von Oswald, Nino Scherrer, Eric Elmoznino, Blake Richards, Guillaume Lajoie, Blaise Agüera y Arcas, João Sacramento

Title: Multi-agent cooperation through learning-aware policy gradients

Abstract:
Self‑interested individuals often fail to cooperate, posing a fundamental challenge for multi‑agent learning. How can we achieve cooperation among self‑interested, independent learning agents? Promising recent work has shown that in certain tasks cooperation can be established between learning‑aware agents who model the learning dynamics of each other. Here, we present the first unbiased, higher‑derivative‑free policy gradient algorithm for learning‑aware reinforcement learning, which takes into account that other agents are themselves learning through trial and error based on multiple noisy trials. We then leverage efficient sequence models to condition behavior on long observation histories that contain traces of the learning dynamics of other agents. Training long‑context policies with our algorithm leads to cooperative behavior and high returns on standard social dilemmas, including a challenging environment where temporally‑extended action coordination is required. Finally, we derive from the iterated prisoner's dilemma a novel explanation for how and when cooperation arises among self‑interested learning‑aware agents.

PaperID: 12569, https://arxiv.org/pdf/2410.18631.pdf

Authors: Niki Kotecha, Antonio del Rio Chanona

Title: Leveraging Graph Neural Networks and Multi-Agent Reinforcement Learning for Inventory Control in Supply Chains

Abstract:
Inventory control in modern supply chains has attracted significant attention due to the increasing number of disruptive shocks and the challenges posed by complex dynamics, uncertainties, and limited collaboration. Traditional methods, which often rely on static parameters, struggle to adapt to changing environments. This paper proposes a Multi‑Agent Reinforcement Learning (MARL) framework with Graph Neural Networks (GNNs) for state representation to address these limitations. Our approach redefines the action space by parameterizing heuristic inventory control policies, making it adaptive as the parameters dynamically adjust based on system conditions. By leveraging the inherent graph structure of supply chains, our framework enables agents to learn the system's topology, and we employ a centralized learning, decentralized execution scheme that allows agents to learn collaboratively while overcoming information‑sharing constraints. Additionally, we incorporate global mean pooling and regularization techniques to enhance performance. We test the capabilities of our proposed approach on four different supply chain configurations and conduct a sensitivity analysis. This work paves the way for utilizing MARL‑GNN frameworks to improve inventory management in complex, decentralized supply chain environments.

PaperID: 12570, https://arxiv.org/pdf/2410.18621.pdf

Authors: Wonhyung Choi, Inkyung Ahn

Title: Evolutionary Dispersal of Ecological Species via Multi-Agent Deep Reinforcement Learning

Abstract:
Understanding species dynamics in heterogeneous environments is essential for ecosystem studies. Traditional models assumed homogeneous habitats, but recent approaches include spatial and temporal variability, highlighting species migration. We adopt starvation‑driven diffusion (SDD) models as nonlinear diffusion to describe species dispersal based on local resource conditions, showing advantages for species survival. However, accurate prediction remains challenging due to model simplifications. This study uses multi‑agent reinforcement learning (MARL) with deep Q‑networks (DQN) to simulate single species and predator‑prey interactions, incorporating SDD‑type rewards. Our simulations reveal evolutionary dispersal strategies, providing insights into species dispersal mechanisms and validating traditional mathematical models.

PaperID: 12571, https://arxiv.org/pdf/2410.18528.pdf

Authors: Zhiwei Liu, Weiran Yao, Jianguo Zhang, Rithesh Murthy, Liangwei Yang, Zuxin Liu, Tian Lan, Ming Zhu, Juntao Tan, Shirley Kokane, Thai Hoang, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong

Title: PRACT: Optimizing Principled Reasoning and Acting of LLM Agent

Abstract:
We introduce the Principled Reasoning and Acting (PRAct) framework, a novel method for learning and enforcing action principles from trajectory data. Central to our approach is the use of text gradients from a reflection and optimization engine to derive these action principles. To adapt action principles to specific task requirements, we propose a new optimization framework, Reflective Principle Optimization (RPO). After execution, RPO employs a reflector to critique current action principles and an optimizer to update them accordingly. We develop the RPO framework under two scenarios: Reward‑RPO, which uses environmental rewards for reflection, and Self‑RPO, which conducts self‑reflection without external rewards. Additionally, two RPO methods, RPO‑Traj and RPO‑Batch, is introduced to adapt to different settings. Experimental results across four environments demonstrate that the PRAct agent, leveraging the RPO framework, effectively learns and applies action principles to enhance performance.

PaperID: 12572, https://arxiv.org/pdf/2410.18489.pdf

Authors: Ahmed R. Sadik, Sebastian Brulin, Markus Olhofer

Title: LLM as a code generator in Agile Model Driven Development

Abstract:
Leveraging Large Language Models (LLM) like GPT4 in the auto generation of code represents a significant advancement, yet it is not without its challenges. The ambiguity inherent in natural language descriptions of software poses substantial obstacles to generating deployable, structured artifacts. This research champions Model Driven Development (MDD) as a viable strategy to overcome these challenges, proposing an Agile Model Driven Development (AMDD) approach that employs GPT4 as a code generator. This approach enhances the flexibility and scalability of the code auto generation process and offers agility that allows seamless adaptation to changes in models or deployment environments. We illustrate this by modeling a multi agent Unmanned Vehicle Fleet (UVF) system using the Unified Modeling Language (UML), significantly reducing model ambiguity by integrating the Object Constraint Language (OCL) for code structure meta modeling, and the FIPA ontology language for communication semantics meta modeling. Applying GPT4 auto generation capabilities yields Java and Python code that is compatible with the JADE and PADE frameworks, respectively. Our thorough evaluation of the auto generated code verifies its alignment with expected behaviors and identifies enhancements in agent interactions. Structurally, we assessed the complexity of code derived from a model constrained solely by OCL meta models, against that influenced by both OCL and FIPA ontology meta models. The results indicate that the ontology constrained meta model produces inherently more complex code, yet its cyclomatic complexity remains within manageable levels, suggesting that additional meta model constraints can be incorporated without exceeding the high risk threshold for complexity.

PaperID: 12573, https://arxiv.org/pdf/2410.18470.pdf

Authors: Nhat-Minh Le-Phan, Phuoc Doan Nguyen, Hyo-Sung Ahn, Minh Hoang Trinh

Title: Bearing-Only Solution for Fermat-Weber Location Problem: Generalized Algorithms

Abstract:
This paper presents novel algorithms for the Fermat‑Weber Location Problem, guiding an autonomous agent to the point that minimizes the weighted sum of Euclidean distances to some beacons using only bearing measurements. The existing results address only the simple scenario where the beacons are stationary and the agent is modeled by a single integrator. In this paper, we propose a number of bearing‑only algorithms that let the agent, which can be modeled as either a single‑integrator or a double‑integrator, follow the Fermat‑Weber point of a group of stationary or moving beacons. The theoretical results are rigorously proven using Lyapunov theory and supported with simulation examples.

PaperID: 12574, https://arxiv.org/pdf/2410.18416.pdf

Authors: Zizhao Wang, Jiaheng Hu, Caleb Chuck, Stephen Chen, Roberto Martín-Martín, Amy Zhang, Scott Niekum, Peter Stone

Title: SkiLD: Unsupervised Skill Discovery Guided by Factor Interactions

Abstract:
Unsupervised skill discovery carries the promise that an intelligent agent can learn reusable skills through autonomous, reward‑free environment interaction. Existing unsupervised skill discovery methods learn skills by encouraging distinguishable behaviors that cover diverse states. However, in complex environments with many state factors (e.g., household environments with many objects), learning skills that cover all possible states is impossible, and naively encouraging state diversity often leads to simple skills that are not ideal for solving downstream tasks. This work introduces Skill Discovery from Local Dependencies (Skild), which leverages state factorization as a natural inductive bias to guide the skill learning process. The key intuition guiding Skild is that skills that induce <b>diverse interactions</b> between state factors are often more valuable for solving downstream tasks. To this end, Skild develops a novel skill learning objective that explicitly encourages the mastering of skills that effectively induce different interactions within an environment. We evaluate Skild in several domains with challenging, long‑horizon sparse reward tasks including a realistic simulated household robot domain, where Skild successfully learns skills with clear semantic meaning and shows superior performance compared to existing unsupervised reinforcement learning methods that only maximize state coverage.

PaperID: 12575, https://arxiv.org/pdf/2410.18382.pdf

Authors: Xinran Fang, Chengleyang Lei, Wei Feng, Yunfei Chen, Ming Xiao, Ning Ge, Chengxiang Wang

Title: Sensing-Communication-Computing-Control Closed-Loop Optimization for 6G Unmanned Robotic Systems

Abstract:
Rapid advancements in field robots have brought a new kind of cyber physical system (CPS)‑‑unmanned robotic system‑‑under the spotlight. In the upcoming sixth‑generation (6G) era, these systems hold great potential to replace humans in hazardous tasks. This paper investigates an unmanned robotic system comprising a multi‑functional unmanned aerial vehicle (UAV), sensors, and actuators. The UAV carries communication and computing modules, acting as an edge information hub (EIH) that transfers and processes information. During the task execution, the EIH gathers sensing data, calculates control commands, and transmits commands to actuators‑‑leading to reflex‑arc‑like sensing‑communication‑computing‑control (\mathbfSC^3) loops. Unlike existing studies that design \mathbfSC^3 loop components separately, we take each \mathbfSC^3 loop as an integrated structure and propose a goal‑oriented closed‑loop optimization scheme. This scheme jointly optimizes uplink and downlink (UL&DL) communication and computing within and across the \mathbfSC^3 loops to minimize the total linear quadratic regulator (LQR) cost. We derive optimal closed‑form solutions for intra‑loop allocation and propose an efficient iterative algorithm for inter‑loop optimization. Under the condition of adequate CPU frequency availability, we derive an approximate closed‑form solution for inter‑loop bandwidth allocation. Simulation results demonstrate that the proposed scheme achieves a two‑tier task‑level balance within and across \mathbfSC^3 loops.

PaperID: 12576, https://arxiv.org/pdf/2410.18202.pdf

Authors: Rohit Bokade, Xiaoning Jin

Title: PyTSC: A Unified Platform for Multi-Agent Reinforcement Learning in Traffic Signal Control

Abstract:
Multi‑Agent Reinforcement Learning (MARL) presents a promising approach for addressing the complexity of Traffic Signal Control (TSC) in urban environments. However, existing platforms for MARL‑based TSC research face challenges such as slow simulation speeds and convoluted, difficult‑to‑maintain codebases. To address these limitations, we introduce PyTSC, a robust and flexible simulation environment that facilitates the training and evaluation of MARL algorithms for TSC. PyTSC integrates multiple simulators, such as SUMO and CityFlow, and offers a streamlined API, empowering researchers to explore a broad spectrum of MARL approaches efficiently. PyTSC accelerates experimentation and provides new opportunities for advancing intelligent traffic management systems in real‑world applications.

PaperID: 12577, https://arxiv.org/pdf/2410.18112.pdf

Authors: Rui Du, Kai Zhao, Jinlong Hou, Qiang Zhang, Peter Zhang

Title: OPTIMA: Optimized Policy for Intelligent Multi-Agent Systems Enables Coordination-Aware Autonomous Vehicles

Abstract:
Coordination among connected and autonomous vehicles (CAVs) is advancing due to developments in control and communication technologies. However, much of the current work is based on oversimplified and unrealistic task‑specific assumptions, which may introduce vulnerabilities. This is critical because CAVs not only interact with their environment but are also integral parts of it. Insufficient exploration can result in policies that carry latent risks, highlighting the need for methods that explore the environment both extensively and efficiently. This work introduces OPTIMA, a novel distributed reinforcement learning framework for cooperative autonomous vehicle tasks. OPTIMA alternates between thorough data sampling from environmental interactions and multi‑agent reinforcement learning algorithms to optimize CAV cooperation, emphasizing both safety and efficiency. Our goal is to improve the generality and performance of CAVs in highly complex and crowded scenarios. Furthermore, the industrial‑scale distributed training system easily adapts to different algorithms, reward functions, and strategies.

PaperID: 12578, https://arxiv.org/pdf/2410.18012.pdf

Authors: Sungil Seok, Shuide Wen, Qiyuan Yang, Juan Feng, Wenming Yang

Title: MiniFed : Integrating LLM-based Agentic-Workflow for Simulating FOMC Meeting

Abstract:
The Federal Funds rate in the United States plays a significant role in both domestic and international financial markets. However, research has predominantly focused on the effects of adjustments to the Federal Funds rate rather than on the decision‑making process itself. Recent advancements in large language models(LLMs) offer a potential method for reconstructing the original FOMC meetings, which are responsible for setting the Federal Funds rate. In this paper, we propose a five‑stage FOMC meeting simulation framework, MiniFed, which employs LLM agents to simulate real‑world FOMC meeting members and optimize the FOMC structure. This framework effectively revitalizes the FOMC meeting process and facilitates projections of the Federal Funds rate. Experimental results demonstrate that our proposed MiniFed framework achieves both high accuracy in Federal Funds rate projections and behavioral alignment with the agents' real‑world counterparts. Given that few studies have focused on employing LLM agents to simulate large‑scale real‑world conferences, our work can serve as a benchmark for future developments.

PaperID: 12579, https://arxiv.org/pdf/2410.17967.pdf

Authors: Imad Bouhou, Stefano Fortunati, Leila Gharsalli, Alexandre Renaux

Title: POMDP-Driven Cognitive Massive MIMO Radar: Joint Target Detection-Tracking In Unknown Disturbances

Abstract:
The joint detection and tracking of a moving target embedded in an unknown disturbance represents a key feature that motivates the development of the cognitive radar paradigm. Building upon recent advancements in robust target detection with multiple‑input multiple‑output (MIMO) radars, this work explores the application of a Partially Observable Markov Decision Process (POMDP) framework to enhance the tracking and detection tasks in a statistically unknown environment. In the POMDP setup, the radar system is considered as an intelligent agent that continuously senses the surrounding environment, optimizing its actions to maximize the probability of detection (P_D) and improve the target position and velocity estimation, all this while keeping a constant probability of false alarm (P_FA). The proposed approach employs an online algorithm that does not require any apriori knowledge of the noise statistics, and it relies on a much more general observation model than the traditional range‑azimuth‑elevation model employed by conventional tracking algorithms. Simulation results clearly show substantial performance improvement of the POMDP‑based algorithm compared to the State‑Action‑Reward‑State‑Action (SARSA)‑based one that has been recently investigated in the context of massive MIMO (MMIMO) radar systems.

PaperID: 12580, https://arxiv.org/pdf/2410.17922.pdf

Authors: Weidi Luo, He Cao, Zijing Liu, Yu Wang, Aidan Wong, Bing Feng, Yuan Yao, Yu Li

Title: Dynamic Guided and Domain Applicable Safeguards for Enhanced Security in Large Language Models

Abstract:
With the extensive deployment of Large Language Models (LLMs), ensuring their safety has become increasingly critical. However, existing defense methods often struggle with two key issues: (i) inadequate defense capabilities, particularly in domain‑specific scenarios like chemistry, where a lack of specialized knowledge can lead to the generation of harmful responses to malicious queries. (ii) over‑defensiveness, which compromises the general utility and responsiveness of LLMs. To mitigate these issues, we introduce a multi‑agents‑based defense framework, Guide for Defense (G4D), which leverages accurate external information to provide an unbiased summary of user intentions and analytically grounded safety response guidance. Extensive experiments on popular jailbreak attacks and benign datasets show that our G4D can enhance LLM's robustness against jailbreak attacks on general and domain‑specific scenarios without compromising the model's general functionality.

PaperID: 12581, https://arxiv.org/pdf/2410.17898.pdf

Authors: Axel Brunnbauer, Julian Lemmel, Zahra Babaiee, Sophie Neubauer, Radu Grosu

Title: Scalable Offline Reinforcement Learning for Mean Field Games

Abstract:
Reinforcement learning algorithms for mean‑field games offer a scalable framework for optimizing policies in large populations of interacting agents. Existing methods often depend on online interactions or access to system dynamics, limiting their practicality in real‑world scenarios where such interactions are infeasible or difficult to model. In this paper, we present Offline Munchausen Mirror Descent (Off‑MMD), a novel mean‑field RL algorithm that approximates equilibrium policies in mean‑field games using purely offline data. By leveraging iterative mirror descent and importance sampling techniques, Off‑MMD estimates the mean‑field distribution from static datasets without relying on simulation or environment dynamics. Additionally, we incorporate techniques from offline reinforcement learning to address common issues like Q‑value overestimation, ensuring robust policy learning even with limited data coverage. Our algorithm scales to complex environments and demonstrates strong performance on benchmark tasks like crowd exploration or navigation, highlighting its applicability to real‑world multi‑agent systems where online experimentation is infeasible. We empirically demonstrate the robustness of Off‑MMD to low‑quality datasets and conduct experiments to investigate its sensitivity to hyperparameter choices.

PaperID: 12582, https://arxiv.org/pdf/2410.17883.pdf

Authors: Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, Kun Shao

Title: Lightweight Neural App Control

Abstract:
This paper introduces a novel mobile phone control architecture, Lightweight Multi‑modal App Control (LiMAC), for efficient interactions and control across various Android apps. LiMAC takes as input a textual goal and a sequence of past mobile observations, such as screenshots and corresponding UI trees, to generate precise actions. To address the computational constraints inherent to smartphones, we introduce a small Action Transformer (AcT) integrated with a fine‑tuned vision‑language model (VLM) for real‑time decision‑making and task execution. We evaluate LiMAC on two open‑source mobile control datasets, demonstrating the superior performance of our small‑form‑factor approach against fine‑tuned versions of open‑source VLMs, such as Florence2 and Qwen2‑VL. It also significantly outperforms prompt engineering baselines utilising closed‑source foundation models like GPT‑4o. More specifically, LiMAC increases the overall action accuracy by up to 19% compared to fine‑tuned VLMs, and up to 42% compared to prompt‑engineering baselines.

PaperID: 12583, https://arxiv.org/pdf/2410.17785.pdf

Authors: Guillem Capellera, Luis Ferraz, Antonio Rubio, Antonio Agudo, Francesc Moreno-Noguer

Title: TranSPORTmer: A Holistic Approach to Trajectory Understanding in Multi-Agent Sports

Abstract:
Understanding trajectories in multi‑agent scenarios requires addressing various tasks, including predicting future movements, imputing missing observations, inferring the status of unseen agents, and classifying different global states. Traditional data‑driven approaches often handle these tasks separately with specialized models. We introduce TranSPORTmer, a unified transformer‑based framework capable of addressing all these tasks, showcasing its application to the intricate dynamics of multi‑agent sports scenarios like soccer and basketball. Using Set Attention Blocks, TranSPORTmer effectively captures temporal dynamics and social interactions in an equivariant manner. The model's tasks are guided by an input mask that conceals missing or yet‑to‑be‑predicted observations. Additionally, we introduce a CLS extra agent to classify states along soccer trajectories, including passes, possessions, uncontrolled states, and out‑of‑play intervals, contributing to an enhancement in modeling trajectories. Evaluations on soccer and basketball datasets show that TranSPORTmer outperforms state‑of‑the‑art task‑specific models in player forecasting, player forecasting‑imputation, ball inference, and ball imputation. https://youtu.be/8VtSRm8oGoE

PaperID: 12584, https://arxiv.org/pdf/2410.17742.pdf

Authors: Xinyu Jia, Wenxin Wang, Jun Yang, Yongping Pan, Haoyong Yu

Title: Multi-layered Safety of Redundant Robot Manipulators via Task-oriented Planning and Control

Abstract:
Ensuring safety is crucial to promote the application of robot manipulators in open workspaces. Factors such as sensor errors or unpredictable collisions make the environment full of uncertainties. In this work, we investigate these potential safety challenges on redundant robot manipulators, and propose a task‑oriented planning and control framework to achieve multi‑layered safety while maintaining efficient task execution. Our approach consists of two main parts: a task‑oriented trajectory planner based on multiple‑shooting model predictive control (MPC) method, and a torque controller that allows safe and efficient collision reaction using only proprioceptive data. Through extensive simulations and real‑hardware experiments, we demonstrate that the proposed framework can effectively handle uncertain static or dynamic obstacles, and perform disturbance resistance in manipulation tasks when unforeseen contacts occur.

PaperID: 12585, https://arxiv.org/pdf/2410.17690.pdf

Authors: Adam Casselman, Abraham P. Vinod, Sarah H. Q. Li

Title: Multi-agent Reach-avoid MDP via Potential Games and Low-rank Policy Structure

Abstract:
We optimize finite horizon multi‑agent reach‑avoid Markov decision process (MDP) via \emphlocal feedback policies. The global feedback policy solution yields global optimality but its communication complexity, memory usage and computation complexity scale exponentially with the number of agents. We mitigate this exponential dependency by restricting the solution space to local feedback policies and show that local feedback policies are rank‑one factorizations of global feedback policies, which provides a principled approach to reducing communication complexity and memory usage. Additionally, by demonstrating that multi‑agent reach‑avoid MDPs over local feedback policies has a potential game structure, we show that iterative best response is a tractable multi‑agent learning scheme with guaranteed convergence to deterministic Nash equilibrium, and derive each agent's best response via multiplicative dynamic program (DP) over the joint state space. Numerical simulations across different MDPs and agent sets show that the peak memory usage and offline computation complexity are significantly reduced while the approximation error to the optimal global reach‑avoid objective is maintained.

PaperID: 12586, https://arxiv.org/pdf/2410.17529.pdf

Authors: Yongqiang Huang, Wentao Ye, Liyao Li, Junbo Zhao

Title: Navigate Complex Physical Worlds via Geometrically Constrained LLM

Abstract:
This study investigates the potential of Large Language Models (LLMs) for reconstructing and constructing the physical world solely based on textual knowledge. It explores the impact of model performance on spatial understanding abilities. To enhance the comprehension of geometric and spatial relationships in the complex physical world, the study introduces a set of geometric conventions and develops a workflow based on multi‑layer graphs and multi‑agent system frameworks. It examines how LLMs achieve multi‑step and multi‑objective geometric inference in a spatial environment using multi‑layer graphs under unified geometric conventions. Additionally, the study employs a genetic algorithm, inspired by large‑scale model knowledge, to solve geometric constraint problems. In summary, this work innovatively explores the feasibility of using text‑based LLMs as physical world builders and designs a workflow to enhance their capabilities.

PaperID: 12587, https://arxiv.org/pdf/2410.17517.pdf

Authors: Karthik Soma, Yann Bouteiller, Heiko Hamann, Giovanni Beltrame

Title: The Hive Mind is a Single Reinforcement Learning Agent

Abstract:
Decision‑making is an essential attribute of any intelligent agent or group. Natural systems are known to converge to effective strategies through at least two distinct mechanisms: collective decision‑making via imitation of others, and trial‑and‑error by a single agent. This paper establishes an equivalence between these two paradigms by drawing from the well‑studied collective decision‑making problem of nest‑hunting in swarms of honey bees. We show that the emergent distributed cognition (sometimes referred to as the hive mind) arising from individuals following simple, local imitation‑based rules is that of a single online reinforcement learning (RL) agent interacting with many parallel environments. More specifically, in the purely imitative weighted voter model of bees' waggle dance, the update rule through which this macro‑agent learns is a multi‑armed bandit algorithm that we coin Maynard‑Cross Learning. Our analysis implies that a group of purely imitative organisms can be equivalent to a more complex, reinforcement‑enabled entity, substantiating the idea that group‑level intelligence may explain how seemingly simple and blind individual behaviors are selected in nature. Beyond biology, the framework offers new tools for analyzing economic and social systems where individuals imitate successful strategies, effectively participating in a collective learning process. Our findings may further inform the design of scalable RL‑inspired collective systems in artificial domains.

PaperID: 12588, https://arxiv.org/pdf/2410.17466.pdf

Authors: Yann Bouteiller, Karthik Soma, Giovanni Beltrame

Title: Evolution of Societies via Reinforcement Learning

Abstract:
The universe involves many independent co‑learning agents as an ever‑evolving part of our observed environment. Yet, in practice, Multi‑Agent Reinforcement Learning (MARL) applications are typically constrained to small, homogeneous populations and remain computationally intensive. We propose a methodology that enables simulating populations of Reinforcement Learning agents at evolutionary scale. More specifically, we derive a fast, parallelizable implementation of Policy Gradient (PG) and Opponent‑Learning Awareness (LOLA), tailored for evolutionary simulations where agents undergo random pairwise interactions in stateless normal‑form games. We demonstrate our approach by simulating the evolution of very large populations made of heterogeneous co‑learning agents, under both naive and advanced learning strategies. In our experiments, 200,000 PG or LOLA agents evolve in the classic games of Hawk‑Dove, Stag‑Hunt, and Rock‑Paper‑Scissors. Each game provides distinct insights into how populations evolve under both naive and advanced MARL rules, including compelling ways in which Opponent‑Learning Awareness affects social evolution.

PaperID: 12589, https://arxiv.org/pdf/2410.17462.pdf

Authors: Minhua Lin, Zhengzhang Chen, Yanchi Liu, Xujiang Zhao, Zongyu Wu, Junxiang Wang, Xiang Zhang, Suhang Wang, Haifeng Chen

Title: Decoding Time Series with LLMs: A Multi-Agent Framework for Cross-Domain Annotation

Abstract:
Time series data is ubiquitous across various domains, including manufacturing, finance, and healthcare. High‑quality annotations are essential for effectively understanding time series and facilitating downstream tasks; however, obtaining such annotations is challenging, particularly in mission‑critical domains. In this paper, we propose TESSA, a multi‑agent system designed to automatically generate both general and domain‑specific annotations for time series data. TESSA introduces two agents: a general annotation agent and a domain‑specific annotation agent. The general agent captures common patterns and knowledge across multiple source domains, leveraging both time‑series‑wise and text‑wise features to generate general annotations. Meanwhile, the domain‑specific agent utilizes limited annotations from the target domain to learn domain‑specific terminology and generate targeted annotations. Extensive experiments on multiple synthetic and real‑world datasets demonstrate that TESSA effectively generates high‑quality annotations, outperforming existing methods.

PaperID: 12590, https://arxiv.org/pdf/2410.17382.pdf

Authors: Amirhossein Afsharrad, Parisa Oftadeh, Ahmadreza Moradipari, Sanjay Lall

Title: Cooperative Multi-Agent Constrained Stochastic Linear Bandits

Abstract:
In this study, we explore a collaborative multi‑agent stochastic linear bandit setting involving a network of N agents that communicate locally to minimize their collective regret while keeping their expected cost under a specified threshold τ. Each agent encounters a distinct linear bandit problem characterized by its own reward and cost parameters, i.e., local parameters. The goal of the agents is to determine the best overall action corresponding to the average of these parameters, or so‑called global parameters. In each round, an agent is randomly chosen to select an action based on its current knowledge of the system. This chosen action is then executed by all agents, then they observe their individual rewards and costs. We propose a safe distributed upper confidence bound algorithm, so called MA‑OPLB, and establish a high probability bound on its T‑round regret. MA‑OPLB utilizes an accelerated consensus method, where agents can compute an estimate of the average rewards and costs across the network by communicating the proper information with their neighbors. We show that our regret bound is of order \mathcalO\left(\fracdτ‑c_0\frac\log(NT)^2\sqrtN\sqrt\fracT\log(1/|λ_2|)\right), where λ_2 is the second largest (in absolute value) eigenvalue of the communication matrix, and τ‑c_0 is the known cost gap of a feasible action. We also experimentally show the performance of our proposed algorithm in different network structures.

PaperID: 12591, https://arxiv.org/pdf/2410.17373.pdf

Authors: Dongsu Lee, Minhae Kwon

Title: Episodic Future Thinking Mechanism for Multi-agent Reinforcement Learning

Abstract:
Understanding cognitive processes in multi‑agent interactions is a primary goal in cognitive science. It can guide the direction of artificial intelligence (AI) research toward social decision‑making in multi‑agent systems, which includes uncertainty from character heterogeneity. In this paper, we introduce an episodic future thinking (EFT) mechanism for a reinforcement learning (RL) agent, inspired by cognitive processes observed in animals. To enable future thinking functionality, we first develop a multi‑character policy that captures diverse characters with an ensemble of heterogeneous policies. Here, the character of an agent is defined as a different weight combination on reward components, representing distinct behavioral preferences. The future thinking agent collects observation‑action trajectories of the target agents and uses the pre‑trained multi‑character policy to infer their characters. Once the character is inferred, the agent predicts the upcoming actions of target agents and simulates the potential future scenario. This capability allows the agent to adaptively select the optimal action, considering the predicted future scenario in multi‑agent interactions. To evaluate the proposed mechanism, we consider the multi‑agent autonomous driving scenario with diverse driving traits and multiple particle environments. Simulation results demonstrate that the EFT mechanism with accurate character inference leads to a higher reward than existing multi‑agent solutions. We also confirm that the effect of reward improvement remains valid across societies with different levels of character diversity.

PaperID: 12592, https://arxiv.org/pdf/2410.17351.pdf

Authors: Aditya Vikram Singh, Ethan Rathbun, Emma Graham, Lisa Oakley, Simona Boboila, Alina Oprea, Peter Chin

Title: Hierarchical Multi-agent Reinforcement Learning for Cyber Network Defense

Abstract:
Recent advances in multi‑agent reinforcement learning (MARL) have created opportunities to solve complex real‑world tasks. Cybersecurity is a notable application area, where defending networks against sophisticated adversaries remains a challenging task typically performed by teams of security operators. In this work, we explore novel MARL strategies for building autonomous cyber network defenses that address challenges such as large policy spaces, partial observability, and stealthy, deceptive adversarial strategies. To facilitate efficient and generalized learning, we propose a hierarchical Proximal Policy Optimization (PPO) architecture that decomposes the cyber defense task into specific sub‑tasks like network investigation and host recovery. Our approach involves training sub‑policies for each sub‑task using PPO enhanced with cybersecurity domain expertise. These sub‑policies are then leveraged by a master defense policy that coordinates their selection to solve complex network defense tasks. Furthermore, the sub‑policies can be fine‑tuned and transferred with minimal cost to defend against shifts in adversarial behavior or changes in network settings. We conduct extensive experiments using CybORG Cage 4, the state‑of‑the‑art MARL environment for cyber defense. Comparisons with multiple baselines across different adversaries show that our hierarchical learning approach achieves top performance in terms of convergence speed, episodic return, and several interpretable metrics relevant to cybersecurity, including the fraction of clean machines on the network, precision, and false positives.

PaperID: 12593, https://arxiv.org/pdf/2410.17229.pdf

Authors: Timothy Parker, Umberto Grandi, Emiliano Lorini

Title: Responsibility in a Multi-Value Strategic Setting

Abstract:
Responsibility is a key notion in multi‑agent systems and in creating safe, reliable and ethical AI. However, most previous work on responsibility has only considered responsibility for single outcomes. In this paper we present a model for responsibility attribution in a multi‑agent, multi‑value setting. We also expand our model to cover responsibility anticipation, demonstrating how considerations of responsibility can help an agent to select strategies that are in line with its values. In particular we show that non‑dominated regret‑minimising strategies reliably minimise an agent's expected degree of responsibility.

PaperID: 12594, https://arxiv.org/pdf/2410.17221.pdf

Authors: Zhaolin Ren, Runyu Zhang, Bo Dai, Na Li

Title: Scalable spectral representations for multi-agent reinforcement learning in network MDPs

Abstract:
Network Markov Decision Processes (MDPs), a popular model for multi‑agent control, pose a significant challenge to efficient learning due to the exponential growth of the global state‑action space with the number of agents. In this work, utilizing the exponential decay property of network dynamics, we first derive scalable spectral local representations for network MDPs, which induces a network linear subspace for the local Q‑function of each agent. Building on these local spectral representations, we design a scalable algorithmic framework for continuous state‑action network MDPs, and provide end‑to‑end guarantees for the convergence of our algorithm. Empirically, we validate the effectiveness of our scalable representation‑based approach on two benchmark problems, and demonstrate the advantages of our approach over generic function approximation approaches to representing the local Q‑functions.

PaperID: 12595, https://arxiv.org/pdf/2410.17068.pdf

Authors: Jianan Bai, Zheng Chen, Erik. G. Larsson

Title: Delay-Constrained Grant-Free Random Access in MIMO Systems: Distributed Pilot Allocation and Power Control

Abstract:
We study a delay‑constrained grant‑free random access system with a multi‑antenna base station. The users randomly generate data packets with expiration deadlines, which are then transmitted from data queues on a first‑in first‑out basis. To deliver a packet, a user needs to succeed in both random access phase (sending a pilot without collision) and data transmission phase (achieving a required data rate with imperfect channel information) before the packet expires. We develop a distributed, cross‑layer policy that allows the users to dynamically and independently choose their pilots and transmit powers to achieve a high effective sum throughput with fairness consideration. Our policy design involves three key components: 1) a proxy of the instantaneous data rate that depends only on macroscopic environment variables and transmission decisions, considering pilot collisions and imperfect channel estimation; 2) a quantitative, instantaneous measure of fairness within each communication round; and 3) a deep learning‑based, multi‑agent control framework with centralized training and distributed execution. The proposed framework benefits from an accurate, differentiable objective function for training, thereby achieving a higher sample efficiency compared with a conventional application of model‑free, multi‑agent reinforcement learning algorithms. The performance of the proposed approach is verified by simulations under highly dynamic and heterogeneous scenarios.

PaperID: 12596, https://arxiv.org/pdf/2410.16981.pdf

Authors: Hyeonjun Park, Daegyu Lim, Seungyeon Kim, Sumin Park

Title: Proleptic Temporal Ensemble for Improving the Speed of Robot Tasks Generated by Imitation Learning

Abstract:
Imitation learning, which enables robots to learn behaviors from demonstrations by human, has emerged as a promising solution for generating robot motions in such environments. The imitation learning‑based robot motion generation method, however, has the drawback of depending on the demonstrator's task execution speed. This paper presents a novel temporal ensemble approach applied to imitation learning algorithms, allowing for execution of future actions. The proposed method leverages existing demonstration data and pre‑trained policies, offering the advantages of requiring no additional computation and being easy to implement. The algorithms performance was validated through real‑world experiments involving robotic block color sorting, demonstrating up to 3x increase in task execution speed while maintaining a high success rate compared to the action chunking with transformer method. This study highlights the potential for significantly improving the performance of imitation learning‑based policies, which were previously limited by the demonstrator's speed. It is expected to contribute substantially to future advancements in autonomous object manipulation technologies aimed at enhancing productivity.

PaperID: 12597, https://arxiv.org/pdf/2410.16919.pdf

Authors: Tomoyuki Kagaya, Yuxuan Lou, Thong Jing Yuan, Subramanian Lakshmi, Jayashree Karlekar, Sugiri Pranata, Natsuki Murakami, Akira Kinose, Koki Oguri, Felix Wick, Yang You

Title: EnvBridge: Bridging Diverse Environments with Cross-Environment Knowledge Transfer for Embodied AI

Abstract:
In recent years, Large Language Models (LLMs) have demonstrated high reasoning capabilities, drawing attention for their applications as agents in various decision‑making processes. One notably promising application of LLM agents is robotic manipulation. Recent research has shown that LLMs can generate text planning or control code for robots, providing substantial flexibility and interaction capabilities. However, these methods still face challenges in terms of flexibility and applicability across different environments, limiting their ability to adapt autonomously. Current approaches typically fall into two categories: those relying on environment‑specific policy training, which restricts their transferability, and those generating code actions based on fixed prompts, which leads to diminished performance when confronted with new environments. These limitations significantly constrain the generalizability of agents in robotic manipulation. To address these limitations, we propose a novel method called EnvBridge. This approach involves the retention and transfer of successful robot control codes from source environments to target environments. EnvBridge enhances the agent's adaptability and performance across diverse settings by leveraging insights from multiple environments. Notably, our approach alleviates environmental constraints, offering a more flexible and generalizable solution for robotic manipulation tasks. We validated the effectiveness of our method using robotic manipulation benchmarks: RLBench, MetaWorld, and CALVIN. Our experiments demonstrate that LLM agents can successfully leverage diverse knowledge sources to solve complex tasks. Consequently, our approach significantly enhances the adaptability and robustness of robotic manipulation agents in planning across diverse environments.

PaperID: 12598, https://arxiv.org/pdf/2410.16812.pdf

Authors: Yuli Qiu, Jiashu Yao, Heyan Huang, Yuhang Guo

Title: Optimizing Chain-of-Thought Reasoning: Tackling Arranging Bottleneck via Plan Augmentation

Abstract:
Multi‑step reasoning ability of large language models is crucial in tasks such as math and tool utilization. Current researches predominantly focus on enhancing model performance in these multi‑step reasoning tasks through fine‑tuning with Chain‑of‑Thought (CoT) steps, yet these methods tend to be heuristic, without exploring nor resolving the bottleneck. In this study, we subdivide CoT reasoning into two parts: arranging and executing, and identify that the bottleneck of models mainly lies in arranging rather than executing. Based on this finding, we propose a plan‑based training and reasoning method that guides models to generate arranging steps through abstract plans. We experiment on both math (GSM8k) and tool utilization (ToolBench) benchmarks. Results show that compared to fine‑tuning directly with CoT data, our approach achieves a better performance on alleviating arranging bottleneck, particularly excelling in long‑distance reasoning generalization.

PaperID: 12599, https://arxiv.org/pdf/2410.16686.pdf

Authors: Jumman Hossain, Emon Dey, Snehalraj Chugh, Masud Ahmed, MS Anwar, Abu-Zaher Faridee, Jason Hoppes, Theron Trout, Anjon Basak, Rafidh Chowdhury, Rishabh Mistry, Hyun Kim, Jade Freeman, Niranjan Suri, Adrienne Raglin, Carl Busart, Anuradha Ravi, Nirmalya Roy

Title: SERN: Bandwidth-Adaptive Cross-Reality Synchronization for Simulation-Enhanced Robot Navigation

Abstract:
Cross reality integration of simulation and physical robots is a promising approach for multi‑robot operations in contested environments, where communication may be intermittent, interference may be present, and observability may be degraded. We present SERN (Simulation‑Enhanced Realistic Navigation), a framework that tightly couples a high‑fidelity virtual twin with physical robots to support real‑time collaborative decision making. SERN makes three main contributions. First, it builds a virtual twin from geospatial and sensor data and continuously corrects it using live robot telemetry. Second, it introduces a physics‑aware synchronization pipeline that combines predictive modeling with adaptive PD control. Third, it provides a bandwidth‑adaptive ROS bridge that prioritizes critical topics when communication links are constrained. We also introduce a multi‑metric cost function that balances latency, reliability, computation, and bandwidth. Theoretically, we show that when the adaptive controller keeps the physical and virtual input mismatch small, synchronization error remains bounded under moderate packet loss and latency. Empirically, SERN reduces end‑to‑end message latency by 15% to 25% and processing load by about 15% compared with a standard ROS setup, while maintaining tight real‑virtual alignment with less than 5 cm positional error and less than 2 degrees rotational error. In a navigation task, SERN achieves a 95% success rate, compared with 85% for a real‑only setup and 70% for a simulation‑only setup, while also requiring fewer interventions and less time to reach the goal. These results show that a simulation‑enhanced cross‑reality stack can improve situational awareness and multi‑agent coordination in contested environments by enabling look‑ahead planning in the virtual twin while using real sensor feedback to correct discrepancies.

PaperID: 12600, https://arxiv.org/pdf/2410.16658.pdf

Authors: Janghoon Ock, Radheesh Sharma Meda, Tirtha Vinchurkar, Yayati Jadhav, Amir Barati Farimani

Title: Adsorb-Agent: Autonomous Identification of Stable Adsorption Configurations via Large Language Model Agent

Abstract:
Adsorption energy is a key reactivity descriptor in catalysis. Determining adsorption energy requires evaluating numerous adsorbate‑catalyst configurations, making it computationally intensive. Current methods rely on exhaustive sampling, which does not guarantee the identification of the global minimum energy. To address this, we introduce Adsorb‑Agent, a Large Language Model (LLM) agent designed to efficiently identify stable adsorption configurations corresponding to the global minimum energy. Adsorb‑Agent leverages its built‑in knowledge and reasoning to strategically explore configurations, significantly reducing the number of initial setups required while improving energy prediction accuracy. In this study, we also evaluated the performance of different LLMs, including GPT‑4o, GPT‑4o‑mini, Claude‑3.7‑Sonnet, and DeepSeek‑Chat, as the reasoning engine for Adsorb‑Agent, with GPT‑4o showing the strongest overall performance. Tested on twenty diverse systems, Adsorb‑Agent identifies comparable adsorption energies for 84% of cases and achieves lower energies for 35%, particularly excelling in complex systems. It identifies lower energies in 47% of intermetallic systems and 67% of systems with large adsorbates. These findings demonstrate Adsorb‑Agent's potential to accelerate catalyst discovery by reducing computational costs and enhancing prediction reliability compared to exhaustive search methods.

PaperID: 12601, https://arxiv.org/pdf/2410.16600.pdf

Authors: Ian Gemp, Andreas Haupt, Luke Marris, Siqi Liu, Georgios Piliouras

Title: Convex Markov Games: A New Frontier for Multi-Agent Reinforcement Learning

Abstract:
Behavioral diversity, expert imitation, fairness, safety goals and others give rise to preferences in sequential decision making domains that do not decompose additively across time. We introduce the class of convex Markov games that allow general convex preferences over occupancy measures. Despite infinite time horizon and strictly higher generality than Markov games, pure strategy Nash equilibria exist. Furthermore, equilibria can be approximated empirically by performing gradient descent on an upper bound of exploitability. Our experiments reveal novel solutions to classic repeated normal‑form games, find fair solutions in a repeated asymmetric coordination game, and prioritize safe long‑term behavior in a robot warehouse environment. In the prisoner's dilemma, our algorithm leverages transient imitation to find a policy profile that deviates from observed human play only slightly, yet achieves higher per‑player utility while also being three orders of magnitude less exploitable.

PaperID: 12602, https://arxiv.org/pdf/2410.16566.pdf

Authors: Yukun Zhang, Qi Dong

Title: Revenue vs. Welfare: A Comprehensive Analysis of Strategic Trade-offs in Online Food Delivery Systems

Abstract:
This paper investigates the trade‑off between short‑term revenue generation and long‑term social welfare optimization in online food delivery platforms. We first develop a static model that captures the equilibrium interactions among restaurants, consumers, and delivery workers, using Gross Merchandise Value (GMV) as a proxy for immediate performance. Building on this, we extend our analysis to a dynamic model that integrates evolving state variables,such as platform reputation and participant retention‑to capture long‑term behavior. By applying dynamic programming techniques, we derive optimal strategies that balance GMV maximization with social welfare enhancement. Extensive multi‑agent simulations validate our theoretical predictions, demonstrating that while a GMV‑focused approach yields strong initial gains, it ultimately undermines long‑term stability. In contrast, a social welfare‑oriented strategy produces more sustainable and robust outcomes. Our findings provide actionable insights for platform operators and policymakers seeking to harmonize rapid growth with long‑term

PaperID: 12603, https://arxiv.org/pdf/2410.16529.pdf

Authors: Hariprasauth Ramamoorthy, Shubhankar Gupta, Suresh Sundaram

Title: Distributed Online Life-Long Learning (DOL3) for Multi-agent Trust and Reputation Assessment in E-commerce

Abstract:
Trust and Reputation Assessment of service providers in citizen‑focused environments like e‑commerce is vital to maintain the integrity of the interactions among agents. The goals and objectives of both the service provider and service consumer agents are relevant to the goals of the respective citizens (end users). The provider agents often pursue selfish goals that can make the service quality highly volatile, contributing towards the non‑stationary nature of the environment. The number of active service providers tends to change over time resulting in an open environment. This necessitates a rapid and continual assessment of the Trust and Reputation. A large number of service providers in the environment require a distributed multi‑agent Trust and Reputation assessment. This paper addresses the problem of multi‑agent Trust and Reputation Assessment in a non‑stationary environment involving transactions between providers and consumers. In this setting, the observer agents carry out the assessment and communicate their assessed trust scores with each other over a network. We propose a novel Distributed Online Life‑Long Learning (DOL3) algorithm that involves real‑time rapid learning of trust and reputation scores of providers. Each observer carries out an adaptive learning and weighted fusion process combining their own assessment along with that of their neighbour in the communication network. Simulation studies reveal that the state‑of‑the‑art methods, which usually involve training a model to assess an agent's trust and reputation, do not work well in such an environment. The simulation results show that the proposed DOL3 algorithm outperforms these methods and effectively handles the volatility in such environments. From the statistical evaluation, it is evident that DOL3 performs better compared to other models in 90% of the cases.

PaperID: 12604, https://arxiv.org/pdf/2410.16517.pdf

Authors: Jingdi Chen, Hanhan Zhou, Yongsheng Mei, Carlee Joe-Wong, Gina Adam, Nathaniel D. Bastian, Tian Lan

Title: RGMDT: Return-Gap-Minimizing Decision Tree Extraction in Non-Euclidean Metric Space

Abstract:
Deep Reinforcement Learning (DRL) algorithms have achieved great success in solving many challenging tasks while their black‑box nature hinders interpretability and real‑world applicability, making it difficult for human experts to interpret and understand DRL policies. Existing works on interpretable reinforcement learning have shown promise in extracting decision tree (DT) based policies from DRL policies with most focus on the single‑agent settings while prior attempts to introduce DT policies in multi‑agent scenarios mainly focus on heuristic designs which do not provide any quantitative guarantees on the expected return. In this paper, we establish an upper bound on the return gap between the oracle expert policy and an optimal decision tree policy. This enables us to recast the DT extraction problem into a novel non‑euclidean clustering problem over the local observation and action values space of each agent, with action values as cluster labels and the upper bound on the return gap as clustering loss. Both the algorithm and the upper bound are extended to multi‑agent decentralized DT extractions by an iteratively‑grow‑DT procedure guided by an action‑value function conditioned on the current DTs of other agents. Further, we propose the Return‑Gap‑Minimization Decision Tree (RGMDT) algorithm, which is a surprisingly simple design and is integrated with reinforcement learning through the utilization of a novel Regularized Information Maximization loss. Evaluations on tasks like D4RL show that RGMDT significantly outperforms heuristic DT‑based baselines and can achieve nearly optimal returns under given DT complexity constraints (e.g., maximum number of DT nodes).

PaperID: 12605, https://arxiv.org/pdf/2410.16444.pdf

Authors: Ricardo Vega, Kevin Zhu, Connor Mattson, Daniel S. Brown, Cameron Nowzari

Title: Agent-Based Emulation for Deploying Robot Swarm Behaviors

Abstract:
Despite significant research, robotic swarms have yet to be useful in solving real‑world problems, largely due to the difficulty of creating and controlling swarming behaviors in multi‑agent systems. Traditional top‑down approaches in which a desired emergent behavior is produced often require complex, resource‑heavy robots, limiting their practicality. This paper introduces a bottom‑up approach by employing an Embodied Agent‑Based Modeling and Simulation approach, emphasizing the use of simple robots and identifying conditions that naturally lead to self‑organized collective behaviors. Using the Reality‑to‑Simulation‑to‑Reality for Swarms (RSRS) process, we tightly integrate real‑world experiments with simulations to reproduce known swarm behaviors as well as discovering a novel emergent behavior without aiming to eliminate or even reduce the sim2real gap. This paper presents the development of an Agent‑Based Embodiment and Emulation process that balances the importance of running physical swarming experiments and the prohibitively time‑consuming process of even setting up and running a single experiment with 20+ robots by leveraging low‑fidelity lightweight simulations to enable hypothesis‑formation to guide physical experiments. We demonstrate the usefulness of our methods by emulating two known behaviors from the literature and show a third behavior `discovered' by accident.

PaperID: 12606, https://arxiv.org/pdf/2410.16441.pdf

Authors: Xinjie Liu, Jingqi Li, Filippos Fotiadis, Mustafa O. Karabag, Jesse Milzman, David Fridovich-Keil, Ufuk Topcu

Title: Approximate Feedback Nash Equilibria with Sparse Inter-Agent Dependencies

Abstract:
Feedback Nash equilibrium strategies in multi‑agent dynamic games require availability of all players' state information to compute control actions. However, in real‑world scenarios, sensing and communication limitations between agents make full state feedback expensive or impractical, and such strategies can become fragile when state information from other agents is inaccurate. To this end, we propose a regularized dynamic programming approach for finding sparse feedback policies that selectively depend on the states of a subset of agents in dynamic games. The proposed approach solves convex adaptive group Lasso problems to compute sparse policies approximating Nash equilibrium solutions. We prove the regularized solutions' asymptotic convergence to a neighborhood of Nash equilibrium policies in linear‑quadratic (LQ) games. Further, we extend the proposed approach to general non‑LQ games via an iterative algorithm. Simulation results in multi‑robot interaction scenarios show that the proposed approach effectively computes feedback policies with varying sparsity levels. When agents have noisy observations of other agents' states, simulation results indicate that the proposed regularized policies consistently achieve lower costs than standard Nash equilibrium policies by up to 77% for all interacting agents whose costs are coupled with other agents' states.

PaperID: 12607, https://arxiv.org/pdf/2410.16400.pdf

Authors: Zhehao Zhang, Ryan Rossi, Tong Yu, Franck Dernoncourt, Ruiyi Zhang, Jiuxiang Gu, Sungchul Kim, Xiang Chen, Zichao Wang, Nedim Lipka

Title: VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use

Abstract:
While vision‑language models (VLMs) have demonstrated remarkable performance across various tasks combining textual and visual information, they continue to struggle with fine‑grained visual perception tasks that require detailed pixel‑level analysis. Effectively eliciting comprehensive reasoning from VLMs on such intricate visual elements remains an open challenge. In this paper, we present VipAct, an agent framework that enhances VLMs by integrating multi‑agent collaboration and vision expert models, enabling more precise visual understanding and comprehensive reasoning. VipAct consists of an orchestrator agent, which manages task requirement analysis, planning, and coordination, along with specialized agents that handle specific tasks such as image captioning and vision expert models that provide high‑precision perceptual information. This multi‑agent approach allows VLMs to better perform fine‑grained visual perception tasks by synergizing planning, reasoning, and tool use. We evaluate VipAct on benchmarks featuring a diverse set of visual perception tasks, with experimental results demonstrating significant performance improvements over state‑of‑the‑art baselines across all tasks. Furthermore, comprehensive ablation studies reveal the critical role of multi‑agent collaboration in eliciting more detailed System‑2 reasoning and highlight the importance of image input for task planning. Additionally, our error analysis identifies patterns of VLMs' inherent limitations in visual perception, providing insights into potential future improvements. VipAct offers a flexible and extensible framework, paving the way for more advanced visual perception systems across various real‑world applications.

PaperID: 12608, https://arxiv.org/pdf/2410.16285.pdf

Authors: John Mavi, Nathan Summers, Sergio Coronado

Title: Assessing the Performance of Human-Capable LLMs -- Are LLMs Coming for Your Job?

Abstract:
The current paper presents the development and validation of SelfScore, a novel benchmark designed to assess the performance of automated Large Language Model (LLM) agents on help desk and professional consultation tasks. Given the increasing integration of AI in industries, particularly within customer service, SelfScore fills a crucial gap by enabling the comparison of automated agents and human workers. The benchmark evaluates agents on problem complexity and response helpfulness, ensuring transparency and simplicity in its scoring system. The study also develops automated LLM agents to assess SelfScore and explores the benefits of Retrieval‑Augmented Generation (RAG) for domain‑specific tasks, demonstrating that automated LLM agents incorporating RAG outperform those without. All automated LLM agents were observed to perform better than the human control group. Given these results, the study raises concerns about the potential displacement of human workers, especially in areas where AI technologies excel. Ultimately, SelfScore provides a foundational tool for understanding the impact of AI in help desk environments while advocating for ethical considerations in the ongoing transition towards automation.

PaperID: 12609, https://arxiv.org/pdf/2410.16237.pdf

Authors: Yihuan Mao, Yipeng Kang, Peilun Li, Ning Zhang, Wei Xu, Chongjie Zhang

Title: IBGP: Imperfect Byzantine Generals Problem for Zero-Shot Robustness in Communicative Multi-Agent Systems

Abstract:
As large language model (LLM) agents increasingly integrate into our infrastructure, their robust coordination and message synchronization become vital. The Byzantine Generals Problem (BGP) is a critical model for constructing resilient multi‑agent systems (MAS) under adversarial attacks. It describes a scenario where malicious agents with unknown identities exist in the system‑situations that, in our context, could result from LLM agents' hallucinations or external attacks. In BGP, the objective of the entire system is to reach a consensus on the action to be taken. Traditional BGP requires global consensus among all agents; however, in practical scenarios, global consensus is not always necessary and can even be inefficient. Therefore, there is a pressing need to explore a refined version of BGP that aligns with the local coordination patterns observed in MAS. We refer to this refined version as Imperfect BGP (IBGP) in our research, aiming to address this discrepancy. To tackle this issue, we propose a framework that leverages consensus protocols within general MAS settings, providing provable resilience against communication attacks and adaptability to changing environments, as validated by empirical results. Additionally, we present a case study in a sensor network environment to illustrate the practical application of our protocol.

PaperID: 12610, https://arxiv.org/pdf/2410.16232.pdf

Authors: Ryan Li, Yanzhe Zhang, Diyi Yang

Title: Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping

Abstract:
Sketches are a natural and accessible medium for UI designers to conceptualize early‑stage ideas. However, existing research on UI/UX automation often requires high‑fidelity inputs like Figma designs or detailed screenshots, limiting accessibility and impeding efficient design iteration. To bridge this gap, we introduce Sketch2Code, a benchmark that evaluates state‑of‑the‑art Vision Language Models (VLMs) on automating the conversion of rudimentary sketches into webpage prototypes. Beyond end‑to‑end benchmarking, Sketch2Code supports interactive agent evaluation that mimics real‑world design workflows, where a VLM‑based agent iteratively refines its generations by communicating with a simulated user, either passively receiving feedback instructions or proactively asking clarification questions. We comprehensively analyze ten commercial and open‑source models, showing that Sketch2Code is challenging for existing VLMs; even the most capable models struggle to accurately interpret sketches and formulate effective questions that lead to steady improvement. Nevertheless, a user study with UI/UX experts reveals a significant preference for proactive question‑asking over passive feedback reception, highlighting the need to develop more effective paradigms for multi‑turn conversational agents.

PaperID: 12611, https://arxiv.org/pdf/2410.16155.pdf

Authors: Tianyi Men, Pengfei Cao, Zhuoran Jin, Yubo Chen, Kang Liu, Jun Zhao

Title: A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns

Abstract:
With the development of large language models, they are widely used as agents in various fields. A key component of agents is memory, which stores vital information but is susceptible to jailbreak attacks. Existing research mainly focuses on single‑agent attacks and shared memory attacks. However, real‑world scenarios often involve independent memory. In this paper, we propose the Troublemaker Makes Chaos in Honest Town (TMCHT) task, a large‑scale, multi‑agent, multi‑topology text‑based attack evaluation framework. TMCHT involves one attacker agent attempting to mislead an entire society of agents. We identify two major challenges in multi‑agent attacks: (1) Non‑complete graph structure, (2) Large‑scale systems. We attribute these challenges to a phenomenon we term toxicity disappearing. To address these issues, we propose an Adversarial Replication Contagious Jailbreak (ARCJ) method, which optimizes the retrieval suffix to make poisoned samples more easily retrieved and optimizes the replication suffix to make poisoned samples have contagious ability. We demonstrate the superiority of our approach in TMCHT, with 23.51%, 18.95%, and 52.93% improvements in line topology, star topology, and 100‑agent settings. Encourage community attention to the security of multi‑agent systems.

PaperID: 12612, https://arxiv.org/pdf/2410.16024.pdf

Authors: Yue Deng, Weiyu Ma, Yuxin Fan, Ruyi Song, Yin Zhang, Haifeng Zhang, Jian Zhao

Title: SMAC-R1: The Emergence of Intelligence in Decision-Making Tasks

Abstract:
StarCraft Multi‑Agent Challenge (SMAC) has been one of the most commonly used experimental environments in multi‑agent reinforcement learning (MARL), where the specific task is to control a set number of allied units to defeat enemy forces. Traditional MARL algorithms often require interacting with the environment for millions of steps to train a parametric model, of which the resulting policies are typically non‑interpretable with weak transferability. In this paper, we introduce SMAC‑R1 which is based on the Qwen2.5‑7B‑Base LLM distilled from DeepSeek‑Coder‑v2.5‑236B. Similar to online reinforcement learning after behavior cloning in offline learning process, in our pipeline, agents leverage the DeepSeek LLM to generate decision tree code by providing task descriptions, and the agents are further self‑reflected using feedback from the rewards provided by the environment. Based on that, we augment the generated scripts to fine‑tune a small LLM, Qwen2.5‑7B‑Base, to distill the decision‑making ability via Supervised Fine‑Tuning (SFT) and enhance the script generation ability by the Group Relative Policy Optimization (GRPO) algorithm. We conduct experiments in the original 23 SMAC tasks and 10 newly‑designed tasks to demonstrate that our method can produce high‑quality, interpretable decision trees with minimal environmental exploration. Moreover, these scripts exhibit strong transferability, successfully applying to homogeneous SMAC environments without modification. We believe this approach offers a new direction for solving decision‑making tasks and domain‑specific LLM training pipelines in the future.

PaperID: 12613, https://arxiv.org/pdf/2410.15987.pdf

Authors: Matthias Bitzer, Reinis Cimurs, Benjamin Coors, Johannes Goth, Sebastian Ziesche, Philipp Geiger, Maximilian Naumann

Title: Analyzing Closed-loop Training Techniques for Realistic Traffic Agent Models in Autonomous Highway Driving Simulations

Abstract:
Simulation plays a crucial role in the rapid development and safe deployment of autonomous vehicles. Realistic traffic agent models are indispensable for bridging the gap between simulation and the real world. Many existing approaches for imitating human behavior are based on learning from demonstration. However, these approaches are often constrained by focusing on individual training strategies. Therefore, to foster a broader understanding of realistic traffic agent modeling, in this paper, we provide an extensive comparative analysis of different training principles, with a focus on closed‑loop methods for highway driving simulation. We experimentally compare (i) open‑loop vs. closed‑loop multi‑agent training, (ii) adversarial vs. deterministic supervised training, (iii) the impact of reinforcement losses, and (iv) the impact of training alongside log‑replayed agents to identify suitable training techniques for realistic agent modeling. Furthermore, we identify promising combinations of different closed‑loop training methods.

PaperID: 12614, https://arxiv.org/pdf/2410.15894.pdf

Authors: Yiwei Yang, Aibo Hu, Yusheng Zheng, Brian Zhao, Xinqi Zhang, Dawei Xiang, Kexin Chu, Wei Zhang, Andi Quinn

Title: MVVM: Deploy Your AI Agents-Securely, Efficiently, Everywhere

Abstract:
The rise of AI agents powered by Large Language Models (LLMs) presents critical challenges: how to securely execute and migrate these agents across heterogeneous environments while protecting sensitive user data, maintaining availability during network failures, minimizing response latency for time‑critical decisions, and ensuring output safety in mission‑critical applications. We present MVVM, a WebAssembly‑based secure container framework that enables transparent live migration of LLM agent workspaces between edge devices and cloud servers with end‑to‑end privacy guarantees, resilient multi‑tier replication, speculative execution for latency optimization, and integrated validation for safety assurance. MVVM introduces two key innovations: (1) a two‑way sandboxing framework leveraging hardware enclaves and accelerator extensions that protects both the agent from malicious hosts and the host from compromised agents; (2) an efficient cross platform migration mechanism using WebAssembly and WASI's platform‑agnostic design, enabling seamless movement across ARM phones, RISC‑V MCUs, x86 servers, and heterogeneous accelerators; and three astonishing use cases: (1) privacy‑aware daemon that automatically determines whether to execute locally or remotely based on data sensitivity and resource availability; (2) multi‑tier replication with intelligent quality degradation that maintains service availability despite network failures or resource constraints; (3) a comprehensive execution framework combining speculative execution for 10x latency reduction with parallel validation that ensures output safety without compromising responsiveness. Our evaluation demonstrates that MVVM is validated on three separate devices across 18 workloads.

PaperID: 12615, https://arxiv.org/pdf/2410.15885.pdf

Authors: Zuojin Tang, Bin Hu, Chenyang Zhao, De Ma, Gang Pan, Bin Liu

Title: VLASCD: A Visual Language Action Model for Simultaneous Chatting and Decision Making

Abstract:
Recent large pretrained models such as LLMs (e.g., GPT series) and VLAs (e.g., OpenVLA) have achieved notable progress on multimodal tasks, yet they are built upon a multi‑input single‑output (MISO) paradigm. We show that this paradigm fundamentally limits performance in multi‑input multi‑output (MIMO) scenarios, where parallel task execution is required. In MISO architectures, tasks compete for a shared output channel, creating mutual exclusion effects that cause unbalanced optimization and degraded performance. To address this gap, we introduce MIMO‑VLA (VLASCD), a unified training framework that enables concurrent multi‑task outputs, exemplified by simultaneous dialogue generation and decision‑making. Inspired by human cognition, MIMO‑VLA eliminates interference between tasks and supports efficient parallel processing. Experiments on the CARLA autonomous driving platform demonstrate that MIMO‑VLA substantially outperforms state‑of‑the‑art MISO‑based LLMs, reinforcement learning models, and VLAs in MIMO settings, establishing a new direction for multimodal and multitask learning.

PaperID: 12616, https://arxiv.org/pdf/2410.15876.pdf

Authors: Woosung Koh, Wonbeen Oh, Siyeol Kim, Suhin Shin, Hyeongjin Kim, Jaein Jang, Junghyun Lee, Se-Young Yun

Title: FlickerFusion: Intra-trajectory Domain Generalizing Multi-Agent RL

Abstract:
Multi‑agent reinforcement learning has demonstrated significant potential in addressing complex cooperative tasks across various real‑world applications. However, existing MARL approaches often rely on the restrictive assumption that the number of entities (e.g., agents, obstacles) remains constant between training and inference. This overlooks scenarios where entities are dynamically removed or added during the inference trajectory ‑‑ a common occurrence in real‑world environments like search and rescue missions and dynamic combat situations. In this paper, we tackle the challenge of intra‑trajectory dynamic entity composition under zero‑shot out‑of‑domain (OOD) generalization, where such dynamic changes cannot be anticipated beforehand. Our empirical studies reveal that existing MARL methods suffer significant performance degradation and increased uncertainty in these scenarios. In response, we propose FlickerFusion, a novel OOD generalization method that acts as a universally applicable augmentation technique for MARL backbone methods. FlickerFusion stochastically drops out parts of the observation space, emulating being in‑domain when inferenced OOD. The results show that FlickerFusion not only achieves superior inference rewards but also uniquely reduces uncertainty vis‑à‑vis the backbone, compared to existing methods. Benchmarks, implementations, and model weights are organized and open‑sourced at flickerfusion305.github.io, accompanied by ample demo video renderings.

PaperID: 12617, https://arxiv.org/pdf/2410.15841.pdf

Authors: Wenzhe Fan, Zishun Yu, Chengdong Ma, Changye Li, Yaodong Yang, Xinhua Zhang

Title: Towards Efficient Collaboration via Graph Modeling in Reinforcement Learning

Abstract:
In multi‑agent reinforcement learning, a commonly considered paradigm is centralized training with decentralized execution. However, in this framework, decentralized execution restricts the development of coordinated policies due to the local observation limitation. In this paper, we consider the cooperation among neighboring agents during execution and formulate their interactions as a graph. Thus, we introduce a novel encoder‑decoder architecture named Factor‑based Multi‑Agent Transformer (f‑MAT) that utilizes a transformer to enable communication between neighboring agents during both training and execution. By dividing agents into different overlapping groups and representing each group with a factor, f‑MAT achieves efficient message passing and parallel action generation through factor‑based attention layers. Empirical results in networked systems such as traffic scheduling and power control demonstrate that f‑MAT achieves superior performance compared to strong baselines, thereby paving the way for handling complex collaborative problems.

PaperID: 12618, https://arxiv.org/pdf/2410.15686.pdf

Authors: Miao Yu, Shilong Wang, Guibin Zhang, Junyuan Mao, Chenlong Yin, Qijiong Liu, Qingsong Wen, Kun Wang, Yang Wang

Title: NetSafe: Exploring the Topological Safety of Multi-agent Networks

Abstract:
Large language models (LLMs) have empowered nodes within multi‑agent networks with intelligence, showing growing applications in both academia and industry. However, how to prevent these networks from generating malicious information remains unexplored with previous research on single LLM's safety be challenging to transfer. In this paper, we focus on the safety of multi‑agent networks from a topological perspective, investigating which topological properties contribute to safer networks. To this end, we propose a general framework, NetSafe along with an iterative RelCom interaction to unify existing diverse LLM‑based agent frameworks, laying the foundation for generalized topological safety research. We identify several critical phenomena when multi‑agent networks are exposed to attacks involving misinformation, bias, and harmful information, termed as Agent Hallucination and Aggregation Safety. Furthermore, we find that highly connected networks are more susceptible to the spread of adversarial attacks, with task performance in a Star Graph Topology decreasing by 29.7%. Besides, our proposed static metrics aligned more closely with real‑world dynamic evaluations than traditional graph‑theoretic metrics, indicating that networks with greater average distances from attackers exhibit enhanced safety. In conclusion, our work introduces a new topological perspective on the safety of LLM‑based multi‑agent networks and discovers several unreported phenomena, paving the way for future research to explore the safety of such networks.

PaperID: 12619, https://arxiv.org/pdf/2410.15684.pdf

Authors: Boshen Wang, Luis E. Ortiz

Title: A Machine Learning Approach to Detect Strategic Behavior from Large-Population Observational Data Applied to Game Mode Prediction on a Team-Based Video Game

Abstract:
Modeling the strategic behavior of agents in a real‑world multi‑agent system using existing state‑of‑the‑art computational game‑theoretic tools can be a daunting task, especially when only the actions taken by the agents can be observed. Before attempting such a task, it would be useful to gain insight into whether or not agents are in fact acting strategically at all, from a game‑theoretic perspective. In this paper, we present an initial step toward addressing this problem by proposing a general approach based on machine learning fundamentals for detecting potentially strategic behavior. We instantiate the approach by applying state‑of‑the‑art machine learning tools for model selection and performance evaluation of prediction models in the context of detecting the strategic behavior of players for game mode selection in the multiplayer online video game Heroes of the Storm. Specifically, as a baseline, we first train neural networks to predict players' game mode selections using only information about the state of the player themselves. Then, we train a new set of neural networks using the same architectures, this time incorporating "historical co‑play" features that encode players' past interactions with other players. We find that including these new features led to statistically significant improvements in game mode prediction accuracy, providing a sufficiently strong signal that players indeed make decisions strategically, which justifies the development of more complex computational game‑theoretic tools in the hope of improving modeling and predictive power. We discuss remaining research work about potential approaches to validate the effectiveness of this initial step to detect strategic behavior.

PaperID: 12620, https://arxiv.org/pdf/2410.15665.pdf

Authors: Xun Jiang, Feng Li, Han Zhao, Jiahao Qiu, Jiaying Wang, Jun Shao, Shihao Xu, Shu Zhang, Weiling Chen, Xavier Tang, Yize Chen, Mengyue Wu, Weizhi Ma, Mengdi Wang, Tianqiao Chen

Title: Long Term Memory: The Foundation of AI Self-Evolution

Abstract:
Large language models (LLMs) like GPTs, trained on vast datasets, have demonstrated impressive capabilities in language understanding, reasoning, and planning, achieving human‑level performance in various tasks. Most studies focus on enhancing these models by training on ever‑larger datasets to build more powerful foundation models. While training stronger models is important, enabling models to evolve during inference is equally crucial, a process we refer to as AI self‑evolution. Unlike large‑scale training, self‑evolution may rely on limited data or interactions. Inspired by the columnar organization of the human cerebral cortex, we hypothesize that AI models could develop cognitive abilities and build internal representations through iterative interactions with their environment. To achieve this, models need long‑term memory (LTM) to store and manage processed interaction data. LTM supports self‑evolution by representing diverse experiences across environments and agents. In this report, we explore AI self‑evolution and its potential to enhance models during inference. We examine LTM's role in lifelong learning, allowing models to evolve based on accumulated interactions. We outline the structure of LTM and the systems needed for effective data retention and representation. We also classify approaches for building personalized models with LTM data and show how these models achieve self‑evolution through interaction. Using LTM, our multi‑agent framework OMNE achieved first place on the GAIA benchmark, demonstrating LTM's potential for AI self‑evolution. Finally, we present a roadmap for future research, emphasizing the importance of LTM for advancing AI technology and its practical applications.

PaperID: 12621, https://arxiv.org/pdf/2410.15549.pdf

Authors: ByungOk Han, Jaehong Kim, Jinhyeok Jang

Title: A Dual Process VLA: Efficient Robotic Manipulation Leveraging VLM

Abstract:
Vision‑Language‑Action (VLA) models are receiving increasing attention for their ability to enable robots to perform complex tasks by integrating visual context with linguistic commands. However, achieving efficient real‑time performance remains challenging due to the high computational demands of existing models. To overcome this, we propose Dual Process VLA (DP‑VLA), a hierarchical framework inspired by dual‑process theory. DP‑VLA utilizes a Large System 2 Model (L‑Sys2) for complex reasoning and decision‑making, while a Small System 1 Model (S‑Sys1) handles real‑time motor control and sensory processing. By leveraging Vision‑Language Models (VLMs), the L‑Sys2 operates at low frequencies, reducing computational overhead, while the S‑Sys1 ensures fast and accurate task execution. Experimental results on the RoboCasa dataset demonstrate that DP‑VLA achieves faster inference and higher task success rates, providing a scalable solution for advanced robotic applications.

PaperID: 12622, https://arxiv.org/pdf/2410.15543.pdf

Authors: Saba Zerefa, Zhaolin Ren, Haitong Ma, Na Li

Title: Distributed Thompson sampling under constrained communication

Abstract:
In Bayesian optimization, a black‑box function is maximized via the use of a surrogate model. We apply distributed Thompson sampling, using a Gaussian process as a surrogate model, to approach the multi‑agent Bayesian optimization problem. In our distributed Thompson sampling implementation, each agent receives sampled points from neighbors, where the communication network is encoded in a graph; each agent utilizes their own Gaussian process to model the objective function. We demonstrate theoretical bounds on Bayesian average regret and Bayesian simple regret, where the bound depends on the structure of the communication graph. Unlike in batch Bayesian optimization, this bound is applicable in cases where the communication graph amongst agents is constrained. When compared to sequential single‑agent Thompson sampling, our bound guarantees faster convergence with respect to time as long as the communication graph is connected. We confirm the efficacy of our algorithm with numerical simulations on traditional optimization test functions, demonstrating the significance of graph connectivity on improving regret convergence.

PaperID: 12623, https://arxiv.org/pdf/2410.15528.pdf

Authors: Chanseo Lee, Sonu Kumar, Kimon A. Vogt, Sam Meraj

Title: Improving Clinical Documentation with AI: A Comparative Study of Sporo AI Scribe and GPT-4o mini

Abstract:
AI‑powered medical scribes have emerged as a promising solution to alleviate the documentation burden in healthcare. Ambient AI scribes provide real‑time transcription and automated data entry into Electronic Health Records (EHRs), with the potential to improve efficiency, reduce costs, and enhance scalability. Despite early success, the accuracy of AI scribes remains critical, as errors can lead to significant clinical consequences. Additionally, AI scribes face challenges in handling the complexity and variability of medical language and ensuring the privacy of sensitive patient data. This case study aims to evaluate Sporo Health's AI scribe, a multi‑agent system leveraging fine‑tuned medical LLMs, by comparing its performance with OpenAI's GPT‑4o Mini on multiple performance metrics. Using a dataset of de‑identified patient conversation transcripts, AI‑generated summaries were compared to clinician‑generated notes (the ground truth) based on clinical content recall, precision, and F1 scores. Evaluations were further supplemented by clinician satisfaction assessments using a modified Physician Documentation Quality Instrument revision 9 (PDQI‑9), rated by both a medical student and a physician. The results show that Sporo AI consistently outperformed GPT‑4o Mini, achieving higher recall, precision, and overall F1 scores. Moreover, the AI generated summaries provided by Sporo were rated more favorably in terms of accuracy, comprehensiveness, and relevance, with fewer hallucinations. These findings demonstrate that Sporo AI Scribe is an effective and reliable tool for clinical documentation, enhancing clinician workflows while maintaining high standards of privacy and security.

PaperID: 12624, https://arxiv.org/pdf/2410.15335.pdf

Authors: Ali Kahe, Hamed Kebriaei

Title: A Distributed Primal-Dual Method for Constrained Multi-agent Reinforcement Learning with General Parameterization

Abstract:
This paper proposes a novel distributed approach for solving a cooperative Constrained Multi‑agent Reinforcement Learning (CMARL) problem, where agents seek to minimize a global objective function subject to shared constraints. Unlike existing methods that rely on centralized training or coordination, our approach enables fully decentralized online learning, with each agent maintaining local estimates of both primal and dual variables. Specifically, we develop a distributed primal‑dual algorithm based on actor‑critic methods, leveraging local information to estimate Lagrangian multipliers. We establish consensus among the Lagrangian multipliers across agents and prove the convergence of our algorithm to an equilibrium point, analyzing the sub‑optimality of this equilibrium compared to the exact solution of the unparameterized problem. Furthermore, we introduce a constrained cooperative Cournot game with stochastic dynamics as a test environment to evaluate the algorithm's performance in complex, real‑world scenarios.

PaperID: 12625, https://arxiv.org/pdf/2410.15311.pdf

Authors: Ruiqi Dong, Zhixuan Liao, Guangwei Lai, Yuhan Ma, Danni Ma, Chenyou Fan

Title: Who is Undercover? Guiding LLMs to Explore Multi-Perspective Team Tactic in the Game

Abstract:
Large Language Models (LLMs) are pivotal AI agents in complex tasks but still face challenges in open decision‑making problems within complex scenarios. To address this, we use the language logic game ``Who is Undercover?'' (WIU) as an experimental platform to propose the Multi‑Perspective Team Tactic (MPTT) framework. MPTT aims to cultivate LLMs' human‑like language expression logic, multi‑dimensional thinking, and self‑perception in complex scenarios. By alternating speaking and voting sessions, integrating techniques like self‑perspective, identity‑determination, self‑reflection, self‑summary and multi‑round find‑teammates, LLM agents make rational decisions through strategic concealment and communication, fostering human‑like trust. Preliminary results show that MPTT, combined with WIU, leverages LLMs' cognitive capabilities to create a decision‑making framework that can simulate real society. This framework aids minority groups in communication and expression, promoting fairness and diversity in decision‑making. Additionally, our Human‑in‑the‑loop experiments demonstrate that LLMs can learn and align with human behaviors through interactive, indicating their potential for active participation in societal decision‑making.

PaperID: 12626, https://arxiv.org/pdf/2410.15287.pdf

Authors: Tian Lan, Wenwei Zhang, Chengqi Lyu, Shuaibin Li, Chen Xu, Heyan Huang, Dahua Lin, Xian-Ling Mao, Kai Chen

Title: Training Language Models to Critique With Multi-agent Feedback

Abstract:
Critique ability, a meta‑cognitive capability of humans, presents significant challenges for LLMs to improve. Recent works primarily rely on supervised fine‑tuning (SFT) using critiques generated by a single LLM like GPT‑4. However, these model‑generated critiques often exhibit flaws due to the inherent complexity of the critique. Consequently, fine‑tuning LLMs on such flawed critiques typically limits the model's performance and propagates these flaws into the learned model. To overcome these challenges, this paper proposes a novel data generation pipeline, named MultiCritique, that improves the critique ability of LLMs by utilizing multi‑agent feedback in both the SFT and reinforcement learning (RL) stages. First, our data generation pipeline aggregates high‑quality critiques from multiple agents instead of a single model, with crucial information as input for simplifying the critique. Furthermore, our pipeline improves the preference accuracy of critique quality through multi‑agent feedback, facilitating the effectiveness of RL in improving the critique ability of LLMs. Based on our proposed MultiCritique data generation pipeline, we construct the MultiCritiqueDataset for the SFT and RL fine‑tuning stages. Extensive experimental results on two benchmarks demonstrate: 1) the superior quality of our constructed SFT dataset compared to existing critique datasets; 2) additional improvements to the critique ability of LLMs brought by the RL stage. Notably, our fine‑tuned 7B model significantly surpasses other advanced 7B‑13B open‑source models, approaching the performance of advanced 70B LLMs and GPT‑4. Codes, datasets and model weights will be publicly available.

PaperID: 12627, https://arxiv.org/pdf/2410.15236.pdf

Authors: Benji Peng, Hanxuan Chen, Keyu Chen, Qian Niu, Ziqian Bi, Ming Liu, Pohsun Feng, Tianyang Wang, Lawrence K. Q. Yan, Yizhu Wen, Yichao Zhang, Caitlyn Heqi Yin, Xinyuan Song, Riyang Bao, Jiacheng Shi

Title: Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

Abstract:
Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation, enabling applications across fields beyond healthcare, software engineering, and conversational systems. Despite these advancements in the past few years, LLMs have shown considerable vulnerabilities, particularly to prompt injection and jailbreaking attacks. This review analyzes the state of research on these vulnerabilities and presents available defense strategies. We roughly categorize attack approaches into prompt‑based, model‑based, multimodal, and multilingual, covering techniques such as adversarial prompting, backdoor injections, and cross‑modality exploits. We also review various defense mechanisms, including prompt filtering, transformation, alignment techniques, multi‑agent defenses, and self‑regulation, evaluating their strengths and shortcomings. We also discuss key metrics and benchmarks used to assess LLM safety and robustness, noting challenges like the quantification of attack success in interactive contexts and biases in existing datasets. Identifying current research gaps, we suggest future directions for resilient alignment strategies, advanced defenses against evolving attacks, automation of jailbreak detection, and consideration of ethical and societal impacts. This review emphasizes the need for continued research and cooperation within the AI community to enhance LLM security and ensure their safe deployment.

PaperID: 12628, https://arxiv.org/pdf/2410.15222.pdf

Authors: Zavier Ndum Ndum, Jian Tao, John Ford, Yang Liu

Title: AutoFLUKA: A Large Language Model Based Framework for Automating Monte Carlo Simulations in FLUKA

Abstract:
Monte Carlo (MC) simulations, particularly using FLUKA, are essential for replicating real‑world scenarios across scientific and engineering fields. Despite the robustness and versatility, FLUKA faces significant limitations in automation and integration with external post‑processing tools, leading to workflows with a steep learning curve, which are time‑consuming and prone to human errors. Traditional methods involving the use of shell and Python scripts, MATLAB, and Microsoft Excel require extensive manual intervention and lack flexibility, adding complexity to evolving scenarios. This study explores the potential of Large Language Models (LLMs) and AI agents to address these limitations. AI agents, integrate natural language processing with autonomous reasoning for decision‑making and adaptive planning, making them ideal for automation. We introduce AutoFLUKA, an AI agent application developed using the LangChain Python Framework to automate typical MC simulation workflows in FLUKA. AutoFLUKA can modify FLUKA input files, execute simulations, and efficiently process results for visualization, significantly reducing human labor and error. Our case studies demonstrate that AutoFLUKA can handle both generalized and domain‑specific cases, such as Microdosimetry, with an streamlined automated workflow, showcasing its scalability and flexibility. The study also highlights the potential of Retrieval Augmentation Generation (RAG) tools to act as virtual assistants for FLUKA, further improving user experience, time and efficiency. In conclusion, AutoFLUKA represents a significant advancement in automating MC simulation workflows, offering a robust solution to the inherent limitations. This innovation not only saves time and resources but also opens new paradigms for research and development in high energy physics, medical physics, nuclear engineering space and environmental science.

PaperID: 12629, https://arxiv.org/pdf/2410.15221.pdf

Authors: Vindula Jayawardana, Baptiste Freydt, Ao Qu, Cameron Hickert, Zhongxia Yan, Cathy Wu

Title: IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement Learning

Abstract:
Despite the popularity of multi‑agent reinforcement learning (RL) in simulated and two‑player applications, its success in messy real‑world applications has been limited. A key challenge lies in its generalizability across problem variations, a common necessity for many real‑world problems. Contextual reinforcement learning (CRL) formalizes learning policies that generalize across problem variations. However, the lack of standardized benchmarks for multi‑agent CRL has hindered progress in the field. Such benchmarks are desired to be based on real‑world applications to naturally capture the many open challenges of real‑world problems that affect generalization. To bridge this gap, we propose IntersectionZoo, a comprehensive benchmark suite for multi‑agent CRL through the real‑world application of cooperative eco‑driving in urban road networks. The task of cooperative eco‑driving is to control a fleet of vehicles to reduce fleet‑level vehicular emissions. By grounding IntersectionZoo in a real‑world application, we naturally capture real‑world problem characteristics, such as partial observability and multiple competing objectives. IntersectionZoo is built on data‑informed simulations of 16,334 signalized intersections derived from 10 major US cities, modeled in an open‑source industry‑grade microscopic traffic simulator. By modeling factors affecting vehicular exhaust emissions (e.g., temperature, road conditions, travel demand), IntersectionZoo provides one million data‑driven traffic scenarios. Using these traffic scenarios, we benchmark popular multi‑agent RL and human‑like driving algorithms and demonstrate that the popular multi‑agent RL algorithms struggle to generalize in CRL settings.

PaperID: 12630, https://arxiv.org/pdf/2410.15205.pdf

Authors: Anning Wei, Jintao Liang, Kaiyuan Lin, Ziyue Li, Rui Zhao

Title: DTPPO: Dual-Transformer Encoder-based Proximal Policy Optimization for Multi-UAV Navigation in Unseen Complex Environments

Abstract:
Existing multi‑agent deep reinforcement learning (MADRL) methods for multi‑UAV navigation face challenges in generalization, particularly when applied to unseen complex environments. To address these limitations, we propose a Dual‑Transformer Encoder‑based Proximal Policy Optimization (DTPPO) method. DTPPO enhances multi‑UAV collaboration through a Spatial Transformer, which models inter‑agent dynamics, and a Temporal Transformer, which captures temporal dependencies to improve generalization across diverse environments. This architecture allows UAVs to navigate new, unseen environments without retraining. Extensive simulations demonstrate that DTPPO outperforms current MADRL methods in terms of transferability, obstacle avoidance, and navigation efficiency across environments with varying obstacle densities. The results confirm DTPPO's effectiveness as a robust solution for multi‑UAV navigation in both known and unseen scenarios.

PaperID: 12631, https://arxiv.org/pdf/2410.15188.pdf

Authors: Guibin Chen

Title: Augmented Lagrangian-Based Safe Reinforcement Learning Approach for Distribution System Volt/VAR Control

Abstract:
This paper proposes a data‑driven solution for Volt‑VAR control problem in active distribution system. As distribution system models are always inaccurate and incomplete, it is quite difficult to solve the problem. To handle with this dilemma, this paper formulates the Volt‑VAR control problem as a constrained Markov decision process (CMDP). By synergistically combining the augmented Lagrangian method and soft actor critic algorithm, a novel safe off‑policy reinforcement learning (RL) approach is proposed in this paper to solve the CMDP. The actor network is updated in a policy gradient manner with the Lagrangian value function. A double‑critics network is adopted to synchronously estimate the action‑value function to avoid overestimation bias. The proposed algorithm does not require strong convexity guarantee of examined problems and is sample efficient. A two‑stage strategy is adopted for offline training and online execution, so the accurate distribution system model is no longer needed. To achieve scalability, a centralized training distributed execution strategy is adopted for a multi‑agent framework, which enables a decentralized Volt‑VAR control for large‑scale distribution system. Comprehensive numerical experiments with real‑world electricity data demonstrate that our proposed algorithm can achieve high solution optimality and constraints compliance.

PaperID: 12632, https://arxiv.org/pdf/2410.15173.pdf

Authors: Safeyah Khaled Alshemali, Daniel Bauer, Yuval Marton

Title: Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event Representation

Abstract:
The thematic fit estimation task measures semantic arguments' compatibility with a given semantic role for a given predicate. We investigate if autoregressive LLMs have consistent, expressible knowledge of event arguments' thematic fit by experimenting with various prompt designs, manipulating input context, reasoning, and output forms. We set a new state‑of‑the‑art on thematic fit benchmarks, but show that closed and open weight LLMs respond differently to our prompting strategies: Closed models achieve better scores overall and benefit from multi‑step reasoning, but they perform worse at filtering out generated sentences incompatible with the given predicate, role, and argument. Our analysis shows that lemma tuple input and sentence input result in surprisingly different thematic fit score distributions.

PaperID: 12633, https://arxiv.org/pdf/2410.15168.pdf

Authors: Xiutian Zhao, Ke Wang, Wei Peng

Title: An Electoral Approach to Diversify LLM-based Multi-Agent Collective Decision-Making

Abstract:
Modern large language models (LLMs) have exhibited cooperative synergy on complex task‑solving, and collective decision‑making (CDM) is a pivotal component in LLM‑based multi‑agent collaboration frameworks. Our survey on 52 recent such systems uncovers a severe lack of diversity, with a heavy reliance on dictatorial and plurality voting for CDM. Through the lens of social choice theory, we scrutinize widely‑adopted CDM methods and identify their limitations. To enrich current landscape of LLM‑based CDM, we present GEDI, an electoral CDM module that incorporates various ordinal preferential voting mechanisms. Our empirical case study across three benchmarks shows that the integration of certain CDM methods can markedly improve the reasoning capabilities and robustness of some leading LLMs, all without requiring intricate system designs. Additionally, we find that some CDM mechanisms generate positive synergies even with as few as three agents. The voting‑based methods also demonstrate robustness against single points of failure, as well as diversity in terms of hit‑rate@k and subject‑wise impacts.

PaperID: 12634, https://arxiv.org/pdf/2410.15156.pdf

Authors: Khaled Nakhleh, Ceyhun Eksin, Sabit Ekin

Title: Simulation-Based Optimistic Policy Iteration For Multi-Agent MDPs with Kullback-Leibler Control Cost

Abstract:
This paper proposes an agent‑based optimistic policy iteration (OPI) scheme for learning stationary optimal stochastic policies in multi‑agent Markov Decision Processes (MDPs), in which agents incur a Kullback‑Leibler (KL) divergence cost for their control efforts and an additional cost for the joint state. The proposed scheme consists of a greedy policy improvement step followed by an m‑step temporal difference (TD) policy evaluation step. We use the separable structure of the instantaneous cost to show that the policy improvement step follows a Boltzmann distribution that depends on the current value function estimate and the uncontrolled transition probabilities. This allows agents to compute the improved joint policy independently. We show that both the synchronous (entire state space evaluation) and asynchronous (a uniformly sampled set of substates) versions of the OPI scheme with finite policy evaluation rollout converge to the optimal value function and an optimal joint policy asymptotically. Simulation results on a multi‑agent MDP with KL control cost variant of the Stag‑Hare game validates our scheme's performance in terms of minimizing the cost return.

PaperID: 12635, https://arxiv.org/pdf/2410.15137.pdf

Authors: Tianlong Zhou, Jun Shang, Weixiong Rao

Title: Collaborative State Fusion in Partially Known Multi-agent Environments

Abstract:
In this paper, we study the collaborative state fusion problem in a multi‑agent environment, where mobile agents collaborate to track movable targets. Due to the limited sensing range and potential errors of on‑board sensors, it is necessary to aggregate individual observations to provide target state fusion for better target state estimation. Existing schemes do not perform well due to (1) impractical assumption of the fully known prior target state‑space model and (2) observation outliers from individual sensors. To address the issues, we propose a two‑stage collaborative fusion framework, namely \underlineLearnable Weighted R\underlineobust \underlineFusion (\textsfLoF). \textsfLoF combines a local state estimator (e.g., Kalman Filter) with a learnable weight generator to address the mismatch between the prior state‑space model and underlying patterns of moving targets. Moreover, given observation outliers, we develop a time‑series soft medoid(TSM) scheme to perform robust fusion. We evaluate \textsfLoF in a collaborative detection simulation environment with promising results. In an example setting with 4 agents and 2 targets, \textsfLoF leads to a 9.1% higher fusion gain compared to the state‑of‑the‑art.

PaperID: 12636, https://arxiv.org/pdf/2410.15048.pdf

Authors: Siyuan Lu, Jiaqi Shao, Bing Luo, Tao Lin

Title: MorphAgent: Empowering Agents through Self-Evolving Profiles and Decentralized Collaboration

Abstract:
Large Language Model (LLM) based multi‑agent systems (MAS) have shown promise in tackling complex tasks, but often rely on predefined roles and centralized coordination, limiting their adaptability to evolving challenges. This paper introduces MorphAgent, a novel Autonomous, Self‑Organizing, and Self‑Adaptive Multi‑Agent System for decentralized agent collaboration that enables agents to dynamically evolve their roles and capabilities. Our approach employs self‑evolving agent profiles, optimized through three key metrics, guiding agents in refining their individual expertise while maintaining complementary team dynamics. MorphAgent implements a two‑phase process: a Profile Update phase for profile optimization, followed by a Task Execution phase where agents continuously adapt their roles based on task feedback. Our experimental results show that MorphAgent outperforms existing frameworks in terms of task performance and adaptability to changing requirements, paving the way for more robust and versatile multi‑agent collaborative systems.

PaperID: 12637, https://arxiv.org/pdf/2410.14989.pdf

Authors: Longtao Zhu, Hongyu Yang, Ge Song, Xin Ma, Yanxin Zhang, Yulong Ji

Title: AutoFPDesigner: Automated Flight Procedure Design Based on Multi-Agent Large Language Model

Abstract:
Current flight procedure design methods heavily rely on human‑led design process, which is not only low auto‑mation but also suffer from complex algorithm modelling and poor generalization. To address these challenges, this paper proposes an agent‑driven flight procedure design method based on large language model, named Au‑toFPDesigner, which utilizes multi‑agent collaboration to complete procedure design. The method enables end‑to‑end automated design of performance‑based navigation (PBN) procedures. In this process, the user input the design requirements in natural language, AutoFPDesigner models the flight procedure design by loading the design speci‑fications and utilizing tool libraries complete the design. AutoFPDesigner allows users to oversee and seamlessly participate in the design process. Experimental results show that AutoFPDesigner ensures nearly 100% safety in the designed flight procedures and achieves 75% task completion rate, with good adaptability across different design tasks. AutoFPDesigner introduces a new paradigm for flight procedure design and represents a key step towards the automation of this process. Keywords: Flight Procedure Design; Large Language Model; Performance‑Based Navigation (PBN); Multi Agent;

PaperID: 12638, https://arxiv.org/pdf/2410.14916.pdf

Authors: Jasmine Jerry Aloor, Siddharth Nayak, Sydney Dolan, Hamsa Balakrishnan

Title: Cooperation and Fairness in Multi-Agent Reinforcement Learning

Abstract:
Multi‑agent systems are trained to maximize shared cost objectives, which typically reflect system‑level efficiency. However, in the resource‑constrained environments of mobility and transportation systems, efficiency may be achieved at the expense of fairness ‑‑ certain agents may incur significantly greater costs or lower rewards compared to others. Tasks could be distributed inequitably, leading to some agents receiving an unfair advantage while others incur disproportionately high costs. It is important to consider the tradeoffs between efficiency and fairness. We consider the problem of fair multi‑agent navigation for a group of decentralized agents using multi‑agent reinforcement learning (MARL). We consider the reciprocal of the coefficient of variation of the distances traveled by different agents as a measure of fairness and investigate whether agents can learn to be fair without significantly sacrificing efficiency (i.e., increasing the total distance traveled). We find that by training agents using min‑max fair distance goal assignments along with a reward term that incentivizes fairness as they move towards their goals, the agents (1) learn a fair assignment of goals and (2) achieve almost perfect goal coverage in navigation scenarios using only local observations. For goal coverage scenarios, we find that, on average, our model yields a 14% improvement in efficiency and a 5% improvement in fairness over a baseline trained using random assignments. Furthermore, an average of 21% improvement in fairness can be achieved compared to a model trained on optimally efficient assignments; this increase in fairness comes at the expense of only a 7% decrease in efficiency. Finally, we extend our method to environments in which agents must complete coverage tasks in prescribed formations and show that it is possible to do so without tailoring the models to specific formation shapes.

PaperID: 12639, https://arxiv.org/pdf/2410.14913.pdf

Authors: Chandrakanth Gudavalli, Bowen Zhang, Connor Levenson, Kin Gwn Lore, B. S. Manjunath

Title: ReeFRAME: Reeb Graph based Trajectory Analysis Framework to Capture Top-Down and Bottom-Up Patterns of Life

Abstract:
In this paper, we present ReeFRAME, a scalable Reeb graph‑based framework designed to analyze vast volumes of GPS‑enabled human trajectory data generated at 1Hz frequency. ReeFRAME models Patterns‑of‑life (PoL) at both the population and individual levels, utilizing Multi‑Agent Reeb Graphs (MARGs) for population‑level patterns and Temporal Reeb Graphs (TERGs) for individual trajectories. The framework's linear algorithmic complexity relative to the number of time points ensures scalability for anomaly detection. We validate ReeFRAME on six large‑scale anomaly detection datasets, simulating real‑time patterns with up to 500,000 agents over two months.

PaperID: 12640, https://arxiv.org/pdf/2410.14890.pdf

Authors: Yunhao Yang, Leonard Berthellemy, Ufuk Topcu

Title: Reasoning, Memorization, and Fine-Tuning Language Models for Non-Cooperative Games

Abstract:
We develop a method that integrates the tree of thoughts and multi‑agent framework to enhance the capability of pre‑trained language models in solving complex, unfamiliar games. The method decomposes game‑solving into four incremental tasks ‑‑ game summarization, area selection, action extraction, and action validation ‑‑ each assigned to a specific language‑model agent. By constructing a tree of thoughts, the method simulates reasoning paths and allows agents to collaboratively distill game representations and tactics, mitigating the limitations of language models in reasoning and long‑term memorization. Additionally, an automated fine‑tuning process further optimizes the agents' performance by ranking query‑response pairs based on game outcomes, e.g., winning or losing. We apply the method to a non‑cooperative game and demonstrate a 65 percent winning rate against benchmark algorithms, with an additional 10 percent improvement after fine‑tuning. In contrast to existing deep learning algorithms for game solving that require millions of training samples, the proposed method consumes approximately 1000 training samples, highlighting its efficiency and scalability.

PaperID: 12641, https://arxiv.org/pdf/2410.14853.pdf

Authors: Wanyu Du, Song Feng, James Gung, Lijia Sun, Yi Zhang, Saab Mansour, Yanjun Qi

Title: DFlow: Diverse Dialogue Flow Simulation with Large Language Models

Abstract:
Developing language model‑based dialogue agents requires effective data to train models that can follow specific task logic. However, most existing data simulation methods focus on increasing diversity in language, topics, or dialogue acts at the utterance level, largely neglecting a critical aspect of task logic diversity at the dialogue level. This paper proposes a novel data simulation method designed to enhance the diversity of synthetic dialogues by focusing on task execution logic. Our method uses LLMs to generate decision tree‑structured task plans, which enables the derivation of diverse dialogue trajectories for a given task. Each trajectory, referred to as a "dialog flow", guides the generation of a multi‑turn dialogue that follows a unique trajectory. We apply this method to generate a task‑oriented dialogue dataset comprising 3,886 dialogue flows across 15 different domains. We validate the effectiveness of this dataset using the next action prediction task, where models fine‑tuned on our dataset outperform strong baselines, including GPT‑4. Upon acceptance of this paper, we plan to release the code and data publicly.

PaperID: 12642, https://arxiv.org/pdf/2410.14807.pdf

Authors: Hong Jun Jeon, Benjamin Van Roy

Title: Aligning AI Agents via Information-Directed Sampling

Abstract:
The staggering feats of AI systems have brought to attention the topic of AI Alignment: aligning a "superintelligent" AI agent's actions with humanity's interests. Many existing frameworks/algorithms in alignment study the problem on a myopic horizon or study learning from human feedback in isolation, relying on the contrived assumption that the agent has already perfectly identified the environment. As a starting point to address these limitations, we define a class of bandit alignment problems as an extension of classic multi‑armed bandit problems. A bandit alignment problem involves an agent tasked with maximizing long‑run expected reward by interacting with an environment and a human, both involving details/preferences initially unknown to the agent. The reward of actions in the environment depends on both observed outcomes and human preferences. Furthermore, costs are associated with querying the human to learn preferences. Therefore, an effective agent ought to intelligently trade‑off exploration (of the environment and human) and exploitation. We study these trade‑offs theoretically and empirically in a toy bandit alignment problem which resembles the beta‑Bernoulli bandit. We demonstrate while naive exploration algorithms which reflect current practices and even touted algorithms such as Thompson sampling both fail to provide acceptable solutions to this problem, information‑directed sampling achieves favorable regret.

PaperID: 12643, https://arxiv.org/pdf/2410.14790.pdf

Authors: Jianchao Ci, Eldert J. van Henten, Xin Wang, Akshay K. Burusa, Gert Kootstra

Title: SSL-NBV: A Self-Supervised-Learning-Based Next-Best-View algorithm for Efficient 3D Plant Reconstruction by a Robot

Abstract:
The 3D reconstruction of plants is challenging due to their complex shape causing many occlusions. Next‑Best‑View (NBV) methods address this by iteratively selecting new viewpoints to maximize information gain (IG). Deep‑learning‑based NBV (DL‑NBV) methods demonstrate higher computational efficiency over classic voxel‑based NBV approaches but current methods require extensive training using ground‑truth plant models, making them impractical for real‑world plants. These methods, moreover, rely on offline training with pre‑collected data, limiting adaptability in changing agricultural environments. This paper proposes a self‑supervised learning‑based NBV method (SSL‑NBV) that uses a deep neural network to predict the IG for candidate viewpoints. The method allows the robot to gather its own training data during task execution by comparing new 3D sensor data to the earlier gathered data and by employing weakly‑supervised learning and experience replay for efficient online learning. Comprehensive evaluations were conducted in simulation and real‑world environments using cross‑validation. The results showed that SSL‑NBV required fewer views for plant reconstruction than non‑NBV methods and was over 800 times faster than a voxel‑based method. SSL‑NBV reduced training annotations by over 90% compared to a baseline DL‑NBV. Furthermore, SSL‑NBV could adapt to novel scenarios through online fine‑tuning. Also using real plants, the results showed that the proposed method can learn to effectively plan new viewpoints for 3D plant reconstruction. Most importantly, SSL‑NBV automated the entire network training and uses continuous online learning, allowing it to operate in changing agricultural environments.

PaperID: 12644, https://arxiv.org/pdf/2410.14627.pdf

Authors: Jan-Samuel Wagner, Dave DeCaprio, Abishek Chiffon Muthu Raja, Jonathan M. Holman, Lauren K. Brady, Sky C. Cheung, Hosein Barzekar, Eric Yang, Mark Anthony Martinez, David Soong, Sriram Sridhar, Han Si, Brandon W. Higgs, Hisham Hamadeh, Scott Ogden

Title: CELI: Controller-Embedded Language Model Interactions

Abstract:
We introduce Controller‑Embedded Language Model Interactions (CELI), a framework that integrates control logic directly within language model (LM) prompts, facilitating complex, multi‑stage task execution. CELI addresses limitations of existing prompt engineering and workflow optimization techniques by embedding control logic directly within the operational context of language models, enabling dynamic adaptation to evolving task requirements. Our framework transfers control from the traditional programming execution environment to the LMs, allowing them to autonomously manage computational workflows while maintaining seamless interaction with external systems and functions. CELI supports arbitrary function calls with variable arguments, bridging the gap between LMs' adaptive reasoning capabilities and conventional software paradigms' structured control mechanisms. To evaluate CELI's versatility and effectiveness, we conducted case studies in two distinct domains: code generation (HumanEval benchmark) and multi‑stage content generation (Wikipedia‑style articles). The results demonstrate notable performance improvements across a range of domains. CELI achieved a 4.9 percentage point improvement over the best reported score of the baseline GPT‑4 model on the HumanEval code generation benchmark. In multi‑stage content generation, 94.4% of CELI‑produced Wikipedia‑style articles met or exceeded first draft quality when optimally configured, with 44.4% achieving high quality. These outcomes underscore CELI's potential for optimizing AI‑driven workflows across diverse computational domains.

PaperID: 12645, https://arxiv.org/pdf/2410.14594.pdf

Authors: Elias Lumer, Vamse Kumar Subbiah, James A. Burke, Pradeep Honaganahalli Basavaraju, Austin Huber

Title: Toolshed: Scale Tool-Equipped Agents with Advanced RAG-Tool Fusion and Tool Knowledge Bases

Abstract:
Recent advancements in tool‑equipped Agents (LLMs) have enabled complex tasks like secure database interactions and multi‑agent code development. However, scaling tool capacity beyond agent reasoning or model limits remains a challenge. In this paper, we address these challenges by introducing Toolshed Knowledge Bases, a tool knowledge base (vector database) designed to store enhanced tool representations and optimize tool selection for large‑scale tool‑equipped Agents. Additionally, we propose Advanced RAG‑Tool Fusion, a novel ensemble of tool‑applied advanced retrieval‑augmented generation (RAG) techniques across the pre‑retrieval, intra‑retrieval, and post‑retrieval phases, without requiring model fine‑tuning. During pre‑retrieval, tool documents are enhanced with key information and stored in the Toolshed Knowledge Base. Intra‑retrieval focuses on query planning and transformation to increase retrieval accuracy. Post‑retrieval refines the retrieved tool documents and enables self‑reflection. Furthermore, by varying both the total number of tools (tool‑M) an Agent has access to and the tool selection threshold (top‑k), we address trade‑offs between retrieval accuracy, agent performance, and token cost. Our approach achieves 46%, 56%, and 47% absolute improvements on the ToolE single‑tool, ToolE multi‑tool and Seal‑Tools benchmark datasets, respectively (Recall@5).

PaperID: 12646, https://arxiv.org/pdf/2410.14569.pdf

Authors: Hanna Kim, Minkyoo Song, Seung Ho Na, Seungwon Shin, Kimin Lee

Title: When LLMs Go Online: The Emerging Threat of Web-Enabled LLMs

Abstract:
Recent advancements in Large Language Models (LLMs) have established them as agentic systems capable of planning and interacting with various tools. These LLM agents are often paired with web‑based tools, enabling access to diverse sources and real‑time information. Although these advancements offer significant benefits across various applications, they also increase the risk of malicious use, particularly in cyberattacks involving personal information. In this work, we investigate the risks associated with misuse of LLM agents in cyberattacks involving personal data. Specifically, we aim to understand: 1) how potent LLM agents can be when directed to conduct cyberattacks, 2) how cyberattacks are enhanced by web‑based tools, and 3) how affordable and easy it becomes to launch cyberattacks using LLM agents. We examine three attack scenarios: the collection of Personally Identifiable Information (PII), the generation of impersonation posts, and the creation of spear‑phishing emails. Our experiments reveal the effectiveness of LLM agents in these attacks: LLM agents achieved a precision of up to 95.9% in collecting PII, generated impersonation posts where 93.9% of them were deemed authentic, and boosted click rate of phishing links in spear phishing emails by 46.67%. Additionally, our findings underscore the limitations of existing safeguards in contemporary commercial LLMs, emphasizing the urgent need for robust security measures to prevent the misuse of LLM agents.

PaperID: 12647, https://arxiv.org/pdf/2410.14516.pdf

Authors: Juyeon Heo, Christina Heinze-Deml, Oussama Elachqar, Kwan Ho Ryan Chan, Shirley Ren, Udhay Nallasamy, Andy Miller, Jaya Narain

Title: Do LLMs "know" internally when they follow instructions?

Abstract:
Instruction‑following is crucial for building AI agents with large language models (LLMs), as these models must adhere strictly to user‑provided constraints and guidelines. However, LLMs often fail to follow even simple and clear instructions. To improve instruction‑following behavior and prevent undesirable outputs, a deeper understanding of how LLMs' internal states relate to these outcomes is required. In this work, we investigate whether LLMs encode information in their representations that correlate with instruction‑following success ‑ a property we term knowing internally. Our analysis identifies a direction in the input embedding space, termed the instruction‑following dimension, that predicts whether a response will comply with a given instruction. We find that this dimension generalizes well across unseen tasks but not across unseen instruction types. We demonstrate that modifying representations along this dimension improves instruction‑following success rates compared to random changes, without compromising response quality. Further investigation reveals that this dimension is more closely related to the phrasing of prompts rather than the inherent difficulty of the task or instructions. This work provides insight into the internal workings of LLMs' instruction‑following, paving the way for reliable LLM agents.

PaperID: 12648, https://arxiv.org/pdf/2410.14393.pdf

Authors: Konstantin Grotov, Artem Borzilov, Maksim Krivobok, Timofey Bryksin, Yaroslav Zharov

Title: Debug Smarter, Not Harder: AI Agents for Error Resolution in Computational Notebooks

Abstract:
Computational notebooks became indispensable tools for research‑related development, offering unprecedented interactivity and flexibility in the development process. However, these benefits come at the cost of reproducibility and an increased potential for bugs. With the rise of code‑fluent Large Language Models empowered with agentic techniques, smart bug‑fixing tools with a high level of autonomy have emerged. However, those tools are tuned for classical script programming and still struggle with non‑linear computational notebooks. In this paper, we present an AI agent designed specifically for error resolution in a computational notebook. We have developed an agentic system capable of exploring a notebook environment by interacting with it ‑‑ similar to how a user would ‑‑ and integrated the system into the JetBrains service for collaborative data science called Datalore. We evaluate our approach against the pre‑existing single‑action solution by comparing costs and conducting a user study. Users rate the error resolution capabilities of the agentic system higher but experience difficulties with UI. We share the results of the study and consider them valuable for further improving user‑agent collaboration.

PaperID: 12649, https://arxiv.org/pdf/2410.14374.pdf

Authors: Marco Aruta, Vadim Malvone, Aniello Murano

Title: A Model Checker for Natural Strategic Ability

Abstract:
In the last two decades, Alternating‑time Temporal Logic (ATL) has been proved to be very useful in modeling strategic reasoning for Multi‑Agent Systems (MAS). However, this logic struggles to capture the bounded rationality inherent in human decision‑making processes. To overcome these limitations, Natural Alternating‑time Temporal Logic (NatATL) has been recently introduced. As an extension of ATL, NatATL incorporates bounded memory constraints into agents' strategies, which allows to resemble human cognitive limitations. In this paper, we present a model checker tool for NatATL specifications ‑ both for memoryless strategies and strategies with recall ‑ integrated into VITAMIN, an open‑source model checker designed specifically for MAS verification. By embedding NatATL into VITAMIN, we transform theoretical advancements into a practical verification framework, enabling comprehensive analysis and validation of strategic reasoning in complex multi‑agent environments. Our novel tool paves the way for applications in areas such as explainable AI and human‑in‑the‑loop systems, highlighting NatATL's substantial potential.

PaperID: 12650, https://arxiv.org/pdf/2410.14262.pdf

Authors: Ted Kwartler, Matthew Berman, Alan Aqrawi

Title: Good Parenting is all you need -- Multi-agentic LLM Hallucination Mitigation

Abstract:
This study explores the ability of Large Language Model (LLM) agents to detect and correct hallucinations in AI‑generated content. A primary agent was tasked with creating a blog about a fictional Danish artist named Flipfloppidy, which was then reviewed by another agent for factual inaccuracies. Most LLMs hallucinated the existence of this artist. Across 4,900 test runs involving various combinations of primary and reviewing agents, advanced AI models such as Llama3‑70b and GPT‑4 variants demonstrated near‑perfect accuracy in identifying hallucinations and successfully revised outputs in 85% to 100% of cases following feedback. These findings underscore the potential of advanced AI models to significantly enhance the accuracy and reliability of generated content, providing a promising approach to improving AI workflow orchestration.

PaperID: 12651, https://arxiv.org/pdf/2410.14251.pdf

Authors: Shuo Tang, Xianghe Pang, Zexi Liu, Bohan Tang, Rui Ye, Tian Jin, Xiaowen Dong, Yanfeng Wang, Siheng Chen

Title: Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation

Abstract:
Post‑training is essential for enabling large language models (LLMs) to follow human instructions. However, its effectiveness depends on high‑quality instruction data, which is challenging to obtain in the real world due to privacy concerns, data scarcity, and high annotation costs. To fill this gap, inspired by the recent success of using LLMs to simulate human society, we propose MATRIX, a multi‑agent simulator that automatically generates diverse text‑based scenarios, capturing a wide range of real‑world human needs in a realistic and scalable manner. Leveraging these outputs, we introduce a novel scenario‑driven instruction generator MATRIX‑Gen for controllable and highly realistic data synthesis. Extensive experiments demonstrate that our framework effectively generates both general and domain‑specific data. On AlpacaEval 2 and Arena‑Hard benchmarks, Llama‑3‑8B‑Base, post‑trained on datasets synthesized by MATRIX‑Gen with just 20K instruction‑response pairs, outperforms Meta's Llama‑3‑8B‑Instruct model, which was trained on over 10M pairs.

PaperID: 12652, https://arxiv.org/pdf/2410.14209.pdf

Authors: Zihan Liu, Ruinan Zeng, Dongxia Wang, Gengyun Peng, Jingyi Wang, Qiang Liu, Peiyu Liu, Wenhai Wang

Title: Agents4PLC: Automating Closed-loop PLC Code Generation and Verification in Industrial Control Systems using LLM-based Agents

Abstract:
In industrial control systems, the generation and verification of Programmable Logic Controller (PLC) code are critical for ensuring operational efficiency and safety. While Large Language Models (LLMs) have made strides in automated code generation, they often fall short in providing correctness guarantees and specialized support for PLC programming. To address these challenges, this paper introduces Agents4PLC, a novel framework that not only automates PLC code generation but also includes code‑level verification through an LLM‑based multi‑agent system. We first establish a comprehensive benchmark for verifiable PLC code generation area, transitioning from natural language requirements to human‑written‑verified formal specifications and reference PLC code. We further enhance our `agents' specifically for industrial control systems by incorporating Retrieval‑Augmented Generation (RAG), advanced prompt engineering techniques, and Chain‑of‑Thought strategies. Evaluation against the benchmark demonstrates that Agents4PLC significantly outperforms previous methods, achieving superior results across a series of increasingly rigorous metrics. This research not only addresses the critical challenges in PLC programming but also highlights the potential of our framework to generate verifiable code applicable to real‑world industrial applications.

PaperID: 12653, https://arxiv.org/pdf/2410.14174.pdf

Authors: Quang Dang, Murat Kucukosmanoglu, Michael Anoruo, Golshan Kargosha, Sarah Conklin, Justin Brooks

Title: Auto Detecting Cognitive Events Using Machine Learning on Pupillary Data

Abstract:
Assessing cognitive workload is crucial for human performance as it affects information processing, decision making, and task execution. Pupil size is a valuable indicator of cognitive workload, reflecting changes in attention and arousal governed by the autonomic nervous system. Cognitive events are closely linked to cognitive workload as they activate mental processes and trigger cognitive responses. This study explores the potential of using machine learning to automatically detect cognitive events experienced using individuals. We framed the problem as a binary classification task, focusing on detecting stimulus onset across four cognitive tasks using CNN models and 1‑second pupillary data. The results, measured by Matthew's correlation coefficient, ranged from 0.47 to 0.80, depending on the cognitive task. This paper discusses the trade‑offs between generalization and specialization, model behavior when encountering unseen stimulus onset times, structural variances among cognitive tasks, factors influencing model predictions, and real‑time simulation. These findings highlight the potential of machine learning techniques in detecting cognitive events based on pupil and eye movement responses, contributing to advancements in personalized learning and optimizing neurocognitive workload management.

PaperID: 12654, https://arxiv.org/pdf/2410.14142.pdf

Authors: Tianqing Zhou, Bobo Wang, Dong Qin, Xuefang Nie, Nan Jiang, Chunguo Li

Title: Secure Collaborative Computation Offloading and Resource Allocation in Cache-Assisted Ultra-Dense IoT Networks With Multi-Slope Channels

Abstract:
Cache‑assisted ultra‑dense mobile edge computing (MEC) networks are a promising solution for meeting the increasing demands of numerous Internet‑of‑Things mobile devices (IMDs). To address the complex interferences caused by small base stations (SBSs) deployed densely in such networks, this paper explores the combination of orthogonal frequency division multiple access (OFDMA), non‑orthogonal multiple access (NOMA), and base station (BS) clustering. Additionally, security measures are introduced to protect IMDs' tasks offloaded to BSs from potential eavesdropping and malicious attacks. As for such a network framework, a computation offloading scheme is proposed to minimize IMDs' energy consumption while considering constraints such as delay, power, computing resources, and security costs, optimizing channel selections, task execution decisions, device associations, power controls, security service assignments, and computing resource allocations. To solve the formulated problem efficiently, we develop a further improved hierarchical adaptive search (FIHAS) algorithm, giving some insights into its parallel implementation, computation complexity, and convergence. Simulation results demonstrate that the proposed algorithms can achieve lower total energy consumption and delay compared to other algorithms when strict latency and cost constraints are imposed.

PaperID: 12655, https://arxiv.org/pdf/2410.14141.pdf

Authors: Sabit Hassan, Hye-Young Chung, Xiang Zhi Tan, Malihe Alikhani

Title: Coherence-Driven Multimodal Safety Dialogue with Active Learning for Embodied Agents

Abstract:
When assisting people in daily tasks, robots need to accurately interpret visual cues and respond effectively in diverse safety‑critical situations, such as sharp objects on the floor. In this context, we present M‑CoDAL, a multimodal‑dialogue system specifically designed for embodied agents to better understand and communicate in safety‑critical situations. The system leverages discourse coherence relations to enhance its contextual understanding and communication abilities. To train this system, we introduce a novel clustering‑based active learning mechanism that utilizes an external Large Language Model (LLM) to identify informative instances. Our approach is evaluated using a newly created multimodal dataset comprising 1K safety violations extracted from 2K Reddit images. These violations are annotated using a Large Multimodal Model (LMM) and verified by human annotators. Results with this dataset demonstrate that our approach improves resolution of safety situations, user sentiment, as well as safety of the conversation. Next, we deploy our dialogue system on a Hello Robot Stretch robot and conduct a within‑subject user study with real‑world participants. In the study, participants role‑play two safety scenarios with different levels of severity with the robot and receive interventions from our model and a baseline system powered by OpenAI's ChatGPT. The study results corroborate and extend the findings from the automated evaluation, showing that our proposed system is more persuasive in a real‑world embodied agent setting.

PaperID: 12656, https://arxiv.org/pdf/2410.14138.pdf

Authors: Jingqi Zhou, Sheng Wang, Jingwei Dong, Kai Liu, Lei Li, Jiahui Gao, Jiyue Jiang, Lingpeng Kong, Chuan Wu

Title: ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Abstract:
Large vision‑language models (LVLMs) have witnessed significant progress on visual understanding tasks. However, they often prioritize language knowledge over image information on visual reasoning tasks, incurring performance degradation. To tackle this issue, we first identify the drawbacks of existing solutions (i.e., limited multi‑modal reasoning capacities, and insufficient and irrelevant visual descriptions). We then decompose visual reasoning process into two stages: proactive visual perception (i.e., eyesight) and textual reasoning (i.e., wisdom), and introduce a novel visual reasoning framework named ProReason. This framework features decoupled vision‑reasoning capabilities and multi‑run proactive perception. Briefly, given a multi‑modal question, ProReason iterates proactive information collection and reasoning until the answer can be concluded with necessary and sufficient visual descriptions. Notably, the disassociation of capabilities allows seamless integration of existing large language models (LLMs) to compensate for the reasoning deficits of LVLMs. Our extensive experiments demonstrate that ProReason outperforms existing multi‑step reasoning frameworks on various benchmarks for both open‑source and closed‑source models, with the average performance gain reaching 13.2%. Besides, the integration of LLMs allows ProReason to produce high‑quality visual reasoning data, which empowers ProReason‑distilled models (i.e., ProReason‑VL and ProReason‑Q3) to achieve superior performance in downstream tasks. Our insights into existing solutions and the decoupled perspective for feasible integration of LLMs illuminate future research on visual reasoning techniques, especially LLM‑assisted ones.

PaperID: 12657, https://arxiv.org/pdf/2410.14041.pdf

Authors: Eric Yang, Tomas Garcia, Hannah Williams, Bhawesh Kumar, Martin Ramé, Eileen Rivera, Yiran Ma, Jonathan Amar, Caricia Catalani, Yugang Jia

Title: From Barriers to Tactics: A Behavioral Science-Informed Agentic Workflow for Personalized Nutrition Coaching

Abstract:
Effective management of cardiometabolic conditions requires sustained positive nutrition habits, often hindered by complex and individualized barriers. Direct human management is simply not scalable, while previous attempts aimed at automating nutrition coaching lack the personalization needed to address these diverse challenges. This paper introduces a novel LLM‑powered agentic workflow designed to provide personalized nutrition coaching by directly targeting and mitigating patient‑specific barriers. Grounded in behavioral science principles, the workflow leverages a comprehensive mapping of nutrition‑related barriers to corresponding evidence‑based strategies. A specialized LLM agent intentionally probes for and identifies the root cause of a patient's dietary struggles. Subsequently, a separate LLM agent delivers tailored tactics designed to overcome those specific barriers with patient context. We designed and validated our approach through a user study with individuals with cardiometabolic conditions, demonstrating the system's ability to accurately identify barriers and provide personalized guidance. Furthermore, we conducted a large‑scale simulation study, grounding on real patient vignettes and expert‑validated metrics, to evaluate the system's performance across a wide range of scenarios. Our findings demonstrate the potential of this LLM‑powered agentic workflow to improve nutrition coaching by providing personalized, scalable, and behaviorally‑informed interventions.

PaperID: 12658, https://arxiv.org/pdf/2410.13953.pdf

Authors: Tonghan Wang, Heng Dong, Yanchen Jiang, David C. Parkes, Milind Tambe

Title: On Diffusion Models for Multi-Agent Partial Observability: Shared Attractors, Error Bounds, and Composite Flow

Abstract:
Multiagent systems grapple with partial observability (PO), and the decentralized POMDP (Dec‑POMDP) model highlights the fundamental nature of this challenge. Whereas recent approaches to addressing PO have appealed to deep learning models, providing a rigorous understanding of how these models and their approximation errors affect agents' handling of PO and their interactions remain a challenge. In addressing this challenge, we investigate reconstructing global states from local action‑observation histories in Dec‑POMDPs using diffusion models. We first find that diffusion models conditioned on local history represent possible states as stable fixed points. In collectively observable (CO) Dec‑POMDPs, individual diffusion models conditioned on agents' local histories share a unique fixed point corresponding to the global state, while in non‑CO settings, shared fixed points yield a distribution of possible states given joint history. We further find that, with deep learning approximation errors, fixed points can deviate from true states and the deviation is negatively correlated to the Jacobian rank. Inspired by this low‑rank property, we bound a deviation by constructing a surrogate linear regression model that approximates the local behavior of a diffusion model. With this bound, we propose a \emphcomposite diffusion process iterating over agents with theoretical convergence guarantees to the true state.

PaperID: 12659, https://arxiv.org/pdf/2410.13919.pdf

Authors: Reworr, Dmitrii Volkov

Title: LLM Agent Honeypot: Monitoring AI Hacking Agents in the Wild

Abstract:
Attacks powered by Large Language Model (LLM) agents represent a growing threat to modern cybersecurity. To address this concern, we present LLM Honeypot, a system designed to monitor autonomous AI hacking agents. By augmenting a standard SSH honeypot with prompt injection and time‑based analysis techniques, our framework aims to distinguish LLM agents among all attackers. Over a trial deployment of about three months in a public environment, we collected 8,130,731 hacking attempts and 8 potential AI agents. Our work demonstrates the emergence of AI‑driven threats and their current level of usage, serving as an early warning of malicious LLM agents in the wild.

PaperID: 12660, https://arxiv.org/pdf/2410.13909.pdf

Authors: Xinyi Li, Yu Xu, Yongfeng Zhang, Edward C. Malthouse

Title: Large Language Model-driven Multi-Agent Simulation for News Diffusion Under Different Network Structures

Abstract:
The proliferation of fake news in the digital age has raised critical concerns, particularly regarding its impact on societal trust and democratic processes. Diverging from conventional agent‑based simulation approaches, this work introduces an innovative approach by employing a large language model (LLM)‑driven multi‑agent simulation to replicate complex interactions within information ecosystems. We investigate key factors that facilitate news propagation, such as agent personalities and network structures, while also evaluating strategies to combat misinformation. Through simulations across varying network structures, we demonstrate the potential of LLM‑based agents in modeling the dynamics of misinformation spread, validating the influence of agent traits on the diffusion process. Our findings emphasize the advantages of LLM‑based simulations over traditional techniques, as they uncover underlying causes of information spread ‑‑ such as agents promoting discussions ‑‑ beyond the predefined rules typically employed in existing agent‑based models. Additionally, we evaluate three countermeasure strategies, discovering that brute‑force blocking influential agents in the network or announcing news accuracy can effectively mitigate misinformation. However, their effectiveness is influenced by the network structure, highlighting the importance of considering network structure in the development of future misinformation countermeasures.

PaperID: 12661, https://arxiv.org/pdf/2410.13886.pdf

Authors: Priyanshu Kumar, Elaine Lau, Saranya Vijayakumar, Tu Trinh, Scale Red Team, Elaine Chang, Vaughn Robinson, Sean Hendryx, Shuyan Zhou, Matt Fredrikson, Summer Yue, Zifan Wang

Title: Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents

Abstract:
For safety reasons, large language models (LLMs) are trained to refuse harmful user instructions, such as assisting dangerous activities. We study an open question in this work: does the desired safety refusal, typically enforced in chat contexts, generalize to non‑chat and agentic use cases? Unlike chatbots, LLM agents equipped with general‑purpose tools, such as web browsers and mobile devices, can directly influence the real world, making it even more crucial to refuse harmful instructions. In this work, we primarily focus on red‑teaming browser agents, LLMs that manipulate information via web browsers. To this end, we introduce Browser Agent Red teaming Toolkit (BrowserART), a comprehensive test suite designed specifically for red‑teaming browser agents. BrowserART is consist of 100 diverse browser‑related harmful behaviors (including original behaviors and ones sourced from HarmBench [Mazeika et al., 2024] and AirBench 2024 [Zeng et al., 2024b]) across both synthetic and real websites. Our empirical study on state‑of‑the‑art browser agents reveals that, while the backbone LLM refuses harmful instructions as a chatbot, the corresponding agent does not. Moreover, attack methods designed to jailbreak refusal‑trained LLMs in the chat settings transfer effectively to browser agents. With human rewrites, GPT‑4o and o1‑preview‑based browser agents attempted 98 and 63 harmful behaviors (out of 100), respectively. We publicly release BrowserART and call on LLM developers, policymakers, and agent developers to collaborate on improving agent safety

PaperID: 12662, https://arxiv.org/pdf/2410.13825.pdf

Authors: Ke Yang, Yao Liu, Sapana Chaudhary, Rasool Fakoor, Pratik Chaudhari, George Karypis, Huzefa Rangwala

Title: AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents

Abstract:
Autonomy via agents using large language models (LLMs) for personalized, standardized tasks boosts human efficiency. Automating web tasks (like booking hotels within a budget) is increasingly sought after. Fulfilling practical needs, the web agent also serves as an important proof‑of‑concept example for various agent grounding scenarios, with its success promising advancements in many future applications. Prior research often handcrafts web agent strategies (e.g., prompting templates, multi‑agent systems, search methods, etc.) and the corresponding in‑context examples, which may not generalize well across all real‑world scenarios. On the other hand, there has been limited study on the misalignment between a web agent's observation/action representation and the pre‑training data of the LLM it's based on. This discrepancy is especially notable when LLMs are primarily trained for language completion rather than tasks involving embodied navigation actions and symbolic web elements. Our study enhances an LLM‑based web agent by simply refining its observation and action space to better align with the LLM's capabilities. This approach enables our base agent to significantly outperform previous methods on a wide variety of web tasks. Specifically, on WebArena, a benchmark featuring general‑purpose web interaction tasks, our agent AgentOccam surpasses the previous state‑of‑the‑art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively, and boosts the success rate by 26.6 points (+161%) over similar plain web agents with its observation and action space alignment. We achieve this without using in‑context examples, new agent roles, online feedback or search strategies. AgentOccam's simple design highlights LLMs' impressive zero‑shot performance on web tasks, and underlines the critical role of carefully tuning observation and action spaces for LLM‑based agents.

PaperID: 12663, https://arxiv.org/pdf/2410.13768.pdf

Authors: Alireza Ghafarollahi, Markus J. Buehler

Title: Rapid and Automated Alloy Design with Graph Neural Network-Powered LLM-Driven Multi-Agent Systems

Abstract:
A multi‑agent AI model is used to automate the discovery of new metallic alloys, integrating multimodal data and external knowledge including insights from physics via atomistic simulations. Our multi‑agent system features three key components: (a) a suite of LLMs responsible for tasks such as reasoning and planning, (b) a group of AI agents with distinct roles and expertise that dynamically collaborate, and (c) a newly developed graph neural network (GNN) model for rapid retrieval of key physical properties. A set of LLM‑driven AI agents collaborate to automate the exploration of the vast design space of MPEAs, guided by predictions from the GNN. We focus on the NbMoTa family of body‑centered cubic (bcc) alloys, modeled using an ML‑based interatomic potential, and target two key properties: the Peierls barrier and solute/screw dislocation interaction energy. Our GNN model accurately predicts these atomic‑scale properties, providing a faster alternative to costly brute‑force calculations and reducing the computational burden on multi‑agent systems for physics retrieval. This AI system revolutionizes materials discovery by reducing reliance on human expertise and overcoming the limitations of direct all‑atom simulations. By synergizing the predictive power of GNNs with the dynamic collaboration of LLM‑based agents, the system autonomously navigates vast alloy design spaces, identifying trends in atomic‑scale material properties and predicting macro‑scale mechanical strength, as demonstrated by several computational experiments. This approach accelerates the discovery of advanced alloys and holds promise for broader applications in other complex systems, marking a significant step forward in automated materials design.

PaperID: 12664, https://arxiv.org/pdf/2410.13700.pdf

Authors: Aditi Saxena, Twinkle Tripathy, Rajasekhar Anguluri

Title: Real Eventual Exponential Positivity of Complex-valued Laplacians: Applications to Consensus in Multi-agent Systems

Abstract:
In this paper, we explore the property of eventual exponential positivity (EEP) in complex matrices. We show that this property holds for the real part of the matrix exponential for a certain class of complex matrices. Next, we present the relation between the spectral properties of the Laplacian matrix of an unsigned digraph with complex edge‑weights and the property of real EEP. Finally, we show that the Laplacian flow system of a network is stable when the negated Laplacian admits real EEP. Numerical examples are presented to demonstrate the results.

PaperID: 12665, https://arxiv.org/pdf/2410.13649.pdf

Authors: Tianyi Zhang, Atta Norouzian, Aanchan Mohan, Frederick Ducatelle

Title: A new approach for fine-tuning sentence transformers for intent classification and out-of-scope detection tasks

Abstract:
In virtual assistant (VA) systems it is important to reject or redirect user queries that fall outside the scope of the system. One of the most accurate approaches for out‑of‑scope (OOS) rejection is to combine it with the task of intent classification on in‑scope queries, and to use methods based on the similarity of embeddings produced by transformer‑based sentence encoders. Typically, such encoders are fine‑tuned for the intent‑classification task, using cross‑entropy loss. Recent work has shown that while this produces suitable embeddings for the intent‑classification task, it also tends to disperse in‑scope embeddings over the full sentence embedding space. This causes the in‑scope embeddings to potentially overlap with OOS embeddings, thereby making OOS rejection difficult. This is compounded when OOS data is unknown. To mitigate this issue our work proposes to regularize the cross‑entropy loss with an in‑scope embedding reconstruction loss learned using an auto‑encoder. Our method achieves a 1‑4% improvement in the area under the precision‑recall curve for rejecting out‑of‑sample (OOS) instances, without compromising intent classification performance.

PaperID: 12666, https://arxiv.org/pdf/2410.13454.pdf

Authors: Chenhang Yan, Liping Yan, Yuezu Lv, Bolei Dong, Yuanqing Xia

Title: Byzantine-Resilient Output Optimization of Multiagent via Self-Triggered Hybrid Detection Approach

Abstract:
How to achieve precise distributed optimization despite unknown attacks, especially the Byzantine attacks, is one of the critical challenges for multiagent systems. This paper addresses a distributed resilient optimization for linear heterogeneous multi‑agent systems faced with adversarial threats. We establish a framework aimed at realizing resilient optimization for continuous‑time systems by incorporating a novel self‑triggered hybrid detection approach. The proposed hybrid detection approach is able to identify attacks on neighbors using both error thresholds and triggering intervals, thereby optimizing the balance between effective attack detection and the reduction of excessive communication triggers. Through using an edge‑based adaptive self‑triggered approach, each agent can receive its neighbors' information and determine whether these information is valid. If any neighbor prove invalid, each normal agent will isolate that neighbor by disconnecting communication along that specific edge. Importantly, our adaptive algorithm guarantees the accuracy of the optimization solution even when an agent is isolated by its neighbors.

PaperID: 12667, https://arxiv.org/pdf/2410.13298.pdf

Authors: Lei Huang, Xiaocheng Feng, Weitao Ma, Liang Zhao, Yuchun Fan, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin

Title: Advancing Large Language Model Attribution through Self-Improving

Abstract:
Teaching large language models (LLMs) to generate text with citations to evidence sources can mitigate hallucinations and enhance verifiability in information‑seeking systems. However, improving this capability requires high‑quality attribution data, which is costly and labor‑intensive. Inspired by recent advances in self‑improvement that enhance LLMs without manual annotation, we present START, a Self‑Taught AttRibuTion framework for iteratively improving the attribution capability of LLMs. First, to prevent models from stagnating due to initially insufficient supervision signals, START leverages the model to self‑construct synthetic training data for warming up. To further self‑improve the model's attribution ability, START iteratively utilizes fine‑grained preference supervision signals constructed from its sampled responses to encourage robust, comprehensive, and attributable generation. Experiments on three open‑domain question‑answering datasets, covering long‑form QA and multi‑step reasoning, demonstrate significant performance gains of 25.13% on average without relying on human annotations and more advanced models. Further analysis reveals that START excels in aggregating information across multiple sources.

PaperID: 12668, https://arxiv.org/pdf/2410.13185.pdf

Authors: Long Li, Weiwen Xu, Jiayan Guo, Ruochen Zhao, Xingxuan Li, Yuqian Yuan, Boqiang Zhang, Yuming Jiang, Yifei Xin, Ronghao Dang, Deli Zhao, Yu Rong, Tian Feng, Lidong Bing

Title: Chain of Ideas: Revolutionizing Research Via Novel Idea Development with LLM Agents

Abstract:
Effective research ideation is a critical step for scientific research. However, the exponential increase in scientific literature makes it challenging for researchers to stay current with recent advances and identify meaningful research directions. Recent developments in large language models~(LLMs) suggest a promising avenue for automating the generation of novel research ideas. However, existing methods for idea generation either trivially prompt LLMs or directly expose LLMs to extensive literature without indicating useful information. Inspired by the research process of human researchers, we propose a Chain‑of‑Ideas~(CoI) agent, an LLM‑based agent that organizes relevant literature in a chain structure to effectively mirror the progressive development in a research domain. This organization facilitates LLMs to capture the current advancements in research, thereby enhancing their ideation capabilities. Furthermore, we propose Idea Arena, an evaluation protocol that can comprehensively evaluate idea generation methods from different perspectives, aligning closely with the preferences of human researchers. Experimental results indicate that the CoI agent consistently outperforms other methods and shows comparable quality as humans in research idea generation. Moreover, our CoI agent is budget‑friendly, with a minimum cost of \0.50 to generate a candidate idea and its corresponding experimental design.

PaperID: 12669, https://arxiv.org/pdf/2410.12952.pdf

Authors: Mingyang Chen, Haoze Sun, Tianpeng Li, Fan Yang, Hao Liang, Keer Lu, Bin Cui, Wentao Zhang, Zenan Zhou, Weipeng Chen

Title: Facilitating Multi-turn Function Calling for LLMs via Compositional Instruction Tuning

Abstract:
Large Language Models (LLMs) have exhibited significant potential in performing diverse tasks, including the ability to call functions or use external tools to enhance their performance. While current research on function calling by LLMs primarily focuses on single‑turn interactions, this paper addresses the overlooked necessity for LLMs to engage in multi‑turn function calling‑‑critical for handling compositional, real‑world queries that require planning with functions but not only use functions. To facilitate this, we introduce an approach, BUTTON, which generates synthetic compositional instruction tuning data via bottom‑up instruction construction and top‑down trajectory generation. In the bottom‑up phase, we generate simple atomic tasks based on real‑world scenarios and build compositional tasks using heuristic strategies based on atomic tasks. Corresponding function definitions are then synthesized for these compositional tasks. The top‑down phase features a multi‑agent environment where interactions among simulated humans, assistants, and tools are utilized to gather multi‑turn function calling trajectories. This approach ensures task compositionality and allows for effective function and trajectory generation by examining atomic tasks within compositional tasks. We produce a dataset BUTTONInstruct comprising 8k data points and demonstrate its effectiveness through extensive experiments across various LLMs.

PaperID: 12670, https://arxiv.org/pdf/2410.12889.pdf

Authors: Gabriele La Malfa, Jie M. Zhang, Michael Luck, Elizabeth Black

Title: Using Protected Attributes to Consider Fairness in Multi-Agent Systems

Abstract:
Fairness in Multi‑Agent Systems (MAS) has been extensively studied, particularly in reward distribution among agents in scenarios such as goods allocation, resource division, lotteries, and bargaining systems. Fairness in MAS depends on various factors, including the system's governing rules, the behaviour of the agents, and their characteristics. Yet, fairness in human society often involves evaluating disparities between disadvantaged and privileged groups, guided by principles of Equality, Diversity, and Inclusion (EDI). Taking inspiration from the work on algorithmic fairness, which addresses bias in machine learning‑based decision‑making, we define protected attributes for MAS as characteristics that should not disadvantage an agent in terms of its expected rewards. We adapt fairness metrics from the algorithmic fairness literature ‑‑ namely, demographic parity, counterfactual fairness, and conditional statistical parity ‑‑ to the multi‑agent setting, where self‑interested agents interact within an environment. These metrics allow us to evaluate the fairness of MAS, with the ultimate aim of designing MAS that do not disadvantage agents based on protected attributes.

PaperID: 12671, https://arxiv.org/pdf/2410.12855.pdf

Authors: Fan Liu, Yue Feng, Zhao Xu, Lixin Su, Xinyu Ma, Dawei Yin, Hao Liu

Title: JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework

Abstract:
Despite advancements in enhancing LLM safety against jailbreak attacks, evaluating LLM defenses remains a challenge, with current methods often lacking explainability and generalization to complex scenarios, leading to incomplete assessments (e.g., direct judgment without reasoning, low F1 score of GPT‑4 in complex cases, bias in multilingual scenarios). To address this, we present JAILJUDGE, a comprehensive benchmark featuring diverse risk scenarios, including synthetic, adversarial, in‑the‑wild, and multilingual prompts, along with high‑quality human‑annotated datasets. The JAILJUDGE dataset includes over 35k+ instruction‑tune data with reasoning explainability and JAILJUDGETEST, a 4.5k+ labeled set for risk scenarios, and a 6k+ multilingual set across ten languages. To enhance evaluation with explicit reasoning, we propose the JailJudge MultiAgent framework, which enables explainable, fine‑grained scoring (1 to 10). This framework supports the construction of instruction‑tuning ground truth and facilitates the development of JAILJUDGE Guard, an end‑to‑end judge model that provides reasoning and eliminates API costs. Additionally, we introduce JailBoost, an attacker‑agnostic attack enhancer, and GuardShield, a moderation defense, both leveraging JAILJUDGE Guard. Our experiments demonstrate the state‑of‑the‑art performance of JailJudge methods (JailJudge MultiAgent, JAILJUDGE Guard) across diverse models (e.g., GPT‑4, Llama‑Guard) and zero‑shot scenarios. JailBoost and GuardShield significantly improve jailbreak attack and defense tasks under zero‑shot settings, with JailBoost enhancing performance by 29.24% and GuardShield reducing defense ASR from 40.46% to 0.15%.

PaperID: 12672, https://arxiv.org/pdf/2410.12853.pdf

Authors: Mahmood Hegazy

Title: Diversity of Thought Elicits Stronger Reasoning Capabilities in Multi-Agent Debate Frameworks

Abstract:
Large language models (LLMs) excel in natural language generation but often confidently produce incorrect responses, especially in tasks like mathematical reasoning. Chain‑of‑thought prompting, self‑verification, and multi‑agent debate are among the strategies proposed to improve the reasoning and factual accuracy of LLMs. Building on Du et al.'s multi‑agent debate framework, we find that multi‑agent debate helps at any model scale, and that diversity of thought elicits stronger reasoning in debating LLMs. Across various model sizes, performance on mathematical reasoning tasks benefits most when diverse trained models are used. Remarkably, after 4 rounds of debate, a diverse set of medium‑capacity models (Gemini‑Pro, Mixtral 7BX8, and PaLM 2‑M) outperforms GPT‑4 on the GSM‑8K benchmark, scoring 91% accuracy. By comparison, when 3 instances of Gemini‑Pro are used, performance only reaches 82%. Finally, this diverse set of medium‑capacity models sets a new state‑of‑the‑art performance on the ASDiv benchmark (94%). These results underscore the idea that the future of AI is agentic, with diverse cooperating agents yielding emergent capabilities beyond even the most powerful individual models.

PaperID: 12673, https://arxiv.org/pdf/2410.12848.pdf

Authors: Per Niklas Waaler, Musarrat Hussain, Igor Molchanov, Lars Ailo Bongo, Brita Elvevåg

Title: Prompt Engineering a Schizophrenia Chatbot: Utilizing a Multi-Agent Approach for Enhanced Compliance with Prompt Instructions

Abstract:
Patients with schizophrenia often present with cognitive impairments that may hinder their ability to learn about their condition. These individuals could benefit greatly from education platforms that leverage the adaptability of Large Language Models (LLMs) such as GPT‑4. While LLMs have the potential to make topical mental health information more accessible and engaging, their black‑box nature raises concerns about ethics and safety. Prompting offers a way to produce semi‑scripted chatbots with responses anchored in instructions and validated information, but prompt‑engineered chatbots may drift from their intended identity as the conversation progresses. We propose a Critical Analysis Filter for achieving better control over chatbot behavior. In this system, a team of prompted LLM agents are prompt‑engineered to critically analyze and refine the chatbot's response and deliver real‑time feedback to the chatbot. To test this approach, we develop an informational schizophrenia chatbot and converse with it (with the filter deactivated) until it oversteps its scope. Once drift has been observed, AI‑agents are used to automatically generate sample conversations in which the chatbot is being enticed to talk about out‑of‑bounds topics. We manually assign to each response a compliance score that quantifies the chatbot's compliance to its instructions; specifically the rules about accurately conveying sources and being transparent about limitations. Activating the Critical Analysis Filter resulted in an acceptable compliance score (>=2) in 67.0% of responses, compared to only 8.7% when the filter was deactivated. These results suggest that a self‑reflection layer could enable LLMs to be used effectively and safely in mental health platforms, maintaining adaptability while reliably limiting their scope to appropriate use cases.

PaperID: 12674, https://arxiv.org/pdf/2410.12846.pdf

Authors: Yuxiang Wang, Jianzhong Qi, Junhao Gan

Title: Accurate and Regret-aware Numerical Problem Solver for Tabular Question Answering

Abstract:
Question answering on free‑form tables (a.k.a. TableQA) is a challenging task because of the flexible structure and complex schema of tables. Recent studies use Large Language Models (LLMs) for this task, exploiting their capability in understanding the questions and tabular data, which are typically given in natural language and contain many textual fields, respectively. While this approach has shown promising results, it overlooks the challenges brought by numerical values which are common in tabular data, and LLMs are known to struggle with such values. We aim to address this issue, and we propose a model named TabLaP that uses LLMs as a planner rather than an answer generator. This approach exploits LLMs' capability in multi‑step reasoning while leaving the actual numerical calculations to a Python interpreter for accurate calculation. Recognizing the inaccurate nature of LLMs, we further make a first attempt to quantify the trustworthiness of the answers produced by TabLaP, such that users can use TabLaP in a regret‑aware manner. Experimental results on two benchmark datasets show that TabLaP is substantially more accurate than the state‑of‑the‑art models, improving the answer accuracy by 5.7% and 5.8% on the two datasets, respectively.

PaperID: 12675, https://arxiv.org/pdf/2410.12808.pdf

Authors: Denys J. C. Matthies, Ruben Schlonsak, Hanzhi Zhuang, Rui Song

Title: FlyAI -- The Next Level of Artificial Intelligence is Unpredictable! Injecting Responses of a Living Fly into Decision Making

Abstract:
In this paper, we introduce a new type of bionic AI that enhances decision‑making unpredictability by incorporating responses from a living fly. Traditional AI systems, while reliable and predictable, lack nuanced and sometimes unseasoned decision‑making seen in humans. Our approach uses a fly's varied reactions, to tune an AI agent in the game of Gobang. Through a study, we compare the performances of different strategies on altering AI agents and found a bionic AI agent to outperform human as well as conventional and white‑noise enhanced AI agents. We contribute a new methodology for creating a bionic random function and strategies to enhance conventional AI agents ultimately improving unpredictability.

PaperID: 12676, https://arxiv.org/pdf/2410.12720.pdf

Authors: Emmanuele Lacavalla, Shuyi Yang, Riccardo Crupi, Joseph E. Gonzalez

Title: HEnRY: A Multi-Agent System Framework for Multi-Domain Contexts

Abstract:
This project, named HEnRY, aims to introduce a Multi‑Agent System (MAS) into Intesa Sanpaolo. The name HEnRY summarizes the project's core principles: the Hierarchical organization of agents in a layered structure for efficient resource management; Efficient optimization of resources and operations to enhance overall performance; Reactive ability of agents to quickly respond to environmental stimuli; and Yielding adaptability and flexibility of agents to handle unexpected situations. The discussion covers two distinct research paths: the first focuses on the system architecture, and the second on the collaboration between agents. This work is not limited to the specific structure of the Intesa Sanpaolo context; instead, it leverages existing research in MAS to introduce a new solution. Since Intesa Sanpaolo is organized according to a model that aligns with international corporate governance best practices, this approach could also be relevant to similar scenarios.

PaperID: 12677, https://arxiv.org/pdf/2410.12651.pdf

Authors: Koen de Vos, Elena Torta, Herman Bruyninckx, Cesar Lopez Martinez, Rene van de Molengraft

Title: Hybrid Decision Making for Scalable Multi-Agent Navigation: Integrating Semantic Maps, Discrete Coordination, and Model Predictive Control

Abstract:
This paper presents a framework for multi‑agent navigation in structured but dynamic environments, integrating three key components: a shared semantic map encoding metric and semantic environmental knowledge, a claim policy for coordinating access to areas within the environment, and a Model Predictive Controller for generating motion trajectories that respect environmental and coordination constraints. The main advantages of this approach include: (i) enforcing area occupancy constraints derived from specific task requirements; (ii) enhancing computational scalability by eliminating the need for collision avoidance constraints between robotic agents; and (iii) the ability to anticipate and avoid deadlocks between agents. The paper includes both simulations and physical experiments demonstrating the framework's effectiveness in various representative scenarios.

PaperID: 12678, https://arxiv.org/pdf/2410.12647.pdf

Authors: Yingpeng Duan, Yujie Tang

Title: Zeroth-Order Feedback Optimization in Multi-Agent Systems: Tackling Coupled Constraints

Abstract:
This paper investigates distributed zeroth‑order feedback optimization in multi‑agent systems with coupled constraints, where each agent operates its local action vector and observes only zeroth‑order information to minimize a global cost function subject to constraints in which the local actions are coupled. Specifically, we employ two‑point zeroth‑order gradient estimation with delayed information to construct stochastic gradients, and leverage the constraint extrapolation technique and the averaging consensus framework to effectively handle the coupled constraints. We also provide convergence rate and oracle complexity results for our algorithm, characterizing its computational efficiency and scalability by rigorous theoretical analysis. Numerical experiments are conducted to validate the algorithm's effectiveness.

PaperID: 12679, https://arxiv.org/pdf/2410.12568.pdf

Authors: Sihao Wu, Jiaxu Liu, Xiangyu Yin, Guangliang Cheng, Xingyu Zhao, Meng Fang, Xinping Yi, Xiaowei Huang

Title: Robust RL with LLM-Driven Data Synthesis and Policy Adaptation for Autonomous Driving

Abstract:
The integration of Large Language Models (LLMs) into autonomous driving systems demonstrates strong common sense and reasoning abilities, effectively addressing the pitfalls of purely data‑driven methods. Current LLM‑based agents require lengthy inference times and face challenges in interacting with real‑time autonomous driving environments. A key open question is whether we can effectively leverage the knowledge from LLMs to train an efficient and robust Reinforcement Learning (RL) agent. This paper introduces RAPID, a novel \underlineRobust \underlineAdaptive \underlinePolicy \underlineInfusion and \underlineDistillation framework, which trains specialized mix‑of‑policy RL agents using data synthesized by an LLM‑based driving agent and online adaptation. RAPID features three key designs: 1) utilization of offline data collected from an LLM agent to distil expert knowledge into RL policies for faster real‑time inference; 2) introduction of robust distillation in RL to inherit both performance and robustness from LLM‑based teacher; and 3) employment of a mix‑of‑policy approach for joint decision decoding with a policy adapter. Through fine‑tuning via online environment interaction, RAPID reduces the forgetting of LLM knowledge while maintaining adaptability to different tasks. Extensive experiments demonstrate RAPID's capability to effectively integrate LLM knowledge into scaled‑down RL policies in an efficient, adaptable, and robust way. Code and checkpoints will be made publicly available upon acceptance.

PaperID: 12680, https://arxiv.org/pdf/2410.12563.pdf

Authors: Gregorio Marchesini, Siyuan Liu, Lars Lindemann, Dimos V. Dimarogonas

Title: A Communication Consistent Approach to Signal Temporal Logic Task Decomposition in Multi-Agent Systems

Abstract:
We consider the problem of decomposing a global task assigned to a multi‑agent system, expressed as a formula within a fragment of Signal Temporal Logic (STL), under range‑limited communication. Given a global task expressed as a conjunction of local tasks defined over the individual and relative states of agents in the system, we propose representing task dependencies among agents as edges of a suitably defined task graph. At the same time, range‑limited communication naturally induces the definition of a communication graph that defines which agents have access to each other's states. Within these settings, inconsistencies arise when a task dependency between a pair of agents is not supported by a corresponding communication link due to the limited communication range. As a result, state feedback control laws previously derived to achieve the tasks' satisfaction can not be leveraged. We propose a task decomposition mechanism to distribute tasks assigned to pairs of non‑communicating agents in the system as conjunctions of tasks defined over the relative states of communicating agents, thus enforcing consistency between task and communication graphs. Assuming the super‑level sets of the predicate functions composing the STL tasks are bounded polytopes, our task decomposition mechanism can be cast as a parameter optimization problem and solved via state‑of‑the‑art decentralized convex optimization algorithms. To guarantee the soundness of our approach, we present various conditions under which the tasks defined in the applied STL fragment are unsatisfiable, and we show sufficient conditions such that our decomposition approach yields satisfiable global tasks after decomposition.

PaperID: 12681, https://arxiv.org/pdf/2410.12539.pdf

Authors: Stelios Triantafyllou, Aleksa Sukovic, Yasaman Zolfimoselo, Goran Radanovic

Title: Counterfactual Effect Decomposition in Multi-Agent Sequential Decision Making

Abstract:
We address the challenge of explaining counterfactual outcomes in multi‑agent Markov decision processes. In particular, we aim to explain the total counterfactual effect of an agent's action on the outcome of a realized scenario through its influence on the environment dynamics and the agents' behavior. To achieve this, we introduce a novel causal explanation formula that decomposes the counterfactual effect by attributing to each agent and state variable a score reflecting their respective contributions to the effect. First, we show that the total counterfactual effect of an agent's action can be decomposed into two components: one measuring the effect that propagates through all subsequent agents' actions and another related to the effect that propagates through the state transitions. Building on recent advancements in causal contribution analysis, we further decompose these two effects as follows. For the former, we consider agent‑specific effects ‑‑ a causal concept that quantifies the counterfactual effect of an agent's action that propagates through a subset of agents. Based on this notion, we use Shapley value to attribute the effect to individual agents. For the latter, we consider the concept of structure‑preserving interventions and attribute the effect to state variables based on their "intrinsic" contributions. Through extensive experimentation, we demonstrate the interpretability of our approach in a Gridworld environment with LLM‑assisted agents and a sepsis management simulator.

PaperID: 12682, https://arxiv.org/pdf/2410.12532.pdf

Authors: Dingkang Yang, Jinjie Wei, Mingcheng Li, Jiyao Liu, Lihao Liu, Ming Hu, Junjun He, Yakun Ju, Wei Zhou, Yang Liu, Lihua Zhang

Title: MedAide: Information Fusion and Anatomy of Medical Intents via LLM-based Agent Collaboration

Abstract:
In healthcare intelligence, the ability to fuse heterogeneous, multi‑intent information from diverse clinical sources is fundamental to building reliable decision‑making systems. Large Language Model (LLM)‑driven information interaction systems currently showing potential promise in the healthcare domain. Nevertheless, they often suffer from information redundancy and coupling when dealing with complex medical intents, leading to severe hallucinations and performance bottlenecks. To this end, we propose MedAide, an LLM‑based medical multi‑agent collaboration framework designed to enable intent‑aware information fusion and coordinated reasoning across specialized healthcare domains. Specifically, we introduce a regularization‑guided module that combines syntactic constraints with retrieval augmented generation to decompose complex queries into structured representations, facilitating fine‑grained clinical information fusion and intent resolution. Additionally, a dynamic intent prototype matching module is proposed to utilize dynamic prototype representation with a semantic similarity matching mechanism to achieve adaptive recognition and updating of the agent's intent in multi‑round healthcare dialogues. Ultimately, we design a rotation agent collaboration mechanism that introduces dynamic role rotation and decision‑level information fusion across specialized medical agents. Extensive experiments are conducted on four medical benchmarks with composite intents. Experimental results from automated metrics and expert doctor evaluations show that MedAide outperforms current LLMs and improves their medical proficiency and strategic reasoning.

PaperID: 12683, https://arxiv.org/pdf/2410.12481.pdf

Authors: Loris Gaven, Clement Romac, Thomas Carta, Sylvain Lamprier, Olivier Sigaud, Pierre-Yves Oudeyer

Title: SAC-GLAM: Improving Online RL for LLM agents with Soft Actor-Critic and Hindsight Relabeling

Abstract:
The past years have seen Large Language Models (LLMs) strive not only as generative models but also as agents solving textual sequential decision‑making tasks. When facing complex environments where their zero‑shot abilities are insufficient, recent work showed online Reinforcement Learning (RL) could be used for the LLM agent to discover and learn efficient strategies interactively. However, most prior work sticks to on‑policy algorithms, which greatly reduces the scope of methods such agents could use for both exploration and exploitation, such as experience replay and hindsight relabeling. Yet, such methods may be key for LLM learning agents, and in particular when designing autonomous intrinsically motivated agents sampling and pursuing their own goals (i.e. autotelic agents). This paper presents and studies an adaptation of Soft Actor‑Critic and hindsight relabeling to LLM agents. Our method not only paves the path towards autotelic LLM agents that learn online but can also outperform on‑policy methods in more classic multi‑goal RL environments.

PaperID: 12684, https://arxiv.org/pdf/2410.12475.pdf

Authors: Lu Shi, Bin Qi, Jiarui Luo, Yang Zhang, Zhanzhao Liang, Zhaowei Gao, Wenke Deng, Lin Sun

Title: Aegis:An Advanced LLM-Based Multi-Agent for Intelligent Functional Safety Engineering

Abstract:
Functional safety is a critical aspect of automotive engineering, encompassing all phases of a vehicle's lifecycle, including design, development, production, operation, and decommissioning. This domain involves highly knowledge‑intensive tasks. This paper introduces Aegis: An Advanced LLM‑Based Multi‑Agent for Intelligent Functional Safety Engineering. Aegis is specifically designed to support complex functional safety tasks within the automotive sector. It is tailored to perform Hazard Analysis and Risk Assessment(HARA), document Functional Safety Requirements(FSR), and plan test cases for Automatic Emergency Braking(AEB) systems. The most advanced version, Aegis‑Max, leverages Retrieval‑Augmented Generation(RAG) and reflective mechanisms to enhance its capability in managing complex, knowledge‑intensive tasks. Additionally, targeted prompt refinement by professional functional safety practitioners can significantly optimize Aegis's performance in the functional safety domain. This paper demonstrates the potential of Aegis to improve the efficiency and effectiveness of functional safety processes in automotive engineering.

PaperID: 12685, https://arxiv.org/pdf/2410.12432.pdf

Authors: Pranjali Pathre, Gunjan Gupta, M. Nomaan Qureshi, Mandyam Brunda, Samarth Brahmbhatt, K. Madhava Krishna

Title: Imagine2Servo: Intelligent Visual Servoing with Diffusion-Driven Goal Generation for Robotic Tasks

Abstract:
Visual servoing, the method of controlling robot motion through feedback from visual sensors, has seen significant advancements with the integration of optical flow‑based methods. However, its application remains limited by inherent challenges, such as the necessity for a target image at test time, the requirement of substantial overlap between initial and target images, and the reliance on feedback from a single camera. This paper introduces Imagine2Servo, an innovative approach leveraging diffusion‑based image editing techniques to enhance visual servoing algorithms by generating intermediate goal images. This methodology allows for the extension of visual servoing applications beyond traditional constraints, enabling tasks like long‑range navigation and manipulation without predefined goal images. We propose a pipeline that synthesizes subgoal images grounded in the task at hand, facilitating servoing in scenarios with minimal initial and target image overlap and integrating multi‑camera feedback for comprehensive task execution. Our contributions demonstrate a novel application of image generation to robotic control, significantly broadening the capabilities of visual servoing systems. Real‑world experiments validate the effectiveness and versatility of the Imagine2Servo framework in accomplishing a variety of tasks, marking a notable advancement in the field of visual servoing.

PaperID: 12686, https://arxiv.org/pdf/2410.12397.pdf

Authors: Arseniy Pertzovsky, Roni Stern, Roie Zivan, Ariel Felner

Title: Multi-Agent Corridor Generating Algorithm

Abstract:
In this paper, we propose the Multi‑Agent Corridor Generating Algorithm (MACGA) for solving the Multi‑agent Pathfinding (MAPF) problem, where a group of agents need to find non‑colliding paths to their target locations. Existing approaches struggle to solve dense MAPF instances. In MACGA, the agents build \emphcorridors, which are sequences of connected vertices, from current locations towards agents' goals, and evacuate other agents out of the corridors to avoid collisions and deadlocks. We also present the MACGA+PIBT algorithm, which integrates the well‑known rule‑based PIBT algorithm into MACGA to improve runtime and solution quality. The proposed algorithms run in polynomial time and have a reachability property, i.e., every agent is guaranteed to reach its goal location at some point. We demonstrate experimentally that MACGA and MACGA+PIBT outperform baseline algorithms in terms of success rate, runtime, and makespan across diverse MAPF benchmark grids.

PaperID: 12687, https://arxiv.org/pdf/2410.12376.pdf

Authors: Qingming Lin, Rui Hu, Huaxia Li, Sensen Wu, Yadong Li, Kai Fang, Hailin Feng, Zhenhong Du, Liuchang Xu

Title: ShapefileGPT: A Multi-Agent Large Language Model Framework for Automated Shapefile Processing

Abstract:
Vector data is one of the two core data structures in geographic information science (GIS), essential for accurately storing and representing geospatial information. Shapefile, the most widely used vector data format, has become the industry standard supported by all major geographic information systems. However, processing this data typically requires specialized GIS knowledge and skills, creating a barrier for researchers from other fields and impeding interdisciplinary research in spatial data analysis. Moreover, while large language models (LLMs) have made significant advancements in natural language processing and task automation, they still face challenges in handling the complex spatial and topological relationships inherent in GIS vector data. To address these challenges, we propose ShapefileGPT, an innovative framework powered by LLMs, specifically designed to automate Shapefile tasks. ShapefileGPT utilizes a multi‑agent architecture, in which the planner agent is responsible for task decomposition and supervision, while the worker agent executes the tasks. We developed a specialized function library for handling Shapefiles and provided comprehensive API documentation, enabling the worker agent to operate Shapefiles efficiently through function calling. For evaluation, we developed a benchmark dataset based on authoritative textbooks, encompassing tasks in categories such as geometric operations and spatial queries. ShapefileGPT achieved a task success rate of 95.24%, outperforming the GPT series models. In comparison to traditional LLMs, ShapefileGPT effectively handles complex vector data analysis tasks, overcoming the limitations of traditional LLMs in spatial analysis. This breakthrough opens new pathways for advancing automation and intelligence in the GIS field, with significant potential in interdisciplinary data analysis and application contexts.

PaperID: 12688, https://arxiv.org/pdf/2410.12375.pdf

Authors: Markus J. Buehler

Title: PRefLexOR: Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning and Agentic Thinking

Abstract:
PRefLexOR (Preference‑based Recursive Language Modeling for Exploratory Optimization of Reasoning) combines preference optimization with concepts from Reinforcement Learning to enable models to self‑teach through iterative reasoning improvements. We propose a recursive learning approach that engages the model in multi‑step reasoning, revisiting, and refining intermediate steps before producing a final output in training and inference phases. Through multiple training stages, the model first learns to align its reasoning with accurate decision paths by optimizing the log odds between preferred and non‑preferred responses. During this process, PRefLexOR builds a dynamic knowledge graph by generating questions from random text chunks and retrieval‑augmentation to contextualize relevant details from the entire training corpus. In the second stage, preference optimization enhances model performance by using rejection sampling to fine‑tune reasoning quality by continually producing in‑situ training data while masking the reasoning steps. Recursive optimization within a thinking token framework introduces iterative feedback loops, where the model refines reasoning, achieving deeper coherence, consistency, and adaptability. Implemented in small language models with only 3 billion parameters, we should that even tiny models can iteratively teach themselves to reason with greater depth and reflectivity. Our implementation is straightforward and can be incorporated into any existing pretrained LLM. We focus our examples on applications in biological materials science and demonstrate the method in a variety of case studies that range from in‑domain to cross‑domain applications. Using reasoning strategies that include thinking and reflection modalities we build a multi‑agent recursive self‑improving inference approach to successively improve responses via repeated sampling in inference time.

PaperID: 12689, https://arxiv.org/pdf/2410.12361.pdf

Authors: Yaxi Lu, Shenzhi Yang, Cheng Qian, Guirong Chen, Qinyu Luo, Yesai Wu, Huadong Wang, Xin Cong, Zhong Zhang, Yankai Lin, Weiwen Liu, Yasheng Wang, Zhiyuan Liu, Fangming Liu, Maosong Sun

Title: Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance

Abstract:
Agents powered by large language models have shown remarkable abilities in solving complex tasks. However, most agent systems remain reactive, limiting their effectiveness in scenarios requiring foresight and autonomous decision‑making. In this paper, we tackle the challenge of developing proactive agents capable of anticipating and initiating tasks without explicit human instructions. We propose a novel data‑driven approach for this problem. Firstly, we collect real‑world human activities to generate proactive task predictions. These predictions are then labeled by human annotators as either accepted or rejected. The labeled data is used to train a reward model that simulates human judgment and serves as an automatic evaluator of the proactiveness of LLM agents. Building on this, we develop a comprehensive data generation pipeline to create a diverse dataset, ProactiveBench, containing 6,790 events. Finally, we demonstrate that fine‑tuning models with the proposed ProactiveBench can significantly elicit the proactiveness of LLM agents. Experimental results show that our fine‑tuned model achieves an F1‑Score of 66.47% in proactively offering assistance, outperforming all open‑source and close‑source models. These results highlight the potential of our method in creating more proactive and effective agent systems, paving the way for future advancements in human‑agent collaboration.

PaperID: 12690, https://arxiv.org/pdf/2410.12236.pdf

Authors: Yuyang Chen, Kaiyan Zhao, Yiming Wang, Ming Yang, Jian Zhang, Xiaoguang Niu

Title: Enhancing LLM Agents for Code Generation with Possibility and Pass-rate Prioritized Experience Replay

Abstract:
Nowadays transformer‑based Large Language Models (LLM) for code generation tasks usually apply sampling and filtering pipelines. Due to the sparse reward problem in code generation tasks caused by one‑token incorrectness, transformer‑based models will sample redundant programs till they find a correct one, leading to low efficiency. To overcome the challenge, we incorporate Experience Replay (ER) in the fine‑tuning phase, where codes and programs produced are stored and will be replayed to give the LLM agent a chance to learn from past experiences. Based on the spirit of ER, we introduce a novel approach called BTP pipeline which consists of three phases: beam search sampling, testing phase, and prioritized experience replay phase. The approach makes use of failed programs collected by code models and replays programs with high Possibility and Pass‑rate Prioritized value (P2Value) from the replay buffer to improve efficiency. P2Value comprehensively considers the possibility of transformers' output and pass rate and can make use of the redundant resources caused by the problem that most programs collected by LLMs fail to pass any tests. We empirically apply our approach in several LLMs, demonstrating that it enhances their performance in code generation tasks and surpasses existing baselines.

PaperID: 12691, https://arxiv.org/pdf/2410.12232.pdf

Authors: Wen Zheng Terence Ng, Jianda Chen, Sinno Jialin Pan, Tianwei Zhang

Title: Improving the Generalization of Unseen Crowd Behaviors for Reinforcement Learning based Local Motion Planners

Abstract:
Deploying a safe mobile robot policy in scenarios with human pedestrians is challenging due to their unpredictable movements. Current Reinforcement Learning‑based motion planners rely on a single policy to simulate pedestrian movements and could suffer from the over‑fitting issue. Alternatively, framing the collision avoidance problem as a multi‑agent framework, where agents generate dynamic movements while learning to reach their goals, can lead to conflicts with human pedestrians due to their homogeneity. To tackle this problem, we introduce an efficient method that enhances agent diversity within a single policy by maximizing an information‑theoretic objective. This diversity enriches each agent's experiences, improving its adaptability to unseen crowd behaviors. In assessing an agent's robustness against unseen crowds, we propose diverse scenarios inspired by pedestrian crowd behaviors. Our behavior‑conditioned policies outperform existing works in these challenging scenes, reducing potential collisions without additional time or travel.

PaperID: 12692, https://arxiv.org/pdf/2410.12081.pdf

Authors: Louis Gonzalez, Hogyeong Gwak, Bumsoo Han, Andrew Mugler

Title: Quantifying cellular autonomy in multi-cue environments

Abstract:
A cell routinely responds to one of many competing environmental cues. A fundamental question is whether the cell follows the cue prioritized by its internal signaling network or the cue that carries the most external information. We introduce a theoretical framework to answer this question. We derive information limits for four types of directional cues: external and self‑generated chemical gradients, fluid flow, and contact inhibition of locomotion. When the cues compete as pairs, these limits predict which cue a cell should follow if its decision is based on environmental information alone. We compare these predicted decision boundaries with data from our and others' cell migration experiments, finding cases where the boundary is obeyed and cases where it is violated by orders of magnitude. Both outcomes are informative, and we find that they rationalize known properties, or predict putative properties, of cells' internal signaling networks. Our work introduces a physical framework to quantify the degree to which a cell acts like an autonomous agent, rather than a passive detector, favoring a cue even when it is less informative.

PaperID: 12693, https://arxiv.org/pdf/2410.12062.pdf

Authors: Qiushi Lin, Hang Ma

Title: MFC-EQ: Mean-Field Control with Envelope Q-Learning for Moving Decentralized Agents in Formation

Abstract:
We study a decentralized version of Moving Agents in Formation (MAiF), a variant of Multi‑Agent Path Finding aiming to plan collision‑free paths for multiple agents with the dual objectives of reaching their goals quickly while maintaining a desired formation. The agents must balance these objectives under conditions of partial observation and limited communication. The formation maintenance depends on the joint state of all agents, whose dimensionality increases exponentially with the number of agents, rendering the learning process intractable. Additionally, learning a single policy that can accommodate different linear preferences for these two objectives presents a significant challenge. In this paper, we propose Mean‑Field Control with Envelop Q‑learning (MFC‑EQ), a scalable and adaptable learning framework for this bi‑objective multi‑agent problem. We approximate the dynamics of all agents using mean‑field theory while learning a universal preference‑agnostic policy through envelop Q‑learning. Our empirical evaluation of MFC‑EQ across numerous instances shows that it outperforms state‑of‑the‑art centralized MAiF baselines. Furthermore, MFC‑EQ effectively handles more complex scenarios where the desired formation changes dynamically ‑‑ a challenge that existing MAiF planners cannot address.

PaperID: 12694, https://arxiv.org/pdf/2410.11906.pdf

Authors: Bolun Sun, Yifan Zhou, Haiyun Jiang

Title: Empowering Users in Digital Privacy Management through Interactive LLM-Based Agents

Abstract:
This paper presents a novel application of large language models (LLMs) to enhance user comprehension of privacy policies through an interactive dialogue agent. We demonstrate that LLMs significantly outperform traditional models in tasks like Data Practice Identification, Choice Identification, Policy Summarization, and Privacy Question Answering, setting new benchmarks in privacy policy analysis. Building on these findings, we introduce an innovative LLM‑based agent that functions as an expert system for processing website privacy policies, guiding users through complex legal language without requiring them to pose specific questions. A user study with 100 participants showed that users assisted by the agent had higher comprehension levels (mean score of 2.6 out of 3 vs. 1.8 in the control group), reduced cognitive load (task difficulty ratings of 3.2 out of 10 vs. 7.8), increased confidence in managing privacy, and completed tasks in less time (5.5 minutes vs. 15.8 minutes). This work highlights the potential of LLM‑based agents to transform user interaction with privacy policies, leading to more informed consent and empowering users in the digital services landscape.

PaperID: 12695, https://arxiv.org/pdf/2410.11782.pdf

Authors: Guibin Zhang, Yanwei Yue, Xiangguo Sun, Guancheng Wan, Miao Yu, Junfeng Fang, Kun Wang, Tianlong Chen, Dawei Cheng

Title: G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks

Abstract:
Recent advancements in large language model (LLM)‑based agents have demonstrated that collective intelligence can significantly surpass the capabilities of individual agents, primarily due to well‑crafted inter‑agent communication topologies. Despite the diverse and high‑performing designs available, practitioners often face confusion when selecting the most effective pipeline for their specific task: Which topology is the best choice for my task, avoiding unnecessary communication token overhead while ensuring high‑quality solution? In response to this dilemma, we introduce G‑Designer, an adaptive, efficient, and robust solution for multi‑agent deployment, which dynamically designs task‑aware, customized communication topologies. Specifically, G‑Designer models the multi‑agent system as a multi‑agent network, leveraging a variational graph auto‑encoder to encode both the nodes (agents) and a task‑specific virtual node, and decodes a task‑adaptive and high‑performing communication topology. Extensive experiments on six benchmarks showcase that G‑Designer is: (1) high‑performing, achieving superior results on MMLU with accuracy at 84.50% and on HumanEval with pass@1 at 89.90%; (2) task‑adaptive, architecting communication protocols tailored to task difficulty, reducing token consumption by up to 95.33% on HumanEval; and (3) adversarially robust, defending against agent adversarial attacks with merely 0.3% accuracy drop.

PaperID: 12696, https://arxiv.org/pdf/2410.11531.pdf

Authors: Xinjie Zhao, Moritz Blum, Rui Yang, Boming Yang, Luis Márquez Carpintero, Mónica Pina-Navarro, Tony Wang, Xin Li, Huitao Li, Yanran Fu, Rongrong Wang, Juntao Zhang, Irene Li

Title: AGENTiGraph: An Interactive Knowledge Graph Platform for LLM-based Chatbots Utilizing Private Data

Abstract:
Large Language Models~(LLMs) have demonstrated capabilities across various applications but face challenges such as hallucination, limited reasoning abilities, and factual inconsistencies, especially when tackling complex, domain‑specific tasks like question answering~(QA). While Knowledge Graphs~(KGs) have been shown to help mitigate these issues, research on the integration of LLMs with background KGs remains limited. In particular, user accessibility and the flexibility of the underlying KG have not been thoroughly explored. We introduce AGENTiGraph (Adaptive Generative ENgine for Task‑based Interaction and Graphical Representation), a platform for knowledge management through natural language interaction. It integrates knowledge extraction, integration, and real‑time visualization. AGENTiGraph employs a multi‑agent architecture to dynamically interpret user intents, manage tasks, and integrate new knowledge, ensuring adaptability to evolving user requirements and data contexts. Our approach demonstrates superior performance in knowledge graph interactions, particularly for complex domain‑specific tasks. Experimental results on a dataset of 3,500 test cases show AGENTiGraph significantly outperforms state‑of‑the‑art zero‑shot baselines, achieving 95.12% accuracy in task classification and 90.45% success rate in task execution. User studies corroborate its effectiveness in real‑world scenarios. To showcase versatility, we extended AGENTiGraph to legislation and healthcare domains, constructing specialized KGs capable of answering complex queries in legal and medical contexts.

PaperID: 12697, https://arxiv.org/pdf/2410.11416.pdf

Authors: Manon Prédhumeau, Ed Manley

Title: Agent-Based Modelling of Older Adult Needs for Autonomous Mobility-on-Demand: A Case Study in Winnipeg, Canada

Abstract:
As the populations continue to age across many nations, ensuring accessible and efficient transportation options for older adults has become an increasingly important concern. Autonomous Mobility‑on‑Demand (AMoD) systems have emerged as a potential solution to address the needs faced by older adults in their daily mobility. However, estimation of older adult mobility needs, and how they vary over space and time, is crucial for effective planning and implementation of such service, and conventional four‑step approaches lack the granularity to fully account for these needs. To address this challenge, we propose an agent‑based model of older adults mobility demand in Winnipeg, Canada. The model is built for 2022 using primarily open data, and is implemented in the Multi‑Agent Transport Simulation (MATSim) toolkit. After calibration to accurately reproduce observed travel behaviors, a new AMoD service is tested in simulation and its potential adoption among Winnipeg older adults is explored. The model can help policy makers to estimate the needs of the elderly populations for door‑to‑door transportation and can guide the design of AMoD transport systems.

PaperID: 12698, https://arxiv.org/pdf/2410.11296.pdf

Authors: Jiayi Li, Matt Motoki, Baosen Zhang

Title: Strategic and Fair Aggregator Interactions in Energy Markets: Mutli-agent Dynamics and Quasiconcave Games

Abstract:
The introduction of aggregator structures has proven effective in bringing fairness to energy resource allocation by negotiating for more resources and economic surplus on behalf of users. This paper extends the fair energy resource allocation problem to a multi‑agent setting, focusing on interactions among multiple aggregators in an electricity market. We prove that the strategic optimization problems faced by the aggregators form a quasiconcave game, ensuring the existence of a Nash equilibrium. This resolves complexities related to market price dependencies on total purchases and balancing fairness and efficiency in energy allocation. In addition, we design simulations to characterize the equilibrium points of the induced game, demonstrating how aggregators stabilize market outcomes, ensure fair resource distribution, and optimize user surplus. Our findings offer a robust framework for understanding strategic interactions among aggregators, contributing to more efficient and equitable energy markets.

PaperID: 12699, https://arxiv.org/pdf/2410.11287.pdf

Authors: Wendi Li, Yixuan Li

Title: Process Reward Model with Q-Value Rankings

Abstract:
Process Reward Modeling (PRM) is critical for complex reasoning and decision‑making tasks where the accuracy of intermediate steps significantly influences the overall outcome. Existing PRM approaches, primarily framed as classification problems, employ cross‑entropy loss to independently evaluate each step's correctness. This method can lead to suboptimal reward distribution and does not adequately address the interdependencies among steps. To address these limitations, we introduce the Process Q‑value Model (PQM), a novel framework that redefines PRM in the context of a Markov Decision Process. PQM optimizes Q‑value rankings based on a novel comparative loss function, enhancing the model's ability to capture the intricate dynamics among sequential decisions. This approach provides a more granular and theoretically grounded methodology for process rewards. Our extensive empirical evaluations across various sampling policies, language model backbones, and multi‑step reasoning benchmarks show that PQM outperforms classification‑based PRMs. The effectiveness of the comparative loss function is highlighted in our comprehensive ablation studies, confirming PQM's practical efficacy and theoretical advantage.

PaperID: 12700, https://arxiv.org/pdf/2410.11282.pdf

Authors: Yimian Ding, Xinqi Wang, Jingzehua Xu, Guanwen Xie, Weiyi Liu, Yi Li

Title: Multi-Objective-Optimization Assisted Data Collection Framework for IoUT Based on Offline Reinforcement

Abstract:
The Information Updating Networks (IUNs) offers significant potential for ocean exploration but encounters challenges due to dynamic underwater environments and severe system attenuation. Current methods relying on Autonomous Underwater Vehicles (AUVs) based on online reinforcement learning (RL) lead to high computational costs and low data utilization. To address these issues and the constraints of turbulent ocean environments, we propose a multi‑AUV assisted data collection framework for IUNs based on multi‑agent offline RL. This framework maximizes data rate and the value of information (VoI), minimizes energy consumption, and ensures collision avoidance by utilizing environmental and equipment status data. We introduce a semi‑communication decentralized training with decentralized execution (SC‑DTDE) paradigm and a multi‑agent independent conservative Q‑learning algorithm (MAICQL) to effectively tackle the problem. Extensive simulations demonstrate the high applicability, robustness, and data collection efficiency of the proposed framework.

PaperID: 12701, https://arxiv.org/pdf/2410.11239.pdf

Authors: Weijie Xu, Jay Desai, Fanyou Wu, Josef Valvoda, Srinivasan H. Sengamedu

Title: HR-Agent: A Task-Oriented Dialogue (TOD) LLM Agent Tailored for HR Applications

Abstract:
Recent LLM (Large Language Models) advancements benefit many fields such as education and finance, but HR has hundreds of repetitive processes, such as access requests, medical claim filing and time‑off submissions, which are unaddressed. We relate these tasks to the LLM agent, which has addressed tasks such as writing assisting and customer support. We present HR‑Agent, an efficient, confidential, and HR‑specific LLM‑based task‑oriented dialogue system tailored for automating repetitive HR processes such as medical claims and access requests. Since conversation data is not sent to an LLM during inference, it preserves confidentiality required in HR‑related tasks.

PaperID: 12702, https://arxiv.org/pdf/2410.11237.pdf

Authors: Lucas Page

Title: Biologically Inspired Swarm Dynamic Target Tracking and Obstacle Avoidance

Abstract:
This study proposes a novel artificial intelligence (AI) driven flight computer, integrating an online free‑retraining‑prediction model, a swarm control, and an obstacle avoidance strategy, to track dynamic targets using a distributed drone swarm for military applications. To enable dynamic target tracking the swarm requires a trajectory prediction capability to achieve intercept allowing for the tracking of rapid maneuvers and movements while maintaining efficient path planning. Traditional predicative methods such as curve fitting or Long ShortTerm Memory (LSTM) have low robustness and struggle with dynamic target tracking in the short term due to slow convergence of single agent‑based trajectory prediction and often require extensive offline training or tuning to be effective. Consequently, this paper introduces a novel robust adaptive bidirectional fuzzy brain emotional learning prediction (BFBEL‑P) methodology to address these challenges. The controller integrates a fuzzy interface, a neural network enabling rapid adaption, predictive capability and multi‑agent solving enabling multiple solutions to be aggregated to achieve rapid convergence times and high accuracy in both the short and long term. This was verified through the use of numerical simulations seeing complex trajectory being predicted and tracked by a swarm of drones. These simulations show improved adaptability and accuracy to state of the art methods in the short term and strong results over long time domains, enabling accurate swarm target tracking and predictive capability.

PaperID: 12703, https://arxiv.org/pdf/2410.10858.pdf

Authors: Yew Ken Chia, Guizhen Chen, Weiwen Xu, Luu Anh Tuan, Soujanya Poria, Lidong Bing

Title: Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths

Abstract:
Advanced models such as OpenAI o1 exhibit impressive problem‑solving capabilities through step‑by‑step reasoning. However, they may still falter on more complex problems, making errors that disrupt their reasoning paths. We attribute this to the expansive solution space, where each step has the risk of diverging into mistakes. To enhance language model reasoning, we introduce a specialized training framework called Reasoning Paths Optimization (RPO), which enables learning to reason and explore from diverse paths. Our approach encourages favorable branches at each reasoning step while penalizing unfavorable ones, enhancing the model's overall problem‑solving performance. Reasoning Paths Optimization does not rely on large‑scale human‑annotated rationales or outputs from closed‑source models, making it scalable and data‑efficient. We focus on multi‑step reasoning tasks, such as math word problems and science‑based exam questions. The experiments demonstrate that our framework significantly enhances the reasoning performance of large language models, with up to 3.1% and 4.3% improvement on GSM8K and MMLU (STEM) respectively. Our data and code can be found at https://reasoning‑paths.github.io.

PaperID: 12704, https://arxiv.org/pdf/2410.10831.pdf

Authors: Xiaonan Xu, Haoshuo Chen, Jesse E. Simsarian, Roland Ryf, Nicolas K. Fontaine, Mikael Mazur, Lauren Dallachiesa, David T. Neilson

Title: Large Language Model-Driven Cross-Domain Orchestration Using Multi-Agent Workflow

Abstract:
We showcase an application that leverages multiple agents, powered by large language models and integrated tools, to collaboratively solve complex network operation tasks across various domains. The tasks include real‑time topology retrieval, network optimization using physical models, and fiber switching facilitated by a robotic arm.

PaperID: 12705, https://arxiv.org/pdf/2410.10738.pdf

Authors: Yuqi Wang, Ke Cheng, Jiawei He, Qitai Wang, Hengchen Dai, Yuntao Chen, Fei Xia, Zhaoxiang Zhang

Title: DrivingDojo Dataset: Advancing Interactive and Knowledge-Enriched Driving World Model

Abstract:
Driving world models have gained increasing attention due to their ability to model complex physical dynamics. However, their superb modeling capability is yet to be fully unleashed due to the limited video diversity in current driving datasets. We introduce DrivingDojo, the first dataset tailor‑made for training interactive world models with complex driving dynamics. Our dataset features video clips with a complete set of driving maneuvers, diverse multi‑agent interplay, and rich open‑world driving knowledge, laying a stepping stone for future world model development. We further define an action instruction following (AIF) benchmark for world models and demonstrate the superiority of the proposed dataset for generating action‑controlled future predictions.

PaperID: 12706, https://arxiv.org/pdf/2410.10610.pdf

Authors: John Mern, Anthony Corso, Damian Burch, Kurt House, Jef Caers

Title: Intelligent prospector v2.0: exploration drill planning under epistemic model uncertainty

Abstract:
Optimal Bayesian decision making on what geoscientific data to acquire requires stating a prior model of uncertainty. Data acquisition is then optimized by reducing uncertainty on some property of interest maximally, and on average. In the context of exploration, very few, sometimes no data at all, is available prior to data acquisition planning. The prior model therefore needs to include human interpretations on the nature of spatial variability, or on analogue data deemed relevant for the area being explored. In mineral exploration, for example, humans may rely on conceptual models on the genesis of the mineralization to define multiple hypotheses, each representing a specific spatial variability of mineralization. More often than not, after the data is acquired, all of the stated hypotheses may be proven incorrect, i.e. falsified, hence prior hypotheses need to be revised, or additional hypotheses generated. Planning data acquisition under wrong geological priors is likely to be inefficient since the estimated uncertainty on the target property is incorrect, hence uncertainty may not be reduced at all. In this paper, we develop an intelligent agent based on partially observable Markov decision processes that plans optimally in the case of multiple geological or geoscientific hypotheses on the nature of spatial variability. Additionally, the artificial intelligence is equipped with a method that allows detecting, early on, whether the human stated hypotheses are incorrect, thereby saving considerable expense in data acquisition. Our approach is tested on a sediment‑hosted copper deposit, and the algorithm presented has aided in the characterization of an ultra high‑grade deposit in Zambia in 2023.

PaperID: 12707, https://arxiv.org/pdf/2410.10596.pdf

Authors: Kazuki Irie, Brenden M. Lake

Title: Overcoming classic challenges for artificial neural networks by providing incentives and practice

Abstract:
Since the earliest proposals for artificial neural network (ANN) models of the mind and brain, critics have pointed out key weaknesses in these models compared to human cognitive abilities. Here we review recent work that uses metalearning to overcome several classic challenges, which we characterize as addressing the Problem of Incentive and Practice ‑‑ that is, providing machines with both incentives to improve specific skills and opportunities to practice those skills. This explicit optimization contrasts with more conventional approaches that hope the desired behaviour will emerge through optimizing related but different objectives. We review applications of this principle to addressing four classic challenges for ANNs: systematic generalization, catastrophic forgetting, few‑shot learning and multi‑step reasoning. We also discuss how large language models incorporate key aspects of this metalearning framework (namely, sequence prediction with feedback trained on diverse data), which helps to explain some of their successes on these classic challenges. Finally, we discuss the prospects for understanding aspects of human development through this framework, and whether natural environments provide the right incentives and practice for learning how to make challenging generalizations.

PaperID: 12708, https://arxiv.org/pdf/2410.10486.pdf

Authors: Mohamed Bentaibi, Laura Caravenna, Jean-Paul A. Gauthier, Francesco Rossi

Title: Consensus in Multiagent Systems under communication failure

Abstract:
We consider multi‑agent systems with cooperative interactions and study the convergence to consensus in the case of time‑dependent connections, with possible communication failure. We prove a new condition ensuring consensus: we define a graph in which directed arrows correspond to connection functions that converge (in the weak sense) to some function with a positive integral on all intervals of the form [t,+\infty). If the graph has a node reachable from all other indices, i.e.~``globally reachable'', then the system converges to consensus. We show that this requirement generalizes some known sufficient conditions for convergence, such as Moreau's or the Persistent Excitation one. We also give a second new condition, transversal to the known ones: total connectedness of the undirected graph formed by the non‑vanishing of limiting functions.

PaperID: 12709, https://arxiv.org/pdf/2410.10460.pdf

Authors: Asger Horn Brorholt, Kim Guldstrand Larsen, Christian Schilling

Title: Compositional Shielding and Reinforcement Learning for Multi-Agent Systems

Abstract:
Deep reinforcement learning has emerged as a powerful tool for obtaining high‑performance policies. However, the safety of these policies has been a long‑standing issue. One promising paradigm to guarantee safety is a shield, which shields a policy from making unsafe actions. However, computing a shield scales exponentially in the number of state variables. This is a particular concern in multi‑agent systems with many agents. In this work, we propose a novel approach for multi‑agent shielding. We address scalability by computing individual shields for each agent. The challenge is that typical safety specifications are global properties, but the shields of individual agents only ensure local properties. Our key to overcome this challenge is to apply assume‑guarantee reasoning. Specifically, we present a sound proof rule that decomposes a (global, complex) safety specification into (local, simple) obligations for the shields of the individual agents. Moreover, we show that applying the shields during reinforcement learning significantly improves the quality of the policies obtained for a given training budget. We demonstrate the effectiveness and scalability of our multi‑agent shielding framework in two case studies, reducing the computation time from hours to seconds and achieving fast learning convergence.

PaperID: 12710, https://arxiv.org/pdf/2410.10394.pdf

Authors: Kaidong Zhang, Pengzhen Ren, Bingqian Lin, Junfan Lin, Shikui Ma, Hang Xu, Xiaodan Liang

Title: PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation

Abstract:
Language‑guided robotic manipulation is a challenging task that requires an embodied agent to follow abstract user instructions to accomplish various complex manipulation tasks. Previous work trivially fitting the data without revealing the relation between instruction and low‑level executable actions, these models are prone to memorizing the surficial pattern of the data instead of acquiring the transferable knowledge, and thus are fragile to dynamic environment changes. To address this issue, we propose a PrIrmitive‑driVen waypOinT‑aware world model for Robotic manipulation (PIVOT‑R) that focuses solely on the prediction of task‑relevant waypoints. Specifically, PIVOT‑R consists of a Waypoint‑aware World Model (WAWM) and a lightweight action prediction module. The former performs primitive action parsing and primitive‑driven waypoint prediction, while the latter focuses on decoding low‑level actions. Additionally, we also design an asynchronous hierarchical executor (AHE), which can use different execution frequencies for different modules of the model, thereby helping the model reduce computational redundancy and improve model execution efficiency. Our PIVOT‑R outperforms state‑of‑the‑art (SoTA) open‑source models on the SeaWave benchmark, achieving an average relative improvement of 19.45% across four levels of instruction tasks. Moreover, compared to the synchronously executed PIVOT‑R, the execution efficiency of PIVOT‑R with AHE is increased by 28‑fold, with only a 2.9% drop in performance. These results provide compelling evidence that our PIVOT‑R can significantly improve both the performance and efficiency of robotic manipulation.

PaperID: 12711, https://arxiv.org/pdf/2410.10071.pdf

Authors: Jinjin Shen, Yan Lin, Yijin Zhang, Weibin Zhang, Feng Shu, Jun Li

Title: Content Caching-Assisted Vehicular Edge Computing Using Multi-Agent Graph Attention Reinforcement Learning

Abstract:
In order to avoid repeated task offloading and realize the reuse of popular task computing results, we construct a novel content caching‑assisted vehicular edge computing (VEC) framework. In the face of irregular network topology and unknown environmental dynamics, we further propose a multi‑agent graph attention reinforcement learning (MGARL) based edge caching scheme, which utilizes the graph attention convolution kernel to integrate the neighboring nodes' features of each agent and further enhance the cooperation among agents. Our simulation results show that our proposed scheme is capable of improving the utilization of caching resources while reducing the long‑term task computing latency compared to the baselines.

PaperID: 12712, https://arxiv.org/pdf/2410.10062.pdf

Authors: Jonathan DeCastro, Andrew Silva, Deepak Gopinath, Emily Sumner, Thomas M. Balch, Laporsha Dees, Guy Rosman

Title: Dreaming to Assist: Learning to Align with Human Objectives for Shared Control in High-Speed Racing

Abstract:
Tight coordination is required for effective human‑robot teams in domains involving fast dynamics and tactical decisions, such as multi‑car racing. In such settings, robot teammates must react to cues of a human teammate's tactical objective to assist in a way that is consistent with the objective (e.g., navigating left or right around an obstacle). To address this challenge, we present Dream2Assist, a framework that combines a rich world model able to infer human objectives and value functions, and an assistive agent that provides appropriate expert assistance to a given human teammate. Our approach builds on a recurrent state space model to explicitly infer human intents, enabling the assistive agent to select actions that align with the human and enabling a fluid teaming interaction. We demonstrate our approach in a high‑speed racing domain with a population of synthetic human drivers pursuing mutually exclusive objectives, such as "stay‑behind" and "overtake". We show that the combined human‑robot team, when blending its actions with those of the human, outperforms the synthetic humans alone as well as several baseline assistance strategies, and that intent‑conditioning enables adherence to human preferences during task execution, leading to improved performance while satisfying the human's objective.

PaperID: 12713, https://arxiv.org/pdf/2410.10020.pdf

Authors: Abhishek Dutta, Yen-Che Hsiao

Title: Adaptive Reasoning and Acting in Medical Language Agents

Abstract:
This paper presents an innovative large language model (LLM) agent framework for enhancing diagnostic accuracy in simulated clinical environments using the AgentClinic benchmark. The proposed automatic correction enables doctor agents to iteratively refine their reasoning and actions following incorrect diagnoses, fostering improved decision‑making over time. Experiments show that the implementation of the adaptive LLM‑based doctor agents achieve correct diagnoses through dynamic interactions with simulated patients. The evaluations highlight the capacity of autonomous agents to adapt and improve in complex medical scenarios. Future enhancements will focus on refining the algorithm and expanding its applicability across a wider range of tasks and different large language models.

PaperID: 12714, https://arxiv.org/pdf/2410.09982.pdf

Authors: Vithursan Thangarasa, Ganesh Venkatesh, Mike Lasby, Nish Sinnadurai, Sean Lie

Title: Self-Data Distillation for Recovering Quality in Pruned Large Language Models

Abstract:
Large language models have driven significant progress in natural language processing, but their deployment requires substantial compute and memory resources. As models scale, compression techniques become essential for balancing model quality with computational efficiency. Structured pruning, which removes less critical components of the model, is a promising strategy for reducing complexity. However, one‑shot pruning often results in significant quality degradation, particularly in tasks requiring multi‑step reasoning. To recover lost quality, supervised fine‑tuning (SFT) is commonly applied, but it can lead to catastrophic forgetting by shifting the model's learned data distribution. Therefore, addressing the degradation from both pruning and SFT is essential to preserve the original model's quality. In this work, we utilize self‑data distilled fine‑tuning to address these challenges. Our approach leverages the original, unpruned model to generate a distilled dataset that preserves semantic richness and mitigates catastrophic forgetting by maintaining alignment with the base model's knowledge. Empirically, we demonstrate that self‑data distillation consistently outperforms standard SFT, improving average accuracy by up to 8% on the HuggingFace OpenLLM Leaderboard v1. Specifically, when pruning six decoder blocks on Llama3.1‑8B Instruct (i.e., 32 to 26 layers, reducing the model size from 8.03B to 6.72B parameters), our method retains 91.2% of the original model's accuracy compared to 81.7% with SFT, while reducing real‑world FLOPs by 16.3%. Furthermore, combining self‑data distilled models through model merging yields enhanced quality retention. Additionally, leveraging these pruned models in speculative decoding increases token acceptance rates, thereby improving inference efficiency in applied settings.

PaperID: 12715, https://arxiv.org/pdf/2410.09881.pdf

Authors: Yajie Zhang, Ce Yu, Chao Sun, Yi Hu, Zhaohui Shang, Jizeng Wei, Xu Yang

Title: GRRIS: a real-time intra-site observation scheduling scheme for distributed survey telescope arrays

Abstract:
The distributed telescope array offers promise for conducting large‑sky‑area, high‑frequency time domain surveys. Multiple telescopes can be deployed at each observation site, so intra‑site observation task scheduling is crucial for enhancing observation efficiency and quality. Efficient use of observable time and rapid response to special situations are critical to maximize scientific discovery in time domain surveys. Besides, the competing scientific priorities, time‑varying observation conditions, and capabilities of observation equipment, lead to a vast search space of the scheduling. So with the increasing number of telescopes and observation fields, balancing computational time with solution quality in observation scheduling poses a significant challenge. Informed by the seminal contributions of earlier studies on a multilevel scheduling model and global scheduler for time domain telescope array, this study is devoted to further exploring the site scheduler. Formulating the observation scheduling of multiple telescopes at the site as a cooperative decision‑making problem, this paper proposes GRRIS, a real‑time intra‑site observation scheduling scheme for telescope array using graph and reinforcement learning. It employs a graph neural network to learn node features that can embed the spatial structure of the observation scheduling. An algorithm based on multi‑agent reinforcement learning is designed to efficiently learn the optimum allocation policy of telescope agents to field nodes. Through numerical simulations with real‑world scenarios, GRRIS can achieve up to a 22% solution improvement over the most competitive scheme. It offers better scalability and sub‑second decision speed, meeting the needs of observation scheduling control for future distributed telescope arrays.

PaperID: 12716, https://arxiv.org/pdf/2410.09701.pdf

Authors: Chengshuai Shi, Kun Yang, Jing Yang, Cong Shen

Title: Transformers as Game Players: Provable In-context Game-playing Capabilities of Pre-trained Models

Abstract:
The in‑context learning (ICL) capability of pre‑trained models based on the transformer architecture has received growing interest in recent years. While theoretical understanding has been obtained for ICL in reinforcement learning (RL), the previous results are largely confined to the single‑agent setting. This work proposes to further explore the in‑context learning capabilities of pre‑trained transformer models in competitive multi‑agent games, i.e., in‑context game‑playing (ICGP). Focusing on the classical two‑player zero‑sum games, theoretical guarantees are provided to demonstrate that pre‑trained transformers can provably learn to approximate Nash equilibrium in an in‑context manner for both decentralized and centralized learning settings. As a key part of the proof, constructional results are established to demonstrate that the transformer architecture is sufficiently rich to realize celebrated multi‑agent game‑playing algorithms, in particular, decentralized V‑learning and centralized VI‑ULCB.

PaperID: 12717, https://arxiv.org/pdf/2410.09604.pdf

Authors: Chen Gao, Baining Zhao, Weichen Zhang, Jinzhu Mao, Jun Zhang, Zhiheng Zheng, Fanhang Man, Jianjie Fang, Zile Zhou, Jinqiang Cui, Xinlei Chen, Yong Li

Title: EmbodiedCity: A Benchmark Platform for Embodied Agent in Real-world City Environment

Abstract:
Embodied artificial intelligence emphasizes the role of an agent's body in generating human‑like behaviors. The recent efforts on EmbodiedAI pay a lot of attention to building up machine learning models to possess perceiving, planning, and acting abilities, thereby enabling real‑time interaction with the world. However, most works focus on bounded indoor environments, such as navigation in a room or manipulating a device, with limited exploration of embodying the agents in open‑world scenarios. That is, embodied intelligence in the open and outdoor environment is less explored, for which one potential reason is the lack of high‑quality simulators, benchmarks, and datasets. To address it, in this paper, we construct a benchmark platform for embodied intelligence evaluation in real‑world city environments. Specifically, we first construct a highly realistic 3D simulation environment based on the real buildings, roads, and other elements in a real city. In this environment, we combine historically collected data and simulation algorithms to conduct simulations of pedestrian and vehicle flows with high fidelity. Further, we designed a set of evaluation tasks covering different EmbodiedAI abilities. Moreover, we provide a complete set of input and output interfaces for access, enabling embodied agents to easily take task requirements and current environmental observations as input and then make decisions and obtain performance evaluations. On the one hand, it expands the capability of existing embodied intelligence to higher levels. On the other hand, it has a higher practical value in the real world and can support more potential applications for artificial general intelligence. Based on this platform, we evaluate some popular large language models for embodied intelligence capabilities of different dimensions and difficulties.

PaperID: 12718, https://arxiv.org/pdf/2410.09528.pdf

Authors: Jialian Li, Yipin Zhang, Wei Shen, Yuzi Yan, Jian Xie, Dong Yan

Title: Boosting Deductive Reasoning with Step Signals In RLHF

Abstract:
Logical reasoning is a crucial task for Large Language Models (LLMs), enabling them to tackle complex problems. Among reasoning tasks, multi‑step reasoning poses a particular challenge. Grounded in the theory of formal logic, we have developed an automated method, Multi‑step Deduction (MuseD), for deductive reasoning data. MuseD has allowed us to create training and testing datasets for multi‑step reasoning. Our generation method enables control over the complexity of the generated instructions, facilitating training and evaluation of models across different difficulty levels. Through RLHF training, our training data has demonstrated significant improvements in logical capabilities for both in‑domain of out‑of‑domain reasoning tasks. Additionally, we have conducted tests to assess the multi‑step reasoning abilities of various models.

PaperID: 12719, https://arxiv.org/pdf/2410.09414.pdf

Authors: Sinan Wang, Zhiyuan Zhong, Shaojin Wen, Yepang Liu

Title: Multi-agent Assisted Automatic Test Generation for Java JSON Libraries

Abstract:
JSON is a widely used format for data exchange between applications. In the Java ecosystem, JSON libraries serve as fundamental toolkits for processing JSON data, powering real‑world applications such as web services, Android apps, or data management systems. However, without effective quality assurance methods such as automatic test generation (ATG), developers risk introducing subtle data inconsistency bugs, compatibility issues, and even security vulnerabilities. These flaws can affect billions of end users and potentially cause severe financial losses. Recently, large language models (LLMs) have shown strong potential in enhancing ATG. However, existing LLM‑based methods like TitanFuzz and YanHui lack specialization in the JSON domain. For Java JSON libraries (JJLs), effective test cases should capture the constraints between formatted data and application programs, leaving critical aspects of quality assurance unaddressed. To fill this gap, we propose JsonATG, a multi‑agent ATG system that generates diverse bug‑triggering tests for JJLs. With historical bug information as seeds, JsonATG introduces a code summarization agent and a test validation agent into the generation pipeline to produce new and valid test cases. It applies agent‑generated program mutation rules tailored specifically for the structural and semantic characteristics of JJLs, such as data streaming operations, serialization formats, and data‑binding patterns. The generated tests are further refined through post‑processing to ensure syntactic and semantic correctness. Our experiments show that JsonATG achieves higher coverage than two state‑of‑the‑art LLM‑based test generation methods on the widely used JSON library. With a \25 budget, we reported 59 bugs (including non‑crashing functional bugs) in fastjson, of which 47 were confirmed and 28 have already been fixed.

PaperID: 12720, https://arxiv.org/pdf/2410.09407.pdf

Authors: Yicheng Fu, Raviteja Anantha, Jianpeng Cheng

Title: CAMPHOR: Collaborative Agents for Multi-input Planning and High-Order Reasoning On Device

Abstract:
While server‑side Large Language Models (LLMs) demonstrate proficiency in function calling and complex reasoning, deploying Small Language Models (SLMs) directly on devices brings opportunities to improve latency and privacy but also introduces unique challenges for accuracy and memory. We introduce CAMPHOR, an innovative on‑device SLM multi‑agent framework designed to handle multiple user inputs and reason over personal context locally, ensuring privacy is maintained. CAMPHOR employs a hierarchical architecture where a high‑order reasoning agent decomposes complex tasks and coordinates expert agents responsible for personal context retrieval, tool interaction, and dynamic plan generation. By implementing parameter sharing across agents and leveraging prompt compression, we significantly reduce model size, latency, and memory usage. To validate our approach, we present a novel dataset capturing multi‑agent task trajectories centered on personalized mobile assistant use‑cases. Our experiments reveal that fine‑tuned SLM agents not only surpass closed‑source LLMs in task completion F1 by~35% but also eliminate the need for server‑device communication, all while enhancing privacy.

PaperID: 12721, https://arxiv.org/pdf/2410.09381.pdf

Authors: Zhiyuan Wei, Jing Sun, Zijiang Zhang, Xianhao Zhang, Meng Li, Zhe Hou

Title: LLM-SmartAudit: Advanced Smart Contract Vulnerability Detection

Abstract:
The immutable nature of blockchain technology, while revolutionary, introduces significant security challenges, particularly in smart contracts. These security issues can lead to substantial financial losses. Current tools and approaches often focus on specific types of vulnerabilities. However, a comprehensive tool capable of detecting a wide range of vulnerabilities with high accuracy is lacking. This paper introduces LLM‑SmartAudit, a novel framework leveraging the advanced capabilities of Large Language Models (LLMs) to detect and analyze vulnerabilities in smart contracts. Using a multi‑agent conversational approach, LLM‑SmartAudit employs a collaborative system with specialized agents to enhance the audit process. To evaluate the effectiveness of LLM‑SmartAudit, we compiled two distinct datasets: a labeled dataset for benchmarking against traditional tools and a real‑world dataset for assessing practical applications. Experimental results indicate that our solution outperforms all traditional smart contract auditing tools, offering higher accuracy and greater efficiency. Furthermore, our framework can detect complex logic vulnerabilities that traditional tools have previously overlooked. Our findings demonstrate that leveraging LLM agents provides a highly effective method for automated smart contract auditing.

PaperID: 12722, https://arxiv.org/pdf/2410.09302.pdf

Authors: Kaixuan Ji, Guanlin Liu, Ning Dai, Qingping Yang, Renjie Zheng, Zheng Wu, Chen Dun, Quanquan Gu, Lin Yan

Title: Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization

Abstract:
Reinforcement Learning (RL) plays a crucial role in aligning large language models (LLMs) with human preferences and improving their ability to perform complex tasks. However, current approaches either require significant computational resources due to the use of multiple models and extensive online sampling for training (e.g., PPO) or are framed as bandit problems (e.g., DPO, DRO), which often struggle with multi‑step reasoning tasks, such as math problem solving and complex reasoning that involve long chains of thought. To overcome these limitations, we introduce Direct Q‑function Optimization (DQO), which formulates the response generation process as a Markov Decision Process (MDP) and utilizes the soft actor‑critic (SAC) framework to optimize a Q‑function directly parameterized by the language model. The MDP formulation of DQO offers structural advantages over bandit‑based methods, enabling more effective process supervision. Experimental results on two math problem‑solving datasets, GSM8K and MATH, demonstrate that DQO outperforms previous methods, establishing it as a promising offline reinforcement learning approach for aligning language models.

PaperID: 12723, https://arxiv.org/pdf/2410.09134.pdf

Authors: Mingjun Wang, Remington Dechene

Title: Multi-Agent Actor-Critics in Autonomous Cyber Defense

Abstract:
The need for autonomous and adaptive defense mechanisms has become paramount in the rapidly evolving landscape of cyber threats. Multi‑Agent Deep Reinforcement Learning (MADRL) presents a promising approach to enhancing the efficacy and resilience of autonomous cyber operations. This paper explores the application of Multi‑Agent Actor‑Critic algorithms which provides a general form in Multi‑Agent learning to cyber defense, leveraging the collaborative interactions among multiple agents to detect, mitigate, and respond to cyber threats. We demonstrate each agent is able to learn quickly and counter act on the threats autonomously using MADRL in simulated cyber‑attack scenarios. The results indicate that MADRL can significantly enhance the capability of autonomous cyber defense systems, paving the way for more intelligent cybersecurity strategies. This study contributes to the growing body of knowledge on leveraging artificial intelligence for cybersecurity and sheds light for future research and development in autonomous cyber operations.

PaperID: 12724, https://arxiv.org/pdf/2410.09114.pdf

Authors: Andrey Anurin, Jonathan Ng, Kibo Schaffer, Jason Schreiber, Esben Kran

Title: Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

Abstract:
LLM agents have the potential to revolutionize defensive cyber operations, but their offensive capabilities are not yet fully understood. To prepare for emerging threats, model developers and governments are evaluating the cyber capabilities of foundation models. However, these assessments often lack transparency and a comprehensive focus on offensive capabilities. In response, we introduce the Catastrophic Cyber Capabilities Benchmark (3CB), a novel framework designed to rigorously assess the real‑world offensive capabilities of LLM agents. Our evaluation of modern LLMs on 3CB reveals that frontier models, such as GPT‑4o and Claude 3.5 Sonnet, can perform offensive tasks such as reconnaissance and exploitation across domains ranging from binary analysis to web technologies. Conversely, smaller open‑source models exhibit limited offensive capabilities. Our software solution and the corresponding benchmark provides a critical tool to reduce the gap between rapidly improving capabilities and robustness of cyber offense evaluations, aiding in the safer deployment and regulation of these powerful technologies.

PaperID: 12725, https://arxiv.org/pdf/2410.09090.pdf

Authors: Haowen Xu, Xueping Li, Jose Tupayachi, Jianming, Lian, Femi Omitaomu

Title: Automating Bibliometric Analysis with Sentence Transformers and Retrieval-Augmented Generation (RAG): A Pilot Study in Semantic and Contextual Search for Customized Literature Characterization for High-Impact Urban Research

Abstract:
Bibliometric analysis is essential for understanding research trends, scope, and impact in urban science, especially in high‑impact journals, such Nature Portfolios. However, traditional methods, relying on keyword searches and basic NLP techniques, often fail to uncover valuable insights not explicitly stated in article titles or keywords. These approaches are unable to perform semantic searches and contextual understanding, limiting their effectiveness in classifying topics and characterizing studies. In this paper, we address these limitations by leveraging Generative AI models, specifically transformers and Retrieval‑Augmented Generation (RAG), to automate and enhance bibliometric analysis. We developed a technical workflow that integrates a vector database, Sentence Transformers, a Gaussian Mixture Model (GMM), Retrieval Agent, and Large Language Models (LLMs) to enable contextual search, topic ranking, and characterization of research using customized prompt templates. A pilot study analyzing 223 urban science‑related articles published in Nature Communications over the past decade highlights the effectiveness of our approach in generating insightful summary statistics on the quality, scope, and characteristics of papers in high‑impact journals. This study introduces a new paradigm for enhancing bibliometric analysis and knowledge retrieval in urban research, positioning an AI agent as a powerful tool for advancing research evaluation and understanding.

PaperID: 12726, https://arxiv.org/pdf/2410.09037.pdf

Authors: Hojae Lee, Junho Kim, SangKeun Lee

Title: Mentor-KD: Making Small Language Models Better Multi-step Reasoners

Abstract:
Large Language Models (LLMs) have displayed remarkable performances across various complex tasks by leveraging Chain‑of‑Thought (CoT) prompting. Recently, studies have proposed a Knowledge Distillation (KD) approach, reasoning distillation, which transfers such reasoning ability of LLMs through fine‑tuning language models of multi‑step rationales generated by LLM teachers. However, they have inadequately considered two challenges regarding insufficient distillation sets from the LLM teacher model, in terms of 1) data quality and 2) soft label provision. In this paper, we propose Mentor‑KD, which effectively distills the multi‑step reasoning capability of LLMs to smaller LMs while addressing the aforementioned challenges. Specifically, we exploit a mentor, intermediate‑sized task‑specific fine‑tuned model, to augment additional CoT annotations and provide soft labels for the student model during reasoning distillation. We conduct extensive experiments and confirm Mentor‑KD's effectiveness across various models and complex reasoning tasks.

PaperID: 12727, https://arxiv.org/pdf/2410.09034.pdf

Authors: Xiangyu Yin, Chuqiao Shi, Yimo Han, Yi Jiang

Title: PEAR: A Robust and Flexible Automation Framework for Ptychography Enabled by Multiple Large Language Model Agents

Abstract:
Ptychography is an advanced computational imaging technique in X‑ray and electron microscopy. It has been widely adopted across scientific research fields, including physics, chemistry, biology, and materials science, as well as in industrial applications such as semiconductor characterization. In practice, obtaining high‑quality ptychographic images requires simultaneous optimization of numerous experimental and algorithmic parameters. Traditionally, parameter selection often relies on trial and error, leading to low‑throughput workflows and potential human bias. In this work, we develop the "Ptychographic Experiment and Analysis Robot" (PEAR), a framework that leverages large language models (LLMs) to automate data analysis in ptychography. To ensure high robustness and accuracy, PEAR employs multiple LLM agents for tasks including knowledge retrieval, code generation, parameter recommendation, and image reasoning. Our study demonstrates that PEAR's multi‑agent design significantly improves the workflow success rate, even with smaller open‑weight models such as LLaMA 3.1 8B. PEAR also supports various automation levels and is designed to work with customized local knowledge bases, ensuring flexibility and adaptability across different research environments.

PaperID: 12728, https://arxiv.org/pdf/2410.09024.pdf

Authors: Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, Xander Davies

Title: AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Abstract:
The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied primarily for LLMs acting as simple chatbots. Meanwhile, LLM agents ‑‑ which use external tools and can execute multi‑stage tasks ‑‑ may pose a greater risk if misused, but their robustness remains underexplored. To facilitate research on LLM agent misuse, we propose a new benchmark called AgentHarm. The benchmark includes a diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment. In addition to measuring whether models refuse harmful agentic requests, scoring well on AgentHarm requires jailbroken agents to maintain their capabilities following an attack to complete a multi‑step task. We evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking, (2) simple universal jailbreak templates can be adapted to effectively jailbreak agents, and (3) these jailbreaks enable coherent and malicious multi‑step agent behavior and retain model capabilities. To enable simple and reliable evaluation of attacks and defenses for LLM‑based agents, we publicly release AgentHarm at https://huggingface.co/datasets/ai‑safety‑institute/AgentHarm.

PaperID: 12729, https://arxiv.org/pdf/2410.08985.pdf

Authors: Bo Ni, Yu Wang, Lu Cheng, Erik Blasch, Tyler Derr

Title: Towards Trustworthy Knowledge Graph Reasoning: An Uncertainty Aware Perspective

Abstract:
Recently, Knowledge Graphs (KGs) have been successfully coupled with Large Language Models (LLMs) to mitigate their hallucinations and enhance their reasoning capability, such as in KG‑based retrieval‑augmented frameworks. However, current KG‑LLM frameworks lack rigorous uncertainty estimation, limiting their reliable deployment in high‑stakes applications. Directly incorporating uncertainty quantification into KG‑LLM frameworks presents challenges due to their complex architectures and the intricate interactions between the knowledge graph and language model components. To address this gap, we propose a new trustworthy KG‑LLM framework, Uncertainty Aware Knowledge‑Graph Reasoning (UAG), which incorporates uncertainty quantification into the KG‑LLM framework. We design an uncertainty‑aware multi‑step reasoning framework that leverages conformal prediction to provide a theoretical guarantee on the prediction set. To manage the error rate of the multi‑step process, we additionally introduce an error rate control module to adjust the error rate within the individual components. Extensive experiments show that our proposed UAG can achieve any pre‑defined coverage rate while reducing the prediction set/interval size by 40% on average over the baselines.

PaperID: 12730, https://arxiv.org/pdf/2410.08948.pdf

Authors: Ariel Flint Ashery, Luca Maria Aiello, Andrea Baronchelli

Title: Emergent social conventions and collective bias in LLM populations

Abstract:
Social conventions are the backbone of social coordination, shaping how individuals form a group. As growing populations of artificial intelligence (AI) agents communicate through natural language, a fundamental question is whether they can bootstrap the foundations of a society. Here, we present experimental results that demonstrate the spontaneous emergence of universally adopted social conventions in decentralized populations of large language model (LLM) agents. We then show how strong collective biases can emerge during this process, even when agents exhibit no bias individually. Last, we examine how committed minority groups of adversarial LLM agents can drive social change by imposing alternative social conventions on the larger population. Our results show that AI systems can autonomously develop social conventions without explicit programming and have implications for designing AI systems that align, and remain aligned, with human values and societal goals.

PaperID: 12731, https://arxiv.org/pdf/2410.08850.pdf

Authors: Lorenzo Magnino, Yuchen Zhu, Mathieu Laurière

Title: Learning to Stop: Deep Learning for Mean Field Optimal Stopping

Abstract:
Optimal stopping is a fundamental problem in optimization with applications in risk management, finance, robotics, and machine learning. We extend the standard framework to a multi‑agent setting, named multi‑agent optimal stopping (MAOS), where agents cooperate to make optimal stopping decisions in a finite‑space, discrete‑time environment. Since solving MAOS becomes computationally prohibitive as the number of agents is very large, we study the mean‑field optimal stopping (MFOS) problem, obtained as the number of agents tends to infinity. We establish that MFOS provides a good approximation to MAOS and prove a dynamic programming principle (DPP) based on mean‑field control theory. We then propose two deep learning approaches: one that learns optimal stopping decisions by simulating full trajectories and another that leverages the DPP to compute the value function and to learn the optimal stopping rule using backward induction. Both methods train neural networks to approximate optimal stopping policies. We demonstrate the effectiveness and the scalability of our work through numerical experiments on 6 different problems in spatial dimension up to 300. To the best of our knowledge, this is the first work to formalize and computationally solve MFOS in discrete time and finite space, opening new directions for scalable MAOS methods.

PaperID: 12732, https://arxiv.org/pdf/2410.08681.pdf

Authors: Bryce Ferenczi, Michael Burke, Tom Drummond

Title: Efficiently Scanning and Resampling Spatio-Temporal Tasks with Irregular Observations

Abstract:
Various works have aimed at combining the inference efficiency of recurrent models and training parallelism of multi‑head attention for sequence modeling. However, most of these works focus on tasks with fixed‑dimension observation spaces, such as individual tokens in language modeling or pixels in image completion. To handle an observation space of varying size, we propose a novel algorithm that alternates between cross‑attention between a 2D latent state and observation, and a discounted cumulative sum over the sequence dimension to efficiently accumulate historical information. We find this resampling cycle is critical for performance. To evaluate efficient sequence modeling in this domain, we introduce two multi‑agent intention tasks: simulated agents chasing bouncing particles and micromanagement analysis in professional StarCraft II games. Our algorithm achieves comparable accuracy with a lower parameter count, faster training and inference compared to existing methods.

PaperID: 12733, https://arxiv.org/pdf/2410.08633.pdf

Authors: Juno Kim, Taiji Suzuki

Title: Transformers Provably Solve Parity Efficiently with Chain of Thought

Abstract:
This work provides the first theoretical analysis of training transformers to solve complex problems by recursively generating intermediate states, analogous to fine‑tuning for chain‑of‑thought (CoT) reasoning. We consider training a one‑layer transformer to solve the fundamental k‑parity problem, extending the work on RNNs by Wies et al. (2023). We establish three key results: (1) any finite‑precision gradient‑based algorithm, without intermediate supervision, requires substantial iterations to solve parity with finite samples. (2) In contrast, when intermediate parities are incorporated into the loss function, our model can learn parity in one gradient update when aided by \emphteacher forcing, where ground‑truth labels of the reasoning chain are provided at each generation step. (3) Even without teacher forcing, where the model must generate CoT chains end‑to‑end, parity can be learned efficiently if augmented data is employed to internally verify the soundness of intermediate steps. Our findings, supported by numerical experiments, show that task decomposition and stepwise reasoning naturally arise from optimizing transformers with CoT; moreover, self‑consistency checking can improve multi‑step reasoning ability, aligning with empirical studies of CoT.

PaperID: 12734, https://arxiv.org/pdf/2410.08610.pdf

Authors: Mattia Zanella

Title: Derivation of macroscopic epidemic models from multi-agent systems

Abstract:
We present a systematic review of some basic results on the derivation of classical epidemiological models from simple agent‑based dynamics. The evolution of large populations is coupled with the dynamics of the contact distribution, providing insights into how individual behaviors impact macroscopic epidemiological trends. The resulting set of equations incorporates local characteristics of the operator governing the emergence of a family of contact distributions. To validate the proposed approach, we provide several numerical results based on asymptotic preserving methods, demonstrating their effectiveness in capturing the multi‑scale nature of the problem and ensuring robust performance across different regimes.

PaperID: 12735, https://arxiv.org/pdf/2410.08504.pdf

Authors: Sujan Sarker, Haley N. Green, Mohammad Samin Yasar, Tariq Iqbal

Title: CoHRT: A Collaboration System for Human-Robot Teamwork

Abstract:
Collaborative robots are increasingly deployed alongside humans in factories, hospitals, schools, and other domains to enhance teamwork and efficiency. Systems that seamlessly integrate humans and robots into cohesive teams for coordinated and efficient task execution are needed, enabling studies on how robot collaboration policies affect team performance and teammates' perceived fairness, trust, and safety. Such a system can also be utilized to study the impact of a robot's normative behavior on team collaboration. Additionally, it allows for investigation into how the legibility and predictability of robot actions affect human‑robot teamwork and perceived safety and trust. Existing systems are limited, typically involving one human and one robot, and thus require more insight into broader team dynamics. Many rely on games or virtual simulations, neglecting the impact of a robot's physical presence. Most tasks are turn‑based, hindering simultaneous execution and affecting efficiency. This paper introduces CoHRT (Collaboration System for Human‑Robot Teamwork), which facilitates multi‑human‑robot teamwork through seamless collaboration, coordination, and communication. CoHRT utilizes a server‑client‑based architecture, a vision‑based system to track task environments, and a simple interface for team action coordination. It allows for the design of tasks considering the human teammates' physical and mental workload and varied skill labels across the team members. We used CoHRT to design a collaborative block manipulation and jigsaw puzzle‑solving task in a team of one Franka Emika Panda robot and two humans. The system enables recording multi‑modal collaboration data to develop adaptive collaboration policies for robots. To further utilize CoHRT, we outline potential research directions in diverse human‑robot collaborative tasks.

PaperID: 12736, https://arxiv.org/pdf/2410.08486.pdf

Authors: Theodoros Trochatos, Jakub Szefer

Title: Quantum Operating System Support for Quantum Trusted Execution Environments

Abstract:
With the growing reliance on cloud‑based quantum computing, ensuring the confidentiality and integrity of quantum computations is paramount. Quantum Trusted Execution Environments (QTEEs) have been proposed to protect users' quantum circuits when they are submitted to remote cloud‑based quantum computers. However, deployment of QTEEs necessitates a Quantum Operating Systems (QOS) that can support QTEEs hardware and operation. This work introduces the first architecture for a QOS to support and enable essential steps required for secure quantum task execution on cloud platforms.

PaperID: 12737, https://arxiv.org/pdf/2410.08475.pdf

Authors: Jiashu He, Mingyu Derek Ma, Jinxuan Fan, Dan Roth, Wei Wang, Alejandro Ribeiro

Title: GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation

Abstract:
Existing approaches based on context prompting or reinforcement learning (RL) to improve the reasoning capacities of large language models (LLMs) depend on the LLMs' internal knowledge to produce reliable Chain‑Of‑Thought (CoT). However, no matter the size of LLMs, certain problems cannot be resolved in a single forward pass. Meanwhile, agent‑based reasoning systems require access to a comprehensive nonparametric knowledge base, which is often costly or not feasible for use in scientific and niche domains. We present Graph Inspired Veracity Extrapolation (GIVE), a novel reasoning method that merges parametric and non‑parametric memories to improve accurate reasoning with minimal external input. GIVE guides the LLM agent to select the most pertinent expert data (observe), engage in query‑specific divergent thinking (reflect), and then synthesize this information to produce the final output (speak). Extensive experiments demonstrated the following benefits of our framework: (1) GIVE boosts the performance of LLMs across various sizes. (2) In some scenarios, GIVE allows smaller LLMs to surpass larger, more sophisticated ones in scientific tasks (GPT3.5T + GIVE > GPT4). (3) GIVE is effective on scientific and open‑domain assessments. (4) GIVE is a training‑free method that enables LLMs to tackle new problems that extend beyond their training data (up to 43.5% ‑> 88.2% accuracy improvement). (5) GIVE allows LLM agents to reason using both restricted (very small) and noisy (very large) knowledge sources, accommodating knowledge graphs (KG) ranging from 135 to more than 840k nodes. (6) The reasoning process involved in GIVE is fully interpretable.

PaperID: 12738, https://arxiv.org/pdf/2410.08440.pdf

Authors: Armel Koulong, Ali Pakniyat

Title: Distributed Adaptive Consensus with Obstacle and Collision Avoidance for Networks of Heterogeneous Multi-Agent Systems

Abstract:
This paper presents a distributed adaptive control strategy for multi‑agent systems with heterogeneous dynamics and collision avoidance. We propose an adaptive control strategy designed to ensure leader‑following formation consensus while effectively managing collision and obstacle avoidance using potential functions. By integrating neural network‑based disturbance estimation and adaptive tuning laws, the proposed strategy ensures consensus and stability in leader‑following formations under fixed topologies.

PaperID: 12739, https://arxiv.org/pdf/2410.08436.pdf

Authors: Zi'ou Zheng, Christopher Malon, Martin Renqiang Min, Xiaodan Zhu

Title: Exploring the Role of Reasoning Structures for Constructing Proofs in Multi-Step Natural Language Reasoning with Large Language Models

Abstract:
When performing complex multi‑step reasoning tasks, the ability of Large Language Models (LLMs) to derive structured intermediate proof steps is important for ensuring that the models truly perform the desired reasoning and for improving models' explainability. This paper is centred around a focused study: whether the current state‑of‑the‑art generalist LLMs can leverage the structures in a few examples to better construct the proof structures with in‑context learning. Our study specifically focuses on structure‑aware demonstration and structure‑aware pruning. We demonstrate that they both help improve performance. A detailed analysis is provided to help understand the results.

PaperID: 12740, https://arxiv.org/pdf/2410.08388.pdf

Authors: Maximus Powers, Shaina Raza, Alex Chang, Rehana Riaz, Umang Mavani, Harshitha Reddy Jonala, Ansh Tiwari, Hua Wei

Title: Responsible AI in NLP: GUS-Net Span-Level Bias Detection Dataset and Benchmark for Generalizations, Unfairness, and Stereotypes

Abstract:
Representational harms in language technologies often occur in short spans within otherwise neutral text, where phrases may simultaneously convey generalizations, unfairness, or stereotypes. Framing bias detection as sentence‑level classification obscures which words carry bias and what type is present, limiting both auditability and targeted mitigation. We introduce the GUS‑Net Framework, comprising the GUS dataset and a multi‑label token‑level detector for span‑level analysis of social bias. The GUS dataset contains 3,739 unique snippets across multiple domains, with over 69,000 token‑level annotations. Each token is labeled using BIO tags (Begin, Inside, Outside) for three pathways of representational harm: Generalizations, Unfairness, and Stereotypes. To ensure reliable data annotation, we employ an automated multi‑agent pipeline that proposes candidate spans which are subsequently verified and corrected by human experts. We formulate bias detection as multi‑label token‑level classification and benchmark both encoder‑based models (e.g., BERT family variants) and decoder‑based large language models (LLMs). Our evaluations cover token‑level identification and span‑level entity recognition on our test set, and out‑of‑distribution generalization. Empirical results show that encoder‑based models consistently outperform decoder‑based baselines on nuanced and overlapping spans while being more computationally efficient. The framework delivers interpretable, fine‑grained diagnostics that enable systematic auditing and mitigation of representational harms in real‑world NLP systems.

PaperID: 12741, https://arxiv.org/pdf/2410.08328.pdf

Authors: Konstantina Christakopoulou, Shibl Mourad, Maja Matarić

Title: Agents Thinking Fast and Slow: A Talker-Reasoner Architecture

Abstract:
Large language models have enabled agents of all kinds to interact with users through natural conversation. Consequently, agents now have two jobs: conversing and planning/reasoning. Their conversational responses must be informed by all available information, and their actions must help to achieve goals. This dichotomy between conversing with the user and doing multi‑step reasoning and planning can be seen as analogous to the human systems of "thinking fast and slow" as introduced by Kahneman. Our approach is comprised of a "Talker" agent (System 1) that is fast and intuitive, and tasked with synthesizing the conversational response; and a "Reasoner" agent (System 2) that is slower, more deliberative, and more logical, and is tasked with multi‑step reasoning and planning, calling tools, performing actions in the world, and thereby producing the new agent state. We describe the new Talker‑Reasoner architecture and discuss its advantages, including modularity and decreased latency. We ground the discussion in the context of a sleep coaching agent, in order to demonstrate real‑world relevance.

PaperID: 12742, https://arxiv.org/pdf/2410.08146.pdf

Authors: Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, Aviral Kumar

Title: Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

Abstract:
A promising approach for improving reasoning in large language models is to use process reward models (PRMs). PRMs provide feedback at each step of a multi‑step reasoning trace, potentially improving credit assignment over outcome reward models (ORMs) that only provide feedback at the final step. However, collecting dense, per‑step human labels is not scalable, and training PRMs from automatically‑labeled data has thus far led to limited gains. To improve a base policy by running search against a PRM or using it as dense rewards for reinforcement learning (RL), we ask: "How should we design process rewards?". Our key insight is that, to be effective, the process reward for a step should measure progress: a change in the likelihood of producing a correct response in the future, before and after taking the step, corresponding to the notion of step‑level advantages in RL. Crucially, this progress should be measured under a prover policy distinct from the base policy. We theoretically characterize the set of good provers and our results show that optimizing process rewards from such provers improves exploration during test‑time search and online RL. In fact, our characterization shows that weak prover policies can substantially improve a stronger base policy, which we also observe empirically. We validate our claims by training process advantage verifiers (PAVs) to predict progress under such provers, and show that compared to ORMs, test‑time search against PAVs is >8% more accurate, and 1.5‑5× more compute‑efficient. Online RL with dense rewards from PAVs enables one of the first results with 5‑6× gain in sample efficiency, and >6% gain in accuracy, over ORMs.

PaperID: 12743, https://arxiv.org/pdf/2410.08130.pdf

Authors: Kamesh R

Title: Think Beyond Size: Adaptive Prompting for More Effective Reasoning

Abstract:
Pretrained large language models (LLMs) are increasingly utilized across a wide range of natural language processing (NLP) tasks due to their impressive capabilities as few‑shot learners. Recent techniques, such as chain‑of‑thought (CoT) prompting, have significantly advanced multi‑step reasoning by introducing step‑by‑step decomposition, achieving state‑of‑the‑art results on complex reasoning benchmarks. However, these approaches often rely on static prompting templates that do not adapt to task complexity or errors during the reasoning process. In this work, we introduce Adaptive Prompting, a dynamic and iterative framework designed to enhance reasoning by incorporating real‑time adjustments to prompt structures and validation mechanisms.Experimental results demonstrate that Adaptive Prompting significantly improves performance on diverse reasoning benchmarks, including arithmetic reasoning (GSM8K, MultiArith), logical reasoning and commonsense tasks, achieving substantial accuracy gains compared to static prompting baselines. By integrating guided prompts, intermediate validation, and self‑corrective steps, our approach enables smaller models to achieve competitive performance with larger counterparts, such as GPT‑4, while maintaining computational efficiency. The framework achieves this without requiring fine‑tuning or task‑specific training data, highlighting the untapped potential of iterative reasoning methods.

PaperID: 12744, https://arxiv.org/pdf/2410.07976.pdf

Authors: Baraah A. M. Sidahmed, Tatjana Chavdarova

Title: Addressing Rotational Learning Dynamics in Multi-Agent Reinforcement Learning

Abstract:
Multi‑agent reinforcement learning (MARL) has emerged as a powerful paradigm for solving complex problems through agents' cooperation and competition, finding widespread applications across domains. Despite its success, MARL faces a reproducibility crisis. We show that, in part, this issue is related to the rotational optimization dynamics arising from competing agents' objectives, and require methods beyond standard optimization algorithms. We reframe MARL approaches using Variational Inequalities (VIs), offering a unified framework to address such issues. Leveraging optimization techniques designed for VIs, we propose a general approach for integrating gradient‑based VI methods capable of handling rotational dynamics into existing MARL algorithms. Empirical results demonstrate significant performance improvements across benchmarks. In zero‑sum games, Rock‑‑paper‑‑scissors and Matching pennies, VI methods achieve better convergence to equilibrium strategies, and in the Multi‑Agent Particle Environment: Predator‑prey, they also enhance team coordination. These results underscore the transformative potential of advanced optimization techniques in MARL.

PaperID: 12745, https://arxiv.org/pdf/2410.07954.pdf

Authors: Irene Saccani, Stefano Ardizzoni, Luca Consolini, Marco Locatelli

Title: Dynamic Programming based Local Search approaches for Multi-Agent Path Finding problems on Directed Graphs

Abstract:
Among sub‑optimal Multi‑Agent Path Finding (MAPF) solvers, rule‑based algorithms are particularly appealing since they are complete. Even in crowded scenarios, they allow finding a feasible solution that brings each agent to its target, preventing deadlock situations. However, generally, rule‑based algorithms provide much longer solutions than the shortest one. The main contribution of this paper is introducing a new local search procedure for improving a known feasible solution. We start from a feasible sub‑optimal solution, and perform a local search in a neighborhood of this solution. If we are able to find a shorter solution, we repeat this procedure until the solution cannot be shortened anymore. At the end, we obtain a solution that is still sub‑optimal, but generally of much better quality than the initial one. We propose two different local search policies. In the first, we explore all paths in which the agents positions remain in a neighborhood of the corresponding positions of the reference solution. In the second, we set an upper limit to the number of agents that can change their path with respect to the reference solution. These two different policies can also be alternated. We explore the neighborhoods by dynamic programming. The fact that our search is local is fundamental in terms of time complexity. Indeed, if the dynamic programming approach is applied to the full MAPF problem, the number of explored states grows exponentially with the number of agents. Instead, the introduction of a locality constraint allows exploring the neghborhoods in a time that grows polynomially with respect to the number of agents.

PaperID: 12746, https://arxiv.org/pdf/2410.07921.pdf

Authors: Arash Khajooeinejad, Fatemeh Sadat Masoumi, Masoumeh Chapariniya

Title: Boosting Hierarchical Reinforcement Learning with Meta-Learning for Complex Task Adaptation

Abstract:
Hierarchical Reinforcement Learning (HRL) is well‑suitedd for solving complex tasks by breaking them down into structured policies. However, HRL agents often struggle with efficient exploration and quick adaptation. To overcome these limitations, we propose integrating meta‑learning into HRL to enable agents to learn and adapt hierarchical policies more effectively. Our method leverages meta‑learning to facilitate rapid task adaptation using prior experience, while intrinsic motivation mechanisms drive efficient exploration by rewarding the discovery of novel states. Specifically, our agent employs a high‑level policy to choose among multiple low‑level policies within custom‑designed grid environments. By incorporating gradient‑based meta‑learning with differentiable inner‑loop updates, we optimize performance across a curriculum of progressively challenging tasks. Experimental results highlight that our metalearning‑enhanced hierarchical agent significantly outperforms standard HRL approaches lacking meta‑learning and intrinsic motivation. The agent demonstrates faster learning, greater cumulative rewards, and higher success rates in complex grid‑based scenarios. These Findings underscore the effectiveness of combining meta‑learning, curriculum learning, and intrinsic motivation to enhance the capability of HRL agents in tackling complex tasks.

PaperID: 12747, https://arxiv.org/pdf/2410.07863.pdf

Authors: Fanqi Kong, Yizhe Huang, Song-Chun Zhu, Siyuan Qi, Xue Feng

Title: Learning to Balance Altruism and Self-interest Based on Empathy in Mixed-Motive Games

Abstract:
Real‑world multi‑agent scenarios often involve mixed motives, demanding altruistic agents capable of self‑protection against potential exploitation. However, existing approaches often struggle to achieve both objectives. In this paper, based on that empathic responses are modulated by inferred social relationships between agents, we propose LASE Learning to balance Altruism and Self‑interest based on Empathy), a distributed multi‑agent reinforcement learning algorithm that fosters altruistic cooperation through gifting while avoiding exploitation by other agents in mixed‑motive games. LASE allocates a portion of its rewards to co‑players as gifts, with this allocation adapting dynamically based on the social relationship ‑‑ a metric evaluating the friendliness of co‑players estimated by counterfactual reasoning. In particular, social relationship measures each co‑player by comparing the estimated Q‑function of current joint action to a counterfactual baseline which marginalizes the co‑player's action, with its action distribution inferred by a perspective‑taking module. Comprehensive experiments are performed in spatially and temporally extended mixed‑motive games, demonstrating LASE's ability to promote group collaboration without compromising fairness and its capacity to adapt policies to various types of interactive co‑players.

PaperID: 12748, https://arxiv.org/pdf/2410.07706.pdf

Authors: Yifan Song, Weimin Xiong, Xiutian Zhao, Dawei Zhu, Wenhao Wu, Ke Wang, Cheng Li, Wei Peng, Sujian Li

Title: AgentBank: Towards Generalized LLM Agents via Fine-Tuning on 50000+ Interaction Trajectories

Abstract:
Fine‑tuning on agent‑environment interaction trajectory data holds significant promise for surfacing generalized agent capabilities in open‑source large language models (LLMs). In this work, we introduce AgentBank, by far the largest trajectory tuning data collection featuring more than 50k diverse high‑quality interaction trajectories which comprises 16 tasks covering five distinct agent skill dimensions. Leveraging a novel annotation pipeline, we are able to scale the annotated trajectories and generate a trajectory dataset with minimized difficulty bias. Furthermore, we fine‑tune LLMs on AgentBank to get a series of agent models, Samoyed. Our comparative experiments demonstrate the effectiveness of scaling the interaction trajectory data to acquire generalized agent capabilities. Additional studies also reveal some key observations regarding trajectory tuning and agent skill generalization.

PaperID: 12749, https://arxiv.org/pdf/2410.07672.pdf

Authors: Yougang Lyu, Lingyong Yan, Zihan Wang, Dawei Yin, Pengjie Ren, Maarten de Rijke, Zhaochun Ren

Title: MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization

Abstract:
As large language models (LLMs) are rapidly advancing and achieving near‑human capabilities on specific tasks, aligning them with human values is becoming more urgent. In scenarios where LLMs outperform humans, we face a weak‑to‑strong alignment problem where we need to effectively align strong student LLMs through weak supervision generated by weak teachers. Existing alignment methods mainly focus on strong‑to‑weak alignment and self‑alignment settings, and it is impractical to adapt them to the much harder weak‑to‑strong alignment setting. To fill this gap, we propose a multi‑agent contrastive preference optimization (MACPO) framework. MACPO facilitates weak teachers and strong students to learn from each other by iteratively reinforcing unfamiliar positive behaviors while penalizing familiar negative ones. To get this, we devise a mutual positive behavior augmentation strategy to encourage weak teachers and strong students to learn from each other's positive behavior and further provide higher quality positive behavior for the next iteration. Additionally, we propose a hard negative behavior construction strategy to induce weak teachers and strong students to generate familiar negative behavior by fine‑tuning on negative behavioral data. Experimental results on the HH‑RLHF and PKU‑SafeRLHF datasets, evaluated using both automatic metrics and human judgments, demonstrate that MACPO simultaneously improves the alignment performance of strong students and weak teachers. Moreover, as the number of weak teachers increases, MACPO achieves better weak‑to‑strong alignment performance through more iteration optimization rounds.

PaperID: 12750, https://arxiv.org/pdf/2410.07561.pdf

Authors: Xiawei Liu, Shiyue Yang, Xinnong Zhang, Haoyu Kuang, Libo Sun, Yihang Yang, Siming Chen, Xuanjing Huang, Zhongyu Wei

Title: AI-Press: A Multi-Agent News Generating and Feedback Simulation System Powered by Large Language Models

Abstract:
The rise of various social platforms has transformed journalism. The growing demand for news content has led to the increased use of large language models (LLMs) in news production due to their speed and cost‑effectiveness. However, LLMs still encounter limitations in professionalism and ethical judgment in news generation. Additionally, predicting public feedback is usually difficult before news is released. To tackle these challenges, we introduce AI‑Press, an automated news drafting and polishing system based on multi‑agent collaboration and Retrieval‑Augmented Generation. We develop a feedback simulation system that generates public feedback considering demographic distributions. Through extensive quantitative and qualitative evaluations, our system shows significant improvements in news‑generating capabilities and verifies the effectiveness of public feedback simulation.

PaperID: 12751, https://arxiv.org/pdf/2410.07553.pdf

Authors: Timothy Ossowski, Danyal Maqbool, Jixuan Chen, Zefan Cai, Tyler Bradshaw, Junjie Hu

Title: COMMA: A Communicative Multimodal Multi-Agent Benchmark

Abstract:
The rapid advances of multimodal agents built on large foundation models have largely overlooked their potential for language‑based communication between agents in collaborative tasks. This oversight presents a critical gap in understanding their effectiveness in real‑world deployments, particularly when communicating with humans. Existing agentic benchmarks fail to address key aspects of inter‑agent communication and collaboration, particularly in scenarios where agents have unequal access to information and must work together to achieve tasks beyond the scope of individual capabilities. To fill this gap, we introduce COMMA: a novel puzzle benchmark designed to evaluate the collaborative performance of multimodal multi‑agent systems through language communication. Our benchmark features a variety of multimodal puzzles, providing a comprehensive evaluation across four key categories of agentic capability in a communicative collaboration setting. Our findings reveal surprising weaknesses in state‑of‑the‑art models, including strong proprietary models like GPT‑4o and reasoning models like o4‑mini. Many chain of thought reasoning models such as R1‑Onevision and LLaVA‑CoT struggle to outperform even a random baseline in agent‑agent collaboration, indicating a potential growth area in their communication abilities.

PaperID: 12752, https://arxiv.org/pdf/2410.07426.pdf

Authors: Kamil Khan, Sudeep Pasricha

Title: CAFEEN: A Cooperative Approach for Energy Efficient NoCs with Multi-Agent Reinforcement Learning

Abstract:
In emerging high‑performance Network‑on‑Chip (NoC) architectures, efficient power management is crucial to minimize energy consumption. We propose a novel framework called CAFEEN that employs both heuristic‑based fine‑grained and machine learning‑based coarse‑grained power‑gating for energy‑efficient NoCs. CAFEEN uses a fine‑grained method to activate only essential NoC buffers during lower network loads. It switches to a coarse‑grained method at peak loads to minimize compounding wake‑up overhead using multi‑agent reinforcement learning. Results show that CAFEEN adaptively balances power‑efficiency with performance, reducing total energy by 2.60x for single application workloads and 4.37x for multi‑application workloads, compared to state‑of‑the‑art NoC power‑gating frameworks.

PaperID: 12753, https://arxiv.org/pdf/2410.07413.pdf

Authors: Deep Parikh, Thomas L. Ahrens, Manoranjan Majji

Title: A Rapid Trajectory Optimization and Control Framework for Resource-Constrained Applications

Abstract:
This paper presents a computationally efficient model predictive control formulation that uses an integral Chebyshev collocation method to enable rapid operations of autonomous agents. By posing the finite‑horizon optimal control problem and recursive re‑evaluation of the optimal trajectories, minimization of the L2 norms of the state and control errors are transcribed into a quadratic program. Control and state variable constraints are parameterized using Chebyshev polynomials and are accommodated in the optimal trajectory generation programs to incorporate the actuator limits and keep‑out constraints. Differentiable collision detection of polytopes is leveraged for optimal collision avoidance. Results obtained from the collocation methods are benchmarked against the existing approaches on an edge computer to outline the performance improvements. Finally, collaborative control scenarios involving multi‑agent space systems are considered to demonstrate the technical merits of the proposed work.

PaperID: 12754, https://arxiv.org/pdf/2410.07409.pdf

Authors: Isaac Remy, David Fridovich-Keil, Karen Leung

Title: Learning responsibility allocations for multi-agent interactions: A differentiable optimization approach with control barrier functions

Abstract:
From autonomous driving to package delivery, ensuring safe yet efficient multi‑agent interaction is challenging as the interaction dynamics are influenced by hard‑to‑model factors such as social norms and contextual cues. Understanding these influences can aid in the design and evaluation of socially‑aware autonomous agents whose behaviors are aligned with human values. In this work, we seek to codify factors governing safe multi‑agent interactions via the lens of responsibility, i.e., an agent's willingness to deviate from their desired control to accommodate safe interaction with others. Specifically, we propose a data‑driven modeling approach based on control barrier functions and differentiable optimization that efficiently learns agents' responsibility allocation from data. We demonstrate on synthetic and real‑world datasets that we can obtain an interpretable and quantitative understanding of how much agents adjust their behavior to ensure the safety of others given their current environment.

PaperID: 12755, https://arxiv.org/pdf/2410.07283.pdf

Authors: Donghyun Lee, Mo Tiwari

Title: Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems

Abstract:
As Large Language Models (LLMs) grow increasingly powerful, multi‑agent systems are becoming more prevalent in modern AI applications. Most safety research, however, has focused on vulnerabilities in single‑agent LLMs. These include prompt injection attacks, where malicious prompts embedded in external content trick the LLM into executing unintended or harmful actions, compromising the victim's application. In this paper, we reveal a more dangerous vector: LLM‑to‑LLM prompt injection within multi‑agent systems. We introduce Prompt Infection, a novel attack where malicious prompts self‑replicate across interconnected agents, behaving much like a computer virus. This attack poses severe threats, including data theft, scams, misinformation, and system‑wide disruption, all while propagating silently through the system. Our extensive experiments demonstrate that multi‑agent systems are highly susceptible, even when agents do not publicly share all communications. To address this, we propose LLM Tagging, a defense mechanism that, when combined with existing safeguards, significantly mitigates infection spread. This work underscores the urgent need for advanced security measures as multi‑agent LLM systems become more widely adopted.

PaperID: 12756, https://arxiv.org/pdf/2410.07274.pdf

Authors: Alice Delbosc, Magalie Ochs, Nicolas Sabouret, Brian Ravenet, Stephane Ayache

Title: Mitigation of gender bias in automatic facial non-verbal behaviors generation

Abstract:
Research on non‑verbal behavior generation for social interactive agents focuses mainly on the believability and synchronization of non‑verbal cues with speech. However, existing models, predominantly based on deep learning architectures, often perpetuate biases inherent in the training data. This raises ethical concerns, depending on the intended application of these agents. This paper addresses these issues by first examining the influence of gender on facial non‑verbal behaviors. We concentrate on gaze, head movements, and facial expressions. We introduce a classifier capable of discerning the gender of a speaker from their non‑verbal cues. This classifier achieves high accuracy on both real behavior data, extracted using state‑of‑the‑art tools, and synthetic data, generated from a model developed in previous work.Building upon this work, we present a new model, FairGenderGen, which integrates a gender discriminator and a gradient reversal layer into our previous behavior generation model. This new model generates facial non‑verbal behaviors from speech features, mitigating gender sensitivity in the generated behaviors. Our experiments demonstrate that the classifier, developed in the initial phase, is no longer effective in distinguishing the gender of the speaker from the generated non‑verbal behaviors.

PaperID: 12757, https://arxiv.org/pdf/2410.07166.pdf

Authors: Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li Erran Li, Ruohan Zhang, Weiyu Liu, Percy Liang, Li Fei-Fei, Jiayuan Mao, Jiajun Wu

Title: Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making

Abstract:
We aim to evaluate Large Language Models (LLMs) for embodied decision making. While a significant body of work has been leveraging LLMs for decision making in embodied environments, we still lack a systematic understanding of their performance because they are usually applied in different domains, for different purposes, and built based on different inputs and outputs. Furthermore, existing evaluations tend to rely solely on a final success rate, making it difficult to pinpoint what ability is missing in LLMs and where the problem lies, which in turn blocks embodied agents from leveraging LLMs effectively and selectively. To address these limitations, we propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks and input‑output specifications of LLM‑based modules. Specifically, it allows us to unify 1) a broad set of embodied decision‑making tasks involving both state and temporally extended goals, 2) four commonly‑used LLM‑based modules for decision making: goal interpretation, subgoal decomposition, action sequencing, and transition modeling, and 3) a collection of fine‑grained metrics which break down evaluation into various types of errors, such as hallucination errors, affordance errors, various types of planning errors, etc. Overall, our benchmark offers a comprehensive assessment of LLMs' performance for different subtasks, pinpointing the strengths and weaknesses in LLM‑powered embodied AI systems, and providing insights for effective and selective use of LLMs in embodied decision making.

PaperID: 12758, https://arxiv.org/pdf/2410.07109.pdf

Authors: Gian Maria Campedelli, Nicolò Penzo, Massimo Stefan, Roberto Dessì, Marco Guerini, Bruno Lepri, Jacopo Staiano

Title: I Want to Break Free! Persuasion and Anti-Social Behavior of LLMs in Multi-Agent Settings with Social Hierarchy

Abstract:
As LLM‑based agents become increasingly autonomous and will more freely interact with each other, studying the interplay among them becomes crucial to anticipate emergent phenomena and potential risks. In this work, we provide an in‑depth analysis of the interactions among agents within a simulated hierarchical social environment, drawing inspiration from the Stanford Prison Experiment. Leveraging 2,400 conversations across six LLMs (i.e., LLama3, Orca2, Command‑r, Mixtral, Mistral2, and gpt4.1) and 240 experimental scenarios, we analyze persuasion and anti‑social behavior between a guard and a prisoner agent with differing objectives. We first document model‑specific conversational failures in this multi‑agent power dynamic context, thereby narrowing our analytic sample to 1,600 conversations. Among models demonstrating successful interaction, we find that goal setting significantly influences persuasiveness but not anti‑social behavior. Moreover, agent personas, especially the guard's, substantially impact both successful persuasion by the prisoner and the manifestation of anti‑social actions. Notably, we observe the emergence of anti‑social conduct even in absence of explicit negative personality prompts. These results have important implications for the development of interactive LLM agents and the ongoing discussion of their societal impact.

PaperID: 12759, https://arxiv.org/pdf/2410.07103.pdf

Authors: Sangwon Yu, Ik-hwan Kim, Jongyoon Song, Saehyung Lee, Junsung Park, Sungroh Yoon

Title: Unleashing Multi-Hop Reasoning Potential in Large Language Models through Repetition of Misordered Context

Abstract:
Multi‑hop reasoning, which requires multi‑step reasoning based on the supporting documents within a given context, remains challenging for large language models (LLMs). LLMs often struggle to filter out irrelevant documents within the context, and their performance is sensitive to the absolute position of supporting documents within that context. In this paper, we identify an additional challenge: LLMs' performance is also sensitive to the order, relative position, in which the supporting documents are presented. We refer to this as the misordered context problem. To address this issue, based on the theoretical approach, we propose a simple yet effective method called context repetition (CoRe), which involves prompting the model by repeatedly presenting the context. This ensures that certain contiguous reasoning segments within supporting documents are presented in the optimal order, effectively guiding the model's reasoning in the appropriate direction. Applying CoRe, we improve the F1 score by up to 30%p on multi‑hop QA tasks and increase accuracy by up to 70%p on a synthetic task. Additionally, CoRe helps mitigate the well‑known "lost‑in‑the‑middle" problem in LLMs and can be effectively combined with retrieval‑based approaches utilizing Chain‑of‑Thought (CoT) reasoning.

PaperID: 12760, https://arxiv.org/pdf/2410.06949.pdf

Authors: Xuanming Zhang, Yuxuan Chen, Yuan Yuan, Minlie Huang

Title: Towards Exception Safety Code Generation with Intermediate Representation Agents Framework

Abstract:
Large Language Models (LLMs) often struggle with robust exception handling in generated code, leading to fragile programs that are prone to runtime errors. We propose Seeker, a novel multi‑agent framework that enforces exception safety in LLM generated code through an Intermediate Representation (IR) approach. Seeker decomposes exception handling into five specialized agents: Scanner, Detector, Predator, Ranker, and Handler that collaboratively analyze code, detect fragile segments, retrieve best practice exception strategies, and inject robust handling code. We also introduce Common Exception Enumeration (CEE), a comprehensive knowledge base derived from official documentation, technical practices, and real world code, to standardize exception handling strategies. Seeker also incorporates a Deep Retrieval‑Augmented Generation (Deep RAG) algorithm to efficiently navigate the exception inheritance hierarchy, cutting down search overhead by 93% while improving accuracy in identifying relevant exceptions. We evaluate Seeker on 15 open source Java projects and multiple benchmarks. Seeker outperforms state of the art baselines, improving exception handling precision by up to 37% and overall code robustness by 38% as measured by expert code review. It significantly closes the gap between LLM and human developers in exception management, achieving a 28% success rate on real world issue fixes (SWE bench) versus 19% by prior methods. Our framework preserves functional correctness of code while proactively handling errors, demonstrating a practical, generalizable solution for safer code generation. In this paper, we discuss the novelty of using intermediate representation and multi‑agent collaboration for exception handling, and outline how Seeker can be extended to other programming languages and complex software engineering tasks, aligning LLM‑generated code with industrial standard.

PaperID: 12761, https://arxiv.org/pdf/2410.06932.pdf

Authors: Daniel Albert, Stephan Billinger

Title: Reproducing and Extending Experiments in Behavioral Strategy with Large Language Models

Abstract:
In this study, we propose LLM agents as a novel approach in behavioral strategy research, complementing simulations and laboratory experiments to advance our understanding of cognitive processes in decision‑making. Specifically, we reproduce a human laboratory experiment in behavioral strategy using large language model (LLM) generated agents and investigate how LLM agents compare to observed human behavior. Our results show that LLM agents effectively reproduce search behavior and decision‑making comparable to humans. Extending our experiment, we analyze LLM agents' simulated "thoughts," discovering that more forward‑looking thoughts correlate with favoring exploitation over exploration to maximize wealth. We show how this new approach can be leveraged in behavioral strategy research and address limitations.

PaperID: 12762, https://arxiv.org/pdf/2410.06678.pdf

Authors: Zeyu Zhang, Sixu Yan, Muzhi Han, Zaijin Wang, Xinggang Wang, Song-Chun Zhu, Hangxin Liu

Title: M3Bench: Benchmarking Whole-body Motion Generation for Mobile Manipulation in 3D Scenes

Abstract:
We propose M3Bench, a new benchmark for whole‑body motion generation in mobile manipulation tasks. Given a 3D scene context, M3Bench requires an embodied agent to reason about its configuration, environmental constraints, and task objectives to generate coordinated whole‑body motion trajectories for object rearrangement. M3Bench features 30,000 object rearrangement tasks across 119 diverse scenes, providing expert demonstrations generated by our newly developed M3BenchMaker, an automatic data generation tool that produces whole‑body motion trajectories from high‑level task instructions using only basic scene and robot information. Our benchmark includes various task splits to evaluate generalization across different dimensions and leverages realistic physics simulation for trajectory assessment. Extensive evaluation analysis reveals that state‑of‑the‑art models struggle with coordinating base‑arm motion while adhering to environmental and task‑specific constraints, underscoring the need for new models to bridge this gap. By releasing M3Bench and M3BenchMaker we aim to advance robotics research toward more adaptive and capable mobile manipulation in diverse, real‑world environments.

PaperID: 12763, https://arxiv.org/pdf/2410.06627.pdf

Authors: Muhammad Morshed Alam, Muhammad Yeasir Aarafat, Tamim Hossain

Title: Variations in Multi-Agent Actor-Critic Frameworks for Joint Optimizations in UAV Swarm Networks: Recent Evolution, Challenges, and Directions

Abstract:
Autonomous unmanned aerial vehicle (UAV) swarm networks (UAVSNs) can effectively execute surveillance, connectivity, and computing services to ground users (GUs). These missions require trajectory planning, UAV‑GUs association, task offloading, next‑hop selection, and resources such as transmit power, bandwidth, caching, and computing allocation to improve network performances. Owing to the highly dynamic topology, limited resources, and non‑availability of global knowledge, optimizing network performance in UAVSNs is very intricate. Hence, it requires an adaptive joint optimization framework that can tackle both discrete and continuous decision variables to ensure optimal network performance under dynamic constraints. Multi‑agent deep reinforcement learning‑based adaptive actor‑critic framework can efficiently address these problems. This paper investigates the recent evolutions of actor‑critic frameworks to deal with joint optimization problems in UAVSNs. In addition, challenges and potential solutions are addressed as research directions.

PaperID: 12764, https://arxiv.org/pdf/2410.06506.pdf

Authors: Ziheng Liu, Jiayi Zhang, Enyu Shi, Yiyang Zhu, Derrick Wing Kwan Ng, Bo Ai

Title: Cooperative Multi-Target Positioning for Cell-Free Massive MIMO with Multi-Agent Reinforcement Learning

Abstract:
Cell‑free massive multiple‑input multiple‑output (mMIMO) is a promising technology to empower next‑generation mobile communication networks. In this paper, to address the computational complexity associated with conventional fingerprint positioning, we consider a novel cooperative positioning architecture that involves certain relevant access points (APs) to establish positioning similarity coefficients. Then, we propose an innovative joint positioning and correction framework employing multi‑agent reinforcement learning (MARL) to tackle the challenges of high‑dimensional sophisticated signal processing, which mainly leverages on the received signal strength information for preliminary positioning, supplemented by the angle of arrival information to refine the initial position estimation. Moreover, to mitigate the bias effects originating from remote APs, we design a cooperative weighted K‑nearest neighbor (Co‑WKNN)‑based estimation scheme to select APs with a high correlation to participate in user positioning. In the numerical results, we present comparisons of various user positioning schemes, which reveal that the proposed MARL‑based positioning scheme with Co‑WKNN can effectively improve positioning performance. It is important to note that the cooperative positioning architecture is a critical element in striking a balance between positioning performance and computational complexity.

PaperID: 12765, https://arxiv.org/pdf/2410.06321.pdf

Authors: Omanshu Thapliyal, Shanelle Clarke, Inseok Hwang

Title: An Algorithm for Distributed Computation of Reachable Sets for Multi-Agent Systems

Abstract:
In this paper, we consider the problem of distributed reachable set computation for multi‑agent systems (MASs) interacting over an undirected, stationary graph. A full state‑feedback control input for such MASs depends no only on the current agent's state, but also of its neighbors. However, in most MAS applications, the dynamics are obscured by individual agents. This makes reachable set computation, in a fully distributed manner, a challenging problem. We utilize the ideas of polytopic reachable set approximation and generalize it to a MAS setup. We formulate the resulting sub‑problems in a fully distributed manner and provide convergence guarantees for the associated computations. The proposed algorithm's convergence is proved for two cases: static MAS graphs, and time‑varying graphs under certain restrictions.

PaperID: 12766, https://arxiv.org/pdf/2410.06304.pdf

Authors: Ruosen Li, Ziming Luo, Xinya Du

Title: FG-PRM: Fine-grained Hallucination Detection and Mitigation in Language Model Mathematical Reasoning

Abstract:
Hallucinations in large language models (LLMs) pose significant challenges in tasks requiring complex multi‑step reasoning, such as mathematical problem‑solving. Existing approaches primarily detect the presence of hallucinations but lack a nuanced understanding of their types and manifestations. In this paper, we first introduce a comprehensive taxonomy that categorizes the common hallucinations in mathematical reasoning tasks into six types. We then propose FG‑PRM (Fine‑Grained Process Reward Model), an augmented model designed to detect and mitigate hallucinations in a fine‑grained, step‑level manner. To address the limitations of manually labeling training data, we propose an automated method for generating fine‑grained hallucination data using LLMs. Our FG‑PRM demonstrates superior performance across two key tasks: 1) Fine‑grained hallucination detection: classifying hallucination types for each reasoning step; and 2) Verification: ranking multiple LLM‑generated outputs to select the most accurate solution. Our experiments show that FG‑PRM excels in fine‑grained hallucination detection and substantially boosts the performance of LLMs on GSM8K and MATH benchmarks. These results highlight the benefits of fine‑grained supervision in enhancing the reliability and interpretability of LLM reasoning processes.

PaperID: 12767, https://arxiv.org/pdf/2410.06172.pdf

Authors: Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Anderson Compalas, Dawn Song, Xin Eric Wang

Title: Multimodal Situational Safety

Abstract:
Multimodal Large Language Models (MLLMs) are rapidly evolving, demonstrating impressive capabilities as multimodal assistants that interact with both humans and their environments. However, this increased sophistication introduces significant safety concerns. In this paper, we present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety, which explores how safety considerations vary based on the specific situation in which the user or agent is engaged. We argue that for an MLLM to respond safely, whether through language or action, it often needs to assess the safety implications of a language query within its corresponding visual context. To evaluate this capability, we develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs. The dataset comprises 1,820 language query‑image pairs, half of which the image context is safe, and the other half is unsafe. We also develop an evaluation framework that analyzes key safety aspects, including explicit safety reasoning, visual understanding, and, crucially, situational safety reasoning. Our findings reveal that current MLLMs struggle with this nuanced safety problem in the instruction‑following setting and struggle to tackle these situational safety challenges all at once, highlighting a key area for future research. Furthermore, we develop multi‑agent pipelines to coordinately solve safety challenges, which shows consistent improvement in safety over the original MLLM response. Code and data: mssbench.github.io.

PaperID: 12768, https://arxiv.org/pdf/2410.06108.pdf

Authors: Corban Rivera, Grayson Byrd, William Paul, Tyler Feldman, Meghan Booker, Emma Holmes, David Handelman, Bethany Kemp, Andrew Badger, Aurora Schmidt, Krishna Murthy Jatavallabhula, Celso M de Melo, Lalithkumar Seenivasan, Mathias Unberath, Rama Chellappa

Title: ConceptAgent: LLM-Driven Precondition Grounding and Tree Search for Robust Task Planning and Execution

Abstract:
Robotic planning and execution in open‑world environments is a complex problem due to the vast state spaces and high variability of task embodiment. Recent advances in perception algorithms, combined with Large Language Models (LLMs) for planning, offer promising solutions to these challenges, as the common sense reasoning capabilities of LLMs provide a strong heuristic for efficiently searching the action space. However, prior work fails to address the possibility of hallucinations from LLMs, which results in failures to execute the planned actions largely due to logical fallacies at high‑ or low‑levels. To contend with automation failure due to such hallucinations, we introduce ConceptAgent, a natural language‑driven robotic platform designed for task execution in unstructured environments. With a focus on scalability and reliability of LLM‑based planning in complex state and action spaces, we present innovations designed to limit these shortcomings, including 1) Predicate Grounding to prevent and recover from infeasible actions, and 2) an embodied version of LLM‑guided Monte Carlo Tree Search with self reflection. In simulation experiments, ConceptAgent achieved a 19% task completion rate across three room layouts and 30 easy level embodied tasks outperforming other state‑of‑the‑art LLM‑driven reasoning baselines that scored 10.26% and 8.11% on the same benchmark. Additionally, ablation studies on moderate to hard embodied tasks revealed a 20% increase in task completion from the baseline agent to the fully enhanced ConceptAgent, highlighting the individual and combined contributions of Predicate Grounding and LLM‑guided Tree Search to enable more robust automation in complex state and action spaces.

PaperID: 12769, https://arxiv.org/pdf/2410.06101.pdf

Authors: Hao Ma, Tianyi Hu, Zhiqiang Pu, Boyin Liu, Xiaolin Ai, Yanyan Liang, Min Chen

Title: Coevolving with the Other You: Fine-Tuning LLM with Sequential Cooperative Multi-Agent Reinforcement Learning

Abstract:
Reinforcement learning (RL) has emerged as a pivotal technique for fine‑tuning large language models (LLMs) on specific tasks. However, prevailing RL fine‑tuning methods predominantly rely on PPO and its variants. Though these algorithms are effective in general RL settings, they often exhibit suboptimal performance and vulnerability to distribution collapse when applied to the fine‑tuning of LLMs. In this paper, we propose CORY, extending the RL fine‑tuning of LLMs to a sequential cooperative multi‑agent reinforcement learning framework, to leverage the inherent coevolution and emergent capabilities of multi‑agent systems. In CORY, the LLM to be fine‑tuned is initially duplicated into two autonomous agents: a pioneer and an observer. The pioneer generates responses based on queries, while the observer generates responses using both the queries and the pioneer's responses. The two agents are trained together. During training, the agents exchange roles periodically, fostering cooperation and coevolution between them. Experiments evaluate CORY's performance by fine‑tuning GPT‑2 and Llama‑2 under subjective and objective reward functions on the IMDB Review and GSM8K datasets, respectively. Results show that CORY outperforms PPO in terms of policy optimality, resistance to distribution collapse, and training robustness, thereby underscoring its potential as a superior methodology for refining LLMs in real‑world applications.

PaperID: 12770, https://arxiv.org/pdf/2410.05673.pdf

Authors: Fivos Kalogiannis, Jingming Yan, Ioannis Panageas

Title: Learning Equilibria in Adversarial Team Markov Games: A Nonconvex-Hidden-Concave Min-Max Optimization Problem

Abstract:
We study the problem of learning a Nash equilibrium (NE) in Markov games which is a cornerstone in multi‑agent reinforcement learning (MARL). In particular, we focus on infinite‑horizon adversarial team Markov games (ATMGs) in which agents that share a common reward function compete against a single opponent, the adversary. These games unify two‑player zero‑sum Markov games and Markov potential games, resulting in a setting that encompasses both collaboration and competition. Kalogiannis et al. (2023a) provided an efficient equilibrium computation algorithm for ATMGs which presumes knowledge of the reward and transition functions and has no sample complexity guarantees. We contribute a learning algorithm that utilizes MARL policy gradient methods with iteration and sample complexity that is polynomial in the approximation error ε and the natural parameters of the ATMG, resolving the main caveats of the solution by (Kalogiannis et al., 2023a). It is worth noting that previously, the existence of learning algorithms for NE was known for Markov two‑player zero‑sum and potential games but not for ATMGs. Seen through the lens of min‑max optimization, computing a NE in these games consists a nonconvex‑nonconcave saddle‑point problem. Min‑max optimization has received extensive study. Nevertheless, the case of nonconvex‑nonconcave landscapes remains elusive: in full generality, finding saddle‑points is computationally intractable (Daskalakis et al., 2021). We circumvent the aforementioned intractability by developing techniques that exploit the hidden structure of the objective function via a nonconvex‑concave reformulation. However, this introduces the challenge of a feasibility set with coupled constraints. We tackle these challenges by establishing novel techniques for optimizing weakly‑smooth nonconvex functions, extending the framework of (Devolder et al., 2014).

PaperID: 12771, https://arxiv.org/pdf/2410.05603.pdf

Authors: Zheyang Xiong, Ziyang Cai, John Cooper, Albert Ge, Vasilis Papageorgiou, Zack Sifakis, Angeliki Giannou, Ziqian Lin, Liu Yang, Saurabh Agarwal, Grigorios G Chrysos, Samet Oymak, Kangwook Lee, Dimitris Papailiopoulos

Title: Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition

Abstract:
Large Language Models (LLMs) have demonstrated remarkable in‑context learning (ICL) capabilities. In this study, we explore a surprising phenomenon related to ICL: LLMs can perform multiple, computationally distinct ICL tasks simultaneously, during a single inference call, a capability we term "task superposition". We provide empirical evidence of this phenomenon across various LLM families and scales and show that this phenomenon emerges even if we train the model to in‑context learn one task at a time. We offer theoretical explanations that this capability is well within the expressive power of transformers. We also explore how LLMs internally compose task vectors during superposition. Furthermore, we show that larger models can solve more ICL tasks in parallel, and better calibrate their output distribution. Our findings offer insights into the latent capabilities of LLMs, further substantiate the perspective of "LLMs as superposition of simulators", and raise questions about the mechanisms enabling simultaneous task execution.

PaperID: 12772, https://arxiv.org/pdf/2410.05570.pdf

Authors: Taufiq Daryanto, Xiaohan Ding, Lance T. Wilhelm, Sophia Stil, Kirk McInnis Knutsen, Eugenia H. Rho

Title: Conversate: Supporting Reflective Learning in Interview Practice Through Interactive Simulation and Dialogic Feedback

Abstract:
Job interviews play a critical role in shaping one's career, yet practicing interview skills can be challenging, especially without access to human coaches or peers for feedback. Recent advancements in large language models (LLMs) present an opportunity to enhance the interview practice experience. Yet, little research has explored the effectiveness and user perceptions of such systems or the benefits and challenges of using LLMs for interview practice. Furthermore, while prior work and recent commercial tools have demonstrated the potential of AI to assist with interview practice, they often deliver one‑way feedback, where users only receive information about their performance. By contrast, dialogic feedback, a concept developed in learning sciences, is a two‑way interaction feedback process that allows users to further engage with and learn from the provided feedback through interactive dialogue. This paper introduces Conversate, a web‑based application that supports reflective learning in job interview practice by leveraging large language models (LLMs) for interactive interview simulations and dialogic feedback. To start the interview session, the user provides the title of a job position (e.g., entry‑level software engineer) in the system. Then, our system will initialize the LLM agent to start the interview simulation by asking the user an opening interview question and following up with questions carefully adapted to subsequent user responses. After the interview session, our back‑end LLM framework will then analyze the user's responses and highlight areas for improvement. Users can then annotate the transcript by selecting specific sections and writing self‑reflections. Finally, the user can interact with the system for dialogic feedback, conversing with the LLM agent to learn from and iteratively refine their answers based on the agent's guidance.

PaperID: 12773, https://arxiv.org/pdf/2410.05554.pdf

Authors: Maulik Bhatt, Iman Askari, Yue Yu, Ufuk Topcu, Huazhen Fang, Negar Mehr

Title: MultiNash-PF: A Particle Filtering Approach for Computing Multiple Local Generalized Nash Equilibria in Trajectory Games

Abstract:
Modern robotic systems frequently engage in complex multi‑agent interactions, many of which are inherently multi‑modal, i.e., they can lead to multiple distinct outcomes. To interact effectively, robots must recognize the possible interaction modes and adapt to the one preferred by other agents. In this work, we propose MultiNash‑PF, an efficient algorithm for capturing the multimodality in multi‑agent interactions. We model interaction outcomes as equilibria of a game‑theoretic planner, where each equilibrium corresponds to a distinct interaction mode. Our framework formulates interactive planning as Constrained Potential Trajectory Games (CPTGs), in which local Generalized Nash Equilibria (GNEs) represent plausible interaction outcomes. We propose to integrate the potential game approach with implicit particle filtering, a sample‑efficient method for non‑convex trajectory optimization. We utilize implicit particle filtering to identify the coarse estimates of multiple local minimizers of the game's potential function. MultiNash‑PF then refines these estimates with optimization solvers, obtaining different local GNEs. We show through numerical simulations that MultiNash‑PF reduces computation time by up to 50% compared to a baseline. We further demonstrate the effectiveness of our algorithm in real‑world human‑robot interaction scenarios, where it successfully accounts for the multi‑modal nature of interactions and resolves potential conflicts in real‑time.

PaperID: 12774, https://arxiv.org/pdf/2410.05508.pdf

Authors: Emily Adlam

Title: How do we Observe Relational Observables?

Abstract:
In theories with a diffeomorphism symmetry, such as general relativity and canonical quantum gravity, it is often proposed that the empirical content is encoded in relational observables. But how do relational observables actually make contact with experience? I argue that this question can only be answered by providing a schematization of the observer which is appropriate for the context of a diffeomorphism‑invariant theory. I suggest that this may require us to move away from a `passive awareness' conception of consciousness towards a more agential conception, because there is a clear sense in which an embodied agent must experience herself as localised at a time. Finally, I consider what this means for the prospects of using quantum reference frames to address the problem of time, arguing that the way in which quantum reference frames are standardly described does not give us adequate resources to model agency, so some other kind of internal reference frame may be required to fully explain how we observe relational observables in the quantum context.

PaperID: 12775, https://arxiv.org/pdf/2410.05434.pdf

Authors: Sanjiban Choudhury, Paloma Sodhi

Title: Better than Your Teacher: LLM Agents that learn from Privileged AI Feedback

Abstract:
While large language models (LLMs) show impressive decision‑making abilities, current methods lack a mechanism for automatic self‑improvement from errors during task execution. We propose LEAP, an iterative fine‑tuning framework that continually improves LLM agents using feedback from AI expert teachers. Our key insight is to equip the expert teachers with a privileged state ‑‑ information that is available during training but hidden at test time. This allows even weak experts to provide precise guidance, significantly improving the student agent's performance without access to privileged information at test time. We evaluate LEAP on diverse decision‑making benchmarks, including text‑based games (ALFWorld), web navigation (WebShop), and interactive coding (Intercode Bash). Our experiments show that LEAP (1) outperforms behavior cloning and ReAct baselines (2) enables weak student models (e.g., Llama3‑8B) to exceed the performance of strong teacher models (GPT4‑o), and (3) allows weak models to self‑improve using privileged versions of themselves. We also provide a theoretical analysis showing that LEAP's success hinges on balancing privileged information with the student's realizability, which we empirically validate. Our code is available at https://leap‑llm.github.io

PaperID: 12776, https://arxiv.org/pdf/2410.05130.pdf

Authors: Yuwei Hu, Runlin Lei, Xinyi Huang, Zhewei Wei, Yongchao Liu

Title: Scalable and Accurate Graph Reasoning with LLM-based Multi-Agents

Abstract:
Recent research has explored the use of Large Language Models (LLMs) for tackling complex graph reasoning tasks. However, due to the intricacies of graph structures and the inherent limitations of LLMs in handling long text, current approaches often fail to deliver satisfactory accuracy, even on small‑scale graphs and simple tasks. To address these challenges, we introduce GraphAgent‑Reasoner, a fine‑tuning‑free framework that utilizes a multi‑agent collaboration strategy for explicit and precise graph reasoning. Inspired by distributed graph computation theory, our framework decomposes graph problems into smaller, node‑centric tasks that are distributed among multiple agents. The agents collaborate to solve the overall problem, significantly reducing the amount of information and complexity handled by a single LLM, thus enhancing the accuracy of graph reasoning. By simply increasing the number of agents, GraphAgent‑Reasoner can efficiently scale to accommodate larger graphs with over 1,000 nodes. Evaluated on the GraphInstruct dataset, our framework demonstrates near‑perfect accuracy on polynomial‑time graph reasoning tasks, significantly outperforming the best available models, both closed‑source and fine‑tuned open‑source variants. Our framework also demonstrates the capability to handle real‑world graph reasoning applications such as webpage importance analysis.

PaperID: 12777, https://arxiv.org/pdf/2410.05017.pdf

Authors: Ang He, Xi-mei Wu, Xiao-bin Guo, Li-bin Liu

Title: Enhanced Multi-Robot SLAM System with Cross-Validation Matching and Exponential Threshold Keyframe Selection

Abstract:
The evolving field of mobile robotics has indeed increased the demand for simultaneous localization and mapping (SLAM) systems. To augment the localization accuracy and mapping efficacy of SLAM, we refined the core module of the SLAM system. Within the feature matching phase, we introduced cross‑validation matching to filter out mismatches. In the keyframe selection strategy, an exponential threshold function is constructed to quantify the keyframe selection process. Compared with a single robot, the multi‑robot collaborative SLAM (CSLAM) system substantially improves task execution efficiency and robustness. By employing a centralized structure, we formulate a multi‑robot SLAM system and design a coarse‑to‑fine matching approach for multi‑map point cloud registration. Our system, built upon ORB‑SLAM3, underwent extensive evaluation utilizing the TUM RGB‑D, EuRoC MAV, and TUM_VI datasets. The experimental results demonstrate a significant improvement in the positioning accuracy and mapping quality of our enhanced algorithm compared to those of ORB‑SLAM3, with a 12.90% reduction in the absolute trajectory error.

PaperID: 12778, https://arxiv.org/pdf/2410.04936.pdf

Authors: Chen Zhang, Huan Hu, Yuan Zhou, Qiyang Cao, Ruochen Liu, Wenya Wei, Elvis S. Liu

Title: Training Interactive Agent in Large FPS Game Map with Rule-enhanced Reinforcement Learning

Abstract:
In the realm of competitive gaming, 3D first‑person shooter (FPS) games have gained immense popularity, prompting the development of game AI systems to enhance gameplay. However, deploying game AI in practical scenarios still poses challenges, particularly in large‑scale and complex FPS games. In this paper, we focus on the practical deployment of game AI in the online multiplayer competitive 3D FPS game called Arena Breakout, developed by Tencent Games. We propose a novel gaming AI system named Private Military Company Agent (PMCA), which is interactable within a large game map and engages in combat with players while utilizing tactical advantages provided by the surrounding terrain. To address the challenges of navigation and combat in modern 3D FPS games, we introduce a method that combines navigation mesh (Navmesh) and shooting‑rule with deep reinforcement learning (NSRL). The integration of Navmesh enhances the agent's global navigation capabilities while shooting behavior is controlled using rule‑based methods to ensure controllability. NSRL employs a DRL model to predict when to enable the navigation mesh, resulting in a diverse range of behaviors for the game AI. Customized rewards for human‑like behaviors are also employed to align PMCA's behavior with that of human players.

PaperID: 12779, https://arxiv.org/pdf/2410.04920.pdf

Authors: Achilleas Santi Seisa, Sumeet Gajanan Satpute, George Nikolakopoulos

Title: Cloud-Based Scheduling Mechanism for Scalable and Resource-Efficient Centralized Controllers

Abstract:
This paper proposes a novel approach to address the challenges of deploying complex robotic software in large‑scale systems, i.e., Centralized Nonlinear Model Predictive Controllers (CNMPCs) for multi‑agent systems. The proposed approach is based on a Kubernetes‑based scheduling mechanism designed to monitor and optimize the operation of CNMPCs, while addressing the scalability limitation of centralized control schemes. By leveraging a cluster in a real‑time cloud environment, the proposed mechanism effectively offloads the computational burden of CNMPCs. Through experiments, we have demonstrated the effectiveness and performance of our system, especially in scenarios where the number of robots is subject to change. Our work contributes to the advancement of cloud‑based control strategies and lays the foundation for enhanced performance in cloud‑controlled robotic systems.

PaperID: 12780, https://arxiv.org/pdf/2410.04871.pdf

Authors: Ziheng Liu, Jiayi Zhang, Enyu Shi, Yiyang Zhu, Derrick Wing Kwan Ng, Bo Ai

Title: Distributed Collaborative User Positioning for Cell-Free Massive MIMO with Multi-Agent Reinforcement Learning

Abstract:
In this paper, we investigate a cell‑free massive multiple‑input multiple‑output system, which exhibits great potential in enhancing the capabilities of next‑generation mobile communication networks. We first study the distributed positioning problem to lay the groundwork for solving resource allocation and interference management issues. Instead of relying on computationally and spatially complex fingerprint positioning methods, we propose a novel two‑stage distributed collaborative positioning architecture with multi‑agent reinforcement learning (MARL) network, consisting of a received signal strength‑based preliminary positioning network and an angle of arrival‑based auxiliary correction network. Our experimental results demonstrate that the two‑stage distributed collaborative user positioning architecture can outperform conventional fingerprint positioning methods in terms of positioning accuracy.

PaperID: 12781, https://arxiv.org/pdf/2410.04753.pdf

Authors: Riyaz Ahuja, Jeremy Avigad, Prasad Tetali, Sean Welleck

Title: ImProver: Agent-Based Automated Proof Optimization

Abstract:
Large language models (LLMs) have been used to generate formal proofs of mathematical theorems in proofs assistants such as Lean. However, we often want to optimize a formal proof with respect to various criteria, depending on its downstream use. For example, we may want a proof to adhere to a certain style, or to be readable, concise, or modularly structured. Having suitably optimized proofs is also important for learning tasks, especially since human‑written proofs may not optimal for that purpose. To this end, we study a new problem of automated proof optimization: rewriting a proof so that it is correct and optimizes for an arbitrary criterion, such as length or readability. As a first method for automated proof optimization, we present ImProver, a large‑language‑model agent that rewrites proofs to optimize arbitrary user‑defined metrics in Lean. We find that naively applying LLMs to proof optimization falls short, and we incorporate various improvements into ImProver, such as the use of symbolic Lean context in a novel Chain‑of‑States technique, as well as error‑correction and retrieval. We test ImProver on rewriting real‑world undergraduate, competition, and research‑level mathematics theorems, finding that ImProver is capable of rewriting proofs so that they are substantially shorter, more modular, and more readable.

PaperID: 12782, https://arxiv.org/pdf/2410.04663.pdf

Authors: Abir Harrasse, Chaithanya Bandi, Hari Bandi

Title: Debate, Deliberate, Decide (D3): A Cost-Aware Adversarial Framework for Reliable and Interpretable LLM Evaluation

Abstract:
The evaluation of Large Language Models (LLMs) remains challenging due to inconsistency, bias, and the absence of transparent decision criteria in automated judging. We present Debate, Deliberate, Decide (D3), a cost‑aware, adversarial multi‑agent framework that orchestrates structured debate among role‑specialized agents (advocates, a judge, and an optional jury) to produce reliable and interpretable evaluations. D3 instantiates two complementary protocols: (1) Multi‑Advocate One‑Round Evaluation (MORE), which elicits k parallel defenses per answer to amplify signal via diverse advocacy, and (2) Single‑Advocate Multi‑Round Evaluation (SAMRE) with budgeted stopping, which iteratively refines arguments under an explicit token budget and convergence checks. We develop a probabilistic model of score gaps that (i) characterizes reliability and convergence under iterative debate and (ii) explains the separation gains from parallel advocacy. Under mild assumptions, the posterior distribution of the round‑r gap concentrates around the true difference and the probability of mis‑ranking vanishes; moreover, aggregating across k advocates provably increases expected score separation. We complement theory with a rigorous experimental suite across MT‑Bench, AlignBench, and AUTO‑J, showing state‑of‑the‑art agreement with human judgments (accuracy and Cohen's kappa), reduced positional and verbosity biases via anonymization and role diversification, and a favorable cost‑accuracy frontier enabled by budgeted stopping. Ablations and qualitative analyses isolate the contributions of debate, aggregation, and anonymity. Together, these results establish D3 as a principled, practical recipe for reliable, interpretable, and cost‑aware LLM evaluation.

PaperID: 12783, https://arxiv.org/pdf/2410.04660.pdf

Authors: Xiaorui Su, Yibo Wang, Shanghua Gao, Xiaolong Liu, Valentina Giunchiglia, Djork-Arné Clevert, Marinka Zitnik

Title: KGARevion: An AI Agent for Knowledge-Intensive Biomedical QA

Abstract:
Biomedical reasoning integrates structured, codified knowledge with tacit, experience‑driven insights. Depending on the context, quantity, and nature of available evidence, researchers and clinicians use diverse strategies, including rule‑based, prototype‑based, and case‑based reasoning. Effective medical AI models must handle this complexity while ensuring reliability and adaptability. We introduce KGARevion, a knowledge graph‑based agent that answers knowledge‑intensive questions. Upon receiving a query, KGARevion generates relevant triplets by leveraging the latent knowledge embedded in a large language model. It then verifies these triplets against a grounded knowledge graph, filtering out errors and retaining only accurate, contextually relevant information for the final answer. This multi‑step process strengthens reasoning, adapts to different models of medical inference, and outperforms retrieval‑augmented generation‑based approaches that lack effective verification mechanisms. Evaluations on medical QA benchmarks show that KGARevion improves accuracy by over 5.2% over 15 models in handling complex medical queries. To further assess its effectiveness, we curated three new medical QA datasets with varying levels of semantic complexity, where KGARevion improved accuracy by 10.4%. The agent integrates with different LLMs and biomedical knowledge graphs for broad applicability across knowledge‑intensive tasks. We evaluated KGARevion on AfriMed‑QA, a newly introduced dataset focused on African healthcare, demonstrating its strong zero‑shot generalization to underrepresented medical contexts.

PaperID: 12784, https://arxiv.org/pdf/2410.04619.pdf

Authors: Junwei Su, Peter Marbach

Title: The Role of Social Support and Influencers in Social Media Communities

Abstract:
How can individual agents coordinate their actions to achieve a shared objective in distributed systems? This challenge spans economic, technical, and sociological domains, each confronting scalability, heterogeneity, and conflicts between individual and collective goals. In economic markets, a common currency facilitates coordination, raising the question of whether such mechanisms can be applied in other contexts. This paper explores this idea within social media platforms, where social support (likes, shares, comments) acts as a currency that shapes content production and sharing. We investigate two key questions: (1) Can social support serve as an effective coordination tool, and (2) What role do influencers play in content creation and dissemination? Our formal analysis shows that social support can coordinate user actions similarly to money in economic markets. Influencers serve dual roles, aggregating content and acting as information proxies, guiding content producers in large markets. While imperfections in information lead to a "price of influence" and suboptimal outcomes, this price diminishes as markets grow, improving social welfare. These insights provide a framework for understanding coordination in distributed environments, with applications in both sociological systems and multi‑agent AI systems.

PaperID: 12785, https://arxiv.org/pdf/2410.04551.pdf

Authors: Amanda Aird, Elena Štefancová, Cassidy All, Amy Voida, Martin Homola, Nicholas Mattei, Robin Burke

Title: Social Choice for Heterogeneous Fairness in Recommendation

Abstract:
Algorithmic fairness in recommender systems requires close attention to the needs of a diverse set of stakeholders that may have competing interests. Previous work in this area has often been limited by fixed, single‑objective definitions of fairness, built into algorithms or optimization criteria that are applied to a single fairness dimension or, at most, applied identically across dimensions. These narrow conceptualizations limit the ability to adapt fairness‑aware solutions to the wide range of stakeholder needs and fairness definitions that arise in practice. Our work approaches recommendation fairness from the standpoint of computational social choice, using a multi‑agent framework. In this paper, we explore the properties of different social choice mechanisms and demonstrate the successful integration of multiple, heterogeneous fairness definitions across multiple data sets.

PaperID: 12786, https://arxiv.org/pdf/2410.04360.pdf

Authors: Jiakai Tang, Heyang Gao, Xuchen Pan, Lei Wang, Haoran Tan, Dawei Gao, Yushuo Chen, Xu Chen, Yankai Lin, Yaliang Li, Bolin Ding, Jingren Zhou, Jun Wang, Ji-Rong Wen

Title: GenSim: A General Social Simulation Platform with Large Language Model based Agents

Abstract:
With the rapid advancement of large language models (LLMs), recent years have witnessed many promising studies on leveraging LLM‑based agents to simulate human social behavior. While prior work has demonstrated significant potential across various domains, much of it has focused on specific scenarios involving a limited number of agents and has lacked the ability to adapt when errors occur during simulation. To overcome these limitations, we propose a novel LLM‑agent‑based simulation platform called GenSim, which: (1) Abstracts a set of general functions to simplify the simulation of customized social scenarios; (2) Supports one hundred thousand agents to better simulate large‑scale populations in real‑world contexts; (3) Incorporates error‑correction mechanisms to ensure more reliable and long‑term simulations. To evaluate our platform, we assess both the efficiency of large‑scale agent simulations and the effectiveness of the error‑correction mechanisms. To our knowledge, GenSim represents an initial step toward a general, large‑scale, and correctable social simulation platform based on LLM agents, promising to further advance the field of social science.

PaperID: 12787, https://arxiv.org/pdf/2410.04151.pdf

Authors: Sicong Peng, Bin Li, Lei Liu, Zesong Fei, Dusit Niyato

Title: Trajectory Design and Resource Allocation for Multi-UAV-Assisted Sensing, Communication, and Edge Computing Integration

Abstract:
In this paper, we propose a multi‑unmanned aerial vehicle (UAV)‑assisted integrated sensing, communication, and computation network. Specifically, the treble‑functional UAVs are capable of offering communication and edge computing services to mobile users (MUs) in proximity, alongside their target sensing capabilities by using multi‑input multi‑output arrays. For the purpose of enhance the computation efficiency, we consider task compression, where each MU can partially compress their offloaded data prior to transmission to trim its size. The objective is to minimize the weighted energy consumption by jointly optimizing the transmit beamforming, the UAVs' trajectories, the compression and offloading partition, the computation resource allocation, while fulfilling the causal‑effect correlation between communication and computation as well as adhering to the constraints on sensing quality. To tackle it, we first reformulate the original problem as a multi‑agent Markov decision process (MDP), which involves heterogeneous agents to decompose the large state spaces and action spaces of MDP. Then, we propose a multi‑agent proximal policy optimization algorithm with attention mechanism to handle the decision‑making problem. Simulation results validate the significant effectiveness of the proposed method in reducing energy consumption. Moreover, it demonstrates superior performance compared to the baselines in relation to resource utilization and convergence speed.

PaperID: 12788, https://arxiv.org/pdf/2410.04004.pdf

Authors: Krishna C. Kalagarla, Matthew Low, Rahul Jain, Ashutosh Nayyar, Pierluigi Nuzzo

Title: Compositional Planning for Logically Constrained Multi-Agent Markov Decision Processes

Abstract:
Designing control policies for large, distributed systems is challenging, especially in the context of critical, temporal logic based specifications (e.g., safety) that must be met with high probability. Compositional methods for such problems are needed for scalability, yet relying on worst‑case assumptions for decomposition tends to be overly conservative. In this work, we use the framework of Constrained Markov Decision Processes (CMDPs) to provide an assume‑guarantee based decomposition for synthesizing decentralized control policies, subject to logical constraints in a multi‑agent setting. The returned policies are guaranteed to satisfy the constraints with high probability and provide a lower bound on the achieved objective reward. We empirically find the returned policies to achieve near‑optimal rewards while enjoying an order of magnitude reduction in problem size and execution time.

PaperID: 12789, https://arxiv.org/pdf/2410.03997.pdf

Authors: Yuan Zhuang, Yi Shen, Zhili Zhang, Yuxiao Chen, Fei Miao

Title: YOLO-MARL: You Only LLM Once for Multi-Agent Reinforcement Learning

Abstract:
Advancements in deep multi‑agent reinforcement learning (MARL) have positioned it as a promising approach for decision‑making in cooperative games. However, it still remains challenging for MARL agents to learn cooperative strategies for some game environments. Recently, large language models (LLMs) have demonstrated emergent reasoning capabilities, making them promising candidates for enhancing coordination among the agents. However, due to the model size of LLMs, it can be expensive to frequently infer LLMs for actions that agents can take. In this work, we propose You Only LLM Once for MARL (YOLO‑MARL), a novel framework that leverages the high‑level task planning capabilities of LLMs to improve the policy learning process of multi‑agents in cooperative games. Notably, for each game environment, YOLO‑MARL only requires one time interaction with LLMs in the proposed strategy generation, state interpretation and planning function generation modules, before the MARL policy training process. This avoids the ongoing costs and computational time associated with frequent LLMs API calls during training. Moreover, trained decentralized policies based on normal‑sized neural networks operate independently of the LLM. We evaluate our method across two different environments and demonstrate that YOLO‑MARL outperforms traditional MARL algorithms.

PaperID: 12790, https://arxiv.org/pdf/2410.03959.pdf

Authors: Zineng Tang, Lingjun Mao, Alane Suhr

Title: Grounding Language in Multi-Perspective Referential Communication

Abstract:
We introduce a task and dataset for referring expression generation and comprehension in multi‑agent embodied environments. In this task, two agents in a shared scene must take into account one another's visual perspective, which may be different from their own, to both produce and understand references to objects in a scene and the spatial relations between them. We collect a dataset of 2,970 human‑written referring expressions, each paired with human comprehension judgments, and evaluate the performance of automated models as speakers and listeners paired with human partners, finding that model performance in both reference generation and comprehension lags behind that of pairs of human agents. Finally, we experiment training an open‑weight speaker model with evidence of communicative success when paired with a listener, resulting in an improvement from 58.9 to 69.3% in communicative success and even outperforming the strongest proprietary model.

PaperID: 12791, https://arxiv.org/pdf/2410.03885.pdf

Authors: Brooks A. Butler, Chi Ho Leung, Philip E. Paré

Title: Collaborative Safety-Critical Formation Control with Obstacle Avoidance

Abstract:
This work explores a collaborative method for ensuring safety in multi‑agent formation control problems. We formulate a control barrier function (CBF) based safety filter control law for a generic distributed formation controller and extend our previously developed collaborative safety framework to an obstacle avoidance problem for agents with acceleration control inputs. We then incorporate multi‑obstacle collision avoidance into the collaborative safety framework. This framework includes a method for computing the maximum capability of agents to satisfy their individual safety requirements. We analyze the convergence rate of our collaborative safety algorithm, and prove the linear‑time convergence of cooperating agents to a jointly feasible safe action for all agents under the special case of a tree‑structured communication network with a single obstacle for each agent. We illustrate the analytical results via simulation on a mass‑spring kinematics‑based formation controller and demonstrate the finite‑time convergence of the collaborative safety algorithm in the simple proven case, the more general case of a fully‑connected system with multiple static obstacles, and with dynamic obstacles.

PaperID: 12792, https://arxiv.org/pdf/2410.03753.pdf

Authors: Jushan Chen

Title: A Brief Tutorial on Consensus ADMM for Distributed Optimization with Applications in Robotics

Abstract:
This paper presents a tutorial on the Consensus Alternating Direction Method of Multipliers (Consensus ADMM) for distributed optimization, with a specific focus on applications in multi‑robot systems. In this tutorial, we derive the consensus ADMM algorithm, highlighting its connections to the augmented Lagrangian and primal‑dual methods. Finally, we apply Consensus ADMM to an example problem for trajectory optimization of a multi‑agent system.

PaperID: 12793, https://arxiv.org/pdf/2410.03690.pdf

Authors: Louis Rosenberg, Hans Schumann, Christopher Dishop, Gregg Willcox, Anita Woolley, Ganesh Mani

Title: Conversational Swarms of Humans and AI Agents enable Hybrid Collaborative Decision-making

Abstract:
Conversational Swarm Intelligence (CSI) is an AI‑powered communication and collaboration technology that allows large, networked groups (of potentially unlimited size) to hold thoughtful conversational deliberations in real‑time. Inspired by the efficient decision‑making dynamics of fish schools, CSI divides a human population into a set of small subgroups connected by AI agents. This enables the full group to hold a unified conversation. In this study, groups of 25 participants were tasked with selecting a roster of players in a real Fantasy Baseball contest. A total of 10 trials were run using CSI. In half the trials, each subgroup was augmented with a fact‑providing AI agent referred to herein as an Infobot. The Infobot was loaded with a wide range of MLB statistics. The human participants could query the Infobot the same way they would query other persons in their subgroup. Results show that when using CSI, the 25‑person groups outperformed 72% of individually surveyed participants and showed significant intelligence amplification versus the mean score (p=0.016). The CSI‑enabled groups also significantly outperformed the most popular picks across the collected surveys for each daily contest (p<0.001). The CSI sessions that used Infobots scored slightly higher than those that did not, but it was not statistically significant in this study. That said, 85% of participants agreed with the statement 'Our decisions were stronger because of information provided by the Infobot' and only 4% disagreed. In addition, deliberations that used Infobots showed significantly less variance (p=0.039) in conversational content across members. This suggests that Infobots promoted more balanced discussions in which fewer members dominated the dialog. This may be because the infobot enabled participants to confidently express opinions with the support of factual data

PaperID: 12794, https://arxiv.org/pdf/2410.03688.pdf

Authors: Zhuoran Xiao, Chenhui Ye, Yunbo Hu, Honggang Yuan, Yihang Huang, Yijia Feng, Liyu Cai, Jiang Chang

Title: LLM Agents as 6G Orchestrator: A Paradigm for Task-Oriented Physical-Layer Automation

Abstract:
The rapid advancement in generative pre‑training models is propelling a paradigm shift in technological progression from basic applications such as chatbots towards more sophisticated agent‑based systems. It is with huge potential and necessity that the 6G system be combined with the copilot of large language model (LLM) agents and digital twins (DT) to manage the highly complicated communication system with new emerging features such as native AI service and sensing. With the 6G‑oriented agent, the base station could understand the transmission requirements of various dynamic upper‑layer tasks, automatically orchestrate the optimal system workflow. Through continuously get feedback from the 6G DT for reinforcement, the agents can finally raise the performance of practical system accordingly. Differing from existing LLM agents designed for general application, the 6G‑oriented agent aims to make highly rigorous and precise planning with a vast amount of extra expert knowledge, which inevitably requires a specific system design from model training to implementation. This paper proposes a novel comprehensive approach for building task‑oriented 6G LLM agents. We first propose a two‑stage continual pre‑training and fine‑tuning scheme to build the field basic model and diversities of specialized expert models for meeting the requirements of various application scenarios. Further, a novel inference framework based on semantic retrieval for leveraging the existing communication‑related functions is proposed. Experiment results of exemplary tasks, such as physical‑layer task decomposition, show the proposed paradigm's feasibility and effectiveness.

PaperID: 12795, https://arxiv.org/pdf/2410.03062.pdf

Authors: Grant Wardle, Teo Susnjak

Title: Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks

Abstract:
This paper examines how the sequencing of images and text within multi‑modal prompts influences the reasoning performance of large language models (LLMs). We performed empirical evaluations using three commercial LLMs. Our results demonstrate that the order in which modalities are presented can significantly affect performance, particularly in tasks of varying complexity. For simpler tasks involving a single image, modality sequencing had a clear impact on accuracy. However, in more complex tasks involving multiple images and intricate reasoning steps, the effect of sequencing diminished, likely due to the increased cognitive demands of the task. Our findings also highlight the importance of question/prompt structure. In nested and multi‑step reasoning tasks, modality sequencing played a key role in shaping model performance. While LLMs excelled in the initial stages of reasoning, they struggled to re‑incorporate earlier information, underscoring the challenges of multi‑hop reasoning within transformer architectures. This suggests that aligning the sequence of modalities with the logical flow of reasoning steps is more critical than modality order alone. These insights offer valuable implications for improving multi‑modal prompt design, with broader applications across fields such as education, medical imaging, and cross‑modal learning.

PaperID: 12796, https://arxiv.org/pdf/2410.03055.pdf

Authors: Shoaib Ahmed Siddiqui, Radhika Gaonkar, Boris Köpf, David Krueger, Andrew Paverd, Ahmed Salem, Shruti Tople, Lukas Wutschitz, Menglin Xia, Santiago Zanella-Béguelin

Title: Permissive Information-Flow Analysis for Large Language Models

Abstract:
Large Language Models (LLMs) are rapidly becoming commodity components of larger software systems. This poses natural security and privacy problems: poisoned data retrieved from one component can change the model's behavior and compromise the entire system, including coercing the model to spread confidential data to untrusted components. One promising approach is to tackle this problem at the system level via dynamic information flow (aka taint) tracking. Unfortunately, this approach of propagating the most restrictive input label to the output is too conservative for applications where LLMs operate on inputs retrieved from diverse sources. In this paper, we propose a novel, more permissive approach to propagate information flow labels through LLM queries. The key idea behind our approach is to propagate only the labels of the samples that were influential in generating the model output and to eliminate the labels of unnecessary inputs. We implement and investigate the effectiveness of two variations of this approach, based on (i) prompt‑based retrieval augmentation, and (ii) a k‑nearest‑neighbors language model. We compare these with a baseline that uses introspection to predict the output label. Our experimental results in an LLM agent setting show that the permissive label propagator improves over the baseline in more than 85% of the cases, which underscores the practicality of our approach.

PaperID: 12797, https://arxiv.org/pdf/2410.02829.pdf

Authors: Chang Xiao, Brenda Z. Yang

Title: LLMs May Not Be Human-Level Players, But They Can Be Testers: Measuring Game Difficulty with LLM Agents

Abstract:
Recent advances in Large Language Models (LLMs) have demonstrated their potential as autonomous agents across various tasks. One emerging application is the use of LLMs in playing games. In this work, we explore a practical problem for the gaming industry: Can LLMs be used to measure game difficulty? We propose a general game‑testing framework using LLM agents and test it on two widely played strategy games: Wordle and Slay the Spire. Our results reveal an interesting finding: although LLMs may not perform as well as the average human player, their performance, when guided by simple, generic prompting techniques, shows a statistically significant and strong correlation with difficulty indicated by human players. This suggests that LLMs could serve as effective agents for measuring game difficulty during the development process. Based on our experiments, we also outline general principles and guidelines for incorporating LLMs into the game testing process.

PaperID: 12798, https://arxiv.org/pdf/2410.02742.pdf

Authors: Haolan Liu, Jishen Zhao

Title: Grounding Large Language Models In Embodied Environment With Imperfect World Models

Abstract:
Despite a widespread success in various applications, large language models (LLMs) often stumble when tackling basic physical reasoning or executing robotics tasks, due to a lack of direct experience with the physical nuances of the real world. To address these issues, we propose a Grounding Large language model with Imperfect world MOdel (GLIMO), which utilizes proxy world models such as simulators to collect and synthesize trining data. GLIMO incorporates an LLM agent‑based data generator to automatically create high‑quality and diverse instruction datasets. The generator includes an iterative self‑refining module for temporally consistent experience sampling, a diverse set of question‑answering instruction seeds, and a retrieval‑augmented generation module for reflecting on prior experiences. Comprehensive experiments show that our approach improve the performance of strong open‑source LLMs like LLaMA‑3 with a performance boost of 2.04 ×, 1.54 ×, and 1.82 × across three different benchmarks, respectively. The performance is able to compete with or surpass their larger counterparts such as GPT‑4.

PaperID: 12799, https://arxiv.org/pdf/2410.02664.pdf

Authors: Zeyang Liu, Xinrui Yang, Shiguang Sun, Long Qian, Lipeng Wan, Xingyu Chen, Xuguang Lan

Title: Grounded Answers for Multi-agent Decision-making Problem through Generative World Model

Abstract:
Recent progress in generative models has stimulated significant innovations in many fields, such as image generation and chatbots. Despite their success, these models often produce sketchy and misleading solutions for complex multi‑agent decision‑making problems because they miss the trial‑and‑error experience and reasoning as humans. To address this limitation, we explore a paradigm that integrates a language‑guided simulator into the multi‑agent reinforcement learning pipeline to enhance the generated answer. The simulator is a world model that separately learns dynamics and reward, where the dynamics model comprises an image tokenizer as well as a causal transformer to generate interaction transitions autoregressively, and the reward model is a bidirectional transformer learned by maximizing the likelihood of trajectories in the expert demonstrations under language guidance. Given an image of the current state and the task description, we use the world model to train the joint policy and produce the image sequence as the answer by running the converged policy on the dynamics model. The empirical results demonstrate that this framework can improve the answers for multi‑agent decision‑making problems by showing superior performance on the training and unseen tasks of the StarCraft Multi‑Agent Challenge benchmark. In particular, it can generate consistent interaction sequences and explainable reward functions at interaction states, opening the path for training generative models of the future.

PaperID: 12800, https://arxiv.org/pdf/2410.02637.pdf

Authors: Mayank Daswani, Mathias M. J. Bellaiche, Marc Wilson, Desislav Ivanov, Mikhail Papkov, Eva Schnider, Jing Tang, Kay Lamerigts, Gabriela Botea, Michael A. Sanchez, Yojan Patel, Shruthi Prabhakara, Shravya Shetty, Umesh Telang

Title: Plots Unlock Time-Series Understanding in Multimodal Models

Abstract:
While multimodal foundation models can now natively work with data beyond text, they remain underutilized in analyzing the considerable amounts of multi‑dimensional time‑series data in fields like healthcare, finance, and social sciences, representing a missed opportunity for richer, data‑driven insights. This paper proposes a simple but effective method that leverages the existing vision encoders of these models to "see" time‑series data via plots, avoiding the need for additional, potentially costly, model training. Our empirical evaluations show that this approach outperforms providing the raw time‑series data as text, with the additional benefit that visual time‑series representations demonstrate up to a 90% reduction in model API costs. We validate our hypothesis through synthetic data tasks of increasing complexity, progressing from simple functional form identification on clean data, to extracting trends from noisy scatter plots. To demonstrate generalizability from synthetic tasks with clear reasoning steps to more complex, real‑world scenarios, we apply our approach to consumer health tasks ‑ specifically fall detection, activity recognition, and readiness assessment ‑ which involve heterogeneous, noisy data and multi‑step reasoning. The overall success in plot performance over text performance (up to an 120% performance increase on zero‑shot synthetic tasks, and up to 150% performance increase on real‑world tasks), across both GPT and Gemini model families, highlights our approach's potential for making the best use of the native capabilities of foundation models.

PaperID: 12801, https://arxiv.org/pdf/2410.02584.pdf

Authors: Angana Borah, Rada Mihalcea

Title: Towards Implicit Bias Detection and Mitigation in Multi-Agent LLM Interactions

Abstract:
As Large Language Models (LLMs) continue to evolve, they are increasingly being employed in numerous studies to simulate societies and execute diverse social tasks. However, LLMs are susceptible to societal biases due to their exposure to human‑generated data. Given that LLMs are being used to gain insights into various societal aspects, it is essential to mitigate these biases. To that end, our study investigates the presence of implicit gender biases in multi‑agent LLM interactions and proposes two strategies to mitigate these biases. We begin by creating a dataset of scenarios where implicit gender biases might arise, and subsequently develop a metric to assess the presence of biases. Our empirical analysis reveals that LLMs generate outputs characterized by strong implicit bias associations (>= 50% of the time). Furthermore, these biases tend to escalate following multi‑agent interactions. To mitigate them, we propose two strategies: self‑reflection with in‑context examples (ICE); and supervised fine‑tuning. Our research demonstrates that both methods effectively mitigate implicit biases, with the ensemble of fine‑tuning and self‑reflection proving to be the most successful.

PaperID: 12802, https://arxiv.org/pdf/2410.02581.pdf

Authors: Joshua McClellan, Naveed Haghani, John Winder, Furong Huang, Pratap Tokekar

Title: Boosting Sample Efficiency and Generalization in Multi-agent Reinforcement Learning via Equivariance

Abstract:
Multi‑Agent Reinforcement Learning (MARL) struggles with sample inefficiency and poor generalization [1]. These challenges are partially due to a lack of structure or inductive bias in the neural networks typically used in learning the policy. One such form of structure that is commonly observed in multi‑agent scenarios is symmetry. The field of Geometric Deep Learning has developed Equivariant Graph Neural Networks (EGNN) that are equivariant (or symmetric) to rotations, translations, and reflections of nodes. Incorporating equivariance has been shown to improve learning efficiency and decrease error [ 2 ]. In this paper, we demonstrate that EGNNs improve the sample efficiency and generalization in MARL. However, we also show that a naive application of EGNNs to MARL results in poor early exploration due to a bias in the EGNN structure. To mitigate this bias, we present Exploration‑enhanced Equivariant Graph Neural Networks or E2GN2. We compare E2GN2 to other common function approximators using common MARL benchmarks MPE and SMACv2. E2GN2 demonstrates a significant improvement in sample efficiency, greater final reward convergence, and a 2x‑5x gain in over standard GNNs in our generalization tests. These results pave the way for more reliable and effective solutions in complex multi‑agent systems.

PaperID: 12803, https://arxiv.org/pdf/2410.02551.pdf

Authors: Zixiang Wang, Yinghao Zhu, Huiya Zhao, Xiaochen Zheng, Dehao Sui, Tianlong Wang, Wen Tang, Yasha Wang, Ewen Harrison, Chengwei Pan, Junyi Gao, Liantao Ma

Title: ColaCare: Enhancing Electronic Health Record Modeling through Large Language Model-Driven Multi-Agent Collaboration

Abstract:
We introduce ColaCare, a framework that enhances Electronic Health Record (EHR) modeling through multi‑agent collaboration driven by Large Language Models (LLMs). Our approach seamlessly integrates domain‑specific expert models with LLMs to bridge the gap between structured EHR data and text‑based reasoning. Inspired by the Multidisciplinary Team (MDT) approach used in clinical settings, ColaCare employs two types of agents: DoctorAgents and a MetaAgent, which collaboratively analyze patient data. Expert models process and generate predictions from numerical EHR data, while LLM agents produce reasoning references and decision‑making reports within the MDT‑driven collaborative consultation framework. The MetaAgent orchestrates the discussion, facilitating consultations and evidence‑based debates among DoctorAgents, simulating diverse expertise in clinical decision‑making. We additionally incorporate the Merck Manual of Diagnosis and Therapy (MSD) medical guideline within a retrieval‑augmented generation (RAG) module for medical evidence support, addressing the challenge of knowledge currency. Extensive experiments conducted on three EHR datasets demonstrate ColaCare's superior performance in clinical mortality outcome and readmission prediction tasks, underscoring its potential to revolutionize clinical decision support systems and advance personalized precision medicine. All code, case studies and a questionnaire are available at the project website: https://colacare.netlify.app.

PaperID: 12804, https://arxiv.org/pdf/2410.02516.pdf

Authors: Vasanth Reddy Baddam, Suat Gumussoy, Almuatazbellah Boker, Hoda Eldardiry

Title: Learning Emergence of Interaction Patterns across Independent RL Agents in Multi-Agent Environments

Abstract:
Many real‑world problems, such as controlling swarms of drones and urban traffic, naturally lend themselves to modeling as multi‑agent reinforcement learning (RL) problems. However, existing multi‑agent RL methods often suffer from scalability challenges, primarily due to the introduction of communication among agents. Consequently, a key challenge lies in adapting the success of deep learning in single‑agent RL to the multi‑agent setting. In response to this challenge, we propose an approach that fundamentally reimagines multi‑agent environments. Unlike conventional methods that model each agent individually with separate networks, our approach, the Bottom Up Network (BUN), adopts a unique perspective. BUN treats the collective of multi‑agents as a unified entity while employing a specialized weight initialization strategy that promotes independent learning. Furthermore, we dynamically establish connections among agents using gradient information, enabling coordination when necessary while maintaining these connections as limited and sparse to effectively manage the computational budget. Our extensive empirical evaluations across a variety of cooperative multi‑agent scenarios, including tasks such as cooperative navigation and traffic control, consistently demonstrate BUN's superiority over baseline methods with substantially reduced computational costs.

PaperID: 12805, https://arxiv.org/pdf/2410.02511.pdf

Authors: Yun Qu, Boyuan Wang, Yuhang Jiang, Jianzhun Shao, Yixiu Mao, Cheems Wang, Chang Liu, Xiangyang Ji

Title: Choices are More Important than Efforts: LLM Enables Efficient Multi-Agent Exploration

Abstract:
With expansive state‑action spaces, efficient multi‑agent exploration remains a longstanding challenge in reinforcement learning. Although pursuing novelty, diversity, or uncertainty attracts increasing attention, redundant efforts brought by exploration without proper guidance choices poses a practical issue for the community. This paper introduces a systematic approach, termed LEMAE, choosing to channel informative task‑relevant guidance from a knowledgeable Large Language Model (LLM) for Efficient Multi‑Agent Exploration. Specifically, we ground linguistic knowledge from LLM into symbolic key states, that are critical for task fulfillment, in a discriminative manner at low LLM inference costs. To unleash the power of key states, we design Subspace‑based Hindsight Intrinsic Reward (SHIR) to guide agents toward key states by increasing reward density. Additionally, we build the Key State Memory Tree (KSMT) to track transitions between key states in a specific task for organized exploration. Benefiting from diminishing redundant explorations, LEMAE outperforms existing SOTA approaches on the challenging benchmarks (e.g., SMAC and MPE) by a large margin, achieving a 10x acceleration in certain scenarios.

PaperID: 12806, https://arxiv.org/pdf/2410.02507.pdf

Authors: Weikang Yuan, Junjie Cao, Zhuoren Jiang, Yangyang Kang, Jun Lin, Kaisong Song, tianqianjin lin, Pengwei Yan, Changlong Sun, Xiaozhong Liu

Title: Can Large Language Models Grasp Legal Theories? Enhance Legal Reasoning with Insights from Multi-Agent Collaboration

Abstract:
Large Language Models (LLMs) could struggle to fully understand legal theories and perform complex legal reasoning tasks. In this study, we introduce a challenging task (confusing charge prediction) to better evaluate LLMs' understanding of legal theories and reasoning capabilities. We also propose a novel framework: Multi‑Agent framework for improving complex Legal Reasoning capability (MALR). MALR employs non‑parametric learning, encouraging LLMs to automatically decompose complex legal tasks and mimic human learning process to extract insights from legal rules, helping LLMs better understand legal theories and enhance their legal reasoning abilities. Extensive experiments on multiple real‑world datasets demonstrate that the proposed framework effectively addresses complex reasoning issues in practical scenarios, paving the way for more reliable applications in the legal domain.

PaperID: 12807, https://arxiv.org/pdf/2410.02506.pdf

Authors: Guibin Zhang, Yanwei Yue, Zhixun Li, Sukwon Yun, Guancheng Wan, Kun Wang, Dawei Cheng, Jeffrey Xu Yu, Tianlong Chen

Title: Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems

Abstract:
Recent advancements in large language model (LLM)‑powered agents have shown that collective intelligence can significantly outperform individual capabilities, largely attributed to the meticulously designed inter‑agent communication topologies. Though impressive in performance, existing multi‑agent pipelines inherently introduce substantial token overhead, as well as increased economic costs, which pose challenges for their large‑scale deployments. In response to this challenge, we propose an economical, simple, and robust multi‑agent communication framework, termed \textttAgentPrune, which can seamlessly integrate into mainstream multi‑agent systems and prunes redundant or even malicious communication messages. Technically, \textttAgentPrune is the first to identify and formally define the communication redundancy issue present in current LLM‑based multi‑agent pipelines, and efficiently performs one‑shot pruning on the spatial‑temporal message‑passing graph, yielding a token‑economic and high‑performing communication topology. Extensive experiments across six benchmarks demonstrate that \textttAgentPrune (I) achieves comparable results as state‑of‑the‑art topologies at merely \5.6 cost compared to their \43.7, (II) integrates seamlessly into existing multi‑agent frameworks with 28.1%～72.8%\downarrow token reduction, and (III) successfully defend against two types of agent‑based adversarial attacks with 3.5%～10.8%\uparrow performance boost.

PaperID: 12808, https://arxiv.org/pdf/2410.02406.pdf

Authors: Mengxu Pan, Alexandra Kitson, Hongyu Wan, Mirjana Prpa

Title: ELLMA-T: an Embodied LLM-agent for Supporting English Language Learning in Social VR

Abstract:
Many people struggle with learning a new language, with traditional tools falling short in providing contextualized learning tailored to each learner's needs. The recent development of large language models (LLMs) and embodied conversational agents (ECAs) in social virtual reality (VR) provide new opportunities to practice language learning in a contextualized and naturalistic way that takes into account the learner's language level and needs. To explore this opportunity, we developed ELLMA‑T, an ECA that leverages an LLM (GPT‑4) and situated learning framework for supporting learning English language in social VR (VRChat). Drawing on qualitative interviews (N=12), we reveal the potential of ELLMA‑T to generate realistic, believable and context‑specific role plays for agent‑learner interaction in VR, and LLM's capability to provide initial language assessment and continuous feedback to learners. We provide five design implications for the future development of LLM‑based language agents in social VR.

PaperID: 12809, https://arxiv.org/pdf/2410.02400.pdf

Authors: Sarah Sachs, Hedi Hadiji, Tim van Erven, Mathias Staudigl

Title: An Online Feasible Point Method for Benign Generalized Nash Equilibrium Problems

Abstract:
We consider a repeatedly played generalized Nash equilibrium game. This induces a multi‑agent online learning problem with joint constraints. An important challenge in this setting is that the feasible set for each agent depends on the simultaneous moves of the other agents and, therefore, varies over time. As a consequence, the agents face time‑varying constraints, which are not adversarial but rather endogenous to the system. Prior work in this setting focused on convergence to a feasible solution in the limit via integrating the constraints in the objective as a penalty function. However, no existing work can guarantee that the constraints are satisfied for all iterations while simultaneously guaranteeing convergence to a generalized Nash equilibrium. This is a problem of fundamental theoretical interest and practical relevance. In this work, we introduce a new online feasible point method. Under the assumption that limited communication between the agents is allowed, this method guarantees feasibility. We identify the class of benign generalized Nash equilibrium problems, for which the convergence of our method to the equilibrium is guaranteed. We set this class of benign generalized Nash equilibrium games in context with existing definitions and illustrate our method with examples.

PaperID: 12810, https://arxiv.org/pdf/2410.02203.pdf

Authors: Jiale Fu, Yaqing Wang, Simeng Han, Jiaming Fan, Xu Yang

Title: GraphIC: A Graph-Based In-Context Example Retrieval Model for Multi-Step Reasoning

Abstract:
In‑context learning (ICL) enhances large language models (LLMs) by incorporating demonstration examples, yet its effectiveness heavily depends on the quality of selected examples. Current methods typically use text embeddings to measure semantic similarity, which often introduces bias in multi‑step reasoning tasks. This occurs because text embeddings contain irrelevant semantic information and lack deeper reasoning structures. To address this, we propose GraphIC, a graph‑based retrieval model that leverages reasoning‑aware representation and specialized similarity metric for in‑context example retrieval. GraphIC first constructs thought graphs‑directed, node‑attributed graphs that explicitly model reasoning steps and their dependencies‑for candidate examples and queries. This approach filters out superficial semantics while preserving essential reasoning processes. Next, GraphIC retrieves examples using a novel similarity metric tailored for these graphs, capturing sequential reasoning patterns and asymmetry between examples. Comprehensive evaluations across mathematical reasoning, code generation, and logical reasoning tasks demonstrate that GraphIC outperforms 10 baseline methods. Our results highlight the importance of reasoning‑aware retrieval in ICL, offering a robust solution for enhancing LLM performance in multi‑step reasoning scenarios.

PaperID: 12811, https://arxiv.org/pdf/2410.02165.pdf

Authors: Yucheng Chu, Hang Li, Kaiqi Yang, Harry Shomer, Hui Liu, Yasemin Copur-Gencturk, Jiliang Tang

Title: A LLM-Powered Automatic Grading Framework with Human-Level Guidelines Optimization

Abstract:
Open‑ended short‑answer questions (SAGs) have been widely recognized as a powerful tool for providing deeper insights into learners' responses in the context of learning analytics (LA). However, SAGs often present challenges in practice due to the high grading workload and concerns about inconsistent assessments. With recent advancements in natural language processing (NLP), automatic short‑answer grading (ASAG) offers a promising solution to these challenges. Despite this, current ASAG algorithms are often limited in generalizability and tend to be tailored to specific questions. In this paper, we propose a unified multi‑agent ASAG framework, GradeOpt, which leverages large language models (LLMs) as graders for SAGs. More importantly, GradeOpt incorporates two additional LLM‑based agents ‑ the reflector and the refiner ‑ into the multi‑agent system. This enables GradeOpt to automatically optimize the original grading guidelines by performing self‑reflection on its errors. Through experiments on a challenging ASAG task, namely the grading of pedagogical content knowledge (PCK) and content knowledge (CK) questions, GradeOpt demonstrates superior performance in grading accuracy and behavior alignment with human graders compared to representative baselines. Finally, comprehensive ablation studies confirm the effectiveness of the individual components designed in GradeOpt.

PaperID: 12812, https://arxiv.org/pdf/2410.02128.pdf

Authors: Stefan Juang, Hugh Cao, Arielle Zhou, Ruochen Liu, Nevin L. Zhang, Elvis Liu

Title: Breaking the mold: The challenge of large scale MARL specialization

Abstract:
In multi‑agent learning, the predominant approach focuses on generalization, often neglecting the optimization of individual agents. This emphasis on generalization limits the ability of agents to utilize their unique strengths, resulting in inefficiencies. This paper introduces Comparative Advantage Maximization (CAM), a method designed to enhance individual agent specialization in multiagent systems. CAM employs a two‑phase process, combining centralized population training with individual specialization through comparative advantage maximization. CAM achieved a 13.2% improvement in individual agent performance and a 14.9% increase in behavioral diversity compared to state‑of‑the‑art systems. The success of CAM highlights the importance of individual agent specialization, suggesting new directions for multi‑agent system development.

PaperID: 12813, https://arxiv.org/pdf/2410.02052.pdf

Authors: Xiao Yu, Baolin Peng, Vineeth Vajipey, Hao Cheng, Michel Galley, Jianfeng Gao, Zhou Yu

Title: ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning

Abstract:
Autonomous agents have demonstrated significant potential in automating complex multistep decision‑making tasks. However, even state‑of‑the‑art vision‑language models (VLMs), such as GPT‑4o, still fall short of human‑level performance, particularly in intricate web environments and long‑horizon tasks. To address these limitations, we present ExACT, an approach to combine test‑time search and self‑learning to build o1‑like models for agentic applications. We first introduce Reflective Monte Carlo Tree Search (R‑MCTS), a novel test time algorithm designed to enhance AI agents' ability to explore decision space on the fly. R‑MCTS extends traditional MCTS by 1) incorporating contrastive reflection, allowing agents to learn from past interactions and dynamically improve their search efficiency; and 2) using multi‑agent debate for reliable state evaluation. Next, we introduce Exploratory Learning, a novel learning strategy to teach agents to search at inference time without relying on any external search algorithms. On the challenging VisualWebArena benchmark, our GPT‑4o based R‑MCTS agent achieves a 6% to 30% relative improvement across various tasks compared to the previous state‑of‑the‑art. Additionally, we show that the knowledge and experience gained from test‑time search can be effectively transferred back to GPT‑4o via fine‑tuning. After Exploratory Learning, GPT‑4o 1) demonstrates the ability to explore the environment, evaluate a state, and backtrack to viable ones when it detects that the current state cannot lead to success, and 2) matches 87% of R‑MCTS's performance while using significantly less compute. Notably, our work demonstrates the compute scaling properties in both training ‑ data collection with R‑MCTS ‑ and testing time. These results suggest a promising research direction to enhance VLMs' capabilities for agentic applications via test‑time search and self‑learning.

PaperID: 12814, https://arxiv.org/pdf/2410.02038.pdf

Authors: Kyungmin Kim, Davide Corsi, Andoni Rodriguez, JB Lanier, Benjami Parellada, Pierre Baldi, Cesar Sanchez, Roy Fox

Title: Realizable Continuous-Space Shields for Safe Reinforcement Learning

Abstract:
While Deep Reinforcement Learning (DRL) has achieved remarkable success across various domains, it remains vulnerable to occasional catastrophic failures without additional safeguards. An effective solution to prevent these failures is to use a shield that validates and adjusts the agent's actions to ensure compliance with a provided set of safety specifications. For real‑world robotic domains, it is essential to define safety specifications over continuous state and action spaces to accurately account for system dynamics and compute new actions that minimally deviate from the agent's original decision. In this paper, we present the first shielding approach specifically designed to ensure the satisfaction of safety requirements in continuous state and action spaces, making it suitable for practical robotic applications. Our method builds upon realizability, an essential property that confirms the shield will always be able to generate a safe action for any state in the environment. We formally prove that realizability can be verified for stateful shields, enabling the incorporation of non‑Markovian safety requirements, such as loop avoidance. Finally, we demonstrate the effectiveness of our approach in ensuring safety without compromising the policy's success rate by applying it to a navigation problem and a multi‑agent particle environment.

PaperID: 12815, https://arxiv.org/pdf/2410.02026.pdf

Authors: Yuan Zhou, Peng Zhang, Mengya Song, Alice Zheng, Yiwen Lu, Zhiheng Liu, Yong Chen, Zhaohan Xi

Title: Zodiac: A Cardiologist-Level LLM Framework for Multi-Agent Diagnostics

Abstract:
Large language models (LLMs) have demonstrated remarkable progress in healthcare. However, a significant gap remains regarding LLMs' professionalism in domain‑specific clinical practices, limiting their application in real‑world diagnostics. In this work, we introduce ZODIAC, an LLM‑powered framework with cardiologist‑level professionalism designed to engage LLMs in cardiological diagnostics. ZODIAC assists cardiologists by extracting clinically relevant characteristics from patient data, detecting significant arrhythmias, and generating preliminary reports for the review and refinement by cardiologists. To achieve cardiologist‑level professionalism, ZODIAC is built on a multi‑agent collaboration framework, enabling the processing of patient data across multiple modalities. Each LLM agent is fine‑tuned using real‑world patient data adjudicated by cardiologists, reinforcing the model's professionalism. ZODIAC undergoes rigorous clinical validation with independent cardiologists, evaluated across eight metrics that measure clinical effectiveness and address security concerns. Results show that ZODIAC outperforms industry‑leading models, including OpenAI's GPT‑4o, Meta's Llama‑3.1‑405B, and Google's Gemini‑pro, as well as medical‑specialist LLMs like Microsoft's BioGPT. ZODIAC demonstrates the transformative potential of specialized LLMs in healthcare by delivering domain‑specific solutions that meet the stringent demands of medical practice. Notably, ZODIAC has been successfully integrated into electrocardiography (ECG) devices, exemplifying the growing trend of embedding LLMs into Software‑as‑Medical‑Device (SaMD).

PaperID: 12816, https://arxiv.org/pdf/2410.01954.pdf

Authors: The Viet Bui, Thanh Hong Nguyen, Tien Mai

Title: ComaDICE: Offline Cooperative Multi-Agent Reinforcement Learning with Stationary Distribution Shift Regularization

Abstract:
Offline reinforcement learning (RL) has garnered significant attention for its ability to learn effective policies from pre‑collected datasets without the need for further environmental interactions. While promising results have been demonstrated in single‑agent settings, offline multi‑agent reinforcement learning (MARL) presents additional challenges due to the large joint state‑action space and the complexity of multi‑agent behaviors. A key issue in offline RL is the distributional shift, which arises when the target policy being optimized deviates from the behavior policy that generated the data. This problem is exacerbated in MARL due to the interdependence between agents' local policies and the expansive joint state‑action space. Prior approaches have primarily addressed this challenge by incorporating regularization in the space of either Q‑functions or policies. In this work, we introduce a regularizer in the space of stationary distributions to better handle distributional shift. Our algorithm, ComaDICE, offers a principled framework for offline cooperative MARL by incorporating stationary distribution regularization for the global learning policy, complemented by a carefully structured multi‑agent value decomposition strategy to facilitate multi‑agent training. Through extensive experiments on the multi‑agent MuJoCo and StarCraft II benchmarks, we demonstrate that ComaDICE achieves superior performance compared to state‑of‑the‑art offline MARL methods across nearly all tasks.

PaperID: 12817, https://arxiv.org/pdf/2410.01943.pdf

Authors: Mohammadreza Pourreza, Hailong Li, Ruoxi Sun, Yeounoh Chung, Shayan Talaei, Gaurav Tarlok Kakkar, Yu Gan, Amin Saberi, Fatma Ozcan, Sercan O. Arik

Title: CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL

Abstract:
In tackling the challenges of large language model (LLM) performance for Text‑to‑SQL tasks, we introduce CHASE‑SQL, a new framework that employs innovative strategies, using test‑time compute in multi‑agent modeling to improve candidate generation and selection. CHASE‑SQL leverages LLMs' intrinsic knowledge to generate diverse and high‑quality SQL candidates using different LLM generators with: (1) a divide‑and‑conquer method that decomposes complex queries into manageable sub‑queries in a single LLM call; (2) chain‑of‑thought reasoning based on query execution plans, reflecting the steps a database engine takes during execution; and (3) a unique instance‑aware synthetic example generation technique, which offers specific few‑shot demonstrations tailored to test questions.To identify the best candidate, a selection agent is employed to rank the candidates through pairwise comparisons with a fine‑tuned binary‑candidates selection LLM. This selection approach has been demonstrated to be more robust over alternatives. The proposed generators‑selector framework not only enhances the quality and diversity of SQL queries but also outperforms previous methods. Overall, our proposed CHASE‑SQL achieves the state‑of‑the‑art execution accuracy of 73.0% and 73.01% on the test set and development set of the notable BIRD Text‑to‑SQL dataset benchmark, rendering CHASE‑SQL the top submission of the leaderboard (at the time of paper submission).

PaperID: 12818, https://arxiv.org/pdf/2410.01920.pdf

Authors: Shengyu Feng, Xiang Kong, Shuang Ma, Aonan Zhang, Dong Yin, Chong Wang, Ruoming Pang, Yiming Yang

Title: Step-by-Step Reasoning for Math Problems via Twisted Sequential Monte Carlo

Abstract:
Augmenting the multi‑step reasoning abilities of Large Language Models (LLMs) has been a persistent challenge. Recently, verification has shown promise in improving solution consistency by evaluating generated outputs. However, current verification approaches suffer from sampling inefficiencies, requiring a large number of samples to achieve satisfactory performance. Additionally, training an effective verifier often depends on extensive process supervision, which is costly to acquire. In this paper, we address these limitations by introducing a novel verification method based on Twisted Sequential Monte Carlo (TSMC). TSMC sequentially refines its sampling effort to focus exploration on promising candidates, resulting in more efficient generation of high‑quality solutions. We apply TSMC to LLMs by estimating the expected future rewards at partial solutions. This approach results in a more straightforward training target that eliminates the need for step‑wise human annotations. We empirically demonstrate the advantages of our method across multiple math benchmarks, and also validate our theoretical analysis of both our approach and existing verification methods.

PaperID: 12819, https://arxiv.org/pdf/2410.01840.pdf

Authors: Quanquan Shao, Yi Fang

Title: Target Pose Guided Whole-body Grasping Motion Generation for Digital Humans

Abstract:
Grasping manipulation is a fundamental mode for human interaction with daily life objects. The synthesis of grasping motion is also greatly demanded in many applications such as animation and robotics. In objects grasping research field, most works focus on generating the last static grasping pose with a parallel gripper or dexterous hand. Grasping motion generation for the full arm especially for the full humanlike intelligent agent is still under‑explored. In this work, we propose a grasping motion generation framework for digital human which is an anthropomorphic intelligent agent with high degrees of freedom in virtual world. Given an object known initial pose in 3D space, we first generate a target pose for whole‑body digital human based on off‑the‑shelf target grasping pose generation methods. With an initial pose and this generated target pose, a transformer‑based neural network is used to generate the whole grasping trajectory, which connects initial pose and target pose smoothly and naturally. Additionally, two post optimization components are designed to mitigates foot‑skating issue and hand‑object interpenetration separately. Experiments are conducted on GRAB dataset to demonstrate effectiveness of this proposed method for whole‑body grasping motion generation with randomly placed unknown objects.

PaperID: 12820, https://arxiv.org/pdf/2410.01811.pdf

Authors: Fiifi Dawson, Zainab Mosunmola, Sahil Pocker, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat

Title: Evaluating Cultural Awareness of LLMs for Yoruba, Malayalam, and English

Abstract:
Although LLMs have been extremely effective in a large number of complex tasks, their understanding and functionality for regional languages and cultures are not well studied. In this paper, we explore the ability of various LLMs to comprehend the cultural aspects of two regional languages: Malayalam (state of Kerala, India) and Yoruba (West Africa). Using Hofstede's six cultural dimensions: Power Distance (PDI), Individualism (IDV), Motivation towards Achievement and Success (MAS), Uncertainty Avoidance (UAV), Long Term Orientation (LTO), and Indulgence (IVR), we quantify the cultural awareness of LLM‑based responses. We demonstrate that although LLMs show a high cultural similarity for English, they fail to capture the cultural nuances across these 6 metrics for Malayalam and Yoruba. We also highlight the need for large‑scale regional language LLM training with culturally enriched datasets. This will have huge implications for enhancing the user experience of chat‑based LLMs and also improving the validity of large‑scale LLM agent‑based market research.

PaperID: 12821, https://arxiv.org/pdf/2410.01798.pdf

Authors: Rishi Veerapaneni, Muhammad Suhail Saleem, Jiaoyang Li, Maxim Likhachev

Title: Windowed MAPF with Completeness Guarantees

Abstract:
Traditional multi‑agent path finding (MAPF) methods try to compute entire start‑goal paths which are collision free. However, computing an entire path can take too long for MAPF systems where agents need to replan fast. Methods that address this typically employ a "windowed" approach and only try to find collision free paths for a small windowed timestep horizon. This adaptation comes at the cost of incompleteness; all current windowed approaches can become stuck in deadlock or livelock. Our main contribution is to introduce our framework, WinC‑MAPF, for Windowed MAPF that enables completeness. Our framework uses heuristic update insights from single‑agent real‑time heuristic search algorithms as well as agent independence ideas from MAPF algorithms. We also develop Single‑Step CBS (SS‑CBS), an instantiation of this framework using a novel modification to CBS. We show how SS‑CBS, which only plans a single step and updates heuristics, can effectively solve tough scenarios where existing windowed approaches fail.

PaperID: 12822, https://arxiv.org/pdf/2410.01763.pdf

Authors: Rebekah A. Gelpí, Yikai Tang, Ethan C. Jackson, William A. Cunningham

Title: Social coordination perpetuates stereotypic expectations and behaviors across generations in deep multi-agent reinforcement learning

Abstract:
Despite often being perceived as morally objectionable, stereotypes are a common feature of social groups, a phenomenon that has often been attributed to biased motivations or limits on the ability to process information. We argue that one reason for this continued prevalence is that pre‑existing expectations about how others will behave, in the context of social coordination, can change the behaviors of one's social partners, creating the very stereotype one expected to see, even in the absence of other potential sources of stereotyping. We use a computational model of dynamic social coordination to illustrate how this "feedback loop" can emerge, engendering and entrenching stereotypic behavior, and then show that human behavior on the task generates a comparable feedback loop. Notably, people's choices on the task are not related to social dominance or system justification, suggesting biased motivations are not necessary to maintain these stereotypes.

PaperID: 12823, https://arxiv.org/pdf/2410.01706.pdf

Authors: Omayma Mahjoub, Sasha Abramowitz, Ruan de Kock, Wiem Khlifi, Simon du Toit, Jemma Daniel, Louay Ben Nessir, Louise Beyers, Claude Formanek, Liam Clark, Arnu Pretorius

Title: Sable: a Performant, Efficient and Scalable Sequence Model for MARL

Abstract:
As multi‑agent reinforcement learning (MARL) progresses towards solving larger and more complex problems, it becomes increasingly important that algorithms exhibit the key properties of (1) strong performance, (2) memory efficiency, and (3) scalability. In this work, we introduce Sable, a performant, memory‑efficient, and scalable sequence modeling approach to MARL. Sable works by adapting the retention mechanism in Retentive Networks (Sun et al., 2023) to achieve computationally efficient processing of multi‑agent observations with long context memory for temporal reasoning. Through extensive evaluations across six diverse environments, we demonstrate how Sable is able to significantly outperform existing state‑of‑the‑art methods in a large number of diverse tasks (34 out of 45 tested). Furthermore, Sable maintains performance as we scale the number of agents, handling environments with more than a thousand agents while exhibiting a linear increase in memory usage. Finally, we conduct ablation studies to isolate the source of Sable's performance gains and confirm its efficient computational memory usage.

PaperID: 12824, https://arxiv.org/pdf/2410.01696.pdf

Authors: Kangsheng Wang, Xiao Zhang, Hao Liu, Songde Han, Huimin Ma, Tianyu Hu

Title: CreDes: Causal Reasoning Enhancement and Dual-End Searching for Solving Long-Range Reasoning Problems using LLMs

Abstract:
Large language models (LLMs) have demonstrated limitations in handling combinatorial optimization problems involving long‑range reasoning, partially due to causal hallucinations and huge search space. As for causal hallucinations, i.e., the inconsistency between reasoning and corresponding state transition, this paper introduces the Causal Relationship Enhancement (CRE) mechanism combining cause‑effect interventions and the Individual Treatment Effect (ITE) to guarantee the solid causal rightness between each step of reasoning and state transition. As for the long causal range and huge search space limiting the performances of existing models featuring single‑direction search, a Dual‑End Searching (DES) approach is proposed to seek solutions by simultaneously starting from both the initial and goal states on the causal probability tree. By integrating CRE and DES (CreDes), our model has realized simultaneous multi‑step reasoning, circumventing the inefficiencies from cascading multiple one‑step reasoning like the Chain‑of‑Thought (CoT). Experiments demonstrate that CreDes significantly outperforms existing State‑Of‑The‑Art (SOTA) solutions in long‑range reasoning tasks in terms of both accuracy and time efficiency.

PaperID: 12825, https://arxiv.org/pdf/2410.01639.pdf

Authors: Elizaveta Tennant, Stephen Hailes, Mirco Musolesi

Title: Moral Alignment for LLM Agents

Abstract:
Decision‑making agents based on pre‑trained Large Language Models (LLMs) are increasingly being deployed across various domains of human activity. While their applications are currently rather specialized, several research efforts are underway to develop more generalist agents. As LLM‑based systems become more agentic, their influence on human activity will grow and their transparency will decrease. Consequently, developing effective methods for aligning them to human values is vital. The prevailing practice in alignment often relies on human preference data (e.g., in RLHF or DPO), in which values are implicit, opaque and are essentially deduced from relative preferences over different model outputs. In this work, instead of relying on human feedback, we introduce the design of reward functions that explicitly and transparently encode core human values for Reinforcement Learning‑based fine‑tuning of foundation agent models. Specifically, we use intrinsic rewards for the moral alignment of LLM agents. We evaluate our approach using the traditional philosophical frameworks of Deontological Ethics and Utilitarianism, quantifying moral rewards for agents in terms of actions and consequences on the Iterated Prisoner's Dilemma (IPD) environment. We also show how moral fine‑tuning can be deployed to enable an agent to unlearn a previously developed selfish strategy. Finally, we find that certain moral strategies learned on the IPD game generalize to several other matrix game environments. In summary, we demonstrate that fine‑tuning with intrinsic rewards is a promising general solution for aligning LLM agents to human values, and it might represent a more transparent and cost‑effective alternative to currently predominant alignment techniques.

PaperID: 12826, https://arxiv.org/pdf/2410.01630.pdf

Authors: Shaokang Wu, Yijin Wang, Yanlong Huang

Title: One-Shot Robust Imitation Learning for Long-Horizon Visuomotor Tasks from Unsegmented Demonstrations

Abstract:
In contrast to single‑skill tasks, long‑horizon tasks play a crucial role in our daily life, e.g., a pouring task requires a proper concatenation of reaching, grasping and pouring subtasks. As an efficient solution for transferring human skills to robots, imitation learning has achieved great progress over the last two decades. However, when learning long‑horizon visuomotor skills, imitation learning often demands a large amount of semantically segmented demonstrations. Moreover, the performance of imitation learning could be susceptible to external perturbation and visual occlusion. In this paper, we exploit dynamical movement primitives and meta‑learning to provide a new framework for imitation learning, called Meta‑Imitation Learning with Adaptive Dynamical Primitives (MiLa). MiLa allows for learning unsegmented long‑horizon demonstrations and adapting to unseen tasks with a single demonstration. MiLa can also resist external disturbances and visual occlusion during task execution. Real‑world robotic experiments demonstrate the superiority of MiLa, irrespective of visual occlusion and random perturbations on robots.

PaperID: 12827, https://arxiv.org/pdf/2410.01450.pdf

Authors: Hong-Hsiang Liu, Yi-Wen Liu

Title: Agent-Driven Large Language Models for Mandarin Lyric Generation

Abstract:
Generative Large Language Models have shown impressive in‑context learning abilities, performing well across various tasks with just a prompt. Previous melody‑to‑lyric research has been limited by scarce high‑quality aligned data and unclear standard for creativeness. Most efforts focused on general themes or emotions, which are less valuable given current language model capabilities. In tonal contour languages like Mandarin, pitch contours are influenced by both melody and tone, leading to variations in lyric‑melody fit. Our study, validated by the Mpop600 dataset, confirms that lyricists and melody writers consider this fit during their composition process. In this research, we developed a multi‑agent system that decomposes the melody‑to‑lyric task into sub‑tasks, with each agent controlling rhyme, syllable count, lyric‑melody alignment, and consistency. Listening tests were conducted via a diffusion‑based singing voice synthesizer to evaluate the quality of lyrics generated by different agent groups.

PaperID: 12828, https://arxiv.org/pdf/2410.01364.pdf

Authors: Yutian Zhang, Guohong Zheng, Zhiyuan Liu, Quan Li, Haipeng Zeng

Title: MARLens: Understanding Multi-agent Reinforcement Learning for Traffic Signal Control via Visual Analytics

Abstract:
The issue of traffic congestion poses a significant obstacle to the development of global cities. One promising solution to tackle this problem is intelligent traffic signal control (TSC). Recently, TSC strategies leveraging reinforcement learning (RL) have garnered attention among researchers. However, the evaluation of these models has primarily relied on fixed metrics like reward and queue length. This limited evaluation approach provides only a narrow view of the model's decision‑making process, impeding its practical implementation. Moreover, effective TSC necessitates coordinated actions across multiple intersections. Existing visual analysis solutions fall short when applied in multi‑agent settings. In this study, we delve into the challenge of interpretability in multi‑agent reinforcement learning (MARL), particularly within the context of TSC. We propose MARLens a visual analytics system tailored to understand MARL‑based TSC. Our system serves as a versatile platform for both RL and TSC researchers. It empowers them to explore the model's features from various perspectives, revealing its decision‑making processes and shedding light on interactions among different agents. To facilitate quick identification of critical states, we have devised multiple visualization views, complemented by a traffic simulation module that allows users to replay specific training scenarios. To validate the utility of our proposed system, we present three comprehensive case studies, incorporate insights from domain experts through interviews, and conduct a user study. These collective efforts underscore the feasibility and effectiveness of MARLens in enhancing our understanding of MARL‑based TSC systems and pave the way for more informed and efficient traffic management strategies.

PaperID: 12829, https://arxiv.org/pdf/2410.01307.pdf

Authors: Mohit Bhatnagar

Title: FanCric : Multi-Agentic Framework for Crafting Fantasy 11 Cricket Teams

Abstract:
Cricket, with its intricate strategies and deep history, increasingly captivates a global audience. The Indian Premier League (IPL), epitomizing Twenty20 cricket, showcases talent in a format that lasts just a few hours as opposed to the longer forms of the game. Renowned for its fusion of technology and fan engagement, the IPL stands as the world's most popular cricket league. This study concentrates on Dream11, India's leading fantasy cricket league for IPL, where participants craft virtual teams based on real player performances to compete internationally. Building a winning fantasy team requires navigating various complex factors including player form and match conditions. Traditionally, this has been approached through operations research and machine learning. This research introduces the FanCric framework, an advanced multi‑agent system leveraging Large Language Models (LLMs) and a robust orchestration framework to enhance fantasy team selection in cricket. FanCric employs both structured and unstructured data to surpass traditional methods by incorporating sophisticated AI technologies. The analysis involved scrutinizing approximately 12.7 million unique entries from a Dream11 contest, evaluating FanCric's efficacy against the collective wisdom of crowds and a simpler Prompt Engineering approach. Ablation studies further assessed the impact of generating varying numbers of teams. The exploratory findings are promising, indicating that further investigation into FanCric's capabilities is warranted to fully realize its potential in enhancing strategic decision‑making using LLMs in fantasy sports and business in general.

PaperID: 12830, https://arxiv.org/pdf/2410.01242.pdf

Authors: Haolin Jin, Zechao Sun, Huaming Chen

Title: RGD: Multi-LLM Based Agent Debugger via Refinement and Generation Guidance

Abstract:
Large Language Models (LLMs) have shown incredible potential in code generation tasks, and recent research in prompt engineering have enhanced LLMs' understanding of textual information. However, ensuring the accuracy of generated code often requires extensive testing and validation by programmers. While LLMs can typically generate code based on task descriptions, their accuracy remains limited, especially for complex tasks that require a deeper understanding of both the problem statement and the code generation process. This limitation is primarily due to the LLMs' need to simultaneously comprehend text and generate syntactically and semantically correct code, without having the capability to automatically refine the code. In real‑world software development, programmers rarely produce flawless code in a single attempt based on the task description alone, they rely on iterative feedback and debugging to refine their programs. Inspired by this process, we introduce a novel architecture of LLM‑based agents for code generation and automatic debugging: Refinement and Guidance Debugging (RGD). The RGD framework is a multi‑LLM‑based agent debugger that leverages three distinct LLM agents‑Guide Agent, Debug Agent, and Feedback Agent. RGD decomposes the code generation task into multiple steps, ensuring a clearer workflow and enabling iterative code refinement based on self‑reflection and feedback. Experimental results demonstrate that RGD exhibits remarkable code generation capabilities, achieving state‑of‑the‑art performance with a 9.8% improvement on the HumanEval dataset and a 16.2% improvement on the MBPP dataset compared to the state‑of‑the‑art approaches and traditional direct prompting approaches. We highlight the effectiveness of the RGD framework in enhancing LLMs' ability to generate and refine code autonomously.

PaperID: 12831, https://arxiv.org/pdf/2410.01101.pdf

Authors: Wenhao Zhan, Scott Fujimoto, Zheqing Zhu, Jason D. Lee, Daniel R. Jiang, Yonathan Efroni

Title: Exploiting Structure in Offline Multi-Agent RL: The Benefits of Low Interaction Rank

Abstract:
We study the problem of learning an approximate equilibrium in the offline multi‑agent reinforcement learning (MARL) setting. We introduce a structural assumption ‑‑ the interaction rank ‑‑ and establish that functions with low interaction rank are significantly more robust to distribution shift compared to general ones. Leveraging this observation, we demonstrate that utilizing function classes with low interaction rank, when combined with regularization and no‑regret learning, admits decentralized, computationally and statistically efficient learning in offline MARL. Our theoretical results are complemented by experiments that showcase the potential of critic architectures with low interaction rank in offline MARL, contrasting with commonly used single‑agent value decomposition architectures.

PaperID: 12832, https://arxiv.org/pdf/2410.01039.pdf

Authors: Tomas Goldsack, Yang Wang, Chenghua Lin, Chung-Chi Chen

Title: From Facts to Insights: A Study on the Generation and Evaluation of Analytical Reports for Deciphering Earnings Calls

Abstract:
This paper explores the use of Large Language Models (LLMs) in the generation and evaluation of analytical reports derived from Earnings Calls (ECs). Addressing a current gap in research, we explore the generation of analytical reports with LLMs in a multi‑agent framework, designing specialized agents that introduce diverse viewpoints and desirable topics of analysis into the report generation process. Through multiple analyses, we examine the alignment between generated and human‑written reports and the impact of both individual and collective agents. Our findings suggest that the introduction of additional agents results in more insightful reports, although reports generated by human experts remain preferred in the majority of cases. Finally, we address the challenging issue of report evaluation, we examine the limitations and strengths of LLMs in assessing the quality of generated reports in different settings, revealing a significant correlation with human experts across multiple dimensions.

PaperID: 12833, https://arxiv.org/pdf/2410.00774.pdf

Authors: Hyogo Hiruma, Hiroshi Ito, Tetusya Ogata

Title: Adaptive Motion Generation Using Uncertainty-Driven Foresight Prediction

Abstract:
Uncertainty of environments has long been a difficult characteristic to handle, when performing real‑world robot tasks. This is because the uncertainty produces unexpected observations that cannot be covered by manual scripting. Learning based robot controlling methods are a promising approach for generating flexible motions against unknown situations, but still tend to suffer under uncertainty due to its deterministic nature. In order to adaptively perform the target task under such conditions, the robot control model must be able to accurately understand the possible uncertainty, and to exploratively derive the optimal action that minimizes such uncertainty. This paper extended an existing predictive learning based robot control method, which employ foresight prediction using dynamic internal simulation. The foresight module refines the model's hidden states by sampling multiple possible futures and replace with the one that led to the lower future uncertainty. The adaptiveness of the model was evaluated on a door opening task. The door can be opened either by pushing, pulling, or sliding, but robot cannot visually distinguish which way, and is required to adapt on the fly. The results showed that the proposed model adaptively diverged its motion through interaction with the door, whereas conventional methods failed to stably diverge. The models were analyzed on Lyapunov exponents of RNN hidden states which reflect the possible divergence at each time step during task execution. The result indicated that the foresight module biased the model to consider future consequences, which lead to embedding uncertainties at the policy of the robot controller, rather than the resultant observation. This is beneficial for implementing adaptive behaviors, which indices derivation of diverse motion during exploration.

PaperID: 12834, https://arxiv.org/pdf/2410.00517.pdf

Authors: Oscar Gil Viyuela, Alberto Sanfeliu

Title: Human-Robot Collaborative Minimum Time Search through Sub-priors in Ant Colony Optimization

Abstract:
Human‑Robot Collaboration (HRC) has evolved into a highly promising issue owing to the latest breakthroughs in Artificial Intelligence (AI) and Human‑Robot Interaction (HRI), among other reasons. This emerging growth increases the need to design multi‑agent algorithms that can manage also human preferences. This paper presents an extension of the Ant Colony Optimization (ACO) meta‑heuristic to solve the Minimum Time Search (MTS) task, in the case where humans and robots perform an object searching task together. The proposed model consists of two main blocks. The first one is a convolutional neural network (CNN) that provides the prior probabilities about where an object may be from a segmented image. The second one is the Sub‑prior MTS‑ACO algorithm (SP‑MTS‑ACO), which takes as inputs the prior probabilities and the particular search preferences of the agents in different sub‑priors to generate search plans for all agents. The model has been tested in real experiments for the joint search of an object through a Vizanti web‑based visualization in a tablet computer. The designed interface allows the communication between a human and our humanoid robot named IVO. The obtained results show an improvement in the search perception of the users without loss of efficiency.

PaperID: 12835, https://arxiv.org/pdf/2410.00371.pdf

Authors: Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Dieter Fox, Ajay Mandlekar, Yijie Guo

Title: AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

Abstract:
Robotic manipulation in open‑world settings requires not only task execution but also the ability to detect and learn from failures. While recent advances in vision‑language models (VLMs) and large language models (LLMs) have improved robots' spatial reasoning and problem‑solving abilities, they still struggle with failure recognition, limiting their real‑world applicability. We introduce AHA, an open‑source VLM designed to detect and reason about failures in robotic manipulation using natural language. By framing failure detection as a free‑form reasoning task, AHA identifies failures and provides detailed, adaptable explanations across different robots, tasks, and environments. We fine‑tuned AHA using FailGen, a scalable framework that generates the first large‑scale dataset of robotic failure trajectories, the AHA dataset. FailGen achieves this by procedurally perturbing successful demonstrations from simulation. Despite being trained solely on the AHA dataset, AHA generalizes effectively to real‑world failure datasets, robotic systems, and unseen tasks. It surpasses the second‑best model (GPT‑4o in‑context learning) by 10.3% and exceeds the average performance of six compared models including five state‑of‑the‑art VLMs by 35.3% across multiple metrics and datasets. We integrate AHA into three manipulation frameworks that utilize LLMs/VLMs for reinforcement learning, task and motion planning, and zero‑shot trajectory generation. AHA's failure feedback enhances these policies' performances by refining dense reward functions, optimizing task planning, and improving sub‑task verification, boosting task success rates by an average of 21.4% across all three tasks compared to GPT‑4 models.

PaperID: 12836, https://arxiv.org/pdf/2410.00354.pdf

Authors: Chung-Chi Chen, Hiroya Takamura, Ichiro Kobayashi, Yusuke Miyao

Title: Hierarchical Organization Simulacra in the Investment Sector

Abstract:
This paper explores designing artificial organizations with professional behavior in investments using a multi‑agent simulation. The method mimics hierarchical decision‑making in investment firms, using news articles to inform decisions. A large‑scale study analyzing over 115,000 news articles of 300 companies across 15 years compared this approach against professional traders' decisions. Results show that hierarchical simulations align closely with professional choices, both in frequency and profitability. However, the study also reveals biases in decision‑making, where changes in prompt wording and perceived agent seniority significantly influence outcomes. This highlights both the potential and limitations of large language models in replicating professional financial decision‑making.

PaperID: 12837, https://arxiv.org/pdf/2410.00272.pdf

Authors: Zida Wu, Ankur Mehta

Title: Decentralized Input and State Estimation for Multi-agent System with Dynamic Topology and Heterogeneous Sensor Network

Abstract:
A crucial challenge in decentralized systems is state estimation in the presence of unknown inputs, particularly within heterogeneous sensor networks with dynamic topologies. While numerous consensus algorithms have been introduced, they often require extensive information exchange or multiple communication iterations to ensure estimation accuracy. This paper proposes an efficient algorithm that achieves an unbiased and optimal solution comparable to filters with full information about other agents. This is accomplished through the use of information filter decomposition and the fusion of inputs via covariance intersection. Our method requires only a single communication iteration for exchanging individual estimates between agents, instead of multiple rounds of information exchange, thus preserving agents' privacy by avoiding the sharing of explicit observations and system equations. Furthermore, to address the challenges posed by dynamic communication topologies, we propose two practical strategies to handle issues arising from intermittent observations and incomplete state estimation, thereby enhancing the robustness and accuracy of the estimation process. Experiments and ablation studies conducted in both stationary and dynamic environments demonstrate the superiority of our algorithm over other baselines. Notably, it performs as well as, or even better than, algorithms that have a global view of all neighbors.

PaperID: 12838, https://arxiv.org/pdf/2410.00163.pdf

Authors: Pedro Henrique Paiola, Gabriel Lino Garcia, João Renato Ribeiro Manesco, Mateus Roder, Douglas Rodrigues, João Paulo Papa

Title: Adapting LLMs for the Medical Domain in Portuguese: A Study on Fine-Tuning and Model Evaluation

Abstract:
This study evaluates the performance of large language models (LLMs) as medical agents in Portuguese, aiming to develop a reliable and relevant virtual assistant for healthcare professionals. The HealthCareMagic‑100k‑en and MedQuAD datasets, translated from English using GPT‑3.5, were used to fine‑tune the ChatBode‑7B model using the PEFT‑QLoRA method. The InternLM2 model, with initial training on medical data, presented the best overall performance, with high precision and adequacy in metrics such as accuracy, completeness and safety. However, DrBode models, derived from ChatBode, exhibited a phenomenon of catastrophic forgetting of acquired medical knowledge. Despite this, these models performed frequently or even better in aspects such as grammaticality and coherence. A significant challenge was low inter‑rater agreement, highlighting the need for more robust assessment protocols. This work paves the way for future research, such as evaluating multilingual models specific to the medical field, improving the quality of training data, and developing more consistent evaluation methodologies for the medical field.

PaperID: 12839, https://arxiv.org/pdf/2410.00081.pdf

Authors: Roland Pihlakas

Title: From homeostasis to resource sharing: Biologically and economically aligned multi-objective multi-agent gridworld-based AI safety benchmarks

Abstract:
Developing safe, aligned agentic AI systems requires comprehensive empirical testing, yet many existing benchmarks neglect crucial themes aligned with biology and economics, both time‑tested fundamental sciences describing our needs and preferences. To address this gap, the present work focuses on introducing biologically and economically motivated themes that have been neglected in current mainstream discussions on AI safety ‑ namely a set of multi‑objective, multi‑agent alignment benchmarks that emphasize homeostasis for bounded and biological objectives, diminishing returns for unbounded, instrumental, and business objectives, sustainability principle, and resource sharing. Eight main benchmark environments have been implemented on the above themes, to illustrate key pitfalls and challenges in agentic AI‑s, such as unboundedly maximizing a homeostatic objective, over‑optimizing one objective at the expense of others, neglecting safety constraints, or depleting shared resources.

PaperID: 12840, https://arxiv.org/pdf/2410.00031.pdf

Authors: Ryan Y. Lin, Siddhartha Ojha, Kevin Cai, Maxwell F. Chen

Title: Strategic Collusion of LLM Agents: Market Division in Multi-Commodity Competitions

Abstract:
Machine‑learning technologies are seeing increased deployment in real‑world market scenarios. In this work, we explore the strategic behaviors of large language models (LLMs) when deployed as autonomous agents in multi‑commodity markets, specifically within Cournot competition frameworks. We examine whether LLMs can independently engage in anti‑competitive practices such as collusion or, more specifically, market division. Our findings demonstrate that LLMs can effectively monopolize specific commodities by dynamically adjusting their pricing and resource allocation strategies, thereby maximizing profitability without direct human input or explicit collusion commands. These results pose unique challenges and opportunities for businesses looking to integrate AI into strategic roles and for regulatory bodies tasked with maintaining fair and competitive markets. The study provides a foundation for further exploration into the ramifications of deferring high‑stakes decisions to LLM‑based agents.

PaperID: 12841, https://arxiv.org/pdf/2409.20560.pdf

Authors: Xiaopan Zhang, Hao Qin, Fuquan Wang, Yue Dong, Jiachen Li

Title: LaMMA-P: Generalizable Multi-Agent Long-Horizon Task Allocation and Planning with LM-Driven PDDL Planner

Abstract:
Language models (LMs) possess a strong capability to comprehend natural language, making them effective in translating human instructions into detailed plans for simple robot tasks. Nevertheless, it remains a significant challenge to handle long‑horizon tasks, especially in subtask identification and allocation for cooperative heterogeneous robot teams. To address this issue, we propose a Language Model‑Driven Multi‑Agent PDDL Planner (LaMMA‑P), a novel multi‑agent task planning framework that achieves state‑of‑the‑art performance on long‑horizon tasks. LaMMA‑P integrates the strengths of the LMs' reasoning capability and the traditional heuristic search planner to achieve a high success rate and efficiency while demonstrating strong generalization across tasks. Additionally, we create MAT‑THOR, a comprehensive benchmark that features household tasks with two different levels of complexity based on the AI2‑THOR environment. The experimental results demonstrate that LaMMA‑P achieves a 105% higher success rate and 36% higher efficiency than existing LM‑based multiagent planners. The experimental videos, code, datasets, and detailed prompts used in each module can be found on the project website: https://lamma‑p.github.io.

PaperID: 12842, https://arxiv.org/pdf/2409.20326.pdf

Authors: Zichong Li, Filip Bjelonic, Victor Klemm, Marco Hutter

Title: MARLadona -- Towards Cooperative Team Play Using Multi-Agent Reinforcement Learning

Abstract:
Robot soccer, in its full complexity, poses an unsolved research challenge. Current solutions heavily rely on engineered heuristic strategies, which lack robustness and adaptability. Deep reinforcement learning has gained significant traction in various complex robotics tasks such as locomotion, manipulation, and competitive games (e.g., AlphaZero, OpenAI Five), making it a promising solution to the robot soccer problem. This paper introduces MARLadona. A decentralized multi‑agent reinforcement learning (MARL) training pipeline capable of producing agents with sophisticated team play behavior, bridging the shortcomings of heuristic methods. Furthermore, we created an open‑source multi‑agent soccer environment. Utilizing our MARL framework and a modified global entity encoder (GEE) as our core architecture, our approach achieves a 66.8% win rate against HELIOS agent, which employs a state‑of‑the‑art heuristic strategy. In addition, we provided an in‑depth analysis of the policy behavior and interpreted the agent's intention using the critic network.

PaperID: 12843, https://arxiv.org/pdf/2409.20289.pdf

Authors: Hongrui Zhao, Boris Ivanovic, Negar Mehr

Title: Distributed NeRF Learning for Collaborative Multi-Robot Perception

Abstract:
Effective environment perception is crucial for enabling downstream robotic applications. Individual robotic agents often face occlusion and limited visibility issues, whereas multi‑agent systems can offer a more comprehensive mapping of the environment, quicker coverage, and increased fault tolerance. In this paper, we propose a collaborative multi‑agent perception system where agents collectively learn a neural radiance field (NeRF) from posed RGB images to represent a scene. Each agent processes its local sensory data and shares only its learned NeRF model with other agents, reducing communication overhead. Given NeRF's low memory footprint, this approach is well‑suited for robotic systems with limited bandwidth, where transmitting all raw data is impractical. Our distributed learning framework ensures consistency across agents' local NeRF models, enabling convergence to a unified scene representation. We show the effectiveness of our method through an extensive set of experiments on datasets containing challenging real‑world scenes, achieving performance comparable to centralized mapping of the environment where data is sent to a central server for processing. Additionally, we find that multi‑agent learning provides regularization benefits, improving geometric consistency in scenarios with sparse input views. We show that in such scenarios, multi‑agent mapping can even outperform centralized training.

PaperID: 12844, https://arxiv.org/pdf/2409.20173.pdf

Authors: Petr Vanc, Giovanni Franzese, Jan Kristof Behrens, Cosimo Della Santina, Karla Stepanova, Jens Kober, Robert Babuska

Title: ILeSiA: Interactive Learning of Robot Situational Awareness from Camera Input

Abstract:
Learning from demonstration is a promising approach for teaching robots new skills. However, a central challenge in the execution of acquired skills is the ability to recognize faults and prevent failures. This is essential because demonstrations typically cover only a limited set of scenarios and often only the successful ones. During task execution, unforeseen situations may arise, such as changes in the robot's environment or interaction with human operators. To recognize such situations, this paper focuses on teaching the robot situational awareness by using a camera input and labeling frames as safe or risky. We train a Gaussian Process (GP) regression model fed by a low‑dimensional latent space representation of the input images. The model outputs a continuous risk score ranging from zero to one, quantifying the degree of risk at each timestep. This allows for pausing task execution in unsafe situations and directly adding new training data, labeled by the human user. Our experiments on a robotic manipulator show that the proposed method can reliably detect both known and novel faults using only a single example for each new fault. In contrast, a standard multi‑layer perceptron (MLP) performs well only on faults it has encountered during training. Our method enables the next generation of cobots to be rapidly deployed with easy‑to‑set‑up, vision‑based risk assessment, proactively safeguarding humans and detecting misaligned parts or missing objects before failures occur. We provide all the code and data required to reproduce our experiments at imitrob.ciirc.cvut.cz/publications/ilesia.

PaperID: 12845, https://arxiv.org/pdf/2409.20067.pdf

Authors: Laixi Shi, Jingchu Gai, Eric Mazumdar, Yuejie Chi, Adam Wierman

Title: Breaking the Curse of Multiagency in Robust Multi-Agent Reinforcement Learning

Abstract:
Standard multi‑agent reinforcement learning (MARL) algorithms are vulnerable to sim‑to‑real gaps. To address this, distributionally robust Markov games (RMGs) have been proposed to enhance robustness in MARL by optimizing the worst‑case performance when game dynamics shift within a prescribed uncertainty set. RMGs remains under‑explored, from reasonable problem formulation to the development of sample‑efficient algorithms. Two notorious and open challenges are the formulation of the uncertainty set and whether the corresponding RMGs can overcome the curse of multiagency, where the sample complexity scales exponentially with the number of agents. In this work, we propose a natural class of RMGs inspired by behavioral economics, where each agent's uncertainty set is shaped by both the environment and the integrated behavior of other agents. We first establish the well‑posedness of this class of RMGs by proving the existence of game‑theoretic solutions such as robust Nash equilibria and coarse correlated equilibria (CCE). Assuming access to a generative model, we then introduce a sample‑efficient algorithm for learning the CCE whose sample complexity scales polynomially with all relevant parameters. To the best of our knowledge, this is the first algorithm to break the curse of multiagency for RMGs, regardless of the uncertainty set formulation.

PaperID: 12846, https://arxiv.org/pdf/2409.20033.pdf

Authors: Thi Ngoc Nguyen, Felix Muesgens

Title: Fuel tax loss in a world of electric mobility: A window of opportunity for congestion pricing

Abstract:
The continued transition towards electric mobility will decrease energy tax revenues worldwide, which has substantial implications for government funds. At the same time, demand for transportation is ever increasing, which in turn increases congestion problems. Combining both challenges, this paper assesses the effectiveness of congestion pricing as a sustainable revenue stream to offset fuel tax loss in 2030 while simultaneously enhancing efficiency in the transport sector. A congestion‑based toll that is road‑and‑time‑variant is simulated for the greater Berlin area in Germany using the multi‑agent transport simulation (MATSim) software. Through the simulation results, this paper quantifies the impacts of the toll on the governmental revenue, traffic management, environment, social welfare, and the distribution effects. We find that the revenue from congestion tolls in a metropolitan area can compensate the reduction in passenger car fuel tax. Furthermore, a remarkable welfare surplus is observed. The toll also successfully incentivises transport users to adjust their travel behaviour, which reduces traffic delay time by 28%. CO2 emissions as a key metric for decarbonisation of the transport sector decrease by more than 5%. The analysis of the distribution effects suggests that a redistribution plan with a focus on the middle‑low‑income residents and the outer boroughs could help the policy gain more public acceptance.

PaperID: 12847, https://arxiv.org/pdf/2409.19965.pdf

Authors: Yinghui Pan, Biyang Ma, Hanyi Zhang, Yifeng Zeng

Title: Variational Auto-encoder Based Solutions to Interactive Dynamic Influence Diagrams

Abstract:
Addressing multiagent decision problems in AI, especially those involving collaborative or competitive agents acting concurrently in a partially observable and stochastic environment, remains a formidable challenge. While Interactive Dynamic Influence Diagrams~(I‑DIDs) have offered a promising decision framework for such problems, they encounter limitations when the subject agent encounters unknown behaviors exhibited by other agents that are not explicitly modeled within the I‑DID. This can lead to sub‑optimal responses from the subject agent. In this paper, we propose a novel data‑driven approach that utilizes an encoder‑decoder architecture, particularly a variational autoencoder, to enhance I‑DID solutions. By integrating a perplexity‑based tree loss function into the optimization algorithm of the variational autoencoder, coupled with the advantages of Zig‑Zag One‑Hot encoding and decoding, we generate potential behaviors of other agents within the I‑DID that are more likely to contain their true behaviors, even from limited interactions. This new approach enables the subject agent to respond more appropriately to unknown behaviors, thus improving its decision quality. We empirically demonstrate the effectiveness of the proposed approach in two well‑established problem domains, highlighting its potential for handling multi‑agent decision problems with unknown behaviors. This work is the first time of using neural networks based approaches to deal with the I‑DID challenge in agent planning and learning problems.

PaperID: 12848, https://arxiv.org/pdf/2409.19894.pdf

Authors: Zhiqiang Yuan, Weitong Chen, Hanlin Wang, Xin Peng, Zhenpeng Chen, Yiling Lou

Title: TransAgent: Enhancing LLM-Based Code Translation via Fine-Grained Execution Alignment

Abstract:
Code translation transforms code between programming languages while preserving functionality, which is critical in software development and maintenance. While traditional learning‑based code translation methods have limited effectiveness due to the lack of sufficient parallel training data, Large Language Models (LLMs) have recently advanced this field with their strong code generation and comprehension capabilities. However, code translated by LLMs still suffers from diverse quality issues, such as syntax and semantic errors. In this work, we propose TransAGENT, a novel multi‑agent system that eliminates the errors during LLM‑based code translation. The main insight of TransAGENT is to localize error‑prone code blocks via fine‑grained execution alignment between source and target code. We evaluate TransAGENT on a newly constructed benchmark of recent programming tasks to mitigate data leakage. TransAGENT outperforms the latest UniTrans by up to 33.3% in translation accuracy and achieves an average improvement of 56.7% over Agentless in program repair performance. We also conduct an ablation study and evaluate TransAGENT across different LLMs, demonstrating its effectiveness and strong generalizability.

PaperID: 12849, https://arxiv.org/pdf/2409.19831.pdf

Authors: Zhengran Ji, Lingyu Zhang, Paul Sajda, Boyuan Chen

Title: Enabling Multi-Robot Collaboration from Single-Human Guidance

Abstract:
Learning collaborative behaviors is essential for multi‑agent systems. Traditionally, multi‑agent reinforcement learning solves this implicitly through a joint reward and centralized observations, assuming collaborative behavior will emerge. Other studies propose to learn from demonstrations of a group of collaborative experts. Instead, we propose an efficient and explicit way of learning collaborative behaviors in multi‑agent systems by leveraging expertise from only a single human. Our insight is that humans can naturally take on various roles in a team. We show that agents can effectively learn to collaborate by allowing a human operator to dynamically switch between controlling agents for a short period and incorporating a human‑like theory‑of‑mind model of teammates. Our experiments showed that our method improves the success rate of a challenging collaborative hide‑and‑seek task by up to 58% with only 40 minutes of human guidance. We further demonstrate our findings transfer to the real world by conducting multi‑robot experiments.

PaperID: 12850, https://arxiv.org/pdf/2409.19625.pdf

Authors: Yann Munro, Camilo Sarmiento, Isabelle Bloch, Gauvain Bourgne, Catherine Pelachaud, Marie-Jeanne Lesot

Title: An action language-based formalisation of an abstract argumentation framework

Abstract:
An abstract argumentation framework is a commonly used formalism to provide a static representation of a dialogue. However, the order of enunciation of the arguments in an argumentative dialogue is very important and can affect the outcome of this dialogue. In this paper, we propose a new framework for modelling abstract argumentation graphs, a model that incorporates the order of enunciation of arguments. By taking this order into account, we have the means to deduce a unique outcome for each dialogue, called an extension. We also establish several properties, such as termination and correctness, and discuss two notions of completeness. In particular, we propose a modification of the previous transformation based on a "last enunciated last updated" strategy, which verifies the second form of completeness.

PaperID: 12851, https://arxiv.org/pdf/2409.19479.pdf

Authors: Shu Ishida

Title: Spatial Reasoning and Planning for Deep Embodied Agents

Abstract:
Humans can perform complex tasks with long‑term objectives by planning, reasoning, and forecasting outcomes of actions. For embodied agents to achieve similar capabilities, they must gain knowledge of the environment transferable to novel scenarios with a limited budget of additional trial and error. Learning‑based approaches, such as deep RL, can discover and take advantage of inherent regularities and characteristics of the application domain from data, and continuously improve their performances, however at a cost of large amounts of training data. This thesis explores the development of data‑driven techniques for spatial reasoning and planning tasks, focusing on enhancing learning efficiency, interpretability, and transferability across novel scenarios. Four key contributions are made. 1) CALVIN, a differential planner that learns interpretable models of the world for long‑term planning. It successfully navigated partially observable 3D environments, such as mazes and indoor rooms, by learning the rewards and state transitions from expert demonstrations. 2) SOAP, an RL algorithm that discovers options unsupervised for long‑horizon tasks. Options segment a task into subtasks and enable consistent execution of the subtask. SOAP showed robust performances on history‑conditional corridor tasks as well as classical benchmarks such as Atari. 3) LangProp, a code optimisation framework using LLMs to solve embodied agent problems that require reasoning by treating code as learnable policies. The framework successfully generated interpretable code with comparable or superior performance to human‑written experts in the CARLA autonomous driving benchmark. 4) Voggite, an embodied agent with a vision‑to‑action transformer backend that solves complex tasks in Minecraft. It achieved third place in the MineRL BASALT Competition by identifying action triggers to segment tasks into multiple stages.

PaperID: 12852, https://arxiv.org/pdf/2409.19391.pdf

Authors: Pihe Hu, Shaolong Li, Zhuoran Li, Ling Pan, Longbo Huang

Title: Value-Based Deep Multi-Agent Reinforcement Learning with Dynamic Sparse Training

Abstract:
Deep Multi‑agent Reinforcement Learning (MARL) relies on neural networks with numerous parameters in multi‑agent scenarios, often incurring substantial computational overhead. Consequently, there is an urgent need to expedite training and enable model compression in MARL. This paper proposes the utilization of dynamic sparse training (DST), a technique proven effective in deep supervised learning tasks, to alleviate the computational burdens in MARL training. However, a direct adoption of DST fails to yield satisfactory MARL agents, leading to breakdowns in value learning within deep sparse value‑based MARL models. Motivated by this challenge, we introduce an innovative Multi‑Agent Sparse Training (MAST) framework aimed at simultaneously enhancing the reliability of learning targets and the rationality of sample distribution to improve value learning in sparse models. Specifically, MAST incorporates the Soft Mellowmax Operator with a hybrid TD‑(λ) schema to establish dependable learning targets. Additionally, it employs a dual replay buffer mechanism to enhance the distribution of training samples. Building upon these aspects, MAST utilizes gradient‑based topology evolution to exclusively train multiple MARL agents using sparse networks. Our comprehensive experimental investigation across various value‑based MARL algorithms on multiple benchmarks demonstrates, for the first time, significant reductions in redundancy of up to 20× in Floating Point Operations (FLOPs) for both training and inference, with less than 3% performance degradation.

PaperID: 12853, https://arxiv.org/pdf/2409.19363.pdf

Authors: Shiqi Lei, Kanghoon Lee, Linjing Li, Jinkyoo Park

Title: Learning Strategy Representation for Imitation Learning in Multi-Agent Games

Abstract:
The offline datasets for imitation learning (IL) in multi‑agent games typically contain player trajectories exhibiting diverse strategies, which necessitate measures to prevent learning algorithms from acquiring undesirable behaviors. Learning representations for these trajectories is an effective approach to depicting the strategies employed by each demonstrator. However, existing learning strategies often require player identification or rely on strong assumptions, which are not appropriate for multi‑agent games. Therefore, in this paper, we introduce the Strategy Representation for Imitation Learning (STRIL) framework, which (1) effectively learns strategy representations in multi‑agent games, (2) estimates proposed indicators based on these representations, and (3) filters out sub‑optimal data using the indicators. STRIL is a plug‑in method that can be integrated into existing IL algorithms. We demonstrate the effectiveness of STRIL across competitive multi‑agent scenarios, including Two‑player Pong, Limit Texas Hold'em, and Connect Four. Our approach successfully acquires strategy representations and indicators, thereby identifying dominant trajectories and significantly enhancing existing IL performance across these environments.

PaperID: 12854, https://arxiv.org/pdf/2409.19100.pdf

Authors: Yuexing Hao, Jason Holmes, Mark R. Waddle, Brian J. Davis, Nathan Y. Yu, Kristin Vickers, Heather Preston, Drew Margolin, Corinna E. Lockenhoff, Aditya Vashistha, Saleh Kalantari, Marzyeh Ghassemi, Wei Liu

Title: Personalizing Prostate Cancer Education for Patients Using an EHR-Integrated LLM Agent

Abstract:
Cancer patients often lack timely education and personalized support due to clinician workload. This quality improvement study develops and evaluates a Large Language Model (LLM) agent, MedEduChat, which is integrated with the clinic's electronic health records (EHR) and designed to enhance prostate cancer patient education. Fifteen non‑metastatic prostate cancer patients and three clinicians recruited from the Mayo Clinic interacted with the agent between May 2024 and April 2025. Findings showed that MedEduChat has a high usability score (UMUX 83.7 out of 100) and improves patients' health confidence (Health Confidence Score rose from 9.9 to 13.9). Clinicians evaluated the patient‑chat interaction history and rated MedEduChat as highly correct (2.9 out of 3), complete (2.7 out of 3), and safe (2.7 out of 3), with moderate personalization (2.3 out of 3). This study highlights the potential of LLM agents to improve patient engagement and health education.

PaperID: 12855, https://arxiv.org/pdf/2409.19006.pdf

Authors: Sakhinana Sagar Srinivas, Vijay Sri Vaikunth, Venkataramana Runkana

Title: Towards Automated Patent Workflows: AI-Orchestrated Multi-Agent Framework for Intellectual Property Management and Analysis

Abstract:
Patents are the currency of innovation, and like any currency, they need to be managed and protected (Gavin Potenza). Patents, as legal documents that secure intellectual property rights, play a critical role in technological innovation. The growing complexity of patent documents and the surge in patent applications have created a need for automated solutions in patent analysis. In this work, we present PatExpert, an autonomous multi‑agent conversational framework designed to streamline and optimize patent‑related tasks. The framework consists of a metaagent that coordinates task‑specific expert agents for various patent‑related tasks and a critique agent for error handling and feedback provision. The meta‑agent orchestrates specialized expert agents, each fine‑tuned for specific tasks such as patent classification, acceptance, claim generation, abstractive summarization, multi‑patent analysis, and scientific hypothesis generation. For multi‑patent analysis, the framework incorporates advanced methods like Graph Retrieval‑Augmented Generation (GRAG) to enhance response accuracy and relevance by combining semantic similarity with knowledge graphs. Error handling is managed by critique agents (Gold‑LLM‑as‑a‑Judge and Reward‑LLM‑as‑a‑Judge), which evaluate output responses for accuracy and provide iterative feedback. The framework also prioritizes explainability, ensuring transparent justifications for decisions made during patent analysis. Its comprehensive capabilities make it a valuable tool for automating complex patent workflows, enhancing efficiency, accuracy, and compliance in patent‑related tasks. Empirical evidence demonstrates significant improvements in patent processing tasks, concluding that the framework offers a robust solution for automating and optimizing patent analysis.

PaperID: 12856, https://arxiv.org/pdf/2409.18985.pdf

Authors: Jyotiranjan Beuria, Mayank Chaurasiya, Laxmidhar Behera

Title: Collective motion from quantum-inspired dynamics in visual perception

Abstract:
We propose a model of collective behavior in self‑propelled active agents that incorporates a perceptual decision‑making process. In this framework, the decision‑making dynamics is modeled using quantum formalism. The perceptual decision state of each agent is an entangled or superposed state of the decision states for the neighboring agents within the vision cone. We suggest that in this framework, the force driving the movement of active agents is governed by the quantum average of its perception operator, providing a bridge between perceptual decision‑making processes and classical dynamics. Additionally, we introduce two perceptual measures of cohesion in the flock, namely, perception strength and perceptual energy, to characterize collective behavior in terms of decision‑making dynamics. Our model demonstrates that, with an appropriate choice of perceptual decision state, the well‑known Vicsek model of flocking behavior can be derived as a specific case of this quantum‑inspired approach. This approach provides fresh insights into collective behavior and multi‑agent coordination, revealing how classical patterns of collective behavior emerge naturally from perception.

PaperID: 12857, https://arxiv.org/pdf/2409.18977.pdf

Authors: Yun Xia, Hai Xue, Di Zhang, Shahid Mumtaz, Xiaolong Xu, Joel J. P. C. Rodrigues

Title: Dynamic Pricing based Near-Optimal Resource Allocation for Elastic Edge Offloading

Abstract:
In mobile edge computing (MEC), task offloading can significantly reduce task execution latency and energy consumption of end user (EU). However, edge server (ES) resources are limited, necessitating efficient allocation to ensure the sustainable and healthy development for MEC systems. In this paper, we propose a dynamic pricing mechanism based near‑optimal resource allocation for elastic edge offloading. First, we construct a resource pricing model and accordingly develop the utility functions for both EU and ES, the optimal pricing model parameters are derived by optimizing the utility functions. In the meantime, our theoretical analysis reveals that the EU's utility function reaches a local maximum within the search range, but exhibits barely growth with increased resource allocation beyond this point. To this end, we further propose the Dynamic Inertia and Speed‑Constrained particle swarm optimization (DISC‑PSO) algorithm, which efficiently identifies the near‑optimal resource allocation. Comprehensive simulation results validate the effectiveness of DISC‑PSO, demonstrating that it significantly outperforms existing schemes by reducing the average number of iterations to reach a near‑optimal solution by 92.11%, increasing the final user utility function value by 0.24%, and decreasing the variance of results by 95.45%.

PaperID: 12858, https://arxiv.org/pdf/2409.18862.pdf

Authors: Sacha Huriot, Hussein Sibai

Title: Safe Decentralized Multi-Agent Control using Black-Box Predictors, Conformal Decision Policies, and Control Barrier Functions

Abstract:
We address the challenge of safe control in decentralized multi‑agent robotic settings, where agents use uncertain black‑box models to predict other agents' trajectories. We use the recently proposed conformal decision theory to adapt the restrictiveness of control barrier functions‑based safety constraints based on observed prediction errors. We use these constraints to synthesize controllers that balance between the objectives of safety and task accomplishment, despite the prediction errors. We provide an upper bound on the average over time of the value of a monotonic function of the difference between the safety constraint based on the predicted trajectories and the constraint based on the ground truth ones. We validate our theory through experimental results showing the performance of our controllers when navigating a robot in the multi‑agent scenes in the Stanford Drone Dataset.

PaperID: 12859, https://arxiv.org/pdf/2409.18745.pdf

Authors: Ana Christina Almada Campos, Bruno Vilhena Adorno

Title: A study on the effects of mixed explicit and implicit communications in human-artificial-agent interactions

Abstract:
Communication between humans and artificial agents is essential for their interaction. This is often inspired by human communication, which uses gestures, facial expressions, gaze direction, and other explicit and implicit means. This work presents interaction experiments where humans and artificial agents interact through explicit and implicit communication to evaluate the effect of mixed explicit‑implicit communication against purely explicit communication and the impact of the task difficulty in this evaluation. Results obtained using Bayesian parameter estimation show that the task execution time did not significantly change when mixed explicit and implicit communications were used in neither of our experiments, which varied in the type of artificial agent (virtual agent and humanoid robot) used and task difficulty. The number of errors was affected by the communication only when the human was executing a more difficult task, and an impact on the perceived efficiency of the interaction was only observed in the interaction with the robot, for both easy and difficult tasks. In contrast, acceptance, sociability, and transparency of the artificial agent increased when using mixed communication modalities in both our experiments and task difficulty levels. This suggests that task‑related measures, such as time, number of errors, and perceived efficiency of the interaction, as well as the impact of the communication on them, are more sensitive to the type of task and the difficulty level, whereas the combination of explicit and implicit communications more consistently improves human perceptions about artificial agents.

PaperID: 12860, https://arxiv.org/pdf/2409.18718.pdf

Authors: Sheikh Salman Hassan, Yu Min Park, Yan Kyaw Tun, Walid Saad, Zhu Han, Choong Seon Hong

Title: Enhancing Spectrum Efficiency in 6G Satellite Networks: A GAIL-Powered Policy Learning via Asynchronous Federated Inverse Reinforcement Learning

Abstract:
In this paper, a novel generative adversarial imitation learning (GAIL)‑powered policy learning approach is proposed for optimizing beamforming, spectrum allocation, and remote user equipment (RUE) association in NTNs. Traditional reinforcement learning (RL) methods for wireless network optimization often rely on manually designed reward functions, which can require extensive parameter tuning. To overcome these limitations, we employ inverse RL (IRL), specifically leveraging the GAIL framework, to automatically learn reward functions without manual design. We augment this framework with an asynchronous federated learning approach, enabling decentralized multi‑satellite systems to collaboratively derive optimal policies. The proposed method aims to maximize spectrum efficiency (SE) while meeting minimum information rate requirements for RUEs. To address the non‑convex, NP‑hard nature of this problem, we combine the many‑to‑one matching theory with a multi‑agent asynchronous federated IRL (MA‑AFIRL) framework. This allows agents to learn through asynchronous environmental interactions, improving training efficiency and scalability. The expert policy is generated using the Whale optimization algorithm (WOA), providing data to train the automatic reward function within GAIL. Simulation results show that the proposed MA‑AFIRL method outperforms traditional RL approaches, achieving a 14.6% improvement in convergence and reward value. The novel GAIL‑driven policy learning establishes a novel benchmark for 6G NTN optimization.

PaperID: 12861, https://arxiv.org/pdf/2409.18545.pdf

Authors: Shashank Shekhar, Anthony Favier, Rachid Alami

Title: An Epistemic Human-Aware Task Planner which Anticipates Human Beliefs and Decisions

Abstract:
We present a substantial extension of our Human‑Aware Task Planning framework, tailored for scenarios with intermittent shared execution experiences and significant belief divergence between humans and robots, particularly due to the uncontrollable nature of humans. Our objective is to build a robot policy that accounts for uncontrollable human behaviors, thus enabling the anticipation of possible advancements achieved by the robot when the execution is not shared, e.g. when humans are briefly absent from the shared environment to complete a subtask. But, this anticipation is considered from the perspective of humans who have access to an estimated model for the robot. To this end, we propose a novel planning framework and build a solver based on AND‑OR search, which integrates knowledge reasoning, including situation assessment by perspective taking. Our approach dynamically models and manages the expansion and contraction of potential advances while precisely keeping track of when (and when not) agents share the task execution experience. The planner systematically assesses the situation and ignores worlds that it has reason to think are impossible for humans. Overall, our new solver can estimate the distinct beliefs of the human and the robot along potential courses of action, enabling the synthesis of plans where the robot selects the right moment for communication, i.e. informing, or replying to an inquiry, or defers ontic actions until the execution experiences can be shared. Preliminary experiments in two domains, one novel and one adapted, demonstrate the effectiveness of the framework.

PaperID: 12862, https://arxiv.org/pdf/2409.18435.pdf

Authors: Xian Yeow Lee, Haiyan Wang, Daisuke Katsumata, Takaharu Matsui, Chetan Gupta

Title: Multi-agent Reinforcement Learning for Dynamic Dispatching in Material Handling Systems

Abstract:
This paper proposes a multi‑agent reinforcement learning (MARL) approach to learn dynamic dispatching strategies, which is crucial for optimizing throughput in material handling systems across diverse industries. To benchmark our method, we developed a material handling environment that reflects the complexities of an actual system, such as various activities at different locations, physical constraints, and inherent uncertainties. To enhance exploration during learning, we propose a method to integrate domain knowledge in the form of existing dynamic dispatching heuristics. Our experimental results show that our method can outperform heuristics by up to 7.4 percent in terms of median throughput. Additionally, we analyze the effect of different architectures on MARL performance when training multiple agents with different functions. We also demonstrate that the MARL agents performance can be further improved by using the first iteration of MARL agents as heuristics to train a second iteration of MARL agents. This work demonstrates the potential of applying MARL to learn effective dynamic dispatching strategies that may be deployed in real‑world systems to improve business outcomes.

PaperID: 12863, https://arxiv.org/pdf/2409.18051.pdf

Authors: Jiayu Yao, Weiwei Pan, Finale Doshi-Velez, Barbara E Engelhardt

Title: Inverse Reinforcement Learning with Multiple Planning Horizons

Abstract:
In this work, we study an inverse reinforcement learning (IRL) problem where the experts are planning under a shared reward function but with different, unknown planning horizons. Without the knowledge of discount factors, the reward function has a larger feasible solution set, which makes it harder for existing IRL approaches to identify a reward function. To overcome this challenge, we develop algorithms that can learn a global multi‑agent reward function with agent‑specific discount factors that reconstruct the expert policies. We characterize the feasible solution space of the reward function and discount factors for both algorithms and demonstrate the generalizability of the learned reward function across multiple domains.

PaperID: 12864, https://arxiv.org/pdf/2409.18031.pdf

Authors: Haochen Liu, Li Chen, Yu Qiao, Chen Lv, Hongyang Li

Title: Reasoning Multi-Agent Behavioral Topology for Interactive Autonomous Driving

Abstract:
Autonomous driving system aims for safe and social‑consistent driving through the behavioral integration among interactive agents. However, challenges remain due to multi‑agent scene uncertainty and heterogeneous interaction. Current dense and sparse behavioral representations struggle with inefficiency and inconsistency in multi‑agent modeling, leading to instability of collective behavioral patterns when integrating prediction and planning (IPP). To address this, we initiate a topological formation that serves as a compliant behavioral foreground to guide downstream trajectory generations. Specifically, we introduce Behavioral Topology (BeTop), a pivotal topological formulation that explicitly represents the consensual behavioral pattern among multi‑agent future. BeTop is derived from braid theory to distill compliant interactive topology from multi‑agent future trajectories. A synergistic learning framework (BeTopNet) supervised by BeTop facilitates the consistency of behavior prediction and planning within the predicted topology priors. Through imitative contingency learning, BeTop also effectively manages behavioral uncertainty for prediction and planning. Extensive verification on large‑scale real‑world datasets, including nuPlan and WOMD, demonstrates that BeTop achieves state‑of‑the‑art performance in both prediction and planning tasks. Further validations on the proposed interactive scenario benchmark showcase planning compliance in interactive cases.

PaperID: 12865, https://arxiv.org/pdf/2409.18028.pdf

Authors: Yotam Wolf, Binyamin Rothberg, Dorin Shteyman, Amnon Shashua

Title: Compositional Hardness of Code in Large Language Models -- A Probabilistic Perspective

Abstract:
A common practice in large language model (LLM) usage for complex analytical tasks such as code generation, is to sample a solution for the entire task within the model's context window. Previous works have shown that subtask decomposition within the model's context (chain of thought), is beneficial for solving such tasks. In this work, we point a limitation of LLMs' ability to perform several sub‑tasks within the same context window ‑ an in‑context hardness of composition, pointing to an advantage for distributing a decomposed problem in a multi‑agent system of LLMs. The hardness of composition is quantified by a generation complexity metric, i.e., the number of LLM generations required to sample at least one correct solution. We find a gap between the generation complexity of solving a compositional problem within the same context relative to distributing it among multiple agents, that increases exponentially with the solution's length. We prove our results theoretically and demonstrate them empirically.

PaperID: 12866, https://arxiv.org/pdf/2409.17443.pdf

Authors: Cameron Mehlman, Gregory Falco

Title: Satellite Chasers: Divergent Adversarial Reinforcement Learning to Engage Intelligent Adversaries on Orbit

Abstract:
As space becomes increasingly crowded and contested, robust autonomous capabilities for multi‑agent environments are gaining critical importance. Current autonomous systems in space primarily rely on optimization‑based path planning or long‑range orbital maneuvers, which have not yet proven effective in adversarial scenarios where one satellite is actively pursuing another. We introduce Divergent Adversarial Reinforcement Learning (DARL), a two‑stage Multi‑Agent Reinforcement Learning (MARL) approach designed to train autonomous evasion strategies for satellites engaged with multiple adversarial spacecraft. Our method enhances exploration during training by promoting diverse adversarial strategies, leading to more robust and adaptable evader models. We validate DARL through a cat‑and‑mouse satellite scenario, modeled as a partially observable multi‑agent capture the flag game where two adversarial ``cat" spacecraft pursue a single ``mouse" evader. DARL's performance is compared against several benchmarks, including an optimization‑based satellite path planner, demonstrating its ability to produce highly robust models for adversarial multi‑agent space environments.

PaperID: 12867, https://arxiv.org/pdf/2409.17348.pdf

Authors: Huao Li, Hossein Nourkhiz Mahjoub, Behdad Chalaki, Vaishnav Tadiparthi, Kwonjoon Lee, Ehsan Moradi-Pari, Charles Michael Lewis, Katia P Sycara

Title: Language Grounded Multi-agent Reinforcement Learning with Human-interpretable Communication

Abstract:
Multi‑Agent Reinforcement Learning (MARL) methods have shown promise in enabling agents to learn a shared communication protocol from scratch and accomplish challenging team tasks. However, the learned language is usually not interpretable to humans or other agents not co‑trained together, limiting its applicability in ad‑hoc teamwork scenarios. In this work, we propose a novel computational pipeline that aligns the communication space between MARL agents with an embedding space of human natural language by grounding agent communications on synthetic data generated by embodied Large Language Models (LLMs) in interactive teamwork scenarios. Our results demonstrate that introducing language grounding not only maintains task performance but also accelerates the emergence of communication. Furthermore, the learned communication protocols exhibit zero‑shot generalization capabilities in ad‑hoc teamwork scenarios with unseen teammates and novel task states. This work presents a significant step toward enabling effective communication and collaboration between artificial agents and humans in real‑world teamwork settings.

PaperID: 12868, https://arxiv.org/pdf/2409.17266.pdf

Authors: Junyan Cheng, Peter Chin

Title: Empirical Asset Pricing with Large Language Model Agents

Abstract:
In this study, we introduce a novel asset pricing model leveraging the Large Language Model (LLM) agents, which integrates qualitative discretionary investment evaluations from LLM agents with quantitative financial economic factors manually curated, aiming to explain the excess asset returns. The experimental results demonstrate that our methodology surpasses traditional machine learning‑based baselines in both portfolio optimization and asset pricing errors. Notably, the Sharpe ratio for portfolio optimization and the mean magnitude of |α| for anomaly portfolios experienced substantial enhancements of 10.6% and 10.0% respectively. Moreover, we performed comprehensive ablation studies on our model and conducted a thorough analysis of the method to extract further insights into the proposed approach. Our results show effective evidence of the feasibility of applying LLMs in empirical asset pricing.

PaperID: 12869, https://arxiv.org/pdf/2409.16851.pdf

Authors: Leonardo Santos, Caio C. G. Ribeiro, Douglas G. Macharet

Title: Communication Backbone Reconfiguration with Connectivity Maintenance

Abstract:
The exchange of information is key in applications that involve multiple agents, such as search and rescue, military operations, and disaster response. In this work, we propose a simple and effective trajectory planning framework that tackles the design, deployment, and reconfiguration of a communication backbone by reframing the problem of networked multi‑agent motion planning as a manipulator motion planning problem. Our approach works for backbones of variable configurations both in terms of the number of robots utilized and the distance limit between each robot. While research has been conducted on connection‑restricted navigation for multi‑robot systems in the last years, the field of manipulators is arguably more developed both in theory and practice. Hence, our methodology facilitates practical applications built on top of widely available motion planning algorithms and frameworks for manipulators.

PaperID: 12870, https://arxiv.org/pdf/2409.16832.pdf

Authors: Lyudong Jin, Ming Tang, Jiayu Pan, Meng Zhang, Hao Wang

Title: Asynchronous Fractional Multi-Agent Deep Reinforcement Learning for Age-Minimal Mobile Edge Computing

Abstract:
In the realm of emerging real‑time networked applications such as cyber‑physical systems (CPS), the Age of Information (AoI) has emerged as a pivotal metric for evaluating timeliness. To meet the high computational demands, such as those in smart manufacturing within CPS, mobile edge computing (MEC) presents a promising solution for optimizing computing and reducing AoI. In this work, we study the timeliness of compute‑intensive updates and explore jointly optimizing the task updating (when to generate a task) and offloading (where to process a task) policies to minimize AoI. Specifically, we consider edge load dynamics and formulate a task scheduling problem to minimize the expected time‑average AoI. Solving this problem is challenging due to the fractional objective introduced by AoI and the asynchronous decision‑making of the semi‑Markov game (SMG). To this end, we propose a fractional reinforcement learning (RL) framework. We begin by introducing a fractional single‑agent RL framework and establish its linear convergence rate. Building on this, we develop a fractional multi‑agent RL framework, extend Dinkelbach's method, and demonstrate its equivalence to the inexact Newton's method. Furthermore, we provide the conditions under which the framework achieves linear convergence to the Nash equilibrium (NE). To tackle the challenge of asynchronous decision‑making in the SMG, we further design an asynchronous model‑free fractional multi‑agent RL algorithm, where each mobile device can determine the task updating and offloading decisions without knowing the real‑time system dynamics and decisions of other devices. Experimental results show that when compared with the best existing baseline algorithm, our proposed algorithm reduces the average AoI by up to 50.6%.

PaperID: 12871, https://arxiv.org/pdf/2409.16720.pdf

Authors: Xian Wang, Jin Zhou, Yuanli Feng, Jiahao Mei, Jiming Chen, Shuo Li

Title: Dashing for the Golden Snitch: Multi-Drone Time-Optimal Motion Planning with Multi-Agent Reinforcement Learning

Abstract:
Recent innovations in autonomous drones have facilitated time‑optimal flight in single‑drone configurations, and enhanced maneuverability in multi‑drone systems by applying optimal control and learning‑based methods. However, few studies have achieved time‑optimal motion planning for multi‑drone systems, particularly during highly agile maneuvers or in dynamic scenarios. This paper presents a decentralized policy network using multi‑agent reinforcement learning for time‑optimal multi‑drone flight. To strike a balance between flight efficiency and collision avoidance, we introduce a soft collision‑free mechanism inspired by optimization‑based methods. By customizing PPO in a centralized training, decentralized execution (CTDE) fashion, we unlock higher efficiency and stability in training while ensuring lightweight implementation. Extensive simulations show that, despite slight performance trade‑offs compared to single‑drone systems, our multi‑drone approach maintains near‑time‑optimal performance with a low collision rate. Real‑world experiments validate our method, with two quadrotors using the same network as in simulation achieving a maximum speed of 13.65 m/s and a maximum body rate of 13.4 rad/s in a 5.5 m 5.5 m 2.0 m space across various tracks, relying entirely on onboard computation.

PaperID: 12872, https://arxiv.org/pdf/2409.16686.pdf

Authors: Dayuan Fu, Biqing Qi, Yihuai Gao, Che Jiang, Guanting Dong, Bowen Zhou

Title: MSI-Agent: Incorporating Multi-Scale Insight into Embodied Agents for Superior Planning and Decision-Making

Abstract:
Long‑term memory is significant for agents, in which insights play a crucial role. However, the emergence of irrelevant insight and the lack of general insight can greatly undermine the effectiveness of insight. To solve this problem, in this paper, we introduce Multi‑Scale Insight Agent (MSI‑Agent), an embodied agent designed to improve LLMs' planning and decision‑making ability by summarizing and utilizing insight effectively across different scales. MSI achieves this through the experience selector, insight generator, and insight selector. Leveraging a three‑part pipeline, MSI can generate task‑specific and high‑level insight, store it in a database, and then use relevant insight from it to aid in decision‑making. Our experiments show that MSI outperforms another insight strategy when planning by GPT3.5. Moreover, We delve into the strategies for selecting seed experience and insight, aiming to provide LLM with more useful and relevant insight for better decision‑making. Our observations also indicate that MSI exhibits better robustness when facing domain‑shifting scenarios.

PaperID: 12873, https://arxiv.org/pdf/2409.16635.pdf

Authors: Sungjune Park, Heehwan Kim, Haehyun Cho, Daeseon Choi

Title: Judgment-of-Thought Prompting: A Courtroom-Inspired Framework for Binary Logical Reasoning with Large Language Models

Abstract:
This paper proposes a novel prompting approach, Judgment of Thought (JoT), specifically tailored for binary logical reasoning tasks. Despite advances in prompt engineering, existing approaches still face limitations in handling complex logical reasoning tasks. To address these issues, JoT introduces a multi‑agent approach with three specialized roles\unicodex2010\unicodex2010\unicodex2010lawyer, prosecutor, and judge\unicodex2010\unicodex2010\unicodex2010where a high‑level model acts as the judge, and lower‑level models serve as lawyer and prosecutor to systematically debate and evaluate arguments. Experimental evaluations on benchmarks such as BigBenchHard and Winogrande demonstrate JoT's superior performance compared to existing prompting approaches, achieving notable improvements, including 98% accuracy in Boolean expressions. Also, our ablation studies validate the critical contribution of each role, iterative refinement loops, and feedback mechanisms. Consequently, JoT significantly enhances accuracy, reliability, and consistency in binary reasoning tasks and shows potential for practical applications.

PaperID: 12874, https://arxiv.org/pdf/2409.16460.pdf

Authors: Hang Liu, Yi Cheng, Rankun Li, Xiaowen Hu, Linqi Ye, Houde Liu

Title: MBC: Multi-Brain Collaborative Control for Quadruped Robots

Abstract:
In the field of locomotion task of quadruped robots, Blind Policy and Perceptive Policy each have their own advantages and limitations. The Blind Policy relies on preset sensor information and algorithms, suitable for known and structured environments, but it lacks adaptability in complex or unknown environments. The Perceptive Policy uses visual sensors to obtain detailed environmental information, allowing it to adapt to complex terrains, but its effectiveness is limited under occluded conditions, especially when perception fails. Unlike the Blind Policy, the Perceptive Policy is not as robust under these conditions. To address these challenges, we propose a MBC:Multi‑Brain collaborative system that incorporates the concepts of Multi‑Agent Reinforcement Learning and introduces collaboration between the Blind Policy and the Perceptive Policy. By applying this multi‑policy collaborative model to a quadruped robot, the robot can maintain stable locomotion even when the perceptual system is impaired or observational data is incomplete. Our simulations and real‑world experiments demonstrate that this system significantly improves the robot's passability and robustness against perception failures in complex environments, validating the effectiveness of multi‑policy collaboration in enhancing robotic motion performance.

PaperID: 12875, https://arxiv.org/pdf/2409.16455.pdf

Authors: Venkata Naren Devarakonda, Ali Umut Kaypak, Shuaihang Yuan, Prashanth Krishnamurthy, Yi Fang, Farshad Khorrami

Title: MultiTalk: Introspective and Extrospective Dialogue for Human-Environment-LLM Alignment

Abstract:
LLMs have shown promising results in task planning due to their strong natural language understanding and reasoning capabilities. However, issues such as hallucinations, ambiguities in human instructions, environmental constraints, and limitations in the executing agent's capabilities often lead to flawed or incomplete plans. This paper proposes MultiTalk, an LLM‑based task planning methodology that addresses these issues through a framework of introspective and extrospective dialogue loops. This approach helps ground generated plans in the context of the environment and the agent's capabilities, while also resolving uncertainties and ambiguities in the given task. These loops are enabled by specialized systems designed to extract and predict task‑specific states, and flag mismatches or misalignments among the human user, the LLM agent, and the environment. Effective feedback pathways between these systems and the LLM planner foster meaningful dialogue. The efficacy of this methodology is demonstrated through its application to robotic manipulation tasks. Experiments and ablations highlight the robustness and reliability of our method, and comparisons with baselines further illustrate the superiority of MultiTalk in task planning for embodied agents.

PaperID: 12876, https://arxiv.org/pdf/2409.16450.pdf

Authors: Talha Bozkus, Urbashi Mitra

Title: A Multi-Agent Multi-Environment Mixed Q-Learning for Partially Decentralized Wireless Network Optimization

Abstract:
Q‑learning is a powerful tool for network control and policy optimization in wireless networks, but it struggles with large state spaces. Recent advancements, like multi‑environment mixed Q‑learning (MEMQ), improves performance and reduces complexity by integrating multiple Q‑learning algorithms across multiple related environments so‑called digital cousins. However, MEMQ is designed for centralized single‑agent networks and is not suitable for decentralized or multi‑agent networks. To address this challenge, we propose a novel multi‑agent MEMQ algorithm for partially decentralized wireless networks with multiple mobile transmitters (TXs) and base stations (BSs), where TXs do not have access to each other's states and actions. In uncoordinated states, TXs act independently to minimize their individual costs. In coordinated states, TXs use a Bayesian approach to estimate the joint state based on local observations and share limited information with leader TX to minimize joint cost. The cost of information sharing scales linearly with the number of TXs and is independent of the joint state‑action space size. The proposed scheme is 50% faster than centralized MEMQ with only a 20% increase in average policy error (APE) and is 25% faster than several advanced decentralized Q‑learning algorithms with 40% less APE. The convergence of the algorithm is also demonstrated.

PaperID: 12877, https://arxiv.org/pdf/2409.16427.pdf

Authors: Xuhui Zhou, Hyunwoo Kim, Faeze Brahman, Liwei Jiang, Hao Zhu, Ximing Lu, Frank Xu, Bill Yuchen Lin, Yejin Choi, Niloofar Mireshghallah, Ronan Le Bras, Maarten Sap

Title: HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions

Abstract:
AI agents are increasingly autonomous in their interactions with human users and tools, leading to increased interactional safety risks. We present HAICOSYSTEM, a framework examining AI agent safety within diverse and complex social interactions. HAICOSYSTEM features a modular sandbox environment that simulates multi‑turn interactions between human users and AI agents, where the AI agents are equipped with a variety of tools (e.g., patient management platforms) to navigate diverse scenarios (e.g., a user attempting to access other patients' profiles). To examine the safety of AI agents in these interactions, we develop a comprehensive multi‑dimensional evaluation framework that uses metrics covering operational, content‑related, societal, and legal risks. Through running 1840 simulations based on 92 scenarios across seven domains (e.g., healthcare, finance, education), we demonstrate that HAICOSYSTEM can emulate realistic user‑AI interactions and complex tool use by AI agents. Our experiments show that state‑of‑the‑art LLMs, both proprietary and open‑sourced, exhibit safety risks in over 50% cases, with models generally showing higher risks when interacting with simulated malicious users. Our findings highlight the ongoing challenge of building agents that can safely navigate complex interactions, particularly when faced with malicious users. To foster the AI agent safety ecosystem, we release a code platform that allows practitioners to create custom scenarios, simulate interactions, and evaluate the safety and performance of their agents.

PaperID: 12878, https://arxiv.org/pdf/2409.16299.pdf

Authors: Huy Nhat Phan, Tien N. Nguyen, Phong X. Nguyen, Nghi D. Q. Bui

Title: HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale

Abstract:
Large Language Models (LLMs) have revolutionized software engineering (SE), showcasing remarkable proficiency in various coding tasks. Despite recent advancements that have enabled the creation of autonomous software agents utilizing LLMs for end‑to‑end development tasks, these systems are typically designed for specific SE functions. We introduce HyperAgent, an innovative generalist multi‑agent system designed to tackle a wide range of SE tasks across different programming languages by mimicking the workflows of human developers. HyperAgent features four specialized agents‑Planner, Navigator, Code Editor, and Executor‑capable of handling the entire lifecycle of SE tasks, from initial planning to final verification. HyperAgent sets new benchmarks in diverse SE tasks, including GitHub issue resolution on the renowned SWE‑Bench benchmark, outperforming robust baselines. Furthermore, HyperAgent demonstrates exceptional performance in repository‑level code generation (RepoExec) and fault localization and program repair (Defects4J), often surpassing state‑of‑the‑art baselines.

PaperID: 12879, https://arxiv.org/pdf/2409.16291.pdf

Authors: Zhiyu Lin, Upol Ehsan, Rohan Agarwal, Samihan Dani, Vidushi Vashishth, Mark Riedl

Title: Beyond Following: Mixing Active Initiative into Computational Creativity

Abstract:
Generative Artificial Intelligence (AI) encounters limitations in efficiency and fairness within the realm of Procedural Content Generation (PCG) when human creators solely drive and bear responsibility for the generative process. Alternative setups, such as Mixed‑Initiative Co‑Creative (MI‑CC) systems, exhibited their promise. Still, the potential of an active mixed initiative, where AI takes a role beyond following, is understudied. This work investigates the influence of the adaptive ability of an active and learning AI agent on creators' expectancy of creative responsibilities in an MI‑CC setting. We built and studied a system that employs reinforcement learning (RL) methods to learn the creative responsibility preferences of a human user during online interactions. Situated in story co‑creation, we develop a Multi‑armed‑bandit agent that learns from the human creator, updates its collaborative decision‑making belief, and switches between its capabilities during an MI‑CC experience. With 39 participants joining a human subject study, Our developed system's learning capabilities are well recognized compared to the non‑learning ablation, corresponding to a significant increase in overall satisfaction with the MI‑CC experience. These findings indicate a robust association between effective MI‑CC collaborative interactions, particularly the implementation of proactive AI initiatives, and deepened understanding among all participants.

PaperID: 12880, https://arxiv.org/pdf/2409.16125.pdf

Authors: Axel Højmark, Govind Pimpale, Arjun Panickssery, Marius Hobbhahn, Jérémy Scheurer

Title: Analyzing Probabilistic Methods for Evaluating Agent Capabilities

Abstract:
To mitigate risks from AI systems, we need to assess their capabilities accurately. This is especially difficult in cases where capabilities are only rarely displayed. Phuong et al. propose two methods that aim to obtain better estimates of the probability of an AI agent successfully completing a given task. The milestone method decomposes tasks into subtasks, aiming to improve overall success rate estimation, while the expert best‑of‑N method leverages human guidance as a proxy for the model's independent performance. Our analysis of these methods as Monte Carlo estimators reveals that while both effectively reduce variance compared to naive Monte Carlo sampling, they also introduce bias. Experimental results demonstrate that the milestone method underestimates true solve rates for many real‑world tasks due to its constraining assumptions. The expert best‑of‑N method exhibits even more severe underestimation across all tasks, attributed to an inherently flawed re‑weighting factor. To enhance the accuracy of capability estimates of AI agents on difficult tasks, we suggest future work should leverage the rich literature on Monte Carlo Estimators.

PaperID: 12881, https://arxiv.org/pdf/2409.16030.pdf

Authors: Wenhao Yu, Jie Peng, Yueliang Ying, Sai Li, Jianmin Ji, Yanyong Zhang

Title: MHRC: Closed-loop Decentralized Multi-Heterogeneous Robot Collaboration with Large Language Models

Abstract:
The integration of large language models (LLMs) with robotics has significantly advanced robots' abilities in perception, cognition, and task planning. The use of natural language interfaces offers a unified approach for expressing the capability differences of heterogeneous robots, facilitating communication between them, and enabling seamless task allocation and collaboration. Currently, the utilization of LLMs to achieve decentralized multi‑heterogeneous robot collaborative tasks remains an under‑explored area of research. In this paper, we introduce a novel framework that utilizes LLMs to achieve decentralized collaboration among multiple heterogeneous robots. Our framework supports three robot categories, mobile robots, manipulation robots, and mobile manipulation robots, working together to complete tasks such as exploration, transportation, and organization. We developed a rich set of textual feedback mechanisms and chain‑of‑thought (CoT) prompts to enhance task planning efficiency and overall system performance. The mobile manipulation robot can adjust its base position flexibly, ensuring optimal conditions for grasping tasks. The manipulation robot can comprehend task requirements, seek assistance when necessary, and handle objects appropriately. Meanwhile, the mobile robot can explore the environment extensively, map object locations, and communicate this information to the mobile manipulation robot, thus improving task execution efficiency. We evaluated the framework using PyBullet, creating scenarios with three different room layouts and three distinct operational tasks. We tested various LLM models and conducted ablation studies to assess the contributions of different modules. The experimental results confirm the effectiveness and necessity of our proposed framework.

PaperID: 12882, https://arxiv.org/pdf/2409.16009.pdf

Authors: Ike Obi, Ruiqi Wang, Wonse Jo, Byung-Cheol Min

Title: Modeling and Evaluating Trust Dynamics in Multi-Human Multi-Robot Task Allocation

Abstract:
Trust is essential in human‑robot collaboration, particularly in multi‑human, multi‑robot (MH‑MR) teams, where it plays a crucial role in maintaining team cohesion in complex operational environments. Despite its importance, trust is rarely incorporated into task allocation and reallocation algorithms for MH‑MR collaboration. While prior research in single‑human, single‑robot interactions has shown that integrating trust significantly enhances both performance outcomes and user experience, its role in MH‑MR task allocation remains underexplored. In this paper, we introduce the Expectation Confirmation Trust (ECT) Model, a novel framework for modeling trust dynamics in MH‑MR teams. We evaluate the ECT model against five existing trust models and a no‑trust baseline to assess its impact on task allocation outcomes across different team configurations (2H‑2R, 5H‑5R, and 10H‑10R). Our results show that the ECT model improves task success rate, reduces mean completion time, and lowers task error rates. These findings highlight the complexities of trust‑based task allocation in MH‑MR teams. We discuss the implications of incorporating trust into task allocation algorithms and propose future research directions for adaptive trust mechanisms that balance efficiency and performance in dynamic, multi‑agent environments.

PaperID: 12883, https://arxiv.org/pdf/2409.15866.pdf

Authors: Jiayu Chen, Chao Yu, Guosheng Li, Wenhao Tang, Shilong Ji, Xinyi Yang, Botian Xu, Huazhong Yang, Yu Wang

Title: Online Planning for Multi-UAV Pursuit-Evasion in Unknown Environments Using Deep Reinforcement Learning

Abstract:
Multi‑UAV pursuit‑evasion, where pursuers aim to capture evaders, poses a key challenge for UAV swarm intelligence. Multi‑agent reinforcement learning (MARL) has demonstrated potential in modeling cooperative behaviors, but most RL‑based approaches remain constrained to simplified simulations with limited dynamics or fixed scenarios. Previous attempts to deploy RL policy to real‑world pursuit‑evasion are largely restricted to two‑dimensional scenarios, such as ground vehicles or UAVs at fixed altitudes. In this paper, we address multi‑UAV pursuit‑evasion by considering UAV dynamics and physical constraints. We introduce an evader prediction‑enhanced network to tackle partial observability in cooperative strategy learning. Additionally, we propose an adaptive environment generator within MARL training, enabling higher exploration efficiency and better policy generalization across diverse scenarios. Simulations show our method significantly outperforms all baselines in challenging scenarios, generalizing to unseen scenarios with a 100% capture rate. Finally, we derive a feasible policy via a two‑stage reward refinement and deploy the policy on real quadrotors in a zero‑shot manner. To our knowledge, this is the first work to derive and deploy an RL‑based policy using collective thrust and body rates control commands for multi‑UAV pursuit‑evasion in unknown environments. The open‑source code and videos are available at https://sites.google.com/view/pursuit‑evasion‑rl.

PaperID: 12884, https://arxiv.org/pdf/2409.15795.pdf

Authors: Jun Ma, Meng Wang, Jinhui Pang, Haofen Wang, Xuejing Feng, Zhipeng Hu, Zhenyu Yang, Mingyang Guo, Zhenming Liu, Junwei Wang, Siyi Lu, Zhiming Gou

Title: Development and Evaluation Study of Intelligent Cockpit in the Age of Large Models

Abstract:
The development of Artificial Intelligence (AI) Large Models has a great impact on the application development of automotive Intelligent cockpit. The fusion development of Intelligent Cockpit and Large Models has become a new growth point of user experience in the industry, which also creates problems for related scholars, practitioners and users in terms of their understanding and evaluation of the user experience and the capability characteristics of the Intelligent Cockpit Large Models (ICLM). This paper aims to analyse the current situation of Intelligent cockpit, large model and AI Agent, to reveal the key of application research focuses on the integration of Intelligent Cockpit and Large Models, and to put forward a necessary limitation for the subsequent development of an evaluation system for the capability of automotive ICLM and user experience. The evaluation system, P‑CAFE, proposed in this paper mainly proposes five dimensions of perception, cognition, action, feedback and evolution as the first‑level indicators from the domains of cognitive architecture, user experience, and capability characteristics of large models, and many second‑level indicators to satisfy the current status of the application and research focuses are selected. After expert evaluation, the weights of the indicators were determined, and the indicator system of P‑CAFE was established. Finally, a complete evaluation method was constructed based on Fuzzy Hierarchical Analysis. It will lay a solid foundation for the application and evaluation of the automotive ICLM, and provide a reference for the development and improvement of the future ICLM.

PaperID: 12885, https://arxiv.org/pdf/2409.15779.pdf

Authors: Jinche La, Jun-Gill Kang, Dasol Lee

Title: A Robust, Task-Agnostic and Fully-Scalable Voxel Mapping System for Large Scale Environments

Abstract:
Perception still remains a challenging problem for autonomous navigation in unknown environment, especially for aerial vehicles. Most mapping algorithms for autonomous navigation are specifically designed for their very intended task, which hinders extended usage or cooperative task. In this paper, we propose a voxel mapping system that can build an adaptable map for multiple tasks. The system employs hash table‑based map structure and manages each voxel with spatial and temporal priorities without explicit map boundary. We also introduce an efficient map‑sharing feature with minimal bandwidth to enable multi‑agent applications. We tested the system in real world and simulation environment by applying it for various tasks including local mapping, global mapping, cooperative multi‑agent navigation, and high‑speed navigation. Our system proved its capability to build customizable map with high resolution, wide coverage, and real‑time performance regardless of sensor and environment. The system can build a full‑resolution map using the map‑sharing feature, with over 95 % of bandwidth reduction from raw sensor data.

PaperID: 12886, https://arxiv.org/pdf/2409.15682.pdf

Authors: Yang Xu, Wenbin Lu, Rui Song

Title: Linear Contextual Bandits with Interference

Abstract:
Interference, a key concept in causal inference, extends the reward modeling process by accounting for the impact of one unit's actions on the rewards of others. In contextual bandit (CB) settings, where multiple units are present in the same round, potential interference can significantly affect the estimation of expected rewards for different arms, thereby influencing the decision‑making process. Although some prior work has explored multi‑agent and adversarial bandits in interference‑aware settings, the effect of interference in CB, as well as the underlying theory, remains significantly underexplored. In this paper, we introduce a systematic framework to address interference in Linear CB (LinCB), bridging the gap between causal inference and online decision‑making. We propose a series of algorithms that explicitly quantify the interference effect in the reward modeling process and provide comprehensive theoretical guarantees, including sublinear regret bounds, finite sample upper bounds, and asymptotic properties. The effectiveness of our approach is demonstrated through simulations and a synthetic data generated based on MovieLens data.

PaperID: 12887, https://arxiv.org/pdf/2409.15623.pdf

Authors: Yiwen Xu, Qinyang Hou, Hongyu Wan, Mirjana Prpa

Title: Safe Guard: an LLM-agent for Real-time Voice-based Hate Speech Detection in Social Virtual Reality

Abstract:
In this paper, we present Safe Guard, an LLM‑agent for the detection of hate speech in voice‑based interactions in social VR (VRChat). Our system leverages Open AI GPT and audio feature extraction for real‑time voice interactions. We contribute a system design and evaluation of the system that demonstrates the capability of our approach in detecting hate speech, and reducing false positives compared to currently available approaches. Our results indicate the potential of LLM‑based agents in creating safer virtual environments and set the groundwork for further advancements in LLM‑driven moderation approaches.

PaperID: 12888, https://arxiv.org/pdf/2409.15612.pdf

Authors: Wangyang Ying, Dongjie Wang, Xuanming Hu, Ji Qiu, Jin Park, Yanjie Fu

Title: Revolutionizing Biomarker Discovery: Leveraging Generative AI for Bio-Knowledge-Embedded Continuous Space Exploration

Abstract:
Biomarker discovery is vital in advancing personalized medicine, offering insights into disease diagnosis, prognosis, and therapeutic efficacy. Traditionally, the identification and validation of biomarkers heavily depend on extensive experiments and statistical analyses. These approaches are time‑consuming, demand extensive domain expertise, and are constrained by the complexity of biological systems. These limitations motivate us to ask: Can we automatically identify the effective biomarker subset without substantial human efforts? Inspired by the success of generative AI, we think that the intricate knowledge of biomarker identification can be compressed into a continuous embedding space, thus enhancing the search for better biomarkers. Thus, we propose a new biomarker identification framework with two important modules:1) training data preparation and 2) embedding‑optimization‑generation. The first module uses a multi‑agent system to automatically collect pairs of biomarker subsets and their corresponding prediction accuracy as training data. These data establish a strong knowledge base for biomarker identification. The second module employs an encoder‑evaluator‑decoder learning paradigm to compress the knowledge of the collected data into a continuous space. Then, it utilizes gradient‑based search techniques and autoregressive‑based reconstruction to efficiently identify the optimal subset of biomarkers. Finally, we conduct extensive experiments on three real‑world datasets to show the efficiency, robustness, and effectiveness of our method.

PaperID: 12889, https://arxiv.org/pdf/2409.15523.pdf

Authors: Woojeong Kim, Ashish Jagmohan, Aditya Vempaty

Title: SEAL: Suite for Evaluating API-use of LLMs

Abstract:
Large language models (LLMs) have limitations in handling tasks that require real‑time access to external APIs. While several benchmarks like ToolBench and APIGen have been developed to assess LLMs' API‑use capabilities, they often suffer from issues such as lack of generalizability, limited multi‑step reasoning coverage, and instability due to real‑time API fluctuations. In this paper, we introduce SEAL, an end‑to‑end testbed designed to evaluate LLMs in real‑world API usage. SEAL standardizes existing benchmarks, integrates an agent system for testing API retrieval and planning, and addresses the instability of real‑time APIs by introducing a GPT‑4‑powered API simulator with caching for deterministic evaluations. Our testbed provides a comprehensive evaluation pipeline that covers API retrieval, API calls, and final responses, offering a reliable framework for structured performance comparison in diverse real‑world scenarios. SEAL is publicly available, with ongoing updates for new benchmarks.

PaperID: 12890, https://arxiv.org/pdf/2409.15345.pdf

Authors: Shengbo Wang, Jingwen Zhao, Tongming Pu, Liangbing Zhao, Xiaoyu Guo, Yue Cheng, Cong Li, Weihao Ma, Chenyu Tang, Zhenyu Xu, Ningli Wang, Luigi Occhipinti, Arokia Nathan, Ravinder Dahiya, Huaqiang Wu, Li Tao, Shuo Gao

Title: Neuromorphic spatiotemporal optical flow: Enabling ultrafast visual perception beyond human capabilities

Abstract:
Optical flow, inspired by the mechanisms of biological visual systems, calculates spatial motion vectors within visual scenes that are necessary for enabling robotics to excel in complex and dynamic working environments. However, current optical flow algorithms, despite human‑competitive task performance on benchmark datasets, remain constrained by unacceptable time delays (~0.6 seconds per inference, 4X human processing speed) in practical deployment. Here, we introduce a neuromorphic optical flow approach that addresses delay bottlenecks by encoding temporal information directly in a synaptic transistor array to assist spatial motion analysis. Compared to conventional spatial‑only optical flow methods, our spatiotemporal neuromorphic optical flow offers the spatial‑temporal consistency of motion information, rapidly identifying regions of interest in as little as 1‑2 ms using the temporal motion cues derived from the embedded temporal information in the two‑dimensional floating gate synaptic transistors. Thus, the visual input can be selectively filtered to achieve faster velocity calculations and various task execution. At the hardware level, due to the atomically sharp interfaces between distinct functional layers in two‑dimensional van der Waals heterostructures, the synaptic transistor offers high‑frequency response (~100 μs), robust non‑volatility (>10000 s), and excellent endurance (>8000 cycles), enabling robust visual processing. In software benchmarks, our system outperforms state‑of‑the‑art algorithms with a 400% speedup, frequently surpassing human‑level performance while maintaining or enhancing accuracy by utilizing the temporal priors provided by the embedded temporal information.

PaperID: 12891, https://arxiv.org/pdf/2409.14913.pdf

Authors: Peter Mühlbacher, Nikos I. Bosse, Lawrence Phillips

Title: Towards a Realistic Long-Term Benchmark for Open-Web Research Agents

Abstract:
We present initial results of a forthcoming benchmark for evaluating LLM agents on white‑collar tasks of economic value. We evaluate agents on real‑world "messy" open‑web research tasks of the type that are routine in finance and consulting. In doing so, we lay the groundwork for an LLM agent evaluation suite where good performance directly corresponds to a large economic and societal impact. We built and tested several agent architectures with o1‑preview, GPT‑4o, Claude‑3.5 Sonnet, Llama 3.1 (405b), and GPT‑4o‑mini. On average, LLM agents powered by Claude‑3.5 Sonnet and o1‑preview substantially outperformed agents using GPT‑4o, with agents based on Llama 3.1 (405b) and GPT‑4o‑mini lagging noticeably behind. Across LLMs, a ReAct architecture with the ability to delegate subtasks to subagents performed best. In addition to quantitative evaluations, we qualitatively assessed the performance of the LLM agents by inspecting their traces and reflecting on their observations. Our evaluation represents the first in‑depth assessment of agents' abilities to conduct challenging, economically valuable analyst‑style research on the real open web.

PaperID: 12892, https://arxiv.org/pdf/2409.14833.pdf

Authors: Ernesto Casablanca, Zengjie Zhang, Gregorio Marchesini, Sofie Haesaert, Dimos V. Dimarogonas, Sadegh Soudjani

Title: SymAware: A Software Development Framework for Trustworthy Multi-Agent Systems with Situational Awareness

Abstract:
Developing trustworthy multi‑agent systems for practical applications is challenging due to the complicated communication of situational awareness (SA) among agents. This paper showcases a novel efficient and easy‑to‑use software framework for multi‑agent simulation, named SymAware which provides a rich set of predefined data structures to compute, store, and communicate SA for agents. It also provides an abstract interface for the agents to compute their control inputs taking into account the awareness of the situation, knowledge, and risk of surrounding agents. Besides, utilizing a cluster of specialized components, SymAware hides the heavy computation of physical rendering and communication interfacing of simulation engines behind the control threads, resulting in high implementation efficiency in bridging the gap between conceptual prototyping and practical applications. Three multi‑agent case studies are used to validate the efficacy and efficiency of this software framework.

PaperID: 12893, https://arxiv.org/pdf/2409.14807.pdf

Authors: Zechang Sun, Yuan-Sen Ting, Yaobo Liang, Nan Duan, Song Huang, Zheng Cai

Title: Interpreting Multi-band Galaxy Observations with Large Language Model-Based Agents

Abstract:
Astronomical research traditionally relies on extensive domain knowledge to interpret observations and narrow down hypotheses. We demonstrate that this process can be emulated using large language model‑based agents to accelerate research workflows. We propose mephisto, a multi‑agent collaboration framework that mimics human reasoning to interpret multi‑band galaxy observations. mephisto interacts with the CIGALE codebase, which includes spectral energy distribution (SED) models to explain observations. In this open‑world setting, mephisto learns from its self‑play experience, performs tree search, and accumulates knowledge in a dynamically updated base. As a proof of concept, we apply mephisto to the latest data from the James Webb Space Telescope. mephisto attains near‑human proficiency in reasoning about galaxies' physical scenarios, even when dealing with a recently discovered population of "Little Red Dot" galaxies. This represents the first demonstration of agentic research in astronomy, advancing towards end‑to‑end research via LLM agents and potentially expediting astronomical discoveries.

PaperID: 12894, https://arxiv.org/pdf/2409.14739.pdf

Authors: Chengjie Liu, Weiyu Chen, Anlan Peng, Yuan Du, Li Du, Jun Yang

Title: AmpAgent: An LLM-based Multi-Agent System for Multi-stage Amplifier Schematic Design from Literature for Process and Performance Porting

Abstract:
Multi‑stage amplifiers are widely applied in analog circuits. However, their large number of components, complex transfer functions, and intricate pole‑zero distributions necessitate extensive manpower for derivation and param sizing to ensure their stability. In order to achieve efficient derivation of the transfer function and simplify the difficulty of circuit design, we propose AmpAgent: a multi‑agent system based on large language models (LLMs) for efficiently designing such complex amplifiers from literature with process and performance porting. AmpAgent is composed of three agents: Literature Analysis Agent, Mathematics Reasoning Agent and Device Sizing Agent. They are separately responsible for retrieving key information (e.g. formulas and transfer functions) from the literature, decompose the whole circuit's design problem by deriving the key formulas, and address the decomposed problem iteratively. AmpAgent was employed in the schematic design of seven types of multi‑stage amplifiers with different compensation techniques. In terms of design efficiency, AmpAgent has reduced the number of iterations by 1.32 ～ 4× and execution time by 1.19 ～ 2.99× compared to conventional optimization algorithms, with a success rate increased by 1.03 ～ 6.79×. In terms of circuit performance, it has improved by 1.63 ～ 27.25× compared to the original literature. The findings suggest that LLMs could play a crucial role in the field of complex analog circuit schematic design, as well as process and performance porting.

PaperID: 12895, https://arxiv.org/pdf/2409.14674.pdf

Authors: Yinpei Dai, Jayjun Lee, Nima Fazeli, Joyce Chai

Title: RACER: Rich Language-Guided Failure Recovery Policies for Imitation Learning

Abstract:
Developing robust and correctable visuomotor policies for robotic manipulation is challenging due to the lack of self‑recovery mechanisms from failures and the limitations of simple language instructions in guiding robot actions. To address these issues, we propose a scalable data generation pipeline that automatically augments expert demonstrations with failure recovery trajectories and fine‑grained language annotations for training. We then introduce Rich languAge‑guided failure reCovERy (RACER), a supervisor‑actor framework, which combines failure recovery data with rich language descriptions to enhance robot control. RACER features a vision‑language model (VLM) that acts as an online supervisor, providing detailed language guidance for error correction and task execution, and a language‑conditioned visuomotor policy as an actor to predict the next actions. Our experimental results show that RACER outperforms the state‑of‑the‑art Robotic View Transformer (RVT) on RLbench across various evaluation settings, including standard long‑horizon tasks, dynamic goal‑change tasks and zero‑shot unseen tasks, achieving superior performance in both simulated and real world environments. Videos and code are available at: https://rich‑language‑failure‑recovery.github.io.

PaperID: 12896, https://arxiv.org/pdf/2409.14605.pdf

Authors: Xiaomin Liu, Qizhi Qiu, Yihao Zhang, Yuming Cheng, Lilin Yi, Weisheng Hu, Qunbi Zhuge

Title: First Field Trial of LLM-Powered AI Agent for Lifecycle Management of Autonomous Driving Optical Networks

Abstract:
We design and demonstrate the first field trial of LLM‑powered AI Agent for ADON. Three operation modes of the Agent are proposed for network lifecycle management. The Agent efficiently processes wavelength add/drop and soft/hard failures, and achieves comparable performance to human‑designed algorithms for power optimization.

PaperID: 12897, https://arxiv.org/pdf/2409.14542.pdf

Authors: Luke Snow, Vikram Krishnamurthy

Title: Distributionally Robust Inverse Reinforcement Learning for Identifying Multi-Agent Coordinated Sensing

Abstract:
We derive a minimax distributionally robust inverse reinforcement learning (IRL) algorithm to reconstruct the utility functions of a multi‑agent sensing system. Specifically, we construct utility estimators which minimize the worst‑case prediction error over a Wasserstein ambiguity set centered at noisy signal observations. We prove the equivalence between this robust estimation and a semi‑infinite optimization reformulation, and we propose a consistent algorithm to compute solutions. We illustrate the efficacy of this robust IRL scheme in numerical studies to reconstruct the utility functions of a cognitive radar network from observed tracking signals.

PaperID: 12898, https://arxiv.org/pdf/2409.14499.pdf

Authors: Xiang Huo, Hao Huang, Katherine R. Davis, H. Vincent Poor, Mingxi Liu

Title: A Review of Scalable and Privacy-Preserving Multi-Agent Frameworks for Distributed Energy Resources

Abstract:
Distributed energy resources (DERs) are gaining prominence due to their advantages in improving energy efficiency, reducing carbon emissions, and enhancing grid resilience. Despite the increasing deployment, the potential of DERs has yet to be fully explored and exploited. A fundamental question restrains the management of numerous DERs in large‑scale power systems, "How should DER data be securely processed and DER operations be efficiently optimized?" To address this question, this paper considers two critical issues, namely privacy for processing DER data and scalability in optimizing DER operations, then surveys existing and emerging solutions from a multi‑agent framework perspective. In the context of scalability, this paper reviews state‑of‑the‑art research that relies on parallel control, optimization, and learning within distributed and/or decentralized information exchange structures, while in the context of privacy, it identifies privacy preservation measures that can be synthesized into the aforementioned scalable structures. Despite research advances in these areas, challenges remain because these highly interdisciplinary studies blend a wide variety of scalable computing architectures and privacy preservation techniques from different fields, making them difficult to adapt in practice. To mitigate this issue, this paper provides a holistic review of trending strategies that orchestrate privacy and scalability for large‑scale power system operations from a multi‑agent perspective, particularly for DER control problems. Furthermore, this review extrapolates new approaches for future scalable, privacy‑aware, and cybersecure pathways to unlock the full potential of DERs through controlling, optimizing, and learning generic multi‑agent‑based cyber‑physical systems.

PaperID: 12899, https://arxiv.org/pdf/2409.14491.pdf

Authors: Rishi Veerapaneni, Arthur Jakobsson, Kevin Ren, Samuel Kim, Jiaoyang Li, Maxim Likhachev

Title: Work Smarter Not Harder: Simple Imitation Learning with CS-PIBT Outperforms Large Scale Imitation Learning for MAPF

Abstract:
Multi‑Agent Path Finding (MAPF) is the problem of effectively finding efficient collision‑free paths for a group of agents in a shared workspace. The MAPF community has largely focused on developing high‑performance heuristic search methods. Recently, several works have applied various machine learning (ML) techniques to solve MAPF, usually involving sophisticated architectures, reinforcement learning techniques, and set‑ups, but none using large amounts of high‑quality supervised data. Our initial objective in this work was to show how simple large scale imitation learning of high‑quality heuristic search methods can lead to state‑of‑the‑art ML MAPF performance. However, we find that, at least with our model architecture, simple large scale (700k examples with hundreds of agents per example) imitation learning does not produce impressive results. Instead, we find that by using prior work that post‑processes MAPF model predictions to resolve 1‑step collisions (CS‑PIBT), we can train a simple ML MAPF model in minutes that dramatically outperforms existing ML MAPF policies. This has serious implications for all future ML MAPF policies (with local communication) which currently struggle to scale. In particular, this finding implies that future learnt policies should (1) always use smart 1‑step collision shields (e.g. CS‑PIBT), (2) always include the collision shield with greedy actions as a baseline (e.g. PIBT) and (3) motivates future models to focus on longer horizon / more complex planning as 1‑step collisions can be efficiently resolved.

PaperID: 12900, https://arxiv.org/pdf/2409.14488.pdf

Authors: Ruoyu Song, Muslum Ozgur Ozmen, Hyungsub Kim, Antonio Bianchi, Z. Berkay Celik

Title: Enhancing LLM-based Autonomous Driving Agents to Mitigate Perception Attacks

Abstract:
There is a growing interest in integrating Large Language Models (LLMs) with autonomous driving (AD) systems. However, AD systems are vulnerable to attacks against their object detection and tracking (ODT) functions. Unfortunately, our evaluation of four recent LLM agents against ODT attacks shows that the attacks are 63.26% successful in causing them to crash or violate traffic rules due to (1) misleading memory modules that provide past experiences for decision making, (2) limitations of prompts in identifying inconsistencies, and (3) reliance on ground truth perception data. In this paper, we introduce Hudson, a driving reasoning agent that extends prior LLM‑based driving systems to enable safer decision making during perception attacks while maintaining effectiveness under benign conditions. Hudson achieves this by first instrumenting the AD software to collect real‑time perception results and contextual information from the driving scene. This data is then formalized into a domain‑specific language (DSL). To guide the LLM in detecting and making safe control decisions during ODT attacks, Hudson translates the DSL into natural language, along with a list of custom attack detection instructions. Following query execution, Hudson analyzes the LLM's control decision to understand its causal reasoning process. We evaluate the effectiveness of Hudson using a proprietary LLM (GPT‑4) and two open‑source LLMs (Llama and Gemma) in various adversarial driving scenarios. GPT‑4, Llama, and Gemma achieve, on average, an attack detection accuracy of 83. 3%, 63. 6%, and 73. 6%. Consequently, they make safe control decisions in 86.4%, 73.9%, and 80% of the attacks. Our results, following the growing interest in integrating LLMs into AD systems, highlight the strengths of LLMs and their potential to detect and mitigate ODT attacks.

PaperID: 12901, https://arxiv.org/pdf/2409.14457.pdf

Authors: Yuntao Wang, Yanghe Pan, Zhou Su, Yi Deng, Quan Zhao, Linkang Du, Tom H. Luan, Jiawen Kang, Dusit Niyato

Title: Large Model Based Agents: State-of-the-Art, Cooperation Paradigms, Security and Privacy, and Future Trends

Abstract:
With the rapid advancement of large models (LMs), the development of general‑purpose intelligent agents powered by LMs has become a reality. It is foreseeable that in the near future, LM‑driven general AI agents will serve as essential tools in production tasks, capable of autonomous communication and collaboration without human intervention. This paper investigates scenarios involving the autonomous collaboration of future LM agents. We review the current state of LM agents, the key technologies enabling LM agent collaboration, and the security and privacy challenges they face during cooperative operations. To this end, we first explore the foundational principles of LM agents, including their general architecture, key components, enabling technologies, and modern applications. We then discuss practical collaboration paradigms from data, computation, and knowledge perspectives to achieve connected intelligence among LM agents. After that, we analyze the security vulnerabilities and privacy risks associated with LM agents, particularly in multi‑agent settings, examining underlying mechanisms and reviewing current and potential countermeasures. Lastly, we propose future research directions for building robust and secure LM agent ecosystems.

PaperID: 12902, https://arxiv.org/pdf/2409.14435.pdf

Authors: Tan-Hanh Pham, Godwyll Aikins, Tri Truong, Kim-Doang Nguyen

Title: Adaptive Compensation for Robotic Joint Failures Using Partially Observable Reinforcement Learning

Abstract:
Robotic manipulators are widely used in various industries for complex and repetitive tasks. However, they remain vulnerable to unexpected hardware failures. In this study, we address the challenge of enabling a robotic manipulator to complete tasks despite joint malfunctions. Specifically, we develop a reinforcement learning (RL) framework to adaptively compensate for a non‑functional joint during task execution. Our experimental platform is the Franka robot with 7 degrees of freedom (DOFs). We formulate the problem as a partially observable Markov decision process (POMDP), where the robot is trained under various joint failure conditions and tested in both seen and unseen scenarios. We consider scenarios where a joint is permanently broken and where it functions intermittently. Additionally, we demonstrate the effectiveness of our approach by comparing it with traditional inverse kinematics‑based control methods. The results show that the RL algorithm enables the robot to successfully complete tasks even with joint failures, achieving a high success rate with an average rate of 93.6%. This showcases its robustness and adaptability. Our findings highlight the potential of RL to enhance the resilience and reliability of robotic systems, making them better suited for unpredictable environments. All related codes and models are published online.

PaperID: 12903, https://arxiv.org/pdf/2409.14228.pdf

Authors: Siyu Zha, Yujia Liu, Chengbo Zheng, Jiaqi XU, Fuze Yu, Jiangtao Gong, Yingqing XU

Title: Mentigo: An Intelligent Agent for Mentoring Students in the Creative Problem Solving Process

Abstract:
With the increasing integration of large lauguage models (LLMs) in education, there is growing interest in using AI agents to support student learning in creative tasks. This study presents an interactive Mentor Agent system named Mentigo, which is designed to assist middle school students in the creative problem solving (CPS) process. We created a comprehensive dataset of real classroom interactions between students and mentors, which include the structured CPS task management, diverse guidance techniques, personalized feedback mechanisms. Based on this dataset, we create agentic workflow for the Mentigo system. The system's effectiveness was evaluated through a comparative experiment with 12 students and reviewed by five expert teachers. The Mentigo system demonstrated significant improvements in student engagement and creative outcomes. The findings provide design implications for leveraging LLMs to support CPS and offer insights into the application of AI mentor agents in educational contexts.

PaperID: 12904, https://arxiv.org/pdf/2409.14177.pdf

Authors: Zhihao Lin, Wei Ma, Mingyi Zhou, Yanjie Zhao, Haoyu Wang, Yang Liu, Jun Wang, Li Li

Title: PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Abstract:
In recent years, Large Language Models (LLMs) have gained widespread use, raising concerns about their security. Traditional jailbreak attacks, which often rely on the model internal information or have limitations when exploring the unsafe behavior of the victim model, limiting their reducing their general applicability. In this paper, we introduce PathSeeker, a novel black‑box jailbreak method, which is inspired by the game of rats escaping a maze. We think that each LLM has its unique "security maze", and attackers attempt to find the exit learning from the received feedback and their accumulated experience to compromise the target LLM's security defences. Our approach leverages multi‑agent reinforcement learning, where smaller models collaborate to guide the main LLM in performing mutation operations to achieve the attack objectives. By progressively modifying inputs based on the model's feedback, our system induces richer, harmful responses. During our manual attempts to perform jailbreak attacks, we found that the vocabulary of the response of the target model gradually became richer and eventually produced harmful responses. Based on the observation, we also introduce a reward mechanism that exploits the expansion of vocabulary richness in LLM responses to weaken security constraints. Our method outperforms five state‑of‑the‑art attack techniques when tested across 13 commercial and open‑source LLMs, achieving high attack success rates, especially in strongly aligned commercial models like GPT‑4o‑mini, Claude‑3.5, and GLM‑4‑air with strong safety alignment. This study aims to improve the understanding of LLM security vulnerabilities and we hope that this sturdy can contribute to the development of more robust defenses.

PaperID: 12905, https://arxiv.org/pdf/2409.14051.pdf

Authors: Tongxuan Liu, Xingyu Wang, Weizhe Huang, Wenjiang Xu, Yuting Zeng, Lei Jiang, Hailong Yang, Jing Li

Title: GroupDebate: Enhancing the Efficiency of Multi-Agent Debate Using Group Discussion

Abstract:
In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse NLP tasks. Extensive research has explored how to enhance the logical reasoning abilities such as Chain‑of‑Thought, Chain‑of‑Thought with Self‑Consistency, Tree‑Of‑Thoughts, and multi‑agent debates. In the context of multi‑agent debates, significant performance improvements can be achieved with an increasing number of agents and debate rounds. However, the escalation in the number of agents and debate rounds can drastically raise the tokens cost of debates, thereby limiting the scalability of the multi‑agent debate technique. To better harness the advantages of multi‑agent debates in logical reasoning tasks, this paper proposes a method to significantly reduce token cost in multi‑agent debates. This approach involves dividing all agents into multiple debate groups, with agents engaging in debates within their respective groups and sharing interim debate results between groups. Comparative experiments across multiple datasets have demonstrated that this method can reduce the total tokens by up to 51.7% during debates and while potentially enhancing accuracy by as much as 25%. Our method significantly enhances the performance and efficiency of interactions in the multi‑agent debate.

PaperID: 12906, https://arxiv.org/pdf/2409.13824.pdf

Authors: Ziqin Yuan, Ruiqi Wang, Taehyeon Kim, Dezhong Zhao, Ike Obi, Byung-Cheol Min

Title: Adaptive Task Allocation in Multi-Human Multi-Robot Teams under Team Heterogeneity and Dynamic Information Uncertainty

Abstract:
Task allocation in multi‑human multi‑robot (MH‑MR) teams presents significant challenges due to the inherent heterogeneity of team members, the dynamics of task execution, and the information uncertainty of operational states. Existing approaches often fail to address these challenges simultaneously, resulting in suboptimal performance. To tackle this, we propose ATA‑HRL, an adaptive task allocation framework using hierarchical reinforcement learning (HRL), which incorporates initial task allocation (ITA) that leverages team heterogeneity and conditional task reallocation in response to dynamic operational states. Additionally, we introduce an auxiliary state representation learning task to manage information uncertainty and enhance task execution. Through an extensive case study in large‑scale environmental monitoring tasks, we demonstrate the benefits of our approach.

PaperID: 12907, https://arxiv.org/pdf/2409.13783.pdf

Authors: Ye Han, Lijun Zhang, Dejian Meng, Zhuang Zhang, Xingyu Hu, Songyu Weng

Title: A Value Based Parallel Update MCTS Method for Multi-Agent Cooperative Decision Making of Connected and Automated Vehicles

Abstract:
To solve the problem of lateral and logitudinal joint decision‑making of multi‑vehicle cooperative driving for connected and automated vehicles (CAVs), this paper proposes a Monte Carlo tree search (MCTS) method with parallel update for multi‑agent Markov game with limited horizon and time discounted setting. By analyzing the parallel actions in the multi‑vehicle joint action space in the partial‑steady‑state traffic flow, the parallel update method can quickly exclude potential dangerous actions, thereby increasing the search depth without sacrificing the search breadth. The proposed method is tested in a large number of randomly generated traffic flow. The experiment results show that the algorithm has good robustness and better performance than the SOTA reinforcement learning algorithms and heuristic methods. The vehicle driving strategy using the proposed algorithm shows rationality beyond human drivers, and has advantages in traffic efficiency and safety in the coordinating zone.

PaperID: 12908, https://arxiv.org/pdf/2409.13759.pdf

Authors: Renato L. Conforme Rosado, Francisco C. Calderon Bocanegra

Title: Simulación de la distribución de alimento en el cultivo de camarón

Abstract:
This document presents the experimentation of 4 cases of food distribution for shrimp farming. The distributions are based on the location of the automatic feeders. Three cases applied in reality and a fourth case where the food is irrigated on the crop simultaneously and uniformly. In a first stage, the simulation of the three distribution cases is successfully adjusted to reality, where the trend of the shrimp growth curve is correlated with the historical data curve. A second stage where you experiment in 16 configurations that are based on the amount of food, the density of biomass and the distribution of the food. The simulation adopts the concepts of genetic algorithms to improve the population and fuzzy logic as an agent evaluation technique for decision‑making against the quality of physical‑chemical parameters in the simulated environment. The results of these interactions reveal a reduction in the simulated total culture time from 22 weeks to 14 weeks.

PaperID: 12909, https://arxiv.org/pdf/2409.13753.pdf

Authors: Asher Sprigler, Alexander Drobek, Keagan Weinstock, Wendpanga Tapsoba, Gavin Childress, Andy Dao, Lucas Gral

Title: Synergistic Simulations: Multi-Agent Problem Solving with Large Language Models

Abstract:
Large Language Models (LLMs) have increasingly demonstrated the ability to facilitate the development of multi‑agent systems that allow the interpretation of thoughts and actions generated by each individual. Promising advancements have also been made in LLM‑based interaction with existing worlds, particularly in interacting with simulated environments. This paper aims to integrate both aforementioned topics (agents & world interaction) into a single simulation where multiple agents can work together to solve a problem, modeling how groups of humans can often solve problems better than individuals. By showing whether LLMs demonstrate the synergy of human collaboration, it could lead to advancements in the applications of LLMs. We implemented two simulations: a physical studio apartment with two roommates, and another where agents collaborate to complete a programming task. We provide a multi‑agent framework, discuss the performance of the agents in each simulation, and discuss potential future additions.

PaperID: 12910, https://arxiv.org/pdf/2409.13642.pdf

Authors: Md Nakhla Rafi, Dong Jae Kim, Tse-Hsun Chen, Shaowei Wang

Title: A Multi-Agent Approach to Fault Localization via Graph-Based Retrieval and Reflexion

Abstract:
Identifying and resolving software faults remains a challenging and resource‑intensive process. Traditional fault localization techniques, such as Spectrum‑Based Fault Localization (SBFL), leverage statistical analysis of test coverage but often suffer from limited accuracy. While learning‑based approaches improve fault localization, they demand extensive training datasets and high computational resources. Recent advances in Large Language Models (LLMs) offer new opportunities by enhancing code understanding and reasoning. However, existing LLM‑based fault localization techniques face significant challenges, including token limitations, performance degradation with long inputs, and scalability issues in complex software systems. To overcome these obstacles, we propose LLM4FL, a multi‑agent fault localization framework that utilizes three specialized LLM agents. First, the Context Extraction Agent applies an order‑sensitive segmentation strategy to partition large coverage data within the LLM's token limit, analyze failure context, and prioritize failure‑related methods. The Debugger Agent then processes the extracted data, which employs graph‑based retrieval‑augmented code navigation to reason about failure causes and rank suspicious methods. Finally, the Reviewer Agent re‑evaluates the identified faulty methods using verbal reinforcement learning, engaging in self‑criticism and iterative refinement. Evaluated on the Defects4J (V2.0.0) benchmark, which includes 675 faults from 14 Java projects, LLM4FL achieves an 18.55% improvement in Top‑1 accuracy over AutoFL and 4.82% over SoapFL. It outperforms supervised techniques such as DeepFL and Grace, all without requiring task‑specific training. Furthermore, its coverage segmentation and prompt chaining strategies enhance performance, increasing Top‑1 accuracy by up to 22%.

PaperID: 12911, https://arxiv.org/pdf/2409.13588.pdf

Authors: Jingyue Zhang, Ian Arawjo

Title: ChainBuddy: An AI Agent System for Generating LLM Pipelines

Abstract:
As large language models (LLMs) advance, their potential applications have grown significantly. However, it remains difficult to evaluate LLM behavior on user‑defined tasks and craft effective pipelines to do so. Many users struggle with where to start, often referred to as the "blank page problem." ChainBuddy, an AI workflow generation assistant built into the ChainForge platform, aims to tackle this issue. From a single prompt or chat, ChainBuddy generates a starter evaluative LLM pipeline in ChainForge aligned to the user's requirements. ChainBuddy offers a straightforward and user‑friendly way to plan and evaluate LLM behavior and make the process less daunting and more accessible across a wide range of possible tasks and use cases. We report a within‑subjects user study comparing ChainBuddy to the baseline interface. We find that when using AI assistance, participants reported a less demanding workload, felt more confident, and produced higher quality pipelines evaluating LLM behavior. However, we also uncover a mismatch between subjective and objective ratings of performance: participants rated their successfulness similarly across conditions, while independent experts rated participant workflows significantly higher with AI assistance. Drawing connections to the Dunning‑Kruger effect, we draw design implications for the future of workflow generation assistants to mitigate the risk of over‑reliance.

PaperID: 12912, https://arxiv.org/pdf/2409.13571.pdf

Authors: Jaeyeon Jang, Diego Klabjan, Han Liu, Nital S. Patel, Xiuqi Li, Balakrishnan Ananthanarayanan, Husam Dauod, Tzung-Han Juang

Title: Scalable Multi-agent Reinforcement Learning for Factory-wide Dynamic Scheduling

Abstract:
Real‑time dynamic scheduling is a crucial but notoriously challenging task in modern manufacturing processes due to its high decision complexity. Recently, reinforcement learning (RL) has been gaining attention as an impactful technique to handle this challenge. However, classical RL methods typically rely on human‑made dispatching rules, which are not suitable for large‑scale factory‑wide scheduling. To bridge this gap, this paper applies a leader‑follower multi‑agent RL (MARL) concept to obtain desired coordination after decomposing the scheduling problem into a set of sub‑problems that are handled by each individual agent for scalability. We further strengthen the procedure by proposing a rule‑based conversion algorithm to prevent catastrophic loss of production capacity due to an agent's error. Our experimental results demonstrate that the proposed model outperforms the state‑of‑the‑art deep RL‑based scheduling models in various aspects. Additionally, the proposed model provides the most robust scheduling performance to demand changes. Overall, the proposed MARL‑based scheduling model presents a promising solution to the real‑time scheduling problem, with potential applications in various manufacturing industries.

PaperID: 12913, https://arxiv.org/pdf/2409.13449.pdf

Authors: Ming Wang, Yuanzhong Liu, Xiaoyu Liang, Yijie Huang, Daling Wang, Xiaocui Yang, Sijia Shen, Shi Feng, Xiaoming Zhang, Chaofeng Guan, Yifei Zhang

Title: Minstrel: Structural Prompt Generation with Multi-Agents Coordination for Non-AI Experts

Abstract:
LLMs have demonstrated commendable performance across diverse domains. Nevertheless, formulating high‑quality prompts to assist them in their work poses a challenge for non‑AI experts. Existing research in prompt engineering suggests somewhat scattered optimization principles and designs empirically dependent prompt optimizers. Unfortunately, these endeavors lack a structural design, incurring high learning costs and it is not conducive to the iterative updating of prompts, especially for non‑AI experts. Inspired by structured reusable programming languages, we propose LangGPT, a structural prompt design framework. Furthermore, we introduce Minstrel, a multi‑generative agent system with reflection to automate the generation of structural prompts. Experiments and the case study illustrate that structural prompts generated by Minstrel or written manually significantly enhance the performance of LLMs. Furthermore, we analyze the ease of use of structural prompts through a user survey in our online community.

PaperID: 12914, https://arxiv.org/pdf/2409.13447.pdf

Authors: Mohanna Hoveyda, Arjen P. de Vries, Maarten de Rijke, Harrie Oosterhuis, Faegheh Hasibi

Title: AQA: Adaptive Question Answering in a Society of LLMs via Contextual Multi-Armed Bandit

Abstract:
In question answering (QA), different questions can be effectively addressed with different answering strategies. Some require a simple lookup, while others need complex, multi‑step reasoning to be answered adequately. This observation motivates the development of a dynamic method that adaptively selects the most suitable QA strategy for each question, enabling more efficient and effective systems capable of addressing a broader range of question types. To this aim, we build on recent advances in the orchestration of multiple large language models (LLMs) and formulate adaptive QA as a dynamic orchestration challenge. We define this as a contextual multi‑armed bandit problem, where the context is defined by the characteristics of the incoming question and the action space consists of potential communication graph configurations among the LLM agents. We then train a linear upper confidence bound model to learn an optimal mapping between different question types and their corresponding optimal multi‑LLM communication graph representation. Our experiments show that the proposed solution is viable for adaptive orchestration of a QA system with multiple modules, as it combines the superior performance of more complex strategies while avoiding their costs when simpler strategies suffice.

PaperID: 12915, https://arxiv.org/pdf/2409.13445.pdf

Authors: Dimitrios Panagopoulos, Adolfo Perrusquia, Weisi Guo

Title: Selective Exploration and Information Gathering in Search and Rescue Using Hierarchical Learning Guided by Natural Language Input

Abstract:
In recent years, robots and autonomous systems have become increasingly integral to our daily lives, offering solutions to complex problems across various domains. Their application in search and rescue (SAR) operations, however, presents unique challenges. Comprehensively exploring the disaster‑stricken area is often infeasible due to the vastness of the terrain, transformed environment, and the time constraints involved. Traditional robotic systems typically operate on predefined search patterns and lack the ability to incorporate and exploit ground truths provided by human stakeholders, which can be the key to speeding up the learning process and enhancing triage. Addressing this gap, we introduce a system that integrates social interaction via large language models (LLMs) with a hierarchical reinforcement learning (HRL) framework. The proposed system is designed to translate verbal inputs from human stakeholders into actionable RL insights and adjust its search strategy. By leveraging human‑provided information through LLMs and structuring task execution through HRL, our approach not only bridges the gap between autonomous capabilities and human intelligence but also significantly improves the agent's learning efficiency and decision‑making process in environments characterised by long horizons and sparse rewards.

PaperID: 12916, https://arxiv.org/pdf/2409.13421.pdf

Authors: Ingvar Ziemann, Nikolai Matni, George J. Pappas

Title: State space models, emergence, and ergodicity: How many parameters are needed for stable predictions?

Abstract:
How many parameters are required for a model to execute a given task? It has been argued that large language models, pre‑trained via self‑supervised learning, exhibit emergent capabilities such as multi‑step reasoning as their number of parameters reach a critical scale. In the present work, we explore whether this phenomenon can analogously be replicated in a simple theoretical model. We show that the problem of learning linear dynamical systems ‑‑ a simple instance of self‑supervised learning ‑‑ exhibits a corresponding phase transition. Namely, for every non‑ergodic linear system there exists a critical threshold such that a learner using fewer parameters than said threshold cannot achieve bounded error for large sequence lengths. Put differently, in our model we find that tasks exhibiting substantial long‑range correlation require a certain critical number of parameters ‑‑ a phenomenon akin to emergence. We also investigate the role of the learner's parametrization and consider a simple version of a linear dynamical system with hidden state ‑‑ an imperfectly observed random walk in \mathbbR. For this situation, we show that there exists no learner using a linear filter which can succesfully learn the random walk unless the filter length exceeds a certain threshold depending on the effective memory length and horizon of the problem.

PaperID: 12917, https://arxiv.org/pdf/2409.13319.pdf

Authors: Qunsong Zeng, Zhanwei Wang, You Zhou, Hai Wu, Lin Yang, Kaibin Huang

Title: Knowledge-Based Ultra-Low-Latency Semantic Communications for Robotic Edge Intelligence

Abstract:
The 6G mobile networks will feature the widespread deployment of AI algorithms at the network edge, which provides a platform for supporting robotic edge intelligence systems. In such a system, a large‑scale knowledge graph (KG) is operated at an edge server as a "remote brain" to guide remote robots on environmental exploration or task execution. In this paper, we present a new air‑interface framework targeting the said systems, called knowledge‑based robotic semantic communications (SemCom), which consists of a protocol and relevant transmission techniques. First, the proposed robotic SemCom protocol defines a sequence of system operations for executing a given robotic task. They include identification of all task‑relevant knowledge paths (KPs) on the KG, semantic matching between KG and object classifier, and uploading of robot's observations for objects recognition and feasible KP identification. Next, to support ultra‑low‑latency feature transmission (ULL‑FT), we propose a novel transmission approach that exploits classifier's robustness, which is measured by classification margin, to compensate for a high bit error probability (BEP) resulting from ultra‑low‑latency transmission (e.g., short packet and/or no coding). By utilizing the tractable Gaussian mixture model, we derive the relation between BEP and classification margin, which sheds light on system requirements to support ULL‑FT. Furthermore, for the case where the classification margin is insufficient for coping with channel distortion, we enhance the ULL‑FT approach by studying retransmission and multi‑view classification for enlarging the margin and further quantifying corresponding requirements. Finally, experiments using DNNs as classifier models and real datasets are conducted to demonstrate the effectiveness of ULL‑FT in communication latency reduction while providing a guarantee on accurate feasible KP identification.

PaperID: 12918, https://arxiv.org/pdf/2409.13144.pdf

Authors: Mohammad Al-Sharman, Luc Edes, Bert Sun, Vishal Jayakumar, Mohamed A. Daoud, Derek Rayside, William Melek

Title: Autonomous Driving at Unsignalized Intersections: A Review of Decision-Making Challenges and Reinforcement Learning-Based Solutions

Abstract:
Autonomous driving at unsignalized intersections is still considered a challenging application for machine learning due to the complications associated with handling complex multi‑agent scenarios characterized by a high degree of uncertainty. Automating the decision‑making process at these safety‑critical environments involves comprehending multiple levels of abstractions associated with learning robust driving behaviors to enable the vehicle to navigate efficiently. In this survey, we aim at exploring the state‑of‑the‑art techniques implemented for decision‑making applications, with a focus on algorithms that combine Reinforcement Learning (RL) and deep learning for learning traversing policies at unsignalized intersections. The reviewed schemes vary in the proposed driving scenario, in the assumptions made for the used intersection model, in the tackled challenges, and in the learning algorithms that are used. We have presented comparisons for these techniques to highlight their limitations and strengths. Based on our in‑depth investigation, it can be discerned that a robust decision‑making scheme for navigating real‑world unsignalized intersection has yet to be developed. Along with our analysis and discussion, we recommend potential research directions encouraging the interested players to tackle the highlighted challenges. By adhering to our recommendations, decision‑making architectures that are both non‑overcautious and safe, yet feasible, can be trained and validated in real‑world unsignalized intersections environments.

PaperID: 12919, https://arxiv.org/pdf/2409.13107.pdf

Authors: Hao Ding, Lalithkumar Seenivasan, Hongchao Shu, Grayson Byrd, Han Zhang, Pu Xiao, Juan Antonio Barragan, Russell H. Taylor, Peter Kazanzides, Mathias Unberath

Title: Towards Robust Surgical Automation via Digital Twin Representations from Foundation Models

Abstract:
Large language model‑based (LLM) agents are emerging as a powerful enabler of robust embodied intelligence due to their capability of planning complex action sequences. Sound planning ability is necessary for robust automation in many task domains, but especially in surgical automation. These agents rely on a highly detailed natural language representation of the scene. Thus, to leverage the emergent capabilities of LLM agents for surgical task planning, developing similarly powerful and robust perception algorithms is necessary to derive a detailed scene representation of the environment from visual input. Previous research has focused primarily on enabling LLM‑based task planning while adopting simple yet severely limited perception solutions to meet the needs for bench‑top experiments, but lacks the critical flexibility to scale to less constrained settings. In this work, we propose an alternate perception approach ‑‑ a digital twin (DT)‑based machine perception approach that capitalizes on the convincing performance and out‑of‑the‑box generalization of recent vision foundation models. Integrating our DT representation and LLM agent for planning with the dVRK platform, we develop an embodied intelligence system and evaluate its robustness in performing peg transfer and gauze retrieval tasks. Our approach shows strong task performance and generalizability to varied environmental settings. Despite a convincing performance, this work is merely a first step towards the integration of DT representations. Future studies are necessary for the realization of a comprehensive DT framework to improve the interpretability and generalizability of embodied intelligence in surgery.

PaperID: 12920, https://arxiv.org/pdf/2409.13082.pdf

Authors: Chenyuan Yang, Xuheng Li, Md Rakib Hossain Misu, Jianan Yao, Weidong Cui, Yeyun Gong, Chris Hawblitzel, Shuvendu Lahiri, Jacob R. Lorch, Shuai Lu, Fan Yang, Ziqiao Zhou, Shan Lu

Title: AutoVerus: Automated Proof Generation for Rust Code

Abstract:
Generative AI has shown its values for many software engineering tasks. Still in its infancy, large language model (LLM)‑based proof generation lags behind LLM‑based code generation. In this paper, we present AutoVerus. AutoVerus uses LLMs to automatically generate correctness proof for Rust code. AutoVerus is designed to match the unique features of Verus, a verification tool that can prove the correctness of Rust code using proofs and specifications also written in Rust. AutoVerus consists of a network of LLM agents that are crafted and orchestrated to mimic human experts' three phases of proof construction: preliminary proof generation, proof refinement guided by generic tips, and proof debugging guided by verification errors. To thoroughly evaluate AutoVerus and help foster future research in this direction, we have built a benchmark suite of 150 non‑trivial proof tasks, based on existing code‑generation benchmarks and verification benchmarks. Our evaluation shows that AutoVerus can automatically generate correct proof for more than 90% of them, with more than half of them tackled in less than 30 seconds or 3 LLM calls.

PaperID: 12921, https://arxiv.org/pdf/2409.13065.pdf

Authors: Jake Olkin, Viraj Parimi, Brian Williams

Title: Multi-Agent Vulcan: An Information-Driven Multi-Agent Path Finding Approach

Abstract:
Scientists often search for phenomena of interest while exploring new environments. Autonomous vehicles are deployed to explore such areas where human‑operated vehicles would be costly or dangerous. Online control of autonomous vehicles for information‑gathering is called adaptive sampling and can be framed as a POMDP that uses information gain as its principal objective. While prior work focuses largely on single‑agent scenarios, this paper confronts challenges unique to multi‑agent adaptive sampling, such as avoiding redundant observations, preventing vehicle collision, and facilitating path planning under limited communication. We start with Multi‑Agent Path Finding (MAPF) methods, which address collision avoidance by decomposing the MAPF problem into a series of single‑agent path planning problems. We then present information‑driven MAPF which addresses multi‑agent information gain under limited communication. First, we introduce an admissible heuristic that relaxes mutual information gain to an additive function that can be evaluated as a set of independent single agent path planning problems. Second, we extend our approach to a distributed system that is robust to limited communication. When all agents are in range, the group plans jointly to maximize information. When some agents move out of range, communicating subgroups are formed and the subgroups plan independently. Since redundant observations are less likely when vehicles are far apart, this approach only incurs a small loss in information gain, resulting in an approach that gracefully transitions from full to partial communication. We evaluate our method against other adaptive sampling strategies across various scenarios, including real‑world robotic applications. Our method was able to locate up to 200% more unique phenomena in certain scenarios, and each agent located its first unique phenomenon faster by up to 50%.

PaperID: 12922, https://arxiv.org/pdf/2409.12984.pdf

Authors: Shuyue Wang, Liujie Ren, Tianyao Zhou, Lili Chen, Tianyu Zhang, Yaoyao Fu, Shuo Wang

Title: Large Language Model-Enhanced Interactive Agent for Public Education on Newborn Auricular Deformities

Abstract:
Auricular deformities are quite common in newborns with potential long‑term negative effects of mental and even hearing problems.Early diagnosis and subsequent treatment are critical for the illness; yet they are missing most of the time due to lack of knowledge among parents. With the help of large language model of Ernie of Baidu Inc., we derive a realization of interactive agent. Firstly, it is intelligent enough to detect which type of auricular deformity corresponding to uploaded images, which is accomplished by PaddleDetection, with precision rate 75%. Secondly, in terms of popularizing the knowledge of auricular deformities, the agent can give professional suggestions of the illness to parents. The above two effects are evaluated via tests on volunteers with control groups in the paper. The agent can reach parents with newborns as well as their pediatrician remotely via Internet in vast, rural areas with quality medical diagnosis capabilities and professional query‑answering functions, which is good news for newborn auricular deformity and other illness that requires early intervention for better treatment.

PaperID: 12923, https://arxiv.org/pdf/2409.12882.pdf

Authors: Hairi, Minghong Fang, Zifan Zhang, Alvaro Velasquez, Jia Liu

Title: On the Hardness of Decentralized Multi-Agent Policy Evaluation under Byzantine Attacks

Abstract:
In this paper, we study a fully‑decentralized multi‑agent policy evaluation problem, which is an important sub‑problem in cooperative multi‑agent reinforcement learning, in the presence of up to f faulty agents. In particular, we focus on the so‑called Byzantine faulty model with model poisoning setting. In general, policy evaluation is to evaluate the value function of any given policy. In cooperative multi‑agent system, the system‑wide rewards are usually modeled as the uniform average of rewards from all agents. We investigate the multi‑agent policy evaluation problem in the presence of Byzantine agents, particularly in the setting of heterogeneous local rewards. Ideally, the goal of the agents is to evaluate the accumulated system‑wide rewards, which are uniform average of rewards of the normal agents for a given policy. It means that all agents agree upon common values (the consensus part) and furthermore, the consensus values are the value functions (the convergence part). However, we prove that this goal is not achievable. Instead, we consider a relaxed version of the problem, where the goal of the agents is to evaluate accumulated system‑wide reward, which is an appropriately weighted average reward of the normal agents. We further prove that there is no correct algorithm that can guarantee that the total number of positive weights exceeds |\mathcalN|‑f , where |\mathcalN| is the number of normal agents. Towards the end, we propose a Byzantine‑tolerant decentralized temporal difference algorithm that can guarantee asymptotic consensus under scalar function approximation. We then empirically test the effective of the proposed algorithm.

PaperID: 12924, https://arxiv.org/pdf/2409.12824.pdf

Authors: Peng Ren, Yuqing Hao, Zhiyong Sun, Qingyun Wang, Guanrong Chen

Title: Data-Driven Cooperative Output Regulation of Continuous-Time Multi-Agent Systems with Unknown Network Topology

Abstract:
This paper investigates data‑driven cooperative output regulation for continuous‑time multi‑agent systems with unknown network topology. Unlike existing studies that typically assume a known network topology to directly compute controller parameters, a novel approach is proposed that allows for the computation of the parameter without prior knowledge of the topology. A lower bound on the minimum non‑zero eigenvalue of the Laplacian matrix is estimated using only edge weight bounds, enabling the output regulation controller design to be independent of global network information. Additionally, the common need for state derivative measurements is eliminated, reducing the amount of data requirements. Furthermore, necessary and sufficient conditions are established to ensure that the data are informative for cooperative output regulation, leading to the design of a distributed output regulation controller. For the case with noisy data, the bound of the output error is provided, which is positively correlated with the noise bound, and a distributed controller is constructed for the approximate cooperative output regulation. Finally, the effectiveness of the proposed methods is verified through numerical simulations.

PaperID: 12925, https://arxiv.org/pdf/2409.12812.pdf

Authors: Shiyu Fang, Jiaqi Liu, Mingyu Ding, Yiming Cui, Chen Lv, Peng Hang, Jian Sun

Title: Towards Interactive and Learnable Cooperative Driving Automation: a Large Language Model-Driven Decision-Making Framework

Abstract:
At present, Connected Autonomous Vehicles (CAVs) have begun to open road testing around the world, but their safety and efficiency performance in complex scenarios is still not satisfactory. Cooperative driving leverages the connectivity ability of CAVs to achieve synergies greater than the sum of their parts, making it a promising approach to improving CAV performance in complex scenarios. However, the lack of interaction and continuous learning ability limits current cooperative driving to single‑scenario applications and specific Cooperative Driving Automation (CDA). To address these challenges, this paper proposes CoDrivingLLM, an interactive and learnable LLM‑driven cooperative driving framework, to achieve all‑scenario and all‑CDA. First, since Large Language Models(LLMs) are not adept at handling mathematical calculations, an environment module is introduced to update vehicle positions based on semantic decisions, thus avoiding potential errors from direct LLM control of vehicle positions. Second, based on the four levels of CDA defined by the SAE J3216 standard, we propose a Chain‑of‑Thought (COT) based reasoning module that includes state perception, intent sharing, negotiation, and decision‑making, enhancing the stability of LLMs in multi‑step reasoning tasks. Centralized conflict resolution is then managed through a conflict coordinator in the reasoning process. Finally, by introducing a memory module and employing retrieval‑augmented generation, CAVs are endowed with the ability to learn from their past experiences. We validate the proposed CoDrivingLLM through ablation experiments on the negotiation module, reasoning with different shots experience, and comparison with other cooperative driving methods.

PaperID: 12926, https://arxiv.org/pdf/2409.12809.pdf

Authors: Philipp Spitzer, Joshua Holstein, Katelyn Morrison, Kenneth Holstein, Gerhard Satzger, Niklas Kühl

Title: Don't be Fooled: The Misinformation Effect of Explanations in Human-AI Collaboration

Abstract:
Across various applications, humans increasingly use black‑box artificial intelligence (AI) systems without insight into these systems' reasoning. To counter this opacity, explainable AI (XAI) methods promise enhanced transparency and interpretability. While recent studies have explored how XAI affects human‑AI collaboration, few have examined the potential pitfalls caused by incorrect explanations. The implications for humans can be far‑reaching but have not been explored extensively. To investigate this, we ran a study (n=160) on AI‑assisted decision‑making in which humans were supported by XAI. Our findings reveal a misinformation effect when incorrect explanations accompany correct AI advice with implications post‑collaboration. This effect causes humans to infer flawed reasoning strategies, hindering task execution and demonstrating impaired procedural knowledge. Additionally, incorrect explanations compromise human‑AI team‑performance during collaboration. With our work, we contribute to HCI by providing empirical evidence for the negative consequences of incorrect explanations on humans post‑collaboration and outlining guidelines for designers of AI.

PaperID: 12927, https://arxiv.org/pdf/2409.12618.pdf

Authors: Santosh Kumar Radha, Yasamin Nouri Jelyani, Ara Ghukasyan, Oktay Goktas

Title: Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning

Abstract:
Iterative human engagement is a common and effective means of leveraging the advanced language processing power of large language models (LLMs). Using well‑structured prompts in a conversational manner, human users can effectively influence an LLM to develop more thoughtful and accurate responses. Motivated by this insight, we propose the Iteration of Thought (IoT) framework for enhancing LLM responses by generating "thought"‑provoking prompts vis a vis an input query and the current iteration of an LLM's response. Unlike static or semi‑static approaches, e.g. Chain of Thought (CoT) or Tree of Thoughts (ToT), IoT adapts its reasoning path dynamically, based on evolving context, and without generating alternate explorative thoughts which are ultimately discarded. The three components of the IoT framework are (1) an Inner Dialogue Agent (IDA) responsible for generating instructive, context‑specific prompts; (2) an LLM Agent (LLMA) that processes these prompts to refine its responses; and (3) an iterative prompting loop that implements a conversation between the former two components. We introduce two variants of our framework: Autonomous Iteration of Thought (AIoT), where an LLM decides when to stop iterating, and Guided Iteration of Thought (GIoT), which always forces a fixed number iterations. We investigate the performance of IoT across various datasets, spanning complex reasoning tasks from the GPQA dataset, explorative problem‑solving in Game of 24, puzzle solving in Mini Crosswords, and multi‑hop question answering from the HotpotQA dataset. Our results show that IoT represents a viable paradigm for autonomous response refinement in LLMs, showcasing significant improvements over CoT and thereby enabling more adaptive and efficient reasoning systems that minimize human intervention.

PaperID: 12928, https://arxiv.org/pdf/2409.12575.pdf

Authors: Manuel Röder, Frank-Michael Schleif

Title: Deep Transfer Hashing for Adaptive Learning on Federated Streaming Data

Abstract:
This extended abstract explores the integration of federated learning with deep transfer hashing for distributed prediction tasks, emphasizing resource‑efficient client training from evolving data streams. Federated learning allows multiple clients to collaboratively train a shared model while maintaining data privacy ‑ by incorporating deep transfer hashing, high‑dimensional data can be converted into compact hash codes, reducing data transmission size and network loads. The proposed framework utilizes transfer learning, pre‑training deep neural networks on a central server, and fine‑tuning on clients to enhance model accuracy and adaptability. A selective hash code sharing mechanism using a privacy‑preserving global memory bank further supports client fine‑tuning. This approach addresses challenges in previous research by improving computational efficiency and scalability. Practical applications include Car2X event predictions, where a shared model is collectively trained to recognize traffic patterns, aiding in tasks such as traffic density assessment and accident detection. The research aims to develop a robust framework that combines federated learning, deep transfer hashing and transfer learning for efficient and secure downstream task execution.

PaperID: 12929, https://arxiv.org/pdf/2409.12516.pdf

Authors: Kei Nakagawa, Masanori Hirano, Kentaro Minami, Takanobu Mizuta

Title: A Multi-agent Market Model Can Explain the Impact of AI Traders in Financial Markets -- A New Microfoundations of GARCH model

Abstract:
The AI traders in financial markets have sparked significant interest in their effects on price formation mechanisms and market volatility, raising important questions for market stability and regulation. Despite this interest, a comprehensive model to quantitatively assess the specific impacts of AI traders remains undeveloped. This study aims to address this gap by modeling the influence of AI traders on market price formation and volatility within a multi‑agent framework, leveraging the concept of microfoundations. Microfoundations involve understanding macroeconomic phenomena, such as market price formation, through the decision‑making and interactions of individual economic agents. While widely acknowledged in macroeconomics, microfoundational approaches remain unexplored in empirical finance, particularly for models like the GARCH model, which captures key financial statistical properties such as volatility clustering and fat tails. This study proposes a multi‑agent market model to derive the microfoundations of the GARCH model, incorporating three types of agents: noise traders, fundamental traders, and AI traders. By mathematically aggregating the micro‑structure of these agents, we establish the microfoundations of the GARCH model. We validate this model through multi‑agent simulations, confirming its ability to reproduce the stylized facts of financial markets. Finally, we analyze the impact of AI traders using parameters derived from these microfoundations, contributing to a deeper understanding of their role in market dynamics.

PaperID: 12930, https://arxiv.org/pdf/2409.12452.pdf

Authors: Jiaxin Wen, Jian Guan, Hongning Wang, Wei Wu, Minlie Huang

Title: Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form Planning

Abstract:
Despite the remarkable success of large language models (LLMs) on traditional natural language processing tasks, their planning ability remains a critical bottleneck in tackling complex multi‑step reasoning tasks. Existing approaches mainly rely on prompting or task‑specific fine‑tuning, often suffering from poor robustness and cross‑task generalization. To address the limitation, we introduce CodePlan, a scalable framework that empowers LLMs to generate and follow code‑form plans ‑‑ pseudocode that outlines high‑level, structured reasoning processes. By leveraging the structured and versatile nature of code, CodePlan effectively captures the rich semantics and control flows inherent to sophisticated reasoning tasks. Importantly, CodePlan allows automatic extraction of code‑form plans from massive, wide‑ranging text corpora without the need for curated, task‑specific datasets. This enables it to scale up efficiently and improve LLM's reasoning capabilities across diverse scenarios. To train CodePlan, we construct a large‑scale dataset of 2M examples that integrate code‑form plans with standard prompt‑response pairs from existing corpora. With minimal computation overhead during both training and inference, CodePlan achieves a 25.1% relative improvement compared with directly generating responses, averaged across 13 challenging multi‑step reasoning benchmarks, spanning mathematical reasoning, symbolic reasoning, instruction‑following, multi‑hop QA, and decision‑making tasks. Further analysis reveals CodePlan's increasing performance gains on more complex reasoning tasks, as well as significant data efficiency thanks to its generalization ability.

PaperID: 12931, https://arxiv.org/pdf/2409.12411.pdf

Authors: Chen Liang, Zhifan Feng, Zihe Liu, Wenbin Jiang, Jinan Xu, Yufeng Chen, Yong Wang

Title: Textualized Agent-Style Reasoning for Complex Tasks by Multiple Round LLM Generation

Abstract:
Chain‑of‑thought prompting significantly boosts the reasoning ability of large language models but still faces three issues: hallucination problem, restricted interpretability, and uncontrollable generation. To address these challenges, we present AgentCOT, a llm‑based autonomous agent framework, which can solve complex problems in an agent‑style manner by multiple round LLM generation. At each step, AgentCOT selects an action and executes it to yield an intermediate result with supporting evidence. In addition, we integrate the step's index into the reasoning process to form a graph structure for complex inference logic. We introduce two new strategies to enhance the performance of AgentCOT.We conduct extensive experiments to verify the effectiveness of our method on six common benchmarks. Results exhibit that our method brings in substantial improvements over current competitive approaches.

PaperID: 12932, https://arxiv.org/pdf/2409.12317.pdf

Authors: Charlotte Cathcart, Ian Xul Belaustegui, Alessio Franci, Naomi Ehrich Leonard

Title: Spiking Nonlinear Opinion Dynamics (S-NOD) for Agile Decision-Making

Abstract:
We present, analyze, and illustrate a first‑of‑its‑kind model of two‑dimensional excitable (spiking) dynamics for decision‑making over two options. The model, Spiking Nonlinear Opinion Dynamics (S‑NOD), provides superior agility, characterized by fast, flexible, and adaptive response to rapid and unpredictable changes in context, environment, or information received about available options. S‑NOD derives through the introduction of a single extra term to the previously presented Nonlinear Opinion Dynamics (NOD) for fast and flexible multi‑agent decision‑making behavior. The extra term is inspired by the fast‑positive, slow‑negative mixed‑feedback structure of excitable systems. The agile behaviors brought about by the new excitable nature of decision‑making driven by S‑NOD are analyzed in a general setting and illustrated in an application to multi‑robot navigation around human movers.

PaperID: 12933, https://arxiv.org/pdf/2409.12153.pdf

Authors: Ravi Pandya, Changliu Liu, Andrea Bajcsy

Title: Robots that Learn to Safely Influence via Prediction-Informed Reach-Avoid Dynamic Games

Abstract:
Robots can influence people to accomplish their tasks more efficiently: autonomous cars can inch forward at an intersection to pass through, and tabletop manipulators can go for an object on the table first. However, a robot's ability to influence can also compromise the safety of nearby people if naively executed. In this work, we pose and solve a novel robust reach‑avoid dynamic game which enables robots to be maximally influential, but only when a safety backup control exists. On the human side, we model the human's behavior as goal‑driven but conditioned on the robot's plan, enabling us to capture influence. On the robot side, we solve the dynamic game in the joint physical and belief space, enabling the robot to reason about how its uncertainty in human behavior will evolve over time. We instantiate our method, called SLIDE (Safely Leveraging Influence in Dynamic Environments), in a high‑dimensional (39‑D) simulated human‑robot collaborative manipulation task solved via offline game‑theoretic reinforcement learning. We compare our approach to a robust baseline that treats the human as a worst‑case adversary, a safety controller that does not explicitly reason about influence, and an energy‑function‑based safety shield. We find that SLIDE consistently enables the robot to leverage the influence it has on the human when it is safe to do so, ultimately allowing the robot to be less conservative while still ensuring a high safety rate during task execution.

PaperID: 12934, https://arxiv.org/pdf/2409.12152.pdf

Authors: Zhiyuan Zhang, Panagiotis Tsiotras

Title: Residual Descent Differential Dynamic Game (RD3G) -- A Fast Newton Solver for Constrained General Sum Games

Abstract:
We present Residual Descent Differential Dynamic Game (RD3G), a Newton‑based solver for constrained multi‑agent game‑control problems. The proposed solver seeks a local Nash equilibrium for problems where agents are coupled through their rewards and state constraints. We compare the proposed method against competing state‑of‑the‑art techniques and showcase the computational benefits of the RD3G algorithm on several example problems.

PaperID: 12935, https://arxiv.org/pdf/2409.12147.pdf

Authors: Justin Chih-Yao Chen, Archiki Prasad, Swarnadeep Saha, Elias Stengel-Eskin, Mohit Bansal

Title: MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning

Abstract:
Large Language Models' (LLM) reasoning can be improved using test‑time aggregation strategies, i.e., generating multiple samples and voting among generated samples. While these improve performance, they often reach a saturation point. Refinement offers an alternative by using LLM‑generated feedback to improve solution quality. However, refinement introduces 3 key challenges: (1) Excessive refinement: Uniformly refining all instances can over‑correct and reduce the overall performance. (2) Inability to localize and address errors: LLMs have a limited ability to self‑correct and struggle to identify and correct their own mistakes. (3) Insufficient refinement: Deciding how many iterations of refinement are needed is non‑trivial, and stopping too soon could leave errors unaddressed. To tackle these issues, we propose MAgICoRe, which avoids excessive refinement by categorizing problem difficulty as easy or hard, solving easy problems with coarse‑grained aggregation and hard ones with fine‑grained and iterative multi‑agent refinement. To improve error localization, we incorporate external step‑wise reward model (RM) scores. Moreover, to ensure effective refinement, we employ a multi‑agent loop with three agents: Solver, Reviewer (which generates targeted feedback based on step‑wise RM scores), and the Refiner (which incorporates feedback). To ensure sufficient refinement, we re‑evaluate updated solutions, iteratively initiating further rounds of refinement. We evaluate MAgICoRe on Llama‑3‑8B and GPT‑3.5 and show its effectiveness across 5 math datasets. Even one iteration of MAgICoRe beats Self‑Consistency by 3.4%, Best‑of‑k by 3.2%, and Self‑Refine by 4.0% while using less than half the samples. Unlike iterative refinement with baselines, MAgICoRe continues to improve with more iterations. Finally, our ablations highlight the importance of MAgICoRe's RMs and multi‑agent communication.

PaperID: 12936, https://arxiv.org/pdf/2409.12001.pdf

Authors: Claude Formanek, Louise Beyers, Callum Rhys Tilbury, Jonathan P. Shock, Arnu Pretorius

Title: Putting Data at the Centre of Offline Multi-Agent Reinforcement Learning

Abstract:
Offline multi‑agent reinforcement learning (MARL) is an exciting direction of research that uses static datasets to find optimal control policies for multi‑agent systems. Though the field is by definition data‑driven, efforts have thus far neglected data in their drive to achieve state‑of‑the‑art results. We first substantiate this claim by surveying the literature, showing how the majority of works generate their own datasets without consistent methodology and provide sparse information about the characteristics of these datasets. We then show why neglecting the nature of the data is problematic, through salient examples of how tightly algorithmic performance is coupled to the dataset used, necessitating a common foundation for experiments in the field. In response, we take a big step towards improving data usage and data awareness in offline MARL, with three key contributions: (1) a clear guideline for generating novel datasets; (2) a standardisation of over 80 existing datasets, hosted in a publicly available repository, using a consistent storage format and easy‑to‑use API; and (3) a suite of analysis tools that allow us to understand these datasets better, aiding further development.

PaperID: 12937, https://arxiv.org/pdf/2409.11852.pdf

Authors: Jianye Xu, Omar Sobhy, Bassam Alrifaee

Title: XP-MARL: Auxiliary Prioritization in Multi-Agent Reinforcement Learning to Address Non-Stationarity

Abstract:
Non‑stationarity poses a fundamental challenge in Multi‑Agent Reinforcement Learning (MARL), arising from agents simultaneously learning and altering their policies. This creates a non‑stationary environment from the perspective of each individual agent, often leading to suboptimal or even unconverged learning outcomes. We propose an open‑source framework named XP‑MARL, which augments MARL with auxiliary prioritization to address this challenge in cooperative settings. XP‑MARL is 1) founded upon our hypothesis that prioritizing agents and letting higher‑priority agents establish their actions first would stabilize the learning process and thus mitigate non‑stationarity and 2) enabled by our proposed mechanism called action propagation, where higher‑priority agents act first and communicate their actions, providing a more stationary environment for others. Moreover, instead of using a predefined or heuristic priority assignment, XP‑MARL learns priority‑assignment policies with an auxiliary MARL problem, leading to a joint learning scheme. Experiments in a motion‑planning scenario involving Connected and Automated Vehicles (CAVs) demonstrate that XP‑MARL improves the safety of a baseline model by 84.4% and outperforms a state‑of‑the‑art approach, which improves the baseline by only 12.8%. Code: github.com/cas‑lab‑munich/sigmarl

PaperID: 12938, https://arxiv.org/pdf/2409.11756.pdf

Authors: Gabriele Sartor, Angelo Oddi, Riccardo Rasconi, Vieri Giuliano Santucci, Rosa Meo

Title: Synthesizing Evolving Symbolic Representations for Autonomous Systems

Abstract:
Recently, AI systems have made remarkable progress in various tasks. Deep Reinforcement Learning(DRL) is an effective tool for agents to learn policies in low‑level state spaces to solve highly complex tasks. Researchers have introduced Intrinsic Motivation(IM) to the RL mechanism, which simulates the agent's curiosity, encouraging agents to explore interesting areas of the environment. This new feature has proved vital in enabling agents to learn policies without being given specific goals. However, even though DRL intelligence emerges through a sub‑symbolic model, there is still a need for a sort of abstraction to understand the knowledge collected by the agent. To this end, the classical planning formalism has been used in recent research to explicitly represent the knowledge an autonomous agent acquires and effectively reach extrinsic goals. Despite classical planning usually presents limited expressive capabilities, PPDDL demonstrated usefulness in reviewing the knowledge gathered by an autonomous system, making explicit causal correlations, and can be exploited to find a plan to reach any state the agent faces during its experience. This work presents a new architecture implementing an open‑ended learning system able to synthesize from scratch its experience into a PPDDL representation and update it over time. Without a predefined set of goals and tasks, the system integrates intrinsic motivations to explore the environment in a self‑directed way, exploiting the high‑level knowledge acquired during its experience. The system explores the environment and iteratively: (a) discover options, (b) explore the environment using options, (c) abstract the knowledge collected and (d) plan. This paper proposes an alternative approach to implementing open‑ended learning architectures exploiting low‑level and high‑level representations to extend its knowledge in a virtuous loop.

PaperID: 12939, https://arxiv.org/pdf/2409.11690.pdf

Authors: Zongwei Wang, Min Gao, Junliang Yu, Xinyi Gao, Quoc Viet Hung Nguyen, Shazia Sadiq, Hongzhi Yin

Title: ID-Free Not Risk-Free: LLM-Powered Agents Unveil Risks in ID-Free Recommender Systems

Abstract:
Recent advances in ID‑free recommender systems have attracted significant attention for effectively addressing the cold start problem. However, their vulnerability to malicious attacks remains largely unexplored. In this paper, we unveil a critical yet overlooked risk: LLM‑powered agents can be strategically deployed to attack ID‑free recommenders, stealthily promoting low‑quality items in black‑box settings. This attack exploits a novel rewriting‑based deception strategy, where malicious agents synthesize deceptive textual descriptions by simulating the characteristics of popular items. To achieve this, the attack mechanism integrates two primary components: (1) a popularity extraction component that captures essential characteristics of popular items and (2) a multi‑agent collaboration mechanism that enables iterative refinement of promotional textual descriptions through independent thinking and team discussion. To counter this risk, we further introduce a detection method to identify suspicious text generated by our discovered attack. By unveiling this risk, our work aims to underscore the urgent need to enhance the security of ID‑free recommender systems.

PaperID: 12940, https://arxiv.org/pdf/2409.11580.pdf

Authors: Arvind Car, Sai Sravan Yarlagadda, Alison Bartsch, Abraham George, Amir Barati Farimani

Title: PLATO: Planning with LLMs and Affordances for Tool Manipulation

Abstract:
As robotic systems become increasingly integrated into complex real‑world environments, there is a growing need for approaches that enable robots to understand and act upon natural language instructions without relying on extensive pre‑programmed knowledge of their surroundings. This paper presents PLATO, an innovative system that addresses this challenge by leveraging specialized large language model agents to process natural language inputs, understand the environment, predict tool affordances, and generate executable actions for robotic systems. Unlike traditional systems that depend on hard‑coded environmental information, PLATO employs a modular architecture of specialized agents to operate without any initial knowledge of the environment. These agents identify objects and their locations within the scene, generate a comprehensive high‑level plan, translate this plan into a series of low‑level actions, and verify the completion of each step. The system is particularly tested on challenging tool‑use tasks, which involve handling diverse objects and require long‑horizon planning. PLATO's design allows it to adapt to dynamic and unstructured settings, significantly enhancing its flexibility and robustness. By evaluating the system across various complex scenarios, we demonstrate its capability to tackle a diverse range of tasks and offer a novel solution to integrate LLMs with robotic platforms, advancing the state‑of‑the‑art in autonomous robotic task execution. For videos and prompt details, please see our project website: https://sites.google.com/andrew.cmu.edu/plato

PaperID: 12941, https://arxiv.org/pdf/2409.11561.pdf

Authors: Weizheng Wang, Aniket Bera, Byung-Cheol Min

Title: Hypergraph-based Coordinated Task Allocation and Socially-aware Navigation for Multi-Robot Systems

Abstract:
A team of multiple robots seamlessly and safely working in human‑filled public environments requires adaptive task allocation and socially‑aware navigation that account for dynamic human behavior. Current approaches struggle with highly dynamic pedestrian movement and the need for flexible task allocation. We propose Hyper‑SAMARL, a hypergraph‑based system for multi‑robot task allocation and socially‑aware navigation, leveraging multi‑agent reinforcement learning (MARL). Hyper‑SAMARL models the environmental dynamics between robots, humans, and points of interest (POIs) using a hypergraph, enabling adaptive task assignment and socially‑compliant navigation through a hypergraph diffusion mechanism. Our framework, trained with MARL, effectively captures interactions between robots and humans, adapting tasks based on real‑time changes in human activity. Experimental results demonstrate that Hyper‑SAMARL outperforms baseline models in terms of social navigation, task completion efficiency, and adaptability in various simulated scenarios.

PaperID: 12942, https://arxiv.org/pdf/2409.11411.pdf

Authors: Mubashir ul Islam, Humza Sami, Pierre-Emmanuel Gaillardon, Valerio Tenace

Title: AIvril: AI-Driven RTL Generation With Verification In-The-Loop

Abstract:
Large Language Models (LLMs) are computational models capable of performing complex natural language processing tasks. Leveraging these capabilities, LLMs hold the potential to transform the entire hardware design stack, with predictions suggesting that front‑end and back‑end tasks could be fully automated in the near future. Currently, LLMs show great promise in streamlining Register Transfer Level (RTL) generation, enhancing efficiency, and accelerating innovation. However, their probabilistic nature makes them prone to inaccuracies ‑ a significant drawback in RTL design, where reliability and precision are essential. To address these challenges, this paper introduces AIvril, an advanced framework designed to enhance the accuracy and reliability of RTL‑aware LLMs. AIvril employs a multi‑agent, LLM‑agnostic system for automatic syntax correction and functional verification, significantly reducing ‑ and in many cases, completely eliminating ‑ instances of erroneous code generation. Experimental results conducted on the VerilogEval‑Human dataset show that our framework improves code quality by nearly 2x when compared to previous works, while achieving an 88.46% success rate in meeting verification objectives. This represents a critical step toward automating and optimizing hardware design workflows, offering a more dependable methodology for AI‑driven RTL design.

PaperID: 12943, https://arxiv.org/pdf/2409.11393.pdf

Authors: Amine Ben Hassouna, Hana Chaari, Ines Belhaj

Title: LLM-Agent-UMF: LLM-based Agent Unified Modeling Framework for Seamless Design of Multi Active/Passive Core-Agent Architectures

Abstract:
In an era where vast amounts of data are collected and processed from diverse sources, there is a growing demand for sophisticated AI systems capable of intelligently fusing and analyzing this information. To address these challenges, researchers have turned towards integrating tools into LLM‑powered agents to enhance the overall information fusion process. However, the conjunction of these technologies and the proposed enhancements in several state‑of‑the‑art works followed a non‑unified software architecture, resulting in a lack of modularity and terminological inconsistencies among researchers. To address these issues, we propose a novel LLM‑based Agent Unified Modeling Framework (LLM‑Agent‑UMF) that establishes a clear foundation for agent development from both functional and software architectural perspectives, developed and evaluated using the Architecture Tradeoff and Risk Analysis Framework (ATRAF). Our framework clearly distinguishes between the different components of an LLM‑based agent, setting LLMs and tools apart from a new element, the core‑agent, which plays the role of central coordinator. This pivotal entity comprises five modules: planning, memory, profile, action, and security ‑‑ the latter often neglected in previous works. By classifying core‑agents into passive and active types based on their authoritative natures, we propose various multi‑core agent architectures that combine unique characteristics of distinctive agents to tackle complex tasks more efficiently. We evaluate our framework by applying it to thirteen state‑of‑the‑art agents, thereby demonstrating its alignment with their functionalities and clarifying overlooked architectural aspects. Moreover, we thoroughly assess five architecture variants of our framework by designing new agent architectures that combine characteristics of state‑of‑the‑art agents to address specific goals. ...

PaperID: 12944, https://arxiv.org/pdf/2409.11276.pdf

Authors: Maria Rigaki, Carlos Catania, Sebastian Garcia

Title: Hackphyr: A Local Fine-Tuned LLM Agent for Network Security Environments

Abstract:
Large Language Models (LLMs) have shown remarkable potential across various domains, including cybersecurity. Using commercial cloud‑based LLMs may be undesirable due to privacy concerns, costs, and network connectivity constraints. In this paper, we present Hackphyr, a locally fine‑tuned LLM to be used as a red‑team agent within network security environments. Our fine‑tuned 7 billion parameter model can run on a single GPU card and achieves performance comparable with much larger and more powerful commercial models such as GPT‑4. Hackphyr clearly outperforms other models, including GPT‑3.5‑turbo, and baselines, such as Q‑learning agents in complex, previously unseen scenarios. To achieve this performance, we generated a new task‑specific cybersecurity dataset to enhance the base model's capabilities. Finally, we conducted a comprehensive analysis of the agents' behaviors that provides insights into the planning abilities and potential shortcomings of such agents, contributing to the broader understanding of LLM‑based agents in cybersecurity contexts

PaperID: 12945, https://arxiv.org/pdf/2409.11087.pdf

Authors: Wouter W. L. Nuijten, Bert de Vries

Title: Reactive Environments for Active Inference Agents with RxEnvironments.jl

Abstract:
Active Inference is a framework that emphasizes the interaction between agents and their environment. While the framework has seen significant advancements in the development of agents, the environmental models are often borrowed from reinforcement learning problems, which may not fully capture the complexity of multi‑agent interactions or allow complex, conditional communication. This paper introduces Reactive Environments, a comprehensive paradigm that facilitates complex multi‑agent communication. In this paradigm, both agents and environments are defined as entities encapsulated by boundaries with interfaces. This setup facilitates a robust framework for communication in nonequilibrium‑Steady‑State systems, allowing for complex interactions and information exchange. We present a Julia package RxEnvironments.jl, which is a specific implementation of Reactive Environments, where we utilize a Reactive Programming style for efficient implementation. The flexibility of this paradigm is demonstrated through its application to several complex, multi‑agent environments. These case studies highlight the potential of Reactive Environments in modeling sophisticated systems of interacting agents.

PaperID: 12946, https://arxiv.org/pdf/2409.10864.pdf

Authors: Viet-Anh Le, Andreas A. Malikopoulos

Title: Distributed Optimization for Traffic Light Control and Connected Automated Vehicle Coordination in Mixed-Traffic Intersections

Abstract:
In this paper, we consider the problem of coordinating traffic light systems and connected automated vehicles (CAVs) in mixed‑traffic intersections. We aim to develop an optimization‑based control framework that leverages both the coordination capabilities of CAVs at higher penetration rates and intelligent traffic management using traffic lights at lower penetration rates. Since the resulting optimization problem is a multi‑agent mixed‑integer quadratic program, we propose a penalization‑enhanced maximum block improvement algorithm to solve the problem in a distributed manner. The proposed algorithm, under certain mild conditions, yields a feasible person‑by‑person optimal solution of the centralized problem. The performance of the control framework and the distributed algorithm is validated through simulations across various penetration rates and traffic volumes.

PaperID: 12947, https://arxiv.org/pdf/2409.10811.pdf

Authors: Shuqing Li, Binchang Li, Yepang Liu, Cuiyun Gao, Jianping Zhang, Shing-Chi Cheung, Michael R. Lyu

Title: Grounded GUI Understanding for Vision-Based Spatial Intelligent Agent: Exemplified by Extended Reality Apps

Abstract:
In recent years, spatial computing a.k.a. Extended Reality (XR) has emerged as a transformative technology, offering users immersive and interactive experiences across diversified virtual environments. Users can interact with XR apps through interactable GUI elements (IGEs) on the stereoscopic three‑dimensional (3D) graphical user interface (GUI). The accurate recognition of these IGEs is instrumental, serving as the foundation of many software engineering tasks, including automated testing and effective GUI search. The most recent IGE detection approaches for 2D mobile apps typically train a supervised object detection model based on a large‑scale manually‑labeled GUI dataset, usually with a pre‑defined set of clickable GUI element categories like buttons and spinners. Such approaches can hardly be applied to IGE detection in XR apps, due to a multitude of challenges including complexities posed by open‑vocabulary and heterogeneous IGE categories, intricacies of context‑sensitive interactability, and the necessities of precise spatial perception and visual‑semantic alignment for accurate IGE detection results. Thus, it is necessary to embark on the IGE research tailored to XR apps. In this paper, we propose the first zero‑shot cOntext‑sensitive inteRactable GUI ElemeNT dEtection framework for virtual Reality apps, named Orienter. By imitating human behaviors, Orienter observes and understands the semantic contexts of XR app scenes first, before performing the detection. The detection process is iterated within a feedback‑directed validation and reflection loop. Specifically, Orienter contains three components, including (1) Semantic context comprehension, (2) Reflection‑directed IGE candidate detection, and (3) Context‑sensitive interactability classification. Extensive experiments demonstrate that Orienter is more effective than the state‑of‑the‑art GUI element detection approaches.

PaperID: 12948, https://arxiv.org/pdf/2409.10737.pdf

Authors: Ana Nunez, Nafis Tanveer Islam, Sumit Kumar Jha, Peyman Najafirad

Title: AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing

Abstract:
Recent advancements in automatic code generation using large language models (LLMs) have brought us closer to fully automated secure software development. However, existing approaches often rely on a single agent for code generation, which struggles to produce secure, vulnerability‑free code. Traditional program synthesis with LLMs has primarily focused on functional correctness, often neglecting critical dynamic security implications that happen during runtime. To address these challenges, we propose AutoSafeCoder, a multi‑agent framework that leverages LLM‑driven agents for code generation, vulnerability analysis, and security enhancement through continuous collaboration. The framework consists of three agents: a Coding Agent responsible for code generation, a Static Analyzer Agent identifying vulnerabilities, and a Fuzzing Agent performing dynamic testing using a mutation‑based fuzzing approach to detect runtime errors. Our contribution focuses on ensuring the safety of multi‑agent code generation by integrating dynamic and static testing in an iterative process during code generation by LLM that improves security. Experiments using the SecurityEval dataset demonstrate a 13% reduction in code vulnerabilities compared to baseline LLMs, with no compromise in functionality.

PaperID: 12949, https://arxiv.org/pdf/2409.10680.pdf

Authors: Kristýna Janovská, Pavel Surynek

Title: Multi-agent Path Finding in Continuous Environment

Abstract:
We address a variant of multi‑agent path finding in continuous environment (CE‑MAPF), where agents move along sets of smooth curves. Collisions between agents are resolved via avoidance in the space domain. A new Continuous Environment Conflict‑Based Search (CE‑CBS) algorithm is proposed in this work. CE‑CBS combines conflict‑based search (CBS) for the high‑level search framework with RRT for low‑level path planning. The CE‑CBS algorithm is tested under various settings on diverse CE‑MAPF instances. Experimental results show that CE‑CBS is competitive w.r.t. to other algorithms that consider continuous aspect in MAPF such as MAPF with continuous time.

PaperID: 12950, https://arxiv.org/pdf/2409.10375.pdf

Authors: Yakov Miron, Dan Navon, Yuval Goldfracht, Dotan Di Castro, Itzik Klein

Title: Decentralized and Asymmetric Multi-Agent Learning in Construction Sites

Abstract:
Multi‑agent collaboration involves multiple participants working together in a shared environment to achieve a common goal. These agents share information, divide tasks, and synchronize their actions. Key aspects of multi agent collaboration include coordination, communication, task allocation, cooperation, adaptation, and decentralization. On construction sites, surface grading is the process of leveling sand piles to increase a specific area's height. In this scenario, a bulldozer grades while a dumper allocates sand piles. Our work aims to utilize a multi‑agent approach to enable these vehicles to collaborate effectively. To this end, we propose a decentralized and asymmetric multi‑agent learning approach for construction sites (DAMALCS). We formulate DAMALCS to reduce expected collisions for operating vehicles. Therefore, we develop two heuristic experts capable of achieving their joint goal optimally by applying an innovative prioritization method. In this approach, the bulldozer's movements take precedence over the dumper's operations, enabling the bulldozer to clear the path for the dumper and ensure continuous operation of both vehicles. Since heuristics alone are insufficient in real‑world scenarios, we utilize them to train AI agents, which proves to be highly effective. We simultaneously train the bulldozer and dumper agents to operate within the same environment, aiming to avoid collisions and optimize performance in terms of time efficiency and sand volume handling. Our trained agents and heuristics are evaluated in both simulation and real‑world lab experiments, testing them under various conditions, such as visual noise and localization errors. The results demonstrate that our approach significantly reduces collision rates for these vehicles.

PaperID: 12951, https://arxiv.org/pdf/2409.10372.pdf

Authors: Qiliang Chen, Sepehr Ilami, Nunzio Lore, Babak Heydari

Title: Integrated Design and Governance of Agentic AI Systems through Adaptive Information Modulation

Abstract:
Modern engineered systems increasingly involve complex sociotechnical environments where multiple agents, including humans and the emerging paradigm of agentic AI powered by large language models, must navigate social dilemmas that pit individual interests against collective welfare. As engineered systems evolve toward multi‑agent architectures with autonomous LLM‑based agents, traditional governance approaches using static rules or fixed network structures fail to address the dynamic uncertainties inherent in real‑world operations. This paper presents a novel framework that integrates adaptive governance mechanisms directly into the design of sociotechnical systems through a unique separation of agent interaction networks from information flow networks. We introduce a system comprising strategic LLM‑based system agents that engage in repeated interactions and a reinforcement learning‑based governing agent that dynamically modulates information transparency. Unlike conventional approaches that require direct structural interventions or payoff modifications, our framework preserves agent autonomy while promoting cooperation through adaptive information governance. The governing agent learns to strategically adjust information disclosure at each timestep, determining what contextual or historical information each system agent can access. Experimental results demonstrate that this RL‑based governance significantly enhances cooperation compared to static information‑sharing baselines.

PaperID: 12952, https://arxiv.org/pdf/2409.10117.pdf

Authors: Alejandro Sánchez Roncero, Rafael I. Cabral Muchacho, Petter Ögren

Title: Multi-Agent Obstacle Avoidance using Velocity Obstacles and Control Barrier Functions

Abstract:
Velocity Obstacles (VO) methods form a paradigm for collision avoidance strategies among moving obstacles and agents. While VO methods perform well in simple multi‑agent environments, they don't guarantee safety and can show overly conservative behavior in common situations. In this paper, we propose to combine a VO‑strategy for guidance with a CBF‑approach for safety, which overcomes the overly conservative behavior of VOs and formally guarantees safety. We validate our method in a baseline comparison study, using 2nd order integrator and car‑like dynamics. Results support that our method outperforms the baselines w.r.t. path smoothness, collision avoidance, and success rates.

PaperID: 12953, https://arxiv.org/pdf/2409.10048.pdf

Authors: Wessel Ledder, Yuzhen Qin, Kiki van der Heijden

Title: Audio-Driven Reinforcement Learning for Head-Orientation in Naturalistic Environments

Abstract:
Although deep reinforcement learning (DRL) approaches in audio signal processing have seen substantial progress in recent years, audio‑driven DRL for tasks such as navigation, gaze control and head‑orientation control in the context of human‑robot interaction have received little attention. Here, we propose an audio‑driven DRL framework in which we utilise deep Q‑learning to develop an autonomous agent that orients towards a talker in the acoustic environment based on stereo speech recordings. Our results show that the agent learned to perform the task at a near perfect level when trained on speech segments in anechoic environments (that is, without reverberation). The presence of reverberation in naturalistic acoustic environments affected the agent's performance, although the agent still substantially outperformed a baseline, randomly acting agent. Finally, we quantified the degree of generalization of the proposed DRL approach across naturalistic acoustic environments. Our experiments revealed that policies learned by agents trained on medium or high reverb environments generalized to low reverb environments, but policies learned by agents trained on anechoic or low reverb environments did not generalize to medium or high reverb environments. Taken together, this study demonstrates the potential of audio‑driven DRL for tasks such as head‑orientation control and highlights the need for training strategies that enable robust generalization across environments for real‑world audio‑driven DRL applications.

PaperID: 12954, https://arxiv.org/pdf/2409.10047.pdf

Authors: Hossein B. Jond

Title: Bearing-Distance Flocking with Zone-Based Interactions in Constrained Dynamic Environments

Abstract:
This paper presents a novel zone‑based flocking control approach suitable for dynamic multi‑agent systems (MAS). Inspired by Reynolds behavioral rules for boids, flocking behavioral rules with the zones of repulsion, conflict, attraction, and surveillance are introduced. For each agent, using only bearing and distance measurements, behavioral contribution vectors quantify the local separation, local and global flock velocity alignment, local cohesion, obstacle avoidance and boundary conditions, and strategic separation for avoiding alien agents. The control strategy uses the local perception‑based behavioral contribution vectors to guide each agent's motion. Additionally, the control strategy incorporates a directionally aware obstacle avoidance mechanism that prioritizes obstacles in the agent's forward path. Simulation results validate the effectiveness of the model in creating flexible, adaptable, and scalable flocking behavior. Asymptotic stability and convergence to a stable flocking configuration for any initial conditions provided the interaction graph is a spanning tree are demonstrated. The flocking model's reliance on locally sensed bearing and distance measurements ensures scalability and robustness, particularly in scenarios where communication is unreliable or resource‑intensive. This makes it well‑suited for real‑world applications demanding seamless operation in highly dynamic and distributed environments.

PaperID: 12955, https://arxiv.org/pdf/2409.10038.pdf

Authors: Yifan Zhang, Yang Yuan, Andrew Chi-Chih Yao

Title: On the Diagram of Thought

Abstract:
Large Language Models (LLMs) excel at many tasks but often falter on complex problems that require structured, multi‑step reasoning. We introduce the Diagram of Thought (DoT), a framework that enables a single LLM to build and navigate a mental map of its reasoning. Instead of thinking in a straight line, the model constructs a dynamic diagram of ideas, where it can propose different lines of thought, critique its own steps, and synthesize validated insights into a final conclusion. This process is controller‑light: it does not require an external search algorithm or planner, but it does use a deterministic online validator for grammar‑constrained typed traces, register constraints, and optional solver checks. To clarify the reliability target of this process, we ground DoT in a mathematical framework from category theory. We interpret accepted typed reasoning records as diagrams in a slice topos and model synthesis of the selected proposer subdiagram as a finite limit. In the predicate fragment, this same object is equivalently a variance‑reversed colimit in the opposite information order. The resulting formalism gives an auditable, step‑by‑step trace of the LLM's typed reasoning and separates semantic guarantees for the typed subtrace from unconstrained natural‑language text and uncertified operational edges.

PaperID: 12956, https://arxiv.org/pdf/2409.09777.pdf

Authors: Haisheng Su, Wei Wu, Zhenjie Yang, Isabel Guan

Title: EgoFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient End-to-End Self-Driving

Abstract:
Current End‑to‑End Autonomous Driving (E2E‑AD) methods resort to unifying modular designs for various tasks (e.g. perception, prediction and planning). Although optimized with a fully differentiable framework in a planning‑oriented manner, existing end‑to‑end driving systems lacking ego‑centric designs still suffer from unsatisfactory performance and inferior efficiency, due to rasterized scene representation learning and redundant information transmission. In this paper, we propose an ego‑centric fully sparse paradigm, named EgoFSD, for end‑to‑end self‑driving. Specifically, EgoFSD consists of sparse perception, hierarchical interaction and iterative motion planner. The sparse perception module performs detection and online mapping based on sparse representation of the driving scene. The hierarchical interaction module aims to select the Closest In‑Path Vehicle / Stationary (CIPV / CIPS) from coarse to fine, benefiting from an additional geometric prior. As for the iterative motion planner, both selected interactive agents and ego‑vehicle are considered for joint motion prediction, where the output multi‑modal ego‑trajectories are optimized in an iterative fashion. In addition, position‑level motion diffusion and trajectory‑level planning denoising are introduced for uncertainty modeling, thereby enhancing the training stability and convergence speed. Extensive experiments are conducted on nuScenes and Bench2Drive datasets, which significantly reduces the average L2 error by 59% and collision rate by 92% than UniAD while achieves 6.9x faster running efficiency.

PaperID: 12957, https://arxiv.org/pdf/2409.09717.pdf

Authors: Justas Andriuškevičius, Junzi Sun

Title: Automatic Control With Human-Like Reasoning: Exploring Language Model Embodied Air Traffic Agents

Abstract:
Recent developments in language models have created new opportunities in air traffic control studies. The current focus is primarily on text and language‑based use cases. However, these language models may offer a higher potential impact in the air traffic control domain, thanks to their ability to interact with air traffic environments in an embodied agent form. They also provide a language‑like reasoning capability to explain their decisions, which has been a significant roadblock for the implementation of automatic air traffic control. This paper investigates the application of a language model‑based agent with function‑calling and learning capabilities to resolve air traffic conflicts without human intervention. The main components of this research are foundational large language models, tools that allow the agent to interact with the simulator, and a new concept, the experience library. An innovative part of this research, the experience library, is a vector database that stores synthesized knowledge that agents have learned from interactions with the simulations and language models. To evaluate the performance of our language model‑based agent, both open‑source and closed‑source models were tested. The results of our study reveal significant differences in performance across various configurations of the language model‑based agents. The best‑performing configuration was able to solve almost all 120 but one imminent conflict scenarios, including up to four aircraft at the same time. Most importantly, the agents are able to provide human‑level text explanations on traffic situations and conflict resolution strategies.

PaperID: 12958, https://arxiv.org/pdf/2409.09575.pdf

Authors: Bo-Kai Ruan, Hao-Tang Tsui, Yung-Hui Li, Hong-Han Shuai

Title: Traffic Scene Generation from Natural Language Description for Autonomous Vehicles with Large Language Model

Abstract:
Generating realistic and controllable traffic scenes from natural language can greatly enhance the development and evaluation of autonomous driving systems. However, this task poses unique challenges: (1) grounding free‑form text into spatially valid and semantically coherent layouts, (2) composing scenarios without predefined locations, and (3) planning multi‑agent behaviors and selecting roads that respect agents' configurations. To address these, we propose a modular framework, TTSG, comprising prompt analysis, road retrieval, agent planning, and a novel plan‑aware road ranking algorithm to solve these challenges. While large language models (LLMs) are used as general planners, our design integrates them into a tightly controlled pipeline that enforces structure, feasibility, and scene diversity. Notably, our ranking strategy ensures consistency between agent actions and road geometry, enabling scene generation without predefined routes or spawn points. The framework supports both routine and safety‑critical scenarios, as well as multi‑stage event composition. Experiments on SafeBench demonstrate that our method achieves the lowest average collision rate (3.5%) across three critical scenarios. Moreover, driving captioning models trained on our generated scenes improve action reasoning by over 30 CIDEr points. These results underscore our proposed framework for flexible, interpretable, and safety‑oriented simulation.

PaperID: 12959, https://arxiv.org/pdf/2409.09573.pdf

Authors: Vrushabh Zinage, Abhishek Jha, Rohan Chandra, Efstathios Bakolas

Title: Decentralized Safe and Scalable Multi-Agent Control under Limited Actuation

Abstract:
To deploy safe and agile robots in cluttered environments, there is a need to develop fully decentralized controllers that guarantee safety, respect actuation limits, prevent deadlocks, and scale to thousands of agents. Current approaches fall short of meeting all these goals: optimization‑based methods ensure safety but lack scalability, while learning‑based methods scale but do not guarantee safety. We propose a novel algorithm to achieve safe and scalable control for multiple agents under limited actuation. Specifically, our approach includes: (i) learning a decentralized neural Integral Control Barrier function (neural ICBF) for scalable, input‑constrained control, (ii) embedding a lightweight decentralized Model Predictive Control‑based Integral Control Barrier Function (MPC‑ICBF) into the neural network policy to ensure safety while maintaining scalability, and (iii) introducing a novel method to minimize deadlocks based on gradient‑based optimization techniques from machine learning to address local minima in deadlocks. Our numerical simulations show that this approach outperforms state‑of‑the‑art multi‑agent control algorithms in terms of safety, input constraint satisfaction, and minimizing deadlocks. Additionally, we demonstrate strong generalization across scenarios with varying agent counts, scaling up to 1000 agents.

PaperID: 12960, https://arxiv.org/pdf/2409.09509.pdf

Authors: Shatayu Kulkarni, Sabine Brunswicker

Title: Learning Nudges for Conditional Cooperation: A Multi-Agent Reinforcement Learning Model

Abstract:
The public goods game describes a social dilemma in which a large proportion of agents act as conditional cooperators (CC): they only act cooperatively if they see others acting cooperatively because they satisfice with the social norm to be in line with what others are doing instead of optimizing cooperation. CCs are guided by aspiration‑based reinforcement learning guided by past experiences of interactions with others and satisficing aspirations. In many real‑world settings, reinforcing social norms do not emerge. In this paper, we propose that an optimizing reinforcement agent can facilitate cooperation through nudges, i.e. indirect mechanisms for cooperation to happen. The agent's goal is to motivate CCs into cooperation through its own actions to create social norms that signal that others are cooperating. We introduce a multi‑agent reinforcement learning model for public goods games, with 3 CC learning agents using aspirational reinforcement learning and 1 nudging agent using deep reinforcement learning to learn nudges that optimize cooperation. For our nudging agent, we model two distinct reward functions, one maximizing the total game return (sum DRL) and one maximizing the number of cooperative contributions contributions higher than a proportional threshold (prop DRL). Our results show that our aspiration‑based RL model for CC agents is consistent with empirically observed CC behavior. Games combining 3 CC RL agents and one nudging RL agent outperform the baseline consisting of 4 CC RL agents only. The sum DRL nudging agent increases the total sum of contributions by 8.22% and the total proportion of cooperative contributions by 12.42%, while the prop nudging DRL increases the total sum of contributions by 8.85% and the total proportion of cooperative contributions by 14.87%. Our findings advance the literature on public goods games and reinforcement learning.

PaperID: 12961, https://arxiv.org/pdf/2409.09345.pdf

Authors: Yuanzhao Zhai, Tingkai Yang, Kele Xu, Feng Dawei, Cheng Yang, Bo Ding, Huaimin Wang

Title: Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models

Abstract:
Agents significantly enhance the capabilities of standalone Large Language Models (LLMs) by perceiving environments, making decisions, and executing actions. However, LLM agents still face challenges in tasks that require multiple decision‑making steps. Estimating the value of actions in specific tasks is difficult when intermediate actions are neither appropriately rewarded nor penalized. In this paper, we propose leveraging a task‑relevant Q‑value model to guide action selection. Specifically, we first collect decision‑making trajectories annotated with step‑level Q values via Monte Carlo Tree Search (MCTS) and construct preference data. We then use another LLM to fit these preferences through step‑level Direct Policy Optimization (DPO), which serves as the Q‑value model. During inference, at each decision‑making step, LLM agents select the action with the highest Q value before interacting with the environment. We apply our method to various open‑source and API‑based LLM agents, demonstrating that Q‑value models significantly improve their performance. Notably, the performance of the agent built with Phi‑3‑mini‑4k‑instruct improved by 103% on WebShop and 75% on HotPotQA when enhanced with Q‑value models, even surpassing GPT‑4o‑mini. Additionally, Q‑value models offer several advantages, such as generalization to different LLM agents and seamless integration with existing prompting strategies.

PaperID: 12962, https://arxiv.org/pdf/2409.09285.pdf

Authors: Carter Berlind, Wenliang Liu, Alyssa Pierson, Calin Belta

Title: Capability Augmentation for Heterogeneous Dynamic Teaming with Temporal Logic Tasks

Abstract:
This paper considers how heterogeneous multi‑agent teams can leverage their different capabilities to mutually improve individual agent performance. We present Capability‑Augmenting Tasks (CATs), which encode how agents can augment their capabilities based on interactions with other teammates. Our framework integrates CAT into the semantics of Metric Temporal Logic (MTL), which defines individual spatio‑temporal tasks for all agents. A centralized Mixed‑Integer Program (MIP) is used to synthesize trajectories for all agents. We compare the expressivity of our approach to a baseline of Capability Temporal Logic Plus (CaTL+). Case studies demonstrate that our approach allows for simpler specifications and improves individual performance when agents leverage the capabilities of their teammates.

PaperID: 12963, https://arxiv.org/pdf/2409.09271.pdf

Authors: Wenhan Wang, Kaibo Liu, An Ran Chen, Ge Li, Zhi Jin, Gang Huang, Lei Ma

Title: Python Symbolic Execution with LLM-powered Code Generation

Abstract:
Symbolic execution is a key technology in software testing, which generates test cases by collecting symbolic path constraints and then solving constraints with SMT solvers. Symbolic execution has been proven helpful in generating high‑coverage test cases, but its limitations, e.g., the difficulties in solving path constraints, prevent it from broader usage in software testing. Moreover, symbolic execution has encountered many difficulties when applied to dynamically typed languages like Python, because it is extremely challenging to translate the flexible Python grammar into rigid solvers. To overcome the main challenges of applying symbolic execution in Python, we proposed an LLM‑empowered agent, LLM‑Sym, that automatically calls an SMT solver, Z3, to solve execution path constraints. Based on an introductory‑level symbolic execution engine, our LLM agent can extend it to supporting programs with complex data type `list'. The core contribution of LLM‑Sym is translating complex Python path constraints into Z3 code. To enable accurate path‑to‑Z3 translation, we design a multiple‑step code generation pipeline including type inference, retrieval and self‑refine. Our experiments demonstrate that LLM‑Sym is capable of solving path constraints on Leetcode problems with complicated control flows and list data structures, which is impossible for the backbone symbolic execution engine. Our approach paves the way for the combination of the generation ability of LLMs with the reasoning ability of symbolic solvers, and opens up new opportunities in LLM‑augmented test case generation.

PaperID: 12964, https://arxiv.org/pdf/2409.09164.pdf

Authors: Albert Xu, Bhaskar Vundurthy, Geordan Gutow, Ian Abraham, Jeff Schneider, Howie Choset

Title: Measure Preserving Flows for Ergodic Search in Convoluted Environments

Abstract:
Autonomous robotic search has important applications in robotics, such as the search for signs of life after a disaster. When \empha priori information is available, for example in the form of a distribution, a planner can use that distribution to guide the search. Ergodic search is one method that uses the information distribution to generate a trajectory that minimizes the ergodic metric, in that it encourages the robot to spend more time in regions with high information and proportionally less time in the remaining regions. Unfortunately, prior works in ergodic search do not perform well in complex environments with obstacles such as a building's interior or a maze. To address this, our work presents a modified ergodic metric using the Laplace‑Beltrami eigenfunctions to capture map geometry and obstacle locations within the ergodic metric. Further, we introduce an approach to generate trajectories that minimize the ergodic metric while guaranteeing obstacle avoidance using measure‑preserving vector fields. Finally, we leverage the divergence‑free nature of these vector fields to generate collision‑free trajectories for multiple agents. We demonstrate our approach via simulations with single and multi‑agent systems on maps representing interior hallways and long corridors with non‑uniform information distribution. In particular, we illustrate the generation of feasible trajectories in complex environments where prior methods fail.

PaperID: 12965, https://arxiv.org/pdf/2409.09013.pdf

Authors: Zhe Su, Xuhui Zhou, Sanketh Rangreji, Anubha Kabra, Julia Mendelsohn, Faeze Brahman, Maarten Sap

Title: AI-LieDar: Examine the Trade-off Between Utility and Truthfulness in LLM Agents

Abstract:
Truthfulness (adherence to factual accuracy) and utility (satisfying human needs and instructions) are both fundamental aspects of Large Language Models, yet these goals often conflict (e.g., sell a car with known flaws), which makes it challenging to achieve both in real‑world deployments. We propose AI‑LieDar, a framework to study how LLM‑based agents navigate these scenarios in an multi‑turn interactive setting. We design a set of real‑world scenarios where language agents are instructed to achieve goals that are in conflict with being truthful during a multi‑turn conversation with simulated human agents. To evaluate the truthfulness at large scale, we develop a truthfulness detector inspired by psychological literature to assess the agents' responses. Our experiment demonstrates that all models are truthful less than 50% of the time, though truthfulness and goal achievement (utility) rates vary across models. We further test the steerability of LLMs towards truthfulness, finding that models can be directed to be truthful or deceptive, and even truth‑steered models still lie. These findings reveal the complex nature of truthfulness in LLMs and underscore the importance of further research to ensure the safe and reliable deployment of LLMs and LLM‑based agents.

PaperID: 12966, https://arxiv.org/pdf/2409.08963.pdf

Authors: Lucio La Cava, Andrea Tagarelli

Title: Safeguarding Decentralized Social Media: LLM Agents for Automating Community Rule Compliance

Abstract:
Ensuring content compliance with community guidelines is crucial for maintaining healthy online social environments. However, traditional human‑based compliance checking struggles with scaling due to the increasing volume of user‑generated content and a limited number of moderators. Recent advancements in Natural Language Understanding demonstrated by Large Language Models unlock new opportunities for automated content compliance verification. This work evaluates six AI‑agents built on Open‑LLMs for automated rule compliance checking in Decentralized Social Networks, a challenging environment due to heterogeneous community scopes and rules. Analyzing over 50,000 posts from hundreds of Mastodon servers, we find that AI‑agents effectively detect non‑compliant content, grasp linguistic subtleties, and adapt to diverse community contexts. Most agents also show high inter‑rater reliability and consistency in score justification and suggestions for compliance. Human‑based evaluation with domain experts confirmed the agents' reliability and usefulness, rendering them promising tools for semi‑automated or human‑in‑the‑loop content moderation systems.

PaperID: 12967, https://arxiv.org/pdf/2409.08811.pdf

Authors: Shao Zhang, Xihuai Wang, Wenhao Zhang, Yongshan Chen, Landi Gao, Dakuo Wang, Weinan Zhang, Xinbing Wang, Ying Wen

Title: Mutual Theory of Mind in Human-AI Collaboration: An Empirical Study with LLM-driven AI Agents in a Real-time Shared Workspace Task

Abstract:
Theory of Mind (ToM) significantly impacts human collaboration and communication as a crucial capability to understand others. When AI agents with ToM capability collaborate with humans, Mutual Theory of Mind (MToM) arises in such human‑AI teams (HATs). The MToM process, which involves interactive communication and ToM‑based strategy adjustment, affects the team's performance and collaboration process. To explore the MToM process, we conducted a mixed‑design experiment using a large language model‑driven AI agent with ToM and communication modules in a real‑time shared‑workspace task. We find that the agent's ToM capability does not significantly impact team performance but enhances human understanding of the agent and the feeling of being understood. Most participants in our study believe verbal communication increases human burden, and the results show that bidirectional communication leads to lower HAT performance. We discuss the results' implications for designing AI agents that collaborate with humans in real‑time shared workspace tasks.

PaperID: 12968, https://arxiv.org/pdf/2409.08767.pdf

Authors: Yang Li, Dengyu Zhang, Junfan Chen, Ying Wen, Qingrui Zhang, Shaoshuai Mou, Wei Pan

Title: HOLA-Drone: Hypergraphic Open-ended Learning for Zero-Shot Multi-Drone Cooperative Pursuit

Abstract:
Zero‑shot coordination (ZSC) is a significant challenge in multi‑agent collaboration, aiming to develop agents that can coordinate with unseen partners they have not encountered before. Recent cutting‑edge ZSC methods have primarily focused on two‑player video games such as OverCooked!2 and Hanabi. In this paper, we extend the scope of ZSC research to the multi‑drone cooperative pursuit scenario, exploring how to construct a drone agent capable of coordinating with multiple unseen partners to capture multiple evaders. We propose a novel Hypergraphic Open‑ended Learning Algorithm (HOLA‑Drone) that continuously adapts the learning objective based on our hypergraphic‑form game modeling, aiming to improve cooperative abilities with multiple unknown drone teammates. To empirically verify the effectiveness of HOLA‑Drone, we build two different unseen drone teammate pools to evaluate their performance in coordination with various unseen partners. The experimental results demonstrate that HOLA‑Drone outperforms the baseline methods in coordination with unseen drone teammates. Furthermore, real‑world experiments validate the feasibility of HOLA‑Drone in physical systems. Videos can be found on the project homepage~\urlhttps://sites.google.com/view/hola‑drone.

PaperID: 12969, https://arxiv.org/pdf/2409.08642.pdf

Authors: Tianlong Wang, Junzhe Chen, Xueting Han, Jing Bai

Title: CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks

Abstract:
Post‑training, particularly reinforcement learning (RL) using self‑play‑generated data, has become a new learning paradigm for large language models (LLMs). However, scaling RL to develop a general reasoner remains a research challenge, as existing methods focus on task‑specific reasoning without adequately addressing generalization across a broader range of tasks. Moreover, unlike traditional RL with limited action space, LLMs operate in an infinite space, making it crucial to search for valuable and diverse strategies to solve problems effectively. To address this, we propose searching within the action space on high‑level abstract plans to enhance model generalization and introduce Critical Plan Step Learning (CPL), comprising: 1) searching on plan, using Monte Carlo Tree Search (MCTS) to explore diverse plan steps in multi‑step reasoning tasks, and 2) learning critical plan steps through Step‑level Advantage Preference Optimization (Step‑APO), which integrates advantage estimates for step preference obtained via MCTS into Direct Preference Optimization (DPO). This combination helps the model effectively learn critical plan steps, enhancing both reasoning capabilities and generalization. Experimental results demonstrate that our method, trained exclusively on GSM8K and MATH, not only significantly improves performance on GSM8K (+10.5%) and MATH (+6.5%), but also enhances out‑of‑domain reasoning benchmarks, such as HumanEval (+12.2%), GPQA (+8.6%), ARC‑C (+4.0%), MMLU‑STEM (+2.2%), and BBH (+1.8%).

PaperID: 12970, https://arxiv.org/pdf/2409.08634.pdf

Authors: Evagoras Makridis, Andreas Grammenos, Gabriele Oliva, Evangelia Kalyvianaki, Christoforos N. Hadjicostis, Themistoklis Charalambous

Title: Average Consensus over Directed Networks in Open Multi-Agent Systems with Acknowledgement Feedback

Abstract:
In this paper, we address the distributed average consensus problem over directed networks in open multi‑agent systems (OMAS), where the stability of the network is disrupted by frequent agent arrivals and departures, leading to a time‑varying average consensus target. To tackle this challenge, we introduce a novel ratio consensus algorithm (OPENRC) based on acknowledgement feedback, designed to be robust to agent arrivals and departures, as well as to unbalanced directed network topologies. We demonstrate that when all active agents execute the OPENRC algorithm, the sum of their state variables remains constant during quiescent epochs when the network remains unchanged. By assuming eventual convergence during such quiescent periods following persistent variations in system composition and size, we prove the convergence of the OPENRC algorithm using column‑stochasticity and mass‑preservation properties. Finally, we apply and evaluate our proposed algorithm in a simulated environment, where agents are departing from and arriving in the network to highlight its resilience against changes in the network size and topology.

PaperID: 12971, https://arxiv.org/pdf/2409.08561.pdf

Authors: Tianqiao Liu, Zui Chen, Zitao Liu, Mi Tian, Weiqi Luo

Title: Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding

Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in tasks requiring reasoning and multi‑step problem‑solving through the use of chain‑of‑thought (CoT) prompting. However, generating the full CoT process results in significantly longer output sequences, leading to increased computational costs and latency during inference. To address this challenge, we propose a novel approach to compress the CoT process through semantic alignment, enabling more efficient decoding while preserving the benefits of CoT reasoning. Our method introduces an auxiliary CoT model that learns to generate and compress the full thought process into a compact special token representation semantically aligned with the original CoT output. This compressed representation is then integrated into the input of the Hidden Chain‑of‑Thought (HCoT) model. The training process follows a two‑stage procedure: First, the CoT model is optimized to generate the compressed token representations aligned with the ground‑truth CoT outputs using a contrastive loss. Subsequently, with the CoT model parameters frozen, the HCoT model is fine‑tuned to generate accurate subsequent predictions conditioned on the prefix instruction and the compressed CoT representations from the CoT model. Extensive experiments across three challenging domains ‑ mathematical reasoning, agent invocation, and question answering ‑ demonstrate that our semantic compression approach achieves competitive or improved performance compared to the full CoT baseline, while providing significant speedups of at least 1.5x in decoding time. Moreover, incorporating contrastive learning objectives further enhances the quality of the compressed representations, leading to better CoT prompting and improved task accuracy. Our work paves the way for more efficient exploitation of multi‑step reasoning capabilities in LLMs across a wide range of applications.

PaperID: 12972, https://arxiv.org/pdf/2409.08406.pdf

Authors: Hang Li, Tianlong Xu, Ethan Chang, Qingsong Wen

Title: Knowledge Tagging with Large Language Model based Multi-Agent System

Abstract:
Knowledge tagging for questions is vital in modern intelligent educational applications, including learning progress diagnosis, practice question recommendations, and course content organization. Traditionally, these annotations have been performed by pedagogical experts, as the task demands not only a deep semantic understanding of question stems and knowledge definitions but also a strong ability to link problem‑solving logic with relevant knowledge concepts. With the advent of advanced natural language processing (NLP) algorithms, such as pre‑trained language models and large language models (LLMs), pioneering studies have explored automating the knowledge tagging process using various machine learning models. In this paper, we investigate the use of a multi‑agent system to address the limitations of previous algorithms, particularly in handling complex cases involving intricate knowledge definitions and strict numerical constraints. By demonstrating its superior performance on the publicly available math question knowledge tagging dataset, MathKnowCT, we highlight the significant potential of an LLM‑based multi‑agent system in overcoming the challenges that previous methods have encountered. Finally, through an in‑depth discussion of the implications of automating knowledge tagging, we underscore the promising results of deploying LLM‑based algorithms in educational contexts.

PaperID: 12973, https://arxiv.org/pdf/2409.08166.pdf

Authors: Sunny Katyara, Suchita Sharma, Praveen Damacharla, Carlos Garcia Santiago, Francis O'Farrell, Philip Long

Title: Collaborating for Success: Optimizing System Efficiency and Resilience Under Agile Industrial Settings

Abstract:
Designing an efficient and resilient human‑robot collaboration strategy that not only upholds the safety and ergonomics of shared workspace but also enhances the performance and agility of collaborative setup presents significant challenges concerning environment perception and robot control. In this research, we introduce a novel approach for collaborative environment monitoring and robot motion regulation to address this multifaceted problem. Our study proposes novel computation and division of safety monitoring zones, adhering to ISO 13855 and TS 15066 standards, utilizing 2D lasers information. These zones are not only configured in the standard three‑layer arrangement but are also expanded into two adjacent quadrants, thereby enhancing system uptime and preventing unnecessary deadlocks. Moreover, we also leverage 3D visual information to track dynamic human articulations and extended intrusions. Drawing upon the fused sensory data from 2D and 3D perceptual spaces, our proposed hierarchical controller stably regulates robot velocity, validated using Lasalle in‑variance principle. Empirical evaluations demonstrate that our approach significantly reduces task execution time and system response delay, resulting in improved efficiency and resilience within collaborative settings.

PaperID: 12974, https://arxiv.org/pdf/2409.07933.pdf

Authors: Haoying Li, Xinghan Li, Shuaiting Huang, Chao yang, Junfeng Wu

Title: Covariance Intersection-based Invariant Kalman Filtering(DInCIKF) for Distributed Pose Estimation

Abstract:
This paper presents a novel approach to distributed pose estimation in the multi‑agent system based on an invariant Kalman filter with covariance intersection. Our method models uncertainties using Lie algebra and applies object‑level observations within Lie groups, which have practical application value. We integrate covariance intersection to handle estimates that are correlated and use the invariant Kalman filter for merging independent data sources. This strategy allows us to effectively tackle the complex correlations of cooperative localization among agents, ensuring our estimates are neither too conservative nor overly confident. Additionally, we examine the consistency and stability of our algorithm, providing evidence of its reliability and effectiveness in managing multi‑agent systems.

PaperID: 12975, https://arxiv.org/pdf/2409.07932.pdf

Authors: Alexei Pisacane, Victor-Alexandru Darvariu, Mirco Musolesi

Title: Reinforcement Learning Discovers Efficient Decentralized Graph Path Search Strategies

Abstract:
Graph path search is a classic computer science problem that has been recently approached with Reinforcement Learning (RL) due to its potential to outperform prior methods. Existing RL techniques typically assume a global view of the network, which is not suitable for large‑scale, dynamic, and privacy‑sensitive settings. An area of particular interest is search in social networks due to its numerous applications. Inspired by seminal work in experimental sociology, which showed that decentralized yet efficient search is possible in social networks, we frame the problem as a collaborative task between multiple agents equipped with a limited local view of the network. We propose a multi‑agent approach for graph path search that successfully leverages both homophily and structural heterogeneity. Our experiments, carried out over synthetic and real‑world social networks, demonstrate that our model significantly outperforms learned and heuristic baselines. Furthermore, our results show that meaningful embeddings for graph navigation can be constructed using reward‑driven learning.

PaperID: 12976, https://arxiv.org/pdf/2409.07918.pdf

Authors: Elizabeth Wilson, György Fazekas, Geraint Wiggins

Title: Tidal MerzA: Combining affective modelling and autonomous code generation through Reinforcement Learning

Abstract:
This paper presents Tidal‑MerzA, a novel system designed for collaborative performances between humans and a machine agent in the context of live coding, specifically focusing on the generation of musical patterns. Tidal‑MerzA fuses two foundational models: ALCAA (Affective Live Coding Autonomous Agent) and Tidal Fuzz, a computational framework. By integrating affective modelling with computational generation, this system leverages reinforcement learning techniques to dynamically adapt music composition parameters within the TidalCycles framework, ensuring both affective qualities to the patterns and syntactical correctness. The development of Tidal‑MerzA introduces two distinct agents: one focusing on the generation of mini‑notation strings for musical expression, and another on the alignment of music with targeted affective states through reinforcement learning. This approach enhances the adaptability and creative potential of live coding practices and allows exploration of human‑machine creative interactions. Tidal‑MerzA advances the field of computational music generation, presenting a novel methodology for incorporating artificial intelligence into artistic practices.

PaperID: 12977, https://arxiv.org/pdf/2409.07911.pdf

Authors: Zhifeng Hu, Chong Han, Wolfgang Gerstacker, Ian F. Akyildiz

Title: Tera-SpaceCom: GNN-based Deep Reinforcement Learning for Joint Resource Allocation and Task Offloading in TeraHertz Band Space Networks

Abstract:
Terahertz (THz) space communications (Tera‑SpaceCom) is envisioned as a promising technology to enable various space science and communication applications. Mainly, the realm of Tera‑SpaceCom consists of THz sensing for space exploration, data centers in space providing cloud services for space exploration tasks, and a low earth orbit (LEO) mega‑constellation relaying these tasks to ground stations (GSs) or data centers via THz links. Moreover, to reduce the computational burden on data centers as well as resource consumption and latency in the relaying process, the LEO mega‑constellation provides satellite edge computing (SEC) services to directly compute space exploration tasks without relaying these tasks to data centers. The LEO satellites that receive space exploration tasks offload (i.e., distribute) partial tasks to their neighboring LEO satellites, to further reduce their computational burden. However, efficient joint communication resource allocation and computing task offloading for the Tera‑SpaceCom SEC network is an NP‑hard mixed‑integer nonlinear programming problem (MINLP), due to the discrete nature of space exploration tasks and sub‑arrays as well as the continuous nature of transmit power. To tackle this challenge, a graph neural network (GNN)‑deep reinforcement learning (DRL)‑based joint resource allocation and task offloading (GRANT) algorithm is proposed with the target of long‑term resource efficiency (RE). Particularly, GNNs learn relationships among different satellites from their connectivity information. Furthermore, multi‑agent and multi‑task mechanisms cooperatively train task offloading and resource allocation. Compared with benchmark solutions, GRANT not only achieves the highest RE with relatively low latency, but realizes the fewest trainable parameters and the shortest running time.

PaperID: 12978, https://arxiv.org/pdf/2409.07878.pdf

Authors: Oscar Delaney, Oliver Guest, Zoe Williams

Title: Mapping Technical Safety Research at AI Companies: A literature review and incentives analysis

Abstract:
As AI systems become more advanced, concerns about large‑scale risks from misuse or accidents have grown. This report analyzes the technical research into safe AI development being conducted by three leading AI companies: Anthropic, Google DeepMind, and OpenAI. We define safe AI development as developing AI systems that are unlikely to pose large‑scale misuse or accident risks. This encompasses a range of technical approaches aimed at ensuring AI systems behave as intended and do not cause unintended harm, even as they are made more capable and autonomous. We analyzed all papers published by the three companies from January 2022 to July 2024 that were relevant to safe AI development, and categorized the 80 included papers into nine safety approaches. Additionally, we noted two categories representing nascent approaches explored by academia and civil society, but not currently represented in any research papers by these leading AI companies. Our analysis reveals where corporate attention is concentrated and where potential gaps lie. Some AI research may stay unpublished for good reasons, such as to not inform adversaries about the details of security techniques they would need to overcome to misuse AI systems. Therefore, we also considered the incentives that AI companies have to research each approach, regardless of how much work they have published on the topic. We identified three categories where there are currently no or few papers and where we do not expect AI companies to become much more incentivized to pursue this research in the future. These are model organisms of misalignment, multi‑agent safety, and safety by design. Our findings provide an indication that these approaches may be slow to progress without funding or efforts from government, civil society, philanthropists, or academia.

PaperID: 12979, https://arxiv.org/pdf/2409.07775.pdf

Authors: Yinbo Yu, Saihao Yan, Jiajia Liu

Title: A Spatiotemporal Stealthy Backdoor Attack against Cooperative Multi-Agent Deep Reinforcement Learning

Abstract:
Recent studies have shown that cooperative multi‑agent deep reinforcement learning (c‑MADRL) is under the threat of backdoor attacks. Once a backdoor trigger is observed, it will perform abnormal actions leading to failures or malicious goals. However, existing proposed backdoors suffer from several issues, e.g., fixed visual trigger patterns lack stealthiness, the backdoor is trained or activated by an additional network, or all agents are backdoored. To this end, in this paper, we propose a novel backdoor attack against c‑MADRL, which attacks the entire multi‑agent team by embedding the backdoor only in a single agent. Firstly, we introduce adversary spatiotemporal behavior patterns as the backdoor trigger rather than manual‑injected fixed visual patterns or instant status and control the attack duration. This method can guarantee the stealthiness and practicality of injected backdoors. Secondly, we hack the original reward function of the backdoored agent via reward reverse and unilateral guidance during training to ensure its adverse influence on the entire team. We evaluate our backdoor attacks on two classic c‑MADRL algorithms VDN and QMIX, in a popular c‑MADRL environment SMAC. The experimental results demonstrate that our backdoor attacks are able to reach a high attack success rate (91.6%) while maintaining a low clean performance variance rate (3.7%).

PaperID: 12980, https://arxiv.org/pdf/2409.07714.pdf

Authors: Yang Li, Quan Yuan, Guiyang Luo, Xiaoyuan Fu, Xuanhan Zhu, Yujia Yang, Rui Pan, Jinglin Li

Title: CollaMamba: Efficient Collaborative Perception with Cross-Agent Spatial-Temporal State Space Model

Abstract:
By sharing complementary perceptual information, multi‑agent collaborative perception fosters a deeper understanding of the environment. Recent studies on collaborative perception mostly utilize CNNs or Transformers to learn feature representation and fusion in the spatial dimension, which struggle to handle long‑range spatial‑temporal features under limited computing and communication resources. Holistically modeling the dependencies over extensive spatial areas and extended temporal frames is crucial to enhancing feature quality. To this end, we propose a resource efficient cross‑agent spatial‑temporal collaborative state space model (SSM), named CollaMamba. Initially, we construct a foundational backbone network based on spatial SSM. This backbone adeptly captures positional causal dependencies from both single‑agent and cross‑agent views, yielding compact and comprehensive intermediate features while maintaining linear complexity. Furthermore, we devise a history‑aware feature boosting module based on temporal SSM, extracting contextual cues from extended historical frames to refine vague features while preserving low overhead. Extensive experiments across several datasets demonstrate that CollaMamba outperforms state‑of‑the‑art methods, achieving higher model accuracy while reducing computational and communication overhead by up to 71.9% and 1/64, respectively. This work pioneers the exploration of the Mamba's potential in collaborative perception. The source code will be made available.

PaperID: 12981, https://arxiv.org/pdf/2409.07653.pdf

Authors: Daniel Weitekamp, Glen Smith, Kenneth Koedinger, Christopher MacLellan

Title: STAND: Self-Aware Precondition Induction for Interactive Task Learning

Abstract:
In interactive task learning (ITL), AI agents learn new capabilities from limited human instruction provided during task execution. STAND is a new method of data‑efficient rule precondition induction specifically designed for these human‑in‑the‑loop training scenarios. A key feature of STAND is its self‑awareness of its own learning ‑‑ it can provide accurate metrics of training progress back to users. STAND beats popular methods like XGBoost, decision trees, random forests, and version spaces at small‑data precondition induction tasks, and is highly accurate at estimating when its performance improves on holdout examples. In our evaluations, we find that STAND shows more monotonic improvement than other models with low rates of error recurrence. These features of STAND support a more consistent training experience, enabling human instructors to estimate when they are finished training and providing active‑learning support by identifying trouble spots where more training is required.

PaperID: 12982, https://arxiv.org/pdf/2409.07453.pdf

Authors: Shengxin Hong, Chang Cai, Sixuan Du, Haiyue Feng, Siyuan Liu, Xiuyi Fan

Title: "My Grade is Wrong!": A Contestable AI Framework for Interactive Feedback in Evaluating Student Essays

Abstract:
Interactive feedback, where feedback flows in both directions between teacher and student, is more effective than traditional one‑way feedback. However, it is often too time‑consuming for widespread use in educational practice. While Large Language Models (LLMs) have potential for automating feedback, they struggle with reasoning and interaction in an interactive setting. This paper introduces CAELF, a Contestable AI Empowered LLM Framework for automating interactive feedback. CAELF allows students to query, challenge, and clarify their feedback by integrating a multi‑agent system with computational argumentation. Essays are first assessed by multiple Teaching‑Assistant Agents (TA Agents), and then a Teacher Agent aggregates the evaluations through formal reasoning to generate feedback and grades. Students can further engage with the feedback to refine their understanding. A case study on 500 critical thinking essays with user studies demonstrates that CAELF significantly improves interactive feedback, enhancing the reasoning and interaction capabilities of LLMs. This approach offers a promising solution to overcoming the time and resource barriers that have limited the adoption of interactive feedback in educational settings.

PaperID: 12983, https://arxiv.org/pdf/2409.07335.pdf

Authors: Mehrdad Zakershahrak, Samira Ghodratnama

Title: Explanation, Debate, Align: A Weak-to-Strong Framework for Language Model Generalization

Abstract:
The rapid advancement of artificial intelligence systems has brought the challenge of AI alignment to the forefront of research, particularly in complex decision‑making and task execution. As these systems surpass human‑level performance in sophisticated problems, ensuring their alignment with human values, intentions, and ethical guidelines becomes crucial. Building on previous work in explanation generation for human‑agent alignment, we address the more complex dynamics of multi‑agent systems and human‑AI teams. This paper introduces a novel approach to model alignment through weak‑to‑strong generalization in the context of language models. We present a framework where a strong model facilitates the improvement of a weaker model, bridging the gap between explanation generation and model alignment. Our method, formalized as a facilitation function, allows for the transfer of capabilities from advanced models to less capable ones without direct access to extensive training data. Our results suggest that this facilitation‑based approach not only enhances model performance but also provides insights into the nature of model alignment and the potential for scalable oversight of AI systems.

PaperID: 12984, https://arxiv.org/pdf/2409.07189.pdf

Authors: Mohamed Dhouioui, Jonathan Barnoud, Rhoslyn Roebuck Williams, Harry J. Stroud, Phil Bates, David R. Glowacki

Title: AI-Guided Molecular Simulations in VR: Exploring Strategies for Imitation Learning in Hyperdimensional Molecular Systems

Abstract:
Molecular dynamics (MD) simulations are a crucial computational tool for researchers to understand and engineer molecular structure and function in areas such as drug discovery, protein engineering, and material design. Despite their utility, MD simulations are expensive, owing to the high dimensionality of molecular systems. Interactive molecular dynamics in virtual reality (iMD‑VR) has recently emerged as a "human‑in‑the‑loop" strategy for efficiently navigating hyper‑dimensional molecular systems. By providing an immersive 3D environment that enables visualization and manipulation of real‑time molecular simulations running on high‑performance computing architectures, iMD‑VR enables researchers to reach out and guide molecular conformational dynamics, in order to efficiently explore complex, high‑dimensional molecular systems. Moreover, iMD‑VR simulations generate rich datasets that capture human experts' spatial insight regarding molecular structure and function. This paper explores the use of researcher‑generated iMD‑VR datasets to train AI agents via imitation learning (IL). IL enables agents to mimic complex behaviours from expert demonstrations, circumventing the need for explicit programming or intricate reward design. In this article, we review IL across robotics and Multi‑agents systems domains which are comparable to iMD‑VR, and discuss how iMD‑VR recordings could be used to train IL models to interact with MD simulations. We then illustrate the applications of these ideas through a proof‑of‑principle study where iMD‑VR data was used to train a CNN network on a simple molecular manipulation task; namely, threading a small molecule through a nanotube pore. Finally, we outline future research directions and potential challenges of using AI agents to augment human expertise in navigating vast molecular conformational spaces.

PaperID: 12985, https://arxiv.org/pdf/2409.07145.pdf

Authors: Davide Ferrari, Cristian Secchi

Title: The Critical Role of Effective Communication in Human-Robot Collaborative Assembly

Abstract:
In the rapidly evolving landscape of Human‑Robot Collaboration (HRC), effective communication between humans and robots is crucial for complex task execution. Traditional request‑response systems often lack naturalness and may hinder efficiency. This study emphasizes the importance of adopting human‑like communication interactions to enable fluent vocal communication between human operators and robots simulating a collaborative human‑robot industrial assembly. We propose a novel approach that employs human‑like interactions through natural dialogue, enabling human operators to engage in vocal conversations with robots. Through a comparative experiment, we demonstrate the efficacy of our approach in enhancing task performance and collaboration efficiency. The robot's ability to engage in meaningful vocal conversations enables it to seek clarification, provide status updates, and ask for assistance when required, leading to improved coordination and a smoother workflow. The results indicate that the adoption of human‑like conversational interactions positively influences the human‑robot collaborative dynamic. Human operators find it easier to convey complex instructions and preferences, resulting in a more productive and satisfying collaboration experience.

PaperID: 12986, https://arxiv.org/pdf/2409.07127.pdf

Authors: Dongkun Huo, Huateng Zhang, Yixue Hao, Yuanlin Ye, Long Hu, Rui Wang, Min Chen

Title: DCMAC: Demand-aware Customized Multi-Agent Communication via Upper Bound Training

Abstract:
Efficient communication can enhance the overall performance of collaborative multi‑agent reinforcement learning. A common approach is to share observations through full communication, leading to significant communication overhead. Existing work attempts to perceive the global state by conducting teammate model based on local information. However, they ignore that the uncertainty generated by prediction may lead to difficult training. To address this problem, we propose a Demand‑aware Customized Multi‑Agent Communication (DCMAC) protocol, which use an upper bound training to obtain the ideal policy. By utilizing the demand parsing module, agent can interpret the gain of sending local message on teammate, and generate customized messages via compute the correlation between demands and local observation using cross‑attention mechanism. Moreover, our method can adapt to the communication resources of agents and accelerate the training progress by appropriating the ideal policy which is trained with joint observation. Experimental results reveal that DCMAC significantly outperforms the baseline algorithms in both unconstrained and communication constrained scenarios.

PaperID: 12987, https://arxiv.org/pdf/2409.06888.pdf

Authors: Cheng Qian, Yulun Zhang, Varun Bhatt, Matthew Christopher Fontaine, Stefanos Nikolaidis, Jiaoyang Li

Title: QD-MAPPER: A Quality Diversity Framework to Automatically Evaluate Multi-Agent Path Finding Algorithms in Diverse Maps

Abstract:
We use the Quality Diversity (QD) algorithm with Neural Cellular Automata (NCA) to automatically evaluate Multi‑Agent Path Finding (MAPF) algorithms by generating diverse maps. Previously, researchers typically evaluate MAPF algorithms on a set of specific, human‑designed maps at their initial stage of algorithm design. However, such fixed maps may not cover all scenarios, and algorithms may overfit to the small set of maps. To seek further improvements, systematic evaluations on a diverse suite of maps are needed. In this work, we propose Quality‑Diversity Multi‑Agent Path Finding Performance EvaluatoR (QD‑MAPPER), a general framework that takes advantage of the QD algorithm to comprehensively understand the performance of MAPF algorithms by generating maps with patterns, be able to make fair comparisons between two MAPF algorithms, providing further information on the selection between two algorithms and on the design of the algorithms. Empirically, we employ this technique to evaluate and compare the behavior of different types of MAPF algorithms, including search‑based, priority‑based, rule‑based, and learning‑based algorithms. Through both single‑algorithm experiments and comparisons between algorithms, researchers can identify patterns that each MAPF algorithm excels and detect disparities in runtime or success rates between different algorithms.

PaperID: 12988, https://arxiv.org/pdf/2409.06864.pdf

Authors: Marta Lagomarsino, Marta Lorenzini, Elena De Momi, Arash Ajoudani

Title: PRO-MIND: Proximity and Reactivity Optimisation of robot Motion to tune safety limits, human stress, and productivity in INDustrial settings

Abstract:
Despite impressive advancements of industrial collaborative robots, their potential remains largely untapped due to the difficulty in balancing human safety and comfort with fast production constraints. To help address this challenge, we present PRO‑MIND, a novel human‑in‑the‑loop framework that leverages valuable data about the human co‑worker to optimise robot trajectories. By estimating human attention and mental effort, our method dynamically adjusts safety zones and enables on‑the‑fly alterations of the robot path to enhance human comfort and optimal stopping conditions. Moreover, we formulate a multi‑objective optimisation to adapt the robot's trajectory execution time and smoothness based on the current human psycho‑physical stress, estimated from heart rate variability and frantic movements. These adaptations exploit the properties of B‑spline curves to preserve continuity and smoothness, which are crucial factors in improving motion predictability and comfort. Evaluation in two realistic case studies showcases the framework's ability to restrain the operators' workload and stress and to ensure their safety while enhancing human‑robot productivity. Further strengths of PRO‑MIND include its adaptability to each individual's specific needs and sensitivity to variations in attention, mental effort, and stress during task execution.

PaperID: 12989, https://arxiv.org/pdf/2409.06750.pdf

Authors: H. Zhang, J. Yin, M. Jiang, C. Su

Title: Can Agents Spontaneously Form a Society? Introducing a Novel Architecture for Generative Multi-Agents to Elicit Social Emergence

Abstract:
Generative agents have demonstrated impressive capabilities in specific tasks, but most of these frameworks focus on independent tasks and lack attention to social interactions. We introduce a generative agent architecture called ITCMA‑S, which includes a basic framework for individual agents and a framework called LTRHA that supports social interactions among multi‑agents. This architecture enables agents to identify and filter out behaviors that are detrimental to social interactions, guiding them to choose more favorable actions. We designed a sandbox environment to simulate the natural evolution of social relationships among multiple identity‑less agents for experimental evaluation. The results showed that ITCMA‑S performed well on multiple evaluation indicators, demonstrating its ability to actively explore the environment, recognize new agents, and acquire new information through continuous actions and dialogue. Observations show that as agents establish connections with each other, they spontaneously form cliques with internal hierarchies around a selected leader and organize collective activities.

PaperID: 12990, https://arxiv.org/pdf/2409.06614.pdf

Authors: Laura Georgescu, James Fox, Anna Gautier, Michael Wooldridge

Title: Fixed-budget and Multiple-issue Quadratic Voting

Abstract:
Quadratic Voting (QV) is a social choice mechanism that addresses the "tyranny of the majority" of one‑person‑one‑vote mechanisms. Agents express not only their preference ordering but also their preference intensity by purchasing x votes at a cost of x^2. Although this pricing rule maximizes utilitarian social welfare and is robust against strategic manipulation, it has not yet found many real‑life applications. One key reason is that the original QV mechanism does not limit voter budgets. Two variations have since been proposed: a (no‑budget) multiple‑issue generalization and a fixed‑budget version that allocates a constant number of credits to agents for use in multiple binary elections. While some analysis has been undertaken with respect to the multiple‑issue variation, the fixed‑budget version has not yet been rigorously studied. In this work, we formally propose a novel fixed‑budget multiple‑issue QV mechanism. This integrates the advantages of both the aforementioned variations, laying the theoretical foundations for practical use cases of QV, such as multi‑agent resource allocation. We analyse our fixed‑budget multiple‑issue QV by comparing it with traditional voting systems, exploring potential collusion strategies, and showing that checking whether strategy profiles form a Nash equilibrium is tractable.

PaperID: 12991, https://arxiv.org/pdf/2409.06568.pdf

Authors: Leilei Lin, Yingming Zhou, Wenlong Chen, Chen Qian

Title: Think-on-Process: Dynamic Process Generation for Collaborative Development of Multi-Agent System

Abstract:
Software development is a collaborative endeavor that requires individuals from different departments to work together in order to collectively develop a high‑quality software system. In this context, people have begun to explore a method that leverages multi‑agent systems based on LLMs to carry out software development. However, existing research tends to rigidly fix the software development process in a framework in code form, thus failing to dynamically adjust the software development process in real‑time to meet the more flexible and variable software environment. In this paper, we propose a dynamic process generation framework, named ToP (Think‑on‑Process). The core idea of ToP is to leverage experiential knowledge (i.e., process models) to guide LLMs in generating software development processes (i.e., instances). These instances will guide multi‑agent in software development and employ a compiler to provide feedback on the development outcomes. Subsequently, we utilize heuristic algorithms to filter the instances and apply process mining algorithms to derive process model. Finally, the process model will be converted into text, formatted as prompts, to enhance the ability of LLMs to generate other instances. Experiments demonstrate that our framework ToP significantly enhances the dynamic process generation capability of the GPT‑3.5 and GPT‑4 for five categories of software development tasks.

PaperID: 12992, https://arxiv.org/pdf/2409.06486.pdf

Authors: Sándor P. Fekete, Ramin Kosfeld, Peter Kramer, Jonas Neutzner, Christian Rieck, Christian Scheffer

Title: Coordinated Motion Planning: Multi-Agent Path Finding in a Densely Packed, Bounded Domain

Abstract:
We study Multi‑Agent Path Finding for arrangements of labeled agents in the interior of a simply connected domain: Given a unique start and target position for each agent, the goal is to find a sequence of parallel, collision‑free agent motions that minimizes the overall time (the makespan) until all agents have reached their respective targets. A natural case is that of a simply connected polygonal domain with axis‑parallel boundaries and integer coordinates, i.e., a simple polyomino, which amounts to a simply connected union of lattice unit squares or cells. We focus on the particularly challenging setting of densely packed agents, i.e., one per cell, which strongly restricts the mobility of agents, and requires intricate coordination of motion. We provide a variety of novel results for this problem, including (1) a characterization of polyominoes in which a reconfiguration plan is guaranteed to exist; (2) a characterization of shape parameters that induce worst‑case bounds on the makespan; (3) a suite of algorithms to achieve asymptotically worst‑case optimal performance with respect to the achievable stretch for cases with severely limited maneuverability. This corresponds to bounding the ratio between obtained makespan and the lower bound provided by the max‑min distance between the start and target position of any agent and our shape parameters. Our results extend findings by Demaine et al. (SIAM Journal on Computing, 2019) who investigated the problem for solid rectangular domains, and in the closely related field of Permutation Routing, as presented by Alpert et al. (Computational Geometry, 2022) for convex pieces of grid graphs.

PaperID: 12993, https://arxiv.org/pdf/2409.06416.pdf

Authors: Jingxiong Liu, Ludvig Lemner, Linnea Wahlgren, Gregory Gay, Nasser Mohammadiha, Joakim Wennerberg

Title: Exploring the Integration of Large Language Models in Industrial Test Maintenance Processes

Abstract:
Much of the cost and effort required during the software testing process is invested in performing test maintenance ‑ the addition, removal, or modification of test cases to keep the test suite in sync with the system‑under‑test or to otherwise improve its quality. Tool support could reduce the cost ‑ and improve the quality ‑ of test maintenance by automating aspects of the process or by providing guidance and support to developers. In this study, we explore the capabilities and applications of large language models (LLMs) ‑ complex machine learning models adapted to textual analysis ‑ to support test maintenance. We conducted a case study at Ericsson AB where we explore the triggers that indicate the need for test maintenance, the actions that LLMs can take, and the considerations that must be made when deploying LLMs in an industrial setting. We also propose and demonstrate a multi‑agent architecture that can predict which tests require maintenance following a change to the source code. Collectively, these contributions advance our theoretical and practical understanding of how LLMs can be deployed to benefit industrial test maintenance processes.

PaperID: 12994, https://arxiv.org/pdf/2409.06351.pdf

Authors: David Bani-Harouni, Nassir Navab, Matthias Keicher

Title: MAGDA: Multi-agent guideline-driven diagnostic assistance

Abstract:
In emergency departments, rural hospitals, or clinics in less developed regions, clinicians often lack fast image analysis by trained radiologists, which can have a detrimental effect on patients' healthcare. Large Language Models (LLMs) have the potential to alleviate some pressure from these clinicians by providing insights that can help them in their decision‑making. While these LLMs achieve high test results on medical exams showcasing their great theoretical medical knowledge, they tend not to follow medical guidelines. In this work, we introduce a new approach for zero‑shot guideline‑driven decision support. We model a system of multiple LLM agents augmented with a contrastive vision‑language model that collaborate to reach a patient diagnosis. After providing the agents with simple diagnostic guidelines, they will synthesize prompts and screen the image for findings following these guidelines. Finally, they provide understandable chain‑of‑thought reasoning for their diagnosis, which is then self‑refined to consider inter‑dependencies between diseases. As our method is zero‑shot, it is adaptable to settings with rare diseases, where training data is limited, but expert‑crafted disease descriptions are available. We evaluate our method on two chest X‑ray datasets, CheXpert and ChestX‑ray 14 Longtail, showcasing performance improvement over existing zero‑shot methods and generalizability to rare diseases.

PaperID: 12995, https://arxiv.org/pdf/2409.06345.pdf

Authors: Siddharth Chaturvedi, Ahmed El-Gazzar, Marcel van Gerven

Title: Foragax: An Agent-Based Modelling Framework Based on JAX

Abstract:
Foraging for resources is a ubiquitous activity conducted by living organisms in a shared environment to maintain their homeostasis. Modelling multi‑agent foraging in‑silico allows us to study both individual and collective emergent behaviour in a tractable manner. Agent‑based modelling has proven to be effective in simulating such tasks, though scaling the simulations to accommodate large numbers of agents with complex dynamics remains challenging. In this work, we present Foragax, a general‑purpose, scalable, hardware‑accelerated, multi‑agent foraging toolkit. Leveraging the JAX library, our toolkit can simulate thousands of agents foraging in a common environment, in an end‑to‑end vectorized and differentiable manner. The toolkit provides agent‑based modelling tools to model various foraging tasks, including options to design custom spatial and temporal agent dynamics, control policies, sensor models, and boundary conditions. Further, the number of agents during such simulations can be increased or decreased based on custom rules. While applied to foraging, the toolkit can also be used to model and simulate a wide range of other multi‑agent scenarios.

PaperID: 12996, https://arxiv.org/pdf/2409.06336.pdf

Authors: Antonin Sulc, Thorsten Hellert, Raimund Kammering, Hayden Hoschouer, Jason St. John

Title: Towards Agentic AI on Particle Accelerators

Abstract:
As particle accelerators grow in complexity, traditional control methods face increasing challenges in achieving optimal performance. This paper envisions a paradigm shift: a decentralized multi‑agent framework for accelerator control, powered by Large Language Models (LLMs) and distributed among autonomous agents. We present a proposition of a self‑improving decentralized system where intelligent agents handle high‑level tasks and communication and each agent is specialized to control individual accelerator components. This approach raises some questions: What are the future applications of AI in particle accelerators? How can we implement an autonomous complex system such as a particle accelerator where agents gradually improve through experience and human feedback? What are the implications of integrating a human‑in‑the‑loop component for labeling operational data and providing expert guidance? We show three examples, where we demonstrate the viability of such architecture.

PaperID: 12997, https://arxiv.org/pdf/2409.06289.pdf

Authors: Zhizhuo Kou, Holam Yu, Junyu Luo, Jingshu Peng, Xujia Li, Chengzhong Liu, Juntao Dai, Lei Chen, Sirui Han, Yike Guo

Title: Automate Strategy Finding with LLM in Quant Investment

Abstract:
We present a novel three‑stage framework leveraging Large Language Models (LLMs) within a risk‑aware multi‑agent system for automate strategy finding in quantitative finance. Our approach addresses the brittleness of traditional deep learning models in financial applications by: employing prompt‑engineered LLMs to generate executable alpha factor candidates across diverse financial data, implementing multimodal agent‑based evaluation that filters factors based on market status, predictive quality while maintaining category balance, and deploying dynamic weight optimization that adapts to market conditions. Experimental results demonstrate the robust performance of the strategy in Chinese & US market regimes compared to established benchmarks. Our work extends LLMs capabilities to quantitative trading, providing a scalable architecture for financial signal extraction and portfolio construction. The overall framework significantly outperforms all benchmarks with 53.17% cumulative return on SSE50 (Jan 2023 to Jan 2024), demonstrating superior risk‑adjusted performance and downside protection on the market.

PaperID: 12998, https://arxiv.org/pdf/2409.06010.pdf

Authors: Bowei Li, Saugat Tripathi, Salman Hosain, Ran Zhang, Jiang, Xie, Miao Wang

Title: When Learning Meets Dynamics: Distributed User Connectivity Maximization in UAV-Based Communication Networks

Abstract:
Distributed management over Unmanned Aerial Vehicle (UAV) based communication networks (UCNs) has attracted increasing research attention. In this work, we study a distributed user connectivity maximization problem in a UCN. The work features a horizontal study over different levels of information exchange during the distributed iteration and a consideration of dynamics in UAV set and user distribution, which are not well addressed in the existing works. Specifically, the studied problem is first formulated into a time‑coupled mixed‑integer non‑convex optimization problem. A heuristic two‑stage UAV‑user association policy is proposed to faster determine the user connectivity. To tackle the NP‑hard problem in scalable manner, the distributed user connectivity maximization algorithm 1 (DUCM‑1) is proposed under the multi‑agent deep Q learning (MA‑DQL) framework. DUCM‑1 emphasizes on designing different information exchange levels and evaluating how they impact the learning convergence with stationary and dynamic user distribution. To comply with the UAV dynamics, DUCM‑2 algorithm is developed which is devoted to autonomously handling arbitrary quit's and join‑in's of UAVs in a considered time horizon. Extensive simulations are conducted i) to conclude that exchanging state information with a deliberated task‑specific reward function design yields the best convergence performance, and ii) to show the efficacy and robustness of DUCM‑2 against the dynamics.

PaperID: 12999, https://arxiv.org/pdf/2409.05712.pdf

Authors: Jiaqi Liu, Peng Hang, Xiaoxiang Na, Chao Huang, Jian Sun

Title: Cooperative Decision-Making for CAVs at Unsignalized Intersections: A MARL Approach with Attention and Hierarchical Game Priors

Abstract:
The development of autonomous vehicles has shown great potential to enhance the efficiency and safety of transportation systems. However, the decision‑making issue in complex human‑machine mixed traffic scenarios, such as unsignalized intersections, remains a challenge for autonomous vehicles. While reinforcement learning (RL) has been used to solve complex decision‑making problems, existing RL methods still have limitations in dealing with cooperative decision‑making of multiple connected autonomous vehicles (CAVs), ensuring safety during exploration, and simulating realistic human driver behaviors. In this paper, a novel and efficient algorithm, Multi‑Agent Game‑prior Attention Deep Deterministic Policy Gradient (MA‑GA‑DDPG), is proposed to address these limitations. Our proposed algorithm formulates the decision‑making problem of CAVs at unsignalized intersections as a decentralized multi‑agent reinforcement learning problem and incorporates an attention mechanism to capture interaction dependencies between ego CAV and other agents. The attention weights between the ego vehicle and other agents are then used to screen interaction objects and obtain prior hierarchical game relations, based on which a safety inspector module is designed to improve the traffic safety. Furthermore, both simulation and hardware‑in‑the‑loop experiments were conducted, demonstrating that our method outperforms other baseline approaches in terms of driving safety, efficiency, and comfort.

PaperID: 13000, https://arxiv.org/pdf/2409.05556.pdf

Authors: Alireza Ghafarollahi, Markus J. Buehler

Title: SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning

Abstract:
A key challenge in artificial intelligence is the creation of systems capable of autonomously advancing scientific understanding by exploring novel domains, identifying complex patterns, and uncovering previously unseen connections in vast scientific data. In this work, we present SciAgents, an approach that leverages three core concepts: (1) the use of large‑scale ontological knowledge graphs to organize and interconnect diverse scientific concepts, (2) a suite of large language models (LLMs) and data retrieval tools, and (3) multi‑agent systems with in‑situ learning capabilities. Applied to biologically inspired materials, SciAgents reveals hidden interdisciplinary relationships that were previously considered unrelated, achieving a scale, precision, and exploratory power that surpasses traditional human‑driven research methods. The framework autonomously generates and refines research hypotheses, elucidating underlying mechanisms, design principles, and unexpected material properties. By integrating these capabilities in a modular fashion, the intelligent system yields material discoveries, critique and improve existing hypotheses, retrieve up‑to‑date data about existing research, and highlights their strengths and limitations. Our case studies demonstrate scalable capabilities to combine generative AI, ontological representations, and multi‑agent modeling, harnessing a `swarm of intelligence' similar to biological systems. This provides new avenues for materials discovery and accelerates the development of advanced materials by unlocking Nature's design principles.

PaperID: 13001, https://arxiv.org/pdf/2409.05480.pdf

Authors: Yihong Tao, Bo Lei, Haoyang Shi, Jingkai Chen, Xing Zhang

Title: Adaptive Multi-Layer Deployment for A Digital Twin Empowered Satellite-Terrestrial Integrated Network

Abstract:
With the development of satellite communication technology, satellite‑terrestrial integrated networks (STIN), which integrate satellite networks and ground networks, can realize seamless global coverage of communication services. Confronting the intricacies of network dynamics, the diversity of resource heterogeneity, and the unpredictability of user mobility, dynamic resource allocation within networks faces formidable challenges. Digital twin (DT), as a new technique, can reflect a physical network to a virtual network to monitor, analyze, and optimize the physical network. Nevertheless, in the process of constructing the DT model, the deployment location and resource allocation of DTs may adversely affect its performance. Therefore, we propose a STIN model, which alleviates the problem of insufficient single‑layer deployment flexibility of the traditional edge network by deploying DTs in multi‑layer nodes in a STIN. To address the challenge of deploying DTs in the network, we propose multi‑layer DT deployment in a STIN to reduce system delay. Then we adopt a multi‑agent reinforcement learning (MARL) scheme to explore the optimal strategy of the DT multi‑layer deployment problem. The implemented scheme demonstrates a notable reduction in system delay, as evidenced by simulation outcomes.

PaperID: 13002, https://arxiv.org/pdf/2409.05392.pdf

Authors: Mario A. V. Saucedo, Nikolaos Stathoulopoulos, Akash Patel, Christoforos Kanellakis, George Nikolakopoulos

Title: Leveraging Computation of Expectation Models for Commonsense Affordance Estimation on 3D Scene Graphs

Abstract:
This article studies the commonsense object affordance concept for enabling close‑to‑human task planning and task optimization of embodied robotic agents in urban environments. The focus of the object affordance is on reasoning how to effectively identify object's inherent utility during the task execution, which in this work is enabled through the analysis of contextual relations of sparse information of 3D scene graphs. The proposed framework develops a Correlation Information (CECI) model to learn probability distributions using a Graph Convolutional Network, allowing to extract the commonsense affordance for individual members of a semantic class. The overall framework was experimentally validated in a real‑world indoor environment, showcasing the ability of the method to level with human commonsense. For a video of the article, showcasing the experimental demonstration, please refer to the following link: https://youtu.be/BDCMVx2GiQE

PaperID: 13003, https://arxiv.org/pdf/2409.05293.pdf

Authors: Zeinab Ebrahimi, Mohammad Deghat

Title: Distributed Robust Continuous-Time Optimization Algorithms for Time-Varying Constrained Cost

Abstract:
This paper presents a distributed continuous‑time optimization framework aimed at overcoming the challenges posed by time‑varying cost functions and constraints in multi‑agent systems, particularly those subject to disturbances. By incorporating tools such as log‑barrier penalty functions to address inequality constraints, an integral sliding mode control for disturbance mitigation is proposed. The algorithm ensures asymptotic tracking of the optimal solution, achieving a tracking error of zero. The convergence of the introduced algorithms is demonstrated through Lyapunov analysis and nonsmooth techniques. Furthermore, the framework's effectiveness is validated through numerical simulations considering two scenarios for the communication networks.

PaperID: 13004, https://arxiv.org/pdf/2409.05155.pdf

Authors: Jiahao Shi, James C. Spall

Title: Difference Between Cyclic and Distributed Approach in Stochastic Optimization for Multi-agent System

Abstract:
Many stochastic optimization problems in multi‑agent systems can be decomposed into smaller subproblems or reduced decision subspaces. The cyclic and distributed approaches are two widely used strategies for solving such problems. In this manuscript, we review four existing methods for addressing these problems and compare them based on their suitable problem frameworks and update rules.

PaperID: 13005, https://arxiv.org/pdf/2409.05113.pdf

Authors: Shiqi Zheng, Choon Ki Ahn, Xiaowei Jiang, Huaicheng Yan, Peng Shi

Title: Nonlinear Cooperative Output Regulation with Input Delay Compensation

Abstract:
This paper investigates the cooperative output regulation (COR) of nonlinear multi‑agent systems (MASs) with long input delay based on periodic event‑triggered mechanism. Compared with other mechanisms, periodic event‑triggered control can automatically guarantee a Zeno‑free behavior and avoid the continuous monitoring of triggered conditions. First, a new periodic event‑triggered distributed observer, which is based on the fully asynchronous communication data, is proposed to estimate the leader information. Second, a new distributed predictor feedback control method is proposed for the considered nonlinear MASs with input delay. By coordinate transformation, the MASs are mapped into new coupled ODE‑PDE target systems with some disturbance‑like terms. Then, we show that the COR problem is solvable. At last, to further save the communication resource, a periodic event‑triggered mechanism is considered in the sensor‑to‑controller transmission in every agent. A new periodic event‑triggered filter is proposed to deal with the periodic event‑triggered feedback data. The MASs with input delay are mapped into coupled ODE‑PDE target systems with sampled data information. Then, Lyapunov‑Krasovskii functions are constructed to demonstrate the exponential stability of the MASs. Simulations verify the validity of the proposed results.

PaperID: 13006, https://arxiv.org/pdf/2409.05106.pdf

Authors: Gregorio Marchesini, Siyuan Liu, Lars Lindemann, Dimos V. Dimarogonas

Title: Decentralized Control of Multi-Agent Systems Under Acyclic Spatio-Temporal Task Dependencies

Abstract:
We introduce a novel distributed sampled‑data control method tailored for heterogeneous multi‑agent systems under a global spatio‑temporal task with acyclic dependencies. Specifically, we consider the global task as a conjunction of independent and collaborative tasks, defined over the absolute and relative states of agent pairs. Task dependencies in this form are then represented by a task graph, which we assume to be acyclic. From the given task graph, we provide an algorithmic approach to define a distributed sampled‑data controller prioritizing the fulfilment of collaborative tasks as the primary objective, while fulfilling independent tasks unless they conflict with collaborative ones. Moreover, communication maintenance among collaborating agents is seamlessly enforced within the proposed control framework. A numerical simulation is provided to showcase the potential of our control framework.

PaperID: 13007, https://arxiv.org/pdf/2409.05025.pdf

Authors: Khai Doan, Marios Avgeris, Aris Leivadeas, Ioannis Lambadaris, Wonjae Shin

Title: Cooperative Learning-Based Framework for VNF Caching and Placement Optimization over Low Earth Orbit Satellite Networks

Abstract:
Low Earth Orbit Satellite Networks (LSNs) are integral to supporting a broad range of modern applications, which are typically modeled as Service Function Chains (SFCs). Each SFC is composed of Virtual Network Functions (VNFs), where each VNF performs a specific task. In this work, we tackle two key challenges in deploying SFCs across an LSN. Firstly, we aim to optimize the long‑term system performance by minimizing the average end‑to‑end SFC execution delay, given that each satellite comes with a pre‑installed/cached subset of VNFs. To achieve optimal SFC placement, we formulate an offline Dynamic Programming (DP) equation. To overcome the challenges associated with DP, such as its complexity, the need for probability knowledge, and centralized decision‑making, we put forth an online Multi‑Agent Q‑Learning (MAQL) solution. Our MAQL approach addresses convergence issues in the non‑stationary LSN environment by enabling satellites to share learning parameters and update their Q‑tables based on distinct rules for their selected actions. Secondly, to determine the optimal VNF subsets for satellite caching, we develop a Bayesian Optimization (BO)‑based learning mechanism that operates both offline and continuously in the background during runtime. Extensive experiments demonstrate that our MAQL approach achieves near‑optimal performance comparable to the DP model and significantly outperforms existing baselines. Moreover, the BO‑based approach effectively enhances the request serving rate over time.

PaperID: 13008, https://arxiv.org/pdf/2409.05001.pdf

Authors: Huan Zhang, Wei Cheng, Yuhan Wu, Wei Hu

Title: A Pair Programming Framework for Code Generation via Multi-Plan Exploration and Feedback-Driven Refinement

Abstract:
Large language models (LLMs) have achieved impressive performance on code generation. Although prior studies enhanced LLMs with prompting techniques and code refinement, they still struggle with complex programming problems due to rigid solution plans. In this paper, we draw on pair programming practices to propose PairCoder, a novel LLM‑based framework for code generation. PairCoder incorporates two collaborative LLM agents, namely a Navigator agent for high‑level planning and a Driver agent for specific implementation. The Navigator is responsible for proposing promising solution plans, selecting the current optimal plan, and directing the next iteration round based on execution feedback. The Driver follows the guidance of Navigator to undertake initial code generation, code testing, and refinement. This interleaved and iterative workflow involves multi‑plan exploration and feedback‑based refinement, which mimics the collaboration of pair programmers. We evaluate PairCoder with both open‑source and closed‑source LLMs on various code generation benchmarks. Extensive experimental results demonstrate the superior accuracy of PairCoder, achieving relative pass@1 improvements of 12.00%‑162.43% compared to prompting LLMs directly.

PaperID: 13009, https://arxiv.org/pdf/2409.04854.pdf

Authors: Konstantinos Varsos, Merkouris Papamichail, Giorgos Flouris, Marina Bitsaki

Title: Adaptation Procedure in Misinformation Games

Abstract:
We study interactions between agents in multi‑agent systems, in which the agents are misinformed with regards to the game that they play, essentially having a subjective and incorrect understanding of the setting, without being aware of it. For that, we introduce a new game‑theoretic concept, called misinformation games, that provides the necessary toolkit to study this situation. Subsequently, we enhance this framework by developing a time‑discrete procedure (called the Adaptation Procedure) that captures iterative interactions in the above context. During the Adaptation Procedure, the agents update their information and reassess their behaviour in each step. We demonstrate our ideas through an implementation, which is used to study the efficiency and characteristics of the Adaptation Procedure.

PaperID: 13010, https://arxiv.org/pdf/2409.04613.pdf

Authors: Chinmay Maheshwari, Manxi Wu, Shankar Sastry

Title: Convergence of Decentralized Actor-Critic Algorithm in General-sum Markov Games

Abstract:
Markov games provide a powerful framework for modeling strategic multi‑agent interactions in dynamic environments. Traditionally, convergence properties of decentralized learning algorithms in these settings have been established only for special cases, such as Markov zero‑sum and potential games, which do not fully capture real‑world interactions. In this paper, we address this gap by studying the asymptotic properties of learning algorithms in general‑sum Markov games. In particular, we focus on a decentralized algorithm where each agent adopts an actor‑critic learning dynamic with asynchronous step sizes. This decentralized approach enables agents to operate independently, without requiring knowledge of others' strategies or payoffs. We introduce the concept of a Markov Near‑Potential Function (MNPF) and demonstrate that it serves as an approximate Lyapunov function for the policy updates in the decentralized learning dynamics, which allows us to characterize the convergent set of strategies. We further strengthen our result under specific regularity conditions and with finite Nash equilibria.

PaperID: 13011, https://arxiv.org/pdf/2409.04286.pdf

Authors: Desiree Heim, Christian Jilek, Adrian Ulges, Andreas Dengel

Title: Using Large Language Models to Generate Authentic Multi-agent Knowledge Work Datasets

Abstract:
Current publicly available knowledge work data collections lack diversity, extensive annotations, and contextual information about the users and their documents. These issues hinder objective and comparable data‑driven evaluations and optimizations of knowledge work assistance systems. Due to the considerable resources needed to collect such data in real‑life settings and the necessity of data censorship, collecting such a dataset appears nearly impossible. For this reason, we propose a configurable, multi‑agent knowledge work dataset generator. This system simulates collaborative knowledge work among agents producing Large Language Model‑generated documents and accompanying data traces. Additionally, the generator captures all background information, given in its configuration or created during the simulation process, in a knowledge graph. Finally, the resulting dataset can be utilized and shared without privacy or confidentiality concerns. This paper introduces our approach's design and vision and focuses on generating authentic knowledge work documents using Large Language Models. Our study involving human raters who assessed 53% of the generated and 74% of the real documents as realistic demonstrates the potential of our approach. Furthermore, we analyze the authenticity criteria mentioned in the participants' comments and elaborate on potential improvements for identified common issues.

PaperID: 13012, https://arxiv.org/pdf/2409.04224.pdf

Authors: Daniel J. Tan, Qianyi Xu, Kay Choong See, Dilruk Perera, Mengling Feng

Title: Advancing Multi-Organ Disease Care: A Hierarchical Multi-Agent Reinforcement Learning Framework

Abstract:
In healthcare, multi‑organ system diseases pose unique and significant challenges as they impact multiple physiological systems concurrently, demanding complex and coordinated treatment strategies. Despite recent advancements in the AI based clinical decision support systems, these solutions only focus on individual organ systems, failing to account for complex interdependencies between them. This narrow focus greatly hinders their effectiveness in recommending holistic and clinically actionable treatments in the real world setting. To address this critical gap, we propose a novel Hierarchical Multi‑Agent Reinforcement Learning (HMARL) framework. Our architecture deploys specialized and dedicated agents for each organ system and facilitates inter‑agent communication to enable synergistic decision‑making across organ systems. Furthermore, we introduce a dual‑layer state representation technique that contextualizes patient conditions at both global and organ‑specific levels, improving the accuracy and relevance of treatment decisions. We evaluate our HMARL solution on the task of sepsis management, a common and critical multi‑organ disease, using both qualitative and quantitative metrics. Our method learns effective, clinically aligned treatment policies that considerably improve patient survival. We believe this framework represents a significant advancement in clinical decision support systems, introducing the first RL solution explicitly designed for multi‑organ treatment recommendations. Our solution moves beyond prevailing simplified, single‑organ models that fall short in addressing the complexity of multi‑organ diseases.

PaperID: 13013, https://arxiv.org/pdf/2409.04168.pdf

Authors: Andreas Stephan, Dawei Zhu, Matthias Aßenmacher, Xiaoyu Shen, Benjamin Roth

Title: From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks

Abstract:
To reduce the need for human annotations, large language models (LLMs) have been proposed as judges of the quality of other candidate models. The performance of LLM judges is typically evaluated by measuring the correlation with human judgments on generative tasks such as summarization or machine translation. In contrast, we study LLM judges on mathematical reasoning tasks. These tasks require multi‑step reasoning, and the correctness of their solutions is verifiable, enabling a more objective evaluation. We perform a detailed performance analysis and find that easy samples are easy to judge, and difficult samples are difficult to judge. Our analysis uncovers a strong correlation between judgment performance and the candidate model task performance, indicating that judges tend to favor higher‑quality models even if their answer is incorrect. As a consequence, we test whether we can predict the behavior of LLM judges using simple features such as part‑of‑speech tags and find that we can correctly predict 70%‑75% of judgments. We conclude this study by analyzing practical use cases, showing that LLM judges consistently detect the on‑average better model but largely fail if we use them to improve task performance.

PaperID: 13014, https://arxiv.org/pdf/2409.03930.pdf

Authors: Kangtong Mo, Linyue Chu, Xingyu Zhang, Xiran Su, Yang Qian, Yining Ou, Wian Pretorius

Title: DRAL: Deep Reinforcement Adaptive Learning for Multi-UAVs Navigation in Unknown Indoor Environment

Abstract:
Autonomous indoor navigation of UAVs presents numerous challenges, primarily due to the limited precision of GPS in enclosed environments. Additionally, UAVs' limited capacity to carry heavy or power‑intensive sensors, such as overheight packages, exacerbates the difficulty of achieving autonomous navigation indoors. This paper introduces an advanced system in which a drone autonomously navigates indoor spaces to locate a specific target, such as an unknown Amazon package, using only a single camera. Employing a deep learning approach, a deep reinforcement adaptive learning algorithm is trained to develop a control strategy that emulates the decision‑making process of an expert pilot. We demonstrate the efficacy of our system through real‑time simulations conducted in various indoor settings. We apply multiple visualization techniques to gain deeper insights into our trained network. Furthermore, we extend our approach to include an adaptive control algorithm for coordinating multiple drones to lift an object in an indoor environment collaboratively. Integrating our DRAL algorithm enables multiple UAVs to learn optimal control strategies that adapt to dynamic conditions and uncertainties. This innovation enhances the robustness and flexibility of indoor navigation and opens new possibilities for complex multi‑drone operations in confined spaces. The proposed framework highlights significant advancements in adaptive control and deep reinforcement learning, offering robust solutions for complex multi‑agent systems in real‑world applications.

PaperID: 13015, https://arxiv.org/pdf/2409.03897.pdf

Authors: Leo Muxing Wang, Pengkun Yang, Lili Su

Title: On the Convergence Rates of Federated Q-Learning across Heterogeneous Environments

Abstract:
Large‑scale multi‑agent systems are often deployed across wide geographic areas, where agents interact with heterogeneous environments. There is an emerging interest in understanding the role of heterogeneity in the performance of the federated versions of classic reinforcement learning algorithms. In this paper, we study synchronous federated Q‑learning, which aims to learn an optimal Q‑function by having K agents average their local Q‑estimates per E iterations. We observe an interesting phenomenon on the convergence speeds in terms of K and E. Similar to the homogeneous environment settings, there is a linear speed‑up concerning K in reducing the errors that arise from sampling randomness. Yet, in sharp contrast to the homogeneous settings, E>1 leads to significant performance degradation. Specifically, we provide a fine‑grained characterization of the error evolution in the presence of environmental heterogeneity, which decay to zero as the number of iterations T increases. The slow convergence of having E>1 turns out to be fundamental rather than an artifact of our analysis. We prove that, for a wide range of stepsizes, the \ell_\infty norm of the error cannot decay faster than Θ(E/T). In addition, our experiments demonstrate that the convergence exhibits an interesting two‑phase phenomenon. For any given stepsize, there is a sharp phase‑transition of the convergence: the error decays rapidly in the beginning yet later bounces up and stabilizes. Provided that the phase‑transition time can be estimated, choosing different stepsizes for the two phases leads to faster overall convergence.

PaperID: 13016, https://arxiv.org/pdf/2409.03881.pdf

Authors: Han Zheng, Zhongxia Yan, Cathy Wu

Title: Multi-agent Path Finding for Mixed Autonomy Traffic Coordination

Abstract:
In the evolving landscape of urban mobility, the prospective integration of Connected and Automated Vehicles (CAVs) with Human‑Driven Vehicles (HDVs) presents a complex array of challenges and opportunities for autonomous driving systems. While recent advancements in robotics have yielded Multi‑Agent Path Finding (MAPF) algorithms tailored for agent coordination task characterized by simplified kinematics and complete control over agent behaviors, these solutions are inapplicable in mixed‑traffic environments where uncontrollable HDVs must coexist and interact with CAVs. Addressing this gap, we propose the Behavior Prediction Kinematic Priority Based Search (BK‑PBS), which leverages an offline‑trained conditional prediction model to forecast HDV responses to CAV maneuvers, integrating these insights into a Priority Based Search (PBS) where the A search proceeds over motion primitives to accommodate kinematic constraints. We compare BK‑PBS with CAV planning algorithms derived by rule‑based car‑following models, and reinforcement learning. Through comprehensive simulation on a highway merging scenario across diverse scenarios of CAV penetration rate and traffic density, BK‑PBS outperforms these baselines in reducing collision rates and enhancing system‑level travel delay. Our work is directly applicable to many scenarios of multi‑human multi‑robot coordination.

PaperID: 13017, https://arxiv.org/pdf/2409.03793.pdf

Authors: Ishaan Domkundwar, Mukunda N S, Ishaan Bhola, Riddhik Kochhar

Title: Safeguarding AI Agents: Developing and Analyzing Safety Architectures

Abstract:
AI agents, specifically powered by large language models, have demonstrated exceptional capabilities in various applications where precision and efficacy are necessary. However, these agents come with inherent risks, including the potential for unsafe or biased actions, vulnerability to adversarial attacks, lack of transparency, and tendency to generate hallucinations. As AI agents become more prevalent in critical sectors of the industry, the implementation of effective safety protocols becomes increasingly important. This paper addresses the critical need for safety measures in AI systems, especially ones that collaborate with human teams. We propose and evaluate three frameworks to enhance safety protocols in AI agent systems: an LLM‑powered input‑output filter, a safety agent integrated within the system, and a hierarchical delegation‑based system with embedded safety checks. Our methodology involves implementing these frameworks and testing them against a set of unsafe agentic use cases, providing a comprehensive evaluation of their effectiveness in mitigating risks associated with AI agent deployment. We conclude that these frameworks can significantly strengthen the safety and security of AI agent systems, minimizing potential harmful actions or outputs. Our work contributes to the ongoing effort to create safe and reliable AI applications, particularly in automated operations, and provides a foundation for developing robust guardrails to ensure the responsible use of AI agents in real‑world applications.

PaperID: 13018, https://arxiv.org/pdf/2409.03789.pdf

Authors: Ibrahim Alshehri, Adnan Alshehri, Abdulrahman Almalki, Majed Bamardouf, Alaqsa Akbar

Title: BreachSeek: A Multi-Agent Automated Penetration Tester

Abstract:
The increasing complexity and scale of modern digital environments have exposed significant gaps in traditional cybersecurity penetration testing methods, which are often time‑consuming, labor‑intensive, and unable to rapidly adapt to emerging threats. There is a critical need for an automated solution that can efficiently identify and exploit vulnerabilities across diverse systems without extensive human intervention. BreachSeek addresses this challenge by providing an AI‑driven multi‑agent software platform that leverages Large Language Models (LLMs) integrated through LangChain and LangGraph in Python. This system enables autonomous agents to conduct thorough penetration testing by identifying vulnerabilities, simulating a variety of cyberattacks, executing exploits, and generating comprehensive security reports. In preliminary evaluations, BreachSeek successfully exploited vulnerabilities in exploitable machines within local networks, demonstrating its practical effectiveness. Future developments aim to expand its capabilities, positioning it as an indispensable tool for cybersecurity professionals.

PaperID: 13019, https://arxiv.org/pdf/2409.03659.pdf

Authors: Ran Zhang, Steffen Eger

Title: LLM-based multi-agent poetry generation in non-cooperative environments

Abstract:
Despite substantial progress of large language models (LLMs) for automatic poetry generation, the generated poetry lacks diversity while the training process differs greatly from human learning. Under the rationale that the learning process of the poetry generation systems should be more human‑like and their output more diverse and novel, we introduce a framework based on social learning where we emphasize non‑cooperative interactions besides cooperative interactions to encourage diversity. Our experiments are the first attempt at LLM‑based multi‑agent systems in non‑cooperative environments for poetry generation employing both TRAINING‑BASED agents (GPT‑2) and PROMPTING‑BASED agents (GPT‑3 and GPT‑4). Our evaluation based on 96k generated poems shows that our framework benefits the poetry generation process for TRAINING‑BASED agents resulting in 1) a 3.0‑3.7 percentage point (pp) increase in diversity and a 5.6‑11.3 pp increase in novelty according to distinct and novel n‑grams. The generated poetry from TRAINING‑BASED agents also exhibits group divergence in terms of lexicons, styles and semantics. PROMPTING‑BASED agents in our framework also benefit from non‑cooperative environments and a more diverse ensemble of models with non‑homogeneous agents has the potential to further enhance diversity, with an increase of 7.0‑17.5 pp according to our experiments. However, PROMPTING‑BASED agents show a decrease in lexical diversity over time and do not exhibit the group‑based divergence intended in the social network. Our paper argues for a paradigm shift in creative tasks such as automatic poetry generation to include social learning processes (via LLM‑based agent modeling) similar to human interaction.

PaperID: 13020, https://arxiv.org/pdf/2409.03512.pdf

Authors: Jifan Yu, Zheyuan Zhang, Daniel Zhang-li, Shangqing Tu, Zhanxin Hao, Rui Miao Li, Haoxuan Li, Yuanchun Wang, Hanming Li, Linlu Gong, Jie Cao, Jiayin Lin, Jinchang Zhou, Fei Qin, Haohua Wang, Jianxiao Jiang, Lijun Deng, Yisi Zhan, Chaojun Xiao, Xusheng Dai, Xuan Yan, Nianyi Lin, Nan Zhang, Ruixin Ni, Yang Dang, Lei Hou, Yu Zhang, Xu Han, Manli Li, Juanzi Li, Zhiyuan Liu, Huiqin Liu, Maosong Sun

Title: From MOOC to MAIC: Reshaping Online Teaching and Learning through LLM-driven Agents

Abstract:
Since the first instances of online education, where courses were uploaded to accessible and shared online platforms, this form of scaling the dissemination of human knowledge to reach a broader audience has sparked extensive discussion and widespread adoption. Recognizing that personalized learning still holds significant potential for improvement, new AI technologies have been continuously integrated into this learning format, resulting in a variety of educational AI applications such as educational recommendation and intelligent tutoring. The emergence of intelligence in large language models (LLMs) has allowed for these educational enhancements to be built upon a unified foundational model, enabling deeper integration. In this context, we propose MAIC (Massive AI‑empowered Course), a new form of online education that leverages LLM‑driven multi‑agent systems to construct an AI‑augmented classroom, balancing scalability with adaptivity. Beyond exploring the conceptual framework and technical innovations, we conduct preliminary experiments at Tsinghua University, one of China's leading universities. Drawing from over 100,000 learning records of more than 500 students, we obtain a series of valuable observations and initial analyses. This project will continue to evolve, ultimately aiming to establish a comprehensive open platform that supports and unifies research, technology, and applications in exploring the possibilities of online education in the era of large model AI. We envision this platform as a collaborative hub, bringing together educators, researchers, and innovators to collectively explore the future of AI‑driven online education.

PaperID: 13021, https://arxiv.org/pdf/2409.03440.pdf

Authors: Phuc Phan Van, Dat Nguyen Minh, An Dinh Ngoc, Huy Phan Thanh

Title: Rx Strategist: Prescription Verification using LLM Agents System

Abstract:
To protect patient safety, modern pharmaceutical complexity demands strict prescription verification. We offer a new approach ‑ Rx Strategist ‑ that makes use of knowledge graphs and different search strategies to enhance the power of Large Language Models (LLMs) inside an agentic framework. This multifaceted technique allows for a multi‑stage LLM pipeline and reliable information retrieval from a custom‑built active ingredient database. Different facets of prescription verification, such as indication, dose, and possible drug interactions, are covered in each stage of the pipeline. We alleviate the drawbacks of monolithic LLM techniques by spreading reasoning over these stages, improving correctness and reliability while reducing memory demands. Our findings demonstrate that Rx Strategist surpasses many current LLMs, achieving performance comparable to that of a highly experienced clinical pharmacist. In the complicated world of modern medications, this combination of LLMs with organized knowledge and sophisticated search methods presents a viable avenue for reducing prescription errors and enhancing patient outcomes.

PaperID: 13022, https://arxiv.org/pdf/2409.03258.pdf

Authors: Yukun Cao, Shuo Han, Zengyi Gao, Zezhong Ding, Xike Xie, S. Kevin Zhou

Title: GraphInsight: Unlocking Insights in Large Language Models for Graph Structure Understanding

Abstract:
Although Large Language Models (LLMs) have demonstrated potential in processing graphs, they struggle with comprehending graphical structure information through prompts of graph description sequences, especially as the graph size increases. We attribute this challenge to the uneven memory performance of LLMs across different positions in graph description sequences, known as ''positional biases''. To address this, we propose GraphInsight, a novel framework aimed at improving LLMs' comprehension of both macro‑ and micro‑level graphical information. GraphInsight is grounded in two key strategies: 1) placing critical graphical information in positions where LLMs exhibit stronger memory performance, and 2) investigating a lightweight external knowledge base for regions with weaker memory performance, inspired by retrieval‑augmented generation (RAG). Moreover, GraphInsight explores integrating these two strategies into LLM agent processes for composite graph tasks that require multi‑step reasoning. Extensive empirical studies on benchmarks with a wide range of evaluation tasks show that GraphInsight significantly outperforms all other graph description methods (e.g., prompting techniques and reordering strategies) in understanding graph structures of varying sizes.

PaperID: 13023, https://arxiv.org/pdf/2409.03230.pdf

Authors: Jin Zhang, Jianyang Xue, Bochao Cao

Title: Improving agent performance in fluid environments by perceptual pretraining

Abstract:
In this paper, we construct a pretraining framework for fluid environment perception, which includes an information compression model and the corresponding pretraining method. We test this framework in a two‑cylinder problem through numerical simulation. The results show that after unsupervised pretraining with this framework, the intelligent agent can acquire key features of surrounding fluid environment, thereby adapting more quickly and effectively to subsequent multi‑scenario tasks. In our research, these tasks include perceiving the position of the upstream obstacle and actively avoiding shedding vortices in the flow field to achieve drag reduction. Better performance of the pretrained agent is discussed in the sensitivity analysis.

PaperID: 13024, https://arxiv.org/pdf/2409.03215.pdf

Authors: Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, Zhiwei Liu, Yihao Feng, Tulika Awalgaonkar, Rithesh Murthy, Eric Hu, Zeyuan Chen, Ran Xu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong

Title: xLAM: A Family of Large Action Models to Empower AI Agent Systems

Abstract:
Autonomous agents powered by large language models (LLMs) have attracted significant research interest. However, the open‑source community faces many challenges in developing specialized models for agent tasks, driven by the scarcity of high‑quality agent datasets and the absence of standard protocols in this area. We introduce and publicly release xLAM, a series of large action models designed for AI agent tasks. The xLAM series includes five models with both dense and mixture‑of‑expert architectures, ranging from 1B to 8x22B parameters, trained using a scalable, flexible pipeline that unifies, augments, and synthesizes diverse datasets to enhance AI agents' generalizability and performance across varied environments. Our experimental results demonstrate that xLAM consistently delivers exceptional performance across multiple agent ability benchmarks, notably securing the 1st position on the Berkeley Function‑Calling Leaderboard, outperforming GPT‑4, Claude‑3, and many other models in terms of tool use. By releasing the xLAM series, we aim to advance the performance of open‑source LLMs for autonomous AI agents, potentially accelerating progress and democratizing access to high‑performance models for agent tasks. Models are available at https://huggingface.co/collections/Salesforce/xlam‑models‑65f00e2a0a63bbcd1c2dade4

PaperID: 13025, https://arxiv.org/pdf/2409.03052.pdf

Authors: Christopher Amato

Title: An Introduction to Centralized Training for Decentralized Execution in Cooperative Multi-Agent Reinforcement Learning

Abstract:
Multi‑agent reinforcement learning (MARL) has exploded in popularity in recent years. Many approaches have been developed but they can be divided into three main types: centralized training and execution (CTE), centralized training for decentralized execution (CTDE), and Decentralized training and execution (DTE). CTDE methods are the most common as they can use centralized information during training but execute in a decentralized manner ‑‑ using only information available to that agent during execution. CTDE is the only paradigm that requires a separate training phase where any available information (e.g., other agent policies, underlying states) can be used. As a result, they can be more scalable than CTE methods, do not require communication during execution, and can often perform well. CTDE fits most naturally with the cooperative case, but can be potentially applied in competitive or mixed settings depending on what information is assumed to be observed. This text is an introduction to CTDE in cooperative MARL. It is meant to explain the setting, basic concepts, and common methods. It does not cover all work in CTDE MARL as the subarea is quite extensive. I have included work that I believe is important for understanding the main concepts in the subarea and apologize to those that I have omitted.

PaperID: 13026, https://arxiv.org/pdf/2409.02960.pdf

Authors: Shunichi Akatsuka, Yaemi Teramoto, Aaron Courville

Title: Managing multiple agents by automatically adjusting incentives

Abstract:
In the coming years, AI agents will be used for making more complex decisions, including in situations involving many different groups of people. One big challenge is that AI agent tends to act in its own interest, unlike humans who often think about what will be the best for everyone in the long run. In this paper, we explore a method to get self‑interested agents to work towards goals that benefit society as a whole. We propose a method to add a manager agent to mediate agent interactions by assigning incentives to certain actions. We tested our method with a supply‑chain management problem and showed that this framework (1) increases the raw reward by 22.2%, (2) increases the agents' reward by 23.8%, and (3) increases the manager's reward by 20.1%.

PaperID: 13027, https://arxiv.org/pdf/2409.02938.pdf

Authors: Gautham Ramachandran, Rick Yang

Title: CortexCompile: Harnessing Cortical-Inspired Architectures for Enhanced Multi-Agent NLP Code Synthesis

Abstract:
Current approaches to automated code generation often rely on monolithic models that lack real‑time adaptability and scalability. This limitation is particularly evident in complex programming tasks that require dynamic adjustment and efficiency. The integration of neuroscience principles into Natural Language Processing (NLP) has the potential to revolutionize automated code generation. This paper presents CortexCompile, a novel modular system inspired by the specialized functions of the human brain's cortical regions. By emulating the distinct roles of the Prefrontal Cortex, Parietal Cortex, Temporal Lobe, and Motor Cortex, CortexCompile achieves significant advancements in scalability, efficiency, and adaptability compared to traditional monolithic models like GPT‑4o. The system's architecture features a Task Orchestration Agent that manages dynamic task delegation and parallel processing, facilitating the generation of highly accurate and optimized code across increasingly complex programming tasks. Experimental evaluations demonstrate that CortexCompile consistently outperforms GPT‑4o in development time, accuracy, and user satisfaction, particularly in tasks involving real‑time strategy games and first‑person shooters. These findings underscore the viability of neuroscience‑inspired architectures in addressing the limitations of current NLP models, paving the way for more efficient and human‑like AI systems.

PaperID: 13028, https://arxiv.org/pdf/2409.02822.pdf

Authors: Giordano De Marzo, Claudio Castellano, David Garcia

Title: AI agents can coordinate beyond human scale

Abstract:
Large language models (LLMs) are increasingly deployed in collaborative tasks involving multiple agents, forming an "AI agent society: where agents interact and influence one another. Whether such groups can spontaneously coordinate on arbitrary decisions without external influence ‑ a hallmark of self‑organized regulation in human societies ‑ remains an open question. Here we investigate the stability of groups formed by AI agents by applying methods from complexity science and principles from behavioral sciences. We find that LLMs can spontaneously form cohesive groups, and that their opinion dynamics is governed by a majority force coefficient, which determines whether coordination is achievable. This majority force diminishes as group size increases, leading to a critical group size beyond which coordination becomes practically unattainable and stability is lost. Notably, this critical group size grows exponentially with the language capabilities of the models, and for the most advanced LLMs, it exceeds the typical size of informal human groups. Our findings highlight intrinsic limitations in the self‑organization of AI agent societies and have implications for the design of collaborative AI systems where coordination is desired or could represent a treat.

PaperID: 13029, https://arxiv.org/pdf/2409.02711.pdf

Authors: Mohammad Reshadati

Title: Creating a Gen-AI based Track and Trace Assistant MVP (SuperTracy) for PostNL

Abstract:
The developments in the field of generative AI has brought a lot of opportunities for companies, for instance to improve efficiency in customer service and automating tasks. PostNL, the biggest parcel and E‑commerce corporation of the Netherlands wants to use generative AI to enhance the communication around track and trace of parcels. During the internship a Minimal Viable Product (MVP) is created to showcase the value of using generative AI technologies, to enhance parcel tracking, analyzing the parcel's journey and being able to communicate about it in an easy to understand manner. The primary goal was to develop an in‑house LLM‑based system, reducing dependency on external platforms and establishing the feasibility of a dedicated generative AI team within the company. This multi‑agent LLM based system aimed to construct parcel journey stories and identify logistical disruptions with heightened efficiency and accuracy. The research involved deploying a sophisticated AI‑driven communication system, employing Retrieval‑Augmented Generation (RAG) for enhanced response precision, and optimizing large language models (LLMs) tailored to domain specific tasks. The MVP successfully implemented a multi‑agent open‑source LLM system, called SuperTracy. SuperTracy is capable of autonomously managing a broad spectrum of user inquiries and improving internal knowledge handling. Results and evaluation demonstrated technological innovation and feasibility, notably in communication about the track and trace of a parcel, which exceeded initial expectations. These advancements highlight the potential of AI‑driven solutions in logistics, suggesting many opportunities for further refinement and broader implementation within PostNL operational framework.

PaperID: 13030, https://arxiv.org/pdf/2409.02663.pdf

Authors: Ahmed Said Donmez, Muhammed O. Sayin

Title: Generalized Individual Q-learning for Polymatrix Games with Partial Observations

Abstract:
This paper addresses the challenge of limited observations in non‑cooperative multi‑agent systems where agents can have partial access to other agents' actions. We present the generalized individual Q‑learning dynamics that combine belief‑based and payoff‑based learning for the networked interconnections of more than two self‑interested agents. This approach leverages access to opponents' actions whenever possible, demonstrably achieving a faster (guaranteed) convergence to quantal response equilibrium in multi‑agent zero‑sum and potential polymatrix games. Notably, the dynamics reduce to the well‑studied smoothed fictitious play and individual Q‑learning under full and no access to opponent actions, respectively. We further quantify the improvement in convergence rate due to observing opponents' actions through numerical simulations.

PaperID: 13031, https://arxiv.org/pdf/2409.02645.pdf

Authors: Jannik Peters, Constantin Waubert de Puiseau, Hasan Tercan, Arya Gopikrishnan, Gustavo Adolpho Lucas De Carvalho, Christian Bitter, Tobias Meisen

Title: Emergent Language: A Survey and Taxonomy

Abstract:
The field of emergent language represents a novel area of research within the domain of artificial intelligence, particularly within the context of multi‑agent reinforcement learning. Although the concept of studying language emergence is not new, early approaches were primarily concerned with explaining human language formation, with little consideration given to its potential utility for artificial agents. In contrast, studies based on reinforcement learning aim to develop communicative capabilities in agents that are comparable to or even superior to human language. Thus, they extend beyond the learned statistical representations that are common in natural language processing research. This gives rise to a number of fundamental questions, from the prerequisites for language emergence to the criteria for measuring its success. This paper addresses these questions by providing a comprehensive review of 181 scientific publications on emergent language in artificial intelligence. Its objective is to serve as a reference for researchers interested in or proficient in the field. Consequently, the main contributions are the definition and overview of the prevailing terminology, the analysis of existing evaluation methods and metrics, and the description of the identified research gaps.

PaperID: 13032, https://arxiv.org/pdf/2409.02636.pdf

Authors: Toshiaki Tsuji

Title: Mamba as a motion encoder for robotic imitation learning

Abstract:
Recent advancements in imitation learning, particularly with the integration of LLM techniques, are set to significantly improve robots' dexterity and adaptability. This paper proposes using Mamba, a state‑of‑the‑art architecture with potential applications in LLMs, for robotic imitation learning, highlighting its ability to function as an encoder that effectively captures contextual information. By reducing the dimensionality of the state space, Mamba operates similarly to an autoencoder. It effectively compresses the sequential information into state variables while preserving the essential temporal dynamics necessary for accurate motion prediction. Experimental results in tasks such as cup placing and case loading demonstrate that despite exhibiting higher estimation errors, Mamba achieves superior success rates compared to Transformers in practical task execution. This performance is attributed to Mamba's structure, which encompasses the state space model. Additionally, the study investigates Mamba's capacity to serve as a real‑time motion generator with a limited amount of training data.

PaperID: 13033, https://arxiv.org/pdf/2409.02611.pdf

Authors: Lingling Zhang, Muye Huang, QianYing Wang, Yaxian Wang, Wenjun Wu, Jun Liu

Title: GoT-CQA: Graph-of-Thought Guided Compositional Reasoning for Chart Question Answering

Abstract:
Chart Question Answering (CQA) aims at answering questions based on the visual chart content, which plays an important role in chart sumarization, business data analysis, and data report generation. CQA is a challenging multi‑modal task because of the strong context dependence and complex reasoning requirement. The former refers to answering this question strictly based on the analysis of the visual content or internal data of the given chart, while the latter emphasizes the various logical and numerical reasoning involved in answer prediction process. In this paper, we pay more attention on the complex reasoning in CQA task, and propose a novel Graph‑of‑Thought (GoT) guided compositional reasoning model called GoT‑CQA to overcome this problem. At first, we transform the chart‑oriented question into a directed acyclic GoT composed of multiple operator nodes, including localization, numerical and logical operator. It intuitively reflects the human brain's solution process to this question. After that, we design an efficient auto‑compositional reasoning framework guided by the GoT, to excute the multi‑step reasoning operations in various types of questions. Comprehensive experiments on ChartQA and PlotQA‑D datasets show that GoT‑CQA achieves outstanding performance, especially in complex human‑written and reasoning questions, comparing with the latest popular baselines.

PaperID: 13034, https://arxiv.org/pdf/2409.02434.pdf

Authors: Muhammad Raees, Afzal Ahmed

Title: Context-Aware Agent-based Model for Smart Long Distance Transport System

Abstract:
Long‑distance transport plays a vital role in the economic growth of countries. However, there is a lack of systems being developed for monitoring and support of long‑route vehicles (LRV). Sustainable and context‑aware transport systems with modern technologies are needed. We model for long‑distance vehicle transportation monitoring and support systems in a multi‑agent environment. Our model incorporates the distance vehicle transport mechanism through agent‑based modeling (ABM). This model constitutes the design protocol of ABM called Overview, Design, and Details (ODD). This model constitutes that every category of agents is offering information as a service. Hence, a federation of services through protocol for the communication between sensors and software components is desired. Such integration of services supports monitoring and tracking of vehicles on the route. The model simulations provide useful results for the integration of services based on smart objects.

PaperID: 13035, https://arxiv.org/pdf/2409.02246.pdf

Authors: Matthew Repasky, He Wang, Yao Xie

Title: Multi-Agent Reinforcement Learning for Joint Police Patrol and Dispatch

Abstract:
Police patrol units need to split their time between performing preventive patrol and being dispatched to serve emergency incidents. In the existing literature, patrol and dispatch decisions are often studied separately. We consider joint optimization of these two decisions to improve police operations efficiency and reduce response time to emergency calls. Methodology/results: We propose a novel method for jointly optimizing multi‑agent patrol and dispatch to learn policies yielding rapid response times. Our method treats each patroller as an independent Q‑learner (agent) with a shared deep Q‑network that represents the state‑action values. The dispatching decisions are chosen using mixed‑integer programming and value function approximation from combinatorial action spaces. We demonstrate that this heterogeneous multi‑agent reinforcement learning approach is capable of learning joint policies that outperform those optimized for patrol or dispatch alone. Managerial Implications: Policies jointly optimized for patrol and dispatch can lead to more effective service while targeting demonstrably flexible objectives, such as those encouraging efficiency and equity in response.

PaperID: 13036, https://arxiv.org/pdf/2409.02231.pdf

Authors: Joseph M. Cavanagh, Kunyang Sun, Andrew Gritsevskiy, Dorian Bagni, Yingze Wang, Thomas D. Bannister, Teresa Head-Gordon

Title: SmileyLlama: Modifying Large Language Models for Directed Chemical Space Exploration

Abstract:
We show that large language model (LLMs) can be transformed via supervised fine‑tuning (SFT) of engineered prompts into SmileyLlama for exploring the chemical space of drug molecules. We benchmark SmileyLlama against pre‑trained LLMs and chemical language models (CLM) trained from scratch for generating valid and novel drug‑like molecules, and use direct preference optimization (DPO) to both improve SmileyLlama's adherence to a prompt and as part of the iMiner reinforcement learning framework to predict molecules with optimized 3D conformations and high binding affinity to drug targets. By training an LLM to speak directly as a CLM, while retaining most of its natural language capabilities, we show that we can reliably generate molecules with user‑specified properties rather than acting only as a chatbot with knowledge of chemistry or as a virtual assistant. While SmileyLlama is geared toward drug discovery, the SFT/DPO/LLM framework can be extended to other chemical, biological, and materials applications.

PaperID: 13037, https://arxiv.org/pdf/2409.01534.pdf

Authors: Yaozong Gan, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

Title: Cross-domain Multi-step Thinking: Zero-shot Fine-grained Traffic Sign Recognition in the Wild

Abstract:
In this study, we propose Cross‑domain Multi‑step Thinking (CdMT) to improve zero‑shot fine‑grained traffic sign recognition (TSR) performance in the wild. Zero‑shot fine‑grained TSR in the wild is challenging due to the cross‑domain problem between clean template traffic signs and real‑world counterparts, and existing approaches particularly struggle with cross‑country TSR scenarios, where traffic signs typically differ between countries. The proposed CdMT framework tackles these challenges by leveraging the multi‑step reasoning capabilities of large multimodal models (LMMs). We introduce context, characteristic, and differential descriptions to design multiple thinking processes for LMMs. Context descriptions, which are enhanced by center coordinate prompt optimization, enable the precise localization of target traffic signs in complex road images and filter irrelevant responses via novel prior traffic sign hypotheses. Characteristic descriptions, which are derived from in‑context learning with template traffic signs, bridge cross‑domain gaps and enhance fine‑grained TSR. Differential descriptions refine the multimodal reasoning ability of LMMs by distinguishing subtle differences among similar signs. CdMT is independent of training data and requires only simple and uniform instructions, enabling it to achieve cross‑country TSR. We conducted extensive experiments on three benchmark datasets and two real‑world datasets from different countries. The proposed CdMT framework achieved superior performance compared with other state‑of‑the‑art methods on all five datasets, with recognition accuracies of 0.93, 0.89, 0.97, 0.89, and 0.85 on the GTSRB, BTSD, TT‑100K, Sapporo, and Yokohama datasets, respectively.

PaperID: 13038, https://arxiv.org/pdf/2409.01525.pdf

Authors: Bryce L Ferguson, Dario Paccagnan, Bary S R Pradelski, Jason R Marden

Title: Bridging the Gap Between Central and Local Decision-Making: The Efficacy of Collaborative Equilibria in Altruistic Congestion Games

Abstract:
Congestion games are popular models often used to study the system‑level inefficiencies caused by selfish agents, typically measured by the price of anarchy. One may expect that aligning the agents' preferences with the system‑level objective‑‑altruistic behavior‑‑would improve efficiency, but recent works have shown that altruism can lead to more significant inefficiency than selfishness in congestion games. In this work, we study to what extent the localness of decision‑making causes inefficiency by considering collaborative decision‑making paradigms that exist between centralized and distributed in altruistic congestion games. In altruistic congestion games with convex latency functions, the system cost is a super‑modular function over the player's joint actions, and the Nash equilibria of the game are local optima in the neighborhood of unilateral deviations. When agents can collaborate, we can exploit the common‑interest structure to consider equilibria with stronger local optimality guarantees in the system objective, e.g., if groups of k agents can collaboratively minimize the system cost, the system equilibria are the local optima over k‑lateral deviations. Our main contributions are in constructing tractable linear programs that provide bounds on the price of anarchy of collaborative equilibria in altruistic congestion games. Our findings bridge the gap between the known efficiency guarantees of centralized and distributed decision‑making paradigms while also providing insights into the benefit of inter‑agent collaboration in multi‑agent systems.

PaperID: 13039, https://arxiv.org/pdf/2409.01411.pdf

Authors: Zirui Xu, Vasileios Tzoumas

Title: Performance-Aware Self-Configurable Multi-Agent Networks: A Distributed Submodular Approach for Simultaneous Coordination and Network Design

Abstract:
We introduce the first, to our knowledge, rigorous approach that enables multi‑agent networks to self‑configure their communication topology to balance the trade‑off between scalability and optimality during multi‑agent planning. We are motivated by the future of ubiquitous collaborative autonomy where numerous distributed agents will be coordinating via agent‑to‑agent communication to execute complex tasks such as traffic monitoring, event detection, and environmental exploration. But the explosion of information in such large‑scale networks currently curtails their deployment due to impractical decision times induced by the computational and communication requirements of the existing near‑optimal coordination algorithms. To overcome this challenge, we present the AlterNAting COordination and Network‑Design Algorithm (Anaconda), a scalable algorithm that also enjoys near‑optimality guarantees. Subject to the agents' bandwidth constraints, Anaconda enables the agents to optimize their local communication neighborhoods such that the action‑coordination approximation performance of the network is maximized. Compared to the state of the art, Anaconda is an anytime self‑configurable algorithm that quantifies its suboptimality guarantee for any type of network, from fully disconnected to fully centralized, and that, for sparse networks, is one order faster in terms of decision speed. To develop the algorithm, we quantify the suboptimality cost due to decentralization, i.e., due to communication‑minimal distributed coordination. We also employ tools inspired by the literature on multi‑armed bandits and submodular maximization subject to cardinality constraints. We demonstrate Anaconda in simulated scenarios of area monitoring and compare it with a state‑of‑the‑art algorithm.

PaperID: 13040, https://arxiv.org/pdf/2409.01392.pdf

Authors: Xiangyuan Xue, Zeyu Lu, Di Huang, Zidong Wang, Wanli Ouyang, Lei Bai

Title: ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems

Abstract:
Much previous AI research has focused on developing monolithic models to maximize their intelligence, with the primary goal of enhancing performance on specific tasks. In contrast, this work attempts to study using LLM‑based agents to design collaborative AI systems autonomously. To explore this problem, we first introduce ComfyBench to evaluate agents's ability to design collaborative AI systems in ComfyUI. ComfyBench is a comprehensive benchmark comprising 200 diverse tasks covering various instruction‑following generation challenges, along with detailed annotations for 3,205 nodes and 20 workflows. Based on ComfyBench, we further develop ComfyAgent, a novel framework that empowers LLM‑based agents to autonomously design collaborative AI systems by generating workflows. ComfyAgent is based on two core concepts. First, it represents workflows with code, which can be reversibly converted into workflows and executed as collaborative systems by the interpreter. Second, it constructs a multi‑agent system that cooperates to learn from existing workflows and generate new workflows for a given task. While experimental results demonstrate that ComfyAgent achieves a comparable resolve rate to o1‑preview and significantly surpasses other agents on ComfyBench, ComfyAgent has resolved only 15% of creative tasks. LLM‑based agents still have a long way to go in autonomously designing collaborative AI systems. Progress with ComfyBench is paving the way for more intelligent and autonomous collaborative AI systems.

PaperID: 13041, https://arxiv.org/pdf/2409.01092.pdf

Authors: Wenshuai Liu, Yaru Fu, Yongna Guo, Fu Lee Wang, Wen Sun, Yan Zhang

Title: Two-Timescale Synchronization and Migration for Digital Twin Networks: A Multi-Agent Deep Reinforcement Learning Approach

Abstract:
Digital twins (DTs) have emerged as a promising enabler for representing the real‑time states of physical worlds and realizing self‑sustaining systems. In practice, DTs of physical devices, such as mobile users (MUs), are commonly deployed in multi‑access edge computing (MEC) networks for the sake of reducing latency. To ensure the accuracy and fidelity of DTs, it is essential for MUs to regularly synchronize their status with their DTs. However, MU mobility introduces significant challenges to DT synchronization. Firstly, MU mobility triggers DT migration which could cause synchronization failures. Secondly, MUs require frequent synchronization with their DTs to ensure DT fidelity. Nonetheless, DT migration among MEC servers, caused by MU mobility, may occur infrequently. Accordingly, we propose a two‑timescale DT synchronization and migration framework with reliability consideration by establishing a non‑convex stochastic problem to minimize the long‑term average energy consumption of MUs. We use Lyapunov theory to convert the reliability constraints and reformulate the new problem as a partially observable Markov decision‑making process (POMDP). Furthermore, we develop a heterogeneous agent proximal policy optimization with Beta distribution (Beta‑HAPPO) method to solve it. Numerical results show that our proposed Beta‑HAPPO method achieves significant improvements in energy savings when compared with other benchmarks.

PaperID: 13042, https://arxiv.org/pdf/2409.01060.pdf

Authors: Bin Yang, Boda Liu, Yilong Han, Xin Meng, Yifan Wang, Hansi Yang, Jianzhuang Xia

Title: Multiagent Reinforcement Learning Enhanced Decision-making of Crew Agents During Floor Construction Process

Abstract:
Fine‑grained simulation of floor construction processes is essential for supporting lean management and the integration of information technology. However, existing research does not adequately address the on‑site decision‑making of constructors in selecting tasks and determining their sequence within the entire construction process. Moreover, decision‑making frameworks from computer science and robotics are not directly applicable to construction scenarios. To facilitate intelligent simulation in construction, this study introduces the Construction Markov Decision Process (CMDP). The primary contribution of this CMDP framework lies in its construction knowledge in decision, observation modifications and policy design, enabling agents to perceive the construction state and follow policy guidance to evaluate and reach various range of targets for optimizing the planning of construction activities. The CMDP is developed on the Unity platform, utilizing a two‑stage training approach with the multi‑agent proximal policy optimization algorithm. A case study demonstrates the effectiveness of this framework: the low‑level policy successfully simulates the construction process in continuous space, facilitating policy testing and training focused on reducing conflicts and blockages among crews; and the high‑level policy improving the spatio‑temporal planning of construction activities, generating construction patterns in distinct phases, leading to the discovery of new construction insights.

PaperID: 13043, https://arxiv.org/pdf/2409.01007.pdf

Authors: Edward Y. Chang

Title: Unlocking the Wisdom of Large Language Models: An Introduction to The Path to Artificial General Intelligence

Abstract:
This booklet, Unlocking the Wisdom of Multi‑LLM Collaborative Intelligence, serves as an accessible introduction to the full volume The Path to Artificial General Intelligence. Through fourteen aphorisms, it distills the core principles of Multi‑LLM Agent Collaborative Intelligence (MACI), a framework designed to coordinate multiple LLMs toward reasoning, planning, and decision‑making that surpasses the capabilities of any single model. The booklet includes titles, abstracts, and introductions from each main chapter, along with the full content of the first two. The newly released third edition features significant enhancements to Chapters 6 through 9 and a revised preface responding to Yann LeCun's critique of AGI feasibility. While LeCun argues that LLMs lack grounding, memory, and planning, we propose that MACI's collaborative architecture, featuring multimodal agents in executive, legislative, and judicial roles, directly addresses these limitations. Chapters on SocraSynth, EVINCE, consciousness modeling, and behavior regulation demonstrate that reasoning systems grounded in structured interaction and checks and balances can produce more reliable, interpretable, and adaptive intelligence. By integrating complementary model strengths, including world modeling and multimodal perception, MACI enables a system‑level intelligence that exceeds the sum of its parts. Like human institutions, progress in AI may depend less on isolated performance and more on coordinated judgment. Collaborative LLMs, not just larger ones, may chart the path toward artificial general intelligence.

PaperID: 13044, https://arxiv.org/pdf/2409.00993.pdf

Authors: Ilya Horiguchi, Takahide Yoshida, Takashi Ikegami

Title: Evolution of Social Norms in LLM Agents using Natural Language

Abstract:
Recent advancements in Large Language Models (LLMs) have spurred a surge of interest in leveraging these models for game‑theoretical simulations, where LLMs act as individual agents engaging in social interactions. This study explores the potential for LLM agents to spontaneously generate and adhere to normative strategies through natural language discourse, building upon the foundational work of Axelrod's metanorm games. Our experiments demonstrate that through dialogue, LLM agents can form complex social norms, such as metanorms‑norms enforcing the punishment of those who do not punish cheating‑purely through natural language interaction. The results affirm the effectiveness of using LLM agents for simulating social interactions and understanding the emergence and evolution of complex strategies and norms through natural language. Future work may extend these findings by incorporating a wider range of scenarios and agent characteristics, aiming to uncover more nuanced mechanisms behind social norm formation.

PaperID: 13045, https://arxiv.org/pdf/2409.00826.pdf

Authors: Petr Specian

Title: Digital Homunculi and Institutional Design: Breaking Through the Experimentation Bottleneck

Abstract:
Democracy research faces a longstanding experimentation bottleneck. Potential institutional innovations remain untested because human‑subject studies are slow, expensive, and ethically fraught. This paper argues that digital homunculi, that is, GenAI‑powered agents role‑playing humans in diverse institutional settings, could offer a way to break through the bottleneck. In contrast to the legacy agent‑based modeling, building complexity from transparent simple rules, the digital homunculi methodology aims to extract latent human behavioral knowledge from opaque large language models. To this ends, it designs multi‑agent interactions as elicitation devices to trigger in LLMs human‑like behavior that can be recorded as synthetic data. However, the validity of synthetic data remains an open question. Success requires that accurate, coherent, transferable models of humans ('little humans' ‑ homunculi) already lurk within GenAI's inscrutable matrices and can be lured out via the social simulation role‑play exercise. At the same time, to the extent these attempts are successful, they promise to completely transform the political economy of institutional research from scarcity to abundance. To help mitigate the number of challenges along the way to such success, I propose concrete validation strategies including behavioral back‑testing via knowledge cutoffs, and outline infrastructure requirements for rigorous evaluation. The stakes are high: legacy democratic institutions develop at much slower pace than the surrounding technological landscape. If they falter, we lack a repository of tested backup alternatives. Breaking through the experimentation bottleneck must be a priority and digital homunculi may be quickly maturing into a methodology capable of achieving this feat.

PaperID: 13046, https://arxiv.org/pdf/2409.00754.pdf

Authors: Jiaming Yin, Weixiong Rao, Yu Xiao, Keshuang Tang

Title: Cooperative Path Planning with Asynchronous Multiagent Reinforcement Learning

Abstract:
In this paper, we study the shortest path problem (SPP) with multiple source‑destination pairs (MSD), namely MSD‑SPP, to minimize average travel time of all shortest paths. The inherent traffic capacity limits within a road network contributes to the competition among vehicles. Multi‑agent reinforcement learning (MARL) model cannot offer effective and efficient path planning cooperation due to the asynchronous decision making setting in MSD‑SPP, where vehicles (a.k.a agents) cannot simultaneously complete routing actions in the previous time step. To tackle the efficiency issue, we propose to divide an entire road network into multiple sub‑graphs and subsequently execute a two‑stage process of inter‑region and intra‑region route planning. To address the asynchronous issue, in the proposed asyn‑MARL framework, we first design a global state, which exploits a low‑dimensional vector to implicitly represent the joint observations and actions of multi‑agents. Then we develop a novel trajectory collection mechanism to decrease the redundancy in training trajectories. Additionally, we design a novel actor network to facilitate the cooperation among vehicles towards the same or close destinations and a reachability graph aimed at preventing infinite loops in routing paths. On both synthetic and real road networks, our evaluation result demonstrates that our approach outperforms state‑of‑the‑art planning approaches.

PaperID: 13047, https://arxiv.org/pdf/2409.00717.pdf

Authors: Natalia Zhang, Xinqi Wang, Qiwen Cui, Runlong Zhou, Sham M. Kakade, Simon S. Du

Title: Preference-Based Multi-Agent Reinforcement Learning: Data Coverage and Algorithmic Techniques

Abstract:
We initiate the study of Preference‑Based Multi‑Agent Reinforcement Learning (PbMARL), exploring both theoretical foundations and empirical validations. We define the task as identifying the Nash equilibrium from a preference‑only offline dataset in general‑sum games, a problem marked by the challenge of sparse feedback signals. Our theory establishes the upper complexity bounds for Nash Equilibrium in effective PbMARL, demonstrating that single‑policy coverage is inadequate and highlighting the importance of unilateral dataset coverage. These theoretical insights are verified through comprehensive experiments. To enhance the practical performance, we further introduce two algorithmic techniques. (1) We propose a Mean Squared Error (MSE) regularization along the time axis to achieve a more uniform reward distribution and improve reward learning outcomes. (2) We propose an additional penalty based on the distribution of the dataset to incorporate pessimism, improving stability and effectiveness during training. Our findings underscore the multifaceted approach required for PbMARL, paving the way for effective preference‑based multi‑agent systems.

PaperID: 13048, https://arxiv.org/pdf/2409.00636.pdf

Authors: Yunxiao Shi, Min Xu, Haimin Zhang, Xing Zi, Qiang Wu

Title: A Learnable Agent Collaboration Network Framework for Personalized Multimodal AI Search Engine

Abstract:
Large language models (LLMs) and retrieval‑augmented generation (RAG) techniques have revolutionized traditional information access, enabling AI agent to search and summarize information on behalf of users during dynamic dialogues. Despite their potential, current AI search engines exhibit considerable room for improvement in several critical areas. These areas include the support for multimodal information, the delivery of personalized responses, the capability to logically answer complex questions, and the facilitation of more flexible interactions. This paper proposes a novel AI Search Engine framework called the Agent Collaboration Network (ACN). The ACN framework consists of multiple specialized agents working collaboratively, each with distinct roles such as Account Manager, Solution Strategist, Information Manager, and Content Creator. This framework integrates mechanisms for picture content understanding, user profile tracking, and online evolution, enhancing the AI search engine's response quality, personalization, and interactivity. A highlight of the ACN is the introduction of a Reflective Forward Optimization method (RFO), which supports the online synergistic adjustment among agents. This feature endows the ACN with online learning capabilities, ensuring that the system has strong interactive flexibility and can promptly adapt to user feedback. This learning method may also serve as an optimization approach for agent‑based systems, potentially influencing other domains of agent applications.

PaperID: 13049, https://arxiv.org/pdf/2409.00557.pdf

Authors: Wenxuan Wang, Juluan Shi, Zixuan Ling, Yuk-Kit Chan, Chaozheng Wang, Cheryl Lee, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, Michael R. Lyu

Title: Learning to Ask: When LLM Agents Meet Unclear Instruction

Abstract:
Equipped with the capability to call functions, modern large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone. However, the effective execution of these tools relies heavily not just on the advanced capabilities of LLMs but also on precise user instructions, which often cannot be ensured in the real world. To evaluate the performance of LLMs tool‑use under imperfect instructions, we meticulously examine the real‑world instructions queried from users, analyze the error patterns, and build a challenging tool‑use benchmark called Noisy ToolBench (NoisyToolBench). We find that due to the next‑token prediction training objective, LLMs tend to arbitrarily generate the missed argument, which may lead to hallucinations and risks. To address this issue, we propose a novel framework, Ask‑when‑Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions. Moreover, to reduce the manual labor involved in user‑LLM interaction and assess LLMs performance in tool utilization from both accuracy and efficiency perspectives, we design an automated evaluation tool named ToolEvaluator. Our experiments demonstrate that the AwN significantly outperforms existing frameworks for tool learning in the NoisyToolBench. We will release all related code and datasets to support future research.

PaperID: 13050, https://arxiv.org/pdf/2409.00494.pdf

Authors: Haowen Xu, Jinghui Yuan, Anye Zhou, Guanhao Xu, Wan Li, Xuegang Ban, Xinyue Ye

Title: GenAI-powered Multi-Agent Paradigm for Smart Urban Mobility: Opportunities and Challenges for Integrating Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) with Intelligent Transportation Systems

Abstract:
Leveraging recent advances in generative AI, multi‑agent systems are increasingly being developed to enhance the functionality and efficiency of smart city applications. This paper explores the transformative potential of large language models (LLMs) and emerging Retrieval‑Augmented Generation (RAG) technologies in Intelligent Transportation Systems (ITS), paving the way for innovative solutions to address critical challenges in urban mobility. We begin by providing a comprehensive overview of the current state‑of‑the‑art in mobility data, ITS, and Connected Vehicles (CV) applications. Building on this review, we discuss the rationale behind RAG and examine the opportunities for integrating these Generative AI (GenAI) technologies into the smart mobility sector. We propose a conceptual framework aimed at developing multi‑agent systems capable of intelligently and conversationally delivering smart mobility services to urban commuters, transportation operators, and decision‑makers. Our approach seeks to foster an autonomous and intelligent approach that (a) promotes science‑based advisory to reduce traffic congestion, accidents, and carbon emissions at multiple scales, (b) facilitates public education and engagement in participatory mobility management, and (c) automates specialized transportation management tasks and the development of critical ITS platforms, such as data analytics and interpretation, knowledge representation, and traffic simulations. By integrating LLM and RAG, our approach seeks to overcome the limitations of traditional rule‑based multi‑agent systems, which rely on fixed knowledge bases and limited reasoning capabilities. This integration paves the way for a more scalable, intuitive, and automated multi‑agent paradigm, driving advancements in ITS and urban mobility.

PaperID: 13051, https://arxiv.org/pdf/2409.00134.pdf

Authors: Anton Andreychuk, Konstantin Yakovlev, Aleksandr Panov, Alexey Skrynnik

Title: MAPF-GPT: Imitation Learning for Multi-Agent Pathfinding at Scale

Abstract:
Multi‑agent pathfinding (MAPF) is a problem that generally requires finding collision‑free paths for multiple agents in a shared environment. Solving MAPF optimally, even under restrictive assumptions, is NP‑hard, yet efficient solutions for this problem are critical for numerous applications, such as automated warehouses and transportation systems. Recently, learning‑based approaches to MAPF have gained attention, particularly those leveraging deep reinforcement learning. Typically, such learning‑based MAPF solvers are augmented with additional components like single‑agent planning or communication. Orthogonally, in this work we rely solely on imitation learning that leverages a large dataset of expert MAPF solutions and transformer‑based neural network to create a foundation model for MAPF called MAPF‑GPT. The latter is capable of generating actions without additional heuristics or communication. MAPF‑GPT demonstrates zero‑shot learning abilities when solving the MAPF problems that are not present in the training dataset. We show that MAPF‑GPT notably outperforms the current best‑performing learnable MAPF solvers on a diverse range of problem instances and is computationally efficient during inference.

PaperID: 13052, https://arxiv.org/pdf/2409.00107.pdf

Authors: Jun He, Andrew L. Liu

Title: Evaluating the Impact of Multiple DER Aggregators on Wholesale Energy Markets: A Hybrid Mean Field Approach

Abstract:
The integration of distributed energy resources (DERs) into wholesale energy markets can greatly enhance grid flexibility, improve market efficiency, and contribute to a more sustainable energy future. As DERs ‑‑ such as solar PV panels and energy storage ‑‑ proliferate, effective mechanisms are needed to ensure that small prosumers can participate meaningfully in these markets. We study a wholesale market model featuring multiple DER aggregators, each controlling a portfolio of DER resources and bidding into the market on behalf of the DER asset owners. The key of our approach lies in recognizing the repeated nature of market interactions the ability of participants to learn and adapt over time. Specifically, Aggregators repeatedly interact with each other and with other suppliers in the wholesale market, collectively shaping wholesale electricity prices (aka the locational marginal prices (LMPs)). We model this multi‑agent interaction using a mean‑field game (MFG), which uses market information ‑‑ reflecting the average behavior of market participants ‑‑ to enable each aggregator to predict long‑term LMP trends and make informed decisions. For each aggregator, because they control the DERs within their portfolio under certain contract structures, we employ a mean‑field control (MFC) approach (as opposed to a MFG) to learn an optimal policy that maximizes the total rewards of the DERs under their management. We also propose a reinforcement learning (RL)‑based method to help each agent learn optimal strategies within the MFG framework, enhancing their ability to adapt to market conditions and uncertainties. Numerical simulations show that LMPs quickly reach a steady state in the hybrid mean‑field approach. Furthermore, our results demonstrate that the combination of energy storage and mean‑field learning significantly reduces price volatility compared to scenarios without storage.

PaperID: 13053, https://arxiv.org/pdf/2409.00082.pdf

Authors: Sagar Srinivas Sakhinana, Geethan Sannidhi, Venkataramana Runkana

Title: Towards Human-Level Understanding of Complex Process Engineering Schematics: A Pedagogical, Introspective Multi-Agent Framework for Open-Domain Question Answering

Abstract:
In the chemical and process industries, Process Flow Diagrams (PFDs) and Piping and Instrumentation Diagrams (P&IDs) are critical for design, construction, and maintenance. Recent advancements in Generative AI, such as Large Multimodal Models (LMMs) like GPT4 (Omni), have shown promise in understanding and interpreting process diagrams for Visual Question Answering (VQA). However, proprietary models pose data privacy risks, and their computational complexity prevents knowledge editing for domain‑specific customization on consumer hardware. To overcome these challenges, we propose a secure, on‑premises enterprise solution using a hierarchical, multi‑agent Retrieval Augmented Generation (RAG) framework for open‑domain question answering (ODQA) tasks, offering enhanced data privacy, explainability, and cost‑effectiveness. Our novel multi‑agent framework employs introspective and specialized sub‑agents using open‑source, small‑scale multimodal models with the ReAct (Reason+Act) prompting technique for PFD and P&ID analysis, integrating multiple information sources to provide accurate and contextually relevant answers. Our approach, supported by iterative self‑correction, aims to deliver superior performance in ODQA tasks. We conducted rigorous experimental studies, and the empirical results validated the proposed approach effectiveness.

PaperID: 13054, https://arxiv.org/pdf/2409.00054.pdf

Authors: Yuting Hu, Dancheng Liu, Qingyun Wang, Charles Yu, Chenhui Xu, Qingxiao Zheng, Heng Ji, Jinjun Xiong

Title: Automating Intervention Discovery from Scientific Literature: A Progressive Ontology Prompting and Dual-LLM Framework

Abstract:
Identifying effective interventions from the scientific literature is challenging due to the high volume of publications, specialized terminology, and inconsistent reporting formats, making manual curation laborious and prone to oversight. To address this challenge, this paper proposes a novel framework leveraging large language models (LLMs), which integrates a progressive ontology prompting (POP) algorithm with a dual‑agent system, named LLM‑Duo. On the one hand, the POP algorithm conducts a prioritized breadth‑first search (BFS) across a predefined ontology, generating structured prompt templates and action sequences to guide the automatic annotation process. On the other hand, the LLM‑Duo system features two specialized LLM agents, an explorer and an evaluator, working collaboratively and adversarially to continuously refine annotation quality. We showcase the real‑world applicability of our framework through a case study focused on speech‑language intervention discovery. Experimental results show that our approach surpasses advanced baselines, achieving more accurate and comprehensive annotations through a fully automated process. Our approach successfully identified 2,421 interventions from a corpus of 64,177 research articles in the speech‑language pathology domain, culminating in the creation of a publicly accessible intervention knowledge base with great potential to benefit the speech‑language pathology community.

PaperID: 13055, https://arxiv.org/pdf/2409.00038.pdf

Authors: Malik Abdul Sami, Muhammad Waseem, Zheying Zhang, Zeeshan Rasheed, Kari Systä, Pekka Abrahamsson

Title: AI based Multiagent Approach for Requirements Elicitation and Analysis

Abstract:
Requirements Engineering (RE) plays a pivotal role in software development, encompassing tasks such as requirements elicitation, analysis, specification, and change management. Despite its critical importance, RE faces challenges including communication complexities, early‑stage uncertainties, and accurate resource estimation. This study empirically investigates the effectiveness of utilizing Large Language Models (LLMs) to automate requirements analysis tasks. We implemented a multi‑agent system that deploys AI models as agents to generate user stories from initial requirements, assess and improve their quality, and prioritize them using a selected technique. In our implementation, we deployed four models, namely GPT‑3.5, GPT‑4 Omni, LLaMA3‑70, and Mixtral‑8B, and conducted experiments to analyze requirements on four real‑world projects. We evaluated the results by analyzing the semantic similarity and API performance of different models, as well as their effectiveness and efficiency in requirements analysis, gathering users' feedback on their experiences. Preliminary results indicate notable variations in task completion among the models. Mixtral‑8B provided the quickest responses, while GPT‑3.5 performed exceptionally well when processing complex user stories with a higher similarity score, demonstrating its capability in deriving accurate user stories from project descriptions. Feedback and suggestions from the four project members further corroborate the effectiveness of LLMs in improving and streamlining RE phases.